C&R Software
Cloud Operations
Initiative Online

Cloud Ops
AI Initiatives

This cluster powers C&R Software's internal cloud operations platform — hosting custom MCP servers that connect AI agents to our tooling, and the agentic systems that run our SRE, cloud engineering, and DevOps workflows.

Custom MCP Servers

Tools wired to AI workflows

nOps
Cost visibility, savings recommendations, and reservation/commitment optimization signals piped to agents.
Okta
User lifecycle, app assignments, and access reviews — exposed to agents for least-privilege remediation.
Datadog
Metrics, traces, logs, and monitor state queried by agents to ground SRE decisions in production data.
Atlassian
Jira, Confluence, Rovo, and Statuspages access — for issue triage, runbook lookup, and incident comms.
Agentic AI

Autonomous and assistive agents

SRE
Incident triage, runbook execution, and post-incident summarization grounded in observability data.
Cloud Engineering
Architecture proposals, IaC review, and cost-aware design assistance across AWS and Azure.
DevOps
CI/CD analysis, dependency auditing, and pipeline troubleshooting bound to repo and platform context.

Introducing Cloud Ops AI Initiatives

A C&R Software platform initiative to give every Cloud Operations workflow a custom AI surface — grounded in our real tooling, hosted on infrastructure we control.

What this is

Cloud Ops AI Initiatives is a dedicated Amazon EKS cluster running two things:

  • Custom MCP servers — small services that expose our SaaS tooling (nOps, Okta, Datadog, the Atlassian suite, and more) to AI agents through the Model Context Protocol. Agents don't get screen-scraping access; they get real, scoped API calls.
  • Agentic AI workloads — autonomous and assistive agents for SRE, CloudOps Support, DevOps, and DBOps, running on top of the MCPs above.

Why this exists

Off-the-shelf assistants don't have access to our cost data, our identity provider, our observability stack, or our incident tooling. Without that context they hallucinate or give generic advice. By owning the MCP layer, we get:

  • Auditability — every AI tool call is a logged API request from a known service in our VPC.
  • Least privilege — each MCP runs with the narrowest API token its job requires. Agents inherit that scope.
  • Cost & latency control — internal traffic, no third-party MCP relays, no per-call SaaS surcharges.

How it's wired up

  • EKS clustercloudops-ai-initiatives in us-east-2, Kubernetes 1.35, internal-only ALBs. See EKS → Deployment.
  • Node autoscaling — Karpenter with a spot + on-demand NodePool, t3/t3a/m5/m5a/m6i/m6a families. See EKS → Karpenter.
  • Ingress — AWS Load Balancer Controller, internal scheme only. Public exposure is explicit opt-in.
  • Auth — Pod Identity Associations for every controller; SaaS API tokens stored as Kubernetes Secrets (eventually External Secrets backed by AWS Secrets Manager).

Roadmap

  • Now: nOps MCP deployed and consumable.
  • Next: Okta, Datadog, Atlassian (Jira/Confluence + Rovo + Statuspages).
  • After that: agentic workloads (SRE post-mortem, CloudOps triage, DevOps automation, DBOps).
  • Later: Workload autoscaler (KEDA), centralized API key rotation, public read-only status page.

nOps MCP

An MCP server that exposes the nOps cloud cost optimization API to AI agents. Cost visibility, MAP migration tracking, Compute Copilot, Business Unit Economics — all callable as MCP tools.

Multi-tenant by design

This MCP is shared. There is no server-side nOps API key — each caller sends their own key on every request via the X-Nops-Api-Key header. The MCP server builds a fresh nOps client per session with the caller's key, so every upstream nOps API call shows up in nOps audit logs as that user.

Source: bitbucket.org/cr-software/nops_mcp branch multi-tenant-http.

1. Get your nOps API key

  1. Log in to app.nops.io.
  2. Open Organization Settings → API Key from the top-right user menu.
  3. Click Generate API Key and copy the value. Format: {client_id}.{random_string}.
Screenshot: nOps → Organization Settings → API Key

2. Use it from an AI client

Paste your nOps key into the headers block of your client's MCP config — never into the cluster. Claude Desktop:

{
  "mcpServers": {
    "nops": {
      "url": "https://ai.cloudops.crsoftwarecloud.com/nops/sse",
      "transport": "sse",
      "headers": {
        "X-Nops-Api-Key": "YOUR_NOPS_KEY_HERE"
      }
    }
  }
}

Same shape works for Cursor, Kiro, and any other MCP-compatible client that supports custom transport headers. If you omit the header, the server returns 401 with JSON-RPC error code -32001.

3. Deployment details (FYI)

Already deployed. Manifests live at mcps/nops/:

  • A Deployment in the mcps namespace running node:20-alpine with the gzipped multi-tenant bundle (nops-mcp.cjs.gz) mounted from a ConfigMap. The MCP runs natively in HTTP mode (MCP_MODE=http) — no supergateway wrapper, because per-request header pass-through requires the MCP to own its HTTP transport.
  • A Service on port 80 → 8000.
  • An Ingress joining the shared cloudops-ai-public IngressGroup — adds the /nops path rule on the cloudops-landing-public ALB.

Live endpoint: https://ai.cloudops.crsoftwarecloud.com/nops/sse

Local stdio mode is preserved for development per the upstream README — useful when iterating on a new tool against your own nOps tenant before pushing to the branch.

Available tools

Full list in the upstream README. Highlights:

ToolWhat it does
nops_get_cost_summaryComprehensive cost summary across all linked accounts
nops_get_daily_costDaily cost grouped by service, account, or tag
nops_list_map_projectsAWS MAP migration projects + tracked resources
nops_list_cost_targetsBudget/cost-target configurations
nops_create_business_unit_economicsCreate a BU Economics report

Troubleshooting

  • 401 / 403 from the MCP: the API key has expired or lacks scope. Regenerate in nOps and re-create the Secret.
  • "Session Auth Required": certain /svc/ endpoints (Compute Copilot, Essentials) still require session auth on most accounts. Contact nOps support to enable API key access for those.
  • 401 on initialize with code: -32001: client config is missing the X-Nops-Api-Key header. Add a headers block to the MCP server entry in your client config.
  • Tool calls return 401 from upstream nOps but the MCP itself responds fine: the per-call key is malformed or expired. Regenerate the key in nOps and update your client config.

Okta MCP

An MCP server for Okta — user lifecycle, group membership, app assignments, access reviews — exposed to agents for least-privilege remediation.

In progress

The MCP server is being built. This page documents the API token creation steps (good to do in advance) and the planned tool surface.

1. Create an Okta API token

The MCP authenticates to the Okta API using an API token. Tokens act on behalf of the user who created them, inheriting their admin permissions. For an MCP that AI agents will use, create the token under a dedicated service account with only the Okta admin roles it actually needs (e.g., Read-only Administrator for inventory queries, Group Administrator for membership remediation).

Steps

  1. Sign in to the Okta Admin Console as a Super Admin (the only role that can mint tokens) at https://<your-org>-admin.okta.com.
  2. In the side nav: Security → API → Tokens.
  3. Click Create token.
  4. Give the token a descriptive name like cloudops-ai-okta-mcp.
  5. Copy the token value immediately — Okta only shows it once. Format: 00<random-40-chars>.
  6. Click OK, got it.
Screenshot: Okta Admin → Security → API → Tokens → Create token

Token lifecycle

  • API tokens are tied to the creator's user account. If the creator is deactivated, the token stops working.
  • Tokens that are unused for 30 days are automatically deactivated. Schedule a periodic health-check from the MCP to keep it alive.
  • Rotate tokens at least every 90 days. Plan the rotation by generating a new token, updating the cluster Secret, then revoking the old one.

Scope it tightly

Okta API tokens inherit the creator's admin role. For the MCP:

  • Create a dedicated service-user (e.g., cloudops-mcp@crsoftware.com) in your Okta org.
  • Assign only the admin roles required (Read-only Administrator + Group Administrator typically suffice).
  • Generate the token while signed in as that service user, not as your personal admin.

2. Store the token in the cluster

kubectl create secret generic okta-mcp \
  --namespace mcps \
  --from-literal=OKTA_DOMAIN='<your-org>.okta.com' \
  --from-literal=OKTA_API_TOKEN='<token>' \
  --dry-run=client -o yaml | kubectl apply -f -

3. Deploy & use

Deployment manifests and endpoint configuration will land here once the MCP server is built. Planned endpoint: https://ai.cloudops.crsoftwarecloud.com/okta/sse.

Planned tool surface

ToolDescription
okta_get_userLook up a user by email, ID, or login
okta_list_user_groupsGroup memberships for a user
okta_list_app_assignmentsApps a user (or group) is assigned to
okta_search_usersSCIM-style filter over the user directory
okta_get_system_logRecent System Log events (auth, group, lifecycle)
okta_add_user_to_groupAdd a user to a group (gated by autoApprove allow-list)
okta_remove_user_from_groupRemove a user from a group (gated)

Datadog MCP

An MCP server exposing Datadog's metrics, traces, logs, monitors, and incidents to AI agents — so SRE and CloudOps Support agents can ground their reasoning in production observability data.

1. Create a Datadog API key

Datadog has two kinds of keys, and the MCP needs both:

  • API key — identifies the organization, used for write/submission endpoints.
  • Application key — identifies the user, used for read/query endpoints (metrics queries, log searches, monitor reads, etc.).

Create the API key

  1. Log in to app.datadoghq.com (or your regional URL — app.datadoghq.eu, us3.datadoghq.com, etc.).
  2. Open the user menu (bottom-left) → Organization Settings → API Keys.
  3. Click New Key. Name it cloudops-ai-datadog-mcp.
  4. Copy the 32-char hex string. You can re-view this later — it's not a one-shot reveal.

Create the Application key

  1. Same nav: Organization Settings → Application Keys.
  2. Click New Key. Name it cloudops-ai-datadog-mcp.
  3. Scope the key: click Edit Scopes and grant only the minimum scopes the MCP needs:
    • events_read, logs_read_data, metrics_read
    • monitors_read, incident_read, apm_read
    • dashboards_read, service_dependencies_read
  4. Save and copy the value. Format: 40-char hex string.
Screenshot: Datadog → Organization Settings → Application Keys → New Key (scoped)

Pick the right site

The MCP needs to know which Datadog site to call. Common values:

URLSite value
app.datadoghq.comdatadoghq.com
us3.datadoghq.comus3.datadoghq.com
us5.datadoghq.comus5.datadoghq.com
app.datadoghq.eudatadoghq.eu
ap1.datadoghq.comap1.datadoghq.com

2. Store both keys in the cluster

kubectl create secret generic datadog-mcp \
  --namespace mcps \
  --from-literal=DD_SITE='datadoghq.com' \
  --from-literal=DD_API_KEY='<api-key>' \
  --from-literal=DD_APP_KEY='<application-key>' \
  --dry-run=client -o yaml | kubectl apply -f -

3. Deploy & use

Deployment lands at mcps/datadog/. Endpoint: https://ai.cloudops.crsoftwarecloud.com/datadog/sse.

Key rotation

Datadog doesn't auto-rotate. We rotate quarterly:

  1. Create a new App key with the same scopes.
  2. Update the cluster Secret.
  3. Restart the MCP pods: kubectl rollout restart deployment/datadog-mcp -n mcps.
  4. Verify the MCP is healthy, then delete the old App key.

Atlassian MCP (Jira & Confluence)

An MCP server exposing Jira Cloud and Confluence Cloud to agents — issue search, transitions, comments, page reads, page creation, label management, etc.

1. Create an Atlassian API token

Atlassian Cloud authenticates API requests with a personal API token tied to a user account. For an MCP, create the token on a dedicated service account, not a person.

Steps

  1. Sign in to id.atlassian.com/manage-profile/security/api-tokens as the service account.
  2. Click Create API token.
  3. Label it cloudops-ai-atlassian-mcp.
  4. Optionally set an expiry (recommended: 90 days, then rotate).
  5. Copy the value. Format: ~24-char alphanumeric string. Token is shown once.
Screenshot: id.atlassian.com → Manage profile → Security → API tokens → Create API token

Service account setup

  • Create an Atlassian user (free or paid seat depending on Jira/Confluence licensing) with email like cloudops-mcp@crsoftware.com.
  • Add it to the Jira/Confluence sites the MCP needs to access.
  • Grant the minimum role: Service Project Customer or Member, plus project-specific permissions for the projects in scope.
  • Avoid granting Site Admin or Organization Admin.

2. Store the token in the cluster

Atlassian API requests are HTTP Basic-authed with email:token:

kubectl create secret generic atlassian-mcp \
  --namespace mcps \
  --from-literal=ATLASSIAN_SITE='crsoftware.atlassian.net' \
  --from-literal=ATLASSIAN_EMAIL='cloudops-mcp@crsoftware.com' \
  --from-literal=ATLASSIAN_API_TOKEN='<token>' \
  --dry-run=client -o yaml | kubectl apply -f -

3. Deploy & use

Deployment lands at mcps/atlassian/. Endpoint: https://ai.cloudops.crsoftwarecloud.com/atlassian/sse.

Tool surface

ToolDescription
jira_search_issuesJQL-based issue search
jira_get_issueIssue detail incl. comments and transitions
jira_transition_issueMove issue through workflow (gated)
jira_add_commentComment on an issue (gated)
confluence_searchCQL search over pages/spaces
confluence_get_pagePage body in storage format or markdown
confluence_create_pageCreate a page (gated)

Atlassian Rovo MCP

An MCP server bridging C&R agents to Atlassian Rovo — Rovo Search across all connected Atlassian products, Rovo Chat sessions, and Rovo Agents.

1. Enable Rovo for your Atlassian organization

Rovo is an org-level Atlassian Cloud feature. Confirm it's turned on:

  1. Open admin.atlassian.com as an org admin.
  2. Navigate to Products → Rovo.
  3. Verify the subscription is active and the products you want indexed (Jira, Confluence, Bitbucket, Trello, etc.) are toggled on.
Screenshot: admin.atlassian.com → Products → Rovo

2. Get a Rovo API credential

Rovo API access is granted via the same Atlassian API token mechanism as Jira/Confluence, but the calling user must have Rovo enabled on their seat. For the MCP service account:

  1. Confirm Rovo is enabled for the cloudops-mcp@crsoftware.com seat in admin.atlassian.com → Directory → Users → <user> → Product access.
  2. Reuse the API token created for the Atlassian MCP, or create a separate token labeled cloudops-ai-rovo-mcp for tighter blast radius.

3. Store the credential

kubectl create secret generic rovo-mcp \
  --namespace mcps \
  --from-literal=ATLASSIAN_SITE='crsoftware.atlassian.net' \
  --from-literal=ATLASSIAN_EMAIL='cloudops-mcp@crsoftware.com' \
  --from-literal=ATLASSIAN_API_TOKEN='<token>' \
  --dry-run=client -o yaml | kubectl apply -f -

4. Deploy & use

Endpoint: https://ai.cloudops.crsoftwarecloud.com/rovo/sse. Planned tools: rovo_search, rovo_chat, rovo_list_agents, rovo_invoke_agent.

Atlassian Statuspage MCP

An MCP server for Atlassian Statuspage — read current component status, list active incidents, and (gated) create or update incidents and maintenance windows.

1. Create a Statuspage API key

  1. Sign in to manage.statuspage.io as an org admin for the relevant page.
  2. Top-right: Your account → API info.
  3. Click Create key +.
  4. Label: cloudops-ai-statuspage-mcp.
  5. Copy the value. Format: 36-char UUID-style. Token is reusable; you can re-view it.
Screenshot: manage.statuspage.io → Your account → API info → Create key

Find your Page ID

Each Statuspage page has a unique page_id. From the Statuspage dashboard URL it's the segment after /pages/. The MCP scopes its calls to a single page at a time, so include the page ID in the Secret.

2. Store the credentials

kubectl create secret generic statuspage-mcp \
  --namespace mcps \
  --from-literal=STATUSPAGE_API_KEY='<api-key>' \
  --from-literal=STATUSPAGE_PAGE_ID='<page-id>' \
  --dry-run=client -o yaml | kubectl apply -f -

3. Deploy & use

Endpoint: https://ai.cloudops.crsoftwarecloud.com/statuspage/sse. Planned tools:

ToolDescription
statuspage_get_statusCurrent component / incident state for the page
statuspage_list_incidentsRecent incidents (filterable by status)
statuspage_create_incidentOpen a new incident (gated)
statuspage_update_incidentAdd an update / change status (gated)
statuspage_schedule_maintenanceSchedule a maintenance window (gated)

EKS Deployment

The cluster, its supporting infrastructure, and three reproducible deploy paths (shell scripts, CloudFormation, Terraform). Source of truth: the eks-cloudops-ia-initiatives Bitbucket repository.

Cluster facts

Namecloudops-ai-initiatives
Regionus-east-2
Kubernetes1.35
Managed nodegroupt3.medium (min=2, max=2, hosts Karpenter + LBC + system pods)
VPC / NAT3 public + 3 private subnets across 3 AZs, single NAT gateway (1 EIP)
Ingress postureInternal ALBs only — public exposure is explicit opt-in
Auth patternEKS Pod Identity

Three deploy paths

Pick one and stick with it for a given cluster — they all produce the same logical environment.

Shell scripts (most explicit)

git clone git@bitbucket.org:cr-software/eks-cloudops-ia-initiatives.git
cd eks-cloudops-ia-initiatives

export AWS_PROFILE=AdmAccess-486588380443
./scripts/00-prereqs.sh
./scripts/01-create-cluster.sh
./scripts/02-install-karpenter.sh
./scripts/03-install-lbc.sh
./scripts/04-deploy-landing.sh

Terraform (single tool, end-to-end)

cd terraform
cp terraform.tfvars.example terraform.tfvars
terraform init && terraform apply

CloudFormation (AWS-side only)

aws cloudformation deploy \
  --stack-name cloudops-ai-initiatives-cluster \
  --template-file cloudformation/01-eks-cluster.yaml \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameter-overrides ClusterName=cloudops-ai-initiatives

K8s objects (manifests under manifests/) are then applied with kubectl.

Deeper docs

See docs/01-eks-cluster.md, docs/02-karpenter.md, docs/03-aws-lbc.md, and docs/troubleshooting.md in the repo for the real issues hit during build-out (EIP quota, Pod Identity webhook race, Karpenter anti-affinity, macOS keychain).

Node Autoscaler — Karpenter

Karpenter handles all node provisioning above the 2-node managed nodegroup. It looks at pending pods and picks the cheapest EC2 shape that satisfies them — across spot + on-demand, multiple instance families.

Upstream docs: karpenter.sh. We run v1.12.1.

What's installed

Controllerkube-system/karpenter — 2 replicas, Pod Identity authed
NodePooldefault — spot + on-demand, t3, t3a, m5, m5a, m6i, m6a, sizes ≥ medium
EC2NodeClassdefault — AL2023 AMI family, private subnets, cluster SG
Interruption queueSQS cloudops-ai-initiatives + EventBridge rules for Spot, EC2 state change, rebalance, scheduled events
ConsolidationWhenEmptyOrUnderutilized, consolidateAfter: 1m
Node TTL720h (30 days) — forces periodic AMI refresh
Cluster-wide limit100 vCPU, 200 GiB memory (safety cap)

NodePool YAML

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["t3", "t3a", "m5", "m5a", "m6i", "m6a"]
        - key: karpenter.k8s.aws/instance-size
          operator: NotIn
          values: ["nano", "micro", "small"]
      expireAfter: 720h
  limits:
    cpu: "100"
    memory: 200Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

Why the managed nodegroup is min=2

The Karpenter Helm chart adds a hardcoded affinity to its own pods (karpenter.sh/nodepool DoesNotExist) — they refuse to run on Karpenter-provisioned nodes. With 2 controller replicas, pod-anti-affinity, and zone topology-spread, the minimum is 2 managed nodes across 2 AZs.

Try it

kubectl create deploy nginx --image=nginx --replicas=30
kubectl get nodeclaims -w   # Karpenter provisions a node
kubectl get nodes -L karpenter.sh/nodepool,karpenter.sh/capacity-type

kubectl delete deploy nginx  # ~1 min later Karpenter consolidates away

Monitoring & Upgrade

How we observe the cluster and how we upgrade it.

Monitoring stack

  • EKS control plane logs — currently disabled (default). Enable when ready:
    eksctl utils update-cluster-logging \
      --region us-east-2 --cluster cloudops-ai-initiatives \
      --enable-types=all
  • Container Insights — install the CloudWatch Agent + Fluent Bit DaemonSets for per-pod and per-node metrics + logs in CloudWatch.
  • Datadog Agent — once the Datadog MCP and account are confirmed, the standard pattern is the official datadog Helm chart in kube-system, with API key from the datadog-mcp Secret (or its own).
  • Karpenter metrics — exposed on the controller's /metrics endpoint. Worth scraping with Prometheus or DD's OpenMetrics integration.

Status

Observability tooling isn't deployed yet. The Datadog MCP work is the natural dependency to unblock this.

Upgrade playbook

EKS minor versions are released roughly every 4 months. Each is supported in standard support for ~14 months. Upgrade rhythm:

  1. Check the release notes for the target version: EKS Kubernetes versions.
  2. Audit deprecated APIs in our workloads (we use pluto or kubent).
  3. Upgrade the control plane:
    eksctl upgrade cluster --name cloudops-ai-initiatives --region us-east-2 --version <target> --approve
  4. Upgrade managed addons (coredns, kube-proxy, vpc-cni, metrics-server, pod-identity-agent) to the recommended versions for the new K8s minor.
  5. Roll the managed nodegroup:
    eksctl upgrade nodegroup \
      --cluster cloudops-ai-initiatives --region us-east-2 \
      --name standard-workers --kubernetes-version <target>
  6. Bump Karpenter's EC2NodeClass.amiSelectorTerms if it pins a specific AMI ID (currently uses al2023@latest, so no action). Karpenter will roll its own nodes within 30 days (the expireAfter), or sooner if disrupted.
  7. Verify: kubectl get nodes -o wide shows all nodes on the new minor, all addons healthy, all workloads running.

AWS Access via Web Popups

How humans get into AWS for the C&R accounts — single-sign-on through Okta, no long-lived IAM users.

The pattern

AWS access for engineers is brokered through AWS IAM Identity Center (formerly AWS SSO), federated from Okta. Each engineer signs in once via Okta, then sees the AWS portal listing every account+role they're authorized for. Clicking a tile opens a popup AWS console session or copies short-lived CLI credentials.

For the CLI

Use aws configure sso once to set up your profile, then sign in interactively:

aws configure sso
# SSO start URL: https://crsoftware.awsapps.com/start  (or your equivalent)
# SSO region:    us-east-2
# Pick the account + role → name the profile, e.g., AdmAccess-486588380443

aws sso login --profile AdmAccess-486588380443
aws sts get-caller-identity --profile AdmAccess-486588380443

For the console

  1. Open the AWS access portal (your IAM Identity Center start URL).
  2. Sign in with Okta.
  3. Click the account → role tile you want. AWS opens the console with the role pre-assumed.
  4. For a short-lived CLI session, click Copy credentials next to a role and paste into a terminal.

What lives in this cluster

The cluster itself is administered through AWSReservedSSO_AdmAccess — same SSO mechanism. EKS access entries are configured for the cluster creator (and additional roles can be added). No aws-auth ConfigMap maintenance needed.

Roadmap

A self-service web tool to mint short-lived kubectl contexts (signed STS token + aws eks get-token) is on the platform engineering backlog.

EKS & ECS Cluster Management (Web UI)

A future self-service Web UI for routine cluster operations — restart a node, trigger a managed-nodegroup roll, run a one-shot EKS addon update, drain an ECS service task — without engineers needing to switch to the AWS console or remember the right CLI incantation.

Coming soon

This is a planned Platform Engineering deliverable. The page below describes the intended scope.

Target operations

  • EKS: list clusters, show nodegroup status, restart/drain/cordon a node, trigger a nodegroup roll, upgrade an EKS addon, view recent events.
  • ECS: list services per cluster, restart a service (force new deployment), scale a service, view recent task failure events.
  • Cross-cutting: see which IAM principal triggered a given action, full audit log of every action taken through the UI.

Design constraints

  • Internal-only (no public ingress). Single-sign-on via Okta — same identity surface as AWS Access.
  • RBAC: read-everywhere for any signed-in engineer; write actions gated by group membership (e.g., sre, cloud-engineering).
  • Every action emits a structured audit event (who, what, when, on which resource, with which IAM principal assumed). Events stream to CloudTrail and Datadog.
  • No direct AWS console pass-through — actions are explicit, parameterized API calls. Easier to audit, easier to constrain.

Tech direction

  • Backend: Go service running in this cluster, assuming a least-privilege IAM role per cluster/account it manages.
  • Frontend: React + Vite, same dark/cyan theme as this site.
  • Deploy: same pattern as the landing page — namespace, deployment, internal ALB ingress.

Pipelines

CI/CD lives in Bitbucket Pipelines and Jenkins, depending on the workload. This section documents the patterns and where to plug a new project in.

Bitbucket Pipelines

Default for new projects in the cr-software Bitbucket workspace. Configured via bitbucket-pipelines.yml at repo root.

  • Runners: hosted Bitbucket runners for typical builds; self-hosted runners (on this cluster) for jobs that need VPC access (e.g., deploying to cloudops-ai-initiatives).
  • Secrets: workspace-level repository variables for non-sensitive config; deployment environment variables for sensitive credentials (mark them Secured so they're masked in logs).
  • OIDC: prefer Bitbucket Pipelines' OpenID Connect to assume an AWS role at runtime — no long-lived AWS access keys in pipeline variables.

Sample build → push → kubectl apply

pipelines:
  branches:
    main:
      - step:
          name: Build & push image
          oidc: true
          script:
            - export AWS_ROLE_ARN=arn:aws:iam::486588380443:role/bitbucket-deployer
            - export AWS_WEB_IDENTITY_TOKEN_FILE=$(pwd)/web-identity-token
            - echo $BITBUCKET_STEP_OIDC_TOKEN > $AWS_WEB_IDENTITY_TOKEN_FILE
            - aws ecr get-login-password --region us-east-2 \
                | docker login --username AWS --password-stdin 486588380443.dkr.ecr.us-east-2.amazonaws.com
            - docker build -t $IMAGE:$BITBUCKET_COMMIT .
            - docker push $IMAGE:$BITBUCKET_COMMIT
      - step:
          name: Deploy
          deployment: production
          script:
            - aws eks update-kubeconfig --region us-east-2 --name cloudops-ai-initiatives
            - kubectl set image deployment/my-app my-app=$IMAGE:$BITBUCKET_COMMIT -n my-ns

Jenkins

Used for legacy projects and any workload that needs Windows agents or Linux agents with custom tooling that isn't easily packaged for Bitbucket Pipelines.

  • Jobs live in a shared Jenkins instance (separate from this cluster).
  • For deployments into this cluster, Jenkins agents authenticate via an IAM role and use the same aws eks update-kubeconfig + kubectl pattern.

Choosing between them

If your project…Use
is new and lives in Bitbucket CloudBitbucket Pipelines
needs Windows agentsJenkins
has heavy custom toolchain not on Bitbucket runnersJenkins (with self-hosted agent)
has tight VPC-only dependenciesEither, with a self-hosted runner in the VPC