This cluster powers C&R Software's internal cloud operations platform — hosting custom MCP servers that connect AI agents to our tooling, and the agentic systems that run our SRE, cloud engineering, and DevOps workflows.
Custom MCP Servers
Tools wired to AI workflows
nOps
Cost visibility, savings recommendations, and reservation/commitment optimization signals piped to agents.
Okta
User lifecycle, app assignments, and access reviews — exposed to agents for least-privilege remediation.
Datadog
Metrics, traces, logs, and monitor state queried by agents to ground SRE decisions in production data.
Atlassian
Jira, Confluence, Rovo, and Statuspages access — for issue triage, runbook lookup, and incident comms.
Agentic AI
Autonomous and assistive agents
SRE
Incident triage, runbook execution, and post-incident summarization grounded in observability data.
Cloud Engineering
Architecture proposals, IaC review, and cost-aware design assistance across AWS and Azure.
DevOps
CI/CD analysis, dependency auditing, and pipeline troubleshooting bound to repo and platform context.
Introducing Cloud Ops AI Initiatives
A C&R Software platform initiative to give every Cloud Operations workflow a custom AI surface — grounded in our real tooling, hosted on infrastructure we control.
What this is
Cloud Ops AI Initiatives is a dedicated Amazon EKS cluster running two things:
Custom MCP servers — small services that expose our SaaS tooling (nOps, Okta, Datadog, the Atlassian suite, and more) to AI agents through the Model Context Protocol. Agents don't get screen-scraping access; they get real, scoped API calls.
Agentic AI workloads — autonomous and assistive agents for SRE, CloudOps Support, DevOps, and DBOps, running on top of the MCPs above.
Why this exists
Off-the-shelf assistants don't have access to our cost data, our identity provider, our observability stack, or our incident tooling. Without that context they hallucinate or give generic advice. By owning the MCP layer, we get:
Auditability — every AI tool call is a logged API request from a known service in our VPC.
Least privilege — each MCP runs with the narrowest API token its job requires. Agents inherit that scope.
Cost & latency control — internal traffic, no third-party MCP relays, no per-call SaaS surcharges.
How it's wired up
EKS cluster — cloudops-ai-initiatives in us-east-2, Kubernetes 1.35, internal-only ALBs. See EKS → Deployment.
Node autoscaling — Karpenter with a spot + on-demand NodePool, t3/t3a/m5/m5a/m6i/m6a families. See EKS → Karpenter.
Ingress — AWS Load Balancer Controller, internal scheme only. Public exposure is explicit opt-in.
Auth — Pod Identity Associations for every controller; SaaS API tokens stored as Kubernetes Secrets (eventually External Secrets backed by AWS Secrets Manager).
Later: Workload autoscaler (KEDA), centralized API key rotation, public read-only status page.
nOps MCP
An MCP server that exposes the nOps cloud cost optimization API to AI agents. Cost visibility, MAP migration tracking, Compute Copilot, Business Unit Economics — all callable as MCP tools.
Multi-tenant by design
This MCP is shared. There is no server-side nOps API key — each caller sends their own key on every request via the X-Nops-Api-Key header. The MCP server builds a fresh nOps client per session with the caller's key, so every upstream nOps API call shows up in nOps audit logs as that user.
Same shape works for Cursor, Kiro, and any other MCP-compatible client that supports custom transport headers. If you omit the header, the server returns 401 with JSON-RPC error code -32001.
3. Deployment details (FYI)
Already deployed. Manifests live at mcps/nops/:
A Deployment in the mcps namespace running node:20-alpine with the gzipped multi-tenant bundle (nops-mcp.cjs.gz) mounted from a ConfigMap. The MCP runs natively in HTTP mode (MCP_MODE=http) — no supergateway wrapper, because per-request header pass-through requires the MCP to own its HTTP transport.
A Service on port 80 → 8000.
An Ingress joining the shared cloudops-ai-public IngressGroup — adds the /nops path rule on the cloudops-landing-public ALB.
Live endpoint:https://ai.cloudops.crsoftwarecloud.com/nops/sse
Local stdio mode is preserved for development per the upstream README — useful when iterating on a new tool against your own nOps tenant before pushing to the branch.
Comprehensive cost summary across all linked accounts
nops_get_daily_cost
Daily cost grouped by service, account, or tag
nops_list_map_projects
AWS MAP migration projects + tracked resources
nops_list_cost_targets
Budget/cost-target configurations
nops_create_business_unit_economics
Create a BU Economics report
Troubleshooting
401 / 403 from the MCP: the API key has expired or lacks scope. Regenerate in nOps and re-create the Secret.
"Session Auth Required": certain /svc/ endpoints (Compute Copilot, Essentials) still require session auth on most accounts. Contact nOps support to enable API key access for those.
401 on initialize with code: -32001: client config is missing the X-Nops-Api-Key header. Add a headers block to the MCP server entry in your client config.
Tool calls return 401 from upstream nOps but the MCP itself responds fine: the per-call key is malformed or expired. Regenerate the key in nOps and update your client config.
Okta MCP
An MCP server for Okta — user lifecycle, group membership, app assignments, access reviews — exposed to agents for least-privilege remediation.
In progress
The MCP server is being built. This page documents the API token creation steps (good to do in advance) and the planned tool surface.
1. Create an Okta API token
The MCP authenticates to the Okta API using an API token. Tokens act on behalf of the user who created them, inheriting their admin permissions. For an MCP that AI agents will use, create the token under a dedicated service account with only the Okta admin roles it actually needs (e.g., Read-only Administrator for inventory queries, Group Administrator for membership remediation).
Steps
Sign in to the Okta Admin Console as a Super Admin (the only role that can mint tokens) at https://<your-org>-admin.okta.com.
In the side nav: Security → API → Tokens.
Click Create token.
Give the token a descriptive name like cloudops-ai-okta-mcp.
Copy the token value immediately — Okta only shows it once. Format: 00<random-40-chars>.
Deployment manifests and endpoint configuration will land here once the MCP server is built. Planned endpoint: https://ai.cloudops.crsoftwarecloud.com/okta/sse.
Planned tool surface
Tool
Description
okta_get_user
Look up a user by email, ID, or login
okta_list_user_groups
Group memberships for a user
okta_list_app_assignments
Apps a user (or group) is assigned to
okta_search_users
SCIM-style filter over the user directory
okta_get_system_log
Recent System Log events (auth, group, lifecycle)
okta_add_user_to_group
Add a user to a group (gated by autoApprove allow-list)
okta_remove_user_from_group
Remove a user from a group (gated)
Datadog MCP
An MCP server exposing Datadog's metrics, traces, logs, monitors, and incidents to AI agents — so SRE and CloudOps Support agents can ground their reasoning in production observability data.
1. Create a Datadog API key
Datadog has two kinds of keys, and the MCP needs both:
API key — identifies the organization, used for write/submission endpoints.
Application key — identifies the user, used for read/query endpoints (metrics queries, log searches, monitor reads, etc.).
Create the API key
Log in to app.datadoghq.com (or your regional URL — app.datadoghq.eu, us3.datadoghq.com, etc.).
Open the user menu (bottom-left) → Organization Settings → API Keys.
Click New Key. Name it cloudops-ai-datadog-mcp.
Copy the 32-char hex string. You can re-view this later — it's not a one-shot reveal.
Create the Application key
Same nav: Organization Settings → Application Keys.
Click New Key. Name it cloudops-ai-datadog-mcp.
Scope the key: click Edit Scopes and grant only the minimum scopes the MCP needs:
events_read, logs_read_data, metrics_read
monitors_read, incident_read, apm_read
dashboards_read, service_dependencies_read
Save and copy the value. Format: 40-char hex string.
Deployment lands at mcps/datadog/. Endpoint: https://ai.cloudops.crsoftwarecloud.com/datadog/sse.
Key rotation
Datadog doesn't auto-rotate. We rotate quarterly:
Create a new App key with the same scopes.
Update the cluster Secret.
Restart the MCP pods: kubectl rollout restart deployment/datadog-mcp -n mcps.
Verify the MCP is healthy, then delete the old App key.
Atlassian MCP (Jira & Confluence)
An MCP server exposing Jira Cloud and Confluence Cloud to agents — issue search, transitions, comments, page reads, page creation, label management, etc.
1. Create an Atlassian API token
Atlassian Cloud authenticates API requests with a personal API token tied to a user account. For an MCP, create the token on a dedicated service account, not a person.
Verify the subscription is active and the products you want indexed (Jira, Confluence, Bitbucket, Trello, etc.) are toggled on.
Screenshot: admin.atlassian.com → Products → Rovo
2. Get a Rovo API credential
Rovo API access is granted via the same Atlassian API token mechanism as Jira/Confluence, but the calling user must have Rovo enabled on their seat. For the MCP service account:
Confirm Rovo is enabled for the cloudops-mcp@crsoftware.com seat in admin.atlassian.com → Directory → Users → <user> → Product access.
Reuse the API token created for the Atlassian MCP, or create a separate token labeled cloudops-ai-rovo-mcp for tighter blast radius.
An MCP server for Atlassian Statuspage — read current component status, list active incidents, and (gated) create or update incidents and maintenance windows.
Copy the value. Format: 36-char UUID-style. Token is reusable; you can re-view it.
Screenshot: manage.statuspage.io → Your account → API info → Create key
Find your Page ID
Each Statuspage page has a unique page_id. From the Statuspage dashboard URL it's the segment after /pages/. The MCP scopes its calls to a single page at a time, so include the page ID in the Secret.
The cluster, its supporting infrastructure, and three reproducible deploy paths (shell scripts, CloudFormation, Terraform). Source of truth: the eks-cloudops-ia-initiatives Bitbucket repository.
K8s objects (manifests under manifests/) are then applied with kubectl.
Deeper docs
See docs/01-eks-cluster.md, docs/02-karpenter.md, docs/03-aws-lbc.md, and docs/troubleshooting.md in the repo for the real issues hit during build-out (EIP quota, Pod Identity webhook race, Karpenter anti-affinity, macOS keychain).
Node Autoscaler — Karpenter
Karpenter handles all node provisioning above the 2-node managed nodegroup. It looks at pending pods and picks the cheapest EC2 shape that satisfies them — across spot + on-demand, multiple instance families.
The Karpenter Helm chart adds a hardcoded affinity to its own pods (karpenter.sh/nodepool DoesNotExist) — they refuse to run on Karpenter-provisioned nodes. With 2 controller replicas, pod-anti-affinity, and zone topology-spread, the minimum is 2 managed nodes across 2 AZs.
Try it
kubectl create deploy nginx --image=nginx --replicas=30
kubectl get nodeclaims -w # Karpenter provisions a node
kubectl get nodes -L karpenter.sh/nodepool,karpenter.sh/capacity-type
kubectl delete deploy nginx # ~1 min later Karpenter consolidates away
Monitoring & Upgrade
How we observe the cluster and how we upgrade it.
Monitoring stack
EKS control plane logs — currently disabled (default). Enable when ready:
Container Insights — install the CloudWatch Agent + Fluent Bit DaemonSets for per-pod and per-node metrics + logs in CloudWatch.
Datadog Agent — once the Datadog MCP and account are confirmed, the standard pattern is the official datadog Helm chart in kube-system, with API key from the datadog-mcp Secret (or its own).
Karpenter metrics — exposed on the controller's /metrics endpoint. Worth scraping with Prometheus or DD's OpenMetrics integration.
Status
Observability tooling isn't deployed yet. The Datadog MCP work is the natural dependency to unblock this.
Upgrade playbook
EKS minor versions are released roughly every 4 months. Each is supported in standard support for ~14 months. Upgrade rhythm:
Bump Karpenter's EC2NodeClass.amiSelectorTerms if it pins a specific AMI ID (currently uses al2023@latest, so no action). Karpenter will roll its own nodes within 30 days (the expireAfter), or sooner if disrupted.
Verify: kubectl get nodes -o wide shows all nodes on the new minor, all addons healthy, all workloads running.
AWS Access via Web Popups
How humans get into AWS for the C&R accounts — single-sign-on through Okta, no long-lived IAM users.
The pattern
AWS access for engineers is brokered through AWS IAM Identity Center (formerly AWS SSO), federated from Okta. Each engineer signs in once via Okta, then sees the AWS portal listing every account+role they're authorized for. Clicking a tile opens a popup AWS console session or copies short-lived CLI credentials.
For the CLI
Use aws configure sso once to set up your profile, then sign in interactively:
aws configure sso
# SSO start URL: https://crsoftware.awsapps.com/start (or your equivalent)
# SSO region: us-east-2
# Pick the account + role → name the profile, e.g., AdmAccess-486588380443
aws sso login --profile AdmAccess-486588380443
aws sts get-caller-identity --profile AdmAccess-486588380443
For the console
Open the AWS access portal (your IAM Identity Center start URL).
Sign in with Okta.
Click the account → role tile you want. AWS opens the console with the role pre-assumed.
For a short-lived CLI session, click Copy credentials next to a role and paste into a terminal.
What lives in this cluster
The cluster itself is administered through AWSReservedSSO_AdmAccess — same SSO mechanism. EKS access entries are configured for the cluster creator (and additional roles can be added). No aws-auth ConfigMap maintenance needed.
Roadmap
A self-service web tool to mint short-lived kubectl contexts (signed STS token + aws eks get-token) is on the platform engineering backlog.
EKS & ECS Cluster Management (Web UI)
A future self-service Web UI for routine cluster operations — restart a node, trigger a managed-nodegroup roll, run a one-shot EKS addon update, drain an ECS service task — without engineers needing to switch to the AWS console or remember the right CLI incantation.
Coming soon
This is a planned Platform Engineering deliverable. The page below describes the intended scope.
Target operations
EKS: list clusters, show nodegroup status, restart/drain/cordon a node, trigger a nodegroup roll, upgrade an EKS addon, view recent events.
ECS: list services per cluster, restart a service (force new deployment), scale a service, view recent task failure events.
Cross-cutting: see which IAM principal triggered a given action, full audit log of every action taken through the UI.
Design constraints
Internal-only (no public ingress). Single-sign-on via Okta — same identity surface as AWS Access.
RBAC: read-everywhere for any signed-in engineer; write actions gated by group membership (e.g., sre, cloud-engineering).
Every action emits a structured audit event (who, what, when, on which resource, with which IAM principal assumed). Events stream to CloudTrail and Datadog.
No direct AWS console pass-through — actions are explicit, parameterized API calls. Easier to audit, easier to constrain.
Tech direction
Backend: Go service running in this cluster, assuming a least-privilege IAM role per cluster/account it manages.
Frontend: React + Vite, same dark/cyan theme as this site.
Deploy: same pattern as the landing page — namespace, deployment, internal ALB ingress.
Pipelines
CI/CD lives in Bitbucket Pipelines and Jenkins, depending on the workload. This section documents the patterns and where to plug a new project in.
Bitbucket Pipelines
Default for new projects in the cr-software Bitbucket workspace. Configured via bitbucket-pipelines.yml at repo root.
Runners: hosted Bitbucket runners for typical builds; self-hosted runners (on this cluster) for jobs that need VPC access (e.g., deploying to cloudops-ai-initiatives).
Secrets: workspace-level repository variables for non-sensitive config; deployment environment variables for sensitive credentials (mark them Secured so they're masked in logs).
OIDC: prefer Bitbucket Pipelines' OpenID Connect to assume an AWS role at runtime — no long-lived AWS access keys in pipeline variables.
Used for legacy projects and any workload that needs Windows agents or Linux agents with custom tooling that isn't easily packaged for Bitbucket Pipelines.
Jobs live in a shared Jenkins instance (separate from this cluster).
For deployments into this cluster, Jenkins agents authenticate via an IAM role and use the same aws eks update-kubeconfig + kubectl pattern.
Choosing between them
If your project…
Use
is new and lives in Bitbucket Cloud
Bitbucket Pipelines
needs Windows agents
Jenkins
has heavy custom toolchain not on Bitbucket runners