
A CEO told me yesterday: “Proudly saying ‘I am using AI’ is no longer cool or acceptable.”
He’s right. Of course we’re all all-in on AI. Ninety percent of us do to be ai-precise. The questions your board and your team actually ask are different. Where do we stand? What does “good” look like? Where do we boldly go from here?
This post hands you the instrument to answer them: my AI adoption maturity matrix. Seven dimensions, five color-coded levels, a scoring ritual that surfaces the conversations your team has been avoiding. And a roadmap for your CEO to follow.
Ask your engineers where the team stands and you’ll get confident answers. Trouble is, confidence doesn’t survive measurement. METR ran a randomized controlled trial with experienced open-source developers working in their own mature codebases. With AI they were 19% slower. While estimating they had been sped up 20%. Go figure.
The system-level numbers point the same way. DORA’s 2025 report found AI adoption correlates with higher delivery throughput and lower delivery stability. GitClear’s analysis of 211 million changed lines shows refactoring collapsing while copy-paste takes over. Duh. Perceived speed is not a metric. License counts and token burn are not maturity either. To know where you stand, you have to look at the whole system around the code: who verifies, who owns, and what gets measured. That’s exactly where things usually go sideways.
As fractional CTO, I get to watch the same movie at several companies in parallel. Same plot, different logo. And the plot is never about the tools. It’s about what AI does to a team.
Scene one: an enthusiastic engineer discovers that agents write architecture decision records on demand. Minutes later, a watertight-looking document sails over the wall into review. The poor reviewers now have to judge pages of blabbering the author never fully read himself. Generation took minutes, review takes days. Nobody can tell which trade-offs the engineer actually weighed and which ones the model padded in. Unhappy faces on both sides of the wall.
Scene two is the entrepreneurial cowboy business variant:
Did I mention security? DryRun Security let three frontier coding agents build two applications, pull request by pull request. 26 of the 30 PRs introduced at least one security vulnerability. Eighty-seven percent. Guess how many your wonder app ships.
Both scenes are the same failure: one individual’s autonomy racing ahead of the team’s ability to verify. Generation is no longer the bottleneck. Trust is.
The teams that do get real gains treat AI as a team sport. They learn together, and they capture and share what they learn. The most striking practice: teams that pair- and even mob-program with AI in the loop, architecting, debugging, and coding together. With ADRs and shared context engineering baked into the daily routine.
So how do you find out which movie your team is in? I built a maturity matrix for exactly this. Seven independent dimensions: autonomy, lifecycle coverage, codebase and platform readiness, scope, team process, governance, and measurement. Each scored on the same levels. With some nice colors for those of us that don’t like reading. From red (ad hoc) to blue (optimized). TL;DR: scroll down to the full matrix.
Why seven dimensions instead of one magic score? Because the failure modes are mismatches between dimensions. A team running multi-hour autonomous agent sessions scores green on autonomy. The same team with no review standards, no budget owner, and no baseline scores red on governance and measurement. That team isn’t mature. It’s a Ferrari racing headlong into a brick wall.
A team is only as AI-mature as its weakest dimension.
The assessment bakes this in with verification gates. The dominant stall point in practice is the verification tax: code generation speeds up first, while review capacity lags. So the time saved generating gets re-spent auditing. Clear that constraint and the next one surfaces upstream: QA. Then the quality of your requirements themselves. Sound familiar? It’s the theory of constraints wearing an AI badge.
And the readiness dimension settles an old score. I argued back in 2016 (was too busy CTO-ing to write blogs) that you must bootstrap your software quality while you still can, because retrofitting quality means changing process, mindset, and people. Agents turn that dial to eleven. Trustworthy tests, fast and visible feedback, and tribal knowledge encoded in the repo are now the difference between an agent that ships and an agent that guesses.
Now the part that makes it work: don’t fill in the matrix yourself. Don’t delegate it to the most enthusiastic engineer either. Have every individual on the team score all seven dimensions privately. Oh and have the leaders also score the teams and their members. Full 360 truth. Then sit together and discuss the outliers, planning poker style.
You’d be surprised. The tech lead scores team process green: we have a guild, we have a channel! The junior scores it red: nobody really reviews my agent PRs, they just sigh and approve. That gap is the conversation your team needed to have for months. The discussion is the deliverable. The matrix is just the excuse.
Do write down today’s colors and the colors you want two quarters from now. That’s your roadmap. It’s also your shield: use it to keep your trigger-happy CEO apprised of where you really are. In colors a board deck can absorb.
And whatever you do, don’t put a meter on token consumption and dock the pay of developers who don’t burn enough. Just, don’t. Measure outcomes, not input. Pair every throughput metric with a stability twin: PR throughput with revert rate, deployment frequency with change-failure rate, and cost per merged PR against your pre-AI baseline. Otherwise you’re measuring how fast the team digs, not whether the hole is in the right place.
Your board whispers that AI-native competitors are breathing down your neck. Your C-level gets more impatient by the week. So why isn’t the AI initiative live yet?
Because every project still grows up the same way, AI or not. Resourcing: finding someone to actually run with it. Requirements: talking to the business, understanding the legacy code, disambiguating, and aligning on the architecture choices while everyone has their say. Building the infrastructure. Then a first MVP or proof of concept, tested internally or with a few friendly customers. Baking in the learnings. Productizing. And only then is there an AI-enabled business outcome you can brag about.
There’s a second trap hiding in that MVP phase: staying in the lab. You can adopt the latest autonomous-agentic-workflow-orchestration platform (insert this quarter’s jargon), but if it only ever runs in the comfort of your own laptop, safely behind the corporate firewall… You’ll have a hard time bringing it out into the real, scary world. Simple technology in a complex environment is still a giant leap. Make baby steps, but make them out there in the real world.
The matrix doesn’t let you skip phases. It shows which friction you can remove and which impatience you simply have to manage. Sharing your colors and your next gate with the C-level beats promising magic. Magic slips. Checklists don’t.
Yes, my inner consultant built a seven-dimension, five-color matrix. At least this one comes without the invoice. Steal it, adapt it, and run it quarterly.
A team can be advanced on one dimension and immature on another; the assessment exists to surface exactly that mismatch. Aim to lift the lowest color, not the highest.
| Dimension | Question it answers |
|---|---|
| Autonomy | How much does the agent do unattended? |
| Lifecycle Coverage | How much of the delivery lifecycle beyond code authoring does AI cover โ tickets, review, tests, dependencies, incidents, customer issues? |
| Codebase & Platform Readiness | Can the codebase and the platform around it support agent work: encoded knowledge, trustworthy tests, fast feedback, live environments to verify in, observability to debug with? |
| Scope | How far do the practices reach โ one dev, the team, the org? |
| Team Process | Do shared rituals, norms, and ownership turn individual usage into team practice? |
| Governance | Are guardrails, ownership, and policy in place? |
| Measurement | Can we prove impact, track cost, show ROI, and catch regressions? |
Core principle: a team is only as mature as its weakest dimension. Tool count does not predict outcomes; the most common failure mode is autonomy running ahead of governance and measurement, producing unauditable, unmeasurable output.
Placement rule for organizational structures: structures are means, not dimensions. Score them by purpose: structures that spread practice (champions, guilds, SIGs) count under Team Process; structures that exercise decision rights (AI board, steering body, harness owner) count under Governance.
Vocabulary: the “agent harness” is the shared setup around the agents themselves: context files, skills (reusable, versioned task instructions), permissions, CI hooks, and review automation. A “flow” is a recurring agent task that runs as part of delivery, such as ticket enhancement, dependency updates, or CI failure triage.
Assess each dimension independently. Circle one cell per row. The table is dense by design: it is the summary. The checklists after it unpack every cell into things you can verify instead of feel.
| Dimension | ๐ด 1 โ Ad hoc | ๐ 2 โ Emerging | ๐ก 3 โ Established | ๐ข 4 โ Managed | ๐ต 5 โ Optimized |
|---|---|---|---|---|---|
| Autonomy | Code completion + Q&A chat, used occasionally | Interactive agents (IDE/CLI), human approves every step | Context-engineered: repo context files, skills, spec-driven + test-driven development make agents reliable; model and effort level chosen per task | Supervised autonomous: multi-hour unattended sessions (features, refactors) with human review at checkpoints; token-optimized flows (caching, routing, right-sized models) | Orchestrated: agents self-source work from tickets and change requests; humans write requirements, not code; cost-per-outcome tuned continuously |
| Lifecycle Coverage | Code authoring and Q&A only | Dev-adjacent assists: commit messages, docs, unit tests, PR descriptions; ad-hoc CLI use | Integrated with team systems via MCP/CLI: ticket enhancement, AI code review on every PR, automated dependency-update PRs | Pipeline-embedded: e2e test creation & repair, CI failure triage with fix PRs, auto-updated docs and release notes | Closed loop with production: agents diagnose incidents and open fix PRs; customer issues reproduced and fixed by flows |
| Codebase & Platform Readiness | Knowledge is tribal; tests cover lines, not behavior; setup is undocumented; deploys are manual console clicks | Root context file as a map (not an encyclopedia); reproducible dev environment; lint, types, and pre-commit enforced; CI on every PR; scripted, repeatable deploys | Behavioral tests agents can treat as ground truth; architectural invariants enforced by build/CI; tacit knowledge encoded (ADRs, scoped context files, do-not-touch paths); infrastructure as code changed via reviewed PRs; e2e tests on every PR; structured, centralized logging | Fast CI sized for agent iteration; flaky tests quarantined; codebase queryable through tools (code graph, find-callers) instead of prompt-stuffing; ephemeral preview environment per PR, torn down on merge; distributed tracing across services; visual regression tests guard UI changes | Self-describing repo: every agent task leaves artifacts (tests, invariants, context) that make the next run more reliable; agents verify their own changes against preview environments and telemetry; readiness tracked per repo |
| Scope | One or two enthusiasts, personal setups | Pockets of practice; informal tips circulate | Team-wide: shared skills, context files, and permissions versioned in the repo | Multi-team: shared standards, common agent harness, reusable patterns | Org-wide: skill/context registry, standard harness with a named owner, onboarding includes AI practices |
| Team Process | No shared rituals; AI-generated code/docs thrown over the wall into review | AI is a recurring retrospective topic; tips and prompts shared informally; early champions visible | Written “good PR” norms for AI work (author = first reviewer); dedicated learning time; learnings documented | Guild / SIG / community of practice; named process owner; PR templates require evidence; team-based (not individual) incentives | Process changes trialed and measured before rollout; learnings institutionalized into the harness, skills, and onboarding |
| Governance | No policy; shadow usage on personal accounts | Written acceptable-use guidance; tool inventory exists | Guardrails: security scanning on AI output; “no AI slop” norm enforced in PRs and docs | Owned: AI board / steering body with decision rights; agent harness has a named owner; permissions are identity-based; review standards for AI code are explicit; cloud-org guardrails bound what agents can provision | Policy-as-code: automated quality and security gates; compliance enforced at every stage of the pipeline; agent execution sandboxed from production credentials |
| Measurement | No baseline; productivity is anecdotal; AI spend unknown | Usage tracked: licenses, active users, acceptance rate; tool/subscription spend visible | Outcomes: pre-rollout baseline; cycle time, review time, defect rate attributable to AI; token spend attributed to team/feature | Paired metrics on a live dashboard; unit economics (cost per merged PR / completed task); budgets and alerts per use case | ROI proven: business value (time saved, defects avoided, outcomes shipped) vs full cost drives go/no-go decisions |
Check every box in a level before claiming it, or write down why a box does not apply (e.g. dependency-update PRs in a repo without external dependencies). The highest fully-checked level is the score. Levels build on each other, but a higher practice supersedes the lower one it replaces: checkpoint review at level 4 satisfies the level 2 expectation of human oversight even though nobody approves every step anymore.
Autonomy is the dimension everyone obsesses over: how much does the agent do without you? The road runs from autocomplete, through agents you approve step by step, to context-engineered setups you can trust with a whole feature, and finally to agents that pick up their own work from the ticket queue. Resist the urge to sprint up this ladder. Every level you skip comes back later as review pain.
๐ด 1 โ Ad hoc
๐ 2 โ Emerging
๐ก 3 โ Established
๐ข 4 โ Managed
๐ต 5 โ Optimized
Code authoring is the obvious use of AI, and the smallest prize. Most engineering hours go to everything around the code: tickets, reviews, tests, dependencies, incidents, and customer issues. This dimension scores breadth across that lifecycle. Score what AI touches; how unattended each flow runs is the Autonomy score, and whether it’s owned and affordable is Governance/Measurement.
๐ด 1 โ Ad hoc
๐ 2 โ Emerging
๐ก 3 โ Established
๐ข 4 โ Managed
๐ต 5 โ Optimized
Whether the codebase and the platform it runs on can support agent work at all. AI is an amplifier: teams with loosely coupled architectures, trustworthy tests, and fast feedback see gains, while teams on tightly coupled systems with tribal knowledge see little, regardless of how mature their practices are. The intelligence of an agent-driven setup lives less in the model than in the instructions, tests, and feedback loops around it. The platform half follows the same logic (CNCF platform engineering maturity model): an agent can only be trusted unattended if its change can be deployed somewhere disposable, verified mechanically, and observed in production. Infrastructure as code matters doubly here: infra defined as text is infra agents can read, modify, and have reviewed like any other PR. Score this per repository/service where agents actually work; the weakest one agents touch unattended is the score.
๐ด 1 โ Ad hoc
๐ 2 โ Emerging
๐ก 3 โ Established
๐ข 4 โ Managed
๐ต 5 โ Optimized
Scope measures reach: one enthusiast, a team, several teams, or the whole org. The anti-pattern here is the hero setup, one developer’s lovingly tuned private toolbox that walks out the door with them. Maturity means the setup lives in the repo and every new joiner inherits it by default.
๐ด 1 โ Ad hoc
๐ 2 โ Emerging
๐ก 3 โ Established
๐ข 4 โ Managed
๐ต 5 โ Optimized
The over-the-wall scenes all trace back to this dimension, and it’s the one most teams skip. Tools spread in days; norms take quarters. Team process is where individual usage turns into shared practice: rituals, written norms, and learning time that make the whole team better instead of one enthusiast faster.
๐ด 1 โ Ad hoc
๐ 2 โ Emerging
๐ก 3 โ Established
๐ข 4 โ Managed
๐ต 5 โ Optimized
Guardrails should be proportionate to blast radius: an interactive pilot needs lighter controls than an unattended flow touching production. Applying the full org-wide policy stack to every experiment kills adoption. Letting unattended flows run on pilot-level controls is the danger zone. Tighten controls as autonomy and coverage grow.
๐ด 1 โ Ad hoc
๐ 2 โ Emerging
๐ก 3 โ Established
๐ข 4 โ Managed
๐ต 5 โ Optimized
Measurement is where most adoption stories quietly fall apart: no baseline, no attribution, no idea what the real spend is. Two habits carry this whole dimension.
Capture a baseline before you roll anything out, and never track a throughput metric without its stability twin. Everything else builds on those two.
๐ด 1 โ Ad hoc
๐ 2 โ Emerging
๐ก 3 โ Established
๐ข 4 โ Managed
๐ต 5 โ Optimized
Throughput โ stability pairs. Never track the left column alone:
| Throughput | Pair with |
|---|---|
| PR throughput per developer | PR revert rate |
| Deployment frequency | Change-failure / rework rate |
| Lead time for changes | PR review cycle time (AI vs non-AI) |
| Tasks completed per developer | Defect rate of AI code vs human baseline |
| Lines of code added | Duplication % and complexity trend |
The last pair catches slow-burn debt that defect rates miss: AI inflates “tasks completed” most cheaply by cloning code, and the cost lands in maintenance years later.
Cost & ROI ladder. How cost measurement matures alongside:
| Level | Cost question you can answer |
|---|---|
| ๐ 2 | “What do we spend on AI tools in total?” |
| ๐ก 3 | “Which team, flow, and model is the spend going to?” |
| ๐ข 4 | “What does a merged PR / completed flow cost, and is it trending right?” |
| ๐ต 5 | “Is the value generated worth the full cost and where do we invest or cut next?” |
Four cost warnings from the trenches. Activity is not impact: more PRs is not more business. So tie value to outcomes, never to output counts. Don’t shop on price-per-token: a cheap model that burns triple the tokens or produces rework is the expensive one, so judge by tokens-to-done and cost-per-outcome. Know that the subscription is just the visible tip: the real money leaks into review time, rework cycles, and harness babysitting. All booked in other budgets. And treat cost as a live dial, not something finance discovers in December.
The dominant stall point in practice is the verification tax: code generation speeds up first, while review, governance, and deployment capacity lag. Time saved generating is re-spent auditing. Clearing one constraint surfaces the next, moving upstream: review capacity first, then QA, then the quality of requirements themselves. Three gates must hold before advancing Autonomy:
| Gate | Before advancing to | Must be true |
|---|---|---|
| G1 โ Review capacity | Autonomy ๐ก 3 | Review time per PR is flat or falling even as PRs are produced faster |
| G2 โ Proven outcome | Autonomy ๐ข 4 | A business outcome (not a developer feeling) has measurably changed because of AI |
| G3 โ Governance first | Autonomy ๐ต 5 | Governance is at ๐ข 4+ before scaling, not retrofitted after |
The score itself is not the point; the pattern is. Write the seven colors side by side (e.g. Autonomy ๐ข ยท Coverage ๐ ยท Readiness ๐ ยท Scope ๐ก ยท Team Process ๐ ยท Governance ๐ด ยท Measurement ๐ด) and match the pattern:
| Profile pattern | Diagnosis | Action |
|---|---|---|
| Autonomy ahead of Governance + Measurement | Danger zone: unauditable, unmeasurable autonomous output | Freeze autonomy expansion; raise Governance and Measurement first |
| Autonomy deep, Coverage narrow | AI is a coding tool, not a delivery system; gains are capped by everything around the code | Expand sideways: code review, tests, tickets, dependencies before pushing autonomy further |
| Coverage broad, Governance + Measurement low | Flow sprawl: automations multiply without owners, budgets, or success metrics | Inventory every flow; assign an owner and a cost/outcome metric to each, retire the rest |
| Autonomy ahead of Team Process | Over-the-wall mode: generation outpaces shared norms, and review friction and resentment build | Adopt the “good PR” standard and start AI retros before expanding autonomy |
| Autonomy ahead of Codebase & Platform Readiness | Agents are guessing: output compiles and passes review, then breaks on undocumented invariants weeks later, and nobody sees it break because logging and tracing are absent | Encode invariants, behavioral tests, and tacit knowledge into the repo, give agents a live environment to verify in and telemetry to observe, then expand autonomy |
| Governance + Measurement ahead of Autonomy | Safe but slow | Push autonomy; the org can absorb the output |
| Autonomy ahead of Scope | Knowledge concentrated in individuals | Institutionalize: move personal skills/context into the repo before the knowledge walks out the door |
| Team Process ahead of Autonomy | Rituals without substance; meetings about AI exceed use of AI | Channel the guild’s energy into hands-on adoption targets |
| All dimensions level and low | Healthy early state | Clear the next verification gate before adding tools |
| All dimensions ๐ข/๐ต | Leading | Shift focus to experimentation and stability optimization |
Wrong question, but everybody asks it. Tools change by the quarter, and owning a tool is no more maturity than owning running shoes is fitness. Treat what follows as orientation. And remember: scope and team process are practice dimensions. No purchase order gets you there.
The autonomy ladder starts innocent: Copilot completions and a ChatGPT or Claude tab. Then interactive agents in the editor, Cursor and friends. Level 3 is where it gets interesting: CLI agents like Claude Code, Codex CLI, or Google Antigravity, made reliable by context engineering. That means CLAUDE.md or AGENTS.md files, reusable skills, spec-driven development with GitHub Spec Kit or the like, and a TDD harness. Level 4 adds agent workflow platforms, custom AI code review, automated ticket enrichment, MCP integrations into Jira and Confluence, and scheduled agents such as Claude Code routines. Level 5, agents that resolve production issues through cloud orchestration platforms like Warp Oz, is real but rare. Don’t pretend you live there.
Does the top of the ladder exist outside conference keynotes? It does.
Lifecycle coverage is about wiring AI deeper into delivery. Start with MCP integrations into your tracker and wiki, AI review as the first pass on every PR. And Renovate- or Dependabot-style update agents that pre-validate their own PRs. Next come agentic e2e test creation and repair, CI triage bots that propose fixes, and docs and release notes that update themselves. The end game routes incidents and support tickets straight to agents. Approve the PR on Sunday morning and go back to bed.
Codebase and platform readiness is the least sexy list and the biggest lever. Devcontainers or nix, a context file that works as a map, pre-commit hooks, CI on every PR. Then ADRs, architecture tests (ArchUnit, dependency-cruiser), per-directory context files, infrastructure as code (Terraform/OpenTofu, Pulumi), Playwright on every PR, and centralized structured logging. The serious money comes later: repo readiness scoring (think Factory.ai-style agent-readiness reports), code-graph and LSP tools served to agents over MCP, flaky-test detection, preview environments per PR (ArgoCD ApplicationSets, Vercel- or Northflank-style platforms), distributed tracing with OpenTelemetry, and visual regression testing with Chromatic, Percy, or Playwright snapshots.
Governance tooling gets real at level 4: a cloud landing zone (AWS Control Tower, Azure Landing Zones), org policies nobody can switch off (SCPs, Azure Policy, GCP org policies), and short-lived credentials via OIDC federation. At level 5 you graduate to policy-as-code engines (OPA, Sentinel, Cloud Custodian) and sandboxed agent runtimes on microVM or gVisor isolation.
And measurement? Start embarrassingly simple: the vendor’s usage dashboard and a shared spend spreadsheet. That’s level 2, and it already beats most teams. Level 3 wants engineering-intelligence platforms (DX, Jellyfish, LinearB) and LLM spend observability (Langfuse, Helicone, or your cloud cost tools), tagged per team and per flow. Level 4 builds the unit-economics dashboard and budget alerts on top. By then you’ll know your cost per merged PR better than your cloud bill. Which, let’s be honest, nobody knows either.
The CEO was right: proudly saying you use AI is no longer cool or acceptable. Everyone does. Knowing where you stand is the new flex: seven colors, a named weakest link, and the next gate to clear. Run the matrix with your whole team this month. Lift the lowest color. Repeat.
What’s your team’s weakest dimension? Enter the matrix and tell me what surprised you.

About Martijn Rutten
Fractional CTO & technology entrepreneur with a long history in challenging software projects. CTO at LUMO Labs impact VC. Former CTO of scale-up Insify, changing the insurance space for SMEs. Former CTO of fintech scale-up Othera, deep in the world of securitized digital assets. Coached many tech startups and corporate innovation teams at HighTechXL. Co-founded Vector Fabrics on parallelization of embedded software. PhD in hardware/software co-design at Philips Research & NXP Semiconductors. More about me.
Related Posts