The Gap

By April 2026, the adoption numbers are staggering: 90% of developers use AI at work and over 80% say it’s made them more productive. Gartner projects 40% of enterprise applications will feature task-specific AI agents by the end of 2026, up from less than 5% in 2025.

Then you read the other column. CIO reports that 95% of enterprises see zero return on their AI investments. McKinsey’s maturity model puts only around 11% of enterprises in the “AI-native” tier. The 2025 DORA report is more uncomfortable still: AI raises throughput and raises change-failure rate. PR size is up 154%. 30% of engineers don’t trust the code their own agents produce.

The gap isn’t the model. Frontier models are a commodity. They get swapped every six months and the next one is better. The gap is everything around the model: the control plane that routes requests, attributes cost, enforces policy, retrieves context, evaluates quality, and measures outcomes. It’s the platform that turns “we rolled out Copilot” into “we shipped a 30.8% reduction in PR cycle time across 1,900 repos,” which is what Atlassian did with Rovo Dev and published at ICSE 2026.

This post is for the architect who has been asked to lead that platform. Not to choose between Copilot and Cursor; that’s a week of spreadsheets. To design what sits around whatever agent you pick, so that a year from now your CFO knows what AI is costing and your CTO knows what it’s earning.

Chapter 1: What “Platform” Actually Means Here

When a VP of Engineering says “we have an AI platform,” they might mean one of three things:

  1. We bought Copilot Enterprise. Everyone has access.
  2. We stood up a chat UI in front of a couple of models.
  3. We run an internal control plane that mediates every AI request our engineers make, attributes cost per team, enforces policy per repo, evaluates quality continuously, exposes a curated surface of tools and skills, and lets any engineer publish a repeatable workflow that triggers on a schedule, a webhook, or a repository event.

Only the third one is a platform. The first two are procurements.

Shopify made this distinction concrete. Per Bessemer’s write-up of their AI-first engineering playbook, Shopify runs an LLM proxy. Every AI request from every tool, every engineer, every script, goes through one internal gateway. Engineers can pick their harness (Claude Code, Cursor, Copilot), but the proxy is non-negotiable. That single architectural choice is what gives them centralised cost control, usage analytics, model flexibility, and the ability to swap a model provider in a day instead of a quarter.

Block took a different turn at the same fork. Rather than wrapping third-party agents, they built Goose internally and open-sourced it. The stated reason, per CTO Dhanji Prasanna on the Sequoia Training Data podcast, was that “data leaving our infrastructure” was unacceptable. The outcome: engineers save 8-10 hours a week, Goose is on track to reclaim 25% of manual hours company-wide, and 100% of Goose’s own PRs are now written by Goose.

Two tech companies. Two legitimate answers to the same architectural prompt. Both built a platform. Neither bought one.

The platform you build, whether Shopify-shaped (gateway + BYO-harness) or Block-shaped (build-the-harness), owns five responsibilities:

The five control-plane responsibilities
click any layer
What it owns, who runs it, how it fails. If any layer lacks a named team, you don't have a platform. You have a shadow-IT problem.
your agents Claude Code · Cursor · Copilot · Goose · internal harnesses
Capability
built-ins · mcp · skills
L1
Identity & policy
tokens · approvals · sandbox
L2
Context
ingest · index · permission · serve
L3
Evaluation
unit · task · production · eval-as-ci
L4
FinOps & observability
gateway · tracing · attribution
L5
your developers Thousands of engineers across dozens of teams
Capability surface
layer 01
  • Built-in tools shipped with the harness
  • Curated registry of approved MCP servers
  • Skills library: org-specific playbooks
  • Progressive disclosure at the gateway
platform-team · capability-sre
90+ tools loaded per prompt. A single enterprise GitHub MCP server loaded naively burns 50k+ tokens of schema before any reasoning. Overhead scales linearly in services connected.
Identity & policy
layer 02
  • Two-identity model: human principal + agent service identity
  • Scoped, short-lived tokens per task
  • Policy-as-code at the MCP gateway
  • Graduated-trust approvals (async wait-state)
  • Sandbox runtime (gVisor, Firecracker, Kata) per trust tier
platform-team · security · iam
Agent inherits the full user scope. One compromised prompt exfiltrates every permission the invoking user has. The blast radius of an AI breach becomes the blast radius of a human breach.
Context pipeline
layer 03
  • Ingestion connectors for repos, docs, tickets, incidents, service catalog
  • Hybrid retrieval: BM25 + dense + graph
  • PII redaction and access-aware retrieval
  • Staleness SLOs per source
  • Token-budget compression and eviction
platform-team · data-eng
Every team ships its own RAG. Twelve incompatible stores, staleness nobody measures, PII leaking across permission boundaries, six answers to the same question depending on which index you hit.
Evaluation harness
layer 04
  • Unit regressions for prompts, tool schemas, system prompts
  • Task-level golden set graded by LLM-as-judge
  • Production shadow traffic + online signals
  • Eval-as-CI: no prompt, tool, or model ships without passing
  • Thumbs-down-to-regression-test feedback loop
platform-team · ai-coe
Silent failure becomes the norm. The agent finishes without an error, the diff looks plausible, the test it wrote passes because it tests the buggy behaviour it introduced. The 2025 DORA change-failure-rate uptick is this failure, audited.
FinOps & observability
layer 05
  • LLM gateway: every call mediated, measured, routable
  • Tiered-routing policy (Haiku / Sonnet / Opus by task class)
  • Prompt-cache hit rate as a first-class SLI
  • Per-team, per-repo, per-task cost attribution
  • End-to-end OpenTelemetry tracing
platform-team · finops · sre
No gateway, no attribution, no chargeback. 5% of users burn 60% of the budget invisibly. When an incident hits at 3am, you have no trace ID. Just a shrug and an angry CFO.
  • Capability: what tools and skills agents can reach, and how they’re discovered.
  • Identity & Policy: who’s acting, with what scope, under what guardrails.
  • Context: how org knowledge gets ingested, permissioned, and served.
  • Evaluation: how you know the agent is actually getting better, not just shipping faster.
  • FinOps & Observability: what it costs, who paid, what it produced, where it broke.

Those five wire up into a system shape worth drawing. The harness runs on each engineer’s laptop (in a Docker container) or in a CI runner. Everything the harness reaches into over the network is the platform. Everything the platform calls out to is a model provider, a curated MCP server, or a source system the platform has already indexed.

The reference architecture
runtime → platform → providers
The harness runs on each engineer's laptop or in a CI container. Everything the harness calls into is the platform. Every subsequent chapter zooms into one box on this diagram.
Runtime per-engineer · per-job
Engineer's laptop
docker
harness (goose / claude code / cursor)
MCP clients + workspace mount
CI runner
headless
ephemeral container, headless harness
fired on PR / label / cron / webhook
Platform network services
LLM gateway
hub
auth · tiered routing · prompt cache
cost attribution · rate limits
MCP gateway + Skills
registry
approved MCP list · progressive disclosure
Skills library · Recipe registry
Context API
retrieval
BM25 + dense + graph
ACL-aware · staleness SLO
Policy service
authz
scoped tokens · approval routing
sandbox-tier selection
Eval harness
ci-gated
golden set · LLM-as-judge
shadow traffic · online signals
Telemetry bus
otel
traces · run records · cost events
per-team / per-repo / per-task
Providers & sources external
Model providers
vendor
Anthropic · OpenAI · Google
internal + open-weight hosts
Curated MCP servers
tools
github · jira · cloud
feature-flags · incident
Source systems
indexed
repos · docs · tickets
incidents · service catalog
Identity
cross-cutting
IdP (human principal)
service registry (agent id)
Trace + cost DB
audit
queryable run history
chargeback source-of-truth

When multiple agents cooperate within this system, the topology that ships is orchestrator-worker, not swarm. Airbnb’s 3,500-file Enzyme-to-RTL migration proved this: per-file parallel workers, central orchestration, brute-force retries with dynamic prompts. 97% automated, 6 weeks, 6 engineers. Swarms, by contrast, are the dominant source of silent failure because a single hallucination in shared memory propagates to every peer that reads it. Chapter 5 shows how Goose sub-recipes implement the orchestrator-worker pattern concretely.

If any of these responsibilities isn’t owned by a named team with a roadmap, you don’t have a platform. You have a shadow-IT problem that will compound. These five exist to enable one thing: the workflow lifecycle. Chapter 5 names it; everything else makes it safe, cheap, and measurable.

Chapter 2: The Capability Surface

The first architectural argument inside every platform team is about how agents get things done. It usually gets framed as a choice between the Model Context Protocol (MCP), now stewarded by the Linux Foundation’s Agentic AI Foundation as of December 2025, and Agent Skills, the behavioural-instruction packages that started shipping with Claude.

Framing them as competitors is a category error. MCP is an execution fabric: a standardised RPC for tools, resources, and prompt templates, with bidirectional comms and dynamic tool discovery. Skills are a knowledge layer: portable instructions that encode the how-to of a specific job. MCP tells an agent what it can do. Skills tell an agent what it should do in a specific situation.

The context-bloat failure is the sharper risk. A single enterprise-grade GitHub MCP server exposes 90+ tools. Loaded naively, that’s 50,000+ tokens of schema entering the context window before the model has read a single line of user intent. Add Jira, the cloud provider, a feature-flag platform, your incident system, and your agent is spending six figures a year on tokens that are just tool catalogue. The overhead scales linearly with the number of services you connect.

The three-tier model that scales:

  1. Built-ins: the small primitive set the harness ships (file ops, shell, code execution). Always loaded.
  2. Curated MCP registry: a governed list of approved MCP servers with progressive disclosure. The agent sees metadata first, loads full tool schemas on semantic match.
  3. Skills library: org-specific playbooks in a searchable registry, discovered by description, expanded on demand.
Capability surface: flat vs progressive
same 5 servers · same task
Same five MCP servers (GitHub, Jira, Cloud, Feature Flags, Incident). Same task. Different token math. Flip the tab.
Context window (200k tokens)
overhead: 39%
MCP
Skills
Headroom
built-ins 4.0k
mcp schemas 52.0k
skills 18.0k
org context 8.0k
headroom 118.0k
Tool schemas loaded
94 schemas
overhead before reasoning
78,000 tokens
Schemas + skill docs, loaded before the model reads the task.
cost per call @ sonnet
$0.234
$3 / 1M input tokens, overhead only.
annual burn @ 1M calls
$234,000
Overhead alone. Reasoning is extra.
effective headroom
118k / 200k
Share of the window available for task + reasoning.

Block’s Goose is the cleanest public expression of this. The operational equation is Goose = LLM + MCP + Agent, but the load-bearing piece isn’t MCP. It’s Goose’s Recipes and Sub-recipes. Recipes are declarative YAML workflows that encode a repeatable piece of work; sub-recipes run in isolated sub-sessions with their own context windows. That isolation keeps token cost linear in the work done rather than quadratic in conversation depth. The result is the 30-40% of code Block’s top engineers now get from Goose in legacy codebases, per the Sequoia interview.

Chapter 3: Identity, Policy, and the Execution Boundary

Every agentic action has two identities the audit team cares about: the human on whose behalf the agent is acting, and the agent’s own service identity. Conflate them and compliance review kills your rollout.

The human identity provides authorisation scope. The agent identity provides attribution and accountability (which agent, which version, which session). Every tool invocation carries both. Tokens are short-lived (minutes, not days) and scoped to the specific task, not the session. Policy is enforced at the MCP gateway, as code, so auditors can diff it and engineers can review it.

How do you keep humans in the loop without drowning them in approval prompts? The pattern that works is the asynchronous wait-state. When an agent hits a high-risk decision (production deploy, financial transaction, irreversible write), the workflow suspends, persists its state externally, and emits an approval event. Reviewers act on their own clock, often hours later. On approval, the signal routes back and the workflow resumes exactly where it left off.

The anti-pattern is approval fatigue. The fix is graduated trust: scope approvals by blast radius.

  • Read-only on scoped data: no approval.
  • Mutations inside a sandbox or personal branch: no approval, full audit.
  • PR against the main branch: standard code review.
  • Production-shaped actions (deploys, config changes, prod data reads): explicit, async approval with a named owner.
  • Irreversible (delete, drop, disable safety): two-person review.

GitHub’s Copilot Enterprise surface has become the most concrete public implementation. Per the December 2025 Enterprise roundup and Microsoft’s DevBlogs on agentic platform engineering, admins get fine-grained permissions, explicit MCP control, audit-log review, and policy-based gating of model upgrades.

Policy tells the agent what it’s allowed to try. The sandbox decides what happens when it tries the wrong thing. Three isolation tiers map to the graduated-trust model:

Isolation tierMechanismRight fit
Docker + seccompNamespaces and cgroups; shared host kernelDev-loop agents on an engineer’s own repo
gVisorUser-space kernel intercepting ~70 syscallsPlatform-served workers (CI, migrations, autonomous PRs)
Firecracker / KataPer-workload Linux kernel via KVMUntrusted, multi-tenant, or cross-org execution

Match the isolation tier to the trust tier. A read-only retrieval agent does not need a microVM. A production migration worker rewriting other teams’ code absolutely does.

Chapter 4: Context Engineering at Platform Scale

The frontier model isn’t your moat. The agent harness isn’t your moat. Your context is your moat: the graph of your repos, the runbooks nobody wrote down, the incident history, the ADRs, the style guides, the org’s service catalog.

MCP is connectivity, not context

A common mistake is assuming that connecting MCP servers to your agent solves the context problem. MCP gives your agent a standardised way to query any single system. Four things it does not do:

  1. Cross-source retrieval. “Find everything relevant to this migration across code, tickets, docs, and incidents” requires a unified index. No single MCP server spans all your sources.
  2. Pre-indexing. MCP queries are live. For a 500k-file monorepo, live search on every agent call is slow and expensive.
  3. Governance. PII redaction, access-aware filtering, staleness SLOs. Each MCP server returns raw data under its own auth model.
  4. Token-budget management. Fitting retrieved context to the model’s window is orchestration the pipeline owns, not the protocol.

The clean architecture: build the context pipeline, expose it as an MCP server. The agent queries one context endpoint. The pipeline behind it handles ingest, index, govern, serve.

The pipeline

Ingest. Connectors to the authoritative sources: repos, docs wiki, ticket system, incident tracker, service catalog. Each with an idempotent, versioned schema and an owner.

Index. Hybrid retrieval is the production default: BM25 for lexical recall, dense embeddings for semantic similarity, graph for structural relationships. No single index is sufficient.

Govern. Staleness SLOs per source. PII and secret redaction before indexing, not after retrieval. Access-aware retrieval: the retriever filters by the caller’s permissions before ranking. If your agent can see secrets its invoking user can’t, you have a data-exfiltration vulnerability wearing a productivity tool’s clothes.

Serve. A token-budget manager (compression, summarisation, eviction) that fits retrieved context to the model’s window and the task’s importance.

Augment Code’s Context Engine is the clearest public reference for this in 2026. It indexes up to 500,000 files across multiple repositories with roughly 100ms retrieval latency, building semantic dependency graphs. The telling move: Augment recently shipped the Context Engine as an MCP server, the exact pipeline-behind-protocol pattern. Sourcegraph’s Cody takes a three-layer approach (local file, local repo, remote repos), handling 300k+ repositories for enterprise customers. Stripe’s agent harness takes the curation angle: each “minion” gets scoped context per task, not the whole repo. Context curated, not copied.

The metric to watch: context hit rate per task type. If your hit rate is under 30%, your pipeline is ornamental.

Chapter 5: Workflows, the Unit That Ships

Four chapters described infrastructure. This chapter is about what the infrastructure produces. The deliverable is the workflow: a versioned, parameterised unit of work any engineer can build once, evaluate, and hand to other engineers (or to CI runners) who invoke it on a trigger they didn’t author.

The workflow lifecycle
author → trigger → run → observe
The control plane of Chapters 1–5 exists to make this lifecycle safe, cheap, and measurable. A workflow is the unit that ships.
Authors
Recipe YAML
default
  • metadata + version
  • parameters
  • extensions (MCP)
  • sub-recipes
Skills / DSL
alt
  • Claude Skills (md)
  • Temporal / LangGraph
  • Rovo Studio (low-code)
Platform
Triggers
fire
  • cron (goose schedule)
  • event (PR, issue, incident, webhook)
  • manual / API (goose run, goose serve)
Runtime
run
  • CI runner (ephemeral)
  • agent pool (Modal / E2B / Northflank)
  • laptop (dev-loop only)
Observability
Run record
first-class
  • trigger source
  • parameters
  • spans + retries
  • status · cost · trace ID
Governance
registry
  • SHA-pinned versions
  • ownership + review
  • deprecation windows

Authoring

Four patterns; the choice follows who the author is:

  • Recipe / YAML: Goose Recipes, GitHub Agentic Workflows (Feb 2026 preview). Structured, diff-reviewable, CI-friendly. The enterprise default.
  • Prompt-as-code: Claude Skills. Flexible, closer to prose, weaker composition.
  • DSL / real code: Temporal, LangGraph, Kestra. Maximum control; needs engineer authors.
  • Low-code: Atlassian Rovo Studio. Natural-language authoring for non-engineers.

A Goose Recipe is the concrete shape most architects will end up writing:

name: pr_security_review
recipe:
  version: 1.0.0
  title: PR Security Review
  description: OWASP-informed review of a pull-request diff.
  settings:
    goose_provider: anthropic
    goose_model: claude-sonnet-4-5
  parameters:
    - key: pr_url
      input_type: string
      requirement: required
      description: "Pull request URL to review"
  extensions:
    - type: builtin
      name: developer
    - type: streamable_http
      name: github
      uri: https://api.githubcopilot.com/mcp/x/pull_requests/readonly
  instructions: |
    You are a security reviewer. Check the diff for OWASP Top-10
    issues, secrets, and unsafe patterns. Be specific and sparing.
  prompt: |
    Review PR {{ pr_url }}. For each finding, cite the file,
    line, severity, and suggested fix. Post findings as a single
    PR comment. If nothing is found, say so.

Every primitive the last four chapters described is visible here. settings routes through the LLM gateway. extensions declares which approved MCP servers the capability surface exposes. parameters is how a non-author reuses the workflow. instructions vs prompt separates policy from task, which is what makes a Recipe testable.

Parameterisation and sub-workflows

A Recipe without parameters is a one-off. With parameters, it’s a product. The sharper Goose primitive is the sub_recipes array: each sub-recipe runs in its own isolated subagent session with its own context window, and sequential_when_repeated: true/false picks parallel vs sequential execution. This is the orchestrator-worker pattern from Chapter 1, made concrete. It’s what makes the Airbnb migration topology possible: 3,500 files fan out across parallel sub-recipe invocations, each with fresh context, orchestrated by one parent.

Triggers

Cron. goose schedule add recipe.yaml --cron '0 9 * * 1-5'. Nightly lint, weekly security audit, daily stale-PR report. The built-in scheduler is single-machine; for distributed schedules, wrap with a Kubernetes CronJob or a Temporal worker pool.

Event-driven. PR opened, issue labelled, incident created, build failed. Atlassian’s Rovo Dev fires on every PR. The Goose GitHub Action wraps the same pattern: label an issue with goose and a PR opens. Event-driven is where agents stop being assistants and start being automation.

Manual / API. goose run -i recipe.yaml --param pr_url=https://... from a CI step, or goose serve running as a webhook receiver inside the cluster.

Runtime, observability, and governance

Triggered workflows run on ephemeral CI runners (GitHub Actions, Buildkite) for sub-five-minute PR-shaped work, or on dedicated agent pools for long-running stateful work. Match runtime to the trust tier from Chapter 3.

Every triggered run is a first-class object: trigger source, parameters, spans with retry counts, final status, cost, trace ID. Kestra recorded over two billion workflow executions in 2025, up from one hundred million in 2024. That twenty-fold increase signals the direction of travel. If your platform cannot answer “what ran when, triggered by what, with what outcome?” in two clicks, it is opaque.

Shared workflows need product discipline. The GitHub Actions governance model (internal org, SHA-pinned versions, PR-reviewed contributions) is the pattern most enterprises borrow.

Chapter 6: Evaluation and Economics

Most platform teams skip evaluation and then wonder why their rollout plateaus. Evaluation is not a phase of delivery; it is the product that determines whether the other five chapters compound.

Silent failure

An agent completes its run without any software error (no exception, no crash, no red log line) and produces output that looks plausible and is wrong. The PR passes review because the diff looks reasonable. The test the agent wrote passes because it tests the buggy behaviour it introduced. Every DORA-2025 data point on increased change-failure rate is a silent-failure story that got written to disk.

The evaluation stack that catches silent failure has three layers.

Unit-level. Tool schemas, prompt templates, and system prompts each get their own regression suite. Every change runs a deterministic test set before it can ship.

Task-level. A curated golden set of real tasks, graded by LLM-as-judge with a rubric that includes business-outcome correctness, not just style. This is eval-as-CI.

Production. Shadow traffic and online signals: thumbs-up/down, PR accept rate on agent-authored code, downstream defect escape rate. The production signals feed back into the golden set. Every thumb-down becomes a candidate regression test.

Atlassian’s Rovo Dev Code Reviewer ran a year-long evaluation across more than 1,900 internal repos before general availability. The result, published at ICSE 2026, was a 30.8% reduction in PR cycle time and a 35.6% reduction in human-written review comments. The same three eval layers apply at the Recipe level: shadow-run the candidate against live triggers before promoting; canary to a subset before broad ship.

Token economics

By the time you have 5,000 engineers on your platform, token cost is non-linear in three dimensions: context depth, fan-out, and retry depth.

Tiered routing. Simple classification and extraction routes to a cheap model (Haiku-class). Standard code generation routes to mid-tier (Sonnet-class). Hard planning and architectural synthesis reserves to the frontier (Opus-class). Defaulting every call to the most expensive model is the single largest source of cost inflation.

Prompt caching as an SLI. Structured prompts should cache at 90%+ hit rate. A 90% cache hit translates to roughly 10x cost reduction on the cached portion. Cache hit rate deserves a dashboard, an owner, and an alert when it drops.

Attribution at every level. Per-team, per-repo, per-task, per-session. Without attribution there’s no chargeback; without chargeback there’s no incentive for teams to care about efficiency.

Shopify’s LLM proxy, mentioned in Chapter 1, is the artefact that makes all of this possible. You cannot attribute cost you don’t see. You cannot route by complexity if requests bypass your router. Per First Round’s write-up, the proxy is what let Shopify’s engineering dashboard correlate AI usage with shipping impact, which in turn gave VP Eng Farhan Thawar the evidence to support the ~20% productivity gain the org now claims.

Agentic platform: cost observability
Token → dollar → team → task. The view your CFO asks for on Monday morning.
apr 2026 · 4,200 eng · live
monthly tokens
847M
+12.4% mom
monthly spend
$284.2k
+8.1% mom
cache hit rate
78%
target > 75%
cost per merged pr
$0.82
−14.5% mom
spend by team
click to drill
apr 2026
monthly spend
cost per pr
merged prs
cache hit
Haiku
Sonnet
Opus
haiku
sonnet
opus

What to measure

The most common failure mode in “AI productivity” reporting is Goodhart’s Law in a lab coat. A measurement stack that survives scrutiny operates in four families: proxy (acceptance rate, session count), activity (DORA: PR count, lead time, CFR), outcome (defect escape, rework, dev-reported friction), and economic (hours saved, cost per merged PR). An architect reporting to leadership needs at least one number from each.

Consider the published record: Uber reports ~10% PR-velocity lift (Pragmatic Engineer), an activity metric. Shopify claims ~20% productivity accompanied by a public refusal to measure it in LOC, an outcome claim. Block’s 8-10 hours saved per engineer per week is a clean economic metric. Airbnb’s 18 months to 6 weeks is a sharp outcome metric with a legible counterfactual. Same reality. Different slices.

Chapter 7: The Build Sequence

The platform described above is not a weekend project. It also does not require a three-year transformation program. The sequence that has worked in the public record collapses into three horizons.

Days 0-90. Stand up the minimum viable control plane.

  • Pick one harness. Don’t debate it for a quarter. Any of them is fine; the harness is replaceable.
  • Stand up the LLM gateway. Every agent request flows through it. Day-one cost attribution.
  • Ship one Recipe. Not twelve. Pick one repeatable task (PR security review, migration shard, on-call triage). Versioned, parameterised, triggered by one event, observable end-to-end. Everything else is scaffolding for the next Recipe.
  • Stand up one golden eval set with an LLM-as-judge rubric. Wire it into CI. Refuse to promote prompts or Recipes that regress.
  • Turn on OpenTelemetry tracing end-to-end.

Months 3-6. Build the moat.

  • Context pipeline for your top-five repos: ingest, index, govern, serve. Measure hit rate.
  • Policy-as-code at the gateway. Scoped tokens. Async approvals for production actions.
  • Expand the eval harness to workflow-level: golden sets of Recipe invocations, shadow-mode promotion.
  • First KPI dashboard: one proxy, one activity, one outcome, one economic metric.

Months 6-12. Compound.

  • Orchestrator-worker topology for the hard workloads: migrations, cross-repo refactors, bulk compliance work.
  • Recipe registry self-service with SHA-pinned versions. Teams contribute; the platform team curates.
  • Progressive autonomy tiers. Graduate teams through read-only, sandboxed, PR, and production as their eval and incident track record earns it.
  • Per-team chargeback. The budget conversation changes the usage conversation.

Fund internal DevRel from day one. Uber’s coursework moved Claude Code adoption from 32% to 63% of engineers in three months. Block’s engineers found Goose through Slack channels, not mandates. Shopify paired a top-down AI-first memo with bottom-up tool freedom through the LLM proxy. The technical platform and the organisational motion need to ship together.

In twelve months, when your CFO asks what AI is costing and what it’s earning, you have an answer, because you built a platform rather than bought a license. That’s the answer the 11% have. It’s not because they picked a better model.

References

  1. Google Cloud / DORA. 2025 State of AI-Assisted Software Development Report. Source for 90% adoption, 30% distrust, PR size +154%, and the stability/throughput tension.
  2. Faros AI. Key Takeaways from the DORA Report 2025. Practitioner analysis of the DORA findings.
  3. McKinsey / KPMG. AI at Scale: Q4 2025 AI Pulse. Source for the four-stage maturity model and the ~11% AI-native figure.
  4. OneReach / CIO. What Shapes Enterprise AI Agents in the Future. Source for the 95% zero-ROI and 14% change-management figures.
  5. Block. Block Open Source Introduces “codename goose” and Goose on GitHub.
  6. Sequoia. Training Data podcast with Dhanji Prasanna. Source for Block’s 8-10 hours/week, 25% target, and 30-40% legacy-code figures.
  7. All Things Open. Meet Goose: The open source AI agent built for developers.
  8. Bessemer Venture Partners. Inside Shopify’s AI-First Engineering Playbook.
  9. First Round Review. From Memo to Movement: Shopify’s Cultural Adoption of AI.
  10. Augment Code. Context Engine and Context Engine MCP now live. Source for the 500k-file indexing, ~100ms retrieval, and pipeline-behind-MCP pattern.
  11. Pragmatic Engineer. How Uber Uses AI for Development. Source for the 84% agentic-coding adoption, Claude Code 32% to 63%, and DevRel investment.
  12. Sourcegraph. How Cody understands your codebase and How Cody provides remote repository awareness. Source for the three-layer context architecture and 300k+ repo scale.
  13. Atlassian. 30.8% Faster PRs: How AI-Driven Rovo Dev Code Reviewer Improved Developer Productivity. Source for the ICSE 2026 publication figures.
  14. GitHub. December 2025 Enterprise Roundup. Source for Copilot Enterprise governance features.
  15. Microsoft DevBlogs. Agentic Platform Engineering with GitHub Copilot.
  16. Airbnb Engineering. Accelerating Large-Scale Test Migration with LLMs.
  17. Anthropic. Model Context Protocol.
  18. Gartner. 40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026.
  19. Block. Goose Recipes reference and Goose Recipes cookbook.
  20. Pulse MCP. Configure your agent with Goose Recipes.
  21. Block. Goose AI Developer Agent GitHub Action.
  22. GitHub. Automate repository tasks with GitHub Agentic Workflows.
  23. Kestra. Kestra 1.0 launch. Source for the 2B+ workflow executions in 2025.
  24. Temporal. Orchestrating Ambient Agents with Temporal.
  25. MindStudio. Stripe Minions vs Shopify Roast. Source for Stripe’s scoped-context agent pattern.
  26. GitHub. Building organization-wide governance for CI/CD with GitHub Actions.