The Gap
By April 2026, the adoption numbers are staggering: 90% of developers use AI at work and over 80% say it’s made them more productive. Gartner projects 40% of enterprise applications will feature task-specific AI agents by the end of 2026, up from less than 5% in 2025.
Then you read the other column. CIO reports that 95% of enterprises see zero return on their AI investments. McKinsey’s maturity model puts only around 11% of enterprises in the “AI-native” tier. The 2025 DORA report is more uncomfortable still: AI raises throughput and raises change-failure rate. PR size is up 154%. 30% of engineers don’t trust the code their own agents produce.
The gap isn’t the model. Frontier models are a commodity. They get swapped every six months and the next one is better. The gap is everything around the model: the control plane that routes requests, attributes cost, enforces policy, retrieves context, evaluates quality, and measures outcomes. It’s the platform that turns “we rolled out Copilot” into “we shipped a 30.8% reduction in PR cycle time across 1,900 repos,” which is what Atlassian did with Rovo Dev and published at ICSE 2026.
This post is for the architect who has been asked to lead that platform. Not to choose between Copilot and Cursor; that’s a week of spreadsheets. To design what sits around whatever agent you pick, so that a year from now your CFO knows what AI is costing and your CTO knows what it’s earning.
Chapter 1: What “Platform” Actually Means Here
When a VP of Engineering says “we have an AI platform,” they might mean one of three things:
- We bought Copilot Enterprise. Everyone has access.
- We stood up a chat UI in front of a couple of models.
- We run an internal control plane that mediates every AI request our engineers make, attributes cost per team, enforces policy per repo, evaluates quality continuously, exposes a curated surface of tools and skills, and lets any engineer publish a repeatable workflow that triggers on a schedule, a webhook, or a repository event.
Only the third one is a platform. The first two are procurements.
Shopify made this distinction concrete. Per Bessemer’s write-up of their AI-first engineering playbook, Shopify runs an LLM proxy. Every AI request from every tool, every engineer, every script, goes through one internal gateway. Engineers can pick their harness (Claude Code, Cursor, Copilot), but the proxy is non-negotiable. That single architectural choice is what gives them centralised cost control, usage analytics, model flexibility, and the ability to swap a model provider in a day instead of a quarter.
Block took a different turn at the same fork. Rather than wrapping third-party agents, they built Goose internally and open-sourced it. The stated reason, per CTO Dhanji Prasanna on the Sequoia Training Data podcast, was that “data leaving our infrastructure” was unacceptable. The outcome: engineers save 8-10 hours a week, Goose is on track to reclaim 25% of manual hours company-wide, and 100% of Goose’s own PRs are now written by Goose.
Two tech companies. Two legitimate answers to the same architectural prompt. Both built a platform. Neither bought one.
The platform you build, whether Shopify-shaped (gateway + BYO-harness) or Block-shaped (build-the-harness), owns five responsibilities:
- Built-in tools shipped with the harness
- Curated registry of approved MCP servers
- Skills library: org-specific playbooks
- Progressive disclosure at the gateway
- Two-identity model: human principal + agent service identity
- Scoped, short-lived tokens per task
- Policy-as-code at the MCP gateway
- Graduated-trust approvals (async wait-state)
- Sandbox runtime (gVisor, Firecracker, Kata) per trust tier
- Ingestion connectors for repos, docs, tickets, incidents, service catalog
- Hybrid retrieval: BM25 + dense + graph
- PII redaction and access-aware retrieval
- Staleness SLOs per source
- Token-budget compression and eviction
- Unit regressions for prompts, tool schemas, system prompts
- Task-level golden set graded by LLM-as-judge
- Production shadow traffic + online signals
- Eval-as-CI: no prompt, tool, or model ships without passing
- Thumbs-down-to-regression-test feedback loop
- LLM gateway: every call mediated, measured, routable
- Tiered-routing policy (Haiku / Sonnet / Opus by task class)
- Prompt-cache hit rate as a first-class SLI
- Per-team, per-repo, per-task cost attribution
- End-to-end OpenTelemetry tracing
- Capability: what tools and skills agents can reach, and how they’re discovered.
- Identity & Policy: who’s acting, with what scope, under what guardrails.
- Context: how org knowledge gets ingested, permissioned, and served.
- Evaluation: how you know the agent is actually getting better, not just shipping faster.
- FinOps & Observability: what it costs, who paid, what it produced, where it broke.
Those five wire up into a system shape worth drawing. The harness runs on each engineer’s laptop (in a Docker container) or in a CI runner. Everything the harness reaches into over the network is the platform. Everything the platform calls out to is a model provider, a curated MCP server, or a source system the platform has already indexed.
When multiple agents cooperate within this system, the topology that ships is orchestrator-worker, not swarm. Airbnb’s 3,500-file Enzyme-to-RTL migration proved this: per-file parallel workers, central orchestration, brute-force retries with dynamic prompts. 97% automated, 6 weeks, 6 engineers. Swarms, by contrast, are the dominant source of silent failure because a single hallucination in shared memory propagates to every peer that reads it. Chapter 5 shows how Goose sub-recipes implement the orchestrator-worker pattern concretely.
If any of these responsibilities isn’t owned by a named team with a roadmap, you don’t have a platform. You have a shadow-IT problem that will compound. These five exist to enable one thing: the workflow lifecycle. Chapter 5 names it; everything else makes it safe, cheap, and measurable.
Chapter 2: The Capability Surface
The first architectural argument inside every platform team is about how agents get things done. It usually gets framed as a choice between the Model Context Protocol (MCP), now stewarded by the Linux Foundation’s Agentic AI Foundation as of December 2025, and Agent Skills, the behavioural-instruction packages that started shipping with Claude.
Framing them as competitors is a category error. MCP is an execution fabric: a standardised RPC for tools, resources, and prompt templates, with bidirectional comms and dynamic tool discovery. Skills are a knowledge layer: portable instructions that encode the how-to of a specific job. MCP tells an agent what it can do. Skills tell an agent what it should do in a specific situation.
The context-bloat failure is the sharper risk. A single enterprise-grade GitHub MCP server exposes 90+ tools. Loaded naively, that’s 50,000+ tokens of schema entering the context window before the model has read a single line of user intent. Add Jira, the cloud provider, a feature-flag platform, your incident system, and your agent is spending six figures a year on tokens that are just tool catalogue. The overhead scales linearly with the number of services you connect.
The three-tier model that scales:
- Built-ins: the small primitive set the harness ships (file ops, shell, code execution). Always loaded.
- Curated MCP registry: a governed list of approved MCP servers with progressive disclosure. The agent sees metadata first, loads full tool schemas on semantic match.
- Skills library: org-specific playbooks in a searchable registry, discovered by description, expanded on demand.
Block’s Goose is the cleanest public expression of this. The operational equation is Goose = LLM + MCP + Agent, but the load-bearing piece isn’t MCP. It’s Goose’s Recipes and Sub-recipes. Recipes are declarative YAML workflows that encode a repeatable piece of work; sub-recipes run in isolated sub-sessions with their own context windows. That isolation keeps token cost linear in the work done rather than quadratic in conversation depth. The result is the 30-40% of code Block’s top engineers now get from Goose in legacy codebases, per the Sequoia interview.
Chapter 3: Identity, Policy, and the Execution Boundary
Every agentic action has two identities the audit team cares about: the human on whose behalf the agent is acting, and the agent’s own service identity. Conflate them and compliance review kills your rollout.
The human identity provides authorisation scope. The agent identity provides attribution and accountability (which agent, which version, which session). Every tool invocation carries both. Tokens are short-lived (minutes, not days) and scoped to the specific task, not the session. Policy is enforced at the MCP gateway, as code, so auditors can diff it and engineers can review it.
How do you keep humans in the loop without drowning them in approval prompts? The pattern that works is the asynchronous wait-state. When an agent hits a high-risk decision (production deploy, financial transaction, irreversible write), the workflow suspends, persists its state externally, and emits an approval event. Reviewers act on their own clock, often hours later. On approval, the signal routes back and the workflow resumes exactly where it left off.
The anti-pattern is approval fatigue. The fix is graduated trust: scope approvals by blast radius.
- Read-only on scoped data: no approval.
- Mutations inside a sandbox or personal branch: no approval, full audit.
- PR against the main branch: standard code review.
- Production-shaped actions (deploys, config changes, prod data reads): explicit, async approval with a named owner.
- Irreversible (delete, drop, disable safety): two-person review.
GitHub’s Copilot Enterprise surface has become the most concrete public implementation. Per the December 2025 Enterprise roundup and Microsoft’s DevBlogs on agentic platform engineering, admins get fine-grained permissions, explicit MCP control, audit-log review, and policy-based gating of model upgrades.
Policy tells the agent what it’s allowed to try. The sandbox decides what happens when it tries the wrong thing. Three isolation tiers map to the graduated-trust model:
| Isolation tier | Mechanism | Right fit |
|---|---|---|
| Docker + seccomp | Namespaces and cgroups; shared host kernel | Dev-loop agents on an engineer’s own repo |
| gVisor | User-space kernel intercepting ~70 syscalls | Platform-served workers (CI, migrations, autonomous PRs) |
| Firecracker / Kata | Per-workload Linux kernel via KVM | Untrusted, multi-tenant, or cross-org execution |
Match the isolation tier to the trust tier. A read-only retrieval agent does not need a microVM. A production migration worker rewriting other teams’ code absolutely does.
Chapter 4: Context Engineering at Platform Scale
The frontier model isn’t your moat. The agent harness isn’t your moat. Your context is your moat: the graph of your repos, the runbooks nobody wrote down, the incident history, the ADRs, the style guides, the org’s service catalog.
MCP is connectivity, not context
A common mistake is assuming that connecting MCP servers to your agent solves the context problem. MCP gives your agent a standardised way to query any single system. Four things it does not do:
- Cross-source retrieval. “Find everything relevant to this migration across code, tickets, docs, and incidents” requires a unified index. No single MCP server spans all your sources.
- Pre-indexing. MCP queries are live. For a 500k-file monorepo, live search on every agent call is slow and expensive.
- Governance. PII redaction, access-aware filtering, staleness SLOs. Each MCP server returns raw data under its own auth model.
- Token-budget management. Fitting retrieved context to the model’s window is orchestration the pipeline owns, not the protocol.
The clean architecture: build the context pipeline, expose it as an MCP server. The agent queries one context endpoint. The pipeline behind it handles ingest, index, govern, serve.
The pipeline
Ingest. Connectors to the authoritative sources: repos, docs wiki, ticket system, incident tracker, service catalog. Each with an idempotent, versioned schema and an owner.
Index. Hybrid retrieval is the production default: BM25 for lexical recall, dense embeddings for semantic similarity, graph for structural relationships. No single index is sufficient.
Govern. Staleness SLOs per source. PII and secret redaction before indexing, not after retrieval. Access-aware retrieval: the retriever filters by the caller’s permissions before ranking. If your agent can see secrets its invoking user can’t, you have a data-exfiltration vulnerability wearing a productivity tool’s clothes.
Serve. A token-budget manager (compression, summarisation, eviction) that fits retrieved context to the model’s window and the task’s importance.
Augment Code’s Context Engine is the clearest public reference for this in 2026. It indexes up to 500,000 files across multiple repositories with roughly 100ms retrieval latency, building semantic dependency graphs. The telling move: Augment recently shipped the Context Engine as an MCP server, the exact pipeline-behind-protocol pattern. Sourcegraph’s Cody takes a three-layer approach (local file, local repo, remote repos), handling 300k+ repositories for enterprise customers. Stripe’s agent harness takes the curation angle: each “minion” gets scoped context per task, not the whole repo. Context curated, not copied.
The metric to watch: context hit rate per task type. If your hit rate is under 30%, your pipeline is ornamental.
Chapter 5: Workflows, the Unit That Ships
Four chapters described infrastructure. This chapter is about what the infrastructure produces. The deliverable is the workflow: a versioned, parameterised unit of work any engineer can build once, evaluate, and hand to other engineers (or to CI runners) who invoke it on a trigger they didn’t author.
- metadata + version
- parameters
- extensions (MCP)
- sub-recipes
- Claude Skills (md)
- Temporal / LangGraph
- Rovo Studio (low-code)
- cron (
goose schedule) - event (PR, issue, incident, webhook)
- manual / API (
goose run,goose serve)
- CI runner (ephemeral)
- agent pool (Modal / E2B / Northflank)
- laptop (dev-loop only)
- trigger source
- parameters
- spans + retries
- status · cost · trace ID
- SHA-pinned versions
- ownership + review
- deprecation windows
Authoring
Four patterns; the choice follows who the author is:
- Recipe / YAML: Goose Recipes, GitHub Agentic Workflows (Feb 2026 preview). Structured, diff-reviewable, CI-friendly. The enterprise default.
- Prompt-as-code: Claude Skills. Flexible, closer to prose, weaker composition.
- DSL / real code: Temporal, LangGraph, Kestra. Maximum control; needs engineer authors.
- Low-code: Atlassian Rovo Studio. Natural-language authoring for non-engineers.
A Goose Recipe is the concrete shape most architects will end up writing:
name: pr_security_review
recipe:
version: 1.0.0
title: PR Security Review
description: OWASP-informed review of a pull-request diff.
settings:
goose_provider: anthropic
goose_model: claude-sonnet-4-5
parameters:
- key: pr_url
input_type: string
requirement: required
description: "Pull request URL to review"
extensions:
- type: builtin
name: developer
- type: streamable_http
name: github
uri: https://api.githubcopilot.com/mcp/x/pull_requests/readonly
instructions: |
You are a security reviewer. Check the diff for OWASP Top-10
issues, secrets, and unsafe patterns. Be specific and sparing.
prompt: |
Review PR {{ pr_url }}. For each finding, cite the file,
line, severity, and suggested fix. Post findings as a single
PR comment. If nothing is found, say so.
Every primitive the last four chapters described is visible here. settings routes through the LLM gateway. extensions declares which approved MCP servers the capability surface exposes. parameters is how a non-author reuses the workflow. instructions vs prompt separates policy from task, which is what makes a Recipe testable.
Parameterisation and sub-workflows
A Recipe without parameters is a one-off. With parameters, it’s a product. The sharper Goose primitive is the sub_recipes array: each sub-recipe runs in its own isolated subagent session with its own context window, and sequential_when_repeated: true/false picks parallel vs sequential execution. This is the orchestrator-worker pattern from Chapter 1, made concrete. It’s what makes the Airbnb migration topology possible: 3,500 files fan out across parallel sub-recipe invocations, each with fresh context, orchestrated by one parent.
Triggers
Cron. goose schedule add recipe.yaml --cron '0 9 * * 1-5'. Nightly lint, weekly security audit, daily stale-PR report. The built-in scheduler is single-machine; for distributed schedules, wrap with a Kubernetes CronJob or a Temporal worker pool.
Event-driven. PR opened, issue labelled, incident created, build failed. Atlassian’s Rovo Dev fires on every PR. The Goose GitHub Action wraps the same pattern: label an issue with goose and a PR opens. Event-driven is where agents stop being assistants and start being automation.
Manual / API. goose run -i recipe.yaml --param pr_url=https://... from a CI step, or goose serve running as a webhook receiver inside the cluster.
Runtime, observability, and governance
Triggered workflows run on ephemeral CI runners (GitHub Actions, Buildkite) for sub-five-minute PR-shaped work, or on dedicated agent pools for long-running stateful work. Match runtime to the trust tier from Chapter 3.
Every triggered run is a first-class object: trigger source, parameters, spans with retry counts, final status, cost, trace ID. Kestra recorded over two billion workflow executions in 2025, up from one hundred million in 2024. That twenty-fold increase signals the direction of travel. If your platform cannot answer “what ran when, triggered by what, with what outcome?” in two clicks, it is opaque.
Shared workflows need product discipline. The GitHub Actions governance model (internal org, SHA-pinned versions, PR-reviewed contributions) is the pattern most enterprises borrow.
Chapter 6: Evaluation and Economics
Most platform teams skip evaluation and then wonder why their rollout plateaus. Evaluation is not a phase of delivery; it is the product that determines whether the other five chapters compound.
Silent failure
An agent completes its run without any software error (no exception, no crash, no red log line) and produces output that looks plausible and is wrong. The PR passes review because the diff looks reasonable. The test the agent wrote passes because it tests the buggy behaviour it introduced. Every DORA-2025 data point on increased change-failure rate is a silent-failure story that got written to disk.
The evaluation stack that catches silent failure has three layers.
Unit-level. Tool schemas, prompt templates, and system prompts each get their own regression suite. Every change runs a deterministic test set before it can ship.
Task-level. A curated golden set of real tasks, graded by LLM-as-judge with a rubric that includes business-outcome correctness, not just style. This is eval-as-CI.
Production. Shadow traffic and online signals: thumbs-up/down, PR accept rate on agent-authored code, downstream defect escape rate. The production signals feed back into the golden set. Every thumb-down becomes a candidate regression test.
Atlassian’s Rovo Dev Code Reviewer ran a year-long evaluation across more than 1,900 internal repos before general availability. The result, published at ICSE 2026, was a 30.8% reduction in PR cycle time and a 35.6% reduction in human-written review comments. The same three eval layers apply at the Recipe level: shadow-run the candidate against live triggers before promoting; canary to a subset before broad ship.
Token economics
By the time you have 5,000 engineers on your platform, token cost is non-linear in three dimensions: context depth, fan-out, and retry depth.
Tiered routing. Simple classification and extraction routes to a cheap model (Haiku-class). Standard code generation routes to mid-tier (Sonnet-class). Hard planning and architectural synthesis reserves to the frontier (Opus-class). Defaulting every call to the most expensive model is the single largest source of cost inflation.
Prompt caching as an SLI. Structured prompts should cache at 90%+ hit rate. A 90% cache hit translates to roughly 10x cost reduction on the cached portion. Cache hit rate deserves a dashboard, an owner, and an alert when it drops.
Attribution at every level. Per-team, per-repo, per-task, per-session. Without attribution there’s no chargeback; without chargeback there’s no incentive for teams to care about efficiency.
Shopify’s LLM proxy, mentioned in Chapter 1, is the artefact that makes all of this possible. You cannot attribute cost you don’t see. You cannot route by complexity if requests bypass your router. Per First Round’s write-up, the proxy is what let Shopify’s engineering dashboard correlate AI usage with shipping impact, which in turn gave VP Eng Farhan Thawar the evidence to support the ~20% productivity gain the org now claims.
What to measure
The most common failure mode in “AI productivity” reporting is Goodhart’s Law in a lab coat. A measurement stack that survives scrutiny operates in four families: proxy (acceptance rate, session count), activity (DORA: PR count, lead time, CFR), outcome (defect escape, rework, dev-reported friction), and economic (hours saved, cost per merged PR). An architect reporting to leadership needs at least one number from each.
Consider the published record: Uber reports ~10% PR-velocity lift (Pragmatic Engineer), an activity metric. Shopify claims ~20% productivity accompanied by a public refusal to measure it in LOC, an outcome claim. Block’s 8-10 hours saved per engineer per week is a clean economic metric. Airbnb’s 18 months to 6 weeks is a sharp outcome metric with a legible counterfactual. Same reality. Different slices.
Chapter 7: The Build Sequence
The platform described above is not a weekend project. It also does not require a three-year transformation program. The sequence that has worked in the public record collapses into three horizons.
Days 0-90. Stand up the minimum viable control plane.
- Pick one harness. Don’t debate it for a quarter. Any of them is fine; the harness is replaceable.
- Stand up the LLM gateway. Every agent request flows through it. Day-one cost attribution.
- Ship one Recipe. Not twelve. Pick one repeatable task (PR security review, migration shard, on-call triage). Versioned, parameterised, triggered by one event, observable end-to-end. Everything else is scaffolding for the next Recipe.
- Stand up one golden eval set with an LLM-as-judge rubric. Wire it into CI. Refuse to promote prompts or Recipes that regress.
- Turn on OpenTelemetry tracing end-to-end.
Months 3-6. Build the moat.
- Context pipeline for your top-five repos: ingest, index, govern, serve. Measure hit rate.
- Policy-as-code at the gateway. Scoped tokens. Async approvals for production actions.
- Expand the eval harness to workflow-level: golden sets of Recipe invocations, shadow-mode promotion.
- First KPI dashboard: one proxy, one activity, one outcome, one economic metric.
Months 6-12. Compound.
- Orchestrator-worker topology for the hard workloads: migrations, cross-repo refactors, bulk compliance work.
- Recipe registry self-service with SHA-pinned versions. Teams contribute; the platform team curates.
- Progressive autonomy tiers. Graduate teams through read-only, sandboxed, PR, and production as their eval and incident track record earns it.
- Per-team chargeback. The budget conversation changes the usage conversation.
Fund internal DevRel from day one. Uber’s coursework moved Claude Code adoption from 32% to 63% of engineers in three months. Block’s engineers found Goose through Slack channels, not mandates. Shopify paired a top-down AI-first memo with bottom-up tool freedom through the LLM proxy. The technical platform and the organisational motion need to ship together.
In twelve months, when your CFO asks what AI is costing and what it’s earning, you have an answer, because you built a platform rather than bought a license. That’s the answer the 11% have. It’s not because they picked a better model.
References
- Google Cloud / DORA. 2025 State of AI-Assisted Software Development Report. Source for 90% adoption, 30% distrust, PR size +154%, and the stability/throughput tension.
- Faros AI. Key Takeaways from the DORA Report 2025. Practitioner analysis of the DORA findings.
- McKinsey / KPMG. AI at Scale: Q4 2025 AI Pulse. Source for the four-stage maturity model and the ~11% AI-native figure.
- OneReach / CIO. What Shapes Enterprise AI Agents in the Future. Source for the 95% zero-ROI and 14% change-management figures.
- Block. Block Open Source Introduces “codename goose” and Goose on GitHub.
- Sequoia. Training Data podcast with Dhanji Prasanna. Source for Block’s 8-10 hours/week, 25% target, and 30-40% legacy-code figures.
- All Things Open. Meet Goose: The open source AI agent built for developers.
- Bessemer Venture Partners. Inside Shopify’s AI-First Engineering Playbook.
- First Round Review. From Memo to Movement: Shopify’s Cultural Adoption of AI.
- Augment Code. Context Engine and Context Engine MCP now live. Source for the 500k-file indexing, ~100ms retrieval, and pipeline-behind-MCP pattern.
- Pragmatic Engineer. How Uber Uses AI for Development. Source for the 84% agentic-coding adoption, Claude Code 32% to 63%, and DevRel investment.
- Sourcegraph. How Cody understands your codebase and How Cody provides remote repository awareness. Source for the three-layer context architecture and 300k+ repo scale.
- Atlassian. 30.8% Faster PRs: How AI-Driven Rovo Dev Code Reviewer Improved Developer Productivity. Source for the ICSE 2026 publication figures.
- GitHub. December 2025 Enterprise Roundup. Source for Copilot Enterprise governance features.
- Microsoft DevBlogs. Agentic Platform Engineering with GitHub Copilot.
- Airbnb Engineering. Accelerating Large-Scale Test Migration with LLMs.
- Anthropic. Model Context Protocol.
- Gartner. 40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026.
- Block. Goose Recipes reference and Goose Recipes cookbook.
- Pulse MCP. Configure your agent with Goose Recipes.
- Block. Goose AI Developer Agent GitHub Action.
- GitHub. Automate repository tasks with GitHub Agentic Workflows.
- Kestra. Kestra 1.0 launch. Source for the 2B+ workflow executions in 2025.
- Temporal. Orchestrating Ambient Agents with Temporal.
- MindStudio. Stripe Minions vs Shopify Roast. Source for Stripe’s scoped-context agent pattern.
- GitHub. Building organization-wide governance for CI/CD with GitHub Actions.