The Anatomy of Agentic Code Assist: Building Production Grade AI Coding Agents

Introduction: The Agentic Shift in Software Engineering

Software engineering tools have been getting closer to translating what humans want into what machines do. We went from assembly to C, from malloc/free to garbage collection, from Vim to IDEs with autocomplete. But something changed in the last two years. We’re are developing more than just “autocomplete” solutions.

Code Assist tools suggest the next token based on what’s in front of your cursor. General Purpose Code Assist Agents can reason about your entire codebase, plan multi-step changes, execute commands, run tests, and fix their own mistakes. The difference is: one is a fancy text predictor, the other is something that actually does engineering work.

OpenHands (formerly OpenDevin) is one of the most interesting open-source projects in this space. It’s a runtime that takes probabilistic LLM outputs and turns them into deterministic actions—compiling code, running tests, managing Docker containers. This post digs into how OpenHands works: its architecture, the CodeAct framework it uses, how it sandboxes execution safely, and what the benchmarks tell us about where this technology actually stands.

AI as an Amplifier: Why This Matters

Here’s something weird from the 2025 DORA report: AI adoption is basically universal now, but productivity gains are all over the place. Some teams are crushing it, others are drowning in AI-generated technical debt. What’s the difference?

AI acts as an amplifier. If your team already has good platform engineering and loosely coupled architectures, AI makes you faster. If you’re stuck with a tightly coupled monolith and manual deployments, AI will help you write bad code faster.

This matters for what makes a good code assist agent. Just dumping code into a buffer isn’t enough. A useful agent needs to:

Navigate existing codebases without breaking things
Verify its own changes against test suites
Learn from failures and try different approaches
Understand the context of your organisation/ infrastructure and adapt accodingly

Think of it as a high-performing junior engineer who can write code quickly but needs to check their work, not a hyper-intelligent autocomplete. OpenHands tries to be the former by treating the agent as part of the actual development workflow, not just a chatbot that spits out code.

Why Open Source Matters Here

The first wave of coding agents (like Devin) were black boxes. Impressive demos, but good luck getting your security team to approve giving them write access to your production codebase. When an agent deletes a config file, you want to know why, not just get an apology.

OpenHands(like modern code assist solutions) takes a different approach. Everything is transparent. The Event Stream logs every action the agent takes and every observation it receives. You can watch it run shell commands, edit files, and search through code in real-time.

This matters because 30% of developers say they don’t trust AI-generated code (per the DORA report). Hard to blame them. But when you can see exactly what the agent is doing, step by step, that trust equation changes. You’re not blindly accepting output—you’re supervising an autonomous process with full visibility. Until you gain enough confidence to let agent take over.

What Makes a “General Purpose” Code Agent?

A SQL-generating bot is useful for one thing. An agent that can write SQL, wrap it in a Python API, build a React frontend, and deploy the whole thing to Kubernetes, then debug production issues? That’s general purpose.

The difference comes down to four things that separate toys from production-ready tools:

1. Memory That Actually Works

LLMs are stateless. ChatGPT “forgets” your file structure the moment it scrolls out of the context window. Try refactoring a 50-file codebase when the agent can’t remember what it read five minutes ago.

A real agent needs persistent memory. Not just a bigger context window—actual tools to explore and navigate your codebase on-demand. OpenHands gives the LLM a developer’s toolkit: ripgrep for fast code search, AST-based analysis for understanding structure, and incremental file access with 100-line windows. Add an event log that lets it “replay” its own history, and you have something that can actually work with large codebases without pre-indexing everything.

2. Execution, Not Just Suggestions

OpenHands can create files, run compilers, execute shell scripts—the actual work. But this is dangerous. Running arbitrary LLM-generated code on your machine is a security nightmare. So OpenHands runs everything in Docker containers. The agent gets a sandboxed workspace where it can do whatever it wants without nuking your host system.

3. Learning from Failure

Code never works the first time. A code generator dies on the first syntax error. A real agent reads the error output, figures out what went wrong, tries a fix, and runs it again.

This Edit-Run-Verify loop is how OpenHands works. Actions flow from the agent to the system, observations (logs, errors, exit codes) flow back. The agent uses that feedback to iterate. Just like you would.

4. Using Your Tools

No LLM knows about your company’s internal Jira workflow or feature flag database. A production agent needs to plug into arbitrary tools without rewriting its core code.

OpenHands uses the Model Context Protocol (MCP)—an open standard for tool discovery. Point it at an MCP server, and the agent can dynamically learn what tools are available and how to use them.

How OpenHands Compares

Here’s how OpenHands stacks up against regular autocomplete and chat assistants:

Feature	Autocomplete (IntelliSense)	Chat (ChatGPT)	Agent (OpenHands)
Context	Current file only	Conversation history	Entire repo + runtime state
Execution	None	None (maybe sandbox)	Full shell in Docker
Agency	You drive everything	Responds to prompts	Multi-step autonomous plans
Tooling	Static analysis	Fixed plugins	Dynamic tool discovery (MCP)
Memory	None	Session-only	Event-sourced persistence

OpenHands goes all-in on that right column. More complex, but actually useful for real work.

How OpenHands Is Built

OpenHands looks like a local app, but it’s actually a distributed system. The key architectural decision: split the reasoning (Agent) from the execution (Runtime), and mediate everything through an event stream. This lets you swap LLMs or runtimes without rewriting the whole system.

Event Sourcing: The Unexpected Choice

OpenHands doesn’t use a traditional database. It’s event-sourced.

Most apps store the current state: if an agent edits a file, you overwrite the record. OpenHands records every action as an immutable event. Want to know the current state? Replay all the events.

The EventStream is the central nervous system. It handles three types of data:

Actions: Commands from the agent—CmdRunAction, FileWriteAction, AgentDelegateAction
Observations: Results from the environment—stdout/stderr, file contents, web pages
Trajectories: The full sequence of actions and observations, serialized to disk (JSON or Pickle)

OpenHands Event Stream Architecture

Immutable event log enabling agent-runtime separation and deterministic replay

AGENT

ACTIVE

CodeActAgent

Reasoning · Planning · Decision Making

↓ ACTIONS CmdRunAction | FileWriteAction

📡

EVENT STREAM

10:23:01.442 → Action CmdRun("pytest tests/")

10:23:02.891 ← Observe CmdOutput(exit=1, stderr="...")

10:23:03.124 → Action FileEdit("src/fix.py", ...)

10:23:03.998 ← Observe FileEditObservation(success=True)

↓ OBSERVATIONS CmdOutputObservation | FileReadObservation

RUNTIME

RUNNING

DockerRuntime

Execution · Sandbox · Response Generation

Why this matters: Deterministic Replay. LLMs are non-deterministic nightmares to debug. When an agent fails, you can replay the exact event sequence and see where it went wrong. No guessing, no “works on my machine.”

The codebase enforces this with a hard split: agenthub (the logic) and runtime (the execution) only talk through serialized events. No shortcuts, no shared state.

The EventStream assigns IDs to events and manages “subscriptions.” The frontend subscribes to get chat updates. The agent reads from it to know what happened last.

There’s been talk in the community about moving to synchronous ToolCall/ToolResult patterns to simplify the Python SDK. But the core idea stays the same: the source of truth is the event history, not some current state snapshot.

The Codebase Structure

OpenHands is organized as a modular monorepo:

openhands/agenthub/: The brains. Different agent implementations (CodeActAgent, BrowsingAgent, etc.). Plug-and-play interface: take State, return Action.
openhands/runtime/: The body. Spins up Docker containers, manages files, executes commands. Abstract Runtime base class with concrete implementations like DockerRuntime and E2BRuntime.
openhands/server/: FastAPI backend. Handles WebSocket connections, orchestrates the AgentController, routes events.
openhands/frontend/: React UI. Visualizes the Event Stream—chat interface, terminal emulator (xterm.js), Monaco editor.
containers/: Dockerfiles for the sandbox environments. Version-controlled with the code.

The Main Loop

The AgentController runs an infinite loop:

Gather recent history from Event Stream
Send to LLM (GPT-4, Claude, whatever)
Parse LLM response into an Action (CmdRunAction, etc.)
Dispatch to Runtime
Get back an Observation (stdout, exit code, etc.)
Add Observation to Event Stream
Go to step 1

Runs until the agent says it’s done (AgentFinishAction) or you kill it. The upcoming Python SDK will let you step through this loop manually, which should make debugging way easier.

The Runtime: Sandboxing the Chaos

Letting an LLM run rm -rf / on your laptop is a bad idea. OpenHands solves this with Docker, but not in the obvious way.

How the Sandbox Actually Works

You can’t just run docker exec for every command. That creates a fresh shell each time, and state gets lost. If the agent runs export API_KEY=xyz, that needs to persist when it runs python script.py later.

OpenHands uses a client-server model across the Docker boundary:

Host side: RuntimeClient running on your machine
Container side: ActionExecutor, a Python HTTP server injected into the container

When the agent wants to run ls -la:

Agent generates CmdRunAction("ls -la")
RuntimeClient serializes it and POSTs to ActionExecutor inside the container
ActionExecutor runs it in a persistent shell session (PTY), captures stdout/stderr/exit code
Response goes back to RuntimeClient
Backend wraps it in CmdOutputObservation and pushes to the event stream

OpenHands Sandbox Architecture

Multi-layered Docker isolation ensuring secure code execution with strict boundaries

🖥️ Host Environment

Ubuntu 22.04 LTS

🐳

Docker Sandbox

ISOLATED

📁

Filesystem Isolation

Separate root filesystem with controlled mount points

/workspace

🌐

Network Isolation

Virtual network with restricted external access

bridge0

⚙️

Process Isolation

Dedicated PID namespace, resource limits enforced

cgroups

👤

User Isolation

Non-privileged user with restricted capabilities

uid:1000

agent@sandbox:/workspace $

# Agent executing commands in sandbox

python test_suite.py

git diff src/main.py

npm run build --production

Executing in isolated environment...

📤

Stdin/Stdout

stdio

⇄

📋

Volume Mount

/workspace

⇄

🔌

API Socket

unix:///var/run

⇄

🛡️

Security Guarantees

No Host Access Read-only System Resource Limited No Privilege Escalation

The persistent shell is the key. Environment variables, working directory, shell history—it all persists across commands. The agent gets something that feels like an actual computer, not a stateless command executor.

The Docker Socket Problem

Sometimes the agent needs to use Docker itself—like building a container for your app. OpenHands handles this by mounting the host’s Docker socket (/var/run/docker.sock) into the sandbox.

This is powerful but dangerous. Mounting the Docker socket gives the container root access to the host. It’s “Docker-out-of-Docker” (not true Docker-in-Docker), and it comes with trade-offs:

Power: The agent can do anything Docker can do
Complexity: Network routing gets weird, especially on macOS/Windows where Docker runs in a VM. httpx.ConnectError issues are common.
Security: You’re basically trusting the container with your host. OpenHands mitigates this by controlling the image, but it’s still a calculated risk.

Other Runtime Options

Docker isn’t the only choice. OpenHands abstracts the runtime, so you can swap it out:

Daytona: Remote, managed dev environments. Offloads compute to the cloud instead of burning your laptop’s battery.
E2B: Firecracker-based VMs designed for AI code execution. Better isolation than Docker, faster startup.

You pick your runtime in config.toml. Same agent code, different execution environment. This is the kind of abstraction that separates production systems from hackathon demos.

CodeAct: Code As the Interface

Early AI agents used JSON tool calling for everything. Want to edit a file? Emit a JSON blob. Run a command? Another JSON blob. Brittle, verbose, and you had to define custom tools for every possible action.

Code Is the Tool

CodeAct flips this. Instead of 50 custom tools (list_files, create_file, search_web), just give the agent Python and Bash.

Need to count lines in all Python files? Write code:

import glob
files = glob.glob("**/*.py", recursive=True)
total_lines = 0
for f in files:
    with open(f) as file:
        total_lines += len(file.readlines())
print(total_lines)

Or use Bash:

find . -name "*.py" | xargs wc -l

Why this works better:

One language for everything: Logic, control flow, and tool execution all use Python/Bash.
More expressive: Write loops, conditionals, error handling in a single action. Try to read a file, catch FileNotFoundError, create it—all in one LLM turn. Fewer round-trips = lower cost and latency.
Free library access: The entire Python ecosystem (pandas, requests, numpy) works out of the box. No wrapper code needed.

How It Works

The CodeActAgent uses a carefully crafted system prompt (system_prompt.j2):

“You can execute Python code in ```python blocks”
“You can execute Bash in ```bash blocks”
“Verify your changes by running tests”

The backend parses the LLM’s markdown response. Code blocks get extracted and sent to the JupyterPlugin (for Python) or BashPlugin (for Bash) inside the container. The JupyterPlugin maintains an interactive IPython kernel, so variables persist across code blocks.

Multiple Agents, One Task

One agent gets lost in a 10,000-file repo. Context window fills with noise, and it forgets what it’s doing.

OpenHands uses agent delegation:

Manager Agent: High-level planner. Breaks “Refactor auth module” into sub-tasks.
RepoStudyAgent: Explorer. Maps the codebase without modifying it.
VerifierAgent: QA specialist. Writes tests, verifies fixes work.
BrowsingAgent: Reads docs and StackOverflow via Playwright.

The main agent can delegate: “I need to know how to use the Stripe API. @BrowsingAgent, find the docs for creating a customer.” BrowsingAgent spins up, does the research, returns a summary. Main agent stays focused on the high-level task.

Tool Integration: MCP

The old problem: N agents × M tools = N×M custom integrations. Want your agent to use Jira, Slack, GitHub, and Linear? Write four separate integrations. For every agent.

OpenHands uses the Model Context Protocol (MCP), an open standard from Anthropic. Think of it as USB-C for AI tools.

How MCP Works

MCP Server: Exposes tools (functions) and resources (data). A GitHub MCP server might expose create_issue and active_pull_requests.
MCP Client (OpenHands): Connects via stdio or SSE. Asks: “What tools do you have?” Gets back JSON schemas. Injects them into the agent’s system prompt.

OpenHands doesn’t know about GitHub or Slack. It just knows MCP. You can write a custom MCP server for your proprietary database, point OpenHands at it, and the agent can use it immediately.

Auth That Doesn’t Suck

What if the agent tries to read your private Slack DMs? OpenHands handles this with OAuth via FastMCP.

When the agent tries to use an authenticated tool, MCP pauses execution and shows you an OAuth flow. You log in, consent, and the token gets stored for that session. The agent acts with your permissions, not as some omniscient god.

Configuration: From TOML Hell to Python Objects

OpenHands used to require a config.toml file with a million environment variables: SANDBOX_IMAGE, WORKSPACE_MOUNT_PATH, LLM_API_KEY, debug flags, etc. Global state everywhere. Good luck running two agents with different configs.

The new Python SDK fixes this:

from openhands.sdk import CodeActAgent, DockerRuntime

agent = CodeActAgent(
    llm_config={"model": "claude-3-5-sonnet"},
    system_prompt="You are a senior python engineer."
)

runtime = DockerRuntime(image="my-custom-image")

await agent.run(task="Fix the bug in main.py", runtime=runtime)

Code, not config files. Agents are objects. You can run them in threads, pause them, inspect state, resume. Synchronous by default, which makes debugging way easier.

Evaluation: SWE-bench, the Reality Check

Demo videos are easy. Proving your agent actually works is hard. OpenHands uses SWE-bench—real GitHub issues from Django, scikit-learn, Flask, etc.

How SWE-bench Works

Start with the codebase before the bug fix
Give the agent the issue description
Let it explore, reproduce the bug, write a patch
Apply the patch, run the test suite
Pass = new test passes + no regressions

This is brutal. The agent can’t just fix the obvious bug. It has to not break anything else.

The Infrastructure Problem

SWE-bench is expensive to run. Gigabytes of Docker images, thousands of containers. Epoch AI compressed the images from ~680 GB to ~67 GB by deduplicating layers. OpenHands runs evaluations in parallel on cloud infrastructure, turning days into minutes.

The Cost Problem

Running the full SWE-bench suite costs hundreds of dollars in API credits. The agent reads thousands of lines of code, generates verbose responses for every issue. SWE-bench Lite (300 issues) and SWE-bench Verified (human-verified subset) exist for people who don’t have unlimited budgets.

Performance: Where Things Stand

The Numbers

OpenHands with Claude 3.5 Sonnet hits around 53% on SWE-bench Verified.

But here’s the interesting part: Inference Time Scaling. Run the agent 5 times on the same problem, use a critic model or voting to pick the best patch, and you can hit 66%. The bottleneck isn’t intelligence, it’s randomness.

Why Agents Fail

Even at 53-66%, agents fail a lot. The failure modes are instructive.

Infinite Loops

Agent tries a fix. Test fails. Agent tries the exact same fix again. Repeat until you run out of tokens.

This happens because of context truncation. When the context window fills up, OpenHands truncates old history. If the agent’s memory of “I already tried this” gets truncated, it’s stuck in Groundhog Day.

Context Pollution

Agent runs find / -name "*.py" and dumps 10,000 lines of output into its context. Or cats a massive log file. Context window fills with noise. LLM starts hallucinating file paths, forgets what it was supposed to do.

Solution: Active context management. Summarize old events, delete large observations, keep the “working memory” clean.

Lazy Coding

Agent writes # ... rest of code ... instead of the full file. Saves tokens, breaks the file when written to disk. OpenHands needs linting to catch this before it causes syntax errors.

Failure Mode Summary

Failure Mode	Cause	How OpenHands Mitigates
Infinite Loop	Context truncation	Trajectory analysis, event summarization
Hallucination	Context overflow	Tool-based code search, event condensation
Regression	Fixing one bug, breaking others	VerifierAgent runs full test suite
Timeout	Docker/network issues	Persistent sessions, cloud runtimes

Conclusion: The New Software Engineering Workflow

Code assist isn’t replacing developers, it’s certainly changing how we work. The architecture behind systems like OpenHands reveals the shift: event sourcing for debuggability, Docker sandboxing for safety, CodeAct for expressiveness, MCP for extensibility. These are building blocks for a new kind of development workflow.

What makes modern tools like OpenHands, Claude Code particularly powerful is the convergence of capabilities that OpenHands pioneered:

Extended thinking: Models that can reason through complex refactoring before touching code
Prompt caching: Reusing codebase context across sessions without re-indexing
Tool integration: MCP servers that let agents interact with your actual development environment—Jira, databases, CI/CD pipelines
Computer use: Agents that can navigate IDEs, run terminal commands, and interact with your full development stack

For bug fixes, boilerplate generation, and mechanical refactoring, having an autonomous agent that executes code, verifies its work, and iterates on failures isn’t a demo anymore. It’s production ready infrastructure that’s reshaping how engineering teams operate. The question isn’t whether to adopt these tools, but how to integrate them into your workflow before your competitors do.

References

Wang, X., et al. (2024). The OpenHands Software Agent SDK: A Composable Framework for Building AI Agents. arXiv preprint arXiv:2511.03690.
- The official technical paper describing the OpenHands architecture, event-sourcing model, and SDK design.
Wang, X., et al. (2024). Executable Code Actions Elicit Better LLM Agents. arXiv preprint arXiv:2402.01030v4.
- Introduces the CodeAct framework that uses code as the universal action interface instead of JSON tool calling.
OpenHands Documentation. Runtime Architecture.
- Official documentation explaining the sandbox architecture, Docker runtime, and client-server model.
Anthropic. Model Context Protocol (MCP).
- Official specification for the Model Context Protocol used for dynamic tool discovery and integration.
Jimenez, C., et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- The benchmark used to evaluate code assist agents on real-world software engineering tasks.
Yang, X., et al. (2024). OPENHANDS: An Open Platform for AI Software Developers. OpenReview.
- Comprehensive overview of the OpenHands platform, agent capabilities, and design philosophy.

Introduction: The Agentic Shift in Software Engineering#

AI as an Amplifier: Why This Matters#

Why Open Source Matters Here#

What Makes a “General Purpose” Code Agent?#

1. Memory That Actually Works#

2. Execution, Not Just Suggestions#

3. Learning from Failure#

4. Using Your Tools#

How OpenHands Compares#

How OpenHands Is Built#

Event Sourcing: The Unexpected Choice#

OpenHands Event Stream Architecture

The Codebase Structure#

The Main Loop#

The Runtime: Sandboxing the Chaos#

How the Sandbox Actually Works#

OpenHands Sandbox Architecture

The Docker Socket Problem#

Other Runtime Options#

CodeAct: Code As the Interface#

Code Is the Tool#

How It Works#

Multiple Agents, One Task#

Tool Integration: MCP#

How MCP Works#

Auth That Doesn’t Suck#

Configuration: From TOML Hell to Python Objects#

Evaluation: SWE-bench, the Reality Check#

How SWE-bench Works#

The Infrastructure Problem#

The Cost Problem#

Performance: Where Things Stand#

The Numbers#

Why Agents Fail#

Infinite Loops#

Context Pollution#

Lazy Coding#

Failure Mode Summary#

Conclusion: The New Software Engineering Workflow#

References#