Introduction: The Agentic Shift in Software Engineering
Software engineering tools have been getting closer to translating what humans want into what machines do. We went from assembly to C, from malloc/free to garbage collection, from Vim to IDEs with autocomplete. But something changed in the last two years. We’re are developing more than just “autocomplete” solutions.
Code Assist tools suggest the next token based on what’s in front of your cursor. General Purpose Code Assist Agents can reason about your entire codebase, plan multi-step changes, execute commands, run tests, and fix their own mistakes. The difference is: one is a fancy text predictor, the other is something that actually does engineering work.
OpenHands (formerly OpenDevin) is one of the most interesting open-source projects in this space. It’s a runtime that takes probabilistic LLM outputs and turns them into deterministic actions—compiling code, running tests, managing Docker containers. This post digs into how OpenHands works: its architecture, the CodeAct framework it uses, how it sandboxes execution safely, and what the benchmarks tell us about where this technology actually stands.
AI as an Amplifier: Why This Matters
Here’s something weird from the 2025 DORA report: AI adoption is basically universal now, but productivity gains are all over the place. Some teams are crushing it, others are drowning in AI-generated technical debt. What’s the difference?
AI acts as an amplifier. If your team already has good platform engineering and loosely coupled architectures, AI makes you faster. If you’re stuck with a tightly coupled monolith and manual deployments, AI will help you write bad code faster.
This matters for what makes a good code assist agent. Just dumping code into a buffer isn’t enough. A useful agent needs to:
- Navigate existing codebases without breaking things
- Verify its own changes against test suites
- Learn from failures and try different approaches
- Understand the context of your organisation/ infrastructure and adapt accodingly
Think of it as a high-performing junior engineer who can write code quickly but needs to check their work, not a hyper-intelligent autocomplete. OpenHands tries to be the former by treating the agent as part of the actual development workflow, not just a chatbot that spits out code.
Why Open Source Matters Here
The first wave of coding agents (like Devin) were black boxes. Impressive demos, but good luck getting your security team to approve giving them write access to your production codebase. When an agent deletes a config file, you want to know why, not just get an apology.
OpenHands(like modern code assist solutions) takes a different approach. Everything is transparent. The Event Stream logs every action the agent takes and every observation it receives. You can watch it run shell commands, edit files, and search through code in real-time.
This matters because 30% of developers say they don’t trust AI-generated code (per the DORA report). Hard to blame them. But when you can see exactly what the agent is doing, step by step, that trust equation changes. You’re not blindly accepting output—you’re supervising an autonomous process with full visibility. Until you gain enough confidence to let agent take over.
What Makes a “General Purpose” Code Agent?
A SQL-generating bot is useful for one thing. An agent that can write SQL, wrap it in a Python API, build a React frontend, and deploy the whole thing to Kubernetes, then debug production issues? That’s general purpose.
The difference comes down to four things that separate toys from production-ready tools:
1. Memory That Actually Works
LLMs are stateless. ChatGPT “forgets” your file structure the moment it scrolls out of the context window. Try refactoring a 50-file codebase when the agent can’t remember what it read five minutes ago.
A real agent needs persistent memory. Not just a bigger context window—actual tools to explore and navigate your codebase on-demand. OpenHands gives the LLM a developer’s toolkit: ripgrep for fast code search, AST-based analysis for understanding structure, and incremental file access with 100-line windows. Add an event log that lets it “replay” its own history, and you have something that can actually work with large codebases without pre-indexing everything.
2. Execution, Not Just Suggestions
OpenHands can create files, run compilers, execute shell scripts—the actual work. But this is dangerous. Running arbitrary LLM-generated code on your machine is a security nightmare. So OpenHands runs everything in Docker containers. The agent gets a sandboxed workspace where it can do whatever it wants without nuking your host system.
3. Learning from Failure
Code never works the first time. A code generator dies on the first syntax error. A real agent reads the error output, figures out what went wrong, tries a fix, and runs it again.
This Edit-Run-Verify loop is how OpenHands works. Actions flow from the agent to the system, observations (logs, errors, exit codes) flow back. The agent uses that feedback to iterate. Just like you would.
4. Using Your Tools
No LLM knows about your company’s internal Jira workflow or feature flag database. A production agent needs to plug into arbitrary tools without rewriting its core code.
OpenHands uses the Model Context Protocol (MCP)—an open standard for tool discovery. Point it at an MCP server, and the agent can dynamically learn what tools are available and how to use them.
How OpenHands Compares
Here’s how OpenHands stacks up against regular autocomplete and chat assistants:
| Feature | Autocomplete (IntelliSense) | Chat (ChatGPT) | Agent (OpenHands) |
|---|---|---|---|
| Context | Current file only | Conversation history | Entire repo + runtime state |
| Execution | None | None (maybe sandbox) | Full shell in Docker |
| Agency | You drive everything | Responds to prompts | Multi-step autonomous plans |
| Tooling | Static analysis | Fixed plugins | Dynamic tool discovery (MCP) |
| Memory | None | Session-only | Event-sourced persistence |
OpenHands goes all-in on that right column. More complex, but actually useful for real work.
How OpenHands Is Built
OpenHands looks like a local app, but it’s actually a distributed system. The key architectural decision: split the reasoning (Agent) from the execution (Runtime), and mediate everything through an event stream. This lets you swap LLMs or runtimes without rewriting the whole system.
Event Sourcing: The Unexpected Choice
OpenHands doesn’t use a traditional database. It’s event-sourced.
Most apps store the current state: if an agent edits a file, you overwrite the record. OpenHands records every action as an immutable event. Want to know the current state? Replay all the events.
The EventStream is the central nervous system. It handles three types of data:
- Actions: Commands from the agent—
CmdRunAction,FileWriteAction,AgentDelegateAction - Observations: Results from the environment—stdout/stderr, file contents, web pages
- Trajectories: The full sequence of actions and observations, serialized to disk (JSON or Pickle)
OpenHands Event Stream Architecture
Immutable event log enabling agent-runtime separation and deterministic replay
CmdRunAction | FileWriteActionCmdRun("pytest tests/")CmdOutput(exit=1, stderr="...")FileEdit("src/fix.py", ...)FileEditObservation(success=True)CmdOutputObservation | FileReadObservationWhy this matters: Deterministic Replay. LLMs are non-deterministic nightmares to debug. When an agent fails, you can replay the exact event sequence and see where it went wrong. No guessing, no “works on my machine.”
The codebase enforces this with a hard split: agenthub (the logic) and runtime (the execution) only talk through serialized events. No shortcuts, no shared state.
The EventStream assigns IDs to events and manages “subscriptions.” The frontend subscribes to get chat updates. The agent reads from it to know what happened last.
There’s been talk in the community about moving to synchronous ToolCall/ToolResult patterns to simplify the Python SDK. But the core idea stays the same: the source of truth is the event history, not some current state snapshot.
The Codebase Structure
OpenHands is organized as a modular monorepo:
openhands/agenthub/: The brains. Different agent implementations (CodeActAgent,BrowsingAgent, etc.). Plug-and-play interface: take State, return Action.openhands/runtime/: The body. Spins up Docker containers, manages files, executes commands. AbstractRuntimebase class with concrete implementations likeDockerRuntimeandE2BRuntime.openhands/server/: FastAPI backend. Handles WebSocket connections, orchestrates theAgentController, routes events.openhands/frontend/: React UI. Visualizes the Event Stream—chat interface, terminal emulator (xterm.js), Monaco editor.containers/: Dockerfiles for the sandbox environments. Version-controlled with the code.
The Main Loop
The AgentController runs an infinite loop:
- Gather recent history from Event Stream
- Send to LLM (GPT-4, Claude, whatever)
- Parse LLM response into an Action (
CmdRunAction, etc.) - Dispatch to Runtime
- Get back an Observation (stdout, exit code, etc.)
- Add Observation to Event Stream
- Go to step 1
Runs until the agent says it’s done (AgentFinishAction) or you kill it. The upcoming Python SDK will let you step through this loop manually, which should make debugging way easier.
The Runtime: Sandboxing the Chaos
Letting an LLM run rm -rf / on your laptop is a bad idea. OpenHands solves this with Docker, but not in the obvious way.
How the Sandbox Actually Works
You can’t just run docker exec for every command. That creates a fresh shell each time, and state gets lost. If the agent runs export API_KEY=xyz, that needs to persist when it runs python script.py later.
OpenHands uses a client-server model across the Docker boundary:
- Host side:
RuntimeClientrunning on your machine - Container side:
ActionExecutor, a Python HTTP server injected into the container
When the agent wants to run ls -la:
- Agent generates
CmdRunAction("ls -la") RuntimeClientserializes it and POSTs toActionExecutorinside the containerActionExecutorruns it in a persistent shell session (PTY), captures stdout/stderr/exit code- Response goes back to
RuntimeClient - Backend wraps it in
CmdOutputObservationand pushes to the event stream
OpenHands Sandbox Architecture
Multi-layered Docker isolation ensuring secure code execution with strict boundaries
The persistent shell is the key. Environment variables, working directory, shell history—it all persists across commands. The agent gets something that feels like an actual computer, not a stateless command executor.
The Docker Socket Problem
Sometimes the agent needs to use Docker itself—like building a container for your app. OpenHands handles this by mounting the host’s Docker socket (/var/run/docker.sock) into the sandbox.
This is powerful but dangerous. Mounting the Docker socket gives the container root access to the host. It’s “Docker-out-of-Docker” (not true Docker-in-Docker), and it comes with trade-offs:
- Power: The agent can do anything Docker can do
- Complexity: Network routing gets weird, especially on macOS/Windows where Docker runs in a VM.
httpx.ConnectErrorissues are common. - Security: You’re basically trusting the container with your host. OpenHands mitigates this by controlling the image, but it’s still a calculated risk.
Other Runtime Options
Docker isn’t the only choice. OpenHands abstracts the runtime, so you can swap it out:
- Daytona: Remote, managed dev environments. Offloads compute to the cloud instead of burning your laptop’s battery.
- E2B: Firecracker-based VMs designed for AI code execution. Better isolation than Docker, faster startup.
You pick your runtime in config.toml. Same agent code, different execution environment. This is the kind of abstraction that separates production systems from hackathon demos.
CodeAct: Code As the Interface
Early AI agents used JSON tool calling for everything. Want to edit a file? Emit a JSON blob. Run a command? Another JSON blob. Brittle, verbose, and you had to define custom tools for every possible action.
Code Is the Tool
CodeAct flips this. Instead of 50 custom tools (list_files, create_file, search_web), just give the agent Python and Bash.
Need to count lines in all Python files? Write code:
import glob
files = glob.glob("**/*.py", recursive=True)
total_lines = 0
for f in files:
with open(f) as file:
total_lines += len(file.readlines())
print(total_lines)
Or use Bash:
find . -name "*.py" | xargs wc -l
Why this works better:
- One language for everything: Logic, control flow, and tool execution all use Python/Bash.
- More expressive: Write loops, conditionals, error handling in a single action. Try to read a file, catch
FileNotFoundError, create it—all in one LLM turn. Fewer round-trips = lower cost and latency. - Free library access: The entire Python ecosystem (pandas, requests, numpy) works out of the box. No wrapper code needed.
How It Works
The CodeActAgent uses a carefully crafted system prompt (system_prompt.j2):
- “You can execute Python code in ```python blocks”
- “You can execute Bash in ```bash blocks”
- “Verify your changes by running tests”
The backend parses the LLM’s markdown response. Code blocks get extracted and sent to the JupyterPlugin (for Python) or BashPlugin (for Bash) inside the container. The JupyterPlugin maintains an interactive IPython kernel, so variables persist across code blocks.
Multiple Agents, One Task
One agent gets lost in a 10,000-file repo. Context window fills with noise, and it forgets what it’s doing.
OpenHands uses agent delegation:
- Manager Agent: High-level planner. Breaks “Refactor auth module” into sub-tasks.
- RepoStudyAgent: Explorer. Maps the codebase without modifying it.
- VerifierAgent: QA specialist. Writes tests, verifies fixes work.
- BrowsingAgent: Reads docs and StackOverflow via Playwright.
The main agent can delegate: “I need to know how to use the Stripe API. @BrowsingAgent, find the docs for creating a customer.” BrowsingAgent spins up, does the research, returns a summary. Main agent stays focused on the high-level task.
Tool Integration: MCP
The old problem: N agents × M tools = N×M custom integrations. Want your agent to use Jira, Slack, GitHub, and Linear? Write four separate integrations. For every agent.
OpenHands uses the Model Context Protocol (MCP), an open standard from Anthropic. Think of it as USB-C for AI tools.
How MCP Works
- MCP Server: Exposes tools (functions) and resources (data). A GitHub MCP server might expose
create_issueandactive_pull_requests. - MCP Client (OpenHands): Connects via stdio or SSE. Asks: “What tools do you have?” Gets back JSON schemas. Injects them into the agent’s system prompt.
OpenHands doesn’t know about GitHub or Slack. It just knows MCP. You can write a custom MCP server for your proprietary database, point OpenHands at it, and the agent can use it immediately.
Auth That Doesn’t Suck
What if the agent tries to read your private Slack DMs? OpenHands handles this with OAuth via FastMCP.
When the agent tries to use an authenticated tool, MCP pauses execution and shows you an OAuth flow. You log in, consent, and the token gets stored for that session. The agent acts with your permissions, not as some omniscient god.
Configuration: From TOML Hell to Python Objects
OpenHands used to require a config.toml file with a million environment variables: SANDBOX_IMAGE, WORKSPACE_MOUNT_PATH, LLM_API_KEY, debug flags, etc. Global state everywhere. Good luck running two agents with different configs.
The new Python SDK fixes this:
from openhands.sdk import CodeActAgent, DockerRuntime
agent = CodeActAgent(
llm_config={"model": "claude-3-5-sonnet"},
system_prompt="You are a senior python engineer."
)
runtime = DockerRuntime(image="my-custom-image")
await agent.run(task="Fix the bug in main.py", runtime=runtime)
Code, not config files. Agents are objects. You can run them in threads, pause them, inspect state, resume. Synchronous by default, which makes debugging way easier.
Evaluation: SWE-bench, the Reality Check
Demo videos are easy. Proving your agent actually works is hard. OpenHands uses SWE-bench—real GitHub issues from Django, scikit-learn, Flask, etc.
How SWE-bench Works
- Start with the codebase before the bug fix
- Give the agent the issue description
- Let it explore, reproduce the bug, write a patch
- Apply the patch, run the test suite
- Pass = new test passes + no regressions
This is brutal. The agent can’t just fix the obvious bug. It has to not break anything else.
The Infrastructure Problem
SWE-bench is expensive to run. Gigabytes of Docker images, thousands of containers. Epoch AI compressed the images from ~680 GB to ~67 GB by deduplicating layers. OpenHands runs evaluations in parallel on cloud infrastructure, turning days into minutes.
The Cost Problem
Running the full SWE-bench suite costs hundreds of dollars in API credits. The agent reads thousands of lines of code, generates verbose responses for every issue. SWE-bench Lite (300 issues) and SWE-bench Verified (human-verified subset) exist for people who don’t have unlimited budgets.
Performance: Where Things Stand
The Numbers
OpenHands with Claude 3.5 Sonnet hits around 53% on SWE-bench Verified.
But here’s the interesting part: Inference Time Scaling. Run the agent 5 times on the same problem, use a critic model or voting to pick the best patch, and you can hit 66%. The bottleneck isn’t intelligence, it’s randomness.
Why Agents Fail
Even at 53-66%, agents fail a lot. The failure modes are instructive.
Infinite Loops
Agent tries a fix. Test fails. Agent tries the exact same fix again. Repeat until you run out of tokens.
This happens because of context truncation. When the context window fills up, OpenHands truncates old history. If the agent’s memory of “I already tried this” gets truncated, it’s stuck in Groundhog Day.
Context Pollution
Agent runs find / -name "*.py" and dumps 10,000 lines of output into its context. Or cats a massive log file. Context window fills with noise. LLM starts hallucinating file paths, forgets what it was supposed to do.
Solution: Active context management. Summarize old events, delete large observations, keep the “working memory” clean.
Lazy Coding
Agent writes # ... rest of code ... instead of the full file. Saves tokens, breaks the file when written to disk. OpenHands needs linting to catch this before it causes syntax errors.
Failure Mode Summary
| Failure Mode | Cause | How OpenHands Mitigates |
|---|---|---|
| Infinite Loop | Context truncation | Trajectory analysis, event summarization |
| Hallucination | Context overflow | Tool-based code search, event condensation |
| Regression | Fixing one bug, breaking others | VerifierAgent runs full test suite |
| Timeout | Docker/network issues | Persistent sessions, cloud runtimes |
Conclusion: The New Software Engineering Workflow
Code assist isn’t replacing developers, it’s certainly changing how we work. The architecture behind systems like OpenHands reveals the shift: event sourcing for debuggability, Docker sandboxing for safety, CodeAct for expressiveness, MCP for extensibility. These are building blocks for a new kind of development workflow.
What makes modern tools like OpenHands, Claude Code particularly powerful is the convergence of capabilities that OpenHands pioneered:
- Extended thinking: Models that can reason through complex refactoring before touching code
- Prompt caching: Reusing codebase context across sessions without re-indexing
- Tool integration: MCP servers that let agents interact with your actual development environment—Jira, databases, CI/CD pipelines
- Computer use: Agents that can navigate IDEs, run terminal commands, and interact with your full development stack
For bug fixes, boilerplate generation, and mechanical refactoring, having an autonomous agent that executes code, verifies its work, and iterates on failures isn’t a demo anymore. It’s production ready infrastructure that’s reshaping how engineering teams operate. The question isn’t whether to adopt these tools, but how to integrate them into your workflow before your competitors do.
References
Wang, X., et al. (2024). The OpenHands Software Agent SDK: A Composable Framework for Building AI Agents. arXiv preprint arXiv:2511.03690.
- The official technical paper describing the OpenHands architecture, event-sourcing model, and SDK design.
Wang, X., et al. (2024). Executable Code Actions Elicit Better LLM Agents. arXiv preprint arXiv:2402.01030v4.
- Introduces the CodeAct framework that uses code as the universal action interface instead of JSON tool calling.
OpenHands Documentation. Runtime Architecture.
- Official documentation explaining the sandbox architecture, Docker runtime, and client-server model.
Anthropic. Model Context Protocol (MCP).
- Official specification for the Model Context Protocol used for dynamic tool discovery and integration.
Jimenez, C., et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- The benchmark used to evaluate code assist agents on real-world software engineering tasks.
Yang, X., et al. (2024). OPENHANDS: An Open Platform for AI Software Developers. OpenReview.
- Comprehensive overview of the OpenHands platform, agent capabilities, and design philosophy.