How to Build a Self-Correcting AI Coder with LangGraph

Code generation with LLMs is unreliable. GPT-4o generates working code about 60-70% of the time for non-trivial tasks. The other 30-40% contains syntax errors, wrong API usage, missing imports, or logic bugs. The fix is not a better model. The fix is a feedback loop: generate code, execute it, read the error, and regenerate. This is exactly the pattern that LangGraph's cyclical graphs are designed for. This tutorial walks through building a self-correcting coding agent from scratch.

The Write-Test-Fix Loop

The architecture is a cycle with three nodes:

Write: The LLM generates code based on the user's requirements and any previous error context.
Test: The generated code is executed in a sandboxed environment. Tests are run. Output and errors are captured.
Fix: If tests fail, the error output is fed back to the LLM along with the failing code. The LLM generates a corrected version. Loop back to Test.

Why LangGraph, Not LangChain?

LangChain processes data through a linear chain: A -> B -> C -> Done. Self-correction requires cycles: Write -> Test -> Fix -> Test -> Fix -> Test -> Pass. LangGraph supports these cycles natively with conditional edges. Learn more in our LangChain vs LangGraph comparison.

Step 1: Define the Graph State

# State tracks the code, errors, and iteration count

from typing import TypedDict, Optional

class CoderState(TypedDict):

requirement: str # user's task description

code: str # current generated code

test_output: str # stdout + stderr from execution

test_passed: bool # did all tests pass?

error_history: list # previous errors for context

iteration: int # current iteration count

max_iterations: int # safety limit (default: 5)

Step 2: Build the Nodes

# Write node: generates or fixes code

def write_code(state: CoderState) -> CoderState:

if state["iteration"] == 0:

prompt = f"Write Python code for: {state['requirement']}"

else:

prompt = (

f"Fix this code:\n{state['code']}\n\n"

f"Error:\n{state['test_output']}\n\n"

f"Previous errors: {state['error_history']}"

)

code = llm.invoke(prompt)

return {"code": extract_code(code), "iteration": state["iteration"] + 1}

# Test node: executes code in sandbox

def test_code(state: CoderState) -> CoderState:

result = sandbox.execute(state["code"], timeout=30)

passed = result.exit_code == 0

errors = state["error_history"].copy()

if not passed:

errors.append(result.stderr)

return {

"test_output": result.stdout + result.stderr,

"test_passed": passed,

"error_history": errors

}

Step 3: Wire the Graph with Conditional Edges

# Build the self-correction cycle

from langgraph.graph import StateGraph, END

graph = StateGraph(CoderState)

graph.add_node("write", write_code)

graph.add_node("test", test_code)

graph.add_edge("write", "test")

graph.add_conditional_edges(

"test",

lambda state: (

END if state["test_passed"]

else END if state["iteration"] >= state["max_iterations"]

else "write" # loop back to fix

)

graph.set_entry_point("write")

agent = graph.compile()

Step 4: Sandbox Execution

Never execute LLM-generated code directly on your server. Use a sandboxed environment:

Docker containers: Spin up a disposable container for each execution. Kill after timeout.
E2B (Code Interpreter API): Cloud sandboxes purpose-built for AI code execution.
Modal: Serverless Python execution with per-function isolation.

Performance: How Well Does It Work?

First-pass success rate (GPT-4o): ~65%
After 1 correction: ~82%
After 2 corrections: ~90%
After 3 corrections: ~93%
Average iterations to pass: 1.4

The self-correction loop raises the effective success rate from 65% to 93%. The remaining 7% typically involves tasks that require architectural changes the LLM cannot figure out from error messages alone. For those cases, add a human-in-the-loop step using LangGraph's interrupt/resume.

Making It Production-Ready

The basic loop above needs several additions for production:

Checkpointing: Save state after each iteration so you can resume if the service crashes.
Observability: Log every iteration with the code, errors, and LLM reasoning for debugging.
Cost controls: Set hard limits on iterations and token usage. See our OpenAI cost optimization guide.
Testing framework: Run tests using evaluation benchmarks to verify the agent's code quality over time.

For the complete production architecture, see our production agent blueprint.

Frequently Asked Questions

Is this safe? Can the AI damage my system?

Only if you execute code without a sandbox. With Docker containers or E2B, the generated code runs in complete isolation. Network access should be disabled by default and file system access restricted to a temporary directory.

Can this replace developers?

No. It handles well-defined, testable programming tasks (write a function that..., parse this data format, implement this algorithm). It does not handle system design, architecture decisions, or ambiguous requirements. Use it as a coding assistant, not an autonomous developer.

What programming language works best?

Python has the best results because LLMs have the most Python training data, and Python's error messages are descriptive. The pattern works for JavaScript and TypeScript as well. For compiled languages (Go, Rust), the compilation step adds latency to each iteration. Read more about why Python dominates AI development.

Build AI-Powered Development Tools

We build custom coding agents and developer tools powered by LLMs. From code generation to automated testing pipelines.

Discuss Your Project