The Agent Loop: One Pattern, Three Frameworks

Every agent framework you will encounter runs the same loop. An LLM receives a prompt and a list of available tools. If it decides it needs more information or needs to take an action, it returns a tool call instead of a text response. Your code executes the tool, sends the result back, and the LLM decides what to do next. When it finally has enough information to answer, it returns text instead of a tool call, and the loop stops.

That is the entire pattern. Simon Willison defines an LLM agent as "something that runs tools in a loop to achieve a goal." Steve Kinney dug through the source code of six major frameworks (Claude Agent SDK, OpenAI Agents SDK, LangGraph, smolagents, Vercel AI SDK, CrewAI) and found they all converge on the same architecture. The loop is a solved problem. The interesting decisions are how much of it the framework hides from you, what controls it gives you over each iteration, and what breaks when you take it to production.

Here is the canonical version in pseudocode:

while not done:
    response = call_llm(messages)
    if response.tool_calls:
        results = execute_tools(response.tool_calls)
        messages.append(results)
    else:
        done = True
        return response

Tool calls are the continuation signal ("I need more information"). A text response is the termination signal ("I have what I need"). Everything else is orchestration around that core mechanic.

This post builds the same agent in three frameworks to show how each one implements this pattern. The frameworks sit on a spectrum from implicit (the loop is hidden) to fully explicit (the loop is your code). Seeing the same task across all three makes the trade-offs concrete.

The Task

We are building a file system researcher. Given a directory, the agent scans for TODO comments in source files, categorizes each one by urgency (critical, important, minor, or unknown), and writes a summary report. The task requires multiple loop iterations: list the files, read each one, search for TODOs, categorize them, and produce the output.

Three tools make this work:

list_files takes a directory path and returns a list of file paths. read_file takes a file path and returns its contents. write_report takes structured TODO data and writes a Markdown summary to disk. The agent decides which tools to call, in what order, and when it has enough information to produce the report. We do not tell it the sequence. That is the point of the loop.

TL;DR Full examples for all three implementations are in the agent-loop-examples repo.

Pydantic AI: The Implicit Loop

Pydantic AI hides the loop entirely. You define an agent, register tools, and call run_sync. The framework handles the while loop, tool execution, message threading, and termination detection internally. You never see the iteration.

agent = Agent(
    'anthropic:claude-sonnet-4-6',
    instructions="""Scan the provided directory for TODO comments
    in source files. Categorize each by urgency (critical, important,
    minor, unknown). Write a summary report using the write_report tool.""",
)

@agent.tool_plain
def list_files(directory: str) -> list[str]:
    """List all source files in the given directory, recursively."""
    ...

# read_file and write_report registered the same way

result = agent.run_sync(
    'Scan ./src for TODO comments and write a report to ./todo-report.md',
    usage_limits=UsageLimits(request_limit=25),
)

The @agent.tool_plain decorator registers each function as a tool the LLM can call. Pydantic AI extracts the function signature, type annotations, and docstring to build the tool schema automatically. When the LLM returns a tool call, the framework validates the arguments using Pydantic, executes the function, and sends the result back. If validation fails, the error is sent to the LLM so it can retry.

The UsageLimits(request_limit=25) parameter is the safety valve. Without it, a confused LLM could loop indefinitely. The request limit caps the total number of LLM calls (not tool calls) in a single run. You can also set tool_calls_limit to cap tool executions directly.

The trade-off is visibility. You do not see what happens between run_sync being called and the result being returned. You cannot inspect the LLM's response before tools execute, inject logic between iterations, or modify the message history mid-loop. For straightforward tool-using agents, this is exactly what you want. For agents that need per-iteration control, you need more machinery. Pydantic AI does offer an escape hatch via agent.iter(), which lets you async-iterate over each step of the internal graph, but the default path is "register tools, call run, get result."

OpenAI Agents SDK: The Semi-Explicit Loop

The OpenAI Agents SDK manages the loop for you but exposes a clear decision model for what happens at each iteration. Internally, every turn produces one of four outcomes: final_output (the LLM produced a response, no tool calls, stop), run_again (tool calls present, execute them, continue), handoff (delegate to another agent), or interruption (a tool needs human approval, pause). That four-way classification is the conceptual leap this SDK adds.

@function_tool
def list_files(directory: str) -> list[str]:
    """List all source files in the given directory, recursively."""
    ...

# read_file and write_report registered the same way

agent = Agent(
    name="todo_researcher",
    instructions="""
        Scan the provided directory for TODO comments in source files.
        Categorize each by urgency (critical, important, minor, unknown).
        Write a summary report using the write_report tool.
    """,
    tools=[list_files, read_file, write_report],
)

result = Runner.run_sync(
    agent,
    'Scan ./src for TODO comments and write a report to ./todo-report.md',
    max_turns=15,
)

The surface looks similar to Pydantic AI: @function_tool for tool registration, a runner that manages the loop, a max turns limit. Where it differs is in what it exposes. If you use the streaming interface (Runner.run_streamed), you can observe each turn's classification as it happens. The handoff mechanism lets one agent delegate to another via a specialized tool call (transfer_to_<agent_name>), reusing the tool infrastructure rather than inventing a separate routing layer. Guardrails run at three points: input (first turn only), output (after the final response), and tool (before and after each execution).

The trade-off compared to Pydantic AI: more concepts to learn (agents, runners, handoffs, guardrails), but more visibility into what the loop is doing at each step. Compared to LangGraph, the loop is still managed for you. You observe its decisions but do not own the control flow.

LangGraph: The Fully Explicit Loop

LangGraph replaces the while loop with a directed cyclic graph. You define nodes (functions that transform state), wire them with edges (routing functions that decide what runs next), and the cycle in the graph is the loop. The pattern is visible in your code.

# State that flows through the graph
class AgentState(TypedDict):
    messages: Annotated[list, add_messages]

# Tools defined with @tool, same pattern as above
# ...

tools = [list_files, read_file, write_report]
tools_by_name = {t.name: t for t in tools}
model = ChatAnthropic(model="claude-sonnet-4-6").bind_tools(tools)

# Node: call the LLM
def call_model(state: AgentState):
    response = model.invoke(state["messages"])
    return {"messages": [response]}

# Node: execute tools
def tool_node(state: AgentState):
    results = []
    for call in state["messages"][-1].tool_calls:
        result = tools_by_name[call["name"]].invoke(call["args"])
        results.append(
            ToolMessage(content=str(result), tool_call_id=call["id"])
        )
    return {"messages": results}

# Conditional edge: should we continue the loop?
def should_continue(state: AgentState) -> Literal["tool_node", "__end__"]:
    last_message = state["messages"][-1]
    if last_message.tool_calls:
        return "tool_node"
    return "__end__"

# Build the graph
graph = StateGraph(AgentState)
graph.add_node("call_model", call_model)
graph.add_node("tool_node", tool_node)
graph.add_edge(START, "call_model")
graph.add_conditional_edges("call_model", should_continue)
graph.add_edge("tool_node", "call_model")
agent = graph.compile()

The six-line loop is now distributed across the graph structure. call_model is the LLM call. should_continue is the "are there tool calls?" check. tool_node is the tool execution. The edge from tool_node back to call_model is the loop. The conditional edge to __end__ is the termination. Same pattern, made explicit.

The state management is the first thing that stands out. AgentState is a TypedDict with an add_messages reducer, which means each node appends to the message list rather than replacing it. You can add fields to the state (a counter for iterations, a cost accumulator, a flag to short-circuit) and every node can read and write them.

The real payoff is what this architecture enables beyond the basic loop. Add a checkpointer to the compile step and the graph persists state at every node transition. If the agent fails after five tool calls, you resume from the fifth checkpoint instead of starting over. Add interrupt_before=["tool_node"] and the graph pauses before executing tools, giving a human the chance to approve or reject. Neither of these require changing the loop logic. They are infrastructure features of the graph execution model.

The trade-off: more code, more concepts (state, nodes, edges, conditional edges, reducers, checkpointers), and a steeper learning curve. For a straightforward agent that calls three tools and writes a report, this is more machinery than you need. For agents that need durable execution, parallel branches, or failure recovery, the graph model earns its complexity.

What Happens at 200 Files

The TODO scanner works cleanly on a directory with 15 files. Point it at a codebase with 200 and you hit the problems the basic pattern does not address.

The agent calls list_files and gets back 200 paths. It starts calling read_file on each one. By the 30th file, the conversation history (every tool call, every file's contents, every LLM response) has consumed most of the context window. The LLM starts losing track of TODOs it found in earlier files. By the 50th, it either hits the context limit and errors out, or it starts hallucinating categories for TODOs it can no longer see in context.

This is where the abstraction level matters. In Pydantic AI, the loop runs inside the framework and you have limited ability to intervene. You can set a request limit, but that just stops the loop early. It does not help the agent be smarter about what it loads into context. In the OpenAI Agents SDK, the streaming interface lets you observe the problem happening but does not give you the machinery to fix it. In LangGraph, you can add a node between tool_node and call_model that summarizes what the agent has found so far and clears old tool results from the state. You can add a counter that batches files in groups of 20. You can persist intermediate results to disk so they survive context compaction. The graph gives you the insertion points. The other frameworks require you to either live with the limitation or move to a more explicit architecture.

Here is the modified LangGraph version. The changes are small but structural: a files_processed counter and a findings accumulator in the state, a compress_context node that extracts results and trims the message history, and a routing change that runs compression every 20 tool calls.

# Extended state: findings persist outside the message history
class AgentState(TypedDict):
    messages: Annotated[list, add_messages]
    files_processed: int
    findings: list[dict]

# New node: compress context after every batch
def compress_context(state: AgentState):
    recent_todos = extract_todos_from_messages(state["messages"])
    trimmed = keep_recent_messages(state["messages"], last_n=6)

    summary = (
        f"Progress: {state['files_processed']} files scanned, "
        f"{len(state['findings']) + len(recent_todos)} TODOs found so far."
    )
    return {
        "messages": trimmed + [SystemMessage(content=summary)],
        "findings": state["findings"] + recent_todos,
    }

# Modified routing: compress every 20 file reads
def should_continue(state: AgentState) -> Literal[
    "tool_node", "compress_context", "__end__"
]:
    last_message = state["messages"][-1]
    if not last_message.tool_calls:
        return "__end__"
    if (
        state["files_processed"] > 0
        and state["files_processed"] % 20 == 0
        and any(c["name"] == "read_file" for c in last_message.tool_calls)
    ):
        return "compress_context"
    return "tool_node"

# Graph with compression node and checkpointing
graph = StateGraph(AgentState)
graph.add_node("call_model", call_model)
graph.add_node("tool_node", tool_node)
graph.add_node("compress_context", compress_context)
graph.add_edge(START, "call_model")
graph.add_conditional_edges("call_model", should_continue)
graph.add_edge("tool_node", "call_model")
graph.add_edge("compress_context", "call_model")
agent = graph.compile(checkpointer=MemorySaver())

Three implementations showed convergence: the same loop, three costumes. This fourth shows divergence: only the fully explicit architecture lets you insert a compression step without rewriting the framework. That is the trade-off in one code block.

This is the real reason the abstraction spectrum matters. Small tasks do not reveal it. Scale does.

Choosing Your Level of Abstraction

All three frameworks run the same loop under the hood. The difference is how much of that loop is your code versus the framework's code.

Implicit (Pydantic AI): You define tools and a goal. The framework owns the loop, the stop condition, and the error handling. Best when you trust the defaults and want the fastest path to a working agent. Most tool-using agents do not need per-iteration control, and Pydantic AI's type-safe tool definitions and automatic schema generation make the common case very clean.

Semi-explicit (OpenAI Agents SDK): The SDK manages the loop but exposes a structured decision model for each turn. You see what happened (tool calls, handoffs, interruptions) and can react through hooks and guardrails. Best when you need to observe or filter the loop without rewriting it. The handoff mechanism makes multi-agent orchestration straightforward.

Fully explicit (LangGraph): You define the graph, the nodes, the edges, and the cycle. You own the control flow. Best when you need checkpointing, parallel branches, human-in-the-loop approval, or durable execution across restarts. You pay for that control with more code and more concepts, but the graph model gives you capabilities the other approaches cannot match without significant custom engineering.

Our default for client projects is Pydantic AI. It covers the majority of use cases with the least code, and the type safety catches a category of bugs that other frameworks leave to runtime. When a project needs multi-agent handoffs or structured guardrails, we move to the OpenAI Agents SDK. When it needs durable execution, checkpointing, or the kind of mid-loop intervention described in the 200-file scenario, we reach for LangGraph. We have yet to encounter a project that needed to start at LangGraph. We have encountered several that needed to migrate there after outgrowing the implicit approach.

What the Frameworks Hide From You

The basic loop does not address several production concerns that will surface the moment you deploy an agent to real users.

Context management is the most impactful. Every tool call and every tool result gets appended to the message history. That history is the LLM's context window, and it is finite. For a Claude Sonnet agent with a 200K token window, a single read_file call on a large source file can consume 5,000 to 10,000 tokens. Multiply that by 30 files and you have burned half the window before the agent starts categorizing anything. The failure mode is not a clean error. The LLM starts degrading gradually: it forgets earlier findings, repeats tool calls it already made, or produces summaries that contradict its own prior analysis. Detecting this mid-loop requires monitoring token usage per iteration, which none of the frameworks do automatically. The fix is usually a combination of summarizing intermediate results (compressing 10 file reads into a structured summary before continuing), clearing old tool results from the message history, and splitting work across sub-agents with their own context windows. Anthropic's engineering team documented this pattern in their multi-agent research system: subagents explore independently and return compressed summaries, keeping the lead agent's context clean.

Cost scales faster than you expect. That same Anthropic post reported that a single-agent loop consumes roughly 4x more tokens than a standard chat interaction, and a multi-agent system consumes approximately 15x more. The 90.2% performance improvement their multi-agent system achieved over a single agent was real, but so was the token bill. For our TODO scanner, a run against a 50-file codebase might cost $0.15. Against 500 files with sub-agents, you are looking at $2 to $5 per run. Neither number is alarming on its own, but multiply by the number of users or automated triggers and cost becomes an architectural concern, not just a billing detail.

Loop detection catches agents that are stuck. A common failure: the agent calls read_file on the same file twice, or alternates between two tool calls without making progress. Without detection, it will burn through your turn limit doing nothing useful. Pydantic AI's UsageLimits will eventually stop it, but by then you have wasted tokens. LangGraph lets you add a counter to the state and short-circuit in should_continue. The OpenAI Agents SDK's tracing helps you diagnose the problem after the fact but does not prevent it in the loop.

When to Use an Agent Loop

Not every problem benefits from an agent loop. The pattern works best when you have clear success criteria and the path to achieving them involves trial and error. Willison's examples are instructive: debugging (a test is failing and the agent can run the test suite), performance optimization (benchmark, change, benchmark again), dependency upgrades (upgrade, run tests, fix breakage), and container optimization (try different base images, measure size).

The common thread is automated verification. The agent needs a way to know whether its last action moved it closer to the goal. A test suite. A benchmark. A size measurement. If you cannot define a verification step, the agent cannot self-correct, and the loop degrades into the LLM guessing repeatedly.

If your task is a fixed sequence of steps with no branching decisions, a simple pipeline is faster, cheaper, and more predictable. Agent loops earn their cost when the number of steps, the order of operations, or the specific tools needed cannot be determined in advance. That is when handing the steering wheel to the LLM and letting it iterate becomes genuinely useful.

Building agent loops is straightforward. Building agent loops that hold up in production, with cost controls, context management, and recovery from failure modes, is where the real work starts. If you are integrating agents into your product and want help getting from prototype to production, that is the kind of problem we work on. Get in touch.