Code Generation over Tool Calling Pattern

In late 2025, Cloudflare and Anthropic independently published research reaching the same conclusion: LLMs produce better results when they write code against APIs than when they generate structured tool calls. The implementations differ, but the underlying insight is shared, and it has direct implications for how MCP evolves.

Cloudflare’s Code Mode

Cloudflare’s approach, detailed in their September 2025 research, converts MCP tool schemas into TypeScript API interfaces. Instead of presenting the model with raw tool definitions and asking it to generate JSON tool calls, Code Mode gives the model TypeScript type definitions and asks it to write code that calls those APIs.

The model never sees raw tool definitions. It sees familiar TypeScript — interfaces, function signatures, type annotations — and writes code against them.

The results from their 31-calendar-event benchmark:

81% fewer tokens compared to traditional tool calling
Better correctness: the code-writing agent correctly used new Date() to determine the current date; the tool-calling agent couldn’t access date functions within the tool-call paradigm and created all events a year in the past
Richer logic: the code-writing agent could use loops, conditionals, variables, and error handling — the full expressiveness of a programming language rather than being limited to one tool call at a time

The key mechanism: when the model writes code, it can batch operations, reuse intermediate results, and apply control flow. With tool calling, each operation is a separate round trip through the model’s context window. Code naturally composes; tool calls accumulate.

// Code Mode: the model writes this
const today = new Date();
for (let i = 0; i < 31; i++) {
  const eventDate = new Date(today);
  eventDate.setDate(today.getDate() + i);
  await calendar.createEvent({
    title: `Daily standup`,
    date: eventDate.toISOString(),
    duration: 30
  });
}

// Traditional tool calling: the model generates 31 separate calls
// {"tool": "create_event", "args": {"title": "Daily standup", "date": "2025-09-01", ...}}
// {"tool": "create_event", "args": {"title": "Daily standup", "date": "2025-09-02", ...}}
// ... (29 more)

The code version is more concise, more correct (dynamic date calculation), and more efficient (single logical operation versus 31 round trips).

Anthropic’s Code Execution with MCP

Anthropic’s research takes a different implementation path but reaches the same destination. Their approach presents MCP tools as filesystem-accessible code modules that agents discover on demand. Rather than loading tool definitions into the context window, the agent writes code that imports and calls these modules.

Their Google Drive-to-Salesforce test case showed the most dramatic improvement:

Traditional MCP: 150,000 tokens
Code execution: 2,000 tokens
Reduction: 98.7%

The explanation is architectural. With traditional MCP, every intermediate result passes through the model’s context window. Download a file from Google Drive — the model sees the full file content in its context. Parse metadata from the file — back through the context. Upload to Salesforce — another round trip with the full response. Each step pollutes the context window with data the model doesn’t need for its next decision.

With code execution, data stays in a sandbox. The model writes a script that downloads, processes, and uploads. Intermediate data lives in variables within the sandbox, never entering the context window. The model only receives the final result: success or failure, with relevant details.

The Shared Insight

Both approaches leverage the same training data asymmetry: LLMs have trained extensively on code, not on synthetic tool-calling formats. Writing TypeScript or Python is what these models were built to do. Generating structured JSON matching arbitrary schemas is an acquired skill taught through relatively limited fine-tuning.

The pattern can be summarized: convert tool interfaces into code interfaces, then let the model write code. The model stays in its strongest modality (code generation) while still accessing external tools.

For data engineers, this pattern already manifests naturally in Claude Code. When Claude writes bq query --nouse_legacy_sql '...' and pipes the result through jq for processing, it’s doing exactly what Cloudflare and Anthropic describe: writing code (shell commands) against a well-known API (the bq CLI) rather than generating structured tool calls. The bq CLI is effectively a “code interface” to BigQuery that the model already speaks fluently.

Implications for MCP’s Future

This research doesn’t invalidate MCP. It suggests MCP’s evolution will move toward code-based interfaces rather than JSON-schema tool calling.

Short term: Expect MCP clients (Claude Code, Cursor, similar tools) to internally convert tool schemas into code representations before presenting them to the model. The developer experience stays the same — you still build MCP servers the same way — but the model sees TypeScript or Python interfaces instead of JSON schemas.

Medium term: MCP servers may expose code SDKs alongside tool definitions. Instead of (or in addition to) a tools/call interface, servers might provide importable modules that agents can use within code execution sandboxes.

Long term: The distinction between “tool calling” and “code execution” may blur entirely. The model writes code that happens to call tools, with the MCP protocol handling the underlying communication transparently.

What This Means in Practice Today

For data engineers using Claude Code right now, the practical takeaway is simple:

CLI commands for well-established tools (bq, gcloud, gsutil, dbt, git) are already the “code generation over tool calling” pattern. The model writes shell code against familiar interfaces. This is the most efficient path for tools with deep training data representation.
MCP for tools without CLIs (Salesforce, HubSpot, internal APIs) still provides value because there’s no established code pattern for the model to generate. MCP gives the model a structured interface that’s better than nothing, even if it’s not as efficient as code generation against a familiar CLI.
Hybrid configurations are the practical answer. Let the model use CLI for tools it knows well, MCP for tools it doesn’t, and trust that the underlying infrastructure will converge over time:

{
  "mcpServers": {
    "salesforce": {
      "command": "npx",
      "args": ["-y", "@mcp/salesforce-server"]
    }
  },
  "permissions": {
    "allow": [
      "Bash(bq *)",
      "Bash(gcloud *)",
      "Bash(gsutil *)",
      "Bash(dbt *)"
    ]
  }
}

Watch for client-side improvements. As Claude Code and other MCP clients adopt code-execution patterns internally, the efficiency gap between CLI and MCP will narrow — potentially without any changes to your MCP server code.

The way AI agents interact with tools is still evolving rapidly. Investing in MCP servers is reasonable because the protocol is becoming infrastructure. Tool-calling patterns may be superseded by code-execution patterns as clients evolve. Building servers with clean, well-documented tools leaves room for that transition without requiring server changes.