LLM Training Data Asymmetry for Tool Use

Large language models learn from what they’ve seen. The quality of their output on any task is directly proportional to how much similar work appeared in their training data. This creates a measurable asymmetry between two ways of connecting AI agents to external tools: generating familiar CLI commands versus generating structured MCP tool calls.

The Training Corpus Gap

Claude, GPT, and their peers have ingested millions of GitHub repositories, Stack Overflow answers, blog posts, and documentation pages. Tools like bq (BigQuery CLI, since 2011), gcloud (since 2013), aws CLI, kubectl, and psql have over a decade of real-world usage examples: troubleshooting threads, tutorial code, production scripts, CI/CD configurations, and blog walkthroughs.

MCP tool-calling, by contrast, requires models to generate structured JSON matching specific schemas. These are formats the models have encountered primarily in synthetic training data created specifically to teach tool use. The volume, diversity, and quality of that training data is orders of magnitude smaller than what exists for established CLI tools.

Cloudflare’s engineering team captured the asymmetry with an analogy:

Making an LLM perform tasks with tool calling is like putting Shakespeare through a month-long class in Mandarin and then asking him to write a play in it. It’s just not going to be his best work.

When Claude generates bq query --nouse_legacy_sql 'SELECT COUNT(*) FROM dataset.table', it’s drawing on patterns it’s seen thousands of times in real codebases. When it generates an MCP tool call with a JSON schema it discovered at runtime, it’s working from a much smaller, more artificial corpus.

The Benchmark Evidence

Two independent research efforts in late 2025 quantified the gap.

Cloudflare’s Code Mode Experiment

Cloudflare tested their hypothesis by asking agents to create 31 calendar events using two approaches: traditional MCP tool-calling and writing TypeScript code against the same APIs.

The code approach used 81% fewer tokens. But the qualitative result was more revealing. The code-writing agent correctly called new Date() to determine the current date before creating events. The tool-calling agent, lacking that capability within the tool-call paradigm, created all 31 events a year in the past.

The code-writing agent had access to the full expressiveness of a programming language — variables, control flow, date arithmetic, error handling. The tool-calling agent was limited to filling in JSON parameters for each individual call.

Anthropic’s Code Execution Research

Anthropic’s independent study tested a Google Drive-to-Salesforce workflow: downloading a document and attaching it to a record.

With traditional MCP tool calls, the workflow consumed 150,000 tokens. With code execution, it dropped to 2,000 tokens — a 98.7% reduction.

The mechanism explains the magnitude. With traditional MCP, every intermediate result passes through the model’s context window. Download a file? The model sees the full response. Parse metadata? Back through the context window. Attach to Salesforce? Another round trip. With code execution, data stays in a sandbox. The model only receives the final result it actually needs.

This isn’t just a token efficiency story. Context window pollution from intermediate results degrades the model’s ability to reason about the overall task. Smaller context means better attention over what remains.

Why This Matters for Data Engineering

For data engineers working with BigQuery, Snowflake, or Postgres, this asymmetry has practical consequences.

A bq query command consumes 15-30 tokens. A bq ls command consumes about 10. The equivalent MCP tool call — with its JSON schema overhead and structured response — typically costs 150-250 tokens per invocation, plus the upfront context window cost of loading tool definitions.

When you’re running dozens of operations in a session — exploring schemas, profiling data, iterating on transformations — the difference compounds from percentages into orders of magnitude.

# ~20 tokens, pattern seen millions of times in training data
bq ls --format=pretty my_dataset

# ~15 tokens, extremely well-represented in training data
bq show --schema --format=prettyjson my_dataset.my_table

# ~40 tokens, familiar pattern
bq query --nouse_legacy_sql '
SELECT COUNT(*) as rows, MIN(created_at), MAX(created_at)
FROM my_dataset.my_table
'

Claude generates these commands naturally and parses their output reliably. The commands are compact. Results stream directly back. No tool definitions needed, no JSON schema overhead.

The Asymmetry Is Not Permanent

This advantage is specific to tools with large training data footprints. It doesn’t apply uniformly.

For APIs without CLI equivalents — Salesforce, HubSpot, Jira, Notion — there’s no well-established command-line pattern for the model to draw on. Asking Claude to generate raw curl commands with OAuth tokens for these services is error-prone. An MCP server that handles authentication and presents clean tool interfaces genuinely improves reliability for these targets.

As MCP adoption grows and more examples of tool-calling appear in public codebases, the training data gap will narrow. Models fine-tuned specifically for tool use will improve. But for tools with a decade or more of CLI usage data — bq, gcloud, aws, kubectl, psql, git — the asymmetry is likely to persist for years because the training data advantage is so large.

The Broader Implication

The lesson extends beyond any single tool. When evaluating any MCP integration, ask: does a CLI already exist? How extensive is its training data footprint? What capabilities does MCP add versus what does it abstract away?

If the CLI is well-established and the model already generates it fluently, MCP adds overhead without adding capability for straightforward operations. If no CLI exists, or the CLI is obscure, or the workflow requires structured responses and audit trails, MCP adds genuine value.

The emerging industry pattern of converting tool schemas into code interfaces — letting LLMs write code against APIs rather than generate tool calls — suggests the industry is converging on this understanding. Meet models where their training data is strongest.