Context Engineering for Data Pipelines

Across AI-for-data-engineering research, context quality determines outcome quality. Claude Code’s biggest error source was mismatched conventions, addressable with structured instruction files. Thomson Reuters’ failures stemmed from missing temporal context. Tiger Data’s AI accuracy improved 27% with semantic catalogs. The emerging discipline is context engineering — how to structure what the AI needs to know, not just how to prompt it.

From ETL to ECL

Ananth Packkildurai proposed replacing the ETL framing entirely with ECL: Extract, Contextualize, Link. The argument is that the mechanical work of pipeline construction — the “Transform” and “Load” — is no longer a differentiator. AI handles it. What differentiates is knowing what the pipeline should do, why, and how it fits into a larger system.

“Contextualize” is the new middle step. It means adding the meaning layer: what does this column represent, what are the edge cases, how does this table relate to others, what business rules apply. “Link” means connecting the contextualized data to downstream consumers in ways that preserve meaning — not just joining tables, but ensuring that the joins make sense for the business question being asked.

This reframe captures something real. The hardest part of building a data pipeline in 2026 isn’t the SQL. It’s knowing what the SQL should do. The context gap is the bottleneck, and context engineering is the discipline that addresses it.

What Context Engineering Looks Like in Practice

Context engineering for data pipelines involves several concrete practices:

Semantic catalogs. Machine-readable descriptions of what tables and columns mean. Not just “customer_id: integer” but “customer_id: unique identifier for a customer account, generated at account creation in Stripe, may not match CRM customer IDs for accounts created before the 2023 migration.” Tiger Data showed that these descriptions improve AI accuracy by 27%. The descriptions don’t have to be perfect — LLM-generated descriptions reviewed by humans are good enough to move the needle.

Structured instruction files. CLAUDE.md is the simplest example — a file that tells Claude Code how this specific project works. But the pattern extends to any AI system that needs project context: coding conventions, naming standards, architectural decisions, things not to do. Each instruction prevents a class of errors.

Documentation as AI input. Column descriptions, model descriptions, source definitions — all the metadata that dbt’s documentation system captures. Traditionally written for human consumption, these descriptions are now equally important as AI input. Better docs in your dbt project mean better AI-generated queries against your warehouse.

Test suites as specification. Tests encode what “correct” means. A unique + not_null test on order_id tells the AI this is a primary key. A relationships test tells it how tables connect. An accepted_values test tells it what valid states look like. The test suite is a machine-readable specification of data correctness.

Lineage as context. How models connect, which tables are upstream of which, where business logic gets applied. dbt’s DAG captures this automatically. AI tools that can read the lineage understand the transformation chain better than AI working with isolated queries.

The Comprehension Risk

Anthropic’s own randomized controlled trial with 52 developers in January 2026 found that AI-assisted developers scored 17% lower on code comprehension tests (50% versus 67%). The largest gap was in debugging. Developers who delegated code generation to AI scored below 40%. Those who used AI for conceptual inquiry scored 65% or higher.

How you use AI determines whether it builds or erodes your capability. Using AI to write code you could write yourself — but faster — is a productivity gain. Using AI to write code you don’t understand is a comprehension loss. Over time, comprehension losses compound: you can’t debug what you don’t understand, you can’t architect what you haven’t built, and you can’t provide the context that AI needs if you don’t know what the context is.

This matters because context engineering requires deep understanding. You can’t write semantic catalog descriptions for tables you haven’t worked with. You can’t add edge case documentation for business rules you’ve never encountered. The people best positioned to do context engineering are the people who’ve spent years learning the data the hard way.

The Skills Pipeline Risk

Stanford’s Digital Economy Study found employment for software developers aged 22-25 declined nearly 20% from the late 2022 peak. Marc Benioff announced Salesforce would hire “no new engineers” in 2025. The risk is structural: seniors at the top, AI at the bottom, and very few juniors learning the craft in between.

AWS CEO Matt Garman pushed back directly: “That’s like, one of the dumbest things I’ve ever heard… How’s that going to work when ten years in the future you have no one that has learned anything?”

The real question is whether organizations maintain the pipeline of people who develop the judgment that AI lacks. Joe Reis’s 2026 survey of 1,101 practitioners found 82% use AI daily, but 64% are stuck in “experimenting” or “tactical tasks.” The gap between using AI to generate boilerplate and understanding why the boilerplate needs to look that way is where expertise develops.

If juniors never get the opportunity to write bad SQL, debug it, learn why it was wrong, and internalize the fix, they never develop the tacit knowledge that context engineering requires. You end up with a generation of practitioners who can prompt AI effectively but can’t evaluate whether the output is correct — which is exactly the situation where AI-generated SQL failure modes become most dangerous.

The Emerging Discipline

As Zach Wilson put it: “AI didn’t kill data engineering. It killed pretending data engineering was about typing code.” Syntax mastery is commoditized. What remains is context and judgment.

Business context and architectural judgment are not supplied by model updates; they come from working with data long enough to know what “Status” actually means. Context engineering operationalizes that knowledge: semantic catalogs, instruction files, comprehensive documentation, test suites as specifications, and lineage as machine-readable architecture.