Build vs. Buy Data Pipeline Economics

The traditional build-vs-buy case for data pipelines rested on a Wakefield Research finding: data engineers spend 44% of their time building and maintaining pipelines, costing approximately $520,000 per year. Custom connector development takes 50-100 hours each. That calculus depended on three assumptions, and all three changed in 2025.

The Three Converging Shifts

Three independent developments converged, and their effects compound.

1. Managed ELT Pricing Became Unpredictable

The March 2025 Fivetran pricing change eliminated bulk discounts and introduced per-connector MAR tiering. Teams with many connections saw 70% cost increases. Marketing data, which updates constantly due to retroactive attribution, became particularly expensive. The minimum annual contract sits at $12,000 before any data flows.

The cost predictability that justified managed ELT — “pay a flat rate, avoid surprise infrastructure costs” — evaporated. One user reported going from $20/month to $2,000/month as data volume grew. Another reported ETL cost reductions of 182x per month after switching from Fivetran to dlt.

2. AI Development Velocity Was Measured

The productivity gains from AI-assisted coding are no longer hypothetical. Controlled studies have put numbers on them:

55.8% faster implementation. A controlled experiment published on arXiv found developers completed an HTTP server implementation in 1 hour 11 minutes with GitHub Copilot versus 2 hours 41 minutes without.
12-21% more pull requests per week. A Microsoft and Accenture field experiment measured the throughput gain in a real-world setting.
56% more likely to pass all unit tests. GitHub’s own research found AI-assisted developers produced code that passed tests at a higher rate.

For data pipeline development specifically, the gains may be even larger. Pipelines follow well-established patterns — API authentication, pagination, rate limiting, schema mapping, incremental loading. These are exactly the kinds of pattern-heavy tasks where AI excels: one dlt user described completing an entire pipeline “in five minutes using the library’s documentation.”

3. Open-Source Tools Reached Production Maturity

dlt (data load tool) hit 3 million monthly downloads. In September 2024 alone, users created 50,000 custom connectors — a 20x increase from January. The library passed version 1.0 stability, now at 1.19, with production users including Artsy and PostHog.

The maturity matters because it changes what “building” means. Building a connector in 2023 meant writing API calls, pagination logic, rate limiting, schema mapping, error recovery, and state management from scratch. Building a connector in 2026 means writing a declarative configuration for a well-tested framework that handles the hard parts.

The Compounding Effect

Building pipelines is cheaper because AI accelerates development. AI-assisted pipeline development is practical because dlt provides the framework and patterns that AI can work with. Managed solution costs continue rising, widening the gap.

The traditional 50-100 hours per connector estimate predates both AI assistance and mature frameworks. With dlt + AI, that number drops to 10-20 hours for standard API patterns. The $520,000 annual pipeline maintenance cost becomes an investment when you’re saving $100,000+ in MAR fees.

Where AI Helps (and Where It Doesn’t)

AI assistance isn’t magic, and understanding where it delivers determines whether the “build” option actually works in practice.

AI excels at the tedious parts. Boilerplate code, API connector scaffolding, ETL structure, configuration files, SQL generation, and test creation. Pattern-based code where the implementation follows established examples. dlt’s LLM-friendly documentation makes this workflow particularly effective — AI assistants can generate pipeline configurations directly from API documentation.

AI struggles with what matters most. Complex business logic. Edge cases the API doesn’t document. System architecture decisions. Security — 29.1% of AI-generated Python code contains security weaknesses according to one study. Performance optimization for high-volume scenarios. The judgment calls that distinguish working code from production-ready code. This is the same production gap that affects all AI-assisted development, not just pipelines.

The maintenance question is nuanced. GitClear research projects code churn doubling in 2024 versus the pre-AI baseline — more code added, more code copy-pasted, less refactoring. AI often reproduces outdated patterns. But counterexamples exist: Amazon Q Developer reduced Java upgrade times from 50 developer-days to hours for Kyndryl, with estimated savings equivalent to 4,500 developer-years. For pipelines specifically, dlt handles schema evolution automatically, which addresses a large portion of the maintenance burden.

The New Decision Framework

The new framework requires actual calculation across three factors.

The one-day rule. If your monthly MAR cost for a source exceeds what a senior engineer costs for a day of work, building probably wins. A senior data engineer at $200/hour represents $1,600 for a day’s work. If a single Fivetran connector costs more than that monthly, the economics favor building.

The capability check. Building requires Python proficiency on the team, AI assistant access, and tolerance for managing your own infrastructure. If any of these are missing, the “buy” option may still be cheaper even at elevated prices.

The compounding benefit. Each pipeline you build develops patterns and reusable components — authentication handling, error management, deployment scripts. The second connector takes less time than the first. The fifth takes a fraction. Managed tools don’t compound this way; the tenth connector costs the same as the first.

The practical answer for most teams is a hybrid approach: managed tools for stable, standard sources where maintenance burden is low, and custom pipelines for high-MAR, high-control, or unsupported sources where the economics favor building.