Adrienne Vermorel
MCP Protocol Fundamentals: What Data Engineers Need to Know
Why MCP Exists
If you have been building data pipelines for any length of time, you know the pain of integrations. Every new data source requires custom code. Every API has its own authentication scheme. Every database connector has its own quirks. Now multiply that complexity by the number of AI applications that need access to your data infrastructure.
This is the N times M problem that the Model Context Protocol (MCP) was designed to solve.
Before MCP, every new data source required its own custom implementation. N AI applications each needed M custom connectors. AI models got trapped behind legacy systems, unable to access the context they needed. Each database, API, or file system required custom code. Maintaining separate connectors became unsustainable as both N and M grew.
Anthropic introduced MCP in November 2024 as an open standard. The protocol enables developers to build secure, two-way connections between AI applications and external data sources, tools, and systems.
The analogy that stuck is “USB-C for AI”. Just as USB-C replaced a tangle of proprietary cables with a single universal connector, MCP replaces fragmented custom integrations with one protocol. Build a connector once, and any MCP-compatible AI application can use it.
For data engineers, your databases, warehouses, and pipelines are exactly the kind of backend systems that AI assistants need to access. Instead of building custom integrations for Claude Desktop, then Cursor, then VS Code Copilot, then whatever comes next, you build one MCP server that works with all of them.
Adoption Context (January 2026)
MCP has moved well beyond experimental status. As of January 2026: 97+ million monthly SDK downloads, 10,000+ active MCP servers in production, and 75,300 GitHub stars on the official servers repository.
The adoption story accelerated in 2025 when major players joined. OpenAI adopted MCP in March 2025, integrating it across their Agents SDK and ChatGPT desktop application. Google DeepMind followed, and Microsoft announced Windows 11 MCP integration at Build 2025.
In December 2025, Anthropic donated MCP to the Agentic AI Foundation under the Linux Foundation, cementing its status as a vendor-neutral industry standard rather than a proprietary Anthropic technology.
If you are wondering whether to invest time in learning MCP, these numbers answer the “is this going to stick around?” question. The protocol has crossed the threshold from interesting experiment to infrastructure you can build on.
Architecture: Three Participants
MCP uses a client-server architecture with three participants: hosts, clients, and servers.
┌──────────────────────────────────────────────┐│ Host Application ││ (Claude Desktop, VS Code, Cursor, etc.) ││ ││ ┌──────────────┐ ┌──────────────┐ ││ │ MCP Client A │ │ MCP Client B │ ││ │ (Database) │ │ (File System)│ ││ └──────┬───────┘ └──────┬───────┘ │└─────────┼─────────────────┼──────────────────┘ │ │ ▼ ▼ ┌──────────────┐ ┌──────────────┐ │ MCP Server A │ │ MCP Server B │ │ (Postgres) │ │ (Filesystem) │ └──────────────┘ └──────────────┘| Component | Role | Examples |
|---|---|---|
| MCP Host | AI application that coordinates multiple clients | Claude Desktop, VS Code, Cursor, custom agents |
| MCP Client | Maintains connection to a single MCP server | One client per server connection |
| MCP Server | Program that exposes context via MCP protocol | Database servers, GitHub server, filesystem server |
The host is the AI application you interact with directly: Claude Desktop, VS Code with Copilot, Cursor, or a custom agent you have built. The host creates and manages MCP clients.
Each client maintains a connection to exactly one MCP server. If your host needs to talk to both a Postgres database and a file system, it spins up two clients, one for each.
The server is where you come in. An MCP server is a program that exposes capabilities through the MCP protocol. It might expose your data warehouse schema, execute queries, or provide access to pipeline metadata.
Under the hood, MCP has two layers: a data layer (JSON-RPC 2.0 handling lifecycle, capability negotiation, and primitives) and a transport layer (managing communication channels, connection establishment, and message framing).
Transport Mechanisms
MCP supports two transport mechanisms.
| Transport | Use Case | Description |
|---|---|---|
| stdio | Local servers | Uses standard input/output streams; optimal for same-machine communication; no network overhead |
| Streamable HTTP | Remote servers | HTTP POST for client-to-server + optional SSE for streaming; supports OAuth, bearer tokens |
stdio for Local Servers
When the MCP server runs on the same machine as the host, stdio is the simplest option. The host spawns the server as a subprocess and communicates via standard input/output streams. No network configuration, no ports to open, no TLS to configure.
{ "mcpServers": { "filesystem": { "command": "npx", "args": ["-y", "@modelcontextprotocol/server-filesystem", "/path/to/directory"] } }}This tells the host to spawn an MCP server using npx, running the filesystem server package with access to a specific directory. The host and server communicate through stdin/stdout.
Streamable HTTP for Remote Servers
For servers running on different machines, Streamable HTTP provides the transport layer. This supports production scenarios where your MCP server runs in a cloud environment, behind authentication.
{ "mcpServers": { "remote-server": { "url": "http://example.com/mcp", "env": { "API_KEY": "your-api-key" } } }}Streamable HTTP uses standard HTTP POST for client-to-server messages with optional Server-Sent Events (SSE) for streaming responses. It integrates with OAuth 2.1 for authentication.
Note: You may encounter references to SSE as a transport mechanism. This is being deprecated in favor of Streamable HTTP.
Message Format
All MCP communication uses JSON-RPC 2.0. Understanding these messages helps when debugging.
When a client connects to a server, they first negotiate capabilities through an initialization handshake.
Initialize Request:
{ "jsonrpc": "2.0", "id": 1, "method": "initialize", "params": { "protocolVersion": "2025-06-18", "capabilities": { "elicitation": {} }, "clientInfo": { "name": "example-client", "version": "1.0.0" } }}Initialize Response:
{ "jsonrpc": "2.0", "id": 1, "result": { "protocolVersion": "2025-06-18", "capabilities": { "tools": {"listChanged": true}, "resources": {} }, "serverInfo": { "name": "example-server", "version": "1.0.0" } }}Once connected, the client can discover what tools the server offers:
Tool Discovery Request:
{"jsonrpc": "2.0", "id": 2, "method": "tools/list"}And invoke those tools:
Tool Call Request:
{ "jsonrpc": "2.0", "id": 3, "method": "tools/call", "params": { "name": "query_database", "arguments": { "query": "SELECT * FROM customers LIMIT 10", "database": "production" } }}The JSON-RPC format is straightforward: every message has a protocol version, an ID for matching requests to responses, a method name, and parameters. Responses include the same ID plus either a result or an error.
Core Primitives
MCP defines three server primitives (what servers expose) and three client primitives (what clients expose to servers).
Server Primitives
| Primitive | Control | Purpose | Example |
|---|---|---|---|
| Tools | Model-controlled | Executable functions with potential side effects | API calls, database queries, file operations |
| Resources | Application-controlled | Data sources for contextual information (like GET endpoints) | File contents, database schemas, API responses |
| Prompts | User-controlled | Reusable templates for LLM interactions | Code review templates, debugging assistants |
Tools are the most commonly used primitive. They represent executable functions that the AI model can invoke: a database query tool, a pipeline trigger tool, a data validation tool. Tools can have side effects, which is why they require explicit invocation.
Resources provide contextual data without side effects. Think of them as read-only endpoints. A resource might expose your dbt project’s model documentation or your data warehouse’s table schemas. The AI can fetch resources to understand context without risk of modifying anything.
Prompts are reusable templates that guide LLM interactions. A “data quality analysis” prompt template could help users consistently request quality checks across different tables.
In Python using the FastMCP framework, these primitives look like this:
from mcp.server.fastmcp import FastMCP
mcp = FastMCP("DataServer")
@mcp.tool()def query_database(query: str, database: str = "production") -> str: """Execute a SQL query against the specified database.""" return f"Query executed on {database}: {query}"
@mcp.resource("schema://{table_name}")def get_table_schema(table_name: str) -> str: """Get the schema for a specific table.""" return f"Schema for {table_name}: id INT, name VARCHAR(255)..."
@mcp.prompt(title="Code Review")def review_code(code: str) -> str: return f"Please review this code:\n\n{code}"Client Primitives
Servers can also request capabilities from clients:
| Primitive | Purpose |
|---|---|
| Sampling | Allows servers to request LLM completions from the host |
| Elicitation | Allows servers to request additional information from users |
| Roots | Filesystem boundaries that clients expose to servers for security |
Sampling is worth noting for data engineering use cases. It lets your MCP server request that the host’s LLM analyze or summarize data, enabling multi-step workflows where the server fetches data and the LLM processes it.
Roots define what filesystem paths a server is allowed to access. This is a security boundary that prevents servers from accessing files outside their designated scope.
Security Model
Security in MCP follows a layered trust boundary model:
User → AI Host → MCP Client → MCP Server(s) → Backend APIs/Databases │ │ Security Broker Resource Server (Access Control) (Token Validation)Authentication with OAuth 2.1
For remote MCP servers, the protocol mandates OAuth 2.1 for authentication. PKCE (Proof Key for Code Exchange) is required for all flows, protecting against authorization code interception. Tokens are scoped to specific MCP servers via Resource Indicators (RFC 8707), and Dynamic Client Registration (RFC 7591) is recommended for runtime credential acquisition.
Security Principles
The host application acts as a security intermediary between the LLM and external resources.
Each MCP server runs in its own process with its own credentials, so a compromised server cannot access credentials for other servers. Servers must validate that tokens are specifically issued for them. Never pass through tokens to upstream APIs, as this prevents “confused deputy” attacks where a malicious request tricks a server into using its authority on behalf of an attacker.
Sensitive operations with side effects require explicit user consent. The host presents tool invocations to the user before executing them. And through the Roots primitive, servers only access filesystem paths they have been explicitly granted.
For data engineers, this means you can build MCP servers that access production databases without exposing those credentials to the AI model itself. The credentials stay in the server process, isolated from both the host and other servers.
MCP vs Traditional APIs
You might wonder why MCP exists when REST APIs already connect systems. The key distinction comes down to the intended consumer.
| Aspect | Traditional APIs | MCP |
|---|---|---|
| Target Consumer | Human developers, applications | AI models and agents |
| Discovery | OpenAPI docs, manual integration | Dynamic capability discovery at runtime |
| Context | Request-response, stateless | Rich context with metadata, stateful sessions |
| Interaction | Client-initiated only | Bidirectional (sampling, elicitation) |
| Output Format | Fixed schemas | Multiple content types (text, images, resources) |
Traditional APIs are designed for human developers who read documentation, understand schemas, and write integration code. They work well for machine-to-machine communication where the calling system knows exactly what it needs.
MCP is designed for AI models that discover capabilities at runtime, need rich context about what tools can do, and benefit from bidirectional communication. An AI assistant does not read your API documentation. It queries available tools, reads their descriptions, and decides how to use them based on the user’s request.
Traditional APIs are strictly client-initiated: the client sends a request, the server responds. MCP supports server-initiated communication through sampling (asking the LLM to process data) and elicitation (asking the user for more information). This enables workflows that were awkward or impossible with REST.