MCP Protocol Fundamentals: What Data Engineers Need to Know

Why MCP Exists

If you have been building data pipelines for any length of time, you know the pain of integrations. Every new data source requires custom code. Every API has its own authentication scheme. Every database connector has its own quirks. Now multiply that complexity by the number of AI applications that need access to your data infrastructure.

This is the N times M problem that the Model Context Protocol (MCP) was designed to solve.

Before MCP, every new data source required its own custom implementation. N AI applications each needed M custom connectors. AI models got trapped behind legacy systems, unable to access the context they needed. Each database, API, or file system required custom code. Maintaining separate connectors became unsustainable as both N and M grew.

Anthropic introduced MCP in November 2024 as an open standard. The protocol enables developers to build secure, two-way connections between AI applications and external data sources, tools, and systems.

The analogy that stuck is “USB-C for AI”. Just as USB-C replaced a tangle of proprietary cables with a single universal connector, MCP replaces fragmented custom integrations with one protocol. Build a connector once, and any MCP-compatible AI application can use it.

For data engineers, your databases, warehouses, and pipelines are exactly the kind of backend systems that AI assistants need to access. Instead of building custom integrations for Claude Desktop, then Cursor, then VS Code Copilot, then whatever comes next, you build one MCP server that works with all of them. (For a map of available servers, clients, and SDKs, see the MCP ecosystem overview.)

Adoption Context (January 2026)

MCP has moved well beyond experimental status. As of January 2026: 97+ million monthly SDK downloads, 10,000+ active MCP servers in production, and 75,300 GitHub stars on the official servers repository.

The adoption story accelerated in 2025 when major players joined. OpenAI adopted MCP in March 2025, integrating it across their Agents SDK and ChatGPT desktop application. Google DeepMind followed, and Microsoft announced Windows 11 MCP integration at Build 2025.

In December 2025, Anthropic donated MCP to the Agentic AI Foundation under the Linux Foundation, cementing its status as a vendor-neutral industry standard rather than a proprietary Anthropic technology.

If you are wondering whether to invest time in learning MCP, these numbers answer the “is this going to stick around?” question. The protocol has crossed the threshold from interesting experiment to infrastructure you can build on.

Architecture: Three Participants

MCP uses a client-server architecture with three participants: hosts, clients, and servers.

┌──────────────────────────────────────────────┐
│           Host Application                   │
│  (Claude Desktop, VS Code, Cursor, etc.)     │
│                                              │
│  ┌──────────────┐  ┌──────────────┐          │
│  │ MCP Client A │  │ MCP Client B │          │
│  │ (Database)   │  │ (File System)│          │
│  └──────┬───────┘  └──────┬───────┘          │
└─────────┼─────────────────┼──────────────────┘
          │                 │
          ▼                 ▼
   ┌──────────────┐  ┌──────────────┐
   │ MCP Server A │  │ MCP Server B │
   │ (Postgres)   │  │ (Filesystem) │
   └──────────────┘  └──────────────┘

Component	Role	Examples
MCP Host	AI application that coordinates multiple clients	Claude Desktop, VS Code, Cursor, custom agents
MCP Client	Maintains connection to a single MCP server	One client per server connection
MCP Server	Program that exposes context via MCP protocol	Database servers, GitHub server, filesystem server

The host is the AI application you interact with directly: Claude Desktop, VS Code with Copilot, Cursor, or a custom agent you have built. The host creates and manages MCP clients.

Each client maintains a connection to exactly one MCP server. If your host needs to talk to both a Postgres database and a file system, it spins up two clients, one for each.

The server is where you come in. An MCP server is a program that exposes capabilities through the MCP protocol. It might expose your data warehouse schema, execute queries, or provide access to pipeline metadata.

Under the hood, MCP has two layers: a data layer (JSON-RPC 2.0 handling lifecycle, capability negotiation, and primitives) and a transport layer (managing communication channels, connection establishment, and message framing).

Transport Mechanisms

MCP supports two transport mechanisms.

Transport	Use Case	Description
stdio	Local servers	Uses standard input/output streams; optimal for same-machine communication; no network overhead
Streamable HTTP	Remote servers	HTTP POST for client-to-server + optional SSE for streaming; supports OAuth, bearer tokens

stdio for Local Servers

When the MCP server runs on the same machine as the host, stdio is the simplest option. The host spawns the server as a subprocess and communicates via standard input/output streams. No network configuration, no ports to open, no TLS to configure.

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/path/to/directory"]
    }
  }
}

This tells the host to spawn an MCP server using npx, running the filesystem server package with access to a specific directory. The host and server communicate through stdin/stdout.

Streamable HTTP for Remote Servers

For servers running on different machines, Streamable HTTP provides the transport layer. This supports production scenarios where your MCP server runs in a cloud environment, behind authentication.

{
  "mcpServers": {
    "remote-server": {
      "url": "http://example.com/mcp",
      "env": {
        "API_KEY": "your-api-key"
      }
    }
  }
}

Streamable HTTP uses standard HTTP POST for client-to-server messages with optional Server-Sent Events (SSE) for streaming responses. It integrates with OAuth 2.1 for authentication.

Note: You may encounter references to SSE as a transport mechanism. This is being deprecated in favor of Streamable HTTP.

Message Format

All MCP communication uses JSON-RPC 2.0. Understanding these messages helps when debugging.

When a client connects to a server, they first negotiate capabilities through an initialization handshake.

Initialize Request:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "initialize",
  "params": {
    "protocolVersion": "2025-06-18",
    "capabilities": {
      "elicitation": {}
    },
    "clientInfo": {
      "name": "example-client",
      "version": "1.0.0"
    }
  }
}

Initialize Response:

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "protocolVersion": "2025-06-18",
    "capabilities": {
      "tools": {"listChanged": true},
      "resources": {}
    },
    "serverInfo": {
      "name": "example-server",
      "version": "1.0.0"
    }
  }
}

Once connected, the client can discover what tools the server offers:

Tool Discovery Request:

{"jsonrpc": "2.0", "id": 2, "method": "tools/list"}

And invoke those tools:

Tool Call Request:

{
  "jsonrpc": "2.0",
  "id": 3,
  "method": "tools/call",
  "params": {
    "name": "query_database",
    "arguments": {
      "query": "SELECT * FROM customers LIMIT 10",
      "database": "production"
    }
  }
}

The JSON-RPC format is straightforward: every message has a protocol version, an ID for matching requests to responses, a method name, and parameters. Responses include the same ID plus either a result or an error.

Core Primitives

MCP defines three server primitives (what servers expose) and three client primitives (what clients expose to servers).

Server Primitives

Primitive	Control	Purpose	Example
Tools	Model-controlled	Executable functions with potential side effects	API calls, database queries, file operations
Resources	Application-controlled	Data sources for contextual information (like GET endpoints)	File contents, database schemas, API responses
Prompts	User-controlled	Reusable templates for LLM interactions	Code review templates, debugging assistants

Tools are the most commonly used primitive. They represent executable functions that the AI model can invoke: a database query tool, a pipeline trigger tool, a data validation tool. Tools can have side effects, which is why they require explicit invocation.

Resources provide contextual data without side effects. Think of them as read-only endpoints. A resource might expose your dbt project’s model documentation or your data warehouse’s table schemas. The AI can fetch resources to understand context without risk of modifying anything.

Prompts are reusable templates that guide LLM interactions. A “data quality analysis” prompt template could help users consistently request quality checks across different tables.

In Python using the FastMCP framework (covered in depth in building custom MCP servers), these primitives look like this:

from mcp.server.fastmcp import FastMCP

mcp = FastMCP("DataServer")

@mcp.tool()
def query_database(query: str, database: str = "production") -> str:
    """Execute a SQL query against the specified database."""
    return f"Query executed on {database}: {query}"

@mcp.resource("schema://{table_name}")
def get_table_schema(table_name: str) -> str:
    """Get the schema for a specific table."""
    return f"Schema for {table_name}: id INT, name VARCHAR(255)..."

@mcp.prompt(title="Code Review")
def review_code(code: str) -> str:
    return f"Please review this code:\n\n{code}"

Client Primitives

Servers can also request capabilities from clients:

Primitive	Purpose
Sampling	Allows servers to request LLM completions from the host
Elicitation	Allows servers to request additional information from users
Roots	Filesystem boundaries that clients expose to servers for security

Sampling is worth noting for data engineering use cases. It lets your MCP server request that the host’s LLM analyze or summarize data, enabling multi-step workflows where the server fetches data and the LLM processes it.

Roots define what filesystem paths a server is allowed to access. This is a security boundary that prevents servers from accessing files outside their designated scope.

Security Model

Security in MCP follows a layered trust boundary model:

User → AI Host → MCP Client → MCP Server(s) → Backend APIs/Databases
         │                          │
    Security Broker           Resource Server
    (Access Control)          (Token Validation)

Authentication with OAuth 2.1

For remote MCP servers, the protocol mandates OAuth 2.1 for authentication. PKCE (Proof Key for Code Exchange) is required for all flows, protecting against authorization code interception. Tokens are scoped to specific MCP servers via Resource Indicators (RFC 8707), and Dynamic Client Registration (RFC 7591) is recommended for runtime credential acquisition.

Security Principles

The host application acts as a security intermediary between the LLM and external resources.

Each MCP server runs in its own process with its own credentials, so a compromised server cannot access credentials for other servers. Servers must validate that tokens are specifically issued for them. Never pass through tokens to upstream APIs, as this prevents “confused deputy” attacks where a malicious request tricks a server into using its authority on behalf of an attacker.

Sensitive operations with side effects require explicit user consent. The host presents tool invocations to the user before executing them. And through the Roots primitive, servers only access filesystem paths they have been explicitly granted.

For data engineers, this means you can build MCP servers that access production databases without exposing those credentials to the AI model itself. See the BigQuery MCP server guide and dbt MCP server guide for practical examples of credential handling. The credentials stay in the server process, isolated from both the host and other servers.

MCP vs Traditional APIs

You might wonder why MCP exists when REST APIs already connect systems. The key distinction comes down to the intended consumer.

Aspect	Traditional APIs	MCP
Target Consumer	Human developers, applications	AI models and agents
Discovery	OpenAPI docs, manual integration	Dynamic capability discovery at runtime
Context	Request-response, stateless	Rich context with metadata, stateful sessions
Interaction	Client-initiated only	Bidirectional (sampling, elicitation)
Output Format	Fixed schemas	Multiple content types (text, images, resources)

Traditional APIs are designed for human developers who read documentation, understand schemas, and write integration code. They work well for machine-to-machine communication where the calling system knows exactly what it needs.

MCP is designed for AI models that discover capabilities at runtime, need rich context about what tools can do, and benefit from bidirectional communication. An AI assistant does not read your API documentation. It queries available tools, reads their descriptions, and decides how to use them based on the user’s request.

Traditional APIs are strictly client-initiated: the client sends a request, the server responds. MCP supports server-initiated communication through sampling (asking the LLM to process data) and elicitation (asking the user for more information). This enables workflows that were awkward or impossible with REST.

That said, MCP isn’t always the right tool. For data warehouse access in Claude Code, native CLI commands can be more token-efficient than MCP for many operations.