MCP Data Catalog Server Pattern

A data catalog MCP server exposes internal catalog data as AI-accessible tools, allowing an assistant to answer questions like “what tables do we have about customers?” or “what feeds into the revenue dashboard?” without opening a browser or navigating a catalog UI. If the catalog is homegrown or running on a platform without an official MCP server, this pattern provides a starting point.

The pattern below uses a simulated in-memory catalog. In production, replace the dictionary lookups with calls to your actual catalog API — DataHub, OpenMetadata, Atlan, or whatever you’re running.

The Server

from mcp.server.fastmcp import FastMCP
import json

mcp = FastMCP("DataCatalogMCP")

# Simulated catalog data - replace with your actual catalog API
CATALOG = {
    "sales.orders": {
        "description": "Order transactions from all channels",
        "columns": ["order_id", "customer_id", "total_amount", "created_at"],
        "owner": "sales-team",
        "tags": ["pii", "financial"],
        "upstream": ["raw.shopify_orders", "raw.pos_transactions"],
        "downstream": ["analytics.revenue_daily", "ml.churn_features"]
    },
    "sales.customers": {
        "description": "Customer master data with demographics",
        "columns": ["customer_id", "email", "segment", "lifetime_value"],
        "owner": "marketing-team",
        "tags": ["pii"],
        "upstream": ["raw.crm_contacts"],
        "downstream": ["sales.orders", "analytics.cohorts"]
    }
}

Three Core Tools

Table Search

The most-used tool. Users ask vague questions — “anything about orders?” — and the AI needs to find relevant tables:

@mcp.tool()
def search_tables(
    query: str,
    tags: list[str] | None = None,
    limit: int = 10
) -> str:
    """Search for tables in the data catalog.

    Args:
        query: Search term to match against table names and descriptions
        tags: Optional list of tags to filter by (e.g., ['pii', 'financial'])
        limit: Maximum number of results to return

    Returns:
        Matching tables with descriptions
    """
    results = []
    for table_name, metadata in CATALOG.items():
        if query.lower() in table_name.lower() or query.lower() in metadata["description"].lower():
            if tags is None or any(t in metadata["tags"] for t in tags):
                results.append({
                    "name": table_name,
                    "description": metadata["description"],
                    "owner": metadata["owner"],
                    "tags": metadata["tags"]
                })
    return json.dumps(results[:limit], indent=2)

In production, this search would hit your catalog’s search API — most catalogs offer full-text search, tag filtering, and relevance ranking out of the box. The MCP tool is a thin wrapper that translates the AI’s parameters into your catalog’s query format.

Table Details

Once the AI finds a table, it needs the full picture:

@mcp.tool()
def get_table_details(table_name: str) -> str:
    """Get detailed metadata for a specific table.

    Args:
        table_name: Fully qualified table name (e.g., 'sales.orders')

    Returns:
        Complete table metadata including columns, owner, and tags
    """
    if table_name not in CATALOG:
        return json.dumps({"error": f"Table '{table_name}' not found in catalog"})

    metadata = CATALOG[table_name]
    return json.dumps({
        "table": table_name,
        **metadata
    }, indent=2)

This tool returns everything: columns, owner, tags, upstream and downstream dependencies. The AI uses this to understand table structure before writing queries or making recommendations.

Data Lineage

Lineage traversal is a key capability of the catalog server. “What feeds into the revenue dashboard?” is a question most catalog UIs require several navigation steps to answer; the MCP tool resolves it in a single call:

@mcp.tool()
def get_data_lineage(
    table_name: str,
    direction: str = "both",
    depth: int = 2
) -> str:
    """Trace data lineage for a table.

    Args:
        table_name: Table to trace lineage for
        direction: 'upstream', 'downstream', or 'both'
        depth: How many levels to traverse (1-5)

    Returns:
        Lineage graph showing data flow
    """
    if table_name not in CATALOG:
        return json.dumps({"error": f"Table '{table_name}' not found"})

    metadata = CATALOG[table_name]
    lineage = {"table": table_name}

    if direction in ("upstream", "both"):
        lineage["upstream"] = metadata.get("upstream", [])
    if direction in ("downstream", "both"):
        lineage["downstream"] = metadata.get("downstream", [])

    return json.dumps(lineage, indent=2)

if __name__ == "__main__":
    mcp.run(transport="stdio")

The depth parameter matters in production. A depth-1 lineage query is fast. A depth-5 query can traverse hundreds of nodes in a complex data warehouse. Set reasonable defaults and limits to prevent the AI from requesting expensive graph traversals.

What You Can Ask

With this server connected, conversations become genuinely useful for data discovery:

“Find all tables related to customers”
“What tables contain PII data?”
“Show me the lineage for the orders table”
“Who owns the revenue_daily table?”
“What raw sources feed into the churn model?”

Production Considerations

Caching. Catalog metadata changes slowly. Cache search results and table details for 5-15 minutes to avoid hammering your catalog API on every question.

Access control. Your catalog likely has its own access control. The MCP server should respect it — if a user can’t see a table in the catalog UI, they shouldn’t see it through MCP either. Pass the user’s identity through to your catalog API calls.

Search quality. The simple substring search in the example is a placeholder. Real catalogs offer fuzzy matching, semantic search, and tag-based filtering. Use your catalog’s native search — it’s been optimized for exactly this use case.

Reference implementations. Before building from scratch, study the DataHub MCP server and the OpenMetadata MCP integration. Both are open source and demonstrate production patterns for catalog integration, search implementation, and lineage APIs.