BigLake Metastore and Catalog Strategy

On GCP, two services form the catalog stack: BigLake Metastore for table-level metadata, and Dataplex Universal Catalog for governance across data products. As Delta Lake and Iceberg converge toward interoperability, catalog infrastructure is increasingly the more durable investment — it tells every engine what tables exist, where they live, and who can access them, regardless of format.

The Problem Catalogs Solve

Without a unified catalog, different engines maintain inconsistent views of your data. Spark knows about tables that BigQuery doesn’t. BigQuery has datasets that Trino can’t discover. A new analyst joins the team and has no way to find what data exists, what it contains, or whether they’re allowed to use it.

This is the “fragmented metadata problem,” and it gets worse as your platform grows. Every new engine, every new team, every new data product creates more metadata that lives in a different system. Catalogs centralize this so there’s one source of truth for table discovery, schema information, and access control.

BigLake Metastore

BigLake Metastore provides a serverless catalog that replaces self-managed Hive Metastore deployments. If you’ve ever dealt with a Hive Metastore — the operational burden of running a MySQL or PostgreSQL backend, managing the Thrift service, handling schema migrations — BigLake Metastore eliminates all of that.

The key capability is the Iceberg REST Catalog API. This standard API enables Spark, Flink, Trino, and BigQuery to discover and manage the same tables through a unified interface. A Spark job can create an Iceberg table, and BigQuery can immediately query it. A BigQuery DML statement can update data, and Spark’s next read sees the changes.

# Spark session configured to use BigLake Metastore
spark = SparkSession.builder \
    .config("spark.sql.catalog.biglake",
            "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.biglake.catalog-impl",
            "org.apache.iceberg.gcp.biglake.BigLakeCatalog") \
    .config("spark.sql.catalog.biglake.gcp_project", "my-project") \
    .config("spark.sql.catalog.biglake.gcp_location", "us-central1") \
    .config("spark.sql.catalog.biglake.warehouse",
            "gs://my-bucket/warehouse") \
    .getOrCreate()

# Create a table in Spark -- immediately queryable from BigQuery
spark.sql("""
    CREATE TABLE biglake.bronze.raw_events (
        event_id STRING,
        event_timestamp TIMESTAMP,
        user_id STRING,
        payload STRING
    )
    USING iceberg
    PARTITIONED BY (days(event_timestamp))
""")

-- The same table is now accessible from BigQuery
SELECT event_id, event_timestamp, user_id
FROM `my-project.bronze.raw_events`
WHERE DATE(event_timestamp) = '2026-03-25';

This cross-engine visibility is what makes the medallion lakehouse pattern practical. Without a shared catalog, the bronze layer (written by Spark) and the silver layer (transformed by BigQuery/dbt) would be disconnected systems that happen to read from the same Cloud Storage paths.

Dataplex Universal Catalog

Dataplex Universal Catalog sits above BigLake Metastore, providing governance across data products, AI models, and analytics assets. While BigLake Metastore handles the mechanical question of “what tables exist and where,” Dataplex handles the organizational questions:

Data quality rules: Define and enforce quality expectations across datasets
Lineage tracking: Trace how data flows from source to consumption
Access management: Govern who can read, write, or administer data spanning BigQuery datasets, Cloud Storage buckets, and Vertex AI models
Data product organization: Group related tables, views, and models into discoverable data products

For organizations building data mesh architectures, Dataplex provides the control plane. Domain teams own their data products, publish them through Dataplex, and consumers discover and subscribe through a unified interface.

Dataplex also integrates with BigQuery’s column-level security and row-level access policies. You define governance rules once in Dataplex, and they apply regardless of which engine — BigQuery, Spark, or Trino — accesses the data through the BigLake connection.

Format Convergence Makes Catalogs More Important

The argument for investing in catalog infrastructure gets stronger as formats converge. Consider what’s already happened:

Delta Lake UniForm exposes Iceberg-compatible metadata, meaning Iceberg readers can query Delta tables without conversion
BigQuery supports Delta Lake with first-class features including deletion vectors, column mapping, and liquid clustering
Apache XTable (formerly OneTable) enables metadata translation between Iceberg, Delta Lake, and Hudi

When your Spark jobs write Delta Lake and your BigQuery queries read Iceberg, the format distinction starts to blur. What matters is that your catalog knows about all of these tables, can enforce governance consistently, and provides a single discovery interface.

Standardizing on BigLake Metastore as the catalog layer is format-agnostic — it works whether the stack is all-in on Iceberg, using Delta Lake with Databricks, or mixing formats.

Practical Recommendations

Start with BigLake Metastore for all new BigLake Iceberg tables. The setup is minimal and the cross-engine benefits are immediate. Even if you’re only using BigQuery today, having tables registered in a standard catalog means adding Spark or Trino later is configuration, not migration.

Add Dataplex when governance requirements emerge. Small teams querying their own data don’t need a formal governance layer. But once you have multiple teams, compliance requirements, or external data consumers, Dataplex provides structure that ad-hoc permissions can’t.

Invest in naming conventions early. Catalogs are only as useful as the metadata in them. Consistent naming (clear database/schema/table hierarchy), meaningful descriptions, and tagged ownership make the difference between a catalog people actually use and one they ignore. This is the organizational investment that compounds — the same way dbt project naming conventions pay dividends as projects grow.

Avoid custom catalog layers. BigLake Metastore is serverless, integrates natively with BigQuery and Spark, and supports the Iceberg REST Catalog standard. Spreadsheet-based or wiki-based alternatives lack this engine integration.

Consistent naming, clear ownership, and automated quality checks in the catalog make table type decisions tactical rather than architectural.