Medallion Lakehouse on GCP

The medallion architecture — bronze, silver, gold layers with increasing data quality — maps to BigQuery’s table type options. Different layers optimize for different things, and BigQuery allows mixing table types within a single project: flexibility where needed (early pipeline stages) and performance where it matters (analytics serving layers).

Bronze Layer: Raw Landing

Raw data lands in BigLake Iceberg tables. Spark or Flink jobs write directly to Iceberg format on Cloud Storage. BigLake Metastore tracks the table metadata, making these tables discoverable by both BigQuery and external compute engines.

This layer prioritizes write throughput and schema flexibility over query performance.

-- Bronze layer: BigLake Iceberg table for raw event ingestion
CREATE TABLE `project.bronze.raw_events`
(
  event_id STRING,
  event_timestamp TIMESTAMP,
  payload JSON,
  source_system STRING,
  ingestion_time TIMESTAMP
)
CLUSTER BY source_system
WITH CONNECTION `project.us.biglake_connection`
OPTIONS (
  file_format = 'PARQUET',
  table_format = 'ICEBERG',
  storage_uri = 'gs://data-lake-bronze/raw_events'
);

Why Iceberg at the bronze layer:

Multi-engine writes: Spark and Flink can write directly to the Iceberg format without going through BigQuery
Schema evolution: Raw data schemas change frequently; Iceberg handles column additions, renames, and type promotions
Time travel: Replay or audit raw data as of any previous state
Portability: If you ever need to migrate off GCP, your raw data is in an open format

Bronze tables are typically append-heavy. New data arrives continuously, and corrections are rare. Enable Autoclass on the underlying Cloud Storage bucket so older data automatically moves to cheaper storage tiers based on actual access patterns.

Silver Layer: Cleaned and Conformed

dbt models transform bronze data into cleaned, conformed datasets. These transformations run in BigQuery using SQL, reading from Iceberg tables and writing to either more Iceberg tables or native BigQuery tables depending on downstream requirements.

The silver layer maps directly to the base and intermediate layers in a dbt project. Base models clean raw bronze data (rename columns, cast types, deduplicate). Intermediate models join and enrich across entities.

-- Silver layer dbt model reading from bronze Iceberg, writing to BigQuery
-- models/intermediate/int__events__enriched.sql
{{ config(
    materialized='incremental',
    incremental_strategy='insert_overwrite',
    partition_by={'field': 'event_date', 'data_type': 'date'},
    cluster_by=['user_id', 'event_name']
) }}

SELECT
    event_id,
    DATE(event_timestamp) AS event_date,
    event_timestamp,
    user_id,
    JSON_VALUE(payload, '$.event_name') AS event_name,
    JSON_VALUE(payload, '$.page_url') AS page_url,
    source_system
FROM {{ source('bronze', 'raw_events') }}

{% if is_incremental() %}
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)
{% endif %}

QUALIFY ROW_NUMBER() OVER (
    PARTITION BY event_id
    ORDER BY ingestion_time DESC
) = 1

Whether silver tables should be Iceberg or native depends on who else needs to read them:

Iceberg: When Spark-based ML pipelines or other engines read silver data directly
Native BigQuery: When only BigQuery and dbt consume silver tables (better query performance, simpler management)

For most analytics-focused teams, native BigQuery tables at the silver layer work well. The multi-engine benefit of Iceberg matters most at bronze (where external systems write) and less at silver (where BigQuery does the work).

Enable auto-reclustering for frequently-queried silver tables to maintain performance as data accumulates. BigQuery handles this automatically for native tables; for Iceberg tables, BigQuery manages compaction in the background.

Gold Layer: Business-Ready Analytics

Aggregated, business-ready datasets live in native BigQuery tables optimized for BI workloads. These tables serve dashboards, support ad-hoc analyst queries, and back embedded analytics. This is the mart layer in dbt terms.

-- Gold layer mart model: daily channel performance
-- models/marts/marketing/mrt__marketing__daily_channel_performance.sql
{{ config(
    materialized='incremental',
    incremental_strategy='insert_overwrite',
    partition_by={'field': 'event_date', 'data_type': 'date'},
    cluster_by=['channel', 'source']
) }}

SELECT
    event_date,
    channel,
    source,
    COUNT(DISTINCT user_id) AS unique_users,
    COUNT(DISTINCT session_id) AS sessions,
    SUM(conversions) AS total_conversions,
    SAFE_DIVIDE(SUM(conversions), COUNT(DISTINCT session_id)) AS conversion_rate
FROM {{ ref('int__sessions__attributed') }}

{% if is_incremental() %}
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)
{% endif %}

GROUP BY 1, 2, 3

Native tables make sense at the gold layer because:

Performance matters most here — these are the tables users actually touch
Multi-engine access is rarely needed for aggregated metrics
Streaming inserts work with sub-second latency for real-time dashboards
Materialized views can further accelerate repeated dashboard queries (see BigQuery Materialized Views)

The Hybrid Advantage

The real power of this pattern is optionality. You’re not locked into a single table format across your entire platform.

Bronze (Iceberg): Any engine can write, any engine can read. You can add Spark processing later or migrate to another platform without re-exporting raw data.
Silver (Iceberg or native): Choose based on whether non-BigQuery engines need access.
Gold (native): Maximum query speed for the tables that drive business decisions.

This hybrid approach preserves flexibility in early pipeline stages while maximizing speed for the tables users interact with daily. The catalog layer ties it all together, providing a unified view across table types.

When Not to Use the Full Medallion Pattern

Not every dataset needs the full medallion treatment. Simple reporting tables can go directly from source to native BigQuery tables without intermediate Iceberg layers. The lakehouse pattern adds value for data that needs:

Multi-engine access at any stage
Long-term portability guarantees
Complex ingestion from streaming systems (Spark, Flink)
Regulatory separation between raw and transformed data

For data that only BigQuery will ever touch, a straightforward dbt project with native tables at every layer is simpler and performs better. Over-complicating the architecture is one of the most common mistakes in BigQuery data lake implementations. Match the complexity of your infrastructure to the complexity of your actual requirements.