Notes on designing, implementing, and optimizing a BigQuery-based data lake. Covers table types, the medallion lakehouse pattern on GCP, catalog strategy, performance characteristics, and common mistakes. Derived from BigQuery and Cloud Storage data lake patterns.
Prerequisites
- BigQuery Architecture for Analytics Engineers — understand how Dremel, Colossus, and slots work before making table type decisions
- BigQuery Cost Model — the cost model shapes every architecture choice here
Reading Order
1. BigQuery Table Types — native BigQuery tables, BigLake external tables, and BigLake Iceberg tables: what each type does and a decision framework for choosing between them.
2. BigLake Performance Characteristics — metadata caching, where the remaining performance gap between external and native tables matters, and where it doesn’t.
3. Medallion Lakehouse on GCP — the bronze-silver-gold architecture on BigQuery: Iceberg at the bronze layer, dbt transformations at silver, native tables at gold. Includes code examples.
4. BigLake Metastore and Catalog Strategy — BigLake Metastore and Dataplex Universal Catalog as the governance layer across table formats.
5. Cloud Storage Tiering for BigQuery — cost optimization across storage tiers, physical billing, and pricing model selection. Reducing storage costs by 60-80% requires coordinating all three.
6. BigQuery Data Lake Common Mistakes — missing metadata caching, unguarded partition filters, and over-engineered architectures.