BigQuery and Cloud Storage: Data Lake Patterns

For years, the data platform decision came down to a binary: data lakes offered flexibility and open formats at the cost of query performance, while data warehouses delivered speed but locked you into proprietary storage. On GCP, that trade-off no longer exists.

BigLake Iceberg Tables in BigQuery combine open-format portability with full DML support, streaming ingestion, and query performance within 20-30% of native tables. The medallion lakehouse pattern (landing data in Iceberg, transforming via dbt in BigQuery, serving analytics from native tables) gives teams both flexibility where they need it and performance where it matters.

This guide provides a decision framework for choosing between native BigQuery tables, BigLake external tables, and BigLake Iceberg tables. The right answer depends on your specific access patterns, governance requirements, and whether other engines need to touch the same data.

Three table types serve different purposes

BigQuery now offers three distinct approaches to storing and querying data (for a broader look at how these fit together, see my BigQuery architecture guide). Each optimizes for different constraints.

Native BigQuery tables store data in BigQuery’s proprietary columnar format. They deliver the best query performance, support streaming inserts with sub-second latency, and handle high-concurrency workloads without tuning. The trade-off is that data lives only in BigQuery. If Spark, Trino, or another engine needs access, you’re exporting data or running federated queries.

BigLake tables (external tables with a Cloud Resource connection) query data sitting in Cloud Storage while applying BigQuery’s governance layer. You get row-level security, column-level masking, and fine-grained access control on data stored in Parquet, ORC, Avro, or CSV files. Query performance depends heavily on file layout and metadata caching configuration. These tables work well for data that multiple engines need to access or that you can’t move out of object storage for compliance reasons.

BigLake Iceberg Tables in BigQuery represent Google’s strategic direction. These tables store data in Apache Iceberg format on Cloud Storage but support full BigQuery DML (INSERT, UPDATE, DELETE, and MERGE). They handle streaming ingestion, support time travel queries, and maintain metadata that any Iceberg-compatible engine can read. BigQuery manages table compaction and optimization automatically.

BigLake Iceberg tables eliminate most reasons to choose between the other two options for new projects. You get open-format flexibility with warehouse-like operational characteristics.

The medallion lakehouse pattern fits GCP well

The medallion architecture (bronze, silver, gold layers with increasing data quality) adapts naturally to BigQuery’s table type options.

Bronze layer: Raw data lands in BigLake Iceberg tables. Spark or Flink jobs write directly to Iceberg format on Cloud Storage. BigLake Metastore tracks the table metadata, making these tables discoverable by both BigQuery and external compute engines. This layer prioritizes write throughput and schema flexibility over query performance.

Silver layer: dbt layer models transform bronze data into cleaned, conformed datasets. These transformations run in BigQuery using SQL, reading from Iceberg tables and writing to either more Iceberg tables or native BigQuery tables depending on downstream requirements. Enable auto-reclustering for frequently-queried silver tables to maintain performance as data accumulates.

Gold layer: Aggregated, business-ready datasets live in native BigQuery tables optimized for BI workloads. These tables serve dashboards, support ad-hoc analyst queries, and back embedded analytics. Native tables make sense here because performance matters most and multi-engine access is rarely needed for aggregated metrics.

This hybrid approach preserves optionality in early pipeline stages. You can always add Spark processing or migrate to another platform while maximizing query speed for the tables users actually touch.

Catalog strategy matters more than format choice

Open table formats are converging, which means investing heavily in one format over another matters less than building strong catalog infrastructure.

Delta Lake tables work in BigQuery with first-class support including deletion vectors, column mapping, and liquid clustering. Delta Lake’s UniForm feature exposes Iceberg-compatible metadata, meaning Iceberg readers can query Delta tables.

BigLake Metastore provides a serverless catalog that replaces self-managed Hive Metastore deployments. The Iceberg REST Catalog API enables Spark, Flink, Trino, and BigQuery to discover and manage the same tables through a unified interface. This eliminates the fragmented metadata problem where different engines maintain inconsistent views of what tables exist and what they contain.

Dataplex Universal Catalog sits above BigLake Metastore, providing governance across data products, AI models, and analytics assets. It handles data quality rules, lineage tracking, and access management spanning BigQuery datasets, Cloud Storage buckets, and Vertex AI models. For organizations building data mesh architectures, Dataplex provides the control plane.

Standardize on BigLake Metastore as your catalog layer regardless of which table formats you use. This investment pays off whether you’re all-in on Iceberg, using Delta Lake with Databricks, or mixing formats across different use cases.

Performance gaps have narrowed significantly

The historical knock against external tables was query performance. They ran 3-5x slower than native tables for typical analytical workloads. That gap has narrowed.

With metadata caching enabled, BigLake tables achieve query performance within 20-30% of native BigQuery tables for most analytical patterns. TPC-DS benchmarks show 4x improvement in wall-clock execution time when metadata caching activates, primarily through optimized query planning that uses cached file statistics for partition pruning and predicate pushdown.

The remaining performance gap matters for:

High-concurrency dashboards serving hundreds of simultaneous users
Sub-second latency requirements for operational analytics
Complex joins across many large tables where every millisecond compounds

For batch reporting, ad-hoc analysis, and data science exploration, the performance difference rarely affects user experience. A query that runs in 3 seconds versus 4 seconds doesn’t change how analysts work.

Streaming performance differs more significantly. Native BigQuery tables accept streaming inserts with sub-second latency and no batching overhead. BigLake Iceberg tables support high-throughput streaming, but individual record latency runs higher. For real-time operational dashboards, native tables remain the better choice.

Cost architecture compounds across three dimensions

Optimizing storage costs requires thinking about tiering, query efficiency, and pricing model selection together. Each dimension interacts with the others.

Storage tiering segments data by access frequency:

Hot data (0-90 days): Native BigQuery tables or BigLake Iceberg tables with Standard class Cloud Storage
Warm data (90 days to 1 year): BigLake tables on Nearline Cloud Storage
Cold data (1+ years): BigLake tables on Coldline, queried only when needed

Cloud Storage lifecycle policies automate transitions between tiers. A typical configuration moves objects from Standard to Nearline at 30 days, Nearline to Coldline at 90 days, and deletes at 365 days (adjust retention for compliance requirements). For Iceberg tables, enable Autoclass on the underlying bucket. It automatically moves data and metadata files to optimal storage classes based on actual access patterns.

Physical storage billing reduces costs significantly (I cover more BigQuery cost optimization tactics in a separate guide). Instead of paying for logical data size, you pay for compressed bytes. BigQuery’s columnar compression typically achieves 3-4x reduction. Combined with long-term storage pricing (50% reduction for data untouched for 90 days), mature datasets cost around $0.005/GB/month. That’s cheaper than most object storage when you account for the query capabilities included.

Editions pricing changes the economics for transformation workloads. Pipeline projects with predictable query patterns benefit from autoscaler provisioning slots with per-second billing. Analytics projects with unpredictable ad-hoc queries work better with on-demand pricing and daily cost quotas. Don’t apply the same pricing model to both use cases.

A decision framework for table type selection

When evaluating which table type to use for a specific dataset, work through these questions:

Does any engine besides BigQuery need to query this data? If Spark, Trino, Presto, or another compute engine needs direct access, BigLake tables (either standard or Iceberg) are your only options. Native BigQuery tables require exporting data or using BigQuery’s connector ecosystem.

Do you need streaming inserts with sub-second latency? Native BigQuery tables handle this best. BigLake Iceberg supports streaming but with higher latency per record. If your use case involves real-time operational dashboards or alerting, start with native tables.

Is this data subject to compliance requirements restricting movement? Some regulatory frameworks require data to remain in specific storage systems or regions. BigLake tables let you query data in place without copying it into BigQuery’s managed storage.

How important is query performance for this specific table? Gold layer tables serving dashboards benefit from native table performance. Bronze layer tables that only feed batch pipelines rarely need it.

Will this dataset exist in five years? Open formats like Iceberg provide better long-term portability. If you’re building core data assets that outlast your current architecture, the insurance value of open formats matters.

For most new projects, the default answer is BigLake Iceberg tables unless you have a specific reason to choose otherwise. The performance penalty is acceptable for most use cases, and you retain the ability to export data to any Iceberg-compatible system.

Common mistakes reveal gaps between intention and execution

Three patterns cause the most problems in BigQuery data lake implementations.

Forgetting metadata caching: BigLake table performance depends on enabling metadata caching. Without it, every query scans Cloud Storage to discover file statistics. This configuration change takes minutes but dramatically affects query latency.

Skipping partition filters: External tables with Hive-style partitioning should require partition filters in query settings. Without this, analysts can accidentally trigger full table scans across terabytes of data. Set require_partition_filter = true on tables partitioned by date or other high-cardinality columns.

Over-complicating the architecture: Not every dataset needs the full medallion treatment. Simple reporting tables can go directly from source to native BigQuery tables without intermediate Iceberg layers. The lakehouse pattern adds value for data that needs multi-engine access or long-term portability, but it’s overhead for data that only BigQuery will ever touch.

The strategic bet is on catalog infrastructure

The GCP data lake landscape has genuinely changed. BigLake Iceberg tables deliver capabilities that didn’t exist two years ago, and the performance gap with native tables continues to narrow. Format wars matter less as Delta Lake and Iceberg converge on interoperability.

The investment that compounds over time is catalog infrastructure. BigLake Metastore plus Dataplex provide a governance layer that spans formats, engines, and data products. Teams that build strong catalog discipline (consistent naming, clear ownership, automated quality checks) find that specific table type decisions become tactical rather than strategic.

For new GCP data platforms in 2026, the medallion lakehouse pattern with BigLake Iceberg as the default table format offers the best balance of flexibility, performance, and future optionality. Native BigQuery tables remain valuable for high-performance serving layers. The architecture that combines both, governed by a unified catalog, positions teams to adapt as the platform continues to evolve.