ServicesAboutNotesContact Get in touch →
EN FR
Note

BigQuery Data Lake Common Mistakes

Three anti-patterns that cause the most problems in BigQuery data lake implementations: missing metadata caching, skipped partition filters, and over-engineered architectures.

Planted
bigquerygcpdata engineeringcost optimization

Three anti-patterns appear consistently in BigQuery data lake implementations: missing metadata caching, unguarded partition filters, and architecture that adds complexity without matching requirements. Each is straightforward to fix once identified.

Forgetting Metadata Caching

Missing metadata caching is the most common performance issue in BigQuery data lake implementations and the easiest to fix.

BigLake table performance depends on enabling metadata caching. Without it, every query scans Cloud Storage to discover file statistics — file sizes, row counts, column min/max values — before the actual query execution even begins. That discovery step adds latency that can make BigLake tables feel sluggish compared to native tables.

With metadata caching enabled, BigQuery caches those file statistics and uses them for query planning. TPC-DS benchmarks show a 4x improvement in wall-clock execution time. The fix takes minutes:

ALTER TABLE `project.dataset.external_table`
SET OPTIONS (
metadata_cache_mode = 'AUTOMATIC',
max_staleness = INTERVAL '4' HOUR
);

The max_staleness setting controls freshness. For tables with frequent writes, set this to 30-60 minutes. For daily batch loads, 4-8 hours works fine. The trade-off is freshness vs. refresh cost, but for most use cases, the default is fine.

When BigLake tables are slower than expected, check metadata caching first.

Skipping Partition Filters

External tables with Hive-style partitioning should require partition filters. Without this safeguard, analysts can accidentally trigger full table scans across terabytes of data. One unfiltered query on a 10 TB table costs $62.50 under on-demand pricing — and that analyst probably doesn’t even realize what happened.

The fix is a single DDL statement:

ALTER TABLE `project.dataset.partitioned_table`
SET OPTIONS (require_partition_filter = true);

With this option set, any query missing a filter on the partition column simply fails with an error instead of silently scanning everything. It’s a guardrail that pays for itself the first time it blocks an accidental full scan.

This matters even more for BigLake tables than for native tables. Native BigQuery tables at least benefit from BigQuery’s internal optimizations and caching. BigLake tables reading from Cloud Storage pay the full I/O cost for every byte scanned, with no buffer between the analyst and the object storage bill.

The deeper issue here is governance. In a shared analytical environment, you can’t rely on every user knowing how partitioning works. require_partition_filter encodes the knowledge structurally. See BigQuery Partition Pruning Patterns for the full set of anti-patterns that defeat pruning, including functions on partition columns and subqueries in filters.

Over-Complicating the Architecture

Not every dataset needs the full medallion lakehouse treatment. Teams sometimes default to bronze-silver-gold with BigLake Iceberg at every layer for data that only BigQuery will ever touch — overhead without benefit.

The lakehouse pattern adds genuine value when:

  • Multiple engines (Spark, Flink, Trino) need to read or write data at some stage
  • Long-term portability is a real requirement, not a theoretical one
  • Complex ingestion from streaming systems justifies the Iceberg layer
  • Regulatory separation between raw and transformed data is mandatory

For data that flows from a managed connector (Fivetran, Airbyte) into BigQuery, gets transformed by dbt, and serves dashboards through Looker — all within BigQuery — a straightforward dbt project with native tables at every layer is simpler, faster, and cheaper.

For a solo data engineer or small team, a three-engine medallion architecture with catalog infrastructure creates more operational burden than it solves. Starting with native BigQuery tables and adding Iceberg when there is a concrete multi-engine requirement is the lower-risk path. Table type selection should be based on current requirements, not anticipated future ones.

A Quick Diagnostic

If you’re inheriting or auditing a BigQuery data lake, check these three things in order:

  1. Query INFORMATION_SCHEMA.TABLE_OPTIONS for all BigLake tables — are any missing metadata_cache_mode = 'AUTOMATIC'?
  2. Check require_partition_filter on every partitioned table with more than 100 GB of data
  3. Count the table types — if every table is BigLake Iceberg but only BigQuery queries them, ask whether native tables would be simpler

Each check takes under five minutes. Left unaddressed, these three patterns increase costs and degrade performance across the platform.