Adrienne Vermorel
BigQuery Architecture for Analytics Engineers: The Complete Guide
Analytics engineers often work productively in BigQuery for years without understanding what’s happening beneath their queries. You write dbt models, they materialize, dashboards update. Everything works until it doesn’t. Then a poorly structured project balloons costs by 10x. A cross-region mistake blocks critical joins. An IAM misconfiguration exposes sensitive data to the wrong team.
These failures share a root cause: architectural decisions made without understanding their consequences. This guide provides the conceptual foundation you need (covering everything from BigQuery’s resource hierarchy to multi-organization patterns) so you can design systems that scale rather than systems you’ll need to rebuild.
How BigQuery actually works under the hood
Understanding BigQuery requires a mental model of what it actually is. Not memorizing Google’s internal systems, but understanding the constraints and trade-offs that shape every architectural decision.
Separation of storage and compute
BigQuery’s core innovation is complete separation between where your data lives and where your queries run. Your tables exist in Colossus, Google’s distributed file system, stored in a proprietary columnar format called Capacitor. Your queries execute separately via Dremel, a multi-level execution tree that parallelizes work across thousands of nodes. Google’s Jupiter network (with over 6 petabits per second of bandwidth) connects them.
This separation has profound implications. Storage costs and compute costs are completely independent. You pay for stored bytes regardless of how often you query them. You pay for scanned bytes regardless of how much data you store. A 10TB table queried once costs far less than a 100GB table queried thousands of times daily.
Unlike Snowflake’s virtual warehouses or Redshift’s node clusters, there are no resources to provision, resize, or manage. BigQuery automatically allocates slots (virtual CPUs with memory and network capacity) based on query complexity. On-demand customers get access to up to 2,000 concurrent slots per project. Capacity-based pricing lets you reserve dedicated slots for predictable performance. You never see these slots or manage them directly. You just run queries.
Why columnar storage changes everything
The Capacitor format stores each column separately on disk. When you run SELECT user_id, event_date FROM events, BigQuery reads only those two columns. The other 47 columns in your table? Never touched. Never billed.
This makes SELECT * genuinely expensive. A table with 50 columns costs roughly 50x more to fully scan than selecting a single column. That “convenience” query during development becomes a budget problem at scale. This cost structure should influence how you design tables and write models: wide denormalized tables aren’t inherently bad, but scanning them carelessly is.
BigQuery collects statistics during data encoding and can operate directly on compressed data without decompression overhead. The query optimizer often knows more about your data than you do, and can make intelligent decisions about execution strategy. This pays off with materialized views: BigQuery can automatically route queries to them without any configuration.
The resource hierarchy every analytics engineer should internalize
BigQuery’s resource hierarchy follows Google Cloud’s structure with BigQuery-specific additions:
Organization └── Folders └── Projects └── Datasets └── Tables / Views / Materialized Views / ModelsEach level serves a distinct purpose, and understanding these boundaries prevents costly mistakes.
Projects as the primary boundary
Projects form the fundamental organizational unit in Google Cloud. Each project has its own billing account, its own quotas, and its own API enablement. For BigQuery specifically, projects determine where query costs are billed.
When you query data, costs bill to the project running the query, not the project storing the data. If your data lives in project-warehouse but you run queries from project-analytics, the compute costs appear on project-analytics’s bill. Storage costs always stay with the project containing the data.
This separation enables powerful patterns. A central data team can manage raw data ingestion and storage in one project. Department teams can query that data from their own projects, with costs flowing to their budgets. Agencies can query client datasets while billing compute to their own accounts.
Projects also scope quotas. On-demand pricing gives each project up to 2,000 concurrent slots, with 20,000 slots shared across an organization. If your dev and prod environments share a project, a runaway development query can starve production dashboards of compute resources.
Datasets as access control boundaries
Datasets are BigQuery’s logical containers for tables and views. They don’t exist in the standard GCP resource hierarchy (you won’t find them in Cloud Resource Manager) but they’re the primary unit for access control and data organization.
Every table inherits permissions from its parent dataset. Grant bigquery.dataViewer on a dataset, and that principal can read all current and future tables within it. This inheritance makes datasets the natural boundary for access control: group tables by who should access them, not just by what they contain.
Several dataset properties deserve attention:
Location is set at creation and cannot be changed afterward. This is the most consequential and least reversible decision you’ll make. A dataset created in US will remain in US forever. Moving it requires exporting to Cloud Storage, creating a new dataset in the correct region, reloading all data, and updating every downstream reference. Or copying it, but it is limited to 1,000 tables per day.
Default table expiration automatically deletes tables after a specified period. Useful for development datasets where you want automatic cleanup of abandoned experiments.
Labels enable cost tracking and resource organization. Apply labels consistently across datasets to enable meaningful cost analysis by team, project, or environment.
Storage billing model lets you choose between logical bytes (what your data represents) and physical bytes (what’s actually stored after compression). Physical billing often reduces costs by 60-80% for highly compressible data, but requires opting in per dataset.
Tables and views
Standard tables store data. Views store queries that execute on read. Materialized views store precomputed query results that BigQuery maintains automatically.
The distinction between standard views and materialized views matters more than it might seem. A standard view runs its defining query every time someone reads from it. If that view aggregates a billion rows, every dashboard refresh pays for that aggregation. A materialized view precomputes results and refreshes incrementally, often reducing both latency and cost by orders of magnitude.
BigQuery’s smart tuning feature automatically routes queries against base tables to use materialized views when beneficial. If you create a materialized view aggregating daily sales, queries that request compatible aggregations will use the materialized view even if they reference the base table directly. Zero configuration required.
Naming conventions that scale
Naming feels like bikeshedding until you’re managing 400 models across 50 datasets and can’t remember which orders table is authoritative. Establish conventions early.
Use snake_case universally. BigQuery is case-sensitive, and mixing conventions creates confusion and bugs. Use semantic prefixes that communicate a table’s role in the transformation pipeline:
| Prefix | Purpose | Example |
|---|---|---|
raw_ or src_ | Untransformed source data | raw_shopify_orders |
base__ | Cleaned, renamed, typed | base__shopify__orders |
int__ | Intermediate transformations | int__orders__daily_aggregates |
mrt__ | Business-specific data marts | mrt__marketing__attribution |
The double-underscore convention (base__shopify__orders) clarifies source system identification and scales elegantly. When you have 30 source systems, knowing that base__shopify__orders and base__netsuite__orders come from different sources at a glance saves real time.
Regional decisions that lock you in
BigQuery offers multi-regions (US, EU) and 40+ single regions globally. Your location choice is permanent, making this one of the highest-stakes architectural decisions you’ll make.
Multi-region vs. single region
Multi-regions store data across multiple data centers within a geography. The US multi-region might place your data in Iowa, Oregon, or Oklahoma. You don’t control which, but Google guarantees it stays within the United States. Multi-regions offer higher availability and larger slot quotas.
Single regions provide data residency guarantees for specific geographic points. Data in europe-west1 stays in Belgium. Period. This matters for regulatory compliance but provides less redundancy.
Pricing is generally equivalent across regions for most workloads. Don’t choose regions based on cost; choose based on requirements.
The constraint that blocks everything
The rule that causes the most pain: all datasets joined in a single query must be in the same location.
You cannot join a table in US with a table in EU. You cannot join us-central1 with us-east1. Attempting cross-region joins fails immediately with an access denied error that doesn’t clearly explain the actual problem.
A single misplaced dataset can block your entire analytics workflow. If your core tables are in US and someone creates a reference table in EU, every query joining them fails. The fix requires recreating the dataset in the correct region, which means exporting data, creating new resources, reloading, and updating all references.
How to choose your region
Make this decision once, document it, and enforce it:
Regulatory requirements often mandate specific regions. GDPR compliance might require EU-only storage for European customer data. Healthcare regulations might require specific geographic boundaries. Check with your legal team.
Proximity to users affects dashboard latency. If your analysts are in Paris, data in europe-west1 loads faster than data in us-central1. This matters less for batch analytics but significantly impacts interactive exploration.
Co-location with data sources improves load performance. Cloud Storage buckets and BigQuery datasets in the same region transfer data faster and cheaper. If your event pipeline lands data in a GCS bucket in us-central1, your raw datasets should be there too.
Organizational standards matter most. Pick one region (or one multi-region) and use it everywhere. The benefits of regional flexibility rarely outweigh the risk of cross-region mistakes.
Cross-region data transfer costs range from $0.02 to $0.14 per GiB depending on geography. But the operational overhead of managing multi-region architectures (the risk of mistakes, the complexity of data movement) usually exceeds these costs. For most analytics engineering teams, consolidating everything in a single region is the right call.
Multi-environment architectures that actually work
Every team needs at least two environments: somewhere to develop and somewhere to serve production. Most need three: development, staging, and production. The question is how to implement this separation in BigQuery.
Google Cloud explicitly recommends separate GCP projects to separate environments. This provides the strongest isolation: independent billing, quotas, security boundaries, and slot reservations. But it’s not the only option, and simpler approaches can work for smaller teams.
Pattern A: Separate projects per environment
Organization└── Data Platform Folder ├── analytics-dev ├── analytics-staging └── analytics-prodEach environment gets its own project. Development work happens in analytics-dev. Staging validates changes in analytics-staging. Production serves dashboards and reports from analytics-prod.
This separation provides independent billing so you can see exactly what development costs versus production. Each project has its own quotas, so a runaway dev query can’t impact production. Security boundaries are clear: production service accounts never touch dev data, and vice versa. If you use capacity pricing, you can reserve dedicated slots for production while dev uses on-demand.
The downside is complexity. You manage more service accounts, more IAM policies, more profiles.yml targets. For teams larger than three or four people, this complexity is worth it. For smaller teams, simpler options exist.
Pattern B: Separate datasets within one project
analytics-data├── dev_base├── dev_marts├── prod_base└── prod_martsA single project contains all environments, separated by dataset naming. Development models write to dev_* datasets; production writes to prod_* datasets.
This is simpler to manage. One billing account, one set of quotas, fewer service accounts. Dataset-level permissions still isolate access: you can grant developers write access to dev_* datasets while restricting production datasets to automated service accounts.
The downsides are shared quotas (heavy dev usage affects prod) and less clear billing separation. This pattern works well for small teams, early-stage projects, or situations where environment isolation isn’t critical.
Pattern C: Central data lake with department data marts
data-lake-storage (central team) └── raw datasets, base datasets ↓ (Data Viewer access)├── marketing-analytics (Marketing team)├── finance-analytics (Finance team)└── product-analytics (Product team)A central data engineering team manages ingestion, cleaning, and core modeling in a dedicated storage project. Department teams run their own transformations and queries in separate projects, building domain-specific marts.
Storage costs go to the central project. Compute costs flow to department billing accounts. This cleanly separates platform costs (“how much does our data infrastructure cost?”) from consumption costs (“how much does the marketing team spend on analytics?”).
The central team grants bigquery.dataViewer on curated datasets to department projects. Each department manages their own dbt projects, their own models, their own dashboards, sourcing from the central data lake.
This pattern requires more organizational coordination but scales to large enterprises with chargeback requirements.
Cross-project querying
Querying across projects requires no special configuration. Just use fully qualified table names:
SELECT c.customer_id, c.customer_name, o.order_id, o.order_dateFROM `other-project.shared_dataset.mrt__sales__customers` cJOIN `my-project.local_dataset.mrt__sales__orders` o ON c.customer_id = o.customer_idSame-region cross-project queries incur zero network transfer costs. The data doesn’t move between projects; BigQuery routes the query to where the data lives.
Permissions required: bigquery.dataViewer on the source dataset (to read the data) and bigquery.jobUser on the project running the query (to execute queries). The querying project pays for compute.
IAM patterns that follow least privilege
BigQuery’s permission model is more nuanced than it first appears. Getting it right means understanding the difference between data access and compute access, and granting each at the appropriate scope.
Predefined roles to use
Google provides BigQuery-specific predefined roles. Use these instead of basic roles (Owner, Editor, Viewer), which predate IAM and grant excessive, poorly-scoped access.
| Role | What it grants | Typical scope |
|---|---|---|
bigquery.dataViewer | Read tables and views | Dataset |
bigquery.dataEditor | Create, update, delete tables | Dataset |
bigquery.jobUser | Run queries and jobs | Project |
bigquery.user | List datasets + run queries | Project |
bigquery.admin | Full BigQuery control | Rarely needed |
The essential pattern separates data access from compute access:
Project-level: Grant bigquery.jobUser to give users the ability to run queries. This is the “compute” permission. It lets them execute jobs but doesn’t determine what data they can access.
Dataset-level: Grant bigquery.dataViewer or bigquery.dataEditor to control what data users can see and modify. This is the “data” permission. It determines which tables are accessible.
A user with jobUser on project-analytics and dataViewer on project-warehouse.sales_data can run queries in project-analytics that read from project-warehouse.sales_data. They cannot read from datasets where they lack dataViewer. They cannot run queries in projects where they lack jobUser.
Service account strategy
Create dedicated service accounts for each function in your data stack:
sa-dbt-dev@analytics-dev.iam.gserviceaccount.comsa-dbt-prod@analytics-prod.iam.gserviceaccount.comsa-fivetran@analytics-prod.iam.gserviceaccount.comsa-looker@analytics-prod.iam.gserviceaccount.comsa-airflow@analytics-prod.iam.gserviceaccount.comEach service account gets only the permissions it needs. The dbt production service account needs bigquery.dataEditor (to create tables) and bigquery.jobUser (to run queries). The Looker service account needs only bigquery.dataViewer. It should never create or modify tables.
Rotate service account keys every 90 days if you’re using key files. Better yet, use Workload Identity Federation when running in GCP environments. Your Cloud Run or GKE workloads can authenticate without managing keys at all.
What to avoid
Basic roles (Owner, Editor, Viewer) grant broad permissions across all GCP services, not just BigQuery. An Editor can modify Cloud Storage buckets, Compute Engine instances, and everything else. Never use these for BigQuery access control.
Project-level bigquery.admin gives full control over all BigQuery resources in a project. Most users don’t need this. Most service accounts definitely don’t need this. Grant admin access only to the small number of people who actually administer BigQuery settings.
Overly broad dataset access happens when teams grant dataViewer on all datasets to avoid permission request tickets. This defeats the purpose of dataset-level isolation. If someone needs access to specific data, grant access to that specific dataset.
Cross-organization patterns for agencies and consultancies
Agencies managing multiple client BigQuery environments face unique challenges. Each client should have their own GCP organization with their own projects, billing, and security requirements. The agency delivers analytics services by working within client organizations, not by centralizing client data in agency infrastructure.
The recommended pattern: Work in client organizations
Each client owns their GCP organization. The agency gets access as needed:
Client Alpha Organization (owned by client)├── client-alpha-analytics│ └── All BigQuery resources, dbt models, data└── IAM: Agency team granted access
Client Beta Organization (owned by client)├── client-beta-analytics│ └── All BigQuery resources, dbt models, data└── IAM: Agency team granted access
Agency Organization (agency-owned)└── agency-internal └── Internal tooling, templates, documentation onlyThis separation provides clear data ownership. Client data stays in client infrastructure. When engagements end, clients retain full control without data migration. Billing flows naturally to client accounts. Compliance requirements (data residency, access controls) are the client’s responsibility to define and the agency’s responsibility to follow.
The agency team authenticates to each client organization separately. Maintain separate service accounts per client, separate credentials, separate dbt profiles. This adds overhead but prevents cross-client data leakage and keeps audit trails clean.
Fallback: Project-per-client in agency organization
Some clients can’t or won’t maintain their own GCP organization. Small businesses without IT infrastructure, short-term engagements, or clients who explicitly want the agency to handle everything. In these cases, create client projects within the agency’s organization:
Agency Organization├── agency-admin│ └── Audit logs, billing exports, shared tooling├── client-alpha-analytics│ └── All BigQuery resources for Client Alpha└── client-beta-analytics └── All BigQuery resources for Client BetaThis pattern requires extra care. Document clearly that the agency owns the infrastructure. Define data ownership and export rights in contracts. Plan for offboarding: if a client wants to leave, you’ll need to export their data and potentially migrate their dbt project to their own organization.
Billing can flow to the client’s billing account (linked to the agency project) or the agency can invoice based on costs. Either way, use labels and separate projects to track costs accurately per client.
Analytics Hub for formal data sharing
Analytics Hub enables zero-copy data sharing across organizational boundaries. It’s particularly useful when clients want to share data with the agency (or vice versa) without copying data or granting direct dataset access.
A data owner publishes a dataset to an Analytics Hub listing. Subscribers in other organizations request access to the listing. Once approved, they get a linked dataset, a read-only reference that points to the source data without copying it.
Storage costs stay with the publisher. Query costs go to the subscriber. The subscriber cannot modify data, cannot set IAM on individual tables, cannot do anything except read. Publishers can restrict data egress entirely, blocking EXPORT, CREATE TABLE AS SELECT, and copy operations.
For agency work, create a private exchange within Analytics Hub. Add specific principals (users, groups, or service accounts) as subscribers. Clients can then subscribe to listings and query through their linked datasets. All access is logged for auditing.
Authorized views for controlled access
Authorized views solve a common problem: exposing filtered or aggregated data without granting access to underlying tables. They work across organizational boundaries.
The pattern:
- Create a view in a dedicated dataset that queries sensitive source tables
- Authorize the view’s dataset to access the source dataset
- Grant users access to the view dataset only
-- In dataset: shared_viewsCREATE VIEW shared_views.analyst_orders ASSELECT order_id, order_date, total_amount -- customer_ssn excluded -- internal_notes excludedFROM sensitive_data.mrt__sales__ordersWHERE region = 'EMEA'; -- Row filteringThe view references sensitive_data.mrt__sales__orders, but users only need access to shared_views. In BigQuery’s console, you authorize shared_views to access sensitive_data. Anyone who can query shared_views.analyst_orders gets the filtered, column-restricted view without any permissions on sensitive_data.
Authorized datasets extend this concept. Instead of authorizing individual views, you authorize an entire dataset. All current and future views in that dataset can access the source data. This reduces management overhead when you’re creating many views for different access patterns.
Row-level and column-level security
For truly multi-tenant environments where different users see different rows of the same table, BigQuery offers row-level security:
CREATE ROW ACCESS POLICY client_a_onlyON multi_tenant.mrt__shared__ordersGRANT TO ('group:client-a-analysts@clienta.com')FILTER USING (client_id = 'CLIENT_A');
CREATE ROW ACCESS POLICY client_b_onlyON multi_tenant.mrt__shared__ordersGRANT TO ('group:client-b-analysts@clientb.com')FILTER USING (client_id = 'CLIENT_B');Client A analysts querying multi_tenant.mrt__shared__orders see only their rows. Client B sees only theirs. The table is the same; the access differs.
Column-level security uses Data Catalog policy tags. You create a tag (say, PII), apply it to sensitive columns, then grant specific users the Fine-Grained Reader role for that tag. Users without the role see those columns as null or receive access errors.
Dynamic data masking takes this further. Users with Masked Reader role see hashed or partially redacted values instead of nulls. Useful for scenarios where analysts need to see that a value exists and can join on it, but shouldn’t see the actual content.
dbt configurations that work with BigQuery
The dbt-bigquery adapter maps dbt concepts to BigQuery resources. Understanding this mapping helps you configure dbt correctly for your architecture.
dbt’s project (sometimes called database in other adapters) maps to BigQuery’s project. dbt’s schema maps to BigQuery’s dataset. When you set schema: base in dbt, you’re targeting a BigQuery dataset called base.
profiles.yml for multiple environments
A typical BigQuery setup has separate targets for development and production:
my_analytics: target: dev outputs: dev: type: bigquery method: oauth # Uses your gcloud credentials project: analytics-dev dataset: "dbt_{{ env_var('USER', 'developer') }}" location: US threads: 4
prod: type: bigquery method: service-account project: analytics-prod dataset: analytics location: US threads: 8 keyfile: /secrets/sa-dbt-prod.json job_execution_timeout_seconds: 600 maximum_bytes_billed: 10000000000 # 10 GB limitSeveral configurations matter significantly:
location must match your existing datasets. BigQuery doesn’t allow cross-region operations, so this must align with your architectural decisions. If your datasets are in EU, set location: EU.
execution_project (not shown above) bills query costs to a different project than where models materialize. Useful for the “central data lake + department marts” pattern where you want compute costs flowing to department budgets.
job_execution_timeout_seconds prevents runaway queries. BigQuery has no default timeout and queries can run indefinitely. Set an explicit limit based on your longest expected legitimate query.
maximum_bytes_billed sets a hard cap on bytes scanned per query. If a query would exceed this limit, it fails before scanning. This catches malformed queries that would scan entire tables unexpectedly.
Service account patterns by deployment method
Different deployment contexts call for different authentication approaches:
Local development: Use OAuth via gcloud auth application-default login. Developers authenticate as themselves, and their personal identity appears in audit logs.
dbt Cloud: Upload a service account key file in connection settings. dbt Cloud manages the credential securely.
Cloud Run or GKE: Use Workload Identity. The service gets a GCP identity automatically with no key files to manage. In profiles.yml, set method: oauth and the environment provides credentials.
GitHub Actions: Store the service account JSON as a repository secret. Inject it as an environment variable during workflow execution.
Service account impersonation: The recommended pattern for sophisticated local development. Developers authenticate via OAuth but impersonate a service account with appropriate permissions:
dev: type: bigquery method: oauth project: analytics-dev dataset: dbt_dev impersonate_service_account: sa-dbt-dev@analytics-dev.iam.gserviceaccount.comThis avoids distributing service account keys while maintaining clear permission boundaries. The developer’s identity appears in audit logs as the actor, with the impersonated service account noted.
The generate_schema_name macro
By default, dbt creates dataset names by combining your target schema with any custom schema: <target_schema>_<custom_schema>. If your target schema is dbt_alice and a model specifies schema: base, the model lands in dbt_alice_base.
This works for development but creates awkward production dataset names. You probably want production models in base, not analytics_base.
Override the default behavior with dbt’s built-in alternative:
-- macros/generate_schema_name.sql{% macro generate_schema_name(custom_schema_name, node) -%} {{ generate_schema_name_for_env(custom_schema_name, node) }}{%- endmacro %}This produces dbt_alice_base in dev (target schema as prefix) but just base in production (custom schema only, no prefix). The distinction is controlled by checking whether target.name == 'prod' or similar logic inside the macro.
Job labels for cost attribution
BigQuery jobs can carry labels that appear in billing exports and INFORMATION_SCHEMA. dbt can automatically apply labels identifying which model generated each query:
query-comment: comment: "{{ query_comment(node) }}" job-label: trueWith this enabled, every query dbt runs carries labels like dbt_model: base__shopify__orders. Query INFORMATION_SCHEMA.JOBS_BY_PROJECT to analyze costs by model:
SELECT (SELECT value FROM UNNEST(labels) WHERE key = 'dbt_invocation_id') AS run_id, (SELECT value FROM UNNEST(labels) WHERE key = 'dbt_model') AS model, COUNT(*) AS query_count, SUM(total_bytes_billed) / POW(10, 9) AS total_gb_billed, SUM(total_slot_ms) / 1000 AS total_slot_secondsFROM `region-US`.INFORMATION_SCHEMA.JOBS_BY_PROJECTWHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) AND job_type = 'QUERY'GROUP BY 1, 2ORDER BY total_gb_billed DESCLIMIT 50;This query shows which models consume the most resources, enabling targeted optimization. That mrt__analytics__events model consuming 80% of your compute budget? Now you know where to focus.
Anti-patterns that destroy budgets and block deployments
Architectural mistakes compound. A bad decision in month one becomes a crisis in month six when you need capabilities that your structure doesn’t support. Here are the patterns that cause the most pain.
The SELECT * trap
It seems convenient. It’s catastrophically expensive.
SELECT * scans every column in a table. In BigQuery’s columnar storage, column count directly correlates with cost. A table with 100 columns costs roughly 100x more to scan per row than selecting a single column.
This matters most in base models. A common pattern:
-- Bad: scans all 87 columnsSELECT * FROM {{ source('shopify', 'orders') }}Better:
-- Good: scans only needed columnsSELECT order_id, customer_id, created_at, total_price, financial_statusFROM {{ source('shopify', 'orders') }}The explicit version scans 5 columns instead of 87, roughly 94% cost reduction for the same row count. Multiply by every base model, every incremental run, every full refresh, and the savings are substantial.
Dataset sprawl and naming chaos
This anti-pattern emerges slowly. Someone creates test_data for an experiment. Someone else creates backup_orders_v2. Six months later, you have 40 datasets and nobody knows which customers table is authoritative.
Symptoms include:
- Datasets named
test,temp,backup,old,new,v2,final,final_v2 - The same table name appearing in multiple datasets
- No documentation of which datasets are active
- IAM grants scattered across dozens of datasets with no clear pattern
The fix starts with an audit. Query INFORMATION_SCHEMA.SCHEMATA to list all datasets. Check access logs to see which are actually queried. Archive unused datasets to a separate project (don’t delete, you might need that data someday). Establish naming conventions and enforce them through code review.
Region mixing catastrophe
You chose US for your core datasets. A contractor created a reference table in EU. Now every query joining them fails.
The error message isn’t helpful: “Access Denied: BigQuery BigQuery: Not found: Dataset.” It doesn’t say “you’re trying to cross regions.” You debug for hours before realizing the issue.
The fix is painful:
- Export the misplaced data to Cloud Storage
- Create a new dataset in the correct region
- Load data from Cloud Storage into the new dataset
- Update all references to the table
- Delete the old dataset
Prevention is easier: establish a single region as standard, document it prominently, and validate in CI that all new datasets use the correct location.
IAM over-permissioning
“Just make everyone an admin so we don’t get blocked by permission issues.”
This creates security risks, audit noise, and confusion. When everyone has full control, nobody knows who changed what or why. When a credential leaks, the blast radius is everything.
The pattern usually emerges from friction: someone needed access, the quickest fix was broad permissions, and nobody cleaned it up. The solution is treating IAM as code: define permissions in Terraform or similar, require review for changes, and audit regularly.
Start by identifying over-permissioned principals:
- Any user with
bigquery.adminat project level (most don’t need it) - Service accounts with
dataEditoron datasets they only read - Basic roles (Owner/Editor/Viewer) granted anywhere
Downgrade to the minimum required permissions. Yes, this might cause temporary access issues. Better to fix those intentionally than to discover your permissions model is fictional.
Missing cost controls
BigQuery’s defaults are permissive. Without explicit limits, a single malformed query can scan petabytes. Without timeouts, a query can run indefinitely, consuming slots and potentially holding locks.
Both maximum_bytes_billed and job_execution_timeout_seconds default to unlimited. Set explicit limits on every target in profiles.yml:
prod: type: bigquery # ... other settings ... maximum_bytes_billed: 50000000000 # 50 GB job_execution_timeout_seconds: 1800 # 30 minutesAlso consider project-level quotas in the GCP console. You can set daily limits on bytes scanned, providing a backstop even if individual query limits fail.
Not using materialized views for repeated queries
A BI dashboard runs the same aggregation query every time someone loads it. Ten analysts loading it ten times daily means 100 executions of identical computation.
Materialized views solve this. Create a view that precomputes the aggregation:
CREATE MATERIALIZED VIEW dashboard_data.daily_sales_summary ASSELECT DATE(order_date) AS order_day, region, COUNT(*) AS order_count, SUM(total_amount) AS total_revenueFROM transactions.mrt__sales__ordersGROUP BY 1, 2;BigQuery maintains this automatically. When underlying data changes, BigQuery incrementally updates the materialized view. Dashboard queries hit the precomputed results instead of re-aggregating raw data.
BigQuery’s smart tuning can also route queries against the base table to use the materialized view automatically. A query asking for SUM(total_amount) grouped by DATE(order_date) might use the materialized view even if it references transactions.mrt__sales__orders directly.
Best practices synthesis
This is a lot to absorb. A condensed checklist for reference:
Architectural decisions to make early
These choices have long-term consequences and are hard to change later:
- Pick one region and use it everywhere. Document your choice. Validate in CI.
- Separate projects for environments when teams exceed 3-4 people or when billing isolation matters.
- Use folder hierarchy to group related projects and enable IAM inheritance.
- Establish naming conventions for datasets and tables. Enforce through code review.
Cost controls to implement immediately
These prevent expensive surprises:
- Set
maximum_bytes_billedon all dbt targets. 10-50 GB is reasonable for most workflows; adjust based on your largest legitimate queries. - Configure job timeouts via
job_execution_timeout_seconds. 30-60 minutes catches runaway queries without blocking legitimate long-running jobs. - Enable job labels to track costs by model. Without attribution, optimization is guesswork.
- Monitor via INFORMATION_SCHEMA.JOBS. Build a dashboard or scheduled query alerting on anomalies.
Security patterns to follow
These keep data protected without creating friction:
- Predefined roles only. Never use basic roles (Owner/Editor/Viewer) for BigQuery access.
- Dataset-level for data permissions. Grant
dataViewerordataEditoron specific datasets, not project-wide. - Project-level for compute permissions. Grant
jobUseron projects where users should run queries. - Separate service accounts per function. dbt, Fivetran, Looker, and Airflow each get their own identity.
- Authorized views for controlled sharing. Expose subsets of data without granting access to underlying tables.
dbt-specific practices
These make dbt work smoothly with your architecture:
- Override
generate_schema_namefor clean production dataset names without prefixes. - Use service account impersonation for local development. Avoids key distribution while maintaining audit trails.
- Implement slim CI with
state:modified+ --deferto test only changed models against production state. - Partition large tables by date. Cluster on frequently filtered columns. (More on this in the partitioning article.)
Where this foundation leads
This architectural understanding sets the stage for optimization. Without knowing how BigQuery organizes resources, discussions of partitioning and clustering are abstract. Without understanding projects and billing, slot reservations don’t make sense. Without grasping IAM boundaries, security recommendations lack context.
Get the architecture right, and everything else becomes optimization. Get it wrong, and you’ll be rebuilding rather than refining.