GCP Processing Engine Selection: Dataflow, Dataproc, and BigQuery

GCP offers four distinct processing engines for data transformation workloads: BigQuery SQL, Dataflow, Dataproc, and Dataproc Serverless. The right selection depends on team expertise and workload characteristics, not on data volume thresholds alone.

The Four Options

BigQuery SQL covers a larger share of transformation workloads than teams often expect. BigQuery Editions pricing with autoscaler provisions slots in real-time with per-second billing, making BigQuery cost-competitive with Spark for many transformation patterns. Operational characteristics are simpler: no cluster management, no Docker image maintenance for the compute layer, no Spark tuning.

The three-layer dbt architecture (base, intermediate, mart) maps directly to transformation patterns that BigQuery executes efficiently:

-- BigQuery handles complex transformations that teams might reflexively reach for Spark to do
SELECT
    user_id,
    DATE(event_timestamp) AS event_date,
    COUNT(DISTINCT session_id) AS sessions,
    ARRAY_AGG(
        STRUCT(event_name, event_timestamp)
        ORDER BY event_timestamp
        LIMIT 100
    ) AS event_sequence,
    SUM(revenue) AS total_revenue
FROM events
WHERE event_date BETWEEN '2026-01-01' AND CURRENT_DATE()
GROUP BY 1, 2

BigQuery manages compute scaling automatically. You don’t configure parallelism, partition counts, or shuffle partitions.

Dataflow excels for new pipelines built on Apache Beam, particularly when you need unified batch and streaming processing from the same code. Its serverless model eliminates cluster management entirely — you deploy a pipeline definition, Dataflow handles provisioning, scaling, and teardown.

The primary use cases:

Unified batch/stream pipelines where the same logic processes both historical backfills and real-time events
Complex windowing (sliding windows, session windows, custom triggers) where Beam’s model is cleaner than SQL
Pipelines where you want a single codebase for batch and streaming without maintaining separate systems

# Dataflow pipeline with Apache Beam -- same code runs batch and streaming
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions(
    runner='DataflowRunner',
    project='my-project',
    region='us-central1',
    temp_location='gs://my-bucket/temp',
    streaming=False  # Change to True for streaming mode
)

with beam.Pipeline(options=options) as p:
    events = (
        p
        | 'ReadFromBigQuery' >> beam.io.ReadFromBigQuery(
            query='SELECT * FROM `project.dataset.raw_events`',
            use_standard_sql=True
        )
        | 'EnrichEvents' >> beam.Map(enrich_event)
        | 'WriteToBigQuery' >> beam.io.WriteToBigQuery(
            table='project:dataset.enriched_events',
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
        )
    )

Dataproc is the right choice when you’re migrating existing Spark workloads, running ML workflows with Spark MLlib, or need custom cluster configurations that serverless options don’t support. It provisions fully managed Hadoop/Spark clusters on GCP, giving you control over cluster size, configuration, and installed packages.

Key Dataproc use cases:

Migrating on-premises Spark jobs to GCP without rewriting them
ML feature engineering pipelines that already use PySpark and MLlib
Workloads requiring specific Spark configurations, custom libraries, or non-standard package versions
Jobs that need access to HDFS-compatible storage patterns

Dataproc Serverless bridges the gap — it runs Spark batch jobs without cluster management, accepting less configurability in exchange for zero provisioning overhead. You submit a PySpark job, it runs, you pay for the duration. No cluster to create, size, or shut down.

# Submit a Spark batch job to Dataproc Serverless
gcloud dataproc batches submit pyspark my_transformation.py \
    --region=us-central1 \
    --deps-bucket=gs://my-staging-bucket \
    --jars=gs://my-bucket/jars/custom-lib.jar \
    -- \
    --input-table=project.dataset.raw_data \
    --output-table=project.dataset.processed_data

The Decision Framework

SQL and dbt team, no existing Spark workloads: BigQuery as the default for transformation. Staying in SQL avoids context switching, cluster debugging, and cluster management. Most performance issues with BigQuery are addressable through query optimization and partitioning rather than switching engines.
Existing Spark workloads to migrate: Dataproc. Rewriting working Spark jobs in SQL for platform uniformity rarely pays off; migrate the infrastructure, not the code.
Unified batch/stream processing from a single codebase: Dataflow with Apache Beam. For simpler streaming requirements — append-only ingestion, basic filtering — BigQuery Streaming API or Pub/Sub + BigQuery subscriptions are cheaper.
Occasional Spark without cluster management: Dataproc Serverless for batch jobs.

Why Scale Thresholds Are the Wrong Signal

Data volume and query slowness are symptoms that warrant diagnosis before adding a new engine:

“We have a lot of data” → Start with partition pruning, check if require_partition_filter is set, verify queries aren’t scanning full tables.
“This query is slow” → Check slot utilization, review the query plan, look for unintentional cross joins or missing predicate pushdown.
“SQL can’t do this transformation” → BigQuery’s array functions, struct operations, and window functions handle most complex transformation logic.

Introducing Spark to resolve a BigQuery performance issue adds a second tuning surface (executor counts, memory per executor, shuffle partitions, broadcast thresholds) that SQL-fluent teams typically don’t know how to navigate.

Integrating with the Medallion Architecture

In a medallion lakehouse, processing engine selection maps naturally to layers:

Bronze layer ingestion: Dataflow (for streaming) or Dataproc Serverless (for batch Spark workloads from upstream systems)
Silver layer transformation: BigQuery SQL via dbt — reads from bronze BigLake Iceberg tables, writes conformed datasets
Gold layer aggregation: BigQuery SQL via dbt — pure analytics, maximum query performance for BI tools

This layering doesn’t require every team to use every engine. Many teams operate entirely within BigQuery for silver and gold, and only introduce Dataflow or Dataproc if their bronze ingestion genuinely requires it (streaming with complex windowing, or migration from an existing Spark pipeline).

The architectural principle: pick the simplest engine that meets your requirements. Each additional engine increases operational complexity and on-call burden.