ServicesAboutNotesContact Get in touch →
EN FR
Note

Dagster Full-Stack Pipeline Architecture

How Dagster unifies ingestion, transformation, Python processing, and downstream triggers in a single asset graph — the pattern that justifies Dagster over simpler orchestration approaches.

Planted
dbtbigquerydata engineeringautomationetl

Dagster’s asset-centric model is most useful when the pipeline extends beyond transformation — covering upstream ingestion, downstream processing, and cross-system coordination in a single dependency graph.

The Typical Full-Stack Pattern

A Dagster + dbt + BigQuery pipeline typically looks like this:

  1. Ingestion: A sensor detects that a Fivetran or dlt sync has completed
  2. Transformation: dbt runs only when upstream data is actually fresh
  3. Python processing: ML features, custom aggregations, or API calls that SQL can’t handle
  4. Downstream triggers: Sensors kick off BI dashboard refreshes or reverse ETL exports

All four stages are Software-Defined Assets in the same graph. A single lineage view shows the full path from raw source to final dashboard. When something breaks, you trace the lineage to find the root cause — whether it’s in the extraction layer, the transformation layer, or downstream processing.

Without a Unified Orchestrator

Without a unified orchestrator, each tool reports success independently:

  • Fivetran has its own dashboard showing sync status
  • dbt Cloud has its own dashboard showing model runs
  • A Python script runs on a cron timer independently of upstream data
  • A Looker API call refreshes a dashboard independently of the Python script

Correlation across systems is manual. The failure mode is silent: a script runs, processes stale data, and writes to the downstream table without anyone detecting it.

Dagster provides a single dependency graph where the sensor that watches for Fivetran completion triggers the dbt transformation, which triggers the Python processing, which triggers the BI refresh. Each step runs only when upstream data is confirmed fresh.

Ingestion: Sensors and External Assets

The ingestion layer typically uses external assets and sensors. An external asset represents data that Dagster doesn’t produce but tracks — a Fivetran-managed table, a file landing in GCS, or data arriving from a partner feed:

import dagster as dg
# External asset: Dagster tracks it but doesn't produce it
fivetran_stripe_payments = dg.AssetSpec(
"raw__stripe__payments",
group_name="ingestion",
description="Stripe payments synced by Fivetran",
)

A sensor watches for changes to external assets and triggers downstream materializations:

@dg.sensor(minimum_interval_seconds=300)
def fivetran_sync_sensor(context):
if fivetran_sync_completed("stripe_payments"):
yield dg.SensorResult(
asset_events=[
dg.AssetMaterialization(asset_key="raw__stripe__payments")
]
)

When the sensor fires, Dagster records that raw__stripe__payments has new data. Any asset that depends on it — your dbt base models, your Python processing steps — becomes eligible for rematerialization based on its automation conditions or freshness policies.

For teams using a hybrid ingestion strategy with some managed connectors (Fivetran) and some custom pipelines (dlt), Dagster unifies both. The Fivetran-synced tables and the dlt-loaded tables appear as assets in the same graph, with dbt transformations downstream of both.

Transformation: dbt as the Middle Layer

The dbt integration handles the transformation layer. Each dbt model becomes an asset with dependencies inferred from ref() calls. dbt tests become asset checks. The dbt layer sits in the middle of the full-stack graph, consuming ingestion assets and producing assets that downstream processing consumes.

The key benefit at this layer is conditional execution. dbt runs only when upstream data is fresh. Instead of running dbt build on a cron timer at 6 AM and hoping Fivetran finished by then, a sensor confirms that raw data arrived, and only then triggers the dbt transformation. No more empty incremental runs or partial data in downstream tables.

Python Processing: Assets Beyond SQL

The step that SQL can’t handle is where Python SDAs fill the gap:

@dg.asset(
group_name="ml_features",
deps=["mrt__marketing__campaign_performance"],
)
def ml__campaign_propensity_scores(
context: dg.AssetExecutionContext,
bigquery: BigQueryResource,
) -> None:
"""Score campaigns using the propensity model."""
client = bigquery.get_client()
df = client.query(
"SELECT * FROM analytics.mrt__marketing__campaign_performance"
).to_dataframe()
scores = run_propensity_model(df)
# Write back to BigQuery
scores.to_gbq(
"analytics.ml__campaign_propensity_scores",
project_id="my-gcp-project",
if_exists="replace",
)

This asset lives in the same dependency graph as the dbt models. It depends on the dbt mart model mrt__marketing__campaign_performance, and downstream assets can depend on it. The lineage graph shows the full chain: Fivetran sync -> dbt base models -> dbt intermediate models -> dbt mart models -> Python ML scoring -> downstream consumption.

Common Python processing patterns that justify the full-stack approach:

  • ML feature engineering that requires Python libraries (scikit-learn, XGBoost)
  • API calls to external services (geocoding, enrichment, notifications)
  • File generation (PDF reports, CSV exports to SFTP)
  • Custom aggregations involving algorithms that SQL expresses poorly
  • Reverse ETL pushing data from BigQuery to SaaS tools

Downstream Triggers

The final stage uses sensors or automation conditions to trigger actions when upstream assets are fresh:

  • Refresh a Looker dashboard or dbt Semantic Layer cache
  • Push data to a CRM via reverse ETL
  • Send a Slack notification that fresh data is available
  • Trigger an external reporting pipeline

These downstream triggers complete the loop. The pipeline doesn’t just produce data — it ensures the data reaches its consumers.

When the Full Stack Isn’t Needed

If the team only runs dbt build on a schedule with no upstream or downstream dependencies, a Cloud Run job on a cron trigger is simpler and cheaper.

The Dagster vs dbt Cloud comparison covers this decision in detail, and the GCP orchestration framework positions Dagster relative to Cloud Run Jobs, Cloud Workflows, and Cloud Composer.