ServicesAboutNotesContact Get in touch →
EN FR
Note

dbt-utils SQL Generators

Reference for dbt-utils SQL generation macros: date_spine, deduplicate, star, union_relations, pivot, unpivot, and the smaller helpers. What each does, how to call it, and the gotchas.

Planted
dbtdata engineeringdata modeling

SQL generators are the most commonly used part of dbt-utils. They don’t query your database — they generate SQL fragments or complete queries at compile time, based on parameters you pass them. Unlike introspective macros, they don’t have compile-time dependencies on existing relations.

This note covers the generators that reshape data: date_spine, deduplicate, star, union_relations, pivot, and unpivot. For generate_surrogate_key, see the dedicated note — its null handling gotcha is significant enough to warrant separate treatment.

date_spine

Generates a complete date or time series. The canonical use case: a time-series model that needs to show zeros for days with no activity, rather than simply omitting those days.

SELECT *
FROM (
{{ dbt_utils.date_spine(
datepart="day",
start_date="CAST('2024-01-01' AS DATE)",
end_date="CAST('2025-01-01' AS DATE)"
) }}
)

The accepted datepart values are: day, week, month, year, hour, minute.

Gotcha: the end date is exclusive. The series above stops at 2024-12-31, not 2025-01-01. If you want a full year including the last day, add one unit to your end date. This catches people every time.

The typical pattern is to join this against your fact table and use COALESCE to fill missing values with zero:

WITH spine AS (
{{ dbt_utils.date_spine(
datepart="day",
start_date="CAST('2024-01-01' AS DATE)",
end_date="CAST('2025-01-01' AS DATE)"
) }}
),
daily_orders AS (
SELECT DATE(order__created_at) AS order__date, COUNT(*) AS order__count
FROM {{ ref('base__shopify__orders') }}
GROUP BY 1
)
SELECT
spine.date_day,
COALESCE(daily_orders.order__count, 0) AS order__count
FROM spine
LEFT JOIN daily_orders ON spine.date_day = daily_orders.order__date

For more complex date series (fiscal calendars, business days), consider the dbt-date package, which builds on top of date_spine.

deduplicate

Removes duplicate rows by keeping one row per partition, ordered by your criteria:

SELECT *
FROM (
{{ dbt_utils.deduplicate(
relation=ref('base__crm__contacts'),
partition_by='contact__id',
order_by='contact__updated_at DESC'
) }}
)

This is the standard way to deduplicate raw source tables where the same record appears multiple times — often because of replication behavior. You keep the most recent version of each entity.

Breaking changes from v0.x to v1.0:

  • group_by was renamed to partition_by
  • relation_alias was removed entirely
  • order_by is now mandatory — pass 1 if you genuinely don’t care which row wins

On Snowflake and BigQuery, this compiles to a QUALIFY clause, which is the native, efficient way to express window-function-based filtering on those platforms:

SELECT *
FROM base__crm__contacts
QUALIFY ROW_NUMBER() OVER (PARTITION BY contact__id ORDER BY contact__updated_at DESC) = 1

On Postgres and Redshift (no QUALIFY), it compiles to a ROW_NUMBER() subquery. The semantics are identical; only the SQL shape differs.

Use deduplicate in base-layer models where you’re bringing in a raw snapshot table and need to establish one row per grain. Don’t use it to paper over a model that’s producing duplicates — find and fix the source of the duplication instead.

star

Selects all columns from a relation with optional exclusions:

SELECT
{{ dbt_utils.star(from=ref('base__shopify__orders'), except=["_fivetran_synced", "_fivetran_deleted"]) }}
FROM {{ ref('base__shopify__orders') }}

This is invaluable for base-layer models where you want to pass through all upstream columns while dropping metadata or staging columns you don’t need downstream.

Additional parameters:

  • relation_alias — adds a table alias prefix to each column (e.g., orders.order__id)
  • prefix — prepends a string to every column name
  • suffix — appends a string to every column name
  • quote_identifiers — defaults to True; set False if your warehouse is case-insensitive and you don’t want quoted names

Critical constraint: the relation must already be materialized. star runs introspection at compile time to discover the column list. If you’re running a model for the first time in a fresh environment, the upstream relation doesn’t exist yet, and the macro will fail. This is the same constraint as all introspective macros.

In practice, this means star works well in base-layer models that sit directly on top of source tables (which always exist before dbt runs), but can be fragile in intermediate models that depend on other dbt models that might not have been built yet.

union_relations

Unions multiple relations, automatically handling columns that exist in one but not another by filling missing columns with NULL:

{{ dbt_utils.union_relations(
relations=[ref('base__stripe__charges_us'), ref('base__stripe__charges_eu')],
exclude=["_loaded_at"],
source_column_name='_dbt_source_relation'
) }}

The source_column_name parameter adds a column indicating which relation each row came from — essential for tracing records back to their source after the union.

A few details worth knowing:

  • include and exclude are mutually exclusive. You can list the columns to keep, or the columns to drop, but not both.
  • The macro handles column ordering: it builds a consistent column list across all relations and emits NULL AS column_name for any column that doesn’t exist in a given relation.
  • The output is a full UNION ALL, not UNION DISTINCT. Deduplication is your responsibility downstream.

The most powerful use of union_relations is pairing it with get_relations_by_pattern for dynamic table discovery. This is the standard approach for consolidating BigQuery date-sharded tables:

{% set sharded_tables = dbt_utils.get_relations_by_pattern(
'raw_data_%',
'events_%',
exclude='%deprecated'
) %}
{{ dbt_utils.union_relations(relations=sharded_tables) }}

This discovers all tables matching the pattern at compile time and unions them into a single model. As new shards are created, the model picks them up automatically on the next run. See dbt-utils Introspective Macros for the get_relations_by_pattern details.

pivot and unpivot

pivot creates pivot tables using CASE expressions, turning a categorical column’s values into separate columns. The typical use: counting or summing a metric by category, where each category becomes its own column.

SELECT
order__date,
{{ dbt_utils.pivot(
column='order__status',
values=dbt_utils.get_column_values(ref('base__shopify__orders'), 'order__status'),
agg='COUNT',
then_value=1,
else_value=0,
prefix='status_'
) }}
FROM {{ ref('base__shopify__orders') }}
GROUP BY 1

The values parameter accepts either a hardcoded list or the output of get_column_values. Using get_column_values makes the pivot dynamic — it adapts when new status values appear in the data. The trade-off is the compile-time dependency on an existing relation.

unpivot does the reverse: takes wide data with many columns and converts it to long format with a column name column and a value column:

{{ dbt_utils.unpivot(
relation=ref('mrt__sales__quarterly_revenue'),
cast_to='FLOAT64',
exclude=['region__name'],
field_name='quarter',
value_name='revenue'
) }}

Migration note from v0.x: The table argument was renamed to relation in v1.0. If you have existing unpivot calls with table=, update them.

The Smaller Helpers

These are genuinely useful but come up less often:

MacroWhat it doesExample
group_by(n)Outputs GROUP BY 1, 2, ..., n{{ dbt_utils.group_by(3) }}
generate_series(upper_bound)Creates a 1-indexed number series{{ dbt_utils.generate_series(100) }}
safe_add(values)Null-safe addition (list argument required since v1.0){{ dbt_utils.safe_add(['col_a', 'col_b']) }}
safe_divide(numerator, denominator)Returns NULL on division by zero{{ dbt_utils.safe_divide('revenue', 'sessions') }}
safe_subtract(values)Null-safe subtraction{{ dbt_utils.safe_subtract(['col_a', 'col_b']) }}
haversine_distance(lat1, lon1, lat2, lon2, unit)Distance between two GPS coordinatesunit defaults to 'mi'; use 'km' for kilometers
width_bucket(expr, min, max, buckets)Assigns values to equal-width histogram bucketsUseful for distribution analysis

safe_add and safe_subtract deserve a note: since v1.0 they take a list argument. The old two-argument form no longer works. Use {{ dbt_utils.safe_add(['col_a', 'col_b', 'col_c']) }} instead of {{ dbt_utils.safe_add('col_a', 'col_b') }}.

haversine_distance is niche but there’s usually a moment in analytics work where you need to measure geographic distance — retail catchment areas, delivery radius analysis, proximity matching. This saves writing the formula from scratch.

Cross-Warehouse Compilation

Most SQL generators compile identically across adapters. The exception is deduplicate, which uses QUALIFY on Snowflake and BigQuery but a ROW_NUMBER() subquery on Postgres and Redshift. This is handled automatically by dbt-utils’s dispatch mechanism — you don’t need to configure anything unless you’re on a non-standard adapter like Databricks, in which case you’ll need spark_utils in your dispatch configuration. See dbt Dispatch Configuration for setup details.