dbt-utils generate_surrogate

generate_surrogate_key creates a hash-based surrogate key from one or more business columns. It is one of the most-used macros in the dbt ecosystem. There is a migration gotcha — null handling changed between the old surrogate_key() macro and generate_surrogate_key() — that can break incremental models and snapshots during upgrades.

What It Does

The macro concatenates the specified columns, adds null handling, and hashes the result using MD5:

SELECT
  {{ dbt_utils.generate_surrogate_key(['order__id', 'order__line_number']) }} AS order_line__surrogate_key,
  order__id,
  order__line_number,
  product__id,
  line__quantity,
  line__price
FROM {{ ref('base__shopify__order_lines') }}

The generated SQL on BigQuery looks roughly like:

TO_HEX(MD5(CONCAT(
  COALESCE(CAST(order__id AS STRING), ''),
  '-',
  COALESCE(CAST(order__line_number AS STRING), '')
)))

The separator ('-') prevents two separate columns from producing the same hash as a single combined column. Without a separator, order__id = '12' and order__line_number = '3' would produce the same hash as order__id = '1' and order__line_number = '23'.

Why You Pass a List

The input is always a list of column names as strings. The columns are cast to string before concatenation, so you can pass integer, date, or any other column type. Null values in any column are treated as empty strings by default — more on that below.

For composite keys (multiple columns defining the grain), list them all:

{{ dbt_utils.generate_surrogate_key(['customer__id', 'product__sku', 'order__date']) }}

For a single-column key, the list still has one item:

{{ dbt_utils.generate_surrogate_key(['order__id']) }}

The Null Handling Gotcha

This is where projects get into trouble. There are two behaviors for null values, controlled by a dbt variable:

Current behavior (default in v1.x): Null values are cast to empty strings. order__id = NULL contributes '' to the concatenated string before hashing.

Legacy behavior (pre-v1.0): Null values were treated differently by the old surrogate_key() macro. The exact behavior was inconsistent and considered a bug.

When you migrate from the old surrogate_key() to generate_surrogate_key(), the hash values change for any row where one of the key columns is null. The hash computed by generate_surrogate_key() will be different from the hash computed by the old macro for those rows.

Why This Breaks Incremental Models

Incremental models use surrogate keys as their unique_key for merge operations. The merge logic compares incoming rows against the existing table using the surrogate key to decide whether to update or insert.

If you migrate the macro and regenerate the surrogate keys, every row with a null in any key column gets a new hash. The incremental model sees these as new rows (no matching key in the existing table) and inserts them as duplicates. Your table now has two versions of the same real-world entity — one with the old hash, one with the new hash. The old row never gets updated because nothing matches it.

dbt snapshots are similarly affected: the snapshot stores the surrogate key used to track record versions. A hash change breaks the history chain.

How to Migrate Safely

If you have existing incremental models or snapshots that used the old surrogate_key() macro, add this to your dbt_project.yml before switching to generate_surrogate_key():

vars:
  surrogate_key_treat_nulls_as_empty_strings: true

This tells generate_surrogate_key() to use the same null treatment as the legacy macro, producing identical hashes for the same input data. It’s a compatibility shim that lets you upgrade the macro without touching your data.

If you’re starting a new project or you’re fine running a full refresh to rebuild all affected incremental models and snapshots, you don’t need the variable — the default behavior is fine.

When to Use Surrogate Keys

Use generate_surrogate_key when:

Your grain is defined by multiple business columns and you need a single primary key (common for fact tables: order__id + order__line_number)
You’re loading from a source that provides no natural unique key
You need a stable identifier that survives upstream key changes

Don’t use it as a substitute for fixing data quality issues. If your source has duplicate records, deduplicate first (see dbt-utils SQL Generators) and then generate the surrogate key on the clean data. A surrogate key on a duplicated row just gives you a unique identifier per duplicate, which doesn’t solve anything.

Checking Your Keys

Combine generate_surrogate_key with a unique test on the output column and a not_null test on every input column:

models:
  - name: fct__shopify__order_lines
    columns:
      - name: order_line__surrogate_key
        data_tests:
          - unique
          - not_null
      - name: order__id
        data_tests:
          - not_null
      - name: order__line_number
        data_tests:
          - not_null

The not_null tests on input columns matter because nulls in key columns are always a smell — either the data truly has no value for that column (a modeling problem) or something upstream broke (a pipeline problem). In either case, a hash that silently includes a null is worse than a test failure that makes the problem visible.