Markov Attribution SQL Implementation

Full Markov chain attribution requires matrix operations that push SQL beyond its comfort zone. The practical approach is to use SQL for path extraction and transition probability calculation, then hand off to Python for the removal effect computation. SQL does the heavy data preparation; Python handles the linear algebra.

This note covers the SQL portion — extracting journey paths from touchpoint data and computing transition counts — which is where most of the engineering effort lives.

Path extraction in BigQuery

The first step is extracting journey paths from your touchpoint data. You need two datasets: paths that ended in conversion and paths that didn’t. Both matter for the Markov model because non-converting paths establish the NULL state transitions.

WITH touchpoints AS (
  SELECT
    user_id,
    channel,
    event_timestamp,
    conversion_id
  FROM {{ ref('int__touchpoints') }}
  WHERE event_timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
),

converting_paths AS (
  SELECT
    t.user_id,
    t.conversion_id,
    STRING_AGG(t.channel, ' > ' ORDER BY t.event_timestamp) AS journey_path,
    'conversion' AS outcome
  FROM touchpoints t
  INNER JOIN {{ ref('int__conversions') }} c
    ON t.user_id = c.user_id
    AND t.conversion_id = c.conversion_id
  WHERE t.event_timestamp BETWEEN
    TIMESTAMP_SUB(c.conversion_timestamp, INTERVAL 30 DAY)
    AND c.conversion_timestamp
  GROUP BY t.user_id, t.conversion_id
),

non_converting_paths AS (
  SELECT
    t.user_id,
    CAST(NULL AS STRING) AS conversion_id,
    STRING_AGG(t.channel, ' > ' ORDER BY t.event_timestamp) AS journey_path,
    'null' AS outcome
  FROM touchpoints t
  LEFT JOIN {{ ref('int__conversions') }} c
    ON t.user_id = c.user_id
  WHERE c.user_id IS NULL
  GROUP BY t.user_id
)

SELECT
  user_id,
  conversion_id,
  journey_path,
  outcome
FROM converting_paths

UNION ALL

SELECT
  user_id,
  conversion_id,
  journey_path,
  outcome
FROM non_converting_paths

This produces rows like:

user_id	journey_path	outcome
user_123	Paid Search > Email > Direct	conversion
user_456	Organic > Paid Search > Organic	null
user_789	Email > Email > Direct	conversion

A few design decisions to note:

The 90-day outer window scopes how far back you pull touchpoints. This should be wider than your attribution lookback window to capture full journey context.
The 30-day conversion window in the converting_paths CTE defines which touchpoints qualify for a specific conversion. Adjust this to match your sales cycle.
Non-converting paths include users who never converted at all. These are essential for accurate transition probabilities — without them, every path ends in conversion and the model can’t distinguish high-probability from low-probability channels.

Transition counting

From the extracted paths, calculate transition counts — how often each state-to-state transition occurs. This is where the Markov model’s structure takes shape.

WITH path_transitions AS (
  SELECT
    journey_path,
    outcome,
    SPLIT(journey_path, ' > ') AS channels
  FROM journey_paths
),

transitions AS (
  SELECT
    'START' AS from_state,
    channels[OFFSET(0)] AS to_state,
    COUNT(*) AS transition_count
  FROM path_transitions
  GROUP BY from_state, to_state

  UNION ALL

  SELECT
    channels[OFFSET(i)] AS from_state,
    channels[OFFSET(i + 1)] AS to_state,
    COUNT(*) AS transition_count
  FROM path_transitions,
    UNNEST(GENERATE_ARRAY(0, ARRAY_LENGTH(channels) - 2)) AS i
  GROUP BY from_state, to_state

  UNION ALL

  SELECT
    channels[OFFSET(ARRAY_LENGTH(channels) - 1)] AS from_state,
    CASE outcome
      WHEN 'conversion' THEN 'CONVERSION'
      ELSE 'NULL'
    END AS to_state,
    COUNT(*) AS transition_count
  FROM path_transitions
  GROUP BY from_state, to_state
)

SELECT
  from_state,
  to_state,
  transition_count,
  transition_count / SUM(transition_count) OVER (PARTITION BY from_state) AS transition_probability
FROM transitions

This query handles three types of transitions:

START transitions — from START to the first channel in each journey
Channel-to-channel transitions — using GENERATE_ARRAY and UNNEST to iterate through each consecutive pair in the path
Terminal transitions — from the last channel to either CONVERSION or NULL

The transition_probability column is calculated by dividing each transition count by the total transitions from that state, using a window function. This gives you the Markov transition matrix in tabular form.

From SQL to Python

The transition probability table is where SQL’s job ends. The next steps — computing the overall conversion probability from START, removing each channel and recomputing, then normalizing removal effects into attribution shares — require matrix operations that SQL can’t handle efficiently.

Export the transition table to Python and use a library:

R: The ChannelAttribution package has C++ backends optimized for scale
Python: marketing-attribution-models or custom NumPy implementations

The handoff is clean: SQL produces a table with from_state, to_state, and transition_probability columns. Python reads it, constructs the matrix, and computes removal effects.

Fitting this into dbt

In a dbt project, this naturally extends the attribution comparison pattern:

models/
├── intermediate/
│   ├── int__touchpoints.sql
│   ├── int__conversions.sql
│   └── int__journey_paths.sql        # Path extraction query
└── marts/
    └── attribution/
        ├── mrt__attribution__transitions.sql  # Transition counting
        ├── mrt__attribution__first_touch.sql
        ├── mrt__attribution__last_touch.sql
        └── mrt__attribution__comparison.sql

The int__journey_paths model extracts paths. The mrt__attribution__transitions model computes transition probabilities. A Python script (or dbt Python model in dbt Core 1.3+) reads the transitions table and writes Markov attribution results back to the warehouse.

The Markov results can then be added to your comparison model alongside heuristic models, giving you a model_type = 'markov' row in the comparison table.

Data quality considerations

The quality of Markov attribution depends entirely on the quality of the underlying transition data.

Channel grouping is critical. Too many channels produces sparse transition matrices where many cell values are zero or based on a handful of observations. Start with 5-10 high-level channel groups and add granularity as data volume grows.

Minimum data requirements: aim for roughly 10 times as many transitions as unique transition types. With 10 channels (plus START, CONVERSION, NULL), you have up to 13 x 13 = 169 possible transitions. That means at least 1,690 total path transitions, typically achievable with a few hundred conversions.

Higher-order models (where the next state depends on the current plus previous states) require exponentially more data. A second-order model with 10 channels has up to 169 x 13 = 2,197 possible transitions. Unless you have tens of thousands of conversions, stick with first-order.

Repeated channels in a path (Email > Email > Email) are valid and common. They represent users returning to the same channel multiple times. Don’t deduplicate consecutive touches unless you have a specific reason to — the self-transition probability (Email -> Email) is meaningful information about channel stickiness.