Attribution Touchpoint Table Design

Every attribution model — first-touch, last-touch, linear, position-based, time-decay, algorithmic — operates on the same input: a touchpoint table that joins marketing interactions to conversions within a lookback window. The models differ only in how they distribute credit. Get the touchpoint table wrong and every model downstream inherits the same problems.

This note covers what fields the touchpoint table needs, the identity challenges that affect accuracy, and the dbt intermediate model that constructs it from raw events.

Required fields

The touchpoint table needs these fields at minimum:

Field	Purpose	Example
`user_id`	User identifier (device or authenticated)	`user_pseudo_id`, `customer_id`
`session_id`	Session identifier	`ga_session_id`
`touchpoint_timestamp`	When the touchpoint occurred	`event_timestamp`
`channel`	Marketing classification	`paid search`, `email`, `organic social`
`source` / `medium`	Traffic source detail	`google / cpc`, `newsletter / email`
`campaign`	Campaign identifier	UTM campaign value
`transaction_id`	Links touchpoint to conversion	`order_id`
`conversion_timestamp`	When the conversion occurred	Purchase event timestamp
`revenue`	Conversion value	`149.99`

The channel field is typically a concatenation of source and medium (google / cpc) or a channel grouping macro that classifies source/medium pairs into marketing categories. Either approach works, but be consistent — the channel values become your attribution report dimensions.

The identity problem

The user_id field deserves special attention. If you’re working with GA4 data, user_pseudo_id is device-based. The same person on their phone, laptop, and work desktop appears as three separate users. Without identity resolution, your attribution thinks three different people had one touchpoint each, when actually one person had three touchpoints in a single journey.

For accurate attribution, you eventually need to join authenticated user_id values with anonymous device identifiers. This is the identity stitching problem. It doesn’t block you from building attribution — you can start with user_pseudo_id and get directionally useful results — but it caps your accuracy. Short purchase cycles (where users typically convert on the same device they discovered you on) are less affected than long B2B cycles where a prospect might interact across multiple devices over weeks.

The practical impact: without identity stitching, your attribution will over-count first-touch credit (each device gets its own “first touch”) and under-count multi-touch journey length. If your median journey looks suspiciously short, identity fragmentation is usually the reason.

What counts as a touchpoint

This is a modeling decision, not a technical one. Common approaches:

Session-level touchpoints. Every session with a marketing source counts as one touchpoint. This is the most common approach and what GA4’s session-scoped traffic source fields support naturally. A user who visits three pages in one session from a paid search click generates one touchpoint, not three.

Event-level touchpoints. Specific high-value events (demo requests, content downloads, pricing page views) each count as a touchpoint regardless of session. This gives more granular attribution but inflates touchpoint counts and can over-credit channels that drive engagement without driving conversion.

Click-level touchpoints. Only explicit ad clicks (gclid, fbclid) count. This excludes organic and direct touchpoints entirely, which limits your attribution to paid channels only.

For most implementations, session-level touchpoints strike the right balance. They capture each marketing interaction without inflating the journey length.

The intermediate dbt model

In a three-layer dbt project, the touchpoint table lives in the intermediate layer. It consumes cleaned events from the base layer and produces the rows that attribution mart models consume.

Here’s the pattern for GA4 data in BigQuery:

-- int__touchpoints_mapped.sql
WITH touchpoint_events AS (
  SELECT
    user_pseudo_id,
    session_id,
    event_timestamp,
    COALESCE(source, '(direct)') AS source,
    COALESCE(medium, '(none)') AS medium,
    CONCAT(
      COALESCE(source, '(direct)'),
      ' / ',
      COALESCE(medium, '(none)')
    ) AS channel
  FROM {{ ref('base__ga4__events') }}
  WHERE session_id IS NOT NULL
),

conversion_events AS (
  SELECT
    user_pseudo_id,
    event_timestamp,
    transaction_id,
    revenue
  FROM {{ ref('base__ga4__events') }}
  WHERE event_name = 'purchase'
)

SELECT
  t.user_pseudo_id AS user_id,
  c.transaction_id,
  t.session_id,
  t.event_timestamp AS touchpoint_timestamp,
  c.event_timestamp AS conversion_timestamp,
  t.source,
  t.medium,
  t.channel,
  c.revenue
FROM touchpoint_events t
JOIN conversion_events c
  ON t.user_pseudo_id = c.user_pseudo_id
  AND t.event_timestamp <= c.event_timestamp
  AND t.event_timestamp >= TIMESTAMP_SUB(
    c.event_timestamp, INTERVAL 30 DAY
  )

A few things worth calling out about this pattern:

Both CTEs pull from the same base model. Touchpoint events and conversion events come from base__ga4__events. The touchpoint CTE extracts sessions with traffic source data. The conversion CTE extracts purchase events. The join connects them by user within the lookback window.

The join condition is where attribution scoping happens. t.event_timestamp <= c.event_timestamp ensures only touchpoints before the conversion qualify. The TIMESTAMP_SUB filter sets the lookback window — 30 days here, but this should match your business’s purchase cycle.

Null source/medium handling matters. Direct visits have null source and medium in GA4. The COALESCE wraps them into (direct) / (none) so they appear as a channel rather than disappearing from attribution. Whether you include direct visits as touchpoints is a business decision — some teams exclude them because “direct” is often a catchall for unattributed traffic.

One row per touchpoint-conversion pair. If a user has 5 sessions before converting, this produces 5 rows for that conversion, each with the session’s traffic source and the conversion’s revenue. The attribution models downstream then decide how to distribute that revenue across the 5 rows.

The base layer feeding this model

The base model cleans raw GA4 BigQuery export data. For attribution, the key transformations are unnesting nested structures and standardizing field names:

-- base__ga4__events.sql
SELECT
  user_pseudo_id,
  (SELECT value.int_value
   FROM UNNEST(event_params)
   WHERE key = 'ga_session_id') AS session_id,
  event_timestamp,
  event_name,
  traffic_source.source,
  traffic_source.medium,
  traffic_source.name AS campaign
FROM {{ source('ga4', 'events') }}

This base model is shared across many downstream use cases — sessionization, funnel analysis, engagement metrics — not just attribution. The traffic source fields it exposes are what the intermediate touchpoint model consumes.

Aggregating for channel reporting

The touchpoint table and the attribution models downstream produce touchpoint-level rows. For channel performance reporting, you aggregate:

SELECT
  channel,
  COUNT(DISTINCT transaction_id) AS conversions_touched,
  SUM(attributed_revenue) AS total_attributed_revenue,
  SUM(attributed_revenue) / COUNT(DISTINCT transaction_id) AS avg_attributed_value
FROM attribution_results
GROUP BY channel
ORDER BY total_attributed_revenue DESC

For single-touch models (first-touch, last-touch), conversions_touched equals the actual conversion count — each conversion maps to exactly one channel. For multi-touch models (linear, position-based), conversions_touched counts how many conversions a channel participated in, which is useful for understanding assisted conversions.

This aggregation typically lives in the comparison summary model rather than as a separate mart, since you want the same aggregation applied consistently across all models.

Common mistakes

Forgetting the lookback window. Without the TIMESTAMP_SUB filter, you’ll join touchpoints from months ago to today’s conversions. A user who first visited from organic search 6 months ago gets that visit credited alongside yesterday’s retargeting click, even though the organic visit has no plausible connection to the conversion.

Including post-conversion touchpoints. If the join condition doesn’t enforce touchpoint_timestamp <= conversion_timestamp, a session that happens after the purchase gets attributed credit for it. This sounds obvious but is easy to miss when the join logic gets complex.

Not deduplicating sessions. If your base model produces multiple rows per session (one per event), the touchpoint CTE needs to deduplicate to one row per session. Otherwise a session with 15 page views becomes 15 touchpoints, massively inflating that session’s share of attributed revenue.

Ignoring repeat purchases. A user with 3 purchases has 3 separate conversion paths. Each purchase should have its own set of qualifying touchpoints, partitioned by transaction_id. If you partition only by user_id, all touchpoints get credited to all purchases.