Marketing Attribution in the Warehouse: A 2026 Guide

Platform-based attribution is failing. iOS App Tracking Transparency has a 35% opt-in rate. Google Consent Mode v2 requires four separate consent parameters for EU users. And Google quietly abandoned its Privacy Sandbox APIs in late 2025 after years of delays. The tools marketers relied on for attribution are increasingly unreliable, or simply unavailable.

The response from data teams? Build attribution where you control the data: your warehouse.

Why platform attribution is breaking down

Marketing attribution answers a deceptively simple question: which touchpoints deserve credit for a conversion? When a customer discovers you through organic search, clicks a LinkedIn ad, reads three blog posts, receives an email, then converts through paid search, who gets credit?

Platforms like Google Analytics and Meta Ads Manager provide attribution reports, but they suffer from fundamental limitations.

Walled garden bias. Each platform naturally over-credits its own contributions. Meta’s conversion reports can’t see Google touchpoints. Google can’t see Meta. Neither can see your email nurture sequence. Each platform claims credit for conversions the others also claim.

Cross-device blindness. Without proper identity resolution, GA4 sees a user on mobile, laptop, and tablet as three separate people. The customer journey fragments into disconnected sessions.

Black-box methodology. Google’s Data-Driven Attribution provides no visibility into how credit is assigned. You get a number without understanding how it was calculated or how to improve it.

Signal degradation. Between ad blockers, iOS ATT, and consent requirements, 40-60% of mobile interactions now go untracked. Your attribution data has holes you can’t see.

What warehouse-based attribution gives you

Moving attribution to your warehouse addresses these problems directly.

Data ownership. Your touchpoint data lives in infrastructure you control. You’re not dependent on platform retention policies or export limitations.

Cross-platform visibility. When GA4 events, ad platform data, and CRM conversions all live in BigQuery, you can join them. The customer who clicked a Facebook ad, visited your site three times, and converted after a sales call becomes one coherent journey.

Customizable logic. Your business isn’t identical to everyone else’s. A B2B company with 90-day sales cycles needs different attribution than an e-commerce site with same-day purchases. In your warehouse, you define the rules.

Privacy-compliant foundation. First-party data you collect directly (with consent) remains usable even as third-party tracking erodes. Building on this foundation future-proofs your measurement.

Full transparency. Every calculation is visible SQL. When a stakeholder asks why a channel received 23% of credit, you can show them the exact logic.

The data foundation you need

Attribution requires three categories of data in your warehouse.

Website interaction data. GA4’s BigQuery export provides event-level data in events_YYYYMMDD tables. Key fields include user_pseudo_id (device identifier), optional user_id (authenticated), and traffic source information in collected_traffic_source or the newer session_traffic_source_last_click field.

One warning: a known bug in GA4 causes gclid parameters to sometimes attribute paid search as organic. Check for gclid presence in your URL parameters and consider overwriting source/medium when you find it.

Ad platform data. You’ll need campaign and cost data from the platforms where you spend money.

PlatformNative BigQuery SupportETL Options
Google AdsYes (Data Transfer Service)dlt, Fivetran, Airbyte
Meta AdsYes (Data Transfer Service)dlt, Fivetran, Airbyte
LinkedIn AdsNodlt, Fivetran, Airbyte
TikTok AdsNodlt, Fivetran, Airbyte

Join strategies typically use click IDs (gclid, fbclid, ttclid) for precise matching or UTM parameters for session-level joins. Standardize your UTM taxonomy across all campaigns: lowercase consistently, include platform click IDs, and document your naming conventions.

Conversion data. This might come from your e-commerce platform, CRM, or application database. You need a way to link conversions back to the marketing touchpoints that preceded them.

Building a touchpoint table

Your attribution models will operate on a touchpoint table with this structure:

FieldPurposeExample
user_idUser identifieruser_pseudo_id, customer_id
session_idSession identifierga_session_id
timestampWhen the touchpoint occurredevent_timestamp
channelMarketing classification”paid_search”, “email”
campaignCampaign identifierUTM campaign value
conversion_idLinks to conversionorder_id
revenueConversion value149.99

Identity resolution

Connecting anonymous browsing sessions to known customers is critical for accurate attribution.

Deterministic matching uses exact identifiers like email or user_id. When someone logs in, you can associate their authenticated identity with their prior anonymous sessions via user_pseudo_id. This achieves 95%+ accuracy but requires user authentication.

Probabilistic matching uses statistical signals like IP address and device characteristics. Accuracy ranges from 60-85%, and privacy regulations increasingly restrict this approach.

For most teams, deterministic matching with first-party data is the path forward. Capture user_id at login, associate it with prior sessions, and build your identity graph incrementally as users engage.

Lookback windows

How far back should you look for touchpoints that contributed to a conversion? Industry benchmarks vary:

IndustryRecommended Window
E-commerce (impulse)7-14 days
E-commerce (considered)30-45 days
B2B mid-market90-180 days
B2B enterprise180+ days

Heuristic attribution models in SQL

Heuristic models apply predetermined rules to distribute conversion credit. They’re straightforward to implement and understand, making them good starting points before adding complexity.

First-touch attribution

First-touch assigns 100% of credit to the initial touchpoint. It answers: what channels introduce customers to our brand?

WITH touchpoints_numbered AS (
SELECT
user_id,
conversion_id,
channel,
revenue,
timestamp,
ROW_NUMBER() OVER (
PARTITION BY user_id, conversion_id
ORDER BY timestamp ASC
) AS touch_position
FROM touchpoints
WHERE timestamp >= TIMESTAMP_SUB(conversion_timestamp, INTERVAL 30 DAY)
)
SELECT
user_id,
conversion_id,
channel,
revenue AS attributed_revenue
FROM touchpoints_numbered
WHERE touch_position = 1

Best for brand awareness campaigns and top-of-funnel optimization. The obvious limitation: it completely ignores what drove the actual conversion.

Last-touch attribution

Last-touch assigns 100% credit to the final touchpoint before conversion. It answers: what channels close the deal?

The SQL is identical to first-touch, but with ORDER BY timestamp DESC.

Despite being the default in many platforms, only 14% of marketers believe last-click attribution is effective. It ignores the discovery and nurturing that made the conversion possible.

Linear attribution

Linear distributes credit equally across all touchpoints. Five touchpoints each get 20%.

WITH touchpoints_counted AS (
SELECT
user_id,
conversion_id,
channel,
revenue,
COUNT(*) OVER (PARTITION BY user_id, conversion_id) AS total_touches
FROM touchpoints
)
SELECT
user_id,
conversion_id,
channel,
revenue / total_touches AS attributed_revenue
FROM touchpoints_counted

Linear provides a balanced view when all touchpoints genuinely contribute. The limitation: it may over-credit low-impact touches in long journeys.

Position-based (U-shaped) attribution

Position-based weights the first and last touchpoints more heavily, typically 40% each, with the remaining 20% distributed among middle touches.

WITH touchpoints_positioned AS (
SELECT
user_id,
conversion_id,
channel,
revenue,
ROW_NUMBER() OVER (
PARTITION BY user_id, conversion_id ORDER BY timestamp ASC
) AS position,
COUNT(*) OVER (PARTITION BY user_id, conversion_id) AS total_touches
FROM touchpoints
)
SELECT
user_id,
conversion_id,
channel,
CASE
WHEN total_touches = 1 THEN 1.0
WHEN total_touches = 2 THEN 0.5
WHEN position = 1 THEN 0.4
WHEN position = total_touches THEN 0.4
ELSE 0.2 / (total_touches - 2)
END * revenue AS attributed_revenue
FROM touchpoints_positioned

This model values both discovery (first touch) and conversion (last touch) while still crediting the nurturing in between.

Time-decay attribution

Time-decay assigns more credit to touchpoints closer to conversion, using exponential decay:

Weight = 2^(-days_before_conversion / half_life)

With a 7-day half-life (Google Analytics default), a touchpoint 7 days before conversion gets 50% weight, 14 days gets 25%, and so on.

WITH decay_weights AS (
SELECT
user_id,
conversion_id,
channel,
revenue,
conversion_timestamp,
timestamp,
POW(0.5, TIMESTAMP_DIFF(conversion_timestamp, timestamp, MINUTE) / (7.0 * 24 * 60)) AS raw_weight
FROM touchpoints
),
normalized AS (
SELECT
*,
SUM(raw_weight) OVER (PARTITION BY user_id, conversion_id) AS total_weight
FROM decay_weights
)
SELECT
user_id,
conversion_id,
channel,
(raw_weight / total_weight) * revenue AS attributed_revenue
FROM normalized

Adjust the half-life based on your sales cycle: 3-7 days for impulse e-commerce, 14-30 days for B2B mid-market, 30-45 days for enterprise.

Data-driven attribution: Markov chains and Shapley values

Heuristic models apply the same rules regardless of actual customer behavior. Data-driven models learn from your data which channels actually drive conversions.

Markov chain attribution

A Markov chain models customer journeys as a sequence of states (channels) where the probability of moving to the next state depends on the current state.

Credit is assigned through the removal effect. Remove a channel entirely from the graph, and measure how much conversion probability drops. If removing Facebook from all journeys drops total conversion probability from 50% to 11%, Facebook’s removal effect is 78%.

Since removal effects don’t sum to 100%, they get normalized:

Share(Channel) = Removal Effect(Channel) / Sum(All Removal Effects)
Attribution(Channel) = Share(Channel) × Total Conversions

First-order Markov models consider only the current state. Higher-order models (2nd, 3rd order) consider previous states too, capturing patterns like “users who see Facebook then Google convert better than Google-only users.” These require exponentially more data.

The SQL for calculating transition probabilities is straightforward; the removal effect calculation typically moves to Python using libraries like ChannelAttribution.

Shapley value attribution

From cooperative game theory, Shapley values assign credit based on each channel’s marginal contribution across all possible combinations of channels.

The formula evaluates every possible coalition of channels and measures how much adding a specific channel improves the outcome. It satisfies appealing mathematical properties: equal contributors get equal credit, zero-value channels get nothing, and credits sum to the total.

Computation scales exponentially: 10 channels require evaluating 1,024 coalitions, and 20 channels require over a million. Approximation methods like Monte Carlo sampling make this tractable.

Google’s Data-Driven Attribution

Google’s DDA combines conversion probability models with Shapley-based credit distribution. It considers up to 50 touchpoints per conversion with a 90-day lookback.

There’s a catch: DDA requires 400+ conversions per type in 30 days and roughly 10,000 paths with 2+ interactions. If you don’t meet these thresholds, GA4 silently falls back to last-click without telling you.

Building in your warehouse means transparency. You know exactly what model is running and why channels receive the credit they do.

Tools and implementation patterns

dbt packages for attribution

Several open-source dbt packages provide production-ready attribution models:

  • snowplow/dbt-snowplow-attribution: Works with Snowflake, BigQuery, Databricks, and Redshift. Includes first-touch, last-touch, linear, position-based, and time-decay models with ROAS calculation.
  • snowplow/dbt-snowplow-fractribution: Data-driven attribution using Markov chains and Shapley values.
  • TasmanAnalytics/tasman-dbt-mta: Configurable multi-touch, multi-cycle attribution engine for Snowflake and BigQuery.
  • rittmananalytics/ra_attribution: Multi-source integration (Snowplow, Segment, RudderStack) with ad spend data.

Python libraries

For data-driven models that require matrix operations:

  • ChannelAttribution (R and Python): k-order Markov chains with C++ backend for scale
  • marketing-attribution-models (DP6): Markov, Shapley, and all heuristic models with customizable weights
models/
├── base/
│ └── base__ga4__events.sql # Raw event cleaning
├── intermediate/
│ ├── int__sessions_stitched.sql # Sessionization
│ ├── int__touchpoints_classified.sql # Marketing touchpoints
│ └── int__conversion_paths_sequenced.sql # User journey paths
├── marts/attribution/
│ ├── mrt__attribution__first_touch.sql
│ ├── mrt__attribution__last_touch.sql
│ ├── mrt__attribution__linear.sql
│ ├── mrt__attribution__position_based.sql
│ ├── mrt__attribution__time_decay.sql
│ └── mrt__attribution__comparison.sql

Use incremental models with appropriate lookback windows for late-arriving data. Test that attribution credits sum to 100% per conversion.

Recovering lost signals

Server-side tracking implementations recover conversions lost to browser restrictions. GTM server-side, Meta Conversions API, and Google Enhanced Conversions can recover 30-40% of conversions that would otherwise disappear.

This data feeds into your warehouse, improving the completeness of your touchpoint table.

Making attribution actionable

The triangulated approach

The most resilient measurement strategies don’t rely on a single methodology. They combine three approaches:

Multi-touch attribution (MTA) for daily digital optimization. User-level data, granular touchpoint analysis, real-time adjustments.

Media mix modeling (MMM) for quarterly strategic budget allocation. Aggregated, privacy-safe data that includes offline channels like TV and out-of-home.

Incrementality testing for proving causal impact. Holdout tests where a portion of the audience doesn’t see ads, geographic market splits, or platform tools like Meta Conversion Lift.

Use incrementality results to calibrate your attribution models. If your model says Facebook drives 30% of conversions but your holdout test shows only 15% incremental lift, adjust accordingly.

Using model disagreement as a signal

Run multiple attribution models in parallel. When first-touch, last-touch, position-based, and time-decay all agree that organic search drives 25% of value, you have high confidence. When they range from 10% to 40%, you have high uncertainty, and stakeholders should know that.

Present results as ranges rather than false precision: “Channel X drives $50K-$80K depending on attribution model.”

Warning signs of attribution failure

Watch for these signals that your attribution isn’t working:

  • Attributed revenue significantly diverges from actual revenue
  • Scaling a “high ROAS” channel doesn’t translate to profit growth
  • More than 60% of deals can’t be traced to any marketing activity

Where to start

Match your approach to your current scale:

StageRecommended Approach
Early/SMBHeuristic models (linear, time-decay) in SQL
Growth (500+ monthly conversions)Markov chain (1st-2nd order)
Scale (1000+ conversions)Shapley value, incrementality calibration
Enterprise (Google ecosystem)Google DDA + custom warehouse validation
Enterprise (full control)Custom Markov/Shapley, triangulated measurement

Start with heuristic models. They’re transparent, easy to explain to stakeholders, and often sufficient for early optimization. Add data-driven approaches when you have the data volume to support them and the questions that require them.

No single model provides complete accuracy, and that’s fine. Customer journeys are messy, and channels interact in ways that resist clean measurement. Attribution that’s transparent, defensible, and good enough to improve decisions beats a black-box number every time.

Build in your warehouse, run multiple models, validate with experiments, and present results with appropriate uncertainty.