Platform-based attribution is failing. iOS App Tracking Transparency has a 35% opt-in rate. Google Consent Mode v2 requires four separate consent parameters for EU users. And Google quietly abandoned its Privacy Sandbox APIs in late 2025 after years of delays. The tools marketers relied on for attribution are increasingly unreliable, or simply unavailable.
The response from data teams? Build attribution where you control the data: your warehouse.
Why platform attribution is breaking down
Marketing attribution answers a deceptively simple question: which touchpoints deserve credit for a conversion? When a customer discovers you through organic search, clicks a LinkedIn ad, reads three blog posts, receives an email, then converts through paid search, who gets credit?
Platforms like Google Analytics and Meta Ads Manager provide attribution reports, but they suffer from fundamental limitations.
Walled garden bias. Each platform naturally over-credits its own contributions. Meta’s conversion reports can’t see Google touchpoints. Google can’t see Meta. Neither can see your email nurture sequence. Each platform claims credit for conversions the others also claim.
Cross-device blindness. Without proper identity resolution, GA4 sees a user on mobile, laptop, and tablet as three separate people. The customer journey fragments into disconnected sessions.
Black-box methodology. Google’s Data-Driven Attribution provides no visibility into how credit is assigned. You get a number without understanding how it was calculated or how to improve it.
Signal degradation. Between ad blockers, iOS ATT, and consent requirements, 40-60% of mobile interactions now go untracked. Your attribution data has holes you can’t see.
What warehouse-based attribution gives you
Moving attribution to your warehouse addresses these problems directly.
Data ownership. Your touchpoint data lives in infrastructure you control. You’re not dependent on platform retention policies or export limitations.
Cross-platform visibility. When GA4 events, ad platform data, and CRM conversions all live in BigQuery, you can join them. The customer who clicked a Facebook ad, visited your site three times, and converted after a sales call becomes one coherent journey.
Customizable logic. Your business isn’t identical to everyone else’s. A B2B company with 90-day sales cycles needs different attribution than an e-commerce site with same-day purchases. In your warehouse, you define the rules.
Privacy-compliant foundation. First-party data you collect directly (with consent) remains usable even as third-party tracking erodes. Building on this foundation future-proofs your measurement.
Full transparency. Every calculation is visible SQL. When a stakeholder asks why a channel received 23% of credit, you can show them the exact logic.
The data foundation you need
Attribution requires three categories of data in your warehouse.
Website interaction data. GA4’s BigQuery export provides event-level data in events_YYYYMMDD tables. Key fields include user_pseudo_id (device identifier), optional user_id (authenticated), and traffic source information in collected_traffic_source or the newer session_traffic_source_last_click field.
One warning: a known bug in GA4 causes gclid parameters to sometimes attribute paid search as organic. Check for gclid presence in your URL parameters and consider overwriting source/medium when you find it.
Ad platform data. You’ll need campaign and cost data from the platforms where you spend money.
| Platform | Native BigQuery Support | ETL Options |
|---|---|---|
| Google Ads | Yes (Data Transfer Service) | dlt, Fivetran, Airbyte |
| Meta Ads | Yes (Data Transfer Service) | dlt, Fivetran, Airbyte |
| LinkedIn Ads | No | dlt, Fivetran, Airbyte |
| TikTok Ads | No | dlt, Fivetran, Airbyte |
Join strategies typically use click IDs (gclid, fbclid, ttclid) for precise matching or UTM parameters for session-level joins. Standardize your UTM taxonomy across all campaigns: lowercase consistently, include platform click IDs, and document your naming conventions.
Conversion data. This might come from your e-commerce platform, CRM, or application database. You need a way to link conversions back to the marketing touchpoints that preceded them.
Building a touchpoint table
Your attribution models will operate on a touchpoint table with this structure:
| Field | Purpose | Example |
|---|---|---|
| user_id | User identifier | user_pseudo_id, customer_id |
| session_id | Session identifier | ga_session_id |
| timestamp | When the touchpoint occurred | event_timestamp |
| channel | Marketing classification | ”paid_search”, “email” |
| campaign | Campaign identifier | UTM campaign value |
| conversion_id | Links to conversion | order_id |
| revenue | Conversion value | 149.99 |
Identity resolution
Connecting anonymous browsing sessions to known customers is critical for accurate attribution.
Deterministic matching uses exact identifiers like email or user_id. When someone logs in, you can associate their authenticated identity with their prior anonymous sessions via user_pseudo_id. This achieves 95%+ accuracy but requires user authentication.
Probabilistic matching uses statistical signals like IP address and device characteristics. Accuracy ranges from 60-85%, and privacy regulations increasingly restrict this approach.
For most teams, deterministic matching with first-party data is the path forward. Capture user_id at login, associate it with prior sessions, and build your identity graph incrementally as users engage.
Lookback windows
How far back should you look for touchpoints that contributed to a conversion? Industry benchmarks vary:
| Industry | Recommended Window |
|---|---|
| E-commerce (impulse) | 7-14 days |
| E-commerce (considered) | 30-45 days |
| B2B mid-market | 90-180 days |
| B2B enterprise | 180+ days |
Heuristic attribution models in SQL
Heuristic models apply predetermined rules to distribute conversion credit. They’re straightforward to implement and understand, making them good starting points before adding complexity.
First-touch attribution
First-touch assigns 100% of credit to the initial touchpoint. It answers: what channels introduce customers to our brand?
WITH touchpoints_numbered AS ( SELECT user_id, conversion_id, channel, revenue, timestamp, ROW_NUMBER() OVER ( PARTITION BY user_id, conversion_id ORDER BY timestamp ASC ) AS touch_position FROM touchpoints WHERE timestamp >= TIMESTAMP_SUB(conversion_timestamp, INTERVAL 30 DAY))
SELECT user_id, conversion_id, channel, revenue AS attributed_revenueFROM touchpoints_numberedWHERE touch_position = 1Best for brand awareness campaigns and top-of-funnel optimization. The obvious limitation: it completely ignores what drove the actual conversion.
Last-touch attribution
Last-touch assigns 100% credit to the final touchpoint before conversion. It answers: what channels close the deal?
The SQL is identical to first-touch, but with ORDER BY timestamp DESC.
Despite being the default in many platforms, only 14% of marketers believe last-click attribution is effective. It ignores the discovery and nurturing that made the conversion possible.
Linear attribution
Linear distributes credit equally across all touchpoints. Five touchpoints each get 20%.
WITH touchpoints_counted AS ( SELECT user_id, conversion_id, channel, revenue, COUNT(*) OVER (PARTITION BY user_id, conversion_id) AS total_touches FROM touchpoints)
SELECT user_id, conversion_id, channel, revenue / total_touches AS attributed_revenueFROM touchpoints_countedLinear provides a balanced view when all touchpoints genuinely contribute. The limitation: it may over-credit low-impact touches in long journeys.
Position-based (U-shaped) attribution
Position-based weights the first and last touchpoints more heavily, typically 40% each, with the remaining 20% distributed among middle touches.
WITH touchpoints_positioned AS ( SELECT user_id, conversion_id, channel, revenue, ROW_NUMBER() OVER ( PARTITION BY user_id, conversion_id ORDER BY timestamp ASC ) AS position, COUNT(*) OVER (PARTITION BY user_id, conversion_id) AS total_touches FROM touchpoints)
SELECT user_id, conversion_id, channel, CASE WHEN total_touches = 1 THEN 1.0 WHEN total_touches = 2 THEN 0.5 WHEN position = 1 THEN 0.4 WHEN position = total_touches THEN 0.4 ELSE 0.2 / (total_touches - 2) END * revenue AS attributed_revenueFROM touchpoints_positionedThis model values both discovery (first touch) and conversion (last touch) while still crediting the nurturing in between.
Time-decay attribution
Time-decay assigns more credit to touchpoints closer to conversion, using exponential decay:
Weight = 2^(-days_before_conversion / half_life)With a 7-day half-life (Google Analytics default), a touchpoint 7 days before conversion gets 50% weight, 14 days gets 25%, and so on.
WITH decay_weights AS ( SELECT user_id, conversion_id, channel, revenue, conversion_timestamp, timestamp, POW(0.5, TIMESTAMP_DIFF(conversion_timestamp, timestamp, MINUTE) / (7.0 * 24 * 60)) AS raw_weight FROM touchpoints),
normalized AS ( SELECT *, SUM(raw_weight) OVER (PARTITION BY user_id, conversion_id) AS total_weight FROM decay_weights)
SELECT user_id, conversion_id, channel, (raw_weight / total_weight) * revenue AS attributed_revenueFROM normalizedAdjust the half-life based on your sales cycle: 3-7 days for impulse e-commerce, 14-30 days for B2B mid-market, 30-45 days for enterprise.
Data-driven attribution: Markov chains and Shapley values
Heuristic models apply the same rules regardless of actual customer behavior. Data-driven models learn from your data which channels actually drive conversions.
Markov chain attribution
A Markov chain models customer journeys as a sequence of states (channels) where the probability of moving to the next state depends on the current state.
Credit is assigned through the removal effect. Remove a channel entirely from the graph, and measure how much conversion probability drops. If removing Facebook from all journeys drops total conversion probability from 50% to 11%, Facebook’s removal effect is 78%.
Since removal effects don’t sum to 100%, they get normalized:
Share(Channel) = Removal Effect(Channel) / Sum(All Removal Effects)Attribution(Channel) = Share(Channel) × Total ConversionsFirst-order Markov models consider only the current state. Higher-order models (2nd, 3rd order) consider previous states too, capturing patterns like “users who see Facebook then Google convert better than Google-only users.” These require exponentially more data.
The SQL for calculating transition probabilities is straightforward; the removal effect calculation typically moves to Python using libraries like ChannelAttribution.
Shapley value attribution
From cooperative game theory, Shapley values assign credit based on each channel’s marginal contribution across all possible combinations of channels.
The formula evaluates every possible coalition of channels and measures how much adding a specific channel improves the outcome. It satisfies appealing mathematical properties: equal contributors get equal credit, zero-value channels get nothing, and credits sum to the total.
Computation scales exponentially: 10 channels require evaluating 1,024 coalitions, and 20 channels require over a million. Approximation methods like Monte Carlo sampling make this tractable.
Google’s Data-Driven Attribution
Google’s DDA combines conversion probability models with Shapley-based credit distribution. It considers up to 50 touchpoints per conversion with a 90-day lookback.
There’s a catch: DDA requires 400+ conversions per type in 30 days and roughly 10,000 paths with 2+ interactions. If you don’t meet these thresholds, GA4 silently falls back to last-click without telling you.
Building in your warehouse means transparency. You know exactly what model is running and why channels receive the credit they do.
Tools and implementation patterns
dbt packages for attribution
Several open-source dbt packages provide production-ready attribution models:
- snowplow/dbt-snowplow-attribution: Works with Snowflake, BigQuery, Databricks, and Redshift. Includes first-touch, last-touch, linear, position-based, and time-decay models with ROAS calculation.
- snowplow/dbt-snowplow-fractribution: Data-driven attribution using Markov chains and Shapley values.
- TasmanAnalytics/tasman-dbt-mta: Configurable multi-touch, multi-cycle attribution engine for Snowflake and BigQuery.
- rittmananalytics/ra_attribution: Multi-source integration (Snowplow, Segment, RudderStack) with ad spend data.
Python libraries
For data-driven models that require matrix operations:
- ChannelAttribution (R and Python): k-order Markov chains with C++ backend for scale
- marketing-attribution-models (DP6): Markov, Shapley, and all heuristic models with customizable weights
Recommended dbt layer structure
models/├── base/│ └── base__ga4__events.sql # Raw event cleaning├── intermediate/│ ├── int__sessions_stitched.sql # Sessionization│ ├── int__touchpoints_classified.sql # Marketing touchpoints│ └── int__conversion_paths_sequenced.sql # User journey paths├── marts/attribution/│ ├── mrt__attribution__first_touch.sql│ ├── mrt__attribution__last_touch.sql│ ├── mrt__attribution__linear.sql│ ├── mrt__attribution__position_based.sql│ ├── mrt__attribution__time_decay.sql│ └── mrt__attribution__comparison.sqlUse incremental models with appropriate lookback windows for late-arriving data. Test that attribution credits sum to 100% per conversion.
Recovering lost signals
Server-side tracking implementations recover conversions lost to browser restrictions. GTM server-side, Meta Conversions API, and Google Enhanced Conversions can recover 30-40% of conversions that would otherwise disappear.
This data feeds into your warehouse, improving the completeness of your touchpoint table.
Making attribution actionable
The triangulated approach
The most resilient measurement strategies don’t rely on a single methodology. They combine three approaches:
Multi-touch attribution (MTA) for daily digital optimization. User-level data, granular touchpoint analysis, real-time adjustments.
Media mix modeling (MMM) for quarterly strategic budget allocation. Aggregated, privacy-safe data that includes offline channels like TV and out-of-home.
Incrementality testing for proving causal impact. Holdout tests where a portion of the audience doesn’t see ads, geographic market splits, or platform tools like Meta Conversion Lift.
Use incrementality results to calibrate your attribution models. If your model says Facebook drives 30% of conversions but your holdout test shows only 15% incremental lift, adjust accordingly.
Using model disagreement as a signal
Run multiple attribution models in parallel. When first-touch, last-touch, position-based, and time-decay all agree that organic search drives 25% of value, you have high confidence. When they range from 10% to 40%, you have high uncertainty, and stakeholders should know that.
Present results as ranges rather than false precision: “Channel X drives $50K-$80K depending on attribution model.”
Warning signs of attribution failure
Watch for these signals that your attribution isn’t working:
- Attributed revenue significantly diverges from actual revenue
- Scaling a “high ROAS” channel doesn’t translate to profit growth
- More than 60% of deals can’t be traced to any marketing activity
Where to start
Match your approach to your current scale:
| Stage | Recommended Approach |
|---|---|
| Early/SMB | Heuristic models (linear, time-decay) in SQL |
| Growth (500+ monthly conversions) | Markov chain (1st-2nd order) |
| Scale (1000+ conversions) | Shapley value, incrementality calibration |
| Enterprise (Google ecosystem) | Google DDA + custom warehouse validation |
| Enterprise (full control) | Custom Markov/Shapley, triangulated measurement |
Start with heuristic models. They’re transparent, easy to explain to stakeholders, and often sufficient for early optimization. Add data-driven approaches when you have the data volume to support them and the questions that require them.
No single model provides complete accuracy, and that’s fine. Customer journeys are messy, and channels interact in ways that resist clean measurement. Attribution that’s transparent, defensible, and good enough to improve decisions beats a black-box number every time.
Build in your warehouse, run multiple models, validate with experiments, and present results with appropriate uncertainty.