Warehouse-based attribution requires three categories of data: website interactions, ad platform spend, and conversions. A sophisticated Markov chain model running on incomplete touchpoint data produces worse results than a simple last-touch model running on clean, comprehensive data. All three categories must be in place to build any attribution model with full cross-platform visibility.
Website interaction data
GA4’s BigQuery export is the most common source of website interaction data for teams building attribution in BigQuery. It provides event-level data in events_YYYYMMDD tables with the fields attribution models need: user_pseudo_id (device identifier), optional user_id (authenticated), and traffic source information.
The traffic source fields have evolved over time. The original traffic_source struct captures the user’s first-ever traffic source. For attribution, you typically want session-level traffic source data from either collected_traffic_source or the newer session_traffic_source_last_click field. See GA4 Traffic Source Fields for the full breakdown of which field to use when.
The GA4 gclid misattribution bug
A known bug in GA4 causes gclid parameters to sometimes attribute paid search clicks as organic. The gclid is present in the URL parameters, confirming the user arrived via a Google Ads click, but GA4 records the source/medium as organic.
This matters for attribution because it shifts credit from paid search to organic search — understating your paid search ROI and overstating organic’s contribution. The fix is straightforward: in your base model, check for gclid presence in the page URL parameters and override the source/medium when you find one:
CASE WHEN page_location LIKE '%gclid=%' AND source = 'google' AND medium = 'organic' THEN 'cpc' ELSE mediumEND AS medium_correctedThis is a defensive pattern — apply it even if you haven’t seen the bug yet, because it can appear intermittently and silently corrupt your attribution data.
Ad platform data
You need campaign and cost data from every platform where you spend money. Without spend data, you can calculate attributed revenue per channel but not ROAS — and ROAS is ultimately what drives budget allocation decisions.
The loading landscape varies by platform:
| Platform | Native BigQuery Support | ETL Options |
|---|---|---|
| Google Ads | Yes (Data Transfer Service) | dlt, Fivetran, Airbyte |
| Meta Ads | No native BQ support | dlt, Fivetran, Airbyte |
| LinkedIn Ads | No | dlt, Fivetran, Airbyte |
| TikTok Ads | No | dlt, Fivetran, Airbyte |
Google Ads Data Transfer Service is the simplest path for Google Ads data — it’s free, native, and writes directly to BigQuery. For other platforms, you’re choosing between managed ETL (Fivetran, Airbyte) and code-first tools like dlt.
The choice between managed and code-first ETL depends on your team’s engineering capacity and your tolerance for vendor dependency. For a full analysis, see the broader discussion of getting ad data into the warehouse.
Join strategies between ad data and web analytics
Connecting ad platform spend data to website sessions requires a reliable join key. Two approaches:
Click ID matching uses platform-specific identifiers (gclid for Google, fbclid for Meta, ttclid for TikTok) that are appended to landing page URLs. These provide precise, click-level matching between the ad platform event and the website session. The downside: they only work for clicks, not impressions, and they require the click ID to survive the redirect chain intact.
UTM parameter matching uses utm_source, utm_medium, and utm_campaign parameters to join at the session or campaign level. This is less precise than click IDs but works across all platforms and doesn’t depend on platform-specific identifiers. See Campaign Naming and UTM Standardization for the hygiene rules that make UTM-based joins reliable.
In practice, use both. Click IDs for the precise paid channel matching where they’re available, UTM parameters as the fallback and as the cross-platform normalization layer.
Standardizing UTM parameters
UTM parameters are the cross-platform bridge between ad spend and web analytics. They must be standardized or your join layer falls apart:
- Always lowercase. UTMs are case-sensitive in GA4.
utm_source=Facebookandutm_source=facebookcreate two separate sources. - Consistent platform naming. Decide once whether you use “facebook” or “meta” as the
utm_sourcevalue, and stick with it everywhere. - Dynamic parameters where available. Google supports
{campaignid}and{adgroupid}as auto-populated UTM values. Meta supports{{campaign.name}}and{{adset.name}}. Use these to eliminate human error in the most granular parameters. - Include platform click IDs. Append gclid, fbclid, etc. alongside UTM parameters so you have both join strategies available.
Document your UTM taxonomy in a place the media buying team actually checks. If they don’t follow the conventions, the data team can’t build reliable joins, and cross-platform attribution breaks at the data layer.
Conversion data
Conversions might come from your e-commerce platform, CRM, application database, or a combination. The key requirement: you need a way to link conversions back to the marketing touchpoints that preceded them.
For e-commerce, the purchase event in GA4 typically carries a transaction_id that links to your order management system. This creates a closed loop: GA4 touchpoints join to GA4 conversions via user_pseudo_id, and GA4 conversions join to your revenue data via transaction_id.
For B2B, conversions are often CRM events — opportunity creation, deal closed-won — that don’t have a direct link to web analytics data. The join requires identity resolution: connecting the anonymous user_pseudo_id from GA4 to the known contact in your CRM. This is harder than e-commerce conversion tracking and is often the bottleneck for B2B attribution accuracy.
For SaaS, conversions might be signups, trial starts, or subscription payments. If your product captures user email at signup and you pass that to GA4 as user_id, you have a direct link. If not, you’re back to identity resolution.
What counts as a conversion
This is a business decision, not a technical one. Common choices:
- E-commerce: Purchase event (with revenue)
- B2B SaaS: Demo request, trial start, paid subscription
- Lead gen: Form submission, qualified lead (MQL)
- Content/media: Newsletter signup, content download
Choose conversions that are meaningful enough to base budget decisions on but frequent enough to produce statistically useful attribution data. If your chosen conversion only happens 10 times per month, you won’t have enough data for multi-touch models to be reliable.
Putting it together
These three data sources feed into the touchpoint table, which is the single intermediate model that all attribution models consume. The touchpoint table joins website interactions (touchpoints) to conversions within a lookback window, producing one row per touchpoint-conversion pair.
The quality of every downstream attribution model, channel ranking, and ROAS calculation depends on the quality of these three source datasets.