Probabilistic Matching Limitations in GA4

When deterministic matching (pairing user_id to user_pseudo_id) leaves you with a lower identification rate than you’d like, probabilistic matching looks attractive. The idea: match anonymous users across sessions by fingerprinting device characteristics, location, and behavioral patterns. In practice, this approach breaks hard against the specific data GA4 exports to BigQuery.

What GA4 intentionally excludes

GA4’s BigQuery export omits the signals that make fingerprinting work:

No IP addresses — stripped before export for privacy reasons
No user agent strings — not available in any GA4 BigQuery field
No canvas fingerprints — not collected by GA4
No hardware identifiers — no MAC addresses, device serial numbers, or device fingerprint hashes

This isn’t an oversight. GA4’s privacy design explicitly removes these identifying signals before data reaches your warehouse. Google uses them internally for modeling, but they don’t leave the platform.

What you do have access to

The signals that survive to BigQuery are aggressively coarse:

SELECT DISTINCT
    device.category,           -- mobile, desktop, tablet (3 values)
    device.operating_system,   -- Android, iOS, Windows, macOS, etc.
    device.browser,            -- Chrome, Safari, Firefox, etc.
    device.language,           -- en-US, fr-FR, etc.
    geo.country,
    geo.city,
    geo.metro
FROM `project.analytics_XXXXX.events_*`
WHERE _table_suffix = FORMAT_DATE('%Y%m%d', CURRENT_DATE() - 1)

Consider what matching on these fields actually produces. How many users in San Francisco use Chrome on a MacBook? Tens of thousands. How many in Paris use Safari on iOS with language fr-FR? Hundreds of thousands. The collision rate on any combination of these coarse signals is enormous, and it gets worse as your traffic volume grows.

The compounding cost of false positives

Incorrectly merged profiles don’t just inflate your identification rate — they actively corrupt downstream analysis in ways that compound over time:

Recommendation systems train on merged behavioral histories. User A browses enterprise software; User B buys kitchen appliances. Merge them and neither gets accurate recommendations.

Email personalization references browsing behavior the recipient doesn’t recognize. Open rates and click-through rates drop; unsubscribe rates rise.

Attribution models credit channels that didn’t contribute to the conversion. Your ROAS calculations become unreliable, and budget allocation decisions based on them push spend toward channels that appear to work but don’t.

Customer support context becomes confusing. A support agent seeing a merged history misdiagnoses issues and provides incorrect guidance.

These problems don’t announce themselves. The data looks fine. Reports still run. Metrics trend in plausible directions. The corruption is only visible when someone audits the underlying data — often months later when the damage is deep.

The false positive math

Even a “good” probabilistic matcher that achieves 95% precision (5% false positive rate) produces compounding problems at scale. If you have 100,000 anonymous users and attempt to probabilistically match them to 50,000 known users, a 5% false positive rate means 5,000 incorrect merges. Those 5,000 polluted profiles contaminate every analysis that touches them.

The Customer 360 identity resolution research from the dbt Community shows that purpose-built probabilistic tools like Splink achieve better precision than naive SQL approaches — but even optimized probabilistic matching introduces false positives that most analytics teams lack the capacity to audit and clean. The marginal gain in match rate rarely justifies the downstream data quality cost.

When to accept a lower match rate

The right answer for most GA4 implementations is: use deterministic matching only, accept the identification rate you get, and invest in improving it through implementation rather than inference.

Ways to improve deterministic match rates legitimately:

Fix user_id implementation gaps — audit which pages and events are firing user_id and which aren’t. A user who logs in on page 5 of your checkout flow should have user_id set from page 1 onward.
Capture user_pseudo_id at form submission — when anonymous users submit a contact form, a hidden field can capture the GA4 client ID and pass it to your CRM, creating a deterministic link without requiring authentication. See Identity Resolution for Customer 360 for the JavaScript pattern.
Extend session-scoped stitching — if a user authenticates partway through a session, apply their user_id to all earlier events in that session using FIRST_VALUE IGNORE NULLS across the session window.

These approaches raise your deterministic match rate without introducing false positives. A 40% identification rate with high confidence is more valuable than a 70% identification rate built on data you can’t trust.

The exception: known-good behavioral signals

There are narrow cases where probabilistic signals are reliable enough to use. The key condition is that the signal must be sufficiently unique on its own:

A user submitting a form with an email address links deterministically to a CRM contact. The form submission is the signal; the email is deterministic.
A purchase with a transaction ID that also appears in your order management system links deterministically. The transaction ID is the signal.
A very specific behavioral sequence (e.g., completing a specific multi-step flow within a specific time window) combined with timing signals can be reliable in low-volume B2B contexts where the user population is small.

These aren’t probabilistic — they’re deterministic signals that look probabilistic because they require inference. The difference matters. Stick to signals where you’d be comfortable explaining the match to an auditor.