ServicesAboutNotesContact Get in touch →
EN FR
Note

GA4 user_id Data Quality

Common implementation bugs that corrupt GA4 user_id data — string 'null' values, logout tagging errors, suspicious high-cardinality IDs — and the SQL patterns to detect and filter them.

Planted
ga4bigqueryanalyticsdata quality

Identity stitching results depend on the quality of the user_id values the implementation sends. Bad user_id data does not throw errors — it silently corrupts every downstream model that trusts it. Validate user_id data before building a mapping table or backstitching pipeline.

The string ‘null’ bug

The most common implementation error: developers set user_id to the string 'null', 'none', 'undefined', '-', or 'anonymous' when a user logs out, instead of setting it to actual null.

What this looks like in your tracking code:

// Buggy: sends the string "null" instead of clearing user_id
gtag('config', 'G-XXXXXXX', {
'user_id': user.isLoggedIn ? user.id : 'null'
});
// Correct: pass null (or omit the field entirely)
gtag('config', 'G-XXXXXXX', {
'user_id': user.isLoggedIn ? user.id : null
});

When this bug exists, GA4 treats every logged-out session as belonging to a single user named 'null'. That fake user accumulates thousands of user_pseudo_id mappings, and your stitching pipeline happily merges all of them into one identity. Your stitch rate looks impressive; your data is garbage.

Detecting it:

SELECT
user_id,
COUNT(*) AS events,
COUNT(DISTINCT user_pseudo_id) AS devices
FROM `project.analytics_XXXXX.events_*`
WHERE user_id IS NOT NULL
GROUP BY user_id
HAVING COUNT(DISTINCT user_pseudo_id) > 1000 -- Suspiciously high device count
ORDER BY devices DESC
LIMIT 20

Any user_id mapping to more than a few hundred devices is suspect. Legitimate power users might have 3-5 devices. A user_id with 50,000 user_pseudo_id mappings is almost certainly a bug.

Once confirmed, work with the implementation team to fix the tagging. In the interim, filter these values out of your identity pipeline:

WHERE user_id IS NOT NULL
AND user_id NOT IN ('null', 'undefined', 'none', '-', 'anonymous', 'guest', '')
AND LENGTH(user_id) > 3 -- Filter obvious single-character garbage

The LENGTH filter is a blunt instrument — real user IDs are almost always longer than 3 characters. Adjust the threshold based on your actual ID format.

PII in user_id

The user_id field should never contain email addresses, names, or other personally identifiable information directly. GA4’s documentation is clear on this, and it’s a GDPR concern: user_id values are logged in GA4’s servers and exported to BigQuery, and email addresses in plaintext create compliance exposure.

The fix is hashing before sending:

// Don't send the email directly
gtag('config', 'G-XXXXXXX', {
'user_id': 'jane@example.com' // Never do this
});
// Send a hashed or anonymized database ID instead
gtag('config', 'G-XXXXXXX', {
'user_id': 'user_12345' // Opaque database ID — fine
});
// Or a SHA-256 hash if you must derive from email
gtag('config', 'G-XXXXXXX', {
'user_id': sha256('jane@example.com') // Pseudonymized
});

If you find email addresses in your existing user_id data, check with your privacy team before building stitching logic on top of it. The identity graph you’d build would constitute behavioral profiling of identified individuals, which requires explicit consent basis under GDPR. See Privacy Constraints for Linked Analytics Data for the compliance implications.

Recycled and temporary IDs

Guest checkout systems often generate temporary session IDs that get assigned as user_id and later discarded. If these IDs are reused across different actual people — as some systems do — your mapping table conflates unrelated users.

Detection: look for user_id values that appear in very short windows with many different user_pseudo_id values but low total event counts per pair:

WITH user_stats AS (
SELECT
user_id,
COUNT(DISTINCT user_pseudo_id) AS device_count,
COUNT(*) AS total_events,
TIMESTAMP_DIFF(
MAX(TIMESTAMP_MICROS(event_timestamp)),
MIN(TIMESTAMP_MICROS(event_timestamp)),
DAY
) AS days_active
FROM `project.analytics_XXXXX.events_*`
WHERE user_id IS NOT NULL
GROUP BY user_id
)
SELECT *
FROM user_stats
WHERE device_count > 10
AND days_active < 7 -- Many devices, very short window
ORDER BY device_count DESC

A legitimate user can have multiple devices. A temporary ID being assigned to many unrelated users will show many devices within a short window with little event depth per device.

Employee and test accounts

Sales teams demoing the product, QA engineers running test scenarios, and employee day-to-day usage all generate user_id events that pollute your analytics. They inflate session counts, skew funnel conversion rates, and can create bizarre user_id mappings if test users use the same device repeatedly.

Common approaches:

  • Filter known employee user_id values in your base model (requires maintaining a list)
  • Exclude known internal IP ranges (GA4 doesn’t export IPs, so this must happen at collection time via tag filters in GTM)
  • Add a separate tag trigger that sets a custom dimension is_internal: true for identified internal traffic, then filter on user_properties in BigQuery

The filtering needs to happen in your data, not just in your stitching logic — otherwise the employee sessions still contribute to anonymous traffic counts and session tables.

Validating before you build

Before building a production identity graph, run these checks:

  1. String null audit: Are any suspicious string values appearing in user_id?
  2. Cardinality check: Are any user_id values mapping to an implausibly high number of devices?
  3. PII scan: Do any user_id values look like email addresses or names?
  4. Recycled ID check: Are any IDs short-lived with high device churn?
  5. Volume sanity: What percentage of events have a non-null user_id? If it’s unexpectedly low, the implementation may not be firing consistently.
-- Quick summary of user_id health
SELECT
COUNTIF(user_id IS NOT NULL) AS events_with_user_id,
COUNT(*) AS total_events,
ROUND(COUNTIF(user_id IS NOT NULL) * 100.0 / COUNT(*), 2) AS pct_identified,
COUNT(DISTINCT user_id) AS distinct_user_ids,
COUNT(DISTINCT user_pseudo_id) AS distinct_devices,
COUNTIF(user_id IN ('null', 'undefined', 'none', '', '-')) AS garbage_user_ids
FROM `project.analytics_XXXXX.events_*`
WHERE _table_suffix >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))

This gives you a one-row health summary. The pct_identified tells you your theoretical maximum stitch rate. The garbage_user_ids count tells you how much of that is fake.

user_id quality can degrade silently — a framework update, a new page, or a developer change to logout behavior can introduce the string null bug. Running these checks before major stitching pipeline builds and after significant implementation changes catches regressions early.