Identity Resolution for Customer 360

CRM data and GA4 data may share a BigQuery project or dataset, but they share no identifier. GA4 tracks devices via user_pseudo_id; the CRM tracks contacts by email and internal ID. These systems describe the same users through different lenses with no native join key.

Identity resolution creates that join key in the warehouse, where both data sources already live.

The join key doesn’t exist — you have to create it

GA4’s BigQuery export gives you two identifiers per event: user_pseudo_id (a cookie-based device ID set automatically) and user_id (a custom identifier you set yourself via the dataLayer or gtag). Your CRM gives you emails, phone numbers, and internal record IDs. None of these overlap by default.

Three strategies work in practice, and most teams end up combining at least two.

Strategy 1: Capture GA4 client ID at form submission

When a visitor fills out a form on your site, a hidden field grabs the user_pseudo_id value and passes it to your CRM alongside the email and name. This requires a small frontend change — a JavaScript snippet that reads the GA4 cookie and populates a hidden field — and a custom CRM property to store the value.

// Read GA4 client ID from the cookie
function getGA4ClientId() {
  const cookie = document.cookie
    .split('; ')
    .find(row => row.startsWith('_ga='));
  if (cookie) {
    // _ga cookie format: GA1.1.XXXXXXXXXX.XXXXXXXXXX
    return cookie.split('.').slice(2).join('.');
  }
  return null;
}

// Populate hidden form field
document.querySelector('input[name="ga4_client_id"]').value = getGA4ClientId();

The result is a direct, deterministic link between a CRM contact and their GA4 browsing history. This is where most teams start because it captures marketing-sourced leads who never log in to a product.

If users log in to your product or portal, push the CRM contact ID (or a hashed email) into GA4’s user_id field via the dataLayer:

dataLayer.push({
  'user_id': 'crm_contact_12345'  // or a hashed email
});

GA4 applies this user_id to all events within the same session once it’s set. The limitation: it does not retroactively apply to previous sessions. User backstitching fixes this in the warehouse.

Strategy 3: Match on transaction ID

For ecommerce, if your GA4 purchase events include a transaction_id that also appears in your CRM orders, you can join on that directly. Reliable when it’s available, but it only gives you identity for converting users — it tells you nothing about the visitors who browsed but didn’t buy.

Combining strategies

The hidden form field captures marketing-sourced leads who never log in. The user_id approach covers authenticated users across sessions. Together they cover the majority of your identifiable traffic. Transaction ID fills in gaps for ecommerce conversions.

Deterministic vs probabilistic matching

Once you have join keys for some users, the question is how to expand coverage.

Deterministic matching uses exact identifiers: email, user_id, phone number, transaction ID. If a GA4 user_pseudo_id was captured alongside an email via a form submission, and that email exists in your CRM, you have a match. No ambiguity. This should always be your starting point.

The SQL is straightforward — join your form submission data (where user_pseudo_id was captured alongside an email) with your CRM contacts on email:

SELECT
    form.user_pseudo_id,
    form.email,
    crm.contact_id AS crm_contact_id
FROM {{ ref('base__form_submissions') }} form
INNER JOIN {{ ref('base__crm__contacts') }} crm
    ON LOWER(TRIM(form.email)) = LOWER(TRIM(crm.email))

The LOWER(TRIM(...)) is defensive but essential. Email casing and whitespace inconsistencies are rampant across systems. Without normalization, you lose matches that should be trivial.

Probabilistic matching — fuzzy matching on names, shared IP addresses, behavioral similarity — offers lower precision. A test by a dbt Community member showed that Jaro-Winkler string matching in SQL achieved a 45% merge rate on a 6,500-record dataset, compared to 66% using Splink’s probabilistic model. The gap is significant, and false positives in identity resolution cause real problems downstream: a merged record that combines two different people pollutes every metric built on it.

For most analytics teams, deterministic matching alone gets you far enough. The marginal gain from probabilistic matching rarely justifies the false positive risk unless you have a dedicated data quality team reviewing matches.

Building the identity mapping table

The output of identity resolution is a mapping table: one row per known identity pair. In dbt, this is an intermediate model:

-- int__identity_resolved.sql
{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key=['identifier_type', 'identifier_value']
) }}

WITH form_identities AS (
    SELECT
        'user_pseudo_id' AS identifier_type,
        user_pseudo_id AS identifier_value,
        email,
        crm_contact_id,
        submitted_at AS identified_at
    FROM {{ ref('int__form_submissions_with_crm') }}
    WHERE crm_contact_id IS NOT NULL
),

login_identities AS (
    SELECT
        'user_id' AS identifier_type,
        user_id AS identifier_value,
        NULL AS email,
        crm_contact_id,
        event_timestamp AS identified_at
    FROM {{ ref('base__ga4__events') }}
    WHERE user_id IS NOT NULL
      AND event_name = 'login'
),

all_identities AS (
    SELECT * FROM form_identities
    UNION ALL
    SELECT * FROM login_identities
)

SELECT
    identifier_type,
    identifier_value,
    email,
    crm_contact_id,
    identified_at
FROM all_identities
QUALIFY ROW_NUMBER() OVER (
    PARTITION BY identifier_type, identifier_value
    ORDER BY identified_at DESC
) = 1

The QUALIFY handles the case where the same identifier maps to multiple CRM contacts over time (e.g., a shared device). Taking the most recent mapping is the safest default.

This model uses merge as the incremental strategy because identity mappings are relatively small (at most one row per known identity pair) but update frequently as new form submissions and logins create new mappings.

Open-source tooling

For teams needing a more complete identity graph beyond simple deterministic matching:

RudderStack’s dbt-id-resolution builds iterative identity graphs using incremental models. It handles transitive identity (if A = B and B = C, then A = C) which is hard to implement correctly in pure SQL.
Hightouch and Brooklyn Data advocate a simpler warehouse-native pattern with two core tables: events (every observed identifier pair) and user_identities (resolved canonical IDs).
Splink is a Python library for probabilistic record linkage that can run against BigQuery. Use it when deterministic matching coverage is genuinely insufficient and you have the capacity to review false positives.

The choice between these and a hand-built solution depends on the complexity of your identity graph. If you have two or three identifier types and clean join keys, the hand-built approach in the identity mapping table above is simpler and more maintainable. If you have five or more identifier types with transitive relationships, a dedicated package saves significant effort.

What makes this hard

Identity resolution sounds like a join problem, but it’s really a data quality problem. The challenges that trip teams up:

Email normalization: jane@acme.com vs Jane@Acme.com vs jane@acme.com (trailing space). Normalize early.
Shared devices: Two people using the same browser create ambiguous user_pseudo_id mappings. The most-recent mapping is a heuristic, not a solution.
CRM data hygiene: Duplicate contacts in the CRM produce duplicate matches. Clean the CRM first, or build deduplication into your identity layer.
Consent: Linking a cookie identifier to a CRM contact is a processing activity under GDPR. See Privacy Constraints for Linked Analytics Data for the compliance implications.

The identity mapping table is the foundation for a Customer 360 model. Errors in identity resolution propagate into every metric built on top of it.