Multi-Source Conflict Resolution

When you merge records from multiple source systems into a unified entity — a Customer 360 model, a product catalog from multiple vendors, a consolidated transaction ledger — you will encounter conflicting data. A contact’s job title might be “VP Sales” in Salesforce and “Vice President, Sales” in HubSpot. Their email might have been updated in the CRM but not in your product database. Their phone number might exist in one system and be null in another.

You need a strategy for which source wins, and you need that strategy to be explicit, consistent, and auditable. Implicit conflict resolution — whatever happens to come first in a join — produces unpredictable results and silent data quality degradation.

Three patterns handle the vast majority of cases.

Priority-based resolution

Pick from a ranked list of sources. The most authoritative system wins, regardless of when the data was updated.

COALESCE(crm.email, product.email, ga4.email) AS customer__email,
COALESCE(crm.company_name, product.company_name) AS customer__company_name,
COALESCE(crm.phone, product.phone) AS customer__phone

COALESCE returns the first non-null value. The order of arguments defines the priority: CRM data wins over product data, which wins over GA4 data.

When to use it: When one system is clearly the system of record for a given field. CRM is the source of truth for contact details. The billing system is the source of truth for payment information. The product database is the source of truth for feature usage. The priority should reflect the business process that maintains each field.

Strengths: Simple. Easy to understand, easy to audit, easy to debug. When a stakeholder asks “where does the email come from?” the answer is a one-line explanation: “CRM first, then product, then GA4.”

Weaknesses: A CRM email that was accurate two years ago wins over a product email updated yesterday. Priority ignores recency, which can be a problem for fields that change frequently.

Per-field priority

The priority order doesn’t have to be the same for every field. In practice, it rarely should be:

-- CRM is authoritative for contact details
COALESCE(crm.email, product.email) AS customer__email,
COALESCE(crm.company_name, product.company_name) AS customer__company_name,

-- Product is authoritative for usage data
COALESCE(product.last_login_at, crm.last_activity_at) AS customer__last_active_at,

-- Billing is authoritative for payment data
COALESCE(billing.plan_name, product.plan_name) AS customer__plan_name

Document these per-field priorities in your model’s YAML description or in a dbt docs block. Future maintainers (including future you) need to understand why the CRM wins for email but the product wins for activity timestamps.

Recency-based resolution

Take whichever value was updated most recently, regardless of source. The assumption is that the most recent update to a field is the most accurate.

CASE
    WHEN crm.updated_at >= COALESCE(product.updated_at, TIMESTAMP('1970-01-01'))
    THEN crm.job_title
    ELSE product.job_title
END AS customer__job_title

When to use it: For fields where the most recent update is likely the most accurate — job title, phone number, address, company name. These change over time, and the system that captured the change most recently is probably right.

Strengths: Adapts automatically. If a contact updates their job title in your product’s profile settings, you don’t need a CRM sync to pick up the change. The most recent value wins.

Weaknesses: Requires updated_at timestamps from each source, and those timestamps need to be reliable. If one system uses the record creation date instead of the last modification date, recency comparison is meaningless. Also more complex than COALESCE, especially when comparing across three or more sources.

Scaling to multiple sources

For more than two sources, a pattern using ARRAY_AGG with ordering avoids deeply nested CASE statements:

(
    SELECT value
    FROM UNNEST([
        STRUCT(crm.job_title AS value, crm.updated_at AS updated_at),
        STRUCT(product.job_title AS value, product.updated_at AS updated_at),
        STRUCT(billing.job_title AS value, billing.updated_at AS updated_at)
    ])
    WHERE value IS NOT NULL
    ORDER BY updated_at DESC
    LIMIT 1
) AS customer__job_title

This is BigQuery-specific syntax. On Snowflake, you’d use a lateral flatten or a CTE with ROW_NUMBER. The principle is the same: collect all non-null values with their timestamps, pick the most recent.

Source-specific fields

Sidestep the conflict entirely. Instead of picking one email, keep them all:

crm.email AS customer__crm_email,
product.email AS customer__product_email,
ga4.email AS customer__ga4_email

When to use it: When downstream consumers need to know the provenance of a value. Reverse ETL to the CRM should use the CRM email, not a product email that might differ. An email marketing platform might want to try all known addresses. A data quality dashboard might flag records where emails disagree.

Strengths: No information loss. Every source’s value is preserved. Consumers make their own choice about which to use.

Weaknesses: Makes the table wider. Three email columns instead of one. For a Customer 360 with five source systems and twenty conflicting fields, this produces a hundred columns. Also pushes the resolution decision to the consumer, which can lead to inconsistent downstream logic if different teams pick different columns.

A pragmatic hybrid

In practice, most teams combine patterns. Use a resolved primary field (priority or recency-based) alongside source-specific fields for provenance:

-- The resolved field for general use
COALESCE(crm.email, product.email) AS customer__email,

-- Source fields for provenance when needed
crm.email AS customer__crm_email,
product.email AS customer__product_email

This gives downstream consumers a sensible default while preserving the ability to override when they have specific requirements.

Choosing a pattern by field type

Field type	Recommended pattern	Rationale
Email, phone	Priority-based	System of record is usually clear
Job title, company	Recency-based	Changes over time; most recent is usually correct
Address	Recency-based	Same as job title
Plan/subscription	Priority-based (billing system)	Financial system is authoritative
Last activity date	Recency-based (latest across sources)	Use `GREATEST()` across all timestamps
Usage metrics	Source-specific	Different systems measure different things
Revenue/financial data	Priority-based (billing system)	One financial system of record

Surrogate keys for multi-source entities

When building unified entities from multiple sources, the surrogate key should encode both the source system and the source ID:

{{ dbt_utils.generate_surrogate_key(['source_system', 'source_id']) }} AS customer__surrogate_key

This prevents collisions between systems that might reuse the same internal IDs (both systems might have a contact with ID 12345). Resolve surrogate keys to a canonical customer_id through the identity bridge.

Documentation and testing

Conflict resolution logic is easy to get right initially and easy to break during refactoring. Protect it with:

YAML descriptions on each resolved field that document the priority order or resolution strategy
dbt tests that verify the resolved field is never null when at least one source has a value (catches COALESCE ordering bugs)
Data quality checks that flag records where sources disagree, so you can monitor how often conflicts actually occur and whether your resolution strategy produces sensible results

The resolution strategy should be a conscious, documented decision — not an artifact of join order.