CRM Data Extraction Challenges

CRM data is often highly requested and poorly trusted in warehouses. CRM records are mutable, relationships are complex, and the extraction layer introduces its own quirks. These challenges shape every downstream modeling decision.

Mutability Is the Core Problem

Most warehouse data arrives as immutable events. A pageview happened. A transaction completed. You append it and move on. CRM data doesn’t work that way. An opportunity changes stage five times before closing. A contact’s email gets updated. A deal amount gets revised the day before close. Every record is a moving target.

This means your warehouse never has a “final” version of a CRM record until the record is truly closed or archived. A pipeline snapshot taken at 9 AM may show different data than one taken at 3 PM for the same record. If your stakeholders ask “what was the pipeline worth last Tuesday?”, you need historical tracking to answer, not just the current state.

Mutability also means incremental extraction is harder. You can’t simply look for new records — you need to detect changes to existing records. That’s what SystemModStamp does in Salesforce and what HubSpot approximates through a combination of mechanisms.

API-Based Extraction vs CDC

Unlike database sources where you can tap into a binlog for change data capture (CDC), CRM systems expose data through APIs. This creates a fundamentally different extraction model.

Fivetran’s Salesforce connector polls using the SystemModStamp field for incremental syncs. Every Salesforce record has this field, and it updates whenever the record is modified through the UI, API, or automation. The connector queries for records where SystemModStamp > last_sync_time to find what changed.

HubSpot lacks true CDC entirely. Fivetran uses a mix of webhooks for deletes, daily API calls, and inference to approximate it. This means HubSpot data in your warehouse can lag further behind reality than Salesforce data, and the lag varies by object type and sync configuration.

The practical impact: don’t assume your CRM data reflects real-time state. Build in expectations of latency when designing dashboards and reporting logic.

Soft Deletes

When a CRM record gets deleted, it doesn’t vanish from your warehouse. Fivetran marks deleted records with _fivetran_deleted = TRUE rather than removing them. This is by design — hard deleting from the warehouse would lose audit trails and break historical analyses.

But the deletion timelines differ by system:

Salesforce hard-deletes records 15 days after they hit the Recycle Bin. After that, even Fivetran can’t see them.
HubSpot retains deleted records for 90 days before permanent removal.

Your base models need to filter soft-deleted records consistently:

SELECT
    id AS opportunity__id,
    name AS opportunity__name,
    stage_name AS opportunity__stage,
    amount AS opportunity__amount
FROM {{ source('salesforce', 'opportunity') }}
WHERE NOT _fivetran_deleted

This filter belongs in every base model for every CRM source table. Miss it in one model, and you’ll have ghost records polluting downstream metrics — closed-lost deals appearing in pipeline counts, deactivated contacts showing in active user reports.

For historical analysis, you may want a separate model that includes deleted records with a flag. But the default should always be to exclude them.

This one catches teams off guard. In Salesforce, formula field changes don’t update SystemModStamp. A formula field is computed dynamically — it doesn’t “change” in the database; it recalculates on access. But since SystemModStamp is what Fivetran uses to detect changes, formula field values in your warehouse can be stale.

Example: you have a formula field days_since_last_activity on the Account object. The underlying last_activity_date changes, the formula recalculates in the Salesforce UI, but SystemModStamp doesn’t update. Fivetran’s incremental sync misses it. Your warehouse shows yesterday’s value.

The dbt_salesforce_formula_utils package exists specifically for this problem. It recomputes formula field logic inside your dbt project rather than relying on extracted values. This is the recommended approach for any formula fields used in reporting.

Alternatively, you can configure Fivetran to run periodic full syncs for objects with critical formula fields. The tradeoff is higher API usage and longer sync times, but guaranteed accuracy.

Rate Limits

CRM APIs enforce strict rate limits that constrain how much data you can extract and how frequently.

Salesforce allows 100,000 API requests per 24 hours on Enterprise edition, plus 1,000 per user license. A large org with 200 users gets 300,000 requests/day. This sounds generous until you factor in that every integration — marketing automation, data syncing, third-party apps — draws from the same pool. Monitor your API usage in Setup > System Overview; if you’re regularly hitting 80%+, extraction frequency will suffer.

HubSpot caps at 190 requests per 10 seconds, with the CRM Search API hard-limited to 10,000 results per query. For large HubSpot instances (100k+ contacts), this means extraction takes longer because the connector must paginate through results in smaller batches.

These limits affect:

Sync frequency: How often you can pull fresh data
Full refresh cost: A full re-sync of a large Salesforce org can consume a significant portion of your daily API budget
Concurrent integrations: Every tool hitting the API competes for the same rate limit pool

When designing your extraction schedule, factor in all API consumers, not just your data pipeline. A marketing automation tool that fires 50,000 API calls during a campaign launch can starve your Fivetran sync if they share the same rate limit.

Implications for Architecture

These challenges — mutability, API extraction, soft deletes, formula fields, rate limits — influence every design decision downstream:

Use SCD Type 2 snapshots to track historical state, because the current record isn’t enough
Filter _fivetran_deleted in every base model, because soft deletes are universal
Don’t trust formula field values from extraction; recompute them in dbt
Design your sync schedule around rate limits, and monitor API usage across all consumers
Build dashboards that acknowledge data latency rather than implying real-time accuracy

CRM data requires more defensive engineering than most sources to produce trustworthy pipeline and customer metrics.

Mutability Is the Core Problem

API-Based Extraction vs CDC

Soft Deletes

Formula Field Blind Spots

Rate Limits

Implications for Architecture