Adrienne Vermorel

Merge vs. Delete+Insert vs. Insert_Overwrite: Choosing the Right dbt Strategy

Merge is the default incremental strategy in dbt. It’s also the strategy most likely to blow up your warehouse bill once tables grow past 100 million rows.

Merge does more work than you often need. When you’re replacing entire partitions of event data, row-by-row comparison is wasteful. When your unique key guarantees no duplicates exist, a full MERGE statement is overkill.

Choosing the right strategy can cut run times significantly and reduce scan costs by orders of magnitude on large tables.

This guide breaks down how each strategy works, when each makes sense, and what to watch for on BigQuery, Snowflake, and Databricks.

Strategy Availability by Warehouse

Not every strategy works on every warehouse. Here’s what you can actually use:

StrategyBigQuerySnowflakeDatabricks
append
merge✅ (default)✅ (default)✅ (default)
delete+insert✅ (v1.11+)
insert_overwrite⚠️ Full table only
replace_where✅ (Delta only)

The Snowflake limitation on insert_overwrite is the one that catches people. Unlike BigQuery and Databricks, Snowflake’s insert_overwrite replaces the entire table, not specific partitions. For partition-level replacement on Snowflake, use delete+insert instead.

How Each Strategy Works

Merge

Merge compares every incoming row against the destination table using your unique_key. Matching rows get updated. Non-matching rows get inserted.

MERGE INTO target AS DBT_INTERNAL_DEST
USING staging AS DBT_INTERNAL_SOURCE
ON DBT_INTERNAL_DEST.id = DBT_INTERNAL_SOURCE.id
WHEN MATCHED THEN UPDATE SET col1 = DBT_INTERNAL_SOURCE.col1, ...
WHEN NOT MATCHED THEN INSERT (col1, col2) VALUES (...)

The generated SQL scans the entire destination table unless you add incremental_predicates to limit the comparison window.

Strengths: Handles mixed inserts and updates cleanly. Atomic operation. Simplest mental model.

Weaknesses: Full table scans get expensive fast. Performance degrades significantly past 100M rows.

Delete+Insert

Delete+insert runs two separate operations: first delete all matching records, then insert the new batch.

-- Step 1: Remove existing records
DELETE FROM target
USING tmp
WHERE tmp.unique_key = target.unique_key
-- Step 2: Insert fresh data
INSERT INTO target SELECT * FROM tmp

The two-step nature creates a brief intermediate state. If the process fails between delete and insert, you’ll have missing data until the next successful run.

Strengths: Faster than merge at scale (3.4x faster at 500M rows on Snowflake). Works well for batch replacements.

Weaknesses: Not atomic. Requires unique_key. Creates intermediate states on failure.

Insert_Overwrite

Insert_overwrite replaces entire partitions rather than comparing individual rows. On BigQuery, it identifies which partitions contain new data, then atomically swaps them out.

-- BigQuery determines affected partitions
DECLARE dbt_partitions_for_replacement ARRAY<DATE>;
SET (dbt_partitions_for_replacement) = (
SELECT AS STRUCT ARRAY_AGG(DISTINCT DATE(partition_col))
FROM source_table
);
-- Atomic partition replacement
MERGE INTO target USING staging ON FALSE
WHEN NOT MATCHED BY SOURCE
AND partition_col IN UNNEST(dbt_partitions_for_replacement)
THEN DELETE
WHEN NOT MATCHED THEN INSERT ...

Strengths: Extremely efficient for time-partitioned data. No row comparison overhead. Handles late-arriving data naturally within overwritten partitions.

Weaknesses: Requires partitioning. Can’t update specific rows within a partition. Behavior differs significantly across warehouses.

Replace_Where (Databricks Only)

Replace_where is Databricks’ predicate-based alternative to partition overwrite. Instead of replacing partitions, it replaces all rows matching a condition.

INSERT INTO table REPLACE WHERE date_col >= '2024-01-01'

Strengths: More flexible than partition-based overwrite. Works on SQL Warehouses. Atomic.

Weaknesses: Databricks only. Requires Delta Lake.

Warehouse-Specific Behaviors

BigQuery

Merge quirks:

  • Requires unique_key. Without it, merge fails (not append behavior).
  • Does not support dynamic partition pruning. Subqueries in predicates won’t trigger pruning.
  • Full table scans are the norm unless you add incremental_predicates or cluster your table.

Insert_overwrite strengths:

  • Works with partitioned tables (required configuration: partition_by)
  • Static partitions (explicit list) are faster than dynamic partition discovery
  • The _dbt_max_partition variable is BigQuery scripting, not Jinja (it runs at query time)

BigQuery gotcha: You can’t use require_partition_filter=true with merge unless you also set incremental_predicates. The merge statement doesn’t know which partitions to touch.

Cost impact: Teams switching from merge to insert_overwrite with static partitions report 100-200x cost reductions on large tables.

Snowflake

Merge behavior:

  • Default strategy, standard MERGE INTO syntax
  • Fails with “nondeterministic merge” if your unique_key isn’t truly unique
  • Performance degrades noticeably past 100M rows

Delete+insert advantages:

  • 3.4x faster than merge at 500M+ rows in benchmarks
  • Requires tmp_relation_type: table when unique_key is defined

Critical limitation: Insert_overwrite on Snowflake replaces the entire table, not partitions. This makes it useless for incremental processing. If you need partition-level replacement on Snowflake, use delete+insert with date predicates in your WHERE clause.

Databricks

Merge on Delta Lake:

  • Uses ACID-compliant MERGE INTO
  • Supports schema evolution via merge_with_schema_evolution=true
  • Advanced options like matched_condition and not_matched_by_source_action='delete' for CDC patterns

Insert_overwrite specifics:

  • Uses dynamic partition overwrite by default
  • A v1.8.0 regression caused full table replacement instead of partition replacement. Fix by setting spark.sql.sources.partitionOverwriteMode=DYNAMIC

Replace_where: Often the best choice for date-bounded updates on SQL Warehouses where partition behavior is less predictable.

Performance Data

Real benchmarks from production systems:

ScenarioMergeBetter AlternativeImprovement
500M rows, Snowflake44 min13 min (delete+insert)3.4x
192GB partition, BigQuery43 min26 min (insert_overwrite)1.7x
Full vs incremental scan, BigQuery24.6 GB500 MB50x data reduction
With incremental_predicates9.5s4s2.4x

Merge works fine at smaller scales but becomes a bottleneck once tables grow.

Configuration Examples

Merge with Predicates (Large Tables)

{{ config(
materialized='incremental',
unique_key='event_id',
incremental_strategy='merge',
incremental_predicates=[
"DBT_INTERNAL_DEST.event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)"
]
) }}
SELECT
event_id,
event_date,
event_name,
user_id,
event_properties
FROM {{ ref('base__segment__events') }}
{% if is_incremental() %}
WHERE event_date >= (SELECT MAX(event_date) - INTERVAL 3 DAY FROM {{ this }})
{% endif %}

The incremental_predicates filter the destination table during the merge. This enables partition pruning and avoids full table scans.

Delete+Insert on Snowflake

{{ config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='delete+insert'
) }}
SELECT
order_id,
customer_id,
order_date,
updated_at,
total_amount
FROM {{ ref('base__shopify__orders') }}
{% if is_incremental() %}
WHERE updated_at >= (SELECT MAX(updated_at) - INTERVAL 3 DAY FROM {{ this }})
{% endif %}

Insert_Overwrite with Static Partitions (BigQuery)

{% set partitions_to_replace = [
'CURRENT_DATE()',
'DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)',
'DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY)'
] %}
{{ config(
materialized='incremental',
incremental_strategy='insert_overwrite',
partition_by={'field': 'event_date', 'data_type': 'date'},
partitions=partitions_to_replace
) }}
SELECT
event_id,
event_date,
event_name,
user_id,
event_properties
FROM {{ ref('base__segment__events') }}
{% if is_incremental() %}
WHERE event_date IN ({{ partitions_to_replace | join(',') }})
{% endif %}

Static partitions are faster than dynamic because BigQuery doesn’t need to query the source table to determine which partitions to replace.

Replace_Where on Databricks

{{ config(
materialized='incremental',
incremental_strategy='replace_where',
incremental_predicates=["event_date >= CURRENT_DATE - INTERVAL 3 DAY"]
) }}
SELECT
event_id,
event_date,
event_name,
user_id,
event_properties
FROM {{ ref('base__segment__events') }}
WHERE event_date >= CURRENT_DATE - INTERVAL 3 DAY

Decision Framework

Use merge when:

  • Tables are under 100M rows
  • You need to update specific rows based on changing source data
  • Simplicity matters more than optimization
  • You’re handling CDC with mixed inserts, updates, and deletes

Use delete+insert when:

  • Tables exceed 100M rows on Snowflake
  • Your updates are batch-oriented (replacing all records for a time window)
  • You can tolerate brief intermediate states during the two-step operation

Use insert_overwrite when:

  • Tables are time-partitioned (BigQuery, Databricks)
  • You’re processing event or fact data where entire days/partitions get reprocessed
  • Late-arriving data falls within your overwrite window
  • Cost optimization is a priority

Use replace_where when:

  • You’re on Databricks with Delta Lake
  • You need predicate-based replacement without partition dependencies
  • SQL Warehouse behavior with insert_overwrite is unpredictable

Use append when:

  • Data is truly append-only (logs, events with no updates)
  • You handle deduplication downstream
  • Snowflake or Databricks (BigQuery doesn’t support it)

Recommendations by Warehouse

BigQuery: Default to insert_overwrite for fact tables with time partitions. Use static partition lists when possible. Fall back to merge with incremental_predicates for dimension tables requiring row-level updates.

Snowflake: Start with merge for tables under 100M rows. Switch to delete+insert for larger tables or batch-oriented updates. Avoid insert_overwrite entirely; it doesn’t do what you’d expect.

Databricks: Merge works well with Delta Lake’s optimizations. Consider replace_where for cleaner predicate-based updates. Use insert_overwrite for partition-aligned workloads, but verify partitionOverwriteMode is set correctly.

What to Watch For

Merge at scale: Monitor query costs and run times. If a model that used to take 5 minutes now takes 30, your table probably outgrew the merge strategy.

Unique key integrity: Nulls in unique_key columns cause match failures and duplicates. Composite keys need all columns to be non-null.

Schema changes: Adding columns with on_schema_change='append_new_columns' works for new data but won’t backfill historical records. Plan accordingly.

Late-arriving data: Insert_overwrite handles late data within overwritten partitions. It does not catch records arriving after the partition window closes. Periodic full refreshes remain necessary.

The right strategy depends on your data patterns, table sizes, and warehouse. Start simple with merge, measure your costs and run times, then optimize when the numbers justify the added complexity.