ServicesAboutNotesContact Get in touch →
EN FR
Note

RSS Feed Deduplication in n8n

How to prevent duplicate Notion pages when polling RSS feeds in n8n, using a Merge node configured as a left anti-join.

Planted
automationdata engineering

RSS polling workflows re-encounter previously processed articles on each run, since articles remain in feeds after processing. Without deduplication, running the workflow multiple times creates duplicate Notion pages.

The solution is deduplication via a left anti-join: fetch existing records, fetch new items, and pass through only the items that appear in the new batch but not in the existing records.

The join concept

A left anti-join is a set operation that returns rows from the left table that have no matching row in the right table. It’s the complement of an inner join.

In SQL terms:

-- Return only new articles (those not already in Notion)
SELECT feed.link
FROM rss_feed AS feed
LEFT JOIN existing_notion_articles AS existing
ON feed.link = existing.content_url
WHERE existing.content_url IS NULL

n8n doesn’t write SQL, but its Merge node can perform this exact operation visually.

The n8n Merge node setup

The workflow uses two nodes feeding into the Merge node:

Input 1 (left side): Get All Articles — queries your Notion content database for all existing RSS articles. Filters for type = "RSS" so only previously ingested articles are returned. Each record exposes a property_content_url field (the original article URL stored as a Notion property).

Input 2 (right side): The split RSS feed items from the current run. Each item has a link field from the RSS XML.

Merge node configuration:

Mode: Combine
Combine by: Matching rules
- Input 1 field: property_content_url
- Input 2 field: link
Join mode: Keep Non Matches
Output from: Input 2

Keep Non Matches with Output from: Input 2 means: give me the items from the RSS feed that did NOT match anything in Notion. These are new articles. Everything else gets dropped.

The executeOnce flag

There’s a critical detail in the Get All Articles node: executeOnce: true.

Without this, n8n would query Notion once per RSS item as it processes the split feed. If you have 50 articles from 5 feeds, that’s 50 identical Notion queries. It’s wasteful, slow, and likely to hit API rate limits.

With executeOnce: true, the node runs once at the start and its results are cached for the entire batch. The Merge node then compares all incoming RSS items against that single snapshot in one pass.

This matters more than it looks. At scale (multiple feeds, many articles), the difference between O(n) API calls and O(1) is the difference between a workflow that finishes in seconds and one that times out.

Why URL is the right deduplication key

The workflow uses the article URL (link in RSS, property_content_url in Notion) as the deduplication key rather than title or publication date.

Titles can change — publishers sometimes update a headline after publishing. Dates can be ambiguous — pubDate isn’t always reliable, and the same article might appear in multiple feeds with different timestamps. The URL is stable. An article’s canonical URL doesn’t change.

The one edge case: tracking parameters. A URL like https://example.com/article?utm_source=feed and https://example.com/article?utm_source=twitter are technically different URLs pointing to the same content. If your RSS sources include tracking-parameterized links, you’d need to strip query parameters before comparison. For most blog RSS feeds, this isn’t an issue — links are clean.

What passes through

After the Merge node, only articles that don’t yet exist in Notion continue downstream. Everything else is silently dropped — no error, no flag, just filtered out.

This makes the workflow naturally idempotent. You can run it multiple times in the same day (or re-run it after a failure) and it won’t create duplicates. Articles already in Notion are simply skipped.

Idempotency is the property you want in any scheduled pipeline. It means “run it again” is always safe, which makes debugging and recovery much simpler. The same principle appears in dlt Incremental Loading and Idempotent Incremental Models in dbt — different tools, same idea.

The full deduplication flow

RSS Feed Items (50 articles from 5 feeds)
↓ Split Out RSS Feed
50 individual items, each with: title, link, pubDate, dc:creator
├──────────────────────────────────────┐
│ │
Get All Articles from Notion (continues downstream)
(runs once, returns all existing URLs)
└──────────────────────────────────────┘
Merge node
(left anti-join on URL)
Only truly new articles pass through
(e.g., 12 out of 50 if 38 already exist)

Limitations

The approach requires your Notion database to be the authoritative record of what’s been processed. If you delete records from Notion, those articles will be re-fetched on the next run. This is usually fine for a personal knowledge base, but worth knowing.

It also means the initial run — when Notion is empty — will pull in the full history of each RSS feed (however far back the feed goes, typically 10-50 articles per source). Most of the time this is desirable: you want a bulk import to seed the database. If you don’t want historical articles, you could add a pubDate filter to only process articles newer than a certain date.