Adrienne Vermorel
n8n RSS to Notion
If you are like me, you probably maintain reading lists, bookmark interesting articles, and try to stay current with industry developments.
But manually collecting content from multiple RSS feeds, cleaning up the cruft, and organizing it into a knowledge base can be quite time-consuming and a tad overwhelming at times.
To fight the overwhelm, I have come up with an automation solution that fetches RSS articles, cleans them up using ChatGPT and saves them as Notion pages.
My workflow is built with n8n, an open-source workflow automation tool that lets you connect different services through visual workflows. Think of it as a self-hosted alternative to Zapier or Make, but with more flexibility and no per-task pricing. You define workflows as nodes connected by edges, where each node performs a specific operation—from making HTTP requests to transforming data with JavaScript.
We'll also briefly touch on Notion, which serves as both our configuration database (where we list RSS sources) and our content repository (where cleaned articles are stored). If you're not familiar with Notion, it's essentially a flexible workspace that combines documents, databases, and kanban boards into one tool.
The Problem I'm Solving
RSS feeds remain one of the best ways to aggregate content from blogs, newsletters, and news sites. However, they come with several challenges:
- Content fragmentation: You need to check multiple sources across different readers. There are paid RSS Reader of course, but the content stays in the app, and things are not so customizable.
- HTML noise: Articles include navigation menus, cookie banners, newsletter CTAs, social sharing buttons, and footer content
- Poor formatting: Raw RSS content often doesn't render well when saved directly
- No deduplication: The same article might appear multiple times if you refresh feeds
- Manual effort: Saving interesting articles for later requires manual copy-paste workflows
This workflow solves all of these problems through automation, creating a "save for later" system that runs on autopilot.
Architecture Overview
The workflow consists of 14 nodes organized into four main stages:
- Source configuration & triggering (Nodes 1-2)
- RSS feed fetching & deduplication (Nodes 3-8)
- Content extraction & cleaning (Nodes 9-13)
- Notion page creation (Node 14)
Here's the high-level flow:
Trigger → Fetch RSS Sources → Get RSS Feed → Parse XML → Split Items
↓
Filter Existing Articles (Merge) → Create Notion Pages → Extract Page IDs
↓
Fetch Article Content (Jina) → Prepare ChatGPT Prompt → Clean with LLM
↓
Convert Markdown to Notion Blocks → Append to Notion Page
Let's examine each stage in detail.
Stage 1: Triggering the Workflow
The workflow can be triggered in two ways:
Manual Trigger
The When clicking 'Execute workflow' node allows you to run the workflow on demand. This is useful for:
- Testing changes to the workflow
- Doing an initial bulk import of articles
- Force-fetching new content outside the schedule
Scheduled Trigger
The Schedule Trigger node runs the workflow automatically every day at 5:00 AM. This timing is useful:
- Most blogs and news sites publish content during business hours
- Running early morning captures yesterday's content before I start my workday
- It avoids peak API usage times for some external services
The scheduled trigger uses a cron-like syntax: triggerAtHour: 5 means "run once per day at 5 AM in your configured timezone."
Stage 2: RSS Source Management and Deduplication
Getting RSS Sources from Notion
The Get many sources from monitoring node queries a Notion database that serves as your RSS source configuration. This is handy because:
- The RSS sources are stored in a database I can edit easily
- I can add metadata to sources (categories, priority, etc.)
- The query filters for
type = "RSS", meaning I could store other content types (podcasts, newsletters) in the same database
The node configuration shows it's querying a specific database URL and returning all records where the type property equals "RSS". Each record must include a property called rss_link containing the feed URL.
Fetching the RSS Feed
The Fetch RSS Feed node makes an HTTP GET request to each RSS feed URL. Note the important headers:
"User-Agent": "Mozilla/5.0"
"Accept": "application/rss+xml, application/xml;q=0.9, */*;q=0.8"
These headers are necessary because:
- Some servers block requests without a User-Agent (they assume it's a bot)
- The Accept header explicitly requests RSS/XML format, though most feeds will return it anyway
The response format is set to text because we need the raw XML before parsing.
Parsing XML to JSON
The XML to JSON node converts the RSS feed XML into a JSON structure that's easier to work with in subsequent nodes. RSS feeds typically have a structure like:
<rss>
<channel>
<item>
<title>Article Title</title>
<link>https://example.com/article</link>
<description>Article description</description>
<pubDate>Thu, 14 Nov 2024 10:00:00 GMT</pubDate>
<dc:creator>Author Name</dc:creator>
</item>
</channel>
</rss>
After XML parsing, this becomes a nested JSON object.
Splitting Out Individual Articles
The Split Out RSS Feed node takes the array at rss.channel.item and creates one output item per article. This is essential because:
- Each article needs to be processed independently
- You need to check each article individually for duplicates
- Each article becomes its own Notion page
Deduplication Logic
This is where the workflow gets sophisticated. The Get All Articles node fetches all existing RSS articles from your Notion content database (where type = "RSS"). Then the Merge node performs a left anti-join:
- Input 1 (left): Existing articles from Notion with their
property_content_url - Input 2 (right): New articles from the RSS feed with their
link - Merge by:
property_content_url=link - Join mode:
keepNonMatches(only keep items from input 2 that don't match input 1) - Output from: input2
The result? Only articles that don't already exist in your Notion database get passed through to the next stage. This prevents duplicates and saves API calls.
The configuration executeOnce: true on the Get All Articles node is crucial—it ensures the existing articles are fetched once for the entire batch, not once per RSS item.
Stage 3: Content Extraction and Cleaning
Creating Placeholder Notion Pages
The Create a database page node creates a new Notion page for each article with basic metadata:
- Title:
{{ $json.title }}from the RSS item - Author: Falls back through
dc:creator,author, or "no author" - Published At:
{{ $json.pubDate }} - RSS feed name: References back to the source from "Get many sources from monitoring"
- content_url: The article URL from
{{ $json.link }} - Type: Set to "RSS" for filtering later
- Icon: 📰 emoji for visual consistency
At this point, the pages exist but have no content—just metadata. The page ID is captured in the response.
Extracting Page IDs
The Set notion_page_id node creates a new field containing the Notion page ID:
{
"notion_page_id": "{{ $json.id }}"
}
This ID is crucial for the final step when we append the cleaned content back to the page.
Fetching Article Content with Jina AI
The Read URL content node uses Jina AI's Reader API, which is specifically designed to extract clean content from web pages. The configuration:
url: "{{ $items('Create a database page')[$itemIndex].json.property_content_url }}"
outputFormat: "markdown"
Jina AI returns the article content as markdown, which is perfect because:
- Markdown is a structured format that's easier to parse than HTML
- It preserves formatting (headers, lists, bold, italic) while discarding cruft
- It's the intermediate format we need before converting to Notion blocks
The node has retry logic configured:
retryOnFail: truemaxTries: 5waitBetweenTries: 5000(5 seconds)
This is important because web scraping can be flaky (timeouts, rate limits, temporary errors).
Preparing the ChatGPT Prompt
The ChatGPT Prompt Preparation node builds the API request body using JavaScript. The prompt strategy is:
System message: Establishes the role and rules
You are a meticulous Markdown cleaner. Keep the main article text and structure
but remove navigation menus, cookie notices, newsletter CTAs, footers, share buttons,
related posts, and other site chrome. Preserve headings, paragraphs, code blocks,
lists, tables, and links (strip tracking params). Preserve the bold and italicization
markers. Replace the double return \n\n with a single return \n. Remove the weird
characters (like \_ or * * * alone in a single line). Return only the cleaned
Markdown with no commentary, and without wrapping it in a markdown block.
User message: Provides the task and content
Task: Clean the Markdown below according to the rules. Return ONLY the cleaned Markdown.
---BEGIN MARKDOWN---
${$input.item.json.content || ''}
---END MARKDOWN---
The prompt is designed to:
- Remove all non-article content (navigation, CTAs, etc.)
- Preserve semantic structure (headings, lists, code)
- Clean up formatting artifacts
- Strip tracking parameters from links
- Return raw markdown without code fences
The model used is gpt-4o-mini with temperature: 0.1 for consistent, deterministic results.
The code explicitly maintains the paired item relationship: pairedItem: 0, which is crucial for n8n to track which input corresponds to which output when processing multiple items.
Calling the ChatGPT API
The ChatGPT Markdown Cleaner node makes a POST request to OpenAI's Chat Completions API:
method: "POST"
url: "https://api.openai.com/v1/chat/completions"
authentication: "predefinedCredentialType"
nodeCredentialType: "openAiApi"
sendBody: true
specifyBody: "json"
jsonBody: "={{ $json.requestBody }}"
timeout: 500000 // 8+ minutes for slow responses
The high timeout is defensive—LLM APIs can occasionally be slow, especially for long articles. The timeout was one of the reasons I could not use the native LLM nodes in n8n, they have relatively short timeouts; too short for my use case.
Stage 4: Converting to Notion Format
The Markdown-to-Notion-Blocks Parser
This is the most complex node in the workflow. Notion doesn't accept raw markdown—it requires content to be structured as an array of "block" objects. Each block represents a content element (paragraph, heading, list item, etc.) with its own type and properties.
The Markdown to notion blocks node contains ~400 lines of JavaScript that:
-
Parses inline markdown formatting:
- Bold:
*text**ortext* - Italic:
_text_ - Links:
[text](url) - Inline code:
code
- Bold:
-
Converts block-level elements:
- Headers:
# H1,## H2,### H3 - Paragraphs: Regular text
- Lists:
bulletor1. numbered - Quotes:
> quoted text - Code blocks:
language... - Images:
 - Dividers:
--,**, or====
- Headers:
-
Builds Notion rich_text objects: These are complex nested structures like:
{ type: "text", text: { content: "Hello world", link: { url: "https://example.com" } // optional }, annotations: { bold: true, italic: false, // ... other formatting } } -
Handles edge cases:
- Headers beyond h3 (h4, h5, h6) become bold paragraphs since Notion only supports 3 heading levels
- Empty quotes are skipped (they look awkward in Notion)
- Consecutive plain text is merged for efficiency
- Escaped underscores are unescaped:
_\_→_ - Unwraps linked images:
[](link)→
-
Splits oversized content: Notion has a 2000-character limit per rich_text element. The
splitLongBlocksfunction:- Detects blocks with text content exceeding 2000 characters
- Splits them into multiple chunks
- Preserves formatting and structure across chunks
- Handles code blocks specially (splits by character position)
-
Limits to 100 blocks: Notion's API has a 100-block limit per request. The workflow takes the first 100 blocks, which is usually sufficient for most articles.
The output includes metadata for debugging:
{
children: blocksToSend, // Array of Notion blocks
meta: {
model: "gpt-4o-mini",
created: "...",
total_blocks: 45,
split_blocks: 47, // After splitting long blocks
blocks_sent: 47
}
}
Appending Content to Notion
The final HTTP Request node makes a PATCH request to Notion's "append block children" API:
method: "PATCH"
url: "https://api.notion.com/v1/blocks/{{ notion_page_id }}/children"
authentication: "predefinedCredentialType"
nodeCredentialType: "notionApi"
body: {
"children": "={{ $json.children }}" // Array of blocks
}
This adds all the content blocks to the Notion page we created earlier. The page now contains:
- Metadata in database properties (title, author, date, source, URL)
- Full article content with preserved formatting
- Clean structure without ads, navigation, or other cruft
Like the Jina AI node, this has retry logic to handle transient API failures.
Cost Considerations
Let's estimate the costs of running this workflow daily:
n8n
- Self-hosted: I pay $5/month on railway.com
- n8n Cloud: Starts at $20/month for 2,500 workflow executions
Jina AI Reader
- Free tier: 1 million tokens
- Typical article: ~2,000-5,000 tokens
- At 20 articles/day: ~60,000-150,000 tokens, so that lasts a while. Then it is $50 per 1 billion tokens. A billion token should last a while.
OpenAI API (GPT-4o-mini)
- $0.150 per 1M input tokens
- $0.600 per 1M output tokens
- Input per article: ~3,000-8,000 tokens (system + user message)
- Output per article: ~2,000-5,000 tokens (cleaned markdown)
- At 20 articles/day × 30 days = 600 articles/month
- Input: ~3.6M tokens = $0.54
- Output: ~2.4M tokens = $1.44
- Total: ~$2/month
Notion
- Free tier: Fine for personal use
- Plus: $10/month
Total monthly cost: ~$7-32, depending on your n8n hosting choice. The LLM and content extraction costs are minimal. Overall, it is cheaper and more customizable that a paid RSS reading tool.
Conclusion
This workflow combines a few modern tools to automate something that was eating up time every day:
- RSS for decentralized content aggregation
- Jina AI for intelligent content extraction
- ChatGPT for content cleaning
- Notion for structured storage
- n8n for orchestration
I've built a system that automatically curates a personal knowledge base from across the web. The workflow runs daily, requires no manual intervention, and costs just a few dollars per month to operate.
For data professionals, this pattern extends beyond RSS feeds. The same architecture could aggregate:
- GitHub repository updates
- Slack messages with specific keywords
- Jira ticket descriptions
- Documentation changes
- Industry report releases
I am a big fan that structured data pipelines aren't just for analytics—they're also powerful for knowledge management. Every component in this workflow is designed with the same principles you'd apply to a data pipeline: idempotency, error handling, transformation logic, and efficient processing.
If you're building something similar, the workflow JSON can be imported directly into n8n. You'll need to:
- Set up your Notion databases (RSS sources and content storage)
- Configure your API credentials (Notion, OpenAI)
- Adjust the schedule and filters to your preferences
- Run the manual trigger to test the full flow
You can find the workflow in this GitHub repository.
Happy automating!