Adrienne Vermorel

n8n RSS to Notion

If you are like me, you probably maintain reading lists, bookmark interesting articles, and try to stay current with industry developments.

But manually collecting content from multiple RSS feeds, cleaning up the cruft, and organizing it into a knowledge base can be quite time-consuming and a tad overwhelming at times.

To fight the overwhelm, I have come up with an automation solution that fetches RSS articles, cleans them up using ChatGPT and saves them as Notion pages.

My workflow is built with n8n, an open-source workflow automation tool that lets you connect different services through visual workflows. Think of it as a self-hosted alternative to Zapier or Make, but with more flexibility and no per-task pricing. You define workflows as nodes connected by edges, where each node performs a specific operation—from making HTTP requests to transforming data with JavaScript.

We'll also briefly touch on Notion, which serves as both our configuration database (where we list RSS sources) and our content repository (where cleaned articles are stored). If you're not familiar with Notion, it's essentially a flexible workspace that combines documents, databases, and kanban boards into one tool.

The Problem I'm Solving

RSS feeds remain one of the best ways to aggregate content from blogs, newsletters, and news sites. However, they come with several challenges:

  1. Content fragmentation: You need to check multiple sources across different readers. There are paid RSS Reader of course, but the content stays in the app, and things are not so customizable.
  2. HTML noise: Articles include navigation menus, cookie banners, newsletter CTAs, social sharing buttons, and footer content
  3. Poor formatting: Raw RSS content often doesn't render well when saved directly
  4. No deduplication: The same article might appear multiple times if you refresh feeds
  5. Manual effort: Saving interesting articles for later requires manual copy-paste workflows

This workflow solves all of these problems through automation, creating a "save for later" system that runs on autopilot.

Architecture Overview

The workflow consists of 14 nodes organized into four main stages:

  1. Source configuration & triggering (Nodes 1-2)
  2. RSS feed fetching & deduplication (Nodes 3-8)
  3. Content extraction & cleaning (Nodes 9-13)
  4. Notion page creation (Node 14)

Here's the high-level flow:

Trigger → Fetch RSS Sources → Get RSS Feed → Parse XML → Split Items

Filter Existing Articles (Merge) → Create Notion Pages → Extract Page IDs

Fetch Article Content (Jina) → Prepare ChatGPT Prompt → Clean with LLM

Convert Markdown to Notion Blocks → Append to Notion Page

Let's examine each stage in detail.

Stage 1: Triggering the Workflow

The workflow can be triggered in two ways:

Manual Trigger

The When clicking 'Execute workflow' node allows you to run the workflow on demand. This is useful for:

  • Testing changes to the workflow
  • Doing an initial bulk import of articles
  • Force-fetching new content outside the schedule

Scheduled Trigger

The Schedule Trigger node runs the workflow automatically every day at 5:00 AM. This timing is useful:

  • Most blogs and news sites publish content during business hours
  • Running early morning captures yesterday's content before I start my workday
  • It avoids peak API usage times for some external services

The scheduled trigger uses a cron-like syntax: triggerAtHour: 5 means "run once per day at 5 AM in your configured timezone."

Stage 2: RSS Source Management and Deduplication

Getting RSS Sources from Notion

The Get many sources from monitoring node queries a Notion database that serves as your RSS source configuration. This is handy because:

  • The RSS sources are stored in a database I can edit easily
  • I can add metadata to sources (categories, priority, etc.)
  • The query filters for type = "RSS", meaning I could store other content types (podcasts, newsletters) in the same database

The node configuration shows it's querying a specific database URL and returning all records where the type property equals "RSS". Each record must include a property called rss_link containing the feed URL.

Fetching the RSS Feed

The Fetch RSS Feed node makes an HTTP GET request to each RSS feed URL. Note the important headers:

"User-Agent": "Mozilla/5.0"
"Accept": "application/rss+xml, application/xml;q=0.9, */*;q=0.8"

These headers are necessary because:

  • Some servers block requests without a User-Agent (they assume it's a bot)
  • The Accept header explicitly requests RSS/XML format, though most feeds will return it anyway

The response format is set to text because we need the raw XML before parsing.

Parsing XML to JSON

The XML to JSON node converts the RSS feed XML into a JSON structure that's easier to work with in subsequent nodes. RSS feeds typically have a structure like:

<rss>
  <channel>
    <item>
      <title>Article Title</title>
      <link>https://example.com/article</link>
      <description>Article description</description>
      <pubDate>Thu, 14 Nov 2024 10:00:00 GMT</pubDate>
      <dc:creator>Author Name</dc:creator>
    </item>
  </channel>
</rss>

After XML parsing, this becomes a nested JSON object.

Splitting Out Individual Articles

The Split Out RSS Feed node takes the array at rss.channel.item and creates one output item per article. This is essential because:

  • Each article needs to be processed independently
  • You need to check each article individually for duplicates
  • Each article becomes its own Notion page

Deduplication Logic

This is where the workflow gets sophisticated. The Get All Articles node fetches all existing RSS articles from your Notion content database (where type = "RSS"). Then the Merge node performs a left anti-join:

  • Input 1 (left): Existing articles from Notion with their property_content_url
  • Input 2 (right): New articles from the RSS feed with their link
  • Merge by: property_content_url = link
  • Join mode: keepNonMatches (only keep items from input 2 that don't match input 1)
  • Output from: input2

The result? Only articles that don't already exist in your Notion database get passed through to the next stage. This prevents duplicates and saves API calls.

The configuration executeOnce: true on the Get All Articles node is crucial—it ensures the existing articles are fetched once for the entire batch, not once per RSS item.

Stage 3: Content Extraction and Cleaning

Creating Placeholder Notion Pages

The Create a database page node creates a new Notion page for each article with basic metadata:

  • Title: {{ $json.title }} from the RSS item
  • Author: Falls back through dc:creator, author, or "no author"
  • Published At: {{ $json.pubDate }}
  • RSS feed name: References back to the source from "Get many sources from monitoring"
  • content_url: The article URL from {{ $json.link }}
  • Type: Set to "RSS" for filtering later
  • Icon: 📰 emoji for visual consistency

At this point, the pages exist but have no content—just metadata. The page ID is captured in the response.

Extracting Page IDs

The Set notion_page_id node creates a new field containing the Notion page ID:

{
  "notion_page_id": "{{ $json.id }}"
}

This ID is crucial for the final step when we append the cleaned content back to the page.

Fetching Article Content with Jina AI

The Read URL content node uses Jina AI's Reader API, which is specifically designed to extract clean content from web pages. The configuration:

url: "{{ $items('Create a database page')[$itemIndex].json.property_content_url }}"
outputFormat: "markdown"

Jina AI returns the article content as markdown, which is perfect because:

  • Markdown is a structured format that's easier to parse than HTML
  • It preserves formatting (headers, lists, bold, italic) while discarding cruft
  • It's the intermediate format we need before converting to Notion blocks

The node has retry logic configured:

  • retryOnFail: true
  • maxTries: 5
  • waitBetweenTries: 5000 (5 seconds)

This is important because web scraping can be flaky (timeouts, rate limits, temporary errors).

Preparing the ChatGPT Prompt

The ChatGPT Prompt Preparation node builds the API request body using JavaScript. The prompt strategy is:

System message: Establishes the role and rules

You are a meticulous Markdown cleaner. Keep the main article text and structure
but remove navigation menus, cookie notices, newsletter CTAs, footers, share buttons,
related posts, and other site chrome. Preserve headings, paragraphs, code blocks,
lists, tables, and links (strip tracking params). Preserve the bold and italicization
markers. Replace the double return \n\n with a single return \n. Remove the weird
characters (like \_ or * * * alone in a single line). Return only the cleaned
Markdown with no commentary, and without wrapping it in a markdown block.

User message: Provides the task and content

Task: Clean the Markdown below according to the rules. Return ONLY the cleaned Markdown.

---BEGIN MARKDOWN---
${$input.item.json.content || ''}
---END MARKDOWN---

The prompt is designed to:

  • Remove all non-article content (navigation, CTAs, etc.)
  • Preserve semantic structure (headings, lists, code)
  • Clean up formatting artifacts
  • Strip tracking parameters from links
  • Return raw markdown without code fences

The model used is gpt-4o-mini with temperature: 0.1 for consistent, deterministic results.

The code explicitly maintains the paired item relationship: pairedItem: 0, which is crucial for n8n to track which input corresponds to which output when processing multiple items.

Calling the ChatGPT API

The ChatGPT Markdown Cleaner node makes a POST request to OpenAI's Chat Completions API:

method: "POST"
url: "https://api.openai.com/v1/chat/completions"
authentication: "predefinedCredentialType"
nodeCredentialType: "openAiApi"
sendBody: true
specifyBody: "json"
jsonBody: "={{ $json.requestBody }}"
timeout: 500000  // 8+ minutes for slow responses

The high timeout is defensive—LLM APIs can occasionally be slow, especially for long articles. The timeout was one of the reasons I could not use the native LLM nodes in n8n, they have relatively short timeouts; too short for my use case.

Stage 4: Converting to Notion Format

The Markdown-to-Notion-Blocks Parser

This is the most complex node in the workflow. Notion doesn't accept raw markdown—it requires content to be structured as an array of "block" objects. Each block represents a content element (paragraph, heading, list item, etc.) with its own type and properties.

The Markdown to notion blocks node contains ~400 lines of JavaScript that:

  1. Parses inline markdown formatting:

    • Bold: *text** or text*
    • Italic: _text_
    • Links: [text](url)
    • Inline code: code
  2. Converts block-level elements:

    • Headers: # H1, ## H2, ### H3
    • Paragraphs: Regular text
    • Lists: bullet or 1. numbered
    • Quotes: > quoted text
    • Code blocks: language...
    • Images: ![alt](url)
    • Dividers: --, **, or ====
  3. Builds Notion rich_text objects: These are complex nested structures like:

    {
      type: "text",
      text: {
        content: "Hello world",
        link: { url: "https://example.com" }  // optional
      },
      annotations: {
        bold: true,
        italic: false,
        // ... other formatting
      }
    }
  4. Handles edge cases:

    • Headers beyond h3 (h4, h5, h6) become bold paragraphs since Notion only supports 3 heading levels
    • Empty quotes are skipped (they look awkward in Notion)
    • Consecutive plain text is merged for efficiency
    • Escaped underscores are unescaped: _\__
    • Unwraps linked images: [![alt](img)](link)![alt](img)
  5. Splits oversized content: Notion has a 2000-character limit per rich_text element. The splitLongBlocks function:

    • Detects blocks with text content exceeding 2000 characters
    • Splits them into multiple chunks
    • Preserves formatting and structure across chunks
    • Handles code blocks specially (splits by character position)
  6. Limits to 100 blocks: Notion's API has a 100-block limit per request. The workflow takes the first 100 blocks, which is usually sufficient for most articles.

The output includes metadata for debugging:

{
  children: blocksToSend,  // Array of Notion blocks
  meta: {
    model: "gpt-4o-mini",
    created: "...",
    total_blocks: 45,
    split_blocks: 47,  // After splitting long blocks
    blocks_sent: 47
  }
}

Appending Content to Notion

The final HTTP Request node makes a PATCH request to Notion's "append block children" API:

method: "PATCH"
url: "https://api.notion.com/v1/blocks/{{ notion_page_id }}/children"
authentication: "predefinedCredentialType"
nodeCredentialType: "notionApi"
body: {
  "children": "={{ $json.children }}"  // Array of blocks
}

This adds all the content blocks to the Notion page we created earlier. The page now contains:

  • Metadata in database properties (title, author, date, source, URL)
  • Full article content with preserved formatting
  • Clean structure without ads, navigation, or other cruft

Like the Jina AI node, this has retry logic to handle transient API failures.

Cost Considerations

Let's estimate the costs of running this workflow daily:

n8n

  • Self-hosted: I pay $5/month on railway.com
  • n8n Cloud: Starts at $20/month for 2,500 workflow executions

Jina AI Reader

  • Free tier: 1 million tokens
  • Typical article: ~2,000-5,000 tokens
  • At 20 articles/day: ~60,000-150,000 tokens, so that lasts a while. Then it is $50 per 1 billion tokens. A billion token should last a while.

OpenAI API (GPT-4o-mini)

  • $0.150 per 1M input tokens
  • $0.600 per 1M output tokens
  • Input per article: ~3,000-8,000 tokens (system + user message)
  • Output per article: ~2,000-5,000 tokens (cleaned markdown)
  • At 20 articles/day × 30 days = 600 articles/month
    • Input: ~3.6M tokens = $0.54
    • Output: ~2.4M tokens = $1.44
    • Total: ~$2/month

Notion

  • Free tier: Fine for personal use
  • Plus: $10/month

Total monthly cost: ~$7-32, depending on your n8n hosting choice. The LLM and content extraction costs are minimal. Overall, it is cheaper and more customizable that a paid RSS reading tool.

Conclusion

This workflow combines a few modern tools to automate something that was eating up time every day:

  • RSS for decentralized content aggregation
  • Jina AI for intelligent content extraction
  • ChatGPT for content cleaning
  • Notion for structured storage
  • n8n for orchestration

I've built a system that automatically curates a personal knowledge base from across the web. The workflow runs daily, requires no manual intervention, and costs just a few dollars per month to operate.

For data professionals, this pattern extends beyond RSS feeds. The same architecture could aggregate:

  • GitHub repository updates
  • Slack messages with specific keywords
  • Jira ticket descriptions
  • Documentation changes
  • Industry report releases

I am a big fan that structured data pipelines aren't just for analytics—they're also powerful for knowledge management. Every component in this workflow is designed with the same principles you'd apply to a data pipeline: idempotency, error handling, transformation logic, and efficient processing.

If you're building something similar, the workflow JSON can be imported directly into n8n. You'll need to:

  1. Set up your Notion databases (RSS sources and content storage)
  2. Configure your API credentials (Notion, OpenAI)
  3. Adjust the schedule and filters to your preferences
  4. Run the manual trigger to test the full flow

You can find the workflow in this GitHub repository.

Happy automating!