Web scraping gives you raw content. Even good scraping tools return markdown that includes navigation menus, cookie banners, newsletter CTAs, social sharing buttons, related posts sections, and footer cruft mixed in with the actual article. You need a cleaning step.
The classical approach is regex or HTML parsing: remove elements by class name, strip known patterns, filter by DOM position. This works until it doesn’t — every site has a different structure, and class names change constantly. You’re playing whack-a-mole with every new source.
A better approach: use an LLM. A cheap model like gpt-4o-mini understands the semantic difference between “article content” and “site chrome” without needing to know the specific HTML structure of each site. You give it markdown, it gives you cleaned markdown. The logic is general enough to work across any site.
Why this works
LLMs are trained on enormous amounts of web content. They have a strong prior on what article text looks like versus what navigation looks like. A paragraph that says “Subscribe to our newsletter for weekly updates” followed by a button reads very differently to an LLM than a paragraph that continues an argument or explanation.
This is the kind of judgment call that’s hard to encode in rules but trivial for a language model. The cost-quality tradeoff lands in the right place too: you don’t need GPT-4o for this. gpt-4o-mini handles it reliably at a fraction of the cost.
The prompt strategy
The key is being specific about what to remove and what to preserve. Vague instructions like “clean this markdown” produce inconsistent results. The prompt needs to enumerate both sides.
System message (sets the role and rules):
You are a meticulous Markdown cleaner. Keep the main article text and structurebut remove navigation menus, cookie notices, newsletter CTAs, footers, share buttons,related posts, and other site chrome. Preserve headings, paragraphs, code blocks,lists, tables, and links (strip tracking params). Preserve the bold and italicizationmarkers. Replace the double return \n\n with a single return \n. Remove the weirdcharacters (like \_ or * * * alone in a single line). Return only the cleanedMarkdown with no commentary, and without wrapping it in a markdown block.User message (the task and content):
Task: Clean the Markdown below according to the rules. Return ONLY the cleaned Markdown.
---BEGIN MARKDOWN---${content}---END MARKDOWN---A few design decisions in this prompt:
Enumerate what to remove explicitly. “Navigation menus, cookie notices, newsletter CTAs, footers, share buttons, related posts” — naming the categories reduces ambiguity. The model knows what a cookie notice is, but “site chrome” alone is less reliable.
Enumerate what to preserve. “Headings, paragraphs, code blocks, lists, tables, and links” — this prevents the model from over-aggressively stripping content it’s unsure about.
Formatting cleanup in the same pass. Stripping tracking params from links, normalizing line breaks, removing escaped underscores and stray asterisks — these are cheap to include in the same prompt and avoid a separate cleaning step.
No commentary, no code fences. The output goes directly into the next pipeline stage. You don’t want the model wrapping the result in markdown or adding “Here is the cleaned content:”. Both instructions are explicit.
Delimiters for the content. ---BEGIN MARKDOWN--- and ---END MARKDOWN--- clearly separate the instructions from the content to be cleaned. This matters when articles themselves contain instruction-like phrases.
Temperature and model selection
Use temperature: 0.1 — as low as it goes without being deterministic. You want consistent, predictable output for the same input. A higher temperature introduces unnecessary variation: the model might choose to keep a sidebar one run and strip it the next.
gpt-4o-mini is the right choice for this task. It’s:
- Cheap: $0.15/1M input tokens, $0.60/1M output tokens
- Fast enough for batch processing
- More than capable of understanding “this is navigation, this is article”
You don’t need reasoning capability or long context for this. The task is mechanical.
Timeout considerations
LLM APIs can be slow, especially for long articles. The n8n HTTP node making this call uses timeout: 500000 (over 8 minutes). That’s defensive programming — most responses come back in seconds, but some long articles on slow API days can take significantly longer.
This was actually why the native LLM nodes in n8n didn’t work for this workflow: their built-in timeouts are too short. Using a raw HTTP Request node gives you control over the timeout.
Cost at scale
For a workflow processing ~20 articles/day:
- Input per article: ~3,000-8,000 tokens (system prompt + article content from Jina)
- Output per article: ~2,000-5,000 tokens (cleaned markdown)
- At 600 articles/month: ~3.6M input tokens ($0.54), ~2.4M output tokens ($1.44)
- Total: ~$2/month
At ~$2/month for 600 articles, the cost is low relative to the alternatives.
The intermediate format chain
This cleaning step sits in the middle of a three-stage content extraction pipeline:
URL → Jina AI Reader → Raw markdown (with noise) ↓ GPT-4o-mini cleaner → Clean markdown ↓ Markdown-to-Notion blocks parser → Notion blocksJina AI converts HTML to markdown (see n8n RSS-to-Notion Workflow for setup). The LLM cleaner removes the noise. The Markdown-to-Notion Blocks Parser converts the clean markdown to Notion’s API format.
The reason to use markdown as the intermediate format — rather than cleaning HTML directly — is that markdown is easier to work with programmatically. It’s line-oriented, structured without being verbose, and the LLM produces consistent markdown output. Parsing markdown into Notion blocks is tractable; parsing arbitrary HTML directly is not.
Generalizing the pattern
This “LLM as semantic filter” pattern extends beyond web scraping. The same approach works for:
Email newsletters: strip footers, unsubscribe links, sponsor messages — keep the content.
Slack message digests: strip @mentions, emoji reactions, thread noise — keep the signal.
Jira ticket descriptions: strip template boilerplate (“Please describe the bug in detail…”) — keep the actual content someone typed.
GitHub PR descriptions: strip auto-generated sections — keep the human-written summary.
In each case, the pattern is the same: define what to keep, define what to remove, use a cheap model with low temperature, get the cleaned content in the format you need.
The model doesn’t need to understand your domain. It needs to understand the difference between “content someone wrote intentionally” and “scaffolding that surrounds that content.” That’s a very general capability.
Related
- n8n RSS-to-Notion Workflow — the full pipeline this cleaning step lives in
- Markdown-to-Notion Blocks Parser — what happens to the cleaned markdown next