Web Scraping Tool

The Gezondland knowledge base contains articles that were originally published on other websites. To migrate this content to our Astro/Starlight platform, we developed a special tool that converts web pages to clean Markdown files.

What does the tool do?

The web scraping tool retrieves articles from external websites and converts them to Markdown with the correct frontmatter for Starlight. The tool:

Automatically detects the CMS of the source website
Extracts the article content without navigation, footers and advertisements
Preserves all hyperlinks from the original article
Adds source attribution with link to the original
Cleans up technical debris generated by CMS systems

Supported Platforms

The tool recognises and processes content from different types of websites:

WordPress Sites

WordPress is the most used CMS in the world. The tool detects whether a site has the WordPress REST API available and uses it preferentially. Via the API we get access to the raw content without formatting artefacts.

Recognised sites: jeleefstijlalsmedicijn.nl, 2diabeat.nl, and other WordPress installations with open API.

Recognition: The tool looks for the standard /wp-json/wp/v2/ endpoint or the API link in the HTML.

Drupal Sites

Drupal is popular with medical and government websites due to its robust structure. The tool recognises Drupal by specific CSS classes and extracts content from the field structure.

Recognised sites: thuisarts.nl (NHG)

Recognition: Presence of field--name-body classes and Drupal-specific markup.

Storyblok/SvelteKit Sites

Modern headless CMS solutions like Storyblok often use SvelteKit as frontend. The tool recognises the characteristic HTML markers these frameworks use.

Recognised sites: voedingleeft.nl

Recognition: HTML_TAG_START and HTML_TAG_END comment blocks in the source code.

Generic HTML

For websites that don’t fall under the above categories, the tool has fallback strategies. These search for common patterns such as <article> tags, .prose classes, or .entry-content divs.

Automatic Source Attribution

Every migrated article automatically receives source attribution at the top of the content. This is a blockquote with:

Source: The name of the website with link to the original article
Author: If available in the source page

This ensures transparency about the origin of the content and respects the intellectual property of the original authors.

Link Preservation

All hyperlinks from the original article are preserved. This applies to:

Internal links within the source article
External links to scientific sources
Links to related articles

The links are not modified or removed, so readers retain access to the full context and sources.

Automatic Cleanup

Websites often use complex formatting systems that leave “debris” in the HTML. The tool cleans this up automatically:

WordPress/Gutenberg

Kadence block markers and styling classes
Info-box components and link decorations
Gutenberg block attributes

General Cleanup

Target and rel attributes of links (such as target="_blank")
Base64-encoded inline images
Video player interface elements
Excessive blank lines and spaces
Drupal entity attributes

MDX Compatibility

Extended Markdown attributes like {.button .primary} are removed
This prevents parse errors in Astro’s MDX processing

Frontmatter Generation

The tool automatically generates the required frontmatter for Starlight:

title: Taken from the <title> tag or Open Graph metadata
description: From the meta description, or an automatically generated fallback

This ensures that migrated articles are immediately compatible with our site without manual adjustments.

Workflow

Migrating an article proceeds as follows:

Enter URL - The tool receives the URL of the article to be migrated
CMS detection - Automatic recognition of the platform
Content extraction - Retrieving the article content via the best available method
HTML to Markdown - Conversion via Pandoc to clean Markdown
Cleanup - Removing CMS-specific artefacts
Add frontmatter - Title and description from metadata
Insert source attribution - Automatic attribution to original source
Save - Output as .mdx file ready for the knowledge base

PDF to Markdown

In addition to web pages, PDF documents can also be converted to Markdown. This is useful for fact sheets, manuals and other documents available as PDF.

Supported PDF types

Text-based PDFs (directly copyable text)
PDFs with tables and lists
Multiple pages

PDF Workflow

Upload PDF - Place the PDF in the Downloads folder or provide the path
Content extraction - The PDF skill reads the text content
Preserve structure - Headings, lists and paragraphs are recognised
Markdown generation - Output to .mdx with correct frontmatter
Add author info - For fact sheets, author information is automatically added

Publication

After creating a new article, it is published via Git:

Automatic Deployment

git add → git commit → git push → Live on docs.gezondland.org

The site is automatically built and deployed after each push to the master branch.

File Locations

Category	Folder
Conditions	`src/content/docs/aandoeningen/`
Getting started	`src/content/docs/aan-de-slag/`
Nutrition	`src/content/docs/voeding/`
Fact sheets Yvo Sijpkens	`src/content/docs/fiches-yvo-sijpkens/`
Classification	`src/content/docs/classificatie/`

URL structure

The URL follows the folder structure: docs.gezondland.org/{category}/{filename}/

For example: src/content/docs/fiches-yvo-sijpkens/pcos.mdx becomes docs.gezondland.org/fiches-yvo-sijpkens/pcos/

Limitations

The tool has some limitations:

Images are not automatically downloaded. These must be retrieved separately and hosted locally.
Embedded videos are not included. Only the textual content is extracted.
Dynamic content loaded via JavaScript is not always accessible.
Paid content behind a login cannot be retrieved.
Scanned PDFs (images of text) require OCR and are not always correctly recognised.

Usage

The tool is available as a Claude Code skill and can be invoked during a session.

Migrating a web article

Provide the URL of the article and the desired category. The tool determines the best extraction method itself.

Converting PDF

Place the PDF in Downloads and provide the filename and target category. The tool reads the PDF and generates a Markdown file.

For questions about migrating specific content or websites that are not correctly recognised, the tool can be extended with new extraction patterns.

Medische Disclaimer: De informatie van Stichting Je Leefstijl Als Medicijn over leefstijl, ziektes en stoornissen mag niet worden opgevat als medisch advies. In geen geval adviseren wij mensen om hun bestaande behandeling te veranderen. We raden mensen met chronische aandoeningen aan om zich over hun behandeling goed door bevoegde medische professionals te laten adviseren.

Medical Disclaimer: The information provided by Stichting Je Leefstijl Als Medicijn regarding lifestyle, diseases, and disorders should not be construed as medical advice. Under no circumstances do we advise people to alter their existing treatment. We recommend that people with chronic conditions seek advice regarding their treatment from qualified medical professionals.