Web Scraping Tool
The Gezondland knowledge base contains articles that were originally published on other websites. To migrate this content to our Astro/Starlight platform, we developed a special tool that converts web pages to clean Markdown files.
What does the tool do?
Section titled “What does the tool do?”The web scraping tool retrieves articles from external websites and converts them to Markdown with the correct frontmatter for Starlight. The tool:
- Automatically detects the CMS of the source website
- Extracts the article content without navigation, footers and advertisements
- Preserves all hyperlinks from the original article
- Adds source attribution with link to the original
- Cleans up technical debris generated by CMS systems
Supported Platforms
Section titled “Supported Platforms”The tool recognises and processes content from different types of websites:
WordPress Sites
Section titled “WordPress Sites”WordPress is the most used CMS in the world. The tool detects whether a site has the WordPress REST API available and uses it preferentially. Via the API we get access to the raw content without formatting artefacts.
Recognised sites: jeleefstijlalsmedicijn.nl, 2diabeat.nl, and other WordPress installations with open API.
Recognition: The tool looks for the standard /wp-json/wp/v2/ endpoint or the API link in the HTML.
Drupal Sites
Section titled “Drupal Sites”Drupal is popular with medical and government websites due to its robust structure. The tool recognises Drupal by specific CSS classes and extracts content from the field structure.
Recognised sites: thuisarts.nl (NHG)
Recognition: Presence of field--name-body classes and Drupal-specific markup.
Storyblok/SvelteKit Sites
Section titled “Storyblok/SvelteKit Sites”Modern headless CMS solutions like Storyblok often use SvelteKit as frontend. The tool recognises the characteristic HTML markers these frameworks use.
Recognised sites: voedingleeft.nl
Recognition: HTML_TAG_START and HTML_TAG_END comment blocks in the source code.
Generic HTML
Section titled “Generic HTML”For websites that don’t fall under the above categories, the tool has fallback strategies. These search for common patterns such as <article> tags, .prose classes, or .entry-content divs.
Automatic Source Attribution
Section titled “Automatic Source Attribution”Every migrated article automatically receives source attribution at the top of the content. This is a blockquote with:
- Source: The name of the website with link to the original article
- Author: If available in the source page
This ensures transparency about the origin of the content and respects the intellectual property of the original authors.
Link Preservation
Section titled “Link Preservation”All hyperlinks from the original article are preserved. This applies to:
- Internal links within the source article
- External links to scientific sources
- Links to related articles
The links are not modified or removed, so readers retain access to the full context and sources.
Automatic Cleanup
Section titled “Automatic Cleanup”Websites often use complex formatting systems that leave “debris” in the HTML. The tool cleans this up automatically:
WordPress/Gutenberg
Section titled “WordPress/Gutenberg”- Kadence block markers and styling classes
- Info-box components and link decorations
- Gutenberg block attributes
General Cleanup
Section titled “General Cleanup”- Target and rel attributes of links (such as
target="_blank") - Base64-encoded inline images
- Video player interface elements
- Excessive blank lines and spaces
- Drupal entity attributes
MDX Compatibility
Section titled “MDX Compatibility”- Extended Markdown attributes like
{.button .primary}are removed - This prevents parse errors in Astro’s MDX processing
Frontmatter Generation
Section titled “Frontmatter Generation”The tool automatically generates the required frontmatter for Starlight:
- title: Taken from the
<title>tag or Open Graph metadata - description: From the meta description, or an automatically generated fallback
This ensures that migrated articles are immediately compatible with our site without manual adjustments.
Workflow
Section titled “Workflow”Migrating an article proceeds as follows:
- Enter URL - The tool receives the URL of the article to be migrated
- CMS detection - Automatic recognition of the platform
- Content extraction - Retrieving the article content via the best available method
- HTML to Markdown - Conversion via Pandoc to clean Markdown
- Cleanup - Removing CMS-specific artefacts
- Add frontmatter - Title and description from metadata
- Insert source attribution - Automatic attribution to original source
- Save - Output as
.mdxfile ready for the knowledge base
PDF to Markdown
Section titled “PDF to Markdown”In addition to web pages, PDF documents can also be converted to Markdown. This is useful for fact sheets, manuals and other documents available as PDF.
Supported PDF types
Section titled “Supported PDF types”- Text-based PDFs (directly copyable text)
- PDFs with tables and lists
- Multiple pages
PDF Workflow
Section titled “PDF Workflow”- Upload PDF - Place the PDF in the Downloads folder or provide the path
- Content extraction - The PDF skill reads the text content
- Preserve structure - Headings, lists and paragraphs are recognised
- Markdown generation - Output to
.mdxwith correct frontmatter - Add author info - For fact sheets, author information is automatically added
Publication
Section titled “Publication”After creating a new article, it is published via Git:
Automatic Deployment
Section titled “Automatic Deployment”git add → git commit → git push → Live on docs.gezondland.orgThe site is automatically built and deployed after each push to the master branch.
File Locations
Section titled “File Locations”| Category | Folder |
|---|---|
| Conditions | src/content/docs/aandoeningen/ |
| Getting started | src/content/docs/aan-de-slag/ |
| Nutrition | src/content/docs/voeding/ |
| Fact sheets Yvo Sijpkens | src/content/docs/fiches-yvo-sijpkens/ |
| Classification | src/content/docs/classificatie/ |
URL structure
Section titled “URL structure”The URL follows the folder structure: docs.gezondland.org/{category}/{filename}/
For example: src/content/docs/fiches-yvo-sijpkens/pcos.mdx becomes docs.gezondland.org/fiches-yvo-sijpkens/pcos/
Limitations
Section titled “Limitations”The tool has some limitations:
- Images are not automatically downloaded. These must be retrieved separately and hosted locally.
- Embedded videos are not included. Only the textual content is extracted.
- Dynamic content loaded via JavaScript is not always accessible.
- Paid content behind a login cannot be retrieved.
- Scanned PDFs (images of text) require OCR and are not always correctly recognised.
The tool is available as a Claude Code skill and can be invoked during a session.
Migrating a web article
Section titled “Migrating a web article”Provide the URL of the article and the desired category. The tool determines the best extraction method itself.
Converting PDF
Section titled “Converting PDF”Place the PDF in Downloads and provide the filename and target category. The tool reads the PDF and generates a Markdown file.
For questions about migrating specific content or websites that are not correctly recognised, the tool can be extended with new extraction patterns.
Medische Disclaimer: De informatie van Stichting Je Leefstijl Als Medicijn over leefstijl, ziektes en stoornissen mag niet worden opgevat als medisch advies. In geen geval adviseren wij mensen om hun bestaande behandeling te veranderen. We raden mensen met chronische aandoeningen aan om zich over hun behandeling goed door bevoegde medische professionals te laten adviseren.
Medical Disclaimer: The information provided by Stichting Je Leefstijl Als Medicijn regarding lifestyle, diseases, and disorders should not be construed as medical advice. Under no circumstances do we advise people to alter their existing treatment. We recommend that people with chronic conditions seek advice regarding their treatment from qualified medical professionals.