Skip to content

Web Scraping Tool

The Gezondland knowledge base contains articles that were originally published on other websites. To migrate this content to our Astro/Starlight platform, we developed a special tool that converts web pages to clean Markdown files.

The web scraping tool retrieves articles from external websites and converts them to Markdown with the correct frontmatter for Starlight. The tool:

  • Automatically detects the CMS of the source website
  • Extracts the article content without navigation, footers and advertisements
  • Preserves all hyperlinks from the original article
  • Adds source attribution with link to the original
  • Cleans up technical debris generated by CMS systems

The tool recognises and processes content from different types of websites:

WordPress is the most used CMS in the world. The tool detects whether a site has the WordPress REST API available and uses it preferentially. Via the API we get access to the raw content without formatting artefacts.

Recognised sites: jeleefstijlalsmedicijn.nl, 2diabeat.nl, and other WordPress installations with open API.

Recognition: The tool looks for the standard /wp-json/wp/v2/ endpoint or the API link in the HTML.

Drupal is popular with medical and government websites due to its robust structure. The tool recognises Drupal by specific CSS classes and extracts content from the field structure.

Recognised sites: thuisarts.nl (NHG)

Recognition: Presence of field--name-body classes and Drupal-specific markup.

Modern headless CMS solutions like Storyblok often use SvelteKit as frontend. The tool recognises the characteristic HTML markers these frameworks use.

Recognised sites: voedingleeft.nl

Recognition: HTML_TAG_START and HTML_TAG_END comment blocks in the source code.

For websites that don’t fall under the above categories, the tool has fallback strategies. These search for common patterns such as <article> tags, .prose classes, or .entry-content divs.

Every migrated article automatically receives source attribution at the top of the content. This is a blockquote with:

  • Source: The name of the website with link to the original article
  • Author: If available in the source page

This ensures transparency about the origin of the content and respects the intellectual property of the original authors.

All hyperlinks from the original article are preserved. This applies to:

  • Internal links within the source article
  • External links to scientific sources
  • Links to related articles

The links are not modified or removed, so readers retain access to the full context and sources.

Websites often use complex formatting systems that leave “debris” in the HTML. The tool cleans this up automatically:

  • Kadence block markers and styling classes
  • Info-box components and link decorations
  • Gutenberg block attributes
  • Target and rel attributes of links (such as target="_blank")
  • Base64-encoded inline images
  • Video player interface elements
  • Excessive blank lines and spaces
  • Drupal entity attributes
  • Extended Markdown attributes like {.button .primary} are removed
  • This prevents parse errors in Astro’s MDX processing

The tool automatically generates the required frontmatter for Starlight:

  • title: Taken from the <title> tag or Open Graph metadata
  • description: From the meta description, or an automatically generated fallback

This ensures that migrated articles are immediately compatible with our site without manual adjustments.

Migrating an article proceeds as follows:

  1. Enter URL - The tool receives the URL of the article to be migrated
  2. CMS detection - Automatic recognition of the platform
  3. Content extraction - Retrieving the article content via the best available method
  4. HTML to Markdown - Conversion via Pandoc to clean Markdown
  5. Cleanup - Removing CMS-specific artefacts
  6. Add frontmatter - Title and description from metadata
  7. Insert source attribution - Automatic attribution to original source
  8. Save - Output as .mdx file ready for the knowledge base

In addition to web pages, PDF documents can also be converted to Markdown. This is useful for fact sheets, manuals and other documents available as PDF.

  • Text-based PDFs (directly copyable text)
  • PDFs with tables and lists
  • Multiple pages
  1. Upload PDF - Place the PDF in the Downloads folder or provide the path
  2. Content extraction - The PDF skill reads the text content
  3. Preserve structure - Headings, lists and paragraphs are recognised
  4. Markdown generation - Output to .mdx with correct frontmatter
  5. Add author info - For fact sheets, author information is automatically added

After creating a new article, it is published via Git:

git add → git commit → git push → Live on docs.gezondland.org

The site is automatically built and deployed after each push to the master branch.

CategoryFolder
Conditionssrc/content/docs/aandoeningen/
Getting startedsrc/content/docs/aan-de-slag/
Nutritionsrc/content/docs/voeding/
Fact sheets Yvo Sijpkenssrc/content/docs/fiches-yvo-sijpkens/
Classificationsrc/content/docs/classificatie/

The URL follows the folder structure: docs.gezondland.org/{category}/{filename}/

For example: src/content/docs/fiches-yvo-sijpkens/pcos.mdx becomes docs.gezondland.org/fiches-yvo-sijpkens/pcos/

The tool has some limitations:

  • Images are not automatically downloaded. These must be retrieved separately and hosted locally.
  • Embedded videos are not included. Only the textual content is extracted.
  • Dynamic content loaded via JavaScript is not always accessible.
  • Paid content behind a login cannot be retrieved.
  • Scanned PDFs (images of text) require OCR and are not always correctly recognised.

The tool is available as a Claude Code skill and can be invoked during a session.

Provide the URL of the article and the desired category. The tool determines the best extraction method itself.

Place the PDF in Downloads and provide the filename and target category. The tool reads the PDF and generates a Markdown file.

For questions about migrating specific content or websites that are not correctly recognised, the tool can be extended with new extraction patterns.

Medische Disclaimer: De informatie van Stichting Je Leefstijl Als Medicijn over leefstijl, ziektes en stoornissen mag niet worden opgevat als medisch advies. In geen geval adviseren wij mensen om hun bestaande behandeling te veranderen. We raden mensen met chronische aandoeningen aan om zich over hun behandeling goed door bevoegde medische professionals te laten adviseren.

Medical Disclaimer: The information provided by Stichting Je Leefstijl Als Medicijn regarding lifestyle, diseases, and disorders should not be construed as medical advice. Under no circumstances do we advise people to alter their existing treatment. We recommend that people with chronic conditions seek advice regarding their treatment from qualified medical professionals.