Spaces:
Sleeping
title: Convert HTML to Markdown
original_url: https://tds.s-anand.net/#/convert-html-to-markdown?id=markdown-crawler
downloaded_at: '2025-06-08T23:25:38.805247'
Converting HTML to Markdown
When working with web content, converting HTML files to plain text or Markdown is a common requirement for content extraction, analysis, and preservation. For example:
- Content analysis: Extract clean text from HTML for natural language processing
- Data mining: Strip formatting to focus on the actual content
- Offline reading: Convert web pages to readable formats for e-readers or offline consumption
- Content migration: Move content between different CMS platforms
- SEO analysis: Extract headings, content structure, and text for optimization
- Archive creation: Store web content in more compact, preservation-friendly formats
- Accessibility: Convert content to formats that work better with screen readers
This tutorial covers both converting existing HTML files and combining web crawling with HTML-to-text conversion in a single workflow – all using the command line.
defuddle-cli
defuddle-cli specializes in HTML - Markdown conversion. It’s a bit slow and not very customizable but produces clean Markdown that preserves structure, links, and basic formatting. Best for content where preserving the document structure is important.
find . -name '*.html' -exec npx --package defuddle-cli -y defuddle parse {} --md -o {}.md \;Copy to clipboardErrorCopied
find . -name '*.html': Finds all HTML files in the current directory and subdirectories-exec ... \;: Executes the following command for each file foundnpx --package defuddle-cli -y: Installs and runs defuddle-cli without promptingdefuddle parse {} --md: Parses the HTML file (represented by{}) and converts to markdown-o {}.md: Outputs to a file with the original name plus .md extension
Pandoc
Pandoc is a bit slow and highly customizable, preserving almost all formatting elements, leading to verbose markdown. Best for academic or documentation conversion where precision matters.
Pandoc can convert from many other formats (such as Word, PDF, LaTeX, etc.) to Markdown and vice versa, making it one of most popular and versatele document convertors.
find . -name '*.html' -exec pandoc -f html -t markdown_strict -o {}.md {} \;Copy to clipboardErrorCopied
find . -name '*.html': Finds all HTML files in the current directory and subdirectories-exec ... \;: Executes the following command for each file foundpandoc: The Swiss Army knife of document conversion-f html -t markdown_strict: Convert from HTML format to strict markdown-o {}.md {}: Output to a markdown file, with the input file as the last argument
Lynx
Lynx is fast and generates text (not Markdown) with minimal formatting. Lynx renders the HTML as it would appear in a text browser, preserving basic structure but losing complex formatting. Best for quick content extraction or when processing large numbers of files.
find . -type f -name '*.html' -exec sh -c 'for f; do lynx -dump -nolist "$f" > "${f%.html}.txt"; done' _ {} +Copy to clipboardErrorCopied
find . -type f -name '*.html': Finds all HTML files in the current directory and subdirectories-exec sh -c '...' _ {} +: Executes a shell command with batched files for efficiencyfor f; do ... done: Loops through each file in the batchlynx -dump -nolist "$f": Uses the lynx text browser to render HTML as plain text-dump: Output the rendered page to stdout-nolist: Don’t include the list of links at the end
> "${f%.html}.txt": Save output to a .txt file with the same base name
w3m
w3m is very slow processing with minimal formatting. w3m tends to be more thorough in its rendering than lynx but takes considerably longer. It supports basic JavaScript processing, making it better at handling modern websites with dynamic content. Best for cases where you need slightly better rendering than lynx, particularly for complex layouts and tables, and when some JavaScript processing is beneficial.
find . -type f -name '*.html' \
-exec sh -c 'for f; do \
w3m -dump -T text/html -cols 80 -no-graph "$f" > "${f%.html}.md"; \
done' _ {} +Copy to clipboardErrorCopied
find . -type f -name '*.html': Finds all HTML files in the current directory and subdirectories-exec sh -c '...' _ {} +: Executes a shell command with batched files for efficiencyfor f; do ... done: Loops through each file in the batchw3m -dump -T text/html -cols 80 -no-graph "$f": Uses the w3m text browser to render HTML-dump: Output the rendered page to stdout-T text/html: Specify input format as HTML-cols 80: Set output width to 80 columns-no-graph: Don’t show graphic characters for tables and frames
> "${f%.html}.md": Save output to a .md file with the same base name
Comparison
| Approach | Speed | Format Quality | Preservation | Best For |
|---|---|---|---|---|
| defuddle-cli | Slow | High | Good structure and links | Content migration, publishing |
| pandoc | Slow | Very High | Almost everything | Academic papers, documentation |
| lynx | Fast | Low | Basic structure only | Quick extraction, large batches |
| w3m | Very Slow | Medium-Low | Basic structure with better tables | Improved readability over lynx |
Optimize Batch Processing
Process in parallel: Use GNU Parallel for multi-core processing:
find . -name "*.html" | parallel "pandoc -f html -t markdown_strict -o {}.md {}"Copy to clipboardErrorCopiedFilter files before processing:
find . -name "*.html" -type f -size -1M -exec pandoc -f html -t markdown {} -o {}.md \;Copy to clipboardErrorCopiedCustomize output format with additional parameters:
# For pandoc, preserve line breaks but simplify other formatting find . -name "*.html" -exec pandoc -f html -t markdown --wrap=preserve --atx-headers {} -o {}.md \;Copy to clipboardErrorCopiedHandle errors gracefully:
find . -name "*.html" -exec sh -c 'for f; do pandoc -f html -t markdown "$f" -o "${f%.html}.md" 2>/dev/null || echo "Failed: $f" >> conversion_errors.log; done' _ {} +Copy to clipboardErrorCopied
Choosing the Right Tool
- Need speed with minimal formatting? Use the lynx approach
- Need precise, complete conversion? Use pandoc
- Need a balance of structure and cleanliness? Try defuddle-cli
- Working with complex tables? w3m might render them better
Remember that the best approach depends on your specific use case, volume of files, and how you intend to use the converted text.
Combined Crawling and Conversion
Sometimes you need to both crawl a website and convert its content to markdown or text in a single workflow, like Crawl4AI or markdown-crawler.
- For research/data collection: Use a specialized crawler (like Crawl4AI) with post-processing conversion
- For simple website archiving: Markdown-crawler provides a convenient all-in-one solution
- For high-quality conversion: Use wget/wget2 for crawling followed by pandoc for conversion
- For maximum speed: Combine wget with lynx in a pipeline
Crawl4AI
Crawl4AI is designed for single-page extraction with high-quality content processing. Crawl4AI is optimized for AI training data extraction, focusing on clean, structured content rather than complete site preservation. It excels at removing boilerplate content and preserving the main article text.
uv venv
source .venv/bin/activate.fish
uv pip install crawl4ai
crawl4ai-setupCopy to clipboardErrorCopied
uv venv: Creates a Python virtual environment using uv (a faster alternative to virtualenv)source .venv/bin/activate.fish: Activates the virtual environment (fish shell syntax)uv pip install crawl4ai: Installs the crawl4ai packagecrawl4ai-setup: Initializes crawl4ai’s required dependencies
markdown-crawler
markdown-crawler combines web crawling with markdown conversion in one tool. It’s efficient for bulk processing but tends to produce lower-quality markdown conversion compared to specialized converters like pandoc or defuddle. Best for projects where quantity and integration are more important than perfect formatting.
uv venv
source .venv/bin/activate.fish
uv pip install markdown-crawler
markdown-crawler -t 5 -d 3 -b ./markdown https://study.iitm.ac.in/ds/Copy to clipboardErrorCopied
uv venvand activation: Same as aboveuv pip install markdown-crawler: Installs the markdown-crawler packagemarkdown-crawler: Runs the crawler with these options:-t 5: Sets 5 threads for parallel crawling-d 3: Limits crawl depth to 3 levels-b ./markdown: Sets the base output directory- Final argument is the starting URL
[Previous
Convert PDFs to Markdown](#/convert-pdfs-to-markdown)
[Next
LLM Website Scraping](#/llm-website-scraping)
