--- title: "Convert HTML to Markdown" original_url: "https://tds.s-anand.net/#/convert-html-to-markdown?id=markdown-crawler" downloaded_at: "2025-06-08T23:25:38.805247" --- [Converting HTML to Markdown](#/convert-html-to-markdown?id=converting-html-to-markdown) ---------------------------------------------------------------------------------------- When working with web content, converting HTML files to plain text or Markdown is a common requirement for content extraction, analysis, and preservation. For example: * **Content analysis**: Extract clean text from HTML for natural language processing * **Data mining**: Strip formatting to focus on the actual content * **Offline reading**: Convert web pages to readable formats for e-readers or offline consumption * **Content migration**: Move content between different CMS platforms * **SEO analysis**: Extract headings, content structure, and text for optimization * **Archive creation**: Store web content in more compact, preservation-friendly formats * **Accessibility**: Convert content to formats that work better with screen readers This tutorial covers both converting existing HTML files and combining web crawling with HTML-to-text conversion in a single workflow – all using the command line. ### [defuddle-cli](#/convert-html-to-markdown?id=defuddle-cli) [defuddle-cli](https://github.com/defuddle/defuddle) specializes in HTML - Markdown conversion. It’s a bit slow and not very customizable but produces clean Markdown that preserves structure, links, and basic formatting. Best for content where preserving the document structure is important. ``` find . -name '*.html' -exec npx --package defuddle-cli -y defuddle parse {} --md -o {}.md \;Copy to clipboardErrorCopied ``` * `find . -name '*.html'`: Finds all HTML files in the current directory and subdirectories * `-exec ... \;`: Executes the following command for each file found * `npx --package defuddle-cli -y`: Installs and runs defuddle-cli without prompting * `defuddle parse {} --md`: Parses the HTML file (represented by `{}`) and converts to markdown * `-o {}.md`: Outputs to a file with the original name plus .md extension ### [Pandoc](#/convert-html-to-markdown?id=pandoc) [Pandoc](https://pandoc.org/) is a bit slow and highly customizable, preserving almost all formatting elements, leading to verbose markdown. Best for academic or documentation conversion where precision matters. Pandoc can convert from many other formats (such as Word, PDF, LaTeX, etc.) to Markdown and vice versa, making it one of most popular and versatele document convertors. [![How to Convert a Word Document to Markdown for Free using Pandoc (12 min)](https://i.ytimg.com/vi/HPSK7q13-40/sddefault.jpg)](https://youtu.be/HPSK7q13-40) ``` find . -name '*.html' -exec pandoc -f html -t markdown_strict -o {}.md {} \;Copy to clipboardErrorCopied ``` * `find . -name '*.html'`: Finds all HTML files in the current directory and subdirectories * `-exec ... \;`: Executes the following command for each file found * `pandoc`: The Swiss Army knife of document conversion * `-f html -t markdown_strict`: Convert from HTML format to strict markdown * `-o {}.md {}`: Output to a markdown file, with the input file as the last argument ### [Lynx](#/convert-html-to-markdown?id=lynx) [Lynx](https://lynx.invisible-island.net/) is fast and generates text (not Markdown) with minimal formatting. Lynx renders the HTML as it would appear in a text browser, preserving basic structure but losing complex formatting. Best for quick content extraction or when processing large numbers of files. ``` find . -type f -name '*.html' -exec sh -c 'for f; do lynx -dump -nolist "$f" > "${f%.html}.txt"; done' _ {} +Copy to clipboardErrorCopied ``` * `find . -type f -name '*.html'`: Finds all HTML files in the current directory and subdirectories * `-exec sh -c '...' _ {} +`: Executes a shell command with batched files for efficiency * `for f; do ... done`: Loops through each file in the batch * `lynx -dump -nolist "$f"`: Uses the lynx text browser to render HTML as plain text + `-dump`: Output the rendered page to stdout + `-nolist`: Don’t include the list of links at the end * `> "${f%.html}.txt"`: Save output to a .txt file with the same base name ### [w3m](#/convert-html-to-markdown?id=w3m) [w3m](https://w3m.sourceforge.net/) is very slow processing with minimal formatting. w3m tends to be more thorough in its rendering than lynx but takes considerably longer. It supports basic JavaScript processing, making it better at handling modern websites with dynamic content. Best for cases where you need slightly better rendering than lynx, particularly for complex layouts and tables, and when some JavaScript processing is beneficial. ``` find . -type f -name '*.html' \ -exec sh -c 'for f; do \ w3m -dump -T text/html -cols 80 -no-graph "$f" > "${f%.html}.md"; \ done' _ {} +Copy to clipboardErrorCopied ``` * `find . -type f -name '*.html'`: Finds all HTML files in the current directory and subdirectories * `-exec sh -c '...' _ {} +`: Executes a shell command with batched files for efficiency * `for f; do ... done`: Loops through each file in the batch * `w3m -dump -T text/html -cols 80 -no-graph "$f"`: Uses the w3m text browser to render HTML + `-dump`: Output the rendered page to stdout + `-T text/html`: Specify input format as HTML + `-cols 80`: Set output width to 80 columns + `-no-graph`: Don’t show graphic characters for tables and frames * `> "${f%.html}.md"`: Save output to a .md file with the same base name ### [Comparison](#/convert-html-to-markdown?id=comparison) | Approach | Speed | Format Quality | Preservation | Best For | | --- | --- | --- | --- | --- | | defuddle-cli | Slow | High | Good structure and links | Content migration, publishing | | pandoc | Slow | Very High | Almost everything | Academic papers, documentation | | lynx | Fast | Low | Basic structure only | Quick extraction, large batches | | w3m | Very Slow | Medium-Low | Basic structure with better tables | Improved readability over lynx | ### [Optimize Batch Processing](#/convert-html-to-markdown?id=optimize-batch-processing) 1. **Process in parallel**: Use GNU Parallel for multi-core processing: ``` find . -name "*.html" | parallel "pandoc -f html -t markdown_strict -o {}.md {}"Copy to clipboardErrorCopied ``` 2. **Filter files before processing**: ``` find . -name "*.html" -type f -size -1M -exec pandoc -f html -t markdown {} -o {}.md \;Copy to clipboardErrorCopied ``` 3. **Customize output format** with additional parameters: ``` # For pandoc, preserve line breaks but simplify other formatting find . -name "*.html" -exec pandoc -f html -t markdown --wrap=preserve --atx-headers {} -o {}.md \;Copy to clipboardErrorCopied ``` 4. **Handle errors gracefully**: ``` find . -name "*.html" -exec sh -c 'for f; do pandoc -f html -t markdown "$f" -o "${f%.html}.md" 2>/dev/null || echo "Failed: $f" >> conversion_errors.log; done' _ {} +Copy to clipboardErrorCopied ``` ### [Choosing the Right Tool](#/convert-html-to-markdown?id=choosing-the-right-tool) * **Need speed with minimal formatting?** Use the lynx approach * **Need precise, complete conversion?** Use pandoc * **Need a balance of structure and cleanliness?** Try defuddle-cli * **Working with complex tables?** w3m might render them better Remember that the best approach depends on your specific use case, volume of files, and how you intend to use the converted text. ### [Combined Crawling and Conversion](#/convert-html-to-markdown?id=combined-crawling-and-conversion) Sometimes you need to both crawl a website and convert its content to markdown or text in a single workflow, like [Crawl4AI](#/convert-html-to-markdown?id=crawl4ai) or [markdown-crawler](#/convert-html-to-markdown?id=markdown-crawler). 1. **For research/data collection**: Use a specialized crawler (like Crawl4AI) with post-processing conversion 2. **For simple website archiving**: Markdown-crawler provides a convenient all-in-one solution 3. **For high-quality conversion**: Use wget/wget2 for crawling followed by pandoc for conversion 4. **For maximum speed**: Combine wget with lynx in a pipeline ### [Crawl4AI](#/convert-html-to-markdown?id=crawl4ai) [Crawl4AI](https://github.com/crawl4ai/crawl4ai) is designed for single-page extraction with high-quality content processing. Crawl4AI is optimized for AI training data extraction, focusing on clean, structured content rather than complete site preservation. It excels at removing boilerplate content and preserving the main article text. ``` uv venv source .venv/bin/activate.fish uv pip install crawl4ai crawl4ai-setupCopy to clipboardErrorCopied ``` * `uv venv`: Creates a Python virtual environment using uv (a faster alternative to virtualenv) * `source .venv/bin/activate.fish`: Activates the virtual environment (fish shell syntax) * `uv pip install crawl4ai`: Installs the crawl4ai package * `crawl4ai-setup`: Initializes crawl4ai’s required dependencies ### [markdown-crawler](#/convert-html-to-markdown?id=markdown-crawler) [markdown-crawler](https://pypi.org/project/markdown-crawler/) combines web crawling with markdown conversion in one tool. It’s efficient for bulk processing but tends to produce lower-quality markdown conversion compared to specialized converters like pandoc or defuddle. Best for projects where quantity and integration are more important than perfect formatting. ``` uv venv source .venv/bin/activate.fish uv pip install markdown-crawler markdown-crawler -t 5 -d 3 -b ./markdown https://study.iitm.ac.in/ds/Copy to clipboardErrorCopied ``` * `uv venv` and activation: Same as above * `uv pip install markdown-crawler`: Installs the markdown-crawler package * `markdown-crawler`: Runs the crawler with these options: + `-t 5`: Sets 5 threads for parallel crawling + `-d 3`: Limits crawl depth to 3 levels + `-b ./markdown`: Sets the base output directory + Final argument is the starting URL [Previous Convert PDFs to Markdown](#/convert-pdfs-to-markdown) [Next LLM Website Scraping](#/llm-website-scraping)