Spaces:

Shriyakupp
/

iitm_scraper

Sleeping

App Files Files Community

iitm_scraper / markdown_files /Crawling_with_the_CLI.md

Shriyakupp

Upload 107 files

980dc8d verified 7 months ago

preview code

raw

history blame contribute delete

5.37 kB

	---
	title: "Crawling with the CLI"
	original_url: "https://tds.s-anand.net/#/crawling-cli?id=wpull"
	downloaded_at: "2025-06-08T23:26:52.185904"
	---

	[Crawling with the CLI](#/crawling-cli?id=crawling-with-the-cli)
	----------------------------------------------------------------

	Since websites are a common source of data, we often download entire websites (crawling) and then process them offline.

	Web crawling is essential in many data-driven scenarios:

	* Data mining and analysis: Gathering structured data from multiple pages for market research, competitive analysis, or academic research
	* Content archiving: Creating offline copies of websites for preservation or backup purposes
	* SEO analysis: Analyzing site structure, metadata, and content to improve search rankings
	* Legal compliance: Capturing website content for regulatory or compliance documentation
	* Website migration: Creating a complete copy before moving to a new platform or design
	* Offline access: Downloading educational resources, documentation, or reference materials for use without internet connection

	The most commonly used tool for fetching websites is [`wget`](https://www.gnu.org/software/wget/). It is pre-installed in many UNIX distributions and easy to install.

	[![Scraping Websites using Wget (8 min)](https://i.ytimg.com/vi/pLfH5TZBGXo/sddefault.jpg)](https://youtu.be/pLfH5TZBGXo)

	To crawl the [IIT Madras Data Science Program website](https://study.iitm.ac.in/ds/) for example, you could run:

	```
	wget \
	--recursive \
	--level=3 \
	--no-parent \
	--convert-links \
	--adjust-extension \
	--compression=auto \
	--accept html,htm \
	--directory-prefix=./ds \
	https://study.iitm.ac.in/ds/Copy to clipboardErrorCopied
	```

	Here’s what each option does:

	* `--recursive`: Enables recursive downloading (following links)
	* `--level=3`: Limits recursion depth to 3 levels from the initial URL
	* `--no-parent`: Restricts crawling to only URLs below the initial directory
	* `--convert-links`: Converts all links in downloaded documents to work locally
	* `--adjust-extension`: Adds proper extensions to files (.html, .jpg, etc.) based on MIME types
	* `--compression=auto`: Automatically handles compressed content (gzip, deflate)
	* `--accept html,htm`: Only downloads files with these extensions
	* `--directory-prefix=./ds`: Saves all downloaded files to the specified directory

	[wget2](https://gitlab.com/gnuwget/wget2) is a better version of `wget` and supports HTTP2, parallel connections, and only updates modified sites. The syntax is (mostly) the same.

	```
	wget2 \
	--recursive \
	--level=3 \
	--no-parent \
	--convert-links \
	--adjust-extension \
	--compression=auto \
	--accept html,htm \
	--directory-prefix=./ds \
	https://study.iitm.ac.in/ds/Copy to clipboardErrorCopied
	```

	There are popular free and open-source alternatives to Wget:

	### [Wpull](#/crawling-cli?id=wpull)

	[Wpull](https://github.com/ArchiveTeam/wpull) is a wget‐compatible Python crawler that supports on-disk resumption, WARC output, and PhantomJS integration.

	```
	uvx wpull \
	--recursive \
	--level=3 \
	--no-parent \
	--convert-links \
	--adjust-extension \
	--compression=auto \
	--accept html,htm \
	--directory-prefix=./ds \
	https://study.iitm.ac.in/ds/Copy to clipboardErrorCopied
	```

	### [HTTrack](#/crawling-cli?id=httrack)

	[HTTrack](https://www.httrack.com/html/fcguide.html) is dedicated website‐mirroring tool with rich filtering and link‐conversion options.

	```
	httrack "https://study.iitm.ac.in/ds/" \
	-O "./ds" \
	"+.study.iitm.ac.in/ds/" \
	-r3Copy to clipboardErrorCopied
	```

	### [Robots.txt](#/crawling-cli?id=robotstxt)

	`robots.txt` is a standard file found in a website’s root directory that specifies which parts of the site should not be accessed by web crawlers. It’s part of the Robots Exclusion Protocol, an ethical standard for web crawling.

	Why it’s important:

	* Server load protection: Prevents excessive traffic that could overload servers
	* Privacy protection: Keeps sensitive or private content from being indexed
	* Legal compliance: Respects website owners’ rights to control access to their content
	* Ethical web citizenship: Shows respect for website administrators’ wishes

	How to override robots.txt restrictions:

	* wget, wget2: Use `-e robots=off`
	* httrack: Use `-s0`
	* wpull: Use `--no-robots`

	When to override robots.txt (use with discretion):

	Only bypass `robots.txt` when:

	* You have explicit permission from the website owner
	* You’re crawling your own website
	* The content is publicly accessible and your crawling won’t cause server issues
	* You’re conducting authorized security testing

	Remember that bypassing `robots.txt` without legitimate reason may:

	* Violate terms of service
	* Lead to IP banning
	* Result in legal consequences in some jurisdictions
	* Cause reputation damage to your organization

	Always use the minimum necessary crawling speed and scope, and consider contacting website administrators for permission when in doubt.

	[Previous

	Scraping with Google Sheets](#/scraping-with-google-sheets)

	[Next

	BBC Weather API with Python](#/bbc-weather-api-with-python)