Spaces:

Shriyakupp
/

iitm_scraper

Sleeping

App Files Files Community

iitm_scraper / markdown_files /Crawling_with_the_CLI.md

Shriyakupp

Upload 107 files

980dc8d verified 6 months ago

preview code

raw

history blame contribute delete

5.37 kB

metadata

title: Crawling with the CLI
original_url: https://tds.s-anand.net/#/crawling-cli?id=wpull
downloaded_at: '2025-06-08T23:26:52.185904'

Crawling with the CLI

Since websites are a common source of data, we often download entire websites (crawling) and then process them offline.

Web crawling is essential in many data-driven scenarios:

Data mining and analysis: Gathering structured data from multiple pages for market research, competitive analysis, or academic research
Content archiving: Creating offline copies of websites for preservation or backup purposes
SEO analysis: Analyzing site structure, metadata, and content to improve search rankings
Legal compliance: Capturing website content for regulatory or compliance documentation
Website migration: Creating a complete copy before moving to a new platform or design
Offline access: Downloading educational resources, documentation, or reference materials for use without internet connection

The most commonly used tool for fetching websites is wget. It is pre-installed in many UNIX distributions and easy to install.

To crawl the IIT Madras Data Science Program website for example, you could run:

wget \
  --recursive \
  --level=3 \
  --no-parent \
  --convert-links \
  --adjust-extension \
  --compression=auto \
  --accept html,htm \
  --directory-prefix=./ds \
  https://study.iitm.ac.in/ds/Copy to clipboardErrorCopied

Here’s what each option does:

--recursive: Enables recursive downloading (following links)
--level=3: Limits recursion depth to 3 levels from the initial URL
--no-parent: Restricts crawling to only URLs below the initial directory
--convert-links: Converts all links in downloaded documents to work locally
--adjust-extension: Adds proper extensions to files (.html, .jpg, etc.) based on MIME types
--compression=auto: Automatically handles compressed content (gzip, deflate)
--accept html,htm: Only downloads files with these extensions
--directory-prefix=./ds: Saves all downloaded files to the specified directory

wget2 is a better version of wget and supports HTTP2, parallel connections, and only updates modified sites. The syntax is (mostly) the same.

wget2 \
  --recursive \
  --level=3 \
  --no-parent \
  --convert-links \
  --adjust-extension \
  --compression=auto \
  --accept html,htm \
  --directory-prefix=./ds \
  https://study.iitm.ac.in/ds/Copy to clipboardErrorCopied

There are popular free and open-source alternatives to Wget:

Wpull

Wpull is a wget‐compatible Python crawler that supports on-disk resumption, WARC output, and PhantomJS integration.

uvx wpull \
  --recursive \
  --level=3 \
  --no-parent \
  --convert-links \
  --adjust-extension \
  --compression=auto \
  --accept html,htm \
  --directory-prefix=./ds \
  https://study.iitm.ac.in/ds/Copy to clipboardErrorCopied

HTTrack

HTTrack is dedicated website‐mirroring tool with rich filtering and link‐conversion options.

httrack "https://study.iitm.ac.in/ds/" \
  -O "./ds" \
  "+*.study.iitm.ac.in/ds/*" \
  -r3Copy to clipboardErrorCopied

robots.txt is a standard file found in a website’s root directory that specifies which parts of the site should not be accessed by web crawlers. It’s part of the Robots Exclusion Protocol, an ethical standard for web crawling.

Why it’s important:

Server load protection: Prevents excessive traffic that could overload servers
Privacy protection: Keeps sensitive or private content from being indexed
Legal compliance: Respects website owners’ rights to control access to their content
Ethical web citizenship: Shows respect for website administrators’ wishes

How to override robots.txt restrictions:

wget, wget2: Use -e robots=off
httrack: Use -s0
wpull: Use --no-robots

When to override robots.txt (use with discretion):

Only bypass robots.txt when:

You have explicit permission from the website owner
You’re crawling your own website
The content is publicly accessible and your crawling won’t cause server issues
You’re conducting authorized security testing

Remember that bypassing robots.txt without legitimate reason may:

Violate terms of service
Lead to IP banning
Result in legal consequences in some jurisdictions
Cause reputation damage to your organization

Always use the minimum necessary crawling speed and scope, and consider contacting website administrators for permission when in doubt.

[Previous

Scraping with Google Sheets](#/scraping-with-google-sheets)

[Next

BBC Weather API with Python](#/bbc-weather-api-with-python)

Crawling with the CLI

Wpull

HTTrack

Robots.txt