iitm_scraper / markdown_files /Crawling_with_the_CLI.md
Shriyakupp's picture
Upload 107 files
980dc8d verified
metadata
title: Crawling with the CLI
original_url: https://tds.s-anand.net/#/crawling-cli?id=wpull
downloaded_at: '2025-06-08T23:26:52.185904'

Crawling with the CLI

Since websites are a common source of data, we often download entire websites (crawling) and then process them offline.

Web crawling is essential in many data-driven scenarios:

  • Data mining and analysis: Gathering structured data from multiple pages for market research, competitive analysis, or academic research
  • Content archiving: Creating offline copies of websites for preservation or backup purposes
  • SEO analysis: Analyzing site structure, metadata, and content to improve search rankings
  • Legal compliance: Capturing website content for regulatory or compliance documentation
  • Website migration: Creating a complete copy before moving to a new platform or design
  • Offline access: Downloading educational resources, documentation, or reference materials for use without internet connection

The most commonly used tool for fetching websites is wget. It is pre-installed in many UNIX distributions and easy to install.

Scraping Websites using Wget (8 min)

To crawl the IIT Madras Data Science Program website for example, you could run:

wget \
  --recursive \
  --level=3 \
  --no-parent \
  --convert-links \
  --adjust-extension \
  --compression=auto \
  --accept html,htm \
  --directory-prefix=./ds \
  https://study.iitm.ac.in/ds/Copy to clipboardErrorCopied

Here’s what each option does:

  • --recursive: Enables recursive downloading (following links)
  • --level=3: Limits recursion depth to 3 levels from the initial URL
  • --no-parent: Restricts crawling to only URLs below the initial directory
  • --convert-links: Converts all links in downloaded documents to work locally
  • --adjust-extension: Adds proper extensions to files (.html, .jpg, etc.) based on MIME types
  • --compression=auto: Automatically handles compressed content (gzip, deflate)
  • --accept html,htm: Only downloads files with these extensions
  • --directory-prefix=./ds: Saves all downloaded files to the specified directory

wget2 is a better version of wget and supports HTTP2, parallel connections, and only updates modified sites. The syntax is (mostly) the same.

wget2 \
  --recursive \
  --level=3 \
  --no-parent \
  --convert-links \
  --adjust-extension \
  --compression=auto \
  --accept html,htm \
  --directory-prefix=./ds \
  https://study.iitm.ac.in/ds/Copy to clipboardErrorCopied

There are popular free and open-source alternatives to Wget:

Wpull

Wpull is a wget‐compatible Python crawler that supports on-disk resumption, WARC output, and PhantomJS integration.

uvx wpull \
  --recursive \
  --level=3 \
  --no-parent \
  --convert-links \
  --adjust-extension \
  --compression=auto \
  --accept html,htm \
  --directory-prefix=./ds \
  https://study.iitm.ac.in/ds/Copy to clipboardErrorCopied

HTTrack

HTTrack is dedicated website‐mirroring tool with rich filtering and link‐conversion options.

httrack "https://study.iitm.ac.in/ds/" \
  -O "./ds" \
  "+*.study.iitm.ac.in/ds/*" \
  -r3Copy to clipboardErrorCopied

Robots.txt

robots.txt is a standard file found in a website’s root directory that specifies which parts of the site should not be accessed by web crawlers. It’s part of the Robots Exclusion Protocol, an ethical standard for web crawling.

Why it’s important:

  • Server load protection: Prevents excessive traffic that could overload servers
  • Privacy protection: Keeps sensitive or private content from being indexed
  • Legal compliance: Respects website owners’ rights to control access to their content
  • Ethical web citizenship: Shows respect for website administrators’ wishes

How to override robots.txt restrictions:

  • wget, wget2: Use -e robots=off
  • httrack: Use -s0
  • wpull: Use --no-robots

When to override robots.txt (use with discretion):

Only bypass robots.txt when:

  • You have explicit permission from the website owner
  • You’re crawling your own website
  • The content is publicly accessible and your crawling won’t cause server issues
  • You’re conducting authorized security testing

Remember that bypassing robots.txt without legitimate reason may:

  • Violate terms of service
  • Lead to IP banning
  • Result in legal consequences in some jurisdictions
  • Cause reputation damage to your organization

Always use the minimum necessary crawling speed and scope, and consider contacting website administrators for permission when in doubt.

[Previous

Scraping with Google Sheets](#/scraping-with-google-sheets)

[Next

BBC Weather API with Python](#/bbc-weather-api-with-python)