What is a Web Crawler?
A web crawler (also known as a spider) is a system for downloading, storing, and analyzing web pages. They are used for a wide variety of purposes. Most prominently, they are one of the main components of web search engines that compile a collection of web pages, index it, and allow users to issue index queries and find web pages that match queries.
A web crawler bot is like someone whose job is to go through all the books in a disorganized library and organize them in such a way that anyone visiting the library can quickly and easily find the information they need. This process starts by collecting a few web pages and then following links to collect new content.
Use Cases of Web Crawler
A crawler is used for many purposes:
- Search engine indexing: One of the best use cases of web crawlers is in search engines. A crawler collects web pages to generate a local index for search engines. For example, Googlebot is the web crawler behind the Google search engine.
- Web archiving: This is the compilation of web-based information to store data for future use. Many national libraries, for instance, run crawlers to archive web pages. The US Library of Congress and the EU web archive are notable examples.
- Web mining: An incredible opportunity for data mining is created by the exponential growth of the web. Web mining allows the internet to discover valuable information.
- Web monitoring: The crawlers help to monitor copyright and trademark violations over the Internet. Digimarc, for instance, uses crawlers to discover pirated activities and reports.
Requirements of the System
A simple web crawler should have the following functionalities:
- Given a set of URLs, visit the URL and store the web page.
- Now, extract the URLs in these web pages.
- Append the new URLs extracted to the list of URLs to be visited.
- Repeat the process.
System Analysis and characteristics
It is important to take note of the characteristics of a good web crawler. A web crawler service should have the following features:
- Scalability: The web is growing exponentially day by day. There are billions of web pages to be crawled and analyzed. Therefore, web crawling should be extremely efficient using parallelization.
- Robustness: The internet is full of traps such as bad HTML, unresponsive servers, crashes, malicious links, etc. The crawler must be able to handle all these cases.
- Politeness: The web crawler should not make too many requests to a website within a short time interval as it may unintentionally lead to DDoS on several websites.
- Extensibility: The system should be adaptable and flexible to any changes that we might encounter in the future, like if we want to crawl images or music files.
- Manageability and Reconfigurability: An appropriate interface is needed to monitor the crawl, including the crawler's speed, statistics about hosts and pages, and the sizes of the main data sets.
Estimation of Scale and Constraints
We should estimate the scale by taking many assumptions by discussing them with the interviewer.
- Assuming 1 billion web pages to be crawled every month, we can estimate the average QPS to be: QPS (Query per Second) = 1,000,000,000 / 30 days / 24 hours / 3600 seconds ~ 400 pages per second.
- We can assume that the peak value of queries per second is 2 times the average, i.e., 800 pages per second.
- Now, assuming that, on average, a web page is 500 Kb in size, we can estimate storage required per month: 1-billion-page 500k = 500 TB storage per month.
- Also, assuming data are stored for five years, 500 TB * 12 months * 5 years = 30 PB.
Once our requirements and scale estimations are clear, we will design a high-level architecture overview of the system.
Let us explore the individual component of the system:
Seed URLs: We need to feed seed URLs to the web crawler as a starting point for the crawl method. A good way to pick seed URLs is to use any website's domain name to crawl all web pages. To make our system more efficient, we need to be creative in choosing the URL to get started with to crawl as many web pages as possible. The selection of it also depends on the use case. It can be chosen in many ways, such as geographical location, categories (such as entertainment, education, sports, food), content type, etc.
URL Frontier: The component that stores URLs to be downloaded is called the URL Frontier. We can crawl by performing a breadth-first traversal of the web, starting from the seed URLs. This is achieved by implementing the frontier as a FIFO.
HTML Fetcher: The purpose of an HTML fetcher is to download the web page corresponding to a given URL given by the URL Frontier using the appropriate network protocol like HTTP/HTTPS.
DNS Resolver: A URL needs to be translated into an IP address to download a web page. To get the corresponding IP address for the URL, the HTML Downloader calls the DNS Resolver.
HTML Parser: The downloaded web pages must be parsed, analyzed, and validated to protect the storage from any problem caused due to bad HTML, malware, etc.
Duplicate Detection: It has been found in studies that approximately 30% of the web pages contain duplicate contents, which can cause the content to be stored multiple times and make the system inefficient or slow. We can use a data structure to match the requirements to check the redundancy of the downloaded content. For example, we can use MD5 hashing of the pages seen before to find if the same hash has occurred before.
Data Storage: The downloaded content must be stored in the storage. We’re certainly going to need a lot of space, but our specific storage system choice will depend on our system’s expected use cases. For instance, we might compress and store all the content with a low-cost cloud storage provider to download and store offline archiving content or analysis. On the other hand, we can instead choose to store the content in a large distributed database if we choose to use the content for real-time search or query, e.g., HDFS, the BigTable from Google, or Apache Cassandra, which could support querying and searching easier.
Caching: To make the system more efficient, we can cache the recently processed URLs to look for any URL from the cache. Again, the type of cache we will use is going to depend upon the use case.
URL Extractor: The task of the URL Extractor is to parse and extract links from HTML pages. The extracted links are filtered and appended to the URL Frontier.
URL Filter: The URL filter is used to filter out unwanted content types, faulty links, and URLs from the sites present in the unsafe sites.
URL Detector: The URL Detector's main purpose is to filter out already visited URLs from the process, which otherwise can cause the same URL to be processed again and again in an infinite loop. Bloom Filters and Hash Tables are most commonly used in URL Detectors.
URL Storage: URL storage is used to store the already visited URLs.
The Workflow of a Web Crawler System
Now that we have all the system components laid out let’s discuss the system workflow.
The crawling process consists of several threads, and each thread performs work cycles repeatedly. The following is a brief overview of a work cycle:
- At the beginning of the work cycle, the worker thread sends seed URLs to URL Frontier. The Frontier pops out a URL according to its priorities and politeness policies.
- Then the Fetcher module is called, which fetches URLs from the URL Frontier.
- The Fetcher calls the DNS resolver to resolve the corresponding web server's host address.
- The HTML Parser parses HTML pages and checks and analyses the content of the page.
- After validating the content, it is passed to the duplicate detection component to check the content's duplicity.
- The duplicate detection component searches the cache, and if it is not found there, it goes to the data storage if the content is already there.
- If it is not in the storage, the web page is sent to Link Extractor, which extracts the outgoing links.
- The extracted URLs are passed to the URL filter, which filters out some unwanted URLs such as file extensions of no interest, black-listed sites, etc.
- After links are filtered, they are passed to the URL Detector component.
- The URL Detector component checks if a URL is already in the storage; if yes, it is processed before, and nothing needs to be done.
- If a URL has not been processed before, it is added to the URL Frontier.
Finally, the URLs reach the URL Frontier, which allots them specific positions in its data structure according to some fixed priority rules.
Enjoy learning, Enjoy system design!