Web crawler (also known as a spider) is a system for downloading, storing, and analyzing web pages. It performs the task of organizing web pages that allow users to easily find information. This is done by collecting a few web pages and following links to gather new content.
Web crawlers are used for a variety of purposes, but they are commonly used as a key component of search engines. These search engines compile a collection of web pages, index them, and allow users to search the index for pages that match their queries.
Search engine indexing: Search engines uses web crawlers to collect web pages and generate a local index. For example, Google uses Googlebot web crawler.
Web archiving: We can use web crawlers for collecting web-based information and storing it for future use. The US Library of Congress and the EU web archive often use crawlers for this purpose.
Web monitoring: Web crawlers can monitor the internet for copyright and trademark violations. For example, Digimarc uses crawlers to identify and report pirated activities.
A simple design of a web crawler should have following functionalities:
When designing a web crawler system, we should consider several important characteristics:
To estimate the scale of a web crawling system, we can make assumptions based on a discussion with the interviewer. For example:
Once we have a clear understanding of our requirements and have made estimations on the scale, we can design a high-level architecture overview of the system. This architecture will outline the overall structure and components of the web crawler.
Let us explore the individual component of the system:
Seed URLs: To begin the crawl process, we need to provide a set of seed URLs to the web crawler. One way to do this is to use a website's domain name to crawl all of its web pages. To make our system more efficient, we should be strategic in choosing the seed URL because it can impact the number of web pages that are crawled. The selection of the seed URL can depend on various factors, such as geographical location, categories (such as entertainment, education, sports, food), content type, etc.
URL Frontier: Component that stores URLs to be downloaded is called the URL Frontier. One way to crawl the web is to use a breadth-first traversal, starting from the seed URLs. This can be implemented by using the URL Frontier as a first-in, first-out (FIFO) queue, where URLs will be processed in the order that they were added to the queue (starting with the seed URLs).
HTML Fetcher: HTML fetcher is a component that is responsible for downloading web pages corresponding to a given URL, as provided by the URL Frontier. It does this by using a network protocol such as HTTP or HTTPS. In simple words, HTML fetcher retrieves the actual web page content that needs to be analyzed and stored.
DNS Resolver: Before a web page can be downloaded, URL must be translated into an IP address. For this, HTML fetcher component initiates download process by calling the DNS Resolver. After this, DNS Resolver converts URL into the corresponding IP address, which is then used to access the web page. This process is necessary because computers communicate with each other using IP addresses.
HTML Parser: After HTML fetcher has downloaded a web page, it is important to parse, analyze, and validate the content to ensure the integrity and security of the stored data. For this, HTML Parser will check issues like poorly formatted HTML or malware that could cause problems with the storage system. Through this process, we ensure that the data being stored is of high quality.
Duplicate Detection: Studies have shown that around 30% of web pages contain duplicate content, which can lead to inefficiencies and slowdowns in the storage system if the same content is stored multiple times. To avoid this problem, we can use a data structure to check for redundancy in the downloaded content. For example, we can use MD5 hashing to compare the content of pages that have been previously seen, and check if the same hash has occurred before. This can help to identify and prevent the storage of duplicate content.
Data Storage: After web pages have been downloaded and parsed, they need to be stored in a storage system. The choice of a specific storage system will depend on the use cases for the data. For example, if we want to use the content for offline archiving or analysis, we can compress and store data with a low-cost cloud storage provider.
On the other side, if we want to use content for real-time search or query, we can store the data in a large distributed database, such as HDFS, Google's BigTable, or Apache Cassandra. These types of databases are designed to support querying and searching. But the key thing is: Regardless of the storage system we choose, it is important to ensure that we have enough space to store large amounts of data that is likely to be collected.
Caching: To improve the efficiency of web crawler, we can use a cache to store recently processed URLs. This allows us to quickly look for a URL in the cache rather than crawling the web to find it again. The type of cache will depend on the specific use case for the web crawler.
URL Extractor: URL Extractor parse and extract links from HTML pages. Once the links have been extracted, they are filtered and added to the URL Frontier. This process allows web crawler to expand the scope of its crawl by following the links on each web page and collecting new content. In other words, URL Extractor enables system to discover new content and continue expanding the collection of web pages.
URL Filter: URL filter is used to filter out unwanted content types, faulty links, and URLs from unsafe sites. This helps to ensure that the system only collects and stores high-quality, relevant content. The URL filter can be customized to meet the specific needs of the web crawler, such as by excluding certain content types or blocking access to unsafe sites. In simple words: By using a URL filter, we can improve the efficiency and effectiveness of the web crawler by limiting the amount of unnecessary or irrelevant content that is collected.
URL Detector: URL Detector is used to filter out URLs that have already been visited. This helps to prevent the system from getting stuck in an infinite loop where the same URL is processed repeatedly. There are a few different techniques that can be used to implement the URL Detector, such as Bloom Filters and Hash Tables. These techniques efficiently identify already visited URLs, so that they can be skipped in the crawling process.
URL Storage: URL storage is used to store the URLs of web pages that have already been visited. This allows the system to keep track of the URLs that have been processed, so that they can be skipped in the future if they are encountered again.
The crawling process of a web crawler consists of several worker threads that perform repeated cycles of work. The following is a summary of the steps involved in a work cycle:
Now URL Frontier is responsible for managing the queue of URLs that are waiting to be crawled. It assigns specific positions to the URLs in its data structure based on certain priority rules. These rules can be customized to meet the specific needs of the web crawler. For example, URL Frontier might prioritize URLs from certain domains or categories, or it might prioritize URLs that have been waiting in the queue for a longer time. By allocating positions to the URLs according to fixed priority rules, the URL Frontier helps to ensure that the web crawler is able to efficiently and effectively crawl the web.
Thanks to Chiranjeev and Navtosh for their contribution in creating the first version of this content. If you have any queries or feedback, please write us at firstname.lastname@example.org. Enjoy learning, Enjoy system design, Enjoy algorithms!
Subscribe to get well designed content on data structure and algorithms, machine learning, system design, object orientd programming and math.
©2023 Code Algorithms Pvt. Ltd.
All rights reserved.