Asked In: Amazon, Ola Cabs
Have you ever wanted to store data online and make it accessible to others? If so, then you may be interested in the Pastebin system design. Pastebin is a service that lets users post various types of content, such as text and images, on the internet and share them with others using a unique URL. The creator of the content can also update it if they are logged in.
As we know what our essential requirements are, it’s time to estimate capacity of the system.
Since this system is expected to receive more reads than writes, let's consider a read-to-write ratio of 10:1. Suppose we have 10 million users, and each user makes one write request per month on average. This means we can expect the following traffic:
In the Pastebin system, we will have two main APIs: One for writing content and another for handling reading requests. These APIs can be implemented using either the SOAP or REST architecture.
To create a paste, the client sends a request to the web server, which forwards it to the write API server. The write API server then performs the following actions:
To view a paste, the client sends a request to the web server, which forwards it to the read API server. The read API server then performs the following actions:
In addition to these essential services, we also have two additional APIs that handle real-time analytics and the deletion of expired content.
Both the content and short URL need to be stored. The average size of each content is 10 MB. As previously calculated, our service will need to store a total of 12 PB of data over the course of 10 years, which is too much to store in a traditional database. Therefore, we will use an object storage service, such as Amazon S3, to store the content and a relational database to store the URLs.
The database has the following structure:
The User ID serves as the primary key in this structure.
For storing paste URLs, we have two options: a relational database like MySQL or a NoSQL database. Since we need fast read and write speeds but do not have many dependencies or relationships among our data, a NoSQL database (like Key-Value Store) is the optimal choice for our system.
Relational databases are very efficient when there are many dependencies and complex queries, but they can be slow for our purposes. On the other hand, NoSQL databases may not be as good at handling relationship queries, but they are faster overall.
The key components of the pastebin system are generating short URLs and storing content. Let's dive into these topics in more detail.
Every time a user creates a paste, the system needs to generate a unique, short URL. For this, we can use BASE64 encoding because all BASE64 characters are URL-safe (i.e., they are within the range of [A-Z], [a-z], [0-9], '+', and '/').
The critical idea to explore: What is the MD5 hash algorithm? How does it work? What are its use cases in system design?
One issue with this algorithm is that the generated URL may be repeated. In such a case, we need to regenerate a new key and keep trying until we don't encounter any failure due to the duplicate key. To address this, we can use a Key Generating Service (KGS) to ensure that all keys inserted into the key database are unique and prevent collisions or duplications.
Key Generation Service (KGS) generates random six-letter strings beforehand and stores them in some database. Whenever we want to store a new paste, we can select an already generated key from the DB and use it. KGS will also ensure that all keys inserted into DB are unique.
Scalability, reliability, availability, and performance are critical business requirements for this service. It should be able to handle a large number of requests and provide a fast, seamless experience for users. To achieve this, we will need to scale our database and have additional servers to handle the increased traffic.
If we use a single database to store all of the data, the service will be more prone to failure. To prevent this, we can partition the data across multiple machines or nodes to store the billions of URLs.
To partition the database, we can use a technique called hash-based partitioning. This involves using a hash function to distribute the URLs into different partitions. We need to decide how many shards to create and choose an appropriate hash function that maps each URL to a specific partition or shard number.
Since we have ten times more read operations than write operations, we can use caching to speed up the service. Caching involves storing frequently accessed URLs in memory for faster access. For example, if a URL appears on a trending page on a social networking website, it is likely to be visited by many people. In this case, we can store the content of the URL in our cache to avoid delays. Services like Memcached or Redis can be used for this purpose.
There are several considerations to keep in mind when designing the cache:
We can use a distributed cache to address these issues.
There may be instances where a single server receives a large number of requests, causing the service to fail. To prevent this single point of failure, we can use load balancing.
There are many algorithms available for distributing the load across servers, but in this system, we can use the least bandwidth method. This algorithm directs incoming requests to the server with the least amount of traffic.
In addition to load balancing, we also need to replicate all databases and servers to ensure that the service remains highly consistent and does not go down in case of any discrepancies.
Pastebin system is a complex and highly scalable service to design. In this blog, we have covered the fundamental concepts necessary for building such a service. We hope you like the content. Please do share your views.
Thanks to Chiranjeev and Suyash for his contribution in creating the first version of this content. If you have any queries/doubts/feedback, please write us at firstname.lastname@example.org. Enjoy learning, Enjoy system design, Enjoy algorithms!