Design Pastebin

Asked In: Amazon, Ola Cabs

Have you ever thought of any such service that could make our life easier by allowing us to store data over the internet and simultaneously allow others to access the same data? If it is so, then this is the perfect blog for you. Here, in this blog, we’ll be discussing the PasteBin System. Here, the aim is to design a highly scalable service that could allow users to paste various content (images, text, etc.) over the internet and allow others to access the pasted content. So let’s look at the formal definition of Pastebin.

What is Pastebin?

Pastebin is a type of online content hosting service where users can store and share content in the form of text or images over the internet by generating a unique URL so that anyone can access the content via that URL. The user who created it can also update the content if he/she is logged in. So without much delay, let’s dive deep into designing pastebin from scratch.

Key Requirements

Requirements analysis is a critical process that enables the success of a system. Requirements are generally split into two types: Functional and Non-Functional Requirements.

Functional requirements are the essential requirements the end-user demands as necessary facilities that system should offer. For PasteBin, the functional requirements are:

  1. Users should be able to generate a unique URL by pasting their content.
  2. The content and URLs should expire after a specific time.

Non-Functional Requirements are the quality constraints the system must satisfy according to the project contract. For PasteBin, the non-functional requirements are:

  1. The system should be reliable and highly available.
  2. Service should be available in real-time with minimum latency.
  3. Paste URL should not be predictable.

As we know, what are our essential requirements, so it’s time to estimate the service’s capacity.

Capacity Estimation

It is necessary to know approximate traffic to service to be scaled and designed accordingly. Since this service would be read-heavy, it is better to consider read-to-write ratio as 10:1. Assume we have 10M users, and each user contributes one write request on average per month. So we have:

  • 10 Million write requests per month
  • 100 Million read requests per month (10X write requests)
  • So total number of write requests in 10 years = 10M * 12 (months/year) * 10 (years)= 1.2 Billion write requests
  • Let the maximum size per paste be 10MB. So amount of paste size per month = 10 M * 10MB = 100 TB paste/month
  • Amount of paste content in one year = 100TB * 12 = 1200TB or 1.2PB paste/year
  • Assume our service runs for 10 years, then total data accumulated = 1200TB * 10 years = 12 PB

Pastebin High Level Design

Pastebin high level design

We have two high-level working API for this service: one for writing the content and the other one for dealing with reading requests. We can use either SOAP or REST architecture to implement the system.

Write API requirements

write( key, content, expiry_date)

  • key: A unique API key available with each user
  • content: The content user wants to paste.
  • expiry_date: Time after which the content should be expired.
  • Return value: API should return a short URL to access the content. In case of invalid input, it should return an error code.

The user enters the content and gets the randomly generated URL. This works in the described way.

The Client sends a create-paste request to the web server, which forwards the request to the Write API server. The Write API server does the following:

  • Generates a unique URL 
  • Store it to the SQL Database pastes table 
  • Store the content to Object Store and returns the URL

Read API requirements

read (short_url)

  • short_url: The URL provided by the Write method to access the content.
  • Return value: API should return the original paste content or an invalid error code if URL is incorrect.

The user enters a pasted URL and views the content. This works in the following way.

The client sends a get-paste request to the web server, which forwards the request to the Read API server. The Read API server does the following:

  • Checks for the generated URL in SQL Database
  • Fetch the paste contents from Object Store if the URL is in SQL Database.
  • Otherwise, return an error message for the user.

Besides these essential services, we also have two more APIs responsible for real-time analytics (out of scope) and deletion of expired content.

Delete API requirements

delete (api_key, URL)

  • api_key: A unique API key provided to each user
  • URL: URL to access the particular content
  • Return Value: Boolean Value indicating success and failure.

Pastebin Low-Level Design

The key component of this service relies on generating short URLs and storing the content. Let’s discuss these in details.

Generating Short URL

The service needs to generate a unique new short URL whenever a user makes a paste. We can use BASE64 for encoding the URL as all characters of BASE64 are URL-safe characters (characters allowed [A-Z][a-z][0–9],’+’,’/’).

To generate the unique URL, we could take MD5 hash of the user’s ipaddress + timestamps. We have various choices like ipaddress + timestamp or ip_address + content, but as content would be very large to handle, we use the other. Alternatively, we could also take the MD5 hash of randomly generated data.

When we encode the MD5 result to Base64 encoding, the resultant string will be 22 characters long.

Base64 encoding will use 6 bits to represent each character, 
MD5 -> Base64 give (128/6) ~ 22 character output.

Hence, we could use:


For choosing the short URL, first, we can choose the first 7 characters of the resulting Base64 encoding.

The drawback of this algorithm

The primary issue with this algorithm is that the generated URL can be repeated. Hence to overcome this, KGS (Key Generating Service) comes to rescue us. KGS algorithm will ensure that all the keys inserted into key-DB are unique and hence we don’t have to worry about duplication or collisions.

Database Design

We need to store both content and short URL. The average size of each content is 10MB. As calculated above, for our service to be available for ten years, we have to accommodate a massive amount of data, i.e., 12 PB, which is too much to be stored in any database. Hence, we have to use Object Storage Service, such as Amazon S3, to store the content and a normal relational database to store URLs.

Database Structure

The structure of the database is described here:

User Data

  1. User ID: A unique user id for each user.
  2. Name: Name of the user.
  3. Email: The email id of the user.
  4. Password: Password of the user.
  5. Creation Date: The date on which the content was shared.

Content Data

  1. User ID: A unique user id for each user.
  2. Short URL: A unique short URL consisting of 7 characters.
  3. Paste path: The URL path of the S3 Bucket.
  4. Expiration date: Time after which content expires
  5. Last accessed: Last time when content was accessed.

The User ID is used as a Primary Key in the above structure.

design pastebin database design

For storing the paste URL, we have two choices to choose from. We can either go with Relational Database like MySQL, or we can go with a NoSQL database. As we want fast read and write speed, but we don’t need lots of relationships among data. Hence NoSQL Databases are the optimal choice for our system.

Relational Databases are very efficient when there are many dependencies, and when we have lots of complex queries, they are too slow. However, NoSQL databases are very poor at handling the relationship queries, but they are faster.

Scaling the Service

This is the most critical aspect of business requirements. The service should be highly scalable, and at the same time, it should be reliable, available, and fast enough for a better user experience. We need to scale our database, have more servers to handle more requests, and be fast enough.

Database Sharding

Service will be more prone to failure if we use a single database to store the data. So we need to partition it into several machines or nodes to store information about billions of URLs.

For database partitioning, we can use the hash-based partitioning technique. Here hash function will be used to distribute the URLs into different partitions. We need to decide the number of shards to make, and then we can choose an appropriate hash function that random number represents the partition/shard number.

Caching

We have ten times more read operation as compared to write operation. Hence to speed up our service, we need to use caching. Cashing is a way to speed up the reading process. The URLs which are accessed quite frequently should be stored in the cache for faster access. 

For example, if any URL appears on any social networking website’s trending page, many people will likely visit the URL. Hence we can store its content in our cache to avoid any delay. We can use services like Memcached or Redis for this purpose.

Various possibilities need to be kept in mind while designing the cache.

  1. Whenever the cache is full, we need to replace the URLs with the trending ones. We can LRU method to implement this.
  2. The cache should remain in synchronisation with the original content. If any chances are done in the original content, then it must be reflected in the cache.

We can use Distributed Cache to incorporate all the above possibilities.

Load Balancing

There might be instances when many requests went to a single server, resulting in the service’s failure. In other words, the designed system might be prone to a single point of failure. To overcome this problem, we use load balancing. 

There are many algorithms available to distribute the load among servers, but here in this system, we can use the least bandwidth method. This algorithm will divert incoming requests to the server serving the least amount of traffic.

Along with balancing the load, we need to replicate all the databases and servers to avoid failure. In case of any discrepancy, the service should never go down and should remain highly consistent.

Conclusion

Pastebin system is a complex and highly scalable service to design. In this blog, we have covered the fundamental concepts necessary for building such a service. We hope you like the content. Please do share your views.

Thanks, Suyash for helping us to create the first version of the draft. Enjoy learning, Enjoy system design!

Share feedback with us

More blogs to explore

Our weekly newsletter

Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.

© 2022 Code Algorithms Pvt. Ltd.

All rights reserved.