What is Whatsapp?
Whatsapp is a social messaging platform that allows users to send messages. This application has become an integral part of most of our lives, but have you ever thought about how Whatsapp works? What are the underlying principles behind its’ design and functioning?
This blog will cover all these fundamental questions and try to develop a basic system just like Whatsapp. We’ll also be discussing WhatsApp’s generic architecture, which could also be used as a base for designing any such chat application. So without any how to delay, let’s dive deep into the design of Whatsapp!
Whatsapp is a highly scalable system and is accessed frequently by various users across the globe. Hence it should be designed efficiently to remain reliable and operational almost every time. So it is essential to figure out the crucial requirements of the system.
The WhatsApp messenger should have these basic requirements:
- Able to support one-on-one conversations.
- Able to show last seen and message acknowledgment (Sent, Delivered, and Read).
- Allow media support (Images/Videos) and end-to-end encryption.
Let’s figure out the capacity estimation of our required service.
Our goal is to build a highly scalable platform that could support a massive amount of traffic. Let’s assume we have 10 billion messages sent per day. So we have:
- 10 billion messages sent per day by 1 billion users
- Number of active users at peak traffic (per second): 700k (6X average)
- Number of messages at peak traffic (per second): 40 million
- Each message has 160 characters (average): 10B * 160 = 1.6TB data per day
- Assume 10 years of service provision: 10 * 1.6B * 365 ~ 6 PB.
- The entire application will comprise several micro-services, each performing a specific task. Let’s assume the latency of sending a message is 20 milliseconds, and the number of concurrent connections per server is 100. Hence the estimated number of servers required in the chat servers’ fleet = (chat messages per second ∗ Latency)/ concurrent connections per server = 40M * 20ms / 100 = 8000 servers
We have two primary services at the heart of this system. i.e., chat service and transient service. The chat service manages all the traffic from the online messages sent by the users. At the same time, the transient service deals with the traffic when the user is offline.
- The chat service is responsible for delivering messages if the user is online. It will check whether the message’s recipient is online or not; if the receiver is online, this service will instantly deliver the message; otherwise, the transient service will send the message to the recipient whenever they come online.
- The transient service maintains the separate storage for storing the temporary available data and storing it until the offline user becomes online.
High-Level API Design
We have two high-level working API for this service for sending and viewing messages. We can use the REST architecture to implement the system.
Sendmessage (fromUser, toUser, clientMetaData, message)
This API will be used for sending messages from one user to another.
Parameters: fromUser, toUser, clientMetaData, message
1. fromUser: User who is sending the message
2. toUser: User to whom the message is being sent
3. clientMetaData: Metadata to store client’s information
4. message: The original message
Conversation(userId, offset, messageCount, TimeStamp)
This API is used to show conversations in a thread. Think of this as the view you see when you open WhatsApp. We would only want to fetch a few messages in one API call for one user at a time. The offset and message count parameters are used to handle this.
Parameters: userId, offset, messageCount, TimeStamp
1. userId: Unique ID of the User
2. offset: Used for fetching the previous message
3. messageCount: Number of messages to be viewed
4. TimeStamp: Last updated time
Understanding how the features like last seen, single tick, and the double tick work?
Acknowledgment service is the key player behind the implementation of these services. This service keeps on generating and checking the acknowledgment responses, and based on that, these features were implemented.
Single tick: An acknowledgment signal is sent from the server when the message from User A reaches User B that the message has been sent.
Double tick: Once the server’s message sends that message to User B by appropriate connection, User B will send an acknowledgment to the server saying that it has received the message. Then the server will send another acknowledgment to User A. Hence it will display a double tick.
Blue tick: When user B checks the message, then user B will send another acknowledgment to the server, saying that it has read the message. Then the server will send another acknowledgment message to User A. Then, User A will display a blue tick.
Last seen feature: This feature solely depends on the heartbeat mechanism. A heartbeat is sent continuously every 5 seconds to the server, which maintains the last seen of various users in a table and can be easily retrieved by any other user to get his/her last seen status.
Design of Key Features
This is an essential component of the Chat service. Using this service, one user can easily send messages to another user. Let’s look into how this functions:
Suppose Alice wants to send a message to Bob. The message gets directed to the chat server with which Alice is connected. Alice gets an acknowledgment from the chat server that the message has been sent. Now the chat server requests the data storage to fetch information about the chat server to which Bob is connected. The chat server of Alice now forwards the message to the chat server of Bob, and the message gets delivered to Bob using a push mechanism. Now Bob sends an acknowledgment back to the chat server of Alice, which in turn informs Alice that the message has been delivered. Now, if Bob read the message again, a new acknowledgment was sent to Alice that the message had been read.
User Activity Status
The last time when a user was active is standard functionality that can be found on instant messengers.
In this figure, a mechanism is shown to maintain a connection between the client and the server. A connection is made between the server and the client, and the bidirectional connection was established using web sockets. Heartbeats are sent via these connections and using which the user activity status was monitored.
End-to-End encryption is an important feature that allows only the communicating users to read the messages.
There is a public key shared among all the users participating in the communication and plays an important role in maintaining the End-to-End encryption between the users. Suppose, two users Alice and Bob, are present in the channel and communicate with each other. Alice has Bob’s public key, and bob has Alice’s public key and their private key, which is not shared. Hence when Alice sends the message, he encrypts the message via Bob’s public key and can only be decrypted via Bob’s private that he has with him. Similarly, Alice can only be able to decrypt the message sent by Bob. Hence in this way, only Alice and Bob could able to see each other messages, and the server only acts as a mediator in the whole process.
Every system is highly vulnerable to failures. To handle such a huge amount of traffic, the service must remain active and fault-tolerant to handle all the bottlenecks. Our service is solely dependent on Chat and Transient servers, and hence it is necessary to address all the challenges behind the working of such servers.
- Chat Server Failure: This is the core component of our system. It is responsible for handling and delivering the messages when the users are online. And hence this system holds connections with the users. Hence if this service fails, then it will affect the whole architecture. There are two ways to handle the chat server’s failure. One way is by transferring the TCP connections to another server, and the other way is by allowing users to initiate the connection automatically in case of connection loss.
- Transient Storage Failure: Transient storage is another component prone to failures, and hence it could eventually affect the whole service. Failure of this service results in the loss of messages in-transit to offline users. To counter this, we can replicate each user’s temporary storage to prevent the loss of messages. Hence whenever the user comes back online, then replica can be used for processing the functionalities. However, if the original server becomes available, then both the original and the replica instances of the user’s transient storage merged to have a unique store for storing.
Latency: Messenger service must be real-time to provide a smooth and better customer experience. Hence the latency needs to minimize using Caching by storing some of the frequently queried data. We can use distributed cache such as Redis to cache user activity status and their recent chats in-memory.
Availability: Our service must remain available most of the time. Our system needs to be fault-tolerant, and for that, we can store multiple copies of the transient messages so that if any message is lost, it can be easily retrieved from its replicas. Hence, the availability of the system can not be compromised.
Our system’s current version supports only very limited features, but we can easily extend the system to support group chats to deliver messages to multiple users. We can also include the functionalities of video and phone calls. This system can also be extended to allow users to put and view each other status or stories. Moreover, we can also extend our system to allow payment or transactions. All these further requirements require various advanced concepts, which are currently out of this blog’s scope. We will cover all these functionalities in the second part of this blog.
Enjoy learning, Enjoy System Design!