Design WhatsApp Messenger

EnjoyAlgorithms Blog Cover Image

You must be familiar with Whatsapp. But it’s very lame to ask you about this, right! Whatsapp has now become an integral part of our life. We are favorably inclined towards it but have you ever thought about how Whatsapp works? What are the underlying principles behind its’ design and functioning? In this blog, we’ll cover all these fundamental questions and try to develop our system, just like Whatsapp. So without any how to delay, let’s dive deep into the design of Whatsapp!

What is Whatsapp?

Whatsapp is a social messenger platform, which allows users to send messages to each other. It is a messaging system that is widely used throughout the globe. Here in this blog, we’ll be discussing WhatsApp’s generic architecture and which could also be used as a base for designing any such chat application. So let’s get started by discussing the key requirements of our service.

Key Requirements

Whatsapp is a highly scalable system and is accessed very frequently by various people across the globe. Hence it should be designed efficiently such that it should remain reliable and remain operational almost every time. It is essential to figure out the essential requirements of the system.

The WhatsApp messenger should have these basic requirements:

  • Able to support one-on-one conversations
  • Able to show Last Seen and Message acknowledgment (Sent, Delivered, and Read).
  • Allow Media Support (Images/Videos) and End to End Encryption.

Let’s figure out the capacity estimation of our required service.

Capacity Estimation

Our goal is to build a highly scalable platform that could support a massive amount of traffic. Let’s assume we have 10 billion messages sent per day. So we have:

  • 10 billion messages send per day and 1 billion users
  • Number of active users at peak traffic (per second): 700000 (6X average)
  • Number of messages at peak traffic (per second): 40 million
  • Each message has 160 characters(average): 10B * 160 = 1.6TB data per day
  • Assume 10 years of service provision: 10 * 1.6B * 365 ~ 6 PB.
  • The entire application will comprise several micro-services, each performing a specific task. Let’s assume the latency of sending a message is 20 milliseconds, and the number of concurrent connections per server is 100. Hence the estimated number of servers required in the chat servers’ fleet = (chat messages per second ∗ Latency)/ concurrent connections per server = 40M * 20ms / 100 = 8000 servers

High-Level Design

We have two primary services at the heart of this system. i.e., chat service and transient service. The chat service manages all the traffic from the online messages sent by the users. At the same time, the transient service deals with the traffic when the user is offline.

  • The chat service is responsible for delivering messages if the user is online. It will check whether the message’s recipient is online or not; if the receiver is online, this service will instantly deliver the message; otherwise, the transient service will send the message to the recipient whenever they come online.
  • The transient service maintains the separate storage for storing the temporary available data and storing it until the offline user becomes online.

High-Level API Design

We have two high-level working API for this service for sending and viewing messages. We can use the REST architecture to implement the system.

a) Sendmessage (fromUser, toUser, clientMetaData, message)

This API will be used for sending messages from one user to another.

Parameters: fromUser, toUser, client metadata, message

1. from user: User who is sending the message
2. toUser: User to whom the message is being sent
3. client metadata: Metadata to store client’s information
4. message: The original message

b) Conversation(userId, offset, message count, TimeStamp)

This API is used to show conversations in a thread. Think of this as the view you see when you open WhatsApp. We would only want to fetch a few messages in one API call for a user at a time. The offset and message count parameters are used to handle this.

Parameters: userId, offset, message count, TimeStamp

1. userId: Unique ID of the User
2. offset: used for fetching the previous message
3. message count: no of messages to be viewed
4. TimeStamp: Last updated time

Understanding how the features like last seen, single tick, and the double tick works?

Acknowledgment service is the key player behind the implementation of these services. This service keeps on generating and checking the acknowledgment responses, and based on that, these features were implemented.

Single tick: An acknowledgment signal is sent from the server when the message from User A reaches User B that the message has been sent.

Double tick: Once the server’s message sends that message to User B by appropriate connection, User B will send an acknowledgment to the server saying that it has received the message. Then the server will send another acknowledgment to User A. Hence it will display a double tick.

Blue tick: When user B checks the message, then user B will send another acknowledgment to the server, saying that it has read the message. Then the server will send another acknowledgment message to User A. Then, User A will display a blue tick.

Last seen feature: This feature solely depends on the heartbeat mechanism. A heartbeat is sent continuously every 5 seconds to the server, which maintains the last seen of various users in a table and can be easily retrieved by any other user to get his/her last seen status.

Key Features Design

One-to-One Communication

This is an essential component of the Chat service. Using this service, one user can easily send messages to another user. Let’s look into how this functions:

Suppose Alice wants to send a message to Bob. The message gets directed to the chat server with which Alice is connected. Alice gets an acknowledgment from the chat server that the message has been sent. Now the chat server requests the data storage to fetch information about the chat server to which Bob is connected. The chat server of Alice now forwards the message to the chat server of Bob, and the message gets delivered to Bob using a push mechanism. Now Bob sends an acknowledgment back to the chat server of Alice, which in turn informs Alice that the message has been delivered. Now, if Bob read the message again, a new acknowledgment was sent to Alice that the message had been read.

User Activity Status

The last time when a user was active is standard functionality that can be found on instant messengers. 

Heartbeat between client and server

In this figure, a mechanism is shown to maintain a connection between the client and the server. A connection is made between the server and the client, and the bidirectional connection was established using web sockets. Heartbeats are sent via these connections and using which the user activity status was monitored.

End-to-End Encryption

End-to-End encryption is an important feature that allows only the communicating users to read the messages. 

There is a public key shared among all the users participating in the communication and plays an important role in maintaining the End-to-End encryption between the users. Suppose, two users Alice and Bob, are present in the channel and communicate with each other. Alice has Bob’s public key, and bob has Alice’s public key and their private key, which is not shared. Hence when Alice sends the message, he encrypts the message via Bob’s public key and can only be decrypted via Bob’s private that he has with him. Similarly, Alice can only be able to decrypt the message sent by Bob. Hence in this way, only Alice and Bob could able to see each other messages, and the server only acts as a mediator in the whole process.

Bottlenecks

Every system is highly vulnerable to failures. To handle such a huge amount of traffic, the service must remain active and fault-tolerant to handle all the bottlenecks. Our service is solely dependent on Chat and Transient servers, and hence it is necessary to address all the challenges behind the working of such servers.

  • Chat Server Failure: This is the core component of our system. It is responsible for handling and delivering the messages when the users are online. And hence this system holds connections with the users. Hence if this service fails, then it will affect the whole architecture. There are two ways to handle the chat server’s failure. One way is by transferring the TCP connections to another server, and the other way is by allowing users to initiate the connection automatically in case of connection loss.
  • Transient Storage Failure: Transient storage is another component prone to failures, and hence it could eventually affect the whole service. Failure of this service results in the loss of messages in-transit to offline users. To counter this, we can replicate each user’s temporary storage to prevent the loss of messages. Hence whenever the user comes back online, then replica can be used for processing the functionalities. However, if the original server becomes available, then both the original and the replica instances of the user’s transient storage merged to have a unique store for storing.

Optimizations

Latency: Messenger service must be real-time to provide a smooth and better customer experience. Hence the latency needs to minimize using Caching by storing some of the frequently queried data. We can use distributed cache such as Redis to cache user activity status and their recent chats in-memory.

Availability: Our service must remain available most of the time. Our system needs to be fault-tolerant, and for that, we can store multiple copies of the transient messages so that if any message is lost, it can be easily retrieved from its replicas. Hence, the availability of the system can not be compromised.

Further Requirements

Our system’s current version supports only very limited features, but we can easily extend the system to support group chats to deliver messages to multiple users. We can also include the functionalities of video and phone calls. This system can also be extended to allow users to put and view each other status or stories. Moreover, we can also extend our system to allow payment or transactions. All these further requirements require various advanced concepts, which are currently out of this blog’s scope. We will cover all these functionalities in the second part of this blog.

We'd love to hear from you

More content from EnjoyAlgorithms

Our weekly newsletter

Subscribe to get free weekly content on data structure and algorithms, machine learning, system design, oops and math. enjoy learning!