Limited Time Offer: 10% off mock interviews 🎉

Common Problems

Design Whatsapp or Messenger

Stefan Mai

medium

35 min

Understanding the Problem

🚗 What is Whatsapp? Whatsapp is a messaging service that allows users to send and receive encrypted messages and calls from their phones and computers. Whatsapp was famously originally built on Erlang (no longer!) and renowned for handling high scale with limited engineering and infrastructure outlay.

Functional Requirements

Apps like WhatsApp and Messenger have tons of features, but your interviewer doesn't want you to cover them all. The most obvious capabilities are almost definitely in-scope but it's good to ask your interviewer if they want you to move beyond. Spending too much time in requirements will make it harder for you to give detail in the rest of the interview, so don't dawdle!

Core Requirements

Users should be able to send/receive messages to other users.
Users should be able to start group chats with multiple participants (limit 100).
Users should be able to send/receive media in their messages.
Users should be able to receive messages sent while they are not online (up to 30 days).

Below the line (out of scope)

Audio/Video calling.
Interactions with businesses.
Registration and profile management.

That fourth requirement isn't obvious to everyone (but it's interesting to design) and If I'm your interviewer I'll probably give you a hint on it.

Non-Functional Requirements

Core Requirements

Messages should be delivered to available users with low latency, < 500ms.
We should guarantee deliverability of messages - they should make their way to users.
The system should be able to handle billions of users with high throughput (TBD - cover this later).
Messages should be stored on centralized servers no longer than necessary.
The system should be resilient against failures of individual components.

Below the line (out of scope)

Exhaustive treatment of security concerns.
Spam and scraping prevention systems.

Adding features that are out of scope is a "nice to have". It shows product thinking and gives your interviewer a chance to help you reprioritize based on what they want to see in the interview. That said, it's very much a nice to have. If additional features are not coming to you quickly (or you've already burned some time), don't waste your time and move on.

Requirements

The Set Up

Planning the Approach

Before you move on to designing the system, it's important to start by taking a moment to plan your strategy. Since 1-1 messages are simply a special case of larger chats, we have two core challenges for this problem: (1) making sure messages get to the users in a chat (eventually) without losing them and (2) having these messages arrive in real-time.

This is a fairly common pattern for real-time systems. There's often a tradeoff between latency and fault-tolerance/durability and we can deliver both by handling them individually and then combining.

For this problem we'll start with (1) in making sure messages actually get to users eventually, then add real-time components in (2). We'll address media attachments afterwards since we should be able to bolt them on. Then we'll deep dive into optimizations and scaling with the time we have available.

So our plan is this order:

Durable (eventual) delivery of messages to users.
Real-time delivery of messages to online users.
Handling of attachments.

Defining the Core Entities

The core entities for Whatsapp are fairly simple. We need:

Users
Chats (2-100 users)
Messages
Clients (a user might have multiple devices)

We'll use this language to reason through the problem.

API or System Interface

Our external-facing API for this system is likely a bi-directional socket connection. For this interview, we'll just use websockets although a simple TLS connection would do. The idea will be that users will open the app and connect to the server, opening this socket.

Since our clients can go offline, we'll need to be able to "sync" state between the client and server. For convenience, let's just assume the server will buffer events sent while the client is offline and, when the client connects, it receives all unsent messages. We'll also assume TCP-style guarantees (delivery, order) here.

Let's talk about the messages we might want to exchange on the socket, starting with messages we want to send.

First, let's be able to create a chat.


// -> createChat
{
    "participants": [],
    "name": ""
} -> {
    "chatId": ""
}

Now we should be able to send messages on the chat.


// -> sendMessage
{
    "chatId": "",
    "message": "",
    "attachments": []
} -> "SUCCESS" | "FAILURE"

We need a way to create attachments (note: I'm going to amend this later in the writeup).


// -> createAttachment
{
    "body": ...,
    "hash": 
} -> {
    "attachmentId": ""
}

And we need a way to add/remove users to the chat.


// -> modifyChatParticipants
{
    "chatId": "",
    "userId": "",
    "operation": "ADD" | "REMOVE"
} -> "SUCCESS" | "FAILURE"

Each of these messages will have a parallel message that is sent to other clients. Each of these will be formally acknowledged so the server can keep track of delivery.

The message receipt acknowledgement is a bit non-obvious but crucial to making sure we don't lose messages. Once the client has "ack'd" a message, the server can assume it's been received and the message can be discarded (for that user).

When a chat is created or updated ...


// <- chatUpdate
{
    "chatId": "",
    "participants": [],
} -> "RECEIVED"

When a message is received ...


// <- newMessage
{
    "chatId": "",
    "userId": ""
    "message": "",
    "attachments": []
} -> "RECEIVED"

Etc ...

Note that enumerating all of these APIs can take time! In the actual interview, I might shortcut by only writing the message names and not the full API. It's also usually a good idea to summarize the API initially before you build out the high-level design in case things need to change. "I'll come back to this as I learn more" is completely acceptable! My whiteboard might look like this:

Data Flow

Now that we have a base to work with let's figure out how we can make them real while we satisfy our requirements.

High-Level Design

1) Durable (eventual) delivery of messages to users.

To satisfy the first and second requirements, we'll need a way to create a chat and then send a message. We'll start with a basic, polling solution running on a single host and then build it out from there.

For infrastructure-style interviews, I highly recommend reasoning about a solution on a single host first. Oftentimes the path to scale is straightforward from there. On the other hand if you solve scale first without thinking about how the actual mechanics of your solution work underneath, you're likely to back yourself into a corner.

For creating a chat, this is straightforward. We need a way for users to connect to our service and for us to store the participants in a database. We'll use a basic service for this.

Create a Chat

For our database, let's use MySQL. We can get fancy with the storage engine and data layout later but the important thing here is that we can write chats to the database and then look up participants when we're sending messages. We'll have a chat table and a chat_participant table.

When we want to send a message, we need to first send the message to the server so that its persisted, then get events out to all our clients so that they can show the message to the recipient user. This pattern of events happening that clients need to know about is going to be common to all changes that happen on the server: chat creation, updates, participant changes, etc. So instead of making this message specific, we'll create a new events table which functions as a queue for a given client of events that they need to process to get up to date.

Event Delivery

Thus, to send a message, we'll:

Look up all participants in the chat via the chat_participant table.
In a transaction, create a message in the message table and create entries in events for each participant that needs to receive the message. We'll simply reference the user's id, the message id, and a timestamp.
Confirm the message back to the sending client.

Send a Message

On the recipient side, our clients are polling for these messages and events periodicially. Once clients receive an event they'll acknowledge receipt and we can delete the entry from events. We can have a cleanup service which runs some dumb queries to find messages and events which either (a) are 30 days old and thus need to be discarded per our non-functional requirements, or (b) have been delivered to all recipients, keeping our database somewhat manageable in size.

Cleanup Service

So far this works very poorly because we're assuming clients are polling us for new messages and chats. But importantly all messages and updates will get successfully delivered to clients eventually, which is great. So let's solve the latency issue next.

2) Real-time delivery of messages to online users.

In order for our system to become real-time, we need to send deltas to the clients when things change: the user is added or removed from a chat, a new message is sent, etc. In our single-node setup this is pretty easy: since all (!) of the users of our system are connected to the single node, we can keep a hash of the user id to the websocket connection they've made to us. When a new event is created in events and needs to be sent to a user we can look up that user id and send them the event.

Websocket connections for events

If the user isn't currently connected, or the connection is broken we don't have any problems - the user will get the events when they reconnect. But if they are, this websocket ensures the event is delivered right as the event is created.

You (and your interviewer) are probably gnashing your teeth at this point at the single point of failure of this Chat Server as well as the scalability problems. Be sure to remind your interviewer that you're going to address the issues shortly so they don't think it's oblivious to you!

3) Users should be able to send/receive media in their messages.

Users sending and receiving media is annoying. It's bandwidth- and storage- intensive. While we could potentially do this with our Chat Server and database, it's better to use purpose-built technologies for this. This is in fact how Whatsapp actually worked: attachments were uploaded via a separate HTTP service.

Bad Solution: Keep attachments in DB

Good Solution: Send attachments via chat server

Great Solution: Manage attachments separately

Ok awesome, so we have a system which has real-time delivery of messages, persistence to handle offline use-cases, and attachments.

Potential Deep Dives

With the core functional requirements met, it's time to dig into the non-functional requirements via deep dives. This includes solving obvious scalability issues as well as auxillary questions which demonstrate your command of system design.

The degree to which a candidate should proactively lead the deep dives is a function of their seniority. In this problem, all levels should be quick to point out that my single-host solution isn't going to scale. But beyond these bottlenecks, it's reasonable in a mid-level interview for the interviewer to drive the majority of the deep dives. However, in senior and staff+ interviews, the level of agency and ownership expected of the candidate increases. They should be able to proactively look around corners and identify potential issues with their design, proposing solutions to address them.

1) How can we handle billons of simultaneous users?

Our single-host system is convenient but unrealistic. Serving billions of users via a single machine isn't possible and it would make deployments and failures a nightmare. So what can we do? The obvious answer is to try to scale out the number of Chat Servers we have. But this introduces some clear problems: now the sending and receiving users might be connected to different hosts.

If we have 1b users, we might expect 200m of them to be connected at any one time. Whatsapp famously served 1-2m users per host, but this will require us to have hundreds of chat servers. That's a lot of simultaneous connections.

Note that I've included some back-of-the-envelope calculations here. Your interviewer will likely expect them, but you'll get more mileage from your calculations by doing them just-in-time: when you need to figure out a scaling bottleneck.

What do we do here?

Bad Solution: Naively horizontally scale

Bad Solution: Keep a kafka topic per user

Good Solution: Consistent Hashing of Chat Servers

Great Solution: Offload to Pub/Sub

2) How can we optimize the chat lookups?

One of the most frequent read paths of our system is when we need to look up all participants for a given chat on every message send (so we know which recipients need to receive the message). Rather than pounding our database, we can cache this data with an LRU cache. We can expire this cache whenever the participant set changes so it's not stale.

Participant Cache

3) How can we handle high throughput and global users?

If we assume 50% of messages are delivered within a few seconds and 50% are delivered on average 1 day later, that means we need to accumulate 10b messages * 50% or 5b messages temporarily stored on our server. If we assume messages average about 1kb, we're using 5tb of storage which is very achievable for a small database cluster.

Let's assume a long-tail of chat size and that each message needs to be delivered to 3-4 recipients. Let's also assume that our 1b users send 10 messages per day on average. With 100k seconds in a day, we need to handle 100k messages created per second on average and 300-400k messsages delivered per second on average. That's a lot of throughput on our database (although achievable with most clustered databases on the market today).

Bad Solution: Partition the Database

Good Solution: Regionalize the Service and Partition the Database

4) What do we do to handle multiple clients for a given user?

To this point we've assumed a user has a single device, but many users have multiple devices: a phone, a tablet, a desktop or laptop - maybe even a work computer. This introduces some new problems.

First, we'll need to add a way for our design to resolve a user to 1 or more clients that may be active at any one time.
Second, we need a way to deactivate clients so that we're not unncessarily storing messages for a client which does not exist any longer.
Lastly, we need to update our eventing system so that it can handle multiple clients.

Let's see if we can account for this with minimal changes to our design.

We'll need to create a new clients table to keep track of clients by user id.
Our participant cache can now store a list of clients rather than just users to avoid blowing up our primary read path. When a new client connects we'll need to invalidate.
We can overload our Cleanup Service to also dispose of clients which have not signed in for 30 days. The client can message the user that they'll need to start from scratch.
We'll need to update our events table to be per-client rather than per-user.

We'll probably want to introduce some limits (10 clients per account) to avoid blowing up our storage and throughput. Since the participant cache is shielding us from the blunt of reads, the main scaling factor is increased writes to our events table - if we assume 2 clients per user on average, we've doubled the throughput and doubled the storage. Given we had some headroom based on our calculations above, we should be ok.

Adding clients

What is Expected at Each Level?

Ok, that was a lot. You may be thinking, “how much of that is actually required from me in an interview?” Let’s break it down.

Mid-level

Breadth vs. Depth: A mid-level candidate will be mostly focused on breadth (80% vs 20%). You should be able to craft a high-level design that meets the functional requirements you've defined, but many of the components will be abstractions with which you only have surface-level familiarity.

Probing the Basics: Your interviewer will spend some time probing the basics to confirm that you know what each component in your system does. For example, if you use websockets, expect that they may ask you what it does and how they work (at a high level). In short, the interviewer is not taking anything for granted with respect to your knowledge.

Mixture of Driving and Taking the Backseat: You should drive the early stages of the interview in particular, but the interviewer doesn’t expect that you are able to proactively recognize problems in your design with high precision. Because of this, it’s reasonable that they will take over and drive the later stages of the interview while probing your design.

The Bar for Whatsapp: For this question, an E4 candidate will have clearly defined the API, landed on a high-level design that is functional and meets the requirements. Their scaling solution will have rough edges but they'll have some knowledge of its flaws.

Senior

Depth of Expertise: As a senior candidate, expectations shift towards more in-depth knowledge — about 60% breadth and 40% depth. This means you should be able to go into technical details in areas where you have hands-on experience. It's crucial that you demonstrate a deep understanding of key concepts and technologies relevant to the task at hand.

Advanced System Design: You should be familiar with advanced system design principles. For example, knowing about the consistent hashing necessary for this problem is essential. You’re also expected to understand the mechanics of long-running sockets. Your ability to navigate these advanced topics with confidence and clarity is key.

Articulating Architectural Decisions: You should be able to clearly articulate the pros and cons of different architectural choices, especially how they impact scalability, performance, and maintainability. You justify your decisions and explain the trade-offs involved in your design choices.

Problem-Solving and Proactivity: You should demonstrate strong problem-solving skills and a proactive approach. This includes anticipating potential challenges in your designs and suggesting improvements. You need to be adept at identifying and addressing bottlenecks, optimizing performance, and ensuring system reliability.

The Bar for Whatsapp: For this question, E5 candidates are expected to speed through the initial high level design so you can spend time discussing, in detail, scaling and robustness issues in the design. You should also be able to discuss the pros and cons of different architectural choices, especially how they impact scalability, performance, and maintainability.

Staff+

Emphasis on Depth: As a staff+ candidate, the expectation is a deep dive into the nuances of system design — I'm looking for about 40% breadth and 60% depth in your understanding. This level is all about demonstrating that, while you may not have solved this particular problem before, you have solved enough problems in the real world to be able to confidently design a solution backed by your experience.

You should know which technologies to use, not just in theory but in practice, and be able to draw from your past experiences to explain how they’d be applied to solve specific problems effectively. The interviewer knows you know the small stuff so you can breeze through that at a high level so you have time to get into what is interesting.

High Degree of Proactivity: At this level, an exceptional degree of proactivity is expected. You should be able to identify and solve issues independently, demonstrating a strong ability to recognize and address the core challenges in system design. This involves not just responding to problems as they arise but anticipating them and implementing preemptive solutions. Your interviewer should intervene only to focus, not to steer.

Practical Application of Technology: You should be well-versed in the practical application of various technologies. Your experience should guide the conversation, showing a clear understanding of how different tools and systems can be configured in real-world scenarios to meet specific requirements.

Complex Problem-Solving and Decision-Making: Your problem-solving skills should be top-notch. This means not only being able to tackle complex technical challenges but also making informed decisions that consider various factors such as scalability, performance, reliability, and maintenance.

Advanced System Design and Scalability: Your approach to system design should be advanced, focusing on scalability and reliability, especially under high load conditions. This includes a thorough understanding of distributed systems, load balancing, caching strategies, and other advanced concepts necessary for building robust, scalable systems.

The Bar for Whatsapp: For a staff+ candidate, expectations are high regarding depth and quality of solutions, particularly for the complex scenarios discussed earlier. Great candidates are going 2 or 3 levels deep to discuss failure modes, bottlenecks, and other issues with their design. There's ample discussion to be had around fault tolerance, database optimization, regionalization and cell-based architecture and more.

Not sure where your gaps are?

Mock interview with an interviewer from your target company. Learn exactly what's standing in between you and your dream job.

Next: YouTube

Your account is free and you can post anonymously if you choose.

Loading comments...

Common Problems

Design Whatsapp or Messenger

Stefan Mai

Ex-Meta & Amazon Hiring Manager

medium

35 min

Understanding the Problem

Functional Requirements

Core Requirements

Users should be able to send/receive messages to other users.
Users should be able to start group chats with multiple participants (limit 100).
Users should be able to send/receive media in their messages.
Users should be able to receive messages sent while they are not online (up to 30 days).

Below the line (out of scope)

Audio/Video calling.
Interactions with businesses.
Registration and profile management.

That fourth requirement isn't obvious to everyone (but it's interesting to design) and If I'm your interviewer I'll probably give you a hint on it.

Non-Functional Requirements

Core Requirements

Messages should be delivered to available users with low latency, < 500ms.
We should guarantee deliverability of messages - they should make their way to users.
The system should be able to handle billions of users with high throughput (TBD - cover this later).
Messages should be stored on centralized servers no longer than necessary.
The system should be resilient against failures of individual components.

Below the line (out of scope)

Exhaustive treatment of security concerns.
Spam and scraping prevention systems.

Requirements

The Set Up

Planning the Approach

So our plan is this order:

Durable (eventual) delivery of messages to users.
Real-time delivery of messages to online users.
Handling of attachments.

Defining the Core Entities

The core entities for Whatsapp are fairly simple. We need:

Users
Chats (2-100 users)
Messages
Clients (a user might have multiple devices)

We'll use this language to reason through the problem.

API or System Interface

Let's talk about the messages we might want to exchange on the socket, starting with messages we want to send.

First, let's be able to create a chat.


// -> createChat
{
    "participants": [],
    "name": ""
} -> {
    "chatId": ""
}

Now we should be able to send messages on the chat.


// -> sendMessage
{
    "chatId": "",
    "message": "",
    "attachments": []
} -> "SUCCESS" | "FAILURE"

We need a way to create attachments (note: I'm going to amend this later in the writeup).


// -> createAttachment
{
    "body": ...,
    "hash": 
} -> {
    "attachmentId": ""
}

And we need a way to add/remove users to the chat.


// -> modifyChatParticipants
{
    "chatId": "",
    "userId": "",
    "operation": "ADD" | "REMOVE"
} -> "SUCCESS" | "FAILURE"

Each of these messages will have a parallel message that is sent to other clients. Each of these will be formally acknowledged so the server can keep track of delivery.

When a chat is created or updated ...


// <- chatUpdate
{
    "chatId": "",
    "participants": [],
} -> "RECEIVED"

When a message is received ...


// <- newMessage
{
    "chatId": "",
    "userId": ""
    "message": "",
    "attachments": []
} -> "RECEIVED"

Etc ...

Data Flow

Now that we have a base to work with let's figure out how we can make them real while we satisfy our requirements.

High-Level Design

1) Durable (eventual) delivery of messages to users.

For creating a chat, this is straightforward. We need a way for users to connect to our service and for us to store the participants in a database. We'll use a basic service for this.

Create a Chat

Event Delivery

Thus, to send a message, we'll:

Look up all participants in the chat via the chat_participant table.
In a transaction, create a message in the message table and create entries in events for each participant that needs to receive the message. We'll simply reference the user's id, the message id, and a timestamp.
Confirm the message back to the sending client.

Send a Message

Cleanup Service

2) Real-time delivery of messages to online users.

Websocket connections for events

3) Users should be able to send/receive media in their messages.

Bad Solution: Keep attachments in DB

Good Solution: Send attachments via chat server

Great Solution: Manage attachments separately

Ok awesome, so we have a system which has real-time delivery of messages, persistence to handle offline use-cases, and attachments.

Potential Deep Dives

1) How can we handle billons of simultaneous users?

What do we do here?

Bad Solution: Naively horizontally scale

Bad Solution: Keep a kafka topic per user

Good Solution: Consistent Hashing of Chat Servers

Great Solution: Offload to Pub/Sub

2) How can we optimize the chat lookups?

Participant Cache

3) How can we handle high throughput and global users?

Bad Solution: Partition the Database

Good Solution: Regionalize the Service and Partition the Database

4) What do we do to handle multiple clients for a given user?

To this point we've assumed a user has a single device, but many users have multiple devices: a phone, a tablet, a desktop or laptop - maybe even a work computer. This introduces some new problems.

First, we'll need to add a way for our design to resolve a user to 1 or more clients that may be active at any one time.
Second, we need a way to deactivate clients so that we're not unncessarily storing messages for a client which does not exist any longer.
Lastly, we need to update our eventing system so that it can handle multiple clients.

Let's see if we can account for this with minimal changes to our design.

We'll need to create a new clients table to keep track of clients by user id.
Our participant cache can now store a list of clients rather than just users to avoid blowing up our primary read path. When a new client connects we'll need to invalidate.
We can overload our Cleanup Service to also dispose of clients which have not signed in for 30 days. The client can message the user that they'll need to start from scratch.
We'll need to update our events table to be per-client rather than per-user.

Adding clients

What is Expected at Each Level?

Ok, that was a lot. You may be thinking, “how much of that is actually required from me in an interview?” Let’s break it down.

Mid-level

Senior

Staff+

Not sure where your gaps are?

Mock interview with an interviewer from your target company. Learn exactly what's standing in between you and your dream job.

Next: YouTube

Your account is free and you can post anonymously if you choose.

Loading comments...

Schedule a mock interview

Meet with a FAANG senior+ engineer or manager and learn exactly what it takes to get the job.

Schedule a Mock Interview

BlogBehavioral Interview Examples Interviewing for Experienced SWEs All Blog Posts

Compare UsCompare to Interviewing.io

LinksFAQ Schedule Mock Interviews Pricing Become a Coach Our Coaches Learn System Design Learn DSA

LegalTerms and Conditions Privacy Policy

Contactsupport@hellointerview.com7511 Greenwood Ave North
Unit #4238 Seattle, WA 98103