Common Problems
Design Whatsapp or Messenger
Stefan Mai
medium
35 min
Understanding the Problem
Functional Requirements
Core Requirements
- Users should be able to send/receive messages to other users.
- Users should be able to start group chats with multiple participants (limit 100).
- Users should be able to send/receive media in their messages.
- Users should be able to receive messages sent while they are not online (up to 30 days).
Below the line (out of scope)
- Audio/Video calling.
- Interactions with businesses.
- Registration and profile management.
Non-Functional Requirements
Core Requirements
- Messages should be delivered to available users with low latency, < 500ms.
- We should guarantee deliverability of messages - they should make their way to users.
- The system should be able to handle billions of users with high throughput (TBD - cover this later).
- Messages should be stored on centralized servers no longer than necessary.
- The system should be resilient against failures of individual components.
Below the line (out of scope)
- Exhaustive treatment of security concerns.
- Spam and scraping prevention systems.
The Set Up
Planning the Approach
Before you move on to designing the system, it's important to start by taking a moment to plan your strategy. Since 1-1 messages are simply a special case of larger chats, we have two core challenges for this problem: (1) making sure messages get to the users in a chat (eventually) without losing them and (2) having these messages arrive in real-time.
For this problem we'll start with (1) in making sure messages actually get to users eventually, then add real-time components in (2). We'll address media attachments afterwards since we should be able to bolt them on. Then we'll deep dive into optimizations and scaling with the time we have available.
So our plan is this order:
- Durable (eventual) delivery of messages to users.
- Real-time delivery of messages to online users.
- Handling of attachments.
Defining the Core Entities
The core entities for Whatsapp are fairly simple. We need:
- Users
- Chats (2-100 users)
- Messages
- Clients (a user might have multiple devices)
We'll use this language to reason through the problem.
API or System Interface
Our external-facing API for this system is likely a bi-directional socket connection. For this interview, we'll just use websockets although a simple TLS connection would do. The idea will be that users will open the app and connect to the server, opening this socket.
Since our clients can go offline, we'll need to be able to "sync" state between the client and server. For convenience, let's just assume the server will buffer events sent while the client is offline and, when the client connects, it receives all unsent messages. We'll also assume TCP-style guarantees (delivery, order) here.
Let's talk about the messages we might want to exchange on the socket, starting with messages we want to send.
First, let's be able to create a chat.
// -> createChat { "participants": [], "name": "" } -> { "chatId": "" }
Now we should be able to send messages on the chat.
// -> sendMessage { "chatId": "", "message": "", "attachments": [] } -> "SUCCESS" | "FAILURE"
We need a way to create attachments (note: I'm going to amend this later in the writeup).
// -> createAttachment { "body": ..., "hash": } -> { "attachmentId": "" }
And we need a way to add/remove users to the chat.
// -> modifyChatParticipants { "chatId": "", "userId": "", "operation": "ADD" | "REMOVE" } -> "SUCCESS" | "FAILURE"
Each of these messages will have a parallel message that is sent to other clients. Each of these will be formally acknowledged so the server can keep track of delivery.
When a chat is created or updated ...
// <- chatUpdate { "chatId": "", "participants": [], } -> "RECEIVED"
When a message is received ...
// <- newMessage { "chatId": "", "userId": "" "message": "", "attachments": [] } -> "RECEIVED"
Etc ...
Note that enumerating all of these APIs can take time! In the actual interview, I might shortcut by only writing the message names and not the full API. It's also usually a good idea to summarize the API initially before you build out the high-level design in case things need to change. "I'll come back to this as I learn more" is completely acceptable! My whiteboard might look like this:
Now that we have a base to work with let's figure out how we can make them real while we satisfy our requirements.
High-Level Design
1) Durable (eventual) delivery of messages to users.
To satisfy the first and second requirements, we'll need a way to create a chat and then send a message. We'll start with a basic, polling solution running on a single host and then build it out from there.
For creating a chat, this is straightforward. We need a way for users to connect to our service and for us to store the participants in a database. We'll use a basic service for this.
For our database, let's use MySQL. We can get fancy with the storage engine and data layout later but the important thing here is that we can write chats to the database and then look up participants when we're sending messages. We'll have a chat table and a chat_participant table.
When we want to send a message, we need to first send the message to the server so that its persisted, then get events out to all our clients so that they can show the message to the recipient user. This pattern of events happening that clients need to know about is going to be common to all changes that happen on the server: chat creation, updates, participant changes, etc. So instead of making this message specific, we'll create a new events table which functions as a queue for a given client of events that they need to process to get up to date.
Thus, to send a message, we'll:
- Look up all participants in the chat via the chat_participant table.
- In a transaction, create a message in the message table and create entries in events for each participant that needs to receive the message. We'll simply reference the user's id, the message id, and a timestamp.
- Confirm the message back to the sending client.
On the recipient side, our clients are polling for these messages and events periodicially. Once clients receive an event they'll acknowledge receipt and we can delete the entry from events. We can have a cleanup service which runs some dumb queries to find messages and events which either (a) are 30 days old and thus need to be discarded per our non-functional requirements, or (b) have been delivered to all recipients, keeping our database somewhat manageable in size.
So far this works very poorly because we're assuming clients are polling us for new messages and chats. But importantly all messages and updates will get successfully delivered to clients eventually, which is great. So let's solve the latency issue next.
2) Real-time delivery of messages to online users.
In order for our system to become real-time, we need to send deltas to the clients when things change: the user is added or removed from a chat, a new message is sent, etc. In our single-node setup this is pretty easy: since all (!) of the users of our system are connected to the single node, we can keep a hash of the user id to the websocket connection they've made to us. When a new event is created in events and needs to be sent to a user we can look up that user id and send them the event.
If the user isn't currently connected, or the connection is broken we don't have any problems - the user will get the events when they reconnect. But if they are, this websocket ensures the event is delivered right as the event is created.
3) Users should be able to send/receive media in their messages.
Users sending and receiving media is annoying. It's bandwidth- and storage- intensive. While we could potentially do this with our Chat Server and database, it's better to use purpose-built technologies for this. This is in fact how Whatsapp actually worked: attachments were uploaded via a separate HTTP service.
Bad Solution: Keep attachments in DB
Approach
The worst approach is to have the Chat Server accept the attachment media and save it in our database. Then we'll need to add an additional message type for users to retrieve attachments.
Challenges
This is not a good solution. First, most databases (in this case MySQL) aren't going to be optimized for handling binary blobs. Second, we're crippling the bandwidth available to our Chat Servers by occupying them with comparitively dumb storage and retrieval.
Good Solution: Send attachments via chat server
Approach
A straightforward approach is to have the Chat Server accept the attachment media, then push it off to blob storage with a TTL of 30 days. Users who want to retrieve a particular attachment can then query the blob storage directly (via a pre-signed URL for authorization). Ideally, we'd find a way to expire the media once it had been received by all recipients. While we could put a CDN in front of our blob storage, since we're capped at 100 participants the cache benefits are going to be relatively small.
Challenges
Unfortunately, our Chat Servers still have to handle the incoming media and forward it to the blob storage (a wasted step). Expiring attachments once they've been downloaded by all recipients isn't handled. Managing encryption and security will require extra steps.
Great Solution: Manage attachments separately
Approach
An ideal approach is that we give our users permission (e.g. via pre-signed URLs) to upload directly to the blob storage. Then, our solution works much like the "Good" solution. Users who want to retrieve a particular attachment can then query the blob storage directly (via a pre-signed URL for authorization). Ideally, we'd find a way to expire the media once it had been received by all recipients. While we could put a CDN in front of our blob storage, since we're capped at 100 participants the cache benefits are going to be relatively small.
Challenges
Expiring attachments once they've been downloaded by all recipients isn't handled. Managing encryption and security will require extra steps.
Ok awesome, so we have a system which has real-time delivery of messages, persistence to handle offline use-cases, and attachments.
Potential Deep Dives
With the core functional requirements met, it's time to dig into the non-functional requirements via deep dives. This includes solving obvious scalability issues as well as auxillary questions which demonstrate your command of system design.
1) How can we handle billons of simultaneous users?
Our single-host system is convenient but unrealistic. Serving billions of users via a single machine isn't possible and it would make deployments and failures a nightmare. So what can we do? The obvious answer is to try to scale out the number of Chat Servers we have. But this introduces some clear problems: now the sending and receiving users might be connected to different hosts.
If we have 1b users, we might expect 200m of them to be connected at any one time. Whatsapp famously served 1-2m users per host, but this will require us to have hundreds of chat servers. That's a lot of simultaneous connections.
What do we do here?
Bad Solution: Naively horizontally scale
Approach
The most naive (broken) solution is to put a load balancer in front of our Chat Servers and scale horizontally.
Challenges
This won't work. A given server might accept a message to be sent to a Chat, but we're no longer guaranteed it will have the connections to each of the clients who needs to receive it. We won't be able to deliver the events and messages!
Bad Solution: Keep a kafka topic per user
Approach
Many candidates instinctively reach for a queue or stream in order to solve the scaling problem, e.g. using a Kafka topic for every user in the system. The idea here would be that we write to that topic for every event and our chat servers can subscribe to the topic to ferry incoming messages back to the user.
Challenges
This unfortunately doesn't work - Kafka is not built for billions of topics. There are potential fixes that you might conceive, like creating "super topics" which are groupings of many users, but you'll quickly find yourself reinventing the good aspects for alternative solutions below with little of the benefit.
Good Solution: Consistent Hashing of Chat Servers
Approach
Another approach is for us to use consistent hashing to assign users to specific Chat Servers. We'll use a service like zookeeper to keep track of which shards own which region of the hash space.
When a request comes in, we'll connect them to the Chat Server they are assigned to based on their user id. When a new event is created, Chat Servers will connect directly with the Chat Server that "owns" that user id to give them a notification to deliver to their user.
Challenges
Each Chat Server will need to maintain connections with each other Chat Server which will require that we keep our Chat Servers large and small in number. Increasing the number of Chat Servers requires careful orchestration of dropping connections so that users reconnect to other servers without triggering a thundering herd. During scaling we need to ensure that events are sent to both servers to prevent dropping messages.
All of these are solvable problems but your interviewer will expect you to talk about them and how to pull it off.
Great Solution: Offload to Pub/Sub
Approach
The last approach is to use a purpose-built system for the pub/sub aspects. We could use Redis pub/sub, for example. Redis pub/sub has "at most once" guarantees which is fine for us because we're providing deliverability guarantees via our backend. Whenever a user connects to our Chat Server, we'll subscribe to the event topic for that user. When new events are created, we'll publish the event via Redis pub/sub and, if a client is connected, it will receive the message.
Since we're abstracting the socket state to Redis, we can introduce a simple least-connections load balancer in front of the Chat Server to evenly distribute load.
Beneath the covers, Redis is assigning each channel to a "slot" which is sharded across the cluster. To send a message to a given user, we need to connect to the master responsible for that slot. In the degenerate case, a message sent to a large chat will require the Chat Server to connect to all nodes in the Redis cluster.
Challenges
The Pub/Sub implementation introduces additional latency because we need to ferry messages via Redis. We lose fine-grained control over allocation which means we risk hot shard issues on the Redis side if too many clients are assigned to a single node in the cluster. The loss of a node in our Redis cluster can introduce thundering herd issues as clients reconnect. Etc.
2) How can we optimize the chat lookups?
One of the most frequent read paths of our system is when we need to look up all participants for a given chat on every message send (so we know which recipients need to receive the message). Rather than pounding our database, we can cache this data with an LRU cache. We can expire this cache whenever the participant set changes so it's not stale.
3) How can we handle high throughput and global users?
If we assume 50% of messages are delivered within a few seconds and 50% are delivered on average 1 day later, that means we need to accumulate 10b messages * 50% or 5b messages temporarily stored on our server. If we assume messages average about 1kb, we're using 5tb of storage which is very achievable for a small database cluster.
Let's assume a long-tail of chat size and that each message needs to be delivered to 3-4 recipients. Let's also assume that our 1b users send 10 messages per day on average. With 100k seconds in a day, we need to handle 100k messages created per second on average and 300-400k messsages delivered per second on average. That's a lot of throughput on our database (although achievable with most clustered databases on the market today).
Bad Solution: Partition the Database
Approach
For the size, fortunately our data is very partitionable/shardable. Since we limit the number of users per chat to 100, there's a hard (and low) cap on the number of times a given message might need to be read. Across tables, our data is fairly evenly distributed across primary keys.
Challenges
Users across the globe might be sending messages to one another and haphazardly assigning e.g. an EU user to a US database shard would be disastrous for performance. In the worst case we have messages ping-ponging between services and databases across the ocean.
Good Solution: Regionalize the Service and Partition the Database
Approach
Most users are going to be located primarily in one region. We can partition both our service and database across regions and statically assign users to a particular region. While we may have instances where we need to send messages across oceans, we'd expect the majority of messages to be sent within region saving expensive calls.
Challenges
Some users may migrate between regions and some chats may have users who span multiple regions. There are lots of optimization opportunities to co-locate the chats with the majority of users.
4) What do we do to handle multiple clients for a given user?
To this point we've assumed a user has a single device, but many users have multiple devices: a phone, a tablet, a desktop or laptop - maybe even a work computer. This introduces some new problems.
- First, we'll need to add a way for our design to resolve a user to 1 or more clients that may be active at any one time.
- Second, we need a way to deactivate clients so that we're not unncessarily storing messages for a client which does not exist any longer.
- Lastly, we need to update our eventing system so that it can handle multiple clients.
Let's see if we can account for this with minimal changes to our design.
- We'll need to create a new clients table to keep track of clients by user id.
- Our participant cache can now store a list of clients rather than just users to avoid blowing up our primary read path. When a new client connects we'll need to invalidate.
- We can overload our Cleanup Service to also dispose of clients which have not signed in for 30 days. The client can message the user that they'll need to start from scratch.
- We'll need to update our events table to be per-client rather than per-user.
We'll probably want to introduce some limits (10 clients per account) to avoid blowing up our storage and throughput. Since the participant cache is shielding us from the blunt of reads, the main scaling factor is increased writes to our events table - if we assume 2 clients per user on average, we've doubled the throughput and doubled the storage. Given we had some headroom based on our calculations above, we should be ok.
What is Expected at Each Level?
Ok, that was a lot. You may be thinking, “how much of that is actually required from me in an interview?” Let’s break it down.
Mid-level
Breadth vs. Depth: A mid-level candidate will be mostly focused on breadth (80% vs 20%). You should be able to craft a high-level design that meets the functional requirements you've defined, but many of the components will be abstractions with which you only have surface-level familiarity.
Probing the Basics: Your interviewer will spend some time probing the basics to confirm that you know what each component in your system does. For example, if you use websockets, expect that they may ask you what it does and how they work (at a high level). In short, the interviewer is not taking anything for granted with respect to your knowledge.
Mixture of Driving and Taking the Backseat: You should drive the early stages of the interview in particular, but the interviewer doesn’t expect that you are able to proactively recognize problems in your design with high precision. Because of this, it’s reasonable that they will take over and drive the later stages of the interview while probing your design.
The Bar for Whatsapp: For this question, an E4 candidate will have clearly defined the API, landed on a high-level design that is functional and meets the requirements. Their scaling solution will have rough edges but they'll have some knowledge of its flaws.
Senior
Depth of Expertise: As a senior candidate, expectations shift towards more in-depth knowledge — about 60% breadth and 40% depth. This means you should be able to go into technical details in areas where you have hands-on experience. It's crucial that you demonstrate a deep understanding of key concepts and technologies relevant to the task at hand.
Advanced System Design: You should be familiar with advanced system design principles. For example, knowing about the consistent hashing necessary for this problem is essential. You’re also expected to understand the mechanics of long-running sockets. Your ability to navigate these advanced topics with confidence and clarity is key.
Articulating Architectural Decisions: You should be able to clearly articulate the pros and cons of different architectural choices, especially how they impact scalability, performance, and maintainability. You justify your decisions and explain the trade-offs involved in your design choices.
Problem-Solving and Proactivity: You should demonstrate strong problem-solving skills and a proactive approach. This includes anticipating potential challenges in your designs and suggesting improvements. You need to be adept at identifying and addressing bottlenecks, optimizing performance, and ensuring system reliability.
The Bar for Whatsapp: For this question, E5 candidates are expected to speed through the initial high level design so you can spend time discussing, in detail, scaling and robustness issues in the design. You should also be able to discuss the pros and cons of different architectural choices, especially how they impact scalability, performance, and maintainability.
Staff+
Emphasis on Depth: As a staff+ candidate, the expectation is a deep dive into the nuances of system design — I'm looking for about 40% breadth and 60% depth in your understanding. This level is all about demonstrating that, while you may not have solved this particular problem before, you have solved enough problems in the real world to be able to confidently design a solution backed by your experience.
You should know which technologies to use, not just in theory but in practice, and be able to draw from your past experiences to explain how they’d be applied to solve specific problems effectively. The interviewer knows you know the small stuff so you can breeze through that at a high level so you have time to get into what is interesting.
High Degree of Proactivity: At this level, an exceptional degree of proactivity is expected. You should be able to identify and solve issues independently, demonstrating a strong ability to recognize and address the core challenges in system design. This involves not just responding to problems as they arise but anticipating them and implementing preemptive solutions. Your interviewer should intervene only to focus, not to steer.
Practical Application of Technology: You should be well-versed in the practical application of various technologies. Your experience should guide the conversation, showing a clear understanding of how different tools and systems can be configured in real-world scenarios to meet specific requirements.
Complex Problem-Solving and Decision-Making: Your problem-solving skills should be top-notch. This means not only being able to tackle complex technical challenges but also making informed decisions that consider various factors such as scalability, performance, reliability, and maintenance.
Advanced System Design and Scalability: Your approach to system design should be advanced, focusing on scalability and reliability, especially under high load conditions. This includes a thorough understanding of distributed systems, load balancing, caching strategies, and other advanced concepts necessary for building robust, scalable systems.
The Bar for Whatsapp: For a staff+ candidate, expectations are high regarding depth and quality of solutions, particularly for the complex scenarios discussed earlier. Great candidates are going 2 or 3 levels deep to discuss failure modes, bottlenecks, and other issues with their design. There's ample discussion to be had around fault tolerance, database optimization, regionalization and cell-based architecture and more.
References
Not sure where your gaps are?
Mock interview with an interviewer from your target company. Learn exactly what's standing in between you and your dream job.
Schedule a mock interview
Meet with a FAANG senior+ engineer or manager and learn exactly what it takes to get the job.
© 2024 Optick Labs Inc. All rights reserved.
Loading comments...