Common Problems
Design Facebook Live Comments
Evan King
hard
35 min
Understanding the Problem
Functional Requirements
Core Requirements
- Viewers can post comments on a Live video feed.
- Viewers can see all comments in near real-time as they are posted.
- Viewers can see comments made before they joined the live feed.
Below the line (out of scope):
- Viewers can reply to comments
- Viewers can react to comments
Non-Functional Requirements
Core Requirements
- The system should scale to support millions of concurrent viewers and thousands of comments per second.
- The system should prioritize availability over consistency, eventual consistency is fine.
- The system should have low latency, broadcasting comments to viewers in near-real time
Below the line (out of scope):
- The system should be secure, ensuring that only authorized users can post comments.
- The system should enforce integrity constraints, ensuring that comments are appropriate (ie. not spam, hate speech, etc.)
Here's how your requirements section might look on your whiteboard:
The Set Up
Planning the Approach
Before you move on to designing the system, it's important to start by taking a moment to plan your strategy. Fortunately, for these common product style system design questions, the plan should be straightforward: build your design up sequentially, going one by one through your functional requirements. This will help you stay focused and ensure you don't get lost in the weeds as you go. Once you've satisfied the functional requirements, you'll rely on your non-functional requirements to guide you through the deep dives.
Defining the Core Entities
I like to begin with a broad overview of the primary entities. Initially, establishing these key entities will guide our thought process and lay a solid foundation as we progress towards defining the API.
For this particular problem, we only have three core entities:
- User: A user can be a viewer or a broadcaster.
- Live Video: The video that is being broadcasted by a user (this is owned and managed by a different team, but is relevant as we will need to integrate with it).
- Comment: The message posted by a user on a live video.
In your interview, this can be as simple as a bulleted list like:
Now, let's carry on to outline the API, tackling each functional requirement in sequence. This step-by-step approach will help us maintain focus and manage scope effectively.
API or System Interface
We'll need a simple POST endpoint to create a comment.
POST /comment/create Header: JWT | SessionToken { "liveVideoId": "123", "message": "Cool video!" }
We also need to be able to fetch past comments for a given live video.
GET /comments/:liveVideoId
Pagination will be important for this endpoint. More on that later when we get deeper into the design.
High-Level Design
To get started with our high-level design, let's begin by addressing the first functional requirement.
1) Viewers can post comments on a Live video feed
Fist things first, we need to make sure that users are able to post a comment.
This should be rather simple. Users will initiate a POST request to the POST /comment/create endpoint with the live video id and the comment message. The server will then validate the request and store the comment in the database.
- Commenter Client: The commenter client is a web or mobile application that allows users to post comments on a live video feed. It is responsible for authenticating the user and sending the comment to the Comment Management Service.
- Load Balancer: Its primary purpose is to distribute incoming application traffic across multiple targets, such as the Comment Management Service, in this case. This increases the availability and fault tolerance of the application.
- Comment Management Service: The comment management service is responsible for creating and querying comments. It receives comments from the commenter client and stores them in the comments database. It will also be responsible for retrieving comments from the comments database and sending them to the viewer client -- more on that later.
- Comments Database: The comments database is a NoSQL database like DynamoDB or MongoDB that stores the simple comment document. It is a document database, which means that it stores data in JSON-like documents. This is a good fit for our use case because we are storing simple comments that don't require complex relationships or transactions.
Let's walk through exactly what happens when a user posts a new comment.
- The users drafts a comment from their device (commentor client)
- The commenter client sends the comment to the comment management service via the POST /comment/create API endpoint which is exposed by the load balancer.
- The comment management service receives the request and stores the comment in the comments database.
Ok that was easy, but things get a little more complicated when we start to consider how users will view comments.
2) Viewers can see all comments in near real-time as they are posted
When a user posts a comment, we need to figure out how to broadcast that comment to all viewers in near real-time. Let's dive into some of our options:
Bad Solution: Polling
Approach
A working, though naive, approach is to have the clients poll for new comments every few seconds. We would use the GET /comments/:liveVideoId endpoint that returns the recent comments for a given live video. The client would then poll this endpoint every few seconds to check for new comments and append them to the list of comments displayed on the screen.
Challenges
This is a simple approach that works well for small scale applications, but it doesn't scale. As the number of comments and viewers grows, the polling frequency will need to increase to keep up with the demand. This will put a lot of strain on the database and will result in many unnecessary requests (since most of the times there will be no new comments to fetch). In order to meet our requirements of "near real-time" comments, we would need to poll the database every few milliseconds, which isn't feasible.
Good Solution: Websockets
Approach
A better approach is to use websockets. Websockets are a two-way communication channel between a client and a server. The client opens a connection to the server and keeps it open. The server keeps a connection open and sends new data to the client without requiring additional requests. When a new comment arrives, the Comment Management Server distributes it to all clients, enabling them to update the comment feed. This approach is more efficient as it eliminates polling and enables the server to immediately push new comments to the client upon creation.
Challenges
Websockets are a good solution, and for real-time chat applications that have a more balanced read/write ratio, they are optimal. However, for our use case, the read/write ratio is not balanced. Comment creation is a relatively infrequent event, so while most viewers will never post a comment, they will be viewing/reading all comments. Because of this imbalance, it doesn't make sense to open a two-way communication channel for each viewer, given that the overhead of maintaining the connection is high.
Great Solution: Server Sent Events (SSE)
Approach
A better approach is to use Server Sent Events (SSE). SSE is a persistent connection just like websockets, but it is unidirectional and goes over HTTP instead of a separate protocol. This means that it is easier to set up and works with existing infrastructure. This means that the server can send data to the client, but the client cannot send data to the server. Why is this better for our use case? The reason lies in the uneven read >> write ratio. Most viewers will never post a comment, but all viewers need to be able to see new comments. The infrequent writes go over HTTPS via a POST request as discussed earlier, while the frequent reads are better served by SSE.
Challenges
SSE comes with its own set of challenges, especially when used in conjunction with load balancers. One of the primary issues is maintaining the persistent connection in environments where connections are routinely balanced across multiple servers. This can disrupt the continuous stream of data, requiring careful configuration of the load balancer to support session persistence or "sticky sessions." This ensures that once a client establishes a connection with a server, all subsequent data for that session is routed to the same server.
Handling reconnections gracefully is crucial. Although SSE automatically tries to reconnect when a connection is lost, ensuring that clients can seamlessly resume receiving updates without missing any data requires careful implementation on the server side.
Here is our updated flow:
- User posts a comment and it is persisted to the database (as explained above)
- In order for all viewers to see the comment, the Comment Management Service will send the comment over SSE to all connected clients that are subscribed to that live video.
- The Commenter Client will receive the comment and add it to the comment feed for the viewer to see.
3) Viewers can see comments made before they joined the live feed
When a user joins a live video the expectation is that, while they immediately see the new comments being added in real-time, they also see the comments that were posted before they joined the live feed. Additionally, if they want to scroll up in the comment feed, they should be able to continue to load older comments. This is a common pattern in chat style applications and is known as "infinite scrolling".
We can easily fetch the most recent N comments by directly querying the database with our GET /comments/:liveVideoId endpoint. However, this endpoint will not allow us to fetch older comments. To do that, we will need to introduce pagination. Pagination is a common technique used to break up a large set of results into smaller chunks. It is typically used in conjunction with infinite scrolling to allow users to load more results as they scroll down the page.
Here are our options when it comes to implementing pagination:
Bad Solution: Offset Pagination
Approach
The simplest approach is to use offset pagination. Offset pagination is a technique that uses an offset to specify the starting point for fetching a set of results. Initially, the offset is set to 0 to load the most recent comments, and it increases by the number of comments fetched each time the user scrolls to load more (known as the page size). While this approach is straightforward to implement, it poses significant challenges in the context of a fast-moving, high-volume comment feed.
Example request: GET /comments/:liveVideoId?offset=0&pagesize=10
Challenges
First, offset pagination is inefficient as the volume of comments grows. The database must count through all rows preceding the offset for each query, leading to slower response times with larger comment volumes. Most importantly, offset pagination is not stable. If a comment is added or deleted while the user is scrolling, the offset will be incorrect and the user will see duplicate or missing comments.
Good Solution: Cursor Pagination
Approach
A better approach is to use cursor pagination. Cursor pagination is a technique that uses a cursor to specify the starting point for fetching a set of results. The cursor is a unique identifier that points to a specific item in the list of results. Initially, the cursor is set to the most recent comment, and it is updated each time the user scrolls to load more. This approach is more efficient than offset pagination because the database does not need to count through all rows preceding the cursor for each query (assuming we built an index on the cursor field). Additionally, cursor pagination is stable, meaning that if a comment is added or deleted while the user is scrolling, the cursor will still point to the correct item in the list of results.
Example request: GET /comments/:liveVideoId?cursor={last_comment_id}&pagesize=10
Challenges
While cursor pagination reduces database load compared to offset pagination, it still requires a database query for each new page of results, which can be significant in a high-traffic environment like ours.
Great Solution: Cursor Pagination with Prefetching and Caching
Approach
A great solution combines cursor pagination with prefetching and caching. In this approach, not only does the cursor point to a specific item in the result set, but the system also prefetches a larger set of results (e.g., pagesize * N) and stores them in a cache. This means that when a user scrolls for more comments, the application can quickly retrieve these prefetched comments from the cache, reducing database queries and improving response times.
Challenges
The primary challenge with this approach is managing the cache effectively, especially in a highly dynamic environment where comments are frequently added. Ensuring the cache stays synchronized with the latest data while minimizing overhead can be complex. Additionally, this approach requires more sophisticated infrastructure for cache management and might involve more intricate logic to handle cache invalidation and updates.
Having applied the "Great Solution", your whiteboard might look something like this. Note that this satisfies all of our functional requirements, but does not yet satisfy our non-functional requirements. That's okay, we'll get to that in the next section.
Potential Deep Dives
1) How will the system scale to support millions of concurrent viewers?
We already landed on Server Sent Events (SSE) being the appropriate technology. Now we need to figure out how to scale it. With SSE, we need to maintain an open connection for each viewer which means that each connection will occupy a port. The maximum number of ports available on a given server is 65,535 so we can support a maximum of ~65k concurrent viewers per server. If we want to support millions of concurrent viewers, we will need to scale horizontally by adding more servers. The question then becomes how do we distribute the load across multiple servers and how does each server know which comments to send to which viewers?
Bad Solution: Horizontal Scaling with Load Balancer and Pub/Sub
Approach
The first thing we need to do is seperate out the write and read traffic by creating Realtime Messaging Servers that are responsible for sending comments to viewers. We seperate this out because the write traffic is much lower than the read traffic and we need to be able to scale the read traffic independently.
To distribute incoming traffic evenly across our multiple servers we can use a simple load balancing algorithm like round robin. Upon connecting to a Realtime Messaging Server through the load balancer, the client needs to send a message informing the server of which live video it is watching. The Realtime Messaging Server then updates a mapping in local memory with this information. This map would look something like this:
{ "liveVideoId1": ["viewer1", "viewer2", "viewer3"], "liveVideoId2": ["viewer4", "viewer5", "viewer6"], "liveVideoId3": ["viewer7", "viewer8", "viewer9"], }
Where viewerN is a reference to the SSE connection for that viewer. Now, anytime a new comment is created, the server can loop through the list of viewers for that live video and send the comment to each one.
The question then becomes how does each Realtime Messaging Server know that a new comment was created? We can use a pub/sub system to solve this problem. A pub/sub system is a messaging system that uses a publish/subscribe model. Publishers send messages to a topic, and subscribers receive messages from a topic. In our case, the comment management service would publish a message to a topic whenever a new comment is created. The Realtime Messaging Servers would subscribe to this topic and receive the message. They would then send the comment to all the viewers that are watching the live video.
Challenges
This approach works, but it is not very efficient. Each Realtime Messaging Server needs to process every comment, even if it is not broadcasting the live video. This leads to inefficiency, slow performance, and high compute intensity that’s impractical at FB scale.
Good Solution: Pub/Sub Partitioning into Topics per Live Video
Approach
To improve upon the previous approach, we can partition the comment stream into different topics based on the live video. Each Realtime Messaging Server would subscribe only to the topics it needs, determined by the viewers connected to it.
Challenges
While this approach is more efficient, it is not perfect. With the load balancer using round robin, there's a risk that a server could end up with viewers subscribed to many different streams, replicating the issue from the previous approach.
Great Solution: Partitioned Pub/Sub with Layer 7 Load Balancer
Approach
To address the issue of servers handling viewers watching many different live videos, we can implement a more intelligent allocation strategy. This strategy would ensure that servers primarily handle viewers watching the same video. To do this, we'll upgrade our load balancer to a layer 7 load balancer that can route traffic based on the live video ID. This will ensure that viewers watching the same live video are routed to the same server (to the extent it's possible). Layer 7 load balancer can route traffic based on the request itself (e.g., the live video ID in the request body) rather than just the IP address and port.
Now, when a user starts watching a live stream, they are connected to a Real-time Messaging Server that has the other users watching that live stream. This means that when a new comment is created, the Real-time Messaging Server only needs to subscribe to the topic for the small number of live videos that it is broadcasting. This reduces the number of topics that each Real-time Messaging Server needs to subscribe to, improving efficiency and performance by reducing the computation and network traffic required to process each comment.
When it receives a comment that was broadcasted to one of the topics it is subscribed to, it can then send that comment to all the viewers that are watching that live video.
Challenges
While this approach is more efficient, it is not without its limitations and challenges. One primary concern is the uneven distribution of load, especially during peak times. Popular live videos could attract a disproportionately high number of viewers, potentially overloading the servers dedicated to those specific topics. This scenario demands dynamic resource allocation strategies, such as spinning up additional servers or reallocating resources from less burdened servers.
Great Solution: Scalable Dispatcher Instead of Pub/Sub
Approach
The approaches we have discussed so far have been centered around a pub/sub system where servers listen to every incoming message and send them out as needed over SSE connections. This means that the service creating comments does not need to know which Realtime Messaging Server is responsible for sending the comment to the viewers -- it simply puts a message on a topic and the Realtime Messaging Servers are responsible for consuming from the correct topics.
While this approach is typically for these types of problems (and would pass the interview), there is another approach which inverts the problem and has the service creating comments be responsible for sending the comment to the correct Realtime Messaging Server. This approach is more complex, but can also be more efficient and scalable.
We need to introduce a new component called a Dispatcher Service. The dispatcher service is responsible for receiving comments from the comment management service and sending them to the correct Realtime Messaging Server. To achieve this, the Dispatcher Service maintains a dynamic mapping of viewers to their corresponding Realtime Messaging Servers. This mapping is constantly updated in response to viewer activities, such as joining or leaving a live video stream. When a new Realtime Messaging Server comes online, it registers itself with the Dispatcher Service, updating the Dispatcher’s understanding of the system's current topology. This registration process includes information about the server's capacity and the live videos it is currently serving.
The Dispatcher Service is also designed to be scalable and replicable. In high-demand scenarios, multiple instances of the Dispatcher Service can be deployed to share the load. This replication not only balances the traffic but also adds redundancy to the system, enhancing its resilience.
Challenges
The primary challenge in this model is ensuring the Dispatcher Service has accurate and up-to-date information about the distribution of viewers across Realtime Messaging Servers. This requires a robust and rapid communication channel between the servers and the Dispatcher Service to handle real-time updates in viewer distribution.
Another significant challenge is maintaining the consistency and synchronization of data across multiple instances of the Dispatcher Service. This is crucial to prevent comment duplication or loss when directing traffic to the appropriate Realtime Messaging Servers.
What is Expected at Each Level?
Ok, that was a lot. You may be thinking, “how much of that is actually required from me in an interview?” Let’s break it down.
Mid-level
Breadth vs. Depth: A mid-level candidate will be mostly focused on breadth (80% vs 20%). You should be able to craft a high-level design that meets the functional requirements you've defined, but many of the components will be abstractions with which you only have surface-level familiarity.
Probing the Basics: Your interviewer will spend some time probing the basics to confirm that you know what each component in your system does. For example, if you add an API Gateway, expect that they may ask you what it does and how it works (at a high level). In short, the interviewer is not taking anything for granted with respect to your knowledge.
Mixture of Driving and Taking the Backseat: You should drive the early stages of the interview in particular, but the interviewer doesn’t expect that you are able to proactively recognize problems in your design with high precision. Because of this, it’s reasonable that they will take over and drive the later stages of the interview while probing your design.
The Bar for FB Live Comments: For this question, I expect that candidates proactively realize the limitations with a polling approach and start to reason around a push based model. With only minor hints they should be able to come up with the pub/sub solution and should be able to scale it with some help from the interviewer.
Senior
Depth of Expertise: As a senior candidate, expectations shift towards more in-depth knowledge — about 60% breadth and 40% depth. This means you should be able to go into technical details in areas where you have hands-on experience. It's crucial that you demonstrate a deep understanding of key concepts and technologies relevant to the task at hand.
Advanced System Design: You should be familiar with advanced system design principles. For example, knowing how to use pub/sub for broadcasting messages. You’re also expected to understand some of the challenges that come with it and discuss detailed scaling strategies (it’s ok if this took some probing/hints from the interviewer). Your ability to navigate these advanced topics with confidence and clarity is key.
Articulating Architectural Decisions: You should be able to clearly articulate the pros and cons of different architectural choices, especially how they impact scalability, performance, and maintainability. You justify your decisions and explain the trade-offs involved in your design choices.
Problem-Solving and Proactivity: You should demonstrate strong problem-solving skills and a proactive approach. This includes anticipating potential challenges in your designs and suggesting improvements. You need to be adept at identifying and addressing bottlenecks, optimizing performance, and ensuring system reliability.
The Bar for Fb Live Comments: For this question, E5 candidates are expected to speed through the initial high level design so you can spend time discussing, in detail, how to scale the system. You should be able to reason through the limitations of the initial design and come up with a pub/sub solution with minimal hints. You should proactively lead the scaling discussion and be able to reason through the trade-offs of different solutions.
Staff+
Emphasis on Depth: As a staff+ candidate, the expectation is a deep dive into the nuances of system design — I'm looking for about 40% breadth and 60% depth in your understanding. This level is all about demonstrating that, while you may not have solved this particular problem before, you have solved enough problems in the real world to be able to confidently design a solution backed by your experience.
You should know which technologies to use, not just in theory but in practice, and be able to draw from your past experiences to explain how they’d be applied to solve specific problems effectively. The interviewer knows you know the small stuff (REST API, data normalization, etc) so you can breeze through that at a high level so you have time to get into what is interesting.
High Degree of Proactivity: At this level, an exceptional degree of proactivity is expected. You should be able to identify and solve issues independently, demonstrating a strong ability to recognize and address the core challenges in system design. This involves not just responding to problems as they arise but anticipating them and implementing preemptive solutions. Your interviewer should intervene only to focus, not to steer.
Practical Application of Technology: You should be well-versed in the practical application of various technologies. Your experience should guide the conversation, showing a clear understanding of how different tools and systems can be configured in real-world scenarios to meet specific requirements.
Complex Problem-Solving and Decision-Making: Your problem-solving skills should be top-notch. This means not only being able to tackle complex technical challenges but also making informed decisions that consider various factors such as scalability, performance, reliability, and maintenance.
Advanced System Design and Scalability: Your approach to system design should be advanced, focusing on scalability and reliability, especially under high load conditions. This includes a thorough understanding of distributed systems, load balancing, caching strategies, and other advanced concepts necessary for building robust, scalable systems.
The Bar for FB Live Comments: For a staff+ candidate, expectations are high regarding depth and quality of solutions, particularly when it comes to scaling the broadcasting of comments. I expect staff+ candidates to not only identify the pub/sub solution but proactively call out the limitations around reliability or scalability and suggest solutions. They likely have a good understanding of the exact technology they would use and can discuss the trade-offs of different solutions in detail.
Not sure where your gaps are?
Mock interview with an interviewer from your target company. Learn exactly what's standing in between you and your dream job.
Loading comments...