Common Problems
Design a Local Delivery Service like Gopuff
Stefan Mai
medium
35 min
Understanding the Problem
Functional Requirements
Core Requirements
- Customers should be able to query availability of items, deliverable in 1 hour, by location.
- Customers should be able to order items.
Below the line (out of scope)
- Handling payments/purchases.
- Handling driver routing and deliveries.
- Search functionality and catalog APIs. (The system is strictly concerned with availability and ordering).
- Cancellations and returns.
Non-Functional Requirements
Core Requirements
- Customers should be able to order from any DC within 1 hour delivery time (i.e. the effective availability is the union of all nearby DCs).
- Availability requests should fast (<100ms) to support use-cases like search.
- Ordering should be strongly consistent: two customers should not be able to purchase the same physical product.
- System should be able to support 10k DCs and 100k items in the catalog across DCs.
- Order volume will be O(1m orders/day)
Below the line (out of scope)
- Privacy and security.
- Disaster recovery.
Here’s how these might be shorthanded in an interview. Note that out-of-scope requirements usually stem from questions like “do we need to handle privacy?”. Interviewers are usually comfortable with you making assertions “I’m going to leave privacy out of scope for the start” and will correct you if needed.
Scale estimations
Because we need to support 1m orders/day, we can estimate the orders/second:
Orders: 1M orders/day / (100k seconds/day) = 10 orders/second
This might burst considerably higher and is unlikely to be evenly distributed over 24 hours, so we might increase this estimate to 50 or 100. This is fairly easily achievable with almost any standard database system. We can safely move on.
Queries are a bit harder to estimate. All-in, we might estimate that each user will look at 100 items across search, the homepage, etc. before purchasing 1 item. In addition, maybe only 5% of these users will end up buying.
Queries: 1M orders/day / (100k seconds/day) / .05 * 100 = 20k queries/second
The read volume here is substantially higher. We’ll need to think through a good caching strategy but we can probably get some advantage due to the fact that these queries are going to be localized by region (more later).
We don’t really need to do any more estimation beyond this. Reflections like those above help your interviewer understand how your estimations are affecting your design and demonstrate your experience - don’t estimate for the sake of estimation!
Set Up
Planning the Approach
In this interview, we need to solve 2 key challenges, aligned with our requirements:
- Answering queries about inventory availability.
- Placing orders.
Defining the API
To begin this problem, we can start with the APIs we need to satisfy - which follow naturally from our functional requirements. Note there are a lot of extraneous details we could shove into these APIs, but we’ll start with these before expanding. We may also need to add APIs alongside our high level design, that's ok too!
Defining the Core Entities
We can also sketch out a rudimentary data model. The important thing here is to find the core entities for the problem that are going to be relevant to our requirements.
Of particular note for this problem is the distinction between an ItemInstance and an Item. While our API consumers are strictly concerned with items (e.g. Cheetohs), we need to keep track of where the physical items are actually located. Orders, meanwhile, are concerned with ItemInstances because they need to be able to track the physical items which are being ordered. If we remove the links, we end up with 4 entities in our system:
- ItemInstance: A physical instance of an item, located at a DC. We'll sum up ItemInstances to determine the quantity available.
- Item: A type of item, e.g. Cheetohs. These are what our customers will actually care about.
- DistributionCenter: A physical location where items are stored. We'll use these to determine which items are available to a user. ItemInstances are stored in DCs.
- Order: A collection of ItemInstances which have been ordered by a user.
High-Level Design
1) Customers should be able to query availability of items
In order to find the items available to a customer, we need to first find the DCs that are close enough to be able to deliver in 1 hour. Then we can aggregate the items that are available in those DCs to come up with the final quantities we can return to the user.
We can start by building a simple API which takes a LAT and LONG and returns a list of DCs within 1 hour delivery. There's a lot of options for this service, from how the data is stored to how closely we can answer the question of delivery time. In the ideal case, we have a service which is both fast and accurate.
Bad Solution: Simple SQL Distance
Approach
We can put all the lat/long of our DC into a table, then measure the distance to the user’s input with some simple math. A very basic version might use Euclidean distance while a more sophisticated example might use the Haversine formula to take into account the curvature of the Earth. Taking a simple threshold on this query would give us DCs within X as the crow flies.
Challenges
- This is a very simple approach which doesn’t take into account traffic, road conditions, etc. It also doesn’t take into account the fact that we might have multiple DCs in the same city.
Bad Solution: Use a Travel Time Estimation Service Against all DCs
We can sync periodically (like every 5 minutes) from a DC table to memory of our service. We can use a travel time estimation service to find the travel times from our input location to each of our DCs by iterating over all of them.
Great Solution: Use a Travel Time Estimation Service Against Nearby DCs
We can sync periodically (like every 5 minutes) from a DC table to memory of our service. When an input comes in, we can prune down the “nearby” DCs by taking a fixed radius (say 60 miles) which we then pass to the external travel time service to create our final estimate.
Once we have the available DCs, the next step is to answer queries about inventory availability. Since our nearby service gives us a list of DCs in scope for a given user, the availability lookup is relatively simple: it aggregates the inventory for the list of DCs and returns the result. We want to make this fast and scalable, and (per our requirements) we can compromise a bit on consistency to get it.
Bad Solution: Query Inventory Directly with Nearby Service Results
We can query the inventory table with the list of items (from the input) and DCs (from the nearby service).
Good Solution: Query Inventory Through Cache with Nearby Service Results
Approach
We can insert a read-through Redis cache on the path to our inventory table. Setting a low TTL (e.g. 1 minute) ensures that these results are fresh.
Challenges
We need to ensure that the cache is always up to date with the inventory table. Our Order Service will need to expire effected cache entries when it writes to the inventory table.
Great Solution: Postgres Read Replicas and Partitioning
Approach
Since we only ever read inventory from a nearby collection of DC’s, we can group them together with a concept of a region ID using the first 3 digits of their zipcode. This means all queries will go to mostly 1 or 2 partitions rather than the entire inventory dataset. Further, we can use read replicas for availability since we can tolerate a small amount of inconsistency.
Challenges
We'll need to manage the sizing of our replicas to balance with traffic. We'll also need to manage the partitiioning of our data to ensure that we're not overloading any one replica.
By choosing the "great" solutions for both challenges we net out with a solution that looks something like this:
We now have:
- Availability Service handles requests from our users for availability given a specific location.
- Nearby Service syncs with the database of nearby DCs and uses an external "Travel Time Service" to calculate travel times from DCs (potentially including traffic).
- Inventory Table a replicated SQL database table partitioned by region which returns the inventory available for each item.
When a user makes a request to get availability for items A, B, and C from latitude X and longitude Y, here's what happens:
- We make a request to the Nearby Service with the user's location X and Y.
- The nearby service accesses its sync'd list of available data centers and grabs any DC which is within 60 miles of X and Y.
- For each of these DCs, the nearby service makes a request to the Travel Time Service to get the travel time from the DC to X and Y.
- For all travel times under (say) 45 minutes, we return the IDs of those DCs to the availability service.
- With the DCs available, we query our SQL database for the items A, B, and C with locations in those DCs.
- We sum up the results and return them to our client.
2) Customers should be able to order items.
The last step is for us to enable placing orders. For this, we require strong consistency to make sure two users aren't ordering the same item. To do this we need to check inventory, record the order, and update the inventory together atomically. While latency is not a big concern here (our users will tolerate more latency here than they will on the reads), we definitely want to make sure we're not promising the same inventory to two users.
To ensure we're not "double booking", we need some form of locking. We have lots of options here, but we're going to favor the least complex implementation:
Good Solution: Two data stores with three phase commit
Approach
We can use a three phase commit to ensure that the transaction is atomic. This means that we can use a transaction manager to coordinate the transaction across the two data stores. This is a good solution because it allows us to use the best data store for each use case. For example, we can use a key-value store for inventory and a relational database for orders.
Challenges
The main challenge with this approach is that it requires a transaction manager to coordinate the transaction across the two data stores. This means that we need to implement a transaction manager or use a third party one. This adds complexity and overhead to our system.
Great Solution: Singular Postgres transaction
Approach
By putting both orders and inventory in the same database, we can take advantage of the ACID properties of our Postgres database. Using a singular transaction with isolation level SERIALIZABLE we can ensure that the entire transaction is atomic. This means that if two users try to order the same item at the same time, one of them will be rejected. This is because the transaction will fail to commit if the inventory is not available.
Challenges
While consolidating our data down to a single database has a lot of benefits, it's not without drawbacks. We're partly coupling the scaling of inventory and orders and we can't take advantage of the best data store for each use case.
By choosing the "great" option and leaning in to our existing Postgres database we can keep our system simple and still meet our requirements. For an order, the process looks like this:
- The user makes a request to the Orders Service to place an order for items A, B, and C.
- The Orders Service makes creates a singular transaction which we submit to our Postgres leader. This transaction: a. Checks the inventory for items A, B, and C > 0. b. If any of the items are out of stock, the transaction fails. c. If all items are in stock, the transaction records the order and updates the status for inventory items A, B, and C to "ordered". d. A new row is created in the Orders table (and OrderItems table) recording the order for A, B, and C. e. The transaction is committed.
- If the transaction succeeds, we return the order to the user.
There are some downsides to this setup. In particular, if any of the items become unavailable in the users order the entire order fails. We'll want to return a more meaningful error message to the user in this case, but this is preferably to succeeding in an order that might not make sense (e.g. a device and its battery).
Putting it all Together
With the two approaches listed, we can now pull everything together for a solution to our problem:
We have three services, one for Availability requests, one for Orders, and a shared service for Nearby DCs. Both our Availability and Orders service use the Nearby service to look up DCs that are close enough to the user. We have a singular Postgres database for inventory and orders, partitioned by region. Our Availability service reads via read replicas, our Orders service writes to the leader using atomic transactions to avoid double writes. A great foundation!
Deep Dives
1) Product change: Add cart "reservations"
Your interviewer asks you to add a new feature:
Given these requirements, we need some way to "lock" inventory to a user when they add it to the cart. We also need a way to expire these locks after 1 hour or when the user checks out -- whichever comes first. We also need to make sure that our Availability service is aware of these locks and doesn't return them as available inventory. We have a couple options for how to implement this:
Good Solution: Use Redis to "lock" items
Approach
We can stand up a Redis instance which will store "locks" on inventory to ensure they are not sold twice, with TTLs to ensure they expire. When a user adds an item to their cart, we will set a lock on the inventory item for 60 minutes. When a user checks out, we will release the lock. We will also need to ensure our Availability Service checks for outstanding locks in calculating available inventory.
Challenges
- We need to make sure each service is "aware" of the locks we've created but checking them before returning available quantities.
- Our Redis instance adds complexity and synchronization problems to the system.
Great Solution: Use a new status flag
Approach
The most straightforward solution to this problem is to add a new status and an expiration field to each inventory item. When a user adds an item to their cart, we will set the status to "reserved" and the expiration to 60 minutes from now. When a user checks out, we can take all of the "reserved" items they have and mark them as "sold". We will also need to add a background process to periodically check for expired reservations and release the inventory.
Challenges
The new status updates sit on the same write path as the order updates, which means they'll contend and cut into our effective write throughput. Since we have a lot of headroom this isn't a big concern.
2) Optimization: Late allocation of DCs (Senior+)
Your interviewer asks you to optimize the system:
The fundamental issue surfaced here is that we make assignments between orders and inventory items suboptimally. We have some options:
Good Solution: Use max inventory DCs
Approach
One simple approach we can use is to make DC allocation decisions based on those which have the most inventory. Since we're trying to avoid out-of-stock scenarios, this should help us avoid exhausting inventory at the edge of the metro area. We can do this by adjusting the logic in our ordering service to prefer to place orders from the DC with the most envitory.
Challenges
While this approach is a simple and cheap heuristic, it has its limitations. We're not able to take advantage of the fungibility of reservations (above) which might exist only on transiently. It's also not particularly smart: if the DCs at the edge of the metro have the most inventory at the start but rapidly deplete due to the geography they cover, we'll still have problems with out of stock scenarios.
Good Solution: Rule engine
Approach
We can build a rule engine which takes into account the current inventory levels of each DC, the current reservations, and the geographic distribution of the orders. We can then use this rule engine to make allocation decisions. This will allow us to take into account the fungibility of reservations and the geographic distribution of orders.
Challenges
This approach is more complex than the previous one, but it still depends on relatively simple heuristics that we can encode. Furthermore, the complexity of the rules we put in place increases the complexity of our system.
Great Solution: Allocation service
Approach
We can create a new entity, "Promise", which represents a promise of a particular Item to a user with the option of a set of DCs - without resolving (at that moment) which specific DC will provide it. When we create reservations, instead of changing the status of a specific item, we can instead create a new Promise entry.
We'll need to update our system to be cognizant of these promises. Our Availability Service will need to sum up all of the outstanding promises and subtract them from the available inventory. Our Orders Service will need to be aware of the promises and resolve them when an order is placed. To simplify this, we can create a new service, the Allocation Service, which will be responsible for allocating promises to orders. This service will be responsible for taking in orders and promises and allocating them to DCs. It will also be responsible for resolving promises when an order is placed.
Challenges
This approach is the most complex of the three (and the most edge cases), but it also has the most flexibility. We can use this approach to implement a variety of allocation strategies, including the ones above. We can also use this approach to implement more complex allocation strategies, such as those which take into account the geographic distribution of orders and the fungibility of reservations.
What is Expected at Each Level?
Your interviewer may even have you go deeper on specific sections, or ask follow-up questions. What might you expect in an actual assessment?
Mid-Level
Breadth vs. Depth: A mid-level candidate will be mostly focused on breadth. As an approximation, you’ll show 80% breadth and 20% depth in your knowledge. You should be able to craft a high-level design that meets the functional requirements you've defined, but the optimality of your solution will be icing on top rather than the focus. Probing the Basics: Your interviewer spend some time probing the basics to confirm that you know what each component in your system does. For example, if you use DynamoDB, expect to be asked about the indexes available to you. Your interviewer will not be taking anything for granted with respect to your knowledge. Mixture of Driving and Taking the Backseat: You should drive the early stages of the interview in particular, but your interviewer won't necessarily expect that you are able to proactively recognize problems in your design with high precision. Because of this, it’s reasonable that they take over and drive the later stages of the interview while probing your design. The Bar for Inventory Management: For this question, interviewers expect a mid-level candidate to have clearly defined the API endpoints and data model, and created both routes: availability and orders. In instances where the candidate uses a “Bad” solution, the interviewer will expect a good discussion but not that the candidate immediately jumps to a great (or sometimes even good) solution.
Senior
Depth of Expertise: As a senior candidate, your interviewer expects a shift towards more in-depth knowledge — about 60% breadth and 40% depth. This means you should be able to go into technical details in areas where you have hands-on experience. Advanced System Design: You should be familiar with advanced system design principles. Certain aspects of this problem should jump out to experienced engineers (read volume, trivial partitioning) and your interviewer will be expecting you to have reasonable solutions. Articulating Architectural Decisions: Your interviewer will want you to clearly articulate the pros and cons of different architectural choices, especially how they impact scalability, performance, and maintainability. You should be able to justify your decisions and explain the trade-offs involved in your design choices. Problem-Solving and Proactivity: You should demonstrate strong problem-solving skills and a proactive approach. This includes anticipating potential challenges in your designs and suggesting improvements. You need to be adept at identifying and addressing bottlenecks, optimizing performance, and ensuring system reliability. The Bar for Inventory Management: For this question, a senior candidate is expected to speed through the initial high level design so we can spend time discussing, in detail, how to optimize the critical paths. Senior candidates would be expected to have optimized solutions for both the atomic transactions of the orders service as well as the scaling of the availability service.
Staff+
Emphasis on Depth: As a staff+ candidate, the expectation is a deep dive into the nuances of system design — the interviewer is looking for about 40% breadth and 60% depth in your understanding. This level is all about demonstrating that "been there, done that" expertise. You should know which technologies to use, not just in theory but in practice, and be able to draw from your past experiences to explain how they’d be applied to solve specific problems effectively. Your interviewer knows you know the small stuff (REST API, data normalization, etc) so you can breeze through that at a high level so we have time to get into what is interesting. High Degree of Proactivity: At this level, your interviewer expects an exceptional degree of proactivity. You should be able to identify and solve issues independently, demonstrating a strong ability to recognize and address the core challenges in system design. This involves not just responding to problems as they arise but anticipating them and implementing preemptive solutions. Practical Application of Technology: You should be well-versed in the practical application of various technologies. Your experience should guide the conversation, showing a clear understanding of how different tools and systems can be configured in real-world scenarios to meet specific requirements. Complex Problem-Solving and Decision-Making: Your problem-solving skills should be top-notch. This means not only being able to tackle complex technical challenges but also making informed decisions that consider various factors such as scalability, performance, reliability, and maintenance. Advanced System Design and Scalability: Your approach to system design should be advanced, focusing on scalability and reliability, especially under high load conditions. This includes a thorough understanding of distributed systems, load balancing, caching strategies, and other advanced concepts necessary for building robust, scalable systems. The Bar for Inventory Management: For a staff+ candidate, expectations are set high regarding depth and quality of solutions, particularly for the complex scenarios discussed earlier. Your interviewer will be looking for you to be diving deep into at least 2-3 key areas, showcasing not just proficiency but also innovative thinking and optimal solution-finding abilities. They should show unique insights for at least a couple follow-up questions of increasing difficulity. A crucial indicator of a staff+ candidate's caliber is the level of insight and knowledge they bring to the table. A good measure for this is if the interviewer comes away from the discussion having gained new understanding or perspectives.
Not sure where your gaps are?
Mock interview with an interviewer from your target company. Learn exactly what's standing in between you and your dream job.
Loading comments...