Back to Main
Learn System Design
In a Hurry
Core Concepts
Key Technologies
Patterns
Advanced Topics
Question Breakdowns
Get Premium
Advanced Topics
Data Structures for Big Data
Learn about data structures for processing and storing large amounts of data in System Design interviews.
Intro
Some systems need to process absolutely massive amounts of data. These problems disproportionately show up in system design interviews as a way to try to stress test the depth of your knowledge. For these problems, simple scaling and adding more machines may be insufficient — you'll need to lean on specialized data structures to solve the problem.
But most system design interviews don't actually require you to implement data structures on the fly and aren't focused at this low level, so why bother? Well, in many cases utilizing a specialized data structure will actually change the shape of the solution and make the surrounding system fundamentally different. Understanding these differences is a super power and can help you design systems that are more efficient, scalable, and performant.
The catch is that these data structures are uncommon for a reason. Employing a bloom filter when a simple hash table would suffice is a red flag for an interviewer. So this is good knowledge to have, but don't forget to keep it simple!
In this deep dive we'll do a few things:
- We'll build out your arsenal of potential approaches, expanding your view of available data structures.
- We'll highlight specific scenarios where these data structures are commonly used.
- Finally, we'll point out some common pitfalls and places where you might get tripped up or tempted to over-engineer.
Let's do it.
Bloom Filter
Our first data structure is probably the most well-known and a great starting point for our discussion. A bloom filter is a probabilistic data structure which is analogous to a set (refresher: sets allow you to insert elements and check their membership).
The most common implementation of a set uses a hash table. With a hash table we can insert elements in O(1) time and check membership in O(1) time. Very fast! Unfortunately, we need to have memory for each possible element in a hash table, which is infeasible for very large sets. Imagine we had trillions of items and we wanted to keep track of all their ids.
Bloom filters help us here by making a compromise. They are dramatically more memory efficient than hash tables but relax the guarantees of a set. Bloom filters can tell you:
- Whether an element is likely in a set, with some configurable probability
- When an element is definitely not in a set
This sounds a bit unusual, so let's take a step back to build up some intuition before explaining how it works.
Intuition
Pretend we have a village with 1,000 people each with a very simple but unique stamp they use to mark documents. We want to keep track of who attended a meeting but we only have a single pad of paper and no pens, only stamps!
Villager Stamps
Because we're so nice, we want to send treats as a thank you to those who attended the meeting. But while we're generous and ok if we send a few treats to people who didn't actually attend, but we want to minimize the treats we waste on the absentees (boo them). To solve this problem we're going to have all of our villagers stamp a single piece of paper for the meeting with their stamps.
To figure out who definitely didn't attend the meeting, our approach is pretty simple: we can look at the shape of each villager's stamp and see if ink covers all of the grooves of their stamp. If it is, they might have attended. If it's not, there's no way they could have stamped the paper - no treats for them!
Here's an example of what the pad may have looked like after the meeting:
Final paper without Christina's stamp
Let's analyze it. In this case:
- Albert's circle and plus lines are present, so he probably attended
- Bryan's circle and diagonal line are present, so he probably attended
- Christina's vertical line is present but her square is not, so she definitely did not attend
For Albert and Bryan, it's likely they attended the meeting, so we'll send them treats. We can conclude that Christina definitely did not attend the meeting because the upper-right square is not present, so we don't need to send her treats.
Cool! Instead of needing 1000 pads of paper for the villagers to stamp, we only needed 1 and we were able to prove that Christina didn't attend the meeting. This is the essence of a bloom filter!
Now to test this analogy a bit more, let's consider what happens if the paper looked like this:
Saturated paper
In this case we can't say anything with certainty about Christina! It's possible she attended, but it's also possible that 1 person with a lower-left square and 1 person with an upper-right square stamped the paper. Or someone with a completely solid stamp attended. We don't know.
In this case we have to send treats to everyone and we fail to save any treats - our approach fails.
We see a few things here:
- By using overlapping stamps we can make some statements about the membership of the set (i.e. the people who failed to attend the meeting).
- While we can sometimes prove that an element is not in the set, we can never prove it is in the set. There's always a chance stamps from others overlap with the one we're testing.
- We want our "stamps" to be both unique and limited. A stamp which is a solid square breaks our approach!
How it Works
Use-Cases and Pitfalls
Web Crawling
Cache Optimizations
Count-Min Sketch
Intuition
How It Works
Use-Cases and Pitfalls
Top K
Caching
HyperLogLog
Intuition
How it Works
Use-Cases and Pitfalls
Analytics and Metrics Systems
Security
Cache Sizing and Analysis
Conclusion

Schedule a mock interview
Meet with a FAANG senior+ engineer or manager and learn exactly what it takes to get the job.