Limited Time Offer: 40% off Hello Interview Premium 🎉

Back to Main

Learn System Design

In a Hurry

Core Concepts

Key Technologies

Patterns

Advanced Topics

Data Structures for Big Data

Question Breakdowns

Vote For New Content

Pricing

Become a Coach

Pricing Become a Coach Employers

Get Premium

Advanced Topics

Data Structures for Big Data

Learn about data structures for processing and storing large amounts of data in System Design interviews.

Intro

Some systems need to process absolutely massive amounts of data. These problems disproportionately show up in system design interviews as a way to try to stress test the depth of your knowledge. For these problems, simple scaling and adding more machines may be insufficient — you'll need to lean on specialized data structures to solve the problem.

But most system design interviews don't actually require you to implement data structures on the fly and aren't focused at this low level, so why bother? Well, in many cases utilizing a specialized data structure will actually change the shape of the solution and make the surrounding system fundamentally different. Understanding these differences is a super power and can help you design systems that are more efficient, scalable, and performant.

The catch is that these data structures are uncommon for a reason. Employing a bloom filter when a simple hash table would suffice is a red flag for an interviewer. So this is good knowledge to have, but don't forget to keep it simple!

In this deep dive we'll do a few things:

We'll build out your arsenal of potential approaches, expanding your view of available data structures.
We'll highlight specific scenarios where these data structures are commonly used.
Finally, we'll point out some common pitfalls and places where you might get tripped up or tempted to over-engineer.

Let's do it.

While occasionally useful for mid-level system design interviews, I would not recommend starting here if you're relatively new to system design. There is much higher ROI in mastering core concepts and key technologies first.

A candidate who nails the implementation of Count-Min Sketch but hasn't internalized things like caching, load balancing, and partitioning is going to struggle to design a performant system architecture. Don't stress out about these details!

Bloom Filter

Our first data structure is probably the most well-known and a great starting point for our discussion. A bloom filter is a probabilistic data structure which is analogous to a set (refresher: sets allow you to insert elements and check their membership).

The most common implementation of a set uses a hash table. With a hash table we can insert elements in O(1) time and check membership in O(1) time. Very fast! Unfortunately, we need to have memory for each possible element in a hash table, which is infeasible for very large sets. Imagine we had trillions of items and we wanted to keep track of all their ids.

Bloom filters help us here by making a compromise. They are dramatically more memory efficient than hash tables but relax the guarantees of a set. Bloom filters can tell you:

Whether an element is likely in a set, with some configurable probability
When an element is definitely not in a set

This sounds a bit unusual, so let's take a step back to build up some intuition before explaining how it works.

Intuition

Pretend we have a village with 1,000 people each with a very simple but unique stamp they use to mark documents. We want to keep track of who attended a meeting but we only have a single pad of paper and no pens, only stamps!

Villager Stamps

Because we're so nice, we want to send treats as a thank you to those who attended the meeting. But while we're generous and ok if we send a few treats to people who didn't actually attend, but we want to minimize the treats we waste on the absentees (boo them). To solve this problem we're going to have all of our villagers stamp a single piece of paper for the meeting with their stamps.

To figure out who definitely didn't attend the meeting, our approach is pretty simple: we can look at the shape of each villager's stamp and see if ink covers all of the grooves of their stamp. If it is, they might have attended. If it's not, there's no way they could have stamped the paper - no treats for them!

Here's an example of what the pad may have looked like after the meeting:

Final paper without Christina's stamp

Let's analyze it. In this case:

Albert's circle and plus lines are present, so he probably attended
Bryan's circle and diagonal line are present, so he probably attended
Christina's vertical line is present but her square is not, so she definitely did not attend

For Albert and Bryan, it's likely they attended the meeting, so we'll send them treats. We can conclude that Christina definitely did not attend the meeting because the upper-right square is not present, so we don't need to send her treats.

Cool! Instead of needing 1000 pads of paper for the villagers to stamp, we only needed 1 and we were able to prove that Christina didn't attend the meeting. This is the essence of a bloom filter!

Now to test this analogy a bit more, let's consider what happens if the paper looked like this:

Saturated paper

In this case we can't say anything with certainty about Christina! It's possible she attended, but it's also possible that 1 person with a lower-left square and 1 person with an upper-right square stamped the paper. Or someone with a completely solid stamp attended. We don't know.

In this case we have to send treats to everyone and we fail to save any treats - our approach fails.

We see a few things here:

By using overlapping stamps we can make some statements about the membership of the set (i.e. the people who failed to attend the meeting).
While we can sometimes prove that an element is not in the set, we can never prove it is in the set. There's always a chance stamps from others overlap with the one we're testing.
We want our "stamps" to be both unique and limited. A stamp which is a solid square breaks our approach!

Purchase Premium to Access

This article is only available to Premium users.

Purchase Premium

Schedule a mock interview

Meet with a FAANG senior+ engineer or manager and learn exactly what it takes to get the job.

Schedule a Mock Interview

Advanced Topics

Data Structures for Big Data

Intro

Bloom Filter

Intuition

Purchase Premium to Access

This article is only available to Premium users.

Schedule a mock interview

Questions

Compare Us

Links

Legal

Contact

Advanced Topics

Data Structures for Big Data

Intro

Bloom Filter

Intuition

How it Works

Use-Cases and Pitfalls

Web Crawling

Cache Optimizations

Count-Min Sketch

Intuition

How It Works

Use-Cases and Pitfalls

Top K

Caching

HyperLogLog

Intuition

How it Works

Use-Cases and Pitfalls

Analytics and Metrics Systems

Security

Cache Sizing and Analysis

Conclusion

Purchase Premium to Access

This article is only available to Premium users.

Schedule a mock interview

Questions

Compare Us

Links

Legal

Contact