Limited Time Offer:Up to 20% off Hello Interview Premium
Up to 20% off Hello Interview Premium 🎉
Hello Interview
Learn System Design
Introduction
How to Prepare
Delivery Framework
Core Concepts
Key Technologies
Common Patterns
Question Breakdowns
Networking Essentials
API Design
Data Modeling
Caching
Sharding
Consistent Hashing
CAP Theorem
Database Indexing
Numbers to Know
Bitly
Dropbox
Local Delivery Service
Ticketmaster
FB News Feed
Tinder
LeetCode
WhatsApp
Rate Limiter
FB Live Comments
FB Post Search
YouTube Top K
Uber
YouTube
Web Crawler
Ad Click Aggregator
News Aggregator
Yelp
Strava
Online Auction
Price Tracking Service
Instagram
Robinhood
Google Docs
Distributed Cache
Job Scheduler
Payment System
Metrics Monitoring
ChatGPT
Real-time Updates
Dealing with Contention
Multi-step Processes
Scaling Reads
Scaling Writes
Handling Large Blobs
Managing Long Running Tasks
Redis
Elasticsearch
Kafka
API Gateway
Cassandra
DynamoDB
PostgreSQL
Flink
ZooKeeper
Time Series Databases
Data Structures for Big Data
Vector Databases
Vote For New Content
Pricing
Sign in / Sign up
Search
⌘K
Pricing

Tutor

Common Problems

Metrics Monitoring

Scaling Writes
Scaling Reads
Published
ByStefan Mai·
hard

Try This Problem Yourself

Practice with guided hints and real-time feedback

Premium users can view this video once signed in

Understanding the Problem

📊 What is a Metrics Monitoring Platform? A metrics monitoring platform collects performance data (CPU, memory, throughput, latency) from servers and services, stores it as time-series data, visualizes it on dashboards, and triggers alerts when thresholds are breached. Think Datadog, Prometheus/Grafana, or AWS CloudWatch. This is infrastructure that engineers rely on to understand system health and respond to incidents.

Functional Requirements

We'll start our discussion by trying to tease out from our interviewer what the system needs to be able to do. Even though a metrics monitoring system is simple at face-value (collect metrics, store them, query them, etc.) there's a lot of potential complexity here so we want to narrow things down.
Core Requirements
  1. The platform should be able to ingest metrics (CPU, memory, latency, custom counters) from services
  2. Users should be able to query and visualize metrics on dashboards with filters, aggregations, and time ranges
  3. Users should be able to define alert rules with thresholds over time windows (e.g., "alert if p99 latency > 500ms for 5 minutes")
  4. Users should receive notifications when alerts fire (email, Slack, PagerDuty)
Below the line (out of scope):
  • Log aggregation and full-text search (separate concern)
  • Distributed tracing (spans, traces)
  • Anomaly detection via ML

Non-Functional Requirements

Metrics monitoring systems can range from a single team's services to a fleet of hundreds of thousands of servers. Getting a sense of the scale of the system is important because it will influence a bunch of the decisions we need to make.
We might ask our interviewer or they might tell us "we need to design for monitoring 500k servers". That's a big fleet. If each server emits 100 metric data points every 10 seconds, that's 5 million metrics per second at peak. Each data point is small (timestamp, value, labels) at roughly 100-200 bytes, but at that volume we're looking at 1GB per second of raw ingestion. That's the crux of the problem.
Core Requirements
  1. The system should scale to ingest 5M metrics per second from 500k servers
  2. Dashboard queries should return within seconds, even for queries spanning days or weeks
  3. Alerts should evaluate with low latency (< 1 minute from metric emission to alert firing)
  4. The system should be highly available. We can tolerate eventual consistency for dashboards, but alert evaluation should be reliable.
  5. The system should handle late or out-of-order data gracefully (network delays are common)
Below the line (out of scope):
  • Multi-region replication (would add complexity)
  • Strong consistency guarantees
Here's how your requirements section might look on your whiteboard:
Requirements
The requirement for alerts to fire in under a minute might seem slow to some readers. "Wouldn't we want to fire as soon as the event happens?" Yes and no. In most production systems, it's difficult to see an event until you've accumulated enough data. Oftentimes alerts are (sensibly) set on moving averages or trends over time.
When you do want to fire an alert as soon as possible, it often is constructed in a very particular way. Amazon detects order drops (their most important event!) by looking for breaches of the number of milliseconds since their last order. Since they have so many orders, this number is very stable and allows them to fire almost instantaneously when something happens.
Designing metrics like this is an art, but rarely the focus for an interview like this! While there may be interviewers who are insistent and want to build a streaming event system, that's not where we'll focus here.

The Set Up

Planning the Approach

Defining the Core Entities

Data Flow

API or System Interface

High-Level Design

1) The platform can ingest metrics from services

2) Users can query and visualize metrics on dashboards

3) Users can define alert rules with thresholds

4) Users receive notifications when alerts fire

Potential Deep Dives

1) How do we serve low-latency dashboard queries over weeks of data?

2) How do we reduce alert latency below 1 minute?

3) How do we ensure high availability during spikes and failures?

4) How do we handle cardinality explosion?

What is Expected at Each Level?

Mid-level

Senior

Staff+

Purchase Premium to Keep Reading

Unlock this article and so much more with Hello Interview Premium
Buy Premium

Currently up to 20% off

Hello Interview Premium

System Design Guided Practice
Exclusive content
Recent interview questions
Learn More
Reading Progress

On This Page

Understanding the Problem

Functional Requirements

Non-Functional Requirements

The Set Up

Planning the Approach

Defining the Core Entities

Data Flow

API or System Interface

High-Level Design

1) The platform can ingest metrics from services

2) Users can query and visualize metrics on dashboards

3) Users can define alert rules with thresholds

4) Users receive notifications when alerts fire

Potential Deep Dives

1) How do we serve low-latency dashboard queries over weeks of data?

2) How do we reduce alert latency below 1 minute?

3) How do we ensure high availability during spikes and failures?

4) How do we handle cardinality explosion?

What is Expected at Each Level?

Mid-level

Senior

Staff+

Questions
Meta SWE Interview QuestionsAmazon SWE Interview QuestionsGoogle SWE Interview QuestionsOpenAI SWE Interview QuestionsEngineering Manager (EM) Interview Questions
Learn
Learn System DesignLearn DSALearn BehavioralLearn ML System DesignLearn Low Level DesignGuided Practice
Links
FAQPricingGift PremiumHello Interview Premium
Legal
Terms and ConditionsPrivacy PolicySecurity
Contact
About UsProduct Support

7511 Greenwood Ave North Unit #4238 Seattle WA 98103


© 2026 Optick Labs Inc. All rights reserved.