Limited Time Offer:Up to 20% off Hello Interview Premium

⌘K

Tutor

Common Problems

ChatGPT

Real-time Updates

Managing Long Running Tasks

Published

ByEvan King·

hard

Try This Problem Yourself

Practice with guided hints and real-time feedback

Understanding the Problem

💬 What is ChatGPT? Unless you've been living under a rock, you know what ChatGPT is. It's a conversational AI product where users send prompts in natural language and get responses streamed back from a large language model. Conversations are saved, so users can come back to an old chat and pick up right where they left off.

For this problem we treat the LLM as a black box we call, not something we train or run the internals of. All the design lives in the serving system around it, in how we stream tokens back fast, how we schedule scarce GPUs, and how we keep cost sane as conversations grow. We'll also scope this to text in, text out only, with no images, audio, or video, and no editing or branching of existing messages.

Functional Requirements

Core Requirements

Users should be able to send a prompt in a chat and receive an AI-generated response.
Users should be able to view past chats and resume a conversation, with the chat's prior context carried into the prompt.

Below the line (out of scope)

Editing or branching existing messages.
Image, audio, or video input and output (text only).
Sharing chats or collaborating on a chat with other users.
Custom GPTs, tool / function calling, and web browsing.
Full-text search across a user's chat history.

Non-Functional Requirements

Non-functional requirements cover the properties of the system that matter to the user and the business.

ChatGPT feels broken if you stare at a blank screen for a few seconds after hitting enter, so latency to the first token matters more than total completion time. Because GPUs are the scarce, expensive resource here, the system has to be deliberate about who gets compute and when. ChatGPT serves a little over 200M daily active users at the time of writing, so that's the scale we'll design against.

With that framing, here are the requirements that actually shape the design.

Purchase Premium to Keep Reading

Unlock this article and so much more with Hello Interview Premium

Buy Premium

Currently up to 20% off

Hello Interview Premium

System Design Guided Practice

Exclusive content

Recent interview questions

Learn More

Reading Progress

On This Page

Understanding the Problem

Functional Requirements

Non-Functional Requirements

The Set Up

Planning the Approach

Defining the Core Entities

API or System Interface

High-Level Design

1) Users should be able to send a prompt and receive an AI-generated response

2) Users should be able to view past chats and resume a conversation with context carried across turns

Potential Deep Dives

1) How do we stream tokens back fast, and keep the stream smooth?

2) How do we route and schedule generation requests across GPU workers?

3) How do we keep heavy users from monopolizing GPUs while giving paid tiers a better experience?

4) As conversations get longer, how do we control inference cost without making the assistant feel forgetful?

Some additional deep dives you might consider

What is Expected at Each Level?

Mid-level

Senior

Staff+

Common Problems

ChatGPT

Try This Problem Yourself

Understanding the Problem

Functional Requirements

Non-Functional Requirements

Purchase Premium to Keep Reading

Unlock this article and so much more with Hello Interview Premium

Questions

Learn

Links

Legal

Contact

Common Problems

ChatGPT

Try This Problem Yourself

Understanding the Problem

Functional Requirements

Non-Functional Requirements

The Set Up

Planning the Approach

Defining the Core Entities

API or System Interface

High-Level Design

1) Users should be able to send a prompt and receive an AI-generated response

2) Users should be able to view past chats and resume a conversation with context carried across turns

Potential Deep Dives

1) How do we stream tokens back fast, and keep the stream smooth?

2) How do we route and schedule generation requests across GPU workers?

3) How do we keep heavy users from monopolizing GPUs while giving paid tiers a better experience?

4) As conversations get longer, how do we control inference cost without making the assistant feel forgetful?

Cancelling a run and reclaiming the GPU

Some additional deep dives you might consider

What is Expected at Each Level?

Mid-level

Senior

Staff+

Purchase Premium to Keep Reading

Unlock this article and so much more with Hello Interview Premium

Questions

Learn

Links

Legal

Contact