How ChatGPT Works?

There is a moment, maybe you have felt it yourself, where you type a question into ChatGPT and within seconds you get a response that feels remarkably thoughtful. It does not just return a keyword match. It understands context, it reasons through problems, it can write code and explain concepts and help you draft emails. And it does all of this for millions of people simultaneously, in real time.

Alt text

If you are an engineer looking at that and thinking “okay, but what is actually happening behind that text box?”, this post is for you.

We are going to go deep. Not just “there’s a transformer model and it predicts tokens” deep. We are going to talk about the full engineering stack: how prompts flow through distributed systems, how GPUs communicate across data centers, how the inference pipeline is optimized for latency, how memory and context are managed, and what tradeoffs the engineering teams at OpenAI are navigating every single day. By the end, you should have a genuine mental model of how a system like ChatGPT is actually built.

What Makes LLM Infrastructure So Hard

Before we get into diagrams and components, it helps to appreciate the scale of what we are dealing with.

GPT-4 is estimated to have somewhere in the range of hundreds of billions of parameters. Each parameter is a floating point number. Just loading a model of that size into GPU memory requires dozens of high-end GPUs. And that is just for one instance of the model, serving one request at a time.

ChatGPT serves millions of requests per day. Each request might require generating hundreds or thousands of tokens, each of which involves a forward pass through the entire model. The memory bandwidth requirements, the compute requirements, the networking requirements between GPUs - it all compounds into one of the most computationally expensive systems ever built for a consumer product.

There is also the latency problem. When you stream text in real time, users expect to see tokens appearing within a second or two. You cannot batch everything offline. You need fast inference, low network overhead, and efficient memory management all at once.

Then there is the reliability problem. GPUs fail. Nodes go offline. Data centers lose connectivity. Your serving infrastructure needs to handle all of that gracefully without dropping user conversations mid-stream.

And finally there is the safety problem. You are exposing a very capable model to billions of internet users. Some of them will try to extract harmful content, manipulate the system, or misuse the API. Your moderation layer needs to catch those cases without introducing unacceptable latency or false positives that frustrate legitimate users.

All of this happens simultaneously, every second, at global scale. That is the engineering problem.

Core Features of ChatGPT

Before going architectural, it is worth being precise about what ChatGPT actually does, because each capability has engineering implications.

  • Conversational AI with multi-turn context: the model maintains the thread of a conversation across multiple exchanges, which requires managing conversation history efficiently
  • Prompt understanding and instruction following: the model parses natural language instructions and tries to execute them faithfully
  • Token-by-token streaming: responses arrive progressively, not all at once, which requires a streaming infrastructure from GPU all the way to the browser
  • Tool usage: the model can decide to call external tools like web search or code execution and integrate the results
  • File and image processing: multimodal inputs need specialized preprocessing before they reach the language model
  • Retrieval augmented generation: the system can fetch relevant documents from a knowledge base and incorporate them into the response
  • Code generation and execution: a sandboxed code interpreter allows the model to run and test code
  • Voice: speech-to-text on input and text-to-speech on output, adding audio processing pipelines on both ends

Each of these is not a feature checkbox. Each is a subsystem with its own latency budget, failure modes, and scaling concerns.

High-Level System Architecture

Let us start with the full picture and then zoom into each layer.

flowchart TD; A[User Browser or App]; B[CDN Edge Layer]; C[API Gateway]; D[Auth and Rate Limiter]; E[Orchestrator Service]; F[Safety Classifier Input]; G[Prompt Processor]; H[Retrieval System]; I[Model Serving Layer]; J[GPU Cluster]; K[KV Cache]; L[Safety Classifier Output]; M[Response Streamer]; N[Client]; A –> B; B –> C; C –> D; D –> E; E –> F; F –> G; G –> H; H –> I; I –> J; J –> K; K –> I; I –> L; L –> M; M –> N;

When a user types a message and hits enter, the request first hits a CDN edge node. This is not just for caching static assets. The CDN terminates TLS close to the user, reducing round-trip latency. It also acts as a shield against DDoS traffic, absorbing volumetric attacks before they reach the application layer.

From the CDN, the request reaches the API gateway. The gateway handles routing, request validation, protocol translation, and exposes a clean interface for multiple clients (web, mobile, API). It is also where authentication tokens are validated and rate limiting decisions are enforced.

After the gateway, requests enter the orchestrator. Think of this as the brain of the request handling layer. It coordinates all the downstream services: it decides whether retrieval is needed, it manages the conversation state, it routes to the appropriate model version, and it sequences tool calls if the model decides to use tools.

The model serving layer is where the actual inference happens. This is backed by GPU clusters running the LLM. The inference layer communicates tightly with a KV cache (more on this later), returns tokens, and streams them back up the stack.

Before returning tokens to the user, a safety classifier runs on the output to catch policy violations. Then the response streamer sends tokens progressively to the client using server-sent events or WebSockets.

The Request Lifecycle in Detail

Understanding how a single request flows helps clarify why each component exists.

sequenceDiagram; participant U as User; participant G as API Gateway; participant O as Orchestrator; participant P as Prompt Processor; participant M as Model Server; participant S as Safety Layer; U ->> G: POST chat message; G ->> O: Validated request; O ->> P: Conversation context; P ->> M: Tokenized prompt; M ->> M: Autoregressive generation; M ->> S: Output tokens; S ->> U: Streamed safe tokens;

Notice that every step in this pipeline adds latency. The CDN, gateway, authentication, orchestration, prompt processing, inference, safety checking, streaming - it all stacks up. For a system promising near-real-time responses, every millisecond matters. This is why companies invest heavily in co-locating services, optimizing network paths, and reducing serialization overhead.

Prompt Processing Pipeline

This is one of the most underappreciated parts of the system. Before a single GPU cycle is spent on inference, your prompt goes through several transformation stages.

flowchart TD; A[Raw User Message]; B[Input Sanitization]; C[Tokenization]; D[System Prompt Injection]; E[Conversation History Assembly]; F[Context Window Check]; G[Context Truncation if needed]; H[Final Token Sequence]; I[Model Inference]; A –> B; B –> C; C –> D; D –> E; E –> F; F –> G; G –> H; H –> I;

Tokenization is the first real transformation. The text is broken into tokens using a byte-pair encoding scheme. The word “unbelievable” might become three tokens. A short Python function might be fifty tokens. This matters because the model does not see characters or words, it sees integer IDs from a vocabulary of roughly 100,000 tokens. The tokenizer is a critical piece of infrastructure, and it runs in CPU, not GPU, so it needs to be fast.

System prompt injection happens before your message. The model needs to know how to behave, what persona to adopt, what tools it has access to, and what policies to follow. This system prompt can be hundreds of tokens long and it is prepended to every request. At scale, this represents a non-trivial fraction of inference compute.

Conversation history assembly is where multi-turn context is added. Every message in the conversation, both user and assistant turns, is concatenated into a single long token sequence. This is what allows the model to maintain context. But it also means that long conversations become expensive. A conversation with 20 exchanges might have 5,000 tokens of history, and all of that has to be processed on every new turn.

Context window management is the hard constraint. GPT-4 supports context windows of up to 128,000 tokens in some configurations. But bigger context means more memory, more compute, and more latency. For typical conversations, the system tries to keep context manageable. When it approaches the limit, it has to make decisions: truncate old messages, summarize them, or reject new turns.

The Transformer Architecture

To understand inference, you need a working model of the transformer. Not the full mathematics, but enough to reason about bottlenecks.

A transformer is a sequence model that takes a sequence of tokens as input and produces a probability distribution over the vocabulary for the next token. It does this by computing attention - a mechanism that lets every token look at every other token in the sequence and weight the information from each.

flowchart TD; A[Input Token IDs]; B[Embedding Layer]; C[Positional Encoding]; D[Multi-Head Self-Attention Block]; E[Feed-Forward Layer]; F[Layer Normalization]; G[Repeat N Layers]; H[Output Linear Layer]; I[Softmax over Vocabulary]; J[Next Token Probability]; A –> B; B –> C; C –> D; D –> E; E –> F; F –> G; G –> H; H –> I; I –> J;

Self-attention is the core innovation. For each token in the sequence, the model computes three vectors: query, key, and value. It then compares the query of the current token against the keys of all other tokens, producing attention scores. These scores weight the values, and the result is a context-aware representation of the current token.

Multi-head attention runs this process in parallel across multiple subspaces. Different heads learn to attend to different relationships, some might focus on syntactic dependencies while others focus on semantic similarity.

Why this replaced RNNs: Recurrent networks process tokens one at a time, sequentially. This creates a gradient problem over long sequences and prevents parallelization during training. Transformers compute attention over the entire sequence at once, enabling massively parallel computation on GPUs. This is the architectural decision that made large language models tractable.

The quadratic complexity of attention is the famous problem. If your sequence is N tokens long, computing attention requires O(N squared) operations. For a 128,000 token context window, this is computationally brutal. Approximate attention methods and efficient implementations help, but this remains one of the fundamental scaling challenges.

Training Pipeline

Training is a separate world from inference, but it shapes everything about how the inference system is designed.

Pretraining is the foundation. The model is trained on a massive corpus of internet text, books, and code, to predict the next token in a sequence. This is unsupervised learning at an almost incomprehensible scale. GPT-4 pretraining likely consumed thousands of A100 GPUs running for months.

Supervised Fine-Tuning comes next. Human trainers write example conversations showing the ideal model behavior. The model is fine-tuned on these examples to learn to follow instructions and produce helpful responses.

RLHF - Reinforcement Learning from Human Feedback is what makes ChatGPT feel usable rather than just capable. Human raters compare pairs of model responses and indicate which is better. These preferences train a reward model that can score responses. The language model is then optimized using reinforcement learning to produce responses that score highly on this reward model.

flowchart TD; A[Raw Training Data]; B[Pretraining on Next Token Prediction]; C[Supervised Fine-Tuning on Demonstrations]; D[Human Preference Collection]; E[Reward Model Training]; F[PPO Optimization Against Reward Model]; G[Production Model]; A –> B; B –> C; C –> D; D –> E; E –> F; F –> G;

Distributed training is unavoidable at this scale. A single model does not fit on a single GPU. Training uses a combination of data parallelism (different GPUs process different batches), tensor parallelism (different GPUs hold different parts of the model), and pipeline parallelism (different layers run on different GPUs). Coordinating gradient synchronization across thousands of GPUs requires high-bandwidth interconnects like NVLink within nodes and InfiniBand across nodes.

Training is dramatically more expensive than inference. A single training run for a frontier model can cost tens or hundreds of millions of dollars. This is why fine-tuning and RLHF iterations happen on much smaller scales and why checkpointing is critical.

Token Generation and the Inference System

This is where the rubber meets the road. When you send a message, here is what happens on the GPU side.

The model receives the complete token sequence: system prompt plus conversation history plus your new message. It processes this entire sequence in one forward pass (this is called the prefill phase), and at the end it produces the first token of the response. Then, for each subsequent token, it runs another forward pass using the previous output as additional context.

flowchart TD; A[Tokenized Prompt]; B[Prefill Forward Pass]; C[First Output Token]; D[Append Token to Sequence]; E[Incremental Forward Pass]; F[Sample Next Token]; G[End of Sequence Token]; H[Stop Generation]; A –> B; B –> C; C –> D; D –> E; E –> F; F –> G; G –> H; F –> D;

Autoregressive generation is this loop of generating one token at a time. It is inherently sequential. You cannot generate token 50 until you have token 49. This creates a fundamental latency floor. A 500-token response with a 50ms per-token generation rate takes 25 seconds. That is why optimizing per-token latency is so critical.

The KV Cache is the most important optimization in LLM inference. During the attention computation, each token computes key and value matrices. For the tokens already in the context (the prompt and previous output), these matrices are the same every time. The KV cache stores these matrices so they do not have to be recomputed on each generation step. This turns an O(N) computation into O(1) per new token for the cached portion. The tradeoff is memory. Large context windows with the KV cache consume enormous GPU memory.

Sampling strategies control how the next token is selected from the probability distribution:

  • Temperature scales the distribution. High temperature means more random and creative. Low temperature means more deterministic and focused.
  • Top-k restricts sampling to only the K most likely tokens, preventing very low-probability tokens from being selected.
  • Top-p (nucleus sampling) restricts to the smallest set of tokens whose cumulative probability exceeds p. This is more adaptive than top-k.

Speculative decoding is an advanced optimization. A smaller, faster draft model generates candidate tokens, and the large model verifies them in parallel. If the draft was correct, you get multiple tokens in the time it would take the large model to generate one. This can significantly improve throughput without changing output quality.

Continuous batching is another key technique. Traditional batching waits for multiple requests to arrive and processes them together. Continuous batching processes requests incrementally, allowing new requests to be added to a running batch and completed requests to leave without waiting for the whole batch to finish. This dramatically improves GPU utilization for variable-length requests.

Distributed GPU Infrastructure

Let us talk about how the actual compute cluster is organized.

A single model forward pass for GPT-4 scale models requires more memory than any single GPU has. An A100 has 80GB of HBM memory. A model with hundreds of billions of parameters, even in 16-bit precision, needs multiple terabytes. The model has to be split across many GPUs.

flowchart TD; A[Inference Request]; B[Orchestrator Node]; C[GPU Node 1 Layers 1 to 10]; D[GPU Node 2 Layers 11 to 20]; E[GPU Node 3 Layers 21 to 30]; F[GPU Node 4 Layers 31 to 40]; G[Response Assembly]; A –> B; B –> C; C –> D; D –> E; E –> F; F –> G;

Tensor parallelism splits individual weight matrices across multiple GPUs. Each GPU holds a shard of each layer and they communicate via all-reduce operations after each computation step. This is fast because it uses NVLink, which has very high bandwidth within a server node.

Pipeline parallelism assigns different layers to different GPUs or nodes. The model flows through the pipeline with each stage doing its computation and passing activations to the next. The challenge is pipeline bubbles - GPUs waiting idle while the pipeline flushes.

Data parallelism runs multiple copies of the entire model on different GPU clusters, each handling different user requests. This is how you scale throughput: more replicas, more concurrent requests served.

NVLink and InfiniBand are the networking infrastructure. NVLink connects GPUs within a node with very high bandwidth, around 600 GB/s bidirectionally for NVLink 4.0. InfiniBand connects nodes across a cluster with lower bandwidth but still orders of magnitude more than typical Ethernet. The communication overhead between GPUs is often the limiting factor in inference latency.

Parallelism Type What is Split Communication Needed Best Use Case Main Bottleneck
Tensor Parallelism Weight matrices within each layer All-reduce per layer, high frequency Single large model that exceeds one GPU memory NVLink bandwidth
Pipeline Parallelism Model layers across stages Activation passing between stages Very deep models with many layers Pipeline bubble idle time
Data Parallelism Input batches across replicas Gradient sync during training only Scaling throughput, multiple concurrent requests Memory per replica
Expert Parallelism (MoE) Expert networks across GPUs Token routing between experts Mixture of experts models Load imbalance between experts

Retrieval Augmented Generation

The model’s knowledge has a training cutoff. Events after that date, proprietary company data, real-time information - none of it is in the weights. RAG is how you give the model access to external knowledge without retraining.

flowchart TD; A[User Query]; B[Query Embedding Model]; C[Vector Database Search]; D[Retrieved Document Chunks]; E[Context Assembly]; F[LLM with Retrieved Context]; G[Grounded Response]; A –> B; B –> C; C –> D; D –> E; A –> E; E –> F; F –> G;

Embeddings are the foundation. Text is converted into dense vector representations where semantically similar content lands close together in the vector space. A query about “machine learning optimization techniques” will have a similar embedding to a document about “gradient descent methods”, even if the exact words differ.

Vector databases like Pinecone, Weaviate, or Qdrant store these embeddings and support approximate nearest neighbor search. They are optimized for high-dimensional vector lookup at low latency. When a query comes in, the vector database finds the K most similar document chunks in milliseconds.

Chunking strategy matters more than people realize. If you chunk documents too small, individual chunks lose context. Too large, and you waste tokens on irrelevant content. A common approach is overlapping chunks of 512 to 1024 tokens with a 128-token overlap so that context at chunk boundaries is not lost.

The RAG tradeoff vs fine-tuning: fine-tuning bakes knowledge into model weights. This gives faster inference since no retrieval step is needed, but updating the knowledge requires retraining. RAG keeps knowledge in a database that can be updated without touching the model. For frequently changing information like news or company documents, RAG is almost always the right choice.

Memory and Context Management

A multi-turn conversation creates an interesting engineering problem. How do you maintain coherent context over many exchanges without your token budget exploding?

Short-term memory is just the conversation history in the context window. For a typical conversation of ten to twenty exchanges, this works fine. The full history fits in the context window and the model has access to everything.

The token explosion problem hits when conversations go long. Each new message appends to the history. A hundred-message conversation might have 20,000 tokens of history. At scale with millions of users, storing and processing this context is expensive both in memory and compute.

Context compression strategies include:

  • Summarization: periodically summarize older parts of the conversation and replace the raw text with the summary. You lose some detail but save tokens.
  • Sliding window: only keep the N most recent messages. Simple but the model loses early context.
  • Selective retention: use a smaller model to identify the most important exchanges and discard the rest.
  • Hierarchical memory: recent messages in full, older messages summarized at increasing levels of compression.

Session storage is the infrastructure layer for this. Conversation history needs to be persisted between requests. A key-value store like Redis is the typical choice: fast reads, TTL-based expiration for old sessions, and low latency. For very long conversations, you might need to offload to a distributed store like DynamoDB.

Caching and Performance Optimization

Caching in an LLM system operates at multiple levels and each level addresses a different bottleneck.

KV cache we covered in the inference section. This is GPU-level caching of attention key-value pairs. It is the single biggest optimization in LLM inference and is non-negotiable for production systems.

Prompt caching is a newer optimization. When many requests share a common prefix (the system prompt, for example), you can cache the KV activations for that prefix and reuse them across requests. OpenAI and Anthropic have introduced this in their APIs. The savings are significant: a 1000-token system prompt that is shared across a million requests, if cached, saves enormous prefill computation.

Semantic caching takes this further. If you can identify that two queries are semantically similar enough to return the same response, you can serve the cached response directly. This works well for frequently asked common questions but requires a fast embedding lookup and careful cache invalidation to avoid stale responses.

CDN caching applies to the API layer for public endpoints. Static assets, documentation, and non-personalized API responses can be cached at edge nodes globally.

Cache Type Where It Lives What It Caches Latency Savings Invalidation Challenge
KV Cache GPU HBM memory Attention key-value pairs per token Eliminates recomputation of past tokens Eviction on memory pressure
Prompt Cache GPU memory, shared across requests Precomputed activations for common prefixes Skips prefill for shared prefix Prefix change invalidates all
Semantic Cache Redis or vector DB Full responses for similar queries Full response, zero inference cost Hard to know when stale
Session Cache Redis Conversation history per user Avoids DB round trip per message TTL expiry, user logout
CDN Cache Edge nodes globally Static assets, public API responses Eliminates origin server round trip Cache-control headers

Safety and Moderation Systems

Building a consumer AI product means you are going to face adversarial users, well-intentioned misuse, and genuinely harmful requests. Safety systems have to address all of these without killing the experience for the vast majority of legitimate users.

Input moderation runs before inference. A fast classifier (much smaller than the LLM itself) evaluates the incoming prompt and flags requests that violate policy. This classifier is trained on examples of harmful content and runs in milliseconds. The tradeoff: a too-aggressive classifier blocks legitimate requests. A too-permissive one lets harmful content through. Finding the right threshold is a continuous calibration problem.

Jailbreak prevention is an arms race. Users craft creative prompts to trick the model into ignoring its safety training, often through roleplay framing, hypothetical scenarios, or adversarial constructions. Defenses include fine-tuning the model to resist these patterns, pattern matching on known jailbreak techniques, and output monitoring.

Output moderation runs after inference. The generated response is checked before being sent to the user. This is a second line of defense and catches cases where the input classifier was bypassed. Running it adds latency, typically another 50 to 100ms.

Policy systems translate human values into classifier labels. What counts as harmful? This is not a technical question, it is a policy question, and the answer varies by jurisdiction, culture, and use case. The infrastructure needs to support multiple policy configurations, for example different rules for an enterprise API customer versus a consumer product.

The false positive problem is real and costly. If your safety classifier is too aggressive, it flags legitimate medical questions, fiction writing, security research, and academic discussions. Each false positive is a user who could not get help and went to a competitor. This drives investment in more nuanced classifiers that understand context rather than pattern matching on surface forms.

Multimodal Systems

ChatGPT now processes images, audio, and documents alongside text. Each modality requires its own pipeline.

Image understanding uses a vision transformer (ViT) or similar vision encoder to convert an image into a sequence of embeddings. These embeddings are projected into the same space as text token embeddings so they can be fed into the language model alongside text. The model then attends over both text and image embeddings together.

OCR pipelines extract text from documents and images before passing them to the language model. Dedicated OCR models handle this, and the extracted text is inserted into the context alongside the original image embeddings.

Audio input goes through a speech-to-text model (OpenAI uses Whisper) that transcribes audio into text. The transcribed text is then processed through the normal text pipeline. For voice conversations with low latency, this transcription needs to be streaming and near-real-time.

Text-to-speech on the output side converts the generated text into audio using a neural TTS model. The challenge is that you cannot wait for the full response before starting TTS. You need to stream text through TTS as it is generated, which requires careful chunking at sentence or phrase boundaries.

Modality Input Processing Model Component Output Processing Latency Added
Text Tokenization Language model decoder Detokenization Baseline
Image Patch extraction, ViT encoding Vision encoder + cross-attention Text generation 100 to 500ms encoding
Audio input Waveform preprocessing, Whisper ASR model then language model Text generation Real-time transcription lag
Audio output Text generation Language model then TTS Waveform synthesis, streaming 50 to 200ms per chunk
Documents (PDF) Text extraction, OCR Language model with long context Text generation Preprocessing time + context tokens

Scaling ChatGPT

When your user base grows from 10,000 users to 10 million users, every part of your architecture gets stress tested.

Horizontal scaling of the API layer is straightforward. Stateless orchestrator services can be scaled behind a load balancer. Kubernetes handles this well, and auto-scaling policies based on request queue depth can provision capacity proactively.

Scaling the GPU layer is where things get hard. GPUs are not commodity hardware. Procurement takes months. Provisioning a new GPU cluster requires physical data center space, power, cooling, and high-bandwidth networking. This means capacity planning needs to happen quarters in advance. OpenAI, Microsoft, and Google have invested billions in GPU infrastructure precisely because you cannot just click a button and get more compute.

Queue-based inference helps smooth over demand spikes. Instead of dropping requests when the GPU cluster is saturated, you queue them and process them in order. This degrades experience gracefully - users wait a bit longer rather than getting errors. The risk is queue depth growing unboundedly during extreme traffic spikes.

Multi-region serving reduces latency for global users and provides fault tolerance. A user in Tokyo should not have their request routed to US East Coast if there is a serving cluster in Japan. Geographic routing also means a region-level outage does not take down the entire service.

Context-length as a scaling bottleneck is underappreciated. A request with a 100,000 token context window uses dramatically more GPU memory and compute than a 1,000 token request. If you allow unlimited context, a small number of users with very long conversations can monopolize your GPU memory. Systems typically enforce context limits and tier them by subscription level.

Reliability and Availability

LLM infrastructure is inherently less reliable than traditional web services because of GPU failures.

GPU failure rates are higher than CPU failure rates. High-end GPUs run hot, under extreme memory pressure, and are pushed to near 100% utilization constantly. Failures are a regular occurrence at the scale of thousands of GPUs. Your serving infrastructure must be designed to route around failed GPUs automatically.

Degraded inference modes provide fallback behavior. If the primary large model is unavailable or overloaded, the system can route to a smaller, faster model with a notice to the user. This is better than an outright error and keeps the experience functional.

Checkpointing and recovery for inference sessions: if a streaming response is interrupted mid-generation, can you resume? This is an open engineering problem. Most production systems currently restart generation from the beginning, which is wasteful. Better approaches involve persisting generation state to fast storage, but this adds latency overhead.

Monitoring for LLM systems requires new kinds of metrics. Beyond standard latency, error rate, and throughput, you need: - Time to first token: how long before the user sees any output - Token generation rate: tokens per second during generation - Context utilization: what fraction of the context window is being used - Safety classifier false positive rate - Model quality metrics via continuous evaluation

Alerting on quality degradation is harder than alerting on errors. If the model starts producing lower quality responses, nothing crashes. You need automated evaluation to catch quality regressions.

Engineering Tradeoffs

Real engineering is navigating tradeoffs under constraints. Here are the ones that matter most for LLM systems.

Latency vs. quality: a smaller model generates tokens faster but produces lower quality responses. A larger model is better but slower. The right choice depends on the use case. For real-time chat, you optimize for sub-second first-token latency. For document summarization, you can tolerate more latency for better quality.

Model size vs. cost: GPT-4 scale models cost orders of magnitude more to serve than smaller models. For high-volume, cost-sensitive use cases, a well-fine-tuned smaller model often delivers 80% of the quality at 10% of the cost.

Retrieval vs. fine-tuning: fine-tuning gives the model knowledge that is always available, zero retrieval latency, and tighter integration with the model’s reasoning. But it is expensive to update and can cause catastrophic forgetting of other capabilities. RAG is cheaper to update and keeps knowledge current, but adds retrieval latency and can retrieve irrelevant context that confuses the model.

GPU utilization vs. response speed: packing more requests into a batch improves GPU utilization and throughput. But it can increase individual request latency as requests wait for a batch to fill. Continuous batching helps, but there is still a fundamental tension between maximizing hardware efficiency and minimizing user-perceived latency.

Safety vs. usability: overly aggressive safety classifiers frustrate users and drive them to less safe alternatives. Too permissive and you enable harm. The tradeoff is real and the calibration is never perfect. The best approach is investing in precise classifiers that understand context rather than surface-level pattern matching.

Caching vs. freshness: aggressive caching reduces compute cost and latency but can serve stale responses. For a system where the world changes constantly, this is a real concern. Semantic caching in particular needs careful TTL management and cache invalidation strategies tied to document update events.

Real-World Technology Stack

A system like ChatGPT is built on a carefully chosen set of technologies, each selected for a specific reason.

PyTorch is the foundation for model development and training. Its dynamic computation graph makes experimentation fast. Most LLM research happens in PyTorch, so it has the best ecosystem of model implementations and training libraries.

CUDA and Triton are the low-level GPU programming layers. Custom CUDA kernels are written for performance-critical operations like attention. Triton, OpenAI’s own GPU programming language, allows writing efficient GPU kernels in Python-like syntax, making custom kernel development more accessible.

TensorRT from NVIDIA optimizes trained models for inference. It fuses operations, quantizes weights to lower precision, and generates GPU-optimal execution plans. Quantization from 16-bit to 8-bit or even 4-bit precision can halve or quarter memory requirements with modest quality tradeoffs.

Ray is the distributed computing framework for Python. It handles distributed inference orchestration, model serving, and scaling compute across multiple nodes without requiring teams to write raw distributed systems code.

Kubernetes orchestrates the containerized service layer: API servers, orchestrators, preprocessing services. It handles deployment, scaling, health checking, and rolling updates.

Redis serves multiple purposes: session storage for conversation history, prompt caching, rate limiting counters, and feature flags. Its sub-millisecond latency makes it suitable for any hot path.

Kafka handles asynchronous event streaming: logging request events, feeding monitoring pipelines, decoupling safety classification from the critical path when it can be asynchronous.

Vector databases (Pinecone, Weaviate, Qdrant) power the retrieval system. They store and serve embeddings at low latency with approximate nearest neighbor algorithms like HNSW.

NVIDIA H100 GPUs are the current hardware of choice for LLM inference and training. The H100 SXM5 has 80GB HBM3 memory, 3.35TB/s memory bandwidth, and dedicated transformer acceleration in the tensor core architecture.

InfiniBand networking (typically 400Gb/s HDR or 800Gb/s NDR) connects nodes in GPU clusters with the bandwidth required for tensor parallelism across nodes.

System Design Interview Perspective

System design interviews at top tech companies increasingly include LLM system questions. “Design ChatGPT” or “Design a large-scale AI chat system” are real questions being asked.

What interviewers are looking for:

They want to see that you understand the unique constraints of LLM systems compared to traditional web systems. The GPU bottleneck, the token generation model, the context management challenge, the streaming requirement - these are ChatGPT-specific problems that demonstrate you have thought deeply about the domain.

Where candidates go wrong:

Most candidates describe a generic web service: load balancer, API servers, database, cache. This misses the actual hard problems. The model serving layer, GPU memory management, KV cache, continuous batching, distributed inference - these are the interesting engineering problems and they are usually completely absent from weak answers.

Another common mistake is ignoring the inference infrastructure entirely and treating the LLM as a black box API call. That might be true if you are building on top of OpenAI’s API, but that is not what the interviewer wants to hear when they ask “design ChatGPT.”

A strong answer structure:

Start with requirements: throughput (requests per second), latency targets (time to first token, streaming rate), context window support, multimodal support. Then align your architecture to those requirements.

Explain the inference pipeline in enough depth to show you understand autoregressive generation, KV cache, and batching. Discuss how you would scale GPU compute horizontally. Address context management and storage. Cover safety systems briefly. Discuss monitoring and reliability.

When discussing tradeoffs, show that you understand the cost of compute and that you would make different architectural choices based on the use case and budget.

Strong vs weak on specific topics:

Topic Weak Answer Strong Answer
Inference “We call the model API and return the response” “Autoregressive generation with KV cache, continuous batching to maximize GPU utilization, speculative decoding for latency”
Scaling “Add more servers” “Tensor parallelism for model size, data parallelism for throughput, queue-based load leveling, multi-region for latency and availability”
Memory “Store conversations in a database” “Redis for hot sessions, TTL-based expiry, context truncation and summarization strategies for long conversations, token budget management”
Latency “Use a CDN” “Minimize time to first token via prefill optimization, streaming with SSE, KV cache, prompt caching for shared prefixes, geographic routing”
Safety “Filter bad words” “Input and output classifiers, reward model alignment, false positive calibration, context-aware moderation, abuse detection systems”

Closing Thoughts

ChatGPT is one of the most complex distributed systems ever deployed to consumers. It combines cutting-edge machine learning research with hardcore distributed systems engineering, real-time streaming infrastructure, GPU cluster management, and safety systems that have to make nuanced judgments at scale.

What I find most interesting about this system is that almost every component represents a genuine engineering challenge that did not exist five years ago. KV cache management at hundred-thousand-token context lengths, speculative decoding, continuous batching, tensor parallelism across InfiniBand-connected nodes, multimodal embedding spaces - these are all relatively new solutions to relatively new problems.

If you are building systems like this, or preparing to work on them, the deepest investment you can make is understanding the inference pipeline end to end. That is where the most interesting engineering decisions happen, and that is where the biggest performance gains are still being found.

The field is moving extremely fast. The architectures that are state of the art today will be iterated on significantly in the next two years. But the fundamentals we covered here - distributed inference, KV cache, attention mechanisms, RAG, context management, safety systems - these principles will remain relevant even as the specific implementations evolve.

If you get one thing from this post, let it be this: every design decision in this system was made under real constraints of compute, latency, cost, and reliability. Understanding those constraints is what separates engineers who can talk about LLM systems from engineers who can build them.

Comments