How Reddit Works?

If you’ve ever refreshed your Reddit feed at midnight, upvoted a post, or gone down a rabbit hole in a subreddit — you’ve touched a system that serves hundreds of millions of users every day. But have you ever wondered what’s actually happening under the hood? Let’s find out.

Alt text

Reddit calls itself “the front page of the internet,” and honestly, that’s not far off. At its core, Reddit is a massive, community-driven link aggregator and discussion platform — think of it as a giant bulletin board broken into thousands of topic-specific rooms called subreddits.

As of 2024, Reddit had:

  • ~1.5 billion monthly active users visiting the site
  • ~100,000 active subreddits
  • Over ~500 million posts and billions of comments
  • Multiple millions of concurrent users at peak times

What makes Reddit an interesting system design problem is the combination of scale, real-time interaction, and community-specific complexity. You have users browsing feeds, posting content, voting, commenting in real-time, getting notifications, and all of this happening across extremely diverse communities with completely different moderation rules.

It’s not as real-time as Twitter (where every tweet needs instant fan-out). It’s not as media-heavy as YouTube. But it’s arguably more complex than both because it combines a social graph, content ranking, tree-structured comments, moderation tools, and real-time interactions all in one platform.

Let’s break it all down.

Core Features

Before we get into architecture, let’s quickly nail down what Reddit actually does. These features will drive every design decision we make.

Posts

A post is the atomic unit of Reddit. It can be:

  • A link pointing to an external URL
  • A text post (also called a self-post)
  • An image or video

Each post lives inside a subreddit, has an author, a title, a score (upvotes minus downvotes), and a timestamp.

Comments (Tree Structure)

This is one of Reddit’s most distinctive features. Comments form a tree — you can reply to a post directly, or reply to another comment, and that reply can have replies, and so on infinitely deep.

flowchart TB; %% Nodes A[Post]; B[Comment A]; C[Reply A1]; D[Reply A1a]; E[Reply A2]; F[Comment B]; G[Reply B1]; %% Tree Structure A –> B; B –> C; C –> D; B –> E; A –> F; F –> G;

This nested tree structure needs special thought when it comes to storage and retrieval, which we’ll cover later.

Upvotes / Downvotes

Every post and every comment can be upvoted or downvoted. The net score (upvotes - downvotes) combined with the time of posting feeds into the ranking algorithm that determines where content appears in feeds.

Vote counts are approximate by design — Reddit intentionally “fuzzes” the exact count to prevent vote manipulation. You’ll see a score of “1.2k” rather than an exact number.

Subreddits

Subreddits are topic-based communities: r/programming, r/worldnews, r/funny. Each one has its own:

  • Set of moderators
  • Custom rules and flairs
  • Moderation tools
  • Posting history

Users can subscribe to subreddits, and their home feed is built from those subscriptions.

User Profiles

Each user has a profile with their post history, comment history, karma score (a rough measure of community contribution), and account settings.

Feed Generation

The home feed is the magic. When you open Reddit, you see a curated list of posts from all the subreddits you follow, ranked by relevance and popularity. This is the hardest part to get right at scale.

High-Level Architecture

Let’s start with the big picture before drilling into specifics.

flowchart TB %% Client Layer subgraph Client[“CLIENT LAYER”] A[“Web Browser / iOS App / Android”] end %% CDN subgraph CDN[“CDN (CloudFront / Fastly / Akamai)”] B[“Static Assets, Images, Videos”] end %% Load Balancer subgraph LB[“LOAD BALANCER (L7)”] C[“HAProxy / AWS ALB / Nginx”] end %% Core Services subgraph Services[“CORE SERVICES”] D[“Auth Service”] E[“Post Service”] F[“Feed Service”] G[“Comment Service”] end %% Message Bus subgraph Kafka[“MESSAGE BUS (Kafka)”] H[“Event Streaming”] end %% Async Services subgraph Async[“ASYNC SERVICES”] I[“Voting Service”] J[“Search Service”] K[“Notification Service”] end %% Data Layer subgraph Data[“DATA LAYER”] L[“PostgreSQL (Users, Subs)”] M[“Cassandra (Votes, Activity)”] N[“Redis (Cache, Sessions)”] end %% Connections A –>|HTTPS| B B –>|Cache Miss| C C –> D C –> E C –> F C –> G E –> H F –> H G –> H H –> I H –> J H –> K I –> L I –> M J –> L J –> N K –> N

💡 Image suggestion: Replace this ASCII diagram with a clean HD architecture diagram showing the service mesh, data flow arrows, and database layers.

Let me explain each piece:

CDN (Content Delivery Network): The first layer that intercepts requests. It serves static content — images, JavaScript bundles, CSS — from edge servers close to the user. No reason to hit your origin server for a Reddit logo.

Load Balancer: Distributes incoming traffic across multiple server instances. Uses Layer 7 (HTTP-aware) routing so it can send /api/posts requests to the Post Service and /api/auth requests to the Auth Service.

Microservices: Reddit evolved from a Python monolith (built on Pylons framework in the early days) into a set of microservices. Each service owns a specific domain — posts, comments, votes, feeds, notifications, etc.

Message Bus (Kafka): The nervous system of the backend. Events like “new vote cast,” “new comment posted,” or “post created” are published as messages on Kafka topics. Other services subscribe and react asynchronously. This decouples the services and allows the system to absorb spikes.

Data Layer: Multiple databases for different needs, which we’ll explore in detail.

Deep Dive: Core Services

Authentication Service

The auth service handles login, session management, and OAuth flows (for third-party app access via Reddit’s API).

How it works:

  • User logs in → auth service validates credentials → issues a JWT (JSON Web Token) or session cookie
  • Every subsequent request carries this token
  • The token is verified at the API gateway level before routing to any service

Reddit also supports OAuth 2.0, which is why apps like Apollo or third-party Reddit clients can authenticate on your behalf.

Session Storage: Sessions are stored in Redis — it’s fast and supports TTL (time-to-live) out of the box, so expired sessions are automatically cleaned up.

Post Service

This service handles CRUD operations for posts — creation, retrieval, editing, deletion.

When a user submits a post:

  1. Validation (is the content within rules? Is the user banned from this subreddit?)
  2. Store the post in the primary database (PostgreSQL)
  3. Publish a post.created event to Kafka
  4. Other services (feed service, notification service, search indexer) listen to this event and react

Schema (simplified):

CREATE TABLE posts (
    id          UUID PRIMARY KEY,
    author_id   UUID NOT NULL REFERENCES users(id),
    subreddit_id UUID NOT NULL REFERENCES subreddits(id),
    title       VARCHAR(300) NOT NULL,
    url         TEXT,
    body        TEXT,
    type        ENUM('link', 'text', 'image', 'video'),
    score       INT DEFAULT 0,
    upvotes     INT DEFAULT 0,
    downvotes   INT DEFAULT 0,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    is_deleted  BOOLEAN DEFAULT FALSE,
    is_nsfw     BOOLEAN DEFAULT FALSE
);

Comment Service

Comments are the most complex data structure in Reddit. They form an unbounded tree and need to be:

  • Retrieved efficiently for a given post
  • Sorted (by “best,” “new,” “top,” “controversial”)
  • Collapsible (hide a branch of replies)

Two common approaches to storing trees in a relational DB:

Option 1: Adjacency List (Simple)

CREATE TABLE comments (
    id          UUID PRIMARY KEY,
    post_id     UUID NOT NULL,
    parent_id   UUID REFERENCES comments(id),  -- NULL if top-level
    author_id   UUID NOT NULL,
    body        TEXT NOT NULL,
    score       INT DEFAULT 0,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Simple but fetching an entire comment tree requires recursive queries or multiple round-trips.

Option 2: Materialized Path Store the full path from root to each node as a string:

-- path: "root_id.child_id.grandchild_id"
id: "comment_abc"
path: "comment_xyz.comment_abc"

This makes it trivially fast to fetch all descendants with a LIKE 'root_id%' query, at the cost of slightly more storage.

Reddit actually uses a hybrid approach — storing the parent ID in PostgreSQL for writes, but caching fully assembled comment trees in Redis for reads. When you load a post page, Reddit fetches the pre-built tree from cache rather than reassembling it from the DB every time.

Voting System

Voting is one of the highest-write-volume operations on Reddit. Every thumb press is a write. At scale, this needs to be handled carefully.

The naive approach — directly updating the score column on the post row — would destroy your database. You’d have thousands of concurrent writes to the same row, causing lock contention.

Reddit’s approach:

  1. Write votes to a fast write-optimized store first (Redis sorted sets or Cassandra)
  2. Periodically batch-aggregate votes and update the main score in PostgreSQL
  3. The score displayed to users is read from cache, not the database
# Pseudo-code for handling a vote
def handle_vote(user_id, post_id, vote_direction):  # +1 or -1
    # Check if user has already voted
    existing = redis.get(f"vote:{user_id}:{post_id}")

    if existing == vote_direction:
        return  # Same vote, no-op (toggle off would be another case)

    # Store user's vote in Redis
    redis.set(f"vote:{user_id}:{post_id}", vote_direction, ex=86400*30)

    # Increment score in Redis sorted set
    redis.zincrby("post_scores", vote_direction, post_id)

    # Publish to Kafka for async persistence
    kafka.publish("votes", {
        "user_id": user_id,
        "post_id": post_id,
        "direction": vote_direction,
        "timestamp": now()
    })

A background consumer reads from the Kafka votes topic and periodically flushes aggregated vote counts to PostgreSQL. This is a classic write-behind caching pattern.

Subreddit Service

Handles subreddit creation, subscription management, moderation settings, and rules.

A subscription is a many-to-many relationship between users and subreddits:

CREATE TABLE subscriptions (
    user_id      UUID NOT NULL,
    subreddit_id UUID NOT NULL,
    subscribed_at TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (user_id, subreddit_id)
);

The list of subreddits a user follows is small enough to be cached in Redis as a set: SMEMBERS user:12345:subscriptions.

Notification System

Notifications are event-driven by nature. When something happens (your comment got a reply, you got a mention, a mod action was taken), a notification needs to reach the right user.

Flow:

  1. Event published to Kafka (e.g., comment.created)
  2. Notification service consumes the event
  3. Checks who needs to be notified (post author, parent comment author, mentioned users)
  4. Stores notification in DB + pushes via WebSocket or Firebase Cloud Messaging (FCM) for mobile

The notification service maintains a user notification inbox — essentially a list of notification records with read/unread status.

Moderation System

This is often overlooked in system design interviews but it’s genuinely one of Reddit’s hardest problems. Reddit has:

  • Volunteer moderators (subreddit-specific)
  • Reddit admins (site-wide)
  • Automated moderation bots (e.g., AutoModerator)
  • ML-based spam detection

Tools in the moderation toolkit:

  • AutoModerator: A rule engine that runs on every post/comment. Rules are YAML-based and can match on keywords, account age, karma thresholds, etc.
  • Spam detection: ML models trained on known spam patterns
  • Shadow banning / Content filtering: Soft-banning users (they think their posts are public, but they’re hidden from others)
  • Report queue: User reports flow into a moderation queue. High-volume subreddits use tools to triage these at scale.

Database Design

Reddit uses a polyglot persistence approach — different data stores for different access patterns.

Data Type Database Why
Users, Posts, Subreddits PostgreSQL Strong ACID guarantees, rich querying
Votes (raw events) Cassandra High write throughput, append-only
Sessions, Cache Redis Sub-millisecond reads, TTL support
Full-text search Elasticsearch Complex query patterns, relevance ranking
Media (images, videos) S3 + CDN Blob storage at scale
Time-series metrics InfluxDB / Prometheus High-cardinality time-series data

Key Schema Examples

Users Table:

CREATE TABLE users (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    username        VARCHAR(20) UNIQUE NOT NULL,
    email           VARCHAR(255) UNIQUE NOT NULL,
    password_hash   VARCHAR(255) NOT NULL,
    post_karma      INT DEFAULT 0,
    comment_karma   INT DEFAULT 0,
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    is_verified     BOOLEAN DEFAULT FALSE,
    is_suspended    BOOLEAN DEFAULT FALSE
);

Subreddits Table:

CREATE TABLE subreddits (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name            VARCHAR(21) UNIQUE NOT NULL,  -- max r/ name length
    description     TEXT,
    subscriber_count INT DEFAULT 0,
    created_by      UUID REFERENCES users(id),
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    is_nsfw         BOOLEAN DEFAULT FALSE,
    is_quarantined  BOOLEAN DEFAULT FALSE
);

SQL vs NoSQL: The Tradeoff

The big question in any system design is: relational or not?

For Reddit’s core data (posts, comments, users): PostgreSQL makes sense. These are highly relational, need transactions, and benefit from complex joins (e.g., “fetch all posts by user X in subreddit Y that have score > 100”).

For votes: You don’t need ACID guarantees for an upvote. You need high write throughput and eventual consistency. Cassandra wins here — it’s designed for append-heavy, time-series-like data.

The key tradeoff:

  • Consistency vs. Availability: If Cassandra is temporarily unavailable, your vote might not be counted immediately — but that’s acceptable. Users won’t notice a 2-second delay in vote count updates.
  • Normalization vs. Denormalization: Relational DBs normalize (one source of truth). NoSQL often denormalizes (duplicate data for read performance). Reddit’s feed, for example, stores pre-computed post data so reads are fast.

Feed Generation

This is the most interesting and most debated part of Reddit’s design. The home feed is a ranked list of posts from all subreddits you follow. How do you build this efficiently for hundreds of millions of users?

The Two Models

Option 1: Pull Model (Fan-in on Read)

When a user opens their feed, the server:

  1. Fetches the list of subreddits the user follows
  2. Queries for top posts from each subreddit
  3. Merges and re-ranks everything
  4. Returns the sorted feed

Pros: Simple to implement. No storage overhead. Always fresh.

Cons: Very slow at scale. If you follow 500 subreddits, that’s potentially 500 database queries per feed load. Unacceptable latency.

Option 2: Push Model (Fan-out on Write)

When a new post is created:

  1. Find all subscribers of that subreddit
  2. Insert the post ID into each subscriber’s “feed queue”

When a user loads their feed, just read from their pre-built queue.

Pros: Feed reads are O(1) — just fetch from the user’s queue.

Cons: Writing a post to r/AskReddit (30M+ subscribers) means 30 million writes. Fan-out on write is a nightmare for large subreddits.

Reddit’s Hybrid Approach

Reddit (like most platforms at this scale) uses a hybrid model:

  • For small subreddits (< X subscribers): Fan-out on write. When a post is created, push it to all subscriber feeds.
  • For large subreddits (> X subscribers, like r/funny or r/worldnews): Pull model at read time. These posts are fetched separately and merged into the feed.
  • User feeds are cached in Redis with a short TTL (a few minutes). Stale feeds are invalidated and rebuilt on the next load.
flowchart TB; A[User Request] –> B; B{Cache Check Redis}; B –>|HIT sub-10ms| C[Return Cached Feed]; B –>|MISS| D[Build Feed]; D –> E[Fetch Followed Subreddits Redis Set]; E –> F[Small Subs Prebuilt Feed Queue]; E –> G[Large Subs Top Posts Directly]; F –> H[Merge and Rank]; G –> H; H –> I[Write to Cache TTL 5-10 min]; I –> J[Return to User];

Personalization

Reddit also personalizes feeds based on:

  • Your vote history (if you consistently upvote posts in r/golang, more Go content bubbles up)
  • Time of day (people consume different content at night vs. morning)
  • Flagged preferences (you can mark certain flairs to filter out)

This personalization layer sits on top of the base ranked feed and re-scores posts before returning results.

Caching & Performance

Reddit is read-heavy by a wide margin. For every 1 person posting, thousands are reading. Every caching decision should optimize for read performance.

Redis for Hot Data

Redis is used everywhere:

  • Session storage — JWT → session data lookup
  • Vote counts — current score of a post/comment
  • Subscriber counts — number of members in a subreddit
  • Comment trees — fully assembled JSON trees for hot posts
  • User feeds — pre-built feed lists with TTL
  • Rate limiting — sliding window counters to throttle API abuse
flowchart TB; %% Nodes A[user:user_id:subscriptions SET of subreddit IDs]; B[post:post_id:score STRING integer score]; C[post:post_id:comments JSON comment tree]; D[feed:user_id:home LIST of post IDs]; E[subreddit:sub_id:hot_posts ZSET sorted by score]; F[ratelimit:user_id:endpoint STRING counter]; %% Grouping subgraph UserData A; D; F; end; subgraph PostData B; C; end; subgraph SubredditData E; end;

CDN for Static Content

All images, videos, and static assets go through a CDN (Reddit uses Fastly). This means:

  • User-uploaded images are stored in S3
  • Served through CDN edge servers globally
  • Cache headers are set aggressively for immutable content

Read Replicas

For database reads, Reddit routes traffic to read replicas of PostgreSQL. Writes go to the primary, reads are distributed across multiple replicas. This multiplies read capacity without increasing write complexity.

Approximate Counts

Displaying exact vote counts requires precise aggregation. Instead, Reddit uses approximate counting with HyperLogLog or similar probabilistic data structures. This trades a tiny margin of error for enormous performance gains on high-cardinality counts.

Scaling the System

How Reddit Scaled Over Time

Reddit started as a Python monolith in 2005. As traffic grew, they hit walls repeatedly. Here’s roughly how the evolution went:

  1. Vertical scaling — bigger servers. Temporary fix.
  2. Read replicas — separated reads from writes.
  3. Caching (Memcached, then Redis) — kept hot data in memory.
  4. Service extraction — broke out search, media, etc. into separate services.
  5. Full microservices — by the early 2020s, most features run as independent services.

Horizontal Scaling

Every service is stateless — no local state is stored on the server. All state lives in Redis, Postgres, or Kafka. This means you can add more instances of any service behind the load balancer without coordination.

flowchart TB; %% Nodes A[Load Balancer]; B[Post Service Instance 1]; C[Post Service Instance 2]; D[Post Service Instance 3]; E[Post Service Instance N]; %% Connections A –> B; A –> C; A –> D; A –> E;

Auto-scaling groups (in AWS or GCP) spin up new instances under load and tear them down when traffic subsides.

Kafka as the Backbone

Kafka is used for event-driven communication between services. Instead of Service A calling Service B directly (synchronous, tightly coupled), Service A publishes an event and Service B (or C, D, E) can consume it independently.

Real-world event topics in Reddit’s system might look like:

flowchart TB; %% Topics T1[posts.created]; T2[comments.created]; T3[votes.cast]; T4[users.banned]; T5[reports.filed]; %% Consumers A[Search Indexer]; B[Feed Service]; C[Notification Service]; D[Moderation Queue]; E[Vote Aggregator]; F[Ranking Service]; G[Content Visibility Service]; H[Auth Service]; %% Connections T1 –> A; T1 –> B; T1 –> C; T2 –> C; T2 –> B; T2 –> D; T3 –> E; T3 –> F; T4 –> G; T4 –> H; T5 –> D;

This design makes the system fault-tolerant — if the notification service goes down, events pile up in Kafka and are processed when the service recovers. Nothing is lost.

Database Sharding

For tables that grow extremely large (like votes or comments), you can shard the data. Sharding means splitting a table across multiple database instances based on some key (often user ID or post ID).

For example, vote records can be sharded by post_id % N where N is the number of shards. All votes for a given post live on the same shard, making aggregation fast.

Real-Time Features

Live Comment Updates

When you’re reading a popular post, new comments sometimes appear without you refreshing. This is done via WebSockets.

Flow:

  1. Browser opens a WebSocket connection to Reddit’s real-time service when viewing a post
  2. When a new comment is submitted, it’s published to Kafka
  3. The real-time service consumes from Kafka and pushes the new comment to all WebSocket clients subscribed to that post ID
flowchart TB; %% Nodes A[Comment Service]; B[comments.created Topic]; C[Kafka Consumer]; D[Real-time Service]; E[Client Web App]; F[WebSocket Connection]; %% Forward Flow A –> B; B –> C; C –> D; %% Real-time Push (reverse direction concept) D –> F; F –> E;

For lower-priority updates (like notification badges), Reddit may fall back to Server-Sent Events (SSE) or even long polling — simpler, lower overhead alternatives to WebSockets.

Push Notifications

Mobile notifications go through platform-specific push systems:

  • Apple Push Notification Service (APNs) for iOS
  • Firebase Cloud Messaging (FCM) for Android

Reddit’s notification service formats the notification payload and sends it to the appropriate push gateway based on the user’s device type.

The Ranking Algorithm

Reddit’s ranking algorithm is what makes the front page feel useful rather than just chronological. Let’s break it down.

The “Hot” Algorithm

Reddit’s hot ranking is based on a formula originally developed by Randall Munroe (yes, the xkcd guy). Here’s a simplified version:

import math
from datetime import datetime

EPOCH = datetime(1970, 1, 1)

def epoch_seconds(date):
    td = date - EPOCH
    return td.days * 86400 + td.seconds

def score(ups, downs):
    return ups - downs

def hot(ups, downs, date):
    s = score(ups, downs)
    order = math.log(max(abs(s), 1), 10)
    sign = 1 if s > 0 else (-1 if s < 0 else 0)
    seconds = epoch_seconds(date) - 1134028003  # Reddit's epoch
    return round(sign * order + seconds / 45000, 7)

The key insight is this: the ranking combines score with time, and time decays the score. A post with 1,000 upvotes an hour ago beats a post with 10,000 upvotes from yesterday. The 45000 constant controls how fast posts decay — roughly every 12.5 hours, a post needs 10x more votes to stay at the same rank.

Other Sort Modes

Mode Logic
New Simple chronological sort by created_at DESC
Top Sort by raw score (upvotes - downvotes) within a time window
Controversial High votes but nearly equal upvotes and downvotes
Rising Posts gaining velocity quickly — score increasing faster than expected
Best Wilson score interval — accounts for post age and vote confidence

Controversial posts are particularly interesting. The formula roughly computes posts where the ratio of upvotes to total votes is close to 0.5, and the total vote count is high:

def controversial(ups, downs):
    if ups <= 0 or downs <= 0:
        return 0
    magnitude = ups + downs
    balance = downs / ups if ups > downs else ups / downs
    return magnitude ** balance

The higher the magnitude and the closer to 50/50 the split, the higher the controversial score.

Challenges & Tradeoffs

1. Consistency vs. Availability

When you upvote a post, you don’t get an error — you get an optimistic UI update. But behind the scenes, that vote might take a few seconds to fully propagate. Reddit chooses availability over strict consistency for votes (following the CAP theorem, when you’re distributed, you pick two of three: Consistency, Availability, Partition Tolerance).

For things like bans and moderation actions, however, Reddit needs stronger consistency — you don’t want a banned user to keep posting because one replica didn’t get the update. This is why different parts of the system use different consistency models.

2. Spam & Vote Manipulation

Vote manipulation is a constant battle. Bad actors create armies of accounts to upvote their own posts or downvote competitors. Reddit counters this with:

  • IP-based rate limiting on vote actions
  • ML models trained to detect coordinated voting patterns
  • Vote fuzzing — slightly obscuring the exact vote count makes it harder to confirm if manipulation is working
  • Account age and karma requirements — new or low-karma accounts have their votes weighted less

3. Moderation at Scale

Reddit has ~100,000 subreddits, each moderated by volunteers. The quality and consistency of moderation varies wildly. Reddit’s technical approach includes:

  • AutoModerator (rules engine built on YAML configs)
  • Site-wide spam detection feeding into mod queues
  • Quarantine system for problematic communities
  • Improved reporting UX to surface actionable reports

The challenge isn’t the technology — it’s the policy and human coordination at scale.

4. The Thundering Herd

When a mega-popular post goes viral, suddenly millions of people request the same resource simultaneously. A cache miss on that resource causes every request to hit the database — causing what’s known as the “thundering herd” problem.

Mitigation strategies:

  • Cache locking: Only one request rebuilds the cache; others wait briefly
  • Short TTLs with background refresh: Cache is refreshed before expiry in background, so users never hit a cold cache
  • Request coalescing: Multiple identical requests in-flight are coalesced into one database query

5. Cold Start for New Users

New users have no history. Their feeds are empty. Reddit handles this with:

  • Default subscriptions to popular subreddits (r/announcements, r/popular)
  • Onboarding flow that asks for interests
  • Trending/popular content as fallback when personalized feed is sparse

System Design Interview Tips

If you’re asked to design Reddit in an interview, here’s how I’d approach it:

Step 1: Clarify Requirements (5 minutes)

Don’t start coding or drawing. Ask:

  • “Are we focusing on the feed? Post creation? Voting?”
  • “What’s the scale? 1M users? 100M?”
  • “Should we handle real-time features like live comment updates?”
  • “Is media upload in scope?”

Interviewers reward candidates who scope properly before diving in.

Step 2: Estimate Scale (5 minutes)

Do quick back-of-envelope math:

flowchart TB; %% Assumptions subgraph Assumptions A[500M daily active users]; B[Avg 20 posts per user per day]; C[10 percent users create content]; D[50M actions per day]; end; %% Traffic subgraph Traffic E[Read QPS about 115k per sec]; F[Write QPS about 580 per sec]; end; %% Storage subgraph Storage G[1M posts per day]; H[1.8B posts over 5 years]; I[2KB per post]; J[Approx 3.6 TB total storage]; end; %% Logical Flow A –> B; A –> C; C –> D; B –> E; D –> F; G –> H; H –> J; I –> J;

This math tells you: this is a read-heavy system that needs aggressive caching.

Step 3: High-Level Design (10 minutes)

Draw the big boxes: client → CDN → load balancer → services → databases. Don’t get lost in details yet. Make sure your interviewer nods along.

Step 4: Deep Dive (15-20 minutes)

Pick 2-3 components and go deep. The interviewer will often guide you. Good candidates for deep dives:

  • Feed generation (pull vs. push vs. hybrid)
  • Voting system (write-heavy, eventual consistency)
  • Comment tree storage and retrieval
  • Ranking algorithm

Step 5: Address Non-Functional Requirements

Always cover:

  • Scalability: How does it handle 10x traffic?
  • Availability: What happens if one service goes down?
  • Consistency: Where do you accept eventual consistency?
  • Performance: What are your p99 latency targets?

Common Mistakes to Avoid

  • Jumping straight to microservices without justification. Start with a sensible monolith, then explain where you’d split.
  • Ignoring caching in a read-heavy system. This is a red flag.
  • Using a single database for everything. Show awareness of different workload patterns.
  • Forgetting about failures. What if Kafka goes down? What if a cache goes cold? Good engineers design for failure.

Final Thoughts

The best part about learning from real systems like Reddit is that the solutions aren’t magic. Every design choice is a tradeoff — consistency vs. speed, simplicity vs. scale, real-time vs. eventual consistency. The job of a good engineer isn’t to find the “right” answer — it’s to understand the tradeoffs and pick the one that fits the problem.

Next time you’re scrolling through Reddit at midnight, remember: somewhere, a Kafka consumer is chugging through vote events, a Redis sorted set is keeping your feed in order, and a Python service is running Wilson score calculations on millions of posts. All so you can find out whether cats or dogs are better.

Thanks for reading. If this was helpful, share it with someone preparing for their next system design interview. And if you have corrections or additions, I’d genuinely love to hear them.

Further Reading

Comments