Blogs


How Meta Serverless Works?

Somewhere inside a hyperscale data center, a user taps a button on Instagram and an invisible chain reaction begins across thousands of machines. A piece of code spins up, runs for a few milliseconds, and vanishes almost instantly. No engineer manually provisioned a server for that request. No virtual machine was waiting in advance. The infrastructure simply reacted in real time, allocating compute exactly when it was needed and disappearing when the work was done.

That is serverless computing at scale, and building it correctly is one of the hardest distributed systems problems in the industry.

Alt text

This blog is a deep technical walkthrough of how Meta-style serverless infrastructure works internally. We are going to walk through the execution pipeline, the scheduler, container isolation, cold start optimization, autoscaling, networking, observability, and the engineering decisions that tie everything together. If you are preparing for a system design interview or just want to understand what really happens under the hood of a serverless platform at hyperscale, this is for you.

What Serverless Actually Means at Scale

The word “serverless” is a bit misleading. There are absolutely servers. What serverless really means is that the people writing the business logic do not have to think about those servers. They write a function, deploy it, and the platform handles everything else: provisioning, scaling, networking, isolation, and cleanup.

AWS Lambda made this concept mainstream. But building serverless infrastructure for a company like Meta, which handles billions of daily active users, trillions of requests, and petabytes of data, requires an entirely different class of engineering. You cannot just run Lambda internally. You need something purpose-built, something that understands Meta’s workload patterns, Meta’s infrastructure constraints, and Meta’s latency requirements.

Read on →

How X Timeline Works?

There is a moment, every time you open X, that feels effortless. A feed of tweets appears. Some are from people you follow. Others are from accounts you have never seen before but somehow feel relevant. A viral post catches your eye. A trending topic surfaces at just the right time. It all feels instant.

Alt text

Behind every timeline refresh is a chain of distributed systems doing extraordinary amounts of work in milliseconds. Machines are computing your interests, traversing social graphs with hundreds of millions of edges, fetching precomputed timelines from tiered caches, ranking thousands of candidate tweets using machine learning models, and delivering the result before your thumb stops scrolling. At peak traffic, X handles hundreds of millions of active users simultaneously, each expecting their own personalized, fresh, low-latency feed.

Understanding how this actually works is one of the richest system design problems in modern software engineering. It touches distributed databases, event streaming, ML-based ranking, graph traversal, cache design, and real-time data pipelines all at once.

This post walks through the full architecture from first principles. Whether you are preparing for a system design interview or simply want to understand how large-scale social media infrastructure is built, this is the engineering deep dive you have been looking for.

Core Features of the X Timeline

Before jumping into architecture, it is worth being precise about what the timeline actually is. X has two primary feed surfaces.

The Following tab shows tweets from accounts you explicitly follow, sorted by relevance and recency. The For You tab, which is the default, is a fully personalized algorithmic feed. It pulls in tweets from accounts you follow, accounts you interact with, accounts that are popular in your network, trending content, and content from accounts you do not follow but that the recommendation engine believes you will engage with.

Read on →

How URL Shortener Works?

There is a particular kind of engineering problem that looks deceptively small from the outside. You paste a long URL into a box, click a button, and get back something like https://bit.ly/3xKp9Ld. The whole interaction takes less than a second. Behind that second, though, is a distributed system that has to do quite a lot of work — generate a globally unique short code, write it durably, cache it for fast retrieval, serve hundreds of thousands of redirects per second, collect analytics events without slowing down the redirect, scan for malicious links, and stay available across multiple data centers.

Alt text

That is the honest shape of a URL shortener at scale. A toy version you could build in an afternoon with a SQLite database and a single Flask server. A production version used by millions of people is something else entirely.

This post walks through that production version — the architecture, the tradeoffs, the engineering decisions, and the failure modes. Whether you are preparing for a system design interview, building your own shortener, or just curious how Bitly or TinyURL actually work at scale, this should give you a real picture of what is going on inside.

Why URL Shortening Became a Distributed Systems Problem

The original use case was simple enough: URLs on the web can get long and ugly, especially after query parameters and tracking strings pile up. Early shorteners were literally just a database with two columns — a short code and a long URL — and a web server that did a lookup and issued a 301 redirect. That works perfectly fine at small scale.

Read on →

How Zomato Works?

The moment you tap “Place Order” on Zomato, a massive chain of events begins instantly behind the scenes. Within seconds, the system identifies your location, finds the right restaurant, assigns a nearby delivery partner, processes your payment, and starts estimating delivery time — all before you even put your phone away. What feels simple on the surface is actually a real-time logistics system coordinating thousands of moving parts across an entire city.

Alt text

Food delivery sounds like a simple logistics problem until you actually try to build it. Then it reveals itself as one of the hardest classes of distributed systems work: real-time geo-spatial computation, dynamic resource allocation, live tracking at millions of concurrent sessions, recommendation engines that personalize without being creepy, and payment flows that must never lose money even when half the network is flaking. Zomato operates across hundreds of cities, millions of daily orders, and tens of thousands of restaurant partners. Let us walk through how a system like this actually works.

What Zomato Actually Does

Most people know Zomato as an app where you order food. But from an engineering perspective, Zomato is really three distinct systems bolted together and made to feel seamless.

The first is a restaurant discovery and search platform. This is the part that helps you find what to eat, browse menus, read reviews, and decide. It behaves like a specialized search engine with local context.

The second is a transactional ordering and logistics system. This handles the actual purchase: taking your order, processing payment, routing it to the restaurant, orchestrating the delivery from pickup to dropoff.

The third is a personalization and recommendations engine. This is the layer that learns your preferences, predicts what you will want to eat on a Tuesday night versus a Sunday brunch, and surfaces restaurant and dish suggestions that feel eerily relevant.

These three systems share infrastructure but have very different performance and correctness requirements. The search system can tolerate slight staleness. The ordering system cannot afford to lose a transaction. The recommendation system can afford to be a little wrong as long as it is interesting. Understanding that these systems have different tradeoffs is the first thing a good engineer thinks about before designing anything.

Core Features at a Glance

Before going deep, it helps to enumerate what Zomato actually provides so we can map features to systems:

Read on →

How Netflix Works?

Every second, somewhere in the world, a person presses play on Netflix and expects the impossible to feel effortless. In the brief instant before the first frame appears, an enormous distributed system has already selected the optimal video quality, routed the request to the nearest CDN edge server, adapted to the user’s network conditions, and begun streaming millions of bytes across the internet — all fast enough that the viewer never thinks about it. No loading screens. No blurry frames. Just seamless playback delivered on demand, at planetary scale.

Alt text

What makes that moment deceptively simple is the enormous machinery running beneath it. At any given time, Netflix is serving somewhere north of 200 million active subscribers across 190 countries. It accounts for roughly 15% of global internet traffic during peak hours. Its recommendation engine influences what more than 80% of subscribers watch next. And the system has to work whether you are on fiber in Seoul, a spotty cellular connection in rural Brazil, or a smart TV in a hotel in Amsterdam.

Most distributed systems have to handle scale. Netflix has to handle scale while also being real-time, fault-tolerant, and nearly invisible to the user. The experience degrades the moment you notice it.

This post is an attempt to explain the full system. Not just the broad strokes, but the reasons behind each architectural decision, the tradeoffs made at each layer, and what happens when things go wrong. We will move from client to CDN to cloud, from video bytes to neural network embeddings, and from API gateway to analytics pipelines.

Core Features of Netflix

Before touching the architecture, it is worth being precise about what Netflix actually does. The feature list sounds mundane but each item carries serious engineering weight.

Read on →