Explaining Rate Limiter System Design
Understand How Companies Like Meta Decide To Allow Or Reject Millions of User Requests Every Second.

What is a Rate Limit ?
Most API’s and websites have a rate limit to deal with fair usage, high traffic and latency issues. The number of requests you can make in an hour or in a day (technically called the “time-window”) is called as ‘rate limit’.
The component that watches traffic and enforces these rules is called a rate limiter. It sits in front of your service (or inside your gateway) and decides one thing for every request:
“Do you pass, or do you get blocked ?”
Now the bigger questions,
What are the different Rate Limiter Algorithms ??? How do you choose the right one for your system ?? What do you choose in an interview ??
Requirements of Rate Limiter :
Enforce limits based on rules (per user, per IP, per API key, per endpoint).
Reject requests cleanly when limits are crossed, usually with HTTP429 “Too Many Requests”.
Tell the client when they can try again (Retry-After headers or timestamps).
Add almost no latency, because it runs on every request.
Stay highly available, because if it dies, everything dies.
Before I begin on the algorithms here are a few clarifications. I have used the term ‘user’ but it is a simple term for all keys - user, API key, IP address, basically any entity making requests.Every rate limiter works by defining a limit (e.g., requests per minute) and a time window over which that limit applies. Multiple algorithms exist because there’s always a tradeoff between exactness, burst tolerance, memory efficiency, and ease of distribution no single approach fits all scenarios. This blog doesn’t provide the best algorithm instead it presents the pros and cons and potential uses of each algorithm.
Algorithms :
Approach 1: Fixed Window [It’s got a catch]
This is the most naïve model.You divide time into fixed chunks like:
1:00–1:59
2:00–2:59
Each user gets a counter per window. When a request comes in, If counter is less then limit system allows and increments counter. Else it blocks with a HTTP 429 "Too Many Requests" status.
The Problem: boundary bursts. A user can send 100 requests at 1:59 and another 100 at 2:00 and technically stay “within limits” while recieving double the amount of allowed requests.
Uses : Ideal for low-risk or internal APIs where slight traffic bursts at window edges are acceptable
Approach 2: Sliding Window Log[Only for the books]
It stores timestamp of each request. For each new request sytem drops all timestamps older than now – window_size. Counts what remains. If count < limit → allows and logs this request. Due to it’s practically high costs, the implementations are limited.
Let’s do some maths to understand it better. For each user, we store a timestamp for every request in the last T seconds.
Limit = 100req/min
User makes 100 requests → we store 100 timestamps
Next minute → old ones expire and new start coming
for a million active users every minute we store 100 million timestamps.
If each timestamp is 50 bytes, we need 100M * 50 = 5GB RAM for the most basic implementation. Including redis replication, multiple limits and safety headrooms is absolutely infeasinble.
Thus making this algorithm more of an academic theory and less of actual usage.
Approach 3: Sliding Window Counter [Best conceptually]
It solves the boundry doubling problem we saw earlier.Sliding window counter is a space-efficient approximation of sliding window log that removes boundary spikes while remaining Redis-friendly
For each user we store two numbers- count in current window and previous window. When a request is recieved, a weighted sum is computed which signal how far in the window we are. For a 1 min window if we are 30 sec in the weight is 30/60 = 0.5. Then effective count is calculated. effective count = current_count + previous_count*(1-weight) . The system compares this count with the limit to decide between approvala nd rejection .
Memory: O(1) memory per key and O(1) work per key
The Tradeoff :- Sliding window counter treats the previous window as if traffic were uniformly distributed across it. It has no idea whether requests were front-loaded, back-loaded, or spiky. Worst case, if the previous window had a spike in the first second and we are early in the new window it will reject the request even if true sliding window would have allowed.
However it’s important to mention that the system can over-penalize but never under-penalizes. The tradeoff is more security and less preciscion which is generally acceptable.
Uses: High traffic, public facing systems
SaaS rate limits
Auth endpoints
Payment API
API gateways
User throttling
Approach 4: Token Bucket [Industry standard]
Imagine a movie theater. You request to watch the movie. The ticket counter looks if any seats are remianing and provides you with tickets accordingly. After every show the number of tickets refill. The token bucket is similiar except that the remaining tickets(tokens) are carried on to the next show.
A bucket initially holds ‘n’ tokens (n = capacity of bucket).These tokens refill at a constant rate e.g. 10 tokens/sec. Each request consumes 1 token. If token exists, request is allowed; else rejected. Unused tokens carry over. For eg if at the end of 1st secong the bucket has 5 unused tokens, after the 1st second those 5 tokens will be added to the incoming 10 tokens and the user can now send upto 15 requests. Thus a user can ‘save’ it’s tokens and spike more than ‘n’ later on.
Memory: O(1) per key
Edge cases: Spikes bigger than capacity are still rejected. Even if a user ‘saves’ for 50 tokens if the bucket capacity is 10, the rest 40 will be rejected. This is because even if refill rate generates more tokens than the bucket can hold the extra tokens will be discarded
Advantages over sliding window counter: Token Bucket is widely used in production by companies like AWS, Stripe, and Meta for API rate limiting and traffic control. Unlike Sliding Window Counter, which smooths traffic but can slightly over-reject requests due to its approximation, Token Bucket naturally accommodates bursts: users or services can accumulate tokens during quiet periods and spend them during spikes, preventing unnecessary rejections. It is also highly memory-efficient, needing only a token count and a timestamp per key, compared to the two counters plus timestamp in Sliding Window Counter. Finally, it is simpler to implement and distribute across multiple servers or regions, making it highly practical for real-world, high-traffic systems where reliability and simplicity matter as much as fairness.
Uses:
Large scale production APIs(AWS, Stripe, Meta, Google Cloud)
Network Routers and Cloud services for traffic control
Auth and login
SaaS platforms
Chat Apps
Cloud Storage
API Gateways
Approach 5: Leaky Bucket [overflow → discards]
This is the actual theatre. The unbought tickets from previous show are not added to the next show. Because the capacity of movie hall stays the same.
This Bucket has a constant drain rate of capacity per window (eg 5 tokens per minute).If the bucket is full when a new request arrives, that request is rejected or dropped. Leaky Bucket smooths traffic by limiting bursts, whereas Token Bucket allows bursts up to capacity. However it also wastes capacity if traffic is uneven.
Uses:
Network traffic shaping
Systems where smoothing is more important than allowing bursts
(*If the system is distributed, implementing a strict Leaky Bucket requires centralized queue or careful coordination, which is why it’s rarely used in modern API rate limiting)