Rate Limiting: The More You Know

Harinder Singh
3 min readSep 20, 2022

--

What is Rate Limiting?

It refers to preventing the frequency of an operation from exceeding some constraint.

Speed limit on a highway
Photo by Ludovic Charlet on Unsplash

Reasons for using a Rate Limiter

I had read about this for interviews(system design rounds). In large-scale systems, rate limiting is commonly used to protect underlying services and resources. I had later read about Cascading failures in Google SRE Book and how to address them. Rate limiting can be used to reduce chances of it. It is important to implement rate limiting on both the client and server side. Other reasons include:
1. Protecting shared services from excessive use — whether intended or unintended — to maintain service availability.
2. Improving the availability of API-based services by avoiding resource starvation or in particular preventing denial of service(DoS) attacks.
3. Managing resource limits or quotas when it comes to exposing services(free tier or enterprise) and its monetisation.

Server Side Rate Limiting Strategies:

  1. Token bucket
  2. Leaky bucket
    The above two might be confusing if you go through the definitions alone. So I’ll try to address them together. A leaky bucket can be understood using the analogy of how a bucket with constant leak will overflow if and when it overflows. There’re 2 variants of leaky bucket: as a meter and as a queue.
    When thinking of it as a meter, a counter can be used which increases as each request arrives and decreases at a fixed rate. At each request, conformance is checked. If it doesn’t conform, naturally the request is rejected. The token bucket algorithm is a mirror image of this.
    In the second version, the queue can be thought of as a simple FIFO buffer or queue of finite length. When new request comes, queue is checked and if there is room it is appended else discarded. At a defined constant interval, the request(s)(rate can be defined as per requirement) will be forwarded and queue length is adjusted appropriately.
    Leaky bucket algorithm is also used at OSI Layer 2 Data Link Layer.
  3. Fixed window: Fixed-window limits — such as 3,000 requests per hour or 10 requests per day — are easy to state, but they are subject to spikes at the edges of the window, as available quota resets. Consider, for example, a limit of 3,000 requests per hour, which still allows for a spike of all 3,000 requests to be made in the first minute of the hour, which might overwhelm the service.
  4. Sliding window: Sliding windows have the benefits of a fixed window, but the rolling window of time smooths out bursts. One way of implementing can be to store timestamp of requests(possibly by using Redis Sorted Set); remove all the timestamps older than currentTime — “rate limit window” and count the length of sorted set to compare with “limit” and reject if exceeded.

Client Side Rate Limiting Strategies:

Before moving to discuss what strategy to implement, have an idea of the key to be used. For example, source IP address, user, API key. In terms of HTTP service response, 429 status code can be used when limiting is applied.

  1. Defer response — An alternative to rate limiting in these cases is to shunt requests into a queue and return some form of job ID.
  2. Pass through— In case the rate limiting module fails, the system should be able to handle request without disclosing sensitive information. (More of backup). This follows the fail open principle.
  3. Backoff — In response to rate-limiting, intermittent, or non-specific errors, a client should generally retry the request after a delay(which increases exponentially after each failed request). This might cause a periodic thundering herd(or friendly DDoS) and in order to avoid this additional random time (jitter) should be applied to the request timing, the backoff period, or both. This is explained in depth at https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ where call contention graphs of fully exponential backoff and jitter(defined as random between 0 and (minimum of MAX_BACKOFF_CAPPED_TIME and base * pow(2, attempt_count) ) explains the best possible backoff strategy.

References:

  1. https://cloud.google.com/architecture/rate-limiting-strategies-techniques
  2. https://en.wikipedia.org/wiki/Leaky_bucket
  3. https://en.wikipedia.org/wiki/Token_bucket
  4. https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
  5. https://en.wikipedia.org/wiki/Thundering_herd_problem
  6. https://sre.google/sre-book/addressing-cascading-failures/

--

--