Infra & cost

Rate Limit

A cap on how many requests or tokens a user can send to an AI API in a given window — used to manage cost, capacity, and abuse.

01 ——

In plain English

Rate limits are the throttle on AI APIs: per-minute or per-day caps on requests, tokens, or both. Every provider has them, every production app eventually hits them, and the shape of the limits drives a lot of architectural decisions.

Common limit types:

  • Requests per minute (RPM) — total API calls per minute
  • Tokens per minute (TPM) — total input + output tokens per minute
  • Concurrent requests — how many calls can be in flight at once
  • Daily / monthly quotas — spend caps that reset on a calendar boundary

Why you'll feel them:

  • Burst traffic — a sudden user spike trips the per-minute limit
  • Long-context calls — one big request can use a minute's TPM budget
  • Agent workflows — many small calls hit RPM limits fast
  • Multi-tenant apps — your end-users share your single API key

How teams handle it:

  • Request higher tier (most providers will raise limits for paying customers on request)
  • Queue and back off with exponential retry
  • Cache where possible (prompt caching reduces TPM)
  • Route across providers / models as failover
02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI