Rate Limit
A cap on how many requests or tokens a user can send to an AI API in a given window — used to manage cost, capacity, and abuse.
In plain English
Rate limits are the throttle on AI APIs: per-minute or per-day caps on requests, tokens, or both. Every provider has them, every production app eventually hits them, and the shape of the limits drives a lot of architectural decisions.
Common limit types:
- Requests per minute (RPM) — total API calls per minute
- Tokens per minute (TPM) — total input + output tokens per minute
- Concurrent requests — how many calls can be in flight at once
- Daily / monthly quotas — spend caps that reset on a calendar boundary
Why you'll feel them:
- Burst traffic — a sudden user spike trips the per-minute limit
- Long-context calls — one big request can use a minute's TPM budget
- Agent workflows — many small calls hit RPM limits fast
- Multi-tenant apps — your end-users share your single API key
How teams handle it:
- Request higher tier (most providers will raise limits for paying customers on request)
- Queue and back off with exponential retry
- Cache where possible (prompt caching reduces TPM)
- Route across providers / models as failover