Infra & cost

Rate Limit

A cap on how many requests or tokens a user can send to an AI API in a given window — used to manage cost, capacity, and abuse.

01 ——

In plain English

Rate limits are the throttle on AI APIs: per-minute or per-day caps on requests, tokens, or both. Every provider has them, every production app eventually hits them, and the shape of the limits drives a lot of architectural decisions.

Common limit types:

Requests per minute (RPM) — total API calls per minute
Tokens per minute (TPM) — total input + output tokens per minute
Concurrent requests — how many calls can be in flight at once
Daily / monthly quotas — spend caps that reset on a calendar boundary

Why you'll feel them:

Burst traffic — a sudden user spike trips the per-minute limit
Long-context calls — one big request can use a minute's TPM budget
Agent workflows — many small calls hit RPM limits fast
Multi-tenant apps — your end-users share your single API key

How teams handle it:

Request higher tier (most providers will raise limits for paying customers on request)
Queue and back off with exponential retry
Cache where possible (prompt caching reduces TPM)
Route across providers / models as failover

02 ——