Guardrails
Rules and filters that constrain what an AI model can output — used to block harmful, off-topic, or non-compliant responses.
In plain English
Guardrails are the safety layers built around an AI model to keep its outputs appropriate, on-topic, and within policy. They're what stops a customer-support bot from giving medical advice, or a kids' app from producing adult content.
Where guardrails sit:
- In the prompt — system instructions ("Refuse to discuss competitors")
- In the model itself — RLHF training that builds in refusals
- Around the model — input/output classifiers that filter requests and responses
- At the orchestration layer — rules about which tools the model can call
Common guardrail goals:
- Block harmful content (violence, illegal advice, CSAM)
- Prevent prompt injection attacks
- Keep the bot on-topic for its use case
- Protect privacy (redact PII)
Off-the-shelf libraries (Guardrails AI, NeMo Guardrails) and major providers (OpenAI moderation, Anthropic Constitutional AI) ship guardrail tooling.