Safety

Guardrails

Rules and filters that constrain what an AI model can output — used to block harmful, off-topic, or non-compliant responses.

01 ——

In plain English

Guardrails are the safety layers built around an AI model to keep its outputs appropriate, on-topic, and within policy. They're what stops a customer-support bot from giving medical advice, or a kids' app from producing adult content.

Where guardrails sit:

  • In the prompt — system instructions ("Refuse to discuss competitors")
  • In the model itselfRLHF training that builds in refusals
  • Around the model — input/output classifiers that filter requests and responses
  • At the orchestration layer — rules about which tools the model can call

Common guardrail goals:

  • Block harmful content (violence, illegal advice, CSAM)
  • Prevent prompt injection attacks
  • Keep the bot on-topic for its use case
  • Protect privacy (redact PII)

Off-the-shelf libraries (Guardrails AI, NeMo Guardrails) and major providers (OpenAI moderation, Anthropic Constitutional AI) ship guardrail tooling.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI