Guardrails & Safety¶
Guardrails provide a safety layer around your agents, ensuring they behave within defined boundaries. They operate on inputs (user messages), outputs (agent responses), tools (execution authorization), and optionally through LLM-powered moderation.
Quick Start¶
Enable basic protection with a few lines:
guardrails:
default:
enabled: true
input:
injection: { enabled: true }
sanitizer: { enabled: true, trim_whitespace: true }
output:
pii: { enabled: true, redact_mode: mask }
agents:
assistant:
guardrails: default
This blocks prompt injection attempts, sanitizes input, and masks PII in output.
How Guardrails Work¶
Guardrails run as a chain of responsibility. Each check runs in sequence and can:
| Action | Behavior |
|---|---|
allow |
Continue normally |
block |
Stop execution, return error |
modify |
Continue with transformed content |
warn |
Log warning, continue |
Each violation has a severity level: low, medium, high, or critical.
Chain Modes¶
guardrails:
strict:
input:
chain_mode: fail_fast # Stop on first violation (default)
output:
chain_mode: collect_all # Run all checks, return consolidated report
fail_fast(default): Maximum security. Rejects immediately on first failurecollect_all: Runs every check and returns all violations together, useful for debugging or user feedback
Input Guardrails¶
Input guardrails protect your agent from malicious or malformed user messages.
Prompt Injection Detection¶
Detects attempts to override system instructions. Hector includes built-in detection patterns for common attack vectors:
- Override attempts: "ignore previous instructions", "disregard all rules"
- Role manipulation: "you are now", "pretend to be", "act as"
- System impersonation: messages starting with
system:,[system],<system> - Jailbreak patterns: "jailbreak", "DAN mode", "developer mode"
- Encoded content:
base64:prefixed payloads
input:
injection:
enabled: true
action: block # block | warn (default: block)
severity: high # default: high
case_sensitive: false # default: false
patterns: # Add custom patterns (regex)
- "bypass safety"
- "reveal your prompt"
- "(?i)ignore.*instructions"
Custom Patterns Are Additive
Custom patterns are added to the built-in set, not replacing them. The built-in patterns always run when injection detection is enabled.
Input Sanitization¶
Cleans inputs before they reach the agent:
input:
sanitizer:
enabled: true
trim_whitespace: true # Remove leading/trailing whitespace (default: true)
normalize_unicode: false # Normalize Unicode characters
max_length: 50000 # Truncate (0 = no limit)
strip_html: true # Remove HTML tags
Length Validation¶
Reject inputs that are too short or too long:
input:
length:
enabled: true
min_length: 1 # default: 1
max_length: 100000 # default: 100000
action: block
severity: medium
Pattern Validation¶
Block or allow inputs matching specific patterns:
input:
pattern:
enabled: true
block_patterns:
- "sudo.*"
- "rm -rf"
- "DROP TABLE"
allow_patterns:
- ".*" # Allow everything else
action: block
severity: high
Output Guardrails¶
Output guardrails protect your users and organization from harmful agent responses.
PII Redaction¶
Detects and redacts personally identifiable information before it reaches the user.
output:
pii:
enabled: true
detect_email: true # Email addresses (default: true)
detect_phone: true # Phone numbers (US format, +1 variants)
detect_ssn: true # Social Security Numbers (XXX-XX-XXXX)
detect_credit_card: true # Visa, Mastercard, Amex, Discover
redact_mode: mask # mask | remove | hash
action: modify # default: modify
severity: high
Redaction modes:
| Mode | Input | Output |
|---|---|---|
mask |
[email protected] |
[EMAIL REDACTED] |
remove |
[email protected] |
(removed entirely) |
hash |
[email protected] |
[EMAIL:a1b2c3d4] |
Use hash for Debugging
Hash mode lets you correlate redacted values across a conversation without exposing the actual data. Useful for debugging PII leaks.
Content Filtering¶
Block responses containing specific keywords or patterns:
output:
content:
enabled: true
blocked_keywords: # Case-insensitive exact match
- "internal_api_key"
- "confidential"
blocked_patterns: # Regex patterns
- "(?i)password.*=.*"
- "sk-[a-zA-Z0-9]{20,}" # Block API key patterns
action: block
severity: critical
Tool Authorization¶
Control which tools agents can invoke, using glob patterns for flexible matching:
guardrails:
restricted:
tool:
authorization:
enabled: true
allowed_tools:
- "search*" # All search tools
- "read_*" # All read operations
- "web_fetch" # Specific tool
blocked_tools:
- "delete_*" # Block all delete operations
- "*_production" # Block anything ending in _production
action: block
severity: high
Layered Tool Security
Tool authorization works alongside other tool security measures. You can combine: (1) Per-agent tool assignment, (2) MCP tool filtering, (3) Human-in-the-loop approval, (4) Guardrail-based authorization. See the Security Guide for the full picture.
LLM-Powered Moderation¶
For semantic safety that pattern matching can't catch, use LLM-based moderation. Hector supports multiple providers:
OpenAI Moderation¶
Free to use with any OpenAI API key:
guardrails:
moderated:
moderation:
enabled: true
strategy: openai
action: block # block | warn
openai:
model: omni-moderation-latest # or text-moderation-latest
threshold: 0.8 # 0-1 confidence threshold
Lakera Guard¶
Specialized in prompt injection and jailbreak detection:
guardrails:
moderated:
moderation:
enabled: true
strategy: lakera
action: block
lakera:
project_id: "your-project-id" # Optional, uses default policy if empty
breakdown: true # Return detailed detector breakdown
Choosing a Moderation Provider
- OpenAI: Good general-purpose content moderation (hate, violence, self-harm). Free API.
- Lakera: Best for prompt injection and jailbreak detection. Paid but specialized.
- Prompt-based: Use your own LLM to evaluate content. Most flexible, highest latency.
Prompt-Based Moderation¶
Use a custom LLM prompt for domain-specific moderation:
guardrails:
custom_moderation:
moderation:
enabled: true
strategy: prompt
prompt:
llm: claude # LLM to use (falls back to agent's LLM)
template: |
Evaluate if this user message is safe and appropriate
for a customer support context. Respond with JSON:
{"safe": true/false, "reason": "..."}
Message: {input}
safe_field: "safe" # JSON field to check (default: "safe")
action: block
Complete Example¶
A production-ready guardrail configuration combining all layers:
guardrails:
production:
enabled: true
input:
chain_mode: fail_fast
length:
enabled: true
max_length: 50000
action: block
severity: medium
injection:
enabled: true
patterns:
- "reveal.*system.*prompt"
- "output.*instructions"
action: block
severity: critical
sanitizer:
enabled: true
trim_whitespace: true
strip_html: true
max_length: 50000
pattern:
enabled: true
block_patterns:
- "(?i)DROP TABLE"
- "(?i)DELETE FROM"
output:
chain_mode: collect_all
pii:
enabled: true
detect_email: true
detect_phone: true
detect_ssn: true
detect_credit_card: true
redact_mode: mask
action: modify
severity: high
content:
enabled: true
blocked_patterns:
- "sk-[a-zA-Z0-9]{20,}" # API keys
- "(?i)internal use only"
action: block
severity: critical
tool:
authorization:
enabled: true
allowed_tools: ["search*", "read_*", "web_fetch"]
blocked_tools: ["delete_*", "command_*"]
moderation:
enabled: true
strategy: openai
openai:
model: omni-moderation-latest
threshold: 0.8
action: block
agents:
support_bot:
guardrails: production
Observability¶
Guardrail interventions are tracked in event metadata, making them visible in logs, traces, and Studio:
- Intervention source:
input_guardrail,output_guardrail,tool_guardrail,moderation - Details: Matched pattern, action taken, severity
- Metrics: Prometheus counters for guardrail triggers (by type, action, severity)
This means you can build dashboards and alerts for guardrail activity. See the Observability Guide.
Next Steps¶
- Security Guide: Full security model including auth, sandboxing, and tool permissions
- Tools Guide: Tool-level security controls
- Guardrails Reference: Complete YAML schema