Skip to content

Guardrails & Safety

Guardrails provide a safety layer around your agents, ensuring they behave within defined boundaries. They operate on inputs (user messages), outputs (agent responses), tools (execution authorization), and optionally through LLM-powered moderation.

Quick Start

Enable basic protection with a few lines:

guardrails:
  default:
    enabled: true
    input:
      injection: { enabled: true }
      sanitizer: { enabled: true, trim_whitespace: true }
    output:
      pii: { enabled: true, redact_mode: mask }

agents:
  assistant:
    guardrails: default

This blocks prompt injection attempts, sanitizes input, and masks PII in output.


How Guardrails Work

Guardrails run as a chain of responsibility. Each check runs in sequence and can:

Action Behavior
allow Continue normally
block Stop execution, return error
modify Continue with transformed content
warn Log warning, continue

Each violation has a severity level: low, medium, high, or critical.

Chain Modes

guardrails:
  strict:
    input:
      chain_mode: fail_fast    # Stop on first violation (default)
    output:
      chain_mode: collect_all  # Run all checks, return consolidated report
  • fail_fast (default): Maximum security. Rejects immediately on first failure
  • collect_all: Runs every check and returns all violations together, useful for debugging or user feedback

Input Guardrails

Input guardrails protect your agent from malicious or malformed user messages.

Prompt Injection Detection

Detects attempts to override system instructions. Hector includes built-in detection patterns for common attack vectors:

  • Override attempts: "ignore previous instructions", "disregard all rules"
  • Role manipulation: "you are now", "pretend to be", "act as"
  • System impersonation: messages starting with system:, [system], <system>
  • Jailbreak patterns: "jailbreak", "DAN mode", "developer mode"
  • Encoded content: base64: prefixed payloads
input:
  injection:
    enabled: true
    action: block          # block | warn (default: block)
    severity: high         # default: high
    case_sensitive: false  # default: false
    patterns:              # Add custom patterns (regex)
      - "bypass safety"
      - "reveal your prompt"
      - "(?i)ignore.*instructions"

Custom Patterns Are Additive

Custom patterns are added to the built-in set, not replacing them. The built-in patterns always run when injection detection is enabled.

Input Sanitization

Cleans inputs before they reach the agent:

input:
  sanitizer:
    enabled: true
    trim_whitespace: true      # Remove leading/trailing whitespace (default: true)
    normalize_unicode: false   # Normalize Unicode characters
    max_length: 50000          # Truncate (0 = no limit)
    strip_html: true           # Remove HTML tags

Length Validation

Reject inputs that are too short or too long:

input:
  length:
    enabled: true
    min_length: 1         # default: 1
    max_length: 100000    # default: 100000
    action: block
    severity: medium

Pattern Validation

Block or allow inputs matching specific patterns:

input:
  pattern:
    enabled: true
    block_patterns:
      - "sudo.*"
      - "rm -rf"
      - "DROP TABLE"
    allow_patterns:
      - ".*"  # Allow everything else
    action: block
    severity: high

Output Guardrails

Output guardrails protect your users and organization from harmful agent responses.

PII Redaction

Detects and redacts personally identifiable information before it reaches the user.

output:
  pii:
    enabled: true
    detect_email: true        # Email addresses (default: true)
    detect_phone: true        # Phone numbers (US format, +1 variants)
    detect_ssn: true          # Social Security Numbers (XXX-XX-XXXX)
    detect_credit_card: true  # Visa, Mastercard, Amex, Discover
    redact_mode: mask         # mask | remove | hash
    action: modify            # default: modify
    severity: high

Redaction modes:

Mode Input Output
mask [email protected] [EMAIL REDACTED]
remove [email protected] (removed entirely)
hash [email protected] [EMAIL:a1b2c3d4]

Use hash for Debugging

Hash mode lets you correlate redacted values across a conversation without exposing the actual data. Useful for debugging PII leaks.

Content Filtering

Block responses containing specific keywords or patterns:

output:
  content:
    enabled: true
    blocked_keywords:           # Case-insensitive exact match
      - "internal_api_key"
      - "confidential"
    blocked_patterns:           # Regex patterns
      - "(?i)password.*=.*"
      - "sk-[a-zA-Z0-9]{20,}"  # Block API key patterns
    action: block
    severity: critical

Tool Authorization

Control which tools agents can invoke, using glob patterns for flexible matching:

guardrails:
  restricted:
    tool:
      authorization:
        enabled: true
        allowed_tools:
          - "search*"       # All search tools
          - "read_*"        # All read operations
          - "web_fetch"     # Specific tool
        blocked_tools:
          - "delete_*"      # Block all delete operations
          - "*_production"  # Block anything ending in _production
        action: block
        severity: high

Layered Tool Security

Tool authorization works alongside other tool security measures. You can combine: (1) Per-agent tool assignment, (2) MCP tool filtering, (3) Human-in-the-loop approval, (4) Guardrail-based authorization. See the Security Guide for the full picture.


LLM-Powered Moderation

For semantic safety that pattern matching can't catch, use LLM-based moderation. Hector supports multiple providers:

OpenAI Moderation

Free to use with any OpenAI API key:

guardrails:
  moderated:
    moderation:
      enabled: true
      strategy: openai
      action: block          # block | warn
      openai:
        model: omni-moderation-latest   # or text-moderation-latest
        threshold: 0.8                   # 0-1 confidence threshold

Lakera Guard

Specialized in prompt injection and jailbreak detection:

guardrails:
  moderated:
    moderation:
      enabled: true
      strategy: lakera
      action: block
      lakera:
        project_id: "your-project-id"   # Optional, uses default policy if empty
        breakdown: true                   # Return detailed detector breakdown

Choosing a Moderation Provider

  • OpenAI: Good general-purpose content moderation (hate, violence, self-harm). Free API.
  • Lakera: Best for prompt injection and jailbreak detection. Paid but specialized.
  • Prompt-based: Use your own LLM to evaluate content. Most flexible, highest latency.

Prompt-Based Moderation

Use a custom LLM prompt for domain-specific moderation:

guardrails:
  custom_moderation:
    moderation:
      enabled: true
      strategy: prompt
      prompt:
        llm: claude                      # LLM to use (falls back to agent's LLM)
        template: |
          Evaluate if this user message is safe and appropriate
          for a customer support context. Respond with JSON:
          {"safe": true/false, "reason": "..."}

          Message: {input}
        safe_field: "safe"               # JSON field to check (default: "safe")
      action: block

Complete Example

A production-ready guardrail configuration combining all layers:

guardrails:
  production:
    enabled: true

    input:
      chain_mode: fail_fast
      length:
        enabled: true
        max_length: 50000
        action: block
        severity: medium
      injection:
        enabled: true
        patterns:
          - "reveal.*system.*prompt"
          - "output.*instructions"
        action: block
        severity: critical
      sanitizer:
        enabled: true
        trim_whitespace: true
        strip_html: true
        max_length: 50000
      pattern:
        enabled: true
        block_patterns:
          - "(?i)DROP TABLE"
          - "(?i)DELETE FROM"

    output:
      chain_mode: collect_all
      pii:
        enabled: true
        detect_email: true
        detect_phone: true
        detect_ssn: true
        detect_credit_card: true
        redact_mode: mask
        action: modify
        severity: high
      content:
        enabled: true
        blocked_patterns:
          - "sk-[a-zA-Z0-9]{20,}"    # API keys
          - "(?i)internal use only"
        action: block
        severity: critical

    tool:
      authorization:
        enabled: true
        allowed_tools: ["search*", "read_*", "web_fetch"]
        blocked_tools: ["delete_*", "command_*"]

    moderation:
      enabled: true
      strategy: openai
      openai:
        model: omni-moderation-latest
        threshold: 0.8
      action: block

agents:
  support_bot:
    guardrails: production

Observability

Guardrail interventions are tracked in event metadata, making them visible in logs, traces, and Studio:

  • Intervention source: input_guardrail, output_guardrail, tool_guardrail, moderation
  • Details: Matched pattern, action taken, severity
  • Metrics: Prometheus counters for guardrail triggers (by type, action, severity)

This means you can build dashboards and alerts for guardrail activity. See the Observability Guide.

Next Steps