Skip to content

Guardrails

Guardrails provide safety controls for AI agents, protecting against prompt injection, PII exposure, and unauthorized tool usage.

Overview

Guardrails validate and transform data at three points:

  • Input: Before user messages reach the agent
  • Output: Before agent responses reach the user
  • Tool: Before tool calls are executed

Define guardrails once, assign to agents by reference:

guardrails:
  strict:
    input:
      injection:
        enabled: true
    output:
      pii:
        enabled: true

agents:
  assistant:
    guardrails: strict

Input Guardrails

Input guardrails validate and sanitize user messages before they reach the LLM.

Length Validation

Prevent excessively long or empty inputs:

guardrails:
  production:
    input:
      length:
        enabled: true
        min_length: 1        # Require at least 1 character
        max_length: 100000   # Cap at 100K characters
        action: block        # Block if violated
        severity: medium

Prompt Injection Detection

Detect and block attempts to manipulate agent behavior:

guardrails:
  production:
    input:
      injection:
        enabled: true
        case_sensitive: false
        action: block
        severity: high
        # Optional: add custom patterns
        patterns:
          - "ignore previous instructions"
          - "you are now"
          - "system prompt"

Default patterns detect common injection techniques:

  • Instruction overrides ("ignore all instructions")
  • Role manipulation ("pretend you are")
  • System prompt impersonation ("system:")
  • Jailbreak attempts ("developer mode")
  • Hidden instruction markers (base64, XML tags)

Input Sanitization

Clean and normalize user input:

guardrails:
  production:
    input:
      sanitizer:
        enabled: true
        trim_whitespace: true      # Remove leading/trailing spaces
        normalize_unicode: true    # Normalize Unicode to NFC form
        strip_html: true           # Remove HTML tags
        max_length: 50000          # Truncate if exceeded (0=no limit)

Output Guardrails

Output guardrails filter and transform agent responses before they reach the user.

PII Detection and Redaction

Automatically detect and redact personally identifiable information:

guardrails:
  production:
    output:
      pii:
        enabled: true
        detect_email: true       # [email protected] → [EMAIL REDACTED]
        detect_phone: true       # 555-123-4567 → [PHONE REDACTED]
        detect_ssn: true         # 123-45-6789 → [SSN REDACTED]
        detect_credit_card: true # 4111...1111 → [CC REDACTED]
        redact_mode: mask        # mask, remove, or hash
        action: modify           # Modify output (vs block entirely)
        severity: high

Redaction modes:

  • mask: Replace with [TYPE REDACTED] (default)
  • remove: Remove PII entirely
  • hash: Replace with SHA-256 hash (reversible lookup)

Content Filtering

Block outputs containing harmful or prohibited content:

guardrails:
  production:
    output:
      content:
        enabled: true
        blocked_keywords:
          - "password"
          - "secret_key"
          - "api_token"
        blocked_patterns:
          - "sk-[a-zA-Z0-9]{48}"    # OpenAI API keys
          - "ghp_[a-zA-Z0-9]{36}"   # GitHub tokens
        action: block
        severity: high

Tool Guardrails

Tool guardrails control which tools can be invoked and with what arguments.

Tool Authorization

Whitelist or blacklist specific tools:

guardrails:
  strict:
    tool:
      authorization:
        enabled: true
        # Whitelist: only these tools are allowed
        allowed_tools:
          - search
          - grep_search
        # Blacklist: these tools are never allowed  
        blocked_tools:
          - bash
          - web_request
        action: block
        severity: high

Use glob patterns for wildcards:

guardrails:
  limited:
    tool:
      authorization:
        enabled: true
        allowed_tools:
          - "read_*"    # Allow all read operations
          - "search"
        blocked_tools:
          - "write_*"   # Block all write operations
          - "delete_*"

Chain Modes

Multiple guardrails run in sequence. Choose how violations are handled:

Fail Fast (Default)

Stop at first violation:

guardrails:
  production:
    input:
      chain_mode: fail_fast  # Stop at first violation
      injection:
        enabled: true
      length:
        enabled: true

Collect All

Gather all violations before returning:

guardrails:
  audit:
    input:
      chain_mode: collect_all  # Check all, report all violations
      injection:
        enabled: true
      length:
        enabled: true
      pattern:
        enabled: true

Use collect_all for comprehensive auditing or when you want complete violation reports.

Assigning Guardrails to Agents

Define guardrails globally, reference by name in agents:

guardrails:
  # Strict guardrails for public-facing agents
  strict:
    enabled: true
    input:
      injection:
        enabled: true
      sanitizer:
        enabled: true
    output:
      pii:
        enabled: true
        redact_mode: mask
    tool:
      authorization:
        enabled: true
        blocked_tools:
          - bash

  # Relaxed for internal tools
  relaxed:
    enabled: true
    input:
      sanitizer:
        enabled: true
        trim_whitespace: true

agents:
  # Public agent uses strict guardrails
  customer_support:
    llm: default
    guardrails: strict
    visibility: public

  # Internal tool uses relaxed guardrails  
  admin_assistant:
    llm: default
    guardrails: relaxed
    visibility: internal

  # No guardrails for private helper
  data_processor:
    llm: default
    visibility: private
    # No guardrails reference = no guardrails

Actions and Severity

Each guardrail can specify an action and severity level:

Actions:

  • allow: Permit the request (used for "all clear" results)
  • block: Reject the request entirely
  • modify: Transform the content (e.g., redact PII)
  • warn: Log warning but allow request

Severity:

  • low: Minor issue
  • medium: Notable issue
  • high: Serious issue
  • critical: Severe issue requiring immediate attention
guardrails:
  production:
    input:
      injection:
        enabled: true
        action: block      # Reject injection attempts
        severity: critical # Log as critical
      length:
        enabled: true
        action: warn       # Warn but allow
        severity: low

Example Configurations

Production Web Application

guardrails:
  production:
    enabled: true
    input:
      chain_mode: fail_fast
      length:
        enabled: true
        max_length: 50000
        action: block
      injection:
        enabled: true
        action: block
        severity: critical
      sanitizer:
        enabled: true
        trim_whitespace: true
        strip_html: true
    output:
      chain_mode: fail_fast
      pii:
        enabled: true
        detect_email: true
        detect_phone: true
        detect_ssn: true
        detect_credit_card: true
        redact_mode: mask
      content:
        enabled: true
        blocked_keywords:
          - "internal_api_key"
          - "database_password"
    tool:
      chain_mode: fail_fast
      authorization:
        enabled: true
        blocked_tools:
          - bash
          - web_request

agents:
  assistant:
    llm: default
    guardrails: production
    tools: [search, text_editor, grep_search]

Strict Privacy Example

guardrails:
  privacy_strict:
    enabled: true
    input:
      injection:
        enabled: true
        action: block
    output:
      pii:
        enabled: true
        detect_email: true
        detect_phone: true
        detect_ssn: true
        redact_mode: remove  # Remove PII entirely
        action: modify
        severity: critical

agents:
  assistant:
    llm: default
    guardrails: privacy_strict

Development Environment

guardrails:
  development:
    enabled: true
    input:
      length:
        enabled: true
        max_length: 100000
      sanitizer:
        enabled: true
        trim_whitespace: true
    # No PII redaction in dev
    # No tool restrictions in dev

agents:
  dev_assistant:
    llm: default
    guardrails: development
    tools: [search, text_editor, grep_search, bash]

Best Practices

Layer Your Defenses

Use multiple guardrails together:

guardrails:
  defense_in_depth:
    input:
      # Layer 1: Sanitize
      sanitizer:
        enabled: true
      # Layer 2: Validate length
      length:
        enabled: true
      # Layer 3: Detect injection
      injection:
        enabled: true
    output:
      # Layer 4: Redact PII
      pii:
        enabled: true
      # Layer 5: Filter content
      content:
        enabled: true

Match Guardrails to Agent Purpose

agents:
  # Customer-facing: maximum protection
  support_agent:
    guardrails: strict
    visibility: public

  # Internal: balanced protection
  research_agent:
    guardrails: standard
    visibility: internal

  # Backend: minimal overhead
  data_pipeline:
    guardrails: minimal
    visibility: private

Monitor and Iterate

Enable logging to track guardrail violations:

logger:
  level: info
  format: json  # Structured logs for analysis

Review logs to:

  • Identify false positives (adjust patterns)
  • Detect attack attempts
  • Fine-tune severity levels