Checkpoint Recovery¶

Hector's checkpoint recovery system enables tasks to survive server crashes, handle rate limits gracefully, and support long-running workflows by periodically saving execution state and automatically resuming from checkpoints.

Overview¶

What is Checkpoint Recovery?¶

Checkpoint recovery allows agent execution to be paused and resumed from any point, not just when waiting for user input. This provides:

✅ Crash Recovery: Tasks survive server crashes and restarts
✅ Rate Limit Resilience: Pause during backoff, resume when limits reset
✅ Long-Running Tasks: Checkpoint periodically for very long workflows
✅ Resource Management: Free goroutines during waits, resume later
✅ Debugging: Inspect task state at any checkpoint

Relationship to HITL¶

Checkpoint recovery extends the async HITL foundation:

Shared Foundation: Both use ExecutionState, session metadata storage, and the same recovery mechanisms
HITL is Event-Driven: Async HITL checkpoints on INPUT_REQUIRED transitions (tool approval)
Checkpoint is Generic: Adds interval-based checkpointing and recovery from any execution point
Unified Storage: Both store checkpoints in session metadata (sessions.metadata JSON field)

Think of it this way: - Async HITL = Checkpoint on user input events - Checkpoint Recovery = Checkpoint on events + intervals + automatic recovery

How It Works¶

Checkpoint Strategies¶

Hector supports three checkpoint strategies:

1. Event-Driven (Default)¶

Checkpoints are created on specific events: - Tool approval requests (HITL pauses) - Errors - Rate limit backoffs

Characteristics: - Minimal overhead (only checkpoints when needed) - Precise recovery points - Uses async HITL foundation

Configuration:

task:
  checkpoint:
    enabled: true
    strategy: "event"  # Default

2. Interval-Based¶

Checkpoints are created periodically during execution: - Every N iterations - After tool calls (optional) - Before LLM calls (optional)

Characteristics: - Background checkpointing (task remains in WORKING state) - Configurable frequency - Trade-off: overhead vs recovery granularity

Configuration:

task:
  checkpoint:
    enabled: true
    strategy: "interval"
    interval:
      every_n_iterations: 5  # Checkpoint every 5 iterations
      after_tool_calls: true  # Also checkpoint after tool calls

3. Hybrid (Recommended)¶

Combines event-driven and interval-based: - Event-driven checkpoints on HITL pauses - Interval-based checkpoints during normal execution - Best recovery coverage

Configuration:

task:
  checkpoint:
    enabled: true
    strategy: "hybrid"
    interval:
      every_n_iterations: 5

Checkpoint Storage¶

Checkpoints are stored in session metadata (same foundation as async HITL):

{
  "sessions": {
    "session-123": {
      "metadata": {
        "pending_executions": {
          "task-abc": {
            "task_id": "task-abc",
            "context_id": "session-123",
            "query": "Analyze codebase",
            "reasoning_state": {...},
            "phase": "iteration_end",
            "checkpoint_type": "interval",
            "checkpoint_time": "2025-01-15T10:30:00Z"
          }
        }
      }
    }
  }
}

Why Session Metadata? - ✅ Already persisted (survives restarts) - ✅ Natural relationship (tasks → sessions via context_id) - ✅ No schema changes (uses existing JSON field) - ✅ Multi-agent isolation (by agent_id + session_id)

Recovery Mechanism¶

On server startup, Hector automatically:

Finds pending tasks: Queries tasks in WORKING or INPUT_REQUIRED state
Loads checkpoints: Retrieves execution state from session metadata
Validates checkpoints: Checks expiration and validity
Resumes execution: Restores reasoning state and continues from checkpoint

Recovery Flow:

Server Startup
  ↓
Find tasks in WORKING or INPUT_REQUIRED state
  ↓
Load checkpoint from session metadata
  ↓
Validate checkpoint (not expired, valid state)
  ↓
Resume execution from checkpoint
  ↓
Task continues from where it left off

Configuration¶

Basic Setup¶

Enable checkpoint recovery:

agents:
  assistant:
    # Session store (required for checkpoint storage)
    session_store: "sqlite"

    task:
      backend: "sql"
      checkpoint:
        enabled: true
        strategy: "hybrid"

        interval:
          every_n_iterations: 5

        recovery:
          auto_resume: true
          resume_timeout: 3600  # 1 hour

Configuration Options¶

Field	Type	Default	Description
`enabled`	boolean	`false`	Enable checkpoint recovery
`strategy`	string	`"event"`	`"event"`, `"interval"`, or `"hybrid"`
`interval.every_n_iterations`	integer	`0`	Checkpoint every N iterations (0 = disabled)
`interval.after_tool_calls`	boolean	`false`	Always checkpoint after tool calls
`interval.before_llm_calls`	boolean	`false`	Checkpoint before LLM calls
`recovery.auto_resume`	boolean	`false`	Auto-resume tasks on startup
`recovery.auto_resume_hitl`	boolean	`false`	Auto-resume INPUT_REQUIRED tasks
`recovery.resume_timeout`	integer	`3600`	Max checkpoint age to resume (seconds)

Programmatic API¶

Configure checkpoint recovery programmatically:

taskService := hector.NewTaskService().
    Backend("sql").
    Checkpoint().
        Enabled(true).
        Strategy("hybrid").
        Interval().
            EveryNIterations(5).
            Build().
        Recovery().
            AutoResume(true).
            ResumeTimeout(3600).
            Build().
        Build().
    Build()

agent, err := hector.NewAgent("assistant").
    WithLLMProvider(llm).
    WithTask(taskService).
    WithSession(sessionService).
    Build()

Use Cases¶

1. Crash Recovery¶

Scenario: Server crashes mid-execution

Without Checkpoint Recovery: - Task lost, must restart from beginning - User frustration, wasted LLM tokens

With Checkpoint Recovery: - Task automatically resumes from last checkpoint - Seamless recovery, no user intervention

Configuration:

task:
  checkpoint:
    enabled: true
    strategy: "interval"
    interval:
      every_n_iterations: 3  # Frequent checkpoints
    recovery:
      auto_resume: true

2. Rate Limit Resilience¶

Scenario: LLM API rate limit hit during long task

Without Checkpoint Recovery: - Task fails, must restart - Wasted progress

With Checkpoint Recovery: - Task checkpoints before rate limit error - Resumes automatically when limits reset - Continues from checkpoint

Configuration:

task:
  checkpoint:
    enabled: true
    strategy: "hybrid"
    interval:
      before_llm_calls: true  # Checkpoint before expensive calls

3. Long-Running Tasks¶

Scenario: Task takes hours to complete (e.g., codebase analysis)

Without Checkpoint Recovery: - Risk of losing progress on restart - No way to pause/resume

With Checkpoint Recovery: - Periodic checkpoints ensure progress saved - Can pause for maintenance, resume later - Survives server restarts

Configuration:

task:
  checkpoint:
    enabled: true
    strategy: "interval"
    interval:
      every_n_iterations: 10  # Checkpoint every 10 iterations
    recovery:
      auto_resume: true
      resume_timeout: 86400  # 24 hours

4. Resource Management¶

Scenario: Many concurrent tasks, need to free resources

Without Checkpoint Recovery: - Goroutines blocked waiting - Resource exhaustion

With Checkpoint Recovery: - Tasks checkpoint and exit goroutines - Resume later when resources available - Better resource utilization

A2A Protocol Compliance¶

Hector's checkpoint recovery is 100% A2A Protocol compliant:

State Management ✅¶

✅ Background Checkpoints: Remain in WORKING state (invisible to client)
✅ Explicit Pauses: Use INPUT_REQUIRED state (standard A2A)
✅ Valid Transitions: WORKING → Checkpoint → Resume as WORKING
✅ Terminal States: Never checkpoints from terminal states (A2A violation)

Recovery Process ✅¶

✅ Standard Resume: Uses A2A message:send with taskId
✅ Multi-Turn: Same taskId for resume (A2A Section 6.3.1)
✅ Context Alignment: Uses contextId for session continuity
✅ State Persistence: Task state persisted in task store

Checkpoint Scenarios¶

Scenario	Task State	Checkpoint Type	A2A Compliance
Tool approval	`INPUT_REQUIRED`	Event-driven	✅ Standard HITL
Interval checkpoint	`WORKING`	Interval-based	✅ Background only
Crash recovery	`WORKING` (resume)	Interval-based	✅ Invisible to client
Rate limit pause	`INPUT_REQUIRED`	Event-driven	✅ Explicit pause

Key Principle: - Client-visible pauses → Use INPUT_REQUIRED state - Background checkpoints → Remain in WORKING state - Never checkpoint from terminal states

Examples¶

Example 1: Basic Crash Recovery¶

Configuration:

agents:
  analyst:
    session_store: "sqlite"
    task:
      backend: "sql"
      checkpoint:
        enabled: true
        strategy: "interval"
        interval:
          every_n_iterations: 5
        recovery:
          auto_resume: true

Behavior: - Checkpoints created every 5 iterations - On server restart, tasks automatically resume - No user intervention required

Example 2: HITL + Checkpoint¶

Configuration:

agents:
  assistant:
    session_store: "sqlite"
    task:
      backend: "sql"
      hitl:
        mode: "async"
      checkpoint:
        enabled: true
        strategy: "hybrid"
        interval:
          every_n_iterations: 10
        recovery:
          auto_resume: true
          auto_resume_hitl: false  # Wait for user input

Behavior: - HITL pauses create event-driven checkpoints - Interval checkpoints every 10 iterations - On restart: auto-resume WORKING tasks, wait for INPUT_REQUIRED

Example 3: Production Setup¶

Configuration:

session_stores:
  main-db:
    backend: sql
    sql:
      driver: postgres
      host: db.example.com
      database: hector_sessions

agents:
  production-agent:
    session_store: "main-db"
    task:
      backend: "sql"
      sql:
        driver: postgres
        host: db.example.com
        database: hector_tasks
      checkpoint:
        enabled: true
        strategy: "hybrid"
        interval:
          every_n_iterations: 5
          after_tool_calls: true
        recovery:
          auto_resume: true
          resume_timeout: 7200  # 2 hours

Performance Considerations¶

Checkpoint Overhead¶

Costs: - State serialization: ~1-5ms per checkpoint - Storage write: ~5-20ms (depends on backend) - Total: ~10-25ms per checkpoint

Mitigation: - Use interval-based sparingly (every 5-10 iterations) - Enable after_tool_calls only if needed - Use before_llm_calls for expensive operations

Storage Considerations¶

Session Metadata Size: - Typical checkpoint: 10-50KB - Multiple checkpoints per session: ~100-500KB - Monitor session metadata size

Best Practices: - Clean up old checkpoints after task completion - Monitor database size - Use PostgreSQL for production (better JSON handling)

Troubleshooting¶

Checkpoints Not Created¶

Symptom: No checkpoints found after server restart

Check: 1. Is checkpoint.enabled: true? 2. Is session_store configured? 3. Are tasks reaching checkpoint points?

Solution:

# Verify configuration
task:
  checkpoint:
    enabled: true  # Must be true
    strategy: "interval"
    interval:
      every_n_iterations: 5  # Must be > 0

session_store: "sqlite"  # Required

Tasks Not Resuming¶

Symptom: Tasks remain in WORKING state after restart

Check: 1. Is recovery.auto_resume: true? 2. Are checkpoints expired? (check resume_timeout) 3. Are checkpoints valid? (check logs)

Solution:

recovery:
  auto_resume: true  # Must be true
  resume_timeout: 3600  # Increase if checkpoints too old

Checkpoint Expired¶

Symptom: Checkpoints found but not resumed (expired)

Solution: - Increase resume_timeout if tasks legitimately paused longer - Or manually resume via API if needed

Best Practices¶

✅ Do¶

Enable checkpoint recovery for production deployments
Use hybrid strategy for best coverage
Set appropriate every_n_iterations (5-10 recommended)
Configure resume_timeout based on your use case
Monitor checkpoint creation and recovery in logs

❌ Don't¶

Don't checkpoint too frequently (overhead)
Don't use interval-based without session_store
Don't set auto_resume_hitl: true unless you want automatic approval
Don't ignore checkpoint errors (they indicate issues)

Human-in-the-Loop - Tool approval workflows (uses checkpoint foundation)
Tasks - Task lifecycle and management
Sessions - Session persistence (checkpoint storage)
Generic Checkpoint/Resume Design - Complete architecture details
Programmatic API Reference - Configure checkpoints programmatically
Configuration Reference - All checkpoint options

Next Steps¶

Setup Session Persistence - Configure session storage for checkpoints
Making HITL Truly Asynchronous - Learn about async HITL (checkpoint foundation)
Tasks Guide - Complete task management documentation

Checkpoint Recovery¶

Overview¶

What is Checkpoint Recovery?¶

Relationship to HITL¶

How It Works¶

Checkpoint Strategies¶

1. Event-Driven (Default)¶

2. Interval-Based¶

3. Hybrid (Recommended)¶

Checkpoint Storage¶

Recovery Mechanism¶

Configuration¶

Basic Setup¶

Configuration Options¶

Programmatic API¶

Use Cases¶

1. Crash Recovery¶

2. Rate Limit Resilience¶

3. Long-Running Tasks¶

4. Resource Management¶

A2A Protocol Compliance¶

State Management ✅¶

Recovery Process ✅¶

Checkpoint Scenarios¶

Examples¶

Example 1: Basic Crash Recovery¶

Example 2: HITL + Checkpoint¶

Example 3: Production Setup¶

Performance Considerations¶

Checkpoint Overhead¶

Storage Considerations¶

Troubleshooting¶

Checkpoints Not Created¶

Tasks Not Resuming¶

Checkpoint Expired¶

Best Practices¶

✅ Do¶

❌ Don't¶

Related Topics¶

Next Steps¶