Skip to content

Memory Management

Memory in Hector determines how agents remember and use conversation context. Hector provides two complementary memory systems: working memory for active conversations and long-term memory for persistent recall.

Memory Architecture

┌─────────────────────────────────────────────────┐
│                 AGENT                           │
├─────────────────────────────────────────────────┤
│                                                 │
│  ┌───────────────────────────────────────┐     │
│  │   WORKING MEMORY (Session)            │     │
│  │   - Current conversation              │     │
│  │   - Token-managed                     │     │
│  │   - Auto-summarization                │     │
│  └───────────────────────────────────────┘     │
│              ↓ Store      ↑ Recall              │
│  ┌───────────────────────────────────────┐     │
│  │   LONG-TERM MEMORY (Persistent)       │     │
│  │   - Vector database                   │     │
│  │   - Semantic search                   │     │
│  │   - Cross-session recall              │     │
│  └───────────────────────────────────────┘     │
│                                                 │
└─────────────────────────────────────────────────┘

Working Memory Strategies

Working memory manages the active conversation context. Choose a strategy based on your needs.

Automatically summarizes old messages when approaching token limits.

Configuration:

agents:
  assistant:
    memory:
      working:
        strategy: "summary_buffer"  # Default
        budget: 2000                # Token budget
        threshold: 0.8              # Summarize at 80% capacity
        target: 0.6                 # Compress to 60% capacity

How it works:

  1. Messages accumulate until reaching 80% of token budget (1600 tokens)
  2. Hector asks the LLM to summarize older messages
  3. Summary replaces old messages, freeing tokens
  4. Conversation continues with summary as context
  5. Summary optionally stored in long-term memory

Best for: - Long conversations - Preserving all information - Natural conversation flow

Example:

agents:
  support:
    llm: "gpt-4o"
    memory:
      working:
        strategy: "summary_buffer"
        budget: 4000
        threshold: 0.8
        target: 0.6
      longterm:

        storage_scope: "session"

Buffer Window

Keeps only the most recent N messages.

Configuration:

agents:
  assistant:
    memory:
      working:
        strategy: "buffer_window"
        window_size: 10  # Keep last 10 messages

How it works:

  1. Maintains a sliding window of recent messages
  2. When window is full, oldest message is dropped
  3. New messages push out old ones
  4. Dropped messages optionally stored in long-term memory

Best for: - Short, focused conversations - Low token usage - Predictable memory size - Fast performance

Example:

agents:
  chatbot:
    llm: "gpt-4o"
    memory:
      working:
        strategy: "buffer_window"
        window_size: 20

Long-Term Memory

Long-term memory provides persistent, semantically searchable storage across sessions.

Prerequisites

Long-term memory requires: - Vector Database (Qdrant) - Embedder (Ollama with nomic-embed-text)

See RAG & Semantic Search for setup.

Configuration

# Vector database
databases:
  qdrant:
    type: "qdrant"
    host: "localhost"
    port: 6334

# Embedder
embedders:
  embedder:
    type: "ollama"
    host: "http://localhost:11434"
    model: "nomic-embed-text"

# Agent with long-term memory
agents:
  assistant:
    database: "qdrant"
    embedder: "embedder"
    memory:
      working:
        strategy: "summary_buffer"
        budget: 2000
      longterm:

        storage_scope: "session"     # all|session|conversational|summaries_only
        batch_size: 1                # Store after each message
        auto_recall: true            # Auto-inject memories
        recall_limit: 5              # Max memories per recall
        collection: "agent_memory"   # Qdrant collection name

Storage Scopes

Control what gets stored in long-term memory:

Scope Description Use Case
all Store all messages (user & assistant) Complete history tracking
session Store summaries at end of session Balanced approach
conversational Store only user messages User preference tracking
summaries_only Store only summarized content Minimal storage

Examples:

# Store everything
longterm:

  storage_scope: "all"  # Every message vectorized

# Store only summaries (efficient)
longterm:

  storage_scope: "summaries_only"

# Store only user messages
longterm:

  storage_scope: "conversational"

Auto-Recall

Automatically retrieve relevant memories:

longterm:

  auto_recall: true       # Automatically inject relevant memories
  recall_limit: 5         # Retrieve top 5 relevant memories
  similarity_threshold: 0.7  # Minimum similarity score

How it works:

  1. User sends a message
  2. Hector searches long-term memory for similar past context
  3. Top N relevant memories injected into working memory
  4. Agent has access to relevant past context automatically

Disable for manual control:

longterm:

  auto_recall: false  # Use tools to explicitly recall

Memory Decision Guide

Choose the right memory configuration:

Simple Tasks (No Persistence Needed)

agents:
  simple:
    memory:
      working:
        strategy: "buffer_window"
        window_size: 10
      # No long-term memory

Customer Support (Session Persistence)

agents:
  support:
    database: "qdrant"
    embedder: "embedder"
    memory:
      working:
        strategy: "summary_buffer"
        budget: 4000
      longterm:

        storage_scope: "session"
        auto_recall: true

Personal Assistant (Cross-Session Learning)

agents:
  personal:
    database: "qdrant"
    embedder: "embedder"
    memory:
      working:
        strategy: "summary_buffer"
        budget: 4000
      longterm:

        storage_scope: "all"
        auto_recall: true
        recall_limit: 10

High-Volume Processing (Minimal Memory)

agents:
  processor:
    memory:
      working:
        strategy: "buffer_window"
        window_size: 5
      # No long-term memory for performance

Memory Best Practices

Token Budget Sizing

Match budget to your LLM's context window:

# GPT-4o (128K context)
memory:
  working:
    budget: 8000  # Leave room for response

# GPT-3.5 Turbo (16K context)
memory:
  working:
    budget: 2000  # Smaller budget

Threshold Tuning

Adjust when summarization triggers:

# Aggressive summarization (more frequent, smaller summaries)
memory:
  working:
    threshold: 0.6  # Summarize at 60%
    target: 0.4     # Compress to 40%

# Conservative summarization (less frequent, larger summaries)
memory:
  working:
    threshold: 0.9  # Summarize at 90%
    target: 0.7     # Compress to 70%

Batch Size for Long-Term Storage

Control storage frequency:

# Immediate storage (every message)
longterm:
  batch_size: 1  # Store immediately

# Batched storage (every 10 messages)
longterm:
  batch_size: 10  # Better performance

# End of session only
longterm:
  batch_size: 0  # Store only when session ends

Collection Naming

Organize memories by purpose:

agents:
  support:
    memory:
      longterm:
        collection: "support_tickets"

  personal:
    memory:
      longterm:
        collection: "user_preferences"

Advanced Patterns

Multi-Tier Memory

Combine strategies for optimal performance:

agents:
  advanced:
    database: "qdrant"
    embedder: "embedder"
    memory:
      working:
        strategy: "summary_buffer"
        budget: 4000
        threshold: 0.8
      longterm:

        storage_scope: "summaries_only"  # Only store summaries
        auto_recall: true
        recall_limit: 3

Session-Scoped Memory

Different memory per session/user:

agents:
  multi_user:
    database: "qdrant"
    embedder: "embedder"
    memory:
      longterm:

        storage_scope: "session"
        collection: "user_sessions"
        # Session ID passed in API calls

Selective Memory

Store only important information:

agents:
  selective:
    memory:
      working:
        strategy: "summary_buffer"
      longterm:

        storage_scope: "summaries_only"  # Only summaries, not raw messages
        auto_recall: false               # Manual recall only

Memory in Action

Example 1: Customer Support

agents:
  support:
    llm: "gpt-4o"
    database: "qdrant"
    embedder: "embedder"

    memory:
      working:
        strategy: "summary_buffer"
        budget: 4000
        threshold: 0.8

      longterm:

        storage_scope: "all"
        auto_recall: true
        recall_limit: 5
        collection: "support_history"

    prompt:
      system_role: |
        You are a customer support agent. Use past
        interactions to provide personalized support.

Result: Agent remembers past issues and provides context-aware support.

Example 2: Research Assistant

agents:
  researcher:
    llm: "claude"
    database: "qdrant"
    embedder: "embedder"

    memory:
      working:
        strategy: "buffer_window"
        window_size: 15

      longterm:

        storage_scope: "summaries_only"
        auto_recall: true
        recall_limit: 10
        collection: "research_notes"

    prompt:
      system_role: |
        You are a research assistant. Build on previous
        research findings and avoid redundant work.

Result: Agent builds on past research, avoiding duplication.


Monitoring Memory Usage

Debug Memory

Enable detailed logging:

agents:
  debug:
    reasoning:
      show_debug_info: true
    memory:
      working:
        strategy: "summary_buffer"
        budget: 2000

Look for log entries like:

Working memory: 1500/2000 tokens (75%)
Triggering summarization at threshold 80%
Summarized 5 messages to 200 tokens
Long-term memory: Stored 1 summary

Test Different Strategies

Create multiple agents to compare:

agents:
  buffer_test:
    memory:
      working:
        strategy: "buffer_window"
        window_size: 10

  summary_test:
    memory:
      working:
        strategy: "summary_buffer"
        budget: 2000

Next Steps