LLM Providers¶

Hector supports multiple LLM providers out of the box. Each agent references an LLM configuration that defines which model to use and how to connect to it.

Supported Providers¶

Provider	Models	Streaming	Structured Output	Multi-Modality
OpenAI	GPT-4o, GPT-4o-mini, GPT-4 Turbo, etc.	✅	✅	✅ Images
Anthropic	Claude Sonnet 4, Claude Opus 4, etc.	✅	✅	✅ Images
Google Gemini	Gemini 2.0 Flash, Gemini Pro, etc.	✅	✅	✅ Images, Video, Audio
Ollama	qwen3 (local models)	✅	✅	✅ Images
Custom (Plugin)	Any model via gRPC plugin	✅	✅	Depends on plugin

Configuration Pattern¶

LLM providers are configured separately from agents:

# Define LLM providers
llms:
  gpt-4o:
    type: "openai"
    model: "gpt-4o-mini"
    api_key: "${OPENAI_API_KEY}"

  claude:
    type: "anthropic"
    model: "claude-sonnet-4-20250514"
    api_key: "${ANTHROPIC_API_KEY}"

# Agents reference them
agents:
  assistant:
    llm: "gpt-4o"  # References the "gpt-4o" config

  researcher:
    llm: "claude"  # References the "claude" config

This allows multiple agents to share the same LLM configuration.

OpenAI¶

Configuration¶

llms:
  my-openai:
    type: "openai"
    model: "gpt-4o"                    # Default: gpt-4o
    api_key: "${OPENAI_API_KEY}"
    host: "https://api.openai.com/v1" # Default host
    temperature: 0.7                   # Default: 0.7
    max_tokens: 8000                   # Default: 8000
    timeout: 60                        # Seconds, default: 60
    max_retries: 5                     # Rate limit retries, default: 5
    retry_delay: 2                     # Seconds, exponential backoff, default: 2

Popular Models¶

Model	Best For	Context Window	Vision
`gpt-4o`	General purpose, balanced	128K tokens	✅
`gpt-4o-mini`	Fast, cost-effective	128K tokens	✅
`gpt-4-turbo`	Complex reasoning	128K tokens	✅
`gpt-3.5-turbo`	Simple tasks, fast	16K tokens	❌

Environment Variables¶

export OPENAI_API_KEY="sk-..."

Example¶

llms:
  gpt-4o:
    type: "openai"
    model: "gpt-4o-mini"
    api_key: "${OPENAI_API_KEY}"
    temperature: 0.7

agents:
  coder:
    name: "Coding Assistant"
    llm: "gpt-4o"
    prompt:
      system_role: "You are an expert programmer."

Anthropic (Claude)¶

Configuration¶

llms:
  my-anthropic:
    type: "anthropic"
    model: "claude-sonnet-4-20250514"  # Default model
    api_key: "${ANTHROPIC_API_KEY}"
    host: "https://api.anthropic.com"  # Default host
    temperature: 0.7                   # Default: 0.7
    max_tokens: 8000                   # Default: 8000
    timeout: 120                       # Seconds, default: 120
    max_retries: 5                     # Default: 5
    retry_delay: 2                     # Seconds, default: 2

    # Extended thinking configuration (optional)
    thinking:
      enabled: true
      budget_tokens: 10000  # Must be < max_tokens (defaults to 1024 if not specified)

Popular Models¶

Model	Best For	Context Window	Vision
`claude-sonnet-4-20250514`	Balanced speed & capability	200K tokens	✅
`claude-opus-4-20250514`	Maximum capability	200K tokens	✅
`claude-3-5-sonnet-20241022`	Previous generation	200K tokens	✅

Extended Thinking Support¶

Anthropic Claude models (Sonnet 4.5, Opus 4.5, etc.) support extended thinking for enhanced reasoning. Enable it via:

CLI:

# --thinking auto-enables --show-thinking
hector call "Solve this problem" --thinking --thinking-budget 10000

# To enable thinking but hide thinking blocks:
hector call "Solve this problem" --thinking --no-show-thinking

Config File:

llms:
  claude:
    type: "anthropic"
    model: "claude-sonnet-4-5"
    api_key: "${ANTHROPIC_API_KEY}"
    max_tokens: 16000
    thinking:
      enabled: true
      budget_tokens: 10000  # Must be < max_tokens

agents:
  assistant:
    llm: "claude"
    enable_thinking: true  # Enable thinking at API level
    reasoning:
      enable_thinking_display: true  # Show thinking blocks

Note: When thinking is enabled, temperature is automatically set to 1.0 (required by Anthropic API).

Environment Variables¶

export ANTHROPIC_API_KEY="sk-ant-..."

Example¶

llms:
  claude:
    type: "anthropic"
    model: "claude-sonnet-4-20250514"
    api_key: "${ANTHROPIC_API_KEY}"
    temperature: 0.7

agents:
  analyst:
    name: "Research Analyst"
    llm: "claude"
    prompt:
      system_role: "You are a thorough research analyst."

Google Gemini¶

Configuration¶

llms:
  my-gemini:
    type: "gemini"
    model: "gemini-2.0-flash-exp"         # Default model
    api_key: "${GEMINI_API_KEY}"
    host: "https://generativelanguage.googleapis.com"  # Default
    temperature: 0.7                      # Default: 0.7
    max_tokens: 4096                      # Default: 4096
    timeout: 60                           # Seconds, default: 60

Popular Models¶

Model	Best For	Context Window	Multi-Modality
`gemini-2.0-flash-exp`	Fast, efficient (experimental)	1M tokens	✅ Images, Video, Audio
`gemini-pro`	General purpose	1M tokens	✅ Images, Video, Audio
`gemini-pro-vision`	Image analysis	16K tokens	✅ Images

Environment Variables¶

export GEMINI_API_KEY="AI..."

Example¶

llms:
  gemini:
    type: "gemini"
    model: "gemini-2.0-flash-exp"
    api_key: "${GEMINI_API_KEY}"
    temperature: 0.7

agents:
  assistant:
    name: "General Assistant"
    llm: "gemini"
    prompt:
      system_role: "You are a helpful assistant."

Ollama (Local Models)¶

Ollama allows you to run LLMs locally on your machine. Hector currently supports the qwen3 model from Ollama.

Configuration¶

llms:
  local-llm:
    type: "ollama"
    model: "qwen3"                    # Currently supported: qwen3
    host: "http://localhost:11434"   # Default: http://localhost:11434
    temperature: 0.7                 # Default: 0.7
    max_tokens: 8000                 # Default: 8000
    timeout: 600                      # Seconds, default: 600 (10 minutes)
    # Note: Ollama doesn't require an API key for local deployments

Supported Models¶

Model	Best For	Notes
`qwen3`	General purpose, reasoning	Supports thinking/reasoning traces

⚠️ Note: Currently, only the qwen3 model is fully supported and tested. Other Ollama models may work but are not officially supported.

Prerequisites¶

Install Ollama: https://ollama.ai
Pull the model:
```
ollama pull qwen3
```
Start Ollama service (usually runs automatically)

Example¶

llms:
  local-llm:
    type: "ollama"
    model: "qwen3"
    host: "http://localhost:11434"
    temperature: 0.7
    timeout: 600  # Increased timeout for larger models

agents:
  assistant:
    name: "Local Assistant"
    llm: "local-llm"
    prompt:
      system_role: "You are a helpful assistant running locally."

Thinking/Reasoning Support¶

The qwen3 model supports thinking/reasoning traces. Enable it with the --thinking flag:

# Enable thinking (auto-shows thinking blocks)
hector call "Solve this problem" --thinking

# Enable thinking but hide blocks
hector call "Solve this problem" --thinking --no-show-thinking

Environment Variables¶

Ollama doesn't require API keys, but you can configure the host:

export OLLAMA_HOST="http://localhost:11434"  # Optional, defaults to localhost:11434

Custom LLM Providers (Plugins)¶

Extend Hector with custom LLM providers via gRPC plugins.

Configuration¶

plugins:
  llms:
    - name: "my-custom-llm"
      protocol: "grpc"
      path: "/path/to/llm-plugin"

llms:
  custom:
    type: "plugin:my-custom-llm"
    model: "my-model"
    # Provider-specific configuration

See Plugin System for implementation details.

Common Configuration Options¶

Temperature¶

Controls randomness in responses (0.0 to 2.0):

llms:
  creative:
    type: "openai"
    temperature: 1.2  # More creative

  precise:
    type: "openai"
    temperature: 0.3  # More deterministic

0.0-0.3: Focused, deterministic (code generation, analysis)
0.7-0.9: Balanced (default for most tasks)
1.0-2.0: Creative (writing, brainstorming)

Max Tokens¶

Maximum tokens in the response:

llms:
  brief:
    type: "openai"
    max_tokens: 500   # Short responses

  detailed:
    type: "openai"
    max_tokens: 4000  # Long responses

Timeouts and Retries¶

Configure resilience:

llms:
  resilient:
    type: "openai"
    timeout: 120        # Wait up to 2 minutes
    max_retries: 5      # Retry 5 times on rate limits
    retry_delay: 2      # Start with 2s, exponential backoff

Custom API Endpoints¶

Use compatible APIs (Azure OpenAI, local models, etc.):

llms:
  azure-openai:
    type: "openai"
    model: "gpt-4"
    api_key: "${AZURE_API_KEY}"
    host: "https://your-resource.openai.azure.com/openai/deployments/your-deployment"

Zero-Config Defaults¶

When running without configuration, Hector uses these defaults:

Provider	Model	Trigger
OpenAI	`gpt-4o-mini`	`OPENAI_API_KEY` set
Anthropic	`claude-sonnet-4-20250514`	`ANTHROPIC_API_KEY` set
Gemini	`gemini-2.0-flash-exp`	`GEMINI_API_KEY` set
Ollama	`qwen3`	Ollama running locally (no API key required)

Priority order: OpenAI → Anthropic → Gemini → Ollama (first available key/service wins).

# Zero-config with OpenAI
export OPENAI_API_KEY="sk-..."
hector call "Hello"  # Uses gpt-4o-mini automatically

Structured Output¶

All providers support structured output (JSON, XML, Enum):

llms:
  structured:
    type: "openai"
    model: "gpt-4o"
    structured_output:
      format: "json"
      schema:
        type: "object"
        properties:
          sentiment:
            type: "string"
            enum: ["positive", "negative", "neutral"]
          confidence:
            type: "number"

See Structured Output for details.

Choosing a Provider¶

Scenario	Recommended Provider	Reason
General purpose	OpenAI GPT-4o	Best balance of speed, cost, capability
Complex reasoning	Anthropic Claude Opus 4	Strongest reasoning capabilities
Cost-sensitive	OpenAI GPT-4o-mini	Excellent price/performance
Fast responses	Gemini 2.0 Flash	Very fast inference
Large context	Gemini Pro	1M token context window
Code generation	OpenAI GPT-4o	Strong code understanding
Creative writing	Anthropic Claude Sonnet 4	Natural, engaging writing
Production reliability	OpenAI GPT-4o	Mature API, good availability

Best Practices¶

Use Environment Variables for API Keys¶

Never hardcode API keys:

# ✅ Good
llms:
  gpt-4o:
    api_key: "${OPENAI_API_KEY}"

# ❌ Bad
llms:
  gpt-4o:
    api_key: "sk-hardcoded-key"

Configure Timeouts Appropriately¶

Match timeouts to task complexity:

llms:
  quick-tasks:
    type: "openai"
    timeout: 30  # Simple queries

  complex-analysis:
    type: "anthropic"
    timeout: 180  # Complex reasoning

Use Different Providers for Different Agents¶

Leverage each provider's strengths:

agents:
  coder:
    llm: "gpt-4o"  # Good at code

  writer:
    llm: "claude"  # Good at prose

  analyzer:
    llm: "claude-opus"  # Strong reasoning

Enable Retries for Production¶

Ensure reliability:

llms:
  production:
    type: "openai"
    max_retries: 5
    retry_delay: 2
    timeout: 120

Multi-Modality Support¶

All providers support image inputs when using vision-capable models. See Multi-Modality Support for complete documentation.

Quick Summary:

Provider	Image Support	URI Support	Max Size
OpenAI	✅ JPEG, PNG, GIF, WebP	✅ HTTP/HTTPS URLs	20MB
Anthropic	✅ JPEG, PNG, GIF, WebP	❌ Base64 only	5MB
Gemini	✅ Images, Video, Audio	✅ GCS URIs, some HTTP	20MB
Ollama	✅ JPEG, PNG, GIF, WebP	❌ Base64 only	20MB

Example:

llms:
  vision:
    type: "openai"
    model: "gpt-4o"  # Vision-capable model
    api_key: "${OPENAI_API_KEY}"

agents:
  vision_assistant:
    llm: "vision"
    # Automatically supports image inputs

Next Steps¶

Multi-Modality Support - Send images, video, and audio to agents
Prompts - Customize agent behavior and instructions
Memory - Manage conversation context
Structured Output - Get JSON/XML responses
Configuration Reference - All LLM options

Quick Start - Run your first agent
Building a Coding Assistant - Complete tutorial
API Reference - API details

LLM Providers¶

Supported Providers¶

Configuration Pattern¶

OpenAI¶

Configuration¶

Popular Models¶

Environment Variables¶

Example¶

Anthropic (Claude)¶

Configuration¶

Popular Models¶

Extended Thinking Support¶

Environment Variables¶

Example¶

Google Gemini¶

Configuration¶

Popular Models¶

Environment Variables¶

Example¶

Ollama (Local Models)¶

Configuration¶

Supported Models¶

Prerequisites¶

Example¶

Thinking/Reasoning Support¶

Environment Variables¶

Custom LLM Providers (Plugins)¶

Configuration¶

Common Configuration Options¶

Temperature¶

Max Tokens¶

Timeouts and Retries¶

Custom API Endpoints¶

Zero-Config Defaults¶

Structured Output¶

Choosing a Provider¶

Best Practices¶

Use Environment Variables for API Keys¶

Configure Timeouts Appropriately¶

Use Different Providers for Different Agents¶

Enable Retries for Production¶

Multi-Modality Support¶

Next Steps¶

Related Topics¶