Observability¶

Hector provides enterprise-grade observability through OpenTelemetry and Prometheus integration, enabling comprehensive monitoring, tracing, and metrics collection for your AI agents.

Overview¶

Hector's observability stack includes:

Metrics (Prometheus): Aggregate statistics about agent performance, throughput, and resource usage
Distributed Tracing (OpenTelemetry/Jaeger): Detailed request-level visibility with timing and context
Dashboards (Grafana): Pre-built visualizations for monitoring and alerting

Architecture¶

┌─────────┐
│ Hector  │
└────┬────┘
     │
     ├─────────────────┐
     │                 │
     ▼                 ▼
┌─────────┐      ┌──────────┐
│Prometheus│      │  Jaeger  │
│(Metrics) │      │ (Traces) │
└────┬─────┘      └──────────┘
     │
     ▼
┌─────────┐
│ Grafana │
│(Dashboards)
└─────────┘

Configuration¶

Basic Configuration¶

Add observability configuration to your Hector config file:

global:
  observability:
    tracing:
      enabled: true
      exporter_type: "jaeger"
      endpoint_url: "localhost:4317"
      sampling_rate: 1.0
      service_name: "hector"
    metrics:
      enabled: true
      port: 8080

Configuration Options¶

Tracing¶

Option	Type	Default	Description
`enabled`	boolean	`false`	Enable distributed tracing
`exporter_type`	string	`"jaeger"`	Trace exporter type (jaeger, otlp)
`endpoint_url`	string	`"localhost:4317"`	OTLP gRPC endpoint
`sampling_rate`	float	`1.0`	Sampling rate (0.0-1.0, 1.0 = 100%)
`service_name`	string	`"hector"`	Service name in traces

Metrics¶

Option	Type	Default	Description
`enabled`	boolean	`false`	Enable Prometheus metrics
`port`	int	`8080`	Port for metrics (not used, metrics on REST port)

Metrics Collected¶

Agent Metrics¶

hector_agent_calls_total - Total number of agent executions
hector_agent_call_duration_seconds - Agent execution duration (histogram)
hector_agent_errors_total - Total agent execution errors
hector_agent_tokens_total - Total tokens used across all agents

Tool Metrics¶

hector_tool_calls_total - Total tool executions
hector_tool_errors_total - Total tool execution failures
hector_tool_execution_duration_seconds - Tool execution duration

LLM Metrics¶

hector_llm_calls_total - Total LLM API calls
hector_llm_errors_total - Total LLM API failures
hector_llm_request_duration_seconds - LLM request duration
hector_llm_input_tokens_total - Total input tokens
hector_llm_output_tokens_total - Total output tokens

Traces¶

Each agent execution creates a trace with the following structure:

Span: agent.call
├── Attributes:
│   ├── agent.name: "assistant"
│   ├── agent.llm: "gpt-4o"
│   └── input_preview: "User input..." (first 100 chars)
└── Duration: Execution time in milliseconds

Quick Start¶

1. Start Observability Stack¶

Use the provided Docker Compose file:

docker-compose -f docker-compose.observability.yaml up -d

This starts: - Jaeger on port 16686 (UI) and 4317 (OTLP gRPC) - Prometheus on port 9090 - Grafana on port 3000 (login: admin/Dev12345)

2. Configure Hector¶

Use the example configuration:

cp configs/observability-example.yaml my-config.yaml
# Edit my-config.yaml with your settings

3. Start Hector¶

hector serve --config my-config.yaml

4. Access Dashboards¶

Metrics Endpoint: http://localhost:8080/metrics
Prometheus: http://localhost:9090
Jaeger: http://localhost:16686
Grafana: http://localhost:3000

Prometheus Queries¶

Basic Queries¶

# Total agent calls
hector_agent_calls_total

# Agent call rate (calls per second)
rate(hector_agent_calls_total[1m])

# Average call duration
rate(hector_agent_call_duration_seconds_sum[5m]) / rate(hector_agent_call_duration_seconds_count[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(hector_agent_call_duration_seconds_bucket[5m]))

# Error rate
rate(hector_agent_errors_total[5m])

Alerting Queries¶

# High error rate (>5%)
rate(hector_agent_errors_total[5m]) / rate(hector_agent_calls_total[5m]) > 0.05

# High latency (P95 > 10s)
histogram_quantile(0.95, rate(hector_agent_call_duration_seconds_bucket[5m])) > 10

# Scraping failure
up{job="hector"} == 0

Grafana Setup¶

Connect Prometheus Datasource¶

Prometheus is automatically configured when using the Docker Compose stack. To manually add:

Go to Configuration → Data Sources
Add Prometheus datasource
URL: http://prometheus:9090 (if in Docker) or http://localhost:9090
Save & Test

Create Dashboards¶

Use these panel queries:

Agent Throughput:

rate(hector_agent_calls_total[5m])

Average Response Time:

rate(hector_agent_call_duration_seconds_sum[5m]) / rate(hector_agent_call_duration_seconds_count[5m])

Error Rate:

rate(hector_agent_errors_total[5m])

Jaeger Usage¶

View Traces¶

Open http://localhost:16686
Select Service: hector
Click "Find Traces"
Click any trace to see:
Request timeline
Duration breakdown
Agent attributes (name, LLM, input)

Search Traces¶

By duration: Set min/max duration
By tags: Filter by agent.name, agent.llm
By time: Select time range

Production Considerations¶

Sampling¶

For high-volume production environments, reduce sampling rate:

observability:
  tracing:
    sampling_rate: 0.1  # Sample 10% of requests

Security¶

Metrics endpoint: Consider adding authentication
Jaeger: Deploy with authentication enabled
Grafana: Change default password
Prometheus: Enable basic auth or use reverse proxy

Performance¶

Metrics collection: ~0.1ms overhead per request
Trace export: Asynchronous, no blocking
Memory: ~10MB for observability components

Storage¶

Prometheus: Configure retention period
```
--storage.tsdb.retention.time=30d
```
Jaeger: Configure storage backend (Cassandra, Elasticsearch)

Troubleshooting¶

No Metrics in Prometheus¶

Check Prometheus targets:

curl http://localhost:9090/api/v1/targets

Verify Hector metrics endpoint:

curl http://localhost:8080/metrics | grep hector_

No Traces in Jaeger¶

Check Hector logs for trace export errors:

grep -i trace hector.log

Verify Jaeger OTLP endpoint:

curl http://localhost:4317

Grafana Can't Query Prometheus¶

Check datasource configuration
Verify Prometheus is running: curl http://localhost:9090/-/healthy
Test query in Prometheus UI first

Best Practices¶

Always enable in production - Observability is crucial for debugging and monitoring
Set appropriate sampling - Balance between visibility and overhead
Create alerts - Monitor error rates and latency
Use dashboards - Visualize trends over time
Correlate metrics and traces - Use both for complete visibility