Multi-Modality Support¶
Hector agents support multi-modal inputs, allowing you to send images, audio, video, and other media types alongside text messages. This enables powerful use cases like image analysis, document understanding, visual question answering, and more.
Overview¶
Multi-modality support in Hector follows the A2A Protocol v0.3.0 specification, using FilePart messages to represent media content. All LLM providers (OpenAI, Anthropic, Gemini, Ollama) automatically handle multi-modal content when supported by their models.
Supported Media Types¶
| Media Type | Supported Formats | Provider Support |
|---|---|---|
| Images | JPEG, PNG, GIF, WebP | ✅ All providers |
| Video | MP4, AVI, MOV (via URIs) | ✅ Gemini |
| Audio | WAV, MP3 (via URIs) | ✅ Gemini |
Note: Image support is universal across all providers. Video and audio support varies by provider capabilities.
Quick Start¶
Sending Images via A2A Protocol¶
Using HTTP/REST:
curl -X POST http://localhost:8080/v1/agents/assistant/message:send \
-H "Content-Type: application/json" \
-d '{
"message": {
"role": "ROLE_USER",
"parts": [
{
"text": "What is in this image?"
},
{
"file": {
"file_with_bytes": "<base64-encoded-image>",
"media_type": "image/jpeg",
"name": "photo.jpg"
}
}
]
}
}'
Using File URI:
{
"message": {
"role": "ROLE_USER",
"parts": [
{
"text": "Analyze this image"
},
{
"file": {
"file_with_uri": "https://example.com/image.jpg",
"media_type": "image/jpeg",
"name": "image.jpg"
}
}
]
}
}
Programmatic API¶
import (
"github.com/kadirpekel/hector/pkg/a2a/pb"
"github.com/kadirpekel/hector/pkg/agent"
)
// Create message with image
msg := &pb.Message{
Role: pb.Role_ROLE_USER,
Parts: []*pb.Part{
{
Part: &pb.Part_Text{
Text: "What's in this image?",
},
},
{
Part: &pb.Part_File{
File: &pb.FilePart{
File: &pb.FilePart_FileWithBytes{
FileWithBytes: imageBytes,
},
MediaType: "image/jpeg",
Name: "photo.jpg",
},
},
},
},
}
// Send to agent
response, err := agent.SendMessage(ctx, msg)
Configuration¶
Agent Card Configuration¶
Configure supported input/output modes in your agent's A2A card:
agents:
vision_assistant:
name: "Vision Assistant"
llm: "gpt-4o"
a2a:
version: "0.3.0"
input_modes:
- "text/plain"
- "application/json"
- "image/jpeg"
- "image/png"
- "image/gif"
- "image/webp"
output_modes:
- "text/plain"
- "application/json"
Default Input Modes:
If not specified, agents automatically include these image types:
- text/plain
- application/json
- image/jpeg
- image/png
- image/gif
- image/webp
LLM Provider Support¶
OpenAI¶
Supported Models: - GPT-4o, GPT-4o-mini (vision-capable) - GPT-4 Turbo with vision
Features: - ✅ Direct HTTP/HTTPS image URLs - ✅ Base64-encoded images (data URIs) - ✅ Maximum image size: 20MB - ✅ Supports JPEG, PNG, GIF, WebP
Example:
llms:
vision:
type: "openai"
model: "gpt-4o" # Vision-capable model
api_key: "${OPENAI_API_KEY}"
agents:
vision_assistant:
llm: "vision"
URI Support:
OpenAI supports direct image URLs. Simply provide the URL in file_with_uri:
{
"file": {
"file_with_uri": "https://example.com/image.jpg",
"media_type": "image/jpeg"
}
}
Anthropic (Claude)¶
Supported Models: - Claude Sonnet 4 (vision-capable) - Claude Opus 4 (vision-capable)
Features: - ✅ Base64-encoded images only - ❌ Image URLs not supported (must download first) - ✅ Maximum image size: 5MB - ✅ Supports JPEG, PNG, GIF, WebP
Example:
llms:
claude_vision:
type: "anthropic"
model: "claude-sonnet-4-20250514"
api_key: "${ANTHROPIC_API_KEY}"
agents:
vision_assistant:
llm: "claude_vision"
Important: Anthropic requires base64-encoded images. If you have a URL, download the image first and convert to bytes.
Google Gemini¶
Supported Models: - Gemini 2.0 Flash (vision-capable) - Gemini Pro (vision-capable)
Features:
- ✅ Google Cloud Storage URIs (gs://)
- ✅ Base64-encoded images (inline data)
- ✅ Video support (via URIs)
- ✅ Audio support (via URIs)
- ✅ Maximum inline size: 20MB
- ✅ Supports JPEG, PNG, GIF, WebP, MP4, AVI, MOV, WAV, MP3
Example:
llms:
gemini_vision:
type: "gemini"
model: "gemini-2.0-flash-exp"
api_key: "${GEMINI_API_KEY}"
agents:
vision_assistant:
llm: "gemini_vision"
URI Support: Gemini supports Google Cloud Storage URIs and some HTTP URLs (may require File API upload):
{
"file": {
"file_with_uri": "gs://bucket/image.jpg",
"media_type": "image/jpeg"
}
}
Ollama¶
Supported Models: - qwen3 (vision-capable) - Other vision-capable models
Features: - ✅ Base64-encoded images only - ❌ Image URLs not supported (must download first) - ✅ Maximum image size: 20MB - ✅ Supports JPEG, PNG, GIF, WebP
Example:
llms:
local_vision:
type: "ollama"
model: "qwen3"
host: "http://localhost:11434"
agents:
vision_assistant:
llm: "local_vision"
FilePart Message Format¶
The A2A Protocol defines FilePart for multi-modal content:
message FilePart {
oneof file {
string file_with_uri = 1; // HTTP/HTTPS URL or GCS URI
bytes file_with_bytes = 2; // Base64-encoded data
}
string media_type = 3; // MIME type (e.g., "image/jpeg")
string name = 4; // Optional filename
}
Field Details¶
file_with_uri (string):
- HTTP/HTTPS URL: https://example.com/image.jpg
- Google Cloud Storage URI: gs://bucket/image.jpg
- Supported by: OpenAI, Gemini
file_with_bytes (bytes):
- Base64-encoded image data
- Supported by: All providers
- Recommended for: Anthropic, Ollama
media_type (string):
- MIME type identifier
- Examples: image/jpeg, image/png, image/gif, image/webp
- Required for proper processing
name (string, optional):
- Filename for reference
- Used in tool results and logging
Vision Tools¶
Hector includes built-in vision tools for image generation and processing:
generate_image¶
Generate images using DALL-E 3:
agents:
creative:
tools: ["generate_image"]
tools:
generate_image:
type: "generate_image"
config:
api_key: "${OPENAI_API_KEY}" # Required
model: "dall-e-3" # Default: dall-e-3
size: "1024x1024" # Default: 1024x1024
quality: "standard" # standard or hd
style: "vivid" # vivid or natural
timeout: "60s" # Default: 60s
Tool Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
prompt |
string | ✅ | Text description of image to generate |
size |
string | ❌ | Image size (e.g., "1024x1024") |
quality |
string | ❌ | "standard" or "hd" |
style |
string | ❌ | "vivid" or "natural" |
Example Usage:
User: Generate an image of a sunset over mountains
Agent: generate_image(prompt="A beautiful sunset over snow-capped mountains")
Agent: Image generated successfully: https://...
screenshot_page¶
Status: Placeholder (not yet implemented)
Take screenshots of web pages (requires headless browser integration):
tools:
screenshot_page:
type: "screenshot_page"
config:
timeout: "30s"
Note: This tool is currently a placeholder. Full implementation requires headless browser integration (Chrome DevTools Protocol, Playwright, etc.).
Use Cases¶
Image Analysis¶
Analyze images and answer questions about their content:
{
"message": {
"role": "ROLE_USER",
"parts": [
{
"text": "What objects are in this image?"
},
{
"file": {
"file_with_bytes": "<base64-image>",
"media_type": "image/jpeg"
}
}
]
}
}
Document Understanding¶
Extract text and information from images of documents:
User: [sends image of invoice]
Agent: This invoice shows:
- Invoice #: INV-2024-001
- Amount: $1,250.00
- Due date: 2024-12-31
Visual Question Answering¶
Answer questions about image content:
User: [sends photo] "What color is the car?"
Agent: The car in the image is red.
Image Generation¶
Generate images based on text descriptions:
User: Create an image of a futuristic cityscape
Agent: generate_image(prompt="Futuristic cityscape with flying cars and neon lights")
Agent: [returns generated image URL]
Multi-Modal Conversations¶
Combine text and images in conversation:
User: [sends diagram] "Explain this architecture"
Agent: This diagram shows a microservices architecture with...
User: [sends updated diagram] "What changed?"
Agent: The new version adds a load balancer and...
Best Practices¶
1. Choose the Right Provider¶
| Use Case | Recommended Provider | Reason |
|---|---|---|
| Image URLs | OpenAI | Direct URL support |
| Large images (>5MB) | OpenAI, Gemini, Ollama | Higher size limits |
| Video/Audio | Gemini | Multi-modal support |
| Cost-sensitive | OpenAI GPT-4o-mini | Lower cost |
| Local processing | Ollama | No API costs |
2. Optimize Image Sizes¶
- Resize images before sending (most models work well with 1024x1024)
- Compress images to reduce payload size
- Use appropriate formats (JPEG for photos, PNG for graphics)
3. Handle Provider Limitations¶
Anthropic URI Limitation:
If using Anthropic with image URLs, download and convert first:
// Download image
resp, err := http.Get(imageURL)
if err != nil {
return err
}
defer resp.Body.Close()
imageBytes, err := io.ReadAll(resp.Body)
if err != nil {
return err
}
// Use file_with_bytes instead
filePart := &pb.FilePart{
File: &pb.FilePart_FileWithBytes{
FileWithBytes: imageBytes,
},
MediaType: "image/jpeg",
}
4. Set Media Types Correctly¶
Always specify media_type for proper processing:
{
"file": {
"file_with_bytes": "<base64>",
"media_type": "image/jpeg" // ✅ Required
}
}
5. Error Handling¶
Handle cases where images are skipped:
- Oversized images: Check size limits (5MB for Anthropic, 20MB for others)
- Unsupported formats: Ensure media type starts with
image/ - Invalid URIs: Verify URLs are accessible
Size Limits¶
| Provider | Maximum Size | Notes |
|---|---|---|
| OpenAI | 20MB | Both URIs and base64 |
| Anthropic | 5MB | Base64 only |
| Gemini | 20MB | Inline data; URIs vary |
| Ollama | 20MB | Base64 only |
Recommendation: Keep images under 5MB for maximum compatibility.
Troubleshooting¶
Images Not Being Processed¶
Check:
1. ✅ Model supports vision (e.g., gpt-4o, not gpt-3.5-turbo)
2. ✅ Media type is set correctly (image/jpeg, image/png, etc.)
3. ✅ Image size is within limits
4. ✅ Provider supports your input method (URI vs bytes)
Anthropic URI Errors¶
Problem: Anthropic doesn't support image URLs directly.
Solution: Download image and use file_with_bytes:
// Download first
imageBytes := downloadImage(url)
// Then use bytes
filePart := &pb.FilePart{
File: &pb.FilePart_FileWithBytes{
FileWithBytes: imageBytes,
},
MediaType: detectMediaType(imageBytes),
}
Gemini URI Limitations¶
Problem: Standard HTTP URLs may not work with Gemini.
Solution: Use Google Cloud Storage URIs or convert to base64:
{
"file": {
"file_with_uri": "gs://my-bucket/image.jpg" // ✅ Works
// OR
"file_with_bytes": "<base64>" // ✅ Always works
}
}
Examples¶
Complete Configuration¶
llms:
vision_llm:
type: "openai"
model: "gpt-4o"
api_key: "${OPENAI_API_KEY}"
agents:
vision_assistant:
name: "Vision Assistant"
llm: "vision_llm"
a2a:
version: "0.3.0"
input_modes:
- "text/plain"
- "image/jpeg"
- "image/png"
- "image/gif"
- "image/webp"
tools:
- "generate_image"
prompt:
system_role: |
You are a vision assistant that can analyze images
and answer questions about their content.
REST API Example¶
# Send image via REST
curl -X POST http://localhost:8080/v1/agents/vision_assistant/message:send \
-H "Content-Type: application/json" \
-d '{
"message": {
"role": "ROLE_USER",
"parts": [
{
"text": "Describe this image"
},
{
"file": {
"file_with_uri": "https://example.com/photo.jpg",
"media_type": "image/jpeg",
"name": "photo.jpg"
}
}
]
}
}'
Programmatic Example¶
package main
import (
"context"
"encoding/base64"
"io"
"net/http"
"github.com/kadirpekel/hector/pkg/a2a/pb"
"github.com/kadirpekel/hector/pkg/agent"
)
func sendImageMessage(ctx context.Context, agent *agent.Agent, imageURL string) error {
// Download image
resp, err := http.Get(imageURL)
if err != nil {
return err
}
defer resp.Body.Close()
imageBytes, err := io.ReadAll(resp.Body)
if err != nil {
return err
}
// Create message with image
msg := &pb.Message{
Role: pb.Role_ROLE_USER,
Parts: []*pb.Part{
{
Part: &pb.Part_Text{
Text: "What's in this image?",
},
},
{
Part: &pb.Part_File{
File: &pb.FilePart{
File: &pb.FilePart_FileWithBytes{
FileWithBytes: imageBytes,
},
MediaType: "image/jpeg",
Name: "image.jpg",
},
},
},
},
}
// Send to agent
response, err := agent.SendMessage(ctx, msg)
if err != nil {
return err
}
// Process response
// ...
return nil
}
Next Steps¶
- Tools - Learn about vision tools (generate_image, screenshot_page)
- LLM Providers - Configure vision-capable models
- A2A Protocol - Understand FilePart message format
- Configuration Reference - Complete configuration options
Related Topics¶
- Tools - Vision tools and capabilities
- LLM Providers - Provider-specific multi-modality support
- A2A Protocol - Protocol specification
- Programmatic API - Using multi-modality in code