Multi-Modality Support¶

Hector agents support multi-modal inputs, allowing you to send images, audio, video, and other media types alongside text messages. This enables powerful use cases like image analysis, document understanding, visual question answering, and more.

Overview¶

Multi-modality support in Hector follows the A2A Protocol v0.3.0 specification, using FilePart messages to represent media content. All LLM providers (OpenAI, Anthropic, Gemini, Ollama) automatically handle multi-modal content when supported by their models.

Supported Media Types¶

Media Type	Supported Formats	Provider Support
Images	JPEG, PNG, GIF, WebP	✅ All providers
Video	MP4, AVI, MOV (via URIs)	✅ Gemini
Audio	WAV, MP3 (via URIs)	✅ Gemini

Note: Image support is universal across all providers. Video and audio support varies by provider capabilities.

Quick Start¶

Sending Images via A2A Protocol¶

Using HTTP/REST:

curl -X POST http://localhost:8080/v1/agents/assistant/message:send \
  -H "Content-Type: application/json" \
  -d '{
    "message": {
      "role": "ROLE_USER",
      "parts": [
        {
          "text": "What is in this image?"
        },
        {
          "file": {
            "file_with_bytes": "<base64-encoded-image>",
            "media_type": "image/jpeg",
            "name": "photo.jpg"
          }
        }
      ]
    }
  }'

Using File URI:

{
  "message": {
    "role": "ROLE_USER",
    "parts": [
      {
        "text": "Analyze this image"
      },
      {
        "file": {
          "file_with_uri": "https://example.com/image.jpg",
          "media_type": "image/jpeg",
          "name": "image.jpg"
        }
      }
    ]
  }
}

Programmatic API¶

import (
    "github.com/kadirpekel/hector/pkg/a2a/pb"
    "github.com/kadirpekel/hector/pkg/agent"
)

// Create message with image
msg := &pb.Message{
    Role: pb.Role_ROLE_USER,
    Parts: []*pb.Part{
        {
            Part: &pb.Part_Text{
                Text: "What's in this image?",
            },
        },
        {
            Part: &pb.Part_File{
                File: &pb.FilePart{
                    File: &pb.FilePart_FileWithBytes{
                        FileWithBytes: imageBytes,
                    },
                    MediaType: "image/jpeg",
                    Name:      "photo.jpg",
                },
            },
        },
    },
}

// Send to agent
response, err := agent.SendMessage(ctx, msg)

Configuration¶

Agent Card Configuration¶

Configure supported input/output modes in your agent's A2A card:

agents:
  vision_assistant:
    name: "Vision Assistant"
    llm: "gpt-4o"

    a2a:
      version: "0.3.0"
      input_modes:
        - "text/plain"
        - "application/json"
        - "image/jpeg"
        - "image/png"
        - "image/gif"
        - "image/webp"
      output_modes:
        - "text/plain"
        - "application/json"

Default Input Modes:

If not specified, agents automatically include these image types: - text/plain - application/json - image/jpeg - image/png - image/gif - image/webp

LLM Provider Support¶

OpenAI¶

Supported Models: - GPT-4o, GPT-4o-mini (vision-capable) - GPT-4 Turbo with vision

Features: - ✅ Direct HTTP/HTTPS image URLs - ✅ Base64-encoded images (data URIs) - ✅ Maximum image size: 20MB - ✅ Supports JPEG, PNG, GIF, WebP

Example:

llms:
  vision:
    type: "openai"
    model: "gpt-4o"  # Vision-capable model
    api_key: "${OPENAI_API_KEY}"

agents:
  vision_assistant:
    llm: "vision"

URI Support: OpenAI supports direct image URLs. Simply provide the URL in file_with_uri:

{
  "file": {
    "file_with_uri": "https://example.com/image.jpg",
    "media_type": "image/jpeg"
  }
}

Anthropic (Claude)¶

Supported Models: - Claude Sonnet 4 (vision-capable) - Claude Opus 4 (vision-capable)

Features: - ✅ Base64-encoded images only - ❌ Image URLs not supported (must download first) - ✅ Maximum image size: 5MB - ✅ Supports JPEG, PNG, GIF, WebP

Example:

llms:
  claude_vision:
    type: "anthropic"
    model: "claude-sonnet-4-20250514"
    api_key: "${ANTHROPIC_API_KEY}"

agents:
  vision_assistant:
    llm: "claude_vision"

Important: Anthropic requires base64-encoded images. If you have a URL, download the image first and convert to bytes.

Google Gemini¶

Supported Models: - Gemini 2.0 Flash (vision-capable) - Gemini Pro (vision-capable)

Features: - ✅ Google Cloud Storage URIs (gs://) - ✅ Base64-encoded images (inline data) - ✅ Video support (via URIs) - ✅ Audio support (via URIs) - ✅ Maximum inline size: 20MB - ✅ Supports JPEG, PNG, GIF, WebP, MP4, AVI, MOV, WAV, MP3

Example:

llms:
  gemini_vision:
    type: "gemini"
    model: "gemini-2.0-flash-exp"
    api_key: "${GEMINI_API_KEY}"

agents:
  vision_assistant:
    llm: "gemini_vision"

URI Support: Gemini supports Google Cloud Storage URIs and some HTTP URLs (may require File API upload):

{
  "file": {
    "file_with_uri": "gs://bucket/image.jpg",
    "media_type": "image/jpeg"
  }
}

Ollama¶

Supported Models: - qwen3 (vision-capable) - Other vision-capable models

Features: - ✅ Base64-encoded images only - ❌ Image URLs not supported (must download first) - ✅ Maximum image size: 20MB - ✅ Supports JPEG, PNG, GIF, WebP

Example:

llms:
  local_vision:
    type: "ollama"
    model: "qwen3"
    host: "http://localhost:11434"

agents:
  vision_assistant:
    llm: "local_vision"

FilePart Message Format¶

The A2A Protocol defines FilePart for multi-modal content:

message FilePart {
  oneof file {
    string file_with_uri = 1;    // HTTP/HTTPS URL or GCS URI
    bytes file_with_bytes = 2;   // Base64-encoded data
  }
  string media_type = 3;         // MIME type (e.g., "image/jpeg")
  string name = 4;               // Optional filename
}

Field Details¶

file_with_uri (string): - HTTP/HTTPS URL: https://example.com/image.jpg - Google Cloud Storage URI: gs://bucket/image.jpg - Supported by: OpenAI, Gemini

file_with_bytes (bytes): - Base64-encoded image data - Supported by: All providers - Recommended for: Anthropic, Ollama

media_type (string): - MIME type identifier - Examples: image/jpeg, image/png, image/gif, image/webp - Required for proper processing

name (string, optional): - Filename for reference - Used in tool results and logging

Vision Tools¶

Hector includes built-in vision tools for image generation and processing:

generate_image¶

Generate images using DALL-E 3:

agents:
  creative:
    tools: ["generate_image"]

tools:
  generate_image:
    type: "generate_image"
    config:
      api_key: "${OPENAI_API_KEY}"  # Required
      model: "dall-e-3"              # Default: dall-e-3
      size: "1024x1024"              # Default: 1024x1024
      quality: "standard"            # standard or hd
      style: "vivid"                 # vivid or natural
      timeout: "60s"                 # Default: 60s

Tool Parameters:

Parameter	Type	Required	Description
`prompt`	string	✅	Text description of image to generate
`size`	string	❌	Image size (e.g., "1024x1024")
`quality`	string	❌	"standard" or "hd"
`style`	string	❌	"vivid" or "natural"

Example Usage:

User: Generate an image of a sunset over mountains
Agent: generate_image(prompt="A beautiful sunset over snow-capped mountains")
Agent: Image generated successfully: https://...

screenshot_page¶

Status: Placeholder (not yet implemented)

Take screenshots of web pages (requires headless browser integration):

tools:
  screenshot_page:
    type: "screenshot_page"
    config:
      timeout: "30s"

Note: This tool is currently a placeholder. Full implementation requires headless browser integration (Chrome DevTools Protocol, Playwright, etc.).

Use Cases¶

Image Analysis¶

Analyze images and answer questions about their content:

{
  "message": {
    "role": "ROLE_USER",
    "parts": [
      {
        "text": "What objects are in this image?"
      },
      {
        "file": {
          "file_with_bytes": "<base64-image>",
          "media_type": "image/jpeg"
        }
      }
    ]
  }
}

Document Understanding¶

Extract text and information from images of documents:

User: [sends image of invoice]
Agent: This invoice shows:
- Invoice #: INV-2024-001
- Amount: $1,250.00
- Due date: 2024-12-31

Visual Question Answering¶

Answer questions about image content:

User: [sends photo] "What color is the car?"
Agent: The car in the image is red.

Image Generation¶

Generate images based on text descriptions:

User: Create an image of a futuristic cityscape
Agent: generate_image(prompt="Futuristic cityscape with flying cars and neon lights")
Agent: [returns generated image URL]

Combine text and images in conversation:

User: [sends diagram] "Explain this architecture"
Agent: This diagram shows a microservices architecture with...
User: [sends updated diagram] "What changed?"
Agent: The new version adds a load balancer and...

Best Practices¶

1. Choose the Right Provider¶

Use Case	Recommended Provider	Reason
Image URLs	OpenAI	Direct URL support
Large images (>5MB)	OpenAI, Gemini, Ollama	Higher size limits
Video/Audio	Gemini	Multi-modal support
Cost-sensitive	OpenAI GPT-4o-mini	Lower cost
Local processing	Ollama	No API costs

2. Optimize Image Sizes¶

Resize images before sending (most models work well with 1024x1024)
Compress images to reduce payload size
Use appropriate formats (JPEG for photos, PNG for graphics)

3. Handle Provider Limitations¶

Anthropic URI Limitation:

If using Anthropic with image URLs, download and convert first:

// Download image
resp, err := http.Get(imageURL)
if err != nil {
    return err
}
defer resp.Body.Close()

imageBytes, err := io.ReadAll(resp.Body)
if err != nil {
    return err
}

// Use file_with_bytes instead
filePart := &pb.FilePart{
    File: &pb.FilePart_FileWithBytes{
        FileWithBytes: imageBytes,
    },
    MediaType: "image/jpeg",
}

4. Set Media Types Correctly¶

Always specify media_type for proper processing:

{
  "file": {
    "file_with_bytes": "<base64>",
    "media_type": "image/jpeg"  // ✅ Required
  }
}

5. Error Handling¶

Handle cases where images are skipped:

Oversized images: Check size limits (5MB for Anthropic, 20MB for others)
Unsupported formats: Ensure media type starts with image/
Invalid URIs: Verify URLs are accessible

Size Limits¶

Provider	Maximum Size	Notes
OpenAI	20MB	Both URIs and base64
Anthropic	5MB	Base64 only
Gemini	20MB	Inline data; URIs vary
Ollama	20MB	Base64 only

Recommendation: Keep images under 5MB for maximum compatibility.

Troubleshooting¶

Images Not Being Processed¶

Check: 1. ✅ Model supports vision (e.g., gpt-4o, not gpt-3.5-turbo) 2. ✅ Media type is set correctly (image/jpeg, image/png, etc.) 3. ✅ Image size is within limits 4. ✅ Provider supports your input method (URI vs bytes)

Anthropic URI Errors¶

Problem: Anthropic doesn't support image URLs directly.

Solution: Download image and use file_with_bytes:

// Download first
imageBytes := downloadImage(url)

// Then use bytes
filePart := &pb.FilePart{
    File: &pb.FilePart_FileWithBytes{
        FileWithBytes: imageBytes,
    },
    MediaType: detectMediaType(imageBytes),
}

Gemini URI Limitations¶

Problem: Standard HTTP URLs may not work with Gemini.

Solution: Use Google Cloud Storage URIs or convert to base64:

{
  "file": {
    "file_with_uri": "gs://my-bucket/image.jpg"  // ✅ Works
    // OR
    "file_with_bytes": "<base64>"  // ✅ Always works
  }
}

Examples¶

Complete Configuration¶

llms:
  vision_llm:
    type: "openai"
    model: "gpt-4o"
    api_key: "${OPENAI_API_KEY}"

agents:
  vision_assistant:
    name: "Vision Assistant"
    llm: "vision_llm"

    a2a:
      version: "0.3.0"
      input_modes:
        - "text/plain"
        - "image/jpeg"
        - "image/png"
        - "image/gif"
        - "image/webp"

    tools:
      - "generate_image"

    prompt:
      system_role: |
        You are a vision assistant that can analyze images
        and answer questions about their content.

REST API Example¶

# Send image via REST
curl -X POST http://localhost:8080/v1/agents/vision_assistant/message:send \
  -H "Content-Type: application/json" \
  -d '{
    "message": {
      "role": "ROLE_USER",
      "parts": [
        {
          "text": "Describe this image"
        },
        {
          "file": {
            "file_with_uri": "https://example.com/photo.jpg",
            "media_type": "image/jpeg",
            "name": "photo.jpg"
          }
        }
      ]
    }
  }'

Programmatic Example¶

package main

import (
    "context"
    "encoding/base64"
    "io"
    "net/http"

    "github.com/kadirpekel/hector/pkg/a2a/pb"
    "github.com/kadirpekel/hector/pkg/agent"
)

func sendImageMessage(ctx context.Context, agent *agent.Agent, imageURL string) error {
    // Download image
    resp, err := http.Get(imageURL)
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    imageBytes, err := io.ReadAll(resp.Body)
    if err != nil {
        return err
    }

    // Create message with image
    msg := &pb.Message{
        Role: pb.Role_ROLE_USER,
        Parts: []*pb.Part{
            {
                Part: &pb.Part_Text{
                    Text: "What's in this image?",
                },
            },
            {
                Part: &pb.Part_File{
                    File: &pb.FilePart{
                        File: &pb.FilePart_FileWithBytes{
                            FileWithBytes: imageBytes,
                        },
                        MediaType: "image/jpeg",
                        Name:      "image.jpg",
                    },
                },
            },
        },
    }

    // Send to agent
    response, err := agent.SendMessage(ctx, msg)
    if err != nil {
        return err
    }

    // Process response
    // ...

    return nil
}

Next Steps¶

Tools - Learn about vision tools (generate_image, screenshot_page)
LLM Providers - Configure vision-capable models
A2A Protocol - Understand FilePart message format
Configuration Reference - Complete configuration options

Tools - Vision tools and capabilities
LLM Providers - Provider-specific multi-modality support
A2A Protocol - Protocol specification
Programmatic API - Using multi-modality in code

Multi-Modality Support¶

Overview¶

Supported Media Types¶

Quick Start¶

Sending Images via A2A Protocol¶

Programmatic API¶

Configuration¶

Agent Card Configuration¶

LLM Provider Support¶

OpenAI¶

Anthropic (Claude)¶

Google Gemini¶

Ollama¶

FilePart Message Format¶

Field Details¶

Vision Tools¶

generate_image¶

screenshot_page¶

Use Cases¶

Image Analysis¶

Document Understanding¶

Visual Question Answering¶

Image Generation¶

Multi-Modal Conversations¶

Best Practices¶

1. Choose the Right Provider¶

2. Optimize Image Sizes¶

3. Handle Provider Limitations¶

4. Set Media Types Correctly¶

5. Error Handling¶

Size Limits¶

Troubleshooting¶

Images Not Being Processed¶

Anthropic URI Errors¶

Gemini URI Limitations¶

Examples¶

Complete Configuration¶

REST API Example¶

Programmatic Example¶

Next Steps¶

Related Topics¶