Image Support - Complete Specification

Implementation Status: COMPLETE

All core components implemented and tested (62/62 tests pass):

Component	Status
ImageStore (disk storage)	Done
Multimodal messages	Done
Messages surface accepts images	Done
Image fetch surface	Done
Anthropic provider vision	Done
OpenAI provider vision	Done
filesystem.read images	Done
agent_loop image injection	Done
Socket.IO image events	Done
Image aging	Done
YAML vision config	Done
Daemon image config	Done

Overview

Support for images at every level of the framework:

User -> Agent: the user sends images (upload, paste, URL).
Tool -> Agent: a tool produces an image (screenshot, diagram, chart).
Agent -> User: images are rendered in the chat.

Ecosystem reference

Claude Code (current capabilities)

Cmd+V pastes a screenshot into the chat.
The Read tool does NOT read images from the filesystem.
The agent cannot capture screenshots on its own.

Anthropic API (Claude)

{
  "role": "user",
  "content": [
    {"type": "text", "text": "What's in this image?"},
    {"type": "image", "source": {
      "type": "base64", "media_type": "image/png", "data": "iVBOR..."
    }}
  ]
}

Formats: JPEG, PNG, GIF, WebP.
Max: 8000x8000 px, 100 images per request (200K context).
Best practice: use the Files API for recurring images (upload once, reference by file_id after).

OpenAI API (GPT-4o)

{
  "role": "user",
  "content": [
    {"type": "text", "text": "What's in this image?"},
    {"type": "image_url", "image_url": {
      "url": "data:image/png;base64,iVBOR..."
    }}
  ]
}

Formats: PNG, JPEG, WebP, non-animated GIF.
Max: 50MB payload, 500 images per request.
detail: "low" (512px) or "high" (native) trades cost vs quality.

DeepSeek

DeepSeek-chat (V3): no vision.
DeepSeek-VL: separate vision model (7B, 1.3B).
The standard deepseek-chat API does NOT accept images.

Architecture

Design principles

Images do NOT live in the messages - they are stored on disk and referenced by an image_id. Inflated to base64 only at LLM-call time (most recent turn only for older images).
Unified format - a ContentBlock abstracts differences between providers. Anthropic/OpenAI conversion happens in the provider, not the agent loop.
Tools can return images - ActionResult supports image blocks in metadata. The agent loop injects them into the messages.
The client receives images over Socket.IO - no separate routes needed; images are inline (base64) in events on the /events namespace.

1. Image store (disk storage)

New component: `ImageStore`

class ImageStore:
    """Store images on disk, return lightweight references."""
    
    def __init__(self, base_dir: Path):
        self._base_dir = base_dir  # ~/.digitorn/images/
    
    async def store(self, data: bytes, mime: str, session_id: str) -> ImageRef:
        """Store an image, return a reference."""
        image_id = uuid4().hex[:12]
        ext = {"image/png": ".png", "image/jpeg": ".jpg", ...}[mime]
        path = self._base_dir / session_id / f"{image_id}{ext}"
        path.parent.mkdir(parents=True, exist_ok=True)
        path.write_bytes(data)
        return ImageRef(
            image_id=image_id,
            path=str(path),
            mime=mime,
            size=len(data),
            width=...,  # from PIL if available
            height=...,
        )
    
    async def get(self, image_id: str, session_id: str) -> bytes | None:
        """Fetch the bytes of an image."""
        ...
    
    async def get_base64(self, image_id: str, session_id: str) -> str | None:
        """Fetch as base64 (for LLM injection)."""
        ...
    
    def cleanup_session(self, session_id: str):
        """Delete every image for a session."""
        ...

@dataclass
class ImageRef:
    image_id: str
    path: str
    mime: str
    size: int
    width: int = 0
    height: int = 0

Why not base64 inside messages?

A PNG screenshot is 500KB-2MB base64. Over 10 turns with 3 images each:

Base64 inside messages = 30MB in memory, re-sent on every LLM call.
Reference + injection on-demand = a few KB in memory.

Injection strategy

Turn	Current-turn images	Previous-turn images
Current turn	full base64 (high resolution)	-
Turn N-1	low-resolution base64 (resized to 512px)	-
Turn N-2+	Text: "[Image: screenshot of login page, 1920x1080]"	-

Keeps the context light while still giving the LLM vision over recent images.

2. Message Format (multimodal)

ContentBlock

@dataclass
class ContentBlock:
    type: str  # "text", "image", "image_ref"
    
    # For type="text"
    text: str = ""
    
    # For type="image" (inline base64)
    image_data: str = ""  # base64
    media_type: str = ""  # "image/png"
    
    # For type="image_ref" (reference into the image store)
    image_id: str = ""
    alt_text: str = ""  # text description used for context

Multimodal messages

# Before (text only)
{"role": "user", "content": "Fix this bug"}

# After (multimodal)
{"role": "user", "content": [
    {"type": "text", "text": "Fix this bug, here's the screenshot:"},
    {"type": "image_ref", "image_id": "abc123", "alt_text": "Screenshot of error page"}
]}

The content is either a str (backward compatible) or a list[ContentBlock].

3. Sending images with a message

The daemon's messages surface accepts images alongside the text body, either as multipart upload (one or more images[] parts plus optional workspace) or as JSON with base64 payloads. The exact route shape is not documented publicly; clients use the SDK.

JSON shape (handled by the SDK):

{
  "message": "Fix this bug",
  "images": [
    {"data": "iVBOR...", "mime": "image/png", "name": "screenshot.png"}
  ],
  "workspace": "/path/to/project"
}

Limits

Setting	Value	Configurable
Max images per message	10	Yes (`images.max_per_message`)
Max size per image	10MB	Yes (`images.max_size_bytes`)
Accepted formats	PNG, JPEG, WebP, GIF	No
Max total images per session	100	Yes (`images.max_per_session`)

4. LLM provider - multimodal conversion

Anthropic Provider

# Convert content blocks to Anthropic format
def _build_content(blocks: list[ContentBlock]) -> list[dict]:
    result = []
    for block in blocks:
        if block.type == "text":
            result.append({"type": "text", "text": block.text})
        elif block.type == "image":
            result.append({
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": block.media_type,
                    "data": block.image_data,
                }
            })
        elif block.type == "image_ref":
            # Resolve the reference -> base64
            data = image_store.get_base64(block.image_id)
            if data:
                result.append({
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": block.media_type or "image/png",
                        "data": data,
                    }
                })
            else:
                # Expired image -> inject a text description
                result.append({"type": "text", "text": f"[Image: {block.alt_text}]"})
    return result

OpenAI-Compatible Provider (GPT-4o, etc.)

def _build_content(blocks: list[ContentBlock]) -> list[dict]:
    result = []
    for block in blocks:
        if block.type == "text":
            result.append({"type": "text", "text": block.text})
        elif block.type == "image":
            result.append({
                "type": "image_url",
                "image_url": {
                    "url": f"data:{block.media_type};base64,{block.image_data}",
                    "detail": "high",
                }
            })
    return result

Providers sans vision (DeepSeek-chat, Ollama text-only)

def _build_content(blocks: list[ContentBlock]) -> list[dict]:
    # Convertir les images en descriptions textuelles
    texts = []
    for block in blocks:
        if block.type == "text":
            texts.append(block.text)
        elif block.type in ("image", "image_ref"):
            texts.append(f"[Image: {block.alt_text or 'uploaded image'}]")
    return [{"type": "text", "text": "\n".join(texts)}]

The provider auto-detects whether the model supports vision.

5. Tools - images as input and output

Filesystem : Read image

Le tool filesystem.read doit supporter la lecture d'images :

async def read(self, params: ReadParams) -> ActionResult:
    path = self._resolve(params.path)
    
    if _is_image(path):
        # Lire comme image, pas comme texte
        data = path.read_bytes()
        base64_data = base64.b64encode(data).decode()
        mime = _mime_for(path.suffix)
        
        return ActionResult(
            success=True,
            data={
                "path": str(path),
                "type": "image",
                "mime": mime,
                "size": len(data),
            },
            metadata={
                "image_data": base64_data,  # Pour le LLM (via agent_loop)
                "media_type": mime,
            }
        )

Browser : Screenshot

async def screenshot(self, params: ScreenshotParams) -> ActionResult:
    # Capture screenshot via Playwright
    data = await page.screenshot(type="png")
    base64_data = base64.b64encode(data).decode()
    
    return ActionResult(
        success=True,
        data={
            "type": "image",
            "mime": "image/png",
            "width": viewport.width,
            "height": viewport.height,
        },
        metadata={
            "image_data": base64_data,
            "media_type": "image/png",
        }
    )

Agent Loop - Injection automatique

In _append_tool_result, when the result contains an image:

def _append_tool_result(ctx, messages, call_id, tool_name, result, ok, cb):
    # ... normal text serialisation ...
    
    # If the result has an image, inject it as a content block
    meta = getattr(result, "metadata", {}) or {}
    if "image_data" in meta:
        # Ajouter un message avec l'image pour que le LLM la voie
        messages.append({
            "role": "user",
            "content": [
                {"type": "text", "text": f"[Tool result image from {tool_name}]"},
                {"type": "image", "source": {
                    "type": "base64",
                    "media_type": meta.get("media_type", "image/png"),
                    "data": meta["image_data"],
                }}
            ]
        })

6. Socket.IO Events - Images vers le client

The daemon emits image events on the Socket.IO /events namespace, in the room session:{session_id}. Les images arrivent dans les envelopes tool_call with image_data (base64) + image_mime added to the payload.

Dans tool_call event

{
  "type": "tool_call",
  "data": {
    "name": "browser__screenshot",
    "result": {
      "type": "image",
      "mime": "image/png",
      "width": 1920,
      "height": 1080
    },
    "image_data": "iVBOR...",
    "image_mime": "image/png"
  }
}

New event: image_message (for images embedded in replies)

{
  "type": "image",
  "data": {
    "image_id": "abc123",
    "mime": "image/png",
    "data": "iVBOR...",
    "width": 800,
    "height": 600,
    "alt": "Diagram of the architecture",
    "source": "tool:presentation.render"
  }
}

7. Persistence - Images dans l'historique

Session history avec images

GET /sessions/{sid}/history returns images as references:

{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Fix this"},
        {"type": "image_ref", "image_id": "abc123", "alt_text": "Error screenshot",
         "mime": "image/png", "width": 1920, "height": 1080}
      ]
    }
  ]
}

Image fetch

The daemon exposes a per-session image-fetch endpoint that returns the raw bytes (image/png). Clients lazy-load images on demand instead of receiving them inline in history. The exact route shape is not documented publicly.

8. Optimisation du contexte

Problem

A 1920x1080 PNG base64 weighs ~1-2MB ~= 500K estimated tokens. Si chaque message a une image, le contexte explose en 3 tours.

Solution : Image Aging

class ImageContextManager:
    """Decides which images get inflated to base64 in the LLM messages."""
    
    def prepare_messages_for_llm(self, messages, current_turn):
        result = []
        for msg in messages:
            if not _has_images(msg):
                result.append(msg)
                continue
            
            blocks = []
            for block in msg["content"]:
                if block["type"] == "text":
                    blocks.append(block)
                elif block["type"] == "image_ref":
                    age = current_turn - block.get("turn", 0)
                    
                    if age == 0:
                        # Current turn -> high resolution
                        blocks.append(_resolve_full(block))
                    elif age <= 2:
                        # 1-2 turns ago -> low resolution (512px)
                        blocks.append(_resolve_low_res(block))
                    else:
                        # 3+ tours → texte seulement
                        blocks.append({
                            "type": "text",
                            "text": f"[Previous image: {block['alt_text']}]"
                        })
            
            result.append({**msg, "content": blocks})
        return result

Estimated sizes

Strategy	Size per image	Estimated tokens
High resolution (1920px)	1-2 MB	~300K
Low resolution (512px)	50-100 KB	~30K
Texte description	50-100 chars	~25

Avec image aging : 1 image full + 2 low-res + N descriptions = ~360K tokens max. Without aging: N full images = N x 300K, context explosion.

9. Config

New settings in ~/.digitorn/config.yaml:

images:
  max_per_message: 10           # Max images par message
  max_size_bytes: 10485760      # 10MB par image
  max_per_session: 100          # Max images par session
  storage_dir: ""               # Vide = ~/.digitorn/images/
  low_res_size: 512             # Taille pour les images anciennes (px)
  aging_full_turns: 1           # Turns kept at high resolution
  aging_low_turns: 2            # Turns kept at low resolution
  cleanup_after_days: 7         # Delete images after N days

10. YAML App Config

agents:
  - id: main
    brain:
      provider: anthropic
      model: claude-sonnet-4-5
      vision: true              # Enable vision support (default: auto-detect)

If vision: false or the model lacks vision -> images are converted to descriptions textuelles automatiquement.

11. Provider compatibility

Provider	Vision	Format
Claude (Anthropic)	Oui	`{"type": "image", "source": {"type": "base64", ...}}`
GPT-4o (OpenAI)	Oui	`{"type": "image_url", "image_url": {"url": "data:..."}}`
GPT-4o-mini	Yes	Same format
DeepSeek-chat (V3)	Non	Converti en texte `[Image: ...]`
DeepSeek-VL	Oui	Format OpenAI-compat
Ollama (llava)	Yes	`{"images": ["base64..."]}` (special format)
Ollama (text-only)	Non	Converti en texte

Detection is automatic via the provider. Each provider knows whether its model supports vision.

12. Implementation - priority order

Phase 1 (V1 - demo)

Route /messages accepte des images (multipart + JSON base64)
ImageStore basique (stockage disque)
Anthropic provider : injection base64 dans les messages
Socket.IO event avec image_data pour le client
Client web : upload/paste + affichage inline

Phase 2 (V1.1)

OpenAI-compat provider : conversion format
Filesystem.read supporte les images
Image aging (context optimization)
Route GET /images/<id> pour lazy loading

Phase 3 (V2)

Browser.screenshot → image dans le contexte
Presentation module → slides as images
Image generation tools (DALL-E, Stable Diffusion via MCP)
Anthropic Files API (upload once, reference by file_id)

13. Ce que Digitorn fera MIEUX que Claude Code

Feature	Claude Code	Digitorn
User paste image	Oui (Cmd+V)	Oui (paste + upload + URL)
Read image from disk	Non (bug)	Oui (filesystem.read)
Agent screenshot	Non	Oui (browser.screenshot)
Image in tool results	Non	Oui (metadata.image_data)
Multi-image per message	Limited	10 images max
Image aging (context)	Non	Oui (full → low-res → text)
Provider fallback sans vision	Non	Oui (texte automatique)
Image persistence	Non	Oui (ImageStore + `/images/<id>`)

Sources :

Implementation Status: COMPLETE​

Overview​

Ecosystem reference​

Claude Code (current capabilities)​

Anthropic API (Claude)​

OpenAI API (GPT-4o)​

DeepSeek​

Architecture​

Design principles​

1. Image store (disk storage)​

New component: ImageStore​

Why not base64 inside messages?​

Injection strategy​

2. Message Format (multimodal)​

ContentBlock​

Multimodal messages​

3. Sending images with a message​

Limits​

4. LLM provider - multimodal conversion​

Anthropic Provider​

OpenAI-Compatible Provider (GPT-4o, etc.)​

Providers sans vision (DeepSeek-chat, Ollama text-only)​

5. Tools - images as input and output​

Filesystem : Read image​

Browser : Screenshot​

Agent Loop - Injection automatique​

6. Socket.IO Events - Images vers le client​

Dans tool_call event​

New event: image_message (for images embedded in replies)​

7. Persistence - Images dans l'historique​

Session history avec images​

Image fetch​

8. Optimisation du contexte​

Problem​

Solution : Image Aging​

Estimated sizes​

9. Config​

10. YAML App Config​

11. Provider compatibility​

12. Implementation - priority order​

Phase 1 (V1 - demo)​

Phase 2 (V1.1)​

Phase 3 (V2)​

13. Ce que Digitorn fera MIEUX que Claude Code​