RAG Module

The rag module is a production-grade Retrieval-Augmented Generation engine. It unifies files, databases, and free-text sources into named knowledge bases with hybrid retrieval (BM25 + semantic), cross-encoder reranking, source citations, semantic cache, and an optional Text2SQL strategy.

Every claim on this page maps to real code under Entries are cited with file + line.

Why a dedicated RAG module?

The existing vector module exposes basic vector ops (add, search, delete) on raw collections. rag builds on top with:

Knowledge bases - named, versioned collections that pair a vector index with a parallel BM25 index and per-source metadata.
Hybrid retrieval - Reciprocal Rank Fusion (RRF) of BM25
- semantic by default.
Cross-encoder reranking for precision.
Source citations injected directly into the LLM context.
Semantic cache for sub-15 ms repeated queries.
Multi-format ingestion (Markdown, PDF, code, CSV, JSON, HTML, databases) -.
Database sync - updated_at, changelog triggers, or LISTEN/NOTIFY.
Text2SQL (the Text2SQL strategy) for natural-language questions over structured data.
Multi-query expansion for broader recall
Corrective RAG with quality-evaluation fallback
6 vector backends - Qdrant, ChromaDB, LanceDB, Pinecone, pgvector, Elasticsearch.

Zero-config quick start

tools:
  modules:
    rag: {}

Setting	Default
Embedding model	`minilm-l12` (384 d, multilingual, 220 MB)
Vector backend	Qdrant in-memory
Retrieval strategy	Hybrid (BM25 + semantic + RRF)
Chunking	`recursive`, 500 chars, 50 overlap
Semantic cache	Enabled, in-memory, 1 h TTL
Citations	Enabled, inline
Reranker	Disabled

The agent then creates KBs and ingests via tool calls:

# Agent: "I'll create a knowledge base and index your docs."
rag.create_knowledge_base(name="docs")
rag.ingest_directory(knowledge_base="docs", path="./docs",
                     extensions=[".md", ".txt"])
rag.query(knowledge_base="docs", query="how does authentication work?")

The 14 actions

@action decorators (line numbers cited):

Tool	Source	Purpose
`rag.create_knowledge_base`		Create a named KB.
`rag.delete_knowledge_base`		Drop a KB + its vector + BM25 indexes.
`rag.list_knowledge_bases`		Enumerate KBs with metadata.
`rag.knowledge_base_stats`		Counts, model, last sync, hit rate.
`rag.ingest`		Add raw text documents.
`rag.ingest_file`		Add a single file.
`rag.ingest_directory`		Walk a directory + add matching files.
`rag.ingest_database`		Index DB tables (rows or schema-only).
`rag.query`		Retrieve from a KB (default strategy or override).
`rag.multi_query`		LLM-expanded query with RRF fusion.
`rag.sql_query`		Text2SQL - generate + run a SELECT.
`rag.clear_cache`		Wipe the semantic cache.
`rag.migrate_embeddings`		Switch a KB to a new embedding model (re-embeds in batches).
`rag.list_models`		List available embedding + reranker shortcuts.

Configuration reference

RagConfig. Mounted under tools.modules.rag.config: (the config: wrapper is mandatory - see App Configuration).

tools:
  modules:
    rag:
      config:
        embedding_model: minilm-l12
        reranker: false                # true | "<shortcut>" | "<HF id>"
        backend:
          type: qdrant                  # qdrant | chroma | lancedb | pinecone | pgvector | elasticsearch
          path: ""
          url: ""
          quantization: none            # none | int8 | binary (qdrant only)
        pipeline:
          retrieval: hybrid             # hybrid | semantic | bm25
          bm25_weight: 0.3
          semantic_weight: 0.7
          rerank_top_n: 20              # 0 = skip rerank
          final_top_k: 5
          multi_query:
            enabled: false
            provider: ""                # llm_provider id for query expansion
            num_variants: 3             # 2..10
        chunking:
          strategy: recursive           # fixed | sentence | paragraph | recursive
          size: 500                     # 50..10000
          overlap: 50                   # 0..500
        sources:
          - type: file
            path: "{{workspace}}/docs"
            extensions: [.md, .txt, .pdf]
            watch: true
            recursive: true
            max_files: 1000
          - type: database
            connection_id: crm
            sync:
              strategy: updated_at      # updated_at | changelog | notify
              interval: 30
              auto_create_triggers: true
              prune_after_hours: 24
            tables:
              users:
                columns: [id, name, email, bio, department]
                mode: embed_rows        # schema_only | embed_rows
                template: "{name} ({department}) - {bio}"
                sync: updated_at
                max_rows: 50000
              orders:
                mode: schema_only
        auto_index:
          on_start: true
          schedule: ""                  # cron expr (uses cron_native)
        cache:
          enabled: true
          backend: memory               # memory | redis
          similarity_threshold: 0.95    # 0.80..1.0
          ttl: 3600
          max_entries: 10000
        citations:
          enabled: true
          format: inline                # inline | footnote | structured
          verify: false
        text2sql:
          enabled: false
          provider: ""
          example_cache: true
        crag:
          enabled: false
          provider: ""
          confidence_threshold: 0.5
          fallback: broader_query       # broader_query | none
        adaptive:
          enabled: false
          provider: ""
          strategies: {}
        contextual_retrieval:
          enabled: false
          provider: ""
          concurrency: 5                # 1..20
          prompt_template: ""
        max_knowledge_bases: 50
        max_documents: 100000
        persistence_dir: ""

Embedding models

BUILTIN_MODELS. 7 built-ins, auto-downloaded by FastEmbed (ONNX, no GPU needed):

Shortcut	FastEmbed id	Dims	Notes
`minilm-l12` (default)	`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`	384	Multilingual, 50 langs, 220 MB.
`bge-m3`	`BAAI/bge-m3`	1024	SOTA multilingual, 100+ langs, 2.3 GB.
`bge-small`	`BAAI/bge-small-en-v1.5`	384	Fast English, 67 MB.
`bge-large`	`BAAI/bge-large-en-v1.5`	1024	Large English, 1.2 GB.
`nomic-v1.5`	`nomic-ai/nomic-embed-text-v1.5`	768	Long-context EN, 8 k tokens.
`jina-v3`	`jinaai/jina-embeddings-v3`	1024	Multilingual 90+ langs, 8 k tokens.
`snowflake-xs`	`snowflake/snowflake-arctic-embed-xs`	384	Lightweight EN, 90 MB.

Custom models

Use any FastEmbed-supported HuggingFace id directly:

config:
  embedding_model: "BAAI/bge-m3"

Or supply a full custom spec:

config:
  embedding_model:
    id: "my-org/custom-embeddings"
    dimensions: 768
    pooling: mean              # mean | cls
    model_file: "onnx/model.onnx"

Live model migration

rag.migrate_embeddings(knowledge_base="docs", target_model="bge-m3") re-embeds all documents in batches and invalidates the semantic cache. No KB downtime.

Reranker models

BUILTIN_RERANKERS. 5 built-ins; default minilm-l6:

Shortcut	HF id	Notes
`minilm-l6` (default)	`Xenova/ms-marco-MiniLM-L-6-v2`	Fast, lightweight.
`minilm-l12`	`Xenova/ms-marco-MiniLM-L-12-v2`	Larger ms-marco model.
`bge-reranker-base`	`BAAI/bge-reranker-base`	Balanced quality.
`jina-reranker-v1-tiny`	`jinaai/jina-reranker-v1-tiny-en`	Ultra-fast English.
`jina-reranker-v2`	`jinaai/jina-reranker-v2-base-multilingual`	Multilingual, 8 k tokens.

config:
  reranker: true                # use the default minilm-l6
  # OR
  reranker: "bge-reranker-base"

Pipeline with rerank:

Vector backends

BackendConfig.type. 6 backends, swappable in YAML:

Backend	Mode	Best for	Pip dep
Qdrant	Embedded / remote	Default, zero-config, quantization.	`qdrant-client` (bundled).
ChromaDB	Embedded / remote	Simple local use.	`chromadb`.
LanceDB	Embedded (file)	Serverless, columnar.	`lancedb`, `pyarrow`.
Pinecone	Cloud	Managed, scalable.	`pinecone`.
pgvector	PostgreSQL ext.	When Postgres is already in the stack.	`asyncpg`, `pgvector`.
Elasticsearch	Remote cluster	Existing ES cluster + lexical filters.	`elasticsearch`.

Qdrant (default)

backend:
  type: qdrant
  path: ""                  # "" = in-memory (default)
  # path: /data/qdrant      # persistent on-disk
  # url: http://qdrant:6333 # remote
  quantization: int8        # none | int8 | binary

int8 quantization gives ~3× faster search at <1% recall loss.

ChromaDB

backend:
  type: chroma
  path: /data/chroma          # "" = in-memory
  # url: http://chroma:8000   # remote

LanceDB

backend:
  type: lancedb
  path: /data/lancedb         # always file-based

Pinecone

backend:
  type: pinecone
  api_key: "{{secret.PINECONE_API_KEY}}"
  index_name: my-index
  cloud: aws
  region: us-east-1

pgvector

backend:
  type: pgvector
  dsn: "postgres://user:pass@host:5432/mydb"

Elasticsearch

backend:
  type: elasticsearch
  url: http://es:9200
  # api_key, username/password supported via the ES client config

Retrieval strategies

PipelineConfig.retrieval.

Hybrid (default)

Semantic + BM25 fused via Reciprocal Rank Fusion. The bm25_weight (default 0.3) and semantic_weight (default 0.7) control balance.

config:
  pipeline:
    retrieval: hybrid
    bm25_weight: 0.3
    semantic_weight: 0.7

Semantic

Pure vector similarity. Fastest, best for conceptual questions.

BM25

Pure keyword. Best for exact-match queries (error codes, identifiers).

Per-call override

rag.query(knowledge_base="docs", query="error ERR-4052", strategy="bm25")
rag.query(knowledge_base="docs", query="how does caching work?", strategy="semantic")

Multi-query expansion

MultiQueryConfig.

config:
  pipeline:
    multi_query:
      enabled: true
      provider: enrichment       # llm_provider id
      num_variants: 3            # 2..10

Flow:

When no LLM provider is configured, falls back to heuristic variants (word slicing, prefix / suffix). Action: rag.multi_query.

Semantic cache

Cached by query embedding similarity. Reported hit rate 15-40 % in production.

config:
  cache:
    enabled: true
    backend: memory             # memory | redis
    similarity_threshold: 0.95  # 0.80..1.0
    ttl: 3600
    max_entries: 10000
    redis_url: "redis://redis:6379/0"   # required when backend=redis

How it works:

Each cache entry records the content hashes of its source documents. When a document is re-ingested, every cache entry that referenced it is invalidated automatically.

Citations

CitationConfig.

config:
  citations:
    enabled: true
    format: inline              # inline | footnote | structured
    verify: false               # check LLM output for invalid [N] refs

The module formats retrieved results as a numbered context block injected into the LLM input:

## Retrieved context - cite sources using [1], [2], etc.

[1] (source: docs/auth.md, section: Overview, confidence: 0.92)
Authentication uses JWT tokens with RSA-256 signing...

[2] (source: database:crm:users, query: "SELECT count(*) FROM users", confidence: 0.98)
| Total users | Active |
|-------------|--------|
| 12 450      | 11 203 |

[3] (source: policies/security.pdf, page 12, confidence: 0.87)
All API endpoints require a valid bearer token...

The LLM also receives a citation instruction:

When answering, ALWAYS cite your sources using [N] notation. If sources conflict, mention both. If no source supports a claim, say "I don't have a source for this."

When verify: true, the citation post-processor flags references like [7] that don't appear in the context block.

Text2SQL

Text2SQLConfig. the Text2SQL strategy.

tools:
  modules:
    database:
      config:
        connections:
          crm:
            driver: postgresql
            host: db.internal
            database: crm

    rag:
      config:
        text2sql:
          enabled: true
          provider: enrichment      # llm_provider id
          example_cache: true

How it works:

Safety

The strategy only allows SELECT. All DML (INSERT, UPDATE, DELETE) and DDL (CREATE, DROP, ALTER, TRUNCATE, GRANT) are blocked before execution.

Example cache

When example_cache: true, validated (question, SQL) pairs are cached. New questions:

similarity > 0.95 → reuse cached SQL directly;
otherwise → cached pairs become few-shot examples for better generation.

Action: rag.sql_query(query="how many active users?", connection_id="crm").

Corrective RAG (CRAG)

CragConfig.

config:
  crag:
    enabled: true
    provider: enrichment
    confidence_threshold: 0.5
    fallback: broader_query       # broader_query | none

Without an LLM provider, CRAG uses the raw retrieval score for filtering.

Adaptive routing

AdaptiveConfig. Picks a strategy per query type:

config:
  adaptive:
    enabled: true
    strategies:
      factual:
        retrieval: semantic
      analytical:
        retrieval: hybrid
        bm25_weight: 0.5
        semantic_weight: 0.5

The query router classifies queries via regex signal detection (<5 ms, no LLM call):

Signal	Pattern examples	Default route
SQL	`how many`, `total`, `average`, `count`, `last quarter`	`sql`
Semantic	`what is`, `explain`, `how does`, `policy on`	`semantic`
Hybrid	`compare`, `difference between`, `versus`	`hybrid`

Contextual retrieval

ContextualRetrievalConfig. Pre-generates a per-chunk context sentence (Anthropic-style "Contextual Retrieval") before embedding. Improves recall on long documents at ingest cost.

config:
  contextual_retrieval:
    enabled: true
    provider: enrichment
    concurrency: 5                # 1..20
    prompt_template: |
      <document>{document}</document>
      <chunk>{chunk}</chunk>
      Provide a short context anchoring this chunk in the document.

Ingestion

Detects extension, picks an ingestor, chunks per the strategy, embeds, indexes BM25 + vector.

Extension	Ingestor	Strategy
`.txt`, `.rst`, `.log`	PlainText	Recursive chunking.
`.md`	Markdown	Split by headers (preserves hierarchy).
`.ts`, `.ts`, `.js`, `.go`, `.rs`, `.java`, `.rb`, `.c`, `.cpp`, `.cs`	Code	Language-aware blocks.
`.csv`	CSV	One document per row.
`.json`	JSON	Flatten objects / arrays.
`.jsonl`	JSONL	One document per line.
`.html`, `.htm`	HTML	Strip tags, extract text.
`.pdf`	PDF	Via `pdf` module (async).
`.xlsx`, `.xls`	Spreadsheet	Via `spreadsheet` module (async).

Incremental ingestion

The IndexingEngine tracks content hashes per file. Re-ingesting an unchanged file is a no-op - no wasted embedding compute.

Ingest actions

rag.ingest_file(knowledge_base="docs", path="./guide.md")

rag.ingest_directory(
  knowledge_base="docs",
  path="./docs",
  extensions=[".md", ".txt", ".pdf"]
)

rag.ingest(
  knowledge_base="docs",
  documents=["First doc text", "Second doc text"],
  source_type="manual",
  source_id="my-source"
)

rag.ingest_database(
  knowledge_base="crm_data",
  connection_id="crm",
  tables={
    "users":  {"columns": ["name", "bio"], "mode": "embed_rows"},
    "orders": {"mode": "schema_only"}
  }
)

Database sources

DatabaseSourceConfig, TableConfig.

Per-table modes

`mode`	What is indexed	Sync	Use when
`schema_only`	DDL + column descriptions + 5 sample rows.	Schema changes only.	Large tables, analytics, Text2SQL.
`embed_rows`	Each row as a document (templated text).	Row-level sync.	Tables with searchable text content.

sources:
  - type: database
    connection_id: crm
    tables:
      users:
        columns: [id, name, email, bio, department]
        mode: embed_rows
        template: "{name} ({department}) - {bio}"
        sync: updated_at
        max_rows: 50000

      orders:
        mode: schema_only
      # unlisted tables are completely ignored

For embed_rows, the row is rendered through the template (default = column concatenation) before embedding.

Database sync

Three strategies:

Strategy	Mechanism	Latency	Prerequisites	Best for
`updated_at`	`WHERE updated_at > watermark`	30 s (configurable)	`updated_at` column + index.	Most tables.
`changelog`	Trigger-based `_rag_changelog` table	30 s	Auto-created triggers.	Tables without `updated_at`.
`notify`	PostgreSQL `LISTEN/NOTIFY`	<1 s	PostgreSQL only.	Near-real-time needs.

sync:
  strategy: updated_at
  interval: 30                 # poll interval (seconds)

sync:
  strategy: changelog
  auto_create_triggers: true
  prune_after_hours: 24

sync:
  strategy: notify
  interval: 30                 # fallback polling on listener disconnect

Guarantees

Resumable - watermarks live in state_snapshot; after restart, sync resumes at the last position.
Idempotent - double-processing is safe (upsert semantics).
Low overhead - 1 indexed query per table per poll (~3 q/s for 100 tables at the default 30 s interval).

Streaming retrieval

When both BM25 and semantic legs are active, the pipeline launches them in parallel and starts the LLM as soon as the first batch returns:

Reduces perceived latency on long-context generations.

Performance targets

Path	Target	How
Cache hit	<15 ms	Embed query (5 ms) + cosine search (5 ms).
Semantic search	<200 ms + LLM	Embed (5 ms) + ANN search (5 ms) + rerank (~100 ms).
Hybrid search	<200 ms + LLM	Parallel semantic + BM25, RRF fusion.
Text2SQL	<500 ms + LLM	Schema lookup + SQL gen + execute.
Multi-query	<800 ms + LLM	4 parallel searches + fusion.

Optimisation levers:

Local embeddings (FastEmbed ONNX) - 3-8 ms, no API calls.
Quantization (Qdrant int8) - ~3× faster, <1% recall loss.
Semantic cache - eliminates pipeline on 15-40% of queries.
Streaming retrieval - start LLM before all results.
Incremental indexing - content hashing skips unchanged files.

Shared instance, per-app reconfig

The rag module has isolation = "shared" (one instance per daemon - many apps see the same backend storage). Its on_start runs once at daemon boot with whatever empty config the module has at that moment → default in-memory backend.

When an app is activated, the bootstrap calls module.on_config_update(cfg) with that app's config. The overridden on_config_update (in ):

Compares old vs new backend path.
Closes the old backend if changed.
Re-creates + initialises the new backend with the new path.
Calls _discover_existing_collections to rebuild _kbs from collections already on disk (populated by previous sessions or offline tools).

This is the only shared module that mutates its backend on per-app activation.

Common config bug: under tools.modules.rag, the backend block MUST live under config:. Without the wrapper, compiled.modules["rag"].config = {}, the bootstrap sees if config: as False, and never calls on_config_update. The result: every query returns "knowledge base not found". See App Configuration → modules for the general rule.

Complete examples

Minimal - zero-config RAG

app:
  app_id: rag-simple
  name: Simple RAG

agents:
  - id: main
    role: assistant
    brain:
      provider: deepseek
      model: deepseek-chat
      backend: openai_compat
      config:
        api_key: "{{secret.DEEPSEEK_API_KEY}}"
        base_url: https://api.deepseek.com/v1
    system_prompt: You answer questions using the RAG knowledge base.

tools:
  modules:
    rag: {}
  capabilities:
    default_policy: auto
    grant:
      - {module: rag}

Documentation assistant

tools:
  modules:
    rag:
      config:
        embedding_model: bge-small
        reranker: true
        sources:
          - type: file
            path: "{{workspace}}"
            extensions: [.md, .txt, .pdf]
            watch: true
        pipeline:
          retrieval: hybrid
          rerank_top_n: 20
          final_top_k: 5
        cache:
          enabled: true
          ttl: 1800
        citations:
          enabled: true
          verify: true
  capabilities:
    default_policy: auto
    grant:
      - {module: rag}

dev:
  variables:
    workspace: ./docs

Enterprise multi-source (DB + documents)

tools:
  modules:
    database:
      config:
        auto_connect:
          - connection_id: crm
            driver: postgresql
            host: db.internal
            database: crm

    rag:
      config:
        embedding_model: bge-m3
        reranker: true
        backend:
          type: qdrant
          path: /data/qdrant
          quantization: int8

        sources:
          - type: database
            connection_id: crm
            sync: {strategy: updated_at, interval: 30}
            tables:
              users:
                columns: [id, name, email, bio, department]
                mode: embed_rows
                template: "{name} ({department}) - {bio}"
              products:
                columns: [id, name, description, category]
                mode: embed_rows
              orders:   {mode: schema_only}
              invoices: {mode: schema_only}

          - type: file
            path: "{{workspace}}/docs"
            extensions: [.md, .txt, .pdf]
            watch: true
          - type: file
            path: "{{workspace}}/policies"
            extensions: [.pdf]
            watch: true

        pipeline:
          retrieval: hybrid
          multi_query: {enabled: true, provider: enrichment, num_variants: 3}
          rerank_top_n: 30
          final_top_k: 5

        text2sql:
          enabled: true
          provider: enrichment

        cache:
          enabled: true
          ttl: 1800
        citations:
          enabled: true
          format: inline
          verify: true

    llm_provider:
      config:
        providers:
          enrichment:
            backend: openai_compat
            model: gpt-4o-mini
            api_key: "{{secret.OPENAI_API_KEY}}"

  capabilities:
    default_policy: auto
    grant:
      - {module: rag}
      - {module: database, actions: [fetch_results]}

Database analytics (no documents)

tools:
  modules:
    database:
      config:
        auto_connect:
          - connection_id: warehouse
            driver: postgresql
            host: analytics.internal
            database: warehouse

    rag:
      config:
        sources:
          - type: database
            connection_id: warehouse
            sync: {strategy: changelog, auto_create_triggers: true}
            tables:
              customers:
                columns: [id, name, segment, lifetime_value]
                mode: embed_rows
                template: "Customer: {name}, segment {segment}, LTV ${lifetime_value}"
              products:
                columns: [id, name, description]
                mode: embed_rows
              orders:  {mode: schema_only}
              revenue: {mode: schema_only}

        text2sql:
          enabled: true
          provider: main_brain
          example_cache: true

  capabilities:
    default_policy: auto
    grant:
      - {module: rag}

Relationship with other modules

Module	Relationship
`vector`	Independent. Use `vector` for raw vector ops, `rag` for full pipelines.
`database`	`rag` calls `database` via the ServiceBus for Text2SQL execution + schema introspection + row fetching.
`pdf`	`rag` calls `pdf.read` via the ServiceBus for PDF ingestion.
`spreadsheet`	`rag` calls `spreadsheet.read` via the ServiceBus for Excel ingestion.
`context_builder`	Shares the FastEmbed singleton when both use `minilm-l12` (no duplicate model load).
`index`	Independent. The RAG indexing engine has its own content-hashing layer.

State persistence

state_snapshot / restore_state persist:

KB metadata (names, descriptions, models, doc counts).
BM25 indexes (serialised term frequencies).
Content hashes (incremental ingestion).
Cache statistics (hit rate, entries, evictions).
Database sync watermarks (resume position).

Vector backend data is persisted independently by the backend itself (Qdrant on disk, LanceDB files, ChromaDB SQLite, etc.).

Cross-references

App-config block reference (tools.modules.rag.config:): App Configuration → tools.modules
Per-module reference (storage backend, advanced knobs): modules/reference/rag.md
Credentials master / per-user keys for DBs and Pinecone: credentials.md
Bundle namespaces (where {{prompt.X}} resolves): Bundle namespaces

Why a dedicated RAG module?​

Zero-config quick start​

The 14 actions​

Configuration reference​

Embedding models​

Custom models​

Live model migration​

Reranker models​

Vector backends​

Qdrant (default)​

ChromaDB​

LanceDB​

Pinecone​

pgvector​

Elasticsearch​

Retrieval strategies​

Hybrid (default)​

Semantic​

BM25​

Per-call override​

Multi-query expansion​

Semantic cache​

Citations​

Text2SQL​

Safety​

Example cache​

Corrective RAG (CRAG)​

Adaptive routing​

Contextual retrieval​

Ingestion​

Incremental ingestion​

Ingest actions​

Database sources​

Per-table modes​

Database sync​

Guarantees​

Streaming retrieval​

Performance targets​

Shared instance, per-app reconfig​

Complete examples​

Minimal - zero-config RAG​

Documentation assistant​

Enterprise multi-source (DB + documents)​

Database analytics (no documents)​

Relationship with other modules​

State persistence​

Cross-references​