Skip to main content

RAG Module

The rag module is a production-grade Retrieval-Augmented Generation engine. It unifies files, databases, and free-text sources into named knowledge bases with hybrid retrieval (BM25 + semantic), cross-encoder reranking, source citations, semantic cache, and an optional Text2SQL strategy.

Every claim on this page maps to real code under Entries are cited with file + line.

Why a dedicated RAG module?

The existing vector module exposes basic vector ops (add, search, delete) on raw collections. rag builds on top with:

  • Knowledge bases - named, versioned collections that pair a vector index with a parallel BM25 index and per-source metadata.

  • Hybrid retrieval - Reciprocal Rank Fusion (RRF) of BM25

    • semantic by default.
  • Cross-encoder reranking for precision.

  • Source citations injected directly into the LLM context.

  • Semantic cache for sub-15 ms repeated queries.

  • Multi-format ingestion (Markdown, PDF, code, CSV, JSON, HTML, databases) -.

  • Database sync - updated_at, changelog triggers, or LISTEN/NOTIFY.

  • Text2SQL (the Text2SQL strategy) for natural-language questions over structured data.

  • Multi-query expansion for broader recall

  • Corrective RAG with quality-evaluation fallback

  • 6 vector backends - Qdrant, ChromaDB, LanceDB, Pinecone, pgvector, Elasticsearch.

Zero-config quick start

tools:
modules:
rag: {}
SettingDefault
Embedding modelminilm-l12 (384 d, multilingual, 220 MB)
Vector backendQdrant in-memory
Retrieval strategyHybrid (BM25 + semantic + RRF)
Chunkingrecursive, 500 chars, 50 overlap
Semantic cacheEnabled, in-memory, 1 h TTL
CitationsEnabled, inline
RerankerDisabled

The agent then creates KBs and ingests via tool calls:

# Agent: "I'll create a knowledge base and index your docs."
rag.create_knowledge_base(name="docs")
rag.ingest_directory(knowledge_base="docs", path="./docs",
extensions=[".md", ".txt"])
rag.query(knowledge_base="docs", query="how does authentication work?")

The 14 actions

@action decorators (line numbers cited):

ToolSourcePurpose
rag.create_knowledge_baseCreate a named KB.
rag.delete_knowledge_baseDrop a KB + its vector + BM25 indexes.
rag.list_knowledge_basesEnumerate KBs with metadata.
rag.knowledge_base_statsCounts, model, last sync, hit rate.
rag.ingestAdd raw text documents.
rag.ingest_fileAdd a single file.
rag.ingest_directoryWalk a directory + add matching files.
rag.ingest_databaseIndex DB tables (rows or schema-only).
rag.queryRetrieve from a KB (default strategy or override).
rag.multi_queryLLM-expanded query with RRF fusion.
rag.sql_queryText2SQL - generate + run a SELECT.
rag.clear_cacheWipe the semantic cache.
rag.migrate_embeddingsSwitch a KB to a new embedding model (re-embeds in batches).
rag.list_modelsList available embedding + reranker shortcuts.

Configuration reference

RagConfig. Mounted under tools.modules.rag.config: (the config: wrapper is mandatory - see App Configuration).

tools:
modules:
rag:
config:
embedding_model: minilm-l12
reranker: false # true | "<shortcut>" | "<HF id>"
backend:
type: qdrant # qdrant | chroma | lancedb | pinecone | pgvector | elasticsearch
path: ""
url: ""
quantization: none # none | int8 | binary (qdrant only)
pipeline:
retrieval: hybrid # hybrid | semantic | bm25
bm25_weight: 0.3
semantic_weight: 0.7
rerank_top_n: 20 # 0 = skip rerank
final_top_k: 5
multi_query:
enabled: false
provider: "" # llm_provider id for query expansion
num_variants: 3 # 2..10
chunking:
strategy: recursive # fixed | sentence | paragraph | recursive
size: 500 # 50..10000
overlap: 50 # 0..500
sources:
- type: file
path: "{{workspace}}/docs"
extensions: [.md, .txt, .pdf]
watch: true
recursive: true
max_files: 1000
- type: database
connection_id: crm
sync:
strategy: updated_at # updated_at | changelog | notify
interval: 30
auto_create_triggers: true
prune_after_hours: 24
tables:
users:
columns: [id, name, email, bio, department]
mode: embed_rows # schema_only | embed_rows
template: "{name} ({department}) - {bio}"
sync: updated_at
max_rows: 50000
orders:
mode: schema_only
auto_index:
on_start: true
schedule: "" # cron expr (uses cron_native)
cache:
enabled: true
backend: memory # memory | redis
similarity_threshold: 0.95 # 0.80..1.0
ttl: 3600
max_entries: 10000
citations:
enabled: true
format: inline # inline | footnote | structured
verify: false
text2sql:
enabled: false
provider: ""
example_cache: true
crag:
enabled: false
provider: ""
confidence_threshold: 0.5
fallback: broader_query # broader_query | none
adaptive:
enabled: false
provider: ""
strategies: {}
contextual_retrieval:
enabled: false
provider: ""
concurrency: 5 # 1..20
prompt_template: ""
max_knowledge_bases: 50
max_documents: 100000
persistence_dir: ""

Embedding models

BUILTIN_MODELS. 7 built-ins, auto-downloaded by FastEmbed (ONNX, no GPU needed):

ShortcutFastEmbed idDimsNotes
minilm-l12 (default)sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2384Multilingual, 50 langs, 220 MB.
bge-m3BAAI/bge-m31024SOTA multilingual, 100+ langs, 2.3 GB.
bge-smallBAAI/bge-small-en-v1.5384Fast English, 67 MB.
bge-largeBAAI/bge-large-en-v1.51024Large English, 1.2 GB.
nomic-v1.5nomic-ai/nomic-embed-text-v1.5768Long-context EN, 8 k tokens.
jina-v3jinaai/jina-embeddings-v31024Multilingual 90+ langs, 8 k tokens.
snowflake-xssnowflake/snowflake-arctic-embed-xs384Lightweight EN, 90 MB.

Custom models

Use any FastEmbed-supported HuggingFace id directly:

config:
embedding_model: "BAAI/bge-m3"

Or supply a full custom spec:

config:
embedding_model:
id: "my-org/custom-embeddings"
dimensions: 768
pooling: mean # mean | cls
model_file: "onnx/model.onnx"

Live model migration

rag.migrate_embeddings(knowledge_base="docs", target_model="bge-m3") re-embeds all documents in batches and invalidates the semantic cache. No KB downtime.

Reranker models

BUILTIN_RERANKERS. 5 built-ins; default minilm-l6:

ShortcutHF idNotes
minilm-l6 (default)Xenova/ms-marco-MiniLM-L-6-v2Fast, lightweight.
minilm-l12Xenova/ms-marco-MiniLM-L-12-v2Larger ms-marco model.
bge-reranker-baseBAAI/bge-reranker-baseBalanced quality.
jina-reranker-v1-tinyjinaai/jina-reranker-v1-tiny-enUltra-fast English.
jina-reranker-v2jinaai/jina-reranker-v2-base-multilingualMultilingual, 8 k tokens.
config:
reranker: true # use the default minilm-l6
# OR
reranker: "bge-reranker-base"

Pipeline with rerank:

Vector backends

BackendConfig.type. 6 backends, swappable in YAML:

BackendModeBest forPip dep
QdrantEmbedded / remoteDefault, zero-config, quantization.qdrant-client (bundled).
ChromaDBEmbedded / remoteSimple local use.chromadb.
LanceDBEmbedded (file)Serverless, columnar.lancedb, pyarrow.
PineconeCloudManaged, scalable.pinecone.
pgvectorPostgreSQL ext.When Postgres is already in the stack.asyncpg, pgvector.
ElasticsearchRemote clusterExisting ES cluster + lexical filters.elasticsearch.

Qdrant (default)

backend:
type: qdrant
path: "" # "" = in-memory (default)
# path: /data/qdrant # persistent on-disk
# url: http://qdrant:6333 # remote
quantization: int8 # none | int8 | binary

int8 quantization gives ~3× faster search at <1% recall loss.

ChromaDB

backend:
type: chroma
path: /data/chroma # "" = in-memory
# url: http://chroma:8000 # remote

LanceDB

backend:
type: lancedb
path: /data/lancedb # always file-based

Pinecone

backend:
type: pinecone
api_key: "{{secret.PINECONE_API_KEY}}"
index_name: my-index
cloud: aws
region: us-east-1

pgvector

backend:
type: pgvector
dsn: "postgres://user:pass@host:5432/mydb"

Elasticsearch

backend:
type: elasticsearch
url: http://es:9200
# api_key, username/password supported via the ES client config

Retrieval strategies

PipelineConfig.retrieval.

Hybrid (default)

Semantic + BM25 fused via Reciprocal Rank Fusion. The bm25_weight (default 0.3) and semantic_weight (default 0.7) control balance.

config:
pipeline:
retrieval: hybrid
bm25_weight: 0.3
semantic_weight: 0.7

Semantic

Pure vector similarity. Fastest, best for conceptual questions.

BM25

Pure keyword. Best for exact-match queries (error codes, identifiers).

Per-call override

rag.query(knowledge_base="docs", query="error ERR-4052", strategy="bm25")
rag.query(knowledge_base="docs", query="how does caching work?", strategy="semantic")

Multi-query expansion

MultiQueryConfig.

config:
pipeline:
multi_query:
enabled: true
provider: enrichment # llm_provider id
num_variants: 3 # 2..10

Flow:

When no LLM provider is configured, falls back to heuristic variants (word slicing, prefix / suffix). Action: rag.multi_query.

Semantic cache

Cached by query embedding similarity. Reported hit rate 15-40 % in production.

config:
cache:
enabled: true
backend: memory # memory | redis
similarity_threshold: 0.95 # 0.80..1.0
ttl: 3600
max_entries: 10000
redis_url: "redis://redis:6379/0" # required when backend=redis

How it works:

Each cache entry records the content hashes of its source documents. When a document is re-ingested, every cache entry that referenced it is invalidated automatically.

Citations

CitationConfig.

config:
citations:
enabled: true
format: inline # inline | footnote | structured
verify: false # check LLM output for invalid [N] refs

The module formats retrieved results as a numbered context block injected into the LLM input:

## Retrieved context - cite sources using [1], [2], etc.

[1] (source: docs/auth.md, section: Overview, confidence: 0.92)
Authentication uses JWT tokens with RSA-256 signing...

[2] (source: database:crm:users, query: "SELECT count(*) FROM users", confidence: 0.98)
| Total users | Active |
|-------------|--------|
| 12 450 | 11 203 |

[3] (source: policies/security.pdf, page 12, confidence: 0.87)
All API endpoints require a valid bearer token...

The LLM also receives a citation instruction:

When answering, ALWAYS cite your sources using [N] notation. If sources conflict, mention both. If no source supports a claim, say "I don't have a source for this."

When verify: true, the citation post-processor flags references like [7] that don't appear in the context block.

Text2SQL

Text2SQLConfig. the Text2SQL strategy.

tools:
modules:
database:
config:
connections:
crm:
driver: postgresql
host: db.internal
database: crm

rag:
config:
text2sql:
enabled: true
provider: enrichment # llm_provider id
example_cache: true

How it works:

Safety

The strategy only allows SELECT. All DML (INSERT, UPDATE, DELETE) and DDL (CREATE, DROP, ALTER, TRUNCATE, GRANT) are blocked before execution.

Example cache

When example_cache: true, validated (question, SQL) pairs are cached. New questions:

  1. similarity > 0.95 → reuse cached SQL directly;
  2. otherwise → cached pairs become few-shot examples for better generation.

Action: rag.sql_query(query="how many active users?", connection_id="crm").

Corrective RAG (CRAG)

CragConfig.

config:
crag:
enabled: true
provider: enrichment
confidence_threshold: 0.5
fallback: broader_query # broader_query | none

Without an LLM provider, CRAG uses the raw retrieval score for filtering.

Adaptive routing

AdaptiveConfig. Picks a strategy per query type:

config:
adaptive:
enabled: true
strategies:
factual:
retrieval: semantic
analytical:
retrieval: hybrid
bm25_weight: 0.5
semantic_weight: 0.5

The query router classifies queries via regex signal detection (<5 ms, no LLM call):

SignalPattern examplesDefault route
SQLhow many, total, average, count, last quartersql
Semanticwhat is, explain, how does, policy onsemantic
Hybridcompare, difference between, versushybrid

Contextual retrieval

ContextualRetrievalConfig. Pre-generates a per-chunk context sentence (Anthropic-style "Contextual Retrieval") before embedding. Improves recall on long documents at ingest cost.

config:
contextual_retrieval:
enabled: true
provider: enrichment
concurrency: 5 # 1..20
prompt_template: |
<document>{document}</document>
<chunk>{chunk}</chunk>
Provide a short context anchoring this chunk in the document.

Ingestion

Detects extension, picks an ingestor, chunks per the strategy, embeds, indexes BM25 + vector.

ExtensionIngestorStrategy
.txt, .rst, .logPlainTextRecursive chunking.
.mdMarkdownSplit by headers (preserves hierarchy).
.ts, .ts, .js, .go, .rs, .java, .rb, .c, .cpp, .csCodeLanguage-aware blocks.
.csvCSVOne document per row.
.jsonJSONFlatten objects / arrays.
.jsonlJSONLOne document per line.
.html, .htmHTMLStrip tags, extract text.
.pdfPDFVia pdf module (async).
.xlsx, .xlsSpreadsheetVia spreadsheet module (async).

Incremental ingestion

The IndexingEngine tracks content hashes per file. Re-ingesting an unchanged file is a no-op - no wasted embedding compute.

Ingest actions

rag.ingest_file(knowledge_base="docs", path="./guide.md")

rag.ingest_directory(
knowledge_base="docs",
path="./docs",
extensions=[".md", ".txt", ".pdf"]
)

rag.ingest(
knowledge_base="docs",
documents=["First doc text", "Second doc text"],
source_type="manual",
source_id="my-source"
)

rag.ingest_database(
knowledge_base="crm_data",
connection_id="crm",
tables={
"users": {"columns": ["name", "bio"], "mode": "embed_rows"},
"orders": {"mode": "schema_only"}
}
)

Database sources

DatabaseSourceConfig, TableConfig.

Per-table modes

modeWhat is indexedSyncUse when
schema_onlyDDL + column descriptions + 5 sample rows.Schema changes only.Large tables, analytics, Text2SQL.
embed_rowsEach row as a document (templated text).Row-level sync.Tables with searchable text content.
sources:
- type: database
connection_id: crm
tables:
users:
columns: [id, name, email, bio, department]
mode: embed_rows
template: "{name} ({department}) - {bio}"
sync: updated_at
max_rows: 50000

orders:
mode: schema_only
# unlisted tables are completely ignored

For embed_rows, the row is rendered through the template (default = column concatenation) before embedding.

Database sync

Three strategies:

StrategyMechanismLatencyPrerequisitesBest for
updated_atWHERE updated_at > watermark30 s (configurable)updated_at column + index.Most tables.
changelogTrigger-based _rag_changelog table30 sAuto-created triggers.Tables without updated_at.
notifyPostgreSQL LISTEN/NOTIFY<1 sPostgreSQL only.Near-real-time needs.
sync:
strategy: updated_at
interval: 30 # poll interval (seconds)
sync:
strategy: changelog
auto_create_triggers: true
prune_after_hours: 24
sync:
strategy: notify
interval: 30 # fallback polling on listener disconnect

Guarantees

  • Resumable - watermarks live in state_snapshot; after restart, sync resumes at the last position.
  • Idempotent - double-processing is safe (upsert semantics).
  • Low overhead - 1 indexed query per table per poll (~3 q/s for 100 tables at the default 30 s interval).

Streaming retrieval

When both BM25 and semantic legs are active, the pipeline launches them in parallel and starts the LLM as soon as the first batch returns:

Reduces perceived latency on long-context generations.

Performance targets

PathTargetHow
Cache hit<15 msEmbed query (5 ms) + cosine search (5 ms).
Semantic search<200 ms + LLMEmbed (5 ms) + ANN search (5 ms) + rerank (~100 ms).
Hybrid search<200 ms + LLMParallel semantic + BM25, RRF fusion.
Text2SQL<500 ms + LLMSchema lookup + SQL gen + execute.
Multi-query<800 ms + LLM4 parallel searches + fusion.

Optimisation levers:

  • Local embeddings (FastEmbed ONNX) - 3-8 ms, no API calls.
  • Quantization (Qdrant int8) - ~3× faster, <1% recall loss.
  • Semantic cache - eliminates pipeline on 15-40% of queries.
  • Streaming retrieval - start LLM before all results.
  • Incremental indexing - content hashing skips unchanged files.

Shared instance, per-app reconfig

The rag module has isolation = "shared" (one instance per daemon - many apps see the same backend storage). Its on_start runs once at daemon boot with whatever empty config the module has at that moment → default in-memory backend.

When an app is activated, the bootstrap calls module.on_config_update(cfg) with that app's config. The overridden on_config_update (in ):

  1. Compares old vs new backend path.
  2. Closes the old backend if changed.
  3. Re-creates + initialises the new backend with the new path.
  4. Calls _discover_existing_collections to rebuild _kbs from collections already on disk (populated by previous sessions or offline tools).

This is the only shared module that mutates its backend on per-app activation.

Common config bug: under tools.modules.rag, the backend block MUST live under config:. Without the wrapper, compiled.modules["rag"].config = {}, the bootstrap sees if config: as False, and never calls on_config_update. The result: every query returns "knowledge base not found". See App Configuration → modules for the general rule.

Complete examples

Minimal - zero-config RAG

app:
app_id: rag-simple
name: Simple RAG

agents:
- id: main
role: assistant
brain:
provider: deepseek
model: deepseek-chat
backend: openai_compat
config:
api_key: "{{secret.DEEPSEEK_API_KEY}}"
base_url: https://api.deepseek.com/v1
system_prompt: You answer questions using the RAG knowledge base.

tools:
modules:
rag: {}
capabilities:
default_policy: auto
grant:
- {module: rag}

Documentation assistant

tools:
modules:
rag:
config:
embedding_model: bge-small
reranker: true
sources:
- type: file
path: "{{workspace}}"
extensions: [.md, .txt, .pdf]
watch: true
pipeline:
retrieval: hybrid
rerank_top_n: 20
final_top_k: 5
cache:
enabled: true
ttl: 1800
citations:
enabled: true
verify: true
capabilities:
default_policy: auto
grant:
- {module: rag}

dev:
variables:
workspace: ./docs

Enterprise multi-source (DB + documents)

tools:
modules:
database:
config:
auto_connect:
- connection_id: crm
driver: postgresql
host: db.internal
database: crm

rag:
config:
embedding_model: bge-m3
reranker: true
backend:
type: qdrant
path: /data/qdrant
quantization: int8

sources:
- type: database
connection_id: crm
sync: {strategy: updated_at, interval: 30}
tables:
users:
columns: [id, name, email, bio, department]
mode: embed_rows
template: "{name} ({department}) - {bio}"
products:
columns: [id, name, description, category]
mode: embed_rows
orders: {mode: schema_only}
invoices: {mode: schema_only}

- type: file
path: "{{workspace}}/docs"
extensions: [.md, .txt, .pdf]
watch: true
- type: file
path: "{{workspace}}/policies"
extensions: [.pdf]
watch: true

pipeline:
retrieval: hybrid
multi_query: {enabled: true, provider: enrichment, num_variants: 3}
rerank_top_n: 30
final_top_k: 5

text2sql:
enabled: true
provider: enrichment

cache:
enabled: true
ttl: 1800
citations:
enabled: true
format: inline
verify: true

llm_provider:
config:
providers:
enrichment:
backend: openai_compat
model: gpt-4o-mini
api_key: "{{secret.OPENAI_API_KEY}}"

capabilities:
default_policy: auto
grant:
- {module: rag}
- {module: database, actions: [fetch_results]}

Database analytics (no documents)

tools:
modules:
database:
config:
auto_connect:
- connection_id: warehouse
driver: postgresql
host: analytics.internal
database: warehouse

rag:
config:
sources:
- type: database
connection_id: warehouse
sync: {strategy: changelog, auto_create_triggers: true}
tables:
customers:
columns: [id, name, segment, lifetime_value]
mode: embed_rows
template: "Customer: {name}, segment {segment}, LTV ${lifetime_value}"
products:
columns: [id, name, description]
mode: embed_rows
orders: {mode: schema_only}
revenue: {mode: schema_only}

text2sql:
enabled: true
provider: main_brain
example_cache: true

capabilities:
default_policy: auto
grant:
- {module: rag}

Relationship with other modules

ModuleRelationship
vectorIndependent. Use vector for raw vector ops, rag for full pipelines.
databaserag calls database via the ServiceBus for Text2SQL execution + schema introspection + row fetching.
pdfrag calls pdf.read via the ServiceBus for PDF ingestion.
spreadsheetrag calls spreadsheet.read via the ServiceBus for Excel ingestion.
context_builderShares the FastEmbed singleton when both use minilm-l12 (no duplicate model load).
indexIndependent. The RAG indexing engine has its own content-hashing layer.

State persistence

state_snapshot / restore_state persist:

  • KB metadata (names, descriptions, models, doc counts).
  • BM25 indexes (serialised term frequencies).
  • Content hashes (incremental ingestion).
  • Cache statistics (hit rate, entries, evictions).
  • Database sync watermarks (resume position).

Vector backend data is persisted independently by the backend itself (Qdrant on disk, LanceDB files, ChromaDB SQLite, etc.).

Cross-references