Web Module
Web search, fetch, and content extraction with multiple search backends. DuckDuckGo is the free default -- no API key needed.
Configuration
modules:
web:
config:
# Search backend (default: duckduckgo)
search:
primary: duckduckgo # duckduckgo | brave | tavily | searxng | google
fallback: null # optional fallback backend
api_keys:
brave: "{{env.BRAVE_API_KEY}}"
tavily: "{{env.TAVILY_API_KEY}}"
google: "{{env.GOOGLE_API_KEY}}"
google_cx: "{{env.GOOGLE_CX}}"
searxng_url: "http://localhost:8080"
# Content settings
max_content_length: 50000 # max chars per fetched page (default: 50k)
user_agent: "Digitorn/1.0" # HTTP User-Agent header
# Cache
cache_ttl: 300 # page cache TTL in seconds (default: 5 min)
cache_max_size: 50 # max cached pages (default: 50)
Search Backends
| Backend | Quality | Speed | Cost | API Key | Best for |
|---|---|---|---|---|---|
| DuckDuckGo | Good | ~1s | Free | No | Development, testing |
| Brave | Good | ~400ms | $0.01/query | Yes | Production, affordable |
| Tavily | Excellent | ~500ms | $0.01/query | Yes | AI agents (structured results) |
| SearXNG | Excellent | ~300ms | Free (self-hosted) | No | Meta-search (aggregates Google+Bing) |
| Google CSE | Best | ~200ms | 100 free/day | Yes | Highest quality results |
DuckDuckGo is the default because it requires no API key and no configuration. For production use, Tavily is recommended as it returns AI-optimized results.
Actions (4)
search
Search the web. Returns results with title, URL, and snippet.
search(query="python asyncio tutorial", limit=5)
Response:
{
"query": "python asyncio tutorial",
"results": [
{"title": "Async IO in Python", "url": "https://...", "snippet": "A walkthrough..."}
],
"count": 5,
"backend": "duckduckgo"
}
fetch
Fetch a web page and convert to clean readable text (markdown-like). Strips scripts, ads, navigation.
fetch(url="https://realpython.com/async-io-python/", max_length=5000)
Response:
{
"url": "https://...",
"title": "Async IO in Python",
"description": "A hands-on walkthrough...",
"content": "# Async IO in Python\n\nAsync IO is a...",
"length": 4823,
"cached": false
}
extract
Extract specific content from a page using CSS selectors.
extract(url="https://...", selector="article, .main-content", max_length=10000)
download
Download a file to a local path.
download(url="https://example.com/data.csv", path="/tmp/data.csv")
HTML Parsing
The module converts HTML to clean text using two strategies:
- html2text (default) -- converts HTML to markdown-like output. Good for articles, docs, blog posts.
- BeautifulSoup (CSS selectors) -- targeted extraction. Good for structured pages.
Noise removal:
- Strips
<script>,<style>,<nav>,<footer>,<header>,<noscript>,<svg>,<iframe>,<form> - Removes common ad/cookie selectors (
.cookie-banner,.ad,#sidebar, etc.) - Collapses excessive whitespace
Response Caching
Fetched pages are cached in memory for 5 minutes (configurable). Benefits:
- Same URL fetched twice: second call is instant (~0.1ms vs ~800ms)
fetchthenextracton the same URL: HTML is reused from cache- Automatic eviction: oldest entry removed when cache is full
- Cache is per-module-instance (no cross-session pollution)
Fallback Mechanism
If the primary search backend fails (timeout, API error, rate limit), the module automatically retries with the configured fallback:
search:
primary: brave
fallback: duckduckgo
The response includes a note field when fallback was used:
{"note": "Primary backend 'brave' failed, used fallback"}