Skip to main content

web

Web search, fetch, content extraction, and downloads. Five backends with automatic fallback. DuckDuckGo is the free default - no API key required.

PropertyValue
Module idweb
Version1.0.0
Typeuser
Config modelWebConfig
Pip depsaiohttp, beautifulsoup4, html2text

Design notes

  • Free by default - DuckDuckGo works without an API key.
  • Clean content - HTML → markdown-like text via html2text; scripts, ads, navigation, cookie banners stripped.
  • Cached fetches - 15 min default TTL, 100 URL capacity (LRU). Same URL twice = one HTTP request.
  • Fallback resilience - if search_backend fails, the module retries with search_fallback and tags the result with a note: "Primary backend ... failed, used fallback".
  • SSRF-guarded - outbound requests go through (private-network blocklist + DNS pinning, see Production Deployment → SSRF).

Search backends

BackendAPI keyCostBest for
duckduckgo (default)nofreedev / testing.
braveyes~$0.01/qproduction, affordable.
tavilyyes~$0.01/qAI-agent-shaped structured results.
searxngno (self-host)freemeta-search across many engines.
googleyes + CX100 free / dayhighest quality.

Configuration

WebConfig (, extra: forbid):

tools:
modules:
web:
config:
search_backend: duckduckgo # duckduckgo | brave | tavily | searxng | google
search_fallback: brave # used if search_backend fails
max_content_length: 50000 # 1000..1_000_000
cache_ttl: 900 # seconds (default 15 min)
fetch_timeout: 30 # 1..300 seconds
user_agent: "MyBot/1.0" # optional override

API keys for brave, tavily, google, etc. are not in the YAML - store them in the credentials vault and reference via credential: (or fall back to {{secret.X}} / {{env.X}}). Outbound allowlist / blocklist live under constraints: (not config.egress); see below.

The 4 actions

All risk_level: low except download (risk_level: medium).

ToolSourcePurpose
web.searchSearch the web. Returns {title, url, snippet} per result + a sources: [url, ...] field for easy citation.
web.fetchFetch a page → clean readable text. HTTP→HTTPS auto-upgrade. Cross-host redirect detected (returns redirect URL, doesn't silently follow). Binary content (PDF / image) → suggests download + then read.
web.extractExtract content using CSS selectors. Internal - prefer fetch(extract=true).
web.downloadDownload a file to a local path (per-app workspace).

web.search - params

ParamDefaultNotes
queryrequiredText query.
limit10Max results.
allowed_domainsnullPer-call domain allowlist.
blocked_domainsnullPer-call domain blocklist.

allowed_domains and blocked_domains are mutually exclusive per call. Combine module-level egress.allowed_domains with per-call to layer enforcement.

web.fetch - params

ParamDefaultNotes
urlrequiredAuto-upgraded to HTTPS.
max_lengthconfigCaps text returned.
extractfalseMain-content extraction (article body, strips nav / footer). Delegates to extract.
prompt""Hint to focus extraction on a specific section.
rawfalseReturn raw HTML instead of converted text.

Constraints

Two universal constraints (apply across every action):

ConstraintTypeDescription
allowed_domainsstring_listRestrict every web call to these domains.
blocked_domainsstring_listBlock these domains from every call.
tools:
modules:
web:
constraints:
allowed_domains: [docs.python.org, stackoverflow.com]
blocked_domains: [malware.example.com]
config:
search_backend: duckduckgo

Cross-references