Building a .onion Search Engine: Indexing Dark Web Content
The dark web lacks the comprehensive search infrastructure that makes the clearnet navigable. Existing .onion search engines (Ahmia, Torch, Not Evil) index a fraction of available .onion content and often have significant downtime. Building a custom .onion search engine - whether for a specific topic vertical, a specific community, or as a general dark web search tool - requires solving unique challenges: crawling .onion addresses requires Tor routing, index infrastructure must run on .onion-accessible servers, and content filtering must prevent the search engine from indexing and surfacing illegal material. This guide covers the technical architecture for a .onion search engine using Python-based crawlers, Elasticsearch for indexing, and a Flask or FastAPI search interface served through a Tor hidden service.
Need this done for your project?
We implement, you ship. Async, documented, done in days.
A .onion crawler routes all HTTP requests through Tor's SOCKS5 proxy (127.0.0.1:9050). Python with Scrapy and the scrapy-socks-proxy middleware provides an efficient crawler framework. Crawler design: start with a seed list of known .onion URLs (Ahmia's public database, dark.fail listings, and manually curated starting points). For each URL, extract all .onion links from the page (use regex to find [a-z2-7]{56}\.onion URLs). Add discovered .onion URLs to the crawl queue. Use Bloom filter or Redis set for visited URL deduplication to avoid re-crawling. Politeness: rotate Tor circuits every 10-50 requests to distribute load across exit/guard relays. Add 2-5 second delays between requests to avoid overwhelming small .onion services. Implement per-domain rate limiting: no more than 1 request per second to any single .onion domain.
Elasticsearch Index Design for Dark Web Content
Elasticsearch provides full-text search with relevance ranking suitable for .onion content. Index schema: document fields include url (keyword, exact match), title (text, analyzed for search), description (text from meta description or first paragraph), content (text, full page text), onion_address (keyword), indexed_at (date), language (keyword for language-based filtering), and category (keyword). Create the index with appropriate analyzers: custom analyzer for .onion URL extraction, language-aware text analysis for multi-language content (dark web content is in many languages). Shard configuration: for a single-node deployment, 1-5 primary shards is sufficient. Elasticsearch should run on localhost only: do not expose Elasticsearch's REST API to external networks. Expose a search API (Flask/FastAPI) that queries Elasticsearch and returns results, with the search API itself served through a .onion hidden service.
Content Filtering and CSAM Prevention
A .onion search engine that indexes illegal content creates serious legal and ethical problems. Multi-layer content filtering: (1) URL blocklist: maintain a list of known illegal .onion addresses and skip them during crawling. Sources: NCMEC reports, law enforcement bulletins (where publicly shared), and community reports. (2) Content hash matching: before indexing, compare file hashes of any downloaded images against PhotoDNA hash databases. (3) Text classification: train or use a text classifier to identify content categories. Exclude from indexing: content matching CSAM indicators, content facilitating real-world violence, and content with explicit illegal service offerings. (4) Human review queue: content flagged by classifiers with confidence below a threshold queues for manual review before indexing. (5) Reporting mechanism: users of the search engine can report indexed content for review and removal.
Search UI Served Through a .onion Hidden Service
The search interface is a web application served via a Tor hidden service. Flask implementation: from flask import Flask, request, jsonify, render_template. Create /search endpoint accepting q (query) and page (pagination) parameters. Query Elasticsearch: result = es.search(index='onion_content', body={query: {multi_match: {query: q, fields: ['title^3', 'description^2', 'content']}}}). Render results with pagination. Serve via Nginx proxying to Flask (listen 127.0.0.1:5000), expose via Tor HiddenServicePort 80 127.0.0.1:80. UI features: basic search box, results with title, URL snippet, and cached summary. Important: do not cache or serve full pages through the search engine (only excerpts), to avoid serving illegal content through the search UI even if it appears in the index.
Operational Challenges and Maintenance
Running a public .onion search engine involves significant ongoing maintenance. .onion addresses are volatile: services go down frequently. Build a freshness score that decrements for each failed crawl attempt and removes indexes for addresses unreachable for 30+ days. Crawl rate management: the Tor network has bandwidth constraints. An aggressive crawler contributes negatively to Tor performance. Configure the crawler to run at moderate rates (a few hundred pages per hour maximum) and schedule bulk crawls during off-peak hours. Storage: a moderately sized dark web index (100,000 pages) requires 10-50 GB of Elasticsearch storage. Plan index growth and implement retention policies deleting pages unreachable for extended periods. Community governance: a public dark web search engine will face content removal requests, abuse, and manipulation attempts. Establish clear policies and moderation processes before public launch.