Is archiving .onion sites without permission legal?

Web archival is generally treated as public benefit activity protected under copyright research exceptions and web scraping case law. However, .onion operators who explicitly prohibit crawling in their terms of service or robots.txt have indicated non-consent. Respect robots.txt during archival as a best practice. For content of clear historical significance, the public benefit justification is stronger.

How do I archive JavaScript-heavy .onion sites that require browser rendering?

Use browsertrix-crawler (Webrecorder) which runs a headless Chromium browser through Tor to render JavaScript and generate WARC files with rendered content. Configure Chromium's proxy to 127.0.0.1:9050 (Tor SOCKS) within the browsertrix-crawler configuration.

What is the typical file size of a crawled .onion site in WARC format?

Text-heavy forums: 100MB-5GB depending on post history. Image-heavy sites: 10-100GB. Compressed WARC (gzip) reduces file size by 50-70% for text content. Plan storage accordingly. For archiving multiple .onion sites, a small NAS or cloud storage account provides cost-effective storage compared to VPS disk.

Can I submit archived dark web content to the Internet Archive?

Yes, via the Archive-It program for partners or by uploading WARC files directly to archive.org. The Internet Archive evaluates submissions for inclusion. Content that is illegal (CSAM, prohibited material) will not be accepted. For historically significant political or press freedom content, the Internet Archive has interest in preserving materials from censored regions.

How do I ensure archived content attribution is accurate?

Include provenance metadata in the WARC file headers: capture date, source URL, archiving tool, and any contextual notes about the site. For IPFS storage, create a JSON metadata file alongside the WARC describing the source, capture date, and operator (or anonymous operator) context. Academic citation format should include the .onion address (even if defunct), capture date, and archiving organization.

Dark Web Content Archival and Preservation: Saving .onion Resources

Dark web resources are inherently ephemeral - .onion sites disappear when operators shut down, are seized, or simply abandon their infrastructure. Unlike clearnet content where the Internet Archive and search engine caches provide some persistence, .onion content has minimal archival infrastructure. Yet dark web resources include historically significant materials: documentation of political movements in censored countries, records of press freedom and whistleblowing cases, technical documentation for privacy tools, and community knowledge accumulated over years. Preserving this content requires dedicated archival effort. This guide covers tools for crawling .onion sites, storing archived content in WARC (Web ARChive) format, using IPFS for distributed archival, and making archived content accessible without re-hosting it at a single vulnerable location.

Need this done for your project?

We implement, you ship. Async, documented, done in days.

Start a Brief

Tools for Crawling .onion Sites

Standard web crawlers (wget, curl, Scrapy) need Tor SOCKS proxy configuration to reach .onion addresses. HTTrack (website copying tool) supports SOCKS5 proxy: httrack http://youronion.onion -s 0 --socks5=127.0.0.1:9050. Wget with Tor: torsocks wget -r -l 5 --convert-links http://youronion.onion (mirrors up to 5 link levels deep). For Python-based archival, requests with PySocks: import socks, socket; socks.set_default_proxy(socks.SOCKS5, '127.0.0.1', 9050); socket.socket = socks.socksocket; then use requests normally for .onion requests. Scrapy supports SOCKS5 via the scrapy-socks-proxy middleware. Performance consideration: Tor circuits limit download speed to 1-10 Mbit/s. Large site archival takes proportionally more time than clearnet archival. Respect rate limits: aggressive crawling can disrupt small .onion services with limited resources.

WARC Format for Archival Storage

WARC (Web ARChive) is the standard format for storing web crawl data, used by the Internet Archive and national libraries. WARC files contain HTTP request-response pairs with full headers, preserving not just content but request metadata. Tools for generating WARC files of .onion content: httrack generates its own format but can be converted. Heritrix (Internet Archive's own crawler) supports SOCKS5 proxy configuration for .onion crawling. browsertrix-crawler (Webrecorder) runs a full browser for JavaScript-heavy sites and generates WARC output - configure with --proxy socks5://127.0.0.1:9050 for .onion support. WARC files can be read and replayed with pywb (Python Wayback) or wayback-machine-downloader. For long-term storage: compress WARC files (gzip) and store with SHA-256 hash verification. A directory of hash-verified WARC files provides cryptographic proof of archival content authenticity.

IPFS for Distributed Archival

IPFS (InterPlanetary File System) provides content-addressed distributed storage where archived files are accessible by their cryptographic hash rather than a URL. Adding archived .onion content to IPFS: ipfs add -r archiveddirectory/. IPFS returns a CID (Content Identifier) - a hash-based address. This CID can be shared and retrieved from any IPFS gateway. The archived content persists as long as at least one node pins it. For dark web archival: pin important WARC files to IPFS, share the CIDs through trusted channels. Multiple archivists pinning the same CID ensures the content survives even if individual archivist nodes go offline. Combine IPFS with Tor: access IPFS via tor2ipfs gateways or run a full IPFS node accessible via a .onion address. The combination provides both distributed persistence (IPFS) and anonymous access (Tor).

Legal and Ethical Framework for .onion Archival

Archiving .onion content raises several legal and ethical questions. Copyright: most .onion content is not formally copyright-claimed, but operators have implicit ownership. Archiving for historical preservation without commercial use is defensible under fair use principles in the US and research exceptions in EU copyright law. For CSAM or other illegal content encountered during archival: do not archive it, report it (NCMEC, INHOPE), and document the encounter without retaining the content. For content removed from .onion due to legal pressure: archiving such content may perpetuate content the operator was legally required to remove. Apply judgment about whether the content warrants historical preservation despite its removal. For content documenting political repression: historical preservation value is high; archive with context about the circumstances of original publication and removal.

Making Archived Content Accessible

Archived .onion content should be accessible to future researchers without requiring them to run Tor or find the original (possibly defunct) .onion address. Options: (1) Submit WARC files to the Internet Archive (archive.org) - their Wayback Machine accepts WARC uploads. The archived content becomes accessible via archive.org's clearnet interface. (2) Host a pywb replay server on a clearnet or .onion server that serves archived content from WARC files. (3) IPFS public gateways (ipfs.io, cloudflare-ipfs.com) provide clearnet access to IPFS-pinned archived content. (4) For sensitive archived content (political repression documentation, whistleblowing records): publish through academic repositories (Zenodo, Harvard Dataverse) with appropriate metadata describing the provenance and historical context. Each method serves different use cases - combine multiple for redundancy.

Related Services

Offshore VPS from $19.99/mo Offshore VPS Locations Global VPS from $29.99/mo Dedicated Servers Compare Plans by Jurisdiction DevOps Services

Privacy & anti-censorship guides

Tor in Russia 2026 Tor obfs4 Bridges Guide

Why Anubiz Host

100% async — no calls, no meetings

Delivered in days, not weeks

Full documentation included

Production-grade from day one

Security-first approach

Post-delivery support included

Bulletproof Hosting Providers

DMCA-Ignored Servers

Offshore VPS from $19.99/mo

Anonymous Hosting Solutions

Tor in Russia 2026: Working Bridges

Tor obfs4 Bridges Guide

Ready to get started?

Skip the research. Tell us what you need, and we'll scope it, implement it, and hand it back — fully documented and production-ready.

Start a Brief Iceland VPS II