en

Multi-Core Tor Relay Configuration: Scaling Across CPU Cores

Tor is historically single-threaded for core routing operations, but modern Tor versions (0.4.5+) have significantly improved multi-core utilization through parallel cryptographic workers and the Conflux multi-path protocol. Understanding how Tor uses multiple CPU cores and how to configure your relay to maximize multi-core performance allows relay operators to get full value from modern multi-core VPS and server hardware. This guide covers Tor's threading model, the NumCPUs configuration, cryptographic worker threads, and complementary strategies for multi-core relay performance.

Need this done for your project?

We implement, you ship. Async, documented, done in days.

Start a Brief

Tor's Threading Architecture

Tor's main event loop (handling circuit operations, scheduling, and routing decisions) runs in a single thread for correctness and simplicity. However, cryptographic operations (the most CPU-intensive work for high-bandwidth relays) are offloaded to worker threads. The cpuworker subsystem handles: ntor handshake computation (X25519 + SHA3 operations for circuit setup), create cell processing (the Tor-side of circuit key exchange), and TAP handshakes (legacy, less common). These operations are parallelized: each cpuworker thread processes independently, allowing multiple circuit setups to proceed simultaneously. The main thread remains the single-threaded event loop but is not the bottleneck for high-bandwidth relays since its work (routing already-established circuit data) is fast. The bottleneck for very high bandwidth relays is AES encryption throughput in the main event loop - this is not parallelized in current Tor versions.

NumCPUs Configuration

The NumCPUs configuration option in torrc controls the number of cpuworker threads Tor spawns: NumCPUs 0 (default) - auto-detect and use all available CPU cores. NumCPUs 4 - use exactly 4 cpuworker threads. For most deployments, NumCPUs 0 (auto-detect) is correct. Over-provisioning cpuworkers (setting NumCPUs higher than available cores) can cause context switching overhead. Under-provisioning (setting it lower than available cores) leaves CPU capacity unused. Verify the correct value: check actual CPU core count (nproc) and set NumCPUs to match. For hyperthreaded CPUs: set NumCPUs to the number of physical cores (not logical threads) if performance testing shows that hyperthreading does not improve throughput. For ARM CPUs: Cortex-A53 has 2 shared L2 cache groups per 4 cores; setting NumCPUs to physical core count is appropriate.

Measuring Multi-Core Utilization

Monitor multi-core CPU usage while the relay is running: mpstat -P ALL 1 shows per-core CPU utilization per second. If all cores show near-100% utilization, the relay is CPU-saturated and needs more cores or reduced bandwidth. If cores beyond the first are lightly loaded (below 50%), either the relay is not heavily loaded, the parallelism is limited by the single-threaded event loop, or NumCPUs should be adjusted. For a relay with 4 CPU cores, expect to see: core 0 (main event loop) at 50-80% during high traffic, cores 1-3 (cpuworkers) at varying levels based on circuit setup rate. If cores 1-3 are consistently at near-100% while core 0 is below 100%, circuit setup rate is the bottleneck - consider increasing NumCPUs or reducing circuit timeout to clear stalled circuits faster.

Multiple Tor Instances for High-Core-Count Servers

For servers with 8+ CPU cores, running multiple Tor relay instances on the same server can more effectively utilize available cores than a single Tor instance (due to the single-threaded event loop limitation). Each Tor instance has its own ORPort, fingerprint, and bandwidth configuration. Configure in separate torrc files: /etc/tor/instances/relay1/torrc, /etc/tor/instances/relay2/torrc, each with different ORPort (9001, 9002, etc.) and separate data directories. Use systemd template units for clean management. The Tor network counts each instance separately - running 4 Tor instances provides 4 relays worth of network contribution from one server. Important: each instance must have a different ORPort and DataDirectory. They share the same Tor guard rules (no two instances can be in the same circuit for the same user).

Kernel-Level Optimizations for Multi-Core Relays

The Linux kernel's network stack can be a bottleneck for high-bandwidth multi-core Tor relays. Key optimizations: RSS (Receive-Side Scaling) - distribute network interrupts across multiple CPU cores: ethtool -L eth0 combined 4 (for a 4-core system, set 4 combined queues). RPS (Receive Packet Steering) - software-level packet distribution when hardware RSS is unavailable: echo f > /sys/class/net/eth0/queues/rx-0/rps_cpus (distribute rx queue to all 4 cores). NAPI polling - most modern NIC drivers use NAPI, reducing interrupt overhead. XDP (eXpress Data Path) - for the highest bandwidth relays, XDP can process packets at the NIC driver level bypassing the kernel network stack. TCP socket buffer tuning for high-bandwidth connections: sysctl -w net.core.rmem_max=134217728 and net.core.wmem_max=134217728.

Why Anubiz Host

100% async — no calls, no meetings
Delivered in days, not weeks
Full documentation included
Production-grade from day one
Security-first approach
Post-delivery support included

Ready to get started?

Skip the research. Tell us what you need, and we'll scope it, implement it, and hand it back — fully documented and production-ready.

Anubiz Chat AI

Online