Multi-Core Tor Relay Configuration: Scaling Across CPU Cores
Tor is historically single-threaded for core routing operations, but modern Tor versions (0.4.5+) have significantly improved multi-core utilization through parallel cryptographic workers and the Conflux multi-path protocol. Understanding how Tor uses multiple CPU cores and how to configure your relay to maximize multi-core performance allows relay operators to get full value from modern multi-core VPS and server hardware. This guide covers Tor's threading model, the NumCPUs configuration, cryptographic worker threads, and complementary strategies for multi-core relay performance.
Need this done for your project?
We implement, you ship. Async, documented, done in days.
Tor's main event loop (handling circuit operations, scheduling, and routing decisions) runs in a single thread for correctness and simplicity. However, cryptographic operations (the most CPU-intensive work for high-bandwidth relays) are offloaded to worker threads. The cpuworker subsystem handles: ntor handshake computation (X25519 + SHA3 operations for circuit setup), create cell processing (the Tor-side of circuit key exchange), and TAP handshakes (legacy, less common). These operations are parallelized: each cpuworker thread processes independently, allowing multiple circuit setups to proceed simultaneously. The main thread remains the single-threaded event loop but is not the bottleneck for high-bandwidth relays since its work (routing already-established circuit data) is fast. The bottleneck for very high bandwidth relays is AES encryption throughput in the main event loop - this is not parallelized in current Tor versions.
NumCPUs Configuration
The NumCPUs configuration option in torrc controls the number of cpuworker threads Tor spawns: NumCPUs 0 (default) - auto-detect and use all available CPU cores. NumCPUs 4 - use exactly 4 cpuworker threads. For most deployments, NumCPUs 0 (auto-detect) is correct. Over-provisioning cpuworkers (setting NumCPUs higher than available cores) can cause context switching overhead. Under-provisioning (setting it lower than available cores) leaves CPU capacity unused. Verify the correct value: check actual CPU core count (nproc) and set NumCPUs to match. For hyperthreaded CPUs: set NumCPUs to the number of physical cores (not logical threads) if performance testing shows that hyperthreading does not improve throughput. For ARM CPUs: Cortex-A53 has 2 shared L2 cache groups per 4 cores; setting NumCPUs to physical core count is appropriate.
Measuring Multi-Core Utilization
Monitor multi-core CPU usage while the relay is running: mpstat -P ALL 1 shows per-core CPU utilization per second. If all cores show near-100% utilization, the relay is CPU-saturated and needs more cores or reduced bandwidth. If cores beyond the first are lightly loaded (below 50%), either the relay is not heavily loaded, the parallelism is limited by the single-threaded event loop, or NumCPUs should be adjusted. For a relay with 4 CPU cores, expect to see: core 0 (main event loop) at 50-80% during high traffic, cores 1-3 (cpuworkers) at varying levels based on circuit setup rate. If cores 1-3 are consistently at near-100% while core 0 is below 100%, circuit setup rate is the bottleneck - consider increasing NumCPUs or reducing circuit timeout to clear stalled circuits faster.
Multiple Tor Instances for High-Core-Count Servers
For servers with 8+ CPU cores, running multiple Tor relay instances on the same server can more effectively utilize available cores than a single Tor instance (due to the single-threaded event loop limitation). Each Tor instance has its own ORPort, fingerprint, and bandwidth configuration. Configure in separate torrc files: /etc/tor/instances/relay1/torrc, /etc/tor/instances/relay2/torrc, each with different ORPort (9001, 9002, etc.) and separate data directories. Use systemd template units for clean management. The Tor network counts each instance separately - running 4 Tor instances provides 4 relays worth of network contribution from one server. Important: each instance must have a different ORPort and DataDirectory. They share the same Tor guard rules (no two instances can be in the same circuit for the same user).
Kernel-Level Optimizations for Multi-Core Relays
The Linux kernel's network stack can be a bottleneck for high-bandwidth multi-core Tor relays. Key optimizations: RSS (Receive-Side Scaling) - distribute network interrupts across multiple CPU cores: ethtool -L eth0 combined 4 (for a 4-core system, set 4 combined queues). RPS (Receive Packet Steering) - software-level packet distribution when hardware RSS is unavailable: echo f > /sys/class/net/eth0/queues/rx-0/rps_cpus (distribute rx queue to all 4 cores). NAPI polling - most modern NIC drivers use NAPI, reducing interrupt overhead. XDP (eXpress Data Path) - for the highest bandwidth relays, XDP can process packets at the NIC driver level bypassing the kernel network stack. TCP socket buffer tuning for high-bandwidth connections: sysctl -w net.core.rmem_max=134217728 and net.core.wmem_max=134217728.