Ollama LLM Inference Hosting in Romania
Ollama is the easiest way to run open-weight LLMs on a single box: pull a model, hit the /api/generate endpoint, get tokens streamed back. The catch is that Ollama on a CPU-only VPS is painfully slow for any model above 3B parameters. AnubizHost runs Ollama on dedicated RTX 4090 24GB hardware in Romania, with Llama-3-70B Q4, Mistral Large, Qwen 2.5 32B, and DeepSeek V3 distilled pre-cached. From $179/mo, crypto payment, no KYC.
Need this done for your project?
We implement, you ship. Async, documented, done in days.
VRAM Math for Ollama Models
Llama-3-8B Q4_K_M: 5.5GB VRAM, runs at 80-120 tokens/sec on a 4090. Llama-3-70B Q4_K_M: 41GB, does not fit on a single 4090 - we route this to a dual-4090 box or use the Q3 quant which fits in 32GB with KV cache split.
Mistral Small 22B Q4: 14GB. Qwen 2.5 32B Q4: 20GB. DeepSeek V3 distill 14B Q4: 9GB. Phi-3 Medium 14B Q4: 8.5GB. Gemma 2 27B Q4: 17GB.
For the 4090's 24GB, comfortable upper bound is models in the 22-32B parameter range at Q4. For larger models (70B+) we run the Q3 quant or recommend the 2x4090 plan.
Pre-Loaded Models
Out of the box pulled and cached: Llama 3.1 8B Q4, Llama 3.1 70B Q3, Mistral Small 22B Q4, Mistral Nemo 12B Q4, Qwen 2.5 32B Q4, Qwen 2.5 Coder 32B Q4, DeepSeek V3 distilled 14B, Phi-3 Medium Q4, Gemma 2 27B Q4, Llama 3.2 Vision 11B for multimodal work.
Custom models you pull yourself via ollama pull are stored on the 500GB NVMe. Plenty of room for a personal model collection.
Why a Dedicated GPU Matters for LLMs
Shared-GPU services (Together, Groq, Fireworks) price-discriminate aggressively. A Llama-3-70B request costs $0.60 per million output tokens at typical rates, which sounds cheap until you start running coding agents that burn 5M tokens daily. At that volume the monthly bill exceeds our $179/mo dedicated 4090 within two weeks.
The dedicated GPU also means deterministic latency. No queue, no neighbor noise, no rate limit. Your first token latency on Ollama with a hot model is 80-200ms. Generation speed depends on the model: 8B at 100+ tok/s, 22B at 30-50 tok/s, 32B at 20-35 tok/s.
API and Integration Patterns
Ollama exposes an OpenAI-compatible /v1/chat/completions endpoint as of version 0.3.x. Drop-in replace OpenAI clients with base_url pointing at your VPS. Continue.dev, Cursor, Cline, Open WebUI all work out of the box.
We pre-install Open WebUI as a web frontend if you want a ChatGPT-style chat interface. Cloudflare Tunnel or our Tor onion option exposes it securely without opening ports.
For coding agent setups (Aider, Continue, Cline) we document the optimal model selection per task and provide example configs.
Order and Setup
Pay BTC, XMR, LN, USDT. Provisioning 15-20 minutes. Ollama listening on 127.0.0.1:11434 by default. Open WebUI exposed via nginx+TLS at your-handle.anubiz-vps.com. Pre-pulled models ready to chat immediately.
Related: AI hosting, llama.cpp alternative, anonymous VPS, live pricing.
Related Services
Why Anubiz Host
Ready to get started?
Skip the research. Tell us what you need, and we'll scope it, implement it, and hand it back — fully documented and production-ready.