Can I run Llama 3.1 70B on a single 4090?

Q3_K_M quant fits in 24GB with reduced KV cache. Q4 does not fit. For full Q4_K_M 70B you want our 2x4090 dedicated server upgrade.

What is the OpenAI compatibility level?

Ollama implements /v1/chat/completions and /v1/embeddings. Function calling works for models that support tool use (Llama 3.1+, Qwen 2.5+, Mistral Large).

Can I fine-tune on this box?

Yes for LoRA. Use Unsloth or Axolotl for 4090-optimized training. Full fine-tunes for 70B+ require multi-GPU.

Does Ollama leak prompts upstream?

No. Ollama is fully local. No telemetry, no model usage reporting back to anyone. Your prompts stay on your VPS.

AI GPU No-KYC

Ollama LLM Inference Hosting in Romania

Ollama is the easiest way to run open-weight LLMs on a single box: pull a model, hit the /api/generate endpoint, get tokens streamed back. The catch is that Ollama on a CPU-only VPS is painfully slow for any model above 3B parameters. AnubizHost runs Ollama on dedicated RTX 4090 24GB hardware in Romania, with Llama-3-70B Q4, Mistral Large, Qwen 2.5 32B, and DeepSeek V3 distilled pre-cached. From $179/mo, crypto payment, no KYC.

Need this done for your project?

We implement, you ship. Async, documented, done in days.

Start a Brief

VRAM Math for Ollama Models

Llama-3-8B Q4_K_M: 5.5GB VRAM, runs at 80-120 tokens/sec on a 4090. Llama-3-70B Q4_K_M: 41GB, does not fit on a single 4090 - we route this to a dual-4090 box or use the Q3 quant which fits in 32GB with KV cache split.

Mistral Small 22B Q4: 14GB. Qwen 2.5 32B Q4: 20GB. DeepSeek V3 distill 14B Q4: 9GB. Phi-3 Medium 14B Q4: 8.5GB. Gemma 2 27B Q4: 17GB.

For the 4090's 24GB, comfortable upper bound is models in the 22-32B parameter range at Q4. For larger models (70B+) we run the Q3 quant or recommend the 2x4090 plan.

Pre-Loaded Models

Out of the box pulled and cached: Llama 3.1 8B Q4, Llama 3.1 70B Q3, Mistral Small 22B Q4, Mistral Nemo 12B Q4, Qwen 2.5 32B Q4, Qwen 2.5 Coder 32B Q4, DeepSeek V3 distilled 14B, Phi-3 Medium Q4, Gemma 2 27B Q4, Llama 3.2 Vision 11B for multimodal work.

Custom models you pull yourself via ollama pull are stored on the 500GB NVMe. Plenty of room for a personal model collection.

Why a Dedicated GPU Matters for LLMs

Shared-GPU services (Together, Groq, Fireworks) price-discriminate aggressively. A Llama-3-70B request costs $0.60 per million output tokens at typical rates, which sounds cheap until you start running coding agents that burn 5M tokens daily. At that volume the monthly bill exceeds our $179/mo dedicated 4090 within two weeks.

The dedicated GPU also means deterministic latency. No queue, no neighbor noise, no rate limit. Your first token latency on Ollama with a hot model is 80-200ms. Generation speed depends on the model: 8B at 100+ tok/s, 22B at 30-50 tok/s, 32B at 20-35 tok/s.

API and Integration Patterns

Ollama exposes an OpenAI-compatible /v1/chat/completions endpoint as of version 0.3.x. Drop-in replace OpenAI clients with base_url pointing at your VPS. Continue.dev, Cursor, Cline, Open WebUI all work out of the box.

We pre-install Open WebUI as a web frontend if you want a ChatGPT-style chat interface. Cloudflare Tunnel or our Tor onion option exposes it securely without opening ports.

For coding agent setups (Aider, Continue, Cline) we document the optimal model selection per task and provide example configs.

Order and Setup

Pay BTC, XMR, LN, USDT. Provisioning 15-20 minutes. Ollama listening on 127.0.0.1:11434 by default. Open WebUI exposed via nginx+TLS at your-handle.anubiz-vps.com. Pre-pulled models ready to chat immediately.

Related Services

Offshore VPS from $17.90/mo Dedicated Servers DevOps Services

Why Anubiz Host

100% async — no calls, no meetings

Delivered in days, not weeks

Full documentation included

Production-grade from day one

Security-first approach

Post-delivery support included

Bulletproof Hosting Providers

DMCA-Ignored Servers

Offshore VPS Hosting

Anonymous Hosting Solutions

Ready to get started?

Skip the research. Tell us what you need, and we'll scope it, implement it, and hand it back — fully documented and production-ready.

Start a Brief AI VPS from $179/mo