Can I run multiple llama-server instances?

Yes via systemd templated units, but they share the GPU. We document a config where one server runs the main model and a second runs a smaller embedding model in parallel using GPU slicing.

Does llama.cpp support function calling?

Yes for models with native tool use (Llama 3.1+, Qwen 2.5+, Mistral Large). The --jinja flag enables chat-template-based tool parsing.

What about FlashAttention?

Enabled by default in our build for models that support it. About 15-25% speedup on long-context generation.

Can I use this for embeddings?

Yes. llama-server supports embedding models. Run nomic-embed-text or BGE-large alongside your chat model.

AI GPU No-KYC

llama.cpp Server Hosting in Romania

llama.cpp is the most efficient open-source LLM runtime for single-GPU and CPU-GPU hybrid inference. Its built-in HTTP server (llama-server) provides an OpenAI-compatible endpoint with first-class support for GGUF quantization, speculative decoding, and grammar-constrained sampling. AnubizHost provisions llama-server on dedicated RTX 4090 hardware in Romania, with the most popular GGUF builds pre-cached and a clean Debian 12 + CUDA 12.4 base. From $179/mo with crypto payment and no KYC.

Need this done for your project?

We implement, you ship. Async, documented, done in days.

Start a Brief

Why llama.cpp Over Ollama or vLLM

Ollama is the easiest path to running an LLM but it wraps llama.cpp and hides knobs. For production deployments where you care about exact quantization, sampler choice, draft model speculation, or grammar-constrained output, going to llama-server directly is the right call.

llama.cpp also supports CPU-GPU split inference cleanly, which matters when you want to run a 70B class model where some layers offload to system RAM. The 4090 + 32GB host RAM combo handles Llama 3.1 70B Q5_K_M with about 40 layers on GPU and the rest on CPU at acceptable speed (8-15 tok/s).

VRAM and CPU-GPU Split Math

Llama 3.1 70B Q4_K_M (41GB): GPU layers 35-38, rest on CPU. Generation ~12 tok/s. Q5_K_M (49GB): GPU 30 layers, ~9 tok/s. Q3_K_M (32GB): full GPU, 35 tok/s.

Mistral Large 123B Q4_K_M (73GB): GPU 25 layers, ~5 tok/s. Marginal for interactive use, fine for batch processing.

Qwen 2.5 72B Q4_K_M (44GB): GPU 35 layers, ~10 tok/s. Q3_K_M (32GB): full GPU, ~30 tok/s.

Server Features We Pre-Configure

llama-server flags we ship in our systemd unit: --jinja for chat template support, --reasoning-format auto for thinking-model output, --slot-save-path for KV cache persistence across restarts, --n-predict 4096 default with override per request, --metrics for Prometheus scraping.

Speculative decoding pre-wired: Llama 3.1 70B uses Llama 3.2 1B as draft model, roughly 1.8x throughput. Qwen 2.5 72B paired with Qwen 2.5 1.5B, ~1.6x speedup.

Grammar-constrained generation via GBNF for tool calling, JSON output, SQL synthesis. Pre-built grammars in /etc/llama-server/grammars.

Hardware Profile

RTX 4090 24GB, AMD EPYC 8-core dedicated, 32GB DDR4 ECC, 500GB NVMe, 1Gbps unmetered. CPU spec matters here because CPU-GPU split inference on big models hits the CPU memory bandwidth hard. EPYC Genoa has ~460 GB/s memory bandwidth, comfortable for 70B Q4 split inference.

Order and Operate

Pay BTC, XMR, LN, USDT. Provision 15-20 minutes. llama-server bound to localhost; nginx TLS reverse proxy on auto-generated subdomain exposes the OpenAI-compatible /v1 endpoint. Pre-cached models: Llama 3.1 8B, Llama 3.1 70B Q4, Qwen 2.5 32B Q4, Qwen 2.5 Coder 32B Q4, Mistral Nemo 12B Q4.

Related Services

Offshore VPS from $17.90/mo Dedicated Servers DevOps Services

Why Anubiz Host

100% async — no calls, no meetings

Delivered in days, not weeks

Full documentation included

Production-grade from day one

Security-first approach

Post-delivery support included

Bulletproof Hosting Providers

DMCA-Ignored Servers

Offshore VPS Hosting

Anonymous Hosting Solutions

Ready to get started?

Skip the research. Tell us what you need, and we'll scope it, implement it, and hand it back — fully documented and production-ready.

Start a Brief AI VPS from $179/mo