LLM Hosting: Tu Host Llama Va Mistral Tren VPS
Cac mo hinh LLM ma nguon mo nhu Llama 3, Mistral va Qwen da dat duoc chat luong gan tuong duong ChatGPT trong nhieu task. Khi tu host LLM tren VPS GPU offshore, ban tranh duoc viec gui prompt nhay cam cho OpenAI, tiet kiem chi phi inference dai han va dam bao bao mat tuyet doi. AnubizHost cung cap GPU VPS tai Iceland phu hop chay LLM 7B-70B voi thanh toan crypto va khong xac minh danh tinh. Bai viet nay huong dan trien khai LLM end-to-end voi vLLM va Ollama.
Need this done for your project?
We implement, you ship. Async, documented, done in days.
Tai Sao Tu Host LLM Thay Vi Dung OpenAI API
OpenAI, Anthropic va Google Gemini cung cap LLM manh me nhung doi hoi gui toan bo prompt va du lieu cho ho. Voi cac use case sau, tu host LLM la lua chon bat buoc:
- Xu ly tai lieu phap ly va y te: GDPR va HIPAA cam gui PII sang ben thu ba khong co BAA. Self-hosted LLM tren VPS rieng dap ung yeu cau compliance.
- Phan tich nguon tin bao chi: Nha bao bao ve nguon tin khong the gui chi tiet cuoc dieu tra cho cong ty My. Iceland jurisdiction cung lop bao ve them.
- R&D san pham noi bo: Prompt chua chien luoc kinh doanh, code dac quyen khong nen ro ri ra ngoai.
- Chi phi inference dai han: 1 trieu token GPT-4 cost $30. Voi VPS GPU $200/thang chay Llama 3 70B, ban xu ly hang trieu token moi ngay khong tang chi phi.
VPS GPU AnubizHost tu cau hinh trung binh (1x RTX 4090 24GB hoac tuong duong) du chay Llama 3 8B va Mistral 7B voi 50-100 token/s. Voi LLM lon hon nhu Llama 3 70B, can multi-GPU hoac single GPU H100/A100.
Trien Khai vLLM Cho Inference Production
vLLM la framework inference LLM nhanh nhat hien nay nho PagedAttention. Cai dat tren VPS GPU AnubizHost (gia dinh da co CUDA 12.1+):
conda create -n vllm python=3.11 -y
conda activate vllm
pip install vllm
# Hoac chay container:
docker run --gpus all -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<token>" \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3-8B-Instruct
vLLM expose OpenAI-compatible API. Ung dung su dung openai SDK chi can doi base_url:
from openai import OpenAI
client = OpenAI(base_url="http://vps-internal:8000/v1", api_key="dummy")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Hello"}]
)
KHONG expose cong 8000 ra Internet truc tiep. Su dung NGINX reverse proxy co API key authentication va rate limiting:
location /v1/ {
if ($http_authorization != "Bearer SECRET_KEY") { return 401; }
limit_req zone=llm_api burst=20 nodelay;
proxy_pass http://127.0.0.1:8000/v1/;
}
Voi cau hinh nay, ban co LLM endpoint production-ready, an toan va co kha nang phuc vu vai nghin request moi gio.
RAG Va Fine-Tuning Tren VPS Iceland
Tu host LLM mo ra kha nang xay dung RAG (Retrieval Augmented Generation) va fine-tuning ma cloud LLM khong cho phep. Mot RAG pipeline tieu chuan tren VPS AnubizHost gom:
Vector database: Qdrant hoac Weaviate self-hosted. Luu embedding cua tai lieu noi bo. Truy van vector la O(log n) voi HNSW index, du nhanh cho hang trieu document.
docker run -d --name qdrant -p 6333:6333 \
-v ~/qdrant_storage:/qdrant/storage \
qdrant/qdrant
Embedding model: BAAI/bge-large-en hoac sentence-transformers/all-mpnet-base-v2 chay tren CPU hoac GPU phu. Generate embedding cho tat ca document khi index.
LLM generation: vLLM serve Llama 3 8B Instruct. Nhan context tu top-k vector search va sinh tra loi.
Fine-tuning LoRA: Voi GPU 24GB RAM, co the LoRA fine-tune Llama 3 8B tren dataset rieng cua ban bang Unsloth hoac axolotl. Day la dieu khong the lam voi OpenAI API.
pip install unsloth
# Fine-tune Llama 3 8B trong vai gio:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3-8b-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
)
# Train tren custom dataset
Toan bo workflow chay tren VPS Iceland AnubizHost, khong du lieu nao roi khoi jurisdiction. Day la cap do bao mat va kiem soat ma cac LLM SaaS khong the cung cap.
Related Services
Why Anubiz Host
Ready to get started?
Skip the research. Tell us what you need, and we'll scope it, implement it, and hand it back — fully documented and production-ready.