LlamaCPP Server | PrivateGPT

llama.cpp is a high-performance inference engine for GGUF models. Its built-in HTTP server (llama-server) exposes an OpenAI-compatible API with full tokenizer support, making it the most capable local option for PrivateGPT.

Capabilities with PrivateGPT

Capability	Status
Model discovery (`/v1/models`)	✅
Tokenizer endpoint (`/tokenize`)	✅
Embeddings	✅
Tool / function calling	✅ model-dependent
Structured output	❌
Streaming	✅
Vision / image input	✅ model-dependent

Setup

Download llama-server

Download a pre-built binary from the llama.cpp releases page. Choose the variant matching your hardware:

Variant	Use when
`llama-server` (CPU)	No GPU, or testing
`llama-server-cuda`	NVIDIA GPU (CUDA)
`llama-server-metal`	macOS with Apple Silicon
`llama-server-vulkan`	AMD / other Vulkan-capable GPU

Or build from source:

$ git clone https://github.com/ggerganov/llama.cpp
$ cd llama.cpp
$ cmake -B build && cmake --build build --config Release -j

Download a GGUF model

Download a GGUF model file from Hugging Face. Example:

$ # Using huggingface-cli (pip install huggingface_hub)
$ # Example LLM (~18 GB, Q4 quantization)
$ huggingface-cli download \
>   unsloth/Qwen3.5-35B-A3B-GGUF \
>   Qwen3.5-35B-A3B-Q4_K_M.gguf \
>   --local-dir ./models
$ 
$ # Example embeddings model
$ huggingface-cli download \
>   ChristianAzinn/mxbai-embed-large-v1-gguf \
>   mxbai-embed-large-v1-f16.gguf \
>   --local-dir ./models

Start llama-server

Run the LLM server on port 8000:

$ llama-server \
>   --model ./models/Qwen3.5-35B-A3B-Q4_K_M.gguf \
>   --port 8000 \
>   --ctx-size 32768

If you want a dedicated embeddings model, start a second server on port 8001:

$ llama-server \
>   --model ./models/mxbai-embed-large-v1-f16.gguf \
>   --port 8001 \
>   --embeddings

Flag	Description
`--model`	Path to your GGUF model file
`--port`	HTTP port (default: 8080; use 8000/8001 to avoid conflict with PrivateGPT)
`--ctx-size`	Maximum context window in tokens
`--embeddings`	Enable the embeddings endpoint for an embedding model
`--n-gpu-layers N`	Offload N layers to GPU (omit for CPU-only)

The LLM API is available at http://localhost:8000/v1. If you start the second instance, the embeddings API is available at http://localhost:8001/v1.

Run PrivateGPT

Package install

Docker

uv (local)

$ OPENAI_API_BASE=http://localhost:8000/v1 \
>   OPENAI_EMBEDDING_API_BASE=http://localhost:8001/v1 \
>   private-gpt serve

Advanced profile example

1 # settings-model.yaml
2 llm:
3   default_model: Qwen3.5-35B-A3B-Q4_K_M
4 
5 embedding:
6   default_model: mxbai-embed-large-v1-f16
7 
8 models:
9   - name: Qwen3.5-35B-A3B-Q4_K_M
10     type: llm
11     mode: openai
12     context_window: 32768
13     tokenizer: Qwen/Qwen3.5-35B-A3B   # Exact token counting via HuggingFace tokenizer
14     support_tools: true
15     support_reasoning: true
16     sampling_params:
17       temperature: 0.6
18       top_p: 0.95
19       top_k: 20
20       min_p: 0.0
21 
22   - name: mxbai-embed-large-v1-f16
23     type: embedding
24     mode: openai
25     context_window: 512

GPU acceleration

NVIDIA (CUDA)

Apple Silicon (Metal)

Offload layers to GPU:

$ llama-server \
>   --model ./models/Qwen3.5-35B-A3B-Q4_K_M.gguf \
>   --port 8000 \
>   --ctx-size 32768 \
>   --n-gpu-layers 99   # Offload all layers; reduce if you run out of VRAM

If you also run the embedding model as a second llama-server instance, apply the same GPU flags to that server separately.