For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Contact usJoin the Discord
ManualAPI GuideAPI Reference
  • Getting started
    • Introduction
    • Quickstart
    • How it works
  • Installation Options
    • Package Install
    • Docker
    • Development
  • Configuration
    • CLI
    • Settings & Profiles
    • Model Configuration
  • Inference Providers
    • Overview
    • Ollama
    • LM Studio
    • LlamaCPP Server
    • vLLM
  • Integrations
    • Overview
    • Claude Code
    • Claude Desktop
    • Claude for Microsoft 365
    • OpenCode
  • Built-in Tools
    • Web Tools
    • Database Tools
  • Storage Providers
    • Vector Store
    • Object Storage
  • User Interface
    • Workbench
  • Observability
    • Observability
  • Reference
    • Troubleshooting
LogoLogo
Contact usJoin the Discord
On this page
  • Capabilities with PrivateGPT
  • Setup
  • Advanced profile example
  • GPU acceleration
Inference Providers

LlamaCPP Server

Was this page helpful?
Previous

vLLM

Next
Built with

llama.cpp is a high-performance inference engine for GGUF models. Its built-in HTTP server (llama-server) exposes an OpenAI-compatible API with full tokenizer support, making it the most capable local option for PrivateGPT.

Capabilities with PrivateGPT

CapabilityStatus
Model discovery (/v1/models)✅
Tokenizer endpoint (/tokenize)✅
Embeddings✅
Tool / function calling✅ model-dependent
Structured output❌
Streaming✅
Vision / image input✅ model-dependent

Setup

1

Download llama-server

Download a pre-built binary from the llama.cpp releases page. Choose the variant matching your hardware:

VariantUse when
llama-server (CPU)No GPU, or testing
llama-server-cudaNVIDIA GPU (CUDA)
llama-server-metalmacOS with Apple Silicon
llama-server-vulkanAMD / other Vulkan-capable GPU

Or build from source:

$git clone https://github.com/ggerganov/llama.cpp
$cd llama.cpp
$cmake -B build && cmake --build build --config Release -j
2

Download a GGUF model

Download a GGUF model file from Hugging Face. Example:

$# Using huggingface-cli (pip install huggingface_hub)
$# Example LLM (~18 GB, Q4 quantization)
$huggingface-cli download \
> unsloth/Qwen3.5-35B-A3B-GGUF \
> Qwen3.5-35B-A3B-Q4_K_M.gguf \
> --local-dir ./models
$
$# Example embeddings model
$huggingface-cli download \
> ChristianAzinn/mxbai-embed-large-v1-gguf \
> mxbai-embed-large-v1-f16.gguf \
> --local-dir ./models
3

Start llama-server

Run the LLM server on port 8000:

$llama-server \
> --model ./models/Qwen3.5-35B-A3B-Q4_K_M.gguf \
> --port 8000 \
> --ctx-size 32768

If you want a dedicated embeddings model, start a second server on port 8001:

$llama-server \
> --model ./models/mxbai-embed-large-v1-f16.gguf \
> --port 8001 \
> --embeddings
FlagDescription
--modelPath to your GGUF model file
--portHTTP port (default: 8080; use 8000/8001 to avoid conflict with PrivateGPT)
--ctx-sizeMaximum context window in tokens
--embeddingsEnable the embeddings endpoint for an embedding model
--n-gpu-layers NOffload N layers to GPU (omit for CPU-only)

The LLM API is available at http://localhost:8000/v1. If you start the second instance, the embeddings API is available at http://localhost:8001/v1.

4

Run PrivateGPT

Package install
Docker
uv (local)
$OPENAI_API_BASE=http://localhost:8000/v1 \
> OPENAI_EMBEDDING_API_BASE=http://localhost:8001/v1 \
> private-gpt serve

Advanced profile example

1# settings-model.yaml
2llm:
3 default_model: Qwen3.5-35B-A3B-Q4_K_M
4
5embedding:
6 default_model: mxbai-embed-large-v1-f16
7
8models:
9 - name: Qwen3.5-35B-A3B-Q4_K_M
10 type: llm
11 mode: openai
12 context_window: 32768
13 tokenizer: Qwen/Qwen3.5-35B-A3B # Exact token counting via HuggingFace tokenizer
14 support_tools: true
15 support_reasoning: true
16 sampling_params:
17 temperature: 0.6
18 top_p: 0.95
19 top_k: 20
20 min_p: 0.0
21
22 - name: mxbai-embed-large-v1-f16
23 type: embedding
24 mode: openai
25 context_window: 512

GPU acceleration

NVIDIA (CUDA)
Apple Silicon (Metal)

Offload layers to GPU:

$llama-server \
> --model ./models/Qwen3.5-35B-A3B-Q4_K_M.gguf \
> --port 8000 \
> --ctx-size 32768 \
> --n-gpu-layers 99 # Offload all layers; reduce if you run out of VRAM

If you also run the embedding model as a second llama-server instance, apply the same GPU flags to that server separately.