llama.cpp is a high-performance inference engine for GGUF models. Its built-in HTTP server (llama-server) exposes an OpenAI-compatible API with full tokenizer support, making it the most capable local option for PrivateGPT.
Download a pre-built binary from the llama.cpp releases page. Choose the variant matching your hardware:
Or build from source:
Offload layers to GPU:
If you also run the embedding model as a second llama-server instance, apply the same GPU flags to that server separately.