PrivateGPT connects to any OpenAI-compatible LLM server via OPENAI_API_BASE. If your server responds to GET /v1/models and POST /v1/chat/completions, it works — whether that is a local binary, a cloud endpoint, or a self-hosted service.
The server handles model inference; PrivateGPT handles the API, retrieval, document processing, and orchestration on top.
The guides below cover popular self-hosted options. These are examples — not an exhaustive list.
Easiest local setup. One command to pull and run any model.
GUI-based desktop app. Great for exploring and switching models without a terminal.
Lightweight binary, full tokenizer support. Best for CPU inference and GGUF models.
Highest throughput. Structured output support. Best for production and multi-user deployments.
† Model-dependent — the server supports the protocol, but the loaded model must also support the capability.
When the server does not expose /tokenize (Ollama), PrivateGPT falls back to a character-based estimate (4 chars = 1 token) for counting tokens. This can cause:
Mitigation: Set context_window explicitly in a detailed model profile to a conservative value. This tells PrivateGPT exactly how many tokens it can safely use.
Only vLLM exposes the structured output (JSON schema enforcement) endpoint used by PrivateGPT for reliable tool calls and schema-constrained responses. With other providers, PrivateGPT falls back to prompt-based JSON extraction, which is less reliable for complex schemas.
The provider pages use the following models as examples. Any OpenAI-compatible model works.
Embedding models are auto-discovered from /v1/models when embedding.auto_discover_models is enabled, which is the default behavior. You only need to define embedding models explicitly in a detailed model profile if you want to override discovery or your provider does not expose them as expected.
Example manual embedding model config in settings-model.yaml: