Overview | PrivateGPT

PrivateGPT connects to any OpenAI-compatible LLM server via OPENAI_API_BASE. If your server responds to GET /v1/models and POST /v1/chat/completions, it works — whether that is a local binary, a cloud endpoint, or a self-hosted service.

$ OPENAI_API_BASE=https://your-openai-compatible-server/v1 private-gpt serve

The server handles model inference; PrivateGPT handles the API, retrieval, document processing, and orchestration on top.

Common local setups

The guides below cover popular self-hosted options. These are examples — not an exhaustive list.

Ollama

Easiest local setup. One command to pull and run any model.

LM Studio

GUI-based desktop app. Great for exploring and switching models without a terminal.

LlamaCPP Server

Lightweight binary, full tokenizer support. Best for CPU inference and GGUF models.

vLLM

Highest throughput. Structured output support. Best for production and multi-user deployments.

Cloud gateways

DaoXE

Multi-model gateway. Access GPT, Claude, Grok, GLM families through one OpenAI-compatible endpoint.

Feature matrix

Capability	Ollama	LM Studio	LlamaCPP Server	vLLM
Model discovery (`/v1/models`)	✅	✅	✅	✅
Tokenizer endpoint (`/tokenize`)	❌	✅	✅	✅
Embeddings endpoint	✅	✅	✅	✅
Tool / function calling	✅ †	✅ †	✅ †	✅ †
Structured output (JSON schema)	❌	❌	❌	✅
Streaming	✅	✅	✅	✅
Vision / image input	✅ †	✅ †	✅ †	✅ †
Audio input	⚙️ Limited	❌	❌	❌

† Model-dependent — the server supports the protocol, but the loaded model must also support the capability.

Impact of a missing tokenizer endpoint

When the server does not expose /tokenize (Ollama), PrivateGPT falls back to a character-based estimate (4 chars = 1 token) for counting tokens. This can cause:

Inaccurate context-window management on very long inputs
Potential context overflow for models with smaller windows (e.g. 4k, 8k)

Mitigation: Set context_window explicitly in a detailed model profile to a conservative value. This tells PrivateGPT exactly how many tokens it can safely use.

Structured output

Only vLLM exposes the structured output (JSON schema enforcement) endpoint used by PrivateGPT for reliable tool calls and schema-constrained responses. With other providers, PrivateGPT falls back to prompt-based JSON extraction, which is less reliable for complex schemas.

Example models

The provider pages use the following models as examples. Any OpenAI-compatible model works.

Role	Model	Size	Notes
LLM	`qwen3.5:35b` (Ollama) / `unsloth/Qwen3.5-35B-A3B-GGUF` (GGUF) / `Qwen/Qwen3.5-35B-A3B-GPTQ-Int4` (vLLM)	~24 GB (Ollama) / ~18 GB (Q4 GGUF)	Mixture-of-experts; strong reasoning and tool use
Embeddings	`mxbai-embed-large` (Ollama) / `mixedbread-ai/mxbai-embed-large-v1`	~670 MB	1024-dim, strong multilingual retrieval

Embedding auto-discovery

Embedding models are auto-discovered from /v1/models when embedding.auto_discover_models is enabled, which is the default behavior. You only need to define embedding models explicitly in a detailed model profile if you want to override discovery or your provider does not expose them as expected.

Example manual embedding model config in settings-model.yaml:

1 embedding:
2   default_model: mxbai-embed-large
3 
4 models:
5   - name: mxbai-embed-large
6     type: embedding
7     mode: openai
8     context_window: 512