Overview

PrivateGPT connects to any OpenAI-compatible LLM server via OPENAI_API_BASE. If your server responds to GET /v1/models and POST /v1/chat/completions, it works — whether that is a local binary, a cloud endpoint, or a self-hosted service.

$OPENAI_API_BASE=https://your-openai-compatible-server/v1 private-gpt serve

The server handles model inference; PrivateGPT handles the API, retrieval, document processing, and orchestration on top.


Common local setups

The guides below cover popular self-hosted options. These are examples — not an exhaustive list.


Feature matrix

CapabilityOllamaLM StudioLlamaCPP ServervLLM
Model discovery (/v1/models)
Tokenizer endpoint (/tokenize)
Embeddings endpoint
Tool / function calling✅ †✅ †✅ †✅ †
Structured output (JSON schema)
Streaming
Vision / image input✅ †✅ †✅ †✅ †
Audio input⚙️ Limited

† Model-dependent — the server supports the protocol, but the loaded model must also support the capability.

Impact of a missing tokenizer endpoint

When the server does not expose /tokenize (Ollama), PrivateGPT falls back to a character-based estimate (4 chars = 1 token) for counting tokens. This can cause:

  • Inaccurate context-window management on very long inputs
  • Potential context overflow for models with smaller windows (e.g. 4k, 8k)

Mitigation: Set context_window explicitly in a detailed model profile to a conservative value. This tells PrivateGPT exactly how many tokens it can safely use.

Structured output

Only vLLM exposes the structured output (JSON schema enforcement) endpoint used by PrivateGPT for reliable tool calls and schema-constrained responses. With other providers, PrivateGPT falls back to prompt-based JSON extraction, which is less reliable for complex schemas.


Example models

The provider pages use the following models as examples. Any OpenAI-compatible model works.

RoleModelSizeNotes
LLMqwen3.5:35b (Ollama) / unsloth/Qwen3.5-35B-A3B-GGUF (GGUF) / Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 (vLLM)~24 GB (Ollama) / ~18 GB (Q4 GGUF)Mixture-of-experts; strong reasoning and tool use
Embeddingsmxbai-embed-large (Ollama) / mixedbread-ai/mxbai-embed-large-v1~670 MB1024-dim, strong multilingual retrieval

Embedding auto-discovery

Embedding models are auto-discovered from /v1/models when embedding.auto_discover_models is enabled, which is the default behavior. You only need to define embedding models explicitly in a detailed model profile if you want to override discovery or your provider does not expose them as expected.

Example manual embedding model config in settings-model.yaml:

1embedding:
2 default_model: mxbai-embed-large
3
4models:
5 - name: mxbai-embed-large
6 type: embedding
7 mode: openai
8 context_window: 512