For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Contact usJoin the Discord
ManualAPI GuideAPI Reference
  • Getting started
    • Introduction
    • Quickstart
    • How it works
  • Installation Options
    • Package Install
    • Docker
    • Development
  • Configuration
    • CLI
    • Settings & Profiles
    • Model Configuration
  • Inference Providers
    • Overview
    • Ollama
    • LM Studio
    • LlamaCPP Server
    • vLLM
  • Integrations
    • Overview
    • Claude Code
    • Claude Desktop
    • Claude for Microsoft 365
    • OpenCode
  • Built-in Tools
    • Web Tools
    • Database Tools
  • Storage Providers
    • Vector Store
    • Object Storage
  • User Interface
    • Workbench
  • Observability
    • Observability
  • Reference
    • Troubleshooting
LogoLogo
Contact usJoin the Discord
On this page
  • Common local setups
  • Feature matrix
  • Impact of a missing tokenizer endpoint
  • Structured output
  • Example models
  • Embedding auto-discovery
Inference Providers

Overview

Was this page helpful?
Previous

Ollama

Next
Built with

PrivateGPT connects to any OpenAI-compatible LLM server via OPENAI_API_BASE. If your server responds to GET /v1/models and POST /v1/chat/completions, it works — whether that is a local binary, a cloud endpoint, or a self-hosted service.

$OPENAI_API_BASE=https://your-openai-compatible-server/v1 private-gpt serve

The server handles model inference; PrivateGPT handles the API, retrieval, document processing, and orchestration on top.


Common local setups

The guides below cover popular self-hosted options. These are examples — not an exhaustive list.

Ollama

Easiest local setup. One command to pull and run any model.

LM Studio

GUI-based desktop app. Great for exploring and switching models without a terminal.

LlamaCPP Server

Lightweight binary, full tokenizer support. Best for CPU inference and GGUF models.

vLLM

Highest throughput. Structured output support. Best for production and multi-user deployments.


Feature matrix

CapabilityOllamaLM StudioLlamaCPP ServervLLM
Model discovery (/v1/models)✅✅✅✅
Tokenizer endpoint (/tokenize)❌✅✅✅
Embeddings endpoint✅✅✅✅
Tool / function calling✅ †✅ †✅ †✅ †
Structured output (JSON schema)❌❌❌✅
Streaming✅✅✅✅
Vision / image input✅ †✅ †✅ †✅ †
Audio input⚠️ Limited❌❌❌

† Model-dependent — the server supports the protocol, but the loaded model must also support the capability.

Impact of a missing tokenizer endpoint

When the server does not expose /tokenize (Ollama), PrivateGPT falls back to a character-based estimate (4 chars = 1 token) for counting tokens. This can cause:

  • Inaccurate context-window management on very long inputs
  • Potential context overflow for models with smaller windows (e.g. 4k, 8k)

Mitigation: Set context_window explicitly in a detailed model profile to a conservative value. This tells PrivateGPT exactly how many tokens it can safely use.

Structured output

Only vLLM exposes the structured output (JSON schema enforcement) endpoint used by PrivateGPT for reliable tool calls and schema-constrained responses. With other providers, PrivateGPT falls back to prompt-based JSON extraction, which is less reliable for complex schemas.


Example models

The provider pages use the following models as examples. Any OpenAI-compatible model works.

RoleModelSizeNotes
LLMqwen3.5:35b (Ollama) / unsloth/Qwen3.5-35B-A3B-GGUF (GGUF) / Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 (vLLM)~24 GB (Ollama) / ~18 GB (Q4 GGUF)Mixture-of-experts; strong reasoning and tool use
Embeddingsmxbai-embed-large (Ollama) / mixedbread-ai/mxbai-embed-large-v1~670 MB1024-dim, strong multilingual retrieval

Embedding auto-discovery

Embedding models are auto-discovered from /v1/models when embedding.auto_discover_models is enabled, which is the default behavior. You only need to define embedding models explicitly in a detailed model profile if you want to override discovery or your provider does not expose them as expected.

Example manual embedding model config in settings-model.yaml:

1embedding:
2 default_model: mxbai-embed-large
3
4models:
5 - name: mxbai-embed-large
6 type: embedding
7 mode: openai
8 context_window: 512