For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Contact usJoin the Discord
ManualAPI GuideAPI Reference
  • Getting started
    • Introduction
    • Quickstart
    • How it works
  • Installation Options
    • Package Install
    • Docker
    • Development
  • Configuration
    • CLI
    • Settings & Profiles
    • Model Configuration
  • Inference Providers
    • Overview
    • Ollama
    • LM Studio
    • LlamaCPP Server
    • vLLM
  • Integrations
    • Overview
    • Claude Code
    • Claude Desktop
    • Claude for Microsoft 365
    • OpenCode
  • Built-in Tools
    • Web Tools
    • Database Tools
  • Storage Providers
    • Vector Store
    • Object Storage
  • User Interface
    • Workbench
  • Observability
    • Observability
  • Reference
    • Troubleshooting
LogoLogo
Contact usJoin the Discord
On this page
  • Limitations with PrivateGPT
  • Setup
  • Advanced profile example
  • Troubleshooting
Inference Providers

Ollama

Was this page helpful?
Previous

LM Studio

Next
Built with

Ollama lets you run large language models locally with a single command. It handles model downloading, GPU offloading, and serving an OpenAI-compatible API on port 11434.

Limitations with PrivateGPT

Ollama does not expose a tokenizer endpoint (/tokenize). PrivateGPT falls back to a character-based estimate (4 chars = 1 token) for token counting. This can cause context-window overflow on long inputs.

Recommendation: Set context_window explicitly in a detailed model profile to match your model’s actual limit. By default, Ollama assumes the following context window based on the VRAM:

< 24 GiB VRAM: 4k context 24-48 GiB VRAM: 32k context

= 48 GiB VRAM: 256k context

CapabilityStatus
Model discovery (/v1/models)✅
Tokenizer endpoint (/tokenize)❌
Embeddings✅
Tool / function calling✅ model-dependent
Structured output❌
Streaming✅
Vision / image input✅ model-dependent

Setup

1

Install Ollama

Download and install from ollama.ai for your platform (macOS, Linux, Windows).

Or on macOS:

$brew install ollama
2

Pull a model

$# Example LLM — Qwen3.5 35B (~24 GB)
$ollama pull qwen3.5:35b
$
$# Example embeddings model (~670 MB)
$ollama pull mxbai-embed-large

Any model from the Ollama library works. For smaller hardware, try qwen3.5:7b.

3

Start the Ollama server

$ollama serve

The API is now available at http://localhost:11434/v1.

On macOS, the Ollama desktop app starts the server automatically when open. You don’t need to run ollama serve manually.

4

Run PrivateGPT

Package install
Docker
uv (local)
$OPENAI_API_BASE=http://localhost:11434/v1 private-gpt serve

Advanced profile example

Because Ollama lacks the tokenizer endpoint, it’s especially useful to set context_window explicitly:

1# settings-model.yaml
2llm:
3 default_model: qwen3.5:35b
4
5embedding:
6 default_model: mxbai-embed-large
7
8models:
9 - name: qwen3.5:35b
10 type: llm
11 mode: openai
12 context_window: 32768 # Set explicitly — Ollama can't report this
13 support_tools: true
14 support_reasoning: true
15 sampling_params:
16 temperature: 0.6
17 top_p: 0.95
18 top_k: 20
19 min_p: 0.0
20
21 - name: mxbai-embed-large
22 type: embedding
23 mode: openai
24 context_window: 512

Generate this automatically with:

$OPENAI_API_BASE=http://localhost:11434/v1 \
> uv run python scripts/auto_discover_models.py --out settings-model.yaml

Then edit context_window and other values as needed and run:

$OPENAI_API_BASE=http://localhost:11434/v1 \
> PGPT_PROFILES=model \
> uv run python -m private_gpt

Troubleshooting

Connection refused inside Docker

Use host.docker.internal instead of localhost:

$-e OPENAI_API_BASE=http://host.docker.internal:11434/v1

On Linux with Docker, use --network host instead:

$docker run --network host -e OPENAI_API_BASE=http://localhost:11434/v1 ...

Model not found

Verify the model is available:

$ollama list