Ollama | PrivateGPT

Ollama lets you run large language models locally with a single command. It handles model downloading, GPU offloading, and serving an OpenAI-compatible API on port 11434.

Limitations with PrivateGPT

Ollama does not expose a tokenizer endpoint (/tokenize). PrivateGPT falls back to a character-based estimate (4 chars = 1 token) for token counting. This can cause context-window overflow on long inputs.

Recommendation: Set context_window explicitly in a detailed model profile to match your model’s actual limit. By default, Ollama assumes the following context window based on the VRAM:

< 24 GiB VRAM: 4k context 24-48 GiB VRAM: 32k context

= 48 GiB VRAM: 256k context

Capability	Status
Model discovery (`/v1/models`)	✅
Tokenizer endpoint (`/tokenize`)	❌
Embeddings	✅
Tool / function calling	✅ model-dependent
Structured output	❌
Streaming	✅
Vision / image input	✅ model-dependent

Setup

Install Ollama

Download and install from ollama.ai for your platform (macOS, Linux, Windows).

Or on macOS:

$ brew install ollama

Pull a model

$ # Example LLM — Qwen3.5 35B (~24 GB)
$ ollama pull qwen3.5:35b
$ 
$ # Example embeddings model (~670 MB)
$ ollama pull mxbai-embed-large

Any model from the Ollama library works. For smaller hardware, try qwen3.5:7b.

Start the Ollama server

$ ollama serve

The API is now available at http://localhost:11434/v1.

On macOS, the Ollama desktop app starts the server automatically when open. You don’t need to run ollama serve manually.

Run PrivateGPT

Package install

Docker

uv (local)

$ OPENAI_API_BASE=http://localhost:11434/v1 private-gpt serve

Advanced profile example

Because Ollama lacks the tokenizer endpoint, it’s especially useful to set context_window explicitly:

1 # settings-model.yaml
2 llm:
3   default_model: qwen3.5:35b
4 
5 embedding:
6   default_model: mxbai-embed-large
7 
8 models:
9   - name: qwen3.5:35b
10     type: llm
11     mode: openai
12     context_window: 32768        # Set explicitly — Ollama can't report this
13     support_tools: true
14     support_reasoning: true
15     sampling_params:
16       temperature: 0.6
17       top_p: 0.95
18       top_k: 20
19       min_p: 0.0
20 
21   - name: mxbai-embed-large
22     type: embedding
23     mode: openai
24     context_window: 512

Generate this automatically with:

$ OPENAI_API_BASE=http://localhost:11434/v1 \
>   uv run python scripts/auto_discover_models.py --out settings-model.yaml

Then edit context_window and other values as needed and run:

$ OPENAI_API_BASE=http://localhost:11434/v1 \
>   PGPT_PROFILES=model \
>   uv run python -m private_gpt

Troubleshooting

Connection refused inside Docker

Use host.docker.internal instead of localhost:

$ -e OPENAI_API_BASE=http://host.docker.internal:11434/v1

On Linux with Docker, use --network host instead:

$ docker run --network host -e OPENAI_API_BASE=http://localhost:11434/v1 ...

Model not found

Verify the model is available:

$ ollama list