Ollama lets you run large language models locally with a single command. It handles model downloading, GPU offloading, and serving an OpenAI-compatible API on port 11434.
Ollama does not expose a tokenizer endpoint (/tokenize). PrivateGPT falls back to a character-based estimate (4 chars = 1 token) for token counting. This can cause context-window overflow on long inputs.
Recommendation: Set context_window explicitly in a detailed model profile to match your model’s actual limit. By default, Ollama assumes the following context window based on the VRAM:
< 24 GiB VRAM: 4k context 24-48 GiB VRAM: 32k context
= 48 GiB VRAM: 256k context
Download and install from ollama.ai for your platform (macOS, Linux, Windows).
Or on macOS:
Because Ollama lacks the tokenizer endpoint, it’s especially useful to set context_window explicitly:
Generate this automatically with:
Then edit context_window and other values as needed and run:
Connection refused inside Docker
Use host.docker.internal instead of localhost:
On Linux with Docker, use --network host instead:
Model not found
Verify the model is available: