For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Contact usJoin the Discord
ManualAPI GuideAPI Reference
  • Getting started
    • Introduction
    • Quickstart
    • How it works
  • Installation Options
    • Package Install
    • Docker
    • Development
  • Configuration
    • CLI
    • Settings & Profiles
    • Model Configuration
  • Inference Providers
    • Overview
    • Ollama
    • LM Studio
    • LlamaCPP Server
    • vLLM
  • Integrations
    • Overview
    • Claude Code
    • Claude Desktop
    • Claude for Microsoft 365
    • OpenCode
  • Built-in Tools
    • Web Tools
    • Database Tools
  • Storage Providers
    • Vector Store
    • Object Storage
  • User Interface
    • Workbench
  • Observability
    • Observability
  • Reference
    • Troubleshooting
LogoLogo
Contact usJoin the Discord
On this page
  • Capabilities with PrivateGPT
  • Setup
  • Advanced profile example
  • Structured output
Inference Providers

vLLM

Was this page helpful?
Previous

Overview

Next
Built with

vLLM is a production-grade inference engine optimised for high throughput and low latency. It is the only supported provider that exposes a structured output (JSON schema) endpoint, making it the best choice for production deployments and applications requiring reliable schema-constrained responses.

vLLM requires an NVIDIA GPU with CUDA support. It is not designed for CPU-only inference.

Capabilities with PrivateGPT

CapabilityStatus
Model discovery (/v1/models)✅
Tokenizer endpoint (/tokenize)✅
Embeddings✅
Tool / function calling✅ model-dependent
Structured output (JSON schema)✅
Streaming✅
Vision / image input✅ model-dependent
Audio input❌

Setup

1

Prerequisites

  • NVIDIA GPU with CUDA 11.8+ (CUDA 12.x recommended)
  • Docker with NVIDIA Container Toolkit

Verify your setup:

$nvidia-smi
$docker run --gpus all nvidia/cuda:12.0-base nvidia-smi
2

Start vLLM

Docker
pip
$# Example LLM — GPTQ Int4 quantization (~18 GB)
$docker run --gpus all \
> -p 8000:8000 \
> --ipc=host \
> vllm/vllm-openai:latest \
> --model Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 \
> --max-model-len 32768

For an embeddings model, start a second vLLM instance on a different port:

$docker run --gpus all \
> -p 8001:8000 \
> --ipc=host \
> vllm/vllm-openai:latest \
> --model mixedbread-ai/mxbai-embed-large-v1 \
> --task embed

The LLM API is available at http://localhost:8000/v1. If you start the second instance, the embeddings API is available at http://localhost:8001/v1.

3

Run PrivateGPT

Package install
Docker
uv (local)
$OPENAI_API_BASE=http://localhost:8000/v1 \
> OPENAI_EMBEDDING_API_BASE=http://localhost:8001/v1 \
> private-gpt serve

Advanced profile example

1# settings-model.yaml
2llm:
3 default_model: Qwen3.5-35B-A3B-GPTQ-Int4
4
5embedding:
6 default_model: mxbai-embed-large-v1
7
8models:
9 - name: Qwen3.5-35B-A3B-GPTQ-Int4
10 type: llm
11 mode: openai
12 context_window: 32768
13 tokenizer: Qwen/Qwen3.5-35B-A3B
14 support_tools: true
15 support_reasoning: true
16 support_image: 0
17 sampling_params:
18 temperature: 0.6
19 top_p: 0.95
20 top_k: 20
21 min_p: 0.0
22
23 - name: mxbai-embed-large-v1
24 type: embedding
25 mode: openai
26 context_window: 512

If your embeddings model runs on a separate vLLM instance (port 8001):

$OPENAI_API_BASE=http://localhost:8000/v1 \
> OPENAI_EMBEDDING_API_BASE=http://localhost:8001/v1 \
> PGPT_PROFILES=model \
> uv run python -m private_gpt

Structured output

vLLM supports the OpenAI response_format parameter for JSON schema enforcement. When PrivateGPT detects this capability, it uses schema-constrained generation for tool calls and structured responses — significantly more reliable than prompt-based approaches.

No extra configuration is needed; PrivateGPT detects structured output support automatically on startup.