vLLM | PrivateGPT

vLLM is a production-grade inference engine optimised for high throughput and low latency. It is the only supported provider that exposes a structured output (JSON schema) endpoint, making it the best choice for production deployments and applications requiring reliable schema-constrained responses.

vLLM requires an NVIDIA GPU with CUDA support. It is not designed for CPU-only inference.

Capabilities with PrivateGPT

Capability	Status
Model discovery (`/v1/models`)	✅
Tokenizer endpoint (`/tokenize`)	✅
Embeddings	✅
Tool / function calling	✅ model-dependent
Structured output (JSON schema)	✅
Streaming	✅
Vision / image input	✅ model-dependent
Audio input	❌

Setup

Prerequisites

NVIDIA GPU with CUDA 11.8+ (CUDA 12.x recommended)
Docker with NVIDIA Container Toolkit

Verify your setup:

$ nvidia-smi
$ docker run --gpus all nvidia/cuda:12.0-base nvidia-smi

Start vLLM

Docker

pip

$ # Example LLM — GPTQ Int4 quantization (~18 GB)
$ docker run --gpus all \
>   -p 8000:8000 \
>   --ipc=host \
>   vllm/vllm-openai:latest \
>   --model Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 \
>   --max-model-len 32768

For an embeddings model, start a second vLLM instance on a different port:

$ docker run --gpus all \
>   -p 8001:8000 \
>   --ipc=host \
>   vllm/vllm-openai:latest \
>   --model mixedbread-ai/mxbai-embed-large-v1 \
>   --task embed

The LLM API is available at http://localhost:8000/v1. If you start the second instance, the embeddings API is available at http://localhost:8001/v1.

Run PrivateGPT

Package install

Docker

uv (local)

$ OPENAI_API_BASE=http://localhost:8000/v1 \
>   OPENAI_EMBEDDING_API_BASE=http://localhost:8001/v1 \
>   private-gpt serve

Advanced profile example

1 # settings-model.yaml
2 llm:
3   default_model: Qwen3.5-35B-A3B-GPTQ-Int4
4 
5 embedding:
6   default_model: mxbai-embed-large-v1
7 
8 models:
9   - name: Qwen3.5-35B-A3B-GPTQ-Int4
10     type: llm
11     mode: openai
12     context_window: 32768
13     tokenizer: Qwen/Qwen3.5-35B-A3B
14     support_tools: true
15     support_reasoning: true
16     support_image: 0
17     sampling_params:
18       temperature: 0.6
19       top_p: 0.95
20       top_k: 20
21       min_p: 0.0
22 
23   - name: mxbai-embed-large-v1
24     type: embedding
25     mode: openai
26     context_window: 512

If your embeddings model runs on a separate vLLM instance (port 8001):

$ OPENAI_API_BASE=http://localhost:8000/v1 \
>   OPENAI_EMBEDDING_API_BASE=http://localhost:8001/v1 \
>   PGPT_PROFILES=model \
>   uv run python -m private_gpt

Structured output

vLLM supports the OpenAI response_format parameter for JSON schema enforcement. When PrivateGPT detects this capability, it uses schema-constrained generation for tool calls and structured responses — significantly more reliable than prompt-based approaches.

No extra configuration is needed; PrivateGPT detects structured output support automatically on startup.