vLLM
vLLM is a production-grade inference engine optimised for high throughput and low latency. It is the only supported provider that exposes a structured output (JSON schema) endpoint, making it the best choice for production deployments and applications requiring reliable schema-constrained responses.
vLLM requires an NVIDIA GPU with CUDA support. It is not designed for CPU-only inference.
Capabilities with PrivateGPT
Setup
Prerequisites
- NVIDIA GPU with CUDA 11.8+ (CUDA 12.x recommended)
- Docker with NVIDIA Container Toolkit
Verify your setup:
Advanced profile example
If your embeddings model runs on a separate vLLM instance (port 8001):
Structured output
vLLM supports the OpenAI response_format parameter for JSON schema enforcement. When PrivateGPT detects this capability, it uses schema-constrained generation for tool calls and structured responses — significantly more reliable than prompt-based approaches.
No extra configuration is needed; PrivateGPT detects structured output support automatically on startup.

