vLLM is a production-grade inference engine optimised for high throughput and low latency. It is the only supported provider that exposes a structured output (JSON schema) endpoint, making it the best choice for production deployments and applications requiring reliable schema-constrained responses.
vLLM requires an NVIDIA GPU with CUDA support. It is not designed for CPU-only inference.
Verify your setup:
If your embeddings model runs on a separate vLLM instance (port 8001):
vLLM supports the OpenAI response_format parameter for JSON schema enforcement. When PrivateGPT detects this capability, it uses schema-constrained generation for tool calls and structured responses — significantly more reliable than prompt-based approaches.
No extra configuration is needed; PrivateGPT detects structured output support automatically on startup.