LLM Backends

Running the Server

PrivateGPT supports running with different LLMs & setups.

Local models

Both the LLM and the Embeddings model will run locally.

Make sure you have followed the Local LLM requirements section before moving on.

This command will start PrivateGPT using the settings.yaml (default profile) together with the settings-local.yaml configuration files. By default, it will enable both the API and the Gradio UI. Run:

$PGPT_PROFILES=local make run

or

$PGPT_PROFILES=local poetry run python -m private_gpt

When the server is started it will print a log Application startup complete. Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API using Swagger UI.

Customizing low level parameters

Currently, not all the parameters of llama.cpp and llama-cpp-python are available at PrivateGPT’s settings.yaml file. In case you need to customize parameters such as the number of layers loaded into the GPU, you might change these at the llm_component.py file under the private_gpt/components/llm/llm_component.py.

Available LLM config options

The llm section of the settings allows for the following configurations:

  • mode: how to run your llm
  • max_new_tokens: this lets you configure the number of new tokens the LLM will generate and add to the context window (by default Llama.cpp uses 256)

Example:

1llm:
2 mode: local
3 max_new_tokens: 256

If you are getting an out of memory error, you might also try a smaller model or stick to the proposed recommended models, instead of custom tuning the parameters.

Using OpenAI

If you cannot run a local model (because you don’t have a GPU, for example) or for testing purposes, you may decide to run PrivateGPT using OpenAI as the LLM and Embeddings model.

In order to do so, create a profile settings-openai.yaml with the following contents:

1llm:
2 mode: openai
3
4openai:
5 api_base: <openai-api-base-url> # Defaults to https://api.openai.com/v1
6 api_key: <your_openai_api_key> # You could skip this configuration and use the OPENAI_API_KEY env var instead
7 model: <openai_model_to_use> # Optional model to use. Default is "gpt-3.5-turbo"
8 # Note: Open AI Models are listed here: https://platform.openai.com/docs/models

And run PrivateGPT loading that profile you just created:

PGPT_PROFILES=openai make run

or

PGPT_PROFILES=openai poetry run python -m private_gpt

When the server is started it will print a log Application startup complete. Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API. You’ll notice the speed and quality of response is higher, given you are using OpenAI’s servers for the heavy computations.

Using OpenAI compatible API

Many tools, including LocalAI and vLLM, support serving local models with an OpenAI compatible API. Even when overriding the api_base, using the openai mode doesn’t allow you to use custom models. Instead, you should use the openailike mode:

1llm:
2 mode: openailike

This mode uses the same settings as the openai mode.

As an example, you can follow the vLLM quickstart guide to run an OpenAI compatible server. Then, you can run PrivateGPT using the settings-vllm.yaml profile:

PGPT_PROFILES=vllm make run

Using Azure OpenAI

If you cannot run a local model (because you don’t have a GPU, for example) or for testing purposes, you may decide to run PrivateGPT using Azure OpenAI as the LLM and Embeddings model.

In order to do so, create a profile settings-azopenai.yaml with the following contents:

1llm:
2 mode: azopenai
3
4embedding:
5 mode: azopenai
6
7azopenai:
8 api_key: <your_azopenai_api_key> # You could skip this configuration and use the AZ_OPENAI_API_KEY env var instead
9 azure_endpoint: <your_azopenai_endpoint> # You could skip this configuration and use the AZ_OPENAI_ENDPOINT env var instead
10 api_version: <api_version> # The API version to use. Default is "2023_05_15"
11 embedding_deployment_name: <your_embedding_deployment_name> # You could skip this configuration and use the AZ_OPENAI_EMBEDDING_DEPLOYMENT_NAME env var instead
12 embedding_model: <openai_embeddings_to_use> # Optional model to use. Default is "text-embedding-ada-002"
13 llm_deployment_name: <your_model_deployment_name> # You could skip this configuration and use the AZ_OPENAI_LLM_DEPLOYMENT_NAME env var instead
14 llm_model: <openai_model_to_use> # Optional model to use. Default is "gpt-35-turbo"

And run PrivateGPT loading that profile you just created:

PGPT_PROFILES=azopenai make run

or

PGPT_PROFILES=azopenai poetry run python -m private_gpt

When the server is started it will print a log Application startup complete. Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API. You’ll notice the speed and quality of response is higher, given you are using Azure OpenAI’s servers for the heavy computations.

Using AWS Sagemaker

For a fully private & performant setup, you can choose to have both your LLM and Embeddings model deployed using Sagemaker.

Note: how to deploy models on Sagemaker is out of the scope of this documentation.

In order to do so, create a profile settings-sagemaker.yaml with the following contents (remember to update the values of the llm_endpoint_name and embedding_endpoint_name to yours):

1llm:
2 mode: sagemaker
3
4sagemaker:
5 llm_endpoint_name: huggingface-pytorch-tgi-inference-2023-09-25-19-53-32-140
6 embedding_endpoint_name: huggingface-pytorch-inference-2023-11-03-07-41-36-479

And run PrivateGPT loading that profile you just created:

PGPT_PROFILES=sagemaker make run

or

PGPT_PROFILES=sagemaker poetry run python -m private_gpt

When the server is started it will print a log Application startup complete. Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API.

Using Ollama

Another option for a fully private setup is using Ollama.

Note: how to deploy Ollama and pull models onto it is out of the scope of this documentation.

In order to do so, create a profile settings-ollama.yaml with the following contents:

1llm:
2 mode: ollama
3
4ollama:
5 model: <ollama_model_to_use> # Required Model to use.
6 # Note: Ollama Models are listed here: https://ollama.ai/library
7 # Be sure to pull the model to your Ollama server
8 api_base: <ollama-api-base-url> # Defaults to http://localhost:11434

And run PrivateGPT loading that profile you just created:

PGPT_PROFILES=ollama make run

or

PGPT_PROFILES=ollama poetry run python -m private_gpt

When the server is started it will print a log Application startup complete. Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API.

Using IPEX-LLM

For a fully private setup on Intel GPUs (such as a local PC with an iGPU, or discrete GPUs like Arc, Flex, and Max), you can use IPEX-LLM.

To deploy Ollama and pull models using IPEX-LLM, please refer to this guide. Then, follow the same steps outlined in the Using Ollama section to create a settings-ollama.yaml profile and run the private-GPT server.

Using Gemini

If you cannot run a local model (because you don’t have a GPU, for example) or for testing purposes, you may decide to run PrivateGPT using Gemini as the LLM and Embeddings model. In addition, you will benefit from multimodal inputs, such as text and images, in a very large contextual window.

In order to do so, create a profile settings-gemini.yaml with the following contents:

1llm:
2 mode: gemini
3
4embedding:
5 mode: gemini
6
7gemini:
8 api_key: <your_gemini_api_key> # You could skip this configuration and use the GEMINI_API_KEY env var instead
9 model: <gemini_model_to_use> # Optional model to use. Default is models/gemini-pro"
10 embedding_model: <gemini_embeddings_to_use> # Optional model to use. Default is "models/embedding-001"

And run PrivateGPT loading that profile you just created:

PGPT_PROFILES=gemini make run

or

PGPT_PROFILES=gemini poetry run python -m private_gpt

When the server is started it will print a log Application startup complete. Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API.