Initiate Async Chat Stream | PrivateGPT

Initiate an asynchronous chat completion stream.

This endpoint starts an asynchronous chat completion process that streams events. Unlike synchronous chat, this endpoint returns immediately with a message_id that can be used to observe the stream progress.

Key Features:

Asynchronous Processing: Non-blocking request handling with immediate response
Stream Observation: Use returned message_id to observe real-time events
Works exactly like synchronous chat, but in an async manner

Notes:

Optional message_id query parameter for custom stream identification
Stream events follow the same format as synchronous chat responses
Stream status can be monitored via status endpoint

Initiate an asynchronous chat completion stream. This endpoint starts an asynchronous chat completion process that streams events. Unlike synchronous chat, this endpoint returns immediately with a message_id that can be used to observe the stream progress. Key Features: - Asynchronous Processing: Non-blocking request handling with immediate response - Stream Observation: Use returned message_id to observe real-time events - Works exactly like synchronous chat, but in an async manner Notes: - Optional message_id query parameter for custom stream identification - Stream events follow the same format as synchronous chat responses - Stream status can be monitored via status endpoint

Query parameters

message_idstringOptional

Optional custom identifier for the stream. If not provided, a unique ID will be generated automatically.

Request

This endpoint expects an object.

modelstringRequiredDefaults to default

Model identifier or alias.

messageslist of objectsRequired

Conversation messages for the request.

max_tokensintegerRequired>=1

Maximum number of tokens to generate in the response.

systemlist of objectsOptional

System prompt input. Accepts str, list[str], System, list[System], or null. It is normalized internally to list[System].

toolslist of objectsOptional

Optional tool definitions.

thinkingobjectOptional

Thinking configuration.

tool_choiceobjectOptional

Tool selection policy.

output_configobjectOptional

Optional output configuration options.

cache_controlobjectOptional

Optional request-level cache control.

streambooleanOptional

Whether to stream the response back to the client.

tool_contextlist of objectsOptional

Context to provide to the tools, such as documents, databases connection strings, or data relevant to tool usage.

mcp_serverslist of objectsOptional

List of MCP servers to use for tool retrieval. Each server can have its own configuration.

containerstringOptional

Container identifier for reuse across requests.

response_formatobjectOptional

Deprecated response format. Use output_config.format instead.

priorityintegerOptional

Priority of the request, used for prioritizing responses.

seedintegerOptional

Random seed for reproducibility.

min_pdoubleOptional

Minimum probability threshold for token selection. Tokens with probability below this value are filtered out.

top_pdoubleOptional0-1

Nucleus sampling parameter. Only tokens with cumulative probability up to this value are considered.

temperaturedoubleOptional0-1

Controls randomness in generation. Higher values make output more random, lower values more deterministic.

top_kintegerOptional>=0

Limits token selection to the top K most likely tokens at each step.

repetition_penaltydoubleOptional

Penalty applied to tokens that have already appeared in the sequence to reduce repetition.

presence_penaltydoubleOptional

Penalty applied based on whether a token has appeared in the text, encouraging topic diversity.

frequency_penaltydoubleOptional

Penalty applied based on how frequently a token appears in the text, reducing repetitive content.

stop_sequenceslist of stringsOptional

Custom stop sequences that stop generation when matched.

metadataobjectOptional

Request metadata (for example, user_id).

service_tierenumOptional

Service tier preference (for example, “auto” or “standard_only”).

inference_geostringOptional

Geographic region hint for inference processing.

correlation_idstringOptional

Correlation ID for tracking the request across systems.

maximum_loaded_skillsintegerOptional>=1

Optional cap for concurrently loaded skills in a conversation. When exceeded, the oldest loaded skill is evicted.

context_managementanyOptional

Response

This endpoint returns an object.

message_idstring

Unique identifier for the initiated stream

statusenum

Initial status of the stream (typically ‘pending’)

messagestringDefaults to Request initiated successfully

Confirmation message for successful stream initiation

Errors

401

Unauthorized

422

Unprocessable Entity

$	curl -X POST https://host.com/v1/messages/async \
>	-H "Content-Type: application/json" \
>	-d '{
>	"model": "model",
>	"messages": [
>	{
>	"role": "system",
>	"content": "content"
>	}
>	],
>	"max_tokens": 1
>	}'