Async messages
The async messages API decouples request submission from response consumption. Instead of holding an HTTP connection open until the model finishes, you start a job, get back a message_id, and consume the output whenever you are ready.
Use async when:
- Generation is long and you don’t want to hold a connection open.
- You need to fan out multiple requests and collect results later.
- You want to let a background worker process the job while the caller does other work.
Async chat uses an in-process stream broker — no separate worker needed. Use memory for single-instance deployments or redis for multi-instance.
Lifecycle
Start a job
Response:
The request body is identical to POST /v1/messages — all fields (tools, tool_context, mcp_servers, sampling params) work the same way. The same per-tool dependency rules apply here: install the specific extra you need, or use private-gpt[tools] or private-gpt[core].
Stream the output
Connect an SSE client to receive events as the model generates them. Events follow the same format as synchronous streaming:
The connection stays open until the job completes, fails, or is cancelled. You can connect and disconnect at any time — the stream replays from the last position on reconnect.
Check status
Poll the status endpoint to inspect the job without consuming the stream:
Response:
Status values:
Cancel a job
Cancel a job while it is pending or processing:
Clean up
Delete the job and free associated resources once you have consumed the result:
Stream broker
Async chat streams are handled in-process — no separate worker required. The broker is configured under stream.broker in settings.yaml:
For Redis, configure the connection:

