Messages
Generate a chat completion from a conversation history.
This endpoint enables multi-turn conversations with the AI model, with
optional tool support and comprehensive message validation.
Key Features:
- Multi-turn conversations: Support for system, user, and assistant
messages
- Tool Support: Full tool use/result validation with automatic or manual
selection
- Citations: Enable `system.citations.enabled` to include references in
responses
- Streaming: Enable `stream` for partial updates in real-time
- Default Prompts: Enable `system.use_default_prompt` for using Zylon
prompts
- Thinking: Enable `thinking.enabled` for step-by-step reasoning
capabilities
- Sampling Parameters: Control randomness with temperature, top_p,
top_k, etc.
Notes:
- Tool use/result blocks must be properly paired within assistant
messages
- Tool choice type must be 'auto', 'tool', or 'none'
- When tool_choice.type is 'tool', tool_choice.name must specify a
valid tool
- All message content is validated for completeness and proper structure
- Last message must be from user or assistant for proper conversation
flow
- MCP servers provide external tool capabilities via Model Context
Protocol
- Sampling parameters control response randomness and token selection
Request
This endpoint expects an object.
model
Model identifier or alias.
messages
Conversation messages for the request.
max_tokens
Maximum number of tokens to generate in the response.
system
System prompt input. Accepts str, list[str], System, list[System], or null. It is normalized internally to list[System].
tools
Optional tool definitions.
thinking
Thinking configuration.
tool_choice
Tool selection policy.
output_config
Optional output configuration options.
cache_control
Optional request-level cache control.
stream
Whether to stream the response back to the client.
tool_context
Context to provide to the tools, such as documents,
databases connection strings, or data relevant to tool usage.
mcp_servers
List of MCP servers to use for tool retrieval. Each server can have its own configuration.
container
Container identifier for reuse across requests.
response_format
Deprecated response format. Use output_config.format instead.
priority
Priority of the request, used for prioritizing responses.
seed
Random seed for reproducibility.
min_p
Minimum probability threshold for token selection. Tokens with probability below this value are filtered out.
top_p
Nucleus sampling parameter. Only tokens with cumulative probability up to this value are considered.
temperature
Controls randomness in generation. Higher values make output more random, lower values more deterministic.
top_k
Limits token selection to the top K most likely tokens at each step.
repetition_penalty
Penalty applied to tokens that have already appeared in the sequence to reduce repetition.
presence_penalty
Penalty applied based on whether a token has appeared in the text, encouraging topic diversity.
frequency_penalty
Penalty applied based on how frequently a token appears in the text, reducing repetitive content.
stop_sequences
Custom stop sequences that stop generation when matched.
metadata
Request metadata (for example, user_id).
service_tier
Service tier preference (for example, “auto” or “standard_only”).
inference_geo
Geographic region hint for inference processing.
correlation_id
Correlation ID for tracking the request across systems.
maximum_loaded_skills
Optional cap for concurrently loaded skills in a conversation. When exceeded, the oldest loaded skill is evicted.
context_management
Response
This endpoint returns an object.
id
Message identifier.
type
Object type.
role
Message author role.
content
Assistant content blocks.
model
Model name used.
container
Optional execution container.
stop_details
Optional structured stop details.
stop_reason
Message stop reason.
stop_sequence
Matched stop sequence, if any.
usage
Token usage stats.
Errors
401
Unauthorized
422
Unprocessable Entity

