vLLM

Estimated reading: 4 minutes

The vLLM component enables efficient execution of large language models by connecting to a vLLM server through an OpenAI-compatible API. It is designed for scenarios that require fast, scalable, and reliable response generation across AI workflows such as Agents, RAG pipelines, and conversational applications.

By integrating with deployed vLLM models, this component allows workflows to generate responses in both streaming and standard modes. It also provides control over execution behavior, including response handling, retry mechanisms, and timing, ensuring consistent and optimized performance.

Integration Behavior

When used within Robility Flow, the vLLM component:

1. Sends structured inference requests to the configured vLLM endpoint
2. Combines system instructions and user input into model-ready prompts
3. Supports both streaming and batch inference modes
4. Manages retries, timeouts, and execution delays for reliable processing
5. Returns model outputs compatible with downstream workflow components

Parameter

Parameter	Description
Input	Specify the user prompt. The main input text is sent to the model for generating a response. Accepts a literal string or a global variable.
System Message	Defines system-level instructions that guide model behavior and response constraints. Accepts a literal string.
Model Name	The name of the deployed vLLM model used for generating responses. Used to route inference requests to the correct model. Accepts a literal string or a global variable.
vLLM API Base	The base URL of the vLLM server used for sending inference requests. Used to connect Robility Flow with the deployed vLLM API endpoint. Accepts a literal string or a global variable.
API Key	Authentication key used to authorize requests to secure vLLM endpoints and access the inference server. Supports hidden input with an eye icon to toggle visibility for secure viewing. Accepts a literal string or a global variable.
Temperature	Controls how creative or deterministic the model response will be by adjusting randomness in output generation. • 0 – 0.15 → Very strict and highly consistent outputs. Best for precise and factual responses with minimal variation. • 0.16 – 0.45 → Controlled responses with low creativity. Maintains focus while allowing slight variation in wording. • 0.46 – 0.60 → Balanced creativity and relevance. Produces natural, moderately varied responses while staying accurate. • 0.61 – 1 → High creativity and diversity. Responses become more varied, less predictable, and more exploratory. Lower values ensure stability and accuracy, while higher values increase creativity and variation. Default: 0.10. Range: 0–1.
Max Tokens	Defines the maximum number of tokens the model can generate in a single response, limiting output length during inference. Accepts numeric values. Set to 0 for unlimited tokens.
Model Kwargs	A set of additional configuration parameters used to customize vLLM model behavior during inference. Accepts key-value pairs in JSON format.
Seed	Defines whether the model output should be reproducible or random. Set to -1 for random results. Accepts integer values.
JSON Mode (toggle)	Makes the model return the response in JSON format (structured data) instead of plain text when enabled.
Stream (toggle)	Turns on real-time response generation, showing output as the model produces it instead of waiting for the full response.
Streaming (toggle)	Enables sending the model output token by token in real time during generation.
Stream Token Usage (toggle)	Shows how many tokens are used while the model is generating the response.
Timeout (seconds)	Maximum time the component waits for the inference request to complete before raising a timeout error. Increase this value when working with large prompts or slower model responses. Default: 30 seconds.
HTTP Timeout	Maximum time allowed for an HTTP request to complete while communicating with the vLLM server. Used to prevent long-running requests from blocking execution. Accepts integer values.
HTTP Max Retries	Maximum number of retry attempts for failed HTTP requests to the vLLM server. Used to improve reliability in case of network or server issues. Set to -1 to use system-defined behavior. Accepts integer values.
Retry Count	Number of times the component automatically retries if an inference request fails. Default: 1.
Delay Between Retries (seconds)	Time in milliseconds to wait between retry attempts. Provides a back-off window before reattempting. Default: 1 second.
Delay Before Execution (seconds)	Time in milliseconds to pause before the component begins processing the inference request. Used for controlling execution flow, sequencing, or rate-limiting within workflows. Default: 1 second.

Output

Output	Description
Model Response	The final response generated by the model after processing your input. This is the answer returned by the system based on your prompt and settings.
Language Model	The AI system that processes your input and generates the response. It is the core engine that understands prompts and produces outputs.

vLLM

Integration Behavior

Parameter

Output

vLLM

CONTENTS