API Reference
API Reference
This page provides the complete API documentation for the VLLMClient class.
This module provides a high-level client interface for interacting with vision-language model to perform multimodal inference.
Supported Features
-
Image-based chat
- image_chat: Uses an image loaded from a file path.
- rgb_image_chat: Uses an in-memory RGB image array.
-
Video-based chat
- video_chat: Samples frames from a video according to configurable strategies before generating a context-aware textual response.
General information about API usages
By default, the API returns only the generated text. When optional values are enabled such as return_history, return_sampling_info, the function returns a tuple in the following order text, history, sampling_info. Please ensure to unpack the returned tuple accordingly based on the enabled options.
VLLMClient
High-level API for running Visual Language Model (VLM) inference based on user prompts. Supports image- and video-based queries, enabling multimodal conversational interactions.
__init__
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
device_id
|
int
|
Identifier for the computation device (e.g., GPU ID) to use for inference. If None, the default device is selected automatically. |
0
|
image_chat
Process an image with a user prompt to produce relevant text output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
Text instruction or question that guides the interpretation of the image and shapes the generated response. |
required |
image_path
|
str
|
Path to the input image file. Supported formats include .png, .jpg, .jpeg. |
required |
chat_history
|
list
|
A list of prior conversation entries. Defaults to None. |
None
|
generation_config
|
dict
|
Parameters that control the text generation process (e.g., temperature, top_k). Defaults to None and uses system settings. |
None
|
return_history
|
bool
|
If True, also returns the updated chat history. Defaults to False. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
text |
str
|
AI-generated text derived from the video and prompt. |
history |
list
|
Updated chat history; returned only if |
rgb_image_chat
Process an image with a user prompt to produce relevant text output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
Instruction or question that guides the interpretation of the image and shapes the generated response. |
required |
image
|
ndarray
|
RGB image in HWC format (Height, Width, Channels) as a NumPy array, with channels ordered RGB. |
required |
chat_history
|
list
|
A list of prior conversation entries. Defaults to None. |
None
|
generation_config
|
dict
|
Parameters that control the text generation process (e.g., temperature, top_k). Defaults to None and uses system settings. |
None
|
return_history
|
bool
|
If True, also returns the updated chat history. Defaults to False. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
text |
str
|
AI-generated text derived from the video and prompt. |
history |
list
|
Updated chat history; returned only if |
rgb_images_chat
Performs multimodal inference by analyzing an in-memory list of RGB image arrays alongside a user-provided prompt, generating a contextually relevant textual output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
Instruction or question that guides the interpretation of the image and shapes the generated response. |
required |
images
|
list[ndarray]
|
List of RGB images, where each image is a NumPy array in HWC (Height, Width, Channels) format with channels ordered as RGB. |
required |
chat_history
|
list
|
A list of prior conversation entries. Defaults to None. |
None
|
generation_config
|
dict
|
Parameters that control the text generation process (e.g., temperature, top_k). Defaults to None and uses system settings. |
None
|
return_history
|
bool
|
If True, also returns the updated chat history. Defaults to False. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
text |
str
|
AI-generated text derived from the video and prompt. |
history |
list
|
Updated chat history; returned only if |
video_chat
Process an input video by sampling frames and analyzing them in context with a user prompt to produce text output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
Text instruction or question used to produce a tailored response from the video. |
required |
video_path
|
str
|
Path to the input video file. Supported formats include .mp4, .mov, .avi, .3gp, .mkv, and .wmv. |
required |
sampling_method
|
str
|
Method used for sampling frames from the video. Valid values are: "duration", "rand", "middle", "fps". Defaults to "duration". Please see detail at Sampling Methods section. |
'duration'
|
sampling_fps
|
float
|
Number of frames to select in 1 second. Only required when |
None
|
min_num_frames
|
int
|
Minimum number of frames to sample from the video, in multiples of 8.
Defaults to 64. Specify when |
64
|
max_num_frames
|
int
|
Maximum number of frames to sample from the video, in multiples of 8.
Defaults to 512. Specify when |
512
|
generation_config
|
dict
|
Configuration parameters for the text generation behavior control. Defaults to None and use system settings. |
None
|
chat_history
|
list
|
A list of prior conversation entries. Defaults to None. |
None
|
return_history
|
bool
|
If True, returns the updated chat history. Defaults to False. |
False
|
return_sampling_info
|
bool
|
If True, returns metadata of the video sampling process. Defaults to False. Please see detail at Sampling Info section. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
text |
str
|
AI-generated text derived from the video and prompt. |
history |
list
|
Updated chat history; returned only if |
sampling_info |
dict
|
Sampling information; returned only if |