Client
Vision-Language Model Client
Supported Features
-
Image-based chat
- image_chat: Uses an image loaded from a file path.
- rgb_image_chat: Uses an in-memory RGB image array.
-
Video-based chat
- video_chat: Samples frames from a video according to configurable strategies before generating a context-aware textual response.
General information about API usages
By default, the API returns only the generated text. When optional values are enabled such as return_history, return_sampling_info, the function returns a tuple in the following order text, history, sampling_info. Please ensure to unpack the returned tuple accordingly based on the enabled options.
VLLMClient
__init__(device_id=0)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
device_id
|
int
|
Identifier for the computation device (e.g., GPU ID) |
0
|
image_chat(prompt, image_path, chat_history=None, generation_config=None, return_history=False)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
Text instruction or question that guides the interpretation of the image and shapes the generated response. |
required |
image_path
|
str
|
Path to the input image file. Supported formats include .png, .jpg, .jpeg. |
required |
chat_history
|
list
|
A list of prior conversation entries. Defaults to None. |
None
|
generation_config
|
dict
|
Parameters that control the text generation process (e.g., temperature, top_k). Defaults to None and uses system settings. |
None
|
return_history
|
bool
|
If True, also returns the updated chat history. Defaults to False. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
text |
str
|
AI-generated text derived from the video and prompt. |
history |
list
|
Updated chat history; returned only if |
rgb_image_chat(prompt, image, chat_history=None, generation_config=None, return_history=False)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
Instruction or question that guides the interpretation of the image and shapes the generated response. |
required |
image
|
ndarray
|
RGB image in HWC format (Height, Width, Channels) as a NumPy array, with channels ordered RGB. |
required |
chat_history
|
list
|
A list of prior conversation entries. Defaults to None. |
None
|
generation_config
|
dict
|
Parameters that control the text generation process (e.g., temperature, top_k). Defaults to None and uses system settings. |
None
|
return_history
|
bool
|
If True, also returns the updated chat history. Defaults to False. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
text |
str
|
AI-generated text derived from the video and prompt. |
history |
list
|
Updated chat history; returned only if |
rgb_images_chat(prompt, images, chat_history=None, generation_config=None, return_history=False)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
Instruction or question that guides the interpretation of the image and shapes the generated response. |
required |
images
|
list[ndarray]
|
List of RGB images, where each image is a NumPy array in HWC (Height, Width, Channels) format with channels ordered as RGB. |
required |
chat_history
|
list
|
A list of prior conversation entries. Defaults to None. |
None
|
generation_config
|
dict
|
Parameters that control the text generation process (e.g., temperature, top_k). Defaults to None and uses system settings. |
None
|
return_history
|
bool
|
If True, also returns the updated chat history. Defaults to False. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
text |
str
|
AI-generated text derived from the video and prompt. |
history |
list
|
Updated chat history; returned only if |
video_chat(prompt, video_path, sampling_method='duration', sampling_fps=None, min_num_frames=64, max_num_frames=512, generation_config=None, chat_history=None, return_history=False, return_sampling_info=False)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
Text instruction or question used to produce a tailored response from the video. |
required |
video_path
|
str
|
Path to the input video file. Supported formats include .mp4, .mov, .avi, .3gp, .mkv, and .wmv. |
required |
sampling_method
|
str
|
Method used for sampling frames from the video. Valid values are: "duration", "rand", "middle", "fps". Defaults to "duration". Please see detail at Sampling Methods section. |
'duration'
|
sampling_fps
|
float
|
Number of frames to select in 1 second. Only required when |
None
|
min_num_frames
|
int
|
Minimum number of frames to sample from the video, in multiples of 8.
Defaults to 64. Specify when |
64
|
max_num_frames
|
int
|
Maximum number of frames to sample from the video, in multiples of 8.
Defaults to 512. Specify when |
512
|
generation_config
|
dict
|
Configuration parameters for the text generation behavior control. Defaults to None and use system settings. |
None
|
chat_history
|
list
|
A list of prior conversation entries. Defaults to None. |
None
|
return_history
|
bool
|
If True, returns the updated chat history. Defaults to False. |
False
|
return_sampling_info
|
bool
|
If True, returns metadata of the video sampling process. Defaults to False. Please see detail at Sampling Info section. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
text |
str
|
AI-generated text derived from the video and prompt. |
history |
list
|
Updated chat history; returned only if |
sampling_info |
dict
|
Sampling information; returned only if |
Sampling information
When return_sampling_info is True, a dict object will be returned as a part of result with the following keys:
| Key | Data Type | Value |
|---|---|---|
sampling_method |
str |
User selected sampling method |
fps |
float |
Input video FPS |
duration |
float |
Input video duration in seconds |
n_video_frames |
int |
Number of input video frames |
min_num_frames |
int |
User-defined minimum number of frames |
max_num_frames |
int |
User-defined maximum number of frames |
n_sample_frames |
int |
Actual number of sampled frames |
frame_indices |
list |
Selected frame indices |
frame_seconds |
list |
Selected frame seconds |
Example value of return_sampling_info
Example Input
- Video duration: 21.56 seconds
- FPS: 25
- Total frames (duration * fps): 539
sampling_method: "duration"min_num_frames: 64max_num_frames: 512
Output sampling_info