Client

Vision-Language Model Client

This module provides a high-level client interface for interacting with vision-language model to perform multimodal inference.

Supported Features

Image-based chat
- image_chat: Uses an image loaded from a file path.
- rgb_image_chat: Uses an in-memory RGB image array.
Video-based chat
- video_chat: Samples frames from a video according to configurable strategies before generating a context-aware textual response.

General information about API usages

- Input Validation
Please validate all inputs before passing them to the SDK and ensure that the content of the path is trustworthy. Invalid inputs may cause errors, and the SDK operates under the assumption that data has already been verified.

- Return Values
By default, the API returns only the generated text. When optional values are enabled such as return_history, return_sampling_info, the function returns a tuple in the following order text, history, sampling_info. Please ensure to unpack the returned tuple accordingly based on the enabled options.

`VLLMClient`

High-level API for running Visual Language Model (VLM) inference based on user prompts. Supports image- and video-based queries, enabling multimodal conversational interactions.

`init(device_id=0)`

Initializes the VLLMClient

Parameters:

Name	Type	Description	Default
`device_id`	`int`	Identifier for the computation device (e.g., GPU ID)	`0`

`image_chat(prompt, image_path, chat_history=None, generation_config=None, return_history=False)`

Process an image with a user prompt to produce relevant text output.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	Text instruction or question that guides the interpretation of the image and shapes the generated response.	required
`image_path`	`str`	Path to the input image file. Supported formats include .png, .jpg, .jpeg.	required
`chat_history`	`list`	A list of prior conversation entries. Defaults to None.	`None`
`generation_config`	`dict`	Parameters that control the text generation process (e.g., temperature, top_k). Defaults to None and uses system settings.	`None`
`return_history`	`bool`	If True, also returns the updated chat history. Defaults to False.	`False`

Returns:

Name	Type	Description
`text`	`str`	AI-generated text derived from the video and prompt.
`history`	`list`	Updated chat history; returned only if `return_history` is True.

`rgb_image_chat(prompt, image, chat_history=None, generation_config=None, return_history=False)`

Process an image with a user prompt to produce relevant text output.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	Instruction or question that guides the interpretation of the image and shapes the generated response.	required
`image`	`ndarray`	RGB image in HWC format (Height, Width, Channels) as a NumPy array, with channels ordered RGB.	required
`chat_history`	`list`	A list of prior conversation entries. Defaults to None.	`None`
`generation_config`	`dict`	Parameters that control the text generation process (e.g., temperature, top_k). Defaults to None and uses system settings.	`None`
`return_history`	`bool`	If True, also returns the updated chat history. Defaults to False.	`False`

Returns:

Name	Type	Description
`text`	`str`	AI-generated text derived from the video and prompt.
`history`	`list`	Updated chat history; returned only if `return_history` is True.

`rgb_images_chat(prompt, images, chat_history=None, generation_config=None, return_history=False)`

Performs multimodal inference by analyzing an in-memory list of RGB image arrays alongside a user-provided prompt, generating a contextually relevant textual output.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	Instruction or question that guides the interpretation of the image and shapes the generated response.	required
`images`	`list[ndarray]`	List of RGB images, where each image is a NumPy array in HWC (Height, Width, Channels) format with channels ordered as RGB.	required
`chat_history`	`list`	A list of prior conversation entries. Defaults to None.	`None`
`generation_config`	`dict`	Parameters that control the text generation process (e.g., temperature, top_k). Defaults to None and uses system settings.	`None`
`return_history`	`bool`	If True, also returns the updated chat history. Defaults to False.	`False`

Returns:

Name	Type	Description
`text`	`str`	AI-generated text derived from the video and prompt.
`history`	`list`	Updated chat history; returned only if `return_history` is True.

`video_chat(prompt, video_path, sampling_method='duration', sampling_fps=None, min_num_frames=64, max_num_frames=512, generation_config=None, chat_history=None, return_history=False, return_sampling_info=False)`

Process an input video by sampling frames and analyzing them in context with a user prompt to produce text output.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	Text instruction or question used to produce a tailored response from the video.	required
`video_path`	`str`	Path to the input video file. Supported formats include .mp4, .mov, .avi, .3gp, .mkv, and .wmv.	required
`sampling_method`	`str`	Method used for sampling frames from the video. Valid values are: "duration", "rand", "middle", "fps". Defaults to "duration". Please see detail at Sampling Methods section.	`'duration'`
`sampling_fps`	`float`	Number of frames to select in 1 second. Only required when `sampling_method` is "fps".	`None`
`min_num_frames`	`int`	Minimum number of frames to sample from the video, in multiples of 8. Defaults to 64. Specify when `sampling_method` is "duration", "rand", or "middle".	`64`
`max_num_frames`	`int`	Maximum number of frames to sample from the video, in multiples of 8. Defaults to 512. Specify when `sampling_method` is "duration", "rand", or "middle".	`512`
`generation_config`	`dict`	Configuration parameters for the text generation behavior control. Defaults to None and use system settings.	`None`
`chat_history`	`list`	A list of prior conversation entries. Defaults to None.	`None`
`return_history`	`bool`	If True, returns the updated chat history. Defaults to False.	`False`
`return_sampling_info`	`bool`	If True, returns metadata of the video sampling process. Defaults to False. Please see detail at Sampling Info section.	`False`

Returns:

Name	Type	Description
`text`	`str`	AI-generated text derived from the video and prompt.
`history`	`list`	Updated chat history; returned only if `return_history` is True.
`sampling_info`	`dict`	Sampling information; returned only if `return_sampling_info` is True.

Sampling information

When return_sampling_info is True, a dict object will be returned as a part of result with the following keys:

Key	Data Type	Value
`sampling_method`	`str`	User selected sampling method
`fps`	`float`	Input video FPS
`duration`	`float`	Input video duration in seconds
`n_video_frames`	`int`	Number of input video frames
`min_num_frames`	`int`	User-defined minimum number of frames
`max_num_frames`	`int`	User-defined maximum number of frames
`n_sample_frames`	`int`	Actual number of sampled frames
`frame_indices`	`list`	Selected frame indices
`frame_seconds`	`list`	Selected frame seconds

Example value of `return_sampling_info`

Example Input

Video duration: 21.56 seconds
FPS: 25
Total frames (duration * fps): 539
sampling_method: "duration"
min_num_frames: 64
max_num_frames: 512

Output sampling_info

{
  'sampling_method': 'duration', 
  'fps': 25.0, 
  'duration': 21.56, 
  'n_video_frames': 539, 
  'min_num_frames': 64, 
  'max_num_frames': 512, 
  'n_sample_frames': 64, 
  'frame_indices': [3, 11, 20, 28, 37, 45, 53, ..., 525, 534], 
  'frame_seconds': [0.1, 0.4, 0.8, 1.1, 1.5, 1.8, 2.1, ..., 21.0, 21.4]
}

Client

Vision-Language Model Client

Supported Features

VLLMClient

__init__(device_id=0)

image_chat(prompt, image_path, chat_history=None, generation_config=None, return_history=False)

rgb_image_chat(prompt, image, chat_history=None, generation_config=None, return_history=False)

rgb_images_chat(prompt, images, chat_history=None, generation_config=None, return_history=False)

video_chat(prompt, video_path, sampling_method='duration', sampling_fps=None, min_num_frames=64, max_num_frames=512, generation_config=None, chat_history=None, return_history=False, return_sampling_info=False)

Sampling information

Example value of return_sampling_info

`VLLMClient`

`init(device_id=0)`

`image_chat(prompt, image_path, chat_history=None, generation_config=None, return_history=False)`

`rgb_image_chat(prompt, image, chat_history=None, generation_config=None, return_history=False)`

`rgb_images_chat(prompt, images, chat_history=None, generation_config=None, return_history=False)`

`video_chat(prompt, video_path, sampling_method='duration', sampling_fps=None, min_num_frames=64, max_num_frames=512, generation_config=None, chat_history=None, return_history=False, return_sampling_info=False)`

Example value of `return_sampling_info`