Skip to content

Client

Vision-Language Model Client

This module provides a high-level client interface for interacting with vision-language model to perform multimodal inference.

Supported Features

  • Image-based chat

    • image_chat: Uses an image loaded from a file path.
    • rgb_image_chat: Uses an in-memory RGB image array.
  • Video-based chat

    • video_chat: Samples frames from a video according to configurable strategies before generating a context-aware textual response.

General information about API usages

- Input Validation
Please validate all inputs before passing them to the SDK and ensure that the content of the path is trustworthy. Invalid inputs may cause errors, and the SDK operates under the assumption that data has already been verified.

- Return Values
By default, the API returns only the generated text. When optional values are enabled such as return_history, return_sampling_info, the function returns a tuple in the following order text, history, sampling_info. Please ensure to unpack the returned tuple accordingly based on the enabled options.

VLLMClient

High-level API for running Visual Language Model (VLM) inference based on user prompts. Supports image- and video-based queries, enabling multimodal conversational interactions.

__init__(device_id=0)

Initializes the VLLMClient

Parameters:

Name Type Description Default
device_id int

Identifier for the computation device (e.g., GPU ID)

0

image_chat(prompt, image_path, chat_history=None, generation_config=None, return_history=False)

Process an image with a user prompt to produce relevant text output.

Parameters:

Name Type Description Default
prompt str

Text instruction or question that guides the interpretation of the image and shapes the generated response.

required
image_path str

Path to the input image file. Supported formats include .png, .jpg, .jpeg.

required
chat_history list

A list of prior conversation entries. Defaults to None.

None
generation_config dict

Parameters that control the text generation process (e.g., temperature, top_k). Defaults to None and uses system settings.

None
return_history bool

If True, also returns the updated chat history. Defaults to False.

False

Returns:

Name Type Description
text str

AI-generated text derived from the video and prompt.

history list

Updated chat history; returned only if return_history is True.

rgb_image_chat(prompt, image, chat_history=None, generation_config=None, return_history=False)

Process an image with a user prompt to produce relevant text output.

Parameters:

Name Type Description Default
prompt str

Instruction or question that guides the interpretation of the image and shapes the generated response.

required
image ndarray

RGB image in HWC format (Height, Width, Channels) as a NumPy array, with channels ordered RGB.

required
chat_history list

A list of prior conversation entries. Defaults to None.

None
generation_config dict

Parameters that control the text generation process (e.g., temperature, top_k). Defaults to None and uses system settings.

None
return_history bool

If True, also returns the updated chat history. Defaults to False.

False

Returns:

Name Type Description
text str

AI-generated text derived from the video and prompt.

history list

Updated chat history; returned only if return_history is True.

rgb_images_chat(prompt, images, chat_history=None, generation_config=None, return_history=False)

Performs multimodal inference by analyzing an in-memory list of RGB image arrays alongside a user-provided prompt, generating a contextually relevant textual output.

Parameters:

Name Type Description Default
prompt str

Instruction or question that guides the interpretation of the image and shapes the generated response.

required
images list[ndarray]

List of RGB images, where each image is a NumPy array in HWC (Height, Width, Channels) format with channels ordered as RGB.

required
chat_history list

A list of prior conversation entries. Defaults to None.

None
generation_config dict

Parameters that control the text generation process (e.g., temperature, top_k). Defaults to None and uses system settings.

None
return_history bool

If True, also returns the updated chat history. Defaults to False.

False

Returns:

Name Type Description
text str

AI-generated text derived from the video and prompt.

history list

Updated chat history; returned only if return_history is True.

video_chat(prompt, video_path, sampling_method='duration', sampling_fps=None, min_num_frames=64, max_num_frames=512, generation_config=None, chat_history=None, return_history=False, return_sampling_info=False)

Process an input video by sampling frames and analyzing them in context with a user prompt to produce text output.

Parameters:

Name Type Description Default
prompt str

Text instruction or question used to produce a tailored response from the video.

required
video_path str

Path to the input video file. Supported formats include .mp4, .mov, .avi, .3gp, .mkv, and .wmv.

required
sampling_method str

Method used for sampling frames from the video. Valid values are: "duration", "rand", "middle", "fps". Defaults to "duration". Please see detail at Sampling Methods section.

'duration'
sampling_fps float

Number of frames to select in 1 second. Only required when sampling_method is "fps".

None
min_num_frames int

Minimum number of frames to sample from the video, in multiples of 8. Defaults to 64. Specify when sampling_method is "duration", "rand", or "middle".

64
max_num_frames int

Maximum number of frames to sample from the video, in multiples of 8. Defaults to 512. Specify when sampling_method is "duration", "rand", or "middle".

512
generation_config dict

Configuration parameters for the text generation behavior control. Defaults to None and use system settings.

None
chat_history list

A list of prior conversation entries. Defaults to None.

None
return_history bool

If True, returns the updated chat history. Defaults to False.

False
return_sampling_info bool

If True, returns metadata of the video sampling process. Defaults to False. Please see detail at Sampling Info section.

False

Returns:

Name Type Description
text str

AI-generated text derived from the video and prompt.

history list

Updated chat history; returned only if return_history is True.

sampling_info dict

Sampling information; returned only if return_sampling_info is True.

Sampling information

When return_sampling_info is True, a dict object will be returned as a part of result with the following keys:

Key Data Type Value
sampling_method str User selected sampling method
fps float Input video FPS
duration float Input video duration in seconds
n_video_frames int Number of input video frames
min_num_frames int User-defined minimum number of frames
max_num_frames int User-defined maximum number of frames
n_sample_frames int Actual number of sampled frames
frame_indices list Selected frame indices
frame_seconds list Selected frame seconds

Example value of return_sampling_info

Example Input

  • Video duration: 21.56 seconds
  • FPS: 25
  • Total frames (duration * fps): 539
  • sampling_method: "duration"
  • min_num_frames: 64
  • max_num_frames: 512

Output sampling_info

{
  'sampling_method': 'duration', 
  'fps': 25.0, 
  'duration': 21.56, 
  'n_video_frames': 539, 
  'min_num_frames': 64, 
  'max_num_frames': 512, 
  'n_sample_frames': 64, 
  'frame_indices': [3, 11, 20, 28, 37, 45, 53, ..., 525, 534], 
  'frame_seconds': [0.1, 0.4, 0.8, 1.1, 1.5, 1.8, 2.1, ..., 21.0, 21.4]
}