Examples API Call

Summary

Before you begin
Your First API Call
Example 1: Basic inference
Example 2: Multi-turn conversations and chat history

With a few simple steps, you will complete your first API call of the video analysis.

Before you begin

We would like introduce Model Loading and Inference Best Practices in order to develop your applications efficiently.

Load the model once at startup
- Loading a model can be time-consuming, especially for large models.
- It is recommended to load the model once and keep it in memory for the lifetime of your application.
Warm up the model for real-time application
- After loading, run a few dummy inferences to "warm up" the model.
- Warming up the model prepares it for faster inference by initializing internal buffers and reducing the first-request latency. This is especially useful for processing large videos or multiple images.
Reuse the loaded model for multiple requests
- Do not reload the model for each client request.
- Keep the model instance in a persistent object or global variable within your application process.

Your First API Call

Note

The below examples demonstrate how to use the SDK for a single inference call.
To keep the model in memory across multiple client requests, developers should manage a persistent process.
For instance, you can create a FastAPI or Flask application that loads the model at startup and serves inference requests without reloading it for each call.

Example Video Input:

Example 1: Basic inference

Code

from woven.vision.ai.wave_8b.vllm_client import VLLMClient
# Initialize model with default settings.
# The GPU Device ID is set to 0 by default
client = VLLMClient()

video_path = "your/test/video.mp4"
prompt = "Describe the video in detail, including objects, actions, and context."
generated_text = client.video_chat(
    prompt=prompt, video_path=video_path
)
print(f"{generated_text}")

Output

The video takes place in a bustling coffee shop with a modern, industrial aesthetic. A barista, dressed in a black t-shirt and grey apron, is seen preparing coffee. The barista is focused on the task, using a coffee grinder and a portafilter. The environment is well-lit with natural light streaming in, highlighting the clean, organized counter and the array of coffee-making equipment. The barista's movements are precise and practiced, indicating experience. In the background, another barista is visible, adding to the busy atmosphere. The colors are warm and inviting, with the reds of the coffee cups and the metallic sheen of the equipment contrasting against the wooden shelves and brick walls.

Example 2: Multi-turn conversations and chat history

The following code shows how to call API for multi-turn conversations

Code

import json
from woven.vision.ai.wave_8b.vllm_client import VLLMClient
# Initialize model with default settings.
client = VLLMClient()

# Value of return_history parameter is set to "False" by default.
# To enable it, please set 'return_history=True'.
# The chat function will return an additinal result 'chat_history'.

video_path = "your/test/video.mp4"
chat_history = None  # Will be maintained across turns

# Turn 1: Ask about the video content
prompt1 = "What is happening in this video?"
response1, chat_history = client.video_chat(
    prompt=prompt1,
    video_path=video_path,
    return_history=True,
    chat_history=chat_history,
)


print(f"Response for the 1st prompt:\n{response1}")
print(f"Chat History 1:\n{json.dumps(chat_history, indent=2, ensure_ascii=False)}")

# Turn 2: Follow-up question about specific details
prompt2 = "Can you describe the main character's actions in more detail?"
response2, chat_history = client.video_chat(
    video_path=video_path,
    prompt=prompt2,
    chat_history=chat_history,
    return_history=True,
)
print(f"Response for the 2nd prompt:\n{response2}")
print(f"Chat History 2:\n{json.dumps(chat_history, indent=2, ensure_ascii=False)}")

# Turn 3: Another follow-up question
prompt3 = "What emotions do you think the character is expressing?"
response3, chat_history = client.video_chat(
    video_path=video_path,
    prompt=prompt3,
    chat_history=chat_history,
    return_history=True,
)
print(f"Response for the 3rd prompt:\n{response3}")
print(f"Chat History 3:\n{json.dumps(chat_history, indent=2, ensure_ascii=False)}")

Output

[
  {
    "role": "<|im_start|>user",
    "content": "<image>\nThe video lasts 21.56s. 64 frames sampled (sample times ≈ ['0.1', '0.4', '0.8', '1.1', '1.5', '1.8', '2.1', '2.5', '2.8', '3.2']s...). Total frames: 64. What is happening in this video?"
  },
  {
    "role": "<|im_start|>assistant",
    "content": "A woman in a coffee shop is preparing a drink. She is wearing a black shirt and a gray apron. She is standing behind a counter with a coffee grinder, a coffee maker, and other coffee-making equipment. She is pouring coffee grounds into the grinder and then grinding them. A man in a white shirt and a gray apron is standing behind her, preparing a drink."
  },
  {
    "role": "<|im_start|>user",
    "content": "Can you describe the main character's actions in more detail?"
  },
  {
    "role": "<|im_start|>assistant",
    "content": "The woman is carefully pouring coffee grounds from a bag into the grinder. She then places the lid on the grinder and turns it on. She watches the coffee grounds being ground, then removes the lid and sets it aside."
  },
  {
    "role": "<|im_start|>user",
    "content": "What emotions do you think the character is expressing?"
  },
  {
    "role": "<|im_start|>assistant",
    "content": "The woman appears to be focused and attentive as she prepares the drink. She seems to be enjoying her work and is taking her time to ensure that the drink is made correctly."
  }
]