Video Memory

Beyond text, TeleMem lets agents store, retrieve, and reason over video content the same way they handle text memories. The design draws on Deep Video Discovery's agentic search and tool-use approach.

Install the video extras first:

pip install -e ".[video]"

Pipeline

add_mm() turns a raw video into retrievable memory in three cached stages:

Frame extraction — decode the video to JPEG frames at the configured FPS
Caption generation — a VLM (e.g. Qwen3-Omni, GPT-4.1-mini) describes each clip
Vector database — clip captions are embedded for semantic retrieval

result = memory.add_mm(
    video_path="data/samples/video/3EQLFHRHpag.mp4",
    output_dir="data/samples/video",
)

Artifacts land under frames/, captions/, and vdb/ inside output_dir; stages whose outputs already exist are skipped automatically.

ReAct-style video QA

search_mm() runs MMCoreAgent, a THINK → ACTION → OBSERVATION loop with three tools:

Tool	Function
`global_browse_tool`	Global overview of video events and themes
`clip_search_tool`	Semantic search for specific content
`frame_inspect_tool`	Inspect frame details in a time range

messages = memory.search_mm(
    question="""The problems people encounter in the video are caused by what?
    (A) Catastrophic weather. (B) Global warming. (C) Financial crisis. (D) Oil crisis.""",
    output_dir="data/samples/video",
    max_iterations=15,
)

from telemem.mm_utils import extract_choice_from_msg
print(extract_choice_from_msg(messages))   # "B"

The full runnable demo is examples/quickstart_mm.py; the repository ships a small sample video. VLM and embedding endpoints are configured in the vlm: section of your config file.