Existing benchmarks evaluate AI assistants on curated, short-duration clips. TeleEgo challenges models with real-world, continuous streams spanning hours of daily activities across diverse scenarios.
Authentic egocentric recordings from real users performing daily activities across work, study, social, shopping, health, and travel scenarios.
Questions arrive dynamically throughout the video stream, mimicking real personal assistant interactions without pre-segmentation.
Combines egocentric video, ambient audio, speech transcripts, and visual narrations requiring cross-modal reasoning.
Tests model's ability to retain and recall information across extended time periods, from seconds to hours.
Models must respond within decision windows, reflecting the temporal demands of live assistance.
Every answer requires precise temporal and modality grounding, enabling auditable evaluation.
Comprehensive Real-Time Accuracy (RTA) evaluation results on TeleEgo benchmark. Models are evaluated under strict streaming protocol.
GPT-4o and Gemini-2.5-Pro are evaluated via API calls with internal implementations opaque. Due to extensive video duration (~14 hours) and API latency, both models were evaluated on the same participant's video for fair comparison.
Key Observations: GPT-4o shows strong Understanding performance (66.67%), especially on Intent Inference (81.81%) and Causal Understanding (81.58%), leveraging its general-purpose reasoning. However, performance drops on Memory (42.18%) and Cross-Memory Reasoning (44.23%) where fine-grained temporal binding is critical.
| Method | Params | Omni | Streaming | Memory (%) | Understanding (%) | Cross-Memory Reasoning (%) | Overall | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| UlM | StM | ET | TCI | LtM | All | II | CU | CmU | MsR | All | CeR | TCU | CtC | All | |||||
| GPT-4o | - | โ | - | 42.31 | 40.58 | 31.91 | 47.37 | 52.78 | 42.18 | 81.81 | 81.58 | 45.71 | 50.00 | 66.67 | 44.44 | 40.00 | 45.00 | 44.23 | 48.94 |
| Gemini-2.5-Pro | - | โ | - | 49.04 | 45.59 | 34.04 | 47.37 | 44.44 | 45.05 | 63.64 | 55.26 | 37.14 | 28.57 | 48.03 | 40.74 | 40.00 | 45.00 | 42.31 | 45.55 |
These models lack cross-call memory mechanisms and process each input unit independently. They are categorized as โpseudo-streamingโ because they do not natively support continuous streaming interaction. In our evaluation, we deliberately avoid supplying any uncompressed historical context to the modelsโotherwise, the burden of memory would be shifted to the context window rather than truly examining their intrinsic memory capability. Therefore, they are evaluated under a strict streaming protocol, ensuring that performance reflects the modelsโ genuine ability to retain information over time.
Key Observations: VideoChat-Online, Qwen2.5-VL, and Qwen2.5-Omni achieve manageable Understanding performance but struggle significantly on Memory and Cross-Memory Reasoning tasks. Among the three, Qwen2.5-Omni achieves the highest overall RTA (46.96%), primarily because it has stronger base model capabilities and avoids ASR transcription errors by directly processing audio.
| Method | Params | Omni | Streaming | Memory (%) | Understanding (%) | Cross-Memory Reasoning (%) | Overall | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| UlM | StM | ET | TCI | LtM | All | II | CU | CmU | MsR | All | CeR | TCU | CtC | All | |||||
| VideoChat-Online | 4B | โ | โ | 30.74 | 26.29 | 19.35 | 29.84 | 22.78 | 26.69 | 57.35 | 44.13 | 35.44 | 27.89 | 42.34 | 18.79 | 32.00 | 41.60 | 29.43 | 31.28 |
| Qwen2.5-VL | 3B | โ | โ | 45.42 | 42.01 | 31.18 | 37.90 | 33.33 | 39.66 | 66.35 | 53.99 | 47.57 | 41.50 | 53.28 | 32.89 | 48.00 | 52.00 | 42.14 | 43.67 |
| Qwen2.5-Omni | 7B | โ | โ | 44.21 | 42.75 | 35.48 | 41.13 | 37.97 | 41.20 | 72.51 | 61.50 | 50.00 | 48.98 | 59.07 | 37.58 | 56.00 | 61.60 | 49.16 | 46.96 |
MiniCPM-o is the only model in our study with an explicit streaming interface that accepts chunk-wise incremental input. However, it lacks dynamic memory management: the KV cache grows without compression, eviction, or sliding window, causing GPU memory overflow and attention dilution as video length increases.
Key Observations: RTA results indicate an optimal session length of about 1 minute (54.10% overall), with poorer performance at both shorter (50.60% at 1s) and longer intervals (30.15% at 10min). At longer intervals, the model often produces incoherent or off-topic responses, confirming that naive KV cache accumulation without memory management is insufficient for sustained streaming.
| Method | Params | Omni | Streaming | Session Length |
Memory (%) | Understanding (%) | Cross-Memory Reasoning (%) | Overall | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| UlM | StM | ET | TCI | LtM | All | II | CU | CmU | MsR | All | CeR | TCU | CtC | All | ||||||
| MiniCPM-o | 8B | โ | โ | 1sec | 51.99 | 51.60 | 35.13 | 47.58 | 44.30 | 47.54 | 75.83 | 61.50 | 51.94 | 44.22 | 59.59 | 28.19 | 48.00 | 64.80 | 45.15 | 50.60 |
| 1min | 50.60 | 55.53 | 37.28 | 53.63 | 51.05 | 50.11 | 80.57 | 69.48 | 56.80 | 51.02 | 65.64 | 33.56 | 36.00 | 66.40 | 47.49 | 54.10 | ||||
| 3min | 48.19 | 52.58 | 37.28 | 49.19 | 45.99 | 47.31 | 73.93 | 63.38 | 53.88 | 40.14 | 59.33 | 32.89 | 32.00 | 66.40 | 46.82 | 50.57 | ||||
| 5min | 43.00 | 44.23 | 33.69 | 47.58 | 37.55 | 41.71 | 63.51 | 56.34 | 44.17 | 43.54 | 52.64 | 23.49 | 32.00 | 56.80 | 38.13 | 44.34 | ||||
| 10min | 28.50 | 28.01 | 23.66 | 30.65 | 26.16 | 27.60 | 41.71 | 40.38 | 34.95 | 29.93 | 37.32 | 15.44 | 36.00 | 37.60 | 26.42 | 30.15 | ||||
Note: Columns are grouped into three capability blocks: Memory, Understanding, and Cross-Memory Reasoning, with an "All" column summarizing each block and an "Overall" column aggregating across blocks. "Omni" denotes integrated audio-video-text perception; "Streaming" denotes native support for streaming interaction. A โ indicates the capability is supported, โ indicates not supported, and "-" indicates unknown.
RTA measures the percentage of questions answered correctly within a strict time window (5 seconds). This metric simulates real-world scenarios where personal assistants must respond promptly.
Unlike traditional offline evaluation, RTA enforces temporal constraints:
RTA is calculated as: (Questions answered correctly within time window) / (Total questions) ร 100%
MPT measures how long a model retains information after initially answering correctly. For questions answered correctly at time t*, the system schedules recall tests at t* + ฮ intervals (ฮ = 60s, r = 1,2,...,10).
Key characteristics:
Currently, we have not included MPT results in our evaluation tables. This is because MPT requires models to demonstrate genuine long-term memory capabilities - the ability to not only answer correctly when information first appears, but also retain and recall that information long after it has disappeared from the input stream.
Why current models are not suitable for MPT evaluation:
MPT is a forward-looking metric that aligns with TeleEgo's goal of evaluating realistic long-term egocentric assistants. We look forward to the emergence of models with true long-horizon streaming memory - rare few can run continuously for hours without resets while maintaining stable generation quality. We anticipate that future truly streaming MLLMs with dynamic memory management will enable meaningful MPT evaluation, providing critical insights into how long information is retained beyond the moment of first use.
๐ TeleEgo, together with our released MPT evaluation framework and code, provides a unified setting to assess both real-time correctness and long-term memory persistence, paving the way for future work on practical egocentric AI assistants.
Questions are organized into three cognitive dimensions with 12 fine-grained subtasks:
TeleEgo comprises 70+ hours of continuous egocentric video from 5 diverse participants, with 3,291 manually annotated questions covering 12 cognitive tasks.
Participants wore first-person cameras continuously for 3 days, recording authentic daily activities across work, social, lifestyle, and cultural scenarios. Average video length: 14.4 hours per participant.
Raw videos were processed to extract synchronized audio streams, speech transcripts via ASR, and visual narrations describing participant actions. All modalities are aligned to a unified timeline with millisecond precision.
Expert annotators generated 3,291 questions using GPT-4o for initial candidate generation, followed by rigorous human verification. Each question includes precise temporal grounding (evidence timestamps) and required modality information.
Multiple rounds of quality checks ensure factual accuracy, temporal correctness, question clarity, and answer uniqueness. All personal identifiable information is removed or anonymized.
Presentations, meetings, coding, studying, reception tasks
Card games (UNO, Mahjong, Poker), video games, pool
Shopping, cooking, walking, fitness
Dining out, dating, museum visits
We welcome submissions from all researchers! Submit your model's predictions to appear on the official leaderboard. Choose your preferred submission method below.
Submit via GitHub PR for the fastest processing and automatic validation.
submissions/ with format: YYYY-MM-DD_YourModelName/results.json - Your model predictionsmetadata.json - Model informationREADME.md - Brief description[Submission] YourModelNameFor quick submissions or if you're unable to create a pull request, you can submit via GitHub Issues.
results.json and metadata.json filesIf GitHub is not accessible, you can email your submission directly to our team.
results.json, metadata.json, README.md[TeleEgo Submission] Your Model NameYour prediction results for each question in the test set.
Information about your model and experimental setup.
If you have questions about the submission process, please:
If you find TeleEgo useful in your research, please cite our paper:
@misc{yan2025teleegobenchmarkingegocentricai,
title={TeleEgo: Benchmarking Egocentric AI Assistants in the Wild},
author={Jiaqi Yan and Ruilong Ren and Jingren Liu and Shuning Xu and Ling Wang and Yiheng Wang and Xinlin Zhong and Yun Wang and Long Zhang and Xiangyu Chen and Changzhi Sun and Jixiang Luo and Dell Zhang and Hao Sun and Chi Zhang and Xuelong Li},
year={2025},
eprint={2510.23981},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.23981},
}