Existing benchmarks evaluate AI assistants on curated, short-duration clips. TeleEgo challenges models with real-world, continuous streams spanning hours of daily activities across diverse scenarios.
Authentic egocentric recordings from real users performing daily activities across work, study, social, shopping, health, and travel scenarios.
Questions arrive dynamically throughout the video stream, mimicking real personal assistant interactions without pre-segmentation.
Combines egocentric video, ambient audio, speech transcripts, and visual narrations requiring cross-modal reasoning.
Tests model's ability to retain and recall information across extended time periods, from seconds to hours.
Models must respond within decision windows, reflecting the temporal demands of live assistance.
Every answer requires precise temporal and modality grounding, enabling auditable evaluation.
Comprehensive evaluation results on TeleEgo benchmark. Models are tested on both Test Set A (public) and Test Set B (hidden).
Public test set with released ground truth. Models can be fine-tuned and optimized on this set.
| Rank | Model | Memory (Avg %) |
Understanding (Avg %) |
Cross-Memory (Avg %) |
Overall (Avg %) |
MPT (minutes) |
|---|---|---|---|---|---|---|
| 🥇 1 | GPT-4o | 42.69 | 60.92 | 45.87 | 48.04 | 3.01 |
| 🥈 2 | Gemini-2.5-Pro | 42.23 | 57.98 | 40.26 | 46.35 | 2.76 |
| 3 | MiniCPM-o | 40.36 | 50.19 | 38.28 | 42.84 | 2.19 |
| 4 | Qwen2.5-VL-2.5 | 34.24 | 35.89 | 27.39 | 33.96 | 1.60 |
| 4 | Videochat-Online | 28.91 | 41.76 | 29.04 | 32.46 | 1.33 |
| 5 | Qwen2.5-Omni | 25.34 | 27.33 | 20.13 | 25.33 | 1.00 |
Hidden test set for unbiased evaluation. Ground truth is not released. Submit your model predictions for evaluation.
| Rank | Model | Memory (Avg %) |
Understanding (Avg %) |
Cross-Memory (Avg %) |
Overall (Avg %) |
MPT (minutes) |
|---|---|---|---|---|---|---|
| 🥇 1 | GPT-4o | -- | -- | -- | -- | -- |
| 🥈 2 | Gemini-2.5-Pro | -- | -- | -- | -- | -- |
| 3 | MiniCPM-o | -- | -- | -- | -- | -- |
| 4 | Qwen2.5-VL-2.5 | -- | -- | -- | -- | -- |
| 5 | Videochat-Online | -- | -- | -- | -- | -- |
| 6 | Qwen2.5-Omni | -- | -- | -- | -- | -- |
🚀 Be among the first to submit your results! Submit now →
Measures whether the model can produce a correct answer within the decision window (time interval between question arrival and deadline). This reflects practical usability in streaming scenarios.
Only the first correct output within the decision window receives credit.
Among questions initially answered correctly, MPT measures how long the model can still recover the correct answer without re-exposure to the underlying evidence.
Higher MPT indicates better long-horizon memory retention.
Every QA item includes time-stamped evidence spans and required-modality tags. Systems must localize verifiable support from the correct modalities.
Models must identify the exact time intervals containing relevant evidence, preventing vague or hallucinated responses.
Each question specifies required modalities (video, speech, narration). Answers must draw from the correct sources.
Submitted evidence must overlap with annotated spans, enabling quantitative assessment of retrieval quality.
Meetings, focused learning, research, and collaborative tasks.
Daily habits, home organization, wellness, and time management.
Conversations, gatherings, group coordination, and shared events.
Dining, entertainment, museums, concerts, and city exploration.
Continuous first-person visual stream captured from wearable cameras
Full audio track including conversations and environmental sounds
Human-authored descriptions of visual events with precise timestamps
Automatic speech recognition text aligned with audio stream
We welcome submissions from all researchers! Choose one of the methods below to submit your model's evaluation results.
This is the preferred method for official leaderboard submissions. Your results will be reviewed and added to the leaderboard.
submissions/ with your model name:submissions/your-model-name-YYYY-MM-DD/
[Submission] Your Model Nameresults.json - Your model's predictions (see format below)metadata.json - Model information and configurationREADME.md - Brief description of your model and approachpaper.pdf - Technical paper if availableFor quick submissions or if you're unable to create a pull request, you can submit via GitHub Issues.
results.json and metadata.json filesIf GitHub is not accessible, you can email your submission directly to our team.
results.json, metadata.json, README.md[TeleEgo Submission] Your Model NameYour prediction results for each question in the test set.
Information about your model and experimental setup.
If you find TeleEgo useful in your research, please cite our paper:
@misc{yan2025teleegobenchmarkingegocentricai,
title={TeleEgo: Benchmarking Egocentric AI Assistants in the Wild},
author={Jiaqi Yan and Ruilong Ren and Jingren Liu and Shuning Xu and Ling Wang and Yiheng Wang and Yun Wang and Long Zhang and Xiangyu Chen and Changzhi Sun and Jixiang Luo and Dell Zhang and Hao Sun and Chi Zhang and Xuelong Li},
year={2025},
eprint={2510.23981},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.23981},
}