TeleEgo Benchmark

Live-in-the-Wild Personal Assistant Benchmark
๐ŸŽฅ Egocentric Video ๐Ÿ”Š Multi-Modal Audio โฑ๏ธ Real-Time Processing ๐Ÿง  Long-Term Memory

Why TeleEgo?

Existing benchmarks evaluate AI assistants on curated, short-duration clips. TeleEgo challenges models with real-world, continuous streams spanning hours of daily activities across diverse scenarios.

๐ŸŒ

Real-World Complexity

Authentic egocentric recordings from real users performing daily activities across work, study, social, shopping, health, and travel scenarios.

๐ŸŽฌ

Streaming Protocol

Questions arrive dynamically throughout the video stream, mimicking real personal assistant interactions without pre-segmentation.

๐Ÿงฉ

Multi-Modal Integration

Combines egocentric video, ambient audio, speech transcripts, and visual narrations requiring cross-modal reasoning.

โณ

Long-Horizon Memory

Tests model's ability to retain and recall information across extended time periods, from seconds to hours.

โšก

Real-Time Constraints

Models must respond within decision windows, reflecting the temporal demands of live assistance.

๐Ÿ“Š

Verifiable Evidence

Every answer requires precise temporal and modality grounding, enabling auditable evaluation.

Leaderboard

Comprehensive Real-Time Accuracy (RTA) evaluation results on TeleEgo benchmark. Models are evaluated under strict streaming protocol.

Table 1: RTA results of proprietary MLLMs on TeleEgo (single-video evaluation)

GPT-4o and Gemini-2.5-Pro are evaluated via API calls with internal implementations opaque. Due to extensive video duration (~14 hours) and API latency, both models were evaluated on the same participant's video for fair comparison.

Key Observations: GPT-4o shows strong Understanding performance (66.67%), especially on Intent Inference (81.81%) and Causal Understanding (81.58%), leveraging its general-purpose reasoning. However, performance drops on Memory (42.18%) and Cross-Memory Reasoning (44.23%) where fine-grained temporal binding is critical.

Method Params Omni Streaming Memory (%) Understanding (%) Cross-Memory Reasoning (%) Overall
UlM StM ET TCI LtM All II CU CmU MsR All CeR TCU CtC All
GPT-4o - โœ“ - 42.31 40.58 31.91 47.37 52.78 42.18 81.81 81.58 45.71 50.00 66.67 44.44 40.00 45.00 44.23 48.94
Gemini-2.5-Pro - โœ“ - 49.04 45.59 34.04 47.37 44.44 45.05 63.64 55.26 37.14 28.57 48.03 40.74 40.00 45.00 42.31 45.55
Table 2: RTA results of pseudo-streaming open-source MLLMs on TeleEgo

These models lack cross-call memory mechanisms and process each input unit independently. They are categorized as โ€œpseudo-streamingโ€ because they do not natively support continuous streaming interaction. In our evaluation, we deliberately avoid supplying any uncompressed historical context to the modelsโ€”otherwise, the burden of memory would be shifted to the context window rather than truly examining their intrinsic memory capability. Therefore, they are evaluated under a strict streaming protocol, ensuring that performance reflects the modelsโ€™ genuine ability to retain information over time.

Key Observations: VideoChat-Online, Qwen2.5-VL, and Qwen2.5-Omni achieve manageable Understanding performance but struggle significantly on Memory and Cross-Memory Reasoning tasks. Among the three, Qwen2.5-Omni achieves the highest overall RTA (46.96%), primarily because it has stronger base model capabilities and avoids ASR transcription errors by directly processing audio.

Method Params Omni Streaming Memory (%) Understanding (%) Cross-Memory Reasoning (%) Overall
UlM StM ET TCI LtM All II CU CmU MsR All CeR TCU CtC All
VideoChat-Online 4B โœ— โœ— 30.74 26.29 19.35 29.84 22.78 26.69 57.35 44.13 35.44 27.89 42.34 18.79 32.00 41.60 29.43 31.28
Qwen2.5-VL 3B โœ— โœ— 45.42 42.01 31.18 37.90 33.33 39.66 66.35 53.99 47.57 41.50 53.28 32.89 48.00 52.00 42.14 43.67
Qwen2.5-Omni 7B โœ“ โœ— 44.21 42.75 35.48 41.13 37.97 41.20 72.51 61.50 50.00 48.98 59.07 37.58 56.00 61.60 49.16 46.96
Table 3: RTA results of MiniCPM-o with varying session lengths on TeleEgo

MiniCPM-o is the only model in our study with an explicit streaming interface that accepts chunk-wise incremental input. However, it lacks dynamic memory management: the KV cache grows without compression, eviction, or sliding window, causing GPU memory overflow and attention dilution as video length increases.

Key Observations: RTA results indicate an optimal session length of about 1 minute (54.10% overall), with poorer performance at both shorter (50.60% at 1s) and longer intervals (30.15% at 10min). At longer intervals, the model often produces incoherent or off-topic responses, confirming that naive KV cache accumulation without memory management is insufficient for sustained streaming.

Method Params Omni Streaming Session
Length
Memory (%) Understanding (%) Cross-Memory Reasoning (%) Overall
UlM StM ET TCI LtM All II CU CmU MsR All CeR TCU CtC All
MiniCPM-o 8B โœ“ โœ“ 1sec 51.99 51.60 35.13 47.58 44.30 47.54 75.83 61.50 51.94 44.22 59.59 28.19 48.00 64.80 45.15 50.60
1min 50.60 55.53 37.28 53.63 51.05 50.11 80.57 69.48 56.80 51.02 65.64 33.56 36.00 66.40 47.49 54.10
3min 48.19 52.58 37.28 49.19 45.99 47.31 73.93 63.38 53.88 40.14 59.33 32.89 32.00 66.40 46.82 50.57
5min 43.00 44.23 33.69 47.58 37.55 41.71 63.51 56.34 44.17 43.54 52.64 23.49 32.00 56.80 38.13 44.34
10min 28.50 28.01 23.66 30.65 26.16 27.60 41.71 40.38 34.95 29.93 37.32 15.44 36.00 37.60 26.42 30.15

Note: Columns are grouped into three capability blocks: Memory, Understanding, and Cross-Memory Reasoning, with an "All" column summarizing each block and an "Overall" column aggregating across blocks. "Omni" denotes integrated audio-video-text perception; "Streaming" denotes native support for streaming interaction. A โœ“ indicates the capability is supported, โœ— indicates not supported, and "-" indicates unknown.

๐Ÿ“Š Evaluation Metrics

โšก Real-Time Accuracy (RTA)

RTA measures the percentage of questions answered correctly within a strict time window (5 seconds). This metric simulates real-world scenarios where personal assistants must respond promptly.

Unlike traditional offline evaluation, RTA enforces temporal constraints:

  • Decision Window: Models have only 5 seconds after question trigger to respond
  • First-Attempt Only: Only the first response is evaluated (no retries)
  • Streaming Protocol: Questions arrive dynamically throughout video playback

RTA is calculated as: (Questions answered correctly within time window) / (Total questions) ร— 100%

๐Ÿง  Memory Persistence Time (MPT)

MPT measures how long a model retains information after initially answering correctly. For questions answered correctly at time t*, the system schedules recall tests at t* + ฮ” intervals (ฮ” = 60s, r = 1,2,...,10).

Key characteristics:

  • No Evidence Replay: During recall, only current video stream is accessible (no re-showing of original evidence)
  • Early Stopping: Once a recall test fails, subsequent tests are removed from the schedule
  • MPT Calculation: Time from t* to first recall failure (capped at 600 seconds for models passing all 10 recalls)

โš ๏ธ About MPT Evaluation Status

Currently, we have not included MPT results in our evaluation tables. This is because MPT requires models to demonstrate genuine long-term memory capabilities - the ability to not only answer correctly when information first appears, but also retain and recall that information long after it has disappeared from the input stream.

Why current models are not suitable for MPT evaluation:

  • Pseudo-streaming models (VideoChat-Online, Qwen2.5-VL, Qwen2.5-Omni): Each call is independent with no cross-call memory. If they answer correctly in both initial and recall phases, we cannot distinguish between "remembered" vs "guessed correctly twice".
  • API models (GPT-4o, Gemini-2.5-Pro): Their internal implementations are black boxes. In our evaluation setup (without accumulating raw history), they face the same issue as pseudo-streaming models.
  • Streaming model (MiniCPM-o): Although it supports streaming input and maintains KV cache across calls, we must manually reset sessions every few minutes to prevent memory overflow and generation quality degradation. This means if an initial test and its recall span across a session boundary, the model's memory is artificially cleared - making MPT measurements unreliable.

MPT is a forward-looking metric that aligns with TeleEgo's goal of evaluating realistic long-term egocentric assistants. We look forward to the emergence of models with true long-horizon streaming memory - rare few can run continuously for hours without resets while maintaining stable generation quality. We anticipate that future truly streaming MLLMs with dynamic memory management will enable meaningful MPT evaluation, providing critical insights into how long information is retained beyond the moment of first use.

๐Ÿš€ TeleEgo, together with our released MPT evaluation framework and code, provides a unified setting to assess both real-time correctness and long-term memory persistence, paving the way for future work on practical egocentric AI assistants.

๐Ÿ“‹ Task Categories

Questions are organized into three cognitive dimensions with 12 fine-grained subtasks:

  • Memory (58.8% of questions): Ultra-long Memory, Short-term Memory, Long-term Memory, Entity Tracking, Temporal Comparison & Interval
  • Understanding (27.3%): Intent Inference, Causal Understanding, Cross-modal Understanding, Multi-step Reasoning
  • Cross-Memory Reasoning (13.9%): Cross-entity Relation, Temporal Chain Understanding, Cross-temporal Causality

๐Ÿ“š Dataset

TeleEgo comprises 70+ hours of continuous egocentric video from 5 diverse participants, with 3,291 manually annotated questions covering 12 cognitive tasks.

5 Participants
70+ Hours of Video
3,291 QA Pairs
12 Task Categories

๐Ÿ“Š Data Collection

๐Ÿ“น

Egocentric Video Capture

Participants wore first-person cameras continuously for 3 days, recording authentic daily activities across work, social, lifestyle, and cultural scenarios. Average video length: 14.4 hours per participant.

๐Ÿ”Š

Multi-Modal Processing

Raw videos were processed to extract synchronized audio streams, speech transcripts via ASR, and visual narrations describing participant actions. All modalities are aligned to a unified timeline with millisecond precision.

โœ๏ธ

Question Annotation

Expert annotators generated 3,291 questions using GPT-4o for initial candidate generation, followed by rigorous human verification. Each question includes precise temporal grounding (evidence timestamps) and required modality information.

โœ…

Quality Control

Multiple rounds of quality checks ensure factual accuracy, temporal correctness, question clarity, and answer uniqueness. All personal identifiable information is removed or anonymized.

๐ŸŽฏ Covered Scenarios

๐Ÿ’ผ

Work & Study

Presentations, meetings, coding, studying, reception tasks

๐ŸŽฎ

Social Activities

Card games (UNO, Mahjong, Poker), video games, pool

๐Ÿ 

Lifestyle & Routines

Shopping, cooking, walking, fitness

๐Ÿฝ๏ธ

Outings & Culture

Dining out, dating, museum visits

๐Ÿค— Download Dataset on Hugging Face

๐Ÿ“ค Submit Your Results

We welcome submissions from all researchers! Submit your model's predictions to appear on the official leaderboard. Choose your preferred submission method below.

๐Ÿ”€ Method 1: GitHub Pull Request (Recommended)

Submit via GitHub PR for the fastest processing and automatic validation.

๐Ÿ“ Steps:
  1. Fork the TeleEgo repository
  2. Create a new folder under submissions/ with format: YYYY-MM-DD_YourModelName/
  3. Add your submission files:
    • results.json - Your model predictions
    • metadata.json - Model information
    • README.md - Brief description
  4. Create a pull request with title: [Submission] YourModelName
  5. Our team will review and merge within 3-5 business days

๐Ÿ’ฌ Method 2: GitHub Issue

For quick submissions or if you're unable to create a pull request, you can submit via GitHub Issues.

๐Ÿ“ Steps:
  1. Go to our Issues page
  2. Click "New Issue" and select the "Submission" template
  3. Fill in all required information in the template
  4. Attach your results.json and metadata.json files
  5. Submit the issue and we'll process your submission

๐Ÿ“ง Method 3: Email Submission

If GitHub is not accessible, you can email your submission directly to our team.

๐Ÿ“ Steps:
  1. Prepare your submission files as a ZIP archive
  2. Include all required files: results.json, metadata.json, README.md
  3. Email to: chengxuyuangg@gmail.com
  4. Use subject line: [TeleEgo Submission] Your Model Name
  5. We'll confirm receipt within 48 hours
โš ๏ธ Note: Email submissions may take longer to process than GitHub submissions. Please allow 5-7 business days for review.

๐Ÿ“„ Submission Format

results.json Format:

Your prediction results for each question in the test set.

{ "submission_info": { "model_name": "YourModel-v1.0", "submission_date": "2025-01-15", "team": "Your Organization" }, "predictions": [ { "question_id": "Q001", "answer": "Your model's answer", "response_time": 2.5, "evidence_spans": [ { "modality": "video" } ] } ] }

metadata.json Format:

Information about your model and experimental setup.

{ "model_name": "YourModel-v1.0", "organization": "Your Organization", "authors": ["Author 1", "Author 2"], "contact_email": "your.email@example.com", "paper_url": "https://arxiv.org/abs/...", "code_url": "https://github.com/...", "model_description": "Brief description of your approach", "model_parameters": "7B", "training_data": "Description of training data used", "modalities_used": ["video", "audio", "text"], "inference_time": "Average time per question" }
โš ๏ธ Important Guidelines:
  • Ensure your results are reproducible
  • Do not use the test set for training or hyperparameter tuning
  • Provide complete information in metadata.json
  • Include evidence spans for all predictions when possible
  • Follow the exact JSON format specified above

Need Help?

If you have questions about the submission process, please:

Citation

If you find TeleEgo useful in your research, please cite our paper:

@misc{yan2025teleegobenchmarkingegocentricai, title={TeleEgo: Benchmarking Egocentric AI Assistants in the Wild}, author={Jiaqi Yan and Ruilong Ren and Jingren Liu and Shuning Xu and Ling Wang and Yiheng Wang and Xinlin Zhong and Yun Wang and Long Zhang and Xiangyu Chen and Changzhi Sun and Jixiang Luo and Dell Zhang and Hao Sun and Chi Zhang and Xuelong Li}, year={2025}, eprint={2510.23981}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.23981}, }