TeleEgo: Benchmarking Egocentric AI Assistants in the Wild

Why TeleEgo?

Existing benchmarks evaluate AI assistants on curated, short-duration clips. TeleEgo challenges models with real-world, continuous streams spanning hours of daily activities across diverse scenarios.

🌍

Real-World Complexity

Authentic egocentric recordings from real users performing daily activities across work, study, social, shopping, health, and travel scenarios.

🎬

Streaming Protocol

Questions arrive dynamically throughout the video stream, mimicking real personal assistant interactions without pre-segmentation.

🧩

Multi-Modal Integration

Combines egocentric video, ambient audio, speech transcripts, and visual narrations requiring cross-modal reasoning.

⏳

Long-Horizon Memory

Tests model's ability to retain and recall information across extended time periods, from seconds to hours.

⚡

Real-Time Constraints

Models must respond within decision windows, reflecting the temporal demands of live assistance.

📊

Verifiable Evidence

Every answer requires precise temporal and modality grounding, enabling auditable evaluation.

Leaderboard

Comprehensive evaluation results on TeleEgo benchmark. Models are tested on both Test Set A (public) and Test Set B (hidden).

🏆 Test Set A (Public Leaderboard)

Public test set with released ground truth. Models can be fine-tuned and optimized on this set.

Rank	Model	Memory (Avg %)	Understanding (Avg %)	Cross-Memory (Avg %)	Overall (Avg %)	MPT (minutes)
🥇 1	GPT-4o	42.69	60.92	45.87	48.04	3.01
🥈 2	Gemini-2.5-Pro	42.23	57.98	40.26	46.35	2.76
3	MiniCPM-o	40.36	50.19	38.28	42.84	2.19
4	Qwen2.5-VL-2.5	34.24	35.89	27.39	33.96	1.60
4	Videochat-Online	28.91	41.76	29.04	32.46	1.33
5	Qwen2.5-Omni	25.34	27.33	20.13	25.33	1.00

🔒 Test Set B (Hidden Test Set) Continue...

Hidden test set for unbiased evaluation. Ground truth is not released. Submit your model predictions for evaluation.

Rank	Model	Memory (Avg %)	Understanding (Avg %)	Cross-Memory (Avg %)	Overall (Avg %)	MPT (minutes)
🥇 1	GPT-4o	--	--	--	--	--
🥈 2	Gemini-2.5-Pro	--	--	--	--	--
3	MiniCPM-o	--	--	--	--	--
4	Qwen2.5-VL-2.5	--	--	--	--	--
5	Videochat-Online	--	--	--	--	--
6	Qwen2.5-Omni	--	--	--	--	--

🚀 Be among the first to submit your results! Submit now →

📊 Evaluation Metrics

⚡ Real-Time Accuracy (RTA)

Measures whether the model can produce a correct answer within the decision window (time interval between question arrival and deadline). This reflects practical usability in streaming scenarios.

RTA = (# correctly answered within decision window) / (# total questions)

Only the first correct output within the decision window receives credit.

⏱️ Memory Persistence Time (MPT)

Among questions initially answered correctly, MPT measures how long the model can still recover the correct answer without re-exposure to the underlying evidence.

MPT = Average retention time across correctly answered items

Higher MPT indicates better long-horizon memory retention.

Auditable Evidence Compliance

Every QA item includes time-stamped evidence spans and required-modality tags. Systems must localize verifiable support from the correct modalities.

🎯

Temporal Grounding

Models must identify the exact time intervals containing relevant evidence, preventing vague or hallucinated responses.

🔊

Modality Compliance

Each question specifies required modalities (video, speech, narration). Answers must draw from the correct sources.

📝

Evidence Overlap

Submitted evidence must overlap with annotated spans, enabling quantitative assessment of retrieval quality.

Dataset & Scenarios

💼

Work & Study

Meetings, focused learning, research, and collaborative tasks.

🏡

Lifestyle & Routines

Daily habits, home organization, wellness, and time management.

👥

Social Activities

Conversations, gatherings, group coordination, and shared events.

🎭

Outings & Culture

Dining, entertainment, museums, concerts, and city exploration.

Multi-Modal Streams

🎥

Egocentric Video

Continuous first-person visual stream captured from wearable cameras

🔊

Speech & Ambient Audio

Full audio track including conversations and environmental sounds

📝

Visual Narration

Human-authored descriptions of visual events with precise timestamps

💬

ASR Transcripts

Automatic speech recognition text aligned with audio stream

📤 Submit Your Results

We welcome submissions from all researchers! Choose one of the methods below to submit your model's evaluation results.

🔀 Method 1: GitHub Pull Request (Recommended)

This is the preferred method for official leaderboard submissions. Your results will be reviewed and added to the leaderboard.

📝 Steps:

Fork our repository: github.com/Programmergg/TeleEgo
Create a new directory under submissions/ with your model name:
submissions/your-model-name-YYYY-MM-DD/
Add your submission files (see requirements below)
Create a pull request with the title: [Submission] Your Model Name
Our team will review and merge your submission within 3-5 business days

📋 Required Files:

results.json - Your model's predictions (see format below)
metadata.json - Model information and configuration
README.md - Brief description of your model and approach
(Optional) paper.pdf - Technical paper if available

🍴 Fork Repository 📂 View Example Submissions

💬 Method 2: GitHub Issue

For quick submissions or if you're unable to create a pull request, you can submit via GitHub Issues.

📝 Steps:

Go to our Issues page
Click "New Issue" and select the "Submission" template
Fill in all required information in the template
Attach your results.json and metadata.json files
Submit the issue and we'll process your submission

✉️ Create Submission Issue

📧 Method 3: Email Submission

If GitHub is not accessible, you can email your submission directly to our team.

📝 Steps:

Prepare your submission files as a ZIP archive
Include all required files: results.json, metadata.json, README.md
Email to: chengxuyuangg@gmail.com
Use subject line: [TeleEgo Submission] Your Model Name
We'll confirm receipt within 48 hours

⚠️ Note: Email submissions may take longer to process than GitHub submissions. Please allow 5-7 business days for review.

📄 Submission Format

results.json Format:

Your prediction results for each question in the test set.

{ "submission_info": { "model_name": "YourModel-v1.0", "submission_date": "2025-01-15", "team": "Your Organization" }, "predictions": [ { "question_id": "Q001", "answer": "Your model's answer", "response_time": 2.5, "evidence_spans": [ { "modality": ""video", "audio", "text"" } ] } ] }

metadata.json Format:

Information about your model and experimental setup.

{ "model_name": "YourModel-v1.0", "organization": "Your Organization", "authors": ["Author 1", "Author 2"], "contact_email": "your-email@example.com", "paper_url": "https://arXiv.org/abs/...", "code_url": "https://github.com/...", "model_description": "Brief description of your approach", "model_parameters": "7B", "training_data": "Description of training data used", "modalities_used": ["video", "audio", "text"], "inference_time": "Average time per question" }

⚠️ Important Guidelines:

Ensure your results are reproducible
Do not use the test set for training or hyperparameter tuning
Provide complete information in metadata.json
Include evidence spans for all predictions when possible
Follow the exact JSON format specified above

Need Help?

If you have questions about the submission process, please:

💬 Ask on GitHub 📧 Email Us

TeleEgo Benchmark

Why TeleEgo?

Real-World Complexity

Streaming Protocol

Multi-Modal Integration

Long-Horizon Memory

Real-Time Constraints

Verifiable Evidence

Leaderboard

🏆 Test Set A (Public Leaderboard)

🔒 Test Set B (Hidden Test Set) Continue...

📊 Evaluation Metrics

Auditable Evidence Compliance

Temporal Grounding

Modality Compliance

Evidence Overlap

Dataset & Scenarios

Work & Study

Lifestyle & Routines

Social Activities

Outings & Culture

Multi-Modal Streams

Egocentric Video

Speech & Ambient Audio

Visual Narration

ASR Transcripts

📤 Submit Your Results

🔀 Method 1: GitHub Pull Request (Recommended)

📋 Required Files:

💬 Method 2: GitHub Issue

📧 Method 3: Email Submission

📄 Submission Format

results.json Format:

metadata.json Format:

Need Help?

Citation