Paper Reading AI Learner

EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark

2025-10-07 17:59:47
Deheng Zhang, Yuqian Fu, Runyi Yang, Yang Miao, Tianwen Qian, Xu Zheng, Guolei Sun, Ajad Chhatkuli, Xuanjing Huang, Yu-Gang Jiang, Luc Van Gool, Danda Pani Paudel

Abstract

Most existing benchmarks for egocentric vision understanding focus primarily on daytime scenarios, overlooking the low-light conditions that are inevitable in real-world applications. To investigate this gap, we present EgoNight, the first comprehensive benchmark for nighttime egocentric vision, with visual question answering (VQA) as the core task. A key feature of EgoNight is the introduction of day-night aligned videos, which enhance night annotation quality using the daytime data and reveal clear performance gaps between lighting conditions. To achieve this, we collect both synthetic videos rendered by Blender and real-world recordings, ensuring that scenes and actions are visually and temporally aligned. Leveraging these paired videos, we construct EgoNight-VQA, supported by a novel day-augmented night auto-labeling engine and refinement through extensive human verification. Each QA pair is double-checked by annotators for reliability. In total, EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types, with more than 300 hours of human work. Evaluations of state-of-the-art multimodal large language models (MLLMs) reveal substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions. Beyond VQA, EgoNight also introduces two auxiliary tasks, day-night correspondence retrieval and egocentric depth estimation at night, that further explore the boundaries of existing models. We believe EgoNight-VQA provides a strong foundation for advancing application-driven egocentric vision research and for developing models that generalize across illumination domains. All the data and code will be made available upon acceptance.

Abstract (translated)

现有的用于第一人称视觉理解的基准测试主要集中在白天场景上,忽略了低光条件在实际应用中不可避免的问题。为了探究这一差距,我们提出了EgoNight,这是首个全面的第一人称夜间视觉基准,其核心任务为视觉问答(VQA)。EgoNight的一个关键特点是引入了日间和夜间对齐的视频,利用日间数据增强夜间的标注质量,并揭示不同光照条件下的性能差异。为了实现这一点,我们收集了由Blender渲染的合成视频以及真实世界的记录,确保场景和动作在视觉上和时间线上的一致性。通过这些配对的视频,我们构建了EgoNight-VQA,并利用了一种新的日间增强夜间自动标注引擎及其经过广泛人工验证而细化的数据集。每个问答(QA)对都由注释者进行了双重检查以确保其可靠性。总体而言,EgoNight-VQA包含3658个QA对,跨越90段视频和12种多样的QA类型,并且包含了超过300小时的人工工作量。对于最先进的多模态大型语言模型(MLLMs)的评估表明,在从白天转移到夜晚时性能显著下降,突显了在低光条件下进行推理所面临的挑战。除了VQA之外,EgoNight还引入了两项辅助任务:日间和夜间对应关系检索及第一人称深度估计,进一步探索现有模型的能力边界。我们认为EgoNight-VQA为推动应用驱动的第一人称视觉研究以及开发能够跨光照领域泛化的模型提供了强有力的基础。所有数据和代码将在接受后公开提供。

URL

https://arxiv.org/abs/2510.06218

PDF

https://arxiv.org/pdf/2510.06218.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot