Abstract
Most existing benchmarks for egocentric vision understanding focus primarily on daytime scenarios, overlooking the low-light conditions that are inevitable in real-world applications. To investigate this gap, we present EgoNight, the first comprehensive benchmark for nighttime egocentric vision, with visual question answering (VQA) as the core task. A key feature of EgoNight is the introduction of day-night aligned videos, which enhance night annotation quality using the daytime data and reveal clear performance gaps between lighting conditions. To achieve this, we collect both synthetic videos rendered by Blender and real-world recordings, ensuring that scenes and actions are visually and temporally aligned. Leveraging these paired videos, we construct EgoNight-VQA, supported by a novel day-augmented night auto-labeling engine and refinement through extensive human verification. Each QA pair is double-checked by annotators for reliability. In total, EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types, with more than 300 hours of human work. Evaluations of state-of-the-art multimodal large language models (MLLMs) reveal substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions. Beyond VQA, EgoNight also introduces two auxiliary tasks, day-night correspondence retrieval and egocentric depth estimation at night, that further explore the boundaries of existing models. We believe EgoNight-VQA provides a strong foundation for advancing application-driven egocentric vision research and for developing models that generalize across illumination domains. All the data and code will be made available upon acceptance.
Abstract (translated)
现有的用于第一人称视觉理解的基准测试主要集中在白天场景上,忽略了低光条件在实际应用中不可避免的问题。为了探究这一差距,我们提出了EgoNight,这是首个全面的第一人称夜间视觉基准,其核心任务为视觉问答(VQA)。EgoNight的一个关键特点是引入了日间和夜间对齐的视频,利用日间数据增强夜间的标注质量,并揭示不同光照条件下的性能差异。为了实现这一点,我们收集了由Blender渲染的合成视频以及真实世界的记录,确保场景和动作在视觉上和时间线上的一致性。通过这些配对的视频,我们构建了EgoNight-VQA,并利用了一种新的日间增强夜间自动标注引擎及其经过广泛人工验证而细化的数据集。每个问答(QA)对都由注释者进行了双重检查以确保其可靠性。总体而言,EgoNight-VQA包含3658个QA对,跨越90段视频和12种多样的QA类型,并且包含了超过300小时的人工工作量。对于最先进的多模态大型语言模型(MLLMs)的评估表明,在从白天转移到夜晚时性能显著下降,突显了在低光条件下进行推理所面临的挑战。除了VQA之外,EgoNight还引入了两项辅助任务:日间和夜间对应关系检索及第一人称深度估计,进一步探索现有模型的能力边界。我们认为EgoNight-VQA为推动应用驱动的第一人称视觉研究以及开发能够跨光照领域泛化的模型提供了强有力的基础。所有数据和代码将在接受后公开提供。
URL
https://arxiv.org/abs/2510.06218