Training diffusion models that work directly on lidar points at the scale of outdoor scenes is challenging due to the difficulty of generating fine-grained details from white noise over a broad field of view. The latest works addressing scene completion with diffusion models tackle this problem by reformulating the original DDPM as a local diffusion process. It contrasts with the common practice of operating at the level of objects, where vanilla DDPMs are currently used. In this work, we close the gap between these two lines of work. We identify approximations in the local diffusion formulation, show that they are not required to operate at the scene level, and that a vanilla DDPM with a well-chosen starting point is enough for completion. Finally, we demonstrate that our method, LiDPM, leads to better results in scene completion on SemanticKITTI. The project page is this https URL .
训练直接在户外场景的激光雷达点上工作的扩散模型具有挑战性,原因是很难从整个视野范围内的白噪声中生成精细的细节。最近一些使用扩散模型解决场景补全问题的工作通过将原始DDPM(差分深度概率模型)重新表述为局部扩散过程来应对这一难题。这与常见的以对象层级操作的方法形成了对比,在后者中当前主要采用的是标准的DDPM。在这项工作中,我们缩小了这两类方法之间的差距。我们识别出在局部扩散形式中的近似处理,并表明这些近似并不是场景级别操作所必需的,一个选择起点得当的标准DDPM就足以完成任务。 最后,我们在SemanticKITTI数据集上展示了我们的方法LiDPM在场景补全方面取得了更好的结果。该项目页面在此链接:[https URL](请将URL替换为实际项目页面地址)。
https://arxiv.org/abs/2504.17791
Autoregressive (AR) models, long dominant in language generation, are increasingly applied to image synthesis but are often considered less competitive than Diffusion-based models. A primary limitation is the substantial number of image tokens required for AR models, which constrains both training and inference efficiency, as well as image resolution. To address this, we present Token-Shuffle, a novel yet simple method that reduces the number of image tokens in Transformer. Our key insight is the dimensional redundancy of visual vocabularies in Multimodal Large Language Models (MLLMs), where low-dimensional visual codes from visual encoder are directly mapped to high-dimensional language vocabularies. Leveraging this, we consider two key operations: token-shuffle, which merges spatially local tokens along channel dimension to decrease the input token number, and token-unshuffle, which untangles the inferred tokens after Transformer blocks to restore the spatial arrangement for output. Jointly training with textual prompts, our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis in a unified next-token prediction way while maintaining efficient training and inference. For the first time, we push the boundary of AR text-to-image generation to a resolution of 2048x2048 with gratifying generation performance. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15. Exhaustive large-scale human evaluations also demonstrate our prominent image generation ability in terms of text-alignment, visual flaw, and visual appearance. We hope that Token-Shuffle can serve as a foundational design for efficient high-resolution image generation within MLLMs.
自回归(AR)模型在语言生成领域长期占据主导地位,如今越来越多地应用于图像合成。然而,在与基于扩散的模型的竞争中,它们通常被认为不占优势。自回归模型的主要局限在于需要大量的图像令牌,这限制了训练和推理效率以及图像分辨率。为了解决这个问题,我们提出了一种名为Token-Shuffle的新方法,该方法通过减少Transformer中的图像令牌数量来提高效率。 我们的核心洞察是多模态大型语言模型(MLLMs)中视觉词汇的维度冗余:来自视觉编码器的低维视觉代码直接映射到高维的语言词汇。利用这一特性,我们考虑了两种关键操作:token-shuffle 操作将空间局部令牌沿通道维度合并以减少输入令牌的数量;而 token-unshuffle 则在 Transformer 块之后解开推断出的令牌,以便恢复输出的空间布局。 通过与文本提示联合训练,我们的策略不需要额外预训练的文本编码器,并使 MLLMs 能够支持极高分辨率图像合成,在统一的下一个令牌预测方式下进行,同时保持高效的训练和推理效率。首次,我们将 AR 文本到图像生成的分辨率边界推进到了2048x2048,取得了令人满意的生成性能。 在 GenAI-benchmark 上,我们的 2.7B 模型在困难提示下的整体得分为 0.77,比自回归模型 LlamaGen 高出 0.18 分,比扩散模型 LDM 高出 0.15 分。详尽的大规模人工评估也证明了我们在文本对齐、视觉缺陷和视觉外观方面的图像生成能力显著优于其他模型。 我们希望 Token-Shuffle 可以作为 MLLMs 中高效高分辨率图像生成的基础设计方案,为未来的研究开辟新的可能性。
https://arxiv.org/abs/2504.17789
Annotating camera poses on dynamic Internet videos at scale is critical for advancing fields like realistic video generation and simulation. However, collecting such a dataset is difficult, as most Internet videos are unsuitable for pose estimation. Furthermore, annotating dynamic Internet videos present significant challenges even for state-of-theart methods. In this paper, we introduce DynPose-100K, a large-scale dataset of dynamic Internet videos annotated with camera poses. Our collection pipeline addresses filtering using a carefully combined set of task-specific and generalist models. For pose estimation, we combine the latest techniques of point tracking, dynamic masking, and structure-from-motion to achieve improvements over the state-of-the-art approaches. Our analysis and experiments demonstrate that DynPose-100K is both large-scale and diverse across several key attributes, opening up avenues for advancements in various downstream applications.
在大规模视频上标注摄像机姿态对于推动如现实视频生成和模拟等领域的进步至关重要。然而,收集此类数据集极具挑战性,因为大多数互联网视频都不适合用于姿态估计。即使使用最先进的方法,动态互联网视频的标注也面临着巨大挑战。本文中,我们介绍了DynPose-100K,这是一个大规模的数据集,包含了带有摄像机姿态注释的动态互联网视频。 我们的采集管道通过精心组合任务特定和通用模型来解决过滤问题。在姿态估计方面,我们结合了最新的点跟踪、动态遮罩及基于运动恢复结构的技术,从而超越了现有最佳方法的表现。我们的分析与实验表明,DynPose-100K不仅规模庞大,在多个关键属性上也具有多样性,这为各种下游应用的进展开辟了新的途径。
https://arxiv.org/abs/2504.17788
This paper presents the results of the fourth edition of the Monocular Depth Estimation Challenge (MDEC), which focuses on zero-shot generalization to the SYNS-Patches benchmark, a dataset featuring challenging environments in both natural and indoor settings. In this edition, we revised the evaluation protocol to use least-squares alignment with two degrees of freedom to support disparity and affine-invariant predictions. We also revised the baselines and included popular off-the-shelf methods: Depth Anything v2 and Marigold. The challenge received a total of 24 submissions that outperformed the baselines on the test set; 10 of these included a report describing their approach, with most leading methods relying on affine-invariant predictions. The challenge winners improved the 3D F-Score over the previous edition's best result, raising it from 22.58% to 23.05%.
本文介绍了第四版单目深度估计挑战赛(MDEC)的结果,该挑战聚焦于在SYNS-Patches基准数据集上的零样本泛化能力。SYNS-Patches是一个包含自然和室内环境中的困难场景的大型数据集。在此版本中,我们修订了评估协议,采用了具有两个自由度的最小二乘对齐方法来支持视差预测以及仿射不变性预测。此外,我们也更新了基准模型,并加入了两种流行的现成方法:Depth Anything v2 和 Marigold。本次挑战共收到了24份超过基线测试集表现的提交;其中10个队伍提供了详尽的方法报告,大部分领先的团队依赖于仿射不变性的预测技术。挑战赛获胜者在3D F-Score上超越了前一版的最佳结果,从22.58%提高到了23.05%。
https://arxiv.org/abs/2504.17787
Bimanual manipulation is a challenging yet crucial robotic capability, demanding precise spatial localization and versatile motion trajectories, which pose significant challenges to existing approaches. Existing approaches fall into two categories: keyframe-based strategies, which predict gripper poses in keyframes and execute them via motion planners, and continuous control methods, which estimate actions sequentially at each timestep. The keyframe-based method lacks inter-frame supervision, struggling to perform consistently or execute curved motions, while the continuous method suffers from weaker spatial perception. To address these issues, this paper introduces an end-to-end framework PPI (keyPose and Pointflow Interface), which integrates the prediction of target gripper poses and object pointflow with the continuous actions estimation. These interfaces enable the model to effectively attend to the target manipulation area, while the overall framework guides diverse and collision-free trajectories. By combining interface predictions with continuous actions estimation, PPI demonstrates superior performance in diverse bimanual manipulation tasks, providing enhanced spatial localization and satisfying flexibility in handling movement restrictions. In extensive evaluations, PPI significantly outperforms prior methods in both simulated and real-world experiments, achieving state-of-the-art performance with a +16.1% improvement on the RLBench2 simulation benchmark and an average of +27.5% gain across four challenging real-world tasks. Notably, PPI exhibits strong stability, high precision, and remarkable generalization capabilities in real-world scenarios. Project page: this https URL
双臂操作是机器人技术中既具挑战性又至关重要的能力,它要求精确的空间定位和多变的运动轨迹,这对现有的方法提出了重大挑战。目前的方法大致可以分为两类:基于关键帧的策略,该类方法预测抓取器在关键帧中的姿态,并通过运动规划器执行这些姿态;以及连续控制方法,这种方法在每个时间步长依次估算动作。基于关键帧的方法缺乏跨帧监督机制,在保持一致性或执行曲线运动时表现挣扎,而连续方法则面临空间感知较弱的问题。 为了解决这些问题,本文介绍了一种端到端框架PPI(关键姿态和点流接口),该框架将目标抓取器姿态的预测、物体点流与连续动作估算结合起来。这些接口使模型能够有效地关注目标操作区域,而整体框架则引导多样且无碰撞的轨迹。通过结合界面预测和连续动作估算,PPI在各种双臂操作任务中表现出色,在空间定位方面有所增强,并对运动限制具有满意的灵活性。在广泛的评估中,无论是模拟实验还是真实世界的测试,PPI都显著超越了先前的方法,实现了业界领先的表现:在RLBench2仿真基准上取得了+16.1%的改进,在四个具有挑战性的现实世界任务上的平均性能提升了+27.5%。值得注意的是,PPI在实际场景中表现出强大的稳定性、高精度以及出色的泛化能力。 项目主页: [此处应为一个URL链接]
https://arxiv.org/abs/2504.17784
Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio, which is critical for artificial auditory perception. However, current methods heavily rely on artificially mixed audio for training, which limits their ability to generalize to naturally mixed audio collected in real-world environments. To overcome this limitation, we propose ClearSep, an innovative framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks, thereby allowing effective sound separation in real-world scenarios. We introduce two remix-based evaluation metrics to quantitatively assess separation quality and use these metrics as thresholds to iteratively apply the data engine alongside model training, progressively optimizing separation performance. In addition, we propose a series of training strategies tailored to these separated independent tracks to make the best use of them. Extensive experiments demonstrate that ClearSep achieves state-of-the-art performance across multiple sound separation tasks, highlighting its potential for advancing sound separation in natural audio scenarios. For more examples and detailed results, please visit our demo page at this https URL.
通用声音分离旨在从混合音频中提取对应于不同事件的清晰音轨,这对人工听觉感知至关重要。然而,当前的方法很大程度上依赖于人工混音数据进行训练,这限制了它们在现实环境中处理自然混音音频的能力。为了克服这一局限性,我们提出了ClearSep,这是一个创新性的框架,它使用一个数据引擎将复杂的自然混合音频分解成多个独立的音轨,从而实现在真实世界场景中的有效声音分离。我们引入了两种基于重新混音的评估指标来定量地评价分离质量,并使用这些指标作为阈值迭代应用数据引擎与模型训练相结合的方法,逐步优化分离性能。此外,我们还提出了一系列针对这些独立音轨量身定制的训练策略以充分利用它们。 大量的实验表明,ClearSep在多种声音分离任务中实现了最先进的性能,突显了它在自然音频场景中的声音分离潜力。欲了解更多示例和详细结果,请访问我们的演示页面:[此处插入URL]。
https://arxiv.org/abs/2504.17782
Learning-based methods, such as imitation learning (IL) and reinforcement learning (RL), can produce excel control policies over challenging agile robot tasks, such as sports robot. However, no existing work has harmonized learning-based policy with model-based methods to reduce training complexity and ensure the safety and stability for agile badminton robot control. In this paper, we introduce \ourmethod, a novel hybrid control system for agile badminton robots. Specifically, we propose a model-based strategy for chassis locomotion which provides a base for arm policy. We introduce a physics-informed ``IL+RL'' training framework for learning-based arm policy. In this train framework, a model-based strategy with privileged information is used to guide arm policy training during both IL and RL phases. In addition, we train the critic model during IL phase to alleviate the performance drop issue when transitioning from IL to RL. We present results on our self-engineered badminton robot, achieving 94.5% success rate against the serving machine and 90.7% success rate against human players. Our system can be easily generalized to other agile mobile manipulation tasks such as agile catching and table tennis. Our project website: this https URL.
基于学习的方法,如模仿学习(IL)和强化学习(RL),可以为诸如运动机器人等具有挑战性的敏捷机器人任务生成出色的控制策略。然而,目前尚无研究工作将基于学习的策略与基于模型的方法相结合,以减少训练复杂性并确保敏捷羽毛球机器人的安全性和稳定性。在本文中,我们介绍了\ourmethod,这是一种新颖的混合控制系统,专门用于敏捷羽毛球机器人。 具体而言,我们提出了一种基于模型的底盘移动策略,为手臂政策提供基础,并引入了一个物理信息驱动的“IL+RL”训练框架,以学习为基础的手臂控制策略。在此训练框架中,在模仿学习和强化学习两个阶段,都使用具有特权信息的基于模型的方法来指导手臂控制策略的学习。此外,我们还在模仿学习阶段训练了评估器模型,以缓解从模仿学习过渡到强化学习时性能下降的问题。 我们在自研的羽毛球机器人上进行了实验,达到了94.5%的成功率对抗发球机,并且在与人类选手比赛时也取得了90.7%的成功率。我们的系统可以轻松地推广应用于其他敏捷移动操作任务中,例如敏捷接球和乒乓球运动中。 项目网站:[请将URL替换为实际链接]
https://arxiv.org/abs/2504.17771
Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its viability, its efficiency-accuracy trade-offs, and systematic scaling studies remain unexplored. To address this gap, we perform a careful comparison of training-free sparse attention methods at varying model scales, sequence lengths, and sparsity levels on a diverse collection of long-sequence tasks-including novel ones that rely on natural language while remaining controllable and easy to evaluate. Based on our experiments, we report a series of key findings: 1) an isoFLOPS analysis reveals that for very long sequences, larger and highly sparse models are preferable to smaller and dense ones. 2) The level of sparsity attainable while statistically guaranteeing accuracy preservation is higher during decoding than prefilling, and correlates with model size in the former. 3) There is no clear strategy that performs best across tasks and phases, with different units of sparsification or budget adaptivity needed for different scenarios. Even moderate sparsity levels often result in significant performance degradation on at least one task, highlighting that sparse attention is not a universal solution. 4) We introduce and validate novel scaling laws specifically tailored for sparse attention, providing evidence that our findings are likely to hold true beyond our range of experiments. Through these insights, we demonstrate that sparse attention is a key tool to enhance the capabilities of Transformer LLMs for processing longer sequences, but requires careful evaluation of trade-offs for performance-sensitive applications.
稀疏注意力机制为扩展Transformer大型语言模型(LLM)的长上下文处理能力提供了一种有前景的战略,然而其可行性、效率与精度之间的权衡以及系统化的规模研究仍然未被探索。为了填补这一空白,我们对不同规模的模型、序列长度和稀疏程度进行了无训练成本的稀疏注意力方法的仔细比较,并在一系列长序列任务上进行测试——包括依赖自然语言的同时又易于控制和评估的新颖任务。基于我们的实验结果,我们报告了一系列关键发现: 1. 通过等FLOPS分析,对于非常长的序列而言,大型且高度稀疏的模型相较于小型且密集型的模型更为优选。 2. 在解码过程中,相较于填充阶段,在统计上保证准确性的可达到稀疏度水平更高,并与前者的模型大小相关联。 3. 没有一种明确的战略可以在所有任务和阶段中表现出最优性能;不同的应用场景需要不同的稀疏化单位或预算适应性。即使适度的稀疏程度通常也会在至少一个任务中导致显著的表现下降,这表明稀疏注意力并不是万能解决方案。 4. 我们引入并验证了专门针对稀疏注意机制设计的新规模法则,提供了证据证明我们的发现很可能超越我们实验范围内的适用。 通过这些见解,我们展示了稀疏关注是增强Transformer LLM处理更长序列能力的关键工具,但需要对性能敏感的应用场景进行权衡的仔细评估。
https://arxiv.org/abs/2504.17768
In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.
近年来,图像编辑模型取得了显著且迅速的发展。最近推出的前沿多模态模型如GPT-4o和Gemini2 Flash展示了极具前景的图像编辑能力。这些模型表现出色,能够满足大多数用户驱动的编辑需求,标志着图像处理领域的重要进步。然而,开源算法与这些闭源模型之间仍存在较大差距。因此,在本文中,我们旨在发布一个名为Step1X-Edit的状态-of-the-art图像编辑模型,它可以与GPT-4o和Gemini2 Flash等闭源模型提供可比的性能。具体而言,我们采用多模态LLM(大型语言模型)处理参考图像和用户的编辑指令,并从中提取隐含嵌入,将其整合到扩散图像解码器中以生成目标图像。为了训练该模型,我们构建了一个数据生成管道来生产高质量的数据集。在评估方面,我们开发了GEdit-Bench这一基于真实世界用户指令的新基准测试工具。实验结果表明,在GEdit-Bench上,Step1X-Edit大大超越现有的开源基线,并接近领先专有模型的性能,从而为图像编辑领域做出了重大贡献。
https://arxiv.org/abs/2504.17761
Conversational assistants are becoming more and more popular, including in healthcare, partly because of the availability and capabilities of Large Language Models. There is a need for controlled, probing evaluations with real stakeholders which can highlight advantages and disadvantages of more traditional architectures and those based on generative AI. We present a within-group user study to compare two versions of a conversational assistant that allows heart failure patients to ask about salt content in food. One version of the system was developed in-house with a neurosymbolic architecture, and one is based on ChatGPT. The evaluation shows that the in-house system is more accurate, completes more tasks and is less verbose than the one based on ChatGPT; on the other hand, the one based on ChatGPT makes fewer speech errors and requires fewer clarifications to complete the task. Patients show no preference for one over the other.
对话型助手在各个领域越来越受欢迎,特别是在医疗保健行业,这主要得益于大型语言模型的可用性和功能。为了评估传统架构与基于生成式AI的系统之间的优缺点,有必要进行受控且深入的研究,并邀请真正的利益相关者参与其中。 我们开展了一项针对心力衰竭患者的用户研究,比较了两种版本的对话型助手,以帮助患者询问有关食物中盐分含量的问题。其中一个系统的内部开发采用了神经符号架构,而另一个则基于ChatGPT构建。评估结果显示,自行研发的系统在准确性、完成任务的数量以及减少冗余方面优于基于ChatGPT的系统;然而,在基于ChatGPT的对话型助手方面,其较少出现言语错误,并且需要更少的澄清才能完成任务。 值得注意的是,患者群体对于这两种系统的偏好并无明显差异。
https://arxiv.org/abs/2504.17753
Language-conditioned policies have recently gained substantial adoption in robotics as they allow users to specify tasks using natural language, making them highly versatile. While much research has focused on improving the action prediction of language-conditioned policies, reasoning about task descriptions has been largely overlooked. Ambiguous task descriptions often lead to downstream policy failures due to misinterpretation by the robotic agent. To address this challenge, we introduce AmbResVLM, a novel method that grounds language goals in the observed scene and explicitly reasons about task ambiguity. We extensively evaluate its effectiveness in both simulated and real-world domains, demonstrating superior task ambiguity detection and resolution compared to recent state-of-the-art baselines. Finally, real robot experiments show that our model improves the performance of downstream robot policies, increasing the average success rate from 69.6% to 97.1%. We make the data, code, and trained models publicly available at this https URL.
基于语言的策略在机器人技术中最近获得了广泛采用,因为它们允许用户使用自然语言来指定任务,从而使其具有很高的灵活性。尽管许多研究集中于提高基于语言的策略的动作预测能力,但对任务描述的理解和推理却鲜有关注。模糊的任务描述常常会导致下游策略失败,原因是机器人的代理无法正确解读这些描述。为了解决这一挑战,我们引入了AmbResVLM,这是一种新颖的方法,它将语言目标在观察到的场景中进行实体化,并明确地探讨任务的模棱两可性。我们在模拟和现实世界环境中对其有效性进行了广泛评估,证明了其比近期最先进的基准方法具有更优的任务模糊度检测与解决能力。最后,真实机器人实验表明,我们的模型提高了下游机器人的策略性能,使平均成功率从69.6%提升到了97.1%。我们将在这一网址(https://这个URL应该替换为实际的链接地址)上公开数据、代码和训练好的模型。
https://arxiv.org/abs/2504.17748
Human activity recognition (HAR) on smartglasses has various use cases, including health/fitness tracking and input for context-aware AI assistants. However, current approaches for egocentric activity recognition suffer from low performance or are resource-intensive. In this work, we introduce a resource (memory, compute, power, sample) efficient machine learning algorithm, EgoCHARM, for recognizing both high level and low level activities using a single egocentric (head-mounted) Inertial Measurement Unit (IMU). Our hierarchical algorithm employs a semi-supervised learning strategy, requiring primarily high level activity labels for training, to learn generalizable low level motion embeddings that can be effectively utilized for low level activity recognition. We evaluate our method on 9 high level and 3 low level activities achieving 0.826 and 0.855 F1 scores on high level and low level activity recognition respectively, with just 63k high level and 22k low level model parameters, allowing the low level encoder to be deployed directly on current IMU chips with compute. Lastly, we present results and insights from a sensitivity analysis and highlight the opportunities and limitations of activity recognition using egocentric IMUs.
人体活动识别(HAR)在智能眼镜上的应用包括健康/健身追踪和上下文感知AI助手的输入。然而,目前用于第一人称视角下的活动识别的方法要么性能低下,要么资源消耗大。在这项工作中,我们提出了一种资源(内存、计算力、电源、样本数量)高效的机器学习算法EgoCHARM,该算法使用单一的第一人称视角(头戴式)惯性测量单元(IMU)来识别高级和低级活动。我们的分层算法采用了半监督学习策略,在训练过程中主要需要高级别活动标签,以学习可泛化的低级别运动嵌入,这些嵌入可以有效用于低级别活动的识别。 我们在9项高级别和3项低级别活动中评估了该方法,分别实现了0.826和0.855的F1分数,并且仅使用63k个高级别和22k个低级别的模型参数。这使得低级别的编码器可以直接部署在当前具有计算能力的IMU芯片上。 最后,我们展示了敏感性分析的结果与见解,并强调了使用第一人称视角下的惯性测量单元进行活动识别的机会与局限性。
https://arxiv.org/abs/2504.17735
All-in-One image restoration aims to address multiple image degradation problems using a single model, significantly reducing training costs and deployment complexity compared to traditional methods that design dedicated models for each degradation type. Existing approaches typically rely on Degradation-specific models or coarse-grained degradation prompts to guide image restoration. However, they lack fine-grained modeling of degradation information and face limitations in balancing multi-task conflicts. To overcome these limitations, we propose DPMambaIR, a novel All-in-One image restoration framework. By integrating a Degradation-Aware Prompt State Space Model (DP-SSM) and a High-Frequency Enhancement Block (HEB), DPMambaIR enables fine-grained modeling of complex degradation information and efficient global integration, while mitigating the loss of high-frequency details caused by task competition. Specifically, the DP-SSM utilizes a pre-trained degradation extractor to capture fine-grained degradation features and dynamically incorporates them into the state space modeling process, enhancing the model's adaptability to diverse degradation types. Concurrently, the HEB supplements high-frequency information, effectively addressing the loss of critical details, such as edges and textures, in multi-task image restoration scenarios. Extensive experiments on a mixed dataset containing seven degradation types show that DPMambaIR achieves the best performance, with 27.69dB and 0.893 in PSNR and SSIM, respectively. These results highlight the potential and superiority of DPMambaIR as a unified solution for All-in-One image restoration.
全方面图像恢复旨在使用单一模型解决多种图像退化问题,与传统方法相比,这些传统方法为每种退化类型设计专用模型,从而大大降低了训练成本和部署复杂性。现有的方法通常依赖于特定退化的模型或粗粒度的退化提示来指导图像恢复。然而,它们缺乏对退化信息进行细粒度建模的能力,并且在处理多任务冲突时存在局限性。为克服这些限制,我们提出了DPMambaIR,这是一种全新的全方面图像恢复框架。 通过集成退化感知提示状态空间模型(DP-SSM)和高频增强块(HEB),DPMambaIR能够对复杂的退化信息进行细粒度建模,并实现高效的全局整合,同时减轻多任务竞争造成的高频细节损失。具体而言,DP-SSM利用预训练的退化提取器捕获细粒度的退化特征,并动态将其融入状态空间模型中,从而增强模型适应各种退化类型的能力。与此同时,HEB补充了高频信息,在多重任务图像恢复场景中有效解决了关键细节(如边缘和纹理)丢失的问题。 在包含七种退化类型的混合数据集上进行的广泛实验表明,DPMambaIR实现了最佳性能,其PSNR值为27.69dB,SSIM值为0.893。这些结果突显了DPMambaIR作为全方位图像恢复统一解决方案的巨大潜力和优越性。
https://arxiv.org/abs/2504.17732
Recently, photo-realistic novel view synthesis from multi-view images, such as neural radiance field (NeRF) and 3D Gaussian Splatting (3DGS), have garnered widespread attention due to their superior performance. However, most works rely on low dynamic range (LDR) images, which limits the capturing of richer scene details. Some prior works have focused on high dynamic range (HDR) scene reconstruction, typically require capturing of multi-view sharp images with different exposure times at fixed camera positions during exposure times, which is time-consuming and challenging in practice. For a more flexible data acquisition, we propose a one-stage method: \textbf{CasualHDRSplat} to easily and robustly reconstruct the 3D HDR scene from casually captured videos with auto-exposure enabled, even in the presence of severe motion blur and varying unknown exposure time. \textbf{CasualHDRSplat} contains a unified differentiable physical imaging model which first applies continuous-time trajectory constraint to imaging process so that we can jointly optimize exposure time, camera response function (CRF), camera poses, and sharp 3D HDR scene. Extensive experiments demonstrate that our approach outperforms existing methods in terms of robustness and rendering quality. Our source code will be available at this https URL
最近,从多视角图像合成逼真的新视图技术(例如神经辐射场(NeRF)和3D高斯点阵(3DGS))因其卓越的性能而备受关注。然而,大多数这类工作依赖于低动态范围(LDR)图像,这限制了捕捉更丰富场景细节的能力。一些先前的研究专注于高动态范围(HDR)场景重建,通常需要在固定相机位置的不同曝光时间下捕获多视角清晰图像,这一过程耗时且实际操作难度大。为了实现更为灵活的数据采集方式,我们提出了一种单阶段方法:**CasualHDRSplat**,该方法可以从启用自动曝光功能的随意拍摄视频中轻松而稳健地重建3D HDR场景,即便在严重的运动模糊和未知变化曝光时间情况下也能有效工作。**CasualHDRSplat** 包含一个统一可微物理成像模型,首先对成像过程应用连续时间轨迹约束,从而能够同时优化曝光时间、相机响应函数(CRF)、相机姿态以及清晰的3D HDR场景。广泛的实验表明,我们的方法在鲁棒性和渲染质量方面均优于现有技术。我们的源代码将在这个网址提供:[this https URL]
https://arxiv.org/abs/2504.17728
We propose a fully unsupervised algorithm that detects from encephalography (EEG) recordings when a subject actively listens to sound, versus when the sound is ignored. This problem is known as absolute auditory attention decoding (aAAD). We propose an unsupervised discriminative CCA model for feature extraction and combine it with an unsupervised classifier called minimally informed linear discriminant analysis (MILDA) for aAAD classification. Remarkably, the proposed unsupervised algorithm performs significantly better than a state-of-the-art supervised model. A key reason is that the unsupervised algorithm can successfully adapt to the non-stationary test data at a low computational cost. This opens the door to the analysis of the auditory attention of a subject using EEG signals with a model that automatically tunes itself to the subject without requiring an arduous supervised training session beforehand.
我们提出了一种完全无监督的算法,该算法能够从脑电图(EEG)记录中检测出受试者何时主动聆听声音,以及何时忽略声音。这个问题被称为绝对听觉注意力解码(aAAD)。为此,我们提出了一个用于特征提取的无监督判别式CCA模型,并将其与一种称为最小信息线性判别分析(MILDA)的无监督分类器结合使用来进行aAAD分类。令人惊讶的是,所提出的无监督算法的表现显著优于最先进的有监督模型。其关键原因在于该无监督算法能够以较低的计算成本成功适应非平稳测试数据。这为利用EEG信号自动调整模型来分析受试者的听觉注意力打开了大门,并且无需事先进行繁琐的有监督训练阶段。
https://arxiv.org/abs/2504.17724
Large language models (LLMs) are increasingly being adopted in educational settings. These applications expand beyond English, though current LLMs remain primarily English-centric. In this work, we ascertain if their use in education settings in non-English languages is warranted. We evaluated the performance of popular LLMs on four educational tasks: identifying student misconceptions, providing targeted feedback, interactive tutoring, and grading translations in six languages (Hindi, Arabic, Farsi, Telugu, Ukrainian, Czech) in addition to English. We find that the performance on these tasks somewhat corresponds to the amount of language represented in training data, with lower-resource languages having poorer task performance. Although the models perform reasonably well in most languages, the frequent performance drop from English is significant. Thus, we recommend that practitioners first verify that the LLM works well in the target language for their educational task before deployment.
大型语言模型(LLMs)在教育环境中正被越来越多地采用。尽管这些应用已经扩展到英语之外的领域,但目前大多数流行的LLM仍然以英语为中心。在这项研究中,我们评估了非英语语境下使用这些模型是否合理。我们在六种语言(印地语、阿拉伯语、波斯语、泰卢固语、乌克兰语和捷克语)以及英语环境下对四种教育任务进行了评估:识别学生误解、提供有针对性的反馈、互动辅导以及翻译评分。 我们发现,这些任务的表现与训练数据中该语言所占的比例有一定关系,即资源较少的语言在完成这些任务时表现较差。尽管模型在大多数语言中的表现都相当不错,但英语之外其他语言的性能下降幅度显著。因此,我们建议实践者在部署之前首先验证LLM在其目标语言环境下是否能有效工作。
https://arxiv.org/abs/2504.17720
StyleGAN has demonstrated the ability of GANs to synthesize highly-realistic faces of imaginary people from random noise. One limitation of GAN-based image generation is the difficulty of controlling the features of the generated image, due to the strong entanglement of the low-dimensional latent space. Previous work that aimed to control StyleGAN with image or text prompts modulated sampling in W latent space, which is more expressive than Z latent space. However, W space still has restricted expressivity since it does not control the feature synthesis directly; also the feature embedding in W space requires a pre-training process to reconstruct the style signal, limiting its application. This paper introduces the concept of "generative fields" to explain the hierarchical feature synthesis in StyleGAN, inspired by the receptive fields of convolution neural networks (CNNs). Additionally, we propose a new image editing pipeline for StyleGAN using generative field theory and the channel-wise style latent space S, utilizing the intrinsic structural feature of CNNs to achieve disentangled control of feature synthesis at synthesis time.
StyleGAN 展示了生成对抗网络(GAN)能够从随机噪声中合成高度逼真的虚构人脸的能力。然而,基于 GAN 的图像生成的一个局限性是难以控制生成图像的特征,这是由于低维潜在空间中的强烈纠缠导致的。先前的工作试图通过使用图像或文本提示来调节 W 潜在空间中的采样以控制 StyleGAN,这种方式比 Z 潜在空间更具表现力。然而,W 空间仍然具有受限的表现力,因为它不能直接控制特征合成;此外,在 W 空间中进行特征嵌入需要一个预训练过程来重建风格信号,这限制了它的应用范围。 本文引入了“生成场”这一概念,用以解释 StyleGAN 中层次化特征合成的过程。该理论受到卷积神经网络(CNN)感受野的启发。此外,我们提出了一种新的基于生成场理论和通道级别的样式潜在空间 S 的 StyleGAN 图像编辑管线。这种方法利用了 CNN 内在结构特性,在图像生成过程中实现了特征合成的解耦控制。
https://arxiv.org/abs/2504.17712
When a plasma disrupts in a tokamak, significant heat and electromagnetic loads are deposited onto the surrounding device components. These forces scale with plasma current and magnetic field strength, making disruptions one of the key challenges for future devices. Unfortunately, disruptions are not fully understood, with many different underlying causes that are difficult to anticipate. Data-driven models have shown success in predicting them, but they only provide limited interpretability. On the other hand, large-scale statistical analyses have been a great asset to understanding disruptive patterns. In this paper, we leverage data-driven methods to find an interpretable representation of the plasma state for disruption characterization. Specifically, we use a latent variable model to represent diagnostic measurements as a low-dimensional, latent representation. We build upon the Variational Autoencoder (VAE) framework, and extend it for (1) continuous projections of plasma trajectories; (2) a multimodal structure to separate operating regimes; and (3) separation with respect to disruptive regimes. Subsequently, we can identify continuous indicators for the disruption rate and the disruptivity based on statistical properties of measurement data. The proposed method is demonstrated using a dataset of approximately 1600 TCV discharges, selecting for flat-top disruptions or regular terminations. We evaluate the method with respect to (1) the identified disruption risk and its correlation with other plasma properties; (2) the ability to distinguish different types of disruptions; and (3) downstream analyses. For the latter, we conduct a demonstrative study on identifying parameters connected to disruptions using counterfactual-like analysis. Overall, the method can adequately identify distinct operating regimes characterized by varying proximity to disruptions in an interpretable manner.
当托卡马克装置中的等离子体发生破裂时,大量热能和电磁负载会被施加到周围的设备组件上。这些力与等离子电流及磁场强度成正比,使得破裂成为未来设备面临的关键挑战之一。不幸的是,对于破裂的原因尚未完全了解,并且有多种难以预测的根本原因。基于数据驱动的模型在预测破裂方面表现出成功,但它们提供的可解释性有限。另一方面,大规模统计分析对理解破坏模式起到了巨大的帮助作用。 本文中,我们利用数据驱动的方法来寻找等离子体状态的一种易于解释的表现形式,以用于破裂特性描述。具体来说,我们使用潜在变量模型将诊断测量值表示为低维的潜在表现形式。在此基础上,我们在变分自编码器(VAE)框架的基础上进行了扩展,以实现:1)连续的等离子体轨迹投影;2)一种多模式结构来区分不同的操作制度;以及3)与破坏制度相关的分离。 随后,我们可以基于测量数据的统计特性确定连续指标,这些指标用于表征破裂率和破损能力。我们使用了大约1600个TCV放电的数据集进行方法验证,并选择了平顶区破裂或常规终止事件。评估该方法时主要关注:(1)所识别出的破坏风险及其与其他等离子体属性的相关性;(2)区分不同类型的破坏的能力;以及(3)下游分析。对于后者,我们通过使用类似反事实的方法进行了示例研究来确定与破坏相关联的参数。 总的来说,该方法能够以易于解释的方式充分地识别由接近破坏程度不同的操作制度所具有的特性。
https://arxiv.org/abs/2504.17710
Large Reasoning Models (LRMs) have exhibited extraordinary prowess in tasks like mathematics and coding, leveraging their advanced reasoning capabilities. Nevertheless, as these capabilities progress, significant concerns regarding their vulnerabilities and safety have arisen, which can pose challenges to their deployment and application in real-world settings. This paper presents a comprehensive survey of LRMs, meticulously exploring and summarizing the newly emerged safety risks, attacks, and defense strategies. By organizing these elements into a detailed taxonomy, this work aims to offer a clear and structured understanding of the current safety landscape of LRMs, facilitating future research and development to enhance the security and reliability of these powerful models.
大型推理模型(LRM)在数学和编码等任务中表现出非凡的能力,凭借其先进的推理能力。然而,随着这些能力的进步,关于它们脆弱性和安全性的重大担忧也随之浮现,这可能会对它们在实际环境中的部署和应用构成挑战。本文提供了一份关于LRMs的全面调查,细致地探索并总结了新出现的安全风险、攻击以及防御策略。通过将这些要素组织成详细的分类体系,这项工作旨在为当前LRMs的安全状况提供清晰而有条理的理解,从而促进未来的研究和发展,以增强这些强大模型的安全性和可靠性。
https://arxiv.org/abs/2504.17704
Daily Activity Recordings for Artificial Intelligence (DARai, pronounced "Dahr-ree") is a multimodal, hierarchically annotated dataset constructed to understand human activities in real-world settings. DARai consists of continuous scripted and unscripted recordings of 50 participants in 10 different environments, totaling over 200 hours of data from 20 sensors including multiple camera views, depth and radar sensors, wearable inertial measurement units (IMUs), electromyography (EMG), insole pressure sensors, biomonitor sensors, and gaze tracker. To capture the complexity in human activities, DARai is annotated at three levels of hierarchy: (i) high-level activities (L1) that are independent tasks, (ii) lower-level actions (L2) that are patterns shared between activities, and (iii) fine-grained procedures (L3) that detail the exact execution steps for actions. The dataset annotations and recordings are designed so that 22.7% of L2 actions are shared between L1 activities and 14.2% of L3 procedures are shared between L2 actions. The overlap and unscripted nature of DARai allows counterfactual activities in the dataset. Experiments with various machine learning models showcase the value of DARai in uncovering important challenges in human-centered applications. Specifically, we conduct unimodal and multimodal sensor fusion experiments for recognition, temporal localization, and future action anticipation across all hierarchical annotation levels. To highlight the limitations of individual sensors, we also conduct domain-variant experiments that are enabled by DARai's multi-sensor and counterfactual activity design setup. The code, documentation, and dataset are available at the dedicated DARai website: this https URL
《每日活动记录集:人工智能版》(DARai,发音为“Dahr-ree”)是一个多模态、分层标注的数据集,旨在理解现实环境中的人类活动。DARai 包含 50 名参与者在 10 种不同环境中的连续脚本化和非脚本化录制数据,总计超过 200 小时的来自包括多个摄像头视角在内的 20 种传感器的数据,其中包括深度与雷达传感器、可穿戴惯性测量单元(IMUs)、肌电图(EMG),鞋内压力传感器,生物监测器以及注视追踪器等。 为了捕捉人类活动的复杂性,DARai 在三个层级上进行了标注:(i) 高级活动(L1)即独立任务;(ii) 低级动作(L2),这些是跨越不同活动共享的动作模式;(iii) 精细程序(L3),详细描述了执行动作的具体步骤。数据集的注释和记录设计使得 L2 动作中有 22.7% 是在不同的 L1 活动中共享,而 L3 程序则有 14.2% 在不同的 L2 动作间共享。 DARai 的重叠性与非脚本化特性允许数据集中存在反事实活动。通过使用各种机器学习模型进行实验,展示了 DARai 在发现以人为中心的应用程序的重要挑战方面的价值。具体来说,我们进行了单模态和多模态传感器融合的识别、时间定位以及未来动作预测实验,这些实验覆盖了所有层级标注。 为了突出单一传感器的局限性,DARai 的多传感器及反事实活动设计也允许进行领域变体实验。该数据集的相关代码、文档均可以在 DARai 官方网站上获取:[此链接](this https URL)
https://arxiv.org/abs/2504.17696