Diffusion models are widely used for image editing tasks. Existing editing methods often design a representation manipulation procedure by curating an edit direction in the text embedding or score space. However, such a procedure faces a key challenge: overestimating the edit strength harms visual consistency while underestimating it fails the editing task. Notably, each source image may require a different editing strength, and it is costly to search for an appropriate strength via trial-and-error. To address this challenge, we propose Concept Lancet (CoLan), a zero-shot plug-and-play framework for principled representation manipulation in diffusion-based image editing. At inference time, we decompose the source input in the latent (text embedding or diffusion score) space as a sparse linear combination of the representations of the collected visual concepts. This allows us to accurately estimate the presence of concepts in each image, which informs the edit. Based on the editing task (replace/add/remove), we perform a customized concept transplant process to impose the corresponding editing direction. To sufficiently model the concept space, we curate a conceptual representation dataset, CoLan-150K, which contains diverse descriptions and scenarios of visual terms and phrases for the latent dictionary. Experiments on multiple diffusion-based image editing baselines show that methods equipped with CoLan achieve state-of-the-art performance in editing effectiveness and consistency preservation.
扩散模型在图像编辑任务中被广泛应用。现有的编辑方法通常通过设计文本嵌入或评分空间中的编辑方向来实现表示操作过程,然而这样的做法面临一个重要挑战:过度估计编辑强度会损害视觉一致性,而低估它则无法完成编辑任务。值得注意的是,每个源图像可能需要不同的编辑强度,并且通过试错法搜索合适的强度代价高昂。为了解决这一挑战,我们提出了Concept Lancet(CoLan),这是一种零样本的即插即用框架,用于基于扩散模型的图像编辑中的原理性表示操作。在推理阶段,我们将源输入在潜在空间(文本嵌入或扩散评分)中分解为收集到的视觉概念表示的稀疏线性组合。这使得我们可以精确估计每个图像中存在的概念,从而指导编辑过程。根据编辑任务(替换、添加或删除),我们执行定制的概念移植流程以施加相应的编辑方向。为了充分建模概念空间,我们策划了一个包含多样化描述和场景的视觉术语和短语的数据集CoLan-150K,该数据集构成了潜在字典的一部分。在多个基于扩散模型的图像编辑基线上的实验表明,使用了CoLan的方法在编辑有效性和一致性保持方面取得了当前最佳性能。
https://arxiv.org/abs/2504.02828
Recent advancements in large language models (LLMs) have accelerated progress toward artificial general intelligence, yet their potential to generate harmful content poses critical safety challenges. Existing alignment methods often struggle to cover diverse safety scenarios and remain vulnerable to adversarial attacks. In this work, we propose Ex-Ante Reasoning Preference Optimization (ERPO), a novel safety alignment framework that equips LLMs with explicit preemptive reasoning through Chain-of-Thought and provides clear evidence for safety judgments by embedding predefined safety rules. Specifically, our approach consists of three stages: first, equipping the model with Ex-Ante reasoning through supervised fine-tuning (SFT) using a constructed reasoning module; second, enhancing safety, usefulness, and efficiency via Direct Preference Optimization (DPO); and third, mitigating inference latency with a length-controlled iterative preference optimization strategy. Experiments on multiple open-source LLMs demonstrate that ERPO significantly enhances safety performance while maintaining response efficiency.
最近在大型语言模型(LLM)方面的进展加速了向通用人工智能迈进的步伐,但这些模型生成有害内容的潜力带来了重要的安全挑战。现有的对齐方法通常难以涵盖多样化的安全场景,并且容易受到对抗性攻击的影响。在此项工作中,我们提出了预审推理偏好优化(ERPO),这是一种新的安全性对齐框架,通过Chain-of-Thought为LLM提供明确的事先推理能力,并通过嵌入预先定义的安全规则来为安全判断提供清晰的证据。具体而言,我们的方法包括三个阶段:首先,通过监督微调(SFT)使用构建的推理模块使模型具备事先推理的能力;其次,利用直接偏好优化(DPO)增强安全性、实用性和效率;最后,采用长度控制的迭代偏好优化策略来缓解推断延迟问题。在多个开源LLM上的实验表明,ERPO显著提高了安全性能,并保持了响应效率。
https://arxiv.org/abs/2504.02725
Atmospheric turbulence is a major source of image degradation in long-range imaging systems. Although numerous deep learning-based turbulence mitigation (TM) methods have been proposed, many are slow, memory-hungry, and do not generalize well. In the spatial domain, methods based on convolutional operators have a limited receptive field, so they cannot handle a large spatial dependency required by turbulence. In the temporal domain, methods relying on self-attention can, in theory, leverage the lucky effects of turbulence, but their quadratic complexity makes it difficult to scale to many frames. Traditional recurrent aggregation methods face parallelization challenges. In this paper, we present a new TM method based on two concepts: (1) A turbulence mitigation network based on the Selective State Space Model (MambaTM). MambaTM provides a global receptive field in each layer across spatial and temporal dimensions while maintaining linear computational complexity. (2) Learned Latent Phase Distortion (LPD). LPD guides the state space model. Unlike classical Zernike-based representations of phase distortion, the new LPD map uniquely captures the actual effects of turbulence, significantly improving the model's capability to estimate degradation by reducing the ill-posedness. Our proposed method exceeds current state-of-the-art networks on various synthetic and real-world TM benchmarks with significantly faster inference speed. The code is available at this http URL.
大气湍流是长距离成像系统中图像退化的主要原因。尽管已经提出了许多基于深度学习的湍流缓解(TM)方法,但很多这些方法运行缓慢、占用大量内存,并且泛化能力较差。在空间域内,基于卷积算子的方法具有有限的感受野范围,因此无法处理大气湍流所需的大空间依赖性需求。而在时间域上,基于自注意力的方法理论上可以利用幸运效应来改善图像质量,但其二次复杂度使得这种方法难以扩展到大量帧中使用。传统的递归聚合方法面临并行化挑战。 本文提出了一种新的TM方法,该方法基于以下两个概念:(1)一种基于选择性状态空间模型的湍流缓解网络(MambaTM)。MambaTM在每个层提供全局感受野,并跨越空间和时间维度的同时保持线性计算复杂度。(2)学习到的潜在相位失真(LPD)。LPD指导了状态空间模型。不同于传统的Zernike表示法,新的LPD图能够独特地捕捉湍流的实际影响,从而显著提高模型估算退化的能力并减少问题的不适定性质。 我们提出的这种方法在各种合成和现实世界的TM基准测试中超过了当前最先进的网络,并且具有明显更快的推理速度。代码可在此URL处获取。
https://arxiv.org/abs/2504.02697
Task arithmetic has emerged as a promising approach for editing models by representing task-specific knowledge as composable task vectors. However, existing methods rely on network linearization to derive task vectors, leading to computational bottlenecks during training and inference. Moreover, linearization alone does not ensure weight disentanglement, the key property that enables conflict-free composition of task vectors. To address this, we propose TaLoS which allows to build sparse task vectors with minimal interference without requiring explicit linearization and sharing information across tasks. We find that pre-trained models contain a subset of parameters with consistently low gradient sensitivity across tasks, and that sparsely updating only these parameters allows for promoting weight disentanglement during fine-tuning. Our experiments prove that TaLoS improves training and inference efficiency while outperforming current methods in task addition and negation. By enabling modular parameter editing, our approach fosters practical deployment of adaptable foundation models in real-world applications.
任务算术作为一种有前景的方法,通过将特定于任务的知识表示为可组合的任务向量来编辑模型。然而,现有的方法依赖于网络线性化以推导出任务向量,这在训练和推理过程中会导致计算瓶颈。此外,仅靠线性化并不能确保权重解耦,这是使任务向量无冲突组合的关键属性。为此,我们提出了TaLoS(Task Arithmetic via LoRA and Sparsity),该方法能够在构建稀疏任务向量时最小限度地干扰模型,并且不需要显式的线性化操作,在不同任务之间共享信息。 我们的研究发现预训练模型中存在一组参数,在各个任务上具有持续较低的梯度敏感性,而仅对这些参数进行稀疏更新可以在微调过程中促进权重解耦。实验表明,TaLoS不仅提高了训练和推理效率,还在任务添加和否定方面优于当前方法。 通过支持模块化参数编辑,我们的方法促进了适应性强的基础模型在实际应用中的部署。这种方法使得能够灵活地根据具体需求调整基础模型的性能,而不必从头开始重新训练整个模型,从而节省了大量计算资源和时间成本。
https://arxiv.org/abs/2504.02620
Recent advances in text-to-image generative models have enabled numerous practical applications, including subject-driven generation, which fine-tunes pretrained models to capture subject semantics from only a few examples. While diffusion-based models produce high-quality images, their extensive denoising steps result in significant computational overhead, limiting real-world applicability. Visual autoregressive~(VAR) models, which predict next-scale tokens rather than spatially adjacent ones, offer significantly faster inference suitable for practical deployment. In this paper, we propose the first VAR-based approach for subject-driven generation. However, na\"ıve fine-tuning VAR leads to computational overhead, language drift, and reduced diversity. To address these challenges, we introduce selective layer tuning to reduce complexity and prior distillation to mitigate language drift. Additionally, we found that the early stages have a greater influence on the generation of subject than the latter stages, which merely synthesize local details. Based on this finding, we propose scale-wise weighted tuning, which prioritizes coarser resolutions for promoting the model to focus on the subject-relevant information instead of local details. Extensive experiments validate that our method significantly outperforms diffusion-based baselines across various metrics and demonstrates its practical usage.
最近在文本到图像生成模型方面的进展已经推动了许多实际应用,包括以主题为导向的生成,该方法通过仅使用少量样本对预训练模型进行微调来捕捉特定主题语义。尽管基于扩散的方法能够产生高质量的图像,但由于其广泛的去噪步骤导致了显著的计算开销,限制了其实用性。视觉自回归(VAR)模型通过预测下一个尺度上的标记而不是空间相邻的标记,提供了更快速的推理性能,适合实际部署。在本文中,我们提出了首个基于VAR的方法来实现主题驱动生成。然而,直接微调VAR会导致计算开销增加、语言漂移以及多样性降低等问题。 为了解决这些问题,我们引入了选择性层微调以减少复杂度,并通过先验蒸馏来缓解语言漂移现象。此外,我们发现早期阶段比后期阶段对主体的生成影响更大,而后者仅合成局部细节。基于这一发现,我们提出按尺度加权微调的方法,优先考虑较粗糙的分辨率,使模型更关注与主题相关的信息而非局部细节。 广泛的实验验证了我们的方法在各种指标上显著优于基于扩散的基本线,并且展示了其实际应用价值。
https://arxiv.org/abs/2504.02612
The recent advancements in Deep Learning models and techniques have led to significant strides in performance across diverse tasks and modalities. However, while the overall capabilities of models show promising growth, our understanding of their internal reasoning processes remains limited, particularly concerning systematic inconsistencies or errors patterns of logical or inferential flaws. These inconsistencies may manifest as contradictory outputs, failure to generalize across similar tasks, or erroneous conclusions in specific contexts. Even detecting and measuring such reasoning discrepancies is challenging, as they may arise from opaque internal procedures, biases and imbalances in training data, or the inherent complexity of the task. Without effective methods to detect, measure, and mitigate these errors, there is a risk of deploying models that are biased, exploitable, or logically unreliable. This thesis aims to address these issues by producing novel methods for deep learning models that reason over knowledge graphs, natural language, and images. The thesis contributes two techniques for detecting and quantifying predictive inconsistencies originating from opaque internal procedures in natural language and image processing models. To mitigate inconsistencies from biases in training data, this thesis presents a data efficient sampling method to improve fairness and performance and a synthetic dataset generation approach in low resource scenarios. Finally, the thesis offers two techniques to optimize the models for complex reasoning tasks. These methods enhance model performance while allowing for more faithful and interpretable exploration and exploitation during inference. Critically, this thesis provides a comprehensive framework to improve the robustness, fairness, and interpretability of deep learning models across diverse tasks and modalities.
最近在深度学习模型和技术上的进展,已经在各种任务和模态中显著提升了性能。然而,尽管整体模型能力显示出了令人鼓舞的增长趋势,我们对其内部推理过程的理解仍然有限,尤其是关于系统性不一致或逻辑或推断错误模式的了解更是如此。这些不一致性可能表现为矛盾的结果、无法在类似任务之间泛化或者在特定情境下的错误结论。检测和衡量这样的推理差异也颇具挑战,因为它们可能是由于模型内部复杂的操作程序、训练数据中的偏见与不平衡,或是任务本身的固有复杂性所导致的。如果没有有效的方法来检测、测量并缓解这些错误,那么部署的模型可能会存在偏见、可被利用或者逻辑上不可靠的风险。 本论文旨在通过为处理知识图谱、自然语言和图像的深度学习模型开发新的方法来解决这些问题。具体而言,本文贡献了两种针对自然语言与图像处理模型中源自不透明内部过程的预测不一致进行检测和量化的新技术。为了缓解训练数据偏见所导致的不一致性问题,本论文提出了一种效率高的采样方法以提升公平性和性能,并在资源匮乏的情况下提供了一套合成数据集生成方案。最后,本文提供了两种优化复杂推理任务模型的方法。这些方法提升了模型的表现力,同时在推断过程中促进了更加忠实和可解释性更强的探索与利用。 关键的是,本论文为提高深度学习模型在各种任务和模态下的鲁棒性、公平性和可解释性提供了一个全面框架。
https://arxiv.org/abs/2504.02577
The accurate delineation of agricultural field boundaries from satellite imagery is vital for land management and crop monitoring. However, current methods face challenges due to limited dataset sizes, resolution discrepancies, and diverse environmental conditions. We address this by reformulating the task as instance segmentation and introducing the Field Boundary Instance Segmentation - 22M dataset (FBIS-22M), a large-scale, multi-resolution dataset comprising 672,909 high-resolution satellite image patches (ranging from 0.25 m to 10 m) and 22,926,427 instance masks of individual fields, significantly narrowing the gap between agricultural datasets and those in other computer vision domains. We further propose Delineate Anything, an instance segmentation model trained on our new FBIS-22M dataset. Our proposed model sets a new state-of-the-art, achieving a substantial improvement of 88.5% in mAP@0.5 and 103% in mAP@0.5:0.95 over existing methods, while also demonstrating significantly faster inference and strong zero-shot generalization across diverse image resolutions and unseen geographic regions. Code, pre-trained models, and the FBIS-22M dataset are available at this https URL.
从卫星图像中准确地划分农田边界对于土地管理和作物监测至关重要。然而,当前的方法由于数据集规模有限、分辨率不一致以及环境条件多样等问题而面临挑战。我们通过将任务重新定义为实例分割,并引入Field Boundary Instance Segmentation - 22M 数据集(FBIS-22M),这是一个大规模的多分辨率数据集,包含672,909个高分辨率卫星图像补丁(从0.25米到10米不等)和22,926,427个单个田地实例掩码,大大缩小了农业数据集与其他计算机视觉领域数据集之间的差距。我们进一步提出了Delineate Anything模型,这是一个基于新FBIS-22M 数据集训练的实例分割模型。我们的模型在mAP@0.5指标上实现了现有方法88.5% 的显著改进,在mAP@0.5:0.95 上则达到了103% 的提升,同时还表现出更快的推理速度和跨不同图像分辨率及未见过地理区域的强大零样本泛化能力。代码、预训练模型以及FBIS-22M 数据集可在该网址获得(此URL为原文中提供,具体链接请参见原英文文本)。
https://arxiv.org/abs/2504.02534
Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that $\textit{proper learning methods could enable effective inference-time scalability}$. A key challenge of RL is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules. In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, i.e. the $\textbf{inference-time scalability of generalist RM}$, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods. For the RM approach, we adopt pointwise generative reward modeling (GRM) to enable flexibility for different input types and potential for inference-time scaling. For the learning method, we propose Self-Principled Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs through online RL, to generate principles adaptively and critiques accurately, resulting in $\textbf{DeepSeek-GRM}$ models. Furthermore, for effective inference-time scaling, we use parallel sampling to expand compute usage, and introduce a meta RM to guide voting process for better scaling performance. Empirically, we show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve better performance compared to training-time scaling. DeepSeek-GRM still meets challenges in some tasks, which we believe can be addressed by future efforts in generalist reward systems. The models will be released and open-sourced.
强化学习(RL)在大规模语言模型(LLMs)的训练后处理中被广泛采用。最近,人们发现通过RL激励LLM中的推理能力表明,“适当的**学习方法可以实现有效的推理时间扩展性**”。RL的一个关键挑战是在各种领域之外获得准确的奖励信号,例如可验证的问题或人工规则。在本工作中,我们探讨如何使用更多推理计算来改进一般查询的奖励建模(RM),即“**通用RM的推理时间扩展性**”,以及如何通过适当的**学习方法提高性能-计算缩放的有效性**。 对于RM方法,我们采用点式生成型奖励建模(GRM)以实现对不同输入类型的灵活性和潜在的推理时间扩展。在学习方法方面,我们提出了自原则批评调优(SPCT),该方法通过在线RL来促进GRMs中的可扩展奖励生成行为,并能够适应性地产生原则并准确批判,从而产生了DeepSeek-GRM模型。 为了有效实现推理时间扩展,我们使用了并行采样以扩大计算使用量,并引入了一个元RM来指导投票过程以获得更好的缩放性能。在实验中,我们展示了SPCT显著提升了GRMs的质量和可扩展性,在各种RM基准测试中超越现有方法和模型且无明显偏差,并能够实现优于训练时间扩展的性能。 尽管DeepSeek-GRM在某些任务上仍面临挑战,但我们相信通过未来对于通用奖励系统的努力可以解决这些问题。这些模型将被发布并开源。
https://arxiv.org/abs/2504.02495
This paper presents SkyReels-A2, a controllable video generation framework capable of assembling arbitrary visual elements (e.g., characters, objects, backgrounds) into synthesized videos based on textual prompts while maintaining strict consistency with reference images for each element. We term this task elements-to-video (E2V), whose primary challenges lie in preserving the fidelity of each reference element, ensuring coherent composition of the scene, and achieving natural outputs. To address these, we first design a comprehensive data pipeline to construct prompt-reference-video triplets for model training. Next, we propose a novel image-text joint embedding model to inject multi-element representations into the generative process, balancing element-specific consistency with global coherence and text alignment. We also optimize the inference pipeline for both speed and output stability. Moreover, we introduce a carefully curated benchmark for systematic evaluation, i.e, A2 Bench. Experiments demonstrate that our framework can generate diverse, high-quality videos with precise element control. SkyReels-A2 is the first open-source commercial grade model for the generation of E2V, performing favorably against advanced closed-source commercial models. We anticipate SkyReels-A2 will advance creative applications such as drama and virtual e-commerce, pushing the boundaries of controllable video generation.
本文介绍了SkyReels-A2,这是一个可控视频生成框架,能够基于文本提示将任意视觉元素(例如角色、物体、背景)组装成合成视频,并确保每个元素与参考图像保持严格一致。我们将此任务称为“元件到视频”(E2V),其主要挑战在于保留每个参考元素的保真度,保证场景组合的一致性以及实现自然输出。为解决这些问题,我们首先设计了一个全面的数据管道来构建用于模型训练的提示-参考-视频三元组。接下来,我们提出了一种新颖的图像文本联合嵌入模型,将多元素表示注入生成过程,平衡特定元素一致性与全局一致性和文本对齐之间的关系。同时,我们也优化了推理流程以提高速度和输出稳定性。此外,我们还引入了一个精心策划的基准测试套件A2 Bench,用于系统的评估。实验结果表明,我们的框架能够生成多样化且高质量、具有精确元件控制能力的视频。SkyReels-A2是首个开源商用级E2V生成模型,在性能上优于先进的闭源商业模型。预计SkyReels-A2将推动创意应用的发展,如戏剧制作和虚拟电子商务,并推进可控视频生成技术的边界。
https://arxiv.org/abs/2504.02436
Recent years have witnessed remarkable advances in talking head generation, owing to its potential to revolutionize the human-AI interaction from text interfaces into realistic video chats. However, research on text-driven talking heads remains underexplored, with existing methods predominantly adopting a cascaded pipeline that combines TTS systems with audio-driven talking head models. This conventional pipeline not only introduces system complexity and latency overhead but also fundamentally suffers from asynchronous audiovisual output and stylistic discrepancies between generated speech and visual expressions. To address these limitations, we introduce OmniTalker, an end-to-end unified framework that simultaneously generates synchronized speech and talking head videos from text and reference video in real-time zero-shot scenarios, while preserving both speech style and facial styles. The framework employs a dual-branch diffusion transformer architecture: the audio branch synthesizes mel-spectrograms from text, while the visual branch predicts fine-grained head poses and facial dynamics. To bridge modalities, we introduce a novel audio-visual fusion module that integrates cross-modal information to ensure temporal synchronization and stylistic coherence between audio and visual outputs. Furthermore, our in-context reference learning module effectively captures both speech and facial style characteristics from a single reference video without introducing an extra style extracting module. To the best of our knowledge, OmniTalker presents the first unified framework that jointly models speech style and facial style in a zero-shot setting, achieving real-time inference speed of 25 FPS. Extensive experiments demonstrate that our method surpasses existing approaches in generation quality, particularly excelling in style preservation and audio-video synchronization.
近年来,在说话人头部生成领域取得了显著进展,这得益于其潜在的能力将人类与AI的互动从文本界面转变为逼真的视频聊天。然而,基于文本驱动的说话人头部研究仍相对较少探索,现有方法主要采用结合TTS(语音合成)系统和音频驱动的说话人头部模型的级联管道。这种传统的管道不仅引入了系统的复杂性和延迟开销,而且从根本上遭受了异步音视频输出以及生成语音与面部表情之间风格不一致的问题。 为了解决这些问题,我们提出了OmniTalker,这是一个端到端统一框架,在实时零样本场景中同时从文本和参考视频生成同步的语音和说话人头部视频,同时保留了语音风格和面部风格。该框架采用了双分支扩散变换器架构:音频分支从文本合成mel频谱图,而视觉分支预测精细的头部姿态和面部动态。为了连接这些模式,我们引入了一个新颖的音视频融合模块,它整合跨模态信息以确保音频与视频输出之间的时间同步和风格一致性。 此外,我们的上下文参考学习模块能够仅通过单个参考视频有效地捕捉语音和面部风格特征,而无需额外的样式提取模块。据我们所知,OmniTalker是第一个在零样本设置下联合建模语音风格和面部风格的一体化框架,并且实现了每秒25帧(FPS)的实时推理速度。 广泛的实验表明,我们的方法超越了现有的生成质量,在风格保留和音视频同步方面表现尤为出色。
https://arxiv.org/abs/2504.02433
Interactive storytelling benefits from planning and exploring multiple 'what if' scenarios. Modern LLMs are useful tools for ideation and exploration, but current chat-based user interfaces restrict users to a single linear flow. To address this limitation, we propose Narrative Studio -- a novel in-browser narrative exploration environment featuring a tree-like interface that allows branching exploration from user-defined points in a story. Each branch is extended via iterative LLM inference guided by system and user-defined prompts. Additionally, we employ Monte Carlo Tree Search (MCTS) to automatically expand promising narrative paths based on user-specified criteria, enabling more diverse and robust story development. We also allow users to enhance narrative coherence by grounding the generated text in an entity graph that represents the actors and environment of the story.
互动式故事讲述可以从规划和探索多个“如果会怎样”的场景中获益。现代大型语言模型(LLM)在创意生成和探索方面非常有用,但当前基于聊天的用户界面限制了用户的单一线性流程体验。为了解决这一局限性,我们提出了Narrative Studio——一个新颖的网页端叙事探索环境,它采用了树状接口,允许从用户定义的故事节点开始进行分支式探索。每个分支通过系统和用户自定义提示引导的迭代LLM推断来扩展。 此外,我们采用蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)算法自动扩展有前景的情节路径,依据用户的特定标准,从而能够生成更多样化且稳健的故事发展。 同时,允许用户通过将生成的文字扎根于表示故事角色和环境的实体图中来增强叙事连贯性。
https://arxiv.org/abs/2504.02426
The deployment of foundation models for medical imaging has demonstrated considerable success. However, their training overheads associated with downstream tasks remain substantial due to the size of the image encoders employed, and the inference complexity is also significantly high. Although lightweight variants have been obtained for these foundation models, their performance is constrained by their limited model capacity and suboptimal training strategies. In order to achieve an improved tradeoff between complexity and performance, we propose a new framework to improve the performance of low complexity models via knowledge distillation from multiple large medical foundation models (e.g., MedSAM, RAD-DINO, MedCLIP), each specializing in different vision tasks, with the goal to effectively bridge the performance gap for medical image segmentation tasks. The agglomerated model demonstrates superior generalization across 12 segmentation tasks, whereas specialized models require explicit training for each task. Our approach achieved an average performance gain of 2\% in Dice coefficient compared to simple distillation.
将基础模型应用于医学成像已经取得了相当的成功。然而,由于所使用的图像编码器的大小,这些模型在下游任务上的训练开销仍然非常高,并且推理复杂性也很高。虽然已经为这些基础模型获得了轻量级变体,但它们的表现受到模型容量有限和培训策略不理想的影响。为了实现复杂性和性能之间的更好权衡,我们提出了一种新框架,通过从多个大型医疗基础模型(例如MedSAM、RAD-DINO、MedCLIP)进行知识蒸馏来提高低复杂度模型的性能,每个模型专注于不同的视觉任务。我们的目标是有效地缩小医学图像分割任务中的表现差距。聚合后的模型在12个分割任务上表现出更好的泛化能力,而专用模型需要为每个单独的任务进行明确训练。与简单的蒸馏相比,我们方法在Dice系数上的平均性能提升了2%。
https://arxiv.org/abs/2504.02351
Speech separation (SS) seeks to disentangle a multi-talker speech mixture into single-talker speech streams. Although SS can be generally achieved using offline methods, such a processing paradigm is not suitable for real-time streaming applications. Causal separation models, which rely only on past and present information, offer a promising solution for real-time streaming. However, these models typically suffer from notable performance degradation due to the absence of future context. In this paper, we introduce a novel frontend that is designed to mitigate the mismatch between training and run-time inference by implicitly incorporating future information into causal models through predictive patterns. The pretrained frontend employs a transformer decoder network with a causal convolutional encoder as the backbone and is pretrained in a self-supervised manner with two innovative pretext tasks: autoregressive hybrid prediction and contextual knowledge distillation. These tasks enable the model to capture predictive patterns directly from mixtures in a self-supervised manner. The pretrained frontend subsequently serves as a feature extractor to generate high-quality predictive patterns. Comprehensive evaluations on synthetic and real-world datasets validated the effectiveness of the proposed pretrained frontend.
语音分离(SS)旨在将多说话人的语音混合物分解为单一说话人的语音流。尽管可以通过离线方法实现一般的语音分离,但这种处理模式不适合实时流应用。因果性分离模型仅依赖于过去和当前的信息,对于实时流提供了一种有前景的解决方案。然而,这些模型通常由于缺乏未来上下文而表现出明显的性能下降。 在本文中,我们提出了一种新颖的前端设计,旨在通过预测模式将未来的相关信息隐式地融入因果模型中,以缓解训练与运行时推理之间的不匹配问题。预训练的前端采用带有因果卷积编码器作为骨干网络的变压器解码器,并通过两个创新性的前置任务进行自监督训练:自回归混合预测和上下文知识蒸馏。这些任务使模型能够直接从混合信号中捕捉到预测模式,且以自监督的方式完成。 经过预训练后的前端随后用作特征提取器,生成高质量的预测模式。在合成数据集和真实世界数据集上的全面评估验证了所提出的预训练前端的有效性。
https://arxiv.org/abs/2504.02302
Efficient understanding of long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding, studying a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). In particular, our contributions are two-fold: First, we formulate temporal search as a Long Video Haystack problem, i.e., finding a minimal set of relevant frames (typically one to five) among tens of thousands of frames from real-world long videos given specific queries. To validate our formulation, we create LV-Haystack, the first benchmark containing 3,874 human-annotated instances with fine-grained evaluation metrics for assessing keyframe search quality and computational efficiency. Experimental results on LV-Haystack highlight a significant research gap in temporal search capabilities, with SOTA keyframe selection methods achieving only 2.1% temporal F1 score on the LVBench subset. Next, inspired by visual search in images, we re-think temporal searching and propose a lightweight keyframe searching framework, T*, which casts the expensive temporal search as a spatial search problem. T* leverages superior visual localization capabilities typically used in images and introduces an adaptive zooming-in mechanism that operates across both temporal and spatial dimensions. Our extensive experiments show that when integrated with existing methods, T* significantly improves SOTA long-form video understanding performance. Specifically, under an inference budget of 32 frames, T* improves GPT-4o's performance from 50.5% to 53.1% and LLaVA-OneVision-72B's performance from 56.5% to 62.4% on LongVideoBench XL subset. Our PyTorch code, benchmark dataset and models are included in the Supplementary material.
高效理解长视频仍然是计算机视觉领域的一大挑战。在这项工作中,我们重新审视了针对长视频理解的时间搜索范式,并探讨了一项所有最先进的(SOTA)长上下文视觉语言模型(VLM)面临的根本性问题。我们的贡献主要有两点:首先,我们将时间搜索定义为“长视频大搜寻”问题,即在给定特定查询的情况下,在实际长视频的数万帧中找到一组最小的相关帧集合(通常是1到5帧)。为了验证这一表述的有效性,我们创建了LV-Haystack,这是首个包含3,874个人工注释实例的基准测试集,并配有细粒度评价指标用于评估关键帧搜索质量和计算效率。在LV-Haystack上的实验结果突显了时间搜索能力方面存在的显著研究差距,SOTA的关键帧选择方法在LVBench子集中仅能达到2.1%的时间F1得分。 其次,受图像中视觉搜索的启发,我们重新思考时间搜索,并提出了一种轻量级的关键帧搜索框架T*。该框架将昂贵的时间搜索问题转化为空间搜索问题。T*利用了通常用于处理图像中的优秀视觉定位能力,并引入了一种适应性放大机制,在时间和空间维度上均能操作。我们的广泛实验表明,当与现有方法结合使用时,T*显著提升了SOTA长视频理解的性能表现。具体而言,在32帧推理预算下,T*将GPT-4o在LongVideoBench XL子集上的性能从50.5%提升到了53.1%,同时将LLaVA-OneVision-72B的性能从56.5%提高到62.4%。我们的PyTorch代码、基准数据集和模型包含在补充材料中提供。
https://arxiv.org/abs/2504.02259
Transformer models leverage self-attention mechanisms to capture complex dependencies, demonstrating exceptional performance in various applications. However, the long-duration high-load computations required for model inference impose stringent reliability demands on the computing platform, as soft errors that occur during execution can significantly degrade model performance. Existing fault tolerance methods protect each operation separately using decoupled kernels, incurring substantial computational and memory overhead. In this paper, we propose a novel error-resilient framework for Transformer models, integrating end-to-end fault tolerant attention (EFTA) to improve inference reliability against soft errors. Our approach enables error detection and correction within a fully fused attention kernel, reducing redundant data access and thereby mitigating memory faults. To further enhance error coverage and reduce overhead, we design a hybrid fault tolerance scheme tailored for the EFTA, introducing for the first time: 1) architecture-aware algorithm-based fault tolerance (ABFT) using tensor checksum, which minimizes inter-thread communication overhead on tensor cores during error detection; 2) selective neuron value restriction, which selectively applies adaptive fault tolerance constraints to neuron values, balancing error coverage and overhead; 3) unified verification, reusing checksums to streamline multiple computation steps into a single verification process. Experimental results show that EFTA achieves up to 7.56x speedup over traditional methods with an average fault tolerance overhead of 13.9%.
Transformer模型利用自注意力机制来捕捉复杂的依赖关系,在各种应用中表现出卓越的性能。然而,推理过程中所需的长时间高负载计算对计算平台提出了严格的可靠性要求,因为在执行期间发生的软错误会显著降低模型性能。现有的容错方法通过使用解耦内核单独保护每一步操作来实现这一目标,但会导致大量的计算和内存开销。在本文中,我们为Transformer模型提出了一种新的具有错误恢复能力的框架,该框架集成了端到端故障耐受注意力(EFTA),以提高对软错误的推理可靠性。我们的方法允许在一个完全融合的关注内核中进行错误检测和校正,从而减少冗余数据访问并缓解内存故障。为了进一步增强错误覆盖率并降低开销,我们为EFTA设计了一种混合容错方案,首次引入了以下内容: 1. 基于体系结构感知的算法级容错(ABFT),使用张量奇偶校验,在误差检测期间最小化张量核心之间的线程间通信开销; 2. 选择性神经元值限制,针对神经元值自适应地应用故障容忍约束,平衡错误覆盖和开销; 3. 统一验证,重复利用奇偶校验以将多个计算步骤简化为单一的验证过程。 实验结果表明,EFTA相较于传统方法在平均容错开销仅为13.9%的情况下,实现了高达7.56倍的速度提升。
https://arxiv.org/abs/2504.02211
The rapid advancements in large Language models (LLMs) have significantly enhanced their reasoning capabilities, driven by various strategies such as multi-agent collaboration. However, unlike the well-established performance improvements achieved through scaling data and model size, the scaling of reasoning in LLMs is more complex and can even negatively impact reasoning performance, introducing new challenges in model alignment and robustness. In this survey, we provide a comprehensive examination of scaling in LLM reasoning, categorizing it into multiple dimensions and analyzing how and to what extent different scaling strategies contribute to improving reasoning capabilities. We begin by exploring scaling in input size, which enables LLMs to process and utilize more extensive context for improved reasoning. Next, we analyze scaling in reasoning steps that improves multi-step inference and logical consistency. We then examine scaling in reasoning rounds, where iterative interactions refine reasoning outcomes. Furthermore, we discuss scaling in training-enabled reasoning, focusing on optimization through iterative model improvement. Finally, we review applications of scaling across domains and outline future directions for further advancing LLM reasoning. By synthesizing these diverse perspectives, this survey aims to provide insights into how scaling strategies fundamentally enhance the reasoning capabilities of LLMs and further guide the development of next-generation AI systems.
大型语言模型(LLMs)在推理能力方面的快速进步,主要得益于多种策略,如多代理协作。然而,与通过扩大数据和模型规模来提升性能的成熟方法相比,LLMs中的推理扩展更为复杂,并且甚至可能对推理表现产生负面影响,从而引入了新的挑战,比如模型校准和鲁棒性问题。在这次综述中,我们全面地考察了LLM推理中的扩展问题,将其分类为多个维度,并分析不同的扩展策略如何以及在多大程度上有助于提升推理能力。 首先,我们将探讨输入规模的扩展,这使LLMs能够处理更广泛的上下文信息,从而提高其推理能力。接下来,我们会分析推理步骤上的扩展,这种方式可以改进多步推理和逻辑一致性。然后,我们还将研究推理轮次的扩展,在这种情况下,迭代互动能精炼推理结果。此外,我们将讨论通过训练优化实现的推理扩展,关注于通过反复模型改进来提升性能的方法。 最后,我们将回顾跨不同领域的扩展应用,并概述未来推动LLM推理进步的方向。通过综合这些不同的视角,本次综述旨在揭示扩展策略如何从根本上增强大型语言模型的推理能力,并进一步指导下一代AI系统的开发工作。
https://arxiv.org/abs/2504.02181
Deploying Large Language Models (LLMs) on resource-constrained edge devices like the Raspberry Pi presents challenges in computational efficiency, power consumption, and response latency. This paper explores quantization-based optimization techniques to enable high-throughput, energy-efficient execution of LLMs on low-power embedded systems. Our approach leverages k-quantization, a Post-Training Quantization (PTQ) method designed for different bit-widths, enabling efficient 2-bit, 4-bit, 6-bit, and 8-bit weight quantization. Additionally, we employ ternary quantization using Quantization-Aware Training (QAT) for BitNet models, allowing for more effective adaptation to lower-bit representations while preserving accuracy. Our findings highlight the potential of quantized LLMs for real-time conversational AI on edge devices, paving the way for low-power, high-efficiency AI deployment in mobile and embedded applications. This study demonstrates that aggressive quantization strategies can significantly reduce energy consumption while maintaining inference quality, making LLMs practical for resource-limited environments.
在资源受限的边缘设备(如Raspberry Pi)上部署大型语言模型(LLMs)面临计算效率、功耗和响应延迟等方面的挑战。本文探讨了基于量化的方法来优化LLMs,以便它们能在低功率嵌入式系统中实现高吞吐量和节能执行。我们的方法采用了k-量化,这是一种针对不同位宽设计的后训练量化(PTQ)技术,支持高效的2位、4位、6位和8位权重量化。此外,我们还使用了基于量化感知训练(QAT)的三值量化方法来优化BitNet模型,从而更有效地适应低比特表示并保持准确性。我们的研究结果突显了量化的LLMs在边缘设备上实现实时对话AI的潜力,并为移动和嵌入式应用中低功耗、高效率的人工智能部署铺平道路。这项研究表明,激进的量化策略可以在保证推理质量的同时显著降低能耗,从而使大型语言模型成为资源受限环境中的实际选择。
https://arxiv.org/abs/2504.02118
Accurate traffic flow prediction is vital for optimizing urban mobility, yet it remains difficult in many cities due to complex spatio-temporal dependencies and limited high-quality data. While deep graph-based models demonstrate strong predictive power, their performance often comes at the cost of high computational overhead and substantial training data requirements, making them impractical for deployment in resource-constrained or data-scarce environments. We propose the FlowDistill, a lightweight and scalable traffic prediction framework based on knowledge distillation from large language models (LLMs). In this teacher-student setup, a fine-tuned LLM guides a compact multi-layer perceptron (MLP) student model using a novel combination of the information bottleneck principle and teacher-bounded regression loss, ensuring the distilled model retains only essential and transferable knowledge. Spatial and temporal correlations are explicitly encoded to enhance the model's generalization across diverse urban settings. Despite its simplicity, FlowDistill consistently outperforms state-of-the-art models in prediction accuracy while requiring significantly less training data, and achieving lower memory usage and inference latency, highlighting its efficiency and suitability for real-world, scalable deployment.
准确的交通流量预测对于优化城市机动性至关重要,但在许多城市中由于复杂的时空依赖关系和高质量数据不足,这一目标仍然难以实现。虽然基于深度图模型的方法展示出了强大的预测能力,但它们通常需要大量的计算资源和训练数据,这使得这些方法在资源受限或数据匮乏的环境中难以实施。为此,我们提出了FlowDistill,这是一个轻量级且可扩展的交通预测框架,基于从大规模语言模型(LLMs)的知识蒸馏技术构建。 在这个师生架构中,一个经过微调的大规模语言模型作为教师,指导一个小巧的多层感知器(MLP)学生模型。通过结合信息瓶颈原理和限定于教师的回归损失的新颖组合方法,确保了蒸馏后的模型仅保留关键且可迁移的知识。同时,显式地编码空间和时间的相关性来增强模型在各种城市环境中的泛化能力。 尽管FlowDistill架构简单,它仍然能够以较低的数据需求、内存占用量及推理延迟,在预测准确性上超越最先进的模型,展示了其高效的性能以及适用于现实世界的可扩展部署的潜力。
https://arxiv.org/abs/2504.02094
The majority of modern robot learning methods focus on learning a set of pre-defined tasks with limited or no generalization to new tasks. Extending the robot skillset to novel tasks involves gathering an extensive amount of training data for additional tasks. In this paper, we address the problem of teaching new tasks to robots using human demonstration videos for repetitive tasks (e.g., packing). This task requires understanding the human video to identify which object is being manipulated (the pick object) and where it is being placed (the placement slot). In addition, it needs to re-identify the pick object and the placement slots during inference along with the relative poses to enable robot execution of the task. To tackle this, we propose SLeRP, a modular system that leverages several advanced visual foundation models and a novel slot-level placement detector Slot-Net, eliminating the need for expensive video demonstrations for training. We evaluate our system using a new benchmark of real-world videos. The evaluation results show that SLeRP outperforms several baselines and can be deployed on a real robot.
大多数现代机器人学习方法专注于通过有限或没有对新任务泛化能力的方式,来学习一组预定义的任务。扩展机器人的技能集以应对新型任务需要收集大量用于额外任务的训练数据。在本文中,我们探讨了使用人类演示视频教授重复性任务(如包装)的新任务给机器人的问题。此任务要求理解视频中的操作行为,识别被操作的对象(取物对象)以及其放置位置(放置槽)。此外,在推理过程中还需要重新识别取物对象和放置槽,并确定它们的相对姿态以使机器人能够执行该任务。 为了解决这一问题,我们提出了SLeRP,这是一个模块化系统,利用了几个先进的视觉基础模型及一种新颖的槽级放置检测器Slot-Net,从而在训练时无需昂贵的人类视频演示。我们在一个由真实世界视频组成的新的基准测试集上评估了我们的系统。实验结果显示,SLeRP超越了几种基线方法,并且可以在实际机器人上部署使用。
https://arxiv.org/abs/2504.01959
Recovering 3D scenes from sparse views is a challenging task due to its inherent ill-posed problem. Conventional methods have developed specialized solutions (e.g., geometry regularization or feed-forward deterministic model) to mitigate the issue. However, they still suffer from performance degradation by minimal overlap across input views with insufficient visual information. Fortunately, recent video generative models show promise in addressing this challenge as they are capable of generating video clips with plausible 3D structures. Powered by large pretrained video diffusion models, some pioneering research start to explore the potential of video generative prior and create 3D scenes from sparse views. Despite impressive improvements, they are limited by slow inference time and the lack of 3D constraint, leading to inefficiencies and reconstruction artifacts that do not align with real-world geometry structure. In this paper, we propose VideoScene to distill the video diffusion model to generate 3D scenes in one step, aiming to build an efficient and effective tool to bridge the gap from video to 3D. Specifically, we design a 3D-aware leap flow distillation strategy to leap over time-consuming redundant information and train a dynamic denoising policy network to adaptively determine the optimal leap timestep during inference. Extensive experiments demonstrate that our VideoScene achieves faster and superior 3D scene generation results than previous video diffusion models, highlighting its potential as an efficient tool for future video to 3D applications. Project Page: this https URL
从稀疏视角恢复三维场景是一个极具挑战性的任务,因为其本质上是不适定问题。传统的方法开发了专门的解决方案(例如几何正则化或前馈确定性模型)来缓解这些问题。然而,它们仍然在输入视图之间最小重叠且视觉信息不足的情况下遭受性能下降的问题。幸运的是,最近的视频生成模型显示出解决这一挑战的潜力,因为它们能够生成具有合理三维结构的视频片段。借助大规模预训练的视频扩散模型,一些开创性的研究开始探索视频生成先验的潜力,并从稀疏视图中创建三维场景。尽管取得了令人印象深刻的进步,但这些方法受限于推理时间慢以及缺乏三维约束条件,导致效率低下和重建过程中出现与现实世界几何结构不符的艺术瑕疵。 在本文中,我们提出了VideoScene框架,旨在将视频扩散模型精简为一步生成三维场景的方法,目标是建立一个高效且有效的工具,以弥合从视频到三维的差距。具体来说,我们设计了一个3D感知跳跃流精炼策略来跳过耗时冗余信息,并训练动态去噪策略网络在推理过程中自适应地确定最佳跳跃时间步长。 大量的实验表明,我们的VideoScene方法比之前的视频扩散模型更快且更有效地生成三维场景结果,这突显了其作为未来视频到3D应用高效工具的潜力。项目页面:[此URL](https://this-url.com/)(原文中的链接地址请替换为实际链接)。
https://arxiv.org/abs/2504.01956