Creating responsible artificial intelligence (AI) systems is an important issue in contemporary research and development of works on AI. One of the characteristics of responsible AI systems is their explainability. In the paper, we are interested in explainable deep learning (XDL) systems. On the basis of the creation of digital twins of physical objects, we introduce the idea of creating readable twins (in the form of imprecise information flow models) for unreadable deep learning models. The complete procedure for switching from the deep learning model (DLM) to the imprecise information flow model (IIFM) is presented. The proposed approach is illustrated with an example of a deep learning classification model for image recognition of handwritten digits from the MNIST data set.
创建负责任的人工智能(AI)系统是当前关于人工智能研究和开发中的一个重要议题。负责任的AI系统的特征之一就是其可解释性。在本文中,我们对可解释的深度学习(XDL)系统感兴趣。基于物理对象数字孪生体的概念,我们提出了为不可读的深度学习模型创建可读孪生(以模糊信息流模型形式)的想法。文中详细介绍了从深度学习模型(DLM)转换到模糊信息流模型(IIFM)的完整过程。通过使用用于识别MNIST数据集中的手写数字图像分类模型的例子来说明所提出的方法。
https://arxiv.org/abs/2504.13150
A robot navigating an outdoor environment with no prior knowledge of the space must rely on its local sensing to perceive its surroundings and plan. This can come in the form of a local metric map or local policy with some fixed horizon. Beyond that, there is a fog of unknown space marked with some fixed cost. A limited planning horizon can often result in myopic decisions leading the robot off course or worse, into very difficult terrain. Ideally, we would like the robot to have full knowledge that can be orders of magnitude larger than a local cost map. In practice, this is intractable due to sparse sensing information and often computationally expensive. In this work, we make a key observation that long-range navigation only necessitates identifying good frontier directions for planning instead of full map knowledge. To this end, we propose Long Range Navigator (LRN), that learns an intermediate affordance representation mapping high-dimensional camera images to `affordable' frontiers for planning, and then optimizing for maximum alignment with the desired goal. LRN notably is trained entirely on unlabeled ego-centric videos making it easy to scale and adapt to new platforms. Through extensive off-road experiments on Spot and a Big Vehicle, we find that augmenting existing navigation stacks with LRN reduces human interventions at test-time and leads to faster decision making indicating the relevance of LRN. this https URL
在没有事先了解的空间中导航的机器人必须依赖于其局部感知来理解周围环境并规划路径。这可以通过局部度量地图或具有固定时间范围的局部策略实现。除此之外,存在一片未知空间的“迷雾”,这片区域被标记为有固定的成本。有限的计划视野常常会导致近视决策,导致机器人偏离预定路线,甚至进入难以通行的地形中。理想情况下,我们希望机器人能够拥有全面的知识,这可能比局部代价地图大几个数量级。然而,在实践中,由于稀疏的感知信息和高昂的计算成本,这是不可行的。 在这项工作中,我们做出一个重要观察:远距离导航只需要确定规划中的“可接近”前沿方向,而无需完整的地图知识。为此,我们提出了长程导航器(LRN),它通过学习一个中间的可操作性表示来实现这一目标——该表示将高维相机图像映射到用于规划的“可接受”的前进步骤,并优化与期望目标的最大一致性。值得注意的是,LRN完全基于未标记的第一人称视频进行训练,这使得它可以很容易地扩展和适应新的平台。 通过在Spot机器人(一种小型四足机器人)以及一辆大型车辆上的大量越野实验中发现,在现有导航系统中加入LRN可以减少测试时的人工干预,并且能更快地做出决策,表明了LRN的相关性和实用性。
https://arxiv.org/abs/2504.13149
Frontier models that generate extended reasoning traces inadvertently produce rich token sequences that can facilitate model distillation. Recognizing this vulnerability, model owners may seek sampling strategies that limit the effectiveness of distillation without compromising model performance. \emph{Antidistillation sampling} provides exactly this capability. By strategically modifying a model's next-token probability distribution, antidistillation sampling poisons reasoning traces, rendering them significantly less effective for distillation while preserving the model's practical utility. For further details, see this https URL.
前沿模型在生成扩展推理痕迹时,无意中会产生丰富的标记序列,这些序列可以促进模型蒸馏。意识到这一漏洞,模型所有者可能会寻找限制蒸馏有效性的采样策略,同时不牺牲模型性能。**反蒸馏采样**正好提供了这种能力。通过战略性地修改模型的下一个令牌概率分布,反蒸馏采样会污染推理痕迹,使其对蒸馏的有效性显著降低,同时保持模型的实际实用性。 如需进一步了解详情,请参阅此[链接](https://example.com)(请将"this https URL"替换为实际链接)。
https://arxiv.org/abs/2504.13146
Large Language Models (LLMs) have shown tremendous potential as agents, excelling at tasks that require multiple rounds of reasoning and interactions. Rejection Sampling Fine-Tuning (RFT) has emerged as an effective method for finetuning LLMs as agents: it first imitates expert-generated successful trajectories and further improves agentic skills through iterative fine-tuning on successful, self-generated trajectories. However, since the expert (e.g., GPT-4) succeeds primarily on simpler subtasks and RFT inherently favors simpler scenarios, many complex subtasks remain unsolved and persistently out-of-distribution (OOD). Upon investigating these challenging subtasks, we discovered that previously failed expert trajectories can often provide valuable guidance, e.g., plans and key actions, that can significantly improve agent exploration efficiency and acquisition of critical skills. Motivated by these observations, we propose Exploring Expert Failures (EEF), which identifies beneficial actions from failed expert trajectories and integrates them into the training dataset. Potentially harmful actions are meticulously excluded to prevent contamination of the model learning process. By leveraging the beneficial actions in expert failures, EEF successfully solves some previously unsolvable subtasks and improves agent tuning performance. Remarkably, our approach achieved a 62\% win rate in WebShop, outperforming RFT (53. 6\%) and GPT-4 (35. 6\%), and to the best of our knowledge, setting a new state-of-the-art as the first method to surpass a score of 0.81 in WebShop and exceed 81 in SciWorld.
大型语言模型(LLM)作为代理表现出巨大的潜力,在需要多轮推理和互动的任务中表现突出。拒绝采样微调(Rejection Sampling Fine-Tuning, RFT)已成为将LLM微调为代理的有效方法:它首先模仿专家生成的成功轨迹,然后通过在成功自动生成的轨迹上进行迭代微调来进一步提升代理技能。然而,由于专家(例如GPT-4)主要在较简单的子任务上表现出色,并且RFT本质上更倾向于处理简单场景,许多复杂子任务仍然未解决并且持续处于分布外(OOD)。通过对这些挑战性子任务的研究,我们发现之前失败的专家轨迹往往可以提供有价值的指导,如计划和关键动作,这可以显著提高代理探索效率并获得关键技能。受此观察启发,我们提出了“利用专家失败进行探索”(EEF)方法:它从失败的专家轨迹中识别出有益的动作,并将这些动作整合到训练数据集中。潜在有害的动作被仔细排除以防止模型学习过程受到污染。通过利用专家失败中的有用行动,EEF成功解决了之前无法解决的一些子任务并提高了代理微调性能。值得注意的是,在WebShop测试中,我们的方法取得了62%的胜率,优于RFT(53.6%)和GPT-4(35.6%),据我们所知,这是第一个在WebShop得分超过0.81并在SciWorld得分超过81的方法,创下了新的最先进的记录。
https://arxiv.org/abs/2504.13145
We introduce $\texttt{Complex-Edit}$, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale. Our approach follows a well-structured ``Chain-of-Edit'' pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions. Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments. Our benchmark yields several notable insights: 1) Open-source models significantly underperform relative to proprietary, closed-source models, with the performance gap widening as instruction complexity increases; 2) Increased instructional complexity primarily impairs the models' ability to retain key elements from the input images and to preserve the overall aesthetic quality; 3) Decomposing a complex instruction into a sequence of atomic steps, executed in a step-by-step manner, substantially degrades performance across multiple metrics; 4) A straightforward Best-of-N selection strategy improves results for both direct editing and the step-by-step sequential approach; and 5) We observe a ``curse of synthetic data'': when synthetic data is involved in model training, the edited images from such models tend to appear increasingly synthetic as the complexity of the editing instructions rises -- a phenomenon that intriguingly also manifests in the latest GPT-4o outputs.
我们介绍了$\texttt{Complex-Edit}$,这是一个全面的基准测试工具,旨在系统地评估基于指令的图像编辑模型在不同复杂度指令下的性能。为了开发这个基准测试,我们利用了GPT-4o来自动收集一系列多样化的编辑指令。我们的方法遵循一个结构良好的“编辑链”流程:首先独立生成各个原子级别的编辑任务,然后将这些任务整合成连贯且复杂的指令。此外,我们还引入了一套评估编辑性能各个方面的新指标,并提供了一个基于视觉语言模型(VLM)的自动评价流水线,以支持大规模评估。 我们的基准测试得出了一些重要的见解: 1. 开源模型在面对复杂度较高的指令时显著落后于专有、闭源模型的表现,且随着指令复杂性的增加,两者之间的性能差距会进一步扩大。 2. 随着指令复杂性的提升,模型主要面临的问题是难以从输入图像中保留关键元素,并保持整体的美学质量。 3. 将复杂的指令分解为一系列原子步骤,并以逐步执行的方式进行操作,在多个评价指标上显著降低了性能表现。 4. 采用直接且简单的“最佳N选一”策略能够提升编辑结果,无论是在直接编辑还是分步序列方法中都有效果。 5. 我们观察到了一个“合成数据的诅咒”现象:当模型训练过程中使用了合成数据时,这些模型生成的编辑图像会随着编辑指令复杂度的增加而显得越来越不真实——有趣的是,这一现象同样出现在最新的GPT-4o输出中。
https://arxiv.org/abs/2504.13143
Human action recognition (HAR) has achieved impressive results with deep learning models, but their decision-making process remains opaque due to their black-box nature. Ensuring interpretability is crucial, especially for real-world applications requiring transparency and accountability. Existing video XAI methods primarily rely on feature attribution or static textual concepts, both of which struggle to capture motion dynamics and temporal dependencies essential for action understanding. To address these challenges, we propose Pose Concept Bottleneck for Explainable Action Recognition (PCBEAR), a novel concept bottleneck framework that introduces human pose sequences as motion-aware, structured concepts for video action recognition. Unlike methods based on pixel-level features or static textual descriptions, PCBEAR leverages human skeleton poses, which focus solely on body movements, providing robust and interpretable explanations of motion dynamics. We define two types of pose-based concepts: static pose concepts for spatial configurations at individual frames, and dynamic pose concepts for motion patterns across multiple frames. To construct these concepts, PCBEAR applies clustering to video pose sequences, allowing for automatic discovery of meaningful concepts without manual annotation. We validate PCBEAR on KTH, Penn-Action, and HAA500, showing that it achieves high classification performance while offering interpretable, motion-driven explanations. Our method provides both strong predictive performance and human-understandable insights into the model's reasoning process, enabling test-time interventions for debugging and improving model behavior.
人类动作识别(HAR)在深度学习模型的应用中取得了显著成果,但由于这些模型的黑箱特性,其决策过程仍然难以理解。确保解释性对于需要透明度和问责制的实际应用场景至关重要。现有的视频可解释人工智能(XAI)方法主要依赖于特征归因或静态文本概念,这两种方式都难以捕捉动作识别所必需的动作动态性和时间依赖性。为了解决这些问题,我们提出了姿势概念瓶颈的可解释动作识别框架(PCBEAR),这是一种新型的概念瓶颈框架,引入了人体姿态序列作为具有运动感知和结构化概念的视频动作识别方法。 与基于像素级特征或静态文本描述的方法不同,PCBEAR利用人体骨架姿态,专注于身体的动作,从而提供关于动作动态性的稳健且可解释的说明。我们定义了两种类型的姿势相关概念:静态姿势概念用于描述单帧的空间配置,而动态姿势概念则用于跨越多帧的运动模式。为了构建这些概念,PCBEAR对视频中的姿态序列进行聚类分析,能够自动发现有意义的概念而无需人工注释。 我们在KTH、Penn-Action和HAA500数据集上验证了PCBEAR的有效性,结果表明它不仅具有高水平的分类性能,还提供了与运动相关的可解释说明。我们的方法在提供强大的预测性能的同时,也向人类使用者揭示了模型推理过程中的直观理解,并支持测试时的操作干预以调试和改进模型行为。
https://arxiv.org/abs/2504.13140
A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be naturally framed as probabilistic conditioning, but exact generation from the resulting distribution -- which can differ substantially from the LM's base distribution -- is generally intractable. In this work, we develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC). Our SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computational resources in light of new information during the course of generation. By comparing to a number of alternatives and ablations on four challenging domains -- Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis -- we demonstrate that, with little overhead, our approach allows small open-source language models to outperform models over 8x larger, as well as closed-source, fine-tuned ones. In support of the probabilistic perspective, we show that these performance improvements are driven by better approximation to the posterior distribution. Our system builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language, giving users a simple, programmable way to apply SMC to a broad variety of controlled generation problems.
许多语言模型(LM)的应用场景需要生成符合特定句法或语义约束的文本。施加此类约束可以自然地被表述为概率条件设定,但精确从这种分布中生成文本通常是不可行的,因为该分布可能与基础语言模型的概率分布有显著差异。在本工作中,我们基于顺序蒙特卡洛(SMC)方法开发了一种控制语言模型生成的架构。我们的SMC框架允许我们在推断时灵活地结合特定领域和问题的具体约束,并且能够根据生成过程中的新信息高效重新分配计算资源。 通过与四个具有挑战性的领域的多种替代方案和消融实验进行比较——包括数据科学中的Python代码生成、文本到SQL转换、目标推理以及分子合成,我们证明了在几乎不需要额外开销的情况下,我们的方法可以使小规模的开源语言模型超越超过8倍大的其他模型,甚至可以超越那些闭源且经过微调的模型。为了支持概率视角,我们展示了性能改进是由更好地逼近后验分布所驱动的。 我们的系统基于Lew等人(2023)的研究框架,并与其语言模型的概率编程语言集成在一起,为用户提供了一种简单而可编程的方法来将SMC应用于广泛的各种受控生成问题。
https://arxiv.org/abs/2504.13139
Reward models (RMs) are essential for aligning Large Language Models (LLMs) with human preferences. However, they often struggle with capturing complex human preferences and generalizing to unseen data. To address these challenges, we introduce Energy-Based Reward Model (EBRM), a lightweight post-hoc refinement framework that enhances RM robustness and generalization. EBRM models the reward distribution explicitly, capturing uncertainty in human preferences and mitigating the impact of noisy or misaligned annotations. It achieves this through conflict-aware data filtering, label-noise-aware contrastive training, and hybrid initialization. Notably, EBRM enhances RMs without retraining, making it computationally efficient and adaptable across different models and tasks. Empirical evaluations on RM benchmarks demonstrate significant improvements in both robustness and generalization, achieving up to a 5.97% improvement in safety-critical alignment tasks compared to standard RMs. Furthermore, reinforcement learning experiments confirm that our refined rewards enhance alignment quality, effectively delaying reward hacking. These results demonstrate our approach as a scalable and effective enhancement for existing RMs and alignment pipelines. The code is available at EBRM.
奖励模型(RMs)对于将大型语言模型(LLMs)与人类偏好对齐至关重要。然而,它们通常难以捕捉复杂的人类偏好,并且在未见过的数据上推广能力有限。为了解决这些问题,我们引入了基于能量的奖励模型(EBRM),这是一种轻量级的事后优化框架,旨在增强RMs的鲁棒性和泛化能力。 EBRM通过显式地建模奖励分布来工作,能够捕捉人类偏好的不确定性,并减少噪声或对齐不佳注释的影响。该方法利用冲突感知数据过滤、考虑标签噪声的对比训练以及混合初始化等技术实现这些目标。值得注意的是,EBRM在不重新训练的情况下增强了RMs,使其计算效率高且适应性强,适用于不同模型和任务。 实验证明,在奖励模型基准测试中,EBRM在鲁棒性和泛化方面取得了显著改善,特别是在安全关键对齐任务上比标准RMs提高了多达5.97%。此外,强化学习实验也证实了我们改进后的奖励能提升对齐质量,并有效延缓奖励欺骗的发生。 这些结果表明,我们的方法是一种可扩展且有效的现有RMs和对齐流水线的增强方式。EBRM的代码可在其官方网站上获取。
https://arxiv.org/abs/2504.13134
This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating reliance on model ensembles, redundant weights, and other computationally expensive components in the previous IQA/VQA competitions. Track 2 introduces a new short-form UGC dataset tailored for single image super-resolution, i.e., the KwaiSR dataset. It consists of 1,800 synthetically generated S-UGC image pairs and 1,900 real-world S-UGC images, which are split into training, validation, and test sets using a ratio of 8:1:1. The primary objective of the challenge is to drive research that benefits the user experience of short-form UGC platforms such as Kwai and TikTok. This challenge attracted 266 participants and received 18 valid final submissions with corresponding fact sheets, significantly contributing to the progress of short-form UGC VQA and image superresolution. The project is publicly available at this https URL ChallengeCVPR-NTIRE2025.
本文介绍了针对2025年NTIRE挑战赛的短形式用户生成内容(UGC)视频质量评估和增强的回顾。该挑战赛包含两个赛道:(i) 高效视频质量评估(KVQ),以及(ii) 基于扩散方法的图像超分辨率(KwaiSR)。 赛道1旨在推动轻量级且高效的视频质量评估(VQA)模型的发展,重点在于消除对模型集成、冗余权重及其他计算成本较高的组件的依赖,在之前的IQA/VQA竞赛中这些问题普遍存在。赛道2引入了一个专为单张图像超分辨率设计的新短形式UGC数据集,即KwaiSR数据集。该数据集包括1,800对合成生成的S-UGC图像和1,900张真实世界的S-UGC图像,并按照8:1:1的比例分配到训练、验证和测试集合中。 挑战赛的主要目标是推动研究工作,提升像Kwai和TikTok这样的短形式UGC平台上的用户体验。该挑战吸引了266名参与者并收到了18份有效的最终提交作品及其对应的事实表,为短形式UGC视频质量评估和图像超分辨率领域的发展做出了重大贡献。 该项目在以下网址公开发布:[ChallengeCVPR-NTIRE2025](https://challengecvpr-ntire2025.org/)
https://arxiv.org/abs/2504.13131
We present a novel approach to integrating scientific knowledge into generative models, enhancing their realism and consistency in image synthesis. First, we introduce Science-T2I, an expert-annotated adversarial dataset comprising adversarial 20k image pairs with 9k prompts, covering wide distinct scientific knowledge categories. Leveraging Science-T2I, we present SciScore, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model. Additionally, based on SciScore, we propose a two-stage training framework, comprising a supervised fine-tuning phase and a masked online fine-tuning phase, to incorporate scientific knowledge into existing generative models. Through comprehensive experiments, we demonstrate the effectiveness of our framework in establishing new standards for evaluating the scientific realism of generated content. Specifically, SciScore attains performance comparable to human-level, demonstrating a 5% improvement similar to evaluations conducted by experienced human evaluators. Furthermore, by applying our proposed fine-tuning method to FLUX, we achieve a performance enhancement exceeding 50% on SciScore.
我们提出了一种将科学知识融入生成模型的新方法,从而增强图像合成的现实感和一致性。首先,我们介绍了Science-T2I,这是一个由专家注释的对抗性数据集,包含2万个对抗性的图像对以及9千个提示词,涵盖了广泛的不同的科学知识类别。利用Science-T2I,我们提出了SciScore,一个端到端的奖励模型,该模型基于科学知识来优化生成图像的评估,并通过增强预训练CLIP模型的科学理解和视觉能力来实现这一点。此外,基于SciScore,我们提出了一种两阶段训练框架,包括监督微调阶段和掩码在线微调阶段,以将科学知识融入现有的生成模型中。通过全面实验,我们展示了我们的框架在评估生成内容的科学现实性方面建立了新的标准。具体而言,SciScore达到了接近人类水平的表现,并且与经验丰富的评审人员进行的人类评价相比,表现提高了5%左右。此外,通过应用我们提出的微调方法到FLUX模型上,我们在SciScore上的性能提升了超过50%。
https://arxiv.org/abs/2504.13129
We introduce FreshStack, a reusable framework for automatically building information retrieval (IR) evaluation benchmarks from community-asked questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality. In addition, we identify cases where rerankers do not clearly improve first-stage retrieval accuracy (two out of five topics). We hope that FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks. FreshStack datasets are available at: this https URL.
我们介绍了FreshStack,这是一个可重复使用的框架,用于从社区提问和回答自动生成信息检索(IR)评估基准。FreshStack执行以下步骤:(1) 从代码和技术文档中自动收集语料库;(2) 根据社区的提问和答案生成片段;以及 (3) 在片段级别提供支持,通过融合检索技术和混合架构来检索文档。我们使用FreshStack构建了五个关于快速发展的、最近的和小众主题的数据集,以确保任务具有足够的挑战性。在FreshStack上,现有的检索模型,在直接应用时,在所有五个主题上都明显低于oracle方法的表现,表明提高IR质量有很大改进空间。此外,我们发现有两组案例中,重排序器并未显著提升第一阶段的检索准确性(共五个主题中的两个)。我们希望FreshStack能促进未来构建现实、可扩展且未受污染的IR和RAG评估基准的工作。FreshStack数据集可以从以下网址获得:this https URL.
https://arxiv.org/abs/2504.13128
Many soft robots struggle to produce dynamic motions with fast, large displacements. We develop a parallel 6 degree-of-freedom (DoF) Stewart-Gough mechanism using Handed Shearing Auxetic (HSA) actuators. By using soft actuators, we are able to use one third as many mechatronic components as a rigid Stewart platform, while retaining a working payload of 2kg and an open-loop bandwidth greater than 16Hx. We show that the platform is capable of both precise tracing and dynamic disturbance rejection when controlling a ball and sliding puck using a Proportional Integral Derivative (PID) controller. We develop a machine-learning-based kinematics model and demonstrate a functional workspace of roughly 10cm in each translation direction and 28 degrees in each orientation. This 6DoF device has many of the characteristics associated with rigid components - power, speed, and total workspace - while capturing the advantages of soft mechanisms.
许多软机器人在产生快速大位移的动态运动方面面临挑战。我们开发了一种使用左手剪切开孔(HSA)执行器的并行六自由度(DoF)Stewart-Gough机构。通过采用软执行器,我们可以将机电部件的数量减少到刚性Stewart平台所需数量的三分之一,同时保留2公斤的工作负载和超过16Hz的开环带宽。我们展示了该平台在使用比例积分微分(PID)控制器控制球体和平移曲棍时能够进行精确跟踪以及动态干扰抑制的能力。我们开发了一种基于机器学习的动力学模型,并证明了该平台在每个平移方向上的功能工作空间约为10厘米,每个定向轴上的工作范围为28度。这种六自由度设备具备刚性组件的许多特性——功率、速度和总体工作空间——同时捕捉到了软机制的优势。
https://arxiv.org/abs/2504.13127
This paper investigates the application of large language models (LLMs) to financial tasks. We fine-tuned foundation models using the Open FinLLM Leaderboard as a benchmark. Building on Qwen2.5 and Deepseek-R1, we employed techniques including supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning (RL) to enhance their financial capabilities. The fine-tuned models demonstrated substantial performance gains across a wide range of financial tasks. Moreover, we measured the data scaling law in the financial domain. Our work demonstrates the potential of large language models (LLMs) in financial applications.
本文研究了大型语言模型(LLM)在金融任务中的应用。我们使用Open FinLLM排行榜作为基准对基础模型进行了微调。基于Qwen2.5和Deepseek-R1,我们采用了包括监督微调(SFT)、直接偏好优化(DPO)以及强化学习(RL)等技术来提升它们的财务能力。经过微调的模型在广泛的金融任务中表现出了显著的性能提升。此外,我们还测量了金融领域中的数据规模定律。我们的工作展示了大型语言模型(LLM)在金融应用中的潜力。
https://arxiv.org/abs/2504.13125
In recent years, the field of vision-language model pre-training has experienced rapid advancements, driven primarily by the continuous enhancement of textual capabilities in large language models. However, existing training paradigms for multimodal large language models heavily rely on high-quality image-text pairs. As models and data scales grow exponentially, the availability of such meticulously curated data has become increasingly scarce and saturated, thereby severely limiting further advancements in this domain. This study investigates scalable caption generation techniques for vision-language model pre-training and demonstrates that large-scale low-hallucination synthetic captions can serve dual purposes: 1) acting as a viable alternative to real-world data for pre-training paradigms and 2) achieving superior performance enhancement when integrated into vision-language models through empirical validation. This paper presents three key contributions: 1) a novel pipeline for generating high-quality, low-hallucination, and knowledge-rich synthetic captions. Our continuous DPO methodology yields remarkable results in reducing hallucinations. Specifically, the non-hallucination caption rate on a held-out test set increases from 48.2% to 77.9% for a 7B-size model. 2) Comprehensive empirical validation reveals that our synthetic captions confer superior pre-training advantages over their counterparts. Across 35 vision language tasks, the model trained with our data achieves a significant performance gain of at least 6.2% compared to alt-text pairs and other previous work. Meanwhile, it also offers considerable support in the text-to-image domain. With our dataset, the FID score is reduced by 17.1 on a real-world validation benchmark and 13.3 on the MSCOCO validation benchmark. 3) We will release Hunyuan-Recap100M, a low-hallucination and knowledge-intensive synthetic caption dataset.
近年来,视觉-语言模型的预训练领域取得了迅速进展,这一进步主要得益于大型语言模型文本能力的持续提升。然而,现有的多模态大规模语言模型培训范式严重依赖高质量的图文对数据集。随着模型和数据规模呈指数级增长,这种精心策划的数据变得越来越稀缺且饱和,从而极大地限制了该领域的进一步发展。本研究探讨了视觉-语言模型预训练的大规模描述生成技术,并证明大规模低幻觉合成描述可以实现双重用途:1)作为现实世界数据在预训练范式中的可行替代方案;2)当集成到视觉-语言模型中时,通过实证验证可显著提升性能。本文贡献了以下三点: 1. 一种创新的流水线方法用于生成高质量、低幻觉且富含知识的合成描述。我们的持续DPO(Deliberative Posterior Optimization)方法在减少幻觉方面取得了显著成果。具体来说,在保留测试集上,对于70亿参数规模模型,非幻觉性描述的比例从48.2%提升到了77.9%。 2. 全面的实证验证显示,我们生成的合成描述相对于现有数据具有更优越的预训练优势。在35个视觉-语言任务中,使用我们的数据集进行训练的模型相比于alt-text对和其他先前工作,在所有任务上至少实现了6.2%的性能提升。此外,它也在文本到图像领域提供了显著的支持。借助我们提供的数据集,在真实世界的验证基准上FID(Frechet Inception Distance)得分减少了17.1,在MSCOCO验证基准上的得分则降低了13.3。 3. 我们将发布Hunyuan-Recap100M,这是一个低幻觉且富含知识的合成描述数据集。
https://arxiv.org/abs/2504.13123
Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination. The code and data are available at this https URL.
基于大型语言模型(LLMs)构建的大型视频模型(LVMs)在视频理解方面显示出巨大潜力,但常常存在与人类直觉不符以及视频幻觉等问题。为解决这些问题,我们提出了VistaDPO——一个用于视频层次化时空直接偏好优化的新框架。VistaDPO通过三个层级增强了文本和视频之间偏好的对齐: 1. 实例层级:将整体视频内容与响应进行对齐; 2. 时间层级:将视频的时间语义与事件描述相匹配; 3. 感知层级:使空间对象与语言标记相对齐。 由于缺乏用于细粒度视频-语言偏好对齐的数据集,我们构建了VistaDPO-7k数据集,其中包含7,200组问题答案(QA),附有被选择和拒绝的响应注释以及时间、关键帧和边界框等空间-时间定位信息。 在诸如视频幻觉、视频问答及描述生成等基准测试中进行的广泛实验表明,VistaDPO显著提高了现有LVMs的表现,并有效减少了视频与语言之间的对齐问题和幻觉现象。代码和数据可在提供的链接获取。
https://arxiv.org/abs/2504.13122
The ability to combine existing concepts into novel ideas stands as a fundamental hallmark of human intelligence. Recent advances in Vision-Language Models (VLMs) like GPT-4V and DALLE-3 have sparked debate about whether their outputs reflect combinational creativity--defined by M. A. Boden (1998) as synthesizing novel ideas through combining existing concepts--or sophisticated pattern matching of training data. Drawing inspiration from cognitive science, we investigate the combinational creativity of VLMs from the lens of concept blending. We propose the Identification-Explanation-Implication (IEI) framework, which decomposes creative processes into three levels: identifying input spaces, extracting shared attributes, and deriving novel semantic implications. To validate this framework, we curate CreativeMashup, a high-quality dataset of 666 artist-generated visual mashups annotated according to the IEI framework. Through extensive experiments, we demonstrate that in comprehension tasks, best VLMs have surpassed average human performance while falling short of expert-level understanding; in generation tasks, incorporating our IEI framework into the generation pipeline significantly enhances the creative quality of VLMs outputs. Our findings establish both a theoretical foundation for evaluating artificial creativity and practical guidelines for improving creative generation in VLMs.
将现有概念结合成新颖想法的能力被视为人类智能的基本标志。近期在视觉-语言模型(VLM)如GPT-4V和DALLE-3方面的进步引发了关于这些模型的输出是否反映了组合创造力——由M.A. Boden(1998年)定义为通过融合现有概念来产生新思想,还是仅仅是对训练数据进行复杂模式匹配的争论。受认知科学启发,我们从概念混合的角度探讨了VLMs的组合创造力,并提出了识别-解释-推论(IEI)框架,该框架将创造性过程分解为三个层次:识别输入空间、提取共享属性以及导出新的语义含义。为了验证这一框架的有效性,我们策划了一个高质量的数据集CreativeMashup,其中包含666个艺术家生成的视觉混搭作品,并根据IEI框架进行了注释。通过广泛的实验,我们展示了在理解任务中,最佳VLM的表现已经超越了普通人类表现但尚未达到专家级的理解;而在生成任务中,将我们的IEI框架整合到生成流程中显著提升了VLM输出的创造性质量。我们的研究为评估人工创造力建立了理论基础,并为改进VLM中的创意生成提供了实用指南。
https://arxiv.org/abs/2504.13120
Deep learning-based trajectory prediction models have demonstrated promising capabilities in capturing complex interactions. However, their out-of-distribution generalization remains a significant challenge, particularly due to unbalanced data and a lack of enough data and diversity to ensure robustness and calibration. To address this, we propose SHIFT (Spectral Heteroscedastic Informed Forecasting for Trajectories), a novel framework that uniquely combines well-calibrated uncertainty modeling with informative priors derived through automated rule extraction. SHIFT reformulates trajectory prediction as a classification task and employs heteroscedastic spectral-normalized Gaussian processes to effectively disentangle epistemic and aleatoric uncertainties. We learn informative priors from training labels, which are automatically generated from natural language driving rules, such as stop rules and drivability constraints, using a retrieval-augmented generation framework powered by a large language model. Extensive evaluations over the nuScenes dataset, including challenging low-data and cross-location scenarios, demonstrate that SHIFT outperforms state-of-the-art methods, achieving substantial gains in uncertainty calibration and displacement metrics. In particular, our model excels in complex scenarios, such as intersections, where uncertainty is inherently higher. Project page: this https URL.
基于深度学习的轨迹预测模型在捕捉复杂交互方面展示了令人鼓舞的能力,然而它们在外分布泛化(out-of-distribution generalization)上仍面临重大挑战,尤其是由于数据不平衡和缺乏足够的多样性和数据量以确保鲁棒性和校准。为了解决这个问题,我们提出了一种新的框架SHIFT(Spectral Heteroscedastic Informed Forecasting for Trajectories),该框架独特地结合了具有良好校准的不确定性建模与通过自动化规则提取生成的信息先验知识。SHIFT将轨迹预测重新定义为分类任务,并使用异方差谱归一化高斯过程有效地分离了认识论(epistemic)和算法(aleatoric)不确定性。我们从训练标签中学习到信息先验,这些先验是由大型语言模型驱动的检索增强生成框架自动从自然语言驾驶规则(如停止规则和可行驶约束)中生成的。在nuScenes数据集上进行广泛的评估,包括具有挑战性的低数据量和跨地点场景,证明了SHIFT优于现有最先进的方法,在不确定性校准和位移指标方面取得了显著改进。特别地,我们的模型在复杂场景(如交叉路口)中表现出色,这些场景中的不确定性本就较高。 项目页面:[此链接](https://this-url.com)
https://arxiv.org/abs/2504.13111
Flow matching models have emerged as a strong alternative to diffusion models, but existing inversion and editing methods designed for diffusion are often ineffective or inapplicable to them. The straight-line, non-crossing trajectories of flow models pose challenges for diffusion-based approaches but also open avenues for novel solutions. In this paper, we introduce a predictor-corrector-based framework for inversion and editing in flow models. First, we propose Uni-Inv, an effective inversion method designed for accurate reconstruction. Building on this, we extend the concept of delayed injection to flow models and introduce Uni-Edit, a region-aware, robust image editing approach. Our methodology is tuning-free, model-agnostic, efficient, and effective, enabling diverse edits while ensuring strong preservation of edit-irrelevant regions. Extensive experiments across various generative models demonstrate the superiority and generalizability of Uni-Inv and Uni-Edit, even under low-cost settings. Project page: this https URL
流动匹配模型作为扩散模型的强有力替代方案已经出现,但为扩散模型设计的现有逆向工程和编辑方法通常对它们无效或不适用。流动模型中的直线、非交叉轨迹路径给基于扩散的方法带来了挑战,同时也为新的解决方案开辟了道路。在本文中,我们介绍了一种基于预测-校正框架用于流动模型的逆向工程和编辑方法。首先,我们提出Uni-Inv,一种设计用来进行精确重构的有效逆向工程技术。在此基础上,我们将延迟注入的概念扩展到流动模型,并引入Uni-Edit,这是一种区域感知、稳健的图像编辑方法。我们的方法无需调优、对模型无依赖、高效且有效,在确保非相关编辑区域的强大保留的同时,能够实现多样化的编辑。在各种生成模型上的广泛实验表明,即使在低成本设置下,Uni-Inv和Uni-Edit也展示了其优越性和泛化能力。 项目页面:[此链接](https://this-url.com)
https://arxiv.org/abs/2504.13109
Underwater acoustic target recognition (UATR) is of great significance for the protection of marine diversity and national defense security. The development of deep learning provides new opportunities for UATR, but faces challenges brought by the scarcity of reference samples and complex environmental interference. To address these issues, we proposes a multi-task balanced channel attention convolutional neural network (MT-BCA-CNN). The method integrates a channel attention mechanism with a multi-task learning strategy, constructing a shared feature extractor and multi-task classifiers to jointly optimize target classification and feature reconstruction tasks. The channel attention mechanism dynamically enhances discriminative acoustic features such as harmonic structures while suppressing noise. Experiments on the Watkins Marine Life Dataset demonstrate that MT-BCA-CNN achieves 97\% classification accuracy and 95\% $F1$-score in 27-class few-shot scenarios, significantly outperforming traditional CNN and ACNN models, as well as popular state-of-the-art UATR methods. Ablation studies confirm the synergistic benefits of multi-task learning and attention mechanisms, while a dynamic weighting adjustment strategy effectively balances task contributions. This work provides an efficient solution for few-shot underwater acoustic recognition, advancing research in marine bioacoustics and sonar signal processing.
水下声学目标识别(UATR)对于保护海洋生物多样性和国家安全具有重要意义。深度学习的发展为UATR提供了新的机遇,但同时也面临着参考样本稀缺和复杂环境干扰带来的挑战。为了应对这些问题,我们提出了一种多任务平衡通道注意卷积神经网络(MT-BCA-CNN)。该方法结合了通道注意力机制与多任务学习策略,构建了一个共享特征提取器及多个任务分类器,以同时优化目标分类和特征重构的任务。通过动态增强如谐波结构等具有区分性的声学特征并抑制噪声,通道注意机制能够显著提高模型性能。 在Watkins海洋生物数据集上的实验表明,在27类少量样本的场景中,MT-BCA-CNN达到了97%的分类准确率和95%的F1值,远远优于传统的CNN、ACNN模型及流行的新一代UATR方法。消融研究证实了多任务学习与注意力机制相结合所具有的协同效应,而动态权重调整策略有效地平衡了各任务贡献。这项工作为少量样本情况下的水下声学识别提供了一种高效解决方案,并推进了海洋生物声学和声呐信号处理领域的研究进展。
https://arxiv.org/abs/2504.13102
Self-Supervised Learning (SSL) powers many current AI systems. As research interest and investment grow, the SSL design space continues to expand. The Platonic view of SSL, following the Platonic Representation Hypothesis (PRH), suggests that despite different methods and engineering approaches, all representations converge to the same Platonic ideal. However, this phenomenon lacks precise theoretical explanation. By synthesizing evidence from Identifiability Theory (IT), we show that the PRH can emerge in SSL. However, current IT cannot explain SSL's empirical success. To bridge the gap between theory and practice, we propose expanding IT into what we term Singular Identifiability Theory (SITh), a broader theoretical framework encompassing the entire SSL pipeline. SITh would allow deeper insights into the implicit data assumptions in SSL and advance the field towards learning more interpretable and generalizable representations. We highlight three critical directions for future research: 1) training dynamics and convergence properties of SSL; 2) the impact of finite samples, batch size, and data diversity; and 3) the role of inductive biases in architecture, augmentations, initialization schemes, and optimizers.
自监督学习(Self-Supervised Learning,SSL)是许多当前AI系统的核心驱动力。随着研究兴趣和投资的增长,SSL的设计空间不断扩展。根据柏拉图观点的自监督学习理论——遵循柏拉图表征假设(Platonic Representation Hypothesis,PRH),尽管不同的方法和技术途径存在差异,所有表示最终都会趋向于一个共同的理想状态。然而,这一现象缺乏精确的理论解释。 通过整合可识别性理论(Identifiability Theory,IT)中的证据,我们展示了自监督学习中柏拉图假设可以产生。但是,目前的可识别性理论无法充分解释SSL在实践中的成功表现。为了弥合理论与实践之间的差距,我们提出将可识别性理论扩展为所谓的单一可识别性理论(Singular Identifiability Theory,SITh),这是一个涵盖整个自监督学习流程的更广泛的理论框架。SITh将能够深入探讨SSL中隐含的数据假设,并推动该领域向更具解释性和泛化的表示方法发展。 我们强调未来研究应关注三个关键方向: 1. 自监督学习中的训练动态和收敛特性; 2. 有限样本量、批量大小以及数据多样性的影响力; 3. 架构、增强技术(augmentations)、初始化方案及优化器中的归纳偏置的作用。
https://arxiv.org/abs/2504.13101