Realistic and controllable garment visualization is critical for fashion e-commerce, where users expect personalized previews under diverse poses and lighting conditions. Existing methods often rely on predefined poses, limiting semantic flexibility and illumination adaptability. To address this, we introduce FashionPose, the first unified text-to-pose-to-relighting generation framework. Given a natural language description, our method first predicts a 2D human pose, then employs a diffusion model to generate high-fidelity person images, and finally applies a lightweight relighting module, all guided by the same textual input. By replacing explicit pose annotations with text-driven conditioning, FashionPose enables accurate pose alignment, faithful garment rendering, and flexible lighting control. Experiments demonstrate fine-grained pose synthesis and efficient, consistent relighting, providing a practical solution for personalized virtual fashion display.
逼真的服装可视化对于时装电子商务至关重要,用户期望在不同的姿势和光照条件下获得个性化的预览效果。现有的方法通常依赖于预定义的姿势,这限制了语义灵活性和照明适应性。为了解决这些问题,我们引入了FashionPose,这是首个统一的文本到姿态再到重新布光生成框架。给定自然语言描述后,我们的方法首先预测2D人体姿态,然后使用扩散模型生成高保真的人体图像,并最终应用一个轻量级的重新布光模块,所有这些步骤都由相同的文本输入指导。通过用基于文本驱动的条件取代显式的姿势标注,FashionPose能够实现准确的姿态对齐、忠实的服装渲染和灵活的光照控制。实验结果表明,Fine-grained姿态合成及高效的、一致性的重新布光可以为个性化的虚拟时装展示提供实用解决方案。
https://arxiv.org/abs/2507.13311
The evaluation of large language models is a complex task, in which several approaches have been proposed. The most common is the use of automated benchmarks in which LLMs have to answer multiple-choice questions of different topics. However, this method has certain limitations, being the most concerning, the poor correlation with the humans. An alternative approach, is to have humans evaluate the LLMs. This poses scalability issues as there is a large and growing number of models to evaluate making it impractical (and costly) to run traditional studies based on recruiting a number of evaluators and having them rank the responses of the models. An alternative approach is the use of public arenas, such as the popular LM arena, on which any user can freely evaluate models on any question and rank the responses of two models. The results are then elaborated into a model ranking. An increasingly important aspect of LLMs is their energy consumption and, therefore, evaluating how energy awareness influences the decisions of humans in selecting a model is of interest. In this paper, we present GEA, the Generative Energy Arena, an arena that incorporates information on the energy consumption of the model in the evaluation process. Preliminary results obtained with GEA are also presented, showing that for most questions, when users are aware of the energy consumption, they favor smaller and more energy efficient models. This suggests that for most user interactions, the extra cost and energy incurred by the more complex and top-performing models do not provide an increase in the perceived quality of the responses that justifies their use.
大型语言模型的评估是一项复杂的任务,已经提出了多种方法。其中最常见的是使用自动基准测试,在这种测试中需要让LLM(大语言模型)回答不同主题的选择题。然而,这种方法存在一定的局限性,特别是与人类评价之间的相关性较差的问题最为突出。另一种替代方法是直接由人来评估这些模型,但这种方法面临可扩展性的挑战,因为待评估的模型数量庞大且不断增加,这使得基于招聘一定数量的评估者并对模型响应进行排名的传统研究变得不切实际(并且成本高昂)。一种备选的方法是在像流行的LM竞技场这样的公共平台上让任何用户都可以自由地评价模型并对其给出的回答进行排序。然后将结果综合成一个模型排名。 近年来,LLM的一个日益重要的方面是其能源消耗情况,并且因此评估关于人类在选择模型时对能源意识的影响变得越来越重要。在这篇论文中,我们介绍了GEA(生成式能源竞技场),这是一个引入了有关模型能耗信息的评价过程的平台。我们也展示了使用GEA获得的一些初步结果:数据显示,在大多数情况下,当用户了解了模型的能量消耗情况后,他们更倾向于选择小型且能效更高的模型。这表明在大多数用户交互中,较复杂和高性能模型所增加的成本与能量消耗并不足以提升回应的质量到可以证明其使用的程度。
https://arxiv.org/abs/2507.13302
We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 1,500 expert-annotated examples derived from 807 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as DeepSeek-R1-0528 and o4-mini, highlights a significant performance gap between these models and human experts in terms of the importance, faithfulness, and soundness of the ablation study designs. Moreover, we demonstrate that current automated evaluation methods are not reliable for our task, as they show a significant discrepancy when compared to human assessment. To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems in measuring LLM performance on our task. We investigate various LLM-as-Judge systems on AbGen-Eval, providing insights for future research on developing more effective and reliable LLM-based evaluation systems for complex scientific tasks.
我们介绍了一个名为AbGen的基准测试,这是首个用于评估大型语言模型(LLM)在为科学研究设计消融研究(ablation study)能力的工具。AbGen包含了1,500个由专家标注的例子,这些例子来源于807篇自然语言处理(NLP)论文。在这个基准测试中,要求LLM根据给定的研究背景生成特定模块或过程的详细消融研究设计。 我们对领先的LLM模型,如DeepSeek-R1-0528和o4-mini进行了评估,结果显示这些模型在消融研究设计的重要性、忠实性和有效性方面与人类专家之间存在显著差异。此外,我们还展示了当前的自动化评价方法对于我们的任务来说不够可靠,因为它们与人工评价相比显示出明显的不一致之处。 为了更深入地探讨这一问题,我们开发了AbGen-Eval,这是一个元评估基准测试工具,旨在评估常用自动评估系统在测量LLM执行复杂科学任务性能时的可靠性。我们在AbGen-Eval上对各种LLM作为评判系统的模型进行了调查,并为未来研究如何发展更加有效和可靠的基于LLM的评价系统提供了见解。
https://arxiv.org/abs/2507.13300
Accurate age verification can protect underage users from unauthorized access to online platforms and e-commerce sites that provide age-restricted services. However, accurate age estimation can be confounded by several factors, including facial makeup that can induce changes to alter perceived identity and age to fool both humans and machines. In this work, we propose DiffClean which erases makeup traces using a text-guided diffusion model to defend against makeup attacks. DiffClean improves age estimation (minor vs. adult accuracy by 4.8%) and face verification (TMR by 8.9% at FMR=0.01%) over competing baselines on digitally simulated and real makeup images.
准确的年龄验证可以保护未成年人免受未经授权访问提供年龄限制服务的在线平台和电子商务网站。然而,准确的年龄估计可能会受到多种因素的影响,包括面部化妆,这些因素可以改变感知的身份和年龄,从而欺骗人类和机器。在这项工作中,我们提出了DiffClean,这是一种使用文本引导扩散模型擦除妆容痕迹的方法,以防御化妆攻击。在数字模拟和真实化妆品图像上,与竞争基线相比,DiffClean提高了年龄估计的准确性(未成年人与成人的准确率提高了4.8%)以及面部验证的性能(FMR=0.01%时TMR提高了8.9%)。
https://arxiv.org/abs/2507.13292
In the past few years LLMs have emerged as a tool that can aid programmers by taking natural language descriptions and generating code based on it. However, LLMs often generate incorrect code that users need to fix and the literature suggests users often struggle to detect these errors. In this work we seek to offer formal guarantees of correctness to LLM generated code; such guarantees could improve the experience of using AI Code Assistants and potentially enable natural language programming for users with little or no programming knowledge. To address this challenge we propose to incorporate a formal query language that can represent a user's intent in a formally defined but natural language-like manner that a user can confirm matches their intent. Then, using such a query we propose to verify LLM generated code to ensure it matches the user's intent. We implement these ideas in our system, Astrogator, for the Ansible programming language which includes such a formal query language, a calculus for representing the behavior of Ansible programs, and a symbolic interpreter which is used for the verification. On a benchmark suite of 21 code-generation tasks, our verifier is able to verify correct code in 83% of cases and identify incorrect code in 92%.
在过去的几年里,大型语言模型(LLMs)作为一种工具出现,能够通过接受自然语言描述并根据这些描述生成代码来帮助程序员。然而,LLMs经常会生成错误的代码,用户需要对其进行修正,并且文献表明用户常常难以发现这些错误。在这项工作中,我们寻求为LLM生成的代码提供正式的正确性保证;这样的保证可以改善使用AI编码助手的体验,并有可能使具有很少或没有编程知识的用户实现自然语言编程。 为了应对这一挑战,我们提议纳入一种形式化的查询语言,该语言能够以类似于自然语言的方式表示用户的意图,同时在形式上被定义清楚,使得用户可以确认它与他们的意图相匹配。然后,通过使用这种查询,我们建议验证LLM生成的代码,确保其符合用户的意图。 我们在名为Astrogator的系统中实现了这些想法,该系统针对Ansible编程语言设计,并包含这样的形式化查询语言、表示Ansible程序行为的微积分以及用于验证的符号解释器。在一套包含了21个代码生成任务的基准测试套件上,我们的验证器能够在83%的情况下验证正确的代码,并在92%的情况下识别出错误的代码。
https://arxiv.org/abs/2507.13290
Automated generation of high-quality media presentations is challenging, requiring robust content extraction, narrative planning, visual design, and overall quality optimization. Existing methods often produce presentations with logical inconsistencies and suboptimal layouts, thereby struggling to meet professional standards. To address these challenges, we introduce RCPS (Reflective Coherent Presentation Synthesis), a novel framework integrating three key components: (1) Deep Structured Narrative Planning; (2) Adaptive Layout Generation; (3) an Iterative Optimization Loop. Additionally, we propose PREVAL, a preference-based evaluation framework employing rationale-enhanced multi-dimensional models to assess presentation quality across Content, Coherence, and Design. Experimental results demonstrate that RCPS significantly outperforms baseline methods across all quality dimensions, producing presentations that closely approximate human expert standards. PREVAL shows strong correlation with human judgments, validating it as a reliable automated tool for assessing presentation quality.
高质量媒体演示文稿的自动化生成是一项挑战,需要强大的内容提取、叙述规划、视觉设计和整体质量优化能力。现有方法通常会产生逻辑不一致且布局不佳的演示文稿,难以达到专业标准。为了解决这些问题,我们引入了RCPS(Reflective Coherent Presentation Synthesis),这是一种新的框架,集成了三个关键组件:(1) 深度结构化叙述规划;(2) 自适应布局生成;(3) 迭代优化循环。此外,我们还提出了PREVAL,这是一个基于偏好的评估框架,使用增强理据的多维度模型来衡量演示文稿在内容、连贯性和设计方面的质量。 实验结果显示,RCPS在所有质量维度上都显著优于基线方法,能够生成接近人类专家标准的演示文稿。而PREVAL与人类判断之间有很强的相关性,证明其作为可靠自动化工具评估演示文稿质量的有效性。
https://arxiv.org/abs/2507.13285
Robots are increasingly integrated across industries, particularly in healthcare. However, many valuable applications for quadrupedal robots remain overlooked. This research explores the effectiveness of three reinforcement learning algorithms in training a simulated quadruped robot for autonomous navigation and obstacle avoidance. The goal is to develop a robotic guide dog simulation capable of path following and obstacle avoidance, with long-term potential for real-world assistance to guide dogs and visually impaired individuals. It also seeks to expand research into medical 'pets', including robotic guide and alert dogs. A comparative analysis of thirteen related research papers shaped key evaluation criteria, including collision detection, pathfinding algorithms, sensor usage, robot type, and simulation platforms. The study focuses on sensor inputs, collision frequency, reward signals, and learning progression to determine which algorithm best supports robotic navigation in complex environments. Custom-made environments were used to ensure fair evaluation of all three algorithms under controlled conditions, allowing consistent data collection. Results show that Proximal Policy Optimization (PPO) outperformed Deep Q-Network (DQN) and Q-learning across all metrics, particularly in average and median steps to goal per episode. By analysing these results, this study contributes to robotic navigation, AI and medical robotics, offering insights into the feasibility of AI-driven quadruped mobility and its role in assistive robotics.
机器人在各行各业中的应用越来越广泛,特别是在医疗保健领域。然而,许多适用于四足机器人的有价值的用途仍然被忽视了。本研究探索了三种强化学习算法在训练模拟四足机器人进行自主导航和障碍物规避方面的有效性。目标是开发一个能够进行路径跟随和障碍物规避的仿真导盲犬,该仿真人形狗具有潜在的实际应用价值,可以在未来帮助真正的导盲犬以及视障人士。此外,这项研究还旨在扩展对医疗“宠物”的研究领域,包括模拟的导盲犬和警报犬。通过对13篇相关研究论文进行比较分析,确定了关键评估标准,其中包括碰撞检测、路径规划算法、传感器使用情况、机器人类型及仿真平台等。 本研究重点关注传感器输入、碰撞频率、奖励信号以及学习进展等方面,以确定哪种算法最有助于复杂环境中的机器人导航。为了在控制条件下公平地评价所有三种算法的性能,并确保一致的数据收集,研究人员使用了自定义制作的环境。实验结果表明,在所有评估指标上,近端策略优化(Proximal Policy Optimization, PPO)均优于深度Q网络(Deep Q-Network, DQN)和Q学习,尤其是在每回合平均步数和中位步数到达目标方面表现最佳。 通过对这些结果进行分析,本研究为机器人导航、人工智能以及医疗机器人的发展做出了贡献,并提供了关于AI驱动的四足运动在辅助机器人技术中的可行性的见解。
https://arxiv.org/abs/2507.13277
Advances in natural language processing and large language models are driving a major transformation in Human Capital Management, with a growing interest in building smart systems based on language technologies for talent acquisition, upskilling strategies, and workforce planning. However, the adoption and progress of these technologies critically depend on the development of reliable and fair models, properly evaluated on public data and open benchmarks, which have so far been unavailable in this domain. To address this gap, we present TalentCLEF 2025, the first evaluation campaign focused on skill and job title intelligence. The lab consists of two tasks: Task A - Multilingual Job Title Matching, covering English, Spanish, German, and Chinese; and Task B - Job Title-Based Skill Prediction, in English. Both corpora were built from real job applications, carefully anonymized, and manually annotated to reflect the complexity and diversity of real-world labor market data, including linguistic variability and gender-marked expressions. The evaluations included monolingual and cross-lingual scenarios and covered the evaluation of gender bias. TalentCLEF attracted 76 registered teams with more than 280 submissions. Most systems relied on information retrieval techniques built with multilingual encoder-based models fine-tuned with contrastive learning, and several of them incorporated large language models for data augmentation or re-ranking. The results show that the training strategies have a larger effect than the size of the model alone. TalentCLEF provides the first public benchmark in this field and encourages the development of robust, fair, and transferable language technologies for the labor market.
自然语言处理和大型语言模型的进步正在推动人力资源管理的重大转型,人们对基于语言技术构建智能系统以支持人才获取、技能提升策略及劳动力规划的兴趣日益增长。然而,这些技术的采用和发展在很大程度上依赖于开发可靠且公平的模型,并经过公共数据集和开放基准测试的恰当评估,在此领域迄今为止尚无可用资源。为填补这一空白,我们推出了TalentCLEF 2025,这是首个专注于技能与职业岗位智能的评估活动。 该实验室包括两个任务:任务A - 多语言职位匹配,涵盖英语、西班牙语、德语和中文;以及任务B - 基于职务名称的技能预测,在英语中进行。两份数据集均基于实际工作申请构建,并进行了仔细匿名处理及人工标注以反映现实劳动力市场的复杂性和多样性,包括语言变异性及性别标记表达。 评估涵盖了单语种和跨语种场景,并涉及性别偏见的评价。TalentCLEF吸引了76支注册队伍提交了超过280份作品。大多数系统依赖于多语言编码模型与信息检索技术结合使用,这些模型通过对比学习进行微调,有些则融入大型语言模型用于数据增强或重新排序。 结果显示,训练策略比单纯模型大小的影响更大。TalentCLEF为该领域提供了首个公共基准,并鼓励开发稳健、公平且可转移的语言技术以应用于劳动力市场。
https://arxiv.org/abs/2507.13275
Reinforcement learning (RL) has become a key component in training large language reasoning models (LLMs). However, recent studies questions its effectiveness in improving multi-step reasoning-particularly on hard problems. To address this challenge, we propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty and provide more informative learning signals. Our method, QuestA, when applied during RL training on math reasoning tasks, not only improves pass@1 but also pass@k-particularly on problems where standard RL struggles to make progress. This enables continual improvement over strong open-source models such as DeepScaleR and OpenMath Nemotron, further enhancing their reasoning capabilities. We achieve new state-of-the-art results on math benchmarks using 1.5B-parameter models: 67.1% (+5.3%) on AIME24, 59.5% (+10.0%) on AIME25, and 35.5% (+4.0%) on HMMT25. Further, we provide theoretical explanations that QuestA improves sample efficiency, offering a practical and generalizable pathway for expanding reasoning capability through RL.
强化学习(RL)已成为训练大型语言推理模型(LLMs)的关键组成部分。然而,最近的研究对其在提高多步推理能力方面的有效性提出了质疑,特别是在解决难题方面。为了解决这一挑战,我们提出了一种简单而有效的方法:问题增强策略(Question Augmentation),通过在训练过程中引入部分解决方案来降低问题难度,并提供更有信息量的学习信号。我们的方法QuestA,在对数学推理任务进行RL训练时应用,不仅提高了首次正确率(pass@1),还提高了k次尝试中的正确率(pass@k)——特别是在那些标准RL难以取得进展的问题上表现突出。这使我们在持续改进像DeepScaleR和OpenMath Nemotron这样的开源模型方面取得了进步,并进一步增强了它们的推理能力。 使用参数量为1.5B的模型,我们在这三个数学基准测试中达到了新的最先进水平:AIME24上的准确率为67.1%(提高了5.3%),AIME25上的准确率为59.5%(提高了10.0%),以及HMMT25上的准确率为35.5%(提高了4.0%)。此外,我们还提供了理论解释说明QuestA如何提高样本效率,为通过RL扩展推理能力提供了一条实用且通用的方法路径。
https://arxiv.org/abs/2507.13266
We present Voxtral Mini and Voxtral Small, two multimodal audio chat models. Voxtral is trained to comprehend both spoken audio and text documents, achieving state-of-the-art performance across a diverse range of audio benchmarks, while preserving strong text capabilities. Voxtral Small outperforms a number of closed-source models, while being small enough to run locally. A 32K context window enables the model to handle audio files up to 40 minutes in duration and long multi-turn conversations. We also contribute three benchmarks for evaluating speech understanding models on knowledge and trivia. Both Voxtral models are released under Apache 2.0 license.
我们介绍了Voxtral Mini和Voxtral Small,这两款多模态音频聊天模型。Voxtral经过训练能够理解语音音频和文本文档,在各种音频基准测试中表现出色的同时,保持了强大的文本处理能力。Voxtral Small在性能上超过了多个闭源模型,并且足够小巧可以在本地运行。该模型配备了32K的上下文窗口,使其能够处理长达40分钟的音频文件以及长时间多轮对话。我们还提供了三个基准测试来评估语音理解模型在知识和趣味问答方面的表现能力。Voxtral两个版本均以Apache 2.0许可证发布。
https://arxiv.org/abs/2507.13264
Bayesian Optimization (BO) algorithm is a standard tool for black-box optimization problems. The current state-of-the-art BO approach for permutation spaces relies on the Mallows kernel-an $\Omega(n^2)$ representation that explicitly enumerates every pairwise comparison. Inspired by the close relationship between the Mallows kernel and pairwise comparison, we propose a novel framework for generating kernel functions on permutation space based on sorting algorithms. Within this framework, the Mallows kernel can be viewed as a special instance derived from bubble sort. Further, we introduce the \textbf{Merge Kernel} constructed from merge sort, which replaces the quadratic complexity with $\Theta(n\log n)$ to achieve the lowest possible complexity. The resulting feature vector is significantly shorter, can be computed in linearithmic time, yet still efficiently captures meaningful permutation distances. To boost robustness and right-invariance without sacrificing compactness, we further incorporate three lightweight, task-agnostic descriptors: (1) a shift histogram, which aggregates absolute element displacements and supplies a global misplacement signal; (2) a split-pair line, which encodes selected long-range comparisons by aligning elements across the two halves of the whole permutation; and (3) sliding-window motifs, which summarize local order patterns that influence near-neighbor objectives. Our empirical evaluation demonstrates that the proposed kernel consistently outperforms the state-of-the-art Mallows kernel across various permutation optimization benchmarks. Results confirm that the Merge Kernel provides a more compact yet more effective solution for Bayesian optimization in permutation space.
贝叶斯优化(BO)算法是解决黑箱优化问题的标准工具。当前针对排列空间的最优贝叶斯优化方法依赖于Mallows核,这是一种显式枚举每一对比较关系的$\Omega(n^2)$表示形式。受Mallows核与成对比较之间密切关系的启发,我们提出了一种基于排序算法生成排列空间上核函数的新框架。在此框架内,Mallows核可以被视为一种特殊实例,源自冒泡排序算法。进一步地,我们引入了由归并排序构造的\textbf{Merge Kernel},它将二次复杂度替换为$\Theta(n \log n)$以实现最低可能的计算复杂度。生成的结果特征向量显著缩短,并可以在对数线性时间内计算完成,同时依然能够高效捕捉有意义的排列距离。 为了增强稳健性和右不变性而不牺牲紧凑性,我们进一步整合了三种轻量级、任务无关的描述符:(1)位移直方图,它聚合绝对元素位移并提供全局错置信号;(2)分裂对线,通过将整个排列的两半部分中的元素对齐来编码选定的长距离比较;以及(3)滑动窗口模式,该模式总结影响最近邻目标的地方顺序模式。我们的实证评估表明,所提出的核函数在各种排列优化基准测试中始终优于当前最先进的Mallows核函数。结果证实Merge Kernel为排列空间中的贝叶斯优化提供了一种更为紧凑且更有效的解决方案。 这一研究不仅展示了如何通过借鉴经典排序算法的特性来设计高效的核函数,而且还提高了我们对如何结合不同类型的描述符以增强机器学习模型在复杂结构化数据上的性能的理解。
https://arxiv.org/abs/2507.13263
A prevalent approach in Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViT) involves freezing the majority of the backbone parameters and solely learning low-rank adaptation weight matrices to accommodate downstream tasks. These low-rank matrices are commonly derived through the multiplication structure of down-projection and up-projection matrices, exemplified by methods such as LoRA and Adapter. In this work, we observe an approximate orthogonality among any two row or column vectors within any weight matrix of the backbone parameters; however, this property is absent in the vectors of the down/up-projection matrices. Approximate orthogonality implies a reduction in the upper bound of the model's generalization error, signifying that the model possesses enhanced generalization capability. If the fine-tuned down/up-projection matrices were to exhibit this same property as the pre-trained backbone matrices, could the generalization capability of fine-tuned ViTs be further augmented? To address this question, we propose an Approximately Orthogonal Fine-Tuning (AOFT) strategy for representing the low-rank weight matrices. This strategy employs a single learnable vector to generate a set of approximately orthogonal vectors, which form the down/up-projection matrices, thereby aligning the properties of these matrices with those of the backbone. Extensive experimental results demonstrate that our method achieves competitive performance across a range of downstream image classification tasks, confirming the efficacy of the enhanced generalization capability embedded in the down/up-projection matrices.
在参数高效的微调(Parameter-Efficient Fine-Tuning,PEFT)中,预训练的Vision Transformer (ViT)模型通常会冻结大部分骨干参数,并仅学习低秩适应权重矩阵以适应下游任务。这些低秩矩阵通常是通过降维和升维矩阵的乘法结构得到的,如LoRA和Adapter等方法所展示的那样。在这项工作中,我们观察到,在任何预训练骨干参数的权重矩阵中,任意两行或两列向量之间存在近似的正交性;然而,这种性质在降维/升维矩阵的向量中是不存在的。近似正交性意味着模型泛化误差上限的减小,表明该模型具有增强的泛化能力。如果微调后的降维/升维矩阵也能表现出与预训练骨干矩阵相同的这种特性,那么微调后的ViT模型是否可以进一步提高其泛化能力? 为了解答这个问题,我们提出了一种近似正交微调(Approximately Orthogonal Fine-Tuning, AOFT)策略来表示低秩权重矩阵。该策略使用一个可学习的向量生成一组近似正交的向量,这些向量构成降维/升维矩阵,并使这些矩阵的性质与骨干模型一致。 大量的实验结果表明,我们的方法在一系列下游图像分类任务中表现出具有竞争力的性能,证实了嵌入降维/升维矩阵中的增强泛化能力的有效性。
https://arxiv.org/abs/2507.13260
Recent progress in Multimodal Large Language Models (MLLMs) has unlocked powerful cross-modal reasoning abilities, but also raised new safety concerns, particularly when faced with adversarial multimodal inputs. To improve the safety of MLLMs during inference, we introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model. AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical benchmarks demonstrate that AutoSteer significantly reduces the Attack Success Rate (ASR) for textual, visual, and cross-modal threats, while maintaining general abilities. These findings position AutoSteer as a practical, interpretable, and effective framework for safer deployment of multimodal AI systems.
近期在多模态大型语言模型(MLLMs)方面取得的进展解锁了强大的跨模态推理能力,但同时也引发了新的安全问题,特别是在面对对抗性多模态输入时。为了提高MLLMs在推断过程中的安全性,我们引入了一种模块化且适应性强的推理干预技术AutoSteer,并且无需对基础模型进行微调。AutoSteer包含三个核心组件:(1)一种新颖的安全感知得分(Safety Awareness Score, SAS),它能够自动识别模型内部层之间最相关的安全区分;(2)一个经过训练以估计中间表示产生有害输出可能性的自适应安全探测器;(3)一个轻量级的拒绝头,当检测到安全风险时,它可以有针对性地介入来调节生成过程。在LLaVA-OV和Chameleon模型上进行的一系列安全性关键基准测试表明,AutoSteer显著降低了文本、视觉以及跨模态威胁的攻击成功率(ASR),同时保持了一般能力不变。这些发现将AutoSteer定位为一种实用、可解释且有效的框架,用于多模态AI系统的安全部署。
https://arxiv.org/abs/2507.13255
Analogies test a model's ability to infer implicit relationships between concepts, making them a key benchmark for evaluating reasoning capabilities. While large language models (LLMs) are widely evaluated for reasoning in English, their abilities in Indic languages remain understudied, limiting our understanding of whether these models generalize across languages. To address this gap, we introduce a new Hindi Analogy Test Set (HATS), comprising 405 multiple-choice questions sourced from Indian government exams. We benchmark state-of-the-art multilingual LLMs using various prompting strategies and introduce a grounded Chain of Thought approach that leverages cognitive theories of analogical reasoning. This approach improves model performance on Hindi analogy questions. Our experiments show that models perform best with English prompts, irrespective of the prompting strategy. Our test set addresses the lack of a critical resource to evaluate LLM reasoning capabilities in Hindi.
类比测试能够评估模型推断概念之间隐含关系的能力,这对于衡量推理能力是关键的基准。虽然大型语言模型(LLM)在英语中的推理能力已经得到了广泛的研究和评价,但它们在印地语等印度语言中的表现却鲜有研究,这限制了我们对这些模型跨语言泛化能力的理解。为了解决这一不足,我们引入了一个新的印地语类比测试集(HATS),包含405道源自印度政府考试的多项选择题。我们在多种提示策略下使用最先进的多语言LLM进行了基准测试,并提出了一种基于认知理论中的类比推理的“链式思维”方法来提高模型在印地语类比问题上的表现。 我们的实验表明,无论采用何种提示策略,模型在接受英语提示时的表现最佳。我们的测试集弥补了评估LLM在印地语中推理能力的关键资源不足的问题。
https://arxiv.org/abs/2507.13238
Large language models (LLMs) have shown impressive abilities in leveraging pretrained knowledge through prompting, but they often struggle with unseen tasks, particularly in data-scarce scenarios. While cross-task in-context learning offers a direct solution for transferring knowledge across tasks, it still faces critical challenges in terms of robustness, scalability, and efficiency. In this paper, we investigate whether cross-task transfer can be achieved via latent space steering without parameter updates or input expansion. Through an analysis of activation patterns in the latent space of LLMs, we observe that the enhanced activations induced by in-context examples have consistent patterns across different tasks. Inspired by these findings, we propose CAST, a novel Cross-task Activation Steering Transfer framework that enables effective transfer by manipulating the model's internal activation states. Our approach first selects influential and diverse samples from high-resource tasks, then utilizes their contrastive representation-enhanced activations to adapt LLMs to low-resource tasks. Extensive experiments across both cross-domain and cross-lingual transfer settings show that our method outperforms competitive baselines and demonstrates superior scalability and lower computational costs.
大型语言模型(LLMs)通过提示展示出利用预训练知识的出色能力,但在处理未见过的任务时特别是在数据稀缺的情况下经常遇到困难。虽然跨任务的在上下文学习提供了直接的知识转移解决方案,它仍然面临着关于鲁棒性、可扩展性和效率的关键挑战。本文探讨了是否可以通过潜在空间引导实现跨任务迁移而无需更新参数或扩展输入。通过对LLMs中激活模式的分析,我们观察到由在上下文示例引起的增强激活在不同任务之间具有一致的模式。受到这一发现的启发,我们提出了CAST(Cross-task Activation Steering Transfer),这是一种通过操控模型内部激活状态来实现有效迁移的新框架。我们的方法首先从高资源任务中选择有影响力且多样化的样本,然后利用它们的对比表示增强激活将LLMs适应到低资源任务。跨领域和跨语言转移设置中的广泛实验表明,我们的方法优于竞争基准,并表现出更优的可扩展性和更低的计算成本。
https://arxiv.org/abs/2507.13236
We present VITA, a Vision-To-Action flow matching policy that evolves latent visual representations into latent actions for visuomotor control. Traditional flow matching and diffusion policies sample from standard source distributions (e.g., Gaussian noise) and require additional conditioning mechanisms like cross-attention to condition action generation on visual information, creating time and space overheads. VITA proposes a novel paradigm that treats latent images as the flow source, learning an inherent mapping from vision to action while eliminating separate conditioning modules and preserving generative modeling capabilities. Learning flows between fundamentally different modalities like vision and action is challenging due to sparse action data lacking semantic structures and dimensional mismatches between high-dimensional visual representations and raw actions. We address this by creating a structured action latent space via an autoencoder as the flow matching target, up-sampling raw actions to match visual representation shapes. Crucially, we supervise flow matching with both encoder targets and final action outputs through flow latent decoding, which backpropagates action reconstruction loss through sequential flow matching ODE solving steps for effective end-to-end learning. Implemented as simple MLP layers, VITA is evaluated on challenging bi-manual manipulation tasks on the ALOHA platform, including 5 simulation and 2 real-world tasks. Despite its simplicity, MLP-only VITA outperforms or matches state-of-the-art generative policies while reducing inference latency by 50-130% compared to conventional flow matching policies requiring different conditioning mechanisms or complex architectures. To our knowledge, VITA is the first MLP-only flow matching policy capable of solving complex bi-manual manipulation tasks like those in ALOHA benchmarks.
我们介绍了VITA,这是一种视觉到行动的流程匹配策略,能够将潜在的视觉表示演化为用于视动控制的潜在动作。传统的流匹配和扩散策略会从标准源分布(如高斯噪声)中采样,并需要额外的条件机制(例如交叉注意力),以便根据视觉信息生成动作,从而导致时间和空间开销。VITA提出了一种新的范式,将潜在图像作为流源进行处理,在消除独立的条件模块的同时学习固有的从视觉到行动映射,并保持生成模型的能力。 在视觉和行动这种根本不同的模态之间学习流是具有挑战性的,因为稀疏的动作数据缺乏语义结构,且高维视觉表示与原始动作之间的维度不匹配。为了解决这个问题,我们通过使用自动编码器创建一个有结构的动作潜在空间作为流匹配的目标,并将原始动作上采样以匹配视觉表现的形状。 关键的是,我们通过流潜解码监督流匹配,既包括编码器目标也涵盖最终的动作输出,在回溯传播动作重构损失的过程中,通过顺序地解决流匹配常微分方程步骤来实现有效的端到端学习。VITA作为简单的MLP层实施,并在ALOHA平台上进行了评估,涵盖了5个模拟任务和2个真实世界任务的双臂操作任务。尽管其简单性,仅使用MLP的VITA在减少推理延迟50-130%的同时,在生成策略方面可以匹敌甚至超过最先进的方法,而这些传统流匹配策略则需要不同的条件机制或复杂的架构。据我们所知,VITA是首个能够解决像ALOHA基准测试中的复杂双臂操作任务的纯MLP流匹配政策。
https://arxiv.org/abs/2507.13231
The pursuit of a generalizable stereo matching model, capable of performing across varying resolutions and disparity ranges without dataset-specific fine-tuning, has revealed a fundamental trade-off. Iterative local search methods achieve high scores on constrained benchmarks, but their core mechanism inherently limits the global consistency required for true generalization. On the other hand, global matching architectures, while theoretically more robust, have been historically rendered infeasible by prohibitive computational and memory costs. We resolve this dilemma with $S^2M^2$: a global matching architecture that achieves both state-of-the-art accuracy and high efficiency without relying on cost volume filtering or deep refinement stacks. Our design integrates a multi-resolution transformer for robust long-range correspondence, trained with a novel loss function that concentrates probability on feasible matches. This approach enables a more robust joint estimation of disparity, occlusion, and confidence. $S^2M^2$ establishes a new state of the art on the Middlebury v3 and ETH3D benchmarks, significantly outperforming prior methods across most metrics while reconstructing high-quality details with competitive efficiency.
追求一种通用的立体匹配模型,这种模型能够在不同分辨率和视差范围内工作,并且不需要特定数据集的微调,已经揭示了一个基本的权衡。迭代局部搜索方法在受限制的基准测试中取得了高分,但其核心机制内在地限制了实现全局一致性的能力,这是真正泛化所必需的。另一方面,虽然理论上更稳健的全球匹配架构被认为是可行的,但由于计算和内存成本过高,在历史上一直难以实施。 我们通过提出$S^2M^2$解决了这一难题:这是一种全局匹配架构,能够在不依赖于代价体积过滤或深度细化堆栈的情况下实现最先进的精度和高效率。我们的设计集成了一个多分辨率变压器,用于稳健的长距离对应,并使用一种新的损失函数进行训练,该函数专注于可行匹配的概率集中。这种方法使视差、遮挡和置信度的一致估计更加稳健。 $S^2M^2$在Middlebury v3和ETH3D基准测试中建立了新的状态,相对于先前的方法,在大多数指标上表现出了显著的优越性,并且能够以具有竞争力的效率重建高质量的细节。
https://arxiv.org/abs/2507.13229
This work presents a novel co-design strategy that integrates trajectory planning and control to handle STL-based tasks in autonomous robots. The method consists of two phases: $(i)$ learning spatio-temporal motion primitives to encapsulate the inherent robot-specific constraints and $(ii)$ constructing an STL-compliant motion plan from these primitives. Initially, we employ reinforcement learning to construct a library of control policies that perform trajectories described by the motion primitives. Then, we map motion primitives to spatio-temporal characteristics. Subsequently, we present a sampling-based STL-compliant motion planning strategy tailored to meet the STL specification. The proposed model-free approach, which generates feasible STL-compliant motion plans across various environments, is validated on differential-drive and quadruped robots across various STL specifications. Demonstration videos are available at this https URL.
这项工作提出了一种新颖的协同设计策略,该策略将轨迹规划与控制相结合,以处理自主机器人中基于STL的任务。此方法包含两个阶段:$(i)$ 学习时空运动原语来封装特定于机器人的内在约束;$(ii)$ 从这些原语构建符合STL规范的运动计划。 初始阶段,我们利用强化学习建立了一组控制策略库,以执行由运动原语描述的轨迹。然后,我们将运动原语映射到时空特征上。随后,我们提出了一种基于采样的、符合STL规范的运动规划策略,以满足特定的STL要求。所提出的无模型方法能够生成适用于各种环境中的可行且符合STL规范的运动计划,并已在差分驱动和四足机器人上进行了跨不同STL规格的验证。 演示视频可在以下链接访问:[此URL](请将"[此URL]"替换为实际提供的网址)。
https://arxiv.org/abs/2507.13225
Recent advances in Generative AI (GenAI) have led to significant improvements in the quality of generated visual content. As AI-generated visual content becomes increasingly indistinguishable from real content, the challenge of detecting the generated content becomes critical in combating misinformation, ensuring privacy, and preventing security threats. Although there has been substantial progress in detecting AI-generated images, current methods for video detection are largely focused on deepfakes, which primarily involve human faces. However, the field of video generation has advanced beyond DeepFakes, creating an urgent need for methods capable of detecting AI-generated videos with generic content. To address this gap, we propose a novel approach that leverages pre-trained visual models to distinguish between real and generated videos. The features extracted from these pre-trained models, which have been trained on extensive real visual content, contain inherent signals that can help distinguish real from generated videos. Using these extracted features, we achieve high detection performance without requiring additional model training, and we further improve performance by training a simple linear classification layer on top of the extracted features. We validated our method on a dataset we compiled (VID-AID), which includes around 10,000 AI-generated videos produced by 9 different text-to-video models, along with 4,000 real videos, totaling over 7 hours of video content. Our evaluation shows that our approach achieves high detection accuracy, above 90% on average, underscoring its effectiveness. Upon acceptance, we plan to publicly release the code, the pre-trained models, and our dataset to support ongoing research in this critical area.
近期在生成式人工智能(GenAI)领域的进展显著提升了生成视觉内容的质量。随着由AI生成的视觉内容越来越难以与真实内容区分,检测这些生成内容以对抗虚假信息、保障隐私和防止安全威胁变得至关重要。尽管在识别AI生成图像方面已取得了重大进步,目前针对视频的检测方法主要集中在深度伪造(DeepFakes)上,后者主要涉及人类面部。然而,视频生成领域已经超越了深度伪造技术,创造了急需能够检测通用内容的AI生成视频的新方法的迫切需求。 为了解决这一缺口,我们提出了一种新颖的方法,该方法利用预训练视觉模型来区分真实的和由AI生成的视频。这些预训练模型在大量真实视觉内容上进行过训练,所提取的特征中包含了可以用于鉴别真实与合成视频的内在信号。通过使用这些提取的特征,在无需额外模型训练的情况下实现了高检测性能,并且通过对提取特征之上简单的线性分类层进行训练进一步提升了性能表现。 我们通过一个由我们编译的数据集(VID-AID)验证了我们的方法,该数据集中包含大约10,000个由9种不同的文本到视频生成模型创建的AI生成视频,以及4,000段真实视频,总计超过7小时的视频内容。评估结果表明,我们的方法在检测准确率上达到了平均值高于90%的良好效果,突显了其有效性。 一旦被接受,我们将公开发布代码、预训练模型和我们自己的数据集,以支持这一关键领域的持续研究工作。
https://arxiv.org/abs/2507.13224
While recent advancements in deep neural networks (DNNs) have substantially enhanced visual AI's capabilities, the challenge of inadequate data diversity and volume remains, particularly in construction domain. This study presents a novel image synthesis methodology tailored for construction worker detection, leveraging the generative-AI platform Midjourney. The approach entails generating a collection of 12,000 synthetic images by formulating 3000 different prompts, with an emphasis on image realism and diversity. These images, after manual labeling, serve as a dataset for DNN training. Evaluation on a real construction image dataset yielded promising results, with the model attaining average precisions (APs) of 0.937 and 0.642 at intersection-over-union (IoU) thresholds of 0.5 and 0.5 to 0.95, respectively. Notably, the model demonstrated near-perfect performance on the synthetic dataset, achieving APs of 0.994 and 0.919 at the two mentioned thresholds. These findings reveal both the potential and weakness of generative AI in addressing DNN training data scarcity.
尽管深度神经网络(DNN)的近期进展大幅提升了视觉人工智能的能力,但在建筑领域的数据多样性和数量不足的问题依然存在。本研究提出了一种针对建筑工人检测的新颖图像合成方法,利用生成式AI平台Midjourney进行实施。该方法通过制定3000个不同的提示来生成一组共12,000张合成图像,并强调图像的真实感和多样性。这些经过人工标注的图像被用作DNN训练的数据集。在实际建筑图像数据集上的评估显示,模型取得了令人鼓舞的结果,在交并比(IoU)阈值为0.5和从0.5到0.95时,平均精度(APs)分别为0.937和0.642。值得注意的是,该模型在合成数据集上表现接近完美,在上述两个阈值下的APs分别为0.994和0.919。这些发现揭示了生成式AI在解决DNN训练数据稀缺问题上的潜力与不足。
https://arxiv.org/abs/2507.13221