Personalization is becoming indispensable for LLMs to align with individual user preferences and needs. Yet current approaches are often computationally expensive, data-intensive, susceptible to catastrophic forgetting, and prone to performance degradation in multi-turn interactions or when handling implicit queries. To address these challenges, we conceptualize personalization as a model editing task and introduce Personalization Editing, a framework that applies localized edits guided by clustered preference representations. This design enables precise preference-aligned updates while preserving overall model capabilities. In addition, existing personalization benchmarks frequently rely on persona-based dialogs between LLMs rather than user-LLM interactions, or focus primarily on stylistic imitation while neglecting information-seeking tasks that require accurate recall of user-specific preferences. We introduce User Preference Question Answering (UPQA), a short-answer QA dataset constructed from in-situ user queries with varying levels of difficulty. Unlike prior benchmarks, UPQA directly evaluates a model's ability to recall and apply specific user preferences. Across experimental settings, Personalization Editing achieves higher editing accuracy and greater computational efficiency than fine-tuning, while outperforming prompting-based baselines in multi-turn conversations and implicit preference questions settings.
个性化对于大型语言模型(LLM)来说正变得不可或缺,以使其能够与个人用户的偏好和需求保持一致。然而,目前的方法往往计算成本高昂、数据密集型,并且容易发生灾难性遗忘,在处理多轮交互或隐式查询时性能会下降。为了解决这些问题,我们将个性化视为一个模型编辑任务,并引入了“Personalization Editing”框架,该框架通过集群偏好数字表示进行局部编辑指导。这种设计能够在保持整体模型能力的同时实现精准的偏好对齐更新。 此外,现有的个性化基准测试通常依赖于基于角色的人机对话,而不是用户与LLM之间的实际交互,或者它们主要集中在风格模仿上,而忽略了需要准确回忆出特定用户偏好的信息查找任务。我们引入了“User Preference Question Answering”(UPQA),这是一个从现场用户的查询中构建的短答案问答数据集,并且具有不同难度级别的问题。与先前的基准测试相比,UPQA直接评估模型回忆和应用具体用户偏好的能力。 在各种实验设置下,“Personalization Editing”的编辑精度高于微调方法,并且计算效率更高,在多轮对话和隐式偏好查询场景中优于基于提示的方法。
https://arxiv.org/abs/2512.13676
Industrial anomaly detection (IAD) is difficult due to the scarcity of normal reference samples and the subtle, localized nature of many defects. Single-pass vision-language models (VLMs) often overlook small abnormalities and lack explicit mechanisms to compare against canonical normal patterns. We propose AgentIAD, a tool-driven agentic framework that enables multi-stage visual inspection. The agent is equipped with a Perceptive Zoomer (PZ) for localized fine-grained analysis and a Comparative Retriever (CR) for querying normal exemplars when evidence is ambiguous. To teach these inspection behaviors, we construct structured perceptive and comparative trajectories from the MMAD dataset and train the model in two stages: supervised fine-tuning followed by reinforcement learning. A two-part reward design drives this process: a perception reward that supervises classification accuracy, spatial alignment, and type correctness, and a behavior reward that encourages efficient tool use. Together, these components enable the model to refine its judgment through step-wise observation, zooming, and verification. AgentIAD achieves a new state-of-the-art 97.62% classification accuracy on MMAD, surpassing prior MLLM-based approaches while producing transparent and interpretable inspection traces.
工业异常检测(IAD)由于正常参考样本的稀缺性和许多缺陷的细微、局部性质而变得困难。单一通过的视觉-语言模型(VLMs)通常会忽略小的异常情况,并且缺乏与标准正常模式进行比较的明确机制。我们提出了AgentIAD,这是一个工具驱动的代理框架,能够执行多阶段视觉检查。该代理配备了感知缩放器(PZ),用于局部细粒度分析和对比检索器(CR),在证据模棱两可时查询正常的样本。为了教授这些检测行为,我们从MMAD数据集中构建了结构化的感知和比较轨迹,并分两个阶段训练模型:监督微调后跟强化学习。 这一过程的设计包括两部分奖励机制:一种是感知奖励,用于监督分类准确性、空间对齐和类型正确性;另一种是行为奖励,鼓励高效地使用工具。这些组件共同作用使模型能够通过逐步观察、缩放和验证来细化其判断。 AgentIAD在MMAD数据集上实现了新的最佳性能,分类准确率达到97.62%,超过了先前基于多模态大语言模型的方法,并且生成的检测痕迹透明可解释。
https://arxiv.org/abs/2512.13671
As the online learning landscape evolves, the need for personalization is increasingly evident. Although educational resources are burgeoning, educators face challenges selecting materials that both align with intended learning outcomes and address diverse learner needs. Large Language Models (LLMs) are attracting growing interest for their potential to create learning resources that better support personalization, but verifying coverage of intended outcomes still requires human alignment review, which is costly and limits scalability. We propose a framework that supports the cost-effective automation of evaluating alignment between educational resources and intended learning outcomes. Using human-generated materials, we benchmarked LLM-based text-embedding models and found that the most accurate model (Voyage) achieved 79% accuracy in detecting alignment. We then applied the optimal model to LLM-generated resources and, via expert evaluation, confirmed that it reliably assessed correspondence to intended outcomes (83% accuracy). Finally, in a three-group experiment with 360 learners, higher alignment scores were positively related to greater learning performance, chi-squared(2, N = 360) = 15.39, p < 0.001. These findings show that embedding-based alignment scores can facilitate scalable personalization by confirming alignment with learning outcomes, which allows teachers to focus on tailoring content to diverse learner needs.
随着在线学习领域的不断发展,个性化教学的需求变得越来越明显。尽管教育资源日益丰富,但教师在选择既能与预期的学习成果相符合又能满足多样化学生需求的材料时面临着挑战。大型语言模型(LLMs)因其有潜力生成更支持个性化的学习资源而引起了越来越多的兴趣,但是验证这些资源是否覆盖了预期的教学目标仍然需要人工审核,这既昂贵又限制了可扩展性。我们提出了一种框架,以实现对教育材料与预期学习成果之间一致性的评估的低成本自动化。 使用由人类创建的材料作为基准,我们测试了基于大型语言模型生成的文字嵌入(text-embedding)模型,并发现最准确的模型(Voyage)在检测一致性方面达到了79%的准确性。接着,我们将最佳模型应用于LLM生成的学习资源中,并通过专家评估确认该模型能够可靠地评估与预期成果的一致性,其准确率为83%。最后,在一个包括360名学习者的三组实验中,我们发现更高的一致性得分与更好的学习表现呈正相关关系,卡方检验结果为chi-squared(2, N = 360) = 15.39,p < 0.001。 这些研究结果表明,基于嵌入的一致性评分可以通过确认教育材料与学习成果之间的一致性来促进可扩展的个性化教学。这使教师能够专注于根据多样化学生需求调整内容,从而进一步提升教学质量。
https://arxiv.org/abs/2512.13658
Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metrics on subsets dictated by tool support. Single-pass methods demonstrated superior capability preservation on the benchmarked subset (avg GSM8K change across three models: ErisForge -0.28 pp; DECCP -0.13 pp), while Bayesian-optimized abliteration produced variable distribution shift (KL divergence: 0.043-1.646) with model-dependent capability impact. These findings provide researchers with evidence-based selection criteria for abliteration tool deployment across diverse model architectures. The principal finding indicates that mathematical reasoning capabilities exhibit the highest sensitivity to abliteration interventions, with GSM8K change ranging from +1.51 pp to -18.81 pp (-26.5% relative) depending on tool selection and model architecture.
大型语言模型中的安全对齐机制通过学习拒绝行为来防止对有害查询的响应,然而这些相同的机制也阻碍了包括认知建模、对抗性测试和安全性分析在内的合法研究应用。虽然消融技术能够通过方向正交化实现拒绝始终态的外科手术式移除,但现有实施方法的有效性仍缺乏系统评估。本研究评估了四种消融工具(Heretic、DECCC、ErisForge、FailSpy)在十六个指令微调模型(70亿至140亿参数规模)上的表现,并报告所有16种模型的工具兼容性和子集限定下的定量指标。单步方法在基准测试子集中展示了更强的能力保持效果(三个模型平均GSM8K变化:ErisForge -0.28个百分点;DECCP -0.13个百分点),而贝叶斯优化消融则产生了不同的分布转移(KL散度范围为0.043至1.646),对不同模型的能力影响各异。这些发现为研究人员在多样化模型架构上部署消融工具提供了基于证据的选择标准。主要发现指出,数学推理能力对消融干预最敏感,GSM8K变化范围从+1.51个百分点到-18.81个百分点(相对下降26.5%),这取决于所选工具和模型架构的不同。
https://arxiv.org/abs/2512.13655
Large-language models (LLMs) have been shown to respond in a variety of ways for classification tasks outside of question-answering. LLM responses are sometimes called "hallucinations" since the output is not what is ex pected. Memorization strategies in LLMs are being studied in detail, with the goal of understanding how LLMs respond. We perform a deep dive into a classification task based on United States Supreme Court (SCOTUS) decisions. The SCOTUS corpus is an ideal classification task to study for LLM memory accuracy because it presents significant challenges due to extensive sentence length, complex legal terminology, non-standard structure, and domain-specific vocabulary. Experimentation is performed with the latest LLM fine tuning and retrieval-based approaches, such as parameter-efficient fine-tuning, auto-modeling, and others, on two traditional category-based SCOTUS classification tasks: one with 15 labeled topics and another with 279. We show that prompt-based models with memories, such as DeepSeek, can be more robust than previous BERT-based models on both tasks scoring about 2 points better than previous models not based on prompting.
大型语言模型(LLMs)已被证明在问答之外的分类任务中会以多种方式响应,有时这些响应被称为“幻觉”,因为输出结果不符合预期。对于LLM中的记忆策略研究正在深入进行,目的是理解这些模型如何做出反应。我们针对美国最高法院(SCOTUS)裁决开展了一项基于分类任务的深度分析。由于句子长度长、法律术语复杂、结构非标准化以及专业词汇的存在,SCOTUS语料库成为了研究LLM记忆准确性的一个理想分类任务。 实验采用了最新的参数高效微调和检索式方法,例如参数高效的微调、自动建模等技术,在两个基于传统类别划分的SCOTUS分类任务上进行测试:一个是包含15个主题标签的任务,另一个则包含了279个。我们展示了带有记忆功能的提示驱动模型(如DeepSeek)在两项任务中的表现均优于先前的BERT基线模型,得分大约高出2分。 通过这种方法的研究表明,具有记忆机制和基于提示调整能力的大型语言模型能够更好地完成复杂法律文本分类任务,并且相较于传统的仅依赖微调的方法,这些新方法可以有效提升模型的表现。
https://arxiv.org/abs/2512.13654
Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. To overcome this limitation, we propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions, instead of operating directly in a continuous action space. This approach effectively balances optimal decision-making in complex scenarios, human-like driving behavior, and efficient exploration in online reinforcement learning. MindDrive achieves strong closed-loop performance on the challenging Bench2Drive benchmark, with a Driving Score (DS) of 78.04 and a Success Rate (SR) of 55.09%. To the best of our knowledge, this is the first work to demonstrate the effectiveness of online reinforcement learning for the VLA model in autonomous driving.
当前的视觉-语言-行动(VLA)范式在自动驾驶领域主要依赖于模仿学习(IL),这种方法会带来固有的挑战,例如分布偏移和因果混淆。在线强化学习通过试错学习提供了一种有前景的方法来解决这些问题。然而,在线强化学习应用于自动驾驶中的VLA模型时,由于连续动作空间中探索效率低下而受到限制。 为了解决这一局限性,我们提出了MindDrive框架,该框架包括一个大型语言模型(LLM),配备有两个不同的LoRA参数集。其中一个LLM作为决策专家,负责场景推理和驾驶决策;另一个则充当行动专家,能够动态地将语言决策映射到可行的轨迹中。通过向推理空间反馈轨迹级奖励,MindDrive使基于有限集合内的离散语言驾驶决策进行试错学习成为可能,而不是直接在连续动作空间内操作。这种方法有效地平衡了复杂场景中的最优决策、类似人类的驾驶行为以及在线强化学习中的高效探索。 在具有挑战性的Bench2Drive基准测试中,MindDrive表现出强大的闭环性能,获得了78.04的驾驶评分(DS)和55.09%的成功率(SR)。据我们所知,这是首次展示在线强化学习在自动驾驶领域VLA模型中有效性的研究。
https://arxiv.org/abs/2512.13636
Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents the first empirical study of temporal tokenization for event sequences, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data's statistical properties, with log-based strategies excelling on skewed distributions and human-centric formats proving robust for mixed modalities.
在使用大型语言模型(LLM)建模时间事件序列时,表示连续时间是一个关键且研究不足的挑战。已经提出了各种策略,如字节级表示或日历令牌等方法。然而,在考虑到实际世界事件数据的各种统计分布(从平滑对数正态到离散、尖峰模式)的情况下,最优方法仍不清楚。本文首次针对事件序列的时间标记化进行了实证研究,比较了不同的编码策略:简单的数字字符串、高精度字节级表示、人类语义日历令牌、经典的均匀分箱以及自适应残差标量量化。我们通过在代表这些多样化分布的真实世界数据集上微调LLM来评估这些策略。我们的分析表明,没有任何单一策略能够普遍优于其他方法;相反,预测性能高度依赖于标记器与数据统计特性的匹配程度,基于对数的方法在偏斜分布中表现出色,而以人为中心的格式则证明了其在混合模式下的稳健性。
https://arxiv.org/abs/2512.13618
The ability to perform multi-modal multi-hop reasoning by iteratively integrating information across various modalities and external knowledge is critical for addressing complex real-world challenges. However, existing Multi-modal Large Language Models (MLLMs) are predominantly limited to single-step reasoning, as existing benchmarks lack the complexity needed to evaluate and drive multi-hop abilities. To bridge this gap, we introduce MMhops, a novel, large-scale benchmark designed to systematically evaluate and foster multi-modal multi-hop reasoning. MMhops dataset comprises two challenging task formats, Bridging and Comparison, which necessitate that models dynamically construct complex reasoning chains by integrating external knowledge. To tackle the challenges posed by MMhops, we propose MMhops-R1, a novel multi-modal Retrieval-Augmented Generation (mRAG) framework for dynamic reasoning. Our framework utilizes reinforcement learning to optimize the model for autonomously planning reasoning paths, formulating targeted queries, and synthesizing multi-level information. Comprehensive experiments demonstrate that MMhops-R1 significantly outperforms strong baselines on MMhops, highlighting that dynamic planning and multi-modal knowledge integration are crucial for complex reasoning. Moreover, MMhops-R1 demonstrates strong generalization to tasks requiring fixed-hop reasoning, underscoring the robustness of our dynamic planning approach. In conclusion, our work contributes a challenging new benchmark and a powerful baseline model, and we will release the associated code, data, and weights to catalyze future research in this critical area.
执行跨模态多跳推理的能力,通过迭代地整合来自各种模式和外部知识的信息来解决复杂的现实世界挑战是至关重要的。然而,现有的多模态大型语言模型(MLLMs)主要局限于单步推理,因为现有基准的复杂性不足以评估和推动多跳能力的发展。为了解决这一差距,我们引入了MMhops,这是一个全新的大规模基准测试平台,旨在系统地评估并促进多模态多跳推理。MMhops数据集包括两个具有挑战性的任务格式:Bridging(桥接)和Comparison(比较),这些格式要求模型动态构建复杂的推理链,并整合外部知识。 为了应对MMhops带来的挑战,我们提出了MMhops-R1,这是一种新颖的多模态检索增强生成(mRAG)框架,旨在进行动态推理。我们的框架利用强化学习来优化模型,使其能够自主规划推理路径、形成有针对性的问题查询并综合多层次信息。全面的实验表明,在MMhops上,MMhops-R1显著优于强大的基线模型,这强调了动态规划和多模态知识整合对于复杂推理的重要性。此外,MMhops-R1在需要固定跳推理的任务中展示了很强的一般化能力,这突显了我们动态规划方法的稳健性。 总之,我们的工作贡献了一个具有挑战性的新基准测试以及一个强大的基线模型,并且我们将发布相关的代码、数据和权重以促进这一关键领域未来的研究。
https://arxiv.org/abs/2512.13573
Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.
记忆已成为基于基础模型的代理的核心能力,并且将继续保持这一地位。随着关于代理记忆的研究迅速扩展并吸引前所未有的关注,该领域也变得越来越碎片化。现有属于代理记忆范畴的工作在动机、实现和评估协议方面往往存在显著差异,而松散定义的记忆术语进一步模糊了概念的清晰度。传统的分类方法,如长/短期记忆,证明不足以捕捉当代代理记忆系统的多样性。 本文旨在提供当前代理记忆研究的最新全景图。我们首先明确界定代理记忆的范围,并将其与诸如大型语言模型(LLM)记忆、检索增强生成(RAG)、上下文工程等相关概念区分开来。然后,我们通过形式、功能和动态性这三大统一视角审视代理记忆。 从形式的角度来看,我们识别出三种主导型的代理记忆实现方式:令牌级、参数化和潜在记忆。从功能角度来看,我们提出了一种更细粒度的分类法,区分事实记忆、体验记忆和工作记忆。从动态性的角度来看,我们分析了如何随着时间推移形成、演变和检索记忆。 为了支持实际开发,我们编制了一份全面的记忆基准测试和开源框架汇总表。超越整合之外,我们还提出了对未来研究前沿的前瞻性视角,包括记忆自动化、强化学习集成、多模态记忆、多代理记忆以及可信性问题。我们希望此次调查不仅作为现有工作的参考,还可以作为重新思考未来智能设计中记忆这一首要原始概念的概念基础。
https://arxiv.org/abs/2512.13564
The study presents the outcomes of research and experimental validation in the domain of automated codebase migration, with a focus on addressing challenges in transitioning SQL-based systems. The proposed method for migration essentially appears as a framework that leverages the best aspects of traditional software engineering techniques and provides an iterative, scalable, precise and efficient solution for modern database transformations. The central piece of the approach is the integration of a fine-tuned Large Language Model to address critical issues in SQL code conversion, such as syntax mapping, resolving discrepancies between Oracle PL/SQL and PostgreSQL, and optimising database elements such as stored procedures, triggers, views, and overall database logic. Thus, the method involves a trade-off between fine-tuning and prompt engineering. Special attention is given to a fine-tuning approach, which enhances the adaptability and compatibility with migration requirements across the entire database. According to the achieved results, fine-tuning plays a very important role. The study employs targeted evaluation methodologies along with computational metrics to measure the success of iterative conversion cycles. Core innovations include automated SQL feature detection, semi-supervised error analysis and integration of Subject Matter Experts feedback within a systematic migration workflow. The methodology achieves significant reductions in Syntax Error Rates, enhances feature alignment throughout migration iterations, and leverages dataset sampling to ensure continual improvement. By embedding GAI into the migration process, the framework facilitates precise feature mapping, semi-automated error resolution, and data-driven optimisation loops, improving workflow efficiency.
该研究展示了在自动化代码库迁移领域的研究成果和实验验证,重点解决基于SQL系统的过渡挑战。提出的迁移方法本质上是一种框架,它借鉴了传统软件工程技术的最佳方面,并为现代数据库转换提供了一种迭代、可扩展、精确且高效的解决方案。此方法的核心在于整合了一个经过微调的大语言模型(Large Language Model),以解决SQL代码转换中的关键问题,如语法映射、解决Oracle PL/SQL和PostgreSQL之间的差异以及优化数据库元素,包括存储过程、触发器、视图等整体数据库逻辑。因此,该方法在微调与提示工程之间存在权衡。特别关注的是采用了一种精细的微调方法,以增强适应性和兼容性,使其能够满足整个数据库迁移需求的要求。根据所取得的结果,微调起着非常重要的作用。 研究采用了有针对性的评估方法和计算指标来衡量迭代转换周期的成功率。核心创新包括自动化的SQL特性检测、半监督错误分析以及在系统化迁移工作流程中整合主题专家(Subject Matter Experts)的意见反馈。该方法实现了语法错误率显著下降,增强了迁移过程中的特征对齐,并通过数据集抽样确保了持续改进。 通过将生成式人工智能(Generative AI, GAI)嵌入到迁移过程中,框架能够实现精确的特性映射、半自动化的错误解决以及基于数据驱动的优化循环,从而提高工作流程效率。
https://arxiv.org/abs/2512.13515
Our objective is to build a general time-aware video-text embedding model for retrieval. To that end, we propose a simple and efficient recipe, dubbed TARA (Time Aware Retrieval Adaptation), to adapt Multimodal LLMs (MLLMs) to a time-aware video-text embedding model without using any video data at all. For evaluating time-awareness in retrieval, we propose a new benchmark with temporally opposite (chiral) actions as hard negatives and curated splits for chiral and non-chiral actions. We show that TARA outperforms all existing video-text models on this chiral benchmark while also achieving strong results on standard benchmarks. Furthermore, we discover additional benefits of TARA beyond time-awareness: (i) TARA embeddings are negation-aware as shown in NegBench benchmark that evaluates negation in video retrieval, (ii) TARA achieves state of the art performance on verb and adverb understanding in videos. Overall, TARA yields a strong, versatile, time-aware video-text embedding model with state of the art zero-shot performance.
我们的目标是构建一个用于检索的时间感知型视频-文本嵌入模型。为此,我们提出了一种简单而高效的方案,称为TARA(Time Aware Retrieval Adaptation),可以在不使用任何视频数据的情况下将多模态大型语言模型(MLLMs)适应为时间感知型的视频-文本嵌入模型。为了评估检索中的时间感知性,我们提出了一个新的基准测试,该基准使用时间上相反的动作作为难例(chiral actions 的硬负样本),并针对时间相反和非时间相反的动作进行了精心设计的数据集划分。 结果显示,TARA在这一新的时间相反动作基准测试中优于所有现有的视频-文本模型,并且还在标准基准测试中取得了优异的成绩。此外,我们还发现了TARA除了具备时间感知性之外的额外优势:(i) TARA生成的嵌入是具有否定意识的,在NegBench基准测试(该测试评估视频检索中的否定语义)中得到了验证;(ii) TARA在理解视频中的动词和副词方面达到了最佳性能。 总体而言,TARA提供了一个强大的、多功能的时间感知型视频-文本嵌入模型,并且在零样本设置下表现出色。
https://arxiv.org/abs/2512.13511
Large language models (LLM) have achieved remarkable performance across a wide range of tasks. However, their substantial parameter sizes pose significant challenges for deployment on edge devices with limited computational and memory resources. Low-rank compression is a promising approach to address this issue, as it reduces both computational and memory costs, making LLM more suitable for resource-constrained environments. Nonetheless, naïve low-rank compression methods require a significant reduction in the retained rank to achieve meaningful memory and computation savings. For a low-rank model, the ranks need to be reduced by more than half to yield efficiency gains. Such aggressive truncation, however, typically results in substantial performance degradation. To address this trade-off, we propose SkipCat, a novel low-rank compression framework that enables the use of higher ranks while achieving the same compression rates. First, we introduce an intra-layer shared low-rank projection method, where multiple matrices that share the same input use a common projection. This reduces redundancy and improves compression efficiency. Second, we propose a block skipping technique that omits computations and memory transfers for selected sub-blocks within the low-rank decomposition. These two techniques jointly enable our compressed model to retain more effective ranks under the same compression budget. Experimental results show that, without any additional fine-tuning, our method outperforms previous low-rank compression approaches by 7% accuracy improvement on zero-shot tasks under the same compression rate. These results highlight the effectiveness of our rank-maximized compression strategy in preserving model performance under tight resource constraints.
大型语言模型(LLM)在各种任务中表现出卓越的性能,然而,它们庞大的参数规模给计算和内存资源有限的边缘设备部署带来了巨大挑战。低秩压缩作为一种有前景的方法被提出以解决这一问题,通过减少计算和内存成本使大规模语言模型更适合于资源受限的环境。但是,简单的低秩压缩方法需要大幅度降低保留的秩才能实现有意义的记忆和计算节省。对于一个低秩模型来说,为了获得效率提升,其排名通常需要减半以上。然而,这种激进的截断往往会导致性能显著下降。 为了解决这一权衡问题,我们提出了SkipCat,这是一种新的低秩压缩框架,在相同的压缩率下能够使用更高的秩。首先,我们引入了一种层内共享的低秩投影方法,其中多个具有相同输入的矩阵共用一个投影器以减少冗余并提高压缩效率。其次,我们提出了一种块跳过技术,该技术省略了选定子块在低秩分解中的计算和内存传输操作。这两种技术共同使我们的压缩模型能够在相同的压缩预算下保留更多的有效排名。 实验结果显示,在没有额外微调的情况下,与先前的低秩压缩方法相比,我们的方法在零样本任务中实现了7%的准确率提升,并且在相同压缩率下表现更优。这些结果突显了我们提出的最大化的秩压缩策略在资源紧张约束下的模型性能保持方面具有高度有效性。
https://arxiv.org/abs/2512.13494
Envy is a common human behavior that shapes competitiveness and can alter outcomes in team settings. As large language models (LLMs) increasingly act on behalf of humans in collaborative and competitive workflows, there is a pressing need to evaluate whether and under what conditions they exhibit envy-like preferences. In this paper, we test whether LLMs show envy-like behavior toward each other. We considered two scenarios: (1) A point allocation game that tests whether a model tries to win over its peer. (2) A workplace setting observing behaviour when recognition is unfair. Our findings reveal consistent evidence of envy-like patterns in certain LLMs, with large variation across models and contexts. For instance, GPT-5-mini and Claude-3.7-Sonnet show a clear tendency to pull down the peer model to equalize outcomes, whereas Mistral-Small-3.2-24B instead focuses on maximizing its own individual gains. These results highlight the need to consider competitive dispositions as a safety and design factor in LLM-based multi-agent systems.
嫉妒是一种常见的行为,会塑造竞争性,并在团队环境中改变结果。随着大型语言模型(LLM)越来越多地代表人类参与协作和竞争的工作流程中发挥作用,评估它们是否以及在何种条件下表现出类似嫉妒的偏好变得越来越迫切。在这篇论文中,我们测试了LLM之间是否存在类似嫉妒的行为模式。我们考虑了两个场景:(1) 一个点分配游戏,测试模型是否试图超越其同行。(2) 观察工作场所环境中当认可不公平时的行为表现。 我们的研究结果揭示了一些LLM在某些情况下表现出一致的类似嫉妒行为模式,且这种行为在不同模型和情境下差异很大。例如,GPT-5-mini 和 Claude-3.7-Sonnet 显示出明显的倾向,试图拉低同行的表现以实现成果均等化,而 Mistral-Small-3.2-24B 则更专注于最大化自身的个体收益。 这些结果强调了在基于LLM的多代理系统设计中考虑竞争倾向作为安全和设计因素的重要性。
https://arxiv.org/abs/2512.13481
Code large language models (Code LLMs) are powerful but costly to train, with scaling laws predicting performance from model size, data, and compute. However, different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance, leading to inaccurate performance prediction. Besides, existing works focus on language-agnostic settings, neglecting the inherently multilingual nature of modern software development. Therefore, it is first necessary to investigate the scaling laws of different PLs, and then consider their mutual influences to arrive at the final multilingual scaling law. In this paper, we present the first systematic exploration of scaling laws for multilingual code pre-training, conducting over 1000+ experiments (Equivalent to 336,000+ H800 hours) across multiple PLs, model sizes (0.2B to 14B parameters), and dataset sizes (1T tokens). We establish comprehensive scaling laws for code LLMs across multiple PLs, revealing that interpreted languages (e.g., Python) benefit more from increased model size and data than compiled languages (e.g., Rust). The study demonstrates that multilingual pre-training provides synergistic benefits, particularly between syntactically similar PLs. Further, the pre-training strategy of the parallel pairing (concatenating code snippets with their translations) significantly enhances cross-lingual abilities with favorable scaling properties. Finally, a proportion-dependent multilingual scaling law is proposed to optimally allocate training tokens by prioritizing high-utility PLs (e.g., Python), balancing high-synergy pairs (e.g., JavaScript-TypeScript), and reducing allocation to fast-saturating languages (Rust), achieving superior average performance across all PLs compared to uniform distribution under the same compute budget.
代码大型语言模型(Code LLMs)虽然功能强大,但训练成本高昂。缩放法则预测了从模型规模、数据和计算资源等方面的影响性能。然而,不同的编程语言(PLs)在预训练阶段对基础模型的性能影响各异,导致了不准确的性能预测。此外,现有的研究大多关注无语言特定性的设置,忽略了现代软件开发中固有的多语言特性。因此,首先有必要探讨不同编程语言的缩放法则,并考虑它们之间的相互影响以达到最终的多语言缩放法则。 在本文中,我们首次系统地探索了跨多种编程语言的代码预训练的缩放法则,进行了超过1000次以上的实验(相当于336,000+小时H800),涵盖多个PLs、模型大小(从2亿到140亿参数)以及数据集规模(从1T标记)。我们建立了针对多种编程语言的代码LLMs的全面缩放法则,揭示了解释型语言(如Python)比编译型语言(如Rust)更能从更大的模型尺寸和更多的训练数据中获益。这项研究表明,多语言预训练提供了协同效应,特别是在语法相似的语言之间尤其明显。 此外,平行配对的预训练策略(将代码片段与其翻译版本连接起来)显著提升了跨语言能力,并具有良好的缩放特性。最后,我们提出了一种依赖比例的多语言缩放法则,通过优先考虑高实用性的编程语言(如Python)、平衡高协同效应的语言对(如JavaScript和TypeScript),并减少在快速饱和的语言(如Rust)上的分配来优化训练令牌的分配,在相同的计算预算下比均匀分布取得了所有PLs中的平均性能优势。
https://arxiv.org/abs/2512.13472
While Large Language Model (LLM) agents show great potential for automated UI navigation such as automated UI testing and AI assistants, their efficiency has been largely overlooked. Our motivating study reveals that inefficient UI representation creates a critical performance bottleneck. However, UI representation optimization, formulated as the task of automatically generating programs that transform UI representations, faces two unique challenges. First, the lack of Boolean oracles, which traditional program synthesis uses to decisively validate semantic correctness, poses a fundamental challenge to co-optimization of token efficiency and completeness. Second, the need to process large, complex UI trees as input while generating long, compositional transformation programs, making the search space vast and error-prone. Toward addressing the preceding limitations, we present UIFormer, the first automated optimization framework that synthesizes UI transformation programs by conducting constraint-based optimization with structured decomposition of the complex synthesis task. First, UIFormer restricts the program space using a domain-specific language (DSL) that captures UI-specific operations. Second, UIFormer conducts LLM-based iterative refinement with correctness and efficiency rewards, providing guidance for achieving the efficiency-completeness co-optimization. UIFormer operates as a lightweight plugin that applies transformation programs for seamless integration with existing LLM agents, requiring minimal modifications to their core logic. Evaluations across three UI navigation benchmarks spanning Android and Web platforms with five LLMs demonstrate that UIFormer achieves 48.7% to 55.8% token reduction with minimal runtime overhead while maintaining or improving agent performance. Real-world industry deployment at WeChat further validates the practical impact of UIFormer.
尽管大型语言模型(LLM)代理在自动UI导航方面,如自动化UI测试和AI助手,展现出巨大的潜力,但它们的效率问题却鲜少受到关注。我们的研究发现,低效的UI表示是导致性能瓶颈的关键因素。然而,UI表示优化,作为一种自动生成程序以转换UI表示的任务,面临着两个独特的挑战:首先,缺乏布尔预言(传统程序合成中用于决定性验证语义正确性的方法),这对令牌效率和完整性的共优构成了根本性挑战;其次,需要处理庞大的、复杂的UI树作为输入,并在此过程中生成长且组合的转换程序,这使得搜索空间庞大而容易出错。为了解决上述限制,我们提出了UIFormer,这是第一个自动优化框架,通过基于约束的优化和复杂合成任务的结构化分解来综合UI转换程序。首先,UIFormer使用一种特定于领域的语言(DSL)捕获UI特有操作以限制程序空间。其次,UIFormer执行基于LLM的迭代细化,并给予正确性和效率奖励,为实现效率-完整性共优提供指导。UIFormer作为一个轻量级插件运行,应用转换程序以便与现有的LLM代理无缝集成,同时对其核心逻辑进行最少修改。通过在涵盖Android和Web平台的三个UI导航基准测试中使用五种不同的大型语言模型进行评估,结果表明UIFormer能够实现48.7%至55.8%的令牌减少,且运行时开销极小,并保持或提升代理性能。此外,在微信中的实际工业部署进一步验证了UIFormer的实际影响。
https://arxiv.org/abs/2512.13438
Text-to-Image (TTI) models generate images based on text prompts, which often leave certain aspects of the desired image ambiguous. When faced with these ambiguities, TTI models have been shown to exhibit biases in their interpretations. These biases can have societal impacts, e.g., when showing only a certain race for a stated occupation. They can also affect user experience when creating redundancy within a set of generated images instead of spanning diverse possibilities. Here, we introduce MineTheGap - a method for automatically mining prompts that cause a TTI model to generate biased outputs. Our method goes beyond merely detecting bias for a given prompt. Rather, it leverages a genetic algorithm to iteratively refine a pool of prompts, seeking for those that expose biases. This optimization process is driven by a novel bias score, which ranks biases according to their severity, as we validate on a dataset with known biases. For a given prompt, this score is obtained by comparing the distribution of generated images to the distribution of LLM-generated texts that constitute variations on the prompt. Code and examples are available on the project's webpage.
文本到图像(TTI)模型根据文字提示生成图像,这些提示常常会使期望的图像在某些方面产生模糊不清的地方。面对这种模糊性时,TTI模型已被证明会表现出解释偏差。这些偏差可能会对社会造成影响,例如,在显示某种职业时仅展示特定种族的人。它们还可能通过在一个生成的图像集合中制造冗余而不是扩展多样化的可能性来影响用户体验。在这里,我们介绍了MineTheGap——一种自动挖掘会导致TTI模型产生偏向性输出的文字提示的方法。我们的方法不仅限于检测给定提示中的偏见,而是利用遗传算法迭代地优化一个文字提示池,寻找那些暴露偏差的提示。这一优化过程由一个新颖的偏见分数驱动,该分数根据其严重程度对偏见进行排名,我们在具有已知偏见的数据集上验证了这一点。对于每个给定的文字提示,这个分数是通过将生成图像的分布与构成文字提示变体的大型语言模型(LLM)生成文本的分布相比较来获得的。该项目的代码和示例可以在项目网页上找到。
https://arxiv.org/abs/2512.13427
Recent advances in Large Language Models (LLMs) have opened new perspectives for automation in optimization. While several studies have explored how LLMs can generate or solve optimization models, far less is understood about what these models actually learn regarding problem structure or algorithmic behavior. This study investigates how LLMs internally represent combinatorial optimization problems and whether such representations can support downstream decision tasks. We adopt a twofold methodology combining direct querying, which assesses LLM capacity to explicitly extract instance features, with probing analyses that examine whether such information is implicitly encoded within their hidden layers. The probing framework is further extended to a per-instance algorithm selection task, evaluating whether LLM-derived representations can predict the best-performing solver. Experiments span four benchmark problems and three instance representations. Results show that LLMs exhibit moderate ability to recover feature information from problem instances, either through direct querying or probing. Notably, the predictive power of LLM hidden-layer representations proves comparable to that achieved through traditional feature extraction, suggesting that LLMs capture meaningful structural information relevant to optimization performance.
最近在大型语言模型(LLM)方面取得的进展为优化领域的自动化开辟了新的视角。尽管已有若干研究探讨了LLM生成或解决优化模型的能力,但对于这些模型实际学习到的问题结构或算法行为知之甚少。本研究旨在探究LLM如何内部表示组合优化问题,并评估这种表示是否能够支持下游决策任务。我们采用了一种双重方法论,结合直接查询和探测分析:前者用于评估LLM从实例中显式提取特征的能力;后者则考察这些信息是否隐含编码在其隐藏层中。进一步地,我们将探测框架扩展到了逐例算法选择任务上,以评估由LLM生成的表示能否预测最佳性能求解器。实验涵盖了四个基准问题和三种实例表示形式。结果表明,LLM表现出通过直接查询或探测分析从问题实例恢复特征信息的适度能力。值得注意的是,LLM隐藏层表示的预测力与传统特征提取方法所获得的效果相当,这表明LLM确实捕捉到了对优化性能具有重要意义的结构化信息。
https://arxiv.org/abs/2512.13374
Large Language Models (LLMs) are prone to mem- orizing training data, which poses serious privacy risks. Two of the most prominent concerns are training data extraction and Membership Inference Attacks (MIAs). Prior research has shown that these threats are interconnected: adversaries can extract training data from an LLM by querying the model to generate a large volume of text and subsequently applying MIAs to verify whether a particular data point was included in the training set. In this study, we integrate multiple MIA techniques into the data extraction pipeline to systematically benchmark their effectiveness. We then compare their performance in this integrated setting against results from conventional MIA bench- marks, allowing us to evaluate their practical utility in real-world extraction scenarios.
大型语言模型(LLMs)容易记忆训练数据,这带来了严重的隐私风险。其中两个最突出的担忧是训练数据提取和成员推断攻击(MIAs)。先前的研究表明,这些威胁相互关联:对手可以通过向模型查询以生成大量文本,并随后应用MIAs来验证特定数据点是否包含在训练集中,从而从LLM中提取训练数据。在这项研究中,我们将多种MIA技术整合到数据提取管道中,系统地评估它们的有效性。然后,在这种集成设置下,我们将其性能与传统MIA基准测试结果进行比较,以评估其在实际提取场景中的实用价值。
https://arxiv.org/abs/2512.13352
Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.
近年来,虚拟化身视频生成模型取得了显著进展。然而,现有工作在生成长时间高分辨率视频方面效率较低,随着视频长度的增加,会出现时间漂移、质量下降以及弱指令跟随等问题。为了解决这些挑战,我们提出了KlingAvatar 2.0,这是一个时空级联框架,能够在空间分辨率和时间维度上进行放大处理。该框架首先生成低分辨率的关键帧蓝图,捕捉全局语义和运动信息,然后通过首尾帧策略将其细化为高分辨率、具有时间连贯性的子片段,并确保长视频中平滑的时间过渡。 为了增强跨模态指令融合与对齐,在扩展视频中的表现,我们引入了一个协同推理导演(Co-Reasoning Director),由三个特定于模式的大语言模型(LLM)专家组成。这些专家可以评估不同模态的优先级并推断用户意图,通过多轮对话将输入转化为详细的故事线。此外,一个负向导演进一步优化负面提示,以提高指令对齐度。 基于这些组件,我们将框架扩展为支持ID特定的多角色控制。广泛的实验表明,我们的模型在高效、跨模态对齐的长时间高分辨率视频生成方面效果显著,提供了增强的视觉清晰度、逼真的唇齿渲染及准确的唇部同步,强大的身份保持以及连贯的跨模态指令跟随能力。
https://arxiv.org/abs/2512.13313
While existing generation and unified models excel at general image generation, they struggle with tasks requiring deep reasoning, planning, and precise data-to-visual mapping abilities beyond general scenarios. To push beyond the existing limitations, we introduce a new and challenging task: creative table visualization, requiring the model to generate an infographic that faithfully and aesthetically visualizes the data from a given table. To address this challenge, we propose ShowTable, a pipeline that synergizes MLLMs with diffusion models via a progressive self-correcting process. The MLLM acts as the central orchestrator for reasoning the visual plan and judging visual errors to provide refined instructions, the diffusion execute the commands from MLLM, achieving high-fidelity results. To support this task and our pipeline, we introduce three automated data construction pipelines for training different modules. Furthermore, we introduce TableVisBench, a new benchmark with 800 challenging instances across 5 evaluation dimensions, to assess performance on this task. Experiments demonstrate that our pipeline, instantiated with different models, significantly outperforms baselines, highlighting its effective multi-modal reasoning, generation, and error correction capabilities.
虽然现有的生成模型和统一模型在通用图像生成方面表现出色,但在需要深度推理、规划以及超出一般场景的精确数据到视觉映射能力的任务上却显得力不从心。为了突破现有局限,我们提出了一项新的具有挑战性的任务:创意表格可视化,要求模型能够根据给定的表格数据生成既准确又美观的信息图表。 为了解决这一挑战,我们提出了ShowTable管道,它通过逐步自我修正过程将多语言大模型(MLLMs)与扩散模型协同工作。在这个过程中,MLLM充当中央调度器进行视觉规划和判断视觉错误以提供精炼的指令,而扩散模型则执行MLLM发出的命令,从而实现高保真度结果。 为了支持该任务以及我们的管道,我们引入了三个自动化的数据构建流程用于训练不同的模块。此外,我们还推出了TableVisBench,一个新的包含800个具有挑战性的实例、横跨五个评估维度的新基准,用以评估在这一任务上的性能表现。 实验表明,使用不同模型实现的我们的管道,在多个基线方法上取得了显著优势,这突显了其有效多模态推理、生成和错误校正的能力。
https://arxiv.org/abs/2512.13303