Robot foundation models are beginning to deliver on the promise of generalist robotic agents, yet progress remains constrained by the scarcity of large-scale real-world manipulation datasets. Simulation and synthetic data generation offer a scalable alternative, but their usefulness is limited by the visual domain gap between simulation and reality. In this work, we present Point Bridge, a framework that leverages unified, domain-agnostic point-based representations to unlock synthetic datasets for zero-shot sim-to-real policy transfer, without explicit visual or object-level alignment. Point Bridge combines automated point-based representation extraction via Vision-Language Models (VLMs), transformer-based policy learning, and efficient inference-time pipelines to train capable real-world manipulation agents using only synthetic data. With additional co-training on small sets of real demonstrations, Point Bridge further improves performance, substantially outperforming prior vision-based sim-and-real co-training methods. It achieves up to 44% gains in zero-shot sim-to-real transfer and up to 66% with limited real data across both single-task and multitask settings. Videos of the robot are best viewed at: this https URL
机器人基础模型已经开始实现通用型机器人代理的承诺,但其进展仍受限于大规模现实世界操作数据集的稀缺。仿真和合成数据生成提供了一种可扩展的替代方案,但由于模拟与现实之间的视觉领域差距,它们的有效性受到了限制。在本文中,我们介绍了Point Bridge框架,该框架利用统一、无领域的点基表示法来解锁合成数据集以实现零样本仿真实现策略迁移,而无需显式的视觉或对象级别的对齐。通过结合基于视觉-语言模型(VLMs)的自动点基表示提取、基于变压器的学习策略以及高效的推理时间管道,Point Bridge能够仅使用合成数据训练具备能力的真实世界操作代理。在额外与少量实际演示进行共训的情况下,Point Bridge进一步提高了性能,并显著超越了先前基于视觉的仿真实现共训方法的表现。在零样本仿真到现实迁移中,它最多可实现44%的增长,在具有有限真实数据的情境下跨单一任务和多任务设置则可达66%。 请注意查看机器人视频的最佳方式是访问此链接:[请在此插入实际URL](原文中的“this https URL”应为具体的网址链接)。
https://arxiv.org/abs/2601.16212
Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications, yet remain vulnerable to adversarial perturbations that distort their feature representations and induce erroneous predictions. To address this vulnerability, we propose the Feature-space Smoothing (FS) and theoretically prove that FS offers certified robustness on the feature representations of MLLMs. Specifically, FS transforms any feature encoder into a smoothed variant that is guaranteed to maintain a certified lower bound on the feature cosine similarity between clean and adversarial representations under $\ell_2$-bounded attacks. Moreover, we indicate that the value of this Feature Cosine Similarity Bound (FCSB) derived from FS can be improved by enlarging the defined Gaussian robustness score on the vanilla encoder. Building upon this, we introduce the Purifier and Smoothness Mapper (PSM), a plug-and-play module that improves the Gaussian robustness score of MLLMs and thus enhances their certified robustness under FS, without requiring any retraining on MLLMs. We demonstrate that the FS with PSM not only provides a strong theoretical robustness guarantee but also exhibits superior empirical performance compared to adversarial training. Extensive experiments across diverse MLLMs and downstream tasks indicate the effectiveness of the FS-PSM, reducing the Attack Success Rate (ASR) of various white-box attacks from nearly 90\% to about 1\%.
多模态大型语言模型(MLLMs)在各种应用场景中表现出强大的能力,但它们仍然容易受到通过扭曲特征表示并引发错误预测的对抗性干扰的影响。为了解决这一脆弱性问题,我们提出了特征空间平滑(FS),并通过理论证明了FS能够提供关于MLLMs特征表示的认证鲁棒性保障。具体而言,FS将任何特征编码器转换为其平滑版本,并保证在$\ell_2$界限内的攻击下,干净和对抗性表示之间的特征余弦相似度可以维持一个经过验证的最低边界。此外,我们指出通过增加原始编码器上的高斯鲁棒评分,可以从FS中得出的特征余弦相似度边界(FCSB)值得到提高。基于此,我们引入了纯化和平滑映射器(PSM),这是一种即插即用模块,它可以提升MLLMs的高斯鲁棒评分并因此增强其在FS下的认证鲁棒性,而无需对MLLMs进行重新训练。我们展示了带有PSM的FS不仅提供了强大的理论稳健保证,而且在对抗性训练方面也表现出更优越的实际性能。跨多种MLLM和下游任务的广泛实验表明,FS-PSM的有效性,将各种白盒攻击的成功率从接近90%降低到大约1%。
https://arxiv.org/abs/2601.16200
State-of-the-art neural theorem provers like DeepSeek-Prover-V1.5 combine large language models with reinforcement learning, achieving impressive results through sophisticated training. We ask: do these highly-trained models still benefit from simple structural guidance at inference time? We evaluate a lightweight intervention -- a fixed prompt schedule over 15 common tactic skeletons -- on the miniF2F benchmark. This simple approach yields 21.7% pass@16 compared to 15.2% for standard sampling from the same model, a 43% relative improvement using the same number of samples (k=16) and same maximum generation length (1024 tokens). Our results suggest that even capable RL-trained provers underutilize structural priors available in the tactic language, and that simple inference-time guidance remains a cheap, complementary boost.
最先进的神经定理证明器,如DeepSeek-Prover-V1.5,结合了大型语言模型和强化学习,在经过复杂的训练后取得了令人印象深刻的结果。我们的问题是:这些高度训练的模型在推理时是否仍然能从简单的结构引导中获益?我们在miniF2F基准上评估了一种轻量级干预方法——一个固定的提示时间表,涵盖15个常见的策略框架。这种简单的方法相比于使用相同模型的标准采样(pass@16为15.2%)在相同的样本数量(k=16)和最大生成长度(1024令牌)下实现了21.7%的通过率,相对改进了43%。我们的结果表明,即使是能力较强的强化学习训练证明器也未能充分利用定理语言中的结构先验知识,在推理时简单的引导仍是一种低成本且互补的方法,可以进一步提升性能。
https://arxiv.org/abs/2601.16172
As large language models (LLMs) become increasingly common in educational applications, there is a growing need for evidence-based methods to design and evaluate LLM prompts that produce personalized and pedagogically aligned out-puts. This study presents a generalizable, systematic approach for evaluating prompts, demonstrated through an analysis of LLM-generated follow-up questions in a structured dialogue activity. Six prompt templates were designed and tested. The templates incorporated established prompt engineering patterns, with each prompt emphasizing distinct pedagogical strategies. The prompt templates were compared through a tournament-style evaluation framework that can be adapted for other educational applications. The tournament employed the Glicko2 rating system with eight judges evaluating question pairs across three dimensions: format, dialogue support, and appropriateness for learners. Data was sourced from 120 authentic user interactions across three distinct educational deployments. Results showed that a single prompt related to strategic reading out-performed other templates with win probabilities ranging from 81% to 100% in pairwise comparisons. This prompt combined persona and context manager pat-terns and was designed to support metacognitive learning strategies such as self-directed learning. The methodology showcases how educational technology re- searchers can systematically evaluate and improve prompt designs, moving beyond ad-hoc prompt engineering toward evidence-based prompt development for educational applications.
随着大型语言模型(LLM)在教育应用中的日益普及,设计和评估能够产生个性化且符合教学目标输出的LLM提示词的需求也越来越大。本研究提出了一种可通用、系统的评估提示词的方法,并通过分析结构化对话活动中生成的后续问题来展示这种方法的有效性。该方法涉及六种不同的提示模板的设计与测试,这些模板采用了现有的提示工程模式,每种提示强调了不同的教学策略。通过一种适应其他教育应用的锦标赛式评估框架对这六个模板进行了比较。竞赛使用Glicko2等级分系统,并由八位评判员从格式、对话支持和适合学习者三个维度上评估问题配对。数据来自120名真实用户在三种不同教育场景下的交互。 研究结果显示,与其它模板相比,在一对一的比较中,一个专门针对策略性阅读设计的提示词获得了81%到100%的不同胜率。该提示结合了人物角色和上下文管理模式,并旨在支持如自我导向学习等元认知学习策略的设计理念。这种方法展示了教育技术研究人员如何能够系统地评估并改进提示设计,从非系统的提示工程转向基于证据的、为教育应用优化的提示开发方法。
https://arxiv.org/abs/2601.16134
Fine-tuning a task-specific multilingual large language model (LLM) involves training the model on a multilingual dataset with examples in all the required languages. Updating one or more supported languages with additional data or adding support for a new language involves retraining the model, which can be computationally inefficient and creates a severe maintenance bottleneck. Recent research on merging multilingual multitask models has shown promise in terms of improved quality, but its computational and maintenance efficiency remains unstudied. In this work, we provide the first focused analysis of this merging strategy from an efficiency perspective, evaluating it across three independent tasks. We demonstrate significant efficiency gains while maintaining parity in terms of quality: this merging approach reduces the initial training time by up to 50\%. We also demonstrate that updating an individual language and re-merging as part of model maintenance reduces training costs by more than 60\%, compared to re-training the full multilingual model. We show this on both public and proprietary industry datasets confirming that the approach works well for industrial use cases in addition to academic settings already studied in previous work.
微调特定任务的多语言大型语言模型(LLM)涉及在包含所有所需语言样本的多语言数据集上训练该模型。用额外的数据更新一个或多个支持的语言,或者添加对新语言的支持,则需要重新训练整个模型,这在计算效率方面是低效的,并且会形成严重的维护瓶颈。最近关于合并多任务多语言模型的研究显示出提高质量的潜力,但其计算和维护效率尚未被研究。 在这项工作中,我们首次从效率角度提供了这种合并策略的第一个集中分析,在三个独立的任务上进行了评估。我们证明了在保持质量一致的情况下实现了显著的效率提升:该合并方法将初始训练时间减少了高达50%。我们也展示了更新单个语言并重新合并作为模型维护的一部分可以比重新训练整个多语言模型节省超过60%的训练成本。我们在公开和专有的行业数据集上证明了这一点,确认这种方法不仅适用于学术研究已经探讨过的设置,也适合工业使用案例。 简而言之,本文通过评估一个特定的合并策略展示了提高效率的同时保持质量不变的优点,并且该方法在实际应用中展现出显著的成本节约效果。
https://arxiv.org/abs/2601.16127
Climate disinformation has become a major challenge in today digital world, especially with the rise of misleading images and videos shared widely on social media. These false claims are often convincing and difficult to detect, which can delay actions on climate change. While vision-language models (VLMs) have been used to identify visual disinformation, they rely only on the knowledge available at the time of training. This limits their ability to reason about recent events or updates. The main goal of this paper is to overcome that limitation by combining VLMs with external knowledge. By retrieving up-to-date information such as reverse image results, online fact-checks, and trusted expert content, the system can better assess whether an image and its claim are accurate, misleading, false, or unverifiable. This approach improves the model ability to handle real-world climate disinformation and supports efforts to protect public understanding of science in a rapidly changing information landscape.
在当今的数字世界中,气候不实信息已成为一个主要挑战,尤其是随着误导性的图片和视频在社交媒体上广泛传播。这些虚假声明往往极具说服力且难以察觉,这可能会延迟应对气候变化的行动。尽管视觉-语言模型(VLM)已被用于识别视觉上的不实信息,但它们仅依赖于训练时已有的知识。这种限制使得它们无法有效推理最近发生的事件或更新的情况。本文的主要目标是通过将VLM与外部知识相结合来克服这一局限性。通过检索最新的信息,如反向图像搜索结果、在线事实核查和可信的专家内容,系统能够更好地评估图片及其声明是否准确、具有误导性、虚假或无法验证。这种方法提高了模型处理现实世界中气候不实信息的能力,并支持在快速变化的信息环境中保护公众对科学的理解的努力。
https://arxiv.org/abs/2601.16108
Large Language Models enable users to access database using natural language interfaces using tools like Text2SQL, Text2SPARQL, and Text2Cypher, which translate user questions into structured database queries. While these systems improve database accessibility, most research focuses on English with limited multilingual support. This work investigates a scalable multilingual Text2Cypher, aiming to support new languages without re-running full fine-tuning, avoiding manual hyper-parameter tuning, and maintaining performance close to joint multilingual fine-tuning. We train language-specific LoRA adapters for English, Spanish, and Turkish and combined them via uniform linear merging or learned fusion MLP with dynamic gating. Experimental results show that the fusion MLP recovers around 75\% of the accuracy gains from joint multilingual fine-tuning while requiring only a smaller subset of the data, outperforming linear merging across all three languages. This approach enables incremental language expansion to new languages by requiring only one LoRA adapter and a lightweight MLP retraining. Learned adapter fusion offers a practical alternative to expensive joint fine-tuning, balancing performance, data efficiency, and scalability for multilingual Text2Cypher task.
大型语言模型通过诸如Text2SQL、Text2SPARQL和Text2Cypher之类的工具,使用户能够使用自然语言接口访问数据库。这些工具可以将用户的查询问题转化为结构化的数据库查询语句。尽管此类系统提高了数据库的可访问性,但大多数相关研究主要集中在英语上,并且对多语言支持有限。这项工作旨在开发一种可扩展的多语言Text2Cypher方法,该方法在增加新语言时无需重新进行全面微调、避免手动超参数调整的同时,保持接近联合多语言微调后的性能水平。 我们为英语、西班牙语和土耳其语训练了特定的语言LoRA适配器,并通过统一线性合并或具有动态门控的学得融合MLP将它们结合起来。实验结果显示,使用融合MLP可以恢复大约75%由联合多语言微调带来的准确度提升效果,同时只需要较小的数据子集,在所有三种语言中均优于线性合并方法。 这种策略通过仅需一个LoRA适配器和轻量级的MLP重新训练来实现对新语言的增量扩展。学得适配器融合为昂贵的联合微调提供了一种实用替代方案,它在多语言Text2Cypher任务中实现了性能、数据效率与可伸缩性之间的平衡。
https://arxiv.org/abs/2601.16097
Large language model (LLM) agents often exhibit abrupt shifts in tone and persona during extended interaction, reflecting the absence of explicit temporal structure governing agent-level state. While prior work emphasizes turn-local sentiment or static emotion classification, the role of explicit affective dynamics in shaping long-horizon agent behavior remains underexplored. This work investigates whether imposing dynamical structure on an external affective state can induce temporal coherence and controlled recovery in multi-turn dialogue. We introduce an agent-level affective subsystem that maintains a continuous Valence-Arousal-Dominance (VAD) state external to the language model and governed by first- and second-order update rules. Instantaneous affective signals are extracted using a fixed, memoryless estimator and integrated over time via exponential smoothing or momentum-based dynamics. The resulting affective state is injected back into generation without modifying model parameters. Using a fixed 25-turn dialogue protocol, we compare stateless, first-order, and second-order affective dynamics. Stateless agents fail to exhibit coherent trajectories or recovery, while state persistence enables delayed responses and reliable recovery. Second-order dynamics introduce affective inertia and hysteresis that increase with momentum, revealing a trade-off between stability and responsiveness.
大型语言模型(LLM)代理在长时间互动中常常表现出语气和人格的突然变化,这反映了没有明确时间结构来管理代理级别的状态。尽管之前的工作强调了转轮本地的情感或静态情感分类,但显性情感动态对长时段内代理行为塑造的作用仍被忽视。这项工作研究了通过强加外在情感状态的动力学结构是否可以诱导多回合对话中的时间连贯性和可控恢复。 我们引入了一个保持连续“效价-唤醒度-支配度”(VAD)状态的代理级别情感子系统,该状态独立于语言模型并由一阶和二阶更新规则控制。即时的情感信号通过固定的无记忆估算器提取,并通过指数平滑或动量动力学进行时间集成。由此产生的感情状态被注入生成过程而不改变模型参数。 使用固定25轮对话协议,我们比较了无状态、一阶和二阶情感动态的效果。无状态的代理无法展示连贯的行为轨迹或恢复能力,而状态持久性则允许延迟响应并确保可靠的恢复。二阶动力学引入了随着动量增加的情感惯性和滞后效应,揭示了稳定性和反应性之间的权衡关系。
https://arxiv.org/abs/2601.16087
Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by default, VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'. This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks. In this paper, we introduce a simple yet effective plug-and-play Distracting Token Pruning (DTP) framework, which dynamically detects and prunes these distracting image tokens. By correcting the model's visual attention patterns, we aim to improve the task success rate, as well as exploring the performance upper boundaries of the model without altering its original architecture or adding additional inputs. Experiments on the SIMPLER Benchmark (Li et al., 2024) show that our method consistently achieving relative improvements in task success rates across different types of novel VLA models, demonstrating generalizability to transformer-based VLAs. Further analysis reveals a negative correlation between the task success rate and the amount of attentions in the task-irrelevant region for all models tested, highlighting a common phenomenon of VLA models that could guide future research. We also publish our code at: this https URL.
翻译如下: Vision-Language Action (VLA) 模型在利用 Vision-Language Models (VLMs) 强大的感知能力来理解环境并直接输出动作方面,已经在机器人操作中取得了显著进展。然而,默认情况下,VLA 模型可能会过分关注任务无关区域中的图像标记(我们称之为“分散注意力的标记”),这种行为可能干扰模型在每一步生成所需的行动标记,从而影响任务的成功率。在这篇论文中,我们介绍了一种简单而有效的即插即用的 Distracting Token Pruning (DTP) 框架,该框架能够动态检测和修剪这些分散注意力的图像标记。通过纠正模型的视觉注意模式,我们的目标是提高任务成功率,并探索模型在不改变其原始架构或添加额外输入的情况下所能达到的最佳性能边界。SIMPLER Benchmark(Li 等人,2024)上的实验表明,我们的方法能够持续实现不同类型新型 VLA 模型的成功率相对提升,在各种类型的模型上展现出泛化能力。进一步分析显示,所有测试模型的任务成功率与任务无关区域中的注意力量之间存在负相关关系,这凸显了 VLA 模型的一个共同现象,可能为未来的研究提供指导方向。 我们已在以下网址发布我们的代码:this https URL。
https://arxiv.org/abs/2601.16065
Language-driven dexterous grasp generation requires the models to understand task semantics, 3D geometry, and complex hand-object interactions. While vision-language models have been applied to this problem, existing approaches directly map observations to grasp parameters without intermediate reasoning about physical interactions. We present DextER, Dexterous Grasp Generation with Embodied Reasoning, which introduces contact-based embodied reasoning for multi-finger manipulation. Our key insight is that predicting which hand links contact where on the object surface provides an embodiment-aware intermediate representation bridging task semantics with physical constraints. DextER autoregressively generates embodied contact tokens specifying which finger links contact where on the object surface, followed by grasp tokens encoding the hand configuration. On DexGYS, DextER achieves 67.14% success rate, outperforming state-of-the-art by 3.83%p with 96.4% improvement in intention alignment. We also demonstrate steerable generation through partial contact specification, providing fine-grained control over grasp synthesis.
基于语言的灵巧抓取生成需要模型理解任务语义、三维几何结构和复杂的手-物体交互。尽管视觉-语言模型已被应用于此类问题,但现有方法直接将观察结果映射到抓取参数上,并未经过关于物理交互的中间推理过程。我们提出了DextER(具有具身推理的灵巧抓取生成),这是一种基于接触点进行多指操作具身推理的方法。我们的关键洞察是,预测哪些手部链接在物体表面上何处接触可以提供一种感知身体特性的中间表示形式,将任务语义与物理约束连接起来。 DextER通过自回归方式生成具身接触令牌,这些令牌指定手指链在哪部分物体表面接触,随后生成抓取令牌来编码手部配置。在DexGYS数据集上,DextER实现了67.14%的成功率,比现有最佳方法高出3.83个百分点,并且意图对齐提高了96.4%。此外,我们还展示了通过部分接触规范实现可控制的生成过程,提供了对手部抓取合成进行精细调控的能力。 该研究强调了在灵巧抓取任务中引入物理交互理解的重要性,展示了一种将视觉-语言模型与复杂手-物体互动结合的有效途径,并为机器人操作和自动化系统中的精细化具身推理设定了新标准。
https://arxiv.org/abs/2601.16046
Large Language Models (LLMs) can aid synthesis planning in chemistry, but standard prompting methods often yield hallucinated or outdated suggestions. We study LLM interactions with a reaction knowledge graph by casting reaction path retrieval as a Text2Cypher (natural language to graph query) generation problem, and define single- and multi-step retrieval tasks. We compare zero-shot prompting to one-shot variants using static, random, and embedding-based exemplar selection, and assess a checklist-driven validator/corrector loop. To evaluate our framework, we consider query validity and retrieval accuracy. We find that one-shot prompting with aligned exemplars consistently performs best. Our checklist-style self-correction loop mainly improves executability in zero-shot settings and offers limited additional retrieval gains once a good exemplar is present. We provide a reproducible Text2Cypher evaluation setup to facilitate further work on KG-grounded LLMs for synthesis planning. Code is available at this https URL.
大型语言模型(LLM)可以在化学合成规划中发挥作用,但标准的提示方法往往会产生幻觉或过时的建议。我们通过将反应路径检索视为一个自然语言到图查询(Text2Cypher)生成问题来研究LLM与反应知识图之间的交互,并定义了一步和多步检索任务。我们将零样本提示与静态、随机以及基于嵌入的示例选择的一次性变体进行比较,同时评估了以检查表为驱动的验证/校正循环。为了评估我们的框架,我们考虑查询的有效性和检索准确性。我们发现使用对齐示例的一次性提示始终表现最佳。在零样本设置中,检查表式的自我修正循环主要提高了可执行性,并且一旦有良好的示例如何存在时,额外的检索增益就非常有限。 为了促进基于知识图的LLM在合成规划方面的进一步研究工作,我们提供了一个可重复的Text2Cypher评估环境。代码可在以下链接获取:[此URL](请将此占位符替换为实际提供的URL)。
https://arxiv.org/abs/2601.16038
High-performance attention kernels are essential for Large Language Models. This paper presents analysis of CuTile-based Flash Attention memory behavior and a technique to improve its cache performance. In particular, our analysis on the NVIDIA GB10 (Grace Blackwell) identifies the main cause of L2 cache miss. Leveraging this insight, we introduce a new programming technique called Sawtooth Wavefront Reordering that reduces L2 misses. We validate it in both CUDA and CuTile, observing 50\% or greater reduction in L2 misses and up to 60\% increase in throughput on GB10.
高性能的注意力核函数对于大型语言模型至关重要。本文分析了基于CuTile的Flash Attention内存行为,并提出了一种改进其缓存性能的技术。特别是,我们在NVIDIA GB10(Grace Blackwell)芯片上的分析确定了导致L2缓存未命中的主要原因。利用这一洞察,我们引入了一种新的编程技术——锯齿波前线重排序(Sawtooth Wavefront Reordering),该技术能够减少L2缓存未命中率。我们在CUDA和CuTile平台上验证了这一技术的有效性,观察到在GB10芯片上L2缓存未命中的减少了50%或更多,并且吞吐量提高了最多60%。
https://arxiv.org/abs/2601.16032
The rise of live streaming has transformed online interaction, enabling massive real-time engagement but also exposing platforms to complex risks such as scams and coordinated malicious behaviors. Detecting these risks is challenging because harmful actions often accumulate gradually and recur across seemingly unrelated streams. To address this, we propose CS-VAR (Cross-Session Evidence-Aware Retrieval-Augmented Detector) for live streaming risk assessment. In CS-VAR, a lightweight, domain-specific model performs fast session-level risk inference, guided during training by a Large Language Model (LLM) that reasons over retrieved cross-session behavioral evidence and transfers its local-to-global insights to the small model. This design enables the small model to recognize recurring patterns across streams, perform structured risk assessment, and maintain efficiency for real-time deployment. Extensive offline experiments on large-scale industrial datasets, combined with online validation, demonstrate the state-of-the-art performance of CS-VAR. Furthermore, CS-VAR provides interpretable, localized signals that effectively empower real-world moderation for live streaming.
直播流媒体的兴起已经改变了在线互动方式,它不仅促进了大规模的实时参与,也使平台面临诸如诈骗和协同恶意行为等复杂风险。由于有害行为常常逐渐积累并在看似无关的直播间中反复出现,因此检测这些风险颇具挑战性。为解决这一问题,我们提出了CS-VAR(跨会话证据感知检索增强检测器)用于直播流媒体的风险评估。 在CS-VAR架构中,一个轻量级、特定领域的模型执行快速的会话级别风险推断,并且在训练过程中由大型语言模型(LLM)指导。该大型语言模型通过检索跨会话的行为证据进行推理,并将其从局部到全局的理解传递给小型模型。这种设计使小型模型能够识别不同直播间中的重复模式,执行结构化的风险评估,并保持实时部署的效率。 我们通过对大规模工业数据集进行了广泛的离线实验,并结合在线验证,展示了CS-VAR在性能上的领先水平。此外,CS-VAR还提供了可解释且本地化的信号,有效地支持了实际直播流媒体内容管理中的监管工作。
https://arxiv.org/abs/2601.16027
This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain through domain adaptation strategies. We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens. We implement a checkpoint selection strategy that evaluates downstream retrieval performance throughout training, revealing that optimal checkpoints achieve best retrieval scores before pre-training loss reaches its minimum. Our encoder models achieve top-3 rankings on the Turkish retrieval leaderboard, with smaller models (155M parameters) achieving comparable performance to larger reference models (307M-567M parameters). Our approach achieves 92.36% production efficiency compared to state-of-the-art models (embeddinggemma-300m: 100.00%, BAAI/bge-m3: 99.54%, newmindai/bge-m3-stsb: 94.38%), ranking fourth overall despite requiring less computational resources. SOTA models rely on multi-stage, computationally intensive training pipelines, making our single-stage pre-training followed by efficient post-training approach a cost-effective alternative; (2)Decoder Model with Continual Pre-training (CPT): Qwen3-1.7B and Qwen3-4B models adapted to Turkish legal domain through controlled curriculum learning. Four-phase CPT with optimal sample ratios enables gradual transition from general language knowledge to specialized legal terminology and long-context reasoning. This approach achieves 36.2% perplexity reduction on Turkish legal text, demonstrating domain adaptation gains.
这篇论文介绍了Mecellem模型,这是一个通过领域适应策略开发土耳其法律领域的专业语言模型的框架。我们做了两项贡献: 1. **从头开始预训练的编码器模型**:基于现代BERT双向编码器,在以土耳其语为主的、包含1127亿个token的数据集上进行预训练。我们实现了一种检查点选择策略,该策略在整个训练过程中评估下游检索性能,结果显示最优的检查点在预训练损失达到最小值之前就能取得最佳检索得分。我们的编码器模型在土耳其检索排行榜上取得了第三名的成绩,小规模模型(155M参数)的表现与大规模参考模型(307M-567M参数)相当。我们的方法比最先进的模型具有更高的生产效率(embeddinggemma-300m: 100.00%,BAAI/bge-m3: 99.54%,newmindai/bge-m3-stsb: 94.38%),尽管需要较少的计算资源,整体排名第四。最先进的模型依赖于多阶段、计算密集型训练流水线,而我们的单阶段预训练后接高效微调的方法则是一个成本效益更高的替代方案。 2. **持续预训练(CPT)解码器模型**:Qwen3-1.7B和Qwen3-4B模型通过控制性课程学习方法适应土耳其法律领域。四阶段的CPT使用最优样本比例,使模型逐步从通用语言知识过渡到专门化的法律术语及长上下文推理能力。这种方法在土耳其法律文本上将困惑度降低了36.2%,展示了领域的适应性收益。 这两种贡献表明了Mecellem框架的有效性和灵活性,在资源优化的前提下,能够为特定领域(如土耳其法律)开发出高性能的语言模型。
https://arxiv.org/abs/2601.16018
Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying physics remains underexplored. Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws. To address this fragmentation, we introduce PhysicsMind, a unified benchmark with both real and simulation environments that evaluates law-consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton's First Law. PhysicsMind comprises two main tasks: i) VQA tasks, testing whether models can reason and determine physical quantities and values from images or short videos, and ii) Video Generation(VG) tasks, evaluating if predicted motion trajectories obey the same center-of-mass, torque, and inertial constraints as the ground truth. A broad range of recent models and video generation models is evaluated on PhysicsMind and found to rely on appearance heuristics while often violating basic mechanics. These gaps indicate that current scaling and training are still insufficient for robust physical understanding, underscoring PhysicsMind as a focused testbed for physics-aware multimodal models. Our data will be released upon acceptance.
现代的多模态大型语言模型(MLLMs)和视频世界模型在数学、常识以及视觉推理方面取得了显著进展,但它们对物理现象的理解仍然未得到充分探索。现有的评估这些能力的基准测试通常依赖于合成的视觉问答模板,或者关注感知上的视频质量,而这与衡量视频是否遵循物理定律关系不大。为了解决这种碎片化问题,我们引入了PhysicsMind,这是一个结合了真实和模拟环境的统一基准,用于评估在三个经典原则(质心、杠杆平衡以及牛顿第一定律)下的守恒推理和生成能力。 PhysicsMind包含两个主要任务: 1. 视觉问答(VQA)任务:测试模型是否能够从图像或短视频中推断并确定物理量和值。 2. 视频生成(VG)任务:评估预测的运动轨迹是否遵守与地面实况相同的质心、力矩以及惯性约束。 我们对一系列最近的多模态模型和视频生成模型进行了PhysicsMind基准测试,发现这些模型通常依赖于外观启发式方法,并且经常违反基本力学原理。这些差距表明当前的规模扩展和训练对于建立稳健的物理理解仍然不足,从而凸显了PhysicsMind作为物理感知型多模态模型集中测试平台的重要性。 我们的数据将在获得接受后公开发布。
https://arxiv.org/abs/2601.16007
Robots that follow natural-language instructions often either plan at a high level using hand-designed interfaces or rely on large end-to-end models that are difficult to deploy for real-time control. We propose TeNet (Text-to-Network), a framework for instantiating compact, task-specific robot policies directly from natural language descriptions. TeNet conditions a hypernetwork on text embeddings produced by a pretrained large language model (LLM) to generate a fully executable policy, which then operates solely on low-dimensional state inputs at high control frequencies. By using the language only once at the policy instantiation time, TeNet inherits the general knowledge and paraphrasing robustness of pretrained LLMs while remaining lightweight and efficient at execution time. To improve generalization, we optionally ground language in behavior during training by aligning text embeddings with demonstrated actions, while requiring no demonstrations at inference time. Experiments on MuJoCo and Meta-World benchmarks show that TeNet produces policies that are orders of magnitude smaller than sequence-based baselines, while achieving strong performance in both multi-task and meta-learning settings and supporting high-frequency control. These results show that text-conditioned hypernetworks offer a practical way to build compact, language-driven controllers for ressource-constrained robot control tasks with real-time requirements.
遵循自然语言指令的机器人通常要么使用手工设计的界面进行高层次规划,要么依赖于难以用于实时控制的大规模端到端模型。我们提出了一种名为TeNet(Text-to-Network)的框架,该框架可以从自然语言描述中直接生成紧凑且任务特定的机器人策略。在这一框架下,一个超网络根据预训练大型语言模型(LLM)产生的文本嵌入来生成完全可执行的策略,并随后仅基于低维状态输入以高频度进行控制操作。通过仅在策略实例化时使用一次自然语言,TeNet继承了预训练LLM的一般知识和同义句稳健性,同时保持了执行时的轻量级与高效性。 为了提高泛化能力,在训练过程中我们可选地通过将文本嵌入与演示动作对齐来使语言具体化于行为中,而在推理阶段则无需展示示例。在MuJoCo和Meta-World基准上的实验表明,TeNet产生的策略比基于序列的基线小几个数量级,同时在多任务设置和元学习设置中均表现出色,并支持高频控制。这些结果表明,以文本条件化的超网络提供了一种实用的方法来构建紧凑且语言驱动的控制器,适用于资源受限但需要实时响应的机器人控制任务。
https://arxiv.org/abs/2601.15912
Diffusion-based language models (DLLMs) offer non-sequential, block-wise generation and richer data reuse compared to autoregressive (AR) models, but existing code DLLMs still lag behind strong AR baselines under comparable budgets. We revisit this setting in a controlled study and introduce Stable-DiffCoder, a block diffusion code model that reuses the Seed-Coder architecture, data, and training pipeline. To enable efficient knowledge learning and stable training, we incorporate a block diffusion continual pretraining (CPT) stage enhanced by a tailored warmup and block-wise clipped noise schedule. Under the same data and architecture, Stable-DiffCoder overall outperforms its AR counterpart on a broad suite of code benchmarks. Moreover, relying only on the CPT and supervised fine-tuning stages, Stable-DiffCoder achieves stronger performance than a wide range of \~8B ARs and DLLMs, demonstrating that diffusion-based training can improve code modeling quality beyond AR training alone. Moreover, diffusion-based any-order modeling improves structured code modeling for editing and reasoning, and through data augmentation, benefits low-resource coding languages.
基于扩散的语言模型(DLLM)相较于自回归(AR)模型提供了非顺序的、块状生成和更丰富的数据重用,但现有的代码DLLM在类似预算下仍不及强大的AR基准。我们在一个受控研究中重新审视了这一情况,并引入了Stable-DiffCoder,这是一种采用Seed-Coder架构、数据及训练管道的块扩散代码模型。为了实现高效的知识学习和稳定的训练,我们结合了一个经过定制预热和分块裁剪噪声时间表增强的块扩散持续预训练(CPT)阶段。 在相同的数据集和架构下,Stable-DiffCoder整体上在一个广泛的代码基准测试中超越了它的AR对应模型。此外,仅依靠CPT和监督微调阶段,Stable-DiffCoder就能实现比一系列大约80亿参数的AR和DLLM更强的表现力,证明了基于扩散的训练能够提升代码建模质量超出单独使用AR训练的效果。 更进一步地,基于扩散的任何阶模型改进了结构化代码建模以用于编辑与推理,并且通过数据增强,有利于资源贫乏的编程语言。
https://arxiv.org/abs/2601.15892
Recent advances in medical vision language models guide the learning of visual representations; however, this form of supervision is constrained by the availability of paired image text data, raising the question of whether robust radiology encoders can be learned without relying on language supervision. In this work, we introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture that learns without language supervision. Pre-trained solely on unlabeled chest X-ray images, the model learns to predict latent representations of masked image regions. This predictive objective differs fundamentally from both image text pre-training and DINO-style self-distillation: rather than aligning global representations across views or modalities, RadJEPA explicitly models latent-space prediction. We evaluate the learned encoder on disease classification, semantic segmentation, and report generation tasks. Across benchmarks, RadJEPA achieves performance exceeding state-of-the-art approaches, including Rad-DINO.
近期的医学视觉语言模型的发展指导了视觉表示的学习;然而,这种形式的监督受限于配对图像文本数据的可用性,引发了是否可以不依赖语言监督来学习稳健的放射学编码器的问题。在本工作中,我们介绍了RadJEPA,这是一种基于联合嵌入预测架构构建的自监督框架,它可以在没有语言监督的情况下进行学习。该模型仅通过未标记的胸部X光图像预训练,学习对遮蔽图像区域的潜在表示进行预测。这种预测目标与图像文本预训练和DINO风格的自我蒸馏方法根本不同:RadJEPA不是跨视图或模态对齐全局表示,而是明确地建模潜在空间中的预测。 我们在疾病分类、语义分割和报告生成任务上评估了所学习到的编码器。在各个基准测试中,RadJEPA的表现超过了包括Rad-DINO在内的最先进方法。
https://arxiv.org/abs/2601.15891
Analog mixed-signal circuit sizing involves complex trade-offs within high-dimensional design spaces. Existing automatic analog circuit sizing approaches rely solely on netlists, ignoring the circuit schematic, which hinders the cognitive link between the schematic and its performance. Furthermore, the black-box nature of machine learning methods and hallucination risks in large language models fail to provide the necessary ground-truth explainability required for industrial sign-off. To address these challenges, we propose a Vision Language Model-optimized collaborative agent design workflow (VLM-CAD), which analyzes circuits, optimizes DC operating points, performs inference-based sizing, and executes external sizing optimization. We integrate Image2Net to annotate circuit schematics and generate a structured JSON description for precise interpretation by Vision Language Models. Furthermore, we propose an Explainable Trust Region Bayesian Optimization method (ExTuRBO) that employs collaborative warm-start from agent-generated seeds and offers dual-granularity sensitivity analysis for external sizing optimization, supporting a comprehensive final design report. Experiment results on amplifier sizing tasks using 180nm, 90nm, and 45nm Predictive Technology Models demonstrate that VLM-CAD effectively balances power and performance while maintaining physics-based explainability. VLM-CAD meets all specification requirements while maintaining low power consumption in optimizing an amplifier with a complementary input and a class-AB output stage, with a total runtime under 66 minutes across all experiments on two amplifiers.
模拟混合信号电路设计中的尺寸优化涉及在高维设计空间内的复杂权衡。现有的自动模拟电路尺寸调整方法仅依赖于网络列表,忽略了电路图本身,这阻碍了从电路图到其性能的认知联系建立。此外,机器学习方法的黑盒性质和大型语言模型中出现幻觉的风险无法提供工业验证所需的必要真实解释性。 为解决这些挑战,我们提出了一种优化协作代理设计工作流程(VLM-CAD)。该系统能够分析电路、优化直流操作点、进行基于推理的尺寸调整,并执行外部尺寸优化。通过集成Image2Net工具来标注电路图并生成结构化的JSON描述,使得视觉语言模型可以精确解读这些电路图。 我们还提出了一种可解释的信任区域贝叶斯优化方法(ExTuRBO),该方法利用代理产生的种子进行协作式暖启动,并提供了针对外部尺寸优化的双级粒度敏感性分析,支持全面的设计报告生成。 在使用180nm、90nm和45nm预测技术模型对放大器尺寸调整任务进行实验后,结果表明VLM-CAD能够在保持基于物理的真实解释性的前提下有效平衡功耗与性能。无论是在优化具有互补输入和AB类输出阶段的放大器时,还是在满足所有规范要求的同时保持低功耗方面,VLM-CAD都能达到目标,并且在整个实验过程中,在两个放大器上的总运行时间均低于66分钟。
https://arxiv.org/abs/2601.07315
Code completion has become a central task, gaining significant attention with the rise of large language model (LLM)-based tools in software engineering. Although recent advances have greatly improved LLMs' code completion abilities, evaluation methods have not advanced equally. Most current benchmarks focus solely on functional correctness of code completions based on given context, overlooking models' ability to follow user instructions during completion-a common scenario in LLM-assisted programming. To address this limitation, we present the first instruction-guided code completion benchmark, Controllable Code Completion Benchmark (C3-Bench), comprising 2,195 carefully designed completion tasks. Through comprehensive evaluation of over 40 mainstream LLMs across C3-Bench and conventional benchmarks, we reveal substantial gaps in instruction-following capabilities between open-source and advanced proprietary models during code completion tasks. Moreover, we develop a straightforward data synthesis pipeline that leverages Qwen2.5-Coder to generate high-quality instruction-completion pairs for supervised fine-tuning (SFT). The resulting model, Qwen2.5-Coder-C3, achieves state-of-the-art performance on C3-Bench. Our findings provide valuable insights for enhancing LLMs' code completion and instruction-following capabilities, establishing new directions for future research in code LLMs. To facilitate reproducibility and foster further research in code LLMs, we open-source all code, datasets, and models.
代码补全已成为软件工程中的核心任务,随着基于大型语言模型(LLM)的工具的兴起,这一领域的关注度显著提升。尽管近期的进步极大地提高了LLM的代码补全能力,但评估方法却没有同样地发展起来。目前大多数基准测试主要关注给定上下文下生成代码片段的功能正确性,而忽视了模型在代码补全过程中遵循用户指令的能力——这在LLM辅助编程中是一个常见的场景。为解决这一局限性,我们提出了首个基于指令的代码补全基准测试Controllable Code Completion Benchmark(C3-Bench),包含2,195个精心设计的任务。 通过跨C3-Bench和传统基准测试对40多个主流LLM进行全面评估,我们揭示了开源模型与高级专有模型在遵循用户指令进行代码补全任务时能力上的显著差距。此外,我们开发了一个简单有效的数据合成流水线,利用Qwen2.5-Coder生成高质量的指令-补全配对用于监督微调(SFT)。由此产生的模型Qwen2.5-Coder-C3,在C3-Bench上达到了最先进的性能水平。 我们的研究结果为提高LLM在代码补全和遵循用户指令能力方面提供了宝贵的见解,并为未来在代码LLM领域的研究开辟了新的方向。为了促进可重复性及推动代码LLM的研究,我们开源所有代码、数据集和模型。
https://arxiv.org/abs/2601.15879