How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) ErdÅs' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.
如何利用人工智能来发现某一科学问题的新前沿状态?先前的工作,如测试时间缩放中的AlphaEvolve,通过提示一个冻结的大型语言模型(LLM)来进行搜索。而我们则在测试期间执行强化学习,使得LLM能够继续训练,并且现在可以使用与特定测试问题相关的经验进行训练。这种持续学习方式非常特别,因为它旨在生成一个优秀的解决方案而非众多较好的平均方案,并且目标是解决这个问题而不是泛化到其他问题上。因此,我们的学习目标和搜索子程序被设计为优先考虑最有前途的解决方案。我们称这种方法为“测试时间训练以发现”(TTT-Discover)。 借鉴先前的研究成果,我们将重点放在具有连续奖励的问题上。我们在数学、GPU内核工程、算法设计及生物学等领域的所有尝试问题中报告了结果。在几乎所有的领域,TTT-Discover都设定了新的前沿状态: (i) ErdÅ¡os的最小重叠问题和一个自相关不等式; (ii) GPU模式内核竞赛(速度比之前的最佳实践快最多2倍); (iii) 过去的AtCoder算法比赛;以及 (iv) 单细胞分析中的去噪问题。 我们的解决方案由专家或组织者评审。我们所有的结果都是通过使用开放模型OpenAI gpt-oss-120b实现的,并可以通过公开提供的代码重现,而不同于以前的最佳成果需要封闭式前沿模型来完成。我们的测试时间训练运行使用了Thinking Machines的一个API——Tinker,每个问题的成本仅为几百美元。
https://arxiv.org/abs/2601.16175
As large language models (LLMs) become increasingly common in educational applications, there is a growing need for evidence-based methods to design and evaluate LLM prompts that produce personalized and pedagogically aligned out-puts. This study presents a generalizable, systematic approach for evaluating prompts, demonstrated through an analysis of LLM-generated follow-up questions in a structured dialogue activity. Six prompt templates were designed and tested. The templates incorporated established prompt engineering patterns, with each prompt emphasizing distinct pedagogical strategies. The prompt templates were compared through a tournament-style evaluation framework that can be adapted for other educational applications. The tournament employed the Glicko2 rating system with eight judges evaluating question pairs across three dimensions: format, dialogue support, and appropriateness for learners. Data was sourced from 120 authentic user interactions across three distinct educational deployments. Results showed that a single prompt related to strategic reading out-performed other templates with win probabilities ranging from 81% to 100% in pairwise comparisons. This prompt combined persona and context manager pat-terns and was designed to support metacognitive learning strategies such as self-directed learning. The methodology showcases how educational technology re- searchers can systematically evaluate and improve prompt designs, moving beyond ad-hoc prompt engineering toward evidence-based prompt development for educational applications.
随着大型语言模型(LLM)在教育应用中的日益普及,设计和评估能够产生个性化且符合教学目标输出的LLM提示词的需求也越来越大。本研究提出了一种可通用、系统的评估提示词的方法,并通过分析结构化对话活动中生成的后续问题来展示这种方法的有效性。该方法涉及六种不同的提示模板的设计与测试,这些模板采用了现有的提示工程模式,每种提示强调了不同的教学策略。通过一种适应其他教育应用的锦标赛式评估框架对这六个模板进行了比较。竞赛使用Glicko2等级分系统,并由八位评判员从格式、对话支持和适合学习者三个维度上评估问题配对。数据来自120名真实用户在三种不同教育场景下的交互。 研究结果显示,与其它模板相比,在一对一的比较中,一个专门针对策略性阅读设计的提示词获得了81%到100%的不同胜率。该提示结合了人物角色和上下文管理模式,并旨在支持如自我导向学习等元认知学习策略的设计理念。这种方法展示了教育技术研究人员如何能够系统地评估并改进提示设计,从非系统的提示工程转向基于证据的、为教育应用优化的提示开发方法。
https://arxiv.org/abs/2601.16134
Large language model (LLM) agents often exhibit abrupt shifts in tone and persona during extended interaction, reflecting the absence of explicit temporal structure governing agent-level state. While prior work emphasizes turn-local sentiment or static emotion classification, the role of explicit affective dynamics in shaping long-horizon agent behavior remains underexplored. This work investigates whether imposing dynamical structure on an external affective state can induce temporal coherence and controlled recovery in multi-turn dialogue. We introduce an agent-level affective subsystem that maintains a continuous Valence-Arousal-Dominance (VAD) state external to the language model and governed by first- and second-order update rules. Instantaneous affective signals are extracted using a fixed, memoryless estimator and integrated over time via exponential smoothing or momentum-based dynamics. The resulting affective state is injected back into generation without modifying model parameters. Using a fixed 25-turn dialogue protocol, we compare stateless, first-order, and second-order affective dynamics. Stateless agents fail to exhibit coherent trajectories or recovery, while state persistence enables delayed responses and reliable recovery. Second-order dynamics introduce affective inertia and hysteresis that increase with momentum, revealing a trade-off between stability and responsiveness.
大型语言模型(LLM)代理在长时间互动中常常表现出语气和人格的突然变化,这反映了没有明确时间结构来管理代理级别的状态。尽管之前的工作强调了转轮本地的情感或静态情感分类,但显性情感动态对长时段内代理行为塑造的作用仍被忽视。这项工作研究了通过强加外在情感状态的动力学结构是否可以诱导多回合对话中的时间连贯性和可控恢复。 我们引入了一个保持连续“效价-唤醒度-支配度”(VAD)状态的代理级别情感子系统,该状态独立于语言模型并由一阶和二阶更新规则控制。即时的情感信号通过固定的无记忆估算器提取,并通过指数平滑或动量动力学进行时间集成。由此产生的感情状态被注入生成过程而不改变模型参数。 使用固定25轮对话协议,我们比较了无状态、一阶和二阶情感动态的效果。无状态的代理无法展示连贯的行为轨迹或恢复能力,而状态持久性则允许延迟响应并确保可靠的恢复。二阶动力学引入了随着动量增加的情感惯性和滞后效应,揭示了稳定性和反应性之间的权衡关系。
https://arxiv.org/abs/2601.16087
Refusal behavior in aligned LLMs is often viewed as model-specific, yet we hypothesize it stems from a universal, low-dimensional semantic circuit shared across models. To test this, we introduce Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions from donor to target models, spanning diverse architectures (e.g., Dense to MoE) and training regimes, without using target-side refusal supervision. By aligning layers via concept fingerprints and reconstructing refusal directions using a shared ``recipe'' of concept atoms, we map the donor's ablation trajectory into the target's semantic space. To preserve capabilities, we introduce a weight-SVD stability guard that projects interventions away from high-variance weight subspaces to prevent collateral damage. Our evaluation across 8 model pairs (including GPT-OSS-20B and GLM-4) confirms that these transferred recipes consistently attenuate refusal while maintaining performance, providing strong evidence for the semantic universality of safety alignment.
拒绝行为在对齐的大型语言模型(LLMs)中通常被视为特定于每个模型的现象,但我们假设这些行为源自一个跨越不同模型的通用、低维语义电路。为了验证这一假说,我们引入了通过概念基础重构进行轨迹回放的方法框架。该框架能够在捐赠者和目标模型之间转移拒绝干预措施,涵盖多样化的架构(如密集型到混合专家型)及不同的训练机制,并且无需在目标侧使用拒绝监督数据。 通过利用层次对齐的概念指纹以及基于共享“配方”的概念原子重构拒绝方向,我们能够将捐赠者的消融轨迹映射到目标模型的语义空间中。为了保持能力不受影响,我们引入了一种权重SVD稳定性保护机制,以防止干预措施进入高方差权重子空间,从而避免对其他性能造成不必要的损害。 我们的评估涵盖了8组模型配对(包括GPT-OSS-20B和GLM-4),结果表明这些转移的配方能够一致地减弱拒绝行为同时保持模型性能。这为安全对齐的语义普遍性提供了强有力的支持证据。
https://arxiv.org/abs/2601.16034
Understanding where and how emotions are represented in large-scale foundation models remains an open problem, particularly in multimodal affective settings. Despite the strong empirical performance of recent affective models, the internal architectural mechanisms that support affective understanding and generation are still poorly understood. In this work, we present a systematic mechanistic study of affective modeling in multimodal foundation models. Across multiple architectures, training strategies, and affective tasks, we analyze how emotion-oriented supervision reshapes internal model parameters. Our results consistently reveal a clear and robust pattern: affective adaptation does not primarily focus on the attention module, but instead localizes to the feed-forward gating projection (\texttt{gate\_proj}). Through controlled module transfer, targeted single-module adaptation, and destructive ablation, we further demonstrate that \texttt{gate\_proj} is sufficient, efficient, and necessary for affective understanding and generation. Notably, by tuning only approximately 24.5\% of the parameters tuned by AffectGPT, our approach achieves 96.6\% of its average performance across eight affective tasks, highlighting substantial parameter efficiency. Together, these findings provide empirical evidence that affective capabilities in foundation models are structurally mediated by feed-forward gating mechanisms and identify \texttt{gate\_proj} as a central architectural locus of affective modeling.
理解大规模基础模型中情感的表示位置和方式仍然是一个开放性问题,尤其是在多模态情绪设置下。尽管最近的情感模型在实证性能上表现出色,但支持情感理解和生成的内部架构机制仍然知之甚少。在这项工作中,我们系统地研究了多模态基础模型中的情感建模机制。通过分析多种架构、训练策略和情感任务中以情感为导向的监督如何重塑内部模型参数,我们的结果一致揭示了一个清晰而稳健的模式:情感适应主要不集中在注意力模块上,而是集中于前馈门控投影(\texttt{gate\_proj})。通过受控模块转移、针对性单模块适应及破坏性消融实验,我们进一步证明了 \texttt{gate\_proj} 是情感理解和生成中既足够又高效且必不可少的机制。值得注意的是,仅调整大约 24.5% 的 AffectGPT 调整参数的比例,我们的方法在八个情感任务上即可达到其平均性能的 96.6%,显示出显著的参数效率。这些发现共同提供了实证证据,证明基础模型中的情感能力是由前馈门控机制结构中介的,并将 \texttt{gate\_proj} 确立为情感建模的关键架构位置。
https://arxiv.org/abs/2601.15906
This study investigates whether professional translators can reliably identify short stories generated in Italian by artificial intelligence (AI) without prior specialized training. Sixty-nine translators took part in an in-person experiment, where they assessed three anonymized short stories - two written by ChatGPT-4o and one by a human author. For each story, participants rated the likelihood of AI authorship and provided justifications for their choices. While average results were inconclusive, a statistically significant subset (16.2%) successfully distinguished the synthetic texts from the human text, suggesting that their judgements were informed by analytical skill rather than chance. However, a nearly equal number misclassified the texts in the opposite direction, often relying on subjective impressions rather than objective markers, possibly reflecting a reader preference for AI-generated texts. Low burstiness and narrative contradiction emerged as the most reliable indicators of synthetic authorship, with unexpected calques, semantic loans and syntactic transfer from English also reported. In contrast, features such as grammatical accuracy and emotional tone frequently led to misclassification. These findings raise questions about the role and scope of synthetic-text editing in professional contexts.
这项研究调查了专业翻译人员是否能够在没有专门培训的情况下可靠地识别由人工智能(AI)生成的意大利语短故事。共有69名译者参与了一项现场实验,他们评估了三篇匿名的短故事——其中两篇是由ChatGPT-4o撰写,一篇由人类作者创作。对于每篇文章,参与者都对其可能是AI撰写的概率进行了评分,并提供了他们的选择理由。 虽然平均结果无法得出明确结论,但统计上显著的一组(16.2%)成功地区分了合成文本与人类文本,这表明这些判断是基于分析技巧而非随机猜测的。然而,几乎相等数量的人将文本误分类为相反的方向,通常依赖于主观印象而非客观标志,可能反映了读者对AI生成文本的偏好。 低突现性和叙事矛盾被发现是最可靠的合成作者身份的指标之一,而意外的语言固定搭配、语义借用和从英语到意大利语的句法转移也被报告为显著特征。相比之下,诸如语法准确性与情感基调等特性经常导致误分类。 这些研究结果引发了关于专业环境中合成文本编辑的作用和范围的问题。
https://arxiv.org/abs/2601.15828
Although artificial intelligence (AI) has become deeply integrated into various stages of the research workflow and achieved remarkable advancements, academic rebuttal remains a significant and underexplored challenge. This is because rebuttal is a complex process of strategic communication under severe information asymmetry rather than a simple technical debate. Consequently, current approaches struggle as they largely imitate surface-level linguistics, missing the essential element of perspective-taking required for effective persuasion. In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) pipeline that models reviewer mental state, formulates persuasion strategy, and generates strategy-grounded response. To train our agent, we construct RebuttalBench, a large-scale dataset synthesized via a novel critique-and-refine approach. Our training process consists of two stages, beginning with a supervised fine-tuning phase to equip the agent with ToM-based analysis and strategic planning capabilities, followed by a reinforcement learning phase leveraging the self-reward mechanism for scalable self-improvement. For reliable and efficient automated evaluation, we further develop Rebuttal-RM, a specialized evaluator trained on over 100K samples of multi-source rebuttal data, which achieves scoring consistency with human preferences surpassing powerful judge GPT-4.1. Extensive experiments show RebuttalAgent significantly outperforms the base model by an average of 18.3% on automated metrics, while also outperforming advanced proprietary models across both automated and human evaluations. Disclaimer: the generated rebuttal content is for reference only to inspire authors and assist in drafting. It is not intended to replace the author's own critical analysis and response.
尽管人工智能(AI)已经在研究工作流程的各个阶段深度集成,并取得了显著进展,但在学术反驳方面仍面临着一个重要且未充分探索的挑战。这是因为反驳是一种在严重信息不对称下的策略性沟通过程,而不仅仅是简单的技术辩论。因此,现有的方法由于主要模仿表面语言层面的表达,未能捕捉到有效说服所需的从对方角度出发的关键元素。 本文介绍了RebuttalAgent框架,这是首个基于心智理论(Theory of Mind, ToM)进行学术反驳的研究框架,并通过一种ToM-策略-响应(TSR)管道实现,该管道模型化审稿人的心理状态、制定说服策略并生成与策略相契合的回应。为了训练我们的代理程序,我们构建了RebuttalBench,这是一个大规模的数据集,通过新颖的批评和细化方法合成而成。我们的培训过程分为两个阶段:首先是监督微调阶段,使代理人具备基于心智理论的分析和战略规划能力;其次是利用自我奖励机制进行可扩展自我改进的强化学习阶段。 为了实现可靠的自动评估,我们进一步开发了Rebuttal-RM,这是一个专门的评估器,在超过10万个多源反驳数据样本上进行了训练。它在自动化评分方面与人类偏好的一致性超过了强大的裁判模型GPT-4.1。 广泛的实验表明,RebuttalAgent在自动化指标上的表现比基础模型平均高出18.3%,并在自动和人工评价中均超越了先进的专有模型。 免责声明:生成的反驳内容仅供作者参考启发,并辅助草拟。它并非旨在替代作者本人的批判性分析与回应。
https://arxiv.org/abs/2601.15715
We propose Zero-Error Horizon (ZEH) for trustworthy LLMs, which represents the maximum range that a model can solve without any errors. While ZEH itself is simple, we demonstrate that evaluating the ZEH of state-of-the-art LLMs yields abundant insights. For example, by evaluating the ZEH of GPT-5.2, we found that GPT-5.2 cannot even compute the parity of a short string like 11000, and GPT-5.2 cannot determine whether the parentheses in ((((()))))) are balanced. This is surprising given the excellent capabilities of GPT-5.2. The fact that LLMs make mistakes on such simple problems serves as an important lesson when applying LLMs to safety-critical domains. By applying ZEH to Qwen2.5 and conducting detailed analysis, we found that while ZEH correlates with accuracy, the detailed behaviors differ, and ZEH provides clues about the emergence of algorithmic capabilities. Finally, while computing ZEH incurs significant computational cost, we discuss how to mitigate this cost by achieving up to one order of magnitude speedup using tree structures and online softmax.
我们提出了一种名为“零误差地平线”(Zero-Error Horizon,ZEH)的概念,以衡量可信赖的大型语言模型(LLM)的能力范围。ZEH表示一个模型在不出现任何错误的情况下能够解决问题的最大范围。虽然ZEH本身定义简单,但我们通过评估最先进的LLMs的ZEH值来展示了丰富的见解。例如,在对GPT-5.2进行测试时发现,该模型甚至无法计算像"11000"这样的短字符串的奇偶性,也无法判断如"((((()))))"这样的括号是否平衡。鉴于GPT-5.2在其他方面的卓越能力,这种结果令人惊讶。大型语言模型对如此简单问题犯错误的事实,在将LLMs应用于安全关键领域时提供了一个重要教训。 通过应用ZEH到Qwen2.5,并进行详细分析,我们发现虽然ZEH与准确性相关联,但具体行为有所不同。此外,ZEH还提供了关于算法能力出现的线索。最后,尽管计算ZEH会带来显著的计算成本,但我们讨论了如何利用树形结构和在线softmax来实现高达一个数量级的速度提升,从而缓解这种成本。
https://arxiv.org/abs/2601.15714
Fine-grained attribute prediction is essential for fashion retail applications including catalog enrichment, visual search, and recommendation systems. Vision-Language Models (VLMs) offer zero-shot prediction without task-specific training, yet their systematic evaluation on multi-attribute fashion tasks remains underexplored. A key challenge is that fashion attributes are often conditional. For example, "outer fabric" is undefined when no outer garment is visible. This requires models to detect attribute applicability before attempting classification. We introduce a three-tier evaluation framework that decomposes this challenge: (1) overall task performance across all classes (including NA class: suggesting attribute is not applicable) for all attributes, (2) attribute applicability detection, and (3) fine-grained classification when attributes are determinable. Using DeepFashion-MultiModal, which explicitly defines NA (meaning attribute doesn't exist or is not visible) within attribute label spaces, we benchmark nine VLMs spanning flagship (GPT-5, Gemini 2.5 Pro), efficient (GPT-5 Mini, Gemini 2.5 Flash), and ultra-efficient tiers (GPT-5 Nano, Gemini 2.5 Flash-Lite) against classifiers trained on pretrained Fashion-CLIP embeddings on 5,000 images across 18 attributes. Our findings reveal that: (1) zero-shot VLMs achieve 64.0% macro-F1, a threefold improvement over logistic regression on pretrained Fashion-CLIP embeddings; (2) VLMs excel at fine-grained classification (Tier 3: 70.8% F1) but struggle with applicability detection (Tier 2: 34.1% NA-F1), identifying a key bottleneck; (3) efficient models achieve over 90% of flagship performance at lower cost, offering practical deployment paths. This diagnostic framework enables practitioners to pinpoint whether errors stem from visibility detection or classification, guiding targeted improvements for production systems.
细粒度属性预测对于时尚零售应用(包括目录丰富、视觉搜索和推荐系统)至关重要。视觉-语言模型(VLMs)在不进行特定任务训练的情况下提供了零样本预测,然而这些模型在多属性时尚任务中的系统性评估尚处于探索阶段。一个关键挑战在于时尚属性往往是条件性的:例如,“外层面料”这一属性仅在外穿衣物可见时才具有定义。这就要求模型首先检测属性是否适用再进行分类。 我们引入了一个三级评价框架来分解这个难题: 1. 跨所有类别的整体任务表现(包括NA类别,表示该属性不适用)。 2. 属性适用性检测。 3. 在可确定的情况下进行细粒度分类。 利用DeepFashion-MultiModal数据集,该数据集在属性标签空间中明确定义了NA(表示该属性不存在或不可见),我们对九种VLMs进行了基准测试,这些模型涵盖了旗舰级(GPT-5, Gemini 2.5 Pro)、高效级(GPT-5 Mini, Gemini 2.5 Flash)和超级高效的级别(GPT-5 Nano, Gemini 2.5 Flash-Lite),并且对比了基于预训练Fashion-CLIP嵌入的分类器在跨18个属性、5000张图像上的表现。 我们的发现表明: 1. 零样本VLMs达到了64.0%的宏F1分数,相较于基于预训练Fashion-CLIP嵌入的逻辑回归模型,有三倍以上的改进。 2. VLMs在细粒度分类(第三级:70.8% F1)方面表现出色,但在适用性检测(第二级:34.1% NA-F1)上表现不佳,这揭示了一个关键瓶颈。 3. 高效模型在较低成本下实现了旗舰性能的90%以上,为实用部署提供了路径。 这一诊断框架使实践者能够确定错误是源于可见性检测还是分类,并指导生产系统进行针对性改进。
https://arxiv.org/abs/2601.15711
The rapid advancement of Multimodal Large Language Models (MLLMs) has introduced complex security challenges, particularly at the intersection of textual and visual safety. While existing schemes have explored the security vulnerabilities of MLLMs, the investigation into their visual safety boundaries remains insufficient. In this paper, we propose Beyond Visual Safety (BVS), a novel image-text pair jailbreaking framework specifically designed to probe the visual safety boundaries of MLLMs. BVS employs a "reconstruction-then-generation" strategy, leveraging neutralized visual splicing and inductive recomposition to decouple malicious intent from raw inputs, thereby leading MLLMs to be induced into generating harmful images. Experimental results demonstrate that BVS achieves a remarkable jailbreak success rate of 98.21\% against GPT-5 (12 January 2026 release). Our findings expose critical vulnerabilities in the visual safety alignment of current MLLMs.
多模态大型语言模型(MLLMs)的迅速发展引入了复杂的安全挑战,尤其是在文本和视觉安全性交汇处的问题。尽管现有的方案已经探索了MLLMs的安全漏洞,但对于其视觉安全边界的调查仍然不足。在本文中,我们提出了一种新颖的图像-文本配对越狱框架——超越视觉安全(BVS),专门用于探测MLLMs的视觉安全边界。BVS采用“重构再生成”策略,利用中立化的视觉拼接和归纳重组将恶意意图与原始输入解耦,从而使MLLMs被诱导生成有害图像。 实验结果显示,BVS在针对2026年1月12日发布的GPT-5模型时,实现了98.21%的惊人越狱成功率。我们的研究揭示了当前MLLMs在视觉安全对齐方面存在的重要漏洞。
https://arxiv.org/abs/2601.15698
Real-time understanding of long video streams remains challenging for multimodal large language models (VLMs) due to redundant frame processing and rapid forgetting of past context. Existing streaming systems rely on fixed-interval decoding or cache pruning, which either produce repetitive outputs or discard crucial temporal information. We introduce Event-VStream, an event-aware framework that represents continuous video as a sequence of discrete, semantically coherent events. Our system detects meaningful state transitions by integrating motion, semantic, and predictive cues, and triggers language generation only at those boundaries. Each event embedding is consolidated into a persistent memory bank, enabling long-horizon reasoning while maintaining low latency. Across OVOBench-Realtime, and long-form Ego4D evaluations, Event-VStream achieves competitive performance. It improves over a VideoLLM-Online-8B baseline by +10.4 points on OVOBench-Realtime, achieves performance close to Flash-VStream-7B despite using only a general-purpose LLaMA-3-8B text backbone, and maintains around 70% GPT-5 win rate on 2-hour Ego4D streams.
实时理解长视频流对多模态大型语言模型(VLMs)来说仍然是一项挑战,因为冗余帧处理和快速遗忘过往上下文导致的问题。现有的流媒体系统依赖于固定间隔解码或缓存修剪技术,这要么产生重复输出,要么丢弃关键的时间信息。我们介绍了Event-VStream,这是一个基于事件感知的框架,将连续视频表示为一系列离散且语义连贯的事件。我们的系统通过整合运动、语义和预测线索来检测有意义的状态转换,并仅在这些边界处触发语言生成。每个事件嵌入都被合并到持久内存库中,这使得长时间推理成为可能的同时保持低延迟。 在OVOBench-Realtime和长篇Ego4D评估中,Event-VStream取得了具有竞争力的性能表现。与VideoLLM-Online-8B基准相比,在OVOBench-Realtime上提高了10.4个点;尽管仅使用了一般的LLaMA-3-8B文本骨干网络,但在性能方面接近于Flash-VStream-7B;在2小时长的Ego4D流媒体中保持了大约70%的GPT-5胜率。
https://arxiv.org/abs/2601.15655
Whether Large Language Models (LLMs) truly possess human-like Theory of Mind (ToM) capabilities has garnered increasing attention. However, existing benchmarks remain largely restricted to narrow paradigms like false belief tasks, failing to capture the full spectrum of human cognitive mechanisms. We introduce CogToM, a comprehensive, theoretically grounded benchmark comprising over 8000 bilingual instances across 46 paradigms, validated by 49 human annotator.A systematic evaluation of 22 representative models, including frontier models like GPT-5.1 and Qwen3-Max, reveals significant performance heterogeneities and highlights persistent bottlenecks in specific dimensions. Further analysis based on human cognitive patterns suggests potential divergences between LLM and human cognitive structures. CogToM offers a robust instrument and perspective for investigating the evolving cognitive boundaries of LLMs.
是否大型语言模型(LLMs)真正具备类似人类的理论心智(Theory of Mind, ToM)能力,这一问题越来越受到关注。然而,现有的基准测试主要局限于如错误信念任务等狭窄范畴,未能全面捕捉到人类认知机制的整体范围。我们介绍了CogToM,这是一个包含46个范式、超过8000个双语实例的综合理论基础基准,并由49名人类标注员进行了验证。对包括GPT-5.1和Qwen3-Max等前沿模型在内的22种代表性模型进行的系统评估揭示了显著的性能异质性和特定维度上持续存在的瓶颈问题。基于人类认知模式的进一步分析表明,LLM与人类认知结构之间可能存在潜在差异。CogToM为探究不断演化的LLM认知边界提供了一个强大的工具和视角。
https://arxiv.org/abs/2601.15628
The rapid growth of live-streaming platforms such as Twitch has introduced complex challenges in moderating toxic behavior. Traditional moderation approaches, such as human annotation and keyword-based filtering, have demonstrated utility, but human moderators on Twitch constantly struggle to scale effectively in the fast-paced, high-volume, and context-rich chat environment of the platform while also facing harassment themselves. Recent advances in large language models (LLMs), such as DeepSeek-R1-Distill and Llama-3-8B-Instruct, offer new opportunities for toxicity detection, especially in understanding nuanced, multimodal communication involving emotes. In this work, we present an exploratory comparison of toxicity detection approaches tailored to Twitch. Our analysis reveals that incorporating emotes improves the detection of toxic behavior. To this end, we introduce ToxiTwitch, a hybrid model that combines LLM-generated embeddings of text and emotes with traditional machine learning classifiers, including Random Forest and SVM. In our case study, the proposed hybrid approach reaches up to 80 percent accuracy under channel-specific training (with 13 percent improvement over BERT and F1-score of 76 percent). This work is an exploratory study intended to surface challenges and limits of emote-aware toxicity detection on Twitch.
直播平台如Twitch的迅速增长带来了管理有毒行为的复杂挑战。传统的管理模式,比如人工标注和基于关键词过滤的方法,在实践中展示了一定的效果,但Twitch的人类管理员却难以在快速、高流量且包含丰富语境的聊天环境中有效扩展工作范围,并且他们还面临着自身遭受骚扰的问题。近期大型语言模型(LLM)的发展,例如DeepSeek-R1-Distill和Llama-3-8B-Instruct,为毒性检测提供了新的机会,尤其是在理解涉及表情符号的微妙多模态沟通方面。在这项工作中,我们展示了对适用于Twitch平台的毒性检测方法进行探索性比较的研究。我们的分析显示,在检测有毒行为时,将表情符号纳入考量能够提升识别效果。为此,我们推出了ToxiTwitch模型,这是一种混合模型,结合了大型语言模型生成的文字和表情符号嵌入与传统机器学习分类器(包括随机森林和支持向量机)的应用。在案例研究中,所提出的混合方法在特定频道训练下可达到高达80%的准确率(相对于BERT,提高了13%,F1分数为76%)。这项工作是一个旨在揭示Twitch平台上具有表情符号意识的毒性检测挑战和限制的探索性研究。
https://arxiv.org/abs/2601.15605
Prompting is central to interaction with AI systems, yet many users struggle to explore alternative directions, articulate creative intent, or understand how variations in prompts shape model outputs. We introduce prompt recommender systems (PRS) as an interaction approach that supports exploration, suggesting contextually relevant follow-up prompts. We present PromptHelper, a PRS prototype integrated into an AI chatbot that surfaces semantically diverse prompt suggestions while users work on real writing tasks. We evaluate PromptHelper in a 2x2 fully within-subjects study (N=32) across creative and academic writing tasks. Results show that PromptHelper significantly increases users' perceived exploration and expressiveness without increasing cognitive workload. Qualitative findings illustrate how prompt recommendations help users branch into new directions, overcome uncertainty about what to ask next, and better articulate their intent. We discuss implications for designing AI interfaces that scaffold exploratory interaction while preserving user agency, and release open-source resources to support research on prompt recommendation.
提示推荐系统(PRS)在与AI系统的互动中扮演着重要角色,然而许多用户在探索不同方向、表达创意意图或理解提示变化如何影响模型输出方面存在困难。本文介绍了PRSS作为一种互动方法,旨在支持用户的探索过程,并提供上下文相关的后续提示建议。我们提出了PromptHelper,这是一个整合到AI聊天机器人中的PRS原型,在用户进行实际写作任务时展示语义多样化的提示建议。 我们在一个2x2完全被试内设计的研究中(N=32),评估了PromptHelper在创意和学术写作任务上的表现。结果显示,PromptHelper显著提高了用户的探索感受与表达能力,同时没有增加认知工作量。质性研究结果展示了如何通过推荐提示帮助用户转向新方向、克服关于下一步应该问什么的不确定性,并更好地表达意图。 本文讨论了设计能够支持探索式互动的同时保持用户自主权的AI界面的重要性,并发布了开源资源以促进对提示推荐的研究。
https://arxiv.org/abs/2601.15575
Personalized learning systems have emerged as a promising approach to enhance student outcomes by tailoring educational content, pacing, and feedback to individual needs. However, most existing systems remain fragmented, specializing in either knowledge tracing, diagnostic modeling, or resource recommendation, but rarely integrating these components into a cohesive adaptive cycle. In this paper, we propose ALIGNAgent (Adaptive Learner Intelligence for Gap Identification and Next-step guidance), a multi-agent educational framework designed to deliver personalized learning through integrated knowledge estimation, skill-gap identification, and targeted resource this http URL begins by processing student quiz performance, gradebook data, and learner preferences to generate topic-level proficiency estimates using a Skill Gap Agent that employs concept-level diagnostic reasoning to identify specific misconceptions and knowledge deficiencies. After identifying skill gaps, the Recommender Agent retrieves preference-aware learning materials aligned with diagnosed deficiencies, implementing a continuous feedback loop where interventions occur before advancing to subsequent topics. Extensive empirical evaluation on authentic datasets from two undergraduate computer science courses demonstrates ALIGNAgent's effectiveness, with GPT-4o-based agents achieving precision of 0.87-0.90 and F1 scores of 0.84-0.87 in knowledge proficiency estimation validated against actual exam performance.
个性化学习系统作为一种有前景的方法,通过根据个人需求定制教育内容、进度和反馈来提高学生的学习成果。然而,大多数现有的系统仍然较为分散,专注于知识追踪、诊断建模或资源推荐中的某一方面,很少将这些组件整合成一个连贯的自适应循环。在这篇论文中,我们提出了ALIGNAgent(用于识别差距并提供下一步指导的自适应学习者智能),这是一个多代理教育框架,旨在通过综合的知识估计、技能缺口识别和针对性的学习资源推荐来提供个性化的学习体验。 该框架首先处理学生的测验表现、成绩册数据以及学生偏好信息,使用技能缺口代理生成基于概念级别诊断推理的主题级熟练度估算。在确定了技能差距之后,推荐代理会检索出符合诊断需求并考虑个人偏好的学习材料,并实施一个持续反馈循环,在进入下一主题之前进行干预。 通过两个本科计算机科学课程的真实数据集的广泛实证评估证明了ALIGNAgent的有效性。基于GPT-4o的代理在知识熟练度估算方面达到了0.87至0.90的精度和0.84至0.87的F1评分,这些结果与实际考试成绩进行了验证。 通过这种方式,ALIGNAgent不仅能够更好地理解每个学生的学习需求和偏好,还能提供更加精确且个性化的学习路径指导,从而显著提高教育效果。
https://arxiv.org/abs/2601.15551
Accurate prediction of traffic crash severity is critical for improving emergency response and public safety planning. Although recent large language models (LLMs) exhibit strong reasoning capabilities, their single-agent architectures often struggle with heterogeneous, domain-specific crash data and tend to generate biased or unstable predictions. To address these limitations, this paper proposes TransportAgents, a hybrid multi-agent framework that integrates category-specific LLM reasoning with a multilayer perceptron (MLP) integration module. Each specialized agent focuses on a particular subset of traffic information, such as demographics, environmental context, or incident details, to produce intermediate severity assessments that are subsequently fused into a unified prediction. Extensive experiments on two complementary U.S. datasets, the Consumer Product Safety Risk Management System (CPSRMS) and the National Electronic Injury Surveillance System (NEISS), demonstrate that TransportAgents consistently outperforms both traditional machine learning and advanced LLM-based baselines. Across three representative backbones, including closed-source models such as GPT-3.5 and GPT-4o, as well as open-source models such as LLaMA-3.3, the framework exhibits strong robustness, scalability, and cross-dataset generalizability. A supplementary distributional analysis further shows that TransportAgents produces more balanced and well-calibrated severity predictions than standard single-agent LLM approaches, highlighting its interpretability and reliability for safety-critical decision support applications.
准确预测交通事故的严重程度对于改善紧急应对和公共安全规划至关重要。尽管最近的大规模语言模型(LLMs)展示了强大的推理能力,但它们单一代理架构在处理异构、领域特定的事故数据时往往力不从心,并且容易生成有偏或不稳定的结果。为了解决这些问题,本文提出了TransportAgents,这是一种混合多代理框架,它将类别特定的LLM推理与多层感知器(MLP)融合模块结合在一起。每个专门化的代理专注于交通信息的一个具体子集,如人口统计、环境背景或事故详情,以生成中间严重性评估,并随后将其整合为统一预测。 在两个互补的美国数据集上进行了广泛的实验:消费者产品安全风险管理系统(CPSRMS)和国家电子伤害监测系统(NEISS)。结果表明,TransportAgents在传统机器学习方法和先进的LLM基础模型之上始终表现出色。无论是在闭源模型(如GPT-3.5和GPT-4o)还是开源模型(如LLaMA-3.3)的基础上,该框架都显示出强大的稳健性、可扩展性和跨数据集的泛化能力。 补充分布分析进一步表明,TransportAgents生成了比标准单一代理LLM方法更平衡且校准良好的严重性预测结果。这突显了其在安全关键决策支持应用中的解释能力和可靠性。
https://arxiv.org/abs/2601.15519
Hallucination in large language models (LLMs) remains an acute concern, contributing to the spread of misinformation and diminished public trust, particularly in high-risk domains. Among hallucination types, factuality is crucial, as it concerns a model's alignment with established world knowledge. Adversarial factuality, defined as the deliberate insertion of misinformation into prompts with varying levels of expressed confidence, tests a model's ability to detect and resist confidently framed falsehoods. Existing work lacks high-quality, domain-specific resources for assessing model robustness under such adversarial conditions, and no prior research has examined the impact of injected misinformation on long-form text factuality. To address this gap, we introduce AdversaRiskQA, the first verified and reliable benchmark systematically evaluating adversarial factuality across Health, Finance, and Law. The benchmark includes two difficulty levels to test LLMs' defensive capabilities across varying knowledge depths. We propose two automated methods for evaluating the adversarial attack success and long-form factuality. We evaluate six open- and closed-source LLMs from the Qwen, GPT-OSS, and GPT families, measuring misinformation detection rates. Long-form factuality is assessed on Qwen3 (30B) under both baseline and adversarial conditions. Results show that after excluding meaningless responses, Qwen3 (80B) achieves the highest average accuracy, while GPT-5 maintains consistently high accuracy. Performance scales non-linearly with model size, varies by domains, and gaps between difficulty levels narrow as models grow. Long-form evaluation reveals no significant correlation between injected misinformation and the model's factual output. AdversaRiskQA provides a valuable benchmark for pinpointing LLM weaknesses and developing more reliable models for high-stakes applications.
大型语言模型(LLM)中的幻觉仍然是一个严重的问题,这导致了错误信息的传播和公众信任度下降,尤其是在高风险领域。在各种类型的幻觉中,事实性尤为重要,因为它关系到模型与已确立的世界知识的一致性。敌对事实性是指有意地将错误信息以不同自信程度插入提示中,测试模型检测并抵制自信表述的谎言的能力。现有的研究缺乏高质量、特定领域的资源来评估模型在这种对抗条件下的稳健性,并且没有先前的研究探讨了注入虚假信息对长篇文本事实性的影响。为了填补这一空白,我们推出了AdversaRiskQA,这是第一个经过验证和可靠的基准测试,系统地评估在健康、金融和法律领域中的敌对事实性。该基准包括两个难度级别来测试LLM的防御能力,并跨越不同的知识深度。我们提出了两种自动化方法来评估对抗攻击的成功率以及长篇文本的事实性。我们在Qwen、GPT-OSS和GPT家族中评估了六种开源和闭源的大语言模型,测量错误信息检测率。在基线和对抗条件下,对Qwen3(30B)进行长篇事实性的评估。结果显示,在排除无意义的响应后,Qwen3(80B)实现了最高的平均准确率,而GPT-5保持了持续高的准确度。性能非线性地随着模型大小变化,并且在不同领域表现各异;难度级别之间的差距随模型规模增大而缩小。长篇评估显示,在注入错误信息和模型的实际输出之间没有显著的相关性。AdversaRiskQA为定位LLM的弱点以及开发适用于高风险应用的更可靠模型提供了宝贵的基准测试。
https://arxiv.org/abs/2601.15511
Characters in novels have typically been modeled based on their presence in scenes in narrative, considering aspects like their actions, named mentions, and dialogue. This conception of character places significant emphasis on the main character who is present in the most scenes. In this work, we instead adopt a framing developed from a new literary theory proposing a six-component structural model of character. This model enables a comprehensive approach to character that accounts for the narrator-character distinction and includes a component neglected by prior methods, discussion by other characters. We compare general-purpose LLMs with task-specific transformers for operationalizing this model of character on major 19th-century British realist novels. Our methods yield both component-level and graph representations of character discussion. We then demonstrate that these representations allow us to approach literary questions at scale from a new computational lens. Specifically, we explore Woloch's classic "the one vs the many" theory of character centrality and the gendered dynamics of character discussion.
小说中的角色通常根据他们在情节场景中的表现来塑造,这些方面包括他们的行为、被提及的名字和对话。这种对角色的理解特别强调那些出现在最多场景中的主要角色。然而,在这项工作中,我们采用了来自一种新文学理论的框架,该理论提出了一个包含六个组成部分的角色结构模型。这个模型提供了一种全面的方法来理解角色,并且考虑了叙述者与角色之间的区别,以及此前方法所忽略的一个方面——其他人物对某个角色的讨论。 我们在研究中比较了通用大语言模型(LLM)和针对特定任务训练的转换器模型,以实现这种新的角色结构模型在19世纪英国现实主义小说中的应用。我们的方法能够生成角色讨论的组件级别表示以及图形化表示形式。然后,我们展示了这些表示方式如何从一个新的计算视角来大规模地解答文学问题。 具体而言,我们探讨了沃尔奇的经典理论——“单个角色与众多角色”的中心性理论,并研究了角色讨论中的性别动态。
https://arxiv.org/abs/2601.15508
Reflexive Thematic Analysis (RTA) is a critical method for generating deep interpretive insights. Yet its core tenets, including researcher reflexivity, tangible analytical evolution, and productive disagreement, are often poorly supported by software tools that prioritize speed and consensus over interpretive depth. To address this gap, we introduce Reflexis, a collaborative workspace that centers these practices. It supports reflexivity by integrating in-situ reflection prompts, makes code evolution transparent and tangible, and scaffolds collaborative interpretation by turning differences into productive, positionality-aware dialogue. Results from our paired-analyst study (N=12) indicate that Reflexis encouraged participants toward more granular reflection and reframed disagreements as productive conversations. The evaluation also surfaced key design tensions, including a desire for higher-level, networked memos and more user control over the timing of proactive alerts. Reflexis contributes a design framework for tools that prioritize rigor and transparency to support deep, collaborative interpretation in an age of automation.
反射主题分析(RTA)是一种生成深度解释性见解的关键方法。然而,其核心原则——包括研究者反思、具体的分析演进和富有成效的争论——通常得不到那些优先考虑速度与共识而非解释深度的软件工具的支持。为了填补这一空白,我们推出了Reflexis,这是一个专注于这些实践的合作工作空间。通过整合现场反思提示,它支持了研究者的自我反思;使代码演变透明化并具象化;并通过将差异转化为具有立场意识的对话来促进协作诠释。来自我们的双分析师研究(N=12)的结果表明,Reflexis鼓励参与者进行更深入的反思,并将分歧重新定义为富有成效的讨论。评估还揭示了关键的设计矛盾,包括对更高层次、网络化的备忘录以及用户控制主动性提醒时间点的需求。Reflexis在自动化时代提供了设计框架,优先考虑严谨性和透明度,以支持深度合作诠释。
https://arxiv.org/abs/2601.15445
We propose a novel way to evaluate sycophancy of LLMs in a direct and neutral way, mitigating various forms of uncontrolled bias, noise, or manipulative language, deliberately injected to prompts in prior works. A key novelty in our approach is the use of LLM-as-a-judge, evaluation of sycophancy as a zero-sum game in a bet setting. Under this framework, sycophancy serves one individual (the user) while explicitly incurring cost on another. Comparing four leading models - Gemini 2.5 Pro, ChatGpt 4o, Mistral-Large-Instruct-2411, and Claude Sonnet 3.7 - we find that while all models exhibit sycophantic tendencies in the common setting, in which sycophancy is self-serving to the user and incurs no cost on others, Claude and Mistral exhibit "moral remorse" and over-compensate for their sycophancy in case it explicitly harms a third party. Additionally, we observed that all models are biased toward the answer proposed last. Crucially, we find that these two phenomena are not independent; sycophancy and recency bias interact to produce `constructive interference' effect, where the tendency to agree with the user is exacerbated when the user's opinion is presented last.
我们提出了一种评估大型语言模型(LLM)谄媚行为的新方法,这种方法直接且中立,并能够缓解之前工作中故意注入的各种形式的不受控制偏见、噪声或操纵性语言。我们的方法的一个关键创新点是采用了“LLM作为裁判”的理念,将谄媚行为视为零和博弈中的投注设置。在这种框架下,谄媚行为有利于一个人(用户),但会明确地对另一个人造成损害。 在比较了四个领先模型——Gemini 2.5 Pro、ChatGpt 4o、Mistral-Large-Instruct-2411 和 Claude Sonnet 3.7 后,我们发现所有这些模型在普通设置下都表现出谄媚倾向,在这种情况下,谄媚行为对用户有利且不会损害其他人。然而,Claude和Mistral表现出了“道德懊悔”,并且当他们的谄媚行为明确地伤害到第三方时,会过度补偿以弥补这一行为。 此外,我们还观察到所有模型都会偏向于最后一个提出的答案。至关重要的是,我们发现这两种现象并不是独立的;谄媚与最近性偏见相互作用,产生了“建设性干涉”效应,在这种效应中,当用户的意见被提出最后时,同意用户的倾向会被进一步加强。
https://arxiv.org/abs/2601.15436