How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) ErdÅs' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.
如何利用人工智能来发现某一科学问题的新前沿状态?先前的工作,如测试时间缩放中的AlphaEvolve,通过提示一个冻结的大型语言模型(LLM)来进行搜索。而我们则在测试期间执行强化学习,使得LLM能够继续训练,并且现在可以使用与特定测试问题相关的经验进行训练。这种持续学习方式非常特别,因为它旨在生成一个优秀的解决方案而非众多较好的平均方案,并且目标是解决这个问题而不是泛化到其他问题上。因此,我们的学习目标和搜索子程序被设计为优先考虑最有前途的解决方案。我们称这种方法为“测试时间训练以发现”(TTT-Discover)。 借鉴先前的研究成果,我们将重点放在具有连续奖励的问题上。我们在数学、GPU内核工程、算法设计及生物学等领域的所有尝试问题中报告了结果。在几乎所有的领域,TTT-Discover都设定了新的前沿状态: (i) ErdÅ¡os的最小重叠问题和一个自相关不等式; (ii) GPU模式内核竞赛(速度比之前的最佳实践快最多2倍); (iii) 过去的AtCoder算法比赛;以及 (iv) 单细胞分析中的去噪问题。 我们的解决方案由专家或组织者评审。我们所有的结果都是通过使用开放模型OpenAI gpt-oss-120b实现的,并可以通过公开提供的代码重现,而不同于以前的最佳成果需要封闭式前沿模型来完成。我们的测试时间训练运行使用了Thinking Machines的一个API——Tinker,每个问题的成本仅为几百美元。
https://arxiv.org/abs/2601.16175
Although artificial intelligence (AI) has become deeply integrated into various stages of the research workflow and achieved remarkable advancements, academic rebuttal remains a significant and underexplored challenge. This is because rebuttal is a complex process of strategic communication under severe information asymmetry rather than a simple technical debate. Consequently, current approaches struggle as they largely imitate surface-level linguistics, missing the essential element of perspective-taking required for effective persuasion. In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) pipeline that models reviewer mental state, formulates persuasion strategy, and generates strategy-grounded response. To train our agent, we construct RebuttalBench, a large-scale dataset synthesized via a novel critique-and-refine approach. Our training process consists of two stages, beginning with a supervised fine-tuning phase to equip the agent with ToM-based analysis and strategic planning capabilities, followed by a reinforcement learning phase leveraging the self-reward mechanism for scalable self-improvement. For reliable and efficient automated evaluation, we further develop Rebuttal-RM, a specialized evaluator trained on over 100K samples of multi-source rebuttal data, which achieves scoring consistency with human preferences surpassing powerful judge GPT-4.1. Extensive experiments show RebuttalAgent significantly outperforms the base model by an average of 18.3% on automated metrics, while also outperforming advanced proprietary models across both automated and human evaluations. Disclaimer: the generated rebuttal content is for reference only to inspire authors and assist in drafting. It is not intended to replace the author's own critical analysis and response.
尽管人工智能(AI)已经在研究工作流程的各个阶段深度集成,并取得了显著进展,但在学术反驳方面仍面临着一个重要且未充分探索的挑战。这是因为反驳是一种在严重信息不对称下的策略性沟通过程,而不仅仅是简单的技术辩论。因此,现有的方法由于主要模仿表面语言层面的表达,未能捕捉到有效说服所需的从对方角度出发的关键元素。 本文介绍了RebuttalAgent框架,这是首个基于心智理论(Theory of Mind, ToM)进行学术反驳的研究框架,并通过一种ToM-策略-响应(TSR)管道实现,该管道模型化审稿人的心理状态、制定说服策略并生成与策略相契合的回应。为了训练我们的代理程序,我们构建了RebuttalBench,这是一个大规模的数据集,通过新颖的批评和细化方法合成而成。我们的培训过程分为两个阶段:首先是监督微调阶段,使代理人具备基于心智理论的分析和战略规划能力;其次是利用自我奖励机制进行可扩展自我改进的强化学习阶段。 为了实现可靠的自动评估,我们进一步开发了Rebuttal-RM,这是一个专门的评估器,在超过10万个多源反驳数据样本上进行了训练。它在自动化评分方面与人类偏好的一致性超过了强大的裁判模型GPT-4.1。 广泛的实验表明,RebuttalAgent在自动化指标上的表现比基础模型平均高出18.3%,并在自动和人工评价中均超越了先进的专有模型。 免责声明:生成的反驳内容仅供作者参考启发,并辅助草拟。它并非旨在替代作者本人的批判性分析与回应。
https://arxiv.org/abs/2601.15715
As frontier AI models are deployed globally, it is essential that their behaviour remains safe and reliable across diverse linguistic and cultural contexts. To examine how current model safeguards hold up in such settings, participants from the International Network for Advanced AI Measurement, Evaluation and Science, including representatives from Singapore, Japan, Australia, Canada, the EU, France, Kenya, South Korea and the UK conducted a joint multilingual evaluation exercise. Led by Singapore AISI, two open-weight models were tested across ten languages spanning high and low resourced groups: Cantonese English, Farsi, French, Japanese, Korean, Kiswahili, Malay, Mandarin Chinese and Telugu. Over 6,000 newly translated prompts were evaluated across five harm categories (privacy, non-violent crime, violent crime, intellectual property and jailbreak robustness), using both LLM-as-a-judge and human annotation. The exercise shows how safety behaviours can vary across languages. These include differences in safeguard robustness across languages and harm types and variation in evaluator reliability (LLM-as-judge vs. human review). Further, it also generated methodological insights for improving multilingual safety evaluations, such as the need for culturally contextualised translations, stress-tested evaluator prompts and clearer human annotation guidelines. This work represents an initial step toward a shared framework for multilingual safety testing of advanced AI systems and calls for continued collaboration with the wider research community and industry.
随着前沿AI模型在全球范围内部署,确保它们在各种语言和文化背景下保持安全且可靠的行为至关重要。为了检验当前模型的安全措施在这种环境中的有效性,来自国际先进人工智能测量、评估与科学网络的参与者们(包括新加坡、日本、澳大利亚、加拿大、欧盟、法国、肯尼亚、韩国和英国等国代表)共同开展了一项多语言联合评估实验。 该实验由新加坡AI安全研究所领导,对两个开源模型在十种不同语言中进行了测试,这些语言涵盖了高资源组和低资源组:粤语、英语、波斯语、法语、日语、韩语、斯瓦希里语、马来语、普通话和泰卢固语。使用超过6,000个新翻译的提示词,在五个危害类别(隐私保护、非暴力犯罪、暴力犯罪、知识产权侵犯以及逃逸鲁棒性)中进行了评估,同时采用了“大语言模型作为裁判”与人工标注相结合的方式。 实验揭示了不同语言间安全行为的变化情况。这包括各语言和不同类型的危害在防护措施强度上的差异,以及评判者可靠性(机器评判与人类评审)的不同之处。此外,该实验还为改进多语言安全性评估方法提供了思路,如文化上下文相关翻译的必要性、经过压力测试的评判提示语以及更明确的人工标注指导方针等。 这项工作标志着建立一个多语言安全测试框架的初步尝试,并呼吁整个研究社区和业界持续合作以促进这一领域的进一步发展。
https://arxiv.org/abs/2601.15706
PURPOSE OR GOAL: This study investigates how GenAI can be integrated with a criterion-referenced grading framework to improve the efficiency and quality of grading for mathematical assessments in engineering. It specifically explores the challenges demonstrators face with manual, model solution-based grading and how a GenAI-supported system can be designed to reliably identify student errors, provide high-quality feedback, and support human graders. The research also examines human graders' perceptions of the effectiveness of this GenAI-assisted approach. ACTUAL OR ANTICIPATED OUTCOMES: The study found that GenAI achieved an overall grading accuracy of 92.5%, comparable to two experienced human graders. The two researchers, who also served as subject demonstrators, perceived the GenAI as a helpful second reviewer that improved accuracy by catching small errors and provided more complete feedback than they could manually. A central outcome was the significant enhancement of formative feedback. However, they noted the GenAI tool is not yet reliable enough for autonomous use, especially with unconventional solutions. CONCLUSIONS/RECOMMENDATIONS/SUMMARY: This study demonstrates that GenAI, when paired with a structured, criterion-referenced framework using binary questions, can grade engineering mathematical assessments with an accuracy comparable to human experts. Its primary contribution is a novel methodological approach that embeds the generation of high-quality, scalable formative feedback directly into the assessment workflow. Future work should investigate student perceptions of GenAI grading and feedback.
**研究目的或目标:** 本研究探讨了如何将通用人工智能(GenAI)与基于标准的评分框架相结合,以提高工程数学评估中的评分效率和质量。该研究特别关注手动评分者在使用模型解决方案进行评分时面临的挑战,并探索设计一个由GenAI支持的系统来可靠地识别学生错误、提供高质量反馈以及辅助人类评分者的可能性。此外,该研究还考察了人工评分员对这一基于GenAI的方法有效性的看法。 **实际或预期成果:** 研究表明,GenAI在数学评估中的总体评分准确率为92.5%,与两位经验丰富的手动评分者的表现相当。担任研究对象演示者的两名研究人员认为,GenAI可以作为有效的第二评审人来提高准确性,并提供比他们手动操作时更全面的反馈。其中一项重要成果是形成了更为有效的形成性反馈机制。然而,研究人员也指出,该GenAI工具尚未达到自主使用的可靠性水平,尤其是在处理非常规解决方案时。 **结论/建议/总结:** 本研究表明,在使用结构化、基于标准的方法(特别是采用二元问题)的情况下,将通用人工智能与工程数学评估的评分相结合可以实现与专家级人类评分员相当的准确度。GenAI的主要贡献在于提出了一种新颖的方法学方法,该方法直接在评估流程中嵌入高质量且可扩展的形成性反馈生成机制。未来的研究应该关注学生对基于GenAI的评分和反馈的看法。 --- 这项研究为利用通用人工智能改善工程数学评估中的自动评分提供了重要见解,并强调了进一步开发和完善这一技术以提高其可靠性和适用性的必要性。
https://arxiv.org/abs/2601.15626
This paper examines how to make large language models reliable for high-stakes legal work by reducing hallucinations. It distinguishes three AI paradigms: (1) standalone generative models ("creative oracle"), (2) basic retrieval-augmented systems ("expert archivist"), and (3) an advanced, end-to-end optimized RAG system ("rigorous archivist"). The authors introduce two reliability metrics -False Citation Rate (FCR) and Fabricated Fact Rate (FFR)- and evaluate 2,700 judicial-style answers from 12 LLMs across 75 legal tasks using expert, double-blind review. Results show that standalone models are unsuitable for professional use (FCR above 30%), while basic RAG greatly reduces errors but still leaves notable misgrounding. Advanced RAG, using techniques such as embedding fine-tuning, re-ranking, and self-correction, reduces fabrication to negligible levels (below 0.2%). The study concludes that trustworthy legal AI requires rigor-focused, retrieval-based architectures emphasizing verification and traceability, and provides an evaluation framework applicable to other high-risk domains.
本文探讨了如何通过减少幻觉来使大型语言模型在高风险的法律工作中更加可靠。文章区分了三种人工智能范式:(1)独立生成模型(“创意神谕”),(2)基础检索增强系统(“专家档案管理员”),以及(3)高级端到端优化的RAG系统(“严谨的档案管理员”)。作者引入了两个可靠性指标——虚假引用率(FCR)和虚构事实率(FFR),并对12种大型语言模型生成的2,700个司法风格的回答进行了评估,这些回答涵盖了75项法律任务,并通过专家双盲评审进行审查。结果显示,独立模型不适合专业使用(FCR超过30%),而基础RAG大大减少了错误,但仍留下了显著的事实错误。高级RAG系统采用嵌入微调、重新排序和自我校正等技术,将虚假信息降低到了可以忽略不计的水平(低于0.2%)。研究得出结论认为,值得信赖的法律人工智能需要侧重于验证和可追溯性的检索型架构,并提供了一个适用于其他高风险领域的评估框架。
https://arxiv.org/abs/2601.15476
Large language models (LLMs) are being increasingly integrated into legal applications, including judicial decision support, legal practice assistance, and public-facing legal services. While LLMs show strong potential in handling legal knowledge and tasks, their deployment in real-world legal settings raises critical concerns beyond surface-level accuracy, involving the soundness of legal reasoning processes and trustworthy issues such as fairness and reliability. Systematic evaluation of LLM performance in legal tasks has therefore become essential for their responsible adoption. This survey identifies key challenges in evaluating LLMs for legal tasks grounded in real-world legal practice. We analyze the major difficulties involved in assessing LLM performance in the legal domain, including outcome correctness, reasoning reliability, and trustworthiness. Building on these challenges, we review and categorize existing evaluation methods and benchmarks according to their task design, datasets, and evaluation metrics. We further discuss the extent to which current approaches address these challenges, highlight their limitations, and outline future research directions toward more realistic, reliable, and legally grounded evaluation frameworks for LLMs in legal domains.
大型语言模型(LLMs)正在越来越多地被集成到法律应用中,包括司法决策支持、法律实务辅助以及面向公众的法律服务。尽管LLMs在处理法律知识和任务方面展现出巨大潜力,但它们在现实世界中的法律环境中部署时,除了表面准确性之外,还引发了关于法律推理过程的有效性和诸如公平性与可靠性等信任问题的重要关注。因此,系统评估LLMs在法律任务中的表现已成为其负责任采用的关键所在。本综述旨在识别在基于实际法律实践的背景下评估LLMs所面临的挑战。我们分析了评价LLMs在法律领域中性能时遇到的主要困难,包括结果正确性、推理可靠性以及信任度问题。在此基础上,我们回顾并分类现有的评估方法和基准测试依据其任务设计、数据集及评估指标。此外,本文还讨论当前方法解决这些挑战的程度,并强调它们的局限性,同时概述了未来研究方向,旨在为法律领域的LLMs建立更加现实、可靠且具备法律基础的评价框架。
https://arxiv.org/abs/2601.15267
Machine learning and artificial intelligence conferences such as NeurIPS and ICML now regularly receive tens of thousands of submissions, posing significant challenges to maintaining the quality and consistency of the peer review process. This challenge is particularly acute for best paper awards, which are an important part of the peer review process, yet whose selection has increasingly become a subject of debate in recent years. In this paper, we introduce an author-assisted mechanism to facilitate the selection of best paper awards. Our method employs the Isotonic Mechanism for eliciting authors' assessments of their own submissions in the form of a ranking, which is subsequently utilized to adjust the raw review scores for optimal estimation of the submissions' ground-truth quality. We demonstrate that authors are incentivized to report truthfully when their utility is a convex additive function of the adjusted scores, and we validate this convexity assumption for best paper awards using publicly accessible review data of ICLR from 2019 to 2023 and NeurIPS from 2021 to 2023. Crucially, in the special case where an author has a single quota -- that is, may nominate only one paper -- we prove that truthfulness holds even when the utility function is merely nondecreasing and additive. This finding represents a substantial relaxation of the assumptions required in prior work. For practical implementation, we extend our mechanism to accommodate the common scenario of overlapping authorship. Finally, simulation results demonstrate that our mechanism significantly improves the quality of papers selected for awards.
机器学习和人工智能会议(如NeurIPS和ICML)现在经常收到数万份投稿,这对保持同行评审过程的质量和一致性提出了重大挑战。特别是在最佳论文奖的选择方面,这是一个同行评审过程中非常重要的一部分,然而近年来其评选过程越来越成为争论的焦点。 在本文中,我们提出了一种作者协助机制来促进最佳论文奖项的选择。我们的方法采用了等单调机制(Isotonic Mechanism),以引导作者对其提交的作品进行排名评估,随后利用这些排名调整原始的评审分数,以便更好地估计作品的真实质量。我们展示了当作者的效用函数是调整后评分的凸加性函数时,他们有动力真实地报告信息,并使用ICLR(2019年至2023年)和NeurIPS(2021年至2023年)期间公开可用的评审数据验证了对于最佳论文奖而言这一凸性的假设。尤为重要的是,在作者只有一个配额的情况下——即只能提名一篇论文时,我们证明即使效用函数只是非减加性函数,也能够保持真实报告行为。这一发现大大放宽了先前工作中所需的假设条件。 为了实际应用,我们将我们的机制扩展到处理常见的重叠作者身份的情况。最后,模拟结果表明,我们的机制显著提高了用于奖励的论文的质量。
https://arxiv.org/abs/2601.15249
AI coding agents are now submitting pull requests (PRs) to software projects, acting not just as assistants but as autonomous contributors. As these agentic contributions are rapidly increasing across real repositories, little is known about how they behave in practice and why many of them fail to be merged. In this paper, we conduct a large-scale study of 33k agent-authored PRs made by five coding agents across GitHub. (RQ1) We first quantitatively characterize merged and not-merged PRs along four broad dimensions: 1) merge outcomes across task types, 2) code changes, 3) CI build results, and 4) review dynamics. We observe that tasks related to documentation, CI, and build update achieve the highest merge success, whereas performance and bug-fix tasks perform the worst. Not-merged PRs tend to involve larger code changes, touch more files, and often do not pass the project's CI/CD pipeline validation. (RQ2) To further investigate why some agentic PRs are not merged, we qualitatively analyze 600 PRs to derive a hierarchical taxonomy of rejection patterns. This analysis complements the quantitative findings in RQ1 by uncovering rejection reasons not captured by quantitative metrics, including lack of meaningful reviewer engagement, duplicate PRs, unwanted feature implementations, and agent misalignment. Together, our findings highlight key socio-technical and human-AI collaboration factors that are critical to improving the success of future agentic workflows.
AI 编码代理现在正在向软件项目提交拉取请求(PR),它们不仅充当助手的角色,还作为自主贡献者发挥作用。随着这些代理性贡献在实际仓库中迅速增加,人们对它们的实际行为和许多未能被合并的原因知之甚少。本文进行了一项大规模研究,涵盖了 GitHub 上五个编码代理所创建的 33,000 多个代理提交的 PR。 (RQ1)我们首先从四个广泛的维度对已合并和未合并的 PR 进行了定量描述:1)任务类型跨领域的合并结果;2)代码变更;3)CI 构建结果;4)审查动态。我们观察到,与文档、CI 和构建更新相关的任务实现了最高的合并成功率,而性能和修复漏洞的任务表现最差。未被合并的 PR 往往涉及更大的代码更改,影响更多文件,并且经常未能通过项目的 CI/CD 流水线验证。 (RQ2)为了进一步探讨为什么一些代理性 PR 无法被合并,我们对600个 PR 进行了定性的分析,以衍生出一个拒绝模式的层次分类法。此分析补充了 RQ1 中量化发现中的未捕获原因,包括缺乏有意义的审查者互动、重复提交的问题、不受欢迎的功能实现以及代理与项目目标的偏差。 总的来说,我们的研究结果强调了改进未来自主工作流成功的关键社会技术及人机协作因素。
https://arxiv.org/abs/2601.15195
Peer review is at the heart of modern science. As submission numbers rise and research communities grow, the decline in review quality is a popular narrative and a common concern. Yet, is it true? Review quality is difficult to measure, and the ongoing evolution of reviewing practices makes it hard to compare reviews across venues and time. To address this, we introduce a new framework for evidence-based comparative study of review quality and apply it to major AI and machine learning conferences: ICLR, NeurIPS and *ACL. We document the diversity of review formats and introduce a new approach to review standardization. We propose a multi-dimensional schema for quantifying review quality as utility to editors and authors, coupled with both LLM-based and lightweight measurements. We study the relationships between measurements of review quality, and its evolution over time. Contradicting the popular narrative, our cross-temporal analysis reveals no consistent decline in median review quality across venues and years. We propose alternative explanations, and outline recommendations to facilitate future empirical studies of review quality.
同行评审是现代科学的核心。随着提交的数量增加和研究社区的扩大,关于评审质量下降的说法变得流行且广泛担忧。然而,这种说法是否属实呢?评审质量难以衡量,并且审稿实践的持续演变使得跨平台和时间点比较评审变得困难。为了解决这一问题,我们引入了一个新的基于证据的研究评审质量的框架,并将其应用于主要的人工智能与机器学习会议:ICLR、NeurIPS 和 *ACL。我们记录了不同形式的评审多样性,并提出了一种新的评审标准化方法。我们还提出了一个多维度模式来量化评审质量作为对编辑和作者的效用,结合了LLM(大型语言模型)和其他轻量级测量方式。我们研究了评审质量度量之间的关系及其随时间的变化趋势。与流行说法相反,我们的跨时间段分析发现,各会议在不同年份的中位数评审质量没有持续下降的趋势。我们提出了替代解释,并概述了未来关于评审质量实证研究的建议。
https://arxiv.org/abs/2601.15172
Multimodal large language models have demonstrated comparable performance to that of radiology trainees on multiple-choice board-style exams. However, to develop clinically useful multimodal LLM tools, high-quality benchmarks curated by domain experts are essential. To curate released and holdout datasets of 100 chest radiographic studies each and propose an artificial intelligence (AI)-assisted expert labeling procedure to allow radiologists to label studies more efficiently. A total of 13,735 deidentified chest radiographs and their corresponding reports from the MIDRC were used. GPT-4o extracted abnormal findings from the reports, which were then mapped to 12 benchmark labels with a locally hosted LLM (Phi-4-Reasoning). From these studies, 1,000 were sampled on the basis of the AI-suggested benchmark labels for expert review; the sampling algorithm ensured that the selected studies were clinically relevant and captured a range of difficulty levels. Seventeen chest radiologists participated, and they marked "Agree all", "Agree mostly" or "Disagree" to indicate their assessment of the correctness of the LLM suggested labels. Each chest radiograph was evaluated by three experts. Of these, at least two radiologists selected "Agree All" for 381 radiographs. From this set, 200 were selected, prioritizing those with less common or multiple finding labels, and divided into 100 released radiographs and 100 reserved as the holdout dataset. The holdout dataset is used exclusively by RSNA to independently evaluate different models. A benchmark of 200 chest radiographic studies with 12 benchmark labels was created and made publicly available this https URL, with each chest radiograph verified by three radiologists. In addition, an AI-assisted labeling procedure was developed to help radiologists label at scale, minimize unnecessary omissions, and support a semicollaborative environment.
多模态大型语言模型在多项选择式的放射学执照考试中表现出与放射科实习生相当的性能。然而,为了开发临床实用的多模态LLM工具,由领域专家策划的高质量基准测试是必不可少的。为此,收集了100份胸部X光研究的数据集作为已发布和保留数据集,并提出了一种AI辅助专家标记程序以帮助放射科医生更高效地进行标注。 共使用了来自MIDRC的13,735张去标识化的胸部X光片及其相应的报告。GPT-4o从报告中提取异常发现,然后通过本地托管的语言模型(Phi-4-Reasoning)映射到12个基准标签上。基于AI建议的基准标签随机选取了1000例研究供专家评审;抽样算法确保所选案例具有临床相关性并涵盖了各种难度水平。 有十七名胸部放射科医生参与,他们对LLM提出的标记正确性进行评估,并选择“完全同意”、“大部分同意”或“不同意”。每张胸部X光片由三名专家独立评价。其中,至少两名放射科医生选择了“完全同意”的图像共有381例。从这些案例中优先选择罕见或多发现标签的200例,分成100份作为已发布数据集和100份保留为备用数据集。 备用数据集仅供RSNA独家使用以独立评估不同的模型。创建了一个包含200个胸部X光研究及其对应的12个基准标签的数据集,并通过此链接(原语句中提及的“this https URL”应理解为此处未提供的具体链接)公开发布,每张胸部X光片都经过三位放射科医生的验证。 此外,还开发了一种AI辅助标记程序以帮助放射科医生大规模地进行标注、减少不必要的遗漏,并支持半协作工作环境。
https://arxiv.org/abs/2601.15129
As Micro-CT technology continues to refine its characterization of material microstructures, industrial CT ultra-precision inspection is generating increasingly large datasets, necessitating solutions to the trade-off between accuracy and efficiency in the 3D characterization of defects during ultra-precise detection. This article provides a unique perspective on recent advances in accurate and efficient 3D visualization using Micro-CT, tracing its evolution from medical imaging to industrial non-destructive testing (NDT). Among the numerous CT reconstruction and volume rendering methods, this article selectively reviews and analyzes approaches that balance accuracy and efficiency, offering a comprehensive analysis to help researchers quickly grasp highly efficient and accurate 3D reconstruction methods for microscopic features. By comparing the principles of computed tomography with advancements in microstructural technology, this article examines the evolution of CT reconstruction algorithms from analytical methods to deep learning techniques, as well as improvements in volume rendering algorithms, acceleration, and data reduction. Additionally, it explores advanced lighting models for high-accuracy, photorealistic, and efficient volume rendering. Furthermore, this article envisions potential directions in CT reconstruction and volume rendering. It aims to guide future research in quickly selecting efficient and precise methods and developing new ideas and approaches for real-time online monitoring of internal material defects through virtual-physical interaction, for applying digital twin model to structural health monitoring (SHM).
随着微CT技术不断完善材料微观结构的表征,工业CT超精密检测生成的数据集越来越大,这要求在3D缺陷表征的精度和效率之间找到一个平衡点。本文提供了一个独特的视角,探讨了使用Micro-CT进行准确且高效三维可视化的近期进展,并追踪其从医学成像到工业无损检测(NDT)的应用历程。在这众多的CT重建与体积渲染方法中,文章精选并分析了一些能够兼顾精度和效率的方法,为研究者提供了一种全面而快速掌握高效的3D重构方法以表征微观特征的方式。 本文通过比较计算断层成像的基本原理和微结构技术的进步,探讨了从解析法到深度学习算法的CT重建算法的发展历程,并且还审视了体积渲染算法改进、加速以及数据缩减方面的进展。此外,它还探索了用于高精度、逼真高效体积渲染的先进照明模型。 最后,本文展望了CT重建和体积渲染未来可能的发展方向,旨在指导未来的科研工作迅速选择有效而准确的方法,并为通过虚拟与物理交互实现材料内部缺陷实时在线监测的新理念及方法开发提供启示,以便将数字孪生模型应用于结构健康监测(SHM)。
https://arxiv.org/abs/2601.15098
AI has revolutionised decision-making across various fields. Yet human judgement remains paramount for high-stakes decision-making. This has fueled explorations of collaborative decision-making between humans and AI systems, aiming to leverage the strengths of both. To explore this dynamic, researchers conduct empirical studies, investigating how humans use AI assistance for decision-making and how this collaboration impacts results. A critical aspect of conducting these studies is the role of participants, often recruited through crowdsourcing platforms. The validity of these studies hinges on the behaviours of the participants, hence effective incentives that can potentially affect these behaviours are a key part of designing and executing these studies. In this work, we aim to address the critical role of incentive design for conducting empirical human-AI decision-making studies, focusing on understanding, designing, and documenting incentive schemes. Through a thematic review of existing research, we explored the current practices, challenges, and opportunities associated with incentive design for human-AI decision-making empirical studies. We identified recurring patterns, or themes, such as what comprises the components of an incentive scheme, how incentive schemes are manipulated by researchers, and the impact they can have on research outcomes. Leveraging the acquired understanding, we curated a set of guidelines to aid researchers in designing effective incentive schemes for their studies, called the Incentive-Tuning Framework, outlining how researchers can undertake, reflect on, and document the incentive design process. By advocating for a standardised yet flexible approach to incentive design and contributing valuable insights along with practical tools, we hope to pave the way for more reliable and generalizable knowledge in the field of human-AI decision-making.
人工智能已经彻底改变了各个领域的决策过程。然而,对于高风险的决策而言,人类判断仍然是至关重要的。这激发了人们探索人与AI系统之间协作决策的可能性,旨在利用双方的优势。为了探讨这种动态关系,研究人员进行了实证研究,调查人类如何使用AI辅助进行决策以及这种合作对结果的影响。这些研究的关键在于参与者的行为,而参与者往往通过众包平台招募。这些研究的有效性在很大程度上取决于参与者的动机行为,因此设计能够影响这些行为的激励措施是开展和执行此类研究的重要组成部分。 这项工作旨在解决进行实证人机决策研究中激励机制设计这一关键问题,重点关注理解、设计和记录激励方案。通过现有研究的主题回顾,我们探讨了与人类-人工智能决策实证研究相关的当前做法、挑战和机遇。我们识别出了反复出现的模式或主题,如构成激励计划的要素、研究人员如何操纵激励计划以及这些计划可能对研究成果产生的影响。 借助获得的理解,我们制定了一套指导原则来帮助研究人员设计有效的激励方案以应用于他们的研究中,并称之为“激励调谐框架”。该框架概述了研究人员应如何进行、反思并记录激励设计方案。通过倡导一种标准化但灵活的激励机制设计方法,并贡献有价值的见解和实用工具,我们希望为人类-人工智能决策领域的可靠性和通用性知识奠定基础。
https://arxiv.org/abs/2601.15064
Qualitative research often contains personal, contextual, and organizational details that pose privacy risks if not handled appropriately. Manual anonymization is time-consuming, inconsistent, and frequently omits critical identifiers. Existing automated tools tend to rely on pattern matching or fixed rules, which fail to capture context and may alter the meaning of the data. This study uses local LLMs to build a reliable, repeatable, and context-aware anonymization process for detecting and anonymizing sensitive data in qualitative transcripts. We introduce a Structured Framework for Adaptive Anonymizer (SFAA) that includes three steps: detection, classification, and adaptive anonymization. The SFAA incorporates four anonymization strategies: rule-based substitution, context-aware rewriting, generalization, and suppression. These strategies are applied based on the identifier type and the risk level. The identifiers handled by the SFAA are guided by major international privacy and research ethics standards, including the GDPR, HIPAA, and OECD guidelines. This study followed a dual-method evaluation that combined manual and LLM-assisted processing. Two case studies were used to support the evaluation. The first includes 82 face-to-face interviews on gamification in organizations. The second involves 93 machine-led interviews using an AI-powered interviewer to test LLM awareness and workplace privacy. Two local models, LLaMA and Phi were used to evaluate the performance of the proposed framework. The results indicate that the LLMs found more sensitive data than a human reviewer. Phi outperformed LLaMA in finding sensitive data, but made slightly more errors. Phi was able to find over 91% of the sensitive data and 94.8% kept the same sentiment as the original text, which means it was very accurate, hence, it does not affect the analysis of the qualitative data.
定性研究通常包含个人、情境和组织细节,如果不适当处理,则会带来隐私风险。手动匿名化耗时且不一致,并常常遗漏关键标识符。现有的自动化工具往往依赖于模式匹配或固定规则,这些方法无法捕捉上下文信息,可能会改变数据的意义。本研究利用本地大型语言模型(LLM)构建了一个可靠、可重复且理解上下文的匿名化过程,用于检测和匿名处理定性转录中的敏感数据。我们引入了适应性匿名器结构框架(SFAA),该框架包含三个步骤:检测、分类和自适应匿名化。SFAA结合了四种匿名策略:基于规则替换、上下文感知重写、概括和抑制,这些策略根据标识符类型和风险水平应用。SFAA处理的标识符由主要国际隐私和研究伦理标准指导,包括GDPR(通用数据保护条例)、HIPAA(健康保险流通与责任法案)和OECD(经济合作与发展组织)指南。 本研究采用了结合手动和LLM辅助处理的双重方法评估方式,并使用两个案例研究来支持评估。第一个案例包括82次面对面访谈,涉及组织中的游戏化;第二个案例则有93次由机器引导的访谈,利用人工智能面试官测试LLM对工作场所隐私的认知情况。 为了评估所提出框架的效果,我们采用了两种本地模型:LLaMA和Phi进行实验。结果表明,这些大型语言模型发现的敏感数据比人工审查员多。在寻找敏感数据方面,Phi优于LLaMA,但错误稍多一些。Phi能够找到超过91%的敏感数据,并且有94.8%的数据保持了与原文相同的情感基调,这意味着其准确性非常高,因此不会影响定性数据分析的结果。
https://arxiv.org/abs/2601.14683
Object detection in video and image surveillance is a well-established yet rapidly evolving task, strongly influenced by recent deep learning advancements. This review summarises modern techniques by examining architectural innovations, generative model integration, and the use of temporal information to enhance robustness and accuracy. Unlike earlier surveys, it classifies methods based on core architectures, data processing strategies, and surveillance specific challenges such as dynamic environments, occlusions, lighting variations, and real-time requirements. The primary goal is to evaluate the current effectiveness of semantic object detection, while secondary aims include analysing deep learning models and their practical applications. The review covers CNN-based detectors, GAN-assisted approaches, and temporal fusion methods, highlighting how generative models support tasks such as reconstructing missing frames, reducing occlusions, and normalising illumination. It also outlines preprocessing pipelines, feature extraction progress, benchmarking datasets, and comparative evaluations. Finally, emerging trends in low-latency, efficient, and spatiotemporal learning approaches are identified for future research.
视频和图像监控中的目标检测是一项成熟但迅速发展的任务,受到了近期深度学习进展的强烈影响。这篇综述通过考察架构创新、生成模型集成以及利用时间信息来增强鲁棒性和准确性的方法,总结了现代技术。与早期调查不同的是,它根据核心架构、数据处理策略和特定于监控的挑战(如动态环境、遮挡、光照变化和实时需求)对方法进行分类。主要目标是评估语义目标检测当前的有效性,次要目标包括分析深度学习模型及其实际应用。综述涵盖了基于CNN的目标检测器、GAN辅助的方法以及时间融合方法,并强调生成模型如何支持重建缺失帧、减少遮挡及正常化照明等任务。此外,它还概述了预处理流水线、特征提取进展、基准数据集和比较评估。最后,指出了低延迟、高效且时空学习方法的新兴趋势,为未来的研究提供了方向。
https://arxiv.org/abs/2601.14677
LLM-based Multi-Agent (LLM-MA) systems are increasingly applied to automate complex software engineering tasks such as requirements engineering, code generation, and testing. However, their operational efficiency and resource consumption remain poorly understood, hindering practical adoption due to unpredictable costs and environmental impact. To address this, we conduct an analysis of token consumption patterns in an LLM-MA system within the Software Development Life Cycle (SDLC), aiming to understand where tokens are consumed across distinct software engineering activities. We analyze execution traces from 30 software development tasks performed by the ChatDev framework using a GPT-5 reasoning model, mapping its internal phases to distinct development stages (Design, Coding, Code Completion, Code Review, Testing, and Documentation) to create a standardized evaluation framework. We then quantify and compare token distribution (input, output, reasoning) across these stages. Our preliminary findings show that the iterative Code Review stage accounts for the majority of token consumption for an average of 59.4% of tokens. Furthermore, we observe that input tokens consistently constitute the largest share of consumption for an average of 53.9%, providing empirical evidence for potentially significant inefficiencies in agentic collaboration. Our results suggest that the primary cost of agentic software engineering lies not in initial code generation but in automated refinement and verification. Our novel methodology can help practitioners predict expenses and optimize workflows, and it directs future research toward developing more token-efficient agent collaboration protocols.
基于大型语言模型的多代理系统(LLM-MA)在自动化复杂软件工程任务如需求工程、代码生成和测试方面得到了越来越广泛的应用。然而,这些系统的运行效率和资源消耗仍不为人们所充分了解,导致由于不可预测的成本和环境影响而难以实际应用。为了应对这一挑战,我们对LLM-MA系统在整个软件开发生命周期(SDLC)中的令牌消费模式进行了分析,旨在理解在不同的软件工程活动中令牌被如何使用。 我们的研究基于ChatDev框架执行的30个软件开发任务的数据,并利用GPT-5推理模型进行。我们将该系统的内部阶段映射到具体的开发阶段(设计、编码、代码完成、代码审查、测试和文档编写),以创建一个标准化评估框架。接着,我们量化并比较了这些阶段中的令牌分布情况(输入、输出、推理)。初步研究结果表明,在平均的59.4%的情况下,迭代的代码审查阶段消耗了最多的令牌。 此外,我们发现输入令牌在所有情况下占最大的比例,平均为53.9%,这提供了潜在重大效率低下现象的实际证据。我们的研究成果表明,代理软件工程的主要成本不在于初始代码生成,而在于自动化改进和验证过程中。 我们提出的新方法可以帮助实践者预测费用并优化工作流程,并将未来的研发方向引导至开发更高效的多代理协作协议上。
https://arxiv.org/abs/2601.14470
Open science initiatives have strengthened scientific integrity and accelerated research progress across many fields, but the state of their practice within transportation research remains under-investigated. Key features of open science, defined here as data and code availability, are difficult to extract due to the inherent complexity of the field. Previous work has either been limited to small-scale studies due to the labor-intensive nature of manual analysis or has relied on large-scale bibliometric approaches that sacrifice contextual richness. This paper introduces an automatic and scalable feature-extraction pipeline to measure data and code availability in transportation research. We employ Large Language Models (LLMs) for this task and validate their performance against a manually curated dataset and through an inter-rater agreement analysis. We applied this pipeline to examine 10,724 research articles published in the Transportation Research Part series of journals between 2019 and 2024. Our analysis found that only 5% of quantitative papers shared a code repository, 4% of quantitative papers shared a data repository, and about 3% of papers shared both, with trends differing across journals, topics, and geographic regions. We found no significant difference in citation counts or review duration between papers that provided data and code and those that did not, suggesting a misalignment between open science efforts and traditional academic metrics. Consequently, encouraging these practices will likely require structural interventions from journals and funding agencies to supplement the lack of direct author incentives. The pipeline developed in this study can be readily scaled to other journals, representing a critical step toward the automated measurement and monitoring of open science practices in transportation research.
开放科学倡议已在多个领域加强了科学研究的诚信度并加速了研究进展,但在交通研究领域的实践状况却尚未得到充分调查。由于交通研究固有的复杂性,要提取其关键特征(这里定义为数据和代码的可用性)颇具挑战。以往的研究要么局限于小规模的手动分析研究,要么依赖于大型文献计量方法,从而牺牲了上下文的丰富度。本文介绍了一种自动且可扩展的功能提取流水线,用于测量交通研究中的数据和代码可用性。我们利用大规模语言模型(LLMs)来完成这项任务,并通过手动策划的数据集以及评阅人之间的一致性分析对其性能进行了验证。我们将该流程应用于2019年至2024年间在《运输研究》期刊系列中发表的10,724篇研究论文,其中定量论文只有5%分享了代码库,4%共享了数据存储库,大约3%同时共享两者。这些趋势因期刊、主题和地理区域而异。我们发现提供数据和代码的论文与那些未提供的论文在引用次数或审稿时间上没有显著差异,这表明开放科学努力与传统学术指标之间存在脱节。因此,鼓励此类实践可能需要期刊和资助机构采取结构性干预措施来弥补作者激励不足的问题。本研究开发出的流水线可轻松扩展至其他期刊,是自动测量和监控交通研究领域中的开放科学实践的重要一步。
https://arxiv.org/abs/2601.14429
Recent years have witnessed increasing interest in extending large language models into agentic systems. While the effectiveness of agents has continued to improve, efficiency, which is crucial for real-world deployment, has often been overlooked. This paper therefore investigates efficiency from three core components of agents: memory, tool learning, and planning, considering costs such as latency, tokens, steps, etc. Aimed at conducting comprehensive research addressing the efficiency of the agentic system itself, we review a broad range of recent approaches that differ in implementation yet frequently converge on shared high-level principles including but not limited to bounding context via compression and management, designing reinforcement learning rewards to minimize tool invocation, and employing controlled search mechanisms to enhance efficiency, which we discuss in detail. Accordingly, we characterize efficiency in two complementary ways: comparing effectiveness under a fixed cost budget, and comparing cost at a comparable level of effectiveness. This trade-off can also be viewed through the Pareto frontier between effectiveness and cost. From this perspective, we also examine efficiency oriented benchmarks by summarizing evaluation protocols for these components and consolidating commonly reported efficiency metrics from both benchmark and methodological studies. Moreover, we discuss the key challenges and future directions, with the goal of providing promising insights.
近年来,人们对将大型语言模型扩展为代理系统表现出日益浓厚的兴趣。尽管代理的有效性一直在提高,但对效率的重视往往被忽视,而效率对于实际部署至关重要。因此,本文从三个核心组件——记忆、工具学习和规划的角度探讨了效率问题,并考虑了诸如延迟、令牌、步骤等成本因素。 为了进行全面研究以解决代理系统本身的效率问题,我们回顾了一系列近期的方法,这些方法虽然在实现上有所不同,但往往共同遵循着一些高级原则,包括但不限于通过压缩和管理来限制上下文范围,在强化学习奖励设计中最小化工具调用频率以及使用受控搜索机制提高效率。我们在文中详细讨论了这些问题。 相应地,我们将效率定义为两个互补的方式:在固定成本预算下比较效果,并在同一效果水平上比较成本。这种权衡也可以通过有效性与成本之间的帕累托前沿视角来看待。从这一角度来看,我们还研究了以这些组件的评估协议和汇总基准及方法学研究报告中常用的效率指标为基础的面向效率的基准测试。 此外,讨论关键挑战并提出未来方向的目标是为该领域提供有希望的新见解。
https://arxiv.org/abs/2601.14192
Writing effective rebuttals is a high-stakes task that demands more than linguistic fluency, as it requires precise alignment between reviewer intent and manuscript details. Current solutions typically treat this as a direct-to-text generation problem, suffering from hallucination, overlooked critiques, and a lack of verifiable grounding. To address these limitations, we introduce $\textbf{RebuttalAgent}$, the first multi-agents framework that reframes rebuttal generation as an evidence-centric planning task. Our system decomposes complex feedback into atomic concerns and dynamically constructs hybrid contexts by synthesizing compressed summaries with high-fidelity text while integrating an autonomous and on-demand external search module to resolve concerns requiring outside literature. By generating an inspectable response plan before drafting, $\textbf{RebuttalAgent}$ ensures that every argument is explicitly anchored in internal or external evidence. We validate our approach on the proposed $\textbf{RebuttalBench}$ and demonstrate that our pipeline outperforms strong baselines in coverage, faithfulness, and strategic coherence, offering a transparent and controllable assistant for the peer review process. Code will be released.
撰写有效的反驳是一项高风险任务,不仅要求语言流畅性,还需要精确地将审稿人的意图与论文细节对齐。目前的解决方案通常将其视为直接生成文本的问题,因此容易出现虚构、遗漏批评和缺乏可验证依据等问题。为了解决这些问题,我们引入了**RebuttalAgent**,这是首个以多代理框架为基础的系统,它重新定义反驳生成为一个以证据为中心的规划任务。我们的系统将复杂的反馈分解成原子级的关注点,并通过整合压缩摘要与高保真文本来动态构建混合上下文,同时集成一个自主且按需调用的外部搜索模块,用于解决需要引用外界文献的问题。在起草之前生成可检查的回答计划,**RebuttalAgent** 确保每个论点都明确地基于内部或外部证据。 我们在提出的**RebuttalBench**上验证了我们的方法,并证明了我们这一流程在线条覆盖面、忠实度和策略一致性方面超过了强有力的基线模型,为同行评审过程提供了一个透明且可控制的辅助工具。代码将公开发布。
https://arxiv.org/abs/2601.14171
Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the opaque decision-making of Large Language Models (LLMs). However, existing reviews primarily treat MI as an observational science, summarizing analytical insights while lacking a systematic framework for actionable intervention. To bridge this gap, we present a practical survey structured around the pipeline: "Locate, Steer, and Improve." We formally categorize Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish a rigorous intervention protocol. Furthermore, we demonstrate how this framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as an actionable methodology for model optimization. The curated paper list of this work is available at this https URL.
机制解释性(MI)已成为揭开大型语言模型(LLMs)不透明决策过程的重要方法。然而,现有的评论主要将MI视为一种观察科学,虽然总结了分析见解但缺乏一个可操作干预的系统框架。为了填补这一空白,我们提出了一项基于“定位、引导和改进”管道的实用调查。我们将本地化(诊断)和指导(干预)的方法正式分类为特定的解释性对象,以建立严格的干预协议。此外,我们展示了该框架如何能够实现在对齐、能力和效率方面的实际改进,有效地将MI作为模型优化的操作方法。这项工作的精选论文列表可在[此处](https://example.com/paperlist)获得。 (请注意,“此处”中的链接需要替换为真实的URL地址)
https://arxiv.org/abs/2601.14004
Recent advances in singing voice synthesis (SVS) have attracted substantial attention from both academia and industry. With the advent of large language models and novel generative paradigms, producing controllable, high-fidelity singing voices has become an attainable goal. Yet the field still lacks a comprehensive survey that systematically analyzes deep-learning-based singing voice synthesis systems and their enabling technologies. To address the aforementioned issue, this survey first categorizes existing systems by task type and then organizes current architectures into two major paradigms: cascaded and end-to-end approaches. Moreover, we provide an in-depth analysis of core technologies, covering singing modeling and control techniques. Finally, we review relevant datasets, annotation tools, and evaluation benchmarks that support training and assessment. In appendix, we introduce training strategies and further discussion of SVS. This survey provides an up-to-date review of the literature on SVS models, which would be a useful reference for both researchers and engineers. Related materials are available at this https URL.
近期在歌声合成(SVS)领域的进展吸引了学术界和工业界的广泛关注。随着大型语言模型和新型生成范式的出现,生产可控的、高保真的歌声已经成为一个可实现的目标。然而,该领域仍然缺乏对基于深度学习的歌声合成系统及其关键技术进行全面分析的综述文献。为了应对上述问题,本综述首先根据任务类型对现有系统进行分类,并将当前架构组织为两大范式:级联和端到端方法。此外,我们还深入分析了核心技术,包括歌唱建模和控制技术。最后,我们回顾了支持训练和评估的相关数据集、标注工具及评价基准。在附录中,我们介绍了培训策略以及关于SVS的进一步讨论。本综述提供了对歌声合成模型文献的最新回顾,将为研究人员和工程师提供有用的参考资源。相关材料可在[此处](https://example.com)访问。
https://arxiv.org/abs/2601.13910