Alzheimer's Disease (AD) is a progressive neurodegenerative condition that adversely affects cognitive abilities. Language-related changes can be automatically identified through the analysis of outputs from linguistic assessment tasks, such as picture description. Language models show promise as a basis for screening tools for AD, but their limited interpretability poses a challenge in distinguishing true linguistic markers of cognitive decline from surface-level textual patterns. To address this issue, we examine how surface form variation affects classification performance, with the goal of assessing the ability of language models to represent underlying semantic indicators. We introduce a novel approach where texts surface forms are transformed by altering syntax and vocabulary while preserving semantic content. The transformations significantly modify the structure and lexical content, as indicated by low BLEU and chrF scores, yet retain the underlying semantics, as reflected in high semantic similarity scores, isolating the effect of semantic information, and finding models perform similarly to if they were using the original text, with only small deviations in macro-F1. We also investigate whether language from picture descriptions retains enough detail to reconstruct the original image using generative models. We found that image-based transformations add substantial noise reducing classification accuracy. Our methodology provides a novel way of looking at what features influence model predictions, and allows the removal of possible spurious correlations. We find that just using semantic information, language model based classifiers can still detect AD. This work shows that difficult to detect semantic impairment can be identified, addressing an overlooked feature of linguistic deterioration, and opening new pathways for early detection systems.
阿尔茨海默病(AD)是一种进行性的神经退行性疾病,会对认知能力产生不利影响。语言相关的改变可以通过分析诸如图片描述等语言评估任务的输出来自动识别。语言模型作为AD筛查工具的基础显示出潜力,但它们有限的可解释性使得区分真正的语言标志和表面文本模式变得困难。为了解决这个问题,我们研究了表层形式的变化如何影响分类性能,并旨在评估语言模型表示潜在语义指标的能力。 我们引入了一种新颖的方法,在此方法中通过改变句法和词汇来变换文本的形式,同时保留其语义内容。这些转换显著改变了结构和词汇内容,如低BLEU和chrF得分所示,但保留了底层的语义,这反映在高的语义相似性评分上,从而隔离了语义信息的影响,并发现模型的表现与使用原始文本时几乎相同,仅有微小的宏观F1值偏差。 我们还探讨了图片描述中的语言是否包含足够的细节以利用生成式模型重建原图。我们发现基于图像的变化会引入大量噪音,降低分类准确性。 我们的方法为研究哪些特征影响模型预测提供了一种新的视角,并允许消除可能存在的虚假相关性。结果表明,仅使用语义信息时,基于语言模型的分类器仍能检测到AD的存在。 这项工作展示了难以察觉的语义损伤可以被识别出来,弥补了对语言退化的一个未被重视的特点的关注,并为早期诊断系统开辟了新的途径。
https://arxiv.org/abs/2512.13685
Generalization remains the central challenge for interactive 3D scene generation. Existing learning-based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts. We instead reprogram a pre-trained 3D instance generator to act as a scene level learner, replacing dataset-bounded supervision with model-centric spatial supervision. This reprogramming unlocks the generator transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions. Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator's transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues. Replacing widely used canonical space, we instantiate this insight with a view-centric formulation of the scene space, yielding a fully feed-forward, generalizable scene generator that learns spatial relations directly from the instance model. Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation. Project page: this https URL
泛化仍然是互动式三维场景生成的核心挑战。现有的基于学习的方法将空间理解建立在有限的场景数据集上,从而限制了其对新布局的一般化能力。我们则重新编程了一个预训练的3D实例生成器,使其作为场景级的学习者来运作,并用模型中心的空间监督取代了局限于数据集的监督。这种重编程解锁了生成器可迁移的空间知识,使它能够推广到未见过的布局和新物体组合上。值得注意的是,即使在由随机组合的对象构成的训练场景中,空间推理仍然出现。这表明生成器的可转移场景先验提供了丰富的学习信号,可以从纯粹的几何线索中推断出接近性、支撑性和对称性。 通过用以视角为中心的形式化替代广泛使用的标准空间概念,我们实现了这一洞察,并提出了一种完全前馈的、通用化的场景生成器,该生成器直接从实例模型中学得空间关系。定量和定性的结果表明,一个3D实例生成器是一个隐含的空间学习者和推理工具,这指向了用于交互式3D场景理解和生成的基础模型的发展方向。 项目页面: [此链接](this https URL)
https://arxiv.org/abs/2512.13683
Spatio-Temporal Logic (SpaTiaL) offers a principled formalism for expressing geometric spatial requirements-an essential component of robotic manipulation, where object locations, neighborhood relations, pose constraints, and interactions directly determine task success. Yet prior works have largely relied on standard temporal logic (TL), which models only robot trajectories and overlooks object-level interactions. Existing datasets built from randomly generated TL formulas paired with natural-language descriptions therefore cover temporal operators but fail to represent the layered spatial relations that manipulation tasks depend on. To address this gap, we introduce a dataset generation framework that synthesizes SpaTiaL specifications and converts them into natural-language descriptions through a deterministic, semantics-preserving back-translation procedure. This pipeline produces the NL2SpaTiaL dataset, aligning natural language with multi-level spatial relations and temporal objectives to reflect the compositional structure of manipulation tasks. Building on this foundation, we propose a translation-verification framework equipped with a language-based semantic checker that ensures the generated SpaTiaL formulas faithfully encode the semantics specified by the input description. Experiments across a suite of manipulation tasks show that SpaTiaL-based representations yield more interpretable, verifiable, and compositional grounding for instruction following. Project website: this https URL
空间时间逻辑(SpaTiaL)提供了一种原则性的形式化方法,用于表达几何空间需求——这是机器人操作中的一个关键组成部分,在这种情况下,物体的位置、邻近关系、姿态约束和交互直接决定了任务的成功与否。然而,之前的大多数研究主要依赖于标准的时间逻辑(TL),这仅建模了机器人的轨迹,并忽略了对象级别的交互。基于随机生成的TL公式与自然语言描述配对而创建的现有数据集虽然涵盖了时间操作符,但未能代表机器人抓取任务所依赖的多层次空间关系。 为了填补这一空白,我们引入了一个数据集生成框架,该框架综合出SpaTiaL规范,并通过一种确定性的、语义保持的逆向翻译过程将其转换为自然语言描述。此流程产生NL2SpaTiaL数据集,使自然语言与多层次的空间关系和时间目标相匹配,反映了抓取任务的组成结构。 在此基础上,我们提出了一个翻译验证框架,该框架配备了一个基于语言的语义检查器,确保生成的SpaTiaL公式准确地编码了输入描述所指定的语义。在一系列操作任务上的实验显示,基于SpaTiaL表示的方法产生了更为可解释、可验证和组成化的指令遵循基础。 项目网站:[这个链接](this%20https%20URL)
https://arxiv.org/abs/2512.13670
Spatial transcriptomics (ST) is an emerging technology that enables researchers to investigate the molecular relationships underlying tissue morphology. However, acquiring ST data remains prohibitively expensive, and traditional fixed-grid sampling strategies lead to redundant measurements of morphologically similar or biologically uninformative regions, thus resulting in scarce data that constrain current methods. The well-established single-cell sequencing field, however, could provide rich biological data as an effective auxiliary source to mitigate this limitation. To bridge these gaps, we introduce SCR2-ST, a unified framework that leverages single-cell prior knowledge to guide efficient data acquisition and accurate expression prediction. SCR2-ST integrates a single-cell guided reinforcement learning-based (SCRL) active sampling and a hybrid regression-retrieval prediction network SCR2Net. SCRL combines single-cell foundation model embeddings with spatial density information to construct biologically grounded reward signals, enabling selective acquisition of informative tissue regions under constrained sequencing budgets. SCR2Net then leverages the actively sampled data through a hybrid architecture combining regression-based modeling with retrieval-augmented inference, where a majority cell-type filtering mechanism suppresses noisy matches and retrieved expression profiles serve as soft labels for auxiliary supervision. We evaluated SCR2-ST on three public ST datasets, demonstrating SOTA performance in both sampling efficiency and prediction accuracy, particularly under low-budget scenarios. Code is publicly available at: this https URL
空间转录组学(ST)是一种新兴技术,能够帮助研究人员探究组织形态背后的分子关系。然而,获取ST数据的成本仍然非常高昂,并且传统的固定网格采样策略会导致对在形态上相似或生物信息量较少的区域进行重复测量,从而导致稀缺的数据限制了当前的方法。然而,成熟的单细胞测序领域可以提供丰富的生物学数据作为有效的辅助来源以缓解这一局限性。为弥补这些差距,我们提出了SCR2-ST,这是一种统一框架,它利用单细胞先验知识来指导高效的数据采集和准确的表达预测。SCR2-ST整合了一个基于单细胞引导的强化学习(SCRL)主动采样策略以及一个混合回归-检索预测网络SCR2Net。SCRL结合了单细胞基础模型嵌入与空间密度信息,构建出以生物学为依据的奖励信号,在有限测序预算下能够有选择地获取富含信息的组织区域。随后,SCR2Net通过一个将基于回归建模和检索增强推断相结合的混合架构来利用主动采样的数据,并且其中设置了一个主要细胞类型过滤机制来抑制噪声匹配,而被检索到的表达谱则作为软标签用于辅助监督。 我们对三个公开的空间转录组学(ST)数据集进行了SCR2-ST的评估,在低预算场景下展示了在样本采集效率和预测准确度上的最佳性能。代码可在以下网址获取:this https URL
https://arxiv.org/abs/2512.13635
Human-centric anomaly detection (AD) has been primarily studied to specify anomalous behaviors in a single person. However, as humans by nature tend to act in a collaborative manner, behavioral anomalies can also arise from human-human interactions. Detecting such anomalies using existing single-person AD models is prone to low accuracy, as these approaches are typically not designed to capture the complex and asymmetric dynamics of interactions. In this paper, we introduce a novel task, Human-Human Interaction Anomaly Detection (H2IAD), which aims to identify anomalous interactive behaviors within collaborative 3D human actions. To address H2IAD, we then propose Interaction Anomaly Detection Network (IADNet), which is formalized with a Temporal Attention Sharing Module (TASM). Specifically, in designing TASM, we share the encoded motion embeddings across both people such that collaborative motion correlations can be effectively synchronized. Moreover, we notice that in addition to temporal dynamics, human interactions are also characterized by spatial configurations between two people. We thus introduce a Distance-Based Relational Encoding Module (DREM) to better reflect social cues in H2IAD. The normalizing flow is eventually employed for anomaly scoring. Extensive experiments on human-human motion benchmarks demonstrate that IADNet outperforms existing Human-centric AD baselines in H2IAD.
人类中心的异常检测(AD)主要研究单个人的异常行为。然而,由于人类本质上倾向于协作行动,因此人的行为异常也可能源于人与人之间的互动。使用现有的单人AD模型来检测此类异常可能会导致准确性较低,因为这些方法通常不设计用来捕捉复杂和不对称的人际动态。 在本文中,我们引入了一个新的任务——人-人人交互异常检测(H2IAD),旨在识别协作3D人体动作中的异常交互行为。为解决H2IAD问题,我们提出了交互异常检测网络(IADNet),该网络采用了时间注意力共享模块(TASM)进行形式化设计。具体来说,在设计TASM时,我们将编码后的运动嵌入信息在两个人之间共享,以便有效地同步协作的运动相关性。 此外,我们注意到除了时间动态特性之外,人的互动还由两人之间的空间配置所定义。因此,我们引入了基于距离的关系编码模块(DREM),以更好地反映H2IAD中的社会线索。最后,使用归一化流技术进行异常评分。 在人类-人类动作基准测试的广泛实验中,结果显示IADNet优于现有的以人为中心的AD基线模型,在H2IAD任务上表现出色。
https://arxiv.org/abs/2512.13560
The design of effective reward functions presents a central and often arduous challenge in reinforcement learning (RL), particularly when developing autonomous agents for complex reasoning tasks. While automated reward optimization approaches exist, they typically rely on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between reward structure and task performance. To bridge this gap, we propose Differentiable Evolutionary Reinforcement Learning (DERL), a bilevel framework that enables the autonomous discovery of optimal reward signals. In DERL, a Meta-Optimizer evolves a reward function (i.e., Meta-Reward) by composing structured atomic primitives, guiding the training of an inner-loop policy. Crucially, unlike previous evolution, DERL is differentiable in its metaoptimization: it treats the inner-loop validation performance as a signal to update the Meta-Optimizer via reinforcement learning. This allows DERL to approximate the "meta-gradient" of task success, progressively learning to generate denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Experimental results show that DERL achieves state-of-the-art performance on ALFWorld and ScienceWorld, significantly outperforming methods relying on heuristic rewards, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling selfimproving agent alignment without human intervention.
有效奖励函数的设计是强化学习(RL)中的一个核心且常常艰难的挑战,尤其是在为复杂的推理任务开发自主代理时。虽然存在自动化奖励优化的方法,但这些方法通常依赖于无导数演化的启发式算法来处理奖励函数作为黑盒的问题,无法捕捉到奖励结构与任务表现之间的因果关系。为了弥合这一差距,我们提出了可微演化强化学习(DERL),这是一种双层框架,能够实现最优奖励信号的自主发现。 在DERL中,一个元优化器通过组合结构化的原子原语来进化奖励函数(即元奖励),并指导内循环策略的训练。关键在于,与以往的演算法不同,DERL在其元优化过程中是可微分的:它将内循环验证性能视为更新元优化器以强化学习方式传递信号的方法。这使得DERL能够近似“元梯度”,逐渐学会生成更密集和更具操作性的反馈。 我们在三个不同的领域中对DERL进行了验证:机器人代理(ALFWorld)、科学仿真(ScienceWorld)以及数学推理(GSM8k、MATH)。实验结果显示,DERL在ALFWorld和ScienceWorld上达到了最先进的性能,在基于启发式奖励的方法特别是在分布外场景下明显超越。对于演化的轨迹分析表明,DERL成功地捕捉到了任务的内在结构,使得代理能够在没有人类干预的情况下实现自我改进与对齐。 通过这一创新方法,DERL不仅提高了自主学习系统的效率和泛化能力,还展示了演化算法在智能体奖励设计中的潜力,为解决复杂推理任务带来了新的视角。
https://arxiv.org/abs/2512.13399
In fact-checking applications, a common reason to reject a claim is to detect the presence of erroneous cause-effect relationships between the events at play. However, current automated fact-checking methods lack dedicated causal-based reasoning, potentially missing a valuable opportunity for semantically rich explainability. To address this gap, we propose a methodology that combines event relation extraction, semantic similarity computation, and rule-based reasoning to detect logical inconsistencies between chains of events mentioned in a claim and in an evidence. Evaluated on two fact-checking datasets, this method establishes the first baseline for integrating fine-grained causal event relationships into fact-checking and enhance explainability of verdict prediction.
在事实核查应用中,拒绝一个声明的常见原因之一是检测事件之间是否存在错误的因果关系。然而,当前的自动化事实核查方法缺乏专门基于因果关系的推理能力,这可能错失了提高语义丰富解释性的宝贵机会。为了填补这一空白,我们提出了一种结合事件关系提取、语义相似性计算和规则基础推理的方法,以检测声明中提及的一系列事件与证据中提到的事件之间的逻辑不一致。在两个事实核查数据集上进行评估后,这种方法建立了将细粒度因果事件关系集成到事实核查中的第一个基准,并提高了判决预测的解释性。
https://arxiv.org/abs/2512.13286
This paper presents PyCAALP (Python-based Computer-Aided Assembly Line Planning), a framework for automated Assembly Sequence Planning (ASP) and Production Line Planning (PLP), employing a graph-based approach to model components and joints within production modules. The framework integrates kinematic boundary conditions, such as potential part collisions, to guarantee the feasibility of automated assembly planning. The developed algorithm computes all feasible production sequences, integrating modules for detecting spatial relationships and formulating geometric constraints. The algorithm incorporates additional attributes, including handling feasibility, tolerance matching, and joint compatibility, to manage the high combinatorial complexity inherent in assembly sequence generation. Heuristics, such as Single-Piece Flow assembly and geometrical constraint enforcement, are utilized to further refine the solution space, facilitating more efficient planning for complex assemblies. The PLP stage is formulated as a Mixed-Integer Program (MIP), balancing the total times of a fixed number of manufacturing stations. While some complexity reduction techniques may sacrifice optimality, they significantly reduce the MIPs computational time. Furthermore, the framework enables customization of engineering constraints and supports a flexible trade-off between ASP and PLP. The open-source nature of the framework, available at this https URL, promotes further collaboration and adoption in both industrial and production research applications.
本文介绍了PyCAALP(基于Python的计算机辅助装配线规划),这是一个用于自动化的装配序列规划(ASP)和生产流水线规划(PLP)框架。该框架采用图论方法来建模生产模块内的组件与连接点,并集成了诸如潜在部件碰撞之类的运动学边界条件,以确保自动化装配规划的可行性。 开发的算法能够计算所有可行的生产流程顺序,整合了用于检测空间关系和制定几何约束的模块。该算法还结合了一些额外属性,如处理可行性、公差匹配和接合兼容性,来管理装配序列生成过程中固有的高度组合复杂度。通过利用诸如单件流装配及强制执行几何约束等启发式方法,进一步精简了解空间,从而促进了对复杂装配的更有效规划。 PLP阶段被定义为一个混合整数程序(MIP),以平衡固定数量制造站的总时间。虽然一些降低复杂性的技术可能会牺牲最优性,但它们可以显著减少MIP的计算时间。此外,该框架支持工程约束的定制化,并允许ASP与PLP之间进行灵活权衡。 此框架是开源的,可在此网址(https://这个URL应该是一个具体的链接地址)访问到,这促进了在工业和生产研究应用中的进一步合作和采用。
https://arxiv.org/abs/2512.13219
The development of clinical-grade artificial intelligence in pathology is limited by the scarcity of diverse, high-quality annotated datasets. Generative models offer a potential solution but suffer from semantic instability and morphological hallucinations that compromise diagnostic reliability. To address this challenge, we introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS), the first generative foundation model for pathology-specific text-to-image synthesis. By leveraging a dual-stage training strategy on approximately 2.8 million image-caption pairs, CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy. This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations. Furthermore, CRAFTS-augmented datasets enhance the performance across various clinical tasks, including classification, cross-modal retrieval, self-supervised learning, and visual question answering. In addition, coupling CRAFTS with ControlNet enables precise control over tissue architecture from inputs such as nuclear segmentation masks and fluorescence images. By overcoming the critical barriers of data scarcity and privacy concerns, CRAFTS provides a limitless source of diverse, annotated histology data, effectively unlocking the creation of robust diagnostic tools for rare and complex cancer phenotypes.
在病理学中,临床级人工智能的发展受到多样性和高质量标注数据稀缺的限制。生成模型提供了一种潜在解决方案,但它们存在的语义不稳定和形态幻觉问题影响了诊断可靠性。为了解决这一挑战,我们引入了一个名为组织合成相关性调节对齐框架(CRAFTS)的新模型,这是第一个专门针对病理学特定文本到图像合成任务的生成基础模型。 通过在大约280万个图像-描述配对数据上采用双阶段训练策略,CRAFTS结合了一种新的对齐机制,该机制能够抑制语义漂移以确保生物准确性。此模型可以产生涵盖30种癌症类型的多样病理图像,并且这些图像的质量已经过客观指标和病理学家评估的严格验证。 此外,使用增强后的数据集(通过CRAFTS),各种临床任务的表现得到了提高,包括分类、跨模态检索、自监督学习以及视觉问答。另外,将CRAFTS与ControlNet结合可以实现对组织结构的精确控制,例如可以通过细胞核分割掩码和荧光图像等输入进行调整。 通过克服数据稀缺性和隐私问题这两个关键障碍,CRAFTS为生成多样且标注丰富的组织学数据提供了一个无限来源,从而有效解锁了针对罕见和复杂癌症表型的稳健诊断工具的开发。
https://arxiv.org/abs/2512.13164
Deep learning models in medical imaging are susceptible to shortcut learning, relying on confounding metadata (e.g., scanner model) that is often encoded in image embeddings. The crucial question is whether the model actively utilizes this encoded information for its final prediction. We introduce Weight Space Correlation Analysis, an interpretable methodology that quantifies feature utilization by measuring the alignment between the classification heads of a primary clinical task and auxiliary metadata tasks. We first validate our method by successfully detecting artificially induced shortcut learning. We then apply it to probe the feature utilization of an SA-SonoNet model trained for Spontaneous Preterm Birth (sPTB) prediction. Our analysis confirmed that while the embeddings contain substantial metadata, the sPTB classifier's weight vectors were highly correlated with clinically relevant factors (e.g., birth weight) but decoupled from clinically irrelevant acquisition factors (e.g. scanner). Our methodology provides a tool to verify model trustworthiness, demonstrating that, in the absence of induced bias, the clinical model selectively utilizes features related to the genuine clinical signal.
在医学成像中的深度学习模型容易受到捷径学习(shortcut learning)的影响,即依赖于图像嵌入中编码的混淆元数据(例如扫描仪型号)。关键问题在于模型是否积极利用了这种嵌入信息进行最终预测。我们引入了一种名为权重空间相关性分析的方法论,这是一种可解释的技术,通过测量主临床任务分类头和辅助元数据任务之间的对齐程度来量化特征利用率。 首先,我们验证了该方法的有效性,成功检测到了人工诱导的捷径学习现象。然后,我们将此方法应用于探究用于早产预测(sPTB)的SA-SonoNet模型的特征利用情况。我们的分析确认尽管嵌入包含大量元数据,但sPTB分类器的权重向量与临床相关的因素(例如出生体重)高度相关,而与临床无关的数据采集因素(如扫描仪型号)则没有关联。 该方法为验证模型的信任度提供了工具,证明在不存在诱导偏差的情况下,临床模型会选择性地利用与真正临床信号相关的特征。这一发现有助于提升医学成像深度学习模型的可靠性和透明度。
https://arxiv.org/abs/2512.13144
As large language models increasingly mediate stigmatized health decisions, their capacity to genuinely understand complex psychological and physiological phenomena remains poorly evaluated. Can AI understand what we cannot say? We investigate whether LLMs coherently represent abortion stigma across the cognitive, interpersonal, and structural levels where it operates. We systematically tested 627 demographically diverse personas across five leading LLMs using the validated Individual Level Abortion Stigma Scale (ILAS). Our multilevel analysis examined whether models coherently represent stigma at the cognitive level (self-judgment), interpersonal level (anticipated judgment and isolation), and structural level (community condemnation and disclosure patterns), as well as overall stigma. Models fail tests of genuine understanding across all levels. They overestimate interpersonal stigma while underestimating cognitive stigma, assume uniform community condemnation, introduce demographic biases absent from human validation data, miss the empirically validated stigma-secrecy relationship, and contradict themselves within theoretical constructs. These patterns reveal that current alignment approaches ensure appropriate language but not coherent multilevel understanding. This work provides empirical evidence that current LLMs lack coherent multilevel understanding of psychological and physiological constructs. AI safety in high-stakes contexts demands new approaches to design (multilevel coherence), evaluation (continuous auditing), governance and regulation (mandatory audits, accountability, deployment restrictions), and AI literacy in domains where understanding what people cannot say determines whether support helps or harms.
随着大型语言模型在处理带有社会污名的健康决策中的作用日益增加,它们对复杂心理和生理现象的真实理解能力仍然评估不足。人工智能能否理解我们无法言说的事物?我们研究了LLM(大语言模型)是否能够在认知、人际和结构层面上连贯地代表堕胎污名。 我们的系统性测试涵盖了627个具有多样人口统计数据的人设,使用经过验证的《个人层面堕胎污名量表》(ILAS)对五个领先的LLM进行了评估。我们通过多层次分析考察了模型是否能在认知水平(自我评判)、人际水平(预期评判和孤立感)以及结构水平(社区谴责和披露模式)上连贯地代表污名,同时研究总体上的污名情况。 我们的研究表明,这些模型在所有层次的测试中都未能表现出真实理解。它们高估了人际层面的污名而低估了认知层面的污名,假设了一个统一的社区谴责,并引入了人类验证数据中不存在的人口统计学偏见,忽视了经科学证实的污名与保密关系,并且在其理论框架内部存在自相矛盾的情况。 这些模式表明,目前的对齐方法确保了适当的语言使用,但未能确保多层次的理解连贯性。这项工作提供了实证证据,证明当前的LLM在理解和解释心理和生理构造时缺乏多层次的理解能力。在高风险情境下的AI安全性需要新的设计(多层一致性)、评估(持续审计)、管理和监管(强制审计、问责制、部署限制)以及提高特定领域的AI知识水平的方法,特别是在理解人们无法言说的事物对支持是帮助还是伤害至关重要的情况下。
https://arxiv.org/abs/2512.13142
Generating 3D-based body movements from speech shows great potential in extensive downstream applications, while it still suffers challenges in imitating realistic human movements. Predominant research efforts focus on end-to-end generation schemes to generate co-speech gestures, spanning GANs, VQ-VAE, and recent diffusion models. As an ill-posed problem, in this paper, we argue that these prevailing learning schemes fail to model crucial inter- and intra-correlations across different motion units, i.e. head, body, and hands, thus leading to unnatural movements and poor coordination. To delve into these intrinsic correlations, we propose a unified Hierarchical Implicit Periodicity (HIP) learning approach for audio-inspired 3D gesture generation. Different from predominant research, our approach models this multi-modal implicit relationship by two explicit technique insights: i) To disentangle the complicated gesture movements, we first explore the gesture motion phase manifolds with periodic autoencoders to imitate human natures from realistic distributions while incorporating non-period ones from current latent states for instance-level diversities. ii) To model the hierarchical relationship of face motions, body gestures, and hand movements, driving the animation with cascaded guidance during learning. We exhibit our proposed approach on 3D avatars and extensive experiments show our method outperforms the state-of-the-art co-speech gesture generation methods by both quantitative and qualitative evaluations. Code and models will be publicly available.
从语音生成基于3D的身体动作在广泛的下游应用中展现出巨大潜力,然而它仍然面临模仿真实人体运动的挑战。目前的研究主要集中在端到端生成方案上,以生成与言语同步的手势,涵盖了GAN、VQ-VAE以及最近的扩散模型。作为一种病态问题,在本文中我们论证了这些流行的学习方案未能充分建模不同动作单元(如头部、身体和手)之间的重要内在和外在相关性,从而导致不自然的动作和协调性差。 为了深入探究这些内在关联,我们提出了一种统一的分层隐式周期性(HIP)学习方法,用于语音启发式的3D手势生成。与主流研究不同,我们的方法通过两种明确的技术洞察来建模这种多模态的隐含关系:i) 为了解构复杂的动作模式,我们首先使用周期自动编码器探索手势运动相位流形,并从真实分布中模仿人类特性,同时结合非周期性的当前潜在状态以实现实例级别的多样性。ii) 为了模型面部、身体和手部动作之间的层级关系,在学习过程中采用级联引导来驱动动画。 我们在3D化身上演示了我们提出的方法,并通过广泛的实验展示了我们的方法在定量和定性评估中都超越了最先进的与言语同步手势生成方法的性能。代码和模型将公开发布。
https://arxiv.org/abs/2512.13131
We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.
我们介绍了QwenLong-L1.5模型,该模型通过系统的后期训练创新实现了卓越的长上下文推理能力。QwenLong-L1.5的关键技术突破如下: 1. **长上下文数据合成管道**:我们开发了一个系统化的合成框架,生成需要跨全球分布证据进行多跳定位的具有挑战性的推理任务。通过将文档分解为原子事实及其底层关系,并通过编程方式构建可验证推理问题,我们的方法可以大规模地创建高质量训练数据,大大超越了简单的检索任务,从而能够实现真正的长距离推理能力。 2. **用于长上下文训练的稳定强化学习**:为了克服长上下文RL中的关键不稳定性,我们引入了具有特定任务优势估计的任务平衡采样来减轻奖励偏差,并提出了自适应熵控制策略优化(AEPO),动态调节探索与利用之间的权衡。 3. **超长上下文的记忆增强架构**:认识到即使扩展的上下文窗口也无法容纳任意长度的序列,我们开发了一个多阶段融合RL训练的记忆管理框架,无缝地将一次通过推理与迭代记忆基处理相结合,用于超过4M令牌的任务。 基于Qwen3-30B-A3B-Thinking模型,QwenLong-L1.5在长上下文推理基准测试中的表现可与GPT-5和Gemini-2.5-Pro相媲美,并且平均比其基线高出9.90分。对于超长任务(1M~4M令牌),QwenLong-L1.5的记忆代理框架相对于代理基线获得了9.48的得分提升。此外,获得的长上下文推理能力还转化为科学推理、记忆工具使用和扩展对话等通用领域的性能增强。
https://arxiv.org/abs/2512.12967
Generalized category discovery (GCD) is an important and challenging task in open-world learning. Specifically, given some labeled data of known classes, GCD aims to cluster unlabeled data that contain both known and unknown classes. Current GCD methods based on parametric classification adopt the DINO-like pseudo-labeling strategy, where the sharpened probability output of one view is used as supervision information for the other view. However, large pre-trained models have a preference for some specific visual patterns, resulting in encoding spurious correlation for unlabeled data and generating noisy pseudo-labels. To address this issue, we propose a novel method, which contains two modules: Loss Sharpness Penalty (LSP) and Dynamic Anchor Selection (DAS). LSP enhances the robustness of model parameters to small perturbations by minimizing the worst-case loss sharpness of the model, which suppressing the encoding of trivial features, thereby reducing overfitting of noise samples and improving the quality of pseudo-labels. Meanwhile, DAS selects representative samples for the unknown classes based on KNN density and class probability during the model training and assigns hard pseudo-labels to them, which not only alleviates the confidence difference between known and unknown classes but also enables the model to quickly learn more accurate feature distribution for the unknown classes, thus further improving the clustering accuracy. Extensive experiments demonstrate that the proposed method can effectively mitigate the noise of pseudo-labels, and achieve state-of-the-art results on multiple GCD benchmarks.
广义类别发现(GCD)是开放世界学习中的一个重要且具有挑战性的任务。具体来说,给定一些已知类别的标注数据后,GCD旨在对包含已知和未知类别的未标记数据进行聚类。当前基于参数分类的GCD方法采用类似于DINO的伪标签策略,即使用一个视图的概率输出(经过锐化处理)作为另一个视图的监督信息。然而,大规模预训练模型倾向于偏好某些特定的视觉模式,导致对未标记数据编码虚假关联,并生成噪声伪标签。为了解决这个问题,我们提出了一种新方法,该方法包含两个模块:损失尖锐度惩罚(LSP)和动态锚点选择(DAS)。通过最小化模型在最坏情况下的损失锐利度,LSP增强了模型参数对小扰动的鲁棒性,从而抑制了琐碎特征的编码,减少了噪声样本过拟合,并提高了伪标签的质量。同时,DAS基于KNN密度和类概率,在模型训练过程中选择未知类别的代表性样本,并为其分配困难伪标签,这不仅缓解了已知类别与未知类别之间的置信度差异,还使模型能够快速学习到未知类别的更准确特征分布,从而进一步提高聚类精度。广泛的实验表明,所提出的方法可以有效降低噪声伪标签的影响,在多个GCD基准测试中取得了最先进的结果。
https://arxiv.org/abs/2512.12925
Inductive Logic Programming (ILP) provides interpretable rule learning in relational domains, yet remains limited in its ability to induce and reason with numerical constraints. Classical ILP systems operate over discrete predicates and typically rely on discretisation or hand-crafted numerical predicates, making it difficult to infer thresholds or arithmetic relations that must hold jointly across examples. Recent work has begun to address these limitations through tighter integrations of ILP with Satisfiability Modulo Theories (SMT) or specialised numerical inference mechanisms. In this paper we investigate a modular alternative that couples the ILP system PyGol with the SMT solver Z3. Candidate clauses proposed by PyGol are interpreted as quantifier-free formulas over background theories such as linear or nonlinear real arithmetic, allowing numerical parameters to be instantiated and verified by the SMT solver while preserving ILP's declarative relational bias. This supports the induction of hybrid rules that combine symbolic predicates with learned numerical constraints, including thresholds, intervals, and multi-literal arithmetic relations. We formalise this SMT-ILP setting and evaluate it on a suite of synthetic datasets designed to probe linear, relational, nonlinear, and multi-hop reasoning. The results illustrate how a modular SMT-ILP architecture can extend the expressivity of symbolic rule learning, complementing prior numerical ILP approaches while providing a flexible basis for future extensions toward richer theory-aware induction.
归纳逻辑编程(ILP)提供了关系领域中可解释的规则学习,但其在诱导和处理数值约束方面的能力仍有限。传统的ILP系统在离散谓词上操作,并通常依赖于离散化或人工编写的数值谓词,这使得难以推断必须共同适用于多个示例的阈值或算术关系。最近的工作已经开始通过更紧密地将ILP与满足性模理论(SMT)或专门化的数值推理机制集成来解决这些限制。在这篇论文中,我们探讨了一种模块化替代方案,该方案结合了ILP系统PyGol和SMT求解器Z3。由PyGol提出的候选规则被解释为在背景理论(如线性或非线性实数算术)上的量词自由公式,这使得数值参数可以由SMT求解器实例化并验证,同时保持ILP的声明式关系偏置。这支持了结合符号谓词和学习到的数值约束的混合规则的诱导,包括阈值、区间以及多文字算术关系。我们形式化了这种SMT-ILP设置,并在一系列旨在探究线性、关系、非线性和多重跳转推理能力的人工数据集上对其进行了评估。实验结果展示了模块化的SMT-ILP架构如何扩展符号规则学习的表达能力,同时补充先前的数值ILP方法并为未来更丰富的理论感知诱导提供灵活的基础。
https://arxiv.org/abs/2512.12918
Understanding videos inherently requires reasoning over both visual and auditory information. To properly evaluate Omni-Large Language Models (Omni-LLMs), which are capable of processing multi-modal information including vision and audio, an effective benchmark must comprehensively cover three key aspects: (1) multi-modal dependency (i.e., questions that cannot be answered using vision or audio alone), (2) diverse audio information types (e.g., speech, sound events), and (3) varying scene spans. However, existing datasets fall short in one or more of these dimensions, limiting strict and comprehensive evaluation. To address this gap, we introduce JointAVBench, a novel benchmark with strict audio-video correlation, spanning five cognitive dimensions, four audio information types (speech, sound events, music, vocal traits), and three scene spans (single-, cross-, and full-scene). Given the high cost of manual annotation, we propose an automated pipeline that leverages state-of-the-art vision-LLMs, audio-LLMs, and general-purpose LLMs to synthesize questions and answers that strictly require joint audio-visual understanding. We evaluate leading vision-only, audio-only, and Omni-LLMs on our dataset. Results show that even the best-performing Omni-LLM achieves an average accuracy of only 62.6\%, outperforming uni-modal baselines but revealing substantial room for improvement, especially in cross-scene reasoning.
理解视频内在地需要对视觉和听觉信息进行推理。为了全面评估能够处理包括视觉和音频在内的多模态信息的全知大型语言模型(Omni-LLM),一个有效的基准测试必须全面涵盖三个方面:(1) 多模态依赖性(即,仅靠视觉或音频无法回答的问题);(2) 丰富的音频信息类型(如语音、声音事件等);以及 (3) 不同的场景跨度。然而,现有的数据集在这几个维度上存在不足,限制了严格的全面评估。为弥补这一空白,我们引入了一个新的基准测试——JointAVBench,它具有严格的音视频关联,并涵盖五个认知层面、四种音频信息类型(语音、声音事件、音乐和声乐特征)以及三种场景跨度(单场景、跨场景和全场景)。考虑到手动注释成本高昂,我们提出了一条自动化流程,利用最先进的视觉LLM、音频LLM和通用型LLM来合成严格要求联合音视频理解的问题与答案。我们在该数据集上对仅基于视觉的模型、仅基于音频的模型以及全知LLM进行了评估。结果显示,即使表现最好的全知LLM也只达到了62.6%的平均准确率,在超过单场景推理的情景下尤其明显,这虽然超过了单一模态基线的表现,但也揭示了显著改进的空间。
https://arxiv.org/abs/2512.12772
The rapid acceleration of scientific publishing has created substantial challenges for researchers attempting to discover, contextualize, and interpret relevant literature. Traditional keyword-based search systems provide limited semantic understanding, while existing AI-driven tools typically focus on isolated tasks such as retrieval, clustering, or bibliometric visualization. This paper presents an integrated system for scientific literature exploration that combines large-scale data acquisition, hybrid retrieval, semantic topic modeling, and heterogeneous knowledge graph construction. The system builds a comprehensive corpus by merging full-text data from arXiv with structured metadata from OpenAlex. A hybrid retrieval architecture fuses BM25 lexical search with embedding-based semantic search using Reciprocal Rank Fusion. Topic modeling is performed on retrieved results using BERTopic or non-negative matrix factorization depending on computational resources. A knowledge graph unifies papers, authors, institutions, countries, and extracted topics into an interpretable structure. The system provides a multi-layered exploration environment that reveals not only relevant publications but also the conceptual and relational landscape surrounding a query. Evaluation across multiple queries demonstrates improvements in retrieval relevance, topic coherence, and interpretability. The proposed framework contributes an extensible foundation for AI-assisted scientific discovery.
科学研究出版的快速加速为研究人员在发现、理解及解释相关文献方面带来了重大挑战。传统的基于关键词的搜索系统提供的语义理解有限,而现有的人工智能驱动工具通常专注于孤立的任务如检索、聚类或引文分析可视化。本文提出了一种集成了大规模数据获取、混合检索、语义主题建模和异构知识图构建的综合系统,用于科学文献探索。 该系统通过合并arXiv中的全文数据与OpenAlex中的结构化元数据来构建一个全面的知识库。采用一种融合了BM25词汇搜索与基于嵌入式语义搜索(使用互惠排名融合)的混合检索架构。根据计算资源的不同,从检索结果中分别应用BERTopic或非负矩阵分解进行主题建模。通过知识图将论文、作者、机构、国家和提取的主题统一为一个可解释的结构。 该系统提供了一个多层次探索环境,不仅揭示与查询相关的出版物,还展示了围绕查询的概念及其关联性景观。经过多轮查询评估证明了在检索相关性、话题连贯性和可解释性方面的改进。所提出的框架为人工智能辅助科学发现提供了可扩展的基础架构。
https://arxiv.org/abs/2512.12760
Real-world deployment of Vision-Language Models (VLMs) is hindered by high computational demands, as existing architectures inefficiently process all tokens uniformly. We introduce Adaptive Token Pruning (ATP), a dynamic inference mechanism that retains only the most informative tokens based on contextual relevance. ATP operates at the vision-language interface, assigning a hybrid importance score combining ViT CLS attention (intra-modal saliency) and CLIP text-image similarity (inter-modal relevance) to keep top-K tokens for the LLM. Unlike static compression, ATP adapts to each input without modifying the backbone. Proposed as a lightweight gating module, ATP is compatible with popular backbones like BLIP-2, LLaVA, and Flamingo. Preliminary evaluations across VQAv2, GQA, and COCO indicate that ATP reduces inference FLOPs by around 40% and achieves roughly 1.5x speedups in end-to-end latency with negligible accuracy loss (less than 1%). Qualitative analyses suggest ATP preserves visual grounding and enhances interpretability. Beyond efficiency, we investigate robustness under corruptions; observations suggest adaptive pruning suppresses spurious correlations, improving stability. These findings imply that resource-constrained inference and model reliability are not competing objectives. Finally, we discuss ATP's role in efficient multimodal edge computing pipelines.
现实世界中视觉语言模型(VLMs)的部署受到高计算需求的限制,因为现有的架构在处理所有令牌时效率低下且方式统一。我们引入了一种动态推理机制——自适应令牌剪枝(ATP),它根据上下文相关性保留最具信息量的令牌,从而提高效率。ATP运行于视觉和语言接口处,并结合ViT CLS注意力(模态内显著性)与CLIP文本-图像相似度(跨模态相关性),为大语言模型(LLM)保留前K个最重要的令牌。不同于静态压缩方法,ATP能够根据每个输入进行自适应调整,而无需对主干架构进行修改。 作为轻量级的门控模块,ATP与BLIP-2、LLaVA和Flamingo等流行基础架构兼容。初步评估显示,在VQAv2、GQA和COCO数据集上,ATP可以将推理FLOPs减少约40%,并使端到端延迟加快1.5倍,同时保持精度几乎不变(损失小于1%)。定性分析表明,ATP保留了视觉接地能力,并提高了模型的可解释性。 除了提高效率之外,我们还研究了在数据损坏情况下的鲁棒性。观察结果表明,自适应剪枝可以抑制虚假相关性,从而提高模型稳定性。这些发现意味着资源受限推理和模型可靠性并不相互竞争的目标。最后,我们讨论ATP在未来多模态边缘计算管道中的作用。 简而言之,通过引入ATP机制,研究人员能够在不牺牲准确性的情况下大幅减少视觉语言模型的计算需求,并提高了其鲁棒性和可解释性,这对于高效部署在资源受限设备上的多模态应用至关重要。
https://arxiv.org/abs/2512.12701
Agentic memory is emerging as a key enabler for large language models (LLM) to maintain continuity, personalization, and long-term context in extended user interactions, critical capabilities for deploying LLMs as truly interactive and adaptive agents. Agentic memory refers to the memory that provides an LLM with agent-like persistence: the ability to retain and act upon information across conversations, similar to how a human would. We present Memoria, a modular memory framework that augments LLM-based conversational systems with persistent, interpretable, and context-rich memory. Memoria integrates two complementary components: dynamic session-level summarization and a weighted knowledge graph (KG)-based user modelling engine that incrementally captures user traits, preferences, and behavioral patterns as structured entities and relationships. This hybrid architecture enables both short-term dialogue coherence and long-term personalization while operating within the token constraints of modern LLMs. We demonstrate how Memoria enables scalable, personalized conversational artificial intelligence (AI) by bridging the gap between stateless LLM interfaces and agentic memory systems, offering a practical solution for industry applications requiring adaptive and evolving user experiences.
代理记忆正逐渐成为大型语言模型(LLM)维持连续性、个性化和长期上下文的关键因素,在扩展用户交互中,这一能力对于将LLM部署为真正互动且适应性强的智能体至关重要。代理记忆是指赋予LLM类似代理持久性的记忆:能够在对话间保留并利用信息的能力,就像人类一样。我们介绍了Memoria,一个模块化记忆框架,该框架通过持久、可解释和上下文丰富的记忆增强了基于LLM的对话系统。 Memoria整合了两个互补组件:动态会话级别的摘要生成器以及基于加权知识图谱(KG)的用户建模引擎,后者能够逐步捕捉用户的特质、偏好及行为模式,并将其表示为结构化的实体和关系。这种混合架构能够在现代LLM的令牌限制内同时实现短期对话连贯性和长期个性化。 我们展示了Memoria如何通过弥合无状态LLM接口与代理记忆系统之间的差距,从而支持可扩展且个性化的对话人工智能,提供了行业应用所需适应性及演进用户体验的实际解决方案。
https://arxiv.org/abs/2512.12686
Legal relations form a highly consequential analytical framework of civil law system, serving as a crucial foundation for resolving disputes and realizing values of the rule of law in judicial practice. However, legal relations in Chinese civil cases remain underexplored in the field of legal artificial intelligence (legal AI), largely due to the absence of comprehensive schemas. In this work, we firstly introduce a comprehensive schema, which contains a hierarchical taxonomy and definitions of arguments, for AI systems to capture legal relations in Chinese civil cases. Based on this schema, we then formulate legal relation extraction task and present LexRel, an expert-annotated benchmark for legal relation extraction in Chinese civil law. We use LexRel to evaluate state-of-the-art large language models (LLMs) on legal relation extractions, showing that current LLMs exhibit significant limitations in accurately identifying civil legal relations. Furthermore, we demonstrate that incorporating legal relations information leads to consistent performance gains on other downstream legal AI tasks.
法律关系构成了民法体系中一个高度重要的分析框架,是解决纠纷和实现法治价值的重要基础。然而,在中国民事案件中,由于缺乏全面的架构,法律关系在法律人工智能(legal AI)领域仍被较少研究。在这项工作中,我们首先介绍了一个全面的架构,包括法律关系的分层分类法及其定义,使AI系统能够捕捉到中国民事案件中的法律关系。基于这一架构,我们将法律关系抽取任务具体化,并提出了LexRel,一个专家标注的数据集,作为中文民法领域中法律关系提取的基准。我们使用LexRel来评估当前最先进的大型语言模型(LLMs)在法律关系抽取上的表现,显示出现有LLMs在准确识别民事法律关系方面存在显著局限性。此外,我们还证明了将法律关系信息融入其中可以持续提升其他下游法律人工智能任务的表现。
https://arxiv.org/abs/2512.12643