Hierarchical land cover and land use (LCLU) classification aims to assign pixel-wise labels with multiple levels of semantic granularity to remote sensing (RS) imagery. However, existing deep learning-based methods face two major challenges: 1) They predominantly adopt a flat classification paradigm, which limits their ability to generate end-to-end multi-granularity hierarchical predictions aligned with tree-structured hierarchies used in practice. 2) Most cross-domain studies focus on performance degradation caused by sensor or scene variations, with limited attention to transferring LCLU models to cross-domain tasks with heterogeneous hierarchies (e.g., LCLU to crop classification). These limitations hinder the flexibility and generalization of LCLU models in practical applications. To address these challenges, we propose HieraRS, a novel hierarchical interpretation paradigm that enables multi-granularity predictions and supports the efficient transfer of LCLU models to cross-domain tasks with heterogeneous tree-structured hierarchies. We introduce the Bidirectional Hierarchical Consistency Constraint Mechanism (BHCCM), which can be seamlessly integrated into mainstream flat classification models to generate hierarchical predictions, while improving both semantic consistency and classification accuracy. Furthermore, we present TransLU, a dual-branch cross-domain transfer framework comprising two key components: Cross-Domain Knowledge Sharing (CDKS) and Cross-Domain Semantic Alignment (CDSA). TransLU supports dynamic category expansion and facilitates the effective adaptation of LCLU models to heterogeneous hierarchies. In addition, we construct MM-5B, a large-scale multi-modal hierarchical land use dataset featuring pixel-wise annotations. The code and MM-5B dataset will be released at: this https URL.
层次土地覆盖和土地利用(LCLU)分类的目标是为遥感(RS)图像提供具有多种语义粒度级别的像素级标签。然而,现有的基于深度学习的方法面临两个主要挑战:1) 它们主要采用扁平化分类方法,这限制了它们生成与实践中使用的树状层次结构一致的端到端多粒度层级预测的能力;2) 大多数跨域研究关注由传感器或场景变化导致的性能下降问题,并且较少关注将LCLU模型转移到具有异构层次结构(如从土地覆盖分类转换为作物分类)的任务上。这些限制阻碍了LCLU模型在实际应用中的灵活性和泛化能力。 为了应对上述挑战,我们提出了HieraRS,这是一种新的层级解释框架,它能够实现多粒度预测,并支持将LCLU模型高效地转移到具有异构树状结构层次的跨域任务中。我们引入了双向层级一致性约束机制(BHCCM),它可以无缝集成到主流扁平分类模型中以生成层级预测,同时提高语义一致性和分类准确性。 此外,我们提出了TransLU,这是一个双分支的跨域迁移框架,包括两个关键组件:跨域知识共享(CDKS)和跨域语义对齐(CDSA)。TransLU支持动态类别扩展,并有助于LCLU模型有效地适应异构层次结构。另外,为了支持这项研究,我们构建了一个大规模多模态层级土地利用数据集——MM-5B,该数据集具有像素级别的标注信息。 代码和MM-5B数据集将在以下网址发布:[请在此处填写实际的URL链接]。
https://arxiv.org/abs/2507.08741
Multiple Instance Learning (MIL) offers a natural solution for settings where only coarse, bag-level labels are available, without having access to instance-level annotations. This is usually the case in digital pathology, which consists of gigapixel sized images. While deterministic attention-based MIL approaches achieve strong bag-level performance, they often overlook the uncertainty inherent in instance relevance. In this paper, we address the lack of uncertainty quantification in instance-level attention scores by introducing \textbf{SGPMIL}, a new probabilistic attention-based MIL framework grounded in Sparse Gaussian Processes (SGP). By learning a posterior distribution over attention scores, SGPMIL enables principled uncertainty estimation, resulting in more reliable and calibrated instance relevance maps. Our approach not only preserves competitive bag-level performance but also significantly improves the quality and interpretability of instance-level predictions under uncertainty. SGPMIL extends prior work by introducing feature scaling in the SGP predictive mean function, leading to faster training, improved efficiency, and enhanced instance-level performance. Extensive experiments on multiple well-established digital pathology datasets highlight the effectiveness of our approach across both bag- and instance-level evaluations. Our code will be made publicly available.
多示例学习(Multiple Instance Learning,MIL)为仅提供粗略的袋级标签而无实例级注释可用的情况提供了自然解决方案。这种情况通常出现在数字病理学中,因为该领域涉及的是数十亿像素大小的图像。虽然基于确定性注意力机制的MIL方法能够实现强大的袋级性能,但它们往往忽略了实例相关性的内在不确定性。在本文中,我们通过引入**SGPMIL**(一种基于稀疏高斯过程(SGP)的概率注意机制框架),解决了实例级别注意力分数中的不确定量化缺乏的问题。SGPMIL通过学习关注分数的后验分布来进行原则化的不确定性估计,从而产生更可靠和校准良好的实例相关性图。 我们的方法不仅保持了袋级性能的竞争水平,还显著提高了在不确定性下的实例级预测的质量和可解释性。SGPMIL通过在SGP预测均值函数中引入特征缩放,扩展了先前的工作,这带来了更快的训练速度、更高的效率以及增强的实例级别表现。在多个知名数字病理学数据集上的广泛实验表明,在袋级和实例级评估方面,我们的方法表现出色。我们将公开提供代码以供使用。
https://arxiv.org/abs/2507.08711
Collaborative fairness is a crucial challenge in federated learning. However, existing approaches often overlook a practical yet complex form of heterogeneity: imbalanced covariate shift. We provide a theoretical analysis of this setting, which motivates the design of FedAKD (Federated Asynchronous Knowledge Distillation)- simple yet effective approach that balances accurate prediction with collaborative fairness. FedAKD consists of client and server updates. In the client update, we introduce a novel asynchronous knowledge distillation strategy based on our preliminary analysis, which reveals that while correctly predicted samples exhibit similar feature distributions across clients, incorrectly predicted samples show significant variability. This suggests that imbalanced covariate shift primarily arises from misclassified samples. Leveraging this insight, our approach first applies traditional knowledge distillation to update client models while keeping the global model fixed. Next, we select correctly predicted high-confidence samples and update the global model using these samples while keeping client models fixed. The server update simply aggregates all client models. We further provide a theoretical proof of FedAKD's convergence. Experimental results on public datasets (FashionMNIST and CIFAR10) and a real-world Electronic Health Records (EHR) dataset demonstrate that FedAKD significantly improves collaborative fairness, enhances predictive accuracy, and fosters client participation even under highly heterogeneous data distributions.
协作公平性是联邦学习中的一个关键挑战。然而,现有的方法往往忽视了一种实用但复杂的异质性形式:不平衡的协变量偏移。我们对该场景进行了理论分析,并提出了FedAKD(Federated Asynchronous Knowledge Distillation)的设计动机——一种简单而有效的方法,它在准确预测和协作公平性之间实现了平衡。 FedAKD由客户端更新和服务器更新两部分组成: 1. **客户端更新**:基于我们的初步分析,我们引入了一种新颖的异步知识蒸馏策略。这一策略揭示了正确分类样本在不同客户端中的特征分布相似,而错误分类样本则表现出显著的变化性。这表明不平衡的协变量偏移主要源自被误分类的样本。因此,该方法首先利用传统的知识蒸馏技术来更新客户端模型,同时保持全局模型固定不变。 2. **服务器更新**:在此之后,我们选择正确且具有高置信度预测的样本,并在保持客户端模型固定的条件下使用这些样本来更新全局模型。最后,服务器简单地聚合所有客户端模型以形成新的全局模型版本。 除此之外,我们还提供了FedAKD收敛性的理论证明。 实验结果表明,在公共数据集(如FashionMNIST和CIFAR10)以及一个现实世界中的电子健康记录(EHR)数据集中,FedAKD显著提高了协作公平性、增强了预测准确性,并促进了在高度异质化数据分布下的客户端参与度。
https://arxiv.org/abs/2507.08617
Vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition but degrade significantly under \textit{temporally evolving distribution shifts} common in real-world scenarios (e.g., gradual illumination or seasonal changes). Existing continual test-time adaptation (CTTA) methods are typically built around sudden and severe distribution shifts and neglect temporal continuity, leading to three core defects: limited memory cache restricts long-range distribution modeling, causing catastrophic forgetting; entropy-based confidence becomes unreliable under temporal drift, worsening error accumulation; and static visual representations misalign with evolving inputs. We formalize this practical problem as \textit{Continual-Temporal Test-Time Adaptation (CT-TTA)}, where test distributions evolve gradually over time. To address it, we propose \textit{BayesTTA}, a Bayesian adaptation framework that enforces temporally consistent predictions and dynamically aligns visual representations. Specifically, BayesTTA incrementally estimates class-conditional Gaussian mixture distributions without storing raw data, adaptively selects covariance structures through statistical hypothesis testing, and performs calibrated inference using Gaussian discriminant analysis (GDA). These calibrated predictions supervise self-paced adaptation of normalization layers, ensuring efficient and stable representation alignment. We establish a comprehensive CT-TTA benchmark across four temporally evolving datasets and further evaluate generalization on ten standard TTA datasets. Extensive experiments show that BayesTTA consistently outperforms state-of-the-art methods, achieving significant gains while maintaining efficiency. Code is available at \href{this https URL}{this https URL}.
视觉-语言模型(VLMs)如CLIP在零样本识别方面表现出色,但在现实场景中常见的时序分布变化(例如光照逐渐改变或季节变换)下性能显著下降。现有的持续测试时间适应(CTTA)方法通常围绕突然且严重的分布偏移构建,并忽视了时间连续性,导致三大核心缺陷:有限的记忆缓存限制了长时间范围内分布的建模能力,从而造成灾难性的遗忘;基于熵的信心度在时序漂移下变得不可靠,进而加重误差累积;静态视觉表示无法与不断变化的输入对齐。我们将这一实际问题形式化为“连续-时间测试时间适应(CT-TTA)”,其中测试分布随时间逐渐演变。 为了应对这个问题,我们提出了BayesTTA,这是一个贝叶斯适应框架,它强制执行时序一致性的预测,并动态地调整视觉表示。具体来说,BayesTTA通过统计假设检验自适应选择协方差结构,在不存储原始数据的情况下增量式估计类条件高斯混合分布,并使用高斯判别分析(GDA)进行校准推理。这些校准后的预测监督归一化层的自我调节适应过程,确保表示对齐的有效性和稳定性。 我们建立了跨越四个时序变化数据集的全面CT-TTA基准,并进一步评估了在十个标准TTA数据集上的泛化能力。大量的实验表明,BayesTTA始终优于现有方法,在保持效率的同时取得显著收益。代码可在[这里](this https URL)获取。
https://arxiv.org/abs/2507.08607
Semantic segmentation relies on many dense pixel-wise annotations to achieve the best performance, but owing to the difficulty of obtaining accurate annotations for real world data, practitioners train on large-scale synthetic datasets. Unpaired image translation is one method used to address the ensuing domain gap by generating more realistic training data in low-data regimes. Current methods for unpaired image translation train generative adversarial networks (GANs) to perform the translation and enforce pixel-level semantic matching through cycle consistency. These methods do not guarantee that the semantic matching holds, posing a problem for semantic segmentation where performance is sensitive to noisy pixel labels. We propose a novel image translation method, Domain Adversarial Kernel Prediction Network (DA-KPN), that guarantees semantic matching between the synthetic label and translation. DA-KPN estimates pixel-wise input transformation parameters of a lightweight and simple translation function. To ensure the pixel-wise transformation is realistic, DA-KPN uses multi-scale discriminators to distinguish between translated and target samples. We show DA-KPN outperforms previous GAN-based methods on syn2real benchmarks for semantic segmentation with limited access to real image labels and achieves comparable performance on face parsing.
语义分割依赖于大量的密集像素级标注以实现最佳性能,但由于获取现实世界数据准确标注的难度较大,实践中通常使用大规模合成数据集进行训练。未配对图像转换是一种方法,用于通过生成更逼真的训练数据来解决由此产生的领域差距问题,在低数据环境下尤其有效。目前的未配对图像转换方法通过循环一致性训练生成对抗网络(GAN),以执行转换并强制执行像素级别的语义匹配。然而,这些方法并不能保证这种语义匹配的有效性,这在性能对噪声像素标签敏感的语义分割任务中成为了一个问题。 我们提出了一种新的图像翻译方法——领域对抗核预测网络(DA-KPN),该方法能够确保合成标签与转换后的图像之间的语义匹配。DA-KPN估算出用于简单轻量级变换函数的逐像素输入变换参数,以实现这一目标。为了保证这种逐像素的变换是真实的,DA-KPN使用多尺度判别器来区分翻译后和目标样本。 我们展示了在仅有限访问真实图像标签的情况下,DA-KPN在syn2real基准测试中优于先前基于GAN的方法,并且在面部解析任务中也达到了相当的表现。
https://arxiv.org/abs/2507.08554
Essays are considered a valuable mechanism for evaluating learning outcomes in writing. Textual cohesion is an essential characteristic of a text, as it facilitates the establishment of meaning between its parts. Automatically scoring cohesion in essays presents a challenge in the field of educational artificial intelligence. The machine learning algorithms used to evaluate texts generally do not consider the individual characteristics of the instances that comprise the analysed corpus. In this meaning, item response theory can be adapted to the context of machine learning, characterising the ability, difficulty and discrimination of the models used. This work proposes and analyses the performance of a cohesion score prediction approach based on item response theory to adjust the scores generated by machine learning models. In this study, the corpus selected for the experiments consisted of the extended Essay-BR, which includes 6,563 essays in the style of the National High School Exam (ENEM), and the Brazilian Portuguese Narrative Essays, comprising 1,235 essays written by 5th to 9th grade students from public schools. We extracted 325 linguistic features and treated the problem as a machine learning regression task. The experimental results indicate that the proposed approach outperforms conventional machine learning models and ensemble methods in several evaluation metrics. This research explores a potential approach for improving the automatic evaluation of cohesion in educational essays.
文章被认为是评估写作学习成果的重要机制。文本连贯性是文本的一个重要特征,因为它有助于在文本的不同部分之间建立意义关系。自动评分系统在教育人工智能领域面临挑战,特别是在评测文章的连贯性方面。通常用于评估文本的机器学习算法往往不考虑构成分析语料库的实例的个别特性。从这个意义上说,项目反应理论可以被调整以适应机器学习的上下文,描述模型的能力、难度和区分度。 这项研究提出并分析了一种基于项目反应理论预测连贯性评分的方法,用于调整机器学习模型生成的分数。在本研究中,所选实验语料库包括扩展版的Essay-BR,该语料库包含6,563篇以全国高中考试(ENEM)风格撰写的文章以及1,235篇由公立学校五年级到九年级学生撰写的巴西葡萄牙语文体文章。我们提取了325个语言特征,并将问题视为机器学习回归任务。 实验结果表明,所提出的模型在多种评估指标上优于传统的机器学习模型和集成方法。这项研究探索了一种可能的方法,以改进对教育论文中连贯性的自动评分系统。
https://arxiv.org/abs/2507.08487
This study investigates the potential of a multimodal large language model (LLM), specifically ChatGPT-4o, to perform human-like interpretations of traffic scenes using static dashcam images. Herein, we focus on three judgment tasks relevant to elderly driver assessments: evaluating traffic density, assessing intersection visibility, and recognizing stop signs recognition. These tasks require contextual reasoning rather than simple object detection. Using zero-shot, few-shot, and multi-shot prompting strategies, we evaluated the performance of the model with human annotations serving as the reference standard. Evaluation metrics included precision, recall, and F1-score. Results indicate that prompt design considerably affects performance, with recall for intersection visibility increasing from 21.7% (zero-shot) to 57.0% (multi-shot). For traffic density, agreement increased from 53.5% to 67.6%. In stop-sign detection, the model demonstrated high precision (up to 86.3%) but a lower recall (approximately 76.7%), indicating a conservative response tendency. Output stability analysis revealed that humans and the model faced difficulties interpreting structurally ambiguous scenes. However, the model's explanatory texts corresponded with its predictions, enhancing interpretability. These findings suggest that, with well-designed prompts, LLMs hold promise as supportive tools for scene-level driving risk assessments. Future studies should explore scalability using larger datasets, diverse annotators, and next-generation model architectures for elderly driver assessments.
这项研究探讨了多模态大型语言模型(LLM),特别是ChatGPT-4o,在使用静态行车记录仪图像进行类似人类的交通场景解读方面的潜力。在此,我们重点关注三种与老年人驾驶员评估相关的判断任务:评价交通密度、评估交叉路口视野和识别停止标志。这些任务需要情境推理而不是简单的物体检测。通过零样本(zero-shot)、少量样本(few-shot)和多样本(multi-shot)提示策略,我们在人类注释作为参考标准的情况下评估了模型的性能。评估指标包括精确度、召回率和F1分数。 结果显示,提示设计显著影响了模型的表现,在交叉路口视野方面,召回率从零样本条件下的21.7%提高到了多样本条件下的57.0%;在交通密度方面,一致性从53.5%增加到67.6%。对于停止标志检测,该模型展现了高精确度(高达86.3%),但较低的召回率(大约为76.7%),表明其倾向于保守反应。 输出稳定性分析显示,人类和模型在解读结构模糊场景时都面临困难,不过,模型生成的解释性文本与其预测相吻合,提高了可解释性。这些发现表明,在设计良好的提示条件下,LLM作为用于场景级驾驶风险评估的支持工具具有潜力。未来的研究应该探索使用更大规模的数据集、多样化的标注人员以及下一代模型架构来进行老年人驾驶员评估的可能性。
https://arxiv.org/abs/2507.08367
In-context learning (ICL) is emerging as a promising technique for achieving universal medical image segmentation, where a variety of objects of interest across imaging modalities can be segmented using a single model. Nevertheless, its performance is highly sensitive to the alignment between the query image and in-context image-mask pairs. In a clinical scenario, the scarcity of annotated medical images makes it challenging to select optimal in-context pairs, and fine-tuning foundation ICL models on contextual data is infeasible due to computational costs and the risk of catastrophic forgetting. To address this challenge, we propose Cycle Context Verification (CCV), a novel framework that enhances ICL-based medical image segmentation by enabling self-verification of predictions and accordingly enhancing contextual alignment. Specifically, CCV employs a cyclic pipeline in which the model initially generates a segmentation mask for the query image. Subsequently, the roles of the query and an in-context pair are swapped, allowing the model to validate its prediction by predicting the mask of the original in-context image. The accuracy of this secondary prediction serves as an implicit measure of the initial query segmentation. A query-specific prompt is introduced to alter the query image and updated to improve the measure, thereby enhancing the alignment between the query and in-context pairs. We evaluated CCV on seven medical image segmentation datasets using two ICL foundation models, demonstrating its superiority over existing methods. Our results highlight CCV's ability to enhance ICL-based segmentation, making it a robust solution for universal medical image segmentation. The code will be available at this https URL.
上下文学习(ICL)正逐渐成为实现通用医学图像分割的有前景的技术,其中可以使用单一模型对各种成像模态中的感兴趣对象进行分割。然而,其性能高度依赖于查询图像和上下文图-掩膜对之间的对齐情况。在临床环境中,标注过的医学图像稀缺使得选择最佳上下文对具有挑战性,而由于计算成本和灾难性遗忘的风险,基于基础模型的ICL在上下文数据上的微调也是不切实际的。 为了解决这一难题,我们提出了循环上下文验证(CCV),这是一种新型框架,通过使预测自验证并相应地增强上下文对齐来改进基于ICL的医学图像分割。具体而言,CCV采用了一个循环管道,在该管道中模型首先为查询图像生成一个分割掩膜。随后,查询图像和一组上下文图-掩膜对的角色被互换,使得模型可以通过预测原上下文图像的掩膜来验证其预测结果。这一二次预测的准确性作为隐性度量用于评估初始查询分割的有效性。通过引入特定于查询的提示语并不断更新以改进该测量值,可以提高查询和上下文配对之间的对齐情况。 我们使用两个基础ICL模型在七个医学图像分割数据集上评估了CCV,证明了其优于现有方法的优势。我们的结果突显出CCV能够增强基于ICL的分割能力,使其成为通用医学图像分割的一个可靠解决方案。代码将在 [提供链接的地方] 提供。
https://arxiv.org/abs/2507.08357
Deep learning has shown remarkable performance in integrating multimodal data for survival prediction. However, existing multimodal methods mainly focus on single cancer types and overlook the challenge of generalization across cancers. In this work, we are the first to reveal that multimodal prognosis models often generalize worse than unimodal ones in cross-cancer scenarios, despite the critical need for such robustness in clinical practice. To address this, we propose a new task: Cross-Cancer Single Domain Generalization for Multimodal Prognosis, which evaluates whether models trained on a single cancer type can generalize to unseen cancers. We identify two key challenges: degraded features from weaker modalities and ineffective multimodal integration. To tackle these, we introduce two plug-and-play modules: Sparse Dirac Information Rebalancer (SDIR) and Cancer-aware Distribution Entanglement (CADE). SDIR mitigates the dominance of strong features by applying Bernoulli-based sparsification and Dirac-inspired stabilization to enhance weaker modality signals. CADE, designed to synthesize the target domain distribution, fuses local morphological cues and global gene expression in latent space. Experiments on a four-cancer-type benchmark demonstrate superior generalization, laying the foundation for practical, robust cross-cancer multimodal prognosis. Code is available at this https URL
深度学习在整合多模态数据进行生存预测方面展现了显著的性能。然而,现有的多模态方法主要集中在单一癌症类型上,并且忽略了跨癌症泛化的挑战。在这项工作中,我们首次揭示了多模态预后模型通常在跨癌症场景中比单模态模型表现得更差,尽管临床实践中迫切需要这种鲁棒性。为了解决这个问题,我们提出了一项新任务:跨癌症单一域泛化多模态预后,评估的是基于单一癌症类型训练的模型是否能够推广到未见过的癌症上。 为了应对这一挑战,我们识别了两个关键问题:来自较弱模态的数据退化和无效的多模态整合。为此,我们引入了两种即插即用模块:稀疏狄拉克信息重整器(SDIR)和癌症感知分布纠缠(CADE)。SDIR通过应用基于伯努利的稀疏化及狄拉克灵感稳定化来缓解强特征的主导地位,以增强较弱模态信号。CADE旨在合成目标域分布,在潜在空间中融合局部形态线索和全局基因表达。 在四种癌症类型的基准测试上进行的实验显示出了优越的泛化能力,为实际应用中的稳健跨癌症多模态预后奠定了基础。代码可在提供的链接处获取。
https://arxiv.org/abs/2507.08340
We present Kwaipilot-AutoThink (KAT), an open-source 40B large language model developed to address the overthinking problem in reasoning-intensive tasks, where an automatic thinking training paradigm is proposed to dynamically switch between reasoning and non-reasoning modes based on task complexity. Specifically, first, we construct the dual-regime dataset based on a novel tagging pipeline and a multi-agent synthesis strategy, and then we apply Multi-Token Prediction (MTP)-enhanced knowledge distillation, enabling efficient and fine-grained reasoning transfer with minimal pretraining cost. Besides, we implement a cold-start initialization strategy that introduces mode-selection priors using majority-vote signals and intent-aware prompting. Finally, we propose Step-SRPO, a reinforcement learning algorithm that incorporates intermediate supervision into the GRPO framework, offering structured guidance over both reasoning-mode selection and response accuracy. Extensive experiments across multiple benchmarks demonstrate that KAT consistently matches or even outperforms current state-of-the-art models, including DeepSeek-R1-0528 and Qwen3-235B-A22B, across a wide range of reasoning-intensive tasks while reducing token usage by up to approximately 30\%. Beyond academic evaluation, KAT has been successfully deployed in Kwaipilot (i.e., Kuaishou's internal coding assistant), and improves real-world development workflows with high accuracy, efficiency, and controllable reasoning behaviors. Moreover, we are actively training a 200B Mixture-of-Experts (MoE) with 40B activation parameters, where the early-stage results already demonstrate promising improvements in performance and efficiency, further showing the scalability of the AutoThink paradigm.
我们介绍了Kwaipilot-AutoThink(KAT),这是一个开源的400亿参数大型语言模型,旨在解决推理密集型任务中的过度思考问题。为此,提出了一种自动思维训练范式,该范式能够根据任务复杂性动态地在推理模式和非推理模式之间切换。具体而言,首先,我们基于一种新颖的标记管道和多代理合成策略构建了双体制数据集;然后应用了增强的知识蒸馏技术(即多令牌预测MTP),使得能够在最小化预训练成本的情况下高效且细致地进行推理转移。 此外,我们实施了一种冷启动初始化策略,该策略通过多数投票信号和意图感知提示引入模式选择先验。最后,我们提出了一种称为Step-SRPO的强化学习算法,它在GRPO框架中整合了中间监督机制,为推理模式的选择以及响应准确性提供了结构化指导。 多项基准测试中的广泛实验表明,KAT模型在各种推理密集型任务上始终能够与当前最先进的模型(包括DeepSeek-R1-0528和Qwen3-235B-A22B)相匹配甚至超越,并且能减少多达约30%的令牌使用量。除了学术评估之外,KAT已经在Kwaipilot(即快手内部代码助手)中成功部署,并通过高准确性、高效性和可控推理行为改进了现实中的开发工作流程。 此外,我们正在积极训练一个2000亿参数混合专家(MoE)模型,该模型具有400亿激活参数。早期实验结果已经显示出性能和效率方面的显著改善,进一步展示了AutoThink范式的可扩展性。
https://arxiv.org/abs/2507.08297
We present SurfDist, a convolutional neural network architecture for three-dimensional volumetric instance segmentation. SurfDist enables prediction of instances represented as closed surfaces composed of smooth parametric surface patches, specifically bicubic Bézier triangles. SurfDist is a modification of the popular model architecture StarDist-3D which breaks StarDist-3D's coupling of instance parameterization dimension and instance voxel resolution, and it produces predictions which may be upsampled to arbitrarily high resolutions without introduction of voxelization artifacts. For datasets with blob-shaped instances, common in biomedical imaging, SurfDist can outperform StarDist-3D with more compact instance parameterizations. We detail SurfDist's technical implementation and show one synthetic and one real-world dataset for which it outperforms StarDist-3D. These results demonstrate that interpretable instance surface models can be learned effectively alongside instance membership.
我们介绍了SurfDist,这是一种用于三维体积实例分割的卷积神经网络架构。SurfDist能够预测由平滑参数化曲面片(具体为双三次Bézier三角形)组成的闭合表面表示的实例。SurfDist是对流行模型架构StarDist-3D的一种修改,它打破了StarDist-3D中实例参数化维度与实例体素分辨率之间的耦合,并且可以生成可以在不引入体素化伪影的情况下上采样至任意高分辨率的预测结果。对于具有块状形状实例的数据集(在生物医学成像中很常见),SurfDist可以通过更紧凑的实例参数化超越StarDist-3D。我们详细介绍了SurfDist的技术实现,并展示了两个数据集,一个合成的和一个现实世界的,在这两个数据集中,SurfDist的表现优于StarDist-3D。这些结果表明可以有效地学习可解释的实例表面模型以及实例成员身份。
https://arxiv.org/abs/2507.08223
While total intracranial carotid artery calcification (ICAC) volume is an established stroke biomarker, growing evidence shows this aggregate metric ignores the critical influence of plaque location, since calcification in different segments carries distinct prognostic and procedural risks. However, a finer-grained, segment-specific quantification has remained technically infeasible. Conventional 3D models are forced to process downsampled volumes or isolated patches, sacrificing the global context required to resolve anatomical ambiguity and render reliable landmark localization. To overcome this, we reformulate the 3D challenge as a \textbf{Parallel Probabilistic Landmark Localization} task along the 1D axial dimension. We propose the \textbf{Depth-Sequence Transformer (DST)}, a framework that processes full-resolution CT volumes as sequences of 2D slices, learning to predict $N=6$ independent probability distributions that pinpoint key anatomical landmarks. Our DST framework demonstrates exceptional accuracy and robustness. Evaluated on a 100-patient clinical cohort with rigorous 5-fold cross-validation, it achieves a Mean Absolute Error (MAE) of \textbf{0.1 slices}, with \textbf{96\%} of predictions falling within a $\pm1$ slice tolerance. Furthermore, to validate its architectural power, the DST backbone establishes the best result on the public Clean-CC-CCII classification benchmark under an end-to-end evaluation protocol. Our work delivers the first practical tool for automated segment-specific ICAC analysis. The proposed framework provides a foundation for further studies on the role of location-specific biomarkers in diagnosis, prognosis, and procedural planning. Our code will be made publicly available.
虽然全颅内颈动脉钙化(ICAC)总体积已被确立为中风的生物标志物,但越来越多的证据表明,这种聚合指标忽视了斑块位置的关键影响,因为不同节段中的钙化具有不同的预后和操作风险。然而,更精细、按节段进行的具体量化在技术上仍然是不可行的。传统的3D模型被迫处理降采样的体积或孤立的补丁,牺牲了解析解剖模糊性和可靠地标定位所需的整体上下文。为了解决这个问题,我们将3D挑战重新定义为沿1D轴向维度的**并行概率标定点定位**任务。我们提出了**深度序列变换器(DST)框架**,该框架将全分辨率CT体积作为2D切片序列处理,并学习预测6个独立的概率分布来确定关键解剖标志点的位置。 我们的DST框架表现出卓越的准确性和鲁棒性。在包含100名患者的临床队列上进行严格的五折交叉验证评估后,它实现了平均绝对误差(MAE)为**0.1片**的结果,并且有**96%**的预测落在±1片容差范围内。此外,为了验证其架构能力,DST骨干在公共Clean-CC-CCII分类基准上以端到端评估协议下取得了最佳结果。 我们的工作提供了首个用于自动节段特异性ICAC分析的实际工具。所提出的框架为后续研究奠定了基础,这些研究旨在探讨位置特定生物标志物在诊断、预后和程序规划中的作用。我们将公开发布相关代码。
https://arxiv.org/abs/2507.08214
Generative Large Language Models (LLMs)inevitably produce untruthful responses. Accurately predicting the truthfulness of these outputs is critical, especially in high-stakes settings. To accelerate research in this domain and make truthfulness prediction methods more accessible, we introduce TruthTorchLM an open-source, comprehensive Python library featuring over 30 truthfulness prediction methods, which we refer to as Truth Methods. Unlike existing toolkits such as Guardrails, which focus solely on document-grounded verification, or LM-Polygraph, which is limited to uncertainty-based methods, TruthTorchLM offers a broad and extensible collection of techniques. These methods span diverse tradeoffs in computational cost, access level (e.g., black-box vs white-box), grounding document requirements, and supervision type (self-supervised or supervised). TruthTorchLM is seamlessly compatible with both HuggingFace and LiteLLM, enabling support for locally hosted and API-based models. It also provides a unified interface for generation, evaluation, calibration, and long-form truthfulness prediction, along with a flexible framework for extending the library with new methods. We conduct an evaluation of representative truth methods on three datasets, TriviaQA, GSM8K, and FactScore-Bio. The code is available at this https URL
生成式大型语言模型(LLMs)不可避免地会产生不真实的信息。准确预测这些输出的真实度在高风险场景中尤为重要。为了加速该领域的研究并使真实性预测方法更加易于获取,我们推出了TruthTorchLM——一个开源、全面的Python库,包含超过30种真实性评估方法,简称“真理方法”。与现有的工具包如Guardrails(专注于文档验证)或LM-Polygraph(仅限于基于不确定性的方法)不同,TruthTorchLM提供了一套广泛且可扩展的技术集合。这些方法涵盖了计算成本、访问级别(例如黑盒与白盒)、依赖文档的要求以及监督类型(自监督或有监督)等多方面的权衡。 TruthTorchLM无缝兼容HuggingFace和LiteLLM平台,支持本地托管模型和基于API的模型使用。它还提供了统一的界面用于生成、评估、校准及长文真实性预测,并提供了一个灵活的框架以扩展库的新方法。我们在TriviaQA、GSM8K和FactScore-Bio三个数据集上对代表性的真实方法进行了评价。 该代码可在以下URL获取:[此链接指向TruthTorchLM项目的GitHub仓库](请根据实际情况替换为实际链接)。
https://arxiv.org/abs/2507.08203
While multiple instance learning (MIL) has shown to be a promising approach for histopathological whole slide image (WSI) analysis, its reliance on permutation invariance significantly limits its capacity to effectively uncover semantic correlations between instances within WSIs. Based on our empirical and theoretical investigations, we argue that approaches that are not permutation-invariant but better capture spatial correlations between instances can offer more effective solutions. In light of these findings, we propose a novel alternative to existing MIL for WSI analysis by learning to restore the order of instances from their randomly shuffled arrangement. We term this task as cracking an instance jigsaw puzzle problem, where semantic correlations between instances are uncovered. To tackle the instance jigsaw puzzles, we propose a novel Siamese network solution, which is theoretically justified by optimal transport theory. We validate the proposed method on WSI classification and survival prediction tasks, where the proposed method outperforms the recent state-of-the-art MIL competitors. The code is available at this https URL.
虽然多实例学习(MIL)已被证明是用于组织病理学全切片图像(WSI)分析的一种有前途的方法,但其依赖于置换不变性显著限制了它有效揭示WSI内实例间语义关联的能力。基于我们的实证和理论研究,我们认为非置换不变性的方法能够更好地捕捉实例间的空间相关性,并能提供更有效的解决方案。鉴于这些发现,我们提出了一种针对现有MIL的新替代方案,该方案通过学习从随机混洗的排列中恢复实例顺序来进行WSI分析。我们将这一任务称为破解实例拼图问题,在此过程中揭示了实例之间的语义关联。为了应对实例拼图挑战,我们提出了一个新颖的暹罗网络解决方案,其理论依据来自于最优传输理论。我们在WSI分类和生存预测任务上验证了所提出的方法,结果显示该方法优于最近的状态-of-the-art MIL竞争对手。代码可在提供的链接处获取。
https://arxiv.org/abs/2507.08178
We propose Adaptive Diffusion Denoised Smoothing, a method for certifying the predictions of a vision model against adversarial examples, while adapting to the input. Our key insight is to reinterpret a guided denoising diffusion model as a long sequence of adaptive Gaussian Differentially Private (GDP) mechanisms refining a pure noise sample into an image. We show that these adaptive mechanisms can be composed through a GDP privacy filter to analyze the end-to-end robustness of the guided denoising process, yielding a provable certification that extends the adaptive randomized smoothing analysis. We demonstrate that our design, under a specific guiding strategy, can improve both certified accuracy and standard accuracy on ImageNet for an $\ell_2$ threat model.
我们提出了自适应扩散去噪平滑法(Adaptive Diffusion Denoised Smoothing),这是一种针对视觉模型预测进行对抗样本认证的方法,并且能够根据输入情况进行调整。我们的关键见解是将一种引导式去噪扩散模型重新解释为一系列通过自适应高斯差异隐私(GDP)机制对纯噪声样本进行细化并最终生成图像的长序列操作。我们展示了这些自适应机制可以通过GDP隐私过滤器组合起来,以分析引导式去噪过程的整体鲁棒性,并提供一种可证明的认证方法,这种方法扩展了自适应随机平滑分析的应用范围。我们演示了在特定引导策略下,我们的设计可以在$\ell_2$威胁模型中同时提升ImageNet数据集上的已验证准确率和标准准确率。
https://arxiv.org/abs/2507.08163
Red-blood-cell lysis (HC50) is the principal safety barrier for antimicrobial-peptide (AMP) therapeutics, yet existing models only say "toxic" or "non-toxic." AmpLyze closes this gap by predicting the actual HC50 value from sequence alone and explaining the residues that drive toxicity. The model couples residue-level ProtT5/ESM2 embeddings with sequence-level descriptors in dual local and global branches, aligned by a cross-attention module and trained with log-cosh loss for robustness to assay noise. The optimal AmpLyze model reaches a PCC of 0.756 and an MSE of 0.987, outperforming classical regressors and the state-of-the-art. Ablations confirm that both branches are essential, and cross-attention adds a further 1% PCC and 3% MSE improvement. Expected-Gradients attributions reveal known toxicity hotspots and suggest safer substitutions. By turning hemolysis assessment into a quantitative, sequence-based, and interpretable prediction, AmpLyze facilitates AMP design and offers a practical tool for early-stage toxicity screening.
红细胞溶解(HC50)是抗菌肽(AMP)治疗药物的主要安全屏障,然而现有的模型只能给出“有毒”或“无毒”的判断。AmpLyze通过仅从序列预测实际的HC50值并解释驱动毒性的残基来填补这一空白。该模型结合了基于单个氨基酸残基级的ProtT5/ESM2嵌入和整个肽链级别的描述符,并通过双局部及全局分支,使用交叉注意力模块进行对齐,并采用log-cosh损失函数进行训练以增强对抗实验噪声的能力。最佳AmpLyze模型达到了0.756的相关系数(PCC)和0.987的均方误差(MSE),超越了传统的回归器以及当前最先进的方法。消融研究确认,两个分支都至关重要,并且交叉注意力模块能进一步提高1%的PCC和3%的MSE改进效果。预期梯度归因揭示了已知的毒性热点区域并提出了更安全的替代方案。通过将溶血评估转变为定量、基于序列的可解释预测,AmpLyze有助于抗菌肽的设计,并提供了一个实用的早期毒理筛查工具。
https://arxiv.org/abs/2507.08162
Traffic accidents are rare, yet high-impact events that require long-context multimodal reasoning for accurate risk forecasting. In this paper, we introduce ALCo-FM, a unified adaptive long-context foundation model that computes a volatility pre-score to dynamically select context windows for input data and encodes and fuses these multimodal data via shallow cross attention. Following a local GAT layer and a BigBird-style sparse global transformer over H3 hexagonal grids, coupled with Monte Carlo dropout for confidence, the model yields superior, well-calibrated predictions. Trained on data from 15 US cities with a class-weighted loss to counter label imbalance, and fine-tuned with minimal data on held-out cities, ALCo-FM achieves 0.94 accuracy, 0.92 F1, and an ECE of 0.04, outperforming more than 20 state-of-the-art baselines in large-scale urban risk prediction. Code and dataset are available at: this https URL
交通事故虽然是罕见事件,但其影响重大,需要通过长期上下文的多模态推理来进行准确的风险预测。在本文中,我们介绍了ALCo-FM,这是一种统一的自适应长上下文基础模型,它计算一个波动预评分以动态选择输入数据的上下文窗口,并通过浅层交叉注意力机制编码和融合这些多模态数据。该模型随后使用局部GAT(图注意网络)层以及类似于BigBird的稀疏全局变换器在H3六边形网格上运行,并结合蒙特卡洛dropout方法来评估其置信度,从而能够生成优越且校准良好的预测结果。 ALCo-FM在美国15个城市的数据上进行训练,并采用类权重损失以解决标签不平衡问题。此外,在未见城市中使用最少数据微调后,该模型实现了0.94的准确率、0.92的F1分数以及0.04的ECE(期望校准误差),在大规模城市风险预测方面超过了超过20个最新的基准方法。 代码和数据集可在以下网址获取:this https URL
https://arxiv.org/abs/2507.08153
Given the rapid adoption of generative AI and its potential to impact a wide range of tasks, understanding the effects of AI on the economy is one of society's most important questions. In this work, we take a step toward that goal by analyzing the work activities people do with AI, how successfully and broadly those activities are done, and combine that with data on what occupations do those activities. We analyze a dataset of 200k anonymized and privacy-scrubbed conversations between users and Microsoft Bing Copilot, a publicly available generative AI system. We find the most common work activities people seek AI assistance for involve gathering information and writing, while the most common activities that AI itself is performing are providing information and assistance, writing, teaching, and advising. Combining these activity classifications with measurements of task success and scope of impact, we compute an AI applicability score for each occupation. We find the highest AI applicability scores for knowledge work occupation groups such as computer and mathematical, and office and administrative support, as well as occupations such as sales whose work activities involve providing and communicating information. Additionally, we characterize the types of work activities performed most successfully, how wage and education correlate with AI applicability, and how real-world usage compares to predictions of occupational AI impact.
鉴于生成式人工智能的迅速采用及其对各种任务潜在影响,理解AI对经济的影响是社会面临的重要问题之一。在这项研究中,我们通过分析人们使用AI的工作活动、这些活动的成功程度和广泛性,并结合有关从事此类活动的职业的数据,朝着这个目标迈出了一步。我们分析了一个包含20万条匿名且隐私已清除的用户与微软Bing Copilot(一个公开可用的生成式AI系统)之间的对话数据集。我们发现人们寻求AI协助最常见的工作活动包括收集信息和写作,而AI本身最常执行的任务是提供信息和支持、写作、教学和建议。结合这些任务分类以及对任务成功度和影响范围的测量,我们为每个职业计算了AI适用性得分。研究结果表明,在计算机和数学、办公室及行政支持等知识型工作群体中,以及销售等需要提供和传递信息的职业中,AI的适用性得分最高。此外,我们还描述了哪些类型的工作活动执行得最成功,并探讨了工资与教育程度如何影响AI适用性,以及实际应用与职业AI影响预测之间的比较情况。
https://arxiv.org/abs/2507.07935
Continuous, automated monitoring of laboratory mice enables more accurate data collection and improves animal welfare through real-time insights. Researchers can achieve a more dynamic and clinically relevant characterization of disease progression and therapeutic effects by integrating behavioral and physiological monitoring in the home cage. However, providing individual mouse metrics is difficult because of their housing density, similar appearances, high mobility, and frequent interactions. To address these challenges, we develop a real-time identification (ID) algorithm that accurately assigns ID predictions to mice wearing custom ear tags in digital home cages monitored by cameras. Our pipeline consists of three parts: (1) a custom multiple object tracker (MouseTracks) that combines appearance and motion cues from mice; (2) a transformer-based ID classifier (Mouseformer); and (3) a tracklet associator linear program to assign final ID predictions to tracklets (MouseMap). Our models assign an animal ID based on custom ear tags at 30 frames per second with 24/7 cage coverage. We show that our custom tracking and ID pipeline improves tracking efficiency and lowers ID switches across mouse strains and various environmental factors compared to current mouse tracking methods.
持续的自动化监测实验室小鼠能够更准确地收集数据,并通过实时洞察提高动物福利。研究人员可以通过在笼舍中整合行为和生理监测,实现对疾病进展及治疗效果的动态且具有临床相关性的特征描述。然而,由于小鼠的高密度饲养、相似外观、高度移动性和频繁互动等原因,提供个体小鼠指标变得困难。 为了解决这些挑战,我们开发了一种实时识别(ID)算法,该算法能够准确地将佩戴定制耳标的数字笼舍中由摄像头监控的小鼠进行身份预测。我们的管道包括三个部分: 1. 一个结合了小鼠外观和运动线索的定制多对象跟踪器(MouseTracks); 2. 基于变压器的身份分类器(Mouseformer); 3. 将最终ID预测分配给轨迹片段(tracklet)的关联程序(MouseMap)。 我们的模型能够在每秒30帧的速度下,根据定制耳标为动物分配一个ID,并实现全天候笼舍覆盖。我们展示了与当前小鼠跟踪方法相比,我们的自定义追踪和ID管道在不同品系的小鼠以及各种环境因素下提高了追踪效率并降低了身份切换率。
https://arxiv.org/abs/2507.07929
The safety validation of automatic emergency braking system (AEBS) requires accurately distinguishing between false positive (FP) and true positive (TP) system activations. While simulations allow straightforward differentiation by comparing scenarios with and without interventions, analyzing activations from open-loop resimulations - such as those from field operational testing (FOT) - is more complex. This complexity arises from scenario parameter uncertainty and the influence of driver interventions in the recorded data. Human labeling is frequently used to address these challenges, relying on subjective assessments of intervention necessity or situational criticality, potentially introducing biases and limitations. This work proposes a rule-based classification approach leveraging the Prediction Divergence Principle (PDP) to address those issues. Applied to a simplified AEBS, the proposed method reveals key strengths, limitations, and system requirements for effective implementation. The findings suggest that combining this approach with human labeling may enhance the transparency and consistency of classification, thereby improving the overall validation process. While the rule set for classification derived in this work adopts a conservative approach, the paper outlines future directions for refinement and broader applicability. Finally, this work highlights the potential of such methods to complement existing practices, paving the way for more reliable and reproducible AEBS validation frameworks.
自动紧急制动系统(AEBS)的安全验证需要准确地区分误报(FP)和真正报(TP)的系统激活。虽然模拟可以通过比较有干预和无干预的情景来轻松区分这些情况,但对开放循环重仿真(如现场运行测试中的记录数据进行分析)更为复杂。这种复杂性源于情景参数不确定性以及驾驶员干预的影响。通常采用人工标注的方法来解决这些问题,这种方法依赖于主观判断干预的必要性或情境的关键程度,可能会引入偏见和限制。本研究提出了一种基于预测分歧原则(PDP)的规则分类方法,以解决上述问题。该方法应用于简化后的AEBS系统中,揭示了有效实施的关键优势、局限性和系统要求。研究结果表明,将这种方法与人工标注相结合可以增强分类过程的透明度和一致性,从而提高整体验证过程的质量。 本工作中所提出的分类规则采用保守的方法制定,文章还概述了未来改进和更广泛适用性的方向。最终,该工作强调此类方法具有补充现有实践、为更可靠和可重复的AEBS验证框架铺平道路的巨大潜力。
https://arxiv.org/abs/2507.07872