In this paper, we propose GuideSR, a novel single-step diffusion-based image super-resolution (SR) model specifically designed to enhance image fidelity. Existing diffusion-based SR approaches typically adapt pre-trained generative models to image restoration tasks by adding extra conditioning on a VAE-downsampled representation of the degraded input, which often compromises structural fidelity. GuideSR addresses this limitation by introducing a dual-branch architecture comprising: (1) a Guidance Branch that preserves high-fidelity structures from the original-resolution degraded input, and (2) a Diffusion Branch, which a pre-trained latent diffusion model to enhance perceptual quality. Unlike conventional conditioning mechanisms, our Guidance Branch features a tailored structure for image restoration tasks, combining Full Resolution Blocks (FRBs) with channel attention and an Image Guidance Network (IGN) with guided attention. By embedding detailed structural information directly into the restoration pipeline, GuideSR produces sharper and more visually consistent results. Extensive experiments on benchmark datasets demonstrate that GuideSR achieves state-of-the-art performance while maintaining the low computational cost of single-step approaches, with up to 1.39dB PSNR gain on challenging real-world datasets. Our approach consistently outperforms existing methods across various reference-based metrics including PSNR, SSIM, LPIPS, DISTS and FID, further representing a practical advancement for real-world image restoration.
在这篇论文中,我们提出了一种新的基于单步扩散的图像超分辨率(SR)模型——GuideSR,该模型旨在提高图像保真度。现有的基于扩散的方法通常通过在退化输入的VAE降采样表示上添加额外条件来适应预训练生成模型以应用于图像恢复任务,这种方法往往牺牲了结构上的保真度。为了解决这一局限性,GuideSR引入了一种双分支架构: 1. 指导分支(Guidance Branch),该分支保留原始分辨率退化输入中的高保真结构。 2. 扩散分支(Diffusion Branch),使用预训练的潜在扩散模型来增强感知质量。 与传统的条件机制不同,我们的指导分支为图像恢复任务设计了特化的结构,结合全分辨率块(FRBs)和通道注意机制以及图像引导网络(IGN)及其引导注意力机制。通过直接将详细的结构信息嵌入到恢复流程中,GuideSR能够生成更清晰且视觉上一致的结果。 在基准数据集上的大量实验表明,GuideSR不仅实现了最先进的性能,并且保持了单步方法的低计算成本,在具有挑战性的现实世界数据集中达到了最高达1.39dB PSNR的增益。我们的方法在各种参考基础度量(包括PSNR、SSIM、LPIPS、DISTS和FID)上始终优于现有技术,进一步代表了实际应用中图像恢复领域的进展。
https://arxiv.org/abs/2505.00687
Automated parking is a critical feature of Advanced Driver Assistance Systems (ADAS), where accurate trajectory prediction is essential to bridge perception and planning modules. Despite its significance, research in this domain remains relatively limited, with most existing studies concentrating on single-modal trajectory prediction of vehicles. In this work, we propose ParkDiffusion, a novel approach that predicts the trajectories of both vehicles and pedestrians in automated parking scenarios. ParkDiffusion employs diffusion models to capture the inherent uncertainty and multi-modality of future trajectories, incorporating several key innovations. First, we propose a dual map encoder that processes soft semantic cues and hard geometric constraints using a two-step cross-attention mechanism. Second, we introduce an adaptive agent type embedding module, which dynamically conditions the prediction process on the distinct characteristics of vehicles and pedestrians. Third, to ensure kinematic feasibility, our model outputs control signals that are subsequently used within a kinematic framework to generate physically feasible trajectories. We evaluate ParkDiffusion on the Dragon Lake Parking (DLP) dataset and the Intersections Drone (inD) dataset. Our work establishes a new baseline for heterogeneous trajectory prediction in parking scenarios, outperforming existing methods by a considerable margin.
自动化停车是高级驾驶辅助系统(ADAS)中的关键功能,其中准确的轨迹预测对于连接感知和规划模块至关重要。尽管其重要性不言而喻,但该领域的研究仍然相对有限,大多数现有研究集中在车辆单一模态的轨迹预测上。在这项工作中,我们提出了ParkDiffusion,这是一种新颖的方法,用于在自动化停车场景中预测车辆和行人的未来轨迹。ParkDiffusion采用扩散模型来捕捉未来轨迹内在的不确定性和多模态特性,并引入了几项关键创新。 首先,我们提出了一种双地图编码器,它通过两步交叉注意机制处理软语义线索和硬几何约束。其次,我们引入了一个自适应代理类型嵌入模块,该模块根据车辆和行人的独特特征动态调整预测过程。最后,为了确保动力学可行性,我们的模型输出控制信号,并随后在动力学框架内使用这些信号生成物理上可行的轨迹。 我们在龙湖停车(DLP)数据集和交叉口无人机(inD)数据集上评估了ParkDiffusion。我们的工作为停车场景中的异构轨迹预测建立了新的基线,在性能上显著优于现有方法。
https://arxiv.org/abs/2505.00586
Extending the context window in large language models (LLMs) is essential for applications involving long-form content generation. However, the linear increase in key-value (KV) cache memory requirements and the quadratic complexity of self-attention with respect to sequence length present significant challenges during fine-tuning and inference. Existing methods suffer from performance degradation when extending to longer contexts. In this work, we introduce a novel context extension method that optimizes both fine-tuning and inference efficiency. Our method exploits a key observation: in the frequency domain, the energy distribution of the KV cache is primarily concentrated in low-frequency components. By filtering out the high-frequency components, the KV cache can be effectively compressed with minimal information loss. Building on this insight, we propose an efficient compression technique, FreqKV, that iteratively compresses the increasing KV cache to a fixed size in the frequency domain, applicable to both fine-tuning and inference. FreqKV introduces no additional parameters or architectural modifications. With minimal fine-tuning, LLMs can learn to leverage the limited cache that is compressed in the frequency domain and extend the context window efficiently. Experiments on various long context language modeling and understanding tasks demonstrate the efficiency and efficacy of the proposed method.
在大型语言模型(LLMs)中扩展上下文窗口对于涉及长篇内容生成的应用至关重要。然而,键值(KV)缓存内存需求的线性增长以及自注意力机制与序列长度之间的二次复杂度,在微调和推理过程中带来了重大挑战。现有的方法在向更长的上下文延伸时会遇到性能下降的问题。在这项工作中,我们提出了一种新的上下文扩展方法,该方法优化了微调和推理效率。我们的方法利用了一个关键观察结果:在频域中,KV缓存的能量分布主要集中在低频分量上。通过滤除高频分量,可以有效地压缩KV缓存,并且信息损失最小。基于这一见解,我们提出了一个高效的压缩技术,称为FreqKV,它可以在频域内将不断增加的KV缓存迭代地压缩到固定大小,适用于微调和推理过程中的应用。FreqKV不引入任何额外参数或架构修改。通过少量微调,LLMs可以学习利用在频域中被压缩的有限缓存,并有效地扩展上下文窗口。在各种长上下文的语言建模和理解任务上的实验表明了所提出方法的有效性和效率。
https://arxiv.org/abs/2505.00570
Accurate classification of medical device risk levels is essential for regulatory oversight and clinical safety. We present a Transformer-based multimodal framework that integrates textual descriptions and visual information to predict device regulatory classification. The model incorporates a cross-attention mechanism to capture intermodal dependencies and employs a self-training strategy for improved generalization under limited supervision. Experiments on a real-world regulatory dataset demonstrate that our approach achieves up to 90.4% accuracy and 97.9% AUROC, significantly outperforming text-only (77.2%) and image-only (54.8%) baselines. Compared to standard multimodal fusion, the self-training mechanism improved SVM performance by 3.3 percentage points in accuracy (from 87.1% to 90.4%) and 1.4 points in macro-F1, suggesting that pseudo-labeling can effectively enhance generalization under limited supervision. Ablation studies further confirm the complementary benefits of both cross-modal attention and self-training.
准确地分类医疗器械的风险等级对于监管监督和临床安全至关重要。我们提出了一种基于Transformer的多模态框架,该框架结合了文本描述和视觉信息来预测设备的监管分类。模型采用了跨注意力机制以捕捉不同模态之间的依赖关系,并且使用自训练策略在有限监督的情况下提高泛化能力。实验表明,在一个真实的监管数据集上,我们的方法实现了高达90.4%的准确率和97.9%的AUROC值,远超纯文本(77.2%)和纯图像(54.8%)基线模型的表现。 与标准多模态融合相比,自训练机制将支持向量机(SVM)的准确性从87.1%提升到90.4%,宏F1值提高了1.4个点。这表明伪标签可以有效增强在有限监督条件下的泛化能力。消融实验进一步证实了跨模态注意力和自训练之间的互补优势。
https://arxiv.org/abs/2505.00422
Prediction of couriers' delivery timely rates in advance is essential to the logistics industry, enabling companies to take preemptive measures to ensure the normal operation of delivery services. This becomes even more critical during anomaly conditions like the epidemic outbreak, during which couriers' delivery timely rate will decline markedly and fluctuates significantly. Existing studies pay less attention to the logistics scenario. Moreover, many works focusing on prediction tasks in anomaly scenarios fail to explicitly model abnormal events, e.g., treating external factors equally with other features, resulting in great information loss. Further, since some anomalous events occur infrequently, traditional data-driven methods perform poorly in these scenarios. To deal with them, we propose a deep spatial-temporal attention model, named DeepSTA. To be specific, to avoid information loss, we design an anomaly spatio-temporal learning module that employs a recurrent neural network to model incident information. Additionally, we utilize Node2vec to model correlations between road districts, and adopt graph neural networks and long short-term memory to capture the spatial-temporal dependencies of couriers. To tackle the issue of insufficient training data in abnormal circumstances, we propose an anomaly pattern attention module that adopts a memory network for couriers' anomaly feature patterns storage via attention mechanisms. The experiments on real-world logistics datasets during the COVID-19 outbreak in 2022 show the model outperforms the best baselines by 12.11% in MAE and 13.71% in MSE, demonstrating its superior performance over multiple competitive baselines.
提前预测快递员的配送准时率对于物流行业至关重要,这使得公司能够采取预防措施以确保送货服务的正常运行。特别是在像疫情爆发这样的异常情况下,快递员的配送准时率会显著下降,并且波动较大。现有的研究较少关注物流场景,并且许多专注于在异常情况下的预测任务的工作未能明确建模异常事件,例如将外部因素与其他特征同等对待,导致信息大量损失。此外,由于一些异常事件发生的频率较低,传统的数据驱动方法在这种情况下表现不佳。为了解决这些问题,我们提出了一种深度时空注意力模型,称为DeepSTA。 具体来说,为了避免信息丢失,我们设计了一个异常时空学习模块,该模块使用递归神经网络来建模突发事件的信息。此外,我们利用Node2vec来建模道路区域之间的关联,并采用图神经网络和长短时记忆网络来捕捉快递员的空间时间依赖关系。为了解决在异常情况下训练数据不足的问题,我们提出了一种异常模式注意模块,该模块通过注意力机制使用内存网络存储快递员的异常特征模式。 实验证实在2022年COVID-19爆发期间的真实物流数据集上,模型的性能优于最佳基线,在均方误差(MSE)方面提高了13.71%,在平均绝对误差(MAE)方面提高了12.11%。这些结果证明了该模型相对于多个竞争基线具有卓越的表现。
https://arxiv.org/abs/2505.00402
Accurately estimating package delivery time is essential to the logistics industry, which enables reasonable work allocation and on-time service guarantee. This becomes even more necessary in mixed logistics scenarios where couriers handle a high volume of delivery and a smaller number of pickup simultaneously. However, most of the related works treat the pickup and delivery patterns on couriers' decision behavior equally, neglecting that the pickup has a greater impact on couriers' decision-making compared to the delivery due to its tighter time constraints. In such context, we have three main challenges: 1) multiple spatiotemporal factors are intricately interconnected, significantly affecting couriers' delivery behavior; 2) pickups have stricter time requirements but are limited in number, making it challenging to model their effects on couriers' delivery process; 3) couriers' spatial mobility patterns are critical determinants of their delivery behavior, but have been insufficiently explored. To deal with these, we propose TransPDT, a Transformer-based multi-task package delivery time prediction model. We first employ the Transformer encoder architecture to capture the spatio-temporal dependencies of couriers' historical travel routes and pending package sets. Then we design the pattern memory to learn the patterns of pickup in the imbalanced dataset via attention mechanism. We also set the route prediction as an auxiliary task of delivery time prediction, and incorporate the prior courier spatial movement regularities in prediction. Extensive experiments on real industry-scale datasets demonstrate the superiority of our method. A system based on TransPDT is deployed internally in JD Logistics to track more than 2000 couriers handling hundreds of thousands of packages per day in Beijing.
准确估计包裹的送达时间对物流行业至关重要,这有助于合理的工作分配和准时服务保障。在混合物流场景中,快递员同时处理大量配送任务及少量取件的情况下,这种能力变得更为重要。然而,大多数相关研究将取件和送件模式对快递员决策行为的影响同等看待,忽略了由于时间限制更严格的原因,取件实际上对比送件对快递员的决策有更大的影响。在这样的背景下,我们面临三个主要挑战:1)多时空因素错综复杂地相互关联,显著影响着快递员的配送行为;2)取件尽管数量较少但有着更为严格的时限要求,使其难以被模型准确地反映其对配送过程的影响;3)快递员的空间移动模式对其配送行为至关重要,但目前对此的研究尚显不足。为应对这些挑战,我们提出了TransPDT,这是一种基于Transformer的多任务包裹送达时间预测模型。 首先,我们采用Transformer编码器架构来捕捉快递员历史行驶路线及待送包裹集中的时空依赖关系。然后,设计了模式记忆模块通过注意力机制在不平衡数据集中学习取件模式。此外,还将路线预测设为送达时间预测的辅助任务,并将先前的快递员空间移动规律纳入预测中。 基于真实工业规模的数据集进行的广泛实验表明我们方法的优势。以TransPDT为基础建立的系统已在京东物流内部部署使用,追踪北京地区超过2000名快递员每天处理数十万包裹的情况。
https://arxiv.org/abs/2505.00375
In data stream clustering, systematic theory of stream clustering algorithms remains relatively scarce. Recently, density-based methods have gained attention. However, existing algorithms struggle to simultaneously handle arbitrarily shaped, multi-density, high-dimensional data while maintaining strong outlier resistance. Clustering quality significantly deteriorates when data density varies complexly. This paper proposes a clustering algorithm based on the novel concept of Tightest Neighbors and introduces a data stream clustering theory based on the Skeleton Set. Based on these theories, this paper develops a new method, TNStream, a fully online algorithm. The algorithm adaptively determines the clustering radius based on local similarity, summarizing the evolution of multi-density data streams in micro-clusters. It then applies a Tightest Neighbors-based clustering algorithm to form final clusters. To improve efficiency in high-dimensional cases, Locality-Sensitive Hashing (LSH) is employed to structure micro-clusters, addressing the challenge of storing k-nearest neighbors. TNStream is evaluated on various synthetic and real-world datasets using different clustering metrics. Experimental results demonstrate its effectiveness in improving clustering quality for multi-density data and validate the proposed data stream clustering theory.
在数据流聚类领域,系统性的理论基础仍然相对匮乏。近年来,基于密度的方法受到了关注。然而,现有的算法难以同时处理形状任意、多密度、高维的数据,并且保持强大的离群点抵抗能力。当数据的密度复杂地变化时,聚类质量会显著下降。本文提出了一种基于新概念“最紧邻”的聚类算法,并引入了基于骨架集的数据流聚类理论。在此基础上,论文开发出了一种新的方法——TNStream(Tightest Neighbors Stream),这是一种完全在线的算法。 该算法根据局部相似性自适应地确定聚类半径,总结多密度数据流在微簇中的演化过程。然后使用基于最紧邻的方法来形成最终的聚类结果。为了提高高维情况下的效率,本文采用了局部敏感哈希(LSH)技术来构造微簇结构,解决了存储k近邻的挑战。 TNStream 在多种合成和现实世界的数据库上进行了测试,并且利用不同的聚类指标评估了其性能。实验结果显示,该方法在处理多密度数据时能够有效提升聚类质量,并验证了提出的基于骨架集的数据流聚类理论的有效性。
https://arxiv.org/abs/2505.00359
The deployment of deep neural networks on resource-constrained devices necessitates effective model com- pression strategies that judiciously balance the reduction of model size with the preservation of performance. This study introduces a novel safety-driven quantization framework that leverages preservation sets to systematically prune and quantize neural network weights, thereby optimizing model complexity without compromising accuracy. The proposed methodology is rigorously evaluated on both a convolutional neural network (CNN) and an attention-based language model, demonstrating its applicability across diverse architectural paradigms. Experimental results reveal that our framework achieves up to a 2.5% enhancement in test accuracy relative to the original unquantized models while maintaining 60% of the initial model size. In comparison to conventional quantization techniques, our approach not only augments generalization by eliminating parameter noise and retaining essential weights but also reduces variance, thereby ensuring the retention of critical model features. These findings underscore the efficacy of safety-driven quantization as a robust and reliable strategy for the efficient optimization of deep learn- ing models. The implementation and comprehensive experimental evaluations of our framework are publicly accessible at GitHub.
在资源受限设备上部署深度神经网络需要有效的模型压缩策略,这些策略能够谨慎地平衡减少模型大小和保持性能之间的关系。本研究引入了一种基于安全驱动的量化框架,该框架利用保留集系统性地裁剪和量化神经网络权重,从而在不牺牲精度的情况下优化模型复杂度。所提出的这种方法对卷积神经网络(CNN)和注意力机制语言模型进行了严格的评估,证明了其适用于多种架构范式。实验结果显示,我们的框架相对于原始未经量化的模型,在保持初始模型大小的60%的同时,测试准确率提高了最多2.5%。与传统的量化技术相比,本方法不仅通过消除参数噪声并保留关键权重来增强泛化能力,还减少了方差,从而确保了重要模型特征的保留。这些发现强调了安全驱动量化的有效性,作为深度学习模型高效优化的强大且可靠策略。我们的框架的实现和全面实验评估可在GitHub上公开访问。
https://arxiv.org/abs/2505.00350
Recent advances in large language models highlighted the excessive quadratic cost of self-attention. Despite the significant research efforts, subquadratic attention methods still suffer from inferior performance in practice. We hypothesize that dynamic, learned content-based sparsity can lead to more efficient attention mechanisms. We present Mixture of Sparse Attention (MoSA), a novel approach inspired by Mixture of Experts (MoE) with expert choice routing. MoSA dynamically selects tokens for each attention head, allowing arbitrary sparse attention patterns. By selecting $k$ tokens from a sequence of length $T$, MoSA reduces the computational complexity of each attention head from $O(T^2)$ to $O(k^2 + T)$. This enables using more heads within the same computational budget, allowing higher specialization. We show that among the tested sparse attention variants, MoSA is the only one that can outperform the dense baseline, sometimes with up to 27% better perplexity for an identical compute budget. MoSA can also reduce the resource usage compared to dense self-attention. Despite using torch implementation without an optimized kernel, perplexity-matched MoSA models are simultaneously faster in wall-clock time, require less memory for training, and drastically reduce the size of the KV-cache compared to the dense transformer baselines.
近期在大型语言模型方面取得的进步突显了自注意力的过度二次成本。尽管已经投入了大量的研究努力,但次二次复杂度的方法在实践中仍然表现不佳。我们假设动态学习的内容基稀疏性可以导致更有效的注意机制。本文介绍了混合稀疏注意(MoSA),这是一种受专家混合(MoE)和专家选择路由启发的新方法。MoSA为每个注意力头动态选择令牌,允许任意的稀疏注意模式。通过从长度为$T$的序列中选取$k$个令牌,MoSA将每个注意力头的计算复杂度从$O(T^2)$降低到$O(k^2 + T)$。这使得在相同的计算预算内可以使用更多的头部,并且能够实现更高的专业化。 我们展示了在测试过的稀疏注意变体中,只有MoSA能够在与密集基线相同计算预算的情况下超越性能基准,有时甚至能提高高达27%的困惑度(perplexity)。此外,与密集自注意力相比,MoSA还可以减少资源使用。尽管采用了未经优化内核支持的PyTorch实现,但在达到同等困惑度水平时,MoSA模型在实际运行时间上更快、需要更少的训练内存,并且大幅减少了KV缓存大小。 总结来说,MoSA通过引入动态选择令牌机制和灵活的稀疏模式,在保持或提升性能的同时显著降低了计算成本和资源需求。
https://arxiv.org/abs/2505.00315
Chronic diseases, including diabetes, hypertension, asthma, HIV-AIDS, epilepsy, and tuberculosis, necessitate rigorous adherence to medication to avert disease progression, manage symptoms, and decrease mortality rates. Adherence is frequently undermined by factors including patient behavior, caregiver support, elevated medical costs, and insufficient healthcare infrastructure. We propose AdCare-VLM, a specialized Video-LLaVA-based multimodal large vision language model (LVLM) aimed at visual question answering (VQA) concerning medication adherence through patient videos. We employ a private dataset comprising 806 custom-annotated tuberculosis (TB) medication monitoring videos, which have been labeled by clinical experts, to fine-tune the model for adherence pattern detection. We present LLM-TB-VQA, a detailed medical adherence VQA dataset that encompasses positive, negative, and ambiguous adherence cases. Our method identifies correlations between visual features, such as the clear visibility of the patient's face, medication, water intake, and the act of ingestion, and their associated medical concepts in captions. This facilitates the integration of aligned visual-linguistic representations and improves multimodal interactions. Experimental results indicate that our method surpasses parameter-efficient fine-tuning (PEFT) enabled VLM models, such as LLaVA-V1.5 and Chat-UniVi, with absolute improvements ranging from 3.1% to 3.54% across pre-trained, regular, and low-rank adaptation (LoRA) configurations. Comprehensive ablation studies and attention map visualizations substantiate our approach, enhancing interpretability.
慢性疾病,包括糖尿病、高血压、哮喘、艾滋病、癫痫和结核病等,需要严格遵循药物治疗方案来防止病情恶化、管理症状并降低死亡率。然而,患者的用药依从性常常受到患者行为、护理者支持不足、高昂的医疗费用以及医疗卫生基础设施薄弱等因素的影响。 我们提出了一种专门针对用药依从性的视觉问答(VQA)任务的多模态大视觉语言模型——AdCare-VLM,该模型基于Video-LLaVA架构。为了检测用药模式并提高依从性,我们在一个私有数据集上对模型进行了微调,该数据集包含了806个由临床专家标注的结核病(TB)药物监测视频。 我们提出了一个详细的医疗依从性VQA数据集——LLM-TB-VQA,涵盖了正向、负向和模棱两可的依从情况。我们的方法可以识别出视觉特征(如患者面部清晰可见、药物、饮水行为以及服药过程等),与注释中的医学概念之间的关联。这促进了视觉-语言对齐表示的整合,并提升了多模态交互的效果。 实验结果表明,我们的方法在预训练、常规和低秩适应(LoRA)配置下超越了参数高效的微调(PEFT)启用的VLM模型,如LLaVA-V1.5和Chat-UniVi,在绝对改进方面表现出了3.1%到3.54%不等的提升。全面的消融研究以及注意图可视化进一步证实了我们的方法,并增强了其可解释性。
https://arxiv.org/abs/2505.00275
Proto-objects - image regions that share common visual properties - offer a promising alternative to traditional attention mechanisms based on rectangular-shaped image patches in neural networks. Although previous work demonstrated that evolving a patch-based hard-attention module alongside a controller network could achieve state-of-the-art performance in visual reinforcement learning tasks, our approach leverages image segmentation to work with higher-level features. By operating on proto-objects rather than fixed patches, we significantly reduce the representational complexity: each image decomposes into fewer proto-objects than regular patches, and each proto-object can be efficiently encoded as a compact feature vector. This enables a substantially smaller self-attention module that processes richer semantic information. Our experiments demonstrate that this proto-object-based approach matches or exceeds the state-of-the-art performance of patch-based implementations with 62% less parameters and 2.6 times less training time.
基于视觉属性相似性的Proto-对象(即图像区域)为传统的矩形图像补丁注意力机制提供了一个有前景的替代方案。尽管之前的工作表明,通过与控制网络一起演化出一个基于补丁的硬注意模块可以在视觉强化学习任务中达到最先进的性能,我们的方法利用了图像分割技术来处理更高层次的特征。通过在Proto-对象上操作而不是固定的补丁,我们显著减少了表示复杂性:每个图像分解为比常规补丁更少的Proto-对象,并且每个Proto-对象都可以被高效地编码为一个紧凑的特征向量。这使得能够构建一个小得多的自注意力模块来处理更为丰富的语义信息。 我们的实验表明,这种基于Proto-对象的方法在参数数量减少62%和训练时间缩短2.6倍的情况下,与基于补丁的最佳实现方法相匹配或超越其性能。
https://arxiv.org/abs/2505.00186
Drawing on 1,178 safety and reliability papers from 9,439 generative AI papers (January 2020 - March 2025), we compare research outputs of leading AI companies (Anthropic, Google DeepMind, Meta, Microsoft, and OpenAI) and AI universities (CMU, MIT, NYU, Stanford, UC Berkeley, and University of Washington). We find that corporate AI research increasingly concentrates on pre-deployment areas -- model alignment and testing & evaluation -- while attention to deployment-stage issues such as model bias has waned. Significant research gaps exist in high-risk deployment domains, including healthcare, finance, misinformation, persuasive and addictive features, hallucinations, and copyright. Without improved observability into deployed AI, growing corporate concentration could deepen knowledge deficits. We recommend expanding external researcher access to deployment data and systematic observability of in-market AI behaviors.
基于对9439篇生成式AI论文(2020年1月至2025年3月)中的1178篇安全性和可靠性研究的分析,我们比较了领先的人工智能公司(Anthropic、Google DeepMind、Meta、微软和OpenAI)以及人工智能大学(卡内基梅隆大学、麻省理工学院、纽约大学、斯坦福大学、加州伯克利分校和华盛顿大学)的研究成果。我们的发现表明,企业AI研究越来越多地集中在部署前领域——模型对齐与测试及评估方面,而针对部署阶段问题的关注度有所下降,例如模型偏见等。在高风险的部署领域中存在显著的研究缺口,包括医疗保健、金融、虚假信息、具有说服力和上瘾性特征的技术、幻觉以及版权等问题。如果没有改进的部署AI可见性,企业集中的增加可能会加深知识缺陷。我们建议扩大外部研究人员对部署数据的访问权限,并系统地观察市场上的AI行为。
https://arxiv.org/abs/2505.00174
Non-muscle-invasive bladder cancer (NMIBC) is a relentless challenge in oncology, with recurrence rates soaring as high as 70-80%. Each recurrence triggers a cascade of invasive procedures, lifelong surveillance, and escalating healthcare costs - affecting 460,000 individuals worldwide. However, existing clinical prediction tools remain fundamentally flawed, often overestimating recurrence risk and failing to provide personalized insights for patient management. In this work, we propose an interpretable deep learning framework that integrates vector embeddings and attention mechanisms to improve NMIBC recurrence prediction performance. We incorporate vector embeddings for categorical variables such as smoking status and intravesical treatments, allowing the model to capture complex relationships between patient attributes and recurrence risk. These embeddings provide a richer representation of the data, enabling improved feature interactions and enhancing prediction performance. Our approach not only enhances performance but also provides clinicians with patient-specific insights by highlighting the most influential features contributing to recurrence risk for each patient. Our model achieves accuracy of 70% with tabular data, outperforming conventional statistical methods while providing clinician-friendly patient-level explanations through feature attention. Unlike previous studies, our approach identifies new important factors influencing recurrence, such as surgical duration and hospital stay, which had not been considered in existing NMIBC prediction models.
非肌层浸润性膀胱癌(NMIBC)是肿瘤学中一个持久的挑战,复发率高达70-80%。每次复发都会引发一系列侵入性的治疗程序、终身监测以及医疗成本的上升,影响到全球46万人的生活。然而,现有的临床预测工具仍然存在根本缺陷,往往高估了复发风险,并且未能提供个性化的患者管理见解。 在这项工作中,我们提出了一种可解释的深度学习框架,该框架结合了向量嵌入和注意力机制以提升NMIBC复发预测的性能。我们将向量嵌入应用于吸烟状况、膀胱内治疗等分类变量中,使得模型能够捕捉到病人属性与复发风险之间的复杂关系。这些嵌入提供了更丰富的数据表示形式,从而增强了特征交互并提高了预测表现。 我们的方法不仅提升了性能,还通过突出每个患者影响复发风险的关键特征,向临床医生提供个性化的见解。在处理表格数据时,我们的模型达到了70%的准确率,超过了传统统计方法,并通过特征注意力为临床医生提供了友好的、基于患者的解释。与以往的研究不同的是,我们的方法识别出了一些新的重要因素,如手术时间和住院时间,这些因素以前并未被纳入现有的NMIBC预测模型中考虑。
https://arxiv.org/abs/2505.00171
Vision-language models (VLMs) have gained significant attention in computational pathology due to their multimodal learning capabilities that enhance big-data analytics of giga-pixel whole slide image (WSI). However, their sensitivity to large-scale clinical data, task formulations, and prompt design remains an open question, particularly in terms of diagnostic accuracy. In this paper, we present a systematic investigation and analysis of three state of the art VLMs for histopathology, namely Quilt-Net, Quilt-LLAVA, and CONCH, on an in-house digestive pathology dataset comprising 3,507 WSIs, each in giga-pixel form, across distinct tissue types. Through a structured ablative study on cancer invasiveness and dysplasia status, we develop a comprehensive prompt engineering framework that systematically varies domain specificity, anatomical precision, instructional framing, and output constraints. Our findings demonstrate that prompt engineering significantly impacts model performance, with the CONCH model achieving the highest accuracy when provided with precise anatomical references. Additionally, we identify the critical importance of anatomical context in histopathological image analysis, as performance consistently degraded when reducing anatomical precision. We also show that model complexity alone does not guarantee superior performance, as effective domain alignment and domain-specific training are critical. These results establish foundational guidelines for prompt engineering in computational pathology and highlight the potential of VLMs to enhance diagnostic accuracy when properly instructed with domain-appropriate prompts.
视觉-语言模型(VLM)在计算病理学领域因其多模态学习能力而受到广泛关注,这种能力能够增强对全滑片图像(WSI)这类大规模数据的分析效果。然而,这些模型对于大规模临床数据、任务定义以及提示设计的敏感性仍然是一个未解之谜,特别是在诊断准确性方面。本文中,我们对三种最新的用于组织病理学的VLM进行了系统性的调查和分析——分别是Quilt-Net、Quilt-LLAVA 和CONCH——基于内部消化病理学数据集进行测试,该数据集中包含了3,507张全滑片图像(WSI),每一张都是吉比特级别的,并且涵盖了不同类型的组织。通过针对癌症侵袭性和异型增生状态的结构化消融研究,我们开发了一个全面的提示工程框架,系统性地调整领域特定性、解剖精确度、指令架构以及输出约束条件。 我们的发现表明,提示工程技术显著影响模型的表现效果,尤其是在CONCH模型中,当提供精确的解剖参考时取得了最高的准确性。此外,我们还识别出了解剖背景在组织病理学图像分析中的重要性——一旦降低解剖精度,性能会持续下降。我们也展示了单靠模型复杂度并不能保证更好的表现;有效的领域对齐和特定领域的训练至关重要。 这些结果为计算病理学中提示工程的实践提供了基础指导原则,并且强调了当VLM被适当指令引导后,在诊断准确性上具有巨大的潜力。
https://arxiv.org/abs/2505.00134
With the growing success of text or image guided 3D generators, users demand more control over the generation process, appearance stylization being one of them. Given a reference image, this requires adapting the appearance of a generated 3D asset to reflect the visual style of the reference while maintaining visual consistency from multiple viewpoints. To tackle this problem, we draw inspiration from the success of 2D stylization methods that leverage the attention mechanisms in large image generation models to capture and transfer visual style. In particular, we probe if large reconstruction models, commonly used in the context of 3D generation, has a similar capability. We discover that the certain attention blocks in these models capture the appearance specific features. By injecting features from a visual style image to such blocks, we develop a simple yet effective 3D appearance stylization method. Our method does not require training or test time optimization. Through both quantitative and qualitative evaluations, we demonstrate that our approach achieves superior results in terms of 3D appearance stylization, significantly improving efficiency while maintaining high-quality visual outcomes.
随着文本或图像引导的3D生成器的成功日益增长,用户对生成过程有了更多的控制需求,其中包括外观风格化。给定一张参考图片后,需要将生成的3D资产的外观调整为反映该参考图片的视觉风格,并且在多个视角下保持视觉一致性。为了应对这一问题,我们借鉴了二维图像风格化方法的成功经验,这些方法利用大型图像生成模型中的注意力机制来捕捉和传递视觉风格。特别是,我们探究了常用的用于3D生成的大规模重建模型是否也具备类似的捕获视觉风格的能力。 我们发现,在这些模型中的一些特定的注意力块能够捕捉到外观相关的特征。通过将来自视觉风格图片的特征注入到这样的模块中,我们开发出了一种简单而有效的方法来实现3D外观风格化。我们的方法无需在训练或测试时进行优化调整。通过对量化的和定性的评估,我们证明了该方法在3D外观风格化方面取得了卓越的结果,在提高效率的同时保持高质量的视觉效果。
https://arxiv.org/abs/2504.21836
With the widespread application of large language models (LLMs), the issue of generating non-existing facts, known as hallucination, has garnered increasing attention. Previous research in enhancing LLM confidence estimation mainly focuses on the single problem setting. However, LLM awareness of its internal parameterized knowledge boundary under the more challenging multi-problem setting, which requires answering multiple problems accurately simultaneously, remains underexplored. To bridge this gap, we introduce a novel method, Multiple Answers and Confidence Stepwise Tuning (MAC-Tuning), that separates the learning of answer prediction and confidence estimation during fine-tuning on instruction data. Extensive experiments demonstrate that our method outperforms baselines by up to 25% in average precision.
随着大型语言模型(LLM)的广泛应用,生成不存在的事实问题,即所谓的“幻觉”现象,引起了越来越多的关注。此前关于增强LLM置信度估计的研究主要集中在单一问题设定上。然而,在更具有挑战性的多问题设定中——需要同时准确回答多个问题——LLM对其内部参数化知识边界的意识仍然有待探索。为了填补这一空白,我们引入了一种新颖的方法:多重答案和置信度逐步调优(MAC-Tuning),该方法在对指令数据进行微调时,将答案预测的学习与置信度估计分离。大量的实验表明,我们的方法比基准线平均提高了高达25%的精确度。
https://arxiv.org/abs/2504.21773
Visual text is a crucial component in both document and scene images, conveying rich semantic information and attracting significant attention in the computer vision community. Beyond traditional tasks such as text detection and recognition, visual text processing has witnessed rapid advancements driven by the emergence of foundation models, including text image reconstruction and text image manipulation. Despite significant progress, challenges remain due to the unique properties that differentiate text from general objects. Effectively capturing and leveraging these distinct textual characteristics is essential for developing robust visual text processing models. In this survey, we present a comprehensive, multi-perspective analysis of recent advancements in visual text processing, focusing on two key questions: (1) What textual features are most suitable for different visual text processing tasks? (2) How can these distinctive text features be effectively incorporated into processing frameworks? Furthermore, we introduce VTPBench, a new benchmark that encompasses a broad range of visual text processing datasets. Leveraging the advanced visual quality assessment capabilities of multimodal large language models (MLLMs), we propose VTPScore, a novel evaluation metric designed to ensure fair and reliable evaluation. Our empirical study with more than 20 specific models reveals substantial room for improvement in the current techniques. Our aim is to establish this work as a fundamental resource that fosters future exploration and innovation in the dynamic field of visual text processing. The relevant repository is available at this https URL.
视觉文本是文档图像和场景图像的重要组成部分,它传达了丰富的语义信息,并在计算机视觉社区中引起了广泛关注。除了传统的任务如文字检测和识别之外,随着基础模型的出现,视觉文本处理也见证了快速的进步,包括文本图像重建和文本图像操作。尽管取得了显著进展,但独特的属性仍然使文本与一般对象区别开来,从而带来了挑战。有效捕捉并利用这些独特的文本特性对于开发稳健的视觉文本处理模型至关重要。 在这篇综述中,我们从多角度全面分析了近期在视觉文本处理方面的最新进展,并重点关注两个关键问题:(1)哪些文本特征最适合不同的视觉文本处理任务?(2)如何有效地将这些独特文本特征融入到处理框架中? 此外,我们介绍了VTPBench,这是一个全新的基准测试平台,涵盖了广泛的视觉文本处理数据集。通过利用多模态大语言模型(MLLMs)的高级视觉质量评估能力,我们提出了一个新的评价指标——VTPScore,旨在确保公平和可靠的评估。 我们的实证研究表明,目前的技术在超过20个特定模型中仍存在较大的改进空间。我们的目标是将这项工作建立为一个基础资源,促进动态且充满活力的视觉文本处理领域的未来探索与创新。相关的代码库可在此[链接](请替换实际URL)获得。
https://arxiv.org/abs/2504.21682
This report aims to evaluate the performance of large language models (LLMs) in solving high school science questions and to explore their potential applications in the educational field. With the rapid development of LLMs in the field of natural language processing, their application in education has attracted widespread attention. This study selected mathematics exam questions from the college entrance examinations (2019-2023) as evaluation data and utilized at least eight LLM APIs to provide answers. A comprehensive assessment was conducted based on metrics such as accuracy, response time, logical reasoning, and creativity. Through an in-depth analysis of the evaluation results, this report reveals the strengths and weaknesses of LLMs in handling high school science questions and discusses their implications for educational practice. The findings indicate that although LLMs perform excellently in certain aspects, there is still room for improvement in logical reasoning and creative problem-solving. This report provides an empirical foundation for further research and application of LLMs in the educational field and offers suggestions for improvement.
本报告旨在评估大型语言模型(LLM)在解答高中科学题目的表现,并探讨其在教育领域的潜在应用。随着自然语言处理领域中LLM的快速发展,它们在教育中的应用已引起了广泛关注。本次研究选取了从2019年至2023年的大学入学考试数学试题作为评估数据,并利用至少八个LLM API来提供答案。根据准确性、响应时间、逻辑推理和创造力等指标进行了全面评估。通过对评估结果的深入分析,本报告揭示了LLM在处理高中科学题目时的优势与不足,并讨论了其对教育实践的意义。研究发现表明,尽管LLM在某些方面表现出色,但在逻辑推理和创造性问题解决能力上仍有改进空间。本报告为今后进一步研究和应用LLM于教育领域提供了实证基础,并提出了改进建议。
https://arxiv.org/abs/2505.00057
We present SAM4EM, a novel approach for 3D segmentation of complex neural structures in electron microscopy (EM) data by leveraging the Segment Anything Model (SAM) alongside advanced fine-tuning strategies. Our contributions include the development of a prompt-free adapter for SAM using two stage mask decoding to automatically generate prompt embeddings, a dual-stage fine-tuning method based on Low-Rank Adaptation (LoRA) for enhancing segmentation with limited annotated data, and a 3D memory attention mechanism to ensure segmentation consistency across 3D stacks. We further release a unique benchmark dataset for the segmentation of astrocytic processes and synapses. We evaluated our method on challenging neuroscience segmentation benchmarks, specifically targeting mitochondria, glia, and synapses, with significant accuracy improvements over state-of-the-art (SOTA) methods, including recent SAM-based adapters developed for the medical domain and other vision transformer-based approaches. Experimental results indicate that our approach outperforms existing solutions in the segmentation of complex processes like glia and post-synaptic densities. Our code and models are available at this https URL.
我们介绍了SAM4EM,这是一种新颖的方法,通过利用Segment Anything Model(SAM)以及先进的微调策略,在电子显微镜(EM)数据中对复杂神经结构进行三维分割。我们的贡献包括开发了一个无提示适配器,用于在两个阶段的掩码解码过程中自动生成提示嵌入;基于低秩适应(LoRA)的双阶段微调方法,利用有限注释数据增强分割性能;以及一种3D内存注意力机制,以确保在三维堆栈中的分割一致性。我们还发布了一个独特的基准数据集,用于星形胶质细胞突起和突触的分割。我们在具有挑战性的神经科学分割基准测试上评估了我们的方法,特别是在线粒体、胶质细胞和突触的靶向分割中,与最先进的(SOTA)方法相比,包括最近为医学领域开发的基于SAM的适配器和其他基于视觉变换器的方法,都取得了显著的准确性改进。实验结果表明,在复杂进程如胶质细胞和后突触密度的分割方面,我们的方法优于现有的解决方案。我们的代码和模型可在以下网址获取:[此URL]。
https://arxiv.org/abs/2504.21544
Meme clustering is critical for toxicity detection, virality modeling, and typing, but it has received little attention in previous research. Clustering similar Internet memes is challenging due to their multimodality, cultural context, and adaptability. Existing approaches rely on databases, overlook semantics, and struggle to handle diverse dimensions of similarity. This paper introduces a novel method that uses template-based matching with multi-dimensional similarity features, thus eliminating the need for predefined databases and supporting adaptive matching. Memes are clustered using local and global features across similarity categories such as form, visual content, text, and identity. Our combined approach outperforms existing clustering methods, producing more consistent and coherent clusters, while similarity-based feature sets enable adaptability and align with human intuition. We make all supporting code publicly available to support subsequent research. Code: this https URL
模因聚类对于检测毒性、预测传播性和分类等方面至关重要,但在以往的研究中却未受到足够的关注。由于互联网模因的多模态性、文化背景和适应性等特点,对其进行相似性聚类极具挑战性。现有的方法依赖于数据库,并且忽视了语义,难以应对多样化的相似维度。本文介绍了一种新颖的方法,该方法采用基于模板匹配的技术并利用多元相似特征,因此无需预先定义的数据库即可支持自适应匹配。我们通过形式、视觉内容、文本和身份等不同类别下的局部和全局特征对模因进行聚类。我们的综合方法在性能上超越了现有的聚类方法,能够生成更为一致且连贯的聚类结果,并且相似性基元集使得该技术具有更强的适应性和与人类直觉的一致性。我们公开提供所有支持代码以供后续研究使用。 代码链接:[请在此处插入实际的URL]
https://arxiv.org/abs/2505.00056