This paper presents IMAGGarment-1, a fine-grained garment generation (FGG) framework that enables high-fidelity garment synthesis with precise control over silhouette, color, and logo placement. Unlike existing methods that are limited to single-condition inputs, IMAGGarment-1 addresses the challenges of multi-conditional controllability in personalized fashion design and digital apparel applications. Specifically, IMAGGarment-1 employs a two-stage training strategy to separately model global appearance and local details, while enabling unified and controllable generation through end-to-end inference. In the first stage, we propose a global appearance model that jointly encodes silhouette and color using a mixed attention module and a color adapter. In the second stage, we present a local enhancement model with an adaptive appearance-aware module to inject user-defined logos and spatial constraints, enabling accurate placement and visual consistency. To support this task, we release GarmentBench, a large-scale dataset comprising over 180K garment samples paired with multi-level design conditions, including sketches, color references, logo placements, and textual prompts. Extensive experiments demonstrate that our method outperforms existing baselines, achieving superior structural stability, color fidelity, and local controllability performance. The code and model are available at this https URL.
本文介绍了IMAGGarment-1,这是一个精细的服装生成(FGG)框架,能够实现高保真度的服装合成,并且可以精确控制轮廓、颜色和标志放置。与现有的仅限于单一条件输入的方法不同,IMAGGarment-1解决了个性化时尚设计和个人数字服饰应用中多条件可控性的挑战。具体来说,IMAGGarment-1采用两阶段训练策略来分别建模全局外观和局部细节,并通过端到端推理实现统一且可控制的生成。 在第一阶段,我们提出了一种全球外观模型,该模型使用混合注意力模块和颜色适配器共同编码轮廓和颜色。第二阶段,我们介绍了一个带有自适应外观感知模块的局部增强模型,用于注入用户定义的标志和空间约束,从而实现准确的位置放置和视觉一致性。 为了支持这一任务,我们发布了GarmentBench,这是一个大规模的数据集,包含超过18万件服装样本及多级设计条件,包括草图、颜色参考、标志位置以及文本提示。广泛的实验表明,我们的方法优于现有基准模型,在结构稳定性、颜色保真度和局部可控性性能方面表现出色。 该代码和模型可在此网址获取:[此URL链接](请根据实际情况提供实际的链接地址)。
https://arxiv.org/abs/2504.13176
Designing efficient and effective architectural backbones has been in the core of research efforts to enhance the capability of foundation models. Inspired by the human cognitive phenomenon of attentional bias-the natural tendency to prioritize certain events or stimuli-we reconceptualize neural architectures, including Transformers, Titans, and modern linear recurrent neural networks as associative memory modules that learn a mapping of keys and values using an internal objective, referred to as attentional bias. Surprisingly, we observed that most existing sequence models leverage either (1) dot-product similarity, or (2) L2 regression objectives as their attentional bias. Going beyond these objectives, we present a set of alternative attentional bias configurations along with their effective approximations to stabilize their training procedure. We then reinterpret forgetting mechanisms in modern deep learning architectures as a form of retention regularization, providing a novel set of forget gates for sequence models. Building upon these insights, we present Miras, a general framework to design deep learning architectures based on four choices of: (i) associative memory architecture, (ii) attentional bias objective, (iii) retention gate, and (iv) memory learning algorithm. We present three novel sequence models-Moneta, Yaad, and Memora-that go beyond the power of existing linear RNNs while maintaining a fast parallelizable training process. Our experiments show different design choices in Miras yield models with varying strengths. For example, certain instances of Miras achieve exceptional performance in special tasks such as language modeling, commonsense reasoning, and recall intensive tasks, even outperforming Transformers and other modern linear recurrent models.
设计高效且有效的架构骨干是提升基础模型能力的核心研究方向。受人类认知现象“注意力偏差”——即优先关注某些事件或刺激的自然倾向启发,我们将神经网络架构(包括Transformer、Titans以及现代线性循环神经网络)重新构想为关联记忆模块,这些模块利用内部目标学习键值映射,我们将其称为“注意力偏置”。令人惊讶的是,我们观察到大多数现有的序列模型要么依赖点积相似度,要么采用L2回归作为其注意力偏置。超越这些目标,本文提出了一套替代的注意力偏置配置及其有效近似方法以稳定训练过程。此外,我们将现代深度学习架构中的遗忘机制重新解释为一种保留正则化形式,并为此类序列模型提供了一组新的遗忘门。 基于上述洞察,我们提出了Miras框架,该框架用于设计深度学习架构并提供了四种选择:(i)关联记忆结构;(ii)注意力偏置目标;(iii)保留门;以及(iv)内存学习算法。随后,本文介绍了三种新颖的序列模型——Moneta、Yaad和Memora,这些模型在超越现有线性RNN能力的同时保持了快速可并行化的训练过程。实验结果表明,在Miras框架下做出的不同设计选择会产出具有不同优势的模型。例如,某些Miras实例在诸如语言建模、常识推理以及记忆密集型任务等特定任务上表现出色,甚至超越了Transformer和其他现代线性循环模型的表现。
https://arxiv.org/abs/2504.13173
Underwater acoustic target recognition (UATR) is of great significance for the protection of marine diversity and national defense security. The development of deep learning provides new opportunities for UATR, but faces challenges brought by the scarcity of reference samples and complex environmental interference. To address these issues, we proposes a multi-task balanced channel attention convolutional neural network (MT-BCA-CNN). The method integrates a channel attention mechanism with a multi-task learning strategy, constructing a shared feature extractor and multi-task classifiers to jointly optimize target classification and feature reconstruction tasks. The channel attention mechanism dynamically enhances discriminative acoustic features such as harmonic structures while suppressing noise. Experiments on the Watkins Marine Life Dataset demonstrate that MT-BCA-CNN achieves 97\% classification accuracy and 95\% $F1$-score in 27-class few-shot scenarios, significantly outperforming traditional CNN and ACNN models, as well as popular state-of-the-art UATR methods. Ablation studies confirm the synergistic benefits of multi-task learning and attention mechanisms, while a dynamic weighting adjustment strategy effectively balances task contributions. This work provides an efficient solution for few-shot underwater acoustic recognition, advancing research in marine bioacoustics and sonar signal processing.
水下声学目标识别(UATR)对于保护海洋生物多样性和国家安全具有重要意义。深度学习的发展为UATR提供了新的机遇,但同时也面临着参考样本稀缺和复杂环境干扰带来的挑战。为了应对这些问题,我们提出了一种多任务平衡通道注意卷积神经网络(MT-BCA-CNN)。该方法结合了通道注意力机制与多任务学习策略,构建了一个共享特征提取器及多个任务分类器,以同时优化目标分类和特征重构的任务。通过动态增强如谐波结构等具有区分性的声学特征并抑制噪声,通道注意机制能够显著提高模型性能。 在Watkins海洋生物数据集上的实验表明,在27类少量样本的场景中,MT-BCA-CNN达到了97%的分类准确率和95%的F1值,远远优于传统的CNN、ACNN模型及流行的新一代UATR方法。消融研究证实了多任务学习与注意力机制相结合所具有的协同效应,而动态权重调整策略有效地平衡了各任务贡献。这项工作为少量样本情况下的水下声学识别提供了一种高效解决方案,并推进了海洋生物声学和声呐信号处理领域的研究进展。
https://arxiv.org/abs/2504.13102
This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations for detecting greenfruits in a complex orchard environment marked by label ambiguity, occlusions, and background blending. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under dynamic real-world conditions. RF-DETR object detection model, utilizing a DINOv2 backbone and deformable attention, excelled in global context modeling, effectively identifying partially occluded or ambiguous greenfruits. In contrast, YOLOv12 leveraged CNN-based attention for enhanced local feature extraction, optimizing it for computational efficiency and edge deployment. RF-DETR achieved the highest mean Average Precision (mAP50) of 0.9464 in single-class detection, proving its superior ability to localize greenfruits in cluttered scenes. Although YOLOv12N recorded the highest mAP@50:95 of 0.7620, RF-DETR consistently outperformed in complex spatial scenarios. For multi-class detection, RF-DETR led with an mAP@50 of 0.8298, showing its capability to differentiate between occluded and non-occluded fruits, while YOLOv12L scored highest in mAP@50:95 with 0.6622, indicating better classification in detailed occlusion contexts. Training dynamics analysis highlighted RF-DETR's swift convergence, particularly in single-class settings where it plateaued within 10 epochs, demonstrating the efficiency of transformer-based architectures in adapting to dynamic visual data. These findings validate RF-DETR's effectiveness for precision agricultural applications, with YOLOv12 suited for fast-response scenarios. >Index Terms: RF-DETR object detection, YOLOv12, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO World, YOLO, You Only Look Once, Roboflow, Detection Transformers, CNNs
这项研究对RF-DETR目标检测基础模型和YOLOv12目标检测模型配置进行了详细的比较,旨在识别复杂果园环境中带有标签模糊、遮挡以及背景融合特点的绿色水果。研究人员开发了一个自定义数据集,其中包含单类(绿色水果)和多类(被遮挡和未被遮挡的绿色水果)注释,以评估模型在动态真实世界条件下的性能。 RF-DETR目标检测模型采用DINOv2主干网络及可变形注意力机制,在全局上下文建模方面表现出色,能够有效识别部分遮挡或模糊不清的绿色水果。相比之下,YOLOv12利用基于CNN的注意力机制来增强局部特征提取,优化了计算效率和边缘部署场景下的应用。 在单类检测中,RF-DETR达到了最高的平均精度(mAP50)为0.9464,证明其在杂乱环境中定位绿色水果的能力更胜一筹。虽然YOLOv12N记录的最高mAP@50:95值为0.7620,但在复杂的空间场景中,RF-DETR始终表现出色。对于多类检测任务,RF-DETR以0.8298的mAP@50领先,展示了其区分被遮挡和未被遮挡水果的能力,而YOLOv12L在mAP@50:95中得分最高为0.6622,表明它在详细遮挡背景下具有更好的分类效果。 训练动力学分析显示,RF-DETR在单类设置下快速收敛,尤其在前10个时期达到稳定状态,这证明了基于变换器架构的模型在适应动态视觉数据方面的效率。这些发现验证了RF-DETR在精准农业应用中的有效性,而YOLOv12则更适合于需要快速响应的应用场景。 > 关键词:RF-DETR目标检测、YOLOv12、YOLOv13、YOLOv14、YOLOv15、YOLOE、YOLO World、YOLO、“只需看一次”(You Only Look Once)、Roboflow、检测变换器(Detection Transformers)、CNNs
https://arxiv.org/abs/2504.13099
Echocardiography is crucial for cardiovascular disease detection but relies heavily on experienced sonographers. Echocardiography probe guidance systems, which provide real-time movement instructions for acquiring standard plane images, offer a promising solution for AI-assisted or fully autonomous scanning. However, developing effective machine learning models for this task remains challenging, as they must grasp heart anatomy and the intricate interplay between probe motion and visual signals. To address this, we present EchoWorld, a motion-aware world modeling framework for probe guidance that encodes anatomical knowledge and motion-induced visual dynamics, while effectively leveraging past visual-motion sequences to enhance guidance precision. EchoWorld employs a pre-training strategy inspired by world modeling principles, where the model predicts masked anatomical regions and simulates the visual outcomes of probe adjustments. Built upon this pre-trained model, we introduce a motion-aware attention mechanism in the fine-tuning stage that effectively integrates historical visual-motion data, enabling precise and adaptive probe guidance. Trained on more than one million ultrasound images from over 200 routine scans, EchoWorld effectively captures key echocardiographic knowledge, as validated by qualitative analysis. Moreover, our method significantly reduces guidance errors compared to existing visual backbones and guidance frameworks, excelling in both single-frame and sequential evaluation protocols. Code is available at this https URL.
超声心动图对于心血管疾病的检测至关重要,但严重依赖有经验的超声技师。超声心动图探头引导系统通过提供获取标准切面图像的实时移动指令,为AI辅助或完全自主扫描提供了具有前景的解决方案。然而,开发用于此类任务的有效机器学习模型仍然极具挑战性,因为这些模型必须掌握心脏解剖结构以及探针运动与视觉信号之间复杂的相互作用。为此,我们提出了EchoWorld,这是一个专为探头引导设计的感知动作的世界建模框架,它编码了解剖知识和由运动引起的视觉动态,并且能够有效利用过去的视动序列来提高指导精度。 EchoWorld采用了一种受世界建模原理启发的预训练策略,在这种策略中,模型预测掩码解剖区域并模拟探针调整后的视觉效果。在微调阶段,我们在此预先训练的模型基础上引入了一个感知动作的关注机制,该机制能够有效地整合历史视动数据,从而实现精确且适应性强的探头引导。 EchoWorld是在超过200万张来自200多例常规检查的超声图像上进行训练的,这些图像涵盖了关键的超声心动图知识,并通过定性分析得到了验证。此外,与现有的视觉骨干网和指导框架相比,我们的方法在单帧和序列评估协议中显著降低了引导误差。 代码可在以下链接获取:[https URL] (请将 [https URL] 替换为实际提供的链接地址)
https://arxiv.org/abs/2504.13065
In recent years, image compression for high-level vision tasks has attracted considerable attention from researchers. Given that object information in images plays a far more crucial role in downstream tasks than background information, some studies have proposed semantically structuring the bitstream to selectively transmit and reconstruct only the information required by these tasks. However, such methods structure the bitstream after encoding, meaning that the coding process still relies on the entire image, even though much of the encoded information will not be transmitted. This leads to redundant computations. Traditional image compression methods require a two-dimensional image as input, and even if the unimportant regions of the image are set to zero by applying a semantic mask, these regions still participate in subsequent computations as part of the image. To address such limitations, we propose an image compression method based on a position-indexed self-attention mechanism that encodes and decodes only the visible parts of the masked image. Compared to existing semantic-structured compression methods, our approach can significantly reduce computational costs.
近年来,针对高级视觉任务的图像压缩技术引起了研究人员的广泛关注。鉴于图像中的对象信息在下游任务中比背景信息扮演着更为关键的角色,一些研究提出通过语义结构化码流来选择性地传输和重建这些任务所需的信息。然而,这类方法是在编码后对码流进行结构化的,这意味着即使许多已编码的信息不会被传输,编码过程仍然依赖于整个图像,从而导致了冗余计算的存在。 传统的图像压缩方法需要二维图像作为输入,并且即便通过应用语义掩膜将图像中的不重要区域设为零,在后续的计算中这些区域依然被视为图像的一部分而参与其中。为了克服上述限制,我们提出了一种基于位置索引自注意力机制的图像压缩方法,该方法只对遮罩后的图像可见部分进行编码和解码。相较于现有的语义结构化压缩方法,我们的方法能够显著减少计算成本。
https://arxiv.org/abs/2504.12923
Generative Adversarial Network (GAN) inversion have demonstrated excellent performance in image inpainting that aims to restore lost or damaged image texture using its unmasked content. Previous GAN inversion-based methods usually utilize well-trained GAN models as effective priors to generate the realistic regions for missing holes. Despite excellence, they ignore a hard constraint that the unmasked regions in the input and the output should be the same, resulting in a gap between GAN inversion and image inpainting and thus degrading the performance. Besides, existing GAN inversion approaches often consider a single modality of the input image, neglecting other auxiliary cues in images for improvements. Addressing these problems, we propose a novel GAN inversion approach, dubbed MMInvertFill, for image inpainting. MMInvertFill contains primarily a multimodal guided encoder with a pre-modulation and a GAN generator with F&W+ latent space. Specifically, the multimodal encoder aims to enhance the multi-scale structures with additional semantic segmentation edge texture modalities through a gated mask-aware attention module. Afterwards, a pre-modulation is presented to encode these structures into style vectors. To mitigate issues of conspicuous color discrepancy and semantic inconsistency, we introduce the F&W+ latent space to bridge the gap between GAN inversion and image inpainting. Furthermore, in order to reconstruct faithful and photorealistic images, we devise a simple yet effective Soft-update Mean Latent module to capture more diversified in-domain patterns for generating high-fidelity textures for massive corruptions. In our extensive experiments on six challenging datasets, we show that our MMInvertFill qualitatively and quantitatively outperforms other state-of-the-arts and it supports the completion of out-of-domain images effectively.
生成对抗网络(GAN)逆向技术在图像修复领域中展示了卓越的性能,其目标是利用未遮挡的内容恢复丢失或损坏的纹理。之前的基于 GAN 逆向的方法通常会使用经过良好训练的 GAN 模型作为有效的先验条件来生成缺失区域的真实感部分。尽管这些方法表现优秀,但它们忽略了输入图像和输出图像中未遮挡区域应当保持一致这一严格的约束条件,这导致了 GAN 逆向与图像修复之间的差距,并因此降低了性能。此外,现有的 GAN 逆向方法通常仅考虑输入图像的单一模式,忽视了其他在改进方面有帮助的辅助线索。 为了应对这些挑战,我们提出了一种新颖的 GAN 逆向方法,称为 MMInvertFill,用于图像修复。MMInvertFill 主要包含一个多模态引导编码器和一个 F&W+ 潜空间中的 GAN 发生器。具体来说,多模态编码器旨在通过门控掩码感知注意模块增强多层次结构,并且引入了额外的语义分割边缘纹理模式。随后,我们提出预调制技术将这些结构编码为样式向量。为了缓解明显的颜色差异和语义不一致的问题,我们引进 F&W+ 潜空间来弥合 GAN 逆向与图像修复之间的差距。 进一步地,为了重构忠实且逼真的图像,我们设计了一个简单而有效的软更新均值潜空间模块,以捕捉更多的域内模式,并为大量损坏生成高质量的纹理。在六个具有挑战性的数据集上的广泛实验中,我们的 MMInvertFill 从定性和定量上都超越了当前最佳方法,并支持跨域图像的有效完成任务。
https://arxiv.org/abs/2504.12844
Relational Triple Extraction (RTE) is a fundamental task in Natural Language Processing (NLP). However, prior research has primarily focused on optimizing model performance, with limited efforts to understand the internal mechanisms driving these models. Many existing methods rely on complex preprocessing to induce specific interactions, often resulting in opaque systems that may not fully align with their theoretical foundations. To address these limitations, we propose SMARTe: a Slot-based Method for Accountable Relational Triple extraction. SMARTe introduces intrinsic interpretability through a slot attention mechanism and frames the task as a set prediction problem. Slot attention consolidates relevant information into distinct slots, ensuring all predictions can be explicitly traced to learned slot representations and the tokens contributing to each predicted relational triple. While emphasizing interpretability, SMARTe achieves performance comparable to state-of-the-art models. Evaluations on the NYT and WebNLG datasets demonstrate that adding interpretability does not compromise performance. Furthermore, we conducted qualitative assessments to showcase the explanations provided by SMARTe, using attention heatmaps that map to their respective tokens. We conclude with a discussion of our findings and propose directions for future research.
关系三元组提取(RTE)是自然语言处理(NLP)中的一个基本任务。然而,以往的研究主要集中在优化模型性能上,很少有研究试图理解驱动这些模型的内在机制。许多现有的方法依赖于复杂的预处理来诱导特定交互,这通常导致不透明系统,可能与其理论基础不完全一致。为了克服这些限制,我们提出了SMARTe:一种基于槽位的关系三元组可解释提取方法。SMARTe 通过引入槽注意机制实现固有可解释性,并将任务定义为集合预测问题。槽注意机制将相关信息整合到不同的槽中,确保所有预测都可以明确地追溯到学习的槽表示及其贡献的每个预测关系三元组中的标记。尽管强调了可解释性,SMARTe 的性能与最新模型相当。在 NYT 和 WebNLG 数据集上的评估表明,增加可解释性并不会影响其表现。此外,我们进行了定性评估以展示 SMARTe 提供的解释,并使用注意力热图将其映射到相应的标记上。最后,我们讨论了研究发现并提出了未来研究的方向。
https://arxiv.org/abs/2504.12816
As 3D Gaussian Splatting (3DGS) gains popularity as a 3D representation of real scenes, enabling user-friendly deformation to create novel scenes while preserving fine details from the original 3DGS has attracted significant research attention. We introduce CAGE-GS, a cage-based 3DGS deformation method that seamlessly aligns a source 3DGS scene with a user-defined target shape. Our approach learns a deformation cage from the target, which guides the geometric transformation of the source scene. While the cages effectively control structural alignment, preserving the textural appearance of 3DGS remains challenging due to the complexity of covariance parameters. To address this, we employ a Jacobian matrix-based strategy to update the covariance parameters of each Gaussian, ensuring texture fidelity post-deformation. Our method is highly flexible, accommodating various target shape representations, including texts, images, point clouds, meshes and 3DGS models. Extensive experiments and ablation studies on both public datasets and newly proposed scenes demonstrate that our method significantly outperforms existing techniques in both efficiency and deformation quality.
随着三维高斯点绘(3D Gaussian Splatting,简称3DGS)作为一种表示真实场景的方法变得越来越受欢迎,它能够实现用户友好的变形以创建新的场景,并且还能保留原始3DGS的精细细节。这种特性引起了大量研究的关注。 我们提出了一种基于笼子(cage-based)的3DGS变形方法——CAGE-GS,该方法可以无缝地将源3DGS场景与用户定义的目标形状对齐。我们的方法从目标中学习一个变形笼子,这个笼子指导源场景的几何变换。虽然这些笼子能够有效地控制结构对齐,但由于协方差参数的复杂性,在保持3DGS的纹理外观方面仍然具有挑战性。为了解决这个问题,我们采用了一种基于雅可比矩阵(Jacobian matrix)的方法来更新每个高斯分布的协方差参数,确保变形后的纹理保真度。 我们的方法非常灵活,可以适应各种目标形状表示形式,包括文本、图像、点云、网格和3DGS模型。在公共数据集以及我们新提出的场景上的广泛实验和消融研究表明,与现有技术相比,我们的方法在效率和变形质量方面都表现出显著优势。
https://arxiv.org/abs/2504.12800
Vision Foundation Models (VFMs) have delivered remarkable performance in Domain Generalized Semantic Segmentation (DGSS). However, recent methods often overlook the fact that visual cues are susceptible, whereas the underlying geometry remains stable, rendering depth information more robust. In this paper, we investigate the potential of integrating depth information with features from VFMs, to improve the geometric consistency within an image and boost the generalization performance of VFMs. We propose a novel fine-tuning DGSS framework, named DepthForge, which integrates the visual cues from frozen DINOv2 or EVA02 and depth cues from frozen Depth Anything V2. In each layer of the VFMs, we incorporate depth-aware learnable tokens to continuously decouple domain-invariant visual and spatial information, thereby enhancing depth awareness and attention of the VFMs. Finally, we develop a depth refinement decoder and integrate it into the model architecture to adaptively refine multi-layer VFM features and depth-aware learnable tokens. Extensive experiments are conducted based on various DGSS settings and five different datsets as unseen target domains. The qualitative and quantitative results demonstrate that our method significantly outperforms alternative approaches with stronger performance, steadier visual-spatial attention, and superior generalization ability. In particular, DepthForge exhibits outstanding performance under extreme conditions (e.g., night and snow). Code is available at this https URL.
视觉基础模型(VFMs)在领域泛化语义分割(DGSS)中表现出色。然而,近期的方法往往忽视了这样一个事实:即视觉线索容易受到影响,而底层几何结构则相对稳定,因此深度信息更加稳健。在这篇论文中,我们研究了将深度信息与VFMs提取的特征相结合的潜力,以提高图像内的几何一致性并增强VFMs的泛化性能。我们提出了一种新颖的微调DGSS框架,名为DepthForge,该框架结合了冻结后的DINOv2或EVA02提供的视觉线索和冻结后的Depth Anything V2提供的深度线索。在VFMs的每一层中,我们引入了感知深度的学习令牌来连续分离出领域不变的视觉和空间信息,从而增强了模型对深度的理解与注意力机制。最后,我们开发了一种深度细化解码器,并将其集成到模型架构中以自适应地精炼多层VFMs特征及感知深度的学习令牌。 基于各种DGSS设置以及五个不同未见过的目标数据集进行了广泛的实验。定性和定量结果均表明,我们的方法在性能、视觉-空间注意力的稳定性以及泛化能力方面显著优于其他替代方法,并且在极端条件下(如夜晚和雪地)表现出色。代码可在以下链接获取:[提供具体网址]
https://arxiv.org/abs/2504.12753
Multi-class Unsupervised Anomaly Detection algorithms (MUAD) are receiving increasing attention due to their relatively low deployment costs and improved training efficiency. However, the real-world effectiveness of MUAD methods is questioned due to limitations in current Industrial Anomaly Detection (IAD) datasets. These datasets contain numerous classes that are unlikely to be produced by the same factory and fail to cover multiple structures or appearances. Additionally, the defects do not reflect real-world characteristics. Therefore, we introduce the Heterogeneous Same-Sort Industrial Anomaly Detection (HSS-IAD) dataset, which contains 8,580 images of metallic-like industrial parts and precise anomaly annotations. These parts exhibit variations in structure and appearance, with subtle defects that closely resemble the base materials. We also provide foreground images for synthetic anomaly generation. Finally, we evaluate popular IAD methods on this dataset under multi-class and class-separated settings, demonstrating its potential to bridge the gap between existing datasets and real factory conditions. The dataset is available at this https URL.
多类别无监督异常检测算法(MUAD)因其相对较低的部署成本和改进的训练效率而受到越来越多的关注。然而,由于现有工业异常检测(IAD)数据集的局限性,人们对MUAD方法在实际应用中的有效性提出了质疑。这些数据集中包含了许多不太可能由同一工厂生产的类别,并且未能涵盖多种结构或外观特征。此外,缺陷也不反映现实世界的特性。 因此,我们引入了异质同类工业异常检测(HSS-IAD)数据集,该数据集包含了8,580张金属类工业部件的图像以及精确的异常标注。这些部件在结构和外观上表现出变化,并带有与基本材料非常相似的细微缺陷。此外,我们还提供了用于合成异常生成的前景图。最后,在多类别和类别分离设置下评估了流行IAD方法在此数据集上的性能,证明其有可能弥合现有数据集与实际工厂条件之间的差距。 该数据集可在以下链接获取:[此 https URL](请将方括号中的内容替换为实际URL)。
https://arxiv.org/abs/2504.12689
Abstractive compression utilizes smaller langauge models to condense query-relevant context, reducing computational costs in retrieval-augmented generation (RAG). However,retrieved documents often include information that is either irrelevant to answering the query or misleading due to factual incorrect content, despite having high relevance scores. This behavior indicates that abstractive compressors are more likely to omit important information essential for the correct answer, especially in long contexts where attention dispersion occurs. To address this issue, we categorize retrieved documents in a more fine-grained manner and propose Abstractive Compression Robust against Noise (ACoRN), which introduces two novel training steps. First, we use offline data augmentation on the training dataset to enhance compressor robustness against two distinct types of retrieval noise. Second, since the language modelbased compressor cannot fully utilize information from multiple retrieved documents and exhibits positional bias, we perform finetuning to generate summaries centered around key information that directly supports the correct answer. Our experiments demonstrate that T5-large, trained with ACoRN as a compressor, improves EM and F1 scores while preserving the answer string, which could serve as direct evidence. ACoRN excels on datasets with many accuracy-reducing documents, making it highly useful in real-world scenarios.
抽象压缩利用较小的语言模型来浓缩与查询相关的上下文,从而在检索增强生成(RAG)中减少计算成本。然而,尽管检索到的文档具有很高的相关性评分,它们往往包含与回答查询无关或因事实错误而误导的信息。这种行为表明,基于摘要的压缩器更容易忽略对正确答案至关重要的信息,尤其是在注意力分散较长上下文时。为解决这一问题,我们以更精细的方式分类检索到的文档,并提出了抗噪抽象压缩(ACoRN),该方法引入了两个新颖的训练步骤。首先,我们在训练数据集上使用离线数据增强来提高压缩器对两种不同类型的检索噪声的鲁棒性。其次,由于基于语言模型的压缩器无法充分利用多个检索文档中的信息并表现出位置偏差,我们进行了微调以生成围绕直接支持正确答案的关键信息中心的摘要。 我们的实验表明,经过ACoRN训练作为压缩器的T5-large在保持答案字符串不变的情况下提高了准确率(EM)和F1分数。这可以作为直接证据使用。 ACoRN在包含大量减少准确性文档的数据集上表现出色,在现实场景中非常有用。
https://arxiv.org/abs/2504.12673
Recent advances in deep learning, particularly frequency dynamic convolution (FDY conv), have significantly improved sound event detection (SED) by enabling frequency-adaptive feature extraction. However, FDY conv relies on temporal average pooling, which treats all temporal frames equally, limiting its ability to capture transient sound events such as alarm bells, door knocks, and speech plosives. To address this limitation, we propose temporal attention pooling frequency dynamic convolution (TFD conv) to replace temporal average pooling with temporal attention pooling (TAP). TAP adaptively weights temporal features through three complementary mechanisms: time attention pooling (TA) for emphasizing salient features, velocity attention pooling (VA) for capturing transient changes, and conventional average pooling for robustness to stationary signals. Ablation studies show that TFD conv improves average PSDS1 by 3.02% over FDY conv with only a 14.8% increase in parameter count. Classwise ANOVA and Tukey HSD analysis further demonstrate that TFD conv significantly enhances detection performance for transient-heavy events, outperforming existing FDY conv models. Notably, TFD conv achieves a maximum PSDS1 score of 0.456, surpassing previous state-of-the-art SED systems. We also explore the compatibility of TAP with other FDY conv variants, including dilated FDY conv (DFD conv), partial FDY conv (PFD conv), and multi-dilated FDY conv (MDFD conv). Among these, the integration of TAP with MDFD conv achieves the best result with a PSDS1 score of 0.459, validating the complementary strengths of temporal attention and multi-scale frequency adaptation. These findings establish TFD conv as a powerful and generalizable framework for enhancing both transient sensitivity and overall feature robustness in SED.
最近的深度学习进展,特别是频率动态卷积(FDY conv),通过实现频域自适应特征提取,在声音事件检测(SED)方面取得了显著改进。然而,FDY conv 依赖于时间平均池化(temporal average pooling, TAP),该方法将所有时间帧同等对待,这限制了其捕捉短暂声音事件(如警报铃声、敲门声和爆破音)的能力。 为解决这一局限性,我们提出了一个结合时序注意力池化的频率动态卷积(TFD conv)。新的方法用时间注意力池化(TAP)替代传统的平均池化。TAP 通过三种互补机制自适应地加权时间特征:时间注意力池化(TA)用于强调显著特征,速度注意力池化(VA)用于捕捉瞬态变化,传统平均池化则确保在处理静止信号时的鲁棒性。 消融研究表明,在仅增加参数数量14.8%的情况下,TFD conv 就能使平均PSDS1分数较FDY conv提升3.02%。此外,类别分析(classwise ANOVA)和Tukey HSD检验进一步表明,对于瞬态事件密集的场景,TFD conv 能显著提高检测性能,并且优于现有的 FDY conv 模型。值得注意的是,在PSDS1评分上,TFD conv 达到了0.456的最高分,超越了之前的最佳SED系统。 我们还探讨了 TAP 与其它FDY conv变体(包括扩张频率动态卷积(DFD conv)、部分频率动态卷积(PFD conv)和多尺度频率动态卷积(MDFD conv))的兼容性。其中,TAP 和 MDFD conv 的结合表现最佳,PSDS1评分达到了0.459,这验证了时序注意力与多尺度频域适应性的互补优势。 这些发现确立了 TFD conv 作为增强瞬态敏感性和整体特征鲁棒性的一种强大且通用的SED框架。
https://arxiv.org/abs/2504.12670
Robotic manipulation faces critical challenges in understanding spatial affordances--the "where" and "how" of object interactions--essential for complex manipulation tasks like wiping a board or stacking objects. Existing methods, including modular-based and end-to-end approaches, often lack robust spatial reasoning capabilities. Unlike recent point-based and flow-based affordance methods that focus on dense spatial representations or trajectory modeling, we propose A0, a hierarchical affordance-aware diffusion model that decomposes manipulation tasks into high-level spatial affordance understanding and low-level action execution. A0 leverages the Embodiment-Agnostic Affordance Representation, which captures object-centric spatial affordances by predicting contact points and post-contact trajectories. A0 is pre-trained on 1 million contact points data and fine-tuned on annotated trajectories, enabling generalization across platforms. Key components include Position Offset Attention for motion-aware feature extraction and a Spatial Information Aggregation Layer for precise coordinate mapping. The model's output is executed by the action execution module. Experiments on multiple robotic systems (Franka, Kinova, Realman, and Dobot) demonstrate A0's superior performance in complex tasks, showcasing its efficiency, flexibility, and real-world applicability.
机器人操作在理解空间可及性(即物体交互的“何地”和“如何”)方面面临重大挑战,这对于像擦拭板子或堆叠物品这样的复杂任务至关重要。现有的方法,包括模块化方法和端到端方法,通常缺乏强大的空间推理能力。与最近基于点的方法和基于流的方法不同,这些方法专注于密集的空间表示或轨迹建模,我们提出了一种层次化的、具有感知性的扩散模型A0,它将操作任务分解为高层次的空间可及性理解和低层次的动作执行。A0利用了无实体依赖的可及性表示,通过预测接触点和接触后的轨迹来捕捉以物体为中心的空间可及性。该模型在100万个接触点的数据上进行了预训练,并在标注过的轨迹数据上进行微调,从而可以在不同的平台上实现泛化。其关键组件包括位置偏移注意机制(用于运动感知特征提取)以及空间信息聚合层(用于精确坐标映射)。A0的输出由动作执行模块负责执行。 实验结果显示,在多个机器人系统(Franka、Kinova、Realman和Dobot)上,A0在复杂任务中的表现优于现有方法,展示了其效率、灵活性和现实世界应用的能力。
https://arxiv.org/abs/2504.12636
Building change detection remains challenging for urban development, disaster assessment, and military reconnaissance. While foundation models like Segment Anything Model (SAM) show strong segmentation capabilities, SAM is limited in the task of building change detection due to domain gap issues. Existing adapter-based fine-tuning approaches face challenges with imbalanced building distribution, resulting in poor detection of subtle changes and inaccurate edge extraction. Additionally, bi-temporal misalignment in change detection, typically addressed by optical flow, remains vulnerable to background noises. This affects the detection of building changes and compromises both detection accuracy and edge recognition. To tackle these challenges, we propose a new SAM-Based Network with Distribution-Aware Fourier Adaptation and Edge-Constrained Warping (FAEWNet) for building change detection. FAEWNet utilizes the SAM encoder to extract rich visual features from remote sensing images. To guide SAM in focusing on specific ground objects in remote sensing scenes, we propose a Distribution-Aware Fourier Aggregated Adapter to aggregate task-oriented changed information. This adapter not only effectively addresses the domain gap issue, but also pays attention to the distribution of changed buildings. Furthermore, to mitigate noise interference and misalignment in height offset estimation, we design a novel flow module that refines building edge extraction and enhances the perception of changed buildings. Our state-of-the-art results on the LEVIR-CD, S2Looking and WHU-CD datasets highlight the effectiveness of FAEWNet. The code is available at this https URL.
构建变化检测在城市发展、灾害评估和军事侦察中仍面临挑战。尽管像Segment Anything Model (SAM)这样的基础模型显示出了强大的分割能力,但SAM在建筑变化检测任务上因领域差距问题而受到限制。现有的基于适配器的微调方法面对不平衡的建筑物分布时也遇到了挑战,导致细微变化难以被发现,并且边缘提取不够准确。此外,在使用光学流处理双时间序列错位的变化检测中,背景噪声对其构成了威胁,影响了建筑变化的检测并削弱了检测精度和边缘识别。 为了解决这些问题,我们提出了一种新的基于SAM的网络——带有分布感知傅里叶适应和边界约束扭曲(FAEWNet)的建筑物变化检测方法。该网络利用SAM编码器从遥感图像中提取丰富的视觉特征。为了引导SAM关注特定地面目标,我们提出了一个分布感知傅里叶聚合适配器来整合任务导向的变化信息。这个适配器不仅有效地解决了领域差距问题,还注意到了变化建筑的分布情况。此外,为了解决噪声干扰和高度偏移估计中的错位问题,我们设计了一个新的流模块以精化建筑物边缘提取,并增强了对变化建筑物的感知。 在LEVIR-CD、S2Looking以及WHU-CD数据集上的最先进的实验结果证明了FAEWNet的有效性。该代码可在以下链接获取:[这里提供原网址]。
https://arxiv.org/abs/2504.12619
Transformers have attained outstanding performance across various modalities, employing scaled-dot-product (SDP) attention mechanisms. Researchers have attempted to migrate Transformers to graph learning, but most advanced Graph Transformers are designed with major architectural differences, either integrating message-passing or incorporating sophisticated attention mechanisms. These complexities prevent the easy adoption of Transformer training advances. We propose three simple modifications to the plain Transformer to render it applicable to graphs without introducing major architectural distortions. Specifically, we advocate for the use of (1) simplified $L_2$ attention to measure the magnitude closeness of tokens; (2) adaptive root-mean-square normalization to preserve token magnitude information; and (3) a relative positional encoding bias with a shared encoder. Significant performance gains across a variety of graph datasets justify the effectiveness of our proposed modifications. Furthermore, empirical evaluation on the expressiveness benchmark reveals noteworthy realized expressiveness in the graph isomorphism.
Transformer模型已经在各种模态上取得了卓越的性能,采用了缩放点积(SDP)注意力机制。研究人员试图将Transformer迁移到图学习中,但大多数先进的Graph Transformer在架构设计上有重大差异,要么整合了消息传递机制,要么加入了复杂的注意机制。这些复杂性阻碍了轻松地采用Transformer训练进展。我们提出了三种对普通Transformer的简单修改方法,使其适用于图形处理而不引入主要的架构扭曲。具体而言,我们建议使用(1)简化的$L_2$注意力来衡量标记之间的距离;(2)自适应均方根归一化以保留标记大小信息;以及(3)带有共享编码器的相对位置嵌入偏差。在各种图形数据集上的显著性能提升证明了我们提出的修改的有效性。此外,在表达力基准上的实证评估揭示了图同构中值得注意的实际表现力。
https://arxiv.org/abs/2504.12588
Event cameras have attracted increasing attention in recent years due to their advantages in high dynamic range, high temporal resolution, low power consumption, and low latency. Some researchers have begun exploring pre-training directly on event data. Nevertheless, these efforts often fail to establish strong connections with RGB frames, limiting their applicability in multi-modal fusion scenarios. To address these issues, we propose a novel CM3AE pre-training framework for the RGB-Event perception. This framework accepts multi-modalities/views of data as input, including RGB images, event images, and event voxels, providing robust support for both event-based and RGB-event fusion based downstream tasks. Specifically, we design a multi-modal fusion reconstruction module that reconstructs the original image from fused multi-modal features, explicitly enhancing the model's ability to aggregate cross-modal complementary information. Additionally, we employ a multi-modal contrastive learning strategy to align cross-modal feature representations in a shared latent space, which effectively enhances the model's capability for multi-modal understanding and capturing global dependencies. We construct a large-scale dataset containing 2,535,759 RGB-Event data pairs for the pre-training. Extensive experiments on five downstream tasks fully demonstrated the effectiveness of CM3AE. Source code and pre-trained models will be released on this https URL.
近年来,事件相机由于其在高动态范围、高时间分辨率、低功耗和低延迟方面的优势而备受关注。一些研究人员已经开始探索直接在事件数据上进行预训练的方法。然而,这些努力往往未能建立与RGB帧的强烈联系,限制了它们在多模态融合场景中的应用能力。为了解决这些问题,我们提出了一种新颖的CM3AE预训练框架,用于RGB-事件感知。该框架接受多种模态/视角的数据输入,包括RGB图像、事件图像和事件体素,从而为基于事件的以及RGB-事件融合的基础任务提供了强大的支持。具体而言,我们设计了一个多模态融合重建模块,该模块从融合后的多模态特征中重构原始图像,明确增强了模型聚合跨模态互补信息的能力。此外,我们采用了一种多模态对比学习策略,在共享潜在空间内对齐跨模态特征表示,这有效提升了模型的多模态理解和捕获全局依赖性的能力。我们构建了一个包含2,535,759个RGB-事件数据对的大规模数据集用于预训练。在五个下游任务上的广泛实验充分证明了CM3AE的有效性。源代码和预训练模型将发布在这个网址上:[https URL](请根据实际情况替换为正确的URL)。
https://arxiv.org/abs/2504.12576
Text-attributed graphs (TAGs) present unique challenges in representation learning by requiring models to capture both the semantic richness of node-associated texts and the structural dependencies of the graph. While graph neural networks (GNNs) excel at modeling topological information, they lack the capacity to process unstructured text. Conversely, large language models (LLMs) are proficient in text understanding but are typically unaware of graph structure. In this work, we propose BiGTex (Bidirectional Graph Text), a novel architecture that tightly integrates GNNs and LLMs through stacked Graph-Text Fusion Units. Each unit allows for mutual attention between textual and structural representations, enabling information to flow in both directions, text influencing structure and structure guiding textual interpretation. The proposed architecture is trained using parameter-efficient fine-tuning (LoRA), keeping the LLM frozen while adapting to task-specific signals. Extensive experiments on five benchmark datasets demonstrate that BiGTex achieves state-of-the-art performance in node classification and generalizes effectively to link prediction. An ablation study further highlights the importance of soft prompting and bi-directional attention in the model's success.
文本属性图(Text-Attributed Graphs,TAGs)在表示学习中提出了独特的挑战,要求模型能够捕捉节点关联文本的语义丰富性和图结构之间的依赖关系。虽然图神经网络(GNNs)擅长建模拓扑信息,但它们缺乏处理无结构化文本的能力。相反,大型语言模型(LLMs)擅长理解文本,但却通常不了解图结构。在这项工作中,我们提出了BiGTex(双向图文本),一种通过堆叠的Graph-Text融合单元紧密集成GNNs和LLMs的新颖架构。每个单元都允许文本表示与结构表示之间的相互注意,使得信息可以双向流动:文本影响结构,而结构指导对文本的理解。所提出的架构使用参数高效的微调(LoRA)进行训练,在保持LLM冻结的同时适应特定任务的信号。在五个基准数据集上的广泛实验表明,BiGTex在节点分类方面实现了最先进的性能,并且能够有效地推广到链接预测。进一步的消融研究突显了软提示和双向注意对于模型成功的重要性。
https://arxiv.org/abs/2504.12474
In our paper we explore the definition, and extrapolation of fallacies as they pertain to the automatic detection of manipulation on social media. In particular we explore how these logical fallacies might appear in the real world i.e internet forums. We discovered a prevalence of misinformation / misguided intention in discussion boards specifically centered around the Ukrainian Russian Conflict which serves to narrow the domain of our task. Although automatic fallacy detection has gained attention recently, most datasets use unregulated fallacy taxonomies or are limited to formal linguistic domains like political debates or news reports. Online discourse, however, often features non-standardized and diverse language not captured in these domains. We present Shady Linguistic Utterance Replication-Generation (SLURG) to address these limitations, exploring the feasibility of generating synthetic fallacious forum-style comments using large language models (LLMs), specifically DeepHermes-3-Mistral-24B. Our findings indicate that LLMs can replicate the syntactic patterns of real data} and that high-quality few-shot prompts enhance LLMs' ability to mimic the vocabulary diversity of online forums.
在我们的论文中,我们探讨了逻辑谬误的定义及其在社交媒体上自动检测操纵行为中的应用。特别是,我们研究了这些逻辑谬误如何在现实世界中呈现,例如互联网论坛上的情况。我们在专门围绕乌克兰与俄罗斯冲突的讨论板块中发现了大量错误信息和误导性意图的现象,这使我们的任务范围更加具体化。尽管最近自动识别逻辑谬误的问题引起了广泛关注,但大多数数据集使用的往往是未经规范化的谬误分类法或局限于正式语言环境如政治辩论或新闻报道等领域。然而,在线对话往往使用非标准化且多样化的语言,这些在上述领域中是无法捕捉到的。 为此,我们提出了“阴暗语言表述复制-生成”(Shady Linguistic Utterance Replication-Generation, SLURG)方法来解决这些问题,并探讨了利用大型语言模型(LLMs),特别是DeepHermes-3-Mistral-24B这类技术,在线论坛式评论中生成合成谬误的可行性。我们的研究结果表明,这些语言模型能够复制真实数据中的句法模式,并且高质量的少样本提示可以增强大型语言模型模仿在线论坛词汇多样性的能力。
https://arxiv.org/abs/2504.12466
Existing zero-shot 3D point cloud segmentation methods often struggle with limited transferability from seen classes to unseen classes and from semantic to visual space. To alleviate this, we introduce 3D-PointZshotS, a geometry-aware zero-shot segmentation framework that enhances both feature generation and alignment using latent geometric prototypes (LGPs). Specifically, we integrate LGPs into a generator via a cross-attention mechanism, enriching semantic features with fine-grained geometric details. To further enhance stability and generalization, we introduce a self-consistency loss, which enforces feature robustness against point-wise perturbations. Additionally, we re-represent visual and semantic features in a shared space, bridging the semantic-visual gap and facilitating knowledge transfer to unseen classes. Experiments on three real-world datasets, namely ScanNet, SemanticKITTI, and S3DIS, demonstrate that our method achieves superior performance over four baselines in terms of harmonic mIoU. The code is available at \href{this https URL}{Github}.
现有的零样本(zero-shot)三维点云分割方法在从已知类别到未知类别的迁移能力和语义空间到视觉空间的转换上常常遇到困难。为了缓解这些问题,我们提出了3D-PointZshotS,这是一种几何感知的零样本分割框架,通过潜在几何原型(Latent Geometric Prototypes, LGPs)增强了特征生成和对齐过程。具体来说,我们将LGPs整合进一个生成器中,并利用跨注意力机制(cross-attention mechanism),从而使语义特征能够获得细粒度的几何细节。 为了进一步提高稳定性和泛化能力,我们引入了一个自我一致性损失函数(self-consistency loss),该函数通过强制点级别的扰动下的特征鲁棒性来优化模型。此外,我们将视觉和语义特征重新表示为一个共享空间中的形式,以弥合语义与视觉之间的鸿沟,并促进向未知类别的知识转移。 在三个真实世界的数据集(ScanNet、SemanticKITTI 和 S3DIS)上的实验表明,我们的方法在调和mIoU方面优于四个基准模型。代码可以在Github上找到:[此链接](this https URL)。
https://arxiv.org/abs/2504.12442