Broad textual understanding and in-context learning require language models that utilize full document contexts. Due to the implementation challenges associated with directly training long-context models, many methods have been proposed for extending models to handle long contexts. However, owing to differences in data and model classes, it has been challenging to compare these approaches, leading to uncertainty as to how to evaluate long-context performance and whether it differs from standard evaluation. We implement a controlled protocol for extension methods with a standardized evaluation, utilizing consistent base models and extension data. Our study yields several insights into long-context behavior. First, we reaffirm the critical role of perplexity as a general-purpose performance indicator even in longer-context tasks. Second, we find that current approximate attention methods systematically underperform across long-context tasks. Finally, we confirm that exact fine-tuning based methods are generally effective within the range of their extension, whereas extrapolation remains challenging. All codebases, models, and checkpoints will be made available open-source, promoting transparency and facilitating further research in this critical area of AI development.
广泛的文本理解和学习需要利用整个文档上下文的语言模型。由于直接训练长语境模型所带来的实施挑战,许多方法都提出了扩展模型以处理长语境。然而,由于数据和模型类的差异,很难比较这些方法,因此不确定如何评估长语境性能以及它是否与标准评估方法有所不同。我们实现了一个带有标准化评估的扩展方法控制协议,利用一致的基础模型和扩展数据。我们的研究得出了一些关于长语境行为的见解。首先,我们再次确认了置信度作为通用性能指标在长语境任务中的关键作用。其次,我们发现当前的近似注意方法在长语境任务中普遍表现不佳。最后,我们证实,在扩展范围内,精确微调方法通常有效,而扩展则仍然具有挑战性。所有代码库、模型和检查点都将公开开源,以促进透明度和促进在这个关键领域 AI 发展的进一步研究。
https://arxiv.org/abs/2409.12181
Brain Tumor Segmentation (BraTS) plays a critical role in clinical diagnosis, treatment planning, and monitoring the progression of brain tumors. However, due to the variability in tumor appearance, size, and intensity across different MRI modalities, automated segmentation remains a challenging task. In this study, we propose a novel Transformer-based framework, multiPI-TransBTS, which integrates multi-physical information to enhance segmentation accuracy. The model leverages spatial information, semantic information, and multi-modal imaging data, addressing the inherent heterogeneity in brain tumor characteristics. The multiPI-TransBTS framework consists of an encoder, an Adaptive Feature Fusion (AFF) module, and a multi-source, multi-scale feature decoder. The encoder incorporates a multi-branch architecture to separately extract modality-specific features from different MRI sequences. The AFF module fuses information from multiple sources using channel-wise and element-wise attention, ensuring effective feature recalibration. The decoder combines both common and task-specific features through a Task-Specific Feature Introduction (TSFI) strategy, producing accurate segmentation outputs for Whole Tumor (WT), Tumor Core (TC), and Enhancing Tumor (ET) regions. Comprehensive evaluations on the BraTS2019 and BraTS2020 datasets demonstrate the superiority of multiPI-TransBTS over the state-of-the-art methods. The model consistently achieves better Dice coefficients, Hausdorff distances, and Sensitivity scores, highlighting its effectiveness in addressing the BraTS challenges. Our results also indicate the need for further exploration of the balance between precision and recall in the ET segmentation task. The proposed framework represents a significant advancement in BraTS, with potential implications for improving clinical outcomes for brain tumor patients.
Brain Tumor Segmentation (BraTS) 在临床诊断、治疗规划和监测肿瘤进展方面起着关键作用。然而,由于不同MRI模态肿瘤外观、大小和强度的变异性,自动分割仍然具有挑战性。在这项研究中,我们提出了一个新型的Transformer-基框架,多PI-TransBTS,整合了多物理信息以提高分割准确性。该模型利用空间信息、语义信息和多模态成像数据,解决肿瘤特征固有的异质性。多PI-TransBTS框架包括编码器、自适应特征融合模块和多源多尺度特征解码器。编码器采用多分支架构,分别从不同MRI序列中提取模块特定的特征。自适应特征融合模块利用通道级和元素级注意将多个来源的信息进行有效融合,确保精确的特征重调。解码器通过任务特定特征引入(TSFI)策略将常见和任务特定的特征结合,产生准确的全肿瘤(WT)、肿瘤核心(TC)和增强肿瘤(ET)区域的分割输出。对BraTS2019和BraTS2020数据集的全面评估表明,多PI-TransBTS相对于最先进的方法具有优越性。模型在Dice系数、汉明距离和敏感分数方面始终优于其他方法,突出了其在解决BraTS挑战方面的重要性。我们的结果还表明,在ET分割任务中需要进一步研究精度和召回之间的平衡。所提出的框架在 BraTS 研究中取得了显著的进展,对改善脑肿瘤患者的临床治疗效果具有潜在影响。
https://arxiv.org/abs/2409.12167
Recent advances in speech spoofing necessitate stronger verification mechanisms in neural speech codecs to ensure authenticity. Current methods embed numerical watermarks before compression and extract them from reconstructed speech for verification, but face limitations such as separate training processes for the watermark and codec, and insufficient cross-modal information integration, leading to reduced watermark imperceptibility, extraction accuracy, and capacity. To address these issues, we propose WMCodec, the first neural speech codec to jointly train compression-reconstruction and watermark embedding-extraction in an end-to-end manner, optimizing both imperceptibility and extractability of the watermark. Furthermore, We design an iterative Attention Imprint Unit (AIU) for deeper feature integration of watermark and speech, reducing the impact of quantization noise on the watermark. Experimental results show WMCodec outperforms AudioSeal with Encodec in most quality metrics for watermark imperceptibility and consistently exceeds both AudioSeal with Encodec and reinforced TraceableSpeech in extraction accuracy of watermark. At bandwidth of 6 kbps with a watermark capacity of 16 bps, WMCodec maintains over 99% extraction accuracy under common attacks, demonstrating strong robustness.
近年来,语音伪造技术的进步使得在神经语音编码器中需要更强的验证机制来确保真实性。目前的方法在压缩之前嵌入数字水印,然后从重构语音中提取它们进行验证,但面临诸如水印和编码器的单独训练过程以及缺乏跨模态信息整合等问题,导致水印的不感知性、提取精度和容量降低。为了应对这些问题,我们提出了WMCodec,第一个在端到端方式下共同训练压缩和编码的水印嵌入的神经语音编码器,通过优化水印的感知度和提取性来提高其性能。此外,我们设计了一个递归注意印迹单元(AIU)来融合水印和语音的特征,减少量化噪声对水印的影响。实验结果表明,WMCodec在大多数质量指标上超过了AudioSeal with Encodec,并且 consistently超过了AudioSeal with Encodec和强化可追溯语音。在带宽为6 kbps,水印容量为16 bps的情况下,WMCodec在普通攻击下保持了超过99%的提取精度,证明了其强大的鲁棒性。
https://arxiv.org/abs/2409.12121
Imitation based robot learning has recently gained significant attention in the robotics field due to its theoretical potential for transferability and generalizability. However, it remains notoriously costly, both in terms of hardware and data collection, and deploying it in real-world environments demands meticulous setup of robots and precise experimental conditions. In this paper, we present a low-cost robot learning framework that is both easily reproducible and transferable to various robots and environments. We demonstrate that deployable imitation learning can be successfully applied even to industrial-grade robots, not just expensive collaborative robotic arms. Furthermore, our results show that multi-task robot learning is achievable with simple network architectures and fewer demonstrations than previously thought necessary. As the current evaluating method is almost subjective when it comes to real-world manipulation tasks, we propose Voting Positive Rate (VPR) - a novel evaluation strategy that provides a more objective assessment of performance. We conduct an extensive comparison of success rates across various self-designed tasks to validate our approach. To foster collaboration and support the robot learning community, we have open-sourced all relevant datasets and model checkpoints, available at this http URL.
基于模仿的学习在机器人领域最近因其在可迁移性和泛化性方面的理论潜力而受到了广泛关注。然而,它依然昂贵,不仅在硬件上,而且在数据收集方面也是如此。将这种方法应用于现实世界环境需要对机器人进行精心设置,并精确控制实验条件。在本文中,我们提出了一个低成本的机器人学习框架,既容易复制,又容易迁移到各种机器人和环境。我们证明了可穿戴式模仿学习甚至可以应用于工业级机器人,而不仅仅是昂贵的合作机器人手臂。此外,我们的结果表明,使用简单的网络架构和比之前想象的更少的演示,多任务机器人学习是可行的。由于当前的评价方法在现实世界的操作任务上几乎具有主观性,我们提出了Voting Positive Rate(VPR) -一种新的评估策略,为性能提供了一个更客观的评估。我们进行了对各种自定义任务成功率的广泛比较,以验证我们的方法。为了促进合作和支持机器人学习社区,我们已经公开发布了所有相关的数据和模型检查点, available at this http URL.
https://arxiv.org/abs/2409.12061
Side-scan sonar (SSS) imagery presents unique challenges in the classification of man-made objects on the seafloor due to the complex and varied underwater environments. Historically, experts have manually interpreted SSS images, relying on conventional machine learning techniques with hand-crafted features. While Convolutional Neural Networks (CNNs) significantly advanced automated classification in this domain, they often fall short when dealing with diverse seafloor textures, such as rocky or ripple sand bottoms, where false positive rates may increase. Recently, Vision Transformers (ViTs) have shown potential in addressing these limitations by utilizing a self-attention mechanism to capture global information in image patches, offering more flexibility in processing spatial hierarchies. This paper rigorously compares the performance of ViT models alongside commonly used CNN architectures, such as ResNet and ConvNext, for binary classification tasks in SSS imagery. The dataset encompasses diverse geographical seafloor types and is balanced between the presence and absence of man-made objects. ViT-based models exhibit superior classification performance across f1-score, precision, recall, and accuracy metrics, although at the cost of greater computational resources. CNNs, with their inductive biases, demonstrate better computational efficiency, making them suitable for deployment in resource-constrained environments like underwater vehicles. Future research directions include exploring self-supervised learning for ViTs and multi-modal fusion to further enhance performance in challenging underwater environments.
侧扫声纳(SSS)图像因为水下环境的复杂性和多样性,在海洋底部对人造物进行分类存在独特的挑战。历史上,专家们手动解释了SSS图像,并依赖手工特征的传统的机器学习技术。虽然卷积神经网络(CNN)在这个领域显著提高了自动分类,但它们往往在对多样水下纹理(如岩石或波浪沙底)处理时不够灵活,可能导致误检率增加。最近,Vision Transformers(ViTs)通过利用自注意机制在图像补丁中捕捉全局信息,为解决这些局限提供了可能,使得在处理空间层次结构时更加灵活。本文对ViT模型的性能与通常使用的CNN架构(如ResNet和ConvNext)在SSS图像上的二分类任务进行了严格的比较。该数据集涵盖了各种地理海底类型,并且在人造物存在与不存在之间具有平衡。ViT模型在f1分数、精度、召回率和准确率指标上都表现优异,尽管代价是更大的计算资源。由于其归纳偏见,CNN展示了更好的计算效率,使它们适用于像水下车辆这样的资源受限环境。未来的研究方向包括探索ViT的自监督学习以及多模态融合,以进一步增强在具有挑战性的水下环境中的性能。
https://arxiv.org/abs/2409.12026
When your robot grasps an object using dexterous hands or grippers, it should understand the Task-Oriented Affordances of the Object(TOAO), as different tasks often require attention to specific parts of the object. To address this challenge, we propose GauTOAO, a Gaussian-based framework for Task-Oriented Affordance of Objects, which leverages vision-language models in a zero-shot manner to predict affordance-relevant regions of an object, given a natural language query. Our approach introduces a new paradigm: "static camera, moving object," allowing the robot to better observe and understand the object in hand during manipulation. GauTOAO addresses the limitations of existing methods, which often lack effective spatial grouping, by extracting a comprehensive 3D object mask using DINO features. This mask is then used to conditionally query gaussians, producing a refined semantic distribution over the object for the specified task. This approach results in more accurate TOAO extraction, enhancing the robot's understanding of the object and improving task performance. We validate the effectiveness of GauTOAO through real-world experiments, demonstrating its capability to generalize across various tasks.
当您的机器人使用灵巧的手或夹具抓住物体时,它应该能够理解物体的任务导向势能(TOAO),因为不同的任务通常需要关注物体特定部分的注意力。为解决这个挑战,我们提出了GauTOAO,一个基于高斯的对象任务导向势能框架,它利用了视觉语言模型在零击中方式预测物体上的势能相关区域,给出自然语言查询。我们的方法引入了一个新的范例:“静态相机,移动物体”,使机器人能够在操作过程中更好地观察和理解手中的物体。通过利用DINO特征提取全面的3D物体掩码,GauTOAO解决了现有方法的局限性,即往往缺乏有效的空间分组。这个掩码然后用于有条件地查询高斯分布,在指定的任务上产生物体上的语义分布。这种方法提高了更准确的TOAO提取,提高了机器人对物体的理解,并提高了任务性能。我们通过现实世界的实验验证了GauTOAO的有效性,证明了它在各种任务上的泛化能力。
https://arxiv.org/abs/2409.11941
The problem of safety for robotic systems has been extensively studied. However, little attention has been given to security issues for three-dimensional systems, such as quadrotors. Malicious adversaries can compromise robot sensors and communication networks, causing incidents, achieving illegal objectives, or even injuring people. This study first designs an intelligent control system for autonomous quadrotors. Then, it investigates the problems of optimal false data injection attack scheduling and countermeasure design for unmanned aerial vehicles. Using a state-of-the-art deep learning-based approach, an optimal false data injection attack scheme is proposed to deteriorate a quadrotor's tracking performance with limited attack energy. Subsequently, an optimal tracking control strategy is learned to mitigate attacks and recover the quadrotor's tracking performance. We base our work on Agilicious, a state-of-the-art quadrotor recently deployed for autonomous settings. This paper is the first in the United Kingdom to deploy this quadrotor and implement reinforcement learning on its platform. Therefore, to promote easy reproducibility with minimal engineering overhead, we further provide (1) a comprehensive breakdown of this quadrotor, including software stacks and hardware alternatives; (2) a detailed reinforcement-learning framework to train autonomous controllers on Agilicious agents; and (3) a new open-source environment that builds upon PyFlyt for future reinforcement learning research on Agilicious platforms. Both simulated and real-world experiments are conducted to show the effectiveness of the proposed frameworks in section 5.2.
机器机器人系统的安全性问题已经得到了广泛研究。然而,对于三维系统(如四旋翼)的安全性问题,关注较少。恶意攻击者可能攻击机器人传感器和通信网络,导致事故、实现非法目标或甚至伤害人员。本研究首先为自主四旋翼设计了智能控制系统。然后,研究了无人机上最优假数据注入攻击调度问题和反制设计问题。采用最先进的深度学习方法,提出了用有限攻击能量恶化四旋翼跟踪性能的最优假数据注入攻击方案。接着,学习最优跟踪控制策略以减轻攻击并恢复四旋翼跟踪性能。我们的工作基于最新部署的智能四旋翼Agilicious,这是英国首个在自主环境中部署的智能四旋翼,并在其平台上实现了强化学习。因此,为了通过最小化工程开发生度促进易于重复,我们进一步提供了(1)对Agilicious的全面拆分,包括软件堆栈和硬件替代方案;(2)用于在Agilicious代理上训练自主控制器的详细强化学习框架;(3)利用PyFlyt构建未来Agilicious平台上的强化学习研究的新开源环境。第5.2节中的模拟和现实世界实验都进行了研究,以证明所提出的框架的有效性。
https://arxiv.org/abs/2409.11897
In recent years, visually-rich document understanding has attracted increasing attention. Transformer-based pre-trained models have become the mainstream approach, yielding significant performance gains in this field. However, the self-attention mechanism's quadratic computational complexity hinders their efficiency and ability to process long documents. In this paper, we present DocMamba, a novel framework based on the state space model. It is designed to reduce computational complexity to linear while preserving global modeling capabilities. To further enhance its effectiveness in document processing, we introduce the Segment-First Bidirectional Scan (SFBS) to capture contiguous semantic information. Experimental results demonstrate that DocMamba achieves new state-of-the-art results on downstream datasets such as FUNSD, CORD, and SORIE, while significantly improving speed and reducing memory usage. Notably, experiments on the HRDoc confirm DocMamba's potential for length extrapolation. The code will be available online.
近年来,视觉丰富的文档理解吸引了越来越多的关注。基于Transformer的预训练模型成为主流方法,在该项目领域取得了显著的性能提升。然而,自注意力机制的二次计算复杂度阻碍了它们的效率和处理长文档的能力。在本文中,我们提出了DocMamba,一种基于状态空间模型的框架。它旨在降低计算复杂度,同时保留全局建模能力。为了进一步提高其在文档处理中的有效性,我们引入了分段 first-order scan (SFBS),以捕捉连续的语义信息。实验结果表明,DocMamba在下游数据集(如FUNSD、CORD和SORIE)上实现了与当前最先进水平相同的结果,同时显著提高了速度和减少了内存使用。值得注意的是,在HRDoc上的实验证实了DocMamba在长度扩展方面的潜力。代码将在网上提供。
https://arxiv.org/abs/2409.11887
State-space models (SSMs), exemplified by S4, have introduced a novel context modeling method by integrating state-space techniques into deep learning. However, they struggle with global context modeling due to their data-independent matrices. The Mamba model addressed this with data-dependent variants via the S6 selective-scan algorithm, enhancing context modeling, especially for long sequences. However, Mamba-based architectures are difficult to scale with respect to the number of parameters, which is a major limitation for vision applications. This paper addresses the scalability issue of large SSMs for image classification and action recognition without requiring additional techniques like knowledge distillation. We analyze the distinct characteristics of Mamba-based and Attention-based models, proposing a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance. We demonstrate that the stable and efficient interleaved architecture resolves the scalability issue of Mamba-based architectures for images and videos and increases robustness to common artifacts like JPEG compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and Something-Something-v2 benchmarks demonstrates that our approach improves the accuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.
状态空间模型(SSMs)通过将状态空间技术集成到深度学习中,引入了一种新颖的上下文建模方法。然而,由于它们的数据无关矩阵,它们在全局上下文建模方面遇到困难。Mamba模型通过S6选择性扫描算法提供了数据相关的变体,通过增强上下文建模,特别是对于长序列,解决了这个问题。然而,基于Mamba的架构在参数数量上很难扩展,这是视觉应用的主要限制。本文解决了大型SSM在图像分类和动作识别方面的可扩展性问题,而不需要使用诸如知识蒸馏等技术。我们分析了Mamba基和注意力基模型,提出了一个Mamba-Attention并行架构,通过增强可扩展性、鲁棒性和性能来解决SSM的问题。我们证明了具有稳定和高效并行架构的Mamba基架构可以解决图像和视频的扩展性问题,并增加了对常见伪影像压缩等伪影的鲁棒性。我们对ImageNet-1K、Kinetics-400和Something-Something-v2基准进行的深入评估表明,通过我们提出的方法,可以提高最先进的Mamba基架构的准确度最高达+1.7。
https://arxiv.org/abs/2409.11867
In recent years, speech diffusion models have advanced rapidly. Alongside the widely used U-Net architecture, transformer-based models such as the Diffusion Transformer (DiT) have also gained attention. However, current DiT speech models treat Mel spectrograms as general images, which overlooks the specific acoustic properties of speech. To address these limitations, we propose a method called Directional Patch Interaction for Text-to-Speech (DPI-TTS), which builds on DiT and achieves fast training without compromising accuracy. Notably, DPI-TTS employs a low-to-high frequency, frame-by-frame progressive inference approach that aligns more closely with acoustic properties, enhancing the naturalness of the generated speech. Additionally, we introduce a fine-grained style temporal modeling method that further improves speaker style similarity. Experimental results demonstrate that our method increases the training speed by nearly 2 times and significantly outperforms the baseline models.
近年来,语音扩散模型的发展速度非常快。除了广泛使用的U-Net架构外,像Diffusion Transformer(DiT)这样的基于Transformer的模型也引起了人们的关注。然而,当前的DiT语音模型将Mel频谱图视为通用图像,这忽略了语音的特定声学特性。为了克服这些限制,我们提出了一个名为DPI-TTS(文本到语音方向补丁交互)的方法,该方法基于DiT并实现了没有牺牲准确性的快速训练。值得注意的是,DPI-TTS采用了一种低到高频率的帧级逐帧推理方法,更接近于声学特性,从而增强了生成的语音的自然性。此外,我们还引入了一种细粒度的时域建模方法,进一步提高了说话人风格相似度。实验结果表明,我们的方法将训练速度提高了近2倍,显著优于基线模型。
https://arxiv.org/abs/2409.11835
Extract-then-Abstract is a naturally coherent paradigm to conduct abstractive summarization with the help of salient information identified by the extractive model. Previous works that adopt this paradigm train the extractor and abstractor separately and introduce extra parameters to highlight the extracted salients to the abstractor, which results in error accumulation and additional training costs. In this paper, we first introduce a parameter-free highlight method into the encoder-decoder framework: replacing the encoder attention mask with a saliency mask in the cross-attention module to force the decoder to focus only on salient parts of the input. A preliminary analysis compares different highlight methods, demonstrating the effectiveness of our saliency mask. We further propose the novel extract-and-abstract paradigm, ExtAbs, which jointly and seamlessly performs Extractive and Abstractive summarization tasks within single encoder-decoder model to reduce error accumulation. In ExtAbs, the vanilla encoder is augmented to extract salients, and the vanilla decoder is modified with the proposed saliency mask to generate summaries. Built upon BART and PEGASUS, experiments on three datasets show that ExtAbs can achieve superior performance than baselines on the extractive task and performs comparable, or even better than the vanilla models on the abstractive task.
提取然后抽象是一种自然流畅的范式,通过在提取模型的帮助下识别显眼的上下文信息,进行抽象性摘要。之前的工作采用这种范式分别训练提取器和抽象器,并引入了额外的参数来突出提取到的显眼信息,导致误差累积和额外的训练成本。在本文中,我们首先将参数无显着方法引入到编码器-解码器框架中:用注意力机制在交叉注意力模块中替换编码器注意力掩码,迫使解码器只关注输入中的显眼部分。初步分析比较了不同的显着方法,证明了我们的显着掩码的有效性。我们进一步提出了名为ExtAbs的新范式,它与单编码器-解码器模型一起在单个框架内协同有效地执行提取性和摘要性摘要。在ExtAbs中,标准的编码器被增强以提取显眼信息,而标准的解码器则用所提出的显眼掩码进行修改,以生成摘要。基于BART和PEGASUS,在三个数据集上的实验表明,ExtAbs在提取任务上优于基线,并且在摘要任务上与基线模型相当,或者甚至比基线模型更好。
https://arxiv.org/abs/2409.11827
Large language model (LLM) role-playing has gained widespread attention, where the authentic character knowledge is crucial for constructing realistic LLM role-playing agents. However, existing works usually overlook the exploration of LLMs' ability to detect characters' known knowledge errors (KKE) and unknown knowledge errors (UKE) while playing roles, which would lead to low-quality automatic construction of character trainable corpus. In this paper, we propose a probing dataset to evaluate LLMs' ability to detect errors in KKE and UKE. The results indicate that even the latest LLMs struggle to effectively detect these two types of errors, especially when it comes to familiar knowledge. We experimented with various reasoning strategies and propose an agent-based reasoning method, Self-Recollection and Self-Doubt (S2RD), to further explore the potential for improving error detection capabilities. Experiments show that our method effectively improves the LLMs' ability to detect error character knowledge, but it remains an issue that requires ongoing attention.
大语言模型(LLM)角色扮演已经引起了广泛的关注,其中真正的人物知识对于构建真实的LLM角色扮演代理至关重要。然而,现有的作品通常忽视了在角色扮演过程中LLMs能够检测到的已知知识错误(KKE)和未知知识错误(UKE),这会导致高质量自动构建角色训练语料库。在本文中,我们提出了一个探究数据集,以评估LLMs在检测KKE和UKE方面的能力。结果显示,即使是最新版本的LLM在熟悉知识方面也难以有效检测这两种类型的错误。为了进一步探索提高错误检测能力的可能性,我们实验了各种推理策略,并提出了基于代理的推理方法:自省和自我怀疑(S2RD)。实验表明,我们的方法有效地提高了LLMs检测错误人物知识的能力,但仍然需要持续关注。
https://arxiv.org/abs/2409.11726
Unsupervised video semantic compression (UVSC), i.e., compressing videos to better support various analysis tasks, has recently garnered attention. However, the semantic richness of previous methods remains limited, due to the single semantic learning objective, limited training data, etc. To address this, we propose to boost the UVSC task by absorbing the off-the-shelf rich semantics from VFMs. Specifically, we introduce a VFMs-shared semantic alignment layer, complemented by VFM-specific prompts, to flexibly align semantics between the compressed video and various VFMs. This allows different VFMs to collaboratively build a mutually-enhanced semantic space, guiding the learning of the compression model. Moreover, we introduce a dynamic trajectory-based inter-frame compression scheme, which first estimates the semantic trajectory based on the historical content, and then traverses along the trajectory to predict the future semantics as the coding context. This reduces the overall bitcost of the system, further improving the compression efficiency. Our approach outperforms previous coding methods on three mainstream tasks and six datasets.
未经监督的视频语义压缩(UVSC),即压缩视频以更好地支持各种分析任务,近年来引起了关注。然而,以前方法语义丰富度仍然有限,由于单一语义学习目标,训练数据等限制。为了解决这个问题,我们提出了一种通过吸收VFM的固有丰富语义来提高UVSC任务的方法。具体来说,我们引入了一个VFMs-shared语义对齐层,由VFM-特定的提示补充,以灵活对压缩视频和各种VFM之间的语义进行对齐。这允许不同的VFM共同构建一个相互增强的语义空间,指导压缩模型的学习。此外,我们还引入了一种基于动态轨迹的帧间压缩方案,该方案首先根据历史内容估计语义轨迹,然后沿着轨迹进行预测,作为编码上下文。这减少了系统的总比特成本,进一步提高了压缩效率。我们的方法在三个主流任务和六个数据集上的表现优于以前的编码方法。
https://arxiv.org/abs/2409.11718
Deep visual odometry, despite extensive research, still faces limitations in accuracy and generalizability that prevent its broader application. To address these challenges, we propose an Oriented FAST and Rotated BRIEF (ORB)-guided visual odometry with selective online adaptation named ORB-SfMLearner. We present a novel use of ORB features for learning-based ego-motion estimation, leading to more robust and accurate results. We also introduce the cross-attention mechanism to enhance the explainability of PoseNet and have revealed that driving direction of the vehicle can be explained through attention weights, marking a novel exploration in this area. To improve generalizability, our selective online adaptation allows the network to rapidly and selectively adjust to the optimal parameters across different domains. Experimental results on KITTI and vKITTI datasets show that our method outperforms previous state-of-the-art deep visual odometry methods in terms of ego-motion accuracy and generalizability.
尽管深度视觉跟踪(Deep Visual Odometry)已经进行了广泛的研究,但它仍然面临着准确性和泛化性方面的局限性,这阻碍了其更广泛的应用。为了应对这些挑战,我们提出了一个以 Oriented FAST 和旋转 BRIEF(ORB)指导的视觉跟踪器,名为 ORB-SfMLearner。我们展示了 ORB 特征在基于学习的自姿态估计中的新颖用法,从而实现了更稳健和准确的结果。我们还引入了跨注意机制来增强 PoseNet 的可解释性,并发现了通过权重来解释车辆驱动方向的新领域探索。为了提高泛化性,我们的选择性在线适应允许网络在不同的领域中快速而选择性地调整到最优参数。在 KITTI 和 vKITTI 数据集上的实验结果表明,我们的方法在自姿态准确性和平衡泛化性方面超越了之前的最先进深度视觉跟踪方法。
https://arxiv.org/abs/2409.11692
Pose skeleton images are an important reference in pose-controllable image generation. In order to enrich the source of skeleton images, recent works have investigated the generation of pose skeletons based on natural language. These methods are based on GANs. However, it remains challenging to perform diverse, structurally correct and aesthetically pleasing human pose skeleton generation with various textual inputs. To address this problem, we propose a framework with GUNet as the main model, PoseDiffusion. It is the first generative framework based on a diffusion model and also contains a series of variants fine-tuned based on a stable diffusion model. PoseDiffusion demonstrates several desired properties that outperform existing methods. 1) Correct Skeletons. GUNet, a denoising model of PoseDiffusion, is designed to incorporate graphical convolutional neural networks. It is able to learn the spatial relationships of the human skeleton by introducing skeletal information during the training process. 2) Diversity. We decouple the key points of the skeleton and characterise them separately, and use cross-attention to introduce textual conditions. Experimental results show that PoseDiffusion outperforms existing SoTA algorithms in terms of stability and diversity of text-driven pose skeleton generation. Qualitative analyses further demonstrate its superiority for controllable generation in Stable Diffusion.
姿势骨骼图像是 pose 控制图像生成的关键参考。为了丰富骨骼图的来源,最近的工作基于自然语言生成姿势骨骼图。这些方法基于 GANs。然而,用各种文本输入进行多样、结构正确和美观的人体姿势骨骼图生成仍然具有挑战性。为解决这个问题,我们提出了一个基于 GUNet 的框架,名为 PoseDiffusion。它是第一个基于扩散模型的生成框架,并且还包含一系列基于稳定扩散模型的微调版本。PoseDiffusion 展示了几个优于现有方法的特征。1)正确的骨骼。GUNet,一个用于姿势扩散的去噪模型,通过引入骨骼信息在训练过程中学习人体骨骼的空间关系。2)多样性。我们通过解耦骨骼的关键点并对其进行单独特征描述,同时使用跨注意来引入文本条件。实验结果表明,PoseDiffusion 在稳定性和文本驱动姿势骨骼图生成多样性方面优于现有的 SoTA 算法。定性分析进一步证明了它在 Stable Diffusion 中具有更好的可控生成能力。
https://arxiv.org/abs/2409.11689
Histopathology analysis is the gold standard for medical diagnosis. Accurate classification of whole slide images (WSIs) and region-of-interests (ROIs) localization can assist pathologists in diagnosis. The gigapixel resolution of WSI and the absence of fine-grained annotations make direct classification and analysis challenging. In weakly supervised learning, multiple instance learning (MIL) presents a promising approach for WSI classification. The prevailing strategy is to use attention mechanisms to measure instance importance for classification. However, attention mechanisms fail to capture inter-instance information, and self-attention causes quadratic computational complexity. To address these challenges, we propose AMD-MIL, an agent aggregator with a mask denoise mechanism. The agent token acts as an intermediate variable between the query and key for computing instance importance. Mask and denoising matrices, mapped from agents-aggregated value, dynamically mask low-contribution representations and eliminate noise. AMD-MIL achieves better attention allocation by adjusting feature representations, capturing micro-metastases in cancer, and improving interpretability. Extensive experiments on CAMELYON-16, CAMELYON-17, TCGA-KIDNEY, and TCGA-LUNG show AMD-MIL's superiority over state-of-the-art methods.
病理学分析是医疗诊断的黄金标准。准确地对整个切片图像(WSIs)和感兴趣区域(ROIs)进行分类和定位可以帮助病理学家进行诊断。WSI的巨像素分辨率以及缺乏细粒度注释使得直接分类和分析具有挑战性。在弱监督学习中,多实例学习(MIL)对WSI分类是一个有前途的方法。然而,目前的策略使用关注机制来衡量实例的重要性进行分类。然而,关注机制无法捕捉实例之间的交互信息,自注意力会导致平方计算复杂度。为了应对这些挑战,我们提出了AMD-MIL,一个带口罩去噪机制的代理聚合器。代理令牌充当查询和键的中间变量,用于计算实例重要性。口罩和去噪矩阵从代理聚合值进行映射,动态地遮盖低贡献表示并消除噪声。通过调整特征表示,AMD-MIL实现了更好的关注分配,捕获了癌症中的微转移,并提高了可解释性。在CAMELYON-16、CAMELYON-17、TCGA-KIDNEY和TCGA-LUNG等大量实验中,AMD-MIL优越于最先进的方法。
https://arxiv.org/abs/2409.11664
The neural codec language model (CLM) has demonstrated remarkable performance in text-to-speech (TTS) synthesis. However, troubled by ``recency bias", CLM lacks sufficient attention to coarse-grained information at a higher temporal scale, often producing unnatural or even unintelligible speech. This work proposes CoFi-Speech, a coarse-to-fine CLM-TTS approach, employing multi-scale speech coding and generation to address this issue. We train a multi-scale neural codec, CoFi-Codec, to encode speech into a multi-scale discrete representation, comprising multiple token sequences with different time resolutions. Then, we propose CoFi-LM that can generate this representation in two modes: the single-LM-based chain-of-scale generation and the multiple-LM-based stack-of-scale generation. In experiments, CoFi-Speech significantly outperforms single-scale baseline systems on naturalness and speaker similarity in zero-shot TTS. The analysis of multi-scale coding demonstrates the effectiveness of CoFi-Codec in learning multi-scale discrete speech representations while keeping high-quality speech reconstruction. The coarse-to-fine multi-scale generation, especially for the stack-of-scale approach, is also validated as a crucial approach in pursuing a high-quality neural codec language model for TTS.
神经编码语言模型(CLM)在文本到语音(TTS)合成方面的表现已经相当出色。然而,由于“时钟偏差”问题,CLM在较高时间尺度上的粗粒度信息关注不足,往往产生不自然或甚至无法理解的语音。这项工作提出了CoFi-Speech,一种基于多尺度语音编码和生成的CLM-TTS方法,以解决这个问题。我们训练了一个多尺度神经编码器,CoFi-Codec,将语音编码成多个时间分辨率不同的标记序列的多尺度离散表示。然后,我们提出了CoFi-LM,可以在两种模式下生成此表示:基于单层神经网络的链式生成和基于多层神经网络的堆叠生成。在实验中,CoFi-Speech在零散TTS中的自然度和说话者相似性方面显著优于单层基线系统。多尺度编码的分析表明,CoFi-Codec在保持高质量语音重构的同时学习多尺度离散语音表示具有有效性。特别是在堆叠尺度方法中,粗到精的多尺度生成被证明是追求高品质神经编码语言模型TTS的关键方法。
https://arxiv.org/abs/2409.11630
AI data-driven models (Graphcast, Pangu Weather, Fourcastnet, and SFNO) are explored for storyline-based climate attribution due to their short inference times, which can accelerate the number of events studied, and provide real time attributions when public attention is heightened. The analysis is framed on the extreme atmospheric river episode of February 2017 that contributed to the Oroville dam spillway incident in Northern California. Past and future simulations are generated by perturbing the initial conditions with the pre-industrial and the late-21st century temperature climate change signals, respectively. The simulations are compared to results from a dynamical model which represents plausible pseudo-realities under both climate environments. Overall, the AI models show promising results, projecting a 5-6 % increase in the integrated water vapor over the Oroville dam in the present day compared to the pre-industrial, in agreement with the dynamical model. Different geopotential-moisture-temperature dependencies are unveiled for each of the AI-models tested, providing valuable information for understanding the physicality of the attribution response. However, the AI models tend to simulate weaker attribution values than the pseudo-reality imagined by the dynamical model, suggesting some reduced extrapolation skill, especially for the late-21st century regime. Large ensembles generated with an AI model (>500 members) produced statistically significant present-day to pre-industrial attribution results, unlike the >20-member ensemble from the dynamical model. This analysis highlights the potential of AI models to conduct attribution analysis, while emphasizing future lines of work on explainable artificial intelligence to gain confidence in these tools, which can enable reliable attribution studies in real-time.
AI驱动的模型(Graphcast、Pangu Weather、Fourcastnet和SFNO)因其较短的推理时间,可以加速研究事件数量,并在公众关注度提高时提供实时归因。分析以2017年2月的极端大气河事件为框架,这场事件有助于北部加州奥罗维尔水坝溃决。通过分别用预工业和21世纪后期温度气候变化信号扰动初始条件,生成过去和未来的模拟。与代表气候环境下的 plausible伪现实动态模型相比,这些模拟的结果是全面的。总的来说,AI模型显示出有希望的结果,将当今Oroville水坝的集成水汽含量比工业革命前预测增加5-6%。对于每个测试的AI模型,都揭示了不同地理湿度和温度依赖关系,为理解归因响应的物理性提供了宝贵的信息。然而,AI模型往往模拟的归因值比动态模型想象的伪现实要弱,这表明在21世纪后期的预估中,扩展技能受到了一些限制,尤其是在晚期21世纪。使用AI模型生成的大于500个成员的大集合产生了统计学上显著的现世到预工业归因结果,而动态模型的20个成员大集合的结果与之不同。本分析突出了AI模型进行归因分析的潜力,同时强调了在解释性人工智能方面对未来工作的探索,以增强对 these工具的信任,从而实现实时可靠归因研究。
https://arxiv.org/abs/2409.11605
Voice conversion (VC) aims to modify the speaker's identity while preserving the linguistic content. Commonly, VC methods use an encoder-decoder architecture, where disentangling the speaker's identity from linguistic information is crucial. However, the disentanglement approaches used in these methods are limited as the speaker features depend on the phonetic content of the utterance, compromising disentanglement. This dependency is amplified with attention-based methods. To address this, we introduce a novel masking mechanism in the input before speaker encoding, masking certain discrete speech units that correspond highly with phoneme classes. Our work aims to reduce the phonetic dependency of speaker features by restricting access to some phonetic information. Furthermore, since our approach is at the input level, it is applicable to any encoder-decoder based VC framework. Our approach improves disentanglement and conversion performance across multiple VC methods, showing significant effectiveness, particularly in attention-based method, with 44% relative improvement in objective intelligibility.
语音转换(VC)旨在通过保留说话人的身份来修改说话人的语言内容。通常,VC方法采用编码器-解码器架构,其中分离说话人的身份和语言信息至关重要。然而,这些方法中使用的解码器方法是有限的,因为说话人的特征取决于所说话的音位内容,导致解码器无法实现解码。这一依赖关系随着基于注意的方法而加剧。为了应对这个问题,我们在输入之前引入了一种新的遮蔽机制,遮蔽与音位类高度相关的某些离散语音单元。我们的工作旨在通过限制对某些音位信息的访问来减少说话人特征的音位依赖性。此外,由于我们的方法在输入级别,因此它适用于任何基于编码器-解码器架构的VC框架。我们的方法改善了多个VC方法中的解码器性能和分离度,显示出明显的有效性,特别是在基于注意的方法中,相对改善了客观可理解性。
https://arxiv.org/abs/2409.11560
Effective retinal vessel segmentation requires a sophisticated integration of global contextual awareness and local vessel continuity. To address this challenge, we propose the Graph Capsule Convolution Network (GCC-UNet), which merges capsule convolutions with CNNs to capture both local and global features. The Graph Capsule Convolution operator is specifically designed to enhance the representation of global context, while the Selective Graph Attention Fusion module ensures seamless integration of local and global information. To further improve vessel continuity, we introduce the Bottleneck Graph Attention module, which incorporates Channel-wise and Spatial Graph Attention mechanisms. The Multi-Scale Graph Fusion module adeptly combines features from various scales. Our approach has been rigorously validated through experiments on widely used public datasets, with ablation studies confirming the efficacy of each component. Comparative results highlight GCC-UNet's superior performance over existing methods, setting a new benchmark in retinal vessel segmentation. Notably, this work represents the first integration of vanilla, graph, and capsule convolutional techniques in the domain of medical image segmentation.
有效的视网膜血管分割需要全局上下文感知和局部血管连续性的精细整合。为解决这一挑战,我们提出了Graph Capsule Convolution Network(GCC-UNet),它将胶囊卷积与CNN相结合,捕捉到局部和全局特征。Graph Capsule Convolution操作特别设计用于增强全局上下文表示,而选择性Graph注意力融合模块确保了局部和全局信息的无缝融合。为了进一步提高血管连续性,我们引入了Bottleneck Graph注意力模块,该模块包含了通道级和空间级图形注意力机制。多尺度图形融合模块巧妙地将各种尺度的特征进行结合。通过在广泛使用公共数据集上进行实验,我们的方法得到了严格的验证,消融研究证实了每个组件的有效性。比较结果强调了GCC-UNet在视网膜血管分割方面的卓越性能,为视网膜血管分割设定了新的基准。值得注意的是,这项工作是第一个将徒手、图块和胶囊卷积技术在医学图像分割领域进行整合的研究。
https://arxiv.org/abs/2409.11508