Learning for robot navigation presents a critical and challenging task. The scarcity and costliness of real-world datasets necessitate efficient learning approaches. In this letter, we exploit Euclidean symmetry in planning for 2D navigation, which originates from Euclidean transformations between reference frames and enables parameter sharing. To address the challenges of unstructured environments, we formulate the navigation problem as planning on a geometric graph and develop an equivariant message passing network to perform value iteration. Furthermore, to handle multi-camera input, we propose a learnable equivariant layer to lift features to a desired space. We conduct comprehensive evaluations across five diverse tasks encompassing structured and unstructured environments, along with maps of known and unknown, given point goals or semantic goals. Our experiments confirm the substantial benefits on training efficiency, stability, and generalization.
学习机器人导航是一项关键且具有挑战性的任务。现实世界数据稀缺且昂贵,需要高效的学习方法。在这个信中,我们利用欧几里得对称性在规划2D导航时利用,该对称性源自参考框架之间的欧几里得变换,并实现了参数共享。为了解决无结构环境的挑战,我们将导航问题转化为几何图的规划中,并开发了一种等变消息传递网络,以进行价值迭代。此外,为了处理多摄像头输入,我们提出了一个可学习等变层,将特征提高到我们希望的空间。我们涵盖了结构化和无结构环境的五类不同任务,以及已知和未知的地图,并给定点目标或语义目标。我们的实验证实,训练效率、稳定性和泛化等方面获得了实质性的好处。
https://arxiv.org/abs/2309.13043
Neural radiance fields (NeRF) have revolutionized the field of image-based view synthesis. However, NeRF uses straight rays and fails to deal with complicated light path changes caused by refraction and reflection. This prevents NeRF from successfully synthesizing transparent or specular objects, which are ubiquitous in real-world robotics and A/VR applications. In this paper, we introduce the refractive-reflective field. Taking the object silhouette as input, we first utilize marching tetrahedra with a progressive encoding to reconstruct the geometry of non-Lambertian objects and then model refraction and reflection effects of the object in a unified framework using Fresnel terms. Meanwhile, to achieve efficient and effective anti-aliasing, we propose a virtual cone supersampling technique. We benchmark our method on different shapes, backgrounds and Fresnel terms on both real-world and synthetic datasets. We also qualitatively and quantitatively benchmark the rendering results of various editing applications, including material editing, object replacement/insertion, and environment illumination estimation. Codes and data are publicly available at this https URL.
神经网络辐射场(NeRF)已经彻底改变了基于图像视图合成的领域。然而,NeRF使用直线光线,并无法处理由折射和反射引起的复杂的光路径变化。这导致NeRF无法成功合成透明或闪耀的物体,它们在现实世界机器人和虚拟现实应用中无处不在。在本文中,我们介绍了折射反射域。将物体轮廓作为输入,我们首先使用逐步编码的立方体重构非Lambertian物体的几何形状,然后使用费斯涅尔术语在一个统一框架中模型物体的折射和反射效果。同时,为了高效且有效地减少失真,我们提出了一个虚拟锥超采样技术。我们在不同的形状、背景和费斯涅尔术语的现实世界和合成数据集上对我们的算法进行了基准测试。我们还定性和定量基准了各种编辑应用程序的渲染结果,包括材料编辑、物体替换/插入和环境照明估计。代码和数据在这个httpsURL上公开可用。
https://arxiv.org/abs/2309.13039
Hand-crafted image quality metrics, such as PSNR and SSIM, are commonly used to evaluate model privacy risk under reconstruction attacks. Under these metrics, reconstructed images that are determined to resemble the original one generally indicate more privacy leakage. Images determined as overall dissimilar, on the other hand, indicate higher robustness against attack. However, there is no guarantee that these metrics well reflect human opinions, which, as a judgement for model privacy leakage, are more trustworthy. In this paper, we comprehensively study the faithfulness of these hand-crafted metrics to human perception of privacy information from the reconstructed images. On 5 datasets ranging from natural images, faces, to fine-grained classes, we use 4 existing attack methods to reconstruct images from many different classification models and, for each reconstructed image, we ask multiple human annotators to assess whether this image is recognizable. Our studies reveal that the hand-crafted metrics only have a weak correlation with the human evaluation of privacy leakage and that even these metrics themselves often contradict each other. These observations suggest risks of current metrics in the community. To address this potential risk, we propose a learning-based measure called SemSim to evaluate the Semantic Similarity between the original and reconstructed images. SemSim is trained with a standard triplet loss, using an original image as an anchor, one of its recognizable reconstructed images as a positive sample, and an unrecognizable one as a negative. By training on human annotations, SemSim exhibits a greater reflection of privacy leakage on the semantic level. We show that SemSim has a significantly higher correlation with human judgment compared with existing metrics. Moreover, this strong correlation generalizes to unseen datasets, models and attack methods.
人工制作的图像质量指标,例如PSNR和SSIM,在重建攻击下通常用于评估模型隐私风险。在这些指标下,确定的重构图像通常表示更多的隐私泄露。另一方面,确定的整然差异图像则表示更强的抵御攻击能力。然而,没有保证这些指标很好地反映了人类的意见,作为模型隐私泄露的判断,它们更加可靠。在本文中,我们全面研究了这些人工制作的指标对人类对重构图像的隐私信息感知的准确性的符合性。在5个数据集,包括自然图像、人脸和精细类别,我们使用4个现有的攻击方法从多个分类模型中重构图像,并为每个重构图像询问多个人类标注者是否可识别。我们的研究表明,人工制作的指标仅与人类评估隐私泄露的微弱相关,甚至这些指标本身也常常互相矛盾。这些观察暗示了社区当前指标的风险。为了应对这些潜在风险,我们提出了一种基于学习的指标,称为SemSim,以评估原始和重构图像语义相似性。SemSim使用标准三因素损失进行训练,使用原始图像作为参考,其中一个可识别的重构图像作为正样本,一个不可识别的重构图像作为负样本。通过训练人类标注,SemSim表现出在语义层面上更多的隐私泄露反映。我们表明,SemSim与人类判断的相关性比现有的指标高得多。此外,这种强相关性可以扩展到未观测的数据集、模型和攻击方法。
https://arxiv.org/abs/2309.13038
PyPose is an open-source library for robot learning. It combines a learning-based approach with physics-based optimization, which enables seamless end-to-end robot learning. It has been used in many tasks due to its meticulously designed application programming interface (API) and efficient implementation. From its initial launch in early 2022, PyPose has experienced significant enhancements, incorporating a wide variety of new features into its platform. To satisfy the growing demand for understanding and utilizing the library and reduce the learning curve of new users, we present the fundamental design principle of the imperative programming interface, and showcase the flexible usage of diverse functionalities and modules using an extremely simple Dubins car example. We also demonstrate that the PyPose can be easily used to navigate a real quadruped robot with a few lines of code.
PyPose是一个开源机器人学习库,它结合了基于学习的方法与基于物理学的优化方法,实现了无缝的机器人全过程中的学习。由于其精心设计的应用编程接口(API)和高效的实现,PyPose在多个任务中被广泛应用。自2022年初首次推出以来,PyPose经历了显著的改进,将其平台包含了一系列丰富的新特性。为了满足不断增长的理解和利用库的需求,并降低新用户的学习曲线,我们提出了 imperative编程接口的基本设计原则,并通过一个简单的Dubins汽车例子展示了各种功能和模块的灵活使用。我们还证明了PyPose可以轻松地用于导航一个真实的四足机器人,只需要几行代码。
https://arxiv.org/abs/2309.13035
Conformers have recently been proposed as a promising modelling approach for automatic speech recognition (ASR), outperforming recurrent neural network-based approaches and transformers. Nevertheless, in general, the performance of these end-to-end models, especially attention-based models, is particularly degraded in the case of long utterances. To address this limitation, we propose adding a fully-differentiable memory-augmented neural network between the encoder and decoder of a conformer. This external memory can enrich the generalization for longer utterances since it allows the system to store and retrieve more information recurrently. Notably, we explore the neural Turing machine (NTM) that results in our proposed Conformer-NTM model architecture for ASR. Experimental results using Librispeech train-clean-100 and train-960 sets show that the proposed system outperforms the baseline conformer without memory for long utterances.
Conformers 最近被提出作为自动语音识别(ASR)的一种有前途的建模方法,比基于循环神经网络的方法和变压器表现更好。然而,总的来说,这些端到端模型,特别是基于注意力的模型,在较长的发言中表现特别差。为了解决这个问题,我们建议在一个 conformer 的编码器和解码器之间添加一个全变分的增强记忆神经网络。这个外部记忆可以丰富对更长发言的泛化,因为它允许系统多次存储和检索更多的信息。值得注意的是,我们探索了导致我们提出的 conformer-NTM 模型架构的神经网络 Turing 机器(NTM)。使用 LibriSpeech 训练- clean-100 和训练-960 集的实验结果表明, proposed 系统在较长的发言中比无记忆的基础 conformer 表现更好。
https://arxiv.org/abs/2309.13029
Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in sparse monolingual models or a sparse multilingual model (named as Dynamic ASR Pathways). Our approach dynamically adapts the sub-network, avoiding premature decisions about a fixed sub-network structure. We show that our approach outperforms existing pruning methods when targeting sparse monolingual models. Further, we illustrate that Dynamic ASR Pathways jointly discovers and trains better sub-networks (pathways) of a single multilingual model by adapting from different sub-network initializations, thereby reducing the need for language-specific pruning.
神经网络剪枝是一种有效的方法,以压缩具有最小性能损失的多语言自动语音识别(ASR)模型。然而,它需要进行多个语言的剪枝和再训练。在这项工作中,我们提议使用一种自适应掩蔽方法,在两个场景下高效剪枝多语言ASR模型,每个产生稀疏的 Monolingual 模型或稀疏的 Multilingual 模型(称为动态ASR通道)。我们的方法动态适应子网络,避免过早决定固定的子网络结构。我们表明,当针对稀疏的 Monolingual 模型时,我们的方法比现有的剪枝方法表现更好。此外,我们举例说明,动态ASR通道通过自适应从不同的子网络初始化中学习更好的子网络(通道),从而减少了特定语言的剪枝需求。
https://arxiv.org/abs/2309.13018
Sparse training is one of the promising techniques to reduce the computational cost of DNNs while retaining high accuracy. In particular, N:M fine-grained structured sparsity, where only N out of consecutive M elements can be nonzero, has attracted attention due to its hardware-friendly pattern and capability of achieving a high sparse ratio. However, the potential to accelerate N:M sparse DNN training has not been fully exploited, and there is a lack of efficient hardware supporting N:M sparse training. To tackle these challenges, this paper presents a computation-efficient training scheme for N:M sparse DNNs using algorithm, architecture, and dataflow co-design. At the algorithm level, a bidirectional weight pruning method, dubbed BDWP, is proposed to leverage the N:M sparsity of weights during both forward and backward passes of DNN training, which can significantly reduce the computational cost while maintaining model accuracy. At the architecture level, a sparse accelerator for DNN training, namely SAT, is developed to neatly support both the regular dense operations and the computation-efficient N:M sparse operations. At the dataflow level, multiple optimization methods ranging from interleave mapping, pre-generation of N:M sparse weights, and offline scheduling, are proposed to boost the computational efficiency of SAT. Finally, the effectiveness of our training scheme is evaluated on a Xilinx VCU1525 FPGA card using various DNN models and datasets. Experimental results show the SAT accelerator with the BDWP sparse training method under 2:8 sparse ratio achieves an average speedup of 1.75x over that with the dense training, accompanied by a negligible accuracy loss of 0.56% on average. Furthermore, our proposed training scheme significantly improves the training throughput by 2.97~25.22x and the energy efficiency by 1.36~3.58x over prior FPGA-based accelerators.
稀疏训练是一种有前途的技术,能够在保留高准确性的同时降低深度学习系统的计算成本。特别是,具有 N:M Fine-grained Structured sparsity 的稀疏结构,其中只有 N 个连续的元素中才有非零值,因此备受关注,因为它具有硬件友好的模式和实现高稀疏比例的能力。然而,加速 N:M 稀疏深度学习训练的潜力尚未得到充分利用,缺乏有效的硬件支持 N:M 稀疏训练。为了解决这些挑战,本文提出了一种计算高效的训练方案,使用算法、结构和数据流的共同设计。在算法层面上,我们提出了一种双向 weight 压缩方法,称为 BDWP,在深度学习训练的forward和backward 过程中利用 N:M 的稀疏权重,可以显著降低计算成本,同时保持模型精度。在架构层面上,我们开发了稀疏深度学习加速器,名为 SAT,以方便支持标准的DenseOps 和计算高效的 N:M 稀疏Ops。在数据流层面上,我们提出了多种优化方法,包括InterleaveMapping、N:M 稀疏权重的预先生成和离线调度,以提高 SAT 的计算效率。最后,我们对我们的训练方案的有效性在 Xilinx VCU1525 FPGA card上进行了评估,使用各种深度学习模型和数据集。实验结果表明,使用 BDWP 稀疏训练方法的 SAT 加速器在 2:8 稀疏比例下实现平均速度提高 1.75 倍,与Dense训练相比,平均精度损失几乎忽略不计。此外,我们提出的训练方案显著提高了训练吞吐量 2.97~25.22 倍,以及与先前基于 FPGA 的加速器相比,能源效率提高了 1.36~3.58 倍。
https://arxiv.org/abs/2309.13015
Large Language Models (LLMs) still struggle with complex reasoning tasks. Motivated by the society of minds (Minsky, 1988), we propose ReConcile, a multi-model multi-agent framework designed as a round table conference among diverse LLM agents to foster diverse thoughts and discussion for improved consensus. ReConcile enhances the reasoning capabilities of LLMs by holding multiple rounds of discussion, learning to convince other agents to improve their answers, and employing a confidence-weighted voting mechanism. In each round, ReConcile initiates discussion between agents via a 'discussion prompt' that consists of (a) grouped answers and explanations generated by each agent in the previous round, (b) their uncertainties, and (c) demonstrations of answer-rectifying human explanations, used for convincing other agents. This discussion prompt enables each agent to revise their responses in light of insights from other agents. Once a consensus is reached and the discussion ends, ReConcile determines the final answer by leveraging the confidence of each agent in a weighted voting scheme. We implement ReConcile with ChatGPT, Bard, and Claude2 as the three agents. Our experimental results on various benchmarks demonstrate that ReConcile significantly enhances the reasoning performance of the agents (both individually and as a team), surpassing prior single-agent and multi-agent baselines by 7.7% and also outperforming GPT-4 on some of these datasets. We also experiment with GPT-4 itself as one of the agents in ReConcile and demonstrate that its initial performance also improves by absolute 10.0% through discussion and feedback from other agents. Finally, we also analyze the accuracy after every round and observe that ReConcile achieves better and faster consensus between agents, compared to a multi-agent debate baseline. Our code is available at: this https URL
大型语言模型(LLM)仍然面临复杂的推理任务。受到思维社会的启发(米斯基,1988年),我们提出了Reconcile,它是一个多模型多Agent框架,设计为在一个多样化的LLM代理之间的圆桌会议中促进多样化的思考和讨论,以改善共识。Reconcile通过多次讨论增强LLM的推理能力,学习说服其他代理改善他们的答案,并使用信心加权投票机制。在每个回合中,Reconcile通过一个“讨论prompt”启动代理之间的讨论,其中(a)包括每个代理在上一个回合中生成的分组答案和解释,(b)是他们的不确定性,(c)是人类解释的演示,用于说服其他代理。这个讨论prompt使每个代理根据其他代理的见解更新他们的答案。一旦共识达成并讨论结束,Reconcile通过利用每个代理的信心加权投票机制确定最终答案。我们使用ChatGPT、 Bard和Claude2作为三个代理,我们的各种基准实验结果表明,Reconcile极大地增强了代理的推理表现(个体和团队),超过先前的单代理和多代理基准7.7%,并且在这些数据集上比GPT-4表现更好。我们也尝试以GPT-4作为Reconcile中的代理之一进行实验,并证明其初始表现也改善了absolute 10.0%。最后,我们还分析每个回合的精度,并观察到Reconcile通过代理之间的讨论实现更好的和更快的共识,相比多代理辩论基准。我们的代码可在以下httpsURL上获取:
https://arxiv.org/abs/2309.13007
Recognizing the prevalence of domain shift as a common challenge in machine learning, various domain generalization (DG) techniques have been developed to enhance the performance of machine learning systems when dealing with out-of-distribution (OOD) data. Furthermore, in real-world scenarios, data distributions can gradually change across a sequence of sequential domains. While current methodologies primarily focus on improving model effectiveness within these new domains, they often overlook fairness issues throughout the learning process. In response, we introduce an innovative framework called Counterfactual Fairness-Aware Domain Generalization with Sequential Autoencoder (CDSAE). This approach effectively separates environmental information and sensitive attributes from the embedded representation of classification features. This concurrent separation not only greatly improves model generalization across diverse and unfamiliar domains but also effectively addresses challenges related to unfair classification. Our strategy is rooted in the principles of causal inference to tackle these dual issues. To examine the intricate relationship between semantic information, sensitive attributes, and environmental cues, we systematically categorize exogenous uncertainty factors into four latent variables: 1) semantic information influenced by sensitive attributes, 2) semantic information unaffected by sensitive attributes, 3) environmental cues influenced by sensitive attributes, and 4) environmental cues unaffected by sensitive attributes. By incorporating fairness regularization, we exclusively employ semantic information for classification purposes. Empirical validation on synthetic and real-world datasets substantiates the effectiveness of our approach, demonstrating improved accuracy levels while ensuring the preservation of fairness in the evolving landscape of continuous domains.
认识到域转换是机器学习中常见的挑战,各种域扩展技术(DG)已经被开发用于提高处理非分布数据(OOD)机器学习系统的性能。此外,在现实世界场景中,数据分布可以逐步变化在一个连续的域序列中。虽然当前的方法主要关注在这些新域中提高模型有效性,但它们往往在整个学习过程中忽视公平问题。为了应对这种情况,我们提出了一种名为“反事实公平 aware 域扩展”的创新框架(CDSAE)。该方法有效地将环境信息和敏感属性从分类特征嵌入表示中分离出来。这种同时分离不仅极大地改善了跨不同熟悉域模型的泛化能力,而且还有效地解决了与不公平分类相关的挑战。我们的策略基于因果关系推理的原则,以解决这些双重问题。为了研究语义信息、敏感属性和环境 cues之间的关系,我们 systematic 地将外部不确定性因素分类为四个隐变量:1)受敏感属性影响的语义信息,2)不受敏感属性影响的语义信息,3)受敏感属性影响的环境问题,4)不受敏感属性影响的环境问题。通过引入公平 regularization,我们仅用于分类目的的语义信息。对合成数据和实际数据集的模拟验证证实了我们方法的有效性,证明了提高准确性水平,同时确保了连续域演化 landscape 中公平性的保持。
https://arxiv.org/abs/2309.13005
In machine translation, a common problem is that the translation of certain words even if translated can cause incomprehension of the target language audience due to different cultural backgrounds. A solution to solve this problem is to add explanations for these words. In a first step, we therefore need to identify these words or phrases. In this work we explore techniques to extract example explanations from a parallel corpus. However, the sparsity of sentences containing words that need to be explained makes building the training dataset extremely difficult. In this work, we propose a semi-automatic technique to extract these explanations from a large parallel corpus. Experiments on English->German language pair show that our method is able to extract sentence so that more than 10% of the sentences contain explanation, while only 1.9% of the original sentences contain explanations. In addition, experiments on English->French and English->Chinese language pairs also show similar conclusions. This is therefore an essential first automatic step to create a explanation dataset. Furthermore we show that the technique is robust for all three language pairs.
在机器翻译中,一个常见问题是,即使翻译了某些单词,由于目标语言文化背景的不同,也会导致 target 语言观众无法理解。解决这个问题的解决方法是对这些单词或短语添加解释。因此,的第一步是确定这些单词或短语。在这个研究中,我们探索了从并行语料库中提取示例解释的技术。然而,包含需要解释的单词的句子数量稀少,使得构建训练数据集非常困难。在这个研究中,我们提出了一种半自动的方法,从大型并行语料库中提取这些解释。对英语到德语语言对进行了实验,表明我们的方法和方法能够提取句子,使得超过 10% 的句子包含解释,而只有 1.9% 的原本句子包含解释。对英语到法语和英语到中文语言对也进行了实验,也得出类似的结论。因此,这是一个创建解释数据集的 essential 的第一步。此外,我们还表明,该方法对这三个语言对都是可靠的。
https://arxiv.org/abs/2309.12998
In the realm of research, the detection/recognition of text within images/videos captured by cameras constitutes a highly challenging problem for researchers. Despite certain advancements achieving high accuracy, current methods still require substantial improvements to be applicable in practical scenarios. Diverging from text detection in images/videos, this paper addresses the issue of text detection within license plates by amalgamating multiple frames of distinct perspectives. For each viewpoint, the proposed method extracts descriptive features characterizing the text components of the license plate, specifically corner points and area. Concretely, we present three viewpoints: view-1, view-2, and view-3, to identify the nearest neighboring components facilitating the restoration of text components from the same license plate line based on estimations of similarity levels and distance metrics. Subsequently, we employ the CnOCR method for text recognition within license plates. Experimental results on the self-collected dataset (PTITPlates), comprising pairs of images in various scenarios, and the publicly available Stanford Cars Dataset, demonstrate the superiority of the proposed method over existing approaches.
在研究领域,相机捕获的图像/视频中的文字检测/识别构成了一个高度挑战的问题,尽管某些方法已经实现了高精度,但当前的方法仍然需要在实际应用场景中进行大量改进。与图像/视频中的文字检测不同,本文通过整合多个不同视角的图像帧来解决 license plate 中的文字检测问题。对于每个视角,该方法提取了描述性特征,描述了 license plate 中文字组件的特征,特别是角落点和区域。具体来说,我们展示了三个视角:view-1、view-2、view-3,以确定最接近的相邻组件,通过相似度和距离度量估计来实现文字组件从同一 license plate 线条的恢复。随后,我们采用 CnOCR 方法在 license plate 中的文字识别。对自收集的数据集(PTITPlates)进行了实验结果,该数据集包括各种场景下的两个图像对,以及公开可用的 Stanford 汽车数据集,证明了该方法相对于现有方法的优越性。
https://arxiv.org/abs/2309.12972
Despite the recent successes of vanilla Graph Neural Networks (GNNs) on many tasks, their foundation on pairwise interaction networks inherently limits their capacity to discern latent higher-order interactions in complex systems. To bridge this capability gap, we propose a novel approach exploiting the rich mathematical theory of simplicial complexes (SCs) - a robust tool for modeling higher-order interactions. Current SC-based GNNs are burdened by high complexity and rigidity, and quantifying higher-order interaction strengths remains challenging. Innovatively, we present a higher-order Flower-Petals (FP) model, incorporating FP Laplacians into SCs. Further, we introduce a Higher-order Graph Convolutional Network (HiGCN) grounded in FP Laplacians, capable of discerning intrinsic features across varying topological scales. By employing learnable graph filters, a parameter group within each FP Laplacian domain, we can identify diverse patterns where the filters' weights serve as a quantifiable measure of higher-order interaction strengths. The theoretical underpinnings of HiGCN's advanced expressiveness are rigorously demonstrated. Additionally, our empirical investigations reveal that the proposed model accomplishes state-of-the-art (SOTA) performance on a range of graph tasks and provides a scalable and flexible solution to explore higher-order interactions in graphs.
尽管 vanilla 图形神经网络(GNN)在多项任务上取得了 recent 的成功,但他们基于点对点交互网络的基础本身限制了他们在复杂系统中发现潜在高级别交互的能力。为了填补这一能力差距,我们提出了一种创新的方法,利用线段组合丰富的数学理论 - 一种用于建模高级别交互的强大工具。目前基于SC的GNNs 承受着高复杂性和Rigidity 的负载,量化高级别交互 strengths 仍然是一项挑战。创新性地,我们提出了高级别 flower-petal(FP)模型,将FP拉普拉斯算子融入SC中。进一步,我们介绍了基于FP拉普拉斯算子的高级别图形卷积网络(HiGCN),能够在不同拓扑级别的上识别内在特征。通过使用可学习图形滤波器,每个FP拉普拉斯域中的参数组,我们可以识别不同的模式,这些滤波器的权重作为高级别交互 strengths 的可量化衡量标准。HiGCN 的高级表达能力的理论基础得到了严格证明。此外,我们的实证研究表明,我们提出的模型在多种图形任务中实现了最先进的表现,并提供了一种可扩展且灵活的解决方案,以探索图形中的高级别交互。
https://arxiv.org/abs/2309.12971
The detailed images produced by Magnetic Resonance Imaging (MRI) provide life-critical information for the diagnosis and treatment of prostate cancer. To provide standardized acquisition, interpretation and usage of the complex MRI images, the PI-RADS v2 guideline was proposed. An automated segmentation following the guideline facilitates consistent and precise lesion detection, staging and treatment. The guideline recommends a division of the prostate into four zones, PZ (peripheral zone), TZ (transition zone), DPU (distal prostatic urethra) and AFS (anterior fibromuscular stroma). Not every zone shares a boundary with the others and is present in every slice. Further, the representations captured by a single model might not suffice for all zones. This motivated us to design a dual-branch convolutional neural network (CNN), where each branch captures the representations of the connected zones separately. Further, the representations from different branches act complementary to each other at the second stage of training, where they are fine-tuned through an unsupervised loss. The loss penalises the difference in predictions from the two branches for the same class. We also incorporate multi-task learning in our framework to further improve the segmentation accuracy. The proposed approach improves the segmentation accuracy of the baseline (mean absolute symmetric distance) by 7.56%, 11.00%, 58.43% and 19.67% for PZ, TZ, DPU and AFS zones respectively.
磁共振成像(MRI)生成的详细图像为前列腺癌的诊断和治疗提供了生命中最重要的信息。为了提供标准化的获取、解释和使用复杂的MRI图像的标准操作,我们提出了PI-RADS v2指南。遵循指南的自动分割有助于一致性和精确的 Lesion 检测、分期和治疗。指南建议将前列腺癌分为四个区域,PZ(周围区域)、TZ(过渡区域)、DPU(远程前列腺癌尿管)和AFS(前部肌肉基质)。不是每个区域都与其他人共享边界,每个切片都包含。此外,一个模型 captured 的表示可能不足以涵盖所有区域。这激励我们设计一种双分支卷积神经网络(CNN),其中每个分支分别捕获连接区域的表示。此外,不同分支的表示在训练的第二阶段相互补充,通过无监督损失进行优化。该损失惩罚两个分支对同一类预测的差异。我们还在我们的框架中引入了多任务学习,以进一步改进分割精度。我们建议的方法改进了基线(平均绝对对称距离)的分割精度,分别为7.56%、11.00%、58.43%、19.67% PZ、TZ、DPU和AFS区域。
https://arxiv.org/abs/2309.12970
Open-set object detection aims at detecting arbitrary categories beyond those seen during training. Most recent advancements have adopted the open-vocabulary paradigm, utilizing vision-language backbones to represent categories with language. In this paper, we introduce DE-ViT, an open-set object detector that employs vision-only DINOv2 backbones and learns new categories through example images instead of language. To improve general detection ability, we transform multi-classification tasks into binary classification tasks while bypassing per-class inference, and propose a novel region propagation technique for localization. We evaluate DE-ViT on open-vocabulary, few-shot, and one-shot object detection benchmark with COCO and LVIS. For COCO, DE-ViT outperforms the open-vocabulary SoTA by 6.9 AP50 and achieves 50 AP50 in novel classes. DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and 7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms the open-vocabulary SoTA by 2.2 mask AP and reaches 34.3 mask APr. Code is available at this https URL.
开放集对象检测的目标是检测训练期间未观察到的任意类别。最近的进展都采用了开放词汇表范式,利用视觉语言骨架来表示类别用语言表示。在本文中,我们介绍了DE-ViT,它是一个开放集对象检测器,使用仅视觉的DINOv2骨架和通过绕过每个类别的推断来学习新类别,而不是使用语言。为了改善一般性检测能力,我们将多分类任务转换为二进制分类任务,而绕过每个类别的推断,并提出了一种新的区域传播技术来进行定位。我们评估了DE-ViT在开放词汇表、少量样本和一次性检测基准上的表现,与COCO和LVIS进行比较。对于COCO,DE-ViT在开放词汇表SoTA上比SoTA表现更好,在新类中达到了50AP50。DE-ViT在10次检测、30次检测和一次性检测SoTA上超过SoTA的15mAP、7.2mAP和2.8AP50。对于LVIS,DE-ViT比开放词汇表SoTA表现更好,达到了34.3 mask APr。代码在此httpsURL上可用。
https://arxiv.org/abs/2309.12969
Nested Event Extraction (NEE) aims to extract complex event structures where an event contains other events as its arguments recursively. Nested events involve a kind of Pivot Elements (PEs) that simultaneously act as arguments of outer events and as triggers of inner events, and thus connect them into nested structures. This special characteristic of PEs brings challenges to existing NEE methods, as they cannot well cope with the dual identities of PEs. Therefore, this paper proposes a new model, called PerNee, which extracts nested events mainly based on recognizing PEs. Specifically, PerNee first recognizes the triggers of both inner and outer events and further recognizes the PEs via classifying the relation type between trigger pairs. In order to obtain better representations of triggers and arguments to further improve NEE performance, it incorporates the information of both event types and argument roles into PerNee through prompt learning. Since existing NEE datasets (e.g., Genia11) are limited to specific domains and contain a narrow range of event types with nested structures, we systematically categorize nested events in generic domain and construct a new NEE dataset, namely ACE2005-Nest. Experimental results demonstrate that PerNee consistently achieves state-of-the-art performance on ACE2005-Nest, Genia11 and Genia13.
嵌套事件提取(NEE)的目标是提取事件结构中包含其他事件作为其论据的复杂的事件。嵌套事件涉及到一种称为pivot elements(PEs)的特殊元素,它们同时作为外部事件论据和内部事件触发器,将它们连接成嵌套结构。PEs的特殊性质给现有的NEE方法带来了挑战,因为它们无法很好地处理PEs的双重身份。因此,本文提出了一种新模型,称为PerNee,它主要基于识别PEs来提取嵌套事件。具体来说,PerNee首先识别内部和外部事件的触发器,并通过分类触发器之间的关系类型来进一步识别PEs。为了获得更好的触发器和论据表示,以进一步改善NEE性能,它通过prompt learning将两种事件类型和论据角色的信息融入PerNee中。由于现有的NEE数据集(例如Gia11)仅局限于特定的领域,并包含嵌套结构和嵌套事件的狭窄范围,因此我们在通用领域 systematic 分类嵌套事件,并建立了新的NEE数据集,即ACE2005- Nest。实验结果显示,PerNee在ACE2005- Nest、Gia11和Gia13上 consistently 实现了最先进的性能。
https://arxiv.org/abs/2309.12960
Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels. Recently, a new paradigm has emerged by generating a foreground prediction map (FPM) to achieve pixel-level localization. While existing FPM-based methods use cross-entropy to evaluate the foreground prediction map and to guide the learning of the generator, this paper presents two astonishing experimental observations on the object localization learning process: For a trained network, as the foreground mask expands, 1) the cross-entropy converges to zero when the foreground mask covers only part of the object region. 2) The activation value continuously increases until the foreground mask expands to the object boundary. Therefore, to achieve a more effective localization performance, we argue for the usage of activation value to learn more object regions. In this paper, we propose a Background Activation Suppression (BAS) method. Specifically, an Activation Map Constraint (AMC) module is designed to facilitate the learning of generator by suppressing the background activation value. Meanwhile, by using foreground region guidance and area constraint, BAS can learn the whole region of the object. In the inference phase, we consider the prediction maps of different categories together to obtain the final localization results. Extensive experiments show that BAS achieves significant and consistent improvement over the baseline methods on the CUB-200-2011 and ILSVRC datasets. In addition, our method also achieves state-of-the-art weakly supervised semantic segmentation performance on the PASCAL VOC 2012 and MS COCO 2014 datasets. Code and models are available at this https URL.
弱监督的对象定位和语义分割旨在使用图像级别标签只定位对象。最近,出现了一种新范式,通过生成前景预测图(FPM)来实现像素级定位。虽然现有的FPM相关方法使用交叉熵来评估前景预测图和指导生成器学习,但本文提出了对对象定位学习过程的两个惊人的实验观察:对于训练网络,当前景掩膜扩展时,1)交叉熵收敛到零,当前景掩膜仅覆盖对象区域的一部分时。2)激活值持续增加,直到前景掩膜扩展到对象边界。因此,为了实现更有效的定位表现,我们主张使用激活值来学习更多的对象区域。在本文中,我们提出了一种背景激活抑制(BAS)方法。具体来说,一个激活图约束(AMC)模块旨在抑制背景激活值,以促进生成器学习。同时,通过使用前景区域指导和监督区域大小,BAS可以学习整个对象区域。在推理阶段,我们考虑不同类别的预测图一起获得最终定位结果。广泛的实验表明,BAS在CUB-200-2011和ILSVRC数据集上实现了显著和一致性的提高,与基准方法相比。此外,我们的方法还在PASCAL VOC 2012和MS COCO 2014数据集上实现了弱监督语义分割性能的顶尖水平。代码和模型在此httpsURL上可用。
https://arxiv.org/abs/2309.12943
Task-oriented dialogue (TOD) systems facilitate users in executing various activities via multi-turn dialogues, but Large Language Models (LLMs) often struggle to comprehend these intricate contexts. In this study, we propose a novel "Self-Explanation" prompting strategy to enhance the comprehension abilities of LLMs in multi-turn dialogues. This task-agnostic approach requires the model to analyze each dialogue utterance before task execution, thereby improving performance across various dialogue-centric tasks. Experimental results from six benchmark datasets confirm that our method consistently outperforms other zero-shot prompts and matches or exceeds the efficacy of few-shot prompts, demonstrating its potential as a powerful tool in enhancing LLMs' comprehension in complex dialogue tasks.
任务导向对话系统(TOD)系统通过多回合对话协助用户执行各种任务,但大型语言模型(LLMs)往往难以理解这些复杂的上下文。在这个研究中,我们提出了一种新的“自我解释”引导策略,以增强多回合对话中的LLMs的理解能力。这种任务无关的方法要求模型在任务执行之前分析每个对话表述,从而提高在各种对话中心任务中的表现。从六个基准数据集的 experimental 结果来看,我们的方法和零样本引导相比,表现 consistently 更好,且与少量的引导效果相当或超过,表明它可能成为增强LLMs在复杂对话任务中理解能力的强大工具。
https://arxiv.org/abs/2309.12940
As software projects progress, quality of code assumes paramount importance as it affects reliability, maintainability and security of software. For this reason, static analysis tools are used in developer workflows to flag code quality issues. However, developers need to spend extra efforts to revise their code to improve code quality based on the tool findings. In this work, we investigate the use of (instruction-following) large language models (LLMs) to assist developers in revising code to resolve code quality issues. We present a tool, CORE (short for COde REvisions), architected using a pair of LLMs organized as a duo comprised of a proposer and a ranker. Providers of static analysis tools recommend ways to mitigate the tool warnings and developers follow them to revise their code. The \emph{proposer LLM} of CORE takes the same set of recommendations and applies them to generate candidate code revisions. The candidates which pass the static quality checks are retained. However, the LLM may introduce subtle, unintended functionality changes which may go un-detected by the static analysis. The \emph{ranker LLM} evaluates the changes made by the proposer using a rubric that closely follows the acceptance criteria that a developer would enforce. CORE uses the scores assigned by the ranker LLM to rank the candidate revisions before presenting them to the developer. CORE could revise 59.2% Python files (across 52 quality checks) so that they pass scrutiny by both a tool and a human reviewer. The ranker LLM is able to reduce false positives by 25.8% in these cases. CORE produced revisions that passed the static analysis tool in 76.8% Java files (across 10 quality checks) comparable to 78.3% of a specialized program repair tool, with significantly much less engineering efforts.
随着软件项目的进展,代码质量变得越来越重要,因为它会影响软件的可靠性、维护性和安全性。因此,静态分析工具被广泛用于开发者工作流程中,以检测代码质量问题。然而,开发者需要额外的努力来修改代码以基于工具调查结果提高代码质量。在本研究中,我们研究使用(指令跟随)大型语言模型(LLMs)协助开发者修改代码以解决代码质量问题的工具。我们提出了一个工具,称为CORE(缩写为COde Revisions),它由一对LLMs组成,由提议者和排名者组成。静态分析工具供应商建议如何缓解工具警告,开发者遵循它们来修改代码。CORE的提议者LLM使用相同的建议并生成候选人代码修订。通过静态质量检查合格的候选人保留。然而,LLM可能引入微妙、意想不到的功能变化,可能未被静态分析发现。排名者LLM评估提议者所做出的更改,使用与开发者接受标准严格遵守的认可标准。CORE使用排名者LLM分配的评分来排名候选人修订,在向开发者展示之前。CORE可以修改59.2%的Python文件(通过52个质量检查),使其通过工具和人类评审员的审查。排名者LLM在这些情况下可以减少false positives的25.8%。CORE创造了76.8%的Java文件(通过10个质量检查)中的修订,与专门的程序修复工具的78.3%相当,工程 effort significantly less。
https://arxiv.org/abs/2309.12938
Recent advances in Large Language Models (LLMs) have enabled the generation of open-ended high-quality texts, that are non-trivial to distinguish from human-written texts. We refer to such LLM-generated texts as \emph{deepfake texts}. There are currently over 11K text generation models in the huggingface model repo. As such, users with malicious intent can easily use these open-sourced LLMs to generate harmful texts and misinformation at scale. To mitigate this problem, a computational method to determine if a given text is a deepfake text or not is desired--i.e., Turing Test (TT). In particular, in this work, we investigate the more general version of the problem, known as \emph{Authorship Attribution (AA)}, in a multi-class setting--i.e., not only determining if a given text is a deepfake text or not but also being able to pinpoint which LLM is the author. We propose \textbf{TopRoBERTa} to improve existing AA solutions by capturing more linguistic patterns in deepfake texts by including a Topological Data Analysis (TDA) layer in the RoBERTa model. We show the benefits of having a TDA layer when dealing with noisy, imbalanced, and heterogeneous datasets, by extracting TDA features from the reshaped $pooled\_output$ of RoBERTa as input. We use RoBERTa to capture contextual representations (i.e., semantic and syntactic linguistic features), while using TDA to capture the shape and structure of data (i.e., linguistic structures). Finally, \textbf{TopRoBERTa}, outperforms the vanilla RoBERTa in 2/3 datasets, achieving up to 7\% increase in Macro F1 score.
近年来大型语言模型(LLM)的进步使得可以生成任意长度高质量的文本,这些文本难以与人类编写的文本区分开来。我们将这些生成的文本称为 \emph{DeepFake texts}。目前 hugoface 模型 repo 中有超过 11K 个文本生成模型。因此,有恶意意图的用户可以轻松利用这些开源LLM生成大规模的有害文本和虚假信息。为了解决这个问题,我们希望有一种计算方法来确定给定文本是否为DeepFake文本,也就是进行图灵测试(TT)。特别是,在本文中,我们研究了更一般的问题,称为 \emph{作者身份确认(AA)},并在多分类环境中研究这个问题--不仅仅是确定给定文本是否为DeepFake文本,而是能够明确指出哪个LLM是作者。我们提出了 \textbf{TopRoBERTa} 来改进现有的AA解决方案,通过在RoBERTa模型中引入一个拓扑数据分析层,来捕获DeepFake文本中的更多语言学模式。我们展示了使用TDA层来处理噪声、不平衡和异质数据的好处,从RoBERTa的重构输出中提取TDA特征作为输入。我们使用RoBERTa捕获上下文表示(即语义和语法语言学特征),同时使用TDA捕获数据的形状和结构(即语言学结构)。最后, \textbf{TopRoBERTa} 在2/3个数据集上优于传统的RoBERTa,实现了7%的macro F1得分提高。
https://arxiv.org/abs/2309.12934
Self-supervised training methods for transformers have demonstrated remarkable performance across various domains. Previous transformer-based models, such as masked autoencoders (MAE), typically utilize a single normalization layer for both the [CLS] symbol and the tokens. We propose in this paper a simple modification that employs separate normalization layers for the tokens and the [CLS] symbol to better capture their distinct characteristics and enhance downstream task performance. Our method aims to alleviate the potential negative effects of using the same normalization statistics for both token types, which may not be optimally aligned with their individual roles. We empirically show that by utilizing a separate normalization layer, the [CLS] embeddings can better encode the global contextual information and are distributed more uniformly in its anisotropic space. When replacing the conventional normalization layer with the two separate layers, we observe an average 2.7% performance improvement over the image, natural language, and graph domains.
Transformer的自监督训练方法已经在各种领域表现出卓越的性能。以前的Transformer模型,如掩码自动编码器(MAE),通常只使用[CLS]符号和代币的单个标准化层。在本文中,我们提出了一种简单的修改,使用不同的标准化层来处理代币和[CLS]符号,更好地捕捉它们的独特特征并提高后续任务表现。我们的方法旨在减轻使用相同标准化统计对不同代币类型的潜在负面影响,这些可能不太与它们各自的角色最佳匹配。我们的经验表明,通过使用独立的标准化层,[CLS]嵌入可以更好地编码全球上下文信息,并在其非均匀空间中更均匀地分布。当将传统的标准化层替换为两个独立的层时,我们观察到与图像、自然语言和图形 domains相比,平均性能提高了2.7%。
https://arxiv.org/abs/2309.12931