Learning for robot navigation presents a critical and challenging task. The scarcity and costliness of real-world datasets necessitate efficient learning approaches. In this letter, we exploit Euclidean symmetry in planning for 2D navigation, which originates from Euclidean transformations between reference frames and enables parameter sharing. To address the challenges of unstructured environments, we formulate the navigation problem as planning on a geometric graph and develop an equivariant message passing network to perform value iteration. Furthermore, to handle multi-camera input, we propose a learnable equivariant layer to lift features to a desired space. We conduct comprehensive evaluations across five diverse tasks encompassing structured and unstructured environments, along with maps of known and unknown, given point goals or semantic goals. Our experiments confirm the substantial benefits on training efficiency, stability, and generalization.
学习机器人导航是一项关键且具有挑战性的任务。现实世界数据稀缺且昂贵,需要高效的学习方法。在这个信中,我们利用欧几里得对称性在规划2D导航时利用,该对称性源自参考框架之间的欧几里得变换,并实现了参数共享。为了解决无结构环境的挑战,我们将导航问题转化为几何图的规划中,并开发了一种等变消息传递网络,以进行价值迭代。此外,为了处理多摄像头输入,我们提出了一个可学习等变层,将特征提高到我们希望的空间。我们涵盖了结构化和无结构环境的五类不同任务,以及已知和未知的地图,并给定点目标或语义目标。我们的实验证实,训练效率、稳定性和泛化等方面获得了实质性的好处。
https://arxiv.org/abs/2309.13043
We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code will be released at this https URL.
我们提出了MosaicFusion,一种简单但有效的扩散based数据增强方法,用于大规模词汇实例分割。我们的方法不需要训练,并依赖于任何标签监督。两个关键设计使我们可以使用现成的文本到图像扩散模型,作为对象实例和 mask 注释有用的数据生成器。首先,我们将图像 canvas 分为多个区域,并执行一次扩散过程,以同时生成多个实例,根据不同的文本提示条件进行训练。其次,我们获得相应的实例掩膜,通过汇总跨层和扩散时间步的对象提示相关的交叉注意力地图,然后简单的阈值和边缘aware 优化处理。在没有花哨的功能的情况下,MosaicFusion 可以生成大量的合成标注数据,特别是对于稀有和新颖类别。在挑战性的LVIS长长尾和开放词汇基准测试中,实验结果证明MosaicFusion可以显著改进现有的实例分割模型的性能,特别是稀有和新颖类别。代码将在本httpsURL上发布。
https://arxiv.org/abs/2309.13042
Pre-training on Internet data has proven to be a key ingredient for broad generalization in many modern ML systems. What would it take to enable such capabilities in robotic reinforcement learning (RL)? Offline RL methods, which learn from datasets of robot experience, offer one way to leverage prior data into the robotic learning pipeline. However, these methods have a "type mismatch" with video data (such as Ego4D), the largest prior datasets available for robotics, since video offers observation-only experience without the action or reward annotations needed for RL methods. In this paper, we develop a system for leveraging large-scale human video datasets in robotic offline RL, based entirely on learning value functions via temporal-difference learning. We show that value learning on video datasets learns representations that are more conducive to downstream robotic offline RL than other approaches for learning from video data. Our system, called V-PTR, combines the benefits of pre-training on video data with robotic offline RL approaches that train on diverse robot data, resulting in value functions and policies for manipulation tasks that perform better, act robustly, and generalize broadly. On several manipulation tasks on a real WidowX robot, our framework produces policies that greatly improve over prior methods. Our video and additional details can be found at this https URL
在互联网数据上进行预训练已经成为许多现代机器学习系统的广泛泛化的关键要素。要在这些机器人强化学习(RL)系统中实现这种能力,需要什么步骤? offline RL方法通过学习机器人经验的数据集来提供一种利用先前数据进入机器人学习管道的方法。然而,这些方法与视频数据(例如 Ego4D)存在“类型不匹配”,因为它们仅提供观察体验,而缺乏 RL 方法所需的行动或奖励注释。在本文中,我们开发了一种系统,旨在利用大规模人类视频数据集在机器人 offline RL 中利用,完全基于基于时间差学习的值函数学习。我们表明,在视频数据上进行值函数学习学习表示,这些表示对于机器人后端 offline RL 方法来说更加有利于下游,而其他从视频数据学习的方法则不如这些方法。我们的系统称为 V-PTR,将预训练视频数据和机器人离线 RL 方法的优点相结合,训练 diverse 机器人数据的数据集,从而产生更好的操作值函数和政策,表现更好、行为更稳健,并广泛泛化。在真实的 WidowX 机器人的几个操作任务中,我们的框架产生的政策比先前方法显著提高。我们的视频和其他详细信息可以在 this https URL 中找到。
https://arxiv.org/abs/2309.13041
Neural radiance fields (NeRF) have revolutionized the field of image-based view synthesis. However, NeRF uses straight rays and fails to deal with complicated light path changes caused by refraction and reflection. This prevents NeRF from successfully synthesizing transparent or specular objects, which are ubiquitous in real-world robotics and A/VR applications. In this paper, we introduce the refractive-reflective field. Taking the object silhouette as input, we first utilize marching tetrahedra with a progressive encoding to reconstruct the geometry of non-Lambertian objects and then model refraction and reflection effects of the object in a unified framework using Fresnel terms. Meanwhile, to achieve efficient and effective anti-aliasing, we propose a virtual cone supersampling technique. We benchmark our method on different shapes, backgrounds and Fresnel terms on both real-world and synthetic datasets. We also qualitatively and quantitatively benchmark the rendering results of various editing applications, including material editing, object replacement/insertion, and environment illumination estimation. Codes and data are publicly available at this https URL.
神经网络辐射场(NeRF)已经彻底改变了基于图像视图合成的领域。然而,NeRF使用直线光线,并无法处理由折射和反射引起的复杂的光路径变化。这导致NeRF无法成功合成透明或闪耀的物体,它们在现实世界机器人和虚拟现实应用中无处不在。在本文中,我们介绍了折射反射域。将物体轮廓作为输入,我们首先使用逐步编码的立方体重构非Lambertian物体的几何形状,然后使用费斯涅尔术语在一个统一框架中模型物体的折射和反射效果。同时,为了高效且有效地减少失真,我们提出了一个虚拟锥超采样技术。我们在不同的形状、背景和费斯涅尔术语的现实世界和合成数据集上对我们的算法进行了基准测试。我们还定性和定量基准了各种编辑应用程序的渲染结果,包括材料编辑、物体替换/插入和环境照明估计。代码和数据在这个httpsURL上公开可用。
https://arxiv.org/abs/2309.13039
Hand-crafted image quality metrics, such as PSNR and SSIM, are commonly used to evaluate model privacy risk under reconstruction attacks. Under these metrics, reconstructed images that are determined to resemble the original one generally indicate more privacy leakage. Images determined as overall dissimilar, on the other hand, indicate higher robustness against attack. However, there is no guarantee that these metrics well reflect human opinions, which, as a judgement for model privacy leakage, are more trustworthy. In this paper, we comprehensively study the faithfulness of these hand-crafted metrics to human perception of privacy information from the reconstructed images. On 5 datasets ranging from natural images, faces, to fine-grained classes, we use 4 existing attack methods to reconstruct images from many different classification models and, for each reconstructed image, we ask multiple human annotators to assess whether this image is recognizable. Our studies reveal that the hand-crafted metrics only have a weak correlation with the human evaluation of privacy leakage and that even these metrics themselves often contradict each other. These observations suggest risks of current metrics in the community. To address this potential risk, we propose a learning-based measure called SemSim to evaluate the Semantic Similarity between the original and reconstructed images. SemSim is trained with a standard triplet loss, using an original image as an anchor, one of its recognizable reconstructed images as a positive sample, and an unrecognizable one as a negative. By training on human annotations, SemSim exhibits a greater reflection of privacy leakage on the semantic level. We show that SemSim has a significantly higher correlation with human judgment compared with existing metrics. Moreover, this strong correlation generalizes to unseen datasets, models and attack methods.
人工制作的图像质量指标,例如PSNR和SSIM,在重建攻击下通常用于评估模型隐私风险。在这些指标下,确定的重构图像通常表示更多的隐私泄露。另一方面,确定的整然差异图像则表示更强的抵御攻击能力。然而,没有保证这些指标很好地反映了人类的意见,作为模型隐私泄露的判断,它们更加可靠。在本文中,我们全面研究了这些人工制作的指标对人类对重构图像的隐私信息感知的准确性的符合性。在5个数据集,包括自然图像、人脸和精细类别,我们使用4个现有的攻击方法从多个分类模型中重构图像,并为每个重构图像询问多个人类标注者是否可识别。我们的研究表明,人工制作的指标仅与人类评估隐私泄露的微弱相关,甚至这些指标本身也常常互相矛盾。这些观察暗示了社区当前指标的风险。为了应对这些潜在风险,我们提出了一种基于学习的指标,称为SemSim,以评估原始和重构图像语义相似性。SemSim使用标准三因素损失进行训练,使用原始图像作为参考,其中一个可识别的重构图像作为正样本,一个不可识别的重构图像作为负样本。通过训练人类标注,SemSim表现出在语义层面上更多的隐私泄露反映。我们表明,SemSim与人类判断的相关性比现有的指标高得多。此外,这种强相关性可以扩展到未观测的数据集、模型和攻击方法。
https://arxiv.org/abs/2309.13038
Imitation learning from human demonstrations is a powerful framework to teach robots new skills. However, the performance of the learned policies is bottlenecked by the quality, scale, and variety of the demonstration data. In this paper, we aim to lower the barrier to collecting large and high-quality human demonstration data by proposing GELLO, a general framework for building low-cost and intuitive teleoperation systems for robotic manipulation. Given a target robot arm, we build a GELLO controller that has the same kinematic structure as the target arm, leveraging 3D-printed parts and off-the-shelf motors. GELLO is easy to build and intuitive to use. Through an extensive user study, we show that GELLO enables more reliable and efficient demonstration collection compared to commonly used teleoperation devices in the imitation learning literature such as VR controllers and 3D spacemouses. We further demonstrate the capabilities of GELLO for performing complex bi-manual and contact-rich manipulation tasks. To make GELLO accessible to everyone, we have designed and built GELLO systems for 3 commonly used robotic arms: Franka, UR5, and xArm. All software and hardware are open-sourced and can be found on our website: this https URL.
模仿人类演示是教授机器人新技能的强大框架。然而,所学策略的性能受到演示数据质量、规模和多样性的 bottleneck。在本文中,我们旨在通过提出GELLO,一个用于构建低成本、直觉的机器人操纵系统的一般框架,降低收集大规模高质量人类演示数据的障碍。给定目标机器人手,我们构建一个GELLO控制器,其 kinematic structure与目标手相同,利用3D打印部件和公版电机。GELLO 易于构建和使用。通过广泛的用户研究,我们表明GELLO比模仿学习文献中常用的远程控制设备(如虚拟现实控制器和3D空间鼠标)更加可靠和高效地收集演示数据。我们还展示了GELLO用于执行复杂的双手动量和丰富接触操纵任务的能力。为了让更多人能够使用GELLO,我们设计和构建了用于3种常用机器人手:Franka,UR5,和xArm的GELLO系统。所有软件和硬件都是开源的,可以在我们的网站上找到:这个 https URL。
https://arxiv.org/abs/2309.13037
PyPose is an open-source library for robot learning. It combines a learning-based approach with physics-based optimization, which enables seamless end-to-end robot learning. It has been used in many tasks due to its meticulously designed application programming interface (API) and efficient implementation. From its initial launch in early 2022, PyPose has experienced significant enhancements, incorporating a wide variety of new features into its platform. To satisfy the growing demand for understanding and utilizing the library and reduce the learning curve of new users, we present the fundamental design principle of the imperative programming interface, and showcase the flexible usage of diverse functionalities and modules using an extremely simple Dubins car example. We also demonstrate that the PyPose can be easily used to navigate a real quadruped robot with a few lines of code.
PyPose是一个开源机器人学习库,它结合了基于学习的方法与基于物理学的优化方法,实现了无缝的机器人全过程中的学习。由于其精心设计的应用编程接口(API)和高效的实现,PyPose在多个任务中被广泛应用。自2022年初首次推出以来,PyPose经历了显著的改进,将其平台包含了一系列丰富的新特性。为了满足不断增长的理解和利用库的需求,并降低新用户的学习曲线,我们提出了 imperative编程接口的基本设计原则,并通过一个简单的Dubins汽车例子展示了各种功能和模块的灵活使用。我们还证明了PyPose可以轻松地用于导航一个真实的四足机器人,只需要几行代码。
https://arxiv.org/abs/2309.13035
Conformers have recently been proposed as a promising modelling approach for automatic speech recognition (ASR), outperforming recurrent neural network-based approaches and transformers. Nevertheless, in general, the performance of these end-to-end models, especially attention-based models, is particularly degraded in the case of long utterances. To address this limitation, we propose adding a fully-differentiable memory-augmented neural network between the encoder and decoder of a conformer. This external memory can enrich the generalization for longer utterances since it allows the system to store and retrieve more information recurrently. Notably, we explore the neural Turing machine (NTM) that results in our proposed Conformer-NTM model architecture for ASR. Experimental results using Librispeech train-clean-100 and train-960 sets show that the proposed system outperforms the baseline conformer without memory for long utterances.
Conformers 最近被提出作为自动语音识别(ASR)的一种有前途的建模方法,比基于循环神经网络的方法和变压器表现更好。然而,总的来说,这些端到端模型,特别是基于注意力的模型,在较长的发言中表现特别差。为了解决这个问题,我们建议在一个 conformer 的编码器和解码器之间添加一个全变分的增强记忆神经网络。这个外部记忆可以丰富对更长发言的泛化,因为它允许系统多次存储和检索更多的信息。值得注意的是,我们探索了导致我们提出的 conformer-NTM 模型架构的神经网络 Turing 机器(NTM)。使用 LibriSpeech 训练- clean-100 和训练-960 集的实验结果表明, proposed 系统在较长的发言中比无记忆的基础 conformer 表现更好。
https://arxiv.org/abs/2309.13029
Precise crop yield prediction is essential for improving agricultural practices and ensuring crop resilience in varying climates. Integrating weather data across the growing season, especially for different crop varieties, is crucial for understanding their adaptability in the face of climate change. In the MLCAS2021 Crop Yield Prediction Challenge, we utilized a dataset comprising 93,028 training records to forecast yields for 10,337 test records, covering 159 locations across 28 U.S. states and Canadian provinces over 13 years (2003-2015). This dataset included details on 5,838 distinct genotypes and daily weather data for a 214-day growing season, enabling comprehensive analysis. As one of the winning teams, we developed two novel convolutional neural network (CNN) architectures: the CNN-DNN model, combining CNN and fully-connected networks, and the CNN-LSTM-DNN model, with an added LSTM layer for weather variables. Leveraging the Generalized Ensemble Method (GEM), we determined optimal model weights, resulting in superior performance compared to baseline models. The GEM model achieved lower RMSE (5.55% to 39.88%), reduced MAE (5.34% to 43.76%), and higher correlation coefficients (1.1% to 10.79%) when evaluated on test data. We applied the CNN-DNN model to identify top-performing genotypes for various locations and weather conditions, aiding genotype selection based on weather variables. Our data-driven approach is valuable for scenarios with limited testing years. Additionally, a feature importance analysis using RMSE change highlighted the significance of location, MG, year, and genotype, along with the importance of weather variables MDNI and AP.
精确的 crop yield 预测对于改善农业实践和确保在不同气候条件下的作物韧性至关重要。将气象数据在整个生长季节中整合,特别是针对不同作物 variety 的气象数据,对于理解它们在气候变化面前的适应能力至关重要。在 MLCAS2021 crop yield 预测挑战中,我们使用了一个包含 93,028 个训练记录的数据集,用于预测 10,337 个测试记录的 yield,覆盖了 28 个美国州和加拿大省在 13 年(2003-2015)中的 159 个地点。这个数据集包含了关于 5,838 个 distinct genotypes 和每日气象数据的详细情况,使能够进行全面分析。作为获胜团队之一,我们开发了两种新的卷积神经网络 (CNN) 架构:CNN-DNN 模型,将 CNN 和全连接网络相结合,以及 CNN-LSTM-DNN 模型,并在气象变量方面增加了 LSTM 层。利用通用群集方法 (gem),我们确定了最佳的模型权重,从而导致与基准模型相比更好的性能。gem 模型在测试数据上的 RMSE 降低到了 5.55% 到 39.88%,MAE 降低到了 5.34% 到 43.76%,并更高的 correlation 系数 (1.1% 到 10.79%)。我们应用了 CNN-DNN 模型来确定各种地点和气象条件的顶级表现 genotypes,并根据气象变量进行 genotypes 选择。我们的数据驱动方法对于测试年份有限的情况非常有价值。此外,使用 RMSE 变化的特征重要性分析强调了地点、MG、年份和 genotypes 的重要性,以及气象变量 MDNI 和 AP 的重要性。
https://arxiv.org/abs/2309.13021
Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in sparse monolingual models or a sparse multilingual model (named as Dynamic ASR Pathways). Our approach dynamically adapts the sub-network, avoiding premature decisions about a fixed sub-network structure. We show that our approach outperforms existing pruning methods when targeting sparse monolingual models. Further, we illustrate that Dynamic ASR Pathways jointly discovers and trains better sub-networks (pathways) of a single multilingual model by adapting from different sub-network initializations, thereby reducing the need for language-specific pruning.
神经网络剪枝是一种有效的方法,以压缩具有最小性能损失的多语言自动语音识别(ASR)模型。然而,它需要进行多个语言的剪枝和再训练。在这项工作中,我们提议使用一种自适应掩蔽方法,在两个场景下高效剪枝多语言ASR模型,每个产生稀疏的 Monolingual 模型或稀疏的 Multilingual 模型(称为动态ASR通道)。我们的方法动态适应子网络,避免过早决定固定的子网络结构。我们表明,当针对稀疏的 Monolingual 模型时,我们的方法比现有的剪枝方法表现更好。此外,我们举例说明,动态ASR通道通过自适应从不同的子网络初始化中学习更好的子网络(通道),从而减少了特定语言的剪枝需求。
https://arxiv.org/abs/2309.13018
Sparse training is one of the promising techniques to reduce the computational cost of DNNs while retaining high accuracy. In particular, N:M fine-grained structured sparsity, where only N out of consecutive M elements can be nonzero, has attracted attention due to its hardware-friendly pattern and capability of achieving a high sparse ratio. However, the potential to accelerate N:M sparse DNN training has not been fully exploited, and there is a lack of efficient hardware supporting N:M sparse training. To tackle these challenges, this paper presents a computation-efficient training scheme for N:M sparse DNNs using algorithm, architecture, and dataflow co-design. At the algorithm level, a bidirectional weight pruning method, dubbed BDWP, is proposed to leverage the N:M sparsity of weights during both forward and backward passes of DNN training, which can significantly reduce the computational cost while maintaining model accuracy. At the architecture level, a sparse accelerator for DNN training, namely SAT, is developed to neatly support both the regular dense operations and the computation-efficient N:M sparse operations. At the dataflow level, multiple optimization methods ranging from interleave mapping, pre-generation of N:M sparse weights, and offline scheduling, are proposed to boost the computational efficiency of SAT. Finally, the effectiveness of our training scheme is evaluated on a Xilinx VCU1525 FPGA card using various DNN models and datasets. Experimental results show the SAT accelerator with the BDWP sparse training method under 2:8 sparse ratio achieves an average speedup of 1.75x over that with the dense training, accompanied by a negligible accuracy loss of 0.56% on average. Furthermore, our proposed training scheme significantly improves the training throughput by 2.97~25.22x and the energy efficiency by 1.36~3.58x over prior FPGA-based accelerators.
稀疏训练是一种有前途的技术,能够在保留高准确性的同时降低深度学习系统的计算成本。特别是,具有 N:M Fine-grained Structured sparsity 的稀疏结构,其中只有 N 个连续的元素中才有非零值,因此备受关注,因为它具有硬件友好的模式和实现高稀疏比例的能力。然而,加速 N:M 稀疏深度学习训练的潜力尚未得到充分利用,缺乏有效的硬件支持 N:M 稀疏训练。为了解决这些挑战,本文提出了一种计算高效的训练方案,使用算法、结构和数据流的共同设计。在算法层面上,我们提出了一种双向 weight 压缩方法,称为 BDWP,在深度学习训练的forward和backward 过程中利用 N:M 的稀疏权重,可以显著降低计算成本,同时保持模型精度。在架构层面上,我们开发了稀疏深度学习加速器,名为 SAT,以方便支持标准的DenseOps 和计算高效的 N:M 稀疏Ops。在数据流层面上,我们提出了多种优化方法,包括InterleaveMapping、N:M 稀疏权重的预先生成和离线调度,以提高 SAT 的计算效率。最后,我们对我们的训练方案的有效性在 Xilinx VCU1525 FPGA card上进行了评估,使用各种深度学习模型和数据集。实验结果表明,使用 BDWP 稀疏训练方法的 SAT 加速器在 2:8 稀疏比例下实现平均速度提高 1.75 倍,与Dense训练相比,平均精度损失几乎忽略不计。此外,我们提出的训练方案显著提高了训练吞吐量 2.97~25.22 倍,以及与先前基于 FPGA 的加速器相比,能源效率提高了 1.36~3.58 倍。
https://arxiv.org/abs/2309.13015
Medical imaging plays a crucial role in modern healthcare by providing non-invasive visualisation of internal structures and abnormalities, enabling early disease detection, accurate diagnosis, and treatment planning. This study aims to explore the application of deep learning models, particularly focusing on the UNet architecture and its variants, in medical image segmentation. We seek to evaluate the performance of these models across various challenging medical image segmentation tasks, addressing issues such as image normalization, resizing, architecture choices, loss function design, and hyperparameter tuning. The findings reveal that the standard UNet, when extended with a deep network layer, is a proficient medical image segmentation model, while the Res-UNet and Attention Res-UNet architectures demonstrate smoother convergence and superior performance, particularly when handling fine image details. The study also addresses the challenge of high class imbalance through careful preprocessing and loss function definitions. We anticipate that the results of this study will provide useful insights for researchers seeking to apply these models to new medical imaging problems and offer guidance and best practices for their implementation.
医学影像在现代医学中发挥着关键作用,通过非侵入性可视化内部结构和异常,能够早期检测疾病、准确诊断和治疗方案规划。本研究旨在探索深度学习模型在医学图像分割中的应用,特别是关注UNet架构及其变体的应用。我们希望评估这些模型在不同挑战性的医学图像分割任务中的表现,解决图像标准化、尺寸调整、架构选择、损失函数设计以及超参数调优等方面的问题。研究结果表明,标准UNet在加入深度网络层后是一种优秀的医学图像分割模型,而Res-UNet和Attention Res-UNet架构则表现出更平滑的收敛和提高性能,特别是在处理精细图像细节时。研究还通过仔细预处理和损失函数定义解决了高类别不平衡的挑战。我们预计,这项研究的结果将为寻求将这些模型应用于新的医学影像问题的研究提供有用的见解,并指导其实施。
https://arxiv.org/abs/2309.13013
Large Language Models (LLMs) still struggle with complex reasoning tasks. Motivated by the society of minds (Minsky, 1988), we propose ReConcile, a multi-model multi-agent framework designed as a round table conference among diverse LLM agents to foster diverse thoughts and discussion for improved consensus. ReConcile enhances the reasoning capabilities of LLMs by holding multiple rounds of discussion, learning to convince other agents to improve their answers, and employing a confidence-weighted voting mechanism. In each round, ReConcile initiates discussion between agents via a 'discussion prompt' that consists of (a) grouped answers and explanations generated by each agent in the previous round, (b) their uncertainties, and (c) demonstrations of answer-rectifying human explanations, used for convincing other agents. This discussion prompt enables each agent to revise their responses in light of insights from other agents. Once a consensus is reached and the discussion ends, ReConcile determines the final answer by leveraging the confidence of each agent in a weighted voting scheme. We implement ReConcile with ChatGPT, Bard, and Claude2 as the three agents. Our experimental results on various benchmarks demonstrate that ReConcile significantly enhances the reasoning performance of the agents (both individually and as a team), surpassing prior single-agent and multi-agent baselines by 7.7% and also outperforming GPT-4 on some of these datasets. We also experiment with GPT-4 itself as one of the agents in ReConcile and demonstrate that its initial performance also improves by absolute 10.0% through discussion and feedback from other agents. Finally, we also analyze the accuracy after every round and observe that ReConcile achieves better and faster consensus between agents, compared to a multi-agent debate baseline. Our code is available at: this https URL
大型语言模型(LLM)仍然面临复杂的推理任务。受到思维社会的启发(米斯基,1988年),我们提出了Reconcile,它是一个多模型多Agent框架,设计为在一个多样化的LLM代理之间的圆桌会议中促进多样化的思考和讨论,以改善共识。Reconcile通过多次讨论增强LLM的推理能力,学习说服其他代理改善他们的答案,并使用信心加权投票机制。在每个回合中,Reconcile通过一个“讨论prompt”启动代理之间的讨论,其中(a)包括每个代理在上一个回合中生成的分组答案和解释,(b)是他们的不确定性,(c)是人类解释的演示,用于说服其他代理。这个讨论prompt使每个代理根据其他代理的见解更新他们的答案。一旦共识达成并讨论结束,Reconcile通过利用每个代理的信心加权投票机制确定最终答案。我们使用ChatGPT、 Bard和Claude2作为三个代理,我们的各种基准实验结果表明,Reconcile极大地增强了代理的推理表现(个体和团队),超过先前的单代理和多代理基准7.7%,并且在这些数据集上比GPT-4表现更好。我们也尝试以GPT-4作为Reconcile中的代理之一进行实验,并证明其初始表现也改善了absolute 10.0%。最后,我们还分析每个回合的精度,并观察到Reconcile通过代理之间的讨论实现更好的和更快的共识,相比多代理辩论基准。我们的代码可在以下httpsURL上获取:
https://arxiv.org/abs/2309.13007
The rapid development of AR/VR brings tremendous demands for 3D content. While the widely-used Computer-Aided Design (CAD) method requires a time-consuming and labor-intensive modeling process, sketch-based 3D modeling offers a potential solution as a natural form of computer-human interaction. However, the sparsity and ambiguity of sketches make it challenging to generate high-fidelity content reflecting creators' ideas. Precise drawing from multiple views or strategic step-by-step drawings is often required to tackle the challenge but is not friendly to novice users. In this work, we introduce a novel end-to-end approach, Deep3DSketch+, which performs 3D modeling using only a single free-hand sketch without inputting multiple sketches or view information. Specifically, we introduce a lightweight generation network for efficient inference in real-time and a structural-aware adversarial training approach with a Stroke Enhancement Module (SEM) to capture the structural information to facilitate learning of the realistic and fine-detailed shape structures for high-fidelity performance. Extensive experiments demonstrated the effectiveness of our approach with the state-of-the-art (SOTA) performance on both synthetic and real datasets.
增强现实(AR)和虚拟现实(VR)的迅速发展,对3D内容的需求急剧增加。虽然广泛使用的计算机辅助设计(CAD)方法需要进行耗时且劳动力密集型的建模过程,但基于 Sketch 的3D建模作为一种自然计算机-人类交互的形式,提供了一个潜在的解决方案。然而, Sketch 的稀疏和歧义使得生成高保真的内容非常困难,通常需要进行精确的多视图绘图或关键步骤的 strategic 绘图,但这不适用于初学者。在这个项目中,我们介绍了一种全新的端到端方法 Deep3DSketch+,它使用单个自由手绘 Sketch 来进行3D建模,而不需要输入多个 Sketch 或视图信息。具体来说,我们介绍了一种轻量级的生成网络,用于实时高效推理,并介绍了一种结构aware的对抗训练方法,以及一个 stroke 增强模块(SEM),以捕获结构信息,以便于学习 realistic 和精细的形状结构,以获得高保真的性能。广泛的实验证明了我们的方法在合成和真实数据集上具有最先进的性能(SOTA)。
https://arxiv.org/abs/2309.13006
Recognizing the prevalence of domain shift as a common challenge in machine learning, various domain generalization (DG) techniques have been developed to enhance the performance of machine learning systems when dealing with out-of-distribution (OOD) data. Furthermore, in real-world scenarios, data distributions can gradually change across a sequence of sequential domains. While current methodologies primarily focus on improving model effectiveness within these new domains, they often overlook fairness issues throughout the learning process. In response, we introduce an innovative framework called Counterfactual Fairness-Aware Domain Generalization with Sequential Autoencoder (CDSAE). This approach effectively separates environmental information and sensitive attributes from the embedded representation of classification features. This concurrent separation not only greatly improves model generalization across diverse and unfamiliar domains but also effectively addresses challenges related to unfair classification. Our strategy is rooted in the principles of causal inference to tackle these dual issues. To examine the intricate relationship between semantic information, sensitive attributes, and environmental cues, we systematically categorize exogenous uncertainty factors into four latent variables: 1) semantic information influenced by sensitive attributes, 2) semantic information unaffected by sensitive attributes, 3) environmental cues influenced by sensitive attributes, and 4) environmental cues unaffected by sensitive attributes. By incorporating fairness regularization, we exclusively employ semantic information for classification purposes. Empirical validation on synthetic and real-world datasets substantiates the effectiveness of our approach, demonstrating improved accuracy levels while ensuring the preservation of fairness in the evolving landscape of continuous domains.
认识到域转换是机器学习中常见的挑战,各种域扩展技术(DG)已经被开发用于提高处理非分布数据(OOD)机器学习系统的性能。此外,在现实世界场景中,数据分布可以逐步变化在一个连续的域序列中。虽然当前的方法主要关注在这些新域中提高模型有效性,但它们往往在整个学习过程中忽视公平问题。为了应对这种情况,我们提出了一种名为“反事实公平 aware 域扩展”的创新框架(CDSAE)。该方法有效地将环境信息和敏感属性从分类特征嵌入表示中分离出来。这种同时分离不仅极大地改善了跨不同熟悉域模型的泛化能力,而且还有效地解决了与不公平分类相关的挑战。我们的策略基于因果关系推理的原则,以解决这些双重问题。为了研究语义信息、敏感属性和环境 cues之间的关系,我们 systematic 地将外部不确定性因素分类为四个隐变量:1)受敏感属性影响的语义信息,2)不受敏感属性影响的语义信息,3)受敏感属性影响的环境问题,4)不受敏感属性影响的环境问题。通过引入公平 regularization,我们仅用于分类目的的语义信息。对合成数据和实际数据集的模拟验证证实了我们方法的有效性,证明了提高准确性水平,同时确保了连续域演化 landscape 中公平性的保持。
https://arxiv.org/abs/2309.13005
In machine translation, a common problem is that the translation of certain words even if translated can cause incomprehension of the target language audience due to different cultural backgrounds. A solution to solve this problem is to add explanations for these words. In a first step, we therefore need to identify these words or phrases. In this work we explore techniques to extract example explanations from a parallel corpus. However, the sparsity of sentences containing words that need to be explained makes building the training dataset extremely difficult. In this work, we propose a semi-automatic technique to extract these explanations from a large parallel corpus. Experiments on English->German language pair show that our method is able to extract sentence so that more than 10% of the sentences contain explanation, while only 1.9% of the original sentences contain explanations. In addition, experiments on English->French and English->Chinese language pairs also show similar conclusions. This is therefore an essential first automatic step to create a explanation dataset. Furthermore we show that the technique is robust for all three language pairs.
在机器翻译中,一个常见问题是,即使翻译了某些单词,由于目标语言文化背景的不同,也会导致 target 语言观众无法理解。解决这个问题的解决方法是对这些单词或短语添加解释。因此,的第一步是确定这些单词或短语。在这个研究中,我们探索了从并行语料库中提取示例解释的技术。然而,包含需要解释的单词的句子数量稀少,使得构建训练数据集非常困难。在这个研究中,我们提出了一种半自动的方法,从大型并行语料库中提取这些解释。对英语到德语语言对进行了实验,表明我们的方法和方法能够提取句子,使得超过 10% 的句子包含解释,而只有 1.9% 的原本句子包含解释。对英语到法语和英语到中文语言对也进行了实验,也得出类似的结论。因此,这是一个创建解释数据集的 essential 的第一步。此外,我们还表明,该方法对这三个语言对都是可靠的。
https://arxiv.org/abs/2309.12998
This paper introduces the Point Cloud Network (PCN) architecture, a novel implementation of linear layers in deep learning networks, and provides empirical evidence to advocate for its preference over the Multilayer Perceptron (MLP) in linear layers. We train several models, including the original AlexNet, using both MLP and PCN architectures for direct comparison of linear layers (Krizhevsky et al., 2012). The key results collected are model parameter count and top-1 test accuracy over the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009). AlexNet-PCN16, our PCN equivalent to AlexNet, achieves comparable efficacy (test accuracy) to the original architecture with a 99.5% reduction of parameters in its linear layers. All training is done on cloud RTX 4090 GPUs, leveraging pytorch for model construction and training. Code is provided for anyone to reproduce the trials from this paper.
本论文介绍了点云网络(PCN)架构,这是一种在深度学习网络中采用线性层的独特实现方式,并提供了实证证据,支持其对线性层中多层感知器(MLP)的偏好。我们训练了多个模型,包括原始AlexNet模型,同时使用MLP和PCN架构进行线性层的直接比较(Krizhevsky等人,2012)。收集的主要结果是模型参数计数和在CIFAR-10和CIFAR-100数据集上的最佳1%测试准确率(Krizhevsky,2009)。AlexNet-PCN16是我们的AlexNet的PCN等价物,其线性层参数减少99.5%。所有训练都在云端RTX 4090GPU上完成,利用PyTorch进行模型构建和训练。代码为任何人提供,用于复制本论文的实验。
https://arxiv.org/abs/2309.12996
In the realm of research, the detection/recognition of text within images/videos captured by cameras constitutes a highly challenging problem for researchers. Despite certain advancements achieving high accuracy, current methods still require substantial improvements to be applicable in practical scenarios. Diverging from text detection in images/videos, this paper addresses the issue of text detection within license plates by amalgamating multiple frames of distinct perspectives. For each viewpoint, the proposed method extracts descriptive features characterizing the text components of the license plate, specifically corner points and area. Concretely, we present three viewpoints: view-1, view-2, and view-3, to identify the nearest neighboring components facilitating the restoration of text components from the same license plate line based on estimations of similarity levels and distance metrics. Subsequently, we employ the CnOCR method for text recognition within license plates. Experimental results on the self-collected dataset (PTITPlates), comprising pairs of images in various scenarios, and the publicly available Stanford Cars Dataset, demonstrate the superiority of the proposed method over existing approaches.
在研究领域,相机捕获的图像/视频中的文字检测/识别构成了一个高度挑战的问题,尽管某些方法已经实现了高精度,但当前的方法仍然需要在实际应用场景中进行大量改进。与图像/视频中的文字检测不同,本文通过整合多个不同视角的图像帧来解决 license plate 中的文字检测问题。对于每个视角,该方法提取了描述性特征,描述了 license plate 中文字组件的特征,特别是角落点和区域。具体来说,我们展示了三个视角:view-1、view-2、view-3,以确定最接近的相邻组件,通过相似度和距离度量估计来实现文字组件从同一 license plate 线条的恢复。随后,我们采用 CnOCR 方法在 license plate 中的文字识别。对自收集的数据集(PTITPlates)进行了实验结果,该数据集包括各种场景下的两个图像对,以及公开可用的 Stanford 汽车数据集,证明了该方法相对于现有方法的优越性。
https://arxiv.org/abs/2309.12972
Despite the recent successes of vanilla Graph Neural Networks (GNNs) on many tasks, their foundation on pairwise interaction networks inherently limits their capacity to discern latent higher-order interactions in complex systems. To bridge this capability gap, we propose a novel approach exploiting the rich mathematical theory of simplicial complexes (SCs) - a robust tool for modeling higher-order interactions. Current SC-based GNNs are burdened by high complexity and rigidity, and quantifying higher-order interaction strengths remains challenging. Innovatively, we present a higher-order Flower-Petals (FP) model, incorporating FP Laplacians into SCs. Further, we introduce a Higher-order Graph Convolutional Network (HiGCN) grounded in FP Laplacians, capable of discerning intrinsic features across varying topological scales. By employing learnable graph filters, a parameter group within each FP Laplacian domain, we can identify diverse patterns where the filters' weights serve as a quantifiable measure of higher-order interaction strengths. The theoretical underpinnings of HiGCN's advanced expressiveness are rigorously demonstrated. Additionally, our empirical investigations reveal that the proposed model accomplishes state-of-the-art (SOTA) performance on a range of graph tasks and provides a scalable and flexible solution to explore higher-order interactions in graphs.
尽管 vanilla 图形神经网络(GNN)在多项任务上取得了 recent 的成功,但他们基于点对点交互网络的基础本身限制了他们在复杂系统中发现潜在高级别交互的能力。为了填补这一能力差距,我们提出了一种创新的方法,利用线段组合丰富的数学理论 - 一种用于建模高级别交互的强大工具。目前基于SC的GNNs 承受着高复杂性和Rigidity 的负载,量化高级别交互 strengths 仍然是一项挑战。创新性地,我们提出了高级别 flower-petal(FP)模型,将FP拉普拉斯算子融入SC中。进一步,我们介绍了基于FP拉普拉斯算子的高级别图形卷积网络(HiGCN),能够在不同拓扑级别的上识别内在特征。通过使用可学习图形滤波器,每个FP拉普拉斯域中的参数组,我们可以识别不同的模式,这些滤波器的权重作为高级别交互 strengths 的可量化衡量标准。HiGCN 的高级表达能力的理论基础得到了严格证明。此外,我们的实证研究表明,我们提出的模型在多种图形任务中实现了最先进的表现,并提供了一种可扩展且灵活的解决方案,以探索图形中的高级别交互。
https://arxiv.org/abs/2309.12971
The detailed images produced by Magnetic Resonance Imaging (MRI) provide life-critical information for the diagnosis and treatment of prostate cancer. To provide standardized acquisition, interpretation and usage of the complex MRI images, the PI-RADS v2 guideline was proposed. An automated segmentation following the guideline facilitates consistent and precise lesion detection, staging and treatment. The guideline recommends a division of the prostate into four zones, PZ (peripheral zone), TZ (transition zone), DPU (distal prostatic urethra) and AFS (anterior fibromuscular stroma). Not every zone shares a boundary with the others and is present in every slice. Further, the representations captured by a single model might not suffice for all zones. This motivated us to design a dual-branch convolutional neural network (CNN), where each branch captures the representations of the connected zones separately. Further, the representations from different branches act complementary to each other at the second stage of training, where they are fine-tuned through an unsupervised loss. The loss penalises the difference in predictions from the two branches for the same class. We also incorporate multi-task learning in our framework to further improve the segmentation accuracy. The proposed approach improves the segmentation accuracy of the baseline (mean absolute symmetric distance) by 7.56%, 11.00%, 58.43% and 19.67% for PZ, TZ, DPU and AFS zones respectively.
磁共振成像(MRI)生成的详细图像为前列腺癌的诊断和治疗提供了生命中最重要的信息。为了提供标准化的获取、解释和使用复杂的MRI图像的标准操作,我们提出了PI-RADS v2指南。遵循指南的自动分割有助于一致性和精确的 Lesion 检测、分期和治疗。指南建议将前列腺癌分为四个区域,PZ(周围区域)、TZ(过渡区域)、DPU(远程前列腺癌尿管)和AFS(前部肌肉基质)。不是每个区域都与其他人共享边界,每个切片都包含。此外,一个模型 captured 的表示可能不足以涵盖所有区域。这激励我们设计一种双分支卷积神经网络(CNN),其中每个分支分别捕获连接区域的表示。此外,不同分支的表示在训练的第二阶段相互补充,通过无监督损失进行优化。该损失惩罚两个分支对同一类预测的差异。我们还在我们的框架中引入了多任务学习,以进一步改进分割精度。我们建议的方法改进了基线(平均绝对对称距离)的分割精度,分别为7.56%、11.00%、58.43%、19.67% PZ、TZ、DPU和AFS区域。
https://arxiv.org/abs/2309.12970