Today's touch sensors come in many shapes and sizes. This has made it challenging to develop general-purpose touch processing methods since models are generally tied to one specific sensor design. We address this problem by performing cross-modal prediction between touch sensors: given the tactile signal from one sensor, we use a generative model to estimate how the same physical contact would be perceived by another sensor. This allows us to apply sensor-specific methods to the generated signal. We implement this idea by training a diffusion model to translate between the popular GelSlim and Soft Bubble sensors. As a downstream task, we perform in-hand object pose estimation using GelSlim sensors while using an algorithm that operates only on Soft Bubble signals. The dataset, the code, and additional details can be found at this https URL.
今天的触摸传感器有许多形状和尺寸。这使得开发通用触摸处理方法变得具有挑战性,因为模型通常与特定传感器设计捆绑在一起。我们通过在触摸传感器之间进行跨模态预测来解决这个问题:给定一个传感器的触觉信号,我们使用生成模型估计由另一个传感器感知到的相同物理接触如何。这使我们能够将特定传感器的方法应用于生成的信号。我们通过训练扩散模型来实现这个想法,该模型在 GelSlim 和 Soft Bubble 传感器之间进行翻译。作为下游任务,我们在使用 GelSlim 传感器进行手部物体姿态估计的同时,使用只操作 Soft Bubble 信号的算法进行研究。数据集、代码和其他详细信息可以在这个链接中找到。
https://arxiv.org/abs/2409.08269
Whether learned, simulated, or analytical, approximations of a robot's dynamics can be inaccurate when encountering novel environments. Many approaches have been proposed to quantify the aleatoric uncertainty of such methods, i.e. uncertainty resulting from stochasticity, however these estimates alone are not enough to properly estimate the uncertainty of a model in a novel environment, where the actual dynamics can change. Such changes can induce epistemic uncertainty, i.e. uncertainty due to a lack of information/data. Accounting for both epistemic and aleatoric dynamics uncertainty in a theoretically-grounded way remains an open problem. We introduce Local Uncertainty Conformal Calibration (LUCCa), a conformal prediction-based approach that calibrates the aleatoric uncertainty estimates provided by dynamics models to generate probabilistically-valid prediction regions of the system's state. We account for both epistemic and aleatoric uncertainty non-asymptotically, without strong assumptions about the form of the true dynamics or how it changes. The calibration is performed locally in the state-action space, leading to uncertainty estimates that are useful for planning. We validate our method by constructing probabilistically-safe plans for a double-integrator under significant changes in dynamics.
无论是通过学习、模拟还是分析,对机器人动态的近似在遇到新颖环境时可能会出现不准确的情况。为了量化这种方法的随机性不确定性,许多方法提出了估价随机性不确定性的方法,即随机性产生的不确定性。然而,这些估计单独不足以正确估计在新颖环境中的模型的不确定性。这些变化可能会导致元理不确定性,即缺乏信息/数据产生的不确定性。在理论基础上同时考虑元理和随机性不确定性仍然是一个未解决的问题。我们引入了局部不确定性 conformal calibration (LUCCa),一种基于 conformal 预测的方法,它将动态模型提供的随机性不确定性估计量用于生成系统的状态的的概率合法预测区域。我们非线性地考虑了元理和随机性不确定性,没有对真实动态的形式或其变化做出强烈的假设。 calibration 在状态-动作空间中进行局部处理,导致可用于规划的 uncertainty estimates。我们通过在动力学显著变化的情况下构建概率安全计划来验证我们的方法。
https://arxiv.org/abs/2409.08249
Recently, methods like Zero-1-2-3 have focused on single-view based 3D reconstruction and have achieved remarkable success. However, their predictions for unseen areas heavily rely on the inductive bias of large-scale pretrained diffusion models. Although subsequent work, such as DreamComposer, attempts to make predictions more controllable by incorporating additional views, the results remain unrealistic due to feature entanglement in the vanilla latent space, including factors such as lighting, material, and structure. To address these issues, we introduce the Visual Isotropy 3D Reconstruction Model (VI3DRM), a diffusion-based sparse views 3D reconstruction model that operates within an ID consistent and perspective-disentangled 3D latent space. By facilitating the disentanglement of semantic information, color, material properties and lighting, VI3DRM is capable of generating highly realistic images that are indistinguishable from real photographs. By leveraging both real and synthesized images, our approach enables the accurate construction of pointmaps, ultimately producing finely textured meshes or point clouds. On the NVS task, tested on the GSO dataset, VI3DRM significantly outperforms state-of-the-art method DreamComposer, achieving a PSNR of 38.61, an SSIM of 0.929, and an LPIPS of 0.027. Code will be made available upon publication.
近年来,像Zero-1-2-3这样的方法集中于单视图基于3D重建,已经取得了显著的成功。然而,它们对未见区域的预测很大程度上依赖于大规模预训练扩散模型的归纳偏见。尽管后续工作,如DreamComposer,试图通过包含额外的视图来使预测更加可控制,但由于元学习在基本潜在空间中的特征纠缠,结果仍然不现实,包括照明、材料和结构等因素。为了解决这些问题,我们引入了Visual Isotropy 3D Reconstruction Model(VI3DRM),一种基于扩散的稀疏视图3D重建模型,它在ID一致和透视去噪的3D潜在空间中操作。通过促进语义信息的解纠缠,颜色、材料属性和光照,VI3DRM能够生成高度逼真的图像,与真实照片无法区分。通过利用真实和合成图像,我们的方法实现了准确的几何图,最终生成了细纹理的网格或点云。在NVS任务上,在GSO数据集上进行测试,VI3DRM显著优于最先进的DreamComposer方法,达到PSNR 38.61,SSIM 0.929和LPIPS 0.027的优异表现。代码将在发表后公开发布。
https://arxiv.org/abs/2409.08207
A unique aspect of human visual understanding is the ability to flexibly interpret abstract concepts: acquiring lifted rules explaining what they symbolize, grounding them across familiar and unfamiliar contexts, and making predictions or reasoning about them. While off-the-shelf vision-language models excel at making literal interpretations of images (e.g., recognizing object categories such as tree branches), they still struggle to make sense of such visual abstractions (e.g., how an arrangement of tree branches may form the walls of a maze). To address this challenge, we introduce Deep Schema Grounding (DSG), a framework that leverages explicit structured representations of visual abstractions for grounding and reasoning. At the core of DSG are schemas--dependency graph descriptions of abstract concepts that decompose them into more primitive-level symbols. DSG uses large language models to extract schemas, then hierarchically grounds concrete to abstract components of the schema onto images with vision-language models. The grounded schema is used to augment visual abstraction understanding. We systematically evaluate DSG and different methods in reasoning on our new Visual Abstractions Dataset, which consists of diverse, real-world images of abstract concepts and corresponding question-answer pairs labeled by humans. We show that DSG significantly improves the abstract visual reasoning performance of vision-language models, and is a step toward human-aligned understanding of visual abstractions.
人类视觉理解的一个独特方面是灵活地解释抽象概念的能力:获取提升规则解释它们代表的含义,将它们在熟悉和不熟悉的上下文中生根,并进行关于它们的预测或推理。虽然预先训练的视觉语言模型在制作图像的逐字解释方面表现出色(例如,识别物体类别,如树枝),但它们仍然很难理解这种视觉抽象概念(例如,如何安排树枝可以形成迷宫的墙壁)。为解决这个问题,我们引入了Deep Schema Grounding(DSG)框架,一个利用视觉抽象概念的显式结构表示来进行根性和推理的框架。 DSG的核心是图表——将抽象概念分解为更基本层次符号的依赖关系图描述。DSG使用大型语言模型提取图表,然后将图表的实体部分以视觉语言模型的形式与图像结合。 grounded schema被用于增强视觉抽象理解。我们系统地评估了DSG以及不同方法在推理上的表现,这些方法基于我们的新视觉抽象数据集,该数据集包括各种真实世界抽象概念的多样图像以及由人类标注的相应问题与答案对。我们证明了DSG显著提高了视觉语言模型的抽象视觉推理性能,并朝着实现与人类对视觉抽象的理解对齐迈出了一步。
https://arxiv.org/abs/2409.08202
The global increase in observed forest dieback, characterised by the death of tree foliage, heralds widespread decline in forest ecosystems. This degradation causes significant changes to ecosystem services and functions, including habitat provision and carbon sequestration, which can be difficult to detect using traditional monitoring techniques, highlighting the need for large-scale and high-frequency monitoring. Contemporary developments in the instruments and methods to gather and process data at large-scales mean this monitoring is now possible. In particular, the advancement of low-cost drone technology and deep learning on consumer-level hardware provide new opportunities. Here, we use an approach based on deep learning and vegetation indices to assess crown dieback from RGB aerial data without the need for expensive instrumentation such as LiDAR. We use an iterative approach to match crown footprints predicted by deep learning with field-based inventory data from a Mediterranean ecosystem exhibiting drought-induced dieback, and compare expert field-based crown dieback estimation with vegetation index-based estimates. We obtain high overall segmentation accuracy (mAP: 0.519) without the need for additional technical development of the underlying Mask R-CNN model, underscoring the potential of these approaches for non-expert use and proving their applicability to real-world conservation. We also find colour-coordinate based estimates of dieback correlate well with expert field-based estimation. Substituting ground truth for Mask R-CNN model predictions showed negligible impact on dieback estimates, indicating robustness. Our findings demonstrate the potential of automated data collection and processing, including the application of deep learning, to improve the coverage, speed and cost of forest dieback monitoring.
全球森林退化的增加,以树叶死亡为特征,预示着广泛的森林生态系统衰败。这种退化导致生态系统服务和工作功能的重大变化,包括栖息地提供和碳储存,这些变化很难通过传统监测技术检测到,凸显了需要大范围和高频度的监测。当前在大规模数据收集和处理工具和方法的发展使这种监测成为可能。特别是,低成本无人机技术和深度学习在消费者级硬件上的进步为监测提供了新的机会。在这里,我们使用基于深度学习和植被指数的方法评估顶端死亡从彩色高空数据,无需昂贵的仪器设备(如激光雷达)。我们使用迭代方法将预测的顶端足迹与展性诱导退化的现场基线数据中的场基数据匹配,并将专家场基顶端死亡估计与植被指数基线估计进行比较。我们获得了高整体分割精度(mAP: 0.519),无需对底层Mask R-CNN模型的额外技术开发,强调了这些方法的非专家可用性和其在现实世界 conservation 中的应用前景。我们还发现,基于颜色的退化关联估计与专家场基估计非常接近。用Mask R-CNN模型预测顶端死亡替代真实世界结果对死亡估计的影响非常小,表明了鲁棒性。我们的研究结果表明,自动数据收集和处理,包括应用深度学习,可以改善森林退监测的覆盖、速度和成本。
https://arxiv.org/abs/2409.08171
We present a new perspective on how readers integrate context during real-time language comprehension. Our proposals build on surprisal theory, which posits that the processing effort of a linguistic unit (e.g., a word) is an affine function of its in-context information content. We first observe that surprisal is only one out of many potential ways that a contextual predictor can be derived from a language model. Another one is the pointwise mutual information (PMI) between a unit and its context, which turns out to yield the same predictive power as surprisal when controlling for unigram frequency. Moreover, both PMI and surprisal are correlated with frequency. This means that neither PMI nor surprisal contains information about context alone. In response to this, we propose a technique where we project surprisal onto the orthogonal complement of frequency, yielding a new contextual predictor that is uncorrelated with frequency. Our experiments show that the proportion of variance in reading times explained by context is a lot smaller when context is represented by the orthogonalized predictor. From an interpretability standpoint, this indicates that previous studies may have overstated the role that context has in predicting reading times.
我们提出了一个关于读者在实时语言理解中如何整合上下文的新视角。我们的建议基于惊讶理论,该理论认为语言单元(例如单词)的加工努力是其上下文信息内容的线性函数。首先,我们观察到惊讶是从语言模型中提取上下文预测符的许多可能方法之一。另一种是单元与其上下文的点间互信息(PMI),当控制单词频率时,它与惊讶具有相同的预测能力。此外,PMI和惊讶都与频率相关。这意味着既没有PMI也没有惊讶包含关于上下文的单独信息。为了回应这种状况,我们提出了一个技术,将惊讶投影到频率的补空间,产生一个新的上下文预测符,该预测符与频率无关。我们的实验结果表明,当上下文用 orthogonalized predictor 表示时,阅读时间的方差解释量要小得多。从可解释性的角度来看,这表明之前的研究可能夸大了上下文在预测阅读时间中的作用。
https://arxiv.org/abs/2409.08160
Large language models (LLMs) have become proficient at solving a wide variety of tasks, including those involving multi-modal inputs. In particular, instantiating an LLM (such as LLaMA) with a speech encoder and training it on paired data imparts speech recognition (ASR) abilities to the decoder-only model, hence called Speech-LLaMA. Nevertheless, due to the sequential nature of auto-regressive inference and the relatively large decoder, Speech-LLaMA models require relatively high inference time. In this work, we propose to speed up Speech-LLaMA inference by predicting multiple tokens in the same decoding step. We explore several model architectures that enable this, and investigate their performance using threshold-based and verification-based inference strategies. We also propose a prefix-based beam search decoding method that allows efficient minimum word error rate (MWER) training for such models. We evaluate our models on a variety of public benchmarks, where they reduce the number of decoder calls by ~3.2x while maintaining or improving WER performance.
大语言模型(LLMs)已经变得擅长解决各种任务,包括涉及多模态输入的任务。特别是,通过使用语音编码器启动一个LLM(如LLaMA),并在成对数据上训练它,授予了 decoder-only 模型语音识别(ASR)能力,因此称为 Speech-LLaMA。然而,由于自回归推理的序列性质以及相对较大的解码器,Speech-LLaMA 模型需要相对较长的时间进行推理。在本文中,我们提出了一种通过预测同一解码过程中多个标记点来加速 Speech-LLaMA 推理的方法。我们探讨了几种实现此功能的模型架构,并使用基于阈值和验证的基本推理策略评估了它们的性能。我们还提出了一种基于前缀的 beam search 解码方法,允许为这类模型实现高效的最低词错误率(MWER)训练。我们在各种公共基准上评估我们的模型,它们将 decoder 调用数量减少了约 3.2 倍,同时保持或提高 WER 性能。
https://arxiv.org/abs/2409.08148
3D segmentation is a core problem in computer vision and, similarly to many other dense prediction tasks, it requires large amounts of annotated data for adequate training. However, densely labeling 3D point clouds to employ fully-supervised training remains too labor intensive and expensive. Semi-supervised training provides a more practical alternative, where only a small set of labeled data is given, accompanied by a larger unlabeled set. This area thus studies the effective use of unlabeled data to reduce the performance gap that arises due to the lack of annotations. In this work, inspired by Bayesian deep learning, we first propose a Bayesian self-training framework for semi-supervised 3D semantic segmentation. Employing stochastic inference, we generate an initial set of pseudo-labels and then filter these based on estimated point-wise uncertainty. By constructing a heuristic $n$-partite matching algorithm, we extend the method to semi-supervised 3D instance segmentation, and finally, with the same building blocks, to dense 3D visual grounding. We demonstrate state-of-the-art results for our semi-supervised method on SemanticKITTI and ScribbleKITTI for 3D semantic segmentation and on ScanNet and S3DIS for 3D instance segmentation. We further achieve substantial improvements in dense 3D visual grounding over supervised-only baselines on ScanRefer. Our project page is available at this http URL.
3D 分割是在计算机视觉中的一个核心问题,与许多其他密集预测任务类似,它需要大量的注释数据来进行适当的训练。然而,将3D点云密集地标注以实现完全监督的训练仍然过于费力和昂贵。半监督训练提供了一个更实际的选择,其中只有少量已标注数据,同时有一个更大的未标注数据集。因此,这个领域研究了未标注数据有效利用来减少由于缺乏注释而产生的性能差距。在这个工作中,我们受到贝叶斯深度学习的启发,首先提出了一个基于贝叶斯的自训练框架来进行半监督3D语义分割。通过随机推理,我们生成一系列伪标签,然后根据估计点间的不确定性来筛选这些伪标签。通过构建一个启发式的 $n$ 部分匹配算法,我们将方法扩展到半监督3D实例分割,最后,使用相同的构建块,扩展到密集3D视觉 grounding。我们在SemanticKITTI和ScribbleKITTI上对3D语义分割的半监督方法取得了最先进的成果,同时在ScanNet和S3DIS上对3D实例分割取得了显著的改善。在ScanRefer上,我们进一步实现了比仅监督基准更显著的密集3D视觉 grounding 的改善。我们的项目页面可以在这个链接 http:// 这种方式上查看。
https://arxiv.org/abs/2409.08102
This paper explores the intersection of technological innovation and access to justice by developing a benchmark for predicting case outcomes in the UK Employment Tribunal (UKET). To address the challenge of extensive manual annotation, the study employs a large language model (LLM) for automatic annotation, resulting in the creation of the CLC-UKET dataset. The dataset consists of approximately 19,000 UKET cases and their metadata. Comprehensive legal annotations cover facts, claims, precedent references, statutory references, case outcomes, reasons and jurisdiction codes. Facilitated by the CLC-UKET data, we examine a multi-class case outcome prediction task in the UKET. Human predictions are collected to establish a performance reference for model comparison. Empirical results from baseline models indicate that finetuned transformer models outperform zero-shot and few-shot LLMs on the UKET prediction task. The performance of zero-shot LLMs can be enhanced by integrating task-related information into few-shot examples. We hope that the CLC-UKET dataset, along with human annotations and empirical findings, can serve as a valuable benchmark for employment-related dispute resolution.
本文探讨了科技创新和司法公正之间的交集,通过为英国就业法庭(UKET)预测案件结果的基准来研究这一问题。为了应对大量手动注释的挑战,该研究采用了一个大型语言模型(LLM)进行自动注释,从而创建了CLC-UKET数据集。数据集包括约19,000个UKET案例及其元数据。全面的法律注释涵盖了事实、主张、先例引用、法律引用、案件结果、原因和司法代码。得益于CLC-UKET数据集的启发,我们在英国就业法庭探讨了一个多类案件结果预测任务。人类预测收集工作为模型比较建立了性能参考。基于基线的模型的实证结果表明,经过微调的Transformer模型在UKET预测任务中优于零散和少散LLM。通过将任务相关信息整合到少散例子中,可以增强零散LLM的性能。我们希望CLC-UKET数据集、人类注释和实证研究能够成为用于纠纷解决的相关基准。
https://arxiv.org/abs/2409.08098
We propose a simple but effective training-free approach tailored to diffusion-based image-to-image translation. Our approach revises the original noise prediction network of a pretrained diffusion model by introducing a noise correction term. We formulate the noise correction term as the difference between two noise predictions; one is computed from the denoising network with a progressive interpolation of the source and target prompt embeddings, while the other is the noise prediction with the source prompt embedding. The final noise prediction network is given by a linear combination of the standard denoising term and the noise correction term, where the former is designed to reconstruct must-be-preserved regions while the latter aims to effectively edit regions of interest relevant to the target prompt. Our approach can be easily incorporated into existing image-to-image translation methods based on diffusion models. Extensive experiments verify that the proposed technique achieves outstanding performance with low latency and consistently improves existing frameworks when combined with them.
我们提出了一种简单但有效的训练-free方法,针对扩散模型的图像到图像翻译。我们的方法通过引入噪声校正项对预训练扩散模型的原始噪声预测网络进行了修改。我们将噪声校正项表示为两个噪声预测之间的差值;一个是由去噪网络在源和目标提示词嵌入的渐进平滑中计算得到的,而另一个是源提示词嵌入的噪声预测。最终的噪声预测网络是由标准去噪项和噪声校正项的线性组合组成的,其中前者被设计用于重构必须保留的区域,而后者旨在有效地编辑与目标提示相关的感兴趣区域。基于扩散模型的现有图像到图像翻译方法可以很容易地纳入我们的方法中。大量实验证实,与结合使用时相比,所提出的技术在低延迟的情况下取得了出色的性能,并一致地改善了现有的框架。
https://arxiv.org/abs/2409.08077
RGB-D has gradually become a crucial data source for understanding complex scenes in assisted driving. However, existing studies have paid insufficient attention to the intrinsic spatial properties of depth maps. This oversight significantly impacts the attention representation, leading to prediction errors caused by attention shift issues. To this end, we propose a novel learnable Depth interaction Pyramid Transformer (DiPFormer) to explore the effectiveness of depth. Firstly, we introduce Depth Spatial-Aware Optimization (Depth SAO) as offset to represent real-world spatial relationships. Secondly, the similarity in the feature space of RGB-D is learned by Depth Linear Cross-Attention (Depth LCA) to clarify spatial differences at the pixel level. Finally, an MLP Decoder is utilized to effectively fuse multi-scale features for meeting real-time requirements. Comprehensive experiments demonstrate that the proposed DiPFormer significantly addresses the issue of attention misalignment in both road detection (+7.5%) and semantic segmentation (+4.9% / +1.5%) tasks. DiPFormer achieves state-of-the-art performance on the KITTI (97.57% F-score on KITTI road and 68.74% mIoU on KITTI-360) and Cityscapes (83.4% mIoU) datasets.
RGB-D已成为辅助驾驶中理解复杂场景的重要数据来源。然而,现有研究对深度图的固有空间特性关注不足。这个缺陷对注意表示造成了关注偏移问题导致的预测误差。为此,我们提出了一种新颖的可学习深度交互金字塔Transformer(DiPFormer),以探索深度的有效性。首先,我们引入了深度空间感知优化(Depth SAO)作为偏移,以表示真实世界空间关系。其次,通过深度线性交叉注意(Depth LCA),相似的特征空间被学习来阐明像素级别的空间差异。最后,我们使用了MLP解码器来有效地融合多尺度特征以满足实时需求。全面的实验证明,与传统的关注偏移解决方案相比,DiPFormer在道路检测(+7.5%)和语义分割(+4.9% / +1.5%)任务上取得了最先进的性能。DiPFormer在KITTI(97.57% F-score on KITTI road and 68.74% mIoU on KITTI-360)和Cityscapes(83.4% mIoU)数据集上实现了最先进的性能。
https://arxiv.org/abs/2409.07995
The task of vision-based 3D occupancy prediction aims to reconstruct 3D geometry and estimate its semantic classes from 2D color images, where the 2D-to-3D view transformation is an indispensable step. Most previous methods conduct forward projection, such as BEVPooling and VoxelPooling, both of which map the 2D image features into 3D grids. However, the current grid representing features within a certain height range usually introduces many confusing features that belong to other height ranges. To address this challenge, we present Deep Height Decoupling (DHD), a novel framework that incorporates explicit height prior to filter out the confusing features. Specifically, DHD first predicts height maps via explicit supervision. Based on the height distribution statistics, DHD designs Mask Guided Height Sampling (MGHS) to adaptively decoupled the height map into multiple binary masks. MGHS projects the 2D image features into multiple subspaces, where each grid contains features within reasonable height ranges. Finally, a Synergistic Feature Aggregation (SFA) module is deployed to enhance the feature representation through channel and spatial affinities, enabling further occupancy refinement. On the popular Occ3D-nuScenes benchmark, our method achieves state-of-the-art performance even with minimal input frames. Code is available at this https URL.
基于视觉的3D占有预测的任务旨在从2D颜色图像中重构3D几何并估计其语义类别,其中2D到3D视图转换是一个不可或缺的步骤。之前的方法通常进行前向投影,例如BEVPooling和VoxelPooling,它们都将2D图像特征映射到3D网格中。然而,当前表示某些高度范围内特征的网格通常引入了许多属于其他高度范围的混淆特征。为解决这个问题,我们提出了Deep Height Decoupling(DHD),一种新框架,它引入了显式的高 prior,以过滤出混淆的特征。具体来说,DHD通过显式监督预测高度图。根据高度分布统计数据,DHD设计了一个名为Mask Guided Height Sampling(MGHS)的适应性方法,将高度图解耦为多个二进制掩码。MGHS将2D图像特征投影到多个子空间中,其中每个网格包含合理高度范围内的特征。最后,部署了一个Synergistic Feature Aggregation(SFA)模块,通过通道和空间关联增强特征表示,实现进一步的占有精度提高。在流行的Occ3D-nuScenes基准上,我们的方法在即使输入帧最少的情况下,也实现了最先进的性能。代码可在此处访问:https://this URL。
https://arxiv.org/abs/2409.07972
We present InterACT: Inter-dependency aware Action Chunking with Hierarchical Attention Transformers, a novel imitation learning framework for bimanual manipulation that integrates hierarchical attention to capture inter-dependencies between dual-arm joint states and visual inputs. InterACT consists of a Hierarchical Attention Encoder and a Multi-arm Decoder, both designed to enhance information aggregation and coordination. The encoder processes multi-modal inputs through segment-wise and cross-segment attention mechanisms, while the decoder leverages synchronization blocks to refine individual action predictions, providing the counterpart's prediction as context. Our experiments on a variety of simulated and real-world bimanual manipulation tasks demonstrate that InterACT significantly outperforms existing methods. Detailed ablation studies validate the contributions of key components of our work, including the impact of CLS tokens, cross-segment encoders, and synchronization blocks.
我们提出了InterACT:具有层次注意力变换的交互式动作片段学习框架,是一种新的双向操作学习框架,可实现多手动操作,并集成了层次注意力来捕捉双臂关节状态和视觉输入之间的相互作用。InterACT由层次注意力编码器和解码器组成,两者都被设计为增强信息聚合和协调。编码器通过片段级和跨片段注意力机制处理多模态输入,而解码器利用同步块来微调单个动作预测,并提供对应者的预测背景。我们对各种模拟和现实世界双向操作任务进行的实验证明,InterACT显著优于现有方法。详细的消融研究证实了我们的工作的关键组件,包括CLS标记的影响,跨片段编码器,和同步块的贡献。
https://arxiv.org/abs/2409.07914
Score prediction is crucial in realistic image sharpness assessment after informative features are collected. Recently, Kolmogorov-Arnold networks (KANs) have been developed and witnessed remarkable success in data fitting. This study presents Taylor series based KAN (TaylorKAN). Then, different KANs are explored on four realistic image databases (BID2011, CID2013, CLIVE, and KonIQ-10k) for score prediction by using 15 mid-level features and 2048 high-level features. When setting support vector regression as the baseline, experimental results indicate KANs are generally better or competitive, TaylorKAN is the best on three databases using mid-level feature input, while KANs are inferior on CLIVE when high-level features are used. This is the first study that explores KANs for image quality assessment. It sheds lights on how to select and improve KANs on related tasks.
分数预测在收集有益特征后对现实图像清晰度评估至关重要。最近,Kolmogorov-Arnold网络(KANs)的开发取得了显著的数据拟合成功。本研究展示了基于Taylor级数的KAN(TaylorKAN)。然后,在四个现实图像数据库(BID2011,CID2013,CLIVE和KonIQ-10k)上,使用15个中级特征和2048个高级特征对分数预测进行考察。当将支持向量回归作为基准时,实验结果表明,KANs通常比或与竞争者更好,TaylorKAN在三个数据库中使用中级特征输入时是最好的,而当使用高级特征时,KANs在CLIVE上表现较差。这是第一篇研究探索KANs在图像质量评估中的文章。它阐明了如何选择和提高与相关任务相关的KANs。
https://arxiv.org/abs/2409.07762
In real-world clinical settings, data distributions evolve over time, with a continuous influx of new, limited disease cases. Therefore, class incremental learning is of great significance, i.e., deep learning models are required to learn new class knowledge while maintaining accurate recognition of previous diseases. However, traditional deep neural networks often suffer from severe forgetting of prior knowledge when adapting to new data unless trained from scratch, which undesirably costs much time and computational burden. Additionally, the sample sizes for different diseases can be highly imbalanced, with newly emerging diseases typically having much fewer instances, consequently causing the classification bias. To tackle these challenges, we are the first to propose a class-incremental learning method under limited samples in the biomedical field. First, we propose a novel cumulative entropy prediction module to measure the uncertainty of the samples, of which the most uncertain samples are stored in a memory bank as exemplars for the model's later review. Furthermore, we theoretically demonstrate its effectiveness in measuring uncertainty. Second, we developed a fine-grained semantic expansion module through various augmentations, leading to more compact distributions within the feature space and creating sufficient room for generalization to new classes. Besides, a cosine classifier is utilized to mitigate classification bias caused by imbalanced datasets. Across four imbalanced data distributions over two datasets, our method achieves optimal performance, surpassing state-of-the-art methods by as much as 53.54% in accuracy.
在实际临床环境中,随着时间的推移,数据分布会发生变化,持续有新的、有限的疾病病例输入。因此,分类增量学习具有重大意义,即深度学习模型需要在保持对之前疾病准确识别的同时学习新的类知识。然而,传统的深度神经网络通常在适应新数据时容易忘记先前的知识,这无疑会花费更多的时间和计算负担。此外,不同疾病样本的样本量可能高度失衡,新兴疾病通常实例较少,从而导致分类偏差。为解决这些挑战,我们首先在生物医学领域提出了一个类增量学习方法。首先,我们提出了一个新颖的累积熵预测模块来衡量样本的不确定性,其中最不确定的一些样本被存储在内存库中作为模型的后回顾的示例。此外,我们还理论证明了它的有效性和测量不确定性的能力。其次,我们通过各种增强技术开发了一个细粒度语义扩展模块,导致在特征空间内实现更紧凑的分布,为对新类别的泛化提供足够的空间。此外,我们还使用余弦分类器来缓解不平衡数据集引起的分类偏差。在两个数据集上的四个不平衡数据分布中,我们的方法实现最佳性能,比最先进的方法提高53.54%。
https://arxiv.org/abs/2409.07757
Knowledge Distillation (KD) transfers knowledge from a large pre-trained teacher network to a compact and efficient student network, making it suitable for deployment on resource-limited media terminals. However, traditional KD methods require balanced data to ensure robust training, which is often unavailable in practical applications. In such scenarios, a few head categories occupy a substantial proportion of examples. This imbalance biases the trained teacher network towards the head categories, resulting in severe performance degradation on the less represented tail categories for both the teacher and student networks. In this paper, we propose a novel framework called Knowledge Rectification Distillation (KRDistill) to address the imbalanced knowledge inherited in the teacher network through the incorporation of the balanced category priors. Furthermore, we rectify the biased predictions produced by the teacher network, particularly focusing on the tail categories. Consequently, the teacher network can provide balanced and accurate knowledge to train a reliable student network. Intensive experiments conducted on various long-tailed datasets demonstrate that our KRDistill can effectively train reliable student networks in realistic scenarios of data imbalance.
知识蒸馏(KD)将一个大型的预训练教师网络中的知识传递给一个紧凑且高效的 student网络,使其适用于资源受限的媒体终端。然而,传统的 KD 方法需要平衡数据来确保稳健的训练,这在实际应用中通常是不可用的。在这种情况下,少数头类别占据了很大的比例。这种不平衡使得训练后的教师网络倾向于头类别,导致教师和学生网络在低代表类别的表现严重下降。在本文中,我们提出了一种新框架,称为知识纠正蒸馏(KRDistill),通过引入平衡类别 prior 来解决教师网络中传递的不平衡知识。此外,我们还纠正了教师网络产生的有偏见预测,特别是关注尾类别。因此,教师网络可以提供平衡和准确的知识来训练一个可靠的 student network。在各种长尾数据集上进行的大量实验证明,我们的 KRDistill 在数据不平衡的现实场景中可以有效训练可靠的 student network。
https://arxiv.org/abs/2409.07694
Recent advancements in predicting pedestrian crossing intentions for Autonomous Vehicles using Computer Vision and Deep Neural Networks are promising. However, the black-box nature of DNNs poses challenges in understanding how the model works and how input features contribute to final predictions. This lack of interpretability delimits the trust in model performance and hinders informed decisions on feature selection, representation, and model optimisation; thereby affecting the efficacy of future research in the field. To address this, we introduce Context-aware Permutation Feature Importance (CAPFI), a novel approach tailored for pedestrian intention prediction. CAPFI enables more interpretability and reliable assessments of feature importance by leveraging subdivided scenario contexts, mitigating the randomness of feature values through targeted shuffling. This aims to reduce variance and prevent biased estimations in importance scores during permutations. We divide the Pedestrian Intention Estimation (PIE) dataset into 16 comparable context sets, measure the baseline performance of five distinct neural network architectures for intention prediction in each context, and assess input feature importance using CAPFI. We observed nuanced differences among models across various contextual characteristics. The research reveals the critical role of pedestrian bounding boxes and ego-vehicle speed in predicting pedestrian intentions, and potential prediction biases due to the speed feature through cross-context permutation evaluation. We propose an alternative feature representation by considering proximity change rate for rendering dynamic pedestrian-vehicle locomotion, thereby enhancing the contributions of input features to intention prediction. These findings underscore the importance of contextual features and their diversity to develop accurate and robust intent-predictive models.
最近,使用计算机视觉和深度神经网络预测自动驾驶车辆中行人过马路意图的进展前景诱人。然而,深度神经网络的黑色盒性质导致了解释模型如何工作以及输入特征如何影响最终预测的困难。这种不可解释性限制了模型性能的可信度,阻碍了关于特征选择、表示和模型优化的知情决策。从而影响到该领域未来的研究效率。为了解决这个问题,我们引入了上下文感知置换特征重要性(CAPFI),一种专门针对行人意图预测的新方法。CAPFI通过利用子分割的场景上下文,通过有针对性的洗牌来减轻特征值的随机性,从而实现更多可解释性和可靠的特征重要性评估。这有助于减少 permutations 中的方差,并防止在 permutations 中的重要性估计出现偏差。我们将行人意图估计(PIE)数据集划分为16个可比较的上下文集,为每个上下文测量五个不同神经网络架构的意图预测 baseline 性能,并使用 CAPFI 评估输入特征的重要性。我们观察到各种上下文特征对模型模型的差异。研究揭示了行人边界框和自车速度在预测行人意图中的关键作用,以及由于速度特征导致的预测偏见。我们提出了一个考虑距离变化率来表示动态行人-车辆运动的新特征表示方法,从而增强输入特征对意图预测的贡献。这些发现强调了上下文特征及其多样性对于开发准确和可靠的行人意图预测模型的至关重要。
https://arxiv.org/abs/2409.07645
Considering the growing prominence of production-level AI and the threat of adversarial attacks that can evade a model at run-time, evaluating the robustness of models to these evasion attacks is of critical importance. Additionally, testing model changes likely means deploying the models to (e.g. a car or a medical imaging device), or a drone to see how it affects performance, making un-tested changes a public problem that reduces development speed, increases cost of development, and makes it difficult (if not impossible) to parse cause from effect. In this work, we used survival analysis as a cloud-native, time-efficient and precise method for predicting model performance in the presence of adversarial noise. For neural networks in particular, the relationships between the learning rate, batch size, training time, convergence time, and deployment cost are highly complex, so researchers generally rely on benchmark datasets to assess the ability of a model to generalize beyond the training data. To address this, we propose using accelerated failure time models to measure the effect of hardware choice, batch size, number of epochs, and test-set accuracy by using adversarial attacks to induce failures on a reference model architecture before deploying the model to the real world. We evaluate several GPU types and use the Tree Parzen Estimator to maximize model robustness and minimize model run-time simultaneously. This provides a way to evaluate the model and optimise it in a single step, while simultaneously allowing us to model the effect of model parameters on training time, prediction time, and accuracy. Using this technique, we demonstrate that newer, more-powerful hardware does decrease the training time, but with a monetary and power cost that far outpaces the marginal gains in accuracy.
考虑到生产级别AI日益凸显的重要性以及能够绕过在运行时模型的新型攻击威胁,评估模型的鲁棒性对于应对这些攻击至关重要。此外,测试模型变化可能意味着将模型部署到(例如汽车或医疗成像设备)或无人机上,观察其对性能的影响,使未经过测试的修改成为公共问题,降低了开发速度,增加了开发成本,使得解析因果关系变得困难(甚至不可能)。在这项工作中,我们使用生存分析作为一种云计算、高效且精确的方法,预测在对抗噪声下模型的性能。对于神经网络来说,学习率、批量大小、训练时间、收敛时间与部署成本之间的关系非常复杂,因此研究人员通常依赖于基准数据集来评估模型在训练数据之外的一般化能力。为解决这个问题,我们提出了使用加速失败时间模型来衡量硬件选择、批量大小、训练轮数和测试集准确性的方法,通过在参考模型架构上使用对抗攻击来诱导模型失败,在部署模型到现实世界之前。我们评估了几种GPU类型,并使用树帕辛估计器来同时最大化模型鲁棒性和最小化模型运行时间。这为我们在一步之内评估模型并提供优化提供了途径,同时允许我们同时模型参数对训练时间、预测时间和准确度的影响。利用这种技术,我们证明了更先进、更强大的硬件确实降低了训练时间,但代价是远远超过精度微小增长值的货币和能源成本。
https://arxiv.org/abs/2409.07609
Visual servoing for the development of autonomous robotic systems capable of administering UltraSound (US) guided regional anesthesia requires real-time segmentation of nerves, needle tip localization and needle trajectory extrapolation. First, we recruited 227 patients to build a large dataset of 41,000 anesthesiologist annotated images from US videos of brachial plexus nerves and developed models to localize nerves in the US images. Generalizability of the best suited model was tested on the datasets constructed from separate US scanners. Using these nerve segmentation predictions, we define automated anesthesia needle targets by fitting an ellipse to the nerve contours. Next, we developed an image analysis tool to guide the needle toward their targets. For the segmentation of the needle, a natural RGB pre-trained neural network was first fine-tuned on a large US dataset for domain transfer and then adapted for the needle using a small dataset. The segmented needle trajectory angle is calculated using Radon transformation and the trajectory is extrapolated from the needle tip. The intersection of the extrapolated trajectory with the needle target guides the needle navigation for drug delivery. The needle trajectory average error was within acceptable range of 5 mm as per experienced anesthesiologists. The entire dataset has been released publicly for further study by the research community at this https URL
为了开发能够进行超声引导区域麻醉的自主机器人系统,需要对神经进行实时分割、针尖局部定位和针程延伸。首先,我们招募了227名患者,构建了一个包含41,000个由美国神经科学家标注的超声视频的的大数据集,并开发了将神经定位到美国图像中的模型。接着,我们在构建了单独的超声扫描仪的数据集上测试了最合适的模型的泛化能力。通过这些神经分割预测,我们通过将椭圆形对齐到神经轮廓来定义自动麻醉针的目标。接下来,我们开发了一个图像分析工具,用于指导针将其目标对准。为了对准针,我们对一个自然预训练的RGB神经网络在大型美国数据集上进行了微调,然后将其适应针。对准的针程延伸角度使用Radon变换计算,轨迹从针尖延伸。针程延伸轨迹与针尖目标相交,引导针的导航进行药物交付。整个数据集已公开发布,供研究社区进一步研究,网址为https://url。
https://arxiv.org/abs/2308.03717
There have been growing concerns around high-stake applications that rely on models trained with biased data, which consequently produce biased predictions, often harming the most vulnerable. In particular, biased medical data could cause health-related applications and recommender systems to create outputs that jeopardize patient care and widen disparities in health outcomes. A recent framework titled Fairness via AI posits that, instead of attempting to correct model biases, researchers must focus on their root causes by using AI to debias data. Inspired by this framework, we tackle bias detection in medical curricula using NLP models, including LLMs, and evaluate them on a gold standard dataset containing 4,105 excerpts annotated by medical experts for bias from a large corpus. We build on previous work by coauthors which augments the set of negative samples with non-annotated text containing social identifier terms. However, some of these terms, especially those related to race and ethnicity, can carry different meanings (e.g., "white matter of spinal cord"). To address this issue, we propose the use of Word Sense Disambiguation models to refine dataset quality by removing irrelevant sentences. We then evaluate fine-tuned variations of BERT models as well as GPT models with zero- and few-shot prompting. We found LLMs, considered SOTA on many NLP tasks, unsuitable for bias detection, while fine-tuned BERT models generally perform well across all evaluated metrics.
高度风险的应用程序依赖使用有偏见数据的模型进行训练,从而产生有偏预测,往往伤害最脆弱的人群。特别是,有偏见的健康数据可能导致医疗应用程序和推荐系统产生对患者护理产生威胁的输出,并加剧健康状况的差异。最近一个名为“公平性通过AI”的框架提出,研究人员不能试图通过AI纠正模型偏见,而是要利用AI对数据进行去偏。受到这个框架的启发,我们使用自然语言处理(NLP)模型包括LLM来检测医学课程中的偏见,并将它们应用于一个由4,105个由医疗专家标注的偏见句子组成的黄金标准数据集上进行评估。我们还在之前工作的基础上,通过添加包含社会标识符词的非注释文本来扩展负样本集。然而,这些词中的一些,特别是与种族和民族相关的词,可能具有不同的含义(例如“脊髓白质”)。为了应对这个问题,我们提出了使用Word Sense Disambiguation模型来通过消除无关句子来优化数据集质量。然后,我们评估了BERT模型以及具有零和少样本提示的GPT模型的微调变体。我们发现,LLM,在许多NLP任务上被认为是最好的,不适合用于偏见检测,而经过微调的BERT模型在所有评估指标上都表现良好。
https://arxiv.org/abs/2409.07424