Self-distillation methods using Siamese networks are popular for self-supervised pre-training. DINO is one such method based on a cross-entropy loss between $K$-dimensional probability vectors, obtained by applying a softmax function to the dot product between representations and learnt prototypes. Given the fact that the learned representations are $L^2$-normalized, we show that DINO and its derivatives, such as iBOT, can be interpreted as a mixture model of von Mises-Fisher components. With this interpretation, DINO assumes equal precision for all components when the prototypes are also $L^2$-normalized. Using this insight we propose DINO-vMF, that adds appropriate normalization constants when computing the cluster assignment probabilities. Unlike DINO, DINO-vMF is stable also for the larger ViT-Base model with unnormalized prototypes. We show that the added flexibility of the mixture model is beneficial in terms of better image representations. The DINO-vMF pre-trained model consistently performs better than DINO on a range of downstream tasks. We obtain similar improvements for iBOT-vMF vs iBOT and thereby show the relevance of our proposed modification also for other methods derived from DINO.
使用Siamese网络进行自馏方法在自监督预训练中很受欢迎。DINO是一种基于距离损失在K维概率向量之间应用softmax函数得到的方法。给定学习的表示是L2正则化的,我们证明了DINO及其导数(如iBOT)可以解释为Mixture模型中的Von Mises-Fisher组件。通过这种解释,我们提出了DINO-vMF,并在计算聚类分配概率时添加了适当的归一化常数。与DINO不同,DINO-vMF对于较大的ViT-Base模型(未正常化原型)也具有稳定性。我们证明了混合模型的灵活性在改善图像表示方面是有益的。与DINO相比,预训练的DINO-vMF在各种下游任务上始终表现更好。我们还得到了iBOT-vMF与iBOT之间的类似改善,从而证明了我们的修改对于其他基于DINO的方法同样具有重要性。
https://arxiv.org/abs/2405.10939
Understanding how language model performance varies with scale is critical to benchmark and algorithm development. Scaling laws are one approach to building this understanding, but the requirement of training models across many different scales has limited their use. We propose an alternative, observational approach that bypasses model training and instead builds scaling laws from ~80 publically available models. Building a single scaling law from multiple model families is challenging due to large variations in their training compute efficiencies and capabilities. However, we show that these variations are consistent with a simple, generalized scaling law where language model performance is a function of a low-dimensional capability space, and model families only vary in their efficiency in converting training compute to capabilities. Using this approach, we show the surprising predictability of complex scaling phenomena: we show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models; we show that the agent performance of models such as GPT-4 can be precisely predicted from simpler non-agentic benchmarks; and we show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
理解语言模型性能随规模变化的规律对基准测试和算法开发至关重要。建立缩放定律是其中一种方法,但要求在许多不同规模上训练模型限制了其使用。我们提出了另一种观察方法,它绕过了模型训练,并从~80个公开可用的模型中构建缩放定律。由于它们在训练计算效率和能力上存在较大差异,因此从多个模型家族中构建单个缩放定律是非常具有挑战性的。然而,我们证明了这些差异与一个简单的、一般化的缩放定律相一致,该定律表明语言模型性能是一个低维能力空间中的函数,而模型家族仅在将训练计算转换为能力方面有所不同。使用这种方法,我们展示了复杂缩放现象的预测准确性:我们证明了几个新兴现象遵循平滑的 sigmoidal 行为,可以从小的模型预测;我们证明了像 GPT-4 这样的模型的代理性能可以从简单的非代理基准预测;我们还展示了如何预测在语言模型能力继续提高后的后训练干预(如 Chain-of-Thought 和 Self-Consistency)的影响。
https://arxiv.org/abs/2405.10938
Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from opioid vs. non-steroidal anti-inflammatory drug (NSAID) initiators (X) with observed serum creatinine labs (Z2) and time-to-acute kidney injury as outcome. We simulated 100 cohorts with a null treatment effect, including X, Z2, atrial fibrillation (U), and 13 other investigator-derived confounders (Z1) in the outcome generation. We then imposed missingness (MZ2) on 50% of Z2 measurements as a function of Z2 and U and created different HDMI candidate AC using structured and NLP-derived features. We mimicked scenarios where U was unobserved by omitting it from all AC candidate sets. Using LASSO, we data-adaptively selected HDMI covariates associated with Z2 and MZ2 for MI, and with U to include in propensity score models. The treatment effect was estimated following propensity score matching in MI datasets and we benchmarked HDMI approaches against a baseline imputation and complete case analysis with Z1 only. HDMI using claims data showed the lowest bias (0.072). Combining claims and sentence embeddings led to an improvement in the efficiency displaying the lowest root-mean-squared-error (0.173) and coverage (94%). NLP-derived AC alone did not perform better than baseline MI. HDMI approaches may decrease bias in studies with partially observed confounders where missingness depends on unobserved factors.
多元 imputation(MI)模型可以通过包括辅助变量(AC)来提高,但在高维数据中的性能仍然知之甚少。我们的目标是在部分观察到的混淆因素的研究中开发和比较基于结构和自然语言处理(NLP)的AC的高维MI(HDMI)方法。我们使用了来自非甾体抗炎药(NSAID)启动器(X)的观察血清肌酐实验室数据(Z2)和急性肾损伤的时间作为结果。我们对100个随机的治疗组进行了建模,包括X、Z2、心房颤动(U)和其他13个调查者确定的混淆因素(Z1)。然后,我们将缺失性(MZ2)对Z2测量进行函数化,并使用结构和NLP确定的特征创建了不同的HDMI候选AC。我们模仿了在所有AC候选集中的U未观察到的场景。使用LASSO,我们在MI数据集中数据适应地选择了与Z2和MZ2相关的HDMI变量,以及与U包括在倾向评分模型中。在MI数据集中的倾向评分匹配后估计了治疗效果,并将HDMI方法与仅使用Z1进行完整病例分析的基线 imputation和比较。使用声称数据,HDMI具有最低的方差(0.072)。将声称和句子嵌入相结合,HDMI方法在效率方面显示出最低的平方根均方误差(0.173)和覆盖(94%)。单独使用自然语言处理(NLP)确定的AC未能在具有部分观察到的混淆因素的研究中优于基线MI。在缺失性与未观察到的因素相关的研究中,HDMI方法可能会降低方差。
https://arxiv.org/abs/2405.10925
In recent years, various large foundation models have been proposed for image segmentation. There models are often trained on large amounts of data corresponding to general computer vision tasks. Hence, these models do not perform well on medical data. There have been some attempts in the literature to perform parameter-efficient finetuning of such foundation models for medical image segmentation. However, these approaches assume that all the parameters of the model are available for adaptation. But, in many cases, these models are released as APIs or blackboxes, with no or limited access to the model parameters and data. In addition, finetuning methods also require a significant amount of compute, which may not be available for the downstream task. At the same time, medical data can't be shared with third-party agents for finetuning due to privacy reasons. To tackle these challenges, we pioneer a blackbox adaptation technique for prompted medical image segmentation, called BAPS. BAPS has two components - (i) An Image-Prompt decoder (IP decoder) module that generates visual prompts given an image and a prompt, and (ii) A Zero Order Optimization (ZOO) Method, called SPSA-GC that is used to update the IP decoder without the need for backpropagating through the foundation model. Thus, our method does not require any knowledge about the foundation model's weights or gradients. We test BAPS on four different modalities and show that our method can improve the original model's performance by around 4%.
近年来,已经提出了许多大型基础模型来进行图像分割。这些模型通常在对应于一般计算机视觉任务的较大数据集上进行训练。因此,这些模型在医学数据上的表现往往不佳。有一些文献中尝试对这类基础模型进行参数高效的微调来进行医学图像分割。然而,这些方法都假设模型的所有参数都可用进行调整。但是,在许多情况下,这些模型作为API或黑盒发布,没有或仅有限地访问模型参数和数据。此外,微调方法也需要大量的计算,这可能不适用于下游任务。同时,由于隐私原因,医疗数据也不能与第三方共享以进行微调。为了应对这些挑战,我们首创了一种用于提示医学图像分割的黑盒适应技术,称为BAPS。BAPS有两个组件——(i)图像提示编码器(IP decoder)模块,根据图像和提示生成视觉提示;(ii)零阶优化(ZOO)方法,称为SPSA-GC,用于在无需通过基础模型反向传播的情况下更新IP decoder。因此,我们的方法不需要了解基础模型的权重或梯度。我们在四个不同数据集上测试BAPS,结果表明,我们的方法可以将原始模型的性能提高约4%。
https://arxiv.org/abs/2405.10913
Large Language Models (LLMs) constitute a breakthrough state-of-the-art Artificial Intelligence (AI) technology which is rapidly evolving and promises to aid in medical diagnosis either by assisting doctors or by simulating a doctor's workflow in more advanced and complex implementations. In this technical paper, we outline Cognitive Network Evaluation Toolkit for Medical Domains (COGNET-MD), which constitutes a novel benchmark for LLM evaluation in the medical domain. Specifically, we propose a scoring-framework with increased difficulty to assess the ability of LLMs in interpreting medical text. The proposed framework is accompanied with a database of Multiple Choice Quizzes (MCQs). To ensure alignment with current medical trends and enhance safety, usefulness, and applicability, these MCQs have been constructed in collaboration with several associated medical experts in various medical domains and are characterized by varying degrees of difficulty. The current (first) version of the database includes the medical domains of Psychiatry, Dentistry, Pulmonology, Dermatology and Endocrinology, but it will be continuously extended and expanded to include additional medical domains.
大语言模型(LLMs)是一种最先进的人工智能(AI)技术,正在迅速发展和有望在医疗诊断方面有所帮助,无论是通过帮助医生还是通过模拟医生的工作流程。在本文技术论文中,我们概述了医学领域 Cognitive Network Evaluation Toolkit(COGNET-MD),构成了一种新的 LLM 评估基准。具体来说,我们提出了一个评估 LLMs 解释医疗文本能力的新评分框架。这个框架附带了一个 Multiple Choice Quizzes(MCQs)数据库。为确保与当前医疗趋势保持一致并提高安全性、有用性和适用性,这些 MCQs 是在与各个医学领域的多个相关专家的合作下建设的,并具有不同的难度程度。目前(第一)版数据库包括精神病学、牙科、肺病学、皮肤病学和内分泌学,但将来会持续扩展和更新,以包括其他医学领域。
https://arxiv.org/abs/2405.10893
The exponential growth of astronomical datasets provides an unprecedented opportunity for humans to gain insight into the Universe. However, effectively analyzing this vast amount of data poses a significant challenge. Astronomers are turning to deep learning techniques to address this, but the methods are limited by their specific training sets, leading to considerable duplicate workloads too. Hence, as an example to present how to overcome the issue, we built a framework for general analysis of galaxy images, based on a large vision model (LVM) plus downstream tasks (DST), including galaxy morphological classification, image restoration, object detection, parameter extraction, and more. Considering the low signal-to-noise ratio of galaxy images and the imbalanced distribution of galaxy categories, we have incorporated a Human-in-the-loop (HITL) module into our large vision model, which leverages human knowledge to enhance the reliability and interpretability of processing galaxy images interactively. The proposed framework exhibits notable few-shot learning capabilities and versatile adaptability to all the abovementioned tasks on galaxy images in the DESI legacy imaging surveys. Expressly, for object detection, trained by 1000 data points, our DST upon the LVM achieves an accuracy of 96.7%, while ResNet50 plus Mask R-CNN gives an accuracy of 93.1%; for morphology classification, to obtain AUC ~0.9, LVM plus DST and HITL only requests 1/50 training sets compared to ResNet18. Expectedly, multimodal data can be integrated similarly, which opens up possibilities for conducting joint analyses with datasets spanning diverse domains in the era of multi-message astronomy.
星际数据的增长为人类深入了解宇宙提供了前所未有的机会。然而,有效地分析这些大量数据仍然是一个巨大的挑战。天文学家开始利用深度学习技术解决这个问题,但是这些方法受到其特定训练集的局限,导致大量的重复工作。因此,作为展示如何克服这个问题的示例,我们构建了一个基于大型视觉模型(LVM)和下游任务的框架,包括星系形态分类、图像修复、目标检测、参数提取等。考虑到星系图像的低信噪比和星系类别的失衡分布,我们将人机交互模块纳入我们的大型视觉模型中,利用人类知识增强处理星系图像的可靠性和可解释性。所提出的框架在DESI遗产成像调查中的星系图像上表现出显著的少样本学习能力和对上述所有任务的变通适应性。具体来说,在我们的LVM上训练1000个数据点后,我们的DST在LVM上可以达到96.7%的准确率,而ResNet50加Mask R-CNN可以实现93.1%的准确率;对于形态分类,要获得AUC ~0.9,只需要1/50的训练集,而ResNet18需要更多的训练集。预计,多模态数据可以按照这种方式整合,为多样域数据集的联合分析提供了可能性,这为在多信使天文学时代进行联合分析提供了可能性。
https://arxiv.org/abs/2405.10890
Most existing methods often rely on complex models to predict scene depth with high accuracy, resulting in slow inference that is not conducive to deployment. To better balance precision and speed, we first designed SmallDepth based on sparsity. Second, to enhance the feature representation ability of SmallDepth during training under the condition of equal complexity during inference, we propose an equivalent transformation module(ETM). Third, to improve the ability of each layer in the case of a fixed SmallDepth to perceive different context information and improve the robustness of SmallDepth to the left-right direction and illumination changes, we propose pyramid loss. Fourth, to further improve the accuracy of SmallDepth, we utilized the proposed function approximation loss (APX) to transfer knowledge in the pretrained HQDecv2, obtained by optimizing the previous HQDec to address grid artifacts in some regions, to SmallDepth. Extensive experiments demonstrate that each proposed component improves the precision of SmallDepth without changing the complexity of SmallDepth during inference, and the developed approach achieves state-of-the-art results on KITTI at an inference speed of more than 500 frames per second and with approximately 2 M parameters. The code and models will be publicly available at this https URL.
大多数现有方法通常依赖复杂的模型来预测场景深度,以实现高精度的预测,但会导致推理速度较慢,不适合部署。为了实现精度和速度的平衡,我们首先基于稀疏性设计了我的小型深度(SmallDepth)。接着,为了在推理过程中增强SmallDepth的特征表示能力,我们提出了等效变换模块(ETM)。然后,为了在固定SmallDepth的情况下更好地感知不同上下文信息,并提高SmallDepth对左右方向和光照变化等的鲁棒性,我们提出了金字塔损失。最后,为了进一步提高SmallDepth的准确性,我们利用所提出的函数逼近损失(APX)将前HQDecv2预训练知识传递给SmallDepth,以解决某些区域中的网格伪影问题。大量实验证明,每个所提出的组件都能提高SmallDepth的精度,而不会改变SmallDepth在推理过程中的复杂度,并且所开发的方法在每秒超过500帧的推理速度和大约20000个参数的KITTI数据集上实现了最先进的结果。代码和模型将公开发布在https://这个URL上。
https://arxiv.org/abs/2405.10885
The goal of image registration is to establish spatial correspondence between two or more images, traditionally through dense displacement fields (DDFs) or parametric transformations (e.g., rigid, affine, and splines). Rethinking the existing paradigms of achieving alignment via spatial transformations, we uncover an alternative but more intuitive correspondence representation: a set of corresponding regions-of-interest (ROI) pairs, which we demonstrate to have sufficient representational capability as other correspondence representation methods.Further, it is neither necessary nor sufficient for these ROIs to hold specific anatomical or semantic significance. In turn, we formulate image registration as searching for the same set of corresponding ROIs from both moving and fixed images - in other words, two multi-class segmentation tasks on a pair of images. For a general-purpose and practical implementation, we integrate the segment anything model (SAM) into our proposed algorithms, resulting in a SAM-enabled registration (SAMReg) that does not require any training data, gradient-based fine-tuning or engineered prompts. We experimentally show that the proposed SAMReg is capable of segmenting and matching multiple ROI pairs, which establish sufficiently accurate correspondences, in three clinical applications of registering prostate MR, cardiac MR and abdominal CT images. Based on metrics including Dice and target registration errors on anatomical structures, the proposed registration outperforms both intensity-based iterative algorithms and DDF-predicting learning-based networks, even yielding competitive performance with weakly-supervised registration which requires fully-segmented training data.
图像配准的目的是建立两张或多张图像之间的空间对应关系,这通常通过密集位移场(DDFs)或参数变换(如刚性、平滑和曲线)来实现。重新思考通过空间变换实现对齐的传统范式,我们发现了另一种但更直观的对应关系表示:一系列相应的兴趣区域(ROI)对,我们证明了它们具有足够的表示能力作为其他配准表示方法。此外,这些ROI不需要或只需要特定的解剖或语义意义才能成立。因此,我们将图像配准定义为在动态和静态图像中寻找相同的一组对应ROI - 换句话说,是两个多类分割任务。为了实现通用和实际应用,我们将分割 anything模型(SAM)集成到我们的建议算法中,从而实现了一个 SARM-enabled 配准(SAMReg)。我们通过实验证明了所提出的 SARMReg 能够对前列腺MR、心脏MR和腹部CT图像进行分割和匹配多个ROI对,从而建立足够准确的对应关系,在三个前列腺MR、心脏MR和腹部CT图像的临床应用中取得了良好的效果。根据包括余子格和目标配准误差在内的指标,与基于强度的迭代算法和DDF预测学习网络相比,所提出的配准表现出更优异的性能,即使在弱监督下,其性能也与其相当。
https://arxiv.org/abs/2405.10879
While Global Navigation Satellite System (GNSS) is often used to provide global positioning if available, its intermittency and/or inaccuracy calls for fusion with other sensors. In this paper, we develop a novel GNSS-Visual-Inertial Navigation System (GVINS) that fuses visual, inertial, and raw GNSS measurements within the square-root inverse sliding window filtering (SRI-SWF) framework in a tightly coupled fashion, which thus is termed SRI-GVINS. In particular, for the first time, we deeply fuse the GNSS pseudorange, Doppler shift, single-differenced pseudorange, and double-differenced carrier phase measurements, along with the visual-inertial measurements. Inherited from the SRI-SWF, the proposed SRI-GVINS gains significant numerical stability and computational efficiency over the start-of-the-art methods. Additionally, we propose to use a filter to sequentially initialize the reference frame transformation till converges, rather than collecting measurements for batch optimization. We also perform online calibration of GNSS-IMU extrinsic parameters to mitigate the possible extrinsic parameter degradation. The proposed SRI-GVINS is extensively evaluated on our own collected UAV datasets and the results demonstrate that the proposed method is able to suppress VIO drift in real-time and also show the effectiveness of online GNSS-IMU extrinsic calibration. The experimental validation on the public datasets further reveals that the proposed SRI-GVINS outperforms the state-of-the-art methods in terms of both accuracy and efficiency.
尽管全球导航卫星系统(GNSS)通常用于提供全球定位,但其间歇性和/或准确性需要与其他传感器进行融合。在本文中,我们开发了一种新颖的GNSS-Visual-Inertial Navigation System(GVINS),在平方根逆滑窗滤波(SRI-SWF)框架内,以紧密耦合的方式融合视觉、惯性和原始GNSS测量,从而称为SRI-GVINS。 特别是,对于第一次,我们深入融合了GNSS伪距、多普勒频移、单差分伪距和双差分载波相位测量,以及视觉-惯性测量。源于SRI-SWF,所提出的SRI-GVINS在开始时具有显著的数值稳定性和计算效率,超过了最先进的方法。此外,我们还提出使用滤波器按顺序初始化参考框架变换,而不是收集批优化测量数据。我们还在GNSS-IMU外部参数上进行在线校准,以减轻可能的非外部参数衰减。 通过对我们自己收集的UAV数据集进行广泛的评估,我们发现所提出的方法在实时抑制VIO漂移方面具有出色的效果,同时也证明了在线GNSS-IMU外部校准的有效性。在公开数据集上的实验验证进一步证实,与最先进的方法相比,所提出的SRI-GVINS在准确性和效率方面都具有优势。
https://arxiv.org/abs/2405.10874
Glioblastoma is the most common primary adult brain tumor, with a grim prognosis - median survival of 12-18 months following treatment, and 4 months otherwise. Glioblastoma is widely infiltrative in the cerebral hemispheres and well-defined by heterogeneous molecular and micro-environmental histopathologic profiles, which pose a major obstacle in treatment. Correctly diagnosing these tumors and assessing their heterogeneity is crucial for choosing the precise treatment and potentially enhancing patient survival rates. In the gold-standard histopathology-based approach to tumor diagnosis, detecting various morpho-pathological features of distinct histology throughout digitized tissue sections is crucial. Such "features" include the presence of cellular tumor, geographic necrosis, pseudopalisading necrosis, areas abundant in microvascular proliferation, infiltration into the cortex, wide extension in subcortical white matter, leptomeningeal infiltration, regions dense with macrophages, and the presence of perivascular or scattered lymphocytes. With these features in mind and building upon the main aim of the BraTS Cluster of Challenges this https URL, the goal of the BraTS-Path challenge is to provide a systematically prepared comprehensive dataset and a benchmarking environment to develop and fairly compare deep-learning models capable of identifying tumor sub-regions of distinct histologic profile. These models aim to further our understanding of the disease and assist in the diagnosis and grading of conditions in a consistent manner.
Glioblastoma是成人最常见的主要脑肿瘤,其预后令人沮丧 - 经过治疗后的中位数生存期为12-18个月,而其他治疗方式下则只有4个月。Glioblastoma广泛浸润于脑叶,并通过对不同分子和显微环境病理学特征的异质性进行评估,构成了在治疗中的主要障碍。正确诊断这些肿瘤并评估其异质性对于选择精确的治疗和潜在提高患者生存率至关重要。在基于肿瘤组织学诊断的金标准方法中,检测数字组织切片中的各种形态学特征至关重要。这些“特征”包括细胞肿瘤的存在、局灶性坏死、伪足性坏死、富含血管增殖的区域、侵入皮质、广泛分布于白质中的区域、脊髓旁沟内渗入、富含巨噬细胞的区域以及周围或散在分布的淋巴细胞的存在。牢记这些特征并在此基础上, BraTS-Path挑战的目标是为 BraTS 集群提供一个系统地准备的全面数据集和基准环境,以开发和公平地比较能够识别不同组织学profile的深度学习模型。这些模型旨在进一步深入了解疾病,并协助以一致的方式对疾病进行诊断和分级。
https://arxiv.org/abs/2405.10871
This paper presents a novel approach to the digital signing of electronic documents through the use of a camera-based interaction system, single-finger tracking for sign recognition, and multi commands executing hand gestures. The proposed solution, referred to as "Air Signature," involves writing the signature in front of the camera, rather than relying on traditional methods such as mouse drawing or physically signing on paper and showing it to a web camera. The goal is to develop a state-of-the-art method for detecting and tracking gestures and objects in real-time. The proposed methods include applying existing gesture recognition and object tracking systems, improving accuracy through smoothing and line drawing, and maintaining continuity during fast finger movements. An evaluation of the fingertip detection, sketching, and overall signing process is performed to assess the effectiveness of the proposed solution. The secondary objective of this research is to develop a model that can effectively recognize the unique signature of a user. This type of signature can be verified by neural cores that analyze the movement, speed, and stroke pixels of the signing in real time. The neural cores use machine learning algorithms to match air signatures to the individual's stored signatures, providing a secure and efficient method of verification. Our proposed System does not require sensors or any hardware other than the camera.
本文提出了一种新颖的方法,通过使用摄像头交互系统、单手指跟踪签名和多命令手势来对电子文档进行数字签名。所提出的解决方案被称为“空气签名”,涉及在摄像头前签名,而不是依赖传统的如鼠标绘图或纸质签名并将其显示给网络摄像头的方法。目标是开发出一种在实时检测和跟踪手势和物体的高级方法。所提出的方法包括应用现有的手势识别和物体跟踪系统、通过平滑和绘线来提高准确性以及保持连续性在快速手指运动过程中。为了评估所提出的解决方案的有效性,对指尖检测、绘图和整体签名过程进行了评估。本研究的主要目标是开发一个模型,可以有效地识别用户的独特签名。这种签名可以通过分析签署者的运动、速度和画笔像素来验证。神经内核使用机器学习算法将空气签名与存储在个人中的签名匹配,提供了一种安全且高效的身份验证方法。我们所提出的系统不需要传感器或任何其他硬件,只需要摄像头。
https://arxiv.org/abs/2405.10868
Understanding the process of emotion generation is crucial for analyzing the causes behind emotions. Causal Emotion Entailment (CEE), an emotion-understanding task, aims to identify the causal utterances in a conversation that stimulate the emotions expressed in a target utterance. However, current works in CEE mainly focus on modeling semantic and emotional interactions in conversations, neglecting the exploration of the emotion-generation process. This hinders the models from deeply understanding emotions, restricting their ability to produce explainable predictions. In this work, inspired by the emotion generation process of "stimulus-appraisal-emotion" in the cognitive appraisal theory, we introduce a step-by-step reasoning method, Emotion-Cause Reasoning Chain (ECR-Chain), to infer the stimulus from the target emotional expressions in conversations. Specifically, we first introduce the ECR-Chain to ChatGPT via few-shot prompting, which significantly improves its performance on the CEE task. We further propose an automated construction process to utilize ChatGPT in building an ECR-Chain set, which can enhance the reasoning abilities of smaller models through supervised training and assist the Vicuna-7B model in achieving state-of-the-art CEE performance. Moreover, our methods can enable these generative language models to effectively perform emotion-cause reasoning in an explainable manner. Our code, data and more details are at this https URL.
理解情感生成的过程对于分析情感的原因至关重要。因果情感 entailment(CEE)是一个情感理解任务,旨在识别在对话中刺激目标陈述情感的语篇。然而,目前CEE主要关注于对话中的语义和情感交互,忽视了探索情感生成过程。这阻碍了模型对情感的深入理解,限制了它们产生可解释预测的能力。 在本文中,我们受到情感生成过程“刺激-评估-情感”在认知评估理论中的启发,引入了一种逐步推理方法,情感-原因推理链(ECR-Chain),用于推断对话中的目标情感表达。具体来说,我们通过几轮对话提示引入ECR-Chain到ChatGPT中,这显著提高了其在CEE任务上的表现。我们进一步提出了一种自动构建过程,使ChatGPT能够用于构建ECR-Chain集合,这可以通过监督训练提高较小模型的推理能力,并帮助Vicuna-7B模型实现最先进的CEE性能。此外,我们的方法可以使这些生成语言模型通过明确的方式有效地执行情感因果推理。我们的代码、数据和更多细节可在https://www.aclweb.org/anthology/N22-36306/1461932846719
https://arxiv.org/abs/2405.10860
This paper presents a novel approach to automated drifting with a standard passenger vehicle, which involves a Nonlinear Model Predictive Control to stabilise and maintain the vehicle at high sideslip angle conditions. The proposed controller architecture is split into three components. The first part consists of the offline computed equilibrium maps, which provide the equilibrium points for each vehicle state given the desired sideslip angle and radius of the path. The second is the predictive controller minimising the errors between the equilibrium and actual vehicle states. The third is a path-following controller, which reduces the path error, altering the equilibrium curvature path. In a high-fidelity simulation environment, we validate the controller architecture capacity to stabilise the vehicle in automated drifting along a desired path, with a maximal lateral path deviation of 1 m. In the experiments with a standard passenger vehicle, we demonstrate that the proposed approach is capable of bringing and maintaining the vehicle at the desired 30 deg sideslip angle in both high and low friction conditions.
本文提出了一种采用标准客运车辆进行自动漂移的新方法,该方法涉及非线性模型预测控制来稳定和维持车辆在 high 侧滑角条件下。所提出的控制器架构被分为三个部分。第一部分包括离线计算平衡图,它提供了每个车辆状态下的平衡点,给定所需的侧滑角和路径半径。第二部分是预测控制器,用于最小化平衡和实际车辆状态之间的误差。第三部分是路径跟踪控制器,它减少了路径误差,改变平衡曲线路径。在高品质仿真环境中,我们验证了控制器架构在自动漂移过程中稳定车辆的能力,最大横向路径偏移为 1m。在采用标准客运车辆的实验中,我们证明了所提出的方法在低和 high 摩擦条件下都能使车辆达到所需的 30° 侧滑角,且能够保持车辆在该状态。
https://arxiv.org/abs/2405.10859
The Shapley value (SV) is a prevalent approach of allocating credit to machine learning (ML) entities to understand black box ML models. Enriching such interpretations with higher-order interactions is inevitable for complex systems, where the Shapley Interaction Index (SII) is a direct axiomatic extension of the SV. While it is well-known that the SV yields an optimal approximation of any game via a weighted least square (WLS) objective, an extension of this result to SII has been a long-standing open problem, which even led to the proposal of an alternative index. In this work, we characterize higher-order SII as a solution to a WLS problem, which constructs an optimal approximation via SII and $k$-Shapley values ($k$-SII). We prove this representation for the SV and pairwise SII and give empirically validated conjectures for higher orders. As a result, we propose KernelSHAP-IQ, a direct extension of KernelSHAP for SII, and demonstrate state-of-the-art performance for feature interactions.
Shapley值(SV)是一种将信用分配给机器学习(ML)实体以理解黑盒ML模型的流行方法。通过提高阶数相互作用,对复杂系统中的这种解释进行拓展是不可避免的。Shapley交互指数(SII)是SV的直接轴理扩展。虽然SV通过加权最小二乘(WLS)目标给出任何游戏的最优近似是一个众所周知的结果,但将这个结果扩展到SII是一个长期未解决的问题,甚至导致了另一种索引的提出。在本文中,我们将高阶SII描述为通过SII和$k$ - Shapley值($k$ - SII)解决WLS问题的解决方案。我们证明了SV和高阶SII,并给出了关于高阶数的经验验证猜测。因此,我们提出了KernelSHAP-IQ,直接扩展了KernelSHAP用于SII,并证明了对于特征交互的功能最先进的性能。
https://arxiv.org/abs/2405.10852
This paper presents an original approach to vehicle obstacle avoidance. It involves the development of a nonlinear Model Predictive Contouring Control, which uses torque vectoring to stabilise and drive the vehicle in evasive manoeuvres at the limit of handling. The proposed algorithm combines motion planning, path tracking and vehicle stability objectives, prioritising collision avoidance in emergencies. The controller's prediction model is a nonlinear double-track vehicle model based on an extended Fiala tyre to capture the nonlinear coupled longitudinal and lateral dynamics. The controller computes the optimal steering angle and the longitudinal forces per each of the four wheels to minimise tracking error in safe situations and maximise the vehicle-to-obstacle distance in emergencies. Thanks to the optimisation of the longitudinal tyre forces, the proposed controller can produce an extra yaw moment, increasing the vehicle's lateral agility to avoid obstacles while keeping the vehicle stable. The optimal forces are constrained in the tyre friction circle not to exceed the tyres and vehicle capabilities. In a high-fidelity simulation environment, we demonstrate the benefits of torque vectoring, showing that our proposed approach is capable of successfully avoiding obstacles and keeping the vehicle stable while driving a double-lane change manoeuvre, in comparison to baselines lacking torque vectoring or collision avoidance prioritisation.
本文提出了一种新的车辆避障方法。它涉及开发了一种非线性预测预测控制方法,该方法利用扭矩矢量来稳定和驱动车辆在临界处理情况下进行避障。所提出的算法将运动规划、路径跟踪和车辆稳定性目标相结合,优先考虑紧急情况下的避障。控制器的预测模型是基于扩展Fiala轮胎的非线性双轨迹车辆模型,用于捕捉非线性的耦合纵向和横向动力。控制器计算出每个轮子的最佳转向角和纵向力,以最小化安全情况下的跟踪误差,并最大紧急情况下的车辆与障碍物距离。由于纵向轮胎力的优化,所提出的控制器可以产生额外的偏航力,提高车辆在避障时的横向敏捷性,同时保持车辆的稳定性。最佳力矩受到轮胎摩擦圆的限制,不得超过轮胎和车辆的能力。在高速仿真环境中,我们证明了扭矩矢量的优势,表明我们提出的方法能够在成功避障和保持车辆稳定的情况下进行双道变更行驶,与缺乏扭矩矢量或避障优先处理的基线相比。
https://arxiv.org/abs/2405.10847
Background and purpose: Deep Learning (DL) has been widely explored for Organs at Risk (OARs) segmentation; however, most studies have focused on a single modality, either CT or MRI, not both simultaneously. This study presents a high-performing DL pipeline for segmentation of 30 OARs from MRI and CT scans of Head and Neck (H&N) cancer patients. Materials and methods: Paired CT and MRI-T1 images from 42 H&N cancer patients alongside annotation for 30 OARs from the H&N OAR CT & MR segmentation challenge dataset were used to develop a segmentation pipeline. After cropping irrelevant regions, rigid followed by non-rigid registration of CT and MRI volumes was performed. Two versions of the CT volume, representing soft tissues and bone anatomy, were stacked with the MRI volume and used as input to an nnU-Net pipeline. Modality Dropout was used during the training to force the model to learn from the different modalities. Segmentation masks were predicted with the trained model for an independent set of 14 new patients. The mean Dice Score (DS) and Hausdorff Distance (HD) were calculated for each OAR across these patients to evaluate the pipeline. Results: This resulted in an overall mean DS and HD of 0.777 +- 0.118 and 3.455 +- 1.679, respectively, establishing the state-of-the-art (SOTA) for this challenge at the time of submission. Conclusion: The proposed pipeline achieved the best DS and HD among all participants of the H&N OAR CT and MR segmentation challenge and sets a new SOTA for automated segmentation of H&N OARs.
背景和目的: 深度学习(DL)已经在组织和濒危组织(OARs)分割方面得到了广泛探讨;然而,大多数研究都集中在单个模态上,通常是CT或MRI,而不是同时进行。本研究旨在为来自头颈部癌症患者MRI和CT扫描的30个OARs开发一个高效的DL管道。材料和方法: 42个头颈部癌症患者的成对CT和T1图像以及H&N OAR CT & MR分割挑战数据集中的注释,用于开发分割管道。在裁剪无关区域后,对CT和MRI体积进行刚性和非刚性对齐。有两版CT体积,代表软组织和骨组织,与MRI体积堆叠使用,作为nnU-Net网络的输入。在训练期间使用模态丢失来强制模型从不同模态学习。预测分割掩码用于对14个独立患者进行分割。计算每个OAR在这些患者上的平均Dice分数(DS)和Hausdorff距离(HD),以评估该管道。 结果: 这导致总体平均DS和HD分别为0.777 +- 0.118和3.455 +- 1.679,分别位于SOTA。 结论: 该提出的管道在所有H&N OAR CT和MRI分割挑战 participants中实现了最佳的DS和HD,并设定了新的SOTA,用于自动分割H&N OARs。
https://arxiv.org/abs/2405.10833
Spatio-temporal action detection (STAD) is an important fine-grained video understanding task. Current methods require box and label supervision for all action classes in advance. However, in real-world applications, it is very likely to come across new action classes not seen in training because the action category space is large and hard to enumerate. Also, the cost of data annotation and model training for new classes is extremely high for traditional methods, as we need to perform detailed box annotations and re-train the whole network from scratch. In this paper, we propose a new challenging setting by performing open-vocabulary STAD to better mimic the situation of action detection in an open world. Open-vocabulary spatio-temporal action detection (OV-STAD) requires training a model on a limited set of base classes with box and label supervision, which is expected to yield good generalization performance on novel action classes. For OV-STAD, we build two benchmarks based on the existing STAD datasets and propose a simple but effective method based on pretrained video-language models (VLM). To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs. This customized fine-tuning endows the VLM with better motion understanding, thus contributing to a more accurate alignment between video regions and texts. Local region feature and global video feature fusion before alignment is adopted to further improve the action detection performance by providing global context. Our method achieves a promising performance on novel classes.
Spatio-temporal action detection (STAD) 是一个重要的细粒度视频理解任务。 目前的解决方案需要对所有动作类别进行箱和标签监督。然而,在现实应用中,可能会遇到在训练中没有见过的全新动作类别,因为动作类别空间很大,很难枚举。 传统方法的代价是数据注释和模型训练成本非常高,因为我们需要对整个网络进行详细的箱注释,并从头开始重新训练整个网络。 在本文中,我们提出了一个新的具有挑战性的设置,通过进行开放式词汇 STAD 来更好地模仿在开放世界中动作检测的情况。开放式词汇spatiotemporal动作检测 (OV-STAD)需要对有限的基础类别的模型进行训练,并且预计将产生对 novel 动作类别的良好泛化性能。对于 OV-STAD,我们基于现有的 STAD 数据集构建了两个基准,并提出了一个简单但有效的方法,基于预训练的视频语言模型 (VLM)。为了更好地适应细粒度动作检测任务,我们在局部视频区域文本对上进行微调。 这种自定义微调使 VLM 具有更好的运动理解,从而提高了视频区域和文本之间的更准确的对齐。 在对齐之前采用局部区域特征和全局视频特征的融合,进一步提高了动作检测性能,提供了全局上下文。我们的方法在 novel 类上取得了良好的性能。
https://arxiv.org/abs/2405.10832
Thanks to the explosive developments of data-driven learning methodologies recently, reinforcement learning (RL) emerges as a promising solution to address the legged locomotion problem in robotics. In this manuscript, we propose a novel concurrent teacher-student reinforcement learning architecture for legged locomotion over challenging terrains, based only on proprioceptive measurements in real-world deployment. Different from convectional teacher-student architecture that trains the teacher policy via RL and transfers the knowledge to the student policy through supervised learning, our proposed architecture trains teacher and student policy networks concurrently under the reinforcement learning paradigm. To achieve this, we develop a new training scheme based on conventional proximal policy gradient (PPO) method to accommodate the interaction between teacher policy network and student policy network. The effectiveness of the proposed architecture as well as the new training scheme is demonstrated through extensive indoor and outdoor experiments on quadrupedal robots and point-foot bipedal robot, showcasing robust locomotion over challenging terrains and improved performance compared to two-stage training methods.
感谢数据驱动学习方法论的爆炸性发展,强化学习(RL)在机器人学中解决腿行问题变得具有前景。在本文中,我们提出了一个新颖的并行教师-学生强化学习架构,用于解决具有挑战性地形的三足机器人。与通过RL训练教师策略并通过监督学习将知识传递给学生策略的传热器教师-学生架构不同,我们提出的架构在强化学习范式下训练教师和学生策略网络的同时。为了实现这一目标,我们开发了一种基于传统近端策略梯度(PPO)方法的新训练方案,以适应教师策略网络和学生策略网络之间的交互。通过在室内和室外对四足机器人和点脚步行机器人进行广泛的实验,证明所提出的架构的有效性和新训练方案的优越性,展示了在具有挑战性地形下的稳健运动和与两阶段训练方法相比的性能提升。
https://arxiv.org/abs/2405.10830
Convolutional neural networks (CNNs) are among the most widely used machine learning models for computer vision tasks, such as image classification. To improve the efficiency of CNNs, many CNNs compressing approaches have been developed. Low-rank methods approximate the original convolutional kernel with a sequence of smaller convolutional kernels, which leads to reduced storage and time complexities. In this study, we propose a novel low-rank CNNs compression method that is based on reduced storage direct tensor ring decomposition (RSDTR). The proposed method offers a higher circular mode permutation flexibility, and it is characterized by large parameter and FLOPS compression rates, while preserving a good classification accuracy of the compressed network. The experiments, performed on the CIFAR-10 and ImageNet datasets, clearly demonstrate the efficiency of RSDTR in comparison to other state-of-the-art CNNs compression approaches.
卷积神经网络(CNNs)是计算机视觉任务中最广泛使用的机器学习模型之一,如图像分类。为了提高CNN的效率,已经提出了许多CNN压缩方法。低秩方法用较小的卷积内核序列近似原始卷积内核,导致存储和时间复杂度的降低。在这项研究中,我们提出了一种基于减少存储直接张量环分解(RSDTR)的新型低秩CNNs压缩方法。所提出的方法具有更高的环模式变换灵活性,其参数和FLOPS压缩率较大,同时保持压缩网络的分类精度。在CIFAR-10和ImageNet数据集上进行的实验明确证明了RSDTR与其他最先进的CNN压缩方法相比具有更高的效率。
https://arxiv.org/abs/2405.10802
Place recognition is a fundamental task for robotic application, allowing robots to perform loop closure detection within simultaneous localization and mapping (SLAM), and achieve relocalization on prior maps. Current range image-based networks use single-column convolution to maintain feature invariance to shifts in image columns caused by LiDAR viewpoint change.However, this raises the issues such as "restricted receptive fields" and "excessive focus on local regions", degrading the performance of networks. To address the aforementioned issues, we propose a lightweight circular convolutional Transformer network denoted as CCTNet, which boosts performance by capturing structural information in point clouds and facilitating crossdimensional interaction of spatial and channel information. Initially, a Circular Convolution Module (CCM) is introduced, expanding the network's perceptual field while maintaining feature consistency across varying LiDAR perspectives. Then, a Range Transformer Module (RTM) is proposed, which enhances place recognition accuracy in scenarios with movable objects by employing a combination of channel and spatial attention mechanisms. Furthermore, we propose an Overlap-based loss function, transforming the place recognition task from a binary loop closure classification into a regression problem linked to the overlap between LiDAR frames. Through extensive experiments on the KITTI and Ford Campus datasets, CCTNet surpasses comparable methods, achieving Recall@1 of 0.924 and 0.965, and Recall@1% of 0.990 and 0.993 on the test set, showcasing a superior performance. Results on the selfcollected dataset further demonstrate the proposed method's potential for practical implementation in complex scenarios to handle movable objects, showing improved generalization in various datasets.
定位是一个机器人应用的基本任务,使机器人能够在同时定位和映射(SLAM)过程中执行闭环检测,并在先验地图上实现重新定位。目前,基于范围图像的网络使用单列卷积来保持特征不变,以应对由于激光雷达视点变化引起的图像列的位移。然而,这导致了诸如“受限制的接收域”和“过度关注局部区域”等问题,降低了网络的性能。为了应对上述问题,我们提出了一个轻量级的环状卷积Transformer网络,称为CCTNet,通过捕获点云中的结构信息并促进空间和通道信息的跨维度交互来提高性能。首先,引入环状卷积模块(CCM),在扩展网络的感知场的同时保持特征一致性,在不同的激光雷达视点下保持特征一致性。接着,我们提出了一个范围Transformer模块(RTM),通过结合通道和空间注意机制,在可移动物体场景中提高地点识别准确性。此外,我们提出了一个基于重叠损失函数的地点识别问题,将二进制环闭合分类问题转化为与激光雷达帧之间的重叠的回归问题。通过在KITTI和福特校园数据集上的广泛实验,CCTNet超越了类似方法,实现了召回率@1为0.924和0.965,以及召回率@1%为0.990和0.993在测试集上的成绩。结果在自收集数据集上进一步证明了该方法在复杂场景中进行实际应用的潜力,各种数据集上的泛化能力得到提高。
https://arxiv.org/abs/2405.10793