Visual In-Context Learning (ICL) has emerged as a promising research area due to its capability to accomplish various tasks with limited example pairs through analogical reasoning. However, training-based visual ICL has limitations in its ability to generalize to unseen tasks and requires the collection of a diverse task dataset. On the other hand, existing methods in the inference-based visual ICL category solely rely on textual prompts, which fail to capture fine-grained contextual information from given examples and can be time-consuming when converting from images to text prompts. To address these challenges, we propose Analogist, a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques using a text-to-image diffusion model pretrained for image inpainting. For visual prompting, we propose a self-attention cloning (SAC) method to guide the fine-grained structural-level analogy between image examples. For textual prompting, we leverage GPT-4V's visual reasoning capability to efficiently generate text prompts and introduce a cross-attention masking (CAM) operation to enhance the accuracy of semantic-level analogy guided by text prompts. Our method is out-of-the-box and does not require fine-tuning or optimization. It is also generic and flexible, enabling a wide range of visual tasks to be performed in an in-context manner. Extensive experiments demonstrate the superiority of our method over existing approaches, both qualitatively and quantitatively.
视觉上下文学习(ICL)由于通过类比推理完成各种任务的能力而成为一个有前景的研究领域。然而,基于训练的视觉ICL在泛化到未见过的任务方面存在局限性,需要收集多样任务数据集。另一方面,基于推理的视觉ICL方法仅依赖文本提示,无法从给定的例子中捕捉到细微的上下文信息,并且将图像从图像到文本提示的转换过程中需要花费时间。为了应对这些挑战,我们提出了Analogist,一种新颖的基于推理的视觉ICL方法,利用预训练的文本到图像扩散模型来探索图像和文本提示技术。 在视觉提示方面,我们提出了自注意力克隆(SAC)方法,以引导图像示例之间的细粒度结构级类比。在文本提示方面,我们利用GPT-4V的视觉推理能力高效生成文本提示,并引入跨注意掩码(CAM)操作,以增强由文本提示引导的语义级类比的精度。我们的方法是出类拔萃的,不需要微调或优化。它也具有通用性和灵活性,能够以上下文方式执行各种视觉任务。大量实验证明,我们的方法在质量和数量上优于现有方法。
https://arxiv.org/abs/2405.10316
Numerous recent works aim to enhance the efficacy of Large Language Models (LLMs) through strategic prompting. In particular, the Optimization by PROmpting (OPRO) approach provides state-of-the-art performance by leveraging LLMs as optimizers where the optimization task is to find instructions that maximize the task accuracy. In this paper, we revisit OPRO for automated prompting with relatively small-scale LLMs, such as LLaMa-2 family and Mistral 7B. Our investigation reveals that OPRO shows limited effectiveness in small-scale LLMs, with limited inference capabilities constraining optimization ability. We suggest future automatic prompting engineering to consider both model capabilities and computational costs. Additionally, for small-scale LLMs, we recommend direct instructions that clearly outline objectives and methodologies as robust prompt baselines, ensuring efficient and effective prompt engineering in ongoing research.
许多最近的工作旨在通过策略提示增强大型语言模型(LLMs)的效率。特别是,通过利用LLM作为优化器,Proof of Programming (OPRO)方法在优化任务中提供了最先进的性能。在本文中,我们重新审视了OPROMpting (OPR)方法用于自动提示相对较小的LLM,如LLLaMa-2系列和Mistral 7B。我们的调查显示,在小型LLM上,OPROMpting的优化效果有限,有限的语言能力限制了优化能力。我们建议,在未来的自动提示工程中,要考虑模型的特性和计算成本。此外,对于小型LLM,我们建议使用明确说明要达到的目标和方法的直接指令作为稳健的提示基础,以确保在 ongoing研究中有高效的和有效的提示工程。
https://arxiv.org/abs/2405.10276
This paper investigates the dynamics of a deep neural network (DNN) learning interactions. Previous studies have discovered and mathematically proven that given each input sample, a well-trained DNN usually only encodes a small number of interactions (non-linear relationships) between input variables in the sample. A series of theorems have been derived to prove that we can consider the DNN's inference equivalent to using these interactions as primitive patterns for inference. In this paper, we discover the DNN learns interactions in two phases. The first phase mainly penalizes interactions of medium and high orders, and the second phase mainly learns interactions of gradually increasing orders. We can consider the two-phase phenomenon as the starting point of a DNN learning over-fitted features. Such a phenomenon has been widely shared by DNNs with various architectures trained for different tasks. Therefore, the discovery of the two-phase dynamics provides a detailed mechanism for how a DNN gradually learns different inference patterns (interactions). In particular, we have also verified the claim that high-order interactions have weaker generalization power than low-order interactions. Thus, the discovered two-phase dynamics also explains how the generalization power of a DNN changes during the training process.
本文研究了深度神经网络(DNN)学习交互的动态。之前的研究发现并数学证明了,给定每个输入样本,经过良好训练的DNN通常只编码样本中输入变量之间的小数量(非线性关系)交互。一系列推论已经被导出来,证明我们可以将DNN的推理视为使用这些交互作为推理的基本模式。在本文中,我们发现了DNN在两个阶段学习交互。第一个阶段主要惩罚中高阶交互,第二个阶段主要学习逐渐增加阶数的交互。我们可以将两个阶段的现象视为DNN过拟合特征的起点。事实上,这种现象已经在各种架构训练的DNN中得到了广泛分享。因此,发现两个阶段动态提供了DNN逐渐学习不同推理模式(交互)的详细机制。特别,我们还验证了说法,高阶交互的泛化能力比低阶交互弱。因此,发现的两个阶段动态也解释了DNN在训练过程中泛化能力的变化。
https://arxiv.org/abs/2405.10262
Open-vocabulary object detection (OvOD) has transformed detection into a language-guided task, empowering users to freely define their class vocabularies of interest during inference. However, our initial investigation indicates that existing OvOD detectors exhibit significant variability when dealing with vocabularies across various semantic granularities, posing a concern for real-world deployment. To this end, we introduce Semantic Hierarchy Nexus (SHiNe), a novel classifier that uses semantic knowledge from class hierarchies. It runs offline in three steps: i) it retrieves relevant super-/sub-categories from a hierarchy for each target class; ii) it integrates these categories into hierarchy-aware sentences; iii) it fuses these sentence embeddings to generate the nexus classifier vector. Our evaluation on various detection benchmarks demonstrates that SHiNe enhances robustness across diverse vocabulary granularities, achieving up to +31.9% mAP50 with ground truth hierarchies, while retaining improvements using hierarchies generated by large language models. Moreover, when applied to open-vocabulary classification on ImageNet-1k, SHiNe improves the CLIP zero-shot baseline by +2.8% accuracy. SHiNe is training-free and can be seamlessly integrated with any off-the-shelf OvOD detector, without incurring additional computational overhead during inference. The code is open source.
开放词汇对象检测(OvOD)使检测任务变成了一种语言指导的任务,使用户在推理过程中可以自由定义他们感兴趣的类词汇。然而,我们的初步调查表明,现有的OvOD检测器在处理不同语义粒度词汇时表现出显著的变异性,这可能会对现实世界的部署造成担忧。为此,我们引入了语义层次结构 Nexus(SHiNe),一种新类器,它使用类层次结构的语义知识。它通过三个步骤运行:i)它从层次结构中检索与目标类别相关的超/子类别;ii)它将这些类别整合到等级感知句子中;iii)它将句子嵌入融合生成nexus分类器向量。我们对各种检测基准的评估表明,SHiNe在不同的词汇粒度下增强了鲁棒性,达到+31.9%的mAP50,同时保留了使用大语言模型生成的等级所取得的有益改进。此外,当应用于ImageNet-1k上的开放词汇分类时,SHiNe提高了CLIP零散 baseline的准确率+2.8%。SHiNe是免费的训练的,可以轻松地与任何现有的OvOD检测器集成,而不会在推理过程中产生额外的计算开销。代码是开源的。
https://arxiv.org/abs/2405.10053
Utilizing solar energy to meet space heating and domestic hot water demand is very efficient (in terms of environmental footprint as well as cost), but in order to ensure that user demand is entirely covered throughout the year needs to be complemented with auxiliary heating systems, typically boilers and heat pumps. Naturally, the optimal control of such a system depends on an accurate prediction of solar thermal production. Experimental testing and physics-based numerical models are used to find a collector's performance curve - the mapping from solar radiation and other external conditions to heat production - but this curve changes over time once the collector is exposed to outdoor conditions. In order to deploy advanced control strategies in small domestic installations, we present an approach that uses machine learning to automatically construct and continuously adapt a model that predicts heat production. Our design is driven by the need to (a) construct and adapt models using supervision that can be extracted from low-cost instrumentation, avoiding extreme accuracy and reliability requirements; and (b) at inference time, use inputs that are typically provided in publicly available weather forecasts. Recent developments in attention-based machine learning, as well as careful adaptation of the training setup to the specifics of the task, have allowed us to design a machine learning-based solution that covers our requirements. We present positive empirical results for the predictive accuracy of our solution, and discuss the impact of these results on the end-to-end system.
使用太阳能满足太空加热和家庭热水需求非常高效(在环境足迹以及成本方面),但为了确保用户需求全年覆盖,需要补充辅助加热系统,通常包括锅炉和热泵。当然,这种系统的最佳控制取决于对太阳能生产进行精确预测。实验测试和基于物理的数值模型用于找到集热器的性能曲线——将太阳能辐射和其他外部条件映射为热量的映射,但一旦集热器暴露于户外条件,这个曲线就会发生变化。为了在小型家庭安装中部署先进的控制策略,我们提出了一个使用机器学习自动构建并持续适应预测热量生产的模型的方法。我们的设计源于(a)使用可以从低成本仪器中提取的监督来构建和适应模型,避免极端准确性和可靠性要求;(b)在推理时间内,使用通常由公共天气预报提供的输入。近年来,基于注意力的机器学习以及仔细适应训练设置以适应任务的开发,使我们能够设计一个基于机器学习的解决方案,满足我们的需求。我们展示了我们解决方案的预测准确性,并讨论了这些结果对整个系统的影响。
https://arxiv.org/abs/2405.09972
Previous unsupervised anomaly detection (UAD) methods often struggle with significant intra-class diversity; i.e., a class in a dataset contains multiple subclasses, which we categorize as Feature-Rich Anomaly Detection Datasets (FRADs). This is evident in applications such as unified setting and unmanned supermarket scenarios. To address this challenge, we developed MiniMaxAD: a lightweight autoencoder designed to efficiently compress and memorize extensive information from normal images. Our model utilizes a large kernel convolutional network equipped with a Global Response Normalization (GRN) unit and employs a multi-scale feature reconstruction strategy. The GRN unit significantly increases the upper limit of the network's capacity, while the large kernel convolution facilitates the extraction of highly abstract patterns, leading to compact normal feature modeling. Additionally, we introduce an Adaptive Contraction Loss (ADCLoss), tailored to FRADs to overcome the limitations of global cosine distance loss. MiniMaxAD was comprehensively tested across six challenging UAD benchmarks, achieving state-of-the-art results in four and highly competitive outcomes in the remaining two. Notably, our model achieved a detection AUROC of up to 97.0\% in ViSA under the unified setting. Moreover, it not only achieved state-of-the-art performance in unmanned supermarket tasks but also exhibited an inference speed 37 times faster than the previous best method, demonstrating its effectiveness in complex UAD tasks.
之前无监督异常检测(UAD)方法通常在数据集中的类内多样性显著受限;即数据集中的一个类别可能包含多个亚类,我们称之为特征丰富异常检测数据集(FRADs)。这在统一设置和无人超市场景等应用中是显而易见的。为解决这个挑战,我们开发了MiniMaxAD:一种轻量级的自编码器,旨在有效地压缩和记忆丰富的图像信息。我们的模型采用了一个大核卷积神经网络,配备了一个全局响应归一化(GRN)单元,并采用多尺度特征重构策略。GRN单元显著增加了网络的容量上限,而大核卷积有助于提取高度抽象的模式,导致紧凑的正常特征建模。此外,我们还引入了自适应收缩损失(ADCLoss),专门针对FRADs来克服全局余弦距离损失。MiniMaxAD在六个具有挑战性的UAD基准测试中进行了全面测试,在四个基准测试中实现了最先进的性能,在另外两个基准测试中具有极具竞争力的结果。值得注意的是,在统一设置下,我们的模型在ViSA上的检测AUROC可以达到97.0%。此外,它不仅在无人超市任务中实现了最先进的表现,而且具有比之前最佳方法快37倍的应用速度,表明其对于复杂UAD任务的处理效果非常出色。
https://arxiv.org/abs/2405.09933
We present Dual3D, a novel text-to-3D generation framework that generates high-quality 3D assets from texts in only $1$ minute.The key component is a dual-mode multi-view latent diffusion model. Given the noisy multi-view latents, the 2D mode can efficiently denoise them with a single latent denoising network, while the 3D mode can generate a tri-plane neural surface for consistent rendering-based denoising. Most modules for both modes are tuned from a pre-trained text-to-image latent diffusion model to circumvent the expensive cost of training from scratch. To overcome the high rendering cost during inference, we propose the dual-mode toggling inference strategy to use only $1/10$ denoising steps with 3D mode, successfully generating a 3D asset in just $10$ seconds without sacrificing quality. The texture of the 3D asset can be further enhanced by our efficient texture refinement process in a short time. Extensive experiments demonstrate that our method delivers state-of-the-art performance while significantly reducing generation time. Our project page is available at this https URL
我们提出了Dual3D,一种新颖的从文本到3D生成的框架,只需1分钟生成高质量3D资产。关键组件是一种双模式多视角潜在扩散模型。由于存在噪声多视角潜在,2D模式可以通过一个单一的潜在去噪网络 efficiently消除它们,而3D模式可以生成一个三平面神经表面,实现基于渲染的去噪。两种模式的大部分模块都从预训练的文本到图像潜在扩散模型调节,以避免从零开始训练的昂贵成本。为了在推理过程中克服高渲染成本,我们提出了使用3D模式的双模式切换推理策略,只需1/10的去噪步数,成功地在10秒钟内生成了一个3D资产,而没有牺牲质量。通过我们短时间内进行的有效纹理平滑过程,可以进一步增强3D资产的纹理。大量实验证明,我们的方法在提高生成速度的同时,取得了与最先进的性能相当的结果。您可以在此链接查看我们的项目页面:https://www.example.com
https://arxiv.org/abs/2405.09874
In weakly supervised medical image segmentation, the absence of structural priors and the discreteness of class feature distribution present a challenge, i.e., how to accurately propagate supervision signals from local to global regions without excessively spreading them to other irrelevant regions? To address this, we propose a novel weakly supervised medical image segmentation framework named PCLMix, comprising dynamic mix augmentation, pixel-level contrastive learning, and consistency regularization strategies. Specifically, PCLMix is built upon a heterogeneous dual-decoder backbone, addressing the absence of structural priors through a strategy of dynamic mix augmentation during training. To handle the discrete distribution of class features, PCLMix incorporates pixel-level contrastive learning based on prediction uncertainty, effectively enhancing the model's ability to differentiate inter-class pixel differences and intra-class consistency. Furthermore, to reinforce segmentation consistency and robustness, PCLMix employs an auxiliary decoder for dual consistency regularization. In the inference phase, the auxiliary decoder will be dropped and no computation complexity is increased. Extensive experiments on the ACDC dataset demonstrate that PCLMix appropriately propagates local supervision signals to the global scale, further narrowing the gap between weakly supervised and fully supervised segmentation methods. Our code is available at this https URL.
在弱监督医疗图像分割中,缺乏结构先验和类特征分布的离散性给带来了挑战,即如何在确保不扩散监督信号到其他相关区域的同时,准确地将它们从局部传播到全局?为解决这个问题,我们提出了一个名为PCLMix的新弱监督医疗图像分割框架,包括动态混合增强、像素级别对比学习和一致性正则化策略。具体来说,PCLMix基于异质双解码器架构,通过在训练过程中动态混合增强策略解决结构先验缺失的问题。为了处理类特征的离散分布,PCLMix基于预测不确定性引入了像素级别的对比学习,有效提高了模型区分不同类别的像素差异和类内一致性的能力。此外,为了增强分割的一致性和鲁棒性,PCLMix采用辅助解码器进行双一致性正则化。在推理阶段,辅助解码器将被删除,不会增加计算复杂度。在ACDC数据集上的大量实验证明,PCLMix正确地将局部监督信号传播到全局范围,进一步缩小了弱监督和完全监督分割方法之间的差距。我们的代码可以从该链接获取。
https://arxiv.org/abs/2405.06288
Large language models are well-known to be effective at few-shot in-context learning (ICL). Recent advancements in multimodal foundation models have enabled unprecedentedly long context windows, presenting an opportunity to explore their capability to perform ICL with many more demonstrating examples. In this work, we evaluate the performance of multimodal foundation models scaling from few-shot to many-shot ICL. We benchmark GPT-4o and Gemini 1.5 Pro across 10 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and fine-grained classification). We observe that many-shot ICL, including up to almost 2,000 multimodal demonstrating examples, leads to substantial improvements compared to few-shot (<100 examples) ICL across all of the datasets. Further, Gemini 1.5 Pro performance continues to improve log-linearly up to the maximum number of tested examples on many datasets. Given the high inference costs associated with the long prompts required for many-shot ICL, we also explore the impact of batching multiple queries in a single API call. We show that batching up to 50 queries can lead to performance improvements under zero-shot and many-shot ICL, with substantial gains in the zero-shot setting on multiple datasets, while drastically reducing per-query cost and latency. Finally, we measure ICL data efficiency of the models, or the rate at which the models learn from more demonstrating examples. We find that while GPT-4o and Gemini 1.5 Pro achieve similar zero-shot performance across the datasets, Gemini 1.5 Pro exhibits higher ICL data efficiency than GPT-4o on most datasets. Our results suggest that many-shot ICL could enable users to efficiently adapt multimodal foundation models to new applications and domains. Our codebase is publicly available at this https URL .
大语言模型在少样本在上下文学习中(ICL)方面已经被证明非常有效。最近多模态基础模型的进步使得以前未曾有过的长上下文窗口成为可能,这为我们研究其在一个多模态基础模型中进行 ICL 时表现的能力提供了机会。在这项工作中,我们评估了多模态基础模型从少样本到多样本 ICL 的性能。我们在包括多个领域的多个数据集(自然图像、医学图像、遥感图像和分子图像)和任务(多分类、多标签和细粒度分类)中进行了基准测试。我们观察到,许多样本在 ICL 中,包括多达几乎 2,000 个多模态示例,比少样本(<100 个示例) ICL 在所有数据集上产生了显着改进。此外,Gemini 1.5 Pro 的性能在许多数据集上呈对数线性增长,直到达到最大测试样本数。考虑到所需长请求的推理成本,我们还研究了在单个 API 调用中批注多个查询对性能的影响。我们发现,批注多达 50 个查询可以提高零样本和多样本 ICL 的性能,在多个数据集上的零样本设置中实现显著的收益,而大大降低每个查询的代价和延迟。最后,我们测量了模型的 ICL 数据效率,即模型从更多示例中学习的速率。我们发现,尽管GPT-4o 和 Gemini 1.5 Pro 在各个数据集上的零样本性能类似,但Gemini 1.5 Pro在大多数数据集上的 ICL 数据效率要高于GPT-4o。我们的结果表明,许多样本在 ICL 中可以帮助用户有效地将多模态基础模型适应到新的应用和领域。我们的代码库公开可用,在这个链接 https:// 。
https://arxiv.org/abs/2405.09798
Due to spatial redundancy in remote sensing images, sparse tokens containing rich information are usually involved in self-attention (SA) to reduce the overall token numbers within the calculation, avoiding the high computational cost issue in Vision Transformers. However, such methods usually obtain sparse tokens by hand-crafted or parallel-unfriendly designs, posing a challenge to reach a better balance between efficiency and performance. Different from them, this paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information meanwhile improving the inference speed. Technically, the meta tokens are first initialized from image tokens via cross-attention. Then, we propose Dual Cross-Attention (DCA) to promote information exchange between image tokens and meta tokens, where they serve as query and key (value) tokens alternatively in a dual-branch structure, significantly reducing the computational complexity compared to self-attention. By employing DCA in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes. Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 \times$ speedup, fewer parameters, and competitive performance compared to the baseline models, and achieves a better trade-off between efficiency and performance.
由于遥感图像中的空间冗余,通常包含丰富信息的稀疏标记通常参与自注意力(SA)以降低计算过程中的总体标记数量,从而避免在Vision Transformers中出现高计算成本问题。然而,这些方法通常通过手动或并行不友好设计获得稀疏标记,挑战了在效率和性能之间实现更好的平衡。与它们不同,本文提出了一种使用可学习元标记来表示稀疏标记的方法,这通过同时学习关键信息有效地提高了推理速度。技术上,元标记首先通过跨注意力从图像标记中初始化。然后,我们提出Dual Cross-Attention (DCA)来促进图像标记和元标记之间的信息交流,在双分支结构中,它们作为查询和键(值)标记交替出现,从而大大减少了计算复杂性。通过在密集视觉标记的早期阶段使用DCA,我们获得了具有不同大小的层次结构LeMeViT。在分类和密集预测任务上的实验结果表明,LeMeViT与基线模型相比具有明显的1.7倍速度提升、更少的参数和竞争力的性能,并实现了效率与性能的更好平衡。
https://arxiv.org/abs/2405.09789
Task-oriented dialogue systems are broadly used in virtual assistants and other automated services, providing interfaces between users and machines to facilitate specific tasks. Nowadays, task-oriented dialogue systems have greatly benefited from pre-trained language models (PLMs). However, their task-solving performance is constrained by the inherent capacities of PLMs, and scaling these models is expensive and complex as the model size becomes larger. To address these challenges, we propose Soft Mixture-of-Expert Task-Oriented Dialogue system (SMETOD) which leverages an ensemble of Mixture-of-Experts (MoEs) to excel at subproblems and generate specialized outputs for task-oriented dialogues. SMETOD also scales up a task-oriented dialogue system with simplicity and flexibility while maintaining inference efficiency. We extensively evaluate our model on three benchmark functionalities: intent prediction, dialogue state tracking, and dialogue response generation. Experimental results demonstrate that SMETOD achieves state-of-the-art performance on most evaluated metrics. Moreover, comparisons against existing strong baselines show that SMETOD has a great advantage in the cost of inference and correctness in problem-solving.
面向任务的对话系统在虚拟助手和其他自动化服务中得到了广泛应用,为用户和机器之间提供了一种界面,以促进特定的任务。如今,面向任务的对话系统已经从预训练语言模型(PLMs)中极大地受益。然而,它们的任务解决能力受到PLMs固有能力的限制,而且随着模型规模的增大,扩展这些模型变得昂贵且复杂。为了应对这些挑战,我们提出了Soft Mixture-of-Expert Task-Oriented Dialogue系统(SMETOD),它利用了专家混合(MoEs)的众筹力量,在子问题解决和任务导向对话中表现出色,同时保持推理效率。此外,SMETOD还通过简单而灵活地扩展任务导向对话系统来保持推理效率。我们在三个基准功能上对模型进行了广泛评估:意图预测、对话状态跟踪和对话响应生成。实验结果表明,SMETOD在大多数评估指标上实现了最先进的性能。此外,与现有强大基线进行比较,SMETOD在推理成本和正确性方面具有巨大优势。
https://arxiv.org/abs/2405.09744
Surgical automation can improve the accessibility and consistency of life saving procedures. Most surgeries require separating layers of tissue to access the surgical site, and suturing to reattach incisions. These tasks involve deformable manipulation to safely identify and alter tissue attachment (boundary) topology. Due to poor visual acuity and frequent occlusions, surgeons tend to carefully manipulate the tissue in ways that enable inference of the tissue's attachment points without causing unsafe tearing. In a similar fashion, we propose JIGGLE, a framework for estimation and interactive sensing of unknown boundary parameters in deformable surgical environments. This framework has two key components: (1) a probabilistic estimation to identify the current attachment points, achieved by integrating a differentiable soft-body simulator with an extended Kalman filter (EKF), and (2) an optimization-based active control pipeline that generates actions to maximize information gain of the tissue attachments, while simultaneously minimizing safety costs. The robustness of our estimation approach is demonstrated through experiments with real animal tissue, where we infer sutured attachment points using stereo endoscope observations. We also demonstrate the capabilities of our method in handling complex topological changes such as cutting and suturing.
手术自动化可以提高拯救生命的手术的可用性和一致性。大多数手术需要分离组织层以访问手术部位,并进行结扎以重新连接切口。这些任务涉及对组织进行可变形操作,以安全地识别和改变组织附着(边界)拓扑结构。由于视觉清晰度差和经常发生遮挡,医生往往需要谨慎地操作组织,以便在不导致不安全撕裂的情况下推断组织附着点。类似地,我们提出了JIGGLE框架,用于在变形手术环境中估计和交互式感知未知边界参数。该框架有两个关键组成部分:(1)通过将可变形软体模拟器与扩展卡尔曼滤波器(EKF)集成,实现对当前附着点的概率估计;(2)基于优化主动控制管道,生成动作以最大化组织附着的信息增益,同时最小化安全成本。我们通过实验研究来证明我们估计方法的稳健性,使用真实动物组织进行实验,通过使用立体内窥镜观察来推断结扎附着点。我们还展示了我们方法的处理复杂拓扑变化(如切割和结扎)的能力。
https://arxiv.org/abs/2405.09743
Large language models (LLMs) often exhibit undesirable behaviours, such as generating untruthful or biased content. Editing their internal representations has been shown to be effective in mitigating such behaviours on top of the existing alignment methods. We propose a novel inference-time editing method, namely spectral editing of activations (SEA), to project the input representations into directions with maximal covariance with the positive demonstrations (e.g., truthful) while minimising covariance with the negative demonstrations (e.g., hallucinated). We also extend our method to non-linear editing using feature functions. We run extensive experiments on benchmarks concerning truthfulness and bias with six open-source LLMs of different sizes and model families. The results demonstrate the superiority of SEA in effectiveness, generalisation to similar tasks, as well as inference and data efficiency. We also show that SEA editing only has a limited negative impact on other model capabilities.
大语言模型(LLMs)通常表现出不良行为,如生成不真实或偏见的内容。编辑它们内部表示已被证明在减轻这种行为上非常有效。我们提出了一种新颖的推理时间编辑方法,即特征函数的谱编辑(SEA),将输入表示投影到具有最大协方差的方向上,同时最小化与负面演示(例如,虚构)的协方差。我们还使用特征函数扩展我们的方法。我们在六个不同大小和模型家族的开放源LLM的基准上进行了广泛的实验。结果表明,SEA在有效性、泛化到类似任务以及推理和数据效率方面具有优势。我们还证明了SEA编辑对其他模型功能的负面影响有限。
https://arxiv.org/abs/2405.09719
Correspondence-based statistical shape modeling (SSM) stands as a powerful technology for morphometric analysis in clinical research. SSM facilitates population-level characterization and quantification of anatomical shapes such as bones and organs, aiding in pathology and disease diagnostics and treatment planning. Despite its potential, SSM remains under-utilized in medical research due to the significant overhead associated with automatic construction methods, which demand complete, aligned shape surface representations. Additionally, optimization-based techniques rely on bias-inducing assumptions or templates and have prolonged inference times as the entire cohort is simultaneously optimized. To overcome these challenges, we introduce Point2SSM++, a principled, self-supervised deep learning approach that directly learns correspondence points from point cloud representations of anatomical shapes. Point2SSM++ is robust to misaligned and inconsistent input, providing SSM that accurately samples individual shape surfaces while effectively capturing population-level statistics. Additionally, we present principled extensions of Point2SSM++ to adapt it for dynamic spatiotemporal and multi-anatomy use cases, demonstrating the broad versatility of the Point2SSM++ framework. Furthermore, we present extensions of Point2SSM++ tailored for dynamic spatiotemporal and multi-anatomy scenarios, showcasing the broad versatility of the framework. Through extensive validation across diverse anatomies, evaluation metrics, and clinically relevant downstream tasks, we demonstrate Point2SSM++'s superiority over existing state-of-the-art deep learning models and traditional approaches. Point2SSM++ substantially enhances the feasibility of SSM generation and significantly broadens its array of potential clinical applications.
基于对应关系的统计形状建模(SSM)在临床研究中具有强大的技术价值。SSM通过促进对人群水平特征和数量级解剖形状(如骨和器官)的描述,有助于病理学和疾病诊断与治疗计划的制定。然而,尽管SSM具有巨大的潜力,但在医学研究中,它仍然没有得到充分利用,因为与自动构建方法相关的显著开销,这些方法要求完全对齐的形状表面表示。此外,基于优化的技术依赖于偏差诱导的假设或模板,并且在整个队列同时优化时,推理时间会延长。为了克服这些挑战,我们引入了Point2SSM++,一种基于原则的自监督深度学习方法,可以直接从解剖形状点云表示中学习对应点。Point2SSM++对 misaligned 和不一致的输入具有鲁棒性,提供SSM,准确地采样单个形状表面,同时有效捕捉人群水平统计数据。此外,我们还介绍了Point2SSM++的可扩展性,用于动态空间和多轴场景,展示了Point2SSM++框架的广泛适用性。通过跨越不同解剖学和临床相关任务的大量验证、评价指标和临床应用,我们证明了Point2SSM++在现有深度学习模型和传统方法上的优越性。Point2SSM++极大地提高了SSM生成的可行性,并显著拓宽了其潜在临床应用的范围。
https://arxiv.org/abs/2405.09707
Anatomical shape analysis plays a pivotal role in clinical research and hypothesis testing, where the relationship between form and function is paramount. Correspondence-based statistical shape modeling (SSM) facilitates population-level morphometrics but requires a cumbersome, potentially bias-inducing construction pipeline. Recent advancements in deep learning have streamlined this process in inference by providing SSM prediction directly from unsegmented medical images. However, the proposed approaches are fully supervised and require utilizing a traditional SSM construction pipeline to create training data, thus inheriting the associated burdens and limitations. To address these challenges, we introduce a weakly supervised deep learning approach to predict SSM from images using point cloud supervision. Specifically, we propose reducing the supervision associated with the state-of-the-art fully Bayesian variational information bottleneck DeepSSM (BVIB-DeepSSM) model. BVIB-DeepSSM is an effective, principled framework for predicting probabilistic anatomical shapes from images with quantification of both aleatoric and epistemic uncertainties. Whereas the original BVIB-DeepSSM method requires strong supervision in the form of ground truth correspondence points, the proposed approach utilizes weak supervision via point cloud surface representations, which are more readily obtainable. Furthermore, the proposed approach learns correspondence in a completely data-driven manner without prior assumptions about the expected variability in shape cohort. Our experiments demonstrate that this approach yields similar accuracy and uncertainty estimation to the fully supervised scenario while substantially enhancing the feasibility of model training for SSM construction.
解剖形状分析在临床研究和假设检验中发挥着关键作用,其中形式与功能的关系至关重要。基于配对的统计形状建模(SSM)促进了人口水平形态计量学,但需要一个繁琐、可能存在偏见构建流程。随着深度学习技术的最新进步,在推理过程中直接从无分割医疗图像中提供SSM预测,从而简化了这一过程。然而,所提出的方法是全监督的,需要利用传统的SSM构建流程创建训练数据,从而继承相关的负担和局限性。为了应对这些挑战,我们引入了一种弱监督的深度学习方法,通过点云监督预测SSM。具体来说,我们提出了一个减少与最先进的完全贝叶斯变分信息瓶颈DeepSSM(BVIB-DeepSSM)模型相关的监督的方法。BVIB-DeepSSM是一种有效的、有理的框架,可以从图像中预测概率解剖形状,同时对概率和实证不确定性进行量化。尽管原始的BVIB-DeepSSM方法需要强监督的地面真值配准点,但所提出的方法通过点云表面表示利用弱监督。此外,与传统方法不同,该方法完全基于数据驱动学习,没有关于预计形状随访的方差的可行假设。我们的实验结果表明,与完全监督情况相比,这种方法具有类似的准确性和不确定性估计,同时大大提高了模型训练为SSM构建的可行性。
https://arxiv.org/abs/2405.09697
Present Brain-Computer Interfacing (BCI) technology allows inference and detection of cognitive and affective states, but fairly little has been done to study scenarios in which such information can facilitate new applications that rely on modeling human cognition. One state that can be quantified from various physiological signals is attention. Estimates of human attention can be used to reveal preferences and novel dimensions of user experience. Previous approaches have tackled these incredibly challenging tasks using a variety of behavioral signals, from dwell-time to click-through data, and computational models of visual correspondence to these behavioral signals. However, behavioral signals are only rough estimations of the real underlying attention and affective preferences of the users. Indeed, users may attend to some content simply because it is salient, but not because it is really interesting, or simply because it is outrageous. With this paper, we put forward a research agenda and example work using BCI to infer users' preferences, their attentional correlates towards visual content, and their associations with affective experience. Subsequently, we link these to relevant applications, such as information retrieval, personalized steering of generative models, and crowdsourcing population estimates of affective experiences.
当前的脑机接口(BCI)技术允许推断和检测认知和情感状态,但相当少的工作已经致力于研究这种信息如何促进新的应用,这些应用依赖于对人类认知的建模。一个可以从各种生理信号进行测量的状态是注意力。人类注意力的估计可用于揭示用户的偏好和新的用户体验维度。之前的方法已经利用各种行为信号(从驻留时间到点击通过数据)和视觉对应关系的计算模型来解决这些极其困难的问题。然而,行为信号只是对用户真实关注和情感偏好的粗略估计。事实上,用户可能只关注某些内容,因为它们引人注目,而不是因为它们真正有趣,或者只是因为它们很离奇。在本文中,我们提出了使用BCI推断用户偏好、他们对视觉内容的注意相关性和情感经验之间的关联的研究议程和示例工作。随后,我们将这些与相关应用联系起来,如信息检索、个性化的生成模型驱动和民办公众情绪体验的估算。
https://arxiv.org/abs/2405.09691
Localizing oneself during endoscopic procedures can be problematic due to the lack of distinguishable textures and landmarks, as well as difficulties due to the endoscopic device such as a limited field of view and challenging lighting conditions. Expert knowledge shaped by years of experience is required for localization within the human body during endoscopic procedures. In this work, we present a deep learning method based on anatomy recognition, that constructs a surgical path in an unsupervised manner from surgical videos, modelling relative location and variations due to different viewing angles. At inference time, the model can map an unseen video's frames on the path and estimate the viewing angle, aiming to provide guidance, for instance, to reach a particular destination. We test the method on a dataset consisting of surgical videos of transsphenoidal adenomectomies, as well as on a synthetic dataset. An online tool that lets researchers upload their surgical videos to obtain anatomy detections and the weights of the trained YOLOv7 model are available at: this https URL.
在内镜手术过程中,将自身定位可能会出现问题,因为内镜设备(如有限视野和复杂的照明条件)以及由于缺乏可区分纹理和标志而产生的困难。在本文中,我们提出了一种基于解剖识别的深度学习方法,在未经监督的情况下从手术视频中构建手术路径,并建模不同视角下的相对位置和变化。在推理时,该模型可以在路径上映射未见过的视频帧,并估计视角,旨在提供指导,例如,到达特定目的地。我们在包括Transsphenoidal腺瘤在内的大规模手术视频数据集上测试了该方法,以及在合成数据集上进行了测试。在这个网站上有这样一个在线工具,让研究人员上传他们的手术视频以获得解剖检测和训练的YOLOv7模型的权重:https:// this URL.
https://arxiv.org/abs/2405.09355
Deep learning classifiers are prone to latching onto dominant confounders present in a dataset rather than on the causal markers associated with the target class, leading to poor generalization and biased predictions. Although explainability via counterfactual image generation has been successful at exposing the problem, bias mitigation strategies that permit accurate explainability in the presence of dominant and diverse artifacts remain unsolved. In this work, we propose the DeCoDEx framework and show how an external, pre-trained binary artifact detector can be leveraged during inference to guide a diffusion-based counterfactual image generator towards accurate explainability. Experiments on the CheXpert dataset, using both synthetic artifacts and real visual artifacts (support devices), show that the proposed method successfully synthesizes the counterfactual images that change the causal pathology markers associated with Pleural Effusion while preserving or ignoring the visual artifacts. Augmentation of ERM and Group-DRO classifiers with the DeCoDEx generated images substantially improves the results across underrepresented groups that are out of distribution for each class. The code is made publicly available at this https URL.
深度学习分类器容易在数据集中固有的主导混淆因素上留下印象,而不是在目标类别的相关因果标记上,导致泛化差和有偏预测。尽管通过反事实图像生成来解释该问题已经取得成功,但允许在主导和多样异常物中实现准确解释的偏差减轻策略仍然是一个未解决的问题。在本文中,我们提出了DeCoDEx框架,并展示了如何在外部预训练的二进制异常物检测器的基础上,在推理过程中指导扩散式反事实图像生成器走向准确解释。在CheXpert数据集上进行的实验(使用合成异常物和真实视觉异常物)表明,与该方法相结合,可以成功生成反事实图像,这些图像在改变与胸膜积液相关的因果病理特征的同时,保留或忽略视觉异常。使用DeCoDEx生成的图像对ERM和Group-DRO分类器的扩展显著提高了分布不寻常类别的结果。代码可在此处公开访问:https://this URL。
https://arxiv.org/abs/2405.09288
It remains a challenge to effectively control the emotion rendering in text-to-speech (TTS) synthesis. Prior studies have primarily focused on learning a global prosodic representation at the utterance level, which strongly correlates with linguistic prosody. Our goal is to construct a hierarchical emotion distribution (ED) that effectively encapsulates intensity variations of emotions at various levels of granularity, encompassing phonemes, words, and utterances. During TTS training, the hierarchical ED is extracted from the ground-truth audio and guides the predictor to establish a connection between emotional and linguistic prosody. At run-time inference, the TTS model generates emotional speech and, at the same time, provides quantitative control of emotion over the speech constituents. Both objective and subjective evaluations validate the effectiveness of the proposed framework in terms of emotion prediction and control.
控制文本到语音合成(TTS)中情感表达仍然具有挑战性。以前的研究主要集中在在句子级别学习全局音调表示,这与语言音调高度相关。我们的目标是构建一个层级情感分布(ED),有效地捕捉情感在各个粒度级别上的强度变化,包括音素、单词和句子。在TTS训练过程中,从地面真实音频中提取层级ED,并引导预测器建立情感和语言音调之间的联系。在运行时推理,TTS模型生成情感语音,同时为语音成分提供情感定量控制。客观和主观评价证实了所提出的框架在情感预测和控制方面的有效性。
https://arxiv.org/abs/2405.09171
Deep models produce a number of features in each internal layer. A key problem in applications such as feature compression for remote inference is determining how important each feature is for the task(s) performed by the model. The problem is especially challenging in the case of multi-task inference, where the same feature may carry different importance for different tasks. In this paper, we examine how effective is mutual information (MI) between a feature and a model's task output as a measure of the feature's importance for that task. Experiments involving hard selection and soft selection (unequal compression) based on MI are carried out to compare the MI-based method with alternative approaches. Multi-objective analysis is provided to offer further insight.
深度模型在内部层中产生多个特征。在远程推理应用中,例如特征压缩,确定每个特征对于模型执行的任务(或任务)的重要性是一个关键问题。尤其是在多任务推理的情况下,相同特征可能对于不同的任务具有不同的重要性。在本文中,我们研究了特征与模型任务输出之间的互信息(MI)之间的重要性。我们进行了涉及硬选择和软选择(不等压缩)的实验,以比较基于MI的方法与其他方法的差异。还提供了多目标分析,以提供进一步的见解。
https://arxiv.org/abs/2405.09077