# The Latest Papers about AI

• ## A Unifying Post-Processing Framework for Multi-Objective Learn-to-Defer Problems

2024-07-17 16:32:30
##### Abstract

Learn-to-Defer is a paradigm that enables learning algorithms to work not in isolation but as a team with human experts. In this paradigm, we permit the system to defer a subset of its tasks to the expert. Although there are currently systems that follow this paradigm and are designed to optimize the accuracy of the final human-AI team, the general methodology for developing such systems under a set of constraints (e.g., algorithmic fairness, expert intervention budget, defer of anomaly, etc.) remains largely unexplored. In this paper, using a $d$-dimensional generalization to the fundamental lemma of Neyman and Pearson (d-GNP), we obtain the Bayes optimal solution for learn-to-defer systems under various constraints. Furthermore, we design a generalizable algorithm to estimate that solution and apply this algorithm to the COMPAS and ACSIncome datasets. Our algorithm shows improvements in terms of constraint violation over a set of baselines.

##### Abstract (translated)

学习-推迟是一种范式，它使学习算法能够作为一个团队与人类专家一起工作，而不是孤立地工作。在这种范式下，我们允许系统将一部分任务推迟给专家。虽然目前存在一些遵循这一范式的系统，这些系统旨在优化最终人机团队的准确性，但在一组约束条件下开发这种系统（例如，算法公平性、专家干预预算、异常排除等）的一般方法仍然很少被探索。在本文中，我们利用Neyman和Pearson（d-GNP）基本定理的$d$维推广，得到了在各种约束条件下学习-推迟系统的贝叶斯最优解。此外，我们还设计了一个可扩展的算法来估计这个解，并将其应用于COMPAS和ACSIncome数据集。我们的算法在基线之间显示了符合约束条件的改善。

##### URL

https://arxiv.org/abs/2407.12710

##### PDF

https://arxiv.org/pdf/2407.12710.pdf

• ## MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

2024-07-17 16:31:38
##### Abstract

Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically underperforms compared with a specialist MLLM on most VL tasks, which can be attributed to task interference. In this paper, we propose a mixture of multimodal experts (MoME) to mitigate task interference and obtain a generalist MLLM. Our MoME is composed of two key components, a mixture of vision experts (MoVE) and a mixture of language experts (MoLE). MoVE can adaptively modulate the features transformed from various vision encoders, and has a strong compatibility in transformation architecture. MoLE incorporates sparsely gated experts into LLMs to achieve painless improvements with roughly unchanged inference costs. In response to task interference, our MoME specializes in both vision and language modality to adapt to task discrepancies. Extensive experiments show that MoME significantly improves the performance of generalist MLLMs across various VL tasks. The source code is released at this https URL

##### Abstract (translated)

多模态大型语言模型（MLLMs）在各种视觉语言任务中表现出了令人印象深刻的能力。然而，通常来说，与专家MLLM相比，通用MLLM在大多数视觉语言任务上的表现会差很多，这可以归因于任务干扰。在本文中，我们提出了一种混合多模态专家（MoME）来减轻任务干扰并获得一个通用MLLM。我们的MoME由两个关键组件组成，分别是视觉专家（MoVE）和语言专家（MoLE）。MoVE可以根据不同的视觉编码器调整特征，并且在转换架构上有很强的兼容性。MoLE将稀疏门专家融入LLM中，以实现几乎不需要增加推理成本的痛处改进。为了应对任务干扰，我们的MoME专注于视觉和语言模态，以适应任务差异。大量实验证明，MoME显著提高了各种视觉语言任务的通用MLLM性能。代码发布在https://这个URL上。

##### URL

https://arxiv.org/abs/2407.12709

##### PDF

https://arxiv.org/pdf/2407.12709.pdf

• ## TTSDS -- Text-to-Speech Distribution Score

2024-07-17 16:30:27
##### Abstract

Many recently published Text-to-Speech (TTS) systems produce audio close to real speech. However, TTS evaluation needs to be revisited to make sense of the results obtained with the new architectures, approaches and datasets. We propose evaluating the quality of synthetic speech as a combination of multiple factors such as prosody, speaker identity, and intelligibility. Our approach assesses how well synthetic speech mirrors real speech by obtaining correlates of each factor and measuring their distance from both real speech datasets and noise datasets. We benchmark 35 TTS systems developed between 2008 and 2024 and show that our score computed as an unweighted average of factors strongly correlates with the human evaluations from each time period.

##### Abstract (translated)

许多最近发表的文本转语音（TTS）系统产生的音频非常接近真实语音。然而，为了理解使用新架构、方法和数据集取得的结果，TTS评估需要重新审视。我们提出，将合成语音的质量作为一个因素的组合来评估，这个因素包括语调、说话人身份和可听性。我们的方法通过获得每个因素的共性并测量它们与真实语音数据集和噪声数据集的距离来评估合成语音与真实语音的相似程度。我们在2008年至2024年期间基准了35个TTS系统，结果表明，我们计算的作为一个非加权平均因素的得分与每个时间段内人类评价的相似性非常强。

##### URL

https://arxiv.org/abs/2407.12707

##### PDF

https://arxiv.org/pdf/2407.12707.pdf

• ## IMAGDressing-v1: Customizable Virtual Dressing

2024-07-17 16:26:30
##### Abstract

Latest advances have achieved realistic virtual try-on (VTON) through localized garment inpainting using latent diffusion models, significantly enhancing consumers' online shopping experience. However, existing VTON technologies neglect the need for merchants to showcase garments comprehensively, including flexible control over garments, optional faces, poses, and scenes. To address this issue, we define a virtual dressing (VD) task focused on generating freely editable human images with fixed garments and optional conditions. Meanwhile, we design a comprehensive affinity metric index (CAMI) to evaluate the consistency between generated images and reference garments. Then, we propose IMAGDressing-v1, which incorporates a garment UNet that captures semantic features from CLIP and texture features from VAE. We present a hybrid attention module, including a frozen self-attention and a trainable cross-attention, to integrate garment features from the garment UNet into a frozen denoising UNet, ensuring users can control different scenes through text. IMAGDressing-v1 can be combined with other extension plugins, such as ControlNet and IP-Adapter, to enhance the diversity and controllability of generated images. Furthermore, to address the lack of data, we release the interactive garment pairing (IGPair) dataset, containing over 300,000 pairs of clothing and dressed images, and establish a standard pipeline for data assembly. Extensive experiments demonstrate that our IMAGDressing-v1 achieves state-of-the-art human image synthesis performance under various controlled conditions. The code and model will be available at this https URL.

##### Abstract (translated)

最新的进展通过使用局部扩散模型进行基于潜在扩散的虚拟试穿（VTON）取得了真实的虚拟试穿（VTON），显著提高了消费者的在线购物体验。然而，现有的VTON技术忽略了商家展示服装的需求，包括对服装进行全面灵活的控制、可选项的脸部、姿势和场景。为了解决这个问题，我们定义了一个专注于生成具有固定服装和可选项条件的可编辑人类图像的虚拟穿衣（VD）任务。同时，我们设计了一个全面的协同指标（CAMI）来评估生成图像与参考服装之间的一致性。然后，我们提出了IMAGDressing-v1，它包括一个从CLIP捕获语义特征和从VAE捕获纹理特征的服装UNet。我们展示了用于整合服装UNet的固定除噪UNet的混合注意模块，确保用户可以通过文本控制不同场景。IMAGDressing-v1可以与其他扩展插件（如ControlNet和IP-Adapter）相结合，以增强生成图像的多样性和可控性。此外，为了解决数据不足的问题，我们发布了包括超过300,000对服装和着装图像的互动服装配对（IGPair）数据集，并建立了数据组装的标准流程。大量实验证明，我们的IMAGDressing-v1在各种控制条件下实现了最先进的 human image synthesis 性能。代码和模型将在此处https URL上提供。

##### URL

https://arxiv.org/abs/2407.12705

##### PDF

https://arxiv.org/pdf/2407.12705.pdf

• ## Subgraph-Aware Training of Text-based Methods for Knowledge Graph Completion

2024-07-17 16:25:37
##### Abstract

Fine-tuning pre-trained language models (PLMs) has recently shown a potential to improve knowledge graph completion (KGC). However, most PLM-based methods encode only textual information, neglecting various topological structures of knowledge graphs (KGs). In this paper, we empirically validate the significant relations between the structural properties of KGs and the performance of the PLM-based methods. To leverage the structural knowledge, we propose a Subgraph-Aware Training framework for KGC (SATKGC) that combines (i) subgraph-aware mini-batching to encourage hard negative sampling, and (ii) a new contrastive learning method to focus more on harder entities and harder negative triples in terms of the structural properties. To the best of our knowledge, this is the first study to comprehensively incorporate the structural inductive bias of the subgraphs into fine-tuning PLMs. Extensive experiments on four KGC benchmarks demonstrate the superiority of SATKGC. Our code is available.

##### Abstract (translated)

近年来，预训练语言模型（PLMs）的微调已经在知识图谱完成（KGC）方面显示出改进的潜力。然而，大多数基于PLM的方法仅编码文本信息，忽视了知识图谱的各种拓扑结构。在本文中，我们通过经验验证了PLM-based方法与知识图谱结构之间的关系。为了利用知识图谱的结构知识，我们提出了一个针对知识图谱的子图感知训练框架（SATKGC），结合了（i）子图感知小批量以鼓励硬负采样，和（ii）一种新的对比学习方法，更关注知识图谱中的 harder entity 和 harder negative triple。据我们所知，这是第一篇将子图的拓扑结构归纳偏见全面纳入微调PLM的研究。在四个KGC基准上的大量实验证明，SATKGC具有优越性。我们的代码可获得。

##### URL

https://arxiv.org/abs/2407.12703

##### PDF

https://arxiv.org/pdf/2407.12703.pdf

• ## TransCAD: A Hierarchical Transformer for CAD Sequence Inference from Point Clouds

2024-07-17 16:24:36
##### Abstract

3D reverse engineering, in which a CAD model is inferred given a 3D scan of a physical object, is a research direction that offers many promising practical applications. This paper proposes TransCAD, an end-to-end transformer-based architecture that predicts the CAD sequence from a point cloud. TransCAD leverages the structure of CAD sequences by using a hierarchical learning strategy. A loop refiner is also introduced to regress sketch primitive parameters. Rigorous experimentation on the DeepCAD and Fusion360 datasets show that TransCAD achieves state-of-the-art results. The result analysis is supported with a proposed metric for CAD sequence, the mean Average Precision of CAD Sequence, that addresses the limitations of existing metrics.

##### URL

https://arxiv.org/abs/2407.12702

##### PDF

https://arxiv.org/pdf/2407.12702.pdf

• ## Calibrated Diverse Ensemble Entropy Minimization for Robust Test-Time Adaptation in Prostate Cancer Detection

2024-07-17 16:20:10
##### Abstract

High resolution micro-ultrasound has demonstrated promise in real-time prostate cancer detection, with deep learning becoming a prominent tool for learning complex tissue properties reflected on ultrasound. However, a significant roadblock to real-world deployment remains, which prior works often overlook: model performance suffers when applied to data from different clinical centers due to variations in data distribution. This distribution shift significantly impacts the model's robustness, posing major challenge to clinical deployment. Domain adaptation and specifically its test-time adaption (TTA) variant offer a promising solution to address this challenge. In a setting designed to reflect real-world conditions, we compare existing methods to state-of-the-art TTA approaches adopted for cancer detection, demonstrating the lack of robustness to distribution shifts in the former. We then propose Diverse Ensemble Entropy Minimization (DEnEM), questioning the effectiveness of current TTA methods on ultrasound data. We show that these methods, although outperforming baselines, are suboptimal due to relying on neural networks output probabilities, which could be uncalibrated, or relying on data augmentation, which is not straightforward to define on ultrasound data. Our results show a significant improvement of $5\%$ to $7\%$ in AUROC over the existing methods and $3\%$ to $5\%$ over TTA methods, demonstrating the advantage of DEnEM in addressing distribution shift. \keywords{Ultrasound Imaging \and Prostate Cancer \and Computer-aided Diagnosis \and Distribution Shift Robustness \and Test-time Adaptation.}

##### Abstract (translated)

高分辨率微型超声在实时前列腺癌检测方面显示出前景，深度学习已成为反映超声图像中复杂组织性质的有价值的工具。然而，实现现实世界部署仍然存在重大挑战，而前 works 往往忽视了这个挑战：当数据分布不同时，模型性能会受到影响。这种分布变化显著影响了模型的稳健性，对临床部署构成了重大挑战。领域自适应和特别是其测试时间自适应（TTA）变体提供了有前途的解决方案来解决这个挑战。在一个反映真实世界条件的设置中，我们比较了现有的方法与用于癌症检测的先进 TTA 方法，揭示了前者对分布变化的鲁棒性不足。然后我们提出了 Diverse Ensemble Entropy Minimization（DEnEM），质疑现有 TTA 方法在超声数据上的有效性。我们证明了尽管这些方法在基线水平上表现优异，但由于依赖神经网络输出概率或依赖数据增强，这些方法在超声数据上并不是最优的。我们的结果表明，与现有方法相比，AUROC 提高了 5% 到 7%，而与 TTA 方法相比，AUROC 提高了 3% 到 5%。这证明了 DEnEM 在解决分布变化方面具有优势。关键词：超声成像、前列腺癌、计算机辅助诊断、分布不变性、测试时间自适应。

##### URL

https://arxiv.org/abs/2407.12697

##### PDF

https://arxiv.org/pdf/2407.12697.pdf

• ## 4Dynamic: Text-to-4D Generation with Hybrid Priors

2024-07-17 16:02:55
##### Abstract

Due to the fascinating generative performance of text-to-image diffusion models, growing text-to-3D generation works explore distilling the 2D generative priors into 3D, using the score distillation sampling (SDS) loss, to bypass the data scarcity problem. The existing text-to-3D methods have achieved promising results in realism and 3D consistency, but text-to-4D generation still faces challenges, including lack of realism and insufficient dynamic motions. In this paper, we propose a novel method for text-to-4D generation, which ensures the dynamic amplitude and authenticity through direct supervision provided by a video prior. Specifically, we adopt a text-to-video diffusion model to generate a reference video and divide 4D generation into two stages: static generation and dynamic generation. The static 3D generation is achieved under the guidance of the input text and the first frame of the reference video, while in the dynamic generation stage, we introduce a customized SDS loss to ensure multi-view consistency, a video-based SDS loss to improve temporal consistency, and most importantly, direct priors from the reference video to ensure the quality of geometry and texture. Moreover, we design a prior-switching training strategy to avoid conflicts between different priors and fully leverage the benefits of each prior. In addition, to enrich the generated motion, we further introduce a dynamic modeling representation composed of a deformation network and a topology network, which ensures dynamic continuity while modeling topological changes. Our method not only supports text-to-4D generation but also enables 4D generation from monocular videos. The comparison experiments demonstrate the superiority of our method compared to existing methods.

##### Abstract (translated)

由于文本到图像扩散模型的迷人生成表现，越来越多的文本到3D生成工作开始探讨如何通过评分蒸馏采样（SDS）损失将2D生成概率蒸馏为3D，以绕过数据稀缺问题。现有的文本到3D方法在真实感和3D一致性方面取得了很好的效果，但文本到4D生成仍然面临着挑战，包括缺乏真实感和不足的动态运动。在本文中，我们提出了一个新颖的文本到4D生成方法，通过视频先验提供直接指导来确保动态幅度和真实性。具体来说，我们采用文本到视频扩散模型生成参考视频，并将4D生成分为静态生成和动态生成两个阶段。静态3D生成是在输入文本和参考视频的第一帧的指导下完成的，而动态生成阶段，我们引入了一个定制的SDS损失来确保多视角一致性，一个基于视频的SDS损失来提高时间一致性，最重要的是，来自参考视频的直接先验确保了形状和纹理的质量。此外，我们还设计了一种先验切换训练策略，以避免不同先验之间的冲突，并完全利用每个先验的优势。为了丰富生成的运动，我们进一步引入了一个动态建模表示，由变形网络和拓扑网络组成，确保动态连续性同时建模拓扑变化。我们的方法不仅支持文本到4D生成，而且可以从单目视频上实现4D生成。比较实验证明了我们的方法与现有方法相比具有优越性。

##### URL

https://arxiv.org/abs/2407.12684

##### PDF

https://arxiv.org/pdf/2407.12684.pdf

• ## In-Situ Infrared Camera Monitoring for Defect and Anomaly Detection in Laser Powder Bed Fusion: Calibration, Data Mapping, and Feature Extraction

2024-07-17 16:02:22
##### Abstract

Laser powder bed fusion (LPBF) process can incur defects due to melt pool instabilities, spattering, temperature increase, and powder spread anomalies. Identifying defects through in-situ monitoring typically requires collecting, storing, and analyzing large amounts of data generated. The first goal of this work is to propose a new approach to accurately map in-situ data to a three-dimensional (3D) geometry, aiming to reduce the amount of storage. The second goal of this work is to introduce several new IR features for defect detection or process model calibration, which include laser scan order, local preheat temperature, maximum pre-laser scanning temperature, and number of spatters generated locally and their landing locations. For completeness, processing of other common IR features, such as interpass temperature, heat intensity, cooling rates, and melt pool area, are also presented with the underlying algorithm and Python implementation. A number of different parts are printed, monitored, and characterized to provide evidence of process defects and anomalies that different IR features are capable of detecting.

##### Abstract (translated)

激光粉末床熔融（LPBF）过程可能会因为熔融池不稳定性、溅射、温度升高以及粉末扩散异常等原因导致缺陷。通过原位监测识别缺陷通常需要收集、存储和分析大量数据。本工作的第一目标是为准确将原位数据映射到三维（3D）几何结构，旨在减少存储量的需求。第二目标是引入几个新的红外特征用于缺陷检测或过程建模，包括激光扫描顺序、局部预热温度、最大预激光扫描温度和局部产生的颗粒数量及其着地位置。为了完整性，本文还介绍了处理其他常见红外特征（如间歇温度、热量、冷却率以及熔融池面积）的算法和Python实现。通过这些不同的部分进行打印、监测和表征，以提供过程缺陷和异常的证据，证明不同红外特征能够检测到。

##### URL

https://arxiv.org/abs/2407.12682

##### PDF

https://arxiv.org/pdf/2407.12682.pdf

• ## Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

2024-07-17 15:59:32
##### Abstract

Most current LLM-based models for video understanding can process videos within minutes. However, they struggle with lengthy videos due to challenges such as "noise and redundancy", as well as "memory and computation" constraints. In this paper, we present Goldfish, a methodology tailored for comprehending videos of arbitrary lengths. We also introduce the TVQA-long benchmark, specifically designed to evaluate models' capabilities in understanding long videos with questions in both vision and text content. Goldfish approaches these challenges with an efficient retrieval mechanism that initially gathers the top-k video clips relevant to the instruction before proceeding to provide the desired response. This design of the retrieval mechanism enables the Goldfish to efficiently process arbitrarily long video sequences, facilitating its application in contexts such as movies or television series. To facilitate the retrieval process, we developed MiniGPT4-Video that generates detailed descriptions for the video clips. In addressing the scarcity of benchmarks for long video evaluation, we adapted the TVQA short video benchmark for extended content analysis by aggregating questions from entire episodes, thereby shifting the evaluation from partial to full episode comprehension. We attained a 41.78% accuracy rate on the TVQA-long benchmark, surpassing previous methods by 14.94%. Our MiniGPT4-Video also shows exceptional performance in short video comprehension, exceeding existing state-of-the-art methods by 3.23%, 2.03%, 16.5% and 23.59% on the MSVD, MSRVTT, TGIF, and TVQA short video benchmarks, respectively. These results indicate that our models have significant improvements in both long and short-video understanding. Our models and code have been made publicly available at this https URL

##### Abstract (translated)

目前，基于LLM的视觉理解模型可以处理几分钟内的视频。然而，由于诸如“噪音和冗余”以及“内存和计算”等挑战，它们在处理长视频时存在困难。在本文中，我们提出了Goldfish，一种专门针对理解任意长度视频的 methodology。我们还引入了 TVQA-long 基准，专门设计用于评估模型在理解带有问题和文本内容的较长视频的能力。Goldfish 通过一个高效的检索机制来应对这些挑战，该机制首先收集与指令相关的最前k个视频剪辑，然后提供所需的回答。这种检索机制的设计使得 Goldfish 能够高效地处理任意长度的视频序列，从而将其应用于电影或电视剧等场景。为了促进检索过程，我们开发了 MiniGPT4-Video，为视频片段生成详细描述。为了应对长视频评估标准不足的问题，我们通过聚合整个季度的所有问题来适应扩展内容分析的 TVQA 短视频基准，从而将评估从部分到全面 episode 的理解。在 TVQA-long 基准上，我们获得了 41.78% 的准确率，比以前的方法高出 14.94%。我们的 MiniGPT4-Video 在短视频理解方面也表现出色，超越了现有的方法 3.23%、2.03%、16.5% 和 23.59% 分别是在 MSVD、MSRVTT、TGIF 和 TVQA 短视频基准上的表现。这些结果表明，我们的模型在长和短视频理解方面都有显著的改进。我们的模型和代码已公开发布在 this https URL。

##### URL

https://arxiv.org/abs/2407.12679

##### PDF

https://arxiv.org/pdf/2407.12679.pdf

• ## CoSIGN: Few-Step Guidance of ConSIstency Model to Solve General INverse Problems

2024-07-17 15:57:50
##### Abstract

Diffusion models have been demonstrated as strong priors for solving general inverse problems. Most existing Diffusion model-based Inverse Problem Solvers (DIS) employ a plug-and-play approach to guide the sampling trajectory with either projections or gradients. Though effective, these methods generally necessitate hundreds of sampling steps, posing a dilemma between inference time and reconstruction quality. In this work, we try to push the boundary of inference steps to 1-2 NFEs while still maintaining high reconstruction quality. To achieve this, we propose to leverage a pretrained distillation of diffusion model, namely consistency model, as the data prior. The key to achieving few-step guidance is to enforce two types of constraints during the sampling process of the consistency model: soft measurement constraint with ControlNet and hard measurement constraint via optimization. Supporting both single-step reconstruction and multistep refinement, the proposed framework further provides a way to trade image quality with additional computational cost. Within comparable NFEs, our method achieves new state-of-the-art in diffusion-based inverse problem solving, showcasing the significant potential of employing prior-based inverse problem solvers for real-world applications. Code is available at: this https URL.

##### Abstract (translated)

扩散模型已经被证明是解决一般反问题的强大先验。大多数现有的基于扩散模型的反问题求解器（DIS）采用插件和 Play 的方法来引导抽样轨迹，或者使用投影或梯度。尽管这些方法有效，但它们通常需要几百个抽样步骤，导致推理时间和重建质量之间存在权衡。在本文中，我们试图将推理步骤推向 1-2 NFE，同时保持高重建质量。为了实现这一点，我们提出了利用预训练的扩散模型（即一致性模型）作为数据先验的方法。实现几步指导的关键在于在一致性模型的抽样过程中实施两种约束：通过 ControlNet 的软测量约束和通过优化实现的有 hard measurement constraint。支持单步重建和多步精炼，所提出的框架进一步提供了将图像质量与额外计算成本进行交换的方式。在相当数量的 NFE 内，我们的方法在基于扩散模型的反问题求解方面实现了最先进的水平，展示了利用先验反问题求解器为现实应用提供重大潜力的可能性。代码可在此处下载：https://this URL。

##### URL

https://arxiv.org/abs/2407.12676

##### PDF

https://arxiv.org/pdf/2407.12676.pdf

• ## GraphMuse: A Library for Symbolic Music Graph Processing

2024-07-17 15:54:09
##### Abstract

Graph Neural Networks (GNNs) have recently gained traction in symbolic music tasks, yet a lack of a unified framework impedes progress. Addressing this gap, we present GraphMuse, a graph processing framework and library that facilitates efficient music graph processing and GNN training for symbolic music tasks. Central to our contribution is a new neighbor sampling technique specifically targeted toward meaningful behavior in musical scores. Additionally, GraphMuse integrates hierarchical modeling elements that augment the expressivity and capabilities of graph networks for musical tasks. Experiments with two specific musical prediction tasks -- pitch spelling and cadence detection -- demonstrate significant performance improvement over previous methods. Our hope is that GraphMuse will lead to a boost in, and standardization of, symbolic music processing based on graph representations. The library is available at this https URL

##### Abstract (translated)

近年来，图神经网络（GNNs）在符号音乐任务中取得了关注，然而缺乏统一的框架仍然阻碍了进步。为解决这一空白，我们提出了GraphMuse，一个用于音乐图处理和GNN训练的图形处理框架和库，为符号音乐任务提供高效的处理和GNN训练。我们提出的GraphMuse的核心贡献在于针对音乐分数有意义行为的新邻居采样技术。此外，GraphMuse还整合了层次建模元素，增强了图网络在音乐任务中的表现力和能力。对于两个具体的音乐预测任务——音高拼写和节奏检测——实验结果表明，与以前的方法相比，性能有显著提高。我们希望GraphMuse能够促进基于图表示的符号音乐处理中的标准化和提升。库可在此处访问：https://www.kaggle.com/graphmuse/graph-muse

##### URL

https://arxiv.org/abs/2407.12671

##### PDF

https://arxiv.org/pdf/2407.12671.pdf

• ## Enhancing the Utility of Privacy-Preserving Cancer Classification using Synthetic Data

2024-07-17 15:52:45
##### Abstract

Deep learning holds immense promise for aiding radiologists in breast cancer detection. However, achieving optimal model performance is hampered by limitations in availability and sharing of data commonly associated to patient privacy concerns. Such concerns are further exacerbated, as traditional deep learning models can inadvertently leak sensitive training information. This work addresses these challenges exploring and quantifying the utility of privacy-preserving deep learning techniques, concretely, (i) differentially private stochastic gradient descent (DP-SGD) and (ii) fully synthetic training data generated by our proposed malignancy-conditioned generative adversarial network. We assess these methods via downstream malignancy classification of mammography masses using a transformer model. Our experimental results depict that synthetic data augmentation can improve privacy-utility tradeoffs in differentially private model training. Further, model pretraining on synthetic data achieves remarkable performance, which can be further increased with DP-SGD fine-tuning across all privacy guarantees. With this first in-depth exploration of privacy-preserving deep learning in breast imaging, we address current and emerging clinical privacy requirements and pave the way towards the adoption of private high-utility deep diagnostic models. Our reproducible codebase is publicly available at this https URL.

##### Abstract (translated)

深度学习在乳腺癌检测方面具有巨大的潜力。然而，要实现最优的模型性能，通常会受到与患者隐私相关数据的可用性和共享的限制。这些限制进一步加剧，因为传统的深度学习模型可能会无意地泄露敏感训练信息。本文通过探索和量化隐私保护的深度学习技术的实用性，解决了这些挑战。具体来说，我们研究了以下两种隐私保护技术：（i）差分隐私的随机梯度下降（DP-SGD）和（ii）我们提出的恶性条件生成对抗网络生成的完全合成训练数据。我们通过下游的乳腺癌分类来评估这些方法。我们的实验结果表明，合成数据增强可以在不同隐私保证下的模型训练中提高隐私性价比。此外，在所有隐私保证下对合成数据进行预训练，模型的性能都非常突出。这进一步可以通过DP-SGD在所有隐私保证下的微调来提高。通过我们对隐私保护在乳腺癌成像中的首次深入探索，我们回答了当前和新兴的临床隐私要求，为采用具有高隐私价值的深度诊断模型铺平了道路。我们的可重复代码库公开在https:// this URL上。

##### URL

https://arxiv.org/abs/2407.12669

##### PDF

https://arxiv.org/pdf/2407.12669.pdf

• ## SG-NeRF: Neural Surface Reconstruction with Scene Graph Optimization

2024-07-17 15:50:17
##### Abstract

3D surface reconstruction from images is essential for numerous applications. Recently, Neural Radiance Fields (NeRFs) have emerged as a promising framework for 3D modeling. However, NeRFs require accurate camera poses as input, and existing methods struggle to handle significantly noisy pose estimates (i.e., outliers), which are commonly encountered in real-world scenarios. To tackle this challenge, we present a novel approach that optimizes radiance fields with scene graphs to mitigate the influence of outlier poses. Our method incorporates an adaptive inlier-outlier confidence estimation scheme based on scene graphs, emphasizing images of high compatibility with the neighborhood and consistency in the rendering quality. We also introduce an effective intersection-over-union (IoU) loss to optimize the camera pose and surface geometry, together with a coarse-to-fine strategy to facilitate the training. Furthermore, we propose a new dataset containing typical outlier poses for a detailed evaluation. Experimental results on various datasets consistently demonstrate the effectiveness and superiority of our method over existing approaches, showcasing its robustness in handling outliers and producing high-quality 3D reconstructions. Our code and data are available at: \url{this https URL}.

##### Abstract (translated)

3D表面重建从图像中是许多应用的基础。最近，神经辐射场（NeRFs）已成为3D建模的有前景的框架。然而，NeRFs需要准确的相机姿态作为输入，而现有的方法很难处理明显嘈杂的姿势估计（即异常值），这些异常值在现实场景中很常见。为了应对这一挑战，我们提出了一个新方法，通过基于场景图的辐射场优化来缓解异常姿势的影响。我们的方法基于基于场景图的自适应亲疏度估计方案，强调与邻居高度兼容的图像和渲染质量的一致性。我们还引入了有效的交集 over-union（IoU）损失来优化相机姿态和表面几何，并采用粗到细的策略促进训练。此外，我们还提出了一个包含典型异常姿势的新数据集，用于详细评估。各种数据集上的实验结果都一致证明了我们的方法相对于现有方法的有效性和优越性，突出了其在处理异常值和产生高质量3D重构方面的鲁棒性。我们的代码和数据可在此处访问：\url{这个链接}。

##### URL

https://arxiv.org/abs/2407.12667

##### PDF

https://arxiv.org/pdf/2407.12667.pdf

• ## Patch-Level Training for Large Language Models

2024-07-17 15:48:39
##### Abstract

As Large Language Models (LLMs) achieve remarkable progress in language understanding and generation, their training efficiency has become a critical concern. Traditionally, LLMs are trained to predict the next token in a sequence. Despite the success of token-level training, it suffers from considerable computational costs due to the need to process an extensive number of tokens. To mitigate this issue, this paper introduces patch-level training for LLMs, which reduces the sequence length by compressing multiple tokens into a single patch. During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch, thereby processing the majority of the training data at a significantly reduced computational cost. Following this, the model continues token-level training on the remaining training data to align with the inference mode. Experiments on a diverse range of models (370M-2.7B parameters) demonstrate that patch-level training can reduce overall computational costs to 0.5$\times$, without compromising the model performance compared to token-level training. Source code: \url{this https URL}.

##### Abstract (translated)

随着大型语言模型（LLMs）在语言理解和生成方面的显著进步，其训练效率已成为一个关键问题。传统上，LLMs通过预测序列中的下一个词来进行训练。尽管在词级训练方面取得了成功，但由于需要处理大量词的要求，它造成了相当大的计算成本。为了减轻这个问题，本文引入了LLMs的补丁级训练，通过将多个词压缩成一个补丁来减少序列长度。在补丁级训练过程中，我们将语言模型输入较短的补丁序列，并将其训练为预测下一个补丁，从而以较低的计算成本处理训练数据的大部分。在接下来的训练数据上，模型继续进行词级训练，以与推理模式对齐。在各种模型（370M-2.7B参数）的实验中，补丁级训练可以将其总计算成本降低至0.5倍，而不会牺牲模型的性能。代码来源：\url{这个 https URL}.

##### URL

https://arxiv.org/abs/2407.12665

##### PDF

https://arxiv.org/pdf/2407.12665.pdf

• ## Is That Rain? Understanding Effects on Visual Odometry Performance for Autonomous UAVs and Efficient DNN-based Rain Classification at the Edge

2024-07-17 15:47:25
##### Abstract

The development of safe and reliable autonomous unmanned aerial vehicles relies on the ability of the system to recognise and adapt to changes in the local environment based on sensor inputs. State-of-the-art local tracking and trajectory planning are typically performed using camera sensor input to the flight control algorithm, but the extent to which environmental disturbances like rain affect the performance of these systems is largely unknown. In this paper, we first describe the development of an open dataset comprising ~335k images to examine these effects for seven different classes of precipitation conditions and show that a worst-case average tracking error of 1.5 m is possible for a state-of-the-art visual odometry system (VINS-Fusion). We then use the dataset to train a set of deep neural network models suited to mobile and constrained deployment scenarios to determine the extent to which it may be possible to efficiently and accurately classify these `rainy' conditions. The most lightweight of these models (MobileNetV3 small) can achieve an accuracy of 90% with a memory footprint of just 1.28 MB and a frame rate of 93 FPS, which is suitable for deployment in resource-constrained and latency-sensitive systems. We demonstrate a classification latency in the order of milliseconds using typical flight computer hardware. Accordingly, such a model can feed into the disturbance estimation component of an autonomous flight controller. In addition, data from unmanned aerial vehicles with the ability to accurately determine environmental conditions in real time may contribute to developing more granular timely localised weather forecasting.

##### Abstract (translated)

安全可靠的自主无人机的发展取决于系统基于传感器输入识别和适应局部环境的能力。通常使用相机传感器输入到飞行控制算法来执行最先进的本地跟踪和轨迹规划，但雨等环境干扰对这些系统性能的影响程度尚不清楚。在本文中，我们首先描述了一个由~335k张图像组成的开放数据集的发展，以研究不同降水条件对这些系统的影响。然后，我们使用该数据集训练了一系列适用于移动和受约束部署场景的深度神经网络模型，以确定是否可能实现高效且准确地分类这些“雨天”条件。这些模型中最轻便的（MobileNetV3 small）可以达到90%的准确度，具有仅1.28 MB的内存 footprint 和93 FPS的帧率，适用于资源受限和延迟敏感的环境。我们使用典型飞行计算机硬件来演示分类延迟，并指出具有准确确定环境条件实时能力的不载人的无人机可以用于开发更精细的局部天气预报。

##### URL

https://arxiv.org/abs/2407.12663

##### PDF

https://arxiv.org/pdf/2407.12663.pdf

• ## InfoNorm: Mutual Information Shaping of Normals for Sparse-View Reconstruction

2024-07-17 15:46:25
##### Abstract

3D surface reconstruction from multi-view images is essential for scene understanding and interaction. However, complex indoor scenes pose challenges such as ambiguity due to limited observations. Recent implicit surface representations, such as Neural Radiance Fields (NeRFs) and signed distance functions (SDFs), employ various geometric priors to resolve the lack of observed information. Nevertheless, their performance heavily depends on the quality of the pre-trained geometry estimation models. To ease such dependence, we propose regularizing the geometric modeling by explicitly encouraging the mutual information among surface normals of highly correlated scene points. In this way, the geometry learning process is modulated by the second-order correlations from noisy (first-order) geometric priors, thus eliminating the bias due to poor generalization. Additionally, we introduce a simple yet effective scheme that utilizes semantic and geometric features to identify correlated points, enhancing their mutual information accordingly. The proposed technique can serve as a plugin for SDF-based neural surface representations. Our experiments demonstrate the effectiveness of the proposed in improving the surface reconstruction quality of major states of the arts. Our code is available at: \url{this https URL}.

##### Abstract (translated)

从多视角图像中进行三维表面重建对于场景理解和交互至关重要。然而，复杂的室内场景会面临由于观察有限而产生的模糊性挑战。最近，隐式表面表示，如神经辐射场（NeRFs）和符号距离函数（SDFs），采用各种几何先验来解决观测信息不足的问题。然而，它们的性能在很大程度上取决于预训练几何估计模型的质量。为了减轻这种依赖关系，我们通过明确鼓励高度相关场景点之间互信息来对几何建模进行正则化。在这种方式下，几何学习过程由噪声（第一级）几何先验的二级关联度调节，从而消除了由于欠拟合而产生的偏差。此外，我们引入了一种简单而有效的方案，利用语义和几何特征来识别相关点，从而增强它们的互信息。所提出的技术可以作为基于SDF的神经表面表示的插件。我们的实验结果表明，与所提出的技术相比，可以显著提高艺术作品主要状态的三维表面重建质量。我们的代码可在此处访问：\url{这个链接}。

##### URL

https://arxiv.org/abs/2407.12661

##### PDF

https://arxiv.org/pdf/2407.12661.pdf

• ## Optimal Control for Clutched-Elastic Robots: A Contact-Implicit Approach

2024-07-17 15:38:00
##### Abstract

Intrinsically elastic robots surpass their rigid counterparts in a range of different characteristics. By temporarily storing potential energy and subsequently converting it to kinetic energy, elastic robots are capable of highly dynamic motions even with limited motor power. However, the time-dependency of this energy storage and release mechanism remains one of the major challenges in controlling elastic robots. A possible remedy is the introduction of locking elements (i.e. clutches and brakes) in the drive train. This gives rise to a new class of robots, so-called clutched-elastic robots (CER), with which it is possible to precisely control the energy-transfer timing. A prevalent challenge in the realm of CERs is the automatic discovery of clutch sequences. Due to complexity, many methods still rely on pre-defined modes. In this paper, we introduce a novel contact-implicit scheme designed to optimize both control input and clutch sequence simultaneously. A penalty in the objective function ensures the prevention of unnecessary clutch transitions. We empirically demonstrate the effectiveness of our proposed method on a double pendulum equipped with two of our newly proposed clutch-based Bi-Stiffness Actuators (BSA).

##### Abstract (translated)

内弹性机器人超越其刚性同类在各种不同特点上。通过暂时储存潜在能量并随后将其转化为动能，内弹性机器人即使动力电源有限也能够实现高度动态的运动。然而，这种能量存储和释放机制的时间依赖性仍然是控制内弹性机器人的主要挑战之一。一种可能的解决方法是在驱动系统中引入锁紧元件（即离合器和制动器）。这导致了一种新型的机器人，称为离合弹性机器人（CER），可以精确控制能量传递的时间。 在离合弹性机器人（CER）领域，一个普遍的挑战是自动发现离合序列。由于复杂性，许多方法仍然依赖于预定义的模式。在本文中，我们介绍了一种新的接触隐式方案，旨在同时优化控制输入和离合序列。目标函数中的惩罚确保了不必要的离合器转换。我们通过实验验证了我们提出的方法在配备我们新提出的基于离合器的双摆钟上的效果。

##### URL

https://arxiv.org/abs/2407.12655

##### PDF

https://arxiv.org/pdf/2407.12655.pdf

• ## MMVR: Millimeter-wave Multi-View Radar Dataset and Benchmark for Indoor Perception

2024-07-17 15:23:17
##### Abstract

Compared with an extensive list of automotive radar datasets that support autonomous driving, indoor radar datasets are scarce at a smaller scale in the format of low-resolution radar point clouds and usually under an open-space single-room setting. In this paper, we scale up indoor radar data collection using multi-view high-resolution radar heatmap in a multi-day, multi-room, and multi-subject setting, with an emphasis on the diversity of environment and subjects. Referred to as the millimeter-wave multi-view radar (MMVR) dataset, it consists of $345$K multi-view radar frames collected from $25$ human subjects over $6$ different rooms, $446$K annotated bounding boxes/segmentation instances, and $7.59$ million annotated keypoints to support three major perception tasks of object detection, pose estimation, and instance segmentation, respectively. For each task, we report performance benchmarks under two protocols: a single subject in an open space and multiple subjects in several cluttered rooms with two data splits: random split and cross-environment split over $395$ 1-min data segments. We anticipate that MMVR facilitates indoor radar perception development for indoor vehicle (robot/humanoid) navigation, building energy management, and elderly care for better efficiency, user experience, and safety. The MMVR dataset is available at this https URL.

##### Abstract (translated)

与支持自动驾驶的广泛汽车雷达数据集相比，在较小规模上，室内雷达数据集是稀缺的，通常以低分辨率雷达点云的形式，通常处于一个开放空间单室设置中。在本文中，我们通过多视角高分辨率雷达热图在多日、多室、多主题设置中放大室内雷达数据收集，重点关注环境和主题的多样性。这个数据集被称为毫米波多视角雷达（MMVR）数据集，它包括从25个人类受试者收集的6个不同房间中的345K个多视角雷达帧、446K个注释边界框/分割实例和7590K个注释关键点，以支持分别物体检测、姿态估计和实例分割三大感知任务。对于每个任务，我们分别报告了两种协议下的性能基准：在开放空间中的单一受试者和在几个拥挤的房间中的多个受试者，两个数据集的随机分割和跨环境分割：395个1分钟数据段的跨环境分割。我们预计，MMVR有助于推动室内雷达感知的发展，提高室内车辆（机器人/人类型）导航、建筑节能管理以及老年照护的效率、用户体验和安全。MMVR数据集可在以下链接处获得。

##### URL

https://arxiv.org/abs/2406.10708

##### PDF

https://arxiv.org/pdf/2406.10708.pdf

• ## Fusion Flow-enhanced Graph Pooling Residual Networks for Unmanned Aerial Vehicles Surveillance in Day and Night Dual Visions

2024-07-17 15:16:23
##### Abstract

Recognizing unauthorized Unmanned Aerial Vehicles (UAVs) within designated no-fly zones throughout the day and night is of paramount importance, where the unauthorized UAVs pose a substantial threat to both civil and military aviation safety. However, recognizing UAVs day and night with dual-vision cameras is nontrivial, since red-green-blue (RGB) images suffer from a low detection rate under an insufficient light condition, such as on cloudy or stormy days, while black-and-white infrared (IR) images struggle to capture UAVs that overlap with the background at night. In this paper, we propose a new optical flow-assisted graph-pooling residual network (OF-GPRN), which significantly enhances the UAV detection rate in day and night dual visions. The proposed OF-GPRN develops a new optical fusion to remove superfluous backgrounds, which improves RGB/IR imaging clarity. Furthermore, OF-GPRN extends optical fusion by incorporating a graph residual split attention network and a feature pyramid, which refines the perception of UAVs, leading to a higher success rate in UAV detection. A comprehensive performance evaluation is conducted using a benchmark UAV catch dataset. The results indicate that the proposed OF-GPRN elevates the UAV mean average precision (mAP) detection rate to 87.8%, marking a 17.9% advancement compared to the residual graph neural network (ResGCN)-based approach.

##### Abstract (translated)

在白天和黑夜中，对指定禁飞区内的未经授权的无人驾驶飞行器（UAVs）进行识别至关重要，因为这些未经授权的UAVs对民用和军事航空安全构成了严重威胁。然而，使用双目摄像头的白天和黑夜识别UAV并非易事，因为在光线不足的情况下，红-绿-蓝（RGB）图像的检测率较低，例如在多云或暴风雨天气下，而黑-白红外（IR）图像在晚上也无法捕捉到与背景重叠的UAV。在本文中，我们提出了一种新的基于光流辅助的图聚网络（OF-GPRN），显著提高了白天和黑夜双重视图中的UAV检测率。所提出的OF-GPRN开发了一种新的光融合来消除多余的背景，从而提高了RGB/IR成像的清晰度。此外，OF-GPRN通过引入图残差分割注意网络和特征金字塔来扩展光融合，从而改善了UAV的感知，导致UAV检测的成功率更高。使用基准UAV捕获数据集进行全面性能评估。结果显示，与基于残差图神经网络（ResGCN）的方法相比，所提出的OF-GPRN将UAV平均平均精度（mAP）检测率提高了17.9%，取得了显著的进展。

##### URL

https://arxiv.org/abs/2407.12647

##### PDF

https://arxiv.org/pdf/2407.12647.pdf

Tags