The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios, and (2) ensuring voice consistency despite variations in facial motion for the same identity. To tackle these issues, we introduce a motion sampler based on conditional flow matching, which is capable of high-quality motion code generation in an efficient way. Moreover, we introduce a novel conditioning method for the TTS system, which utilises motion-removed features from the TFG model to yield uniform speech outputs. Our extensive experiments demonstrate that our method effectively creates natural-looking talking faces and speech that accurately match the input text. To our knowledge, this is the first effort to build a multimodal synthesis system that can generalise to unseen identities.
本工作的目标是同时从文本中生成自然对话脸和语音输出。我们通过将谈话面部生成(TFG)和文本转语音(TTS)系统集成到一个统一框架中来实现这一目标。我们解决了每个任务的主要挑战:(1)生成具有真实世界场景中各种头势的广泛范围的头部;(2)在同一身份下,即使面部运动存在差异,也要确保声音的一致性。为解决这些问题,我们引入了一种基于条件流匹配的运动采样方法,该方法能够以高效的方式生成高质量的运动码。此外,我们引入了一种新的条件方法来对TTS系统,该方法利用TFG模型的运动去除特征来产生统一的语音输出。我们广泛的实验证明了我们的方法有效地创建了自然外观的对话脸和语音,准确地匹配了输入文本。据我们所知,这是第一个在未见过的身份上构建多模态合成系统的尝试。
https://arxiv.org/abs/2405.10272
The Learning-to-match (LTM) framework proves to be an effective inverse optimal transport approach for learning the underlying ground metric between two sources of data, facilitating subsequent matching. However, the conventional LTM framework faces scalability challenges, necessitating the use of the entire dataset each time the parameters of the ground metric are updated. In adapting LTM to the deep learning context, we introduce the mini-batch Learning-to-match (m-LTM) framework for audio-text retrieval problems. This framework leverages mini-batch subsampling and Mahalanobis-enhanced family of ground metrics. Moreover, to cope with misaligned training data in practice, we propose a variant using partial optimal transport to mitigate the harm of misaligned data pairs in training data. We conduct extensive experiments on audio-text matching problems using three datasets: AudioCaps, Clotho, and ESC-50. Results demonstrate that our proposed method is capable of learning rich and expressive joint embedding space, which achieves SOTA performance. Beyond this, the proposed m-LTM framework is able to close the modality gap across audio and text embedding, which surpasses both triplet and contrastive loss in the zero-shot sound event detection task on the ESC-50 dataset. Notably, our strategy of employing partial optimal transport with m-LTM demonstrates greater noise tolerance than contrastive loss, especially under varying noise ratios in training data on the AudioCaps dataset. Our code is available at this https URL
学习匹配(LTM)框架被证明是一种有效的反向最优传输方法,用于在两个数据源之间学习潜在的地面度量,从而促进后续匹配。然而,传统的LTM框架面临可扩展性挑战,每次更新地面度量参数时,需要使用整个数据集。将LTM适应深度学习场景,我们引入了音频-文本检索问题的小批次学习匹配(m-LTM)框架。该框架利用了 mini-batch 子采样和 Mahalanobis 增强的地面度量家族。此外,为了应对实际实践中存在的训练数据对齐问题,我们提出了一个使用部分最优传输的变体,以减轻对齐数据对训练数据的影响。我们在三个数据集(AudioCaps、Clotho和ESC-50)上对音频-文本匹配问题进行了广泛的实验。结果表明,我们提出的方法可以学习丰富和富有表现力的联合嵌入空间,实现最佳性能。此外,与仅基于对比损失的零 shot 声事件检测任务相比,所提出的 m-LTM 框架在AudioCaps数据集上的模态差距可以实现更大的提升。值得注意的是,我们使用部分最优传输与 m-LTM 的策略表明,与对比损失相比,噪声容忍度更高,特别是在 AudioCaps 数据集上,训练数据中的噪声比值变化时。我们的代码可以从该链接https://www.oskari.org/es/docs/latest/html/index.html获取。
https://arxiv.org/abs/2405.10084
We revisit the classical problem of multiclass classification with bandit feedback (Kakade, Shalev-Shwartz and Tewari, 2008), where each input classifies to one of $K$ possible labels and feedback is restricted to whether the predicted label is correct or not. Our primary inquiry is with regard to the dependency on the number of labels $K$, and whether $T$-step regret bounds in this setting can be improved beyond the $\smash{\sqrt{KT}}$ dependence exhibited by existing algorithms. Our main contribution is in showing that the minimax regret of bandit multiclass is in fact more nuanced, and is of the form $\smash{\widetilde{\Theta}\left(\min \left\{|\mathcal{H}| + \sqrt{T}, \sqrt{KT \log |{\mathcal{H}|}} \right\} \right) }$, where $\mathcal{H}$ is the underlying (finite) hypothesis class. In particular, we present a new bandit classification algorithm that guarantees regret $\smash{\widetilde{O}(|\mathcal{H}|+\sqrt{T})}$, improving over classical algorithms for moderately-sized hypothesis classes, and give a matching lower bound establishing tightness of the upper bounds (up to log-factors) in all parameter regimes.
我们回顾了带有博弈反馈的多分类分类经典问题(Kakade,Shalev-Shwartz和Tewari,2008),其中每个输入类别将预测为$K$个可能的标签,反馈仅限于预测标签是否正确。我们的主要关注是关于$K$的数量依赖关系,以及在当前环境中,是否可以通过超越现有算法的$\sqrt{KT}$依赖关系来提高$T$步后悔的上界。我们的主要贡献是证明,带博弈反馈的多分类的最小最大后悔实际上更加复杂,并且具有形式为$\smash{\hat{\theta}\left(\min \left\{|\mathcal{H}| + \sqrt{T}, \sqrt{KT \log |{\mathcal{H}|}} \right\} \right)}$,其中$\mathcal{H}$是 underlying(有限)假设类。 特别是,我们提出了一个新的带博弈分类算法,该算法保证后悔$O(|\mathcal{H}|+\sqrt{T})$,这超过了中等大小假设类经典算法的范围,并为所有参数范围给出了匹配的下界。
https://arxiv.org/abs/2405.10027
Real driving-video dehazing poses a significant challenge due to the inherent difficulty in acquiring precisely aligned hazy/clear video pairs for effective model training, especially in dynamic driving scenarios with unpredictable weather conditions. In this paper, we propose a pioneering approach that addresses this challenge through a nonaligned regularization strategy. Our core concept involves identifying clear frames that closely match hazy frames, serving as references to supervise a video dehazing network. Our approach comprises two key components: reference matching and video dehazing. Firstly, we introduce a non-aligned reference frame matching module, leveraging an adaptive sliding window to match high-quality reference frames from clear videos. Video dehazing incorporates flow-guided cosine attention sampler and deformable cosine attention fusion modules to enhance spatial multiframe alignment and fuse their improved information. To validate our approach, we collect a GoProHazy dataset captured effortlessly with GoPro cameras in diverse rural and urban road environments. Extensive experiments demonstrate the superiority of the proposed method over current state-of-the-art methods in the challenging task of real driving-video dehazing. Project page.
真实驾驶视频去雾 poses 是一个具有重大挑战性的问题,因为准确获得对齐的雾/清视频对对于有效的模型训练来说具有内在的困难,尤其是在具有不可预测天气条件的动态驾驶场景中。在本文中,我们提出了一个先驱方法,通过非对齐的 Regularization 策略来解决这个挑战。我们的核心思想是识别出与雾帧紧密匹配的清晰帧,作为参考以指导视频去雾网络。我们的方法包括两个关键组件:参考匹配和视频去雾。首先,我们引入了一个非对齐的参考帧匹配模块,利用自适应滑动窗口从清晰视频中匹配高质量参考帧。视频去雾包括流指导余弦注意采样器和解偶余弦注意融合器,以增强多帧空间对齐和融合它们的改进信息。为了验证我们的方法,我们收集了一个使用 GoPro 摄像机轻松捕捉到的多样农村和城市道路环境中的 GoProHazy 数据集。大量的实验证明,与现有方法相比,所提出的方法在具有挑战性的真实驾驶视频去雾任务中具有优越性。项目页面。
https://arxiv.org/abs/2405.09996
We study the problem of online unweighted bipartite matching with $n$ offline vertices and $n$ online vertices where one wishes to be competitive against the optimal offline algorithm. While the classic RANKING algorithm of Karp et al. [1990] provably attains competitive ratio of $1-1/e > 1/2$, we show that no learning-augmented method can be both 1-consistent and strictly better than $1/2$-robust under the adversarial arrival model. Meanwhile, under the random arrival model, we show how one can utilize methods from distribution testing to design an algorithm that takes in external advice about the online vertices and provably achieves competitive ratio interpolating between any ratio attainable by advice-free methods and the optimal ratio of 1, depending on the advice quality.
我们研究了在有 $n$ 个离线顶点和 $n$ 个在线顶点的情况下,在线不平衡二分匹配问题中的问题,用户希望在与离线最优算法的竞争中具有竞争力。虽然经典的 Karp 等人 [1990] 的排名算法在证明过程中达到 $1-1/e > 1/2$ 的竞争比率,但我们证明了没有学习增强方法可以在对抗到达模型下同时实现 $1$-一致性和 $1/2$ 以上的比率为 $1$。同时,在随机到达模型下,我们证明了如何利用分布测试的方法来设计一个算法,它接收关于在线顶点的外部建议,并证明可以在任何建议免费方法可达到的比率之间实现竞争比率 interpolation,以及与最优比率 $1$ 之间的比率。
https://arxiv.org/abs/2405.09784
Recently, large language models (LLMs) have outperformed human experts in predicting the results of neuroscience experiments (Luo et al., 2024). What is the basis for this performance? One possibility is that statistical patterns in that specific scientific literature, as opposed to emergent reasoning abilities arising from broader training, underlie LLMs' performance. To evaluate this possibility, we trained (next word prediction) a relatively small 124M-parameter GPT-2 model on 1.3 billion tokens of domain-specific knowledge. Despite being orders of magnitude smaller than larger LLMs trained on trillions of tokens, small models achieved expert-level performance in predicting neuroscience results. Small models trained on the neuroscience literature succeeded when they were trained from scratch using a tokenizer specifically trained on neuroscience text or when the neuroscience literature was used to finetune a pretrained GPT-2. Our results indicate that expert-level performance may be attained by even small LLMs through domain-specific, auto-regressive training approaches.
近年来,大型语言模型(LLMs)已经在预测神经科学研究结果方面超越了人类专家(Luo等人,2024)。这种超群表现的基础是什么?一种可能性是,相对于从更广泛的训练中产生的广泛的推理能力,该特定科学文献中的统计模式是LLMs性能超越人类专家的原因。为了评估这种可能性,我们在1300亿个领域特定知识点的数据上训练了一个规模相对较小的124M参数GPT-2模型。尽管规模比训练在数十亿个token的较大LLM要小得多,但小模型在预测神经科学结果方面实现了专家水平。当小模型通过专门为神经科学文本训练的tokenizer进行训练时,或者当神经科学文献被用于微调预训练的GPT-2时,它们的成功预测结果表明,专家级的性能可能通过领域特定的自回归训练方法实现。
https://arxiv.org/abs/2405.09395
While content-based image retrieval (CBIR) has been extensively studied in natural image retrieval, its application to medical images presents ongoing challenges, primarily due to the 3D nature of medical images. Recent studies have shown the potential use of pre-trained vision embeddings for CBIR in the context of radiology image retrieval. However, a benchmark for the retrieval of 3D volumetric medical images is still lacking, hindering the ability to objectively evaluate and compare the efficiency of proposed CBIR approaches in medical imaging. In this study, we extend previous work and establish a benchmark for region-based and multi-organ retrieval using the TotalSegmentator dataset (TS) with detailed multi-organ annotations. We benchmark embeddings derived from pre-trained supervised models on medical images against embeddings derived from pre-trained unsupervised models on non-medical images for 29 coarse and 104 detailed anatomical structures in volume and region levels. We adopt a late interaction re-ranking method inspired by text matching for image retrieval, comparing it against the original method proposed for volume and region retrieval achieving retrieval recall of 1.0 for diverse anatomical regions with a wide size range. The findings and methodologies presented in this paper provide essential insights and benchmarks for the development and evaluation of CBIR approaches in the context of medical imaging.
虽然基于内容的图像检索(CBIR)在自然图像检索中已经得到了广泛研究,但在医学图像中应用时仍然存在挑战,主要原因是医学图像的3D性质。最近的研究表明,在放射学图像检索背景下,预训练视觉嵌入可能有用于CBIR。然而,还没有一个用于检索3D体积医学图像的基准,这阻碍了客观评估和比较所提出的CBIR方法在医学成像中的效率。在这项研究中,我们延长了以前的工作,并使用TotalSegmentator数据集(TS)建立了基于区域的和多器官检索的基准,并对医学图像和非医学图像的预训练嵌入进行了比较。我们对29个粗粒度和104个详细解剖结构的体积和区域水平的预训练嵌入进行了比较,采用了一种类似于文本匹配的晚期交互重新排名方法,将其与体积和区域检索的原始方法进行比较,实现了检索召回率为1.0,具有多样解剖结构的广泛大小范围。本文所提出的研究成果和方法提供了开发和评估CBIR方法在医学成像领域的必要见解和基准。
https://arxiv.org/abs/2405.09334
Despite the impressive performance of Multi-view Stereo (MVS) approaches given plenty of training samples, the performance degradation when generalizing to unseen domains has not been clearly explored yet. In this work, we focus on the domain generalization problem in MVS. To evaluate the generalization results, we build a novel MVS domain generalization benchmark including synthetic and real-world datasets. In contrast to conventional domain generalization benchmarks, we consider a more realistic but challenging scenario, where only one source domain is available for training. The MVS problem can be analogized back to the feature matching task, and maintaining robust feature consistency among views is an important factor for improving generalization performance. To address the domain generalization problem in MVS, we propose a novel MVS framework, namely RobustMVS. A DepthClustering-guided Whitening (DCW) loss is further introduced to preserve the feature consistency among different views, which decorrelates multi-view features from viewpoint-specific style information based on geometric priors from depth maps. The experimental results further show that our method achieves superior performance on the domain generalization benchmark.
尽管基于大量训练样本的多视角立体(MVS)方法表现出色,但在泛化到未见过的领域时,其性能下降尚未被明确探讨。在本文中,我们关注MVS中的领域泛化问题。为了评估泛化结果,我们构建了一个包含合成和真实世界数据的新MVS领域泛化基准。与传统领域泛化基准相比,我们考虑了一个更现实但更具挑战性的场景,即仅有一个训练源域。MVS问题可以类比回特征匹配任务,而保持视图之间的一致性对于提高泛化性能至关重要。为解决MVS中的领域泛化问题,我们提出了名为RobustMVS的新MVS框架。还引入了 DepthClustering-guided Whitening(DCW)损失,以保留不同视图之间的特征一致性,基于深度图的点云基于几何 prior 进行多视角特征的重新投影。实验结果进一步表明,我们的方法在领域泛化基准上实现了卓越的性能。
https://arxiv.org/abs/2405.09131
We introduce BEVRender, a novel learning-based approach for the localization of ground vehicles in Global Navigation Satellite System (GNSS)-denied off-road scenarios. These environments are typically challenging for conventional vision-based state estimation due to the lack of distinct visual landmarks and the instability of vehicle poses. To address this, BEVRender generates high-quality local bird's eye view (BEV) images of the local terrain. Subsequently, these images are aligned with a geo-referenced aerial map via template-matching to achieve accurate cross-view registration. Our approach overcomes the inherent limitations of visual inertial odometry systems and the substantial storage requirements of image-retrieval localization strategies, which are susceptible to drift and scalability issues, respectively. Extensive experimentation validates BEVRender's advancement over existing GNSS-denied visual localization methods, demonstrating notable enhancements in both localization accuracy and update frequency. The code for BEVRender will be made available soon.
我们提出了BEVRender,一种新的基于学习的在GNSS拒绝的离线场景中定位地面车辆的新方法。这些环境通常对传统视觉状态估计方法具有挑战性,因为缺乏明显的视觉地标和车辆姿态的不稳定性。为了应对这个问题, BEVRender生成高质量的局部鸟瞰(BEV)图像,并通过模板匹配与地理参考的无人机地图对它们进行对齐,以实现准确的跨视图配准。我们的方法克服了视觉惯性导航系统的固有局限性和图像检索定位策略需要大量存储空间的问题,这些问题容易受到漂移和可扩展性的影响。大量的实验证实,BEVRender在现有GNSS拒绝的视觉本地化方法中取得了显著的进步,证明了在定位精度和更新频率方面的显著改进。BEVRender的代码即将发布。
https://arxiv.org/abs/2405.09001
We investigate the problem of pixelwise correspondence for deformable objects, namely cloth and rope, by comparing both classical and learning-based methods. We choose cloth and rope because they are traditionally some of the most difficult deformable objects to analytically model with their large configuration space, and they are meaningful in the context of robotic tasks like cloth folding, rope knot-tying, T-shirt folding, curtain closing, etc. The correspondence problem is heavily motivated in robotics, with wide-ranging applications including semantic grasping, object tracking, and manipulation policies built on top of correspondences. We present an exhaustive survey of existing classical methods for doing correspondence via feature-matching, including SIFT, SURF, and ORB, and two recently published learning-based methods including TimeCycle and Dense Object Nets. We make three main contributions: (1) a framework for simulating and rendering synthetic images of deformable objects, with qualitative results demonstrating transfer between our simulated and real domains (2) a new learning-based correspondence method extending Dense Object Nets, and (3) a standardized comparison across state-of-the-art correspondence methods. Our proposed method provides a flexible, general formulation for learning temporally and spatially continuous correspondences for nonrigid (and rigid) objects. We report root mean squared error statistics for all methods and find that Dense Object Nets outperforms baseline classical methods for correspondence, and our proposed extension of Dense Object Nets performs similarly.
我们通过比较基于经典方法和基于学习的方法来研究可变形物体的像素级对应问题,包括布料和绳索。我们选择布料和绳索是因为它们是传统上最难以用大配置空间进行分析建模的变形物体之一,而且在机器人任务如布料折叠、绳索结结、T恤折叠、窗帘关闭等背景下具有重要意义。对应问题在机器人领域受到广泛关注,包括通过特征匹配的语义抓取、物体跟踪和操作策略等。我们全面调查了通过特征匹配实现对应的传统经典方法,包括SIFT、SURF和ORB,以及两篇最近发表的学习方法TimeCycle和Dense Object Nets。我们做出了三个主要贡献:(1)通过模拟和渲染变形物体的合成图像,展示了模拟和真实领域之间的转移;(2)扩展了Dense Object Nets的新学习方法;(3)对最先进的对应方法进行了标准化比较。我们提出的方法为学习非刚性(和刚性)对象的时域和空间连续对应提供了一个灵活、通用的公式。我们报告了所有方法的所有者的根均方误差统计,并发现,Dense Object Nets基线经典方法在对应方面优越,而我们的对Dense Object Nets的扩展也具有相似的性能。
https://arxiv.org/abs/2405.08996
Current video summarization methods primarily depend on supervised computer vision techniques, which demands time-consuming manual annotations. Further, the annotations are always subjective which make this task more challenging. To address these issues, we analyzed the feasibility in transforming the video summarization into a text summary task and leverage Large Language Models (LLMs) to boost video summarization. This paper proposes a novel self-supervised framework for video summarization guided by LLMs. Our method begins by generating captions for video frames, which are then synthesized into text summaries by LLMs. Subsequently, we measure semantic distance between the frame captions and the text summary. It's worth noting that we propose a novel loss function to optimize our model according to the diversity of the video. Finally, the summarized video can be generated by selecting the frames whose captions are similar with the text summary. Our model achieves competitive results against other state-of-the-art methods and paves a novel pathway in video summarization.
目前,主要的视频摘要方法依赖于监督计算机视觉技术,这需要耗时的人工标注。此外,标注总是主观的,使这项任务更具挑战性。为了应对这些问题,我们分析了将视频摘要转换为文本摘要的可行性,并利用大型语言模型(LLMs)提高视频摘要。本文提出了一种新的自监督框架,用于指导LLMs的 video summarization。我们的方法首先为视频帧生成字幕,然后由LLMs将其合成为文本摘要。接下来,我们测量视频帧字幕与文本摘要之间的语义距离。值得注意的是,我们提出了一个新颖的损失函数,根据视频的多样性优化我们的模型。最后,可以根据文本摘要选择具有相似文本摘要的帧来生成摘要视频。我们的模型在与其他最先进的 methods竞争的同时,在视频摘要领域开辟了新的途径。
https://arxiv.org/abs/2405.08890
In many applications, the demand arises for algorithms capable of aligning partially overlapping point sets while remaining invariant to the corresponding transformations. This research presents a method designed to meet such requirements through minimization of the objective function of the robust point matching (RPM) algorithm. First, we show that the RPM objective is a cubic polynomial. Then, through variable substitution, we transform the RPM objective to a quadratic function. Leveraging the convex envelope of bilinear monomials, we proceed to relax the resulting objective function, thus obtaining a lower bound problem that can be conveniently decomposed into distinct linear assignment and low-dimensional convex quadratic program components, both amenable to efficient optimization. Furthermore, a branch-and-bound (BnB) algorithm is devised, which solely branches over the transformation parameters, thereby boosting convergence rate. Empirical evaluations demonstrate better robustness of the proposed methodology against non-rigid deformation, positional noise, and outliers, particularly in scenarios where outliers remain distinct from inliers, when compared with prevailing state-of-the-art approaches.
在许多应用中,需要寻求能够对部分重叠的点集进行对齐,同时保持对相应变换的不变的算法。这项研究通过最小化稳健点匹配(RPM)算法的目标函数来满足这些要求。首先,我们证明了RPM目标是一个三次多项式。然后,通过变量替换,我们将RPM目标转化为一个二次函数。利用二元多项式的凸内积,我们继续放松得到的目标函数,从而得到了一个可以方便地分解为不同的线性分配问题和低维凸四元二次规划组件的更严格的目标函数。此外,还设计了一种分支定界(BnB)算法,它仅在变换参数上分支,从而提高了收敛率。 empirical结果表明,与非刚性变形、位置噪声和异常相比,所提出的方法在非均匀场景中的鲁棒性更好,尤其是在与现有先进方法相比,异常仍然与基线分离时。
https://arxiv.org/abs/2405.08589
This paper addresses the problem of pathological lung segmentation, a significant challenge in medical image analysis, particularly pronounced in cases of peripheral opacities (severe fibrosis and consolidation) because of the textural similarity between lung tissue and surrounding areas. To overcome these challenges, this paper emphasizes the use of CycleGAN for unpaired image-to-image translation, in order to provide an augmentation method able to generate fake pathological images matching an existing ground truth. Although previous studies have employed CycleGAN, they often neglect the challenge of shape deformation, which is crucial for accurate medical image segmentation. Our work introduces an innovative strategy that incorporates additional loss functions. Specifically, it proposes an L1 loss based on the lung surrounding which shape is constrained to remain unchanged at the transition from the healthy to pathological domains. The lung surrounding is derived based on ground truth lung masks available in the healthy domain. Furthermore, preprocessing steps, such as cropping based on ribs/vertebra locations, are applied to refine the input for the CycleGAN, ensuring that the network focus on the lung region. This is essential to avoid extraneous biases, such as the zoom effect bias, which can divert attention from the main task. The method is applied to enhance in semi-supervised manner the lung segmentation process by employing a U-Net model trained with on-the-fly data augmentation incorporating synthetic pathological tissues generated by the CycleGAN model. Preliminary results from this research demonstrate significant qualitative and quantitative improvements, setting a new benchmark in the field of pathological lung segmentation. Our code is available at this https URL
本文解决了病理肺部分割的问题,这是医学图像分析的一个显著挑战,尤其是在边缘透明度(严重纤维化和胶质化)病例中更加突出,因为肺组织与周围区域的纹理相似。为了克服这些挑战,本文强调使用CycleGAN进行无配对图像到图像的转换,以提供一种能够生成与现有真实 ground truth 匹配的假病理图像的增强方法。虽然之前的研究已经使用了CycleGAN,但它们通常忽视了形状变形的重要性,这对于准确医学图像分割至关重要。我们的工作引入了一种创新策略,包括额外的损失函数。具体来说,它提出了一个基于肺周围约束的L1损失,该约束在从健康到病理域的转换过程中保持形状不变。肺周围基于健康的领域内存在的真实肺mask为基础进行提取。此外,对输入进行预处理步骤,如基于肋/椎位置的裁剪,以优化CycleGAN,确保网络集中于肺区域。这对于避免诸如放大效果偏见等额外偏差至关重要。将该方法应用于半监督方式增强肺分割过程,通过使用训练时数据增强包含由CycleGAN模型生成的合成病理组织的方法。来自这项研究的结果表明,在半监督方式下,肺分割过程有显著的质量和数量改进,为病理肺部分割领域树立了新的基准。我们的代码可在此处访问:https://www.thisurl.com
https://arxiv.org/abs/2405.08556
Multi-Head Attention (MHA) is a key component of Transformer. In MHA, attention heads work independently, causing problems such as low-rank bottleneck of attention score matrices and head redundancy. We propose Dynamically Composable Multi-Head Attention (DCMHA), a parameter and computation efficient attention architecture that tackles the shortcomings of MHA and increases the expressive power of the model by dynamically composing attention heads. At the core of DCMHA is a $\it{Compose}$ function that transforms the attention score and weight matrices in an input-dependent way. DCMHA can be used as a drop-in replacement of MHA in any transformer architecture to obtain the corresponding DCFormer. DCFormer significantly outperforms Transformer on different architectures and model scales in language modeling, matching the performance of models with ~1.7x-2.0x compute. For example, DCPythia-6.9B outperforms open source Pythia-12B on both pretraining perplexity and downstream task evaluation. The code and models are available at this https URL.
多头注意力(MHA)是Transformer的关键组成部分。在MHA中,关注头独立工作,导致诸如注意力分数矩阵低秩瓶颈和头冗余等问题。我们提出了一种动态可组合的多头注意力(DCMHA),一种参数和计算效率高的注意机制,通过动态组合注意头解决了MHA的不足之处,从而提高了模型的表现力。DCMHA的核心是一个有$\it{Compose}$函数,它以输入为依赖地将注意力分数和权重矩阵转换。DCMHA可以作为任何Transformer架构的 drop-in 替换,以获得相应的DCFormer。DCFormer 在各种架构和模型规模的语言建模中显著优于Transformer,其性能与具有 ~1.7x-2.0x 计算能力的模型相匹敌。例如,DCPythia-6.9B 在预训练预置和下游任务评估方面都优于开源Pythia-12B。代码和模型可通过此https URL获取。
https://arxiv.org/abs/2405.08553
Class-incremental learning (CIL) has emerged as a means to learn new classes incrementally without catastrophic forgetting of previous classes. Recently, CIL has undergone a paradigm shift towards dynamic architectures due to their superior performance. However, these models are still limited by the following aspects: (i) Data augmentation (DA), which are tightly coupled with CIL, remains under-explored in dynamic architecture scenarios. (ii) Feature representation. The discriminativeness of dynamic feature are sub-optimal and possess potential for refinement. (iii) Classifier. The misalignment between dynamic feature and classifier constrains the capabilities of the model. To tackle the aforementioned drawbacks, we propose the Dynamic Feature Learning and Matching (DFLM) model in this paper from above three perspectives. Specifically, we firstly introduce class weight information and non-stationary functions to extend the mix DA method for dynamically adjusting the focus on memory during training. Then, von Mises-Fisher (vMF) classifier is employed to effectively model the dynamic feature distribution and implicitly learn their discriminative properties. Finally, the matching loss is proposed to facilitate the alignment between the learned dynamic features and the classifier by minimizing the distribution distance. Extensive experiments on CIL benchmarks validate that our proposed model achieves significant performance improvements over existing methods.
分类级学习(CIL)作为一种逐步学习新类的方法,不会导致对前一类的灾难性遗忘。最近,由于其卓越的性能,CIL在动态架构场景中经历了一个范式转移,转向动态架构。然而,这些模型仍然受到以下方面的限制:(一)数据增强(DA),它们与CIL紧密耦合,在动态架构场景中仍然是一个未被充分探索的问题。(二)特征表示。动态特征的区分度较低,具有改进的潜力。(三)分类器。动态特征与分类器之间的不匹配限制了模型的能力。为解决上述问题,本文从三个方面提出了一种动态特征学习和匹配(DFLM)模型。具体来说,我们首先引入了分类权重信息和不变函数,扩展了混合DA方法,以便在训练过程中动态调整对记忆的关注。然后,使用vMises-Fisher(vMF)分类器来有效地建模动态特征分布,并隐含地学习它们的区分性特性。最后,提出匹配损失,通过最小化分布距离,促进学习到的动态特征与分类器之间的对齐。在CIL基准测试上进行的大量实验证明,与现有方法相比,我们提出的模型具有显著的性能提升。
https://arxiv.org/abs/2405.08533
Image matching is still challenging in such scenes with large viewpoints or illumination changes or with low textures. In this paper, we propose a Transformer-based pseudo 3D image matching method. It upgrades the 2D features extracted from the source image to 3D features with the help of a reference image and matches to the 2D features extracted from the destination image by the coarse-to-fine 3D matching. Our key discovery is that by introducing the reference image, the source image's fine points are screened and furtherly their feature descriptors are enriched from 2D to 3D, which improves the match performance with the destination image. Experimental results on multiple datasets show that the proposed method achieves the state-of-the-art on the tasks of homography estimation, pose estimation and visual localization especially in challenging scenes.
图像匹配在具有大视角或光照变化或低纹理的场景中仍然具有挑战性。在本文中,我们提出了一种基于Transformer的伪3D图像匹配方法。它通过参考图像升级源图像中提取的2D特征为3D特征,并通过粗到细的3D匹配将目标图像中提取的2D特征与源图像中的匹配。我们的关键发现是,通过引入参考图像,可以筛选出源图像中微小的点,并从2D到3D进一步丰富其特征描述符,从而提高与目标图像的匹配性能。在多个数据集上的实验结果表明,与其他方法相比,尤其是在具有挑战性场景的任务中,所提出的方法在姿态估计和视觉定位方面实现了最先进的水平。
https://arxiv.org/abs/2405.08434
From a perspective of feature matching, optical flow estimation for event cameras involves identifying event correspondences by comparing feature similarity across accompanying event frames. In this work, we introduces an effective and robust high-dimensional (HD) feature descriptor for event frames, utilizing Vector Symbolic Architectures (VSA). The topological similarity among neighboring variables within VSA contributes to the enhanced representation similarity of feature descriptors for flow-matching points, while its structured symbolic representation capacity facilitates feature fusion from both event polarities and multiple spatial scales. Based on this HD feature descriptor, we propose a novel feature matching framework for event-based optical flow, encompassing both model-based (VSA-Flow) and self-supervised learning (VSA-SM) methods. In VSA-Flow, accurate optical flow estimation validates the effectiveness of HD feature descriptors. In VSA-SM, a novel similarity maximization method based on the HD feature descriptor is proposed to learn optical flow in a self-supervised way from events alone, eliminating the need for auxiliary grayscale images. Evaluation results demonstrate that our VSA-based method achieves superior accuracy in comparison to both model-based and self-supervised learning methods on the DSEC benchmark, while remains competitive among both methods on the MVSEC benchmark. This contribution marks a significant advancement in event-based optical flow within the feature matching methodology.
从特征匹配的角度来看,事件相机中的光学流估计涉及通过比较伴随事件帧之间的特征相似性来识别事件对应关系。在这项工作中,我们引入了一个有效且鲁棒的高维(HD)特征描述器来描述事件帧,利用向量符号抽象结构(VSA)。VSA中相邻变量之间的拓扑相似性有助于增强特征描述器对于匹配点的光流匹配的表示相似性,而其结构化符号表示能力则促使来自事件极性和多个空间维度的特征进行融合。基于这个HD特征描述器,我们提出了一个基于事件的光流匹配框架,涵盖了基于模型的(VSA-Flow)和自监督学习(VSA-SM)方法。在VSA-Flow中,准确的光流估计验证了HD特征描述器的有效性。在VSA-SM中,我们提出了一种基于HD特征描述器的新型相似度最大化方法来以事件的方式自监督地学习光流,消除了需要辅助灰度图像的需求。评估结果表明,基于VSA的方法在DSEC基准上实现了与基于模型的方法和自监督学习方法相比的卓越准确性,而在MVSEC基准上则与这两种方法保持竞争力。这项贡献在基于特征匹配的光流匹配方法论上取得了显著的进步。
https://arxiv.org/abs/2405.08300
The authentic 3D hand avatar with every identifiable information, such as hand shapes and textures, is necessary for immersive experiences in AR/VR. In this paper, we present a universal hand model (UHM), which 1) can universally represent high-fidelity 3D hand meshes of arbitrary identities (IDs) and 2) can be adapted to each person with a short phone scan for the authentic hand avatar. For effective universal hand modeling, we perform tracking and modeling at the same time, while previous 3D hand models perform them separately. The conventional separate pipeline suffers from the accumulated errors from the tracking stage, which cannot be recovered in the modeling stage. On the other hand, ours does not suffer from the accumulated errors while having a much more concise overall pipeline. We additionally introduce a novel image matching loss function to address a skin sliding during the tracking and modeling, while existing works have not focused on it much. Finally, using learned priors from our UHM, we effectively adapt our UHM to each person's short phone scan for the authentic hand avatar.
真实的三维手虚拟形象,包括手形状和纹理等可识别的信息,对于AR/VR中的沉浸体验是必要的。在本文中,我们提出了一个通用的手模型(UHM),它具有以下两个特点:1)可以普遍代表任意身份(ID)的高保真3D手网格;2)可以通过短电话扫描,将真实手虚拟形象个性化定制。为了实现有效的通用手建模,我们在跟踪和建模的同时进行,而之前的手模型则是在建模阶段分别进行。传统的单独管道从跟踪阶段积累的错误,在建模阶段无法恢复。另一方面,我们的模型在整体流程中没有积累的错误,而且更加简洁。此外,我们还引入了一种新的图像匹配损失函数,以解决跟踪和建模过程中的皮肤滑动问题,而现有的工作并没有太多关注这个问题。最后,使用从UHM中学习到的先验知识,我们有效地将UHM个性定制化每个人的短电话扫描的真实手虚拟形象。
https://arxiv.org/abs/2405.07933
Many natural language processing systems operate over tokenizations of text to address the open-vocabulary problem. In this paper, we give and analyze an algorithm for the efficient construction of deterministic finite automata designed to operate directly on tokenizations produced by the popular byte pair encoding technique. This makes it possible to apply many existing techniques and algorithms to the tokenized case, such as pattern matching, equivalence checking of tokenization dictionaries, and composing tokenized languages in various ways.
许多自然语言处理系统通过词标来解决开放词汇问题。在本文中,我们给出了一个算法,用于构造直接基于流行字节对编码技术产生的词标的确定性有限自动机,使得许多现有技术和服务可以应用于词标处理情况,例如模式匹配、词标字典的等价性检查和以各种方式组合词标语言。
https://arxiv.org/abs/2405.07671
Object detection techniques for Unmanned Aerial Vehicles (UAVs) rely on Deep Neural Networks (DNNs), which are vulnerable to adversarial attacks. Nonetheless, adversarial patches generated by existing algorithms in the UAV domain pay very little attention to the naturalness of adversarial patches. Moreover, imposing constraints directly on adversarial patches makes it difficult to generate patches that appear natural to the human eye while ensuring a high attack success rate. We notice that patches are natural looking when their overall color is consistent with the environment. Therefore, we propose a new method named Environmental Matching Attack(EMA) to address the issue of optimizing the adversarial patch under the constraints of color. To the best of our knowledge, this paper is the first to consider natural patches in the domain of UAVs. The EMA method exploits strong prior knowledge of a pretrained stable diffusion to guide the optimization direction of the adversarial patch, where the text guidance can restrict the color of the patch. To better match the environment, the contrast and brightness of the patch are appropriately adjusted. Instead of optimizing the adversarial patch itself, we optimize an adversarial perturbation patch which initializes to zero so that the model can better trade off attacking performance and naturalness. Experiments conducted on the DroneVehicle and Carpk datasets have shown that our work can reach nearly the same attack performance in the digital attack(no greater than 2 in mAP$\%$), surpass the baseline method in the physical specific scenarios, and exhibit a significant advantage in terms of naturalness in visualization and color difference with the environment.
无人机(UAV)的目标检测技术依赖于深度神经网络(DNNs),这些网络对攻击非常敏感。然而,UAV领域现有算法生成的攻击补丁对攻击的自然性非常关注。此外,直接对攻击补丁施加约束会使得生成看起来自然的人工补丁变得困难,同时保证高攻击成功率。我们注意到,当补丁的整体颜色与环境相同时,它们看起来是自然的。因此,我们提出了一种名为环境匹配攻击(EMA)的新方法来解决在颜色约束下优化攻击补丁的问题。据我们所知,这是第一个考虑UAV领域自然补丁的论文。EMA方法利用预训练的稳定扩散的强烈先验知识引导攻击补丁的优化方向,其中文本指导可以限制补丁的颜色。为了更好地匹配环境,适当调整补丁的对比度和亮度。我们不是优化攻击补丁本身,而是优化一个攻击补丁,该补丁初始化为零,以便模型可以更好地平衡攻击性能和自然性。在DroneVehicle和Carpk数据集上进行的实验表明,我们的工作在数字攻击(MAP%不超过2)方面的攻击性能与基线方法相当,在物理特定场景中超过了基线方法,并且在可视化和颜色差异方面具有显著的优越性。
https://arxiv.org/abs/2405.07595