How effectively can LLM-based AI assistants utilize their memory (context) to perform various tasks? Traditional data benchmarks, which are often manually crafted, suffer from several limitations: they are static, susceptible to overfitting, difficult to interpret, and lack actionable insights--failing to pinpoint the specific capabilities a model lacks when it does not pass a test. In this paper, we present a framework for automatically generating a comprehensive set of tests to evaluate models' abilities to use their memory effectively. Our framework extends the range of capability tests beyond the commonly explored (passkey, key-value, needle in the haystack) search, a dominant focus in the literature. Specifically, we evaluate models on atomic tasks such as searching, recalling, editing, matching, comparing information in context memory, and performing basic operations when inputs are structured into distinct blocks, simulating real-world data. Additionally, we design composite tests to investigate the models' ability to maintain state while operating on memory. Our benchmark enables an interpretable, detailed assessment of memory capabilities of LLMs.
基于大型语言模型(LLM)的AI助手如何有效地利用其记忆(上下文)来执行各种任务?传统的数据基准测试通常由人工创建,存在若干局限性:它们是静态的、容易过拟合、难以解释且缺乏可操作见解——无法在模型未能通过测试时确定具体缺失的能力。在这篇论文中,我们提出了一种框架,用于自动生成一套全面的测试来评估模型利用其记忆的有效能力。我们的框架扩展了常见探索范围之外的能力测试(如密码查找、键值对搜索和大海捞针),这些是文献中的主要关注点。特别地,我们在结构化成不同块的输入上评估模型,执行原子任务,例如搜索、回忆、编辑、匹配和比较上下文记忆中的信息,并在模拟现实世界数据的情况下进行基本操作。此外,我们设计了合成测试来调查模型在处理内存时保持状态的能力。我们的基准测试使LLM的记忆能力能够得到可解释且详细的评估。
https://arxiv.org/abs/2502.03358
Purpose: To develop and evaluate a deep learning-based method that allows to perform myocardial infarct segmentation in a fully-automated way. Materials and Methods: For this retrospective study, a cascaded framework of two and three-dimensional convolutional neural networks (CNNs), specialized on identifying ischemic myocardial scars on late gadolinium enhancement (LGE) cardiac magnetic resonance (CMR) images, was trained on an in-house training dataset consisting of 144 examinations. On a separate test dataset from the same institution, including images from 152 examinations obtained between 2021 and 2023, a quantitative comparison between artificial intelligence (AI)-based segmentations and manual segmentations was performed. Further, qualitative assessment of segmentation accuracy was evaluated for both human and AI-generated contours by two CMR experts in a blinded experiment. Results: Excellent agreement could be found between manually and automatically calculated infarct volumes ($\rho_c$ = 0.9). The qualitative evaluation showed that compared to human-based measurements, the experts rated the AI-based segmentations to better represent the actual extent of infarction significantly (p < 0.001) more often (33.4% AI, 25.1% human, 41.5% equal). On the contrary, for segmentation of microvascular obstruction (MVO), manual measurements were still preferred (11.3% AI, 55.6% human, 33.1% equal). Conclusion: This fully-automated segmentation pipeline enables CMR infarct size to be calculated in a very short time and without requiring any pre-processing of the input images while matching the segmentation quality of trained human observers. In a blinded experiment, experts preferred automated infarct segmentations more often than manual segmentations, paving the way for a potential clinical application.
目的:开发并评估一种基于深度学习的方法,用于全自动的心肌梗死分割。 材料和方法:在这项回顾性研究中,研究人员使用了一个由两家医院内部数据集训练的级联框架(包括二维和三维卷积神经网络),该框架专门针对在延迟钆增强心脏磁共振成像(LGE CMR)图像上识别心肌缺血疤痕进行优化。训练数据集包含144个检查案例。在一个独立的数据集中进行了定量比较,这个测试数据集来自同一个机构,包括2021年至2023年间获取的152张图像。此外,在一个双盲实验中,由两位CMR专家对人类和AI生成轮廓的分割准确性的定性评估进行了评价。 结果:手动计算的心肌梗死体积与自动计算的结果之间具有很好的一致性($ρ_c$ = 0.9)。在定性评价方面,专家们认为AI基础的分割比基于人的测量更精确地代表了心肌梗塞的实际范围,这一结论显著得多(p < 0.001),具体来说,33.4%的情况是AI优于人类,25.1%的情况是人工优于AI,而两者相等的情况下占41.5%。然而,在微血管障碍(MVO)的分割方面,手动测量仍然更受欢迎(分别为11.3%,55.6%和33.1%)。 结论:这种全自动的心肌梗死分割管道可以在极短的时间内计算CMR心肌梗死大小,并且无需对输入图像进行任何预处理,同时与训练有素的人类观察者的分割质量相匹配。在双盲实验中,专家们更倾向于选择自动化的梗死分割结果而非手动的分割结果,这为潜在的临床应用铺平了道路。
https://arxiv.org/abs/2502.03272
Understanding the textual components of resumes and job postings is critical for improving job-matching accuracy and optimizing job search systems in online recruitment platforms. However, existing works primarily focus on analyzing individual components within this information, requiring multiple specialized tools to analyze each aspect. Such disjointed methods could potentially hinder overall generalizability in recruitment-related text processing. Therefore, we propose a unified sentence encoder that utilized multi-task dual-encoder framework for jointly learning multiple component into the unified sentence encoder. The results show that our method outperforms other state-of-the-art models, despite its smaller model size. Moreover, we propose a novel metric, Language Bias Kullback-Leibler Divergence (LBKL), to evaluate language bias in the encoder, demonstrating significant bias reduction and superior cross-lingual performance.
理解简历和工作说明中的文本组件对于提高招聘匹配的准确性以及优化在线招聘平台上的求职系统至关重要。然而,现有研究主要集中在分析这些信息中的单个组成部分上,并需要使用多个专门工具来分别处理每个方面。这种分散的方法可能会阻碍招聘信息处理的整体适用性。因此,我们提出了一种统一的句子编码器,该编码器采用多任务双编码框架,将多种组件合并到一个统一的句子编码器中进行联合学习。结果显示,尽管模型规模较小,我们的方法在性能上优于其他最先进的模型。此外,我们还提出了一个新的评估语言偏见的标准——语言偏差Kullback-Leibler散度(LBKL),并通过显著减少语言偏见和卓越的语言间表现来证明其有效性。
https://arxiv.org/abs/2502.03220
The burgeoning field of video-text retrieval has witnessed significant advancements with the advent of deep learning. However, the challenge of matching text and video persists due to inadequate textual descriptions of videos. The substantial information gap between the two modalities hinders a comprehensive understanding of videos, resulting in ambiguous retrieval results. While rewriting methods based on large language models have been proposed to broaden text expressions, carefully crafted prompts are essential to ensure the reasonableness and completeness of the rewritten texts. This paper proposes an automatic caption enhancement method that enhances expression quality and mitigates empiricism in augmented captions through self-learning. Additionally, an expertized caption selection mechanism is designed and introduced to customize augmented captions for each video, facilitating video-text matching. Our method is entirely data-driven, which not only dispenses with heavy data collection and computation workload but also improves self-adaptability by circumventing lexicon dependence and introducing personalized matching. The superiority of our method is validated by state-of-the-art results on various benchmarks, specifically achieving Top-1 recall accuracy of 68.5% on MSR-VTT, 68.1% on MSVD, and 62.0% on DiDeMo.
随着深度学习的出现,视频文本检索这一新兴领域取得了显著进展。然而,由于对视频的文字描述不足,匹配文字和视频仍然是一个挑战。这种模态之间的信息差距阻碍了对视频的全面理解,并导致模糊的检索结果。虽然基于大型语言模型的重写方法已被提出以扩展文本表达方式,但是精心设计的提示对于确保重写文本的合理性与完整性至关重要。本文提出了一种自动字幕增强方法,该方法通过自我学习来提高表达质量并减少扩充字幕中的经验主义倾向。此外,还设计和引入了专家级字幕选择机制,以针对每个视频定制增强后的字幕,从而促进视频文字匹配。 我们的方法完全基于数据驱动,不仅消除了繁重的数据收集和计算工作负担,而且还通过避免词汇依赖并引入个性化匹配来提高自我适应性。我们在各种基准测试中验证了我们方法的优越性,特别是在MSR-VTT、MSVD和DiDeMo上分别实现了68.5%、68.1%和62.0%的Top-1检索准确率。
https://arxiv.org/abs/2502.02885
We present lightweight flow matching multilingual text-to-speech (TTS) systems for Ojibwe, Mi'kmaq, and Maliseet, three Indigenous languages in North America. Our results show that training a multilingual TTS model on three typologically similar languages can improve the performance over monolingual models, especially when data are scarce. Attention-free architectures are highly competitive with self-attention architecture with higher memory efficiency. Our research not only advances technical development for the revitalization of low-resource languages but also highlights the cultural gap in human evaluation protocols, calling for a more community-centered approach to human evaluation.
我们介绍了北美三种土著语言——奥吉布瓦语、密克马克语和马利塞特语的轻量级流匹配多语言文本转语音(TTS)系统。研究结果显示,在数据稀缺的情况下,针对三种类型学相似的语言训练一个多语言TTS模型可以比单语言模型表现出更好的性能。无注意力机制的架构在内存效率方面与自注意力架构相当,并且具有高度竞争力。我们的研究不仅推进了低资源语言复兴的技术发展,还强调了人类评估协议中的文化差距问题,呼吁采取更加以社区为中心的人类评估方法。
https://arxiv.org/abs/2502.02703
Large language models (LLMs) perform remarkably well on tabular datasets in zero- and few-shot settings, since they can extract meaning from natural language column headers that describe features and labels. Similarly, TabPFN, a recent non-LLM transformer pretrained on numerous tables for in-context learning, has demonstrated excellent performance for dataset sizes up to a thousand samples. In contrast, gradient-boosted decision trees (GBDTs) are typically trained from scratch on each dataset without benefiting from pretraining data and must learn the relationships between columns from their entries alone since they lack natural language understanding. LLMs and TabPFN excel on small tabular datasets where a strong prior is essential, yet they are not competitive with GBDTs on medium or large datasets, since their context lengths are limited. In this paper, we propose a simple and lightweight approach for fusing large language models and TabPFN with gradient-boosted decision trees, which allows scalable GBDTs to benefit from the natural language capabilities and pretraining of transformers. We name our fusion methods LLM-Boost and PFN-Boost, respectively. While matching or surpassing the performance of the transformer at sufficiently small dataset sizes and GBDTs at sufficiently large sizes, LLM-Boost and PFN-Boost outperform both standalone components on a wide range of dataset sizes in between. We demonstrate state-of-the-art performance against numerous baselines and ensembling algorithms. We find that PFN-Boost achieves the best average performance among all methods we test for all but very small dataset sizes. We release our code at this http URL .
大型语言模型(LLMs)在零样本和少量样本设置下的表格数据表现非常出色,因为它们可以从描述特征和标签的自然语言列标题中提取意义。类似地,最近的一种非LLM变换器TabPFN,在大量表格上进行了预训练以进行上下文学习,并且对于多达一千个样本的数据集表现出色。相比之下,梯度增强决策树(GBDTs)通常需要在每个数据集中从零开始训练,并不受益于预训练数据,由于它们缺乏自然语言理解能力,因此必须仅通过列中的条目来学习各列之间的关系。 LLMs和TabPFN在小规模表格数据集上表现出色,因为在这种情况下一个强大的先验知识是必要的。然而,在中等或大规模数据集中,它们的表现不及GBDTs,原因在于其上下文长度的限制。在这篇论文中,我们提出了一种简单而轻量级的方法来融合大型语言模型和TabPFN与梯度增强决策树,这使得可扩展的GBDT能够利用变换器的自然语言能力和预训练能力。 我们将这种融合方法分别命名为LLM-Boost和PFN-Boost。在足够小的数据集大小下,LLM-Boost和PFN-Boost的表现能匹配或超越变压器;而在足够大的数据集中,则可以媲美GBDTs。此外,在这两者之间的一系列数据集规模上,这两种方法都超过了单独使用的任何组件。 我们在众多基准线和集成算法中展示了最先进的性能。我们发现,对于所有除非常小的数据集大小以外的情况,PFN-Boost在所有测试的方法中获得了最佳的平均表现。 我们的代码可在[这个链接](http://this.http.url/)获取。
https://arxiv.org/abs/2502.02672
Effective self-supervised learning (SSL) techniques have been key to unlocking large datasets for representation learning. While many promising methods have been developed using online corpora and captioned photographs, their application to scientific domains, where data encodes highly specialized knowledge, remains in its early stages. We present a self-supervised masked modeling framework for 3D particle trajectory analysis in Time Projection Chambers (TPCs). These detectors produce globally sparse (<1% occupancy) but locally dense point clouds, capturing meter-scale particle trajectories at millimeter resolution. Starting with PointMAE, this work proposes volumetric tokenization to group sparse ionization points into resolution-agnostic patches, as well as an auxiliary energy infilling task to improve trajectory semantics. This approach -- which we call Point-based Liquid Argon Masked Autoencoder (PoLAr-MAE) -- achieves 99.4% track and 97.7% shower classification F-scores, matching that of supervised baselines without any labeled data. While the model learns rich particle trajectory representations, it struggles with sub-token phenomena like overlapping or short-lived particle trajectories. To support further research, we release PILArNet-M -- the largest open LArTPC dataset (1M+ events, 5.2B labeled points) -- to advance SSL in high energy physics (HEP). Project site: this https URL
有效的自监督学习(SSL)技术已成为解锁大型数据集进行表示学习的关键。尽管许多有前景的方法已经通过在线语料库和配有说明的照片得到了发展,但它们在科学领域的应用——这些领域中的数据编码了高度专业化的知识——仍处于早期阶段。我们提出了一种用于时间投影室(TPC)中3D粒子轨迹分析的自监督掩码建模框架。这些探测器产生的全局稀疏(<1%占用率)但局部密集的点云,以毫米级分辨率捕捉到米级的粒子轨迹。 基于PointMAE的工作提出了体积标记化来将稀疏的离子化点分组为与分辨率无关的补丁,并引入了一个辅助能量填充任务来改进轨迹语义。我们称这种方法为基于点的液氩掩码自编码器(PoLAr-MAE),它在没有标注数据的情况下,达到了99.4%的追踪和97.7%的 Shower分类F分数,与监督基线相匹配。 尽管该模型能够学习到丰富的粒子轨迹表示,但它仍难以处理如重叠或短寿命粒子轨迹这样的亚标记现象。为了支持进一步的研究,我们发布了PILArNet-M——一个最大的开放LArTPC数据集(超过100万事件,52亿个标注点),以推动高能物理领域中的自监督学习。 项目网站:[这个链接](https://this-url.com)
https://arxiv.org/abs/2502.02558
We present flow Q-learning (FQL), a simple and performant offline reinforcement learning (RL) method that leverages an expressive flow-matching policy to model arbitrarily complex action distributions in data. Training a flow policy with RL is a tricky problem, due to the iterative nature of the action generation process. We address this challenge by training an expressive one-step policy with RL, rather than directly guiding an iterative flow policy to maximize values. This way, we can completely avoid unstable recursive backpropagation, eliminate costly iterative action generation at test time, yet still mostly maintain expressivity. We experimentally show that FQL leads to strong performance across 73 challenging state- and pixel-based OGBench and D4RL tasks in offline RL and offline-to-online RL. Project page: this https URL
我们介绍了流Q学习(FQL),这是一种简单而高效的离线强化学习(RL)方法,利用表达性的流匹配策略来建模数据中任意复杂的动作分布。使用RL训练流策略是一个棘手的问题,因为动作生成过程具有迭代性。为了解决这个问题,我们在FQL中通过RL训练一个表达性强的一步策略,而不是直接指导迭代式流策略以最大化价值。这样一来,我们可以完全避免不稳定的递归反向传播,消除测试时昂贵的迭代行动生成,同时仍然保持高度的表达能力。 我们通过实验表明,FQL在离线和从离线到在线RL的73个具有挑战性的状态和像素基础任务中表现出色(OGBench 和 D4RL 任务)。项目页面:[此处提供链接]
https://arxiv.org/abs/2502.02538
We propose Electrostatic Field Matching (EFM), a novel method that is suitable for both generative modeling and distribution transfer tasks. Our approach is inspired by the physics of an electrical capacitor. We place source and target distributions on the capacitor plates and assign them positive and negative charges, respectively. We then learn the electrostatic field of the capacitor using a neural network approximator. To map the distributions to each other, we start at one plate of the capacitor and move the samples along the learned electrostatic field lines until they reach the other plate. We theoretically justify that this approach provably yields the distribution transfer. In practice, we demonstrate the performance of our EFM in toy and image data experiments.
我们提出了一种新颖的方法——静电场匹配(EFM),该方法适用于生成模型和分布转换任务。我们的方法受电容器物理原理的启发。我们将源分布和目标分布在电容器的两块板上放置,并分别赋予它们正负电荷。然后,我们利用神经网络近似器来学习电容器的静电场。为了将这些分布相互映射,我们从电容器的一端开始,沿已学得的静电场线移动样本,直到它们到达另一端。理论上,我们证明了这种方法可以有效地实现分布转换。在实践中,我们在玩具数据和图像数据实验中展示了EFM的有效性。
https://arxiv.org/abs/2502.02367
Obtaining enough high-quality correspondences is crucial for robust registration. Existing correspondence refinement methods mostly follow the paradigm of outlier removal, which either fails to correctly identify the accurate correspondences under extreme outlier ratios, or select too few correct correspondences to support robust registration. To address this challenge, we propose a novel approach named Regor, which is a progressive correspondence regenerator that generates higher-quality matches whist sufficiently robust for numerous outliers. In each iteration, we first apply prior-guided local grouping and generalized mutual matching to generate the local region correspondences. A powerful center-aware three-point consistency is then presented to achieve local correspondence correction, instead of removal. Further, we employ global correspondence refinement to obtain accurate correspondences from a global perspective. Through progressive iterations, this process yields a large number of high-quality correspondences. Extensive experiments on both indoor and outdoor datasets demonstrate that the proposed Regor significantly outperforms existing outlier removal techniques. More critically, our approach obtain 10 times more correct correspondences than outlier removal methods. As a result, our method is able to achieve robust registration even with weak features. The code will be released.
获取足够高质量的对应关系对于稳健配准至关重要。现有的对应关系改进方法大多遵循异常值去除范式,要么在极端的异常比情况下无法正确识别准确的对应关系,要么选择太少正确的对应关系来支持稳健配准。为了应对这一挑战,我们提出了一种新的方法——Regor(渐进式对应生成器),它能够在存在大量异常的情况下生成更高质量的匹配结果。 在每次迭代中,我们首先应用基于先验知识的局部分组和广义互匹配来生成局部区域对应的初步估计。然后引入一个强大的中心感知三点一致性检查,用于实现局部对应关系的修正,而非简单的去除。此外,我们还采用了全局对应关系改进方法,从整体角度获取准确的对应关系。通过渐进式的迭代过程,这一流程可以产生大量的高质量对应关系。 在室内和室外数据集上进行的广泛实验表明,所提出的Regor方法显著优于现有的异常值去除技术。更为关键的是,我们的方法能够获得比异常值去除方法多十倍以上的正确对应关系。因此,即使是在特征较弱的情况下,我们的方法也能够实现稳健配准。 代码将会发布。
https://arxiv.org/abs/2502.02163
Flow matching has shown state-of-the-art performance in various generative tasks, ranging from image generation to decision-making, where guided generation is pivotal. However, the guidance of flow matching is more general than and thus substantially different from that of its predecessor, diffusion models. Therefore, the challenge in guidance for general flow matching remains largely underexplored. In this paper, we propose the first framework of general guidance for flow matching. From this framework, we derive a family of guidance techniques that can be applied to general flow matching. These include a new training-free asymptotically exact guidance, novel training losses for training-based guidance, and two classes of approximate guidance that cover classical gradient guidance methods as special cases. We theoretically investigate these different methods to give a practical guideline for choosing suitable methods in different scenarios. Experiments on synthetic datasets, image inverse problems, and offline reinforcement learning demonstrate the effectiveness of our proposed guidance methods and verify the correctness of our flow matching guidance framework. Code to reproduce the experiments can be found at this https URL.
流动匹配在从图像生成到决策制定等各类生成任务中展现了最先进的性能,特别是在引导式生成中尤为重要。然而,与它的前身扩散模型相比,流动匹配的引导方式更为通用,且两者之间存在显著差异。因此,对于一般流动匹配中的引导挑战的研究仍然很大程度上未被探索。 在这篇论文中,我们提出了第一个适用于一般流动匹配的引导框架。从这个框架出发,我们推导出了一系列可以应用于一般流动匹配的引导技术。这包括一种新的无需训练但渐近精确的引导方法、用于基于训练指导的新颖损失函数,以及两类涵盖经典梯度引导方法作为特例的大约引导方法。通过理论研究这些不同的方法,我们提供了一个在不同场景中选择合适方法的实际指南。 我们在合成数据集、图像逆问题和离线强化学习上的实验展示了我们提出的引导方法的有效性,并验证了我们的流动匹配引导框架的正确性。可在提供的网址(此 https URL)上找到重现实验代码。
https://arxiv.org/abs/2502.02150
This paper introduces the induced matching distance, a novel topological metric designed to compare discrete structures represented by a symmetric non-negative function. We apply this notion to analyze agent trajectories over time. We use dynamic time warping to measure trajectory similarity and compute the 0-dimensional persistent homology to identify relevant connected components, which, in our context, correspond to groups of similar trajectories. To track the evolution of these components across time, we compute induced matching distances, which preserve the coherence of their dynamic behavior. We then obtain a 1-dimensional signal that quantifies the consistency of trajectory groups over time. Our experiments demonstrate that our approach effectively differentiates between various agent behaviors, highlighting its potential as a robust tool for topological analysis in robotics and related fields.
本文介绍了诱导匹配距离,这是一种新颖的拓扑度量方法,旨在通过一个对称非负函数表示的离散结构进行比较。我们将这一概念应用于分析随时间变化的代理轨迹。我们使用动态时间规整(Dynamic Time Warping)来测量轨迹相似性,并计算0维持久同调以识别相关的连通组件,在我们的上下文中,这些组件对应于一组类似的轨迹。为了跟踪这些组分在时间上的演变情况,我们计算诱导匹配距离,从而保留它们的动态行为的一致性。随后,我们获得一个一维信号,量化了随时间变化的轨迹群体的一致性。实验表明,我们的方法能够有效地区分各种代理行为,突显其作为机器人学及相关领域拓扑分析有力工具的潜力。
https://arxiv.org/abs/2502.02112
Diffusion models (DMs) have significantly advanced the development of real-world image super-resolution (Real-ISR), but the computational cost of multi-step diffusion models limits their application. One-step diffusion models generate high-quality images in a one sampling step, greatly reducing computational overhead and inference latency. However, most existing one-step diffusion methods are constrained by the performance of the teacher model, where poor teacher performance results in image artifacts. To address this limitation, we propose FluxSR, a novel one-step diffusion Real-ISR technique based on flow matching models. We use the state-of-the-art diffusion model FLUX.1-dev as both the teacher model and the base model. First, we introduce Flow Trajectory Distillation (FTD) to distill a multi-step flow matching model into a one-step Real-ISR. Second, to improve image realism and address high-frequency artifact issues in generated images, we propose TV-LPIPS as a perceptual loss and introduce Attention Diversification Loss (ADL) as a regularization term to reduce token similarity in transformer, thereby eliminating high-frequency artifacts. Comprehensive experiments demonstrate that our method outperforms existing one-step diffusion-based Real-ISR methods. The code and model will be released at this https URL.
扩散模型(DMs)在现实世界的图像超分辨率(Real-ISR)领域取得了显著进展,但多步扩散模型的计算成本限制了它们的应用。一步扩散模型可以在一次采样步骤中生成高质量的图像,极大地减少了计算开销和推理延迟。然而,大多数现有的一步扩散方法受到教师模型性能的制约,其中较差的教师模型表现会导致图像出现伪影。为了解决这一局限性,我们提出了FluxSR,这是一种基于流匹配模型的一步扩散Real-ISR技术。我们使用最先进的扩散模型FLUX.1-dev作为教师模型和基础模型。首先,我们引入了流动轨迹蒸馏(FTD),将多步流匹配模型转化为一步Real-ISR。其次,为了提高图像的真实感并解决生成图像中的高频伪影问题,我们提出了TV-LPIPS作为一种感知损失,并引入了注意力多样化损失(ADL)作为正则化项,以减少变压器中令牌的相似性,从而消除高频伪影。全面的实验表明,我们的方法优于现有的基于一步扩散的Real-ISR方法。代码和模型将在以下链接发布:[此 URL](此 https URL)。
https://arxiv.org/abs/2502.01993
Reinforcement learning from human feedback (RLHF), which aligns a diffusion model with input prompt, has become a crucial step in building reliable generative AI models. Most works in this area use a discrete-time formulation, which is prone to induced errors, and often not applicable to models with higher-order/black-box solvers. The objective of this study is to develop a disciplined approach to fine-tune diffusion models using continuous-time RL, formulated as a stochastic control problem with a reward function that aligns the end result (terminal state) with input prompt. The key idea is to treat score matching as controls or actions, and thereby making connections to policy optimization and regularization in continuous-time RL. To carry out this idea, we lay out a new policy optimization framework for continuous-time RL, and illustrate its potential in enhancing the value networks design space via leveraging the structural property of diffusion models. We validate the advantages of our method by experiments in downstream tasks of fine-tuning large-scale Text2Image models of Stable Diffusion v1.5.
从人类反馈中进行强化学习(RLHF)以将扩散模型与输入提示对齐,已成为构建可靠生成式AI模型的关键步骤。在这一领域的大多数研究工作中,通常采用离散时间形式化方法,这种方法容易产生诱导误差,并且往往不适用于具有高阶/黑盒求解器的模型。本研究的目标是开发一种基于连续时间强化学习的方法来微调扩散模型,将其视为一个随机控制问题,奖励函数旨在使最终结果(终端状态)与输入提示对齐。该方法的核心思想是将评分匹配视为控制或行动,并因此建立与策略优化和连续时间RL中正则化之间的联系。 为了实现这一想法,我们提出了一种新的连续时间RL的策略优化框架,并展示了它在通过利用扩散模型的结构性质来扩展价值网络设计空间方面的潜力。我们通过实验验证了该方法在微调Stable Diffusion v1.5的大规模文本到图像(Text2Image)模型的任务中的优势。 这一研究不仅为解决离散时间强化学习中常见的问题提供了新的途径,而且还展示了连续时间框架在处理复杂生成任务时的优势,特别是当这些任务涉及到需要精细控制和对齐的高阶或黑盒解算器时。通过这种方式,我们能够进一步提升AI模型生成内容的质量和可靠性,特别是在图像生成等任务上展现出显著效果。
https://arxiv.org/abs/2502.01819
In this work, we address the brand entity linking problem for e-commerce search queries. The entity linking task is done by either i)a two-stage process consisting of entity mention detection followed by entity disambiguation or ii) an end-to-end linking approaches that directly fetch the target entity given the input text. The task presents unique challenges: queries are extremely short (averaging 2.4 words), lack natural language structure, and must handle a massive space of unique brands. We present a two-stage approach combining named-entity recognition with matching, and a novel end-to-end solution using extreme multi-class classification. We validate our solutions by both offline benchmarks and the impact of online A/B test.
在这项工作中,我们解决了电子商务搜索查询中的品牌实体链接问题。实体链接任务可以通过以下两种方式完成:i) 一个两阶段的过程,包括实体提及检测后进行实体消歧;ii) 直接根据输入文本获取目标实体的端到端链接方法。该任务面临独特的挑战:查询非常短(平均2.4个词),缺乏自然语言结构,并且必须处理大量的独特品牌空间。我们提出了一种结合命名实体识别与匹配的两阶段方法,以及一种使用极端多类分类的新型端到端解决方案。我们通过离线基准测试和在线A/B测试的影响来验证我们的解决方案的有效性。
https://arxiv.org/abs/2502.01555
Learning diffusion bridge models is easy; making them fast and practical is an art. Diffusion bridge models (DBMs) are a promising extension of diffusion models for applications in image-to-image translation. However, like many modern diffusion and flow models, DBMs suffer from the problem of slow inference. To address it, we propose a novel distillation technique based on the inverse bridge matching formulation and derive the tractable objective to solve it in practice. Unlike previously developed DBM distillation techniques, the proposed method can distill both conditional and unconditional types of DBMs, distill models in a one-step generator, and use only the corrupted images for training. We evaluate our approach for both conditional and unconditional types of bridge matching on a wide set of setups, including super-resolution, JPEG restoration, sketch-to-image, and other tasks, and show that our distillation technique allows us to accelerate the inference of DBMs from 4x to 100x and even provide better generation quality than used teacher model depending on particular setup.
学习扩散桥模型(Diffusion Bridge Models,DBMs)很简单;但要使它们变得快速且实用则是一门艺术。扩散桥模型是扩散模型在图像到图像转换应用中的一个有前景的扩展。然而,与许多现代的扩散和流模型一样,DBMs面临着推理速度慢的问题。为了解决这个问题,我们提出了一种基于逆向桥匹配公式的新颖蒸馏技术,并推导出实用的目标函数来解决这一问题。 不同于之前开发的DBM蒸馏技术,我们的方法可以同时对条件性和非条件性的DBMs进行蒸馏,在一步生成器中训练模型,并仅使用损坏的图像进行训练。我们在一系列广泛的设置上评估了我们这种方法在条件性和非条件性桥匹配上的表现,包括超分辨率、JPEG恢复、草图到图像转换以及其他任务,结果显示我们的蒸馏技术可以将DBM的推理速度提高4倍至100倍不等,并且在某些情况下甚至能提供比原教师模型更好的生成质量。
https://arxiv.org/abs/2502.01362
This paper presents a novel approach to Visual Inertial Odometry (VIO), focusing on the initialization and feature matching modules. Existing methods for initialization often suffer from either poor stability in visual Structure from Motion (SfM) or fragility in solving a huge number of parameters simultaneously. To address these challenges, we propose a new pipeline for visual inertial initialization that robustly handles various complex scenarios. By tightly coupling gyroscope measurements, we enhance the robustness and accuracy of visual SfM. Our method demonstrates stable performance even with only four image frames, yielding competitive results. In terms of feature matching, we introduce a hybrid method that combines optical flow and descriptor-based matching. By leveraging the robustness of continuous optical flow tracking and the accuracy of descriptor matching, our approach achieves efficient, accurate, and robust tracking results. Through evaluation on multiple benchmarks, our method demonstrates state-of-the-art performance in terms of accuracy and success rate. Additionally, a video demonstration on mobile devices showcases the practical applicability of our approach in the field of Augmented Reality/Virtual Reality (AR/VR).
本文提出了一种新颖的视觉惯性里程计(VIO)方法,重点在于初始化和特征匹配模块。现有的初始化方法通常在视觉结构从运动(SfM)中的稳定性较差或同时解决大量参数时不够健壮。为了解决这些问题,我们提出了一种新的视觉惯性初始化流水线,能够稳健地处理各种复杂场景。通过紧密耦合陀螺仪测量数据,增强了视觉SfM的鲁棒性和精度。我们的方法即使在仅有四帧图像的情况下也能表现出稳定性能,并且结果具有竞争力。 在特征匹配方面,我们引入了一种结合光流和描述子匹配的混合方法。利用连续光流追踪的强大能力和基于描述子匹配的准确性,我们的方法实现了高效、准确且鲁棒的跟踪效果。通过多个基准测试评估,我们的方法在精度和成功率方面表现出最先进的性能。 此外,在移动设备上的视频演示展示了我们这种方法在增强现实/虚拟现实(AR/VR)领域中的实际应用性。
https://arxiv.org/abs/2502.01297
In this paper, a new variant of an algorithm for normalized cross-correlation (NCC) is proposed in the context of template matching in images. The proposed algorithm is based on the precomputation of a template image approximation, enabling more efficient calculation of approximate NCC with the source image than using the original template for exact NCC calculation. The approximate template is precomputed from the template image by a split-and-merge approach, resulting in a decomposition to axis-aligned rectangular segments, whose sizes depend on per-segment pixel intensity variance. In the approximate template, each segment is assigned the mean grayscale value of the corresponding pixels from the original template. The proposed algorithm achieves superior computational performance with negligible NCC approximation errors compared to the well-known Fast Fourier Transform (FFT)-based NCC algorithm, when applied on less visually complex and/or smaller template images. In other cases, the proposed algorithm can maintain either computational performance or NCC approximation error within the range of the FFT-based algorithm, but not both.
本文提出了一种新的算法变体,用于图像模板匹配中的归一化互相关(NCC)计算。所提出的算法基于对模板图像的预近似计算,使得与使用原始模板进行精确NCC计算相比,可以更高效地与源图像进行近似NCC计算。通过分裂和合并的方法从模板图像中预先计算出近似的模板,在此过程中将其分解为轴对齐的矩形段,这些段的大小取决于每个片段内像素强度方差。在近似模板中,每个分段被分配其对应于原始模板中相应像素的平均灰度值。 所提出的算法在应用于视觉复杂程度较低和/或较小的模板图像时,与著名的基于快速傅立叶变换(FFT)的NCC算法相比,在计算性能上表现出更优的表现,并且几乎不会产生NCC近似误差。而在其他情况下,所提出的方法可以在计算性能或者NCC近似误差中保持在一个范围之内,但无法同时在两者之间实现平衡。
https://arxiv.org/abs/2502.01286
Industrial defect segmentation is critical for manufacturing quality control. Due to the scarcity of training defect samples, few-shot semantic segmentation (FSS) holds significant value in this field. However, existing studies mostly apply FSS to tackle defects on simple textures, without considering more diverse scenarios. This paper aims to address this gap by exploring FSS in broader industrial products with various defect types. To this end, we contribute a new real-world dataset and reorganize some existing datasets to build a more comprehensive few-shot defect segmentation (FDS) benchmark. On this benchmark, we thoroughly investigate metric learning-based FSS methods, including those based on meta-learning and those based on Vision Foundation Models (VFMs). We observe that existing meta-learning-based methods are generally not well-suited for this task, while VFMs hold great potential. We further systematically study the applicability of various VFMs in this task, involving two paradigms: feature matching and the use of Segment Anything (SAM) models. We propose a novel efficient FDS method based on feature matching. Meanwhile, we find that SAM2 is particularly effective for addressing FDS through its video track mode. The contributed dataset and code will be available at: this https URL.
工业缺陷分割对于制造业的质量控制至关重要。由于训练样本的稀缺性,少样本语义分割(FSS)在这一领域具有重要意义。然而,现有的研究大多将FSS应用于处理简单纹理上的缺陷,而没有考虑到更多样化的场景。本文旨在通过探索适用于各种不同类型缺陷的更广泛工业产品的FSS来填补这一空白。为此,我们贡献了一个新的真实世界数据集,并重新组织了一些现有数据集以构建一个更加全面的少样本缺陷分割(FDS)基准测试。在这个基准上,我们深入研究了基于度量学习的FSS方法,包括基于元学习和基于视觉基础模型(VFMs)的方法。我们观察到现有的基于元学习的方法通常不适用于这项任务,而VFMs则具有巨大的潜力。此外,我们系统地研究了各种VFMs在这一任务中的适用性,涉及两种范式:特征匹配以及使用Segment Anything (SAM) 模型。我们提出了一种基于特征匹配的新型高效FDS方法。同时,我们发现通过其视频跟踪模式,SAM2 特别有效于解决FDS问题。 贡献的数据集和代码将在以下网址提供:this https URL.
https://arxiv.org/abs/2502.01216
Frequent, high-resolution remote sensing imagery is crucial for agricultural and environmental monitoring. Satellites from the Landsat collection offer detailed imagery at 30m resolution but with lower temporal frequency, whereas missions like MODIS and VIIRS provide daily coverage at coarser resolutions. Clouds and cloud shadows contaminate about 55\% of the optical remote sensing observations, posing additional challenges. To address these challenges, we present SatFlow, a generative model-based framework that fuses low-resolution MODIS imagery and Landsat observations to produce frequent, high-resolution, gap-free surface reflectance imagery. Our model, trained via Conditional Flow Matching, demonstrates better performance in generating imagery with preserved structural and spectral integrity. Cloud imputation is treated as an image inpainting task, where the model reconstructs cloud-contaminated pixels and fills gaps caused by scan lines during inference by leveraging the learned generative processes. Experimental results demonstrate the capability of our approach in reliably imputing cloud-covered regions. This capability is crucial for downstream applications such as crop phenology tracking, environmental change detection etc.,
频繁的高分辨率遥感图像对于农业和环境监测至关重要。Landsat卫星系列提供的30米分辨率影像虽然详细,但时间频率较低;而像MODIS和VIIRS这样的任务则提供每日覆盖范围,尽管其空间分辨率较粗。大约55%的光学遥感观测被云层和云影污染,这给数据利用带来了额外挑战。为了解决这些问题,我们提出了SatFlow框架,这是一个基于生成模型的方法,它融合了低分辨率的MODIS影像与Landsat观测结果,以生产频繁、高分辨率且无缺口的表面反射率图像。 我们的模型通过条件流匹配进行训练,在生成保持结构和光谱完整性的影像方面表现更佳。云层填补被视为一个图像修复任务,其中模型重建被污染的像素,并在推理过程中利用所学习到的生成过程来填充扫描线造成的空缺区域。实验结果表明了该方法可靠地填补云覆盖区域的能力。这种能力对于下游应用(如作物生长阶段追踪、环境变化检测等)至关重要。
https://arxiv.org/abs/2502.01098