This paper explores hallucination phenomena in large language models (LLMs) through the lens of language philosophy and psychoanalysis. By incorporating Lacan's concepts of the "chain of signifiers" and "suture points," we propose the Anchor-RAG framework as a novel approach to mitigate hallucinations. In contrast to the predominant reliance on trial-and-error experiments, constant adjustments of mathematical formulas, or resource-intensive methods that emphasize quantity over quality, our approach returns to the fundamental principles of linguistics to analyze the root causes of hallucinations in LLMs. Drawing from robust theoretical foundations, we derive algorithms and models that are not only effective in reducing hallucinations but also enhance LLM performance and improve output quality. This paper seeks to establish a comprehensive theoretical framework for understanding hallucinations in LLMs and aims to challenge the prevalent "guess-and-test" approach and rat race mentality in the field. We aspire to pave the way for a new era of interpretable LLMs, offering deeper insights into the inner workings of language-based AI systems.
本文通过语言哲学和精神分析的视角探讨大型语言模型(LLM)中的幻觉现象。结合拉康的“能指链”和“缝合点”概念,我们提出了Anchor-RAG框架作为缓解幻觉的新方法。与当前主要依赖试错实验、不断调整数学公式或强调数量而非质量的资源密集型方法不同,我们的方法回归语言学的基本原则,以分析LLM中幻觉的根本原因。基于坚实的理论基础,我们推导出不仅有效减少幻觉,还能提升LLM性能和改善输出质量的算法和模型。 本文旨在建立一个全面的理论框架来理解LLM中的幻觉现象,并挑战该领域盛行的“猜测与测试”方法以及资源竞赛心态。我们的目标是为可解释性的大型语言模型铺平道路,提供对基于语言的人工智能系统的内部运作更深刻的洞察。
https://arxiv.org/abs/2503.14392
Large language models (LLMs) undergo a three-phase training process: unsupervised pre-training, supervised fine-tuning (SFT), and learning from human feedback (RLHF/DPO). Notably, it is during the final phase that these models are exposed to negative examples -- incorrect, rejected, or suboptimal responses to queries. This paper delves into the role of negative examples in the training of LLMs, using a likelihood-ratio (Likra) model on multiple-choice question answering benchmarks to precisely manage the influence and the volume of negative examples. Our findings reveal three key insights: (1) During a critical phase in training, Likra with negative examples demonstrates a significantly larger improvement per training example compared to SFT using only positive examples. This leads to a sharp jump in the learning curve for Likra unlike the smooth and gradual improvement of SFT; (2) negative examples that are plausible but incorrect (near-misses) exert a greater influence; and (3) while training with positive examples fails to significantly decrease the likelihood of plausible but incorrect answers, training with negative examples more accurately identifies them. These results indicate a potentially significant role for negative examples in improving accuracy and reducing hallucinations for LLMs.
大型语言模型(LLM)的训练过程分为三个阶段:无监督预训练、有监督微调(SFT)和从人类反馈中学习(RLHF/DPO)。值得注意的是,在最后一个阶段,这些模型才接触到负面示例——即查询的不正确、被拒绝或次优响应。本文探讨了在大型语言模型训练过程中负面示例的作用,并使用似然比(Likra)模型在多项选择题回答基准测试上精确管理负面样本的影响和数量。我们的研究发现揭示了三个关键见解: 1. 在训练过程中的一个关键阶段,包含负面示例的Likra显示出每个训练样例相比仅使用正面示例进行SFT有显著更大的改进效果。这导致了Likra学习曲线出现明显的跳跃式提升,与SFT中平滑渐进式的改善形成鲜明对比; 2. 可信但错误的答案(接近正确答案但实际是错的)作为负面示例对模型的影响更大; 3. 单独使用正面样本训练无法显著降低可信但错误答案的可能性,而采用负面样例进行训练则更准确地识别出这些选项。 这些结果表明,在提高大型语言模型的准确性并减少幻觉方面,负面示例可能扮演着重要角色。
https://arxiv.org/abs/2503.14391
We investigated the performance of existing semi- and fully autonomous methods for controlling flipper-based skid-steer robots. Our study involves reimplementation of these methods for fair comparison and it introduces a novel semi-autonomous control policy that provides a compelling trade-off among current state-of-the-art approaches. We also propose new metrics for assessing cognitive load and traversal quality and offer a benchmarking interface for generating Quality-Load graphs from recorded data. Our results, presented in a 2D Quality-Load space, demonstrate that the new control policy effectively bridges the gap between autonomous and manual control methods. Additionally, we reveal a surprising fact that fully manual, continuous control of all six degrees of freedom remains highly effective when performed by an experienced operator on a well-designed analog controller from third person view.
我们研究了现有基于鳍片的履带转向机器人半自主和全自主控制方法的性能。我们的研究包括重新实现这些方法以进行公平比较,并提出了一种新的半自主控制策略,该策略在当前最先进的方法之间提供了一个引人注目的权衡。此外,我们还提出了用于评估认知负荷和穿越质量的新指标,并提供了从记录数据生成Quality-Load图的基准测试界面。我们的结果在二维的质量-负载空间中显示,新提出的控制策略有效地弥合了自主控制与手动控制方法之间的差距。另外,我们发现了一个令人惊讶的事实:当经验丰富的操作员使用来自第三方视点的精心设计的模拟控制器进行全手动、连续控制所有六个自由度时,这种方法仍然非常有效。
https://arxiv.org/abs/2503.14389
The purpose of this paper is to examine whether large language models (LLMs) can understand what is good and evil with respect to judging good/evil reputation of celebrities. Specifically, we first apply a large language model (namely, ChatGPT) to the task of collecting sentences that mention the target celebrity from articles about celebrities on Web pages. Next, the collected sentences are categorized based on their contents by ChatGPT, where ChatGPT assigns a category name to each of those categories. Those assigned category names are referred to as "aspects" of each celebrity. Then, by applying the framework of retrieval augmented generation (RAG), we show that the large language model is quite effective in the task of judging good/evil reputation of aspects and descriptions of each celebrity. Finally, also in terms of proving the advantages of the proposed method over existing services incorporating RAG functions, we show that the proposed method of judging good/evil of aspects/descriptions of each celebrity significantly outperform an existing service incorporating RAG functions.
本文的目的是探讨大型语言模型(LLMs)是否能够理解善与恶,特别是在判断名人的善恶声誉方面。具体而言,首先我们应用了一个大型语言模型(如ChatGPT),从网页上的名人文章中收集提及目标名人的句子。接下来,使用ChatGPT对这些收集到的句子进行分类,根据内容为每一类分配一个类别名称。这些被分配的类别名称被称为每个名人的“方面”。然后,在检索增强生成(RAG)框架的应用下,我们展示了大型语言模型在判断名人各方面的善恶声誉任务中非常有效。最后,为了证明所提出方法相对于现有集成RAG功能的服务的优势,本文显示了对于判断每位名人的各方面和描述的善恶,所提出的方法显著优于现有的集成了RAG功能的服务。
https://arxiv.org/abs/2503.14382
Synthetic videos nowadays is widely used to complement data scarcity and diversity of real-world videos. Current synthetic datasets primarily replicate real-world scenarios, leaving impossible, counterfactual and anti-reality video concepts underexplored. This work aims to answer two questions: 1) Can today's video generation models effectively follow prompts to create impossible video content? 2) Are today's video understanding models good enough for understanding impossible videos? To this end, we introduce IPV-Bench, a novel benchmark designed to evaluate and foster progress in video understanding and generation. IPV-Bench is underpinned by a comprehensive taxonomy, encompassing 4 domains, 14 categories. It features diverse scenes that defy physical, biological, geographical, or social laws. Based on the taxonomy, a prompt suite is constructed to evaluate video generation models, challenging their prompt following and creativity capabilities. In addition, a video benchmark is curated to assess Video-LLMs on their ability of understanding impossible videos, which particularly requires reasoning on temporal dynamics and world knowledge. Comprehensive evaluations reveal limitations and insights for future directions of video models, paving the way for next-generation video models.
如今,合成视频广泛用于补充现实世界视频数据的稀缺性和多样性。当前的合成数据集主要复制现实世界的场景,而不可能的、反事实的和违背现实的视频概念则被大大忽视了。这项工作旨在回答两个问题:1)当今的视频生成模型能否根据提示有效创建不可能的内容?2)今天的视频理解模型是否足够好,能够理解这些不可能的视频?为此,我们引入了IPV-Bench,这是一个新型基准测试工具,旨在评估和推动视频理解和生成方面的进步。IPV-Bench 建立在全面分类法的基础之上,涵盖了4个领域、14个类别。它包含违背物理、生物、地理或社会规律的各种场景。基于这个分类系统,我们构建了一套提示语来评估视频生成模型的跟随提示和创造力的能力。此外,我们还汇集了一个视频基准测试集,用于评估视频-LLMs(大型语言模型)对理解不可能视频能力的理解,这特别需要对时间动态和世界知识进行推理。全面的评估揭示了未来视频模型的发展方向中的局限性和洞察力,为下一代视频模型铺平道路。
https://arxiv.org/abs/2503.14378
Despite the growing scale of medical Vision-Language datasets, the impact of dataset quality on model performance remains under-explored. We introduce Open-PMC, a high-quality medical dataset from PubMed Central, containing 2.2 million image-text pairs, enriched with image modality annotations, subfigures, and summarized in-text references. Notably, the in-text references provide richer medical context, extending beyond the abstract information typically found in captions. Through extensive experiments, we benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks. Our results show that dataset quality-not just size-drives significant performance gains. We complement our benchmark with an in-depth analysis of feature representation. Our findings highlight the crucial role of data curation quality in advancing multimodal medical AI. We release Open-PMC, along with the trained models and our codebase.
尽管医学视觉语言数据集的规模正在不断扩大,但数据集质量对模型性能的影响仍然研究不足。我们介绍了Open-PMC,这是一个来自PubMed Central的高质量医学数据集,包含220万张图像文本配对,并附有图像模态注释、子图和总结在文中引用。值得注意的是,在文中的引用提供了更丰富的医疗背景信息,超出了通常出现在说明文字中的摘要信息。通过广泛的实验,我们从检索任务和零样本分类任务两个方面将Open-PMC与更大的数据集进行了基准测试。我们的结果显示,数据集的质量——而不仅仅是规模——能够带来显著的性能提升。为了补充这一基准测试,我们还对特征表示进行了一项深入分析。我们的研究结果突显了高质量的数据整理在推进多模态医学AI发展中的关键作用。我们将Open-PMC、训练好的模型和代码库一同发布。
https://arxiv.org/abs/2503.14377
Linear RNNs with gating recently demonstrated competitive performance compared to Transformers in language modeling. Although their linear compute scaling in sequence length offers theoretical runtime advantages over Transformers, realizing these benefits in practice requires optimized custom kernels, as Transformers rely on the highly efficient Flash Attention kernels. Leveraging the chunkwise-parallel formulation of linear RNNs, Flash Linear Attention (FLA) shows that linear RNN kernels are faster than Flash Attention, by parallelizing over chunks of the input sequence. However, since the chunk size of FLA is limited, many intermediate states must be materialized in GPU memory. This leads to low arithmetic intensity and causes high memory consumption and IO cost, especially for long-context pre-training. In this work, we present Tiled Flash Linear Attention (TFLA), a novel kernel algorithm for linear RNNs, that enables arbitrary large chunk sizes by introducing an additional level of sequence parallelization within each chunk. First, we apply TFLA to the xLSTM with matrix memory, the mLSTM. Second, we propose an mLSTM variant with sigmoid input gate and reduced computation for even faster kernel runtimes at equal language modeling performance. In our speed benchmarks, we show that our new mLSTM kernels based on TFLA outperform highly optimized Flash Attention, Linear Attention and Mamba kernels, setting a new state of the art for efficient long-context sequence modeling primitives.
线性RNN(Recurrent Neural Network)在加入门控机制后,在语言模型任务上展现出了与Transformer相当甚至更优的性能。尽管它们在线性计算复杂度方面具有序列长度上的理论运行时间优势,但在实践中实现这些好处需要优化的定制内核,因为Transformer依赖于高效的Flash Attention内核。通过线性RNN的块并行化公式,提出了Flash Linear Attention(FLA),表明线性RNN内核在处理输入序列的块时可以比Flash Attention更快。 然而,由于FLA中块大小的限制,许多中间状态必须被物质化到GPU内存中,这导致了低算术强度,并且在长上下文预训练场景下产生了高内存消耗和IO成本。在此工作中,我们提出了Tiled Flash Linear Attention(TFLA),这是一种针对线性RNN的新内核算法,它通过在每个块内部引入序列并行化的一个额外层次来支持任意大的块大小。 首先,我们将TFLA应用于具有矩阵记忆的xLSTM模型中,即mLSTM。其次,我们提出了一种改进版的mLSTM变体,该变体采用Sigmoid输入门和减少计算量的方法,在保持相同语言建模性能的同时实现更快的内核运行时间。 在我们的速度基准测试中,基于TFLA的新mLSTM内核超出了高度优化的Flash Attention、Linear Attention以及Mamba内核的表现,并且为高效长上下文序列模型设定了新的状态标准。
https://arxiv.org/abs/2503.14376
User engagement is greatly enhanced by fully immersive multi-modal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution & frame-rate contents. To stimulate the reconstruction of immersive volumetric videos, we introduce ImViD, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios. Our capture rig supports multi-view video-audio capture while on the move, a capability absent in existing datasets, significantly enhancing the completeness, flexibility, and efficiency of data capture. The captured multi-view videos (with synchronized audios) are in 5K resolution at 60FPS, lasting from 1-5 minutes, and include rich foreground-background elements, and complex dynamics. We benchmark existing methods using our dataset and establish a base pipeline for constructing immersive volumetric videos from multi-view audiovisual inputs for 6-DoF multi-modal immersive VR experiences. The benchmark and the reconstruction and interaction results demonstrate the effectiveness of our dataset and baseline method, which we believe will stimulate future research on immersive volumetric video production.
用户参与度可以通过结合视觉和听觉刺激的全沉浸式多模态体验得到显著提升。因此,VR/AR技术的下一个前沿领域在于沉浸式的体三维视频,这种视频能够捕捉完整的场景、提供广阔的6自由度互动空间、支持多模态反馈,并具备高分辨率与帧率的内容。 为了激发对沉浸式体三维视频重建的研究,我们推出了ImViD,这是一个集成了全方位数据捕获和多种室内外场景的多视角、多模式的数据集。我们的捕捉设备能够支持在移动过程中进行多视角音视频捕获,这是现有数据集中所不具备的功能,从而大大增强了数据捕获的整体性、灵活性和效率。 所捕获的多视角视频(同步音频)以5K分辨率60帧每秒播放,持续时间为1到5分钟,并且包括丰富的前景背景元素及复杂的动态效果。我们使用该数据集对现有方法进行了基准测试,并为从多视角音视频输入构建沉浸式体三维视频以实现6自由度的多模态沉浸式VR体验建立了一条基础流水线。 通过基准测试和重建与互动结果,证明了我们的数据集和基线方法的有效性。我们相信这将激发未来关于沉浸式体三维视频制作的研究。
https://arxiv.org/abs/2503.14359
Rectified Flow (RF) models trained with a Flow matching framework have achieved state-of-the-art performance on Text-to-Image (T2I) conditional generation. Yet, multiple benchmarks show that synthetic images can still suffer from poor alignment with the prompt, i.e., images show wrong attribute binding, subject positioning, numeracy, etc. While the literature offers many methods to improve T2I alignment, they all consider only Diffusion Models, and require auxiliary datasets, scoring models, and linguistic analysis of the prompt. In this paper we aim to address these gaps. First, we introduce RFMI, a novel Mutual Information (MI) estimator for RF models that uses the pre-trained model itself for the MI estimation. Then, we investigate a self-supervised fine-tuning approach for T2I alignment based on RFMI that does not require auxiliary information other than the pre-trained model itself. Specifically, a fine-tuning set is constructed by selecting synthetic images generated from the pre-trained RF model and having high point-wise MI between images and prompts. Our experiments on MI estimation benchmarks demonstrate the validity of RFMI, and empirical fine-tuning on SD3.5-Medium confirms the effectiveness of RFMI for improving T2I alignment while maintaining image quality.
修正流(RF)模型在使用流动匹配框架训练时,在文本到图像(T2I)条件生成任务上取得了最先进的性能。然而,多个基准测试表明,合成的图像仍然可能与提示不完全对齐,例如展示错误的属性绑定、主体位置或数量等问题。尽管文献提供了许多改进T2I对齐的方法,但这些方法都仅针对扩散模型,并且需要辅助数据集、评分模型和对提示进行语言分析等额外信息。 本文旨在解决这些问题。首先,我们引入了RFMI,这是一种新的用于RF模型的互信息(MI)估计器,该估计器利用预训练模型本身来进行MI估算。其次,我们探讨了一种基于RFMI的自监督微调方法,以改善T2I对齐,并且仅需使用预训练模型自身,而无需额外的信息或数据集。 具体而言,通过从预先训练好的RF模型生成合成图像并选择其中与提示之间具有高逐点互信息(point-wise MI)的图像来构建一个微调集合。我们的实验在MI估计基准上表明了RFMI的有效性,并且经验上的微调结果显示,在保持图像质量的同时,RFMI能够有效提高T2I对齐的质量。
https://arxiv.org/abs/2503.14358
Accurate tumor segmentation is crucial for cancer diagnosis and treatment. While foundation models have advanced general-purpose segmentation, existing methods still struggle with: (1) limited incorporation of medical priors, (2) imbalance between generic and tumor-specific features, and (3) high computational costs for clinical adaptation. To address these challenges, we propose MAST-Pro (Mixture-of-experts for Adaptive Segmentation of pan-Tumors with knowledge-driven Prompts), a novel framework that integrates dynamic Mixture-of-Experts (D-MoE) and knowledge-driven prompts for pan-tumor segmentation. Specifically, text and anatomical prompts provide domain-specific priors, guiding tumor representation learning, while D-MoE dynamically selects experts to balance generic and tumor-specific feature learning, improving segmentation accuracy across diverse tumor types. To enhance efficiency, we employ Parameter-Efficient Fine-Tuning (PEFT), optimizing MAST-Pro with significantly reduced computational overhead. Experiments on multi-anatomical tumor datasets demonstrate that MAST-Pro outperforms state-of-the-art approaches, achieving up to a 5.20% improvement in average DSC while reducing trainable parameters by 91.04%, without compromising accuracy.
精确的肿瘤分割对于癌症诊断和治疗至关重要。尽管基础模型在通用分割方面取得了进展,但现有方法仍然面临以下挑战:(1)医疗先验知识有限的应用;(2)通用特征与肿瘤特异性特征之间的不平衡问题;以及(3)临床适应过程中较高的计算成本。为了解决这些问题,我们提出了一种名为MAST-Pro(基于知识驱动提示的全肿瘤自适应分割混合专家模型)的新框架,该框架结合了动态混合专家(D-MoE)和知识驱动提示,用于全肿瘤分割。具体而言,文本和解剖学提示提供了特定领域的先验信息,指导肿瘤表示学习,而D-MoE则能够动态选择专家以平衡通用特征与肿瘤特异性特征的学习过程,从而提高各种类型肿瘤分割的准确性。为了提升效率,我们采用了参数高效的微调(PEFT),优化了MAST-Pro框架,并显著减少了计算开销。 在多解剖部位肿瘤数据集上的实验表明,MAST-Pro超越了现有最佳方法,在平均DSC(Dice相似系数)上提高了高达5.20%,同时减少可训练参数达91.04%,且不牺牲准确性。
https://arxiv.org/abs/2503.14355
A CORDIC-based configuration for the design of Activation Functions (AF) was previously suggested to accelerate ASIC hardware design for resource-constrained systems by providing functional reconfigurability. Since its introduction, this new approach for neural network acceleration has gained widespread popularity, influencing numerous designs for activation functions in both academic and commercial AI processors. In this retrospective analysis, we explore the foundational aspects of this initiative, summarize key developments over recent years, and introduce the DA-VINCI AF tailored for the evolving needs of AI applications. This new generation of dynamically configurable and precision-adjustable activation function cores promise greater adaptability for a range of activation functions in AI workloads, including Swish, SoftMax, SeLU, and GeLU, utilizing the Shift-and-Add CORDIC technique. The previously presented design has been optimized for MAC, Sigmoid, and Tanh functionalities and incorporated into ReLU AFs, culminating in an accumulative NEURIC compute unit. These enhancements position NEURIC as a fundamental component in the resource-efficient vector engine for the realization of AI accelerators that focus on DNNs, RNNs/LSTMs, and Transformers, achieving a quality of results (QoR) of 98.5%.
基于CORDIC(坐标旋转数字计算机)的配置方法被提出用于设计激活函数(Activation Function,AF),以加速资源受限系统中的ASIC硬件设计,并提供功能重构性。自该方法问世以来,它已成为神经网络加速的新途径,在学术界和商业AI处理器的设计中产生了广泛影响。在这次回顾分析中,我们将探讨这项倡议的基础方面,总结近年来的关键发展,并介绍专为适应人工智能应用不断变化需求而设计的DA-VINCI激活函数。 新一代动态可配置且精度可调的激活函数核心承诺在包括Swish、SoftMax、SeLU和GeLU在内的多种AI工作负载激活功能中提供更高的灵活性。这些核心利用移位加法CORDIC技术实现这一目标。先前提出的设计已针对MAC(矩阵乘法累加)、Sigmoid和Tanh功能进行了优化,并被集成到ReLU激活函数中,最终形成了累积NEURIC计算单元。 这些改进使NEURIC成为资源效率高的向量引擎中的关键组件,该引擎专注于深度神经网络(DNNs)、递归神经网络/长短时记忆网络(RNN/LSTMs)和Transformer加速器的实现。通过这一设计优化,实现了98.5%的质量结果(Quality of Results, QoR)。
https://arxiv.org/abs/2503.14354
Obstacle avoidance for unmanned aerial vehicles like quadrotors is a popular research topic. Most existing research focuses only on static environments, and obstacle avoidance in environments with multiple dynamic obstacles remains challenging. This paper proposes a novel deep-reinforcement learning-based approach for the quadrotors to navigate through highly dynamic environments. We propose a lidar data encoder to extract obstacle information from the massive point cloud data from the lidar. Multi frames of historical scans will be compressed into a 2-dimension obstacle map while maintaining the obstacle features required. An end-to-end deep neural network is trained to extract the kinematics of dynamic and static obstacles from the obstacle map, and it will generate acceleration commands to the quadrotor to control it to avoid these obstacles. Our approach contains perception and navigating functions in a single neural network, which can change from a navigating state into a hovering state without mode switching. We also present simulations and real-world experiments to show the effectiveness of our approach while navigating in highly dynamic cluttered environments.
无人机(如四旋翼飞行器)的避障是当前研究的一个热门话题。大多数现有的研究仅关注静态环境下的避障问题,而在含有多个动态障碍物的环境中进行避障仍然是一项挑战性任务。本文提出了一种基于深度强化学习的方法,旨在帮助四旋翼飞行器在高度动态的环境中导航。 我们设计了一个激光雷达数据编码器来从大量点云数据中提取障碍物信息。多帧历史扫描将被压缩成一个二维障碍地图,同时保留所需的障碍特征。通过端到端的深度神经网络训练,可以从该障碍图中提取静态和动态障碍物的动力学特性,并生成加速度指令以控制四旋翼飞行器避开这些障碍物。 我们的方法在一个单一的神经网络中结合了感知和导航功能,无需模式切换即可从导航状态转换为悬停状态。此外,我们还通过仿真和实际实验展示了该方法在高度动态且复杂的环境中导航时的有效性。
https://arxiv.org/abs/2503.14352
Recent video diffusion models have enhanced video editing, but it remains challenging to handle instructional editing and diverse tasks (e.g., adding, removing, changing) within a unified framework. In this paper, we introduce VEGGIE, a Video Editor with Grounded Generation from Instructions, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions. Specifically, given a video and text query, VEGGIE first utilizes an MLLM to interpret user intentions in instructions and ground them to the video contexts, generating frame-specific grounded task queries for pixel-space responses. A diffusion model then renders these plans and generates edited videos that align with user intent. To support diverse tasks and complex instructions, we employ a curriculum learning strategy: first aligning the MLLM and video diffusion model with large-scale instructional image editing data, followed by end-to-end fine-tuning on high-quality multitask video data. Additionally, we introduce a novel data synthesis pipeline to generate paired instructional video editing data for model training. It transforms static image data into diverse, high-quality video editing samples by leveraging Image-to-Video models to inject dynamics. VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model, while other models struggle with multi-tasking. VEGGIE also excels in video object grounding and reasoning segmentation, where other baselines fail. We further reveal how the multiple tasks help each other and highlight promising applications like zero-shot multimodal instructional and in-context video editing.
最近的视频扩散模型(video diffusion models)已经提升了视频编辑的效果,但仍然面临挑战,特别是在处理基于指令的编辑任务和多样化的编辑需求上。本文介绍了一种新的方法VEGGIE(Video Editor with Grounded Generation from Instructions),这是一种简单而全面的端到端框架,旨在统一各种用户指令下的视频概念编辑、定位以及推理过程。具体而言,给定一个视频和文本查询后,VEGGIE首先使用大规模语言模型(MLLM)来解析用户的意图,并将这些意图与视频内容相关联,生成特定于每一帧的任务查询,以供后续像素空间的响应处理。接下来,扩散模型根据这些计划绘制图像并生成符合用户需求的编辑后的视频。 为了支持多样化的任务和复杂的指令,我们采用了一种课程学习策略:首先使用大规模的基于指令的图像编辑数据对MLLM和视频扩散模型进行预训练;然后在高质量的多任务视频数据上进行端到端微调。此外,还引入了一个新颖的数据合成流程来生成配对的、用于模型训练的基于指令的视频编辑数据集。该流程通过利用从静态图像到动态视频的转换技术(例如Image-to-Video模型),将静态图片数据转化为多样化且高质量的视频编辑样本。 VEGGIE在处理不同技能水平的视频编辑任务时表现优异,超越了单一功能的最佳基准方法,并展示了其作为多功能模型的优势。同时,在涉及视频对象定位和推理分割的任务上,其他基准方法常常表现不佳,而VEGGIE却能很好地完成这些任务。 此外,本文还探讨了多任务之间如何相互促进和支持,并强调了零样本跨模态指令编辑与上下文视频编辑等有前景的应用领域。
https://arxiv.org/abs/2503.14350
Multi-map Sparse Monocular visual Simultaneous Localization and Mapping applied to monocular endoscopic sequences has proven efficient to robustly recover tracking after the frequent losses in endoscopy due to motion blur, temporal occlusion, tools interaction or water jets. The sparse multi-maps are adequate for robust camera localization, however they are very poor for environment representation, they are noisy, with a high percentage of inaccurately reconstructed 3D points, including significant outliers, and more importantly with an unacceptable low density for clinical applications. We propose a method to remove outliers and densify the maps of the state of the art for sparse endoscopy multi-map CudaSIFT-SLAM. The NN LightDepth for up-to-scale depth dense predictions are aligned with the sparse CudaSIFT submaps by means of the robust to spurious LMedS. Our system mitigates the inherent scale ambiguity in monocular depth estimation while filtering outliers, leading to reliable densified 3D maps. We provide experimental evidence of accurate densified maps 4.15 mm RMS accuracy at affordable computing time in the C3VD phantom colon dataset. We report qualitative results on the real colonoscopy from the Endomapper dataset.
多图稀疏单目视觉同步定位与地图构建(SLAM)技术在处理内窥镜序列时,已证明能够有效恢复由于运动模糊、时间遮挡、工具交互或水柱等原因造成的频繁跟踪丢失。虽然稀疏的多图对于相机定位具有鲁棒性,但它们对环境表示的效果较差:这些地图中包含大量噪声和不准确重建的3D点,包括显著的异常值,并且更重要的是,密度极低,无法满足临床应用的要求。 我们提出了一种方法来去除异常值并增加现有的稀疏内窥镜多图CUDA SIFT-SLAM技术地图的稠密性。通过利用鲁棒于虚假匹配的LMedS算法,我们将NN LightDepth(一种用于尺度不变深度预测的方法)与稀疏的CUDA SIFT子图对齐。我们的系统在处理单目深度估计中的固有尺度模糊时,不仅能过滤异常值,还能生成可靠且稠密化的3D地图。 我们在C3VD仿真结肠数据集上提供了实验证据,表明我们能够以可接受的计算时间实现4.15毫米RMS精度的准确稠密化地图。同时,我们也报告了来自Endomapper数据集的真实结肠镜检查中的定性结果。
https://arxiv.org/abs/2503.14346
Recent advances in text-to-speech synthesis have achieved notable success in generating high-quality short utterances for individual speakers. However, these systems still face challenges when extending their capabilities to long, multi-speaker, and spontaneous dialogues, typical of real-world scenarios such as podcasts. These limitations arise from two primary challenges: 1) long speech: podcasts typically span several minutes, exceeding the upper limit of most existing work; 2) spontaneity: podcasts are marked by their spontaneous, oral nature, which sharply contrasts with formal, written contexts; existing works often fall short in capturing this spontaneity. In this paper, we propose MoonCast, a solution for high-quality zero-shot podcast generation, aiming to synthesize natural podcast-style speech from text-only sources (e.g., stories, technical reports, news in TXT, PDF, or Web URL formats) using the voices of unseen speakers. To generate long audio, we adopt a long-context language model-based audio modeling approach utilizing large-scale long-context speech data. To enhance spontaneity, we utilize a podcast generation module to generate scripts with spontaneous details, which have been empirically shown to be as crucial as the text-to-speech modeling itself. Experiments demonstrate that MoonCast outperforms baselines, with particularly notable improvements in spontaneity and coherence.
最近的文本到语音合成技术在为个别说话人生成高质量的简短话语方面取得了显著成功。然而,当这些系统试图扩展其能力以处理长篇、多说话者和即兴对话时(如播客等现实场景),它们仍面临挑战。这些限制主要源于以下两个问题:1)长时间语音:播客通常持续几分钟甚至更长的时间,超出了大多数现有研究的工作上限;2)自发性:播客以其非正式的口头形式为特征,这与现有的书面语境截然不同;目前的研究往往难以捕捉这种自发性。 在本文中,我们提出了MoonCast,这是一种针对高质量零样本(zero-shot)播客生成解决方案。MoonCast旨在使用未知说话人的声音从仅有文本来源(例如故事、技术报告或新闻的TXT、PDF或Web URL格式文件)合成自然且风格化的播客式语音。为了生成长时间音频,MoonCast采用了一种基于大规模长上下文语言模型的声音建模方法,以利用大量的长上下文语音数据。为了提高自发性,我们引入了一个播客生成模块来创建带有即兴细节的脚本,这些即兴细节已被证明与文本到语音的建模本身同样重要。 实验结果表明,MoonCast超越了基线系统,在自发性和连贯性方面尤其表现出显著改进。
https://arxiv.org/abs/2503.14345
Medical image segmentation aims to identify anatomical structures at the voxel-level. Segmentation accuracy relies on distinguishing voxel differences. Compared to advancements achieved in studies of the inter-class variance, the intra-class variance receives less attention. Moreover, traditional linear classifiers, limited by a single learnable weight per class, struggle to capture this finer distinction. To address the above challenges, we propose a Multi-Prototype-based Embedding Refinement method for semi-supervised medical image segmentation. Specifically, we design a multi-prototype-based classification strategy, rethinking the segmentation from the perspective of structural relationships between voxel embeddings. The intra-class variations are explored by clustering voxels along the distribution of multiple prototypes in each class. Next, we introduce a consistency constraint to alleviate the limitation of linear classifiers. This constraint integrates different classification granularities from a linear classifier and the proposed prototype-based classifier. In the thorough evaluation on two popular benchmarks, our method achieves superior performance compared with state-of-the-art methods. Code is available at this https URL.
医学图像分割的目标是在体素级别识别解剖结构。分割精度依赖于区分不同体素之间的差异。相比于在研究类间差异方面的进展,对于类内差异的关注较少。此外,传统的线性分类器由于每个类别只有一个可学习权重,在捕捉这些细微差别方面存在困难。为了解决上述挑战,我们提出了一种基于多原型的嵌入优化方法,用于半监督医学图像分割。具体而言,我们设计了基于多原型的分类策略,从体素嵌入之间的结构关系出发重新思考分割问题。通过在每个类别中多个原型分布上的聚类来探索类内变化。接下来,我们引入了一致性约束以缓解线性分类器的局限性。这种约束结合了来自线性分类器和所提出的基于原型的分类器的不同分类粒度。在两个流行基准数据集上进行彻底评估后,我们的方法相比现有最佳方法取得了更优性能。代码可在提供的链接处获取。
https://arxiv.org/abs/2503.14343
Predicting the words that a child is going to learn next can be useful for boosting language acquisition, and such predictions have been shown to be possible with both neural network techniques (looking at changes in the vocabulary state over time) and graph model (looking at data pertaining to the relationships between words). However, these models do not fully capture the complexity of the language learning process of an infant when used in isolation. In this paper, we examine how a model of language acquisition for infants and young children can be constructed and adapted for use in a Spatio-Temporal Graph Convolutional Network (STGCN), taking into account the different types of linguistic relationships that occur during child language learning. We introduce a novel approach for predicting child vocabulary acquisition, and evaluate the efficacy of such a model with respect to the different types of linguistic relationships that occur during language acquisition, resulting in insightful observations on model calibration and norm selection. An evaluation of this model found that the mean accuracy of models for predicting new words when using sensorimotor relationships (0.733) and semantic relationships (0.729) were found to be superior to that observed with a 2-layer Feed-forward neural network. Furthermore, the high recall for some relationships suggested that some relationships (e.g. visual) were superior in identifying a larger proportion of relevant words that a child should subsequently learn than others (such as auditory).
预测孩子即将学习的单词对于促进语言习得非常有用,研究显示使用神经网络技术(分析词汇状态随时间的变化)和图模型(考察词与词之间关系的数据)可以实现这样的预测。然而,在单独使用时,这些模型并不能完全捕捉到婴儿语言学习过程中的复杂性。 在本文中,我们探讨了如何为婴幼儿构建和调整一种语言习得模型,并将其应用于时空图卷积网络(STGCN),同时考虑在儿童语言学习过程中发生的不同类型的语言关系。我们提出了一种预测儿童词汇获取的新方法,并评估这种模型在不同类型语言关系下的有效性,从而得出关于模型校准和选择标准的深刻见解。 对该模型进行的评估表明,在使用感觉运动关系和语义关系时,用于预测新单词的模型平均准确率分别为0.733和0.729,优于使用两层前馈神经网络观察到的结果。此外,某些关系(如视觉)具有较高的召回率,意味着这些关系比其他类型的关系(例如听觉)更能识别出孩子后续应学习的相关词汇的较大比例。 通过这种方式,这项研究不仅为预测儿童语言习得提供了一个新颖的方法论框架,而且也为理解不同类型的语言关系如何影响这一过程提供了有价值的见解。
https://arxiv.org/abs/2503.14341
While recent works (e.g. o1, DeepSeek R1) have demonstrated great promise of using long Chain-of-Thought (CoT) to improve reasoning capabilities of language models, scaling it up during test-time is challenging due to inefficient memory usage -- intermediate computations accumulate indefinitely in context even no longer needed for future thoughts. We propose PENCIL, which incorporates a reduction mechanism into the autoregressive generation process, allowing the model to recursively clean up intermediate thoughts based on patterns learned from training. With this reduction mechanism, PENCIL significantly reduces the maximal context length required during generation, and thus can generate longer thoughts with limited memory, solving larger-scale problems given more thinking time. For example, we demonstrate PENCIL achieves 97\% accuracy on the challenging Einstein's puzzle -- a task even large models like GPT-4 struggle with -- using only a small 25M-parameter transformer with 2048 context length. Theoretically, we prove PENCIL can perform universal space-efficient computation by simulating Turing machines with optimal time and space complexity, and thus can solve arbitrary computational tasks that would otherwise be intractable given context window constraints.
最近的一些研究(例如o1,DeepSeek R1)展示了使用长链式思维(Chain-of-Thought, CoT)来提升语言模型的推理能力的巨大潜力。然而,在测试时扩大这一方法的应用范围是具有挑战性的,因为其内存利用效率低下——即使不再需要,中间计算也会无限积累在上下文中。为了解决这个问题,我们提出了PENCIL,这是一种将缩减机制融入自回归生成过程中的方法,使模型能够根据从训练中学习到的模式递归地清理中间思维内容。借助这一缩减机制,PENCIL显著减少了生成过程中所需的最大上下文长度,因此能够在有限内存的情况下生成更长的思考序列,并且能通过更多的时间进行思考来解决更大规模的问题。例如,我们展示了使用仅25M参数的小型Transformer模型(上下文长度为2048),PENCIL在颇具挑战性的爱因斯坦难题上实现了97%的准确率——这是即使是像GPT-4这样的大型模型也难以完成的任务。从理论上讲,我们证明了PENCIL能够通过模拟具有最优时间与空间复杂度的图灵机来进行通用的空间效率计算,因此可以解决由于上下文窗口限制而通常无法处理的任意计算任务。
https://arxiv.org/abs/2503.14337
Studies often aim to reveal how neural representations encode aspects of an observer's environment, such as its contents or structure. These are ``first-order" representations (FORs), because they're ``about" the external world. A less-common target is ``higher-order" representations (HORs), which are ``about" FORs -- their contents, stability, or uncertainty. HORs of uncertainty appear critically involved in adaptive behaviors including learning under uncertainty, influencing learning rates and internal model updating based on environmental feedback. However, HORs about uncertainty are unlikely to be direct ``read-outs" of FOR characteristics, instead reflecting estimation processes which may be lossy, bias-prone, or distortive and which may also incorporate estimates of distributions of uncertainty the observer is likely to experience. While some research has targeted neural representations of ``instantaneously" estimated uncertainty, how the brain represents \textit{distributions} of expected uncertainty remains largely unexplored. Here, we propose a novel reinforcement learning (RL) based generative artificial intelligence (genAI) approach to explore neural representations of uncertainty distributions. We use existing functional magnetic resonance imaging data, where humans learned to `de-noise' their brain states to achieve target neural patterns, to train denoising diffusion genAI models with RL algorithms to learn noise distributions similar to how humans might learn to do the same. We then explore these models' learned noise-distribution HORs compared to control models trained with traditional backpropagation. Results reveal model-dependent differences in noise distribution representations -- with the RL-based model offering much higher explanatory power for human behavior -- offering an exciting path towards using genAI to explore neural noise-distribution HORs.
研究通常旨在揭示神经表示如何编码观察者环境的各个方面,如内容或结构。这些被称为“第一级”表示(FOR),因为它们涉及外部世界。“第二级”表示(HOR)则较少见,这些表示是关于第一级表示——其内容、稳定性或不确定性。不确定性的高级表示在包括不确定条件下的学习在内的适应性行为中似乎起着关键作用,影响学习速率和基于环境反馈的内部模型更新。然而,关于不确定性的第二级表示不太可能直接反映第一级特征的表现形式,而是反映出具有潜在损失、偏差倾向或扭曲过程的过程,并且这些还可能包含观察者可能会遇到的不确定性分布估计。 尽管一些研究集中于神经表示中的“瞬时”估算不确定性,但大脑如何表征预期不确定性的“分布”仍然很大程度上未被探索。在这里,我们提出了一种基于强化学习(RL)的生成人工智能(genAI)方法来探讨不确定性分布的神经表示。我们利用现有的功能磁共振成像数据,在这些实验中人类学会了“去噪”以达到目标神经模式,用这些数据训练带有RL算法的denoising扩散genAI模型,以便它们学会与人类类似的噪声分布。然后,我们将这些模型学习到的噪声分布第二级表示与使用传统反向传播方法训练的传统控制模型进行比较。结果揭示了基于不同模型之间在噪声分布表示上的差异——以RL为基础的模型为解释人类行为提供了更高的解释力——这为利用genAI探索神经噪音分布的高级表示提供了一条令人兴奋的道路。
https://arxiv.org/abs/2503.14333
Efficient material logistics play a critical role in controlling costs and schedules in the construction industry. However, manual material handling remains prone to inefficiencies, delays, and safety risks. Autonomous forklifts offer a promising solution to streamline on-site logistics, reducing reliance on human operators and mitigating labor shortages. This paper presents the development and evaluation of the Autonomous Dynamic All-terrain Pallet Transporter (ADAPT), a fully autonomous off-road forklift designed for construction environments. Unlike structured warehouse settings, construction sites pose significant challenges, including dynamic obstacles, unstructured terrain, and varying weather conditions. To address these challenges, our system integrates AI-driven perception techniques with traditional approaches for decision making, planning, and control, enabling reliable operation in complex environments. We validate the system through extensive real-world testing, comparing its long-term performance against an experienced human operator across various weather conditions. We also provide a comprehensive analysis of challenges and key lessons learned, contributing to the advancement of autonomous heavy machinery. Our findings demonstrate that autonomous outdoor forklifts can operate near human-level performance, offering a viable path toward safer and more efficient construction logistics.
高效的材料物流在控制成本和进度方面对建筑行业至关重要。然而,手动搬运物料仍然容易出现效率低下、延误和安全风险等问题。自主叉车提供了一种有前景的解决方案,可以优化现场物流流程,减少对人工操作员的依赖,并缓解劳动力短缺问题。本文介绍了自主动态全地形托盘运输器(ADAPT)的研发与评估工作,这是一种专为建筑环境设计的全自主越野叉车。 与结构化的仓库设置不同,建筑工地面临着包括动态障碍物、不规则地形和各种天气条件在内的重大挑战。为了应对这些挑战,我们的系统结合了人工智能驱动的感知技术以及传统的决策制定、规划和控制方法,从而能够在复杂环境中实现可靠的运行。我们通过广泛的实地测试验证该系统的性能,并将长期表现与具有丰富经验的人工操作员的表现进行比较,在不同的天气条件下评估其能力。 此外,本文还提供了全面的问题分析及关键经验教训总结,为自主重型机械设备的发展做出贡献。我们的研究结果表明,室外自主叉车能够达到接近人工水平的操作绩效,从而为建筑物流的安全性和效率提供了一条可行的道路。
https://arxiv.org/abs/2503.14331