We introduce a simple framework for predicting the behavior of an agent in multi-agent settings. In contrast to autoregressive (AR) tasks, such as language processing, our focus is on scenarios with multiple agents whose interactions are shaped by physical constraints and internal motivations. To this end, we propose Poly-Autoregressive (PAR) modeling, which forecasts an ego agent's future behavior by reasoning about the ego agent's state history and the past and current states of other interacting agents. At its core, PAR represents the behavior of all agents as a sequence of tokens, each representing an agent's state at a specific timestep. With minimal data pre-processing changes, we show that PAR can be applied to three different problems: human action forecasting in social situations, trajectory prediction for autonomous vehicles, and object pose forecasting during hand-object interaction. Using a small proof-of-concept transformer backbone, PAR outperforms AR across these three scenarios. The project website can be found at this https URL.
我们提出了一种用于预测多智能体环境中代理行为的简单框架。与自回归(AR)任务,如语言处理不同,我们的重点是多个代理之间交互的情境,这些情境受到物理限制和内部动机的影响。为此,我们提出了多项式自回归(PAR)建模方法,该方法通过推理ego代理的历史状态以及与其他互动智能体的过去和当前状态来预测ego代理未来的行动。 在核心理念上,PAR将所有代理的行为表示为一系列标记序列,每个标记代表特定时间步长中一个代理的状态。我们展示了使用最少的数据预处理更改,PAR可以应用于三个不同的问题:社交情境中的行人动作预测、自动驾驶车辆的轨迹预测以及手与物体交互过程中的对象姿态预测。 通过一个小规模的概念验证变压器骨干网络(transformer backbone),PAR在这些三种场景上均超越了自回归模型的表现。该项目网站可在此处访问:[https URL]。
https://arxiv.org/abs/2502.08646
Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities. To address this limitation, we present PulseCheck457, a scalable and unbiased synthetic dataset designed with 4 key capability for spatial reasoning: multi-object recognition, 2D location, 3D location, and 3D orientation. We develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels that range from basic single object recognition to our new proposed complex 6D spatial reasoning tasks. We evaluated various large multimodal models (LMMs) on PulseCheck457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings.
尽管大型多模态模型(LMMs)在视觉场景理解和推理方面表现出色,但它们在复杂和精确的三维空间推理能力仍存在不确定性。现有基准测试主要侧重于二维空间理解,并缺乏全面评估不同难度下六维空间推理的框架。为了解决这一局限性,我们提出了PulseCheck457,这是一个可扩展且无偏见的合成数据集,设计了四个关键的空间推理能力:多对象识别、二维位置、三维位置和三维方向。我们构建了一个级联评估结构,在五个难度级别上创建了七种问题类型,从基本的单个物体识别到我们新提出的复杂六维空间推理任务。我们在PulseCheck457上对各种大型多模态模型(LMMs)进行了评估,观察到了随着任务复杂度增加性能普遍下降的现象,特别是在三维推理和六维空间任务中尤为明显。为了量化这些挑战,我们引入了相对性能下降率(RPDR),突出了在三维推理能力中的关键弱点。利用数据集无偏的设计特性,我们也揭示了不同属性的预测偏差,在真实世界的图像设置中也观察到了类似的模式。
https://arxiv.org/abs/2502.08636
The growing availability of longitudinal Magnetic Resonance Imaging (MRI) datasets has facilitated Artificial Intelligence (AI)-driven modeling of disease progression, making it possible to predict future medical scans for individual patients. However, despite significant advancements in AI, current methods continue to face challenges including achieving patient-specific individualization, ensuring spatiotemporal consistency, efficiently utilizing longitudinal data, and managing the substantial memory demands of 3D scans. To address these challenges, we propose Brain Latent Progression (BrLP), a novel spatiotemporal model designed to predict individual-level disease progression in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates in a small latent space, mitigating the computational challenges posed by high-dimensional imaging data; (ii) it explicitly integrates subject metadata to enhance the individualization of predictions; (iii) it incorporates prior knowledge of disease dynamics through an auxiliary model, facilitating the integration of longitudinal data; and (iv) it introduces the Latent Average Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in the predicted progression at inference time and (b) allows us to derive a measure of the uncertainty for the prediction. We train and evaluate BrLP on 11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its generalizability on an external test set comprising 2,257 MRIs from 962 subjects. Our experiments compare BrLP-generated MRI scans with real follow-up MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The code is publicly available at: this https URL.
随着纵向磁共振成像(MRI)数据集的日益普及,基于人工智能(AI)的疾病进展建模得到了促进,使得为个别患者预测未来的医学扫描成为可能。然而,尽管在AI领域取得了显著的进步,目前的方法仍然面临着一些挑战,包括实现以患者为中心的个性化、确保时空一致性、有效利用纵向数据以及管理3D扫描带来的巨大内存需求。为了应对这些挑战,我们提出了一种新颖的时空模型——大脑潜在进展(BrLP),旨在预测个体层面在3D脑MRI中的疾病进展。 BrLP的关键贡献有四点:(i) 它在一个较小的潜在空间中运行,从而减轻了高维影像数据带来的计算难题;(ii) 它明确整合了受试者元数据以增强预测的个性化;(iii) 通过辅助模型将疾病的动态知识纳入其中,促进了纵向数据的集成;(iv) 引入了潜在平均稳定化(LAS)算法,该算法(a)在推理时强制执行预测进展中的时空一致性,并(b)允许我们推导出预测不确定性的度量。 我们在2805名受试者的11,730张T1加权(T1w)脑MRI上训练并评估了BrLP,并通过962名受试者组成的2,257张外部测试集验证了其泛化能力。我们的实验将由BrLP生成的MRI扫描与真实的随访MRI进行了比较,证明了相对于现有方法而言达到了最先进的准确性。代码可在以下链接公开获取:this https URL.
https://arxiv.org/abs/2502.08560
Video Moment Retrieval is a common task to evaluate the performance of visual-language models - it involves localising start and end times of moments in videos from query sentences. The current task formulation assumes that the queried moment is present in the video, resulting in false positive moment predictions when irrelevant query sentences are provided. In this paper we propose the task of Negative-Aware Video Moment Retrieval (NA-VMR), which considers both moment retrieval accuracy and negative query rejection accuracy. We make the distinction between In-Domain and Out-of-Domain negative queries and provide new evaluation benchmarks for two popular video moment retrieval datasets: QVHighlights and Charades-STA. We analyse the ability of current SOTA video moment retrieval approaches to adapt to Negative-Aware Video Moment Retrieval and propose UniVTG-NA, an adaptation of UniVTG designed to tackle NA-VMR. UniVTG-NA achieves high negative rejection accuracy (avg. $98.4\%$) scores while retaining moment retrieval scores to within $3.87\%$ Recall@1. Dataset splits and code are available at this https URL
视频时刻检索是一项用于评估视觉语言模型性能的常见任务,它涉及根据查询句子在视频中定位特定时刻的开始和结束时间。当前的任务设定假设被查询的时刻存在于视频中,这会导致当提供不相关的查询句时产生假阳性的时刻预测。在这篇论文中,我们提出了负样本感知视频时刻检索(Negative-Aware Video Moment Retrieval, NA-VMR)任务,该任务不仅考虑了时刻检索的准确性还考虑了对负面查询的拒绝精度。 我们区分了域内和域外负面查询,并为两个流行的视频时刻检索数据集——QVHighlights 和 Charades-STA 提供了新的评估基准。本文分析了现有最先进的视频时刻检索方法适应负样本感知视频时刻检索的能力,并提出了 UniVTG-NA,这是对 UniVTG 的改进版本,旨在解决 NA-VMR 问题。UniVTG-NA 在保持召回率@1精度(平均 96.13%)的同时,达到了很高的负面拒绝准确性(平均 98.4%)。数据集分割和代码在以下链接提供:[此URL](http://this https URL)
https://arxiv.org/abs/2502.08544
The properties of black holes and accretion flows can be inferred by fitting Event Horizon Telescope (EHT) data to simulated images generated through general relativistic ray tracing (GRRT). However, due to the computationally intensive nature of GRRT, the efficiency of generating specific radiation flux images needs to be improved. This paper introduces the Branch Correction Denoising Diffusion Model (BCDDM), which uses a branch correction mechanism and a weighted mixed loss function to improve the accuracy of generated black hole images based on seven physical parameters of the radiatively inefficient accretion flow (RIAF) model. Our experiments show a strong correlation between the generated images and their physical parameters. By enhancing the GRRT dataset with BCDDM-generated images and using ResNet50 for parameter regression, we achieve significant improvements in parameter prediction performance. This approach reduces computational costs and provides a faster, more efficient method for dataset expansion, parameter estimation, and model fitting.
黑洞和吸积流的特性可以通过将事件视界望远镜(EHT)的数据与通过广义相对论光线追踪(GRRT)生成的模拟图像拟合来推断。然而,由于GRRT计算密集型的性质,提高特定辐射通量图像生成效率的需求仍然存在。本文介绍了分支校正去噪扩散模型(BCDDM),该模型利用了一种分支校正机制和加权混合损失函数,以基于放射性低效吸积流(RIAF)模型的七个物理参数来提升生成黑洞图像的准确性。实验结果表明,生成的图像与其物理参数之间存在很强的相关性。通过使用BCDDM生成的图像增强GRRT数据集,并采用ResNet50进行参数回归,我们在参数预测性能上实现了显著改进。这种方法降低了计算成本,并提供了一种更快、更有效的方法来扩展数据集、估计参数和拟合模型。
https://arxiv.org/abs/2502.08528
Next token prediction has been the standard training objective used in large language model pretraining. Representations are learned as a result of optimizing for token-level perplexity. We propose Continuous Concept Mixing (CoCoMix), a novel pretraining framework that combines discrete next token prediction with continuous concepts. Specifically, CoCoMix predicts continuous concepts learned from a pretrained sparse autoencoder and mixes them into the model's hidden state by interleaving with token hidden representations. Through experiments on multiple benchmarks, including language modeling and downstream reasoning tasks, we show that CoCoMix is more sample efficient and consistently outperforms standard next token prediction, knowledge distillation and inserting pause tokens. We find that combining both concept learning and interleaving in an end-to-end framework is critical to performance gains. Furthermore, CoCoMix enhances interpretability and steerability by allowing direct inspection and modification of the predicted concept, offering a transparent way to guide the model's internal reasoning process.
接下来的标记预测一直是大规模语言模型预训练中使用的标准训练目标。通过优化单个标记级别的困惑度,学习到了表示形式。我们提出了一个新颖的预训练框架——连续概念混合(CoCoMix),该框架结合了离散的下一个标记预测和连续的概念。具体而言,CoCoMix 预测由预先训练好的稀疏自动编码器学到的连续概念,并通过在隐藏状态中插入与标记隐藏表示交错的方式将这些概念融入模型之中。通过语言建模和下游推理任务等多个基准实验显示,CoCoMix 在样本效率上表现更佳,并且持续优于标准的下一个标记预测、知识蒸馏以及插入暂停标记的方法。 我们发现,在端到端框架中结合概念学习与插值对于性能提升至关重要。此外,CoCoMix 还通过允许直接检查和修改预测的概念来增强模型的可解释性和可控性,为指导模型内部推理过程提供了一种透明的方式。
https://arxiv.org/abs/2502.08524
Referring Remote Sensing Image Segmentation (RRSIS) is critical for ecological monitoring, urban planning, and disaster management, requiring precise segmentation of objects in remote sensing imagery guided by textual descriptions. This task is uniquely challenging due to the considerable vision-language gap, the high spatial resolution and broad coverage of remote sensing imagery with diverse categories and small targets, and the presence of clustered, unclear targets with blurred edges. To tackle these issues, we propose \ours, a novel framework designed to bridge the vision-language gap, enhance multi-scale feature interaction, and improve fine-grained object differentiation. Specifically, \ours introduces: (1) the Bidirectional Spatial Correlation (BSC) for improved vision-language feature alignment, (2) the Target-Background TwinStream Decoder (T-BTD) for precise distinction between targets and non-targets, and (3) the Dual-Modal Object Learning Strategy (D-MOLS) for robust multimodal feature reconstruction. Extensive experiments on the benchmark datasets RefSegRS and RRSIS-D demonstrate that \ours achieves state-of-the-art performance. Specifically, \ours improves the overall IoU (oIoU) by 3.76 percentage points (80.57) and 1.44 percentage points (79.23) on the two datasets, respectively. Additionally, it outperforms previous methods in the mean IoU (mIoU) by 5.37 percentage points (67.95) and 1.84 percentage points (66.04), effectively addressing the core challenges of RRSIS with enhanced precision and robustness.
参考遥感图像分割(RRSIS)对于生态监测、城市规划和灾害管理至关重要,需要根据文本描述精确地对遥感图像中的对象进行分割。这项任务具有独特的挑战性,原因在于视觉与语言之间的显著差距,以及高空间分辨率的遥感影像所带来的广泛覆盖范围、多样化的类别和微小目标的存在,并且其中包含聚集成群、边缘模糊不清的目标。为了应对这些难题,我们提出了一种新的框架——\ours,旨在弥合视觉-语言鸿沟,增强多尺度特征交互,并提升细粒度对象的区分能力。 具体而言,\ours引入了以下三个创新组件: 1. 双向空间相关性(Bidirectional Spatial Correlation, BSC),以改善视觉与语言特征之间的对齐; 2. 目标-背景双流解码器(Target-Background TwinStream Decoder, T-BTD),用于准确区分目标和非目标区域; 3. 双模态对象学习策略(Dual-Modal Object Learning Strategy, D-MOLS),用于增强多模态特征的重建能力。 在基准数据集RefSegRS和RRSIS-D上的大量实验表明,\ours达到了当前最佳性能。具体而言,在两个数据集中,\ours分别提高了整体交并比(oIoU)3.76个百分点(达到80.57)和1.44个百分点(达到79.23)。同时,它在平均交并比(mIoU)方面也超过了之前的最佳方法,分别提升了5.37个百分点(达到67.95)和1.84个百分点(达到66.04),从而有效解决了RRSIS的核心挑战,并且通过提高精度和鲁棒性来应对这些挑战。
https://arxiv.org/abs/2502.08486
Recently, the generation of dynamic 3D objects from a video has shown impressive results. Existing methods directly optimize Gaussians using whole information in frames. However, when dynamic regions are interwoven with static regions within frames, particularly if the static regions account for a large proportion, existing methods often overlook information in dynamic regions and are prone to overfitting on static regions. This leads to producing results with blurry textures. We consider that decoupling dynamic-static features to enhance dynamic representations can alleviate this issue. Thus, we propose a dynamic-static feature decoupling module (DSFD). Along temporal axes, it regards the portions of current frame features that possess significant differences relative to reference frame features as dynamic features. Conversely, the remaining parts are the static features. Then, we acquire decoupled features driven by dynamic features and current frame features. Moreover, to further enhance the dynamic representation of decoupled features from different viewpoints and ensure accurate motion prediction, we design a temporal-spatial similarity fusion module (TSSF). Along spatial axes, it adaptively selects a similar information of dynamic regions. Hinging on the above, we construct a novel approach, DS4D. Experimental results verify our method achieves state-of-the-art (SOTA) results in video-to-4D. In addition, the experiments on a real-world scenario dataset demonstrate its effectiveness on the 4D scene. Our code will be publicly available.
最近,从视频生成动态3D对象取得了令人印象深刻的结果。现有方法直接使用帧中所有信息来优化高斯分布。然而,当动态区域与静态区域交织在一起,特别是如果静态区域占较大比例时,现有的方法往往忽视了动态区域中的信息,并且容易在静态区域过度拟合。这导致生成结果出现模糊纹理的问题。我们认为,分离动态和静态特征以增强动态表示可以缓解这一问题。因此,我们提出了一个动态-静态特征解耦模块(DSFD)。沿时间轴,它将当前帧特征中相对于参考帧特征具有显著差异的部分视为动态特征;而其余部分则被视为静态特征。随后,我们根据动态特征与当前帧特征获取分离的特征。此外,为了进一步增强从不同视角获得的解耦特征中的动态表示,并确保准确的动作预测,我们设计了一个时空相似性融合模块(TSSF)。沿空间轴,它自适应地选择动态区域的类似信息。基于上述方法,我们构建了一种新的方法DS4D。实验结果验证了我们的方法在视频到4D转换中取得了最先进的(SOTA)成果。此外,在一个真实场景数据集上的实验表明其在4D场景中的有效性。我们将公开发布代码。
https://arxiv.org/abs/2502.08377
Camouflaged Object Detection (COD), the task of identifying objects concealed within their environments, has seen rapid growth due to its wide range of practical applications. A key step toward developing trustworthy COD systems is the estimation and effective utilization of uncertainty. In this work, we propose a human-machine collaboration framework for classifying the presence of camouflaged objects, leveraging the complementary strengths of computer vision (CV) models and noninvasive brain-computer interfaces (BCIs). Our approach introduces a multiview backbone to estimate uncertainty in CV model predictions, utilizes this uncertainty during training to improve efficiency, and defers low-confidence cases to human evaluation via RSVP-based BCIs during testing for more reliable decision-making. We evaluated the framework in the CAMO dataset, achieving state-of-the-art results with an average improvement of 4.56\% in balanced accuracy (BA) and 3.66\% in the F1 score compared to existing methods. For the best-performing participants, the improvements reached 7.6\% in BA and 6.66\% in the F1 score. Analysis of the training process revealed a strong correlation between our confidence measures and precision, while an ablation study confirmed the effectiveness of the proposed training policy and the human-machine collaboration strategy. In general, this work reduces human cognitive load, improves system reliability, and provides a strong foundation for advancements in real-world COD applications and human-computer interaction. Our code and data are available at: this https URL.
伪装物体检测(COD)是指识别隐藏在环境中的物体的任务,由于其广泛的实际应用而迅速发展。开发可信的COD系统的一个关键步骤是估计和有效利用不确定性。在这项工作中,我们提出了一种结合计算机视觉(CV)模型与非侵入性脑机接口(BCI)优势的人机协作框架,用于识别伪装物体的存在情况。我们的方法引入了一个多视角骨干网络来估计CV模型预测中的不确定性,并在训练过程中使用这种不确定性以提高效率,在测试阶段通过基于RSVP的BCIs将低置信度案例转交给人工评估,从而做出更可靠的决策。 我们在CAMO数据集上对该框架进行了评估,取得了最先进的结果,平衡精度(BA)和F1分数分别平均提高了4.56%和3.66%,优于现有方法。对于表现最佳的参与者,改进程度达到了7.6%的BA和6.66%的F1分数。训练过程分析显示,我们的置信度测量与精确度之间存在强烈的相关性,而消融研究确认了所提出的训练策略和人机协作策略的有效性。 总的来说,这项工作减少了人类的认知负担,提高了系统的可靠性,并为现实世界中的COD应用以及人机交互的进步奠定了坚实的基础。我们的代码和数据可在以下网址获得:this https URL。
https://arxiv.org/abs/2502.08373
Spatial relation hallucinations pose a persistent challenge in large vision-language models (LVLMs), leading to generate incorrect predictions about object positions and spatial configurations within an image. To address this issue, we propose a constraint-aware prompting framework designed to reduce spatial relation hallucinations. Specifically, we introduce two types of constraints: (1) bidirectional constraint, which ensures consistency in pairwise object relations, and (2) transitivity constraint, which enforces relational dependence across multiple objects. By incorporating these constraints, LVLMs can produce more spatially coherent and consistent outputs. We evaluate our method on three widely-used spatial relation datasets, demonstrating performance improvements over existing approaches. Additionally, a systematic analysis of various bidirectional relation analysis choices and transitivity reference selections highlights greater possibilities of our methods in incorporating constraints to mitigate spatial relation hallucinations.
空间关系错觉一直是大型视觉-语言模型(LVLMs)面临的一个持久挑战,导致这些模型在生成关于图像中对象位置和空间配置的预测时出现错误。为了解决这个问题,我们提出了一种基于约束感知提示框架的方法,旨在减少空间关系错觉。具体来说,我们引入了两种类型的约束:(1)双向约束,确保成对对象之间关系的一致性;(2)传递性约束,强制多个对象之间的相互依赖关系。通过结合这些约束条件,LVLMs能够生成更具一致性和连贯性的空间输出。 我们在三个广泛使用的空间关系数据集上评估了我们的方法,并展示了相对于现有方法的性能改进。此外,对各种双向关系分析选择和传递性参考选取进行的系统分析进一步突显了我们方法在引入约束以减少空间关系错觉方面的更大潜力。
https://arxiv.org/abs/2502.08317
Differentiating signals from the background in micrographs is a critical initial step for cryogenic electron microscopy (cryo-EM), yet it remains laborious due to low signal-to-noise ratio (SNR), the presence of contaminants and densely packed particles of varying sizes. Although image segmentation has recently been introduced to distinguish particles at the pixel level, the low SNR complicates the automated generation of accurate annotations for training supervised models. Moreover, platforms for systematically comparing different design choices in pipeline construction are lacking. Thus, a modular framework is essential to understand the advantages and limitations of this approach and drive further development. To address these challenges, we present a pipeline that automatically generates high-quality segmentation maps from cryo-EM data to serve as ground truth labels. Our modular framework enables the selection of various segmentation models and loss functions. We also integrate Conditional Random Fields (CRFs) with different solvers and feature sets to refine coarse predictions, thereby producing fine-grained segmentation. This flexibility facilitates optimal configurations tailored to cryo-EM datasets. When trained on a limited set of micrographs, our approach achieves over 90% accuracy, recall, precision, Intersection over Union (IoU), and F1-score on synthetic data. Furthermore, to demonstrate our framework's efficacy in downstream analyses, we show that the particles extracted by our pipeline produce 3D density maps with higher resolution than those generated by existing particle pickers on real experimental datasets, while achieving performance comparable to that of manually curated datasets from experts.
在冷冻电子显微镜(cryo-EM)中,从背景中区分信号是至关重要的初始步骤,但由于信噪比低、存在污染物以及颗粒密度高且尺寸各异等问题,这一过程仍然非常繁琐。尽管最近引入了图像分割技术来在像素级别上区分粒子,但低信噪比使得自动生成准确的训练监督模型所需的标注变得复杂。此外,在管道构建过程中系统地比较不同设计选择的平台仍不完善。因此,一个模块化框架对于理解这种方法的优势和局限性以及推动进一步的发展至关重要。 为了应对这些挑战,我们提出了一种能够自动从cryo-EM数据中生成高质量分割图作为真值标签的流程。我们的模块化框架允许选择各种分割模型和损失函数。此外,我们将条件随机场(CRFs)与不同的求解器和特征集集成起来以细化粗略预测,从而产生精细颗粒度的分割效果。这种灵活性有助于根据cryo-EM数据集实现最佳配置。 在仅使用有限数量显微图像训练的情况下,我们的方法在合成数据上实现了超过90%的准确率、召回率、精确率、交并比(IoU)和F1分数。此外,为了展示我们框架在下游分析中的有效性,我们展示了由我们的流程提取的粒子产生的3D密度图比现有颗粒选择器在实际实验数据集上的分辨率更高,并且性能与专家手动注释的数据集相当。
https://arxiv.org/abs/2502.08287
Future Event Prediction (FEP) is an essential activity whose demand and application range across multiple domains. While traditional methods like simulations, predictive and time-series forecasting have demonstrated promising outcomes, their application in forecasting complex events is not entirely reliable due to the inability of numerical data to accurately capture the semantic information related to events. One forecasting way is to gather and aggregate collective opinions on the future to make predictions as cumulative perspectives carry the potential to help estimating the likelihood of upcoming events. In this work, we organize the existing research and frameworks that aim to support future event prediction based on crowd wisdom through aggregating individual forecasts. We discuss the challenges involved, available datasets, as well as the scope of improvement and future research directions for this task. We also introduce a novel data model to represent individual forecast statements.
未来事件预测(FEP)是一项至关重要的活动,其需求和应用范围跨越多个领域。尽管传统方法如模拟、预测及时间序列预测展示了令人鼓舞的结果,但由于数值数据无法准确捕捉与事件相关的语义信息,在复杂事件的预测中这些方法的应用并不完全可靠。一种预测方式是汇集并聚合集体对未来意见以进行预测,因为累积视角有助于估算即将发生的事件的可能性。在这项工作中,我们组织了现有的研究和框架,旨在通过汇总个人预测来支持基于群众智慧的未来事件预测。我们讨论了该任务涉及的挑战、可用的数据集以及改进范围和未来的研发方向。此外,我们还介绍了一种新的数据模型来表示个人预测陈述。
https://arxiv.org/abs/2502.08205
Decomposition of text into atomic propositions is a flexible framework allowing for the closer inspection of input and output text. We use atomic decomposition of hypotheses in two natural language reasoning tasks, traditional NLI and defeasible NLI, to form atomic sub-problems, or granular inferences that models must weigh when solving the overall problem. These atomic sub-problems serve as a tool to further understand the structure of both NLI and defeasible reasoning, probe a model's consistency and understanding of different inferences, and measure the diversity of examples in benchmark datasets. Our results indicate that LLMs still struggle with logical consistency on atomic NLI and defeasible NLI sub-problems. Lastly, we identify critical atomic sub-problems of defeasible NLI examples, or those that most contribute to the overall label, and propose a method to measure the inferential consistency of a model, a metric designed to capture the degree to which a model makes consistently correct or incorrect predictions about the same fact under different contexts.
将文本分解为原子命题是一种灵活的框架,允许对输入和输出文本进行更细致的分析。我们在两个自然语言推理任务中使用假设的原子分解:传统的NLI(自然语言推断)和可废止的NLI。我们将这些假设分解成原子子问题或细粒度推断,模型在解决整个问题时必须权衡这些子问题。这些原子子问题是理解NLI和可废止推理结构、探测模型的一致性和不同推断的理解程度以及衡量基准数据集中示例多样性的工具。 我们的研究结果表明,大型语言模型(LLM)在处理原子NLI和可废止NLI的子问题时,在逻辑一致性方面仍然存在困难。最后,我们确定了可废止NLI实例中的关键原子子问题,即对整体标签贡献最大的那些,并提出了一种衡量模型推理一致性的方法,该指标旨在捕捉模型在同一事实下在不同上下文中做出的一致正确或错误预测的程度。
https://arxiv.org/abs/2502.08080
Trajectory prediction and planning are fundamental components for autonomous vehicles to navigate safely and efficiently in dynamic environments. Traditionally, these components have often been treated as separate modules, limiting the ability to perform interactive planning and leading to computational inefficiency in multi-agent scenarios. In this paper, we present a novel unified and data-driven framework that integrates prediction and planning with a single consistency model. Trained on real-world human driving datasets, our consistency model generates samples from high-dimensional, multimodal joint trajectory distributions of the ego and multiple surrounding agents, enabling end-to-end predictive planning. It effectively produces interactive behaviors, such as proactive nudging and yielding to ensure both safe and efficient interactions with other road users. To incorporate additional planning constraints on the ego vehicle, we propose an alternating direction method for multi-objective guidance in online guided sampling. Compared to diffusion models, our consistency model achieves better performance with fewer sampling steps, making it more suitable for real-time deployment. Experimental results on Waymo Open Motion Dataset (WOMD) demonstrate our method's superiority in trajectory quality, constraint satisfaction, and interactive behavior compared to various existing approaches.
轨迹预测和规划是自主车辆在动态环境中安全高效导航的基本组成部分。传统上,这些组件通常被视为独立模块处理,这限制了互动规划的能力,并且在多智能体场景中导致计算效率低下。在这篇论文中,我们提出了一种新颖的统一数据驱动框架,该框架将预测与规划整合为单一的一致性模型。我们的模型基于真实世界的人类驾驶数据集进行训练,能够从自车及其周围多个代理的高度维度、多模态联合轨迹分布生成样本,从而实现端到端的预测性规划。这有效地产生了互动行为,例如主动调整和让行,以确保与其他道路使用者的安全和高效互动。为了在自车中整合额外的规划约束条件,我们提出了一种用于在线引导采样的多目标指导交替方向方法。相比扩散模型,我们的模型在较少的抽样步骤下实现了更好的性能,使其更适合实时部署。在Waymo开放运动数据集(WOMD)上的实验结果表明,与现有的各种方法相比,我们的方法在轨迹质量、约束满足和互动行为方面表现更优。
https://arxiv.org/abs/2502.08033
Next-Token Prediction (NTP) is a de facto approach for autoregressive (AR) video generation, but it suffers from suboptimal unidirectional dependencies and slow inference speed. In this work, we propose a semi-autoregressive (semi-AR) framework, called Next-Block Prediction (NBP), for video generation. By uniformly decomposing video content into equal-sized blocks (e.g., rows or frames), we shift the generation unit from individual tokens to blocks, allowing each token in the current block to simultaneously predict the corresponding token in the next block. Unlike traditional AR modeling, our framework employs bidirectional attention within each block, enabling tokens to capture more robust spatial dependencies. By predicting multiple tokens in parallel, NBP models significantly reduce the number of generation steps, leading to faster and more efficient inference. Our model achieves FVD scores of 103.3 on UCF101 and 25.5 on K600, outperforming the vanilla NTP model by an average of 4.4. Furthermore, thanks to the reduced number of inference steps, the NBP model generates 8.89 frames (128x128 resolution) per second, achieving an 11x speedup. We also explored model scales ranging from 700M to 3B parameters, observing significant improvements in generation quality, with FVD scores dropping from 103.3 to 55.3 on UCF101 and from 25.5 to 19.5 on K600, demonstrating the scalability of our approach.
下一代标记预测(NTP)是自回归视频生成的一个事实上的方法,但它在单向依赖性和推理速度方面存在不足。为此,我们提出了一种半自回归框架——下一区块预测(NBP),用于改进视频生成过程。通过将视频内容均匀分解为等大小的区块(如行或帧),我们将生成单元从单独的标记转移到区块,使得当前区块中的每个标记可以同时预测下一个区块中对应的标记。与传统的自回归建模不同,我们的框架在每个区块内采用双向注意力机制,使标记能够捕捉到更稳健的空间依赖性。通过并行预测多个标记,NBP模型显著减少了生成步骤的数量,从而实现了更快、更高效的推理速度。 实验结果表明,在UCF101数据集上,我们的模型获得了FVD分数为103.3的成绩;在K600数据集上则达到25.5的评分,比标准NTP模型平均高出4.4分。此外,由于生成步骤减少,NBP模型每秒可生成8.89帧(分辨率为128x128),实现了大约11倍的速度提升。 我们还探讨了从7亿到30亿参数的多种模型规模,并观察到了显著的质量改进:在UCF101数据集上,FVD分数从103.3降至55.3;在K600数据集上则从25.5降至19.5,这证明了我们方法的良好可扩展性。
https://arxiv.org/abs/2502.07737
This paper presents a learned model to predict the robot-centric velocity of an underwater robot through dynamics-aware proprioception. The method exploits a recurrent neural network using as inputs inertial cues, motor commands, and battery voltage readings alongside the hidden state of the previous time-step to output robust velocity estimates and their associated uncertainty. An ensemble of networks is utilized to enhance the velocity and uncertainty predictions. Fusing the network's outputs into an Extended Kalman Filter, alongside inertial predictions and barometer updates, the method enables long-term underwater odometry without further exteroception. Furthermore, when integrated into visual-inertial odometry, the method assists in enhanced estimation resilience when dealing with an order of magnitude fewer total features tracked (as few as 1) as compared to conventional visual-inertial systems. Tested onboard an underwater robot deployed both in a laboratory pool and the Trondheim Fjord, the method takes less than 5ms for inference either on the CPU or the GPU of an NVIDIA Orin AGX and demonstrates less than 4% relative position error in novel trajectories during complete visual blackout, and approximately 2% relative error when a maximum of 2 visual features from a monocular camera are available.
本文提出了一种基于学习的模型,用于通过动力学感知本体感觉来预测水下机器人相对于自身的速度。该方法利用递归神经网络作为输入,包括惯性提示、电机命令和电池电压读数以及前一时间步长的隐藏状态,输出鲁棒的速度估计及其相关不确定性。使用一组网络来增强速度和不确定性的预测。将这些网络的输出与扩展卡尔曼滤波器融合,并结合惯性和气压计更新,该方法能够在无需进一步外感知的情况下实现长期水下里程测量。此外,在集成到视觉惯性里程测量系统中时,即使在跟踪的总特征数量仅为传统视觉惯性系统的十分之一(最少为1个)的情况下,该方法也能帮助提高估计的稳定性。 该方法已在部署于实验室游泳池和特隆赫姆峡湾的水下机器人上进行测试。无论是使用CPU还是NVIDIA Orin AGX的GPU进行推理,其推断时间均不超过5毫秒。在完全视觉盲的状态下,对于新轨迹,该方法表现出小于4%的位置误差;而在单目相机最多提供2个视觉特征时,则大约有2%的位置误差。
https://arxiv.org/abs/2502.07726
Negation has been a long-standing challenge for language models. Previous studies have shown that they struggle with negation in many natural language understanding tasks. In this work, we propose a self-supervised method to make language models more robust against negation. We introduce a novel task, Next Sentence Polarity Prediction (NSPP), and a variation of the Next Sentence Prediction (NSP) task. We show that BERT and RoBERTa further pre-trained on our tasks outperform the off-the-shelf versions on nine negation-related benchmarks. Most notably, our pre-training tasks yield between 1.8% and 9.1% improvement on CondaQA, a large question-answering corpus requiring reasoning over negation.
否定一直是语言模型面临的长期挑战。先前的研究表明,它们在许多自然语言理解任务中难以处理否定现象。在这项工作中,我们提出了一种自监督方法来增强语言模型对否定的鲁棒性。我们引入了一个新颖的任务——下一句子极性预测(NSPP)以及下一个句子预测(NSP)任务的一个变体。实验表明,在九个与否定相关的基准测试上,基于BERT和RoBERTa进一步预训练于我们的任务版本比现成版本表现更佳。尤为值得注意的是,在要求进行否定推理的大规模问答语料库CondaQA中,我们的预训练任务提高了1.8%到9.1%的表现。
https://arxiv.org/abs/2502.07717
Open-ended learning agents must efficiently prioritize goals in vast possibility spaces, focusing on those that maximize learning progress (LP). When such autotelic exploration is achieved by LLM agents trained with online RL in high-dimensional and evolving goal spaces, a key challenge for LP prediction is modeling one's own competence, a form of metacognitive monitoring. Traditional approaches either require extensive sampling or rely on brittle expert-defined goal groupings. We introduce MAGELLAN, a metacognitive framework that lets LLM agents learn to predict their competence and LP online. By capturing semantic relationships between goals, MAGELLAN enables sample-efficient LP estimation and dynamic adaptation to evolving goal spaces through generalization. In an interactive learning environment, we show that MAGELLAN improves LP prediction efficiency and goal prioritization, being the only method allowing the agent to fully master a large and evolving goal space. These results demonstrate how augmenting LLM agents with a metacognitive ability for LP predictions can effectively scale curriculum learning to open-ended goal spaces.
开放式学习代理必须在广阔的可能空间中有效地优先考虑目标,专注于那些最大化学习进度(LP)的目标。当通过在线强化学习训练的大规模语言模型(LLM)代理实现了这种自驱动探索,在高维和不断变化的环境中进行目标寻找时,预测LP的一个关键挑战是对其自身能力建模,这是一种元认知监测形式。传统方法要么需要大量的采样,要么依赖于脆弱且由专家定义的目标分组。 我们引入了MAGELLAN这一元认知框架,它使LLM代理能够在线学习预测自己的能力和LP。通过捕捉目标之间的语义关系,MAGELLAN实现了样本高效的LP估计和对不断变化的目标空间的动态适应性。在交互式学习环境中,我们展示了MAGELLAN可以提高LP预测效率并优先考虑目标设定,在处理庞大且不断演进的目标空间方面,它是唯一使代理能够完全掌握的方法。 这些结果表明,通过增强LLM代理以进行LP预测的元认知能力,可以使课程学习有效地扩展到开放式目标空间中。
https://arxiv.org/abs/2502.07709
We present Matrix3D, a unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis using just the same model. Matrix3D utilizes a multi-modal diffusion transformer (DiT) to integrate transformations across several modalities, such as images, camera parameters, and depth maps. The key to Matrix3D's large-scale multi-modal training lies in the incorporation of a mask learning strategy. This enables full-modality model training even with partially complete data, such as bi-modality data of image-pose and image-depth pairs, thus significantly increases the pool of available training data. Matrix3D demonstrates state-of-the-art performance in pose estimation and novel view synthesis tasks. Additionally, it offers fine-grained control through multi-round interactions, making it an innovative tool for 3D content creation. Project page: this https URL.
我们介绍了Matrix3D,这是一种统一模型,能够执行多项摄影测量子任务,包括姿态估计、深度预测和新视角合成,并且仅使用同一个模型即可完成。Matrix3D利用多模态扩散变换器(DiT)来整合图像、相机参数和深度图等不同模态之间的转换。Matrix3D大规模多模态训练的关键在于引入了掩码学习策略,这使得即使在数据不完整的情况下(例如仅有的图像-姿态对或图像-深度对),也能进行全模态模型训练,从而大大增加了可用的训练数据量。在姿态估计和新视角合成任务中,Matrix3D展现了最先进的性能,并且通过多轮互动提供了精细控制,使其成为三维内容创建中的创新工具。项目主页:[此链接](https://this-url.com/)。
https://arxiv.org/abs/2502.07685
Perceiving the environment and its changes over time corresponds to two fundamental yet heterogeneous types of information: semantics and motion. Previous end-to-end autonomous driving works represent both types of information in a single feature vector. However, including motion tasks, such as prediction and planning, always impairs detection and tracking performance, a phenomenon known as negative transfer in multi-task learning. To address this issue, we propose Neural-Bayes motion decoding, a novel parallel detection, tracking, and prediction method separating semantic and motion learning, similar to the Bayes filter. Specifically, we employ a set of learned motion queries that operate in parallel with the detection and tracking queries, sharing a unified set of recursively updated reference points. Moreover, we employ interactive semantic decoding to enhance information exchange in semantic tasks, promoting positive transfer. Experiments on the nuScenes dataset show improvements of 5% in detection and 11% in tracking. Our method achieves state-of-the-art collision rates in open-loop planning evaluation without any modifications to the planning module.
感知环境及其随时间的变化对应于两种基本但异质的信息类型:语义和运动。之前端到端的自动驾驶工作将这两种信息类型表示在一个单一的功能向量中。然而,包含诸如预测和规划等运动任务总是会损害检测和跟踪性能,这在多任务学习中被称为负迁移现象。为了解决这个问题,我们提出了Neural-Bayes 运动解码法,这是一种新的并行检测、跟踪和预测方法,它将语义和运动学习分开,类似于贝叶斯滤波器的工作方式。具体来说,我们使用了一组与检测和跟踪查询并行工作的已学得的运动查询,并共享一组统一递归更新的参考点。此外,我们还采用了交互式语义解码来增强语义任务中的信息交换,以促进正向迁移。在nuScenes数据集上的实验表明,在检测方面提高了5%,在跟踪方面提高了11%。我们的方法在无规划模块修改的情况下实现了开放环路规划评估中最佳的碰撞率。
https://arxiv.org/abs/2502.07631