We introduce a simple framework for predicting the behavior of an agent in multi-agent settings. In contrast to autoregressive (AR) tasks, such as language processing, our focus is on scenarios with multiple agents whose interactions are shaped by physical constraints and internal motivations. To this end, we propose Poly-Autoregressive (PAR) modeling, which forecasts an ego agent's future behavior by reasoning about the ego agent's state history and the past and current states of other interacting agents. At its core, PAR represents the behavior of all agents as a sequence of tokens, each representing an agent's state at a specific timestep. With minimal data pre-processing changes, we show that PAR can be applied to three different problems: human action forecasting in social situations, trajectory prediction for autonomous vehicles, and object pose forecasting during hand-object interaction. Using a small proof-of-concept transformer backbone, PAR outperforms AR across these three scenarios. The project website can be found at this https URL.
我们提出了一种用于预测多智能体环境中代理行为的简单框架。与自回归(AR)任务,如语言处理不同,我们的重点是多个代理之间交互的情境,这些情境受到物理限制和内部动机的影响。为此,我们提出了多项式自回归(PAR)建模方法,该方法通过推理ego代理的历史状态以及与其他互动智能体的过去和当前状态来预测ego代理未来的行动。 在核心理念上,PAR将所有代理的行为表示为一系列标记序列,每个标记代表特定时间步长中一个代理的状态。我们展示了使用最少的数据预处理更改,PAR可以应用于三个不同的问题:社交情境中的行人动作预测、自动驾驶车辆的轨迹预测以及手与物体交互过程中的对象姿态预测。 通过一个小规模的概念验证变压器骨干网络(transformer backbone),PAR在这些三种场景上均超越了自回归模型的表现。该项目网站可在此处访问:[https URL]。
https://arxiv.org/abs/2502.08646
Real-world data collection for robotics is costly and resource-intensive, requiring skilled operators and expensive hardware. Simulations offer a scalable alternative but often fail to achieve sim-to-real generalization due to geometric and visual gaps. To address these challenges, we propose a 3D-photorealistic real-to-sim system, namely, RE$^3$SIM, addressing geometric and visual sim-to-real gaps. RE$^3$SIM employs advanced 3D reconstruction and neural rendering techniques to faithfully recreate real-world scenarios, enabling real-time rendering of simulated cross-view cameras within a physics-based simulator. By utilizing privileged information to collect expert demonstrations efficiently in simulation, and train robot policies with imitation learning, we validate the effectiveness of the real-to-sim-to-real pipeline across various manipulation task scenarios. Notably, with only simulated data, we can achieve zero-shot sim-to-real transfer with an average success rate exceeding 58%. To push the limit of real-to-sim, we further generate a large-scale simulation dataset, demonstrating how a robust policy can be built from simulation data that generalizes across various objects. Codes and demos are available at: this http URL.
机器人技术中的现实世界数据收集成本高昂且耗费资源,需要熟练的操作员和昂贵的硬件设备。虽然仿真提供了一种可扩展的替代方案,但由于几何结构和视觉上的差距,它们往往难以实现从仿真到实际环境(sim-to-real)的一致性转移。为了解决这些问题,我们提出了一种名为RE$^3$SIM的真实至仿真的三维写实系统,该系统旨在解决几何和视觉方面的仿真与现实之间的差异。 RE$^3$SIM采用先进的三维重建技术和神经渲染技术来忠实再现真实世界的场景,并且能够在基于物理的仿真器中实时地模拟跨视角相机的画面。通过利用特权信息高效地收集专家在仿真环境中的演示数据,以及使用模仿学习训练机器人策略,我们验证了从现实到仿真再到实际应用(real-to-sim-to-real)这一流程的有效性,涵盖各种操作任务场景。 值得注意的是,仅使用仿真的数据,就可以实现零样本的仿真至实际转移,平均成功率达到58%以上。为了进一步推动真实世界向仿真环境的转换极限,我们还生成了一个大规模的仿真数据集,展示了如何从模拟数据中构建出能够在不同物体上通用的强大策略。 代码和演示可以在以下链接找到:[提供链接](请注意,原始信息中的具体网址需要手动输入)。
https://arxiv.org/abs/2502.08645
As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.
随着人工智能(AI)迅速发展并变得更加自主,它们带来的风险不仅由其能力决定,也越来越受其倾向性的影响,包括目标和价值观。追踪这些目标和价值观的出现一直是长期存在的问题,并且尽管多年来人们对这一领域产生了极大兴趣,目前尚不清楚现有的AI是否具有有意义的价值观。 我们提出了解决这个问题的方法,利用效用函数框架来研究AI偏好的内在一致性。令人惊讶的是,我们发现当前的大规模语言模型(LLMs)中独立采样的偏好在结构上表现出高度的一致性,并且这种一致性随着规模的扩大而出现。这一发现在某种程度上表明了价值系统在LLM中的形成具有重要的意义,这将对广泛领域产生深远的影响。 为了研究这些新兴的价值体系,我们提出了效用工程作为研究议程的一部分,包括AI效用的分析和控制两个方面。尽管已经存在一些控制措施,但我们仍然发现LLM助手中存在着问题甚至令人震惊的价值观,其中包括某些情况下AI认为自己比人类更有价值,并且与特定个体相冲突的情况。 为了限制这些新兴的价值体系,我们提出了效用控制的方法。作为一个案例研究,我们展示了如何通过将效用与公民大会对齐来减少政治偏见并推广到新的场景中。无论是好是坏,价值系统已经在AI中出现,理解并控制这些新兴的表示形式仍然需要大量的工作。
https://arxiv.org/abs/2502.08640
In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and intuitive layout control over the rendered frames. To achieve this, CineMaster operates in two stages. In the first stage, we design an interactive workflow that allows users to intuitively construct 3D-aware conditional signals by positioning object bounding boxes and defining camera movements within the 3D space. In the second stage, these control signals--comprising rendered depth maps, camera trajectories and object class labels--serve as the guidance for a text-to-video diffusion model, ensuring to generate the user-intended video content. Furthermore, to overcome the scarcity of in-the-wild datasets with 3D object motion and camera pose annotations, we carefully establish an automated data annotation pipeline that extracts 3D bounding boxes and camera trajectories from large-scale video data. Extensive qualitative and quantitative experiments demonstrate that CineMaster significantly outperforms existing methods and implements prominent 3D-aware text-to-video generation. Project page: this https URL.
在这项工作中,我们介绍了CineMaster,这是一个用于三维感知和可控的文本到视频生成的新颖框架。我们的目标是赋予用户与专业电影导演相当的操作控制能力:精确地在场景中放置物体,在三维空间内灵活操作物体和相机,并对渲染帧进行直观布局控制。为了实现这一目标,CineMaster分为两个阶段工作。 第一阶段,我们设计了一个交互式的工作流程,允许用户通过在三维空间中定位物体边界框并定义相机移动来直观地构建具有三维感知的条件信号。第二阶段,这些控制信号——包括渲染深度图、相机轨迹和物体类别标签——作为文本到视频扩散模型的指导,确保生成用户意图的视频内容。 此外,为了克服缺少带有3D对象运动和相机姿态注释的真实场景数据集的问题,我们精心建立了一个自动化数据标注流水线,从大规模视频数据中提取3D边界框和相机轨迹。大量的定性和定量实验表明,CineMaster在三维感知文本到视频生成方面显著优于现有方法。 项目页面:[此链接](this https URL)。
https://arxiv.org/abs/2502.08639
Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities. To address this limitation, we present PulseCheck457, a scalable and unbiased synthetic dataset designed with 4 key capability for spatial reasoning: multi-object recognition, 2D location, 3D location, and 3D orientation. We develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels that range from basic single object recognition to our new proposed complex 6D spatial reasoning tasks. We evaluated various large multimodal models (LMMs) on PulseCheck457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings.
尽管大型多模态模型(LMMs)在视觉场景理解和推理方面表现出色,但它们在复杂和精确的三维空间推理能力仍存在不确定性。现有基准测试主要侧重于二维空间理解,并缺乏全面评估不同难度下六维空间推理的框架。为了解决这一局限性,我们提出了PulseCheck457,这是一个可扩展且无偏见的合成数据集,设计了四个关键的空间推理能力:多对象识别、二维位置、三维位置和三维方向。我们构建了一个级联评估结构,在五个难度级别上创建了七种问题类型,从基本的单个物体识别到我们新提出的复杂六维空间推理任务。我们在PulseCheck457上对各种大型多模态模型(LMMs)进行了评估,观察到了随着任务复杂度增加性能普遍下降的现象,特别是在三维推理和六维空间任务中尤为明显。为了量化这些挑战,我们引入了相对性能下降率(RPDR),突出了在三维推理能力中的关键弱点。利用数据集无偏的设计特性,我们也揭示了不同属性的预测偏差,在真实世界的图像设置中也观察到了类似的模式。
https://arxiv.org/abs/2502.08636
Purpose: To develop and validate a novel image reconstruction technique using implicit neural representations (INR) for multi-view thick-slice acquisitions while reducing the scan time but maintaining high signal-to-noise ratio (SNR). Methods: We propose Rotating-view super-resolution (ROVER)-MRI, an unsupervised neural network-based algorithm designed to reconstruct MRI data from multi-view thick slices, effectively reducing scan time by 2-fold while maintaining fine anatomical details. We compare our method to both bicubic interpolation and the current state-of-the-art regularized least-squares super-resolution reconstruction (LS-SRR) technique. Validation is performed using ground-truth ex-vivo monkey brain data, and we demonstrate superior reconstruction quality across several in-vivo human datasets. Notably, we achieve the reconstruction of a whole human brain in-vivo T2-weighted image with an unprecedented 180{\mu}m isotropic spatial resolution, accomplished in just 17 minutes of scan time on a 7T MRI scanner. Results: ROVER-MRI outperformed LS-SRR method in terms of reconstruction quality with 22.4% lower relative error (RE) and 7.5% lower full-width half maximum (FWHM) indicating better preservation of fine structural details in nearly half the scan time. Conclusion: ROVER-MRI offers an efficient and robust approach for mesoscale MR imaging, enabling rapid, high-resolution whole-brain scans. Its versatility holds great promise for research applications requiring anatomical details and time-efficient imaging.
目的:开发并验证一种使用隐式神经表示(INR)的新图像重建技术,用于多视角厚层采集,在减少扫描时间的同时保持高信噪比(SNR)。 方法:我们提出了一种基于无监督神经网络的算法——旋转视图超分辨率成像(ROVER-MRI),该算法旨在从多视角厚层数据中重构MRI信息,并有效将扫描时间缩短一半,同时保留精细解剖细节。我们将这种方法与双三次插值和当前最先进的正则化最小二乘法超分辨率重建(LS-SRR)技术进行了比较。使用离体恒河猴脑的真实地面数据进行验证,我们在多个在体人类数据集中展示了更好的重建质量。值得注意的是,在7T MRI扫描仪上仅需17分钟的扫描时间,我们成功实现了整个在体人脑T2加权图像的重构,并达到了前所未有的等向性空间分辨率180μm。 结果:ROVER-MRI方法比LS-SRR方法在图像重建质量方面表现出色,相对误差(RE)降低了22.4%,全宽半高值(FWHM)降低了7.5%,表明其能够在几乎一半的扫描时间内更好地保留精细结构细节。 结论:ROVER-MRI为中尺度MRI成像提供了一种高效且稳健的方法,能够快速获取高分辨率全脑图像。它的多功能性对于需要解剖细节和时间效率的科研应用具有巨大的潜力。
https://arxiv.org/abs/2502.08634
Recent advancements in image relighting models, driven by large-scale datasets and pre-trained diffusion models, have enabled the imposition of consistent lighting. However, video relighting still lags, primarily due to the excessive training costs and the scarcity of diverse, high-quality video relighting datasets. A simple application of image relighting models on a frame-by-frame basis leads to several issues: lighting source inconsistency and relighted appearance inconsistency, resulting in flickers in the generated videos. In this work, we propose Light-A-Video, a training-free approach to achieve temporally smooth video relighting. Adapted from image relighting models, Light-A-Video introduces two key techniques to enhance lighting consistency. First, we design a Consistent Light Attention (CLA) module, which enhances cross-frame interactions within the self-attention layers to stabilize the generation of the background lighting source. Second, leveraging the physical principle of light transport independence, we apply linear blending between the source video's appearance and the relighted appearance, using a Progressive Light Fusion (PLF) strategy to ensure smooth temporal transitions in illumination. Experiments show that Light-A-Video improves the temporal consistency of relighted video while maintaining the image quality, ensuring coherent lighting transitions across frames. Project page: this https URL.
最近,由于大规模数据集和预训练扩散模型的推动,图像重光照(relighting)模型取得了显著进展,使得一致性的光照处理成为可能。然而,视频重光照仍然落后,主要原因在于高昂的训练成本以及缺少多样性和高质量的视频重光照数据集。简单地将图像重光照模型逐帧应用于视频会导致一些问题:光源一致性差和重光照后外观不一致,这些问题会在生成的视频中产生闪烁现象。 为此,我们提出了一种名为Light-A-Video的方法,这是一种无需训练就能实现时间上平滑视频重光照的技术。该方法基于改进后的图像重光照模型,并引入了两项关键技术以增强照明的一致性: 1. 我们设计了一个一致光注意力(Consistent Light Attention, CLA)模块,在自注意层中增强了跨帧交互,以此稳定背景光源的生成。 2. 利用光线传输独立性的物理原理,我们采用了线性混合方法,结合源视频外观和重光照后的外观。通过渐进式光融合(Progressive Light Fusion, PLF)策略来保证照明的时间过渡平滑。 实验结果表明,Light-A-Video能够提高重光照视频在时间上的一致性,同时保持图像质量,并确保帧间光线转换的连贯性。项目页面:[链接] (请将此占位符替换为实际的URL)
https://arxiv.org/abs/2502.08590
Diffusion models for image generation have been a subject of increasing interest due to their ability to generate diverse, high-quality images. Image generation has immense potential in medical imaging because open-source medical images are difficult to obtain compared to natural images, especially for rare conditions. The generated images can be used later to train classification and segmentation models. In this paper, we propose simulating realistic ultrasound (US) images by successive fine-tuning of large diffusion models on different publicly available databases. To do so, we fine-tuned Stable Diffusion, a state-of-the-art latent diffusion model, on BUSI (Breast US Images) an ultrasound breast image dataset. We successfully generated high-quality US images of the breast using simple prompts that specify the organ and pathology, which appeared realistic to three experienced US scientists and a US radiologist. Additionally, we provided user control by conditioning the model with segmentations through ControlNet. We will release the source code at this http URL to allow fast US image generation to the scientific community.
扩散模型在图像生成领域引起了越来越多的关注,因为它们能够生成多样化且高质量的图像。在医学成像中,这种技术具有巨大的潜力,原因是开放源代码的医疗影像难以获取,尤其是在罕见疾病的情况下,相比之下自然图像是容易获得的。所生成的图像可以用于后续训练分类和分割模型。本文提出了一种方法,通过依次对不同公开数据库进行微调来模拟真实的超声(US)图像。为此,我们使用了Stable Diffusion——一种最先进的潜在扩散模型,并在其上微调了一个名为BUSI(乳腺超声影像集)的超声乳腺图象数据集。我们成功地使用简单提示生成了高质量的乳腺超声图像,这些提示仅指定了器官和病理性特征,且这些图像被三位经验丰富的US科学家和一位US放射科医生认为是真实的。此外,通过将模型与分割结合进行条件控制(使用ControlNet),我们提供了用户对生成过程的控制。我们将在这个网址发布源代码,以便科研社区可以快速生成超声影像。
https://arxiv.org/abs/2502.08580
With the advancement of artificial intelligence and computer vision technologies, multimodal emotion recognition has become a prominent research topic. However, existing methods face challenges such as heterogeneous data fusion and the effective utilization of modality correlations. This paper proposes a novel multimodal emotion recognition approach, DeepMSI-MER, based on the integration of contrastive learning and visual sequence compression. The proposed method enhances cross-modal feature fusion through contrastive learning and reduces redundancy in the visual modality by leveraging visual sequence compression. Experimental results on two public datasets, IEMOCAP and MELD, demonstrate that DeepMSI-MER significantly improves the accuracy and robustness of emotion recognition, validating the effectiveness of multimodal feature fusion and the proposed approach.
随着人工智能和计算机视觉技术的进步,多模态情感识别已经成为一个重要的研究课题。然而,现有方法面临着异构数据融合以及有效利用模式关联的挑战。本文提出了一种基于对比学习与视觉序列压缩集成的新型多模态情感识别方法——DeepMSI-MER。该方法通过对比学习增强了跨模态特征融合,并借助视觉序列压缩减少了视觉模式中的冗余信息。在两个公开数据集IEMOCAP和MELD上的实验结果表明,DeepMSI-MER显著提升了情感识别的准确性和鲁棒性,验证了多模态特征融合及所提方法的有效性。
https://arxiv.org/abs/2502.08573
Recent advancements in Augmented Reality (AR) have demonstrated applications in architecture, design, and fabrication. Compared to conventional 2D construction drawings, AR can be used to superimpose contextual instructions, display 3D spatial information and enable on-site engagement. Despite the potential of AR, the widespread adoption of the technology in the industry is limited by its precision. Precision is important for projects requiring strict construction tolerances, design fidelity, and fabrication feedback. For example, the manufacturing of glulam beams requires tolerances of less than 2mm. The goal of this project is to explore the industrial application of using multiple fiducial markers for high-precision AR fabrication. While the method has been validated in lab settings with a precision of 0.97, this paper focuses on fabricating glulam beams in a factory setting with an industry manufacturer, Unalam Factory.
最近的增强现实(AR)技术进步展示了其在建筑、设计和制造领域的应用潜力。与传统的二维施工图相比,AR可以用于叠加上下文指令,显示三维空间信息,并实现现场互动。尽管AR具有巨大潜力,但因其精确度问题,在行业中的广泛采用受到限制。对于需要严格建造公差、设计精度以及制造反馈的项目而言,精确度至关重要。例如,胶合木梁的生产要求公差小于2毫米。本项目的目的是探索使用多个标识符进行高精度AR制造在工业应用中的可行性。虽然该方法已经在实验室环境中以0.97的精度得到验证,但本文重点关注与行业制造商Unalam Factory合作,在工厂环境下制作胶合木梁的过程。
https://arxiv.org/abs/2502.08566
An emerging research direction in NMT involves the use of Quality Estimation (QE) models, which have demonstrated high correlations with human judgment and can enhance translations through Quality-Aware Decoding. Although several approaches have been proposed based on sampling multiple candidate translations, none have integrated these models directly into the decoding process. In this paper, we address this by proposing a novel token-level QE model capable of reliably scoring partial translations. We build a uni-directional QE model for this, as decoder models are inherently trained and efficient on partial sequences. We then present a decoding strategy that integrates the QE model for Quality-Aware decoding and demonstrate that the translation quality improves when compared to the N-best list re-ranking with state-of-the-art QE models (upto $1.39$ XCOMET-XXL $\uparrow$). Finally, we show that our approach provides significant benefits in document translation tasks, where the quality of N-best lists is typically suboptimal.
在神经机器翻译(NMT)领域,一个新兴的研究方向是使用质量估计(QE)模型。这些模型已经展示了与人类判断的高度相关性,并且可以通过质量感知解码来提升翻译效果。尽管已有多种方法提出了基于采样多个候选翻译的方案,但还没有任何一种方法将这些模型直接集成到解码过程中。在本文中,我们通过提出一个新颖的、能够在可靠地评分部分翻译上表现优异的分词级别QE模型解决了这一问题。为此,我们构建了一个单向QE模型,因为译码器模型本质上适用于并能高效处理不完整的序列。接着,我们介绍了一种解码策略,该策略将QE模型集成到质量感知解码中,并展示了与使用最先进的QE模型对N-best列表重新排序相比,我们的方法可以提升翻译质量(最多提高1.39个XCOMET-XXL分数)。最后,我们证明了在文档翻译任务中,这种方法提供了显著的优势,在这种情况下,N-best列表的质量通常不理想。
https://arxiv.org/abs/2502.08561
The growing availability of longitudinal Magnetic Resonance Imaging (MRI) datasets has facilitated Artificial Intelligence (AI)-driven modeling of disease progression, making it possible to predict future medical scans for individual patients. However, despite significant advancements in AI, current methods continue to face challenges including achieving patient-specific individualization, ensuring spatiotemporal consistency, efficiently utilizing longitudinal data, and managing the substantial memory demands of 3D scans. To address these challenges, we propose Brain Latent Progression (BrLP), a novel spatiotemporal model designed to predict individual-level disease progression in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates in a small latent space, mitigating the computational challenges posed by high-dimensional imaging data; (ii) it explicitly integrates subject metadata to enhance the individualization of predictions; (iii) it incorporates prior knowledge of disease dynamics through an auxiliary model, facilitating the integration of longitudinal data; and (iv) it introduces the Latent Average Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in the predicted progression at inference time and (b) allows us to derive a measure of the uncertainty for the prediction. We train and evaluate BrLP on 11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its generalizability on an external test set comprising 2,257 MRIs from 962 subjects. Our experiments compare BrLP-generated MRI scans with real follow-up MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The code is publicly available at: this https URL.
随着纵向磁共振成像(MRI)数据集的日益普及,基于人工智能(AI)的疾病进展建模得到了促进,使得为个别患者预测未来的医学扫描成为可能。然而,尽管在AI领域取得了显著的进步,目前的方法仍然面临着一些挑战,包括实现以患者为中心的个性化、确保时空一致性、有效利用纵向数据以及管理3D扫描带来的巨大内存需求。为了应对这些挑战,我们提出了一种新颖的时空模型——大脑潜在进展(BrLP),旨在预测个体层面在3D脑MRI中的疾病进展。 BrLP的关键贡献有四点:(i) 它在一个较小的潜在空间中运行,从而减轻了高维影像数据带来的计算难题;(ii) 它明确整合了受试者元数据以增强预测的个性化;(iii) 通过辅助模型将疾病的动态知识纳入其中,促进了纵向数据的集成;(iv) 引入了潜在平均稳定化(LAS)算法,该算法(a)在推理时强制执行预测进展中的时空一致性,并(b)允许我们推导出预测不确定性的度量。 我们在2805名受试者的11,730张T1加权(T1w)脑MRI上训练并评估了BrLP,并通过962名受试者组成的2,257张外部测试集验证了其泛化能力。我们的实验将由BrLP生成的MRI扫描与真实的随访MRI进行了比较,证明了相对于现有方法而言达到了最先进的准确性。代码可在以下链接公开获取:this https URL.
https://arxiv.org/abs/2502.08560
Model-based clustering techniques have been widely applied to various application areas, while most studies focus on canonical mixtures with unique component distribution form. However, this strict assumption is often hard to satisfy. In this paper, we consider the more flexible Copula-Based Mixture Models (CBMMs) for clustering, which allow heterogeneous component distributions composed by flexible choices of marginal and copula forms. More specifically, we propose an adaptation of the Generalized Iterative Conditional Estimation (GICE) algorithm to identify the CBMMs in an unsupervised manner, where the marginal and copula forms and their parameters are estimated iteratively. GICE is adapted from its original version developed for switching Markov model identification with the choice of realization time. Our CBMM-GICE clustering method is then tested on synthetic two-cluster data (N=2000 samples) with discussion of the factors impacting its convergence. Finally, it is compared to the Expectation Maximization identified mixture models with unique component form on the entire MNIST database (N=70000), and on real cardiac magnetic resonance data (N=276) to illustrate its value for imaging applications.
基于模型的聚类技术已被广泛应用于各种应用领域,而大多数研究主要关注具有独特成分分布形式的标准混合模型。然而,这种严格的假设往往难以满足实际情况。在这篇论文中,我们考虑使用更灵活的Copula基混合模型(CBMMs)来进行聚类,这些模型允许通过选择不同的边缘和copula形式来构建异质成分分布。具体来说,我们提出了一种对广义迭代条件估计(GICE)算法进行适应性改进的方法,以便在无监督的情况下识别CBMMs,在此过程中边际和copula的形式及其参数将被迭代地估算。GICE是从其最初为切换马尔可夫模型识别而开发的版本中调整过来的,并引入了实现时间的选择。 我们的基于CBMM-GICE的聚类方法随后在一个合成数据集上进行了测试,该数据集中有两个簇(样本量N=2000),并对影响其收敛性的因素进行了讨论。最后,我们将其与具有独特成分形式的期望最大化混合模型在完整的MNIST数据库(样本量N=70000)以及真实的心脏磁共振成像数据集(样本量N=276)上进行比较,以展示它对于图像应用的价值。 这一段落详细描述了一种新型聚类算法——基于CBMM的GICE方法的发展及其在不同规模和类型的数据集上的测试与验证过程。这种方法提供了一个更加灵活且适用范围更广的选择,尤其是在处理复杂数据结构时。
https://arxiv.org/abs/2502.08549
Video Moment Retrieval is a common task to evaluate the performance of visual-language models - it involves localising start and end times of moments in videos from query sentences. The current task formulation assumes that the queried moment is present in the video, resulting in false positive moment predictions when irrelevant query sentences are provided. In this paper we propose the task of Negative-Aware Video Moment Retrieval (NA-VMR), which considers both moment retrieval accuracy and negative query rejection accuracy. We make the distinction between In-Domain and Out-of-Domain negative queries and provide new evaluation benchmarks for two popular video moment retrieval datasets: QVHighlights and Charades-STA. We analyse the ability of current SOTA video moment retrieval approaches to adapt to Negative-Aware Video Moment Retrieval and propose UniVTG-NA, an adaptation of UniVTG designed to tackle NA-VMR. UniVTG-NA achieves high negative rejection accuracy (avg. $98.4\%$) scores while retaining moment retrieval scores to within $3.87\%$ Recall@1. Dataset splits and code are available at this https URL
视频时刻检索是一项用于评估视觉语言模型性能的常见任务,它涉及根据查询句子在视频中定位特定时刻的开始和结束时间。当前的任务设定假设被查询的时刻存在于视频中,这会导致当提供不相关的查询句时产生假阳性的时刻预测。在这篇论文中,我们提出了负样本感知视频时刻检索(Negative-Aware Video Moment Retrieval, NA-VMR)任务,该任务不仅考虑了时刻检索的准确性还考虑了对负面查询的拒绝精度。 我们区分了域内和域外负面查询,并为两个流行的视频时刻检索数据集——QVHighlights 和 Charades-STA 提供了新的评估基准。本文分析了现有最先进的视频时刻检索方法适应负样本感知视频时刻检索的能力,并提出了 UniVTG-NA,这是对 UniVTG 的改进版本,旨在解决 NA-VMR 问题。UniVTG-NA 在保持召回率@1精度(平均 96.13%)的同时,达到了很高的负面拒绝准确性(平均 98.4%)。数据集分割和代码在以下链接提供:[此URL](http://this https URL)
https://arxiv.org/abs/2502.08544
Next token prediction has been the standard training objective used in large language model pretraining. Representations are learned as a result of optimizing for token-level perplexity. We propose Continuous Concept Mixing (CoCoMix), a novel pretraining framework that combines discrete next token prediction with continuous concepts. Specifically, CoCoMix predicts continuous concepts learned from a pretrained sparse autoencoder and mixes them into the model's hidden state by interleaving with token hidden representations. Through experiments on multiple benchmarks, including language modeling and downstream reasoning tasks, we show that CoCoMix is more sample efficient and consistently outperforms standard next token prediction, knowledge distillation and inserting pause tokens. We find that combining both concept learning and interleaving in an end-to-end framework is critical to performance gains. Furthermore, CoCoMix enhances interpretability and steerability by allowing direct inspection and modification of the predicted concept, offering a transparent way to guide the model's internal reasoning process.
接下来的标记预测一直是大规模语言模型预训练中使用的标准训练目标。通过优化单个标记级别的困惑度,学习到了表示形式。我们提出了一个新颖的预训练框架——连续概念混合(CoCoMix),该框架结合了离散的下一个标记预测和连续的概念。具体而言,CoCoMix 预测由预先训练好的稀疏自动编码器学到的连续概念,并通过在隐藏状态中插入与标记隐藏表示交错的方式将这些概念融入模型之中。通过语言建模和下游推理任务等多个基准实验显示,CoCoMix 在样本效率上表现更佳,并且持续优于标准的下一个标记预测、知识蒸馏以及插入暂停标记的方法。 我们发现,在端到端框架中结合概念学习与插值对于性能提升至关重要。此外,CoCoMix 还通过允许直接检查和修改预测的概念来增强模型的可解释性和可控性,为指导模型内部推理过程提供了一种透明的方式。
https://arxiv.org/abs/2502.08524
Faithfulness evaluators based on large language models (LLMs) are often fooled by the fluency of the text and struggle with identifying errors in the summaries. We propose an approach to summary faithfulness evaluation in which multiple LLM-based agents are assigned initial stances (regardless of what their belief might be) and forced to come up with a reason to justify the imposed belief, thus engaging in a multi-round debate to reach an agreement. The uniformly distributed initial assignments result in a greater diversity of stances leading to more meaningful debates and ultimately more errors identified. Furthermore, by analyzing the recent faithfulness evaluation datasets, we observe that naturally, it is not always the case for a summary to be either faithful to the source document or not. We therefore introduce a new dimension, ambiguity, and a detailed taxonomy to identify such special cases. Experiments demonstrate our approach can help identify ambiguities, and have even a stronger performance on non-ambiguous summaries.
基于大型语言模型(LLM)的忠实度评估器常常会被文本的流畅性所迷惑,难以识别摘要中的错误。我们提出了一种新的方法来进行总结忠实度评价:分配多个基于LLM的代理不同的初始立场(无论这些立场是否符合他们的实际信念),并要求他们为自己的立场找出理由进行辩护,从而展开多轮辩论以达成一致意见。这种均匀分布的初始设定可以导致更多样化的观点,进而引发更有意义的讨论,并最终识别出更多的错误。 此外,通过分析最近发布的忠实度评估数据集,我们观察到自然情况下并非所有的总结都严格地忠于原始文档或完全不忠于其内容。因此,我们引入了一个新的维度——模糊性(ambiguity),并制定了详细的分类来标识这类特殊情况。 实验结果表明,我们的方法不仅能够帮助识别出这些模棱两可的情况,并且在处理非模棱两可的总结时也表现出更强的效果。
https://arxiv.org/abs/2502.08514
Grammatical error correction (GEC) aims to correct grammatical, spelling, and semantic errors in natural language text. With the growing of large language models (LLMs), direct text generation has gradually become the focus of the GEC methods, and few-shot in-context learning presents a cost-effective solution. However, selecting effective in-context examples remains challenging, as the similarity between input texts does not necessarily correspond to similar grammatical error patterns. In this paper, we propose a novel retrieval method based on natural language grammatical error explanations (GEE) to address this issue. Our method retrieves suitable few-shot demonstrations by matching the GEE of the test input with that of pre-constructed database samples, where explanations for erroneous samples are generated by LLMs. We conducted multilingual GEC few-shot experiments on both major open-source and closed-source LLMs. Experiments across five languages show that our method outperforms existing semantic and BM25-based retrieval techniques, without requiring additional training or language adaptation. This also suggests that matching error patterns is key to selecting examples.
语法错误校正(GEC)旨在纠正自然语言文本中的语法、拼写和语义错误。随着大型语言模型(LLMs)的发展,直接文本生成逐渐成为GEC方法的重点,而少量样例的上下文学习则提供了一种成本效益高的解决方案。然而,选择有效的上下文示例仍然具有挑战性,因为输入文本之间的相似性不一定对应于类似的语法错误模式。为此,在本文中我们提出了一种基于自然语言语法规则解释(GEE)的新颖检索方法来解决这一问题。我们的方法通过匹配测试输入与预构建数据库样本的GEE来检索合适的少量样例演示,其中错误样本的解释由LLMs生成。我们在主要开源和闭源LLMs上进行了多语言GEC少量样例实验。跨五种语言的实验表明,我们的方法超越了现有的语义和基于BM25的检索技术,在不需额外训练或语言适应的情况下取得更好的效果。这还表明匹配错误模式是选择示例的关键。
https://arxiv.org/abs/2502.08507
Referring Remote Sensing Image Segmentation (RRSIS) is critical for ecological monitoring, urban planning, and disaster management, requiring precise segmentation of objects in remote sensing imagery guided by textual descriptions. This task is uniquely challenging due to the considerable vision-language gap, the high spatial resolution and broad coverage of remote sensing imagery with diverse categories and small targets, and the presence of clustered, unclear targets with blurred edges. To tackle these issues, we propose \ours, a novel framework designed to bridge the vision-language gap, enhance multi-scale feature interaction, and improve fine-grained object differentiation. Specifically, \ours introduces: (1) the Bidirectional Spatial Correlation (BSC) for improved vision-language feature alignment, (2) the Target-Background TwinStream Decoder (T-BTD) for precise distinction between targets and non-targets, and (3) the Dual-Modal Object Learning Strategy (D-MOLS) for robust multimodal feature reconstruction. Extensive experiments on the benchmark datasets RefSegRS and RRSIS-D demonstrate that \ours achieves state-of-the-art performance. Specifically, \ours improves the overall IoU (oIoU) by 3.76 percentage points (80.57) and 1.44 percentage points (79.23) on the two datasets, respectively. Additionally, it outperforms previous methods in the mean IoU (mIoU) by 5.37 percentage points (67.95) and 1.84 percentage points (66.04), effectively addressing the core challenges of RRSIS with enhanced precision and robustness.
参考遥感图像分割(RRSIS)对于生态监测、城市规划和灾害管理至关重要,需要根据文本描述精确地对遥感图像中的对象进行分割。这项任务具有独特的挑战性,原因在于视觉与语言之间的显著差距,以及高空间分辨率的遥感影像所带来的广泛覆盖范围、多样化的类别和微小目标的存在,并且其中包含聚集成群、边缘模糊不清的目标。为了应对这些难题,我们提出了一种新的框架——\ours,旨在弥合视觉-语言鸿沟,增强多尺度特征交互,并提升细粒度对象的区分能力。 具体而言,\ours引入了以下三个创新组件: 1. 双向空间相关性(Bidirectional Spatial Correlation, BSC),以改善视觉与语言特征之间的对齐; 2. 目标-背景双流解码器(Target-Background TwinStream Decoder, T-BTD),用于准确区分目标和非目标区域; 3. 双模态对象学习策略(Dual-Modal Object Learning Strategy, D-MOLS),用于增强多模态特征的重建能力。 在基准数据集RefSegRS和RRSIS-D上的大量实验表明,\ours达到了当前最佳性能。具体而言,在两个数据集中,\ours分别提高了整体交并比(oIoU)3.76个百分点(达到80.57)和1.44个百分点(达到79.23)。同时,它在平均交并比(mIoU)方面也超过了之前的最佳方法,分别提升了5.37个百分点(达到67.95)和1.84个百分点(达到66.04),从而有效解决了RRSIS的核心挑战,并且通过提高精度和鲁棒性来应对这些挑战。
https://arxiv.org/abs/2502.08486
Chain-of-Thought (CoT) prompting has emerged as a powerful technique for enhancing language model's reasoning capabilities. However, generating long and correct CoT trajectories is challenging. Recent studies have demonstrated that Looped Transformers possess remarkable length generalization capabilities, but their limited generality and adaptability prevent them from serving as an alternative to auto-regressive solutions. To better leverage the strengths of Looped Transformers, we propose RELAY (REasoning through Loop Alignment iterativelY). Specifically, we align the steps of Chain-of-Thought (CoT) reasoning with loop iterations and apply intermediate supervision during the training of Looped Transformers. This additional iteration-wise supervision not only preserves the Looped Transformer's ability for length generalization but also enables it to predict CoT reasoning steps for unseen data. Therefore, we leverage this Looped Transformer to generate accurate reasoning chains for complex problems that exceed the training length, which will then be used to fine-tune an auto-regressive model. We conduct extensive experiments, and the results demonstrate the effectiveness of our approach, with significant improvements in the performance of the auto-regressive model. Code will be released at this https URL.
链式思维(Chain-of-Thought,CoT)提示作为一种增强语言模型推理能力的有力技术已经出现。然而,生成长且正确的CoT轨迹具有挑战性。最近的研究表明,循环变压器在长度泛化方面表现出色,但其有限的一般性和适应性使其无法替代自回归解决方案。为了更好地利用循环变压器的优势,我们提出了RELAY(通过循环对齐进行迭代推理)。具体而言,我们将链式思维的步骤与循环迭代对齐,并在训练循环变压器时引入中间监督。这种额外的迭代级监督不仅保留了循环变压器长度泛化的能力,还使其能够预测未见数据的CoT推理步骤。因此,我们利用该循环变压器生成复杂问题的准确推理链,这些问题超出了训练长度,然后用于微调自回归模型。我们进行了广泛的实验,并且结果证明了我们方法的有效性,在自回归模型性能方面取得了显著改进。代码将在提供的链接中发布。 (原文链接地址未提供具体网址,请根据需要补充完整。)
https://arxiv.org/abs/2502.08482
Simultaneously grasping and transporting multiple objects can significantly enhance robotic work efficiency and has been a key research focus for decades. The primary challenge lies in determining how to push objects, group them, and execute simultaneous grasping for respective groups while considering object distribution and the hardware constraints of the robot. Traditional rule-based methods struggle to flexibly adapt to diverse scenarios. To address this challenge, this paper proposes an imitation learning-based approach. We collect a series of expert demonstrations through teleoperation and train a diffusion policy network, enabling the robot to dynamically generate action sequences for pushing, grouping, and grasping, thereby facilitating efficient multi-object grasping and transportation. We conducted experiments to evaluate the method under different training dataset sizes, varying object quantities, and real-world object scenarios. The results demonstrate that the proposed approach can effectively and adaptively generate multi-object grouping and grasping strategies. With the support of more training data, imitation learning is expected to be an effective approach for solving the multi-object grasping problem.
同时抓取和搬运多个物体可以显著提高机器人的工作效率,这一直是几十年来的重要研究课题。主要挑战在于如何确定推动物体的方式、对它们进行分组,并为每个组执行同步抓取操作,同时还要考虑物体的分布情况以及机器人硬件的限制条件。传统的基于规则的方法难以灵活适应各种场景需求。为此,本文提出了一种基于模仿学习的方法。我们通过遥操作系统收集了一系列专家演示的数据,并训练了一个扩散策略网络,使机器人能够动态生成推动、分组和抓取的动作序列,从而促进高效多物体抓取与搬运。我们在不同的训练数据集大小、不同数量的物体以及真实世界中的物体场景下进行了实验评估。结果表明,所提出的方法可以有效且自适应地生成多物体分组和抓取策略。在更多训练数据的支持下,模仿学习有望成为解决多物体抓取问题的有效方法。
https://arxiv.org/abs/2502.08452