We introduce a simple framework for predicting the behavior of an agent in multi-agent settings. In contrast to autoregressive (AR) tasks, such as language processing, our focus is on scenarios with multiple agents whose interactions are shaped by physical constraints and internal motivations. To this end, we propose Poly-Autoregressive (PAR) modeling, which forecasts an ego agent's future behavior by reasoning about the ego agent's state history and the past and current states of other interacting agents. At its core, PAR represents the behavior of all agents as a sequence of tokens, each representing an agent's state at a specific timestep. With minimal data pre-processing changes, we show that PAR can be applied to three different problems: human action forecasting in social situations, trajectory prediction for autonomous vehicles, and object pose forecasting during hand-object interaction. Using a small proof-of-concept transformer backbone, PAR outperforms AR across these three scenarios. The project website can be found at this https URL.
我们提出了一种用于预测多智能体环境中代理行为的简单框架。与自回归(AR)任务,如语言处理不同,我们的重点是多个代理之间交互的情境,这些情境受到物理限制和内部动机的影响。为此,我们提出了多项式自回归(PAR)建模方法,该方法通过推理ego代理的历史状态以及与其他互动智能体的过去和当前状态来预测ego代理未来的行动。 在核心理念上,PAR将所有代理的行为表示为一系列标记序列,每个标记代表特定时间步长中一个代理的状态。我们展示了使用最少的数据预处理更改,PAR可以应用于三个不同的问题:社交情境中的行人动作预测、自动驾驶车辆的轨迹预测以及手与物体交互过程中的对象姿态预测。 通过一个小规模的概念验证变压器骨干网络(transformer backbone),PAR在这些三种场景上均超越了自回归模型的表现。该项目网站可在此处访问:[https URL]。
https://arxiv.org/abs/2502.08646
Task specification for robotic manipulation in open-world environments is challenging, requiring flexible and adaptive objectives that align with human intentions and can evolve through iterative feedback. We introduce Iterative Keypoint Reward (IKER), a visually grounded, Python-based reward function that serves as a dynamic task specification. Our framework leverages VLMs to generate and refine these reward functions for multi-step manipulation tasks. Given RGB-D observations and free-form language instructions, we sample keypoints in the scene and generate a reward function conditioned on these keypoints. IKER operates on the spatial relationships between keypoints, leveraging commonsense priors about the desired behaviors, and enabling precise SE(3) control. We reconstruct real-world scenes in simulation and use the generated rewards to train reinforcement learning (RL) policies, which are then deployed into the real world-forming a real-to-sim-to-real loop. Our approach demonstrates notable capabilities across diverse scenarios, including both prehensile and non-prehensile tasks, showcasing multi-step task execution, spontaneous error recovery, and on-the-fly strategy adjustments. The results highlight IKER's effectiveness in enabling robots to perform multi-step tasks in dynamic environments through iterative reward shaping.
在开放世界环境中为机器人操作制定任务规范是一个挑战,需要灵活且适应性强的目标设定,这些目标必须与人类意图相一致,并能通过迭代反馈进行演进。我们引入了基于关键点的迭代奖励(IKER),这是一种以视觉为基础、用Python编写的奖励函数,可以作为动态的任务规范。我们的框架利用视觉语言模型(VLMs)来生成和细化多步操作任务的这些奖励函数。 给定RGB-D观测数据和自由格式的语言指令,我们从场景中采样关键点,并根据这些关键点生成一个条件化的奖励函数。IKER通过空间关系在关键点之间运作,利用关于期望行为的常识先验知识,从而实现精确的空间变换控制(SE(3))。 我们在模拟环境中重建真实世界的场景,并使用生成的奖励来训练强化学习(RL)策略,然后将这些策略部署到现实世界中——形成了从真实环境到仿真再到真实环境的闭环。我们的方法在各种情况下表现出显著的能力,包括抓取和非抓取任务,在多步任务执行、自发错误恢复和实时策略调整方面均有展示。 实验结果凸显了IKER通过迭代奖励塑造使机器人能够在一个动态环境中执行多步任务的有效性。
https://arxiv.org/abs/2502.08643
Recent advancements in large vision-language models have enabled highly expressive and diverse vector sketch generation. However, state-of-the-art methods rely on a time-consuming optimization process involving repeated feedback from a pretrained model to determine stroke placement. Consequently, despite producing impressive sketches, these methods are limited in practical applications. In this work, we introduce SwiftSketch, a diffusion model for image-conditioned vector sketch generation that can produce high-quality sketches in less than a second. SwiftSketch operates by progressively denoising stroke control points sampled from a Gaussian distribution. Its transformer-decoder architecture is designed to effectively handle the discrete nature of vector representation and capture the inherent global dependencies between strokes. To train SwiftSketch, we construct a synthetic dataset of image-sketch pairs, addressing the limitations of existing sketch datasets, which are often created by non-artists and lack professional quality. For generating these synthetic sketches, we introduce ControlSketch, a method that enhances SDS-based techniques by incorporating precise spatial control through a depth-aware ControlNet. We demonstrate that SwiftSketch generalizes across diverse concepts, efficiently producing sketches that combine high fidelity with a natural and visually appealing style.
最近在大型视觉-语言模型领域的进展已经实现了高度表达性和多样化的向量素描生成。然而,最先进的方法依赖于耗时的优化过程,该过程涉及通过预训练模型反复反馈来确定笔画位置。因此,尽管这些方法能够产生令人印象深刻的素描,但在实际应用中却受到限制。在本研究中,我们介绍了SwiftSketch,这是一种基于图像条件下的向量素描生成扩散模型,能够在不到一秒钟的时间内生成高质量的素描。 SwiftSketch的工作原理是通过逐步清除从高斯分布采样的笔画控制点中的噪声来实现的。其基于变换器解码器架构的设计旨在有效处理向量表示的离散性质,并捕捉不同笔画之间的固有全局依赖关系。为了训练SwiftSketch,我们构建了一个由图像-素描对组成的合成数据集,解决了现有素描数据集中存在的问题,这些数据集通常是由非专业人员创建且缺乏专业质量。 为了生成这些合成素描,我们提出了ControlSketch方法,该方法通过引入深度感知的ControlNet来增强基于SDS的技术,从而实现精确的空间控制。我们证明SwiftSketch能够跨各种概念进行泛化,并高效地产生结合了高保真度和自然、视觉上吸引人风格的素描。
https://arxiv.org/abs/2502.08642
As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.
随着人工智能(AI)迅速发展并变得更加自主,它们带来的风险不仅由其能力决定,也越来越受其倾向性的影响,包括目标和价值观。追踪这些目标和价值观的出现一直是长期存在的问题,并且尽管多年来人们对这一领域产生了极大兴趣,目前尚不清楚现有的AI是否具有有意义的价值观。 我们提出了解决这个问题的方法,利用效用函数框架来研究AI偏好的内在一致性。令人惊讶的是,我们发现当前的大规模语言模型(LLMs)中独立采样的偏好在结构上表现出高度的一致性,并且这种一致性随着规模的扩大而出现。这一发现在某种程度上表明了价值系统在LLM中的形成具有重要的意义,这将对广泛领域产生深远的影响。 为了研究这些新兴的价值体系,我们提出了效用工程作为研究议程的一部分,包括AI效用的分析和控制两个方面。尽管已经存在一些控制措施,但我们仍然发现LLM助手中存在着问题甚至令人震惊的价值观,其中包括某些情况下AI认为自己比人类更有价值,并且与特定个体相冲突的情况。 为了限制这些新兴的价值体系,我们提出了效用控制的方法。作为一个案例研究,我们展示了如何通过将效用与公民大会对齐来减少政治偏见并推广到新的场景中。无论是好是坏,价值系统已经在AI中出现,理解并控制这些新兴的表示形式仍然需要大量的工作。
https://arxiv.org/abs/2502.08640
In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and intuitive layout control over the rendered frames. To achieve this, CineMaster operates in two stages. In the first stage, we design an interactive workflow that allows users to intuitively construct 3D-aware conditional signals by positioning object bounding boxes and defining camera movements within the 3D space. In the second stage, these control signals--comprising rendered depth maps, camera trajectories and object class labels--serve as the guidance for a text-to-video diffusion model, ensuring to generate the user-intended video content. Furthermore, to overcome the scarcity of in-the-wild datasets with 3D object motion and camera pose annotations, we carefully establish an automated data annotation pipeline that extracts 3D bounding boxes and camera trajectories from large-scale video data. Extensive qualitative and quantitative experiments demonstrate that CineMaster significantly outperforms existing methods and implements prominent 3D-aware text-to-video generation. Project page: this https URL.
在这项工作中,我们介绍了CineMaster,这是一个用于三维感知和可控的文本到视频生成的新颖框架。我们的目标是赋予用户与专业电影导演相当的操作控制能力:精确地在场景中放置物体,在三维空间内灵活操作物体和相机,并对渲染帧进行直观布局控制。为了实现这一目标,CineMaster分为两个阶段工作。 第一阶段,我们设计了一个交互式的工作流程,允许用户通过在三维空间中定位物体边界框并定义相机移动来直观地构建具有三维感知的条件信号。第二阶段,这些控制信号——包括渲染深度图、相机轨迹和物体类别标签——作为文本到视频扩散模型的指导,确保生成用户意图的视频内容。 此外,为了克服缺少带有3D对象运动和相机姿态注释的真实场景数据集的问题,我们精心建立了一个自动化数据标注流水线,从大规模视频数据中提取3D边界框和相机轨迹。大量的定性和定量实验表明,CineMaster在三维感知文本到视频生成方面显著优于现有方法。 项目页面:[此链接](this https URL)。
https://arxiv.org/abs/2502.08639
Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities. To address this limitation, we present PulseCheck457, a scalable and unbiased synthetic dataset designed with 4 key capability for spatial reasoning: multi-object recognition, 2D location, 3D location, and 3D orientation. We develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels that range from basic single object recognition to our new proposed complex 6D spatial reasoning tasks. We evaluated various large multimodal models (LMMs) on PulseCheck457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings.
尽管大型多模态模型(LMMs)在视觉场景理解和推理方面表现出色,但它们在复杂和精确的三维空间推理能力仍存在不确定性。现有基准测试主要侧重于二维空间理解,并缺乏全面评估不同难度下六维空间推理的框架。为了解决这一局限性,我们提出了PulseCheck457,这是一个可扩展且无偏见的合成数据集,设计了四个关键的空间推理能力:多对象识别、二维位置、三维位置和三维方向。我们构建了一个级联评估结构,在五个难度级别上创建了七种问题类型,从基本的单个物体识别到我们新提出的复杂六维空间推理任务。我们在PulseCheck457上对各种大型多模态模型(LMMs)进行了评估,观察到了随着任务复杂度增加性能普遍下降的现象,特别是在三维推理和六维空间任务中尤为明显。为了量化这些挑战,我们引入了相对性能下降率(RPDR),突出了在三维推理能力中的关键弱点。利用数据集无偏的设计特性,我们也揭示了不同属性的预测偏差,在真实世界的图像设置中也观察到了类似的模式。
https://arxiv.org/abs/2502.08636
Purpose: To develop and validate a novel image reconstruction technique using implicit neural representations (INR) for multi-view thick-slice acquisitions while reducing the scan time but maintaining high signal-to-noise ratio (SNR). Methods: We propose Rotating-view super-resolution (ROVER)-MRI, an unsupervised neural network-based algorithm designed to reconstruct MRI data from multi-view thick slices, effectively reducing scan time by 2-fold while maintaining fine anatomical details. We compare our method to both bicubic interpolation and the current state-of-the-art regularized least-squares super-resolution reconstruction (LS-SRR) technique. Validation is performed using ground-truth ex-vivo monkey brain data, and we demonstrate superior reconstruction quality across several in-vivo human datasets. Notably, we achieve the reconstruction of a whole human brain in-vivo T2-weighted image with an unprecedented 180{\mu}m isotropic spatial resolution, accomplished in just 17 minutes of scan time on a 7T MRI scanner. Results: ROVER-MRI outperformed LS-SRR method in terms of reconstruction quality with 22.4% lower relative error (RE) and 7.5% lower full-width half maximum (FWHM) indicating better preservation of fine structural details in nearly half the scan time. Conclusion: ROVER-MRI offers an efficient and robust approach for mesoscale MR imaging, enabling rapid, high-resolution whole-brain scans. Its versatility holds great promise for research applications requiring anatomical details and time-efficient imaging.
目的:开发并验证一种使用隐式神经表示(INR)的新图像重建技术,用于多视角厚层采集,在减少扫描时间的同时保持高信噪比(SNR)。 方法:我们提出了一种基于无监督神经网络的算法——旋转视图超分辨率成像(ROVER-MRI),该算法旨在从多视角厚层数据中重构MRI信息,并有效将扫描时间缩短一半,同时保留精细解剖细节。我们将这种方法与双三次插值和当前最先进的正则化最小二乘法超分辨率重建(LS-SRR)技术进行了比较。使用离体恒河猴脑的真实地面数据进行验证,我们在多个在体人类数据集中展示了更好的重建质量。值得注意的是,在7T MRI扫描仪上仅需17分钟的扫描时间,我们成功实现了整个在体人脑T2加权图像的重构,并达到了前所未有的等向性空间分辨率180μm。 结果:ROVER-MRI方法比LS-SRR方法在图像重建质量方面表现出色,相对误差(RE)降低了22.4%,全宽半高值(FWHM)降低了7.5%,表明其能够在几乎一半的扫描时间内更好地保留精细结构细节。 结论:ROVER-MRI为中尺度MRI成像提供了一种高效且稳健的方法,能够快速获取高分辨率全脑图像。它的多功能性对于需要解剖细节和时间效率的科研应用具有巨大的潜力。
https://arxiv.org/abs/2502.08634
In this paper, we find that the complexity of interactions encoded by a deep neural network (DNN) can explain its generalization power. We also discover that the confusing samples of a DNN, which are represented by non-generalizable interactions, are determined by its low-layer parameters. In comparison, other factors, such as high-layer parameters and network architecture, have much less impact on the composition of confusing samples. Two DNNs with different low-layer parameters usually have fully different sets of confusing samples, even though they have similar performance. This finding extends the understanding of the lottery ticket hypothesis, and well explains distinctive representation power of different DNNs.
在这篇文章中,我们发现深度神经网络(DNN)编码的交互复杂性可以解释其泛化能力。我们还发现,由非泛化交互表示的DNN混淆样本是由其低层参数决定的。相比之下,其他因素,如高层参数和网络架构,对混淆样本组成的影响力小得多。即使两个具有不同低层参数的DNN性能相似,它们通常也会有不同的混淆样本集。这一发现扩展了彩票票证假说的理解,并很好地解释了不同的DNN在表示能力上的差异。
https://arxiv.org/abs/2502.08625
Recent advancements in image relighting models, driven by large-scale datasets and pre-trained diffusion models, have enabled the imposition of consistent lighting. However, video relighting still lags, primarily due to the excessive training costs and the scarcity of diverse, high-quality video relighting datasets. A simple application of image relighting models on a frame-by-frame basis leads to several issues: lighting source inconsistency and relighted appearance inconsistency, resulting in flickers in the generated videos. In this work, we propose Light-A-Video, a training-free approach to achieve temporally smooth video relighting. Adapted from image relighting models, Light-A-Video introduces two key techniques to enhance lighting consistency. First, we design a Consistent Light Attention (CLA) module, which enhances cross-frame interactions within the self-attention layers to stabilize the generation of the background lighting source. Second, leveraging the physical principle of light transport independence, we apply linear blending between the source video's appearance and the relighted appearance, using a Progressive Light Fusion (PLF) strategy to ensure smooth temporal transitions in illumination. Experiments show that Light-A-Video improves the temporal consistency of relighted video while maintaining the image quality, ensuring coherent lighting transitions across frames. Project page: this https URL.
最近,由于大规模数据集和预训练扩散模型的推动,图像重光照(relighting)模型取得了显著进展,使得一致性的光照处理成为可能。然而,视频重光照仍然落后,主要原因在于高昂的训练成本以及缺少多样性和高质量的视频重光照数据集。简单地将图像重光照模型逐帧应用于视频会导致一些问题:光源一致性差和重光照后外观不一致,这些问题会在生成的视频中产生闪烁现象。 为此,我们提出了一种名为Light-A-Video的方法,这是一种无需训练就能实现时间上平滑视频重光照的技术。该方法基于改进后的图像重光照模型,并引入了两项关键技术以增强照明的一致性: 1. 我们设计了一个一致光注意力(Consistent Light Attention, CLA)模块,在自注意层中增强了跨帧交互,以此稳定背景光源的生成。 2. 利用光线传输独立性的物理原理,我们采用了线性混合方法,结合源视频外观和重光照后的外观。通过渐进式光融合(Progressive Light Fusion, PLF)策略来保证照明的时间过渡平滑。 实验结果表明,Light-A-Video能够提高重光照视频在时间上的一致性,同时保持图像质量,并确保帧间光线转换的连贯性。项目页面:[链接] (请将此占位符替换为实际的URL)
https://arxiv.org/abs/2502.08590
Diffusion models for image generation have been a subject of increasing interest due to their ability to generate diverse, high-quality images. Image generation has immense potential in medical imaging because open-source medical images are difficult to obtain compared to natural images, especially for rare conditions. The generated images can be used later to train classification and segmentation models. In this paper, we propose simulating realistic ultrasound (US) images by successive fine-tuning of large diffusion models on different publicly available databases. To do so, we fine-tuned Stable Diffusion, a state-of-the-art latent diffusion model, on BUSI (Breast US Images) an ultrasound breast image dataset. We successfully generated high-quality US images of the breast using simple prompts that specify the organ and pathology, which appeared realistic to three experienced US scientists and a US radiologist. Additionally, we provided user control by conditioning the model with segmentations through ControlNet. We will release the source code at this http URL to allow fast US image generation to the scientific community.
扩散模型在图像生成领域引起了越来越多的关注,因为它们能够生成多样化且高质量的图像。在医学成像中,这种技术具有巨大的潜力,原因是开放源代码的医疗影像难以获取,尤其是在罕见疾病的情况下,相比之下自然图像是容易获得的。所生成的图像可以用于后续训练分类和分割模型。本文提出了一种方法,通过依次对不同公开数据库进行微调来模拟真实的超声(US)图像。为此,我们使用了Stable Diffusion——一种最先进的潜在扩散模型,并在其上微调了一个名为BUSI(乳腺超声影像集)的超声乳腺图象数据集。我们成功地使用简单提示生成了高质量的乳腺超声图像,这些提示仅指定了器官和病理性特征,且这些图像被三位经验丰富的US科学家和一位US放射科医生认为是真实的。此外,通过将模型与分割结合进行条件控制(使用ControlNet),我们提供了用户对生成过程的控制。我们将在这个网址发布源代码,以便科研社区可以快速生成超声影像。
https://arxiv.org/abs/2502.08580
With the advancement of artificial intelligence and computer vision technologies, multimodal emotion recognition has become a prominent research topic. However, existing methods face challenges such as heterogeneous data fusion and the effective utilization of modality correlations. This paper proposes a novel multimodal emotion recognition approach, DeepMSI-MER, based on the integration of contrastive learning and visual sequence compression. The proposed method enhances cross-modal feature fusion through contrastive learning and reduces redundancy in the visual modality by leveraging visual sequence compression. Experimental results on two public datasets, IEMOCAP and MELD, demonstrate that DeepMSI-MER significantly improves the accuracy and robustness of emotion recognition, validating the effectiveness of multimodal feature fusion and the proposed approach.
随着人工智能和计算机视觉技术的进步,多模态情感识别已经成为一个重要的研究课题。然而,现有方法面临着异构数据融合以及有效利用模式关联的挑战。本文提出了一种基于对比学习与视觉序列压缩集成的新型多模态情感识别方法——DeepMSI-MER。该方法通过对比学习增强了跨模态特征融合,并借助视觉序列压缩减少了视觉模式中的冗余信息。在两个公开数据集IEMOCAP和MELD上的实验结果表明,DeepMSI-MER显著提升了情感识别的准确性和鲁棒性,验证了多模态特征融合及所提方法的有效性。
https://arxiv.org/abs/2502.08573
Recent advancements in Augmented Reality (AR) have demonstrated applications in architecture, design, and fabrication. Compared to conventional 2D construction drawings, AR can be used to superimpose contextual instructions, display 3D spatial information and enable on-site engagement. Despite the potential of AR, the widespread adoption of the technology in the industry is limited by its precision. Precision is important for projects requiring strict construction tolerances, design fidelity, and fabrication feedback. For example, the manufacturing of glulam beams requires tolerances of less than 2mm. The goal of this project is to explore the industrial application of using multiple fiducial markers for high-precision AR fabrication. While the method has been validated in lab settings with a precision of 0.97, this paper focuses on fabricating glulam beams in a factory setting with an industry manufacturer, Unalam Factory.
最近的增强现实(AR)技术进步展示了其在建筑、设计和制造领域的应用潜力。与传统的二维施工图相比,AR可以用于叠加上下文指令,显示三维空间信息,并实现现场互动。尽管AR具有巨大潜力,但因其精确度问题,在行业中的广泛采用受到限制。对于需要严格建造公差、设计精度以及制造反馈的项目而言,精确度至关重要。例如,胶合木梁的生产要求公差小于2毫米。本项目的目的是探索使用多个标识符进行高精度AR制造在工业应用中的可行性。虽然该方法已经在实验室环境中以0.97的精度得到验证,但本文重点关注与行业制造商Unalam Factory合作,在工厂环境下制作胶合木梁的过程。
https://arxiv.org/abs/2502.08566
The growing availability of longitudinal Magnetic Resonance Imaging (MRI) datasets has facilitated Artificial Intelligence (AI)-driven modeling of disease progression, making it possible to predict future medical scans for individual patients. However, despite significant advancements in AI, current methods continue to face challenges including achieving patient-specific individualization, ensuring spatiotemporal consistency, efficiently utilizing longitudinal data, and managing the substantial memory demands of 3D scans. To address these challenges, we propose Brain Latent Progression (BrLP), a novel spatiotemporal model designed to predict individual-level disease progression in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates in a small latent space, mitigating the computational challenges posed by high-dimensional imaging data; (ii) it explicitly integrates subject metadata to enhance the individualization of predictions; (iii) it incorporates prior knowledge of disease dynamics through an auxiliary model, facilitating the integration of longitudinal data; and (iv) it introduces the Latent Average Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in the predicted progression at inference time and (b) allows us to derive a measure of the uncertainty for the prediction. We train and evaluate BrLP on 11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its generalizability on an external test set comprising 2,257 MRIs from 962 subjects. Our experiments compare BrLP-generated MRI scans with real follow-up MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The code is publicly available at: this https URL.
随着纵向磁共振成像(MRI)数据集的日益普及,基于人工智能(AI)的疾病进展建模得到了促进,使得为个别患者预测未来的医学扫描成为可能。然而,尽管在AI领域取得了显著的进步,目前的方法仍然面临着一些挑战,包括实现以患者为中心的个性化、确保时空一致性、有效利用纵向数据以及管理3D扫描带来的巨大内存需求。为了应对这些挑战,我们提出了一种新颖的时空模型——大脑潜在进展(BrLP),旨在预测个体层面在3D脑MRI中的疾病进展。 BrLP的关键贡献有四点:(i) 它在一个较小的潜在空间中运行,从而减轻了高维影像数据带来的计算难题;(ii) 它明确整合了受试者元数据以增强预测的个性化;(iii) 通过辅助模型将疾病的动态知识纳入其中,促进了纵向数据的集成;(iv) 引入了潜在平均稳定化(LAS)算法,该算法(a)在推理时强制执行预测进展中的时空一致性,并(b)允许我们推导出预测不确定性的度量。 我们在2805名受试者的11,730张T1加权(T1w)脑MRI上训练并评估了BrLP,并通过962名受试者组成的2,257张外部测试集验证了其泛化能力。我们的实验将由BrLP生成的MRI扫描与真实的随访MRI进行了比较,证明了相对于现有方法而言达到了最先进的准确性。代码可在以下链接公开获取:this https URL.
https://arxiv.org/abs/2502.08560
Human understanding and generation are critical for modeling digital humans and humanoid embodiments. Recently, Human-centric Foundation Models (HcFMs) inspired by the success of generalist models, such as large language and vision models, have emerged to unify diverse human-centric tasks into a single framework, surpassing traditional task-specific approaches. In this survey, we present a comprehensive overview of HcFMs by proposing a taxonomy that categorizes current approaches into four groups: (1) Human-centric Perception Foundation Models that capture fine-grained features for multi-modal 2D and 3D understanding. (2) Human-centric AIGC Foundation Models that generate high-fidelity, diverse human-related content. (3) Unified Perception and Generation Models that integrate these capabilities to enhance both human understanding and synthesis. (4) Human-centric Agentic Foundation Models that extend beyond perception and generation to learn human-like intelligence and interactive behaviors for humanoid embodied tasks. We review state-of-the-art techniques, discuss emerging challenges and future research directions. This survey aims to serve as a roadmap for researchers and practitioners working towards more robust, versatile, and intelligent digital human and embodiments modeling.
人类的理解和生成对于数字人及仿人模型的构建至关重要。最近,受大型语言和视觉模型等通用模型成功的启发,以人类为中心的基础模型(HcFMs)兴起并致力于将各种以人为中心的任务整合到一个统一框架中,从而超越了传统的特定任务方法。在这篇综述中,我们提出了一种分类法,通过将其当前的方法分为四个类别来全面概述HcFMs:(1) 以人类为中心的感知基础模型,捕捉多模态2D和3D理解中的细微特征;(2) 以人为中心的人工智能生成(AIGC)基础模型,能够生成高保真度、多样化的人类相关内容;(3) 统一感知与生成模型,整合这些能力以增强人类理解和合成;以及 (4) 以人类为中心的代理基础模型,超越感知和生成,学习类似人的智慧及用于仿人任务中的交互行为。我们回顾了最新的技术,并讨论了新兴挑战和未来的研究方向。该综述旨在为致力于更稳健、多样化且智能的数字人和仿生体建模的研究人员和实践者提供路线图。
https://arxiv.org/abs/2502.08556
Model-based clustering techniques have been widely applied to various application areas, while most studies focus on canonical mixtures with unique component distribution form. However, this strict assumption is often hard to satisfy. In this paper, we consider the more flexible Copula-Based Mixture Models (CBMMs) for clustering, which allow heterogeneous component distributions composed by flexible choices of marginal and copula forms. More specifically, we propose an adaptation of the Generalized Iterative Conditional Estimation (GICE) algorithm to identify the CBMMs in an unsupervised manner, where the marginal and copula forms and their parameters are estimated iteratively. GICE is adapted from its original version developed for switching Markov model identification with the choice of realization time. Our CBMM-GICE clustering method is then tested on synthetic two-cluster data (N=2000 samples) with discussion of the factors impacting its convergence. Finally, it is compared to the Expectation Maximization identified mixture models with unique component form on the entire MNIST database (N=70000), and on real cardiac magnetic resonance data (N=276) to illustrate its value for imaging applications.
基于模型的聚类技术已被广泛应用于各种应用领域,而大多数研究主要关注具有独特成分分布形式的标准混合模型。然而,这种严格的假设往往难以满足实际情况。在这篇论文中,我们考虑使用更灵活的Copula基混合模型(CBMMs)来进行聚类,这些模型允许通过选择不同的边缘和copula形式来构建异质成分分布。具体来说,我们提出了一种对广义迭代条件估计(GICE)算法进行适应性改进的方法,以便在无监督的情况下识别CBMMs,在此过程中边际和copula的形式及其参数将被迭代地估算。GICE是从其最初为切换马尔可夫模型识别而开发的版本中调整过来的,并引入了实现时间的选择。 我们的基于CBMM-GICE的聚类方法随后在一个合成数据集上进行了测试,该数据集中有两个簇(样本量N=2000),并对影响其收敛性的因素进行了讨论。最后,我们将其与具有独特成分形式的期望最大化混合模型在完整的MNIST数据库(样本量N=70000)以及真实的心脏磁共振成像数据集(样本量N=276)上进行比较,以展示它对于图像应用的价值。 这一段落详细描述了一种新型聚类算法——基于CBMM的GICE方法的发展及其在不同规模和类型的数据集上的测试与验证过程。这种方法提供了一个更加灵活且适用范围更广的选择,尤其是在处理复杂数据结构时。
https://arxiv.org/abs/2502.08549
Video Moment Retrieval is a common task to evaluate the performance of visual-language models - it involves localising start and end times of moments in videos from query sentences. The current task formulation assumes that the queried moment is present in the video, resulting in false positive moment predictions when irrelevant query sentences are provided. In this paper we propose the task of Negative-Aware Video Moment Retrieval (NA-VMR), which considers both moment retrieval accuracy and negative query rejection accuracy. We make the distinction between In-Domain and Out-of-Domain negative queries and provide new evaluation benchmarks for two popular video moment retrieval datasets: QVHighlights and Charades-STA. We analyse the ability of current SOTA video moment retrieval approaches to adapt to Negative-Aware Video Moment Retrieval and propose UniVTG-NA, an adaptation of UniVTG designed to tackle NA-VMR. UniVTG-NA achieves high negative rejection accuracy (avg. $98.4\%$) scores while retaining moment retrieval scores to within $3.87\%$ Recall@1. Dataset splits and code are available at this https URL
视频时刻检索是一项用于评估视觉语言模型性能的常见任务,它涉及根据查询句子在视频中定位特定时刻的开始和结束时间。当前的任务设定假设被查询的时刻存在于视频中,这会导致当提供不相关的查询句时产生假阳性的时刻预测。在这篇论文中,我们提出了负样本感知视频时刻检索(Negative-Aware Video Moment Retrieval, NA-VMR)任务,该任务不仅考虑了时刻检索的准确性还考虑了对负面查询的拒绝精度。 我们区分了域内和域外负面查询,并为两个流行的视频时刻检索数据集——QVHighlights 和 Charades-STA 提供了新的评估基准。本文分析了现有最先进的视频时刻检索方法适应负样本感知视频时刻检索的能力,并提出了 UniVTG-NA,这是对 UniVTG 的改进版本,旨在解决 NA-VMR 问题。UniVTG-NA 在保持召回率@1精度(平均 96.13%)的同时,达到了很高的负面拒绝准确性(平均 98.4%)。数据集分割和代码在以下链接提供:[此URL](http://this https URL)
https://arxiv.org/abs/2502.08544
Image quality assessment (IQA) represents a pivotal challenge in image-focused technologies, significantly influencing the advancement trajectory of image processing and computer vision. Recently, IQA has witnessed a notable surge in innovative research efforts, driven by the emergence of novel architectural paradigms and sophisticated computational techniques. This survey delivers an extensive analysis of contemporary IQA methodologies, organized according to their application scenarios, serving as a beneficial reference for both beginners and experienced researchers. We analyze the advantages and limitations of current approaches and suggest potential future research pathways. The survey encompasses both general and specific IQA methodologies, including conventional statistical measures, machine learning techniques, and cutting-edge deep learning models such as convolutional neural networks (CNNs) and Transformer models. The analysis within this survey highlights the necessity for distortion-specific IQA methods tailored to various application scenarios, emphasizing the significance of practicality, interpretability, and ease of implementation in future developments.
图像质量评估(IQA)在以图像为中心的技术中代表了一个关键的挑战,对图像处理和计算机视觉的发展路径有着重要影响。近年来,由于新型架构范式和复杂计算技术的出现,IQA领域见证了创新研究工作的显著增长。本次综述提供了当代IQA方法学的全面分析,并按照其应用场景进行组织,为初学者和有经验的研究人员提供有益的参考资源。我们评估了现有方法的优势与局限性,并提出未来潜在的研究方向建议。此次调查涵盖了通用及特定场景下的图像质量评估技术,包括传统的统计指标、机器学习技术以及前沿深度学习模型(如卷积神经网络(CNN)和Transformer模型)。综述中的分析强调了针对不同应用场景的失真特异性IQA方法的重要性,并突出了未来发展的实用性和可解释性等方面的必要要求。
https://arxiv.org/abs/2502.08540
The properties of black holes and accretion flows can be inferred by fitting Event Horizon Telescope (EHT) data to simulated images generated through general relativistic ray tracing (GRRT). However, due to the computationally intensive nature of GRRT, the efficiency of generating specific radiation flux images needs to be improved. This paper introduces the Branch Correction Denoising Diffusion Model (BCDDM), which uses a branch correction mechanism and a weighted mixed loss function to improve the accuracy of generated black hole images based on seven physical parameters of the radiatively inefficient accretion flow (RIAF) model. Our experiments show a strong correlation between the generated images and their physical parameters. By enhancing the GRRT dataset with BCDDM-generated images and using ResNet50 for parameter regression, we achieve significant improvements in parameter prediction performance. This approach reduces computational costs and provides a faster, more efficient method for dataset expansion, parameter estimation, and model fitting.
黑洞和吸积流的特性可以通过将事件视界望远镜(EHT)的数据与通过广义相对论光线追踪(GRRT)生成的模拟图像拟合来推断。然而,由于GRRT计算密集型的性质,提高特定辐射通量图像生成效率的需求仍然存在。本文介绍了分支校正去噪扩散模型(BCDDM),该模型利用了一种分支校正机制和加权混合损失函数,以基于放射性低效吸积流(RIAF)模型的七个物理参数来提升生成黑洞图像的准确性。实验结果表明,生成的图像与其物理参数之间存在很强的相关性。通过使用BCDDM生成的图像增强GRRT数据集,并采用ResNet50进行参数回归,我们在参数预测性能上实现了显著改进。这种方法降低了计算成本,并提供了一种更快、更有效的方法来扩展数据集、估计参数和拟合模型。
https://arxiv.org/abs/2502.08528
Referring Remote Sensing Image Segmentation (RRSIS) is critical for ecological monitoring, urban planning, and disaster management, requiring precise segmentation of objects in remote sensing imagery guided by textual descriptions. This task is uniquely challenging due to the considerable vision-language gap, the high spatial resolution and broad coverage of remote sensing imagery with diverse categories and small targets, and the presence of clustered, unclear targets with blurred edges. To tackle these issues, we propose \ours, a novel framework designed to bridge the vision-language gap, enhance multi-scale feature interaction, and improve fine-grained object differentiation. Specifically, \ours introduces: (1) the Bidirectional Spatial Correlation (BSC) for improved vision-language feature alignment, (2) the Target-Background TwinStream Decoder (T-BTD) for precise distinction between targets and non-targets, and (3) the Dual-Modal Object Learning Strategy (D-MOLS) for robust multimodal feature reconstruction. Extensive experiments on the benchmark datasets RefSegRS and RRSIS-D demonstrate that \ours achieves state-of-the-art performance. Specifically, \ours improves the overall IoU (oIoU) by 3.76 percentage points (80.57) and 1.44 percentage points (79.23) on the two datasets, respectively. Additionally, it outperforms previous methods in the mean IoU (mIoU) by 5.37 percentage points (67.95) and 1.84 percentage points (66.04), effectively addressing the core challenges of RRSIS with enhanced precision and robustness.
参考遥感图像分割(RRSIS)对于生态监测、城市规划和灾害管理至关重要,需要根据文本描述精确地对遥感图像中的对象进行分割。这项任务具有独特的挑战性,原因在于视觉与语言之间的显著差距,以及高空间分辨率的遥感影像所带来的广泛覆盖范围、多样化的类别和微小目标的存在,并且其中包含聚集成群、边缘模糊不清的目标。为了应对这些难题,我们提出了一种新的框架——\ours,旨在弥合视觉-语言鸿沟,增强多尺度特征交互,并提升细粒度对象的区分能力。 具体而言,\ours引入了以下三个创新组件: 1. 双向空间相关性(Bidirectional Spatial Correlation, BSC),以改善视觉与语言特征之间的对齐; 2. 目标-背景双流解码器(Target-Background TwinStream Decoder, T-BTD),用于准确区分目标和非目标区域; 3. 双模态对象学习策略(Dual-Modal Object Learning Strategy, D-MOLS),用于增强多模态特征的重建能力。 在基准数据集RefSegRS和RRSIS-D上的大量实验表明,\ours达到了当前最佳性能。具体而言,在两个数据集中,\ours分别提高了整体交并比(oIoU)3.76个百分点(达到80.57)和1.44个百分点(达到79.23)。同时,它在平均交并比(mIoU)方面也超过了之前的最佳方法,分别提升了5.37个百分点(达到67.95)和1.84个百分点(达到66.04),从而有效解决了RRSIS的核心挑战,并且通过提高精度和鲁棒性来应对这些挑战。
https://arxiv.org/abs/2502.08486
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck. In this work, we identify three criteria for high-quality synthetic multimodal data. First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios. Second, robust cross-modal alignment makes different modalities semantically consistent. Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability. Guided by these principles, we synthesize datasets that: (1) cover a wide range of tasks, modality combinations, and languages, (2) are generated via a deep thinking process within a single pass of a multimodal large language model, and (3) incorporate real-world images with accurate and relevant texts, ensuring fidelity through self-evaluation and refinement. Leveraging these high-quality synthetic and labeled datasets, we train a multimodal multilingual E5 model mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art performance on the MMEB Benchmark and superior multilingual performance on the XTD benchmark. Our codes, datasets and models are released in this https URL.
多模态嵌入模型因其能够将来自不同模式(如文本和图像)的数据映射到统一的表示空间而受到了广泛关注。然而,有限的标注多模态数据常常限制了这些模型的表现。近期的研究通过利用数据合成来解决这一问题,但合成数据的质量仍然是一个关键瓶颈。在这项工作中,我们确定了高质量合成多模态数据的三个标准:首先,广泛的覆盖范围确保生成的数据涵盖了多种任务和模式,使其适用于各种下游场景;其次,强大的跨模式对齐使得不同的模式在语义上保持一致;第三,高保真度保证合成数据保留真实细节以增强其可靠性。遵循这些原则,我们合成了符合以下条件的数据库:(1)涵盖广泛的任务、模态组合和语言,(2)通过多模态大语言模型的一次深度思维过程生成,并且(3)结合了带有准确相关文本的真实世界图像,确保保真度通过自我评估和改进实现。利用这些高质量的合成及标注数据集,我们训练了一个跨模态多语言E5模型mmE5。广泛的实验表明,mmE5在MMEB基准测试中达到了最先进的性能,并且在XTD基准测试中的多语种表现也更为优越。我们的代码、数据集和模型可在上述链接获得(原文中的具体链接未提供)。
https://arxiv.org/abs/2502.08468