The computer vision community has developed numerous techniques for digitally restoring true scene information from single-view degraded photographs, an important yet extremely ill-posed task. In this work, we tackle image restoration from a different perspective by jointly denoising multiple photographs of the same scene. Our core hypothesis is that degraded images capturing a shared scene contain complementary information that, when combined, better constrains the restoration problem. To this end, we implement a powerful multi-view diffusion model that jointly generates uncorrupted views by extracting rich information from multi-view relationships. Our experiments show that our multi-view approach outperforms existing single-view image and even video-based methods on image deblurring and super-resolution tasks. Critically, our model is trained to output 3D consistent images, making it a promising tool for applications requiring robust multi-view integration, such as 3D reconstruction or pose estimation.
计算机视觉领域已经开发了多种技术,用于从单视角的退化照片中数字化地恢复真实场景信息,这是一个重要但极其难以处理的任务。在这项工作中,我们通过同时去噪同一场景的多张照片,以不同的视角解决了图像修复的问题。我们的核心假设是,捕捉相同场景的退化图片包含互补的信息,这些信息结合后可以更好地约束图像复原问题。为此,我们实现了一个强大的多视图扩散模型,该模型能够从多视图关系中提取丰富的信息,并同时生成未受污染的视角。 实验表明,与现有的单视图和甚至基于视频的方法相比,我们的多视图方法在图像去模糊和超分辨率任务上表现出色。尤为重要的是,我们的模型经过训练可以输出三维一致的图像,使其成为需要稳健多视图集成的应用(如3D重建或姿态估计)的理想工具。
https://arxiv.org/abs/2503.14463
Purpose: Accurate 3D MRI-ultrasound (US) deformable registration is critical for real-time guidance in high-dose-rate (HDR) prostate brachytherapy. We present a weakly supervised spatial implicit neural representation (SINR) method to address modality differences and pelvic anatomy challenges. Methods: The framework uses sparse surface supervision from MRI/US segmentations instead of dense intensity matching. SINR models deformations as continuous spatial functions, with patient-specific surface priors guiding a stationary velocity field for biologically plausible deformations. Validation included 20 public Prostate-MRI-US-Biopsy cases and 10 institutional HDR cases, evaluated via Dice similarity coefficient (DSC), mean surface distance (MSD), and 95% Hausdorff distance (HD95). Results: The proposed method achieved robust registration. For the public dataset, prostate DSC was $0.93 \pm 0.05$, MSD $0.87 \pm 0.10$ mm, and HD95 $1.58 \pm 0.37$ mm. For the institutional dataset, prostate CTV achieved DSC $0.88 \pm 0.09$, MSD $1.21 \pm 0.38$ mm, and HD95 $2.09 \pm 1.48$ mm. Bladder and rectum performance was lower due to ultrasound's limited field of view. Visual assessments confirmed accurate alignment with minimal discrepancies. Conclusion: This study introduces a novel weakly supervised SINR-based approach for 3D MRI-US deformable registration. By leveraging sparse surface supervision and spatial priors, it achieves accurate, robust, and computationally efficient registration, enhancing real-time image guidance in HDR prostate brachytherapy and improving treatment precision.
目的:在高剂量率(HDR)前列腺组织间放射治疗中,精确的3D MRI-超声波(US)可变形配准对于实时引导至关重要。本文介绍了一种基于弱监督的空间隐式神经表示(SINR)方法来解决模态差异和骨盆解剖结构挑战。 方法:该框架使用来自MRI/US分割的稀疏表面监督,而不是密集强度匹配。SINR模型将变形视为连续空间函数,并通过特定于患者的表面先验指导一个静态速度场以实现生物合理的变形。验证包括20个公开的前列腺-MRI-US-活检案例和10个机构HDR案例,在Dice相似系数(DSC)、平均面距离(MSD)和95% Hausdorff距离(HD95)方面进行评估。 结果:所提出的方法实现了稳健的配准。对于公共数据集,前列腺DSC为$0.93 \pm 0.05$,MSD为$0.87 \pm 0.10$毫米,HD95为$1.58 \pm 0.37$毫米。对于机构数据集,前列腺CTV的DSC为$0.88 \pm 0.09$,MSD为$1.21 \pm 0.38$毫米,HD95为$2.09 \pm 1.48$毫米。由于超声波有限的视野范围,膀胱和直肠的表现较低。视觉评估确认了准确对齐且差异很小。 结论:本研究介绍了一种基于弱监督SINR的新方法用于3D MRI-US可变形配准。通过利用稀疏表面监督和空间先验信息,该方法实现了准确、稳健且计算效率高的配准,从而提高了HDR前列腺组织间放射治疗中的实时图像引导精度,并改善了治疗的精确度。
https://arxiv.org/abs/2503.14395
Multi-map Sparse Monocular visual Simultaneous Localization and Mapping applied to monocular endoscopic sequences has proven efficient to robustly recover tracking after the frequent losses in endoscopy due to motion blur, temporal occlusion, tools interaction or water jets. The sparse multi-maps are adequate for robust camera localization, however they are very poor for environment representation, they are noisy, with a high percentage of inaccurately reconstructed 3D points, including significant outliers, and more importantly with an unacceptable low density for clinical applications. We propose a method to remove outliers and densify the maps of the state of the art for sparse endoscopy multi-map CudaSIFT-SLAM. The NN LightDepth for up-to-scale depth dense predictions are aligned with the sparse CudaSIFT submaps by means of the robust to spurious LMedS. Our system mitigates the inherent scale ambiguity in monocular depth estimation while filtering outliers, leading to reliable densified 3D maps. We provide experimental evidence of accurate densified maps 4.15 mm RMS accuracy at affordable computing time in the C3VD phantom colon dataset. We report qualitative results on the real colonoscopy from the Endomapper dataset.
多图稀疏单目视觉同步定位与地图构建(SLAM)技术在处理内窥镜序列时,已证明能够有效恢复由于运动模糊、时间遮挡、工具交互或水柱等原因造成的频繁跟踪丢失。虽然稀疏的多图对于相机定位具有鲁棒性,但它们对环境表示的效果较差:这些地图中包含大量噪声和不准确重建的3D点,包括显著的异常值,并且更重要的是,密度极低,无法满足临床应用的要求。 我们提出了一种方法来去除异常值并增加现有的稀疏内窥镜多图CUDA SIFT-SLAM技术地图的稠密性。通过利用鲁棒于虚假匹配的LMedS算法,我们将NN LightDepth(一种用于尺度不变深度预测的方法)与稀疏的CUDA SIFT子图对齐。我们的系统在处理单目深度估计中的固有尺度模糊时,不仅能过滤异常值,还能生成可靠且稠密化的3D地图。 我们在C3VD仿真结肠数据集上提供了实验证据,表明我们能够以可接受的计算时间实现4.15毫米RMS精度的准确稠密化地图。同时,我们也报告了来自Endomapper数据集的真实结肠镜检查中的定性结果。
https://arxiv.org/abs/2503.14346
Leveraging the training-by-pruning paradigm introduced by Zhou et al. and Isik et al. introduced a federated learning protocol that achieves a 34-fold reduction in communication cost. We achieve a compression improvements of orders of orders of magnitude over the state-of-the-art. The central idea of our framework is to encode the network weights $\vec w$ by a the vector of trainable parameters $\vec p$, such that $\vec w = Q\cdot \vec p$ where $Q$ is a carefully-generate sparse random matrix (that remains fixed throughout training). In such framework, the previous work of Zhou et al. [NeurIPS'19] is retrieved when $Q$ is diagonal and $\vec p$ has the same dimension of $\vec w$. We instead show that $\vec p$ can effectively be chosen much smaller than $\vec w$, while retaining the same accuracy at the price of a decrease of the sparsity of $Q$. Since server and clients only need to share $\vec p$, such a trade-off leads to a substantial improvement in communication cost. Moreover, we provide theoretical insight into our framework and establish a novel link between training-by-sampling and random convex geometry.
通过利用Zhou等人和Isik等人提出的修剪训练(training-by-pruning)范式,我们提出了一种联邦学习协议,实现了通信成本34倍的减少。我们的压缩改进比最先进的方法提高了多个数量级。 我们框架的核心思想是用可训练参数向量$\vec p$编码网络权重$\vec w$,即$\vec w = Q \cdot \vec p$,其中$Q$是一个精心生成且在整个训练过程中保持不变的稀疏随机矩阵。在这样的框架中,Zhou等人的先前工作(NeurIPS'19)当$Q$是单位对角阵,并且$\vec p$与$\vec w$具有相同的维度时被复现。 我们进一步展示,通过选择比$\vec w$小得多的$\vec p$,可以保持相同精度的同时减少$Q$矩阵中的稀疏性。由于服务器和客户端只需交换较小规模的向量$\vec p$,这种权衡带来了通信成本上的显著改进。此外,我们还提供了对我们框架的理论见解,并建立了训练采样与随机凸几何之间的一个新颖联系。
https://arxiv.org/abs/2503.14246
Recent advances in Neural Radiance Fields (NeRF) have shown great potential in 3D reconstruction and novel view synthesis, particularly for indoor and small-scale scenes. However, extending NeRF to large-scale outdoor environments presents challenges such as transient objects, sparse cameras and textures, and varying lighting conditions. In this paper, we propose a segmentation-guided enhancement to NeRF for outdoor street scenes, focusing on complex urban environments. Our approach extends ZipNeRF and utilizes Grounded SAM for segmentation mask generation, enabling effective handling of transient objects, modeling of the sky, and regularization of the ground. We also introduce appearance embeddings to adapt to inconsistent lighting across view sequences. Experimental results demonstrate that our method outperforms the baseline ZipNeRF, improving novel view synthesis quality with fewer artifacts and sharper details.
最近在神经辐射场(NeRF)领域的进展展示了其在3D重建和新颖视角合成方面的巨大潜力,特别是在室内场景和小规模环境中。然而,将NeRF应用于大型户外环境时面临着诸如瞬变物体、稀疏相机与纹理以及光照条件变化等挑战。本文提出了一种针对室外街景的基于分割引导的NeRF增强方法,专注于复杂的城市环境。我们的方法扩展了ZipNeRF,并利用Grounded SAM生成分割掩码,这有助于有效处理瞬变对象,建模天空,并对地面进行正则化。此外,我们引入外观嵌入来适应视图序列中的不一致光照条件。实验结果表明,相较于基线模型ZipNeRF,我们的方法在减少伪影和提高细节清晰度方面显著提升了新颖视角合成的质量。
https://arxiv.org/abs/2503.14219
This paper presents RoGSplat, a novel approach for synthesizing high-fidelity novel views of unseen human from sparse multi-view images, while requiring no cumbersome per-subject optimization. Unlike previous methods that typically struggle with sparse views with few overlappings and are less effective in reconstructing complex human geometry, the proposed method enables robust reconstruction in such challenging conditions. Our key idea is to lift SMPL vertices to dense and reliable 3D prior points representing accurate human body geometry, and then regress human Gaussian parameters based on the points. To account for possible misalignment between SMPL model and images, we propose to predict image-aligned 3D prior points by leveraging both pixel-level features and voxel-level features, from which we regress the coarse Gaussians. To enhance the ability to capture high-frequency details, we further render depth maps from the coarse 3D Gaussians to help regress fine-grained pixel-wise Gaussians. Experiments on several benchmark datasets demonstrate that our method outperforms state-of-the-art methods in novel view synthesis and cross-dataset generalization. Our code is available at this https URL.
本文介绍了 RoGSplat,这是一种新颖的方法,用于从稀疏的多视角图像中合成未见过的人体的新视图,并且不需要复杂的个体优化。与以往的方法不同,这些方法通常在处理重叠较少的稀疏视图时会遇到困难,并且在重建复杂人体几何形状方面效果较差,所提出的方法能够在这样的挑战条件下实现稳健重建。我们的核心思想是将 SMPL 顶点提升到表示准确人体几何结构的密集而可靠的三维先验点上,然后基于这些点回归出人类高斯参数。为了应对 SMPL 模型与图像之间可能存在的对齐错误,我们提出通过利用像素级特征和体素级特征来预测图像对齐的 3D 先验点,并从这些点中回归出粗略的高斯分布。为增强捕捉高频细节的能力,我们进一步渲染由粗略 3D 高斯生成的深度图,以帮助回归精细到像素级别的高斯参数。 在多个基准数据集上的实验表明,我们的方法在新视图合成和跨数据集泛化方面优于当前最先进的方法。我们的代码可在[此处](https://this_https_URL/)获取。
https://arxiv.org/abs/2503.14198
End-to-end autonomous driving unifies tasks in a differentiable framework, enabling planning-oriented optimization and attracting growing attention. Current methods aggregate historical information either through dense historical bird's-eye-view (BEV) features or by querying a sparse memory bank, following paradigms inherited from detection. However, we argue that these paradigms either omit historical information in motion planning or fail to align with its multi-step nature, which requires predicting or planning multiple future time steps. In line with the philosophy of future is a continuation of past, we propose BridgeAD, which reformulates motion and planning queries as multi-step queries to differentiate the queries for each future time step. This design enables the effective use of historical prediction and planning by applying them to the appropriate parts of the end-to-end system based on the time steps, which improves both perception and motion planning. Specifically, historical queries for the current frame are combined with perception, while queries for future frames are integrated with motion planning. In this way, we bridge the gap between past and future by aggregating historical insights at every time step, enhancing the overall coherence and accuracy of the end-to-end autonomous driving pipeline. Extensive experiments on the nuScenes dataset in both open-loop and closed-loop settings demonstrate that BridgeAD achieves state-of-the-art performance.
端到端的自动驾驶将任务统一在一个可微分框架内,使得规划导向优化成为可能,并且吸引了越来越多的关注。当前的方法通过密集的历史鸟瞰图(BEV)特征或查询稀疏的记忆库来聚合历史信息,这些方法遵循了来自检测领域的范式。然而,我们指出,这些范式要么忽略了在运动规划中使用历史信息,要么未能与多步骤的性质相匹配,这需要预测或计划多个未来的时间步。基于“未来是过去的延续”的哲学理念,我们提出了BridgeAD,它将运动和规划查询重新表述为多步骤查询,以便针对每个未来的时序进行差异化处理。这种设计使过去的历史预测和规划能够有效地应用于整个端到端系统的适当部分,从而根据时间步改进感知和运动规划。具体而言,当前帧的历史查询与感知相结合,而未来帧的查询则被整合到运动规划中。通过这种方式,在每一个时间步上聚合历史洞察力,填补了过去与未来的差距,增强了整体连贯性和端到端自动驾驶流水线的准确性。在nuScenes数据集的开放环和闭环设置中的广泛实验表明,BridgeAD实现了最先进的性能。
https://arxiv.org/abs/2503.14182
While foundation models have revolutionised computer vision, their effectiveness for sketch understanding remains limited by the unique challenges of abstract, sparse visual inputs. Through systematic analysis, we uncover two fundamental limitations: Stable Diffusion (SD) struggles to extract meaningful features from abstract sketches (unlike its success with photos), and exhibits a pronounced frequency-domain bias that suppresses essential low-frequency components needed for sketch understanding. Rather than costly retraining, we address these limitations by strategically combining SD with CLIP, whose strong semantic understanding naturally compensates for SD's spatial-frequency biases. By dynamically injecting CLIP features into SD's denoising process and adaptively aggregating features across semantic levels, our method achieves state-of-the-art performance in sketch retrieval (+3.35%), recognition (+1.06%), segmentation (+29.42%), and correspondence learning (+21.22%), demonstrating the first truly universal sketch feature representation in the era of foundation models.
尽管基础模型已经革新了计算机视觉领域,但它们在理解素描方面仍然受到抽象、稀疏的视觉输入所带来的独特挑战的限制。通过系统分析,我们发现了两个基本局限性:稳定扩散(Stable Diffusion, SD)难以从抽象素描中提取有意义的特征(与它处理照片的成功形成对比),并且表现出明显的频域偏差,抑制了理解素描所需的必要低频成分。 为了应对这些局限,我们并未选择昂贵且耗时的重新训练方式,而是通过策略性地结合SD和CLIP来加以解决。CLIP强大的语义理解能力自然弥补了SD的空间-频率偏差。我们的方法包括动态将CLIP特征注入到SD的去噪过程中,并在语义层级上自适应聚合这些特征。这种方法在素描检索(+3.35%)、识别(+1.06%)、分割(+29.42%)和对应学习(+21.22%)方面均达到了最先进的性能,展示了基础模型时代首个真正意义上的通用素描特征表示方法。
https://arxiv.org/abs/2503.14129
Existing 3D Human Pose Estimation (HPE) methods achieve high accuracy but suffer from computational overhead and slow inference, while knowledge distillation methods fail to address spatial relationships between joints and temporal correlations in multi-frame inputs. In this paper, we propose Sparse Correlation and Joint Distillation (SCJD), a novel framework that balances efficiency and accuracy for 3D HPE. SCJD introduces Sparse Correlation Input Sequence Downsampling to reduce redundancy in student network inputs while preserving inter-frame correlations. For effective knowledge transfer, we propose Dynamic Joint Spatial Attention Distillation, which includes Dynamic Joint Embedding Distillation to enhance the student's feature representation using the teacher's multi-frame context feature, and Adjacent Joint Attention Distillation to improve the student network's focus on adjacent joint relationships for better spatial understanding. Additionally, Temporal Consistency Distillation aligns the temporal correlations between teacher and student networks through upsampling and global supervision. Extensive experiments demonstrate that SCJD achieves state-of-the-art performance. Code is available at this https URL.
现有的3D人体姿态估计(HPE)方法虽然能够达到很高的精度,但存在计算开销大和推理速度慢的问题。而知识蒸馏方法则无法解决关节之间的空间关系以及多帧输入中的时间相关性问题。本文提出了一种新的框架——稀疏关联与关节蒸馏(SCJD),该框架能够在3D HPE中实现效率与精度的平衡。 SCJD引入了稀疏相关输入序列降采样技术,以减少学生网络输入中的冗余信息同时保留帧间的相关性。为了有效地进行知识传递,我们提出了动态关节空间注意力蒸馏方法,其中包括动态关节嵌入蒸馏和相邻关节注意蒸馏两个方面:前者利用教师的多帧上下文特征来增强学生的特征表示;后者则改进了学生网络对相邻关节关系的关注度以提高其在空间上的理解能力。此外,通过上采样与全局监督相结合的方式进行时间一致性蒸馏,使教师模型和学生模型之间的时间相关性得到对齐。 经过广泛的实验验证,SCJD实现了最先进的性能表现。该框架的代码可在提供的链接中获取(此URL指向的是原文中的一个链接位置)。
https://arxiv.org/abs/2503.14097
Accurate traffic flow estimation and prediction are critical for the efficient management of transportation systems, particularly under increasing urbanization. Traditional methods relying on static sensors often suffer from limited spatial coverage, while probe vehicles provide richer, albeit sparse and irregular data. This work introduces ON-Traffic, a novel deep operator Network and a receding horizon learning-based framework tailored for online estimation of spatio-temporal traffic state along with quantified uncertainty by using measurements from moving probe vehicles and downstream boundary inputs. Our framework is evaluated in both numerical and simulation datasets, showcasing its ability to handle irregular, sparse input data, adapt to time-shifted scenarios, and provide well-calibrated uncertainty estimates. The results demonstrate that the model captures complex traffic phenomena, including shockwaves and congestion propagation, while maintaining robustness to noise and sensor dropout. These advancements present a significant step toward online, adaptive traffic management systems.
准确的交通流量估算和预测对于高效管理交通运输系统至关重要,尤其是在城市化进程不断加快的情况下。传统方法依赖静态传感器,但其空间覆盖范围有限;而探测车辆提供的数据虽然丰富,但也更为稀疏且不规则。本文介绍了ON-Traffic框架,这是一种新颖的深度操作网络及基于滚动地平线学习的方法,专门用于利用移动探测车辆和下游边界输入的数据来进行时空交通状态的在线估计,并量化不确定性。 我们的框架在数值和模拟数据集中进行了评估,展示了其处理不规则、稀疏输入数据的能力,适应时间偏移场景,并提供经过良好校准的不确定度估计。结果表明,该模型能够捕捉复杂的交通现象,包括冲击波和拥堵传播,同时保持对噪声和传感器故障的强大鲁棒性。这些进展为在线自适应交通管理系统的发展迈出了重要一步。
https://arxiv.org/abs/2503.14053
Visual localization is considered to be one of the crucial parts in many robotic and vision systems. While state-of-the art methods that relies on feature matching have proven to be accurate for visual localization, its requirements for storage and compute are burdens. Scene coordinate regression (SCR) is an alternative approach that remove the barrier for storage by learning to map 2D pixels to 3D scene coordinates. Most popular SCR use Convolutional Neural Network (CNN) to extract 2D descriptor, which we would argue that it miss the spatial relationship between pixels. Inspired by the success of vision transformer architecture, we present a new SCR architecture, called A-ScoRe, an Attention-based model which leverage attention on descriptor map level to produce meaningful and high-semantic 2D descriptors. Since the operation is performed on descriptor map, our model can work with multiple data modality whether it is a dense or sparse from depth-map, SLAM to Structure-from-Motion (SfM). This versatility allows A-SCoRe to operate in different kind of environments, conditions and achieve the level of flexibility that is important for mobile robots. Results show our methods achieve comparable performance with State-of-the-art methods on multiple benchmark while being light-weighted and much more flexible. Code and pre-trained models are public in our repository: this https URL.
视觉定位被认为是许多机器人和视觉系统中的关键部分之一。虽然依赖于特征匹配的最先进的方法已被证明在视觉定位中非常准确,但其对存储和计算资源的需求却是一个负担。场景坐标回归(SCR)是一种替代方法,它通过学习将2D像素映射到3D场景坐标来消除存储障碍。大多数流行的SCR使用卷积神经网络(CNN)提取2D描述符,但我们认为这种方法忽视了像素之间的空间关系。受视觉变换架构成功的启发,我们提出了一种新的SCR架构,称为A-ScoRe,这是一种基于注意力的模型,它在描述符图级别上利用注意力来生成具有高度语义意义和相关性的2D描述符。由于操作是在描述符图上进行的,我们的模型可以与多种数据模态(无论是来自深度图、SLAM到运动结构(SfM)的密集或稀疏数据)一起工作。这种多功能性使得A-SCoRe能够在不同类型的环境和条件下运行,并实现对于移动机器人而言至关重要的灵活性水平。实验结果显示,我们的方法在多个基准测试中与最先进的方法相比性能相当,同时更轻量级且更具灵活性。 代码和预训练模型可以在我们的仓库中公开获得:[此链接](this https URL)。
https://arxiv.org/abs/2503.13982
Recent advances in motion diffusion models have led to remarkable progress in diverse motion generation tasks, including text-to-motion synthesis. However, existing approaches represent motions as dense frame sequences, requiring the model to process redundant or less informative frames. The processing of dense animation frames imposes significant training complexity, especially when learning intricate distributions of large motion datasets even with modern neural architectures. This severely limits the performance of generative motion models for downstream tasks. Inspired by professional animators who mainly focus on sparse keyframes, we propose a novel diffusion framework explicitly designed around sparse and geometrically meaningful keyframes. Our method reduces computation by masking non-keyframes and efficiently interpolating missing frames. We dynamically refine the keyframe mask during inference to prioritize informative frames in later diffusion steps. Extensive experiments show that our approach consistently outperforms state-of-the-art methods in text alignment and motion realism, while also effectively maintaining high performance at significantly fewer diffusion steps. We further validate the robustness of our framework by using it as a generative prior and adapting it to different downstream tasks. Source code and pre-trained models will be released upon acceptance.
最近在运动扩散模型方面的进展已经在多种运动生成任务中取得了显著成果,包括文本到运动的合成。然而,现有的方法将动作表示为密集的关键帧序列,这要求模型处理冗余或信息量较少的帧。处理密集动画关键帧带来了相当大的训练复杂度,尤其是在使用现代神经架构学习大规模运动数据集复杂的分布时。这严重限制了生成式运动模型在下游任务中的性能。 受专业动画师主要关注稀疏关键帧的启发,我们提出了一种新的扩散框架,该框架围绕着稀疏且几何意义明确的关键帧设计。我们的方法通过屏蔽非关键帧并有效地插值缺失帧来减少计算量。我们在推理过程中动态调整关键帧掩码,在后续的扩散步骤中优先处理信息量大的帧。 广泛的实验表明,与最先进的方法相比,我们的方法在文本对齐和运动真实性方面表现出色,并且在整个较少的扩散步骤下仍能保持高性能。我们进一步通过将其用作生成先验并适应不同的下游任务来验证框架的鲁棒性。我们将根据接受情况发布源代码和预训练模型。
https://arxiv.org/abs/2503.13859
Spread through air spaces (STAS) represents a newly identified aggressive pattern in lung cancer, which is known to be associated with adverse prognostic factors and complex pathological features. Pathologists currently rely on time consuming manual assessments, which are highly subjective and prone to variation. This highlights the urgent need for automated and precise diag nostic solutions. 2,970 lung cancer tissue slides are comprised from multiple centers, re-diagnosed them, and constructed and publicly released three lung cancer STAS datasets: STAS CSU (hospital), STAS TCGA, and STAS CPTAC. All STAS datasets provide corresponding pathological feature diagnoses and related clinical data. To address the bias, sparse and heterogeneous nature of STAS, we propose an scale-aware multiple instance learning(SMILE) method for STAS diagnosis of lung cancer. By introducing a scale-adaptive attention mechanism, the SMILE can adaptively adjust high attention instances, reducing over-reliance on local regions and promoting consistent detection of STAS lesions. Extensive experiments show that SMILE achieved competitive diagnostic results on STAS CSU, diagnosing 251 and 319 STAS samples in CPTAC andTCGA,respectively, surpassing clinical average AUC. The 11 open baseline results are the first to be established for STAS research, laying the foundation for the future expansion, interpretability, and clinical integration of computational pathology technologies. The datasets and code are available at this https URL.
空气空间扩散(STAS)是肺癌中一种新识别的侵袭性模式,已知与不良预后因素和复杂的病理特征相关。目前,病理学家依赖耗时且主观性强的手动评估方法,这种方法容易出现变异。这突显了自动化和精确诊断解决方案的迫切需求。 2,970个肺癌组织切片来自多个中心,并重新进行了诊断,构建并公开发布了三个肺癌STAS数据集:STAS CSU(医院)、STAS TCGA 和 STAS CPTAC。所有 STAS 数据集均提供了相应的病理特征诊断和相关临床数据。为了应对 STAS 偏差、稀疏性和异质性的问题,我们提出了一种尺度感知的多实例学习 (SMILE) 方法来进行肺癌 STAS 诊断。通过引入尺度自适应注意力机制,SMILE 能够灵活调整高关注度实例,减少对局部区域过度依赖,并促进一致性的STAS病灶检测。 广泛的实验表明,SMILE 在 STAS CSU 数据集上实现了具有竞争力的诊断结果,在 CPTAC 和 TCGA 中分别识别了 251 和 319 个 STAS 样本,超过了临床平均 AUC。这11项开放基线结果是首次为STAS研究建立的标准,为未来计算病理学技术的发展、解释性和临床整合奠定了基础。 数据集和代码可在以下链接获取:[此URL](https://this-url.com/) (请将 [此URL] 替换为实际的网址)。
https://arxiv.org/abs/2503.13799
Recently, patch-deformation methods have exhibited significant effectiveness in multi-view stereo owing to the deformable and expandable patches in reconstructing textureless areas. However, such methods primarily emphasize broadening the receptive field in textureless areas, while neglecting deformation instability caused by easily overlooked edge-skipping, potentially leading to matching distortions. To address this, we propose SED-MVS, which adopts panoptic segmentation and multi-trajectory diffusion strategy for segmentation-driven and edge-aligned patch deformation. Specifically, to prevent unanticipated edge-skipping, we first employ SAM2 for panoptic segmentation as depth-edge guidance to guide patch deformation, followed by multi-trajectory diffusion strategy to ensure patches are comprehensively aligned with depth edges. Moreover, to avoid potential inaccuracy of random initialization, we combine both sparse points from LoFTR and monocular depth map from DepthAnything V2 to restore reliable and realistic depth map for initialization and supervised guidance. Finally, we integrate segmentation image with monocular depth map to exploit inter-instance occlusion relationship, then further regard them as occlusion map to implement two distinct edge constraint, thereby facilitating occlusion-aware patch deformation. Extensive results on ETH3D, Tanks & Temples, BlendedMVS and Strecha datasets validate the state-of-the-art performance and robust generalization capability of our proposed method.
最近,补丁变形方法在多视角立体视觉中表现出显著的效果,这得益于它们在重建无纹理区域时采用的可变性和可扩展性。然而,这些方法主要侧重于扩大无纹理区域的感受野,而忽略了由容易被忽视的边缘跳过所导致的变形不稳定性,这种不稳定可能导致匹配失真。为了解决这个问题,我们提出了SED-MVS(Segmentation-driven Edge-aligned Deformation for Multi-view Stereo),它采用了全景分割和多轨迹扩散策略来进行基于分割引导和边缘对齐的补丁变形。 具体来说,为了防止意外的边缘跳过,我们首先使用SAM2进行全景分割,并将其作为深度边缘指导来引导补丁变形。随后采用多轨迹扩散策略以确保补丁全面地与深度边缘对齐。此外,为了避免随机初始化可能带来的不准确性,我们将LoFTR生成的稀疏点和DepthAnything V2提供的单目深度图结合使用,以恢复可靠的、现实的深度图进行初始设置并提供监督指导。 最后,我们整合了分割图像与单目深度图来利用实例间的遮挡关系,并将它们视为遮挡图以实现两种不同的边缘约束,从而促进具有遮挡意识的补丁变形。在ETH3D、Tanks & Temples、BlendedMVS和Strecha数据集上的广泛实验结果验证了我们提出的方法具有最先进的性能和强大的泛化能力。
https://arxiv.org/abs/2503.13721
Photo-realistic rendering and novel view synthesis play a crucial role in human-computer interaction tasks, from gaming to path planning. Neural Radiance Fields (NeRFs) model scenes as continuous volumetric functions and achieve remarkable rendering quality. However, NeRFs often struggle in large, low-textured areas, producing cloudy artifacts known as ''floaters'' that reduce scene realism, especially in indoor environments with featureless architectural surfaces like walls, ceilings, and floors. To overcome this limitation, prior work has integrated geometric constraints into the NeRF pipeline, typically leveraging depth information derived from Structure from Motion or Multi-View Stereo. Yet, conventional RGB-feature correspondence methods face challenges in accurately estimating depth in textureless regions, leading to unreliable constraints. This challenge is further complicated in 360-degree ''inside-out'' views, where sparse visual overlap between adjacent images further hinders depth estimation. In order to address these issues, we propose an efficient and robust method for computing dense depth priors, specifically tailored for large low-textured architectural surfaces in indoor environments. We introduce a novel depth loss function to enhance rendering quality in these challenging, low-feature regions, while complementary depth-patch regularization further refines depth consistency across other areas. Experiments with Instant-NGP on two synthetic 360-degree indoor scenes demonstrate improved visual fidelity with our method compared to standard photometric loss and Mean Squared Error depth supervision.
照片级真实感渲染和新视角合成在人机交互任务中扮演着关键角色,从游戏到路径规划都有广泛的应用。神经辐射场(NeRF)通过将场景建模为连续的体积函数来实现卓越的渲染效果。然而,在大面积低纹理区域如室内环境中特征贫乏的建筑表面(墙壁、天花板和地板),NeRF常常难以处理,会产生模糊的“浮动物体”等伪影,降低场景的真实感。 为了克服这一限制,先前的研究尝试将几何约束引入到NeRF的工作流程中,通常利用从基于运动结构或多视角立体重建获得的深度信息。然而,传统的RGB特征对应方法在估计纹理缺失区域的深度时遇到了挑战,导致了不可靠的约束条件。这一问题在360度“外向内”视图中更为复杂,相邻图像之间的视觉重叠稀疏,进一步阻碍了深度估计。 为了应对这些问题,我们提出了一种高效且鲁棒的方法来计算针对室内环境中大面积低纹理建筑表面的高度密集的深度先验。我们引入了一个新颖的深度损失函数,以增强这些挑战性、特征贫乏区域中的渲染质量,同时通过互补的深度贴图正则化进一步提高其他区域中深度的一致性。 使用Instant-NGP在两个合成360度室内场景上的实验表明,与标准光度学损失和均方误差深度监督相比,我们的方法显著提高了视觉保真度。
https://arxiv.org/abs/2503.13710
Efficient timing in ride-matching is crucial for improving the performance of ride-hailing and ride-pooling services, as it determines the number of drivers and passengers considered in each matching process. Traditional batched matching methods often use fixed time intervals to accumulate ride requests before assigning matches. While this approach increases the number of available drivers and passengers for matching, it fails to adapt to real-time supply-demand fluctuations, often leading to longer passenger wait times and driver idle periods. To address this limitation, we propose an adaptive ride-matching strategy using deep reinforcement learning (RL) to dynamically determine when to perform matches based on real-time system conditions. Unlike fixed-interval approaches, our method continuously evaluates system states and executes matching at moments that minimize total passenger wait time. Additionally, we incorporate a potential-based reward shaping (PBRS) mechanism to mitigate sparse rewards, accelerating RL training and improving decision quality. Extensive empirical evaluations using a realistic simulator trained on real-world data demonstrate that our approach outperforms fixed-interval matching strategies, significantly reducing passenger waiting times and detour delays, thereby enhancing the overall efficiency of ride-hailing and ride-pooling systems.
有效的配对时间对于提高叫车和拼车服务的性能至关重要,因为它决定了每次匹配过程中考虑的司机和乘客数量。传统的批量匹配方法通常使用固定的时间间隔来积累乘车请求后再进行分配。虽然这种方法增加了可用于匹配的司机和乘客的数量,但无法适应供需波动,往往导致乘客等待时间过长和司机空闲时段增加。为了解决这一限制,我们提出了一种利用深度强化学习(RL)的自适应配对策略,该策略根据实时系统状况动态确定何时执行配对操作,以尽量减少总乘客等待时间。与固定间隔方法不同的是,我们的方法会持续评估系统状态,并在能最大限度降低总体等待时间的时刻进行匹配。 此外,我们还引入了一种基于势差奖励塑造(PBRS)机制来缓解稀疏奖励的问题,从而加速RL训练并提高决策质量。通过使用基于真实世界数据训练的真实场景模拟器进行广泛的经验性评价证明了我们的方法优于固定间隔配对策略,显著减少了乘客的等待时间和绕路延迟,从而提高了叫车和拼车系统的整体效率。
https://arxiv.org/abs/2503.13200
3D Gaussian Splatting (3DGS) achieves high-fidelity rendering with fast real-time performance, but existing methods rely on offline training after full Structure-from-Motion (SfM) processing. In contrast, this work introduces On-the-Fly GS, a progressive framework enabling near real-time 3DGS optimization during image capture. As each image arrives, its pose and sparse points are updated via on-the-fly SfM, and newly optimized Gaussians are immediately integrated into the 3DGS field. We propose a progressive local optimization strategy to prioritize new images and their neighbors by their corresponding overlapping relationship, allowing the new image and its overlapping images to get more training. To further stabilize training across old and new images, an adaptive learning rate schedule balances the iterations and the learning rate. Moreover, to maintain overall quality of the 3DGS field, an efficient global optimization scheme prevents overfitting to the newly added images. Experiments on multiple benchmark datasets show that our On-the-Fly GS reduces training time significantly, optimizing each new image in seconds with minimal rendering loss, offering the first practical step toward rapid, progressive 3DGS reconstruction.
3D高斯点阵(3D Gaussian Splatting,简称3DGS)能够实现高质量的渲染和快速实时性能。然而,现有的方法依赖于在完成整个基于运动恢复结构(Structure-from-Motion,SfM)处理后的离线训练。相比之下,这项工作引入了即席高斯点阵(On-the-Fly GS),这是一种渐进框架,能够在图像捕捉过程中实现接近实时的3DGS优化。每当一张新图片到达时,其姿态和稀疏点通过即时SfM进行更新,并且新的优化后的高斯点能够立即集成到3DGS场中。 我们提出了一种逐步局部优化策略,优先考虑新图像及其相邻图像之间的重叠关系,使得新图像及其重叠的图像可以获得更多的训练机会。为了进一步在旧和新图像之间稳定训练过程,一种自适应的学习率计划平衡了迭代次数与学习率。此外,为保持整个3DGS场的质量,一个高效的全局优化方案防止对新增加的图片过拟合。 在多个基准数据集上的实验表明,我们的即席高斯点阵方法显著减少了训练时间,每次新图像只需几秒钟即可完成优化,并且几乎不会导致渲染损失。这标志着向快速、渐进式3DGS重建迈出的第一步。
https://arxiv.org/abs/2503.13086
Most current video MLLMs rely on uniform frame sampling and image-level encoders, resulting in inefficient data processing and limited motion awareness. To address these challenges, we introduce EMA, an Efficient Motion-Aware video MLLM that utilizes compressed video structures as inputs. We propose a motion-aware GOP (Group of Pictures) encoder that fuses spatial and motion information within a GOP unit in the compressed video stream, generating compact, informative visual tokens. By integrating fewer but denser RGB frames with more but sparser motion vectors in this native slow-fast input architecture, our approach reduces redundancy and enhances motion representation. Additionally, we introduce MotionBench, a benchmark for evaluating motion understanding across four motion types: linear, curved, rotational, and contact-based. Experimental results show that EMA achieves state-of-the-art performance on both MotionBench and popular video question answering benchmarks, while reducing inference costs. Moreover, EMA demonstrates strong scalability, as evidenced by its competitive performance on long video understanding benchmarks.
目前大多数视频多模态语言模型(MMLM)依赖于均匀的帧采样和图像级编码器,这导致了数据处理效率低下以及对运动感知能力有限。为了解决这些问题,我们提出了EMA(高效运动感知视频MLLM),它利用压缩视频结构作为输入。我们提出了一种运动感知GOP(图片组)编码器,在压缩视频流中的GOP单元内融合空间和运动信息,生成紧凑且信息丰富的视觉标记。 通过在原生的慢快输入架构中整合较少但密集的RGB帧与较多但稀疏的运动向量,我们的方法减少了冗余并增强了对运动的表示能力。此外,我们还提出了MotionBench,这是一个用于评估四种类型运动理解(线性、曲线、旋转和接触式)的基准测试。 实验结果显示,EMA在MotionBench以及流行的视频问答基准上均达到了最先进的性能,并且降低了推理成本。此外,EMA展示了强大的可扩展性,这一点通过其在长视频理解基准上的竞争表现得到了证明。
https://arxiv.org/abs/2503.13016
Cooperative perception can increase the view field and decrease the occlusion of an ego vehicle, hence improving the perception performance and safety of autonomous driving. Despite the success of previous works on cooperative object detection, they mostly operate on dense Bird's Eye View (BEV) feature maps, which are computationally demanding and can hardly be extended to long-range detection problems. More efficient fully sparse frameworks are rarely explored. In this work, we design a fully sparse framework, SparseAlign, with three key features: an enhanced sparse 3D backbone, a query-based temporal context learning module, and a robust detection head specially tailored for sparse features. Extensive experimental results on both OPV2V and DairV2X datasets show that our framework, despite its sparsity, outperforms the state of the art with less communication bandwidth requirements. In addition, experiments on the OPV2Vt and DairV2Xt datasets for time-aligned cooperative object detection also show a significant performance gain compared to the baseline works.
协作感知能够扩大自身车辆的视野范围,减少遮挡现象,从而提升自动驾驶中的感知性能和安全性。尽管之前在合作对象检测方面取得了成功,但大多数方法都依赖于密集型鸟瞰图(BEV)特征图操作,这种做法计算资源消耗大,并且难以扩展到长距离检测问题上。更高效的全稀疏框架则鲜有探索。 在这项工作中,我们设计了一种全新的完全稀疏框架——SparseAlign,具备三个关键特性:增强的稀疏3D骨干网络、基于查询的时间上下文学习模块以及专门针对稀疏特征优化的稳健检测头。在OPV2V和DairV2X数据集上进行的广泛实验表明,尽管我们的框架采用了更少的计算资源(即更低的通信带宽需求),但仍超越了当前最先进的技术性能。 此外,在处理时间对齐的合作对象检测任务时,我们也在OPV2Vt和DairV2Xt数据集上进行了实验,并显示与基线方法相比有显著的性能提升。
https://arxiv.org/abs/2503.12982
Neural Radiance Field (NeRF) has shown remarkable performance in novel view synthesis but requires many multiview images, making it impractical for few-shot scenarios. Ray augmentation was proposed to prevent overfitting for sparse training data by generating additional rays. However, existing methods, which generate augmented rays only near the original rays, produce severe floaters and appearance distortion due to limited viewpoints and inconsistent rays obstructed by nearby obstacles and complex surfaces. To address these problems, we propose DivCon-NeRF, which significantly enhances both diversity and consistency. It employs surface-sphere augmentation, which preserves the distance between the original camera and the predicted surface point. This allows the model to compare the order of high-probability surface points and filter out inconsistent rays easily without requiring the exact depth. By introducing inner-sphere augmentation, DivCon-NeRF randomizes angles and distances for diverse viewpoints, further increasing diversity. Consequently, our method significantly reduces floaters and visual distortions, achieving state-of-the-art performance on the Blender, LLFF, and DTU datasets. Our code will be publicly available.
神经辐射场(NeRF)在新颖视图合成中表现出色,但需要大量的多视角图像,这使得它在少量数据场景中的应用变得不切实际。为了防止使用稀疏训练数据时过度拟合,提出了射线增强技术,通过生成额外的射线来解决这一问题。然而,现有的方法仅在原始射线附近生成增广射线,导致由于视点有限且被附近的障碍物和复杂表面遮挡而导致出现严重的漂浮物体(floaters)和外观失真。 为了解决这些问题,我们提出了DivCon-NeRF,它显著增强了多样性和一致性。该模型采用表面球体增强方法,这种方法保留了原始摄像机与预测的表面点之间的距离。这样,模型可以比较高概率表面点的顺序,并轻松过滤掉不一致的射线,而无需精确深度信息。 通过引入内球增强技术,DivCon-NeRF 随机化角度和距离以获取多样化的视点,从而进一步增加了多样性。因此,我们的方法显著减少了漂浮物体和视觉失真,在Blender、LLFF 和 DTU 数据集上达到了最先进的性能水平。我们的代码将在公开平台上提供。
https://arxiv.org/abs/2503.12947