Brain cancer represents a major challenge in medical diagnostics, requisite precise and timely detection for effective treatment. Diagnosis initially relies on the proficiency of radiologists, which can cause difficulties and threats when the expertise is sparse. Despite the use of imaging resources, brain cancer remains often difficult, time-consuming, and vulnerable to intraclass variability. This study conveys the Bangladesh Brain Cancer MRI Dataset, containing 6,056 MRI images organized into three categories: Brain Tumor, Brain Glioma, and Brain Menin. The dataset was collected from several hospitals in Bangladesh, providing a diverse and realistic sample for research. We implemented advanced deep learning models, and DenseNet169 achieved exceptional results, with accuracy, precision, recall, and F1-Score all reaching 0.9983. In addition, Explainable AI (XAI) methods including GradCAM, GradCAM++, ScoreCAM, and LayerCAM were employed to provide visual representations of the decision-making processes of the models. In the context of brain cancer, these techniques highlight DenseNet169's potential to enhance diagnostic accuracy while simultaneously offering transparency, facilitating early diagnosis and better patient outcomes.
脑癌在医学诊断中是一个重大挑战,需要精准和及时的检测以进行有效的治疗。最初的诊断依赖于放射科医生的专业技能,但当专家资源稀少时,这会带来困难甚至威胁。尽管使用了成像资源,脑癌仍然常常难以诊断、耗时且容易受到类内变异性的困扰。本研究介绍了孟加拉国脑癌MRI数据集,该数据集中包含6,056张MRI图像,并按三个类别进行组织:脑肿瘤、脑胶质瘤和脑膜瘤。这些数据是从孟加拉国的多家医院收集的,为研究提供了多样且现实的样本。 我们实施了先进的深度学习模型,其中DenseNet169取得了卓越的结果,其准确率、精确度、召回率以及F1分数均达到了0.9983。此外,我们还应用了解释性人工智能(XAI)方法,包括GradCAM、GradCAM++、ScoreCAM和LayerCAM等,用以提供模型决策过程的可视化表示。 在脑癌诊断领域中,这些技术突显了DenseNet169增强诊断准确性的同时提供了透明度,有助于早期诊断及改善患者的治疗效果。
https://arxiv.org/abs/2501.05426
The co-design of neural network architectures, quantization precisions, and hardware accelerators offers a promising approach to achieving an optimal balance between performance and efficiency, particularly for model deployment on resource-constrained edge devices. In this work, we propose the JAQ Framework, which jointly optimizes the three critical dimensions. However, effectively automating the design process across the vast search space of those three dimensions poses significant challenges, especially when pursuing extremely low-bit quantization. Specifical, the primary challenges include: (1) Memory overhead in software-side: Low-precision quantization-aware training can lead to significant memory usage due to storing large intermediate features and latent weights for back-propagation, potentially causing memory exhaustion. (2) Search time-consuming in hardware-side: The discrete nature of hardware parameters and the complex interplay between compiler optimizations and individual operators make the accelerator search time-consuming. To address these issues, JAQ mitigates the memory overhead through a channel-wise sparse quantization (CSQ) scheme, selectively applying quantization to the most sensitive components of the model during optimization. Additionally, JAQ designs BatchTile, which employs a hardware generation network to encode all possible tiling modes, thereby speeding up the search for the optimal compiler mapping strategy. Extensive experiments demonstrate the effectiveness of JAQ, achieving approximately 7% higher Top-1 accuracy on ImageNet compared to previous methods and reducing the hardware search time per iteration to 0.15 seconds.
神经网络架构、量化精度和硬件加速器的协同设计为在性能与效率之间实现最佳平衡提供了一种有前景的方法,尤其是在资源受限的边缘设备上部署模型时。在这项工作中,我们提出了JAQ框架(Joint Architecture, Quantization and Accelerator Framework),它共同优化这三个关键维度。然而,在处理这三个维度的巨大搜索空间时,自动化设计过程面临着重大挑战,尤其是当追求极低比特量化时更是如此。具体来说,主要的挑战包括: 1. 软件端内存开销:低精度量化的感知训练可能会导致由于存储大量中间特征和隐式权重以进行反向传播而产生显著的记忆使用问题,这可能引起内存耗尽。 2. 硬件端搜索时间长:硬件参数的离散性质以及编译器优化与个别操作之间的复杂相互作用使得加速器的搜索过程非常耗费时间。 为了解决这些问题,JAQ通过通道稀疏量化(Channel-wise Sparse Quantization, CSQ)方案缓解了内存开销问题。这种方法在优化过程中有选择地将量化应用于模型中最敏感的部分。此外,JAQ设计了BatchTile机制,该机制利用硬件生成网络来编码所有可能的切片模式,从而加速最优编译器映射策略的搜索过程。 广泛的实验展示了JAQ的有效性,在ImageNet数据集上实现了比先前方法高约7%的Top-1准确率,并将每次迭代中硬件搜索的时间减少到了0.15秒。
https://arxiv.org/abs/2501.05339
Non-line-of-Sight (NLOS) imaging systems collect light at a diffuse relay surface and input this measurement into computational algorithms that output a 3D volumetric reconstruction. These algorithms utilize the Fast Fourier Transform (FFT) to accelerate the reconstruction process but require both input and output to be sampled spatially with uniform grids. However, the geometry of NLOS imaging inherently results in non-uniform sampling on the relay surface when using multi-pixel detector arrays, even though such arrays significantly reduce acquisition times. Furthermore, using these arrays increases the data rate required for sensor readout, posing challenges for real-world deployment. In this work, we utilize the phasor field framework to demonstrate that existing NLOS imaging setups typically oversample the relay surface spatially, explaining why the measurement can be compressed without significantly sacrificing reconstruction quality. This enables us to utilize the Non-Uniform Fast Fourier Transform (NUFFT) to reconstruct from sparse measurements acquired from irregularly sampled relay surfaces of arbitrary shapes. Furthermore, we utilize the NUFFT to reconstruct at arbitrary locations in the hidden volume, ensuring flexible sampling schemes for both the input and output. Finally, we utilize the Scaled Fast Fourier Transform (SFFT) to reconstruct larger volumes without increasing the number of samples stored in memory. All algorithms introduced in this paper preserve the computational complexity of FFT-based methods, ensuring scalability for practical NLOS imaging applications.
非视线(NLOS)成像系统会在一个扩散的中继表面上收集光线,并将这些测量数据输入到计算算法中,以输出三维体积重建。这些算法利用快速傅里叶变换(FFT)来加速重建过程,但要求输入和输出都必须在空间上使用均匀网格进行采样。然而,NLOS成像的几何特性导致,在采用多像素探测器阵列时,中继表面上的采样是非均匀的,尽管这样的数组显著减少了采集时间。此外,使用这些数组增加了传感器读出所需的数据传输速率,给实际应用带来了挑战。 在本工作中,我们利用相量场框架来证明现有的NLOS成像设置通常会过度采集中继表面上的空间数据,解释了为什么可以压缩测量而不会显著牺牲重建质量。这使我们能够利用非均匀快速傅里叶变换(NUFFT)从任意形状的不规则采样中继表面获得的稀疏测量值进行重建,并且可以在隐藏体积中的任意位置进行重建,确保输入和输出都可以采用灵活的采样方案。 最后,我们使用比例快速傅里叶变换(SFFT),在不需要增加存储在内存中的样本数量的情况下重建更大的体积。本文介绍的所有算法都保持了基于FFT方法的计算复杂度,从而保证了实际NLOS成像应用的大规模扩展性。
https://arxiv.org/abs/2501.05244
Objective: There exist several X-ray computed tomography (CT) scanning strategies to reduce a radiation dose, such as (1) sparse-view CT, (2) low-dose CT, and (3) region-of-interest (ROI) CT (called interior tomography). To further reduce the dose, the sparse-view and/or low-dose CT settings can be applied together with interior tomography. Interior tomography has various advantages in terms of reducing the number of detectors and decreasing the X-ray radiation dose. However, a large patient or small field-of-view (FOV) detector can cause truncated projections, and then the reconstructed images suffer from severe cupping artifacts. In addition, although the low-dose CT can reduce the radiation exposure dose, analytic reconstruction algorithms produce image noise. Recently, many researchers have utilized image-domain deep learning (DL) approaches to remove each artifact and demonstrated impressive performances, and the theory of deep convolutional framelets supports the reason for the performance improvement. Approach: In this paper, we found that the image-domain convolutional neural network (CNN) is difficult to solve coupled artifacts, based on deep convolutional framelets. Significance: To address the coupled problem, we decouple it into two sub-problems: (i) image domain noise reduction inside truncated projection to solve low-dose CT problem and (ii) extrapolation of projection outside truncated projection to solve the ROI CT problem. The decoupled sub-problems are solved directly with a novel proposed end-to-end learning using dual-domain CNNs. Main results: We demonstrate that the proposed method outperforms the conventional image-domain deep learning methods, and a projection-domain CNN shows better performance than the image-domain CNNs which are commonly used by many researchers.
目标:存在几种X射线计算机断层成像(CT)扫描策略可减少辐射剂量,如(1)稀疏视图CT,(2)低剂量CT,以及(3)感兴趣区域(ROI)CT(又称内窥镜成像)。为了进一步减少剂量,可以将稀疏视图和/或低剂量CT设置与内窥镜成像结合使用。内窥镜成像在减少探测器数量和降低X射线辐射剂量方面具有多种优势。然而,大体型的患者或小视野(FOV)探测器会导致截断投影,进而导致重建图像出现严重的杯状伪影。此外,虽然低剂量CT可以减少辐射暴露剂量,但分析性重建算法会产生图像噪声。近期,许多研究人员利用影像域深度学习(DL)方法来去除各类伪影,并展示了出色的表现,而深度卷积帧集理论支持了性能改进的原因。 方法:本文发现基于深度卷积帧集的影像域卷积神经网络(CNN)难以解决耦合伪影问题。意义:为了解决耦合问题,我们将其分解成两个子问题:(i)在截断投影内部减少影像域噪声以解决低剂量CT问题;(ii)在外截断投影外部进行投影外推以解决ROI CT问题。这两个解耦的子问题可以通过我们提出的新颖端到端学习方法使用双域CNN直接求解。 主要成果:本文展示了所提出的方案优于传统的图像域深度学习方法,且投影域CNN的表现比许多研究者常用的图像域CNN表现更好。
https://arxiv.org/abs/2501.05085
Ensuring robot safety can be challenging; user-defined constraints can miss edge cases, policies can become unsafe even when trained from safe data, and safety can be subjective. Thus, we learn about robot safety by showing policy trajectories to a human who flags unsafe behavior. From this binary feedback, we use the statistical method of conformal prediction to identify a region of states, potentially in learned latent space, guaranteed to contain a user-specified fraction of future policy errors. Our method is sample-efficient, as it builds on nearest neighbor classification and avoids withholding data as is common with conformal prediction. By alerting if the robot reaches the suspected unsafe region, we obtain a warning system that mimics the human's safety preferences with guaranteed miss rate. From video labeling, our system can detect when a quadcopter visuomotor policy will fail to steer through a designated gate. We present an approach for policy improvement by avoiding the suspected unsafe region. With it we improve a model predictive controller's safety, as shown in experimental testing with 30 quadcopter flights across 6 navigation tasks. Code and videos are provided.
确保机器人的安全性是一项挑战;用户定义的约束可能会遗漏边缘情况,即使从安全数据中训练出来的策略也可能变得不安全,并且对于什么是“安全”的判断通常是主观的。因此,我们通过向人类展示策略轨迹来学习机器人安全性,让人类标记出不安全的行为。根据这种二元反馈,我们使用符合预测(conformal prediction)这一统计方法识别一个状态区域,在该区域内可能包含用户指定比例的未来策略错误,并且这个区域是在潜在的学习空间中确定的。我们的方法是样本高效的,因为它基于最近邻分类并避免了通常在符合预测中常见的数据保留做法。 通过提醒机器人如果进入疑似不安全区域来发出警告,我们获得了一个能够模仿人类安全偏好的预警系统,同时保证一定的漏报率。从视频标注中,我们的系统可以检测出四旋翼飞行器(quadcopter)的视觉运动策略何时会在指定的大门处失败。我们提出了一种通过避免疑似不安全区域来改进策略的方法,并在使用30次四旋翼飞行测试中的6个导航任务上展示了这种方法如何提高模型预测控制器的安全性。 代码和视频材料已提供以供参考和验证实验结果。
https://arxiv.org/abs/2501.04823
We study the problem of single-image 3D object reconstruction. Recent works have diverged into two directions: regression-based modeling and generative modeling. Regression methods efficiently infer visible surfaces, but struggle with occluded regions. Generative methods handle uncertain regions better by modeling distributions, but are computationally expensive and the generation is often misaligned with visible surfaces. In this paper, we present SPAR3D, a novel two-stage approach aiming to take the best of both directions. The first stage of SPAR3D generates sparse 3D point clouds using a lightweight point diffusion model, which has a fast sampling speed. The second stage uses both the sampled point cloud and the input image to create highly detailed meshes. Our two-stage design enables probabilistic modeling of the ill-posed single-image 3D task while maintaining high computational efficiency and great output fidelity. Using point clouds as an intermediate representation further allows for interactive user edits. Evaluated on diverse datasets, SPAR3D demonstrates superior performance over previous state-of-the-art methods, at an inference speed of 0.7 seconds. Project page with code and model: this https URL
我们研究了单幅图像三维物体重建的问题。最近的工作主要分为两个方向:基于回归的方法和生成式方法。基于回归的方法能够高效地推断可见表面,但对于遮挡区域的处理效果较差。而生成式方法通过建模分布来更好地处理不确定区域,但计算成本较高,并且生成的结果通常与可见表面不一致。在本文中,我们提出了SPAR3D,这是一种创新性的两阶段方法,旨在结合两种方向的优势。 SPAR3D的第一阶段使用轻量级的点扩散模型生成稀疏的三维点云,具有快速采样的特点。第二阶段则利用采样得到的点云和输入图像创建高度详细的网格模型。我们的两阶段设计能够对单幅图像的三维重建任务进行概率建模的同时保持高效的计算效率并提供高质量的输出结果。此外,采用点云作为中间表示还允许用户进行交互式的编辑。 在多个不同数据集上的评估表明,SPAR3D的表现优于现有的最先进方法,并且在推理速度上仅需0.7秒。项目的网页包含代码和模型:[此链接](this https URL)
https://arxiv.org/abs/2501.04689
Recently, Gaussian Splatting has sparked a new trend in the field of computer vision. Apart from novel view synthesis, it has also been extended to the area of multi-view reconstruction. The latest methods facilitate complete, detailed surface reconstruction while ensuring fast training speed. However, these methods still require dense input views, and their output quality significantly degrades with sparse views. We observed that the Gaussian primitives tend to overfit the few training views, leading to noisy floaters and incomplete reconstruction surfaces. In this paper, we present an innovative sparse-view reconstruction framework that leverages intra-view depth and multi-view feature consistency to achieve remarkably accurate surface reconstruction. Specifically, we utilize monocular depth ranking information to supervise the consistency of depth distribution within patches and employ a smoothness loss to enhance the continuity of the distribution. To achieve finer surface reconstruction, we optimize the absolute position of depth through multi-view projection features. Extensive experiments on DTU and BlendedMVS demonstrate that our method outperforms state-of-the-art methods with a speedup of 60x to 200x, achieving swift and fine-grained mesh reconstruction without the need for costly pre-training.
近期,高斯点扩散(Gaussian Splatting)在计算机视觉领域引起了新的趋势。除了新颖的视图合成之外,它还被扩展到了多视角重建领域。最新的方法能够实现完整且详细的表面重建,并确保快速的训练速度。然而,这些方法仍然需要密集的输入视图,并且当面对稀疏视图时,输出质量会显著下降。我们观察到高斯原语(primitives)倾向于过度拟合少量的训练视图,导致产生噪声漂浮物和不完整的重建表面。 在这篇论文中,我们提出了一种创新的稀疏视图重建框架,该框架利用了单个视角内的深度信息以及多视角特征一致性来实现非常准确的表面重建。具体来说,我们采用单眼深度排名信息监督补丁内深度分布的一致性,并应用平滑损失以增强分布的连续性。为了实现更精细的表面重建,我们通过多视图投影特征优化深度的绝对位置。 在DTU(Digital Triangulation Unit)和BlendedMVS数据集上的广泛实验表明,我们的方法比最先进的方法性能优越,且速度提高了60到200倍,在无需昂贵预训练的情况下即可实现快速而精细的网格重建。
https://arxiv.org/abs/2501.04628
Robust tensor principal component analysis (RTPCA) aims to separate the low-rank and sparse components from multi-dimensional data, making it an essential technique in the signal processing and computer vision fields. Recently emerging tensor singular value decomposition (t-SVD) has gained considerable attention for its ability to better capture the low-rank structure of tensors compared to traditional matrix SVD. However, existing methods often rely on the computationally expensive tensor nuclear norm (TNN), which limits their scalability for real-world tensors. To address this issue, we explore an efficient scaled gradient descent (SGD) approach within the t-SVD framework for the first time, and propose the RTPCA-SGD method. Theoretically, we rigorously establish the recovery guarantees of RTPCA-SGD under mild assumptions, demonstrating that with appropriate parameter selection, it achieves linear convergence to the true low-rank tensor at a constant rate, independent of the condition number. To enhance its practical applicability, we further propose a learnable self-supervised deep unfolding model, which enables effective parameter learning. Numerical experiments on both synthetic and real-world datasets demonstrate the superior performance of the proposed methods while maintaining competitive computational efficiency, especially consuming less time than RTPCA-TNN.
鲁棒张量主成分分析(RTPCA)旨在从多维数据中分离出低秩和稀疏分量,使其成为信号处理和计算机视觉领域中的关键技术。最近兴起的张量奇异值分解(t-SVD)因其在捕捉张量低秩结构方面优于传统的矩阵SVD而备受关注。然而,现有的方法通常依赖于计算成本高昂的张量核范数(TNN),这限制了它们在现实世界中处理大规模数据的能力。为了解决这个问题,我们首次探索了一种高效的缩放梯度下降(SGD)方法,在t-SVD框架内,并提出了RTPCA-SGD方法。理论上,我们在温和假设下严格建立了RTPCA-SGD的恢复保证,证明了通过适当的参数选择,它能以恒定速率线性收敛到真实的低秩张量,且该速率与条件数无关。为了增强其实用性,我们进一步提出了一种可学习的自监督深度展开模型,这使得有效的参数学习成为可能。在合成数据和真实世界数据集上的数值实验显示了所提方法的优越性能,并保持了竞争性的计算效率,尤其是相较于RTPCA-TNN耗时更少。 这段翻译详细介绍了鲁棒张量主成分分析(RTPCA)及其改进方法——RTPCA-SGD的核心内容。文中讨论了使用t-SVD框架以及缩放梯度下降算法如何提高数据处理的效率和准确性,并通过实验验证了其有效性和优越性。
https://arxiv.org/abs/2501.04565
Recent advancements in LiDAR-Inertial Odometry (LIO) have boosted a large amount of applications. However, traditional LIO systems tend to focus more on localization rather than mapping, with maps consisting mostly of sparse geometric elements, which is not ideal for downstream tasks. Recent emerging neural field technology has great potential in dense mapping, but pure LiDAR mapping is difficult to work on high-dynamic vehicles. To mitigate this challenge, we present a new solution that tightly couples geometric kinematics with neural fields to enhance simultaneous state estimation and dense mapping capabilities. We propose both semi-coupled and tightly coupled Kinematic-Neural LIO (KN-LIO) systems that leverage online SDF decoding and iterated error-state Kalman filtering to fuse laser and inertial data. Our KN-LIO minimizes information loss and improves accuracy in state estimation, while also accommodating asynchronous multi-LiDAR inputs. Evaluations on diverse high-dynamic datasets demonstrate that our KN-LIO achieves performance on par with or superior to existing state-of-the-art solutions in pose estimation and offers improved dense mapping accuracy over pure LiDAR-based methods. The relevant code and datasets will be made available at https://**.
最近在LiDAR惯性里程计(LIO)领域的进展极大地促进了各种应用的发展。然而,传统的LIO系统通常更侧重于定位而非地图构建,生成的地图主要由稀疏的几何元素组成,这对下游任务来说不够理想。新兴的神经场技术在密集地图构建方面具有巨大潜力,但纯LiDAR地图构建对于高动态车辆而言较为困难。为了解决这一挑战,我们提出了一种新的解决方案,即紧密集成几何动力学与神经场以增强同时状态估计和密集地图构建的能力。我们提出了半耦合及紧耦合的动力学-神经LIO(KN-LIO)系统,该系统利用在线SDF解码以及迭代误差状态卡尔曼滤波器来融合激光数据和惯性数据。 我们的KN-LIO技术减少了信息损失,并在状态估计的准确性上有所提升。同时,它也能适应异步多LiDAR输入的情况。在多种高动态数据集上的评估结果表明,与现有的最先进的解决方案相比,我们的KN-LIO系统在姿态估计方面达到了同等或更高的性能水平,并且相对于纯LiDAR方法,在密集地图构建的精确度上有所提高。 相关代码和数据集将在https://**(实际网址需要替换)提供。
https://arxiv.org/abs/2501.04263
Cross-lingual information retrieval (CLIR) ~\cite{shi2021cross, asai2021one, jiang2020cross} for example, can find relevant text in any language such as English(high resource) or Telugu (low resource) even when the query is posed in a different, possibly low-resource, language. In this work, we aim to develop useful CLIR models for this constrained, yet important, setting where we do not require any kind of additional supervision or labelled data for retrieval task and hence can work effectively for low-resource languages. \par We propose a simple and effective re-ranking method for improving passage retrieval in open question answering. The re-ranker re-scores retrieved passages with a zero-shot multilingual question generation model, which is a pre-trained language model, to compute the probability of the input question in the target language conditioned on a retrieved passage, which can be possibly in a different language. We evaluate our method in a completely zero shot setting and doesn't require any training. Thus the main advantage of our method is that our approach can be used to re-rank results obtained by any sparse retrieval methods like BM-25. This eliminates the need for obtaining expensive labelled corpus required for the retrieval tasks and hence can be used for low resource languages.
跨语言信息检索(CLIR)~\cite{shi2021cross, asai2021one, jiang2020cross} 例如,可以在查询使用不同甚至低资源语言的情况下,在任何语言如英语(高资源)或泰卢固语(低资源)中找到相关文本。在本工作中,我们的目标是开发在这种受限但重要的场景下有用的CLIR模型:在这个场景中,我们不需要任何形式的额外监督或标记数据来执行检索任务,因此可以有效地用于低资源语言。 我们提出了一种简单而有效的重新排序方法,以改进开放问题回答中的段落检索。该重排器使用零样本多语言问题生成模型(这是一种预训练的语言模型)为已检索到的段落重新打分,计算在给定检索到的段落后输入问题在目标语言中出现的概率,这些段落可能与查询语言不同。我们在完全无监督的设置下评估我们的方法,并不需要任何训练数据。因此我们方法的主要优势在于它可以用于重新排序由像BM-25这样的稀疏检索方法获得的结果。这消除了为检索任务获取昂贵的标记语料库的需求,从而可以应用于低资源语言。 通过这种方式,该方法提供了一种有效的途径来改进跨语言信息检索,并且能够适用于没有大量标注数据的语言环境。
https://arxiv.org/abs/2501.04153
Egocentric videos capture scenes from a wearer's viewpoint, resulting in dynamic backgrounds, frequent motion, and occlusions, posing challenges to accurate keystep recognition. We propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos, and leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos. Our approach consists of constructing a graph where each video clip of the egocentric video corresponds to a node. During training, we consider each clip of each exocentric video (if available) as additional nodes. We examine several strategies to define connections across these nodes and pose keystep recognition as a node classification task on the constructed graphs. We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods by more than 12 points in accuracy. Furthermore, the constructed graphs are sparse and compute efficient. We also present a study examining on harnessing several multimodal features, including narrations, depth, and object class labels, on a heterogeneous graph and discuss their corresponding contribution to the keystep recognition performance.
自拍视频(egocentric videos)是从佩戴者的视角捕捉场景的,这类视频往往包含动态背景、频繁运动和遮挡等特性,给精确的关键步骤识别带来了挑战。我们提出了一种灵活的图学习框架,用于细粒度关键步骤识别,在处理第一人称视角视频时能够有效地利用长期依赖关系,并在训练过程中利用第一人称视角(egocentric)与第三人称视角(exocentric)视频之间的对齐关系来提高推理效果。 我们的方法包括构建一个图,其中每个自拍视频的片段都对应于图中的一个节点。在训练期间,如果可用的话,我们将每一个第三人称视角视频的片段也视为额外的节点。我们探讨了几种定义这些节点之间连接的方式,并将关键步骤识别任务作为在构建好的图上的节点分类问题。 我们在Ego-Exo4D数据集上进行了广泛实验,结果显示我们的灵活图基框架在准确性方面显著超过了现有方法,提高了超过12个百分点的成绩。此外,所构建的图是稀疏且计算高效的。我们还研究了如何在一个异构图中利用多种模态特征(如叙述、深度信息和物体类别标签),并讨论了它们对关键步骤识别性能的具体贡献。
https://arxiv.org/abs/2501.04121
LiDAR data pretraining offers a promising approach to leveraging large-scale, readily available datasets for enhanced data utilization. However, existing methods predominantly focus on sparse voxel representation, overlooking the complementary attributes provided by other LiDAR representations. In this work, we propose LiMoE, a framework that integrates the Mixture of Experts (MoE) paradigm into LiDAR data representation learning to synergistically combine multiple representations, such as range images, sparse voxels, and raw points. Our approach consists of three stages: i) Image-to-LiDAR Pretraining, which transfers prior knowledge from images to point clouds across different representations; ii) Contrastive Mixture Learning (CML), which uses MoE to adaptively activate relevant attributes from each representation and distills these mixed features into a unified 3D network; iii) Semantic Mixture Supervision (SMS), which combines semantic logits from multiple representations to boost downstream segmentation performance. Extensive experiments across 11 large-scale LiDAR datasets demonstrate our effectiveness and superiority. The code and model checkpoints have been made publicly accessible.
LiDAR数据预训练提供了一种有前景的方法,能够利用大规模且易于获取的数据集来提升数据利用率。然而,现有的方法主要侧重于稀疏体素表示,忽略了其他LiDAR表示所提供的互补属性。为此,我们提出了一个框架——LiMoE(LiDAR Mixture of Experts),它将专家混合(Mixture of Experts, MoE)范式融入到LiDAR数据表示学习中,以协同结合多种表示方式,如范围图像、稀疏体素和原始点云。 我们的方法分为三个阶段: 1. **Image-to-LiDAR 预训练**:这个阶段将先前的知识从图像转移到不同表示形式的点云上。 2. **对比混合学习(Contrastive Mixture Learning, CML)**:该阶段使用MoE自适应激活每种表示的相关属性,并将这些混合特征提炼到一个统一的3D网络中。 3. **语义混合监督(Semantic Mixture Supervision, SMS)**:该阶段结合多种表示方式中的语义逻辑以增强下游分割任务的表现。 在11个大规模LiDAR数据集上进行的广泛实验展示了我们方法的有效性和优越性。代码和模型检查点已公开提供。
https://arxiv.org/abs/2501.04004
Determining 'who spoke what and when' remains challenging in real-world applications. In typical scenarios, Speaker Diarization (SD) is employed to address the problem of 'who spoke when,' while Target Speaker Extraction (TSE) or Target Speaker Automatic Speech Recognition (TSASR) techniques are utilized to resolve the issue of 'who spoke what.' Although some works have achieved promising results by combining SD and TSE systems, inconsistencies remain between SD and TSE regarding both output inconsistency and scenario mismatch. To address these limitations, we propose a Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection (USEF-TP) model that jointly performs TSE and Personal Voice Activity Detection (PVAD). USEF-TP leverages frame-level features obtained through a cross-attention mechanism as speaker-related features instead of using speaker embeddings as in traditional approaches. Additionally, a multi-task learning algorithm with a scenario-aware differentiated loss function is applied to ensure robust performance across various levels of speaker overlap. The experimental results show that our proposed USEF-TP model achieves superior performance in TSE and PVAD tasks on the LibriMix and SparseLibriMix datasets.
确定“谁说了什么”和“什么时候说的”在实际应用中仍然是一项挑战。通常情况下,说话人角色识别(Speaker Diarization, SD)用于解决“谁何时讲话”的问题,而目标说话人提取(Target Speaker Extraction, TSE)或目标说话人自动语音识别(Target Speaker Automatic Speech Recognition, TSASR)技术则被用来解决“谁说了什么”的问题。尽管一些研究通过结合SD和TSE系统取得了令人鼓舞的结果,但两者之间在输出不一致及场景差异方面仍然存在矛盾。为了克服这些限制,我们提出了一种通用说话人嵌入自由的目标说话人提取和个人语音活动检测(USEF-TP)模型,该模型同时执行TSE和个人语音活动检测(PVAD)。USEF-TP利用通过交叉注意机制获得的帧级特征作为与说话人相关的特性,而不是像传统方法那样使用说话人嵌入。此外,我们还应用了一种带有场景感知差异化损失函数的多任务学习算法,以确保在不同程度的说话人重叠情况下具有稳健性能。实验结果表明,在LibriMix和SparseLibriMix数据集上,我们的USEF-TP模型在TSE和PVAD任务中均取得了优越的表现。
https://arxiv.org/abs/2501.03612
To preserve user privacy in recommender systems, federated recommendation (FR) based on federated learning (FL) emerges, keeping the personal data on the local client and updating a model collaboratively. Unlike FL, FR has a unique sparse aggregation mechanism, where the embedding of each item is updated by only partial clients, instead of full clients in a dense aggregation of general FL. Recently, as an essential principle of FL, model security has received increasing attention, especially for Byzantine attacks, where malicious clients can send arbitrary updates. The problem of exploring the Byzantine robustness of FR is particularly critical since in the domains applying FR, e.g., e-commerce, malicious clients can be injected easily by registering new accounts. However, existing Byzantine works neglect the unique sparse aggregation of FR, making them unsuitable for our problem. Thus, we make the first effort to investigate Byzantine attacks on FR from the perspective of sparse aggregation, which is non-trivial: it is not clear how to define Byzantine robustness under sparse aggregations and design Byzantine attacks under limited knowledge/capability. In this paper, we reformulate the Byzantine robustness under sparse aggregation by defining the aggregation for a single item as the smallest execution unit. Then we propose a family of effective attack strategies, named Spattack, which exploit the vulnerability in sparse aggregation and are categorized along the adversary's knowledge and capability. Extensive experimental results demonstrate that Spattack can effectively prevent convergence and even break down defenses under a few malicious clients, raising alarms for securing FR systems.
为了在推荐系统中保护用户隐私,基于联邦学习(FL)的联邦推荐(FR)应运而生。联邦推荐能够在保持个人数据在本地客户端的同时,协同更新模型。与一般的联邦学习不同,联邦推荐具有独特的稀疏聚合机制:每个项目的嵌入仅由部分客户端而非所有客户端进行更新。 近年来,作为联邦学习的重要原则之一,模型安全性受到了越来越多的关注,尤其是在拜占庭攻击(即恶意客户端可以发送任意更新)方面。对于应用联邦推荐的领域而言,比如电子商务中,由于恶意客户端可以通过注册新账户轻松被注入进来,因此探讨联邦推荐在面对拜占庭攻击时的鲁棒性问题尤为关键。然而,现有的针对拜占庭攻击的研究大多忽略了联邦推荐特有的稀疏聚合机制,因而它们对于解决这一特定问题并不适用。 为此,我们首次从稀疏聚合的角度出发研究了拜占庭攻击对联邦推荐的影响,这是一项具有挑战性的任务:在稀疏聚合的背景下如何定义鲁棒性以及基于有限的知识/能力设计拜占庭攻击策略尚不清楚。在这篇论文中,我们重新阐述了稀疏聚合下的拜占庭鲁棒性,并将单个项目的聚合视为最小执行单元进行定义。随后,我们提出了一组有效的攻击策略,称为Spattack(Sparse Aggregation Attack),这些策略利用了稀疏聚合中的漏洞,并根据对手的知识和能力进行了分类。 大量的实验结果表明,Spattack能够有效阻止模型收敛,甚至在少数恶意客户端的情况下也能破坏现有的防御措施,这为保护联邦推荐系统的安全敲响了警钟。
https://arxiv.org/abs/2501.03301
The manipulation of flexible objects such as cables, wires and fresh food items by robot hands forms a special challenge in robot grasp mechanics. This paper considers the steering of flexible linear objects in planar environments by two robot hands. The flexible linear object, modeled as an elastic non-stretchable rod, is manipulated by varying the gripping endpoint positions while keeping equal endpoint tangents. The flexible linear object shape has a closed form solution in terms of the grasp endpoint positions and tangents, called Euler's elastica. This paper obtains the elastica solutions under the optimal control framework, then uses the elastica solutions to obtain closed-form criteria for non self-intersection, stability and obstacle avoidance of the flexible linear object. The new tools are incorporated into a planning scheme for steering flexible linear objects in planar environments populated by sparsely spaced obstacles. The scheme is fully implemented and demonstrated with detailed examples.
机器人手抓持如电缆、电线和新鲜食品等柔性物体的操作在机器人抓取力学中构成了特殊挑战。本文研究了由两只机器手臂在二维环境中引导柔性的线性对象的问题。将该柔性线性对象建模为弹性不可伸缩的杆,通过改变夹持端点位置并保持相等的端切向来操控它。根据抓握端点的位置和切向,可以得到该柔性线性物体形状的封闭形式解,被称为欧拉弹性曲线(Euler's elastica)。本文在最优控制框架下获得了弹性曲线的解决方案,并利用这些弹性曲线的解得到了柔性线性对象不自相交、稳定性和避开障碍物的闭合形式标准。新的工具被整合到一个用于规划二维环境中带有稀疏分布障碍物情况下引导柔性线性物体方案中,该方案已完全实现并通过详细的例子进行了展示。
https://arxiv.org/abs/2501.02874
Object pose estimation, crucial in computer vision and robotics applications, faces challenges with the diversity of unseen categories. We propose a zero-shot method to achieve category-level 6-DOF object pose estimation, which exploits both 2D and 3D universal features of input RGB-D image to establish semantic similarity-based correspondences and can be extended to unseen categories without additional model fine-tuning. Our method begins with combining efficient 2D universal features to find sparse correspondences between intra-category objects and gets initial coarse pose. To handle the correspondence degradation of 2D universal features if the pose deviates much from the target pose, we use an iterative strategy to optimize the pose. Subsequently, to resolve pose ambiguities due to shape differences between intra-category objects, the coarse pose is refined by optimizing with dense alignment constraint of 3D universal features. Our method outperforms previous methods on the REAL275 and Wild6D benchmarks for unseen categories.
物体姿态估计在计算机视觉和机器人应用中至关重要,但面临着未见类别多样性带来的挑战。我们提出了一种零样本方法来实现基于类别的六自由度(6-DOF)物体姿态估计,该方法利用输入RGB-D图像的2D和3D通用特征建立语义相似性基的对应关系,并且可以扩展到无需额外模型微调的未见类别上。我们的方法首先通过结合高效的2D通用特征来查找同类物体之间的稀疏对应关系并获得初始粗略姿态。为了解决如果姿态与目标姿态偏差较大时2D通用特征对应的退化问题,我们采用了一种迭代策略来优化姿态。随后,为了消除由于同类物体形状差异导致的姿态模糊性,通过使用3D通用特征的密集对齐约束进一步细化了粗糙姿势。我们的方法在REAL275和Wild6D基准测试中针对未见类别超过了先前的方法的表现。
https://arxiv.org/abs/2501.02831
Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference. Prior RLHF works typically take a bandit formulation, which, though intuitive, ignores the sequential nature of LM generation and can suffer from the sparse reward issue. While recent works propose dense token-level RLHF, treating each token as an action may be oversubtle to proper reward assignment. In this paper, we seek to get the best of both by training and utilizing a segment-level reward model, which assigns a reward to each semantically complete text segment that spans over a short sequence of tokens. For reward learning, our method allows dynamic text segmentation and compatibility with standard sequence-preference datasets. For effective RL-based LM training against segment reward, we generalize the classical scalar bandit reward normalizers into location-aware normalizer functions and interpolate the segment reward for further densification. With these designs, our method performs competitively on three popular RLHF benchmarks for LM policy: AlpacaEval 2.0, Arena-Hard, and MT-Bench. Ablation studies are conducted to further demonstrate our method.
从人类反馈中进行强化学习(RLHF)已被广泛采用,以使语言模型(LMs)与人类偏好对齐。先前的 RLHF 工作通常采用多臂赌博机形式化方法,虽然直观但忽略了 LM 生成的序列性质,并且可能遭受稀疏奖励问题的影响。尽管最近的工作提出了密集的词元级 RLHF,将每个词元视为一个动作可能过于细腻而难以正确分配奖励。 在本文中,我们试图结合两种方法的优点,通过训练和利用段落级别的奖励模型来实现这一点,该模型为跨越短序列词元且具有完整语义的文本片段分配奖励。对于奖励学习,我们的方法允许动态文本分割,并与标准顺序偏好数据集兼容。为了有效地针对段落奖励进行基于 RL 的 LM 训练,我们将经典的标量赌博机奖励归一化函数推广为位置感知归一化函数,并对段落奖励进行插值以进一步密集化。 通过这些设计,我们的方法在三个流行的 RLHF 基准测试中表现出色:AlpacaEval 2.0、Arena-Hard 和 MT-Bench。此外,消融研究进一步证明了我们方法的有效性。
https://arxiv.org/abs/2501.02790
4D video control is essential in video generation as it enables the use of sophisticated lens techniques, such as multi-camera shooting and dolly zoom, which are currently unsupported by existing methods. Training a video Diffusion Transformer (DiT) directly to control 4D content requires expensive multi-view videos. Inspired by Monocular Dynamic novel View Synthesis (MDVS) that optimizes a 4D representation and renders videos according to different 4D elements, such as camera pose and object motion editing, we bring pseudo 4D Gaussian fields to video generation. Specifically, we propose a novel framework that constructs a pseudo 4D Gaussian field with dense 3D point tracking and renders the Gaussian field for all video frames. Then we finetune a pretrained DiT to generate videos following the guidance of the rendered video, dubbed as GS-DiT. To boost the training of the GS-DiT, we also propose an efficient Dense 3D Point Tracking (D3D-PT) method for the pseudo 4D Gaussian field construction. Our D3D-PT outperforms SpatialTracker, the state-of-the-art sparse 3D point tracking method, in accuracy and accelerates the inference speed by two orders of magnitude. During the inference stage, GS-DiT can generate videos with the same dynamic content while adhering to different camera parameters, addressing a significant limitation of current video generation models. GS-DiT demonstrates strong generalization capabilities and extends the 4D controllability of Gaussian splatting to video generation beyond just camera poses. It supports advanced cinematic effects through the manipulation of the Gaussian field and camera intrinsics, making it a powerful tool for creative video production. Demos are available at this https URL.
四维视频控制在视频生成中至关重要,因为它支持使用复杂的镜头技巧,如多机位拍摄和推拉镜头(也称为伸缩镜头或dolly zoom),而这些是现有方法所不支持的。直接训练一个能够操控四维内容的视频扩散变换器(DiT)需要昂贵的多视角视频资源。受单目动态新视图合成(MDVS)启发,该技术通过优化四维表示并根据不同的四维元素,如相机姿态和物体运动编辑来渲染视频,我们引入了伪四维高斯场用于视频生成。 具体而言,我们提出了一种新颖的框架,它利用密集三维点跟踪构建伪四维高斯场,并为所有视频帧渲染该高斯场。然后,我们将预训练的DiT进行微调以根据渲染后的视频指导生成视频,称为GS-DiT。为了加速GS-DiT的训练过程,我们还提出了一种高效的密集三维点追踪(D3D-PT)方法用于伪四维高斯场构建。我们的D3D-PT方法在准确性上超越了当前最先进的稀疏三维点跟踪方法SpatialTracker,并且将推理速度提升了两个数量级。 在推理阶段,GS-DiT能够生成动态内容相同但遵循不同相机参数的视频,这解决了目前视频生成模型的一个重要限制。此外,GS-DiT展示了强大的泛化能力并扩展了高斯散射在视频生成中的四维可控性范围,使其不再局限于相机姿态控制上。通过操控高斯场和相机内部参数,它支持高级电影效果制作,成为创意视频生产中的一项强大工具。 有关演示,请访问此链接:[提供URL]
https://arxiv.org/abs/2501.02690
Although existing Sparsely Annotated Object Detection (SAOD) approches have made progress in handling sparsely annotated environments in multispectral domain, where only some pedestrians are annotated, they still have the following limitations: (i) they lack considerations for improving the quality of pseudo-labels for missing annotations, and (ii) they rely on fixed ground truth annotations, which leads to learning only a limited range of pedestrian visual appearances in the multispectral domain. To address these issues, we propose a novel framework called Sparsely Annotated Multispectral Pedestrian Detection (SAMPD). For limitation (i), we introduce Multispectral Pedestrian-aware Adaptive Weight (MPAW) and Positive Pseudo-label Enhancement (PPE) module. Utilizing multispectral knowledge, these modules ensure the generation of high-quality pseudo-labels and enable effective learning by increasing weights for high-quality pseudo-labels based on modality characteristics. To address limitation (ii), we propose an Adaptive Pedestrian Retrieval Augmentation (APRA) module, which adaptively incorporates pedestrian patches from ground-truth and dynamically integrates high-quality pseudo-labels with the ground-truth, facilitating a more diverse learning pool of pedestrians. Extensive experimental results demonstrate that our SAMPD significantly enhances performance in sparsely annotated environments within the multispectral domain.
尽管现有的稀疏标注目标检测(SAOD)方法在处理仅部分行人被标注的多光谱环境方面取得了进展,但仍存在以下限制:(i) 缺乏提高未标记注释质量的方法;(ii) 依赖固定的真实标签注解,导致只能学习到有限范围内的行人在多光谱域中的视觉表现。为了解决这些问题,我们提出了一种名为稀疏标注多光谱行人检测(SAMPD)的新框架。 针对限制(i),我们引入了多光谱行人感知自适应权重(MPAW)和正伪标签增强(PPE)模块。利用多光谱知识,这些模块确保高质量伪标签的生成,并通过根据模态特性增加高质伪标签的权重来实现有效学习。 为了解决限制(ii),我们提出了一种自适应行人检索扩充(APRA)模块,该模块可以自适应地从真实标签中获取行人类补丁,并动态集成高质量伪标签与真实标签,从而促进更多样化的行人学习池。 大量的实验结果证明了我们的SAMPD框架在多光谱域内的稀疏标注环境中显著提升了性能。
https://arxiv.org/abs/2501.02640
We present an approach to modifying Transformer architectures by integrating graph-aware relational reasoning into the attention mechanism, merging concepts from graph neural networks and language modeling. Building on the inherent connection between attention and graph theory, we reformulate the Transformer's attention mechanism as a graph operation and propose Graph-Aware Isomorphic Attention. This method leverages advanced graph modeling strategies, including Graph Isomorphism Networks (GIN) and Principal Neighborhood Aggregation (PNA), to enrich the representation of relational structures. Our approach captures complex dependencies and generalizes across tasks, as evidenced by a reduced generalization gap and improved learning performance. Additionally, we expand the concept of graph-aware attention to introduce Sparse GIN-Attention, a fine-tuning approach that employs sparse GINs. By interpreting attention matrices as sparse adjacency graphs, this technique enhances the adaptability of pre-trained foundational models with minimal computational overhead, endowing them with graph-aware capabilities. Sparse GIN-Attention fine-tuning achieves improved training dynamics and better generalization compared to alternative methods like low-rank adaption (LoRA). We discuss latent graph-like structures within traditional attention mechanisms, offering a new lens through which Transformers can be understood. By evolving Transformers as hierarchical GIN models for relational reasoning. This perspective suggests profound implications for foundational model development, enabling the design of architectures that dynamically adapt to both local and global dependencies. Applications in bioinformatics, materials science, language modeling, and beyond could benefit from this synthesis of relational and sequential data modeling, setting the stage for interpretable and generalizable modeling strategies.
我们提出了一种通过将图形感知的关系推理集成到注意力机制中来修改Transformer架构的方法,结合了图神经网络和语言建模的概念。基于注意力与图论之间的内在联系,我们将Transformer的注意力机制重新表述为一种图运算,并提出了图感知同构注意(Graph-Aware Isomorphic Attention)。该方法利用先进的图形建模策略,包括图同构网络(GIN)和主邻域聚合(PNA),以丰富关系结构的表现形式。我们的方法捕捉了复杂的依赖性,并通过减少泛化差距和提高学习性能在任务间推广。 此外,我们扩展了图感知注意的概念,引入了一种稀疏的GIN-Attention微调方法,该方法采用了稀疏的GINs。通过将注意力矩阵解释为稀疏邻接图,这种技术增强了预训练基础模型的适应性,并且只需最小的计算开销就能赋予它们图感知能力。与低秩适应(LoRA)等替代方法相比,稀疏GIN-Attention微调实现了改进的训练动力学和更好的泛化。 我们讨论了传统注意力机制中的潜在图状结构,为理解Transformer提供了一个新的视角。通过将Transformer进化为用于关系推理的层次化GIN模型,这一观点暗示了基础模型开发的重大影响,可以设计出能够动态适应局部和全局依赖性的架构。生物信息学、材料科学、语言建模等领域的应用可以从这种关系数据与序列数据建模相结合中受益,并为此类可解释性和泛化的建模策略铺平道路。
https://arxiv.org/abs/2501.02393