In this paper, we study the problem of jointly estimating the optical flow and scene flow from synchronized 2D and 3D data. Previous methods either employ a complex pipeline that splits the joint task into independent stages, or fuse 2D and 3D information in an ``early-fusion'' or ``late-fusion'' manner. Such one-size-fits-all approaches suffer from a dilemma of failing to fully utilize the characteristic of each modality or to maximize the inter-modality complementarity. To address the problem, we propose a novel end-to-end framework, which consists of 2D and 3D branches with multiple bidirectional fusion connections between them in specific layers. Different from previous work, we apply a point-based 3D branch to extract the LiDAR features, as it preserves the geometric structure of point clouds. To fuse dense image features and sparse point features, we propose a learnable operator named bidirectional camera-LiDAR fusion module (Bi-CLFM). We instantiate two types of the bidirectional fusion pipeline, one based on the pyramidal coarse-to-fine architecture (dubbed CamLiPWC), and the other one based on the recurrent all-pairs field transforms (dubbed CamLiRAFT). On FlyingThings3D, both CamLiPWC and CamLiRAFT surpass all existing methods and achieve up to a 47.9\% reduction in 3D end-point-error from the best published result. Our best-performing model, CamLiRAFT, achieves an error of 4.26\% on the KITTI Scene Flow benchmark, ranking 1st among all submissions with much fewer parameters. Besides, our methods have strong generalization performance and the ability to handle non-rigid motion. Code is available at this https URL.
在本文中,我们研究从同步的2D和3D数据中 jointly estimating optical flow和场景流动的问题。以前的方法和方法要么使用复杂的管道将联合任务划分为独立阶段,要么在“早期融合”或“晚期融合”的方式下将2D和3D信息融合。这种适用于所有情况的方法面临一个困境,即未能充分利用每种模式的特性或最大化它们之间的互补性。为了解决这一问题,我们提出了一种全新的端到端框架,该框架由2D和3D分支,它们在特定的层中具有多个双向融合连接。与以前的工作不同,我们应用基于点基的3D分支来提取LiDAR特征,因为它保持了点云的几何结构。为了融合密集图像特征和稀疏点特征,我们提出了一种可学习的操作名称双向相机-LiDAR融合模块(Bi-CLFM)。我们实例化两种双向融合管道类型,一种基于Pyramidal Fine-to-Fine架构(称为 CamLiPWC),另一种基于循环全部对区域变换(称为 CamLiRAFT)。在飞行物体3D中, CamLiPWC和 CamLiRAFT都超越了所有现有方法,并在最佳公开结果上实现了3D端点误差的47.9%减少。我们的最优模型 CamLiRAFT 在KITTI场景Flow基准测试中实现了4.26%的错误,成为所有提交中参数更少的佼佼者。此外,我们的方法具有强大的泛化性能和处理非定域运动的能力。代码可在本网站的 https URL 中获取。
https://arxiv.org/abs/2303.12017
In this paper, we consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario, with the goal of detecting and classifying the action instances from arbitrary categories within some untrimmed videos, even not seen at training time. We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification. We make the following contributions. First, to compensate image-text foundation models with temporal motions, we improve category-agnostic action proposal by explicitly aligning embeddings of optical flows, RGB and texts, which has largely been ignored in existing low-shot methods. Second, to improve open-vocabulary action classification, we construct classifiers with strong discriminative power, i.e., avoid lexical ambiguities. To be specific, we propose to prompt the pre-trained CLIP text encoder either with detailed action descriptions (acquired from large-scale language models), or visually-conditioned instance-specific prompt vectors. Third, we conduct thorough experiments and ablation studies on THUMOS14 and ActivityNet1.3, demonstrating the superior performance of our proposed model, outperforming existing state-of-the-art approaches by one significant margin.
在本文中,我们考虑了低精度(零精度和少量精度)情况下的时间行动定位问题,目标是在训练时不能看到某些未修剪视频的任意分类行动中实例的情况下,检测和分类这些行动实例。我们采用了基于Transformer的两步行动定位架构,并使用类无关的行动提议,随后采用开放词汇分类。我们做出了以下贡献。第一,为了补偿图像文本基础模型的时间运动,我们改进了类无关的行动提议,通过明确对齐光学流、RGB和文本的嵌入来提高其精度。这在现有的低精度方法中几乎被忽视了。第二,为了提高开放词汇分类的精度,我们建立了具有强大分类力的Classifier,即避免词义歧义。具体而言,我们提议使用详细的行动描述(从大规模语言模型获取)或视觉条件特定实例优先提示向量来启发预训练的CLIP文本编码器。第三,我们对THUMOS14和ActivityNet1.3进行了完整的实验和分解研究,展示了我们提出的模型的优秀性能,比现有的先进技术高出一个显著的差异。
https://arxiv.org/abs/2303.11732
Event cameras have the ability to record continuous and detailed trajectories of objects with high temporal resolution, thereby providing intuitive motion cues for optical flow estimation. Nevertheless, most existing learning-based approaches for event optical flow estimation directly remould the paradigm of conventional images by representing the consecutive event stream as static frames, ignoring the inherent temporal continuity of event data. In this paper, we argue that temporal continuity is a vital element of event-based optical flow and propose a novel Temporal Motion Aggregation (TMA) approach to unlock its potential. Technically, TMA comprises three components: an event splitting strategy to incorporate intermediate motion information underlying the temporal context, a linear lookup strategy to align temporally continuous motion features and a novel motion pattern aggregation module to emphasize consistent patterns for motion feature enhancement. By incorporating temporally continuous motion information, TMA can derive better flow estimates than existing methods at early stages, which not only enables TMA to obtain more accurate final predictions, but also greatly reduces the demand for a number of refinements. Extensive experiments on DESC-Flow and MVSEC datasets verify the effectiveness and superiority of our TMA. Remarkably, compared to E-RAFT, TMA achieves a 6% improvement in accuracy and a 40% reduction in inference time on DSEC-Flow.
事件相机能够以高时间分辨率记录物体连续且详细的轨迹,从而为光学流估计提供直观的运动线索。然而,大多数现有的事件光学流估计方法直接套用传统图像的范式,将连续的事件流表示为静态帧,忽略了事件数据固有的时间连续性。在本文中,我们认为时间连续性是事件基于光学流的重要元素,并提出了一种新的时间运动聚合方法以解锁其潜力。技术上,TMA由三个组件组成:事件分割策略以纳入隐含的时间上下文中的中间运动信息、线性查找策略以对齐时间连续运动特征,以及一种新的运动模式聚合模块以强调一致的运动模式以提高运动特征增强。通过纳入时间连续运动信息,TMA可以在早期阶段比现有方法得出更好的流估计结果,这不仅使TMA能够更准确地做出最终预测,还大大降低了对许多改进的需求。在 DESC-Flow 和 MVSEC 数据集上进行广泛的实验验证了我们的TMA方法的有效性和优越性。令人惊讶地,与 E-RAFT 相比,TMA在 DSEC-Flow 上实现了6%的精度改进和40%的推理时间减少。
https://arxiv.org/abs/2303.11629
We study the problem of estimating optical flow from event cameras. One important issue is how to build a high-quality event-flow dataset with accurate event values and flow labels. Previous datasets are created by either capturing real scenes by event cameras or synthesizing from images with pasted foreground objects. The former case can produce real event values but with calculated flow labels, which are sparse and inaccurate. The later case can generate dense flow labels but the interpolated events are prone to errors. In this work, we propose to render a physically correct event-flow dataset using computer graphics models. In particular, we first create indoor and outdoor 3D scenes by Blender with rich scene content variations. Second, diverse camera motions are included for the virtual capturing, producing images and accurate flow labels. Third, we render high-framerate videos between images for accurate events. The rendered dataset can adjust the density of events, based on which we further introduce an adaptive density module (ADM). Experiments show that our proposed dataset can facilitate event-flow learning, whereas previous approaches when trained on our dataset can improve their performances constantly by a relatively large margin. In addition, event-flow pipelines when equipped with our ADM can further improve performances.
我们研究从事件摄像机估计光学流动的一个问题。一个重要的问题是如何构建高质量的事件-flow数据集,具有准确的事件值和流标签。以前的数据集是通过事件摄像机捕捉实际场景或通过粘贴前景物体合成的。前者可以产生真实的事件值,但使用计算得到的流标签,这些标签稀疏且不准确。后者可以生成密度较高的流标签,但拼接的事件往往容易出错。在本研究中,我们提议使用计算机图形模型渲染一个物理正确的事件-flow数据集。特别地,我们使用Blender创建室内和室外的三维场景,具有丰富的场景内容变化。其次,我们包括多种相机运动,以虚拟捕捉、产生图像和准确的流标签。第三,我们渲染高质量的高帧率视频,用于准确的事件。渲染数据集可以调整事件密度,在此基础上我们进一步引入了自适应密度模块(ADM)。实验表明,我们提出的数据集可以促进事件-flow学习,而以前在训练我们的数据集时的方法可以 constantly 通过相对较大的改善其性能。此外,配备我们的ADM的事件-flow管道可以进一步改善性能。
https://arxiv.org/abs/2303.11011
Both static and moving objects usually exist in real-life videos. Most video object segmentation methods only focus on exacting and exploiting motion cues to perceive moving objects. Once faced with static objects frames, moving object predictors may predict failed results caused by uncertain motion information, such as low-quality optical flow maps. Besides, many sources such as RGB, depth, optical flow and static saliency can provide useful information about the objects. However, existing approaches only utilize the RGB or RGB and optical flow. In this paper, we propose a novel adaptive multi-source predictor for zero-shot video object segmentation. In the static object predictor, the RGB source is converted to depth and static saliency sources, simultaneously. In the moving object predictor, we propose the multi-source fusion structure. First, the spatial importance of each source is highlighted with the help of the interoceptive spatial attention module (ISAM). Second, the motion-enhanced module (MEM) is designed to generate pure foreground motion attention for improving both static and moving features used in the decoder. Furthermore, we design a feature purification module (FPM) to filter the inter-source incompatible features. By the ISAM, MEM and FPM, the multi-source features are effectively fused. In addition, we put forward an adaptive predictor fusion network (APF) to evaluate the quality of optical flow and fuse the predictions from the static object predictor and the moving object predictor in order to prevent over-reliance on the failed results caused by low-quality optical flow maps. Experiments show that the proposed model outperforms the state-of-the-art methods on three challenging ZVOS benchmarks. And, the static object predictor can precisely predicts a high-quality depth map and static saliency map at the same time.
静态和移动物体通常存在于实际的视频中。大多数视频物体分割方法只关注精确和利用运动线索感知移动物体。一旦遇到静态物体帧,移动物体预测器可能会预测由于不确定的运动信息(如低质量光学流图)导致的失败结果。此外,许多来源,如RGB、深度、光学流和静态对比度,可以提供关于物体的有用信息。然而,现有的方法仅利用RGB或RGB和光学流。在本文中,我们提出了一种 novel 的自适应多方预测器,用于零次预测视频物体分割。在静态物体预测器中,RGB源被同时转换为深度和静态对比度源。在移动物体预测器中,我们提出了多方融合结构。首先,通过感觉空间注意力模块(ISAM)强调每个源的空间重要性。其次,运动增强模块(MEM)设计用于产生纯粹的前端运动注意,以提高解码器中使用的静态和移动特征。此外,我们设计了特征净化模块(FPM)以过滤相互不一致的源特征。通过ISAM、MEM和FPM,多方特征有效地融合起来。此外,我们提出了一种自适应预测器融合网络(APF)以评估光学流的质量,并融合静态物体预测器和移动物体预测器的预测,以避免过度依赖低质量光学流图导致的失败结果。实验表明,该模型在三个挑战性的ZVOOS基准测试中表现优于最先进的方法。同时,静态物体预测器可以同时精确预测高质量的深度图和静态对比度地图。
https://arxiv.org/abs/2303.10383
We present a method for estimating the shutter angle, a.k.a. exposure fraction -- the ratio of the exposure time and the reciprocal of frame rate -- of videoclips containing motion. The approach exploits the relation of the exposure fraction, optical flow, and linear motion blur. Robustness is achieved by selecting image patches where both the optical flow and blur estimates are reliable, checking their consistency. The method was evaluated on the publicly available Beam-Splitter Dataset with a range of exposure fractions from 0.015 to 0.36. The best achieved mean absolute error of estimates was 0.039. We successfully test the suitability of the method for a forensic application of detection of video tampering by frame removal or insertion.
我们提出了一种方法,用于估计包含运动的视频片段的快门角度,即曝光量,即曝光时间和帧率的比值。该方法利用曝光量、光学流和线性运动模糊之间的关系。稳健性是通过选择可靠的光学流和模糊估计,并检查它们的一致性来实现的。该方法在公开可用的 beam-Splitter Dataset 上进行了评估,该数据集的曝光量范围从0.015到0.36。估计的平均值绝对误差的最佳值为0.039。我们成功地测试了该方法的 forensic 应用,即通过帧删除或插入检测视频篡改。
https://arxiv.org/abs/2303.10247
We introduce VideoFlow, a novel optical flow estimation framework for videos. In contrast to previous methods that learn to estimate optical flow from two frames, VideoFlow concurrently estimates bi-directional optical flows for multiple frames that are available in videos by sufficiently exploiting temporal cues. We first propose a TRi-frame Optical Flow (TROF) module that estimates bi-directional optical flows for the center frame in a three-frame manner. The information of the frame triplet is iteratively fused onto the center frame. To extend TROF for handling more frames, we further propose a MOtion Propagation (MOP) module that bridges multiple TROFs and propagates motion features between adjacent TROFs. With the iterative flow estimation refinement, the information fused in individual TROFs can be propagated into the whole sequence via MOP. By effectively exploiting video information, VideoFlow presents extraordinary performance, ranking 1st on all public benchmarks. On the Sintel benchmark, VideoFlow achieves 1.649 and 0.991 average end-point-error (AEPE) on the final and clean passes, a 15.1% and 7.6% error reduction from the best published results (1.943 and 1.073 from FlowFormer++). On the KITTI-2015 benchmark, VideoFlow achieves an F1-all error of 3.65%, a 19.2% error reduction from the best published result (4.52% from FlowFormer++).
我们介绍了VideoFlow,一个用于视频的全新的光学流估计框架。与以前的方法和学习从两帧中估计光学流不同,VideoFlow通过充分利用时间线索,同时估计多帧中可用的双向光学流,具体方法是提出TROF模块,以三帧的方式估计中心帧的双向光学流。将帧triplet的信息迭代地融合到中心帧上。为了扩展TROF处理更多帧,我们还提出MOtion Propagation(MOP)模块,将多个TROF连接并传递相邻TROF之间的运动特征。通过迭代的流估计优化,将融合在单个TROF中的信息通过MOP传播到整个序列。通过有效地利用视频信息,VideoFlow表现出卓越的性能,在所有公开基准中排名第一。在Sintel基准测试中,VideoFlow在最终干净帧上的AEPE平均为1.649和0.991,比最好的 published 结果(1.943和1.073,来自Flow former++)减少了15.1%和7.6%。在KITTI-2015基准测试中,VideoFlow的F1-全部误差为3.65%,比最好的 published 结果(4.52%,来自Flow former++)减少了19.2%。
https://arxiv.org/abs/2303.08340
Temporal modeling is crucial for multi-frame human pose estimation. Most existing methods directly employ optical flow or deformable convolution to predict full-spectrum motion fields, which might incur numerous irrelevant cues, such as a nearby person or background. Without further efforts to excavate meaningful motion priors, their results are suboptimal, especially in complicated spatiotemporal interactions. On the other hand, the temporal difference has the ability to encode representative motion information which can potentially be valuable for pose estimation but has not been fully exploited. In this paper, we present a novel multi-frame human pose estimation framework, which employs temporal differences across frames to model dynamic contexts and engages mutual information objectively to facilitate useful motion information disentanglement. To be specific, we design a multi-stage Temporal Difference Encoder that performs incremental cascaded learning conditioned on multi-stage feature difference sequences to derive informative motion representation. We further propose a Representation Disentanglement module from the mutual information perspective, which can grasp discriminative task-relevant motion signals by explicitly defining useful and noisy constituents of the raw motion features and minimizing their mutual information. These place us to rank No.1 in the Crowd Pose Estimation in Complex Events Challenge on benchmark dataset HiEve, and achieve state-of-the-art performance on three benchmarks PoseTrack2017, PoseTrack2018, and PoseTrack21.
时间建模对于多帧人类姿态估计至关重要。大多数现有方法直接使用光学流或可变形卷积预测全光谱运动场,这可能会引入许多无关的线索,例如附近的人或背景。如果没有进一步挖掘有意义的运动先验,它们的结果就不太可能达到最优,特别是在复杂的时空交互中。另一方面,时间差异具有编码代表运动信息的能力,这些信息可能对姿态估计非常有价值,但尚未得到充分利用。在本文中,我们提出了一种新的多帧人类姿态估计框架,该框架使用每个帧之间的时间差异来建模动态上下文,并客观参与互信息以促进有用的运动信息分离。具体来说,我们设计了一个多级时间差异编码器,在多级特征差异序列的逐步迭代学习中产生有用的运动表示。我们还提出了一种表示分离模块,可以从互信息的角度提出,可以明确定义原始运动特征有用的噪声成分和最小化它们的互信息。这些使我们能够在复杂事件挑战中通过HiEve基准数据集获得复杂事件姿态估计任务中的最佳排名,并在三项基准数据集PoseTrack2017、PoseTrack2018和PoseTrack21上实现最先进的表现。
https://arxiv.org/abs/2303.08475
Optical flow estimation is a challenging problem remaining unsolved. Recent deep learning based optical flow models have achieved considerable success. However, these models often train networks from the scratch on standard optical flow data, which restricts their ability to robustly and geometrically match image features. In this paper, we propose a rethinking to previous optical flow estimation. We particularly leverage Geometric Image Matching (GIM) as a pre-training task for the optical flow estimation (MatchFlow) with better feature representations, as GIM shares some common challenges as optical flow estimation, and with massive labeled real-world data. Thus, matching static scenes helps to learn more fundamental feature correlations of objects and scenes with consistent displacements. Specifically, the proposed MatchFlow model employs a QuadTree attention-based network pre-trained on MegaDepth to extract coarse features for further flow regression. Extensive experiments show that our model has great cross-dataset generalization. Our method achieves 11.5% and 10.1% error reduction from GMA on Sintel clean pass and KITTI test set. At the time of anonymous submission, our MatchFlow(G) enjoys state-of-the-art performance on Sintel clean and final pass compared to published approaches with comparable computation and memory footprint. Codes and models will be released in this https URL.
光学流估计是一个未解决的有挑战性的问题。最近基于深度学习的光学流模型已经取得了相当大的成功。然而,这些模型通常从头开始训练网络,使用标准光学流数据,这限制了它们能够 robustly 和几何上匹配图像特征的能力。在本文中,我们提出了对以前的光学流估计问题的一种重新思考。我们特别利用几何图像匹配(GIM)作为光学流估计(MatchFlow)的预处理任务,并利用更好的特征表示,因为GIM与光学流估计有一些共同的挑战,并使用大量标记的现实世界数据。因此,匹配静态场景有助于学习物体和场景具有相同位移的重要特征 correlation。具体而言,我们提出的MatchFlow模型使用基于MegaDepth的 QuadTree注意力网络进行训练,以提取更粗的特征,以进一步流回归。广泛的实验表明,我们的模型具有跨dataset的泛化能力。我们的方法从Sintel干净通道和KITTI测试集上实现了11.5%和10.1%的误差减少,与具有类似计算和内存 footprint的公开方法相比,我们在匿名提交时,我们的MatchFlow(G)在Sintel干净和最终通道上实现了先进的性能。代码和模型将在这个httpsURL上发布。
https://arxiv.org/abs/2303.08384
Analyzing the dynamic changes of cellular morphology is important for understanding the various functions and characteristics of live cells, including stem cells and metastatic cancer cells. To this end, we need to track all points on the highly deformable cellular contour in every frame of live cell video. Local shapes and textures on the contour are not evident, and their motions are complex, often with expansion and contraction of local contour features. The prior arts for optical flow or deep point set tracking are unsuited due to the fluidity of cells, and previous deep contour tracking does not consider point correspondence. We propose the first deep learning-based tracking of cellular (or more generally viscoelastic materials) contours with point correspondence by fusing dense representation between two contours with cross attention. Since it is impractical to manually label dense tracking points on the contour, unsupervised learning comprised of the mechanical and cyclical consistency losses is proposed to train our contour tracker. The mechanical loss forcing the points to move perpendicular to the contour effectively helps out. For quantitative evaluation, we labeled sparse tracking points along the contour of live cells from two live cell datasets taken with phase contrast and confocal fluorescence microscopes. Our contour tracker quantitatively outperforms compared methods and produces qualitatively more favorable results. Our code and data are publicly available at this https URL
分析细胞形态的动态变化对于理解细胞的各种功能以及特性,包括干细胞和转移癌症细胞的重要性很重要。为此,我们需要在每个帧的 live cell 视频中跟踪高度可变形的细胞轮廓的所有点。轮廓上的局部形状和纹理并不显著,它们的运动是复杂的,常常伴随着局部轮廓特征的扩张和收缩。由于光学流或深度点跟踪的先前技术由于细胞的流动性而无法适用,而先前的深度轮廓跟踪也没有考虑点对应关系。我们提出了一种基于深度学习的点对应细胞(或更一般的黏性材料)轮廓跟踪方法,通过将两个轮廓的密集表示相结合并交叉关注来 fusion。由于在轮廓上手动标注密集跟踪点是不可能的,我们建议将机械和周期性一致性损失组成 unsupervised 学习来训练我们的轮廓跟踪器。机械损失迫使点在与轮廓垂直的方向移动,有效地帮助了。为了进行量化评估,我们从两个使用相位 contrast 和单光子共轭荧光显微镜拍摄的 live cell 数据集中标记了稀疏跟踪点沿着细胞轮廓。我们的轮廓跟踪器在数值上超越了比较方法,并产生了更好的定性结果。我们的代码和数据在这个 https URL 上公开可用。
https://arxiv.org/abs/2303.08364
Optical identification is often done with spatial or temporal visual pattern recognition and localization. Temporal pattern recognition, depending on the technology, involves a trade-off between communication frequency, range and accurate tracking. We propose a solution with light-emitting beacons that improves this trade-off by exploiting fast event-based cameras and, for tracking, sparse neuromorphic optical flow computed with spiking neurons. In an asset monitoring use case, we demonstrate that the system, embedded in a simulated drone, is robust to relative movements and enables simultaneous communication with, and tracking of, multiple moving beacons. Finally, in a hardware lab prototype, we achieve state-of-the-art optical camera communication frequencies in the kHz magnitude.
光学识别通常与空间或时间的视觉模式识别和定位相结合。根据技术,时间模式识别涉及通信频率、范围和准确的跟踪之间的权衡。我们提出了一种解决方案,使用发光 beacons,通过利用快速事件基于摄像头改善权衡,并使用基于突触神经元的计算稀疏神经形态学光学流进行跟踪。在一个资产监测使用场景中,我们演示了该系统在模拟无人机中的稳定性,对相对运动进行鲁棒性,并同时与多个移动 beacon 进行通信和跟踪。最后,在硬件实验室原型中,我们实现了高分辨率光学相机通信频率在 kHz 量级上。
https://arxiv.org/abs/2303.07169
Despite significant efforts, cutting-edge video segmentation methods still remain sensitive to occlusion and rapid movement, due to their reliance on the appearance of objects in the form of object embeddings, which are vulnerable to these disturbances. A common solution is to use optical flow to provide motion information, but essentially it only considers pixel-level motion, which still relies on appearance similarity and hence is often inaccurate under occlusion and fast movement. In this work, we study the instance-level motion and present InstMove, which stands for Instance Motion for Object-centric Video Segmentation. In comparison to pixel-wise motion, InstMove mainly relies on instance-level motion information that is free from image feature embeddings, and features physical interpretations, making it more accurate and robust toward occlusion and fast-moving objects. To better fit in with the video segmentation tasks, InstMove uses instance masks to model the physical presence of an object and learns the dynamic model through a memory network to predict its position and shape in the next frame. With only a few lines of code, InstMove can be integrated into current SOTA methods for three different video segmentation tasks and boost their performance. Specifically, we improve the previous arts by 1.5 AP on OVIS dataset, which features heavy occlusions, and 4.9 AP on YouTubeVIS-Long dataset, which mainly contains fast-moving objects. These results suggest that instance-level motion is robust and accurate, and hence serving as a powerful solution in complex scenarios for object-centric video segmentation.
尽管做出了巨大的努力,最先进的视频分割方法仍然对遮挡和快速移动敏感,这是因为它们依赖于以对象嵌入的形式出现的物体的外观,而这些干扰容易破坏它们的外观相似性。一种常见的解决方案是使用光学流提供运动信息,但本质上是仅考虑像素级别的运动,仍然依赖于外观相似性,因此通常在遮挡和快速移动时不准确。在这项工作中,我们研究了实例级别的运动,并提出了名为InstMove的实例运动术语,以代表对象中心视频分割。与像素级别的运动相比,InstMove主要依赖于没有图像特征嵌入的实例级别的运动信息,并具有物理解释,使其在遮挡和快速移动对象方面更加准确和鲁棒。为了更好地与视频分割任务相适应,InstMove使用实例 masks 模型物体的物理存在,并通过记忆网络学习动态模型,以预测其下帧的位置和形状。仅使用几个代码行,InstMove可以集成到当前的最佳方法中,以增强三种不同的视频分割任务的性能。具体而言,我们在OVIS数据集上提高了先前的作品1.5AP,该数据集包含严重的遮挡,而在YouTubeVIS-Long数据集上提高了4.9AP,该数据集主要包含快速移动的对象。这些结果表明实例级别的运动是稳健和准确的,因此对于对象中心视频分割复杂的场景是一种强大的解决方案。
https://arxiv.org/abs/2303.08132
Event cameras provide high temporal precision, low data rates, and high dynamic range visual perception, which are well-suited for optical flow estimation. While data-driven optical flow estimation has obtained great success in RGB cameras, its generalization performance is seriously hindered in event cameras mainly due to the limited and biased training data. In this paper, we present a novel simulator, BlinkSim, for the fast generation of large-scale data for event-based optical flow. BlinkSim consists of a configurable rendering engine and a flexible engine for event data simulation. By leveraging the wealth of current 3D assets, the rendering engine enables us to automatically build up thousands of scenes with different objects, textures, and motion patterns and render very high-frequency images for realistic event data simulation. Based on BlinkSim, we construct a large training dataset and evaluation benchmark BlinkFlow that contains sufficient, diversiform, and challenging event data with optical flow ground truth. Experiments show that BlinkFlow improves the generalization performance of state-of-the-art methods by more than 40% on average and up to 90%. Moreover, we further propose an Event optical Flow transFormer (E-FlowFormer) architecture. Powered by our BlinkFlow, E-FlowFormer outperforms the SOTA methods by up to 91% on MVSEC dataset and 14% on DSEC dataset and presents the best generalization performance.
事件相机提供高时间精度、低数据速率、高动态范围的视觉感知,非常适合用于光学流估计。虽然基于数据的光学流估计在RGB相机中取得了巨大的成功,但在事件相机中其泛化性能却受到了严重的阻碍,主要是因为训练数据的限制和偏差。在本文中,我们介绍了一种新模拟器BlinkSim,用于快速生成基于事件事件的光学流大规模数据。BlinkSim由一个可配置渲染引擎和一个灵活的引擎组成。通过利用当前3D资产的丰富资源,渲染引擎使我们能够自动构建数千个场景,包括不同的物体、纹理和运动模式,并渲染非常高频的图像,以进行真实的事件数据模拟。基于BlinkSim,我们构建了一个大规模的训练数据和评估基准BlinkFlow,其中包含了足够的、多样化和具有挑战性的事件数据,并使用光学流真实先验。实验表明,BlinkFlow平均提高了现有方法的泛化性能超过40%,高达90%。此外,我们还提出了一个事件光学流转换器(E-Flow former)架构。通过我们的BlinkFlow,E-Flowformer在MVSEC数据集上比SOTA方法高出91%,在DSEC数据集上高出14%,表现出最佳泛化性能。
https://arxiv.org/abs/2303.07716
Local feature matching aims at establishing sparse correspondences between a pair of images. Recently, detectorfree methods present generally better performance but are not satisfactory in image pairs with large scale differences. In this paper, we propose Patch Area Transportation with Subdivision (PATS) to tackle this issue. Instead of building an expensive image pyramid, we start by splitting the original image pair into equal-sized patches and gradually resizing and subdividing them into smaller patches with the same scale. However, estimating scale differences between these patches is non-trivial since the scale differences are determined by both relative camera poses and scene structures, and thus spatially varying over image pairs. Moreover, it is hard to obtain the ground truth for real scenes. To this end, we propose patch area transportation, which enables learning scale differences in a self-supervised manner. In contrast to bipartite graph matching, which only handles one-to-one matching, our patch area transportation can deal with many-to-many relationships. PATS improves both matching accuracy and coverage, and shows superior performance in downstream tasks, such as relative pose estimation, visual localization, and optical flow estimation. The source code will be released to benefit the community.
局部特征匹配旨在建立两图像之间的稀疏对应关系。近年来,无检测算法方法通常表现更好,但在图像 pairs 中存在大规模差异时并不令人满意。在本文中,我们提出了 patch 区域传输(PATS)方法来解决这个问题。我们不再需要建造昂贵的图像金字塔,而是从原始图像对中切分成相等大小的区块,并逐渐缩小并分割成相同的小区块,每个区块的大小相同。但是,对这些区块之间尺度差异的估计是一项艰巨的任务,因为尺度差异是由相对相机姿态和场景结构决定的,因此它们在图像对中的空间变化。此外,对于真实的场景,很难获得初始 truth。因此,我们提出了 patch 区域传输方法,它可以通过自我监督的方式学习尺度差异。与只会处理一对一匹配的二分图匹配不同,我们的 patch 区域传输可以处理多对多的关系。PATS 改善了匹配精度和覆盖范围,并在后续任务,如相对姿态估计、视觉定位和光学流估计中表现出更好的性能。源代码将用于促进社区。
https://arxiv.org/abs/2303.07700
Optical flow has achieved great success under clean scenes, but suffers from restricted performance under foggy scenes. To bridge the clean-to-foggy domain gap, the existing methods typically adopt the domain adaptation to transfer the motion knowledge from clean to synthetic foggy domain. However, these methods unexpectedly neglect the synthetic-to-real domain gap, and thus are erroneous when applied to real-world scenes. To handle the practical optical flow under real foggy scenes, in this work, we propose a novel unsupervised cumulative domain adaptation optical flow (UCDA-Flow) framework: depth-association motion adaptation and correlation-alignment motion adaptation. Specifically, we discover that depth is a key ingredient to influence the optical flow: the deeper depth, the inferior optical flow, which motivates us to design a depth-association motion adaptation module to bridge the clean-to-foggy domain gap. Moreover, we figure out that the cost volume correlation shares similar distribution of the synthetic and real foggy images, which enlightens us to devise a correlation-alignment motion adaptation module to distill motion knowledge of the synthetic foggy domain to the real foggy domain. Note that synthetic fog is designed as the intermediate domain. Under this unified framework, the proposed cumulative adaptation progressively transfers knowledge from clean scenes to real foggy scenes. Extensive experiments have been performed to verify the superiority of the proposed method.
光学流在清洁场景中取得了巨大成功,但在雾场景中表现不佳。为了填补清洁到雾域之间的差异,现有的方法通常采用域适应来将运动知识从清洁到合成雾域转移。然而,这些方法意外地忽略了合成到真实域之间的差异,因此在应用于实际场景时会出现错误。为了处理在真实雾场景中的实际光学流,在本文中,我们提出了一种新颖的未监督累积域适应光学流框架(UCDA-Flow):深度关联运动适应和关联对齐运动适应。具体来说,我们发现深度是影响光学流的关键成分:深度更深的物体,光学流较差,这激励我们设计深度关联运动适应模块来填补清洁到雾域之间的差异。此外,我们计算出合成和真实雾图像的成本体积相关性具有相似的分布,这启发我们设计关联对齐运动适应模块来将合成雾域的运动知识蒸馏到真实雾域。注意,合成雾被设计为中间域。在 this 统一框架下, proposed 累积适应逐步将知识从清洁场景转移到真实雾场景。广泛实验已进行以验证该方法的优越性。
https://arxiv.org/abs/2303.07564
Deep high dynamic range (HDR) imaging as an image translation issue has achieved great performance without explicit optical flow alignment. However, challenges remain over content association ambiguities especially caused by saturation and large-scale movements. To address the ghosting issue and enhance the details in saturated regions, we propose a scale-aware two-stage high dynamic range imaging framework (STHDR) to generate high-quality ghost-free HDR image. The scale-aware technique and two-stage fusion strategy can progressively and effectively improve the HDR composition performance. Specifically, our framework consists of feature alignment and two-stage fusion. In feature alignment, we propose a spatial correct module (SCM) to better exploit useful information among non-aligned features to avoid ghosting and saturation. In the first stage of feature fusion, we obtain a preliminary fusion result with little ghosting. In the second stage, we conflate the results of the first stage with aligned features to further reduce residual artifacts and thus improve the overall quality. Extensive experimental results on the typical test dataset validate the effectiveness of the proposed STHDR in terms of speed and quality.
Deep high dynamic range (HDR) imaging作为一种图像转换问题,已经取得了出色的表现,但缺乏明确的光学流对齐。然而,仍有关于内容匹配混淆的挑战,特别是由于饱和度和大规模运动引起的。为了解决渲染问题和提高饱和度区域的细节,我们提出了一种 Scale-aware 两阶段高动态范围成像框架(STHDR),以生成高质量的无渲染鬼影的HDR图像。Scale-aware技术和两阶段融合策略可以逐步有效地改进HDR组合性能。具体来说,我们的框架包括特征对齐和两阶段融合。在特征对齐中,我们提出了一个空间正确模块(SCM),更好地利用非对齐特征之间的有用信息,避免渲染和饱和度。在特征融合的第一阶段,我们获得几乎没有渲染的初步融合结果。在第二阶段,我们将第一阶段的结果与对齐特征混淆,进一步减少残留 artifacts,从而提高整体质量。对典型的测试数据集广泛的实验结果验证了提出的STHDR在速度和质量方面的有效性。
https://arxiv.org/abs/2303.06575
Group Activity Recognition (GAR) aims to detect the activity performed by multiple actors in a scene. Prior works model the spatio-temporal features based on the RGB, optical flow or keypoint data types. However, using both the temporality and these data types altogether increase the computational complexity significantly. Our hypothesis is that by only using the RGB data without temporality, the performance can be maintained with a negligible loss in accuracy. To that end, we propose a novel GAR technique for volleyball videos, DECOMPL, which consists of two complementary branches. In the visual branch, it extracts the features using attention pooling in a selective way. In the coordinate branch, it considers the current configuration of the actors and extracts the spatial information from the box coordinates. Moreover, we analyzed the Volleyball dataset that the recent literature is mostly based on, and realized that its labeling scheme degrades the group concept in the activities to the level of individual actors. We manually reannotated the dataset in a systematic manner for emphasizing the group concept. Experimental results on the Volleyball as well as Collective Activity (from another domain, i.e., not volleyball) datasets demonstrated the effectiveness of the proposed model DECOMPL, which delivered the best/second best GAR performance with the reannotations/original annotations among the comparable state-of-the-art techniques. Our code, results and new annotations will be made available through GitHub after the revision process.
群体活动识别(Gar)的目标是在场景中检测多个演员的活动。先前的工作基于RGB、光学流或关键点数据类型模型了空间特征。然而,同时使用时间和空间数据类型会显著增加计算复杂度。我们的假设是仅使用RGB数据而不考虑时间可以实现性能的几乎零损失,因此我们提出了一种新的排球视频Gar技术,名为DECOMPL,它由两个互补分支组成。在视觉分支中,它通过选择性注意力聚合提取特征。在坐标分支中,它考虑演员当前的配置,并从盒子坐标中提取空间信息。此外,我们对最近的文献中基于排球的数据集进行了分析,并发现其标签方案将群体概念降低到了个体演员的水平。我们采用了一种系统的手动重新标签方法,以强调群体概念。对排球和群体活动(来自另一个领域,不是排球)数据集的实验结果证明了该模型DECOMPL的有效性,它在重新标签/原始标签的同行先进技术中提供了最好的/第二好的Gar性能。我们的代码、结果和新注释将在修订后通过GitHub发布。
https://arxiv.org/abs/2303.06439
Unsupervised optical flow estimation is especially hard near occlusions and motion boundaries and in low-texture regions. We show that additional information such as semantics and domain knowledge can help better constrain this problem. We introduce SemARFlow, an unsupervised optical flow network designed for autonomous driving data that takes estimated semantic segmentation masks as additional inputs. This additional information is injected into the encoder and into a learned upsampler that refines the flow output. In addition, a simple yet effective semantic augmentation module provides self-supervision when learning flow and its boundaries for vehicles, poles, and sky. Together, these injections of semantic information improve the KITTI-2015 optical flow test error rate from 11.80% to 8.38%. We also show visible improvements around object boundaries as well as a greater ability to generalize across datasets. Code will be made available.
无监督光学流估计在遮挡和运动边界以及低纹理区域特别困难。我们证明了额外的信息,如语义和领域知识,可以帮助更好地限制这个问题。我们介绍了SemARFlow,这是一个为自主驾驶数据设计的无监督光学流网络,使用估计的语义分割掩膜作为额外的输入。这些额外的信息被注入到编码器和一个通过学习的扩展器,以优化流输出。此外,一个简单的但有效的语义增强模块在学习流和其边界对车辆、丘陵和天空时提供自我监督。通过这些注入的语义信息,KITTI-2015光学流测试错误率从11.80%降低到8.38%。我们还展示了围绕物体边界的可见改善以及更广泛地应用于数据集的能力。代码将公开提供。
https://arxiv.org/abs/2303.06209
This article proposes a deep neural network, namely CrackPropNet, to measure crack propagation on asphalt concrete (AC) specimens. It offers an accurate, flexible, efficient, and low-cost solution for crack propagation measurement using images collected during cracking tests. CrackPropNet significantly differs from traditional deep learning networks, as it involves learning to locate displacement field discontinuities by matching features at various locations in the reference and deformed images. An image library representing the diversified cracking behavior of AC was developed for supervised training. CrackPropNet achieved an optimal dataset scale F-1 of 0.755 and optimal image scale F-1 of 0.781 on the testing dataset at a running speed of 26 frame-per-second. Experiments demonstrated that low to medium-level Gaussian noises had a limited impact on the measurement accuracy of CrackPropNet. Moreover, the model showed promising generalization on fundamentally different images. As a crack measurement technique, the CrackPropNet can detect complex crack patterns accurately and efficiently in AC cracking tests. It can be applied to characterize the cracking phenomenon, evaluate AC cracking potential, validate test protocols, and verify theoretical models.
本文提出了一种深度学习网络,即CrackPropNet,用于测量 asphalt concrete (AC) 样本中的裂缝传播。它提供了一种准确、灵活、高效、低成本的解决方案,通过在裂缝测试中收集的图像进行裂缝传播测量。CrackPropNet 与传统深度学习网络存在较大差异,因为它涉及通过学习在参考和变形图像中不同位置的特征来定位位移场离散性。为了监督训练,我们开发了表示 AC 样本多样化裂缝行为的图像库。CrackPropNet 在测试数据集上实现了最佳数据集尺度 F-1 的 0.755 和最佳图像尺度 F-1 的 0.781,以每秒 26 帧的速度运行。实验表明,低到中等程度的Gaussian噪声对 CrackPropNet 的测量精度具有有限的影响。此外,模型在 fundamentally 不同的图像上表现出良好的泛化能力。作为裂缝测量技术,CrackPropNet 在 AC 裂缝测试中准确高效地检测复杂的裂缝模式。它可以用于描述裂缝现象、评估 AC 裂缝潜在性、验证测试协议并验证理论模型。
https://arxiv.org/abs/2303.05957
Event cameras have recently gained significant traction since they open up new avenues for low-latency and low-power solutions to complex computer vision problems. To unlock these solutions, it is necessary to develop algorithms that can leverage the unique nature of event data. However, the current state-of-the-art is still highly influenced by the frame-based literature, and usually fails to deliver on these promises. In this work, we take this into consideration and propose a novel self-supervised learning pipeline for the sequential estimation of event-based optical flow that allows for the scaling of the models to high inference frequencies. At its core, we have a continuously-running stateful neural model that is trained using a novel formulation of contrast maximization that makes it robust to nonlinearities and varying statistics in the input events. Results across multiple datasets confirm the effectiveness of our method, which establishes a new state of the art in terms of accuracy for approaches trained or optimized without ground truth.
Event cameras最近取得了显著进展,因为它们为解决复杂的计算机视觉问题提供了低延迟和低功耗的解决方案。要解锁这些解决方案,必须开发能够利用事件数据独特性质的算法。然而,当前的研究成果仍然受到帧率相关的文献的强烈影响,通常无法兑现这些承诺。在本研究中,我们考虑到了这一点,并提出了一种新的自我监督学习流程,用于事件based光学流的顺序估计,该流程可以使模型规模扩展到高推理频率。其核心是一个持续运行的有状态神经网络模型,使用ContrastMaximization的新 formulation进行训练,使其能够 robustly适应输入事件中的非线性和变化性。多个数据集的结果确认了我们方法的有效性,这建立了在没有基准实值训练或优化方法的情况下的准确性的新的前沿。
https://arxiv.org/abs/2303.05214