Masked Autoencoders (MAEs) learn self-supervised representations by randomly masking input image patches and a reconstruction loss. Alternatively, contrastive learning self-supervised methods encourage two versions of the same input to have a similar representation, while pulling apart the representations for different inputs. We propose ViC-MAE, a general method that combines both MAE and contrastive learning by pooling the local feature representations learned under the MAE reconstruction objective and leveraging this global representation under a contrastive objective across video frames. We show that visual representations learned under ViC-MAE generalize well to both video classification and image classification tasks. Using a backbone ViT-B/16 network pre-trained on the Moments in Time (MiT) dataset, we obtain state-of-the-art transfer learning from video to images on Imagenet-1k by improving 1.58% in absolute top-1 accuracy from a recent previous work. Moreover, our method maintains a competitive transfer-learning performance of 81.50% top-1 accuracy on the Kinetics-400 video classification benchmark. In addition, we show that despite its simplicity, ViC-MAE yields improved results compared to combining MAE pre-training with previously proposed contrastive objectives such as VicReg and SiamSiam.
遮蔽自动编码器(MAEs)通过随机遮蔽输入图像点和重构损失学习自监督表示。Alternatively,比较性学习自监督方法鼓励相同的输入版本具有相似的表示,同时分离不同的输入版本的表示。我们提出了ViC-MAE,一种通用方法,将MAEs和比较性学习相结合,通过汇总在MAEs重构目标下学习的小特征表示并利用跨视频帧的比较目标上的优势,获得视频分类和图像分类任务的最新研究成果。通过在时间序列数据(MiT)数据集上预先训练的ViT-B/16网络,我们在Imagenet-1k上从视频到图像的迁移学习中获得最先进的结果,从最近的一项工作提高了1.58%的绝对准确率。此外,我们的方法在Kinetics-400视频分类基准上保持了具有竞争力的迁移学习性能,保持了81.50%的top-1准确率。此外,我们表明,尽管ViC-MAE的简单易用,但它比结合MAEs预训练与以前提出的比较性目标,如vicReg和SiamSiam等方法获得更好的结果。
https://arxiv.org/abs/2303.12001
Deep neural networks are likely to fail when the test data is corrupted in real-world deployment (e.g., blur, weather, etc.). Test-time optimization is an effective way that adapts models to generalize to corrupted data during testing, which has been shown in the image domain. However, the techniques for improving video classification corruption robustness remain few. In this work, we propose a Temporal Coherent Test-time Optimization framework (TeCo) to utilize spatio-temporal information in test-time optimization for robust video classification. To exploit information in video with self-supervised learning, TeCo uses global content from video clips and optimizes models for entropy minimization. TeCo minimizes the entropy of the prediction based on the global content from video clips. Meanwhile, it also feeds local content to regularize the temporal coherence at the feature level. TeCo retains the generalization ability of various video classification models and achieves significant improvements in corruption robustness across Mini Kinetics-C and Mini SSV2-C. Furthermore, TeCo sets a new baseline in video classification corruption robustness via test-time optimization.
深度学习网络在真实世界部署中测试数据有损坏时可能会失败(例如,模糊、天气等)。测试时优化是一种有效的方法,可以适应模型在测试期间对损坏数据进行泛化的方法,这在图像领域已经得到了证明。然而,改善视频分类的损坏鲁棒性的方法仍然很少。在本文中,我们提出了一个时间一致性测试时优化框架(TeCo),利用时间信息在测试时优化视频分类的鲁棒性。为了利用自监督学习视频中的信息,TeCo使用视频片段中的全局内容并优化模型的熵最小化。TeCo基于视频片段中的全局内容最小化预测熵。同时,它还在特征级别上 feed 当地内容,以 regularize 时间一致性。TeCo保留了各种视频分类模型的泛化能力,并在 Mini Kinetics-C 和 Mini SSV2-C 之间实现了显著的损坏鲁棒性改善。此外,通过测试时优化,TeCo 在视频分类的损坏鲁棒性方面也提供了一个新的基础。
https://arxiv.org/abs/2302.14309
To address the problem of medical image recognition, computer vision techniques like convolutional neural networks (CNN) are frequently used. Recently, 3D CNN-based models dominate the field of magnetic resonance image (MRI) analytics. Due to the high similarity between MRI data and videos, we conduct extensive empirical studies on video recognition techniques for MRI classification to answer the questions: (1) can we directly use video recognition models for MRI classification, (2) which model is more appropriate for MRI, (3) are the common tricks like data augmentation in video recognition still useful for MRI classification? Our work suggests that advanced video techniques benefit MRI classification. In this paper, four datasets of Alzheimer's and Parkinson's disease recognition are utilized in experiments, together with three alternative video recognition models and data augmentation techniques that are frequently applied to video tasks. In terms of efficiency, the results reveal that the video framework performs better than 3D-CNN models by 5% - 11% with 50% - 66% less trainable parameters. This report pushes forward the potential fusion of 3D medical imaging and video understanding research.
为了解决医学图像识别问题,像卷积神经网络(CNN)这样的计算机视觉技术经常被使用。最近,基于3D CNN模型的分类器在磁共振成像(MRI)分析领域占据了主导地位。由于MRI数据和视频之间的高度相似性,我们进行了广泛的实证研究,以研究视频分类技术对MRI分类的影响,以回答以下问题:(1)我们可以直接使用视频分类模型进行MRI分类吗?(2)哪种模型更适合MRI分类?(3)视频识别中的常见的增强技巧,如数据增强,对于MRI分类仍然有用吗?我们的研究表明,先进的视频技术有助于MRI分类。在本文中,使用阿尔茨海默病和帕金森病识别的四个数据集,以及三个不同的视频分类模型和常用的视频任务增强技术。从效率上来看,结果表明,视频框架比3D-CNN模型表现更好,下降了5%至11%,训练参数少50%至66%。本报告推进了3D医学成像和视频理解研究的 potential fusion。
https://arxiv.org/abs/2302.12688
Real-time video surveillance, through CCTV camera systems has become essential for ensuring public safety which is a priority today. Although CCTV cameras help a lot in increasing security, these systems require constant human interaction and monitoring. To eradicate this issue, intelligent surveillance systems can be built using deep learning video classification techniques that can help us automate surveillance systems to detect violence as it happens. In this research, we explore deep learning video classification techniques to detect violence as they are happening. Traditional image classification techniques fall short when it comes to classifying videos as they attempt to classify each frame separately for which the predictions start to flicker. Therefore, many researchers are coming up with video classification techniques that consider spatiotemporal features while classifying. However, deploying these deep learning models with methods such as skeleton points obtained through pose estimation and optical flow obtained through depth sensors, are not always practical in an IoT environment. Although these techniques ensure a higher accuracy score, they are computationally heavier. Keeping these constraints in mind, we experimented with various video classification and action recognition techniques such as ConvLSTM, LRCN (with both custom CNN layers and VGG-16 as feature extractor) CNNTransformer and C3D. We achieved a test accuracy of 80% on ConvLSTM, 83.33% on CNN-BiLSTM, 70% on VGG16-BiLstm ,76.76% on CNN-Transformer and 80% on C3D.
实时视频监控通过CCTV camera系统已经成为确保公共安全的重要措施,而这一措施在当今优先级非常高。尽管CCTV摄像头在增加安全性方面做了很多工作,但这些系统需要不断的人类交互和监测。为了解决这个问题,我们可以使用深度学习视频分类技术来自动化监控系统,以便在发生时检测暴力行为。在这项研究中,我们探索了深度学习视频分类技术来检测正在发生的暴力行为。传统的图像分类技术在分类视频时存在一定的局限性,因为它们试图分别对待每个帧进行分类,这会导致预测开始闪烁。因此,许多研究人员正在开发视频分类技术,考虑时间和空间特征的同时分类。然而,在IoT环境中部署这些深度学习模型,如通过姿态估计获取骨骼点和控制深度传感器获取的光学流的方法,并不总是实际可行的。尽管这些方法可以保证更高的准确率,但它们的计算量相对较大。考虑到这些限制,我们尝试了各种视频分类和动作识别技术,如ConvLSTM、LRCN(同时使用自定义CNN层和VGG-16作为特征提取器)、CNNTransformer和C3D。我们在ConvLSTM上实现了80%的测试准确率,在CNN-BiLSTM上达到了83.33%,在VGG16-BiLstm上达到了70%,在CNNTransformer上达到了76.76%,在C3D上达到了80%。
https://arxiv.org/abs/2302.11027
Parameter-efficient transfer learning (PETL) is an emerging research spot aimed at inexpensively adapting large-scale pre-trained models to downstream tasks. Recent advances have achieved great success in saving storage costs for various vision tasks by updating or injecting a small number of parameters instead of full fine-tuning. However, we notice that most existing PETL methods still incur non-negligible latency during inference. In this paper, we propose a parameter-efficient and computationally friendly adapter for giant vision models, called RepAdapter. Specifically, we prove that the adaption modules, even with a complex structure, can be seamlessly integrated into most giant vision models via structural re-parameterization. This property makes RepAdapter zero-cost during inference. In addition to computation efficiency, RepAdapter is more effective and lightweight than existing PETL methods due to its sparse structure and our careful deployment. To validate RepAdapter, we conduct extensive experiments on 27 benchmark datasets of three vision tasks, i.e., image and video classifications and semantic segmentation. Experimental results show the superior performance and efficiency of RepAdapter than the state-of-the-art PETL methods. For instance, by updating only 0.6% parameters, we can improve the performance of ViT from 38.8 to 55.1 on Sun397. Its generalizability is also well validated by a bunch of vision models, i.e., ViT, CLIP, Swin-Transformer and ConvNeXt. Our source code is released at this https URL.
高效参数转移学习(PETL)是一个新兴的研究热点,旨在以低成本的方式将大规模预训练模型适用于后续任务。最近的进展已经成功地通过更新或注入少量参数的方式,在多种视觉任务中减少了存储成本,例如图像和视频分类和语义分割。然而,我们注意到,大多数现有的PETL方法在推理期间仍然会产生显著延迟。在本文中,我们提出了一个巨型视觉模型适配器,称为RepAdapter,它具有较高的参数效率,并且计算密集型。具体来说,我们证明了,即使具有复杂的结构,适配模块可以通过结构重新参数化无缝地融入大多数巨型视觉模型中。这种特性使得RepAdapter在推理期间零成本。除了计算效率,RepAdapter比现有的PETL方法更有效且更轻量级,因为它的稀疏结构和我们的仔细部署。为了验证RepAdapter,我们针对三个视觉任务27个基准数据集进行了广泛的实验,即图像和视频分类和语义分割。实验结果表明,RepAdapter的性能和效率优于最先进的PETL方法。例如,仅更新0.6%的参数,可以将ViT的性能从38.8改善到55.1,其通用性也得到了ViT、CLIP、 Swin-Transformer和ConvNeXt等许多视觉模型的充分验证。我们源代码库在此httpsURL上发布。
https://arxiv.org/abs/2302.08106
We present Reversible Vision Transformers, a memory efficient architecture design for visual recognition. By decoupling the GPU memory requirement from the depth of the model, Reversible Vision Transformers enable scaling up architectures with efficient memory usage. We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants and benchmark extensively across both model sizes and tasks of image classification, object detection and video classification. Reversible Vision Transformers achieve a reduced memory footprint of up to 15.5x at roughly identical model complexity, parameters and accuracy, demonstrating the promise of reversible vision transformers as an efficient backbone for hardware resource limited training regimes. Finally, we find that the additional computational burden of recomputing activations is more than overcome for deeper models, where throughput can increase up to 2.3x over their non-reversible counterparts. Full code and trained models are available at this https URL. A simpler, easy to understand and modify version is also available at this https URL
我们提出了可逆的视觉转换器架构设计,用于视觉识别。通过将GPU内存要求与模型深度相分离,可逆的视觉转换器能够实现高效的内存使用架构的 scaling 。我们将两个流行的模型 Vision Transformer 和 Multiscale Vision Transformers 改编为可逆版本,并在模型大小和图像分类、物体检测和视频分类任务方面广泛基准测试。可逆的视觉转换器在模型复杂性、参数和准确性大致相同的情况下,实现了减少内存 footprint 高达 15.5 倍的卓越性能,这表明可逆的视觉转换器作为硬件资源有限训练体系结构的高效骨架的潜力。最后,我们发现对于更深层模型,重新计算激活器的额外计算负担已经远远超过了克服它的机会,在那里,Throughput 可以增加高达 2.3 倍于非可逆对应的模型。完整的代码和训练模型在此 https URL 上可用。更简单、易于理解和修改的版本也在此 https URL 上可用。
https://arxiv.org/abs/2302.04869
To develop the advanced self-driving systems, many researchers are focusing to alert all possible traffic risk cases from closed-circuit television (CCTV) and dashboard-mounted cameras. Most of these methods focused on identifying frame-by-frame in which an anomaly has occurred, but they are unrealized, which road traffic participant can cause ego-vehicle leading into collision because of available annotation dataset only to detect anomaly on traffic video. Near-miss is one type of accident and can be defined as a narrowly avoided accident. However, there is no difference between accident and near-miss at the time before the accident happened, so our contribution is to redefine the accident definition and re-annotate the accident inconsistency on DADA-2000 dataset together with near-miss. By extending the start and end time of accident duration, our annotation can precisely cover all ego-motions during an incident and consistently classify all possible traffic risk accidents including near-miss to give more critical information for real-world driving assistance systems. The proposed method integrates two different components: conditional style translation (CST) and separable 3-dimensional convolutional neural network (S3D). CST architecture is derived by unsupervised image-to-image translation networks (UNIT) used for augmenting the re-annotation DADA-2000 dataset to increase the number of traffic risk accident videos and to generalize the performance of video classification model on different types of conditions while S3D is useful for video classification to prove dataset re-annotation consistency. In evaluation, the proposed method achieved a significant improvement result by 10.25% positive margin from the baseline model for accuracy on cross-validation analysis.
https://arxiv.org/abs/2301.02726
Learning with noisy label (LNL) is a classic problem that has been extensively studied for image tasks, but much less for video in the literature. A straightforward migration from images to videos without considering the properties of videos, such as computational cost and redundant information, is not a sound choice. In this paper, we propose two new strategies for video analysis with noisy labels: 1) A lightweight channel selection method dubbed as Channel Truncation for feature-based label noise detection. This method selects the most discriminative channels to split clean and noisy instances in each category; 2) A novel contrastive strategy dubbed as Noise Contrastive Learning, which constructs the relationship between clean and noisy instances to regularize model training. Experiments on three well-known benchmark datasets for video classification show that our proposed tru{\bf N}cat{\bf E}-split-contr{\bf A}s{\bf T} (NEAT) significantly outperforms the existing baselines. By reducing the dimension to 10\% of it, our method achieves over 0.4 noise detection F1-score and 5\% classification accuracy improvement on Mini-Kinetics dataset under severe noise (symmetric-80\%). Thanks to Noise Contrastive Learning, the average classification accuracy improvement on Mini-Kinetics and Sth-Sth-V1 is over 1.6\%.
https://arxiv.org/abs/2212.13495
This work explores an efficient approach to establish a foundational video-text model for tasks including open-vocabulary video classification, text-to-video retrieval, video captioning and video question-answering. We present VideoCoCa that reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules (for example, cross-frame attention layer or perceiver resampler) and finetune the modified architecture on video-text data, we surprisingly find that the generative attentional pooling and contrastive attentional pooling layers in the image-text CoCa design are instantly adaptable to ``flattened frame embeddings'', yielding a strong zero-shot transfer baseline for many video-text tasks. Specifically, the frozen image encoder of a pretrained image-text CoCa takes each video frame as inputs and generates \(N\) token embeddings per frame for totally \(T\) video frames. We flatten \(N \times T\) token embeddings as a long sequence of frozen video representation and apply CoCa's generative attentional pooling and contrastive attentional pooling on top. All model weights including pooling layers are directly loaded from an image-text CoCa pretrained model. Without any video or video-text data, VideoCoCa's zero-shot transfer baseline already achieves state-of-the-art results on zero-shot video classification on Kinetics 400/600/700, UCF101, HMDB51, and Charades, as well as zero-shot text-to-video retrieval on MSR-VTT and ActivityNet Captions. We also explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering (iVQA, MSRVTT-QA, MSVD-QA) and video captioning (MSR-VTT, ActivityNet, Youcook2). Our approach establishes a simple and effective video-text baseline for future research.
https://arxiv.org/abs/2212.04979
The most popular methods and algorithms for AI are, for the vast majority, black boxes. Black boxes can be an acceptable solution to unimportant problems (in the sense of the degree of impact) but have a fatal flaw for the rest. Therefore the explanation tools for them have been quickly developed. The evaluation of their quality remains an open research question. In this technical report, we remind recently proposed post-hoc explainers FEM and MLFEM which have been designed for explanations of CNNs in image and video classification tasks. We also propose their evaluation with reference-based and no-reference metrics. The reference-based metrics are Pearson Correlation coefficient and Similarity computed between the explanation maps and the ground truth, which is represented by Gaze Fixation Density Maps obtained due to a psycho-visual experiment. As a no-reference metric we use "stability" metric, proposed by Alvarez-Melis and Jaakkola. We study its behaviour, consensus with reference-based metrics and show that in case of several kind of degradations on input images, this metric is in agreement with reference-based ones. Therefore it can be used for evaluation of the quality of explainers when the ground truth is not available.
https://arxiv.org/abs/2212.01222
Motion recognition is a promising direction in computer vision, but the training of video classification models is much harder than images due to insufficient data and considerable parameters. To get around this, some works strive to explore multimodal cues from RGB-D data. Although improving motion recognition to some extent, these methods still face sub-optimal situations in the following aspects: (i) Data augmentation, i.e., the scale of the RGB-D datasets is still limited, and few efforts have been made to explore novel data augmentation strategies for videos; (ii) Optimization mechanism, i.e., the tightly space-time-entangled network structure brings more challenges to spatiotemporal information modeling; And (iii) cross-modal knowledge fusion, i.e., the high similarity between multimodal representations caused to insufficient late fusion. To alleviate these drawbacks, we propose to improve RGB-D-based motion recognition both from data and algorithm perspectives in this paper. In more detail, firstly, we introduce a novel video data augmentation method dubbed ShuffleMix, which acts as a supplement to MixUp, to provide additional temporal regularization for motion recognition. Secondly, a Unified Multimodal De-coupling and multi-stage Re-coupling framework, termed UMDR, is proposed for video representation learning. Finally, a novel cross-modal Complement Feature Catcher (CFCer) is explored to mine potential commonalities features in multimodal information as the auxiliary fusion stream, to improve the late fusion results. The seamless combination of these novel designs forms a robust spatiotemporal representation and achieves better performance than state-of-the-art methods on four public motion datasets. Specifically, UMDR achieves unprecedented improvements of +4.5% on the Chalearn IsoGD dataset.Our code is available at this https URL.
https://arxiv.org/abs/2211.09146
Video processing and analysis have become an urgent task since a huge amount of videos (e.g., Youtube, Hulu) are uploaded online every day. The extraction of representative key frames from videos is very important in video processing and analysis since it greatly reduces computing resources and time. Although great progress has been made recently, large-scale video classification remains an open problem, as the existing methods have not well balanced the performance and efficiency simultaneously. To tackle this problem, this work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC). The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically. The other is that it can preserve the temporal information of the video. Thus it improves the efficiency of video classification. Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification. Moreover, a weight fusion strategy of different input networks is presented to boost the performance. By optimizing both video classification and key frame extraction simultaneously, we achieve better classification performance and higher efficiency. We evaluate our method on two popular datasets (i.e., HMDB51 and UCF101) and the experimental results consistently demonstrate that our strategy achieves competitive performance and efficiency compared with the state-of-the-art approaches.
https://arxiv.org/abs/2211.06742
The computer vision community has seen a shift from convolutional-based to pure transformer architectures for both image and video tasks. Training a transformer from zero for these tasks usually requires a lot of data and computational resources. Video Swin Transformer (VST) is a pure-transformer model developed for video classification which achieves state-of-the-art results in accuracy and efficiency on several datasets. In this paper, we aim to understand if VST generalizes well enough to be used in an out-of-domain setting. We study the performance of VST on two large-scale datasets, namely FCVID and Something-Something using a transfer learning approach from Kinetics-400, which requires around 4x less memory than training from scratch. We then break down the results to understand where VST fails the most and in which scenarios the transfer-learning approach is viable. Our experiments show an 85\% top-1 accuracy on FCVID without retraining the whole model which is equal to the state-of-the-art for the dataset and a 21\% accuracy on Something-Something. The experiments also suggest that the performance of the VST decreases on average when the video duration increases which seems to be a consequence of a design choice of the model. From the results, we conclude that VST generalizes well enough to classify out-of-domain videos without retraining when the target classes are from the same type as the classes used to train the model. We observed this effect when we performed transfer-learning from Kinetics-400 to FCVID, where most datasets target mostly objects. On the other hand, if the classes are not from the same type, then the accuracy after the transfer-learning approach is expected to be poor. We observed this effect when we performed transfer-learning from Kinetics-400, where the classes represent mostly objects, to Something-Something, where the classes represent mostly actions.
https://arxiv.org/abs/2210.09969
Vision Transformers have achieved impressive performance in video classification, while suffering from the quadratic complexity caused by the Softmax attention mechanism. Some studies alleviate the computational costs by reducing the number of tokens in attention calculation, but the complexity is still quadratic. Another promising way is to replace Softmax attention with linear attention, which owns linear complexity but presents a clear performance drop. We find that such a drop in linear attention results from the lack of attention concentration on critical features. Therefore, we propose a feature fixation module to reweight the feature importance of the query and key before computing linear attention. Specifically, we regard the query, key, and value as various latent representations of the input token, and learn the feature fixation ratio by aggregating Query-Key-Value information. This is beneficial for measuring the feature importance comprehensively. Furthermore, we enhance the feature fixation by neighborhood association, which leverages additional guidance from spatial and temporal neighbouring tokens. The proposed method significantly improves the linear attention baseline and achieves state-of-the-art performance among linear video Transformers on three popular video classification benchmarks. With fewer parameters and higher efficiency, our performance is even comparable to some Softmax-based quadratic Transformers.
https://arxiv.org/abs/2210.08164
Current weakly supervised video anomaly detection algorithms mostly use multiple instance learning (MIL) or their varieties. Almost all recent approaches focus on how to select the correct snippets for training to improve the performance. They overlook or do not realize the power of video classification in boosting the performance of anomaly detection. In this paper, we study explicitly the power of video classification supervision using a BERT or LSTM. With this BERT or LSTM, CNN features of all snippets of a video can be aggregated into a single feature which can be used for video classification. This simple yet powerful video classification supervision, combined into the MIL framework, brings extraordinary performance improvement on all three major video anomaly detection datasets. Particularly it improves the mean average precision (mAP) on the XD-Violence from SOTA 78.84\% to new 82.10\%. The source code is available at this https URL.
https://arxiv.org/abs/2210.06688
Visual data such as images and videos are typically modeled as discretizations of inherently continuous, multidimensional signals. Existing continuous-signal models attempt to exploit this fact by modeling the underlying signals of visual (e.g., image) data directly. However, these models have not yet been able to achieve competitive performance on practical vision tasks such as large-scale image and video classification. Building on a recent line of work on deep state space models (SSMs), we propose \method, a new multidimensional SSM layer that extends the continuous-signal modeling ability of SSMs to multidimensional data including images and videos. We show that S4ND can model large-scale visual data in $1$D, $2$D, and $3$D as continuous multidimensional signals and demonstrates strong performance by simply swapping Conv2D and self-attention layers with \method\ layers in existing state-of-the-art models. On ImageNet-1k, \method\ exceeds the performance of a Vision Transformer baseline by $1.5\%$ when training with a $1$D sequence of patches, and matches ConvNeXt when modeling images in $2$D. For videos, S4ND improves on an inflated $3$D ConvNeXt in activity classification on HMDB-51 by $4\%$. S4ND implicitly learns global, continuous convolutional kernels that are resolution invariant by construction, providing an inductive bias that enables generalization across multiple resolutions. By developing a simple bandlimiting modification to S4 to overcome aliasing, S4ND achieves strong zero-shot (unseen at training time) resolution performance, outperforming a baseline Conv2D by $40\%$ on CIFAR-10 when trained on $8 \times 8$ and tested on $32 \times 32$ images. When trained with progressive resizing, S4ND comes within $\sim 1\%$ of a high-resolution model while training $22\%$ faster.
https://arxiv.org/abs/2210.06583
Automatic traffic accidents detection has appealed to the machine vision community due to its implications on the development of autonomous intelligent transportation systems (ITS) and importance to traffic safety. Most previous studies on efficient analysis and prediction of traffic accidents, however, have used small-scale datasets with limited coverage, which limits their effect and applicability. Existing datasets in traffic accidents are either small-scale, not from surveillance cameras, not open-sourced, or not built for freeway scenes. Since accidents happened in freeways tend to cause serious damage and are too fast to catch the spot. An open-sourced datasets targeting on freeway traffic accidents collected from surveillance cameras is in great need and of practical importance. In order to help the vision community address these shortcomings, we endeavor to collect video data of real traffic accidents that covered abundant scenes. After integration and annotation by various dimensions, a large-scale traffic accidents dataset named TAD is proposed in this work. Various experiments on image classification, object detection, and video classification tasks, using public mainstream vision algorithms or frameworks are conducted in this work to demonstrate performance of different methods. The proposed dataset together with the experimental results are presented as a new benchmark to improve computer vision research, especially in ITS.
https://arxiv.org/abs/2209.12386
Unmanned aerial vehicles (UAVs) are now widely applied to data acquisition due to its low cost and fast mobility. With the increasing volume of aerial videos, the demand for automatically parsing these videos is surging. To achieve this, current researches mainly focus on extracting a holistic feature with convolutions along both spatial and temporal dimensions. However, these methods are limited by small temporal receptive fields and cannot adequately capture long-term temporal dependencies which are important for describing complicated dynamics. In this paper, we propose a novel deep neural network, termed FuTH-Net, to model not only holistic features, but also temporal relations for aerial video classification. Furthermore, the holistic features are refined by the multi-scale temporal relations in a novel fusion module for yielding more discriminative video representations. More specially, FuTH-Net employs a two-pathway architecture: (1) a holistic representation pathway to learn a general feature of both frame appearances and shortterm temporal variations and (2) a temporal relation pathway to capture multi-scale temporal relations across arbitrary frames, providing long-term temporal dependencies. Afterwards, a novel fusion module is proposed to spatiotemporal integrate the two features learned from the two pathways. Our model is evaluated on two aerial video classification datasets, ERA and Drone-Action, and achieves the state-of-the-art results. This demonstrates its effectiveness and good generalization capacity across different recognition tasks (event classification and human action recognition). To facilitate further research, we release the code at this https URL.
https://arxiv.org/abs/2209.11316
The traffic video data has become a critical factor in confining the state of traffic congestion due to the recent advancements in computer vision. This work proposes a unique technique for traffic video classification using a color-coding scheme before training the traffic data in a Deep convolutional neural network. At first, the video data is transformed into an imagery data set; then, the vehicle detection is performed using the You Only Look Once algorithm. A color-coded scheme has been adopted to transform the imagery dataset into a binary image dataset. These binary images are fed to a Deep Convolutional Neural Network. Using the UCSD dataset, we have obtained a classification accuracy of 98.2%.
https://arxiv.org/abs/2209.07943
Recently vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks. The less restrictive inductive bias of transformers endows greater representational capacity in comparison with CNNs. However, in the image classification setting this flexibility comes with a trade-off with respect to sample efficiency, where transformers require ImageNet-scale training. This notion has carried over to video where transformers have not yet been explored for video classification in the low-labeled or semi-supervised settings. Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting compared to CNNs. We specifically evaluate video vision transformers across two contrasting video datasets (Kinetics-400 and SomethingSomething-V2) and perform thorough analysis and ablation studies to explain this observation using the predominant features of video transformer architectures. We even show that using just the labeled data, transformers significantly outperform complex semi-supervised CNN methods that leverage large-scale unlabeled data as well. Our experiments inform our recommendation that semi-supervised learning video work should consider the use of video transformers in the future.
https://arxiv.org/abs/2209.07474