Evaluating safety performance in a resource-efficient way is crucial for the development of autonomous systems. Simulation of parameterized scenarios is a popular testing strategy but parameter sweeps can be prohibitively expensive. To address this, we propose HiddenGems: a sample-efficient method for discovering the boundary between compliant and non-compliant behavior via active learning. Given a parameterized scenario, one or more compliance metrics, and a simulation oracle, HiddenGems maps the compliant and non-compliant domains of the scenario. The methodology enables critical test case identification, comparative analysis of different versions of the system under test, as well as verification of design objectives. We evaluate HiddenGems on a scenario with a jaywalker crossing in front of an autonomous vehicle and obtain compliance boundary estimates for collision, lane keep, and acceleration metrics individually and in combination, with 6 times fewer simulations than a parameter sweep. We also show how HiddenGems can be used to detect and rectify a failure mode for an unprotected turn with 86% fewer simulations.
https://arxiv.org/abs/2210.13956
Multi-task learning promises better model generalization on a target task by jointly optimizing it with an auxiliary task. However, the current practice requires additional labeling efforts for the auxiliary task, while not guaranteeing better model performance. In this paper, we find that jointly training a dense prediction (target) task with a self-supervised (auxiliary) task can consistently improve the performance of the target task, while eliminating the need for labeling auxiliary tasks. We refer to this joint training as Composite Learning (CompL). Experiments of CompL on monocular depth estimation, semantic segmentation, and boundary detection show consistent performance improvements in fully and partially labeled datasets. Further analysis on depth estimation reveals that joint training with self-supervision outperforms most labeled auxiliary tasks. We also find that CompL can improve model robustness when the models are evaluated in new domains. These results demonstrate the benefits of self-supervision as an auxiliary task, and establish the design of novel task-specific self-supervised methods as a new axis of investigation for future multi-task learning research.
https://arxiv.org/abs/2210.07239
The task of Generic Event Boundary Detection (GEBD) aims to detect moments in videos that are naturally perceived by humans as generic and taxonomy-free event boundaries. Modeling the dynamically evolving temporal and spatial changes in a video makes GEBD a difficult problem to solve. Existing approaches involve very complex and sophisticated pipelines in terms of architectural design choices, hence creating a need for more straightforward and simplified approaches. In this work, we address this issue by revisiting a simple and effective self-supervised method and augment it with a differentiable motion feature learning module to tackle the spatial and temporal diversities in the GEBD task. We perform extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets to demonstrate the efficacy of the proposed approach compared to the other self-supervised state-of-the-art methods. We also show that this simple self-supervised approach learns motion features without any explicit motion-specific pretext task.
https://arxiv.org/abs/2210.05574
Medical image segmentation assists in computer-aided diagnosis, surgeries, and treatment. Digitize tissue slide images are used to analyze and segment glands, nuclei, and other biomarkers which are further used in computer-aided medical applications. To this end, many researchers developed different neural networks to perform segmentation on histological images, mostly these networks are based on encoder-decoder architecture and also utilize complex attention modules or transformers. However, these networks are less accurate to capture relevant local and global features with accurate boundary detection at multiple scales, therefore, we proposed an Encoder-Decoder Network, Quick Attention Module and a Multi Loss Function (combination of Binary Cross Entropy (BCE) Loss, Focal Loss & Dice Loss). We evaluate the generalization capability of our proposed network on two publicly available datasets for medical image segmentation MoNuSeg and GlaS and outperform the state-of-the-art networks with 1.99% improvement on the MoNuSeg dataset and 7.15% improvement on the GlaS dataset. Implementation Code is available at this link: this https URL
https://arxiv.org/abs/2209.00729
Multi-modal named entity recognition (MNER) aims at identifying entity spans and recognizing their categories in social media posts with the aid of images. However, in dominant MNER approaches, the interaction of different modalities is usually carried out through the alternation of self-attention and cross-attention or over-reliance on the gating machine, which results in imprecise and biased correspondence between fine-grained semantic units of text and image. To address this issue, we propose a Flat Multi-modal Interaction Transformer (FMIT) for MNER. Specifically, we first utilize noun phrases in sentences and general domain words to obtain visual cues. Then, we transform the fine-grained semantic representation of the vision and text into a unified lattice structure and design a novel relative position encoding to match different modalities in Transformer. Meanwhile, we propose to leverage entity boundary detection as an auxiliary task to alleviate visual bias. Experiments show that our methods achieve the new state-of-the-art performance on two benchmark datasets.
https://arxiv.org/abs/2208.11039
This is a review paper of traditional approaches for edge, corner, and boundary detection methods. There are many real-world applications of edge, corner, and boundary detection methods. For instance, in medical image analysis, edge detectors are used to extract the features from the given image. In modern innovations like autonomous vehicles, edge detection and segmentation are the most crucial things. If we want to detect motion or track video, corner detectors help. I tried to compare the results of detectors stage-wise wherever it is possible and also discussed the importance of image prepossessing to minimise the noise. Real-world images are used to validate detector performance and limitations.
https://arxiv.org/abs/2208.07714
Continuous Hand Gesture Recognition (CHGR) has been extensively studied by researchers in the last few decades. Recently, one model has been presented to deal with the challenge of the boundary detection of isolated gestures in a continuous gesture video [17]. To enhance the model performance and also replace the handcrafted feature extractor in the presented model in [17], we propose a GCN model and combine it with the stacked Bi-LSTM and Attention modules to push the temporal information in the video stream. Considering the breakthroughs of GCN models for skeleton modality, we propose a two-layer GCN model to empower the 3D hand skeleton features. Finally, the class probabilities of each isolated gesture are fed to the post-processing module, borrowed from [17]. Furthermore, we replace the anatomical graph structure with some non-anatomical graph structures. Due to the lack of a large dataset, including both the continuous gesture sequences and the corresponding isolated gestures, three public datasets in Dynamic Hand Gesture Recognition (DHGR), RKS-PERSIANSIGN, and ASLVID, are used for evaluation. Experimental results show the superiority of the proposed model in dealing with isolated gesture boundaries detection in continuous gesture sequences
https://arxiv.org/abs/2207.07619
Gazetteer is widely used in Chinese named entity recognition (NER) to enhance span boundary detection and type classification. However, to further understand the generalizability and effectiveness of gazetteers, the NLP community still lacks a systematic analysis of the gazetteer-enhanced NER model. In this paper, we first re-examine the effectiveness several common practices of the gazetteer-enhanced NER models and carry out a series of detailed analysis to evaluate the relationship between the model performance and the gazetteer characteristics, which can guide us to build a more suitable gazetteer. The findings of this paper are as follows: (1) the gazetteer has improved the most situations where the dataset is difficult to learn well for the conventional NER model. (2) the performance of model greatly benefits from the high-quality pre-trained lexeme embeddings. (3) a good gazetteer should cover more entities that can be matched in both the training set and testing set.
https://arxiv.org/abs/2207.02802
Generic event boundary detection (GEBD) is an important yet challenging task in video understanding, which aims at detecting the moments where humans naturally perceive event boundaries. In this paper, we present a local context modeling and global boundary decoding approach for GEBD task. Local context modeling sub-network is proposed to perceive diverse patterns of generic event boundaries, and it generates powerful video representations and reliable boundary confidence. Based on them, global boundary decoding sub-network is exploited to decode event boundaries from a global view. Our proposed method achieves 85.13% F1-score on Kinetics-GEBD testing set, which achieves a more than 22% F1-score boost compared to the baseline method. The code is available at this https URL.
https://arxiv.org/abs/2206.15268
This report presents the algorithm used in the submission of Generic Event Boundary Detection (GEBD) Challenge at CVPR 2022. In this work, we improve the existing Structured Context Transformer (SC-Transformer) method for GEBD. Specifically, a transformer decoder module is added after transformer encoders to extract high quality frame features. The final classification is performed jointly on the results of the original binary classifier and a newly introduced multi-class classifier branch. To enrich motion information, optical flow is introduced as a new modality. Finally, model ensemble is used to further boost performance. The proposed method achieves 86.49% F1 score on Kinetics-GEBD test set. which improves 2.86% F1 score compared to the previous SOTA method.
https://arxiv.org/abs/2206.12634
Generic Event Boundary Detection (GEBD) tasks aim at detecting generic, taxonomy-free event boundaries that segment a whole video into chunks. In this paper, we apply Masked Autoencoders to improve algorithm performance on the GEBD tasks. Our approach mainly adopted the ensemble of Masked Autoencoders fine-tuned on the GEBD task as a self-supervised learner with other base models. Moreover, we also use a semi-supervised pseudo-label method to take full advantage of the abundant unlabeled Kinetics-400 data while training. In addition, we propose a soft-label method to partially balance the positive and negative samples and alleviate the problem of ambiguous labeling in this task. Lastly, a tricky segmentation alignment policy is implemented to refine boundaries predicted by our models to more accurate locations. With our approach, we achieved 85.94% on the F1-score on the Kinetics-GEBD test set, which improved the F1-score by 2.31% compared to the winner of the 2021 Kinetics-GEBD Challenge. Our code is available at this https URL.
https://arxiv.org/abs/2206.08610
Generic Event Boundary Detection (GEBD) aims to detect moments where humans naturally perceive as event boundaries. In this paper, we present Structured Context Transformer (or SC-Transformer) to solve the GEBD task, which can be trained in an end-to-end fashion. Specifically, we use the backbone convolutional neural network (CNN) to extract the features of each video frame. To capture temporal context information of each frame, we design the structure context transformer (SC-Transformer) by re-partitioning input frame sequence. Note that, the overall computation complexity of SC-Transformer is linear to the video length. After that, the group similarities are computed to capture the differences between frames. Then, a lightweight fully convolutional network is used to determine the event boundaries based on the grouped similarity maps. To remedy the ambiguities of boundary annotations, the Gaussian kernel is adopted to preprocess the ground-truth event boundaries to further boost the accuracy. Extensive experiments conducted on the challenging Kinetics-GEBD and TAPOS datasets demonstrate the effectiveness of the proposed method compared to the state-of-the-art methods.
https://arxiv.org/abs/2206.02985
Camouflaged object detection intends to discover the concealed objects hidden in the surroundings. Existing methods follow the bio-inspired framework, which first locates the object and second refines the boundary. We argue that the discovery of camouflaged objects depends on the recurrent search for the object and the boundary. The recurrent processing makes the human tired and helpless, but it is just the advantage of the transformer with global search ability. Therefore, a dual-task interactive transformer is proposed to detect both accurate position of the camouflaged object and its detailed boundary. The boundary feature is considered as Query to improve the camouflaged object detection, and meanwhile the object feature is considered as Query to improve the boundary detection. The camouflaged object detection and the boundary detection are fully interacted by multi-head self-attention. Besides, to obtain the initial object feature and boundary feature, transformer-based backbones are adopted to extract the foreground and background. The foreground is just object, while foreground minus background is considered as boundary. Here, the boundary feature can be obtained from blurry boundary region of the foreground and background. Supervised by the object, the background and the boundary ground truth, the proposed model achieves state-of-the-art performance in public datasets. this https URL
https://arxiv.org/abs/2205.10579
Generic event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network, which demands considerable computational power and storage space. To that end, we propose a new end-to-end compressed video representation learning for event boundary detection that leverages the rich information in the compressed domain, i.e., RGB, motion vectors, residuals, and the internal group of pictures (GOP) structure, without fully decoding the video. Specifically, we first use the ConvNets to extract features of the I-frames in the GOPs. After that, a light-weight spatial-channel compressed encoder is designed to compute the feature representations of the P-frames based on the motion vectors, residuals and representations of their dependent I-frames. A temporal contrastive module is proposed to determine the event boundaries of video sequences. To remedy the ambiguities of annotations and speed up the training process, we use the Gaussian kernel to preprocess the ground-truth event boundaries. Extensive experiments conducted on the Kinetics-GEBD dataset demonstrate that the proposed method achieves comparable results to the state-of-the-art methods with $4.5\times$ faster running speed.
https://arxiv.org/abs/2203.15336
Movie genre classification has been widely studied in recent years due to its various applications in video editing, summarization, and recommendation. Prior work has typically addressed this task by predicting genres based solely on the visual content. As a result, predictions from these methods often perform poorly for genres such as documentary or musical, since non-visual modalities like audio or language play an important role in correctly classifying these genres. In addition, the analysis of long videos at frame level is always associated with high computational cost and makes the prediction less efficient. To address these two issues, we propose a Multi-Modal approach leveraging shot information, MMShot, to classify video genres in an efficient and effective way. We evaluate our method on MovieNet and Condensed Movies for genre classification, achieving 17% ~ 21% improvement on mean Average Precision (mAP) over the state-of-the-art. Extensive experiments are conducted to demonstrate the ability of MMShot for long video analysis and uncover the correlations between genres and multiple movie elements. We also demonstrate our approach's ability to generalize by evaluating the scene boundary detection task, achieving 1.1% improvement on Average Precision (AP) over the state-of-the-art.
https://arxiv.org/abs/2203.13281
Boundaries are among the primary visual cues used by human and computer vision systems. One of the key problems in boundary detection is the label representation, which typically leads to class imbalance and, as a consequence, to thick boundaries that require non-differential post-processing steps to be thinned. In this paper, we re-interpret boundaries as 1-D surfaces and formulate a one-to-one vector transform function that allows for training of boundary prediction completely avoiding the class imbalance issue. Specifically, we define the boundary representation at any point as the unit vector pointing to the closest boundary surface. Our problem formulation leads to the estimation of direction as well as richer contextual information of the boundary, and, if desired, the availability of zero-pixel thin boundaries also at training time. Our method uses no hyper-parameter in the training loss and a fixed stable hyper-parameter at inference. We provide theoretical justification/discussions of the vector transform representation. We evaluate the proposed loss method using a standard architecture and show the excellent performance over other losses and representations on several datasets.
https://arxiv.org/abs/2203.08795
Generic Boundary Detection (GBD) aims at locating general boundaries that divide videos into semantically coherent and taxonomy-free units, and could server as an important pre-processing step for long-form video understanding. Previous research separately handle these different-level generic boundaries with specific designs of complicated deep networks from simple CNN to LSTM. Instead, in this paper, our objective is to develop a general yet simple architecture for arbitrary boundary detection in videos. To this end, we present Temporal Perceiver, a general architecture with Transformers, offering a unified solution to the detection of arbitrary generic boundaries. The core design is to introduce a small set of latent feature queries as anchors to compress the redundant input into fixed dimension via cross-attention blocks. Thanks to this fixed number of latent units, it reduces the quadratic complexity of attention operation to a linear form of input frames. Specifically, to leverage the coherence structure of videos, we construct two types of latent feature queries: boundary queries and context queries, which handle the semantic incoherence and coherence regions accordingly. Moreover, to guide the learning of latent feature queries, we propose an alignment loss on cross-attention to explicitly encourage the boundary queries to attend on the top possible boundaries. Finally, we present a sparse detection head on the compressed representations and directly output the final boundary detection results without any post-processing module. We test our Temporal Perceiver on a variety of detection benchmarks, ranging from shot-level, event-level, to scene-level GBD. Our method surpasses the previous state-of-the-art methods on all benchmarks, demonstrating the generalization ability of our temporal perceiver.
https://arxiv.org/abs/2203.00307
In recent years, the accuracy of automatic lyrics alignment methods has increased considerably. Yet, many current approaches employ frameworks designed for automatic speech recognition (ASR) and do not exploit properties specific to music. Pitch is one important musical attribute of singing voice but it is often ignored by current systems as the lyrics content is considered independent of the pitch. In practice, however, there is a temporal correlation between the two as note starts often correlate with phoneme starts. At the same time the pitch is usually annotated with high temporal accuracy in ground truth data while the timing of lyrics is often only available at the line (or word) level. In this paper, we propose a multi-task learning approach for lyrics alignment that incorporates pitch and thus can make use of a new source of highly accurate temporal information. Our results show that the accuracy of the alignment result is indeed improved by our approach. As an additional contribution, we show that integrating boundary detection in the forced-alignment algorithm reduces cross-line errors, which improves the accuracy even further.
https://arxiv.org/abs/2202.01646
The paper presents the results from a multi-year effort to develop and validate image processing methods for selecting the best physical models based on solar image observations. The approach consists of selecting the physical models based on their agreement with coronal holes extracted from the images. Ultimately, the goal is to use physical models to predict geomagnetic storms. We decompose the problem into three subproblems: (i) coronal hole segmentation based on physical constraints, (ii) matching clusters of coronal holes between different maps, and (iii) physical map classification. For segmenting coronal holes, we develop a multi-modal method that uses segmentation maps from three different methods to initialize a level-set method that evolves the initial coronal hole segmentation to the magnetic boundary. Then, we introduce a new method based on Linear Programming for matching clusters of coronal holes. The final matching is then performed using Random Forests. The methods were carefully validated using consensus maps derived from multiple readers, manual clustering, manual map classification, and method validation for 50 maps. The proposed multi-modal segmentation method significantly outperformed SegNet, U-net, Henney-Harvey, and FCN by providing accurate boundary detection. Overall, the method gave a 95.5% map classification accuracy.
https://arxiv.org/abs/2201.01380
Developing Natural Language Processing resources for a low resource language is a challenging but essential task. In this paper, we present a Morphological Analyzer for Gujarati. We have used a Bi-Directional LSTM based approach to perform morpheme boundary detection and grammatical feature tagging. We have created a data set of Gujarati words with lemma and grammatical features. The Bi-LSTM based model of Morph Analyzer discussed in the paper handles the language morphology effectively without the knowledge of any hand-crafted suffix rules. To the best of our knowledge, this is the first dataset and morph analyzer model for the Gujarati language which performs both grammatical feature tagging and morpheme boundary detection tasks.
https://arxiv.org/abs/2112.09860