We introduce Programmatic Motion Concepts, a hierarchical motion representation for human actions that captures both low-level motion and high-level description as motion concepts. This representation enables human motion description, interactive editing, and controlled synthesis of novel video sequences within a single framework. We present an architecture that learns this concept representation from paired video and action sequences in a semi-supervised manner. The compactness of our representation also allows us to present a low-resource training recipe for data-efficient learning. By outperforming established baselines, especially in the small data regime, we demonstrate the efficiency and effectiveness of our framework for multiple applications.
https://arxiv.org/abs/2206.13502
Unpaired image translation algorithms can be used for sim2real tasks, but many fail to generate temporally consistent results. We present a new approach that combines differentiable rendering with image translation to achieve temporal consistency over indefinite timescales, using surface consistency losses and \emph{neural neural textures}. We call this algorithm TRITON (Texture Recovering Image Translation Network): an unsupervised, end-to-end, stateless sim2real algorithm that leverages the underlying 3D geometry of input scenes by generating realistic-looking learnable neural textures. By settling on a particular texture for the objects in a scene, we ensure consistency between frames statelessly. Unlike previous algorithms, TRITON is not limited to camera movements -- it can handle the movement of objects as well, making it useful for downstream tasks such as robotic manipulation.
https://arxiv.org/abs/2206.13500
Humans can leverage prior experience and learn novel tasks from a handful of demonstrations. In contrast to offline meta-reinforcement learning, which aims to achieve quick adaptation through better algorithm design, we investigate the effect of architecture inductive bias on the few-shot learning capability. We propose a Prompt-based Decision Transformer (Prompt-DT), which leverages the sequential modeling ability of the Transformer architecture and the prompt framework to achieve few-shot adaptation in offline RL. We design the trajectory prompt, which contains segments of the few-shot demonstrations, and encodes task-specific information to guide policy generation. Our experiments in five MuJoCo control benchmarks show that Prompt-DT is a strong few-shot learner without any extra finetuning on unseen target tasks. Prompt-DT outperforms its variants and strong meta offline RL baselines by a large margin with a trajectory prompt containing only a few timesteps. Prompt-DT is also robust to prompt length changes and can generalize to out-of-distribution (OOD) environments.
https://arxiv.org/abs/2206.13499
Transparency methods such as model visualizations provide information that outputs alone might miss, since they describe the internals of neural networks. But can we trust that model explanations reflect model behavior? For instance, can they diagnose abnormal behavior such as backdoors or shape bias? To evaluate model explanations, we define a model as anomalous if it differs from a reference set of normal models, and we test whether transparency methods assign different explanations to anomalous and normal models. We find that while existing methods can detect stark anomalies such as shape bias or adversarial training, they struggle to identify more subtle anomalies such as models trained on incomplete data. Moreover, they generally fail to distinguish the inputs that induce anomalous behavior, e.g. images containing a backdoor trigger. These results reveal new blind spots in existing model explanations, pointing to the need for further method development.
https://arxiv.org/abs/2206.13498
This paper proves that robustness implies generalization via data-dependent generalization bounds. As a result, robustness and generalization are shown to be connected closely in a data-dependent manner. Our bounds improve previous bounds in two directions, to solve an open problem that has seen little development since 2010. The first is to reduce the dependence on the covering number. The second is to remove the dependence on the hypothesis space. We present several examples, including ones for lasso and deep learning, in which our bounds are provably preferable. The experiments on real-world data and theoretical models demonstrate near-exponential improvements in various situations. To achieve these improvements, we do not require additional assumptions on the unknown distribution; instead, we only incorporate an observable and computable property of the training samples. A key technical innovation is an improved concentration bound for multinomial random variables that is of independent interest beyond robustness and generalization.
https://arxiv.org/abs/2206.13497
Ensembling is a popular and effective method for improving machine learning (ML) models. It proves its value not only in classical ML but also for deep learning. Ensembles enhance the quality and trustworthiness of ML solutions, and allow uncertainty estimation. However, they come at a price: training ensembles of deep learning models eat a huge amount of computational resources. A snapshot ensembling collects models in the ensemble along a single training path. As it runs training only one time, the computational time is similar to the training of one model. However, the quality of models along the training path is different: typically, later models are better if no overfitting occurs. So, the models are of varying utility. Our method improves snapshot ensembling by selecting and weighting ensemble members along the training path. It relies on training-time likelihoods without looking at validation sample errors that standard stacking methods do. Experimental evidence for Fashion MNIST, CIFAR-10, and CIFAR-100 datasets demonstrates the superior quality of the proposed weighted ensembles c.t. vanilla ensembling of deep learning models.
https://arxiv.org/abs/2206.13491
If capable AI agents are generally incentivized to seek power in service of the objectives we specify for them, then these systems will pose enormous risks, in addition to enormous benefits. In fully observable environments, most reward functions have an optimal policy which seeks power by keeping options open and staying alive. However, the real world is neither fully observable, nor will agents be perfectly optimal. We consider a range of models of AI decision-making, from optimal, to random, to choices informed by learning and interacting with an environment. We discover that many decision-making functions are retargetable, and that retargetability is sufficient to cause power-seeking tendencies. Our functional criterion is simple and broad. We show that a range of qualitatively dissimilar decision-making procedures incentivize agents to seek power. We demonstrate the flexibility of our results by reasoning about learned policy incentives in Montezuma's Revenge. These results suggest a safety risk: Eventually, highly retargetable training procedures may train real-world agents which seek power over humans.
https://arxiv.org/abs/2206.13477
Acoustic events are sounds with well-defined spectro-temporal characteristics which can be associated with the physical objects generating them. Acoustic scenes are collections of such acoustic events in no specific temporal order. Given this natural linkage between events and scenes, a common belief is that the ability to classify events must help in the classification of scenes. This has led to several efforts attempting to do well on Acoustic Event Tagging (AET) and Acoustic Scene Classification (ASC) using a multi-task network. However, in these efforts, improvement in one task does not guarantee an improvement in the other, suggesting a tension between ASC and AET. It is unclear if improvements in AET translates to improvements in ASC. We explore this conundrum through an extensive empirical study and show that under certain conditions, using AET as an auxiliary task in the multi-task network consistently improves ASC performance. Additionally, ASC performance further improves with the AET data-set size and is not sensitive to the choice of events or the number of events in the AET data-set. We conclude that this improvement in ASC performance comes from the regularization effect of using AET and not from the network's improved ability to discern between acoustic events.
https://arxiv.org/abs/2206.13476
We introduce an atlas of algebro-geometric objects associated with image formation in pinhole cameras. The nodes of the atlas are algebraic varieties or their vanishing ideals related to each other by projection or elimination and restriction or specialization respectively. This atlas offers a unifying framework for the study of problems in 3D computer vision. We initiate the study of the atlas by completely characterizing a part of the atlas stemming from the triangulation problem. We conclude with several open problems and generalizations of the atlas.
https://arxiv.org/abs/2206.13468
Brain graph representation learning serves as the fundamental technique for brain diseases diagnosis. Great efforts from both the academic and industrial communities have been devoted to brain graph representation learning in recent years. The isomorphic neural network (IsoNN) introduced recently can automatically learn the existence of sub-graph patterns in brain graphs, which is also the state-of-the-art brain graph representation learning method by this context so far. However, IsoNN fails to capture the orientations of sub-graph patterns, which may render the learned representations to be useless for many cases. In this paper, we propose a new Iso-CapsNet (Isomorphic Capsule Net) model by introducing the graph isomorphic capsules for effective brain graph representation learning. Based on the capsule dynamic routing, besides the subgraph pattern existence confidence scores, Iso-CapsNet can also learn other sub-graph rich properties, including position, size and orientation, for calculating the class-wise digit capsules. We have compared Iso-CapsNet with both classic and state-of-the-art brain graph representation approaches with extensive experiments on four brain graph benchmark datasets. The experimental results also demonstrate the effectiveness of Iso-CapsNet, which can out-perform the baseline methods with significant improvements.
https://arxiv.org/abs/2206.13465
Learning effective reinforcement learning (RL) policies to solve real-world complex tasks can be quite challenging without a high-fidelity simulation environment. In most cases, we are only given imperfect simulators with simplified dynamics, which inevitably lead to severe sim-to-real gaps in RL policy learning. The recently emerged field of offline RL provides another possibility to learn policies directly from pre-collected historical data. However, to achieve reasonable performance, existing offline RL algorithms need impractically large offline data with sufficient state-action space coverage for training. This brings up a new question: is it possible to combine learning from limited real data in offline RL and unrestricted exploration through imperfect simulators in online RL to address the drawbacks of both approaches? In this study, we propose the Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning (H2O) framework to provide an affirmative answer to this question. H2O introduces a dynamics-aware policy evaluation scheme, which adaptively penalizes the Q function learning on simulated state-action pairs with large dynamics gaps, while also simultaneously allowing learning from a fixed real-world dataset. Through extensive simulation and real-world tasks, as well as theoretical analysis, we demonstrate the superior performance of H2O against other cross-domain online and offline RL algorithms. H2O provides a brand new hybrid offline-and-online RL paradigm, which can potentially shed light on future RL algorithm design for solving practical real-world tasks.
https://arxiv.org/abs/2206.13464
The visual system of a robot has different requirements depending on the application: it may require high accuracy or reliability, be constrained by limited resources or need fast adaptation to dynamically changing environments. In this work, we focus on the instance segmentation task and provide a comprehensive study of different techniques that allow adapting an object segmentation model in presence of novel objects or different domains. We propose a pipeline for fast instance segmentation learning designed for robotic applications where data come in stream. It is based on an hybrid method leveraging on a pre-trained CNN for feature extraction and fast-to-train Kernel-based classifiers. We also propose a training protocol that allows to shorten the training time by performing feature extraction during the data acquisition. We benchmark the proposed pipeline on two robotics datasets and we deploy it on a real robot, i.e. the iCub humanoid. To this aim, we adapt our method to an incremental setting in which novel objects are learned on-line by the robot. The code to reproduce the experiments is publicly available on GitHub.
https://arxiv.org/abs/2206.13462
The development process of high-fidelity SLAM systems depends on their validation upon reliable datasets. Towards this goal, we propose IBISCape, a simulated benchmark that includes data synchronization and acquisition APIs for telemetry from heterogeneous sensors: stereo-RGB/DVS, Depth, IMU, and GPS, along with the ground truth scene segmentation and vehicle ego-motion. Our benchmark is built upon the CARLA simulator, whose back-end is the Unreal Engine rendering a high dynamic scenery simulating the real world. Moreover, we offer 34 multi-modal datasets suitable for autonomous vehicles navigation, including scenarios for scene understanding evaluation like accidents, along with a wide range of frame quality based on a dynamic weather simulation class integrated with our APIs. We also introduce the first calibration targets to CARLA maps to solve the unknown distortion parameters problem of CARLA simulated DVS and RGB cameras. Finally, using IBISCape sequences, we evaluate four ORB-SLAM3 systems (monocular RGB, stereo RGB, Stereo Visual Inertial (SVI), and RGB-D) performance and BASALT Visual-Inertial Odometry (VIO) system on various sequences collected in simulated large-scale dynamic environments. Keywords: benchmark, multi-modal, datasets, Odometry, Calibration, DVS, SLAM
https://arxiv.org/abs/2206.13455
Video prediction is an extrapolation task that predicts future frames given past frames, and video frame interpolation is an interpolation task that estimates intermediate frames between two frames. We have witnessed the tremendous advancement of video frame interpolation, but the general video prediction in the wild is still an open question. Inspired by the photo-realistic results of video frame interpolation, we present a new optimization framework for video prediction via video frame interpolation, in which we solve an extrapolation problem based on an interpolation model. Our video prediction framework is based on optimization with a pretrained differentiable video frame interpolation module without the need for a training dataset, and thus there is no domain gap issue between training and test data. Also, our approach does not need any additional information such as semantic or instance maps, which makes our framework applicable to any video. Extensive experiments on the Cityscapes, KITTI, DAVIS, Middlebury, and Vimeo90K datasets show that our video prediction results are robust in general scenarios, and our approach outperforms other video prediction methods that require a large amount of training data or extra semantic information.
https://arxiv.org/abs/2206.13454
Despite extensive theoretical work on biologically plausible learning rules, it has been difficult to obtain clear evidence about whether and how such rules are implemented in the brain. We consider biologically plausible supervised- and reinforcement-learning rules and ask whether changes in network activity during learning can be used to determine which learning rule is being used. Supervised learning requires a credit-assignment model estimating the mapping from neural activity to behavior, and, in a biological organism, this model will inevitably be an imperfect approximation of the ideal mapping, leading to a bias in the direction of the weight updates relative to the true gradient. Reinforcement learning, on the other hand, requires no credit-assignment model and tends to make weight updates following the true gradient direction. We derive a metric to distinguish between learning rules by observing changes in the network activity during learning, given that the mapping from brain to behavior is known by the experimenter. Because brain-machine interface (BMI) experiments allow for perfect knowledge of this mapping, we focus on modeling a cursor-control BMI task using recurrent neural networks, showing that learning rules can be distinguished in simulated experiments using only observations that a neuroscience experimenter would plausibly have access to.
https://arxiv.org/abs/2206.13448
In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers. We do this by activating distinct parts of the network for different tasks. We train our model using a novel approach to two-stage training. In Stage I, the model learns speaker-independent word-level prosody representations from speech which it uses for many-to-many fine-grained prosody transfer. In Stage II, we learn to predict these prosody representations using the contextual information available in text, thereby, enabling multi-speaker TTS with contextually appropriate prosody. We compare CC2 to two strong baselines, one in TTS with contextually appropriate prosody, and one in fine-grained prosody transfer. CC2 reduces the gap in naturalness between our baseline and copy-synthesised speech by $22.79\%$. In fine-grained prosody transfer evaluations, it obtains a relative improvement of $33.15\%$ in target speaker similarity.
https://arxiv.org/abs/2206.13443
Emergency vehicles (EMVs) play a crucial role in responding to time-critical calls such as medical emergencies and fire outbreaks in urban areas. Existing methods for EMV dispatch typically optimize routes based on historical traffic-flow data and design traffic signal pre-emption accordingly; however, we still lack a systematic methodology to address the coupling between EMV routing and traffic signal control. In this paper, we propose EMVLight, a decentralized reinforcement learning (RL) framework for joint dynamic EMV routing and traffic signal pre-emption. We adopt the multi-agent advantage actor-critic method with policy sharing and spatial discounted factor. This framework addresses the coupling between EMV navigation and traffic signal control via an innovative design of multi-class RL agents and a novel pressure-based reward function. The proposed methodology enables EMVLight to learn network-level cooperative traffic signal phasing strategies that not only reduce EMV travel time but also shortens the travel time of non-EMVs. Simulation-based experiments indicate that EMVLight enables up to a $42.6\%$ reduction in EMV travel time as well as an $23.5\%$ shorter average travel time compared with existing approaches.
https://arxiv.org/abs/2206.13441
Establishing voxelwise semantic correspondence across distinct imaging modalities is a foundational yet formidable computer vision task. Current multi-modality registration techniques maximize hand-crafted inter-domain similarity functions, are limited in modeling nonlinear intensity-relationships and deformations, and may require significant re-engineering or underperform on new tasks, datasets, and domain pairs. This work presents ContraReg, an unsupervised contrastive representation learning approach to multi-modality deformable registration. By projecting learned multi-scale local patch features onto a jointly learned inter-domain embedding space, ContraReg obtains representations useful for non-rigid multi-modality alignment. Experimentally, ContraReg achieves accurate and robust results with smooth and invertible deformations across a series of baselines and ablations on a neonatal T1-T2 brain MRI registration task with all methods validated over a wide range of deformation regularization strengths.
https://arxiv.org/abs/2206.13434
An open research question in deep reinforcement learning is how to focus the policy learning of key decisions within a sparse domain. This paper emphasizes combining the advantages of inputoutput hidden Markov models and reinforcement learning towards interpretable maintenance decisions. We propose a novel hierarchical-modeling methodology that, at a high level, detects and interprets the root cause of a failure as well as the health degradation of the turbofan engine, while, at a low level, it provides the optimal replacement policy. It outperforms the baseline performance of deep reinforcement learning methods applied directly to the raw data or when using a hidden Markov model without such a specialized hierarchy. It also provides comparable performance to prior work, however, with the additional benefit of interpretability.
https://arxiv.org/abs/2206.13433
Autonomous underwater vehicles (AUV) are commonly used in many underwater applications. Recently, the usage of multi-rotor unmanned autonomous vehicles (UAV) for marine applications is receiving more attention in the literature. Usually, both platforms employ an inertial navigation system (INS), and aiding sensors for an accurate navigation solution. In AUV navigation, Doppler velocity log (DVL) is mainly used to aid the INS, while for UAVs, it is common to use global navigation satellite systems (GNSS) receivers. The fusion between the aiding sensor and the INS requires a definition of step size parameter in the estimation process. It is responsible for the solution frequency update and, eventually, its accuracy. The choice of the step size poses a tradeoff between computational load and navigation performance. Generally, the aiding sensors update frequency is considered much slower compared to the INS operating frequency (hundreds Hertz). Such high rate is unnecessary for most platforms, specifically for low dynamics AUVs. In this work, a supervised machine learning based adaptive tuning scheme to select the proper INS step size is proposed. To that end, a velocity error bound is defined, allowing the INS/DVL or the INS/GNSS to act in a sub-optimal working conditions, and yet minimize the computational load. Results from simulations and field experiment show the benefits of using the proposed approach. In addition, the proposed framework can be applied to any other fusion scenarios between any type of sensors or platforms.
https://arxiv.org/abs/2206.13428