Attackers can deliberately perturb classifiers' input with subtle noise, altering final predictions. Among proposed countermeasures, adversarial purification employs generative networks to preprocess input images, filtering out adversarial noise. In this study, we propose specific generators, defined Multiple Latent Variable Generative Models (MLVGMs), for adversarial purification. These models possess multiple latent variables that naturally disentangle coarse from fine features. Taking advantage of these properties, we autoencode images to maintain class-relevant information, while discarding and re-sampling any detail, including adversarial noise. The procedure is completely training-free, exploring the generalization abilities of pre-trained MLVGMs on the adversarial purification downstream task. Despite the lack of large models, trained on billions of samples, we show that smaller MLVGMs are already competitive with traditional methods, and can be used as foundation models. Official code released at this https URL.
https://arxiv.org/abs/2412.03453
Recently, large vision-language models (LVLMs) have rapidly gained popularity for their strong generation and reasoning capabilities given diverse multimodal inputs. However, these models incur significant computational and memory overhead during inference, which greatly hinders the efficient deployment in practical scenarios. The extensive key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost. Based on this, recent works have investigated ways to reduce the KV cache size for higher efficiency. Although effective, they generally overlook the distinct importance distributions of KV vectors across layers and maintain the same cache size for each layer during the next token prediction. This results in the significant contextual information loss for certain layers, leading to notable performance decline. To address this, we present PrefixKV. It reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration. With an adaptive layer-wise KV retention recipe based on binary search, the maximum contextual information can thus be preserved in each layer, facilitating the generation. Extensive experiments demonstrate that our method achieves the state-of-the-art performance compared with others. It exhibits superior inference efficiency and generation quality trade-offs, showing promising potential for practical applications. Code is available at \url{this https URL}.
https://arxiv.org/abs/2412.03409
A key stumbling block in effective supply chain risk management for companies and policymakers is a lack of visibility on interdependent supply network relationships. Relationship prediction, also called link prediction is an emergent area of supply chain surveillance research that aims to increase the visibility of supply chains using data-driven techniques. Existing methods have been successful for predicting relationships but struggle to extract the context in which these relationships are embedded - such as the products being supplied or locations they are supplied from. Lack of context prevents practitioners from distinguishing transactional relations from established supply chain relations, hindering accurate estimations of risk. In this work, we develop a new Generative Artificial Intelligence (Gen AI) enhanced machine learning framework that leverages pre-trained language models as embedding models combined with machine learning models to predict supply chain relationships within knowledge graphs. By integrating Generative AI techniques, our approach captures the nuanced semantic relationships between entities, thereby improving supply chain visibility and facilitating more precise risk management. Using data from a real case study, we show that GenAI-enhanced link prediction surpasses all benchmarks, and demonstrate how GenAI models can be explored and effectively used in supply chain risk management.
https://arxiv.org/abs/2412.03390
Human speech exhibits rich and flexible prosodic variations. To address the one-to-many mapping problem from text to prosody in a reasonable and flexible manner, we propose DiffStyleTTS, a multi-speaker acoustic model based on a conditional diffusion module and an improved classifier-free guidance, which hierarchically models speech prosodic features, and controls different prosodic styles to guide prosody prediction. Experiments show that our method outperforms all baselines in naturalness and achieves superior synthesis speed compared to three diffusion-based baselines. Additionally, by adjusting the guiding scale, DiffStyleTTS effectively controls the guidance intensity of the synthetic prosody.
https://arxiv.org/abs/2412.03388
Vision-language models (VLMs) have shown remarkable success across various multi-modal tasks, yet large VLMs encounter significant efficiency challenges due to processing numerous visual tokens. A promising approach to accelerating large VLM inference is using partial information, such as attention maps from specific layers, to assess token importance and prune less essential tokens. However, our study reveals three key insights: (i) Partial attention information is insufficient for accurately identifying critical visual tokens, resulting in suboptimal performance, especially at low token retention ratios; (ii) Global attention information, such as the attention map aggregated across all layers, more effectively preserves essential tokens and maintains comparable performance under aggressive pruning. However, the attention maps from all layers requires a full inference pass, which increases computational load and is therefore impractical in existing methods; and (iii) The global attention map aggregated from a small VLM closely resembles that of a large VLM, suggesting an efficient alternative. Based on these findings, we introduce a \textbf{training-free} method, \underline{\textbf{S}}mall VLM \underline{\textbf{G}}uidance for accelerating \underline{\textbf{L}}arge VLMs (\textbf{SGL}). Specifically, we employ the attention map aggregated from a small VLM to guide visual token pruning in a large VLM. Additionally, an early exiting mechanism is developed to fully use the small VLM's predictions, dynamically invoking the larger VLM only when necessary, yielding a superior trade-off between accuracy and computation. Extensive evaluations across 11 benchmarks demonstrate the effectiveness and generalizability of SGL, achieving up to 91\% pruning ratio for visual tokens while retaining competitive performance.
https://arxiv.org/abs/2412.03324
Demand for bike sharing is impacted by various factors, such as weather conditions, events, and the availability of other transportation modes. This impact remains elusive due to the complex interdependence of these factors or locationrelated user behavior variations. It is also not clear which factor is additional information which are not already contained in the historical demand. Intermodal dependencies between bike-sharing and other modes are also underexplored, and the value of this information has not been studied in degraded situations. The proposed study analyzes the impact of adding contextual data, such as weather, time embedding, and road traffic flow, to predict bike-sharing Origin-Destination (OD) flows in atypical weather situations Our study highlights a mild relationship between prediction quality of bike-sharing demand and road traffic flow, while the introduced time embedding allows outperforming state-of-the-art results, particularly in the case of degraded weather conditions. Including weather data as an additional input further improves our model with respect to the basic ST-ED-RMGC prediction model by reducing of more than 20% the prediction error in degraded weather condition.
https://arxiv.org/abs/2412.03307
In this paper, we present DiffusionVLA, a novel framework that seamlessly combines the autoregression model with the diffusion model for learning visuomotor policy. Central to our approach is a next-token prediction objective, enabling the model to reason effectively over the user's query in the context of current observations. Subsequently, a diffusion model is attached to generate robust action outputs. To enhance policy learning through self-reasoning, we introduce a novel reasoning injection module that integrates reasoning phrases directly into the policy learning process. The whole framework is simple and flexible, making it easy to deploy and upgrade. We conduct extensive experiments using multiple real robots to validate the effectiveness of DiffusionVLA. Our tests include a challenging factory sorting task, where DiffusionVLA successfully categorizes objects, including those not seen during training. We observe that the reasoning module makes the model interpretable. It allows observers to understand the model thought process and identify potential causes of policy failures. Additionally, we test DiffusionVLA on a zero-shot bin-picking task, achieving 63.7\% accuracy on 102 previously unseen objects. Our method demonstrates robustness to visual changes, such as distractors and new backgrounds, and easily adapts to new embodiments. Furthermore, DiffusionVLA can follow novel instructions and retain conversational ability. Notably, DiffusionVLA is data-efficient and fast at inference; our smallest DiffusionVLA-2B runs 82Hz on a single A6000 GPU and can train from scratch on less than 50 demonstrations for a complex task. Finally, we scale the model from 2B to 72B parameters, showcasing improved generalization capabilities with increased model size.
https://arxiv.org/abs/2412.03293
In smart mobility, large networks of geographically distributed sensors produce vast amounts of high-frequency spatio-temporal data that must be processed in real time to avoid major disruptions. Traditional centralized approaches are increasingly unsuitable to this task, as they struggle to scale with expanding sensor networks, and reliability issues in central components can easily affect the whole deployment. To address these challenges, we explore and adapt semi-decentralized training techniques for Spatio-Temporal Graph Neural Networks (ST-GNNs) in smart mobility domain. We implement a simulation framework where sensors are grouped by proximity into multiple cloudlets, each handling a subgraph of the traffic graph, fetching node features from other cloudlets to train its own local ST-GNN model, and exchanging model updates with other cloudlets to ensure consistency, enhancing scalability and removing reliance on a centralized aggregator. We perform extensive comparative evaluation of four different ST-GNN training setups -- centralized, traditional FL, server-free FL, and Gossip Learning -- on large-scale traffic datasets, the METR-LA and PeMS-BAY datasets, for short-, mid-, and long-term vehicle speed predictions. Experimental results show that semi-decentralized setups are comparable to centralized approaches in performance metrics, while offering advantages in terms of scalability and fault tolerance. In addition, we highlight often overlooked issues in existing literature for distributed ST-GNNs, such as the variation in model performance across different geographical areas due to region-specific traffic patterns, and the significant communication overhead and computational costs that arise from the large receptive field of GNNs, leading to substantial data transfers and increased computation of partial embeddings.
https://arxiv.org/abs/2412.03188
Multi-Task Learning (MTL) involves the concurrent training of multiple tasks, offering notable advantages for dense prediction tasks in computer vision. MTL not only reduces training and inference time as opposed to having multiple single-task models, but also enhances task accuracy through the interaction of multiple tasks. However, existing methods face limitations. They often rely on suboptimal cross-task interactions, resulting in task-specific predictions with poor geometric and predictive coherence. In addition, many approaches use inadequate loss weighting strategies, which do not address the inherent variability in task evolution during training. To overcome these challenges, we propose an advanced MTL model specifically designed for dense vision tasks. Our model leverages state-of-the-art vision transformers with task-specific decoders. To enhance cross-task coherence, we introduce a trace-back method that improves both cross-task geometric and predictive features. Furthermore, we present a novel dynamic task balancing approach that projects task losses onto a common scale and prioritizes more challenging tasks during training. Extensive experiments demonstrate the superiority of our method, establishing new state-of-the-art performance across two benchmark datasets. The code is available at:this https URL
https://arxiv.org/abs/2412.03179
Deep Learning has shown outstanding results in computer vision tasks; healthcare is no exception. However, there is no straightforward way to expose the decision-making process of DL models. Good accuracy is not enough for skin cancer predictions. Understanding the model's behavior is crucial for clinical application and reliable outcomes. In this work, we identify desiderata for explanations in skin-lesion models. We analyzed seven methods, four based on pixel-attribution (Grad-CAM, Score-CAM, LIME, SHAP) and three on high-level concepts (ACE, ICE, CME), for a deep neural network trained on the International Skin Imaging Collaboration Archive. Our findings indicate that while these techniques reveal biases, there is room for improving the comprehensiveness of explanations to achieve transparency in skin-lesion models.
https://arxiv.org/abs/2412.03166
Text serves as the key control signal in video generation due to its narrative nature. To render text descriptions into video clips, current video diffusion models borrow features from text encoders yet struggle with limited text comprehension. The recent success of large language models (LLMs) showcases the power of decoder-only transformers, which offers three clear benefits for text-to-video (T2V) generation, namely, precise text understanding resulting from the superior scalability, imagination beyond the input text enabled by next token prediction, and flexibility to prioritize user interests through instruction tuning. Nevertheless, the feature distribution gap emerging from the two different text modeling paradigms hinders the direct use of LLMs in established T2V models. This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully tailored token fuser to harmonize the outputs from text encoders and LLMs. Such a design allows the T2V model to fully leverage learned video priors while capitalizing on the text-related capability of LLMs. Extensive quantitative and qualitative results demonstrate the effectiveness of Mimir in generating high-quality videos with excellent text comprehension, especially when processing short captions and managing shifting motions. Project page: this https URL
https://arxiv.org/abs/2412.03085
Strategy learning in game environments with multi-agent is a challenging problem. Since each agent's reward is determined by the joint strategy, a greedy learning strategy that aims to maximize its own reward may fall into a local optimum. Recent studies have proposed the opponent modeling and shaping methods for game environments. These methods enhance the efficiency of strategy learning by modeling the strategies and updating processes of other agents. However, these methods often rely on simple predictions of opponent strategy changes. Due to the lack of modeling behavioral preferences such as cooperation and competition, they are usually applicable only to predefined scenarios and lack generalization capabilities. In this paper, we propose a novel Preference-based Opponent Shaping (PBOS) method to enhance the strategy learning process by shaping agents' preferences towards cooperation. We introduce the preference parameter, which is incorporated into the agent's loss function, thus allowing the agent to directly consider the opponent's loss function when updating the strategy. We update the preference parameters concurrently with strategy learning to ensure that agents can adapt to any cooperative or competitive game environment. Through a series of experiments, we verify the performance of PBOS algorithm in a variety of differentiable games. The experimental results show that the PBOS algorithm can guide the agent to learn the appropriate preference parameters, so as to achieve better reward distribution in multiple game environments.
https://arxiv.org/abs/2412.03072
Transformer-based architectures have achieved unprecedented success in time series analysis. However, facing the challenge of across-domain modeling, existing studies utilize statistical prior as prompt engineering fails under the huge distribution shift among various domains. In this paper, a Unified Time Series Diffusion (UTSD) model is established for the first time to model the multi-domain probability distribution, utilizing the powerful probability distribution modeling ability of Diffusion. Unlike the autoregressive models that capture the conditional probabilities of the prediction horizon to the historical sequence, we use a diffusion denoising process to model the mixture distribution of the cross-domain data and generate the prediction sequence for the target domain directly utilizing conditional sampling. The proposed UTSD contains three pivotal designs: (1) The condition network captures the multi-scale fluctuation patterns from the observation sequence, which are utilized as context representations to guide the denoising network to generate the prediction sequence; (2) Adapter-based fine-tuning strategy, the multi-domain universal representation learned in the pretraining stage is utilized for downstream tasks in target domains; (3) The diffusion and denoising process on the actual sequence space, combined with the improved classifier free guidance as the conditional generation strategy, greatly improves the stability and accuracy of the downstream task. We conduct extensive experiments on mainstream benchmarks, and the pre-trained UTSD outperforms existing foundation models on all data domains, exhibiting superior zero-shot generalization ability. After training from scratch, UTSD achieves comparable performance against domain-specific proprietary models. The empirical results validate the potential of UTSD as a time series foundational model.
https://arxiv.org/abs/2412.03068
Accurate video prediction by deep neural networks, especially for dynamic regions, is a challenging task in computer vision for critical applications such as autonomous driving, remote working, and telemedicine. Due to inherent uncertainties, existing prediction models often struggle with the complexity of motion dynamics and occlusions. In this paper, we propose a novel stochastic long-term video prediction model that focuses on dynamic regions by employing a hybrid warping strategy. By integrating frames generated through forward and backward warpings, our approach effectively compensates for the weaknesses of each technique, improving the prediction accuracy and realism of moving regions in videos while also addressing uncertainty by making stochastic predictions that account for various motions. Furthermore, considering real-time predictions, we introduce a MobileNet-based lightweight architecture into our model. Our model, called SVPHW, achieves state-of-the-art performance on two benchmark datasets.
https://arxiv.org/abs/2412.03061
Unsupervised 3D representation learning via masked-and-reconstruction with differentiable rendering is promising to reduce the labeling burden for fusion 3D perception. However, previous literature conduct pre-training for different modalities separately because of the hight GPU memory consumption. Consequently, the interaction between the two modalities (images and point clouds) is neglected during pre-training. In this paper, we explore joint unsupervised pre-training for fusion 3D perception via differentiable rendering and propose CLAP, short for Curvature sampLing and swApping Prototype assignment prediction. The contributions are three-fold. 1) To overcome the GPU memory consumption problem, we propose Curvature Sampling to sample the more informative points/pixels for pre-training. 2) We propose to use learnable prototypes to represent parts of the scenes in a common feature space and bring the idea of swapping prototype assignment prediction to learn the interaction between the two modalities. 3) To further optimize learnable prototypes, we propose an Expectation-Maximization training scheme to maximize the similarity between embeddings and prototypes, followed by a Gram Matrix Regularization Loss to avoid collapse. Experiment results on NuScenes show that CLAP achieves 300% more performance gain as compared to previous SOTA 3D pre-training method via differentiable rendering. Codes and models will be released.
https://arxiv.org/abs/2412.03059
Portfolio management is an important yet challenging task in AI for FinTech, which aims to allocate investors' budgets among different assets to balance the risk and return of an investment. In this study, we propose a general Multi-objectIve framework with controLLable rIsk for pOrtfolio maNagement (MILLION), which consists of two main phases, i.e., return-related maximization and risk control. Specifically, in the return-related maximization phase, we introduce two auxiliary objectives, i.e., return rate prediction, and return rate ranking, combined with portfolio optimization to remit the overfitting problem and improve the generalization of the trained model to future markets. Subsequently, in the risk control phase, we propose two methods, i.e., portfolio interpolation and portfolio improvement, to achieve fine-grained risk control and fast risk adaption to a user-specified risk level. For the portfolio interpolation method, we theoretically prove that the risk can be perfectly controlled if the to-be-set risk level is in a proper interval. In addition, we also show that the return rate of the adjusted portfolio after portfolio interpolation is no less than that of the min-variance optimization, as long as the model in the reward maximization phase is effective. Furthermore, the portfolio improvement method can achieve greater return rates while keeping the same risk level compared to portfolio interpolation. Extensive experiments are conducted on three real-world datasets. The results demonstrate the effectiveness and efficiency of the proposed framework.
https://arxiv.org/abs/2412.03038
Molecular Relational Learning (MRL) is a rapidly growing field that focuses on understanding the interaction dynamics between molecules, which is crucial for applications ranging from catalyst engineering to drug discovery. Despite recent progress, earlier MRL approaches are limited to using only the 2D topological structure of molecules, as obtaining the 3D interaction geometry remains prohibitively expensive. This paper introduces a novel 3D geometric pre-training strategy for MRL (3DMRL) that incorporates a 3D virtual interaction environment, overcoming the limitations of costly traditional quantum mechanical calculation methods. With the constructed 3D virtual interaction environment, 3DMRL trains 2D MRL model to learn the overall 3D geometric information of molecular interaction through contrastive learning. Moreover, fine-grained interaction between molecules is learned through force prediction loss, which is crucial in understanding the wide range of molecular interaction processes. Extensive experiments on various tasks using real-world datasets, including out-of-distribution and extrapolation scenarios, demonstrate the effectiveness of 3DMRL, showing up to a 24.93\% improvement in performance across 40 tasks.
https://arxiv.org/abs/2412.02957
Existing works typically treat spatial-temporal prediction as the task of learning a function $F$ to transform historical observations to future observations. We further decompose this cross-time transformation into three processes: (1) Encoding ($E$): learning the intrinsic representation of observations, (2) Cross-Time Mapping ($M$): transforming past representations into future representations, and (3) Decoding ($D$): reconstructing future observations from the future representations. From this perspective, spatial-temporal prediction can be viewed as learning $F = E \cdot M \cdot D$, which includes learning the space transformations $\left\{{E},{D}\right\}$ between the observation space and the hidden representation space, as well as the spatial-temporal mapping $M$ from future states to past states within the representation space. This leads to two key questions: \textbf{Q1: What kind of representation space allows for mapping the past to the future? Q2: How to achieve map the past to the future within the representation space?} To address Q1, we propose a Spatial-Temporal Backdoor Adjustment strategy, which learns a Spatial-Temporal De-Confounded (STDC) representation space and estimates the de-confounding causal effect of historical data on future data. This causal relationship we captured serves as the foundation for subsequent spatial-temporal mapping. To address Q2, we design a Spatial-Temporal Embedding (STE) that fuses the information of temporal and spatial confounders, capturing the intrinsic spatial-temporal characteristics of the representations. Additionally, we introduce a Cross-Time Attention mechanism, which queries the attention between the future and the past to guide spatial-temporal mapping.
https://arxiv.org/abs/2412.02942
For decades, corporations and governments have relied on scanned documents to record vast amounts of information. However, extracting this information is a slow and tedious process due to the overwhelming amount of documents. The rise of vision language models presents a way to efficiently and accurately extract the information out of these documents. The current automated workflow often requires a two-step approach involving the extraction of information using optical character recognition software, and subsequent usage of large language models for processing this information. Unfortunately, these methods encounter significant challenges when dealing with noisy scanned documents. The high information density of such documents often necessitates using computationally expensive language models to effectively reduce noise. In this study, we propose PatchFinder, an algorithm that builds upon Vision Language Models (VLMs) to address the information extraction task. First, we devise a confidence-based score, called Patch Confidence, based on the Maximum Softmax Probability of the VLMs' output to measure the model's confidence in its predictions. Then, PatchFinder utilizes that score to determine a suitable patch size, partition the input document into overlapping patches of that size, and generate confidence-based predictions for the target information. Our experimental results show that PatchFinder can leverage Phi-3v, a 4.2 billion parameter vision language model, to achieve an accuracy of 94% on our dataset of 190 noisy scanned documents, surpassing the performance of ChatGPT-4o by 18.5 percentage points.
https://arxiv.org/abs/2412.02886
We introduce a causal modeling framework that captures the input-output behavior of predictive models (e.g., machine learning models) by representing it using causal graphs. The framework enables us to define and identify features that directly cause the predictions, which has broad implications for data collection and model evaluation. We show two assumptions under which the direct causes can be discovered from data, one of which further simplifies the discovery process. In addition to providing sound and complete algorithms, we propose an optimization technique based on an independence rule that can be integrated with the algorithms to speed up the discovery process both theoretically and empirically.
https://arxiv.org/abs/2412.02878