Efficiently transferring Learned Image Compression (LIC) model from human perception to machine perception is an emerging challenge in vision-centric representation learning. Existing approaches typically adapt LIC to downstream tasks in a single-task manner, which is inefficient, lacks task interaction, and results in multiple task-specific bitstreams. To address these limitations, we propose an asymmetric adaptor framework that supports multi-task adaptation within a single model. Our method introduces a shared adaptor to learn general semantic features and task-specific adaptors to preserve task-level distinctions. With only lightweight plug-in modules and a frozen base codec, our method achieves strong performance across multiple tasks while maintaining compression efficiency. Experiments on the PASCAL-Context benchmark demonstrate that our method outperforms both Fully Fine-Tuned and other Parameter Efficient Fine-Tuned (PEFT) baselines, and validating the effectiveness of multi-vision transferring.
将学习到的图像压缩(LIC)模型从人类感知高效转移到机器感知是视觉中心表示学习中一个新兴的挑战。现有的方法通常以单任务的方式适应LIC,这种方式效率低下、缺乏任务间的交互,并且会产生多个特定于任务的比特流。为了解决这些限制,我们提出了一种不对称适配器框架,该框架支持在一个单一模型内进行多任务适应。我们的方法引入了一个共享适配器来学习通用语义特征和特定于任务的适配器以保持任务级别的区别。通过仅使用轻量级插件模块和冻结的基础编解码器,我们的方法在多个任务中实现了强大的性能,并且保持了压缩效率。 在PASCAL-Context基准上的实验表明,我们提出的方法超过了完全微调(Fully Fine-Tuned)和其他参数高效微调(PEFT)基线的性能,验证了多视觉转移的有效性。
https://arxiv.org/abs/2504.12997
Contrastive video-language pretraining has demonstrated great success in learning rich and robust video representations. However, deploying such video encoders on compute-constrained edge devices remains challenging due to their high computational demands. Additionally, existing models are typically trained to process only short video clips, often limited to 4 to 64 frames. In this paper, we introduce AdaVid, a flexible architectural framework designed to learn efficient video encoders that can dynamically adapt their computational footprint based on available resources. At the heart of AdaVid is an adaptive transformer block, inspired by Matryoshka Representation Learning, which allows the model to adjust its hidden embedding dimension at inference time. We show that AdaVid-EgoVLP, trained on video-narration pairs from the large-scale Ego4D dataset, matches the performance of the standard EgoVLP on short video-language benchmarks using only half the compute, and even outperforms EgoVLP when given equal computational resources. We further explore the trade-off between frame count and compute on the challenging Diving48 classification benchmark, showing that AdaVid enables the use of more frames without exceeding computational limits. To handle longer videos, we also propose a lightweight hierarchical network that aggregates short clip features, achieving a strong balance between compute efficiency and accuracy across several long video benchmarks.
对比视频-语言预训练在学习丰富且鲁棒的视频表示方面取得了巨大成功。然而,由于其计算需求较高,将此类视频编码器部署到计算资源受限的边缘设备上仍然具有挑战性。此外,现有模型通常仅被训练用于处理短片段视频,帧数限制在4至64帧之间。 在本文中,我们介绍了AdaVid,这是一种灵活的架构框架,旨在学习能够根据可用资源动态调整其计算足迹的有效视频编码器。在AdaVid的核心部分是一个自适应变换块,受Matryoshka表示学习的启发,该设计使得模型可以在推理时调整隐藏嵌入维度。 实验表明,在大规模Ego4D数据集中基于视频-叙述对训练得到的AdaVid-EgoVLP模型,在处理短片段视频语言基准任务上仅使用一半计算资源就与标准EgoVLP的表现相当。当给定相同的计算资源时,它甚至优于EgoVLP。 此外,我们在具有挑战性的Diving48分类基准测试中探讨了帧数和计算量之间的权衡,证明AdaVid能够利用更多帧而不超出计算限制。为了处理更长的视频,我们还提出了一种轻量级的分层网络来聚合短片段特征,在多个长期视频基准上实现了计算效率与准确性的良好平衡。 总结来说,本文提出的AdaVid框架在保证模型性能的同时显著降低了计算成本,并展示了如何通过灵活调整架构应对不同场景下的挑战。
https://arxiv.org/abs/2504.12513
Text-attributed graphs (TAGs) present unique challenges in representation learning by requiring models to capture both the semantic richness of node-associated texts and the structural dependencies of the graph. While graph neural networks (GNNs) excel at modeling topological information, they lack the capacity to process unstructured text. Conversely, large language models (LLMs) are proficient in text understanding but are typically unaware of graph structure. In this work, we propose BiGTex (Bidirectional Graph Text), a novel architecture that tightly integrates GNNs and LLMs through stacked Graph-Text Fusion Units. Each unit allows for mutual attention between textual and structural representations, enabling information to flow in both directions, text influencing structure and structure guiding textual interpretation. The proposed architecture is trained using parameter-efficient fine-tuning (LoRA), keeping the LLM frozen while adapting to task-specific signals. Extensive experiments on five benchmark datasets demonstrate that BiGTex achieves state-of-the-art performance in node classification and generalizes effectively to link prediction. An ablation study further highlights the importance of soft prompting and bi-directional attention in the model's success.
文本属性图(Text-Attributed Graphs,TAGs)在表示学习中提出了独特的挑战,要求模型能够捕捉节点关联文本的语义丰富性和图结构之间的依赖关系。虽然图神经网络(GNNs)擅长建模拓扑信息,但它们缺乏处理无结构化文本的能力。相反,大型语言模型(LLMs)擅长理解文本,但却通常不了解图结构。在这项工作中,我们提出了BiGTex(双向图文本),一种通过堆叠的Graph-Text融合单元紧密集成GNNs和LLMs的新颖架构。每个单元都允许文本表示与结构表示之间的相互注意,使得信息可以双向流动:文本影响结构,而结构指导对文本的理解。所提出的架构使用参数高效的微调(LoRA)进行训练,在保持LLM冻结的同时适应特定任务的信号。在五个基准数据集上的广泛实验表明,BiGTex在节点分类方面实现了最先进的性能,并且能够有效地推广到链接预测。进一步的消融研究突显了软提示和双向注意对于模型成功的重要性。
https://arxiv.org/abs/2504.12474
Spatiotemporal learning is challenging due to the intricate interplay between spatial and temporal dependencies, the high dimensionality of the data, and scalability constraints. These challenges are further amplified in scientific domains, where data is often irregularly distributed (e.g., missing values from sensor failures) and high-volume (e.g., high-fidelity simulations), posing additional computational and modeling difficulties. In this paper, we present SCENT, a novel framework for scalable and continuity-informed spatiotemporal representation learning. SCENT unifies interpolation, reconstruction, and forecasting within a single architecture. Built on a transformer-based encoder-processor-decoder backbone, SCENT introduces learnable queries to enhance generalization and a query-wise cross-attention mechanism to effectively capture multi-scale dependencies. To ensure scalability in both data size and model complexity, we incorporate a sparse attention mechanism, enabling flexible output representations and efficient evaluation at arbitrary resolutions. We validate SCENT through extensive simulations and real-world experiments, demonstrating state-of-the-art performance across multiple challenging tasks while achieving superior scalability.
时空学习由于空间和时间依赖性之间错综复杂的相互作用、数据的高维度以及规模限制,面临着诸多挑战。这些挑战在科学领域中尤为突出,因为该领域的数据常常分布不规则(例如,传感器故障导致的数据缺失)并且数据量大(例如,高质量仿真产生的大量数据),从而带来额外的计算和建模困难。在这篇论文中,我们提出了SCENT,这是一个用于可扩展且连续性感知的时空表示学习的新框架。SCENT在单一架构内统一了插值、重构和预测功能。基于Transformer编码器-处理器-解码器骨干网结构,SCENT引入了可学习的查询以增强泛化能力,并采用基于查询的交叉注意力机制来有效捕捉多尺度依赖性。为了确保数据量和模型复杂度在扩展时都能保持良好性能,我们集成了稀疏注意力机制,这使得能够灵活地生成输出表示并实现任意分辨率下的高效评估。通过广泛的仿真和现实世界的实验验证了SCENT的有效性,在多项具有挑战性的任务中展现了卓越的性能,并且实现了优于其他方法的可伸缩性。
https://arxiv.org/abs/2504.12262
Vision foundation models in remote sensing have been extensively studied due to their superior generalization on various downstream tasks. Synthetic Aperture Radar (SAR) offers all-day, all-weather imaging capabilities, providing significant advantages for Earth observation. However, establishing a foundation model for SAR image interpretation inevitably encounters the challenges of insufficient information utilization and poor interpretability. In this paper, we propose a remote sensing foundation model based on complex-valued SAR data, which simulates the polarimetric decomposition process for pre-training, i.e., characterizing pixel scattering intensity as a weighted combination of scattering bases and scattering coefficients, thereby endowing the foundation model with physical interpretability. Specifically, we construct a series of scattering queries, each representing an independent and meaningful scattering basis, which interact with SAR features in the scattering query decoder and output the corresponding scattering coefficient. To guide the pre-training process, polarimetric decomposition loss and power self-supervision loss are constructed. The former aligns the predicted coefficients with Yamaguchi coefficients, while the latter reconstructs power from the predicted coefficients and compares it to the input image's power. The performance of our foundation model is validated on six typical downstream tasks, achieving state-of-the-art results. Notably, the foundation model can extract stable feature representations and exhibits strong generalization, even in data-scarce conditions.
基于视觉的地球观测基础模型在遥感领域得到了广泛的研究,这些模型由于其在各种下游任务中的出色泛化能力而备受关注。合成孔径雷达(SAR)提供全天候成像的能力,在地球观测方面具有显著优势。然而,为SAR图像解释建立一个基础模型不可避免地会遇到信息利用不足和可解释性差的挑战。 在此论文中,我们提出了一种基于复值SAR数据的遥感基础模型,并通过模拟极化分解过程进行预训练,从而赋予该基础模型物理可解释性的特性。具体而言,我们将每个独立且有意义的散射基底作为一组散射查询来构建,这些查询与散射解码器中的SAR特征相互作用,并输出相应的散射系数。为了指导预训练过程,我们构造了极化分解损失和功率自监督损失。前者将预测到的系数与Yamaguchi系数对齐,后者则从预测出的系数重建功率,并将其与输入图像的功率进行比较。 我们的基础模型在六个典型的下游任务中得到了验证,达到了最先进的成果。值得注意的是,在数据稀缺的情况下,该基础模型也能提取稳定的特征表示并展示强大的泛化能力。
https://arxiv.org/abs/2504.11999
This article reviews contemporary methods for integrating force, including both proprioception and tactile sensing, in robot manipulation policy learning. We conduct a comparative analysis on various approaches for sensing force, data collection, behavior cloning, tactile representation learning, and low-level robot control. From our analysis, we articulate when and why forces are needed, and highlight opportunities to improve learning of contact-rich, generalist robot policies on the path toward highly capable touch-based robot foundation models. We generally find that while there are few tasks such as pouring, peg-in-hole insertion, and handling delicate objects, the performance of imitation learning models is not at a level of dynamics where force truly matters. Also, force and touch are abstract quantities that can be inferred through a wide range of modalities and are often measured and controlled implicitly. We hope that juxtaposing the different approaches currently in use will help the reader to gain a systemic understanding and help inspire the next generation of robot foundation models.
本文回顾了将力(包括本体感觉和触觉感知)集成到机器人操作策略学习中的当代方法。我们对各种感测力的方法、数据收集、行为克隆、触觉表示学习以及低级机器人控制进行了比较分析。通过我们的分析,我们阐明了何时以及为何需要使用力,并强调在迈向高度能力的基于触摸的机器人基础模型过程中,改进接触密集型通用机器人策略的学习机会。总体而言,我们发现虽然诸如倾倒液体、插入孔洞、处理易碎物品等任务中确实存在一些应用场景,但模仿学习模型的表现尚未达到动态水平,使得力真正变得重要。此外,力和触觉是可以通过各种模态推断的抽象量,并且通常通过隐式测量和控制。 我们希望对比当前使用的方法能够帮助读者获得系统性理解,并激发下一代机器人基础模型的发展。
https://arxiv.org/abs/2504.11827
RGB-Thermal Video Object Detection (RGBT VOD) can address the limitation of traditional RGB-based VOD in challenging lighting conditions, making it more practical and effective in many applications. However, similar to most RGBT fusion tasks, it still mainly relies on manually aligned multimodal image pairs. In this paper, we propose a novel Multimodal Spatio-temporal Graph learning Network (MSGNet) for alignment-free RGBT VOD problem by leveraging the robust graph representation learning model. Specifically, we first design an Adaptive Partitioning Layer (APL) to estimate the corresponding regions of the Thermal image within the RGB image (high-resolution), achieving a preliminary inexact alignment. Then, we introduce the Spatial Sparse Graph Learning Module (S-SGLM) which employs a sparse information passing mechanism on the estimated inexact alignment to achieve reliable information interaction between different modalities. Moreover, to fully exploit the temporal cues for RGBT VOD problem, we introduce Hybrid Structured Temporal Modeling (HSTM), which involves a Temporal Sparse Graph Learning Module (T-SGLM) and Temporal Star Block (TSB). T-SGLM aims to filter out some redundant information between adjacent frames by employing the sparse aggregation mechanism on the temporal graph. Meanwhile, TSB is dedicated to achieving the complementary learning of local spatial relationships. Extensive comparative experiments conducted on both the aligned dataset VT-VOD50 and the unaligned dataset UVT-VOD2024 demonstrate the effectiveness and superiority of our proposed method. Our project will be made available on our website for free public access.
RGB-Thermal 视频对象检测(RGBT VOD)可以解决传统基于 RGB 的视频对象检测在挑战性光照条件下存在的局限,使其在许多应用中更加实用和有效。然而,类似于大多数 RGBT 融合任务,它仍然主要依赖于手动对齐的多模态图像对。在这篇论文中,我们提出了一种新颖的无对齐 RGBT VOD 解决方案——多模式时空图学习网络(MSGNet),通过利用稳健的图表示学习模型实现这一目标。 具体来说,我们首先设计了一个自适应分区层(Adaptive Partitioning Layer, APL)来估计热图像在高分辨率RGB 图像中的对应区域,从而实现初步的不精确对齐。然后,我们引入了空间稀疏图学习模块(Spatial Sparse Graph Learning Module, S-SGLM),它通过在预估的不精确对齐上使用稀疏信息传递机制,实现了不同模态之间可靠的信息交互。 此外,为了充分利用 RGBT VOD 问题中的时间线索,我们提出了混合结构化时间建模(Hybrid Structured Temporal Modeling, HSTM)。该模型包括时间稀疏图学习模块(Temporal Sparse Graph Learning Module, T-SGLM)和时间星块(Temporal Star Block, TSB)。T-SGLM 目标是通过在时间图上使用稀疏聚合机制来过滤相邻帧之间的某些冗余信息,同时TSB 旨在实现局部空间关系的互补学习。 我们在对齐的数据集 VT-VOD50 和未对齐的数据集 UVT-VOD2024 上进行了广泛的对比实验,结果证明了我们提出的方法的有效性和优越性。我们的项目将在网站上免费公开提供。
https://arxiv.org/abs/2504.11779
Recent developments in generative artificial intelligence (AI) rely on machine learning techniques such as deep learning and generative modeling to achieve state-of-the-art performance across wide-ranging domains. These methods' surprising performance is due in part to their ability to learn implicit "representations'' of complex, multi-modal data. Unfortunately, deep neural networks are notoriously black boxes that obscure these representations, making them difficult to interpret or analyze. To resolve these difficulties, one approach is to build new interpretable neural network models from the ground up. This is the goal of the emerging field of causal representation learning (CRL) that uses causality as a vector for building flexible, interpretable, and transferable generative AI. CRL can be seen as a culmination of three intrinsically statistical problems: (i) latent variable models such as factor analysis; (ii) causal graphical models with latent variables; and (iii) nonparametric statistics and deep learning. This paper reviews recent progress in CRL from a statistical perspective, focusing on connections to classical models and statistical and causal identifiablity results. This review also highlights key application areas, implementation strategies, and open statistical questions in CRL.
最近在生成式人工智能(AI)领域的进展依赖于机器学习技术,如深度学习和生成模型,在广泛的领域内实现了最先进的性能。这些方法之所以表现出色,部分原因在于它们能够学习复杂、多模态数据的隐含“表示”。不幸的是,深度神经网络因其固有的黑箱特性而难以揭示这些表示,从而使得解释或分析变得困难。为了解决这些问题,一种方法是从头开始构建新的可解释神经网络模型。这是新兴领域因果表征学习(CRL)的目标,该领域利用因果关系来建立灵活、可解释且具有迁移能力的生成式AI系统。 CRL可以被视为三种本质上统计问题的结合:(i)因子分析等隐变量模型;(ii)带有隐变量的因果图模型;以及(iii)非参数统计和深度学习。本文从统计学角度回顾了最近在CRL领域的进展,重点介绍了与经典模型的关系及统计和因果识别的结果。此外,该综述还强调了关键应用领域、实施策略以及CRL中的开放性统计问题。
https://arxiv.org/abs/2504.11609
Multimodal protein language models (PLMs) integrate sequence and token-based structural information, serving as a powerful foundation for protein modeling, generation, and design. However, the reliance on tokenizing 3D structures into discrete tokens causes substantial loss of fidelity about fine-grained structural details and correlations. In this paper, we systematically elucidate the design space of multimodal PLMs to overcome their limitations. We identify tokenization loss and inaccurate structure token predictions by the PLMs as major bottlenecks. To address these, our proposed design space covers improved generative modeling, structure-aware architectures and representation learning, and data exploration. Our advancements approach finer-grained supervision, demonstrating that token-based multimodal PLMs can achieve robust structural modeling. The effective design methods dramatically improve the structure generation diversity, and notably, folding abilities of our 650M model by reducing the RMSD from 5.52 to 2.36 on PDB testset, even outperforming 3B baselines and on par with the specialized folding models.
多模态蛋白质语言模型(PLMs)结合了序列信息和基于标记的结构信息,为蛋白质建模、生成和设计提供了一个强大的基础。然而,依赖于将三维结构离散化成标记的做法会导致精细结构细节和相关性的大量损失。在本文中,我们系统地阐明了多模态PLM的设计空间以克服其局限性。我们识别出由模型使用的令牌化过程导致的精度损失以及不准确的结构令牌预测是主要瓶颈。为了解决这些问题,我们的设计方案涵盖了改进生成建模、感知结构的架构和表示学习及数据探索。 通过这些进展,我们的方法接近了更精细粒度的监督,证明了基于标记的多模态PLM能够实现稳健的结构建模。有效的设计方法显著提高了模型在结构生成多样性方面的表现,并且特别地,在PDB测试集上,将我们6.5亿参数模型的RMSD值从5.52降低到了2.36,甚至超过了30亿参数的基础模型,并与专门的折叠模型相当。 这一成果表明了改进多模态蛋白质语言模型潜力的巨大前景,特别是在蛋白质结构预测和设计方面。
https://arxiv.org/abs/2504.11454
Out-of-distribution (OOD) detection is essential for the safe deployment of machine learning models. Recent advances have explored improved classification losses and representation learning strategies to enhance OOD detection. However, these methods are often tailored to specific post-hoc detection techniques, limiting their generalizability. In this work, we identify a critical issue in Logit Normalization (LogitNorm), which inhibits its effectiveness in improving certain post-hoc OOD detection methods. To address this, we propose Extended Logit Normalization ($\textbf{ELogitNorm}$), a novel hyperparameter-free formulation that significantly benefits a wide range of post-hoc detection methods. By incorporating feature distance-awareness to LogitNorm, $\textbf{ELogitNorm}$ shows more robust OOD separability and in-distribution (ID) confidence calibration than its predecessor. Extensive experiments across standard benchmarks demonstrate that our approach outperforms state-of-the-art training-time methods in OOD detection while maintaining strong ID classification accuracy.
离分布(OOD,Out-of-Distribution)检测对于机器学习模型的安全部署至关重要。近期的研究进展探索了改进的分类损失和表示学习策略以增强OOD检测能力。然而,这些方法通常针对特定的事后检测技术进行定制,从而限制了它们的通用性。在这项工作中,我们识别出了Logit规范化(LogitNorm)中的一个关键问题,这个问题阻碍了它在提升某些事后OOD检测方法有效性方面的效果。为了解决这一问题,我们提出了一种新的无需超参数调整的形式化方法——扩展Logit规范化(Extended Logit Normalization, ELogitNorm)。ELogitNorm通过将特征距离感知特性融入到LogitNorm中,在广泛的后处理检测技术上取得了显著的提升。实验表明,ELogitNorm在OOD分离和内部分布(ID)置信度校准方面表现出更稳健的效果,优于其前身。 广泛的标准基准测试验证了我们的方法在OOD检测方面的优越性能,同时保持了强大的内部分类准确性,超越了现有的训练时间方法。这一成果展示了ELogitNorm作为一种通用且有效的OOD检测策略的巨大潜力。
https://arxiv.org/abs/2504.11434
The rapid accumulation of Electronic Health Records (EHRs) has transformed healthcare by providing valuable data that enhance clinical predictions and diagnoses. While conventional machine learning models have proven effective, they often lack robust representation learning and depend heavily on expert-crafted features. Although deep learning offers powerful solutions, it is often criticized for its lack of interpretability. To address these challenges, we propose DeepSelective, a novel end to end deep learning framework for predicting patient prognosis using EHR data, with a strong emphasis on enhancing model interpretability. DeepSelective combines data compression techniques with an innovative feature selection approach, integrating custom-designed modules that work together to improve both accuracy and interpretability. Our experiments demonstrate that DeepSelective not only enhances predictive accuracy but also significantly improves interpretability, making it a valuable tool for clinical decision-making. The source code is freely available at this http URL .
电子健康记录(EHRs)的快速积累已经通过提供增强临床预测和诊断的有价值数据,彻底改变了医疗保健行业。虽然传统的机器学习模型已被证明是有效的,但它们常常缺乏强大的表示学习能力,并且严重依赖于专家手工设计的特征。尽管深度学习提供了强大的解决方案,但它往往因为其不可解释性而受到批评。为了应对这些挑战,我们提出了一种名为DeepSelective的新颖端到端深度学习框架,该框架用于使用EHR数据预测患者的预后情况,并特别强调提高模型的可解释性。DeepSelective结合了数据压缩技术与创新的功能选择方法,整合定制设计的模块协同工作以提升准确性和可解释性。我们的实验表明,DeepSelective不仅提高了预测准确性,还显著增强了可解释性,使其成为临床决策制定中的宝贵工具。该源代码可在[此处](http://此URL)免费获取。
https://arxiv.org/abs/2504.11264
While multimodal fusion has been extensively studied in Multimodal Sentiment Analysis (MSA), the role of fusion depth and multimodal capacity allocation remains underexplored. In this work, we position fusion depth, scalability, and dedicated multimodal capacity as primary factors for effective fusion. We introduce DeepMLF, a novel multimodal language model (LM) with learnable tokens tailored toward deep fusion. DeepMLF leverages an audiovisual encoder and a pretrained decoder LM augmented with multimodal information across its layers. We append learnable tokens to the LM that: 1) capture modality interactions in a controlled fashion and 2) preserve independent information flow for each modality. These fusion tokens gather linguistic information via causal self-attention in LM Blocks and integrate with audiovisual information through cross-attention MM Blocks. Serving as dedicated multimodal capacity, this design enables progressive fusion across multiple layers, providing depth in the fusion process. Our training recipe combines modality-specific losses and language modelling loss, with the decoder LM tasked to predict ground truth polarity. Across three MSA benchmarks with varying dataset characteristics, DeepMLF achieves state-of-the-art performance. Our results confirm that deeper fusion leads to better performance, with optimal fusion depths (5-7) exceeding those of existing approaches. Additionally, our analysis on the number of fusion tokens reveals that small token sets ($\sim$20) achieve optimal performance. We examine the importance of representation learning order (fusion curriculum) through audiovisual encoder initialization experiments. Our ablation studies demonstrate the superiority of the proposed fusion design and gating while providing a holistic examination of DeepMLF's scalability to LLMs, and the impact of each training objective and embedding regularization.
尽管多模态融合在多模态情感分析(MSA)中已经得到了广泛研究,但关于融合深度和多模态容量分配的作用仍需进一步探索。在这项工作中,我们定位了融合深度、可扩展性和专有的多模态容量作为有效融合的主要因素。我们引入了一种新的多模态语言模型DeepMLF,该模型具有针对深层融合而定制的可学习令牌。DeepMLF利用了一个音频-视觉编码器和一个通过其各个层融入了多模态信息的预训练解码器LM。我们在LM中附加了可学习的令牌,以:1)有控制地捕获模式间的交互作用;2)为每种模式保留独立的信息流。这些融合令牌通过LM块中的因果自注意力机制收集语言信息,并通过跨注意力MM块与音频-视觉信息集成。作为专有的多模态容量,这种设计允许在多个层级上进行渐进式融合,在融合过程中提供深度。 我们的训练方案结合了特定于模式的损失和语言建模损失,其中解码器LM的任务是预测真实极性。在具有不同数据集特征的三个MSA基准测试中,DeepMLF实现了最先进的性能。我们的结果证实了更深层次的融合能够带来更好的表现,并且最佳融合深度(5-7)超过了现有方法的表现。此外,我们关于融合令牌数量的分析表明,较小的令牌集合(约20个)可以达到最优性能。 通过音频-视觉编码器初始化实验,我们探讨了表示学习顺序的重要性(即融合课程)。我们的消融研究展示了所提出融合设计和门控机制的优越性,并提供了对DeepMLF在大规模语言模型中的可扩展性的全面评估。此外,还揭示了每个训练目标及嵌入正则化的影响。
https://arxiv.org/abs/2504.11082
Recommending items to users has long been a fundamental task, and studies have tried to improve it ever since. Most well-known models commonly employ representation learning to map users and items into a unified embedding space for matching assessment. These approaches have primary limitations, especially when dealing with explicit feedback and sparse data contexts. Two primary limitations are their proneness to overfitting and failure to incorporate epistemic uncertainty in predictions. To address these problems, we propose a novel Bayesian Deep Ensemble Collaborative Filtering method named BDECF. To improve model generalization and quality, we utilize Bayesian Neural Networks, which incorporate uncertainty within their weight parameters. In addition, we introduce a new interpretable non-linear matching approach for the user and item embeddings, leveraging the advantages of the attention mechanism. Furthermore, we endorse the implementation of an ensemble-based supermodel to generate more robust and reliable predictions, resulting in a more complete model. Empirical evaluation through extensive experiments and ablation studies across a range of publicly accessible real-world datasets with differing sparsity characteristics confirms our proposed method's effectiveness and the importance of its components.
向用户推荐物品一直是基础性的任务,自那时以来,各种研究一直致力于改进这一过程。大多数知名的模型通常采用表示学习方法,将用户和项目映射到一个统一的嵌入空间中进行匹配评估。然而,在处理显式反馈和稀疏数据上下文时,这些方法存在明显的局限性:它们容易过度拟合,并且无法在预测中纳入知识性不确定性。 为了解决这些问题,我们提出了一种新的贝叶斯深度集成协同过滤方法——BDECF(Bayesian Deep Ensemble Collaborative Filtering)。为了提高模型的泛化能力和质量,我们将贝叶斯神经网络应用于我们的框架,该网络将不确定因素整合到其权重参数中。此外,我们引入了一个新颖且可解释的非线性匹配方式用于用户和项目嵌入,同时利用注意力机制的优势来提升性能。 更重要的是,我们主张采用基于集成的方法构建超模型以生成更加稳健可靠的预测结果,并形成一个更完整的系统框架。通过广泛的实验及消融研究,在不同稀疏度特征的真实世界公开数据集上进行的实证评估证实了我们提出方法的有效性及其组件的重要性。
https://arxiv.org/abs/2504.10753
Multimodal foundation models have significantly improved feature representation by integrating information from multiple modalities, making them highly suitable for a broader set of applications. However, the exploration of multimodal facial representation for understanding perception has been limited. Understanding and analyzing facial states, such as Action Units (AUs) and emotions, require a comprehensive and robust framework that bridges visual and linguistic modalities. In this paper, we present a comprehensive pipeline for multimodal facial state analysis. First, we compile a new Multimodal Face Dataset (MFA) by generating detailed multilevel language descriptions of face, incorporating Action Unit (AU) and emotion descriptions, by leveraging GPT-4o. Second, we introduce a novel Multilevel Multimodal Face Foundation model (MF^2) tailored for Action Unit (AU) and emotion recognition. Our model incorporates comprehensive visual feature modeling at both local and global levels of face image, enhancing its ability to represent detailed facial appearances. This design aligns visual representations with structured AU and emotion descriptions, ensuring effective cross-modal integration. Third, we develop a Decoupled Fine-Tuning Network (DFN) that efficiently adapts MF^2 across various tasks and datasets. This approach not only reduces computational overhead but also broadens the applicability of the foundation model to diverse scenarios. Experimentation show superior performance for AU and emotion detection tasks.
多模态基础模型通过整合多种模态的信息,显著提高了特征表示能力,并使其适用于更广泛的应用场景。然而,关于多模态面部表征在感知理解中的应用探索还相对有限。理解和分析诸如动作单元(AUs)和情绪等面部状态需要一个能够连接视觉与语言模式的全面且稳健的框架。本文中,我们提出了一套针对多模态面部状态分析的综合管道。首先,通过利用GPT-4o生成详细的多层次语言描述,我们构建了一个新的多模态脸部数据集(MFA),该数据集中包含了对动作单元(AU)和情绪的详细描述。其次,我们引入了专门用于动作单元(AU)和情绪识别的新颖多层次多模态面部基础模型(MF^2)。我们的模型在人脸图像的局部及全局层次上全面建模视觉特征,增强了其表现详细面部外观的能力。这种设计将视觉表征与结构化的AU和情绪描述相协调,确保了有效的跨模式整合。第三,我们开发了一个解耦微调网络(DFN),该网络能够有效地使MF^2适应各种任务和数据集。这种方法不仅减少了计算开销,而且还扩大了基础模型在多种场景中的适用性。实验结果表明,在动作单元(AU)和情绪检测任务中表现出优越的性能。
https://arxiv.org/abs/2504.10351
Multimodal representation learning, exemplified by multimodal contrastive learning (MMCL) using image-text pairs, aims to learn powerful representations by aligning cues across modalities. This approach relies on the core assumption that the exemplar image-text pairs constitute two representations of an identical concept. However, recent research has revealed that real-world datasets often exhibit misalignment. There are two distinct viewpoints on how to address this issue: one suggests mitigating the misalignment, and the other leveraging it. We seek here to reconcile these seemingly opposing perspectives, and to provide a practical guide for practitioners. Using latent variable models we thus formalize misalignment by introducing two specific mechanisms: selection bias, where some semantic variables are missing, and perturbation bias, where semantic variables are distorted -- both affecting latent variables shared across modalities. Our theoretical analysis demonstrates that, under mild assumptions, the representations learned by MMCL capture exactly the information related to the subset of the semantic variables invariant to selection and perturbation biases. This provides a unified perspective for understanding misalignment. Based on this, we further offer actionable insights into how misalignment should inform the design of real-world ML systems. We validate our theoretical findings through extensive empirical studies on both synthetic data and real image-text datasets, shedding light on the nuanced impact of misalignment on multimodal representation learning.
多模态表示学习,例如通过图像-文本对进行的多模态对比学习(MMCL),旨在通过跨模式对齐线索来学习强大的表示。该方法依赖于核心假设,即示例图像-文本对是相同概念的两种表现形式。然而,最近的研究表明,现实世界的数据集通常存在不对齐的问题。针对这一问题有两种不同的观点:一种观点认为应该缓解这种不对齐,而另一种则主张利用它。在这里,我们试图调和这两种看似对立的观点,并为实践者提供实用指南。 通过潜在变量模型,我们将不对齐正式化为两种特定机制:选择偏差(某些语义变量缺失)和扰动偏差(语义变量被扭曲)。这两种机制都会影响跨模式共享的潜变量。我们的理论分析表明,在适度假设下,MMCL学习到的表示准确地捕捉了与对选择和扰动偏差具有不变性的语义子集相关的所有信息。这为理解不对齐提供了一种统一的观点。 基于这一发现,我们进一步提供了实际见解,说明不对齐应如何指导现实世界机器学习系统的设计。通过广泛的实证研究,在合成数据集和真实图像-文本数据集上验证了我们的理论结果,揭示了不对齐对多模态表示学习的微妙影响。
https://arxiv.org/abs/2504.10143
Transformer-based models have achieved remarkable success in natural language and vision tasks, but their application to gene expression analysis remains limited due to data sparsity, high dimensionality, and missing values. We present GexBERT, a transformer-based autoencoder framework for robust representation learning of gene expression data. GexBERT learns context-aware gene embeddings by pretraining on large-scale transcriptomic profiles with a masking and restoration objective that captures co-expression relationships among thousands of genes. We evaluate GexBERT across three critical tasks in cancer research: pan-cancer classification, cancer-specific survival prediction, and missing value imputation. GexBERT achieves state-of-the-art classification accuracy from limited gene subsets, improves survival prediction by restoring expression of prognostic anchor genes, and outperforms conventional imputation methods under high missingness. Furthermore, its attention-based interpretability reveals biologically meaningful gene patterns across cancer types. These findings demonstrate the utility of GexBERT as a scalable and effective tool for gene expression modeling, with translational potential in settings where gene coverage is limited or incomplete.
基于Transformer的模型在自然语言和视觉任务中取得了显著的成功,但在基因表达分析中的应用仍受到数据稀疏性、高维度以及缺失值的影响而受限。我们提出了GexBERT,这是一种面向基因表达数据分析的鲁棒表示学习框架,基于Transformer的自动编码器。通过大规模转录组谱型的预训练,并采用掩码和恢复目标的方式,GexBERT能够捕捉数千种基因之间的共表达关系,从而学习上下文感知的基因嵌入。 我们在癌症研究中的三个关键任务上评估了GexBERT:泛癌分类、特定癌症生存预测以及缺失值插补。GexBERT在有限基因子集的情况下实现了最先进的分类准确性;通过恢复预后锚定基因的表达来改进生存预测,并且在高缺失率下超过了传统的插补方法。此外,其基于注意力的可解释性揭示了跨不同癌症类型中具有生物学意义的基因模式。 这些发现证明了GexBERT作为一种可用于基因表达建模的可扩展和有效工具的价值,在基因覆盖率有限或不完整的情况下也展示了转化潜力。
https://arxiv.org/abs/2504.09704
Recent advances in self-supervised deep learning have improved our ability to quantify cellular morphological changes in high-throughput microscopy screens, a process known as morphological profiling. However, most current methods only learn from images, despite many screens being inherently multimodal, as they involve both a chemical or genetic perturbation as well as an image-based readout. We hypothesized that incorporating chemical compound structure during self-supervised pre-training could improve learned representations of images in high-throughput microscopy screens. We introduce a representation learning framework, MICON (Molecular-Image Contrastive Learning), that models chemical compounds as treatments that induce counterfactual transformations of cell phenotypes. MICON significantly outperforms classical hand-crafted features such as CellProfiler and existing deep-learning-based representation learning methods in challenging evaluation settings where models must identify reproducible effects of drugs across independent replicates and data-generating centers. We demonstrate that incorporating chemical compound information into the learning process provides consistent improvements in our evaluation setting and that modeling compounds specifically as treatments in a causal framework outperforms approaches that directly align images and compounds in a single representation space. Our findings point to a new direction for representation learning in morphological profiling, suggesting that methods should explicitly account for the multimodal nature of microscopy screening data.
最近的自我监督深度学习进展提高了我们量化高通量显微镜筛查中细胞形态变化的能力,这一过程被称为形态学谱型分析。然而,大多数当前的方法仅从图像中学习,尽管许多筛查本质上是多模态的,因为它们涉及化学或基因扰动以及基于图像的读出方式。我们假设在自我监督预训练过程中结合化学化合物结构可以改善高通量显微镜筛查中图像的学习表示。为此,我们引入了一种表示学习框架MICON(分子-图像对比学习),该框架将化学化合物建模为诱导细胞表型反事实变化的治疗因素。 MICON在具有挑战性的评估设置中显著优于传统的手工设计特征(如CellProfiler)和现有的基于深度学习的表示学习方法,这些评估设置要求模型识别独立重复实验及不同数据生成中心中的药物可重复效果。我们展示了将化学化合物信息纳入学习过程可以在我们的评估环境中提供一致的改进,并且将化合物专门建模为因果框架中的治疗因素优于直接在单一表示空间中对齐图像和化合物的方法。 我们的发现指出了形态学谱型分析中表示学习的新方向,建议方法应明确考虑显微镜筛查数据的多模态特性。
https://arxiv.org/abs/2504.09544
In open data sets of functional magnetic resonance imaging (fMRI), the heterogeneity of the data is typically attributed to a combination of factors, including differences in scanning procedures, the presence of confounding effects, and population diversities between multiple sites. These factors contribute to the diminished effectiveness of representation learning, which in turn affects the overall efficacy of subsequent classification procedures. To address these limitations, we propose a novel multi-site adversarial learning network (MSalNET) for fMRI-based mental disorder detection. Firstly, a representation learning module is introduced with a node information assembly (NIA) mechanism to better extract features from functional connectivity (FC). This mechanism aggregates edge information from both horizontal and vertical directions, effectively assembling node information. Secondly, to generalize the feature across sites, we proposed a site-level feature extraction module that can learn from individual FC data, which circumvents additional prior information. Lastly, an adversarial learning network is proposed as a means of balancing the trade-off between individual classification and site regression tasks, with the introduction of a novel loss function. The proposed method was evaluated on two multi-site fMRI datasets, i.e., Autism Brain Imaging Data Exchange (ABIDE) and ADHD-200. The results indicate that the proposed method achieves a better performance than other related algorithms with the accuracy of 75.56 and 68.92 in ABIDE and ADHD-200 datasets, respectively. Furthermore, the result of the site regression indicates that the proposed method reduces site variability from a data-driven perspective. The most discriminative brain regions revealed by NIA are consistent with statistical findings, uncovering the "black box" of deep learning to a certain extent.
在功能性磁共振成像(fMRI)的公开数据集中,数据异质性的原因通常归因于扫描程序的不同、混杂效应的存在以及多站点间人口多样性的差异。这些因素导致了表示学习效果减弱,进而影响后续分类过程的整体有效性。为解决这些问题,我们提出了一种新颖的多站点对抗性学习网络(MSalNET),用于基于fMRI的精神障碍检测。 首先,引入了一个具有节点信息组装(NIA)机制的表示学习模块,以更好地从功能连接(FC)中提取特征。这种机制能够汇总水平和垂直方向上的边信息,有效地整合了节点信息。其次,为了跨站点推广特征,我们提出了一种基于个体FC数据的学习站点级特征提取模块,这避免了额外先验信息的需要。最后,通过引入一种新的损失函数,提出了一个对抗性学习网络来平衡个人分类和站点回归任务之间的权衡。 所提方法在两个多站点fMRI数据集上进行了评估,即自闭症脑成像数据交换(ABIDE)和注意力缺陷多动障碍-200(ADHD-200)。结果显示,在ABIDE和ADHD-200数据集中,所提方法的准确率分别为75.56% 和 68.92%,优于其他相关算法。此外,站点回归的结果表明,从数据驱动的角度来看,所提方法能够减少站点间的变异性。“黑箱”深度学习的一部分被揭示出来,NIA机制识别出的关键脑区与统计发现一致。
https://arxiv.org/abs/2504.09179
In modern air traffic management, generating synthetic flight trajectories has emerged as a promising solution for addressing data scarcity, protecting sensitive information, and supporting large-scale analyses. In this paper, we propose a novel method for trajectory synthesis by adapting the Time-Based Vector Quantized Variational Autoencoder (TimeVQVAE). Our approach leverages time-frequency domain processing, vector quantization, and transformer-based priors to capture both global and local dynamics in flight data. By discretizing the latent space and integrating transformer priors, the model learns long-range spatiotemporal dependencies and preserves coherence across entire flight paths. We evaluate the adapted TimeVQVAE using an extensive suite of quality, statistical, and distributional metrics, as well as a flyability assessment conducted in an open-source air traffic simulator. Results indicate that TimeVQVAE outperforms a temporal convolutional VAE baseline, generating synthetic trajectories that mirror real flight data in terms of spatial accuracy, temporal consistency, and statistical properties. Furthermore, the simulator-based assessment shows that most generated trajectories maintain operational feasibility, although occasional outliers underscore the potential need for additional domain-specific constraints. Overall, our findings underscore the importance of multi-scale representation learning for capturing complex flight behaviors and demonstrate the promise of TimeVQVAE in producing representative synthetic trajectories for downstream tasks such as model training, airspace design, and air traffic forecasting.
在现代空中交通管理中,生成合成飞行轨迹已成为解决数据稀缺、保护敏感信息和支持大规模分析的有前景的方法。本文提出了一种新颖的轨迹综合方法,通过调整基于时间向量量化变分自动编码器(TimeVQVAE)来实现。我们的方法利用了时频域处理、向量量化和基于Transformer的先验知识,以捕捉飞行数据中的全局和局部动态。通过离散化潜在空间并整合变压器先验,模型能够学习长时间的空间-时间依赖关系,并保持整个飞行路径的一致性。 我们使用了一套广泛的品质、统计和分布度量以及在开源空中交通模拟器中进行的可飞性评估来评价改进后的TimeVQVAE。实验结果显示,TimeVQVAE的表现优于基于时序卷积的VAE基准模型,在空间准确性、时间一致性及统计数据特性方面,生成的合成轨迹与真实飞行数据相似。 此外,基于模拟器的评估显示,大多数生成的轨迹在操作上是可行的,尽管偶尔会出现异常值,这可能表明需要额外加入特定领域的约束条件。总的来说,我们的研究强调了多尺度表示学习对于捕捉复杂飞行行为的重要性,并证明了TimeVQVAE在产生用于后续任务(如模型训练、空域设计和空中交通预测)的代表性合成轨迹方面的潜力。
https://arxiv.org/abs/2504.09101
In autoregressive (AR) image generation, visual tokenizers compress images into compact discrete latent tokens, enabling efficient training of downstream autoregressive models for visual generation via next-token prediction. While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality -- a challenge not adequately addressed in existing literature. To address this, we introduce GigaTok, the first approach to simultaneously improve image reconstruction, generation, and representation learning when scaling visual tokenizers. We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma. To mitigate this, we propose semantic regularization, which aligns tokenizer features with semantically consistent features from a pre-trained visual encoder. This constraint prevents excessive latent space complexity during scaling, yielding consistent improvements in both reconstruction and downstream autoregressive generation. Building on semantic regularization, we explore three key practices for scaling tokenizers:(1) using 1D tokenizers for better scalability, (2) prioritizing decoder scaling when expanding both encoder and decoder, and (3) employing entropy loss to stabilize training for billion-scale tokenizers. By scaling to $\bf{3 \space billion}$ parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.
在自回归(AR)图像生成中,视觉标记化器将图像压缩为紧凑的离散潜在令牌,从而通过下一个令牌预测有效地训练下游自回归模型进行视觉生成。虽然扩展视觉标记化器可以提高图像重建质量,但这往往会降低下游生成质量——这一挑战在现有文献中并未得到充分解决。为此,我们提出了GigaTok,这是第一个能够在扩大视觉标记化器时同时改进图像重建、生成和表示学习的方法。 我们识别出潜在空间复杂度的增长是导致重建与生成困境的主要原因。为缓解这一问题,我们提出了一种语义正则化方法,它将令牌化器的特征与预训练视觉编码器中的语义一致的特征对齐。这种约束在扩展过程中防止了过度复杂的潜在空间,从而在重建和下游自回归生成方面实现了持续改进。 基于语义正则化,我们探索了三种关键实践来扩大标记化器:(1) 使用一维标记化器以获得更好的可扩展性;(2) 在同时扩展编码器和解码器时优先考虑解码器的规模增加;以及 (3) 采用熵损失来稳定数十亿参数级令牌化器的训练。 通过将模型规模扩大到$\bf{30\space亿}$个参数,GigaTok在图像重建、下游自回归生成质量和下游自回归表示质量方面均达到了最先进的性能。
https://arxiv.org/abs/2504.08736