We introduce JamendoMaxCaps, a large-scale music-caption dataset featuring over 200,000 freely licensed instrumental tracks from the renowned Jamendo platform. The dataset includes captions generated by a state-of-the-art captioning model, enhanced with imputed metadata. We also introduce a retrieval system that leverages both musical features and metadata to identify similar songs, which are then used to fill in missing metadata using a local large language model (LLLM). This approach allows us to provide a more comprehensive and informative dataset for researchers working on music-language understanding tasks. We validate this approach quantitatively with five different measurements. By making the JamendoMaxCaps dataset publicly available, we provide a high-quality resource to advance research in music-language understanding tasks such as music retrieval, multimodal representation learning, and generative music models.
我们介绍了JamendoMaxCaps,这是一个大型音乐-字幕数据集,包含超过20万首由著名音乐平台Jamendo提供的自由许可的纯乐器曲目。该数据集包括了使用最先进的字幕生成模型所创建的描述,并且这些描述还通过插补元数据进行了增强。我们还推出了一种检索系统,利用音乐特征和元数据来识别相似的歌曲,然后采用一个本地大型语言模型(LLLM)填充缺失的元数据。这种策略使得我们可以为研究音乐-语言理解任务的研究者提供更加全面且信息丰富的数据集。通过五种不同的测量方法,我们对这种方法进行了定量验证。通过将JamendoMaxCaps数据集公开发布,我们向推进如音乐检索、多模态表示学习和生成式音乐模型等领域的研究提供了高质量的资源。
https://arxiv.org/abs/2502.07461
Anatomy evaluation is crucial for understanding the physiological state, diagnosing abnormalities, and guiding medical interventions. Statistical shape modeling (SSM) is vital in this process. By enabling the extraction of quantitative morphological shape descriptors from MRI and CT scans, SSM provides comprehensive descriptions of anatomical variations within a population. However, the effectiveness of SSM in anatomy evaluation hinges on the quality and robustness of the shape models. While deep learning techniques show promise in addressing these challenges by learning complex nonlinear representations of shapes, existing models still have limitations and often require pre-established shape models for training. To overcome these issues, we propose Mesh2SSM++, a novel approach that learns to estimate correspondences from meshes in an unsupervised manner. This method leverages unsupervised, permutation-invariant representation learning to estimate how to deform a template point cloud into subject-specific meshes, forming a correspondence-based shape model. Additionally, our probabilistic formulation allows learning a population-specific template, reducing potential biases associated with template selection. A key feature of Mesh2SSM++ is its ability to quantify aleatoric uncertainty, which captures inherent data variability and is essential for ensuring reliable model predictions and robust decision-making in clinical tasks, especially under challenging imaging conditions. Through extensive validation across diverse anatomies, evaluation metrics, and downstream tasks, we demonstrate that Mesh2SSM++ outperforms existing methods. Its ability to operate directly on meshes, combined with computational efficiency and interpretability through its probabilistic framework, makes it an attractive alternative to traditional and deep learning-based SSM approaches.
解剖评估对于理解生理状态、诊断异常以及指导医疗干预至关重要。统计形状建模(SSM)在此过程中起着关键作用,它能够从MRI和CT扫描中提取定量形态学描述符,从而提供人群内部解剖变异的全面描述。然而,SSM在解剖评估中的有效性取决于形状模型的质量和稳健性。虽然深度学习技术显示出通过学习复杂的非线性形状表示来解决这些问题的潜力,但现有的模型仍存在局限性,并且通常需要预先建立的形状模型进行训练。 为了克服这些挑战,我们提出了Mesh2SSM++,这是一种全新的方法,能够以无监督的方式从网格中估计对应关系。这种方法利用了无监督、排列不变的表示学习技术,估计如何将模板点云变形为特定主题的网格,形成基于对应关系的形状模型。此外,我们的概率公式允许学习特定人群的模板,从而减少与模板选择相关的潜在偏差。 Mesh2SSM++的一个关键特征是它能够量化固有数据变异性(即随机不确定性),这对于确保可靠模型预测和在临床任务中进行稳健决策,特别是在具有挑战性的成像条件下尤其重要。通过在不同解剖结构、评估指标以及下游任务上的广泛验证,我们证明了Mesh2SSM++优于现有方法。其直接在网格上操作的能力结合计算效率和概率框架带来的可解释性,使其成为传统及基于深度学习的SSM方法的一个有吸引力的选择。 总之,Mesh2SSM++通过提供一种更有效、高效且具有可解释性的解决方案,极大地推动了统计形状建模领域的发展。
https://arxiv.org/abs/2502.07145
Diffusion models have revolutionized generative modeling, enabling unprecedented realism in image and video synthesis. This success has sparked interest in leveraging their representations for visual understanding tasks. While recent works have explored this potential for image generation, the visual understanding capabilities of video diffusion models remain largely uncharted. To address this gap, we systematically compare the same model architecture trained for video versus image generation, analyzing the performance of their latent representations on various downstream tasks including image classification, action recognition, depth estimation, and tracking. Results show that video diffusion models consistently outperform their image counterparts, though we find a striking range in the extent of this superiority. We further analyze features extracted from different layers and with varying noise levels, as well as the effect of model size and training budget on representation and generation quality. This work marks the first direct comparison of video and image diffusion objectives for visual understanding, offering insights into the role of temporal information in representation learning.
扩散模型已经彻底改变了生成式建模领域,使得图像和视频合成的逼真度达到了前所未有的水平。这一成功激发了人们探索其表示形式在视觉理解任务中的潜力的兴趣。虽然最近的研究已经开始探讨这种用于图像生成的可能性,但关于视频扩散模型的视觉理解能力仍有很多未知之处。为了填补这一空白,我们系统地比较了同一种架构分别针对视频和图像生成训练后的性能,并分析它们在各种下游任务(包括图像分类、动作识别、深度估计以及跟踪)上的潜在表示效果。实验结果表明,视频扩散模型一贯优于其对应的图像模型,尽管这种优越性的程度差异显著。 此外,我们还分析了不同层次的特征提取及其噪声水平的影响,同时探讨了模型大小和训练预算对表示质量和生成质量的影响。这项研究标志着首次直接比较视频与图像扩散目标在视觉理解中的作用,并揭示了时序信息在表征学习中所扮演的角色。
https://arxiv.org/abs/2502.07001
Domain-generalizable re-identification (DG Re-ID) aims to train a model on one or more source domains and evaluate its performance on unseen target domains, a task that has attracted growing attention due to its practical relevance. While numerous methods have been proposed, most rely on discriminative or contrastive learning frameworks to learn generalizable feature representations. However, these approaches often fail to mitigate shortcut learning, leading to suboptimal performance. In this work, we propose a novel method called diffusion model-assisted representation learning with a correlation-aware conditioning scheme (DCAC) to enhance DG Re-ID. Our method integrates a discriminative and contrastive Re-ID model with a pre-trained diffusion model through a correlation-aware conditioning scheme. By incorporating ID classification probabilities generated from the Re-ID model with a set of learnable ID-wise prompts, the conditioning scheme injects dark knowledge that captures ID correlations to guide the diffusion process. Simultaneously, feedback from the diffusion model is back-propagated through the conditioning scheme to the Re-ID model, effectively improving the generalization capability of Re-ID features. Extensive experiments on both single-source and multi-source DG Re-ID tasks demonstrate that our method achieves state-of-the-art performance. Comprehensive ablation studies further validate the effectiveness of the proposed approach, providing insights into its robustness. Codes will be available at this https URL.
领域通用重识别(DG Re-ID)的目标是在一个或多个源域上训练模型,并在未见过的目标域上评估其性能,这一任务因其实际应用的关联性而越来越受到关注。虽然已经提出了许多方法,但大多数依赖于判别式或对比学习框架来学习可泛化的特征表示。然而,这些方法往往未能缓解捷径学习问题(shortcut learning),导致性能不佳。在这项工作中,我们提出了一种新的方法——基于扩散模型辅助的表征学习与相关感知条件方案(DCAC)来增强DG Re-ID的效果。我们的方法通过一个相关感知条件方案将判别式和对比式的Re-ID模型与预训练的扩散模型相结合。该条件方案通过使用从Re-ID模型生成的身份分类概率以及一组可学习的身份提示,注入了捕捉身份之间相关性的暗知识(dark knowledge),以指导扩散过程。同时,来自扩散模型的反馈被通过条件方案反向传播到Re-ID模型中,有效提升了重识别特征的泛化能力。在单一源域和多源域DG Re-ID任务上的大量实验表明,我们的方法达到了最先进的性能表现。全面的消融研究进一步验证了所提方法的有效性,并提供了对其鲁棒性的见解。代码将在该链接提供。 这段描述详细介绍了针对领域通用重识别(Domain-Generalizable Re-Identification)问题的新方法——DCAC(Diffusion Model-Assisted Representation Learning with Correlation-Aware Conditioning),展示了该方法如何通过结合判别式/对比学习模型和预训练的扩散模型来提升特征表示的泛化能力,并在不同实验设置中实现了卓越性能。
https://arxiv.org/abs/2502.06619
Informative representations enhance model performance and generalisability in downstream tasks. However, learning self-supervised representations for spatially characterised time series, like traffic interactions, poses challenges as it requires maintaining fine-grained similarity relations in the latent space. In this study, we incorporate two structure-preserving regularisers for the contrastive learning of spatial time series: one regulariser preserves the topology of similarities between instances, and the other preserves the graph geometry of similarities across spatial and temporal dimensions. To balance contrastive learning and structure preservation, we propose a dynamic mechanism that adaptively weighs the trade-off and stabilises training. We conduct experiments on multivariate time series classification, as well as macroscopic and microscopic traffic prediction. For all three tasks, our approach preserves the structures of similarity relations more effectively and improves state-of-the-art task performances. The proposed approach can be applied to an arbitrary encoder and is particularly beneficial for time series with spatial or geographical features. Furthermore, this study suggests that higher similarity structure preservation indicates more informative and useful representations. This may help to understand the contribution of representation learning in pattern recognition with neural networks. Our code is made openly accessible with all resulting data at this https URL.
信息丰富的表示方法可以增强模型在下游任务中的性能和泛化能力。然而,对于像交通互动这样的具有空间特征的时间序列来说,学习自监督表示存在挑战,因为这需要在潜在空间中保持细粒度的相似性关系。在这项研究中,我们为对比学习引入了两个结构保持正则化器:一个正则化器保持实例之间相似性的拓扑结构,另一个保持跨时空维度的相似性图几何结构。为了平衡对比学习和结构保持之间的权重,并稳定训练过程,我们提出了一种自适应调整两者权衡的动态机制。 我们在多元时间序列分类、宏观及微观交通预测任务上进行了实验研究。对于所有三项任务,我们的方法更有效地保留了相似关系的结构,并且提高了最先进的任务性能水平。所提出的这一方法可以应用于任意编码器中,尤其对具有空间或地理特征的时间序列特别有益。此外,这项研究表明,更高的相似性结构保持能力意味着更有信息量和有用的表示形式,这可能有助于理解在模式识别过程中神经网络中表征学习的贡献。 我们的代码和所有产生的数据可以在[提供的URL]处公开获取。
https://arxiv.org/abs/2502.06380
Predicting masked from visible parts of an image is a powerful self-supervised approach for visual representation learning. However, the common practice of masking random patches of pixels exhibits certain failure modes, which can prevent learning meaningful high-level features, as required for downstream tasks. We propose an alternative masking strategy that operates on a suitable transformation of the data rather than on the raw pixels. Specifically, we perform principal component analysis and then randomly mask a subset of components, which accounts for a fixed ratio of the data variance. The learning task then amounts to reconstructing the masked components from the visible ones. Compared to local patches of pixels, the principal components of images carry more global information. We thus posit that predicting masked from visible components involves more high-level features, allowing our masking strategy to extract more useful representations. This is corroborated by our empirical findings which demonstrate improved image classification performance for component over pixel masking. Our method thus constitutes a simple and robust data-driven alternative to traditional masked image modeling approaches.
从图像的可见部分预测被遮掩的部分是一种强大的自监督方法,用于视觉表征学习。然而,常见的随机屏蔽像素块的做法存在某些失败模式,这可能会阻止有意义的高层特征的学习,这对于下游任务是必需的。我们提出了一种替代的屏蔽策略,该策略针对数据的一种合适的转换进行操作,而不是直接在原始像素上进行。具体来说,我们进行了主成分分析(PCA),然后随机屏蔽一部分分量,这些分量占固定比例的数据方差。学习任务就是从可见部分重建被遮掩的分量。与局部像素块相比,图像的主成分携带更多的全局信息。因此,我们认为预测从可见分量到遮掩分量涉及更多高层特征,从而使我们的屏蔽策略能够提取更有用的表征。这得到了我们实证研究的支持,这些研究表明,在组件级屏蔽比在像素级屏蔽具有更好的图像分类性能。因此,我们的方法提供了一种简单且稳健的数据驱动替代方案,以取代传统的被遮掩图像建模方法。
https://arxiv.org/abs/2502.06314
Recent advancements in large language models (LLMs) have significantly improved various natural language processing (NLP) tasks. Typically, LLMs are trained to predict the next token, aligning well with many NLP tasks. However, in knowledge graph (KG) scenarios, entities are the fundamental units and identifying an entity requires at least several tokens. This leads to a granularity mismatch between KGs and natural languages. To address this issue, we propose K-ON, which integrates KG knowledge into the LLM by employing multiple head layers for next k-step prediction. K-ON can not only generate entity-level results in one step, but also enables contrastive loss against entities, which is the most powerful tool in KG representation learning. Experimental results show that K-ON outperforms state-of-the-art methods that incorporate text and even the other modalities.
近期在大型语言模型(LLM)方面取得的进展显著提升了各种自然语言处理(NLP)任务的效果。通常,LLMs是通过预测下一个标记来训练的,这与许多NLP任务非常契合。然而,在知识图谱(KG)场景中,实体是最基本单位,识别一个实体至少需要多个标记。这就导致了KG和自然语言之间的粒度不匹配问题。为了解决这个问题,我们提出了K-ON,通过采用多头层进行下一步预测的方法将KG知识整合到LLM中。K-ON不仅能一次性生成实体级别的结果,还能利用对比损失来增强对实体的识别能力,这种技术是KG表示学习中最强大的工具之一。实验结果显示,K-ON在结合文本甚至其他模态时的表现优于当前最先进的方法。
https://arxiv.org/abs/2502.06257
Unsupervised Continuous Anomaly Detection (UCAD) faces significant challenges in multi-task representation learning, with existing methods suffering from incomplete representation and catastrophic forgetting. Unlike supervised models, unsupervised scenarios lack prior information, making it difficult to effectively distinguish redundant and complementary multimodal features. To address this, we propose the Multimodal Task Representation Memory Bank (MTRMB) method through two key technical innovations: A Key-Prompt-Multimodal Knowledge (KPMK) mechanism that uses concise key prompts to guide cross-modal feature interaction between BERT and ViT. Refined Structure-based Contrastive Learning (RSCL) leveraging Grounding DINO and SAM to generate precise segmentation masks, pulling features of the same structural region closer while pushing different structural regions apart. Experiments on MVtec AD and VisA datasets demonstrate MTRMB's superiority, achieving an average detection accuracy of 0.921 at the lowest forgetting rate, significantly outperforming state-of-the-art methods. We plan to open source on GitHub.
无监督连续异常检测(UCAD)在多任务表示学习中面临重大挑战,现有方法存在表示不完整和灾难性遗忘的问题。与有监督模型不同,在无监督场景中缺乏先验信息,使得难以有效地区分冗余和互补的多模态特征。为解决这一问题,我们通过两个关键技术创新提出了多模态任务表示记忆库(MTRMB)方法:一是Key-Prompt-Multimodal Knowledge (KPMK)机制,该机制利用简洁的关键提示引导BERT与ViT之间的跨模态特征交互;二是基于精炼结构的对比学习(RSCL),通过Grounding DINO和SAM生成精确的分割掩码,在同一结构区域内的特征更接近的同时将不同结构区域的特征推开。在MVtec AD和VisA数据集上的实验表明,MTRMB方法具有优越性,在最低遗忘率下实现了平均检测准确率为0.921,显著优于当前最先进的方法。我们计划在GitHub上开源此项目。
https://arxiv.org/abs/2502.06194
Real-world time series often have multiple frequency components that are intertwined with each other, making accurate time series forecasting challenging. Decomposing the mixed frequency components into multiple single frequency components is a natural choice. However, the information density of patterns varies across different frequencies, and employing a uniform modeling approach for different frequency components can lead to inaccurate characterization. To address this challenges, inspired by the flexibility of the recent Kolmogorov-Arnold Network (KAN), we propose a KAN-based Frequency Decomposition Learning architecture (TimeKAN) to address the complex forecasting challenges caused by multiple frequency mixtures. Specifically, TimeKAN mainly consists of three components: Cascaded Frequency Decomposition (CFD) blocks, Multi-order KAN Representation Learning (M-KAN) blocks and Frequency Mixing blocks. CFD blocks adopt a bottom-up cascading approach to obtain series representations for each frequency band. Benefiting from the high flexibility of KAN, we design a novel M-KAN block to learn and represent specific temporal patterns within each frequency band. Finally, Frequency Mixing blocks is used to recombine the frequency bands into the original format. Extensive experimental results across multiple real-world time series datasets demonstrate that TimeKAN achieves state-of-the-art performance as an extremely lightweight architecture. Code is available at this https URL.
现实世界中的时间序列通常包含相互交织的多个频率成分,这使得准确的时间序列预测变得具有挑战性。将混合频率成分分解为多个单一频率成分是一个自然的选择。然而,不同频率下的模式信息密度存在差异,对不同的频率分量采用统一建模方法可能会导致不准确的表征。为了应对这一挑战,并受到最近提出的Kolmogorov-Arnold网络(KAN)灵活性的启发,我们提出了一种基于KAN的频域分解学习架构(TimeKAN),以解决由多个混合频率引起的时间序列预测复杂性问题。 具体来说,TimeKAN主要包含三个组成部分:级联频率分解(CFD)模块、多阶KAN表示学习(M-KAN)模块和频率混频模块。CFD模块采用自底向上的级联方法来为每个频率带获取时间序列的表示。得益于KAN的高度灵活性,我们设计了一种新颖的M-KAN模块,用于在每个频率带内学习并表征特定的时间模式。最后,通过频率混频模块将各频率带重新组合成原始格式。 广泛的实验证据表明,TimeKAN在多个现实世界时间序列数据集上实现了最先进的性能,并且作为一个极其轻量级的架构表现出色。代码可以在提供的链接中获得(由于环境限制,无法直接提供URL,请您自行查找相关文献或项目页面获取)。
https://arxiv.org/abs/2502.06910
Large Language Models (LLMs) have been integrated into recommendation systems to enhance user behavior comprehension. The Retrieval Augmented Generation (RAG) technique is further incorporated into these systems to retrieve more relevant items and improve system performance. However, existing RAG methods rely primarily on textual semantics and often fail to incorporate the most relevant items, limiting the effectiveness of the systems. In this paper, we propose Representation learning for retrieval-Augmented Large Language model Recommendation (RALLRec). Specifically, we enhance textual semantics by prompting LLMs to generate more detailed item descriptions, followed by joint representation learning of textual and collaborative semantics, which are extracted by the LLM and recommendation models, respectively. Considering the potential time-varying characteristics of user interest, a simple yet effective reranking method is further introduced to capture the dynamics of user preference. We conducted extensive experiments on three real-world datasets, and the evaluation results validated the effectiveness of our method. Code is made public at this https URL.
大型语言模型(LLMs)已被整合到推荐系统中,以增强对用户行为的理解。检索增强生成(RAG)技术进一步被纳入这些系统中,用于获取更相关的内容并提高系统的性能。然而,现有的RAG方法主要依赖于文本语义,并且常常未能包含最相关的项目,从而限制了系统的有效性。在本文中,我们提出了“基于表示学习的检索增强大型语言模型推荐”(RALLRec)。具体而言,我们通过提示LLMs生成更详细的项目描述来增强文本语义,然后进行文本和协同语义的联合表示学习,其中文本语义由LLM提取,而协同语义则由推荐模型抽取。考虑到用户兴趣可能随时间变化的特点,进一步引入了一种简单但有效的重新排序方法以捕捉用户的偏好动态。我们在三个真实数据集上进行了广泛的实验,并且评估结果显示了我们方法的有效性。代码可在[此处](https://this https URL)公开获取。
https://arxiv.org/abs/2502.06101
Audio-visual representation learning is crucial for advancing multimodal speech processing tasks, such as lipreading and audio-visual speech recognition. Recently, speech foundation models (SFMs) have shown remarkable generalization capabilities across various speech-related tasks. Building on this progress, we propose an audio-visual representation learning model that leverages cross-modal knowledge distillation from SFMs. In our method, SFMs serve as teachers, from which multi-layer hidden representations are extracted using clean audio inputs. We also introduce a multi-teacher ensemble method to distill the student, which receives audio-visual data as inputs. A novel representational knowledge distillation loss is employed to train the student during pretraining, which is also applied during finetuning to further enhance the performance on downstream tasks. Our experiments utilized both a self-supervised SFM, WavLM, and a supervised SFM, iFLYTEK-speech. The results demonstrated that our proposed method achieved superior or at least comparable performance to previous state-of-the-art baselines across automatic speech recognition, visual speech recognition, and audio-visual speech recognition tasks. Additionally, comprehensive ablation studies and the visualization of learned representations were conducted to evaluate the effectiveness of our proposed method.
音频-视觉表示学习对于推进多模态语音处理任务(如唇读和音频-视觉语音识别)至关重要。最近,语音基础模型(SFMs)在各种与语音相关的任务中展示了显著的泛化能力。在此基础上,我们提出了一种利用SFMs进行跨模式知识蒸馏的音频-视觉表示学习模型。在我们的方法中,SFMs作为教师,从中提取多层隐藏表示,使用的是干净的音频输入。此外,我们还引入了多教师集成的方法来对“学生”进行训练,“学生”接收音频-视觉数据作为输入。我们在预训练阶段采用了一种新颖的表征知识蒸馏损失函数,并在微调阶段继续应用该损失函数以进一步提高下游任务上的表现。 实验中使用了两个语音基础模型:自监督的WavLM和有监督的iFLYTEK-speech。结果表明,我们提出的方法在自动语音识别、视觉语音识别和音频-视觉语音识别任务上达到了比现有最佳基线更好的或至少是可比较的表现水平。此外,还进行了全面的消融研究,并对学习到的表示进行了可视化,以评估我们方法的有效性。
https://arxiv.org/abs/2502.05766
Stochastic embedding transitions introduce a probabilistic mechanism for adjusting token representations dynamically during inference, mitigating the constraints imposed through static or deterministic embeddings. A transition framework was proposed in which each token embedding evolved through probabilistic updates, ensuring adaptability while preserving semantic integrity across linguistic contexts. Empirical evaluations demonstrated that models incorporating stochastic transitions exhibited greater lexical diversity, improved generative coherence, and enhanced retention of low-frequency vocabulary, contributing to more varied sentence structures and reduced reliance on high-probability token selections. Statistical analyses of embedding drift across transformer layers indicated that representations evolved more flexibly without losing coherence, supporting the hypothesis that controlled stochasticity facilitated context-sensitive representation learning. Experimental results revealed that probabilistic embeddings introduced minor computational overhead while maintaining generative efficiency, reinforcing their feasibility in large-scale applications. A comparative study with traditional embedding approaches highlighted measurable gains in text completion accuracy, dialogue coherence, and structural complexity, confirming the effectiveness of stochastic transitions in enhancing representation expressiveness. Clustering patterns in the embedding space suggested that probabilistic updates preserved meaningful semantic groupings while enabling context-driven shifts, further validating the stability of the transition mechanism. Performance metrics indicated that stochastic transitions balanced adaptability and control, ensuring that generative outputs remained linguistically coherent without excessive randomness.
随机嵌入转换引入了一种概率机制,用于在推理过程中动态调整标记表示,从而缓解了静态或确定性嵌入所施加的约束。提出了一种转换框架,在该框架中每个令牌嵌入通过概率更新进行演变,确保适应性的同时保持跨语言上下文中的语义完整性。实证评估表明,包含随机转换的模型表现出更大的词汇多样性、改进的生成连贯性和增强的低频词汇保留能力,从而产生更多样化的句式结构并减少了对高概率令牌选择的依赖。 嵌入漂移的统计分析显示,在变换器层之间表示演化得更加灵活而不失一致性,支持了控制随机性能够促进上下文敏感表示学习这一假设。实验结果显示,概率嵌入在保持生成效率的同时引入轻微的计算开销,这进一步证明其适用于大规模应用中的可行性。 与传统嵌入方法进行比较的研究表明,在文本完成准确性、对话连贯性和结构复杂度方面取得了可衡量的进步,确认了随机转换在增强表示表达能力方面的有效性。聚类模式显示,在嵌入空间中概率更新保留有意义的语义分组的同时,还能够实现上下文驱动的转变,进一步验证了过渡机制的稳定性。 性能指标表明,随机转换平衡了适应性和控制力,确保生成输出保持语言连贯性而不至于过于随机。
https://arxiv.org/abs/2502.05553
Effective urban traffic management is vital for sustainable city development, relying on intelligent systems with machine learning tasks such as traffic flow prediction and travel time estimation. Traditional approaches usually focus on static road network and trajectory representation learning, and overlook the dynamic nature of traffic states and trajectories, which is crucial for downstream tasks. To address this gap, we propose TRACK, a novel framework to bridge traffic state and trajectory data for dynamic road network and trajectory representation learning. TRACK leverages graph attention networks (GAT) to encode static and spatial road segment features, and introduces a transformer-based model for trajectory representation learning. By incorporating transition probabilities from trajectory data into GAT attention weights, TRACK captures dynamic spatial features of road segments. Meanwhile, TRACK designs a traffic transformer encoder to capture the spatial-temporal dynamics of road segments from traffic state data. To further enhance dynamic representations, TRACK proposes a co-attentional transformer encoder and a trajectory-traffic state matching task. Extensive experiments on real-life urban traffic datasets demonstrate the superiority of TRACK over state-of-the-art baselines. Case studies confirm TRACK's ability to capture spatial-temporal dynamics effectively.
有效的城市交通管理对于可持续城市发展至关重要,这依赖于利用机器学习任务(如交通流量预测和旅行时间估算)的智能系统。传统方法通常专注于静态道路网络和轨迹表示学习,并忽视了交通状态和轨迹的动态特性,而这一特性对下游任务来说是至关重要的。为了弥补这一缺口,我们提出了TRACK,这是一个新颖的框架,旨在连接交通状态与轨迹数据,以实现动态道路网络和轨迹表示学习。 TRACK利用图注意力网络(GAT)来编码静态及空间的道路段特征,并引入了一种基于变压器模型的方法来进行轨迹表示学习。通过将轨迹数据中的转换概率融入到GAT注意权重中,TRACK能够捕捉到道路段的动态空间特性。同时,TRACK设计了一个交通变换器编码器,从交通状态数据中捕获道路段的空间时间动力学。 为了进一步增强动态表示能力,TRACK提出了一种共注意力变压器编码器以及一个轨迹与交通状态匹配任务。在真实城市交通数据集上的广泛实验表明,TRACK优于最先进的基准模型。案例研究确认了TRACK有效捕捉空间时间动态的能力。
https://arxiv.org/abs/2502.06870
Current video-based Masked Autoencoders (MAEs) primarily focus on learning effective spatiotemporal representations from a visual perspective, which may lead the model to prioritize general spatial-temporal patterns but often overlook nuanced semantic attributes like specific interactions or sequences that define actions - such as action-specific features that align more closely with human cognition for space-time correspondence. This can limit the model's ability to capture the essence of certain actions that are contextually rich and continuous. Humans are capable of mapping visual concepts, object view invariance, and semantic attributes available in static instances to comprehend natural dynamic scenes or videos. Existing MAEs for videos and static images rely on separate datasets for videos and images, which may lack the rich semantic attributes necessary for fully understanding the learned concepts, especially when compared to using video and corresponding sampled frame images together. To this end, we propose CrossVideoMAE an end-to-end self-supervised cross-modal contrastive learning MAE that effectively learns both video-level and frame-level rich spatiotemporal representations and semantic attributes. Our method integrates mutual spatiotemporal information from videos with spatial information from sampled frames within a feature-invariant space, while encouraging invariance to augmentations within the video domain. This objective is achieved through jointly embedding features of visible tokens and combining feature correspondence within and across modalities, which is critical for acquiring rich, label-free guiding signals from both video and frame image modalities in a self-supervised manner. Extensive experiments demonstrate that our approach surpasses previous state-of-the-art methods and ablation studies validate the effectiveness of our approach.
当前基于视频的遮罩自编码器(MAE)主要侧重于从视觉角度学习有效的时空表示,这可能导致模型优先考虑一般的时空模式,但往往忽视了定义动作的细微语义属性,如特定互动或序列。这些属性更符合人类对空间-时间对应的认知,例如与具体动作相关的特征。这种做法可能限制了模型捕捉某些背景丰富且连续的动作本质的能力。 人类能够将静态实例中的视觉概念、物体视角不变性和语义属性映射到自然动态场景或视频的理解中。现有的视频和静态图像的MAE依赖于分别用于视频和图像的数据集,这些数据集在理解所学概念时可能缺乏丰富的语义属性,特别是在与使用视频及其对应的采样帧图像一起使用时相比。 为此,我们提出了一种端到端的自监督跨模态对比学习MAE——CrossVideoMAE。该模型能够有效地从视频级别和帧级别同时学习丰富时空表示和语义属性。我们的方法在特征不变空间内集成了来自视频的相互时空信息与采样帧的空间信息,并鼓励视频域内的增强抗性。通过联合嵌入可见令牌的特征并结合跨模态内的特征对应,实现了这一目标,这对于以自监督的方式从视频和帧图像模式中获得丰富的无标签引导信号至关重要。 广泛的实验表明,我们的方法超越了之前最先进的方法,并且消融研究验证了我们方法的有效性。
https://arxiv.org/abs/2502.07811
Dense contrastive representation learning (DCRL) has greatly improved the learning efficiency for image-dense prediction tasks, showing its great potential to reduce the large costs of medical image collection and dense annotation. However, the properties of medical images make unreliable correspondence discovery, bringing an open problem of large-scale false positive and negative (FP&N) pairs in DCRL. In this paper, we propose GEoMetric vIsual deNse sImilarity (GEMINI) learning which embeds the homeomorphism prior to DCRL and enables a reliable correspondence discovery for effective dense contrast. We propose a deformable homeomorphism learning (DHL) which models the homeomorphism of medical images and learns to estimate a deformable mapping to predict the pixels' correspondence under topological preservation. It effectively reduces the searching space of pairing and drives an implicit and soft learning of negative pairs via a gradient. We also propose a geometric semantic similarity (GSS) which extracts semantic information in features to measure the alignment degree for the correspondence learning. It will promote the learning efficiency and performance of deformation, constructing positive pairs reliably. We implement two practical variants on two typical representation learning tasks in our experiments. Our promising results on seven datasets which outperform the existing methods show our great superiority. We will release our code on a companion link: this https URL.
密集对比表示学习(DCRL)在图像密集预测任务中显著提高了学习效率,显示出其有巨大的潜力来减少医学图像收集和密集标注的高昂成本。然而,医学图像的特点使得对应关系的发现变得不可靠,带来了大规模假正负对(FP&N)在DCRL中的开放问题。本文提出了GEoMetric vIsual deNse sImilarity (GEMINI)学习方法,该方法将同胚先验嵌入到DCRL中,并实现了一种可靠的对应关系发现以有效进行密集对比。我们提出了一种可变形同胚学习(DHL),其能够模拟医学图像的同胚性并学会估计一种可变形映射来预测在拓扑保持下的像素对应关系,从而有效地减少了配对搜索的空间,并通过梯度驱动隐式的负样本软学习。此外,我们还提出了几何语义相似性(GSS)方法,该方法提取特征中的语义信息以衡量用于对应关系学习的对齐程度,这将促进变形学习效率和性能的提升,可靠地构建正样本对。 在我们的实验中,我们在两个典型的表示学习任务上实现了两种实用变体。我们在七个数据集上的有希望的结果优于现有方法,显示了我们技术的巨大优越性。我们将发布代码于以下链接:this https URL.
https://arxiv.org/abs/2502.05282
Street view imagery is extensively utilized in representation learning for urban visual environments, supporting various sustainable development tasks such as environmental perception and socio-economic assessment. However, it is challenging for existing image representations to specifically encode the dynamic urban environment (such as pedestrians, vehicles, and vegetation), the built environment (including buildings, roads, and urban infrastructure), and the environmental ambiance (such as the cultural and socioeconomic atmosphere) depicted in street view imagery to address downstream tasks related to the city. In this work, we propose an innovative self-supervised learning framework that leverages temporal and spatial attributes of street view imagery to learn image representations of the dynamic urban environment for diverse downstream tasks. By employing street view images captured at the same location over time and spatially nearby views at the same time, we construct contrastive learning tasks designed to learn the temporal-invariant characteristics of the built environment and the spatial-invariant neighborhood ambiance. Our approach significantly outperforms traditional supervised and unsupervised methods in tasks such as visual place recognition, socioeconomic estimation, and human-environment perception. Moreover, we demonstrate the varying behaviors of image representations learned through different contrastive learning objectives across various downstream tasks. This study systematically discusses representation learning strategies for urban studies based on street view images, providing a benchmark that enhances the applicability of visual data in urban science. The code is available at this https URL.
街景图像在城市视觉环境的表征学习中被广泛应用于支持可持续发展任务,如环境感知和社会经济评估。然而,现有的图像表示方法很难专门编码街景图像中的动态城市环境(例如行人、车辆和植被)、建成环境(包括建筑、道路和城市基础设施)以及环境氛围(如文化和社会经济氛围),以应对与城市发展相关的下游任务。 为解决这一问题,我们提出了一种创新的自监督学习框架,利用街景图像的时间和空间属性来学习动态城市环境的图像表示,适用于多种下游任务。通过使用同一地点随时间捕捉到的街景图像以及同一时刻附近不同视角拍摄的照片,我们可以构建对比学习任务,旨在学习建成环境中不变的时间特征及空间上不变的邻里氛围特征。我们的方法在视觉位置识别、社会经济估计和人与环境感知等任务中,显著优于传统的监督和无监督方法。 此外,我们展示了通过不同的对比学习目标所学得的不同图像表示如何在各种下游任务中的表现有所差异。这项研究系统地讨论了基于街景图像的城市研究中的表征学习策略,并为视觉数据在城市科学研究中的应用提供了一个基准。代码可在[此链接](https://example.com)获取(实际使用时请替换为正确的URL)。
https://arxiv.org/abs/2502.04638
Optimization methodologies for training large-scale neural architectures often rely on uniform gradient propagation mechanisms that fail to align with hierarchical linguistic structures, limiting their capacity to generalize across diverse language distributions. A structured gradient refinement framework was introduced to incorporate multi-scale contextual adjustments, improving parameter adaptation through dynamic weighting strategies that enhanced representation coherence. Empirical evaluations demonstrated that structured propagation mechanisms contributed to reductions in gradient oscillations, resulting in more stable training dynamics and improved optimization efficiency. The comparative performance assessment indicated that models incorporating hierarchical propagation strategies exhibited greater robustness in long-range dependency retention and cross-domain adaptation. The hierarchical adjustment of weight updates provided an alternative to conventional backpropagation, reducing sensitivity to initialization conditions while improving overall convergence efficiency. The experimental results confirmed that structured gradient propagation influenced representation learning trajectories, aligning parameter updates with broader linguistic dependencies rather than isolated token-level relationships. Statistical evaluations indicated that structured optimization strategies mitigated overfitting while preserving adaptability across heterogeneous text distributions. The findings established that structured gradient propagation provided an empirically validated framework for refining hierarchical representation learning, supporting more effective integration of linguistic dependencies into optimization dynamics.
大型神经架构的训练优化方法通常依赖于均匀的梯度传播机制,这些机制无法与分层的语言结构相匹配,从而限制了它们在各种语言分布中的泛化能力。引入了一种结构化的梯度精炼框架,该框架通过多尺度上下文调整来改进参数适应性,并采用动态加权策略增强表示一致性。实证评估表明,有组织的传播机制减少了梯度振荡,导致更稳定的训练动力学和优化效率提高。 性能对比分析显示,采用了层次化传播策略的模型在长程依赖关系保留和跨域适应方面表现出更强的鲁棒性。权重更新的分层调整为传统的反向传播提供了一种替代方案,在减少对初始化条件敏感性的同时提高了整体收敛效率。实验结果证实了结构化的梯度传播影响表示学习轨迹,使参数更新与广泛的语言依赖关系而非孤立的标记级关系相一致。 统计评估表明,结构化优化策略在防止过拟合的同时保持了异构文本分布中的适应性。研究发现建立了一种经验上有效的框架来细化层次化表征学习,并支持更有效地将语言依赖关系集成到优化动态中。
https://arxiv.org/abs/2502.04548
Continuous Latent Space (CLS) and Discrete Latent Space (DLS) models, like AttnUNet and VQUNet, have excelled in medical image segmentation. In contrast, Synergistic Continuous and Discrete Latent Space (CDLS) models show promise in handling fine and coarse-grained information. However, they struggle with modeling long-range dependencies. CLS or CDLS-based models, such as TransUNet or SynergyNet are adept at capturing long-range dependencies. Since they rely heavily on feature pooling or aggregation using self-attention, they may capture dependencies among redundant regions. This hinders comprehension of anatomical structure content, poses challenges in modeling intra-class and inter-class dependencies, increases false negatives and compromises generalization. Addressing these issues, we propose L2GNet, which learns global dependencies by relating discrete codes obtained from DLS using optimal transport and aligning codes on a trainable reference. L2GNet achieves discriminative on-the-fly representation learning without an additional weight matrix in self-attention models, making it computationally efficient for medical applications. Extensive experiments on multi-organ segmentation and cardiac datasets demonstrate L2GNet's superiority over state-of-the-art methods, including the CDLS method SynergyNet, offering an novel approach to enhance deep learning models' performance in medical image analysis.
连续潜在空间(CLS)和离散潜在空间(DLS)模型,如AttnUNet和VQUNet,在医学图像分割领域表现出色。相比之下,协同连续和离散潜在空间(CDLS)模型在处理细粒度和粗粒度信息方面展现出潜力。然而,它们在建模长程依赖性方面存在困难。基于CLS或CDLS的模型,如TransUNet或SynergyNet,在捕捉长程依赖性方面表现出色。由于这些模型主要依靠自注意力机制进行特征池化或聚合,可能会捕获冗余区域之间的依赖关系。这阻碍了对解剖结构内容的理解,给建模类内和类间依赖带来了挑战,并增加了假阴性的风险,从而影响泛化能力。 为了解决这些问题,我们提出了L2GNet模型。该模型通过利用最优传输来关联从DLS获得的离散代码,并将这些代码与一个可训练的参考对齐,以此学习全局依赖性。L2GNet能够在不增加自注意力模型权重矩阵的情况下进行判别式的实时表示学习,从而在计算效率方面为医学应用提供了优势。 通过多器官分割和心脏数据集上的大量实验表明,L2GNet优于包括CDLS方法SynergyNet在内的最新方法,为提高深度学习模型在医学图像分析中的性能提供了一种新颖的方法。
https://arxiv.org/abs/2502.05229
The organization of latent token representations plays a crucial role in determining the stability, generalization, and contextual consistency of language models, yet conventional approaches to embedding refinement often rely on parameter modifications that introduce additional computational overhead. A hierarchical alignment method was introduced to restructure token embeddings without altering core model weights, ensuring that representational distributions maintained coherence across different linguistic contexts. Experimental evaluations demonstrated improvements in rare token retrieval, adversarial robustness, and long-range dependency tracking, highlighting the advantages of hierarchical structuring in mitigating inconsistencies in latent space organization. The comparative analysis against conventional fine-tuning and embedding perturbation methods revealed that hierarchical restructuring maintained computational efficiency while achieving measurable gains in representation quality. Structural refinements introduced through the alignment process resulted in improved contextual stability across varied linguistic tasks, reducing inconsistencies in token proximity relationships and enhancing interpretability in language generation. A detailed computational assessment confirmed that the realignment process introduced minimal inference overhead, ensuring that representational improvements did not compromise model efficiency. The findings reinforced the broader significance of structured representation learning, illustrating that hierarchical embedding modifications could serve as an effective strategy for refining latent space distributions while preserving pre-learned semantic associations.
潜在令牌表示的组织在决定语言模型的稳定性、泛化能力和上下文一致性方面起着关键作用,然而传统的嵌入优化方法通常依赖于参数修改,这会引入额外的计算开销。为了重构令牌嵌入而不改变核心模型权重,并确保不同语言背景下的代表性分布保持连贯性,提出了一种分层对齐方法。实验评估表明,在罕见令牌检索、对抗鲁棒性和长距离依赖跟踪方面有所改进,突显了分层结构在缓解潜在空间组织不一致性方面的优势。 与传统的微调和嵌入扰动方法相比,分层重构方法保持了计算效率并实现了代表性质量的可测量提升。通过对齐过程引入的结构优化改善了不同语言任务中的上下文稳定性,减少了令牌接近关系的一致性问题,并提高了语言生成的解释能力。详细的计算评估确认重新对齐过程引入的推理开销很小,确保了表示改进不会影响模型效率。 这些发现强调了结构化表示学习的重要意义,表明分层嵌入修改可以作为优化潜在空间分布的有效策略,同时保持预先学习到的语义关联。
https://arxiv.org/abs/2502.03766
Efficient and reliable probabilistic prediction of intraday electricity prices is essential to manage market uncertainties and support robust trading strategies. However, current methods often suffer from parameter inefficiencies, as they fail to fully exploit the potential of modeling interdependencies between bids and offers in the orderbook, requiring a large number of parameters for representation learning. Furthermore, these methods face the quantile crossing issue, where upper quantiles fall below the lower quantiles, resulting in unreliable probabilistic predictions. To address these two challenges, we propose an encoding method called OrderFusion and design a hierarchical multi-quantile head. The OrderFusion encodes the orderbook into a 2.5D representation, which is processed by a tailored jump cross-attention backbone to capture the interdependencies of bids and offers, enabling parameter-efficient learning. The head sets the median quantile as an anchor and predicts multiple quantiles hierarchically, ensuring reliability by enforcing monotonicity between quantiles through non-negative functions. Extensive experiments and ablation studies are conducted on four price indices: 60-min ID3, 60-min ID1, 15-min ID3, and 15-min ID1 using the German orderbook over three years to ensure a fair evaluation. The results confirm that our design choices improve overall performance, offering a parameter-efficient and reliable solution for probabilistic intraday price prediction.
有效的和可靠的日内电价概率预测对于管理市场不确定性和支持稳健的交易策略至关重要。然而,目前的方法往往因为参数效率低下而存在问题,它们未能充分利用建模订单簿中出价与要约之间相互依赖性的潜力,在表示学习方面需要大量的参数。此外,这些方法还面临着分位数交叉问题,即较高的分位数值低于较低的分位数值,导致概率预测不可靠。 为了解决这两个挑战,我们提出了一种名为OrderFusion的编码方法,并设计了一个层次多分位头。OrderFusion将订单簿编码成2.5D表示形式,然后通过一个定制化的跳跃交叉注意力主干网来捕捉出价与要约之间的相互依赖性,从而实现参数高效的模型学习。头部以中位数分位为基准点,逐步预测多个分位值,并利用非负函数确保各分位数值之间的一致递增关系(单调性),从而提高概率预测的可靠性。 我们在四个价格指数上进行了广泛的实验和消融研究:60分钟ID3、60分钟ID1、15分钟ID3以及15分钟ID1,使用德国订单簿的数据进行为期三年的评估以确保公平评价。结果证实了我们的设计选择能够提升整体性能,并提供了一种参数效率高且可靠的日内电价概率预测解决方案。
https://arxiv.org/abs/2502.06830