Robotic assistive feeding holds significant promise for improving the quality of life for individuals with eating disabilities. However, acquiring diverse food items under varying conditions and generalizing to unseen food presents unique challenges. Existing methods that rely on surface-level geometric information (e.g., bounding box and pose) derived from visual cues (e.g., color, shape, and texture) often lacks adaptability and robustness, especially when foods share similar physical properties but differ in visual appearance. We employ imitation learning (IL) to learn a policy for food acquisition. Existing methods employ IL or Reinforcement Learning (RL) to learn a policy based on off-the-shelf image encoders such as ResNet-50. However, such representations are not robust and struggle to generalize across diverse acquisition scenarios. To address these limitations, we propose a novel approach, IMRL (Integrated Multi-Dimensional Representation Learning), which integrates visual, physical, temporal, and geometric representations to enhance the robustness and generalizability of IL for food acquisition. Our approach captures food types and physical properties (e.g., solid, semi-solid, granular, liquid, and mixture), models temporal dynamics of acquisition actions, and introduces geometric information to determine optimal scooping points and assess bowl fullness. IMRL enables IL to adaptively adjust scooping strategies based on context, improving the robot's capability to handle diverse food acquisition scenarios. Experiments on a real robot demonstrate our approach's robustness and adaptability across various foods and bowl configurations, including zero-shot generalization to unseen settings. Our approach achieves improvement up to $35\%$ in success rate compared with the best-performing baseline.
机器人辅助喂食对于改善具有进食障碍的个人生活质量具有巨大的潜力。然而,在不同的条件和情况下获取多样食物项目并将其推广到未见过的食物具有独特的挑战。现有的方法往往依赖于视觉提示(例如颜色、形状和纹理)表面的几何信息,往往缺乏适应性和稳健性,尤其是在食物具有类似的物理属性但视觉外观不同的情况下。我们使用模仿学习(IL)来学习食物获取策略。现有的方法使用IL或强化学习(RL)基于标准的图像编码器(例如ResNet-50)学习策略。然而,这些表示不具有鲁棒性,在多样获取场景中表现不佳。为了克服这些限制,我们提出了名为IMRL(集成多维表示学习)的新方法,该方法将视觉、物理、时间和几何表示相结合,提高了IL在食品获取方面的鲁棒性和可扩展性。我们的方法捕捉食品类型和物理属性(例如固体、半固体、粗粒、液体和混合物),建模获取动作的时变动态,并引入几何信息来确定最优的勺子抓取点和评估碗的满度。IMRL使得IL能够根据上下文自适应地调整抓取策略,从而提高机器人处理多样食品获取场景的能力。在真实机器人上的实验证实了我们在不同食品和碗配置下的鲁棒性和适应性,包括对未见设置的零 shot泛化。与最佳基线相比,我们的方法实现了35%的改善。
https://arxiv.org/abs/2409.12092
Current audio-visual representation learning can capture rough object categories (e.g., ``animals'' and ``instruments''), but it lacks the ability to recognize fine-grained details, such as specific categories like ``dogs'' and ``flutes'' within animals and instruments. To address this issue, we introduce DETECLAP, a method to enhance audio-visual representation learning with object information. Our key idea is to introduce an audio-visual label prediction loss to the existing Contrastive Audio-Visual Masked AutoEncoder to enhance its object awareness. To avoid costly manual annotations, we prepare object labels from both audio and visual inputs using state-of-the-art language-audio models and object detectors. We evaluate the method of audio-visual retrieval and classification using the VGGSound and AudioSet20K datasets. Our method achieves improvements in recall@10 of +1.5% and +1.2% for audio-to-visual and visual-to-audio retrieval, respectively, and an improvement in accuracy of +0.6% for audio-visual classification.
目前,音频-视觉表示学习可以捕捉粗对象类别(例如,``动物``和``乐器``),但它缺乏识别细粒度细节的能力,例如动物和乐器中具体的类别,如``狗``和``长笛``。为解决这个问题,我们引入了DETECLAP,一种通过对象信息增强音频-视觉表示学习的方法。我们的关键想法是向现有的对比性音频-视觉遮蔽自动编码器引入音频-视觉标签预测损失,以增强其对象意识。为了避免昂贵的手动注释,我们使用最先进的语言-音频模型和物体检测器为音频和视觉输入准备对象标签。我们使用VGGSound和AudioSet20K数据集评估音频-视觉检索和分类的方法。我们的方法在音频-视觉和视觉-音频检索方面的召回率分别提高了+1.5%和+1.2%,而在音频-视觉分类方面的准确度提高了+0.6%。
https://arxiv.org/abs/2409.11729
We present VertiEncoder, a self-supervised representation learning approach for robot mobility on vertically challenging terrain. Using the same pre-training process, VertiEncoder can handle four different downstream tasks, including forward kinodynamics learning, inverse kinodynamics learning, behavior cloning, and patch reconstruction with a single representation. VertiEncoder uses a TransformerEncoder to learn the local context of its surroundings by random masking and next patch reconstruction. We show that VertiEncoder achieves better performance across all four different tasks compared to specialized End-to-End models with 77% fewer parameters. We also show VertiEncoder's comparable performance against state-of-the-art kinodynamic modeling and planning approaches in real-world robot deployment. These results underscore the efficacy of VertiEncoder in mitigating overfitting and fostering more robust generalization across diverse environmental contexts and downstream vehicle kinodynamic tasks.
我们提出了VertiEncoder,一种用于机器人垂直挑战地形上的自监督表示学习方法。使用相同的预训练过程,VertiEncoder可以处理四个不同的下游任务,包括前向动力学学习、反向动力学学习、行为克隆和带状重建。VertiEncoder使用TransformerEncoder通过随机遮罩和下一个补丁重建来学习其周围区域的局部上下文。我们证明了VertiEncoder在所有四个不同的任务上比具有77%更少参数的专业End-to-End模型都取得了更好的性能。我们还证明了VertiEncoder与最先进的动力学建模和规划方法在现实世界的机器人部署中具有可比较的性能。这些结果突出了VertiEncoder在缓解过拟合和促进多样环境上下文下的泛化方面的有效性。
https://arxiv.org/abs/2409.11570
Audio-text contrastive models have become a powerful approach in music representation learning. Despite their empirical success, however, little is known about the influence of key design choices on the quality of music-text representations learnt through this framework. In this work, we expose these design choices within the constraints of limited data and computation budgets, and establish a more solid understanding of their impact grounded in empirical observations along three axes: the choice of base encoders, the level of curation in training data, and the use of text augmentation. We find that data curation is the single most important factor for music-text contrastive training in resource-constrained scenarios. Motivated by this insight, we introduce two novel techniques, Augmented View Dropout and TextSwap, which increase the diversity and descriptiveness of text inputs seen in training. Through our experiments we demonstrate that these are effective at boosting performance across different pre-training regimes, model architectures, and downstream data distributions, without incurring higher computational costs or requiring additional training data.
音频文本对比模型已经成为音乐表示学习中的有力方法。然而,尽管它们在经验上取得了成功,但关于这些框架中学习到的音乐文本表示的质量的影响,目前还知之甚少。在这项工作中,我们在有限的数据和计算预算内揭示了这些设计选择,并基于实证观察结果,建立了它们对音乐文本表示质量影响的更扎实理解,主要从三个角度进行考察:基础编码器的选择、训练数据中的策展水平以及文本增强。我们发现,数据策展是资源受限场景中音乐文本对比训练的最重要因素。受到这一见解的启发,我们引入了两种新颖的技术,即增强型视图丢弃和文本替换,它们增加了训练中可见文本输入的多样性和描述性。通过我们的实验,我们证明了这些技术在提高不同预训练状态、模型架构和下游数据分布的性能方面非常有效,而不会产生更高的计算成本或需要额外的训练数据。
https://arxiv.org/abs/2409.11498
Self-supervised learning has proved effective for skeleton-based human action understanding. However, previous works either rely on contrastive learning that suffers false negative problems or are based on reconstruction that learns too much unessential low-level clues, leading to limited representations for downstream tasks. Recently, great advances have been made in generative learning, which is naturally a challenging yet meaningful pretext task to model the general underlying data distributions. However, the representation learning capacity of generative models is under-explored, especially for the skeletons with spacial sparsity and temporal redundancy. To this end, we propose Masked Conditional Diffusion (MacDiff) as a unified framework for human skeleton modeling. For the first time, we leverage diffusion models as effective skeleton representation learners. Specifically, we train a diffusion decoder conditioned on the representations extracted by a semantic encoder. Random masking is applied to encoder inputs to introduce a information bottleneck and remove redundancy of skeletons. Furthermore, we theoretically demonstrate that our generative objective involves the contrastive learning objective which aligns the masked and noisy views. Meanwhile, it also enforces the representation to complement for the noisy view, leading to better generalization performance. MacDiff achieves state-of-the-art performance on representation learning benchmarks while maintaining the competence for generative tasks. Moreover, we leverage the diffusion model for data augmentation, significantly enhancing the fine-tuning performance in scenarios with scarce labeled data. Our project is available at this https URL.
自监督学习已经在基于骨骼的人动作理解中取得了有效成果。然而,之前的工作要么依赖于对比学习,这种方法会受到虚假负面的困扰,要么是基于重构学习,这种方法会学习太多无关的低级线索,导致对于下游任务的表示有限。最近,在生成学习方面取得了很大的进展,生成学习自然是一个具有挑战性但有意義的先验任务,以建模通用数据分布。然而,生成模型的表示学习能力尚未得到充分利用,尤其是对于具有稀疏空间和时间冗余的骨骼。因此,我们提出了一种统一框架,用于人体骨骼建模,即遮蔽条件扩散(MacDiff)。这是第一次利用扩散模型作为有效的骨骼表示学习器。具体来说,我们使用语义编码器提取的表示来训练扩散解码器。对编码器输入应用随机遮蔽以引入信息瓶颈并消除骨架的冗余。此外,我们理论证明,我们的生成目标包括对比学习目标,使遮蔽和噪声视图对齐。同时,它还强制表示补充噪声视图,导致更好的泛化性能。MacDiff 在表示学习基准上实现了最先进的性能,同时保持了生成任务的竞争力。此外,我们还利用扩散模型进行数据增强,显著增强了在稀有标注数据场景下的微调性能。我们的项目可以在该https URL上查看。
https://arxiv.org/abs/2409.10473
We introduce jina-embeddings-v3, a novel text embedding model with 570 million parameters, achieves state-of-the-art performance on multilingual data and long-context retrieval tasks, supporting context lengths of up to 8192 tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for query-document retrieval, clustering, classification, and text matching. Additionally, Matryoshka Representation Learning is integrated into the training process, allowing flexible truncation of embedding dimensions without compromising performance. Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the latest proprietary embeddings from OpenAI and Cohere on English tasks, while achieving superior performance compared to multilingual-e5-large-instruct across all multilingual tasks.
我们提出了jina-embeddings-v3,一种具有570 million参数的新文本嵌入模型,在多语言数据和长文本检索任务上实现了最先进的性能,支持多达8192个词的上下文长度。该模型包括一组针对各种任务定制的低秩适应器,用于生成高质量的查询-文档检索、聚类、分类和文本匹配的嵌入。此外,Matryoshka表示学习集成到训练过程中,允许在不牺牲性能的情况下灵活截取嵌入维度。在MTEB基准上的评估显示,jina-embeddings-v3在英语任务上优于最新的人工智能和Cohere的专用嵌入,同时在所有多语言任务上优于multilingual-e5-large-instruct。
https://arxiv.org/abs/2409.10173
Self-supervised speech representation learning has become essential for extracting meaningful features from untranscribed audio. Recent advances highlight the potential of deriving discrete symbols from the features correlated with linguistic units, which enables text-less training across diverse tasks. In particular, sentence-level Self-Distillation of the pretrained HuBERT (SD-HuBERT) induces syllabic structures within latent speech frame representations extracted from an intermediate Transformer layer. In SD-HuBERT, sentence-level representation is accumulated from speech frame features through self-attention layers using a special CLS token. However, we observe that the information aggregated in the CLS token correlates more with speaker identity than with linguistic content. To address this, we propose a speech-only self-supervised fine-tuning approach that separates syllabic units from speaker information. Our method introduces speaker perturbation as data augmentation and adopts a frame-level training objective to prevent the CLS token from aggregating paralinguistic information. Experimental results show that our approach surpasses the current state-of-the-art method in most syllable segmentation and syllabic unit quality metrics on Librispeech, underscoring its effectiveness in promoting syllabic organization within speech-only models.
自监督语音表示学习已成为从未转录的音频中提取有意义特征的必要手段。最近的研究表明,从与语言单元相关联的特征中导出离散符号的可能性,这使得可以在各种任务上进行无文本训练。特别是,预训练的HuBERT(SD-HuBERT)中的句子级别自微调引起词汇级结构,这使得在提取中间Transformer层的潜在语音框架表示时可以实现文本less训练。然而,我们观察到,CLS token中累积的信息与说话人身份的相关性比与语言内容的相关性更强。为了解决这个问题,我们提出了一种仅使用语音进行自监督微调的方法,将词汇级单元与说话人信息分离。我们的方法引入了说话人扰动作为数据增强,并采用帧级别训练目标,以防止CLS token聚合paralinguistic信息。实验结果表明,在Librispeech上,我们的方法在大多数单词划分和词汇单元质量指标上超过了当前最先进的方法,强调了其在促进仅使用语音的模型中词汇级组织方面的有效性。
https://arxiv.org/abs/2409.10103
Current end-to-end autonomous driving methods resort to unifying modular designs for various tasks (e.g. perception, prediction and planning). Although optimized in a planning-oriented spirit with a fully differentiable framework, existing end-to-end driving systems without ego-centric designs still suffer from unsatisfactory performance and inferior efficiency, owing to the rasterized scene representation learning and redundant information transmission. In this paper, we revisit the human driving behavior and propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving. Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner. The sparse perception module performs detection, tracking and online mapping based on sparse representation of the driving scene. The hierarchical interaction module aims to select the Closest In-Path Vehicle / Stationary (CIPV / CIPS) from coarse to fine, benefiting from an additional geometric prior. As for the iterative motion planner, both selected interactive agents and ego-vehicle are considered for joint motion prediction, where the output multi-modal ego-trajectories are optimized in an iterative fashion. Besides, both position-level motion diffusion and trajectory-level planning denoising are introduced for uncertainty modeling, thus facilitating the training stability and convergence of the whole framework. Extensive experiments conducted on nuScenes dataset demonstrate the superior planning performance and great efficiency of DiFSD, which significantly reduces the average L2 error by \textbf{66\%} and collision rate by \textbf{77\%} than UniAD while achieves \textbf{8.2$\times$} faster running efficiency.
当前的端到端自动驾驶方法将各种任务的统一化设计(例如感知、预测和规划)推向极致。尽管在规划式的思想下,以完全可导框架优化,但现有的端到端自动驾驶系统仍然由于基于自我中心的设计而性能不满意,效率低下。在本文中,我们重新审视了人类驾驶行为,并提出了一个以自我为中心的全稀疏范式,名为DiFSD,用于端到端自驾驶。具体来说,DiFSD主要由稀疏感知、分层交互和迭代运动规划器组成。稀疏感知模块基于驾驶场景的稀疏表示进行检测、跟踪和在线映射。分层交互模块旨在从粗到细选择最近路径上的车辆/车站(CIPV/CIPS),利用额外的几何先验。对于迭代运动规划器,为联合运动预测选择选定的交互代理和自车辆,其中输出多模态自车轨迹以迭代方式优化。此外,我们还引入了位置级运动扩散和轨迹级规划去噪,用于不确定性建模,从而促进整个框架的训练稳定性和收敛。在 nuScenes 数据集上进行的广泛实验证明,DiFSD具有卓越的规划性能和巨大的效率提升,与 UniAD 相比,降低了平均 L2 误差 66\%,碰撞率 77\%,并实现了 8.2\times 的运行效率提升。
https://arxiv.org/abs/2409.09777
Multimodal schizophrenia assessment systems have gained traction over the last few years. This work introduces a schizophrenia assessment system to discern between prominent symptom classes of schizophrenia and predict an overall schizophrenia severity score. We develop a Vector Quantized Variational Auto-Encoder (VQ-VAE) based Multimodal Representation Learning (MRL) model to produce task-agnostic speech representations from vocal Tract Variables (TVs) and Facial Action Units (FAUs). These representations are then used in a Multi-Task Learning (MTL) based downstream prediction model to obtain class labels and an overall severity score. The proposed framework outperforms the previous works on the multi-class classification task across all evaluation metrics (Weighted F1 score, AUC-ROC score, and Weighted Accuracy). Additionally, it estimates the schizophrenia severity score, a task not addressed by earlier approaches.
多模态精神分裂症评估系统在过去的几年里得到了广泛关注。这项工作介绍了一种用于区分精神分裂症突出症状类别的精神分裂症评估系统,并预测精神分裂症严重程度评分。我们开发了一种基于多模态表示学习(MRL)的向量量化变分自编码器(VQ-VAE)模型,用于从语音 tract 变量(TVs)和面部动作单元(FAUs)生成任务无关的语音表示。这些表示 then 被用于一个多任务学习(MTL)下游预测模型,以获得分类标签和整体严重分数。与之前的工作相比,该框架在所有评估指标(加权 F1 分数、AUC-ROC 分数和加权准确度)上都表现出色。此外,它还估计了精神分裂症的严重程度评分,这是之前方法没有解决的问题。
https://arxiv.org/abs/2409.09733
The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia, highlighting the urgent need for robust and generalizable face forgery detection (FFD) techniques. Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored, which limits the generalization capability of the model. In addition, most FFD methods tend to identify facial images generated by GAN, but struggle to detect unseen diffusion-synthesized ones. To address the limitations, we aim to leverage the cutting-edge foundation model, contrastive language-image pre-training (CLIP), to achieve generalizable diffusion face forgery detection (DFFD). In this paper, we propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities via language-guided face forgery representation learning, to facilitate the advancement of DFFD. Specifically, we devise a fine-grained language encoder (FLE) that extracts fine global language features from hierarchical text prompts. We design a multi-modal vision encoder (MVE) to capture global image forgery embeddings as well as fine-grained noise forgery patterns extracted from the richest patch, and integrate them to mine general visual forgery traces. Moreover, we build an innovative plug-and-play sample pair attention (SPA) method to emphasize relevant negative pairs and suppress irrelevant ones, allowing cross-modality sample pairs to conduct more flexible alignment. Extensive experiments and visualizations show that our model outperforms the state of the arts on different settings like cross-generator, cross-forgery, and cross-dataset evaluations.
快速发展的照片真实感人脸生成方法在社会各界和学术界引起了显著的担忧,突显了迫切需要 robust 和通用的用于检测人脸伪造(FFD)技术的强大模型。尽管现有的方法主要通过图像模态捕获人脸伪造模式,但其他模态如细粒度噪声和文本并未完全探索,这限制了模型的泛化能力。此外,大多数FFD方法倾向于仅识别由GAN生成的面部图像,但很难检测未见过的扩散合成的人脸伪造。为了克服这些限制,我们旨在利用先进的基于模型的方法,对比语言图像预训练(CLIP),实现通用的扩散人脸伪造检测(DFFD)。在本文中,我们提出了一个新颖的多模态细粒度CLIP(MFCLIP)模型,通过语言指导的人脸伪造表示学习全面和细粒度的人脸伪造痕迹,以促进DFFD的进步。具体来说,我们设计了一个细粒度语言编码器(FLE),从分层文本提示中提取细粒度全局语言特征。我们设计了一个多模态视觉编码器(MVE)来捕捉全局图像伪造 embedding以及从最丰富的补丁中提取的细粒度噪音伪造模式,并将它们整合到挖掘一般视觉伪造痕迹中。此外,我们还构建了一种创新的可插拔和可放大的样本对关注(SPA)方法,强调相关负样本并抑制无关样本,允许跨模态样本对进行更灵活的对齐。大量的实验和可视化结果表明,我们的模型在不同的设置(如跨生成器、跨伪造、跨数据集评估)下优于现有水平。
https://arxiv.org/abs/2409.09724
We introduce AnnualBERT, a series of language models designed specifically to capture the temporal evolution of scientific text. Deviating from the prevailing paradigms of subword tokenizations and "one model to rule them all", AnnualBERT adopts whole words as tokens and is composed of a base RoBERTa model pretrained from scratch on the full-text of 1.7 million arXiv papers published until 2008 and a collection of progressively trained models on arXiv papers at an annual basis. We demonstrate the effectiveness of AnnualBERT models by showing that they not only have comparable performances in standard tasks but also achieve state-of-the-art performances on domain-specific NLP tasks as well as link prediction tasks in the arXiv citation network. We then utilize probing tasks to quantify the models' behavior in terms of representation learning and forgetting as time progresses. Our approach enables the pretrained models to not only improve performances on scientific text processing tasks but also to provide insights into the development of scientific discourse over time. The series of the models is available at this https URL.
我们提出了 AnnualBERT,一系列专门用于捕捉科学文本时间演变的自然语言模型。我们摒弃了目前的主流子词分词范式和“一模型统治所有”的范式,AnnualBERT 采用整个单词作为标记符,由一个从头预训练的RoBERTa模型和一项基于每年发表的arXiv论文的渐进式训练的模型组成。我们通过展示 AnnualBERT 模型的效果来说明它们不仅在标准任务中具有可比较的性能,而且在领域特定的自然语言处理任务和arXiv引用网络的链路预测任务中实现了最先进的性能。然后,我们利用提示任务来衡量随着时间的推移,模型在表示学习和遗忘方面的行为。我们的方法使预训练模型不仅能提高科学文本处理任务的性能,而且还能为科学语篇的发展提供洞察。该模型的系列版本可在这个链接中查看。
https://arxiv.org/abs/2409.09636
Human auditory perception is compositional in nature -- we identify auditory streams from auditory scenes with multiple sound events. However, such auditory scenes are typically represented using clip-level representations that do not disentangle the constituent sound sources. In this work, we learn source-centric audio representations where each sound source is represented using a distinct, disentangled source embedding in the audio representation. We propose two novel approaches to learning source-centric audio representations: a supervised model guided by classification and an unsupervised model guided by feature reconstruction, both of which outperform the baselines. We thoroughly evaluate the design choices of both approaches using an audio classification task. We find that supervision is beneficial to learn source-centric representations, and that reconstructing audio features is more useful than reconstructing spectrograms to learn unsupervised source-centric representations. Leveraging source-centric models can help unlock the potential of greater interpretability and more flexible decoding in machine listening.
人类听觉感知本质上是一种组合性质,我们在多个声音事件的环境中识别出听觉流。然而,这样的听觉场景通常使用剪辑级别的表示来表示,没有区分组成声音源。在这项工作中,我们学习源中心音频表示,其中每个声音源在音频表示中使用一个独特的、去中心化的源嵌入。我们提出了两种学习源中心音频表示的新方法:一种是在分类指导下的监督模型,另一种是在特征重构指导下的无监督模型,两者都优于基线。我们通过一个音频分类任务详细评估了这两种方法的架构设计选择。我们发现,指导监督有助于学习源中心表示,而重构音频特征比重构频谱图更有助于学习无监督源中心表示。利用源中心模型可以帮助解锁机器聆听中更大可解释性和更灵活的解码潜力。
https://arxiv.org/abs/2409.09619
While motion has garnered attention in various tasks, its potential as a modality for weakly-supervised object detection (WSOD) in static images remains unexplored. Our study introduces an approach to enhance WSOD methods by integrating motion information. This method involves leveraging hallucinated motion from static images to improve WSOD on image datasets, utilizing a Siamese network for enhanced representation learning with motion, addressing camera motion through motion normalization, and selectively training images based on object motion. Experimental validation on the COCO and YouTube-BB datasets demonstrates improvements over a state-of-the-art method.
虽然运动在各种任务中引起了人们的关注,但它在静态图像中作为弱监督物体检测(WSOD)模态的潜在可能性仍然没有被探索。我们的研究通过整合运动信息来提高WSOD方法。该方法利用静态图像中的幻想运动来提高图像数据集上的WSOD,使用运动增强网络进行运动表示学习,通过运动 normalization 处理相机运动,并基于物体运动进行选择性训练图像。在COCO和YouTube-BB数据集上的实验验证表明,相较于最先进的 method,取得了显著的改进。
https://arxiv.org/abs/2409.09616
Learning representations on large graphs is a long-standing challenge due to the inter-dependence nature. Transformers recently have shown promising performance on small graphs thanks to its global attention for capturing all-pair interactions beyond observed structures. Existing approaches tend to inherit the spirit of Transformers in language and vision tasks, and embrace complicated architectures by stacking deep attention-based propagation layers. In this paper, we attempt to evaluate the necessity of adopting multi-layer attentions in Transformers on graphs, which considerably restricts the efficiency. Specifically, we analyze a generic hybrid propagation layer, comprised of all-pair attention and graph-based propagation, and show that multi-layer propagation can be reduced to one-layer propagation, with the same capability for representation learning. It suggests a new technical path for building powerful and efficient Transformers on graphs, particularly through simplifying model architectures without sacrificing expressiveness. As exemplified by this work, we propose a Simplified Single-layer Graph Transformers (SGFormer), whose main component is a single-layer global attention that scales linearly w.r.t. graph sizes and requires none of any approximation for accommodating all-pair interactions. Empirically, SGFormer successfully scales to the web-scale graph ogbn-papers100M, yielding orders-of-magnitude inference acceleration over peer Transformers on medium-sized graphs, and demonstrates competitiveness with limited labeled data.
在大型图中学习表示是一个长期存在的挑战,由于图中节点之间的相互依存性。近年来,由于全局注意力能够捕捉观测到的结构之外的所有对对交互,Transformer在小型图中显示出良好的性能。现有的方法通常在语言和视觉任务中继承了Transformer的精神,并通过堆叠基于深度注意力的传播层来采用复杂的架构。在本文中,我们试图评估在Transformer中采用多层注意力的必要性,这会在很大程度上限制效率。具体来说,我们分析了一种通用的混合传播层,包括所有对对注意力和图传播,并展示了多层传播可以减少到一层传播,具有与表示学习相同的能力。它提出了一种在 graphs 上构建强大且高效的Transformer的新技术途径,特别是通过简化模型架构而不牺牲表现力。如本文的工作所示,我们提出了一个简化单层图Transformer(SGFormer),其主要组件是一个单层全局注意力,其与图大小成线性关系,并且不需要任何近似来适应所有对对交互。通过实验,SGFormer在大型网络图中成功扩展,使其性能超过中大型网络图上的Transformer,并且具有与有限标记数据相当的表现竞争力。
https://arxiv.org/abs/2409.09007
A joint image reconstruction and segmentation approach based on disentangled representation learning was trained to enable cardiac cine MR imaging in real-time and under free-breathing. An exploratory feasibility study tested the proposed method in undersampled real-time acquisitions based on an in-house developed spiral bSSFP pulse sequence in eight healthy participants and five patients with intermittent atrial fibrillation. Images and predicted LV segmentations were compared to the reference standard of ECG-gated segmented Cartesian cine in repeated breath-holds and corresponding manual segmentation. On a 5-point Likert scale, image quality of the real-time breath-hold approach and Cartesian cine was comparable in healthy participants (RT-BH: 1.99 $\pm$ .98, Cartesian: 1.94 $\pm$ .86, p=.052), but slightly inferior in free-breathing (RT-FB: 2.40 $\pm$ .98, p<.001). In patients with arrhythmia, image quality from both real-time approaches was favourable (RT-BH: 2.10 $\pm$ 1.28, p<.001, RT-FB: 2.40 $\pm$ 1.13, p<.001, Cartesian: 2.68 $\pm$ 1.13). Intra-observer reliability was good (ICC=.77, 95%-confidence interval [.75, .79], p<.001). In functional analysis, a positive bias was observed for ejection fractions derived from the proposed model compared to the clinical reference standard (RT-BH mean EF: 58.5 $\pm$ 5.6%, bias: +3.47%, 95%-confidence interval [-.86, 7.79%], RT-FB mean: 57.9 $\pm$ 10.6%, bias: +1.45%, [-3.02, 5.91%], Cartesian mean: 54.9 $\pm$ 6.7%). The introduced real-time MR imaging technique is capable of acquiring high-quality cardiac cine data in 1-2 minutes without the need for ECG gating and breath-holds. It thus offers a promising alternative to the current clinical practice of segmented acquisition, with shorter scan times, higher patient comfort and increased robustness to arrhythmia and patient incompliance.
一种基于解离表示学习的多分辨率图像重建和分割方法被训练,以实现实时和无创呼吸成像。一个探索可行性研究在基于内部开发的螺旋BSSFP脉冲序列的欠采样实时采集中测试了所提出的技术,该采集基于八个健康参与者和五个患有间歇性心房颤动的患者。图像和预测的LV分割与参考标准ECG-gated分割的重复呼吸保持可比较(RT-BH:1.99 $\pm$.98,Cartesian:1.94 $\pm$.86,p=.052),但在自由呼吸时稍逊一筹(RT-FB:2.40 $\pm$.98,p<.001)。在患有心律失常的患者中,两种实时方法的图像质量都有所改善(RT-BH:2.10 $\pm$ 1.28,p<.001,RT-FB:2.40 $\pm$ 1.13,p<.001,Cartesian:2.68 $\pm$ 1.13)。内部观察者可靠性很好(ICC=.77,95%-置信区间[.75,.79],p<.001)。在功能分析中,观察到基于所提出模型的射血分数与临床参考标准之间的积极偏差(RT-BH平均EF:58.5 $\pm$ 5.6%,偏差:+3.47%,95%-置信区间[-86, 7.79%],RT-FB平均:57.9 $\pm$ 10.6%,偏差:+1.45%,[-3.02, 5.91%],Cartesian平均:54.9 $\pm$ 6.7%)。所提出的实时MRI成像技术能够在没有ECG门控和呼吸门控的情况下,在1-2分钟内获取高质量的心脏cine数据。因此,它为当前临床分割采集技术的改进提供了有前途的替代方案,具有更短的患者扫描时间、更高的患者舒适度和对心律失常和患者不合作情况的增强鲁棒性。
https://arxiv.org/abs/2409.08619
Passive acoustic monitoring (PAM) is crucial for bioacoustic research, enabling non-invasive species tracking and biodiversity monitoring. Citizen science platforms like Xeno-Canto provide large annotated datasets from focal recordings, where the target species is intentionally recorded. However, PAM requires monitoring in passive soundscapes, creating a domain shift between focal and passive recordings, which challenges deep learning models trained on focal recordings. To address this, we leverage supervised contrastive learning to improve domain generalization in bird sound classification, enforcing domain invariance across same-class examples from different domains. We also propose ProtoCLR (Prototypical Contrastive Learning of Representations), which reduces the computational complexity of the SupCon loss by comparing examples to class prototypes instead of pairwise comparisons. Additionally, we present a new few-shot classification benchmark based on BirdSet, a large-scale bird sound dataset, and demonstrate the effectiveness of our approach in achieving strong transfer performance.
被动声监测(PAM)对生物声学研究至关重要,实现非侵入性物种跟踪和生物多样性监测。像Xeno-Canto这样的公民科学平台提供了来自焦点录音的大型注释数据集,其中目标物种有意记录。然而,PAM需要在被动声景中进行监测,从而在焦点录音和被动录音之间创造一个域转移,这使得基于焦点录音的深度学习模型训练具有挑战性。为了应对这个问题,我们利用有监督的对比学习来提高鸟类声分类的领域通用性,在同一类别的不同领域内保持领域不变。我们还提出了ProtoCLR(代表性对比学习),它通过将示例与类原样进行比较来降低SupCon损失的计算复杂性。此外,我们基于BirdSet,一个大规模鸟类声音数据集,提出了一个新的几 shot分类基准,并证明了我们的方法在实现强大转移性能方面的有效性。
https://arxiv.org/abs/2409.08589
Graph Neural Networks (GNNs) have been widely employed for feature representation learning in molecular graphs. Therefore, it is crucial to enhance the expressiveness of feature representation to ensure the effectiveness of GNNs. However, a significant portion of current research primarily focuses on the structural features within individual molecules, often overlooking the structural similarity between molecules, which is a crucial aspect encapsulating rich information on the relationship between molecular properties and structural characteristics. Thus, these approaches fail to capture the rich semantic information at the molecular structure level. To bridge this gap, we introduce the \textbf{Molecular Structural Similarity Motif GNN (MSSM-GNN)}, a novel molecular graph representation learning method that can capture structural similarity information among molecules from a global perspective. In particular, we propose a specially designed graph that leverages graph kernel algorithms to represent the similarity between molecules quantitatively. Subsequently, we employ GNNs to learn feature representations from molecular graphs, aiming to enhance the accuracy of property prediction by incorporating additional molecular representation information. Finally, through a series of experiments conducted on both small-scale and large-scale molecular datasets, we demonstrate that our model consistently outperforms eleven state-of-the-art baselines. The codes are available at this https URL.
图形神经网络(GNNs)在分子图形中广泛用于特征表示学习。因此,确保GNNs的有效性,提高特征表示的表述力至关重要。然而,目前大部分研究主要关注单个分子内的结构特征,往往忽视了分子之间的结构相似性,这是捕获分子性质与结构特征之间丰富信息的关键方面。因此,这些方法无法在分子级别捕捉到丰富的语义信息。为了填补这一空白,我们引入了《分子结构相似性主题图神经网络(MSSM-GNN)》(Molecular Structural Similarity Motif GNN,MSSM-GNN),一种新型的分子图形表示学习方法,可以从全局角度捕捉分子之间的结构相似性信息。 特别是,我们提出了一种专门设计的图,利用图卷积算法量化分子之间的相似性。然后,我们利用GNNs从分子图形中学习特征表示,旨在通过引入额外的分子表示信息来提高预测准确性。最后,通过在中小规模和大型分子数据集上进行的一系列实验,我们证明了我们的模型在11个最先进的基线模型中始终表现出色。代码可在此处访问:https://www.xxx.com/。
https://arxiv.org/abs/2409.08580
Capturing complex hierarchical human activities, from atomic actions (e.g., picking up one present, moving to the sofa, unwrapping the present) to contextual events (e.g., celebrating Christmas) is crucial for achieving high-performance video question answering (VideoQA). Recent works have expanded multimodal models (e.g., CLIP, LLaVA) to process continuous video sequences, enhancing the model's temporal reasoning capabilities. However, these approaches often fail to capture contextual events that can be decomposed into multiple atomic actions non-continuously distributed over relatively long-term sequences. In this paper, to leverage the spatial visual context representation capability of the CLIP model for obtaining non-continuous visual representations in terms of contextual events in videos, we convert long-term video sequences into a spatial image domain and finetune the multimodal model LLaVA for the VideoQA task. Our approach achieves competitive performance on the STAR task, in particular, with a 78.4% accuracy score, exceeding the current state-of-the-art score by 2.8 points on the NExTQA task.
捕捉复杂的人性活动(例如捡起一个礼物、移动到沙发、打开礼物)以及上下文事件(例如庆祝圣诞节)对于实现高性能视频问答(VideoQA)至关重要。最近的工作已经将多模态模型(例如CLIP、LLaVa)扩展到处理连续视频序列,提高了模型的时间推理能力。然而,这些方法通常无法捕捉可以分解为多个原子动作并在相对较长的序列中分布式的上下文事件。在本文中,为了利用CLIP模型的空间视觉上下文表示能力,获得视频中的上下文事件的非连续视觉表示,我们将长期视频序列转换为空间图像域,并对多模态模型LLaVa进行微调,以实现VideoQA任务。我们的方法在STAR任务上实现了竞争力的性能,尤其是,在NExTQA任务上的得分比现有 state-of-the-art score 高2.8分。
https://arxiv.org/abs/2409.07748
Video question answering (VideoQA) is a task to predict the correct answer to questions posed about a given video. The system must comprehend spatial and temporal relationships among objects extracted from videos to perform causal and temporal reasoning. While prior works have focused on modeling individual object movements using transformer-based methods, they falter when capturing complex scenarios involving multiple objects (e.g., "a boy is throwing a ball in a hoop"). We propose a contrastive language event graph representation learning method called CLanG to address this limitation. Aiming to capture event representations associated with multiple objects, our method employs a multi-layer GNN-cluster module for adversarial graph representation learning, enabling contrastive learning between the question text and its relevant multi-object event graph. Our method outperforms a strong baseline, achieving up to 2.2% higher accuracy on two challenging VideoQA datasets, NExT-QA and TGIF-QA-R. In particular, it is 2.8% better than baselines in handling causal and temporal questions, highlighting its strength in reasoning multiple object-based events.
视频问题回答(VideoQA)是一个预测关于给定视频的问题正确答案的任务。为了进行因果和时序推理,系统必须理解从视频中提取的物体之间的空间和时间关系。虽然之前的工作主要使用基于Transformer的方法建模单个物体的运动,但当涉及多个物体(例如:“一个男孩在篮筐里扔球”)时,它们的表现就差了。我们提出了一种称为CLanG的对比语言事件图表示学习方法来解决这个局限。我们的目标是捕获多个物体相关的活动表示,为此我们采用了一个多层GNN聚类模块进行对抗图表示学习,实现了问题文本和相关多物体事件图之间的对比学习。我们的方法在两个具有挑战性的VideoQA数据集上的表现优于基线,达到了2.2%的准确率。特别是,它在处理因果和时序问题方面比基线提高了2.8%,突出了其在基于多个物体事件进行推理方面的优势。
https://arxiv.org/abs/2409.07747
Dialogue topic segmentation plays a crucial role in various types of dialogue modeling tasks. The state-of-the-art unsupervised DTS methods learn topic-aware discourse representations from conversation data through adjacent discourse matching and pseudo segmentation to further mine useful clues in unlabeled conversational relations. However, in multi-round dialogs, discourses often have co-references or omissions, leading to the fact that direct use of these discourses for representation learning may negatively affect the semantic similarity computation in the neighboring discourse matching task. In order to fully utilize the useful cues in conversational relations, this study proposes a novel unsupervised dialog topic segmentation method that combines the Utterance Rewriting (UR) technique with an unsupervised learning algorithm to efficiently utilize the useful cues in unlabeled dialogs by rewriting the dialogs in order to recover the co-referents and omitted words. Compared with existing unsupervised models, the proposed Discourse Rewriting Topic Segmentation Model (UR-DTS) significantly improves the accuracy of topic segmentation. The main finding is that the performance on DialSeg711 improves by about 6% in terms of absolute error score and WD, achieving 11.42% in terms of absolute error score and 12.97% in terms of WD. on Doc2Dial the absolute error score and WD improves by about 3% and 2%, respectively, resulting in SOTA reaching 35.17% in terms of absolute error score and 38.49% in terms of WD. This shows that the model is very effective in capturing the nuances of conversational topics, as well as the usefulness and challenges of utilizing unlabeled conversations.
对话主题分割在各种对话建模任务中起着关键作用。最先进的无监督DTS方法通过相邻对话匹配和伪主题分割从对话数据中学习主题相关的会话表示,进一步挖掘未标记对话关系中的有用线索。然而,在多轮对话中,对话通常存在共同参考或省略,导致直接使用这些对话进行表示学习可能会对相邻对话匹配任务的语义相似性计算产生负面影响。为了充分利用对话关系中的有用线索,本研究提出了一个新颖的无监督对话主题分割方法,将Utterance Rewriting(UR)技术无监督学习算法相结合,通过重写对话来恢复共同参考和省略的字词,以有效地利用未标记对话中的有用线索。与现有无监督模型相比,所提出的UR-DTS模型在主题分割准确性方面显著改进。主要发现是,在DialSeg711上的性能提高了约6%,在WD方面提高了约11.42%,达到12.97%的绝对误差分数。在Doc2Dial上,绝对误差分数和WD分别提高了约3%和2%,达到SOTA的绝对误差分数为35.17%,WD为38.49%。这说明该模型在捕捉会话主题的细微差别以及利用未标记对话的有用性和挑战方面非常有效。
https://arxiv.org/abs/2409.07672