In this work, our goals are two fold: large-vocabulary continuous sign language recognition (CSLR), and sign language retrieval. To this end, we introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text. To enable CSLR evaluation in the large-vocabulary setting, we introduce new dataset annotations that have been manually collected. These provide continuous sign-level annotations for six hours of test videos, and will be made publicly available. We demonstrate that by a careful choice of loss functions, training the model for both the CSLR and retrieval tasks is mutually beneficial in terms of performance -- retrieval improves CSLR performance by providing context, while CSLR improves retrieval with more fine-grained supervision. We further show the benefits of leveraging weak and noisy supervision from large-vocabulary datasets such as BOBSL, namely sign-level pseudo-labels, and English subtitles. Our model significantly outperforms the previous state of the art on both tasks.
在这项工作中,我们的目标有两个:大规模词汇连续手语识别(CSLR)和手语检索。为此,我们引入了一个多任务Transformer模型CSLR2,它能够将手语序列和口头语言文本联合嵌入空间中的输出。为了在大型词汇设置中实现CSLR评估,我们引入了新的数据集注释,这些注释已经手动收集。这些提供了六个小时测试视频的连续手语级注释,将公开发布。我们证明了,通过仔细选择损失函数,同时训练CSLR和检索任务,可以提高性能——检索通过提供上下文来提高CSLR性能,而CSLR通过更细粒度的监督来提高检索。我们还进一步展示了利用大型词汇数据集如BOBSL的优势,如手语级伪标签和英文字幕。我们的模型在两个任务上都显著超过了前人水平。
https://arxiv.org/abs/2405.10266
Foundation models in computational pathology promise to unlock the development of new clinical decision support systems and models for precision medicine. However, there is a mismatch between most clinical analysis, which is defined at the level of one or more whole slide images, and foundation models to date, which process the thousands of image tiles contained in a whole slide image separately. The requirement to train a network to aggregate information across a large number of tiles in multiple whole slide images limits these models' impact. In this work, we present a slide-level foundation model for H&E-stained histopathology, PRISM, that builds on Virchow tile embeddings and leverages clinical report text for pre-training. Using the tile embeddings, PRISM produces slide-level embeddings with the ability to generate clinical reports, resulting in several modes of use. Using text prompts, PRISM achieves zero-shot cancer detection and sub-typing performance approaching and surpassing that of a supervised aggregator model. Using the slide embeddings with linear classifiers, PRISM surpasses supervised aggregator models. Furthermore, we demonstrate that fine-tuning of the PRISM slide encoder yields label-efficient training for biomarker prediction, a task that typically suffers from low availability of training data; an aggregator initialized with PRISM and trained on as little as 10% of the training data can outperform a supervised baseline that uses all of the data.
翻译:计算病理学中的基础模型承诺要解锁精确医学中新的临床决策支持系统和模型的开发。然而,大多数临床分析(以一或多个整片图像的水平定义)与现有基础模型之间的差距很大,后者在整片图像中处理成千上万的图像块。在训练网络以跨越大量 tile 的跨多个整片图像的信息方面,这些模型的影响受到限制。在这项工作中,我们提出了一个用于 H&E 染色组织学的基础模型,PRISM,它基于 Virchow 图像块嵌入并利用临床报告文本进行预训练。利用块嵌入,PRISM 产生具有生成临床报告能力的多模式使用。利用文本提示,PRISM 实现零散癌症检测和亚型检测性能,其效果接近并超过了有监督聚合器的模型。使用线性分类器的块嵌入,PRISM 超越了有监督聚合器模型。此外,我们还证明了 PRISM 块编码器的微调使得生物标志物预测的标签效率训练,这种任务通常缺乏足够的训练数据。基于 PRISM 的聚合器初始化并仅使用 10% 的训练数据进行训练,可以击败使用所有数据的开环基线模型。
https://arxiv.org/abs/2405.10254
Graph Neural Networks (GNNs) are powerful tools for graph classification. One important operation for GNNs is the downsampling or pooling that can learn effective embeddings from the node representations. In this paper, we propose a new hierarchical pooling operation, namely the Edge-Node Attention-based Differentiable Pooling (ENADPool), for GNNs to learn effective graph representations. Unlike the classical hierarchical pooling operation that is based on the unclear node assignment and simply computes the averaged feature over the nodes of each cluster, the proposed ENADPool not only employs a hard clustering strategy to assign each node into an unique cluster, but also compress the node features as well as their edge connectivity strengths into the resulting hierarchical structure based on the attention mechanism after each pooling step. As a result, the proposed ENADPool simultaneously identifies the importance of different nodes within each separated cluster and edges between corresponding clusters, that significantly addresses the shortcomings of the uniform edge-node based structure information aggregation arising in the classical hierarchical pooling operation. Moreover, to mitigate the over-smoothing problem arising in existing GNNs, we propose a Multi-distance GNN (MD-GNN) model associated with the proposed ENADPool operation, allowing the nodes to actively and directly receive the feature information from neighbors at different random walk steps. Experiments demonstrate the effectiveness of the MD-GNN associated with the proposed ENADPool.
图神经网络(GNNs)是用于图形分类的强大工具。GNNs的一个重要操作是下采样或池化,可以从节点表示中学习有效的嵌入。在本文中,我们提出了一个新颖的层次池化操作,即基于边缘节点注意力的差分池化(ENADPool)来学习图形表示。与经典的层次池化操作基于不明确的节点分配,简单地计算每个簇节点的平均特征不同,提出的ENADPool不仅采用了一种强聚类策略将每个节点分配到唯一的簇中,而且在每个池化步骤后,还将节点特征以及它们的边缘连接强度压缩到结果分层结构中,基于注意力机制。因此,与传统的层次池化操作相比,ENADPool同时识别每个独立簇内不同节点的重要性以及对应簇之间的边的重要性,从而显著解决了经典层次池化操作中均匀边缘-节点结构信息聚合的不足。此外,为了减轻现有GNN中产生的过度平滑问题,我们提出了一个与提出的ENADPool操作相关的多距离GNN(MD-GNN)模型,允许节点在不同的随机漫步步骤积极直接接收邻居的特征信息。实验结果表明,与提出的ENADPool操作相关的MD-GNN具有有效性。
https://arxiv.org/abs/2405.10218
In this paper, we introduce the Polish Massive Text Embedding Benchmark (PL-MTEB), a comprehensive benchmark for text embeddings in Polish. The PL-MTEB consists of 28 diverse NLP tasks from 5 task types. We adapted the tasks based on previously used datasets by the Polish NLP community. In addition, we created a new PLSC (Polish Library of Science Corpus) dataset consisting of titles and abstracts of scientific publications in Polish, which was used as the basis for two novel clustering tasks. We evaluated 15 publicly available models for text embedding, including Polish and multilingual ones, and collected detailed results for individual tasks and aggregated results for each task type and the entire benchmark. PL-MTEB comes with open-source code at this https URL.
在本文中,我们提出了波兰大规模文本嵌入基准(PL-MTEB),一个用于波兰文本嵌入的全面基准。PL-MTEB由5个任务类型的28个多样任务组成。我们根据波兰自然语言处理社区之前使用的数据集对任务进行了改编。此外,我们创建了一个新的PLSC(波兰科学文献库)数据集,包括波兰科学期刊的标题和摘要,用作两个新颖聚类任务的基线。我们对15个公开可用的模型进行了文本嵌入评估,包括波兰和多语言模型,收集了每个任务类型和整个基准的详细结果。PL-MTEB现在可以在此链接处获取开源代码。
https://arxiv.org/abs/2405.10138
In autonomous driving, accurately interpreting the movements of other road users and leveraging this knowledge to forecast future trajectories is crucial. This is typically achieved through the integration of map data and tracked trajectories of various agents. Numerous methodologies combine this information into a singular embedding for each agent, which is then utilized to predict future behavior. However, these approaches have a notable drawback in that they may lose exact location information during the encoding process. The encoding still includes general map information. However, the generation of valid and consistent trajectories is not guaranteed. This can cause the predicted trajectories to stray from the actual lanes. This paper introduces a new refinement module designed to project the predicted trajectories back onto the actual map, rectifying these discrepancies and leading towards more consistent predictions. This versatile module can be readily incorporated into a wide range of architectures. Additionally, we propose a novel scene encoder that handles all relations between agents and their environment in a single unified heterogeneous graph attention network. By analyzing the attention values on the different edges in this graph, we can gain unique insights into the neural network's inner workings leading towards a more explainable prediction.
在自动驾驶中,准确地解释其他道路用户的行为并利用这些知识预测未来轨迹至关重要。通常,这是通过将地图数据和不同代理的跟踪轨迹整合来实现这一目标的。许多方法将这一信息整合为每个代理的单一嵌入,然后用于预测未来的行为。然而,这些方法的一个显著缺点是在编码过程中可能会丢失精确的地理位置信息。编码过程仍然包括一般地图信息。然而,生成有效的和一致的轨迹并不是绝对的保证。这可能导致预测的轨迹与实际车道脱离。本文介绍了一种新的优化模块,旨在将预测的轨迹投射回实际地图,纠正这些差异并实现更一致的预测。这个多功能模块可以轻松地集成到各种架构中。此外,我们提出了一个全新的场景编码器,用于处理所有代理和它们环境之间的关系,在单个统一的异质图注意力网络中。通过分析该图不同边缘的注意力值,我们可以深入了解神经网络的工作原理,从而实现更有解释性的预测。
https://arxiv.org/abs/2405.10134
We introduce LatentTimePFN (LaT-PFN), a foundational Time Series model with a strong embedding space that enables zero-shot forecasting. To achieve this, we perform in-context learning in latent space utilizing a novel integration of the Prior-data Fitted Networks (PFN) and Joint Embedding Predictive Architecture (JEPA) frameworks. We leverage the JEPA framework to create a prediction-optimized latent representation of the underlying stochastic process that generates time series and combines it with contextual learning, using a PFN. Furthermore, we improve on preceding works by utilizing related time series as a context and introducing an abstract time axis. This drastically reduces training time and increases the versatility of the model by allowing any time granularity and forecast horizon. We show that this results in superior zero-shot predictions compared to established baselines. We also demonstrate our latent space produces informative embeddings of both individual time steps and fixed-length summaries of entire series. Finally, we observe the emergence of multi-step patch embeddings without explicit training, suggesting the model actively learns discrete tokens that encode local structures in the data, analogous to vision transformers.
我们提出了 LatentTimePFN(LaT-PFN),一种基本的时间序列模型,具有强大的嵌入空间,可以实现零 shots 预测。为了实现这一目标,我们在潜在空间中进行本地的学习,利用新颖的集成 Prior-data Fitted Networks(PFN)和联合嵌入预测架构(JEPA)框架。我们利用 JEPA 框架创建了生成时间序列的预测优化潜在表示,将上下文学习和 PFN 相结合。此外,我们在前人工作中取得了改进,通过将相关的时间序列作为上下文,并引入了一个抽象的时间轴。这极大地减少了训练时间,并增加了模型的灵活性,允许了任何时间粒度和预测视野。我们证明了这种方法在零 shots 预测方面优于现有的基线。 我们还证明了我们的潜在空间产生了有关单个时间步和整个系列固定长度的摘要的有用嵌入。最后,我们观察到在无需显式训练的情况下,出现了多级补丁嵌入,这表明模型在数据中主动学习编码数据局部结构的离散标记,类似于视觉 Transformer。
https://arxiv.org/abs/2405.10093
The Learning-to-match (LTM) framework proves to be an effective inverse optimal transport approach for learning the underlying ground metric between two sources of data, facilitating subsequent matching. However, the conventional LTM framework faces scalability challenges, necessitating the use of the entire dataset each time the parameters of the ground metric are updated. In adapting LTM to the deep learning context, we introduce the mini-batch Learning-to-match (m-LTM) framework for audio-text retrieval problems. This framework leverages mini-batch subsampling and Mahalanobis-enhanced family of ground metrics. Moreover, to cope with misaligned training data in practice, we propose a variant using partial optimal transport to mitigate the harm of misaligned data pairs in training data. We conduct extensive experiments on audio-text matching problems using three datasets: AudioCaps, Clotho, and ESC-50. Results demonstrate that our proposed method is capable of learning rich and expressive joint embedding space, which achieves SOTA performance. Beyond this, the proposed m-LTM framework is able to close the modality gap across audio and text embedding, which surpasses both triplet and contrastive loss in the zero-shot sound event detection task on the ESC-50 dataset. Notably, our strategy of employing partial optimal transport with m-LTM demonstrates greater noise tolerance than contrastive loss, especially under varying noise ratios in training data on the AudioCaps dataset. Our code is available at this https URL
学习匹配(LTM)框架被证明是一种有效的反向最优传输方法,用于在两个数据源之间学习潜在的地面度量,从而促进后续匹配。然而,传统的LTM框架面临可扩展性挑战,每次更新地面度量参数时,需要使用整个数据集。将LTM适应深度学习场景,我们引入了音频-文本检索问题的小批次学习匹配(m-LTM)框架。该框架利用了 mini-batch 子采样和 Mahalanobis 增强的地面度量家族。此外,为了应对实际实践中存在的训练数据对齐问题,我们提出了一个使用部分最优传输的变体,以减轻对齐数据对训练数据的影响。我们在三个数据集(AudioCaps、Clotho和ESC-50)上对音频-文本匹配问题进行了广泛的实验。结果表明,我们提出的方法可以学习丰富和富有表现力的联合嵌入空间,实现最佳性能。此外,与仅基于对比损失的零 shot 声事件检测任务相比,所提出的 m-LTM 框架在AudioCaps数据集上的模态差距可以实现更大的提升。值得注意的是,我们使用部分最优传输与 m-LTM 的策略表明,与对比损失相比,噪声容忍度更高,特别是在 AudioCaps 数据集上,训练数据中的噪声比值变化时。我们的代码可以从该链接https://www.oskari.org/es/docs/latest/html/index.html获取。
https://arxiv.org/abs/2405.10084
Natural language could play an important role in developing generalist surgical models by providing a broad source of supervision from raw texts. This flexible form of supervision can enable the model's transferability across datasets and tasks as natural language can be used to reference learned visual concepts or describe new ones. In this work, we present HecVL, a novel hierarchical video-language pretraining approach for building a generalist surgical model. Specifically, we construct a hierarchical video-text paired dataset by pairing the surgical lecture video with three hierarchical levels of texts: at clip-level, atomic actions using transcribed audio texts; at phase-level, conceptual text summaries; and at video-level, overall abstract text of the surgical procedure. Then, we propose a novel fine-to-coarse contrastive learning framework that learns separate embedding spaces for the three video-text hierarchies using a single model. By disentangling embedding spaces of different hierarchical levels, the learned multi-modal representations encode short-term and long-term surgical concepts in the same model. Thanks to the injected textual semantics, we demonstrate that the HecVL approach can enable zero-shot surgical phase recognition without any human annotation. Furthermore, we show that the same HecVL model for surgical phase recognition can be transferred across different surgical procedures and medical centers.
自然语言在开发通用手术模型方面发挥了重要作用,因为它可以提供从原始文本的广泛监督。这种灵活的监督形式可以确保模型在数据和任务上的可转移性,因为自然语言可以用于参考学到的视觉概念或描述新的概念。在这项工作中,我们提出了HecVL,一种用于构建通用手术模型的分层视频语言预训练方法。具体来说,我们通过将手术讲座视频与三个层次的文本(剪辑级别、音频转录文本级别和视频级别)进行配对,构建了一个分层的视频文本对数据集。然后,我们提出了一个新颖的细到粗的对比学习框架,使用单个模型学习三个视频文本层次的嵌入空间。通过分离不同层次的嵌入空间,学习到的多模态表示编码了同一模型中的短期和长期手术概念。由于引入了文本语义,我们证明了HecVL方法可以在没有任何人类注释的情况下实现零散手术阶段识别。此外,我们还证明了用于手术阶段识别的HecVL模型可以应用于不同的手术过程和医疗机构。
https://arxiv.org/abs/2405.10075
Open-vocabulary object detection (OvOD) has transformed detection into a language-guided task, empowering users to freely define their class vocabularies of interest during inference. However, our initial investigation indicates that existing OvOD detectors exhibit significant variability when dealing with vocabularies across various semantic granularities, posing a concern for real-world deployment. To this end, we introduce Semantic Hierarchy Nexus (SHiNe), a novel classifier that uses semantic knowledge from class hierarchies. It runs offline in three steps: i) it retrieves relevant super-/sub-categories from a hierarchy for each target class; ii) it integrates these categories into hierarchy-aware sentences; iii) it fuses these sentence embeddings to generate the nexus classifier vector. Our evaluation on various detection benchmarks demonstrates that SHiNe enhances robustness across diverse vocabulary granularities, achieving up to +31.9% mAP50 with ground truth hierarchies, while retaining improvements using hierarchies generated by large language models. Moreover, when applied to open-vocabulary classification on ImageNet-1k, SHiNe improves the CLIP zero-shot baseline by +2.8% accuracy. SHiNe is training-free and can be seamlessly integrated with any off-the-shelf OvOD detector, without incurring additional computational overhead during inference. The code is open source.
开放词汇对象检测(OvOD)使检测任务变成了一种语言指导的任务,使用户在推理过程中可以自由定义他们感兴趣的类词汇。然而,我们的初步调查表明,现有的OvOD检测器在处理不同语义粒度词汇时表现出显著的变异性,这可能会对现实世界的部署造成担忧。为此,我们引入了语义层次结构 Nexus(SHiNe),一种新类器,它使用类层次结构的语义知识。它通过三个步骤运行:i)它从层次结构中检索与目标类别相关的超/子类别;ii)它将这些类别整合到等级感知句子中;iii)它将句子嵌入融合生成nexus分类器向量。我们对各种检测基准的评估表明,SHiNe在不同的词汇粒度下增强了鲁棒性,达到+31.9%的mAP50,同时保留了使用大语言模型生成的等级所取得的有益改进。此外,当应用于ImageNet-1k上的开放词汇分类时,SHiNe提高了CLIP零散 baseline的准确率+2.8%。SHiNe是免费的训练的,可以轻松地与任何现有的OvOD检测器集成,而不会在推理过程中产生额外的计算开销。代码是开源的。
https://arxiv.org/abs/2405.10053
In the new paradigm of semantic communication (SC), the focus is on delivering meanings behind bits by extracting semantic information from raw data. Recent advances in data-to-text models facilitate language-oriented SC, particularly for text-transformed image communication via image-to-text (I2T) encoding and text-to-image (T2I) decoding. However, although semantically aligned, the text is too coarse to precisely capture sophisticated visual features such as spatial locations, color, and texture, incurring a significant perceptual difference between intended and reconstructed images. To address this limitation, in this paper, we propose a novel language-oriented SC framework that communicates both text and a compressed image embedding and combines them using a latent diffusion model to reconstruct the intended image. Experimental results validate the potential of our approach, which transmits only 2.09\% of the original image size while achieving higher perceptual similarities in noisy communication channels compared to a baseline SC method that communicates only through text.The code is available at this https URL .
在语义通信(SC)的新范式中,重点是通过对原始数据提取语义信息来传递比特背后的含义。最近的数据到文本模型的进步促进了面向语言的SC,特别是通过图像到文本(I2T)编码和文本到图像(T2I)解码进行文本转图像通信。然而,尽管在语义上与原始数据对齐,但文本过于粗粒度,无法准确捕捉复杂的视觉特征,如空间位置、颜色和纹理,导致预想和重构图像之间的感知差异相当大。为了克服这一限制,本文提出了一种新颖的语言导向SC框架,将文本和压缩图像嵌入结合使用,并通过潜在扩散模型重构意图图像。实验结果证实了我们的方法具有潜在的可行性,即在仅传输原始图像大小2.09%的同时,在噪声通信通道中实现了与仅通过文本的基线SC方法更高的感知相似性。代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2405.09976
Transliterating related languages that use different scripts into a common script shows effectiveness in improving crosslingual transfer in downstream tasks. However, this methodology often makes pretraining a model from scratch unavoidable, as transliteration brings about new subwords not covered in existing multilingual pretrained language models (mPLMs). This is not desired because it takes a lot of computation budget for pretraining. A more promising way is to make full use of available mPLMs. To this end, this paper proposes a simple but effective framework: Transliterate-Merge-Initialize (TransMI), which can create a strong baseline well-suited for data that is transliterated into a common script by exploiting an mPLM and its accompanied tokenizer. TransMI has three stages: (a) transliterate the vocabulary of an mPLM into a common script; (b) merge the new vocabulary with the original vocabulary; and (c) initialize the embeddings of the new subwords. We applied TransMI to three recent strong mPLMs, and our experiments demonstrate that TransMI not only preserves their ability to handle non-transliterated data, but also enables the models to effectively process transliterated data: the results show a consistent improvement of 3% to 34%, varying across different models and tasks. We make our code and models publicly available at \url{this https URL}.
翻译:将使用不同脚本的语言之间进行统一转写,可以有效提高跨语言迁移下游任务的效果。然而,这种方法通常使得从零开始预训练模型变得不可避免,因为转写带来了尚未在现有多语言预训练语言模型(mPLMs)中涵盖的新子词。这不是我们所期望的,因为它需要大量的计算预算进行预训练。 更令人乐观的方法是充分利用已有的mPLMs。为此,本文提出了一个简单而有效的框架:Transliterate-Merge-Initialize(TransMI),它可以通过利用一个mPLM及其伴随的词典,将转写成一个通用脚本的词汇表。TransMI有三个阶段: (a)将mPLM的词汇表翻译成通用脚本; (b)将新词汇与原始词汇合并; (c)初始化新子词的嵌入。 我们将TransMI应用于三个最近的强mPLM,并且我们的实验结果表明,TransMI不仅保留了它们处理非转写数据的能力,而且还使模型能够有效处理转写数据:结果表明,模型的性能提高了3%至34%,具体取决于不同的模型和任务。我们将我们的代码和模型公开发布在\url{这个 https URL}上。
https://arxiv.org/abs/2405.09913
A game's theme is an important part of its design -- it conveys narrative information, rhetorical messages, helps the player intuit strategies, aids in tutorialisation and more. Thematic elements of games are notoriously difficult for AI systems to understand and manipulate, however, and often rely on large amounts of hand-written interpretations and knowledge. In this paper we present a technique which connects game embeddings, a recent method for modelling game dynamics from log data, and word embeddings, which models semantic information about language. We explain two different approaches for using game embeddings in this way, and show evidence that game embeddings enhance the linguistic translations of game concepts from one theme to another, opening up exciting new possibilities for reasoning about the thematic elements of games in the future.
游戏的主题是其设计的重要组成部分——它传达叙事信息、议论信息,帮助玩家推断策略,有助于教程化等。然而,游戏的主题元素对AI系统来说往往难以理解和操作,而且通常依赖于大量的手写解释和知识。在本文中,我们提出了一个技术,该技术将游戏嵌入件连接起来,这是从日志数据中建模游戏动态的最近方法,并将词向量建模为语言的语义信息。我们解释了使用游戏嵌入件的两种不同方法,并展示了游戏嵌入件如何增强游戏概念从一个主题到另一个的语义翻译,为未来关于游戏主题元素的推理开辟了令人兴奋的新可能性。
https://arxiv.org/abs/2405.09893
We introduce RoScenes, the largest multi-view roadside perception dataset, which aims to shed light on the development of vision-centric Bird's Eye View (BEV) approaches for more challenging traffic scenes. The highlights of RoScenes include significantly large perception area, full scene coverage and crowded traffic. More specifically, our dataset achieves surprising 21.13M 3D annotations within 64,000 $m^2$. To relieve the expensive costs of roadside 3D labeling, we present a novel BEV-to-3D joint annotation pipeline to efficiently collect such a large volume of data. After that, we organize a comprehensive study for current BEV methods on RoScenes in terms of effectiveness and efficiency. Tested methods suffer from the vast perception area and variation of sensor layout across scenes, resulting in performance levels falling below expectations. To this end, we propose RoBEV that incorporates feature-guided position embedding for effective 2D-3D feature assignment. With its help, our method outperforms state-of-the-art by a large margin without extra computational overhead on validation set. Our dataset and devkit will be made available at \url{this https URL}.
我们介绍了RoScenes,这是最大的多视角路边感知数据集,旨在阐明在更复杂的交通场景中,视觉中心鸟眼视(BEV)方法的发展。RoScenes的重点包括显著的感知区域、完整的场景覆盖和拥挤的交通。具体来说,我们的数据集在64,000 $m^2的面积上实现了令人惊讶的21.13M 3D注释。为了减轻道路边3D标注昂贵的成本,我们提出了一个新颖的BEV到3D联合注释管道,以有效地收集如此大的数据量。此后,我们组织了一项关于RoScenes上当前BEV方法的全面研究,从有效性、效率等方面进行评估。经过测试的方法存在感知区域巨大和场景传感器布局变化多样的问题,导致性能水平低于预期。因此,我们提出了RoBEV,它采用基于特征的定位嵌入来有效进行2D-3D特征分配。有了它的帮助,我们的方法在验证集上显著优于最先进的水平,而无需额外的计算开销。我们的数据集和开发工具包将公开在 \url{这个链接} 上。
https://arxiv.org/abs/2405.09883
We present HyperSum, an extractive summarization framework that captures both the efficiency of traditional lexical summarization and the accuracy of contemporary neural approaches. HyperSum exploits the pseudo-orthogonality that emerges when randomly initializing vectors at extremely high dimensions ("blessing of dimensionality") to construct representative and efficient sentence embeddings. Simply clustering the obtained embeddings and extracting their medoids yields competitive summaries. HyperSum often outperforms state-of-the-art summarizers -- in terms of both summary accuracy and faithfulness -- while being 10 to 100 times faster. We open-source HyperSum as a strong baseline for unsupervised extractive summarization.
我们提出了HyperSum,一种提取式摘要框架,既捕捉了传统词汇摘要的效率,又捕捉了当代神经方法的精度。HyperSum利用随机初始化高维度向量时出现的伪正交性("维度的祝福")来构建具有代表性和有效性的句子嵌入。简单地将获得的嵌入聚类并提取其 medoids 即可获得竞争力的摘要。HyperSum通常在摘要准确性和可靠性方面优于最先进的总结器——即使是在10到100倍的速度下。我们开源HyperSum作为无监督提取式摘要的强大基线。
https://arxiv.org/abs/2405.09765
While content-based image retrieval (CBIR) has been extensively studied in natural image retrieval, its application to medical images presents ongoing challenges, primarily due to the 3D nature of medical images. Recent studies have shown the potential use of pre-trained vision embeddings for CBIR in the context of radiology image retrieval. However, a benchmark for the retrieval of 3D volumetric medical images is still lacking, hindering the ability to objectively evaluate and compare the efficiency of proposed CBIR approaches in medical imaging. In this study, we extend previous work and establish a benchmark for region-based and multi-organ retrieval using the TotalSegmentator dataset (TS) with detailed multi-organ annotations. We benchmark embeddings derived from pre-trained supervised models on medical images against embeddings derived from pre-trained unsupervised models on non-medical images for 29 coarse and 104 detailed anatomical structures in volume and region levels. We adopt a late interaction re-ranking method inspired by text matching for image retrieval, comparing it against the original method proposed for volume and region retrieval achieving retrieval recall of 1.0 for diverse anatomical regions with a wide size range. The findings and methodologies presented in this paper provide essential insights and benchmarks for the development and evaluation of CBIR approaches in the context of medical imaging.
虽然基于内容的图像检索(CBIR)在自然图像检索中已经得到了广泛研究,但在医学图像中应用时仍然存在挑战,主要原因是医学图像的3D性质。最近的研究表明,在放射学图像检索背景下,预训练视觉嵌入可能有用于CBIR。然而,还没有一个用于检索3D体积医学图像的基准,这阻碍了客观评估和比较所提出的CBIR方法在医学成像中的效率。在这项研究中,我们延长了以前的工作,并使用TotalSegmentator数据集(TS)建立了基于区域的和多器官检索的基准,并对医学图像和非医学图像的预训练嵌入进行了比较。我们对29个粗粒度和104个详细解剖结构的体积和区域水平的预训练嵌入进行了比较,采用了一种类似于文本匹配的晚期交互重新排名方法,将其与体积和区域检索的原始方法进行比较,实现了检索召回率为1.0,具有多样解剖结构的广泛大小范围。本文所提出的研究成果和方法提供了开发和评估CBIR方法在医学成像领域的必要见解和基准。
https://arxiv.org/abs/2405.09334
Heterogeneous, interconnected, systems-level, molecular data have become increasingly available and key in precision medicine. We need to utilize them to better stratify patients into risk groups, discover new biomarkers and targets, repurpose known and discover new drugs to personalize medical treatment. Existing methodologies are limited and a paradigm shift is needed to achieve quantitative and qualitative breakthroughs. In this perspective paper, we survey the literature and argue for the development of a comprehensive, general framework for embedding of multi-scale molecular network data that would enable their explainable exploitation in precision medicine in linear time. Network embedding methods map nodes to points in low-dimensional space, so that proximity in the learned space reflects the network's topology-function relationships. They have recently achieved unprecedented performance on hard problems of utilizing few omic data in various biomedical applications. However, research thus far has been limited to special variants of the problems and data, with the performance depending on the underlying topology-function network biology hypotheses, the biomedical applications and evaluation metrics. The availability of multi-omic data, modern graph embedding paradigms and compute power call for a creation and training of efficient, explainable and controllable models, having no potentially dangerous, unexpected behaviour, that make a qualitative breakthrough. We propose to develop a general, comprehensive embedding framework for multi-omic network data, from models to efficient and scalable software implementation, and to apply it to biomedical informatics. It will lead to a paradigm shift in computational and biomedical understanding of data and diseases that will open up ways to solving some of the major bottlenecks in precision medicine and other domains.
异质、相互连接的、系统层面的分子数据在精准医学中越来越重要。我们需要利用它们更好地将患者划分到风险组中,发现新的生物标志物和靶点,重新利用已知的并发现新的药物以个性化医疗治疗。现有的方法论受限,需要进行范式转换以实现定量和定性突破。在本文中,我们调查了文献并呼吁开发一个全面的、通用的多尺度分子网络数据嵌入框架,这将使得它们在精准医学中的可解释利用实现线性时间内的突破。网络嵌入方法将节点映射到低维空间中的点,因此学习空间中的接近反映了网络的拓扑功能关系。它们最近在各种生物医学应用中利用少量的单细胞数据取得前所未有的性能。然而,迄今为止的研究都局限于问题的特殊变体和数据,其性能取决于底层拓扑功能网络生物学假设、生物医学应用和评估指标。多尺度分子数据的可用性、现代图嵌入范式和计算能力要求创建和训练高效、可解释、可控的模型,没有潜在的危险的意外行为,实现质量突破。我们建议开发一个通用的、全面的嵌入多尺度分子网络数据的框架,从模型到高效的软件实现,并将其应用于生物医学信息学。这将导致计算和生物医学对数据和疾病的理解实现范式转移,开辟解决一些精准医学和其他领域主要瓶颈的道路。
https://arxiv.org/abs/2405.09595
Recent years have witnessed the rapid development of short videos, which usually contain both visual and audio modalities. Background music is important to the short videos, which can significantly influence the emotions of the viewers. However, at present, the background music of short videos is generally chosen by the video producer, and there is a lack of automatic music recommendation methods for short videos. This paper introduces MVBind, an innovative Music-Video embedding space Binding model for cross-modal retrieval. MVBind operates as a self-supervised approach, acquiring inherent knowledge of intermodal relationships directly from data, without the need of manual annotations. Additionally, to compensate the lack of a corresponding musical-visual pair dataset for short videos, we construct a dataset, SVM-10K(Short Video with Music-10K), which mainly consists of meticulously selected short videos. On this dataset, MVBind manifests significantly improved performance compared to other baseline methods. The constructed dataset and code will be released to facilitate future research.
近年来,短视频的发展速度加快,通常包含视觉和音频模态。背景音乐对于短视频来说非常重要,因为它可以显著影响观众的情感。然而,目前,短视频的背景音乐通常由视频制作人选择,缺乏针对短视频的自动音乐推荐方法。本文介绍了一种创新的跨模态检索模型MVBind,用于短片的跨模态检索。MVBind是一种自监督方法,通过直接从数据中获取模态关系的固有知识,无需手动注释。此外,为了弥补短视频没有相应的音乐-视觉对数据集的不足,我们构建了一个数据集SVM-10K(短视频与音乐-10K),主要包含精心选择短的短视频。在这个数据集上,MVBind表现出比其他基线方法显著的优异性能。构建的数据集和代码将发布,以促进未来研究。
https://arxiv.org/abs/2405.09286
In this paper, we present the findings of our Project ALPINE which stands for ``Autoregressive Learning for Planning In NEtworks." Project ALPINE initiates a theoretical investigation into the development of planning capabilities in Transformer-based language models through their autoregressive learning mechanisms, aiming to identify any potential limitations in their planning abilities. We abstract planning as a network path-finding task where the objective is to generate a valid path from a specified source node to a designated target node. In terms of expressiveness, we show that the Transformer is capable of executing path-finding by embedding the adjacency and reachability matrices within its weights. Our theoretical analysis of the gradient-based learning dynamic of the Transformer reveals that the Transformer is capable of learning both the adjacency matrix and a limited form of the reachability matrix. These theoretical insights are then validated through experiments, which demonstrate that the Transformer indeed learns the adjacency matrix and an incomplete reachability matrix, which aligns with the predictions made in our theoretical analysis. Additionally, when applying our methodology to a real-world planning benchmark, called Blocksworld, our observations remain consistent. Our theoretical and empirical analyses further unveil a potential limitation of Transformer in path-finding: it cannot identify reachability relationships through transitivity, and thus would fail when path concatenation is needed to generate a path. In summary, our findings shed new light on how the internal mechanisms of autoregressive learning enable planning in networks. This study may contribute to our understanding of the general planning capabilities in other related domains.
在本文中,我们展示了我们的项目ALPINE的研究成果,该项目名为“基于Transformer的自动规划网络”。ALPINE通过研究Transformer基语言模型的自回归学习机制,对基于Transformer的规划能力的发展进行理论探讨,旨在确定其规划能力中存在的任何潜在限制。我们将规划定义为一个网络路径寻址任务,其中目标是从指定的源节点生成一个有效的路径到指定的目标节点。在表现力方面,我们证明了Transformer可以通过将邻接和可达性矩阵嵌入其权重来执行路径寻址。我们对Transformer的自回归学习动态的理论分析揭示了Transformer能够学习邻接矩阵以及有限形式的可达性矩阵。这些理论见解通过实验得到了验证,实验结果表明Transformer确实能够学习邻接矩阵和有限形式的可达性矩阵,这与我们理论分析中的预测相符。此外,当将我们的方法应用于现实世界的规划基准,称为Blocksworld时,我们的观察结果仍然保持一致。我们的理论和实证分析进一步揭示了Transformer在路径寻址中的一个潜在限制:它不能通过传递性识别可达性关系,因此当路径拼接需要生成路径时,它将失败。总之,我们的研究结果为理解自回归学习在网络中的内部机制以及如何实现网络规划提供了新的思路。本研究可能会对我们的理解其他相关领域中的普遍规划能力有所贡献。
https://arxiv.org/abs/2405.09220
Current speaker diarization systems rely on an external voice activity detection model prior to speaker embedding extraction on the detected speech segments. In this paper, we establish that the attention system of a speaker embedding extractor acts as a weakly supervised internal VAD model and performs equally or better than comparable supervised VAD systems. Subsequently, speaker diarization can be performed efficiently by extracting the VAD logits and corresponding speaker embedding simultaneously, alleviating the need and computational overhead of an external VAD model. We provide an extensive analysis of the behavior of the frame-level attention system in current speaker verification models and propose a novel speaker diarization pipeline using ECAPA2 speaker embeddings for both VAD and embedding extraction. The proposed strategy gains state-of-the-art performance on the AMI, VoxConverse and DIHARD III diarization benchmarks.
当前的讲话者语音识别系统在提取讲话者嵌入之前依赖于外部语音活动检测模型。在本文中,我们证明了发言者嵌入提取器的注意系统充当一个弱监督的内部VAD模型,并且其表现与相应的监督VAD系统相当或者更好。随后,通过同时提取VAD日志和相应的讲话者嵌入,可以高效地实现发言者识别。我们详细分析了当前讲话者验证模型中帧级注意系统的行为,并使用ECAPA2讲话者嵌入提出了用于VAD和嵌入提取的新讲话者识别流程。所提出的策略在AMI、VoxConverse和DIHARD III语调基准上获得了最先进的性能。
https://arxiv.org/abs/2405.09142
The discovery of linear embedding is the key to the synthesis of linear control techniques for nonlinear systems. In recent years, while Koopman operator theory has become a prominent approach for learning these linear embeddings through data-driven methods, these algorithms often exhibit limitations in generalizability beyond the distribution captured by training data and are not robust to changes in the nominal system dynamics induced by intrinsic or environmental factors. To overcome these limitations, this study presents an adaptive Koopman architecture capable of responding to the changes in system dynamics online. The proposed framework initially employs an autoencoder-based neural network that utilizes input-output information from the nominal system to learn the corresponding Koopman embedding offline. Subsequently, we augment this nominal Koopman architecture with a feed-forward neural network that learns to modify the nominal dynamics in response to any deviation between the predicted and observed lifted states, leading to improved generalization and robustness to a wide range of uncertainties and disturbances compared to contemporary methods. Extensive tracking control simulations, which are undertaken by integrating the proposed scheme within a Model Predictive Control framework, are used to highlight its robustness against measurement noise, disturbances, and parametric variations in system dynamics.
线性嵌入的发现是 nonlinear系统线性控制技术合成的关键。近年来,虽然Koopman操作子理论通过数据驱动方法学习这些线性嵌入取得了突出地位,但这些算法在泛化能力上常常存在局限性,不仅限于训练数据所捕获的分布,而且对由内生或环境因素引起的拟合系统动态变化不具有鲁棒性。为了克服这些限制,本研究提出了一个自适应Koopman架构,能够在线系统动态变化发生时响应变化。所提出的框架首先采用了一个基于自动编码器的神经网络,利用名义系统的输入-输出信息来学习相应的Koopman嵌入。随后,我们通过一个前馈神经网络来增强这种名义Koopman架构,使其能够根据预测和观察到的抬升状态对名义动态进行修改,从而提高泛化能力和对当代方法的鲁棒性。为了检验这种方法对测量噪声、干扰和系统动态参数变化等的鲁棒性,我们在Model预测控制框架中进行了广泛的跟踪控制仿真。
https://arxiv.org/abs/2405.09101