Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art methods, highlighting the potential of agent-based approaches in advancing long-form video understanding.
长格式视频理解是计算机视觉领域的一个重要挑战,要求模型具有在长多模态序列上进行推理的能力。为了满足人类在长格式视频理解中的认知过程,我们强调交互式推理和规划,而不是处理长视觉输入的能力。我们引入了一个名为VideoAgent的新颖智能体系统,它采用一个大语言模型作为核心代理,通过迭代确定和汇总关键信息来回答问题,而视觉语言模型作为工具来翻译和检索视觉信息。在具有挑战性的EgoSchema和NExT-QA基准测试中评估,VideoAgent平均使用8.4和8.2帧,实现了54.1%和71.3%的零散准确性。这些结果表明,我们的方法在当前最先进的方法上具有优越的效性和效率,突出了基于智能体的方法在促进长格式视频理解方面的潜力。
https://arxiv.org/abs/2403.10517
Gaze following and social gaze prediction are fundamental tasks providing insights into human communication behaviors, intent, and social interactions. Most previous approaches addressed these tasks separately, either by designing highly specialized social gaze models that do not generalize to other social gaze tasks or by considering social gaze inference as an ad-hoc post-processing of the gaze following task. Furthermore, the vast majority of gaze following approaches have proposed static models that can handle only one person at a time, therefore failing to take advantage of social interactions and temporal dynamics. In this paper, we address these limitations and introduce a novel framework to jointly predict the gaze target and social gaze label for all people in the scene. The framework comprises of: (i) a temporal, transformer-based architecture that, in addition to image tokens, handles person-specific tokens capturing the gaze information related to each individual; (ii) a new dataset, VSGaze, that unifies annotation types across multiple gaze following and social gaze datasets. We show that our model trained on VSGaze can address all tasks jointly, and achieves state-of-the-art results for multi-person gaze following and social gaze prediction.
凝视跟随和社会凝视预测是人类交流行为、意图和社交互动的基础任务,为研究这些任务提供了深入了解。之前的方法通常是将这些任务单独处理,或者通过设计高度专业化的社交凝视模型,这些模型不具有泛化性,或者将社会凝视推理视为凝视跟随任务的附带后处理。此外,绝大多数凝视跟随方法都提出了静态模型,只能处理一个观察者,因此无法充分利用社交互动和时间动态。在本文中,我们克服了这些限制,并引入了一种新的框架,可以同时预测场景中所有人员的凝视目标和社会凝视标签。该框架包括:(i)一个基于时间,Transformer-based架构,除了图像标记外,还处理个人标记,捕捉与每个个体相关的凝视信息;(ii)一个新的数据集VSGaze,将多个凝视跟随和社交凝视数据集中的注释类型统一。我们证明了在VSGaze上训练的模型可以同时处理所有任务,并取得了多个人凝视跟随和社交凝视预测的最好成绩。
https://arxiv.org/abs/2403.10511
Pre-training image representations from the raw text about images enables zero-shot vision transfer to downstream tasks. Through pre-training on millions of samples collected from the internet, multimodal foundation models, such as CLIP, produce state-of-the-art zero-shot results that often reach competitiveness with fully supervised methods without the need for task-specific training. Besides the encouraging performance on classification accuracy, it is reported that these models close the robustness gap by matching the performance of supervised models trained on ImageNet under natural distribution shift. Because robustness is critical to real-world applications, especially safety-critical ones, in this paper, we present a comprehensive evaluation based on a large-scale robustness benchmark covering 7 natural, 3 synthetic distribution shifts, and 11 adversarial attacks. We use CLIP as a pilot study. We show that CLIP leads to a significant robustness drop compared to supervised ImageNet models on our benchmark, especially under synthetic distribution shift and adversarial attacks. Furthermore, data overlap analysis suggests that the observed robustness under natural distribution shifts could be attributed, at least in part, to data overlap. In summary, our evaluation shows a comprehensive evaluation of robustness is necessary; and there is a significant need to improve the robustness of zero-shot multimodal models.
通过从原始文本中提取图像预训练图像表示,使得零散 shot 视觉传输下游任务成为可能。通过从互联网收集数百万个样本进行预训练,多模态基础模型(如 CLIP)产生了最先进的零散 shot 结果,通常可以达到与无需任务特定训练的全监督方法相媲美的水平。除了分类准确度令人鼓舞的结果之外,据报道,这些模型通过在自然分布漂移下训练监督模型与 ImageNet 上的监督模型相匹敌,从而缩小了鲁棒性差距。因为鲁棒性对现实世界的应用(尤其是关键应用)至关重要,尤其是在本文中,我们基于覆盖7个自然、3个合成分布漂移和11个对抗攻击的大型鲁棒性基准进行全面评估。我们使用 CLIP 作为试点研究。我们发现,CLIP 在我们的基准上导致监督 ImageNet 模型在合成分布漂移和对抗攻击方面的鲁棒性显著下降。此外,数据重叠分析表明,观察到的鲁棒性在自然分布漂移上可能是由数据重叠造成的。总之,我们的评估表明,对鲁棒性的全面评估是必要的;提高零散 shot 多模态模型的鲁棒性具有重要的意义。
https://arxiv.org/abs/2403.10499
Integrating Large Language Models (VLMs) and Vision-Language Models (VLMs) with robotic systems enables robots to process and understand complex natural language instructions and visual information. However, a fundamental challenge remains: for robots to fully capitalize on these advancements, they must have a deep understanding of their physical embodiment. The gap between AI models cognitive capabilities and the understanding of physical embodiment leads to the following question: Can a robot autonomously understand and adapt to its physical form and functionalities through interaction with its environment? This question underscores the transition towards developing self-modeling robots without reliance on external sensory or pre-programmed knowledge about their structure. Here, we propose a meta self modeling that can deduce robot morphology through proprioception (the internal sense of position and movement). Our study introduces a 12 DoF reconfigurable legged robot, accompanied by a diverse dataset of 200k unique configurations, to systematically investigate the relationship between robotic motion and robot morphology. Utilizing a deep neural network model comprising a robot signature encoder and a configuration decoder, we demonstrate the capability of our system to accurately predict robot configurations from proprioceptive signals. This research contributes to the field of robotic self-modeling, aiming to enhance understanding of their physical embodiment and adaptability in real world scenarios.
将大型语言模型(VLMs)和视觉语言模型(VLMs)与机器人系统集成,使机器人能够处理和理解复杂的自然语言指令和视觉信息。然而,一个基本挑战 remains:为了充分利用这些进步,机器人必须对其物理 embodiment 具有深入的理解。AI 模型认知能力和物理 embodiment 的理解之间的差距导致了以下问题:机器人是否可以通过与环境的交互自主理解并适应其物理形态和功能?这个问题突出了开发无需依赖外部感官或预编程知识其结构的自我建模机器人的过渡。 在这里,我们提出了一种元自我建模,可以通过本体感觉(内部感觉位置和运动)来推断机器人的外形。我们的研究引入了一台12个自由度可重构的机器人,并随着一个包含200k个独特配置的多样数据集,系统性地研究了机器运动和机器人外形之间的关系。利用包括机器人签名编码器和一个配置解码器的深度神经网络模型,我们证明了我们的系统能够准确预测从本体感觉信号预测机器人配置。这项研究为机器人自建模领域做出了贡献,旨在增强其在现实场景中理解和适应能力。
https://arxiv.org/abs/2403.10496
Audiovisual emotion recognition (ER) in videos has immense potential over unimodal performance. It effectively leverages the inter- and intra-modal dependencies between visual and auditory modalities. This work proposes a novel audio-visual emotion recognition system utilizing a joint multimodal transformer architecture with key-based cross-attention. This framework aims to exploit the complementary nature of audio and visual cues (facial expressions and vocal patterns) in videos, leading to superior performance compared to solely relying on a single modality. The proposed model leverages separate backbones for capturing intra-modal temporal dependencies within each modality (audio and visual). Subsequently, a joint multimodal transformer architecture integrates the individual modality embeddings, enabling the model to effectively capture inter-modal (between audio and visual) and intra-modal (within each modality) relationships. Extensive evaluations on the challenging Affwild2 dataset demonstrate that the proposed model significantly outperforms baseline and state-of-the-art methods in ER tasks.
音频视觉情感识别 (ER) 在视频中有巨大的潜力。它有效地利用了视觉和听觉模式之间的相互依赖关系。这项工作提出了一种新型的音频-视觉情感识别系统,采用基于关键的跨注意力的联合多模态Transformer架构。这个框架旨在利用视频中的音频和视觉提示(面部表情和语调模式)的互补性质,从而比仅依赖单一模态时获得更好的性能。所提出的模型利用单独的骨干网络来捕捉每个模式内部的时序依赖(音频和视觉)。接着,联合多模态Transformer架构整合了每个模态的单独嵌入,使得模型能够有效捕捉跨模态(在音频和视觉之间)和模态内的(每个模式内)关系。在具有挑战性的Affwild2数据集上进行的广泛评估证明,与基线和最先进的ER方法相比,所提出的模型在ER任务中显著表现出更好的性能。
https://arxiv.org/abs/2403.10488
Performance attribution analysis, defined as the process of explaining the drivers of the excess performance of an investment portfolio against a benchmark, stands as a significant aspect of portfolio management and plays a crucial role in the investment decision-making process, particularly within the fund management industry. Rooted in a solid financial and mathematical framework, the importance and methodologies of this analytical technique are extensively documented across numerous academic research papers and books. The integration of large language models (LLMs) and AI agents marks a groundbreaking development in this field. These agents are designed to automate and enhance the performance attribution analysis by accurately calculating and analyzing portfolio performances against benchmarks. In this study, we introduce the application of an AI Agent for a variety of essential performance attribution tasks, including the analysis of performance drivers and utilizing LLMs as calculation engine for multi-level attribution analysis and question-answer (QA) exercises. Leveraging advanced prompt engineering techniques such as Chain-of-Thought (CoT) and Plan and Solve (PS), and employing a standard agent framework from LangChain, the research achieves promising results: it achieves accuracy rates exceeding 93% in analyzing performance drivers, attains 100% in multi-level attribution calculations, and surpasses 84% accuracy in QA exercises that simulate official examination standards. These findings affirm the impactful role of AI agents, prompt engineering and evaluation in advancing portfolio management processes, highlighting a significant advancement in the practical application and evaluation of AI technologies within the domain.
绩效归因分析,定义为解释投资组合相对于基准的超额表现驱动因素的过程,是组合管理的一个重要方面,并在基金管理行业中起着关键作用。这种分析技术植根于扎实的金融和数学框架之中,在大量学术研究论文和书籍中详细记录了其重要性和方法论。引入大型语言模型(LLMs)和AI代理标志着该领域的一项重大发展。这些代理旨在通过准确计算并分析组合相对于基准的表现来提高绩效归因分析。在这项研究中,我们引入AI代理来执行各种关键的绩效归因任务,包括分析绩效驱动因素和利用LLMs作为多级归因分析的计算引擎以及问答(QA)练习。利用先进的提示工程技术(如Chain-of-Thought(CoT)和Plan and Solve(PS)),并采用LangChain中的标准代理框架,这项研究取得了良好的成果:分析性能驱动因素的准确性率超过93%,多级归因计算达到100%,以及模拟官方考试标准的QA练习的准确性超过84%。这些发现证实了AI代理、提示工程和评估在提高组合管理过程方面的重要作用,突出了在领域内运用人工智能技术的重要性。
https://arxiv.org/abs/2403.10482
Enhancing the robustness of deep learning models, particularly in the realm of vision transformers (ViTs), is crucial for their real-world deployment. In this work, we provide a finetuning approach to enhance the robustness of vision transformers inspired by the concept of nullspace from linear algebra. Our investigation centers on whether a vision transformer can exhibit resilience to input variations akin to the nullspace property in linear mappings, implying that perturbations sampled from this nullspace do not influence the model's output when added to the input. Firstly, we show that for many pretrained ViTs, a non-trivial nullspace exists due to the presence of the patch embedding layer. Secondly, as nullspace is a concept associated with linear algebra, we demonstrate that it is possible to synthesize approximate nullspace elements for the non-linear blocks of ViTs employing an optimisation strategy. Finally, we propose a fine-tuning strategy for ViTs wherein we augment the training data with synthesized approximate nullspace noise. After finetuning, we find that the model demonstrates robustness to adversarial and natural image perbutations alike.
增强深度学习模型的稳健性,特别是视觉Transformer(ViTs)领域,对其实际部署至关重要。在这项工作中,我们提出了一种灵感来自线性代数中的零空间概念的微调方法,以增强视觉Transformer的稳健性。我们的研究集中在是否一个视觉Transformer可以表现出类似于零空间属性的输入变化抗性,这意味着从零空间中采样到的扰动在添加到输入时不会影响模型的输出。首先,我们证明了对于许多预训练的ViT,由于存在补丁嵌入层,存在非平凡零空间。其次,由于零空间与线性代数有关,我们证明了可以使用优化策略合成ViT的的非线性块的近似零空间元素。最后,我们提出了一种ViT的微调策略,即通过合成近似零空间噪声来增加训练数据。经过微调后,我们发现模型对各种类型的图像扰动都表现出抗性。
https://arxiv.org/abs/2403.10476
Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos. In this paper, we introduce a hybrid SLT approach, Spotter+GPT, that utilizes a sign spotter and a pretrained large language model to improve SLT performance. Our method builds upon the strengths of both components. The videos are first processed by the spotter, which is trained on a linguistic sign language dataset, to identify individual signs. These spotted signs are then passed to the powerful language model, which transforms them into coherent and contextually appropriate spoken language sentences.
翻译:手语翻译(SLT)是一项具有挑战性的任务,旨在从手语视频生成口语句子。在本文中,我们引入了一种混合SLT方法,Spotter+GPT,该方法利用了一个经过训练的符号检测器和一个大型的预训练语言模型来提高SLT性能。我们的方法基于两个组件的优势。视频首先由符号检测器处理,该检测器在一个语言符号数据集中训练,以识别单个符号。这些已标记的符号随后传递给强大的语言模型,该模型将它们转化为连贯且上下文适当的口语句子。
https://arxiv.org/abs/2403.10434
Image segmentation is one of the most fundamental problems in computer vision and has drawn a lot of attentions due to its vast applications in image understanding and autonomous driving. However, designing effective and efficient segmentation neural architectures is a labor-intensive process that may require lots of trials by human experts. In this paper, we address the challenge of integrating multi-head self-attention into high resolution representation CNNs efficiently, by leveraging architecture search. Manually replacing convolution layers with multi-head self-attention is non-trivial due to the costly overhead in memory to maintain high resolution. By contrast, we develop a multi-target multi-branch supernet method, which not only fully utilizes the advantages of high-resolution features, but also finds the proper location for placing multi-head self-attention module. Our search algorithm is optimized towards multiple objective s (e.g., latency and mIoU) and capable of finding architectures on Pareto frontier with arbitrary number of branches in a single search. We further present a series of model via Hybrid Convolutional-Transformer Architecture Search (HyCTAS) method that searched for the best hybrid combination of light-weight convolution layers and memory-efficient self-attention layers between branches from different resolutions and fuse to high resolution for both efficiency and effectiveness. Extensive experiments demonstrate that HyCTAS outperforms previous methods on semantic segmentation task. Code and models are available at \url{this https URL}.
图像分割是计算机视觉中最基本的问题之一,由于其在图像理解和自动驾驶中的广泛应用,因此受到了很多关注。然而,设计有效且高效的分割神经架构是一个劳动密集的过程,可能需要许多专家的人工作尝试。在本文中,我们通过利用架构搜索解决了将多头自注意力集成到高分辨率表示CNNs中的问题,通过构建多目标多分支超级网络。通过手动替换卷积层为多头自注意力,由于需要高昂的内存开销来维持高分辨率,因此这是不可能的。相反,我们开发了一种多目标多分支超级网络方法,不仅充分利用了高分辨率特征的优势,而且发现了放置多头自注意力的适当位置。我们的搜索算法针对多个目标(如延迟和mIoU)进行优化,可以在一个搜索中找到架构在帕累托前沿的任意数量分支上的最优架构。我们还通过HyCTAS方法展示了一系列模型,该方法在寻找不同分辨率分支的最佳轻量级卷积层和内存高效的自注意力层之间进行了搜索,将高分辨率与效率进行了平衡。大量实验证明,HyCTAS在语义分割任务上优于以前的算法。代码和模型可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2403.10413
We introduce EXAMS-V, a new challenging multi-discipline multimodal multilingual exam benchmark for evaluating vision language models. It consists of 20,932 multiple-choice questions across 20 school disciplines covering natural science, social science, and other miscellaneous studies, e.g., religion, fine arts, business, etc. EXAMS-V includes a variety of multimodal features such as text, images, tables, figures, diagrams, maps, scientific symbols, and equations. The questions come in 11 languages from 7 language families. Unlike existing benchmarks, EXAMS-V is uniquely curated by gathering school exam questions from various countries, with a variety of education systems. This distinctive approach calls for intricate reasoning across diverse languages and relies on region-specific knowledge. Solving the problems in the dataset requires advanced perception and joint reasoning over the text and the visual content of the image. Our evaluation results demonstrate that this is a challenging dataset, which is difficult even for advanced vision-text models such as GPT-4V and Gemini; this underscores the inherent complexity of the dataset and its significance as a future benchmark.
我们提出了EXAMS-V,一个新的挑战性的跨学科多模态多语言考试基准,用于评估视觉语言模型。它包括20,932个多选题,横跨20个学科,涵盖自然科学、社会科学及其他学科,例如宗教、艺术、商业等。EXAMS-V包括多种多模态特征,如文本、图像、表格、图表、地图、科学符号和方程。问题分为7个语言家族的11种语言。与现有基准不同,EXAMS-V通过收集来自各种国家的学校考试问题,具有各种教育体系的特点。这种独特的方法要求在各种语言中进行深入的推理和地区特定知识的依赖。解决数据集中的问题需要对文本和图像的内容进行联合推理的高级感知。我们对数据集的评估结果表明,这是一个具有挑战性的数据集,对于像GPT-4V和Gemini这样的高级视觉-文本模型来说,甚至难以完成;这强调了数据集的复杂性和作为未来基准的重要性。
https://arxiv.org/abs/2403.10378
Leveraging Transformer attention has led to great advancements in HDR deghosting. However, the intricate nature of self-attention introduces practical challenges, as existing state-of-the-art methods often demand high-end GPUs or exhibit slow inference speeds, especially for high-resolution images like 2K. Striking an optimal balance between performance and latency remains a critical concern. In response, this work presents PASTA, a novel Progressively Aggregated Spatio-Temporal Alignment framework for HDR deghosting. Our approach achieves effectiveness and efficiency by harnessing hierarchical representation during feature distanglement. Through the utilization of diverse granularities within the hierarchical structure, our method substantially boosts computational speed and optimizes the HDR imaging workflow. In addition, we explore within-scale feature modeling with local and global attention, gradually merging and refining them in a coarse-to-fine fashion. Experimental results showcase PASTA's superiority over current SOTA methods in both visual quality and performance metrics, accompanied by a substantial 3-fold (x3) increase in inference speed.
利用Transformer关注力在HDR去雾中取得了巨大的进展。然而,自注意力的复杂性引入了实际挑战,因为现有的最先进的方法通常需要高端GPU或表现出较慢的推理速度,特别是对于高分辨率图像(如2K)。在性能和延迟之间实现最佳平衡仍然是一个关键问题。因此,本文提出了一种名为PASTA的新型可逐步聚合时空对齐框架用于HDR去雾。我们的方法通过在特征扭曲过程中利用分层的表示来获得有效性和效率。通过在分层结构中使用不同的粒度,我们的方法大大提高了计算速度并优化了HDR成像工作流程。此外,我们还研究了与局部和全局注意力相关的自适应特征建模,在粗到细的粒度上逐渐合并和优化它们。实验结果表明,PASTA在视觉质量和性能指标方面优于当前最先进的方法,伴随着推理速度的大幅提升(x3)。
https://arxiv.org/abs/2403.10376
Humans can learn a new word and infer its grammatical properties from very few examples. They have an abstract notion of linguistic properties like grammatical gender and agreement rules that can be applied to novel syntactic contexts and words. Drawing inspiration from psycholinguistics, we conduct a noun learning experiment to assess whether an LSTM and a decoder-only transformer can achieve human-like abstraction of grammatical gender in French. Language models were tasked with learning the gender of a novel noun embedding from a few examples in one grammatical agreement context and predicting agreement in another, unseen context. We find that both language models effectively generalise novel noun gender from one to two learning examples and apply the learnt gender across agreement contexts, albeit with a bias for the masculine gender category. Importantly, the few-shot updates were only applied to the embedding layers, demonstrating that models encode sufficient gender information within the word embedding space. While the generalisation behaviour of models suggests that they represent grammatical gender as an abstract category, like humans, further work is needed to explore the details of how exactly this is implemented. For a comparative perspective with human behaviour, we conducted an analogous one-shot novel noun gender learning experiment, which revealed that native French speakers, like language models, also exhibited a masculine gender bias and are not excellent one-shot learners either.
人类可以从很少的例子中学会一个新的单词,并推断出其语义特征。他们具有类似于语义特征的抽象概念,如语性别和一致规则,可以应用于新颖的句法上下文和单词。从心理语言学的灵感出发,我们进行了一次名词学习实验,以评估是否可以使用LSTM和decoder-only transformer实现类似的人类对语性特征的抽象理解。语言模型被要求从几个例子中学习一个新的名词的性别,从一个语义一致上下文中预测另一个未见过的上下文中的同意。我们发现,两个语言模型都能有效地从一两个学习例子中推广新颖的 noun 性别,并将其应用到一致上下文中,尽管对男性性别类别有偏见。重要的是,只应用了少量的样本更新,这表明模型在词嵌入空间中编码了足够的性别信息。虽然模型的泛化行为表明它们将语性特征表示为抽象类别,就像人类一样,但还需要进一步工作来探索实际上是如何实现的。为了与人类行为进行比较,我们进行了一次类似的一小样本新名词性别学习实验,结果表明,与语言模型一样,母语为法语的参与者也表现出性别偏见,并且也不是很好的单击学习者。
https://arxiv.org/abs/2403.10338
Transformers have demonstrated their effectiveness in image restoration tasks. Existing Transformer architectures typically comprise two essential components: multi-head self-attention and feed-forward network (FFN). The former captures long-range pixel dependencies, while the latter enables the model to learn complex patterns and relationships in the data. Previous studies have demonstrated that FFNs are key-value memories \cite{geva2020transformer}, which are vital in modern Transformer architectures. In this paper, we conduct an empirical study to explore the potential of attention mechanisms without using FFN and provide novel structures to demonstrate that removing FFN is flexible for image restoration. Specifically, we propose Continuous Scaling Attention (\textbf{CSAttn}), a method that computes attention continuously in three stages without using FFN. To achieve competitive performance, we propose a series of key components within the attention. Our designs provide a closer look at the attention mechanism and reveal that some simple operations can significantly affect the model performance. We apply our \textbf{CSAttn} to several image restoration tasks and show that our model can outperform CNN-based and Transformer-based image restoration approaches.
变换器在图像修复任务中已经证明了其有效性。现有的变换器架构通常包括两个基本组件:多头自注意力和前馈网络(FFN)。前者捕捉长距离像素关系,而后者使模型能够学习数据中的复杂模式和关系。以前的研究表明,FFN是关键值记忆器 \cite{geva2020transformer},这对于现代变换器架构至关重要。在本文中,我们进行了一项实证研究,探讨了没有使用FFN的注意机制的潜力,并为图像修复提供了一些新的结构,表明去除FFN是灵活的。具体来说,我们提出了连续缩放注意(CSAttn)的方法,这是一种在三个阶段连续计算注意的方法,而没有使用FFN。为了实现竞争力的性能,我们在注意力的各个方面提出了关键组件。我们的设计使我们对注意机制更加深入地了解,并揭示了某些简单的操作可能会显著影响模型性能。我们将我们的CSAttn应用于多个图像修复任务,并证明了我们的模型可以超过基于CNN和基于Transformer的图像修复方法。
https://arxiv.org/abs/2403.10336
In scientific research and its application, scientific literature analysis is crucial as it allows researchers to build on the work of others. However, the fast growth of scientific knowledge has led to a massive increase in scholarly articles, making in-depth literature analysis increasingly challenging and time-consuming. The emergence of Large Language Models (LLMs) has offered a new way to address this challenge. Known for their strong abilities in summarizing texts, LLMs are seen as a potential tool to improve the analysis of scientific literature. However, existing LLMs have their own limits. Scientific literature often includes a wide range of multimodal elements, such as molecular structure, tables, and charts, which are hard for text-focused LLMs to understand and analyze. This issue points to the urgent need for new solutions that can fully understand and analyze multimodal content in scientific literature. To answer this demand, we present Uni-SMART (Universal Science Multimodal Analysis and Research Transformer), an innovative model designed for in-depth understanding of multimodal scientific literature. Through rigorous quantitative evaluation across several domains, Uni-SMART demonstrates superior performance over leading text-focused LLMs. Furthermore, our exploration extends to practical applications, including patent infringement detection and nuanced analysis of charts. These applications not only highlight Uni-SMART's adaptability but also its potential to revolutionize how we interact with scientific literature.
在科学研究及其应用中,文献分析至关重要,因为它允许研究人员借鉴他人的工作。然而,科学知识的快速发展导致学术论文数量的大幅增加,使得深入的文献分析越来越具有挑战性和耗时。大型语言模型的出现为解决这一挑战提供了一种新的方法。以其在概括文本方面的强大能力而闻名,LLMs被视为提高科学文献分析的一种潜在工具。然而,现有的LLMs也有其局限性。科学文献通常包括多种多模态元素,如分子结构、表格和图表,这对于以文本为中心的LLMs来说很难理解和分析。这个问题凸显了急需新的解决方案,这些方案可以完全理解并分析科学文献中的多模态内容。为了满足这一需求,我们提出了Uni-SMART(通用科学多模态分析和研究转换器),一种专为深入理解多模态科学文献而设计的创新模型。通过在多个领域进行严谨的定量评估,Uni-SMART证明其在文本导向的LLM中的优越性能。此外,我们的探索还扩展到实际应用,包括专利侵权检测和图表精炼分析。这些应用不仅突出了Uni-SMART的适应性,还表明了其可能彻底改变我们与科学文献互动的方式。
https://arxiv.org/abs/2403.10301
Time-series data in real-world medical settings typically exhibit long-range dependencies and are observed at non-uniform intervals. In such contexts, traditional sequence-based recurrent models struggle. To overcome this, researchers replace recurrent architectures with Neural ODE-based models to model irregularly sampled data and use Transformer-based architectures to account for long-range dependencies. Despite the success of these two approaches, both incur very high computational costs for input sequences of moderate lengths and greater. To mitigate this, we introduce the Rough Transformer, a variation of the Transformer model which operates on continuous-time representations of input sequences and incurs significantly reduced computational costs, critical for addressing long-range dependencies common in medical contexts. In particular, we propose multi-view signature attention, which uses path signatures to augment vanilla attention and to capture both local and global dependencies in input data, while remaining robust to changes in the sequence length and sampling frequency. We find that Rough Transformers consistently outperform their vanilla attention counterparts while obtaining the benefits of Neural ODE-based models using a fraction of the computational time and memory resources on synthetic and real-world time-series tasks.
真实世界医疗场景中的时间序列数据通常表现出长距离依赖关系,并且观察到的间隔是非均匀的。在这种情况下,传统基于序列的循环模型很难。为了克服这种,研究人员用基于神经网络的运动方程模型来建模非均匀采样数据,并使用基于Transformer的架构来处理长距离依赖关系。尽管这两种方法都取得了成功,但它们的输入序列中等长度和高维数据需要非常高的计算成本。为了减轻这种成本,我们引入了Rough Transformer,这是一种Transformer模型的变体,它在输入序列的连续时间表示上运行,并大大降低了计算成本,这对解决医疗场景中常见的长距离依赖关系非常重要。 特别是,我们提出了多视角签名注意,它使用路径签名来增强基本的注意力,并捕捉输入数据中的局部和全局依赖关系,同时保持对序列长度和采样周期的变化鲁棒。我们发现,Rough Transformers在 synthetic 和 real-world time-series 任务上的表现始终优于它们的普通注意力 counterparts,而使用的时间和内存资源却大大减少。
https://arxiv.org/abs/2403.10288
The task of few-shot image classification and segmentation (FS-CS) involves classifying and segmenting target objects in a query image, given only a few examples of the target classes. We introduce the Vision-Instructed Segmentation and Evaluation (VISE) method that transforms the FS-CS problem into the Visual Question Answering (VQA) problem, utilising Vision-Language Models (VLMs), and addresses it in a training-free manner. By enabling a VLM to interact with off-the-shelf vision models as tools, the proposed method is capable of classifying and segmenting target objects using only image-level labels. Specifically, chain-of-thought prompting and in-context learning guide the VLM to answer multiple-choice questions like a human; vision models such as YOLO and Segment Anything Model (SAM) assist the VLM in completing the task. The modular framework of the proposed method makes it easily extendable. Our approach achieves state-of-the-art performance on the Pascal-5i and COCO-20i datasets.
少量样本图像分类和分割(FS-CS)任务的目的是对查询图像中的目标对象进行分类和分割,而给定只有几个目标类别的例子。我们引入了 Vision-Instructed Segmentation and Evaluation(VISE)方法,将FS-CS问题转化为视觉问答(VQA)问题,利用视觉语言模型(VLMs),并且以无需训练的方式解决了这个问题。通过使视觉模型与通用视觉模型作为工具进行交互,所提出的方法能够使用仅有的图像级别标签对目标对象进行分类和分割。具体来说,连锁思考提示和上下文学习引导VLM像人类一样回答多选题;像YOLO和Segment Anything Model(SAM)这样的视觉模型帮助VLM完成任务。所提出方法的模块化框架使其易于扩展。我们的方法在Pascal-5i和COCO-20i数据集上实现了最先进的性能。
https://arxiv.org/abs/2403.10287
The explanations of large language models have recently been shown to be sensitive to the randomness used for their training, creating a need to characterize this sensitivity. In this paper, we propose a characterization that questions the possibility to provide simple and informative explanations for such models. To this end, we give statistical definitions for the explanations' signal, noise and signal-to-noise ratio. We highlight that, in a typical case study where word-level univariate explanations are analyzed with first-order statistical tools, the explanations of simple feature-based models carry more signal and less noise than those of transformer ones. We then discuss the possibility to improve these results with alternative definitions of signal and noise that would capture more complex explanations and analysis methods, while also questioning the tradeoff with their plausibility for readers.
大型语言模型的解释最近被发现对用于其训练的随机性敏感,这导致了对这种敏感性进行描述的需求。在本文中,我们提出了一个质疑是否可以提供简单而有益的解释给这类模型的特点。为此,我们给出了解释的信号、噪声和信号-噪声比率的统计定义。我们强调,在典型案例研究中,使用一阶统计工具对单词级别单变量解释进行分析,基于简单的特征的模型的解释携带更多的信号和更少的噪声,而Transformer模型的解释则相反。接着,我们讨论了通过 alternative definitions of signal and noise 来改善这些结果的可能性,同时质疑它们的可读性以及其对读者的可信度。
https://arxiv.org/abs/2403.10275
Single-modal object re-identification (ReID) faces great challenges in maintaining robustness within complex visual scenarios. In contrast, multi-modal object ReID utilizes complementary information from diverse modalities, showing great potentials for practical applications. However, previous methods may be easily affected by irrelevant backgrounds and usually ignore the modality gaps. To address above issues, we propose a novel learning framework named \textbf{EDITOR} to select diverse tokens from vision Transformers for multi-modal object ReID. We begin with a shared vision Transformer to extract tokenized features from different input modalities. Then, we introduce a Spatial-Frequency Token Selection (SFTS) module to adaptively select object-centric tokens with both spatial and frequency information. Afterwards, we employ a Hierarchical Masked Aggregation (HMA) module to facilitate feature interactions within and across modalities. Finally, to further reduce the effect of backgrounds, we propose a Background Consistency Constraint (BCC) and an Object-Centric Feature Refinement (OCFR). They are formulated as two new loss functions, which improve the feature discrimination with background suppression. As a result, our framework can generate more discriminative features for multi-modal object ReID. Extensive experiments on three multi-modal ReID benchmarks verify the effectiveness of our methods. The code is available at this https URL.
单模态物体识别(ReID)在复杂的视觉场景中面临很大的挑战。相比之下,多模态物体ReID利用来自不同模态的互补信息,具有很大的实际应用潜力。然而,之前的方法可能会受到无关背景的影响,通常会忽略模态差距。为解决上述问题,我们提出了一个名为《编辑器》(Editor)的新学习框架,用于从视觉Transformer中选择多样化的标记。我们首先使用共享的视觉Transformer提取不同输入模态的标记。然后,我们引入了一个空间频率标记选择(SFTS)模块,以适应选择具有空间和频率信息的物体中心标记。接下来,我们使用层次结构掩码聚合(HMA)模块促进模态之间和模态之间的特征交互。最后,为了进一步减少背景的影响,我们提出了背景一致性约束(BCC)和物体中心特征细化(OCFR)。它们被表示为两个新的损失函数,通过背景抑制改善了特征识别。因此,我们的框架可以生成更有区分性的多模态物体ReID。在三个多模态ReID基准测试中进行了广泛的实验,验证了我们的方法的有效性。代码可在此处下载:https://www.example.com/。
https://arxiv.org/abs/2403.10254
This paper explores the problem of continual learning (CL) of vision-language models (VLMs) in open domains, where the models need to perform continual updating and inference on a streaming of datasets from diverse seen and unseen domains with novel classes. Such a capability is crucial for various applications in open environments, e.g., AI assistants, autonomous driving systems, and robotics. Current CL studies mostly focus on closed-set scenarios in a single domain with known classes. Large pre-trained VLMs like CLIP have demonstrated superior zero-shot recognition ability, and a number of recent studies leverage this ability to mitigate catastrophic forgetting in CL, but they focus on closed-set CL in a single domain dataset. Open-domain CL of large VLMs is significantly more challenging due to 1) large class correlations and domain gaps across the datasets and 2) the forgetting of zero-shot knowledge in the pre-trained VLMs in addition to the knowledge learned from the newly adapted datasets. In this work we introduce a novel approach, termed CoLeCLIP, that learns an open-domain CL model based on CLIP. It addresses these challenges by a joint learning of a set of task prompts and a cross-domain class vocabulary. Extensive experiments on 11 domain datasets show that CoLeCLIP outperforms state-of-the-art methods for open-domain CL under both task- and class-incremental learning settings.
本文探讨了在开放领域中,视觉语言模型(VLMs)的持续学习(CL)问题,这些模型需要对来自不同可见和不可见领域的数据流进行持续更新和推理,并且需要学习新的类。这种能力对于各种开放环境中的应用程序(例如AI助手、自动驾驶系统和机器人)至关重要。目前,大多数CL研究都集中在单个域内的已知类闭包场景。像CLIP这样的大预训练VLM已经证明了卓越的零散拍摄识别能力,并且一些最近的研究利用这种能力来减轻CL中的灾难性遗忘,但他们集中在单个域数据集中的闭包CL。 大VLMs在开放域中的CL挑战非常大,因为数据集之间存在大的类别相关性和领域差异,以及预训练VLM在从新适应的数据集中学习新知识时会遗忘零散拍摄知识。在本文中,我们引入了一种新方法,称为CoLeCLIP,它基于CLIP学习开放域CL模型。它通过联合学习一组任务提示和跨领域类词汇表来解决这些挑战。在11个领域数据集上的广泛实验表明,CoLeCLIP在任务和类增益学习设置下都优于最先进的开放域CL方法。
https://arxiv.org/abs/2403.10245
With the development of astronomical facilities, large-scale time series data observed by these facilities is being collected. Analyzing anomalies in these astronomical observations is crucial for uncovering potential celestial events and physical phenomena, thus advancing the scientific research process. However, existing time series anomaly detection methods fall short in tackling the unique characteristics of astronomical observations where each star is inherently independent but interfered by random concurrent noise, resulting in a high rate of false alarms. To overcome the challenges, we propose AERO, a novel two-stage framework tailored for unsupervised anomaly detection in astronomical observations. In the first stage, we employ a Transformer-based encoder-decoder architecture to learn the normal temporal patterns on each variate (i.e., star) in alignment with the characteristic of variate independence. In the second stage, we enhance the graph neural network with a window-wise graph structure learning to tackle the occurrence of concurrent noise characterized by spatial and temporal randomness. In this way, AERO is not only capable of distinguishing normal temporal patterns from potential anomalies but also effectively differentiating concurrent noise, thus decreasing the number of false alarms. We conducted extensive experiments on three synthetic datasets and three real-world datasets. The results demonstrate that AERO outperforms the compared baselines. Notably, compared to the state-of-the-art model, AERO improves the F1-score by up to 8.76% and 2.63% on synthetic and real-world datasets respectively.
随着天文设施的发展,这些设施观测到的大规模时间序列数据正在被收集。分析这些天文观测中的异常情况对于揭示潜在的天文事件和物理现象具有重要意义,从而促进科学研究的进程。然而,现有的时间序列异常检测方法在处理天文观测中独特特征的地方存在局限性,即每个恒星固有的独立性,但又受到随机同时噪声的影响,导致虚假警报率高。为了克服这些挑战,我们提出了AERO,一种专门为天文观测中的无监督异常检测设计的双阶段框架。 在第一阶段,我们采用基于Transformer的编码器-解码器架构来学习每个可变(即恒星)的normal temporal patterns,与可变独立性的特点相一致。在第二阶段,我们通过窗口卷积神经网络来增强模型,以应对由空间和时间随机性特征引起的并行噪声。这样,AERO不仅能够区分正常的时间序列模式,而且能够有效地区分并降低虚假警报率。 我们在三个合成数据集和三个真实世界数据集上进行了广泛的实验。实验结果表明,AERO超越了与它进行比较的基线。值得注意的是,与最先进的模型相比,AERO在合成和真实世界数据集上分别提高了8.76%和2.63%的F1得分。
https://arxiv.org/abs/2403.10220