What can be learned about causality and experimentation from passive data? This question is salient given recent successes of passively-trained language models in interactive domains such as tool use. Passive learning is inherently limited. However, we show that purely passive learning can in fact allow an agent to learn generalizable strategies for determining and using causal structures, as long as the agent can intervene at test time. We formally illustrate that learning a strategy of first experimenting, then seeking goals, can allow generalization from passive learning in principle. We then show empirically that agents trained via imitation on expert data can indeed generalize at test time to infer and use causal links which are never present in the training data; these agents can also generalize experimentation strategies to novel variable sets never observed in training. We then show that strategies for causal intervention and exploitation can be generalized from passive data even in a more complex environment with high-dimensional observations, with the support of natural language explanations. Explanations can even allow passive learners to generalize out-of-distribution from perfectly-confounded training data. Finally, we show that language models, trained only on passive next-word prediction, can generalize causal intervention strategies from a few-shot prompt containing examples of experimentation, together with explanations and reasoning. These results highlight the surprising power of passive learning of active causal strategies, and may help to understand the behaviors and capabilities of language models.
从被动数据中可以学到关于因果关系和实验的方法吗?近期在工具使用等互动领域中被动训练的语言模型所取得的成功使得这个问题变得非常突出。被动学习本身是有限的,但我们可以通过证明纯粹的被动学习确实可以让一个代理学习可以应用的普遍策略,只要代理能够在测试时干预。我们正式地证明,通过实验然后寻求目标的策略,可以从被动学习 principle 中推广。我们 empirical 地证明,通过模仿专家数据训练的代理可以在测试时泛化,推断和使用训练数据中从未出现的联系;这些代理还可以将实验策略泛化到从未在训练中观察到的新变量集合中。我们证明,因果关系干预和利用策略可以从被动数据中推广,即使在高维度环境中并具有高维观测数据的更复杂环境中,只要有自然语言解释的支持。解释甚至可以让被动学习者从完美相关训练数据中推广到未曾在训练中观察到的新变量集合中。最后,我们证明,只从被动next word预测中训练的语言模型可以从包含实验示例的几个回合 prompt 中推广因果关系干预策略,并包含解释和推理。这些结果突出了被动学习 Activecausality 策略的惊人能力,并可能有助于理解语言模型的行为和能力。
https://arxiv.org/abs/2305.16183
Argument summarisation is a promising but currently under-explored field. Recent work has aimed to provide textual summaries in the form of concise and salient short texts, i.e., key points (KPs), in a task known as Key Point Analysis (KPA). One of the main challenges in KPA is finding high-quality key point candidates from dozens of arguments even in a small corpus. Furthermore, evaluating key points is crucial in ensuring that the automatically generated summaries are useful. Although automatic methods for evaluating summarisation have considerably advanced over the years, they mainly focus on sentence-level comparison, making it difficult to measure the quality of a summary (a set of KPs) as a whole. Aggravating this problem is the fact that human evaluation is costly and unreproducible. To address the above issues, we propose a two-step abstractive summarisation framework based on neural topic modelling with an iterative clustering procedure, to generate key points which are aligned with how humans identify key points. Our experiments show that our framework advances the state of the art in KPA, with performance improvement of up to 14 (absolute) percentage points, in terms of both ROUGE and our own proposed evaluation metrics. Furthermore, we evaluate the generated summaries using a novel set-based evaluation toolkit. Our quantitative analysis demonstrates the effectiveness of our proposed evaluation metrics in assessing the quality of generated KPs. Human evaluation further demonstrates the advantages of our approach and validates that our proposed evaluation metric is more consistent with human judgment than ROUGE scores.
论点摘要是一个有前途但目前未被充分探索的领域。最近的工作旨在提供简洁而引人注目的简短文本,也就是所谓的关键点(KPs),来完成一个名为关键点分析(KPA)的任务。KPA任务的主要挑战之一是找到在小型语料库中从数十个论点中挑选高质量的关键点候选人。此外,评估关键点是至关重要的,以确保自动生成的摘要有用。虽然自动评估摘要的方法已经过去几年中有了显著的进步,但它们主要关注句子级别的比较,这使得很难评估摘要(一组KPs)的整体质量。加剧这个问题的是,人类评估非常昂贵且不可重复。为了解决这些问题,我们提出了基于神经网络主题建模和迭代聚类程序的双重抽象摘要框架,以生成与人类识别关键点方法相一致的关键点。我们的实验结果表明,我们的框架在KPA任务中取得了进展,在ROUGE和我们自己提出的评估指标中性能改进了14(绝对)个百分点。此外,我们使用了一个新的基于集合评估工具包来评估生成的摘要。我们的定量分析证明了我们提出的评估指标在评估生成KPs的质量方面的 effectiveness。人类评估进一步展示了我们的方法的优势,并证明了我们提出的评估指标比ROUGE评分更加与人类判断一致。
https://arxiv.org/abs/2305.16000
The integration of multi-document pre-training objectives into language models has resulted in remarkable improvements in multi-document downstream tasks. In this work, we propose extending this idea by pre-training a generic multi-document model from a novel cross-document question answering pre-training objective. To that end, given a set (or cluster) of topically-related documents, we systematically generate semantically-oriented questions from a salient sentence in one document and challenge the model, during pre-training, to answer these questions while "peeking" into other topically-related documents. In a similar manner, the model is also challenged to recover the sentence from which the question was generated, again while leveraging cross-document information. This novel multi-document QA formulation directs the model to better recover cross-text informational relations, and introduces a natural augmentation that artificially increases the pre-training data. Further, unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation (e.g., QA) and long text generation (e.g., summarization). Following this scheme, we pre-train our model -- termed QAmden -- and evaluate its performance across several multi-document tasks, including multi-document QA, summarization, and query-focused summarization, yielding improvements of up to 7%, and significantly outperforms zero-shot GPT-3.5 and GPT-4.
将多文档预训练目标融入语言模型,导致了多文档后续任务的重大改进。在这项工作中,我们建议扩展这个思想,通过预训练一个通用的多文档模型,从一个新的跨文档问答预训练目标开始。为此,给定一组(或簇)相关的文档,我们 systematic 地从一份文档中的一条引人注目的句子中生成语义相关的提问,并在预训练期间挑战模型回答这些问题,同时“窥探”其他相关的文档。类似地,模型也被挑战恢复生成的提问的句子,同时利用跨文档信息。这个新的多文档QA formulation指示模型更好地恢复跨文本信息关系,并引入了一种自然的增强,从而增加了预训练数据。此外,与以前的多文档模型专注于分类或总结任务不同,我们的预训练目标 formulation使模型能够同时涉及短文本生成(如QA)和长文本生成(如总结)的任务。按照这个方案,我们预训练我们的模型——称为QAmden——并评估它在多个多文档任务中的表现,包括多文档QA、总结和提问聚焦总结,取得了高达7%的改进,显著超越了零样本GPT-3.5和GPT-4。
https://arxiv.org/abs/2305.15387
Burn injuries can result from mechanisms such as thermal, chemical, and electrical insults. A prompt and accurate assessment of burns is essential for deciding definitive clinical treatments. Currently, the primary approach for burn assessments, via visual and tactile observations, is approximately 60%-80% accurate. The gold standard is biopsy and a close second would be non-invasive methods like Laser Doppler Imaging (LDI) assessments, which have up to 97% accuracy in predicting burn severity and the required healing time. In this paper, we introduce a machine learning pipeline for assessing burn severities and segmenting the regions of skin that are affected by burn. Segmenting 2D colour images of burns allows for the injured versus non-injured skin to be delineated, clearly marking the extent and boundaries of the localized burn/region-of-interest, even during remote monitoring of a burn patient. We trained a convolutional neural network (CNN) to classify four severities of burns. We built a saliency mapping method, Boundary Attention Mapping (BAM), that utilises this trained CNN for the purpose of accurately localizing and segmenting the burn regions from skin burn images. We demonstrated the effectiveness of our proposed pipeline through extensive experiments and evaluations using two datasets; 1) A larger skin burn image dataset consisting of 1684 skin burn images of four burn severities, 2) An LDI dataset that consists of a total of 184 skin burn images with their associated LDI scans. The CNN trained using the first dataset achieved an average F1-Score of 78% and micro/macro- average ROC of 85% in classifying the four burn severities. Moreover, a comparison between the BAM results and LDI results for measuring injury boundary showed that the segmentations generated by our method achieved 91.60% accuracy, 78.17% sensitivity, and 93.37% specificity.
烧伤损伤可以由热、化学和电刺激等机制引起。及时和准确的烧伤评估对于决定最终临床治疗方案至关重要。目前,烧伤评估的主要方法主要是通过视觉和触觉观察来进行,大约60%-80%的准确度较高。标准方法就是活检,另一个接近标准的方法就是像激光共聚焦成像(LDI)那样的非侵入性方法,它可以在烧伤患者远程监测时预测烧伤严重程度和所需的愈合时间,其预测准确率可以达到97%。在本文中,我们介绍了一种用于评估烧伤严重程度和分割受烧伤皮肤区域的机器学习管道。通过分割烧伤的2D彩色图像,可以清晰地区分受伤皮肤和非受伤皮肤,并明确标记局部烧伤/感兴趣区域的范围和边界,即使在烧伤患者远程监测时也是如此。我们训练了一个卷积神经网络(CNN)来分类四种烧伤严重程度。我们建立了一个注意力映射方法,称为边界注意力映射(BAM),利用这个训练好的CNN,以准确地定位和分割烧伤区域从皮肤烧伤图像中。我们通过使用两个数据集进行了广泛的实验和评估,使用数据集1)一个包括1684个皮肤烧伤图像的更大的数据集,以及数据集2)一个包括184个皮肤烧伤图像及其相应的LDI扫描的数据集。使用第一个数据集训练的CNN,在分类四种烧伤严重程度时的平均F1得分为78%,micro/macro-平均ROC为85%。此外,比较BAM结果和LDI结果以测量烧伤边界时,我们发现我们的方法生成的分割具有91.60%的准确率,78.17%的灵敏度,和93.37%的特异性。
https://arxiv.org/abs/2305.15365
Cauliflower is a hand-harvested crop that must fulfill high-quality standards in sales making the timing of harvest important. However, accurately determining harvest-readiness can be challenging due to the cauliflower head being covered by its canopy. While deep learning enables automated harvest-readiness estimation, errors can occur due to field-variability and limited training data. In this paper, we analyze the reliability of a harvest-readiness classifier with interpretable machine learning. By identifying clusters of saliency maps, we derive reliability scores for each classification result using knowledge about the domain and the image properties. For unseen data, the reliability can be used to (i) inform farmers to improve their decision-making and (ii) increase the model prediction accuracy. Using RGB images of single cauliflower plants at different developmental stages from the GrowliFlower dataset, we investigate various saliency mapping approaches and find that they result in different quality of reliability scores. With the most suitable interpretation tool, we adjust the classification result and achieve a 15.72% improvement of the overall accuracy to 88.14% and a 15.44% improvement of the average class accuracy to 88.52% for the GrowliFlower dataset.
香槟花是一种手工收获的作物,必须在销售中满足高质量的标准,因此收获时机非常重要。然而,准确确定收获成熟的难度在于香槟花的头部被其覆盖,这使得深度学习可以实现自动化的收获成熟度估算。尽管深度学习可以实现自动化的收获成熟度估算,但可能因为FieldVariation和有限的训练数据而产生错误。在本文中,我们使用可解释机器学习来分析具有解释性的机器学习方法的收获成熟度分类器的可靠性。通过识别相关度地图的集群,我们使用关于领域的知识和图像属性计算出每个分类结果的可靠性得分。对于未观测到的数据,可靠性可以用于(i)通知农民提高决策制定能力,(ii)提高模型预测准确性。使用 growliFlower 数据集单株香槟花在不同发育阶段RGB图像,我们研究各种相关度映射方法,并发现它们产生不同质量的可靠性得分。使用最适合的解释工具,我们调整了分类结果,实现了 overall 准确率的15.72%提高至88.14%,平均类准确率的15.44%提高至88.52%。
https://arxiv.org/abs/2305.15149
Millions of users are active on social media. To allow users to better showcase themselves and network with others, we explore the auto-generation of social media self-introduction, a short sentence outlining a user's personal interests. While most prior work profiles users with tags (e.g., ages), we investigate sentence-level self-introductions to provide a more natural and engaging way for users to know each other. Here we exploit a user's tweeting history to generate their self-introduction. The task is non-trivial because the history content may be lengthy, noisy, and exhibit various personal interests. To address this challenge, we propose a novel unified topic-guided encoder-decoder (UTGED) framework; it models latent topics to reflect salient user interest, whose topic mixture then guides encoding a user's history and topic words control decoding their self-introduction. For experiments, we collect a large-scale Twitter dataset, and extensive results show the superiority of our UTGED to the advanced encoder-decoder models without topic modeling.
数百万用户在社交媒体上活跃。为了让用户更好地展示自己并与他人建立联系,我们探索了自动生成社交媒体自我介绍的功能,即简单的句子概述用户的个人兴趣。尽管先前的工作大多数涉及标签(例如年龄)的用户(例如),我们研究句子级别的自我介绍,以提供一个更加自然和有互动性的让用户互相了解的方式。在这里,我们利用用户的微博历史生成他们的自我介绍。任务非常困难,因为历史内容可能会很长、嘈杂,并表现出各种个人兴趣。为了解决这一挑战,我们提出了一个独特的主题引导编码解码框架(UTGED)框架。该框架模型潜在主题以反映敏锐的用户兴趣,其主题混合随后指导编码用户的历史,并主题词汇控制解码他们的自我介绍。为了进行实验,我们收集了大规模的推特数据集,广泛的结果表明,我们的UTGED比没有主题建模的高级编码解码模型优越。
https://arxiv.org/abs/2305.15138
In this paper, we introduce Divide-and-Conquer into the salient object detection (SOD) task to enable the model to learn prior knowledge that is for predicting the saliency map. We design a novel network, Divide-and-Conquer Network (DC-Net) which uses two encoders to solve different subtasks that are conducive to predicting the final saliency map, here is to predict the edge maps with width 4 and location maps of salient objects and then aggregate the feature maps with different semantic information into the decoder to predict the final saliency map. The decoder of DC-Net consists of our newly designed two-level Residual nested-ASPP (ResASPP$^{2}$) modules, which have the ability to capture a large number of different scale features with a small number of convolution operations and have the advantages of maintaining high resolution all the time and being able to obtain a large and compact effective receptive field (ERF). Based on the advantage of Divide-and-Conquer's parallel computing, we use Parallel Acceleration to speed up DC-Net, allowing it to achieve competitive performance on six LR-SOD and five HR-SOD datasets under high efficiency (60 FPS and 55 FPS). Codes and results are available: this https URL.
在本文中,我们将Divide-and-Conquer方法引入到突出对象检测任务中,以便模型能够学习用于预测突出点映射的先前知识。我们设计了一个全新的网络,称为Divide-and-Conquer Network(DC-Net),它使用两个编码器来解决有助于预测最终突出点映射的不同子任务,这里是预测边缘图宽度为4和突出点位置图,然后将具有不同语义信息的特征图合并到解码器中以预测最终突出点映射。DC-Net的解码器由我们新设计的两层Residual nested-ASPP(ResASPP$^{2}$)模块组成,它能够以较少的卷积操作捕捉大量不同尺度的特征,并具有保持高分辨率始终和获得大型紧凑有效响应面(ERF)的优势。基于Divide-and-Conquer方法的并行计算优势,我们使用并行加速来加快DC-Net,使其能够在高效率下(60FPS和55FPS)实现竞争性能,在六个LR-SOD和五个HR-SOD数据集上表现出色。代码和结果均在此httpsURL提供。
https://arxiv.org/abs/2305.14955
In-context learning (ICL), the ability of large language models to perform novel tasks by conditioning on a prompt with a few task examples, requires demonstrations that are informative about the test instance. The standard approach of independently selecting the most similar examples selects redundant demonstrations while overlooking important information. This work proposes a framework for assessing the informativeness of demonstrations based on their coverage of salient aspects (e.g., reasoning patterns) of the test input. Using this framework, we show that contextual token embeddings effectively capture these salient aspects, and their recall measured using BERTScore-Recall (BSR) yields a reliable measure of informativeness. Further, we extend recall metrics like BSR to propose their set versions to find maximally informative sets of demonstrations. On 6 complex compositional generation tasks and 7 diverse LLMs, we show that Set-BSR outperforms the standard similarity-based approach by up to 16% on average and, despite being learning-free, often surpasses methods that leverage task or LLM-specific training.
上下文学习(ICL)是指大型语言模型通过给定几个任务示例来条件化地执行新任务的能力。这种能力需要展示对测试实例 informative 的证明。传统的方式是独立地选择最相似的示例来选择冗余的证明,而忽略了重要的信息。这项工作提出了一个框架,用于评估证明的 informativeness,基于它们对测试输入的显著方面(例如,推理模式)的覆盖。利用这个框架,我们证明了上下文 token 嵌入器有效地捕捉了这些显著方面,并使用 BERTScore-Recall (BSR) 方法测量它们的召回率,从而获得了可靠的 informativeness 测量。此外,我们扩展了召回度量,如 BSR,提出了它们的集版本,以找到最 informative 的证明集。在6个复杂的组合生成任务和7个多样化的LLM中,我们表明,Set-BSR平均领先标准相似性基于方法方法 by 16%,尽管没有学习,但它经常超越利用任务或LLM-特定训练的方法。
https://arxiv.org/abs/2305.14907
Long document summarization systems are critical for domains with lengthy and jargonladen text, yet they present significant challenges to researchers and developers with limited computing resources. Existing solutions mainly focus on efficient attentions or divide-and-conquer strategies. The former reduces theoretical time complexity, but is still memory-heavy. The latter methods sacrifice global context, leading to uninformative and incoherent summaries. This work aims to leverage the memory-efficient nature of divide-and-conquer methods while preserving global context. Concretely, our framework AWESOME uses two novel mechanisms: (1) External memory mechanisms track previously encoded document segments and their corresponding summaries, to enhance global document understanding and summary coherence. (2) Global salient content is further identified beforehand to augment each document segment to support its summarization. Extensive experiments on diverse genres of text, including government reports, transcripts, scientific papers, and novels, show that AWESOME produces summaries with improved informativeness, faithfulness, and coherence than competitive baselines on longer documents, while having a similar or smaller GPU memory footprint.
较长的文档摘要系统对于包含长篇术语丰富的文本的领域至关重要,但对于缺乏计算资源的研究人员和开发者而言,它们提出了巨大的挑战。现有的解决方案主要关注高效的关注或分而治之策略。前者可以减少理论时间复杂度,但仍然内存昂贵。后者则牺牲全球背景,导致缺乏信息和不一致的摘要。这项工作旨在利用分而治之策略的内存高效性,同时保留全球背景。具体来说,我们的框架AWESOME使用了两个全新的机制:(1)外部内存机制跟踪先前编码的文档片段及其相应的摘要,以提高全球文档理解度和摘要连贯性。(2)全球相关的内容是事先确定的,以增强每个文档片段的支持,以支持其摘要。对不同类型的文本,包括政府报告、记录、科学论文和小说,进行广泛的实验表明,AWESOME产生的摘要比较长的文档竞争基准更具 informativeness、忠实性和连贯性,同时GPU内存 footprint相似或较小。
https://arxiv.org/abs/2305.14806
Video multimodal fusion aims to integrate multimodal signals in videos, such as visual, audio and text, to make a complementary prediction with multiple modalities contents. However, unlike other image-text multimodal tasks, video has longer multimodal sequences with more redundancy and noise in both visual and audio modalities. Prior denoising methods like forget gate are coarse in the granularity of noise filtering. They often suppress the redundant and noisy information at the risk of losing critical information. Therefore, we propose a denoising bottleneck fusion (DBF) model for fine-grained video multimodal fusion. On the one hand, we employ a bottleneck mechanism to filter out noise and redundancy with a restrained receptive field. On the other hand, we use a mutual information maximization module to regulate the filter-out module to preserve key information within different modalities. Our DBF model achieves significant improvement over current state-of-the-art baselines on multiple benchmarks covering multimodal sentiment analysis and multimodal summarization tasks. It proves that our model can effectively capture salient features from noisy and redundant video, audio, and text inputs. The code for this paper is publicly available at this https URL.
视频多模态融合旨在将视频中的多种模态信号合并,例如视觉、音频和文本,以进行互补预测,同时保留多种模态内容。然而,与其他图像-文本多模态任务不同,视频在视觉和音频模态中具有更长的模态序列,同时存在更多的冗余和噪声。类似于忘记门等先前的降噪方法,它们的降噪粒度较粗,常常抑制冗余和噪声信息,有可能导致重要信息丢失。因此,我们提出了一种精细的视频多模态融合降噪瓶颈融合模型(DBF)。一方面,我们使用瓶颈机制来过滤噪声和冗余,限制响应面。另一方面,我们使用 mutual information 最大化模块来调节过滤模块,以保留不同模态中的关键信息。我们的 DBF 模型在多个基准点上比当前最先进的基准模型在许多方面都取得了显著的改进,涵盖了多种模态的情感分析和模态摘要任务。证明我们的模型可以有效地捕获噪声和冗余的视频、音频和文本输入中的突出特征。本文代码在此 https URL 上公开可用。
https://arxiv.org/abs/2305.14652
Cross-lingual summarization consists of generating a summary in one language given an input document in a different language, allowing for the dissemination of relevant content across speakers of other languages. However, this task remains challenging, mainly because of the need for cross-lingual datasets and the compounded difficulty of summarizing and translating. This work presents $\mu$PLAN, an approach to cross-lingual summarization that uses an intermediate planning step as a cross-lingual bridge. We formulate the plan as a sequence of entities that captures the conceptualization of the summary, i.e. identifying the salient content and expressing in which order to present the information, separate from the surface form. Using a multilingual knowledge base, we align the entities to their canonical designation across languages. $\mu$PLAN models first learn to generate the plan and then continue generating the summary conditioned on the plan and the input. We evaluate our methodology on the XWikis dataset on cross-lingual pairs across four languages and demonstrate that this planning objective achieves state-of-the-art performance in terms of ROUGE and faithfulness scores. Moreover, this planning approach improves the zero-shot transfer to new cross-lingual language pairs compared to non-planning baselines.
跨语言摘要生成任务是指,在给定一个不同语言的输入文档的情况下,生成一个特定语言的摘要,从而允许其他语言持有者传播相关信息。然而,这项任务仍然具有挑战性,主要是因为需要跨语言数据集以及摘要和翻译的叠加难度。这项工作提出了$\mu$PLAN,一种跨语言摘要生成方法,使用中间规划步骤作为跨语言桥梁。我们将计划制定为一组实体,该实体捕捉摘要的概念,即确定关键内容,并表达信息的顺序,而不仅仅是表面形式。使用多个语言的知识库,我们将实体对齐到它们的跨语言常模。$\mu$PLAN模型首先学习生成计划,然后根据计划和输入生成摘要。我们在XWikis数据集上评估了我们的方法,在四种不同语言的跨语言对上评估,并证明了这个规划目标在ROUGE和忠实度得分方面达到最先进的表现。此外,与无规划基准相比,这个规划方法改进了对新跨语言对的零次迁移。
https://arxiv.org/abs/2305.14205
To date, the widely-adopted way to perform fixation collection in panoptic video is based on a head-mounted display (HMD), where participants' fixations are collected while wearing an HMD to explore the given panoptic scene freely. However, this widely-used data collection method is insufficient for training deep models to accurately predict which regions in a given panoptic are most important when it contains intermittent salient events. The main reason is that there always exist "blind zooms" when using HMD to collect fixations since the participants cannot keep spinning their heads to explore the entire panoptic scene all the time. Consequently, the collected fixations tend to be trapped in some local views, leaving the remaining areas to be the "blind zooms". Therefore, fixation data collected using HMD-based methods that accumulate local views cannot accurately represent the overall global importance of complex panoramic scenes. This paper introduces the auxiliary Window with a Dynamic Blurring (WinDB) fixation collection approach for panoptic video, which doesn't need HMD and is blind-zoom-free. Thus, the collected fixations can well reflect the regional-wise importance degree. Using our WinDB approach, we have released a new PanopticVideo-300 dataset, containing 300 panoptic clips covering over 225 categories. Besides, we have presented a simple baseline design to take full advantage of PanopticVideo-300 to handle the blind-zoom-free attribute-induced fixation shifting problem. Our WinDB approach, PanopticVideo-300, and tailored fixation prediction model are all publicly available at this https URL.
到目前为止,普遍采用的方法在潘托普视频中进行定位收集是基于头戴显示器(HMD),在该HMD上收集参与者的定位数据以自由探索给定潘托普场景。然而,这种广泛使用的数据收集方法不足以训练深度模型,准确预测当给定潘托普场景包含间歇性突出事件时,哪些区域是最重要的。主要原因是因为使用HMD收集定位数据时,存在“盲目放大”现象,因为参与者不能一直转动头部探索整个潘托普场景。因此,收集的定位数据往往被困在某些局部视图中,留下剩余的区域成为“盲目放大”。因此,使用HMD为基础的方法和收集局部视图的总和并不能准确反映复杂全景场景的整体全球重要性。本文介绍了一种动态模糊(WinDB)定位收集方法,用于潘托普视频,该方法不需要HMD且无盲目放大现象。因此,收集的定位数据可以很好地反映地区的重要性程度。使用我们的WinDB方法,我们发布了一个潘托普视频-300数据集,包含300个潘托普片段,涵盖了225个类别。此外,我们还介绍了一个简单的基准设计,以充分利用潘托普视频-300,以处理无盲目放大特征导致的定位转移问题。我们的WinDB方法、潘托普视频-300和定制的定位预测模型都在这个httpsURL上公开可用。
https://arxiv.org/abs/2305.13901
Fusing multiple modalities for affective computing tasks has proven effective for performance improvement. However, how multimodal fusion works is not well understood, and its use in the real world usually results in large model sizes. In this work, on sentiment and emotion analysis, we first analyze how the salient affective information in one modality can be affected by the other in crossmodal attention. We find that inter-modal incongruity exists at the latent level due to crossmodal attention. Based on this finding, we propose a lightweight model via Hierarchical Crossmodal Transformer with Modality Gating (HCT-MG), which determines a primary modality according to its contribution to the target task and then hierarchically incorporates auxiliary modalities to alleviate inter-modal incongruity and reduce information redundancy. The experimental evaluation on three benchmark datasets: CMU-MOSI, CMU-MOSEI, and IEMOCAP verifies the efficacy of our approach, showing that it: 1) outperforms major prior work by achieving competitive results and can successfully recognize hard samples; 2) mitigates the inter-modal incongruity at the latent level when modalities have mismatched affective tendencies; 3) reduces model size to less than 1M parameters while outperforming existing models of similar sizes.
将多种感官模式合并用于情感计算任务已经证明能够提高性能。然而,如何整合多种感官模式的运作尚不清楚,因此在现实世界中使用通常会导致大型模型大小。在这项工作中,对于情感和情绪分析,我们首先分析了如何在跨感官注意力中影响某一感官模式的另一条感官信息。我们发现,由于跨感官注意力,在潜伏阶段存在跨感官不匹配。基于这一发现,我们提出了一种轻量级模型,通过Hierarchical Crossmodal Transformer with Modality Gating (HCT-MG),根据它对目标任务的贡献确定一种主要感官模式,然后Hierarchically incorporates辅助感官模式,减轻跨感官不匹配,减少信息冗余。在三个基准数据集:CMU-MOSI、CMU-MOSEI和IEMOCAP的实验评估证实了我们的方法的有效性,表明它: 1) 通过实现竞争结果并成功识别困难样本,超越了先前的主要工作; 2) 在感官模式不匹配的潜伏阶段减轻跨感官不匹配; 3) 在模型大小不到100万参数的情况下,却超越了同类模型的大小。
https://arxiv.org/abs/2305.13583
Existing event-centric NLP models often only apply to the pre-defined ontology, which significantly restricts their generalization capabilities. This paper presents CEO, a novel Corpus-based Event Ontology induction model to relax the restriction imposed by pre-defined event ontologies. Without direct supervision, CEO leverages distant supervision from available summary datasets to detect corpus-wise salient events and exploits external event knowledge to force events within a short distance to have close embeddings. Experiments on three popular event datasets show that the schema induced by CEO has better coverage and higher accuracy than previous methods. Moreover, CEO is the first event ontology induction model that can induce a hierarchical event ontology with meaningful names on eleven open-domain corpora, making the induced schema more trustworthy and easier to be further curated.
现有的事件中心化自然语言处理模型通常只适用于预先定义的本体,这极大地限制了其泛化能力。本文介绍了CEO,一个基于 Corpus 的 Event Ontology Induction 模型,以放松预先定义事件本体的限制。在没有直接监督的情况下,CEO利用可用简要数据集的远程监督来检测 Corpus 中的显著事件,并利用外部事件知识来强制在极短距离内发生的事件具有靠近嵌入。对三个流行的事件数据集的实验表明,CEO 引起的 schema 比先前方法覆盖更广且更准确。此外,CEO是第一个能够在十一个开放域corpora上建立具有有意义的名称的层级事件本体的模型,从而使其引起的 schema 更加可信,更容易进一步 curated。
https://arxiv.org/abs/2305.13521
Unsupervised video object segmentation has made significant progress in recent years, but the manual annotation of video mask datasets is expensive and limits the diversity of available datasets. The Segment Anything Model (SAM) has introduced a new prompt-driven paradigm for image segmentation, unlocking a range of previously unexplored capabilities. In this paper, we propose a novel paradigm called UVOSAM, which leverages SAM for unsupervised video object segmentation without requiring video mask labels. To address SAM's limitations in instance discovery and identity association, we introduce a video salient object tracking network that automatically generates trajectories for prominent foreground objects. These trajectories then serve as prompts for SAM to produce video masks on a frame-by-frame basis. Our experimental results demonstrate that UVOSAM significantly outperforms current mask-supervised methods. These findings suggest that UVOSAM has the potential to improve unsupervised video object segmentation and reduce the cost of manual annotation.
unsupervised video object segmentation 在过去几年中取得了显著进展,但手动标注视频遮罩数据集非常昂贵,并限制了可用数据集的多样性。Segment anything Model (SAM) 引入了一种新的 prompt-driven 范式,用于图像分割,解锁了以前未被探索的能力。在本文中,我们提出了一种称为 UVOSAM 的新范式,它利用 SAM 进行 unsupervised 视频对象分割,而不需要视频遮罩标签。为了解决SAM 在实例发现和身份关联方面的局限性,我们引入了一种视频突出物跟踪网络,自动生成主要前景物体的轨迹。这些轨迹随后用作 UVOSAM 生成视频遮罩的提示。我们的实验结果表明, UVOSAM 显著优于当前基于遮罩监督的方法。这些发现表明, UVOSAM 有潜力改善 unsupervised 视频对象分割,并降低手动标注的成本。
https://arxiv.org/abs/2305.12659
Referring Expression Segmentation (RES) is a widely explored multi-modal task, which endeavors to segment the pre-existing object within a single image with a given linguistic expression. However, in broader real-world scenarios, it is not always possible to determine if the described object exists in a specific image. Typically, we have a collection of images, some of which may contain the described objects. The current RES setting curbs its practicality in such situations. To overcome this limitation, we propose a more realistic and general setting, named Group-wise Referring Expression Segmentation (GRES), which expands RES to a collection of related images, allowing the described objects to be present in a subset of input images. To support this new setting, we introduce an elaborately compiled dataset named Grouped Referring Dataset (GRD), containing complete group-wise annotations of target objects described by given expressions. We also present a baseline method named Grouped Referring Segmenter (GRSer), which explicitly captures the language-vision and intra-group vision-vision interactions to achieve state-of-the-art results on the proposed GRES and related tasks, such as Co-Salient Object Detection and RES. Our dataset and codes will be publicly released in this https URL.
refering expression segmentation (RES) 是一种被广泛探索的多模态任务,旨在在一个图像中以给定的语言学表达式分割预先存在的物体。然而,在更广泛的现实世界场景中,并不一定能够确定所描述的物体是否存在于特定的图像中。通常我们有一个集合的图像,其中一些可能包含所描述的对象。当前 RES 设定在这种情况中限制了其实用性。为了克服这一限制,我们提出了一种更现实和一般性的设定,称为群体 refering expression segmentation (GRES),将 RES扩展到一组相关的图像,使所描述的对象能够在输入图像的子集中找到。为了支持这个新设定,我们介绍了一个精心构造的dataset,名为群体 refering Dataset (GRD),其中包含由给定表达式描述的目标对象的完整的群体注释。我们还介绍了一个基线方法,称为群体 refering Segmenter (GRSer),它 explicitly 捕捉到语言-视觉和群体内部视觉-视觉交互,以实现所提出的 GRES 和相关的任务,如共同关键对象检测和 RES 的结果。我们的数据和代码将在这个 https URL 上公开发布。
https://arxiv.org/abs/2305.12452
Neural metrics for machine translation evaluation, such as COMET, exhibit significant improvements in their correlation with human judgments, as compared to traditional metrics based on lexical overlap, such as BLEU. Yet, neural metrics are, to a great extent, "black boxes" returning a single sentence-level score without transparency about the decision-making process. In this work, we develop and compare several neural explainability methods and demonstrate their effectiveness for interpreting state-of-the-art fine-tuned neural metrics. Our study reveals that these metrics leverage token-level information that can be directly attributed to translation errors, as assessed through comparison of token-level neural saliency maps with Multidimensional Quality Metrics (MQM) annotations and with synthetically-generated critical translation errors. To ease future research, we release our code at: this https URL.
神经网络对机器翻译评估的衡量指标(如COMET)在与人类判断的相关性方面表现出显著改进,相比之下,与基于词义重叠的传统衡量指标(如BLEU)相比,这些指标的性能有了显著提高。然而,神经网络衡量指标在很大程度上是“黑盒子”,只返回一个句子级别的得分,而决策过程却没有透明度。在这项工作中,我们开发和比较了几种神经网络解释性方法,并证明了它们对于解释最先进的精细调整神经网络衡量指标的有效性。我们的研究表明,这些指标利用了一些可以直接归因于翻译错误的句子级别的信息,通过比较句子级别的神经网络重要性映射与多维质量度量(MQM)注释和合成的关键翻译错误注释,来评估这些指标的性能。为了便于未来的研究,我们发布了我们的代码,该代码存储在以下httpsURL中。
https://arxiv.org/abs/2305.11806
We introduce ViDaS, a two-stream, fully convolutional Video, Depth-Aware Saliency network to address the problem of attention modeling ``in-the-wild", via saliency prediction in videos. Contrary to existing visual saliency approaches using only RGB frames as input, our network employs also depth as an additional modality. The network consists of two visual streams, one for the RGB frames, and one for the depth frames. Both streams follow an encoder-decoder approach and are fused to obtain a final saliency map. The network is trained end-to-end and is evaluated in a variety of different databases with eye-tracking data, containing a wide range of video content. Although the publicly available datasets do not contain depth, we estimate it using three different state-of-the-art methods, to enable comparisons and a deeper insight. Our method outperforms in most cases state-of-the-art models and our RGB-only variant, which indicates that depth can be beneficial to accurately estimating saliency in videos displayed on a 2D screen. Depth has been widely used to assist salient object detection problems, where it has been proven to be very beneficial. Our problem though differs significantly from salient object detection, since it is not restricted to specific salient objects, but predicts human attention in a more general aspect. These two problems not only have different objectives, but also different ground truth data and evaluation metrics. To our best knowledge, this is the first competitive deep learning video saliency estimation approach that combines both RGB and Depth features to address the general problem of saliency estimation ``in-the-wild". The code will be publicly released.
我们介绍了ViDaS,一个双向、全卷积视频、深度感知注意力网络,通过在视频中进行注意力预测来解决“在野外”的注意力建模问题。与只使用RGB帧作为输入的传统视觉注意力方法不同,我们的网络还使用了深度作为额外的特征。网络由两个视觉流组成,一个用于RGB帧,另一个用于深度帧。两个流采用编码-解码方法,并融合成最终的的注意力地图。网络通过端到端的训练进行训练,并在不同的数据库中评估使用 eye-tracking 数据,包含广泛的视频内容。尽管公开可用的数据集不包括深度,我们使用三种最先进的方法估计它,以便进行比较和更深入的理解。在我们的大多数情况下,最先进的模型和我们只使用RGB帧的变体相比表现良好,这表明深度对在二维屏幕上显示的视频精确估计注意力可以有益处。深度已经广泛用于协助注意力检测问题,并证明这是非常有益的。但我们的问题与注意力检测问题有很大不同,因为它不仅限制特定的吸引人的注意力物体,而是在更一般的方向预测人类注意力。这两个问题不仅有不同的目标,而且有不同的真实值数据和评估指标。据我们所知,这是第一个竞争的深度学习视频注意力估计方法,将RGB和深度特征相结合,以解决“在野外”的注意力估计通用问题。代码将公开发布。
https://arxiv.org/abs/2305.11729
Deep ensembles achieved state-of-the-art results in classification and out-of-distribution (OOD) detection; however, their effectiveness remains limited due to the homogeneity of learned patterns within the ensemble. To overcome this challenge, our study introduces a novel approach that promotes diversity among ensemble members by leveraging saliency maps. By incorporating saliency map diversification, our method outperforms conventional ensemble techniques in multiple classification and OOD detection tasks, while also improving calibration. Experiments on well-established OpenOOD benchmarks highlight the potential of our method in practical applications.
Deep ensembles 在分类和分布异常检测方面取得了最先进的结果,但是它们的 effectiveness 仍然受到限制,因为 ensemble 内学习模式的多样性不足。为了克服这一挑战,我们的研究提出了一种新方法,通过利用注意力地图来促进 ensemble 成员之间的多样性。通过引入注意力地图多样性,我们的方法和传统的 ensemble 技术在多个分类和分布异常检测任务中表现优异,同时也提高了校准。在著名的 OpenOOD 基准测试的实验突出了我们方法在实际应用中的潜力。
https://arxiv.org/abs/2305.11616
Humans exhibit complex motions that vary depending on the task that they are performing, the interactions they engage in, as well as subject-specific preferences. Therefore, forecasting future poses based on the history of the previous motions is a challenging task. This paper presents an innovative auxiliary-memory-powered deep neural network framework for the improved modelling of historical knowledge. Specifically, we disentangle subject-specific, task-specific, and other auxiliary information from the observed pose sequences and utilise these factorised features to query the memory. A novel Multi-Head knowledge retrieval scheme leverages these factorised feature embeddings to perform multiple querying operations over the historical observations captured within the auxiliary memory. Moreover, our proposed dynamic masking strategy makes this feature disentanglement process dynamic. Two novel loss functions are introduced to encourage diversity within the auxiliary memory while ensuring the stability of the memory contents, such that it can locate and store salient information that can aid the long-term prediction of future motion, irrespective of data imbalances or the diversity of the input data distribution. With extensive experiments conducted on two public benchmarks, Human3.6M and CMU-Mocap, we demonstrate that these design choices collectively allow the proposed approach to outperform the current state-of-the-art methods by significant margins: $>$ 17\% on the Human3.6M dataset and $>$ 9\% on the CMU-Mocap dataset.
人类表现出复杂的运动,这些运动取决于他们正在执行的任务、他们参与的互动以及特定的主题偏好。因此,基于之前运动的历史记录预测未来的 poses 是一项挑战性的任务。本文提出了一种创新的辅助记忆驱动的深度学习框架,以改进对历史知识的建模。具体而言,我们从观察的 pose 序列中分离出特定的主题、任务和其他辅助信息,并利用这些归一化特征来查询记忆。一个新颖的 Multi-Head 知识检索方案利用这些归一化特征嵌入来进行在辅助记忆内多次查询操作,并利用这些特征嵌入来定位和存储有助于长期预测未来运动的重要信息,无论数据不平衡或输入数据分布的多样性。通过在两个公共基准数据集 Human3.6M 和 CMU-Mocap 上进行广泛的实验,我们证明,这些设计选择 collectively 允许该方法通过显著的优势胜过当前的先进方法:在 Human3.6M 数据集上超过了 17%,而在 CMU-Mocap 数据集上超过了 9%。
https://arxiv.org/abs/2305.11394