Word error rate (WER) is a metric used to evaluate the quality of transcriptions produced by Automatic Speech Recognition (ASR) systems. In many applications, it is of interest to estimate WER given a pair of a speech utterance and a transcript. Previous work on WER estimation focused on building models that are trained with a specific ASR system in mind (referred to as ASR system-dependent). These are also domain-dependent and inflexible in real-world applications. In this paper, a hypothesis generation method for ASR System-Independent WER estimation (SIWE) is proposed. In contrast to prior work, the WER estimators are trained using data that simulates ASR system output. Hypotheses are generated using phonetically similar or linguistically more likely alternative words. In WER estimation experiments, the proposed method reaches a similar performance to ASR system-dependent WER estimators on in-domain data and achieves state-of-the-art performance on out-of-domain data. On the out-of-domain data, the SIWE model outperformed the baseline estimators in root mean square error and Pearson correlation coefficient by relative 17.58% and 18.21%, respectively, on Switchboard and CALLHOME. The performance was further improved when the WER of the training set was close to the WER of the evaluation dataset.
词错误率(WER)是一种用于评估自动语音识别(ASR)系统产生的转录质量的度量。在许多应用中,估计WER根据语音单位和文本的关系进行估计。之前的工作主要集中在构建特定ASR系统的模型(称为ASR系统依赖的模型)。这些模型在领域上是相关的,且在现实世界的应用中不灵活。在本文中,提出了一个用于ASR System-Independent WER估计(SIWE)的假设生成方法。与之前的工作不同,使用模拟ASR系统输出的数据来训练WER估计算法。假设是通过 phonetically similar 或 linguistically more likely 替代词来生成的。在WER估计实验中,所提出的方法在领域内数据上的性能与ASR系统依赖的WER估计算法相当,同时在对外部数据上的性能达到了最先进的水平。在外部数据上,SIWE模型分别比基线估计器在Switchboard和CALLHOME上的相对误差的绝对值和皮尔逊相关系数上提高了17.58%和18.21%。当训练集的WER接近评估数据上的WER时,性能进一步得到改善。
https://arxiv.org/abs/2404.16743
Cancelable Biometric is a challenging research field in which security of an original biometric image is ensured by transforming the original biometric into another irreversible domain. Several approaches have been suggested in literature for generating cancelable biometric templates. In this paper, two novel and simple cancelable biometric template generation methods based on Random Walk (CBRW) have been proposed. By employing random walk and other steps given in the proposed two algorithms viz. CBRW-BitXOR and CBRW-BitCMP, the original biometric is transformed into a cancellable template. The performance of the proposed methods is compared with other state-of-the-art methods. Experiments have been performed on eight publicly available gray and color datasets i.e. CP (ear) (gray and color), UTIRIS (iris) (gray and color), ORL (face) (gray), IIT Delhi (iris) (gray and color), and AR (face) (color). Performance of the generated templates is measured in terms of Correlation Coefficient (Cr), Root Mean Square Error (RMSE), Peak Signal to Noise Ratio (PSNR), Structural Similarity (SSIM), Mean Absolute Error (MAE), Number of Pixel Change Rate (NPCR), and Unified Average Changing Intensity (UACI). By experimental results, it has been proved that proposed methods are superior than other state-of-the-art methods in qualitative as well as quantitative analysis. Furthermore, CBRW performs better on both gray as well as color images.
可取消生物特征是一个具有挑战性的研究领域,其中通过将原始生物特征变换为另一个不可逆的领域来确保原始生物特征的安全。在文献中,已经提出了几种生成可取消生物特征模板的方法。本文提出了两种基于随机漫步(CBRW)的新颖且简单的可取消生物特征模板生成方法。通过使用提出的两种算法 viz. CBRW-BitXOR 和 CBRW-BitCMP,将原始生物特征变换为可取消模板。本文方法的表现与最先进的算法进行了比较。实验在八个公开可用的灰度和彩色数据集上进行,即CP(耳朵)(灰度与彩色),UTIRIS(眼睛)(灰度与彩色),ORL(面部)(灰度),IIT德里亚(眼睛)(灰度与彩色),和AR(眼睛)(彩色)。生成模板的表现用相关系数系数(Cr)、算术均方误差(RMSE)、峰值信号与噪声比(PSNR)、结构相似性(SSIM)、平均绝对误差(MAE)、像素变化率(NPCR)和统一平均变化强度(UACI)进行衡量。通过实验结果,证明了本文方法在质量和数量分析方面优于其他最先进的算法。此外,CBRW在灰度和彩色图像上表现更好。
https://arxiv.org/abs/2404.16739
We propose a novel multi-stage trans-dimensional architecture for multi-view cardiac image segmentation. Our method exploits the relationship between long-axis (2D) and short-axis (3D) magnetic resonance (MR) images to perform a sequential 3D-to-2D-to-3D segmentation, segmenting the long-axis and short-axis images. In the first stage, 3D segmentation is performed using the short-axis image, and the prediction is transformed to the long-axis view and used as a segmentation prior in the next stage. In the second step, the heart region is localized and cropped around the segmentation prior using a Heart Localization and Cropping (HLC) module, focusing the subsequent model on the heart region of the image, where a 2D segmentation is performed. Similarly, we transform the long-axis prediction to the short-axis view, localize and crop the heart region and again perform a 3D segmentation to refine the initial short-axis segmentation. We evaluate our proposed method on the Multi-Disease, Multi-View & Multi-Center Right Ventricular Segmentation in Cardiac MRI (M&Ms-2) dataset, where our method outperforms state-of-the-art methods in segmenting cardiac regions of interest in both short-axis and long-axis images. The pre-trained models, source code, and implementation details will be publicly available.
我们提出了一个新颖的多阶段多视角心肌图像分割架构。我们的方法利用长轴(2D)和短轴(3D)磁共振(MR)图像之间的关系进行级联3D-to-2D-to-3D分割,分割长轴和短轴图像。在第一阶段,使用短轴图像进行3D分割,并将预测转换为长轴视图,用作下一阶段的分割先决条件。在第二阶段,使用心定位和裁剪(HLC)模块将心区域定位和裁剪在分割先决条件周围,将后续模型聚焦于图像中的心区域,并进行2D分割。同样,我们将长轴预测转换为短轴视图,将心区域定位和裁剪,并再次进行3D分割,以优化初始的短轴分割。我们在M&M-2数据集上评估我们的方法,该数据集包括多病种、多视角和多中心右心室分割。我们的方法在短轴和长轴图像中分割感兴趣的心脏区域方面均优于最先进的Methods。预训练模型、源代码和实现细节将公开可用。
https://arxiv.org/abs/2404.16708
Charts are important for presenting and explaining complex data relationships. Recently, multimodal large language models (MLLMs) have shown remarkable capabilities in various chart understanding tasks. However, the sheer size of these models in terms of parameters and computational requirements limits their use in resource-constrained environments. In this paper, we present TinyChart, an efficient MLLM for chart understanding with only 3B parameters. TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, which trains the model to generate Python programs for numerical calculations, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module, which gradually merges most similar vision tokens. Extensive experiments demonstrate that our 3B TinyChart achieves SOTA performance on a variety of chart understanding benchmarks including ChartQA, Chart-to-Text, Chart-to-Table, OpenCQA, and ChartX. It outperforms several chart understanding MLLM with up to 13B parameters such as ChartLlama and ChartAst, and close-sourced general-purpose MLLM GPT-4V on ChartQA. It also demonstrates its superior efficiency with higher throughput during inference due to a smaller model scale and more efficient vision encoding. Our code and model are available at this https URL.
图表对于呈现和解释复杂数据关系非常重要。最近,多模态大型语言模型(MLLMs)在各种图表理解任务中表现出非凡的能力。然而,这些模型在参数和计算需求方面的庞大规模限制了其在资源受限环境中的应用。在本文中,我们提出了TinyChart,一个仅包含3B参数的高效的MLLM,用于图表理解。TinyChart克服了高效图表理解的两个关键挑战:(1)通过程序化思考(PoT)学习策略减少学习数值计算的负担,该策略训练模型生成用于数值计算的Python程序,(2)通过视觉词表合并模块减少高分辨率图像中产生的长视觉特征序列,该模块逐渐合并最相似的视觉词。大量实验证明,我们的3B TinyChart在包括ChartQA、Chart-to-Text、Chart-to-Table、OpenCQA和ChartX在内的各种图表理解基准测试中实现了最先进的性能。它优于拥有多达13B参数的ChartLlama和ChartAst等几个图表理解MLLM,并在ChartQA上的性能优于基于闭源通用MLLM GPT-4V。它还证明了其在推理过程中由于模型规模较小和视觉编码更高效而具有优越的效率。我们的代码和模型可以从该链接下载:https://url.com/
https://arxiv.org/abs/2404.16635
Beyond improving trust and validating model fairness, xAI practices also have the potential to recover valuable scientific insights in application domains where little to no prior human intuition exists. To that end, we propose a method to extract global concept explanations from the predictions of graph neural networks to develop a deeper understanding of the tasks underlying structure-property relationships. We identify concept explanations as dense clusters in the self-explaining Megan models subgraph latent space. For each concept, we optimize a representative prototype graph and optionally use GPT-4 to provide hypotheses about why each structure has a certain effect on the prediction. We conduct computational experiments on synthetic and real-world graph property prediction tasks. For the synthetic tasks we find that our method correctly reproduces the structural rules by which they were created. For real-world molecular property regression and classification tasks, we find that our method rediscovers established rules of thumb. More specifically, our results for molecular mutagenicity prediction indicate more fine-grained resolution of structural details than existing explainability methods, consistent with previous results from chemistry literature. Overall, our results show promising capability to extract the underlying structure-property relationships for complex graph property prediction tasks.
除了提高信任度和验证模型的公平性外,基于AI的研究还有可能在缺乏先前人类直觉的应用领域中恢复有价值的科学见解。为此,我们提出了一种从图神经网络的预测中提取全局概念解释的方法,以更深入地理解支撑任务结构与属性之间关系的任务结构。我们将概念解释确定为自解释Megan模型的子图潜在空间中的密集聚类。对于每个概念,我们优化一个具有代表性的图,并可选地使用GPT-4来提供关于每个结构对预测的影响的假设。我们在合成和现实世界的图属性预测任务上进行计算实验。对于合成任务,我们发现我们的方法正确地复制了它们创建的结构规则。对于现实世界的分子属性回归和分类任务,我们发现我们的方法重新发现了已有的经验法则。具体来说,我们的分子突变预测结果表明,我们的方法比现有的解释性方法具有更细粒度的结构细节的分辨率,这与化学文献中的 previous results 相一致。总体而言,我们的结果表明,基于AI的研究具有从复杂图属性预测任务中提取底层结构与属性之间关系的有前景的能力。
https://arxiv.org/abs/2404.16532
Document-level Relation Extraction (DocRE) is the task of extracting all semantic relationships from a document. While studies have been conducted on English DocRE, limited attention has been given to DocRE in non-English languages. This work delves into effectively utilizing existing English resources to promote DocRE studies in non-English languages, with Japanese as the representative case. As an initial attempt, we construct a dataset by transferring an English dataset to Japanese. However, models trained on such a dataset suffer from low recalls. We investigate the error cases and attribute the failure to different surface structures and semantics of documents translated from English and those written by native speakers. We thus switch to explore if the transferred dataset can assist human annotation on Japanese documents. In our proposal, annotators edit relation predictions from a model trained on the transferred dataset. Quantitative analysis shows that relation recommendations suggested by the model help reduce approximately 50% of the human edit steps compared with the previous approach. Experiments quantify the performance of existing DocRE models on our collected dataset, portraying the challenges of Japanese and cross-lingual DocRE.
文档级别关系提取(DocRE)是从文档中提取所有语义关系的过程。尽管已经进行了关于英语DocRE的研究,但在非英语语言中,对DocRE的研究却鲜有关注。本文深入研究如何有效地利用现有英语资源来促进非英语语言中的DocRE研究,以日本为例作为代表。作为初始尝试,我们将英语数据集迁移到日本并构建了一个数据集。然而,训练在这样的数据集上的模型,模型的召回率很低。我们研究了错误案例,并将失败归因于从英语到非英语翻译的文档的不同表面结构和语义。因此,我们转向研究是否转移的数据集可以帮助人类对日语文档进行标注。在我们的建议中,注释者编辑从转移数据集中得出的关系预测。定量分析显示,与以前的方法相比,模型建议的关系减少约50%的人为编辑步骤。实验验证了现有DocRE模型的在我们收集的数据集上的性能,揭示了日语和跨语言DocRE的挑战。
https://arxiv.org/abs/2404.16506
Large Language Models (LLMs) are extensively used today across various sectors, including academia, research, business, and finance, for tasks such as text generation, summarization, and translation. Despite their widespread adoption, these models often produce incorrect and misleading information, exhibiting a tendency to hallucinate. This behavior can be attributed to several factors, with consistency and reasoning capabilities being significant contributors. LLMs frequently lack the ability to generate explanations and engage in coherent reasoning, leading to inaccurate responses. Moreover, they exhibit inconsistencies in their outputs. This paper aims to evaluate and compare the consistency and reasoning capabilities of both public and proprietary LLMs. The experiments utilize the Boolq dataset as the ground truth, comprising questions, answers, and corresponding explanations. Queries from the dataset are presented as prompts to the LLMs, and the generated responses are evaluated against the ground truth answers. Additionally, explanations are generated to assess the models' reasoning abilities. Consistency is evaluated by repeatedly presenting the same query to the models and observing for variations in their responses. For measuring reasoning capabilities, the generated explanations are compared to the ground truth explanations using metrics such as BERT, BLEU, and F-1 scores. The findings reveal that proprietary models generally outperform public models in terms of both consistency and reasoning capabilities. However, even when presented with basic general knowledge questions, none of the models achieved a score of 90\% in both consistency and reasoning. This study underscores the direct correlation between consistency and reasoning abilities in LLMs and highlights the inherent reasoning challenges present in current language models.
大规模语言模型(LLMs)如今在学术界、研究、商业和金融等各个领域得到了广泛应用,用于诸如文本生成、总结和翻译等任务。尽管它们已经得到了普遍的采用,但这些模型往往会产生错误或误导性的信息,表现出一种幻觉倾向。这种行为可以归因于几个因素,其中一致性和推理能力是重要的因素。LLMs常常缺乏生成解释和进行合乎理性的推理的能力,导致不准确的回答。此外,它们在输出上表现出不一致性。本文旨在评估和比较公共和专有LLMs的一致性和推理能力。实验使用了Boolq数据集作为基线,包括问题、答案和相应的解释。数据集中的问题作为提示呈现在LLMs上,生成的答案与基线答案进行比较。此外,还生成了推理能力来评估模型的能力。一致性是通过反复向模型呈现相同的问题并观察其响应的变化来评估的。为了衡量推理能力,生成的解释与基线解释使用BERT、BLEU和F-1分数进行比较。研究结果表明,专有模型在一致性和推理能力方面通常优于公共模型。然而,即使面对基本的通用知识问题,没有一个模型在一致性和推理能力上获得了90%的分数。这项研究突出了LLMs的一致性和推理能力与当前语言模型的固有推理挑战之间的直接关系,并强调了当前语言模型中存在的固有推理挑战。
https://arxiv.org/abs/2404.16478
Weakly supervised medical image segmentation (MIS) using generative models is crucial for clinical diagnosis. However, the accuracy of the segmentation results is often limited by insufficient supervision and the complex nature of medical imaging. Existing models also only provide a single outcome, which does not allow for the measurement of uncertainty. In this paper, we introduce DiffSeg, a segmentation model for skin lesions based on diffusion difference which exploits diffusion model principles to ex-tract noise-based features from images with diverse semantic information. By discerning difference between these noise features, the model identifies diseased areas. Moreover, its multi-output capability mimics doctors' annotation behavior, facilitating the visualization of segmentation result consistency and ambiguity. Additionally, it quantifies output uncertainty using Generalized Energy Distance (GED), aiding interpretability and decision-making for physicians. Finally, the model integrates outputs through the Dense Conditional Random Field (DenseCRF) algorithm to refine the segmentation boundaries by considering inter-pixel correlations, which improves the accuracy and optimizes the segmentation results. We demonstrate the effectiveness of DiffSeg on the ISIC 2018 Challenge dataset, outperforming state-of-the-art U-Net-based methods.
弱监督下的医学图像分割(MIS)利用生成模型在临床诊断中至关重要。然而,分割结果的准确性常常受到监督不足和医学图像复杂性的限制。现有的模型仅提供单一输出,无法衡量不确定性。在本文中,我们介绍了DiffSeg,一种基于扩散差分的皮肤病变分割模型,它利用扩散模型原理从具有丰富语义信息的图像中提取噪声基于特征。通过鉴别这些噪声特征,模型识别出病变区域。此外,其多输出能力模仿了医生的标注行为,有助于可视化分割结果的一致性和不确定性。此外,通过使用泛化能量距离(GED)量化输出不确定性,有助于医生更好地解释和做出决策。最后,通过Dense Conditional Random Field(DenseCRF)算法将输出集成,通过考虑像素间关联来平滑分割边界,从而提高准确性和优化分割结果。我们在ISIC 2018挑战数据集上证明了DiffSeg的有效性,超越了基于U-Net的最先进方法。
https://arxiv.org/abs/2404.16474
Model-free reinforcement learning methods lack an inherent mechanism to impose behavioural constraints on the trained policies. While certain extensions exist, they remain limited to specific types of constraints, such as value constraints with additional reward signals or visitation density constraints. In this work we try to unify these existing techniques and bridge the gap with classical optimization and control theory, using a generic primal-dual framework for value-based and actor-critic reinforcement learning methods. The obtained dual formulations turn out to be especially useful for imposing additional constraints on the learned policy, as an intrinsic relationship between such dual constraints (or regularization terms) and reward modifications in the primal is reveiled. Furthermore, using this framework, we are able to introduce some novel types of constraints, allowing to impose bounds on the policy's action density or on costs associated with transitions between consecutive states and actions. From the adjusted primal-dual optimization problems, a practical algorithm is derived that supports various combinations of policy constraints that are automatically handled throughout training using trainable reward modifications. The resulting $\texttt{DualCRL}$ method is examined in more detail and evaluated under different (combinations of) constraints on two interpretable environments. The results highlight the efficacy of the method, which ultimately provides the designer of such systems with a versatile toolbox of possible policy constraints.
模型无关强化学习方法缺乏对训练后策略施加行为约束的固有机制。虽然存在某些扩展,但它们仍然局限于特定的约束类型,例如带有额外奖励信号的价值约束或访问密度约束。在这项工作中,我们试图统一这些现有技术,并使用基于价值的actor-critic强化学习方法的泛化二次框架来弥合与经典优化和控制理论之间的差距。所得到的双重形式展开在很大程度上有助于对学习到的策略施加额外的约束,因为这种双重约束(或 regularization 项)与原初在值上的约束之间揭示了一种固有的关系。此外,利用这种框架,我们能够引入一些新颖的约束类型,使得能够对策略的动作密度或连续状态和动作之间的转移成本施加限制。从调整后的原初-二次优化问题中,我们得到了一个实际算法,它在训练过程中自动处理各种策略约束。利用不同的约束组合,对两个可解释的环境进行了评估。结果表明,该方法非常有效,最终为设计这种系统的设计者提供了一个丰富的策略约束工具箱。
https://arxiv.org/abs/2404.16468
Multimodal sentiment analysis (MSA) aims to understand human sentiment through multimodal data. Most MSA efforts are based on the assumption of modality completeness. However, in real-world applications, some practical factors cause uncertain modality missingness, which drastically degrades the model's performance. To this end, we propose a Correlation-decoupled Knowledge Distillation (CorrKD) framework for the MSA task under uncertain missing modalities. Specifically, we present a sample-level contrastive distillation mechanism that transfers comprehensive knowledge containing cross-sample correlations to reconstruct missing semantics. Moreover, a category-guided prototype distillation mechanism is introduced to capture cross-category correlations using category prototypes to align feature distributions and generate favorable joint representations. Eventually, we design a response-disentangled consistency distillation strategy to optimize the sentiment decision boundaries of the student network through response disentanglement and mutual information maximization. Comprehensive experiments on three datasets indicate that our framework can achieve favorable improvements compared with several baselines.
多模态情感分析(MSA)旨在通过多模态数据理解人类情感。大多数MSA努力都是基于模态完备性的假设。然而,在现实应用中,一些实际因素导致不确定模态缺失,这严重削弱了模型的性能。为此,我们提出了一个在不确定缺失模态下的MSA任务的联合关系蒸馏(CorrKD)框架。具体来说,我们提出了一个样本级别的对比性蒸馏机制,用于将包含跨样本相关性的全面知识转移到重建缺失语义。此外,还引入了一个分类引导的原型蒸馏机制,通过分类原型来捕捉跨类相关性,从而使特征分布对齐,并生成有利的联合表示。最后,我们设计了一个响应解耦一致性蒸馏策略,通过响应解耦和互信息最大化来优化学生网络的情感决策边界。在三个数据集上的全面实验表明,与几个基线相比,我们的框架可以实现显著的改进。
https://arxiv.org/abs/2404.16456
The recent work Local Implicit Image Function (LIIF) and subsequent Implicit Neural Representation (INR) based works have achieved remarkable success in Arbitrary-Scale Super-Resolution (ASSR) by using MLP to decode Low-Resolution (LR) features. However, these continuous image representations typically implement decoding in High-Resolution (HR) High-Dimensional (HD) space, leading to a quadratic increase in computational cost and seriously hindering the practical applications of ASSR. To tackle this problem, we propose a novel Latent Modulated Function (LMF), which decouples the HR-HD decoding process into shared latent decoding in LR-HD space and independent rendering in HR Low-Dimensional (LD) space, thereby realizing the first computational optimal paradigm of continuous image representation. Specifically, LMF utilizes an HD MLP in latent space to generate latent modulations of each LR feature vector. This enables a modulated LD MLP in render space to quickly adapt to any input feature vector and perform rendering at arbitrary resolution. Furthermore, we leverage the positive correlation between modulation intensity and input image complexity to design a Controllable Multi-Scale Rendering (CMSR) algorithm, offering the flexibility to adjust the decoding efficiency based on the rendering precision. Extensive experiments demonstrate that converting existing INR-based ASSR methods to LMF can reduce the computational cost by up to 99.9%, accelerate inference by up to 57 times, and save up to 76% of parameters, while maintaining competitive performance. The code is available at this https URL.
最近基于MLP的局部隐式图像函数(LIIF)和后续的隐式神经表示(INR)在任意尺度超分辨率(ASSR)方面的研究取得了显著的成功,通过使用MLP解码低分辨率(LR)特征。然而,这些连续的图像表示通常在高分辨率(HR)和高维度(HD)空间中执行解码,导致计算成本增加,严重阻碍了ASSR的实际应用。为了解决这个问题,我们提出了一个新颖的潜在模块函数(LMF),它将高分辨率(HR)和高维度(HD)解码过程在LR-HD空间中进行共享隐式解码,在HR低维度(LD)空间中实现独立渲染,从而实现了连续图像表示的第一个计算最优范式。具体来说,LMF利用高维度(HD)的MLP在隐空间中生成每个LR特征向量的隐式模度。这使得在渲染空间中,模度强度与输入特征向量呈正相关,从而实现对输入图像复杂度的自适应调整。此外,我们利用模度强度与输入图像复杂度之间的正相关性,设计了一个可控制多尺度渲染(CMSR)算法,根据渲染精度调整解码效率。大量实验证明,将现有的INR为基础的ASSR方法转换为LMF可以降低计算成本至99.9%,加速推理至57倍,并节省约76%的参数,同时保持竞争力的性能。代码可在此处访问:https://url.com/
https://arxiv.org/abs/2404.16451
This paper presents a question-answering approach to extract document-level event-argument structures. We automatically ask and answer questions for each argument type an event may have. Questions are generated using manually defined templates and generative transformers. Template-based questions are generated using predefined role-specific wh-words and event triggers from the context document. Transformer-based questions are generated using large language models trained to formulate questions based on a passage and the expected answer. Additionally, we develop novel data augmentation strategies specialized in inter-sentential event-argument relations. We use a simple span-swapping technique, coreference resolution, and large language models to augment the training instances. Our approach enables transfer learning without any corpora-specific modifications and yields competitive results with the RAMS dataset. It outperforms previous work, and it is especially beneficial to extract arguments that appear in different sentences than the event trigger. We also present detailed quantitative and qualitative analyses shedding light on the most common errors made by our best model.
本文提出了一种问题回答方法,用于提取文档级别的事件- argument 结构。我们自动对每个可能具有的事件类型提出问题并进行回答。问题是通过手动定义的模板和生成 transformer 生成的。基于模板的问题使用预定义的角色特定 wh- 词和上下文文档中的事件触发器生成。基于 transformer 的問題使用经过训练的大型语言模型生成的关于文章和预期答案的问题。此外,我们还开发了针对 inter-sentential event-argument 关系的新颖数据增强策略。我们使用简单的跨度替换技术、核心关系解析和大型语言模型来增强训练实例。我们的方法无需对数据集进行特定修改即可实现迁移学习,并与 RAMS 数据集的竞争结果相媲美。它尤其有益于提取不同句子中出现的事件 argument。我们还对最佳模型最常犯的错误进行了详细的定量定性分析,阐明了我们的最佳模型在分析中取得的成就。
https://arxiv.org/abs/2404.16413
The use of multimodal data in assisted diagnosis and segmentation has emerged as a prominent area of interest in current research. However, one of the primary challenges is how to effectively fuse multimodal features. Most of the current approaches focus on the integration of multimodal features while ignoring the correlation and consistency between different modal features, leading to the inclusion of potentially irrelevant information. To address this issue, we introduce an innovative Multimodal Information Cross Transformer (MicFormer), which employs a dual-stream architecture to simultaneously extract features from each modality. Leveraging the Cross Transformer, it queries features from one modality and retrieves corresponding responses from another, facilitating effective communication between bimodal features. Additionally, we incorporate a deformable Transformer architecture to expand the search space. We conducted experiments on the MM-WHS dataset, and in the CT-MRI multimodal image segmentation task, we successfully improved the whole-heart segmentation DICE score to 85.57 and MIoU to 75.51. Compared to other multimodal segmentation techniques, our method outperforms by margins of 2.83 and 4.23, respectively. This demonstrates the efficacy of MicFormer in integrating relevant information between different modalities in multimodal tasks. These findings hold significant implications for multimodal image tasks, and we believe that MicFormer possesses extensive potential for broader applications across various domains. Access to our method is available at this https URL
在当前的研究中,多模态数据在辅助诊断和分割中的应用已成为一个突出的研究领域。然而,一个主要挑战是如何有效地融合多模态特征。目前的大多数方法都关注于多模态特征的整合,而忽略了不同模态特征之间的相关性和一致性,导致包含可能无关的信息。为解决这个问题,我们引入了一种创新的多模态信息交叉变换(MicFormer)方法,它采用双流架构同时提取每个模态的特征。利用交叉变换,它从一个模态提取特征并从另一个模态检索相应的响应,促进不同模态特征之间的有效沟通。此外,我们还引入了一个可变的Transformer架构来扩展搜索空间。我们在MM-WHS数据集上进行了实验,并在CT-MRI多模态图像分割任务中成功将整体心分割DICE得分提高至85.57,MIoU至75.51。与其他多模态分割技术相比,我们的方法在优势上明显超过2.83和4.23。这些发现对于多模态图像任务具有重要的意义,我们相信MicFormer在各种领域的广泛应用具有极大的潜力。您可以通过以下链接访问我们的方法:https://www. researchgate.net/publication/333003621_Multimodal_Information_Cross_Transformer
https://arxiv.org/abs/2404.16371
Transformers trained on natural language data have been shown to learn its hierarchical structure and generalize to sentences with unseen syntactic structures without explicitly encoding any structural bias. In this work, we investigate sources of inductive bias in transformer models and their training that could cause such generalization behavior to emerge. We extensively experiment with transformer models trained on multiple synthetic datasets and with different training objectives and show that while other objectives e.g. sequence-to-sequence modeling, prefix language modeling, often failed to lead to hierarchical generalization, models trained with the language modeling objective consistently learned to generalize hierarchically. We then conduct pruning experiments to study how transformers trained with the language modeling objective encode hierarchical structure. When pruned, we find joint existence of subnetworks within the model with different generalization behaviors (subnetworks corresponding to hierarchical structure and linear order). Finally, we take a Bayesian perspective to further uncover transformers' preference for hierarchical generalization: We establish a correlation between whether transformers generalize hierarchically on a dataset and whether the simplest explanation of that dataset is provided by a hierarchical grammar compared to regular grammars exhibiting linear generalization.
翻译: 训练在自然语言数据上的Transformer模型已经被证明可以在没有明确编码任何结构偏见的情况下学习其层次结构,并泛化到未见过的语法结构的句子。在本文中,我们研究了导致Transformer模型出现这种泛化行为的归纳偏见来源以及它们的训练。我们对多个合成数据集训练的Transformer模型和不同训练目标进行了广泛的实验,并发现,与其他目标(如序列到序列建模、前语言建模)相比,使用语言建模目标训练的模型能够保持层次结构的泛化。然后我们进行了剪枝实验,以研究使用语言建模目标训练的Transformer模型如何编码层次结构。当剪枝时,我们发现模型内存在不同泛化行为的子网络(对应于层次结构和线性顺序的子网络)。最后,我们从贝叶斯角度进一步揭示了Transformer模型对层次泛化的偏好:我们建立了训练数据是否具有层次结构泛化以及该数据是否可以用层次语法给出简单解释与展示线性泛化之间的相关关系。
https://arxiv.org/abs/2404.16367
Unsupervised graph anomaly detection aims at identifying rare patterns that deviate from the majority in a graph without the aid of labels, which is important for a variety of real-world applications. Recent advances have utilized Graph Neural Networks (GNNs) to learn effective node representations by aggregating information from neighborhoods. This is motivated by the hypothesis that nodes in the graph tend to exhibit consistent behaviors with their neighborhoods. However, such consistency can be disrupted by graph anomalies in multiple ways. Most existing methods directly employ GNNs to learn representations, disregarding the negative impact of graph anomalies on GNNs, resulting in sub-optimal node representations and anomaly detection performance. While a few recent approaches have redesigned GNNs for graph anomaly detection under semi-supervised label guidance, how to address the adverse effects of graph anomalies on GNNs in unsupervised scenarios and learn effective representations for anomaly detection are still under-explored. To bridge this gap, in this paper, we propose a simple yet effective framework for Guarding Graph Neural Networks for Unsupervised Graph Anomaly Detection (G3AD). Specifically, G3AD introduces two auxiliary networks along with correlation constraints to guard the GNNs from inconsistent information encoding. Furthermore, G3AD introduces an adaptive caching module to guard the GNNs from solely reconstructing the observed data that contains anomalies. Extensive experiments demonstrate that our proposed G3AD can outperform seventeen state-of-the-art methods on both synthetic and real-world datasets.
无监督图异常检测旨在识别在图中与多数不同的罕见模式,而无需标签帮助,这对于各种现实应用场景具有重要意义。最近,人们利用图神经网络(GNNs)通过聚合图中的信息来学习有效的节点表示。这一假设基于一个结论,即图中节点通常会表现出与周围节点一致的行为。然而,图异常可能会以多种方式破坏这种一致性。大多数现有方法直接使用GNNs来学习表示,忽视了图异常对GNNs的负面影响,导致节点表示效果不佳和异常检测性能下降。虽然一些最近的方法为在半监督标签指导下去优化GNNs进行了重新设计,但如何解决图异常对GNNs在无监督场景下的不利影响以及如何学习有效的异常检测表示方法仍然是一个未探索的问题。为了填补这个空白,本文提出了一种简单的但有效的框架来保护无监督图神经网络免受异常的影响,即G3AD。具体来说,G3AD引入了两个辅助网络和相关约束来保护GNNs免受不一致信息编码。此外,G3AD还引入了自适应缓存模块来保护GNNs免受仅重构包含异常的观察数据。大量实验证明,我们提出的G3AD可以在 synthetic 和 real-world 数据上优于 17 个最先进的算法的表现。
https://arxiv.org/abs/2404.16366
Pooling is a crucial operation in computer vision, yet the unique structure of skeletons hinders the application of existing pooling strategies to skeleton graph modelling. In this paper, we propose an Improved Graph Pooling Network, referred to as IGPN. The main innovations include: Our method incorporates a region-awareness pooling strategy based on structural partitioning. The correlation matrix of the original feature is used to adaptively adjust the weight of information in different regions of the newly generated features, resulting in more flexible and effective processing. To prevent the irreversible loss of discriminative information, we propose a cross fusion module and an information supplement module to provide block-level and input-level information respectively. As a plug-and-play structure, the proposed operation can be seamlessly combined with existing GCN-based models. We conducted extensive evaluations on several challenging benchmarks, and the experimental results indicate the effectiveness of our proposed solutions. For example, in the cross-subject evaluation of the NTU-RGB+D 60 dataset, IGPN achieves a significant improvement in accuracy compared to the baseline while reducing Flops by nearly 70%; a heavier version has also been introduced to further boost accuracy.
池化是计算机视觉中一个关键的操作,然而骨架图模型的独特结构使得现有的池化策略难以应用于骨架图建模。在本文中,我们提出了一个改进的图池化网络,被称为IGPN。主要的创新包括:我们的方法基于结构分割的局部感知池化策略。原始特征的协方差矩阵用于根据新特征不同区域的信息自适应地调整其权重,从而实现更加灵活和有效的处理。为了防止不可逆的信息损失,我们提出了跨融合模块和信息补充模块,分别为块级和输入级提供信息。作为可插拔的结构,与现有的基于图卷积网络(GCN)的模型无缝结合。我们对多个具有挑战性的基准进行了广泛的评估,实验结果表明,我们的解决方案的有效性得到了验证。例如,在NTU-RGB+D 60数据集的跨subject评估中,IGPN在准确率方面比基线提高了显著的差异,同时减少了Flops近70%;还推出了一种更重的版本,以进一步提高准确性。
https://arxiv.org/abs/2404.16359
Zero-shot learning has consistently yielded remarkable progress via modeling nuanced one-to-one visual-attribute correlation. Existing studies resort to refining a uniform mapping function to align and correlate the sample regions and subattributes, ignoring two crucial issues: 1) the inherent asymmetry of attributes; and 2) the unutilized channel information. This paper addresses these issues by introducing a simple yet effective approach, dubbed Dual Expert Distillation Network (DEDN), where two experts are dedicated to coarse- and fine-grained visual-attribute modeling, respectively. Concretely, one coarse expert, namely cExp, has a complete perceptual scope to coordinate visual-attribute similarity metrics across dimensions, and moreover, another fine expert, namely fExp, consists of multiple specialized subnetworks, each corresponds to an exclusive set of attributes. Two experts cooperatively distill from each other to reach a mutual agreement during training. Meanwhile, we further equip DEDN with a newly designed backbone network, i.e., Dual Attention Network (DAN), which incorporates both region and channel attention information to fully exploit and leverage visual semantic knowledge. Experiments on various benchmark datasets indicate a new state-of-the-art.
零样本学习通过建模复杂的一对一视觉属性相关性 consistently取得了显著的进步。现有研究通过优化统一映射函数来对样本区域和子属性进行对齐和相关,忽略了两个关键问题:(1)属性的固有不对称性;(2)未利用的通道信息。本文通过引入一种简单而有效的途径来解决这些问题,称为双专家蒸馏网络(DEDN),其中两个专家分别致力于粗粒度和细粒度视觉属性建模。具体来说,一个粗专家,即cExp,具有完整的感知范围,以协调维度内的视觉属性相似度度量,另一个细专家,即fExp,由多个专用子网络组成,每个子网络对应一个独特的属性集合。两个专家在训练过程中合作蒸馏,以达到相互一致。同时,我们通过设计一个新的骨干网络,即双注意网络(DAN),为DEDN添加了新功能,该网络包含区域和通道关注信息,以充分利用和利用视觉语义知识。在各种基准数据集上的实验表明,达到了最先进水平。
https://arxiv.org/abs/2404.16348
We analyze the concept of virtuosity as a collective attribute in music and its relationship with the entropy based on an experiment that compares two sets of digital signals played by composer-performer electric guitarists. Based on an interdisciplinary approach related to the complex systems, we computed the spectrum of signals, identified statistical distributions that best describe them, and measured the Shannon entropy to establish their diversity. Findings suggested that virtuosity might be related to a range of entropy values that identify levels of diversity of the frequency components of audio signals. Despite the presence of different values of entropy in the two sets of signals, they are statistically similar. Therefore, entropy values can be interpreted as levels of virtuosity in music.
我们将音乐中的技巧性作为一个集体属性来分析,并探讨其与熵的关系。我们通过对比两位作曲家演奏的电吉他产生的两组数字信号来实施一个实验。基于复杂系统的跨学科方法,我们计算了信号的频谱,找到了最能描述它们的统计分布,并测量了香农熵来确定它们的多样性。研究结果表明,技巧性可能与音频信号中频率分量的多样性有关。尽管两组信号中存在不同的熵值,但它们在统计上是相似的。因此,熵值可以解释为音乐中的技巧水平。
https://arxiv.org/abs/2404.16259
Translating major language resources to build minor language resources becomes a widely-used approach. Particularly in translating complex data points composed of multiple components, it is common to translate each component separately. However, we argue that this practice often overlooks the interrelation between components within the same data point. To address this limitation, we propose a novel MT pipeline that considers the intra-data relation in implementing MT for training data. In our MT pipeline, all the components in a data point are concatenated to form a single translation sequence and subsequently reconstructed to the data components after translation. We introduce a Catalyst Statement (CS) to enhance the intra-data relation, and Indicator Token (IT) to assist the decomposition of a translated sequence into its respective data components. Through our approach, we have achieved a considerable improvement in translation quality itself, along with its effectiveness as training data. Compared with the conventional approach that translates each data component separately, our method yields better training data that enhances the performance of the trained model by 2.690 points for the web page ranking (WPR) task, and 0.845 for the question generation (QG) task in the XGLUE benchmark.
将主要语言资源翻译成构建次要语言资源已成为一种广泛使用的方法。尤其是在将由多个组件组成的复杂数据点进行翻译时,通常会将每个组件单独翻译。然而,我们认为这种做法经常忽视同一数据点内组件之间的相互关系。为了应对这一局限,我们提出了一个新颖的MT管道,在实现MT训练数据时考虑了数据点内组件之间的内部分析关系。在我们的MT管道中,数据点中的所有组件都被连接成一个单独的翻译序列,然后在翻译后重构为数据组件。我们引入了一个催化剂语句(CS)来增强数据点内组件之间的相互关系,和一个指示词标记(IT)来辅助将翻译序列分解为其相应数据组件。通过我们的方法,我们在翻译质量本身以及作为训练数据的有效性方面都取得了显著的改进。与将每个数据组件单独翻译的传统方法相比,我们的方法产生了更好的训练数据,该数据可以提高训练模型的在线性(WPR)任务的性能2.690分,而在XGLUE基准中的问题生成(QG)任务的性能0.845分。
https://arxiv.org/abs/2404.16257
Searching dependency graphs and manipulating them can be a time consuming and challenging task to get right. We document Semgrex, a system for searching dependency graphs, and introduce Ssurgeon, a system for manipulating the output of Semgrex. The compact language used by these systems allows for easy command line or API processing of dependencies. Additionally, integration with publicly released toolkits in Java and Python allows for searching text relations and attributes over natural text.
搜索依赖关系图并对其进行操作可能是一个耗时且具有挑战性的任务。我们记录了Semgrex,一个用于搜索依赖关系的系统,并介绍了Ssurgeon,一个用于操作Semgrex输出的系统。这些系统所使用的简洁语言允许轻松地进行命令行或API处理依赖关系。此外,与Java和Python中公开发布的工具包的集成允许在自然文本中搜索文本关系和属性。
https://arxiv.org/abs/2404.16250