Despite significant improvements in enhancing the quality of translation, context-aware machine translation (MT) models underperform in many cases. One of the main reasons is that they fail to utilize the correct features from context when the context is too long or their models are overly complex. This can lead to the explain-away effect, wherein the models only consider features easier to explain predictions, resulting in inaccurate translations. To address this issue, we propose a model that explains the decisions made for translation by predicting coreference features in the input. We construct a model for input coreference by exploiting contextual features from both the input and translation output representations on top of an existing MT model. We evaluate and analyze our method in the WMT document-level translation task of English-German dataset, the English-Russian dataset, and the multilingual TED talk dataset, demonstrating an improvement of over 1.0 BLEU score when compared with other context-aware models.
尽管在提高翻译质量方面取得了显著的改进,但上下文感知的机器翻译(MT)模型在许多情况下表现不佳。其中一个主要原因是它们在长上下文或模型过于复杂时无法利用上下文的正确特征。这可能导致解释效应,即模型仅考虑更容易解释预测的特征,从而导致不准确的翻译。为了解决这个问题,我们提出了一种模型,通过预测输入中的指称特征来解释翻译决策。我们在现有的MT模型之上利用输入和翻译输出表示的上下文特征构建了一个输入指称模型。我们在英语-德语语料库、英语-俄罗斯语料库和多语言TED演讲语料库上评估和分析我们的方法,结果表明,与其它上下文感知的模型相比,我们的方法在改善了超过1.0 BLEU得分方面取得了显著的进步。
https://arxiv.org/abs/2404.19505
Blind face restoration (BFR) on images has significantly progressed over the last several years, while real-world video face restoration (VFR), which is more challenging for more complex face motions such as moving gaze directions and facial orientations involved, remains unsolved. Typical BFR methods are evaluated on privately synthesized datasets or self-collected real-world low-quality face images, which are limited in their coverage of real-world video frames. In this work, we introduced new real-world datasets named FOS with a taxonomy of "Full, Occluded, and Side" faces from mainly video frames to study the applicability of current methods on videos. Compared with existing test datasets, FOS datasets cover more diverse degradations and involve face samples from more complex scenarios, which helps to revisit current face restoration approaches more comprehensively. Given the established datasets, we benchmarked both the state-of-the-art BFR methods and the video super resolution (VSR) methods to comprehensively study current approaches, identifying their potential and limitations in VFR tasks. In addition, we studied the effectiveness of the commonly used image quality assessment (IQA) metrics and face IQA (FIQA) metrics by leveraging a subjective user study. With extensive experimental results and detailed analysis provided, we gained insights from the successes and failures of both current BFR and VSR methods. These results also pose challenges to current face restoration approaches, which we hope stimulate future advances in VFR research.
近年来,在图像上进行盲人面修复(BFR)已经取得了显著的进展,而现实生活中视频面部修复(VFR)仍然是一个未解决的问题,尤其是对于更复杂的运动眼动和面部朝向等场景。典型的BFR方法在私有的合成数据集或自收集的现实生活中低质量面部分类数据上进行评估,这些数据集对于现实视频帧的覆盖范围有限。在本文中,我们引入了名为FOS的新现实数据集,从主要视频帧的“完整、遮挡和侧面”面部分类对现实视频进行研究,以评估当前方法在视频中的应用。与现有测试数据集相比,FOS数据集涵盖了更多的 degradation,并涉及更复杂场景中的面部分子,这有助于更全面地回顾当前的面部修复方法。鉴于已有的数据集,我们比较了最先进的BFR方法和视频超级分辨率(VSR)方法,以全面研究当前方法,并确定其在VFR任务中的潜力和限制。此外,我们利用主观用户研究探讨了常用图像质量评估(IQA)指标和面部智商评估(FIQA)指标的有效性。通过大量实验结果和详细分析提供了,我们从当前BFR和VSR方法的成功和失败中获得了洞察。这些结果也对当前的面部修复方法提出了挑战,我们希望激发未来在VFR研究方面的进一步进展。
https://arxiv.org/abs/2404.19500
Edge vision systems combining sensing and embedded processing promise low-latency, decentralized, and energy-efficient solutions that forgo reliance on the cloud. As opposed to conventional frame-based vision sensors, event-based cameras deliver a microsecond-scale temporal resolution with sparse information encoding, thereby outlining new opportunities for edge vision systems. However, mainstream algorithms for frame-based vision, which mostly rely on convolutional neural networks (CNNs), can hardly exploit the advantages of event-based vision as they are typically optimized for dense matrix-vector multiplications. While event-driven graph neural networks (GNNs) have recently emerged as a promising solution for sparse event-based vision, their irregular structure is a challenge that currently hinders the design of efficient hardware accelerators. In this paper, we propose EvGNN, the first event-driven GNN accelerator for low-footprint, ultra-low-latency, and high-accuracy edge vision with event-based cameras. It relies on three central ideas: (i) directed dynamic graphs exploiting single-hop nodes with edge-free storage, (ii) event queues for the efficient identification of local neighbors within a spatiotemporally decoupled search range, and (iii) a novel layer-parallel processing scheme enabling the low-latency execution of multi-layer GNNs. We deployed EvGNN on a Xilinx KV260 Ultrascale+ MPSoC platform and benchmarked it on the N-CARS dataset for car recognition, demonstrating a classification accuracy of 87.8% and an average latency per event of 16$\mu$s, thereby enabling real-time, microsecond-resolution event-based vision at the edge.
集感和嵌入式处理相结合的边缘视觉系统有望提供低延迟、去中心化和节能的解决方案,摒弃对云计算的依赖。与传统的基于帧的视觉传感器不同,事件驱动的相机在稀疏信息编码下实现微秒级的时序分辨率,为边缘视觉系统提供了新的机会。然而,基于帧的视觉算法,这些算法通常依赖于卷积神经网络(CNNs),很难充分利用事件驱动视觉的优势,因为它们通常为密集矩阵向量乘法优化。虽然事件驱动图神经网络(GNNs)最近作为一种有前景的解决方案涌现出来,但它们的非规则结构是一个挑战,目前阻碍了高效硬件加速器的的设计。在本文中,我们提出了EvGNN,第一个事件驱动的GNN加速器,用于具有事件驱动相机的低功耗、超低延迟和高精度的边缘视觉。它依赖于三个核心理念:(i)指向动态图利用单跳节点和边缘存储的稀疏表示,(ii)在解耦的时空搜索范围内高效识别局部邻居的事件队列,(iii)一种新颖的层并行处理方案,实现多层GNN的低延迟执行。我们在Xilinx KV260 Ultrascale+ MPSoC平台上部署了EvGNN,并在N-CARS数据集上对其进行 benchmark,证明了其分类准确率为87.8%,事件延迟为16μs,从而实现了边缘实时、微秒级视觉。
https://arxiv.org/abs/2404.19489
Current text generation models are trained using real data which can potentially contain sensitive information, such as confidential patient information and the like. Under certain conditions output of the training data which they have memorised can be triggered, exposing sensitive data. To mitigate against this risk we propose a safer alternative which sees fragmented data in the form of domain-specific short phrases randomly grouped together shared instead of full texts. Thus, text fragments that could re-identify an individual cannot be reproduced by the model in one sequence, giving significant protection against linkage attacks. We fine-tune several state-of-the-art LLMs using meaningful syntactic chunks to explore their utility. In particular, we fine-tune BERT-based models to predict two cardiovascular diagnoses. Our results demonstrate the capacity of LLMs to benefit from the pre-trained knowledge and deliver classification results when fine-tuned with fragmented data comparable to fine-tuning with full training data.
当前的文本生成模型使用真实数据进行训练,这可能包含敏感信息,如机密患者信息等。在某些情况下,它们训练数据的输出可能会触发包含敏感数据的输出,从而暴露敏感信息。为了减轻这种风险,我们提出了一个更安全的选择,即使用领域特定短语(domain-specific short phrases)随机组合而不是完整文本的破碎数据。因此,模型无法通过一个序列复制可能重新识别个人的文本片段,从而对链接攻击具有显著的防护作用。我们使用有意义的语义块对几个最先进的LLM进行微调,以探讨它们的使用价值。特别地,我们微调基于BERT的模型,以预测两个心血管诊断。我们的结果表明,LLM可以利用预训练知识并在与完整训练数据相似的破碎数据上进行微调,从而产生分类结果。
https://arxiv.org/abs/2404.19486
Neurosymbolic background knowledge and the expressivity required of its logic can break Machine Learning assumptions about data Independence and Identical Distribution. In this position paper we propose to analyze IID relaxation in a hierarchy of logics that fit different use case requirements. We discuss the benefits of exploiting known data dependencies and distribution constraints for Neurosymbolic use cases and argue that the expressivity required for this knowledge has implications for the design of underlying ML routines. This opens a new research agenda with general questions about Neurosymbolic background knowledge and the expressivity required of its logic.
神经符号知识及其表达性对于其逻辑可能破坏机器学习对数据独立性和相同分布的假设。在本文论文中,我们将分析可观测逻辑中IID放松的层次结构,以适应不同的用例需求。我们讨论了利用已知的数据相关性和分布约束来激发神经符号使用的优势,并认为这种知识对于机器学习底层方法的实现具有影响。这开启了一个新的研究议程,包括关于神经符号背景知识及其逻辑表达性的一般问题。
https://arxiv.org/abs/2404.19485
Large language model pre-training has become increasingly expensive, with most practitioners relying on scaling laws to allocate compute budgets for model size and training tokens, commonly referred to as Compute-Optimal or Chinchilla Optimal. In this paper, we hypothesize a new scaling law that suggests model performance depends mostly on the amount of compute spent for transformer-based models, independent of the specific allocation to model size and dataset size. Using this unified scaling law, we predict that (a) for inference efficiency, training should prioritize smaller model sizes and larger training datasets, and (b) assuming the exhaustion of available web datasets, scaling the model size might be the only way to further improve model performance.
大规模语言模型预训练成本已经变得越来越昂贵,大多数实践者依赖于缩放定律来分配计算预算,通常称为Compute-Optimal或Chinchilla Optimal。在本文中,我们提出了一个新的缩放定律,表明模型的性能主要取决于用于Transformer模型的计算支出,而与模型大小和数据集大小无关。使用这个统一的缩放定律,我们预测:(a)在推理效率上,训练应该优先考虑更小的模型大小和更大的训练数据集;(b)在假设可用网络数据集已用尽的情况下,扩大模型大小可能是提高模型性能的唯一方法。
https://arxiv.org/abs/2404.19484
We introduce 'FactCheck Editor', an advanced text editor designed to automate fact-checking and correct factual inaccuracies. Given the widespread issue of misinformation, often a result of unintentional mistakes by content creators, our tool aims to address this challenge. It supports over 90 languages and utilizes transformer models to assist humans in the labor-intensive process of fact verification. This demonstration showcases a complete workflow that detects text claims in need of verification, generates relevant search engine queries, and retrieves appropriate documents from the web. It employs Natural Language Inference (NLI) to predict the veracity of claims and uses LLMs to summarize the evidence and suggest textual revisions to correct any errors in the text. Additionally, the effectiveness of models used in claim detection and veracity assessment is evaluated across multiple languages.
我们介绍了一个名为“FactCheck Editor”的高级文本编辑器,旨在自动检测事实核查并纠正事实不准确。考虑到广泛存在的错误和内容创作者无意间犯的错误导致的虚假信息问题,我们的工具旨在解决这个问题。它支持超过90种语言,并利用Transformer模型帮助人类在繁重的事实验证过程中提供支持。这个演示展示了完全的工作流程,包括检测需要验证的文本主张、生成相关搜索引擎查询和从互联网上检索适当的文件。它采用自然语言推理(NLI)预测主张的真实性,并使用LLM摘要证据并建议对文本进行文本修订以纠正任何错误。此外,还在多个语言中评估了用于检测和验证真实性的模型的效果。
https://arxiv.org/abs/2404.19482
This paper presents an innovative approach to intraoperative Optical Coherence Tomography (iOCT) image segmentation in ophthalmic surgery, leveraging statistical analysis of speckle patterns to incorporate statistical pathology-specific prior knowledge. Our findings indicate statistically different speckle patterns within the retina and between retinal layers and surgical tools, facilitating the segmentation of previously unseen data without the necessity for manual labeling. The research involves fitting various statistical distributions to iOCT data, enabling the differentiation of different ocular structures and surgical tools. The proposed segmentation model aims to refine the statistical findings based on prior tissue understanding to leverage statistical and biological knowledge. Incorporating statistical parameters, physical analysis of light-tissue interaction, and deep learning informed by biological structures enhance segmentation accuracy, offering potential benefits to real-time applications in ophthalmic surgical procedures. The study demonstrates the adaptability and precision of using Gamma distribution parameters and the derived binary maps as sole inputs for segmentation, notably enhancing the model's inference performance on unseen data.
本文提出了一种在眼科手术中创新利用光散射型共焦显微镜(iOCT)图像分割的方法,通过统计分析散射模式来纳入统计学病理学特异性先验知识。我们的研究结果表明,视网膜和视网膜层之间的散射模式具有统计学差异,从而无需手动标注即可分割 previously unseen 的数据。研究涉及对 iOCT 数据进行 various 统计分布的拟合,使得可以区分不同的眼部结构和手术工具。所提出的分割模型旨在基于先验组织理解来优化统计发现,并利用生物结构的物理分析、生物结构的深度学习及其增强分割准确性,为眼科手术提供潜在的实时应用。本研究展示了使用高斯分布参数和得到的二进制映射作为唯一输入进行分割的适应性和精度,尤其是在未经观察到的数据上的模型推理性能得到显著提高。
https://arxiv.org/abs/2404.19481
Diffusion models have emerged as effective tools for generating diverse and high-quality content. However, their capability in high-resolution image generation, particularly for panoramic images, still faces challenges such as visible seams and incoherent transitions. In this paper, we propose TwinDiffusion, an optimized framework designed to address these challenges through two key innovations: Crop Fusion for quality enhancement and Cross Sampling for efficiency optimization. We introduce a training-free optimizing stage to refine the similarity of the adjacent image areas, as well as an interleaving sampling strategy to yield dynamic patches during the cropping process. A comprehensive evaluation is conducted to compare TwinDiffusion with the existing methods, considering factors including coherence, fidelity, compatibility, and efficiency. The results demonstrate the superior performance of our approach in generating seamless and coherent panoramas, setting a new standard in quality and efficiency for panoramic image generation.
扩散模型已成为生成多样且高质量内容的有效工具。然而,其在高分辨率图像生成方面,特别是全景图像,仍然面临着一些挑战,如可见的拼接和不相干的过渡。在本文中,我们提出了TwinDiffusion,一种通过两个关键创新来解决这些挑战的优化框架:裁剪融合用于质量增强和交叉采样用于效率优化。我们引入了一个无需训练的优化阶段来精炼相邻图像区域的相似性,以及一个跨采样策略,以便在裁剪过程中产生动态补丁。对现有方法进行了全面的评估,考虑了包括一致性、忠实性、兼容性和效率在内的因素。结果表明,我们的方法在生成无缝和一致的全景图像方面表现出卓越的性能,为全景图像生成树立了新的质量和技术标准。
https://arxiv.org/abs/2404.19475
Brain responses related to working memory originate from distinct brain areas and oscillate at different frequencies. EEG signals with high temporal correlation can effectively capture these responses. Therefore, estimating the functional connectivity of EEG for working memory protocols in different frequency bands plays a significant role in analyzing the brain dynamics with increasing memory and cognitive loads, which remains largely unexplored. The present study introduces a Bayesian structure learning algorithm to learn the functional connectivity of EEG in sensor space. Next, the functional connectivity graphs are taken as input to the graph convolutional network to classify the working memory loads. The intrasubject (subject-specific) classification performed on 154 subjects for six different verbal working memory loads produced the highest classification accuracy of 96% and average classification accuracy of 89%, outperforming state-of-the-art classification models proposed in the literature. Furthermore, the proposed Bayesian structure learning algorithm is compared with state-of-the-art functional connectivity estimation methods through intersubject and intrasubject statistical analysis of variance. The results also show that the alpha and theta bands have better classification accuracy than the beta band.
与工作记忆相关的脑活动起源于不同的脑区,并且以不同的频率振荡。具有高时间相关性的EEG信号可以有效地捕捉这些反应。因此,在不同的频率带中估计EEG的脑功能连接在分析随着记忆和认知负荷增加的大脑动态中具有重要作用,这仍然是一个未被充分探索的问题。本研究引入了一种基于贝叶斯结构学习算法的EEG在传感器空间学习功能连接的方法。然后将功能连接图作为输入输入到图卷积网络以分类工作记忆负载。对六个不同口头工作记忆负载的154个受试者进行内在subject-specific(subject-specific)分类,产生了96%的分类准确性和89%的平均分类准确度,超过了文献中提出的最先进的分类模型。此外,通过跨受试者和内在subject-specific方差分析,比较了所提出的贝叶斯结构学习算法与最先进的功能连接估计方法。结果还显示,阿尔法(alpha)和theta带具有比beta带更好的分类准确率。
https://arxiv.org/abs/2404.19467
Adversarial examples are typically optimized with gradient-based attacks. While novel attacks are continuously proposed, each is shown to outperform its predecessors using different experimental setups, hyperparameter settings, and number of forward and backward calls to the target models. This provides overly-optimistic and even biased evaluations that may unfairly favor one particular attack over the others. In this work, we aim to overcome these limitations by proposing AttackBench, i.e., the first evaluation framework that enables a fair comparison among different attacks. To this end, we first propose a categorization of gradient-based attacks, identifying their main components and differences. We then introduce our framework, which evaluates their effectiveness and efficiency. We measure these characteristics by (i) defining an optimality metric that quantifies how close an attack is to the optimal solution, and (ii) limiting the number of forward and backward queries to the model, such that all attacks are compared within a given maximum query budget. Our extensive experimental analysis compares more than 100 attack implementations with a total of over 800 different configurations against CIFAR-10 and ImageNet models, highlighting that only very few attacks outperform all the competing approaches. Within this analysis, we shed light on several implementation issues that prevent many attacks from finding better solutions or running at all. We release AttackBench as a publicly available benchmark, aiming to continuously update it to include and evaluate novel gradient-based attacks for optimizing adversarial examples.
对抗性样本通常使用基于梯度的攻击进行优化。虽然不断有新的攻击被提出,但每个攻击都被证明通过不同的实验设置、超参数设置和前馈和反向调用目标模型的次数,优于其先驱。这提供了过于乐观和带有偏见的评估,甚至可能不公平地偏袒某个攻击 over 其他的攻击。在这项工作中,我们旨在克服这些局限,通过提出 AttackBench,即第一个能够进行不同攻击公平比较的评价框架。为此,我们首先对基于梯度的攻击进行了分类,识别出它们的组成部分和差异。然后我们引入了我们的框架,用于评估它们的有效性和效率。我们通过定义一个优化度度量来衡量攻击的接近程度,即攻击与最优解的距离,并通过限制前馈和反向调用次数,使得所有攻击在一个给定的最大查询预算内进行比较。我们的广泛实验分析将超过100个攻击实现与CIFAR-10和ImageNet模型的总共超过800个不同的配置进行了比较,强调只有很少的攻击能够超过所有竞争方法。在这个分析中,我们揭示了几个阻止许多攻击找到更好解决方案或运行的问题的实现问题。我们将 AttackBench 公开发布,旨在持续更新以包括并评估用于优化对抗性样本的新颖基于梯度的攻击。
https://arxiv.org/abs/2404.19460
Imitation learning is an approach in which an agent learns how to execute a task by trying to mimic how one or more teachers perform it. This learning approach offers a compromise between the time it takes to learn a new task and the effort needed to collect teacher samples for the agent. It achieves this by balancing learning from the teacher, who has some information on how to perform the task, and deviating from their examples when necessary, such as states not present in the teacher samples. Consequently, the field of imitation learning has received much attention from researchers in recent years, resulting in many new methods and applications. However, with this increase in published work and past surveys focusing mainly on methodology, a lack of standardisation became more prominent in the field. This non-standardisation is evident in the use of environments, which appear in no more than two works, and evaluation processes, such as qualitative analysis, that have become rare in current literature. In this survey, we systematically review current imitation learning literature and present our findings by (i) classifying imitation learning techniques, environments and metrics by introducing novel taxonomies; (ii) reflecting on main problems from the literature; and (iii) presenting challenges and future directions for researchers.
模仿学习是一种方法,其中智能体通过尝试模仿一个或多个教师如何执行任务来学习如何执行任务。这种学习方法在学习和获取教师样本之间取得了妥协,既不会花费过多时间来学习新任务,也不会花费过多精力来收集教师样本。它是通过平衡从教师那里学习到的知识(教师有一些关于如何执行任务的简要信息)和必要时与教师例子保持一定距离(如不在教师样本中的状态)来实现的。因此,近年来,模仿学习领域已经得到了研究人员的高度关注,并产生了许多新方法和应用。然而,随着发表的作品数量增加和过去调查主要关注方法论,该领域的标准化问题变得更加突出。这种非标准化在环境和评估过程中尤为明显。在本文的调查中,我们系统地回顾了当前的模仿学习文献,并通过引入新的分类来呈现我们的研究结果。我们还反思了文献中提出的主要问题,并提出了研究人员未来需要关注的新挑战和方向。
https://arxiv.org/abs/2404.19456
A critical issue in approximating solutions of ordinary differential equations using neural networks is the exact satisfaction of the boundary or initial conditions. For this purpose, neural forms have been introduced, i.e., functional expressions that depend on neural networks which, by design, satisfy the prescribed conditions exactly. Expanding upon prior progress, the present work contributes in three distinct aspects. First, it presents a novel formalism for crafting optimized neural forms. Second, it outlines a method for establishing an upper bound on the absolute deviation from the exact solution. Third, it introduces a technique for converting problems with Neumann or Robin conditions into equivalent problems with parametric Dirichlet conditions. The proposed optimized neural forms were numerically tested on a set of diverse problems, encompassing first-order and second-order ordinary differential equations, as well as first-order systems. Stiff and delay differential equations were also considered. The obtained solutions were compared against solutions obtained via Runge-Kutta methods and exact solutions wherever available. The reported results and analysis verify that in addition to the exact satisfaction of the boundary or initial conditions, optimized neural forms provide closed-form solutions of superior interpolation capability and controllable overall accuracy.
使用神经网络近似一般微分方程的解的一个关键问题是边界或初始条件的精确满足。为此,引入了神经形式,即依赖于神经网络的函数表达式,预定满足给定条件。在阐述之前的研究进展的基础上,本文在三个不同方面做出了贡献。首先,它提出了一种创建优化神经形式的全新形式。其次,它描述了一种建立绝对偏差上界的方法,该方法在精确解的精度上达到了一定的限制。第三,它引入了一种将带有纳西或罗宾条件的問題转换为等效问题并使用参数皮尔里条件的问题解决技术。所提出的优化神经形式在多种多样的问题上进行了数值测试,包括一阶和二阶微分方程以及一阶系统。还考虑了刚性和延迟微分方程。得到的结果与通过Rrange-Kutta方法得到的结果以及在任何情况下存在的精确解进行了比较。报告的结果和分析证实,除了边界或初始条件的精确满足外,优化神经形式还具有卓越的插值能力和可控制的精度。
https://arxiv.org/abs/2404.19454
Conventional industrial robots often use two-fingered grippers or suction cups to manipulate objects or interact with the world. Because of their simplified design, they are unable to reproduce the dexterity of human hands when manipulating a wide range of objects. While the control of humanoid hands evolved greatly, hardware platforms still lack capabilities, particularly in tactile sensing and providing soft contact surfaces. In this work, we present a method that equips the skeleton of a tendon-driven humanoid hand with a soft and sensorized tactile skin. Multi-material 3D printing allows us to iteratively approach a cast skin design which preserves the robot's dexterity in terms of range of motion and speed. We demonstrate that a soft skin enables firmer grasps and piezoresistive sensor integration enhances the hand's tactile sensing capabilities.
传统的工业机器人通常使用两个手指的抓取器或吸盘来操纵物体或与外界互动。由于其简化的设计,它们无法在处理各种物体时复制人类手的灵巧性。虽然人形手的控制发展了很多,但硬件平台仍然缺乏功能,特别是在触觉感知和提供柔软接触表面方面。在这项工作中,我们提出了一种方法,为肌驱动的人形手骨架安装了一个柔软且带有传感器的触觉皮肤。多材料 3D 打印使我们能够逐步接近铸型皮肤设计,这种设计在活动范围和速度方面保留机器人的灵巧性。我们证明了柔软的皮肤能够提供更紧实的抓握,而压电式传感器集成则提高了手部触觉感知的性能。
https://arxiv.org/abs/2404.19448
Anomaly synthesis is one of the effective methods to augment abnormal samples for training. However, current anomaly synthesis methods predominantly rely on texture information as input, which limits the fidelity of synthesized abnormal samples. Because texture information is insufficient to correctly depict the pattern of anomalies, especially for logical anomalies. To surmount this obstacle, we present the AnomalyXFusion framework, designed to harness multi-modality information to enhance the quality of synthesized abnormal samples. The AnomalyXFusion framework comprises two distinct yet synergistic modules: the Multi-modal In-Fusion (MIF) module and the Dynamic Dif-Fusion (DDF) module. The MIF module refines modality alignment by aggregating and integrating various modality features into a unified embedding space, termed X-embedding, which includes image, text, and mask features. Concurrently, the DDF module facilitates controlled generation through an adaptive adjustment of X-embedding conditioned on the diffusion steps. In addition, to reveal the multi-modality representational power of AnomalyXFusion, we propose a new dataset, called MVTec Caption. More precisely, MVTec Caption extends 2.2k accurate image-mask-text annotations for the MVTec AD and LOCO datasets. Comprehensive evaluations demonstrate the effectiveness of AnomalyXFusion, especially regarding the fidelity and diversity for logical anomalies. Project page: http:github.com/hujiecpp/MVTec-Caption
异常合成是一种有效的增强训练异常样本的方法。然而,现有的异常合成方法主要依赖于纹理信息作为输入,这限制了合成异常样本的保真度。因为纹理信息不足以正确地描绘异常的图案,尤其是对于逻辑异常。为克服这一障碍,我们提出了AnomalyXFusion框架,旨在利用多模态信息提高合成异常样本的质量。AnomalyXFusion框架包括两个不同的但相互作用的模块:多模态In-Fusion(MIF)模块和动态Dif-Fusion(DDF)模块。MIF模块通过聚合和整合各种模态特征到统一的嵌入空间X-embedding中,称为X-嵌入,来优化模态对齐。同时,DDF模块通过根据扩散步骤自适应调整X-嵌入来促进控制生成。此外,为了揭示AnomalyXFusion的多模态表示能力,我们提出了一个新的数据集,称为MVTec Caption。更具体地说,MVTec Caption扩展了MVTec AD和LOCO数据集中的2200个准确图像-纹理-文本注释。全面的评估证明了AnomalyXFusion的有效性,特别是对于逻辑异常的保真度和多样性。项目页面:http:github.com/hujiecpp/MVTec-Caption
https://arxiv.org/abs/2404.19444
Naija is the Nigerian-Pidgin spoken by approx. 120M speakers in Nigeria and it is a mixed language (e.g., English, Portuguese and Indigenous languages). Although it has mainly been a spoken language until recently, there are currently two written genres (BBC and Wikipedia) in Naija. Through statistical analyses and Machine Translation experiments, we prove that these two genres do not represent each other (i.e., there are linguistic differences in word order and vocabulary) and Generative AI operates only based on Naija written in the BBC genre. In other words, Naija written in Wikipedia genre is not represented in Generative AI.
纳加语是 Nigeria 约 12 亿操作者所使用的尼日利亚-派第语言(例如,英语,葡萄牙语和本土语言)。 尽管它主要是一种口头语言,但在最近,纳加语有两种书写形式(BBC 和 Wikipedia)。 通过统计分析和机器翻译实验,我们证明这两个文本格式并不相互代表(即在词序和词汇方面有语言差异)生成式 AI 仅基于 BBC 格式的纳加语运行。换句话说,在生成式 AI 中,纳加语书写在 Wikipedia 格式的纳加语并不代表。
https://arxiv.org/abs/2404.19442
Existing neural audio codecs usually sacrifice computational complexity for audio quality. They build the feature transformation layers mainly on convolutional blocks, which are not inherently appropriate for capturing local redundancies of audio signals. As compensation, either adversarial losses from a discriminator or a large number of model parameters are required to improve the codec. To that end, we propose Efficient Speech Codec (ESC), a lightweight parameter-efficient codec laid on cross-scale residual vector quantization and transformers. Our model leverages mirrored hierarchical window-attention transformer blocks and performs step-wise decoding from coarse-to-fine feature representations. To enhance codebook utilization, we design a learning paradigm that involves a pre-training stage to assist with codec training. Extensive results show that ESC can achieve high audio quality with much lower complexity, which is a prospective alternative in place of existing codecs.
现有的神经音频编码器通常会牺牲计算复杂度以换取音频质量。它们主要在卷积层构建特征转换层,但卷积层并不适合捕捉音频信号的局部冗余。为了补偿这一点,ESC(高效语音编码器)采用跨尺度残差向量量化和Transformer架构,这是一个轻量级的参数高效的编码器。我们的模型利用镜像层次级窗口注意Transformer模块,并从粗到细的特征表示进行逐步解码。为了提高代码库的利用率,我们设计了一个学习范式,包括一个预训练阶段来辅助编码器训练。大量实验结果表明,ESC可以在较低的复杂度下实现高品质的音频,这是现有编码器的潜在替代方案。
https://arxiv.org/abs/2404.19441
Two major areas of interest in the era of Large Language Models regard questions of what do LLMs know, and if and how they may be able to reason (or rather, approximately reason). Since to date these lines of work progressed largely in parallel (with notable exceptions), we are interested in investigating the intersection: probing for reasoning about the implicitly-held knowledge. Suspecting the performance to be lacking in this area, we use a very simple set-up of comparisons between cardinalities associated with elements of various subjects (e.g. the number of legs a bird has versus the number of wheels on a tricycle). We empirically demonstrate that although LLMs make steady progress in knowledge acquisition and (pseudo)reasoning with each new GPT release, their capabilities are limited to statistical inference only. It is difficult to argue that pure statistical learning can cope with the combinatorial explosion inherent in many commonsense reasoning tasks, especially once arithmetical notions are involved. Further, we argue that bigger is not always better and chasing purely statistical improvements is flawed at the core, since it only exacerbates the dangerous conflation of the production of correct answers with genuine reasoning ability.
在大型语言模型时代,两个主要的兴趣领域是关于LLMs知道什么,以及它们是否能够进行推理(或者说,约等于推理)。因为迄今为止,这些领域的发展主要是并行的(当然,也有一些显著的例外),所以我们对此感兴趣的是调查这个交叉点:关于LLMs所隐含的知识的推理。我们怀疑,在这个领域,性能存在不足,因此我们使用一个非常简单的对比集来研究各种主题元素的相关基数(例如,一个鸟类有多少条腿与一个三轮车的轮子数量)。我们通过实验证明,尽管LLMs在知识获取和(伪)推理方面在每个GPT版本中都有进步,但它们的能力仅限于统计推断。很难认为,纯统计学习可以应对许多常识推理任务中固有的组合爆炸,尤其是在涉及到算术概念的情况下。此外,我们认为大并不一定就是更好的,因为过于关注统计改进会破坏正确答案与真正推理能力之间的危险混淆。
https://arxiv.org/abs/2404.19432
We present an information retrieval based reverse dictionary system using modern pre-trained language models and approximate nearest neighbors search algorithms. The proposed approach is applied to an existing Estonian language lexicon resource, Sõnaveeb (word web), with the purpose of enhancing and enriching it by introducing cross-lingual reverse dictionary functionality powered by semantic search. The performance of the system is evaluated using both an existing labeled English dataset of words and definitions that is extended to contain also Estonian and Russian translations, and a novel unlabeled evaluation approach that extracts the evaluation data from the lexicon resource itself using synonymy relations. Evaluation results indicate that the information retrieval based semantic search approach without any model training is feasible, producing median rank of 1 in the monolingual setting and median rank of 2 in the cross-lingual setting using the unlabeled evaluation approach, with models trained for cross-lingual retrieval and including Estonian in their training data showing superior performance in our particular task.
我们提出了一个基于现代预训练语言模型和近似最近邻居搜索算法的信息检索反词典系统。该方法应用于现有的爱沙尼亚语词汇表资源Sõnaveeb(词网),旨在通过引入跨语言反词典功能,增强和丰富它。系统性能通过同时使用现有的带有爱沙尼亚和俄语翻译的英语标签数据集以及使用同义词关系从词汇表资源本身中提取评估数据来进行评估。评估结果显示,基于语义搜索的没有模型训练的信息检索方法是可行的,在单语环境中 median 排名为1,在跨语言环境中 median 排名为2,使用无标签评估方法。特别地,为跨语言检索训练模型的模型在 our 特定任务上表现出卓越的性能。
https://arxiv.org/abs/2404.19430
In the field of personalized image generation, the ability to create images preserving concepts has significantly improved. Creating an image that naturally integrates multiple concepts in a cohesive and visually appealing composition can indeed be challenging. This paper introduces "InstantFamily," an approach that employs a novel masked cross-attention mechanism and a multimodal embedding stack to achieve zero-shot multi-ID image generation. Our method effectively preserves ID as it utilizes global and local features from a pre-trained face recognition model integrated with text conditions. Additionally, our masked cross-attention mechanism enables the precise control of multi-ID and composition in the generated images. We demonstrate the effectiveness of InstantFamily through experiments showing its dominance in generating images with multi-ID, while resolving well-known multi-ID generation problems. Additionally, our model achieves state-of-the-art performance in both single-ID and multi-ID preservation. Furthermore, our model exhibits remarkable scalability with a greater number of ID preservation than it was originally trained with.
在个性化图像生成领域,创建保留概念的图像的能力已经显著提高。创建一个自然地将多个概念集成在统一且视觉上吸引人的构图中的图像确实具有挑战性。本文介绍了一种名为“InstantFamily”的方法,该方法采用了一种新颖的遮罩交叉注意力和多模态嵌入堆栈来实现零散ID图像生成。我们的方法有效地保留了ID,因为它利用了与文本条件预训练的人脸识别模型。此外,我们的遮罩交叉注意力机制使得在生成的图像中精确控制多ID和构图。我们通过实验证明InstantFamily在生成具有多ID的图像方面具有优势,同时解决了已知的多ID生成问题。此外,我们的模型在单ID和多ID保留方面都达到了最先进的性能。此外,我们的模型在保留更多ID的情况下表现出了出色的可扩展性。
https://arxiv.org/abs/2404.19427