Extracting who says what to whom is a crucial part in analyzing human communication in today's abundance of data such as online news articles. Yet, the lack of annotated data for this task in German news articles severely limits the quality and usability of possible systems. To remedy this, we present a new, freely available, creative-commons-licensed dataset for quotation attribution in German news articles based on WIKINEWS. The dataset provides curated, high-quality annotations across 1000 documents (250,000 tokens) in a fine-grained annotation schema enabling various downstream uses for the dataset. The annotations not only specify who said what but also how, in which context, to whom and define the type of quotation. We specify our annotation schema, describe the creation of the dataset and provide a quantitative analysis. Further, we describe suitable evaluation metrics, apply two existing systems for quotation attribution, discuss their results to evaluate the utility of our dataset and outline use cases of our dataset in downstream tasks.
提取谁对谁说了什么是对分析当今丰富数据(如在线新闻文章)中的人际交流至关重要的一部分。然而,德国新闻文章中缺乏带注释的数据,这严重限制了可能系统的质量和可用性。为了弥补这一不足,我们为基于WIKINEWS的德语新闻文章中的引用归因提供一个全新、免费、创意共享许可证的 dataset。该dataset提供了1000个文档(250,000个标记)的精心挑选的注释,采用细粒度的注释模式,使各种下游用途成为可能。注释不仅指明谁说了什么,而且还在何种背景下,对谁说了什么,并定义了引用类型。我们指定我们的注释模式,描述了该dataset的创建,并提供了定量分析。此外,我们描述了合适的评估指标,将两个现有的引用归因系统应用于该dataset,讨论它们的结果以评估我们的dataset的可用性,并概述了我们的dataset在下游任务中的应用场景。
https://arxiv.org/abs/2404.16764
Cancelable Biometric is a challenging research field in which security of an original biometric image is ensured by transforming the original biometric into another irreversible domain. Several approaches have been suggested in literature for generating cancelable biometric templates. In this paper, two novel and simple cancelable biometric template generation methods based on Random Walk (CBRW) have been proposed. By employing random walk and other steps given in the proposed two algorithms viz. CBRW-BitXOR and CBRW-BitCMP, the original biometric is transformed into a cancellable template. The performance of the proposed methods is compared with other state-of-the-art methods. Experiments have been performed on eight publicly available gray and color datasets i.e. CP (ear) (gray and color), UTIRIS (iris) (gray and color), ORL (face) (gray), IIT Delhi (iris) (gray and color), and AR (face) (color). Performance of the generated templates is measured in terms of Correlation Coefficient (Cr), Root Mean Square Error (RMSE), Peak Signal to Noise Ratio (PSNR), Structural Similarity (SSIM), Mean Absolute Error (MAE), Number of Pixel Change Rate (NPCR), and Unified Average Changing Intensity (UACI). By experimental results, it has been proved that proposed methods are superior than other state-of-the-art methods in qualitative as well as quantitative analysis. Furthermore, CBRW performs better on both gray as well as color images.
可取消生物特征是一个具有挑战性的研究领域,其中通过将原始生物特征变换为另一个不可逆的领域来确保原始生物特征的安全。在文献中,已经提出了几种生成可取消生物特征模板的方法。本文提出了两种基于随机漫步(CBRW)的新颖且简单的可取消生物特征模板生成方法。通过使用提出的两种算法 viz. CBRW-BitXOR 和 CBRW-BitCMP,将原始生物特征变换为可取消模板。本文方法的表现与最先进的算法进行了比较。实验在八个公开可用的灰度和彩色数据集上进行,即CP(耳朵)(灰度与彩色),UTIRIS(眼睛)(灰度与彩色),ORL(面部)(灰度),IIT德里亚(眼睛)(灰度与彩色),和AR(眼睛)(彩色)。生成模板的表现用相关系数系数(Cr)、算术均方误差(RMSE)、峰值信号与噪声比(PSNR)、结构相似性(SSIM)、平均绝对误差(MAE)、像素变化率(NPCR)和统一平均变化强度(UACI)进行衡量。通过实验结果,证明了本文方法在质量和数量分析方面优于其他最先进的算法。此外,CBRW在灰度和彩色图像上表现更好。
https://arxiv.org/abs/2404.16739
Linguistic ambiguity continues to represent a significant challenge for natural language processing (NLP) systems, notwithstanding the advancements in architectures such as Transformers and BERT. Inspired by the recent success of instructional models like ChatGPT and Gemini (In 2023, the artificial intelligence was called Bard.), this study aims to analyze and discuss linguistic ambiguity within these models, focusing on three types prevalent in Brazilian Portuguese: semantic, syntactic, and lexical ambiguity. We create a corpus comprising 120 sentences, both ambiguous and unambiguous, for classification, explanation, and disambiguation. The models capability to generate ambiguous sentences was also explored by soliciting sets of sentences for each type of ambiguity. The results underwent qualitative analysis, drawing on recognized linguistic references, and quantitative assessment based on the accuracy of the responses obtained. It was evidenced that even the most sophisticated models, such as ChatGPT and Gemini, exhibit errors and deficiencies in their responses, with explanations often providing inconsistent. Furthermore, the accuracy peaked at 49.58 percent, indicating the need for descriptive studies for supervised learning.
语言歧义一直是自然语言处理(NLP)系统的一个显著挑战,尽管像Transformer和BERT这样的架构取得了进步。受到类似ChatGPT和Gemini等 recent instructional models的成功启发,本研究旨在分析并讨论这些模型中的语言歧义,重点关注巴西葡萄牙语中三种普遍存在的歧义类型:语义、句法 和词汇歧义。我们创建了一个包括120个句子的语料库,包括歧义和明确语义两种,用于分类、解释和去歧义。还研究了模型生成歧义句的能力,通过要求针对每种歧义类型提供一组句子。结果经过定性分析,基于公认的语言参考,以及基于所获回答的准确性的定量评估。结果显示,即使是最先进的模型,如ChatGPT和Gemini,在其回应中也有错误和不足之处,解释往往是不一致的。此外,准确率在49.58%达到峰值,表明需要进行描述性研究来进行有监督学习。
https://arxiv.org/abs/2404.16653
It has been found that Transformer-based language models have the ability to perform basic quantitative reasoning. In this paper, we propose a method for studying how these models internally represent numerical data, and use our proposal to analyze the ALBERT family of language models. Specifically, we extract the learned embeddings these models use to represent tokens that correspond to numbers and ordinals, and subject these embeddings to Principal Component Analysis (PCA). PCA results reveal that ALBERT models of different sizes, trained and initialized separately, consistently learn to use the axes of greatest variation to represent the approximate ordering of various numerical concepts. Numerals and their textual counterparts are represented in separate clusters, but increase along the same direction in 2D space. Our findings illustrate that language models, trained purely to model text, can intuit basic mathematical concepts, opening avenues for NLP applications that intersect with quantitative reasoning.
研究发现,基于Transformer的语言模型具有执行基本数量推理的能力。在本文中,我们提出了一种研究这些模型内部如何表示数值数据的方法,并使用我们的建议分析ALBERT家族的语言模型。具体来说,我们提取这些模型用于表示数字和序数的 learned嵌入,并对其进行主成分分析(PCA)。PCA的结果表明,具有不同大小的ALBERT模型,在训练和初始化过程中分别进行,能够一致地使用变化最大的轴来表示各种数值概念的近似顺序。数值和它们的文本对应物分别位于不同的簇中,但在二维空间中沿着相同的方向增加。我们的研究结果表明,训练纯粹用于建模文本的语言模型可以直观地理解基本的数学概念,为与数量推理相关的自然语言处理应用程序开辟了道路。
https://arxiv.org/abs/2404.16574
Document-level Relation Extraction (DocRE) is the task of extracting all semantic relationships from a document. While studies have been conducted on English DocRE, limited attention has been given to DocRE in non-English languages. This work delves into effectively utilizing existing English resources to promote DocRE studies in non-English languages, with Japanese as the representative case. As an initial attempt, we construct a dataset by transferring an English dataset to Japanese. However, models trained on such a dataset suffer from low recalls. We investigate the error cases and attribute the failure to different surface structures and semantics of documents translated from English and those written by native speakers. We thus switch to explore if the transferred dataset can assist human annotation on Japanese documents. In our proposal, annotators edit relation predictions from a model trained on the transferred dataset. Quantitative analysis shows that relation recommendations suggested by the model help reduce approximately 50% of the human edit steps compared with the previous approach. Experiments quantify the performance of existing DocRE models on our collected dataset, portraying the challenges of Japanese and cross-lingual DocRE.
文档级别关系提取(DocRE)是从文档中提取所有语义关系的过程。尽管已经进行了关于英语DocRE的研究,但在非英语语言中,对DocRE的研究却鲜有关注。本文深入研究如何有效地利用现有英语资源来促进非英语语言中的DocRE研究,以日本为例作为代表。作为初始尝试,我们将英语数据集迁移到日本并构建了一个数据集。然而,训练在这样的数据集上的模型,模型的召回率很低。我们研究了错误案例,并将失败归因于从英语到非英语翻译的文档的不同表面结构和语义。因此,我们转向研究是否转移的数据集可以帮助人类对日语文档进行标注。在我们的建议中,注释者编辑从转移数据集中得出的关系预测。定量分析显示,与以前的方法相比,模型建议的关系减少约50%的人为编辑步骤。实验验证了现有DocRE模型的在我们收集的数据集上的性能,揭示了日语和跨语言DocRE的挑战。
https://arxiv.org/abs/2404.16506
This paper presents a question-answering approach to extract document-level event-argument structures. We automatically ask and answer questions for each argument type an event may have. Questions are generated using manually defined templates and generative transformers. Template-based questions are generated using predefined role-specific wh-words and event triggers from the context document. Transformer-based questions are generated using large language models trained to formulate questions based on a passage and the expected answer. Additionally, we develop novel data augmentation strategies specialized in inter-sentential event-argument relations. We use a simple span-swapping technique, coreference resolution, and large language models to augment the training instances. Our approach enables transfer learning without any corpora-specific modifications and yields competitive results with the RAMS dataset. It outperforms previous work, and it is especially beneficial to extract arguments that appear in different sentences than the event trigger. We also present detailed quantitative and qualitative analyses shedding light on the most common errors made by our best model.
本文提出了一种问题回答方法,用于提取文档级别的事件- argument 结构。我们自动对每个可能具有的事件类型提出问题并进行回答。问题是通过手动定义的模板和生成 transformer 生成的。基于模板的问题使用预定义的角色特定 wh- 词和上下文文档中的事件触发器生成。基于 transformer 的問題使用经过训练的大型语言模型生成的关于文章和预期答案的问题。此外,我们还开发了针对 inter-sentential event-argument 关系的新颖数据增强策略。我们使用简单的跨度替换技术、核心关系解析和大型语言模型来增强训练实例。我们的方法无需对数据集进行特定修改即可实现迁移学习,并与 RAMS 数据集的竞争结果相媲美。它尤其有益于提取不同句子中出现的事件 argument。我们还对最佳模型最常犯的错误进行了详细的定量定性分析,阐明了我们的最佳模型在分析中取得的成就。
https://arxiv.org/abs/2404.16413
Robot swarms hold immense potential for performing complex tasks far beyond the capabilities of individual robots. However, the challenge in unleashing this potential is the robots' limited sensory capabilities, which hinder their ability to detect and adapt to unknown obstacles in real-time. To overcome this limitation, we introduce a novel robot swarm control method with an indirect obstacle detector using a smoothed particle hydrodynamics (SPH) model. The indirect obstacle detector can predict the collision with an obstacle and its collision point solely from the robot's velocity information. This approach enables the swarm to effectively and accurately navigate environments without the need for explicit obstacle detection, significantly enhancing their operational robustness and efficiency. Our method's superiority is quantitatively validated through a comparative analysis, showcasing its significant navigation and pattern formation improvements under obstacle-unaware conditions.
机器人群具有在远超过单个机器人的复杂任务中执行巨大潜力。然而,释放这一潜力的挑战是机器人的有限感知能力,这阻碍了它们在实时感知未知障碍的能力。为了克服这一限制,我们引入了一种使用平滑粒子流体动力学(SPH)模型的新型机器人群控制方法。这种间接障碍检测器可以通过机器人的速度信息仅预测与障碍的碰撞及其碰撞点。这种方法使群能够有效且准确地导航环境,而无需进行显式的障碍检测,显著提高了它们的操作稳健性和效率。通过对比分析验证,我们量化了该方法在无障碍条件下的导航和模式形成改善。
https://arxiv.org/abs/2404.16309
Large language models (LLMs) have demonstrated impressive generalization capabilities on specific tasks with human-written instruction data. However, the limited quantity, diversity, and professional expertise of such instruction data raise concerns about the performance of LLMs in psychotherapy tasks when provided with domain-specific instructions. To address this, we firstly propose Domain-Specific Assistant Instructions based on AlexanderStreet therapy, and secondly, we use an adaption fine-tuning method and retrieval augmented generation method to improve pre-trained LLMs. Through quantitative evaluation of linguistic quality using automatic and human evaluation, we observe that pre-trained LLMs on Psychotherapy Assistant Instructions outperform state-of-the-art LLMs response baselines. Our Assistant-Instruction approach offers a half-annotation method to align pre-trained LLMs with instructions and provide pre-trained LLMs with more psychotherapy knowledge.
大语言模型(LLMs)已经在特定任务上展示了令人印象深刻的泛化能力,这些任务使用人类编写的指令数据。然而,这种指令数据的数量有限,多样性较小,专业能力有限,这使得当LLMs获得特定领域的指导时,在心理治疗任务上的表现引起了人们的担忧。为解决这个问题,我们首先提出了基于AlexanderStreet治疗的领域特定辅助指令,然后使用自监督和人工评估来改进预训练LLMs。通过使用自动和人工评估对语言质量进行定量评估,我们观察到,使用心理治疗助手指令预训练的LLMs超越了最先进的LLMs响应基线。我们的辅助指令方法提供了一种半注释方法,使预训练的LLM与指令对齐,并为预训练的LLM提供更多的心理治疗知识。
https://arxiv.org/abs/2404.16160
Purpose: This study explores the feasibility of using generative machine learning (ML) to translate Optical Coherence Tomography (OCT) images into Optical Coherence Tomography Angiography (OCTA) images, potentially bypassing the need for specialized OCTA hardware. Methods: The method involved implementing a generative adversarial network framework that includes a 2D vascular segmentation model and a 2D OCTA image translation model. The study utilizes a public dataset of 500 patients, divided into subsets based on resolution and disease status, to validate the quality of TR-OCTA images. The validation employs several quality and quantitative metrics to compare the translated images with ground truth OCTAs (GT-OCTA). We then quantitatively characterize vascular features generated in TR-OCTAs with GT-OCTAs to assess the feasibility of using TR-OCTA for objective disease diagnosis. Result: TR-OCTAs showed high image quality in both 3 and 6 mm datasets (high-resolution, moderate structural similarity and contrast quality compared to GT-OCTAs). There were slight discrepancies in vascular metrics, especially in diseased patients. Blood vessel features like tortuosity and vessel perimeter index showed a better trend compared to density features which are affected by local vascular distortions. Conclusion: This study presents a promising solution to the limitations of OCTA adoption in clinical practice by using vascular features from TR-OCTA for disease detection. Translation relevance: This study has the potential to significantly enhance the diagnostic process for retinal diseases by making detailed vascular imaging more widely available and reducing dependency on costly OCTA equipment.
目的:本研究探讨了使用生成机器学习(ML)将光学相干断层扫描(OCT)图像转换为光学相干断层扫描血管造影(OCTA)图像的可行性,可能绕过需要专用OCTA硬件的需求。方法:本研究采用了一个包括2D血管分割模型和2D OCTA图像翻译模型的生成对抗网络框架。研究利用了一个包含500名患者的公共数据集,根据分辨率和疾病状态将数据集划分为子集,以验证TR-OCTA图像的质量。验证采用多个质量和数量指标来比较转换后的图像与真实OCTA(GT-OCTA)之间的差异。然后,通过定量分析TR-OCTA中产生的血管特征与GT-OCTA之间的差异,评估了使用TR-OCTA进行客观疾病诊断的可行性。结果:在3毫米和6毫米数据集上,TR-OCTA显示出与GT-OCTA相同的高图像质量(高分辨率、中等结构相似度和对比质量与GT-OCTA相比)。在病患者中,血管指标略有差异。像曲折度和血管周长指数这样的血管特征显示出更好的趋势,而受到局部血管变形影响的密度特征呈下降趋势。结论:本研究通过利用TR-OCTA中的血管特征来解决OCTA在临床实践中的局限性,为眼科疾病的诊断提供了一个有前景的解决方案。翻译意义:通过使详细血管成像更加广泛可用,并减少对昂贵OCTA设备的依赖,本研究有可能显著提高眼科疾病的诊断过程。
https://arxiv.org/abs/2404.16133
Ground Penetrating Radar (GPR) has been widely studied as a tool for extracting soil parameters relevant to agriculture and horticulture. When combined with Machine-Learning-based (ML) methods, high-resolution Stepped Frequency Countinuous Wave Radar (SFCW) measurements hold the promise to give cost effective access to depth resolved soil parameters, including at root-level depth. In a first step in this direction, we perform an extensive field survey with a tractor mounted SFCW GPR instrument. Using ML data processing we test the GPR instrument's capabilities to predict the apparent electrical conductivity (ECaR) as measured by a simultaneously recording Electromagnetic Induction (EMI) instrument. The large-scale field measurement campaign with 3472 co-registered and geo-located GPR and EMI data samples distributed over ~6600 square meters was performed on a golf course. The selected terrain benefits from a high surface homogeneity, but also features the challenge of only small, and hence hard to discern, variations in the measured soil parameter. Based on the quantitative results we suggest the use of nugget-to-sill ratio as a performance metric for the evaluation of end-to-end ML performance in the agricultural setting and discuss the limiting factors in the multi-sensor regression setting. The code is released as open source and available at this https URL.
透地雷达(GPR)作为一种用于提取与农业和园艺相关的土壤参数的工具,已经得到了广泛研究。当与机器学习(ML)方法相结合时,高分辨率逐次频移连续波雷达(SFCW)测量具有将成本有效地获取到深度解析土壤参数(包括根层深度)的潜力。在朝着这个方向迈出的第一步中,我们使用搭载拖拉机上的SFCW GPR仪器进行了一场广泛的现场调查。利用ML数据处理,我们测试了GPR仪器的功能,以预测同时记录电磁感应(EMI)仪器测量的表面电导率(ECaR)。在一片高尔夫球场地上进行的这项大规模现场测量活动使用了3472个共同注册和几何定位的GPR和EMI数据样本,覆盖约6600平方米。所选地面得益于高表面均匀性,但同时也存在测量土壤参数小且难以分辨的挑战。根据定量结果,我们建议将泥炭到胎里的比率作为衡量整个ML性能的性能指标,并讨论了多传感器回归设置中的限制因素。该代码是开源的,您可以在此链接https上获取。
https://arxiv.org/abs/2404.15961
Etruscan mirrors constitute a significant category within Etruscan art and, therefore, undergo systematic examinations to obtain insights into ancient times. A crucial aspect of their analysis involves the labor-intensive task of manually tracing engravings from the backside. Additionally, this task is inherently challenging due to the damage these mirrors have sustained, introducing subjectivity into the process. We address these challenges by automating the process through photometric-stereo scanning in conjunction with deep segmentation networks which, however, requires effective usage of the limited data at hand. We accomplish this by incorporating predictions on a per-patch level, and various data augmentations, as well as exploring self-supervised learning. Compared to our baseline, we improve predictive performance w.r.t. the pseudo-F-Measure by around 16%. When assessing performance on complete mirrors against a human baseline, our approach yields quantitative similar performance to a human annotator and significantly outperforms existing binarization methods. With our proposed methodology, we streamline the annotation process, enhance its objectivity, and reduce overall workload, offering a valuable contribution to the examination of these historical artifacts and other non-traditional documents.
伊特鲁里亚镜子在伊特鲁里亚艺术中是一个重要的分类,因此它们会进行系统性的检查,以获得对古代时代的洞察。分析的关键部分涉及从背面手动绘制浮雕的劳动密集型任务。此外,由于这些镜子所承受的损害,这项任务本质上具有挑战性,并引入了主观因素。我们通过将光栅化与深度分割网络相结合来自动化这个过程,尽管如此,需要有效使用手头有限的数据。我们通过在每片区域上预测以及各种数据增强方法以及探索自监督学习来实现这一目标。与我们的基线相比,我们在关于伪-F-测量方面的预测性能提高了约16%。当评估整体镜子与人类基线上的表现时,我们的方法与人类注释者的量化类似,显著超过了现有的二值化方法。凭借我们提出的方法,我们简化了注释过程,提高了其客观性,并降低了整体工作量,为研究这些历史文物以及其他非传统文献提供了宝贵的贡献。
https://arxiv.org/abs/2404.15903
Geometry- and appearance-controlled full-body human image generation is an interesting but challenging task. Existing solutions are either unconditional or dependent on coarse conditions (e.g., pose, text), thus lacking explicit geometry and appearance control of body and garment. Sketching offers such editing ability and has been adopted in various sketch-based face generation and editing solutions. However, directly adapting sketch-based face generation to full-body generation often fails to produce high-fidelity and diverse results due to the high complexity and diversity in the pose, body shape, and garment shape and texture. Recent geometrically controllable diffusion-based methods mainly rely on prompts to generate appearance and it is hard to balance the realism and the faithfulness of their results to the sketch when the input is coarse. This work presents Sketch2Human, the first system for controllable full-body human image generation guided by a semantic sketch (for geometry control) and a reference image (for appearance control). Our solution is based on the latent space of StyleGAN-Human with inverted geometry and appearance latent codes as input. Specifically, we present a sketch encoder trained with a large synthetic dataset sampled from StyleGAN-Human's latent space and directly supervised by sketches rather than real images. Considering the entangled information of partial geometry and texture in StyleGAN-Human and the absence of disentangled datasets, we design a novel training scheme that creates geometry-preserved and appearance-transferred training data to tune a generator to achieve disentangled geometry and appearance control. Although our method is trained with synthetic data, it can handle hand-drawn sketches as well. Qualitative and quantitative evaluations demonstrate the superior performance of our method to state-of-the-art methods.
几何和外观控全身体像生成是一个有趣但具有挑战性的任务。现有的解决方案或依赖于粗略的条件(例如姿态,文本),从而缺乏对身体的几何和外观控制。绘画提供了这种编辑能力,并在各种基于草图的脸部生成和编辑解决方案中得到了应用。然而,直接将基于草图的脸部生成应用到全身生成通常由于姿态、身体形状和服装形状和纹理的高复杂性和多样性而无法产生高质量和多样性的结果。最近,基于几何可控制扩散的解决方案主要依赖于提示生成外观,而在输入粗糙时很难平衡现实感和结果的准确性。 这项工作提出了Sketch2Human,第一个基于语义草图的全身体人像生成系统(用于几何控制)和参考图像(用于外观控制)。我们的解决方案基于StyleGAN-Human的潜在空间,具有倒置的形状和外观潜在代码作为输入。具体来说,我们提出了一种用从StyleGAN-Human的潜在空间中采样的大规模合成数据训练的草图编码器,并直接从草图中监督生成图像。考虑到StyleGAN-Human中部分几何和纹理的纠缠信息以及缺乏解耦的数据集,我们设计了一种新的训练计划,以创建几何保留和外观传递的训练数据来调整生成器实现解耦几何和外观控制。尽管我们的方法是基于合成数据训练的,但它还可以处理手绘草图。定性和定量评估证明了我们的方法在现有方法中的卓越性能。
https://arxiv.org/abs/2404.15889
This paper handles the problem of converting real pictures into traditional Chinese ink-wash paintings, i.e., Chinese ink-wash painting style transfer. Though this problem could be realized by a wide range of image-to-image translation models, a notable issue with all these methods is that the original image content details could be easily erased or corrupted due to transfer of ink-wash style elements. To solve or ameliorate this issue, we propose to incorporate saliency detection into the unpaired image-to-image translation framework to regularize content information of the generated paintings. The saliency map is utilized for content regularization from two aspects, both explicitly and implicitly: (\romannumeral1) we propose saliency IOU (SIOU) loss to explicitly regularize saliency consistency before and after stylization; (\romannumeral2) we propose saliency adaptive normalization (SANorm) which implicitly enhances content integrity of the generated paintings by injecting saliency information to the generator network to guide painting generation. Besides, we also propose saliency attended discriminator network which harnesses saliency mask to focus generative adversarial attention onto salient image regions, it contributes to producing finer ink-wash stylization effect for salient objects of images. Qualitative and quantitative experiments consistently demonstrate superiority of our model over related advanced methods for Chinese ink-wash painting style transfer.
本文处理将真实图像转换成传统中国水墨画的问题,即水墨画风格的图像-to-图像转换问题。尽管这个问题可以通过广泛的图像到图像转换模型来解决,但所有这些方法的一个显着的问题是,由于墨水风格元素的转移,原始图像内容细节很容易被轻松地擦除或损坏。为解决这个问题或减轻这个问题,我们提出将局部可用性检测引入无配对图像到图像转换框架中,以规范化生成的绘画的内容信息。局部可用性图用于从两个方面进行内容规范化:(罗马数字1)我们提出局部可用性IOU(SIOU)损失,以明确规范化和 stylization 之前的和之后的 saliency 一致性;(罗马数字2)我们提出局部可用性自适应规范化(SANorm),它通过向生成网络注入 saliency 信息来增强生成的绘画的内容完整性,从而指导绘画生成。此外,我们还提出了 saliency 注意的判别器网络,它利用 saliency 掩码将生成对抗注意力聚焦于显眼的图像区域,有助于产生更清晰的墨水画风格效果。定性和定量的实验结果都一致地证明了我们的模型在相关先进方法中的优越性。
https://arxiv.org/abs/2404.15743
Existing NeRF-based inverse rendering methods suppose that scenes are exclusively illuminated by distant light sources, neglecting the potential influence of emissive sources within a scene. In this work, we confront this limitation using LDR multi-view images captured with emissive sources turned on and off. Two key issues must be addressed: 1) ambiguity arising from the limited dynamic range along with unknown lighting details, and 2) the expensive computational cost in volume rendering to backtrace the paths leading to final object colors. We present a novel approach, ESR-NeRF, leveraging neural networks as learnable functions to represent ray-traced fields. By training networks to satisfy light transport segments, we regulate outgoing radiances, progressively identifying emissive sources while being aware of reflection areas. The results on scenes encompassing emissive sources with various properties demonstrate the superiority of ESR-NeRF in qualitative and quantitative ways. Our approach also extends its applicability to the scenes devoid of emissive sources, achieving lower CD metrics on the DTU dataset.
现有的基于NeRF的反向渲染方法假定场景是由远距离光源独家照明,而忽略了场景内潜在的发射源影响。在本文中,我们通过开启和关闭发射源的LDR多视角图像来应对这一局限。需要解决两个关键问题:1)由于动态范围有限和未知的光线细节而产生的模糊;2)在体积渲染中,为了追溯导致最终物体颜色的路径而产生的昂贵计算成本。我们提出了ESR-NeRF,一种利用神经网络作为可学习函数来表示光迹场的全新方法。通过训练网络满足光传输段,我们调节出射辐射,在意识到反射区域的同时,逐渐确定发射源。ESR-NeRF对具有各种属性的发射源场景的性能在质量和数量上都有所改进。我们的方法还将其应用扩展到没有发射源的场景中,在DTU数据集上的CD指标较低。
https://arxiv.org/abs/2404.15707
Diffusion models have demonstrated their capability to synthesize high-quality and diverse images from textual prompts. However, simultaneous control over both global contexts (e.g., object layouts and interactions) and local details (e.g., colors and emotions) still remains a significant challenge. The models often fail to understand complex descriptions involving multiple objects and reflect specified visual attributes to wrong targets or ignore them. This paper presents Global-Local Diffusion (\textit{GLoD}), a novel framework which allows simultaneous control over the global contexts and the local details in text-to-image generation without requiring training or fine-tuning. It assigns multiple global and local prompts to corresponding layers and composes their noises to guide a denoising process using pre-trained diffusion models. Our framework enables complex global-local compositions, conditioning objects in the global prompt with the local prompts while preserving other unspecified identities. Our quantitative and qualitative evaluations demonstrate that GLoD effectively generates complex images that adhere to both user-provided object interactions and object details.
扩散模型已经证明了它们从文本提示中合成高质量和多样图像的能力。然而,同时控制全局上下文(例如物体布局和交互)和局部细节(例如颜色和情感)仍然是一个重要的挑战。模型通常无法理解涉及多个物体的复杂描述,并将指定的视觉属性错误地应用于错误的目标或忽略它们。本文提出了一种名为全局-局部扩散(GLoD)的新框架,允许在文本到图像生成中同时控制全局上下文和局部细节,而无需进行训练或微调。它将多个全局和局部提示分配给相应的层,并将它们的噪声组合起来,使用预训练的扩散模型进行去噪处理。我们的框架能够实现复杂的全局-局部组合,通过局部提示保留全局提示,同时保留其他未指定身份的物体。我们的定量和定性评估显示,GLoD有效地生成了符合用户提供的物体交互和物体细节的复杂图像。
https://arxiv.org/abs/2404.15447
Existing Transformers for monocular 3D human shape and pose estimation typically have a quadratic computation and memory complexity with respect to the feature length, which hinders the exploitation of fine-grained information in high-resolution features that is beneficial for accurate reconstruction. In this work, we propose an SMPL-based Transformer framework (SMPLer) to address this issue. SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation, which allow effective utilization of high-resolution features in the Transformer. In addition, based on these two designs, we also introduce several novel modules including a multi-scale attention and a joint-aware attention to further boost the reconstruction performance. Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods both quantitatively and qualitatively. Notably, the proposed algorithm achieves an MPJPE of 45.2 mm on the Human3.6M dataset, improving upon Mesh Graphormer by more than 10% with fewer than one-third of the parameters. Code and pretrained models are available at this https URL.
现有的单目3D人体形状和姿态估计器通常具有关于特征长度的二次计算和内存复杂度,这会阻碍在分辨率高的特征中发掘微小信息的有利程度,从而影响准确重建。在这项工作中,我们提出了一个基于SMPL的Transformer框架(SMPLer)来解决这一问题。SMPLer包括两个关键组件:分离的注意力操作和基于SMPL的目标表示,允许在Transformer中有效利用高分辨率特征。此外,根据这两个设计,我们还引入了几个新颖的模块,包括多尺度注意力和联合注意,以进一步提高重建性能。大量实验证明,SMPLer在现有的人体形状和姿态估计方法上具有有效性,无论是定量还是定性。值得注意的是,与Mesh Graphormer相比,所提出的算法在Human3.6M数据集上实现了更高的MPJPE,同时参数数量不到三分之一。代码和预训练模型可在此处访问的URL中获取。
https://arxiv.org/abs/2404.15276
Cancelable Biometric is repetitive distortion embedded in original Biometric image for keeping it secure from unauthorized access. In this paper, we have generated Cancelable Biometric templates with Reverse Boolean XOR technique. Three different methods have been proposed for generation of Cancelable Biometric templates based on Visual Secret Sharing scheme. In each method, one Secret image and n-1 Cover images are used as: (M1) One original Biometric image (Secret) with n- 1 randomly chosen Gray Cover images (M2) One original Secret image with n-1 Cover images, which are Randomly Permuted version of the original Secret image (M3) One Secret image with n-1 Cover images, both Secret image and Cover images are Randomly Permuted version of original Biometric image. Experiment works have performed on publicly available ORL Face database and IIT Delhi Iris database. The performance of the proposed methods is compared in terms of Co-relation Coefficient (Cr), Mean Square Error (MSE), Mean Absolute Error (MAE), Structural Similarity (SSIM), Peak Signal to Noise Ratio (PSNR), Number of Pixel Change Rate (NPCR), and Unified Average Changing Intensity (UACI). It is found that among the three proposed method, M3 generates good quality Cancelable templates and gives best performance in terms of quality. M3 is also better in quantitative terms on ORL dataset while M2 and M3 are comparable on IIT Delhi Iris dataset.
可取消生物特征是在原始生物特征图像中重复扭曲以保持其安全性,以防止未经授权的访问。在本文中,我们使用反向布尔异或技术生成了可取消生物特征模板。根据视觉共享方案,有三种不同的方法提出了生成可取消生物特征模板的方法。每种方法都使用一个秘密图像和n-1个随机的灰度覆盖图像:(M1)一个n-1个随机选择的灰度覆盖图像的秘密图像(M2)一个n-1个随机选择的与原始秘密图像的随机排列版本的原始秘密图像(M3)一个带有n-1个随机选择的灰度覆盖图像的原始生物特征图像(M4)一个带有n-1个随机选择的与原始生物特征图像的随机排列版本的秘密图像。在公开可用的ORL人脸数据库和IIT德里Iris数据库上进行了实验。所提出方法的性能在相关系数系数(Cr)、平均平方误差(MSE)、平均绝对误差(MAE)、结构相似性(SSIM)、峰信号比噪声比(PSNR)和像素变化率(NPCR)以及统一平均变化强度(UACI)方面进行了比较。实验结果表明,在三种提出的方法中,M3生成的可取消生物特征模板质量最好,并且在质量方面具有最佳性能。M3在ORL数据集上的定量表现优于M2和M3在IIT德里Iris数据集上的表现。
https://arxiv.org/abs/2404.15394
Recently, implicit neural representations (INR) have made significant strides in various vision-related domains, providing a novel solution for Multispectral and Hyperspectral Image Fusion (MHIF) tasks. However, INR is prone to losing high-frequency information and is confined to the lack of global perceptual capabilities. To address these issues, this paper introduces a Fourier-enhanced Implicit Neural Fusion Network (FeINFN) specifically designed for MHIF task, targeting the following phenomena: The Fourier amplitudes of the HR-HSI latent code and LR-HSI are remarkably similar; however, their phases exhibit different patterns. In FeINFN, we innovatively propose a spatial and frequency implicit fusion function (Spa-Fre IFF), helping INR capture high-frequency information and expanding the receptive field. Besides, a new decoder employing a complex Gabor wavelet activation function, called Spatial-Frequency Interactive Decoder (SFID), is invented to enhance the interaction of INR features. Especially, we further theoretically prove that the Gabor wavelet activation possesses a time-frequency tightness property that favors learning the optimal bandwidths in the decoder. Experiments on two benchmark MHIF datasets verify the state-of-the-art (SOTA) performance of the proposed method, both visually and quantitatively. Also, ablation studies demonstrate the mentioned contributions. The code will be available on Anonymous GitHub (https://anonymous.4open.science/r/FeINFN-15C9/) after possible acceptance.
近年来,隐含神经表示(INR)在各种视觉相关领域取得了显著的进步,为多光谱和超光谱图像融合(MHIF)任务提供了新的解决方案。然而,INR容易丢失高频信息,并且局限于缺乏全局感知能力。为了应对这些问题,本文提出了一种专门为MHIF任务设计的傅里叶增强隐含神经融合网络(FeINFN),旨在解决以下现象:HR-HSI隐含码的傅里叶振幅和LR-HSI隐含码的傅里叶振幅显著相似;然而,它们的相位表现出不同的模式。在FeINFN中,我们创新地提出了一种空间和频率隐含融合函数(Spa-Fre IFF),帮助INR捕获高频信息并扩大感受野。此外,还提出了一种新的解码器,采用复杂的高尔顿卷积激活函数,称为空间频率交互解码器(SFID),以增强INR特征之间的交互。特别地,我们进一步理论证明,高尔顿卷积激活具有时间-频率紧密性特性,有利于在解码器中学习最优带宽。在两个基准MHIF数据集上的实验证实了所提出方法的最先进性能,无论是在视觉方面还是量化方面。此外,消融研究还证明了上述贡献。代码将在经过可能接受后,在匿名GitHub上发布(https://anonymous.4open.science/r/FeINFN-15C9/)。
https://arxiv.org/abs/2404.15174
This paper studies interpretability of convolutional networks by means of saliency maps. Most approaches based on Class Activation Maps (CAM) combine information from fully connected layers and gradient through variants of backpropagation. However, it is well understood that gradients are noisy and alternatives like guided backpropagation have been proposed to obtain better visualization at inference. In this work, we present a novel training approach to improve the quality of gradients for interpretability. In particular, we introduce a regularization loss such that the gradient with respect to the input image obtained by standard backpropagation is similar to the gradient obtained by guided backpropagation. We find that the resulting gradient is qualitatively less noisy and improves quantitatively the interpretability properties of different networks, using several interpretability methods.
本文通过关注卷积神经网络的清晰度图来研究可解释性。大多数基于类激活图(CAM)的方法结合了全连接层的信息和反向传播中的梯度。然而,人们普遍认为梯度是噪声,因此出现了类似于指导反向传播(GSP)的方法来获得更好的推理可视化。在这项工作中,我们提出了一个新颖的训练方法来提高梯度的质量。特别地,我们引入了一个正则化损失,使得通过标准反向传播获得的输入图像的梯度与通过指导反向传播获得的梯度相似。我们发现,通过这种方法得到的梯度在质上是更少的噪声,并且通过使用几种可解释性方法,提高了不同网络的定量可解释性特性。
https://arxiv.org/abs/2404.15024
Learning-based image stitching techniques typically involve three distinct stages: registration, fusion, and rectangling. These stages are often performed sequentially, each trained independently, leading to potential cascading error propagation and complex parameter tuning challenges. In rethinking the mathematical modeling of the fusion and rectangling stages, we discovered that these processes can be effectively combined into a single, variety-intensity inpainting problem. Therefore, we propose the Simple and Robust Stitcher (SRStitcher), an efficient training-free image stitching method that merges the fusion and rectangling stages into a unified model. By employing the weighted mask and large-scale generative model, SRStitcher can solve the fusion and rectangling problems in a single inference, without additional training or fine-tuning of other models. Our method not only simplifies the stitching pipeline but also enhances fault tolerance towards misregistration errors. Extensive experiments demonstrate that SRStitcher outperforms state-of-the-art (SOTA) methods in both quantitative assessments and qualitative evaluations. The code is released at this https URL
基于学习的图像拼接技术通常包括三个不同的阶段:注册、融合和矩形化。这些阶段通常按顺序执行,每个阶段都经过独立训练,这可能导致级联错误传播和复杂参数调整挑战。在重新考虑融合和矩形化阶段的数学建模时,我们发现这些过程可以有效合并成一个单一的多样性强度在补救问题中。因此,我们提出了简单且鲁棒的全拼接器(SRStitcher)高效的无训练图像拼接方法,将融合和矩形化阶段合并为一个统一的模型。通过采用加权掩码和大规模生成模型,SRStitcher可以在单个推理中解决融合和矩形化问题,而无需其他模型的微调或训练。我们的方法不仅简化了拼接流程,还提高了对配准错误容错的能力。大量实验证明,SRStitcher在定量评估和定性评估方面都优于最先进的(SOTA)方法。代码发布在https://这一URL
https://arxiv.org/abs/2404.14951