Diffusion-based technologies have made significant strides, particularly in personalized and customized facialgeneration. However, existing methods face challenges in achieving high-fidelity and detailed identity (ID)consistency, primarily due to insufficient fine-grained control over facial areas and the lack of a comprehensive strategy for ID preservation by fully considering intricate facial details and the overall face. To address these limitations, we introduce ConsistentID, an innovative method crafted for diverseidentity-preserving portrait generation under fine-grained multimodal facial prompts, utilizing only a single reference image. ConsistentID comprises two key components: a multimodal facial prompt generator that combines facial features, corresponding facial descriptions and the overall facial context to enhance precision in facial details, and an ID-preservation network optimized through the facial attention localization strategy, aimed at preserving ID consistency in facial regions. Together, these components significantly enhance the accuracy of ID preservation by introducing fine-grained multimodal ID information from facial regions. To facilitate training of ConsistentID, we present a fine-grained portrait dataset, FGID, with over 500,000 facial images, offering greater diversity and comprehensiveness than existing public facial datasets. % such as LAION-Face, CelebA, FFHQ, and SFHQ. Experimental results substantiate that our ConsistentID achieves exceptional precision and diversity in personalized facial generation, surpassing existing methods in the MyStyle dataset. Furthermore, while ConsistentID introduces more multimodal ID information, it maintains a fast inference speed during generation.
基于扩散的技术取得了显著的进展,特别是在个性化面部生成方面。然而,现有的方法在实现高保真度和详细身份一致性方面面临挑战,主要原因是面部区域细粒度控制不足,以及没有全面考虑面部细节和整个面部以实现ID保留的策略。为了应对这些局限,我们引入了ConsistentID,一种为在细粒度多模态面部提示下生成多样化身份肖像的创新方法,仅使用单个参考图像。ConsistentID包括两个关键组件:一个多模态面部提示生成器,将面部特征、相应的面部描述和整个面部上下文相结合以提高面部细节的精度,和一个通过面部关注局部定位策略优化的ID保留网络,旨在保留面部区域ID的一致性。这些组件一起显著提高了ID保留的准确性,通过引入面部区域的细粒度多模态ID信息。为了方便ConsistentID的训练,我们提出了一个超过50万张面部图片的细粒度肖像数据集FGID,比现有的公共面部数据集(如LAION-Face,CelebA,FFHQ和SFHQ)具有更大的多样性和完整性。实验结果证实,我们的ConsistentID在个性化面部生成方面实现了非凡的精度和多样性,超过了MyStyle数据集中的现有方法。此外,虽然ConsistentID引入了更多的多模态ID信息,但在生成过程中保持了快速的推理速度。
https://arxiv.org/abs/2404.16771
Cancelable Biometric is a challenging research field in which security of an original biometric image is ensured by transforming the original biometric into another irreversible domain. Several approaches have been suggested in literature for generating cancelable biometric templates. In this paper, two novel and simple cancelable biometric template generation methods based on Random Walk (CBRW) have been proposed. By employing random walk and other steps given in the proposed two algorithms viz. CBRW-BitXOR and CBRW-BitCMP, the original biometric is transformed into a cancellable template. The performance of the proposed methods is compared with other state-of-the-art methods. Experiments have been performed on eight publicly available gray and color datasets i.e. CP (ear) (gray and color), UTIRIS (iris) (gray and color), ORL (face) (gray), IIT Delhi (iris) (gray and color), and AR (face) (color). Performance of the generated templates is measured in terms of Correlation Coefficient (Cr), Root Mean Square Error (RMSE), Peak Signal to Noise Ratio (PSNR), Structural Similarity (SSIM), Mean Absolute Error (MAE), Number of Pixel Change Rate (NPCR), and Unified Average Changing Intensity (UACI). By experimental results, it has been proved that proposed methods are superior than other state-of-the-art methods in qualitative as well as quantitative analysis. Furthermore, CBRW performs better on both gray as well as color images.
可取消生物特征是一个具有挑战性的研究领域,其中通过将原始生物特征变换为另一个不可逆的领域来确保原始生物特征的安全。在文献中,已经提出了几种生成可取消生物特征模板的方法。本文提出了两种基于随机漫步(CBRW)的新颖且简单的可取消生物特征模板生成方法。通过使用提出的两种算法 viz. CBRW-BitXOR 和 CBRW-BitCMP,将原始生物特征变换为可取消模板。本文方法的表现与最先进的算法进行了比较。实验在八个公开可用的灰度和彩色数据集上进行,即CP(耳朵)(灰度与彩色),UTIRIS(眼睛)(灰度与彩色),ORL(面部)(灰度),IIT德里亚(眼睛)(灰度与彩色),和AR(眼睛)(彩色)。生成模板的表现用相关系数系数(Cr)、算术均方误差(RMSE)、峰值信号与噪声比(PSNR)、结构相似性(SSIM)、平均绝对误差(MAE)、像素变化率(NPCR)和统一平均变化强度(UACI)进行衡量。通过实验结果,证明了本文方法在质量和数量分析方面优于其他最先进的算法。此外,CBRW在灰度和彩色图像上表现更好。
https://arxiv.org/abs/2404.16739
In the rapidly evolving field of artificial intelligence, ensuring safe decision-making of Large Language Models (LLMs) is a significant challenge. This paper introduces Governance of the Commons Simulation (GovSim), a simulation platform designed to study strategic interactions and cooperative decision-making in LLMs. Through this simulation environment, we explore the dynamics of resource sharing among AI agents, highlighting the importance of ethical considerations, strategic planning, and negotiation skills. GovSim is versatile and supports any text-based agent, including LLMs agents. Using the Generative Agent framework, we create a standard agent that facilitates the integration of different LLMs. Our findings reveal that within GovSim, only two out of 15 tested LLMs managed to achieve a sustainable outcome, indicating a significant gap in the ability of models to manage shared resources. Furthermore, we find that by removing the ability of agents to communicate, they overuse the shared resource, highlighting the importance of communication for cooperation. Interestingly, most LLMs lack the ability to make universalized hypotheses, which highlights a significant weakness in their reasoning skills. We open source the full suite of our research results, including the simulation environment, agent prompts, and a comprehensive web interface.
在人工智能领域,确保大型语言模型(LLMs)的安全决策是一个重要的挑战。本文介绍了治理共享资源模拟(GovSim)模拟平台,该平台旨在研究LLMs的战略互动和合作决策。通过这个仿真环境,我们探讨了AI代理之间资源共享的动态,强调了道德考虑、战略规划和谈判技能的重要性。GovSim具有灵活性,支持任何基于文本的代理,包括LLM代理。利用生成代理框架,我们创建了一个标准代理,促进不同LLM的集成。我们的研究结果表明,在GovSim中,只有两个 out of 15 经测试的LLM成功地实现了可持续的结果,表明模型在管理共享资源方面的能力存在显著的差距。此外,我们发现,通过移除代理与进行沟通的能力,它们超出了共享资源的使用,强调了沟通在合作中的重要性。有趣的是,大多数LLM都缺乏普遍化假设的能力,这表明它们在推理能力方面存在显著的弱点。我们开源了我们所有研究的完整套件,包括仿真环境、代理提示和综合网页界面。
https://arxiv.org/abs/2404.16698
While neural implicit representations have gained popularity in multi-view 3D reconstruction, previous work struggles to yield physically plausible results, thereby limiting their applications in physics-demanding domains like embodied AI and robotics. The lack of plausibility originates from both the absence of physics modeling in the existing pipeline and their inability to recover intricate geometrical structures. In this paper, we introduce PhyRecon, which stands as the first approach to harness both differentiable rendering and differentiable physics simulation to learn implicit surface representations. Our framework proposes a novel differentiable particle-based physical simulator seamlessly integrated with the neural implicit representation. At its core is an efficient transformation between SDF-based implicit representation and explicit surface points by our proposed algorithm, Surface Points Marching Cubes (SP-MC), enabling differentiable learning with both rendering and physical losses. Moreover, we model both rendering and physical uncertainty to identify and compensate for the inconsistent and inaccurate monocular geometric priors. The physical uncertainty additionally enables a physics-guided pixel sampling to enhance the learning of slender structures. By amalgamating these techniques, our model facilitates efficient joint modeling with appearance, geometry, and physics. Extensive experiments demonstrate that PhyRecon significantly outperforms all state-of-the-art methods in terms of reconstruction quality. Our reconstruction results also yield superior physical stability, verified by Isaac Gym, with at least a 40% improvement across all datasets, opening broader avenues for future physics-based applications.
虽然多视角3D重建中神经隐式表示已经获得了越来越多的关注,但之前的 work 很难产生物理上合理的成果,从而限制了它们在需要物理要求的领域(如 embodied AI 和机器人学)的应用。缺乏可信度源于现有流程中缺少物理建模以及它们无法恢复复杂的几何结构。在本文中,我们引入了 PhyRecon,这是第一个利用可导渲染和可导物理仿真来学习隐式表面表示的方法。我们的框架将新颖的可导粒子基于物理仿真与神经隐式表示无缝集成。其核心是基于我们提出的表面点前进立方(SP-MC)算法在 SDF 基于隐式表示和显式表面点之间进行有效的转换,实现基于渲染和物理损失的可导学习。此外,我们还建模了渲染和物理不确定性以识别和弥补不一致和不准确的单目几何先验。物理不确定性还允许我们进行基于物理的像素采样,以增强对细长结构的学习。通过将这些技术相结合,我们的模型实现了与外观、几何和物理的效率共生建模。大量实验证明,PhyRecon 在重建质量方面显著超过了所有现有方法。我们的重建结果还证明了伊萨·格雷戈尔(Isaac Gym)验证的卓越物理稳定性,在所有数据集上实现了至少 40% 的改进,为未来的基于物理的应用于开辟了更广泛的道路。
https://arxiv.org/abs/2404.16666
Developing autonomous agents for mobile devices can significantly enhance user interactions by offering increased efficiency and accessibility. However, despite the growing interest in mobile device control agents, the absence of a commonly adopted benchmark makes it challenging to quantify scientific progress in this area. In this work, we introduce B-MoCA: a novel benchmark designed specifically for evaluating mobile device control agents. To create a realistic benchmark, we develop B-MoCA based on the Android operating system and define 60 common daily tasks. Importantly, we incorporate a randomization feature that changes various aspects of mobile devices, including user interface layouts and language settings, to assess generalization performance. We benchmark diverse agents, including agents employing large language models (LLMs) or multi-modal LLMs as well as agents trained from scratch using human expert demonstrations. While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to enhance their effectiveness. Our source code is publicly available at this https URL.
开发移动设备上的自主代理可以显著增强用户交互,通过提供更高的效率和可访问性。然而,尽管移动设备控制代理越来越受到关注,但缺乏一个普遍适用的基准使得衡量这一领域科学进展具有挑战性。在这项工作中,我们介绍了B-MoCA:一个专门为评估移动设备控制代理而设计的新的基准。为了创建一个真实的基准,我们基于Android操作系统开发B-MoCA,并定义了60个常见的日常任务。重要的是,我们引入了随机化功能,随机改变移动设备的各个方面,包括用户界面布局和语言设置,以评估泛化性能。我们基准了各种代理,包括使用大型语言模型(LLMs)或多模态LLM训练的代理以及使用人类专家演示训练的代理。虽然这些代理在执行简单任务时表现出熟练,但他们在复杂任务上的表现却显露出未来研究可以改进其有效性的巨大潜力。我们的源代码可在此链接公开使用。
https://arxiv.org/abs/2404.16660
The integration of Large Language Models (LLMs) into healthcare promises to transform medical diagnostics, research, and patient care. Yet, the progression of medical LLMs faces obstacles such as complex training requirements, rigorous evaluation demands, and the dominance of proprietary models that restrict academic exploration. Transparent, comprehensive access to LLM resources is essential for advancing the field, fostering reproducibility, and encouraging innovation in healthcare AI. We present Hippocrates, an open-source LLM framework specifically developed for the medical domain. In stark contrast to previous efforts, it offers unrestricted access to its training datasets, codebase, checkpoints, and evaluation protocols. This open approach is designed to stimulate collaborative research, allowing the community to build upon, refine, and rigorously evaluate medical LLMs within a transparent ecosystem. Also, we introduce Hippo, a family of 7B models tailored for the medical domain, fine-tuned from Mistral and LLaMA2 through continual pre-training, instruction tuning, and reinforcement learning from human and AI feedback. Our models outperform existing open medical LLMs models by a large-margin, even surpassing models with 70B parameters. Through Hippocrates, we aspire to unlock the full potential of LLMs not just to advance medical knowledge and patient care but also to democratize the benefits of AI research in healthcare, making them available across the globe.
将大型语言模型(LLMs)融入医疗保健行业有望彻底改变医疗诊断、研究和患者护理。然而,医疗LLMs的发展面临着一些障碍,如复杂的训练要求、严格的评估需求以及 proprietary模型的主导地位,这些模型限制了学术探索。透明、全面的访问LLM资源对于推动该领域的发展、促进可重复性以及鼓励医疗保健领域的人工创新至关重要。我们推出了Hippocrates,一个专为医疗领域而设计的开源LLM框架。与之前的努力相比,它提供了无限制的访问其训练数据集、代码库、检查点以及评估协议。这种开放方法旨在鼓励协同研究,让社区在透明的生态系统中构建、改进和严格评估医疗LLM。此外,我们还介绍了Hippo家族7B模型,这些模型针对医疗领域进行了微调和优化,通过持续的预训练、指令调整和强化学习从人类和AI反馈中进行微调。我们的模型在现有开放医疗LLM模型的性能优势基础上,性能优势巨大,甚至超过了具有70B参数的模型。通过Hippocrates,我们渴望利用LLMs不仅推动医疗知识和患者护理的发展,还将促进医疗保健领域的人工研究民主化,使它们在全球范围内可用。
https://arxiv.org/abs/2404.16621
This paper aims to explore the evolution of image denoising in a pedagological way. We briefly review classical methods such as Fourier analysis and wavelet bases, highlighting the challenges they faced until the emergence of neural networks, notably the U-Net, in the 2010s. The remarkable performance of these networks has been demonstrated in studies such as Kadkhodaie et al. (2024). They exhibit adaptability to various image types, including those with fixed regularity, facial images, and bedroom scenes, achieving optimal results and biased towards geometry-adaptive harmonic basis. The introduction of score diffusion has played a crucial role in image generation. In this context, denoising becomes essential as it facilitates the estimation of probability density scores. We discuss the prerequisites for genuine learning of probability densities, offering insights that extend from mathematical research to the implications of universal structures.
本文旨在以教育性的方式探讨图像去噪的演变。我们简要回顾了经典方法,如傅里叶分析和小波基,并着重指出它们在21世纪之前所面临到的挑战,特别是U-Net。这些网络的非凡性能已在像Kadkhodaie等人(2024)这样的研究中得到证实。它们表现出对各种图像类型的适应性,包括具有固定规范的图像、面部图像和卧室场景,实现最佳结果并倾向于几何自适应小波基。引入分数扩散在图像生成中发挥了关键作用。在这种背景下,去噪变得至关重要,因为它有助于概率密度分数的估计。我们讨论了真正学习概率密度的先决条件,将数学研究的见解扩展到普遍结构的意义上。
https://arxiv.org/abs/2404.16617
In below freezing winter conditions, road surface friction can greatly vary based on the mixture of snow, ice, and water on the road. Friction between the road and vehicle tyres is a critical parameter defining vehicle dynamics, and therefore road surface friction information is essential to acquire for several intelligent transportation applications, such as safe control of automated vehicles or alerting drivers of slippery road conditions. This paper explores computer vision-based evaluation of road surface friction from roadside cameras. Previous studies have extensively investigated the application of convolutional neural networks for the task of evaluating the road surface condition from images. Here, we propose a hybrid deep learning architecture, WCamNet, consisting of a pretrained visual transformer model and convolutional blocks. The motivation of the architecture is to combine general visual features provided by the transformer model, as well as finetuned feature extraction properties of the convolutional blocks. To benchmark the approach, an extensive dataset was gathered from national Finnish road infrastructure network of roadside cameras and optical road surface friction sensors. Acquired results highlight that the proposed WCamNet outperforms previous approaches in the task of predicting the road surface friction from the roadside camera images.
在严寒的冬季条件下,道路表面的摩擦系数会因路面上积雪、冰和水混合物的影响而大大不同。道路与车辆轮胎之间的摩擦是定义车辆动力学的重要参数,因此获取道路表面摩擦信息对于多个智能交通应用至关重要,例如安全控制自动车辆或警示驾驶员道路湿滑情况。本文从路边摄像机对道路表面摩擦进行计算机视觉评估。之前的研究已经广泛探讨了使用卷积神经网络从图像中评估道路表面状况。本文提出了一种混合深度学习架构WCamNet,包括预训练的视觉 transformer模型和卷积模块。架构的动机是结合 transformer 模型提供的通用视觉特征以及卷积模块的微调特征提取特性。为了验证该方法,从国家芬兰道路基础设施网络的路边摄像机和光学道路表面摩擦传感器中收集了广泛的數據。得到的结果表明,与之前的方法相比,所提出的 WCamNet 在预测从路边摄像机图像中预测道路表面摩擦方面表现优异。
https://arxiv.org/abs/2404.16578
In recent years, with the rapid development of computer information technology, the development of artificial intelligence has been accelerating. The traditional geometry recognition technology is relatively backward and the recognition rate is low. In the face of massive information database, the traditional algorithm model inevitably has the problems of low recognition accuracy and poor performance. Deep learning theory has gradually become a very important part of machine learning. The implementation of convolutional neural network (CNN) reduces the difficulty of graphics generation algorithm. In this paper, using the advantages of lenet-5 architecture sharing weights and feature extraction and classification, the proposed geometric pattern recognition algorithm model is faster in the training data set. By constructing the shared feature parameters of the algorithm model, the cross-entropy loss function is used in the recognition process to improve the generalization of the model and improve the average recognition accuracy of the test data set.
近年来,随着计算机信息技术的快速发展,人工智能的发展也加速了。传统的几何识别技术相对较落后,识别率也较低。面对大规模的信息数据库,传统的算法模型无疑存在识别准确度低和性能差的问题。深度学习理论逐渐成为机器学习的重要组成部分。卷积神经网络(CNN)的实现减化了图形生成算法的难度。在本文中,利用lenet-5架构共享权重和特征提取与分类的优势,所提出的几何模式识别算法模型在训练数据集上训练速度更快。通过构建算法模型的共享特征参数,交叉熵损失函数在识别过程中用于提高模型的泛化能力和测试数据集的平均识别准确度。
https://arxiv.org/abs/2404.16561
Generative 3D face models featuring disentangled controlling factors hold immense potential for diverse applications in computer vision and computer graphics. However, previous 3D face modeling methods face a challenge as they demand specific labels to effectively disentangle these factors. This becomes particularly problematic when integrating multiple 3D face datasets to improve the generalization of the model. Addressing this issue, this paper introduces a Weakly-Supervised Disentanglement Framework, denoted as WSDF, to facilitate the training of controllable 3D face models without an overly stringent labeling requirement. Adhering to the paradigm of Variational Autoencoders (VAEs), the proposed model achieves disentanglement of identity and expression controlling factors through a two-branch encoder equipped with dedicated identity-consistency prior. It then faithfully re-entangles these factors via a tensor-based combination mechanism. Notably, the introduction of the Neutral Bank allows precise acquisition of subject-specific information using only identity labels, thereby averting degeneration due to insufficient supervision. Additionally, the framework incorporates a label-free second-order loss function for the expression factor to regulate deformation space and eliminate extraneous information, resulting in enhanced disentanglement. Extensive experiments have been conducted to substantiate the superior performance of WSDF. Our code is available at this https URL.
生成式3D面部模型具有解耦的控制因素,在计算机视觉和计算机图形学中具有巨大的应用潜力。然而,之前的3D面部建模方法遇到了一个挑战,因为它们需要特定的标签来有效地解耦这些因素。当整合多个3D面部数据集来提高模型的泛化能力时,这个问题变得尤为严重。为了解决这个问题,本文引入了一个弱监督解耦框架(WSDF),以促进无需过于严格标签要求来训练可控制3D面部模型的训练。遵循变分自编码器(VAE)的范例,所提出的模型通过配备专用身份一致性先验的两个分支编码器实现对身份和表达控制因素的解耦。然后,它通过张量组合机制忠实地重新解耦这些因素。值得注意的是,引入中值银行允许仅使用身份标签来精确获取主题特定信息,从而避免了由于监督不足而导致的退化。此外,该框架还包括一个无标签的二阶损失函数来调节变形空间,消除多余信息,从而增强解耦。已经进行了大量实验来证明WSDF的优越性能。我们的代码可在此处访问:https://url.cn/xyz4444
https://arxiv.org/abs/2404.16536
3D object generation has undergone significant advancements, yielding high-quality results. However, fall short of achieving precise user control, often yielding results that do not align with user expectations, thus limiting their applicability. User-envisioning 3D object generation faces significant challenges in realizing its concepts using current generative models due to limited interaction capabilities. Existing methods mainly offer two approaches: (i) interpreting textual instructions with constrained controllability, or (ii) reconstructing 3D objects from 2D images. Both of them limit customization to the confines of the 2D reference and potentially introduce undesirable artifacts during the 3D lifting process, restricting the scope for direct and versatile 3D modifications. In this work, we introduce Interactive3D, an innovative framework for interactive 3D generation that grants users precise control over the generative process through extensive 3D interaction capabilities. Interactive3D is constructed in two cascading stages, utilizing distinct 3D representations. The first stage employs Gaussian Splatting for direct user interaction, allowing modifications and guidance of the generative direction at any intermediate step through (i) Adding and Removing components, (ii) Deformable and Rigid Dragging, (iii) Geometric Transformations, and (iv) Semantic Editing. Subsequently, the Gaussian splats are transformed into InstantNGP. We introduce a novel (v) Interactive Hash Refinement module to further add details and extract the geometry in the second stage. Our experiments demonstrate that Interactive3D markedly improves the controllability and quality of 3D generation. Our project webpage is available at \url{this https URL}.
3D对象生成已经取得了显著的进步,产生了高质量的结果。然而,由于缺乏用户控制,通常无法实现精确的用户期望,从而限制了其应用范围。用户可视化3D对象生成面临很大的挑战,因为在目前的生成模型中具有有限的交互能力。现有的方法主要提出了两种方法:(i)通过约束可控制性的文本指令进行解释,或者(ii)从2D图像中重构3D对象。两种方法都限制了对2D参考范围内的定制,并且在3D提升过程中可能引入不良伪影,从而限制了直接和多功能的3D修改范围。在这项工作中,我们引入了Interactive3D,一种创新的交互式3D生成框架,通过广泛的3D交互功能赋予用户对生成过程的精确控制。Interactive3D分为两个级联阶段构建,利用不同的3D表示方法。第一个阶段采用高斯平铺进行直接用户交互,通过(i)添加和移除组件, (ii)可形变和刚体拖拽, (iii)几何变换和(iv)语义编辑来修改和指导生成方向。然后,高斯平铺被转换为InstantNGP。我们引入了一种新颖的(v)交互式哈希平滑模块,以进一步增加细节并提取第二阶段的几何。我们的实验证明,Interactive3D显著提高了3D生成的可控性和质量。我们的项目网页可以通过 \url{这个链接}访问。
https://arxiv.org/abs/2404.16510
Document-level Relation Extraction (DocRE) is the task of extracting all semantic relationships from a document. While studies have been conducted on English DocRE, limited attention has been given to DocRE in non-English languages. This work delves into effectively utilizing existing English resources to promote DocRE studies in non-English languages, with Japanese as the representative case. As an initial attempt, we construct a dataset by transferring an English dataset to Japanese. However, models trained on such a dataset suffer from low recalls. We investigate the error cases and attribute the failure to different surface structures and semantics of documents translated from English and those written by native speakers. We thus switch to explore if the transferred dataset can assist human annotation on Japanese documents. In our proposal, annotators edit relation predictions from a model trained on the transferred dataset. Quantitative analysis shows that relation recommendations suggested by the model help reduce approximately 50% of the human edit steps compared with the previous approach. Experiments quantify the performance of existing DocRE models on our collected dataset, portraying the challenges of Japanese and cross-lingual DocRE.
文档级别关系提取(DocRE)是从文档中提取所有语义关系的过程。尽管已经进行了关于英语DocRE的研究,但在非英语语言中,对DocRE的研究却鲜有关注。本文深入研究如何有效地利用现有英语资源来促进非英语语言中的DocRE研究,以日本为例作为代表。作为初始尝试,我们将英语数据集迁移到日本并构建了一个数据集。然而,训练在这样的数据集上的模型,模型的召回率很低。我们研究了错误案例,并将失败归因于从英语到非英语翻译的文档的不同表面结构和语义。因此,我们转向研究是否转移的数据集可以帮助人类对日语文档进行标注。在我们的建议中,注释者编辑从转移数据集中得出的关系预测。定量分析显示,与以前的方法相比,模型建议的关系减少约50%的人为编辑步骤。实验验证了现有DocRE模型的在我们收集的数据集上的性能,揭示了日语和跨语言DocRE的挑战。
https://arxiv.org/abs/2404.16506
We present a generic algorithm for scoring pose estimation methods that rely on single image semantic analysis. The algorithm employs a lightweight putative shape representation using a combination of multiple Gaussian Processes. Each Gaussian Process (GP) yields distance normal distributions from multiple reference points in the object's coordinate system to its surface, thus providing a geometric evaluation framework for scoring predicted poses. Our confidence measure comprises the average mixture probability of pixel back-projections onto the shape template. In the reported experiments, we compare the accuracy of our GP based representation of objects versus the actual geometric models and demonstrate the ability of our method to capture the influence of outliers as opposed to the corresponding intrinsic measures that ship with the segmentation and pose estimation methods.
我们提出了一种通用的算法,用于评估基于单张图像语义分析的姿势估计方法。该算法采用多个高斯过程来表示轻量级假设形状。每个高斯过程(GP)从对象坐标系中多个参考点的距离正常分布到其表面,从而为评分预测姿势提供几何评估框架。我们的置信度度量包括像素反投影平均混合概率到形状模板。在报道的实验中,我们比较了基于GP表示的对象与实际几何模型的准确性,并证明了我们的方法能够捕捉到异质性影响,而不是与分割和姿势估计方法附带的相关内在度量。
https://arxiv.org/abs/2404.16471
While originally developed for novel view synthesis, Neural Radiance Fields (NeRFs) have recently emerged as an alternative to multi-view stereo (MVS). Triggered by a manifold of research activities, promising results have been gained especially for texture-less, transparent, and reflecting surfaces, while such scenarios remain challenging for traditional MVS-based approaches. However, most of these investigations focus on close-range scenarios, with studies for airborne scenarios still missing. For this task, NeRFs face potential difficulties at areas of low image redundancy and weak data evidence, as often found in street canyons, facades or building shadows. Furthermore, training such networks is computationally expensive. Thus, the aim of our work is twofold: First, we investigate the applicability of NeRFs for aerial image blocks representing different characteristics like nadir-only, oblique and high-resolution imagery. Second, during these investigations we demonstrate the benefit of integrating depth priors from tie-point measures, which are provided during presupposed Bundle Block Adjustment. Our work is based on the state-of-the-art framework VolSDF, which models 3D scenes by signed distance functions (SDFs), since this is more applicable for surface reconstruction compared to the standard volumetric representation in vanilla NeRFs. For evaluation, the NeRF-based reconstructions are compared to results of a publicly available benchmark dataset for airborne images.
虽然最初是为 novel view synthesis 设计的,但近年来 Neural Radiance Fields (NeRFs) 已经作为一种多视图立体 (MVS) 的替代方案得到了广泛应用。受到多种研究活动的触发,尤其是在缺乏纹理、透明和反射表面的情况下,NeRFs 的表现尤为出色,而传统 MVS 方法在这些问题上仍然具有挑战性。然而,这些研究主要集中在近景场景,尽管已经对空气场景进行了研究,但仍有缺失。对于这项任务,NeRFs 在低图像冗余和弱数据证据的领域可能会面临潜在的困难,正如在街巷、建筑立面或建筑物阴影中常见的情况。此外,训练这类网络在计算上较为昂贵。因此,我们工作的目标是双重的:首先,我们研究 NeRFs 在代表不同特性的航空图像块上的适用性;其次,在這些調查期間,我們將展示將來自點測量學的深度 prior 整合到預假 Bundle Block Adjustment 中的好處。我们的工作基於最先进的框架 VolSDF,它通過點距函數 (SDF) 建模 3D 场景,因為這比標準的 NeRFs 的表面重建更適合作用。對於評估,我們將 NeRF 基於的重建與空氣中可獲得的公开數據集的結果進行比較。
https://arxiv.org/abs/2404.16429
The interactions between tumor cells and the tumor microenvironment (TME) dictate therapeutic efficacy of radiation and many systemic therapies in breast cancer. However, to date, there is not a widely available method to reproducibly measure tumor and immune phenotypes for each patient's tumor. Given this unmet clinical need, we applied multiple instance learning (MIL) algorithms to assess activity of ten biologically relevant pathways from the hematoxylin and eosin (H&E) slide of primary breast tumors. We employed different feature extraction approaches and state-of-the-art model architectures. Using binary classification, our models attained area under the receiver operating characteristic (AUROC) scores above 0.70 for nearly all gene expression pathways and on some cases, exceeded 0.80. Attention maps suggest that our trained models recognize biologically relevant spatial patterns of cell sub-populations from H&E. These efforts represent a first step towards developing computational H&E biomarkers that reflect facets of the TME and hold promise for augmenting precision oncology.
肿瘤细胞与肿瘤微环境(TME)之间的相互作用决定了放射治疗和许多系统治疗在乳腺癌中的治疗效果。然而,目前还没有一种可重复测量每个患者肿瘤的肿瘤和免疫表型的广泛可用方法。鉴于这一未满足的临床需求,我们将多实例学习(MIL)算法应用于从原始乳腺癌的哈希和电子显微镜(H&E)切片评估十种生物相关的通路的活动。我们采用了不同的特征提取方法和最先进的模型架构。使用二分类,我们的模型在几乎所有基因表达通路上的接收者操作特征(AUROC)分数都超过了0.70,在某些情况下甚至超过了0.80。注意力图表明,经过训练的模型能够识别H&E中的细胞亚群的空间模式。这些努力代表了解决计算H&E生物标志物的第一步,这些生物标志物可以反映TME的方面,并具有提高精准癌症治疗的精度的潜力。
https://arxiv.org/abs/2404.16397
This work introduces SwarmRL, a Python package designed to study intelligent active particles. SwarmRL provides an easy-to-use interface for developing models to control microscopic colloids using classical control and deep reinforcement learning approaches. These models may be deployed in simulations or real-world environments under a common framework. We explain the structure of the software and its key features and demonstrate how it can be used to accelerate research. With SwarmRL, we aim to streamline research into micro-robotic control while bridging the gap between experimental and simulation-driven sciences. SwarmRL is available open-source on GitHub at this https URL.
本工作介绍了SwarmRL,一个旨在研究智能主动粒子的Python软件包。SwarmRL为使用经典控制和深度强化学习方法控制微观颗粒的模型提供了易于使用的界面。这些模型可以在模拟或现实环境中共存于一个共同框架中。我们解释了软件的架构及其关键特征,并展示了如何使用它来加速研究。通过SwarmRL,我们旨在简化微型机器人控制的科学研究,并缩小实验和仿真驱动的科学之间的差距。SwarmRL可以在GitHub上的这个链接开源使用。
https://arxiv.org/abs/2404.16388
Despite their remarkable successes, state-of-the-art language models face challenges in grasping certain important semantic details. This paper introduces the VISLA (Variance and Invariance to Semantic and Lexical Alterations) benchmark, designed to evaluate the semantic and lexical understanding of language models. VISLA presents a 3-way semantic (in)equivalence task with a triplet of sentences associated with an image, to evaluate both vision-language models (VLMs) and unimodal language models (ULMs). An evaluation involving 34 VLMs and 20 ULMs reveals surprising difficulties in distinguishing between lexical and semantic variations. Spatial semantics encoded by language models also appear to be highly sensitive to lexical information. Notably, text encoders of VLMs demonstrate greater sensitivity to semantic and lexical variations than unimodal text encoders. Our contributions include the unification of image-to-text and text-to-text retrieval tasks, an off-the-shelf evaluation without fine-tuning, and assessing LMs' semantic (in)variance in the presence of lexical alterations. The results highlight strengths and weaknesses across diverse vision and unimodal language models, contributing to a deeper understanding of their capabilities. % VISLA enables a rigorous evaluation, shedding light on language models' capabilities in handling semantic and lexical nuances. Data and code will be made available at this https URL.
尽管它们取得了惊人的成功,最先进的语言模型在理解某些重要的语义细节方面仍然面临着挑战。本文介绍了一个名为 VISLA(语义和词义变化对齐)的基准,旨在评估语言模型的语义和词义理解能力。VISLA 提出了一种三元语义(形式)等价任务,三元组句子与图像相关联,以评估视觉语言模型(VLMs)和单语语言模型(ULMs)的语义和词义理解能力。在评估 34 个 VLMs 和 20 个 ULMs 的过程中,揭示了在区分词汇和语义变化方面令人惊讶的困难。语言模型编码的空间语义也似乎对词汇信息非常敏感。值得注意的是,VLMs 的文本编码器在语义和词义变化方面比单语文本编码器更加敏感。我们的贡献包括统一图像到文本和文本到文本检索任务,无需微调即可进行一般评估,以及评估 LMs 在词汇变化影响下的语义(形式)变化。结果表明,各种视觉和单语语言模型的优势和劣势得到了突出,有助于更深入地了解它们的能力。VISLA 为严格的评估提供了一个框架,揭示了语言模型在处理语义和词义细微差别方面的能力。数据和代码将在此链接的 URL 上提供。
https://arxiv.org/abs/2404.16365
Existing works have made strides in video generation, but the lack of sound effects (SFX) and background music (BGM) hinders a complete and immersive viewer experience. We introduce a novel semantically consistent v ideo-to-audio generation framework, namely SVA, which automatically generates audio semantically consistent with the given video content. The framework harnesses the power of multimodal large language model (MLLM) to understand video semantics from a key frame and generate creative audio schemes, which are then utilized as prompts for text-to-audio models, resulting in video-to-audio generation with natural language as an interface. We show the satisfactory performance of SVA through case study and discuss the limitations along with the future research direction. The project page is available at this https URL.
现有的视频生成工作已经取得了一定的进展,但缺乏音效(SFX)和背景音乐(BGM)会阻碍完全沉浸的观众体验。我们介绍了一个新颖的语义一致的视频到音频生成框架,即SVA,它能够自动生成与给定视频内容语义一致的音频。该框架利用多模态大型语言模型(MLLM)的力量,从关键帧理解视频语义,并生成创意音频方案,这些方案作为文本到音频模型的提示,实现了自然语言作为界面的视频到音频生成。我们通过案例研究展示了SVA的满意性能,并讨论了与未来研究方向相关的局限性。项目页面可以通过这个链接访问:https://www.aclweb.org/anthology/N18-1196
https://arxiv.org/abs/2404.16305
Fuzzing, a widely-used technique for bug detection, has seen advancements through Large Language Models (LLMs). Despite their potential, LLMs face specific challenges in fuzzing. In this paper, we identified five major challenges of LLM-assisted fuzzing. To support our findings, we revisited the most recent papers from top-tier conferences, confirming that these challenges are widespread. As a remedy, we propose some actionable recommendations to help improve applying LLM in Fuzzing and conduct preliminary evaluations on DBMS fuzzing. The results demonstrate that our recommendations effectively address the identified challenges.
模糊测试(Fuzzing)是一种广泛使用的代码审计技术,它通过大型语言模型(LLMs)取得了进展。尽管LLMs具有巨大的潜力,但它们在模糊测试方面面临一些特定的挑战。在本文中,我们确定了LLM辅助模糊测试的五个主要挑战。为了支持我们的发现,我们回顾了顶级会议中最新的论文,证实了这些挑战是普遍存在的。为了改善在模糊测试中应用LLM,我们提出了一些可行的建议,并对DBMS模糊测试进行了初步评估。结果显示,我们的建议有效地解决了识别出的挑战。
https://arxiv.org/abs/2404.16297
With the development and widespread application of digital image processing technology, image splicing has become a common method of image manipulation, raising numerous security and legal issues. This paper introduces a new splicing image detection algorithm based on the statistical characteristics of natural images, aimed at improving the accuracy and efficiency of splicing image detection. By analyzing the limitations of traditional methods, we have developed a detection framework that integrates advanced statistical analysis techniques and machine learning methods. The algorithm has been validated using multiple public datasets, showing high accuracy in detecting spliced edges and locating tampered areas, as well as good robustness. Additionally, we explore the potential applications and challenges faced by the algorithm in real-world scenarios. This research not only provides an effective technological means for the field of image tampering detection but also offers new ideas and methods for future related research.
随着数字图像处理技术的发展和广泛应用,图像拼接已成为图像编辑的常见方法,同时也带来了许多安全和法律问题。本文介绍了一种基于自然图像统计特征的新型拼接图像检测算法,旨在提高拼接图像检测的准确性和效率。通过分析传统方法的局限性,我们开发了一个集成了高级统计分析技术和机器学习方法的检测框架。该算法已通过多个公共数据集的验证,显示出在检测拼接边缘和查找被篡改区域方面的较高准确性和鲁棒性。此外,我们还探讨了该算法在现实场景中的潜在应用和所面临的问题。这项研究不仅为图像编辑检测领域提供了一种有效的技术手段,而且为未来相关研究提供了新的思路和方法。
https://arxiv.org/abs/2404.16296