Despite the strong research interest in document-level Machine Translation (MT), the test sets dedicated to this task are still scarce. The existing test sets mainly cover topics from the general domain and fall short on specialised domains, such as legal and financial. Also, in spite of their document-level aspect, they still follow a sentence-level logic that does not allow for including certain linguistic phenomena such as information reorganisation. In this work, we aim to fill this gap by proposing a novel test set: DOLFIN. The dataset is built from specialised financial documents, and it makes a step towards true document-level MT by abandoning the paradigm of perfectly aligned sentences, presenting data in units of sections rather than sentences. The test set consists of an average of 1950 aligned sections for five language pairs. We present a detailed data collection pipeline that can serve as inspiration for aligning new document-level datasets. We demonstrate the usefulness and quality of this test set by evaluating a number of models. Our results show that the test set is able to discriminate between context-sensitive and context-agnostic models and shows the weaknesses when models fail to accurately translate financial texts. The test set is made public for the community.
尽管在文档级机器翻译(MT)领域存在强烈的研究兴趣,但专门为此任务设计的测试集仍然稀缺。现有的测试集主要涵盖通用领域的主题,在法律和金融等专业领域则显得不足。此外,虽然这些测试集具有文档级别的特点,但仍遵循句子级别的逻辑,这限制了包括信息重组等语言现象在内的处理能力。 在这项工作中,我们旨在填补这一空白,提出了一个新的测试集:DOLFIN。该数据集基于专业的金融文档构建,并通过放弃完全对齐的句子模式,迈向真正的文档级机器翻译,以部分(section)而非句子为单位呈现数据。测试集中包含五种语言配对的平均1950个对齐的部分。 我们提供了一个详细的数据收集流程,可作为其他文档级别数据集对齐工作的灵感来源。通过评估一系列模型,我们展示了此测试集的有效性和质量。我们的结果表明,该测试集能够区分上下文敏感型和上下文无关型模型,并揭示了当模型未能准确翻译金融文本时的弱点。 本测试集已公开提供给社区使用。
https://arxiv.org/abs/2502.03053
Text-embedding models often exhibit biases arising from the data on which they are trained. In this paper, we examine a hitherto unexplored bias in text-embeddings: bias arising from the presence of $\textit{names}$ such as persons, locations, organizations etc. in the text. Our study shows how the presence of $\textit{name-bias}$ in text-embedding models can potentially lead to erroneous conclusions in assessment of thematic this http URL-embeddings can mistakenly indicate similarity between texts based on names in the text, even when their actual semantic content has no similarity or indicate dissimilarity simply because of the names in the text even when the texts match semantically. We first demonstrate the presence of name bias in different text-embedding models and then propose $\textit{text-anonymization}$ during inference which involves removing references to names, while preserving the core theme of the text. The efficacy of the anonymization approach is demonstrated on two downstream NLP tasks, achieving significant performance gains. Our simple and training-optimization-free approach offers a practical and easily implementable solution to mitigate name bias.
文本嵌入模型经常表现出源于训练数据的偏见。本文探讨了一种此前未被研究过的文本嵌入中的偏差:即**名称(如人名、地名、组织名等)的存在所引起的偏差**。我们的研究表明,文本中存在**名称偏差**可能使基于语义内容评估主题相似性时得出错误结论。具体来说,当两个文档仅因含有相同或类似的名称而被误认为是相似的,即使它们在实际意义上没有任何关联;或者相反地,由于名称的不同而导致文本被认为不相关,尽管其含义完全一致。 我们首先展示了不同文本嵌入模型中存在的名称偏差,并提出了一种称为**文本匿名化**的方法,在推理过程中移除文本中的名称参考,同时保留文本的核心主题。我们在两个下游自然语言处理任务上验证了该匿名化方法的有效性,并取得了显著的性能提升。 我们的这种方法简单且无需额外训练优化,提供了一个实用而易于实施的方式来减轻名称偏差问题。
https://arxiv.org/abs/2502.02903
Flaky tests exhibit non-deterministic behavior during execution and they may pass or fail without any changes to the program under test. Detecting and classifying these flaky tests is crucial for maintaining the robustness of automated test suites and ensuring the overall reliability and confidence in the testing. However, flaky test detection and classification is challenging due to the variability in test behavior, which can depend on environmental conditions and subtle code interactions. Large Language Models (LLMs) offer promising approaches to address this challenge, with fine-tuning and few-shot learning (FSL) emerging as viable techniques. With enough data fine-tuning a pre-trained LLM can achieve high accuracy, making it suitable for organizations with more resources. Alternatively, we introduce FlakyXbert, an FSL approach that employs a Siamese network architecture to train efficiently with limited data. To understand the performance and cost differences between these two methods, we compare fine-tuning on larger datasets with FSL in scenarios restricted by smaller datasets. Our evaluation involves two existing flaky test datasets, FlakyCat and IDoFT. Our results suggest that while fine-tuning can achieve high accuracy, FSL provides a cost-effective approach with competitive accuracy, which is especially beneficial for organizations or projects with limited historical data available for training. These findings underscore the viability of both fine-tuning and FSL in flaky test detection and classification with each suited to different organizational needs and resource availability.
在执行过程中,不稳定测试(flaky tests)表现出非确定性的行为,并且即使不更改被测程序,它们也可能通过或失败。检测和分类这些不稳定测试对于维护自动化测试套件的健壮性以及确保整体可靠性和对测试的信心至关重要。然而,由于测试行为的变化取决于环境条件和代码交互的细微差别,因此发现和分类这种不稳定测试具有挑战性。 大型语言模型(LLMs)为解决这一难题提供了有前景的方法。其中,微调(fine-tuning)和少量样本学习(Few-Shot Learning, FSL)已作为可行技术出现。通过在足够大的数据集上进行微调,预训练的LLM可以实现高精度,这使得它适合资源更丰富的组织使用。然而,我们还引入了一种名为FlakyXbert的方法,该方法采用Siamese网络架构,在受限的小型数据集中也能高效地进行少量样本学习。 为了理解这两种方法在性能和成本上的差异,我们在两个现有的不稳定测试数据集(FlakyCat 和 IDoFT)上进行了比较。我们的实验结果表明,虽然微调可以实现高精度,但FSL提供了一种低成本的方法,并且具有竞争力的准确性,这对可用训练历史数据有限的组织或项目特别有利。 这些发现强调了微调和FSL在不稳定测试检测与分类中的可行性,各自适合不同类型的组织需求及资源可得性。
https://arxiv.org/abs/2502.02715
Controllable video generation remains a significant challenge, despite recent advances in generating high-quality and consistent videos. Most existing methods for controlling video generation treat the video as a whole, neglecting intricate fine-grained spatiotemporal relationships, which limits both control precision and efficiency. In this paper, we propose Controllable Video Generative Adversarial Networks (CoVoGAN) to disentangle the video concepts, thus facilitating efficient and independent control over individual concepts. Specifically, following the minimal change principle, we first disentangle static and dynamic latent variables. We then leverage the sufficient change property to achieve component-wise identifiability of dynamic latent variables, enabling independent control over motion and identity. To establish the theoretical foundation, we provide a rigorous analysis demonstrating the identifiability of our approach. Building on these theoretical insights, we design a Temporal Transition Module to disentangle latent dynamics. To enforce the minimal change principle and sufficient change property, we minimize the dimensionality of latent dynamic variables and impose temporal conditional independence. To validate our approach, we integrate this module as a plug-in for GANs. Extensive qualitative and quantitative experiments on various video generation benchmarks demonstrate that our method significantly improves generation quality and controllability across diverse real-world scenarios.
尽管在生成高质量和一致的视频方面取得了近期进展,但可控视频生成仍然是一项重大挑战。大多数现有的控制视频生成的方法将整个视频视为整体,忽略了复杂的细粒度时空关系,从而限制了控制精度和效率。为此,本文提出了可控制视频生成对抗网络(CoVoGAN),以分离视频概念,从而使对单个概念的独立高效控制成为可能。 具体来说,在遵循最小变化原则的情况下,我们首先将静态和动态潜在变量解耦。然后利用充分变化属性来实现动态潜在变量在组件级别的识别性,从而能够单独控制运动和身份。为了建立理论基础,我们提供了严格的分析,证明了我们的方法的可识别性。在此基础上,我们设计了一个时间过渡模块以分离潜在动力学。 为确保最小变化原则和充分变化属性,我们减少了潜在动态变量的维度,并施加了条件时间独立性。为了验证该方法的有效性,我们将此模块作为GAN插件集成起来进行实验。在各种视频生成基准上的大量定性和定量实验证明,我们的方法显著提升了不同现实场景下的生成质量和可控性。
https://arxiv.org/abs/2502.02690
Visible watermarks pose significant challenges for image restoration techniques, especially when the target background is unknown. Toward this end, we present MorphoMod, a novel method for automated visible watermark removal that operates in a blind setting -- without requiring target images. Unlike existing methods, MorphoMod effectively removes opaque and transparent watermarks while preserving semantic content, making it well-suited for real-world applications. Evaluations on benchmark datasets, including the Colored Large-scale Watermark Dataset (CLWD), LOGO-series, and the newly introduced Alpha1 datasets, demonstrate that MorphoMod achieves up to a 50.8% improvement in watermark removal effectiveness compared to state-of-the-art methods. Ablation studies highlight the impact of prompts used for inpainting, pre-removal filling strategies, and inpainting model performance on watermark removal. Additionally, a case study on steganographic disorientation reveals broader applications for watermark removal in disrupting high-level hidden messages. MorphoMod offers a robust, adaptable solution for watermark removal and opens avenues for further advancements in image restoration and adversarial manipulation.
可见水印对图像恢复技术构成了重大挑战,尤其是在目标背景未知的情况下。为此,我们提出了MorphoMod,这是一种新颖的自动化去除可见水印的方法,在盲处理环境下工作——无需提供目标图片。与现有方法不同的是,MorphoMod能够有效移除不透明和半透明的水印,并且在保留语义内容的同时进行操作,使其非常适合现实世界的应用。 在包括Colored Large-scale Watermark Dataset (CLWD),LOGO系列以及新引入的Alpha1数据集在内的基准数据集上进行评估显示,MorphoMod相比最先进的方法,在水印去除效果方面提高了高达50.8%。消融研究表明了用于修复过程中的提示、预移除填充策略和修复模型性能对水印去除的影响。此外,一项关于隐写术定向的研究案例揭示了水印去除在干扰高级隐藏信息方面的更广泛应用。 MorphoMod为水印去除提供了一种稳健且适应性强的解决方案,并开启了图像恢复及对抗性操作领域进一步发展的新途径。
https://arxiv.org/abs/2502.02676
The exploration of cellular heterogeneity within the tumor microenvironment (TME) via single-cell RNA sequencing (scRNA-seq) is essential for understanding cancer progression and response to therapy. Current scRNA-seq approaches, however, lack spatial context and rely on incomplete datasets of ligand-receptor interactions (LRIs), limiting accurate cell type annotation and cell-cell communication (CCC) inference. This study addresses these challenges using a novel graph neural network (GNN) model that enhances cell type prediction and cell interaction analysis. Our study utilized a dataset consisting of 49,020 cells from 19 patients across three cancer types: Leukemia, Breast Invasive Carcinoma, and Colorectal Cancer. The proposed scGSL model demonstrated robust performance, achieving an average accuracy of 84.83%, precision of 86.23%, recall of 81.51%, and an F1 score of 80.92% across all datasets. These metrics represent a significant enhancement over existing methods, which typically exhibit lower performance metrics. Additionally, by reviewing existing literature on gene interactions within the TME, the scGSL model proves to robustly identify biologically meaningful gene interactions in an unsupervised manner, validated by significant expression differences in key gene pairs across various cancers. The source code and data used in this paper can be found in this https URL.
通过单细胞RNA测序(scRNA-seq)探索肿瘤微环境(TME)内的细胞异质性对于理解癌症进展和治疗反应至关重要。然而,现有的scRNA-seq方法缺乏空间背景信息,并依赖于不完整的配体-受体相互作用(LRIs)数据集,这限制了准确的细胞类型注释和细胞间通信(CCC)推断。本研究通过一种新型图神经网络(GNN)模型来解决这些挑战,该模型增强了细胞类型的预测能力和细胞间交互分析能力。 我们的研究使用了一个包含49,020个细胞的数据集,来自19名患者的三种癌症类型:白血病、乳腺浸润性癌和结直肠癌。所提出的scGSL模型表现出强大的性能,在所有数据集中平均准确率为84.83%,精确度为86.23%,召回率(灵敏度)为81.51%,F1得分为80.92%。这些指标相较于现有方法有着显著的改进,后者通常表现较弱。 此外,通过回顾TME内基因相互作用的相关文献,scGSL模型能够以无监督的方式稳健地识别具有生物学意义的基因交互,并且通过对不同癌症中关键基因对表达差异的验证来证明其有效性。本研究使用的源代码和数据可在以下链接获取:[此处插入URL]。
https://arxiv.org/abs/2502.02629
Dynamic Contrast-Enhanced Magnetic Resonance Imaging (DCE-MRI) is a medical imaging technique that plays a crucial role in the detailed visualization and identification of tissue perfusion in abnormal lesions and radiological suggestions for biopsy. However, DCE-MRI involves the administration of a Gadolinium based (Gad) contrast agent, which is associated with a risk of toxicity in the body. Previous deep learning approaches that synthesize DCE-MR images employ unimodal non-contrast or low-dose contrast MRI images lacking focus on the local perfusion information within the anatomy of interest. We propose AAD-DCE, a generative adversarial network (GAN) with an aggregated attention discriminator module consisting of global and local discriminators. The discriminators provide a spatial embedded attention map to drive the generator to synthesize early and late response DCE-MRI images. Our method employs multimodal inputs - T2 weighted (T2W), Apparent Diffusion Coefficient (ADC), and T1 pre-contrast for image synthesis. Extensive comparative and ablation studies on the ProstateX dataset show that our model (i) is agnostic to various generator benchmarks and (ii) outperforms other DCE-MRI synthesis approaches with improvement margins of +0.64 dB PSNR, +0.0518 SSIM, -0.015 MAE for early response and +0.1 dB PSNR, +0.0424 SSIM, -0.021 MAE for late response, and (ii) emphasize the importance of attention ensembling. Our code is available at this https URL.
动态对比增强磁共振成像(DCE-MRI)是一种医学影像技术,在异常病灶和放射学活检建议中的组织灌注的详细可视化与识别中起着关键作用。然而,进行DCE-MRI需要使用基于钆(Gadolinium, Gad)的对比剂,这种物质在体内存在毒性风险。以往采用深度学习方法合成DCE-MR图像时,仅依赖单一模态的非增强或低剂量增强MRI图像,并且未充分关注感兴趣解剖结构内的局部灌注信息。 我们提出了一种名为AAD-DCE的方法,这是一种生成对抗网络(GAN),其包含一个集成了全局和局部判别器的聚合注意力判别模块。该判别器提供了一个空间嵌入注意图,以驱动生成器合成早期响应和晚期响应DCE-MRI图像。我们的方法采用多模态输入——T2加权(T2W)、表观扩散系数(ADC)及T1平扫前的影像进行图像合成。 在ProstateX数据集上的广泛比较与消融研究显示,我们的模型: (i) 对各种生成器基准是无差别的; (ii) 在早期响应和晚期响应分别提高了+0.64 dB PSNR、+0.0518 SSIM和-0.015 MAE以及+0.1 dB PSNR、+0.0424 SSIM和-0.021 MAE,优于其他DCE-MRI合成方法; (ii) 强调了注意力集成的重要性。 我们的代码可在提供的链接中获取。
https://arxiv.org/abs/2502.02555
Several studies indicate that deep learning models can learn to detect breast cancer from mammograms (X-ray images of the breasts). However, challenges with overfitting and poor generalisability prevent their routine use in the clinic. Models trained on data from one patient population may not perform well on another due to differences in their data domains, emerging due to variations in scanning technology or patient characteristics. Data augmentation techniques can be used to improve generalisability by expanding the diversity of feature representations in the training data by altering existing examples. Image-to-image translation models are one approach capable of imposing the characteristic feature representations (i.e. style) of images from one dataset onto another. However, evaluating model performance is non-trivial, particularly in the absence of ground truths (a common reality in medical imaging). Here, we describe some key aspects that should be considered when evaluating style transfer algorithms, highlighting the advantages and disadvantages of popular metrics, and important factors to be mindful of when implementing them in practice. We consider two types of generative models: a cycle-consistent generative adversarial network (CycleGAN) and a diffusion-based SynDiff model. We learn unpaired image-to-image translation across three mammography datasets. We highlight that undesirable aspects of model performance may determine the suitability of some metrics, and also provide some analysis indicating the extent to which various metrics assess unique aspects of model performance. We emphasise the need to use several metrics for a comprehensive assessment of model performance.
几项研究表明,深度学习模型可以从乳房X光片(乳腺的X射线图像)中学会检测乳腺癌。然而,由于过拟合和较差的泛化能力问题,这些模型还不能在临床上常规使用。在一个患者群体的数据上训练出的模型可能无法很好地应用于其他群体,这是因为不同数据集之间的扫描技术和患者特征差异导致的数据域不一致。通过数据增强技术可以改进模型的泛化能力,例如通过对现有样本进行修改来扩展训练数据中的特征表示多样性。图像到图像转换模型是一种能够将一个数据集中图像的独特特征表现(即风格)施加于另一个数据集的方法。然而,在缺乏地面真实值的情况下评估模型性能颇具挑战性,这是医学影像中常见的现实问题。在此,我们描述了在评估样式迁移算法时应考虑的一些关键方面,并强调了一些流行指标的优点和缺点以及在实践中实施这些指标的重要因素。我们将探讨两种生成模型:循环一致对抗网络(CycleGAN)和基于扩散的SynDiff模型,在三个乳腺X射线数据集中学习未配对图像到图像的转换。我们指出,模型性能中的某些不良方面可能会影响一些度量标准的适用性,并提供了一些分析以说明各种指标在评估模型独特性能方面的程度。我们强调需要使用多个指标来全面评估模型性能。
https://arxiv.org/abs/2502.02475
Applications of generating photo-realistic human avatars are many, however, high-fidelity avatar generation traditionally required expensive professional camera rigs and artistic labor, but recent research has enabled constructing them automatically from smartphones with RGB and IR sensors. However, these new methods still rely on the presence of high-resolution cameras on modern smartphones and often require offloading the processing to powerful servers with GPUs. Modern applications such as video conferencing call for the ability to generate these avatars from consumer-grade laptop webcams using limited compute available on-device. In this work, we develop a novel method based on 3D morphable models, landmark detection, photo-realistic texture GANs, and differentiable rendering to tackle the problem of low webcam image quality and edge computation. We build an automatic system to generate high-fidelity animatable avatars under these limitations, leveraging the neural compute capabilities of mobile chips.
生成逼真的虚拟人物形象的应用广泛,然而,高质量的虚拟形象制作传统上需要昂贵的专业相机设备和大量的艺术创作劳动。不过,最近的研究已经实现了利用带有RGB和红外传感器的智能手机自动构建这些形象的方法。尽管如此,这些新方法仍然依赖于现代智能手机配备高分辨率摄像头,并且通常需要将处理工作转移到具有强大GPU的服务器上进行。 然而,在如今的应用场景中,如视频会议,人们希望能够仅使用普通笔记本电脑上的网络摄像头来生成这样的虚拟人物形象,并利用设备本身的计算能力。本研究开发了一种基于3D可变模型、关键点检测、逼真纹理GAN(生成对抗网络)和可微渲染的新方法,以应对低质量的网络摄像头图像质量和边缘计算限制的问题。 我们构建了一个自动系统,在这些局限条件下也能生成高保真的可动画虚拟形象,并充分利用移动芯片中的神经计算能力。
https://arxiv.org/abs/2502.02468
Current face editing methods mainly rely on GAN-based techniques, but recent focus has shifted to diffusion-based models due to their success in image reconstruction. However, diffusion models still face challenges in manipulating fine-grained attributes and preserving consistency of attributes that should remain unchanged. To address these issues and facilitate more convenient editing of face images, we propose a novel approach that leverages the power of Stable-Diffusion models and crude 3D face models to control the lighting, facial expression and head pose of a portrait photo. We observe that this task essentially involve combinations of target background, identity and different face attributes. We aim to sufficiently disentangle the control of these factors to enable high-quality of face editing. Specifically, our method, coined as RigFace, contains: 1) A Spatial Arrtibute Encoder that provides presise and decoupled conditions of background, pose, expression and lighting; 2) An Identity Encoder that transfers identity features to the denoising UNet of a pre-trained Stable-Diffusion model; 3) An Attribute Rigger that injects those conditions into the denoising UNet. Our model achieves comparable or even superior performance in both identity preservation and photorealism compared to existing face editing models.
当前的人脸编辑方法主要依赖于基于GAN的技术,但最近的研究焦点转向了扩散模型(diffusion-based models),因为这些模型在图像重建方面表现出色。然而,扩散模型仍然面临着挑战,即难以精确操纵细粒度属性并保持不应改变的属性的一致性。为了应对这些问题,并促进人脸图像编辑的便利性,我们提出了一种新的方法,该方法利用稳定扩散(Stable-Diffusion)模型和粗糙3D面部模型的力量来控制肖像照片中的光照、面部表情及头部姿态。我们观察到,这项任务本质上涉及目标背景、身份以及不同面部属性的组合。我们的目标是充分解耦这些因素的控制,以实现高质量的人脸编辑。具体来说,我们的方法,命名为RigFace,包含: 1. **空间属性编码器**(Spatial Attribute Encoder):提供精确且独立于其他因素的背景、姿势、表情和光照条件。 2. **身份编码器**(Identity Encoder):将身份特征转移到预训练稳定扩散模型中的去噪UNet中。 3. **属性控制器**(Attribute Rigger):将这些条件注入到去噪UNet中。 我们的模型在与现有面部编辑模型相比时,在身份保持和照片逼真度方面达到了相当甚至更好的性能。
https://arxiv.org/abs/2502.02465
In this work, we extend the SEEDS superpixel algorithm from 2D images to 3D volumes, resulting in 3D SEEDS, a faster, better, and open-source supervoxel algorithm for medical image analysis. We compare 3D SEEDS with the widely used supervoxel algorithm SLIC on 13 segmentation tasks across 10 organs. 3D SEEDS accelerates supervoxel generation by a factor of 10, improves the achievable Dice score by +6.5%, and reduces the under-segmentation error by -0.16%. The code is available at this https URL
在这项工作中,我们将SEEDS超像素算法从二维图像扩展到三维体积数据,从而开发出3D SEEDS,这是一种更快、更优且开源的医学影像分析中超体素生成算法。我们在涉及10种器官的13个分割任务上将3D SEEDS与广泛使用的SLIC超体素算法进行了比较。结果表明,3D SEEDS在超体素生成速度方面提升了十倍,并使可实现的Dice得分提高了6.5%,同时降低了-0.16%的欠分割误差。该代码可在[此链接](https://this https URL)获取。
https://arxiv.org/abs/2502.02409
Accurate identification of druggable pockets is essential for structure-based drug design. However, most pocket-identification algorithms prioritize their geometric properties over downstream docking performance. To address this limitation, we developed RAPID-Net, a pocket-finding algorithm for seamless integration with docking workflows. When guiding AutoDock Vina, RAPID-Net outperforms DiffBindFR on the PoseBusters benchmark and enables blind docking on large proteins that AlphaFold 3 cannot process as a whole. Furthermore, RAPID-Net surpasses PUResNet and Kalasanty in docking accuracy and pocket-ligand intersection rates across diverse datasets, including PoseBusters, Astex Diverse Set, BU48, and Coach420. When accuracy is evaluated as ``at least one correct pose in the ensemble'', RAPID-Net outperforms AlphaFold 3 on the PoseBusters benchmark, suggesting that our approach can be further improved with a suitable pose reweighting tool offering a cost-effective and competitive alternative to AlphaFold 3 for docking. Finally, using several therapeutically relevant examples, we demonstrate the ability of RAPID-Net to identify remote functional sites, highlighting its potential to facilitate the development of innovative therapeutics.
准确识别可药物作用的口袋对于基于结构的药物设计至关重要。然而,大多数口袋识别算法优先考虑它们的几何特性而非下游对接性能。为了克服这一限制,我们开发了RAPID-Net,这是一种能够与对接工作流程无缝集成的口袋寻找算法。当指导AutoDock Vina时,RAPID-Net在PoseBusters基准测试中超越了DiffBindFR,并且能够在AlphaFold 3无法整体处理的大蛋白质上进行盲法对接。此外,在包括PoseBusters、Astex Diverse Set、BU48和Coach420在内的多样数据集中,RAPID-Net的对接准确性和口袋-配体交集率均优于PUResNet和Kalasanty。当以“集合中至少有一个正确的姿势”为评估准确性时,RAPID-Net在PoseBusters基准测试中超越了AlphaFold 3,这表明我们的方法可以使用适当的姿势重新加权工具进一步改进,从而提供一种成本效益高且竞争力强的对接替代方案来取代AlphaFold 3。最后,通过几个具有治疗意义的例子,我们展示了RAPID-Net识别远程功能位点的能力,突显了它在促进创新药物开发方面的潜力。
https://arxiv.org/abs/2502.02371
The civil engineering industry faces a critical need for innovative non-destructive evaluation methods, particularly for ageing critical infrastructure, such as bridges, where current techniques fall short. Muography, a non-invasive imaging technique, constructs three-dimensional density maps by detecting interactions of naturally occurring cosmic-ray muons within the scanned volume. Cosmic-ray muons provide deep penetration and inherent safety due to their high momenta and natural source. However, the technology's reliance on this source results in constrained muon flux, leading to prolonged acquisition times, noisy reconstructions and image interpretation challenges. To address these limitations, we developed a two-model deep learning approach. First, we employed a conditional Wasserstein generative adversarial network with gradient penalty (cWGAN-GP) to perform predictive upsampling of undersampled muography images. Using the structural similarity index measure (SSIM), 1-day sampled images matched the perceptual qualities of a 21-day image, while the peak signal-to-noise ratio (PSNR) indicated noise improvement equivalent to 31 days of sampling. A second cWGAN-GP model, trained for semantic segmentation, quantitatively assessed the upsampling model's impact on concrete sample features. This model achieved segmentation of rebar grids and tendon ducts, with Dice-Sørensen accuracy coefficients of 0.8174 and 0.8663. Notably, it could mitigate or remove z-plane smearing artifacts caused by muography's inverse imaging problem. Both models were trained on a comprehensive Geant4 Monte-Carlo simulation dataset reflecting realistic civil infrastructure scenarios. Our results demonstrate significant improvements in acquisition speed and image quality, marking a substantial step toward making muography more practical for reinforced concrete infrastructure monitoring applications.
土木工程行业迫切需要创新的无损检测方法,尤其是在评估老化关键基础设施(如桥梁)方面,当前技术存在不足。缪子成像是一种非侵入性成像技术,通过探测扫描体积内自然存在的宇宙射线缪子相互作用来构建三维密度图。由于其高动量和天然来源,宇宙射线缪子提供了深层穿透能力以及固有的安全性。然而,该技术依赖于这种天然源会导致缪子通量受限,从而导致较长的采集时间、噪声重建问题及图像解释挑战。 为了解决这些问题,我们开发了一种双模型深度学习方法。首先,我们使用带梯度惩罚(cWGAN-GP)的条件瓦瑟斯坦生成对抗网络对采样不足的缪子成像进行预测插值处理。通过结构相似性指数测量(SSIM),一天采集的数据图像在感知质量上可以与21天采集的数据相匹配;峰值信噪比(PSNR)则表明噪声水平等同于31天采集的效果有所改善。 第二个cWGAN-GP模型经过训练用于语义分割,它可以定量评估插值处理对混凝土样本特征的影响。该模型能够实现钢筋网和预应力筋管道的分割,Dice-Sørensen准确度系数分别为0.8174和0.8663。值得注意的是,它还可以减轻或消除由缪子成像反向问题导致的Z平面模糊伪影。 这两种模型都是基于全面的Geant4蒙特卡洛模拟数据集进行训练的,该数据集反映了现实中的土木基础设施场景。我们的结果表明,在采集速度和图像质量方面取得了显著改进,这标志着缪子成像技术在增强混凝土结构监测应用中更加实用的关键一步。
https://arxiv.org/abs/2502.02624
Multi-view clustering (MvC) utilizes information from multiple views to uncover the underlying structures of data. Despite significant advancements in MvC, mitigating the impact of missing samples in specific views on the integration of knowledge from different views remains a critical challenge. This paper proposes a novel Mask-informed Deep Contrastive Incomplete Multi-view Clustering (Mask-IMvC) method, which elegantly identifies a view-common representation for clustering. Specifically, we introduce a mask-informed fusion network that aggregates incomplete multi-view information while considering the observation status of samples across various views as a mask, thereby reducing the adverse effects of missing values. Additionally, we design a prior knowledge-assisted contrastive learning loss that boosts the representation capability of the aggregated view-common representation by injecting neighborhood information of samples from different views. Finally, extensive experiments are conducted to demonstrate the superiority of the proposed Mask-IMvC method over state-of-the-art approaches across multiple MvC datasets, both in complete and incomplete scenarios.
多视角聚类(MvC)利用来自多个视角的信息来揭示数据的潜在结构。尽管在MvC方面取得了显著进展,但减少特定视图中缺失样本对不同视角知识整合的影响仍然是一个关键挑战。本文提出了一种新颖的方法——受掩码信息启发的深度对比不完整多视角聚类(Mask-IMvC),该方法巧妙地识别出用于聚类的跨视角共通表示形式。具体而言,我们引入了一个基于掩码的信息融合网络,该网络在聚合不完备多视角信息时考虑样本在不同视图中的观测状态作为掩码,从而减少了缺失值带来的负面影响。此外,我们设计了一种辅助先验知识的帮助对比学习损失,通过注入来自不同视图的样本邻域信息来增强聚合后的跨视角共通表示的能力。最后,通过对多个MvC数据集进行全面实验,展示了所提出的Mask-IMvC方法在完整和不完整场景下均优于现有最先进方法的优势。
https://arxiv.org/abs/2502.02234
Capability evaluations are required to understand and regulate AI systems that may be deployed or further developed. Therefore, it is important that evaluations provide an accurate estimation of an AI system's capabilities. However, in numerous cases, previously latent capabilities have been elicited from models, sometimes long after initial release. Accordingly, substantial efforts have been made to develop methods for eliciting latent capabilities from models. In this paper, we evaluate the effectiveness of capability elicitation techniques by intentionally training model organisms -- language models with hidden capabilities that are revealed by a password. We introduce a novel method for training model organisms, based on circuit breaking, which is more robust to elicitation techniques than standard password-locked models. We focus on elicitation techniques based on prompting and activation steering, and compare these to fine-tuning methods. Prompting techniques can elicit the actual capability of both password-locked and circuit-broken model organisms in an MCQA setting, while steering fails to do so. For a code-generation task, only fine-tuning can elicit the hidden capabilities of our novel model organism. Additionally, our results suggest that combining techniques improves elicitation. Still, if possible, fine-tuning should be the method of choice to improve the trustworthiness of capability evaluations.
能力评估对于理解并监管可能被部署或进一步开发的AI系统至关重要。因此,这些评估必须能够准确地估计出AI系统的实际能力。然而,在许多情况下,某些潜在的能力在模型最初发布后很长时间才得以揭示。为此,人们已经投入大量努力来发展从模型中提取潜在能力的方法。在这篇论文中,我们通过有意训练具有隐藏能力的模型(即需要密码才能解锁这些能力的语言模型),评估了能力提取技术的有效性。我们介绍了一种基于电路断路器的新方法来训练这种模型组织体,这种方法比标准的密码锁定模型更能抵御现有能力提取技术。 本文着重研究了基于提示和激活导向的能力提取技术,并将它们与微调方法进行了比较。在多项选择题解答(MCQA)设置中,提示技术能够从既需要密码才能解锁又经过电路断路器训练的模型组织体中揭示其实际能力,而激活导向则无法做到这一点。对于代码生成任务而言,只有微调方法能够从我们新型的模型组织体中提取出隐藏的能力。 此外,我们的研究结果还表明,结合使用多种技术可以提高能力提取的效果。然而,在可能的情况下,微调应作为提升能力评估可信度的方法首选。
https://arxiv.org/abs/2502.02180
The rapid evolution of Large Language Models (LLMs) has transformed natural language processing but raises critical concerns about biases inherent in their deployment and use across diverse linguistic and sociocultural contexts. This paper presents a framework named ASCenD BDS (Adaptable, Stochastic and Context-aware framework for Detection of Bias, Discrimination and Stereotyping). The framework presents approach to detecting bias, discrimination, stereotyping across various categories such as gender, caste, age, disability, socioeconomic status, linguistic variations, etc., using an approach which is Adaptive, Stochastic and Context-Aware. The existing frameworks rely heavily on usage of datasets to generate scenarios for detection of Bias, Discrimination and Stereotyping. Examples include datasets such as Civil Comments, Wino Gender, WinoBias, BOLD, CrowS Pairs and BBQ. However, such an approach provides point solutions. As a result, these datasets provide a finite number of scenarios for assessment. The current framework overcomes this limitation by having features which enable Adaptability, Stochasticity, Context Awareness. Context awareness can be customized for any nation or culture or sub-culture (for example an organization's unique culture). In this paper, context awareness in the Indian context has been established. Content has been leveraged from Indian Census 2011 to have a commonality of categorization. A framework has been developed using Category, Sub-Category, STEM, X-Factor, Synonym to enable the features for Adaptability, Stochasticity and Context awareness. The framework has been described in detail in Section 3. Overall 800 plus STEMs, 10 Categories, 31 unique SubCategories were developed by a team of consultants at Saint Fox Consultancy Private Ltd. The concept has been tested out in SFCLabs as part of product development.
大型语言模型(LLMs)的快速演变已经改变了自然语言处理领域,但同时也引发了关于其在多样化的语言和社会文化背景中部署和使用时内在偏见的重要担忧。本文提出了一种名为ASCenD BDS(可适应、随机化及上下文感知框架用于检测偏见、歧视与刻板印象)的框架。该框架提供了一种方法来检测跨多个类别(如性别、种姓、年龄、残疾、经济和社会地位、语言变化等)的偏见、歧视和刻板印象,其方法是可适应性、随机化及上下文感知。 现有的框架主要依赖于使用数据集生成检测偏见、歧视与刻板印象的情景。例如,Civil Comments、Wino Gender、WinoBias、BOLD、CrowS Pairs 和 BBQ 数据集等。然而,这种方法只提供点对点解决方案。因此,这些数据集仅提供了有限数量的评估情景。目前提出的框架通过具备可适应性、随机化及上下文感知功能克服了这一限制。 上下文感知可以根据任何国家或文化(例如一个组织的独特文化)进行定制。在本文中,已在印度语境下建立了上下文感知概念。内容利用来自2011年印度人口普查的数据来实现分类的一致性。通过类别、子类别、STEM(情境-主题-情感)、X-Factor 和同义词开发了一种框架,以支持可适应性、随机化及上下文感知的功能。该框架已在第3节中详细描述。 总体而言,圣福咨询私人有限公司的顾问团队开发了800多个STEM、10个类别和31个独特的子类别。这一概念已在SFCLabs的产品研发过程中进行了测试。
https://arxiv.org/abs/2502.02072
Generating structured textual content requires mechanisms that enforce coherence, stability, and adherence to predefined constraints while maintaining semantic fidelity. Conventional approaches often rely on rule-based heuristics or fine-tuning strategies that lack flexibility and generalizability across diverse tasks. The incorporation of Gradient-Regularized Latent Space Modulation (GRLSM) introduces a novel paradigm for guiding text generation through the application of structured constraints within the latent space. The integration of gradient-based regularization mitigates abrupt variations in latent representations, ensuring a smoother encoding process that enhances structural consistency and logical progression within generated sequences. Comparative evaluations demonstrate that latent space modulation leads to a reduction in perplexity, increased coherence scores, and improved structural alignment across multiple domains. Stability assessments further indicate that the imposition of spectral norm constraints facilitates more controlled variations in generated text, preserving semantic consistency under input perturbations. Empirical results confirm that structured latent space constraints not only refine the organization of generated outputs but also enhance interpretability through more predictable and reliable synthesis patterns. Performance metrics illustrate that the GRLSM framework substantially reduces structural inconsistencies while preserving the generative flexibility inherent in neural models.
生成结构化的文本内容需要能够保证连贯性、稳定性和遵循预定义约束机制的同时保持语义准确性。传统方法通常依赖于基于规则的启发式或微调策略,这些方法在处理多样任务时缺乏灵活性和泛化能力。引入梯度正则化潜在空间调节(Gradient-Regularized Latent Space Modulation, GRLSM)为通过潜在空间中的结构化约束来指导文本生成提供了一种新的范例。基于梯度的正则化可以缓解潜在表示中的突然变化,确保编码过程更加平滑,并增强生成序列内的结构性一致性和逻辑进展。 比较评估表明,在多个领域中,对潜在空间进行调节能够降低困惑度(perplexity),提高连贯性得分,并改进结构一致性。稳定性评估进一步显示,施加谱范数约束有助于在文本生成中实现更受控制的变化,即使输入发生变化也能保持语义的一致性。实证结果确认,结构性的潜在空间约束不仅优化了生成输出的组织,还通过更加可预测和可靠的合成模式增强了解释能力。 性能指标表明,GRLSM框架显著减少了结构不一致,同时保留了神经模型固有的生成灵活性。
https://arxiv.org/abs/2502.01979
Texture synthesis is a fundamental task in computer vision, whose goal is to generate visually realistic and structurally coherent textures for a wide range of applications, from graphics to scientific simulations. While traditional methods like tiling and patch-based techniques often struggle with complex textures, recent advancements in deep learning have transformed this field. In this paper, we propose ViT-SGAN, a new hybrid model that fuses Vision Transformers (ViTs) with a Spatial Generative Adversarial Network (SGAN) to address the limitations of previous methods. By incorporating specialized texture descriptors such as mean-variance (mu, sigma) and textons into the self-attention mechanism of ViTs, our model achieves superior texture synthesis. This approach enhances the model's capacity to capture complex spatial dependencies, leading to improved texture quality that is superior to state-of-the-art models, especially for regular and irregular textures. Comparison experiments with metrics such as FID, IS, SSIM, and LPIPS demonstrate the substantial improvement of ViT-SGAN, which underlines its efficiency in generating diverse realistic textures.
纹理合成是计算机视觉中的一个基本任务,其目标是从图形到科学模拟等广泛的应用中生成视觉上逼真且结构一致的纹理。虽然传统的平铺和基于补丁的方法在处理复杂纹理时往往力不从心,但深度学习领域的近期进展已经彻底改变了这一领域。在这篇论文中,我们提出了ViT-SGAN,这是一种将视觉变换器(Vision Transformers, ViTs)与空间生成对抗网络(Spatial Generative Adversarial Network, SGAN)相结合的新型混合模型,以克服先前方法的局限性。 通过在ViTs的自注意力机制中融入专门设计的纹理描述符,如均值-方差(mu, sigma)和文素(textons),我们的模型能够在合成纹理时展现出卓越的能力。这一策略增强了模型捕捉复杂空间依赖关系的能力,从而提升了其生成高质量纹理的效果,尤其是在处理规则和不规则纹理方面优于当前最先进的方法。 通过诸如FID、IS、SSIM以及LPIPS等评价指标进行的对比实验表明,ViT-SGAN在生成多样化且逼真的纹理时具有显著的优势。这进一步强调了该模型的有效性及其实用价值。
https://arxiv.org/abs/2502.01842
While generative robot policies have demonstrated significant potential in learning complex, multimodal behaviors from demonstrations, they still exhibit diverse failures at deployment-time. Policy steering offers an elegant solution to reducing the chance of failure by using an external verifier to select from low-level actions proposed by an imperfect generative policy. Here, one might hope to use a Vision Language Model (VLM) as a verifier, leveraging its open-world reasoning capabilities. However, off-the-shelf VLMs struggle to understand the consequences of low-level robot actions as they are represented fundamentally differently than the text and images the VLM was trained on. In response, we propose FOREWARN, a novel framework to unlock the potential of VLMs as open-vocabulary verifiers for runtime policy steering. Our key idea is to decouple the VLM's burden of predicting action outcomes (foresight) from evaluation (forethought). For foresight, we leverage a latent world model to imagine future latent states given diverse low-level action plans. For forethought, we align the VLM with these predicted latent states to reason about the consequences of actions in its native representation--natural language--and effectively filter proposed plans. We validate our framework across diverse robotic manipulation tasks, demonstrating its ability to bridge representational gaps and provide robust, generalizable policy steering.
虽然生成型机器人策略在从演示中学习复杂多模态行为方面展现出了巨大潜力,但在部署时仍然会遇到各种失败情况。通过使用外部验证器来选择由不完美的生成策略提出的低级别动作,政策引导提供了一种优雅的解决方案,从而减少了失败的可能性。在这里,人们可能会希望利用视觉语言模型(VLM)作为验证器,借助其开放式推理能力。然而,现成的VLM难以理解机器人行为的基本动作后果,因为这些基本动作与VLM训练时所用的语言和图像存在根本差异。 为了解决这个问题,我们提出了FOREWARN这一新框架,旨在解锁VLM作为运行时间策略引导中开放词汇验证器的潜力。我们的关键理念是将VLM预测行动结果(预见)的任务与其评价(预思)任务分离。对于预见部分,我们利用潜在世界模型来根据多样的低级别动作计划想象未来的潜在状态。而对于预思环节,我们将VLM与这些预测的潜在状态对齐,在其本源表示——自然语言中推理动作后果,并有效筛选出提出的方案。 我们在各种机器人操作任务上验证了我们的框架,展示了它能够弥合表现形式之间的差距并提供稳健、通用化的策略引导能力。
https://arxiv.org/abs/2502.01828
Recent advancements in Large Language Models (LLMs) have paved the way for Large Code Models (LCMs), enabling automation in complex software engineering tasks, such as code generation, software testing, and program comprehension, among others. Tools like GitHub Copilot and ChatGPT have shown substantial benefits in supporting developers across various practices. However, the ambition to scale these models to trillion-parameter sizes, exemplified by GPT-4, poses significant challenges that limit the usage of Artificial Intelligence (AI)-based systems powered by large Deep Learning (DL) models. These include rising computational demands for training and deployment and issues related to trustworthiness, bias, and interpretability. Such factors can make managing these models impractical for many organizations, while their "black-box'' nature undermines key aspects, including transparency and accountability. In this paper, we question the prevailing assumption that increasing model parameters is always the optimal path forward, provided there is sufficient new data to learn additional patterns. In particular, we advocate for a Neurosymbolic research direction that combines the strengths of existing DL techniques (e.g., LLMs) with traditional symbolic methods--renowned for their reliability, speed, and determinism. To this end, we outline the core features and present preliminary results for our envisioned approach, aimed at establishing the first Neurosymbolic Program Comprehension (NsPC) framework to aid in identifying defective code components.
近期,在大型语言模型(LLMs)方面的进展为大型代码模型(LCMs)的发展铺平了道路,使得自动化复杂的软件工程任务成为可能,例如代码生成、软件测试和程序理解等。工具如GitHub Copilot和ChatGPT已在支持各种开发实践方面显示出了显著的好处。然而,将这些模型扩展到万亿参数规模的雄心,比如GPT-4所展现的一样,带来了诸如训练和部署所需的计算需求激增以及可信度、偏差和可解释性等问题等重大挑战,这限制了大型深度学习(DL)模型驱动的人工智能系统在许多组织中的实际应用。这些“黑箱”性质的存在,不仅使管理和控制这些模型变得不切实际,还影响到了透明度和问责制的关键方面。 本文质疑了一个普遍存在的假设:只要有足够的新数据来学习额外的模式,增加模型参数就总是向前发展的最优路径。尤其地,我们倡导一种神经符号(Neurosymbolic)研究方向,该方向结合了现有深度学习技术(如LLMs)的优点与传统符号方法的优势——后者以可靠性、速度和确定性著称。 为此,我们概述了所构想的途径的核心特征,并呈现了一些初步结果,旨在建立第一个神经符号程序理解(NsPC)框架,用以辅助识别缺陷代码组件。
https://arxiv.org/abs/2502.01806