In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphrased queries. Such behavior might render the retrieval system less predictable and lead to user frustration. In this work, we consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. To start with, we collect a dataset of paraphrased image descriptions to facilitate quantitative evaluation for this task. We then hypothesize that the undesired behavior of existing dual-encoder model is due to their text towers which are trained on image-sentence pairs and lack the ability to capture the semantic similarity between paraphrased queries. To improve on this, we investigate multiple strategies for training a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries while maintaining similar zero-shot classification and retrieval accuracy.
在最近几年,双编码器视觉语言模型(例如CLIP)在文本到图像检索方面取得了显著的性能。然而,我们发现,这些模型通常对于一对同义词查询的检索结果会有非常不同的情况。这种行为可能会使得检索系统更具预测性,从而导致用户不满。在这项工作中,我们考虑了同义词文本到图像检索的任务,其中模型旨在通过一对同义词查询返回类似的结果。为了开始,我们收集了一个同义词图像描述的 dataset,以便为这个任务进行定量评估。然后我们假设现有双编码器模型的不良行为是因为它们在训练时使用的文本塔,这些文本塔针对图像-句子对进行训练,无法捕捉到同义词查询之间的语义相似性。为了改进这种状况,我们研究了从预训练于大型文本语料库的语言模型开始,为双编码器模型训练多种策略。与公共的双编码器模型(如CLIP和OpenCLIP)相比,我们使用最佳适应策略训练的双编码器模型在非同义词查询的检索排名相似性方面明显更高,同时保持相同的零散分类和检索准确性。
https://arxiv.org/abs/2405.03190
Deep learning has emerged as a promising approach for learning the nonlinear mapping between diffusion-weighted MR images and tissue parameters, which enables automatic and deep understanding of the brain microstructures. However, the efficiency and accuracy in the multi-parametric estimations are still limited since previous studies tend to estimate multi-parametric maps with dense sampling and isolated signal modeling. This paper proposes DeepMpMRI, a unified framework for fast and high-fidelity multi-parametric estimation from various diffusion models using sparsely sampled q-space data. DeepMpMRI is equipped with a newly designed tensor-decomposition-based regularizer to effectively capture fine details by exploiting the correlation across parameters. In addition, we introduce a Nesterov-based adaptive learning algorithm that optimizes the regularization parameter dynamically to enhance the performance. DeepMpMRI is an extendable framework capable of incorporating flexible network architecture. Experimental results demonstrate the superiority of our approach over 5 state-of-the-art methods in simultaneously estimating multi-parametric maps for various diffusion models with fine-grained details both quantitatively and qualitatively, achieving 4.5 - 22.5$\times$ acceleration compared to the dense sampling of a total of 270 diffusion gradients.
深度学习已经成为了学习扩散加权磁共振图像(DWI)和组织参数之间非线性映射的有前途的方法,这使得我们能够自动和深入理解大脑微观结构。然而,多参数估计的效率和准确性仍然有限,因为以前的研究倾向于使用稀疏采样和离散信号建模来估计多参数映射。本文提出DeepMpMRI,一种基于稀疏采样q空间数据的统一框架,用于从各种扩散模型进行高速和高保真的多参数估计。DeepMpMRI配备了一个新设计的张量分解基于正则化的特征,通过利用参数之间的相关性有效地捕捉细节。此外,我们引入了一种Nesterov基于自适应学习算法,动态优化正则化参数以提高性能。DeepMpMRI是一个可扩展的框架,能够容纳灵活的网络架构。实验结果表明,我们的方法在同时估计多种扩散模型的细粒度多参数映射方面具有优越性,超过5种最先进的无监督学习方法,实现了4.5 - 22.5×的加速,相比总共270个扩散梯度的密集采样。
https://arxiv.org/abs/2405.03159
We present SketchGPT, a flexible framework that employs a sequence-to-sequence autoregressive model for sketch generation, and completion, and an interpretation case study for sketch recognition. By mapping complex sketches into simplified sequences of abstract primitives, our approach significantly streamlines the input for autoregressive modeling. SketchGPT leverages the next token prediction objective strategy to understand sketch patterns, facilitating the creation and completion of drawings and also categorizing them accurately. This proposed sketch representation strategy aids in overcoming existing challenges of autoregressive modeling for continuous stroke data, enabling smoother model training and competitive performance. Our findings exhibit SketchGPT's capability to generate a diverse variety of drawings by adding both qualitative and quantitative comparisons with existing state-of-the-art, along with a comprehensive human evaluation study. The code and pretrained models will be released on our official GitHub.
我们提出了SketchGPT,一个灵活的框架,它采用序列到序列自回归模型用于草图生成和完成,以及用于草图识别的交互案例研究。通过将复杂草图映射为抽象基本结构的简化序列,我们的方法大大简化了自回归建模的输入。SketchGPT利用下一个词预测目标策略来理解草图模式,促进草图的创作和完成,并准确分类它们。这种提出的草图表示策略有助于克服连续 stroke 数据中自回归建模的现有挑战,使得模型训练更加平滑,同时实现具有竞争力的性能。我们的研究结果表明,SketchGPT通过添加定性和定量与现有最先进的水平的比较,具有生成各种不同草图的能力,以及全面的用户评估研究。代码和预训练模型将在我们的官方 GitHub 上发布。
https://arxiv.org/abs/2405.03099
Exploring complex adaptive financial trading environments through multi-agent based simulation methods presents an innovative approach within the realm of quantitative finance. Despite the dominance of multi-agent reinforcement learning approaches in financial markets with observable data, there exists a set of systematically significant financial markets that pose challenges due to their partial or obscured data availability. We, therefore, devise a multi-agent simulation approach employing small-scale meta-heuristic methods. This approach aims to represent the opaque bilateral market for Australian government bond trading, capturing the bilateral nature of bank-to-bank trading, also referred to as "over-the-counter" (OTC) trading, and commonly occurring between "market makers". The uniqueness of the bilateral market, characterized by negotiated transactions and a limited number of agents, yields valuable insights for agent-based modelling and quantitative finance. The inherent rigidity of this market structure, which is at odds with the global proliferation of multilateral platforms and the decentralization of finance, underscores the unique insights offered by our agent-based model. We explore the implications of market rigidity on market structure and consider the element of stability, in market design. This extends the ongoing discourse on complex financial trading environments, providing an enhanced understanding of their dynamics and implications.
通过基于多智能体(multi-agent)的仿真方法探索复杂适应金融交易环境是一种在量化金融领域具有创新性的方法。尽管在具有观测数据的市场中,多智能体强化学习方法占据主导地位,但存在一组由于部分或难以获得数据而具有系统性地重要性的金融市场。因此,我们设计了一种基于元启发式方法的多智能体仿真方法。该方法旨在代表澳大利亚政府债券交易的双边市场,捕捉到银行间交易的双边性质,也称为“场外”(OTC) 交易,以及通常在市场制造商之间发生的双边交易。双边市场的独特性,其特点是有协议的交易和有限的代理数量,为基于智能体的建模和量化金融提供了宝贵的见解。市场结构的固有刚性,与其与全球多边平台和金融市场的分散化相矛盾,强调了我们的基于智能体的模型所提供的独特见解。我们探讨了市场刚性对市场结构和市场设计的影响。这扩展了关于复杂金融交易环境的持续讨论,提供了对它们动态和影响的更深入了解。
https://arxiv.org/abs/2405.02849
Motion style transfer is a significant research direction in multimedia applications. It enables the rapid switching of different styles of the same motion for virtual digital humans, thus vastly increasing the diversity and realism of movements. It is widely applied in multimedia scenarios such as movies, games, and the Metaverse. However, most of the current work in this field adopts the GAN, which may lead to instability and convergence issues, making the final generated motion sequence somewhat chaotic and unable to reflect a highly realistic and natural style. To address these problems, we consider style motion as a condition and propose the Style Motion Conditioned Diffusion (SMCD) framework for the first time, which can more comprehensively learn the style features of motion. Moreover, we apply Mamba model for the first time in the motion style transfer field, introducing the Motion Style Mamba (MSM) module to handle longer motion sequences. Thirdly, aiming at the SMCD framework, we propose Diffusion-based Content Consistency Loss and Content Consistency Loss to assist the overall framework's training. Finally, we conduct extensive experiments. The results reveal that our method surpasses state-of-the-art methods in both qualitative and quantitative comparisons, capable of generating more realistic motion sequences.
运动风格迁移是多媒体应用中的一个重要研究方向。它使得虚拟数字人可以快速切换相同运动风格的不同样式,从而极大地增加了运动的多样性和现实感。它广泛应用于电影、游戏和元宇宙等多媒体场景。然而,目前这个领域的大多数工作采用生成对抗网络(GAN),这可能导致不稳定和收敛问题,使得最终生成的运动序列有些混乱,无法反映高度真实和自然风格。为了解决这些问题,我们考虑将风格迁移作为一种条件,并首次提出了Style Motion Conditioned Diffusion(SMCD)框架。此外,我们在运动风格迁移领域中首次应用了Mamba模型,引入了Motion Style Mamba(MSM)模块来处理较长的运动序列。第三,为了支持SMCD框架,我们提出了基于扩散的内容的一致性损失和一致性损失来协助整个框架的训练。最后,我们进行了广泛的实验。结果表明,我们的方法在质量和数量上均超过了最先进的Method,能够生成更真实的运动序列。
https://arxiv.org/abs/2405.02844
The COVID-19 pandemic has strained global public health, necessitating accurate diagnosis and intervention to control disease spread and reduce mortality rates. This paper introduces an interpretable deep survival prediction model designed specifically for improved understanding and trust in COVID-19 prognosis using chest X-ray (CXR) images. By integrating a large-scale pretrained image encoder, Risk-specific Grad-CAM, and anatomical region detection techniques, our approach produces regional interpretable outcomes that effectively capture essential disease features while focusing on rare but critical abnormal regions. Our model's predictive results provide enhanced clarity and transparency through risk area localization, enabling clinicians to make informed decisions regarding COVID-19 diagnosis with better understanding of prognostic insights. We evaluate the proposed method on a multi-center survival dataset and demonstrate its effectiveness via quantitative and qualitative assessments, achieving superior C-indexes (0.764 and 0.727) and time-dependent AUCs (0.799 and 0.691). These results suggest that our explainable deep survival prediction model surpasses traditional survival analysis methods in risk prediction, improving interpretability for clinical decision making and enhancing AI system trustworthiness.
COVID-19大流行对全球公共卫生造成了巨大压力,需要准确的诊断和干预来控制疾病传播并降低死亡率。本文介绍了一种专门针对改善对COVID-19预后的理解和信任的可解释深度生存预测模型。通过整合大规模预训练图像编码器、风险特定Grad-CAM和解剖区域检测技术,我们的方法产生具有区域可解释性的结果,能够有效地捕捉关键疾病特征,同时专注于罕见但关键异常区域。我们模型的预测结果通过风险区域定位提供了增强的清晰度和透明度,使临床医生可以更好地理解预后的洞察力做出COVID-19诊断。我们在多中心生存数据集上评估了所提出的方法,并通过定量和定性评估证明了其有效性,实现了卓越的C指数(0.764和0.727)和时间依赖性的AUC(0.799和0.691)。这些结果表明,我们的可解释深度生存预测模型超越了传统的生存分析方法,提高了风险预测的可解释性,增强了AI系统的可靠性。
https://arxiv.org/abs/2405.02815
Light field (LF) imaging captures both angular and spatial light distributions, enabling advanced photographic techniques. However, micro-lens array (MLA)- based cameras face a spatial-angular resolution tradeoff due to a single shared sensor. We propose a novel light field framework for resolution enhancement, employing a modular approach. The first module generates a high-resolution, all-in-focus image. The second module, a texture transformer network, enhances the resolution of each light field perspective independently using the output of the first module as a reference image. The final module leverages light field regularity to jointly improve resolution across all LF image perspectives. Our approach demonstrates superior performance to existing methods in both qualitative and quantitative evaluations.
光场(LF)成像同时捕捉到角速度和空间光分布,使得高级摄影技术成为可能。然而,基于微透镜阵列(MLA)的相机由于共享传感器,面临着空间-角速度分辨率权衡。我们提出了一个用于分辨率增强的新型光场框架,采用模块化方法。第一个模块生成一个高分辨率、全焦距的图像。第二个模块,一个纹理转换器网络,通过第一个模块的输出作为参考图像,独立地增强每个光场视角的分辨率。最后一个模块利用光场 regularity 共同提高所有 LF 图像视角的分辨率。我们的方法在质性和量化评价中表现出优越的性能。
https://arxiv.org/abs/2405.02787
An interpretable comparison of generative models requires the identification of sample types produced more frequently by each of the involved models. While several quantitative scores have been proposed in the literature to rank different generative models, such score-based evaluations do not reveal the nuanced differences between the generative models in capturing various sample types. In this work, we propose a method called Fourier-based Identification of Novel Clusters (FINC) to identify modes produced by a generative model with a higher frequency in comparison to a reference distribution. FINC provides a scalable stochastic algorithm based on random Fourier features to estimate the eigenspace of kernel covariance matrices of two generative models and utilize the principal eigendirections to detect the sample types present more dominantly in each model. We demonstrate the application of the FINC method to standard computer vision datasets and generative model frameworks. Our numerical results suggest the scalability and efficiency of the developed Fourier-based method in highlighting the sample types captured with different frequencies by widely-used generative models.
一种可解释的比较生成模型需要确定每个涉及模型的产生样本类型的更多元。虽然文献中已经提出了几个定量的评分来比较不同的生成模型,但这些评分基于分数的评估并没有揭示生成模型在捕捉各种样本类型方面的细微差异。在本文中,我们提出了一种方法叫做基于傅里叶特征的新兴聚类识别方法(FINC)来识别比参考分布产生更高频率的生成模型的模态。FINC基于随机傅里叶特征提供了一个可扩展的随机算法,用于估计两个生成模型的核协方差矩阵的 eigen space,并利用主 eigendirections 检测每个模型中出现更主要样本类型的样本类型。我们证明了FINC方法在标准计算机视觉数据集和生成模型框架中的应用。我们的数值结果表明,基于傅里叶特征的方法在强调使用广泛使用的生成模型捕捉不同频率样本类型方面具有可扩展性和效率。
https://arxiv.org/abs/2405.02700
Network traffic analysis is fundamental for network management, troubleshooting, and security. Tasks such as traffic classification, anomaly detection, and novelty discovery are fundamental for extracting operational information from network data and measurements. We witness the shift from deep packet inspection and basic machine learning to Deep Learning (DL) approaches where researchers define and test a custom DL architecture designed for each specific problem. We here advocate the need for a general DL architecture flexible enough to solve different traffic analysis tasks. We test this idea by proposing a DL architecture based on generic data adaptation modules, followed by an integration module that summarises the extracted information into a compact and rich intermediate representation (i.e. embeddings). The result is a flexible Multi-modal Autoencoder (MAE) pipeline that can solve different use cases. We demonstrate the architecture with traffic classification (TC) tasks since they allow us to quantitatively compare results with state-of-the-art solutions. However, we argue that the MAE architecture is generic and can be used to learn representations useful in multiple scenarios. On TC, the MAE performs on par or better than alternatives while avoiding cumbersome feature engineering, thus streamlining the adoption of DL solutions for traffic analysis.
网络流量分析是网络管理、故障排查和安全的基本。诸如流量分类、异常检测和新奇发现等任务是提取网络数据和测量的操作信息的基本方法。我们观察到从深度包检查和基本机器学习向深度学习(DL)方法的转变,研究人员为每个特定问题定义并测试自定义的DL架构。在这里,我们倡导需要一个通用的DL架构,足够灵活以解决不同的流量分析任务。为了验证这个想法,我们提出了一个基于通用数据适应模块的DL架构,然后是一个整合模块,将提取的信息汇总到简洁且丰富的中间表示(即嵌入)中。结果是一个灵活的多模态自动编码器(MAE)管道,可以解决不同的用例。我们用流量分类(TC)任务来展示这个架构,因为它们允许我们定量比较结果与最先进的解决方案。然而,我们认为MAE架构是通用的,可以用于多种场景的学习表示。在TC上,MAE与替代方案表现相当或者更好,同时避免了繁琐的特征工程,从而加速了DL解决方案在流量分析领域的采用。
https://arxiv.org/abs/2405.02649
Magnetic resonance imaging (MRI) and positron emission tomography (PET) are increasingly used in multimodal analysis of neurodegenerative disorders. While MRI is broadly utilized in clinical settings, PET is less accessible. Many studies have attempted to use deep generative models to synthesize PET from MRI scans. However, they often suffer from unstable training and inadequately preserve brain functional information conveyed by PET. To this end, we propose a functional imaging constrained diffusion (FICD) framework for 3D brain PET image synthesis with paired structural MRI as input condition, through a new constrained diffusion model (CDM). The FICD introduces noise to PET and then progressively removes it with CDM, ensuring high output fidelity throughout a stable training phase. The CDM learns to predict denoised PET with a functional imaging constraint introduced to ensure voxel-wise alignment between each denoised PET and its ground truth. Quantitative and qualitative analyses conducted on 293 subjects with paired T1-weighted MRI and 18F-fluorodeoxyglucose (FDG)-PET scans suggest that FICD achieves superior performance in generating FDG-PET data compared to state-of-the-art methods. We further validate the effectiveness of the proposed FICD on data from a total of 1,262 subjects through three downstream tasks, with experimental results suggesting its utility and generalizability.
磁共振成像(MRI)和正电子发射断层扫描(PET)在多模态分析神经退行性疾病方面越来越受到欢迎。尽管MRI在临床环境中得到了广泛应用,但PET却不太易获取。许多研究试图使用深度生成模型从MRI扫描中合成PET,但这些模型往往在训练过程中不稳定,并且不能很好地保留PET中传递给大脑的功能信息。为此,我们提出了一个功能成像约束扩散(FICD)框架,用于使用成对结构MRI生成3D脑PET图像,并通过一个新的约束扩散模型(CDM)实现。FICD引入了噪声到PET,然后通过CDM逐步去除它,确保在稳定的训练阶段具有高输出保真度。CDM学会了通过引入功能成像约束来预测去噪PET,以确保每个去噪PET与其实际对照之间进行逐个像素对齐。对293名成对T1加权MRI和18F-氟代葡萄糖(FDG)-PET扫描的受试者的定量定性分析结果表明,FICD在生成FDG-PET数据方面具有比现有方法更卓越的性能。我们还通过三个下游任务验证了所提出的FICD在总1262个受试者数据上的有效性,实验结果表明了其效用和可扩展性。
https://arxiv.org/abs/2405.02504
Knowledge graphs have emerged as a sophisticated advancement and refinement of semantic networks, and their deployment is one of the critical methodologies in contemporary artificial intelligence. The construction of knowledge graphs is a multifaceted process involving various techniques, where researchers aim to extract the knowledge from existing resources for the construction since building from scratch entails significant labor and time costs. However, due to the pervasive issue of heterogeneity, the description diversity across different knowledge graphs can lead to mismatches between concepts, thereby impacting the efficacy of knowledge extraction. This Ph.D. study focuses on automatic knowledge graph extension, i.e., properly extending the reference knowledge graph by extracting and integrating concepts from one or more candidate knowledge graphs. We propose a novel knowledge graph extension framework based on entity type recognition. The framework aims to achieve high-quality knowledge extraction by aligning the schemas and entities across different knowledge graphs, thereby enhancing the performance of the extension. This paper elucidates three major contributions: (i) we propose an entity type recognition method exploiting machine learning and property-based similarities to enhance knowledge extraction; (ii) we introduce a set of assessment metrics to validate the quality of the extended knowledge graphs; (iii) we develop a platform for knowledge graph acquisition, management, and extension to benefit knowledge engineers practically. Our evaluation comprehensively demonstrated the feasibility and effectiveness of the proposed extension framework and its functionalities through quantitative experiments and case studies.
知识图谱已成为语义网络的先进发展和精炼,其在当代人工智能中扮演着关键方法论的角色。知识图谱的构建是一个多方面的过程,涉及各种技术,研究人员旨在从现有资源中提取知识,因为从头开始构建会带来大量的人力和时间成本。然而,由于普遍存在的异质性问题,知识图谱之间的描述差异可能导致概念之间的不匹配,从而影响知识提取的效力。本博士学位论文专注于自动知识图谱扩展,即通过提取和整合一个或多个候选知识图谱来扩展参考知识图谱。我们提出了一个基于实体类型识别的知识图谱扩展框架。该框架旨在通过将不同知识图谱之间的模式和实体对齐,实现高质量的知识扩展,从而提高扩展的性能。本文阐明了三个主要贡献: (i)我们提出了一种利用机器学习和属性基于相似性的实体类型识别方法,以增强知识提取; (ii)我们引入了一组评估指标来验证扩展知识图谱的质量; (iii)我们开发了一个知识图谱获取、管理和扩展平台,以帮助知识工程师实际操作。我们的评估全面展示了所提出的扩展框架的可行性和有效性,以及其实用性。
https://arxiv.org/abs/2405.02463
In this paper, we consider two challenging issues in reference-based super-resolution (RefSR) for smartphone, (i) how to choose a proper reference image, and (ii) how to learn RefSR in a self-supervised manner. Particularly, we propose a novel self-supervised learning approach for real-world RefSR from observations at dual and multiple camera zooms. Firstly, considering the popularity of multiple cameras in modern smartphones, the more zoomed (telephoto) image can be naturally leveraged as the reference to guide the super-resolution (SR) of the lesser zoomed (ultra-wide) image, which gives us a chance to learn a deep network that performs SR from the dual zoomed observations (DZSR). Secondly, for self-supervised learning of DZSR, we take the telephoto image instead of an additional high-resolution image as the supervision information, and select a center patch from it as the reference to super-resolve the corresponding ultra-wide image patch. To mitigate the effect of the misalignment between ultra-wide low-resolution (LR) patch and telephoto ground-truth (GT) image during training, we first adopt patch-based optical flow alignment and then design an auxiliary-LR to guide the deforming of the warped LR features. To generate visually pleasing results, we present local overlapped sliced Wasserstein loss to better represent the perceptual difference between GT and output in the feature space. During testing, DZSR can be directly deployed to super-solve the whole ultra-wide image with the reference of the telephoto image. In addition, we further take multiple zoomed observations to explore self-supervised RefSR, and present a progressive fusion scheme for the effective utilization of reference images. Experiments show that our methods achieve better quantitative and qualitative performance against state-of-the-arts. Codes are available at this https URL.
在本文中,我们考虑了在基于参考图像的超分辨率(RefSR)中 two 个具有挑战性的问题:(i)如何选择一个适当的参考图像,(ii)如何在自监督的方式下学习RefSR。特别地,我们提出了一种从双摄像头和多摄像头缩放的观察中进行真实世界RefSR的新型自监督学习方法。首先,考虑到现代智能手机中多个摄像头的流行,更缩放的(望远镜)图像可以自然地作为一个参考,以指导较小缩放(超广角)图像的超分辨率(SR),这给我们机会学习从双缩放观察中进行SR的深度网络。(ii)为了自监督学习DZSR,我们选择望远镜图像作为监督信息,并从它中选择一个中心补丁作为参考,以超分辨率相应的超广角图像补丁。为了减轻在训练过程中超广角低分辨率(LR)补丁与望远镜地面真实(GT)图像之间错位的影响,我们首先采用基于补丁的图像光束对齐,然后设计了一个辅助-LR,以指导失真 LR 特征的变形。为了生成视觉效果好的结果,我们提出了局部重叠的切削韦伯损失,更好地表示 GT 和输出在特征空间中的差异。在测试期间,DZSR可以直接部署用于解决整个超广角图像。此外,我们进一步进行了多次缩放观察,以探索自监督RefSR,并提出了参考图像的有效利用方案。实验结果表明,我们的方法在定量和定性方面都优于现有技术水平。代码可在此处下载:https://www.x剔除。
https://arxiv.org/abs/2405.02171
Medical texts are notoriously challenging to read. Properly measuring their readability is the first step towards making them more accessible. In this paper, we present a systematic study on fine-grained readability measurements in the medical domain at both sentence-level and span-level. We introduce a new dataset MedReadMe, which consists of manually annotated readability ratings and fine-grained complex span annotation for 4,520 sentences, featuring two novel "Google-Easy" and "Google-Hard" categories. It supports our quantitative analysis, which covers 650 linguistic features and automatic complex word and jargon identification. Enabled by our high-quality annotation, we benchmark and improve several state-of-the-art sentence-level readability metrics for the medical domain specifically, which include unsupervised, supervised, and prompting-based methods using recently developed large language models (LLMs). Informed by our fine-grained complex span annotation, we find that adding a single feature, capturing the number of jargon spans, into existing readability formulas can significantly improve their correlation with human judgments. We will publicly release the dataset and code.
医疗文本通常很难阅读。正确测量其可读性是使其更易访问的第一步。在本文中,我们在句子级别和跨度级别对医疗领域的细粒度可读性测量进行了一项系统性的研究。我们引入了一个名为MedReadMe的新数据集,其中包括4,520个句子,每个句子都由人工标注的读性评分和细粒度复杂跨度注释。它涵盖了两个新的“Google-Easy”和“Google-Hard”类别。它支持我们定量的分析,涵盖了650个语言特征和自动识别复杂词汇。得益于我们高质量的注释,我们基准和提高了多个针对医疗领域的句子级别可读性指标,这些指标包括基于最近发展的大语言模型(LLMs)的无监督、有监督和提示方法。凭借我们细粒度复杂跨度注释,我们发现,向现有的可读性公式中添加一个功能,即捕捉到词汇跨度数量,可以显著提高它们与人类判断的相关性。我们将公开发布这个数据集和代码。
https://arxiv.org/abs/2405.02144
This thesis is a corpus-based, quantitative, and typological analysis of the functions of Early Slavic participle constructions and their finite competitors ($jegda$-'when'-clauses). The first part leverages detailed linguistic annotation on Early Slavic corpora at the morphosyntactic, dependency, information-structural, and lexical levels to obtain indirect evidence for different potential functions of participle clauses and their main finite competitor and understand the roles of compositionality and default discourse reasoning as explanations for the distribution of participle constructions and $jegda$-clauses in the corpus. The second part uses massively parallel data to analyze typological variation in how languages express the semantic space of English $when$, whose scope encompasses that of Early Slavic participle constructions and $jegda$-clauses. Probabilistic semantic maps are generated and statistical methods (including Kriging, Gaussian Mixture Modelling, precision and recall analysis) are used to induce cross-linguistically salient dimensions from the parallel corpus and to study conceptual variation within the semantic space of the hypothetical concept WHEN.
本论文是对早期斯拉夫语参与词构造及其有限竞争者 ($jegda$-“当”-从句) 功能的定量和类型分析。第一部分利用对早期斯拉夫语语料在形态、依存、信息结构和词汇层次的详细语义注释,间接证明不同参与词从句及其主要有限竞争者以及构成性和默认会话推理在语料库中分布的不同潜在功能,并理解构成性和默认会话推理在语料库中参与词构造和 $jegda$-从句分布的作用。第二部分利用大量并行数据分析语言如何表达英语 $when$ 的语义空间,其范围涵盖了早期斯拉夫语参与词构造和 $jegda$-从句。概率语义图被生成,并使用统计方法(包括Kriging、高斯混合建模、精确度和召回分析)从并行语料中诱导跨语言显着维度,并研究假想概念WHEN语义空间中的概念变异。
https://arxiv.org/abs/2405.01972
With growing complexity and criticality of automated driving functions in road traffic and their operational design domains (ODD), there is increasing demand for covering significant proportions of development, validation, and verification in virtual environments and through simulation models. If, however, simulations are meant not only to augment real-world experiments, but to replace them, quantitative approaches are required that measure to what degree and under which preconditions simulation models adequately represent reality, and thus, using their results accordingly. Especially in R&D areas related to the safety impact of the "open world", there is a significant shortage of real-world data to parameterize and/or validate simulations - especially with respect to the behavior of human traffic participants, whom automated driving functions will meet in mixed traffic. We present an approach to systematically acquire data in public traffic by heterogeneous means, transform it into a unified representation, and use it to automatically parameterize traffic behavior models for use in data-driven virtual validation of automated driving functions.
随着自动驾驶功能在道路交通和其操作设计领域(ODD)中的复杂性和关键性的增加,对在虚拟环境和模拟模型中涵盖开发、验证和验证的比例的需求不断增加。然而,如果仿真不仅仅是为了补充现实世界的实验,而是要替代它们,就需要定量方法来衡量仿真模型在多大程度上以及何时充分代表了现实世界。因此,对于与"开放世界"安全影响相关的R&D领域,特别是在与混合交通中的人类交通参与者行为相关的领域,仿真中缺乏足够的真实世界数据来参数化和/或验证仿真 - 尤其是在与自动驾驶功能混合交通中,自动驾驶功能将遇到的情况。我们提出了一个通过异质手段系统地获取公共交通数据的方法,将其转化为统一的表示,并将其用于自动参数化交通行为模型,以便在数据驱动的虚拟验证中使用自动驾驶功能。
https://arxiv.org/abs/2405.01776
Reliability of AI systems is a fundamental concern for the successful deployment and widespread adoption of AI technologies. Unfortunately, the escalating complexity and heterogeneity of AI hardware systems make them inevitably and increasingly susceptible to hardware faults (e.g., bit flips) that can potentially corrupt model parameters. Given this challenge, this paper aims to answer a critical question: How likely is a parameter corruption to result in an incorrect model output? To systematically answer this question, we propose a novel quantitative metric, Parameter Vulnerability Factor (PVF), inspired by architectural vulnerability factor (AVF) in computer architecture community, aiming to standardize the quantification of AI model resilience/vulnerability against parameter corruptions. We define a model parameter's PVF as the probability that a corruption in that particular model parameter will result in an incorrect output. Similar to AVF, this statistical concept can be derived from statistically extensive and meaningful fault injection (FI) experiments. In this paper, we present several use cases on applying PVF to three types of tasks/models during inference -- recommendation (DLRM), vision classification (CNN), and text classification (BERT). PVF can provide pivotal insights to AI hardware designers in balancing the tradeoff between fault protection and performance/efficiency such as mapping vulnerable AI parameter components to well-protected hardware modules. PVF metric is applicable to any AI model and has a potential to help unify and standardize AI vulnerability/resilience evaluation practice.
人工智能系统的可靠性是成功部署和广泛采用人工智能技术的关键因素。然而,人工智能硬件系统的日益复杂和异质性使得它们越来越容易受到硬件故障(例如比特翻转)的潜在影响,这可能导致模型参数的错误。鉴于这一挑战,本文旨在回答一个关键问题:参数腐蚀导致错误模型的可能性有多大?为了系统地回答这个问题,我们提出了一个新型的定量指标——参数安全风险因子(PVF),灵感来自计算机架构社区中的架构漏洞因素(AVF),旨在统一和标准化人工智能模型对参数腐蚀的抵抗力。我们定义了一个模型参数的PVF为,在该特定模型参数中的腐蚀导致错误输出的概率。与AVF相似,这个统计概念可以从统计上广泛和有意义的故障注入(FI)实验中推导出来。在本文中,我们展示了将PVF应用于推理过程中的三种任务/模型的几个用例——推荐(DLRM)、视觉分类(CNN)和文本分类(BERT)。PVF可以为人工智能硬件设计师在故障保护与性能/效率之间取得平衡提供关键见解,将易受腐蚀的AI参数组件映射到得到良好保护的硬件模块。PVF指标适用于任何人工智能模型,有潜力帮助统一和标准化人工智能漏洞/抵抗力评估实践。
https://arxiv.org/abs/2405.01741
Art reinterpretation is the practice of creating a variation of a reference work, making a paired artwork that exhibits a distinct artistic style. We ask if such an image pair can be used to customize a generative model to capture the demonstrated stylistic difference. We propose Pair Customization, a new customization method that learns stylistic difference from a single image pair and then applies the acquired style to the generation process. Unlike existing methods that learn to mimic a single concept from a collection of images, our method captures the stylistic difference between paired images. This allows us to apply a stylistic change without overfitting to the specific image content in the examples. To address this new task, we employ a joint optimization method that explicitly separates the style and content into distinct LoRA weight spaces. We optimize these style and content weights to reproduce the style and content images while encouraging their orthogonality. During inference, we modify the diffusion process via a new style guidance based on our learned weights. Both qualitative and quantitative experiments show that our method can effectively learn style while avoiding overfitting to image content, highlighting the potential of modeling such stylistic differences from a single image pair.
艺术再诠释是对参考作品的一种变体,创作了一对对比艺术作品,展示了独特的艺术风格。我们询问,这样的图像对是否可以用于定制生成模型来捕捉演示的文体差异。我们提出了Pair Customization,一种新的定制方法,它从单个图像对中学习文体差异,然后将获得的风格应用于生成过程。与现有的方法不同,我们的方法从单个图像对中捕获文体差异。这使我们能够在不需要对实例的具体内容进行过拟合的情况下应用文体变化。为了应对这项新任务,我们采用了一种联合优化方法,将风格和内容明确地分离到两个LORA权重空间中。我们优化了这些风格和内容权重,以复制风格和内容图像,同时鼓励它们的正交性。在推理过程中,我们通过基于我们学习到的权重的新的样式指导来修改扩散过程。所有定性和定量实验都表明,我们的方法可以有效地学习风格,同时避免对图像内容的过拟合,突出了从单个图像对中建模这些文体差异的潜在可能性。
https://arxiv.org/abs/2405.01536
Large-scale Text-to-Image (T2I) diffusion models demonstrate significant generation capabilities based on textual prompts. Based on the T2I diffusion models, text-guided image editing research aims to empower users to manipulate generated images by altering the text prompts. However, existing image editing techniques are prone to editing over unintentional regions that are beyond the intended target area, primarily due to inaccuracies in cross-attention maps. To address this problem, we propose Localization-aware Inversion (LocInv), which exploits segmentation maps or bounding boxes as extra localization priors to refine the cross-attention maps in the denoising phases of the diffusion process. Through the dynamic updating of tokens corresponding to noun words in the textual input, we are compelling the cross-attention maps to closely align with the correct noun and adjective words in the text prompt. Based on this technique, we achieve fine-grained image editing over particular objects while preventing undesired changes to other regions. Our method LocInv, based on the publicly available Stable Diffusion, is extensively evaluated on a subset of the COCO dataset, and consistently obtains superior results both quantitatively and qualitatively.The code will be released at this https URL
大规模文本到图像(T2I)扩散模型基于文本提示表现出显著的生成能力。基于T2I扩散模型,文本指导图像编辑研究旨在通过改变文本提示来用户操纵生成的图像。然而,现有的图像编辑技术容易导致编辑超过预期的目标区域,主要原因是跨注意图的准确性。为了解决这个问题,我们提出了局部化注意的逆置(LocInv)技术,该技术利用分词图或边界框作为额外的局部化先验来优化扩散过程的降噪阶段中的跨注意图。通过动态更新文本输入中相应的名词词位的 tokens,我们使得跨注意图与文本提示中的正确名词和形容词词首 closely 对齐。基于这种技术,我们在特定物体上实现精细图像编辑,同时防止对其他区域的不必要修改。我们的方法LocInv,基于公开可用的Stable Diffusion,在COCO数据集的子集上进行了广泛的评估,并且无论是在数量上还是在质量上,都取得了卓越的结果。代码将发布在https://这个 URL上。
https://arxiv.org/abs/2405.01496
Transesophageal echocardiography (TEE) plays a pivotal role in cardiology for diagnostic and interventional procedures. However, using it effectively requires extensive training due to the intricate nature of image acquisition and interpretation. To enhance the efficiency of novice sonographers and reduce variability in scan acquisitions, we propose a novel ultrasound (US) navigation assistance method based on contrastive learning as goal-conditioned reinforcement learning (GCRL). We augment the previous framework using a novel contrastive patient batching method (CPB) and a data-augmented contrastive loss, both of which we demonstrate are essential to ensure generalization to anatomical variations across patients. The proposed framework enables navigation to both standard diagnostic as well as intricate interventional views with a single model. Our method was developed with a large dataset of 789 patients and obtained an average error of 6.56 mm in position and 9.36 degrees in angle on a testing dataset of 140 patients, which is competitive or superior to models trained on individual views. Furthermore, we quantitatively validate our method's ability to navigate to interventional views such as the Left Atrial Appendage (LAA) view used in LAA closure. Our approach holds promise in providing valuable guidance during transesophageal ultrasound examinations, contributing to the advancement of skill acquisition for cardiac ultrasound practitioners.
经食道超声检查(TEE)在心血管病学中具有重要的诊断和干预作用。然而,要有效地使用它,需要进行广泛的培训,因为图像获取和解释的复杂性。为了提高新手超声技术员的效率,减少扫描获取的变异性,我们提出了一种基于对比学习的目标条件强化学习(GCRL)超声导航辅助方法。我们通过一种新颖的对比患者批量方法(CPB)和数据增强对比损失来增强先前的框架。我们证明了CPB和数据增强对比损失对确保患者间解剖变异的泛化至关重要。所提出的框架能够通过单个模型实现对标准诊断和复杂干预视图的导航。我们的方法基于一个大型数据集(789名患者)开发,在测试数据集(140名患者)上的平均误差为6.56毫米的位置和9.36度的角度,与单个视图训练的模型相当或更好。此外,我们通过定量验证了我们的方法在到达干预视图(如左心房附壁)方面的能力,这些视图在LAA关闭中使用。我们的方法在提供心血管超声检查中的有价值的指导方面具有潜力,有助于提高心脏超声技术员的技能。
https://arxiv.org/abs/2405.01409
Training deep neural networks (DNNs) from noisy labels is an important and challenging task. However, most existing approaches focus on the corrupted labels and ignore the importance of inherent data structure. To bridge the gap between noisy labels and data, inspired by the concept of potential energy in physics, we propose a novel Potential Energy based Mixture Model (PEMM) for noise-labels learning. We innovate a distance-based classifier with the potential energy regularization on its class centers. Embedding our proposed classifier with existing deep learning backbones, we can have robust networks with better feature representations. They can preserve intrinsic structures from the data, resulting in a superior noisy tolerance. We conducted extensive experiments to analyze the efficiency of our proposed model on several real-world datasets. Quantitative results show that it can achieve state-of-the-art performance.
训练深度神经网络(DNNs)从嘈杂标签是一个重要而具有挑战性的任务。然而,大多数现有方法都关注于嘈杂标签,并忽略了固有数据结构的重要性。为了在嘈杂标签和数据之间搭建一座桥梁,受到物理学中势能概念的启发,我们提出了一个基于势能的噪音标签学习的新模型。我们在其类中心上应用了势能 regularization 的距离基于分类器。将我们所提出的分类器与现有的深度学习骨干嵌入,我们可以获得具有更好特征表示的稳健网络。它们可以保留数据中的固有结构,从而具有卓越的嘈杂容忍性。我们对多个现实世界数据集进行了广泛的实验,以分析我们提出的模型的效率。定量的结果表明,它可以实现最先进的性能。
https://arxiv.org/abs/2405.01186