Explaining deep neural networks is challenging, due to their large size and non-linearity. In this paper, we introduce a concept-based explanation method, in order to explain the prediction for an individual class, as well as contrasting any two classes, i.e. explain why the model predicts one class over the other. We test it on several openly available classification models trained on ImageNet1K, as well as on a segmentation model trained to detect tumor in stained tissue samples. We perform both qualitative and quantitative tests. For example, for a ResNet50 model from pytorch model zoo, we can use the explanation for why the model predicts a class 'A' to automatically select six dataset crops where the model does not predict class 'A'. The model then predicts class 'A' again for the newly combined image in 71\% of the cases (works for 710 out of the 1000 classes). The code including an .ipynb example is available on git: this https URL.
解释深度神经网络的预测结果颇具挑战,这主要是因为它们规模庞大且具有非线性特性。本文中,我们提出了一种基于概念的解释方法,旨在说明模型为何会为某个特定类别做出预测,并能够对比任意两个类别的区别,即解释为什么模型在给定输入时选择一个类别而非另一个。我们在几个公开可用的、训练于ImageNet1K数据集上的分类模型上测试了这种方法,同时也在用于检测染色组织样本中肿瘤区域的分割模型上进行了应用。我们进行了定性和定量两方面的测试。 例如,在使用pytorch模型库中的ResNet50模型时,我们可以利用解释方法来找出模型为何预测类别'A'的原因,并自动选择六个数据集中的样本,这些样本在不被分类为'A'的情况下由模型进行预测。然后,当将这六张图像合并成一张新的图像时,在71%的情况下(即针对1000个类中的710个),该模型再次预测类别为'A'。 相关的代码及一个.ipynb示例可以在以下git链接上找到:[此URL](this https URL)。
https://arxiv.org/abs/2502.03422
Purpose: To develop and evaluate a deep learning-based method that allows to perform myocardial infarct segmentation in a fully-automated way. Materials and Methods: For this retrospective study, a cascaded framework of two and three-dimensional convolutional neural networks (CNNs), specialized on identifying ischemic myocardial scars on late gadolinium enhancement (LGE) cardiac magnetic resonance (CMR) images, was trained on an in-house training dataset consisting of 144 examinations. On a separate test dataset from the same institution, including images from 152 examinations obtained between 2021 and 2023, a quantitative comparison between artificial intelligence (AI)-based segmentations and manual segmentations was performed. Further, qualitative assessment of segmentation accuracy was evaluated for both human and AI-generated contours by two CMR experts in a blinded experiment. Results: Excellent agreement could be found between manually and automatically calculated infarct volumes ($\rho_c$ = 0.9). The qualitative evaluation showed that compared to human-based measurements, the experts rated the AI-based segmentations to better represent the actual extent of infarction significantly (p < 0.001) more often (33.4% AI, 25.1% human, 41.5% equal). On the contrary, for segmentation of microvascular obstruction (MVO), manual measurements were still preferred (11.3% AI, 55.6% human, 33.1% equal). Conclusion: This fully-automated segmentation pipeline enables CMR infarct size to be calculated in a very short time and without requiring any pre-processing of the input images while matching the segmentation quality of trained human observers. In a blinded experiment, experts preferred automated infarct segmentations more often than manual segmentations, paving the way for a potential clinical application.
目的:开发并评估一种基于深度学习的方法,用于全自动的心肌梗死分割。 材料和方法:在这项回顾性研究中,研究人员使用了一个由两家医院内部数据集训练的级联框架(包括二维和三维卷积神经网络),该框架专门针对在延迟钆增强心脏磁共振成像(LGE CMR)图像上识别心肌缺血疤痕进行优化。训练数据集包含144个检查案例。在一个独立的数据集中进行了定量比较,这个测试数据集来自同一个机构,包括2021年至2023年间获取的152张图像。此外,在一个双盲实验中,由两位CMR专家对人类和AI生成轮廓的分割准确性的定性评估进行了评价。 结果:手动计算的心肌梗死体积与自动计算的结果之间具有很好的一致性($ρ_c$ = 0.9)。在定性评价方面,专家们认为AI基础的分割比基于人的测量更精确地代表了心肌梗塞的实际范围,这一结论显著得多(p < 0.001),具体来说,33.4%的情况是AI优于人类,25.1%的情况是人工优于AI,而两者相等的情况下占41.5%。然而,在微血管障碍(MVO)的分割方面,手动测量仍然更受欢迎(分别为11.3%,55.6%和33.1%)。 结论:这种全自动的心肌梗死分割管道可以在极短的时间内计算CMR心肌梗死大小,并且无需对输入图像进行任何预处理,同时与训练有素的人类观察者的分割质量相匹配。在双盲实验中,专家们更倾向于选择自动化的梗死分割结果而非手动的分割结果,这为潜在的临床应用铺平了道路。
https://arxiv.org/abs/2502.03272
Tree-based and rule-based machine learning models play pivotal roles in explainable artificial intelligence (XAI) due to their unique ability to provide explanations in the form of tree or rule sets that are easily understandable and interpretable, making them essential for applications in which trust in model decisions is necessary. These transparent models are typically used in surrogate modeling, a post-hoc XAI approach for explaining the logic of black-box models, enabling users to comprehend and trust complex predictive systems while maintaining competitive performance. This study proposes the Cost-Sensitive Rule and Tree Extraction (CORTEX) method, a novel rule-based XAI algorithm grounded in the multi-class cost-sensitive decision tree (CSDT) method. The original version of the CSDT is extended to classification problems with more than two classes by inducing the concept of an n-dimensional class-dependent cost matrix. The performance of CORTEX as a rule-extractor XAI method is compared to other post-hoc tree and rule extraction methods across several datasets with different numbers of classes. Several quantitative evaluation metrics are employed to assess the explainability of generated rule sets. Our findings demonstrate that CORTEX is competitive with other tree-based methods and can be superior to other rule-based methods across different datasets. The extracted rule sets suggest the advantages of using the CORTEX method over other methods by producing smaller rule sets with shorter rules on average across datasets with a diverse number of classes. Overall, the results underscore the potential of CORTEX as a powerful XAI tool for scenarios that require the generation of clear, human-understandable rules while maintaining good predictive performance.
基于树和规则的机器学习模型在可解释人工智能(XAI)中扮演着至关重要的角色,因为它们能够提供以树或规则集形式呈现的易于理解和解读的解释。这使得这些透明模型对于需要对模型决策建立信任的应用场景而言至关重要。这些模型通常用于代理建模,这是一种事后XAI方法,旨在解释黑盒模型背后的逻辑,使用户能够理解并信赖复杂的预测系统,同时保持竞争力的性能。 本文提出了一种基于多类成本敏感决策树(CSDT)方法的新颖规则基础XAI算法——Cost-Sensitive Rule and Tree Extraction (CORTEX) 方法。原始版本的 CSDT 被扩展到具有更多类别以上的分类问题中,通过引入 n 维类别相关成本矩阵的概念来实现这一目的。 本文比较了 CORTEX 作为规则提取 XAI 方法与其他后处理树和规则抽取方法在多个不同类别的数据集上的表现,并使用了几种定量评估指标来评估生成的规则集的解释性。研究结果表明,CORTEX 在与其它基于树的方法相比具有竞争力,在不同数据集上可能优于其他基于规则的方法。 所提取的规则集合显示了 CORTEX 方法相对于其他方法的优势,即在具有多种类别的数据集中可以产生更小、平均长度较短的规则集。总体而言,这些结果强调了 CORTEX 作为一种强大XAI工具的潜力,在需要生成清晰易懂的人类可理解规则的同时还能保持良好的预测性能的情况下尤其有用。
https://arxiv.org/abs/2502.03200
Natural language interaction with sensing systems is crucial for enabling all users to comprehend sensor data and its impact on their everyday lives. However, existing systems, which typically operate in a Question Answering (QA) manner, are significantly limited in terms of the duration and complexity of sensor data they can handle. In this work, we introduce SensorChat, the first end-to-end QA system designed for long-term sensor monitoring with multimodal and high-dimensional data including time series. SensorChat effectively answers both qualitative (requiring high-level reasoning) and quantitative (requiring accurate responses derived from sensor data) questions in real-world scenarios. To achieve this, SensorChat uses an innovative three-stage pipeline that includes question decomposition, sensor data query, and answer assembly. The first and third stages leverage Large Language Models (LLMs) for intuitive human interactions and to guide the sensor data query process. Unlike existing multimodal LLMs, SensorChat incorporates an explicit query stage to precisely extract factual information from long-duration sensor data. We implement SensorChat and demonstrate its capability for real-time interactions on a cloud server while also being able to run entirely on edge platforms after quantization. Comprehensive QA evaluations show that SensorChat achieves up to 26% higher answer accuracy than state-of-the-art systems on quantitative questions. Additionally, a user study with eight volunteers highlights SensorChat's effectiveness in handling qualitative and open-ended questions.
与感知系统进行自然语言交互对于使所有用户能够理解传感器数据及其对日常生活的影响至关重要。然而,现有的大多数基于问题回答(QA)方式的系统,在处理传感器数据的时间跨度和复杂性方面存在显著限制。在这项工作中,我们引入了SensorChat——首个专为长期多模态高维度传感数据分析设计的端到端问答系统,其中包括时间序列数据。SensorChat能够有效解答现实场景中的定性和定量问题,前者要求高层次推理,后者则需要从传感器数据中准确提取信息。 为了实现这一目标,SensorChat采用了一个创新性的三阶段管道流程:问题分解、传感器数据查询和答案组装。第一和第三阶段利用大型语言模型(LLMs)来促进直观的人机交互,并指导传感器数据查询过程。与现有的多模态LLM不同,SensorChat包含一个明确的查询阶段,能够精确地从长时间跨度的传感器数据中提取事实信息。 我们实施了SensorChat,并展示了其在云端服务器上的实时互动能力,同时也证明它经过量化处理后可在边缘平台独立运行。全面的问题回答评估表明,在定量问题上,SensorChat比最先进的系统高出26%的答案准确性。此外,一项针对八名志愿者的研究突显了SensorChat在处理定性和开放性问题方面的有效性。
https://arxiv.org/abs/2502.02883
Differentiable Search Indexing (DSI) is a recent paradigm for information retrieval which uses a transformer-based neural network architecture as the document index to simplify the retrieval process. A differentiable index has many advantages enabling modifications, updates or extensions to the index. In this work, we explore balancing relevance and novel information content (diversity) for training DSI systems inspired by Maximal Marginal Relevance (MMR), and show the benefits of our approach over the naive DSI training. We present quantitative and qualitative evaluations of relevance and diversity measures obtained using our method on NQ320K and MSMARCO datasets in comparison to naive DSI. With our approach, it is possible to achieve diversity without any significant impact to relevance. Since we induce diversity while training DSI, the trained model has learned to diversify while being relevant. This obviates the need for a post-processing step to induce diversity in the recall set as typically performed using MMR. Our approach will be useful for Information Retrieval problems where both relevance and diversity are important such as in sub-topic retrieval. Our work can also be easily be extended to the incremental DSI settings which would enable fast updates to the index while retrieving a diverse recall set.
可微搜索索引(DSI)是一种最近提出的信息检索范式,它使用基于变换器的神经网络架构作为文档索引来简化检索过程。这种不同iable的索引具有许多优点,包括支持修改、更新或扩展索引的能力。在这项工作中,我们探讨了如何在训练DSI系统时平衡相关性和新颖信息内容(多样性),灵感来源于最大边缘相关性(MMR)。我们的方法展示了相对于简单的DSI训练的优势,并通过NQ320K和MSMARCO数据集上的定量及定性评估结果来比较使用我们方法获取的相关性和多样性的度量与简单DSI的对比。利用我们的方法,可以在不影响相关性的前提下实现多样性。由于我们在训练DSI时引入了多样性,因此经过训练的模型已学会了在保持相关信息的同时实现多样化。这消除了通常通过MMR进行检索集后处理步骤以增加多样性的必要性。对于同时需要高相关性和多样性的信息检索问题(如次主题检索),我们的方法将非常有用。此外,我们提出的工作也可以轻松地扩展到增量DSI设置中,从而在检索多样性召回集合的同时实现快速的索引更新。
https://arxiv.org/abs/2502.02788
Controllable video generation remains a significant challenge, despite recent advances in generating high-quality and consistent videos. Most existing methods for controlling video generation treat the video as a whole, neglecting intricate fine-grained spatiotemporal relationships, which limits both control precision and efficiency. In this paper, we propose Controllable Video Generative Adversarial Networks (CoVoGAN) to disentangle the video concepts, thus facilitating efficient and independent control over individual concepts. Specifically, following the minimal change principle, we first disentangle static and dynamic latent variables. We then leverage the sufficient change property to achieve component-wise identifiability of dynamic latent variables, enabling independent control over motion and identity. To establish the theoretical foundation, we provide a rigorous analysis demonstrating the identifiability of our approach. Building on these theoretical insights, we design a Temporal Transition Module to disentangle latent dynamics. To enforce the minimal change principle and sufficient change property, we minimize the dimensionality of latent dynamic variables and impose temporal conditional independence. To validate our approach, we integrate this module as a plug-in for GANs. Extensive qualitative and quantitative experiments on various video generation benchmarks demonstrate that our method significantly improves generation quality and controllability across diverse real-world scenarios.
尽管在生成高质量和一致的视频方面取得了近期进展,但可控视频生成仍然是一项重大挑战。大多数现有的控制视频生成的方法将整个视频视为整体,忽略了复杂的细粒度时空关系,从而限制了控制精度和效率。为此,本文提出了可控制视频生成对抗网络(CoVoGAN),以分离视频概念,从而使对单个概念的独立高效控制成为可能。 具体来说,在遵循最小变化原则的情况下,我们首先将静态和动态潜在变量解耦。然后利用充分变化属性来实现动态潜在变量在组件级别的识别性,从而能够单独控制运动和身份。为了建立理论基础,我们提供了严格的分析,证明了我们的方法的可识别性。在此基础上,我们设计了一个时间过渡模块以分离潜在动力学。 为确保最小变化原则和充分变化属性,我们减少了潜在动态变量的维度,并施加了条件时间独立性。为了验证该方法的有效性,我们将此模块作为GAN插件集成起来进行实验。在各种视频生成基准上的大量定性和定量实验证明,我们的方法显著提升了不同现实场景下的生成质量和可控性。
https://arxiv.org/abs/2502.02690
This paper introduces the first Expected Possession Value (EPV) benchmark and a new and improved EPV model for football. Through the introduction of the OJN-Pass-EPV benchmark, we present a novel method to quantitatively assess the quality of EPV models by using pairs of game states with given relative EPVs. Next, we attempt to replicate the results of Fernández et al. (2021) using a dataset containing Dutch Eredivisie and World Cup matches. Following our failure to do so, we propose a new architecture based on U-net-type convolutional neural networks, achieving good results in model loss and Expected Calibration Error. Finally, we present an improved pass model that incorporates ball height and contains a new dual-component pass value model that analyzes reward and risk. The resulting EPV model correctly identifies the higher value state in 78% of the game state pairs in the OJN-Pass-EPV benchmark, demonstrating its ability to accurately assess goal-scoring potential. Our findings can help assess the quality of EPV models, improve EPV predictions, help assess potential reward and risk of passing decisions, and improve player and team performance.
本文介绍了第一个预期控球价值(Expected Possession Value,简称EPV)基准和一种新的、改进的足球EPV模型。通过引入OJN-Pass-EPV基准,我们提出了一种使用具有给定相对EPVs的游戏状态对来定量评估EPV模型质量的新方法。接下来,我们尝试使用包含荷兰埃因霍温超级联赛和世界杯比赛数据集复制Fernández等人(2021)的结果,在未能成功的情况下,我们提出了基于U-net类型的卷积神经网络的新架构,并在模型损失和预期校准误差方面取得了良好结果。最后,我们提出了一种改进的传球模型,该模型将球的高度纳入考量,并包含了一个新的双成分传球价值模型,用于分析奖励与风险。最终生成的EPV模型能够在OJN-Pass-EPV基准中的游戏状态对中正确识别78%的价值更高的状态,展示了其准确评估得分潜力的能力。我们的研究结果可以帮助评估EPV模型的质量、改进EPV预测、帮助评估传球决策的风险和回报,并提高球员和团队的表现。
https://arxiv.org/abs/2502.02565
The civil engineering industry faces a critical need for innovative non-destructive evaluation methods, particularly for ageing critical infrastructure, such as bridges, where current techniques fall short. Muography, a non-invasive imaging technique, constructs three-dimensional density maps by detecting interactions of naturally occurring cosmic-ray muons within the scanned volume. Cosmic-ray muons provide deep penetration and inherent safety due to their high momenta and natural source. However, the technology's reliance on this source results in constrained muon flux, leading to prolonged acquisition times, noisy reconstructions and image interpretation challenges. To address these limitations, we developed a two-model deep learning approach. First, we employed a conditional Wasserstein generative adversarial network with gradient penalty (cWGAN-GP) to perform predictive upsampling of undersampled muography images. Using the structural similarity index measure (SSIM), 1-day sampled images matched the perceptual qualities of a 21-day image, while the peak signal-to-noise ratio (PSNR) indicated noise improvement equivalent to 31 days of sampling. A second cWGAN-GP model, trained for semantic segmentation, quantitatively assessed the upsampling model's impact on concrete sample features. This model achieved segmentation of rebar grids and tendon ducts, with Dice-Sørensen accuracy coefficients of 0.8174 and 0.8663. Notably, it could mitigate or remove z-plane smearing artifacts caused by muography's inverse imaging problem. Both models were trained on a comprehensive Geant4 Monte-Carlo simulation dataset reflecting realistic civil infrastructure scenarios. Our results demonstrate significant improvements in acquisition speed and image quality, marking a substantial step toward making muography more practical for reinforced concrete infrastructure monitoring applications.
土木工程行业迫切需要创新的无损检测方法,尤其是在评估老化关键基础设施(如桥梁)方面,当前技术存在不足。缪子成像是一种非侵入性成像技术,通过探测扫描体积内自然存在的宇宙射线缪子相互作用来构建三维密度图。由于其高动量和天然来源,宇宙射线缪子提供了深层穿透能力以及固有的安全性。然而,该技术依赖于这种天然源会导致缪子通量受限,从而导致较长的采集时间、噪声重建问题及图像解释挑战。 为了解决这些问题,我们开发了一种双模型深度学习方法。首先,我们使用带梯度惩罚(cWGAN-GP)的条件瓦瑟斯坦生成对抗网络对采样不足的缪子成像进行预测插值处理。通过结构相似性指数测量(SSIM),一天采集的数据图像在感知质量上可以与21天采集的数据相匹配;峰值信噪比(PSNR)则表明噪声水平等同于31天采集的效果有所改善。 第二个cWGAN-GP模型经过训练用于语义分割,它可以定量评估插值处理对混凝土样本特征的影响。该模型能够实现钢筋网和预应力筋管道的分割,Dice-Sørensen准确度系数分别为0.8174和0.8663。值得注意的是,它还可以减轻或消除由缪子成像反向问题导致的Z平面模糊伪影。 这两种模型都是基于全面的Geant4蒙特卡洛模拟数据集进行训练的,该数据集反映了现实中的土木基础设施场景。我们的结果表明,在采集速度和图像质量方面取得了显著改进,这标志着缪子成像技术在增强混凝土结构监测应用中更加实用的关键一步。
https://arxiv.org/abs/2502.02624
Autonomous driving systems rely on robust 3D scene understanding. Recent advances in Semantic Scene Completion (SSC) for autonomous driving underscore the limitations of RGB-based approaches, which struggle under motion blur, poor lighting, and adverse weather. Event cameras, offering high dynamic range and low latency, address these challenges by providing asynchronous data that complements RGB inputs. We present DSEC-SSC, the first real-world benchmark specifically designed for event-aided SSC, which includes a novel 4D labeling pipeline for generating dense, visibility-aware labels that adapt dynamically to object motion. Our proposed RGB-Event fusion framework, EvSSC, introduces an Event-aided Lifting Module (ELM) that effectively bridges 2D RGB-Event features to 3D space, enhancing view transformation and the robustness of 3D volume construction across SSC models. Extensive experiments on DSEC-SSC and simulated SemanticKITTI-E demonstrate that EvSSC is adaptable to both transformer-based and LSS-based SSC architectures. Notably, evaluations on SemanticKITTI-C demonstrate that EvSSC achieves consistently improved prediction accuracy across five degradation modes and both In-domain and Out-of-domain settings, achieving up to a 52.5% relative improvement in mIoU when the image sensor partially fails. Additionally, we quantitatively and qualitatively validate the superiority of EvSSC under motion blur and extreme weather conditions, where autonomous driving is challenged. The established datasets and our codebase will be made publicly at this https URL.
自主驾驶系统依赖于强大的3D场景理解能力。近期在自主驾驶领域的语义场景完成(Semantic Scene Completion,SSC)方面的进展表明了基于RGB方法的局限性,在运动模糊、光照不良和恶劣天气条件下,这些方法表现不佳。事件相机通过提供异步数据来解决这些问题,这种特性补充了RGB输入,并且具备高动态范围和低延迟的特点。我们提出了DSEC-SSC,这是首个专门为增强型(event-aided)SSC设计的真实世界基准测试,包括一个新的4D标记流水线,用于生成密集的、可视性感知标签,这些标签能够根据物体运动进行动态调整。 我们的提出的RGB-事件融合框架EvSSC引入了一个辅助事件提升模块(Event-aided Lifting Module, ELM),该模块有效地将2D RGB-事件特征连接到3D空间中,提高了视图转换能力,并增强了各种SSC模型中的三维体构建的鲁棒性。在DSEC-SSC和模拟SemanticKITTI-E上的广泛实验表明,EvSSC可以适应基于变压器架构和LSS架构的SSC系统。 特别值得注意的是,在SemanticKITTI-C上进行评估时,EvSSC在五种退化模式下以及域内和跨域设置中均实现了预测准确度的一致性提升。当图像传感器部分失效时,mIoU(平均交并比)最多可提高52.5%。 此外,我们在运动模糊和极端天气条件下从定量和定性的角度验证了EvSSC的优越性能,在这些情况下自主驾驶面临挑战。我们所建立的数据集及代码库将公开发布于[提供的URL]。
https://arxiv.org/abs/2502.02334
Understanding cell cycle dynamics is crucial for studying biological processes such as growth, development and disease progression. While fluorescent protein reporters like the Fucci system allow live monitoring of cell cycle phases, they require genetic engineering and occupy additional fluorescence channels, limiting broader applicability in complex experiments. In this study, we conduct a comprehensive evaluation of deep learning methods for predicting continuous Fucci signals using non-fluorescence brightfield imaging, a widely available label-free modality. To that end, we generated a large dataset of 1.3 M images of dividing RPE1 cells with full cell cycle trajectories to quantitatively compare the predictive performance of distinct model categories including single time-frame models, causal state space models and bidirectional transformer models. We show that both causal and transformer-based models significantly outperform single- and fixed frame approaches, enabling the prediction of visually imperceptible transitions like G1/S within 1h resolution. Our findings underscore the importance of sequence models for accurate predictions of cell cycle dynamics and highlight their potential for label-free imaging.
理解细胞周期动力学对于研究生长、发育和疾病进展等生物过程至关重要。虽然荧光蛋白报告系统(如Fucci系统)能够实现实时监测细胞周期阶段,但它们需要进行基因工程操作,并且占用额外的荧光通道,这在复杂实验中限制了其广泛适用性。在这项研究中,我们评估了一系列深度学习方法,旨在通过非荧光亮视野成像来预测连续的Fucci信号。这是一种广泛可用、无需标记的方法。 为此,我们生成了一个包含130万张图像的大数据集,这些图像是在完全记录RPE1细胞分裂周期轨迹的情况下拍摄的。利用这个数据集,我们可以定量比较包括单时间帧模型、因果状态空间模型和双向变换器模型在内的不同模型类别的预测性能。我们的研究结果显示,无论是基于因果关系还是基于转换器的模型,在预测视觉上难以察觉的转变(如G1/S过渡)时,其效果远远优于单一或固定帧的方法,并且能够达到每小时分辨率。 这项工作的发现强调了序列模型在细胞周期动力学准确预测中的重要性,并突显了其在无标签成像领域的潜力。
https://arxiv.org/abs/2502.02182
While attention-based approaches have shown considerable progress in enhancing image fusion and addressing the challenges posed by long-range feature dependencies, their efficacy in capturing local features is compromised by the lack of diverse receptive field extraction techniques. To overcome the shortcomings of existing fusion methods in extracting multi-scale local features and preserving global features, this paper proposes a novel cross-modal image fusion approach based on a multi-scale convolutional neural network with attention Transformer (MATCNN). MATCNN utilizes the multi-scale fusion module (MSFM) to extract local features at different scales and employs the global feature extraction module (GFEM) to extract global features. Combining the two reduces the loss of detail features and improves the ability of global feature representation. Simultaneously, an information mask is used to label pertinent details within the images, aiming to enhance the proportion of preserving significant information in infrared images and background textures in visible images in fused images. Subsequently, a novel optimization algorithm is developed, leveraging the mask to guide feature extraction through the integration of content, structural similarity index measurement, and global feature loss. Quantitative and qualitative evaluations are conducted across various datasets, revealing that MATCNN effectively highlights infrared salient targets, preserves additional details in visible images, and achieves better fusion results for cross-modal images. The code of MATCNN will be available at this https URL.
尽管基于注意力的方法在增强图像融合和解决长距离特征依赖性带来的挑战方面取得了显著进展,但由于缺乏多样化的感受野提取技术,其捕捉局部特征的能力受到限制。为了解决现有融合方法在提取多尺度局部特征和保持全局特征方面的不足,本文提出了一种基于多尺度卷积神经网络与注意力Transformer(MATCNN)的新型跨模态图像融合方法。MATCNN利用多尺度融合模块(MSFM)在不同尺度上提取局部特征,并采用全局特征提取模块(GFEM)提取全局特征。通过结合这两个模块减少了细节特征的丢失,提高了全局特征表示的能力。 同时,使用信息掩码对图像中的重要细节进行标记,旨在提高红外图像中显著信息和可见光图像中背景纹理在融合图像中保存的比例。随后,开发了一种新的优化算法,利用该掩码指导特征提取,并通过内容、结构相似性指数度量以及全局特征损失的集成来实现这一目标。 通过对多个数据集进行定量和定性的评估发现,MATCNN能够有效突出红外显著目标,在可见光图像中保留更多细节,并在跨模态图像融合方面取得了更好的结果。MATCNN的代码将在[此处](https://this.http URL.com)提供(请将URL替换为实际地址)。
https://arxiv.org/abs/2502.01959
We present SliderSpace, a framework for automatically decomposing the visual capabilities of diffusion models into controllable and human-understandable directions. Unlike existing control methods that require a user to specify attributes for each edit direction individually, SliderSpace discovers multiple interpretable and diverse directions simultaneously from a single text prompt. Each direction is trained as a low-rank adaptor, enabling compositional control and the discovery of surprising possibilities in the model's latent space. Through extensive experiments on state-of-the-art diffusion models, we demonstrate SliderSpace's effectiveness across three applications: concept decomposition, artistic style exploration, and diversity enhancement. Our quantitative evaluation shows that SliderSpace-discovered directions decompose the visual structure of model's knowledge effectively, offering insights into the latent capabilities encoded within diffusion models. User studies further validate that our method produces more diverse and useful variations compared to baselines. Our code, data and trained weights are available at this https URL
我们介绍 SliderSpace,这是一个用于自动分解扩散模型的视觉能力为可控制且人类易于理解的方向的框架。与现有需要用户单独指定每个编辑方向属性的控制方法不同,SliderSpace 从单一文本提示中同时发现多个解释性和多样化的方向。每个方向都作为低秩适配器进行训练,从而实现组合控制,并在模型的潜在空间中发现令人惊喜的可能性。 通过在最先进的扩散模型上的广泛实验,我们展示了 SliderSpace 在三个应用领域中的有效性:概念分解、艺术风格探索和多样性增强。我们的定量评估表明,SliderSpace 发现的方向有效地分解了模型知识的视觉结构,为扩散模型中编码的潜在能力提供了见解。用户研究进一步验证了与基线相比,我们的方法能产生更多样化且有用的变体。 我们的代码、数据和训练权重可在此网址获取:[此链接应由原作者提供] 原文链接似乎被省略了,请访问相应的学术论文或项目页面以获得这些资源的准确地址。
https://arxiv.org/abs/2502.01639
Human body restoration, as a specific application of image restoration, is widely applied in practice and plays a vital role across diverse fields. However, thorough research remains difficult, particularly due to the lack of benchmark datasets. In this study, we propose a high-quality dataset automated cropping and filtering (HQ-ACF) pipeline. This pipeline leverages existing object detection datasets and other unlabeled images to automatically crop and filter high-quality human images. Using this pipeline, we constructed a person-based restoration with sophisticated objects and natural activities (\emph{PERSONA}) dataset, which includes training, validation, and test sets. The dataset significantly surpasses other human-related datasets in both quality and content richness. Finally, we propose \emph{OSDHuman}, a novel one-step diffusion model for human body restoration. Specifically, we propose a high-fidelity image embedder (HFIE) as the prompt generator to better guide the model with low-quality human image information, effectively avoiding misleading prompts. Experimental results show that OSDHuman outperforms existing methods in both visual quality and quantitative metrics. The dataset and code will at this https URL.
人体恢复作为图像恢复的一个具体应用,在实践中得到了广泛应用,并在多个领域发挥着重要作用。然而,由于缺乏基准数据集,深入研究仍然面临挑战。为此,本研究提出了一种高质量数据集自动化裁剪和过滤(HQ-ACF)流水线。该流水线利用现有的物体检测数据集和其他未标记图像来自动裁剪和筛选高质量的人体图片。通过此流水线,我们构建了一个以人物为基础、包含复杂对象及自然活动的恢复数据集——PERSONA,其中包括训练集、验证集和测试集。与现有其他人体相关数据集相比,该数据集在质量和内容丰富度上都有显著提升。 最后,我们提出了一种新的单步扩散模型OSDHuman,用于人体恢复。具体来说,我们提出了一个高保真图像嵌入器(HFIE)作为提示生成器,以更好地指导模型利用低质量的人体图片信息,并有效避免误导性提示的产生。实验结果表明,在视觉质量和定量指标方面,OSDHuman优于现有方法。 数据集和代码可在 [此链接](https://this.url) 获取。
https://arxiv.org/abs/2502.01411
While traditional self-supervised learning methods improve performance and robustness across various medical tasks, they rely on single-vector embeddings that may not capture fine-grained concepts such as anatomical structures or organs. The ability to identify such concepts and their characteristics without supervision has the potential to improve pre-training methods, and enable novel applications such as fine-grained image retrieval and concept-based outlier detection. In this paper, we introduce ConceptVAE, a novel pre-training framework that detects and disentangles fine-grained concepts from their style characteristics in a self-supervised manner. We present a suite of loss terms and model architecture primitives designed to discretise input data into a preset number of concepts along with their local style. We validate ConceptVAE both qualitatively and quantitatively, demonstrating its ability to detect fine-grained anatomical structures such as blood pools and septum walls from 2D cardiac echocardiographies. Quantitatively, ConceptVAE outperforms traditional self-supervised methods in tasks such as region-based instance retrieval, semantic segmentation, out-of-distribution detection, and object detection. Additionally, we explore the generation of in-distribution synthetic data that maintains the same concepts as the training data but with distinct styles, highlighting its potential for more calibrated data generation. Overall, our study introduces and validates a promising new pre-training technique based on concept-style disentanglement, opening multiple avenues for developing models for medical image analysis that are more interpretable and explainable than black-box approaches.
虽然传统的自监督学习方法在各种医疗任务中提高了性能和鲁棒性,但它们依赖于单一向量嵌入,这可能无法捕捉到精细的概念,例如解剖结构或器官。能够在无监督的情况下识别这些概念及其特征的能力有望改进预训练方法,并实现诸如细粒度图像检索和基于概念的异常检测等新型应用。在本文中,我们介绍了ConceptVAE,这是一种新颖的自监督预训练框架,可以检测并分离出从其风格特性中的细微概念。我们提出了一系列损失项和模型架构基本元素,旨在将输入数据离散化为预定数量的概念及其局部风格。我们在定性和定量上验证了ConceptVAE的能力,展示了它能够从2D心脏超声心动图中识别精细的解剖结构,如血液池和隔壁。在量化方面,ConceptVAE在区域实例检索、语义分割、分布外检测和目标检测等任务中优于传统的自监督方法。此外,我们还探讨了生成与训练数据具有相同概念但风格不同的分布内合成数据的可能性,强调其在更精确的数据生成方面的潜力。总的来说,我们的研究介绍并验证了一种基于概念-风格分离的新颖预训练技术,为开发比黑盒方法更具可解释性和可解释性的医疗图像分析模型开辟了多种途径。
https://arxiv.org/abs/2502.01335
Quantitative information flow analyses (QIF) are a class of techniques for measuring the amount of confidential information leaked by a program to its public outputs. Shannon entropy is an important method to quantify the amount of leakage in QIF. This paper focuses on the programs modeled in Boolean constraints and optimizes the two stages of the Shannon entropy computation to implement a scalable precise tool PSE. In the first stage, we design a knowledge compilation language called \ADDAND that combines Algebraic Decision Diagrams and conjunctive decomposition. \ADDAND avoids enumerating possible outputs of a program and supports tractable entropy computation. In the second stage, we optimize the model counting queries that are used to compute the probabilities of outputs. We compare PSE with the state-of-the-art probably approximately correct tool EntropyEstimation, which was shown to significantly outperform the existing precise tools. The experimental results demonstrate that PSE solved 55 more benchmarks compared to EntropyEstimation in a total of 441. For 98% of the benchmarks that both PSE and EntropyEstimation solved, PSE is at least $10\times$ as efficient as EntropyEstimation.
定量信息流分析(QIF)是一类用于衡量程序向其公共输出泄漏的机密信息量的技术。香农熵是量化QIF中泄露量的一个重要方法。本文专注于基于布尔约束建模的程序,并优化了香农熵计算的两个阶段,以实现一个可扩展且精确的工具PSE。 在第一阶段,我们设计了一种称为\ADDAND的知识编译语言,该语言结合了代数决策图和合取分解技术。\ADDAND避免了枚举程序可能输出的过程,并支持可操作的熵计算。 在第二阶段,我们优化了用于计算输出概率的模型计数查询。 我们将PSE与最先进的大概率正确工具EntropyEstimation进行了比较,后者被证明显著优于现有的精确工具。实验结果表明,在总共441个基准测试中,PSE比EntropyEstimation多解决了55个问题。对于两个工具都解决的98%的基准测试,PSE至少是EntropyEstimation效率的10倍以上。
https://arxiv.org/abs/2502.01160
$\chi$-separation is an advanced quantitative susceptibility mapping (QSM) method that is designed to generate paramagnetic ($\chi_{para}$) and diamagnetic ($|\chi_{dia}|$) susceptibility maps, reflecting the distribution of iron and myelin in the brain. However, vessels have shown artifacts, interfering with the accurate quantification of iron and myelin in applications. To address this challenge, a new vessel segmentation method for $\chi$-separation is developed. The method comprises three steps: 1) Seed generation from $\textit{R}_2^*$ and the product of $\chi_{para}$ and $|\chi_{dia}|$ maps; 2) Region growing, guided by vessel geometry, creating a vessel mask; 3) Refinement of the vessel mask by excluding non-vessel structures. The performance of the method was compared to conventional vessel segmentation methods both qualitatively and quantitatively. To demonstrate the utility of the method, it was tested in two applications: quantitative evaluation of a neural network-based $\chi$-separation reconstruction method ($\chi$-sepnet-$\textit{R}_2^*$) and population-averaged region of interest (ROI) analysis. The proposed method demonstrates superior performance to the conventional vessel segmentation methods, effectively excluding the non-vessel structures, achieving the highest Dice score coefficient. For the applications, applying vessel masks report notable improvements for the quantitative evaluation of $\chi$-sepnet-$\textit{R}_2^*$ and statistically significant differences in population-averaged ROI analysis. These applications suggest excluding vessels when analyzing the $\chi$-separation maps provide more accurate evaluations. The proposed method has the potential to facilitate various applications, offering reliable analysis through the generation of a high-quality vessel mask.
$\chi$-分离是一种先进的定量磁化率映射(QSM)方法,旨在生成顺磁性($\chi_{para}$)和逆磁性($|\chi_{dia}|$)的磁化率图谱,反映大脑中铁和髓鞘的分布情况。然而,在实际应用中发现血管会产生伪影,干扰了铁和髓鞘量化的准确性。为了解决这一问题,开发了一种新的$\chi$-分离血管分割方法。 该方法包含三个步骤: 1. 从$\textit{R}_2^*$图谱以及$\chi_{para}$和$|\chi_{dia}|$的乘积中生成种子。 2. 在血管几何结构引导下进行区域生长,创建一个血管掩模。 3. 对血管掩模进行细化处理,排除非血管结构。 该方法与传统血管分割方法在定性和定量两个方面进行了比较。为了展示该方法的有效性,它被应用于两个具体的应用场景:神经网络基$\chi$-分离重建法($\chi$-sepnet-$\textit{R}_2^*$)的量化评估以及人群平均感兴趣区(ROI)分析。 与传统血管分割方法相比,所提出的方法表现出更优的性能,能够有效排除非血管结构,在Dice相似系数方面达到了最高值。在实际应用中,应用该血管掩模对$\chi$-sepnet-$\textit{R}_2^*$进行量化评估以及人群平均ROI分析时显示了显著改进,并且后者中的统计差异具有重要意义。这些应用场景表明,在分析$\chi$-分离图谱时排除血管能够提供更为准确的评估结果。 所提出的方法具备促进多种应用的能力,通过生成高质量的血管掩模为各种研究提供了可靠的分析手段。
https://arxiv.org/abs/2502.01023
In this paper, we presents a novel method for improving text-to-image generation by combining Large Language Models (LLMs) with diffusion models, a hybrid approach aimed at achieving both higher quality and efficiency in image synthesis from text descriptions. Our approach introduces a new dynamic KL-weighting strategy to optimize the diffusion process, along with incorporating semantic understanding from pre-trained LLMs to guide the generation process. The proposed method significantly improves both the visual quality and alignment of generated images with text descriptions, addressing challenges such as computational inefficiency, instability in training, and robustness to textual variability. We evaluate our method on the COCO dataset and demonstrate its superior performance over traditional GAN-based models, both quantitatively and qualitatively. Extensive experiments, including ablation studies and human evaluations, confirm that our method outperforms existing approaches in terms of image realism, relevance to the input text, and overall aesthetic quality. Our approach also shows promise in scalability to other multimodal tasks, making it a versatile solution for a wide range of generative applications.
在这篇论文中,我们提出了一种结合大型语言模型(LLMs)与扩散模型的新型方法,旨在通过混合策略提高从文本描述生成图像的质量和效率。我们的方法引入了新的动态KL加权策略来优化扩散过程,并利用预训练的语言模型的语义理解来指导生成过程。所提出的这种方法显著提升了生成图像在视觉质量和与文本描述的一致性方面的表现,解决了计算效率低下、训练不稳定以及对文本变化鲁棒性不足等问题。 我们在COCO数据集上评估了我们的方法,并通过定量和定性的比较证明其优于传统的GAN(生成对抗网络)模型。广泛的实验,包括消融研究和人类评价,证实我们所提出的方法在图像逼真度、与输入文本的相关性和整体美学质量方面均超越现有的方法。 此外,我们的方法显示出在其他多模态任务中的可扩展性潜力,使其成为广泛生成应用领域中的一种多功能解决方案。
https://arxiv.org/abs/2502.00826
Forest is the most significant land-based carbon storage mechanism. The forest carbon sink can effectively decrease the atmospheric CO2 concentration and mitigate climate change. Remote sensing estimation not only ensures high accuracy of data, but also enables large-scale area observation. Optical images provide the possibility for long-term monitoring, which is a potential issue in the future carbon storage estimation research. We chose Huize County, Qujing City, Yunnan Province, China as the study area, took GF-1 WFV satellite image as the data, introduced the KD-VGG module to extract the initial features, and proposed the improved implicit diffusion model (IIDM). The results showed that: (1) The VGG-19 module after knowledge distillation can realize the initial feature extraction, reduce the inference time and improve the accuracy in the case of reducing the number of model parameters. (2) The Attention + MLP module was added for feature fusion to obtain the relationship between global and local features and realized the restoration of high-fidelity images in the continuous scale range. (3) The IIDM model proposed in this paper had the highest estimation accuracy, with RMSE of 28.68, which was 13.16 higher than that of the regression model, about 31.45%. In the estimation of carbon storage, the generative model can extract deeper features, and its performance was significantly better than other models. It demonstrated the feasibility of artificial intelligence-generated content (AIGC) in the field of quantitative remote sensing and provided valuable insights for the study of carbon neutralization effect. By combining the actual characteristics of the forest, the regional carbon storage estimation with a resolution of 16-meter was utilized to provide a significant theoretical basis for the formulation of forest carbon sink regulation.
森林是基于陆地的最重要的碳储存机制。通过森林碳汇可以有效降低大气中的二氧化碳浓度,从而缓解气候变化。遥感估算不仅保证了数据的高准确性,还能够实现大规模区域观测。光学图像提供了长期监测的可能性,这在未来碳存储估计研究中可能是一个潜在的问题。 我们选择了中国云南省曲靖市会泽县作为研究区域,采用GF-1 WFV卫星影像作为数据来源,并引入了KD-VGG模块以提取初始特征,同时提出了一种改进的隐式扩散模型(IIDM)。结果表明: (1) 经过知识蒸馏后的VGG-19模块可以在减少模型参数数量的同时实现初始特征的提取,从而降低推理时间并提高准确性。 (2) 添加注意力机制与多层感知器(MLP) 模块用于特征融合,以获取全局和局部特征之间的关系,并实现了在连续尺度范围内高保真图像的恢复。 (3) 本文提出的IIDM模型具有最高的估计精度,其均方根误差(RMSE)为28.68,比回归模型高出13.16(约31.45%)。在碳储存估计中,生成式模型能够提取更深层次的特征,其性能显著优于其他模型。这表明了人工智能生成内容(AIGC)在定量遥感领域中的可行性,并为碳中和效应的研究提供了有价值的观点。 结合森林的实际特性,该研究使用16米分辨率的区域碳储量估算方法,为制定森林碳汇调节策略提供了重要的理论依据。
https://arxiv.org/abs/2502.00783
Personalized content filtering, such as recommender systems, has become a critical infrastructure to alleviate information overload. However, these systems merely filter existing content and are constrained by its limited diversity, making it difficult to meet users' varied content needs. To address this limitation, personalized content generation has emerged as a promising direction with broad applications. Nevertheless, most existing research focuses on personalized text generation, with relatively little attention given to personalized image generation. The limited work in personalized image generation faces challenges in accurately capturing users' visual preferences and needs from noisy user-interacted images and complex multimodal instructions. Worse still, there is a lack of supervised data for training personalized image generation models. To overcome the challenges, we propose a Personalized Image Generation Framework named Pigeon, which adopts exceptional large multimodal models with three dedicated modules to capture users' visual preferences and needs from noisy user history and multimodal instructions. To alleviate the data scarcity, we introduce a two-stage preference alignment scheme, comprising masked preference reconstruction and pairwise preference alignment, to align Pigeon with the personalized image generation task. We apply Pigeon to personalized sticker and movie poster generation, where extensive quantitative results and human evaluation highlight its superiority over various generative baselines.
个性化内容过滤,例如推荐系统,已成为缓解信息过载的关键基础设施。然而,这些系统仅能筛选现有内容,并受限于其有限的多样性,难以满足用户多样化的内容需求。为解决这一局限性,个性化内容生成作为一种具有广泛应用前景的方向应运而生。尽管如此,现有的大多数研究主要集中在个性化文本生成上,对个性化图像生成的关注相对较少。在个性化图像生成领域的有限工作面临着从嘈杂的用户互动图像和复杂的多模态指令中准确捕捉用户视觉偏好与需求的巨大挑战。更糟糕的是,训练个性化图像生成模型缺乏监督数据。为了克服这些挑战,我们提出了一种名为Pigeon的个性化图像生成框架,该框架采用了卓越的大规模多模态模型,并配备了三个专门模块,从嘈杂的用户历史记录和复杂多模态指令中捕捉用户的视觉偏好与需求。为缓解数据稀缺问题,我们引入了一个两阶段的偏好对齐方案,包括掩码偏好重构和成对偏好对齐,以使Pigeon更好地适应个性化图像生成任务的需求。我们将Pigeon应用于个性化贴纸及电影海报生成,并通过广泛的定量结果和人类评估证明了其在各种生成基线模型中的优越性。
https://arxiv.org/abs/2410.14170
In recent years, neuro-symbolic methods have become a popular and powerful approach that augments artificial intelligence systems with the capability to perform abstract, logical, and quantitative deductions with enhanced precision and controllability. Recent studies successfully performed symbolic reasoning by leveraging various machine learning models to explicitly or implicitly predict intermediate labels that provide symbolic instructions. However, these intermediate labels are not always prepared for every task as a part of training data, and pre-trained models, represented by Large Language Models (LLMs), also do not consistently generate valid symbolic instructions with their intrinsic knowledge. On the other hand, existing work developed alternative learning techniques that allow the learning system to autonomously uncover optimal symbolic instructions. Nevertheless, their performance also exhibits limitations when faced with relatively huge search spaces or more challenging reasoning problems. In view of this, in this work, we put forward an advanced practice for neuro-symbolic reasoning systems to explore the intermediate labels with weak supervision from problem inputs and final outputs. Our experiments on the Mathematics dataset illustrated the effectiveness of our proposals from multiple aspects.
近年来,神经符号方法已成为一种流行且强大的技术,通过增强人工智能系统执行抽象、逻辑和定量推理的能力,并提高了精确度和可控性。最近的研究成功地利用各种机器学习模型来明确或隐式预测中间标签,这些标签提供了符号指令以进行符号推理。然而,这些中间标签并不总是作为训练数据的一部分为每个任务准备就绪,预训练的模型(如大型语言模型)也不总能根据其内在知识一致生成有效的符号指令。另一方面,现有工作开发了替代学习技术,使学习系统能够自主发现最优的符号指令。然而,当面对相对庞大的搜索空间或更具挑战性的推理问题时,这些方法的表现也显示出局限性。 鉴于此,在这项工作中,我们提出了一种高级实践,用于神经符号推理系统探索中间标签,并利用来自问题输入和最终输出的弱监督来进行这一过程。我们在数学数据集上的实验从多个方面展示了我们提议的有效性。
https://arxiv.org/abs/2502.00629