In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and intuitive layout control over the rendered frames. To achieve this, CineMaster operates in two stages. In the first stage, we design an interactive workflow that allows users to intuitively construct 3D-aware conditional signals by positioning object bounding boxes and defining camera movements within the 3D space. In the second stage, these control signals--comprising rendered depth maps, camera trajectories and object class labels--serve as the guidance for a text-to-video diffusion model, ensuring to generate the user-intended video content. Furthermore, to overcome the scarcity of in-the-wild datasets with 3D object motion and camera pose annotations, we carefully establish an automated data annotation pipeline that extracts 3D bounding boxes and camera trajectories from large-scale video data. Extensive qualitative and quantitative experiments demonstrate that CineMaster significantly outperforms existing methods and implements prominent 3D-aware text-to-video generation. Project page: this https URL.
在这项工作中,我们介绍了CineMaster,这是一个用于三维感知和可控的文本到视频生成的新颖框架。我们的目标是赋予用户与专业电影导演相当的操作控制能力:精确地在场景中放置物体,在三维空间内灵活操作物体和相机,并对渲染帧进行直观布局控制。为了实现这一目标,CineMaster分为两个阶段工作。 第一阶段,我们设计了一个交互式的工作流程,允许用户通过在三维空间中定位物体边界框并定义相机移动来直观地构建具有三维感知的条件信号。第二阶段,这些控制信号——包括渲染深度图、相机轨迹和物体类别标签——作为文本到视频扩散模型的指导,确保生成用户意图的视频内容。 此外,为了克服缺少带有3D对象运动和相机姿态注释的真实场景数据集的问题,我们精心建立了一个自动化数据标注流水线,从大规模视频数据中提取3D边界框和相机轨迹。大量的定性和定量实验表明,CineMaster在三维感知文本到视频生成方面显著优于现有方法。 项目页面:[此链接](this https URL)。
https://arxiv.org/abs/2502.08639
With advancements in satellite imaging technology, acquiring high-resolution multi-view satellite imagery has become increasingly accessible, enabling rapid and location-independent ground model reconstruction. However, traditional stereo matching methods struggle to capture fine details, and while neural radiance fields (NeRFs) achieve high-quality reconstructions, their training time is prohibitively long. Moreover, challenges such as low visibility of building facades, illumination and style differences between pixels, and weakly textured regions in satellite imagery further make it hard to reconstruct reasonable terrain geometry and detailed building facades. To address these issues, we propose Sat-DN, a novel framework leveraging a progressively trained multi-resolution hash grid reconstruction architecture with explicit depth guidance and surface normal consistency constraints to enhance reconstruction quality. The multi-resolution hash grid accelerates training, while the progressive strategy incrementally increases the learning frequency, using coarse low-frequency geometry to guide the reconstruction of fine high-frequency details. The depth and normal constraints ensure a clear building outline and correct planar distribution. Extensive experiments on the DFC2019 dataset demonstrate that Sat-DN outperforms existing methods, achieving state-of-the-art results in both qualitative and quantitative evaluations. The code is available at this https URL.
随着卫星成像技术的进步,获取高分辨率的多视角卫星图像变得越来越容易,这使得快速且位置独立的地表模型重建成为可能。然而,传统的立体匹配方法难以捕捉精细细节,并且尽管神经辐射场(NeRF)能够实现高质量的重建,但其训练时间过长。此外,建筑物立面可见度低、像素间的光照和风格差异以及卫星图像中纹理较弱区域等因素进一步使得合理地重建地形几何结构和详细建筑立面变得困难。 为了解决这些问题,我们提出了Sat-DN,这是一种新的框架,它利用了一个逐步训练的多分辨率哈希网格重建架构,并结合了显式的深度指导和表面法线一致性约束来提升重建质量。这种多分辨率哈希网格加速了训练过程,而渐进策略则通过逐渐增加学习频率,使用粗糙低频几何结构来引导精细高频细节的重建。深度和法线约束确保了清晰的建筑轮廓以及正确的平面分布。 在DFC2019数据集上的广泛实验表明,Sat-DN超越了现有的方法,在定性和定量评估中均达到了最先进的结果。代码可以在以下链接获取:[此URL](请将括号内的文字替换为实际提供的链接)。
https://arxiv.org/abs/2502.08352
We present MatSwap, a method to transfer materials to designated surfaces in an image photorealistically. Such a task is non-trivial due to the large entanglement of material appearance, geometry, and lighting in a photograph. In the literature, material editing methods typically rely on either cumbersome text engineering or extensive manual annotations requiring artist knowledge and 3D scene properties that are impractical to obtain. In contrast, we propose to directly learn the relationship between the input material -- as observed on a flat surface -- and its appearance within the scene, without the need for explicit UV mapping. To achieve this, we rely on a custom light- and geometry-aware diffusion model. We fine-tune a large-scale pre-trained text-to-image model for material transfer using our synthetic dataset, preserving its strong priors to ensure effective generalization to real images. As a result, our method seamlessly integrates a desired material into the target location in the photograph while retaining the identity of the scene. We evaluate our method on synthetic and real images and show that it compares favorably to recent work both qualitatively and quantitatively. We will release our code and data upon publication.
我们介绍了一种名为MatSwap的方法,该方法能够将材料转移到图像中指定的表面上,并且效果非常逼真。由于照片中的材质外观、几何结构和光照之间存在复杂的相互作用,这项任务并不容易实现。在相关文献中,大多数材质编辑方法通常依赖于繁琐的文字工程或需要大量手动注释的数据集,这些数据集要求艺术家的专业知识以及难以获取的3D场景属性。 相比之下,我们提出了一种直接学习输入材料(观察到的平坦表面上)与其在场景中的外观之间关系的方法,而无需显式使用UV映射。为了实现这一目标,我们依赖于一个定制的、具有光照和几何感知能力的扩散模型,并且我们将大规模预训练的文字转图像模型进行了微调,用于材质转移任务,同时保持其强大的先验知识以确保对真实图像的有效泛化。 通过这种方法,我们可以将期望的材料无缝地融入到照片的目标位置中,同时保留场景的身份特征。我们在合成和真实图像上评估了该方法,并且无论是在定性还是定量方面都显示出了优于最近工作的优势。我们将在论文发表后发布代码和数据集。
https://arxiv.org/abs/2502.07784
To help users make privacy-related decisions, personalized privacy assistants based on AI technology have been developed in recent years. These AI-driven Personalized Privacy Assistants (AI-driven PPAs) can reap significant benefits for users, who may otherwise struggle to make decisions regarding their personal data in environments saturated with privacy-related decision requests. However, no study systematically inquired about the features of these AI-driven PPAs, their underlying technologies, or the accuracy of their decisions. To fill this gap, we present a Systematization of Knowledge (SoK) to map the existing solutions found in the scientific literature. We screened 1697 unique research papers over the last decade (2013-2023), constructing a classification from 39 included papers. As a result, this SoK reviews several aspects of existing research on AI-driven PPAs in terms of types of publications, contributions, methodological quality, and other quantitative insights. Furthermore, we provide a comprehensive classification for AI-driven PPAs, delving into their architectural choices, system contexts, types of AI used, data sources, types of decisions, and control over decisions, among other facets. Based on our SoK, we further underline the research gaps and challenges and formulate recommendations for the design and development of AI-driven PPAs as well as avenues for future research.
为了帮助用户做出与隐私相关的决策,近年来开发了基于人工智能技术的个性化隐私助手(AI驱动的PPA)。这些AI驱动的个性化隐私助手可以为用户提供显著的好处,在数据饱和且充斥着大量隐私相关决策请求的情况下,用户可能会感到难以做出关于个人数据的决定。然而,没有系统性的研究对这些AI驱动的PPA的功能、底层技术和其决策准确性进行过调查。为了填补这一空白,我们提出了一项知识体系化(SoK)的研究来梳理过去十年中科学文献中的现有解决方案。我们在2013年至2023年间筛选了1697篇独特研究论文,并根据其中的39篇构建了一个分类系统。 这项SoK回顾了几方面现有的关于AI驱动PPA的研究,包括出版类型、贡献、方法质量以及其他定量见解。此外,我们还提供了对AI驱动PPA的全面分类,深入探讨其架构选择、系统背景、使用的AI种类、数据来源、决策类型以及对决策控制等方面的内容。 基于我们的SoK研究结果,进一步强调了该领域的研究缺口和挑战,并为AI驱动的PPA的设计与开发提出了建议,同时也指出了未来研究的方向。
https://arxiv.org/abs/2502.07693
Global leaders and policymakers are unified in their unequivocal commitment to decarbonization efforts in support of Net-Zero agreements. District Heating Systems (DHS), while contributing to carbon emissions due to the continued reliance on fossil fuels for heat production, are embracing more sustainable practices albeit with some sense of vulnerability as it could constrain their ability to adapt to dynamic demand and production scenarios. As demographic demands grow and renewables become the central strategy in decarbonizing the heating sector, the need for accurate demand forecasting has intensified. Advances in digitization have paved the way for Machine Learning (ML) based solutions to become the industry standard for modeling complex time series patterns. In this paper, we focus on building a Deep Learning (DL) model that uses deconstructed components of independent and dependent variables that affect heat demand as features to perform multi-step ahead forecasting of head demand. The model represents the input features in a time-frequency space and uses an attention mechanism to generate accurate forecasts. The proposed method is evaluated on a real-world dataset and the forecasting performance is assessed against LSTM and CNN-based forecasting models. Across different supply zones, the attention-based models outperforms the baselines quantitatively and qualitatively, with an Mean Absolute Error (MAE) of 0.105 with a standard deviation of 0.06kW h and a Mean Absolute Percentage Error (MAPE) of 5.4% with a standard deviation of 2.8%, in comparison the second best model with a MAE of 0.10 with a standard deviation of 0.06kW h and a MAPE of 5.6% with a standard deviation of 3%.
全球领导人和政策制定者在支持净零协议方面对减排的承诺是统一且明确的。尽管区域供热系统(DHS)由于继续依赖化石燃料进行热能生产,导致碳排放增加,但它们正逐渐采纳更加可持续的做法,即便如此仍存在一定的脆弱性——因为这可能限制其适应动态需求和生产情景的能力。随着人口需求的增长以及可再生能源成为供暖行业脱碳核心策略的实施,对准确的需求预测的需要变得更加迫切。 数字化的进步为基于机器学习(ML)的解决方案铺平了道路,这些方案已成为建模复杂时间序列模式的行业标准。本文重点在于构建一个深度学习(DL)模型,该模型使用影响热需求的独立和依赖变量的分解组件作为特征来进行多步预测。此模型在时频空间中表示输入特征,并采用注意力机制来生成准确的预测。 所提出的方法在一个真实世界的数据集上进行了评估,并与长短期记忆(LSTM)和基于卷积神经网络(CNN)的预测模型进行比较,以评估其预测性能。跨不同的供应区,在定量和定性指标上,带有注意机制的模型优于基线模型,具体表现为平均绝对误差(MAE)为0.105千瓦时,标准偏差为0.06千瓦时,以及平均绝对百分比误差(MAPE)为5.4%,标准偏差为2.8%。相比之下,第二好的模型具有0.10千瓦时的MAE和0.06千瓦时的标准偏差,以及5.6%的MAPE和3%的标准偏差。
https://arxiv.org/abs/2502.07854
Background. Recently, dynamic total-body positron emission tomography (PET) imaging has become possible due to new scanner devices. While clustering algorithms have been proposed for PET analysis already earlier, there is still little research systematically evaluating these algorithms for processing of dynamic total-body PET images. Materials and methods. Here, we compare the performance of 15 unsupervised clustering methods, including K-means either by itself or after principal component analysis (PCA) or independent component analysis (ICA), Gaussian mixture model (GMM), fuzzy c-means (FCM), agglomerative clustering, spectral clustering, and several newer clustering algorithms, for classifying time activity curves (TACs) in dynamic PET images. We use dynamic total-body $^{15}$O-water PET images collected from 30 patients with suspected or confirmed coronary artery disease. To evaluate the clustering algorithms in a quantitative way, we use them to classify 5000 TACs from each image based on whether the curve is taken from brain, right heart ventricle, right kidney, lower right lung lobe, or urinary bladder. Results. According to our results, the best methods are GMM, FCM, and ICA combined with mini batch K-means, which classified the TACs with a median accuracies of 89\%, 83\%, and 81\%, respectively, in a processing time of half a second or less on average for each image. Conclusion. GMM, FCM, and ICA with mini batch K-means show promise for dynamic total-body PET analysis.
背景:最近,由于新型扫描设备的出现,全身动态正电子发射断层成像(PET)成为可能。尽管之前已经提出了用于PET分析的聚类算法,但系统性评估这些算法处理动态全身体积PET图像的能力的研究仍然较少。 材料与方法:在此研究中,我们比较了15种无监督聚类方法的表现,包括K均值聚类(单独使用或结合主成分分析(PCA)或独立成分分析(ICA),高斯混合模型(GMM),模糊C-均值(FCM),凝聚层次聚类,谱聚类以及几种较新的聚类算法。这些方法用于根据时间活性曲线(TACs)对动态PET图像进行分类。我们使用了从怀疑或已确认冠状动脉疾病患者收集的30份动态全身体积$^{15}$O-水PET图像。为了以定量方式评估聚类算法,我们将它们应用于每个图象中的5000个TACs,根据曲线是否来自大脑、右心室、右侧肾脏、下右肺叶或膀胱进行分类。 结果:根据我们的研究结果,表现最好的方法是GMM(高斯混合模型)、FCM(模糊C-均值)和ICA与小批量K均值结合使用的方法。这些方法以极短的处理时间(每个图像半秒或更少的时间内),分别达到了89%,83% 和 81% 的中位分类准确率。 结论:GMM、FCM以及ICA结合小批量K均值展现出了在动态全身体积PET分析中的潜力。
https://arxiv.org/abs/2502.07511
In-hand manipulation using multiple dexterous fingers is a critical robotic skill that can reduce the reliance on large arm motions, thereby saving space and energy. This letter focuses on in-grasp object movement, which refers to manipulating an object to a desired pose through only finger motions within a stable grasp. The key challenge lies in simultaneously achieving high precision and large-range movements while maintaining a constant stable grasp. To address this problem, we propose a simple and practical approach based on kinematic trajectory optimization with no need for pretraining or object geometries, which can be easily applied to novel objects in real-world scenarios. Adopting this approach, we won the championship for the in-hand manipulation track at the 9th Robotic Grasping and Manipulation Competition (RGMC) held at ICRA 2024. Implementation details, discussion, and further quantitative experimental results are presented in this letter, which aims to comprehensively evaluate our approach and share our key takeaways from the competition. Supplementary materials including video and code are available at this https URL .
手持操作使用多只灵巧的手指是一项关键的机器人技能,可以减少对大范围手臂动作的依赖,从而节省空间和能源。本文重点研究握持中的物体移动,即通过手指运动在稳定握持中将物体调整到所需的姿态。主要挑战在于同时实现高精度的大范围运动并保持恒定稳定的抓取状态。 为了解决这一问题,我们提出了一种基于动力学轨迹优化的简单实用方法,无需预训练或对象几何信息即可应用到现实世界中的新物体上。采用这种方法,我们在2024年ICRA举办的第九届机器人抓握与操作竞赛(RGMC)的手持操作赛道中获得了冠军。 本文将详细介绍该方法的实现细节、讨论以及进一步的定量实验结果,并旨在全面评估我们的方法并分享我们在比赛中的关键收获。补充材料包括视频和代码,可在此链接获取:[https URL]
https://arxiv.org/abs/2502.07472
We introduce JamendoMaxCaps, a large-scale music-caption dataset featuring over 200,000 freely licensed instrumental tracks from the renowned Jamendo platform. The dataset includes captions generated by a state-of-the-art captioning model, enhanced with imputed metadata. We also introduce a retrieval system that leverages both musical features and metadata to identify similar songs, which are then used to fill in missing metadata using a local large language model (LLLM). This approach allows us to provide a more comprehensive and informative dataset for researchers working on music-language understanding tasks. We validate this approach quantitatively with five different measurements. By making the JamendoMaxCaps dataset publicly available, we provide a high-quality resource to advance research in music-language understanding tasks such as music retrieval, multimodal representation learning, and generative music models.
我们介绍了JamendoMaxCaps,这是一个大型音乐-字幕数据集,包含超过20万首由著名音乐平台Jamendo提供的自由许可的纯乐器曲目。该数据集包括了使用最先进的字幕生成模型所创建的描述,并且这些描述还通过插补元数据进行了增强。我们还推出了一种检索系统,利用音乐特征和元数据来识别相似的歌曲,然后采用一个本地大型语言模型(LLLM)填充缺失的元数据。这种策略使得我们可以为研究音乐-语言理解任务的研究者提供更加全面且信息丰富的数据集。通过五种不同的测量方法,我们对这种方法进行了定量验证。通过将JamendoMaxCaps数据集公开发布,我们向推进如音乐检索、多模态表示学习和生成式音乐模型等领域的研究提供了高质量的资源。
https://arxiv.org/abs/2502.07461
This paper introduces CodeQUEST, a novel framework leveraging Large Language Models (LLMs) to iteratively evaluate and enhance code quality across multiple dimensions, including readability, maintainability, efficiency, and security. The framework is divided into two main components: an Evaluator that assesses code quality across ten dimensions, providing both quantitative scores and qualitative summaries, and an Optimizer that iteratively improves the code based on the Evaluator's feedback. Our study demonstrates that CodeQUEST can effectively and robustly evaluate code quality, with its assessments aligning closely with established code quality metrics. Through a series of experiments using a curated dataset of Python and JavaScript examples, CodeQUEST demonstrated significant improvements in code quality, achieving a mean relative percentage improvement of 52.6%. The framework's evaluations were validated against a set of proxy metrics comprising of Pylint Score, Radon Maintainability Index, and Bandit output logs, showing a meaningful correlation. This highlights the potential of LLMs in automating code quality evaluation and improvement processes, presenting a significant advancement toward enhancing software development practices. The code implementation of the framework is available at: this https URL.
本文介绍了CodeQUEST,这是一种新颖的框架,利用大型语言模型(LLMs)迭代地评估和提升代码质量,涵盖了可读性、可维护性、效率和安全性等多个维度。该框架分为两个主要组件:一个评估器,用于在十个项目度量下评估代码质量,并提供定量评分和定性总结;以及一个优化器,根据评估器的反馈迭代改进代码。我们的研究证明了CodeQUEST能够有效地且稳健地评估代码质量,其评估结果与已建立的代码质量指标高度一致。通过一系列使用精心挑选的Python和JavaScript示例数据集进行的实验,CodeQUEST展示了显著的代码质量提升,平均相对百分比改善达到了52.6%。该框架的评估结果通过一组代理度量进行了验证,包括Pylint评分、Radon可维护性指数以及Bandit输出日志,显示出有意义的相关性。这突显了LLMs在自动化代码质量评估和改进过程中的潜力,并朝着提升软件开发实践迈出了重要一步。 框架的代码实现可在以下网址获得:this https URL.
https://arxiv.org/abs/2502.07399
Structural information in images is crucial for aesthetic assessment, and it is widely recognized in the artistic field that imitating the structure of other works significantly infringes on creators' rights. The advancement of diffusion models has led to AI-generated content imitating artists' structural creations, yet effective detection methods are still lacking. In this paper, we define this phenomenon as "structural infringement" and propose a corresponding detection method. Additionally, we develop quantitative metrics and create manually annotated datasets for evaluation: the SIA dataset of synthesized data, and the SIR dataset of real data. Due to the current lack of datasets for structural infringement detection, we propose a new data synthesis strategy based on diffusion models and LLM, successfully training a structural infringement detection model. Experimental results show that our method can successfully detect structural infringements and achieve notable improvements on annotated test sets.
图像中的结构信息对于美学评估至关重要,而在艺术领域中普遍认为模仿其他作品的结构会严重侵犯创作者的权利。随着扩散模型的进步,AI生成的内容开始模仿艺术家的结构性创作,然而有效的检测方法仍然不足。在本文中,我们将这种现象定义为“结构侵权”,并提出了一种相应的检测方法。此外,我们还开发了定量指标,并创建了用于评估的手动注释数据集:合成数据的SIA数据集和真实数据的SIR数据集。由于目前缺乏用于结构侵权检测的数据集,我们提出了基于扩散模型和LLM(大型语言模型)的新数据合成策略,成功训练了一个结构侵权检测模型。实验结果表明,我们的方法可以有效检测结构侵权,并在标注测试集中取得了显著改进。
https://arxiv.org/abs/2502.07323
Latent diffusion models have recently demonstrated superior capabilities in many downstream image synthesis tasks. However, customization of latent diffusion models using unauthorized data can severely compromise the privacy and intellectual property rights of data owners. Adversarial examples as protective perturbations have been developed to defend against unauthorized data usage by introducing imperceptible noise to customization samples, preventing diffusion models from effectively learning them. In this paper, we first reveal that the primary reason adversarial examples are effective as protective perturbations in latent diffusion models is the distortion of their latent representations, as demonstrated through qualitative and quantitative experiments. We then propose the Contrastive Adversarial Training (CAT) utilizing adapters as an adaptive attack against these protection methods, highlighting their lack of robustness. Extensive experiments demonstrate that our CAT method significantly reduces the effectiveness of protective perturbations in customization configurations, urging the community to reconsider and enhance the robustness of existing protective perturbation methods. Code is available at \hyperlink{here}{this https URL}.
最近,潜在扩散模型在许多下游图像合成任务中展示了卓越的能力。然而,使用未经授权的数据自定义这些模型会严重损害数据所有者的隐私和知识产权权益。对抗样本作为保护性扰动已被开发出来,通过向定制样本中引入不易察觉的噪声来防御未经授权的数据使用,从而使扩散模型无法有效学习它们。在这篇论文中,我们首先揭示了对抗样本在潜在扩散模型中的有效性主要原因是其对潜在表示的扭曲,这一结论通过定性和定量实验得到了证实。随后,我们提出了利用适配器进行对比对抗训练(Contrastive Adversarial Training, CAT)的方法,作为针对这些保护方法的一种自适应攻击手段,并强调了现有保护性扰动方法缺乏鲁棒性的不足之处。广泛的实验证明,我们的CAT方法显著降低了在定制配置中保护性扰动的有效性,促使社区重新考虑并加强现有保护性扰动方法的鲁棒性。代码可在[此链接](https://this https URL)获取。
https://arxiv.org/abs/2502.07225
Large Language Models (LLMs) benefit from training on ever larger amounts of textual data, but as a result, they increasingly incur the risk of leaking private information. The ability to selectively remove knowledge from LLMs is, therefore, a highly desirable capability. In this paper, we propose LUNAR, a novel unlearning methodology grounded in the Linear Representation Hypothesis. LUNAR operates by redirecting the representations of unlearned data to regions that trigger the model's inherent ability to express its inability to answer. LUNAR achieves state-of-the-art unlearning performance while significantly enhancing the controllability of the unlearned model during inference. Specifically, LUNAR achieves between 2.9x to 11.7x improvements on combined "unlearning efficacy" and "model utility" score ("Deviation Score") on the PISTOL dataset across various base models. We also demonstrate, through quantitative analysis and qualitative examples, LUNAR's superior controllability in generating coherent and contextually aware responses, mitigating undesired side effects of existing methods. Moreover, we demonstrate that LUNAR is robust against white-box adversarial attacks and versatile in handling real-world scenarios, such as processing sequential unlearning requests.
大型语言模型(LLMs)在训练时会从越来越多的文本数据中获益,但与此同时,它们泄露私人信息的风险也在增加。因此,能够有选择地从这些模型中移除知识的能力变得非常具有吸引力。本文提出了LUNAR,一种基于线性表示假设的新颖遗忘方法(unlearning methodology)。LUNAR通过将未学习的数据的表示重定向到触发模型内在表达其无法回答问题能力的区域来运作。 在PISTOL数据集上,对于各种基础模型而言,LUNAR在结合了“遗忘有效性”和“模型效用”的综合评分(Deviation Score)方面实现了2.9倍至11.7倍的改进。通过定量分析和定性示例,我们展示了LUNAR在生成连贯且符合上下文响应方面的卓越控制能力,并减轻现有方法带来的不良副作用。此外,我们还证明了LUNAR能够抵御白盒对抗攻击,并具备处理现实世界场景(如连续遗忘请求)的灵活性。 总的来说,这项研究为解决大型语言模型中的隐私和数据保护问题提供了一个创新且有效的解决方案。
https://arxiv.org/abs/2502.07218
Image compression under ultra-low bitrates remains challenging for both conventional learned image compression (LIC) and generative vector-quantized (VQ) modeling. Conventional LIC suffers from severe artifacts due to heavy quantization, while generative VQ modeling gives poor fidelity due to the mismatch between learned generative priors and specific inputs. In this work, we propose Hybrid-Diffusion Image Compression (HDCompression), a dual-stream framework that utilizes both generative VQ-modeling and diffusion models, as well as conventional LIC, to achieve both high fidelity and high perceptual quality. Different from previous hybrid methods that directly use pre-trained LIC models to generate low-quality fidelity-preserving information from heavily quantized latent, we use diffusion models to extract high-quality complimentary fidelity information from the ground-truth input, which can enhance the system performance in several aspects: improving indices map prediction, enhancing the fidelity-preserving output of the LIC stream, and refining conditioned image reconstruction with VQ-latent correction. In addition, our diffusion model is based on a dense representative vector (DRV), which is lightweight with very simple sampling schedulers. Extensive experiments demonstrate that our HDCompression outperforms the previous conventional LIC, generative VQ-modeling, and hybrid frameworks in both quantitative metrics and qualitative visualization, providing balanced robust compression performance at ultra-low bitrates.
在极低比特率下的图像压缩依然是传统学习型图像压缩(LIC)和生成向量量化(VQ)模型所面临的一大挑战。传统的LIC方法由于重量化导致严重的伪影问题,而生成VQ模型则因为学习到的先验与具体输入数据不匹配而导致保真度低下。针对上述问题,在这项研究中我们提出了Hybrid-Diffusion图像压缩技术(HDCompression),这是一种双流框架,通过结合使用生成VQ建模、扩散模型以及传统LIC来同时实现高保真度和高质量的感知效果。 与之前那些直接利用预训练好的LIC模型从高度量化的潜在表示中提取低质量保真信息的混合方法不同,我们的方法采用了扩散模型从原始输入图像中抽取高质量的互补保真信息。这种方式可以从多方面提升系统性能:改进索引图预测、增强LIC流中的保真度保持输出以及通过VQ潜变量修正来细化条件下的图像重建。此外,我们所使用的扩散模型基于密集表示向量(DRV),这使得该模型轻量化且具有非常简单的采样调度器。 大量的实验表明,在定量指标和定性可视化方面,我们的HDCompression技术相较于传统的LIC、生成VQ建模以及混合框架都表现出了更好的性能。它在极低比特率条件下提供了平衡而稳健的压缩效果。
https://arxiv.org/abs/2502.07160
Anatomy evaluation is crucial for understanding the physiological state, diagnosing abnormalities, and guiding medical interventions. Statistical shape modeling (SSM) is vital in this process. By enabling the extraction of quantitative morphological shape descriptors from MRI and CT scans, SSM provides comprehensive descriptions of anatomical variations within a population. However, the effectiveness of SSM in anatomy evaluation hinges on the quality and robustness of the shape models. While deep learning techniques show promise in addressing these challenges by learning complex nonlinear representations of shapes, existing models still have limitations and often require pre-established shape models for training. To overcome these issues, we propose Mesh2SSM++, a novel approach that learns to estimate correspondences from meshes in an unsupervised manner. This method leverages unsupervised, permutation-invariant representation learning to estimate how to deform a template point cloud into subject-specific meshes, forming a correspondence-based shape model. Additionally, our probabilistic formulation allows learning a population-specific template, reducing potential biases associated with template selection. A key feature of Mesh2SSM++ is its ability to quantify aleatoric uncertainty, which captures inherent data variability and is essential for ensuring reliable model predictions and robust decision-making in clinical tasks, especially under challenging imaging conditions. Through extensive validation across diverse anatomies, evaluation metrics, and downstream tasks, we demonstrate that Mesh2SSM++ outperforms existing methods. Its ability to operate directly on meshes, combined with computational efficiency and interpretability through its probabilistic framework, makes it an attractive alternative to traditional and deep learning-based SSM approaches.
解剖评估对于理解生理状态、诊断异常以及指导医疗干预至关重要。统计形状建模(SSM)在此过程中起着关键作用,它能够从MRI和CT扫描中提取定量形态学描述符,从而提供人群内部解剖变异的全面描述。然而,SSM在解剖评估中的有效性取决于形状模型的质量和稳健性。虽然深度学习技术显示出通过学习复杂的非线性形状表示来解决这些问题的潜力,但现有的模型仍存在局限性,并且通常需要预先建立的形状模型进行训练。 为了克服这些挑战,我们提出了Mesh2SSM++,这是一种全新的方法,能够以无监督的方式从网格中估计对应关系。这种方法利用了无监督、排列不变的表示学习技术,估计如何将模板点云变形为特定主题的网格,形成基于对应关系的形状模型。此外,我们的概率公式允许学习特定人群的模板,从而减少与模板选择相关的潜在偏差。 Mesh2SSM++的一个关键特征是它能够量化固有数据变异性(即随机不确定性),这对于确保可靠模型预测和在临床任务中进行稳健决策,特别是在具有挑战性的成像条件下尤其重要。通过在不同解剖结构、评估指标以及下游任务上的广泛验证,我们证明了Mesh2SSM++优于现有方法。其直接在网格上操作的能力结合计算效率和概率框架带来的可解释性,使其成为传统及基于深度学习的SSM方法的一个有吸引力的选择。 总之,Mesh2SSM++通过提供一种更有效、高效且具有可解释性的解决方案,极大地推动了统计形状建模领域的发展。
https://arxiv.org/abs/2502.07145
Diffusion models have been used extensively for high quality image and video generation tasks. In this paper, we propose a novel conditional diffusion model with spatial attention and latent embedding (cDAL) for medical image segmentation. In cDAL, a convolutional neural network (CNN) based discriminator is used at every time-step of the diffusion process to distinguish between the generated labels and the real ones. A spatial attention map is computed based on the features learned by the discriminator to help cDAL generate more accurate segmentation of discriminative regions in an input image. Additionally, we incorporated a random latent embedding into each layer of our model to significantly reduce the number of training and sampling time-steps, thereby making it much faster than other diffusion models for image segmentation. We applied cDAL on 3 publicly available medical image segmentation datasets (MoNuSeg, Chest X-ray and Hippocampus) and observed significant qualitative and quantitative improvements with higher Dice scores and mIoU over the state-of-the-art algorithms. The source code is publicly available at this https URL.
扩散模型已经被广泛应用于高质量的图像和视频生成任务中。在本文中,我们提出了一种新颖的条件扩散模型(具有空间注意力和潜在嵌入),即cDAL,用于医学图像分割。在cDAL中,在扩散过程中的每个时间步使用基于卷积神经网络(CNN)的判别器来区分生成的标签与真实的标签。根据判别器学习到的特征计算出一个空间注意图,以帮助cDAL更准确地生成输入图像中有鉴别力区域的分割结果。此外,我们在模型的每一层中加入了一个随机潜在嵌入,这显著减少了训练和采样时间步的数量,从而使它比其他用于图像分割的扩散模型快得多。 我们将在三个公开可用的医学图像分割数据集(MoNuSeg、胸部X光片及海马体)上应用cDAL,并观察到与现有最佳算法相比,在Dice分数和mIoU方面的定性和定量改进显著提高。源代码可在[此链接](https://this https URL)获取。
https://arxiv.org/abs/2502.06997
The recent development of powerful AI systems has highlighted the need for robust risk management frameworks in the AI industry. Although companies have begun to implement safety frameworks, current approaches often lack the systematic rigor found in other high-risk industries. This paper presents a comprehensive risk management framework for the development of frontier AI that bridges this gap by integrating established risk management principles with emerging AI-specific practices. The framework consists of four key components: (1) risk identification (through literature review, open-ended red-teaming, and risk modeling), (2) risk analysis and evaluation using quantitative metrics and clearly defined thresholds, (3) risk treatment through mitigation measures such as containment, deployment controls, and assurance processes, and (4) risk governance establishing clear organizational structures and accountability. Drawing from best practices in mature industries such as aviation or nuclear power, while accounting for AI's unique challenges, this framework provides AI developers with actionable guidelines for implementing robust risk management. The paper details how each component should be implemented throughout the life-cycle of the AI system - from planning through deployment - and emphasizes the importance and feasibility of conducting risk management work prior to the final training run to minimize the burden associated with it.
近期强大AI系统的开发突显了在AI行业中建立稳健的风险管理框架的必要性。虽然公司已经开始实施安全框架,但目前的方法通常缺乏其他高风险行业所具备的那种系统性的严谨性。本文提出了一种全面的风险管理框架,旨在填补这一空白,通过将成熟的风险管理原则与新兴的特定于AI的做法相结合来推动前沿AI的发展。该框架包含四个关键组成部分: 1. **风险识别**(通过文献回顾、开放式的红队测试以及风险建模)。 2. **使用定量指标和明确界定阈值的风险分析和评估**。 3. **通过缓解措施如隔离、部署控制和保障流程来处理风险**。 4. **建立清晰的组织结构与问责制度以进行风险管理**。 借鉴成熟行业(如航空或核电领域)的最佳实践,同时考虑到AI的独特挑战,该框架为AI开发者提供了可操作的指南,用于实施稳健的风险管理措施。论文详细说明了如何在整个生命周期中实现每个组成部分——从规划到部署,并强调在最终训练运行之前开展风险管理工作的必要性和可行性,以最小化与之相关的负担。
https://arxiv.org/abs/2502.06656
Manipulating the material appearance of objects in images is critical for applications like augmented reality, virtual prototyping, and digital content creation. We present MaterialFusion, a novel framework for high-quality material transfer that allows users to adjust the degree of material application, achieving an optimal balance between new material properties and the object's original features. MaterialFusion seamlessly integrates the modified object into the scene by maintaining background consistency and mitigating boundary artifacts. To thoroughly evaluate our approach, we have compiled a dataset of real-world material transfer examples and conducted complex comparative analyses. Through comprehensive quantitative evaluations and user studies, we demonstrate that MaterialFusion significantly outperforms existing methods in terms of quality, user control, and background preservation. Code is available at this https URL.
在图像中操纵物体的材质外观对于增强现实、虚拟原型设计和数字内容创作等应用至关重要。我们提出了MaterialFusion,这是一个新型框架,用于高质量地传输材料,并允许用户调整材料的应用程度,以实现新材质属性与对象原始特征之间的最佳平衡。MaterialFusion通过保持背景一致性并减少边界伪影,能够无缝地将修改后的物体整合到场景中。 为了全面评估我们的方法,我们编译了一个包含现实世界材料转移示例的数据集,并进行了复杂的比较分析。通过详尽的定量评价和用户研究,我们证明了MaterialFusion在质量、用户控制以及背景保留方面显著优于现有方法。代码可在[此处](https://example.com)获取。 (注:原文中的“this https URL”用于放置实际链接的位置,在此替换为一个示例网址)
https://arxiv.org/abs/2502.06606
Quantitative Artificial Intelligence (AI) Benchmarks have emerged as fundamental tools for evaluating the performance, capability, and safety of AI models and systems. Currently, they shape the direction of AI development and are playing an increasingly prominent role in regulatory frameworks. As their influence grows, however, so too does concerns about how and with what effects they evaluate highly sensitive topics such as capabilities, including high-impact capabilities, safety and systemic risks. This paper presents an interdisciplinary meta-review of about 100 studies that discuss shortcomings in quantitative benchmarking practices, published in the last 10 years. It brings together many fine-grained issues in the design and application of benchmarks (such as biases in dataset creation, inadequate documentation, data contamination, and failures to distinguish signal from noise) with broader sociotechnical issues (such as an over-focus on evaluating text-based AI models according to one-time testing logic that fails to account for how AI models are increasingly multimodal and interact with humans and other technical systems). Our review also highlights a series of systemic flaws in current benchmarking practices, such as misaligned incentives, construct validity issues, unknown unknowns, and problems with the gaming of benchmark results. Furthermore, it underscores how benchmark practices are fundamentally shaped by cultural, commercial and competitive dynamics that often prioritise state-of-the-art performance at the expense of broader societal concerns. By providing an overview of risks associated with existing benchmarking procedures, we problematise disproportionate trust placed in benchmarks and contribute to ongoing efforts to improve the accountability and relevance of quantitative AI benchmarks within the complexities of real-world scenarios.
定量人工智能(AI)基准测试已经作为评估AI模型和系统性能、能力和安全性的基本工具出现。目前,它们正在塑造AI的发展方向,并在监管框架中发挥越来越重要的作用。然而,随着这些基准的影响日益增加,人们对如何以及以何种效果评估敏感话题——如能力(包括高影响力的能力)、安全性和系统性风险等问题也越来越担忧。 本文是对过去十年内大约100篇讨论定量基准测试实践不足之处的研究进行的跨学科元综述。该综述汇集了许多关于基准设计和应用中的细粒度问题(例如数据集创建时的偏差、文档不足、数据污染以及无法区分信号与噪声的问题)以及更广泛的社会技术问题(如过分关注基于一次性测试逻辑评估文本型AI模型,而未能考虑到AI模型越来越具多模态特性和其如何与人类及其他技术系统交互)。此外,我们的综述还强调了当前基准实践中的若干系统性缺陷,例如动机不一致、构念效度问题、未知的未知风险以及对基准结果游戏化的问题。更重要的是,本文凸显出基准实践在很大程度上受到文化、商业和竞争动态的影响,这些因素往往以牺牲更广泛的公共利益为代价而优先考虑最先进的性能。 通过概述现有基准程序的风险,我们质疑过度依赖基准测试的信任,并贡献于提升定量AI基准在全球复杂现实场景中的责任性和相关性的持续努力。
https://arxiv.org/abs/2502.06559
Point clouds captured with laser scanning systems from forest environments can be utilized in a wide variety of applications within forestry and plant ecology, such as the estimation of tree stem attributes, leaf angle distribution, and above-ground biomass. However, effectively utilizing the data in such tasks requires the semantic segmentation of the data into wood and foliage points, also known as leaf-wood separation. The traditional approach to leaf-wood separation has been geometry- and radiometry-based unsupervised algorithms, which tend to perform poorly on data captured with airborne laser scanning (ALS) systems, even with a high point density. While recent machine and deep learning approaches achieve great results even on sparse point clouds, they require manually labeled training data, which is often extremely laborious to produce. Multispectral (MS) information has been demonstrated to have potential for improving the accuracy of leaf-wood separation, but quantitative assessment of its effects has been lacking. This study proposes a fully unsupervised deep learning method, GrowSP-ForMS, which is specifically designed for leaf-wood separation of high-density MS ALS point clouds and based on the GrowSP architecture. GrowSP-ForMS achieved a mean accuracy of 84.3% and a mean intersection over union (mIoU) of 69.6% on our MS test set, outperforming the unsupervised reference methods by a significant margin. When compared to supervised deep learning methods, our model performed similarly to the slightly older PointNet architecture but was outclassed by more recent approaches. Finally, two ablation studies were conducted, which demonstrated that our proposed changes increased the test set mIoU of GrowSP-ForMS by 29.4 percentage points (pp) in comparison to the original GrowSP model and that utilizing MS data improved the mIoU by 5.6 pp from the monospectral case.
使用激光扫描系统从森林环境中捕获的点云数据可以在林业和植物生态学的广泛应用中发挥作用,例如树木茎干属性估算、叶片角度分布以及地上生物量评估。然而,在这些任务中有效利用数据需要将数据语义分割为木材点和植被点(也称为叶木分离)。传统上,叶木分离采用基于几何与辐射度的无监督算法实现,但在使用机载激光扫描(ALS)系统捕获的数据时,即使是在高密度点云下,这些方法的表现也不佳。尽管最近的一些机器学习和深度学习方法在稀疏点云上的表现非常出色,但它们需要手动标记的训练数据,而生成这类数据通常非常耗时且费力。多光谱(MS)信息已显示出可以提高叶木分离准确性的潜力,但是对其影响进行定量评估的研究尚缺乏。 本研究提出了一种完全无监督的深度学习方法GrowSP-ForMS,该方法专门针对高密度MS ALS点云中的叶木分离,并基于GrowSP架构。在我们的多光谱测试数据集上,GrowSP-ForMS实现了84.3%的平均精度和69.6%的平均交并比(mIoU),显著优于无监督参考方法。与有监督深度学习方法相比,该模型的表现类似于较旧的PointNet架构,但不如最近的方法表现好。 此外,还进行了两项消融研究,结果表明我们的建议修改将GrowSP-ForMS在测试集上的mIoU提高了29.4个百分点(pp),相比于原始的GrowSP模型,并且利用多光谱数据使得mIoU相较于单光谱情况提升了5.6 pp。
https://arxiv.org/abs/2502.06227
Medical images often exhibit low and blurred contrast between lesions and surrounding tissues, with considerable variation in lesion edges and shapes even within the same disease, leading to significant challenges in segmentation. Therefore, precise segmentation of lesions has become an essential prerequisite for patient condition assessment and formulation of treatment plans. Significant achievements have been made in research related to the U-Net model in recent years. It improves segmentation performance and is extensively applied in the semantic segmentation of medical images to offer technical support for consistent quantitative lesion analysis methods. First, this paper classifies medical image datasets on the basis of their imaging modalities and then examines U-Net and its various improvement models from the perspective of structural modifications. The research objectives, innovative designs, and limitations of each approach are discussed in detail. Second, we summarize the four central improvement mechanisms of the U-Net and U-Net variant algorithms: the jump-connection mechanism, residual-connection mechanism, 3D-UNet, and transformer mechanism. Finally, we examine the relationships among the four core enhancement mechanisms and commonly utilized medical datasets and propose potential avenues and strategies for future advancements. This paper provides a systematic summary and reference for researchers in related fields, and we look forward to designing more efficient and stable medical image segmentation network models based on the U-Net network.
医学图像中,病变与其周围组织之间的对比通常较低且模糊,即使在同一疾病内,病灶边缘和形状的变化也非常大,这使得分割工作极具挑战性。因此,精确地对病变进行分割已成为评估患者病情和制定治疗方案的重要前提条件。近年来,在与U-Net模型相关的研究领域取得了显著成就,该模型提升了分割性能,并广泛应用于医学图像的语义分割中,为一致性的定量病灶分析方法提供了技术支持。 本文首先基于成像模式对医疗影像数据集进行分类,然后从结构改进的角度审视U-Net及其各种改进模型。详细讨论了每种方法的研究目标、创新设计和局限性。其次,我们总结了四种中心改善机制:跳跃连接机制(jump-connection mechanism)、残差连接机制(residual-connection mechanism)、3D-U-Net以及转换器机制(transformer mechanism)。最后,本文还分析了这四个核心增强机制与常用的医学数据集之间的关系,并提出了未来可能的发展方向和策略。 本论文为相关领域的研究人员提供了系统的总结和参考,我们期待基于U-Net网络设计出更高效、更稳定的医疗影像分割网络模型。
https://arxiv.org/abs/2502.06895