Recent advances in 2D image generation have achieved remarkable quality,largely driven by the capacity of diffusion models and the availability of large-scale datasets. However, direct 3D generation is still constrained by the scarcity and lower fidelity of 3D datasets. In this paper, we introduce Zero-1-to-G, a novel approach that addresses this problem by enabling direct single-view generation on Gaussian splats using pretrained 2D diffusion models. Our key insight is that Gaussian splats, a 3D representation, can be decomposed into multi-view images encoding different attributes. This reframes the challenging task of direct 3D generation within a 2D diffusion framework, allowing us to leverage the rich priors of pretrained 2D diffusion models. To incorporate 3D awareness, we introduce cross-view and cross-attribute attention layers, which capture complex correlations and enforce 3D consistency across generated splats. This makes Zero-1-to-G the first direct image-to-3D generative model to effectively utilize pretrained 2D diffusion priors, enabling efficient training and improved generalization to unseen objects. Extensive experiments on both synthetic and in-the-wild datasets demonstrate superior performance in 3D object generation, offering a new approach to high-quality 3D generation.
最近在二维图像生成领域的进展取得了显著的成果,这主要得益于扩散模型的能力增强和大规模数据集的可用性。然而,直接三维(3D)生成仍然受到三维数据集稀缺性和较低保真度的限制。在这篇论文中,我们介绍了一种新的方法——Zero-1-to-G,该方法通过利用预训练的二维扩散模型在高斯点上实现单视角直接生成,从而解决了这一问题。我们的关键见解在于:高斯点作为三维表示形式,可以被分解为编码不同属性的多视图图像。这将具有挑战性的直接三维生成任务重新框架到一个二维扩散模型中,从而使我们能够利用预训练二维扩散模型中的丰富先验知识。为了引入对3D感知的支持,我们提出了跨视角和跨属性注意层,这些层捕捉复杂的相关性,并确保在生成的高斯点之间保持3D一致性。这使得Zero-1-to-G成为第一个有效利用预训练2D扩散模型先验的直接图像到3D生成模型,从而实现高效的训练并提高对未见过物体的一般化能力。在合成数据集和真实世界数据集上的广泛实验表明,在三维对象生成方面具有卓越性能,并为高质量三维生成提供了一种新的方法。
https://arxiv.org/abs/2501.05427
Missing data in time-series analysis poses significant challenges, affecting the reliability of downstream applications. Imputation, the process of estimating missing values, has emerged as a key solution. This paper introduces BRATI, a novel deep-learning model designed to address multivariate time-series imputation by combining Bidirectional Recurrent Networks and Attention mechanisms. BRATI processes temporal dependencies and feature correlations across long and short time horizons, utilizing two imputation blocks that operate in opposite temporal directions. Each block integrates recurrent layers and attention mechanisms to effectively resolve long-term dependencies. We evaluate BRATI on three real-world datasets under diverse missing-data scenarios: randomly missing values, fixed-length missing sequences, and variable-length missing sequences. Our findings demonstrate that BRATI consistently outperforms state-of-the-art models, delivering superior accuracy and robustness in imputing multivariate time-series data.
在时间序列分析中,缺失数据带来了重大挑战,影响了下游应用的可靠性。填补(即估计缺失值的过程)已成为关键解决方案之一。本文介绍了一种名为BRATI的新颖深度学习模型,该模型结合双向递归网络和注意力机制,旨在解决多变量时间序列插补问题。BRATI 能够处理长短期时间依赖关系以及特征间的相关性,并利用两个在相反时间方向上运行的插补模块来实现这一点。每个模块整合了递归层和注意机制,以有效解析长期依赖关系。我们在三种实际数据集下评估 BRATI 的性能,这些数据集涵盖了不同缺失值的情况:随机丢失值、固定长度的缺失序列以及可变长度的缺失序列。我们的研究结果表明,BRATI 在多变量时间序列插补方面始终优于最先进的模型,在准确性和鲁棒性上表现更佳。
https://arxiv.org/abs/2501.05401
Unlike human-engineered systems such as aeroplanes, where each component's role and dependencies are well understood, the inner workings of AI models remain largely opaque, hindering verifiability and undermining trust. This paper introduces SemanticLens, a universal explanation method for neural networks that maps hidden knowledge encoded by components (e.g., individual neurons) into the semantically structured, multimodal space of a foundation model such as CLIP. In this space, unique operations become possible, including (i) textual search to identify neurons encoding specific concepts, (ii) systematic analysis and comparison of model representations, (iii) automated labelling of neurons and explanation of their functional roles, and (iv) audits to validate decision-making against requirements. Fully scalable and operating without human input, SemanticLens is shown to be effective for debugging and validation, summarizing model knowledge, aligning reasoning with expectations (e.g., adherence to the ABCDE-rule in melanoma classification), and detecting components tied to spurious correlations and their associated training data. By enabling component-level understanding and validation, the proposed approach helps bridge the "trust gap" between AI models and traditional engineered systems. We provide code for SemanticLens on this https URL and a demo on this https URL.
与飞机等由人类设计的系统不同,这些系统的每个组件的作用和依赖关系都十分明确,AI模型内部的工作原理仍然很大程度上不透明,这阻碍了验证过程,并且削弱了信任。本文介绍了一种名为SemanticLens的通用神经网络解释方法,该方法将隐藏在各个组成部分(如单个神经元)中的知识映射到一个语义结构化的多模态空间中,例如CLIP等基础模型的空间内。在这个空间中,可以执行一些独特的操作,包括:(i) 文本搜索以识别编码特定概念的神经元;(ii) 系统性地分析和比较模型表示;(iii) 自动标记神经元并解释其功能角色;以及(iv) 审计以验证决策是否符合要求。SemanticLens完全可扩展且无需人工干预,被证明在调试、验证、总结模型知识、使推理与预期保持一致(例如,在恶性黑色素瘤分类中遵循ABCDE规则)等方面非常有效,并能检测到与虚假关联和其相关训练数据绑定的组件。通过实现对组件级别的理解和验证,该方法有助于弥合AI模型和传统工程系统之间的“信任差距”。我们可以在[此链接](https://example.com/code)提供SemanticLens的代码,在[此链接](https://example.com/demo)提供演示。
https://arxiv.org/abs/2501.05398
In recent years, tremendous success has been witnessed in Retrieval-Augmented Generation (RAG), widely used to enhance Large Language Models (LLMs) in domain-specific, knowledge-intensive, and privacy-sensitive tasks. However, attackers may steal those valuable RAGs and deploy or commercialize them, making it essential to detect Intellectual Property (IP) infringement. Most existing ownership protection solutions, such as watermarks, are designed for relational databases and texts. They cannot be directly applied to RAGs because relational database watermarks require white-box access to detect IP infringement, which is unrealistic for the knowledge base in RAGs. Meanwhile, post-processing by the adversary's deployed LLMs typically destructs text watermark information. To address those problems, we propose a novel black-box "knowledge watermark" approach, named RAG-WM, to detect IP infringement of RAGs. RAG-WM uses a multi-LLM interaction framework, comprising a Watermark Generator, Shadow LLM & RAG, and Watermark Discriminator, to create watermark texts based on watermark entity-relationship tuples and inject them into the target RAG. We evaluate RAG-WM across three domain-specific and two privacy-sensitive tasks on four benchmark LLMs. Experimental results show that RAG-WM effectively detects the stolen RAGs in various deployed LLMs. Furthermore, RAG-WM is robust against paraphrasing, unrelated content removal, knowledge insertion, and knowledge expansion attacks. Lastly, RAG-WM can also evade watermark detection approaches, highlighting its promising application in detecting IP infringement of RAG systems.
近年来,检索增强生成(RAG)在特定领域的、知识密集型的和隐私敏感的任务中取得了巨大成功,广泛用于提升大型语言模型(LLMs)。然而,攻击者可能会窃取这些宝贵的RAG并进行部署或商业化,因此检测知识产权侵权变得至关重要。现有的大多数所有权保护解决方案,如水印技术,是为关系数据库和文本设计的,不能直接应用于RAG,因为关系数据库水印需要白盒访问来检测IP侵权,在RAG的知识库中这是不现实的。同时,对手部署的LLMs在后处理时通常会破坏文本水印信息。为了应对这些问题,我们提出了一种新颖的黑盒“知识水印”方法,名为RAG-WM,用于检测RAG的知识产权侵权。RAG-WM采用了一个多LLM交互框架,包括水印生成器、影子LLM&RAG以及水印鉴别器,基于水印实体关系元组创建水印文本并将其注入目标RAG中。我们在四个基准LLMs上对RAG-WM进行了跨三个特定领域和两个隐私敏感任务的评估。实验结果显示,RAG-WM能够有效地检测各种部署的LLMs中的被盗RAG。此外,RAG-WM还具有对抗同义句转换、无关内容移除、知识插入及知识扩展攻击的能力,并且可以避开水印检测方法,突显了其在检测RAG系统知识产权侵权方面的广阔应用前景。
https://arxiv.org/abs/2501.05249
Foreground segmentation is a fundamental task in computer vision, encompassing various subdivision tasks. Previous research has typically designed task-specific architectures for each task, leading to a lack of unification. Moreover, they primarily focus on recognizing foreground objects without effectively distinguishing them from the background. In this paper, we emphasize the importance of the background and its relationship with the foreground. We introduce FOCUS, the Foreground ObjeCts Universal Segmentation framework that can handle multiple foreground tasks. We develop a multi-scale semantic network using the edge information of objects to enhance image features. To achieve boundary-aware segmentation, we propose a novel distillation method, integrating the contrastive learning strategy to refine the prediction mask in multi-modal feature space. We conduct extensive experiments on a total of 13 datasets across 5 tasks, and the results demonstrate that FOCUS consistently outperforms the state-of-the-art task-specific models on most metrics.
前景分割是计算机视觉中的一个基本任务,涵盖了多种细分任务。以往的研究通常为每个特定任务设计专门的架构,导致了缺乏统一性的问题。此外,它们主要集中在识别前景对象上,而未能有效地区分背景和前景之间的差异。在本文中,我们强调了背景的重要性及其与前景的关系。我们介绍了FOCUS框架(Foreground ObjeCts Universal Segmentation),这是一个能够处理多种前景任务的通用分割框架。我们开发了一种多尺度语义网络,利用物体边缘信息来增强图像特征。为了实现边界感知分割,我们提出了一种新颖的知识蒸馏方法,并结合对比学习策略在多模态特征空间中细化预测掩码。 我们在总计13个数据集上的5项任务上进行了广泛的实验,结果表明FOCUS在大多数指标上持续优于现有的特定任务模型。
https://arxiv.org/abs/2501.05238
Convolutional Neural Networks (CNNs) have drawn researchers' attention to identifying cattle using muzzle images. However, CNNs often fail to capture long-range dependencies within the complex patterns of the muzzle. The transformers handle these challenges. This inspired us to fuse the strengths of CNNs and transformers in muzzle-based cattle identification. Addition and concatenation have been the most commonly used techniques for feature fusion. However, addition fails to preserve discriminative information, while concatenation results in an increase in dimensionality. Both methods are simple operations and cannot discover the relationships or interactions between fusing features. This research aims to overcome the issues faced by addition and concatenation. This research introduces a novel approach called Multi-Head Attention Feature Fusion (MHAFF) for the first time in cattle identification. MHAFF captures relations between the different types of fusing features while preserving their originality. The experiments show that MHAFF outperformed addition and concatenation techniques and the existing cattle identification methods in accuracy on two publicly available cattle datasets. MHAFF demonstrates excellent performance and quickly converges to achieve optimum accuracy of 99.88% and 99.52% in two cattle datasets simultaneously.
卷积神经网络(CNNs)在使用鼻带图像识别牛的品种上吸引了研究人员的关注。然而,CNNs往往难以捕捉到鼻带上复杂模式中的长距离依赖关系。变压器可以处理这些挑战。这激发了我们将CNN和变压器的优势融合在一起用于基于鼻带的牛识别的研究。目前,加法和级联是最常用的特征融合技术。但是,加法无法保留鉴别信息,而级联则会导致维度增加。这两种方法都是简单的操作,并且不能发现融合特征之间的关系或交互作用。这项研究旨在克服加法和级联所面临的挑战。 本研究首次在牛的识别中引入了一种名为多头注意力特征融合(MHAFF)的新方法。MHAFF能够捕捉不同类型的融合特征之间的关联,同时保留它们原有的特性。实验结果显示,在两个公开可用的牛数据集上,MHAFF在精度方面超过了加法和级联技术以及现有的牛识别方法。MHAFF展示了卓越的表现,并快速收敛以实现99.88%和99.52%的最佳准确性,这两个数据集同时达到了这一水平。 这种创新的方法不仅提高了特征融合的效果,还显著提升了基于鼻带图像的牛识别系统的准确性和效率。
https://arxiv.org/abs/2501.05209
Context. Developing secure and reliable software remains a key challenge in software engineering (SE). The ever-evolving technological landscape offers both opportunities and threats, creating a dynamic space where chaos and order compete. Secure software engineering (SSE) must continuously address vulnerabilities that endanger software systems and carry broader socio-economic risks, such as compromising critical national infrastructure and causing significant financial losses. Researchers and practitioners have explored methodologies like Static Application Security Testing Tools (SASTTs) and artificial intelligence (AI) approaches, including machine learning (ML) and large language models (LLMs), to detect and mitigate these vulnerabilities. Each method has unique strengths and limitations. Aim. This thesis seeks to bring order to the chaos in SSE by addressing domain-specific differences that impact AI accuracy. Methodology. The research employs a mix of empirical strategies, such as evaluating effort-aware metrics, analyzing SASTTs, conducting method-level analysis, and leveraging evidence-based techniques like systematic dataset reviews. These approaches help characterize vulnerability prediction datasets. Results. Key findings include limitations in static analysis tools for identifying vulnerabilities, gaps in SASTT coverage of vulnerability types, weak relationships among vulnerability severity scores, improved defect prediction accuracy using just-in-time modeling, and threats posed by untouched methods. Conclusions. This thesis highlights the complexity of SSE and the importance of contextual knowledge in improving AI-driven vulnerability and defect prediction. The comprehensive analysis advances effective prediction models, benefiting both researchers and practitioners.
翻译如下: 在软件工程(SE)中,开发安全且可靠的软件仍然是一个关键挑战。不断变化的技术环境既带来了机遇也带来了威胁,在这样的环境中,混沌与秩序相互竞争。安全软件工程(SSE)必须持续应对危及软件系统的漏洞,并承担更广泛的经济社会风险,如破坏关键国家基础设施和造成重大经济损失。研究人员和实践者已经探索了各种方法来检测和缓解这些漏洞,包括静态应用安全测试工具(SASTTs)以及人工智能(AI)方法,例如机器学习(ML)和大型语言模型(LLMs)。每种方法都有其独特的优点和局限性。 目的:本论文旨在通过解决影响AI准确性的领域特定差异问题来为SSE中的混乱带来秩序。 研究方法:该研究采用了一系列经验策略,包括评估意识努力的度量标准、分析静态应用安全测试工具(SASTTs)、进行方法级分析以及利用基于证据的技术如系统性数据集审查等手段。这些方法有助于对漏洞预测数据集进行描述。 结果:主要发现包括在识别漏洞方面静态分析工具的局限性,SASTT对于漏洞类型覆盖范围的缺口,关于漏洞严重程度评分之间弱相关性的现象,使用即时建模提高了缺陷预测精度以及未触及的方法所面临的风险等。 结论:本论文强调了SSE领域的复杂性,并指出了提升AI驱动的漏洞和缺陷预测效果的重要性。全面分析促进了有效的预测模型的发展,为研究人员和实践者带来了益处。
https://arxiv.org/abs/2501.05165
Document-Level Biomedical Relation Extraction (Bio-RE) aims to identify relations between biomedical entities within extensive texts, serving as a crucial subfield of biomedical text mining. Existing Bio-RE methods struggle with cross-sentence inference, which is essential for capturing relations spanning multiple sentences. Moreover, previous methods often overlook the incompleteness of documents and lack the integration of external knowledge, limiting contextual richness. Besides, the scarcity of annotated data further hampers model training. Recent advancements in large language models (LLMs) have inspired us to explore all the above issues for document-level Bio-RE. Specifically, we propose a document-level Bio-RE framework via LLM Adaptive Document-Relation Cross-Mapping (ADRCM) Fine-Tuning and Concept Unique Identifier (CUI) Retrieval-Augmented Generation (RAG). First, we introduce the Iteration-of-REsummary (IoRs) prompt for solving the data scarcity issue. In this way, Bio-RE task-specific synthetic data can be generated by guiding ChatGPT to focus on entity relations and iteratively refining synthetic data. Next, we propose ADRCM fine-tuning, a novel fine-tuning recipe that establishes mappings across different documents and relations, enhancing the model's contextual understanding and cross-sentence inference capabilities. Finally, during the inference, a biomedical-specific RAG approach, named CUI RAG, is designed to leverage CUIs as indexes for entities, narrowing the retrieval scope and enriching the relevant document contexts. Experiments conducted on three Bio-RE datasets (GDA, CDR, and BioRED) demonstrate the state-of-the-art performance of our proposed method by comparing it with other related works.
文档级生物医学关系提取(Bio-RE)旨在识别广泛文本中生物医学实体之间的关系,这是生物医学文本挖掘的一个重要子领域。现有的Bio-RE方法在跨句子推理方面存在困难,这对于捕捉跨越多句话的关系至关重要。此外,先前的方法往往忽略了文档的不完备性,并缺乏外部知识整合,从而限制了上下文的丰富度。而且,标注数据的稀缺进一步阻碍了模型训练。最近,在大型语言模型(LLMs)领域的进展激发了我们探索上述所有问题以解决文档级Bio-RE的需求。 具体来说,我们提出了一种通过LLM自适应文档关系跨映射(ADRCM)微调和概念唯一标识符(CUI)检索增强生成(RAG)的文档级Bio-RE框架。首先,我们引入了REsummary迭代(IoRs)提示来解决数据稀缺问题,在这种情况下,通过引导ChatGPT关注实体关系并迭代地精炼合成数据,可以生成特定于Bio-RE任务的合成数据。 其次,我们提出了ADRCM微调方法,这是一种新的微调配方,建立了不同文档和关系之间的映射,增强了模型的上下文理解能力和跨句子推理能力。最后,在进行推断时,设计了一种名为CUI RAG的生物医学特定RAG方法,利用CUI作为实体索引,缩小检索范围并丰富相关文档背景。 我们在三个Bio-RE数据集(GDA、CDR和BioRED)上进行了实验,并通过与其它相关工作对比验证了我们所提出的方法达到了最先进的性能。
https://arxiv.org/abs/2501.05155
Data-driven soft sensors (DDSS) have become mainstream methods for predicting key performance indicators in process industries. However, DDSS development requires complex and costly customized designs tailored to various tasks during the modeling process. Moreover, DDSS are constrained to a single structured data modality, limiting their ability to incorporate additional contextual knowledge. Furthermore, DDSSs' limited representation learning leads to weak predictive performance with scarce data. To address these challenges, we propose a general framework named LLM-TKESS (large language model for text-based knowledge-embedded soft sensing), harnessing the powerful general problem-solving capabilities, cross-modal knowledge transfer abilities, and few-shot capabilities of LLM for enhanced soft sensing modeling. Specifically, an auxiliary variable series encoder (AVS Encoder) is proposed to unleash LLM's potential for capturing temporal relationships within series and spatial semantic relationships among auxiliary variables. Then, we propose a two-stage fine-tuning alignment strategy: in the first stage, employing parameter-efficient fine-tuning through autoregressive training adjusts LLM to rapidly accommodate process variable data, resulting in a soft sensing foundation model (SSFM). Subsequently, by training adapters, we adapt the SSFM to various downstream tasks without modifying its architecture. Then, we propose two text-based knowledge-embedded soft sensors, integrating new natural language modalities to overcome the limitations of pure structured data models. Furthermore, benefiting from LLM's pre-existing world knowledge, our model demonstrates outstanding predictive capabilities in small sample conditions. Using the thermal deformation of air preheater rotor as a case study, we validate through extensive experiments that LLM-TKESS exhibits outstanding performance.
数据驱动的软传感器(DDSS)已成为过程工业中预测关键绩效指标的主要方法。然而,DDSS的发展需要在建模过程中进行复杂的、成本高昂且针对特定任务定制的设计。此外,DDSS仅限于单一结构化数据模式,限制了其融合额外上下文知识的能力。更进一步的是,由于DDSS的表示学习能力有限,在数据稀缺的情况下预测性能较弱。为了解决这些挑战,我们提出了一种名为LLM-TKESS(基于大型语言模型的知识嵌入文本软传感)的一般框架,利用LLM强大的通用问题解决能力、跨模态知识转移能力和少量样本处理能力来增强软传感器建模。 具体而言,我们提出了一个辅助变量序列编码器(AVS编码器),以释放LLM在捕捉时间序列内的时间关系和辅助变量间的空间语义关系方面的潜力。然后,我们提出了一种两阶段的微调对齐策略:第一阶段通过自回归训练进行参数高效的微调,使LLM能够快速适应过程变量数据,形成一个软传感基础模型(SSFM)。随后,在不修改其架构的情况下,通过对适配器进行训练来将SSFM调整为各种下游任务。此外,我们提出了两种基于文本的知识嵌入式软传感器,通过引入新的自然语言模态克服了纯结构化数据模型的局限性。 得益于LLM预先存在的世界知识,我们的模型在小样本条件下展示了卓越的预测能力。以空气预热器转子的热变形为例,在广泛的实验验证中,我们证明了LLM-TKESS表现出色。
https://arxiv.org/abs/2501.05075
This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video and image-based VLMs across reasoning types. To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrites VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks highlight the impact of our method components across benchmarks, VLMs, and reasoning types.
这篇论文提出了一种针对常识视频问答(VQA)的首个基于视频的蕴含树推理方法。尽管大型视觉-语言模型(VLMs)取得了显著进展,但人们对这些模型存在的问题越来越担忧:它们可能学习到视频与潜在答案之间的虚假关联,并且由于其黑盒特性以及现有基准测试中的偏见而强化了这种关联。我们的方法通过四个步骤明确地将VQA任务锚定在视频片段上:构建蕴含树、验证视频-语言的蕴含关系、进行树推理和动态扩展树结构。该方法的重要优势在于,它能够跨当前基于视频和图像的VLMs以及不同的推理类型实现泛化能力。 为了支持公平评估,我们设计了一种基于大型语言模型的去偏策略,将重写VQA基准测试的答案集以强制执行模型推理。在现有及去偏后的基准上进行系统的实验,展示了我们的方法组件在不同基准、VLMs和推理类型上的影响。
https://arxiv.org/abs/2501.05069
Oceanographers rely on visual analysis to interpret model simulations, identify events and phenomena, and track dynamic ocean processes. The ever increasing resolution and complexity of ocean data due to its dynamic nature and multivariate relationships demands a scalable and adaptable visualization tool for interactive exploration. We introduce pyParaOcean, a scalable and interactive visualization system designed specifically for ocean data analysis. pyParaOcean offers specialized modules for common oceanographic analysis tasks, including eddy identification and salinity movement tracking. These modules seamlessly integrate with ParaView as filters, ensuring a user-friendly and easy-to-use system while leveraging the parallelization capabilities of ParaView and a plethora of inbuilt general-purpose visualization functionalities. The creation of an auxiliary dataset stored as a Cinema database helps address I/O and network bandwidth bottlenecks while supporting the generation of quick overview visualizations. We present a case study on the Bay of Bengal (BoB) to demonstrate the utility of the system and scaling studies to evaluate the efficiency of the system.
海洋学家依赖于视觉分析来解释模型模拟、识别事件和现象,并追踪动态的海洋过程。由于其动态特性和多变量关系,海洋数据的分辨率和复杂性不断上升,这需要一个可扩展且适应性强的可视化工具来进行交互式探索。我们引入了pyParaOcean,这是一个专门为海洋数据分析设计的可扩展和互动可视化系统。 pyParaOcean提供了专门针对常见海洋学分析任务(如涡旋识别、盐度运动追踪等)的功能模块,并将这些模块无缝集成到ParaView中作为过滤器使用。这样既保证了一个用户友好的易于使用的系统,同时利用了ParaView的并行化能力以及众多内置的一般用途可视化功能。 创建一个存储为Cinema数据库的辅助数据集有助于解决输入输出和网络带宽瓶颈问题,并支持快速概览可视化的生成。我们通过孟加拉湾(BoB)的一个案例研究展示了该系统的实用性,并进行了扩展性研究以评估其效率。
https://arxiv.org/abs/2501.05009
The discovery of causal relationships from observed data has attracted significant interest from disciplines such as economics, social sciences, epidemiology, and biology. In practical applications, considerable knowledge of the underlying systems is often unavailable, and real data are often associated with nonlinear causal structures, which make the direct use of most conventional causality analysis methods difficult. This study proposes a novel quantum Peter-Clark (qPC) algorithm for causal discovery that does not assume any underlying model structures. Based on the independence conditional tests in a class of reproducing kernel Hilbert spaces characterized by quantum circuits, the proposed qPC algorithm can explore causal relationships from the observed data drawn from arbitrary distributions. We conducted systematic experiments on fundamental graph parts of causal structures, demonstrating that the qPC algorithm exhibits a significantly better performance, particularly with smaller sample sizes compared to its classical counterpart. Furthermore, we proposed a novel optimization approach based on Kernel Target Alignment (KTA) for determining hyperparameters of quantum kernels. This method effectively reduced the risk of false positives in causal discovery, enabling more reliable inference. Our theoretical and experimental results demonstrate that the proposed quantum algorithm can empower classical algorithms for robust and accurate inference in causal discovery, supporting them in regimes where classical algorithms typically fail. Additionally, the effectiveness of this method was validated using the Boston Housing dataset as a real-world application. These findings demonstrate the new potential of quantum circuit-based causal discovery methods in addressing practical challenges, particularly in small-sample scenarios where traditional approaches have shown limitations.
从观测数据中发现因果关系已经吸引了经济学、社会科学、流行病学和生物学等学科的广泛关注。在实际应用中,通常缺乏对底层系统的充分了解,并且真实数据常常与非线性因果结构相关联,这使得大多数传统因果分析方法难以直接使用。本研究提出了一种新颖的量子Peter-Clark (qPC) 算法用于因果发现,该算法不假设任何底层模型结构。基于一类由量子电路表征的再生核希尔伯特空间中的独立性条件测试,所提出的 qPC 算法能够从任意分布中抽取的数据探索因果关系。我们在基本图形部分的因果结构上进行了系统的实验,结果表明,与经典方法相比,qPC算法在样本量较小的情况下表现出了显著更好的性能。此外,我们还提出了一种基于核目标对齐(KTA)的新优化方法来确定量子核的超参数,这种方法有效降低了因果发现中假阳性的风险,从而实现了更为可靠的推断。我们的理论和实验结果表明,所提出的量子算法能够增强经典算法在因果发现中的稳健性和准确性,使其能够在传统算法通常失败的情况下发挥作用。此外,我们使用波士顿住房数据集作为实际应用案例验证了该方法的有效性。这些发现展示了基于量子电路的因果发现方法的新潜力,在小样本场景等实践中解决挑战,并且这种方法在传统的技术手段显示局限性的场合尤其有效。
https://arxiv.org/abs/2501.05007
Transformer has been extensively explored for hyperspectral image (HSI) classification. However, transformer poses challenges in terms of speed and memory usage because of its quadratic computational complexity. Recently, the Mamba model has emerged as a promising approach, which has strong long-distance modeling capabilities while maintaining a linear computational complexity. However, representing the HSI is challenging for the Mamba due to the requirement for an integrated spatial and spectral understanding. To remedy these drawbacks, we propose a novel HSI classification model based on a Mamba model, named MambaHSI, which can simultaneously model long-range interaction of the whole image and integrate spatial and spectral information in an adaptive manner. Specifically, we design a spatial Mamba block (SpaMB) to model the long-range interaction of the whole image at the pixel-level. Then, we propose a spectral Mamba block (SpeMB) to split the spectral vector into multiple groups, mine the relations across different spectral groups, and extract spectral features. Finally, we propose a spatial-spectral fusion module (SSFM) to adaptively integrate spatial and spectral features of a HSI. To our best knowledge, this is the first image-level HSI classification model based on the Mamba. We conduct extensive experiments on four diverse HSI datasets. The results demonstrate the effectiveness and superiority of the proposed model for HSI classification. This reveals the great potential of Mamba to be the next-generation backbone for HSI models. Codes are available at this https URL .
Transformer模型在高光谱图像(HSI)分类中得到了广泛的研究。然而,由于其二次计算复杂度,Transformer在速度和内存使用方面面临挑战。最近,Mamba模型作为一种有前景的方法出现,它具有强大的长距离建模能力,并且保持了线性计算复杂度。但是,表示HSI对于Mamba来说是具有挑战性的,因为这需要一个综合的空间和光谱理解。为了克服这些缺点,我们提出了一种基于Mamba模型的新型HSI分类模型,命名为MambaHSI,该模型可以同时建模整个图像的长程相互作用,并以自适应方式整合空间和光谱信息。具体而言,我们设计了一个空间Mamba块(SpaMB),用于在像素级别建模整个图像的长程交互。然后,我们提出了一个光谱Mamba块(SpeMB),将光谱向量划分为多个组,挖掘不同光谱组之间的关系,并提取光谱特征。最后,我们提出了一种空间-光谱融合模块(SSFM),用于自适应地整合HSI的空间和光谱特性。 据我们所知,这是第一个基于Mamba的图像级别的HSI分类模型。我们在四个不同的HSI数据集上进行了广泛的实验。结果证明了该模型在HSI分类中的有效性和优越性。这揭示了Mamba成为下一代HSI模型骨干架构的巨大潜力。代码可在[提供的URL]获取。 (注意:原文中的“this https URL”未提供实际链接,需要作者进一步提供以供参考)
https://arxiv.org/abs/2501.04944
Referring video object segmentation aims to segment objects within a video corresponding to a given text description. Existing transformer-based temporal modeling approaches face challenges related to query inconsistency and the limited consideration of context. Query inconsistency produces unstable masks of different objects in the middle of the video. The limited consideration of context leads to the segmentation of incorrect objects by failing to adequately account for the relationship between the given text and instances. To address these issues, we propose the Multi-context Temporal Consistency Module (MTCM), which consists of an Aligner and a Multi-Context Enhancer (MCE). The Aligner removes noise from queries and aligns them to achieve query consistency. The MCE predicts text-relevant queries by considering multi-context. We applied MTCM to four different models, increasing performance across all of them, particularly achieving 47.6 J&F on the MeViS. Code is available at this https URL.
基于文本描述的视频对象分割旨在对视频中的特定对象进行分割,这些对象由给定的文字描述指定。现有的基于Transformer的时间建模方法在查询一致性问题和上下文考虑不足方面面临挑战。查询不一致导致视频中间不同对象的掩码不稳定;而有限的上下文考量则可能导致由于未能充分考虑给定文本与实例之间的关系而导致对错误的对象进行分割。 为了解决这些问题,我们提出了多上下文时间一致性模块(MTCM),该模块由Aligner和多上下文增强器(MCE)组成。Aligner通过去除查询中的噪声并对其进行对齐来实现查询的一致性;而MCE则通过考虑多上下文来预测与文本相关的查询。 我们将MTCM应用到四种不同的模型上,所有这些模型的性能都有所提高,尤其是在MeViS数据集上的J&F指标达到了47.6。代码可在提供的链接中获取。
https://arxiv.org/abs/2501.04939
Chronic itch affects 13% of the US population, is highly debilitating, and underlies many medical conditions. A major challenge in clinical care and new therapeutics development is the lack of an objective measure for quantifying itch, leading to reliance on subjective measures like patients' self-assessment of itch severity. In this paper, we show that a home radio device paired with artificial intelligence (AI) can concurrently capture scratching and evaluate its impact on sleep quality by analyzing radio signals bouncing in the environment. The device eliminates the need for wearable sensors or skin contact, enabling monitoring of chronic itch over extended periods at home without burdening patients or interfering with their skin condition. To validate the technology, we conducted an observational clinical study of chronic pruritus patients, monitored at home for one month using both the radio device and an infrared camera. Comparing the output of the device to ground truth data from the camera demonstrates its feasibility and accuracy (ROC AUC = 0.997, sensitivity = 0.825, specificity = 0.997). The results reveal a significant correlation between scratching and low sleep quality, manifested as a reduction in sleep efficiency (R = 0.6, p < 0.001) and an increase in sleep latency (R = 0.68, p < 0.001). Our study underscores the potential of passive, long-term, at-home monitoring of chronic scratching and its sleep implications, offering a valuable tool for both clinical care of chronic itch patients and pharmaceutical clinical trials.
慢性瘙痒影响了美国13%的人口,具有高度的致残性,并且是许多医疗状况的基础。临床护理和新疗法开发面临的一个主要挑战是对瘙痒缺乏客观测量方法,这导致依赖于患者自我评估等主观指标来衡量瘙痒严重程度。在本文中,我们展示了家用无线设备与人工智能(AI)结合可以同时捕捉抓挠行为并评估其对睡眠质量的影响,通过分析环境中的无线电波信号实现这一点。该设备消除了使用可穿戴传感器或皮肤接触的需要,允许患者无需负担和不影响皮肤状况的情况下,在家中长期监测慢性瘙痒。为了验证这项技术的有效性,我们进行了一项观察性临床研究,针对慢性瘙痒症患者,用无线设备和红外摄像机在家进行了为期一个月的监测。将设备输出与摄像机提供的真实数据对比显示了该设备在可行性和准确性方面的表现(ROC AUC = 0.997,灵敏度 = 0.825,特异性 = 0.997)。研究结果揭示了抓挠行为和睡眠质量降低之间存在显著相关性,表现为睡眠效率下降(R = 0.6, p < 0.001) 和入睡时间延长 (R = 0.68, p < 0.001)。我们的研究表明,长期、无接触的家庭监控慢性抓挠及其对睡眠的影响具有潜在价值,为慢性瘙痒患者的临床护理和药物临床试验提供了有价值的工具。
https://arxiv.org/abs/2501.04896
Effective instruction tuning is indispensable for optimizing code LLMs, aligning model behavior with user expectations and enhancing model performance in real-world applications. However, most existing methods focus on code snippets, which are limited to specific functionalities and rigid structures, restricting the complexity and diversity of the synthesized data. To address these limitations, we introduce a novel feature tree-based synthesis framework inspired by Abstract Syntax Trees (AST). Unlike AST, which captures syntactic structure of code, our framework models semantic relationships between code elements, enabling the generation of more nuanced and diverse data. The feature tree is constructed from raw data and refined iteratively to increase the quantity and diversity of the extracted features. This process enables the identification of more complex patterns and relationships within the code. By sampling subtrees with controlled depth and breadth, our framework allows precise adjustments to the complexity of the generated code, supporting a wide range of tasks from simple function-level operations to intricate multi-file scenarios. We fine-tuned widely-used base models to create the EpiCoder series, achieving state-of-the-art performance at both the function and file levels across multiple benchmarks. Notably, empirical evidence indicates that our approach shows significant potential in synthesizing highly complex repository-level code data. Further analysis elucidates the merits of this approach by rigorously assessing data complexity and diversity through software engineering principles and LLM-as-a-judge method.
有效的指令调优对于优化代码LLM(大型语言模型)至关重要,能够使模型行为与用户期望对齐,并提升模型在实际应用中的性能。然而,现有的大多数方法主要关注于代码片段的处理,这些片段局限于特定的功能性和僵化的结构中,这限制了合成数据的复杂性和多样性。为了解决这些问题,我们引入了一种新的基于特征树的合成框架,灵感来源于抽象语法树(AST)。与捕捉代码句法结构的AST不同,我们的框架建模的是代码元素之间的语义关系,使得生成更加细腻和多样化的数据成为可能。 特征树是从原始数据构建并迭代细化以增加提取特征的数量和多样性。这一过程使识别出代码内部更复杂的模式和关联成为了可能。通过有控制地采样子树的深度和广度,我们的框架允许对生成代码复杂性进行精确调整,支持从简单的函数级操作到复杂的多文件场景等广泛的任务。 我们对广泛应用的基础模型进行了微调,创建了EpiCoder系列,在多个基准测试中实现了在功能级别和文件级别的性能最佳。值得注意的是,实证研究表明,我们的方法在合成高度复杂的代码仓库级数据方面展现出显著潜力。进一步的分析通过软件工程原则和将LLM作为评判的方法严谨地评估数据复杂性和多样性,阐明了这种方法的优点。
https://arxiv.org/abs/2501.04694
Hyperspectral image (HSI) fusion addresses the challenge of reconstructing High-Resolution HSIs (HR-HSIs) from High-Resolution Multispectral images (HR-MSIs) and Low-Resolution HSIs (LR-HSIs), a critical task given the high costs and hardware limitations associated with acquiring high-quality HSIs. While existing methods leverage spatial and spectral relationships, they often suffer from limited receptive fields and insufficient feature utilization, leading to suboptimal performance. Furthermore, the scarcity of high-quality HSI data highlights the importance of efficient data utilization to maximize reconstruction quality. To address these issues, we propose HyFusion, a novel framework designed to enhance the receptive field and enable effective feature map reusing, thereby maximizing data utilization. First, HR-MSI and LR-HSI inputs are concatenated to form a quasi-fused draft, preserving complementary spatial and spectral details. Next, the Enhanced Reception Field Block (ERFB) is introduced, combining shifting-window attention and dense connections to expand the receptive field, effectively capturing long-range dependencies and reusing features to reduce information loss, thereby boosting data efficiency. Finally, the Dual-Coupled Network (DCN) dynamically extracts high-frequency spectral and spatial features from LR-HSI and HR-MSI, ensuring efficient cross-domain fusion. Extensive experiments demonstrate that HyFusion achieves state-of-the-art performance in HR-MSI/LR-HSI fusion, significantly improving reconstruction quality while maintaining a compact model size and computational efficiency. By integrating enhanced receptive fields and feature map reusing, HyFusion provides a practical and effective solution for HSI fusion in resource-constrained scenarios, setting a new benchmark in hyperspectral imaging. Our code will be publicly available.
高光谱图像(HSI)融合旨在解决从高分辨率多光谱图像(HR-MSIs)和低分辨率高光谱图像(LR-HSIs)重建高质量的高分辨率高光谱图像(HR-HSI)的问题,这是一个至关重要的任务,因为获取高质量的HSIs通常成本高昂且受到硬件限制。尽管现有方法利用了空间和光谱关系,但由于感受野有限以及特征利用率不足,它们往往难以实现最佳性能。此外,高质量HSI数据的稀缺性强调了高效利用数据以最大化重建质量的重要性。 为了解决这些问题,我们提出了HyFusion,这是一种新型框架,旨在扩大感受野并使有效重用特征图成为可能,从而最大程度地提高数据利用率。首先,HR-MSI和LR-HSI输入被合并形成一个准融合草稿,保留互补的空间和光谱细节。接下来,引入了增强接收场块(ERFB),该模块结合移动窗口注意力机制和密集连接来扩大感受野,有效地捕获长距离依赖关系并重用特征以减少信息损失,从而提高数据效率。最后,双耦合网络(DCN)动态地从LR-HSI和HR-MSI中提取高频光谱和空间特征,确保跨域融合的高效性。 大量的实验表明,HyFusion在HR-MSI/LR-HSI融合方面取得了最先进的性能,在保持紧凑模型大小和计算效率的同时显著提高了重建质量。通过整合增强的感受野和重用特征图的方法,HyFusion为资源受限场景中的HSI融合提供了一个实用且有效的解决方案,并在此类图像处理领域树立了新的标杆。我们的代码将会公开发布。
https://arxiv.org/abs/2501.04665
Hyperspectral image (HSI) classification is a crucial technique for remote sensing to build large-scale earth monitoring systems. HSI contains much more information than traditional visual images for identifying the categories of land covers. One recent feasible solution for HSI is to leverage CapsNets for capturing spectral-spatial information. However, these methods require high computational requirements due to the full connection architecture between stacked capsule layers. To solve this problem, a DWT-CapsNet is proposed to identify partial but important connections in CapsNet for a effective and efficient HSI classification. Specifically, we integrate a tailored attention mechanism into a Discrete Wavelet Transform (DWT)-based downsampling layer, alleviating the information loss problem of conventional downsampling operation in feature extractors. Moreover, we propose a novel multi-scale routing algorithm that prunes a large proportion of connections in CapsNet. A capsule pyramid fusion mechanism is designed to aggregate the spectral-spatial relationships in multiple levels of granularity, and then a self-attention mechanism is further conducted in a partially and locally connected architecture to emphasize the meaningful relationships. As shown in the experimental results, our method achieves state-of-the-art accuracy while keeping lower computational demand regarding running time, flops, and the number of parameters, rendering it an appealing choice for practical implementation in HSI classification.
高光谱图像(HSI)分类是遥感技术中构建大规模地球监测系统的一项关键技术。与传统视觉图像相比,HSI包含更多用于识别土地覆盖类别的信息。近期的一个可行解决方案是利用胶囊网络(CapsNet)来捕捉高光谱数据中的光谱-空间信息。然而,由于堆叠的胶囊层之间的全连接架构,这些方法需要较高的计算要求。 为了解决这个问题,研究人员提出了一种基于离散小波变换(DWT)的CapsNet(DWT-CapsNet),以识别CapsNet中部分但重要的连接,从而实现有效的HSI分类。具体而言,他们将一个定制化的注意力机制集成到了DWT基下的下采样层中,缓解了传统特征提取器中的常规下采样操作导致的信息损失问题。 此外,研究人员还提出了一种新颖的多尺度路由算法,该算法在CapsNet中剪枝了大量的连接,从而降低了计算复杂度。设计了一个胶囊金字塔融合机制,用于聚集多个粒度级别的光谱-空间关系,并进一步在一个部分和局部相连的架构上进行自注意力机制操作,以突出有意义的关系。 实验结果表明,所提出的方法达到了最先进的精度水平,同时在运行时间、浮点运算次数(flops)以及参数数量方面保持了较低的计算需求,这使得该方法成为HSI分类实际应用中的一个有吸引力的选择。
https://arxiv.org/abs/2501.04643
Symbolic music analysis tasks are often performed by models originally developed for Natural Language Processing, such as Transformers. Such models require the input data to be represented as sequences, which is achieved through a process of tokenization. Tokenization strategies for symbolic music often rely on absolute MIDI values to represent pitch information. However, music research largely promotes the benefit of higher-level representations such as melodic contour and harmonic relations for which pitch intervals turn out to be more expressive than absolute pitches. In this work, we introduce a general framework for building interval-based tokenizations. By evaluating these tokenizations on three music analysis tasks, we show that such interval-based tokenizations improve model performances and facilitate their explainability.
符号音乐分析任务通常由为自然语言处理(如Transformer模型)开发的模型来执行。这些模型要求输入数据以序列形式表示,这是通过分词过程实现的。对于符号音乐来说,分词策略往往依赖于绝对MIDI值来表示音高信息。然而,音乐研究广泛认为更高层次的表示方式,例如旋律轮廓和和声关系,比绝对音高低效得多。这些高层次的表示中,音程(即两个音符之间的间隔)更具表现力。 在本文工作中,我们提出了一种基于音程构建分词策略的通用框架。通过评估这些音程分词策略在三项音乐分析任务中的性能,我们表明这种基于音程的方法能够提升模型的表现,并有助于其可解释性。
https://arxiv.org/abs/2501.04630
Artificial Intelligence is revolutionizing medical practice, enhancing diagnostic accuracy and healthcare delivery. However, its adaptation in medical settings still faces significant challenges, related to data availability and privacy constraints. Synthetic data has emerged as a promising solution to mitigate these issues, addressing data scarcity while preserving privacy. Recently, Latent Diffusion Models have emerged as a powerful tool for generating high-quality synthetic data. Meanwhile, the integration of different modalities has gained interest, emphasizing the need of models capable of handle multimodal medical this http URL approaches struggle to integrate complementary information and lack the ability to generate modalities simultaneously. To address this challenge, we present MedCoDi-M, a 6.77-billion-parameter model, designed for multimodal medical data generation, that, following Foundation Model paradigm, exploits contrastive learning and large quantity of data to build a shared latent space which capture the relationships between different data modalities. Further, we introduce the Multi-Prompt training technique, which significantly boosts MedCoDi-M's generation under different settings. We extensively validate MedCoDi-M: first we benchmark it against five competitors on the MIMIC-CXR dataset, a state-of-the-art dataset for Chest X-ray and radiological report generation. Secondly, we perform a Visual Turing Test with expert radiologists to assess the realism and clinical relevance of the generated data, ensuring alignment with real-world scenarios. Finally, we assess the utility of MedCoDi-M in addressing key challenges in the medical field, such as anonymization, data scarcity and imbalance learning. The results are promising, demonstrating the applicability of MedCoDi-M in medical contexts. Project page is at this https URL.
人工智能正在革新医学实践,通过提高诊断准确性并改善医疗服务来带来变革。然而,在医疗环境中应用人工智能仍然面临重大挑战,这些问题主要与数据可用性和隐私限制相关。合成数据作为一种有前景的解决方案应运而生,它不仅能够解决数据不足的问题,还能在生成过程中保护患者隐私。 最近,潜在扩散模型作为生成高质量合成数据的强大工具崭露头角。与此同时,多模态集成的需求日益增长,然而现有的方法难以整合互补信息且无法同时生成多种模态的数据。为了解决这一挑战,我们提出了MedCoDi-M,一个具有67亿参数的模型,专门用于处理多模态医疗数据生成任务。该模型遵循基础模型范式,利用对比学习和大量数据来构建共享潜在空间以捕捉不同数据模式之间的关系。 此外,我们引入了Multi-Prompt训练技术,这一技术显著提升了MedCoDi-M在各种设置下的生成能力。我们在MIMIC-CXR数据集上对其进行了广泛的验证:首先,我们将该模型与五个竞争对手进行基准测试,这是一个用于胸部X光片和放射学报告生成的最新数据集;其次,我们通过专家放射科医生进行视觉图灵测试来评估生成数据的真实性和临床相关性,以确保其符合现实场景中的需求;最后,我们还评估了MedCoDi-M在解决医疗领域关键挑战(如匿名化、数据稀缺和学习不平衡)方面的效用。 结果显示,MedCoDi-M具有在医学情境中应用的潜力。项目页面位于此链接:[假设此处应插入具体网址]。
https://arxiv.org/abs/2501.04614