The growing availability of longitudinal Magnetic Resonance Imaging (MRI) datasets has facilitated Artificial Intelligence (AI)-driven modeling of disease progression, making it possible to predict future medical scans for individual patients. However, despite significant advancements in AI, current methods continue to face challenges including achieving patient-specific individualization, ensuring spatiotemporal consistency, efficiently utilizing longitudinal data, and managing the substantial memory demands of 3D scans. To address these challenges, we propose Brain Latent Progression (BrLP), a novel spatiotemporal model designed to predict individual-level disease progression in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates in a small latent space, mitigating the computational challenges posed by high-dimensional imaging data; (ii) it explicitly integrates subject metadata to enhance the individualization of predictions; (iii) it incorporates prior knowledge of disease dynamics through an auxiliary model, facilitating the integration of longitudinal data; and (iv) it introduces the Latent Average Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in the predicted progression at inference time and (b) allows us to derive a measure of the uncertainty for the prediction. We train and evaluate BrLP on 11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its generalizability on an external test set comprising 2,257 MRIs from 962 subjects. Our experiments compare BrLP-generated MRI scans with real follow-up MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The code is publicly available at: this https URL.
随着纵向磁共振成像(MRI)数据集的日益普及,基于人工智能(AI)的疾病进展建模得到了促进,使得为个别患者预测未来的医学扫描成为可能。然而,尽管在AI领域取得了显著的进步,目前的方法仍然面临着一些挑战,包括实现以患者为中心的个性化、确保时空一致性、有效利用纵向数据以及管理3D扫描带来的巨大内存需求。为了应对这些挑战,我们提出了一种新颖的时空模型——大脑潜在进展(BrLP),旨在预测个体层面在3D脑MRI中的疾病进展。 BrLP的关键贡献有四点:(i) 它在一个较小的潜在空间中运行,从而减轻了高维影像数据带来的计算难题;(ii) 它明确整合了受试者元数据以增强预测的个性化;(iii) 通过辅助模型将疾病的动态知识纳入其中,促进了纵向数据的集成;(iv) 引入了潜在平均稳定化(LAS)算法,该算法(a)在推理时强制执行预测进展中的时空一致性,并(b)允许我们推导出预测不确定性的度量。 我们在2805名受试者的11,730张T1加权(T1w)脑MRI上训练并评估了BrLP,并通过962名受试者组成的2,257张外部测试集验证了其泛化能力。我们的实验将由BrLP生成的MRI扫描与真实的随访MRI进行了比较,证明了相对于现有方法而言达到了最先进的准确性。代码可在以下链接公开获取:this https URL.
https://arxiv.org/abs/2502.08560
Next token prediction has been the standard training objective used in large language model pretraining. Representations are learned as a result of optimizing for token-level perplexity. We propose Continuous Concept Mixing (CoCoMix), a novel pretraining framework that combines discrete next token prediction with continuous concepts. Specifically, CoCoMix predicts continuous concepts learned from a pretrained sparse autoencoder and mixes them into the model's hidden state by interleaving with token hidden representations. Through experiments on multiple benchmarks, including language modeling and downstream reasoning tasks, we show that CoCoMix is more sample efficient and consistently outperforms standard next token prediction, knowledge distillation and inserting pause tokens. We find that combining both concept learning and interleaving in an end-to-end framework is critical to performance gains. Furthermore, CoCoMix enhances interpretability and steerability by allowing direct inspection and modification of the predicted concept, offering a transparent way to guide the model's internal reasoning process.
接下来的标记预测一直是大规模语言模型预训练中使用的标准训练目标。通过优化单个标记级别的困惑度,学习到了表示形式。我们提出了一个新颖的预训练框架——连续概念混合(CoCoMix),该框架结合了离散的下一个标记预测和连续的概念。具体而言,CoCoMix 预测由预先训练好的稀疏自动编码器学到的连续概念,并通过在隐藏状态中插入与标记隐藏表示交错的方式将这些概念融入模型之中。通过语言建模和下游推理任务等多个基准实验显示,CoCoMix 在样本效率上表现更佳,并且持续优于标准的下一个标记预测、知识蒸馏以及插入暂停标记的方法。 我们发现,在端到端框架中结合概念学习与插值对于性能提升至关重要。此外,CoCoMix 还通过允许直接检查和修改预测的概念来增强模型的可解释性和可控性,为指导模型内部推理过程提供了一种透明的方式。
https://arxiv.org/abs/2502.08524
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI) with giga-pixel size and hierarchical image context in digital pathology. However, these methods heavily depend on a substantial number of bag-level labels and solely learn from the original slides, which are easily affected by variations in data distribution. Recently, vision language model (VLM)-based methods introduced the language prior by pre-training on large-scale pathological image-text pairs. However, the previous text prompt lacks the consideration of pathological prior knowledge, therefore does not substantially boost the model's performance. Moreover, the collection of such pairs and the pre-training process are very time-consuming and this http URL solve the above problems, we propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification. Specifically, we propose a dual-scale visual descriptive text prompt based on the frozen large language model (LLM) to boost the performance of VLM effectively. To transfer the VLM to process WSI efficiently, for the image branch, we propose a prototype-guided patch decoder to aggregate the patch features progressively by grouping similar patches into the same prototype; for the text branch, we introduce a context-guided text decoder to enhance the text features by incorporating the multi-granular image contexts. Extensive studies on three multi-cancer and multi-center subtyping datasets demonstrate the superiority of ViLa-MIL.
基于多实例学习(MIL)的框架已成为处理数字病理学中具有吉比特大小和层次图像上下文的全滑动图像(WSI)的主流方法。然而,这些方法严重依赖于大量的袋级标签,并且仅从原始切片中学到内容,因此容易受到数据分布变化的影响。近期,基于视觉语言模型(VLM)的方法通过在大规模病理图-文本对上进行预训练来引入了语言先验知识。但是,之前的文本提示缺乏对病理学先验知识的考虑,因而未能显著提升模型性能。此外,此类配对的收集和预训练过程非常耗时。 为了解决上述问题,我们提出了一种双尺度视觉-语言多实例学习(ViLa-MIL)框架,用于全滑动图像分类。具体来说,基于冻结的大规模语言模型(LLM),我们提出了一个双尺度视觉描述文本提示,以有效提升VLM的性能。为了将VLM高效地应用于处理WSI,对于图像分支,我们提出了一种原型引导的补丁解码器,通过将相似的补丁分组到同一个原型中来逐步聚合补丁特征;而对于文本分支,我们引入了一个上下文引导的文本解码器,通过结合多粒度图像上下文来增强文本特征。在三个跨癌症和跨中心亚型的数据集上进行了广泛的实验,证明了ViLa-MIL的优势。
https://arxiv.org/abs/2502.08391
Particularly in the structure of global discourse, coherence plays a pivotal role in human text comprehension and is a hallmark of high-quality text. This is especially true for persuasive texts, where coherent argument structures support claims effectively. This paper discusses and proposes methods for detecting, extracting and representing these global discourse structures in a proccess called Argument(ation) Mining. We begin by defining key terms and processes of discourse structure analysis, then continue to summarize existing research on the matter, and identify shortcomings in current argument component extraction and classification methods. Furthermore, we will outline an architecture for argument mining that focuses on making models more generalisable while overcoming challenges in the current field of research by utilizing novel NLP techniques. This paper reviews current knowledge, summarizes recent works, and outlines our NLP pipeline, aiming to contribute to the theoretical understanding of global discourse structures.
特别是在全球话语结构中,连贯性在人类文本理解中扮演着关键角色,并且是高质量文本的一个显著标志。这一点尤其适用于说服性文本,在这些文本中,连贯的论证结构能够有效地支持论点。本文讨论并提出了一些用于检测、提取和表示全局话语结构的方法,这一过程被称为“论据(论证)挖掘”。我们首先定义了话语结构分析的关键术语和流程,然后总结现有研究,并识别出当前论据成分提取和分类方法中的不足之处。此外,我们将概述一个专注于使模型更具普适性的论证挖掘架构,并通过利用新颖的自然语言处理技术来克服当前研究领域的挑战。本文回顾了现有的知识体系,总结了近期的工作成果,并概述了我们的NLP(自然语言处理)流水线,旨在为全球话语结构的理论理解做出贡献。
https://arxiv.org/abs/2502.08371
Retrieval-Augmented Generation (RAG) has emerged as a prominent method for incorporating domain knowledge into Large Language Models (LLMs). While RAG enhances response relevance by incorporating retrieved domain knowledge in the context, retrieval errors can still lead to hallucinations and incorrect answers. To recover from retriever failures, domain knowledge is injected by fine-tuning the model to generate the correct response, even in the case of retrieval errors. However, we observe that without systematic knowledge augmentation, fine-tuned LLMs may memorize new information but still fail to extract relevant domain knowledge, leading to poor performance. In this work, we present a novel framework that significantly enhances the fine-tuning process by augmenting the training data in two ways -- context augmentation and knowledge paraphrasing. In context augmentation, we create multiple training samples for a given QA pair by varying the relevance of the retrieved information, teaching the model when to ignore and when to rely on retrieved content. In knowledge paraphrasing, we fine-tune with multiple answers to the same question, enabling LLMs to better internalize specialized knowledge. To mitigate catastrophic forgetting due to fine-tuning, we add a domain-specific identifier to a question and also utilize a replay buffer containing general QA pairs. Experimental results demonstrate the efficacy of our method over existing techniques, achieving up to 10\% relative gain in token-level recall while preserving the LLM's generalization capabilities.
检索增强生成(RAG)已成为将领域知识融入大规模语言模型(LLMs)的一种显著方法。尽管RAG通过在上下文中引入检索到的领域知识来提高响应的相关性,但检索错误仍可能导致幻觉和不正确的答案。为了从检索器故障中恢复过来,通过微调模型以生成正确答案来进行领域知识注入,即使存在检索错误也不例外。然而,我们观察到,在没有系统性的知识增强的情况下,经过微调的LLMs可能会记住新的信息,但仍无法提取相关的领域知识,从而导致性能不佳。 在本文工作中,我们提出了一种新颖的框架,通过两种方式增强训练数据来显著改进微调过程——上下文增强和知识改写。在上下文增强中,我们通过改变检索到的信息的相关性为给定的问答对创建多个训练样本,教导模型何时忽略和何时依赖于检索内容。而在知识改写中,我们使用同一问题的不同答案进行微调,使LLMs更好地内化专业知识。 为了缓解因微调而产生的灾难性遗忘问题,我们在一个问题上添加了特定领域的标识符,并利用包含通用问答对的重播缓存区来增强模型的能力。实验结果表明,我们的方法在现有技术的基础上表现出色,在保持LLM泛化能力的同时,达到了高达10%的相对增益于令牌级召回率。
https://arxiv.org/abs/2502.08356
The integration of Large Language Models (LLMs) into optimization has created a powerful synergy, opening exciting research opportunities. This paper investigates how LLMs can enhance existing optimization algorithms. Using their pre-trained knowledge, we demonstrate their ability to propose innovative heuristic variations and implementation strategies. To evaluate this, we applied a non-trivial optimization algorithm, Construct, Merge, Solve and Adapt (CMSA) -- a hybrid metaheuristic for combinatorial optimization problems that incorporates a heuristic in the solution construction phase. Our results show that an alternative heuristic proposed by GPT-4o outperforms the expert-designed heuristic of CMSA, with the performance gap widening on larger and denser graphs. Project URL: this https URL
大型语言模型(LLMs)在优化领域的整合创造了一种强大的协同效应,开启了令人兴奋的研究机会。本文探讨了如何利用LLM增强现有的优化算法。通过它们的预训练知识,我们展示了它们提出创新启发式方法和实现策略的能力。为了评估这一点,我们将一个非平凡的优化算法——构造、合并、求解与适应(CMSA)应用于测试中,这是一种混合元启发式算法,用于组合优化问题,并在解决方案构建阶段融入了启发式方法。我们的结果显示,由GPT-4o提出的替代启发式方法超过了CMSA专家设计的启发式方法,在更大的和更密集的图上性能差距进一步扩大。 项目网址:[此处应填写具体URL]
https://arxiv.org/abs/2502.08298
In this work, we propose an architecture of LLM Modules that enables the transfer of knowledge from a large pre-trained model to a smaller model using an Enhanced Cross-Attention mechanism. In the proposed scheme, the Qwen2-1.5B model is frozen and its representations are passed through specially designed attention layers to the GPT-Neo-125M model, which is trained on limited computational resources. Experimental results on the Bespoke-Stratos-17k dataset demonstrate that after 15 epochs of training, the combined model generates responses comparable in quality to those obtained by distillation. We discuss the advantages of the modular approach, provide examples of input queries and comparative analysis, and outline prospects for further extension of the method.
在这项工作中,我们提出了一种基于LLM模块的架构,该架构利用增强型跨注意力机制将大型预训练模型的知识转移到较小规模的模型中。在提议的方案中,Qwen2-1.5B 模型被冻结,并且其表示通过专门设计的注意力层传递到 GPT-Neo-125M 模型上进行训练,后者可以在有限的计算资源下运行。在 Bespoke-Stratos-17k 数据集上的实验结果表明,在经过 15 轮(epoch)的训练后,组合模型生成的回答质量可以与蒸馏方法得到的结果相媲美。我们讨论了模块化方法的优势,并提供了输入查询示例和比较分析,同时概述了进一步扩展该方法的可能性。
https://arxiv.org/abs/2502.08213
Precise classification of megakaryocytes is crucial for diagnosing myelodysplastic syndromes. Although self-supervised learning has shown promise in medical image analysis, its application to classifying megakaryocytes in stained slides faces three main challenges: (1) pervasive background noise that obscures cellular details, (2) a long-tailed distribution that limits data for rare subtypes, and (3) complex morphological variations leading to high intra-class variability. To address these issues, we propose the ActiveSSF framework, which integrates active learning with self-supervised pretraining. Specifically, our approach employs Gaussian filtering combined with K-means clustering and HSV analysis (augmented by clinical prior knowledge) for accurate region-of-interest extraction; an adaptive sample selection mechanism that dynamically adjusts similarity thresholds to mitigate class imbalance; and prototype clustering on labeled samples to overcome morphological complexity. Experimental results on clinical megakaryocyte datasets demonstrate that ActiveSSF not only achieves state-of-the-art performance but also significantly improves recognition accuracy for rare subtypes. Moreover, the integration of these advanced techniques further underscores the practical potential of ActiveSSF in clinical settings. To foster further research, the code and datasets will be publicly released in the future.
精确分类巨核细胞对于诊断骨髓增生异常综合征至关重要。尽管自我监督学习在医学图像分析中展现出潜力,但将其应用于染色切片中的巨核细胞分类面临三大挑战:(1)普遍存在背景噪声,掩盖了细胞细节;(2)长尾分布限制了罕见亚型的数据量;以及(3)复杂的形态变化导致高内的类内变异性。为解决这些问题,我们提出了ActiveSSF框架,该框架结合了主动学习和自我监督预训练方法。 具体而言,我们的方法采用高斯滤波与K-means聚类及HSV分析相结合的技术(通过临床先验知识增强),以实现精确的感兴趣区域提取;一种自适应样本选择机制,动态调整相似性阈值来缓解类别不平衡问题;以及基于标注样本原型聚类的方法,以此克服形态复杂性。在临床上用于巨核细胞的数据集上进行实验的结果表明,ActiveSSF不仅达到了最先进的性能水平,还显著提高了罕见亚型的识别精度。 此外,这些高级技术的整合进一步突显了ActiveSSF在临床环境中的实用潜力。为了促进进一步的研究,将在未来公开发布代码和数据集。
https://arxiv.org/abs/2502.08200
Data scarcity significantly complicates the continual learning problem, i.e., how a deep neural network learns in dynamic environments with very few samples. However, the latest progress of few-shot class incremental learning (FSCIL) methods and related studies show insightful knowledge on how to tackle the problem. This paper presents a comprehensive survey on FSCIL that highlights several important aspects i.e. comprehensive and formal objectives of FSCIL approaches, the importance of prototype rectifications, the new learning paradigms based on pre-trained model and language-guided mechanism, the deeper analysis of FSCIL performance metrics and evaluation, and the practical contexts of FSCIL in various areas. Our extensive discussion presents the open challenges, potential solutions, and future directions of FSCIL.
数据稀缺显著增加了持续学习问题的复杂性,即深度神经网络在样本数量非常有限且环境动态变化的情况下如何进行学习。然而,最近关于少量示例类别增量学习(FSCIL)方法及其相关研究的进步提供了一些有价值的见解,以解决此类问题。本文对FSCIL进行了全面回顾,并强调了几个重要方面:FSCIL方法的综合和正式目标、原型修正的重要性、基于预训练模型和语言引导机制的新学习范式、对FSCIL性能指标及评估进行更深入分析以及在各个领域中FSCIL的实际应用背景。我们广泛的讨论揭示了FSCIL面临的开放性挑战,潜在解决方案及其未来的发展方向。
https://arxiv.org/abs/2502.08181
Large Language Models (LLMs) have demonstrated strong generalization capabilities across a wide range of natural language processing (NLP) tasks. However, they exhibit notable weaknesses in character-level string manipulation, struggling with fundamental operations such as character deletion, insertion, and substitution. These challenges stem primarily from tokenization constraints, despite the critical role of such operations in data preprocessing and code generation. Through systematic analysis, we derive two key insights: (1) LLMs face significant difficulties in leveraging intrinsic token knowledge for character-level reasoning, and (2) atomized word structures can substantially enhance LLMs' ability to process token-level structural information. Building on these insights, we propose Character-Level Manipulation via Divide and Conquer, a novel approach designed to bridge the gap between token-level processing and character-level manipulation. Our method decomposes complex operations into explicit character-level subtasks coupled with controlled token reconstruction phases, leading to significant improvements in accuracy. Without additional training, our method significantly improves accuracies on the $\texttt{Deletion}$, $\texttt{Insertion}$, and $\texttt{Substitution}$ tasks. To support further research, we open-source our implementation and benchmarks.
大型语言模型(LLMs)在广泛自然语言处理(NLP)任务中表现出强大的泛化能力,但在字符级字符串操作方面存在明显弱点,难以完成诸如字符删除、插入和替换等基础操作。这些挑战主要源于分词限制,尽管这类操作在数据预处理和代码生成中扮演着关键角色。通过系统性分析,我们得出两个重要见解:(1)LLMs 难以利用内在的词汇知识来进行字符级推理;(2)原子化单词结构可以显著提升 LLMs 处理词元级结构性信息的能力。基于这些见解,我们提出了“分而治之”字符级别操作方法,这是一种旨在弥合词元级处理与字符级操纵之间差距的新途径。我们的方法将复杂操作分解为明确的字符级子任务,并结合受控词汇重组阶段,从而显著提高了准确率。无需额外训练,我们的方法在删除、插入和替换任务上的准确性有了明显提高。为了支持进一步研究,我们开源了实现代码和基准测试数据集。
https://arxiv.org/abs/2502.08180
While Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by incorporating external knowledge, they still face persistent challenges in retrieval inefficiency and the inability of LLMs to filter out irrelevant information. We present ParetoRAG, an unsupervised framework that optimizes RAG systems through sentence-level refinement guided by the Pareto principle. By decomposing paragraphs into sentences and dynamically re-weighting core content while preserving contextual coherence, ParetoRAG achieves dual improvements in both retrieval precision and generation quality without requiring additional training or API resources. This framework has been empirically validated across various datasets, LLMs, and retrievers.
虽然检索增强生成(RAG)系统通过整合外部知识提升了大型语言模型(LLMs)的能力,但它们仍然面临着检索效率低下和LLM无法过滤掉无关信息的持续挑战。我们提出了一种名为ParetoRAG的无监督框架,该框架通过帕累托原则指导下的句子级细化优化了RAG系统。ParetoRAG通过将段落分解为句子并动态调整核心内容权重,在保持上下文连贯性的同时实现了检索准确性和生成质量的双重提升,而无需额外的训练或API资源。这一框架已经在多个数据集、LLMs和检索器上得到了实证验证。
https://arxiv.org/abs/2502.08178
We identify characteristic features of how pitch is manipulated for expressive purposes by Hyper Music, a mainstream commercial music company specialising in advertisement music for global corporations. The study shows that the use and organisation of pitch in the company's `Primaal' brand differs from Western classical music. Through interviews with producers and in-depth analysis of their work, we reveal that their methods centre on a conscious aim to construct a musical discourse based on pitch uncertainty, contrasting with the clear transmission of well-defined pitches in Western classical traditions. According to the Primaal producers, who acknowledge the influence of artists such as Kanye West and Daft Punk and use widely available technology, pitch uncertainty captures the listener's attention. We provide analyses of musical excerpts demonstrating their approach, alongside descriptions of the tools and methods employed to achieve their expressive goals. These goals and methods are placed in a broader historical context, contrasting with fundamental principles of pitch organisation in Western music. Techniques used by Hyper Music to introduce and control pitch uncertainty include boosting upper partials, expressive use of inharmonicity, continuous pitch distributions around 'poles' tied to specific 'modes', and continuously evolving pitch. We examine these techniques from a psychoacoustic perspective, and conduct listening tests corroborating some of the observations. The ultimate goal of the study is to introduce a set of methods suited to the analysis of pitch in contemporary popular music.
我们识别了Hyper Music公司在为全球企业制作广告音乐时,如何操纵音高以达到表达效果的特征。该公司专注于主流商业音乐,并以其“Primaal”品牌而闻名。研究显示,“Primaal”品牌的音高使用和组织方式与西方古典音乐有所不同。通过与制作人的访谈以及对其作品的深入分析,我们揭示出他们的方法集中在有意识地构建一种基于音高不确定性的音乐语境上,这与西方古典传统中清晰传递明确定义的音高的做法形成对比。 据Primaal公司的制作人表示,他们受到了Kanye West和Daft Punk等艺术家的影响,并使用广泛可用的技术。这些制作人认为,音高的不确定性能够吸引听众的注意力。我们提供了对音乐片段的具体分析,展示了他们的方法,以及为实现其表达目标所使用的工具和技术描述。 我们将这些目标和方法置于更广泛的歷史背景中进行考量,与西方音乐的基本音高组织原则相对比。Hyper Music公司引入并控制音高不确定性的技术包括增强上部泛音、有表现力地使用不和谐性、围绕特定“模式”的连续音高低分布以及持续变化的音高。 我们从心理声学的角度来审视这些技术,并进行了听觉测试以验证一些观察结果。该研究最终目标是介绍一套适用于分析当代流行音乐中音高的方法。
https://arxiv.org/abs/2502.08131
We introduce \textbf{Knowledge Swapping}, a novel task designed to selectively regulate knowledge of a pretrained model by enabling the forgetting of user\-specified information, retaining essential knowledge, and acquiring new knowledge simultaneously. By delving into the analysis of knock-on feature hierarchy, we find that incremental learning typically progresses from low\-level representations to higher\-level semantics, whereas forgetting tends to occur in the opposite direction\-starting from high-level semantics and moving down to low-level features. Building upon this, we propose to benchmark the knowledge swapping task with the strategy of \textit{Learning Before Forgetting}. Comprehensive experiments on various tasks like image classification, object detection, and semantic segmentation validate the effectiveness of the proposed strategy. The source code is available at \href{this https URL}{this https URL}.
我们介绍了一种新的任务——**知识交换(Knowledge Swapping)**,该任务旨在通过允许遗忘用户指定的信息、保留核心知识以及同时获取新知识来选择性地调节预训练模型的知识。通过对级联特征层次的分析,我们发现增量学习通常从低层表示逐步进展到高层语义,而遗忘则往往相反,即从高层语义开始并向下延伸至低层特征。基于这一观察,我们提出了使用“先学后忘(Learning Before Forgetting)”策略来衡量知识交换任务的方法。在图像分类、目标检测和语义分割等多种任务上的全面实验验证了所提出策略的有效性。源代码可在[此处](this https URL)获取。
https://arxiv.org/abs/2502.08075
Large Language Models (LLMs) often excel in specific domains but fall short in others due to the limitations of their training. Thus, enabling LLMs to solve problems collaboratively by integrating their complementary knowledge promises to improve their performance across domains. To realize this potential, we introduce a novel Collaborative Speculative Decoding (CoSD) algorithm that enables efficient LLM knowledge fusion at test time without requiring additional model training. CoSD employs a draft model to generate initial sequences and an easy-to-learn rule or decision tree to decide when to invoke an assistant model to improve these drafts. CoSD not only enhances knowledge fusion but also improves inference efficiency, is transferable across domains and models, and offers greater explainability. Experimental results demonstrate that CoSD improves accuracy by up to 10\% across benchmarks compared to existing methods, providing a scalable and effective solution for LLM-based applications
大型语言模型(LLMs)通常在某些领域表现出色,但在其他领域则可能表现不佳,这主要是由于其训练过程中的限制。因此,通过整合它们互补的知识来实现协作解决问题的模式有望提升其跨领域的性能。为了实现这一潜力,我们提出了一种新颖的协同推测解码(CoSD)算法,该算法能够在测试时有效地融合大型语言模型的知识,而无需进行额外的模型训练。CoSD 使用一个草稿模型生成初始序列,并采用易于学习的规则或决策树来决定何时调用辅助模型以改进这些草稿。 CoSD 不仅增强了知识融合,还提高了推理效率,并且能够跨领域和不同模型间转移。此外,它提供了更高的可解释性。实验结果表明,与现有方法相比,CoSD 在基准测试中的准确率最高提升了10%,为基于大型语言模型的应用提供了一种可扩展且有效的解决方案。
https://arxiv.org/abs/2502.08020
Large Language Models (LLMs) have achieved impressive results across numerous domains, yet they experience notable deficiencies in legal question-answering tasks. LLMs often generate generalized responses that lack the logical specificity required for expert legal advice and are prone to hallucination, providing answers that appear correct but are unreliable. Retrieval-Augmented Generation (RAG) techniques offer partial solutions to address this challenge, but existing approaches typically focus only on semantic similarity, neglecting the logical structure essential to legal reasoning. In this paper, we propose the Logical-Semantic Integration Model (LSIM), a novel supervised framework that bridges semantic and logical coherence. LSIM comprises three components: reinforcement learning predicts a structured fact-rule chain for each question, a trainable Deep Structured Semantic Model (DSSM) retrieves the most relevant candidate questions by integrating semantic and logical features, and in-context learning generates the final answer using the retrieved content. Our experiments on a real-world legal QA dataset-validated through both automated metrics and human evaluation-demonstrate that LSIM significantly enhances accuracy and reliability compared to existing methods.
大型语言模型(LLMs)在众多领域取得了显著成果,但在法律问答任务中却表现出明显的不足。这些模型常常生成缺乏专家法律建议所需的逻辑具体性的泛化答案,并且容易出现“幻觉”,即提供看似正确但实际上不可靠的答案。检索增强生成(RAG)技术可以部分解决这一问题,但现有方法通常仅关注语义相似性,忽略了对法律推理至关重要的逻辑结构。在本文中,我们提出了逻辑-语义集成模型(LSIM),这是一种新颖的监督框架,旨在弥合语义和逻辑连贯性的鸿沟。LSIM包括三个组成部分:强化学习为每个问题预测一个结构化的事实-规则链,可训练的深层结构性语义模型(DSSM)通过结合语义和逻辑特征检索最相关的候选问题,以及在上下文中的学习使用检索到的内容生成最终答案。我们在一个经过自动评估指标和人工评价验证的真实世界法律问答数据集上的实验表明,与现有方法相比,LSIM显著提高了准确性和可靠性。
https://arxiv.org/abs/2502.07912
The rise of large language models has opened new avenues for users seeking legal advice. However, users often lack professional legal knowledge, which can lead to questions that omit critical information. This deficiency makes it challenging for traditional legal question-answering systems to accurately identify users' actual needs, often resulting in imprecise or generalized advice. In this work, we develop a legal question-answering system called Intelligent Legal Assistant, which interacts with users to precisely capture their needs. When a user poses a question, the system requests that the user select their geographical location to pinpoint the applicable laws. It then generates clarifying questions and options based on the key information missing from the user's initial question. This allows the user to select and provide the necessary details. Once all necessary information is provided, the system produces an in-depth legal analysis encompassing three aspects: overall conclusion, jurisprudential analysis, and resolution suggestions.
大型语言模型的兴起为寻求法律建议的用户开辟了新的途径。然而,用户往往缺乏专业的法律知识,这可能导致他们提出的问题中遗漏关键信息。这种缺失使得传统的法律问答系统难以准确地识别用户的实际需求,通常会导致不精确或过于泛化的建议。为此,我们开发了一个名为“智能法律助手”的法律问答系统,该系统通过与用户互动来精准捕捉其需求。 当用户提问时,系统会要求用户提供他们的地理位置信息,以便确定适用的法律法规。然后,根据用户最初问题中缺失的关键信息,系统将生成澄清性的问题和选项供用户选择并提供必要的细节。一旦所有必要信息都被提供,系统就会生成一份深入的法律分析报告,涵盖三个方面:总体结论、法学分析以及解决建议。
https://arxiv.org/abs/2502.07904
We present MatSwap, a method to transfer materials to designated surfaces in an image photorealistically. Such a task is non-trivial due to the large entanglement of material appearance, geometry, and lighting in a photograph. In the literature, material editing methods typically rely on either cumbersome text engineering or extensive manual annotations requiring artist knowledge and 3D scene properties that are impractical to obtain. In contrast, we propose to directly learn the relationship between the input material -- as observed on a flat surface -- and its appearance within the scene, without the need for explicit UV mapping. To achieve this, we rely on a custom light- and geometry-aware diffusion model. We fine-tune a large-scale pre-trained text-to-image model for material transfer using our synthetic dataset, preserving its strong priors to ensure effective generalization to real images. As a result, our method seamlessly integrates a desired material into the target location in the photograph while retaining the identity of the scene. We evaluate our method on synthetic and real images and show that it compares favorably to recent work both qualitatively and quantitatively. We will release our code and data upon publication.
我们介绍了一种名为MatSwap的方法,该方法能够将材料转移到图像中指定的表面上,并且效果非常逼真。由于照片中的材质外观、几何结构和光照之间存在复杂的相互作用,这项任务并不容易实现。在相关文献中,大多数材质编辑方法通常依赖于繁琐的文字工程或需要大量手动注释的数据集,这些数据集要求艺术家的专业知识以及难以获取的3D场景属性。 相比之下,我们提出了一种直接学习输入材料(观察到的平坦表面上)与其在场景中的外观之间关系的方法,而无需显式使用UV映射。为了实现这一目标,我们依赖于一个定制的、具有光照和几何感知能力的扩散模型,并且我们将大规模预训练的文字转图像模型进行了微调,用于材质转移任务,同时保持其强大的先验知识以确保对真实图像的有效泛化。 通过这种方法,我们可以将期望的材料无缝地融入到照片的目标位置中,同时保留场景的身份特征。我们在合成和真实图像上评估了该方法,并且无论是在定性还是定量方面都显示出了优于最近工作的优势。我们将在论文发表后发布代码和数据集。
https://arxiv.org/abs/2502.07784
We demonstrate that discriminative models inherently contain powerful generative capabilities, challenging the fundamental distinction between discriminative and generative architectures. Our method, Direct Ascent Synthesis (DAS), reveals these latent capabilities through multi-resolution optimization of CLIP model representations. While traditional inversion attempts produce adversarial patterns, DAS achieves high-quality image synthesis by decomposing optimization across multiple spatial scales (1x1 to 224x224), requiring no additional training. This approach not only enables diverse applications -- from text-to-image generation to style transfer -- but maintains natural image statistics ($1/f^2$ spectrum) and guides the generation away from non-robust adversarial patterns. Our results demonstrate that standard discriminative models encode substantially richer generative knowledge than previously recognized, providing new perspectives on model interpretability and the relationship between adversarial examples and natural image synthesis.
我们展示了判别模型本质上含有强大的生成能力,这挑战了判别架构和生成架构之间的基本区别。我们的方法,直接上升合成(Direct Ascent Synthesis, DAS),通过在CLIP模型表示上的多分辨率优化揭示这些潜在的能力。虽然传统的逆向尝试会产生对抗性模式,但DAS通过将优化分解到多个空间尺度(从1x1到224x224)实现了高质量的图像合成,并且无需额外训练。这种方法不仅能够支持多种应用——包括文本到图像生成和风格迁移——还能保持自然图像统计特性($1/f^2$频谱),并指导生成远离非鲁棒性的对抗性模式。我们的结果表明,标准判别模型编码了比之前认识到的更为丰富的生成知识,这为模型可解释性和对抗样本与自然图像合成之间的关系提供了新的视角。
https://arxiv.org/abs/2502.07753
Egocentric visual query localization (EgoVQL) focuses on localizing the target of interest in space and time from first-person videos, given a visual query. Despite recent progressive, existing methods often struggle to handle severe object appearance changes and cluttering background in the video due to lacking sufficient target cues, leading to degradation. Addressing this, we introduce PRVQL, a novel Progressive knowledge-guided Refinement framework for EgoVQL. The core is to continuously exploit target-relevant knowledge directly from videos and utilize it as guidance to refine both query and video features for improving target localization. Our PRVQL contains multiple processing stages. The target knowledge from one stage, comprising appearance and spatial knowledge extracted via two specially designed knowledge learning modules, are utilized as guidance to refine the query and videos features for the next stage, which are used to generate more accurate knowledge for further feature refinement. With such a progressive process, target knowledge in PRVQL can be gradually improved, which, in turn, leads to better refined query and video features for localization in the final stage. Compared to previous methods, our PRVQL, besides the given object cues, enjoys additional crucial target information from a video as guidance to refine features, and hence enhances EgoVQL in complicated scenes. In our experiments on challenging Ego4D, PRVQL achieves state-of-the-art result and largely surpasses other methods, showing its efficacy. Our code, model and results will be released at this https URL.
自第一人称视频中定位感兴趣目标的空间和时间位置的自我中心视觉查询定位(EgoVQL)在给定视觉查询的情况下是一项重要任务。尽管近年来有了一些进展,现有的方法常常难以处理由于缺乏足够的目标线索而导致的目标外观变化严重及背景混乱的问题,从而导致性能下降。为了解决这些问题,我们引入了PRVQL,这是一种新颖的基于渐进式知识引导细化框架的方法,专用于EgoVQL。 PRVQL的核心在于从视频中持续挖掘与目标相关的信息,并将其用作指导来改进查询和视频特征,以提高目标定位精度。该方法包含多个处理阶段,在每个阶段中,通过两个专门设计的知识学习模块提取的外观信息和空间知识被用来引导下一个阶段中的查询和视频特征进行细化。这一过程使得PRVQL能够逐步改善其内部的目标知识,并因此在最终阶段提供更精炼的查询和视频特征以实现更好的定位效果。 与先前的方法相比,我们的PRVQL方法不仅利用了给定的对象线索,还从视频中获取额外的关键目标信息作为指导来细化特性。这使得PRVQL能够在复杂场景下显著增强EgoVQL的效果。 在具有挑战性的Ego4D数据集上的实验表明,我们提出的PRVQL达到了最先进的结果,并大幅超过了其他方法,证明了其有效性和优越性。我们的代码、模型和结果将在提供的URL上发布。
https://arxiv.org/abs/2502.07707
To help users make privacy-related decisions, personalized privacy assistants based on AI technology have been developed in recent years. These AI-driven Personalized Privacy Assistants (AI-driven PPAs) can reap significant benefits for users, who may otherwise struggle to make decisions regarding their personal data in environments saturated with privacy-related decision requests. However, no study systematically inquired about the features of these AI-driven PPAs, their underlying technologies, or the accuracy of their decisions. To fill this gap, we present a Systematization of Knowledge (SoK) to map the existing solutions found in the scientific literature. We screened 1697 unique research papers over the last decade (2013-2023), constructing a classification from 39 included papers. As a result, this SoK reviews several aspects of existing research on AI-driven PPAs in terms of types of publications, contributions, methodological quality, and other quantitative insights. Furthermore, we provide a comprehensive classification for AI-driven PPAs, delving into their architectural choices, system contexts, types of AI used, data sources, types of decisions, and control over decisions, among other facets. Based on our SoK, we further underline the research gaps and challenges and formulate recommendations for the design and development of AI-driven PPAs as well as avenues for future research.
为了帮助用户做出与隐私相关的决策,近年来开发了基于人工智能技术的个性化隐私助手(AI驱动的PPA)。这些AI驱动的个性化隐私助手可以为用户提供显著的好处,在数据饱和且充斥着大量隐私相关决策请求的情况下,用户可能会感到难以做出关于个人数据的决定。然而,没有系统性的研究对这些AI驱动的PPA的功能、底层技术和其决策准确性进行过调查。为了填补这一空白,我们提出了一项知识体系化(SoK)的研究来梳理过去十年中科学文献中的现有解决方案。我们在2013年至2023年间筛选了1697篇独特研究论文,并根据其中的39篇构建了一个分类系统。 这项SoK回顾了几方面现有的关于AI驱动PPA的研究,包括出版类型、贡献、方法质量以及其他定量见解。此外,我们还提供了对AI驱动PPA的全面分类,深入探讨其架构选择、系统背景、使用的AI种类、数据来源、决策类型以及对决策控制等方面的内容。 基于我们的SoK研究结果,进一步强调了该领域的研究缺口和挑战,并为AI驱动的PPA的设计与开发提出了建议,同时也指出了未来研究的方向。
https://arxiv.org/abs/2502.07693