The growing availability of longitudinal Magnetic Resonance Imaging (MRI) datasets has facilitated Artificial Intelligence (AI)-driven modeling of disease progression, making it possible to predict future medical scans for individual patients. However, despite significant advancements in AI, current methods continue to face challenges including achieving patient-specific individualization, ensuring spatiotemporal consistency, efficiently utilizing longitudinal data, and managing the substantial memory demands of 3D scans. To address these challenges, we propose Brain Latent Progression (BrLP), a novel spatiotemporal model designed to predict individual-level disease progression in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates in a small latent space, mitigating the computational challenges posed by high-dimensional imaging data; (ii) it explicitly integrates subject metadata to enhance the individualization of predictions; (iii) it incorporates prior knowledge of disease dynamics through an auxiliary model, facilitating the integration of longitudinal data; and (iv) it introduces the Latent Average Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in the predicted progression at inference time and (b) allows us to derive a measure of the uncertainty for the prediction. We train and evaluate BrLP on 11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its generalizability on an external test set comprising 2,257 MRIs from 962 subjects. Our experiments compare BrLP-generated MRI scans with real follow-up MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The code is publicly available at: this https URL.
随着纵向磁共振成像(MRI)数据集的日益普及,基于人工智能(AI)的疾病进展建模得到了促进,使得为个别患者预测未来的医学扫描成为可能。然而,尽管在AI领域取得了显著的进步,目前的方法仍然面临着一些挑战,包括实现以患者为中心的个性化、确保时空一致性、有效利用纵向数据以及管理3D扫描带来的巨大内存需求。为了应对这些挑战,我们提出了一种新颖的时空模型——大脑潜在进展(BrLP),旨在预测个体层面在3D脑MRI中的疾病进展。 BrLP的关键贡献有四点:(i) 它在一个较小的潜在空间中运行,从而减轻了高维影像数据带来的计算难题;(ii) 它明确整合了受试者元数据以增强预测的个性化;(iii) 通过辅助模型将疾病的动态知识纳入其中,促进了纵向数据的集成;(iv) 引入了潜在平均稳定化(LAS)算法,该算法(a)在推理时强制执行预测进展中的时空一致性,并(b)允许我们推导出预测不确定性的度量。 我们在2805名受试者的11,730张T1加权(T1w)脑MRI上训练并评估了BrLP,并通过962名受试者组成的2,257张外部测试集验证了其泛化能力。我们的实验将由BrLP生成的MRI扫描与真实的随访MRI进行了比较,证明了相对于现有方法而言达到了最先进的准确性。代码可在以下链接公开获取:this https URL.
https://arxiv.org/abs/2502.08560
We present Label Space Reduction (LSR), a novel method for improving zero-shot classification performance of Large Language Models (LLMs). LSR iteratively refines the classification label space by systematically ranking and reducing candidate classes, enabling the model to concentrate on the most relevant options. By leveraging unlabeled data with the statistical learning capabilities of data-driven models, LSR dynamically optimizes the label space representation at test time. Our experiments across seven benchmarks demonstrate that LSR improves macro-F1 scores by an average of 7.0% (up to 14.2%) with Llama-3.1-70B and 3.3% (up to 11.1%) with Claude-3.5-Sonnet compared to standard zero-shot classification baselines. To reduce the computational overhead of LSR, which requires an additional LLM call at each iteration, we propose distilling the model into a probabilistic classifier, allowing for efficient inference.
我们提出了一种新颖的方法——标签空间缩减(LSR),用于提升大型语言模型(LLMs)的零样本分类性能。LSR通过系统地对候选类别进行排序和减少,迭代优化分类标签空间,使模型能够专注于最相关的选项。利用未标记数据与基于数据驱动模型的统计学习能力,LSR能够在测试时动态优化标签空间表示。 我们在七个基准测试中进行了实验,结果表明,相较于标准零样本分类基线方法,在使用Llama-3.1-70B时,LSR使宏平均F1分数提高了平均7.0%(最高提升达14.2%),在使用Claude-3.5-Sonnet时则提升了平均3.3%(最多可达11.1%)。 为了降低LSR的计算开销——每次迭代需要额外调用一次LLM,我们建议将模型蒸馏成一个概率分类器,从而实现高效的推理。
https://arxiv.org/abs/2502.08436
Context-aware compression techniques have gained increasing attention as model sizes continue to grow, introducing computational bottlenecks that hinder efficient deployment. A structured encoding approach was proposed to selectively eliminate redundant parameter groups while ensuring that representational fidelity was preserved across multiple layers. Contextual Compression Encoding (CCE) introduced a multi-stage encoding mechanism that dynamically restructured parameter distributions, allowing for significant reductions in memory footprint and computational complexity. Experimental evaluations demonstrated that models compressed through CCE retained linguistic expressivity and coherence, maintaining accuracy across a range of text generation and classification tasks. Layer-wise analysis revealed that middle-network layers exhibited higher compression ratios, aligning with the observation that self-attention and feed-forward transformations contained redundancies that could be reorganized without impairing functional capacity. Comparisons against conventional quantization and pruning methods confirmed that CCE provided a more balanced trade-off between efficiency and model retention, achieving reductions in energy consumption and inference latency without requiring extensive retraining. Computational efficiency improvements were particularly evident in deployment scenarios involving resource-constrained environments, where reductions in memory usage enabled more scalable implementations. Further analyses of internal network behavior showed that compressed models exhibited stable activation distributions and adapted dynamically to input variations, reinforcing the viability of structured compression strategies for optimizing large-scale architectures.
上下文感知压缩技术随着模型规模的不断扩大而越来越受到重视,因为这导致了计算瓶颈问题,阻碍了高效的部署。一种结构化编码方法被提出,用于选择性地消除冗余参数组,并确保在多层中保持表示保真度。上下文压缩编码(CCE)引入了一种多层次的编码机制,能够动态重构参数分布,从而实现了显著的记忆足迹和计算复杂性的减少。 实验评估表明,通过CCE进行压缩的模型保留了语言表达性和连贯性,在一系列文本生成和分类任务中保持了准确性。逐层分析显示中间网络层具有更高的压缩比率,这与观察到的事实一致:自注意力机制和前馈转换包含可以重新组织而不会影响功能容量的冗余。 相对于传统的量化和修剪方法,CCE提供了效率与模型保留之间的更平衡权衡,在不需大量再训练的情况下实现了能耗减少及推理延迟降低。在资源受限环境中的部署场景中,计算效率改进尤为明显:内存使用量的减少使得更加可扩展的实施成为可能。 对网络内部行为进一步分析显示,压缩后的模型表现出稳定的激活分布,并能够动态地适应输入变化,这强化了结构化压缩策略用于优化大规模架构的有效性。
https://arxiv.org/abs/2502.08323
Self-attention in transformer models is an incremental associative memory that maps key vectors to value vectors. One way to speed up self-attention is to employ GPU-compliant vector search algorithms, yet the standard partitioning methods yield poor results in this context, because (1) keys and queries follow different distributions and (2) the effect of RoPE positional encoding. In this paper, we introduce SAAP (Self-Attention with Asymmetric Partitions), which overcomes these problems. It is an asymmetrical indexing technique that employs distinct partitions for keys and queries, thereby approximating self-attention with a data-adaptive sparsity pattern. It works on pretrained language models without finetuning, as it only requires to train (offline) a small query classifier. On a long context Llama 3.1-8b model, with sequences ranging from 100k to 500k tokens, our method typically reduces by a factor 20 the fraction of memory that needs to be looked-up, which translates to a time saving of 60\% when compared to FlashAttention-v2.
在变压器模型中,自我注意机制是一种增量式联想记忆系统,它将键向量映射到值向量。加速自我注意的一种方法是采用与GPU兼容的向量搜索算法,然而标准分区方法在这种情况下效果不佳,原因有两个:(1) 键和查询遵循不同的分布;(2) RoPE位置编码的影响。为此,在本文中我们提出了SAAP(具有不对称分区的自我注意力),它克服了这些问题。这是一种非对称索引技术,为键和查询采用了不同的分区方法,从而以数据自适应稀疏模式近似自我注意机制。该方法可以在无需微调的情况下应用于预训练的语言模型上,因为它只需要离线训练一个小型查询分类器即可。在包含10万到50万个令牌的长上下文Llama 3.1-8b模型中,我们的方法通常将需要查找的内存比例减少20倍,相较于FlashAttention-v2可节省60%的时间。 翻译总结了SAAP(Self-Attention with Asymmetric Partitions)如何通过使用不同的分区策略为键和查询来提高自我注意机制在长上下文中的效率。这种方法特别适用于大型预训练语言模型,并且不需要进行额外的微调,从而减少了内存访问需求并提高了计算速度。
https://arxiv.org/abs/2502.08246
This paper argues that deep neural networks (DNNs) mostly determine their outputs during the early stages of inference, where biases inherent in the model play a crucial role in shaping this process. We draw a parallel between this phenomenon and human decision-making, which often relies on fast, intuitive heuristics. Using diffusion models (DMs) as a case study, we demonstrate that DNNs often make early-stage decision-making influenced by the type and extent of bias in their design and training. Our findings offer a new perspective on bias mitigation, efficient inference, and the interpretation of machine learning systems. By identifying the temporal dynamics of decision-making in DNNs, this paper aims to inspire further discussion and research within the machine learning community.
本文认为,深度神经网络(DNN)在其推理的早期阶段就主要决定了输出结果,而模型中固有的偏差在此过程中起着关键作用。我们将这一现象与人类决策过程进行类比,后者往往依赖于快速、直觉性的启发法。通过扩散模型(DMs)作为案例研究,我们展示了DNN经常受到其设计和训练中的偏见类型和程度的影响,在早期阶段就做出了决策。我们的发现为偏见缓解、高效推理以及机器学习系统的解释提供了新的视角。通过识别DNN中决策过程的时间动态特性,本文旨在激发机器学习社区进一步的讨论与研究。
https://arxiv.org/abs/2502.08167
Chain-of-thought (CoT) prompting has achieved remarkable success in natural language processing (NLP). However, its vast potential remains largely unexplored for graphs. This raises an interesting question: How can we design CoT prompting for graphs to guide graph models to learn step by step? On one hand, unlike natural languages, graphs are non-linear and characterized by complex topological structures. On the other hand, many graphs lack textual data, making it difficult to formulate language-based CoT prompting. In this work, we propose the first CoT prompt learning framework for text-free graphs, GCoT. Specifically, we decompose the adaptation process for each downstream task into a series of inference steps, with each step consisting of prompt-based inference, ``thought'' generation, and thought-conditioned prompt learning. While the steps mimic CoT prompting in NLP, the exact mechanism differs significantly. Specifically, at each step, an input graph, along with a prompt, is first fed into a pre-trained graph encoder for prompt-based inference. We then aggregate the hidden layers of the encoder to construct a ``thought'', which captures the working state of each node in the current step. Conditioned on this thought, we learn a prompt specific to each node based on the current state. These prompts are fed into the next inference step, repeating the cycle. To evaluate and analyze the effectiveness of GCoT, we conduct comprehensive experiments on eight public datasets, which demonstrate the advantage of our approach.
链式思维(CoT)提示在自然语言处理(NLP)领域取得了显著的成功,但其巨大的潜力在图数据处理中尚未被充分探索。这引发了一个有趣的问题:如何为图形设计CoT提示,以指导图形模型逐步学习?一方面,与自然语言不同,图是非线性的,并且由复杂的拓扑结构特征化。另一方面,许多图缺乏文本数据,使得基于语言的CoT提示难以制定。 为此,在这项工作中我们提出了首个针对无文本图的CoT提示学习框架——GCoT(Graph Chain-of-Thought Prompt)。具体而言,我们将每个下游任务的适应过程分解为一系列推理步骤,每一步包括基于提示的推理、"思维"生成以及基于当前状态条件下的提示学习。虽然这些步骤模仿了NLP中的CoT提示机制,但其确切机制却有很大不同。 在每一阶段中,首先将输入图和一个提示一起输入到预训练的图形编码器进行基于提示的推理。然后我们聚合编码器的隐藏层以构造“思维”,该思维捕捉了当前步长下每个节点的工作状态。条件化于这种思维,我们在考虑当前状态下为每个节点学习特定的提示。这些提示被送入下一个推理步骤中,并重复此循环。 为了评估和分析GCoT的有效性,我们对八个公开数据集进行了全面实验,结果表明我们的方法具有明显优势。
https://arxiv.org/abs/2502.08092
Decomposition of text into atomic propositions is a flexible framework allowing for the closer inspection of input and output text. We use atomic decomposition of hypotheses in two natural language reasoning tasks, traditional NLI and defeasible NLI, to form atomic sub-problems, or granular inferences that models must weigh when solving the overall problem. These atomic sub-problems serve as a tool to further understand the structure of both NLI and defeasible reasoning, probe a model's consistency and understanding of different inferences, and measure the diversity of examples in benchmark datasets. Our results indicate that LLMs still struggle with logical consistency on atomic NLI and defeasible NLI sub-problems. Lastly, we identify critical atomic sub-problems of defeasible NLI examples, or those that most contribute to the overall label, and propose a method to measure the inferential consistency of a model, a metric designed to capture the degree to which a model makes consistently correct or incorrect predictions about the same fact under different contexts.
将文本分解为原子命题是一种灵活的框架,允许对输入和输出文本进行更细致的分析。我们在两个自然语言推理任务中使用假设的原子分解:传统的NLI(自然语言推断)和可废止的NLI。我们将这些假设分解成原子子问题或细粒度推断,模型在解决整个问题时必须权衡这些子问题。这些原子子问题是理解NLI和可废止推理结构、探测模型的一致性和不同推断的理解程度以及衡量基准数据集中示例多样性的工具。 我们的研究结果表明,大型语言模型(LLM)在处理原子NLI和可废止NLI的子问题时,在逻辑一致性方面仍然存在困难。最后,我们确定了可废止NLI实例中的关键原子子问题,即对整体标签贡献最大的那些,并提出了一种衡量模型推理一致性的方法,该指标旨在捕捉模型在同一事实下在不同上下文中做出的一致正确或错误预测的程度。
https://arxiv.org/abs/2502.08080
The capabilities of Large Language Models (LLMs) in low-resource languages lag far behind those in English, making their universal accessibility a significant challenge. To alleviate this, we present $\textit{Franken-Adapter}$, a modular language adaptation approach for decoder-only LLMs with embedding surgery. Our method begins by creating customized vocabularies for target languages and performing language adaptation through embedding tuning on multilingual data. These pre-trained embeddings are subsequently integrated with LLMs that have been instruction-tuned on English alignment data to enable zero-shot cross-lingual transfer. Our experiments on $\texttt{Gemma2}$ models with up to 27B parameters demonstrate improvements of up to 20% across 96 languages, spanning both discriminative and generative tasks, with minimal regressions ($<$1%) in English. Further in-depth analysis reveals the critical role of customizing tokenizers in enhancing language adaptation, while boosting inference efficiency. Additionally, we show the versatility of our method by achieving a 14% improvement over a math-optimized LLM across 20 languages, offering a modular solution to transfer reasoning abilities across languages post hoc.
大规模语言模型(LLMs)在低资源语言中的能力远远落后于英语,这使得它们的普遍适用性面临重大挑战。为了解决这一问题,我们提出了“Franken-Adapter”,这是一种用于解码器独占LLM的语言适应模块化方法,结合了嵌入手术技术。我们的方法首先为目标语言创建定制词汇表,并通过多语言数据上的嵌入调整进行语言适应。然后,这些预训练的嵌入被整合到在英语对齐数据上进行了指令微调的LLMs中,以实现零样本跨语言迁移能力。 我们在具有高达270亿参数的“Gemma2”模型上进行的实验表明,在包括判别和生成任务在内的96种语言中,性能提高了最多20%,同时在英语中的退化程度较小(<1%)。进一步深入分析揭示了定制分词器对于增强语言适应性以及提高推理效率的关键作用。此外,我们展示了该方法的灵活性,通过在20种语言上实现了比优化过的数学模型高出14%的表现,从而提供了一种模块化的解决方案,用于后置地跨语言转移推理能力。 这种方法为解决低资源语言中的LLMs应用难题提供了新的思路,并且证明了定制化适应策略的重要性。
https://arxiv.org/abs/2502.08037
Internal representations within deep neural architectures encode high-dimensional abstractions of linguistic structures, yet they often exhibit inefficiencies in feature distribution, limiting expressiveness and adaptability. Contextual Subspace Manifold Projection introduces a structured refinement technique that selectively reconfigures token embeddings through controlled subspace constraints, ensuring more stable and geometrically well-defined feature distributions. Empirical evaluations demonstrated that the structured intervention reduced anisotropy, leading to improved representation compactness while preserving semantic fidelity across transformer layers. Clustering analyses indicated that token embeddings exhibited greater feature separability, reinforcing the hypothesis that structured projection techniques enhance internal representation organization without sacrificing linguistic coherence. Gradient magnitude distributions suggested that the method introduced a smoother optimization trajectory, potentially contributing to more stable parameter updates throughout training. Computational overhead associated with the projection operations remained minimal, ensuring that the refinements did not introduce significant trade-offs in model efficiency or inference speed. Comparisons with standard embedding refinement techniques highlighted that structured manifold constraints provided a direct mechanism for improving representation quality without requiring additional gradient-based optimization. Perplexity evaluations confirmed that the adjustments did not negatively impact sequence coherence, further validating the effectiveness of the proposed approach.
深度神经网络架构内部的表示编码了语言结构的高度抽象,但它们在特征分布方面常常表现出低效性,限制了表达能力和适应能力。上下文子空间流形投影引入了一种有结构的精炼技术,通过受控的子空间约束对令牌嵌入进行选择性重构,确保更稳定且几何上定义明确的特征分布。 实证评估表明,这种结构化的干预措施减少了各向异性,并提高了表示的紧凑性,在保持语义保真度的同时,跨Transformer层的效果也得到了改善。聚类分析显示令牌嵌入具有更好的特征分离能力,这增强了这样一种假设:结构化投影技术能够增强内部表示的组织性而不牺牲语言连贯性。梯度幅度分布表明该方法引入了更平滑的优化路径,可能有助于在整个训练过程中实现更加稳定的参数更新。 与标准嵌入精炼技术相比,相关计算开销保持在较低水平,确保这些改进不会显著影响模型效率或推理速度。投影操作带来的额外成本很小,这保证了对模型整体性能的影响最小化。结构化的流形约束提供了一种直接改善表示质量的方法,而无需依赖额外的基于梯度的优化方法。 困惑度评估确认了调整并未对序列连贯性产生负面影响,进一步验证了所提出方法的有效性。总之,上下文子空间流形投影不仅提高了深度神经网络模型的表达性和稳定性,还保持了原有的语言理解能力。
https://arxiv.org/abs/2502.08026
Large Language Models (LLMs) often excel in specific domains but fall short in others due to the limitations of their training. Thus, enabling LLMs to solve problems collaboratively by integrating their complementary knowledge promises to improve their performance across domains. To realize this potential, we introduce a novel Collaborative Speculative Decoding (CoSD) algorithm that enables efficient LLM knowledge fusion at test time without requiring additional model training. CoSD employs a draft model to generate initial sequences and an easy-to-learn rule or decision tree to decide when to invoke an assistant model to improve these drafts. CoSD not only enhances knowledge fusion but also improves inference efficiency, is transferable across domains and models, and offers greater explainability. Experimental results demonstrate that CoSD improves accuracy by up to 10\% across benchmarks compared to existing methods, providing a scalable and effective solution for LLM-based applications
大型语言模型(LLMs)通常在某些领域表现出色,但在其他领域则可能表现不佳,这主要是由于其训练过程中的限制。因此,通过整合它们互补的知识来实现协作解决问题的模式有望提升其跨领域的性能。为了实现这一潜力,我们提出了一种新颖的协同推测解码(CoSD)算法,该算法能够在测试时有效地融合大型语言模型的知识,而无需进行额外的模型训练。CoSD 使用一个草稿模型生成初始序列,并采用易于学习的规则或决策树来决定何时调用辅助模型以改进这些草稿。 CoSD 不仅增强了知识融合,还提高了推理效率,并且能够跨领域和不同模型间转移。此外,它提供了更高的可解释性。实验结果表明,与现有方法相比,CoSD 在基准测试中的准确率最高提升了10%,为基于大型语言模型的应用提供了一种可扩展且有效的解决方案。
https://arxiv.org/abs/2502.08020
We propose a novel dynamic safety framework that optimizes language model (LM) safety reasoning at inference time without modifying model weights. Building on recent advances in self-critique methods, our approach leverages a meta-critique mechanism that iteratively updates safety prompts-termed specifications-to drive the critique and revision process adaptively. This test-time optimization not only improves performance against adversarial jailbreak requests but also in diverse general safety-related tasks, such as avoiding moral harm or pursuing honest responses. Our empirical evaluations across several language models demonstrate that dynamically optimized safety prompts yield significantly higher safety scores compared to fixed system prompts and static self-critique defenses. Code to be released at this https URL .
我们提出了一种新颖的动态安全框架,该框架在推理过程中优化语言模型(LM)的安全性推理,而无需修改模型权重。我们的方法基于最近自我批评方法的进步,利用了一个元批评机制,这种机制迭代更新安全性提示——称为规范——以自适应地驱动批评和修订过程。这种测试时间优化不仅提高了对抗性越狱请求的性能,还在避免道德伤害或追求诚实回应等多样化的通用安全相关任务中表现更佳。我们在多种语言模型上的实证评估表明,动态优化的安全提示在安全性得分方面显著优于固定系统提示和静态自我批评防御措施。代码将在[此处发布](https://这个URL)。
https://arxiv.org/abs/2502.07985
Transformer-based text embedding models have improved their performance on benchmarks like MIRACL and BEIR by increasing their parameter counts. However, this scaling approach introduces significant deployment challenges, including increased inference latency and memory usage. These challenges are particularly severe in retrieval-augmented generation (RAG) applications, where large models' increased memory requirements constrain dataset ingestion capacity, and their higher latency directly impacts query-time performance. While causal language models have addressed similar efficiency challenges using Mixture of Experts (MoE) architectures, this approach hasn't been successfully adapted to the general text embedding setting. In this paper, we introduce Nomic Embed v2, the first general purpose MoE text embedding model. Our model outperforms models in the same parameter class on both monolingual and multilingual benchmarks while also maintaining competitive performance with models twice its size. We open-source all code, models, and evaluation data to ensure full reproducibility of our training pipeline.
基于Transformer的文本嵌入模型通过增加参数数量在MIRACL和BEIR等基准测试中的表现有所提升。然而,这种扩展方法带来了显著的部署挑战,包括推理延迟增加和内存使用量增大。这些问题尤其体现在检索增强生成(RAG)应用中:大型模型更高的内存需求限制了数据集的处理能力;其延迟增加则直接对查询时间性能造成影响。 虽然因果语言模型通过混合专家(Mixture of Experts, MoE)架构解决了类似的效率问题,但这种方案尚未成功地应用于通用文本嵌入环境。在本文中,我们介绍了Nomic Embed v2,这是第一个通用的MoE文本嵌入模型。我们的模型在单语和多语基准测试中的表现均优于同参数级别的其他模型,并且其性能与两倍大小的模型相比也具有竞争力。 为了确保训练管道的完全可重复性,我们将所有代码、模型及评估数据开源发布。
https://arxiv.org/abs/2502.07972
We present Pippo, a generative model capable of producing 1K resolution dense turnaround videos of a person from a single casually clicked photo. Pippo is a multi-view diffusion transformer and does not require any additional inputs - e.g., a fitted parametric model or camera parameters of the input image. We pre-train Pippo on 3B human images without captions, and conduct multi-view mid-training and post-training on studio captured humans. During mid-training, to quickly absorb the studio dataset, we denoise several (up to 48) views at low-resolution, and encode target cameras coarsely using a shallow MLP. During post-training, we denoise fewer views at high-resolution and use pixel-aligned controls (e.g., Spatial anchor and Plucker rays) to enable 3D consistent generations. At inference, we propose an attention biasing technique that allows Pippo to simultaneously generate greater than 5 times as many views as seen during training. Finally, we also introduce an improved metric to evaluate 3D consistency of multi-view generations, and show that Pippo outperforms existing works on multi-view human generation from a single image.
我们介绍了一种名为Pippo的生成模型,它可以仅从一张随意拍摄的照片中生成分辨率为1K的人体密集旋转视频。Pippo是一个多视角扩散变换器,并不需要任何额外输入(例如,拟合参数模型或输入图像的相机参数)。我们在没有标注的30亿张人类图片上对Pippo进行了预训练,在工作室捕捉到的人类数据集上进行了多视图中期训练和后期训练。在中期训练阶段,为了快速吸收工作室的数据集,我们以低分辨率去噪多个(最多48个)视角,并使用浅层MLP粗略编码目标相机。在后期训练中,我们在高分辨率下对较少的视角进行去噪处理,并采用像素对齐控制(例如空间锚点和Plücker射线),从而实现三维一致生成。 在推理阶段,我们提出了一种注意力偏向技术,使Pippo能够在一次训练中同时生成比训练过程中看到的多五倍以上的视图。最后,我们还引入了一个改进的度量标准来评估多视角生成中的三维一致性,并证明了从单张图像生成多人视角时,Pippo的表现优于现有方法。
https://arxiv.org/abs/2502.07785
Next-Token Prediction (NTP) is a de facto approach for autoregressive (AR) video generation, but it suffers from suboptimal unidirectional dependencies and slow inference speed. In this work, we propose a semi-autoregressive (semi-AR) framework, called Next-Block Prediction (NBP), for video generation. By uniformly decomposing video content into equal-sized blocks (e.g., rows or frames), we shift the generation unit from individual tokens to blocks, allowing each token in the current block to simultaneously predict the corresponding token in the next block. Unlike traditional AR modeling, our framework employs bidirectional attention within each block, enabling tokens to capture more robust spatial dependencies. By predicting multiple tokens in parallel, NBP models significantly reduce the number of generation steps, leading to faster and more efficient inference. Our model achieves FVD scores of 103.3 on UCF101 and 25.5 on K600, outperforming the vanilla NTP model by an average of 4.4. Furthermore, thanks to the reduced number of inference steps, the NBP model generates 8.89 frames (128x128 resolution) per second, achieving an 11x speedup. We also explored model scales ranging from 700M to 3B parameters, observing significant improvements in generation quality, with FVD scores dropping from 103.3 to 55.3 on UCF101 and from 25.5 to 19.5 on K600, demonstrating the scalability of our approach.
下一代标记预测(NTP)是自回归视频生成的一个事实上的方法,但它在单向依赖性和推理速度方面存在不足。为此,我们提出了一种半自回归框架——下一区块预测(NBP),用于改进视频生成过程。通过将视频内容均匀分解为等大小的区块(如行或帧),我们将生成单元从单独的标记转移到区块,使得当前区块中的每个标记可以同时预测下一个区块中对应的标记。与传统的自回归建模不同,我们的框架在每个区块内采用双向注意力机制,使标记能够捕捉到更稳健的空间依赖性。通过并行预测多个标记,NBP模型显著减少了生成步骤的数量,从而实现了更快、更高效的推理速度。 实验结果表明,在UCF101数据集上,我们的模型获得了FVD分数为103.3的成绩;在K600数据集上则达到25.5的评分,比标准NTP模型平均高出4.4分。此外,由于生成步骤减少,NBP模型每秒可生成8.89帧(分辨率为128x128),实现了大约11倍的速度提升。 我们还探讨了从7亿到30亿参数的多种模型规模,并观察到了显著的质量改进:在UCF101数据集上,FVD分数从103.3降至55.3;在K600数据集上则从25.5降至19.5,这证明了我们方法的良好可扩展性。
https://arxiv.org/abs/2502.07737
This paper presents a learned model to predict the robot-centric velocity of an underwater robot through dynamics-aware proprioception. The method exploits a recurrent neural network using as inputs inertial cues, motor commands, and battery voltage readings alongside the hidden state of the previous time-step to output robust velocity estimates and their associated uncertainty. An ensemble of networks is utilized to enhance the velocity and uncertainty predictions. Fusing the network's outputs into an Extended Kalman Filter, alongside inertial predictions and barometer updates, the method enables long-term underwater odometry without further exteroception. Furthermore, when integrated into visual-inertial odometry, the method assists in enhanced estimation resilience when dealing with an order of magnitude fewer total features tracked (as few as 1) as compared to conventional visual-inertial systems. Tested onboard an underwater robot deployed both in a laboratory pool and the Trondheim Fjord, the method takes less than 5ms for inference either on the CPU or the GPU of an NVIDIA Orin AGX and demonstrates less than 4% relative position error in novel trajectories during complete visual blackout, and approximately 2% relative error when a maximum of 2 visual features from a monocular camera are available.
本文提出了一种基于学习的模型,用于通过动力学感知本体感觉来预测水下机器人相对于自身的速度。该方法利用递归神经网络作为输入,包括惯性提示、电机命令和电池电压读数以及前一时间步长的隐藏状态,输出鲁棒的速度估计及其相关不确定性。使用一组网络来增强速度和不确定性的预测。将这些网络的输出与扩展卡尔曼滤波器融合,并结合惯性和气压计更新,该方法能够在无需进一步外感知的情况下实现长期水下里程测量。此外,在集成到视觉惯性里程测量系统中时,即使在跟踪的总特征数量仅为传统视觉惯性系统的十分之一(最少为1个)的情况下,该方法也能帮助提高估计的稳定性。 该方法已在部署于实验室游泳池和特隆赫姆峡湾的水下机器人上进行测试。无论是使用CPU还是NVIDIA Orin AGX的GPU进行推理,其推断时间均不超过5毫秒。在完全视觉盲的状态下,对于新轨迹,该方法表现出小于4%的位置误差;而在单目相机最多提供2个视觉特征时,则大约有2%的位置误差。
https://arxiv.org/abs/2502.07726
In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorithm, the image-to-video task is indeed easier to converge over the text-to-video task. We also explore a bag of optimization tricks to reduce the computational cost of training the image-to-video (I2V) models from three aspects: 1) model convergence speedup by using a multi-modal prior condition injection; 2) inference latency speed up by applying an adversarial step distillation, and 3) inference memory cost optimization with parameter sparsification. With those techniques, we are able to generate 5-second video clips within 3 seconds. By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics, spending less than 1 second for generating 1 second video clips on average. We conduct a series of preliminary explorations to find out the optimal tradeoff between computational cost and video quality during diffusion step distillation and hope this could be a good foundation model for open-source explorations. The code and the model weights are available at this https URL.
在这份技术报告中,我们介绍了Magic 1-For-1(简称Magic141),这是一种高效的视频生成模型,具有优化的内存消耗和推理延迟。其核心思想很简单:将文字到视频的生成任务分解为两个独立且更容易处理的任务以进行扩散步骤蒸馏,即文本到图像生成和图像到视频生成。我们验证了,在使用相同的优化算法时,图像到视频任务的确比文字到视频任务更快收敛。 此外,我们探索了一系列优化技巧,从三个方面降低了训练图像到视频(I2V)模型的计算成本:1) 通过多模态先验条件注入加速模型收敛速度;2) 通过应用对抗步骤蒸馏来加快推理延迟;3) 使用参数稀疏化技术进行推理内存成本优化。凭借这些技术,我们能够在3秒内生成5秒钟的视频片段。通过在测试时使用滑动窗口方法,我们能够在一个分钟内生成一分钟长的高质量视频,并且平均每生成1秒钟的视频片段仅需不到1秒的时间,显著提升了视觉质量和动态效果。 我们在扩散步骤蒸馏过程中进行了一系列初步探索,以找出计算成本和视频质量之间的最佳平衡点。希望这可以成为开源探索的一个良好基础模型。代码和模型权重可在以下链接获取:[提供URL]。
https://arxiv.org/abs/2502.07701
The evolution of large-scale contrastive pre-training propelled by top-tier datasets has reached a transition point in the scaling law. Consequently, sustaining and enhancing a model's pre-training capabilities in drift environments have surfaced as a notable challenge. In this paper, we initially uncover that contrastive pre-training methods are significantly impacted by concept drift wherein distributions change unpredictably, resulting in notable biases in the feature space of the pre-trained model. Empowered by causal inference, we construct a structural causal graph to analyze the impact of concept drift to contrastive pre-training systemically, and propose the causal interventional contrastive objective. Upon achieving this, we devise a resilient contrastive pre-training approach to accommodate the data stream of concept drift, with simple and scalable implementation. Extensive experiments on various downstream tasks demonstrate our resilient contrastive pre-training effectively mitigates the bias stemming from the concept drift data stream. Codes are available at this https URL.
由顶级数据集驱动的大规模对比预训练的发展已经达到了扩展定律中的一个转折点。因此,在漂移环境中维持和增强模型的预训练能力已成为一个显著挑战。在这篇论文中,我们首先发现对比预训练方法在概念漂移(即分布不可预测地变化)的影响下受到严重影响,导致预训练模型特征空间出现明显的偏差。 借助因果推理的力量,我们构建了一个结构化因果图来系统性分析概念漂移对对比预训练的影响,并提出了因果干预对比目标。在此基础上,我们设计了一种具有弹性的对比预训练方法,以适应概念漂移的数据流,并且该方法实现了简单而可扩展的实施。 在各种下游任务上的广泛实验表明,我们的弹性对比预训练有效地缓解了由概念漂移数据流引起的偏差问题。代码可在[此链接](https://example.com)获取。
https://arxiv.org/abs/2502.07620
Imagination in world models is crucial for enabling agents to learn long-horizon policy in a sample-efficient manner. Existing recurrent state-space model (RSSM)-based world models depend on single-step statistical inference to capture the environment dynamics, and, hence, they are unable to perform long-term imagination tasks due to the accumulation of prediction errors. Inspired by the dual-process theory of human cognition, we propose a novel dual-mind world model (DMWM) framework that integrates logical reasoning to enable imagination with logical consistency. DMWM is composed of two components: an RSSM-based System 1 (RSSM-S1) component that handles state transitions in an intuitive manner and a logic-integrated neural network-based System 2 (LINN-S2) component that guides the imagination process through hierarchical deep logical reasoning. The inter-system feedback mechanism is designed to ensure that the imagination process follows the logical rules of the real environment. The proposed framework is evaluated on benchmark tasks that require long-term planning from the DMControl suite. Extensive experimental results demonstrate that the proposed framework yields significant improvements in terms of logical coherence, trial efficiency, data efficiency and long-term imagination over the state-of-the-art world models.
想象在世界模型中的作用对于使智能体能够以样本高效的方式学习长期策略至关重要。现有的基于递归状态空间模型(RSSM)的世界模型依赖于单步统计推断来捕捉环境动态,因此无法执行长期的想象任务,因为预测误差会累积起来。受到人类认知双过程理论的启发,我们提出了一种新的双心灵世界模型(DMWM)框架,该框架整合了逻辑推理以实现具有逻辑一致性的想象功能。DMWM 由两个组成部分构成:一个基于 RSSM 的系统1 (RSSM-S1) 组件,它用直观的方式处理状态转换;以及一个融合了逻辑的神经网络系统2 (LINN-S2) 组件,通过分层深度逻辑推理来引导想象过程。跨系统的反馈机制设计用于确保想象过程遵循现实环境中的逻辑规则。 该框架在 DMControl 套件中需要长期规划的任务上进行了基准测试。大量的实验结果表明,相较于最先进的世界模型,在逻辑一致性、试验效率、数据效率和长期想象方面,所提出的框架取得了显著的改进。
https://arxiv.org/abs/2502.07591
Linear sequence modeling approaches, such as linear attention, provide advantages like linear-time training and constant-memory inference over sequence lengths. However, existing sequence parallelism (SP) methods are either not optimized for the right-product-first feature of linear attention or use a ring-style communication strategy, which results in lower computation parallelism, limits their scalability for longer sequences in distributed systems. In this paper, we introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models with very-long input sequences. Compared to previous work LASP, LASP-2 rethinks the minimal communication requirement for SP on linear attention layers, reorganizes the whole communication-computation workflow of LASP. In this way, only one single AllGather collective communication is needed on intermediate memory states, whose sizes are independent of the sequence length, leading to significant improvements of both communication and computation parallelism, as well as their overlap. Additionally, we extend LASP-2 to LASP-2H by applying similar communication redesign to standard attention modules, offering an efficient SP solution for hybrid models that blend linear and standard attention layers. Our evaluation on a Linear-Llama3 model, a variant of Llama3 with linear attention replacing standard attention, demonstrates the effectiveness of LASP-2 and LASP-2H. Specifically, LASP-2 achieves training speed improvements of 15.2% over LASP and 36.6% over Ring Attention, with a sequence length of 2048K across 64 GPUs. The Code is released as a part of: this https URL.
线性序列建模方法,如线性注意力机制,在处理长序列时提供了多项优势,包括线性时间训练和恒定内存推理。然而,现有的序列并行化(SP)技术要么未能针对线性注意的“右积优先”特性进行优化,要么采用环式通信策略,这导致了计算并行性的降低,并限制了它们在分布式系统中处理更长序列的能力。 本文介绍了LASP-2这一新的SP方法,用于提升在线性注意力转换器模型训练过程中非常长输入序列时的通信和计算并行性。相较于之前的LASP工作,LASP-2重新考虑了线性注意层所需的最小通信需求,并重组了整个LASP中的通信-计算工作流程。通过这种方式,在中间内存状态仅需一次单个AllGather集体通信即可完成(这些内存状态大小与序列长度无关),从而显著提高了通信和计算并行性,以及它们之间的重叠程度。 此外,我们还扩展了LASP-2至LASP-2H版本,通过将类似的通信重新设计应用于标准注意模块中,为混合模型提供了一种高效的SP解决方案。这些混合模型结合使用线性和标准注意力层。我们在一个名为Linear-Llama3的模型上进行了评估,该模型是Llama3的一个变体,其中用线性注意力替换了标准注意力。我们的实验结果表明LASP-2和LASP-2H的有效性:具体而言,在64个GPU配置下,序列长度为2048K时,与LASP相比,LASP-2的训练速度提高了15.2%,而与Ring Attention(环式注意)相比则提升了36.6%。 代码已作为开源的一部分在此地址发布。
https://arxiv.org/abs/2502.07563
Generative models, particularly text-to-image (T2I) diffusion models, play a crucial role in medical image analysis. However, these models are prone to training data memorization, posing significant risks to patient privacy. Synthetic chest X-ray generation is one of the most common applications in medical image analysis with the MIMIC-CXR dataset serving as the primary data repository for this task. This study adopts a data-driven approach and presents the first systematic attempt to identify prompts and text tokens in MIMIC-CXR that contribute the most to training data memorization. Our analysis reveals an unexpected finding: prompts containing traces of de-identification procedures are among the most memorized, with de-identification markers contributing the most. Furthermore, we also find existing inference-time memorization mitigation strategies are ineffective and fail to sufficiently reduce the model's reliance on memorized text tokens highlighting a broader issue in T2I synthesis with MIMIC-CXR. On this front, we propose actionable strategies to enhance privacy and improve the reliability of generative models in medical imaging. Finally, our results provide a foundation for future work on developing and benchmarking memorization mitigation techniques for synthetic chest X-ray generation using the MIMIC-CXR dataset.
生成模型,特别是文本到图像(T2I)扩散模型,在医学影像分析中扮演着重要角色。然而,这些模型容易记忆训练数据,从而对患者隐私构成重大风险。合成胸部X光片的生成是医学影像分析中最常见的应用之一,MIMIC-CXR 数据集则是该任务的主要数据存储库。本研究采用数据驱动的方法,并首次系统地尝试识别在 MIMIC-CXR 中最有助于训练数据记忆的提示和文本标记。我们的分析揭示了一个出乎意料的发现:包含去标识化程序痕迹的提示是被记忆最多的,而去标识化的标记贡献最大。此外,我们还发现现有的推理时的记忆缓解策略无效,并且未能充分减少模型对已记住文本标记的依赖性,这在使用 MIMIC-CXR 的 T2I 合成中揭示了一个更广泛的问题。在此方面,我们提出了可操作的策略以增强隐私保护并提高生成模型在医学影像中的可靠性。最后,我们的结果为未来关于开发和基准测试用于合成胸部 X 光片生成的记忆缓解技术的研究奠定了基础。
https://arxiv.org/abs/2502.07516