Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive video generation. Existing autoregressive video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an autoregressive video generator that retains the LLM architecture with minimal architectural modifications. To inject spatiotemporal correlations in LLMs, we identify the efficacy of incorporating 3D RoPE and diagnose its imbalanced frequency spectrum ranges. Therefore, we propose MM-RoPE, a RoPE scheme that preserves the original textual RoPE while providing comprehensive frequency spectra and scaled 3D positions for modeling multimodal spatiotemporal data. Moreover, Lumos-1 resorts to a token dependency strategy that obeys intra-frame bidirectionality and inter-frame temporal causality. Based on this dependency strategy, we identify the issue of frame-wise loss imbalance caused by spatial information redundancy and solve it by proposing Autoregressive Discrete Diffusion Forcing (AR-DF). AR-DF introduces temporal tube masking during training with a compatible inference-time masking policy to avoid quality degradation. By using memory-efficient training techniques, we pre-train Lumos-1 on only 48 GPUs, achieving performance comparable to EMU3 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V. Code and models are available at this https URL.
自回归大型语言模型(LLMs)已经统一了广泛的语言任务,激发了初步的自回归视频生成尝试。现有的自回归视频生成器要么偏离标准LLM架构,依赖于庞大的外部文本编码器,或者由于逐令牌解码而导致不可接受的延迟。在这篇论文中,我们介绍了Lumos-1,这是一种保留LLM架构且仅需最小架构修改的自回归视频生成器。 为了在LLMs中注入时空相关性,我们确定了引入3D RoPE(旋转位置嵌入)的有效性,并诊断了其不平衡的频率光谱范围。因此,我们提出了MM-RoPE,一种保持原始文本RoPE同时提供全面频率光谱和缩放的3D位置方案,以建模多模式时空数据。 此外,Lumos-1依赖于令牌依赖策略,该策略遵守帧内双向性和帧间时间因果性。基于这种依赖策略,我们识别了由空间信息冗余引起的逐帧损失不平衡问题,并通过提出自回归离散扩散强迫(AR-DF)来解决此问题。AR-DF在训练过程中引入了时空管掩码,并制定了与推理时间兼容的掩码政策以避免质量下降。 利用内存高效的训练技术,我们在仅48个GPU上对Lumos-1进行了预训练,在GenEval、COSMOS-Video2World(VBench-I2V)和OpenSoraPlan(VBench-T2V)等基准测试中达到了与EMU3相当的性能。 有关代码和模型,请访问此链接:[提供链接]。
https://arxiv.org/abs/2507.08801
We propose cache steering, a lightweight method for implicit steering of language models via a one-shot intervention applied directly to the key-value cache. To validate its effectiveness, we apply cache steering to induce chain-of-thought reasoning in small language models. Our approach leverages GPT-4o-generated reasoning traces to construct steering vectors that shift model behavior toward more explicit, multi-step reasoning without fine-tuning or prompt modifications. Experimental evaluations on diverse reasoning benchmarks demonstrate that cache steering improves both the qualitative structure of model reasoning and quantitative task performance. Compared to prior activation steering techniques that require continuous interventions, our one-shot cache steering offers substantial advantages in terms of hyperparameter stability, inference-time efficiency, and ease of integration, making it a more robust and practical solution for controlled generation.
我们提出了一种轻量级的方法——缓存引导(cache steering),通过一次性干预直接应用于键值缓存来隐式地指导语言模型。为了验证其有效性,我们将缓存引导应用到小型语言模型中以诱导链式思维推理。我们的方法利用GPT-4o生成的推理痕迹来构建转向向量,使模型行为倾向于更明确、多步骤的推理,而无需微调或提示修改。实验评估在多样化的推理基准上表明,缓存引导不仅改善了模型推理的质量结构,还提高了定量任务性能。 与之前需要持续干预的激活引导技术相比,我们的一次性缓存引导方法在超参数稳定性、推断时间效率和集成简便性方面提供了显著优势,使其成为控制生成更为稳健且实用的解决方案。
https://arxiv.org/abs/2507.08799
Risk-averse Constrained Reinforcement Learning (RaCRL) aims to learn policies that minimise the likelihood of rare and catastrophic constraint violations caused by an environment's inherent randomness. In general, risk-aversion leads to conservative exploration of the environment which typically results in converging to sub-optimal policies that fail to adequately maximise reward or, in some cases, fail to achieve the goal. In this paper, we propose an exploration-based approach for RaCRL called Optimistic Risk-averse Actor Critic (ORAC), which constructs an exploratory policy by maximising a local upper confidence bound of the state-action reward value function whilst minimising a local lower confidence bound of the risk-averse state-action cost value function. Specifically, at each step, the weighting assigned to the cost value is increased or decreased if it exceeds or falls below the safety constraint value. This way the policy is encouraged to explore uncertain regions of the environment to discover high reward states whilst still satisfying the safety constraints. Our experimental results demonstrate that the ORAC approach prevents convergence to sub-optimal policies and improves significantly the reward-cost trade-off in various continuous control tasks such as Safety-Gymnasium and a complex building energy management environment CityLearn.
风险规避约束强化学习(RaCRL)旨在学习策略,以最小化由于环境固有随机性导致的罕见且灾难性的约束违反概率。总体而言,风险规避会导致对环境进行保守探索,这通常会收敛到次优策略,这些策略未能充分最大化奖励或在某些情况下无法实现目标。在这篇论文中,我们提出了一种基于探索的方法来解决RaCRL问题,称为乐观风险规避策略评估器(ORAC)。该方法通过最大化状态-动作回报价值函数的局部上置信界同时最小化风险规避状态下代价值函数的局部下置信界来构建一个探索性策略。具体而言,在每一步中,如果成本值超过或低于安全约束值,则会增加或减少分配给成本值的权重。这样可以鼓励该策略在满足安全约束的同时探索环境中的不确定区域以发现高回报状态。 我们的实验结果表明,ORAC方法能够防止收敛到次优策略,并且在诸如Safety-Gymnasium和复杂建筑能源管理环境CityLearn等连续控制任务中显著改善了奖励-成本的权衡。
https://arxiv.org/abs/2507.08793
This paper proposes a neural rendering approach that represents a scene as "compressed light-field tokens (CLiFTs)", retaining rich appearance and geometric information of a scene. CLiFT enables compute-efficient rendering by compressed tokens, while being capable of changing the number of tokens to represent a scene or render a novel view with one trained network. Concretely, given a set of images, multi-view encoder tokenizes the images with the camera poses. Latent-space K-means selects a reduced set of rays as cluster centroids using the tokens. The multi-view ``condenser'' compresses the information of all the tokens into the centroid tokens to construct CLiFTs. At test time, given a target view and a compute budget (i.e., the number of CLiFTs), the system collects the specified number of nearby tokens and synthesizes a novel view using a compute-adaptive renderer. Extensive experiments on RealEstate10K and DL3DV datasets quantitatively and qualitatively validate our approach, achieving significant data reduction with comparable rendering quality and the highest overall rendering score, while providing trade-offs of data size, rendering quality, and rendering speed.
本文提出了一种神经渲染方法,该方法将场景表示为“压缩光场令牌(CLiFTs)”,保留了场景的丰富外观和几何信息。通过使用压缩令牌进行计算高效的渲染,同时能够根据需要改变用于表示场景或生成新视角的令牌数量,从而实现灵活性。具体而言,给定一组图像后,多视图编码器会利用相机姿态对这些图像进行标记化处理。在潜在空间中应用K-均值聚类选择少量光线作为簇中心,进一步减少了信息量。然后,通过一个多视图“压缩机”将所有令牌的信息压缩到中心令牌上,构建出CLiFTs。 在测试阶段,给定一个目标视角和计算预算(即CLiFT的数量),系统会收集指定数量的附近令牌,并使用适应计算资源的渲染器生成新的视角。通过对RealEstate10K和DL3DV数据集进行广泛的实验,我们的方法在定量和定性上都得到了验证。该方法实现了显著的数据减少,在保证可比渲染质量的同时获得了最高的整体渲染评分,并且提供了关于数据大小、渲染质量和渲染速度之间的权衡方案。
https://arxiv.org/abs/2507.08776
Recent advances in 3D generation have transitioned from multi-view 2D rendering approaches to 3D-native latent diffusion frameworks that exploit geometric priors in ground truth data. Despite progress, three key limitations persist: (1) Single-latent representations fail to capture complex multi-part geometries, causing detail degradation; (2) Holistic latent coding neglects part independence and interrelationships critical for compositional design; (3) Global conditioning mechanisms lack fine-grained controllability. Inspired by human 3D design workflows, we propose CoPart - a part-aware diffusion framework that decomposes 3D objects into contextual part latents for coherent multi-part generation. This paradigm offers three advantages: i) Reduces encoding complexity through part decomposition; ii) Enables explicit part relationship modeling; iii) Supports part-level conditioning. We further develop a mutual guidance strategy to fine-tune pre-trained diffusion models for joint part latent denoising, ensuring both geometric coherence and foundation model priors. To enable large-scale training, we construct Partverse - a novel 3D part dataset derived from Objaverse through automated mesh segmentation and human-verified annotations. Extensive experiments demonstrate CoPart's superior capabilities in part-level editing, articulated object generation, and scene composition with unprecedented controllability.
最近在三维生成领域的进展已经从多视角二维渲染方法转向了利用地面真实数据中的几何先验的3D原生潜在扩散框架。尽管取得了进步,但仍存在三个关键限制:(1) 单一潜在表示无法捕捉复杂的多部件几何形状,导致细节退化;(2) 整体潜在编码忽略了组成设计中至关重要的各部分独立性和相互关系;(3) 全局条件机制缺乏细粒度的可控性。受人类三维设计工作流程启发,我们提出了CoPart——一个以部分感知为主的扩散框架,它将三维对象分解为上下文相关的部分潜在表示,用于一致的多部件生成。这种范式提供了三个优势:i)通过部分分解减少编码复杂度;ii)支持显式的部分关系建模;iii)支持基于部分级别的条件设置。为了进一步优化预训练的扩散模型以进行联合部分潜在去噪,我们开发了一种相互指导策略,确保几何一致性和基础模型先验知识的同时实现这一目标。为大规模训练提供支持,我们构建了Partverse——一个新颖的3D部分数据集,它是通过Objaverse的自动网格分割和人工验证注释衍生而来的。广泛的实验表明,CoPart在部分级编辑、连杆对象生成以及场景组合方面具有前所未有的可控性,并展示了其卓越的能力。
https://arxiv.org/abs/2507.08772
Due to the excellent performance in yielding high-quality, zero-shot segmentation, Segment Anything Model (SAM) and its variants have been widely applied in diverse scenarios such as healthcare and intelligent manufacturing. Therefore, effectively compressing SAMs has become an increasingly pressing practical need. In this study, we propose Birkhoff, a novel data-free compression algorithm for SAM and its variants. Unlike quantization, pruning, distillation, and other compression methods, Birkhoff embodies versatility across model types, agility in deployment, faithfulness to the original model, and compactness in model size. Specifically, Birkhoff introduces a novel compression algorithm: Hyper-Compression, whose core principle is to find a dense trajectory to turn a high-dimensional parameter vector into a low-dimensional scalar. Furthermore, Birkhoff designs a dedicated linear layer operator, HyperLinear, to fuse decompression and matrix multiplication to significantly accelerate inference of the compressed SAMs. Extensive experiments on 18 SAMs in the COCO, LVIS, and SA-1B datasets show that Birkhoff performs consistently and competitively in compression time, compression ratio, post-compression performance, and inference speed. For example, Birkhoff can achieve a compression ratio of 5.17x on SAM2-B, with less than 1% performance drop without using any fine-tuning data. Moreover, the compression is finished within 60 seconds for all models.
由于在生成高质量、零样本分割方面的卓越性能,Segment Anything Model(SAM)及其变体已在医疗保健和智能制造等多样化场景中广泛使用。因此,有效地压缩这些模型已成为日益迫切的实际需求。在这项研究中,我们提出了一种新颖的数据无关压缩算法Birkhoff,专门用于压缩SAM及其变体。与量化、剪枝、知识蒸馏和其他压缩方法不同,Birkhoff在适用的模型类型多样性、部署灵活性、对原始模型的忠实度以及模型大小的紧凑性方面具有独特优势。 具体而言,Birkhoff引入了一种新型的压缩算法:超压缩(Hyper-Compression),其核心原理是寻找一条密集轨迹,将高维参数向量转换为低维标量。此外,Birkhoff设计了一个专门的线性层操作符——超线性(HyperLinear),用于融合解压和矩阵乘法运算,从而显著加速了被压缩SAM模型的推理速度。 在COCO、LVIS和SA-1B数据集上的18个不同SAM模型进行的大量实验表明,无论是在压缩时间、压缩比例、压缩后性能还是推理速度方面,Birkhoff都能表现出一致且具有竞争力的表现。例如,在不使用任何微调数据的情况下,Birkhoff可以将SAM2-B的压缩比达到5.17倍,并且性能下降不到1%;所有模型的压缩过程均能在60秒内完成。
https://arxiv.org/abs/2507.08765
Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.
基于离线数据的强化学习会遇到Q值外推误差的问题。为了解决这个问题,我们首先证明了在超出数据范围的情况下进行线性Q函数外推特别容易出现问题。为此,我们提出了一种方法来指导在数据范围之外逐渐减少Q值,这是通过奖励缩放与层归一化(RS-LN)和对不可行动作的惩罚机制(PA)实现的。结合RS-LN和PA,我们开发出一种新的算法叫做PARS。我们在一系列任务中评估了PARS的表现,并在D4RL基准测试中的离线训练和在线微调阶段表现出优于当前最先进算法的性能,在具有挑战性的AntMaze Ultra任务上取得了显著的成功。 更详细的翻译如下: 基于离线数据的强化学习会遇到Q值外推误差的问题。为了解决这个问题,我们首先证明了在超出实际观测范围的情况下进行线性Q函数外推特别容易出现问题。为了缓解这一问题,我们提出了一种方法来指导Q值在超出训练数据范围时逐渐减少。这通过结合奖励缩放(Reward Scaling)和层归一化(Layer Normalization, RS-LN),以及对不可行动作的惩罚机制(Penalization for Infeasible Actions, PA)实现。通过组合RS-LN和PA,我们开发出了一种新的算法叫做PARS(Penalized Action and Reward Scaling)。我们在一系列任务中评估了PARS的表现,并在D4RL基准测试中的离线训练阶段表现出优于当前最先进算法的性能,在最具挑战性的AntMaze Ultra任务上取得了显著的成功。此外,我们的方法还在在线微调阶段表现出了很好的适应性和鲁棒性。
https://arxiv.org/abs/2507.08761
Hierarchical land cover and land use (LCLU) classification aims to assign pixel-wise labels with multiple levels of semantic granularity to remote sensing (RS) imagery. However, existing deep learning-based methods face two major challenges: 1) They predominantly adopt a flat classification paradigm, which limits their ability to generate end-to-end multi-granularity hierarchical predictions aligned with tree-structured hierarchies used in practice. 2) Most cross-domain studies focus on performance degradation caused by sensor or scene variations, with limited attention to transferring LCLU models to cross-domain tasks with heterogeneous hierarchies (e.g., LCLU to crop classification). These limitations hinder the flexibility and generalization of LCLU models in practical applications. To address these challenges, we propose HieraRS, a novel hierarchical interpretation paradigm that enables multi-granularity predictions and supports the efficient transfer of LCLU models to cross-domain tasks with heterogeneous tree-structured hierarchies. We introduce the Bidirectional Hierarchical Consistency Constraint Mechanism (BHCCM), which can be seamlessly integrated into mainstream flat classification models to generate hierarchical predictions, while improving both semantic consistency and classification accuracy. Furthermore, we present TransLU, a dual-branch cross-domain transfer framework comprising two key components: Cross-Domain Knowledge Sharing (CDKS) and Cross-Domain Semantic Alignment (CDSA). TransLU supports dynamic category expansion and facilitates the effective adaptation of LCLU models to heterogeneous hierarchies. In addition, we construct MM-5B, a large-scale multi-modal hierarchical land use dataset featuring pixel-wise annotations. The code and MM-5B dataset will be released at: this https URL.
层次土地覆盖和土地利用(LCLU)分类的目标是为遥感(RS)图像提供具有多种语义粒度级别的像素级标签。然而,现有的基于深度学习的方法面临两个主要挑战:1) 它们主要采用扁平化分类方法,这限制了它们生成与实践中使用的树状层次结构一致的端到端多粒度层级预测的能力;2) 大多数跨域研究关注由传感器或场景变化导致的性能下降问题,并且较少关注将LCLU模型转移到具有异构层次结构(如从土地覆盖分类转换为作物分类)的任务上。这些限制阻碍了LCLU模型在实际应用中的灵活性和泛化能力。 为了应对上述挑战,我们提出了HieraRS,这是一种新的层级解释框架,它能够实现多粒度预测,并支持将LCLU模型高效地转移到具有异构树状结构层次的跨域任务中。我们引入了双向层级一致性约束机制(BHCCM),它可以无缝集成到主流扁平分类模型中以生成层级预测,同时提高语义一致性和分类准确性。 此外,我们提出了TransLU,这是一个双分支的跨域迁移框架,包括两个关键组件:跨域知识共享(CDKS)和跨域语义对齐(CDSA)。TransLU支持动态类别扩展,并有助于LCLU模型有效地适应异构层次结构。另外,为了支持这项研究,我们构建了一个大规模多模态层级土地利用数据集——MM-5B,该数据集具有像素级别的标注信息。 代码和MM-5B数据集将在以下网址发布:[请在此处填写实际的URL链接]。
https://arxiv.org/abs/2507.08741
Nonlinear vector autoregression (NVAR) and reservoir computing (RC) have shown promise in forecasting chaotic dynamical systems, such as the Lorenz-63 model and El Nino-Southern Oscillation. However, their reliance on fixed nonlinearities - polynomial expansions in NVAR or random feature maps in RC - limits their adaptability to high noise or real-world data. These methods also scale poorly in high-dimensional settings due to costly matrix inversion during readout computation. We propose an adaptive NVAR model that combines delay-embedded linear inputs with features generated by a shallow, learnable multi-layer perceptron (MLP). The MLP and linear readout are jointly trained using gradient-based optimization, enabling the model to learn data-driven nonlinearities while preserving a simple readout structure. Unlike standard NVAR, our approach avoids the need for an exhaustive and sensitive grid search over ridge and delay parameters. Instead, tuning is restricted to neural network hyperparameters, improving scalability. Initial experiments on chaotic systems tested under noise-free and synthetically noisy conditions showed that the adaptive model outperformed the standard NVAR in predictive accuracy and showed robust forecasting under noisy conditions with a lower observation frequency.
非线性向量自回归(NVAR)和液池计算(RC)在预测洛伦兹-63模型和厄尔尼诺南方涛动等混沌动力系统方面显示出潜力。然而,它们依赖于固定的非线性机制——NVAR中的多项式展开或RC中的随机特征映射——限制了这些方法对高噪声或真实世界数据的适应能力。在高维设置中,由于读取计算期间需要进行昂贵的矩阵求逆操作,这类方法的表现也较差。 我们提出了一种自适应的NVAR模型,该模型结合了延迟嵌入线性输入和浅层、可学习的多层感知机(MLP)生成的特征。通过基于梯度的优化技术联合训练MLP和线性读取部分,使模型能够从数据中学习非线性特性的同时保持简单的读取结构。与标准NVAR不同,我们的方法避免了对岭参数和延迟参数进行繁琐且敏感的网格搜索调整。相反,调整仅限于神经网络超参数,从而提高了可扩展性。 在无噪声和人工添加噪音的情况下,混沌系统的初步实验表明,自适应模型在预测准确性方面优于标准NVAR,并能在噪声条件下以较低的观测频率实现稳健的预测。
https://arxiv.org/abs/2507.08738
Catastrophic forgetting in deep neural networks occurs when learning new tasks degrades performance on previously learned tasks due to knowledge overwriting. Among the approaches to mitigate this issue, regularization techniques aim to identify and constrain "important" parameters to preserve previous knowledge. In the highly nonconvex optimization landscape of deep learning, we propose a novel perspective: tracking parameters during the final training plateau is more effective than monitoring them throughout the entire training process. We argue that parameters that exhibit higher activity (movement and variability) during this plateau reveal directions in the loss landscape that are relatively flat, making them suitable for adaptation to new tasks while preserving knowledge from previous ones. Our comprehensive experiments demonstrate that this approach achieves superior performance in balancing catastrophic forgetting mitigation with strong performance on newly learned tasks.
深度神经网络中的灾难性遗忘现象是指在学习新任务时,会损害之前已经学过的任务的表现,原因是旧知识被新的知识覆盖。为缓解这个问题,正则化技术旨在识别并约束“重要”参数以保存先前的知识。在深度学习中高度非凸的优化景观中,我们提出了一种新颖的观点:追踪最终训练平台期期间的参数比在整个训练过程中监测它们更有效。我们认为,在这个平台期内表现出更高活动(移动和变化)的参数揭示了损失景观中的相对平坦区域,这使得它们适合适应新任务的同时保留之前的知识。我们的全面实验表明,这种方法在缓解灾难性遗忘与在新学习的任务上取得优异表现之间实现了更好的平衡。 具体来说: - 灾难性遗忘指的是深度神经网络在学习新任务时导致旧任务性能下降的现象。 - 为解决这一问题,一种常用的方法是采用正则化技术来识别和限制“重要”的参数,从而保护之前的知识不被覆盖。 - 在非凸的优化空间中,我们提出了一种新的视角:关注模型在训练接近尾声、达到性能平台期时参数的变化情况,而不是在整个训练过程中持续监控它们。我们认为,在这个阶段变化较大的参数表明网络能够在保留旧知识的同时适应新任务。 - 我们的实验结果证明了这种方法能更有效地平衡灾难性遗忘的缓解与新的学习任务上的表现提升之间的关系。 总的来说,这种侧重于识别和保护在最终训练平台期活跃的重要参数的方法显示出了优越的效果。
https://arxiv.org/abs/2507.08736
Solving computer vision problems through machine learning, one often encounters lack of sufficient training data. To mitigate this we propose the use of ensembles of weak learners based on spectral total-variation (STV) features (Gilboa 2014). The features are related to nonlinear eigenfunctions of the total-variation subgradient and can characterize well textures at various scales. It was shown (Burger et-al 2016) that, in the one-dimensional case, orthogonal features are generated, whereas in two-dimensions the features are empirically lowly correlated. Ensemble learning theory advocates the use of lowly correlated weak learners. We thus propose here to design ensembles using learners based on STV features. To show the effectiveness of this paradigm we examine a hard real-world medical imaging problem: the predictive value of computed tomography (CT) data for high uptake in positron emission tomography (PET) for patients suspected of skeletal metastases. The database consists of 457 scans with 1524 unique pairs of registered CT and PET slices. Our approach is compared to deep-learning methods and to Radiomics features, showing STV learners perform best (AUC=0.87), compared to neural nets (AUC=0.75) and Radiomics (AUC=0.79). We observe that fine STV scales in CT images are especially indicative for the presence of high uptake in PET.
通过机器学习解决计算机视觉问题时,常常会遇到训练数据不足的问题。为了解决这一问题,我们提出使用基于谱总变差(STV)特征的弱学习者集合的方法(Gilboa 2014)。这些特征与总变差次梯度的非线性特征值相关,并且能够很好地表征不同尺度下的纹理信息。研究表明,在一维情况下,可以生成正交特征;而在二维情况下,特征之间的相关性较低。集成学习理论提倡使用低相关性的弱学习者。因此,我们建议基于STV特征设计集合模型。 为了证明这一范式的有效性,我们将研究一个硬核的现实世界医学成像问题:通过计算机断层扫描(CT)数据预测正电子发射断层扫描(PET)中高摄取值对于疑似骨骼转移患者的临床意义。该数据库包含457次扫描,共有1524对配准后的CT和PET切片。我们将这种方法与深度学习方法及Radiomics特征进行了比较,并发现基于STV的学习者表现出最好的效果(AUC=0.87),优于神经网络(AUC=0.75)和Radiomics(AUC=0.79)。我们观察到,CT图像中的精细STV尺度特别有助于预测PET中高摄取值的存在。
https://arxiv.org/abs/2507.08735
Modern configurable software systems need to learn models that correlate configuration and performance. However, when the system operates in dynamic environments, the workload variations, hardware changes, and system updates will inevitably introduce concept drifts at different levels - global drifts, which reshape the performance landscape of the entire configuration space; and local drifts, which only affect certain sub-regions of that space. As such, existing offline and transfer learning approaches can struggle to adapt to these implicit and unpredictable changes in real-time, rendering configuration performance learning challenging. To address this, we propose DHDA, an online configuration performance learning framework designed to capture and adapt to these drifts at different levels. The key idea is that DHDA adapts to both the local and global drifts using dually hierarchical adaptation: at the upper level, we redivide the data into different divisions, within each of which the local model is retrained, to handle global drifts only when necessary. At the lower level, the local models of the divisions can detect local drifts and adapt themselves asynchronously. To balance responsiveness and efficiency, DHDA combines incremental updates with periodic full retraining to minimize redundant computation when no drifts are detected. Through evaluating eight software systems and against state-of-the-art approaches, we show that DHDA achieves considerably better accuracy and can effectively adapt to drifts with up to 2x improvements, while incurring reasonable overhead and is able to improve different local models in handling concept drift.
现代可配置软件系统需要学习能够关联配置与性能的模型。然而,当这些系统在动态环境中运行时,工作负载的变化、硬件更新以及系统升级不可避免地会在不同层次上引入概念漂移——全局漂移会重塑整个配置空间内的性能格局;局部漂移则仅影响该空间中的某些子区域。因此,现有的离线和迁移学习方法难以实时适应这些隐式且不可预测的变更,使得配置性能学习变得极具挑战性。为此,我们提出了DHDA(双重分层自适应框架),这是一种在线配置性能学习框架,旨在捕捉并适应不同层次上的漂移变化。该框架的核心思想是利用双重层级自适应来应对局部和全局漂移:在高层级上,我们将数据重新划分为不同的分区,在必要时仅针对这些分区内的全局漂移进行处理;而在底层,各个分区内配置的本地模型能够异步地检测并适应局部漂移。为平衡响应性和效率,DHDA结合了增量更新与定期全面重训练,以在没有检测到漂移变化的情况下尽量减少冗余计算。 通过评估八个软件系统并与最先进的方法进行对比,我们展示了DHDA能够在概念漂移下实现显著更高的准确率,并能有效地适应多达2倍的性能改进。此外,在保持合理开销的同时,该框架还能提升不同局部模型处理概念漂移的能力。
https://arxiv.org/abs/2507.08730
Learning robot manipulation policies from raw, real-world image data requires a large number of robot-action trials in the physical environment. Although training using simulations offers a cost-effective alternative, the visual domain gap between simulation and robot workspace remains a major limitation. Gaussian Splatting visual reconstruction methods have recently provided new directions for robot manipulation by generating realistic environments. In this paper, we propose the first method for learning supervised-based robot handovers solely from RGB images without the need of real-robot training or real-robot data collection. The proposed policy learner, Human-to-Robot Handover using Sparse-View Gaussian Splatting (H2RH-SGS), leverages sparse-view Gaussian Splatting reconstruction of human-to-robot handover scenes to generate robot demonstrations containing image-action pairs captured with a camera mounted on the robot gripper. As a result, the simulated camera pose changes in the reconstructed scene can be directly translated into gripper pose changes. We train a robot policy on demonstrations collected with 16 household objects and {\em directly} deploy this policy in the real environment. Experiments in both Gaussian Splatting reconstructed scene and real-world human-to-robot handover experiments demonstrate that H2RH-SGS serves as a new and effective representation for the human-to-robot handover task.
从原始的真实世界图像数据中学习机器人操作策略需要在物理环境中进行大量的机器人动作试验。虽然使用仿真训练提供了成本效益高的替代方案,但模拟环境与机器人工作空间之间的视觉领域差距仍然是主要限制之一。最近,高斯点绘(Gaussian Splatting)的视觉重构方法为机器人操控提供了一些新的方向,通过生成逼真的环境来帮助这一问题。在本文中,我们提出了首个仅基于RGB图像进行监督学习的人机交接策略的方法,并且无需实际机器人的训练或数据采集。该提出的策略学习者名为“使用稀疏视图高斯点绘的人机交互政策学习器”(Human-to-Robot Handover using Sparse-View Gaussian Splatting,简称H2RH-SGS),它利用稀疏视图的高斯点绘重构人与机器人交接场景来生成包含相机安装在机械抓手上拍摄到的图像动作对的机器人演示。因此,在重建的场景中模拟相机姿态的变化可以直接转换为夹爪姿态的变化。我们使用16种家庭常用物品进行实验收集示范,并**直接**将此策略部署到了实际环境中。无论是高斯点绘重构的场景还是现实世界的人机交接实验,都表明H2RH-SGS对于人与机器人交互任务提供了一种新的且有效的表示方法。 该研究的核心在于利用了稀疏视图高斯点绘技术来生成逼真的训练环境,并以此为基础从RGB图像中学习到机器人的操作策略。这种方法不仅减少了对实际物理试验的需求,而且在转换到真实环境中时能够保持较高精度和有效性,为机器人操控任务提供了新的可能路径。
https://arxiv.org/abs/2507.08726
Encountering shifted data at test time is a ubiquitous challenge when deploying predictive models. Test-time adaptation (TTA) methods address this issue by continuously adapting a deployed model using only unlabeled test data. While TTA can extend the model's lifespan, it is only a temporary solution. Eventually the model might degrade to the point that it must be taken offline and retrained. To detect such points of ultimate failure, we propose pairing TTA with risk monitoring frameworks that track predictive performance and raise alerts when predefined performance criteria are violated. Specifically, we extend existing monitoring tools based on sequential testing with confidence sequences to accommodate scenarios in which the model is updated at test time and no test labels are available to estimate the performance metrics of interest. Our extensions unlock the application of rigorous statistical risk monitoring to TTA, and we demonstrate the effectiveness of our proposed TTA monitoring framework across a representative set of datasets, distribution shift types, and TTA methods.
在部署预测模型时,遇到测试数据偏移是一个普遍的挑战。测试时间适应(TTA)方法通过仅使用未标记的测试数据来持续更新已部署的模型,从而解决这个问题。虽然TTA可以延长模型的有效期,但这是暂时性的解决方案。最终,模型可能会退化到必须离线重新训练的程度。为了检测这种最终失效点,我们建议将风险监控框架与TTA结合使用,这些框架跟踪预测性能并在预定义性能标准被违反时发出警报。 具体来说,我们将现有的基于置信序列的顺序测试监控工具扩展到了模型在测试时间更新且没有可用标签来估计所需性能指标的情况。我们的扩展解锁了严格的统计风险监测应用于TTA的可能性,并展示了我们提出的TTA监控框架在具有代表性的数据集、分布偏移类型和TTA方法上的有效性。
https://arxiv.org/abs/2507.08721
Scaling laws have achieved success in LLM and foundation models. To explore their potential in ISAC research, we propose Great-X. This single-engine multimodal data twin platform reconstructs the ray-tracing computation of Sionna within Unreal Engine and is deeply integrated with autonomous driving tools. This enables efficient and synchronized simulation of multimodal data, including CSI, RGB, Radar, and LiDAR. Based on this platform, we construct an open-source, large-scale, low-altitude UAV multimodal synaesthesia dataset named Great-MSD, and propose a baseline CSI-based UAV 3D localization algorithm, demonstrating its feasibility and generalizability across different CSI simulation engines. The related code and dataset are publicly available at: this https URL.
规模定律在大型语言模型(LLM)和基础模型研究中取得了成功。为了探索其在ISAC(智能传感与通信)研究中的潜力,我们提出了Great-X平台。这是一个单引擎多模态数据孪生平台,它将Sionna中的光线追踪计算重构到Unreal Engine中,并且与自动驾驶工具深度集成。这使得包括信道状态信息(CSI)、RGB图像、雷达和激光雷达在内的多模态数据的高效同步仿真成为可能。 基于此平台,我们构建了一个开源的大规模低空无人机多模态通感数据集,名为Great-MSD,并提出了一种基于CSI的无人机3D定位算法基线方案。该方案展示了其在不同CSI模拟引擎中的可行性和泛化能力。 相关代码和数据集可在以下链接获取:[此网址](this https URL)(请将"this https URL"替换为实际的网址)。
https://arxiv.org/abs/2507.08716
For developing innovative systems architectures, modeling and optimization techniques have been central to frame the architecting process and define the optimization and modeling problems. In this context, for system-of-systems the use of efficient dedicated approaches (often physics-based simulations) is highly recommended to reduce the computational complexity of the targeted applications. However, exploring novel architectures using such dedicated approaches might pose challenges for optimization algorithms, including increased evaluation costs and potential failures. To address these challenges, surrogate-based optimization algorithms, such as Bayesian optimization utilizing Gaussian process models have emerged.
在开发创新系统架构时,建模和优化技术一直是定义架构过程及优化与建模问题的核心。在这种背景下,对于由多个子系统组成的复杂系统(即“系统之系统”),推荐使用高效的专用方法(通常是基于物理的模拟)来降低目标应用的计算复杂性。然而,利用这些专用方法探索新的架构可能会给优化算法带来挑战,包括增加的评估成本和潜在的失败风险。为了解决这些问题,基于代理的优化算法,如利用高斯过程模型的贝叶斯优化算法应运而生。
https://arxiv.org/abs/2507.08715
We propose a novel embedding-based captioning metric termed as L-CLIPScore that can be used for efficiently evaluating caption quality and training captioning model. L-CLIPScore is calculated from a lightweight CLIP (L-CLIP), which is a dual-encoder architecture compressed and distilled from CLIP. To compress, we apply two powerful techniques which are weight multiplexing and matrix decomposition for reducing the parameters of encoders and word embedding matrix, respectively. To distill, we design a novel multi-modal Similarity Regulator (SR) loss to transfer more vision-language alignment knowledge. Specifically, SR loss amplifies the multi-modal embedding similarity if the given image-text pair is matched and diminishes the similarity if the pair is non-matched. By compressing and distilling by this novel SR loss, our L-CLIP achieves comparable multi-modal alignment ability to the original CLIP while it requires fewer computation resources and running time. We carry out exhaustive experiments to validate the efficiency and effectiveness of L-CLIPScore when using it as the judge to evaluate caption quality. We also discover that when using L-CLIPScore as the supervisor to train the captioning model, it should be mixed up by an n-gram-based metric and meanwhile analyze why using L-CLIPScore only will cause fail training.
我们提出了一种新的基于嵌入的描述符度量方法,称为L-CLIPScore,它可以高效地评估描述符的质量并用于训练描述生成模型。L-CLIPScore 是从一个轻量级 CLIP(即 L-CLIP)计算得出的,该架构是从原始 CLIP 压缩和蒸馏而来的双编码器结构。为了压缩,我们采用了两种强大的技术:权重复用和矩阵分解,分别用于减少编码器和词嵌入矩阵中的参数数量。为了进行蒸馏,我们设计了一种新颖的多模态相似性调节器(SR)损失函数,以转移更多的视觉-语言对齐知识。具体来说,当给定的图像-文本配对匹配时,SR 损失会放大多模态嵌入之间的相似度;而不匹配时,则减小这种相似度。通过使用 SR 损失进行压缩和蒸馏,我们的 L-CLIP 达到了与原始 CLIP 相当的多模态对齐能力,但需要较少的计算资源和运行时间。 我们进行了详尽的实验以验证在评估描述符质量时使用 L-CLIPScore 的效率和有效性。同时,我们也发现,在利用 L-CLIPScore 作为监督信号训练描述生成模型时,应该将其与基于 n-gram 的度量混合使用,并且分析了仅使用 L-CLIPScore 进行训练为什么会失败的原因。
https://arxiv.org/abs/2507.08710
Inverse Reinforcement Learning (IRL) presents a powerful paradigm for learning complex robotic tasks from human demonstrations. However, most approaches make the assumption that expert demonstrations are available, which is often not the case. Those that allow for suboptimality in the demonstrations are not designed for long-horizon goals or adversarial tasks. Many desirable robot capabilities fall into one or both of these categories, thus highlighting a critical shortcoming in the ability of IRL to produce field-ready robotic agents. We introduce Sample-efficient Preference-based inverse reinforcement learning for Long-horizon Adversarial tasks from Suboptimal Hierarchical demonstrations (SPLASH), which advances the state-of-the-art in learning from suboptimal demonstrations to long-horizon and adversarial settings. We empirically validate SPLASH on a maritime capture-the-flag task in simulation, and demonstrate real-world applicability with sim-to-real translation experiments on autonomous unmanned surface vehicles. We show that our proposed methods allow SPLASH to significantly outperform the state-of-the-art in reward learning from suboptimal demonstrations.
逆向强化学习(IRL)为从人类演示中学习复杂机器人任务提供了一个强大的框架。然而,大多数方法假设专家演示是可用的,这在实践中常常不成立。那些允许演示存在非最优性的方法并不适用于长期目标或对抗性任务的设计。许多理想的机器人能力都属于上述一种或两种情况,因此突显了IRL生成可直接应用的机器人代理的能力上的一个关键缺陷。 我们提出了SPLASH(从次优层次化演示中进行样本高效偏好评价逆向强化学习以解决长时序和对抗性任务),该方法在从非最优演示中学习除外,在长期目标与对抗性任务设置方面也实现了对现有技术的重大突破。我们在模拟环境中通过海上夺旗任务验证了SPLASH的效果,并通过自主无人水面艇的仿真到现实转换实验展示了其实际应用潜力。我们证明,我们的方法使SPLASH在从非最优演示中学习奖励时显著超越现有的最先进技术。
https://arxiv.org/abs/2507.08707
Knowledge graphs (KGs) play a critical role in enhancing large language models (LLMs) by introducing structured and grounded knowledge into the learning process. However, most existing KG-enhanced approaches rely on parameter-intensive fine-tuning, which risks catastrophic forgetting and degrades the pretrained model's generalization. Moreover, they exhibit limited adaptability to real-time knowledge updates due to their static integration frameworks. To address these issues, we introduce the first test-time KG-augmented framework for LLMs, built around a dedicated knowledge graph-guided attention (KGA) module that enables dynamic knowledge fusion without any parameter updates. The proposed KGA module augments the standard self-attention mechanism with two synergistic pathways: outward and inward aggregation. Specifically, the outward pathway dynamically integrates external knowledge into input representations via input-driven KG fusion. This inward aggregation complements the outward pathway by refining input representations through KG-guided filtering, suppressing task-irrelevant signals and amplifying knowledge-relevant patterns. Importantly, while the outward pathway handles knowledge fusion, the inward path selects the most relevant triples and feeds them back into the fusion process, forming a closed-loop enhancement mechanism. By synergistically combining these two pathways, the proposed method supports real-time knowledge fusion exclusively at test-time, without any parameter modification. Extensive experiments on five benchmarks verify the comparable knowledge fusion performance of KGA.
知识图谱(KGs)在通过引入结构化和有根据的知识来增强大型语言模型(LLMs)的学习过程中扮演着关键角色。然而,大多数现有的基于KG的改进方法依赖于耗参数量大的微调过程,这会带来灾难性遗忘的风险,并且损害预训练模型的泛化能力。此外,由于它们采用静态集成框架,这些方法对实时知识更新表现出有限的适应性。 为了解决这些问题,我们引入了首个测试时基于KG增强的LLMs框架,该框架围绕一个专有的、以知识图为指导的注意力(KGA)模块构建而成,可以实现在无需参数更新的情况下动态融合知识。所提出的KGA模块通过两条协同作用的路径扩展标准自注意力机制:向外聚合和向内聚合。 具体而言,向外路径通过输入驱动的知识图谱融合,将外部知识动态地集成到输入表示中。向内聚合则通过KG指导下的过滤来完善输入表示,抑制任务无关信号并放大与知识相关的模式,从而补充了向外路径的功能。尤为重要的是,在处理知识融合时,向外路径选择最相关的三元组,并将其反馈给融合过程,形成了一个闭环增强机制。 通过协同结合这两条路径,所提出的方法支持仅在测试时间进行实时知识融合,而无需对任何参数进行修改。在五个基准上的广泛实验验证了KGA的知识融合性能与现有方法相当。
https://arxiv.org/abs/2507.08704
Magnetic resonance imaging (MRI) enables non-invasive, high-resolution analysis of muscle structures. However, automated segmentation remains limited by high computational costs, reliance on large training datasets, and reduced accuracy in segmenting smaller muscles. Convolutional neural network (CNN)-based methods, while powerful, often suffer from substantial computational overhead, limited generalizability, and poor interpretability across diverse populations. This study proposes a training-free segmentation approach based on keypoint tracking, which integrates keypoint selection with Lucas-Kanade optical flow. The proposed method achieves a mean Dice similarity coefficient (DSC) ranging from 0.6 to 0.7, depending on the keypoint selection strategy, performing comparably to state-of-the-art CNN-based models while substantially reducing computational demands and enhancing interpretability. This scalable framework presents a robust and explainable alternative for muscle segmentation in clinical and research applications.
磁共振成像(MRI)能够进行非侵入性的高分辨率肌肉结构分析。然而,自动分割仍然受限于高昂的计算成本、对大规模训练数据集的依赖以及在较小肌肉分割时准确度降低的问题。基于卷积神经网络(CNN)的方法虽然强大,但常常面临显著的计算开销、泛化能力有限和解释性较差等问题,尤其是在面对多样化的人群时。 本研究提出了一种基于关键点跟踪的无需训练的分割方法,该方法结合了关键点选择与Lucas-Kanade光流算法。所提出的这种方法在不同关键点选择策略下可以达到0.6到0.7之间的平均Dice相似系数(DSC),其表现可媲美最先进的CNN基模型,同时大大降低了计算需求并增强了解释性。 该可扩展框架为临床和研究应用中的肌肉分割提供了一个稳健且易于理解的替代方案。
https://arxiv.org/abs/2507.08690