Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive video generation. Existing autoregressive video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an autoregressive video generator that retains the LLM architecture with minimal architectural modifications. To inject spatiotemporal correlations in LLMs, we identify the efficacy of incorporating 3D RoPE and diagnose its imbalanced frequency spectrum ranges. Therefore, we propose MM-RoPE, a RoPE scheme that preserves the original textual RoPE while providing comprehensive frequency spectra and scaled 3D positions for modeling multimodal spatiotemporal data. Moreover, Lumos-1 resorts to a token dependency strategy that obeys intra-frame bidirectionality and inter-frame temporal causality. Based on this dependency strategy, we identify the issue of frame-wise loss imbalance caused by spatial information redundancy and solve it by proposing Autoregressive Discrete Diffusion Forcing (AR-DF). AR-DF introduces temporal tube masking during training with a compatible inference-time masking policy to avoid quality degradation. By using memory-efficient training techniques, we pre-train Lumos-1 on only 48 GPUs, achieving performance comparable to EMU3 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V. Code and models are available at this https URL.
自回归大型语言模型(LLMs)已经统一了广泛的语言任务,激发了初步的自回归视频生成尝试。现有的自回归视频生成器要么偏离标准LLM架构,依赖于庞大的外部文本编码器,或者由于逐令牌解码而导致不可接受的延迟。在这篇论文中,我们介绍了Lumos-1,这是一种保留LLM架构且仅需最小架构修改的自回归视频生成器。 为了在LLMs中注入时空相关性,我们确定了引入3D RoPE(旋转位置嵌入)的有效性,并诊断了其不平衡的频率光谱范围。因此,我们提出了MM-RoPE,一种保持原始文本RoPE同时提供全面频率光谱和缩放的3D位置方案,以建模多模式时空数据。 此外,Lumos-1依赖于令牌依赖策略,该策略遵守帧内双向性和帧间时间因果性。基于这种依赖策略,我们识别了由空间信息冗余引起的逐帧损失不平衡问题,并通过提出自回归离散扩散强迫(AR-DF)来解决此问题。AR-DF在训练过程中引入了时空管掩码,并制定了与推理时间兼容的掩码政策以避免质量下降。 利用内存高效的训练技术,我们在仅48个GPU上对Lumos-1进行了预训练,在GenEval、COSMOS-Video2World(VBench-I2V)和OpenSoraPlan(VBench-T2V)等基准测试中达到了与EMU3相当的性能。 有关代码和模型,请访问此链接:[提供链接]。
https://arxiv.org/abs/2507.08801
We introduce NeuralOS, a neural framework that simulates graphical user interfaces (GUIs) of operating systems by directly predicting screen frames in response to user inputs such as mouse movements, clicks, and keyboard events. NeuralOS combines a recurrent neural network (RNN), which tracks computer state, with a diffusion-based neural renderer that generates screen images. The model is trained on a large-scale dataset of Ubuntu XFCE recordings, which include both randomly generated interactions and realistic interactions produced by AI agents. Experiments show that NeuralOS successfully renders realistic GUI sequences, accurately captures mouse interactions, and reliably predicts state transitions like application launches. Although modeling fine-grained keyboard interactions precisely remains challenging, NeuralOS offers a step toward creating fully adaptive, generative neural interfaces for future human-computer interaction systems.
我们介绍了一种名为NeuralOS的神经框架,该框架通过直接预测屏幕帧来模拟操作系统中的图形用户界面(GUI),这些屏幕帧是对诸如鼠标移动、点击和键盘事件等用户输入的响应。NeuralOS结合了一个递归神经网络(RNN),用于跟踪计算机状态,并采用基于扩散的神经渲染器生成屏幕图像。该模型是在大规模数据集上进行训练的,该数据集包括Ubuntu XFCE记录,这些记录既包含随机生成的交互,也包含了由AI代理产生的现实交互。 实验结果表明,NeuralOS能够成功地渲染出逼真的GUI序列,准确捕捉鼠标交互,并可靠预测诸如应用程序启动等状态转换。尽管精确建模细粒度键盘交互仍然具有挑战性,但NeuralOS为未来人机交互系统中创建完全适应性和生成性的神经界面提供了一步进展。
https://arxiv.org/abs/2507.08800
We propose cache steering, a lightweight method for implicit steering of language models via a one-shot intervention applied directly to the key-value cache. To validate its effectiveness, we apply cache steering to induce chain-of-thought reasoning in small language models. Our approach leverages GPT-4o-generated reasoning traces to construct steering vectors that shift model behavior toward more explicit, multi-step reasoning without fine-tuning or prompt modifications. Experimental evaluations on diverse reasoning benchmarks demonstrate that cache steering improves both the qualitative structure of model reasoning and quantitative task performance. Compared to prior activation steering techniques that require continuous interventions, our one-shot cache steering offers substantial advantages in terms of hyperparameter stability, inference-time efficiency, and ease of integration, making it a more robust and practical solution for controlled generation.
我们提出了一种轻量级的方法——缓存引导(cache steering),通过一次性干预直接应用于键值缓存来隐式地指导语言模型。为了验证其有效性,我们将缓存引导应用到小型语言模型中以诱导链式思维推理。我们的方法利用GPT-4o生成的推理痕迹来构建转向向量,使模型行为倾向于更明确、多步骤的推理,而无需微调或提示修改。实验评估在多样化的推理基准上表明,缓存引导不仅改善了模型推理的质量结构,还提高了定量任务性能。 与之前需要持续干预的激活引导技术相比,我们的一次性缓存引导方法在超参数稳定性、推断时间效率和集成简便性方面提供了显著优势,使其成为控制生成更为稳健且实用的解决方案。
https://arxiv.org/abs/2507.08799
Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. In this paradigm, an LLM is typically prompted to compare a candidate answer against a ground-truth reference and assign a binary reward indicating correctness. Despite the seeming simplicity of this comparison task, we find that generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards. We demonstrate that this weakness is widespread across LLMs, datasets, and prompt formats, posing a serious threat for core algorithmic paradigms that rely on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, we introduce a simple yet effective data augmentation strategy and train a new generative reward model with substantially improved robustness. Our findings highlight the urgent need for more reliable LLM-based evaluation methods. We release our robust, general-domain reward model and its synthetic training data at this https URL and this https URL.
生成奖励模型(也被称为LLM裁判),使用大型语言模型(LLMs)来评估答案质量,在具有可验证奖励的强化学习(RLVR)中越来越受欢迎。它们通常比刚性的基于规则的度量标准更受青睐,尤其是在涉及自由形式输出的复杂推理任务方面。在这种范式下,一个LLM通常被提示将候选答案与真实参考进行比较,并分配一个二元奖励来指示正确性。尽管这种对比任务看似简单,我们发现生成型奖励模型对表面操纵表现出令人惊讶的脆弱性:非词符号(例如,“:”或“.”)或推理开头语句如“思考过程:”和“让我们一步一步解决这个问题。”常常会导致错误的正向奖励。我们证明了这一弱点广泛存在于各种LLM、数据集和提示格式中,这对依赖生成型奖励模型的核心算法范式构成严重威胁,例如拒绝采样法、偏好优化以及RLVR。 为了缓解这一问题,我们引入了一种简单而有效的数据增强策略,并训练了一个新的具有显著改进的稳健性的生成型奖励模型。我们的发现突显了更可靠的基于LLM评估方法的迫切需求。我们在[此处](https://example.com/reward_model)和[此处](https://example.com/train_data)发布了我们强大的、通用领域的奖励模型及其合成训练数据。
https://arxiv.org/abs/2507.08794
Risk-averse Constrained Reinforcement Learning (RaCRL) aims to learn policies that minimise the likelihood of rare and catastrophic constraint violations caused by an environment's inherent randomness. In general, risk-aversion leads to conservative exploration of the environment which typically results in converging to sub-optimal policies that fail to adequately maximise reward or, in some cases, fail to achieve the goal. In this paper, we propose an exploration-based approach for RaCRL called Optimistic Risk-averse Actor Critic (ORAC), which constructs an exploratory policy by maximising a local upper confidence bound of the state-action reward value function whilst minimising a local lower confidence bound of the risk-averse state-action cost value function. Specifically, at each step, the weighting assigned to the cost value is increased or decreased if it exceeds or falls below the safety constraint value. This way the policy is encouraged to explore uncertain regions of the environment to discover high reward states whilst still satisfying the safety constraints. Our experimental results demonstrate that the ORAC approach prevents convergence to sub-optimal policies and improves significantly the reward-cost trade-off in various continuous control tasks such as Safety-Gymnasium and a complex building energy management environment CityLearn.
风险规避约束强化学习(RaCRL)旨在学习策略,以最小化由于环境固有随机性导致的罕见且灾难性的约束违反概率。总体而言,风险规避会导致对环境进行保守探索,这通常会收敛到次优策略,这些策略未能充分最大化奖励或在某些情况下无法实现目标。在这篇论文中,我们提出了一种基于探索的方法来解决RaCRL问题,称为乐观风险规避策略评估器(ORAC)。该方法通过最大化状态-动作回报价值函数的局部上置信界同时最小化风险规避状态下代价值函数的局部下置信界来构建一个探索性策略。具体而言,在每一步中,如果成本值超过或低于安全约束值,则会增加或减少分配给成本值的权重。这样可以鼓励该策略在满足安全约束的同时探索环境中的不确定区域以发现高回报状态。 我们的实验结果表明,ORAC方法能够防止收敛到次优策略,并且在诸如Safety-Gymnasium和复杂建筑能源管理环境CityLearn等连续控制任务中显著改善了奖励-成本的权衡。
https://arxiv.org/abs/2507.08793
This paper proposes a neural rendering approach that represents a scene as "compressed light-field tokens (CLiFTs)", retaining rich appearance and geometric information of a scene. CLiFT enables compute-efficient rendering by compressed tokens, while being capable of changing the number of tokens to represent a scene or render a novel view with one trained network. Concretely, given a set of images, multi-view encoder tokenizes the images with the camera poses. Latent-space K-means selects a reduced set of rays as cluster centroids using the tokens. The multi-view ``condenser'' compresses the information of all the tokens into the centroid tokens to construct CLiFTs. At test time, given a target view and a compute budget (i.e., the number of CLiFTs), the system collects the specified number of nearby tokens and synthesizes a novel view using a compute-adaptive renderer. Extensive experiments on RealEstate10K and DL3DV datasets quantitatively and qualitatively validate our approach, achieving significant data reduction with comparable rendering quality and the highest overall rendering score, while providing trade-offs of data size, rendering quality, and rendering speed.
本文提出了一种神经渲染方法,该方法将场景表示为“压缩光场令牌(CLiFTs)”,保留了场景的丰富外观和几何信息。通过使用压缩令牌进行计算高效的渲染,同时能够根据需要改变用于表示场景或生成新视角的令牌数量,从而实现灵活性。具体而言,给定一组图像后,多视图编码器会利用相机姿态对这些图像进行标记化处理。在潜在空间中应用K-均值聚类选择少量光线作为簇中心,进一步减少了信息量。然后,通过一个多视图“压缩机”将所有令牌的信息压缩到中心令牌上,构建出CLiFTs。 在测试阶段,给定一个目标视角和计算预算(即CLiFT的数量),系统会收集指定数量的附近令牌,并使用适应计算资源的渲染器生成新的视角。通过对RealEstate10K和DL3DV数据集进行广泛的实验,我们的方法在定量和定性上都得到了验证。该方法实现了显著的数据减少,在保证可比渲染质量的同时获得了最高的整体渲染评分,并且提供了关于数据大小、渲染质量和渲染速度之间的权衡方案。
https://arxiv.org/abs/2507.08776
Recent advances in 3D generation have transitioned from multi-view 2D rendering approaches to 3D-native latent diffusion frameworks that exploit geometric priors in ground truth data. Despite progress, three key limitations persist: (1) Single-latent representations fail to capture complex multi-part geometries, causing detail degradation; (2) Holistic latent coding neglects part independence and interrelationships critical for compositional design; (3) Global conditioning mechanisms lack fine-grained controllability. Inspired by human 3D design workflows, we propose CoPart - a part-aware diffusion framework that decomposes 3D objects into contextual part latents for coherent multi-part generation. This paradigm offers three advantages: i) Reduces encoding complexity through part decomposition; ii) Enables explicit part relationship modeling; iii) Supports part-level conditioning. We further develop a mutual guidance strategy to fine-tune pre-trained diffusion models for joint part latent denoising, ensuring both geometric coherence and foundation model priors. To enable large-scale training, we construct Partverse - a novel 3D part dataset derived from Objaverse through automated mesh segmentation and human-verified annotations. Extensive experiments demonstrate CoPart's superior capabilities in part-level editing, articulated object generation, and scene composition with unprecedented controllability.
最近在三维生成领域的进展已经从多视角二维渲染方法转向了利用地面真实数据中的几何先验的3D原生潜在扩散框架。尽管取得了进步,但仍存在三个关键限制:(1) 单一潜在表示无法捕捉复杂的多部件几何形状,导致细节退化;(2) 整体潜在编码忽略了组成设计中至关重要的各部分独立性和相互关系;(3) 全局条件机制缺乏细粒度的可控性。受人类三维设计工作流程启发,我们提出了CoPart——一个以部分感知为主的扩散框架,它将三维对象分解为上下文相关的部分潜在表示,用于一致的多部件生成。这种范式提供了三个优势:i)通过部分分解减少编码复杂度;ii)支持显式的部分关系建模;iii)支持基于部分级别的条件设置。为了进一步优化预训练的扩散模型以进行联合部分潜在去噪,我们开发了一种相互指导策略,确保几何一致性和基础模型先验知识的同时实现这一目标。为大规模训练提供支持,我们构建了Partverse——一个新颖的3D部分数据集,它是通过Objaverse的自动网格分割和人工验证注释衍生而来的。广泛的实验表明,CoPart在部分级编辑、连杆对象生成以及场景组合方面具有前所未有的可控性,并展示了其卓越的能力。
https://arxiv.org/abs/2507.08772
To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67$\times$ speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (this https URL).
为了减轻大型语言模型(LLMs)的计算负担,采用激活稀疏架构(如专家混合MoE)吸引了越来越多的关注。然而,传统MoE中非可微和刚性的路由机制损害了模型性能。此外,尽管每个标记仅激活少数参数,但这些稀疏激活架构在块级表现出低稀疏性,即多个连续标记的组合会激活大量参数的比例。这种稀疏模式不利于资源受限条件(例如终端设备)下的加速,并且与主流加速技术(如投机解码)不兼容。为了解决这些问题,我们引入了一种新的MoE架构BlockFFN及其高效的训练和部署技术。具体而言,我们使用集成了ReLU激活和RMSNorm的路由器来实现可微分和灵活的路由机制。接下来,为了同时促进标记级稀疏性(TLS)和块级稀疏性(CLS),设计了CLS感知的训练目标,使得BlockFFN更加易于加速。最后,我们实现了高效的加速内核,并首次结合了激活稀疏性和投机解码技术。实验结果显示,BlockFFN在其他MoE基准模型上的性能优越,达到了超过80%的TLS和70%八标记CLS(Chunk-Level Sparsity)。我们的内核实现在真实终端设备上比密集模型快达3.67倍。所有代码和检查点均可公开访问(此 https URL 链接)。
https://arxiv.org/abs/2507.08771
In this study, we leverage a unique UNESCO collection of mid-20th century radio recordings to probe the robustness of modern off-the-shelf language identification (LID) and speaker recognition (SR) methods, especially with respect to the impact of multilingual speakers and cross-age recordings. Our findings suggest that LID systems, such as Whisper, are increasingly adept at handling second-language and accented speech. However, speaker embeddings remain a fragile component of speech processing pipelines that is prone to biases related to the channel, age, and language. Issues which will need to be overcome should archives aim to employ SR methods for speaker indexing.
在这项研究中,我们利用了一个独特的联合国教科文组织收藏的20世纪中期的无线电录音集,以探究现代现成的语言识别(LID)和说话人识别(SR)方法的鲁棒性,特别是针对多语言使用者和跨年龄段录音的影响。我们的发现表明,诸如Whisper之类的语言识别系统在处理第二语言和带口音的语音方面越来越擅长。然而,说话人嵌入仍然是语音处理管道中一个脆弱的部分,容易受到与通道、年龄和语言相关的偏见影响。这些问题需要克服,如果档案馆打算使用说话人识别方法进行说话人索引的话。
https://arxiv.org/abs/2507.08768
This study presents a hybrid model for classifying handwritten digits in the MNIST dataset, combining convolutional neural networks (CNNs) with a multi-well Hopfield network. The approach employs a CNN to extract high-dimensional features from input images, which are then clustered into class-specific prototypes using k-means clustering. These prototypes serve as attractors in a multi-well energy landscape, where a Hopfield network performs classification by minimizing an energy function that balances feature similarity and class this http URL model's design enables robust handling of intraclass variability, such as diverse handwriting styles, while providing an interpretable framework through its energy-based decision process. Through systematic optimization of the CNN architecture and the number of wells, the model achieves a high test accuracy of 99.2% on 10,000 MNIST images, demonstrating its effectiveness for image classification tasks. The findings highlight the critical role of deep feature extraction and sufficient prototype coverage in achieving high performance, with potential for broader applications in pattern recognition.
这项研究提出了一种用于在MNIST数据集上分类手写数字的混合模型,该模型结合了卷积神经网络(CNN)与多阱霍普菲尔德网络。这种方法利用CNN从输入图像中提取高维特征,并使用k均值聚类将这些特征聚类为特定于每个类别的原型。这些原型作为具有多个能量陷阱的能量景观中的吸引子,在其中霍普菲尔德网络通过最小化一个平衡了特征相似性和类别归属的能函数来进行分类。该模型的设计能够稳健地处理同一类别内的变化,例如多样的手写风格,并且由于其基于能量的决策过程提供了可解释性框架。 通过对CNN架构和阱的数量进行系统优化,该模型在10,000张MNIST图像上达到了99.2%的高测试准确率,证明了它对于图像分类任务的有效性。研究结果强调了深度特征提取以及充分原型覆盖对于实现高性能的关键作用,并且其潜在的应用范围可能更广泛,在模式识别领域具有潜力。
https://arxiv.org/abs/2507.08766
Due to the excellent performance in yielding high-quality, zero-shot segmentation, Segment Anything Model (SAM) and its variants have been widely applied in diverse scenarios such as healthcare and intelligent manufacturing. Therefore, effectively compressing SAMs has become an increasingly pressing practical need. In this study, we propose Birkhoff, a novel data-free compression algorithm for SAM and its variants. Unlike quantization, pruning, distillation, and other compression methods, Birkhoff embodies versatility across model types, agility in deployment, faithfulness to the original model, and compactness in model size. Specifically, Birkhoff introduces a novel compression algorithm: Hyper-Compression, whose core principle is to find a dense trajectory to turn a high-dimensional parameter vector into a low-dimensional scalar. Furthermore, Birkhoff designs a dedicated linear layer operator, HyperLinear, to fuse decompression and matrix multiplication to significantly accelerate inference of the compressed SAMs. Extensive experiments on 18 SAMs in the COCO, LVIS, and SA-1B datasets show that Birkhoff performs consistently and competitively in compression time, compression ratio, post-compression performance, and inference speed. For example, Birkhoff can achieve a compression ratio of 5.17x on SAM2-B, with less than 1% performance drop without using any fine-tuning data. Moreover, the compression is finished within 60 seconds for all models.
由于在生成高质量、零样本分割方面的卓越性能,Segment Anything Model(SAM)及其变体已在医疗保健和智能制造等多样化场景中广泛使用。因此,有效地压缩这些模型已成为日益迫切的实际需求。在这项研究中,我们提出了一种新颖的数据无关压缩算法Birkhoff,专门用于压缩SAM及其变体。与量化、剪枝、知识蒸馏和其他压缩方法不同,Birkhoff在适用的模型类型多样性、部署灵活性、对原始模型的忠实度以及模型大小的紧凑性方面具有独特优势。 具体而言,Birkhoff引入了一种新型的压缩算法:超压缩(Hyper-Compression),其核心原理是寻找一条密集轨迹,将高维参数向量转换为低维标量。此外,Birkhoff设计了一个专门的线性层操作符——超线性(HyperLinear),用于融合解压和矩阵乘法运算,从而显著加速了被压缩SAM模型的推理速度。 在COCO、LVIS和SA-1B数据集上的18个不同SAM模型进行的大量实验表明,无论是在压缩时间、压缩比例、压缩后性能还是推理速度方面,Birkhoff都能表现出一致且具有竞争力的表现。例如,在不使用任何微调数据的情况下,Birkhoff可以将SAM2-B的压缩比达到5.17倍,并且性能下降不到1%;所有模型的压缩过程均能在60秒内完成。
https://arxiv.org/abs/2507.08765
Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.
基于离线数据的强化学习会遇到Q值外推误差的问题。为了解决这个问题,我们首先证明了在超出数据范围的情况下进行线性Q函数外推特别容易出现问题。为此,我们提出了一种方法来指导在数据范围之外逐渐减少Q值,这是通过奖励缩放与层归一化(RS-LN)和对不可行动作的惩罚机制(PA)实现的。结合RS-LN和PA,我们开发出一种新的算法叫做PARS。我们在一系列任务中评估了PARS的表现,并在D4RL基准测试中的离线训练和在线微调阶段表现出优于当前最先进算法的性能,在具有挑战性的AntMaze Ultra任务上取得了显著的成功。 更详细的翻译如下: 基于离线数据的强化学习会遇到Q值外推误差的问题。为了解决这个问题,我们首先证明了在超出实际观测范围的情况下进行线性Q函数外推特别容易出现问题。为了缓解这一问题,我们提出了一种方法来指导Q值在超出训练数据范围时逐渐减少。这通过结合奖励缩放(Reward Scaling)和层归一化(Layer Normalization, RS-LN),以及对不可行动作的惩罚机制(Penalization for Infeasible Actions, PA)实现。通过组合RS-LN和PA,我们开发出了一种新的算法叫做PARS(Penalized Action and Reward Scaling)。我们在一系列任务中评估了PARS的表现,并在D4RL基准测试中的离线训练阶段表现出优于当前最先进算法的性能,在最具挑战性的AntMaze Ultra任务上取得了显著的成功。此外,我们的方法还在在线微调阶段表现出了很好的适应性和鲁棒性。
https://arxiv.org/abs/2507.08761
Digital Twins (DT) have the potential to transform traffic management and operations by creating dynamic, virtual representations of transportation systems that sense conditions, analyze operations, and support decision-making. A key component for DT of the transportation system is dynamic roadway geometry sensing. However, existing approaches often rely on static maps or costly sensors, limiting scalability and adaptability. Additionally, large-scale DTs that collect and analyze data from multiple sources face challenges in privacy, communication, and computational efficiency. To address these challenges, we introduce Geo-ORBIT (Geometrical Operational Roadway Blueprint with Integrated Twin), a unified framework that combines real-time lane detection, DT synchronization, and federated meta-learning. At the core of Geo-ORBIT is GeoLane, a lightweight lane detection model that learns lane geometries from vehicle trajectory data using roadside cameras. We extend this model through Meta-GeoLane, which learns to personalize detection parameters for local entities, and FedMeta-GeoLane, a federated learning strategy that ensures scalable and privacy-preserving adaptation across roadside deployments. Our system is integrated with CARLA and SUMO to create a high-fidelity DT that renders highway scenarios and captures traffic flows in real-time. Extensive experiments across diverse urban scenes show that FedMeta-GeoLane consistently outperforms baseline and meta-learning approaches, achieving lower geometric error and stronger generalization to unseen locations while drastically reducing communication overhead. This work lays the foundation for flexible, context-aware infrastructure modeling in DTs. The framework is publicly available at this https URL.
数字孪生(DT)具有通过创建动态的虚拟交通系统表示来变革交通管理和运营的潜力。这些系统能够感知条件、分析操作并支持决策制定。交通系统的数字孪生的关键组成部分之一是动态道路几何感测。然而,现有的方法往往依赖于静态地图或昂贵的传感器,这限制了其可扩展性和适应性。此外,大型DT在从多个来源收集和分析数据时面临着隐私保护、通信效率和计算效率等方面的挑战。 为了解决这些问题,我们引入了一个统一框架Geo-ORBIT(集成孪生的道路几何运营蓝图),它结合了实时车道检测、DT同步以及联邦元学习。Geo-ORBIT的核心是轻量级的车道检测模型GeoLane,该模型利用路边摄像头收集到的车辆轨迹数据来学习车道几何形状。我们通过Meta-GeoLane扩展了这一模型,使其能够为本地实体个性化检测参数,并且FedMeta-GeoLane是一种联邦学习策略,确保在路边部署中的可扩展性和隐私保护适应性。 我们的系统与CARLA和SUMO集成在一起,创建了一个高保真度的DT,可以实时渲染高速公路场景并捕捉交通流。在多种城市环境中进行的广泛实验表明,FedMeta-GeoLane始终优于基线方法和元学习方法,在未见过的位置上几何误差更低,并且大大减少了通信开销。 这项工作为数字孪生中的灵活、上下文感知基础设施建模奠定了基础。该框架可在以下网址公开获取:[提供链接]
https://arxiv.org/abs/2507.08743
Hierarchical land cover and land use (LCLU) classification aims to assign pixel-wise labels with multiple levels of semantic granularity to remote sensing (RS) imagery. However, existing deep learning-based methods face two major challenges: 1) They predominantly adopt a flat classification paradigm, which limits their ability to generate end-to-end multi-granularity hierarchical predictions aligned with tree-structured hierarchies used in practice. 2) Most cross-domain studies focus on performance degradation caused by sensor or scene variations, with limited attention to transferring LCLU models to cross-domain tasks with heterogeneous hierarchies (e.g., LCLU to crop classification). These limitations hinder the flexibility and generalization of LCLU models in practical applications. To address these challenges, we propose HieraRS, a novel hierarchical interpretation paradigm that enables multi-granularity predictions and supports the efficient transfer of LCLU models to cross-domain tasks with heterogeneous tree-structured hierarchies. We introduce the Bidirectional Hierarchical Consistency Constraint Mechanism (BHCCM), which can be seamlessly integrated into mainstream flat classification models to generate hierarchical predictions, while improving both semantic consistency and classification accuracy. Furthermore, we present TransLU, a dual-branch cross-domain transfer framework comprising two key components: Cross-Domain Knowledge Sharing (CDKS) and Cross-Domain Semantic Alignment (CDSA). TransLU supports dynamic category expansion and facilitates the effective adaptation of LCLU models to heterogeneous hierarchies. In addition, we construct MM-5B, a large-scale multi-modal hierarchical land use dataset featuring pixel-wise annotations. The code and MM-5B dataset will be released at: this https URL.
层次土地覆盖和土地利用(LCLU)分类的目标是为遥感(RS)图像提供具有多种语义粒度级别的像素级标签。然而,现有的基于深度学习的方法面临两个主要挑战:1) 它们主要采用扁平化分类方法,这限制了它们生成与实践中使用的树状层次结构一致的端到端多粒度层级预测的能力;2) 大多数跨域研究关注由传感器或场景变化导致的性能下降问题,并且较少关注将LCLU模型转移到具有异构层次结构(如从土地覆盖分类转换为作物分类)的任务上。这些限制阻碍了LCLU模型在实际应用中的灵活性和泛化能力。 为了应对上述挑战,我们提出了HieraRS,这是一种新的层级解释框架,它能够实现多粒度预测,并支持将LCLU模型高效地转移到具有异构树状结构层次的跨域任务中。我们引入了双向层级一致性约束机制(BHCCM),它可以无缝集成到主流扁平分类模型中以生成层级预测,同时提高语义一致性和分类准确性。 此外,我们提出了TransLU,这是一个双分支的跨域迁移框架,包括两个关键组件:跨域知识共享(CDKS)和跨域语义对齐(CDSA)。TransLU支持动态类别扩展,并有助于LCLU模型有效地适应异构层次结构。另外,为了支持这项研究,我们构建了一个大规模多模态层级土地利用数据集——MM-5B,该数据集具有像素级别的标注信息。 代码和MM-5B数据集将在以下网址发布:[请在此处填写实际的URL链接]。
https://arxiv.org/abs/2507.08741
Nonlinear vector autoregression (NVAR) and reservoir computing (RC) have shown promise in forecasting chaotic dynamical systems, such as the Lorenz-63 model and El Nino-Southern Oscillation. However, their reliance on fixed nonlinearities - polynomial expansions in NVAR or random feature maps in RC - limits their adaptability to high noise or real-world data. These methods also scale poorly in high-dimensional settings due to costly matrix inversion during readout computation. We propose an adaptive NVAR model that combines delay-embedded linear inputs with features generated by a shallow, learnable multi-layer perceptron (MLP). The MLP and linear readout are jointly trained using gradient-based optimization, enabling the model to learn data-driven nonlinearities while preserving a simple readout structure. Unlike standard NVAR, our approach avoids the need for an exhaustive and sensitive grid search over ridge and delay parameters. Instead, tuning is restricted to neural network hyperparameters, improving scalability. Initial experiments on chaotic systems tested under noise-free and synthetically noisy conditions showed that the adaptive model outperformed the standard NVAR in predictive accuracy and showed robust forecasting under noisy conditions with a lower observation frequency.
非线性向量自回归(NVAR)和液池计算(RC)在预测洛伦兹-63模型和厄尔尼诺南方涛动等混沌动力系统方面显示出潜力。然而,它们依赖于固定的非线性机制——NVAR中的多项式展开或RC中的随机特征映射——限制了这些方法对高噪声或真实世界数据的适应能力。在高维设置中,由于读取计算期间需要进行昂贵的矩阵求逆操作,这类方法的表现也较差。 我们提出了一种自适应的NVAR模型,该模型结合了延迟嵌入线性输入和浅层、可学习的多层感知机(MLP)生成的特征。通过基于梯度的优化技术联合训练MLP和线性读取部分,使模型能够从数据中学习非线性特性的同时保持简单的读取结构。与标准NVAR不同,我们的方法避免了对岭参数和延迟参数进行繁琐且敏感的网格搜索调整。相反,调整仅限于神经网络超参数,从而提高了可扩展性。 在无噪声和人工添加噪音的情况下,混沌系统的初步实验表明,自适应模型在预测准确性方面优于标准NVAR,并能在噪声条件下以较低的观测频率实现稳健的预测。
https://arxiv.org/abs/2507.08738
Catastrophic forgetting in deep neural networks occurs when learning new tasks degrades performance on previously learned tasks due to knowledge overwriting. Among the approaches to mitigate this issue, regularization techniques aim to identify and constrain "important" parameters to preserve previous knowledge. In the highly nonconvex optimization landscape of deep learning, we propose a novel perspective: tracking parameters during the final training plateau is more effective than monitoring them throughout the entire training process. We argue that parameters that exhibit higher activity (movement and variability) during this plateau reveal directions in the loss landscape that are relatively flat, making them suitable for adaptation to new tasks while preserving knowledge from previous ones. Our comprehensive experiments demonstrate that this approach achieves superior performance in balancing catastrophic forgetting mitigation with strong performance on newly learned tasks.
深度神经网络中的灾难性遗忘现象是指在学习新任务时,会损害之前已经学过的任务的表现,原因是旧知识被新的知识覆盖。为缓解这个问题,正则化技术旨在识别并约束“重要”参数以保存先前的知识。在深度学习中高度非凸的优化景观中,我们提出了一种新颖的观点:追踪最终训练平台期期间的参数比在整个训练过程中监测它们更有效。我们认为,在这个平台期内表现出更高活动(移动和变化)的参数揭示了损失景观中的相对平坦区域,这使得它们适合适应新任务的同时保留之前的知识。我们的全面实验表明,这种方法在缓解灾难性遗忘与在新学习的任务上取得优异表现之间实现了更好的平衡。 具体来说: - 灾难性遗忘指的是深度神经网络在学习新任务时导致旧任务性能下降的现象。 - 为解决这一问题,一种常用的方法是采用正则化技术来识别和限制“重要”的参数,从而保护之前的知识不被覆盖。 - 在非凸的优化空间中,我们提出了一种新的视角:关注模型在训练接近尾声、达到性能平台期时参数的变化情况,而不是在整个训练过程中持续监控它们。我们认为,在这个阶段变化较大的参数表明网络能够在保留旧知识的同时适应新任务。 - 我们的实验结果证明了这种方法能更有效地平衡灾难性遗忘的缓解与新的学习任务上的表现提升之间的关系。 总的来说,这种侧重于识别和保护在最终训练平台期活跃的重要参数的方法显示出了优越的效果。
https://arxiv.org/abs/2507.08736
Solving computer vision problems through machine learning, one often encounters lack of sufficient training data. To mitigate this we propose the use of ensembles of weak learners based on spectral total-variation (STV) features (Gilboa 2014). The features are related to nonlinear eigenfunctions of the total-variation subgradient and can characterize well textures at various scales. It was shown (Burger et-al 2016) that, in the one-dimensional case, orthogonal features are generated, whereas in two-dimensions the features are empirically lowly correlated. Ensemble learning theory advocates the use of lowly correlated weak learners. We thus propose here to design ensembles using learners based on STV features. To show the effectiveness of this paradigm we examine a hard real-world medical imaging problem: the predictive value of computed tomography (CT) data for high uptake in positron emission tomography (PET) for patients suspected of skeletal metastases. The database consists of 457 scans with 1524 unique pairs of registered CT and PET slices. Our approach is compared to deep-learning methods and to Radiomics features, showing STV learners perform best (AUC=0.87), compared to neural nets (AUC=0.75) and Radiomics (AUC=0.79). We observe that fine STV scales in CT images are especially indicative for the presence of high uptake in PET.
通过机器学习解决计算机视觉问题时,常常会遇到训练数据不足的问题。为了解决这一问题,我们提出使用基于谱总变差(STV)特征的弱学习者集合的方法(Gilboa 2014)。这些特征与总变差次梯度的非线性特征值相关,并且能够很好地表征不同尺度下的纹理信息。研究表明,在一维情况下,可以生成正交特征;而在二维情况下,特征之间的相关性较低。集成学习理论提倡使用低相关性的弱学习者。因此,我们建议基于STV特征设计集合模型。 为了证明这一范式的有效性,我们将研究一个硬核的现实世界医学成像问题:通过计算机断层扫描(CT)数据预测正电子发射断层扫描(PET)中高摄取值对于疑似骨骼转移患者的临床意义。该数据库包含457次扫描,共有1524对配准后的CT和PET切片。我们将这种方法与深度学习方法及Radiomics特征进行了比较,并发现基于STV的学习者表现出最好的效果(AUC=0.87),优于神经网络(AUC=0.75)和Radiomics(AUC=0.79)。我们观察到,CT图像中的精细STV尺度特别有助于预测PET中高摄取值的存在。
https://arxiv.org/abs/2507.08735
Modern configurable software systems need to learn models that correlate configuration and performance. However, when the system operates in dynamic environments, the workload variations, hardware changes, and system updates will inevitably introduce concept drifts at different levels - global drifts, which reshape the performance landscape of the entire configuration space; and local drifts, which only affect certain sub-regions of that space. As such, existing offline and transfer learning approaches can struggle to adapt to these implicit and unpredictable changes in real-time, rendering configuration performance learning challenging. To address this, we propose DHDA, an online configuration performance learning framework designed to capture and adapt to these drifts at different levels. The key idea is that DHDA adapts to both the local and global drifts using dually hierarchical adaptation: at the upper level, we redivide the data into different divisions, within each of which the local model is retrained, to handle global drifts only when necessary. At the lower level, the local models of the divisions can detect local drifts and adapt themselves asynchronously. To balance responsiveness and efficiency, DHDA combines incremental updates with periodic full retraining to minimize redundant computation when no drifts are detected. Through evaluating eight software systems and against state-of-the-art approaches, we show that DHDA achieves considerably better accuracy and can effectively adapt to drifts with up to 2x improvements, while incurring reasonable overhead and is able to improve different local models in handling concept drift.
现代可配置软件系统需要学习能够关联配置与性能的模型。然而,当这些系统在动态环境中运行时,工作负载的变化、硬件更新以及系统升级不可避免地会在不同层次上引入概念漂移——全局漂移会重塑整个配置空间内的性能格局;局部漂移则仅影响该空间中的某些子区域。因此,现有的离线和迁移学习方法难以实时适应这些隐式且不可预测的变更,使得配置性能学习变得极具挑战性。为此,我们提出了DHDA(双重分层自适应框架),这是一种在线配置性能学习框架,旨在捕捉并适应不同层次上的漂移变化。该框架的核心思想是利用双重层级自适应来应对局部和全局漂移:在高层级上,我们将数据重新划分为不同的分区,在必要时仅针对这些分区内的全局漂移进行处理;而在底层,各个分区内配置的本地模型能够异步地检测并适应局部漂移。为平衡响应性和效率,DHDA结合了增量更新与定期全面重训练,以在没有检测到漂移变化的情况下尽量减少冗余计算。 通过评估八个软件系统并与最先进的方法进行对比,我们展示了DHDA能够在概念漂移下实现显著更高的准确率,并能有效地适应多达2倍的性能改进。此外,在保持合理开销的同时,该框架还能提升不同局部模型处理概念漂移的能力。
https://arxiv.org/abs/2507.08730
The multi-camera vehicle tracking (MCVT) framework holds significant potential for smart city applications, including anomaly detection, traffic density estimation, and suspect vehicle tracking. However, current publicly available datasets exhibit limitations, such as overly simplistic scenarios, low-resolution footage, and insufficiently diverse conditions, creating a considerable gap between academic research and real-world scenario. To fill this gap, we introduce RoundaboutHD, a comprehensive, high-resolution multi-camera vehicle tracking benchmark dataset specifically designed to represent real-world roundabout scenarios. RoundaboutHD provides a total of 40 minutes of labelled video footage captured by four non-overlapping, high-resolution (4K resolution, 15 fps) cameras. In total, 512 unique vehicle identities are annotated across different camera views, offering rich cross-camera association data. RoundaboutHD offers temporal consistency video footage and enhanced challenges, including increased occlusions and nonlinear movement inside the roundabout. In addition to the full MCVT dataset, several subsets are also available for object detection, single camera tracking, and image-based vehicle re-identification (ReID) tasks. Vehicle model information and camera modelling/ geometry information are also included to support further analysis. We provide baseline results for vehicle detection, single-camera tracking, image-based vehicle re-identification, and multi-camera tracking. The dataset and the evaluation code are publicly available at: this https URL
多摄像头车辆跟踪(MCVT)框架在智慧城市应用中具有巨大潜力,包括异常检测、交通密度估算和可疑车辆追踪。然而,当前公开的数据集存在一些局限性,例如场景过于简单化、视频分辨率低以及条件多样性不足等问题,这使得学术研究与实际应用场景之间存在着相当大的差距。为了解决这一问题,我们引入了RoundaboutHD数据集——这是一个全面的多摄像头高分辨率车辆跟踪基准数据集,专门设计用于代表现实世界的环形交叉路口场景。RoundaboutHD提供了由四个非重叠、高分辨率(4K分辨率,15fps)摄像头拍摄的总计40分钟的带标签视频片段,并在不同摄像头视角中对总共512辆独特车辆进行了标注,从而提供丰富的跨摄像机关联数据。 RoundaboutHD还为多摄像头跟踪任务提供了时间一致性视频片段和增强挑战,包括增加的遮挡以及环形交叉路口内的非线性运动。除了完整的MCVT数据集之外,还包括几个子集用于物体检测、单个摄像头追踪和基于图像的车辆重识别(ReID)任务。此外还包含了关于车辆型号信息及摄像机建模/几何信息等附加支持材料以供进一步分析使用。 我们为车辆检测、单摄像机跟踪、基于图像的车辆再识别以及多摄像机跟踪提供了基准结果。该数据集和评估代码可以公开访问,网址如下:[此链接](https://this-url.com)(请将实际URL填入)。
https://arxiv.org/abs/2507.08729
Learning robot manipulation policies from raw, real-world image data requires a large number of robot-action trials in the physical environment. Although training using simulations offers a cost-effective alternative, the visual domain gap between simulation and robot workspace remains a major limitation. Gaussian Splatting visual reconstruction methods have recently provided new directions for robot manipulation by generating realistic environments. In this paper, we propose the first method for learning supervised-based robot handovers solely from RGB images without the need of real-robot training or real-robot data collection. The proposed policy learner, Human-to-Robot Handover using Sparse-View Gaussian Splatting (H2RH-SGS), leverages sparse-view Gaussian Splatting reconstruction of human-to-robot handover scenes to generate robot demonstrations containing image-action pairs captured with a camera mounted on the robot gripper. As a result, the simulated camera pose changes in the reconstructed scene can be directly translated into gripper pose changes. We train a robot policy on demonstrations collected with 16 household objects and {\em directly} deploy this policy in the real environment. Experiments in both Gaussian Splatting reconstructed scene and real-world human-to-robot handover experiments demonstrate that H2RH-SGS serves as a new and effective representation for the human-to-robot handover task.
从原始的真实世界图像数据中学习机器人操作策略需要在物理环境中进行大量的机器人动作试验。虽然使用仿真训练提供了成本效益高的替代方案,但模拟环境与机器人工作空间之间的视觉领域差距仍然是主要限制之一。最近,高斯点绘(Gaussian Splatting)的视觉重构方法为机器人操控提供了一些新的方向,通过生成逼真的环境来帮助这一问题。在本文中,我们提出了首个仅基于RGB图像进行监督学习的人机交接策略的方法,并且无需实际机器人的训练或数据采集。该提出的策略学习者名为“使用稀疏视图高斯点绘的人机交互政策学习器”(Human-to-Robot Handover using Sparse-View Gaussian Splatting,简称H2RH-SGS),它利用稀疏视图的高斯点绘重构人与机器人交接场景来生成包含相机安装在机械抓手上拍摄到的图像动作对的机器人演示。因此,在重建的场景中模拟相机姿态的变化可以直接转换为夹爪姿态的变化。我们使用16种家庭常用物品进行实验收集示范,并**直接**将此策略部署到了实际环境中。无论是高斯点绘重构的场景还是现实世界的人机交接实验,都表明H2RH-SGS对于人与机器人交互任务提供了一种新的且有效的表示方法。 该研究的核心在于利用了稀疏视图高斯点绘技术来生成逼真的训练环境,并以此为基础从RGB图像中学习到机器人的操作策略。这种方法不仅减少了对实际物理试验的需求,而且在转换到真实环境中时能够保持较高精度和有效性,为机器人操控任务提供了新的可能路径。
https://arxiv.org/abs/2507.08726