Catastrophic forgetting in deep neural networks occurs when learning new tasks degrades performance on previously learned tasks due to knowledge overwriting. Among the approaches to mitigate this issue, regularization techniques aim to identify and constrain "important" parameters to preserve previous knowledge. In the highly nonconvex optimization landscape of deep learning, we propose a novel perspective: tracking parameters during the final training plateau is more effective than monitoring them throughout the entire training process. We argue that parameters that exhibit higher activity (movement and variability) during this plateau reveal directions in the loss landscape that are relatively flat, making them suitable for adaptation to new tasks while preserving knowledge from previous ones. Our comprehensive experiments demonstrate that this approach achieves superior performance in balancing catastrophic forgetting mitigation with strong performance on newly learned tasks.
深度神经网络中的灾难性遗忘现象是指在学习新任务时,会损害之前已经学过的任务的表现,原因是旧知识被新的知识覆盖。为缓解这个问题,正则化技术旨在识别并约束“重要”参数以保存先前的知识。在深度学习中高度非凸的优化景观中,我们提出了一种新颖的观点:追踪最终训练平台期期间的参数比在整个训练过程中监测它们更有效。我们认为,在这个平台期内表现出更高活动(移动和变化)的参数揭示了损失景观中的相对平坦区域,这使得它们适合适应新任务的同时保留之前的知识。我们的全面实验表明,这种方法在缓解灾难性遗忘与在新学习的任务上取得优异表现之间实现了更好的平衡。 具体来说: - 灾难性遗忘指的是深度神经网络在学习新任务时导致旧任务性能下降的现象。 - 为解决这一问题,一种常用的方法是采用正则化技术来识别和限制“重要”的参数,从而保护之前的知识不被覆盖。 - 在非凸的优化空间中,我们提出了一种新的视角:关注模型在训练接近尾声、达到性能平台期时参数的变化情况,而不是在整个训练过程中持续监控它们。我们认为,在这个阶段变化较大的参数表明网络能够在保留旧知识的同时适应新任务。 - 我们的实验结果证明了这种方法能更有效地平衡灾难性遗忘的缓解与新的学习任务上的表现提升之间的关系。 总的来说,这种侧重于识别和保护在最终训练平台期活跃的重要参数的方法显示出了优越的效果。
https://arxiv.org/abs/2507.08736
Direct Preference Optimization (DPO) has emerged as a popular and efficient alternative to reward modeling and reinforcement learning for aligning language models with human preferences. Despite its empirical success, the theoretical properties and intrinsic limitations of DPO remain underexplored. In this work, we first present a comprehensive analysis of DPO's dynamics from a probability evolution perspective. Our analysis reveals that DPO is highly sensitive to initialization. It also tends to misallocate probability mass, which can inadvertently shift probability toward irrelevant or undesired responses. This misallocation may unintentionally reinforce model bias, thereby compromising both the stability of model alignment and the consistency with intended preferences. Motivated by these theoretical findings, we propose a theoretically grounded bilevel optimization framework that tightly integrate supervised fine-tuning with an enhanced DPO objective a.k.a. stable preference optimization. Our approach introduces a principled regularization scheme to explicitly encourage absolute probability improvement for preferred outputs, while maintaining stable optimization dynamics. Experiments on challenging reasoning and summarization benchmarks elucidate that our method consistently improves reasoning accuracy and better aligns output distributions with intended preferences, outperforming standard DPO. Stable preference optimization provides new insights into the design of preference-based alignment objectives and opens up new avenues towards more reliable and interpretable language model alignment.
直接偏好优化(DPO)作为一种流行且高效的替代方法,已用于对齐语言模型与人类的偏好,相比奖励建模和强化学习更为有效。尽管在实证上取得了成功,但关于DPO的理论性质及其内在限制仍缺乏深入研究。在这项工作中,我们首先从概率演化角度对DPO的动力学进行了全面分析。我们的分析揭示了DPO对于初始化高度敏感,并且倾向于错误地分配概率质量,这可能导致意外地将概率向无关或不期望的响应转移。这种误分配可能会无意中强化模型偏差,从而损害模型对齐的稳定性和与预期偏好的一致性。 基于这些理论发现,我们提出了一种以理论为依据的双层优化框架,该框架紧密结合了监督微调和增强版DPO目标(即稳定偏好优化)。我们的方法引入了一个原则性的正则化方案,明确鼓励针对期望输出的绝对概率改进,同时保持稳定的优化动态。在具有挑战性的推理和摘要基准测试上的实验表明,我们提出的方法能够持续提高推理准确性,并更好地使输出分布与预期偏好吗地对齐,优于标准DPO。 稳定偏好优化为基于偏好的对齐目标设计提供了新的见解,并开辟了通往更可靠且可解释的语言模型对齐的新途径。
https://arxiv.org/abs/2507.07723
We propose a novel spatial-temporal graph Mamba (STG-Mamba) for the music-guided dance video synthesis task, i.e., to translate the input music to a dance video. STG-Mamba consists of two translation mappings: music-to-skeleton translation and skeleton-to-video translation. In the music-to-skeleton translation, we introduce a novel spatial-temporal graph Mamba (STGM) block to effectively construct skeleton sequences from the input music, capturing dependencies between joints in both the spatial and temporal dimensions. For the skeleton-to-video translation, we propose a novel self-supervised regularization network to translate the generated skeletons, along with a conditional image, into a dance video. Lastly, we collect a new skeleton-to-video translation dataset from the Internet, containing 54,944 video clips. Extensive experiments demonstrate that STG-Mamba achieves significantly better results than existing methods.
我们提出了一种新颖的空间-时间图Mamba(STG-Mamba),用于音乐指导的舞蹈视频合成任务,即从输入音乐生成舞蹈视频。STG-Mamba包含两个转换映射:音乐到骨骼转换和骨骼到视频转换。 在音乐到骨骼转换中,我们引入了一个新的空间-时间图Mamba(STGM)模块,以有效地根据输入音乐构建骨骼序列,并捕捉关节之间在空间和时间维度上的依赖关系。对于骨骼到视频的转换,我们提出了一种新颖的自监督正则化网络,该网络将生成的骨骼与条件图像一起转换为舞蹈视频。 最后,我们从互联网上收集了一个新的骨骼到视频转换数据集,包含54,944个视频片段。广泛的实验表明,STG-Mamba在现有方法中取得了显著更好的结果。
https://arxiv.org/abs/2507.06689
In this paper, we propose view-dependent projection (VDP) to facilitate point cloud segmentation, designing efficient 3D-to-2D mapping that dynamically adapts to the spatial geometry from view variations. Existing projection-based methods leverage view-independent projection in complex scenes, relying on straight lines to generate direct rays or upward curves to reduce occlusions. However, their view independence provides projection rays that are limited to pre-defined parameters by human settings, restricting point awareness and failing to capture sufficient projection diversity across different view planes. Although multiple projections per view plane are commonly used to enhance spatial variety, the projected redundancy leads to excessive computational overhead and inefficiency in image processing. To address these limitations, we design a framework of VDP to generate data-driven projections from 3D point distributions, producing highly informative single-image inputs by predicting rays inspired by the adaptive behavior of fireworks. In addition, we construct color regularization to optimize the framework, which emphasizes essential features within semantic pixels and suppresses the non-semantic features within black pixels, thereby maximizing 2D space utilization in a projected image. As a result, our approach, PointVDP, develops lightweight projections in marginal computation costs. Experiments on S3DIS and ScanNet benchmarks show that our approach achieves competitive results, offering a resource-efficient solution for semantic understanding.
在这篇文章中,我们提出了一种视图依赖投影(View-Dependent Projection, VDP)方法,以促进点云分割。该方法设计了高效的三维到二维映射,能够根据视角变化动态适应空间几何结构。现有的基于投影的方法使用视图无关的投影在复杂场景中工作,它们依靠直线来生成直接光线或向上曲线减少遮挡。然而,这些方法由于其视图独立性,所生成的投影光束仅限于由人工设定的预定义参数,从而限制了点云的认知,并且无法捕捉不同视角下的充分投影多样性。 尽管多视角平面中的多次投影通常被用来增强空间变化,但由于映射冗余导致计算开销过大和图像处理效率低下。为了解决这些问题,我们设计了一个VDP框架,从三维点分布中生成数据驱动的投影,通过预测烟花适应行为启发式的光线来产生高信息量的单幅图像输入。此外,我们构建了色彩正则化以优化该框架,强调语义像素中的关键特征并抑制非语义黑像素中的特性,从而最大化在映射图中的二维空间利用率。 因此,我们的方法PointVDP能够以边际计算成本生成轻量级投影。实验结果表明,在S3DIS和ScanNet基准测试中,我们的方法取得了具有竞争力的结果,并提供了一种资源高效的解决方案用于语义理解。
https://arxiv.org/abs/2507.06618
In this paper, we introduce VisioPath, a novel framework combining vision-language models (VLMs) with model predictive control (MPC) to enable safe autonomous driving in dynamic traffic environments. The proposed approach leverages a bird's-eye view video processing pipeline and zero-shot VLM capabilities to obtain structured information about surrounding vehicles, including their positions, dimensions, and velocities. Using this rich perception output, we construct elliptical collision-avoidance potential fields around other traffic participants, which are seamlessly integrated into a finite-horizon optimal control problem for trajectory planning. The resulting trajectory optimization is solved via differential dynamic programming with an adaptive regularization scheme and is embedded in an event-triggered MPC loop. To ensure collision-free motion, a safety verification layer is incorporated in the framework that provides an assessment of potential unsafe trajectories. Extensive simulations in Simulation of Urban Mobility (SUMO) demonstrate that VisioPath outperforms conventional MPC baselines across multiple metrics. By combining modern AI-driven perception with the rigorous foundation of optimal control, VisioPath represents a significant step forward in safe trajectory planning for complex traffic systems.
在这篇论文中,我们介绍了VisioPath,这是一个结合了视觉-语言模型(VLM)和模型预测控制(MPC)的新型框架,旨在实现动态交通环境下的安全自主驾驶。所提出的这种方法利用了一个鸟瞰视频处理管道以及零样本视觉-语言模型的能力,以获取周围车辆的位置、尺寸和速度等结构化信息。通过这种丰富的感知输出,我们构建了围绕其他交通参与者的椭圆形避碰势场,并将其无缝集成到有限时间最优控制问题中进行轨迹规划。最终的轨迹优化通过具有自适应正则化的微分动态编程求解,并嵌入到了基于事件触发的MPC循环中。 为了确保无碰撞运动,框架中还集成了一个安全验证层,用于评估潜在的不安全轨迹。在城市交通仿真(SUMO)中的广泛模拟表明,VisioPath在多个指标上优于传统的MPC基准方法。通过将现代AI驱动感知与最优控制的严谨基础相结合,VisioPath代表了复杂交通系统中安全轨迹规划的重要进展。
https://arxiv.org/abs/2507.06441
Pre-trained large vision-language models (VLMs) like CLIP demonstrate impressive generalization ability. Existing prompt-based and adapter-based works have made significant progress in fine-tuning VLMs but still face the challenges of maintaining strong generalization abilities, particularly towards unseen new classes. This limitation partly arises from these methods treating all tokens of the image and text encoder equally, which can lead to overfitting on less informative features (e.g., background noise, template words) and degrade the general representations that are crucial for novel concept recognition. To address this issue, we propose Dynamic Rank Adaptation (DRA), a novel adapter variant method, designed specifically to enhance new class generalization. DRA dynamically allocates adaptation ranks based on the importance of features during training to preserve general knowledge. DRA first employs token importance grouping, using sequence attention to evaluate and group tokens by their importance. Then, we adopt rank adaptation according to the importance of each token group dynamically by assigning higher feature ranks to the more important tokens. Also, we design a new channel response mechanism to prioritize the preservation and adaptation of feature channels identified as the most informative for each instance. In addition, a L1 regularization term is introduced to stabilize the training. Extensive experiments demonstrate the effectiveness and superiority of our proposed DRA over existing works, especially on enhancing the performance of new classes on various benchmarks, including base-new classes, cross-datasets evaluation and domain generalization. The source code will be published after the paper is received.
预训练的大规模视觉-语言模型(VLM)如CLIP展现了令人印象深刻的泛化能力。现有的基于提示和适配器的工作在微调VLM方面取得了显著进展,但仍面临保持强泛化能力的挑战,特别是在未见过的新类别上。这一限制部分源于这些方法对图像和文本编码器中的所有标记一视同仁,这可能导致过度拟合于信息量较少的特征(例如背景噪音、模板词汇),从而损害对于新概念识别至关重要的通用表示。 为了解决这个问题,我们提出了动态秩适配(DRA)——一种新型的适配器变体方法,旨在特别增强对新类别的泛化能力。DRA在训练期间根据特征的重要性动态分配适应秩,以保留通用知识。首先,DRA采用标记重要性分组的方法,利用序列注意力来评估并按重要性将标记进行分组。然后,我们根据每个标记组的重要程度动态地采用秩适配,给更重要的令牌分配更高的特征等级。此外,我们设计了一种新的通道响应机制,优先保留和适应每例中识别为最具有信息量的特征通道。另外,引入L1正则化项以稳定训练过程。 广泛的实验展示了我们的DRA方法在现有工作中的有效性和优越性,特别是在增强各种基准上新类别的性能方面,包括基础-新类别、跨数据集评估和领域泛化。论文接收后将公开源代码。
https://arxiv.org/abs/2507.05668
Graph Neural Networks (GNNs) often struggle with noisy edges. We propose Latent Space Constrained Graph Neural Networks (LSC-GNN) to incorporate external "clean" links and guide embeddings of a noisy target graph. We train two encoders--one on the full graph (target plus external edges) and another on a regularization graph excluding the target's potentially noisy links--then penalize discrepancies between their latent representations. This constraint steers the model away from overfitting spurious edges. Experiments on benchmark datasets show LSC-GNN outperforms standard and noise-resilient GNNs in graphs subjected to moderate noise. We extend LSC-GNN to heterogeneous graphs and validate it on a small protein-metabolite network, where metabolite-protein interactions reduce noise in protein co-occurrence data. Our results highlight LSC-GNN's potential to boost predictive performance and interpretability in settings with noisy relational structures.
图神经网络(GNNs)在处理带有噪声边的图形时常常遇到困难。为此,我们提出了隐空间约束图神经网络(LSC-GNN),以整合外部“干净”链接,并指导目标图中的嵌入过程。具体来说,我们将训练两个编码器:一个是在包含目标图和外部边的完整图上进行训练;另一个则在排除了目标图中潜在噪声连接后的正则化图上进行训练。然后我们通过惩罚这两种情况下的隐空间表示之间的差异来约束模型。这种限制有助于引导模型避免过度拟合虚假边的情况。 实验结果表明,LSC-GNN在受到适度噪音影响的基准数据集上的表现优于标准和抗噪GNNs。此外,我们将LSC-GNN扩展到了异构图中,并在一个小型蛋白质-代谢物网络上进行了验证,在该网络中,代谢物-蛋白质相互作用减少了蛋白质共现数据中的噪声。 我们的结果强调了LSC-GNN在具有噪音关系结构的设置下提高预测性能和可解释性的潜力。
https://arxiv.org/abs/2507.05540
Continual post-training (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to specific and ever-evolving downstream tasks. While existing research has primarily concentrated on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm within CPT remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model for continual post-training. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to catastrophic forgetting of previously learned tasks. In contrast, RFT inherently preserves prior knowledge and achieve performance comparable to multi-task training. (2) RFT successfully protects and even enhances the model's general knowledge on standard benchmarks (e.g., MMMU and MMLU-Pro). Conversely, SFT degrades general model capabilities severely. Further analysis shows that explicit mechanisms, such as KL penalty and chain-of-thought reasoning, are not the primary factors. Instead, we find that the implicit regularization inherent to RFT is a key factor in mitigating forgetting. Finally, we propose a rollout-based instance filtering algorithm to improve the stability and efficiency of RFT. Our comprehensive study demonstrates the superiority of RFT as a robust paradigm for continual post-training.
持续后训练(CPT)是将基础模型如多模态大型语言模型适应特定且不断发展的下游任务的一种流行且有效的方法。尽管现有研究主要集中在数据重播、模型扩展或参数正则化等方法上,但CPT中学习范式的基本作用仍鲜有探讨。本文对比分析了两种核心后训练范式:监督微调(SFT)和强化微调(RFT),探究它们在持续后训练过程中对知识保留的不同影响。我们的实验是在包含七个多样化多模态任务的基准上进行,以Qwen2.5-VL-7B-Instruct作为基础模型用于持续后训练。 研究结果得出两个重要的发现: 1. 在不断学习下游任务时,SFT会导致对先前所学任务的灾难性遗忘。相反,RFT能够固有地保留之前的知识,并实现与多任务训练相当的表现。 2. RFT成功保护并甚至增强了模型在标准基准测试(如MMMU和MMLU-Pro)上的通用知识。而SFT则严重削弱了模型的一般能力。 进一步分析表明,明确机制,例如KL惩罚和链式思维推理,并非是主要因素。相反,我们发现RFT中的隐含正则化是减少遗忘的关键因素。最后,我们提出了一种基于rollout的实例过滤算法,以提高RFT的稳定性和效率。我们的全面研究表明,RFT作为持续后训练的一种稳健范式具有明显优势。
https://arxiv.org/abs/2507.05386
Significant challenges exist in efficient data analysis of most advanced experimental and observational techniques because the collected signals often include unwanted contributions--such as background and signal distortions--that can obscure the physically relevant information of interest. To address this, we have developed a self-supervised machine-learning approach for source separation using a dual implicit neural representation framework that jointly trains two neural networks: one for approximating distortions of the physical signal of interest and the other for learning the effective background contribution. Our method learns directly from the raw data by minimizing a reconstruction-based loss function without requiring labeled data or pre-defined dictionaries. We demonstrate the effectiveness of our framework by considering a challenging case study involving large-scale simulated as well as experimental momentum-energy-dependent inelastic neutron scattering data in a four-dimensional parameter space, characterized by heterogeneous background contributions and unknown distortions to the target signal. The method is found to successfully separate physically meaningful signals from a complex or structured background even when the signal characteristics vary across all four dimensions of the parameter space. An analytical approach that informs the choice of the regularization parameter is presented. Our method offers a versatile framework for addressing source separation problems across diverse domains, ranging from superimposed signals in astronomical measurements to structural features in biomedical image reconstructions.
在大多数先进的实验和观测技术中,高效的数据分析面临着重大挑战,因为收集到的信号通常包含不需要的贡献(如背景噪声和信号失真),这些因素会掩盖感兴趣的物理相关信息。为了解决这个问题,我们开发了一种自监督机器学习方法来进行源分离,该方法使用双重隐式神经表示框架同时训练两个神经网络:一个用于近似感兴趣物理信号的失真,另一个则用于学习有效的背景贡献。我们的方法直接从原始数据中进行学习,并通过最小化基于重建的损失函数来完成这一过程,而无需标注数据或预定义字典。 我们通过考虑一个具有挑战性的案例研究展示了框架的有效性:该研究涉及大规模模拟和实验中的动量-能量依赖的非弹性中子散射数据,在四维参数空间内展开。这些数据的特点是异质背景贡献及对目标信号未知的失真。我们的方法能够在复杂或结构化的背景下成功分离出物理意义明确的信号,即使这种信号特征跨越了参数空间的所有四个维度。 此外,还提出了一种分析方法来指导正则化参数的选择。 我们的方法提供了一个灵活框架,用于解决跨多个领域的源分离问题,从天文学测量中的叠加信号到生物医学图像重建中的结构特性。
https://arxiv.org/abs/2507.05249
Recent advancements in language models (LMs) gradually ushered in an era where post-training is crucial. Yet, post-training approaches such as supervised fine-tuning (SFT) do not guarantee effective use of knowledge acquired during pretraining. We therefore present \ours, a lightweight method that encourages parametric information utilization in LMs during post-training. This is achieved via treating FFN layer as associate key-value memory, and promotes the use of stored memory vectors via forward-pass interventions or regularization during backpropagation. We find this simple guidance during post-training phase delivers consistent performance improvements across diverse model families--including Qwen, Gemma and Llama-spanning over 15 downstream tasks in both ID and OOD evaluations. Beyond performance gains, we also find that steered LMs can adaptively allocate information-placing more emphasis on generating semantically meaningful tokens, while using fewer resources on simple transition ones (e.g., `,' or `and'). Our work underscores that vanilla post-training does not fully leverage pre-training potential, and steering LMs in latent representation space offers a promising approach that enhances both performance and interpretability.
最近的语言模型(LM)的进步逐渐开启了后训练至关重要的时代。然而,诸如监督微调(SFT)等后训练方法并不能保证有效利用预训练阶段获得的知识。为此,我们提出了一种轻量级的方法——\ours,该方法鼓励语言模型在后训练过程中利用参数信息。通过将前馈网络层视为关联键值内存,并且通过正向传递干预或反向传播过程中的正则化来促进对存储的内存向量的使用,实现了这一目标。我们发现,在后训练阶段这种简单的指导能够跨多种模型家族——包括Qwen、Gemma和Llama等在内——带来持续的表现改进,涉及15项不同的下游任务,并涵盖了同源(ID)和异源(OOD)评估。 除了性能提升之外,我们还发现被引导后的语言模型可以自适应地分配信息:更加注重生成语义上有意义的标记,同时减少对简单过渡词(例如“,”或“and”)的关注。我们的工作强调了标准后训练不能充分利用预训练潜力,并指出在潜在表示空间中引导语言模型提供了一种有前景的方法,可以增强性能和可解释性。
https://arxiv.org/abs/2507.05158
We investigate the use of Long Short-Term Memory (LSTM) and Decomposition-LSTM (DLSTM) networks, combined with an ensemble algorithm, to predict solar flare occurrences using time-series data from the GOES catalog. The dataset spans from 2003 to 2023 and includes 151,071 flare events. Among approximately possible patterns, 7,552 yearly pattern windows are identified, highlighting the challenge of long-term forecasting due to the Sun's complex, self-organized criticality-driven behavior. A sliding window technique is employed to detect temporal quasi-patterns in both irregular and regularized flare time series. Regularization reduces complexity, enhances large flare activity, and captures active days more effectively. To address class imbalance, resampling methods are applied. LSTM and DLSTM models are trained on sequences of peak fluxes and waiting times from irregular time series, while LSTM and DLSTM, integrated with an ensemble approach, are applied to sliding windows of regularized time series with a 3-hour interval. Performance metrics, particularly TSS (0.74), recall (0.95) and the area under the curve (AUC=0.87) in the receiver operating characteristic (ROC), indicate that DLSTM with an ensemble approach on regularized time series outperforms other models, offering more accurate large-flare forecasts with fewer false errors compared to models trained on irregular time series. The superior performance of DLSTM is attributed to its ability to decompose time series into trend and seasonal components, effectively isolating random noise. This study underscores the potential of advanced machine learning techniques for solar flare prediction and highlights the importance of incorporating various solar cycle phases and resampling strategies to enhance forecasting reliability.
我们研究了结合集成算法的长短期记忆网络(LSTM)和分解-LSTM(DLSTM)网络在使用GOES目录的时间序列数据预测太阳耀斑发生情况中的应用。该数据集涵盖了2003年至2023年的时期,包含151,071个耀斑事件。在大约可能的模式中,确定了7,552个年度模式窗口,突显了由于太阳复杂、由自组织临界性驱动的行为导致长期预测面临的挑战。 滑动窗口技术被用于检测不规则和正则化后的耀斑时间序列中的时序准模式。正则化减少了复杂性,并增强了大型耀斑活动的可见度,同时更有效地捕捉活跃日。为解决类别不平衡问题,应用了重采样方法。LSTM 和 DLSTM 模型是在从不规则时间序列中提取的峰值通量和等待时间序列上进行训练的;而与集成方法结合后的 LSTM 和 DLSTM 则被应用于具有3小时间隔窗口的正则化时间序列。 性能指标(特别是TSS(0.74)、召回率(0.95)以及在接收者操作特征(ROC)曲线下的面积(AUC=0.87))表明,基于正则化时间序列并结合集成方法的DLSTM模型优于其他模型。DLSTM结合集成方法可以提供更准确的大耀斑预测,并且与基于不规则时间序列训练的模型相比,其产生的错误较少。 DLSTM卓越性能的原因在于它能够将时间序列分解为趋势和季节性成分,从而有效隔离随机噪声。这项研究表明了高级机器学习技术在太阳耀斑预测中的潜力,并强调了纳入各种太阳活动周期阶段及重采样策略以增强预测可靠性的必要性。
https://arxiv.org/abs/2507.05313
Cancer survival prediction requires integrating pathological Whole Slide Images (WSIs) and genomic profiles, a challenging task due to the inherent heterogeneity and the complexity of modeling both inter- and intra-modality interactions. Current methods often employ straightforward fusion strategies for multimodal feature integration, failing to comprehensively capture modality-specific and modality-common interactions, resulting in a limited understanding of multimodal correlations and suboptimal predictive performance. To mitigate these limitations, this paper presents a Multimodal Representation Decoupling Network (MurreNet) to advance cancer survival analysis. Specifically, we first propose a Multimodal Representation Decomposition (MRD) module to explicitly decompose paired input data into modality-specific and modality-shared representations, thereby reducing redundancy between modalities. Furthermore, the disentangled representations are further refined then updated through a novel training regularization strategy that imposes constraints on distributional similarity, difference, and representativeness of modality features. Finally, the augmented multimodal features are integrated into a joint representation via proposed Deep Holistic Orthogonal Fusion (DHOF) strategy. Extensive experiments conducted on six TCGA cancer cohorts demonstrate that our MurreNet achieves state-of-the-art (SOTA) performance in survival prediction.
癌症生存预测需要整合病理全片图像(WSIs)和基因组谱型,由于其内在的异质性和跨模态及模内交互建模的复杂性,这一任务具有挑战性。当前的方法通常采用简单直接的融合策略来集成多模态特征,这不足以全面捕捉特定于模态的以及跨越不同模态之间的相互作用,导致对多模态相关性的理解有限和预测性能不佳。为了解决这些问题,本文提出了一个多模态表示解耦网络(MurreNet),以推进癌症生存分析。 具体而言,我们首先提出了一种多模态表示分解(MRD)模块,该模块可以将配对输入数据显式地分解成特定于模态和共同分享的表示形式,从而减少不同模态之间的冗余。此外,通过一种新的训练正则化策略进一步细化并更新这些分离出的表示形式,该策略在分布相似性、差异性和模态特征代表性方面施加约束。最后,增强后的多模态特征通过提出的深度整体正交融合(DHOF)策略集成到一个联合表示中。 在六种TCGA癌症队列上进行的广泛实验表明,我们的MurreNet在生存预测方面的表现达到了最先进的水平(SOTA)。
https://arxiv.org/abs/2507.04891
We present MatDecompSDF, a novel framework for recovering high-fidelity 3D shapes and decomposing their physically-based material properties from multi-view images. The core challenge of inverse rendering lies in the ill-posed disentanglement of geometry, materials, and illumination from 2D observations. Our method addresses this by jointly optimizing three neural components: a neural Signed Distance Function (SDF) to represent complex geometry, a spatially-varying neural field for predicting PBR material parameters (albedo, roughness, metallic), and an MLP-based model for capturing unknown environmental lighting. The key to our approach is a physically-based differentiable rendering layer that connects these 3D properties to the input images, allowing for end-to-end optimization. We introduce a set of carefully designed physical priors and geometric regularizations, including a material smoothness loss and an Eikonal loss, to effectively constrain the problem and achieve robust decomposition. Extensive experiments on both synthetic and real-world datasets (e.g., DTU) demonstrate that MatDecompSDF surpasses state-of-the-art methods in geometric accuracy, material fidelity, and novel view synthesis. Crucially, our method produces editable and relightable assets that can be seamlessly integrated into standard graphics pipelines, validating its practical utility for digital content creation.
我们介绍了MatDecompSDF,这是一个新颖的框架,用于从多视角图像中恢复高保真3D形状并分解其基于物理特性的材料属性。逆向渲染的核心挑战在于从二维观察结果中分离几何、材质和照明,这一过程本质上是病态且难以处理的。我们的方法通过优化三个神经网络组件来解决这个问题:一个用于表示复杂几何结构的神经Signed Distance Function(SDF)、一个预测基于物理渲染(PBR)材料参数(如反射率、粗糙度和金属光泽)的空间变化神经场,以及一种基于多层感知器(MLP)的模型,用于捕捉未知环境照明。我们方法的关键在于引入了一种基于物理特性的可微分渲染层,它将3D属性与输入图像连接起来,从而实现端到端优化。 此外,我们设计了一系列精心构造的物理先验和几何正则化措施,包括材料平滑度损失(Eikonal损失),以有效地约束问题并达成稳健分解。在合成数据集和真实世界数据集(如DTU)上进行的广泛实验表明,MatDecompSDF在几何精度、材质保真度以及新视角合成方面均超越了现有最先进的方法。 尤为重要的是,我们的方法生成了可编辑且可以重新照明的资产,并能无缝地集成到标准图形工作流程中,从而验证了其对于数字内容创作的实际效用。
https://arxiv.org/abs/2507.04749
Real-world time series typically exhibit complex temporal variations, making the time series classification task notably challenging. Recent advancements have demonstrated the potential of multi-scale analysis approaches, which provide an effective solution for capturing these complex temporal patterns. However, existing multi-scale analysis-based time series prediction methods fail to eliminate redundant scale-shared features across multi-scale time series, resulting in the model over- or under-focusing on scale-shared features. To address this issue, we propose a novel end-to-end Disentangled Multi-Scale framework for Time Series classification (DisMS-TS). The core idea of DisMS-TS is to eliminate redundant shared features in multi-scale time series, thereby improving prediction performance. Specifically, we propose a temporal disentanglement module to capture scale-shared and scale-specific temporal representations, respectively. Subsequently, to effectively learn both scale-shared and scale-specific temporal representations, we introduce two regularization terms that ensure the consistency of scale-shared representations and the disparity of scale-specific representations across all temporal scales. Extensive experiments conducted on multiple datasets validate the superiority of DisMS-TS over its competitive baselines, with the accuracy improvement up to 9.71%.
现实世界中的时间序列通常表现出复杂的时序变化,这使得时间序列分类任务具有挑战性。近期的研究进展展示了多尺度分析方法的潜力,这些方法能够有效捕捉这些复杂的时间模式。然而,现有的基于多尺度分析的时间序列预测方法未能消除跨多个时间尺度之间的冗余共享特征,导致模型在处理这些特征时过度或不足关注。为了解决这个问题,我们提出了一种新的端到端分解式多尺度框架(Disentangled Multi-Scale framework for Time Series classification, DisMS-TS)。DisMS-TS的核心思想是消除多尺度时间序列中的冗余共享特征,从而提升预测性能。 具体来说,我们设计了一个时序解耦模块来分别捕捉跨规模的共享时间和特定于每个尺度的时间表示。之后,为了有效地学习这些共享和特有的时间表示,我们引入了两个正则化项:一个确保所有时间尺度上共享表示的一致性;另一个保证各尺度特有表示之间的差异性。 在多个数据集上的广泛实验验证了DisMS-TS相较于竞争基准方法的优越性能,准确率提高了高达9.71%。
https://arxiv.org/abs/2507.04600
Model compression offers a promising path to reducing the cost and inaccessibility of large pre-trained models, without significantly compromising their impressive performance. Large Transformer models, including large language models (LLMs), often contain computational redundancy, which can serve as a target for new model compression methods. In this work, we specifically target neuron-level redundancies in model layers by combining groups of similar neurons into fewer neurons. We frame this width reduction as a Discrete Optimal Transport problem, and propose DOTResize, a novel Transformer compression method that uses optimal transport theory to transform and compress model weights. To ensure applicability within the Transformer architecture, we motivate and incorporate entropic regularization and matrix factorization into the transportation maps produced by our method. Unlike pruning-based approaches which discard neurons based on importance measures, DOTResize re-projects the entire neuron width, allowing the retention and redistribution of useful signal across the reduced layer. Empirical results show that compared to simple or state-of-the-art neuron width-pruning techniques, DOTResize can outperform these methods across multiple LLM families and sizes, while achieving measurable reductions in real-world computational cost.
模型压缩为降低大型预训练模型的成本和可访问性提供了有希望的途径,同时不会显著损害其卓越的性能。大型Transformer模型(包括大规模语言模型)经常包含计算冗余,这可以作为新的模型压缩方法的目标。在这项工作中,我们特别针对模型层中的神经元级别的冗余进行处理,通过将相似的神经元组合成较少的神经元来减少宽度。我们将这种宽度缩减视为离散最优传输问题,并提出了DOTResize,这是一种新颖的Transformer压缩方法,利用最优传输理论转换和压缩模型权重。 为了确保在Transformer架构内的适用性,我们引入了熵正则化和矩阵分解到由我们的方法产生的运输图中。与基于剪枝的方法不同(后者根据重要性指标舍弃神经元),DOTResize重新投影整个神经元宽度,从而允许在网络层缩减后保留并重新分配有用信号。 实验结果表明,与其他简单的或最先进的神经元宽度剪枝技术相比,无论是在多个大规模语言模型家族和规模上,DOTResize都能表现出优越的性能,并且在实际计算成本方面实现了可测量的减少。
https://arxiv.org/abs/2507.04517
Neural Radiance Fields (NeRF) has emerged as a compelling framework for scene representation and 3D recovery. To improve its performance on real-world data, depth regularizations have proven to be the most effective ones. However, depth estimation models not only require expensive 3D supervision in training, but also suffer from generalization issues. As a result, the depth estimations can be erroneous in practice, especially for outdoor unbounded scenes. In this paper, we propose to employ view-consistent distributions instead of fixed depth value estimations to regularize NeRF training. Specifically, the distribution is computed by utilizing both low-level color features and high-level distilled features from foundation models at the projected 2D pixel-locations from per-ray sampled 3D points. By sampling from the view-consistency distributions, an implicit regularization is imposed on the training of NeRF. We also utilize a depth-pushing loss that works in conjunction with the sampling technique to jointly provide effective regularizations for eliminating the failure modes. Extensive experiments conducted on various scenes from public datasets demonstrate that our proposed method can generate significantly better novel view synthesis results than state-of-the-art NeRF variants as well as different depth regularization methods.
神经辐射场(Neural Radiance Fields,简称NeRF)已经成为场景表示和3D恢复领域的一个引人注目的框架。为了提高其在真实数据上的性能,深度正则化被证明是最有效的手段之一。然而,深度估计模型不仅需要昂贵的三维监督进行训练,还存在泛化问题。因此,在实践中,尤其是在户外无界场景中,这些深度估计可能会出现错误。 为此,本文提出了一种替代方法:利用视图一致分布而非固定深度值来正则化NeRF的训练过程。具体而言,该分布是通过结合从投影到二维像素位置上的每个射线采样三维点提取的低级颜色特征和高级基础模型精炼特征计算得出的。通过对这些视图一致性分布进行抽样,对NeRF的训练施加了隐式正则化。此外,我们还采用了一种与抽样技术协同工作的深度推动损失,共同提供有效的正则化以消除失败模式。 在来自公开数据集的各种场景上进行的广泛实验表明,所提出的方法可以生成比最先进的NeRF变体以及其他不同的深度正则化方法显著更好的新颖视图合成结果。
https://arxiv.org/abs/2507.04408
Current continuous sign language recognition (CSLR) methods struggle with handling diverse samples. Although dynamic convolutions are ideal for this task, they mainly focus on spatial modeling and fail to capture the temporal dynamics and contextual dependencies. To address this, we propose DESign, a novel framework that incorporates Dynamic Context-Aware Convolution (DCAC) and Subnet Regularization Connectionist Temporal Classification (SR-CTC). DCAC dynamically captures the inter-frame motion cues that constitute signs and uniquely adapts convolutional weights in a fine-grained manner based on contextual information, enabling the model to better generalize across diverse signing behaviors and boost recognition accuracy. Furthermore, we observe that existing methods still rely on only a limited number of frames for parameter updates during training, indicating that CTC learning overfits to a dominant path. To address this, SR-CTC regularizes training by applying supervision to subnetworks, encouraging the model to explore diverse CTC alignment paths and effectively preventing overfitting. A classifier-sharing strategy in SR-CTC further strengthens multi-scale consistency. Notably, SR-CTC introduces no inference overhead and can be seamlessly integrated into existing CSLR models to boost performance. Extensive ablations and visualizations further validate the effectiveness of the proposed methods. Results on mainstream CSLR datasets (i.e., PHOENIX14, PHOENIX14-T, CSL-Daily) demonstrate that DESign achieves state-of-the-art performance.
目前的连续手语识别(CSLR)方法在处理多样化的样本时遇到了困难。虽然动态卷积对于此类任务非常理想,但它们主要关注空间建模,并且无法捕捉到时间动态和上下文依赖关系。为了解决这一问题,我们提出了一种新颖框架DESign,该框架结合了动态感知上下文的卷积(DCAC)和子网正则化连接时序分类(SR-CTC)。DCAC能够动态地捕获构成手语符号之间的帧间运动线索,并根据上下文信息以精细的方式独特地调整卷积权重,使模型能够在多样化的手势行为中更好地泛化并提高识别精度。此外,我们观察到现有方法在训练过程中仍然依赖于有限数量的帧进行参数更新,表明CTC学习过于适应单一路径。为了解决这个问题,SR-CTC通过将监督应用于子网络来正则化训练过程,从而鼓励模型探索多样的CTC对齐路径并有效防止过拟合。SR-CTC中的分类器共享策略进一步增强了多尺度一致性。值得注意的是,SR-CTC在推理过程中不会引入额外的开销,并且可以无缝集成到现有的CSLR模型中以提高性能。广泛的消融分析和可视化进一步验证了所提出方法的有效性。主流CSLR数据集(即PHOENIX14、PHOENIX14-T、CSL-Daily)上的实验结果表明,DESign达到了当前最佳的性能水平。
https://arxiv.org/abs/2507.03339
Explainable artificial intelligence (XAI) approaches have been increasingly applied in drug discovery to learn molecular representations and identify substructures driving property predictions. However, building end-to-end explainable machine learning models for structure-activity relationship (SAR) modeling for compound property prediction faces many challenges, such as limited activity data per target and the sensitivity of properties to subtle molecular changes. To address this, we leveraged activity-cliff molecule pairs, i.e., compounds sharing a common scaffold but differing sharply in potency, targeting three proto-oncogene tyrosine-protein kinase Src proteins (i.e., PDB IDs 1O42, 2H8H, and 4MXO). We implemented graph neural network (GNN) methods to obtain atom-level feature information and predict compound-protein affinity (i.e., half maximal inhibitory concentration, IC50). In addition, we trained GNN models with different structure-aware loss functions to adequately leverage molecular property and structure information. We also utilized group lasso and sparse group lasso to prune and highlight molecular subgraphs and enhance the structure-specific model explainability for the predicted property difference in molecular activity-cliff pairs. We improved drug property prediction by integrating common and uncommon node information and using sparse group lasso, reducing the average root mean squared error (RMSE) by 12.70%, and achieving the lowest averaged RMSE=0.2551 and the highest PCC=0.9572. Furthermore, applying regularization enhances feature attribution methods that estimate the contribution of each atom in the molecular graphs by boosting global direction scores and atom-level accuracy in atom coloring accuracy, which improves model interpretability in drug discovery pipelines, particularly in investigating important molecular substructures in lead optimization.
可解释的人工智能(XAI)方法在药物发现中被越来越多地应用,用于学习分子表示并识别驱动属性预测的子结构。然而,在构建针对构效关系(SAR)建模的化合物属性预测的端到端可解释机器学习模型时面临许多挑战,例如每个靶标的活性数据有限以及对细微分子变化的敏感性。 为了解决这些问题,我们利用了活性悬崖分子对——即共享相同支架但效力差异显著的化合物——针对三个原癌基因酪氨酸蛋白激酶Src蛋白质(PDB ID分别为1O42、2H8H和4MXO)进行了研究。我们实施了图神经网络(GNN)方法以获取原子级特征信息并预测化合物-蛋白质亲和力(即半数最大抑制浓度,IC50)。此外,我们使用不同结构感知的损失函数来充分利用分子属性和结构信息训练GNN模型。还采用了组套索法和稀疏组套索法来修剪和突出显示分子子图,并增强预测性状差异在分子活性悬崖对中的结构特定模型可解释性。通过整合常见与不常见的节点信息并使用稀疏组套索,我们改善了药物属性的预测,平均均方根误差(RMSE)降低了12.70%,达到了最低平均RMSE=0.2551和最高PCC=0.9572。 此外,应用正则化可以增强特征归因方法,通过提高全局方向评分和原子级精度来估计分子图中每个原子的贡献,并提高药物发现管道中的模型可解释性,尤其是在探究先导化合物优化过程中的重要分子子结构方面。
https://arxiv.org/abs/2507.03318
The advancement of Large Language Models (LLMs) has transformed Natural Language Processing (NLP), enabling performance across diverse tasks with little task-specific training. However, LLMs remain susceptible to social biases, particularly reflecting harmful stereotypes from training data, which can disproportionately affect marginalised communities. We measure gender bias in Maltese LMs, arguing that such bias is harmful as it reinforces societal stereotypes and fails to account for gender diversity, which is especially problematic in gendered, low-resource languages. While bias evaluation and mitigation efforts have progressed for English-centric models, research on low-resourced and morphologically rich languages remains limited. This research investigates the transferability of debiasing methods to Maltese language models, focusing on BERTu and mBERTu, BERT-based monolingual and multilingual models respectively. Bias measurement and mitigation techniques from English are adapted to Maltese, using benchmarks such as CrowS-Pairs and SEAT, alongside debiasing methods Counterfactual Data Augmentation, Dropout Regularization, Auto-Debias, and GuiDebias. We also contribute to future work in the study of gender bias in Maltese by creating evaluation datasets. Our findings highlight the challenges of applying existing bias mitigation methods to linguistically complex languages, underscoring the need for more inclusive approaches in the development of multilingual NLP.
大型语言模型(LLMs)的发展已经革新了自然语言处理(NLP),使其能够在几乎不需要特定任务训练的情况下完成多样化的任务。然而,LLMs仍然容易受到社会偏见的影响,特别是从训练数据中反映出有害的刻板印象,这会对边缘化社区产生不成比例的影响。我们测量马耳他语模型中的性别偏见,并主张这种偏见是有害的,因为它强化了社会上的刻板印象并且未能考虑到性别多样性,这对性别化且资源较少的语言来说尤其成问题。 尽管针对以英语为中心的模型的偏见评估和缓解工作已经有所进展,但对于低资源和形态丰富的语言的研究仍然有限。这项研究探讨了将去偏方法转移到马耳他语模型上的可行性,重点分析了基于BERT的单语模型BERTu和多语种模型mBERTu。从英语中借鉴的偏见测量与缓解技术被应用于马耳他语,并使用CrowS-Pairs和SEAT等基准进行测试,同时应用Counterfactual Data Augmentation(反事实数据增强)、Dropout Regularization(dropout正则化)、Auto-Debias(自动去偏)以及GuiDebias等方法。此外,我们还通过创建评估数据集为未来研究马耳他语中的性别偏见工作做出了贡献。 我们的发现强调了将现有的偏见缓解方法应用于语言复杂性较高的语言所面临的挑战,并突显了在多语言NLP开发中采用更具包容性的方法的必要性。
https://arxiv.org/abs/2507.03142
In this paper, we propose a novel model called Learnable VAE (L-VAE), which learns a disentangled representation together with the hyperparameters of the cost function. L-VAE can be considered as an extension of \b{eta}-VAE, wherein the hyperparameter, \b{eta}, is empirically adjusted. L-VAE mitigates the limitations of \b{eta}-VAE by learning the relative weights of the terms in the loss function to control the dynamic trade-off between disentanglement and reconstruction losses. In the proposed model, the weight of the loss terms and the parameters of the model architecture are learned concurrently. An additional regularization term is added to the loss function to prevent bias towards either reconstruction or disentanglement losses. Experimental analyses show that the proposed L-VAE finds an effective balance between reconstruction fidelity and disentangling the latent dimensions. Comparisons of the proposed L-VAE against \b{eta}-VAE, VAE, ControlVAE, DynamicVAE, and {\sigma}-VAE on datasets, such as dSprites, MPI3D-complex, Falcor3D, and Isaac3D reveals that L-VAE consistently provides the best or the second best performances measured by a set of disentanglement metrics. Moreover, qualitative experiments on CelebA dataset, confirm the success of the L-VAE model for disentangling the facial attributes.
在这篇论文中,我们提出了一种名为可学习的变分自编码器(L-VAE)的新模型,该模型在学习解耦表示的同时也学会了代价函数的超参数。L-VAE 可以被视为 \b{eta}-VAE 的一种扩展,在 \b{eta}-VAE 中,超参数 \b{eta} 是通过经验进行调整的。L-VAE 通过学习损失函数中各项的相对权重来解决 \b{eta}-VAE 的局限性,以此控制解耦与重构损失之间的动态权衡。在所提出的模型中,同时学会了损失项的权重和模型架构的参数。为了防止偏向于重建或解耦损失中的任何一个方向,向损失函数中添加了一个额外的正则化项。 实验分析显示,提出的 L-VAE 在保持重构保真度的同时有效地分离了潜在维度。与 \b{eta}-VAE、VAE、ControlVAE、DynamicVAE 和 {\sigma}-VAE 这些模型在 dSprites、MPI3D-complex、Falcor3D 以及 Isaac3D 数据集上的比较研究表明,L-VAE 在一组解耦度量中始终表现出最佳或次佳的性能。此外,在 CelebA 数据集上进行的定性实验进一步证实了 L-VAE 模型在面部属性分离方面取得了成功。
https://arxiv.org/abs/2507.02619