Human motion reconstruction from monocular videos is a fundamental challenge in computer vision, with broad applications in AR/VR, robotics, and digital content creation, but remains challenging under frequent occlusions in real-world this http URL regression-based methods are efficient but fragile to missing observations, while optimization- and diffusion-based approaches improve robustness at the cost of slow inference speed and heavy preprocessing steps. To address these limitations, we leverage recent advances in generative masked modeling and present MoRo: Masked Modeling for human motion Recovery under Occlusions. MoRo is an occlusion-robust, end-to-end generative framework that formulates motion reconstruction as a video-conditioned task, and efficiently recover human motion in a consistent global coordinate system from RGB videos. By masked modeling, MoRo naturally handles occlusions while enabling efficient, end-to-end inference. To overcome the scarcity of paired video-motion data, we design a cross-modality learning scheme that learns multi-modal priors from a set of heterogeneous datasets: (i) a trajectory-aware motion prior trained on MoCap datasets, (ii) an image-conditioned pose prior trained on image-pose datasets, capturing diverse per-frame poses, and (iii) a video-conditioned masked transformer that fuses motion and pose priors, finetuned on video-motion datasets to integrate visual cues with motion dynamics for robust inference. Extensive experiments on EgoBody and RICH demonstrate that MoRo substantially outperforms state-of-the-art methods in accuracy and motion realism under occlusions, while performing on-par in non-occluded scenarios. MoRo achieves real-time inference at 70 FPS on a single H200 GPU.
从单目视频中重建人体运动是计算机视觉中的一个基本挑战,具有广泛的应用场景,包括增强现实/虚拟现实、机器人技术和数字内容创作。然而,在实际环境中由于频繁的遮挡问题,这一任务仍然极具挑战性。基于回归的方法虽然效率高但对缺失观测非常敏感,而优化和扩散方法则通过牺牲推理速度并增加预处理步骤来提高鲁棒性。为了解决这些问题,我们利用最近在生成式掩码建模方面的进展,并提出了一种用于遮挡下人体运动恢复的框架——MoRo(Masked Modeling for human motion Recovery under Occlusions)。 MoRo是一种针对遮挡具有鲁棒性的端到端生成框架,它将运动重建视为一个视频条件下的任务,在全局坐标系中从RGB视频高效地恢复人类运动。通过掩码建模,MoRo能够自然处理遮挡问题,并支持高效的端到端推理。为了克服成对的视频-动作数据稀缺的问题,我们设计了一种跨模态学习方案,该方案从一组异构的数据集中学习多模式先验:(i)一种在MoCap数据集上训练的动作轨迹感知运动先验;(ii)一种基于图像的姿态先验,在图像姿态数据集上进行训练,捕捉每帧中多样的姿势;以及(iii)一个视频条件下的掩码变换器,该模型融合了动作和姿态的先验,并通过在视频-动作数据集上的微调与视觉线索结合运动动力学以实现稳健推理。 在EgoBody和RICH数据集上进行的大量实验表明,在遮挡条件下,MoRo在准确性和运动逼真度方面显著优于最先进的方法,而在非遮挡场景中则表现出相当的性能。此外,MoRo能够在单个H200 GPU上以每秒70帧的速度实现实时推理。
https://arxiv.org/abs/2601.16079
Deep learning has substantially advanced medical image segmentation, yet achieving robust generalization across diverse imaging modalities and anatomical structures remains a major challenge. A key contributor to this limitation lies in how existing architectures, ranging from CNNs to Transformers and their hybrids, primarily encode spatial information while overlooking frequency-domain representations that capture rich structural and textural cues. Although few recent studies have begun exploring spectral information at the feature level, supervision-level integration of frequency cues-crucial for fine-grained object localization-remains largely untapped. To this end, we propose Phi-SegNet, a CNN-based architecture that incorporates phase-aware information at both architectural and optimization levels. The network integrates Bi-Feature Mask Former (BFMF) modules that blend neighboring encoder features to reduce semantic gaps, and Reverse Fourier Attention (RFA) blocks that refine decoder outputs using phase-regularized features. A dedicated phase-aware loss aligns these features with structural priors, forming a closed feedback loop that emphasizes boundary precision. Evaluated on five public datasets spanning X-ray, US, histopathology, MRI, and colonoscopy, Phi-SegNet consistently achieved state-of-the-art performance, with an average relative improvement of 1.54+/-1.26% in IoU and 0.98+/-0.71% in F1-score over the next best-performing model. In cross-dataset generalization scenarios involving unseen datasets from the known domain, Phi-SegNet also exhibits robust and superior performance, highlighting its adaptability and modality-agnostic design. These findings demonstrate the potential of leveraging spectral priors in both feature representation and supervision, paving the way for generalized segmentation frameworks that excel in fine-grained object localization.
深度学习在医学图像分割领域取得了显著进展,然而,在不同成像模态和解剖结构之间实现稳健的泛化仍然是一个重大挑战。现有架构(从CNN到Transformer及其混合体)主要编码空间信息,而忽视了捕捉丰富结构和纹理线索的频域表示,这是导致这一限制的关键因素之一。虽然最近有一些研究开始探索特征级别的光谱信息,但在监督级别上融合频率线索——这对于精细目标定位至关重要——仍然很大程度上未被开发。 为此,我们提出Phi-SegNet,这是一种基于CNN的架构,在体系结构和优化层面都整合了相位感知信息。该网络集成了Bi-Feature Mask Former(BFMF)模块,用于融合相邻编码器特征以减少语义差距,并使用相位正则化特征来精炼解码器输出的Reverse Fourier Attention(RFA)块。 通过专门设计的相位感知损失函数将这些特征与结构先验对齐,形成了一个闭环反馈机制,强调了边界的精确性。在涵盖X射线、超声波、组织病理学、MRI和结肠镜检查等领域的五个公开数据集上进行了评估,Phi-SegNet始终取得了最先进的性能,在平均相对改进方面,相较于下一个最佳模型,IoU提高了1.54±1.26%,F1得分提高了0.98±0.71%。 在涉及来自已知域但未经训练的数据集的跨数据集泛化场景中,Phi-SegNet也表现出稳健且优越的表现,彰显了其适应性和模态无关设计。这些发现表明,在特征表示和监督方面利用光谱先验具有潜力,并为实现卓越精细目标定位能力的通用分割框架铺平道路。
https://arxiv.org/abs/2601.16064
Diffusion models achieve state-of-the-art performance but often fail to generate outputs that align with human preferences and intentions, resulting in images with poor aesthetic quality and semantic inconsistencies. Existing alignment methods present a difficult trade-off: fine-tuning approaches suffer from loss of diversity with reward over-optimization, while test-time scaling methods introduce significant computational overhead and tend to under-optimize. To address these limitations, we propose HyperAlign, a novel framework that trains a hypernetwork for efficient and effective test-time alignment. Instead of modifying latent states, HyperAlign dynamically generates low-rank adaptation weights to modulate the diffusion model's generation operators. This allows the denoising trajectory to be adaptively adjusted based on input latents, timesteps and prompts for reward-conditioned alignment. We introduce multiple variants of HyperAlign that differ in how frequently the hypernetwork is applied, balancing between performance and efficiency. Furthermore, we optimize the hypernetwork using a reward score objective regularized with preference data to reduce reward hacking. We evaluate HyperAlign on multiple extended generative paradigms, including Stable Diffusion and FLUX. It significantly outperforms existing fine-tuning and test-time scaling baselines in enhancing semantic consistency and visual appeal.
扩散模型在性能上达到了最先进的水平,但常常无法生成符合人类偏好和意图的输出,导致图像出现较差的艺术质量和语义不一致性。现有的对齐方法面临着艰难的权衡:微调方法由于奖励过度优化而损失多样性,而在测试时间进行缩放的方法则引入了显著的计算开销,并且往往未能充分优化。为了解决这些限制,我们提出了HyperAlign,这是一种新颖的框架,用于训练一个超网络以在测试时进行高效有效的对齐。与修改潜在状态不同,HyperAlign 动态生成低秩适应权重来调节扩散模型的生成运算符。这使得去噪轨迹可以根据输入潜变量、时间步和提示(基于奖励条件)自适应地调整。 我们引入了多个版本的HyperAlign,这些版本在超网络应用频率上有所不同,从而在性能与效率之间取得平衡。此外,我们使用带有偏好数据正则化的奖励评分目标来优化超网络以减少奖励操控的可能性。我们在包括Stable Diffusion和FLUX在内的多种扩展生成范式上对HyperAlign进行了评估。它在增强语义一致性和视觉吸引力方面显著优于现有的微调和测试时间缩放基线方法。
https://arxiv.org/abs/2601.15968
Novel view synthesis (NVS) of static and dynamic urban scenes is essential for autonomous driving simulation, yet existing methods often struggle to balance reconstruction time with quality. While state-of-the-art neural radiance fields and 3D Gaussian Splatting approaches achieve photorealism, they often rely on time-consuming per-scene optimization. Conversely, emerging feed-forward methods frequently adopt per-pixel Gaussian representations, which lead to 3D inconsistencies when aggregating multi-view predictions in complex, dynamic environments. We propose EvolSplat4D, a feed-forward framework that moves beyond existing per-pixel paradigms by unifying volume-based and pixel-based Gaussian prediction across three specialized branches. For close-range static regions, we predict consistent geometry of 3D Gaussians over multiple frames directly from a 3D feature volume, complemented by a semantically-enhanced image-based rendering module for predicting their appearance. For dynamic actors, we utilize object-centric canonical spaces and a motion-adjusted rendering module to aggregate temporal features, ensuring stable 4D reconstruction despite noisy motion priors. Far-Field scenery is handled by an efficient per-pixel Gaussian branch to ensure full-scene coverage. Experimental results on the KITTI-360, KITTI, Waymo, and PandaSet datasets show that EvolSplat4D reconstructs both static and dynamic environments with superior accuracy and consistency, outperforming both per-scene optimization and state-of-the-art feed-forward baselines.
城市静态和动态场景的新型视图合成(NVS)对于自动驾驶仿真至关重要,但现有方法往往难以在重建时间和质量之间取得平衡。尽管最先进的神经辐射场和3D高斯点阵方法实现了照片级真实感,它们通常依赖于耗时的每场景优化过程。相反,新兴的前馈方法经常采用像素级别的高斯表示,在复杂的动态环境中聚合多视图预测会导致三维不一致性。 我们提出了EvolSplat4D,这是一种超越现有基于像素范式的前馈框架,通过在三个专门分支中统一了体积和像素基础的高斯预测。对于近距离静态区域,我们在3D特征体直接从多个帧预测一致的3D高斯几何,并辅以增强语义的图像渲染模块来预测其外观。对于动态物体,我们利用对象中心化的规范空间以及运动调整渲染模块来聚合时间特性,确保即使在噪声运动先验下也能实现稳定的4D重建。远距离场景则通过一个高效的像素级别高斯分支处理,以确保全场景覆盖。 实验结果表明,在KITTI-360、KITTI、Waymo和PandaSet数据集上,EvolSplat4D能够以更高的精度和一致性重建静态及动态环境,并且优于每场景优化以及最先进的前馈基准方法。
https://arxiv.org/abs/2601.15951
Text-Based Person Search (TBPS) holds unique value in real-world surveillance bridging visual perception and language understanding, yet current paradigms utilizing pre-training models often fail to transfer effectively to complex open-world scenarios. The reliance on "Passive Observation" leads to multifaceted spurious correlations and spatial semantic misalignment, causing a lack of robustness against distribution shifts. To fundamentally resolve these defects, this paper proposes ICON (Invariant Counterfactual Optimization with Neuro-symbolic priors), a framework integrating causal and topological priors. First, we introduce Rule-Guided Spatial Intervention to strictly penalize sensitivity to bounding box noise, forcibly severing location shortcuts to achieve geometric invariance. Second, Counterfactual Context Disentanglement is implemented via semantic-driven background transplantation, compelling the model to ignore background interference for environmental independence. Then, we employ Saliency-Driven Semantic Regularization with adaptive masking to resolve local saliency bias and guarantee holistic completeness. Finally, Neuro-Symbolic Topological Alignment utilizes neuro-symbolic priors to constrain feature matching, ensuring activated regions are topologically consistent with human structural logic. Experimental results demonstrate that ICON not only maintains leading performance on standard benchmarks but also exhibits exceptional robustness against occlusion, background interference, and localization noise. This approach effectively advances the field by shifting from fitting statistical co-occurrences to learning causal invariance.
基于文本的人体搜索(TBPS)在现实世界的监控中具有独特的价值,它连接了视觉感知和语言理解。然而,当前利用预训练模型的范式通常无法有效地转移到复杂的开放世界场景中。对“被动观察”的依赖导致多方面的虚假关联及空间语义错位,这使得系统缺乏应对分布变化的鲁棒性。为了从根本上解决这些问题,本文提出了ICON(具有神经符号先验的不变反事实优化),这是一种结合因果和拓扑先验知识的框架。 首先,我们引入了规则引导的空间干预机制,严格惩罚对边界框噪声的敏感度,并强制切断位置捷径以实现几何上的不变性。其次,通过语义驱动的背景移植来实施反事实上下文解耦,迫使模型忽略背景干扰从而达到环境独立性。然后,利用自适应掩码进行显着性驱动的语义正则化,解决局部显着偏差,并保证整体完整性和一致性。最后,神经符号拓扑对齐利用了神经符号先验来约束特征匹配,确保激活区域与人体结构逻辑在拓扑上保持一致。 实验结果表明,ICON不仅在标准基准测试中维持领先的性能,而且表现出对抗遮挡、背景干扰和定位噪声的卓越鲁棒性。这种方法通过从拟合统计共现转变为学习因果不变性,有效地推动了该领域的发展。
https://arxiv.org/abs/2601.15931
Deep neural network models degrade significantly in the long-tailed data distribution, with the overall training data dominated by a small set of classes in the head, and the tail classes obtaining less training examples. Addressing the imbalance in the classes, attention in the related literature was given mainly to the adjustments carried out in the decision space in terms of either corrections performed at the logit level in order to compensate class-prior bias, with the least attention to the optimization process resulting from the adjustments introduced through the differences in the confidences among the samples. In the current study, we present the design of a class and confidence-aware re-weighting scheme for long-tailed learning. This scheme is purely based upon the loss level and has a complementary nature to the existing methods performing the adjustment of the logits. In the practical implementation stage of the proposed scheme, we use an {\Omega}(p_t, f_c) function. This function enables the modulation of the contribution towards the training task based upon the confidence value of the prediction, as well as the relative frequency of the corresponding class. Our observations in the experiments are corroborated by significant experimental results performed on the CIFAR-100-LT, ImageNet-LT, and iNaturalist2018 datasets under various values of imbalance factors that clearly authenticate the theoretical discussions above.
深度神经网络模型在长尾数据分布中表现显著下降,这种情况通常是由于训练数据主要由头部少量类别主导,而尾部类别的训练样本较少造成的。针对这种类别不平衡问题,现有文献中的研究重点大多集中在决策空间的调整上,例如通过修正logit级别以补偿先验偏置,相比之下对优化过程的关注较少,尤其是那些因样本间信心值差异所引入的调整。在当前的研究中,我们提出了一种基于类和信心感知的重新加权方案,专门针对长尾学习问题。该方案纯粹是基于损失层面设计,并且与现有方法(通过修改logit来调整)具有互补性。 我们在实践中采用了一个Ω(p_t, f_c)函数实现这一设计方案,此函数允许根据预测的信心值以及对应类别的相对频率调制对训练任务的贡献度。我们的实验结果表明,在CIFAR-100-LT、ImageNet-LT和iNaturalist2018数据集上,针对各种不平衡因素进行的不同价值验证均显著支持了上述理论讨论的有效性。 简而言之,这种新提出的类与信心感知加权方案在处理长尾问题时表现出色,并且通过实验结果得到了验证。
https://arxiv.org/abs/2601.15924
Purpose: Accurate 3D hand pose estimation supports surgical applications such as skill assessment, robot-assisted interventions, and geometry-aware workflow analysis. However, surgical environments pose severe challenges, including intense and localized lighting, frequent occlusions by instruments or staff, and uniform hand appearance due to gloves, combined with a scarcity of annotated datasets for reliable model training. Method: We propose a robust multi-view pipeline for 3D hand pose estimation in surgical contexts that requires no domain-specific fine-tuning and relies solely on off-the-shelf pretrained models. The pipeline integrates reliable person detection, whole-body pose estimation, and state-of-the-art 2D hand keypoint prediction on tracked hand crops, followed by a constrained 3D optimization. In addition, we introduce a novel surgical benchmark dataset comprising over 68,000 frames and 3,000 manually annotated 2D hand poses with triangulated 3D ground truth, recorded in a replica operating room under varying levels of scene complexity. Results: Quantitative experiments demonstrate that our method consistently outperforms baselines, achieving a 31% reduction in 2D mean joint error and a 76% reduction in 3D mean per-joint position error. Conclusion: Our work establishes a strong baseline for 3D hand pose estimation in surgery, providing both a training-free pipeline and a comprehensive annotated dataset to facilitate future research in surgical computer vision.
目的:准确的三维手部姿态估计支持手术应用,如技能评估、机器人辅助干预和几何感知工作流程分析。然而,手术环境带来了严重的挑战,包括强烈的局部照明、频繁的手被仪器或工作人员遮挡以及由于手套导致的手部外观一致化,并且缺乏可靠的模型训练所需的数据集。 方法:我们提出了一种稳健的多视角流水线,用于在手术环境中进行三维手部姿态估计,该流水线无需特定领域的微调,仅依赖现成的预训练模型。这个流程包括可靠的人体检测、全身姿势估计和基于跟踪的手部裁剪区域内的最先进的二维关键点预测,并通过受约束的三维优化来完成整个过程。此外,我们引入了一个新颖的手术基准数据集,该数据集包含超过68,000帧及3,000个手动注释的二维手部姿态,这些数据是在一个模拟手术室中记录下来的,在不同的场景复杂度下都有三角测量的三维真实值。 结果:定量实验表明,我们的方法在性能上始终优于基准模型,实现了2D平均关节误差降低31%,以及3D平均每关节位置误差减少76%的成绩。 结论:我们提出的工作为手术环境中的三维手部姿态估计建立了坚实的基础,提供了一个无需训练的流水线和一个全面注释的数据集,以促进未来在手术计算机视觉领域的研究。
https://arxiv.org/abs/2601.15918
We propose a novel first-order method for non-convex optimization of the form $\max_{\bm{w}\in\mathbb{R}^d}\mathbb{E}_{\bm{x}\sim\mathcal{D}}[f_{\bm{w}}(\bm{x})]$, termed Progressive Power Homotopy (Prog-PowerHP). The method applies stochastic gradient ascent to a surrogate objective obtained by first performing a power transformation and then Gaussian smoothing, $F_{N,\sigma}(\bm{\mu}):=\mathbb{E}_{\bm{w}\sim\mathcal{N}(\bm{\mu},\sigma^2I_d),\bm{x}\sim\mathcal{D}}[e^{Nf_w(\bm{x})}]$, while progressively increasing the power parameter $N$ and decreasing the smoothing scale $\sigma$ along the optimization trajectory. We prove that, under mild regularity conditions, Prog-PowerHP converges to a small neighborhood of the global optimum with an iteration complexity scaling nearly as $O(d^2\varepsilon^{-2})$. Empirically, Prog-PowerHP demonstrates clear advantages in phase retrieval when the samples-to-dimension ratio approaches the information-theoretic limit, and in training two-layer neural networks in under-parameterized regimes. These results suggest that Prog-PowerHP is particularly effective for navigating cluttered non-convex landscapes where standard first-order methods struggle.
我们提出了一种新颖的非凸优化问题的一阶方法,形式为 $\max_{\bm{w}\in\mathbb{R}^d}\mathbb{E}_{\bm{x}\sim\mathcal{D}}[f_{\bm{w}}(\bm{x})]$,该方法被称为渐进幂同伦法(Prog-PowerHP)。这种方法通过首先执行幂变换然后进行高斯平滑处理来获得一个代理目标函数 $F_{N,\sigma}(\bm{\mu}):=\mathbb{E}_{\bm{w}\sim\mathcal{N}(\bm{\mu},\sigma^2I_d),\bm{x}\sim\mathcal{D}}[e^{Nf_w(\bm{x})}]$,然后在此目标函数上应用随机梯度上升。在优化过程中,幂参数 $N$ 逐步增加而平滑尺度 $\sigma$ 则逐步减小。 我们证明了,在适度的正则条件下,Prog-PowerHP 可以收敛到全局最优解的小邻域,并且其迭代复杂性随着问题规模和精度要求几乎呈 $O(d^2\varepsilon^{-2})$ 的关系。实验结果表明,当样本数量与维度比值接近信息论极限时,在相位恢复任务中以及在参数不足的情况下训练两层神经网络时,Prog-PowerHP 显示出明显的优越性。这些结果表明,对于那些标准一阶方法难以处理的复杂非凸景观问题而言,Prog-PowerHP 尤其有效。
https://arxiv.org/abs/2601.15915
Analog mixed-signal circuit sizing involves complex trade-offs within high-dimensional design spaces. Existing automatic analog circuit sizing approaches rely solely on netlists, ignoring the circuit schematic, which hinders the cognitive link between the schematic and its performance. Furthermore, the black-box nature of machine learning methods and hallucination risks in large language models fail to provide the necessary ground-truth explainability required for industrial sign-off. To address these challenges, we propose a Vision Language Model-optimized collaborative agent design workflow (VLM-CAD), which analyzes circuits, optimizes DC operating points, performs inference-based sizing, and executes external sizing optimization. We integrate Image2Net to annotate circuit schematics and generate a structured JSON description for precise interpretation by Vision Language Models. Furthermore, we propose an Explainable Trust Region Bayesian Optimization method (ExTuRBO) that employs collaborative warm-start from agent-generated seeds and offers dual-granularity sensitivity analysis for external sizing optimization, supporting a comprehensive final design report. Experiment results on amplifier sizing tasks using 180nm, 90nm, and 45nm Predictive Technology Models demonstrate that VLM-CAD effectively balances power and performance while maintaining physics-based explainability. VLM-CAD meets all specification requirements while maintaining low power consumption in optimizing an amplifier with a complementary input and a class-AB output stage, with a total runtime under 66 minutes across all experiments on two amplifiers.
模拟混合信号电路设计中的尺寸优化涉及在高维设计空间内的复杂权衡。现有的自动模拟电路尺寸调整方法仅依赖于网络列表,忽略了电路图本身,这阻碍了从电路图到其性能的认知联系建立。此外,机器学习方法的黑盒性质和大型语言模型中出现幻觉的风险无法提供工业验证所需的必要真实解释性。 为解决这些挑战,我们提出了一种优化协作代理设计工作流程(VLM-CAD)。该系统能够分析电路、优化直流操作点、进行基于推理的尺寸调整,并执行外部尺寸优化。通过集成Image2Net工具来标注电路图并生成结构化的JSON描述,使得视觉语言模型可以精确解读这些电路图。 我们还提出了一种可解释的信任区域贝叶斯优化方法(ExTuRBO),该方法利用代理产生的种子进行协作式暖启动,并提供了针对外部尺寸优化的双级粒度敏感性分析,支持全面的设计报告生成。 在使用180nm、90nm和45nm预测技术模型对放大器尺寸调整任务进行实验后,结果表明VLM-CAD能够在保持基于物理的真实解释性的前提下有效平衡功耗与性能。无论是在优化具有互补输入和AB类输出阶段的放大器时,还是在满足所有规范要求的同时保持低功耗方面,VLM-CAD都能达到目标,并且在整个实验过程中,在两个放大器上的总运行时间均低于66分钟。
https://arxiv.org/abs/2601.07315
The development of native computer-use agents (CUA) represents a significant leap in multimodal AI. However, their potential is currently bottlenecked by the constraints of static data scaling. Existing paradigms relying primarily on passive imitation of static datasets struggle to capture the intricate causal dynamics inherent in long-horizon computer tasks. In this work, we introduce EvoCUA, a native computer use agentic model. Unlike static imitation, EvoCUA integrates data generation and policy optimization into a self-sustaining evolutionary cycle. To mitigate data scarcity, we develop a verifiable synthesis engine that autonomously generates diverse tasks coupled with executable validators. To enable large-scale experience acquisition, we design a scalable infrastructure orchestrating tens of thousands of asynchronous sandbox rollouts. Building on these massive trajectories, we propose an iterative evolving learning strategy to efficiently internalize this experience. This mechanism dynamically regulates policy updates by identifying capability boundaries -- reinforcing successful routines while transforming failure trajectories into rich supervision through error analysis and self-correction. Empirical evaluations on the OSWorld benchmark demonstrate that EvoCUA achieves a success rate of 56.7%, establishing a new open-source state-of-the-art. Notably, EvoCUA significantly outperforms the previous best open-source model, OpenCUA-72B (45.0%), and surpasses leading closed-weights models such as UI-TARS-2 (53.1%). Crucially, our results underscore the generalizability of this approach: the evolving paradigm driven by learning from experience yields consistent performance gains across foundation models of varying scales, establishing a robust and scalable path for advancing native agent capabilities.
本地计算机使用代理(CUA)的发展标志着多模态AI的重大飞跃。然而,其潜力目前受到静态数据扩展限制的制约。现有的主要依赖于静态数据集被动模仿的方法难以捕捉长期计算机任务中内在的复杂因果动态。在本工作中,我们引入了EvoCUA,这是一种本地计算机使用代理模型。与静态模仿不同,EvoCUA将数据生成和策略优化集成到一个自我维持的进化循环中。为了缓解数据稀缺问题,我们开发了一个可验证的合成引擎,该引擎能够自主地生成多样化的任务,并配以可执行验证器。为实现大规模经验获取,我们设计了一种可扩展的基础架构,协调数万个异步沙箱回放。基于这些庞大的轨迹,我们提出了一种迭代进化学习策略,从而高效内化这种经验。此机制通过识别能力边界来动态调节策略更新——强化成功的常规操作,并通过错误分析和自我纠正将失败的轨迹转变为丰富的监督信号。在OSWorld基准测试上的实证评估表明,EvoCUA实现了56.7%的成功率,建立了新的开源最佳性能标准。值得注意的是,EvoCUA显著优于之前的最佳开源模型OpenCUA-72B(45.0%),并超过了诸如UI-TARS-2(53.1%)等领先的封闭权重模型。最关键地,我们的结果强调了这种方法的通用性:由从经验中学习驱动的进化范式在不同规模的基础模型上均持续获得性能提升,为推进本地代理能力提供了一条稳健且可扩展的道路。
https://arxiv.org/abs/2601.15876
A convolutional neural network (CNN) is a deep learning algorithm that has been specifically designed for computer vision applications. The CNNs proved successful in handling the increasing amount of data in many computer vision problems, where classical machine learning algorithms were insufficient. Flowers have many uses in our daily lives, from decorating to making medicines to detoxifying the environment. Identifying flower types requires expert knowledge. However, accessing experts at any time and in any location may not always be feasible. In this study a mobile application based on CNNs was developed to recognize different types of flowers to provide non-specialists with quick and easy access to information about flower types. The study employed three distinct CNN models, namely MobileNet, DenseNet121, and Xception, to determine the most suitable model for the mobile application. The classification performances of the models were evaluated by training them with seven different optimization algorithms. The DenseNet-121 architecture, which uses the stochastic gradient descent (SGD) optimization algorithm, was the most successful, achieving 95.84 % accuracy, 96.00% precision, recall, and F1-score. This result shows that CNNs can be used for flower classification in mobile applications.
卷积神经网络(CNN)是一种专门为计算机视觉应用设计的深度学习算法。在许多计算机视觉问题中,随着数据量的增长,传统的机器学习算法变得不够用,而CNN们成功地解决了这一挑战。花朵在我们的日常生活中有着多种用途,从装饰到制作药物再到净化环境。识别花种需要专业知识。然而,在任何时间和地点都能随时访问专家并不总是可行的。为此,本研究开发了一款基于CNN的移动应用,用于识别不同种类的花卉,为非专业人士提供快速便捷地获取有关花种的信息途径。该研究采用了三种不同的CNN模型,即MobileNet、DenseNet121和Xception,以确定最适合应用于移动端的模型。通过使用七种不同的优化算法对这些模型进行训练后对其分类性能进行了评估。结果表明,在使用随机梯度下降(SGD)优化算法的情况下,DenseNet-121架构表现最佳,达到了95.84%的准确率、96.00%的精确度、召回率和F1值。这一结果显示了CNN在移动应用中用于花卉分类的有效性。
https://arxiv.org/abs/2601.15810
Reinforcement learning has substantially improved the performance of LLM agents on tasks with verifiable outcomes, but it still struggles on open-ended agent tasks with vast solution spaces (e.g., complex travel planning). Due to the absence of objective ground-truth for these tasks, current RL algorithms largely rely on reward models that assign scalar scores to individual responses. We contend that such pointwise scoring suffers from an inherent discrimination collapse: the reward model struggles to distinguish subtle advantages among different trajectories, resulting in scores within a group being compressed into a narrow range. Consequently, the effective reward signal becomes dominated by noise from the reward model, leading to optimization stagnation. To address this, we propose ArenaRL, a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra-group relative ranking. ArenaRL introduces a process-aware pairwise evaluation mechanism, employing multi-level rubrics to assign fine-grained relative scores to trajectories. Additionally, we construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals. Empirical results confirm that the built seeded single-elimination scheme achieves nearly equivalent advantage estimation accuracy to full pairwise comparisons with O(N^2) complexity, while operating with only O(N) complexity, striking an optimal balance between efficiency and precision. Furthermore, to address the lack of full-cycle benchmarks for open-ended agents, we build Open-Travel and Open-DeepResearch, two high-quality benchmarks featuring a comprehensive pipeline covering SFT, RL training, and multi-dimensional evaluation. Extensive experiments show that ArenaRL substantially outperforms standard RL baselines, enabling LLM agents to generate more robust solutions for complex real-world tasks.
强化学习在具有可验证结果的任务中显著提升了大型语言模型(LLM)代理的性能,但在解空间广阔、开放式的任务(如复杂旅行规划)上仍然存在挑战。由于这些任务缺乏明确的目标真实值,目前的强化学习算法主要依赖于奖励模型来为每个响应分配标量分数。我们认为这种点对点评分方法存在着固有的歧视性坍缩问题:即奖励模型难以区分不同轨迹之间的细微优势,导致同一组内的得分被压缩到一个狭窄范围内。因此,有效的奖励信号会被来自奖励模型的噪声所主导,从而导致优化停滞不前。 为了解决这个问题,我们提出了ArenaRL这一强化学习范式,它将点对点标量评分转变为组内相对排名机制。ArenaRL引入了一种过程感知成对评估机制,并采用多层次的标准来给轨迹分配精细的相对分数。此外,我们构建了一个组内的对抗竞技场,并设计出一种基于锦标赛式的排名方案以获得稳定的竞争优势信号。实证结果显示:建立的种子单淘汰赛制在优势估计准确性方面接近于复杂度为O(N^2)的全两两比较方法,而自身只需使用O(N)的复杂度,从而实现了效率与精确度的最佳平衡。 此外,为了弥补开放式代理缺少完整循环基准的问题,我们建立了Open-Travel和Open-DeepResearch两个高质量的基准测试集,涵盖了SFT(策略微调)、强化学习训练以及多维度评估在内的全面管道。广泛的实验表明:ArenaRL显著超越了标准的强化学习基线方法,使LLM代理能够生成更鲁棒的解决方案,适用于复杂的现实世界任务。
https://arxiv.org/abs/2601.06487
2D Gaussian Splatting (2DGS) is an emerging explicit scene representation method with significant potential for image compression due to high fidelity and high compression ratios. However, existing low-light enhancement algorithms operate predominantly within the pixel domain. Processing 2DGS-compressed images necessitates a cumbersome decompression-enhancement-recompression pipeline, which compromises efficiency and introduces secondary degradation. To address these limitations, we propose LL-GaussianImage, the first zero-shot unsupervised framework designed for low-light enhancement directly within the 2DGS compressed representation domain. Three primary advantages are offered by this framework. First, a semantic-guided Mixture-of-Experts enhancement framework is designed. Dynamic adaptive transformations are applied to the sparse attribute space of 2DGS using rendered images as guidance to enable compression-as-enhancement without full decompression to a pixel grid. Second, a multi-objective collaborative loss function system is established to strictly constrain smoothness and fidelity during enhancement, suppressing artifacts while improving visual quality. Third, a two-stage optimization process is utilized to achieve reconstruction-as-enhancement. The accuracy of the base representation is ensured through single-scale reconstruction and network robustness is enhanced. High-quality enhancement of low-light images is achieved while high compression ratios are maintained. The feasibility and superiority of the paradigm for direct processing within the compressed representation domain are validated through experimental results.
2D高斯点阵(2DGS)是一种新兴的显式场景表示方法,由于其高保真度和高压缩比,在图像压缩方面具有巨大的潜力。然而,现有的低光增强算法主要在像素域内操作。处理通过2DGS压缩的图像需要一个复杂的解压-增强-重新压缩管道流程,这不仅效率低下,还会引入二次退化问题。为了克服这些限制,我们提出了LL-GaussianImage框架,这是首个针对低光图像直接在2DGS压缩表示域内进行无监督零样本增强处理的方法。 该框架提供了三个主要优势: 1. 设计了一个基于语义引导的专家混合增强框架。通过使用渲染图像作为指导,在不完全解压到像素网格的情况下对2DGS的稀疏属性空间应用动态自适应变换,从而实现压缩与增强一体化的效果。 2. 建立了一种多目标协作损失函数系统,严格限制了在增强过程中保持平滑度和保真度的要求。这种方法不仅可以抑制伪影,还可以提高视觉质量。 3. 采用两阶段优化过程来实现重建即增强的目标。通过单尺度重建确保基础表示的准确性,并加强网络鲁棒性,在维持高压缩比的同时实现了低光图像的高质量增强效果。 实验结果验证了直接在压缩表示域内进行处理的可行性和优越性,展示了LL-GaussianImage框架的有效性。
https://arxiv.org/abs/2601.15772
Deploying reinforcement learning in the real world remains challenging due to sample inefficiency, sparse rewards, and noisy visual observations. Prior work leverages demonstrations and human feedback to improve learning efficiency and robustness. However, offline-to-online methods need large datasets and can be unstable, while VLA-assisted RL relies on large-scale pretraining and fine-tuning. As a result, a low-cost real-world RL method with minimal data requirements has yet to emerge. We introduce \textbf{SigEnt-SAC}, an off-policy actor-critic method that learns from scratch using a single expert trajectory. Our key design is a sigmoid-bounded entropy term that prevents negative-entropy-driven optimization toward out-of-distribution actions and reduces Q-function oscillations. We benchmark SigEnt-SAC on D4RL tasks against representative baselines. Experiments show that SigEnt-SAC substantially alleviates Q-function oscillations and reaches a 100\% success rate faster than prior methods. Finally, we validate SigEnt-SAC on four real-world robotic tasks across multiple embodiments, where agents learn from raw images and sparse rewards; results demonstrate that SigEnt-SAC can learn successful policies with only a small number of real-world interactions, suggesting a low-cost and practical pathway for real-world RL deployment.
在现实世界中部署强化学习(Reinforcement Learning,RL)仍然面临挑战,主要是由于样本效率低下、稀疏奖励和嘈杂的视觉观测。先前的研究利用演示和人类反馈来提高学习效率和鲁棒性。然而,离线到在线的方法需要大量的数据集并且可能不稳定,而视觉语言助手辅助的强化学习则依赖于大规模预训练和微调。因此,目前还没有出现一种低成本、对数据要求极低的真实世界RL方法。 我们介绍了\textbf{SigEnt-SAC},这是一种无策略(off-policy)演员-评论家(actor-critic)方法,仅通过一条专家轨迹就能从头开始学习。我们的关键设计是一个带有sigmoid边界项的熵条款,该条款可以防止由负熵驱动的优化朝向分布外动作,并且能够减少Q函数的振荡现象。 我们在D4RL任务上对SigEnt-SAC进行了基准测试,并与代表性基线进行了对比。实验结果显示,SigEnt-SAC显著缓解了Q函数的振荡问题,并且比先前的方法更快地达到了100%的成功率。 最后,在四个跨多身体形态的真实世界机器人任务中验证了SigEnt-SAC的有效性,其中代理从原始图像和稀疏奖励进行学习;结果表明,SigEnt-SAC能够在只有少量真实世界交互的情况下学习成功的策略,这暗示了一条低成本且实用的路径来部署现实世界的RL。
https://arxiv.org/abs/2601.15761
Tabular data is a fundamental form of data structure. The evolution of table analysis tools reflects humanity's continuous progress in data acquisition, management, and processing. The dynamic changes in table columns arise from technological advancements, changing needs, data integration, etc. However, the standard process of training AI models on tables with fixed columns and then performing inference is not suitable for handling dynamically changed tables. Therefore, new methods are needed for efficiently handling such tables in an unsupervised manner. In this paper, we introduce a new task, Tabular Incremental Inference (TabII), which aims to enable trained models to incorporate new columns during the inference stage, enhancing the practicality of AI models in scenarios where tables are dynamically changed. Furthermore, we demonstrate that this new task can be framed as an optimization problem based on the information bottleneck theory, which emphasizes that the key to an ideal tabular incremental inference approach lies in minimizing mutual information between tabular data and representation while maximizing between representation and task labels. Under this guidance, we design a TabII method with Large Language Model placeholders and Pretrained TabAdapter to provide external knowledge and Incremental Sample Condensation blocks to condense the task-relevant information given by incremental column attributes. Experimental results across eight public datasets show that TabII effectively utilizes incremental attributes, achieving state-of-the-art performance.
表格数据是一种基础的数据结构形式。表格分析工具的演变反映了人类在数据获取、管理和处理方面持续的进步。表格列的变化动态源于技术进步、需求变化和数据集成等因素。然而,使用固定列训练AI模型并进行推理的标准流程并不适用于应对动态变化的表格。因此,需要新的方法来以无监督的方式高效地处理这种类型的表格。 本文介绍了一项新任务——Tabular Incremental Inference(简称 TabII),旨在使经过训练的模型能够在推理阶段吸收新增加的列,从而提升AI模型在面对动态变化的表格场景时的实际应用性。此外,我们证明了这一新任务可以基于信息瓶颈理论被构架为一个优化问题,并强调理想中的表增量推理方法的关键在于最小化表数据与表示之间的互信息,同时最大化表示和任务标签之间的互信息。 根据上述指导原则,我们设计了一种新的TabII方法:使用大型语言模型占位符和预训练的TabAdapter提供外部知识,并结合增量样本浓缩模块来集中由增量列属性所提供的任务相关信息。通过八个公共数据集上的实验结果表明,TabII有效利用了增量属性,在性能上达到了最先进的水平。
https://arxiv.org/abs/2601.15751
An essential technique for diagnosing brain disorders is electrophysiological source imaging (ESI). While model-based optimization and deep learning methods have achieved promising results in this field, the accurate selection and refinement of features remains a central challenge for precise ESI. This paper proposes FAIR-ESI, a novel framework that adaptively refines feature importance across different views, including FFT-based spectral feature refinement, weighted temporal feature refinement, and self-attention-based patch-wise feature refinement. Extensive experiments on two simulation datasets with diverse configurations and two real-world clinical datasets validate our framework's efficacy, highlighting its potential to advance brain disorder diagnosis and offer new insights into brain function.
诊断脑部疾病的一种关键技术是电生理源成像(ESI)。尽管基于模型的优化和深度学习方法在这一领域取得了令人鼓舞的结果,但精确选择和改进特征仍然是实现精准ESI的核心挑战。本文提出了FAIR-ESI框架,这是一种新颖的方法,它能够根据不同视角自适应地调整特征的重要性,包括基于FFT的谱特征精炼、加权时序特征精炼以及基于自我注意机制的补丁级特征精炼。通过在两个配置多样的模拟数据集和两个真实世界临床数据集上进行广泛的实验验证了我们框架的有效性,并突显了其在推进脑部疾病诊断及提供对大脑功能新见解方面的潜力。
https://arxiv.org/abs/2601.15731
The performance of modern AI systems is fundamentally constrained by the quality of their underlying kernels, which translate high-level algorithmic semantics into low-level hardware operations. Achieving near-optimal kernels requires expert-level understanding of hardware architectures and programming models, making kernel engineering a critical but notoriously time-consuming and non-scalable process. Recent advances in large language models (LLMs) and LLM-based agents have opened new possibilities for automating kernel generation and optimization. LLMs are well-suited to compress expert-level kernel knowledge that is difficult to formalize, while agentic systems further enable scalable optimization by casting kernel development as an iterative, feedback-driven loop. Rapid progress has been made in this area. However, the field remains fragmented, lacking a systematic perspective for LLM-driven kernel generation. This survey addresses this gap by providing a structured overview of existing approaches, spanning LLM-based approaches and agentic optimization workflows, and systematically compiling the datasets and benchmarks that underpin learning and evaluation in this domain. Moreover, key open challenges and future research directions are further outlined, aiming to establish a comprehensive reference for the next generation of automated kernel optimization. To keep track of this field, we maintain an open-source GitHub repository at this https URL.
现代AI系统的性能从根本上受到其底层内核质量的限制,这些内核将高层次的算法语义转化为低层次的硬件操作。实现接近最优的内核需要对硬件架构和编程模型有专家级别的理解,使内核工程成为一项关键但耗时且难以扩展的过程。最近在大型语言模型(LLMs)和基于LLM的代理系统方面的进展为自动化的内核生成和优化开启了新的可能性。LLMs非常适合压缩那些难以形式化表达的专业级内核知识,而代理系统则通过将内核开发视为一个迭代、反馈驱动的循环进一步实现规模化的优化。在这一领域已取得了快速进展。然而,该领域的研究仍然支离破碎,缺乏一种基于LLM驱动内核生成的系统的视角。本综述通过提供现有方法的结构化概览来填补这一空白,这些方法涵盖了基于LLM的方法和代理优化工作流,并系统地编纂了支撑此领域学习与评估的数据集和基准测试。此外,还概述了一些关键的开放性挑战及未来的研究方向,旨在为下一代自动化内核优化建立一个全面的参考文献。为了追踪该领域的进展,我们在开源GitHub仓库(请参见此处提供的URL)上维护了一个存储库。
https://arxiv.org/abs/2601.15727
Accurate alignment of multi-degree-of-freedom rehabilitation robots is essential for safe and effective patient training. This paper proposes a two-stage calibration framework for a self-designed three-degree-of-freedom (3-DOF) ankle rehabilitation robot. First, a Kronecker-product-based open-loop calibration method is developed to cast the input-output alignment into a linear parameter identification problem, which in turn defines the associated experimental design objective through the resulting information matrix. Building on this formulation, calibration posture selection is posed as a combinatorial design-of-experiments problem guided by a D-optimality criterion, i.e., selecting a small subset of postures that maximises the determinant of the information matrix. To enable practical selection under constraints, a Proximal Policy Optimization (PPO) agent is trained in simulation to choose 4 informative postures from a candidate set of 50. Across simulation and real-robot evaluations, the learned policy consistently yields substantially more informative posture combinations than random selection: the mean determinant of the information matrix achieved by PPO is reported to be more than two orders of magnitude higher with reduced variance. In addition, real-world results indicate that a parameter vector identified from only four D-optimality-guided postures provides stronger cross-episode prediction consistency than estimates obtained from a larger but unstructured set of 50 postures. The proposed framework therefore improves calibration efficiency while maintaining robust parameter estimation, offering practical guidance for high-precision alignment of multi-DOF rehabilitation robots.
多自由度康复机器人的精确对准对于确保患者训练的安全和有效性至关重要。本文提出了一种针对自设计的三自由度(3-DOF)踝关节康复机器人进行校准的两阶段框架。首先,开发了一种基于克罗内克积的开环校准方法,将输入输出对准问题转化为一个线性参数识别问题,并通过所得的信息矩阵定义了相应的实验设计目标。在此基础上,借助D-最优准则指导组合实验设计问题,以选择一组能够最大化信息矩阵行列式的有限姿势集合作为校准姿态的选择策略。 为了在实际约束条件下实现可行的姿态选择,使用模拟训练了一种近端策略优化(PPO)代理程序,在候选的50个姿势中挑选出4个具有代表性的姿势。无论是在模拟评估还是实机测试中,所学策略始终比随机选取方式产生了信息量更大的姿态组合:根据PPO得到的信息矩阵行列式的平均值提高了两个数量级且方差降低。 此外,实际结果显示,仅基于四个D-最优准则指导下的姿态识别出的参数向量,在跨周期预测一致性方面优于50个未结构化的姿势集得到估计。因此,所提出的框架在保持稳健性的同时可以提高校准效率,并为多自由度康复机器人的高精度对准提供实用指南。
https://arxiv.org/abs/2601.15707
Designing inclusive cycling infrastructure requires balancing competing needs of diverse user groups, yet designers often struggle to anticipate how different cyclists experience the same street. We investigate how persona-based multi-agent evaluation can support inclusive design by making experiential conflicts explicit. We present StreetDesignAI, an interactive system that enables designers to (1) ground evaluation in street context through imagery and map data, (2) receive parallel feedback from cyclist personas spanning confident to cautious users, and (3) iteratively modify designs while surfacing conflicts across perspectives. A within-subjects study with 26 transportation professionals demonstrates that structured multi-perspective feedback significantly improves designers' understanding of diverse user perspectives, ability to identify persona needs, and confidence in translating them into design decisions, with higher satisfaction and stronger intention for professional adoption. Qualitative findings reveal how conflict surfacing transforms design exploration from single-perspective optimization toward deliberate trade-off reasoning. We discuss implications for AI tools that scaffold inclusive design through disagreement as an interaction primitive.
设计包容性的自行车基础设施需要平衡不同用户群体的竞争需求,然而设计师常常难以预测不同的骑行者在同一街道上的体验。我们探讨了基于角色的多智能体评估如何通过明确经验冲突来支持包容性设计。我们介绍了StreetDesignAI系统,这是一个交互式工具,它使设计师能够:(1)通过图像和地图数据将评估与街景联系起来;(2)从代表自信到谨慎骑行者的用户角色中获得并行反馈;以及(3)在跨不同视角的冲突浮现时迭代修改设计。一项包含26名交通专业人士的实验表明,结构化的多视角反馈显著提升了设计师对多样化用户体验的理解、识别角色需求的能力及将其转化为设计方案的信心,并提高了他们对专业应用的满意度和意愿度。定性分析揭示了如何通过揭示矛盾将设计探索从单一视角优化转向有意图权衡推理的过程。我们讨论了人工智能工具通过冲突作为交互原语来支持包容性设计的意义。
https://arxiv.org/abs/2601.15671
Emotional information in speech plays a unique role in multimodal perception. However, current Speech Large Language Models (SpeechLLMs), similar to conventional speech emotion recognition (SER) systems, still treat emotion understanding as a simple classification problem. This provides limited interpretability of predictions, while leaving the LLMs' expressive and reasoning capabilities underutilized. In this work, we take the first step to reformulate SER as a deep reasoning problem through reinforcement learning (RL). We propose EmotionThinker, which is designed to generate accurate emotion predictions with interpretable explanations grounded in fine-grained acoustic cues. To achieve this, we first construct EmotionCoT-35K, an emotional reasoning dataset with Chain-of-Thought annotations and detailed captions. Second, we observe that current SpeechLLMs exhibit weak prosody perception, whereas prosodic cues constitute fundamental signals for interpreting emotions. To address this, we develop the prosody-enhanced foundation model EmotionThinker-Base, and demonstrate that prosody enhancement improves emotion understanding. Third, we introduce Group-Relative-Policy-Optimization with Progressive-Trust-aware-Reasoning-Reward (GRPO-PTR) for RL. Different from standard GRPO, which relies only on rule-based outcome rewards, GRPO-PTR progressively introduces reasoning reward, dynamically adjusts it with a trustworthiness weight reflecting the alignment between reasoning and outcome, and evaluates the overall reasoning quality with a reward model based on multi-dimensional criteria. EmotionThinker outperforms previous state-of-the-art evaluation models both in emotion accuracy and explanation quality, advancing SER toward interpretable multimodal reasoning. Project page: this https URL
情感信息在多模态感知中扮演着独特的角色。然而,当前的语音大型语言模型(SpeechLLMs)以及传统的语音情绪识别(SER)系统仍然将情绪理解视为一个简单的分类问题。这种做法限制了预测的解释性,并未充分利用LLM的表达和推理能力。为此,在这项工作中,我们首次尝试通过强化学习(RL)将SER重新定义为深度推理问题。我们提出了EmotionThinker模型,旨在生成基于细粒度声学线索的情感预测及可解释的说明。 为了实现这一目标,首先构建了带有链式思考标注和详细描述的情感推理数据集EmotionCoT-35K。其次,观察到当前的SpeechLLMs对语调感知较弱,而语调信号构成了理解情绪的基本信号。为了解决这个问题,我们开发了一个增强型基础模型EmotionThinker-Base,并展示了语调增强改善了情感的理解能力。最后,引入了基于逐步信任意识推理奖励的组相对策略优化(Group-Relative-Policy-Optimization with Progressive-Trust-aware-Reasoning-Reward, GRPO-PTR)用于RL。与仅依赖规则基础结果奖励的标准GRPO不同,GRPO-PTR逐渐引入推理奖励,并根据推论和结果之间的对齐情况动态调整,使用基于多维度标准的奖励模型来评估整体推理质量。 EmotionThinker在情感准确性及解释质量方面均超越了先前最先进的评估模型,推动SER向可解释的多模态推理迈进。项目页面:[此处插入实际URL]
https://arxiv.org/abs/2601.15668