As healthcare increasingly turns to AI for scalable and trustworthy clinical decision support, ensuring reliability in model reasoning remains a critical challenge. Individual large language models (LLMs) are susceptible to hallucinations and inconsistency, whereas naive ensembles of models often fail to deliver stable and credible recommendations. Building on our previous work on LLM Chemistry, which quantifies the collaborative compatibility among LLMs, we apply this framework to improve the reliability in medication recommendation from brief clinical vignettes. Our approach leverages multi-LLM collaboration guided by Chemistry-inspired interaction modeling, enabling ensembles that are effective (exploiting complementary strengths), stable (producing consistent quality), and calibrated (minimizing interference and error amplification). We evaluate our Chemistry-based Multi-LLM collaboration strategy on real-world clinical scenarios to investigate whether such interaction-aware ensembles can generate credible, patient-specific medication recommendations. Preliminary results are encouraging, suggesting that LLM Chemistry-guided collaboration may offer a promising path toward reliable and trustworthy AI assistants in clinical practice.
https://arxiv.org/abs/2512.05066
Constructing 4D language fields is crucial for embodied AI, augmented/virtual reality, and 4D scene understanding, as they provide enriched semantic representations of dynamic environments and enable open-vocabulary querying in complex scenarios. However, existing approaches to 4D semantic field construction primarily rely on scene-specific Gaussian splatting, which requires per-scene optimization, exhibits limited generalization, and is difficult to scale to real-world applications. To address these limitations, we propose 4DLangVGGT, the first Transformer-based feed-forward unified framework for 4D language grounding, that jointly integrates geometric perception and language alignment within a single architecture. 4DLangVGGT has two key components: the 4D Visual Geometry Transformer, StreamVGGT, which captures spatio-temporal geometric representations of dynamic scenes; and the Semantic Bridging Decoder (SBD), which projects geometry-aware features into a language-aligned semantic space, thereby enhancing semantic interpretability while preserving structural fidelity. Unlike prior methods that depend on costly per-scene optimization, 4DLangVGGT can be jointly trained across multiple dynamic scenes and directly applied during inference, achieving both deployment efficiency and strong generalization. This design significantly improves the practicality of large-scale deployment and establishes a new paradigm for open-vocabulary 4D scene understanding. Experiments on HyperNeRF and Neu3D datasets demonstrate that our approach not only generalizes effectively but also achieves state-of-the-art performance, achieving up to 2% gains under per-scene training and 1% improvements under multi-scene training. Our code released in this https URL
构建4D语言场对于具身人工智能、增强/虚拟现实以及4D场景理解至关重要,因为它们提供了动态环境的丰富语义表示,并在复杂场景中支持开放式词汇查询。然而,现有方法主要依赖于特定场景的高斯点化(Gaussian splatting),这种方法需要针对每个场景进行优化,泛化能力有限且难以扩展到实际应用中去。为了解决这些局限性,我们提出了4DLangVGGT,这是首个基于Transformer的前馈统一框架,用于4D语言定位,并在单一架构内同时整合了几何感知和语言对齐。4DLangVGGT有两个关键组件:一个是捕捉动态场景时空几何表示的4D视觉几何变换器(StreamVGGT);另一个是语义桥接解码器(SBD),该解码器将具有几何感知性的特征映射到与语言对齐的语义空间中,从而增强语义可解释性的同时保持结构保真度。不同于依赖昂贵的场景特定优化的先前方法,4DLangVGGT可以在多个动态场景间联合训练,并在推断时直接应用,实现了部署效率和强大泛化的双重目标。这一设计显著提高了大规模部署的实际可行性,并为开放式词汇4D场景理解建立了一个新的范式。在HyperNeRF和Neu3D数据集上的实验表明,我们的方法不仅能够有效泛化,而且达到了最先进的性能,在每个场景训练下可获得高达2%的增益,在多场景训练下可取得1%的改进。我们的代码可在以下链接获取:[提供具体链接]
https://arxiv.org/abs/2512.05060
Generating interactive and dynamic 4D scenes from a single static image remains a core challenge. Most existing generate-then-reconstruct and reconstruct-then-generate methods decouple geometry from motion, causing spatiotemporal inconsistencies and poor generalization. To address these, we extend the reconstruct-then-generate framework to jointly perform Motion generation and geometric Reconstruction for 4D Synthesis (MoRe4D). We first introduce TrajScene-60K, a large-scale dataset of 60,000 video samples with dense point trajectories, addressing the scarcity of high-quality 4D scene data. Based on this, we propose a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to jointly generate geometrically consistent and motion-plausible 4D point trajectories. To leverage single-view priors, we design a depth-guided motion normalization strategy and a motion-aware module for effective geometry and dynamics integration. We then propose a 4D View Synthesis Module (4D-ViSM) to render videos with arbitrary camera trajectories from 4D point track representations. Experiments show that MoRe4D generates high-quality 4D scenes with multi-view consistency and rich dynamic details from a single image. Code: this https URL.
从单张静态图像生成互动且动态的4D场景依然是一个核心挑战。大多数现有的“先生成再重构”和“先重构再生成”的方法将几何与运动分离,导致时空不一致性和较差的泛化能力。为解决这些问题,我们扩展了“先重构再生成”的框架,提出了同时进行运动生成和几何重构以实现4D合成(MoRe4D)。首先,我们介绍了TrajScene-60K,这是一个包含密集点轨迹的大型数据集,其中包含了60,000个视频样本,解决了高质量4D场景数据稀缺的问题。基于此数据集,我们提出了一种扩散模型为基础的4D场景轨迹生成器(4D-STraG),以同时生成几何一致且运动合理的4D点轨迹。为了利用单视角先验信息,我们设计了深度引导的运动规范化策略和感知运动模块来有效地整合几何与动态特性。随后,我们提出了一种4D视图合成模块(4D-ViSM),用于从4D点轨迹表示中渲染任意相机路径的视频。实验表明,MoRe4D能够从单张图像生成具有多视角一致性和丰富动态细节的高质量4D场景。 代码地址:[此处应填写具体链接,请参见原文]
https://arxiv.org/abs/2512.05044
Facial Image inpainting aim is to restore the missing or corrupted regions in face images while preserving identity, structural consistency and photorealistic image quality, a task specifically created for photo restoration. Though there are recent lot of advances in deep generative models, existing methods face problems with large irregular masks, often producing blurry textures on the edges of the masked region, semantic inconsistencies, or unconvincing facial structures due to direct pixel level synthesis approach and limited exploitation of facial priors. In this paper we propose a novel architecture, which address these above challenges through semantic-guided hierarchical synthesis. Our approach starts with a method that organizes and synthesizes information based on meaning, followed by refining the texture. This process gives clear insights into the facial structure before we move on to creating detailed images. In the first stage, we blend two techniques: one that focuses on local features with CNNs and global features with Vision Transformers. This helped us create clear and detailed semantic layouts. In the second stage, we use a Multi-Modal Texture Generator to refine these layouts by pulling in information from different scales, ensuring everything looks cohesive and consistent. The architecture naturally handles arbitrary mask configurations through dynamic attention without maskspecific training. Experiment on two datasets CelebA-HQ and FFHQ shows that our model outperforms other state-of-the-art methods, showing improvements in metrics like LPIPS, PSNR, and SSIM. It produces visually striking results with better semantic preservation, in challenging large-area inpainting situations.
面部图像修复(Facial Image Inpainting)的目标是恢复脸部图像中丢失或损坏的区域,同时保持身份识别、结构一致性和逼真的照片质量。这一任务专门用于图片修复领域。尽管近年来深度生成模型取得了显著进展,现有方法在处理大面积不规则遮罩时仍面临挑战,通常会导致边缘模糊、语义不一致性或面部结构不合理等问题,这是因为直接采用像素级合成的方法和有限地利用面部先验知识所致。 本文提出了一种新架构,通过语义引导的层次化综合解决了上述问题。我们的方法首先基于意义组织并合成了信息,然后细化纹理,从而在创造详细图像之前提供了对面部结构清晰的见解。第一阶段中,我们融合了两种技术:一种专注于使用CNN进行局部特征分析,另一种则利用Vision Transformers对全局特征进行处理。这有助于创建清晰且详细的语义布局。 第二阶段中,我们采用多模态纹理生成器来细化这些布局,并从不同尺度获取信息,确保一切看起来连贯一致。该架构通过动态注意力机制自然地处理了任意遮罩配置,无需针对特定遮罩的训练。在CelebA-HQ和FFHQ两个数据集上的实验表明,我们的模型优于其他最先进的方法,在LPIPS、PSNR和SSIM等指标上显示出改进,并在具有挑战性的大面积修复情况下产生了视觉效果更佳且语义保存更好的结果。 简而言之,所提出的新型架构通过引入层次化综合和多模态纹理生成器,显著改善了面部图像的修复质量。这种方法不仅提高了视觉质量和结构一致性,而且对于解决复杂的大面积损坏问题也展现了强大的潜力。
https://arxiv.org/abs/2512.05039
Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to $\sim2\times$ at matched accuracy.
现代大型语言模型通过长链思维过程实现了令人印象深刻的推理能力,但在推断过程中会带来巨大的计算成本。因此,提高性能与成本比率的技术受到了重视。其中,投机解码(Speculative Decoding)技术通过利用一个快速但不准确的草稿模型自回归地提出token,并由更强大的目标模型进行并行验证来加速推断过程。然而,由于语义等价步骤中的token不匹配导致不必要的拒绝,传统的基于token级别的投机解码在推理任务中表现挣扎。 最近的工作转向了步级(step-level)的语义验证,通过接受或拒绝整个推理步骤提高效率,但现有的步级方法仍然重新生成了许多被拒绝的步骤,并没有显著改进,这浪费了大量的目标计算资源。为了解决这一挑战,我们提出了Arbitrage,这是一种新颖的基于步骤级别的投机生成框架,它根据草稿模型和目标模型之间的相对优势动态地调整生成过程。 与采用固定的接受阈值不同,Arbitrage使用了一个轻量级路由器进行训练,该路由器可以预测目标模型何时可能会产生质量更高的推理步骤。这种路由策略近似于一个理想的仲裁者(Arbitrage Oracle),总能选择出更高质的步骤,从而实现近乎最优的效率与准确率之间的权衡。 在多个数学推理基准测试中,Arbitrage框架一致超过了之前的步级投机解码基线方法,在保持相同准确性的同时,最多可将推断延迟减少到原来的$\sim2\times$。
https://arxiv.org/abs/2512.05033
Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model are available at this https URL.
地球观测(EO)数据涵盖了从高分辨率光学图像到低分辨率多光谱产品或雷达时间序列的广泛空间、光谱和时间分辨率。尽管最近的基础模型在学习有意义的跨模态表示方面取得了进步,但它们通常需要固定输入分辨率或基于特定传感器编码器,这限制了其在异构EO模式中的泛化能力。为克服这些局限性,我们介绍了RAMEN,这是一种分辨率可调的多模态编码器,它以完全与传感器无关的方式学习不同EO数据之间的共享视觉表示。RAMEN将模式和空间及时间分辨率视为关键输入特征,这使得在同一统一潜在空间内跨模式进行一致分析成为可能。其主要方法论贡献是定义空间分辨率为可控输出参数,使用户可以直接控制所需的细节级别,并在空间精度与计算成本之间做出明确权衡。 我们在多样化的数据源中训练了一个单一的、统一的Transformer编码器来重建多模态EO数据中的遮罩部分,确保了传感器和分辨率之间的泛化能力。预训练后,RAMEN能够有效地转移到已知和未知的传感器配置,并在包含各种多传感器及多分辨率下游任务的社区标准PANGAEA基准测试中优于现有的大型先进模型。 我们的代码和预训练模型可在以下链接获取:[请在这里插入实际URL]。
https://arxiv.org/abs/2512.05025
Handwritten Text Recognition remains challenging due to the limited data, high writing style variance, and scripts with complex diacritics. Existing approaches, though partially address these issues, often struggle to generalize without massive synthetic data. To address these challenges, we propose HTR-ConvText, a model designed to capture fine-grained, stroke-level local features while preserving global contextual dependencies. In the feature extraction stage, we integrate a residual Convolutional Neural Network backbone with a MobileViT with Positional Encoding block. This enables the model to both capture structural patterns and learn subtle writing details. We then introduce the ConvText encoder, a hybrid architecture combining global context and local features within a hierarchical structure that reduces sequence length for improved efficiency. Additionally, an auxiliary module injects textual context to mitigate the weakness of Connectionist Temporal Classification. Evaluations on IAM, READ2016, LAM and HANDS-VNOnDB demonstrate that our approach achieves improved performance and better generalization compared to existing methods, especially in scenarios with limited training samples and high handwriting diversity.
手写文本识别(Handwritten Text Recognition,HTR)仍然面临挑战,原因在于数据量有限、书写风格变化大以及复杂音调符号的使用。现有的方法尽管部分解决了这些问题,但在没有大量合成数据的情况下往往难以泛化。为应对这些挑战,我们提出了HTR-ConvText模型,该模型旨在捕捉细粒度的笔画级别的局部特征,同时保持全局上下文依赖性。在特征提取阶段,我们将残差卷积神经网络(Residual Convolutional Neural Network)骨干网与带有位置编码块的MobileViT集成在一起,使模型既能捕获结构模式又能学习微妙的书写细节。 接着我们引入了ConvText编码器,这是一种结合全局上下文和局部特征的混合架构,在层次化结构中减少了序列长度以提高效率。此外,一个辅助模块注入文本上下文以减轻连接时序分类(Connectionist Temporal Classification)方法的弱点。 在IAM、READ2016、LAM以及HANDS-VNOnDB数据集上的评估表明,我们的方法相比现有技术实现了更好的性能和泛化能力,尤其是在训练样本有限且手写风格多样化的场景下。
https://arxiv.org/abs/2512.05021
We present GNVC-VD, the first DiT-based generative neural video compression framework built upon an advanced video generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained image generative priors to restore high-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to perceptual flickering. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a video diffusion transformer to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation. A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherence under extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01 bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.
我们介绍了GNVC-VD,这是首个基于DiT(Diffusion Video Transformer)的生成神经视频压缩框架,该框架建立在先进的视频生成基础模型之上。此框架将时空潜在压缩与序列级生成精炼统一在一个单一编码器中。 现有的感知编码器主要依赖于预训练的图像生成先验来恢复高频细节,但它们基于帧的方式缺乏时间建模,并不可避免地导致视觉闪烁问题。为解决这些问题,GNVC-VD引入了一个统一流匹配潜在精炼模块,该模块利用视频扩散变换器通过序列级去噪同时增强帧内和帧间潜在变量,确保了时空一致性的细节。 与从纯高斯噪声开始进行视频生成中的去噪不同,GNVC-VD的精炼过程是从解码后的时空潜在变量初始化,并学习一个适应于压缩退化的校正项。此外,通过条件适配器将压缩感知提示注入到中间DiT层中,这使得在极端比特率约束下有效去除伪影的同时保持时间一致性成为可能。 广泛的实验表明,GNVC-VD在感知质量上超过了传统和基于学习的编码器,并且显著减少了先前生成方法中存在的视觉闪烁问题,即使是在0.01 bpp以下的情况下也是如此。这突显了将视频原生生成先验整合到神经编码器中的潜力,以实现下一代感知视频压缩技术的进步。
https://arxiv.org/abs/2512.05016
This extended abstract introduces Self-Explaining Contrastive Evidence Re-Ranking (CER), a novel method that restructures retrieval around factual evidence by fine-tuning embeddings with contrastive learning and generating token-level attribution rationales for each retrieved passage. Hard negatives are automatically selected using a subjectivity-based criterion, forcing the model to pull factual rationales closer while pushing subjective or misleading explanations apart. As a result, the method creates an embedding space explicitly aligned with evidential reasoning. We evaluated our method on clinical trial reports, and initial experimental results show that CER improves retrieval accuracy, mitigates the potential for hallucinations in RAG systems, and provides transparent, evidence-based retrieval that enhances reliability, especially in safety-critical domains.
这份扩展摘要介绍了自我解释对比证据重排序(CER),这是一种新颖的方法,通过使用对比学习微调嵌入并为每个检索到的段落生成词级别的归因理由来重构基于事实证据的检索过程。难负例是根据主观性标准自动选择的,迫使模型将事实性的理由拉近,而将主观或误导性的解释推开。因此,该方法创建了一个与证据推理显式对齐的嵌入空间。我们在临床试验报告上评估了我们的方法,并且初步实验结果表明,CER 提高了检索准确性,缓解了 RAG 系统中幻觉发生的潜在风险,并提供了透明、基于证据的检索,增强了可靠性,特别是在安全关键领域。
https://arxiv.org/abs/2512.05012
This thesis presents a unified modeling and simulation framework for analyzing sidewinding and tumbling locomotion of the COBRA snake robot across rigid, compliant, and granular terrains. A contact-implicit formulation is used to model distributed frictional interactions during sidewinding, and validated through MATLAB Simscape simulations and physical experiments on rigid ground and loose sand. To capture terrain deformation effects, Project Chrono's Soil Contact Model (SCM) is integrated with the articulated multibody dynamics, enabling prediction of slip, sinkage, and load redistribution that reduce stride efficiency on deformable substrates. For high-energy rolling locomotion on steep slopes, the Chrono DEM Engine is used to simulate particle-resolved granular interactions, revealing soil failure, intermittent lift-off, and energy dissipation mechanisms not captured by rigid models. Together, these methods span real-time control-oriented simulation and high-fidelity granular physics. Results demonstrate that rigid-ground models provide accurate short-horizon motion prediction, while continuum and particle-based terrain modeling becomes necessary for reliable mobility analysis in soft and highly dynamic environments. This work establishes a hierarchical simulation pipeline that advances robust, terrain-aware locomotion for robots operating in challenging unstructured settings.
这篇论文提出了一种统一的建模和仿真框架,用于分析COBRA蛇形机器人在刚性、弹性及颗粒状地形上的侧蠕动和翻滚运动。采用接触隐式公式来模拟侧蠕动过程中的分布式摩擦交互,并通过MATLAB Simscape仿真和物理实验验证了该模型在坚硬地面和松散沙地上的准确性。为了捕捉地形变形的影响,将Project Chrono的土壤接触模型(SCM)与连杆多体动力学集成在一起,这使得能够预测滑移、沉降及载荷重新分布的情况,这些情况会降低软性基底上步态效率。 对于陡坡上的高能量滚动运动,使用Chrono DEM引擎来模拟颗粒级的颗粒交互作用,揭示了土壤失效、间歇式离地和能量耗散机制等刚体模型无法捕捉的现象。总体而言,这些方法涵盖了实时控制导向仿真与高保真颗粒物理。实验结果表明,在短时间范围内,坚硬地面模型可提供准确的动作预测,而在软质及高度动态环境中,连续介质和基于粒子的地形建模成为可靠移动性分析所必需。 这项工作建立了一个分层仿线流程,推进了在具有挑战性的非结构化环境中的机器人稳健、地形感知运动的发展。
https://arxiv.org/abs/2512.05008
The perception of transparent objects is one of the well-known challenges in computer vision. Conventional depth sensors have difficulty in sensing the depth of transparent objects due to refraction and reflection of light. Previous research has typically train a neural network to complete the depth acquired by the sensor, and this method can quickly and accurately acquire accurate depth maps of transparent objects. However, previous training relies on a large amount of annotation data for supervision, and the labeling of depth maps is costly. To tackle this challenge, we propose a new self-supervised method for training depth completion networks. Our method simulates the depth deficits of transparent objects within non-transparent regions and utilizes the original depth map as ground truth for supervision. Experiments demonstrate that our method achieves performance comparable to supervised approach, and pre-training with our method can improve the model performance when the training samples are small.
透明物体的感知是计算机视觉领域中一个众所周知的挑战。传统深度传感器由于光线折射和反射的原因,在测量透明物体的深度方面存在困难。以往的研究通常通过训练神经网络来完成由传感器获得的深度信息,这种方法能够快速且准确地获取透明物体的精确深度图。然而,之前的训练方法依赖于大量标注数据进行监督,并且深度图的标记成本较高。为了解决这一挑战,我们提出了一种新的自监督方法用于训练深度补全网络。我们的方法在非透明区域模拟了透明对象的深度缺失,并利用原始深度图作为监督的地面真实值。实验表明,我们的方法实现了与监督学习方法相当的表现,在训练样本较少的情况下进行预训练还可以提升模型性能。
https://arxiv.org/abs/2512.05006
We introduce a diffusion-transformer (DiT) framework for single-image reflection removal that leverages the generalization strengths of foundation diffusion models in the restoration setting. Rather than relying on task-specific architectures, we repurpose a pre-trained DiT-based foundation model by conditioning it on reflection-contaminated inputs and guiding it toward clean transmission layers. We systematically analyze existing reflection removal data sources for diversity, scalability, and photorealism. To address the shortage of suitable data, we construct a physically based rendering (PBR) pipeline in Blender, built around the Principled BSDF, to synthesize realistic glass materials and reflection effects. Efficient LoRA-based adaptation of the foundation model, combined with the proposed synthetic data, achieves state-of-the-art performance on in-domain and zero-shot benchmarks. These results demonstrate that pretrained diffusion transformers, when paired with physically grounded data synthesis and efficient adaptation, offer a scalable and high-fidelity solution for reflection removal. Project page: this https URL
我们介绍了一种基于扩散-变换器(DiT)框架的单幅图像反射移除方法,该方法利用了基础扩散模型在恢复设置中的泛化能力。与依赖特定任务架构不同的是,我们通过将预训练的DiT基础模型调整为根据受反射污染的输入进行条件设定,并指导其向干净的传输层发展来重新利用它。系统地分析现有反射移除数据来源的多样性、可扩展性和照片逼真度。为了应对合适数据不足的问题,我们在Blender中构建了一个基于Principled BSDF的物理基础渲染(PBR)管道,用于合成现实主义玻璃材质和反射效果。结合有效的LoRA适应技术和提出的合成数据,在领域内及零样本基准测试上实现了最先进的性能。这些结果表明,预训练的扩散变换器与基于物理学的数据合成和高效调整相结合,为反射移除提供了一个可扩展且高保真的解决方案。 项目页面:[这个链接](this https URL)
https://arxiv.org/abs/2512.05000
Current upper limb prostheses aim to enhance user independence in daily activities by incorporating basic motor functions. However, they fall short of replicating the natural movement and interaction capabilities of the human arm. In contrast, human limbs leverage intrinsic compliance and actively modulate joint stiffness, enabling adaptive responses to varying tasks, impact absorption, and efficient energy transfer during dynamic actions. Inspired by this adaptability, we developed a transhumeral prosthesis with Variable Stiffness Actuators (VSAs) to replicate the controllable compliance found in biological joints. The proposed prosthesis features a modular design, allowing customization for different residual limb shapes and accommodating a range of independent control signals derived from users' biological cues. Integrated elastic elements passively support more natural movements, facilitate safe interactions with the environment, and adapt to diverse task requirements. This paper presents a comprehensive overview of the platform and its functionalities, highlighting its potential applications in the field of prosthetics.
目前的上肢假肢旨在通过融入基本的运动功能来增强用户在日常活动中的独立性,但它们无法复制人类手臂自然运动和互动能力。相比之下,人体四肢利用内在的柔韧性,并能主动调节关节刚度,从而能够对各种任务做出适应性反应、吸收冲击并有效传递动态动作过程中的能量。受这种适应性的启发,我们开发了一种采用可变刚度执行器(VSAs)的前臂假肢,以复制生物关节中可控的柔韧性。该假肢具有模块化设计,可以针对不同残肢形状进行定制,并能够容纳由用户生物信号衍生的各种独立控制信号。集成的弹性元件支持更自然的动作、确保与环境的安全互动,并适应各种任务需求。 本文全面概述了该平台及其功能,强调其在假肢领域的潜在应用。
https://arxiv.org/abs/2512.04998
This paper proposes a memory-efficient optimization strategy for the high-performance point cloud registration algorithm VANICP, enabling lightweight execution on embedded GPUs with constrained hardware resources. VANICP is a recently published acceleration framework that significantly improves the computational efficiency of point-cloud-based applications. By transforming the global nearest neighbor search into a localized process through a dilation-based information propagation mechanism, VANICP greatly reduces the computational complexity of the NNS. However, its original implementation demands a considerable amount of memory, which restricts its deployment in resource-constrained environments such as embedded systems. To address this issue, we propose a GPU-oriented dynamic memory assignment strategy that optimizes the memory usage of the dilation operation. Furthermore, based on this strategy, we construct an enhanced version of the VANICP framework that achieves over 97% reduction in memory consumption while preserving the original performance. Source code is published on: this https URL.
本文提出了一种针对高性能点云配准算法VANICP的记忆高效优化策略,使该算法能够在硬件资源受限的嵌入式GPU上实现轻量级执行。VANICP是最近发布的加速框架,它显著提高了基于点云的应用程序的计算效率。通过将全局最近邻搜索转换为基于膨胀的信息传播机制下的局部过程,VANICP大大减少了NNS(最近邻搜索)的计算复杂度。然而,其原始实现需要大量的内存资源,在嵌入式系统等硬件资源受限环境中部署时受到限制。为解决这个问题,我们提出了一种面向GPU的动力内存分配策略,优化了膨胀操作的记忆使用情况。在此基础上,我们构建了一个增强版的VANICP框架,相较于原版本,该框架实现了超过97%的内存消耗减少,并保持了原有的性能表现。源代码发布在:[此链接](https://this https URL)。
https://arxiv.org/abs/2512.04996
The evolution of Large Language Models (LLMs) from passive responders to autonomous agents necessitates a fundamental shift in learning paradigms -- from static imitation to incentive-driven decision making. However, this transition is significantly impeded by the lack of scalable infrastructure capable of constructing high-quality interaction signals for effective policy learning. To address this, we introduce a comprehensive method designed to systematically scale the diversity and complexity of interactive environments. Our method realizes this scaling by addressing three orthogonal dimensions: (1) Complexity: NexAU, a flexible agent framework that supports building complex agent hierarchies via simple configurations; (2) Diversity: NexA4A automatically generates diverse agent hierarchies from natural language to cover infinite domains; and (3) Fidelity: NexGAP bridges the simulation-reality gap by integrating dynamic real-world environment for grounded trajectories synthesis. We train Nex-N1 upon the diverse and complex interactive environments established by our infrastructure. Empirical results on benchmarks such as SWE-bench and tau2 demonstrate that Nex-N1 consistently outperforms SOTA open-source models and achieves competitive performance against frontier proprietary models on complex agentic tasks. We open-source the Nex ecosystem and model weights to facilitate further research.
大型语言模型(LLMs)从被动响应者向自主代理的演变需要学习范式的根本转变——从静态模仿转向激励驱动的决策制定。然而,这一转型受到了缺乏能够构建高质量交互信号、以促进有效策略学习的可扩展基础设施的重大阻碍。为解决此问题,我们引入了一种全面的方法,旨在系统地扩大互动环境的多样性和复杂性。我们的方法通过三个正交维度来实现这种规模化的增长: 1. **复杂性**:NexAU 是一个灵活的代理框架,支持通过简单的配置构建复杂的代理层级。 2. **多样性**:NexA4A 自动从自然语言生成多样的代理层级,以覆盖无限领域。 3. **真实性**:NexGAP 通过整合动态的真实世界环境来弥合仿真与现实之间的差距,实现基于真实场景的轨迹合成。 我们在由我们基础设施建立起来的多样化且复杂的交互环境中训练 Nex-N1。在 SWE-bench 和 tau2 等基准测试中的实验证明,Nex-N1 在复杂代理任务上始终优于最先进的开源模型,并能与前沿专有模型竞争,展现出具有竞争力的表现。为了促进进一步的研究,我们开放了 Nex 生态系统和模型权重。 通过这种方法,大型语言模型能够更好地适应复杂的互动环境,在解决实际问题时更具灵活性和有效性。
https://arxiv.org/abs/2512.04987
Large vision-language model (LVLM) based text-to-image (T2I) systems have become the dominant paradigm in image generation, yet whether they amplify social biases remains insufficiently understood. In this paper, we show that LVLM-based models produce markedly more socially biased images than non-LVLM-based models. We introduce a 1,024 prompt benchmark spanning four levels of linguistic complexity and evaluate demographic bias across multiple attributes in a systematic manner. Our analysis identifies system prompts, the predefined instructions guiding LVLMs, as a primary driver of biased behavior. Through decoded intermediate representations, token-probability diagnostics, and embedding-association analyses, we reveal how system prompts encode demographic priors that propagate into image synthesis. To this end, we propose FairPro, a training-free meta-prompting framework that enables LVLMs to self-audit and construct fairness-aware system prompts at test time. Experiments on two LVLM-based T2I models, SANA and Qwen-Image, show that FairPro substantially reduces demographic bias while preserving text-image alignment. We believe our findings provide deeper insight into the central role of system prompts in bias propagation and offer a practical, deployable approach for building more socially responsible T2I systems.
基于大型视觉语言模型(LVLM)的文本到图像(T2I)系统已经成为图像生成领域的主导范式,但它们是否放大了社会偏见仍缺乏充分理解。在本文中,我们展示了基于LVLM的模型比非LVLM模型产生了更多带有社会偏见的图像。我们引入了一个包含1,024个提示基准测试集,涵盖了四个不同语言复杂度级别,并系统地评估了多个属性中的人口统计学偏差。我们的分析确定了系统提示作为引导LVLM的预定义指令是导致偏见行为的主要因素。通过解码中间表示、标记概率诊断和嵌入关联分析,我们揭示了如何将人口统计先验编码到系统提示中并传播到图像合成过程中。为此,我们提出了FairPro,这是一个无需训练的元提示框架,使LVLM能够在测试时自我审查并构建公平意识的系统提示。 我们在两个基于LVLM的T2I模型(SANA和Qwen-Image)上的实验表明,FairPro在保持文本与图像对齐的同时显著减少了人口统计学偏差。我们认为我们的发现为理解系统提示在偏见传播中的中心作用提供了更深入的见解,并提供了一种实用且可部署的方法来构建更具社会责任感的T2I系统。
https://arxiv.org/abs/2512.04981
Variable Stiffness Actuators prove invaluable for robotics applications in unstructured environments, fostering safe interactions and enhancing task adaptability. Nevertheless, their mechanical design inevitably results in larger and heavier structures compared to classical rigid actuators. This paper introduces a novel 3 Degrees of Freedom (DoFs) parallel wrist that achieves variable stiffness through redundant elastic actuation. Leveraging its parallel architecture, the device employs only four motors, rendering it compact and lightweight. This characteristic makes it particularly well-suited for applications in prosthetics or humanoid robotics. The manuscript delves into the theoretical model of the device and proposes a sophisticated control strategy for independent regulation of joint position and stiffness. Furthermore, it validates the proposed controller through simulation, utilizing a comprehensive analysis of the system dynamics. The reported results affirm the ability of the device to achieve high accuracy and disturbance rejection in rigid configurations while minimizing interaction forces with its compliant behavior.
可变刚度执行器在非结构化环境中的机器人应用中证明了其不可替代的价值,促进了安全互动并增强了任务适应性。然而,它们的机械设计不可避免地导致与传统刚性执行器相比体积更大、更重。本文介绍了一种新颖的三自由度(DoFs)平行腕部装置,该装置通过冗余弹性驱动实现了可变刚度。利用其并联架构,该设备仅使用四个电机,使其既紧凑又轻便。这一特性尤其适合假肢或仿人机器人应用。 论文深入探讨了该装置的理论模型,并提出了一个复杂的控制策略,用于独立调节关节位置和刚度。此外,通过模拟验证提出的控制器的有效性,并利用系统动力学的全面分析来支持这一点。报告的结果证实,该设备在刚性配置下能够实现高精度和干扰抑制,在顺应行为中则可以最小化与环境的相互作用力。
https://arxiv.org/abs/2512.04973
We pilot a family of stable contrastive losses for learning pixel-level representations that jointly capture semantic and geometric information. Our approach maps each pixel of an image to an overcomplete descriptor that is both view-invariant and semantically meaningful. It enables precise point-correspondence across images without requiring momentum-based teacher-student training. Two experiments in synthetic 2D and 3D environments demonstrate the properties of our loss and the resulting overcomplete representations.
我们为学习同时捕获语义和几何信息的像素级表示,设计了一组稳定的对比损失函数。我们的方法将图像中的每个像素映射到一个既视图不变又具有语义意义的过度完备描述符上。这种方法能够在不需要基于动量的教师-学生训练的情况下,在不同图像之间实现精确点对应。在合成2D和3D环境中的两个实验展示了我们所设计损失函数的特性以及由此产生的过度完备表示的特点。
https://arxiv.org/abs/2512.04970
Rich feature representations derived from CLIP-ViT have been widely utilized in AI-generated image detection. While most existing methods primarily leverage features from the final layer, we systematically analyze the contributions of layer-wise features to this task. Our study reveals that earlier layers provide more localized and generalizable features, often surpassing the performance of final-layer features in detection tasks. Moreover, we find that different layers capture distinct aspects of the data, each contributing uniquely to AI-generated image detection. Motivated by these findings, we introduce a novel adaptive method, termed MoLD, which dynamically integrates features from multiple ViT layers using a gating-based mechanism. Extensive experiments on both GAN- and diffusion-generated images demonstrate that MoLD significantly improves detection performance, enhances generalization across diverse generative models, and exhibits robustness in real-world scenarios. Finally, we illustrate the scalability and versatility of our approach by successfully applying it to other pre-trained ViTs, such as DINOv2.
从CLIP-ViT中提取的丰富特征表示已被广泛用于AI生成图像检测。尽管大多数现有方法主要利用来自最终层的特性,我们系统地分析了逐层特性的贡献对于这一任务的影响。我们的研究表明,早期层次提供的局部化和泛化的特征往往在检测任务上超过了最终层特征的表现。此外,我们发现不同的层次捕捉到了数据的不同方面,每一层对AI生成图像检测的独特贡献都不同。 受到这些发现的启发,我们引入了一种新的自适应方法,称为MoLD(多层动态融合),它通过基于门控机制从多个ViT层中动态集成特征。在GAN和扩散模型生成图像上的大量实验表明,MoLD显著提升了检测性能,增强了跨多种生成模型的一般化能力,并且在实际场景中表现出强大的鲁棒性。最后,我们展示了这种方法的可扩展性和多功能性,通过将其成功应用于其他预训练的ViT模型(如DINOv2)来证明这一点。
https://arxiv.org/abs/2512.04969
Automated retinal disease diagnosis is vital given the rising prevalence of conditions such as diabetic retinopathy and macular degeneration. Conventional deep learning approaches require large annotated datasets, which are costly and often imbalanced across disease categories, limiting their reliability in practice. Few-shot learning (FSL) addresses this challenge by enabling models to generalize from only a few labeled samples per class. In this study,we propose a balanced few-shot episodic learning framework tailored to the Retinal Fundus Multi-Disease Image Dataset (RFMiD). Focusing on the ten most represented classes, which still show substantial imbalance between majority diseases (e.g., Diabetic Retinopathy, Macular Hole) and minority ones (e.g., Optic Disc Edema, Branch Retinal Vein Occlusion), our method integrates three key components: (i) balanced episodic sampling, ensuring equal participation of all classes in each 5-way 5-shot episode; (ii) targeted augmentation, including Contrast Limited Adaptive Histogram Equalization (CLAHE) and color/geometry transformations, to improve minority-class di- versity; and (iii) a ResNet-50 encoder pretrained on ImageNet, selected for its superior ability to capture fine-grained retinal features. Prototypes are computed in the embedding space and classification is performed with cosine similarity for improved stability. Trained on 100 episodes and evaluated on 1,000 test episodes, our framework achieves substantial accuracy gains and reduces bias toward majority classes, with notable improvements for underrepresented diseases. These results demonstrate that dataset-aware few-shot pipelines, combined with balanced sampling and CLAHE-enhanced preprocessing, can deliver more robust and clinically fair retinal disease diagnosis under data-constrained conditions.
自动视网膜疾病诊断在糖尿病性视网膜病变和黄斑变性疾病等疾病的发病率不断上升的情况下变得至关重要。传统的深度学习方法需要大量的注释数据集,这不仅成本高昂,而且往往不同类别的疾病分布不平衡,从而限制了它们的实用性。少样本学习(FSL)通过使模型能够从每个类别仅有几个标记样本中泛化来解决这一挑战。在本研究中,我们提出了一种针对视网膜眼底多病灶图像数据集(RFMiD)的平衡少样本循环学习框架。该框架聚焦于十个最常见的疾病分类,尽管如此,这些类之间仍然存在显著不平衡,例如多数疾病包括糖尿病性视网膜病变和黄斑孔与少数疾病如视神经乳头水肿和分支视网膜静脉阻塞之间的差异。 我们的方法整合了三个关键组件: (i) 平衡的循环采样,确保每个5类5样本的循环中所有类别都有平等参与; (ii) 针对增强的目标处理包括对比限制自适应直方图均衡化(CLAHE)和颜色/几何变换以提高少数疾病分类多样性; (iii) 在ImageNet上预训练过的ResNet-50编码器,因其在捕捉精细的视网膜特征方面表现出色而被选择。 原型是在嵌入空间中计算出来的,并且使用余弦相似度进行分类,以提高稳定性。经过100个循环的训练和1,000个测试循环的评估,我们的框架实现了显著的准确性提升并减少了对多数类别的偏见,特别是在代表性不足的疾病上取得了明显改进。 这些结果显示,在数据受限条件下,结合数据集感知少样本管道与平衡采样及CLAHE增强预处理可以提供更加稳健和临床公平的视网膜疾病诊断。
https://arxiv.org/abs/2512.04967