Generalization remains the central challenge for interactive 3D scene generation. Existing learning-based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts. We instead reprogram a pre-trained 3D instance generator to act as a scene level learner, replacing dataset-bounded supervision with model-centric spatial supervision. This reprogramming unlocks the generator transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions. Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator's transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues. Replacing widely used canonical space, we instantiate this insight with a view-centric formulation of the scene space, yielding a fully feed-forward, generalizable scene generator that learns spatial relations directly from the instance model. Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation. Project page: this https URL
泛化仍然是互动式三维场景生成的核心挑战。现有的基于学习的方法将空间理解建立在有限的场景数据集上,从而限制了其对新布局的一般化能力。我们则重新编程了一个预训练的3D实例生成器,使其作为场景级的学习者来运作,并用模型中心的空间监督取代了局限于数据集的监督。这种重编程解锁了生成器可迁移的空间知识,使它能够推广到未见过的布局和新物体组合上。值得注意的是,即使在由随机组合的对象构成的训练场景中,空间推理仍然出现。这表明生成器的可转移场景先验提供了丰富的学习信号,可以从纯粹的几何线索中推断出接近性、支撑性和对称性。 通过用以视角为中心的形式化替代广泛使用的标准空间概念,我们实现了这一洞察,并提出了一种完全前馈的、通用化的场景生成器,该生成器直接从实例模型中学得空间关系。定量和定性的结果表明,一个3D实例生成器是一个隐含的空间学习者和推理工具,这指向了用于交互式3D场景理解和生成的基础模型的发展方向。 项目页面: [此链接](this https URL)
https://arxiv.org/abs/2512.13683
Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metrics on subsets dictated by tool support. Single-pass methods demonstrated superior capability preservation on the benchmarked subset (avg GSM8K change across three models: ErisForge -0.28 pp; DECCP -0.13 pp), while Bayesian-optimized abliteration produced variable distribution shift (KL divergence: 0.043-1.646) with model-dependent capability impact. These findings provide researchers with evidence-based selection criteria for abliteration tool deployment across diverse model architectures. The principal finding indicates that mathematical reasoning capabilities exhibit the highest sensitivity to abliteration interventions, with GSM8K change ranging from +1.51 pp to -18.81 pp (-26.5% relative) depending on tool selection and model architecture.
大型语言模型中的安全对齐机制通过学习拒绝行为来防止对有害查询的响应,然而这些相同的机制也阻碍了包括认知建模、对抗性测试和安全性分析在内的合法研究应用。虽然消融技术能够通过方向正交化实现拒绝始终态的外科手术式移除,但现有实施方法的有效性仍缺乏系统评估。本研究评估了四种消融工具(Heretic、DECCC、ErisForge、FailSpy)在十六个指令微调模型(70亿至140亿参数规模)上的表现,并报告所有16种模型的工具兼容性和子集限定下的定量指标。单步方法在基准测试子集中展示了更强的能力保持效果(三个模型平均GSM8K变化:ErisForge -0.28个百分点;DECCP -0.13个百分点),而贝叶斯优化消融则产生了不同的分布转移(KL散度范围为0.043至1.646),对不同模型的能力影响各异。这些发现为研究人员在多样化模型架构上部署消融工具提供了基于证据的选择标准。主要发现指出,数学推理能力对消融干预最敏感,GSM8K变化范围从+1.51个百分点到-18.81个百分点(相对下降26.5%),这取决于所选工具和模型架构的不同。
https://arxiv.org/abs/2512.13655
Near-field perception is essential for the safe operation of autonomous mobile robots (AMRs) in manufacturing environments. Conventional ranging sensors such as light detection and ranging (LiDAR) and ultrasonic devices provide broad situational awareness but often fail to detect small objects near the robot base. To address this limitation, this paper presents a three-tier near-field perception framework. The first approach employs light-discontinuity detection, which projects a laser stripe across the near-field zone and identifies interruptions in the stripe to perform fast, binary cutoff sensing for obstacle presence. The second approach utilizes light-displacement measurement to estimate object height by analyzing the geometric displacement of a projected stripe in the camera image, which provides quantitative obstacle height information with minimal computational overhead. The third approach employs a computer vision-based object detection model on embedded AI hardware to classify objects, enabling semantic perception and context-aware safety decisions. All methods are implemented on a Raspberry Pi 5 system, achieving real-time performance at 25 or 50 frames per second. Experimental evaluation and comparative analysis demonstrate that the proposed hierarchy balances precision, computation, and cost, thereby providing a scalable perception solution for enabling safe operations of AMRs in manufacturing environments.
近距离感知对于自主移动机器人(AMR)在制造环境中的安全操作至关重要。传统的测距传感器,如激光雷达和超声波设备,提供了广泛的情境感知能力,但往往无法检测到靠近机器人基座的小型物体。为了解决这一局限性,本文提出了一种三级近场感知框架。 第一种方法采用了光不连续性检测技术,通过在近场区域投射一条激光线,并识别其中的中断来快速、二元地进行障碍物存在与否的判断。 第二种方法使用光线位移测量技术,通过分析投影线条在相机图像中的几何位移,估计物体高度。这种方法能够提供具有最小计算开销的高度信息,量化了障碍物的具体高度数据。 第三种方法则利用嵌入式AI硬件上的计算机视觉对象检测模型来识别和分类物体,实现语义感知,并据此做出情境相关的安全决策。 所有这些方法都在Raspberry Pi 5系统上实现了实时性能,达到每秒25帧或50帧的速度。实验评估和比较分析表明,所提出的层次结构在精度、计算需求以及成本之间取得了平衡,为AMR在制造环境中的安全操作提供了可扩展的感知解决方案。
https://arxiv.org/abs/2512.13561
Generating 3D-based body movements from speech shows great potential in extensive downstream applications, while it still suffers challenges in imitating realistic human movements. Predominant research efforts focus on end-to-end generation schemes to generate co-speech gestures, spanning GANs, VQ-VAE, and recent diffusion models. As an ill-posed problem, in this paper, we argue that these prevailing learning schemes fail to model crucial inter- and intra-correlations across different motion units, i.e. head, body, and hands, thus leading to unnatural movements and poor coordination. To delve into these intrinsic correlations, we propose a unified Hierarchical Implicit Periodicity (HIP) learning approach for audio-inspired 3D gesture generation. Different from predominant research, our approach models this multi-modal implicit relationship by two explicit technique insights: i) To disentangle the complicated gesture movements, we first explore the gesture motion phase manifolds with periodic autoencoders to imitate human natures from realistic distributions while incorporating non-period ones from current latent states for instance-level diversities. ii) To model the hierarchical relationship of face motions, body gestures, and hand movements, driving the animation with cascaded guidance during learning. We exhibit our proposed approach on 3D avatars and extensive experiments show our method outperforms the state-of-the-art co-speech gesture generation methods by both quantitative and qualitative evaluations. Code and models will be publicly available.
从语音生成基于3D的身体动作在广泛的下游应用中展现出巨大潜力,然而它仍然面临模仿真实人体运动的挑战。目前的研究主要集中在端到端生成方案上,以生成与言语同步的手势,涵盖了GAN、VQ-VAE以及最近的扩散模型。作为一种病态问题,在本文中我们论证了这些流行的学习方案未能充分建模不同动作单元(如头部、身体和手)之间的重要内在和外在相关性,从而导致不自然的动作和协调性差。 为了深入探究这些内在关联,我们提出了一种统一的分层隐式周期性(HIP)学习方法,用于语音启发式的3D手势生成。与主流研究不同,我们的方法通过两种明确的技术洞察来建模这种多模态的隐含关系:i) 为了解构复杂的动作模式,我们首先使用周期自动编码器探索手势运动相位流形,并从真实分布中模仿人类特性,同时结合非周期性的当前潜在状态以实现实例级别的多样性。ii) 为了模型面部、身体和手部动作之间的层级关系,在学习过程中采用级联引导来驱动动画。 我们在3D化身上演示了我们提出的方法,并通过广泛的实验展示了我们的方法在定量和定性评估中都超越了最先进的与言语同步手势生成方法的性能。代码和模型将公开发布。
https://arxiv.org/abs/2512.13131
Debate has been widely adopted as a strategy to enhance critical thinking skills in English Language Arts (ELA). One important skill in debate is forming effective argumentation, which requires debaters to select supportive evidence from literature and construct compelling claims. However, the training of this skill largely depends on human coaching, which is labor-intensive and difficult to scale. To better support students in preparing for debates, this study explores the potential of leveraging artificial intelligence to generate effective arguments. Specifically, we prompted GPT-4 to create an evidence card and compared it to those produced by human debaters. The evidence cards outline the arguments students will present and how those arguments will be delivered, including components such as literature-based evidence quotations, summaries of core ideas, verbatim reading scripts, and tags (i.e., titles of the arguments). We compared the quality of the arguments in the evidence cards created by GPT and student debaters using Aristotle's rhetorical principles: ethos (credibility), pathos (emotional appeal), and logos (logical reasoning). Through a systematic qualitative and quantitative analysis, grounded in the rhetorical principles, we identify the strengths and limitations of human and GPT in debate reasoning, outlining areas where AI's focus and justifications align with or diverge from human reasoning. Our findings contribute to the evolving role of AI-assisted learning interventions, offering insights into how student debaters can develop strategies that enhance their argumentation and reasoning skills.
辩论作为一种策略被广泛采用,旨在增强英语语言艺术(ELA)中的批判性思维能力。在辩论中,形成有效论证是一项重要技能,这要求辩手从文学作品中挑选支持证据,并构建有说服力的主张。然而,培养这一技能很大程度上依赖于人力辅导,这种方式耗时且难以扩展规模。为了更好地帮助学生准备辩论,本研究探索了利用人工智能生成有效论据的潜力。具体而言,我们提示GPT-4创建了一个证据卡片,并将其与由人类辩手制作的证据卡片进行了比较。这些证据卡片概述了学生们将要呈现的论点以及如何呈现这些论点,其中包括文献引用、核心观点摘要、逐字阅读脚本和标签(即论证标题)等组成部分。 我们根据亚里士多德的修辞原则——伦理(可信性)、情感(情感吸引力)和逻辑(逻辑推理),比较了GPT与学生辩手生成的证据卡片中论点的质量。通过系统性的定性和定量分析,基于这些修辞原则,我们识别出人类和AI在辩论推理中的优缺点,并指出AI的关注点和论证是否与其人类对应物一致或有所偏离之处。 我们的研究结果为人工智能辅助学习干预措施的角色演变做出了贡献,提供了关于学生辩手如何制定策略以提高其论据构建和推理技能的见解。
https://arxiv.org/abs/2512.12817
In this paper, we propose a method for generating layouts for image-based advertisements by leveraging a Vision-Language Model (VLM). Conventional advertisement layout techniques have predominantly relied on saliency mapping to detect salient regions within a background image, but such approaches often fail to fully account for the image's detailed composition and semantic content. To overcome this limitation, our method harnesses a VLM to recognize the products and other elements depicted in the background and to inform the placement of text and logos. The proposed layout-generation pipeline consists of two steps. In the first step, the VLM analyzes the image to identify object types and their spatial relationships, then produces a text-based "placement plan" based on this analysis. In the second step, that plan is rendered into the final layout by generating HTML-format code. We validated the effectiveness of our approach through evaluation experiments, conducting both quantitative and qualitative comparisons against existing methods. The results demonstrate that by explicitly considering the background image's content, our method produces noticeably higher-quality advertisement layouts.
在这篇论文中,我们提出了一种利用视觉-语言模型(VLM)生成基于图像的广告布局的方法。传统的广告布局技术主要依赖于显著性映射来检测背景图像中的显著区域,但这种方法往往未能充分考虑到图像的详细构图和语义内容。为了解决这一局限,我们的方法采用VLM识别背景中展示的产品和其他元素,并据此确定文本和标志的位置。 我们提出的布局生成流程包含两个步骤。第一步,VLM分析图像以识别物体类型及其空间关系,然后根据这些信息生成基于文本的“放置计划”。第二步,则是通过生成HTML格式代码将该计划渲染为最终布局。 我们通过评估实验验证了这种方法的有效性,在定量和定性的对比中与现有方法进行了比较。结果表明,通过明确考虑背景图像的内容,我们的方法能够产生明显质量更高的广告布局。
https://arxiv.org/abs/2512.12596
Accurate coronary artery segmentation from coronary computed tomography angiography is essential for quantitative coronary analysis and clinical decision support. Nevertheless, reliable segmentation remains challenging because of small vessel calibers, complex branching, blurred boundaries, and myocardial interference. We propose a coronary artery segmentation framework that integrates myocardial anatomical priors, structure aware feature encoding, and three dimensional wavelet inverse wavelet transformations. Myocardial priors and residual attention based feature enhancement are incorporated during encoding to strengthen coronary structure representation. Wavelet inverse wavelet based downsampling and upsampling enable joint spatial frequency modeling and preserve multi scale structural consistency, while a multi scale feature fusion module integrates semantic and geometric information in the decoding stage. The model is trained and evaluated on the public ImageCAS dataset using a 3D overlapping patch based strategy with a 7:1:2 split for training, validation, and testing. Experimental results demonstrate that the proposed method achieves a Dice coefficient of 0.8082, Sensitivity of 0.7946, Precision of 0.8471, and an HD95 of 9.77 mm, outperforming several mainstream segmentation models. Ablation studies further confirm the complementary contributions of individual components. The proposed method enables more stable and consistent coronary artery segmentation under complex geometric conditions, providing reliable segmentation results for subsequent coronary structure analysis tasks.
从冠状动脉CT血管造影(CCTA)中准确分割冠状动脉对于定量冠脉分析和临床决策支持至关重要。然而,由于细小的血管口径、复杂的分支结构、模糊边界以及心肌干扰等因素,可靠的分割仍然具有挑战性。 为此,我们提出了一种整合了心肌解剖先验知识、结构感知特征编码以及三维小波反小波变换的冠状动脉分割框架。在编码过程中结合心肌先验信息和基于残差注意力机制的特征增强以强化冠脉结构表示。通过小波反小波单元进行下采样和上采样,实现了联合空间频率建模,并保持了多尺度结构的一致性;同时,在解码阶段采用一个多尺度特征融合模块来整合语义信息与几何信息。 该模型在公开的ImageCAS数据集上进行了训练和评估,采用了基于3D重叠补丁的方法并按7:1:2的比例划分为训练、验证和测试部分。实验结果显示,所提出的方法实现了Dice系数为0.8082,敏感性为0.7946,精确度为0.8471,HD95(距离误差)为9.77毫米,优于几种主流分割模型的性能。消融研究表明各个组成部分之间存在互补贡献。 所提出的这种方法在复杂几何条件下实现了更稳定和一致的冠状动脉分割,为后续冠脉结构分析任务提供了可靠的分割结果。
https://arxiv.org/abs/2512.12539
Generative Artificial Intelligence (AI) has created unprecedented opportunities for creative expression, education, and research. Text-to-image systems such as DALL.E, Stable Diffusion, and Midjourney can now convert ideas into visuals within seconds, but they also present a dual-use dilemma, raising critical ethical concerns: amplifying societal biases, producing high-fidelity disinformation, and violating intellectual property. This paper introduces SafeGen, a framework that embeds ethical safeguards directly into the text-to-image generation pipeline, grounding its design in established principles for Trustworthy AI. SafeGen integrates two complementary components: BGE-M3, a fine-tuned text classifier that filters harmful or misleading prompts, and Hyper-SD, an optimized diffusion model that produces high fidelity, semantically aligned images. Built on a curated multilingual (English- Vietnamese) dataset and a fairness-aware training process, SafeGen demonstrates that creative freedom and ethical responsibility can be reconciled within a single workflow. Quantitative evaluations confirm its effectiveness, with Hyper-SD achieving IS = 3.52, FID = 22.08, and SSIM = 0.79, while BGE-M3 reaches an F1-Score of 0.81. An ablation study further validates the importance of domain-specific fine-tuning for both modules. Case studies illustrate SafeGen's practical impact in blocking unsafe prompts, generating inclusive teaching materials, and reinforcing academic integrity.
生成式人工智能(AI)为创意表达、教育和研究创造了前所未有的机会。例如,DALL·E、Stable Diffusion 和 Midjourney 等文本转图像系统能够在几秒钟内将想法转化为视觉效果,但它们也带来了双重用途的难题,并引发了重要的伦理问题:放大社会偏见、生成高保真度的信息操纵内容以及侵犯知识产权。本文介绍了 SafeGen 框架,该框架直接在文本到图像生成管道中嵌入了道德保障措施,其设计基于可信赖 AI 的既定原则。SafeGen 集成了两个互补的组件:BGE-M3,这是一种经过微调的文本分类器,用于筛选有害或误导性的提示;以及 Hyper-SD,这是一个优化的扩散模型,可以生成高保真度且语义一致的图像。SafeGen 建立在精心策划的多语言(英语—越南语)数据集和重视公平性训练过程的基础上,展示了创意自由与道德责任可以在同一工作流程中得到调和。定量评估证实了其有效性:Hyper-SD 达到 IS = 3.52、FID = 22.08 和 SSIM = 0.79;BGE-M3 则达到了 F1-Score 的 0.81 分数。消融研究进一步验证了对两个模块进行领域特定微调的重要性。案例研究展示了 SafeGen 在阻止不安全提示、生成包容性教学材料和增强学术诚信方面的实际影响。 SafeGen 框架通过引入直接嵌入到文本转图像生成管道中的伦理保障措施,旨在平衡创意自由与道德责任之间的关系,并且其设计基于可信赖 AI 的既定原则。该框架的两个主要组成部分——BGE-M3 和 Hyper-SD ——不仅提高了系统的安全性与有效性,还展示了在实际应用场景中维护伦理标准和技术创新之间取得平衡的可能性。
https://arxiv.org/abs/2512.12501
We introduce MARKET-BENCH, a benchmark that evaluates large language models (LLMs) on introductory quantitative trading tasks by asking them to construct executable backtesters from natural-language strategy descriptions and market assumptions. Each instance specifies one of three canonical strategies -- scheduled trading on Microsoft (NASDAQ: MSFT), pairs trading on Coca-Cola (NASDAQ: KO) and Pepsi (NASDAQ: PEP), or delta hedging on MSFT -- and models must produce code whose P\&L, drawdown, and position paths match a verifiable reference implementation. We assess twelve state-of-the-art models using a multi-round pass@k metric that separates structural reliability (whether the backtest runs) from numerical accuracy (mean absolute error of the backtest metrics). While most models reliably execute the simplest strategy (average pass@3 of 0.80), errors vary by orders of magnitude across models and tasks: Gemini 3 Pro and Claude 4.5 Sonnet combine strong reliability with low error on simpler strategies, GPT-5.1 Codex-Max achieves perfect pass@1 on the first two strategies and the lowest best-run error on the easiest task, and Qwen3 Max attains perfect pass@3 yet sometimes produces inaccurate P\&L paths. These results show that current LLMs can scaffold basic trading infrastructure but still struggle to reason robustly about prices, inventory, and risk; we release MARKET-BENCH and a public leaderboard at this https URL.
我们介绍了MARKET-BENCH,这是一个评估大型语言模型(LLMs)在入门级量化交易任务上的基准测试工具。它要求模型能够从自然语言策略描述和市场假设中构建可执行的回测程序。每个实例指定三种标准策略之一——针对微软公司(NASDAQ: MSFT) 的预定交易、针对可口可乐(NASDAQ: KO) 和百事(NASDAQ: PEP) 的配对交易,或针对MSFT公司的德尔塔对冲——并且模型必须生成的代码需要在收益和损失(P&L)、回撤以及持仓路径方面与一个验证性参考实现相匹配。我们使用一个多轮次的pass@k指标来评估十二个最先进的模型,该指标将结构可靠性(是否可以执行回测)从数值准确性(回测度量的平均绝对误差)中分离出来进行单独评估。 大多数模型能够可靠地运行最简单的策略(平均pass@3为0.80),但在不同模型和任务之间,错误的程度相差几个数量级。Gemini 3 Pro 和 Claude 4.5 Sonnet 结合了在较简单策略上的强结构可靠性与低误差;GPT-5.1 Codex-Max 在前两个策略上实现了完美的pass@1,并且在最简单的任务中拥有最低的最优运行误差;而Qwen3 Max 达到了完美的pass@3,但在某些情况下会产生不准确的收益和损失路径。 这些结果显示,目前的LLMs能够搭建基本的交易基础设施,但仍然难以稳健地处理价格、库存和风险。我们将在[此链接](https://this-URL.com)发布MARKET-BENCH 和一个公开排行榜。
https://arxiv.org/abs/2512.12264
Semantic distance measurement is a fundamental problem in computational linguistics, providing a quantitative characterization of similarity or relatedness between text segments, and underpinning tasks such as text retrieval and text classification. From a mathematical perspective, a semantic distance can be viewed as a metric defined on a space of texts or on a representation space derived from them. However, most classical semantic distance methods are essentially fixed, making them difficult to adapt to specific data distributions and task requirements. In this paper, a semantic distance measure based on multi-kernel Gaussian processes (MK-GP) was proposed. The latent semantic function associated with texts was modeled as a Gaussian process, with its covariance function given by a combined kernel combining Matérn and polynomial components. The kernel parameters were learned automatically from data under supervision, rather than being hand-crafted. This semantic distance was instantiated and evaluated in the context of fine-grained sentiment classification with large language models under an in-context learning (ICL) setup. The experimental results demonstrated the effectiveness of the proposed measure.
语义距离测量是计算语言学中的一个基本问题,它提供了一种对文本片段之间相似性或相关性的定量描述,并支持诸如文本检索和分类等任务。从数学角度来看,语义距离可以被视为在文本空间或由其衍生的表示空间上定义的一种度量标准。然而,大多数传统的语义距离方法本质上是固定的,这使得它们难以适应特定的数据分布和任务需求。 本文提出了一种基于多核高斯过程(MK-GP)的语义距离测量方法。该方法将与文本相关的潜在语义函数建模为高斯过程,并使用由Matérn核和多项式成分结合而成的协方差函数来定义其内核参数。这些内核参数是从数据中自动学习得到的,而不是手工设置的。这种语义距离在大规模语言模型的细粒度情感分类任务中的情境学习(ICL)环境中进行了实例化并进行评估。 实验结果表明了所提出的方法的有效性。
https://arxiv.org/abs/2512.12238
Echocardiography is the most widely used imaging modality in cardiology, yet its interpretation remains labor-intensive and inherently multimodal, requiring view recognition, quantitative measurements, qualitative assessments, and guideline-based reasoning. While recent vision-language models (VLMs) have achieved broad success in natural images and certain medical domains, their potential in echocardiography has been limited by the lack of large-scale, clinically grounded image-text datasets and the absence of measurement-based reasoning central to echo interpretation. We introduce EchoGround-MIMIC, the first measurement-grounded multimodal echocardiography dataset, comprising 19,065 image-text pairs from 1,572 patients with standardized views, structured measurements, measurement-grounded captions, and guideline-derived disease labels. Building on this resource, we propose EchoVLM, a vision-language model that incorporates two novel pretraining objectives: (i) a view-informed contrastive loss that encodes the view-dependent structure of echocardiographic imaging, and (ii) a negation-aware contrastive loss that distinguishes clinically critical negative from positive findings. Across five types of clinical applications with 36 tasks spanning multimodal disease classification, image-text retrieval, view classification, chamber segmentation, and landmark detection, EchoVLM achieves state-of-the-art performance (86.5% AUC in zero-shot disease classification and 95.1% accuracy in view classification). We demonstrate that clinically grounded multimodal pretraining yields transferable visual representations and establish EchoVLM as a foundation model for end-to-end echocardiography interpretation. We will release EchoGround-MIMIC and the data curation code, enabling reproducibility and further research in multimodal echocardiography interpretation.
超声心动图是心脏病学中最常用的成像方式,然而其解读仍然非常费力且具有多模态特性,需要进行视图识别、定量测量、定性评估以及基于指南的推理。尽管最近的视觉-语言模型(VLMs)在自然图像和某些医学领域中取得了广泛的成功,但由于缺乏大规模、临床导向的图文数据集以及缺少超声心动图解读中至关重要的基于测量的推理能力,其在超声心动图领域的应用潜力受到了限制。 我们引入了EchoGround-MIMIC,这是第一个以测量为基础的多模态超声心动图数据集,包含来自1572名患者的19,065对图文配对,并且这些图像和文本均基于标准化视图、结构化测量值、与测量相关的描述以及基于指南的疾病标签。在此资源基础上,我们提出了EchoVLM,这是一个视觉-语言模型,它包括两个新颖的预训练目标:(i)一个以视图为信息来源的对比损失,该损失编码了超声心动图成像中视图依赖性的结构;(ii)一种基于否定情况识别关键临床发现的对比损失。 在跨五种类型的临床应用和涵盖多模态疾病分类、图文检索、视图分类、腔室分割以及地标检测在内的36项任务上,EchoVLM取得了最先进的性能(零样本学习下的疾病分类AUC为86.5%,视图分类准确率为95.1%)。我们展示了基于临床导向的多模态预训练能够产生可转移的视觉表示,并确立了EchoVLM作为端到端超声心动图解读的基础模型的地位。我们将发布EchoGround-MIMIC以及数据整理代码,以促进研究再现性和进一步在多模态超声心动图解读领域的研究工作。
https://arxiv.org/abs/2512.12107
While many vision-language models (VLMs) are developed to answer well-defined, straightforward questions with highly specified targets, as in most benchmarks, they often struggle in practice with complex open-ended tasks, which usually require multiple rounds of exploration and reasoning in the visual space. Such visual thinking paths not only provide step-by-step exploration and verification as an AI detective but also produce better interpretations of the final answers. However, these paths are challenging to evaluate due to the large exploration space of intermediate steps. To bridge the gap, we develop an evaluation suite, ``Visual Reasoning with multi-step EXploration (V-REX)'', which is composed of a benchmark of challenging visual reasoning tasks requiring native multi-step exploration and an evaluation protocol. V-REX covers rich application scenarios across diverse domains. V-REX casts the multi-step exploratory reasoning into a Chain-of-Questions (CoQ) and disentangles VLMs' capability to (1) Planning: breaking down an open-ended task by selecting a chain of exploratory questions; and (2) Following: answering curated CoQ sequentially to collect information for deriving the final answer. By curating finite options of questions and answers per step, V-REX achieves a reliable quantitative and fine-grained analysis of the intermediate steps. By assessing SOTA proprietary and open-sourced VLMs, we reveal consistent scaling trends, significant differences between planning and following abilities, and substantial room for improvement in multi-step exploratory reasoning.
尽管许多视觉语言模型(VLM)被开发出来以回答大多数基准测试中定义明确、直接的问题,这些问题通常具有高度具体的目标,但在实践中处理复杂的开放式任务时往往遇到困难。这些复杂任务通常需要在视觉空间内进行多轮探索和推理。这种视觉思考路径不仅提供像AI侦探一样的逐步探索与验证过程,还能产生对最终答案更好的解释。然而,由于中间步骤的探索空间很大,评估这些路径颇具挑战性。 为了弥合这一差距,我们开发了一个名为“具有多步探索的视觉推理”(Visual Reasoning with multi-step EXploration, V-REX)的评估套件,它由一系列需要原生多步探索的具有挑战性的视觉推理任务组成,并包括相应的评估协议。V-REX涵盖跨多个领域的丰富应用场景。 V-REX将多步骤探索性推理转化为一个问题链(Chain-of-Questions, CoQ),并分离出VLM的能力:(1) 规划:通过选择一系列探索性的问题来分解开放式任务;以及(2) 跟进:按顺序回答精心策划的CoQ以收集推导最终答案所需的信息。通过在每一步提供有限数量的选择问题和答案,V-REX能够对中间步骤进行可靠且精细的定量分析。 通过对最先进(SOTA)的专有及开源视觉语言模型进行评估,我们揭示了持续的规模效应趋势、规划能力与跟进能力之间的显著差异以及多步探索性推理方面存在大量改进空间。
https://arxiv.org/abs/2512.11995
Large-scale video generation models have shown remarkable potential in modeling photorealistic appearance and lighting interactions in real-world scenes. However, a closed-loop framework that jointly understands intrinsic scene properties (e.g., albedo, normal, material, and irradiance), leverages them for video synthesis, and supports editable intrinsic representations remains unexplored. We present V-RGBX, the first end-to-end framework for intrinsic-aware video editing. V-RGBX unifies three key capabilities: (1) video inverse rendering into intrinsic channels, (2) photorealistic video synthesis from these intrinsic representations, and (3) keyframe-based video editing conditioned on intrinsic channels. At the core of V-RGBX is an interleaved conditioning mechanism that enables intuitive, physically grounded video editing through user-selected keyframes, supporting flexible manipulation of any intrinsic modality. Extensive qualitative and quantitative results show that V-RGBX produces temporally consistent, photorealistic videos while propagating keyframe edits across sequences in a physically plausible manner. We demonstrate its effectiveness in diverse applications, including object appearance editing and scene-level relighting, surpassing the performance of prior methods.
大规模视频生成模型在模拟现实场景中的逼真外观和光照交互方面展现了巨大的潜力。然而,一个能够联合理解内在场景属性(例如反照率、法线、材质和辐照度)、利用这些属性进行视频合成,并支持可编辑的内在表示的闭环框架尚未被探索。我们提出了V-RGBX,这是首个面向内在感知视频编辑的一体化端到端框架。V-RGBX统一了三个关键能力:(1)将视频逆向渲染为内在通道;(2)从这些内在表示中生成逼真的视频;以及(3)基于内在通道进行关键帧驱动的视频编辑。V-RGBX的核心是一个交错条件机制,它通过用户选定的关键帧提供直观且物理上合理的视频编辑能力,支持对任何内在模态的灵活操作。大量的定性和定量结果表明,V-RGBX能够生成在时间上一致、逼真的视频,并以一种物理合理的方式传播关键帧上的修改。我们在多种应用中展示了其有效性,包括对象外观编辑和场景级别的重光照处理,超越了现有方法的表现水平。
https://arxiv.org/abs/2512.11799
We present Particulate, a feed-forward approach that, given a single static 3D mesh of an everyday object, directly infers all attributes of the underlying articulated structure, including its 3D parts, kinematic structure, and motion constraints. At its core is a transformer network, Part Articulation Transformer, which processes a point cloud of the input mesh using a flexible and scalable architecture to predict all the aforementioned attributes with native multi-joint support. We train the network end-to-end on a diverse collection of articulated 3D assets from public datasets. During inference, Particulate lifts the network's feed-forward prediction to the input mesh, yielding a fully articulated 3D model in seconds, much faster than prior approaches that require per-object optimization. Particulate can also accurately infer the articulated structure of AI-generated 3D assets, enabling full-fledged extraction of articulated 3D objects from a single (real or synthetic) image when combined with an off-the-shelf image-to-3D generator. We further introduce a new challenging benchmark for 3D articulation estimation curated from high-quality public 3D assets, and redesign the evaluation protocol to be more consistent with human preferences. Quantitative and qualitative results show that Particulate significantly outperforms state-of-the-art approaches.
我们提出了Particulate,这是一种前馈方法,在给定一个日常物体的单个静态3D网格的情况下,直接推断出其底层铰链结构的所有属性,包括3D部件、运动学结构和运动约束。其核心是一个变压器网络——Part Articulation Transformer(PAT),该网络使用一种灵活且可扩展的架构处理输入网格的点云,并预测上述所有属性,同时支持原生多关节功能。我们通过来自公开数据集的各种铰链式3D资产对整个网络进行了端到端训练。在推理阶段,Particulate将网络的前馈预测提升至输入网格上,几秒钟内即可生成完全铰链化的3D模型,比需要针对每个对象进行优化的先前方法快得多。此外,Particulate还可以准确推断出由AI生成的3D资产的铰链结构,在与现成的图像到3D生成器结合使用时,可以从单张(真实或合成)图像中完整提取出铰链式3D物体。 我们还引入了一个新的具有挑战性的基准测试,用于从高质量公开的3D资产中进行3D铰链估计,并重新设计了评估协议以更好地与人类偏好一致。定量和定性结果表明,Particulate在性能上显著优于当前最先进的方法。
https://arxiv.org/abs/2512.11798
Seismic processing transforms raw data into subsurface images essential for geophysical applications. Traditional methods face challenges, such as noisy data, and manual parameter tuning, among others. Recently deep learning approaches have proposed alternative solutions to some of these problems. However, important challenges of existing deep learning approaches are spatially inconsistent results across neighboring seismic gathers and lack of user-control. We address these limitations by introducing ContextSeisNet, an in-context learning model, to seismic demultiple processing. Our approach conditions predictions on a support set of spatially related example pairs: neighboring common-depth point gathers from the same seismic line and their corresponding labels. This allows the model to learn task-specific processing behavior at inference time by observing how similar gathers should be processed, without any retraining. This method provides both flexibility through user-defined examples and improved lateral consistency across seismic lines. On synthetic data, ContextSeisNet outperforms a U-Net baseline quantitatively and demonstrates enhanced spatial coherence between neighboring gathers. On field data, our model achieves superior lateral consistency compared to both traditional Radon demultiple and the U-Net baseline. Relative to the U-Net, ContextSeisNet also delivers improved near-offset performance and more complete multiple removal. Notably, ContextSeisNet achieves comparable field data performance despite being trained on 90% less data, demonstrating substantial data efficiency. These results establish ContextSeisNet as a practical approach for spatially consistent seismic demultiple with potential applicability to other seismic processing tasks.
地震处理将原始数据转换为地球物理应用中必不可少的地下图像。传统方法面临着诸如噪声数据和手动参数调整等问题的挑战。最近,深度学习方法提出了针对这些问题的一些替代解决方案。然而,现有的深度学习方法面临的重要挑战包括相邻地震集合间空间一致性不足以及缺乏用户控制。 为了应对这些限制,我们引入了ContextSeisNet——一种上下文学习模型,应用于地震去多次波处理中。我们的方法在预测时依赖于支持集中的空间相关示例对:来自同一地震测线的邻近共深度点集合及其对应的标签。这使模型能够在不重新训练的情况下通过观察相似数据应如何被处理来学习任务特定的数据处理行为。这种方法既提供了通过用户定义示例实现的灵活性,又改善了不同地震测线之间的横向一致性。 在合成数据上,ContextSeisNet相比基于U-Net的方法在定量评估中表现更佳,并展示了相邻集合间空间一致性的增强。在现场数据上,我们的模型与传统Radon去多次波和U-Net基准方法相比,在横向一致性方面表现出色。相对于U-Net,ContextSeisNet还提供了更好的近距离性能以及更加彻底的多次波移除效果。 值得注意的是,尽管在只有10%训练数据的情况下,ContextSeisNet依然能够达到与基于更多数据训练的传统模型相当的表现水平,这证明了其显著的数据效率。这些结果表明,ContextSeisNet是一种实用的方法,可以实现地震去多次波的空间一致性,并可能适用于其他地震处理任务。
https://arxiv.org/abs/2512.11575
We present Flowception, a novel non-autoregressive and variable-length video generation framework. Flowception learns a probability path that interleaves discrete frame insertions with continuous frame denoising. Compared to autoregressive methods, Flowception alleviates error accumulation/drift as the frame insertion mechanism during sampling serves as an efficient compression mechanism to handle long-term context. Compared to full-sequence flows, our method reduces FLOPs for training three-fold, while also being more amenable to local attention variants, and allowing to learn the length of videos jointly with their content. Quantitative experimental results show improved FVD and VBench metrics over autoregressive and full-sequence baselines, which is further validated with qualitative results. Finally, by learning to insert and denoise frames in a sequence, Flowception seamlessly integrates different tasks such as image-to-video generation and video interpolation.
我们介绍了Flowception,这是一种新颖的非自回归且可变长度的视频生成框架。Flowception 学习了一种概率路径,在该路径中交替进行离散帧插入和连续帧去噪。与自回归方法相比,Flowception 减轻了在采样过程中的误差累积/漂移问题,因为其帧插入机制充当了一个有效的压缩机制来处理长期上下文。相比于全序列流(full-sequence flows),我们的方法将训练时的FLOPs减少了三倍,并且更易于采用局部注意力变体,同时还可以共同学习视频长度和内容。定量实验结果表明,在自回归基线和全序列基线上,Flowception 在 FVD 和 VBench 评估指标上表现出显著改进,这一结论也通过定性结果得到了进一步验证。最后,通过在序列中学习插入和去噪帧,Flowception 可以无缝地集成诸如图像到视频生成和视频插值等不同任务。
https://arxiv.org/abs/2512.11438
Accurate 3D plant models are crucial for computational phenotyping and physics-based simulation; however, current approaches face significant limitations. Learning-based reconstruction methods require extensive species-specific training data and lack editability. Procedural modeling offers parametric control but demands specialized expertise in geometric modeling and an in-depth understanding of complex procedural rules, making it inaccessible to domain scientists. We present FloraForge, an LLM-assisted framework that enables domain experts to generate biologically accurate, fully parametric 3D plant models through iterative natural language Plant Refinements (PR), minimizing programming expertise. Our framework leverages LLM-enabled co-design to refine Python scripts that generate parameterized plant geometries as hierarchical B-spline surface representations with botanical constraints with explicit control points and parametric deformation functions. This representation can be easily tessellated into polygonal meshes with arbitrary precision, ensuring compatibility with functional structural plant analysis workflows such as light simulation, computational fluid dynamics, and finite element analysis. We demonstrate the framework on maize, soybean, and mung bean, fitting procedural models to empirical point cloud data through manual refinement of the Plant Descriptor (PD), human-readable files. The pipeline generates dual outputs: triangular meshes for visualization and triangular meshes with additional parametric metadata for quantitative analysis. This approach uniquely combines LLM-assisted template creation, mathematically continuous representations enabling both phenotyping and rendering, and direct parametric control through PD. The framework democratizes sophisticated geometric modeling for plant science while maintaining mathematical rigor.
准确的3D植物模型对于计算表型分析和基于物理的仿真至关重要;然而,当前的方法面临着重大限制。基于学习的重建方法需要大量的特定物种训练数据,并且缺乏可编辑性。程序化建模提供了参数控制功能,但要求具备几何建模的专业知识以及对复杂程序规则有深入理解,这使得该方法对领域科学家来说难以掌握。 我们提出了FloraForge框架,这是一个由大语言模型(LLM)辅助的工具,它使领域专家能够通过迭代自然语言植物细化(PR),以生物准确性生成完全参数化的3D植物模型,并且不需要编程专业知识。我们的框架利用了LLM驱动的协同设计功能来优化Python脚本,这些脚本可以生成带有显式控制点和参数化变形函数的分层B样条曲面表示形式,并遵循植物学约束条件。这种表示形式可以轻松地被转化为具有任意精度的多边形网格,确保与功能性结构植物分析工作流程(如光照模拟、计算流体动力学以及有限元分析)兼容。 我们在玉米、大豆和绿豆上的案例研究中展示了该框架的应用情况,通过手动调整可读性强的文件——植物描述符(PD),以将程序化模型拟合到经验点云数据。整个管道生成了双输出:用于可视化的三角网格以及带有额外参数元数据用于定量分析的三角网格。 这种方法的独特之处在于它结合了LLM辅助模板创建、支持表型分析和渲染所需的数学连续表示,以及通过PD进行直接参数控制的能力。这种框架使复杂的几何建模技术能够普及到植物科学领域,并且保持了严谨的数学基础。
https://arxiv.org/abs/2512.11925
Speech-driven talking heads have recently emerged and enable interactive avatars. However, real-world applications are limited, as current methods achieve high visual fidelity but slow or fast yet temporally unstable. Diffusion methods provide realistic image generation, yet struggle with oneshot settings. Gaussian Splatting approaches are real-time, yet inaccuracies in facial tracking, or inconsistent Gaussian mappings, lead to unstable outputs and video artifacts that are detrimental to realistic use cases. We address this problem by mapping Gaussian Splatting using 3D Morphable Models to generate person-specific avatars. We introduce transformer-based prediction of model parameters, directly from audio, to drive temporal consistency. From monocular video and independent audio speech inputs, our method enables generation of real-time talking head videos where we report competitive quantitative and qualitative performance.
基于语音驱动的虚拟头像(说话头部)技术最近发展迅速,使互动式化身成为可能。然而,现实世界中的应用仍然受到限制,因为目前的方法虽然能实现高视觉保真度,但存在速度慢或快而不稳定的问题。扩散方法能够生成逼真的图像,但在一次性的设置中表现不佳。Gaussian Splatting 方法实现了实时性能,但由于面部跟踪不准确或者高斯映射的一致性问题,会导致输出不稳定和视频伪影,这对真实使用场景不利。 为了解决这些问题,我们通过将3D Morphable Models(三维形态模型)与 Gaussian Splatting 结合起来生成特定于个人的头像。此外,我们引入了基于变压器的预测方法,直接从音频中预测模型参数,从而驱动时间上的稳定性。我们的方法可以从单目视频和独立的语音输入中生成实时说话头部视频,并在定量和定性评估上都表现出竞争力。
https://arxiv.org/abs/2512.10939
We explore how large language models (LLMs) can enhance the proposal selection process at large user facilities, offering a scalable, consistent, and cost-effective alternative to traditional human review. Proposal selection depends on assessing the relative strength among submitted proposals; however, traditional human scoring often suffers from weak inter-proposal correlations and is subject to reviewer bias and inconsistency. A pairwise preference-based approach is logically superior, providing a more rigorous and internally consistent basis for ranking, but its quadratic workload makes it impractical for human reviewers. We address this limitation using LLMs. Leveraging the uniquely well-curated proposals and publication records from three beamlines at the Spallation Neutron Source (SNS), Oak Ridge National Laboratory (ORNL), we show that the LLM rankings correlate strongly with the human rankings (Spearman $\rho\simeq 0.2-0.8$, improving to $\geq 0.5$ after 10\% outlier removal). Moreover, LLM performance is no worse than that of human reviewers in identifying proposals with high publication potential, while costing over two orders of magnitude less. Beyond ranking, LLMs enable advanced analyses that are challenging for humans, such as quantitative assessment of proposal similarity via embedding models, which provides information crucial for review committees.
我们探讨了大型语言模型(LLMs)如何能够提升大规模用户设施中的提案选择过程,提供了一种比传统人工审查更可扩展、一致且成本效益更高的替代方案。提案的选择取决于评估提交的提案之间的相对强度;然而,传统的手工评分往往因为提案间相关性较弱而受到评审员偏见和不一致性的困扰。基于成对偏好(pairwise preference-based)的方法在逻辑上更为优越,它为排名提供了更加严谨且内在一致的基础,但由于其工作量呈二次增长特性,对于人工评审来说是不可行的。我们通过使用大型语言模型来解决这一限制。 利用来自橡树岭国家实验室斯波尔特中子源(SNS)三束线的独特、精心整理的提案和出版记录数据集,我们展示了大型语言模型生成的排名与人类评分结果高度相关(Spearman $\rho \simeq 0.2-0.8$,在剔除10%异常值后提升至$\geq 0.5$)。此外,在识别具有高发表潜力的提案方面,大型语言模型的表现并不逊色于人工评审员,并且成本要低两个数量级以上。除了排名之外,大型语言模型还能够执行对人类来说极具挑战性的高级分析,例如通过嵌入模型进行提案相似性量化的定量评估,这为评审委员会提供了至关重要的信息。
https://arxiv.org/abs/2512.10895
The notion of causal effect is fundamental across many scientific disciplines. Traditionally, quantitative researchers have studied causal effects at the level of variables; for example, how a certain drug dose (W) causally affects a patient's blood pressure (Y). However, in many modern data domains, the raw variables-such as pixels in an image or tokens in a language model-do not have the semantic structure needed to formulate meaningful causal questions. In this paper, we offer a more fine-grained perspective by studying causal effects at the level of events, drawing inspiration from probability theory, where core notions such as independence are first given for events and sigma-algebras, before random variables enter the picture. Within the measure-theoretic framework of causal spaces, a recently introduced axiomatisation of causality, we first introduce several binary definitions that determine whether a causal effect is present, as well as proving some properties of them linking causal effect to (in)dependence under an intervention measure. Further, we provide quantifying measures that capture the strength and nature of causal effects on events, and show that we can recover the common measures of treatment effect as special cases.
因果效应的概念在许多科学学科中都是基础性的。传统上,定量研究人员研究的是变量层面的因果效应;例如,某种药物剂量(W)如何对患者的血压(Y)产生因果影响。然而,在许多现代数据领域中,原始变量——如图像中的像素或语言模型中的标记——缺乏形成有意义的因果问题所需的语义结构。在本文中,我们提供了一个更为细致的观点,通过研究事件层面的因果效应来探讨这一主题,并借鉴概率论中的理论,其中核心概念(如独立性)首先是对事件和σ代数定义的,然后才引入随机变量。 在一个最近提出的因果空间测度理论框架下,我们首先介绍了一些二元定义,以确定是否存在因果效应,并证明了这些定义与干预措施下的(不)相关性的某些性质之间的联系。此外,我们还提供了一系列量化指标来捕捉事件上因果影响的强度和特性,并展示了我们可以将常见的治疗效果衡量方法作为特殊情况加以恢复。
https://arxiv.org/abs/2512.11919