Generative models have achieved remarkable progress with the emergence of flow matching (FM). It has demonstrated strong generative capabilities and attracted significant attention as a simulation-free flow-based framework capable of learning exact data densities. Motivated by these advances, we propose LatentFM, a flow-based model operating in the latent space for medical image segmentation. To model the data distribution, we first design two variational autoencoders (VAEs) to encode both medical images and their corresponding masks into a lower-dimensional latent space. We then estimate a conditional velocity field that guides the flow based on the input image. By sampling multiple latent representations, our method synthesizes diverse segmentation outputs whose pixel-wise variance reliably captures the underlying data distribution, enabling both highly accurate and uncertainty-aware predictions. Furthermore, we generate confidence maps that quantify the model certainty, providing clinicians with richer information for deeper analysis. We conduct experiments on two datasets, ISIC-2018 and CVC-Clinic, and compare our method with several prior baselines, including both deterministic and generative approach models. Through comprehensive evaluations, both qualitative and quantitative results show that our approach achieves superior segmentation accuracy while remaining highly efficient in the latent space.
生成模型在流匹配(FM)的出现后取得了显著的进步。它展示了强大的生成能力,并作为无需模拟的基于流框架而备受关注,该框架能够学习精确的数据密度。受到这些进展的启发,我们提出了LatentFM,这是一种操作于潜在空间中的基于流的模型,用于医学图像分割。 为了建模数据分布,我们首先设计了两个变分自编码器(VAEs),将医学影像及其对应的掩码编码到较低维度的潜在空间中。然后估计一个条件速度场,该速度场根据输入图像引导流动的方向。通过采样多个潜在表示,我们的方法能够合成多样化的分割输出,其像素级方差可靠地捕捉了底层数据分布,从而实现既准确又具备不确定性意识的预测。 此外,我们还生成信心图,量化模型的确信度,为临床医生提供更丰富的信息进行深入分析。我们在两个数据集(ISIC-2018和CVC-Clinic)上进行了实验,并将我们的方法与几种先前的基础线方法进行了比较,包括确定性和生成性方法。 通过全面的定性和定量评估,结果显示我们的方法在分割准确性方面取得了卓越的效果,同时在潜在空间中保持了高度效率。
https://arxiv.org/abs/2512.04821
The rapid progress of Large Language Models (LLMs) has transformed natural language processing and broadened its impact across research and society. Yet, systematic evaluation of these models, especially for languages beyond English, remains limited. "Challenging the Abilities of LAnguage Models in ITAlian" (CALAMITA) is a large-scale collaborative benchmarking initiative for Italian, coordinated under the Italian Association for Computational Linguistics. Unlike existing efforts that focus on leaderboards, CALAMITA foregrounds methodology: it federates more than 80 contributors from academia, industry, and the public sector to design, document, and evaluate a diverse collection of tasks, covering linguistic competence, commonsense reasoning, factual consistency, fairness, summarization, translation, and code generation. Through this process, we not only assembled a benchmark of over 20 tasks and almost 100 subtasks, but also established a centralized evaluation pipeline that supports heterogeneous datasets and metrics. We report results for four open-weight LLMs, highlighting systematic strengths and weaknesses across abilities, as well as challenges in task-specific evaluation. Beyond quantitative results, CALAMITA exposes methodological lessons: the necessity of fine-grained, task-representative metrics, the importance of harmonized pipelines, and the benefits and limitations of broad community engagement. CALAMITA is conceived as a rolling benchmark, enabling continuous integration of new tasks and models. This makes it both a resource -- the most comprehensive and diverse benchmark for Italian to date -- and a framework for sustainable, community-driven evaluation. We argue that this combination offers a blueprint for other languages and communities seeking inclusive and rigorous LLM evaluation practices.
大型语言模型(LLMs)的快速发展已经改变了自然语言处理领域,并扩大了其在研究和社会中的影响。然而,这些模型特别是非英语语言系统的系统性评估仍然有限。“挑战意大利语语言模型的能力”(CALAMITA)是一个大规模的合作基准测试项目,由意大利计算语言学协会协调进行,旨在为意大利语设计并实施全面的评估标准。 与现有的专注于排行榜的努力不同,CALAMITA 侧重于方法论:该项目汇聚了来自学术界、工业界和公共部门的80多名贡献者,共同设计、记录并评估一系列多样化任务。这些任务涵盖了语言能力、常识推理、事实一致性、公平性、摘要生成、翻译以及代码生成等多个方面。 通过这个过程,我们不仅汇集了一个包含超过20项主要任务及近100个子任务的基准测试集,还建立了一条支持异构数据集和指标的集中化评估流水线。在对四个开源大语言模型的结果进行分析后,我们指出了这些系统在整个能力范围内所展现出的系统性优势与劣势,并揭示了具体任务评价中的挑战。 除了定量结果外,CALAMITA 还暴露了一些方法论上的启示:精细度量指标以及代表性任务的重要性、统一评估流水线的意义,还有广泛社区参与带来的益处和限制。CALAMITA 被设计为一个滚动基准项目,能够持续集成新的任务和模型。这使得它既是一个资源——迄今为止最全面且多样的意大利语评测标准——也是一个可持续发展的社区驱动型评估框架。 我们认为这种结合提供了其他语言及社群寻求包容性和严谨性的大语言模型评价实践蓝图。
https://arxiv.org/abs/2512.04759
Depth completion plays a vital role in 3D perception systems, especially in scenarios where sparse depth data must be densified for tasks such as autonomous driving, robotics, and augmented reality. While many existing approaches rely on semantic segmentation to guide depth completion, they often overlook the benefits of object-level understanding. In this work, we introduce an instance-aware depth completion framework that explicitly integrates binary instance masks as spatial priors to refine depth predictions. Our model combines four main components: a frozen YOLO V11 instance segmentation branch, a U-Net-based depth completion backbone, a cross-attention fusion module, and an attention-guided prediction head. The instance segmentation branch generates per-image foreground masks that guide the depth branch via cross-attention, allowing the network to focus on object-centric regions during refinement. We validate our method on the Virtual KITTI 2 dataset, showing that it achieves lower RMSE compared to both a U-Net-only baseline and previous semantic-guided methods, while maintaining competitive MAE. Qualitative and quantitative results demonstrate that the proposed model effectively enhances depth accuracy near object boundaries, occlusions, and thin structures. Our findings suggest that incorporating instance-aware cues offers a promising direction for improving depth completion without relying on dense semantic labels.
深度完成在3D感知系统中扮演着至关重要的角色,特别是在需要将稀疏的深度数据密集化的场景下,例如自动驾驶、机器人技术和增强现实。尽管许多现有的方法依赖于语义分割来指导深度完成任务,但它们往往忽略了对象级别理解的好处。在这项工作中,我们介绍了一种实例感知的深度完成框架,该框架明确地整合了二值实例掩模作为空间先验,以细化深度预测。我们的模型结合了四个主要组件:冻结版YOLO V11实例分割分支、基于U-Net的深度完成骨干网络、跨注意力融合模块和注意力引导的预测头。实例分割分支生成每张图像的前景掩模,并通过跨注意力机制指导深度分支,使网络在细化过程中能够专注于对象中心区域。 我们在虚拟KITTI 2数据集上验证了我们的方法,结果显示与仅基于U-Net的方法以及之前的语义导向方法相比,其均方根误差(RMSE)更低,同时保持了竞争性的平均绝对误差(MAE)。定性和定量结果表明,所提出的模型能够有效提升对象边界、遮挡和细长结构附近的深度精度。我们的研究结果表明,在不依赖密集的语义标签的情况下,融入实例感知线索为提高深度完成性能提供了一条有前景的方向。
https://arxiv.org/abs/2512.04734
Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.
高效的流媒体视频生成对于模拟互动和动态世界至关重要。现有的方法是通过滑动窗口注意力机制来精简多步视频扩散模型,并使用初始帧作为sink tokens以保持注意力性能并减少错误积累。然而,这种做法使得视频帧过度依赖这些静态tokens,导致复制最初的几帧并且减少了运动的动态性。为了解决这个问题,我们引入了Reward Forcing这一全新的框架,该框架包含两个关键设计。 首先,我们提出了EMA-Sink方法,它维持着从初始帧初始化并持续更新的固定大小tokens,通过滑动窗口移除旧token时采用指数移动平均融合方式来更新。这种机制在无需额外计算成本的情况下,捕获长期上下文和近期动态信息,防止了初始帧复制的同时保持了长时间的一致性。 其次,为了更好地从教师模型中提取运动动力学,我们提出了一种新颖的Rewarded Distribution Matching Distillation(Re-DMD)方法。传统的分布匹配方法会平等对待每一个训练样本,限制了模型优先处理动态内容的能力。相反,Re-DMD通过优先选择由视觉语言模型评估为具有较大动态性的样本,使模型输出偏向高奖励区域。这种方法显著提升了运动质量,同时保持数据保真度。 我们通过定量和定性实验展示了Reward Forcing在标准基准测试中达到了最先进的性能,并且能够在单个H100 GPU上以每秒23.1帧的速度生成高质量的流媒体视频。
https://arxiv.org/abs/2512.04678
The advancement of embodied AI has unlocked significant potential for intelligent humanoid robots. However, progress in both Vision-Language-Action (VLA) models and world models is severely hampered by the scarcity of large-scale, diverse training data. A promising solution is to "robotize" web-scale human videos, which has been proven effective for policy training. However, these solutions mainly "overlay" robot arms to egocentric videos, which cannot handle complex full-body motions and scene occlusions in third-person videos, making them unsuitable for robotizing humans. To bridge this gap, we introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task. This finetuning requires paired human-humanoid videos, so we designed a scalable data creation pipeline, turning community assets into 17+ hours of paired synthetic videos using Unreal Engine. We then apply our trained model to 60 hours of the Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million "robotized" humanoid video frames. Quantitative analysis and user studies confirm our method's superiority over existing baselines: 69% of users rated it best for motion consistency, and 62.1% for embodiment correctness.
嵌入式人工智能的发展为智能人形机器人的进步解锁了巨大的潜力。然而,视觉-语言-行动(VLA)模型和世界模型的进步严重受到大规模、多样化训练数据稀缺的限制。一种有前景的解决方案是将网络规模的人类视频“机器人化”,这种方法已被证明对于策略训练非常有效。然而,这些方法主要是在第一人称视角视频中叠加机器臂的动作,无法处理第三人称视频中的复杂全身动作和场景遮挡问题,因此不适合人类的“机器人化”。为了填补这一空白,我们提出了X-Humanoid,这是一种生成式视频编辑方法,将强大的Wan 2.2模型转换为视频到视频结构,并对其进行微调以适应从人类到机器人的翻译任务。这种微调需要配对的人类-机器人视频数据,因此我们设计了一个可扩展的数据创建管道,使用虚幻引擎(Unreal Engine)将社区资源转化为17小时以上的合成配对视频。然后我们将训练好的模型应用于60小时的Ego-Exo4D视频中,生成并发布一个包含超过360万张“机器人化”人形视频帧的新大规模数据集。定量分析和用户研究证实了我们方法相对于现有基准的优势:69%的用户认为它在动作一致性方面最佳,而62.1%的人则认为它的存在正确性表现更佳。
https://arxiv.org/abs/2512.04537
Recent advances in diffusion models have brought remarkable progress in image and video editing, yet some tasks remain underexplored. In this paper, we introduce a new task, Object Retexture, which transfers local textures from a reference object to a target object in images or videos. To perform this task, a straightforward solution is to use ControlNet conditioned on the source structure and the reference texture. However, this approach suffers from limited controllability for two reasons: conditioning on the raw reference image introduces unwanted structural information, and it fails to disentangle the visual texture and structure information of the source. To address this problem, we propose Refaçade, a method that consists of two key designs to achieve precise and controllable texture transfer in both images and videos. First, we employ a texture remover trained on paired textured/untextured 3D mesh renderings to remove appearance information while preserving the geometry and motion of source videos. Second, we disrupt the reference global layout using a jigsaw permutation, encouraging the model to focus on local texture statistics rather than the global layout of the object. Extensive experiments demonstrate superior visual quality, precise editing, and controllability, outperforming strong baselines in both quantitative and human evaluations. Code is available at this https URL.
最近在扩散模型方面的进展已经在图像和视频编辑领域取得了显著的进步,但仍有某些任务未被充分探索。在这篇论文中,我们介绍了一个新的任务——对象重贴图(Object Retexture),该任务旨在将参考对象的局部纹理转移到目标对象上,无论是静态图像还是动态视频中的对象。 为了完成这一任务,一个直接的方法是使用受控于源结构和参考纹理的ControlNet。然而,这种方案由于两个原因而存在可控性受限的问题:一是基于原始参考图片进行条件设定会引入不必要的结构信息;二是无法分离源对象的视觉纹理与结构信息。为了解决这些问题,我们提出了Refaçade方法,该方法包含两项关键设计以实现图像和视频中的精确且可控制的纹理转移: 1. 我们使用一个在配对的有纹理/无纹理3D网格渲染上训练的纹理去除器来删除源视频中出现的信息,同时保留几何结构和运动。 2. 通过使用拼图排列破坏参考对象的整体布局,促使模型关注局部纹理统计而非整个物体的全局布局。 广泛的实验展示了Refaçade在视觉质量、精确编辑及可控性方面的优越表现,在定量评估和人工评估中都优于强大的基线方法。代码可以在提供的链接地址上获取。
https://arxiv.org/abs/2512.04534
As a challenging video editing task, movie trailer generation involves selecting and reorganizing movie shots to create engaging trailers. Currently, most existing automatic trailer generation methods employ a "selection-then-ranking" paradigm (i.e., first selecting key shots and then ranking them), which suffers from inevitable error propagation and limits the quality of the generated trailers. Beyond this paradigm, we propose a new self-paced and self-corrective masked prediction method called SSMP, which achieves state-of-the-art results in automatic trailer generation via bi-directional contextual modeling and progressive self-correction. In particular, SSMP trains a Transformer encoder that takes the movie shot sequences as prompts and generates corresponding trailer shot sequences accordingly. The model is trained via masked prediction, reconstructing each trailer shot sequence from its randomly masked counterpart. The mask ratio is self-paced, allowing the task difficulty to adapt to the model and thereby improving model performance. When generating a movie trailer, the model fills the shot positions with high confidence at each step and re-masks the remaining positions for the next prediction, forming a progressive self-correction mechanism that is analogous to how human editors work. Both quantitative results and user studies demonstrate the superiority of SSMP in comparison to existing automatic movie trailer generation methods. Demo is available at: this https URL.
生成电影预告片是一项具有挑战性的视频编辑任务,它涉及选择和重组电影片段以创建引人入胜的预告片。目前,大多数现有的自动预告片生成方法采用的是“先选后排”的模式(即首先选择关键镜头然后对其进行排序),这种做法会导致不可避免的错误传播,并限制了生成预告片的质量。除了这种方法之外,我们提出了一种新的自适应和自我纠正的方法叫做SSMP(Self-Paced and Self-Corrective Masked Prediction)。通过双向上下文建模和逐步自我纠错,该方法在自动预告片生成方面达到了最先进的水平。 具体来说,SSMP训练一个Transformer编码器,它以电影镜头序列作为提示,并相应地生成预告片镜头序列。模型通过掩码预测进行训练,即从其随机遮蔽的对应物中重建每个预告片镜头序列。掩码比例是自适应的,使任务难度能够根据模型的能力调整,从而提升模型性能。在生成电影预告片时,该模型会在每一步以高置信度填充镜头位置,并重新对剩余位置进行掩码处理以便于下一步预测,形成了一种类似于人类编辑工作流程的逐步自我纠错机制。 无论是定量结果还是用户研究都表明了SSMP相较于现有的自动电影预告片生成方法具有明显的优势。演示视频可在以下链接查看:[请参见原文献获取实际链接]。
https://arxiv.org/abs/2512.04426
We introduce MVRoom, a controllable novel view synthesis (NVS) pipeline for 3D indoor scenes that uses multi-view diffusion conditioned on a coarse 3D layout. MVRoom employs a two-stage design in which the 3D layout is used throughout to enforce multi-view consistency. The first stage employs novel representations to effectively bridge the 3D layout and consistent image-based condition signals for multi-view generation. The second stage performs image-conditioned multi-view generation, incorporating a layout-aware epipolar attention mechanism to enhance multi-view consistency during the diffusion process. Additionally, we introduce an iterative framework that generates 3D scenes with varying numbers of objects and scene complexities by recursively performing multi-view generation (MVRoom), supporting text-to-scene generation. Experimental results demonstrate that our approach achieves high-fidelity and controllable 3D scene generation for NVS, outperforming state-of-the-art baseline methods both quantitatively and qualitatively. Ablation studies further validate the effectiveness of key components within our generation pipeline.
我们介绍了MVRoom,这是一种用于室内三维场景的可控新颖视角合成(NVS)流水线,它使用基于粗略3D布局的多视图扩散。MVRoom采用两阶段设计,在整个过程中利用3D布局来强制执行多视图一致性。第一阶段采用了新的表示方法,有效地将3D布局与一致的图像条件信号结合起来,用于多视图生成。第二阶段进行受图像条件限制的多视图生成,并引入了一种具有视角意识的极线注意力机制,在扩散过程中增强多视图的一致性。 此外,我们还介绍了一个迭代框架,该框架通过递归执行多视图生成(MVRoom)来生成包含不同数量对象和场景复杂度的3D场景,支持从文本到场景的生成。实验结果表明,我们的方法在NVS方面实现了高质量且可控的三维场景生成,在定量和定性评估中均优于现有的基准方法。消融研究进一步验证了我们生成管道中关键组件的有效性。
https://arxiv.org/abs/2512.04248
While text-to-video (T2V) generation has achieved remarkable progress in photorealism, generating intent-aligned videos that faithfully obey physics principles remains a core challenge. In this work, we systematically study Newtonian motion-controlled text-to-video generation and evaluation, emphasizing physical precision and motion coherence. We introduce MoReGen, a motion-aware, physics-grounded T2V framework that integrates multi-agent LLMs, physics simulators, and renderers to generate reproducible, physically accurate videos from text prompts in the code domain. To quantitatively assess physical validity, we propose object-trajectory correspondence as a direct evaluation metric and present MoReSet, a benchmark of 1,275 human-annotated videos spanning nine classes of Newtonian phenomena with scene descriptions, spatiotemporal relations, and ground-truth trajectories. Using MoReSet, we conduct experiments on existing T2V models, evaluating their physical validity through both our MoRe metrics and existing physics-based evaluators. Our results reveal that state-of-the-art models struggle to maintain physical validity, while MoReGen establishes a principled direction toward physically coherent video synthesis.
尽管文本到视频(T2V)生成在逼真度方面取得了显著进步,但生成符合物理原理的意图一致的视频仍然是一个核心挑战。在这项工作中,我们系统地研究了牛顿运动控制下的文本到视频生成和评估方法,并强调了物理精度和运动连贯性的重要性。我们引入了一个名为MoReGen的框架,这是一个基于运动感知且以物理学为基础的T2V框架,它集成了多代理LLM(大型语言模型)、物理模拟器以及渲染器,可以依据代码域中的文本提示生成可重复、物理准确的视频。 为了定量评估物理有效性,我们提出了一种直接评价指标——物体轨迹对应关系,并引入了一个基准测试集合MoReSet。该集合包括1,275个人工标注的视频片段,涵盖了九类牛顿现象,每个视频都有场景描述、时空关联以及真实轨迹作为参考标准。 利用MoReSet,我们在现有的T2V模型上进行了实验,通过我们的MoRe指标和现有基于物理原则的评估器来评价它们的物理有效性。研究结果表明,最先进的模型在保持物理有效方面面临挑战,而MoReGen则为生成物理连贯的视频合成提供了一个原理清晰的方向。
https://arxiv.org/abs/2512.04221
Adapting large language models (LLMs) to low-resource languages remains a major challenge due to data scarcity and cross-lingual drift. This work presents a two-stage adaptation of Qwen2.5-3B to Tibetan, a morphologically rich and underrepresented language. We employ Continual Pretraining (CPT) to establish Tibetan linguistic grounding, followed by Supervised Fine-Tuning (SFT) for task and translation specialization. Empirical evaluations demonstrate a consistent decrease in perplexity (from 2.98 $\rightarrow$ 1.54) and substantial improvements in Chinese$\rightarrow$Tibetan translation quality (BLEU: 0.046 $\rightarrow$ 0.261; chrF: 2.2 $\rightarrow$ 6.6). Layer-wise analysis across 435 layers in Qwen3-4B reveals that adaptation primarily concentrates on embedding and output heads, with mid--late MLP projections encoding domain-specific transformations. Our findings suggest that CPT constructs a Tibetan semantic manifold while SFT sharpens task alignment with minimal representational disruption. This study provides the first quantitative exploration of Tibetan adaptation dynamics for LLMs, and offers an open, reproducible framework for extending multilingual foundation models to low-resource settings.
将大型语言模型(LLMs)适应于资源匮乏的语言仍然是一个重大挑战,原因在于数据稀缺和跨语言漂移。本工作提出了一种两阶段的方法来调整Qwen2.5-3B以适配藏语,这是一种形态丰富且代表性不足的语言。我们采用了连续预训练(Continual Pretraining, CPT)建立藏语文本基础,并随后通过监督微调(Supervised Fine-Tuning, SFT)进行任务和翻译的专门化处理。 实证评估显示,在困惑度(perplexity)方面有显著下降(从2.98降至1.54),并且在汉语到藏语的翻译质量上也有大幅改善(BLEU值从0.046提升至0.261;chrF值从2.2提高到6.6)。通过分析Qwen3-4B中的435层,我们发现适应过程主要集中在嵌入和输出头部,而中期到后期的多层感知器(MLP)投影则编码了特定领域的转换。 我们的研究结果表明,CPT构建了一个藏语的语义流形,而SFT则在不破坏表示的前提下精确任务对齐。这项研究为LLMs中藏语适应动态提供了首个定量探索,并提供了一个开放、可重复的框架,用于将多语言基础模型扩展到资源匮乏的语言环境中。
https://arxiv.org/abs/2512.03976
Industrial automation increasingly requires flexible control strategies that can adapt to changing tasks and environments. Agents based on Large Language Models (LLMs) offer potential for such adaptive planning and execution but lack standardized benchmarks for systematic comparison. We introduce a benchmark with an executable simulation environment representing the Blocksworld problem providing five complexity categories. By integrating the Model Context Protocol (MCP) as a standardized tool interface, diverse agent architectures can be connected to and evaluated against the benchmark without implementation-specific modifications. A single-agent implementation demonstrates the benchmark's applicability, establishing quantitative metrics for comparison of LLM-based planning and execution approaches.
工业自动化越来越需要能够适应不断变化的任务和环境的灵活控制策略。基于大型语言模型(LLMs)的代理为这种自适应规划与执行提供了潜力,但缺乏标准化基准来进行系统比较。我们引入了一个包含可执行模拟环境的基准测试,该环境代表了积木世界问题,并提供了五个复杂度类别。通过整合模型上下文协议(MCP)作为标准工具接口,各种不同的代理架构可以连接并评估到这一基准上,而无需进行特定于实现的修改。单一代理实现证明了此基准的有效性,并建立了量化指标以比较基于LLMs的规划和执行方法。
https://arxiv.org/abs/2512.03955
We present CaravelMetrics, a computational framework for automated cerebrovascular analysis that models vessel morphology through skeletonization-derived graph representations. The framework integrates atlas-based regional parcellation, centerline extraction, and graph construction to compute fifteen morphometric, topological, fractal, and geometric features. The features can be estimated globally from the complete vascular network or regionally within arterial territories, enabling multiscale characterization of cerebrovascular organization. Applied to 570 3D TOF-MRA scans from the IXI dataset (ages 20-86), CaravelMetrics yields reproducible vessel graphs capturing age- and sex-related variations and education-associated increases in vascular complexity, consistent with findings reported in the literature. The framework provides a scalable and fully automated approach for quantitative cerebrovascular feature extraction, supporting normative modeling and population-level studies of vascular health and aging.
我们介绍了CaravelMetrics,这是一种计算框架,用于自动进行脑血管分析,通过基于骨架化的图表示来建模血管形态。该框架集成了基于图谱的区域分割、中心线提取和图构造功能,以计算十五种形态学、拓扑学、分形和几何特征。这些特征可以全局地从整个血管网络中估计出来,也可以在动脉区域内局部估计,从而实现对脑血管组织的多尺度描述。 该框架应用于包含570张3D TOF-MRA扫描图像的IXI数据集(年龄范围为20-86岁),CaravelMetrics能够生成可重复性的血管图,捕捉到与年龄和性别相关的变异以及教育水平与血管复杂性增加之间的关联,这些发现与文献中的报道一致。 该框架提供了一种可扩展且完全自动化的定量提取脑血管特征的方法,支持对血管健康和衰老进行规范建模和人群层面的研究。
https://arxiv.org/abs/2512.03869
We present a CloseUpAvatar - a novel approach for articulated human avatar representation dealing with more general camera motions, while preserving rendering quality for close-up views. CloseUpAvatar represents an avatar as a set of textured planes with two sets of learnable textures for low and high-frequency detail. The method automatically switches to high-frequency textures only for cameras positioned close to the avatar's surface and gradually reduces their impact as the camera moves farther away. Such parametrization of the avatar enables CloseUpAvatar to adjust rendering quality based on camera distance ensuring realistic rendering across a wider range of camera orientations than previous approaches. We provide experiments using the ActorsHQ dataset with high-resolution input images. CloseUpAvatar demonstrates both qualitative and quantitative improvements over existing methods in rendering from novel wide range camera positions, while maintaining high FPS by limiting the number of required primitives.
我们介绍了一种名为CloseUpAvatar的新方法,这是一种用于处理更一般相机运动的可动人体角色表示方式,并且在特写视图中保持渲染质量。CloseUpAvatar 将一个角色表示为一组带有纹理的平面,这些平面上有两个学习得到的不同频率细节的纹理集:低频和高频。该方法会自动仅针对靠近角色表面的相机使用高频频纹理,并随着相机远离而逐渐减少其影响。这种参数化使得CloseUpAvatar 能够根据相机距离调整渲染质量,在广泛的相机方向上实现逼真的渲染效果,超越了以往的方法。 我们通过使用High-Resolution ActorsHQ数据集中的高分辨率输入图像进行了实验。在从新颖的广泛范围相机位置进行渲染时,CloseUpAvatar 在视觉和量化指标上都超过了现有方法的表现,并且通过限制所需的基本元素数量而保持了较高的FPS(每秒帧数)。
https://arxiv.org/abs/2512.03593
The RNA inverse folding problem, a key challenge in RNA design, involves identifying nucleotide sequences that can fold into desired secondary structures, which are critical for ensuring molecular stability and function. The inherent complexity of this task stems from the intricate relationship between sequence and structure, making it particularly challenging. In this paper, we propose a framework, named HyperRNA, a generative model with an encoder-decoder architecture that leverages hypergraphs to design RNA sequences. Specifically, our HyperRNA model consists of three main components: preprocessing, encoding and decoding. In the preprocessing stage, graph structures are constructed by extracting the atom coordinates of RNA backbone based on 3-bead coarse-grained representation. The encoding stage processes these graphs, capturing higher order dependencies and complex biomolecular interactions using an attention embedding module and a hypergraph-based encoder. Finally, the decoding stage generates the RNA sequence in an autoregressive manner. We conducted quantitative and qualitative experiments on the PDBBind and RNAsolo datasets to evaluate the inverse folding task for RNA sequence generation and RNA-protein complex sequence generation. The experimental results demonstrate that HyperRNA not only outperforms existing RNA design methods but also highlights the potential of leveraging hypergraphs in RNA engineering.
RNA反向折叠问题作为RNA设计中的一个关键挑战,涉及识别能够折叠成所需二级结构的核苷酸序列。这些二级结构对于确保分子稳定性和功能至关重要。由于序列和结构之间复杂的关系,这一任务具有固有的难度。 在本文中,我们提出了一种名为HyperRNA的框架,这是一种采用编码器-解码器架构的生成模型,利用超图来设计RNA序列。具体而言,我们的HyperRNA模型包含三个主要组成部分:预处理、编码和解码阶段。在预处理阶段,基于3-bead粗粒度表示提取RNA骨架的原子坐标构建图形结构。编码阶段通过对这些图形进行处理,采用注意力嵌入模块和基于超图的编码器来捕捉高阶依赖关系及复杂的生物分子相互作用。最后,在解码阶段以自回归方式生成RNA序列。 我们通过在PDBBind和RNAsolo数据集上进行定量和定性实验,评估了HyperRNA在RNA序列生成以及RNA-蛋白质复合物序列生成方面的反向折叠任务性能。实验结果表明,HyperRNA不仅超越现有的RNA设计方法,还突显了利用超图在RNA工程中的潜力。
https://arxiv.org/abs/2512.03592
Scene Text Editing (STE) involves replacing text in a scene image with new target text while preserving both the original text style and background texture. Existing methods suffer from two major challenges: inconsistency and length-insensitivity. They often fail to maintain coherence between the edited local patch and the surrounding area, and they struggle to handle significant differences in text length before and after editing. To tackle these challenges, we propose an end-to-end framework called Global-Local Aware Scene Text Editing (GLASTE), which simultaneously incorporates high-level global contextual information along with delicate local features. Specifically, we design a global-local combination structure, joint global and local losses, and enhance text image features to ensure consistency in text style within local patches while maintaining harmony between local and global areas. Additionally, we express the text style as a vector independent of the image size, which can be transferred to target text images of various sizes. We use an affine fusion to fill target text images into the editing patch while maintaining their aspect ratio unchanged. Extensive experiments on real-world datasets validate that our GLASTE model outperforms previous methods in both quantitative metrics and qualitative results and effectively mitigates the two challenges.
场景文本编辑(STE)涉及在保留原始文本样式和背景纹理的同时,将场景图像中的文字替换为新的目标文字。现有方法面临两大挑战:不一致性及长度敏感性问题。它们通常无法维持编辑后的局部区域与其周围环境之间的连贯性,并且难以处理编辑前后的文本长度显著变化的问题。 为了克服这些挑战,我们提出了一种名为全局-局部感知场景文本编辑(GLASTE)的端到端框架,该框架同时整合了高层次的全局上下文信息和精细的局部特征。具体而言,我们设计了一个全局-局部组合结构,并引入联合全局与局部损失函数以确保在保持局部区域之间和谐的同时,在局部区域内维持一致的文字风格。 此外,我们将文字样式表示为独立于图像大小的向量形式,这使得它可以转移到各种尺寸的目标文本图像中。我们在填充目标文本到编辑区域时采用仿射融合技术来保留其纵横比不变。 通过真实世界数据集上的广泛实验验证了GLASTE模型在量化指标和定性结果上均优于先前的方法,并且有效缓解了上述两大挑战。
https://arxiv.org/abs/2512.03574
Articulated object generation has seen increasing advancements, yet existing models often lack the ability to be conditioned on text prompts. To address the significant gap between textual descriptions and 3D articulated object representations, we propose GAOT, a three-phase framework that generates articulated objects from text prompts, leveraging diffusion models and hypergraph learning in a three-step process. First, we fine-tune a point cloud generation model to produce a coarse representation of objects from text prompts. Given the inherent connection between articulated objects and graph structures, we design a hypergraph-based learning method to refine these coarse representations, representing object parts as graph vertices. Finally, leveraging a diffusion model, the joints of articulated objects-represented as graph edges-are generated based on the object parts. Extensive qualitative and quantitative experiments on the PartNet-Mobility dataset demonstrate the effectiveness of our approach, achieving superior performance over previous methods.
可动对象生成技术取得了显著进展,但现有模型通常缺乏根据文本提示进行条件化的能力。为了弥合文字描述与3D可动对象表示之间的差距,我们提出了GAOT框架,这是一个三阶段的框架,它能够从文本提示中生成可动对象,利用扩散模型和超图学习来完成这一过程。 具体来说,GAOT包括三个主要步骤: 1. **初步生成**:首先,通过微调一个点云生成模型,根据文本提示产生可动对象的粗略表示。 2. **细化处理**:考虑到可动对象与图形结构之间的内在联系,我们设计了一种基于超图的学习方法来优化这些初始表示。在该步骤中,将物体的部分作为图顶点进行表示,并利用超图学习对它们进行进一步精细化和连接性建模。 3. **最终生成关节**:最后一步,采用扩散模型根据已细化的物体部分(即图顶点)生成可动对象的关节(用作图边)。通过这种方式,使得输出的三维模型具备了更加精细且准确的可动特性。 在PartNet-Mobility数据集上的大量定性和定量实验验证了该方法的有效性,表明其性能优于先前的方法。
https://arxiv.org/abs/2512.03566
This paper presents a preliminary investigation into automated dance movement analysis using contemporary computer vision techniques. We propose a proof-of-concept framework that integrates YOLOv8 and v11 for dancer detection with the Segment Anything Model (SAM) for precise segmentation, enabling the tracking and quantification of dancer movements in video recordings without specialized equipment or markers. Our approach identifies dancers within video frames, counts discrete dance steps, calculates spatial coverage patterns, and measures rhythm consistency across performance sequences. Testing this framework on a single 49-second recording of Ghanaian AfroBeats dance demonstrates technical feasibility, with the system achieving approximately 94% detection precision and 89% recall on manually inspected samples. The pixel-level segmentation provided by SAM, achieving approximately 83% intersection-over-union with visual inspection, enables motion quantification that captures body configuration changes beyond what bounding-box approaches can represent. Analysis of this preliminary case study indicates that the dancer classified as primary by our system executed 23% more steps with 37% higher motion intensity and utilized 42% more performance space compared to dancers classified as secondary. However, this work represents an early-stage investigation with substantial limitations including single-video validation, absence of systematic ground truth annotations, and lack of comparison with existing pose estimation methods. We present this framework to demonstrate technical feasibility, identify promising directions for quantitative dance metrics, and establish a foundation for future systematic validation studies.
本文提出了一项初步研究,探讨了使用现代计算机视觉技术进行自动化舞蹈动作分析的方法。我们设计了一个概念验证框架,该框架结合使用YOLOv8和v11来检测舞者,并采用Segment Anything Model(SAM)进行精确分割,以在不借助专门设备或标记的情况下追踪并量化视频记录中的舞者动作。 我们的方法能够识别出视频帧内的舞者,统计独立的舞蹈步骤数量,计算空间覆盖模式,并衡量表演序列中节奏的一致性。通过对加纳AfroBeats舞蹈一段49秒的录像进行测试,该框架在技术上证明了其可行性:系统对手动检查样本的检测精确度达到约94%,召回率达到89%。 SAM提供的像素级分割效果达到了大约83%的交并比(IOU),这一水平通过视觉检验得到验证。这使得动作量化能够捕捉到超越边界框方法所不能表示的身体配置变化,从而更全面地记录舞蹈动作的变化情况。 对这项初步案例研究的分析表明,被系统分类为主要舞者的执行了23%更多的步骤,并且运动强度比其他次级舞者高出37%,并利用了42%更多的表演空间。然而,这一工作代表了一项早期阶段的研究,存在包括单一视频验证、缺乏系统的标注以及与现有姿态估计方法比较不足在内的诸多局限性。 我们提出这个框架旨在证明技术上的可行性,并确定定量舞蹈指标的有前景的方向,同时为未来系统化的验证研究奠定基础。
https://arxiv.org/abs/2512.03509
Autoregressive Language Models (LLMs) trained on Next-Token Prediction (NTP) often suffer from ``Topic Drift'' where the generation wanders away from the initial prompt due to a reliance on local associations rather than global planning \citep{holtzman2019curious}. While scaling model size mitigates this \citep{brown2020language}, the fundamental myopia of the NTP objective remains. In this work, we introduce the Idea-Gated Transformer, a novel architecture that separates semantic planning from syntactic generation. We introduce an auxiliary ``Idea Head'' trained to predict the bag-of-words distribution for a future context window, creating a latent ``Concept Vector'' that actively gates the main vocabulary during generation. We propose a differentiable gating mechanism that suppresses semantically irrelevant tokens, effectively pruning the search space in real-time. Experiments on WikiText-103 demonstrate that while the Idea-Gated model achieves comparable validation perplexity to a standard GPT-2 baseline, it exhibits significantly superior Domain Retention. Qualitative and quantitative analysis reveals that the gating mechanism successfully locks generation into specific semantic clusters (e.g., Finance, Science) and resists associative drift, offering a parameter-efficient path toward more controllable language modeling.
基于下一令牌预测(NTP)训练的自回归语言模型(LLMs)经常会出现“主题漂移”问题,即生成的内容会逐渐偏离初始提示的主题,这是因为它们依赖于局部关联而非全局规划 \citep{holtzman2019curious}。虽然扩大模型规模可以缓解这一问题 \citep{brown2020language},但NTP目标的根本近视性仍然存在。在这项工作中,我们介绍了“理念门控变压器”,这是一种新颖的架构,它将语义规划与句法生成分离开来。我们引入了一个辅助的“理念头”(Idea Head),用于预测未来上下文窗口中的词袋分布,从而创建一个隐式的“概念向量”,该向量在生成过程中主动地控制主要词汇表。我们提出了一种可微分的门控机制,可以在实时中抑制语义不相关的令牌,有效地修剪搜索空间。实验结果表明,在WikiText-103数据集上,“理念门控”模型实现了与标准GPT-2基线相当的有效性验证困惑度,但其“领域保留”性能显著优越。定性和定量分析显示,该门控机制成功地将生成锁定在特定的语义簇(例如金融、科学)中,并且能够抵抗联想漂移,为实现更可控的语言建模提供了一条参数高效的路径。
https://arxiv.org/abs/2512.03343
A quarter century ago, Wikipedia's decentralized, crowdsourced, and consensus-driven model replaced the centralized, expert-driven, and authority-based standard for encyclopedic knowledge curation. The emergence of generative AI encyclopedias, such as Grokipedia, possibly presents another potential shift in epistemic evolution. This study investigates whether AI- and human-curated encyclopedias rely on the same foundations of authority. We conducted a multi-scale comparative analysis of the citation networks from 72 matched article pairs, which cite a total of almost 60,000 sources. Using an 8-category epistemic classification, we mapped the "epistemic profiles" of the articles on each platform. Our findings reveal several quantitative and qualitative differences in how knowledge is sourced and encyclopedia claims are epistemologically justified. Grokipedia replaces Wikipedia's heavy reliance on peer-reviewed "Academic & Scholarly" work with a notable increase in "User-generated" and "Civic organization" sources. Comparative network analyses further show that Grokipedia employs very different epistemological profiles when sourcing leisure topics (such as Sports and Entertainment) and more societal sensitive civic topics (such as Politics & Conflicts, Geographical Entities, and General Knowledge & Society). Finally, we find a "scaling-law for AI-generated knowledge sourcing" that shows a linear relationship between article length and citation density, which is distinct from collective human reference sourcing. We conclude that this first implementation of an LLM-based encyclopedia does not merely automate knowledge production but restructures it. Given the notable changes and the important role of encyclopedias, we suggest the continuation and deepening of algorithm audits, such as the one presented here, in order to understand the ongoing epistemological shifts.
二十五年前,维基百科的去中心化、众包和共识驱动模式取代了以专家为主导、权威为基础的传统百科全书知识编纂方式。生成式人工智能百科全书(如Grokipedia)的出现可能标志着知识论演进中的又一次转变。这项研究探讨了AI和人工编撰的百科全书是否依赖于同样的权威基础。我们对72组配对文章进行了多尺度比较分析,这些文章总共引用了近60,000个来源。使用八类知识分类体系,我们在每个平台上绘制了“知识论档案”。我们的研究发现揭示了知识来源和百科全书条目证据支持方面的若干定性和定量差异。 Grokipedia 用用户生成的内容和民间组织的资源显著取代了维基百科对同行评审过的“学术与学者”工作的重度依赖。比较网络分析进一步表明,Grokipedia 在引用休闲主题(如体育和娱乐)时使用了一种知识论档案,而在涉及更多社会敏感性的公民话题(如政治冲突、地理实体以及通用知识和社会科学)时,则采用了一种截然不同的知识论档案。 最后,我们发现了一个“AI生成知识来源的规模法则”,显示文章长度与引用密度之间存在线性关系,这不同于集体人类参考资源的使用模式。我们认为,基于LLM(大型语言模型)的百科全书首次实现,并不仅仅自动化了知识生产,而是重新构建了它。鉴于这些显著变化和百科全书的重要作用,我们建议继续并深化算法审计,例如本研究中展示的那种,以理解正在进行的知识论转变。
https://arxiv.org/abs/2512.03337
This study presents an LLM-assisted annotation pipeline for the sociolinguistic and topical analysis of bilingual discourse in two typologically distinct contexts: Spanish-English and Spanish-GuaranÃ. Using large language models, we automatically labeled topic, genre, and discourse-pragmatic functions across a total of 3,691 code-switched sentences, integrated demographic metadata from the Miami Bilingual Corpus, and enriched the Spanish-Guaranà dataset with new topic annotations. The resulting distributions reveal systematic links between gender, language dominance, and discourse function in the Miami data, and a clear diglossic division between formal Guaranà and informal Spanish in Paraguayan texts. These findings replicate and extend earlier interactional and sociolinguistic observations with corpus-scale quantitative evidence. The study demonstrates that large language models can reliably recover interpretable sociolinguistic patterns traditionally accessible only through manual annotation, advancing computational methods for cross-linguistic and low-resource bilingual research.
本研究提出了一种利用大型语言模型辅助的注释流程,用于分析两种类型学上不同的双语话语的社会语言学和主题内容:西班牙语-英语和西班牙语-瓜拉尼语。我们使用大规模语言模型自动标注了总共3,691个代码互换句子的主题、体裁及会话功能,并整合了来自迈阿密双语语料库的元数据,同时为西班牙语-瓜拉尼语的数据集增加了新的主题注释。 结果分布揭示了性别、语言主导地位与话语功能在迈阿密数据中的系统联系,以及巴拉圭文本中正式瓜拉尼语和非正式西班牙语之间的明显双语分化。这些发现复制并扩展了之前通过互动观察和社会语言学研究所得的结果,并提供了语料库规模的定量证据。 本研究表明,大型语言模型能够可靠地恢复以往只能通过手动注释才能获取的社会语言学模式,从而推进跨语言及低资源双语研究中的计算方法的发展。
https://arxiv.org/abs/2512.03334