LLMs often adopt an assertive language style also when making false claims. Such ``overconfident hallucinations'' mislead users and erode trust. Achieving the ability to express in language the actual degree of uncertainty around a claim is therefore of great importance. We find that ``verbal uncertainty'' is governed by a single linear feature in the representation space of LLMs, and show that this has only moderate correlation with the actual ``semantic uncertainty'' of the model. We apply this insight and show that (1) the mismatch between semantic and verbal uncertainty is a better predictor of hallucinations than semantic uncertainty alone and (2) we can intervene on verbal uncertainty at inference time and reduce hallucinations on short-form answers, achieving an average relative reduction of 32%.
大型语言模型(LLMs)在作出错误陈述时也常常采用自信的语言风格。这种“过度自信的幻觉”会误导用户并侵蚀信任。因此,能够用语言表达围绕某个声明的实际不确定性程度变得非常重要。我们发现,“口头不确定性”在LLM的表示空间中由单一线性特征控制,并且这一特征与模型实际存在的“语义不确定性”的相关度仅为适度。 基于这一洞察,我们展示了以下两点: 1. 语义不确定性和口头不确定性之间的不匹配比单独的语义不确定性更准确地预测幻觉; 2. 我们可以在推理时间干预口头不确定性,减少短形式答案中的幻觉现象,并实现平均相对减少32%的效果。
https://arxiv.org/abs/2503.14477
Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.
推理缩放赋予大型语言模型(LLM)前所未有的推理能力,强化学习是激发复杂推理的核心技术。然而,最先进的推理LLM的关键技术细节被隐藏(如OpenAI的o1博客和DeepSeek R1的技术报告),因此社区仍然难以再现它们的RL训练结果。我们提出了“解耦裁剪与动态采样策略优化”(DAPO)算法,并完全开源了一个使用Qwen2.5-32B基础模型在AIME 2024上取得50分的大型强化学习系统。不同于之前保留训练细节的做法,我们介绍了四项使大规模LLM RL成为可能的关键技术。此外,我们还开源了我们的训练代码,该代码基于verl框架,并附带了一个精心整理和处理过的数据集。这些开放源码系统的组成部分增强了可重复性,并支持未来在大规模LLM RL领域的研究。
https://arxiv.org/abs/2503.14476
The field of Novel View Synthesis has been revolutionized by 3D Gaussian Splatting (3DGS), which enables high-quality scene reconstruction that can be rendered in real-time. 3DGS-based techniques typically suffer from high GPU memory and disk storage requirements which limits their practical application on consumer-grade devices. We propose Opti3DGS, a novel frequency-modulated coarse-to-fine optimization framework that aims to minimize the number of Gaussian primitives used to represent a scene, thus reducing memory and storage demands. Opti3DGS leverages image frequency modulation, initially enforcing a coarse scene representation and progressively refining it by modulating frequency details in the training images. On the baseline 3DGS, we demonstrate an average reduction of 62% in Gaussians, a 40% reduction in the training GPU memory requirements and a 20% reduction in optimization time without sacrificing the visual quality. Furthermore, we show that our method integrates seamlessly with many 3DGS-based techniques, consistently reducing the number of Gaussian primitives while maintaining, and often improving, visual quality. Additionally, Opti3DGS inherently produces a level-of-detail scene representation at no extra cost, a natural byproduct of the optimization pipeline. Results and code will be made publicly available.
新颖视角合成领域通过3D高斯点阵(3D Gaussian Splatting,简称3DGS)技术得到了革命性的提升,该技术能够实现高质量的场景重建,并支持实时渲染。然而,基于3DGS的技术通常会面临较高的GPU内存和磁盘存储需求问题,这限制了它们在消费级设备上的实际应用。 我们提出了一种新颖的方法——Opti3DGS,这是一种频段调制的从粗到细优化框架,旨在最小化用于表示场景的高斯原语的数量,从而降低内存和存储需求。Opti3DGS利用图像频率调节技术,首先强制执行一个粗糙的场景表示,并通过在训练图像中逐步调整细节频率来细化这一表示。 基于基准的3DGS方法,我们展示了平均减少了62%的高斯点数量,在训练过程中将GPU内存的需求降低了40%,同时优化时间也减少了20%,而这些改进均没有牺牲视觉质量。此外,我们的方法能够无缝集成到多种基于3DGS的技术中,并且在保持甚至提升视觉质量的同时,始终减少高斯原语的数量。 值得注意的是,Opti3DGS自然地生成了不同级别的细节场景表示,无需额外成本,这得益于其优化管道的特性。我们将在未来公开发布结果和代码。
https://arxiv.org/abs/2503.14475
Different attribution-scores have been proposed to quantify the relevance of database tuples for a query answer from a database. Among them, we find Causal Responsibility, the Shapley Value, the Banzhaf Power-Index, and the Causal Effect. They have been analyzed in isolation, mainly in terms of computational properties. In this work, we start an investigation into the alignment of these scores on the basis of the queries at hand; that is, on whether they induce compatible rankings of tuples. We are able to identify vast classes of queries for which some pairs of scores are always aligned, and others for which they are not. It turns out that the presence of exogenous tuples makes a crucial difference in this regard.
不同的属性评分方法被提出用于量化数据库查询答案中元组的相关性。其中,因果责任(Causal Responsibility)、夏普利值(Shapley Value)、班扎夫权力指数(Banzhaf Power-Index)和因果效应(Causal Effect)是最常见的几种。这些评分方法通常分别进行分析,主要集中在计算属性上。在这项工作中,我们开始研究在特定查询基础上这些评分的一致性;也就是说,在它们是否能够对元组产生一致的排名方面进行了调查。我们发现了一些对于某些评分配对始终具有相同一致性的大规模查询类别,同时也有一些其评分不一致的情况。结果表明,外生元组的存在在这个问题上起着关键作用。
https://arxiv.org/abs/2503.14469
The computer vision community has developed numerous techniques for digitally restoring true scene information from single-view degraded photographs, an important yet extremely ill-posed task. In this work, we tackle image restoration from a different perspective by jointly denoising multiple photographs of the same scene. Our core hypothesis is that degraded images capturing a shared scene contain complementary information that, when combined, better constrains the restoration problem. To this end, we implement a powerful multi-view diffusion model that jointly generates uncorrupted views by extracting rich information from multi-view relationships. Our experiments show that our multi-view approach outperforms existing single-view image and even video-based methods on image deblurring and super-resolution tasks. Critically, our model is trained to output 3D consistent images, making it a promising tool for applications requiring robust multi-view integration, such as 3D reconstruction or pose estimation.
计算机视觉领域已经开发了多种技术,用于从单视角的退化照片中数字化地恢复真实场景信息,这是一个重要但极其难以处理的任务。在这项工作中,我们通过同时去噪同一场景的多张照片,以不同的视角解决了图像修复的问题。我们的核心假设是,捕捉相同场景的退化图片包含互补的信息,这些信息结合后可以更好地约束图像复原问题。为此,我们实现了一个强大的多视图扩散模型,该模型能够从多视图关系中提取丰富的信息,并同时生成未受污染的视角。 实验表明,与现有的单视图和甚至基于视频的方法相比,我们的多视图方法在图像去模糊和超分辨率任务上表现出色。尤为重要的是,我们的模型经过训练可以输出三维一致的图像,使其成为需要稳健多视图集成的应用(如3D重建或姿态估计)的理想工具。
https://arxiv.org/abs/2503.14463
We present RWKV-7 "Goose", a new sequence modeling architecture, along with pre-trained language models that establish a new state-of-the-art in downstream performance at the 3 billion parameter scale on multilingual tasks, and match current SoTA English language performance despite being trained on dramatically fewer tokens than other top 3B models. Nevertheless, RWKV-7 models require only constant memory usage and constant inference time per token. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to $\mathsf{TC}^0$. To demonstrate RWKV-7's language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and train four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset. To foster openness, reproduction, and adoption, we release our models and dataset component listing at this https URL, and our training and inference code at this https URL all under the Apache 2.0 License.
我们介绍了RWKV-7 "Goose" 这一新的序列建模架构,以及在多语言任务上达到30亿参数规模新最佳性能的预训练语言模型。这些模型尽管使用的训练标记数量远少于其他顶级30亿参数模型,但它们在英语语言表现方面仍达到了当前的最佳状态(SoTA)。然而,RWKV-7 模型只需恒定内存使用和每令牌恒定推断时间。 RWKV-7 引入了新的广义德尔塔规则公式,包括向量值门控和上下文学习率以及放松的值替换规则。我们展示了RWKV-7能够执行状态跟踪并识别所有正则语言,同时保持训练中的并行性。这超过了标准复杂度假设下变换器的能力限制($\mathsf{TC}^0$)。 为了展示RWKV-7的语言建模能力,我们也提供了一个扩展的开源多语言语料库,包含3.1万亿个标记,并在该数据集上训练了四个不同规模的RWKV-7模型,参数范围从0.19亿到2.9亿。为促进开放性、再现性和采用,我们在 [此链接](https://this-url.com) 发布了我们的模型和数据集组件列表,在 [此链接](https://that-url.com) 发布了训练和推断代码,均在 Apache 2.0 许可下使用。 (注意:原文中的具体URL被替换为占位符,请用实际的URL替换。)
https://arxiv.org/abs/2503.14456
We introduce a Reinforcement Learning (RL)-based method for re-synthesis of quantum circuits containing arbitrary Pauli rotations alongside Clifford operations. By collapsing each sub-block to a compact representation and then synthesizing it step-by-step through a learned heuristic, we obtain circuits that are both shorter and compliant with hardware connectivity constraints. We find that the method is fast enough and good enough to work as an optimization procedure: in direct comparisons on 6-qubit random Pauli Networks against state-of-the-art heuristic methods, our RL approach yields over 2x reduction in two-qubit gate count, while executing in under 10 milliseconds per circuit. We further integrate the method into a collect-and-re-synthesize pipeline, applied as a Qiskit transpiler pass, where we observe average improvements of 20% in two-qubit gate count and depth, reaching up to 60% for many instances, across the Benchpress benchmark. These results highlight the potential of RL-driven synthesis to significantly improve circuit quality in realistic, large-scale quantum transpilation workloads.
我们介绍了一种基于强化学习(RL)的方法,用于重新综合包含任意 Pauli 旋转和 Clifford 操作的量子电路。通过将每个子块压缩为紧凑表示,并逐步应用学习到的启发式方法进行合成,我们可以获得既简短又符合硬件连接限制的电路。我们发现该方法足够快速且效果良好,可以作为优化过程使用:在与最先进的启发式方法直接比较中,在 6 量子比特随机 Pauli 网络上,我们的 RL 方法在两量子比特门的数量上减少了超过 2 倍,并且每条电路执行时间不到 10 毫秒。我们进一步将该方法集成到收集和重新综合的流水线中,并将其作为 Qiskit 转译器过程应用,在 Benchpress 基准测试跨多个实例时,观察到了平均两量子比特门数量和深度分别减少了 20% 和最多 60% 的改进。这些结果突显了 RL 驱动综合在实际大规模量子转译工作负载中显著提升电路质量的潜力。
https://arxiv.org/abs/2503.14448
We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.
我们提出了一种用于快速前馈生成三维场景的潜在扩散模型。给定一个或多个图像,我们的模型Bolt3D可以直接在单个GPU上以不到七秒的时间采样出高质量的三维场景表示。通过利用强大的、可扩展的现有二维扩散网络架构,我们的模型能够生成一致且高保真的三维场景表示。为了训练这个模型,我们创建了一个大规模的多视角一致数据集,该数据集包含了3D几何和外观信息,这是通过对现有的多视图图像数据集应用最先进的密集三维重建技术得到的。与之前的需要针对每个场景进行三维重建优化的多视图生成模型相比,Bolt3D将推理成本降低了高达300倍。
https://arxiv.org/abs/2503.14445
Automated feature engineering plays a critical role in improving predictive model performance for tabular learning tasks. Traditional automated feature engineering methods are limited by their reliance on pre-defined transformations within fixed, manually designed search spaces, often neglecting domain knowledge. Recent advances using Large Language Models (LLMs) have enabled the integration of domain knowledge into the feature engineering process. However, existing LLM-based approaches use direct prompting or rely solely on validation scores for feature selection, failing to leverage insights from prior feature discovery experiments or establish meaningful reasoning between feature generation and data-driven performance. To address these challenges, we propose LLM-FE, a novel framework that combines evolutionary search with the domain knowledge and reasoning capabilities of LLMs to automatically discover effective features for tabular learning tasks. LLM-FE formulates feature engineering as a program search problem, where LLMs propose new feature transformation programs iteratively, and data-driven feedback guides the search process. Our results demonstrate that LLM-FE consistently outperforms state-of-the-art baselines, significantly enhancing the performance of tabular prediction models across diverse classification and regression benchmarks.
自动化特征工程在提高表格学习任务的预测模型性能方面扮演着关键角色。传统的自动化特征工程技术受限于其对固定、手动设计搜索空间内预定义转换方法的依赖,往往忽视了领域知识的作用。近期利用大型语言模型(LLMs)的进步使得将领域知识整合到特征工程过程中成为可能。然而,现有的基于LLM的方法要么直接采用提示技术,要么仅仅依靠验证分数来进行特征选择,未能充分利用先前特征发现实验中的见解或在特征生成与数据驱动性能之间建立有意义的推理联系。 为解决这些挑战,我们提出了LLM-FE这一创新框架,它结合了进化搜索和大型语言模型(LLMs)所提供的领域知识及推理能力,以自动发现适用于表格学习任务的有效特征。LLM-FE将特征工程问题表述为一个程序搜索问题,在此过程中,LLMs会迭代地提出新的特征变换程序,而数据驱动的反馈则指导整个搜索过程。 我们的实验结果表明,相较于最先进的基线方法,LLM-FE始终表现更优,并且在多种分类和回归基准测试中显著提升了表格预测模型的表现。
https://arxiv.org/abs/2503.14434
Common subword tokenization algorithms like BPE and UnigramLM assume that text can be split into meaningful units by concatenative measures alone. This is not true for languages such as Hebrew and Arabic, where morphology is encoded in root-template patterns, or Malay and Georgian, where split affixes are common. We present SPLINTER, a pre-processing step which rearranges text into a linear form that better represents such nonconcatenative morphologies, enabling meaningful contiguous segments to be found by the tokenizer. We demonstrate SPLINTER's merit using both intrinsic measures evaluating token vocabularies in Hebrew, Arabic, and Malay; as well as on downstream tasks using BERT-architecture models trained for Hebrew.
常见的子词标记化算法,如BPE(Byte Pair Encoding)和UnigramLM,假设可以通过简单的拼接操作将文本分割成有意义的单位。然而,对于希伯来语、阿拉伯语等语言来说,这种假设并不成立,因为这些语言中的形态学信息是通过根形模式编码的;而对于马来语和格鲁吉亚语,则普遍存在分裂词缀的现象。我们提出了一种名为SPLINTER的预处理步骤,该步骤可以将文本重新排列成一种线性形式,以更好地表示此类非拼接(nonconcatenative)形态学结构,从而使标记器能够找到有意义且连续的片段。 我们通过内在度量评估希伯来语、阿拉伯语和马来语中的词典词汇,并使用基于BERT架构模型在希伯来语下游任务上的表现,展示了SPLINTER的优点。
https://arxiv.org/abs/2503.14433
Large language models (LLMs) are increasingly integrated with specialized external tools, yet many tasks demand zero-shot tool usage with minimal or noisy documentation. Existing solutions rely on manual rewriting or labeled data for validation, making them inapplicable in true zero-shot settings. To address these challenges, we propose PLAY2PROMPT, an automated framework that systematically "plays" with each tool to explore its input-output behaviors. Through this iterative trial-and-error process, PLAY2PROMPT refines tool documentation and generates usage examples without any labeled data. These examples not only guide LLM inference but also serve as validation to further enhance tool utilization. Extensive experiments on real-world tasks demonstrate that PLAY2PROMPT significantly improves zero-shot tool performance across both open and closed models, offering a scalable and effective solution for domain-specific tool integration.
大型语言模型(LLMs)与专门的外部工具结合得越来越紧密,但许多任务需要在几乎没有或文档质量很低的情况下实现零样本工具使用。现有的解决方案依赖于手动重写或带标签的数据进行验证,在真正的零样本设置中难以应用。为了解决这些挑战,我们提出了PLAY2PROMPT,这是一个自动化的框架,系统地“玩转”每个工具以探索其输入输出行为。通过这种迭代的试错过程,PLAY2PROMPT能够无需任何标记数据就完善工具文档并生成使用示例。这些示例如何指导LLM进行推理的同时还作为验证手段来进一步提升工具的应用效率。在真实世界任务上的广泛实验表明,PLAY2PROMPT显著提高了开放模型和封闭模型中的零样本工具性能,提供了一种规模化且有效的特定领域工具集成解决方案。
https://arxiv.org/abs/2503.14432
Text-to-video (T2V) generation has made significant strides with diffusion models. However, existing methods still struggle with accurately binding attributes, determining spatial relationships, and capturing complex action interactions between multiple subjects. To address these limitations, we propose MagicComp, a training-free method that enhances compositional T2V generation through dual-phase refinement. Specifically, (1) During the Conditioning Stage: We introduce the Semantic Anchor Disambiguation to reinforces subject-specific semantics and resolve inter-subject ambiguity by progressively injecting the directional vectors of semantic anchors into original text embedding; (2) During the Denoising Stage: We propose Dynamic Layout Fusion Attention, which integrates grounding priors and model-adaptive spatial perception to flexibly bind subjects to their spatiotemporal regions through masked attention modulation. Furthermore, MagicComp is a model-agnostic and versatile approach, which can be seamlessly integrated into existing T2V architectures. Extensive experiments on T2V-CompBench and VBench demonstrate that MagicComp outperforms state-of-the-art methods, highlighting its potential for applications such as complex prompt-based and trajectory-controllable video generation. Project page: this https URL.
文本到视频(T2V)生成在扩散模型的推动下取得了显著进展。然而,现有的方法仍然难以准确绑定属性、确定空间关系以及捕捉多个主体之间的复杂动作交互。为了解决这些局限性,我们提出了MagicComp,这是一种训练无关的方法,通过双阶段细化来增强组合式的文本到视频生成。 具体而言: 1. **在条件设定阶段**:我们引入了语义锚点消歧法,以强化特定于主题的语义并解决跨主体的模糊问题。这是通过逐步将语义锚点的方向向量注入原始文本嵌入来实现的。 2. **在去噪阶段**:我们提出了动态布局融合注意力机制,该机制整合了定位先验和模型适应性空间感知能力,通过掩码注意调节灵活绑定各个主体到它们的空间时间区域。 此外,MagicComp是一种与模型无关且多用途的方法,可以无缝地集成到现有的T2V架构中。在T2V-CompBench和VBench上的广泛实验表明,MagicComp超越了最先进的方法,突显了其在复杂提示基础视频生成和轨迹可控的视频生成等应用中的潜力。 项目页面:[此链接](this https URL)。
https://arxiv.org/abs/2503.14428
Escape rooms present a unique cognitive challenge that demands exploration-driven planning: players should actively search their environment, continuously update their knowledge based on new discoveries, and connect disparate clues to determine which elements are relevant to their objectives. Motivated by this, we introduce VisEscape, a benchmark of 20 virtual escape rooms specifically designed to evaluate AI models under these challenging conditions, where success depends not only on solving isolated puzzles but also on iteratively constructing and refining spatial-temporal knowledge of a dynamically changing environment. On VisEscape, we observed that even state-of-the-art multimodal models generally fail to escape the rooms, showing considerable variation in their levels of progress and trajectories. To address this issue, we propose VisEscaper, which effectively integrates Memory, Feedback, and ReAct modules, demonstrating significant improvements by performing 3.7 times more effectively and 5.0 times more efficiently on average.
逃脱房间提供了一个独特的认知挑战,需要进行以探索为导向的规划:玩家应该积极地搜索他们的环境,根据新发现不断更新知识,并将不同的线索联系起来以确定哪些元素与他们的目标相关。受此启发,我们推出了VisEscape,这是一个由20个虚拟逃脱房间组成的基准测试,特别设计用于在这些具有挑战性的条件下评估AI模型的表现,在这种情况下,成功不仅取决于解决孤立的谜题,还取决于迭代地构建和细化对不断变化环境的空间-时间知识。 在VisEscape上,我们观察到即使是最先进的多模态模型也通常无法逃脱房间,并且它们的进步水平和轨迹显示出显著的变化。为了解决这一问题,我们提出了VisEscaper,该系统有效地集成了记忆、反馈和ReAct(反应)模块,在性能和效率方面表现出显着的改进:平均而言,它在解决问题时比其他模型有效3.7倍,并且更高效5.0倍。 这种集成的方法显示了在处理动态环境中的认知挑战时,如何通过结合不同的策略来提高AI系统的适应性和学习能力。
https://arxiv.org/abs/2503.14427
The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at this https URL.
不断增长的生成视频的真实感和质量使得人类越来越难以识别深度伪造内容,因此人们不得不更多地依赖自动深度伪造检测器。然而,这些深度伪造检测器也容易出错,并且它们的决策不可解释,使人类更容易受到基于深度伪造的欺诈和错误信息的影响。为此,我们引入了ExDDV,这是首个用于视频中可解释深度伪造检测的数据集和基准测试。ExDDV包括大约5,400个真实和深度伪造视频,这些视频被手动注释有文本描述(以解释伪影)和点击标记(以指出伪影)。我们在ExDDV上评估了多种视觉-语言模型,并进行了各种微调和在上下文中学习策略的实验。我们的结果表明,为了开发能够定位并描述观察到的伪影的稳健可解释模型,需要同时使用文本和点击监督。我们新型数据集以及用于复现结果的代码可在以下网址获得:[this https URL]。 请注意,最后提到的具体URL在此处用"this https URL"代替,请访问原文或直接输入具体链接以获取详细信息。
https://arxiv.org/abs/2503.14421
Social platforms have expanded opportunities for deliberation with the comments being used to inform one's opinion. However, using such information to form opinions is challenged by unsubstantiated or false content. To enhance the quality of opinion formation and potentially confer resistance to misinformation, we developed Iffy-Or-Not (ION), a browser extension that seeks to invoke critical thinking when reading texts. With three features guided by argumentation theory, ION highlights fallacious content, suggests diverse queries to probe them with, and offers deeper questions to consider and chat with others about. From a user study (N=18), we found that ION encourages users to be more attentive to the content, suggests queries that align with or are preferable to their own, and poses thought-provoking questions that expands their perspectives. However, some participants expressed aversion to ION due to misalignments with their information goals and thinking predispositions. Potential backfiring effects with ION are discussed.
社交平台扩大了讨论和辩论的机会,评论被用来影响个人的观点。然而,利用这些信息形成观点会受到无根据或虚假内容的挑战。为了提高意见形成的质量,并对错误信息产生抵抗力,我们开发了一款名为“真伪判断器”(Iffy-Or-Not,简称ION)的浏览器扩展程序,它旨在让用户在阅读文本时进行批判性思考。 通过基于论证理论设计的三个功能——识别谬误、提出多样化的查询以探究这些内容以及提供更深入的问题供用户与他人讨论——ION能够帮助用户更加关注内容的真实性,并引导他们考虑不同的观点。一项针对18名用户的调查显示,ION鼓励用户更加注意文本的内容,提出的查询与其自身观点相符或更佳,并且提出了启发性的问题以扩展他们的视野。 然而,一些参与者表示对ION持反感态度,因为他们认为它与自己的信息目标和思考倾向不一致。文章中还讨论了使用ION时可能出现的负面效果。
https://arxiv.org/abs/2503.14412
Temporal graph neural networks (TGNNs) have shown remarkable performance in temporal graph modeling. However, real-world temporal graphs often possess rich textual information, giving rise to temporal text-attributed graphs (TTAGs). Such combination of dynamic text semantics and evolving graph structures introduces heightened complexity. Existing TGNNs embed texts statically and rely heavily on encoding mechanisms that biasedly prioritize structural information, overlooking the temporal evolution of text semantics and the essential interplay between semantics and structures for synergistic reinforcement. To tackle these issues, we present \textbf{Cross}, a novel framework that seamlessly extends existing TGNNs for TTAG modeling. The key idea is to employ the advanced large language models (LLMs) to extract the dynamic semantics in text space and then generate expressive representations unifying both semantics and structures. Specifically, we propose a Temporal Semantics Extractor in the {Cross} framework, which empowers the LLM to offer the temporal semantic understanding of node's evolving contexts of textual neighborhoods, facilitating semantic dynamics. Subsequently, we introduce the Semantic-structural Co-encoder, which collaborates with the above Extractor for synthesizing illuminating representations by jointly considering both semantic and structural information while encouraging their mutual reinforcement. Extensive experimental results on four public datasets and one practical industrial dataset demonstrate {Cross}'s significant effectiveness and robustness.
时间图神经网络(TGNN)在处理时序图数据方面表现出了卓越的性能。然而,现实世界中的时序图往往包含丰富的文本信息,从而形成了时序属性图(TTAG),这种结合了动态文本语义和不断变化的图结构的数据引入了更高的复杂性。现有的TGNN方法通常静态地嵌入文本,并且过于依赖偏向于结构性信息的编码机制,忽视了文本语义的时间演变以及语义与结构之间的相互作用对于协同增强的重要性。 为了解决这些问题,我们提出了一种新颖的框架——\textbf{Cross},它能够将现有的TGNN扩展应用于TTAG建模。该框架的核心思想是利用先进的大型语言模型(LLM)来提取文本空间中的动态语义,并生成同时包含语义和结构信息的表达式表示。 具体而言,在Cross框架中我们提出了一个时间语义抽取器,使LLM能够提供节点在其文本邻居演变上下文中的时间语义理解,从而促进语义的动力学。随后,我们引入了语义-结构协同编码器,该组件与上述提取器协作,通过同时考虑语义和结构信息并鼓励它们之间的相互强化来生成具有洞察力的表示。 在四个公共数据集和一个实际工业数据集上的广泛实验结果证明了Cross的有效性和鲁棒性。
https://arxiv.org/abs/2503.14411
Co-speech gestures convey a wide variety of meanings and play an important role in face-to-face human interactions. These gestures significantly influence the addressee's engagement, recall, comprehension, and attitudes toward the speaker. Similarly, they impact interactions between humans and embodied virtual agents. The process of selecting and animating meaningful gestures has thus become a key focus in the design of these agents. However, automating this gesture selection process poses a significant challenge. Prior gesture generation techniques have varied from fully automated, data-driven methods, which often struggle to produce contextually meaningful gestures, to more manual approaches that require crafting specific gesture expertise and are time-consuming and lack generalizability. In this paper, we leverage the semantic capabilities of Large Language Models to develop a gesture selection approach that suggests meaningful, appropriate co-speech gestures. We first describe how information on gestures is encoded into GPT-4. Then, we conduct a study to evaluate alternative prompting approaches for their ability to select meaningful, contextually relevant gestures and to align them appropriately with the co-speech utterance. Finally, we detail and demonstrate how this approach has been implemented within a virtual agent system, automating the selection and subsequent animation of the selected gestures for enhanced human-agent interactions.
共言语手势传达了多种意义,并在面对面的人际互动中扮演着重要角色。这些手势显著影响接收者的参与度、记忆力、理解力以及对发言人的态度。同样,它们也会影响人类与具身虚拟代理之间的交互。因此,在设计这些虚拟代理时,选择和动画化有意义的手势已成为一个关键焦点。然而,自动化这一手势选择过程带来了重大挑战。先前的手势生成技术从完全自动化的数据驱动方法(往往难以产生上下文相关的手势)到需要专门手势专业知识的更手动的方法(耗时且缺乏通用性)不等。 在本文中,我们利用大型语言模型的语言语义能力来开发一种手势选择方法,该方法建议有意义且合适的共言语手势。首先,我们将描述如何将关于手势的信息编码进GPT-4。然后,我们将进行一项研究,评估不同的提示方法能否有效地选择有意义的、上下文相关的手势,并使其与共言语句适当对应。最后,我们将详细说明并展示这一方法在虚拟代理系统中的实现情况:自动化地选择了相应的手势并在之后将其动画化以增强人类和代理之间的交互。
https://arxiv.org/abs/2503.14408
Recent multi-teacher distillation methods have unified the encoders of multiple foundation models into a single encoder, achieving competitive performance on core vision tasks like classification, segmentation, and depth estimation. This led us to ask: Could similar success be achieved when the pool of teachers also includes vision models specialized in diverse tasks across both 2D and 3D perception? In this paper, we define and investigate the problem of heterogeneous teacher distillation, or co-distillation, a challenging multi-teacher distillation scenario where teacher models vary significantly in both (a) their design objectives and (b) the data they were trained on. We explore data-sharing strategies and teacher-specific encoding, and introduce DUNE, a single encoder excelling in 2D vision, 3D understanding, and 3D human perception. Our model achieves performance comparable to that of its larger teachers, sometimes even outperforming them, on their respective tasks. Notably, DUNE surpasses MASt3R in Map-free Visual Relocalization with a much smaller encoder.
最近的多教师蒸馏方法将多个基础模型的编码器统一为一个单一的编码器,在图像分类、分割和深度估计等核心视觉任务上达到了竞争性的性能。这引发了我们的思考:当教师模型池还包括专门针对2D和3D感知不同任务的视觉模型时,是否也能取得类似的成功? 在本文中,我们定义并研究了异构教师蒸馏(或协同蒸馏)的问题,这是一个具有挑战性的多教师蒸馏场景,在这种场景下,教师模型的设计目标以及它们所训练的数据集都存在显著差异。我们探索了数据共享策略和特定于每个教师的编码方式,并引入了一种名为DUNE的单个编码器,该编码器在2D视觉、3D理解和3D人体感知方面表现出色。我们的模型在其各自的任务上达到了与其更大的教师模型相当甚至超越后者的性能。值得注意的是,在没有Map的情况下,DUNE在视觉重定位任务上的表现超过了MASt3R,并且它的编码器要小得多。
https://arxiv.org/abs/2503.14405
Facial Aesthetics Enhancement (FAE) aims to improve facial attractiveness by adjusting the structure and appearance of a facial image while preserving its identity as much as possible. Most existing methods adopted deep feature-based or score-based guidance for generation models to conduct FAE. Although these methods achieved promising results, they potentially produced excessively beautified results with lower identity consistency or insufficiently improved facial attractiveness. To enhance facial aesthetics with less loss of identity, we propose the Nearest Neighbor Structure Guidance based on Diffusion (NNSG-Diffusion), a diffusion-based FAE method that beautifies a 2D facial image with 3D structure guidance. Specifically, we propose to extract FAE guidance from a nearest neighbor reference face. To allow for less change of facial structures in the FAE process, a 3D face model is recovered by referring to both the matched 2D reference face and the 2D input face, so that the depth and contour guidance can be extracted from the 3D face model. Then the depth and contour clues can provide effective guidance to Stable Diffusion with ControlNet for FAE. Extensive experiments demonstrate that our method is superior to previous relevant methods in enhancing facial aesthetics while preserving facial identity.
面部美学增强(FAE)的目标是通过调整面部图像的结构和外观来提高面部吸引力,同时尽可能地保留其身份特征。现有的大多数方法采用基于深度特征或评分的引导方式,用于生成模型进行FAE处理。尽管这些方法取得了令人满意的结果,但它们可能会产生过度美化的效果,导致身份一致性降低,或者对脸部美感的改善不足。为了在较少损失面部身份的情况下增强面部美学,我们提出了基于扩散的最近邻结构指导(NNSG-Diffusion)方法,这是一种利用3D结构引导来美化2D面部图像的扩散基础FAE方法。 具体而言,我们提出从最近的邻居参考脸中提取FAE指南。为了在FAE过程中减少面部结构的变化,通过参考匹配的2D参考脸和输入的2D人脸恢复出一个3D脸部模型,这样可以从该3D脸部模型中提取深度和轮廓指导信息。然后,这些深度和轮廓线索可以为Stable Diffusion结合ControlNet提供有效的指导,以进行FAE处理。 大量的实验表明,我们的方法在增强面部美学的同时保留面部身份方面优于以前的相关方法。
https://arxiv.org/abs/2503.14402
Purpose: Accurate 3D MRI-ultrasound (US) deformable registration is critical for real-time guidance in high-dose-rate (HDR) prostate brachytherapy. We present a weakly supervised spatial implicit neural representation (SINR) method to address modality differences and pelvic anatomy challenges. Methods: The framework uses sparse surface supervision from MRI/US segmentations instead of dense intensity matching. SINR models deformations as continuous spatial functions, with patient-specific surface priors guiding a stationary velocity field for biologically plausible deformations. Validation included 20 public Prostate-MRI-US-Biopsy cases and 10 institutional HDR cases, evaluated via Dice similarity coefficient (DSC), mean surface distance (MSD), and 95% Hausdorff distance (HD95). Results: The proposed method achieved robust registration. For the public dataset, prostate DSC was $0.93 \pm 0.05$, MSD $0.87 \pm 0.10$ mm, and HD95 $1.58 \pm 0.37$ mm. For the institutional dataset, prostate CTV achieved DSC $0.88 \pm 0.09$, MSD $1.21 \pm 0.38$ mm, and HD95 $2.09 \pm 1.48$ mm. Bladder and rectum performance was lower due to ultrasound's limited field of view. Visual assessments confirmed accurate alignment with minimal discrepancies. Conclusion: This study introduces a novel weakly supervised SINR-based approach for 3D MRI-US deformable registration. By leveraging sparse surface supervision and spatial priors, it achieves accurate, robust, and computationally efficient registration, enhancing real-time image guidance in HDR prostate brachytherapy and improving treatment precision.
目的:在高剂量率(HDR)前列腺组织间放射治疗中,精确的3D MRI-超声波(US)可变形配准对于实时引导至关重要。本文介绍了一种基于弱监督的空间隐式神经表示(SINR)方法来解决模态差异和骨盆解剖结构挑战。 方法:该框架使用来自MRI/US分割的稀疏表面监督,而不是密集强度匹配。SINR模型将变形视为连续空间函数,并通过特定于患者的表面先验指导一个静态速度场以实现生物合理的变形。验证包括20个公开的前列腺-MRI-US-活检案例和10个机构HDR案例,在Dice相似系数(DSC)、平均面距离(MSD)和95% Hausdorff距离(HD95)方面进行评估。 结果:所提出的方法实现了稳健的配准。对于公共数据集,前列腺DSC为$0.93 \pm 0.05$,MSD为$0.87 \pm 0.10$毫米,HD95为$1.58 \pm 0.37$毫米。对于机构数据集,前列腺CTV的DSC为$0.88 \pm 0.09$,MSD为$1.21 \pm 0.38$毫米,HD95为$2.09 \pm 1.48$毫米。由于超声波有限的视野范围,膀胱和直肠的表现较低。视觉评估确认了准确对齐且差异很小。 结论:本研究介绍了一种基于弱监督SINR的新方法用于3D MRI-US可变形配准。通过利用稀疏表面监督和空间先验信息,该方法实现了准确、稳健且计算效率高的配准,从而提高了HDR前列腺组织间放射治疗中的实时图像引导精度,并改善了治疗的精确度。
https://arxiv.org/abs/2503.14395