This paper makes three key contributions. First, via a substantial corpus of 51,278 interview questions sourced from 888 YouTube videos of mock interviews of Indian civil service candidates, we demonstrate stark gender bias in the broad nature of questions asked to male and female candidates. Second, our experiments with large language models show a strong presence of gender bias in explanations provided by the LLMs on the gender inference task. Finally, we present a novel dataset of 51,278 interview questions that can inform future social science studies.
本文做出了三个关键的贡献。首先,通过从888个印度公共服务候选人的模拟面试视频的51,278个问题,我们证明了在面试问题的广泛范围内,男女候选人的性别偏见非常明显。其次,我们对大型语言模型的实验结果表明,LLM在性别推断任务中提供的性别偏见解释中存在强烈的性别偏见。最后,我们提出了一个由51,278个问题组成的新数据集,可以指导未来的社会科学研究。
https://arxiv.org/abs/2409.12194
Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra ``thinking'' really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model's response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.
通过对提示进行翻译来获取链式思考(CoT)是向大型语言模型(LLMs)征求推理能力的一种实际方法。但是这种额外的“思考”究竟在哪些任务中真正有用呢?为了分析这个问题,我们进行了一项涉及超过100篇论文的定量元分析,并对14种模型进行了自己的评估。我们的研究结果表明,在涉及数学或逻辑等任务时,CoT具有很强的性能优势,而在其他类型的任务上则相对较小。在MMLU上,直接生成答案而无需CoT会导致几乎相同的准确率,除非问题或模型的回答中包含等号符号,表示符号操作和推理。 为了验证这个发现,我们对这些问题进行了进一步的分析,将规划与执行分离并将其与工具增强的LLM进行比较。CoT的很多优势来自于提高符号执行,但它在使用符号求解器方面的表现相对较差。我们的研究结果表明,CoT可以有针对性地应用,在保持性能的同时降低推理成本。此外,它们还表明需要从基于提示的CoT向更利用LLM应用范围的中间计算的新范式进行转变。
https://arxiv.org/abs/2409.12183
While significant progress has been made on the text-to-SQL task, recent solutions repeatedly encode the same database schema for every question, resulting in unnecessary high inference cost and often overlooking crucial database knowledge. To address these issues, we propose You Only Read Once (YORO), a novel paradigm that directly internalizes database knowledge into the parametric knowledge of a text-to-SQL model during training and eliminates the need for schema encoding during inference. YORO significantly reduces the input token length by 66%-98%. Despite its shorter inputs, our empirical results demonstrate YORO's competitive performances with traditional systems on three benchmarks as well as its significant outperformance on large databases. Furthermore, YORO excels in handling questions with challenging value retrievals such as abbreviation.
尽管在文本到关系数据库任务上已经取得了显著的进展,但最近提出的解决方案在每一个问题上都编码了相同的数据库模式,导致不必要的推理成本过高,并且经常忽视关键数据库知识。为了应对这些问题,我们提出了You Only Read Once(YORO)这一新范式,它将数据库知识直接内化于文本到关系模型在训练过程中的参数化知识中,从而在推理过程中消除了模式编码的需求。YORO显著地将输入词长降低了66%-98%。 尽管YORO的输入词长较短,但我们的实证结果表明,与传统系统相比,YORO在三个基准测试上的竞争性能非常强,同时在大型数据库上的表现也非常显著。此外,YORO在处理具有挑战性的值检索问题(如缩写)时表现出色。
https://arxiv.org/abs/2409.12172
In this report, we present a series of math-specific large language models: Qwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B. The core innovation of the Qwen2.5 series lies in integrating the philosophy of self-improvement throughout the entire pipeline, from pre-training and post-training to inference: (1) During the pre-training phase, Qwen2-Math-Instruct is utilized to generate large-scale, high-quality mathematical data. (2) In the post-training phase, we develop a reward model (RM) by conducting massive sampling from Qwen2-Math-Instruct. This RM is then applied to the iterative evolution of data in supervised fine-tuning (SFT). With a stronger SFT model, it's possible to iteratively train and update the RM, which in turn guides the next round of SFT data iteration. On the final SFT model, we employ the ultimate RM for reinforcement learning, resulting in the Qwen2.5-Math-Instruct. (3) Furthermore, during the inference stage, the RM is used to guide sampling, optimizing the model's performance. Qwen2.5-Math-Instruct supports both Chinese and English, and possess advanced mathematical reasoning capabilities, including Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR). We evaluate our models on 10 mathematics datasets in both English and Chinese, such as GSM8K, MATH, GaoKao, AMC23, and AIME24, covering a range of difficulties from grade school level to math competition problems.
在这份报告中,我们介绍了一系列针对数学的大语言模型:Qwen2.5-Math和Qwen2.5-Math-Instruct-1.5B/7B/72B。Qwen2.5系列的核心创新在于在整个管道中整合自我提升的理念,从预训练到后训练,直至推理: (1)在预训练阶段,Qwen2-Math-Instruct被用于生成大规模、高质量的数学数据。 (2)在后训练阶段,我们通过从Qwen2-Math-Instruct进行大规模抽样来开发了一个奖励模型(RM)。然后将该RM应用于数据在监督微调(SFT)中的迭代进化。有了更强的SFT模型,可以逐级训练和更新RM,从而引导下一轮SFT数据迭代。在最终的后训练模型中,我们使用终极RM进行强化学习,实现了Qwen2.5-Math-Instruct。 (3)此外,在推理阶段,RM被用于指导抽样,优化模型的性能。Qwen2.5-Math-Instruct支持中文和英文,并具备高级数学推理能力,包括链式思维(CoT)和工具集成推理(TIR)。我们在英语和中文的10个数学数据集上评估我们的模型,这些数据集包括GSM8K、MATH、GaoKao、AMC23和AIME24,涵盖了从小学到数学竞赛问题的各种难度级别。
https://arxiv.org/abs/2409.12122
Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modeling techniques to audio data. However, audio codecs often operate at high frame rates, resulting in slow training and inference, especially for autoregressive models. To address this challenge, we present the Low Frame-rate Speech Codec (LFSC): a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps bitrate and 21.5 frames per second. We demonstrate that our novel codec can make the inference of LLM-based text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.
大语言模型(LLMs)通过将音频转换为离散 tokens 的音频编解码器显著提高了音频处理。然而,音频编解码器通常以高帧率为操作,导致训练和推理缓慢,特别是对于自回归模型。为了解决这个问题,我们提出了低帧率语音编解码器(LFSC):一种利用有限标量量化和大语言模型中的对抗训练来获得高品质音频压缩的神经音频编解码器,具有1.89 kbps的比特速率和21.5帧每秒。我们证明了我们的新编解码器可以在不降低质量的情况下将基于LLM的文本到语音模型的推理速度提高三倍,同时提高可听度和产生与以前模型相当的质量。
https://arxiv.org/abs/2409.12117
In the rapidly evolving field of machine learning, training models with datasets from various locations and organizations presents significant challenges due to privacy and legal concerns. The exploration of effective collaborative training settings capable of leveraging valuable knowledge from distributed and isolated datasets is increasingly crucial. This study investigates key factors that impact the effectiveness of collaborative training methods in code next-token prediction, as well as the correctness and utility of the generated code, demonstrating the promise of such methods. Additionally, we evaluate the memorization of different participant training data across various collaborative training settings, including centralized, federated, and incremental training, highlighting their potential risks in leaking data. Our findings indicate that the size and diversity of code datasets are pivotal factors influencing the success of collaboratively trained code models. We show that federated learning achieves competitive performance compared to centralized training while offering better data protection, as evidenced by lower memorization ratios in the generated code. However, federated learning can still produce verbatim code snippets from hidden training data, potentially violating privacy or copyright. Our study further explores effectiveness and memorization patterns in incremental learning, emphasizing the sequence in which individual participant datasets are introduced. We also identify cross-organizational clones as a prevalent challenge in both centralized and federated learning scenarios. Our findings highlight the persistent risk of data leakage during inference, even when training data remains unseen. We conclude with recommendations for practitioners and researchers to optimize multisource datasets, propelling cross-organizational collaboration forward.
在快速发展的机器学习领域,使用各种地点和组织的数据集训练模型存在显着的安全和隐私问题。探索能够利用分布式和孤立数据集的有效合作训练设置越来越重要。本研究调查了影响协作训练方法在代码预测下一个词的有效性的关键因素,以及生成的代码的正确性和可用性,证明了这些方法的优势。此外,我们评估了各种协作训练设置中不同参与者训练数据的学习记忆,包括集中式、分布式和增量式训练,突出泄露数据的风险。我们的研究结果表明,代码数据集的大小和多样性是影响协作训练模型成功的关键因素。我们证明了分布式学习在保持竞争力的性能同时提供更好的数据保护方面比集中式训练更有效,正如生成的代码中较低的存储比所表明的。然而,分布式学习仍然可能从隐藏的训练数据中产生等效的代码片段,这可能导致隐私或版权问题。我们的研究进一步研究了增量学习中的效果和记忆模式,强调了在引入个人参与者数据时序列的重要性。我们也指出了集中式和分布式学习场景中跨组织克隆的普遍挑战。我们的研究结果表明,在推理过程中数据泄露的风险持续存在,即使训练数据未被看到。我们得出结论,对于实践者和研究人员,优化多源数据集将推动跨组织合作向前发展。
https://arxiv.org/abs/2409.12020
Real-time rendering of human head avatars is a cornerstone of many computer graphics applications, such as augmented reality, video games, and films, to name a few. Recent approaches address this challenge with computationally efficient geometry primitives in a carefully calibrated multi-view setup. Albeit producing photorealistic head renderings, it often fails to represent complex motion changes such as the mouth interior and strongly varying head poses. We propose a new method to generate highly dynamic and deformable human head avatars from multi-view imagery in real-time. At the core of our method is a hierarchical representation of head models that allows to capture the complex dynamics of facial expressions and head movements. First, with rich facial features extracted from raw input frames, we learn to deform the coarse facial geometry of the template mesh. We then initialize 3D Gaussians on the deformed surface and refine their positions in a fine step. We train this coarse-to-fine facial avatar model along with the head pose as a learnable parameter in an end-to-end framework. This enables not only controllable facial animation via video inputs, but also high-fidelity novel view synthesis of challenging facial expressions, such as tongue deformations and fine-grained teeth structure under large motion changes. Moreover, it encourages the learned head avatar to generalize towards new facial expressions and head poses at inference time. We demonstrate the performance of our method with comparisons against the related methods on different datasets, spanning challenging facial expression sequences across multiple identities. We also show the potential application of our approach by demonstrating a cross-identity facial performance transfer application.
实时渲染人头 Avatar 是许多计算机图形应用程序的基础,如增强现实、游戏和电影等。最近的方法通过在精心校准的多视图设置中使用计算效率高的几何图元来解决这个问题。尽管它们可以产生逼真的头渲染,但它们往往无法代表复杂的运动变化,如嘴部和强烈变化的头部姿势。我们提出了一种新的方法,利用多视图图像中的实时多视角信息来生成高度动态和变形的人类头 Avatar。 我们方法的核心是对头模型的分层表示,允许捕捉面部表情的复杂动态和头部运动。首先,通过从原始输入帧中提取丰富 facial特征,我们学习变形模板网格的粗面部几何形状。然后,在变形表面上初始化3D高斯分布,并在精细级上微调其位置。我们在端到端框架中训练这个粗到细的面部 Avatar 模型,同时将头部姿势作为可学习参数。这使得我们不仅可以通过视频输入实现可控的脸部动画,而且还可以在大运动变化下实现高保真的新视图合成,例如舌头变形和细粒度的牙齿结构。此外,它鼓励学习到的 Avatar 在推理时间点上推广到新的面部表情和头部姿势。 我们在不同数据集上与相关方法进行比较,跨越多个身份展示我们方法的性能。我们还通过展示跨身份面部性能转移应用的潜力来证明我们方法的潜在应用价值。
https://arxiv.org/abs/2409.11951
In this paper, we address the challenge of generating realistic 3D human motions for action classes that were never seen during the training phase. Our approach involves decomposing complex actions into simpler movements, specifically those observed during training, by leveraging the knowledge of human motion contained in GPTs models. These simpler movements are then combined into a single, realistic animation using the properties of diffusion models. Our claim is that this decomposition and subsequent recombination of simple movements can synthesize an animation that accurately represents the complex input action. This method operates during the inference phase and can be integrated with any pre-trained diffusion model, enabling the synthesis of motion classes not present in the training data. We evaluate our method by dividing two benchmark human motion datasets into basic and complex actions, and then compare its performance against the state-of-the-art.
在本文中,我们解决了在训练阶段从未见过的动作类中生成逼真的3D人体运动的问题。我们的方法利用GPT模型的知识,通过分解复杂动作为训练期间观察到的更简单的动作,具体这些动作,然后将这些简单的动作合并成一个真实的动画,利用扩散模型的特性。我们的主张是,这种分解和后续简单动作的重新组合可以合成准确地表示复杂输入动作的动画。这种方法在推理阶段运行,可以与任何预训练的扩散模型集成,从而合成训练数据中没有的动量类别。我们通过将两个基准的人体运动数据集分为基本和复杂动作,然后与最先进的水平进行比较,来评估我们的方法。
https://arxiv.org/abs/2409.11920
Large Language Models (LLMs) can memorize sensitive information, raising concerns about potential misuse. LLM Unlearning, a post-hoc approach to remove this information from trained LLMs, offers a promising solution to mitigate these risks. However, previous practices face three key challenges: 1. Utility: successful unlearning often causes catastrophic collapse on unrelated tasks. 2. Efficiency: many methods either involve adding similarly sized models, which slows down unlearning or inference, or require retain data that are difficult to obtain. 3. Robustness: even effective methods may still leak data via extraction techniques. To address these challenges, we propose MEOW, a simple yet effective gradient descent-based unlearning method. Specifically, we use an offline LLM to generate a set of inverted facts. Then, we design a new metric, MEMO, to quantify memorization in LLMs. Finally, based on the signals provided by MEMO, we select the most appropriate set of inverted facts and finetune the model based on them. We evaluate MEOW on the commonly used unlearn benchmark, ToFU, with Llama2-7B-Chat and Phi-1.5B, and test it on both NLU and NLG tasks. Results demonstrate significant improvement of MEOW in forget quality without substantial loss in model utility. Meanwhile, MEOW does not exhibit significant degradation in NLU or NLG capabilities, and there is even a slight improvement in NLU performance.
大语言模型(LLMs)可以记住敏感信息,这引起了潜在滥用担忧。LLM Unlearning,一种在训练好的LLM上删除此信息的后验方法,为减轻这些风险提供了一个有前景的解决方案。然而,之前的实践面临三个关键挑战:1. 效用:成功的卸载通常会导致无关任务上的灾难性崩溃。2. 效率:许多方法包括添加大小相似的模型,这会减缓卸载或推理的速度,或者需要保留难以获得的數據。3. 鲁棒性:即使有效的方法也可能通过提取技术泄露数据。为了应对这些挑战,我们提出了MEOW,一种简单而有效的基于梯度下降的卸载方法。具体来说,我们使用一个离线LLM生成一系列反事实。然后,我们设计了一个新的指标MEMO,用于衡量LLM的记性。最后,根据MEMO提供的信号,我们选择最合适的反事实集,并对模型进行微调。我们在常用的卸载基准测试ToFU(Llama2-7B-Chat和Phi-1.5B)上评估MEOW,并将其应用于NLU和NLG任务上。结果表明,MEOW在保持记忆质量的同时显著提高了模型效用。与此同时,MEOW在NLU和NLG能力上没有显著退化,甚至在NLU性能上还有轻微的提高。
https://arxiv.org/abs/2409.11844
In recent years, speech diffusion models have advanced rapidly. Alongside the widely used U-Net architecture, transformer-based models such as the Diffusion Transformer (DiT) have also gained attention. However, current DiT speech models treat Mel spectrograms as general images, which overlooks the specific acoustic properties of speech. To address these limitations, we propose a method called Directional Patch Interaction for Text-to-Speech (DPI-TTS), which builds on DiT and achieves fast training without compromising accuracy. Notably, DPI-TTS employs a low-to-high frequency, frame-by-frame progressive inference approach that aligns more closely with acoustic properties, enhancing the naturalness of the generated speech. Additionally, we introduce a fine-grained style temporal modeling method that further improves speaker style similarity. Experimental results demonstrate that our method increases the training speed by nearly 2 times and significantly outperforms the baseline models.
近年来,语音扩散模型的发展速度非常快。除了广泛使用的U-Net架构外,像Diffusion Transformer(DiT)这样的基于Transformer的模型也引起了人们的关注。然而,当前的DiT语音模型将Mel频谱图视为通用图像,这忽略了语音的特定声学特性。为了克服这些限制,我们提出了一个名为DPI-TTS(文本到语音方向补丁交互)的方法,该方法基于DiT并实现了没有牺牲准确性的快速训练。值得注意的是,DPI-TTS采用了一种低到高频率的帧级逐帧推理方法,更接近于声学特性,从而增强了生成的语音的自然性。此外,我们还引入了一种细粒度的时域建模方法,进一步提高了说话人风格相似度。实验结果表明,我们的方法将训练速度提高了近2倍,显著优于基线模型。
https://arxiv.org/abs/2409.11835
The recent development of deep learning large models in medicine shows remarkable performance in medical image analysis and diagnosis, but their large number of parameters causes memory and inference latency challenges. Knowledge distillation offers a solution, but the slide-level gradients cannot be backpropagated for student model updates due to high-resolution pathological images and slide-level labels. This study presents an Efficient Fine-tuning on Compressed Models (EFCM) framework with two stages: unsupervised feature distillation and fine-tuning. In the distillation stage, Feature Projection Distillation (FPD) is proposed with a TransScan module for adaptive receptive field adjustment to enhance the knowledge absorption capability of the student model. In the slide-level fine-tuning stage, three strategies (Reuse CLAM, Retrain CLAM, and End2end Train CLAM (ETC)) are compared. Experiments are conducted on 11 downstream datasets related to three large medical models: RETFound for retina, MRM for chest X-ray, and BROW for histopathology. The experimental results demonstrate that the EFCM framework significantly improves accuracy and efficiency in handling slide-level pathological image problems, effectively addressing the challenges of deploying large medical models. Specifically, it achieves a 4.33% increase in ACC and a 5.2% increase in AUC compared to the large model BROW on the TCGA-NSCLC and TCGA-BRCA datasets. The analysis of model inference efficiency highlights the high efficiency of the distillation fine-tuning method.
近年来,在医学领域中,深度学习大型模型的开发在医学图像分析和诊断方面表现出显著的性能,但它们具有大量的参数,导致记忆和推理延迟。知识蒸馏提供了解决方案,但由于高分辨率病理图像和层级的标签,学生模型的更新无法通过级联梯度进行反向传播。这项研究介绍了一种高效的可压缩模型(EFCM)框架,包括两个阶段:无监督特征蒸馏和微调。在蒸馏阶段,提出了使用TransScan模块的Feature Projection Distillation(FPD)策略,以自适应地调整学生模型的感官场以增强知识吸收能力。在微调阶段,比较了三种策略(重用CLAM,重置CLAM和端到端训练CLAM(ETC))。实验在三个大型医学模型相关的11个下游数据集上进行,包括视网膜、胸部X光片和病理学。实验结果表明,EFCM框架在处理层级的病理图像问题方面显著提高了准确性和效率,有效解决了部署大型医疗模型的挑战。具体来说,它比大型模型BROW在TCGA-NSCLC和TCGA-BRCA数据集上实现了4.33%的ACC和5.2%的AUC的提高。对模型推理效率的分析强调了蒸馏微调方法的效率。
https://arxiv.org/abs/2409.11817
Face recognition in the wild is now advancing towards light-weight models, fast inference speed and resolution-adapted capability. In this paper, we propose a bridge distillation approach to turn a complex face model pretrained on private high-resolution faces into a light-weight one for low-resolution face recognition. In our approach, such a cross-dataset resolution-adapted knowledge transfer problem is solved via two-step distillation. In the first step, we conduct cross-dataset distillation to transfer the prior knowledge from private high-resolution faces to public high-resolution faces and generate compact and discriminative features. In the second step, the resolution-adapted distillation is conducted to further transfer the prior knowledge to synthetic low-resolution faces via multi-task learning. By learning low-resolution face representations and mimicking the adapted high-resolution knowledge, a light-weight student model can be constructed with high efficiency and promising accuracy in recognizing low-resolution faces. Experimental results show that the student model performs impressively in recognizing low-resolution faces with only 0.21M parameters and 0.057MB memory. Meanwhile, its speed reaches up to 14,705, ~934 and 763 faces per second on GPU, CPU and mobile phone, respectively.
野外的面部识别现在正在朝着轻量级模型、快速推理速度和分辨率适应能力的发展方向前进。在本文中,我们提出了一种桥蒸馏方法,将私有的高分辨率面部预训练模型转化为低分辨率面部识别的轻量级模型。在我们的方法中,通过两步蒸馏解决了跨数据集分辨率适应知识传递问题。第一步,我们进行跨数据集蒸馏,将高分辨率私人面部上的先验知识传递到高分辨率公共面部,并生成紧凑且具有区分性的特征。第二步,通过多任务学习进一步将先验知识传递给合成低分辨率面部。通过学习低分辨率面部表示并模拟适应高分辨率知识,可以构建具有高效率和令人满意的准确性的轻量学生模型来识别低分辨率面部。实验结果表明,仅使用0.21M参数和0.057MB内存的高学生模型在识别低分辨率面部时表现出色。同时,其速度分别在GPU、CPU和移动手机上达到14,705、~934和763张/秒。
https://arxiv.org/abs/2409.11786
Understanding emotions is fundamental to human interaction and experience. Humans easily infer emotions from situations or facial expressions, situations from emotions, and do a variety of other \emph{affective cognition}. How adept is modern AI at these inferences? We introduce an evaluation framework for testing affective cognition in foundation models. Starting from psychological theory, we generate 1,280 diverse scenarios exploring relationships between appraisals, emotions, expressions, and outcomes. We evaluate the abilities of foundation models (GPT-4, Claude-3, Gemini-1.5-Pro) and humans (N = 567) across carefully selected conditions. Our results show foundation models tend to agree with human intuitions, matching or exceeding interparticipant agreement. In some conditions, models are ``superhuman'' -- they better predict modal human judgements than the average human. All models benefit from chain-of-thought reasoning. This suggests foundation models have acquired a human-like understanding of emotions and their influence on beliefs and behavior.
理解情感是人类互动和体验的基础。人类很容易从情境或面部表情中推断情感,从情感中推断情境,以及进行各种情感认知。现代AI在推断这些方面有多擅长呢?我们为测试基础模型对情感认知的能力建立了一个评估框架。从心理理论开始,我们生成了1280个不同的情景,探讨了评价、情感、表达和结果之间的关系。我们评估了基础模型(GPT-4,Claude-3,Gemini-1.5-Pro)和人类(N=567)在这些条件下的能力。我们的结果显示,基础模型往往与人类的直觉相符,甚至超过参与者之间的共识。在某些情况下,模型表现得“超人类”——它们比平均人类更准确地预测模态人类评判。所有模型都受益于链式思维推理。这表明,基础模型已经获得了类似于人类对情感及其对信念和行为影响的理解。
https://arxiv.org/abs/2409.11733
3D Gaussian Splatting has emerged as a powerful 3D scene representation technique, capturing fine details with high efficiency. In this paper, we introduce a novel voting-based method that extends 2D segmentation models to 3D Gaussian splats. Our approach leverages masked gradients, where gradients are filtered by input 2D masks, and these gradients are used as votes to achieve accurate segmentation. As a byproduct, we discovered that inference-time gradients can also be used to prune Gaussians, resulting in up to 21% compression. Additionally, we explore few-shot affordance transfer, allowing annotations from 2D images to be effectively transferred onto 3D Gaussian splats. The robust yet straightforward mathematical formulation underlying this approach makes it a highly effective tool for numerous downstream applications, such as augmented reality (AR), object editing, and robotics. The project code and additional resources are available at this https URL.
3D高斯平铺作为一种强大的3D场景表示技术,高效率地捕捉到细小细节。在本文中,我们介绍了一种新的基于投票的方法,将2D分割模型扩展到3D高斯平铺。我们的方法利用遮罩梯度,其中梯度通过输入2D遮罩进行过滤,然后用于投票以实现精确分割。作为附加品,我们还发现推理时间梯度也可以用于剪枝高斯,从而实现高达21%的压缩。此外,我们探讨了几 shot可扩展性转移,允许2D图像的注释有效地转移到3D高斯平铺。这种方法的基础是简单而 robust 的数学公式,使其成为许多下游应用的高效工具,如增强现实(AR),物体编辑和机器人。项目代码和其他资源可在此处访问:https:// URL。
https://arxiv.org/abs/2409.11681
Goal recognition (GR) involves inferring an agent's unobserved goal from a sequence of observations. This is a critical problem in AI with diverse applications. Traditionally, GR has been addressed using 'inference to the best explanation' or abduction, where hypotheses about the agent's goals are generated as the most plausible explanations for observed behavior. Alternatively, some approaches enhance interpretability by ensuring that an agent's behavior aligns with an observer's expectations or by making the reasoning behind decisions more transparent. In this work, we tackle a different challenge: explaining the GR process in a way that is comprehensible to humans. We introduce and evaluate an explainable model for goal recognition (GR) agents, grounded in the theoretical framework and cognitive processes underlying human behavior explanation. Drawing on insights from two human-agent studies, we propose a conceptual framework for human-centered explanations of GR. Using this framework, we develop the eXplainable Goal Recognition (XGR) model, which generates explanations for both why and why not questions. We evaluate the model computationally across eight GR benchmarks and through three user studies. The first study assesses the efficiency of generating human-like explanations within the Sokoban game domain, the second examines perceived explainability in the same domain, and the third evaluates the model's effectiveness in aiding decision-making in illegal fishing detection. Results demonstrate that the XGR model significantly enhances user understanding, trust, and decision-making compared to baseline models, underscoring its potential to improve human-agent collaboration.
目标识别(GR)涉及从观察序列中推断代理者的未观察到目标。这是人工智能具有多样应用的一个关键问题。传统上,GR通过“推理到最佳解释”或类比来解决,其中观察行为的最佳解释生成假设代理者目标。另外,一些方法通过确保代理者的行为与观察者的期望相一致或使决策背后的推理更加透明来增强可解释性。在这项工作中,我们面临一个不同的挑战:用易于理解的方式解释GR过程。我们基于人类行为解释的理论框架和认知过程,引入并评估了一个可解释的目标识别(GR)代理者的模型。借鉴来自两个人类-代理研究的结果,我们提出了一个人类中心GR解释的框架。使用这个框架,我们开发了可解释目标识别(XGR)模型,它可以生成原因和原因不方面的解释。我们通过计算在八个GR基准和三个用户研究中评估了该模型。第一项研究评估了在Sokoban游戏领域中生成人类似解释的效率,第二项研究研究了该领域中的感知可解释性,第三项研究评估了模型在非法渔获检测中的决策辅助效果。结果表明,与基线模型相比,XGR模型显著增强了用户的理解、信任和决策,进一步强调了其改善人类代理者合作的潜力。
https://arxiv.org/abs/2409.11675
AI data-driven models (Graphcast, Pangu Weather, Fourcastnet, and SFNO) are explored for storyline-based climate attribution due to their short inference times, which can accelerate the number of events studied, and provide real time attributions when public attention is heightened. The analysis is framed on the extreme atmospheric river episode of February 2017 that contributed to the Oroville dam spillway incident in Northern California. Past and future simulations are generated by perturbing the initial conditions with the pre-industrial and the late-21st century temperature climate change signals, respectively. The simulations are compared to results from a dynamical model which represents plausible pseudo-realities under both climate environments. Overall, the AI models show promising results, projecting a 5-6 % increase in the integrated water vapor over the Oroville dam in the present day compared to the pre-industrial, in agreement with the dynamical model. Different geopotential-moisture-temperature dependencies are unveiled for each of the AI-models tested, providing valuable information for understanding the physicality of the attribution response. However, the AI models tend to simulate weaker attribution values than the pseudo-reality imagined by the dynamical model, suggesting some reduced extrapolation skill, especially for the late-21st century regime. Large ensembles generated with an AI model (>500 members) produced statistically significant present-day to pre-industrial attribution results, unlike the >20-member ensemble from the dynamical model. This analysis highlights the potential of AI models to conduct attribution analysis, while emphasizing future lines of work on explainable artificial intelligence to gain confidence in these tools, which can enable reliable attribution studies in real-time.
AI驱动的模型(Graphcast、Pangu Weather、Fourcastnet和SFNO)因其较短的推理时间,可以加速研究事件数量,并在公众关注度提高时提供实时归因。分析以2017年2月的极端大气河事件为框架,这场事件有助于北部加州奥罗维尔水坝溃决。通过分别用预工业和21世纪后期温度气候变化信号扰动初始条件,生成过去和未来的模拟。与代表气候环境下的 plausible伪现实动态模型相比,这些模拟的结果是全面的。总的来说,AI模型显示出有希望的结果,将当今Oroville水坝的集成水汽含量比工业革命前预测增加5-6%。对于每个测试的AI模型,都揭示了不同地理湿度和温度依赖关系,为理解归因响应的物理性提供了宝贵的信息。然而,AI模型往往模拟的归因值比动态模型想象的伪现实要弱,这表明在21世纪后期的预估中,扩展技能受到了一些限制,尤其是在晚期21世纪。使用AI模型生成的大于500个成员的大集合产生了统计学上显著的现世到预工业归因结果,而动态模型的20个成员大集合的结果与之不同。本分析突出了AI模型进行归因分析的潜力,同时强调了在解释性人工智能方面对未来工作的探索,以增强对 these工具的信任,从而实现实时可靠归因研究。
https://arxiv.org/abs/2409.11605
Out-of-distribution (OOD) detection is crucial for enhancing the generalization of AI models used in mammogram screening. Given the challenge of limited prior knowledge about OOD samples in external datasets, unsupervised generative learning is a preferable solution which trains the model to discern the normal characteristics of in-distribution (ID) data. The hypothesis is that during inference, the model aims to reconstruct ID samples accurately, while OOD samples exhibit poorer reconstruction due to their divergence from normality. Inspired by state-of-the-art (SOTA) hybrid architectures combining CNNs and transformers, we developed a novel backbone - HAND, for detecting OOD from large-scale digital screening mammogram studies. To boost the learning efficiency, we incorporated synthetic OOD samples and a parallel discriminator in the latent space to distinguish between ID and OOD samples. Gradient reversal to the OOD reconstruction loss penalizes the model for learning OOD reconstructions. An anomaly score is computed by weighting the reconstruction and discriminator loss. On internal RSNA mammogram held-out test and external Mayo clinic hand-curated dataset, the proposed HAND model outperformed encoder-based and GAN-based baselines, and interestingly, it also outperformed the hybrid CNN+transformer baselines. Therefore, the proposed HAND pipeline offers an automated efficient computational solution for domain-specific quality checks in external screening mammograms, yielding actionable insights without direct exposure to the private medical imaging data.
离散(OD)检测对于增强在乳腺筛查中使用的AI模型的泛化能力至关重要。由于在外部数据集中对OD样本的了解有限,无监督生成学习是一种更可取的解决方案,该解决方案训练模型以区分分布(ID)数据的正常特征。假设在推理过程中,模型旨在准确地重构ID样本,而OD样本由于其从正态性中分化而表现得更差。受到最先进的(SOTA)结合卷积神经网络(CNN)和Transformer的混合架构的启发,我们开发了一种名为HAND的新骨架,用于从大规模数字筛查乳腺X光片研究中检测OD。为了提高学习效率,我们在潜在空间中引入了合成OD样本和一个新的区分器,以区分ID和OD样本。对OD重建损失的梯度翻转惩罚模型学习OD重构。异常得分通过权衡重建和区分器损失来计算。在内部RSNA乳腺X光片持有者测试和外部梅奥诊所手动标注的数据集上,所提出的HAND模型超过了基于编码器的基线和基于GAN的基线,而且有趣的是,它还超过了基于CNN+Transformer的混合基线。因此,所提出的HAND流程为在 external screening mammograms 对领域特定质量检查提供自动高效的计算解决方案,同时不直接暴露于私有医疗成像数据。
https://arxiv.org/abs/2409.11534
Embodied vision-based real-world systems, such as mobile robots, require a careful balance between energy consumption, compute latency, and safety constraints to optimize operation across dynamic tasks and contexts. As local computation tends to be restricted, offloading the computation, ie, to a remote server, can save local resources while providing access to high-quality predictions from powerful and large models. However, the resulting communication and latency overhead has led to limited usability of cloud models in dynamic, safety-critical, real-time settings. To effectively address this trade-off, we introduce UniLCD, a novel hybrid inference framework for enabling flexible local-cloud collaboration. By efficiently optimizing a flexible routing module via reinforcement learning and a suitable multi-task objective, UniLCD is specifically designed to support the multiple constraints of safety-critical end-to-end mobile systems. We validate the proposed approach using a challenging, crowded navigation task requiring frequent and timely switching between local and cloud operations. UniLCD demonstrates improved overall performance and efficiency, by over 35% compared to state-of-the-art baselines based on various split computing and early exit strategies.
具有身体嵌入式视觉的实时现实系统(如移动机器人)需要在能耗、计算延迟和安全约束之间进行仔细的平衡,以优化在动态任务和上下文中的操作。由于局部计算通常受到限制,将计算卸载到远程服务器上,可以在本地资源受限的同时提供来自强大的大型模型的优质预测。然而,这种方法产生的通信和延迟开销导致云模型在动态、安全关键、实时设置中的可用性有限。为了有效解决这一权衡,我们引入了UniLCD,一种新型的混合推理框架,用于实现灵活的本地-云协作。通过通过强化学习和合适的多任务目标,有效地优化了灵活的路由模块,UniLCD特别设计来支持安全关键端到端移动系统的多个约束。我们通过一个具有挑战性的拥挤导航任务来验证所提出的方法。UniLCD展示了比基于各种分裂计算和早期退出策略的最先进的基准超过35%的优越性能和效率。
https://arxiv.org/abs/2409.11403
Video diffusion models have shown great potential in generating high-quality videos, making them an increasingly popular focus. However, their inherent iterative nature leads to substantial computational and time costs. While efforts have been made to accelerate video diffusion by reducing inference steps (through techniques like consistency distillation) and GAN training (these approaches often fall short in either performance or training stability). In this work, we introduce a two-stage training framework that effectively combines consistency distillation with GAN training to address these challenges. Additionally, we propose a novel video discriminator design, which eliminates the need for decoding the video latents and improves the final performance. Our model is capable of producing high-quality videos in merely one-step, with the flexibility to perform multi-step refinement for further performance enhancement. Our quantitative evaluation on the OpenWebVid-1M benchmark shows that our model significantly outperforms existing methods. Notably, our 1-step performance(FVD 171.15) exceeds the 8-step performance of the consistency distillation based method, AnimateLCM (FVD 184.79), and approaches the 25-step performance of advanced Stable Video Diffusion (FVD 156.94).
视频扩散模型已经在生成高质量视频方面显示出巨大的潜力,因此越来越受到关注。然而,其固有的迭代特性导致大量计算和时间成本。虽然已经通过减少推理步骤(通过诸如一致性蒸馏等技术)尝试加速视频扩散,但这些方法在性能或训练稳定性方面往往存在缺陷。在这项工作中,我们引入了一个两阶段训练框架,将一致性蒸馏与GAN训练相结合,以解决这些挑战。此外,我们提出了一种新颖的视频判别器设计,消除了对视频 latent 的解码,从而提高了最终性能。我们的模型可以在仅一步之内产生高质量的视频,具有进行多步精炼的灵活性,以进一步提高性能。我们在OpenWebVid-1M基准上的定量评估显示,我们的模型显著优于现有方法。值得注意的是,我们的1步性能(FVD 171.15)超过了基于一致性蒸馏的方法AnimateLCM(FVD 184.79)的8步性能,并接近于高级稳定视频扩散(FVD 156.94)的25步性能。
https://arxiv.org/abs/2409.11367
Recent work showed that large diffusion models can be reused as highly precise monocular depth estimators by casting depth estimation as an image-conditional image generation task. While the proposed model achieved state-of-the-art results, high computational demands due to multi-step inference limited its use in many scenarios. In this paper, we show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. The fixed model performs comparably to the best previously reported configuration while being more than 200$\times$ faster. To optimize for downstream task performance, we perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models on common zero-shot benchmarks. We surprisingly find that this fine-tuning protocol also works directly on Stable Diffusion and achieves comparable performance to current state-of-the-art diffusion-based depth and normal estimation models, calling into question some of the conclusions drawn from prior works.
最近的工作表明,大型扩散模型可以通过将深度估计视为图像有条件图像生成任务来重新使用,从而实现高精度的单目深度估计。虽然所提出的模型在实现最先进结果方面取得了成功,但由于多步推理的高计算需求,它在许多场景中仍然受到限制。在本文中,我们证明了感知低效是由一个尚未被注意到的推理管道中的漏洞引起的。与之前报道的最佳配置相比,固定模型具有相同的表现,同时速度超过200倍。为了优化下游任务的性能,我们在单步模型上进行端到端的微调,并得到一个确定性模型,在常用零散基准上优于所有其他扩散为基础的深度和法线估计模型。我们惊讶地发现,这种微调方法也直接应用于Stable Diffusion,并达到了与当前最先进扩散为基础的深度和法线估计模型相当的表现,这也质疑了之前工作的某些结论。
https://arxiv.org/abs/2409.11355