Multilingual LLMs have achieved remarkable benchmark performance, but we find they continue to underperform on non-Latin script languages across contemporary LLM families. This discrepancy arises from the fact that LLMs are pretrained with orthographic scripts, which are dominated by Latin characters that obscure their shared phonology with non-Latin scripts. We propose leveraging phonemic transcriptions as complementary signals to induce script-invariant representations. Our study demonstrates that integrating phonemic signals improves performance across both non-Latin and Latin languages, with a particularly significant impact on closing the performance gap between the two. Through detailed experiments, we show that phonemic and orthographic scripts retrieve distinct examples for in-context learning (ICL). This motivates our proposed Mixed-ICL retrieval strategy, where further aggregation leads to our significant performance improvements for both Latin script languages (up to 12.6%) and non-Latin script languages (up to 15.1%) compared to randomized ICL retrieval.
多语言大语言模型已经实现了显著的基准性能,但我们发现它们在当代大语言模型家族中对非拉丁书写系统的语言表现依然欠佳。这种差异源于这些大语言模型是在以拉丁字符为主的正写法基础上进行预训练的,这掩盖了与非拉丁文字系统共享的音系学特征。我们建议利用音位转写作为互补信号来诱导书写系统不变的表现形式。我们的研究表明,整合音位信号能够提升包括拉丁和非拉丁语言在内的整体表现,并特别显著地缩小了两种语言之间的性能差距。通过详细的实验,我们展示了音位与正写法脚本在上下文学习(ICL)中检索出不同的示例。这促使我们提出了混合ICL检索策略,在进一步聚合后,我们的方法对于拉丁书写系统语言的性能提升最高可达12.6%,而对于非拉丁书写系统语言则高达15.1%,相比随机的ICL检索有显著改善。
https://arxiv.org/abs/2411.02398
Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models cannot perfectly handle long and complex text prompts, especially when the text prompts contain various objects with numerous attributes and interrelated spatial relationships. While many regional prompting methods have been proposed for UNet-based models (SD1.5, SDXL), but there are still no implementations based on the recent Diffusion Transformer (DiT) architecture, such as SD3 and this http URL this report, we propose and implement regional prompting for FLUX.1 based on attention manipulation, which enables DiT with fined-grained compositional text-to-image generation capability in a training-free manner. Code is available at this https URL.
扩散模型在文本到图像生成方面表现出色。它们的语义理解能力(即跟随提示的能力)也随着大型语言模型(如T5和Llama)的发展而得到了显著提升。然而,现有的模型无法完美处理长且复杂的文本提示,特别是在这些提示包含多个具有众多属性及相互关联的空间关系的对象时。尽管为基于UNet的模型(如SD1.5、SDXL)提出了许多区域提示方法,但目前还没有基于最近的扩散Transformer(DiT)架构(例如SD3)的相关实现。在本报告中,我们提出并实现了针对FLUX.1的区域提示机制,该机制通过注意力操作使DiT具备了无需训练即可生成精细组合文本到图像的能力。代码可在以下链接获取:[此https URL]。 注意:原文中的“this http URL”和“this https URL”应替换为实际的有效URL地址。
https://arxiv.org/abs/2411.02395
Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even large language models - which allocate varying representational capacities based on entropy, context and familiarity. Inspired by this, we propose an approach to learn variable-length token representations for 2D images. Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts. Each iteration refines the 2D tokens, updates the existing 1D latent tokens, and adaptively increases representational capacity by adding new tokens. This enables compression of images into a variable number of tokens, ranging from 32 to 256. We validate our tokenizer using reconstruction loss and FID metrics, demonstrating that token count aligns with image entropy, familiarity and downstream task requirements. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery.
当前的视觉系统通常给图像分配固定长度的表示,而不考虑信息内容。这与人类智能——甚至是大型语言模型——形成对比,后者会根据熵、上下文和熟悉程度分配不同的表示能力。受到这一现象的启发,我们提出了一种学习2D图像可变长度标记表示的方法。我们的编码器-解码器架构递归处理2D图像标记,并在多次循环展开中将它们提炼为1D潜在标记。每个迭代过程都会细化2D标记,更新现有的1D潜在标记,并通过添加新标记自适应地增加表示能力。这使得可以将图像压缩成一个可变数量的标记,范围从32到256个不等。我们使用重建损失和FID指标验证了我们的标记器,表明标记数与图像熵、熟悉程度以及下游任务需求相吻合。随着每次迭代中表示能力的逐步增加,递归处理标记显示出标记专业化的迹象,揭示了发现对象/部件的潜力。
https://arxiv.org/abs/2411.02393
Large language models (LLMs) have demonstrated remarkable capabilities in various scientific domains, from natural language processing to complex problem-solving tasks. Their ability to understand and generate human-like text has opened up new possibilities for advancing scientific research, enabling tasks such as data analysis, literature review, and even experimental design. One of the most promising applications of LLMs in this context is hypothesis generation, where they can identify novel research directions by analyzing existing knowledge. However, despite their potential, LLMs are prone to generating ``hallucinations'', outputs that are plausible-sounding but factually incorrect. Such a problem presents significant challenges in scientific fields that demand rigorous accuracy and verifiability, potentially leading to erroneous or misleading conclusions. To overcome these challenges, we propose KG-CoI (Knowledge Grounded Chain of Ideas), a novel system that enhances LLM hypothesis generation by integrating external, structured knowledge from knowledge graphs (KGs). KG-CoI guides LLMs through a structured reasoning process, organizing their output as a chain of ideas (CoI), and includes a KG-supported module for the detection of hallucinations. With experiments on our newly constructed hypothesis generation dataset, we demonstrate that KG-CoI not only improves the accuracy of LLM-generated hypotheses but also reduces the hallucination in their reasoning chains, highlighting its effectiveness in advancing real-world scientific research.
大型语言模型(LLMs)在多个科学领域展示了显著的能力,从自然语言处理到复杂问题解决任务。它们理解和生成类人文本的能力为推进科学研究开辟了新的可能性,使数据解析、文献回顾乃至实验设计等任务成为可能。在这种背景下,LLMs 最有前景的应用之一是假设生成,通过分析现有知识,它们可以识别出新颖的研究方向。然而,尽管有这些潜力,LLMs 容易产生“幻觉”,即听起来合理但实际上错误的输出。这样的问题在要求严格准确性和可验证性的科学领域中构成了重大挑战,可能导致错误或误导性的结论。为克服这些挑战,我们提出了KG-CoI(知识基础的想法链),这是一个通过集成来自知识图谱(KGs)的外部结构化知识来增强LLM假设生成能力的新系统。KG-CoI 引导 LLM 通过一个有组织的推理过程,并将它们的输出组织成一条想法链(CoI)。此外,它还包括一个基于KG的支持模块用于检测幻觉。通过对新构建的假设生成数据集进行实验,我们展示了KG-CoI不仅提高了LLM生成假设的准确性,还减少了其推理链条中的幻觉现象,突显了其在促进现实世界科学研究方面的有效性。
https://arxiv.org/abs/2411.02382
In this paper, we present a dynamic semantic clustering approach inspired by the Chinese Restaurant Process, aimed at addressing uncertainty in the inference of Large Language Models (LLMs). We quantify uncertainty of an LLM on a given query by calculating entropy of the generated semantic clusters. Further, we propose leveraging the (negative) likelihood of these clusters as the (non)conformity score within Conformal Prediction framework, allowing the model to predict a set of responses instead of a single output, thereby accounting for uncertainty in its predictions. We demonstrate the effectiveness of our uncertainty quantification (UQ) technique on two well known question answering benchmarks, COQA and TriviaQA, utilizing two LLMs, Llama2 and Mistral. Our approach achieves SOTA performance in UQ, as assessed by metrics such as AUROC, AUARC, and AURAC. The proposed conformal predictor is also shown to produce smaller prediction sets while maintaining the same probabilistic guarantee of including the correct response, in comparison to existing SOTA conformal prediction baseline.
在这篇论文中,我们提出了一种受中国餐馆过程启发的动态语义聚类方法,旨在解决大型语言模型(LLMs)推理中的不确定性问题。我们通过计算生成的语义簇的熵来量化给定查询下LLM的不确定性。此外,我们建议利用这些簇的(负)可能性作为符合性预测框架内的(非)一致性评分,使模型能够预测一组响应而不是单一输出,从而考虑其预测中的不确定性。我们在两个著名的问答基准测试COQA和TriviaQA上展示了我们的不确定性量化(UQ)技术的有效性,并使用了两种大型语言模型Llama2和Mistral。我们的方法在UQ方面达到了最先进的性能,这通过AUROC、AUARC和AURAC等指标进行了评估。我们提出的符合性预测器也被证明能够生成更小的预测集,同时保持相同的概率保证来包含正确响应,与现有的最先进符合性预测基线相比。
https://arxiv.org/abs/2411.02381
Current volumetric biomedical foundation models struggle to generalize as public 3D datasets are small and do not cover the broad diversity of medical procedures, conditions, anatomical regions, and imaging protocols. We address this by creating a representation learning method that instead anticipates strong domain shifts at training time itself. We first propose a data engine that synthesizes highly variable training samples that enable generalization to new biomedical contexts. To then train a single 3D network for any voxel-level task, we develop a contrastive learning method that pretrains the network to be stable against nuisance imaging variation simulated by the data engine, a key inductive bias for generalization. This network's features can be used as robust representations of input images for downstream tasks and its weights provide a strong, dataset-agnostic initialization for finetuning on new datasets. As a result, we set new standards across both multimodality registration and few-shot segmentation, a first for any 3D biomedical vision model, all without (pre-)training on any existing dataset of real images.
当前的体积生物医学基础模型难以泛化,因为公共3D数据集较小,并且无法涵盖广泛的医疗程序、状况、解剖区域和成像协议。我们通过创建一种表示学习方法来解决这一问题,该方法在训练过程中能够预见强大的领域变化。首先,我们提出了一种数据引擎,它合成高度可变的训练样本,以实现对新生物医学环境的泛化。为了训练一个适用于任何体素级任务的单一3D网络,我们开发了一种对比学习方法,预先训练该网络对抗由数据引擎模拟的干扰成像变化,这是泛化的关键归纳偏差。这个网络的特征可以用作输入图像在下游任务中的鲁棒表示,而其权重为在新数据集上进行微调提供了强大的、与数据集无关的初始化。因此,我们在多模态配准和少量样本分割方面设立了新的标准,这在任何3D生物医学视觉模型中尚属首次,而且这一切都没有使用任何现有真实图像的数据集(预)训练。
https://arxiv.org/abs/2411.02372
MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at this https URL.
MLLMs(多模态大语言模型)已经展现出了处理复杂语言和视觉数据时的卓越理解和推理能力。这些进展激发了建立一种全能型机器人MLLM的愿景,这种模型能够理解复杂的指令并完成各种实际任务。然而,为现实世界的机器人开发MLLM颇具挑战性,因为通常机器人平台上的计算和内存容量是有限的。相反,MLLMs的推理需要存储数十亿个参数并进行大量的计算,这对硬件提出了较高的要求。 在我们的论文中,我们提出了一种用于机器人视觉-语言-动作模型(DeeR-VLA,或简称DeeR)的动态提前退出框架,该框架能够根据每一种情况自动调整激活的MLLM大小。这种方法利用了MLLMs中的多出口架构,允许模型一旦为特定情况激活了适当规模的部分便停止处理过程,从而避免进一步冗余计算。此外,我们还开发了新的算法来建立DeeR的提前终止标准,这些标准基于预定义的需求如平均计算成本(即功耗)、峰值计算消耗(即延迟)以及GPU内存使用量。这些改进确保了DeeR在不同资源约束下能够高效运行并保持竞争力。 在CALVIN机器人操作基准测试中,DeeR展示了显著的计算成本降低,降低了5.2-6.5倍,并将LLM的GPU内存需求减少了2-6倍,同时并未影响性能。代码和检查点可以在提供的此https URL处获取。
https://arxiv.org/abs/2411.02359
Nanorobots are a promising development in targeted drug delivery and the treatment of neurological disorders, with potential for crossing the blood-brain barrier (BBB). These small devices leverage advancements in nanotechnology and bioengineering for precise navigation and targeted payload delivery, particularly for conditions like brain tumors, Alzheimer's disease, and Parkinson's disease. Recent progress in artificial intelligence (AI) and machine learning (ML) has improved the navigation and effectiveness of nanorobots, allowing them to detect and interact with cancer cells through biomarker analysis. This study presents a new reinforcement learning (RL) framework for optimizing nanorobot navigation in complex biological environments, focusing on cancer cell detection by analyzing the concentration gradients of surrounding biomarkers. We utilize a computer simulation model to explore the behavior of nanorobots in a three-dimensional space with cancer cells and biological barriers. The proposed method uses Q-learning to refine movement strategies based on real-time biomarker concentration data, enabling nanorobots to autonomously navigate to cancerous tissues for targeted drug delivery. This research lays the groundwork for future laboratory experiments and clinical applications, with implications for personalized medicine and less invasive cancer treatments. The integration of intelligent nanorobots could revolutionize therapeutic strategies, reducing side effects and enhancing treatment effectiveness for cancer patients. Further research will investigate the practical deployment of these technologies in medical settings, aiming to unlock the full potential of nanorobotics in healthcare.
纳米机器人在靶向药物递送和神经障碍治疗方面展现出巨大的发展潜力,特别是在穿越血脑屏障(BBB)方面。这些小型设备利用纳米技术和生物工程的最新进展实现精确导航和定点载荷输送,特别适用于如脑肿瘤、阿尔茨海默病和帕金森病等病症。人工智能(AI)和机器学习(ML)领域的近期进步提高了纳米机器人的导航能力和效果,使其能够通过生物标志物分析检测并与癌细胞互动。本研究提出了一种新的强化学习(RL)框架,用于优化复杂生物环境中的纳米机器人导航,重点在于通过周围生物标志物浓度梯度的分析来识别癌细胞。我们使用计算机仿真模型探索了三维空间中纳米机器人的行为表现,该环境中包含了癌细胞和生物屏障。所提出的方法采用Q-learning算法,根据实时的生物标志物浓度数据优化移动策略,使纳米机器人能够自主导航至病变组织进行靶向药物递送。这项研究为未来的实验室实验和临床应用奠定了基础,并对个性化医疗和非侵入性癌症治疗具有重要意义。智能纳米机器人的整合可能将彻底变革治疗策略,减少副作用并提升癌症患者的治疗效果。进一步的研究将探索这些技术在医学领域的实际部署,旨在充分发挥纳米机器人在医疗保健中的潜力。
https://arxiv.org/abs/2411.02345
Decoder-only Transformers often struggle with complex reasoning tasks, particularly arithmetic reasoning requiring multiple sequential operations. In this work, we identify representation collapse in the model's intermediate layers as a key factor limiting their reasoning capabilities. To address this, we propose Sequential Variance-Covariance Regularization (Seq-VCR), which enhances the entropy of intermediate representations and prevents collapse. Combined with dummy pause tokens as substitutes for chain-of-thought (CoT) tokens, our method significantly improves performance in arithmetic reasoning problems. In the challenging $5 \times 5$ integer multiplication task, our approach achieves $99.5\%$ exact match accuracy, outperforming models of the same size (which yield $0\%$ accuracy) and GPT-4 with five-shot CoT prompting ($44\%$). We also demonstrate superior results on arithmetic expression and longest increasing subsequence (LIS) datasets. Our findings highlight the importance of preventing intermediate layer representation collapse to enhance the reasoning capabilities of Transformers and show that Seq-VCR offers an effective solution without requiring explicit CoT supervision.
解码器仅有的Transformer模型在处理复杂的推理任务时,尤其是需要多步操作的算术推理任务时,通常会遇到困难。在这项研究中,我们发现模型中间层中的表示坍缩是限制其推理能力的关键因素之一。为了解决这个问题,我们提出了顺序方差-协方差正则化(Seq-VCR),该方法增强了中间表示的熵,并防止了表示坍缩。结合用作链式思考(CoT)标记替代品的虚拟暂停标记,我们的方法在算术推理问题上显著提高了性能。在具有挑战性的$5 \times 5$整数乘法任务中,我们的方法实现了99.5%的确切匹配准确率,优于相同大小模型(0%准确率)和使用五步链式思考提示的GPT-4(44%)。我们还在算术表达式数据集和最长递增子序列(LIS)数据集上展示了优越的结果。我们的研究结果强调了防止中间层表示坍缩对于增强Transformer推理能力的重要性,并表明Seq-VCR提供了无需显式链式思考监督的有效解决方案。
https://arxiv.org/abs/2411.02344
Texturing is a crucial step in the 3D asset production workflow, which enhances the visual appeal and diversity of 3D assets. Despite recent advancements in Text-to-Texture (T2T) generation, existing methods often yield subpar results, primarily due to local discontinuities, inconsistencies across multiple views, and their heavy dependence on UV unwrapping outcomes. To tackle these challenges, we propose a novel generation-refinement 3D texturing framework called MVPaint, which can generate high-resolution, seamless textures while emphasizing multi-view consistency. MVPaint mainly consists of three key modules. 1) Synchronized Multi-view Generation (SMG). Given a 3D mesh model, MVPaint first simultaneously generates multi-view images by employing an SMG model, which leads to coarse texturing results with unpainted parts due to missing observations. 2) Spatial-aware 3D Inpainting (S3I). To ensure complete 3D texturing, we introduce the S3I method, specifically designed to effectively texture previously unobserved areas. 3) UV Refinement (UVR). Furthermore, MVPaint employs a UVR module to improve the texture quality in the UV space, which first performs a UV-space Super-Resolution, followed by a Spatial-aware Seam-Smoothing algorithm for revising spatial texturing discontinuities caused by UV unwrapping. Moreover, we establish two T2T evaluation benchmarks: the Objaverse T2T benchmark and the GSO T2T benchmark, based on selected high-quality 3D meshes from the Objaverse dataset and the entire GSO dataset, respectively. Extensive experimental results demonstrate that MVPaint surpasses existing state-of-the-art methods. Notably, MVPaint could generate high-fidelity textures with minimal Janus issues and highly enhanced cross-view consistency.
纹理化是3D资产生产工作流程中的一个关键步骤,它增强了3D资产的视觉吸引力和多样性。尽管在文本到纹理(T2T)生成方面有了最近的进步,现有方法通常会产生欠佳的结果,这主要是由于局部不连续性、多视角之间的不一致性以及它们对UV展开结果的高度依赖。为了应对这些挑战,我们提出了一种新颖的生成-优化3D纹理框架称为MVPaint,它可以生成高分辨率且无缝的纹理,并强调了多视角的一致性。MVPaint主要由三个关键模块组成。 1) 同步多视角生成(SMG)。给定一个3D网格模型,MVPaint首先通过使用SMG模型同时生成多个视角的图像,这会导致由于缺少观测而导致粗略纹理化结果和未着色的部分。 2) 空间感知3D修补(S3I)。为了确保完整的3D纹理化,我们引入了专门设计用于有效为先前未观察到区域添加纹理的S3I方法。 3) UV优化(UVR)。此外,MVPaint采用了UVR模块以提高UV空间中的纹理质量。该模块首先执行UV空间超级分辨率操作,随后使用一种空间感知接缝平滑算法来修订由于UV展开而引起的空域纹理不连续性。 除此之外,我们还基于从Objaverse数据集中精选的高质量3D网格和整个GSO数据集,建立了两个T2T评估基准:Objaverse T2T基准和GSO T2T基准。广泛的实验结果表明,MVPaint超越了现有的最先进方法。值得注意的是,MVPaint能够生成高保真纹理,并显著减少Janus问题且大幅度提高跨视角一致性。
https://arxiv.org/abs/2411.02336
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs). Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL-$p\%$ sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. Firstly, different activation functions exhibit comparable performance but opposite training-time sparsity trends. The activation ratio (i.e., $1-\mathrm{sparsity\ ratio}$) evolves as a convergent increasing power-law and decreasing logspace power-law with the amount of training data for SiLU-activated and ReLU-activated LLMs, respectively. These demonstrate that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity. Secondly, the activation ratio linearly increases with the width-depth ratio below a certain bottleneck point, indicating the potential advantage of a deeper architecture at a fixed parameter scale. Finally, at similar width-depth ratios, we surprisingly find that the limit value of activation sparsity varies weakly with the parameter scale, i.e., the activation patterns within LLMs are insensitive to the parameter scale. These empirical laws towards LLMs with greater activation sparsity have important implications for making LLMs more efficient and interpretable.
激活稀疏性表示在激活输出中存在大量贡献较小的元素,这些元素可以被消除,并且这对涉及大型语言模型(LLM)的重要应用有好处。尽管促进LLMs中的更大激活稀疏性值得深入研究,但现有工作缺乏对激活稀疏性和潜在影响因素之间相关性的全面和定量研究。本文提出了关于解码器仅基于Transformer的LLMs中激活稀疏性的量化扩展特性和影响因素的全面研究。具体而言,我们提出了一种精确且性能感知的激活稀疏性度量PPL-$p\%$稀疏性,该度量适用于任何激活函数。通过广泛的实验,我们发现了几个重要的现象。首先,不同的激活函数表现出相近的性能但训练时的稀疏趋势相反。对于SiLU激活和ReLU激活的LLMs,随着训练数据量的增加,激活比(即$1-\mathrm{稀疏性比例}$)分别以收敛递增幂律和递减对数空间幂律演变。这表明在作为激活函数方面,ReLU比SiLU更有效,并且可以利用更多的训练数据来改善激活稀疏性。其次,在某个瓶颈点以下,激活比率随宽度-深度比线性增加,表明固定参数规模下深层架构具有潜在优势。最后,在类似的宽度-深度比条件下,我们惊讶地发现激活稀疏性的极限值与参数规模的弱相关性,即LLMs内的激活模式对参数规模不敏感。这些关于具有更大激活稀疏性的LLM的经验法则对于提高LLMs的效率和可解释性有重要的启示意义。
https://arxiv.org/abs/2411.02335
Generative diffusion models (GDMs) have recently shown great success in synthesizing multimedia signals with high perceptual quality enabling highly efficient semantic communications in future wireless networks. In this paper, we develop an intent-aware generative semantic multicasting framework utilizing pre-trained diffusion models. In the proposed framework, the transmitter decomposes the source signal to multiple semantic classes based on the multi-user intent, i.e. each user is assumed to be interested in details of only a subset of the semantic classes. The transmitter then sends to each user only its intended classes, and multicasts a highly compressed semantic map to all users over shared wireless resources that allows them to locally synthesize the other classes, i.e. non-intended classes, utilizing pre-trained diffusion models. The signal retrieved at each user is thereby partially reconstructed and partially synthesized utilizing the received semantic map. This improves utilization of the wireless resources, with better preserving privacy of the non-intended classes. We design a communication/computation-aware scheme for per-class adaptation of the communication parameters, such as the transmission power and compression rate to minimize the total latency of retrieving signals at multiple receivers, tailored to the prevailing channel conditions as well as the users reconstruction/synthesis distortion/perception requirements. The simulation results demonstrate significantly reduced per-user latency compared with non-generative and intent-unaware multicasting benchmarks while maintaining high perceptual quality of the signals retrieved at the users.
生成扩散模型(GDMs)最近在以高感知质量合成多媒体信号方面取得了巨大成功,这使得未来无线网络中能够实现高效的语义通信。本文我们开发了一个基于意图感知的生成性语义多播框架,利用预训练的扩散模型。在提出的框架中,发射器根据多用户意图将源信号分解为多个语义类别,即假设每个用户仅对部分语义类别的细节感兴趣。然后,发射器只为每个用户提供其感兴趣的类别,并通过共享无线资源向所有用户多播高度压缩的语义图,使它们能够利用预训练的扩散模型在当地合成非目标类别。因此,在每个用户端检索到的信号是部分重构和部分合成的结果,利用接收到的语义图进行操作。这提高了无线资源的利用率,并更好地保护了非目标类别的隐私。我们设计了一种通信/计算感知方案,用于根据不同的传播参数(如传输功率和压缩率)对每个类别进行调整,以在现有的信道条件下以及用户重构/合成失真/感知需求下最小化多接收器检索信号的总延迟。仿真结果表明,与非生成性和无意图感知的多播基准相比,在保持高感知质量的同时显著降低了每用户的延迟。
https://arxiv.org/abs/2411.02334
The past year has witnessed the significant advancement of video-based large language models. However, the challenge of developing a unified model for both short and long video understanding remains unresolved. Most existing video LLMs cannot handle hour-long videos, while methods custom for long videos tend to be ineffective for shorter videos and images. In this paper, we identify the key issue as the redundant content in videos. To address this, we propose a novel pooling strategy that simultaneously achieves token compression and instruction-aware visual feature aggregation. Our model is termed Prompt-guided Pooling LLaVA, or PPLLaVA for short. Specifically, PPLLaVA consists of three core components: the CLIP-based visual-prompt alignment that extracts visual information relevant to the user's instructions, the prompt-guided pooling that compresses the visual sequence to arbitrary scales using convolution-style pooling, and the clip context extension designed for lengthy prompt common in visual dialogue. Moreover, our codebase also integrates the most advanced video Direct Preference Optimization (DPO) and visual interleave training. Extensive experiments have validated the performance of our model. With superior throughput and only 1024 visual context, PPLLaVA achieves better results on image benchmarks as a video LLM, while achieving state-of-the-art performance across various video benchmarks, excelling in tasks ranging from caption generation to multiple-choice questions, and handling video lengths from seconds to hours. Codes have been available at this https URL.
过去一年见证了基于视频的大语言模型的重要进展。然而,开发一个既能处理短视频又能理解长视频的统一模型仍然是一个未解决的问题。大多数现有的视频大语言模型无法处理长达数小时的视频,而专门为长时间视频设计的方法往往对较短的视频和图像无效。在本文中,我们确定了问题的关键在于视频中的冗余内容。为了解决这个问题,我们提出了一种新的池化策略,该策略同时实现了标记压缩和指令感知的视觉特征聚合。我们的模型称为提示引导池化LLaVA(PPLLaVA)。具体来说,PPLLaVA由三个核心组件组成:基于CLIP的视觉-提示对齐,用于提取与用户指令相关的视觉信息;提示引导池化,通过卷积式的池化将视觉序列压缩到任意规模;以及针对长提示设计的上下文扩展,这在视觉对话中很常见。此外,我们的代码库还集成了最先进的视频直接偏好优化(DPO)和视觉交错训练。广泛的实验验证了我们模型的表现。凭借卓越的吞吐量和仅1024个视觉上下文,PPLLaVA作为视频大语言模型在图像基准测试中取得了更好的结果,并在各种视频基准上实现了最先进性能,在从生成标题到多项选择题的任务中表现出色,能够处理从几秒到数小时长度的视频。代码可在以下链接获取:[此 https URL]。
https://arxiv.org/abs/2411.02327
Recent developments in 2D visual generation have been remarkably successful. However, 3D and 4D generation remain challenging in real-world applications due to the lack of large-scale 4D data and effective model design. In this paper, we propose to jointly investigate general 3D and 4D generation by leveraging camera and object movements commonly observed in daily life. Due to the lack of real-world 4D data in the community, we first propose a data curation pipeline to obtain camera poses and object motion strength from videos. Based on this pipeline, we introduce a large-scale real-world 4D scene dataset: CamVid-30K. By leveraging all the 3D and 4D data, we develop our framework, GenXD, which allows us to produce any 3D or 4D scene. We propose multiview-temporal modules, which disentangle camera and object movements, to seamlessly learn from both 3D and 4D data. Additionally, GenXD employs masked latent conditions to support a variety of conditioning views. GenXD can generate videos that follow the camera trajectory as well as consistent 3D views that can be lifted into 3D representations. We perform extensive evaluations across various real-world and synthetic datasets, demonstrating GenXD's effectiveness and versatility compared to previous methods in 3D and 4D generation.
近期在二维视觉生成方面的发展取得了显著的成功。然而,由于缺乏大规模的四维数据和有效模型设计,在实际应用中三维和四维生成仍然具有挑战性。本文提出通过利用日常生活中常见的摄像机运动和物体运动来共同研究通用的三维和四维生成问题。鉴于社区内缺乏真实世界的四维数据,我们首先提出了一个数据整理流程,以从视频中获取相机姿态和物体运动强度。基于这一流程,我们引入了一个大规模的真实世界四维场景数据集:CamVid-30K。通过充分利用所有三维和四维的数据,我们开发了我们的框架GenXD,它使我们能够生成任何三维或四维的场景。我们提出了多视角时间模块来分离相机运动和物体运动,从而可以从三维和四维数据中无缝学习。此外,GenXD采用掩码潜变量条件以支持多种条件视图。GenXD可以生成遵循摄像机轨迹的视频以及一致的三维视图,并且这些视图可以提升为三维表示形式。我们在各种真实世界和合成的数据集上进行了广泛的评估,展示了与之前的三维和四维生成方法相比,GenXD的有效性和多功能性。
https://arxiv.org/abs/2411.02319
Code large language models (LLMs) have made significant progress in code debugging by directly generating the correct code based on the buggy code snippet. Programming benchmarks, typically consisting of buggy code snippet and their associated test cases, are used to assess the debugging capabilities of LLMs. However, many existing benchmarks primarily focus on Python and are often limited in terms of language diversity (e.g., DebugBench and DebugEval). To advance the field of multilingual debugging with LLMs, we propose the first massively multilingual debugging benchmark, which includes 3.6K test samples of 18 programming languages and covers the automated program repair (APR) task, the code review (CR) task, and the bug identification (BI) task. Further, we introduce the debugging instruction corpora MDEVAL-INSTRUCT by injecting bugs into the correct multilingual queries and solutions (xDebugGen). Further, a multilingual debugger xDebugCoder trained on MDEVAL-INSTRUCT as a strong baseline specifically to handle the bugs of a wide range of programming languages (e.g. "Missing Mut" in language Rust and "Misused Macro Definition" in language C). Our extensive experiments on MDEVAL reveal a notable performance gap between open-source models and closed-source LLMs (e.g., GPT and Claude series), highlighting huge room for improvement in multilingual code debugging scenarios.
代码大型语言模型(LLMs)在直接基于错误的代码片段生成正确代码方面取得了显著进展,从而改进了代码调试。编程基准测试通常包含有缺陷的代码片段及其相关测试用例,用于评估LLMs的调试能力。然而,许多现有的基准主要集中在Python上,并且在语言多样性方面往往有限(例如DebugBench和DebugEval)。为了推进多语言环境下的调试领域,我们提出了第一个大规模多语言调试基准,该基准包含18种编程语言的3.6K个测试样本,并涵盖了自动化程序修复(APR)任务、代码审查(CR)任务以及缺陷识别(BI)任务。此外,通过将错误注入到正确的多语言查询和解决方案(xDebugGen)中,我们引入了调试指令语料库MDEVAL-INSTRUCT。进一步地,基于MDEVAL-INSTRUCT训练了一个多语言调试器xDebugCoder作为强大的基线模型,专门用于处理各种编程语言的bug(例如,在Rust语言中的"Missing Mut"和C语言中的"Misused Macro Definition")。我们在MDEVAL上的广泛实验表明,开源模型与闭源LLMs(如GPT和Claude系列)之间存在显著性能差距,这突显了在多语言代码调试场景中存在巨大的改进空间。
https://arxiv.org/abs/2411.02310
Object-Centric Learning (OCL) can discover objects in images or videos by simply reconstructing the input. For better object discovery, representative OCL methods reconstruct the input as its Variational Autoencoder (VAE) intermediate representation, which suppresses pixel noises and promotes object separability by discretizing continuous super-pixels with template features. However, treating features as units overlooks their composing attributes, thus impeding model generalization; indexing features with scalar numbers loses attribute-level similarities and differences, thus hindering model convergence. We propose \textit{Grouped Discrete Representation} (GDR) for OCL. We decompose features into combinatorial attributes via organized channel grouping, and compose these attributes into discrete representation via tuple indexes. Experiments show that our GDR improves both Transformer- and Diffusion-based OCL methods consistently on various datasets. Visualizations show that our GDR captures better object separability.
对象中心学习(OCL)可以通过简单地重构输入来发现图像或视频中的对象。为了更好地发现对象,典型的OCL方法将输入重建为其变分自编码器(VAE)的中间表示形式,通过用模板特征对连续超像素进行离散化处理,以抑制像素噪声并促进对象可分离性。然而,将特征视为单元会忽略它们的组成属性,从而阻碍模型泛化;使用标量数索引特征则会丢失属性级别的相似性和差异性,进而妨碍模型收敛。我们提出了用于OCL的\textit{分组离散表示}(GDR)。通过组织化的通道分组将特征分解为组合属性,并通过元组索引将其组成离散表示形式。实验表明,我们的GDR在各种数据集上一致提高了基于Transformer和扩散模型的OCL方法的性能。可视化结果表明,我们的GDR能够更好地捕捉对象可分离性。
https://arxiv.org/abs/2411.02299
While 3D generative models have greatly improved artists' workflows, the existing diffusion models for 3D generation suffer from slow generation and poor generalization. To address this issue, we propose a two-stage approach named Hunyuan3D-1.0 including a lite version and a standard version, that both support text- and image-conditioned generation. In the first stage, we employ a multi-view diffusion model that efficiently generates multi-view RGB in approximately 4 seconds. These multi-view images capture rich details of the 3D asset from different viewpoints, relaxing the tasks from single-view to multi-view reconstruction. In the second stage, we introduce a feed-forward reconstruction model that rapidly and faithfully reconstructs the 3D asset given the generated multi-view images in approximately 7 seconds. The reconstruction network learns to handle noises and in-consistency introduced by the multi-view diffusion and leverages the available information from the condition image to efficiently recover the 3D structure. % Extensive experimental results demonstrate the effectiveness of Hunyuan3D-1.0 in generating high-quality 3D assets. Our framework involves the text-to-image model ~\ie, Hunyuan-DiT, making it a unified framework to support both text- and image-conditioned 3D generation. Our standard version has $10\times$ more parameters than our lite and other existing model. Our Hunyuan3D-1.0 achieves an impressive balance between speed and quality, significantly reducing generation time while maintaining the quality and diversity of the produced assets.
虽然三维生成模型极大地改善了艺术家的工作流程,现有的用于三维生成的扩散模型仍面临着生成速度慢和泛化能力差的问题。为了解决这一问题,我们提出了一种两阶段的方法,命名为Hunyuan3D-1.0,包括轻量级版本和标准版本,它们均支持基于文本和图像条件的生成。在第一阶段,我们采用了一个多视角扩散模型,该模型能够高效地在大约4秒内生成多视角RGB图。这些多视角图像从不同的视点捕捉三维资产的丰富细节,将任务从单视图重建扩展到多视图重建。在第二阶段,我们引入了一种前馈重构模型,在约7秒的时间内快速且忠实地根据生成的多视角图像进行三维资产的重构。重构网络学会了处理由多视角扩散带来的噪声和不一致性,并利用条件图像中的可用信息高效地恢复三维结构。广泛的实验结果证明了Hunyuan3D-1.0在生成高质量三维资产方面的有效性。我们的框架涉及文本到图像模型,即Hunyuan-DiT,使其成为一个统一的框架来支持基于文本和图像条件的三维生成。与轻量级版本和其他现有模型相比,我们的标准版本参数多出约10倍。Hunyuan3D-1.0实现了速度和质量之间的出色平衡,在显著减少生成时间的同时保持了生成资产的质量和多样性。
https://arxiv.org/abs/2411.02293
Machine learning (ML) has the potential to become an essential tool in supporting clinical decision-making processes, offering enhanced diagnostic capabilities and personalized treatment plans. However, outsourcing medical records to train ML models using patient data raises legal, privacy, and security concerns. Federated learning has emerged as a promising paradigm for collaborative ML, meeting healthcare institutions' requirements for robust models without sharing sensitive data and compromising patient privacy. This study proposes a novel method that combines federated learning (FL) and Graph Neural Networks (GNNs) to predict stroke severity using electroencephalography (EEG) signals across multiple medical institutions. Our approach enables multiple hospitals to jointly train a shared GNN model on their local EEG data without exchanging patient information. Specifically, we address a regression problem by predicting the National Institutes of Health Stroke Scale (NIHSS), a key indicator of stroke severity. The proposed model leverages a masked self-attention mechanism to capture salient brain connectivity patterns and employs EdgeSHAP to provide post-hoc explanations of the neurological states after a stroke. We evaluated our method on EEG recordings from four institutions, achieving a mean absolute error (MAE) of 3.23 in predicting NIHSS, close to the average error made by human experts (MAE $\approx$ 3.0). This demonstrates the method's effectiveness in providing accurate and explainable predictions while maintaining data privacy.
机器学习(ML)有可能成为支持临床决策过程的重要工具,提供增强的诊断能力和个性化治疗方案。然而,将医疗记录外包以使用患者数据训练ML模型引发了法律、隐私和安全方面的担忧。联邦学习作为一种有前景的合作式机器学习范例应运而生,它能够在不分享敏感数据且不影响患者隐私的情况下满足医疗机构对稳健模型的需求。本研究提出了一种结合联邦学习(FL)和图神经网络(GNNs)的新方法,用于跨多个医疗结构使用脑电图(EEG)信号预测中风的严重程度。我们的方法使多家医院能够在不交换患者信息的情况下共同训练一个共享的GNN模型。具体而言,我们通过预测美国国立卫生研究院中风量表(NIHSS),这是一个衡量中风严重程度的关键指标,解决了一个回归问题。所提出的模型利用了掩码自注意力机制来捕捉显著的大脑连接模式,并采用EdgeSHAP提供中风后神经状态的解释。我们在四个机构的EEG记录上评估了我们的方法,在预测NIHSS时达到了3.23的平均绝对误差(MAE),接近人类专家的平均误差水平(MAE ≈ 3.0)。这证明了该方法在保持数据隐私的同时,提供了准确且可解释的预测。
https://arxiv.org/abs/2411.02286
Class imbalance and label noise are pervasive in large-scale datasets, yet much of machine learning research assumes well-labeled, balanced data, which rarely reflects real world conditions. Existing approaches typically address either label noise or class imbalance in isolation, leading to suboptimal results when both issues coexist. In this work, we propose Conformal-in-the-Loop (CitL), a novel training framework that addresses both challenges with a conformal prediction-based approach. CitL evaluates sample uncertainty to adjust weights and prune unreliable examples, enhancing model resilience and accuracy with minimal computational cost. Our extensive experiments include a detailed analysis showing how CitL effectively emphasizes impactful data in noisy, imbalanced datasets. Our results show that CitL consistently boosts model performance, achieving up to a 6.1% increase in classification accuracy and a 5.0 mIoU improvement in segmentation. Our code is publicly available: CitL.
类不平衡和标签噪声在大规模数据集中普遍存在,但许多机器学习研究假设的是标注良好、平衡的数据集,这很少能反映现实世界的情况。现有的方法通常单独解决标签噪声或类别不平衡问题,当这两个问题同时存在时会导致次优结果。在这项工作中,我们提出了名为“循环中的合式预测”(Conformal-in-the-Loop, CitL)的新训练框架,该框架基于合式预测的方法来应对这些挑战。CitL通过评估样本不确定性来调整权重并剪枝不可靠的示例,从而以极小的计算成本增强模型的鲁棒性和准确性。我们的广泛实验包括详细分析了CitL如何有效地强调在嘈杂且不平衡的数据集中的关键数据。结果显示,CitL能持续提升模型性能,在分类准确率上提高多达6.1%,在分割方面mIoU提升了5.0。我们的代码已公开:CitL。
https://arxiv.org/abs/2411.02281
Graph neural networks have inherent representational limitations due to their message-passing structure. Recent work has suggested that these limitations can be overcome by using unique node identifiers (UIDs). Here we argue that despite the advantages of UIDs, one of their disadvantages is that they lose the desirable property of permutation-equivariance. We thus propose to focus on UID models that are permutation-equivariant, and present theoretical arguments for their advantages. Motivated by this, we propose a method to regularize UID models towards permutation equivariance, via a contrastive loss. We empirically demonstrate that our approach improves generalization and extrapolation abilities while providing faster training convergence. On the recent BREC expressiveness benchmark, our proposed method achieves state-of-the-art performance compared to other random-based approaches.
图神经网络由于其消息传递结构而具有内在的表示限制。最近的研究表明,这些限制可以通过使用唯一节点标识符(UIDs)来克服。然而,我们认为尽管UIDs有优势,但它们的一个缺点是失去了置换等变性这一理想特性。因此,我们建议关注那些保持置换等变性的UID模型,并提出了其优点的理论论据。受到这一点的启发,我们提出了一种通过对比损失来正则化UID模型以使其趋向于置换等变性的方法。实验结果显示,我们的方法提升了泛化和外推能力,同时提供了更快的训练收敛速度。在最近的BREC表达力基准测试中,与其它基于随机的方法相比,我们提出的方法达到了最先进的性能。
https://arxiv.org/abs/2411.02271