The success of many RL techniques heavily relies on human-engineered dense rewards, which typically demand substantial domain expertise and extensive trial and error. In our work, we propose DrS (Dense reward learning from Stages), a novel approach for learning reusable dense rewards for multi-stage tasks in a data-driven manner. By leveraging the stage structures of the task, DrS learns a high-quality dense reward from sparse rewards and demonstrations if given. The learned rewards can be \textit{reused} in unseen tasks, thus reducing the human effort for reward engineering. Extensive experiments on three physical robot manipulation task families with 1000+ task variants demonstrate that our learned rewards can be reused in unseen tasks, resulting in improved performance and sample efficiency of RL algorithms. The learned rewards even achieve comparable performance to human-engineered rewards on some tasks. See our project page (this https URL) for more details.
许多强化学习(RL)技术的成功很大程度上依赖于人类设计的密集奖励,通常需要深厚的领域专业知识以及广泛的尝试和误差。在我们的工作中,我们提出了DrS(从Stages进行密集奖励学习),一种学习可重用密集奖励以数据驱动方式处理多阶段任务的新颖方法。通过利用任务的阶段结构,DrS从稀疏奖励和演示中学习高质量密集奖励。所学习的奖励可以在未见过的任务中\textit{重用},从而减轻了奖励工程的人力成本。在三个物理机器人操作任务家族(具有1000+任务变体)的广泛实验中,我们的学习奖励在未见过的任务中可以\textit{重用},从而提高了RL算法的表现和样本效率。甚至有些任务的学习奖励甚至可以达到与人类设计的奖励相媲美的水平。更多详情,请查看我们的项目页面(此https URL)。
https://arxiv.org/abs/2404.16779
The prevalent approaches of unsupervised 3D object detection follow cluster-based pseudo-label generation and iterative self-training processes. However, the challenge arises due to the sparsity of LiDAR scans, which leads to pseudo-labels with erroneous size and position, resulting in subpar detection performance. To tackle this problem, this paper introduces a Commonsense Prototype-based Detector, termed CPD, for unsupervised 3D object detection. CPD first constructs Commonsense Prototype (CProto) characterized by high-quality bounding box and dense points, based on commonsense intuition. Subsequently, CPD refines the low-quality pseudo-labels by leveraging the size prior from CProto. Furthermore, CPD enhances the detection accuracy of sparsely scanned objects by the geometric knowledge from CProto. CPD outperforms state-of-the-art unsupervised 3D detectors on Waymo Open Dataset (WOD), PandaSet, and KITTI datasets by a large margin. Besides, by training CPD on WOD and testing on KITTI, CPD attains 90.85% and 81.01% 3D Average Precision on easy and moderate car classes, respectively. These achievements position CPD in close proximity to fully supervised detectors, highlighting the significance of our method. The code will be available at this https URL.
大多数无监督的三维物体检测方法遵循基于聚类的伪标签生成和迭代自训练过程。然而,由于激光雷达扫描的稀疏性,导致伪标签具有错误的大小和位置,从而导致检测性能不佳。为了解决这个问题,本文引入了一种以常识原型为基础的检测器,称为CPD,用于无监督三维物体检测。CPD首先基于常识直觉构建了高质量的边界框和密集点的高质量常识原型(CProto)。然后,CPD通过利用CProto的大小先验来优化低质量伪标签。此外,CPD通过CProto的几何知识提高了稀疏扫描对象检测的准确性。CPD在Waymo Open Dataset(WOD)、PandaSet和KITTI数据集上优于最先进的无监督三维检测器。此外,通过在WOD和KITTI上训练CPD并进行测试,CPD在容易和 moderate 车辆类别上获得了90.85%和81.01%的3D平均精度。这些成就使CPD与完全监督的检测器相接近,强调了我们的方法的重要性。代码将在该https URL上可用。
https://arxiv.org/abs/2404.16493
Cell tracking remains a pivotal yet challenging task in biomedical research. The full potential of deep learning for this purpose is often untapped due to the limited availability of comprehensive and varied training data sets. In this paper, we present SynCellFactory, a generative cell video augmentation. At the heart of SynCellFactory lies the ControlNet architecture, which has been fine-tuned to synthesize cell imagery with photorealistic accuracy in style and motion patterns. This technique enables the creation of synthetic yet realistic cell videos that mirror the complexity of authentic microscopy time-lapses. Our experiments demonstrate that SynCellFactory boosts the performance of well-established deep learning models for cell tracking, particularly when original training data is sparse.
细胞追踪在生物医学研究中仍然是一个关键但具有挑战性的任务。由于深度学习在为此目的的全面且多样化的训练数据集的可用性方面往往被低估,因此深度学习在此任务上的全部潜力常常未被充分利用。在本文中,我们提出了SynCellFactory,一种生成细胞视频的增强方法。SynCellFactory的核心是ControlNet架构,该架构已通过在风格和运动模式上合成细胞图像来提高其准确性。这种技术能够创建与真实显微镜时间间隔复杂性相仿的合成细胞视频。我们的实验结果表明,SynCellFactory能够显著提高已有的深度学习模型在细胞追踪方面的性能,特别是当原始训练数据稀疏时。
https://arxiv.org/abs/2404.16421
Human matting is a foundation task in image and video processing, where human foreground pixels are extracted from the input. Prior works either improve the accuracy by additional guidance or improve the temporal consistency of a single instance across frames. We propose a new framework MaGGIe, Masked Guided Gradual Human Instance Matting, which predicts alpha mattes progressively for each human instances while maintaining the computational cost, precision, and consistency. Our method leverages modern architectures, including transformer attention and sparse convolution, to output all instance mattes simultaneously without exploding memory and latency. Although keeping constant inference costs in the multiple-instance scenario, our framework achieves robust and versatile performance on our proposed synthesized benchmarks. With the higher quality image and video matting benchmarks, the novel multi-instance synthesis approach from publicly available sources is introduced to increase the generalization of models in real-world scenarios.
人类遮罩是图像和视频处理中的一个基础任务,其中从输入中提取人类前景像素。先前的 works 要么通过额外的指导来提高准确性,要么通过在帧之间改善单个实例的时序一致性。我们提出了一种新的框架 MaGGIe,掩码引导的逐步人类实例遮罩,在预测每个人类实例的 alpha 遮罩的同时保持计算成本、精度和一致性。我们的方法利用了现代架构,包括 transformer 注意力和稀疏卷积,同时输出所有实例遮罩,而不会导致内存和延迟的爆炸。 尽管在多实例场景中保持不变的推理成本,但我们的框架在拟合真实场景中表现出稳健和多功能的性能。随着高质量图像和视频遮罩基准的提高,我们引入了一种来自公开来源的多实例合成方法,以提高模型在现实场景中的泛化能力。
https://arxiv.org/abs/2404.16035
Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gated Sparse Autoencoder (Gated SAE), which achieves a Pareto improvement over training with prevailing methods. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage -- systematic underestimation of feature activations. The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions: this enables us to apply the L1 penalty only to the former, limiting the scope of undesirable side effects. Through training SAEs on LMs of up to 7B parameters we find that, in typical hyper-parameter ranges, Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity.
最近的工作发现,稀疏自动编码器(SAEs)是发现自然语言模型(LMs)激活的有用特征的有效技术,通过找到稀疏、线性的LM激活的稀疏重构。我们引入了门控稀疏自动编码器(Gated SSAE),它比现有的训练方法实现了帕累托改进。在SAE中,用于鼓励稀疏性的L1惩罚引入了许多不利的偏差,例如收缩——系统性地低估特征激活。Gated SSAE的关键洞察力是分离确定使用方向的功能和估计这些方向的大小:这使我们能够仅对前者应用L1惩罚,从而限制了不良影响范围。通过在具有7B参数的LM上训练SAEs,我们发现,在典型的超参数范围内,Gated SSAE解决了收缩,具有与SAEs相似的可解释性,并且需要一半的 firing特征才能实现与同质重构的相当的重建保真度。
https://arxiv.org/abs/2404.16014
Decentralized Federated Learning (DFL) has become popular due to its robustness and avoidance of centralized coordination. In this paradigm, clients actively engage in training by exchanging models with their networked neighbors. However, DFL introduces increased costs in terms of training and communication. Existing methods focus on minimizing communication often overlooking training efficiency and data heterogeneity. To address this gap, we propose a novel \textit{sparse-to-sparser} training scheme: DA-DPFL. DA-DPFL initializes with a subset of model parameters, which progressively reduces during training via \textit{dynamic aggregation} and leads to substantial energy savings while retaining adequate information during critical learning periods. Our experiments showcase that DA-DPFL substantially outperforms DFL baselines in test accuracy, while achieving up to $5$ times reduction in energy costs. We provide a theoretical analysis of DA-DPFL's convergence by solidifying its applicability in decentralized and personalized learning. The code is available at:this https URL
去中心化联邦学习(DFL)因其稳健性和避免集中协调而变得流行。在这种范式中,客户端通过与网络邻居交换模型来积极参与训练。然而,DFL在训练和通信方面引入了增加的成本。现有的方法通常关注最小化通信,而忽视了训练效率和数据异质性。为了填补这一空白,我们提出了一个新颖的\textit{稀疏到稀疏}训练方案:DA-DPFL。DA-DPFL以部分模型参数为基础进行初始化,在训练过程中通过动态聚合逐渐减少,从而在关键学习期间保留足够的信息。我们的实验展示了DA-DPFL在测试准确率方面明显优于DFL基线,同时实现能源成本降低至原来的5倍。我们通过固化DA-DPFL在去中心化和个性化学习上的收敛性,提供了理论分析。代码可在此处访问:https://this URL。
https://arxiv.org/abs/2404.15943
Despite the recent progress in long-context language models, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context. This paper aims to address this question. Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We identify intriguing properties of retrieval heads:(1) universal: all the explored models with long-context capability have a set of retrieval heads; (2) sparse: only a small portion (less than 5\%) of the attention heads are retrieval. (3) intrinsic: retrieval heads already exist in models pretrained with short context. When extending the context length by continual pretraining, it is still the same set of heads that perform information retrieval. (4) dynamically activated: take Llama-2 7B for example, 12 retrieval heads always attend to the required information no matter how the context is changed. The rest of the retrieval heads are activated in different contexts. (5) causal: completely pruning retrieval heads leads to failure in retrieving relevant information and results in hallucination, while pruning random non-retrieval heads does not affect the model's retrieval ability. We further show that retrieval heads strongly influence chain-of-thought (CoT) reasoning, where the model needs to frequently refer back the question and previously-generated context. Conversely, tasks where the model directly generates the answer using its intrinsic knowledge are less impacted by masking out retrieval heads. These observations collectively explain which internal part of the model seeks information from the input tokens. We believe our insights will foster future research on reducing hallucination, improving reasoning, and compressing the KV cache.
尽管在长上下文语言模型方面已经取得了最近的进展,但如何让基于Transformer的模型在长上下文中检索到相关信息仍然是一个难以解决的问题。本文旨在回答这个问题。我们对一系列模型进行系统性的调查,揭示了检索头部的特殊性质。我们称之为检索头部。我们确定了检索头部的有趣特性:(1)普遍:所有具有长上下文能力的模型都有检索头部;(2)稀疏:只有不到5%的注意力头部是检索头部。(3)内在:预训练模型中已经存在检索头部。在通过持续预训练扩展上下文长度时,仍然是相同的检索头部执行信息检索。(4)动态激活:以Llama-2 7B为例,即使上下文变化,12个检索头部始终关注所需信息。其余的检索头部则在不同的上下文中激活。 (5)因果:完全删除检索头部会导致无法检索到相关信息,并导致幻觉,而删除随机非检索头部则不会影响模型的检索能力。我们进一步表明,检索头部强烈影响链式推理(CoT)推理,即模型需要经常回顾问题及其先前的上下文。相反,使用模型自身的知识直接生成答案的任务对遮盖检索头部的影响较小。这些观察结果共同解释了模型从输入词中寻求信息的部分。我们相信,我们的见解将促进未来研究在减少幻觉、提高推理和压缩KV缓存方面取得进一步进展。
https://arxiv.org/abs/2404.15574
Recent advancements in machine learning have led to novel imaging systems and algorithms that address ill-posed problems. Assessing their trustworthiness and understanding how to deploy them safely at test time remains an important and open problem. We propose a method that leverages conformal prediction to retrieve upper/lower bounds and statistical inliers/outliers of reconstructions based on the prediction intervals of downstream metrics. We apply our method to sparse-view CT for downstream radiotherapy planning and show 1) that metric-guided bounds have valid coverage for downstream metrics while conventional pixel-wise bounds do not and 2) anatomical differences of upper/lower bounds between metric-guided and pixel-wise methods. Our work paves the way for more meaningful reconstruction bounds. Code available at this https URL
近年来机器学习的进步导致了处理欠拟合问题的新颖成像系统和算法。评估它们的可信度和在测试时安全部署它们仍然是一个重要且开放的问题。我们提出了一种利用同构预测来检索基于预测间隔的重建的上/下界以及统计异常/正常化的方法。我们将该方法应用于稀疏视野CT downstream放射治疗计划,并证明了:在下游指标的指导下,指标指导的边界具有有效的覆盖范围,而传统的像素级边界则没有;以及指标指导和像素级方法的upper/lower边界之间解剖学差异。我们的工作为更有意义的研究奠定了基础。代码可在此处访问:https:// this URL
https://arxiv.org/abs/2404.15274
Medical Vision-Language Pretraining (Med-VLP) establishes a connection between visual content from medical images and the relevant textual descriptions. Existing Med-VLP methods primarily focus on 2D images depicting a single body part, notably chest X-rays. In this paper, we extend the scope of Med-VLP to encompass 3D images, specifically targeting full-body scenarios, by using a multimodal dataset of CT images and reports. Compared with the 2D counterpart, 3D VLP is required to effectively capture essential semantics from significantly sparser representation in 3D imaging. In this paper, we introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning, aligning grounded visual features with precise diagnostic text. Additionally, we developed an abnormality dictionary to augment contrastive learning with diverse negative samples. Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages. The performance of CT-GLIP is validated on a separate test set of 1,130 patients, focusing on the 16 most frequent abnormalities across 7 organs. The experimental results show our model's superior performance over the standard CLIP framework across zero-shot and fine-tuning scenarios, using both CNN and ViT architectures.
医疗视觉-语言预训练(Med-VLP)建立了视觉内容从医学图像和相关的文本描述之间的联系。现有的Med-VLP方法主要集中在描述单个身体部位的2D图像,特别是胸部X光片。在本文中,我们将Med-VLP的视野扩展到包括3D图像,特别是全身情景,通过使用包含CT图像和报告的多模态数据集。与2D版本相比,3D VLP需要有效地从显著稀疏的3D成像表示中捕捉关键语义信息。本文我们引入了CT-GLIP(基于CT的 grounded 语言-图像预训练),一种新颖的方法,用于构建器官级别的图像-文本对以增强多模态对比学习,将 grounded visual features 与精确的诊断文本对齐。此外,我们还开发了一个异常情况词典,以增加对比学习中的多样负样本。我们的方法,在包括17,702名患者跨越104个器官的44,011个器官级别视觉-文本对的多模态CT数据集上进行训练,能够以零散的方式识别器官和异常情况。CT-GLIP的性能在一个包括1,130名患者的独立测试集上进行了验证,重点关注7个器官中最常见的异常情况。实验结果表明,在我们的模型在零散和微调场景下超过了标准CLIP框架,使用了CNN和ViT架构。
https://arxiv.org/abs/2404.15272
We introduce XFT, a simple yet powerful training scheme, by simply merging upcycled Mixture-of-Experts (MoE) to unleash the performance limit of instruction-tuned code Large Language Models (LLMs). While vanilla sparse upcycling fails to improve instruction tuning, XFT introduces a shared expert mechanism with a novel routing weight normalization strategy into sparse upcycling, which significantly boosts instruction tuning. After fine-tuning the upcycled MoE model, XFT introduces a learnable model merging mechanism to compile the upcycled MoE model back to a dense model, achieving upcycled MoE-level performance with only dense-model compute. By applying XFT to a 1.3B model, we create a new state-of-the-art tiny code LLM (<3B) with 67.1 and 64.6 pass@1 on HumanEval and HumanEval+ respectively. With the same data and model architecture, XFT improves supervised fine-tuning (SFT) by 13% on HumanEval+, along with consistent improvements from 2% to 13% on MBPP+, MultiPL-E, and DS-1000, demonstrating its generalizability. XFT is fully orthogonal to existing techniques such as Evol-Instruct and OSS-Instruct, opening a new dimension for improving code instruction tuning. Codes are available at this https URL .
我们提出了XFT,一种简单而强大的训练方案,通过简单地将循环利用的MoE与指令微调的代码大语言模型(LLM)合并,释放了LLM的性能限制。尽管经典的稀疏回收方案无法提高指令微调,XFT通过在稀疏回收方案中引入共享专家机制与新颖的路由权重归一化策略,显著提高了指令微调。在微调循环利用的MoE模型后,XFT引入了一个可学习的模型合并机制,将循环利用的MoE模型编译为密歇根模型,实现仅使用密歇根模型计算的循环利用MoE级别性能。将XFT应用于1.3B模型,我们创建了一个新的人工微小代码LLM(<3B),具有HumanEval和HumanEval+上的67.1和64.6分。使用相同的数据和模型架构,XFT在HumanEval+上提高了监督微调(SFT)13%,同时在MBPP+、MultiPL-E和DS-1000上实现了从2%到13%的显著改进,表明其具有良好的泛化能力。与现有的技术如Evol-Instruct和OSS-Instruct等完全正交,XFT为改进代码指令微调打开了一个新的维度。代码可在此https://url.com/处获取。
https://arxiv.org/abs/2404.15247
Long document question answering (DocQA) aims to answer questions from long documents over 10k words. They usually contain content structures such as sections, sub-sections, and paragraph demarcations. However, the indexing methods of long documents remain under-explored, while existing systems generally employ fixed-length chunking. As they do not consider content structures, the resultant chunks can exclude vital information or include irrelevant content. Motivated by this, we propose the Multi-view Content-aware indexing (MC-indexing) for more effective long DocQA via (i) segment structured document into content chunks, and (ii) represent each content chunk in raw-text, keywords, and summary views. We highlight that MC-indexing requires neither training nor fine-tuning. Having plug-and-play capability, it can be seamlessly integrated with any retrievers to boost their performance. Besides, we propose a long DocQA dataset that includes not only question-answer pair, but also document structure and answer scope. When compared to state-of-art chunking schemes, MC-indexing has significantly increased the recall by 42.8%, 30.0%, 23.9%, and 16.3% via top k= 1.5, 3, 5, and 10 respectively. These improved scores are the average of 8 widely used retrievers (2 sparse and 6 dense) via extensive experiments.
为了回答超过10k字的长文档问题,文档问题回答(DocQA)旨在解决这样的问题。它们通常包含如章节、子章节和段落分界符等内容的结构。然而,长文档的索引方法仍然鲜被探索,而现有的系统通常采用固定长度的片段。由于它们没有考虑内容结构,因此产生的片段可能排除关键信息或包含无关内容。为了激励这一点,我们提出了多视角内容感知索引(MC-indexing)来通过(i)将文档结构化文档划分为内容片段,和(ii)在每个内容片段上表示原始文本、关键词和摘要视图来更有效地解决长文档的DocQA。我们强调,MC-indexing不需要训练或微调。具有可插拔和可定制功能,它可以轻松地与任何检索器集成,提高它们的性能。此外,我们还提出了一个包含不仅问题与答案对,还包括文档结构和答案范围的长的文档问题回答数据集。与最先进的片段化方案相比,MC-indexing显著增加了通过top k=1.5,3,5,和10的召回度分别为42.8%,30.0%,23.9%和16.3%。这些提高的分数是通过广泛实验得到的8个常用检索器的平均值(2个稀疏和6个密集)。
https://arxiv.org/abs/2404.15103
Sparse Mixtures of Experts (SMoE) scales model capacity without significant increases in training and inference costs, but exhibits the following two issues: (1) Low expert activation, where only a small subset of experts are activated for optimization. (2) Lacking fine-grained analytical capabilities for multiple semantic concepts within individual tokens. We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens. These sub-tokens are then assigned to and processed by a diverse set of experts in parallel, and seamlessly reintegrated into the original token form. The multi-head mechanism enables the model to collectively attend to information from various representation spaces within different experts, while significantly enhances expert activation, thus deepens context understanding and alleviate overfitting. Moreover, our MH-MoE is straightforward to implement and decouples from other SMoE optimization methods, making it easy to integrate with other SMoE models for enhanced performance. Extensive experimental results across three tasks: English-focused language modeling, Multi-lingual language modeling and Masked multi-modality modeling tasks, demonstrate the effectiveness of MH-MoE.
Sparse Mixtures of Experts (SMoE)是一种在不显著增加训练和推理成本的情况下扩展模型容量的方法,但存在以下两个问题:(1)专家激活度低,仅激活一小部分专家进行优化;(2)对多个语义概念的细粒度分析能力不足。我们提出了多头专家混合专家(MH-MoE)方法,它采用一个多头机制将每个词划分为多个子词。这些子词随后被分配给多个专家并并行处理,无缝地重新整合到原始词形式中。多头机制使模型能够集体关注不同专家对各个表示空间的信息,从而显著增强专家激活,加深上下文理解,缓解过拟合。此外,我们的MH-MoE易于实现,与其他SMoE优化方法解耦,容易与其他SMoE模型集成以提高性能。在三个任务(英语关注语言建模、多语言语言建模和遮罩多模态建模)上的实验结果表明,MH-MoE的有效性。
https://arxiv.org/abs/2404.15045
In this paper, we present PRISM, a Promptable and Robust Interactive Segmentation Model, aiming for precise segmentation of 3D medical images. PRISM accepts various visual inputs, including points, boxes, and scribbles as sparse prompts, as well as masks as dense prompts. Specifically, PRISM is designed with four principles to achieve robustness: (1) Iterative learning. The model produces segmentations by using visual prompts from previous iterations to achieve progressive improvement. (2) Confidence learning. PRISM employs multiple segmentation heads per input image, each generating a continuous map and a confidence score to optimize predictions. (3) Corrective learning. Following each segmentation iteration, PRISM employs a shallow corrective refinement network to reassign mislabeled voxels. (4) Hybrid design. PRISM integrates hybrid encoders to better capture both the local and global information. Comprehensive validation of PRISM is conducted using four public datasets for tumor segmentation in the colon, pancreas, liver, and kidney, highlighting challenges caused by anatomical variations and ambiguous boundaries in accurate tumor identification. Compared to state-of-the-art methods, both with and without prompt engineering, PRISM significantly improves performance, achieving results that are close to human levels. The code is publicly available at this https URL.
在本文中,我们提出了PRISM,一种可编程和鲁棒的分割模型,旨在实现对3D医疗图像的准确分割。PRISM接受各种视觉输入,包括点、盒子和涂鸦作为稀疏提示,以及掩码作为稠密提示。具体来说,PRISM 设计了四个原则来实现鲁棒性:(1)迭代学习。模型通过使用前一次迭代中的视觉提示来产生分割,实现逐步改进。(2)信心学习。PRISM 对每个输入图像采用多个分割头,每个头产生一个连续的地图和信心分数,以优化预测。(3)纠正学习。在每个分割迭代后,PRISM 使用浅层纠正修复网络重新分配错误标注的体积。(4)混合设计。PRISM 整合了混合编码器,更好地捕捉局部的和全局信息。通过对大肠癌、胰腺、肝脏和肾脏等四个公开数据集进行全面的分割验证,PRISM 的性能得到了显著提高,达到接近人类水平。PRISM 的代码公开可用,https:// this URL。
https://arxiv.org/abs/2404.15028
Plenty of existing work has analyzed the abilities of the transformer architecture by describing its representational capacity with formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language \emph{acceptance}. We contend that this is an ill-suited problem in the study of \emph{language models} (LMs), which are definitionally \emph{probability distributions} over strings. In this paper, we focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.
大量现有工作通过用计算模型描述Transformer架构的表示能力来分析Transformer架构的能力。然而,目前的研究主要集中在分析语言接受性方面。我们认为,这对语言模型的研究来说是一个不适合的问题,因为它们定义为字符串上的概率分布。在本文中,我们将关注Transformer LMs和$n$-gram LMs之间的关系,这是语言模型中一个简单且具有历史相关性的类。我们证明了,使用硬或稀疏注意机制的Transformer LMs可以准确地表示任何$n$-gram LM,这为我们提供了关于它们概率表示能力的一个具体下界。这为理解Transformer LMs如何表示字符串上的概率分布提供了一个初步步骤。
https://arxiv.org/abs/2404.14994
In biological tasks, data is rarely plentiful as it is generated from hard-to-gather measurements. Therefore, pre-training foundation models on large quantities of available data and then transfer to low-data downstream tasks is a promising direction. However, how to design effective foundation models for molecular learning remains an open question, with existing approaches typically focusing on models with large parameter capacities. In this work, we propose $\texttt{MiniMol}$, a foundational model for molecular learning with 10 million parameters. $\texttt{MiniMol}$ is pre-trained on a mix of roughly 3300 sparsely defined graph- and node-level tasks of both quantum and biological nature. The pre-training dataset includes approximately 6 million molecules and 500 million labels. To demonstrate the generalizability of $\texttt{MiniMol}$ across tasks, we evaluate it on downstream tasks from the Therapeutic Data Commons (TDC) ADMET group showing significant improvements over the prior state-of-the-art foundation model across 17 tasks. $\texttt{MiniMol}$ will be a public and open-sourced model for future research.
在生物任务中,数据很少是丰富的,因为它们是由难以收集的测量生成的。因此,将大量可用数据上的预训练大型基础模型转移到低数据量的下游任务是一个有前途的方向。然而,如何为分子学习设计有效的基模型仍然是一个未解决的问题,现有的方法通常关注具有大参数容量的模型。在这项工作中,我们提出了 $\texttt{MiniMol}$,一个具有100万个参数的分子学习基础模型。$\texttt{MiniMol}$ 在大约3300个稀疏定义的量子和生物自然水平上进行预训练。预训练数据集包括大约6000万分子和5000万标签。为了证明 $\texttt{MiniMol}$ 在不同任务上的泛化能力,我们在治疗数据共享(TDC)ADMET组中评估了它的下游任务,显示在17个任务上取得了显著的改善,超过了前状态的基模型。$\texttt{MiniMol}$ 将是一个为未来研究公开和开源的模型。
https://arxiv.org/abs/2404.14986
Blocking is a critical step in entity resolution, and the emergence of neural network-based representation models has led to the development of dense blocking as a promising approach for exploring deep semantics in blocking. However, previous advanced self-supervised dense blocking approaches require domain-specific training on the target domain, which limits the benefits and rapid adaptation of these methods. To address this issue, we propose UBlocker, a dense blocker that is pre-trained on a domain-independent, easily-obtainable tabular corpus using self-supervised contrastive learning. By conducting domain-independent pre-training, UBlocker can be adapted to various downstream blocking scenarios without requiring domain-specific fine-tuning. To evaluate the universality of our entity blocker, we also construct a new benchmark covering a wide range of blocking tasks from multiple domains and scenarios. Our experiments show that the proposed UBlocker, without any domain-specific learning, significantly outperforms previous self- and unsupervised dense blocking methods and is comparable and complementary to the state-of-the-art sparse blocking methods.
阻塞是在实体识别过程中一个关键的步骤,基于神经网络的表示模型的出现使得密集阻塞作为一种探索深度语义的有效方法而受到关注。然而,之前的高级自监督密集阻塞方法需要针对目标域进行领域特定的训练,这限制了这些方法的好处和快速适应能力。为了解决这个问题,我们提出了UBlocker,一种在自监督对比学习的基础上预训练于无领域无关、易于获取的表格语料库的密集阻塞方法。通过进行无领域的预训练,UBlocker可以适应各种下游阻塞场景,而无需进行领域特定的微调。为了评估我们实体阻塞器的普适性,我们还构建了一个新的基准,涵盖了多个领域和场景的广泛阻塞任务。我们的实验结果表明,与没有进行任何领域特定学习相比,所提出的UBlocker在阻塞任务中显著超过了以前的自监督和无监督密集阻塞方法,与最先进的稀疏阻塞方法相当,互补且具有优势。
https://arxiv.org/abs/2404.14831
Although the convolutional neural network (CNN) has achieved excellent performance in vision tasks by extracting the intra-sample representation, it will take a higher training expense because of stacking numerous convolutional layers. Recently, as the bilinear models, graph neural networks (GNN) have succeeded in exploring the underlying topological relationship among the graph data with a few graph neural layers. Unfortunately, it cannot be directly utilized on non-graph data due to the lack of graph structure and has high inference latency on large-scale scenarios. Inspired by these complementary strengths and weaknesses, \textit{we discuss a natural question, how to bridge these two heterogeneous networks?} In this paper, we propose a novel CNN2GNN framework to unify CNN and GNN together via distillation. Firstly, to break the limitations of GNN, a differentiable sparse graph learning module is designed as the head of networks to dynamically learn the graph for inductive learning. Then, a response-based distillation is introduced to transfer the knowledge from CNN to GNN and bridge these two heterogeneous networks. Notably, due to extracting the intra-sample representation of a single instance and the topological relationship among the datasets simultaneously, the performance of distilled ``boosted'' two-layer GNN on Mini-ImageNet is much higher than CNN containing dozens of layers such as ResNet152.
虽然卷积神经网络(CNN)通过提取内部样本表示在视觉任务中取得了优秀的性能,但由于堆叠多个卷积层,其训练成本会更高。最近,作为一种线性模型,图神经网络(GNN)通过几个图神经网络层成功探索了图数据的潜在拓扑关系。然而,由于缺乏图形结构,在大型场景上,它无法直接应用于非图形数据。受到这些互补优势和劣势的启发,我们提出了一个自然的问题:如何将这些异构网络连接起来?本文我们提出了一种名为CNN2GNN的新CNN-GNN框架,通过蒸馏将CNN和GNN统一在一起。首先,为了克服GNN的限制,我们设计了一个可导稀疏图学习模块,作为网络的头部,通过归纳学习动态地学习图形。然后,引入了基于响应的蒸馏,将CNN中的知识传递给GNN,并桥接这两个异构网络。值得注意的是,由于同时提取了单个实例的内部表示和数据集中的拓扑关系,用于微调的“加强”二层GNN在Mini-ImageNet上的性能比包含数十层ResNet152的CNN更高。
https://arxiv.org/abs/2404.14822
In this paper, we present a method to reconstruct the world and multiple dynamic humans in 3D from a monocular video input. As a key idea, we represent both the world and multiple humans via the recently emerging 3D Gaussian Splatting (3D-GS) representation, enabling to conveniently and efficiently compose and render them together. In particular, we address the scenarios with severely limited and sparse observations in 3D human reconstruction, a common challenge encountered in the real world. To tackle this challenge, we introduce a novel approach to optimize the 3D-GS representation in a canonical space by fusing the sparse cues in the common space, where we leverage a pre-trained 2D diffusion model to synthesize unseen views while keeping the consistency with the observed 2D appearances. We demonstrate our method can reconstruct high-quality animatable 3D humans in various challenging examples, in the presence of occlusion, image crops, few-shot, and extremely sparse observations. After reconstruction, our method is capable of not only rendering the scene in any novel views at arbitrary time instances, but also editing the 3D scene by removing individual humans or applying different motions for each human. Through various experiments, we demonstrate the quality and efficiency of our methods over alternative existing approaches.
在本文中,我们提出了一种从单目视频输入中重构世界和多个动态人类的方法。关键思想是,我们通过最近新兴的3D高斯扩散(3D-GS)表示来表示世界和多个人类,使得可以方便地组合和渲染它们。特别地,我们解决了在3D人类重建中严重有限和稀疏观察到的挑战,这是在现实生活中常见的一个挑战。为了解决这个问题,我们引入了一种通过融合共同空间中稀疏提示的新方法来优化3D-GS表示。我们利用预训练的2D扩散模型在保持观察到的2D外观一致性的同时,合成未见过的视图。我们证明了我们的方法可以在各种具有遮挡、图像裁剪、少样本和极其稀疏观察的挑战例子中重构高质量的可动画3D人类。在重构后,我们的方法不仅可以在任意时间点渲染场景的新视图,还可以通过删除单个人类或为每个人类应用不同的运动来编辑3D场景。通过各种实验,我们证明了我们的方法在各种现有方法中的质量和效率。
https://arxiv.org/abs/2404.14410
We address the task of estimating camera parameters from a set of images depicting a scene. Popular feature-based structure-from-motion (SfM) tools solve this task by incremental reconstruction: they repeat triangulation of sparse 3D points and registration of more camera views to the sparse point cloud. We re-interpret incremental structure-from-motion as an iterated application and refinement of a visual relocalizer, that is, of a method that registers new views to the current state of the reconstruction. This perspective allows us to investigate alternative visual relocalizers that are not rooted in local feature matching. We show that scene coordinate regression, a learning-based relocalization approach, allows us to build implicit, neural scene representations from unposed images. Different from other learning-based reconstruction methods, we do not require pose priors nor sequential inputs, and we optimize efficiently over thousands of images. Our method, ACE0 (ACE Zero), estimates camera poses to an accuracy comparable to feature-based SfM, as demonstrated by novel view synthesis. Project page: this https URL
我们从一系列图像描述的场景中估计相机的参数。流行的基于特征的结构从运动(SfM)工具通过迭代重构稀疏的3D点并进行相机视图的注册来解决这个任务。我们将递归结构从运动重新解释为迭代应用和优化,即从当前重建状态的视觉重定位器中注册新视图。这种视角使我们能够研究不基于局部特征匹配的 alternative visual relocalizers。我们证明了场景坐标回归,一种基于学习的重定位方法,可以从未姿态的图像中构建隐含的神经场景表示。与其它学习 based 的重构方法不同,我们不需要姿态先验或者顺序输入,而且我们在成千上万的图像上优化效率。我们的方法 ACE0(ACE Zero)估计相机的参数,其精度与基于特征的 SfM 相当,如图所示,通过新颖的视图合成证明了这一点。页面链接:这个 <https:// this URL>
https://arxiv.org/abs/2404.14351
Light Detection and Ranging (LiDAR) technology has proven to be an important part of many robotics systems. Surface normals estimated from LiDAR data are commonly used for a variety of tasks in such systems. As most of the today's mechanical LiDAR sensors produce sparse data, estimating normals from a single scan in a robust manner poses difficulties. In this paper, we address the problem of estimating normals for sparse LiDAR data avoiding the typical issues of smoothing out the normals in high curvature areas. Mechanical LiDARs rotate a set of rigidly mounted lasers. One firing of such a set of lasers produces an array of points where each point's neighbor is known due to the known firing pattern of the scanner. We use this knowledge to connect these points to their neighbors and label them using the angles of the lines connecting them. When estimating normals at these points, we only consider points with the same label as neighbors. This allows us to avoid estimating normals in high curvature areas. We evaluate our approach on various data, both self-recorded and publicly available, acquired using various sparse LiDAR sensors. We show that using our method for normal estimation leads to normals that are more robust in areas with high curvature which leads to maps of higher quality. We also show that our method only incurs a constant factor runtime overhead with respect to a lightweight baseline normal estimation procedure and is therefore suited for operation in computationally demanding environments.
光探测和测距(LiDAR)技术已经成为许多机器人系统的重要组成部分。从LiDAR数据中估计的表面法线通常用于这些系统中的各种任务。由于大多数现代机械LiDAR传感器仅产生稀疏数据,在稳健地估计法线方面存在困难。在本文中,我们解决了在稀疏LiDAR数据中估计法线避开高曲率区域典型问题的問題。机械LiDAR将一组刚性安装的激光器旋转。这种设置的每发一次激光器会产生一个点,由于扫描器的已知扫描模式,每个点的邻居已知。我们利用这个知识将这些点连接到邻居并使用它们之间的角度来标记它们。在估计这些点的法线时,我们只考虑具有相同标签的点。这使我们能够避免在高曲率区域中估计法线。我们在各种自记录的和公开可用的数据上评估我们的方法。我们发现,使用我们估计法线的方法可以得到更高曲率区域中的法线,从而实现质量更高的地图。我们还发现,相对于轻量级基线法线估计方法,我们的方法只产生了恒定的运行时开销,因此它适用于计算密集型环境。
https://arxiv.org/abs/2404.14281