Working with cognate data involves handling synonyms, that is, multiple words that describe the same concept in a language. In the early days of language phylogenetics it was recommended to select one synonym only. However, as we show here, binary character matrices, which are used as input for computational methods, do allow for representing the entire dataset including all synonyms. Here we address the question how one can and if one should include all synonyms or whether it is preferable to select synonyms a priori. To this end, we perform maximum likelihood tree inferences with the widely used RAxML-NG tool and show that it yields plausible trees when all synonyms are used as input. Furthermore, we show that a priori synonym selection can yield topologically substantially different trees and we therefore advise against doing so. To represent cognate data including all synonyms, we introduce two types of character matrices beyond the standard binary ones: probabilistic binary and probabilistic multi-valued character matrices. We further show that it is dataset-dependent for which character matrix type the inferred RAxML-NG tree is topologically closest to the gold standard. We also make available a Python interface for generating all of the above character matrix types for cognate data provided in CLDF format.
与同义数据一起工作涉及处理同义词,即在一种语言中描述相同概念的多个单词。在语言进化论的早期阶段,建议只选择一个同义词。然而,正如我们所展示的,二进制字符矩阵,作为计算方法的输入,确实允许表示整个数据集,包括所有同义词。在这里,我们回答了一个问题:一个人如何以及是否应该包括所有同义词,或者是否应该先选择同义词。为此,我们使用广泛使用的RAxML-NG工具进行最大似然树推断,并发现当所有同义词作为输入时,它确实产生了可信的树。此外,我们还发现,先验同义词选择可能导致拓扑结构根本不同的树,因此我们建议不要这样做。要表示包括所有同义词的语义数据,我们引入了两种额外的字符矩阵类型:概率二进制和概率多值字符矩阵。我们进一步表明,对于哪种字符矩阵类型,推断的RAxML-NG树与黄金标准的最小距离是数据集相关的。我们还为CLDF格式的认知数据提供了Python接口,用于生成所有上述字符矩阵类型。
https://arxiv.org/abs/2404.19328
Large pre-trained vision language models (VLMs) have shown impressive zero-shot ability on downstream tasks with manually designed prompt, which are not optimal for specific domains. To further adapt VLMs to downstream tasks, soft prompt is proposed to replace manually designed prompt, which acts as a learning vector that undergoes fine-tuning based on specific domain data. Prior prompt learning methods primarily learn a fixed prompt and residuled prompt from training samples. However, the learned prompts lack diversity and ignore information about unseen domains, potentially compromising the transferability of the prompts. In this paper, we reframe the prompt learning framework from a generative perspective and propose a simple yet efficient method for the Domain Generalization (DG) task, namely \textbf{S}oft \textbf{P}rompt \textbf{G}eneration (SPG). To the best of our knowledge, we are the first to introduce the generative model into prompt learning in VLMs and explore its potential for producing soft prompts by relying solely on the generative model, ensuring the diversity of prompts. Specifically, SPG consists of a two-stage training phase and an inference phase. During the training phase, we introduce soft prompt labels for each domain, aiming to incorporate the generative model domain knowledge. During the inference phase, the generator of the generative model is employed to obtain instance-specific soft prompts for the unseen target domain. Extensive experiments on five domain generalization benchmarks of three DG tasks demonstrate that our proposed SPG achieves state-of-the-art performance. The code will be available soon.
大规模预训练的视觉语言模型(VLMs)在带有自定义提示的下游任务上表现出令人印象深刻的零样本能力,而这些任务并不是为特定领域设计的。为了进一步将VLMs适应下游任务,我们提出了一种用软提示替换自定义提示的方法,该方法将作为基于特定领域数据的精细微调学习向量。先前的提示学习方法主要从训练样本中学习固定的提示和残差提示。然而,学习到的提示缺乏多样性,忽略了未见过的领域的信息,可能影响到提示的转移性。在本文中,我们从生成性视角重新审视了提示学习框架,并提出了一个简单而有效的领域泛化(DG)任务方法:软提示生成(SPG)。据我们所知,这是将生成模型引入VLM提示学习中的第一个尝试,并探讨了仅依靠生成模型产生软提示的可能性,确保了提示的多样性。具体来说,SPG包括两个训练阶段和一个推理阶段。在训练阶段,我们对每个领域引入软提示标签,旨在将生成模型的领域知识纳入其中。在推理阶段,使用生成模型的生成器来获得未见过的目标领域的实例特定软提示。在三个DG任务的五种领域泛化基准上进行的大量实验证明,我们所提出的SPG达到了最先进的性能水平。代码即将发布。
https://arxiv.org/abs/2404.19286
Adapting Large Language Models (LLMs) to new tasks through fine-tuning has been made more efficient by the introduction of Parameter-Efficient Fine-Tuning (PEFT) techniques, such as LoRA. However, these methods often underperform compared to full fine-tuning, particularly in scenarios involving complex datasets. This issue becomes even more pronounced in complex domains, highlighting the need for improved PEFT approaches that can achieve better performance. Through a series of experiments, we have uncovered two critical insights that shed light on the training and parameter inefficiency of LoRA. Building on these insights, we have developed HydraLoRA, a LoRA framework with an asymmetric structure that eliminates the need for domain expertise. Our experiments demonstrate that HydraLoRA outperforms other PEFT approaches, even those that rely on domain knowledge during the training and inference phases. \href{this https URL}{Code}.
通过微调来将大型语言模型(LLMs)应用到新任务上已经变得更加高效,通过引入参数高效的微调技术(如LoRA)来实现。然而,这些方法通常在涉及复杂数据集的场景中表现不佳。在复杂领域,这个问题变得更加突出,凸显了需要改进的参数高效微调方法以实现更好的性能的需求。通过一系列实验,我们发现了两个关键见解,揭示了LoRA在训练和参数效率方面的不足。基于这些见解,我们开发了HydraLoRA,一个具有非对称结构的LoRA框架,消除了领域专业知识的需求。我们的实验证明,HydraLoRA在其他基于领域知识的PEFT方法中表现优异,即使在培训和推理阶段也依赖于领域知识。\href{this <https://this URL>}{Code}
https://arxiv.org/abs/2404.19245
Large vision-language models have impressively promote the performance of 2D visual recognition under zero/few-shot scenarios. In this paper, we focus on exploiting the large vision-language model, i.e., CLIP, to address zero/few-shot 3D shape recognition based on multi-view representations. The key challenge for both tasks is to generate a discriminative descriptor of the 3D shape represented by multiple view images under the scenarios of either without explicit training (zero-shot 3D shape recognition) or training with a limited number of data (few-shot 3D shape recognition). We analyze that both tasks are relevant and can be considered simultaneously. Specifically, leveraging the descriptor which is effective for zero-shot inference to guide the tuning of the aggregated descriptor under the few-shot training can significantly improve the few-shot learning efficacy. Hence, we propose Prompt-Enhanced View Aggregation Network (PEVA-Net) to simultaneously address zero/few-shot 3D shape recognition. Under the zero-shot scenario, we propose to leverage the prompts built up from candidate categories to enhance the aggregation process of multiple view-associated visual features. The resulting aggregated feature serves for effective zero-shot recognition of the 3D shapes. Under the few-shot scenario, we first exploit a transformer encoder to aggregate the view-associated visual features into a global descriptor. To tune the encoder, together with the main classification loss, we propose a self-distillation scheme via a feature distillation loss by treating the zero-shot descriptor as the guidance signal for the few-shot descriptor. This scheme can significantly enhance the few-shot learning efficacy.
大型视觉语言模型在零/少样本场景中显著提高了2D视觉识别的表现。在本文中,我们将重点利用CLIP(大型视觉语言模型)来解决基于多视图表示的零/少样本3D形状识别。这两个任务的关键挑战是在没有明确训练(零样本3D形状识别)或有限数据训练(少样本3D形状识别)的场景中生成对多视图表示的3D形状的区分性描述。我们分析认为,这两个任务是相关的,可以同时考虑。具体来说,利用在零样本推理中有效的描述器来引导在少样本训练中聚合描述符可以显著提高少样本学习效果。因此,我们提出了Prompt-Enhanced View Aggregation Network(PEVA-Net)来同时解决零/少样本3D形状识别。在零样本场景中,我们希望通过利用从候选类别中构建的提示来增强多视图相关视觉特征的聚合过程。生成的聚合特征可用于有效的零样本3D形状识别。在少样本场景中,我们首先利用Transformer编码器将视图相关视觉特征聚合为全局描述符。为了调整编码器,我们通过一个通过特征蒸馏损失对零样本描述进行自监督学习的方案来提出一个自监督学习方案。这个方案可以显著增强少样本学习效果。
https://arxiv.org/abs/2404.19168
This technical report describes the design and training of novel speculative decoding draft models, for accelerating the inference speeds of large language models in a production environment. By conditioning draft predictions on both context vectors and sampled tokens, we can train our speculators to efficiently predict high-quality n-grams, which the base model then accepts or rejects. This allows us to effectively predict multiple tokens per inference forward pass, accelerating wall-clock inference speeds of highly optimized base model implementations by a factor of 2-3x. We explore these initial results and describe next steps for further improvements.
这份技术报告描述了用于加速生产环境中大型语言模型推理速度的新型虚构解码草案模型的设计和训练。通过将草案预测条件化为上下文向量和小样本 tokens,我们可以训练我们的speculators有效地预测高质量的小语料库,而基模型则接受或拒绝这些预测。这使我们能够有效预测每个推理前缀中的多个标记,从而将优化基模型实现的高倍速度提升到2-3倍。我们探讨了这些初步结果,并描述了进一步改进的步骤。
https://arxiv.org/abs/2404.19124
Graphics Processing Units (GPUs) have become the leading hardware accelerator for deep learning applications and are used widely in training and inference of transformers; transformers have achieved state-of-the-art performance in many areas of machine learning and are especially used in most modern Large Language Models (LLMs). However, GPUs require large amounts of energy, which poses environmental concerns, demands high operational costs, and causes GPUs to be unsuitable for edge computing. We develop an accelerator for transformers, namely, Llama 2, an open-source state-of-the-art LLM, using high level synthesis (HLS) on Field Programmable Gate Arrays (FPGAs). HLS allows us to rapidly prototype FPGA designs without writing code at the register-transfer level (RTL). We name our method HLSTransform, and the FPGA designs we synthesize with HLS achieve up to a 12.75x reduction and 8.25x reduction in energy used per token on the Xilinx Virtex UltraScale+ VU9P FPGA compared to an Intel Xeon Broadwell E5-2686 v4 CPU and NVIDIA RTX 3090 GPU respectively, while increasing inference speeds by up to 2.46x compared to CPU and maintaining 0.53x the speed of an RTX 3090 GPU despite the GPU's 4 times higher base clock rate. With the lack of existing open-source FPGA accelerators for transformers, we open-source our code and document our steps for synthesis. We hope this work will serve as a step in democratizing the use of FPGAs in transformer inference and inspire research into energy-efficient inference methods as a whole. The code can be found on this https URL.
图形处理器(GPUs)已成为深度学习应用的首选硬件加速器,并在训练和推理Transformer方面得到了广泛应用;Transformer在许多机器学习领域取得了最先进的性能,尤其是在现代大型语言模型(LLMs)中应用更加广泛。然而,GPU需要大量能源,这导致了对环境产生负面影响,增加了运营成本,使得GPU不适合用于边缘计算。我们开发了一个用于Transformer的加速器,即Llama 2,使用高级可编程门阵列(FPGAs)上的高级水平合成(HLS)。HLS使我们能够快速原型设计FPGA,而无需在寄存器传输级别(RTL)编写代码。我们将我们的方法命名为HLSTransform,我们用HLS合成的FPGA设计可以达到每令牌12.75倍能量消耗和每GPU 8.25倍能量消耗的降低,同时将推理速度提高至2.46倍,相较于CPU,保持0.53倍Transformer的性能,尽管GPU的基频钟率是其4倍高。由于缺乏现有的开源FPGA加速器,我们开源了我们的代码,并记录了合成过程的步骤。我们希望这项工作能为民主化Transformer推理中FPGA的使用迈出一步,并鼓励研究团队关注能量高效的推理方法。代码可以从此链接找到:https://www.acceler.io/。
https://arxiv.org/abs/2405.00738
Large Language Models (LLMs) have significantly advanced natural language processing (NLP) tasks but also pose ethical and societal risks due to their propensity to generate harmful content. To address this, various approaches have been developed to safeguard LLMs from producing unsafe content. However, existing methods have limitations, including the need for training specific control models and proactive intervention during text generation, that lead to quality degradation and increased computational overhead. To mitigate those limitations, we propose LLMSafeGuard, a lightweight framework to safeguard LLM text generation in real-time. LLMSafeGuard integrates an external validator into the beam search algorithm during decoding, rejecting candidates that violate safety constraints while allowing valid ones to proceed. We introduce a similarity based validation approach, simplifying constraint introduction and eliminating the need for control model training. Additionally, LLMSafeGuard employs a context-wise timing selection strategy, intervening LLMs only when necessary. We evaluate LLMSafe-Guard on two tasks, detoxification and copyright safeguarding, and demonstrate its superior performance over SOTA baselines. For instance, LLMSafeGuard reduces the average toxic score of. LLM output by 29.7% compared to the best baseline meanwhile preserving similar linguistic quality as natural output in detoxification task. Similarly, in the copyright task, LLMSafeGuard decreases the Longest Common Subsequence (LCS) by 56.2% compared to baselines. Moreover, our context-wise timing selection strategy reduces inference time by at least 24% meanwhile maintaining comparable effectiveness as validating each time step. LLMSafeGuard also offers tunable parameters to balance its effectiveness and efficiency.
大规模语言模型(LLMs)在自然语言处理(NLP)任务方面已经显著取得了进步,但同时也因为其倾向于生成有害内容而带来道德和社会风险。为了解决这个问题,已经开发了许多方法来保护LLMs免受产生不安全内容的侵害。然而,现有的方法存在局限性,包括需要训练特定的控制模型并在文本生成过程中进行主动干预,导致质量和计算开销的增加。为了减轻这些局限,我们提出了LLMSafeGuard,一个轻量级的框架,用于在实时保护LLM文本生成。LLMSafeGuard在解码过程中将外部验证器集成到beam搜索算法中,拒绝违反安全约束的候选者,同时允许合法的候选继续前进。我们引入了一种基于相似性的验证方法,简化了约束的引入,并消除了控制模型训练的需求。此外,LLMSafeGuard采用了一种基于语境的时序选择策略,只在需要时干预LLMs。我们在两个任务上评估LLMSafe-Guard,分别是脱毒和版权保护,并证明了其相对于现有基线的优越性能。例如,LLMSafeGuard在脱毒任务中将LLM输出平均毒性分数降低了29.7%,同时保留了与自然输出相似的语言质量。同样,在版权任务中,LLMSafeGuard将最长共同子序列(LCS)降低了56.2%。此外,我们的语境时序选择策略通过至少减少24%的推理时间,同时保持与验证每个时步的有效性相当,减少了LLMSafeGuard的复杂性。LLMSafeGuard还提供了可调节的参数,以平衡其有效性和效率。
https://arxiv.org/abs/2404.19048
Speculative decoding has demonstrated its effectiveness in accelerating the inference of large language models while maintaining a consistent sampling distribution. However, the conventional approach of training a separate draft model to achieve a satisfactory token acceptance rate can be costly. Drawing inspiration from early exiting, we propose a novel self-speculative decoding framework \emph{Kangaroo}, which uses a fixed shallow sub-network as a self-draft model, with the remaining layers serving as the larger target model. We train a lightweight and efficient adapter module on top of the sub-network to bridge the gap between the sub-network and the full model's representation ability. It is noteworthy that the inference latency of the self-draft model may no longer be negligible compared to the large model, necessitating strategies to increase the token acceptance rate while minimizing the drafting steps of the small model. To address this challenge, we introduce an additional early exiting mechanism for generating draft tokens. Specifically, we halt the small model's subsequent prediction during the drafting phase once the confidence level for the current token falls below a certain threshold. Extensive experiments on the Spec-Bench demonstrate the effectiveness of Kangaroo. Under single-sequence verification, Kangaroo achieves speedups up to $1.68\times$ on Spec-Bench, outperforming Medusa-1 with 88.7\% fewer additional parameters (67M compared to 591M). The code for Kangaroo is available at this https URL.
推测解码已经在加速大型语言模型的推理同时保持一致的抽样分布方面证明了其有效性。然而,传统的训练一个单独的草案模型以实现满意的标记接受率的方法可能代价高昂。从早期的退出方法中获得了灵感的我们,提出了一个新的自speculative解码框架Kangaroo,它使用一个固定的浅层子网络作为自speculative模型,其余层作为更大的目标模型。我们在子网络之上训练了一个轻量级且高效的适配器模块,以弥合子网络和完整模型表示能力之间的差距。值得注意的是,自speculative模型的推理延迟可能不再可以忽略不计,因此需要采取策略来增加标记接受率同时最小化小模型的抽写步骤。为了应对这个挑战,我们在Spec-Bench中引入了额外的早期退出机制来生成草案令牌。具体来说,一旦当前标记的置信度低于某个阈值,我们就停止小模型的后续预测。对Spec-Bench的实验证明Kangaroo非常有效。在单序列验证中,Kangaroo在Spec-Bench上的速度提升至1.68倍,优于Medusa-1,具有88.7%的参数减少(67M compared to 591M)。Kangaroo的代码可以从该https URL获取。
https://arxiv.org/abs/2404.18911
Recent work has demonstrated that the latent spaces of large language models (LLMs) contain directions predictive of the truth of sentences. Multiple methods recover such directions and build probes that are described as getting at a model's "knowledge" or "beliefs". We investigate this phenomenon, looking closely at the impact of context on the probes. Our experiments establish where in the LLM the probe's predictions can be described as being conditional on the preceding (related) sentences. Specifically, we quantify the responsiveness of the probes to the presence of (negated) supporting and contradicting sentences, and score the probes on their consistency. We also perform a causal intervention experiment, investigating whether moving the representation of a premise along these belief directions influences the position of the hypothesis along that same direction. We find that the probes we test are generally context sensitive, but that contexts which should not affect the truth often still impact the probe outputs. Our experiments show that the type of errors depend on the layer, the (type of) model, and the kind of data. Finally, our results suggest that belief directions are (one of the) causal mediators in the inference process that incorporates in-context information.
近年来,研究表明大型语言模型(LLMs)的潜在空间包含了句子中事实的预测方向。多种方法恢复这些方向,并构建了描述为获取模型“知识”或“信念”的 probe。我们研究了这种现象,仔细观察了上下文对probe的影响。我们的实验证实了上下文对probe预测的影响。具体来说,我们量化了probe对存在(否定)支持句和反驳句的响应程度,并对其一致性进行评分。我们还进行了一项因果干预实验,研究了将前提沿这些信念方向移动是否会影响同方向上假设的位置。我们发现,我们测试的probe通常对上下文敏感,但应该不影响真理的上下文仍然会影响probe输出。我们的实验表明,probe的类型取决于层、模型和数据类型。最后,我们的结果表明,信念方向是推理过程中包含上下文信息的(一种)因果中介。
https://arxiv.org/abs/2404.18865
In this paper, we present a different way to use two modalities, in which either one modality or the other is seen by a single model. This can be useful when adapting an unimodal model to leverage more information while respecting a limited computational budget. This would mean having a single model that is able to deal with any modalities. To describe this, we coined the term anymodal learning. An example of this, is a use case where, surveillance in a room when the lights are off would be much more valuable using an infrared modality while a visible one would provide more discriminative information when lights are on. This work investigates how to efficiently leverage visible and infrared/thermal modalities for transformer-based object detection backbone to create an anymodal architecture. Our work does not create any inference overhead during the testing while exploring an effective way to exploit the two modalities during the training. To accomplish such a task, we introduce the novel anymodal training technique: Mixed Patches (MiPa), in conjunction with a patch-wise domain agnostic module, which is responsible of learning the best way to find a common representation of both modalities. This approach proves to be able to balance modalities by reaching competitive results on individual modality benchmarks with the alternative of using an unimodal architecture on three different visible-infrared object detection datasets. Finally, our proposed method, when used as a regularization for the strongest modality, can beat the performance of multimodal fusion methods while only requiring a single modality during inference. Notably, MiPa became the state-of-the-art on the LLVIP visible/infrared benchmark. Code: this https URL
在本文中,我们提出了另一种使用两种方式的方法,其中一种方式是让一个模型看到一种模式,而另一种方式是让一个模型看到另一种模式。当适应一个单一模态的模型以利用更多的信息,同时遵守有限计算预算时,这种方法可以很有用。这意味着要有一个能够处理任何模态的模型。为了描述这一点,我们定义了一个术语:多模态学习。一个这种多模态学习的例子是在房间里有灯光熄灭时进行监控,使用红外模式会比使用可见模式提供更有价值的监控信息,而灯光打开时,可见模式会提供更有区分性的信息。本文研究了如何有效地将可见和红外/热模态用于基于Transformer的对象检测骨干网络以创建多模态架构。我们的工作在测试过程中没有产生任何推理开销,同时探索了在训练过程中有效利用两种模态的最佳方法。为了实现这一目标,我们引入了新的多模态训练技术:Mixed Patches(MiPa),并与一个补丁域无关的模块相结合,该模块负责学习找到两种模态之间共同表示的最佳方式。这种方法通过在单个模态基准上实现竞争性的结果,证明了能够平衡模态,同时使用三个不同的可见-红外物体检测数据集上的单一模态架构。最后,当我们将该方法用作最强的模态的正则化时,可以在仅需要一个模态的情况下击败多模态融合方法的性能。值得注意的是,MiPa在LLVIP可见/红外基准上达到了最先进的水平。代码:https:// this URL
https://arxiv.org/abs/2404.18849
We introduce Harmonic Robustness, a powerful and intuitive method to test the robustness of any machine-learning model either during training or in black-box real-time inference monitoring without ground-truth labels. It is based on functional deviation from the harmonic mean value property, indicating instability and lack of explainability. We show implementation examples in low-dimensional trees and feedforward NNs, where the method reliably identifies overfitting, as well as in more complex high-dimensional models such as ResNet-50 and Vision Transformer where it efficiently measures adversarial vulnerability across image classes.
我们提出了Harmonic Robustness,一种强大且直观的方法,用于测试任何机器学习模型的鲁棒性,无论是训练期间还是黑盒实时推理监控,而无需真实标签。它基于函数偏离均值值属性的特征,表明不稳定和缺乏可解释性。我们展示了在低维树和前馈神经网络中的实现例子,其中该方法可以可靠地检测过拟合,以及像ResNet-50和Vision Transformer这样在图像类别的对抗性攻击中高效测量攻击性的高维模型。
https://arxiv.org/abs/2404.18825
Explanatory inference is the creation and evaluation of hypotheses that provide explanations, and is sometimes known as abduction or abductive inference. Generative AI is a new set of artificial intelligence models based on novel algorithms for generating text, images, and sounds. This paper proposes a set of benchmarks for assessing the ability of AI programs to perform explanatory inference, and uses them to determine the extent to which ChatGPT, a leading generative AI model, is capable of making explanatory inferences. Tests on the benchmarks reveal that ChatGPT performs creative and evaluative inferences in many domains, although it is limited to verbal and visual modalities. Claims that ChatGPT and similar models are incapable of explanation, understanding, causal reasoning, meaning, and creativity are rebutted.
推理解释是一种创建和评估提供解释的假设的方法,有时也称为类比或演绎推理。生成型AI是一种基于新算法生成文本、图像和声音的新的人工智能模型。本文提出了一组用于评估AI程序进行推理解释的基准,并使用这些基准来确定ChatGPT等领先生成型AI模型在进行推理解释方面的能力。基准测试的结果表明,ChatGPT在许多领域都表现出创造性和评价性推理,尽管它限于口头和视觉模态。关于ChatGPT和类似模型无法进行解释、理解、因果推理、意义和创造力的观点被反驳了。
https://arxiv.org/abs/2404.18982
This paper introduces YOLOv8-TO, a novel approach for reverse engineering of topology-optimized structures into interpretable geometric parameters using the YOLOv8 instance segmentation model. Density-based topology optimization methods require post-processing to convert the optimal density distribution into a parametric representation for design exploration and integration with CAD tools. Traditional methods such as skeletonization struggle with complex geometries and require manual intervention. YOLOv8-TO addresses these challenges by training a custom YOLOv8 model to automatically detect and reconstruct structural components from binary density distributions. The model is trained on a diverse dataset of both optimized and random structures generated using the Moving Morphable Components method. A custom reconstruction loss function based on the dice coefficient of the predicted geometry is used to train the new regression head of the model via self-supervised learning. The method is evaluated on test sets generated from different topology optimization methods, including out-of-distribution samples, and compared against a skeletonization approach. Results show that YOLOv8-TO significantly outperforms skeletonization in reconstructing visually and structurally similar designs. The method showcases an average improvement of 13.84% in the Dice coefficient, with peak enhancements reaching 20.78%. The method demonstrates good generalization to complex geometries and fast inference times, making it suitable for integration into design workflows using regular workstations. Limitations include the sensitivity to non-max suppression thresholds. YOLOv8-TO represents a significant advancement in topology optimization post-processing, enabling efficient and accurate reverse engineering of optimized structures for design exploration and manufacturing.
本文介绍了一种名为YOLOv8-TO的新方法,用于使用YOLOv8实例分割模型将拓扑优化结构反向工程为可解释的几何参数。密度基于拓扑优化方法需要后处理将最优密度分布转换为设计探索和CAD工具集成所需的参数表示。传统方法如骨架化在复杂几何图形上挣扎,并需要手动干预。YOLOv8-TO通过训练自适应检测和重构结构的YOLOv8模型来解决这些挑战。模型在通过自监督学习训练的新回归头的基础上进行训练,同时使用基于 dice 系数的自适应重构损失函数进行训练。该方法在从不同拓扑优化方法产生的测试集中进行评估,包括离散样本,并将其与骨架化方法进行比较。结果表明,YOLOv8-TO在重构视觉和结构相似的设计方面显著优于骨架化方法。该方法在 Dice 系数上展示了13.84%的改进,峰值增强达到20.78%。该方法具有良好的对复杂几何的泛化能力,并且具有快速的推理时间,使其适用于使用常规工作台进行设计工作流程的集成。局限性包括对非最大抑制阈值的敏感性。YOLOv8-TO在拓扑优化后处理方面取得了显著的进展,实现了对优化结构的高效且准确的逆向工程,以进行设计探索和制造。
https://arxiv.org/abs/2404.18763
Video Anomaly Detection (VAD) identifies unusual activities in video streams, a key technology with broad applications ranging from surveillance to healthcare. Tackling VAD in real-life settings poses significant challenges due to the dynamic nature of human actions, environmental variations, and domain shifts. Many research initiatives neglect these complexities, often concentrating on traditional testing methods that fail to account for performance on unseen datasets, creating a gap between theoretical models and their real-world utility. Online learning is a potential strategy to mitigate this issue by allowing models to adapt to new information continuously. This paper assesses how well current VAD algorithms can adjust to real-life conditions through an online learning framework, particularly those based on pose analysis, for their efficiency and privacy advantages. Our proposed framework enables continuous model updates with streaming data from novel environments, thus mirroring actual world challenges and evaluating the models' ability to adapt in real-time while maintaining accuracy. We investigate three state-of-the-art models in this setting, focusing on their adaptability across different domains. Our findings indicate that, even under the most challenging conditions, our online learning approach allows a model to preserve 89.39% of its original effectiveness compared to its offline-trained counterpart in a specific target domain.
视频异常检测(VAD)是通过识别视频流中的异常活动来确定异常行为的视频技术,这种技术在从监视到医疗保健等广泛的应用领域具有重要作用。在现实环境中解决VAD问题面临着显著的挑战,因为人类行为的动态性、环境变化和领域转移的复杂性。许多研究倡议忽略了这些复杂性,通常集中于传统的测试方法,这些方法无法考虑未见数据上的性能,从而在理论和实际应用之间存在差距。在线学习是一种可能的策略,通过允许模型持续适应新信息,来缓解这个问题,特别是那些基于姿态分析的模型,以提高其效率和隐私优势。 本文通过一个在线学习框架评估了现有VAD算法在现实环境中的适应能力,特别是那些基于姿态分析的模型,以评估它们在保持准确性的同时,对新型环境进行持续更新的能力。我们研究了这种设置中三种最先进的模型,重点关注它们在不同领域中的适应性。我们的研究结果表明,即使在最具有挑战性的条件下,我们的在线学习方法也能使模型保留其原始有效性的89.39%,与它在离线训练时的对应模型相比。
https://arxiv.org/abs/2404.18747
The "meaning" of an iconic gesture is conditioned on its informational evaluation. Only informational evaluation lifts a gesture to a quasi-linguistic level that can interact with verbal content. Interaction is either vacuous or regimented by usual lexicon-driven inferences. Informational evaluation is spelled out as extended exemplification (extemplification) in terms of perceptual classification of a gesture's visual iconic model. The iconic model is derived from Frege/Montague-like truth-functional evaluation of a gesture's form within spatially extended domains. We further argue that the perceptual classification of instances of visual communication requires a notion of meaning different from Frege/Montague frameworks. Therefore, a heuristic for gesture interpretation is provided that can guide the working semanticist. In sum, an iconic gesture semantics is introduced which covers the full range from kinematic gesture representations over model-theoretic evaluation to inferential interpretation in dynamic semantic frameworks.
标志性手势的意义取决于其信息评估。只有信息评估将手势提升到可以与言语内容互动的准语言级别,这种互动可以是空白的,也可以是由通常的词汇驱动的推理推断。信息评估被用感知分类来描述手势的视觉原型模型的表现形式。原型模型来源于在手势的扩展领域内对手势形式的Frege/Montague类似真理函数评估。我们进一步认为,视觉通信实例的感知分类需要一个与Frege/Montague框架不同的意义概念。因此,提供了一个手势解释的启发式,可以引导工作语义学家。总之,引入了标志性手势语义,该语义覆盖了从运动符号表示到模型理论评估的完整范围,以及动态语义框架中的推理解释。
https://arxiv.org/abs/2404.18708
Collective Perception has attracted significant attention in recent years due to its advantage for mitigating occlusion and expanding the field-of-view, thereby enhancing reliability, efficiency, and, most crucially, decision-making safety. However, developing collective perception models is highly resource demanding due to extensive requirements of processing input data for many agents, usually dozens of images and point clouds for a single frame. This not only slows down the model development process for collective perception but also impedes the utilization of larger models. In this paper, we propose an agent-based training framework that handles the deep learning modules and agent data separately to have a cleaner data flow structure. This framework not only provides an API for flexibly prototyping the data processing pipeline and defining the gradient calculation for each agent, but also provides the user interface for interactive training, testing and data visualization. Training experiment results of four collective object detection models on the prominent collective perception benchmark OPV2V show that the agent-based training can significantly reduce the GPU memory consumption and training time while retaining inference performance. The framework and model implementations are available at \url{this https URL}
近年来,由于其减轻遮挡并扩大视野的优势,集合感知在 collective perception 中引起了广泛关注。这使得集合感知在提高可靠性、效率和最关键的是决策安全性方面具有优势。然而,开发集合感知模型需要大量的资源,因为对于每个帧,输入数据的处理需求很高,通常有数十个图像和点云。这不仅会减缓集合感知的模型开发过程,而且还会阻碍更大模型的使用。在本文中,我们提出了一个基于代理的训练框架,该框架分别处理深度学习模块和代理数据,以实现更干净的数据流结构。这个框架不仅提供了灵活地原型化数据处理管道和定义每个代理的梯度计算的 API,还提供了用户界面进行交互式训练、测试和数据可视化。在著名的集合感知基准 OPV2V 上训练四个集合物体检测模型的实验结果表明,基于代理的训练可以显著减少 GPU 内存消耗和训练时间,同时保留推理性能。框架和模型实现可在 \url{this <https://this <https://this URL>}
https://arxiv.org/abs/2404.18617
Event-based sensors are well suited for real-time processing due to their fast response times and encoding of the sensory data as successive temporal differences. These and other valuable properties, such as a high dynamic range, are suppressed when the data is converted to a frame-based format. However, most current methods either collapse events into frames or cannot scale up when processing the event data directly event-by-event. In this work, we address the key challenges of scaling up event-by-event modeling of the long event streams emitted by such sensors, which is a particularly relevant problem for neuromorphic computing. While prior methods can process up to a few thousand time steps, our model, based on modern recurrent deep state-space models, scales to event streams of millions of events for both training and inference.We leverage their stable parameterization for learning long-range dependencies, parallelizability along the sequence dimension, and their ability to integrate asynchronous events effectively to scale them up to long event streams.We further augment these with novel event-centric techniques enabling our model to match or beat the state-of-the-art performance on several event stream benchmarks. In the Spiking Speech Commands task, we improve state-of-the-art by a large margin of 6.6% to 87.1%. On the DVS128-Gestures dataset, we achieve competitive results without using frames or convolutional neural networks. Our work demonstrates, for the first time, that it is possible to use fully event-based processing with purely recurrent networks to achieve state-of-the-art task performance in several event-based benchmarks.
基于事件的传感器非常适合实时处理,因为它们具有快速的响应时间和将感知数据编码为连续的时间差异。当数据以帧格式转换时,这些和其他有价值的特点(高动态范围)会被抑制。然而,大多数现有方法要么将事件压缩成帧,要么在处理事件数据时无法按事件逐个处理。在这项工作中,我们解决了在事件逐个处理方面扩展长事件建模的关键挑战,这是神经形态计算的一个相关问题。虽然先前的方法可以处理多达几千个时间步,但我们的模型,基于现代递归深度状态空间模型,可以扩展到处理数百万个事件的event stream,无论是训练还是推理。我们利用它们的稳健参数化来学习长距离依赖关系,沿着序列维度的并行性以及它们将异步事件有效地整合到long event streams中的能力,将其扩展到long event streams。我们进一步通过新颖的事件中心技术增加这些模型的性能,使其在多个event stream基准测试中与最先进的水平相当或者超过。在Spiking Speech Commands任务中,我们在很大程度上提高了最先进的水平,提高了6.6%到87.1%。在DVS128-Gestures数据集上,我们实现了与使用帧或卷积神经网络的竞争结果。我们的工作证明了,对于使用纯递归网络的完全事件基于处理,可以在多个事件基于基准测试中实现最先进任务的性能。
https://arxiv.org/abs/2404.18508
MTL is a learning paradigm that effectively leverages both task-specific and shared information to address multiple related tasks simultaneously. In contrast to STL, MTL offers a suite of benefits that enhance both the training process and the inference efficiency. MTL's key advantages encompass streamlined model architecture, performance enhancement, and cross-domain generalizability. Over the past twenty years, MTL has become widely recognized as a flexible and effective approach in various fields, including CV, NLP, recommendation systems, disease prognosis and diagnosis, and robotics. This survey provides a comprehensive overview of the evolution of MTL, encompassing the technical aspects of cutting-edge methods from traditional approaches to deep learning and the latest trend of pretrained foundation models. Our survey methodically categorizes MTL techniques into five key areas: regularization, relationship learning, feature propagation, optimization, and pre-training. This categorization not only chronologically outlines the development of MTL but also dives into various specialized strategies within each category. Furthermore, the survey reveals how the MTL evolves from handling a fixed set of tasks to embracing a more flexible approach free from task or modality constraints. It explores the concepts of task-promptable and -agnostic training, along with the capacity for ZSL, which unleashes the untapped potential of this historically coveted learning paradigm. Overall, we hope this survey provides the research community with a comprehensive overview of the advancements in MTL from its inception in 1997 to the present in 2023. We address present challenges and look ahead to future possibilities, shedding light on the opportunities and potential avenues for MTL research in a broad manner. This project is publicly available at this https URL.
MTL是一种学习范式,有效利用任务特定和共享信息来同时解决多个相关任务。与STL相比,MTL提供了一系列增强训练过程和推理效率的优势。MTL的关键优势包括简化模型架构、性能提升和跨领域泛化。在过去的二十年中,MTL已经成为许多领域广泛认可的灵活有效的解决方案,包括CV、自然语言处理、推荐系统、疾病预后和诊断、以及机器人领域。本次调查全面回顾了MTL的发展历程,从传统方法的尖端技术到深度学习的最新趋势,以及预训练基础模型的最新趋势。我们的调查系统地分类MTL技术为五个关键领域:正则化、关系学习、特征传播、优化和预训练。这种分类不仅按时间顺序描述了MTL的发展,还深入研究了每个领域的各种专业策略。此外,调查揭示了MTL如何从处理固定任务转变为更加灵活的方法,摆脱了任务或模型约束。它探讨了任务提示的和无条件的训练概念,以及ZSL(零样本学习)的能力,揭示了这一历史悠久的值得称赞的学习范式所蕴含的潜力。总的来说,我们希望通过这次调查为研究社区提供MTL从1997年创立到2023年的全面概述。我们关注当前的挑战,展望未来的机遇,以一种全面的方式揭示MTL研究在各个领域的机会和潜在途径。这个项目在https://这个URL上公开可用。
https://arxiv.org/abs/2404.18961
Low Rank Adaptation (LoRA) has emerged as one of the most widely adopted methods for Parameter Efficient Fine-Tuning (PEFT) of Large Language Models (LLMs). LoRA reduces the number of trainable parameters and memory usage while achieving comparable performance to full fine-tuning. We aim to assess the viability of training and serving LLMs fine-tuned with LoRA in real-world applications. First, we measure the quality of LLMs fine-tuned with quantized low rank adapters across 10 base models and 31 tasks for a total of 310 models. We find that 4-bit LoRA fine-tuned models outperform base models by 34 points and GPT-4 by 10 points on average. Second, we investigate the most effective base models for fine-tuning and assess the correlative and predictive capacities of task complexity heuristics in forecasting the outcomes of fine-tuning. Finally, we evaluate the latency and concurrency capabilities of LoRAX, an open-source Multi-LoRA inference server that facilitates the deployment of multiple LoRA fine-tuned models on a single GPU using shared base model weights and dynamic adapter loading. LoRAX powers LoRA Land, a web application that hosts 25 LoRA fine-tuned Mistral-7B LLMs on a single NVIDIA A100 GPU with 80GB memory. LoRA Land highlights the quality and cost-effectiveness of employing multiple specialized LLMs over a single, general-purpose LLM.
Low Rank Adaptation(LoRA)已成为大型语言模型(LLMs)参数高效微调(PEFT)中最广泛采用的方法之一。通过减少训练参数和内存使用,LoRA 实现了与完全微调相当的性能。我们旨在评估使用 LoRA 对 LLMs 进行微调在现实世界应用中的可行性。 首先,我们测量了使用量化低秩适配器对 10 个基模型和 31 个任务进行微调的 LLMs 的质量。我们发现,4 位 LoRA 微调的模型平均比基模型高 34 点,比 GPT-4 模型高 10 点。 其次,我们研究了用于微调的最有效的基模型,并评估了预测复杂度提示在预测微调结果方面的相关性。 最后,我们评估了 LoRAX,一个开源的多 LoRA 推理服务器,使用共享基模型权重和动态适配器加载来部署多个微调模型到单个 GPU。LoRAX 通过在单个 NVIDIA A100 GPU 上托管 25 个微调 Mistral-7B LLMs 的 Web 应用程序 LoRA Land,展示了使用多个专业 LLM 的成本效益和质量。 LoRA Land 突出了在单个通用 LLM 上使用多个专业 LLM 的优势和成本效益。
https://arxiv.org/abs/2405.00732
Language models can hallucinate when performing complex and detailed mathematical reasoning. Physics provides a rich domain for assessing mathematical reasoning capabilities where physical context imbues the use of symbols which needs to satisfy complex semantics (\textit{e.g.,} units, tensorial order), leading to instances where inference may be algebraically coherent, yet unphysical. In this work, we assess the ability of Language Models (LMs) to perform fine-grained mathematical and physical reasoning using a curated dataset encompassing multiple notations and Physics subdomains. We improve zero-shot scores using synthetic in-context examples, and demonstrate non-linear degradation of derivation quality with perturbation strength via the progressive omission of supporting premises. We find that the models' mathematical reasoning is not physics-informed in this setting, where physical context is predominantly ignored in favour of reverse-engineering solutions.
语言模型在进行复杂和精细的数学推理时可能会出现幻觉。物理学为评估数学推理能力提供了一个丰富的领域,其中物理上下文赋予了使用符号的需要,需要满足复杂的语义(例如单位、张量顺序),导致推断可能具有代数一致性,但不符合物理。在这项工作中,我们评估了语言模型(LMs)使用精心挑选的数据集进行细粒度数学和物理推理的能力。我们通过合成上下文实例来提高零散得分,并通过渐进地省略支持性前提来证明推导质量的非线性降解。我们发现,在这个场景中,模型的数学推理并不是基于物理的,物理上下文被忽略,而是倾向于通过反向工程解决方案来解决。
https://arxiv.org/abs/2404.18384