Numerous sign language datasets exist, yet they typically cover only a limited selection of the thousands of signs used globally. Moreover, creating diverse sign language datasets is an expensive and challenging task due to the costs associated with gathering a varied group of signers. Motivated by these challenges, we aimed to develop a solution that addresses these limitations. In this context, we focused on textually describing body movements from skeleton keypoint sequences, leading to the creation of a new dataset. We structured this dataset around AUTSL, a comprehensive isolated Turkish sign language dataset. We also developed a baseline model, SkelCap, which can generate textual descriptions of body movements. This model processes the skeleton keypoints data as a vector, applies a fully connected layer for embedding, and utilizes a transformer neural network for sequence-to-sequence modeling. We conducted extensive evaluations of our model, including signer-agnostic and sign-agnostic assessments. The model achieved promising results, with a ROUGE-L score of 0.98 and a BLEU-4 score of 0.94 in the signer-agnostic evaluation. The dataset we have prepared, namely the AUTSL-SkelCap, will be made publicly available soon.
许多手语数据集存在,但它们通常只覆盖了全球数百万种手语中的一小部分。此外,创建多样手语数据集是一个昂贵且具有挑战性的任务,因为收集到一个多样群体手语者相关的费用。为了克服这些挑战,我们旨在开发一种解决方案,以解决这些限制。在这个背景下,我们关注从骨骼关键点序列中描述身体运动,从而创建了一个新的数据集。我们围绕AUTSL,一个全面的隔离土耳其手语数据集,构建了 this dataset。我们还开发了一个 baseline 模型,SkeletonCap,可以生成身体运动的文本描述。这个模型将骨骼关键点数据处理为向量,应用了全连接层进行嵌入,并利用了Transformer神经网络进行序列到序列建模。我们对我们的模型进行了广泛的评估,包括 signer-agnostic 和 sign-agnostic 评估。我们的模型取得了很好的效果,在 signer-agnostic 评估中的 ROUGE-L 分数为 0.98,BLEU-4 分数为 0.94。我们准备的数据集,即 AUTSL-SkeletonCap,很快将公开发布。
https://arxiv.org/abs/2405.02977
Due to the ever-increasing availability of video surveillance cameras and the growing need for crime prevention, the violence detection task is attracting greater attention from the research community. With respect to other action recognition tasks, violence detection in surveillance videos shows additional issues, such as the presence of a significant variety of real fight scenes. Unfortunately, available datasets seem to be very small compared with other action recognition datasets. Moreover, in surveillance applications, people in the scenes always differ for each video and the background of the footage differs for each camera. Also, violent actions in real-life surveillance videos must be detected quickly to prevent unwanted consequences, thus models would definitely benefit from a reduction in memory usage and computational costs. Such problems make classical action recognition methods difficult to be adopted. To tackle all these issues, we introduce JOSENet, a novel self-supervised framework that provides outstanding performance for violence detection in surveillance videos. The proposed model receives two spatiotemporal video streams, i.e., RGB frames and optical flows, and involves a new regularized self-supervised learning approach for videos. JOSENet provides improved performance compared to self-supervised state-of-the-art methods, while requiring one-fourth of the number of frames per video segment and a reduced frame rate. The source code and the instructions to reproduce our experiments are available at this https URL.
由于视频监控摄像头的不断增加和犯罪预防的需求不断增加,暴力检测任务正在从研究社区获得越来越多的关注。与其他动作识别任务相比,监控视频中的暴力检测任务还表现为其他问题,例如存在显著的实战场景。然而,现有的数据集似乎与其他动作识别数据集相比非常小。此外,在视频应用中,每个视频场景的人都有所不同,视频拍摄的角度也有所不同。为了快速检测现实生活中的暴力行为,防止不良后果,因此模型可以从内存使用和计算成本的降低中获得好处。这些问题使得经典动作识别方法难以采用。为解决这些问题,我们引入了JOSENet,一种新颖的自监督框架,可以在监控视频中的暴力检测方面提供卓越的表现。与自监督先进的视频分类方法相比,所提出的模型具有更好的性能,同时每个视频片段需要四分之一帧数和降低帧率。JOSENet的源代码和重现实验的说明可以在该链接处找到。
https://arxiv.org/abs/2405.02961
Given a query consisting of a reference image and a relative caption, Composed Image Retrieval (CIR) aims to retrieve target images visually similar to the reference one while incorporating the changes specified in the relative caption. The reliance of supervised methods on labor-intensive manually labeled datasets hinders their broad applicability. In this work, we introduce a new task, Zero-Shot CIR (ZS-CIR), that addresses CIR without the need for a labeled training dataset. We propose an approach named iSEARLE (improved zero-Shot composEd imAge Retrieval with textuaL invErsion) that involves mapping the visual information of the reference image into a pseudo-word token in CLIP token embedding space and combining it with the relative caption. To foster research on ZS-CIR, we present an open-domain benchmarking dataset named CIRCO (Composed Image Retrieval on Common Objects in context), the first CIR dataset where each query is labeled with multiple ground truths and a semantic categorization. The experimental results illustrate that iSEARLE obtains state-of-the-art performance on three different CIR datasets -- FashionIQ, CIRR, and the proposed CIRCO -- and two additional evaluation settings, namely domain conversion and object composition. The dataset, the code, and the model are publicly available at this https URL.
给定一个由参考图像和相对描述组成的查询,组合图像检索(CIR)旨在通过包含在相对描述中指定的变化来检索与参考图像视觉上相似的目标图像,同时实现这一点。依赖于有监督方法对劳动密集型手动标注数据集的依赖会限制其广泛的适用性。在这项工作中,我们引入了一个新的任务,名为零 shot组合图像检索(ZS-CIR),它不需要有标签的训练数据集来解决组合图像检索(CIR)。我们提出了一个名为iSEARLE(改进零 shot组合图像age检索与文本ual invErsion)的方法,该方法涉及将参考图像的视觉信息映射到CLIP标记嵌入空间中的伪词标记,并将其与相对描述相结合。为了促进对零 shot组合图像检索的研究,我们提出了一个名为CIRCO(在上下文中共同对象图像检索)的开源领域基准数据集,它是第一个每个查询都带有多个地面真实值和语义分类的CIR数据集。实验结果表明,iSEARLE在三个不同的CIR数据集--FashionIQ,CIRR和所提出的CIRCO--上都取得了最先进的性能,同时还取得了另外两个评估设置,即领域转换和对象组合。数据集、代码和模型都可以在https://这个链接上获得。
https://arxiv.org/abs/2405.02951
Multi-intent natural language understanding (NLU) presents a formidable challenge due to the model confusion arising from multiple intents within a single utterance. While previous works train the model contrastively to increase the margin between different multi-intent labels, they are less suited to the nuances of multi-intent NLU. They ignore the rich information between the shared intents, which is beneficial to constructing a better embedding space, especially in low-data scenarios. We introduce a two-stage Prediction-Aware Contrastive Learning (PACL) framework for multi-intent NLU to harness this valuable knowledge. Our approach capitalizes on shared intent information by integrating word-level pre-training and prediction-aware contrastive fine-tuning. We construct a pre-training dataset using a word-level data augmentation strategy. Subsequently, our framework dynamically assigns roles to instances during contrastive fine-tuning while introducing a prediction-aware contrastive loss to maximize the impact of contrastive learning. We present experimental results and empirical analysis conducted on three widely used datasets, demonstrating that our method surpasses the performance of three prominent baselines on both low-data and full-data scenarios.
由于在一个句子中存在多个意图,多意图自然语言理解(NLU)面临着巨大的挑战。虽然以前的工作通过对比训练来增加不同多意图标签之间的间隔,但他们并不适合多意图NLU的细微差别。他们忽略了共享意图之间的丰富信息,这对于构建更好的嵌入空间尤其在低数据场景中是很有益的。我们提出了一个两阶段预测意识对比学习(PACL)框架,用于多意图NLU,以充分利用这种有价值的信息。我们的方法通过将词级预训练和预测意识对比微调相结合,利用共享意图信息,构建了一个预训练数据集。在对比微调期间,我们的框架动态地为实例分配角色,并引入预测意识对比损失以最大化对比学习的影响。我们在三个广泛使用数据集上进行了实验和实证分析,结果表明,我们的方法在低数据和全数据场景上都超越了三个显著的基本方法。
https://arxiv.org/abs/2405.02925
Deep learning-based image restoration methods have achieved promising performance. However, how to faithfully preserve the structure of the original image remains challenging. To address this challenge, we propose a novel Residual-Conditioned Optimal Transport (RCOT) approach, which models the image restoration as an optimal transport (OT) problem for both unpaired and paired settings, integrating the transport residual as a unique degradation-specific cue for both the transport cost and the transport map. Specifically, we first formalize a Fourier residual-guided OT objective by incorporating the degradation-specific information of the residual into the transport cost. Based on the dual form of the OT formulation, we design the transport map as a two-pass RCOT map that comprises a base model and a refinement process, in which the transport residual is computed by the base model in the first pass and then encoded as a degradation-specific embedding to condition the second-pass restoration. By duality, the RCOT problem is transformed into a minimax optimization problem, which can be solved by adversarially training neural networks. Extensive experiments on multiple restoration tasks show the effectiveness of our approach in terms of both distortion measures and perceptual quality. Particularly, RCOT restores images with more faithful structural details compared to state-of-the-art methods.
基于深度学习的图像修复方法已经取得了很好的性能。然而,如何忠实保留原始图像的结构仍然具有挑战性。为解决这个问题,我们提出了一个新颖的残差约束优化传输(RCOT)方法,将图像修复建模为对于未配对和成对设置的优化传输(OT)问题,将传输残差作为传输成本和传输映射的唯一退化特定提示。具体来说,我们首先通过将残差的退化特定信息融入传输成本中,形式化了一个Fourier残差引导的OT目标。基于OT公式的双形式,我们设计了一个包含基模型和优化过程的两层RCOT映射,其中传输残差在第一层由基模型计算,然后用退化特定编码作为第二层修复的调节。通过极值,RCOT问题转化为一个最小最大优化问题,可以被对抗性训练的神经网络求解。在多个修复任务上进行的大量实验证明了我们方法在失真度和感知质量方面的有效性。特别是,RCOT修复的图像具有比现有方法更忠实于结构的细节。
https://arxiv.org/abs/2405.02843
Person search aims to localize specific a target person from a gallery set of images with various scenes. As the scene of moving pedestrian changes, the captured person image inevitably bring in lots of background noise and foreground noise on the person feature, which are completely unrelated to the person identity, leading to severe performance degeneration. To address this issue, we present a Scene-Adaptive Person Search (SEAS) model by introducing bilateral modulations to simultaneously eliminate scene noise and maintain a consistent person representation to adapt to various scenes. In SEAS, a Background Modulation Network (BMN) is designed to encode the feature extracted from the detected bounding box into a multi-granularity embedding, which reduces the input of background noise from multiple levels with norm-aware. Additionally, to mitigate the effect of foreground noise on the person feature, SEAS introduces a Foreground Modulation Network (FMN) to compute the clutter reduction offset for the person embedding based on the feature map of the scene image. By bilateral modulations on both background and foreground within an end-to-end manner, SEAS obtains consistent feature representations without scene noise. SEAS can achieve state-of-the-art (SOTA) performance on two benchmark datasets, CUHK-SYSU with 97.1\% mAP and PRW with 60.5\% mAP. The code is available at this https URL.
人员搜索旨在从具有各种场景的图库中定位特定目标人员。当动态行人经过时,捕获的个人信息图像会不可避免地引入大量背景噪音和前景噪音,而这些噪音与人员身份无关,导致性能下降。为解决这个问题,我们提出了一个场景适应人员搜索(SEAS)模型,通过双向调用来同时消除场景噪音并保持一致的人员表示,以适应各种场景。在SEAS中,我们设计了一个背景模块网络(BMN),将其捕获到的边界框特征编码为多粒度嵌入,从而减少了来自多个级别背景噪音的输入。此外,为了减轻前景噪音对人员特征的影响,SEAS引入了一个前景模块网络(FMN),根据场景图像的特征图计算降噪偏移量。通过在背景和前景之间以端到端的方式进行双向调节,SEAS获得了无场景噪音的稳定特征表示。SEAS在两个基准数据集上的性能达到了最先进水平(SOTA),CUHK-SYSU with 97.1\% mAP和PRW with 60.5\% mAP。代码可在此处访问:https://www.aclweb.org/anthology/J/SEAS2023/10265
https://arxiv.org/abs/2405.02834
Deep learning has had a significant impact on the identification and classification of mineral resources, especially playing a key role in efficiently and accurately identifying different minerals, which is important for improving the efficiency and accuracy of mining. However, traditional ore sorting meth- ods often suffer from inefficiency and lack of accuracy, especially in complex mineral environments. To address these challenges, this study proposes a method called OreYOLO, which incorporates an attentional mechanism and a multi-scale feature fusion strategy, based on ore data from gold and sul- fide ores. By introducing the progressive feature pyramid structure into YOLOv5 and embedding the attention mechanism in the feature extraction module, the detection performance and accuracy of the model are greatly improved. In order to adapt to the diverse ore sorting scenarios and the deployment requirements of edge devices, the network structure is designed to be lightweight, which achieves a low number of parameters (3.458M) and computational complexity (6.3GFLOPs) while maintaining high accuracy (99.3% and 99.2%, respectively). In the experimental part, a target detection dataset containing 6000 images of gold and sulfuric iron ore is constructed for gold and sulfuric iron ore classification training, and several sets of comparison experiments are set up, including the YOLO series, EfficientDet, Faster-RCNN, and CenterNet, etc., and the experiments prove that OreYOLO outperforms the commonly used high-performance object detection of these architectures
Deep学习在矿物识别和分类方面取得了显著影响,尤其是在高效准确地识别不同矿物方面发挥了关键作用,这对于提高采矿的效率和准确性至关重要。然而,传统的矿石分类方法通常存在效率低和准确性不足的问题,尤其是在复杂矿石环境中。为解决这些挑战,本研究提出了一个名为OreYOLO的方法,该方法基于黄金和硫铁矿的矿石数据,采用关注机制和多尺度特征融合策略。通过将逐步特征金字塔结构引入到YOLOv5中,并在特征提取模块中嵌入注意力机制,模型的检测性能和准确性得到了极大的提高。为了适应多样矿石分类场景和边缘设备的部署需求,网络结构被设计为轻量化,达到低参数(3.458M)和高计算复杂度(6.3GFLOPs),同时保持高准确率(99.3%和99.2%)。在实验部分,为构建黄金和硫铁矿矿石分类训练的目标检测数据集,建立了包含6000个图像的黄金和硫铁矿矿石分类训练数据集,并设置了几组比较实验,包括YOLO系列、EfficientDet、Faster-RCNN和CenterNet等,实验结果表明,OreYOLO在这些架构中使用的通常高性能物体检测方法中表现优异。
https://arxiv.org/abs/2405.02785
Network traffic analysis is fundamental for network management, troubleshooting, and security. Tasks such as traffic classification, anomaly detection, and novelty discovery are fundamental for extracting operational information from network data and measurements. We witness the shift from deep packet inspection and basic machine learning to Deep Learning (DL) approaches where researchers define and test a custom DL architecture designed for each specific problem. We here advocate the need for a general DL architecture flexible enough to solve different traffic analysis tasks. We test this idea by proposing a DL architecture based on generic data adaptation modules, followed by an integration module that summarises the extracted information into a compact and rich intermediate representation (i.e. embeddings). The result is a flexible Multi-modal Autoencoder (MAE) pipeline that can solve different use cases. We demonstrate the architecture with traffic classification (TC) tasks since they allow us to quantitatively compare results with state-of-the-art solutions. However, we argue that the MAE architecture is generic and can be used to learn representations useful in multiple scenarios. On TC, the MAE performs on par or better than alternatives while avoiding cumbersome feature engineering, thus streamlining the adoption of DL solutions for traffic analysis.
网络流量分析是网络管理、故障排查和安全的基本。诸如流量分类、异常检测和新奇发现等任务是提取网络数据和测量的操作信息的基本方法。我们观察到从深度包检查和基本机器学习向深度学习(DL)方法的转变,研究人员为每个特定问题定义并测试自定义的DL架构。在这里,我们倡导需要一个通用的DL架构,足够灵活以解决不同的流量分析任务。为了验证这个想法,我们提出了一个基于通用数据适应模块的DL架构,然后是一个整合模块,将提取的信息汇总到简洁且丰富的中间表示(即嵌入)中。结果是一个灵活的多模态自动编码器(MAE)管道,可以解决不同的用例。我们用流量分类(TC)任务来展示这个架构,因为它们允许我们定量比较结果与最先进的解决方案。然而,我们认为MAE架构是通用的,可以用于多种场景的学习表示。在TC上,MAE与替代方案表现相当或者更好,同时避免了繁琐的特征工程,从而加速了DL解决方案在流量分析领域的采用。
https://arxiv.org/abs/2405.02649
The complexity of black-box algorithms can lead to various challenges, including the introduction of biases. These biases present immediate risks in the algorithms' application. It was, for instance, shown that neural networks can deduce racial information solely from a patient's X-ray scan, a task beyond the capability of medical experts. If this fact is not known to the medical expert, automatic decision-making based on this algorithm could lead to prescribing a treatment (purely) based on racial information. While current methodologies allow for the "orthogonalization" or "normalization" of neural networks with respect to such information, existing approaches are grounded in linear models. Our paper advances the discourse by introducing corrections for non-linearities such as ReLU activations. Our approach also encompasses scalar and tensor-valued predictions, facilitating its integration into neural network architectures. Through extensive experiments, we validate our method's effectiveness in safeguarding sensitive data in generalized linear models, normalizing convolutional neural networks for metadata, and rectifying pre-existing embeddings for undesired attributes.
Black-box算法的复杂性可能导致各种挑战,包括引入偏差。这些偏差在算法应用中表现为即时风险。例如,已经证明,仅从患者X光片中推断出种族信息,超出了医学专家的能力范围。如果这个事实不知道给定医学专家,基于这个算法的自动决策可能导致根据种族信息进行(纯粹)治疗。尽管现有的方法允许将神经网络与这种信息进行“正交化”或“归一化”,但现有方法基于线性模型。我们的论文通过引入非线性性的修正来推动对话。我们的方法还包括有标量的和张量值预测,有助于将其集成到神经网络架构中。通过广泛的实验,我们验证了我们的方法在保护泛化线性模型中敏感数据的有效性,以及为元数据对齐卷积神经网络进行归一化和纠正预先存在的嵌入。
https://arxiv.org/abs/2405.02475
Traditional recommender systems such as matrix factorization methods rely on learning a shared dense embedding space to represent both items and user preferences. Sequence models such as RNN, GRUs, and, recently, Transformers have also excelled in the task of sequential recommendation. This task requires understanding the sequential structure present in users' historical interactions to predict the next item they may like. Building upon the success of Large Language Models (LLMs) in a variety of tasks, researchers have recently explored using LLMs that are pretrained on vast corpora of text for sequential recommendation. To use LLMs in sequential recommendations, both the history of user interactions and the model's prediction of the next item are expressed in text form. We propose CALRec, a two-stage LLM finetuning framework that finetunes a pretrained LLM in a two-tower fashion using a mixture of two contrastive losses and a language modeling loss: the LLM is first finetuned on a data mixture from multiple domains followed by another round of target domain finetuning. Our model significantly outperforms many state-of-the-art baselines (+37% in Recall@1 and +24% in NDCG@10) and systematic ablation studies reveal that (i) both stages of finetuning are crucial, and, when combined, we achieve improved performance, and (ii) contrastive alignment is effective among the target domains explored in our experiments.
传统推荐系统,如矩阵分解方法,依赖于学习共同的密集嵌入空间来表示物品和用户偏好。序列模型,如RNN、GRU和最近的自然语言处理模型Transformer,也在序列推荐任务中表现出色。这项任务需要理解用户历史交互中的序列结构,以预测他们可能喜欢下一个项目。在大型语言模型(LLMs)在各种任务上的成功的基础上,研究人员最近探索了使用LLMs进行序列推荐。要使用LLMs进行序列推荐,用户历史交互的背景和模型的预测下一个项目的文本形式都需要表达。我们提出了CALRec,一种两阶段LLM微调框架,它通过混合两种对比性损失和一个语言建模损失,在两层结构中对预训练的LLM进行微调:首先对来自多个领域的数据混合进行微调,然后进行目标域微调。我们的模型在许多最先进的基线(+37%的召回度@1和+24%的NDCG@10)上都显著超过了很多状态-of-the-art,而且系统消融研究揭示了(i)微调阶段是至关重要的,而且,当结合时,我们实现更好的性能,(ii)在探索的各个目标域中,对比性对目标域之间的差异有效的特点。
https://arxiv.org/abs/2405.02429
Structured science summaries or research contributions using properties or dimensions beyond traditional keywords enhances science findability. Current methods, such as those used by the Open Research Knowledge Graph (ORKG), involve manually curating properties to describe research papers' contributions in a structured manner, but this is labor-intensive and inconsistent between the domain expert human curators. We propose using Large Language Models (LLMs) to automatically suggest these properties. However, it's essential to assess the readiness of LLMs like GPT-3.5, Llama 2, and Mistral for this task before application. Our study performs a comprehensive comparative analysis between ORKG's manually curated properties and those generated by the aforementioned state-of-the-art LLMs. We evaluate LLM performance through four unique perspectives: semantic alignment and deviation with ORKG properties, fine-grained properties mapping accuracy, SciNCL embeddings-based cosine similarity, and expert surveys comparing manual annotations with LLM outputs. These evaluations occur within a multidisciplinary science setting. Overall, LLMs show potential as recommendation systems for structuring science, but further finetuning is recommended to improve their alignment with scientific tasks and mimicry of human expertise.
使用超越传统关键词的属性或维度来结构化科学摘要或研究贡献可以提高科学可查找性。目前的方法,如Open Research Knowledge Graph (ORKG)中所使用的,需要手动编辑属性以描述研究论文的贡献,但这是劳动密集型且与领域专家人类编者之间存在不一致性。我们提出使用大型语言模型(LLMs)来自动建议这些属性。然而,在应用之前评估LLMs(如GPT-3.5、Llama 2和Mistral)的准备情况至关重要。 我们的研究对ORKG手动编辑的属性和上述最先进的LLM生成的属性进行了全面比较分析。我们通过四个独特的视角来评估LLM性能:语义对齐和与ORKG属性之间的偏移,细粒度属性映射准确度,基于SciNCL嵌入的余弦相似度,以及专家调查与LLM输出之间的比较。这些评估发生在多学科科学环境中。 总的来说,LLMs在构建科学推荐系统方面具有潜力,但需要进一步的微调以改善其与科学任务的同步性和模拟人类专业知识的能力。
https://arxiv.org/abs/2405.02105
Stability and reliable operation under a spectrum of environmental conditions is still an open challenge for soft and continuum style manipulators. The inability to carry sufficient load and effectively reject external disturbances are two drawbacks which limit the scale of continuum designs, preventing widespread adoption of this technology. To tackle these problems, this work details the design and experimental testing of a modular, tendon driven bead-style continuum manipulator with tunable stiffness. By embedding the ability to independently control the stiffness of distinct sections of the structure, the manipulator can regulate it's posture under greater loads of up to 1kg at the end-effector, with reference to the flexible state. Likewise, an internal routing scheme vastly improves the stability of the proximal segment when operating the distal segment, reducing deviations by at least 70.11%. Operation is validated when gravity is both tangential and perpendicular to the manipulator backbone, a feature uncommon in previous designs. The findings presented in this work are key to the development of larger scale continuum designs, demonstrating that flexibility and tip stability under loading can co-exist without compromise.
在各种环境条件下保持稳定可靠的操作仍然是软性和连续式操作器的开放挑战。无法承受足够的负载并有效拒绝外部干扰是两个限制连续设计规模的因素,这阻止了这种技术的大规模应用。为解决这些问题,本文详细描述了具有可调刚度的模块化索具驱动的珠子式连续操作器的的设计和实验测试。通过嵌入能够独立控制结构不同部分的刚度,操作器可以在末端效应器上调节其姿态,达到超过1kg的负载。同样,内部路由方案在操作远端段时极大地提高了近端段的稳定性,将偏差减少至少70.11%。当重力既与操作器主干成角度又与操作器底部成垂直时,操作被验证。本文的工作成果对大型连续设计的发展至关重要,表明在施加负载下,灵活性和尖端稳定性可以共存而不妥协。
https://arxiv.org/abs/2405.01925
The landscape of information retrieval has broadened from search services to a critical component in various advanced applications, where indexing efficiency, cost-effectiveness, and freshness are increasingly important yet remain less explored. To address these demands, we introduce Semi-parametric Vocabulary Disentangled Retrieval (SVDR). SVDR is a novel semi-parametric retrieval framework that supports two types of indexes: an embedding-based index for high effectiveness, akin to existing neural retrieval methods; and a binary token index that allows for quick and cost-effective setup, resembling traditional term-based retrieval. In our evaluation on three open-domain question answering benchmarks with the entire Wikipedia as the retrieval corpus, SVDR consistently demonstrates superiority. It achieves a 3% higher top-1 retrieval accuracy compared to the dense retriever DPR when using an embedding-based index and an 9% higher top-1 accuracy compared to BM25 when using a binary token index. Specifically, the adoption of a binary token index reduces index preparation time from 30 GPU hours to just 2 CPU hours and storage size from 31 GB to 2 GB, achieving a 90% reduction compared to an embedding-based index.
信息检索领域的景观已经从搜索服务扩展到各种先进应用程序中的关键组件,其中索引效率、成本效益和新鲜度日益重要,但仍然没有被充分利用。为了满足这些需求,我们引入了半参数化词汇解耦检索(SVDR)。SVDR是一种新颖的半参数化检索框架,支持两种索引:一种基于嵌入的索引,类似于现有的神经检索方法;另一种是二进制词索引,允许快速且成本效益高的设置,类似于传统的词表检索。在用整个维基百科作为检索语料库的三个公开领域问题回答基准测试中,SVDR始终保持优势。它使用基于嵌入的索引时,比DPR的 top-1 检索准确度高出 3%,使用二进制词索引时,比BM25的 top-1 准确度高出 9%。具体来说,采用二进制词索引减少了索引准备时间从 30GPU 小时减少到只需 2CPU 小时,减少了存储量从 31GB 减少到 2GB,实现了与基于嵌入的索引 90% 的减少。
https://arxiv.org/abs/2405.01924
Large language models (LLMs) increasingly serve as the backbone for classifying text associated with distinct domains and simultaneously several labels (classes). When encountering domain shifts, e.g., classifier of movie reviews from IMDb to Rotten Tomatoes, adapting such an LLM-based multi-label classifier is challenging due to incomplete label sets at the target domain and daunting training overhead. The existing domain adaptation methods address either image multi-label classifiers or text binary classifiers. In this paper, we design DALLMi, Domain Adaptation Large Language Model interpolator, a first-of-its-kind semi-supervised domain adaptation method for text data models based on LLMs, specifically BERT. The core of DALLMi is the novel variation loss and MixUp regularization, which jointly leverage the limited positively labeled and large quantity of unlabeled text and, importantly, their interpolation from the BERT word embeddings. DALLMi also introduces a label-balanced sampling strategy to overcome the imbalance between labeled and unlabeled data. We evaluate DALLMi against the partial-supervised and unsupervised approach on three datasets under different scenarios of label availability for the target domain. Our results show that DALLMi achieves higher mAP than unsupervised and partially-supervised approaches by 19.9% and 52.2%, respectively.
大语言模型(LLMs)越来越多地成为用于对特定领域分类文本并根据多个标签(类别)进行分类的骨架。在遇到领域变化时,例如将IMDb上的电影评论分类到Rotten Tomatoes,基于LLM的跨域多标签分类器在目标领域和多个标签(类别)的情况下进行调整是非常具有挑战性的,因为目标领域的标签集不完整,训练开销巨大。现有的领域迁移方法要么是图像多标签分类器,要么是文本二分类器。在本文中,我们设计了一个第一性的基于LLM的半监督领域迁移方法——DALLMi,一种基于LLM的文本数据模型的第一性的半监督领域迁移方法,特别是BERT。DALLMi的核心是新颖的变体损失和MixUp正则化,它们共同利用有限的正例标签和大量未标记文本,以及它们从BERT词向量之间的插值,同时引入了标签平衡抽样策略,以克服目标领域中标签和未标记数据之间的不平衡。我们在三个数据集的不同场景下,对目标领域进行半监督和无监督方法进行了评估。我们的结果表明,DALLMi在半监督和无监督方法的基础上分别实现了19.9%和52.2%的mAP提升。
https://arxiv.org/abs/2405.01883
Healthcare monitoring is crucial, especially for the daily care of elderly individuals living alone. It can detect dangerous occurrences, such as falls, and provide timely alerts to save lives. Non-invasive millimeter wave (mmWave) radar-based healthcare monitoring systems using advanced human activity recognition (HAR) models have recently gained significant attention. However, they encounter challenges in handling sparse point clouds, achieving real-time continuous classification, and coping with limited monitoring ranges when statically mounted. To overcome these limitations, we propose RobHAR, a movable robot-mounted mmWave radar system with lightweight deep neural networks for real-time monitoring of human activities. Specifically, we first propose a sparse point cloud-based global embedding to learn the features of point clouds using the light-PointNet (LPN) backbone. Then, we learn the temporal pattern with a bidirectional lightweight LSTM model (BiLiLSTM). In addition, we implement a transition optimization strategy, integrating the Hidden Markov Model (HMM) with Connectionist Temporal Classification (CTC) to improve the accuracy and robustness of the continuous HAR. Our experiments on three datasets indicate that our method significantly outperforms the previous studies in both discrete and continuous HAR tasks. Finally, we deploy our system on a movable robot-mounted edge computing platform, achieving flexible healthcare monitoring in real-world scenarios.
医疗监测对于单独生活的老年人来说至关重要。它可以检测到像跌倒这样的危险情况,并为拯救生命提供及时的警报。基于先进的人活动识别(HAR)模型的非侵入性毫米波(mmWave)医疗监测系统近年来引起了广泛关注。然而,它们在处理稀疏点云、实现实时连续分类和应对有限监测范围时遇到了挑战。为了克服这些限制,我们提出了RobHAR,一种可移动的机器人搭载的mmWave雷达系统,用于实时监测人类活动。具体来说,我们首先提出了基于稀疏点云的全局嵌入来学习点云的特征,使用光点网络(LPN)骨干网络。然后,我们使用双向轻量级LSTM模型学习时间模式。此外,我们还实现了一个转换优化策略,将隐马尔可夫模型(HMM)与连接式 Temporal Classification(CTC)结合,以提高连续 HAR的准确性和鲁棒性。我们对三个数据集的实验结果表明,我们的方法在离散和连续 HAR任务中显著超过了之前的研究。最后,我们将系统部署在可移动机器人搭载的边缘计算平台上,实现了在现实场景中灵活的医疗监测。
https://arxiv.org/abs/2405.01882
The task of steel surface defect recognition is an industrial problem with great industry values. The data insufficiency is the major challenge in training a robust defect recognition network. Existing methods have investigated to enlarge the dataset by generating samples with generative models. However, their generation quality is still limited by the insufficiency of defect image samples. To this end, we propose Stable Surface Defect Generation (StableSDG), which transfers the vast generation distribution embedded in Stable Diffusion model for steel surface defect image generation. To tackle with the distinctive distribution gap between steel surface images and generated images of the diffusion model, we propose two processes. First, we align the distribution by adapting parameters of the diffusion model, adopted both in the token embedding space and network parameter space. Besides, in the generation process, we propose image-oriented generation rather than from pure Gaussian noises. We conduct extensive experiments on steel surface defect dataset, demonstrating state-of-the-art performance on generating high-quality samples and training recognition models, and both designed processes are significant for the performance.
钢铁表面缺陷识别是一个具有很高产业价值的工业问题。数据不足是训练一个稳健缺陷识别网络的主要挑战。现有方法通过生成具有生成模型的样本来扩大数据集,但是它们的生成质量仍然受到缺陷图像样本不足的限制。为此,我们提出了Stable Surface Defect Generation(StableSDG),它将钢铁表面缺陷图像生成的巨大分布转移到Stable Diffusion模型中。为了解决钢铁表面图像和扩散模型生成的图像之间的独特分布差距,我们提出了两个过程。首先,通过自适应扩散模型的参数,在词嵌入空间和网络参数空间中调整分布。其次,在生成过程中,我们提出了图像相关生成,而不是仅仅基于高斯噪声的纯生成。我们对钢铁表面缺陷数据集进行了广泛的实验,证明了设计过程对生成高质量样本和训练识别模型具有显著意义,而且这两个过程对于性能都至关重要。
https://arxiv.org/abs/2405.01872
Unmanned Aerial Vehicles (UAVs) have emerged as a transformative technology across diverse sectors, offering adaptable solutions to complex challenges in both military and civilian domains. Their expanding capabilities present a platform for further advancement by integrating cutting-edge computational tools like Artificial Intelligence (AI) and Machine Learning (ML) algorithms. These advancements have significantly impacted various facets of human life, fostering an era of unparalleled efficiency and convenience. Large Language Models (LLMs), a key component of AI, exhibit remarkable learning and adaptation capabilities within deployed environments, demonstrating an evolving form of intelligence with the potential to approach human-level proficiency. This work explores the significant potential of integrating UAVs and LLMs to propel the development of autonomous systems. We comprehensively review LLM architectures, evaluating their suitability for UAV integration. Additionally, we summarize the state-of-the-art LLM-based UAV architectures and identify novel opportunities for LLM embedding within UAV frameworks. Notably, we focus on leveraging LLMs to refine data analysis and decision-making processes, specifically for enhanced spectral sensing and sharing in UAV applications. Furthermore, we investigate how LLM integration expands the scope of existing UAV applications, enabling autonomous data processing, improved decision-making, and faster response times in emergency scenarios like disaster response and network restoration. Finally, we highlight crucial areas for future research that are critical for facilitating the effective integration of LLMs and UAVs.
无人机(UAVs)作为一种变革性的技术,已经出现在各种领域,为军事和民用领域提供了适应性的解决方案。它们不断扩大的能力为通过整合尖端的计算工具如人工智能(AI)和机器学习(ML)算法,进一步推动进步提供了平台。这些进步对人类生活产生了重大影响,推动了无与伦比的高效和便利的时期。大型语言模型(LLMs),是AI的关键组成部分,在部署环境中表现出惊人的学习和适应能力,表明了一种不断发展的智能形式,具有接近人类水平的能力。 本文探讨了将无人机(UAVs)和LLMs集成以推动自主系统开发的巨大潜力。我们全面回顾了LLM架构,评估其是否适合无人机集成。此外,我们总结了基于LLM的无人机架构的最新进展,并探讨了LLM在无人机框架中嵌入的新机会。值得注意的是,我们重点关注利用LLMs优化数据分析和决策过程,特别是增强无人机应用中的光谱感知和数据共享。 此外,我们研究了LLM集成如何扩大现有无人机应用的范围,实现自主数据处理、改进决策以及在紧急场景如灾难应对和网络恢复中的更快的响应时间。最后,我们强调了未来研究的关键领域,这些领域对于促进LLMs和UAV的有效整合至关重要。
https://arxiv.org/abs/2405.01745
Image and multimodal machine learning tasks are very challenging to solve in the case of poorly distributed data. In particular, data availability and privacy restrictions exacerbate these hurdles in the medical domain. The state of the art in image generation quality is held by Latent Diffusion models, making them prime candidates for tackling this problem. However, a few key issues still need to be solved, such as the difficulty in generating data from under-represented classes and a slow inference process. To mitigate these issues, we propose a new method for image augmentation in long-tailed data based on leveraging the rich latent space of pre-trained Stable Diffusion Models. We create a modified separable latent space to mix head and tail class examples. We build this space via Iterated Learning of underlying sparsified embeddings, which we apply to task-specific saliency maps via a K-NN approach. Code is available at this https URL
图像和多模态机器学习任务在分布式数据中解决问题非常具有挑战性。特别是,数据可用性和隐私限制在医疗领域使这些障碍更加严重。目前,图像生成质量的最佳状态由潜在扩散模型持有,使它们成为解决这个问题的理想候选者。然而,还需要解决几个关键问题,例如从代表性不足的类别的数据生成数据和推理过程缓慢的问题。为了减轻这些问题,我们提出了一种基于预训练稳定扩散模型的图像增强方法,该方法基于利用其丰富的潜在空间。我们创建了一个修改后的分离式潜在空间,以混合头和尾类别的样本。我们通过迭代学习底层稀疏表示来构建这个空间,并将其应用于任务特定的显着度图上,通过K-NN方法进行应用。代码可在此处访问:https://www.kaggle.com/your_username/project
https://arxiv.org/abs/2405.01705
Small object detection in aerial imagery presents significant challenges in computer vision due to the minimal data inherent in small-sized objects and their propensity to be obscured by larger objects and background noise. Traditional methods using transformer-based models often face limitations stemming from the lack of specialized databases, which adversely affect their performance with objects of varying orientations and scales. This underscores the need for more adaptable, lightweight models. In response, this paper introduces two innovative approaches that significantly enhance detection and segmentation capabilities for small aerial objects. Firstly, we explore the use of the SAHI framework on the newly introduced lightweight YOLO v9 architecture, which utilizes Programmable Gradient Information (PGI) to reduce the substantial information loss typically encountered in sequential feature extraction processes. The paper employs the Vision Mamba model, which incorporates position embeddings to facilitate precise location-aware visual understanding, combined with a novel bidirectional State Space Model (SSM) for effective visual context modeling. This State Space Model adeptly harnesses the linear complexity of CNNs and the global receptive field of Transformers, making it particularly effective in remote sensing image classification. Our experimental results demonstrate substantial improvements in detection accuracy and processing efficiency, validating the applicability of these approaches for real-time small object detection across diverse aerial scenarios. This paper also discusses how these methodologies could serve as foundational models for future advancements in aerial object recognition technologies. The source code will be made accessible here.
小的目标检测在无人机图像中具有显著的计算机视觉挑战,因为小规模物体固有的少量数据以及它们倾向于被较大物体和背景噪声遮挡,传统使用Transformer-based模型的方法常常受到缺乏专业数据库的局限,这会影响其对不同方向和尺寸的目标的检测和分割效果。这凸显了需要更灵活、轻便的模型的需要。因此,本文介绍了两种创新方法,显著增强了小无人机目标检测和分割的能力。首先,我们探讨了在轻量级YOLO v9架构上使用SAHI框架的效果,该架构利用可编程梯度信息(PGI)来减少在序列特征提取过程中通常遇到的大量信息损失。本文采用Vision Mamba模型,该模型包含位置嵌入,以促进精确的位置感知视觉理解,结合了一种新颖的双向状态空间模型(SSM)进行有效的视觉上下文建模。这种状态空间模型巧妙地利用了CNN的线性复杂性和Transformer的全局接收场,使其在远红外图像分类中特别有效。我们的实验结果表明,这些方法在检测精度和处理效率方面都有显著的提高,验证了这些方法在各种无人机场景下的实时小目标检测的适用性。本文还讨论了这些方法如何成为未来无人机目标识别技术进步的基础模型。源代码将在此处公开。
https://arxiv.org/abs/2405.01699
The study of privacy-preserving Natural Language Processing (NLP) has gained rising attention in recent years. One promising avenue studies the integration of Differential Privacy in NLP, which has brought about innovative methods in a variety of application settings. Of particular note are $\textit{word-level Metric Local Differential Privacy (MLDP)}$ mechanisms, which work to obfuscate potentially sensitive input text by performing word-by-word $\textit{perturbations}$. Although these methods have shown promising results in empirical tests, there are two major drawbacks: (1) the inevitable loss of utility due to addition of noise, and (2) the computational expensiveness of running these mechanisms on high-dimensional word embeddings. In this work, we aim to address these challenges by proposing $\texttt{1-Diffractor}$, a new mechanism that boasts high speedups in comparison to previous mechanisms, while still demonstrating strong utility- and privacy-preserving capabilities. We evaluate $\texttt{1-Diffractor}$ for utility on several NLP tasks, for theoretical and task-based privacy, and for efficiency in terms of speed and memory. $\texttt{1-Diffractor}$ shows significant improvements in efficiency, while still maintaining competitive utility and privacy scores across all conducted comparative tests against previous MLDP mechanisms. Our code is made available at: this https URL.
近年来,对隐私保护的自然语言处理(NLP)的研究受到了越来越多的关注。一个有前景的研究方向是研究在NLP中整合差分隐私(DP),为各种应用场景带来了创新的方法。尤其值得注意的是单词级别的差分隐私(MLDP)机制,通过逐词对输入文本进行扰动来模糊可能敏感的输入文本。尽管这些方法在实证测试中显示出良好的效果,但有两个主要缺点:(1)由于加入噪声而导致的必然的效用损失,以及(2)在 high-dimensional 单词嵌入上运行这些机制的计算开销。在本文中,我们试图通过提出 1-Diffractor,一种在比较前机制速度更快但仍然具有强大的效用和隐私保护能力的新机制,来解决这些挑战。我们对 1-Diffractor 在多个 NLP 任务上的效用进行了评估,以及基于理论和任务隐私的评估,同时还评估了速度和内存方面的效率。与前 MLDP 机制相比,1-Diffractor 在效率上表现出显著的改进,同时保持了竞争的效用和隐私得分。我们的代码可在此处下载:https:// this URL。
https://arxiv.org/abs/2405.01678