Hand-crafted image quality metrics, such as PSNR and SSIM, are commonly used to evaluate model privacy risk under reconstruction attacks. Under these metrics, reconstructed images that are determined to resemble the original one generally indicate more privacy leakage. Images determined as overall dissimilar, on the other hand, indicate higher robustness against attack. However, there is no guarantee that these metrics well reflect human opinions, which, as a judgement for model privacy leakage, are more trustworthy. In this paper, we comprehensively study the faithfulness of these hand-crafted metrics to human perception of privacy information from the reconstructed images. On 5 datasets ranging from natural images, faces, to fine-grained classes, we use 4 existing attack methods to reconstruct images from many different classification models and, for each reconstructed image, we ask multiple human annotators to assess whether this image is recognizable. Our studies reveal that the hand-crafted metrics only have a weak correlation with the human evaluation of privacy leakage and that even these metrics themselves often contradict each other. These observations suggest risks of current metrics in the community. To address this potential risk, we propose a learning-based measure called SemSim to evaluate the Semantic Similarity between the original and reconstructed images. SemSim is trained with a standard triplet loss, using an original image as an anchor, one of its recognizable reconstructed images as a positive sample, and an unrecognizable one as a negative. By training on human annotations, SemSim exhibits a greater reflection of privacy leakage on the semantic level. We show that SemSim has a significantly higher correlation with human judgment compared with existing metrics. Moreover, this strong correlation generalizes to unseen datasets, models and attack methods.
人工制作的图像质量指标,例如PSNR和SSIM,在重建攻击下通常用于评估模型隐私风险。在这些指标下,确定的重构图像通常表示更多的隐私泄露。另一方面,确定的整然差异图像则表示更强的抵御攻击能力。然而,没有保证这些指标很好地反映了人类的意见,作为模型隐私泄露的判断,它们更加可靠。在本文中,我们全面研究了这些人工制作的指标对人类对重构图像的隐私信息感知的准确性的符合性。在5个数据集,包括自然图像、人脸和精细类别,我们使用4个现有的攻击方法从多个分类模型中重构图像,并为每个重构图像询问多个人类标注者是否可识别。我们的研究表明,人工制作的指标仅与人类评估隐私泄露的微弱相关,甚至这些指标本身也常常互相矛盾。这些观察暗示了社区当前指标的风险。为了应对这些潜在风险,我们提出了一种基于学习的指标,称为SemSim,以评估原始和重构图像语义相似性。SemSim使用标准三因素损失进行训练,使用原始图像作为参考,其中一个可识别的重构图像作为正样本,一个不可识别的重构图像作为负样本。通过训练人类标注,SemSim表现出在语义层面上更多的隐私泄露反映。我们表明,SemSim与人类判断的相关性比现有的指标高得多。此外,这种强相关性可以扩展到未观测的数据集、模型和攻击方法。
https://arxiv.org/abs/2309.13038
PyPose is an open-source library for robot learning. It combines a learning-based approach with physics-based optimization, which enables seamless end-to-end robot learning. It has been used in many tasks due to its meticulously designed application programming interface (API) and efficient implementation. From its initial launch in early 2022, PyPose has experienced significant enhancements, incorporating a wide variety of new features into its platform. To satisfy the growing demand for understanding and utilizing the library and reduce the learning curve of new users, we present the fundamental design principle of the imperative programming interface, and showcase the flexible usage of diverse functionalities and modules using an extremely simple Dubins car example. We also demonstrate that the PyPose can be easily used to navigate a real quadruped robot with a few lines of code.
PyPose是一个开源机器人学习库,它结合了基于学习的方法与基于物理学的优化方法,实现了无缝的机器人全过程中的学习。由于其精心设计的应用编程接口(API)和高效的实现,PyPose在多个任务中被广泛应用。自2022年初首次推出以来,PyPose经历了显著的改进,将其平台包含了一系列丰富的新特性。为了满足不断增长的理解和利用库的需求,并降低新用户的学习曲线,我们提出了 imperative编程接口的基本设计原则,并通过一个简单的Dubins汽车例子展示了各种功能和模块的灵活使用。我们还证明了PyPose可以轻松地用于导航一个真实的四足机器人,只需要几行代码。
https://arxiv.org/abs/2309.13035
Precise crop yield prediction is essential for improving agricultural practices and ensuring crop resilience in varying climates. Integrating weather data across the growing season, especially for different crop varieties, is crucial for understanding their adaptability in the face of climate change. In the MLCAS2021 Crop Yield Prediction Challenge, we utilized a dataset comprising 93,028 training records to forecast yields for 10,337 test records, covering 159 locations across 28 U.S. states and Canadian provinces over 13 years (2003-2015). This dataset included details on 5,838 distinct genotypes and daily weather data for a 214-day growing season, enabling comprehensive analysis. As one of the winning teams, we developed two novel convolutional neural network (CNN) architectures: the CNN-DNN model, combining CNN and fully-connected networks, and the CNN-LSTM-DNN model, with an added LSTM layer for weather variables. Leveraging the Generalized Ensemble Method (GEM), we determined optimal model weights, resulting in superior performance compared to baseline models. The GEM model achieved lower RMSE (5.55% to 39.88%), reduced MAE (5.34% to 43.76%), and higher correlation coefficients (1.1% to 10.79%) when evaluated on test data. We applied the CNN-DNN model to identify top-performing genotypes for various locations and weather conditions, aiding genotype selection based on weather variables. Our data-driven approach is valuable for scenarios with limited testing years. Additionally, a feature importance analysis using RMSE change highlighted the significance of location, MG, year, and genotype, along with the importance of weather variables MDNI and AP.
精确的 crop yield 预测对于改善农业实践和确保在不同气候条件下的作物韧性至关重要。将气象数据在整个生长季节中整合,特别是针对不同作物 variety 的气象数据,对于理解它们在气候变化面前的适应能力至关重要。在 MLCAS2021 crop yield 预测挑战中,我们使用了一个包含 93,028 个训练记录的数据集,用于预测 10,337 个测试记录的 yield,覆盖了 28 个美国州和加拿大省在 13 年(2003-2015)中的 159 个地点。这个数据集包含了关于 5,838 个 distinct genotypes 和每日气象数据的详细情况,使能够进行全面分析。作为获胜团队之一,我们开发了两种新的卷积神经网络 (CNN) 架构:CNN-DNN 模型,将 CNN 和全连接网络相结合,以及 CNN-LSTM-DNN 模型,并在气象变量方面增加了 LSTM 层。利用通用群集方法 (gem),我们确定了最佳的模型权重,从而导致与基准模型相比更好的性能。gem 模型在测试数据上的 RMSE 降低到了 5.55% 到 39.88%,MAE 降低到了 5.34% 到 43.76%,并更高的 correlation 系数 (1.1% 到 10.79%)。我们应用了 CNN-DNN 模型来确定各种地点和气象条件的顶级表现 genotypes,并根据气象变量进行 genotypes 选择。我们的数据驱动方法对于测试年份有限的情况非常有价值。此外,使用 RMSE 变化的特征重要性分析强调了地点、MG、年份和 genotypes 的重要性,以及气象变量 MDNI 和 AP 的重要性。
https://arxiv.org/abs/2309.13021
Recent advances in Large Language Models (LLMs) have enabled the generation of open-ended high-quality texts, that are non-trivial to distinguish from human-written texts. We refer to such LLM-generated texts as \emph{deepfake texts}. There are currently over 11K text generation models in the huggingface model repo. As such, users with malicious intent can easily use these open-sourced LLMs to generate harmful texts and misinformation at scale. To mitigate this problem, a computational method to determine if a given text is a deepfake text or not is desired--i.e., Turing Test (TT). In particular, in this work, we investigate the more general version of the problem, known as \emph{Authorship Attribution (AA)}, in a multi-class setting--i.e., not only determining if a given text is a deepfake text or not but also being able to pinpoint which LLM is the author. We propose \textbf{TopRoBERTa} to improve existing AA solutions by capturing more linguistic patterns in deepfake texts by including a Topological Data Analysis (TDA) layer in the RoBERTa model. We show the benefits of having a TDA layer when dealing with noisy, imbalanced, and heterogeneous datasets, by extracting TDA features from the reshaped $pooled\_output$ of RoBERTa as input. We use RoBERTa to capture contextual representations (i.e., semantic and syntactic linguistic features), while using TDA to capture the shape and structure of data (i.e., linguistic structures). Finally, \textbf{TopRoBERTa}, outperforms the vanilla RoBERTa in 2/3 datasets, achieving up to 7\% increase in Macro F1 score.
近年来大型语言模型(LLM)的进步使得可以生成任意长度高质量的文本,这些文本难以与人类编写的文本区分开来。我们将这些生成的文本称为 \emph{DeepFake texts}。目前 hugoface 模型 repo 中有超过 11K 个文本生成模型。因此,有恶意意图的用户可以轻松利用这些开源LLM生成大规模的有害文本和虚假信息。为了解决这个问题,我们希望有一种计算方法来确定给定文本是否为DeepFake文本,也就是进行图灵测试(TT)。特别是,在本文中,我们研究了更一般的问题,称为 \emph{作者身份确认(AA)},并在多分类环境中研究这个问题--不仅仅是确定给定文本是否为DeepFake文本,而是能够明确指出哪个LLM是作者。我们提出了 \textbf{TopRoBERTa} 来改进现有的AA解决方案,通过在RoBERTa模型中引入一个拓扑数据分析层,来捕获DeepFake文本中的更多语言学模式。我们展示了使用TDA层来处理噪声、不平衡和异质数据的好处,从RoBERTa的重构输出中提取TDA特征作为输入。我们使用RoBERTa捕获上下文表示(即语义和语法语言学特征),同时使用TDA捕获数据的形状和结构(即语言学结构)。最后, \textbf{TopRoBERTa} 在2/3个数据集上优于传统的RoBERTa,实现了7%的macro F1得分提高。
https://arxiv.org/abs/2309.12934
High-quality text embedding is pivotal in improving semantic textual similarity (STS) tasks, which are crucial components in Large Language Model (LLM) applications. However, a common challenge existing text embedding models face is the problem of vanishing gradients, primarily due to their reliance on the cosine function in the optimization objective, which has saturation zones. To address this issue, this paper proposes a novel angle-optimized text embedding model called AnglE. The core idea of AnglE is to introduce angle optimization in a complex space. This novel approach effectively mitigates the adverse effects of the saturation zone in the cosine function, which can impede gradient and hinder optimization processes. To set up a comprehensive STS evaluation, we experimented on existing short-text STS datasets and a newly collected long-text STS dataset from GitHub Issues. Furthermore, we examine domain-specific STS scenarios with limited labeled data and explore how AnglE works with LLM-annotated data. Extensive experiments were conducted on various tasks including short-text STS, long-text STS, and domain-specific STS tasks. The results show that AnglE outperforms the state-of-the-art (SOTA) STS models that ignore the cosine saturation zone. These findings demonstrate the ability of AnglE to generate high-quality text embeddings and the usefulness of angle optimization in STS.
高质量的文本嵌入是改善语义文本相似性任务的关键,它们是大型语言模型应用的关键组件。然而,现有文本嵌入模型面临一个共同的挑战,就是梯度消失问题,这主要是因为它们在优化目标中依赖余弦函数,而余弦函数有一个饱和区域。为了解决这一问题,本文提出了一种名为AnglE的新角度优化文本嵌入模型。AnglE的核心思想是引入复杂的空间角度优化。这种新的方法有效地缓解了余弦函数饱和区域产生的不利效应,这些效应可能会阻碍梯度和妨碍优化过程。为了建立全面的语义文本相似性评估,我们实验了现有的短文本语义文本相似性任务数据集和新从GitHub问题集收集的长篇文本语义文本相似性任务数据集。我们还检查了特定领域的有限标记数据下的特定语义文本相似性场景,并探索了AnglE与LLM标记数据的结合方式。广泛的实验涵盖了各种任务,包括短文本语义文本相似性任务、长篇文本语义文本相似性任务和特定领域的语义文本相似性任务。结果表明,AnglE比忽略余弦函数饱和区域的最先进的语义文本相似性模型表现更好。这些发现表明AnglE能够生成高质量的文本嵌入,以及在语义文本相似性任务中的角度优化的有用性。
https://arxiv.org/abs/2309.12871
Sequential recommendation (SRS) has become the technical foundation in many applications recently, which aims to recommend the next item based on the user's historical interactions. However, sequential recommendation often faces the problem of data sparsity, which widely exists in recommender systems. Besides, most users only interact with a few items, but existing SRS models often underperform these users. Such a problem, named the long-tail user problem, is still to be resolved. Data augmentation is a distinct way to alleviate these two problems, but they often need fabricated training strategies or are hindered by poor-quality generated interactions. To address these problems, we propose a Diffusion Augmentation for Sequential Recommendation (DiffuASR) for a higher quality generation. The augmented dataset by DiffuASR can be used to train the sequential recommendation models directly, free from complex training procedures. To make the best of the generation ability of the diffusion model, we first propose a diffusion-based pseudo sequence generation framework to fill the gap between image and sequence generation. Then, a sequential U-Net is designed to adapt the diffusion noise prediction model U-Net to the discrete sequence generation task. At last, we develop two guide strategies to assimilate the preference between generated and origin sequences. To validate the proposed DiffuASR, we conduct extensive experiments on three real-world datasets with three sequential recommendation models. The experimental results illustrate the effectiveness of DiffuASR. As far as we know, DiffuASR is one pioneer that introduce the diffusion model to the recommendation.
Sequential recommendation (SRS)已经成为许多应用的技术基础,其目标是基于用户的历史交互推荐下一个物品。然而,Sequential recommendation常常面临数据稀疏的问题,这个问题在推荐系统中很常见。此外,大多数用户只与少数物品交互,但现有的SRS模型往往在这些用户上表现不好,这种情况被称为长尾用户问题,仍需要解决。数据增强是一种缓解这两个问题的独特方法,但常常需要编造训练策略或受到生成 interactions 的质量阻碍。为了解决这些问题,我们提出了一种Sequential Recommendation中的扩散增强(DiffuASR),以提供更高质量的生成。通过DiffuASR增强的数据集可以用于直接训练Sequential recommendation模型,而无需复杂的训练程序。为了最大限度地发挥扩散模型的生成能力,我们首先提出了一种基于扩散的伪序列生成框架,以填补图像和序列生成之间的空缺。然后,我们设计了Sequential U-Net,以适应扩散噪声预测模型U-Net的离散序列生成任务。最后,我们发展了两个指导策略,以整合生成和起源序列的偏好。为了验证所提出的DiffuASR,我们针对三个实际数据集和三个Sequential recommendation模型进行了广泛的实验。实验结果展示了DiffuASR的有效性。据所知,DiffuASR是将扩散模型引入推荐领域的先驱之一。
https://arxiv.org/abs/2309.12858
Eyebrows play a critical role in facial expression and appearance. Although the 3D digitization of faces is well explored, less attention has been drawn to 3D eyebrow modeling. In this work, we propose EMS, the first learning-based framework for single-view 3D eyebrow reconstruction. Following the methods of scalp hair reconstruction, we also represent the eyebrow as a set of fiber curves and convert the reconstruction to fibers growing problem. Three modules are then carefully designed: RootFinder firstly localizes the fiber root positions which indicates where to grow; OriPredictor predicts an orientation field in the 3D space to guide the growing of fibers; FiberEnder is designed to determine when to stop the growth of each fiber. Our OriPredictor is directly borrowing the method used in hair reconstruction. Considering the differences between hair and eyebrows, both RootFinder and FiberEnder are newly proposed. Specifically, to cope with the challenge that the root location is severely occluded, we formulate root localization as a density map estimation task. Given the predicted density map, a density-based clustering method is further used for finding the roots. For each fiber, the growth starts from the root point and moves step by step until the ending, where each step is defined as an oriented line with a constant length according to the predicted orientation field. To determine when to end, a pixel-aligned RNN architecture is designed to form a binary classifier, which outputs stop or not for each growing step. To support the training of all proposed networks, we build the first 3D synthetic eyebrow dataset that contains 400 high-quality eyebrow models manually created by artists. Extensive experiments have demonstrated the effectiveness of the proposed EMS pipeline on a variety of different eyebrow styles and lengths, ranging from short and sparse to long bushy eyebrows.
眉毛在面部表情和外貌中发挥着关键作用。尽管对人脸3D数字化的研究已经充分展开,但人们对3D眉毛建模的关注程度相对较低。在本研究中,我们提出了EMS,是第一个基于学习的框架,用于单视角3D眉毛重建。遵循眉毛种植的方法,我们也将眉毛表示为一组纤维曲线,并将重建转换为纤维生长问题。三个模块因此被精心设计:首先,RootFinder localizing the fiber root positions which indicates where to grow;其次,OriPredictor预测3D空间中的向量场,以指导纤维的生长;最后,FiberEnder设计来确定何时停止每个纤维的生长。我们的OriPredictor直接借用了眉毛种植中使用的方法。考虑到眉毛和头发的差异,同时新提出了RootFinder和FiberEnder。具体来说,为了应对困难,root位置的严重遮挡,我们将其定义为密度映射估计任务。给定预测的密度映射,基于密度的聚类方法被进一步使用以找到root。对于每个纤维,从root点开始,逐步增长,直到结束,每个步骤根据预测的向量场定义为一条定向线,具有恒定长度。为了确定何时结束,设计了一个像素对齐的RNN架构,形成二进制分类器,以输出每个生长步骤是否停止。为了支持所有 proposed 网络的训练,我们建立了第一个3D合成眉毛数据集,其中包含由艺术家手动创建的400个高质量的眉毛模型。广泛的实验已经证明了所提出的EMS管道对于各种不同眉毛样式和长度的有效性,包括短而稀疏到长而浓密的眉毛。
https://arxiv.org/abs/2309.12787
While image data starts to enjoy the simple-but-effective self-supervised learning scheme built upon masking and self-reconstruction objective thanks to the introduction of tokenization procedure and vision transformer backbone, convolutional neural networks as another important and widely-adopted architecture for image data, though having contrastive-learning techniques to drive the self-supervised learning, still face the difficulty of leveraging such straightforward and general masking operation to benefit their learning process significantly. In this work, we aim to alleviate the burden of including masking operation into the contrastive-learning framework for convolutional neural networks as an extra augmentation method. In addition to the additive but unwanted edges (between masked and unmasked regions) as well as other adverse effects caused by the masking operations for ConvNets, which have been discussed by prior works, we particularly identify the potential problem where for one view in a contrastive sample-pair the randomly-sampled masking regions could be overly concentrated on important/salient objects thus resulting in misleading contrastiveness to the other view. To this end, we propose to explicitly take the saliency constraint into consideration in which the masked regions are more evenly distributed among the foreground and background for realizing the masking-based augmentation. Moreover, we introduce hard negative samples by masking larger regions of salient patches in an input image. Extensive experiments conducted on various datasets, contrastive learning mechanisms, and downstream tasks well verify the efficacy as well as the superior performance of our proposed method with respect to several state-of-the-art baselines.
尽管图像数据开始享受基于遮蔽和自重构目标的简单但有效的自我监督学习方案,由于引入了 tokenization 过程和视觉Transformer骨架,卷积神经网络也成为了另一种重要且广泛应用的图像数据架构。尽管卷积神经网络有对比学习技术来驱动自我监督学习,但它们仍然面临着利用这种简单而普遍的遮蔽操作来显著改善其学习过程的困难。在这项工作中,我们旨在减轻将遮蔽操作纳入对比学习框架,作为增加的增广方法,对卷积神经网络作为对比学习框架额外的增广方法的负担。除了无害的边缘(在遮蔽和未被遮蔽区域之间)以及由卷积神经网络的遮蔽操作引起的其他不利效应,我们特别发现了一种潜在问题,即在一个对比样本对中,随机选择的遮蔽区域可能过于集中在重要或显著的对象上,从而导致对另一个视图的学习对比度产生误导。为此,我们建议 explicitly 考虑可见性约束,其中遮蔽区域在 foreground 和 background 之间更均匀地分布以实现基于遮蔽的增广。此外,我们通过在输入图像中遮蔽较大的显著斑点来引入硬负样本。我们对多种数据集、对比学习和后续任务进行了广泛的实验,并成功地证明了我们提出的方法和几个前沿基准之间的差距。
https://arxiv.org/abs/2309.12757
Most existing methods for unsupervised industrial anomaly detection train a separate model for each object category. This kind of approach can easily capture the category-specific feature distributions, but results in high storage cost and low training efficiency. In this paper, we propose a unified mixed-attention auto encoder (MAAE) to implement multi-class anomaly detection with a single model. To alleviate the performance degradation due to the diverse distribution patterns of different categories, we employ spatial attentions and channel attentions to effectively capture the global category information and model the feature distributions of multiple classes. Furthermore, to simulate the realistic noises on features and preserve the surface semantics of objects from different categories which are essential for detecting the subtle anomalies, we propose an adaptive noise generator and a multi-scale fusion module for the pre-trained features. MAAE delivers remarkable performances on the benchmark dataset compared with the state-of-the-art methods.
大多数现有的无监督工业异常检测方法为每个对象类别训练了单独的模型。这种方法可以轻松捕捉类别特定的特征分布,但会导致高存储成本和低训练效率。在本文中,我们提出了一种统一的混合注意力自动编码器(MAAE),以使用单个模型实现多分类异常检测。为了减轻不同类别不同分布 Pattern 的性能下降,我们使用了空间注意力和通道注意力,有效地捕捉了全球类别信息并模型了多个类别的特征分布。此外,为了模拟真实的噪声特征并保留不同类别物体表面的语义,我们提出了自适应噪声生成器和多尺度融合模块,用于训练预训练特征。与现有方法相比,MAAE在基准数据集上表现出优异的性能。
https://arxiv.org/abs/2309.12700
Surface defect inspection is a very challenging task in which surface defects usually show weak appearances or exist under complex backgrounds. Most high-accuracy defect detection methods require expensive computation and storage overhead, making them less practical in some resource-constrained defect detection applications. Although some lightweight methods have achieved real-time inference speed with fewer parameters, they show poor detection accuracy in complex defect scenarios. To this end, we develop a Global Context Aggregation Network (GCANet) for lightweight saliency detection of surface defects on the encoder-decoder structure. First, we introduce a novel transformer encoder on the top layer of the lightweight backbone, which captures global context information through a novel Depth-wise Self-Attention (DSA) module. The proposed DSA performs element-wise similarity in channel dimension while maintaining linear complexity. In addition, we introduce a novel Channel Reference Attention (CRA) module before each decoder block to strengthen the representation of multi-level features in the bottom-up path. The proposed CRA exploits the channel correlation between features at different layers to adaptively enhance feature representation. The experimental results on three public defect datasets demonstrate that the proposed network achieves a better trade-off between accuracy and running efficiency compared with other 17 state-of-the-art methods. Specifically, GCANet achieves competitive accuracy (91.79% $F_{\beta}^{w}$, 93.55% $S_\alpha$, and 97.35% $E_\phi$) on SD-saliency-900 while running 272fps on a single gpu.
表面缺陷检查是一项极具挑战性的任务,通常表面缺陷呈现出较弱的外观或存在于复杂的背景中。大多数高精度的缺陷检测方法都需要昂贵的计算和存储 overhead,因此在一些资源受限的缺陷检测应用中不太实用。虽然一些轻量级方法在仅有几个参数的情况下已经可以实现实时推断速度,但在复杂的缺陷场景中表现出较差的检测精度。为此,我们开发了一种全球上下文聚合网络(GCANet),用于在编码器和解码器结构中 lightweight saliency检测表面缺陷。我们首先在轻量级骨架的顶部引入了一个新的transformer编码器,该编码器通过 novel Depth-wise Self-Attention (DSA)模块实现了全球上下文信息捕捉。 proposed DSA 在通道维度上进行元素相似性计算,同时保持线性复杂性。此外,我们在每个解码块前引入了一个 novel Channel Reference Attention (CRA)模块,以加强从bottom-up路径上下来的多级特征表示。 proposed CRA利用不同层上特征之间的通道相关性,自适应地增强特征表示。在三个公开缺陷数据集上的实验结果显示,与另外17个先进方法相比, proposed 网络在准确性和运行效率之间的更好权衡。具体来说,GCANet 在SD-saliency-900上实现了 competitive accuracy(91.79% $F_{\beta}^{w}$,93.55% $S_\alpha$,97.35% $E_\phi$),同时运行在单个GPU上的帧率为272fps。
https://arxiv.org/abs/2309.12641
Surface defect inspection is of great importance for industrial manufacture and production. Though defect inspection methods based on deep learning have made significant progress, there are still some challenges for these methods, such as indistinguishable weak defects and defect-like interference in the background. To address these issues, we propose a transformer network with multi-stage CNN (Convolutional Neural Network) feature injection for surface defect segmentation, which is a UNet-like structure named CINFormer. CINFormer presents a simple yet effective feature integration mechanism that injects the multi-level CNN features of the input image into different stages of the transformer network in the encoder. This can maintain the merit of CNN capturing detailed features and that of transformer depressing noises in the background, which facilitates accurate defect detection. In addition, CINFormer presents a Top-K self-attention module to focus on tokens with more important information about the defects, so as to further reduce the impact of the redundant background. Extensive experiments conducted on the surface defect datasets DAGM 2007, Magnetic tile, and NEU show that the proposed CINFormer achieves state-of-the-art performance in defect detection.
表面缺陷检测对于工业制造和生产非常重要。虽然基于深度学习的缺陷检测方法已经取得了重大进展,但这些方法仍然面临一些挑战,例如难以区分的弱缺陷和背景中的类似缺陷干扰。为了解决这些问题,我们提出了一种包含多级卷积神经网络(CNN)特征注入的Transformer网络,称为CIN former,它是一种类似于UNet的结构。CIN former提供了一个简单但有效的特征集成机制,将输入图像的多层次CNN特征注入到编码器中的Transformer网络的不同阶段。这样可以保持CNN捕捉详细特征的优点,以及Transformer在背景中抑制噪声的优点,从而有利于准确的缺陷检测。此外,CINformer还提供了一个K注意力模块,专注于包含更多有关缺陷重要信息的代币,进一步减少了冗余背景的影响。在表面缺陷数据集DAGM 2007、磁贴和NEU上进行广泛的实验表明,提出的CIN Former在缺陷检测方面取得了最先进的性能。
https://arxiv.org/abs/2309.12639
Surface defect inspection is an important task in industrial inspection. Deep learning-based methods have demonstrated promising performance in this domain. Nevertheless, these methods still suffer from misjudgment when encountering challenges such as low-contrast defects and complex backgrounds. To overcome these issues, we present a decision fusion network (DFNet) that incorporates the semantic decision with the feature decision to strengthen the decision ability of the network. In particular, we introduce a decision fusion module (DFM) that extracts a semantic vector from the semantic decision branch and a feature vector for the feature decision branch and fuses them to make the final classification decision. In addition, we propose a perception fine-tuning module (PFM) that fine-tunes the foreground and background during the segmentation stage. PFM generates the semantic and feature outputs that are sent to the classification decision stage. Furthermore, we present an inner-outer separation weight matrix to address the impact of label edge uncertainty during segmentation supervision. Our experimental results on the publicly available datasets including KolektorSDD2 (96.1% AP) and Magnetic-tile-defect-datasets (94.6% mAP) demonstrate the effectiveness of the proposed method.
表面缺陷检查是在工业检查中一个重要的任务。基于深度学习的方法在这一领域表现出良好的性能。然而,在遇到低对比度缺陷和复杂背景等挑战时,这些方法仍然容易出现判断错误。为了解决这些问题,我们提出了一个决策融合网络(DFNet),它将语义决策与特征决策相结合,加强网络的决策能力。特别地,我们引入了一个决策融合模块(DFM),从语义决策分支提取语义向量,并从特征决策分支提取特征向量,将它们融合起来,以做出最终的分类决策。此外,我们提出了一个感知微调模块(PFM),在分割阶段微调前端和背景。PFM生成语义和特征输出,并将其发送到分类决策阶段。此外,我们提出了一个内外部分离权重矩阵,以解决标签边缘不确定性在分割监督期间的影响。我们在公开可用数据集包括KolektorSDD2(96.1%AP)和磁贴缺陷数据集(94.6%mAP)上的实验结果证明了该方法的有效性。
https://arxiv.org/abs/2309.12630
Surface electromyography (sEMG) and high-density sEMG (HD-sEMG) biosignals have been extensively investigated for myoelectric control of prosthetic devices, neurorobotics, and more recently human-computer interfaces because of their capability for hand gesture recognition/prediction in a wearable and non-invasive manner. High intraday (same-day) performance has been reported. However, the interday performance (separating training and testing days) is substantially degraded due to the poor generalizability of conventional approaches over time, hindering the application of such techniques in real-life practices. There are limited recent studies on the feasibility of multi-day hand gesture recognition. The existing studies face a major challenge: the need for long sEMG epochs makes the corresponding neural interfaces impractical due to the induced delay in myoelectric control. This paper proposes a compact ViT-based network for multi-day dynamic hand gesture prediction. We tackle the main challenge as the proposed model only relies on very short HD-sEMG signal windows (i.e., 50 ms, accounting for only one-sixth of the convention for real-time myoelectric implementation), boosting agility and responsiveness. Our proposed model can predict 11 dynamic gestures for 20 subjects with an average accuracy of over 71% on the testing day, 3-25 days after training. Moreover, when calibrated on just a small portion of data from the testing day, the proposed model can achieve over 92% accuracy by retraining less than 10% of the parameters for computational efficiency.
表面电感测量(sEMG)和高密度sEMG(HD-sEMG)生物信号已经被广泛研究用于肢体残疾控制、神经机器人学以及最近的人机接口,因为它们能够在佩戴且非侵入性的情况下进行手动作识别/预测。每日(当天)表现 据报道很高。然而,每日表现(区分训练和测试日)因传统方法的泛化性能较差而大幅度退化,阻碍将这些技术应用于实际实践中。目前,关于一天多次手动作识别的可行性研究有限。现有的研究面临一个主要挑战:需要长sEMG epochs导致相应的神经接口不可能实现,因为肌电控制引起的延迟。本 paper 提出了一种紧凑的ViT-based网络,用于一天多次的动态手动作预测。我们克服了主要挑战,因为 proposed 模型只需要非常短的HD-sEMG信号窗口(即50 ms,只占实时肌电实现的传统标准的六分之一),提高敏捷性和响应性。我们 proposed 模型可以预测20名 subjects 11种动态手势,在测试日,平均准确率超过71%,训练3-25天后。此外,当仅从测试日的数据中校准一小部分数据时,该模型可以实现超过92%的准确率,通过减少计算效率不到10%的参数重新训练。
https://arxiv.org/abs/2309.12602
Data-driven insights are essential for modern agriculture. This research paper introduces a machine learning framework designed to improve how we educate and reach out to people in the field of horticulture. The framework relies on data from the Horticulture Online Help Desk (HOHD), which is like a big collection of questions from people who love gardening and are part of the Extension Master Gardener Program (EMGP). This framework has two main parts. First, it uses special computer programs (machine learning models) to sort questions into categories. This helps us quickly send each question to the right expert, so we can answer it faster. Second, it looks at when questions are asked and uses that information to guess how many questions we might get in the future and what they will be about. This helps us plan on topics that will be really important. It's like knowing what questions will be popular in the coming months. We also take into account where the questions come from by looking at the Zip Code. This helps us make research that fits the challenges faced by gardeners in different places. In this paper, we demonstrate the potential of machine learning techniques to predict trends in horticulture by analyzing textual queries from homeowners. We show that NLP, classification, and time series analysis can be used to identify patterns in homeowners' queries and predict future trends in horticulture. Our results suggest that machine learning could be used to predict trends in other agricultural sectors as well. If large-scale agriculture industries curate and maintain a comparable repository of textual data, the potential for trend prediction and strategic agricultural planning could be revolutionized. This convergence of technology and agriculture offers a promising pathway for the future of sustainable farming and data-informed agricultural practices
数据驱动 insights 对现代农业至关重要。这篇论文介绍了一个机器学习框架,旨在改善我们如何教育和接触园艺领域的人。框架依赖于来自园艺在线帮助中心(HOHD)的数据,这是一个大型集合的问题,来自喜欢 Gardening 并是Extension Master Gardener Program (EMGP) 的一部分的人的问题。这个框架有两个主要部分。第一,它使用特殊的计算机程序(机器学习模型)将问题分门别类。这帮助我们快速将每个问题发送到正确的专家,所以我们可以更快地回答问题。第二,它看着问题被提出,并使用这些信息猜测我们可能会得到多少问题以及它们将是什么。这帮助我们计划那些真正重要的话题。有点像知道哪些问题将在几个月内受欢迎。我们还考虑了问题的来源,通过查看 zip 代码。这帮助我们进行研究,适合各地 gardeners 所面临的挑战。在这篇文章中,我们演示了机器学习技术预测园艺领域趋势的潜力,通过分析 Homeowner 的文本查询。我们表明 NLP、分类和时序分析可用于识别 Homeowner 查询中的模式并预测园艺领域未来趋势。我们的结果显示,机器学习可以用于预测其他农业部门的趋势。如果大型农业行业创建和维护一个类似的文本数据存储库,趋势预测和战略农业规划的潜力可能会被彻底改变。这科技和农业的合并提供了一个有前途的路径,为可持续农业和数据驱动的农业实践的未来。
https://arxiv.org/abs/2309.12579
The development of large language models (LLMs) capable of following instructions and engaging in conversational interactions sparked increased interest in their utilization across various support tools. We investigate the utility of modern LLMs in assisting professional writers via an empirical user study (n=30). The design of our collaborative writing interface is grounded in the cognitive process model of writing that views writing as a goal-oriented thinking process encompassing non-linear cognitive activities: planning, translating, and reviewing. Participants are asked to submit a post-completion survey to provide feedback on the potential and pitfalls of LLMs as writing collaborators. Upon analyzing the writer-LLM interactions, we find that while writers seek LLM's help across all three types of cognitive activities, they find LLMs more helpful in translation and reviewing. Our findings from analyzing both the interactions and the survey responses highlight future research directions in creative writing assistance using LLMs.
大型语言模型(LLM)的开发能够遵循指令并参与对话交互,引起了对各种支持工具的广泛应用的兴趣。通过实证用户研究(n=30),我们探讨了现代LLM在帮助专业作家方面的有效性。我们的合作写作界面的设计基于写作的认知过程模型,将写作视为一个目标导向的思考过程,包括非线性的认知活动:规划、翻译和审查。参与者被要求完成完成后的调查,以提供对LLM作为写作合作伙伴的潜在和缺点的反馈。通过分析作家-LLM交互,我们发现,虽然作家在所有三种认知活动中寻求LLM的帮助,但他们在翻译和审查方面发现LLM更为有用。我们通过对交互和调查响应的分析,突出了使用LLM进行创意写作协助的未来研究方向。
https://arxiv.org/abs/2309.12570
Human following is a crucial feature of human-robot interaction, yet it poses numerous challenges to mobile agents in real-world scenarios. Some major hurdles are that the target person may be in a crowd, obstructed by others, or facing away from the agent. To tackle these challenges, we present a novel person re-identification module composed of three parts: a 360-degree visual registration, a neural-based person re-identification using human faces and torsos, and a motion tracker that records and predicts the target person's future position. Our human-following system also addresses other challenges, including identifying fast-moving targets with low latency, searching for targets that move out of the camera's sight, collision avoidance, and adaptively choosing different following mechanisms based on the distance between the target person and the mobile agent. Extensive experiments show that our proposed person re-identification module significantly enhances the human-following feature compared to other baseline variants.
人跟随是人类机器人交互中一个重要的特征,但在现实世界场景中,它给移动代理带来了许多挑战。一些主要障碍是目标人可能在人群中、被其他人阻挡或远离代理。为了解决这些问题,我们提出了一个由三个部分组成的新的人重排模块:一个全方位的视觉注册、一个基于神经网络的人重排使用人类面部和身体部分,还有一个运动跟踪器,用于记录和预测目标人的未来位置。我们的人跟随系统还解决了其他挑战,包括快速移动目标的低延迟识别、搜索目标离开相机视野的目标、避免碰撞、以及根据目标人和移动代理之间的距离自适应选择不同的跟随机制。广泛的实验表明,我们提出的人重排模块与其他基准变异相比,显著提高了人跟随特性。
https://arxiv.org/abs/2309.12479
In recent times, there is an increased interest in the identification and re-identification of people at long distances, such as from rooftop cameras, UAV cameras, street cams, and others. Such recognition needs to go beyond face and use whole-body markers such as gait. However, datasets to train and test such recognition algorithms are not widely prevalent, and fewer are labeled. This paper introduces DIOR -- a framework for data collection, semi-automated annotation, and also provides a dataset with 14 subjects and 1.649 million RGB frames with 3D/2D skeleton gait labels, including 200 thousands frames from a long range camera. Our approach leverages advanced 3D computer vision techniques to attain pixel-level accuracy in indoor settings with motion capture systems. Additionally, for outdoor long-range settings, we remove the dependency on motion capture systems and adopt a low-cost, hybrid 3D computer vision and learning pipeline with only 4 low-cost RGB cameras, successfully achieving precise skeleton labeling on far-away subjects, even when their height is limited to a mere 20-25 pixels within an RGB frame. On publication, we will make our pipeline open for others to use.
近年来,对远距离识别和重新识别的兴趣日益增加,例如从屋顶摄像头、无人机摄像头、街头摄像头和其他设备中拍摄的图像。这种识别需要超越面部识别,使用全身标志,如步态。然而,训练和测试这种识别算法的 datasets 并不普遍,标记的样本更少。本文介绍了 DIOR - 一个数据收集、半自动标注的框架,并提供了包含14个 subjects 和1.649百万张 RGB 帧的三维/二维骨骼步态标签的数据集,其中包括从远程相机拍摄200 thousands帧的图像。我们的方法利用先进的三维计算机视觉技术在室内条件下实现像素级别的精度。此外,对于室外远距离设置,我们摆脱了对运动捕捉系统的依赖性,采用仅4个低成本 RGB 相机的低成本三维计算机视觉和学习通道,成功对远距离样本进行精确的骨骼标签标注,即使样本的高度仅在RGB帧内仅有20-25像素。在出版时,我们将我们的通道开放给他人使用。
https://arxiv.org/abs/2309.12429
Federated Learning (FL) has revolutionized how we train deep neural networks by enabling decentralized collaboration while safeguarding sensitive data and improving model performance. However, FL faces two crucial challenges: the diverse nature of data held by individual clients and the vulnerability of the FL system to security breaches. This paper introduces an innovative solution named Estimated Mean Aggregation (EMA) that not only addresses these challenges but also provides a fundamental reference point as a $\mathsf{baseline}$ for advanced aggregation techniques in FL systems. EMA's significance lies in its dual role: enhancing model security by effectively handling malicious outliers through trimmed means and uncovering data heterogeneity to ensure that trained models are adaptable across various client datasets. Through a wealth of experiments, EMA consistently demonstrates high accuracy and area under the curve (AUC) compared to alternative methods, establishing itself as a robust baseline for evaluating the effectiveness and security of FL aggregation methods. EMA's contributions thus offer a crucial step forward in advancing the efficiency, security, and versatility of decentralized deep learning in the context of FL.
联邦学习(FL)已经 revolutionized how we train deep neural networks 了,通过使分散合作,同时保护敏感数据和提高模型性能,改变了训练深度学习的方式。然而,FL 面临着两个关键挑战:个体客户持有的数据的多样性和 FL 系统对安全漏洞的脆弱性。本文介绍了一种创新的解决方案,名为估计均值聚合(EMA),它不仅解决了这些挑战,而且还提供了一个基本的参考点,作为 FL 系统高级聚合技术的 $\mathsf{baseline}$。EMA 的重要性在于它的双重作用:通过有效地处理恶意极端值,增强模型安全性,并暴露数据的多样性,以确保训练模型可以在各种客户数据集上适应。通过大量的实验,EMA consistently demonstrates 高准确性和AUC,与替代方法相比,建立了自己作为FL聚合方法有效性和安全性的稳健基准。EMA 的贡献为在 FL 背景下推动分散深度学习的效率、安全性和多功能提供了一个重要的迈出。
https://arxiv.org/abs/2309.12267
This work investigates a case study of using physical-based sonification of Quadratic Unconstrained Binary Optimization (QUBO) problems, optimized by the Variational Quantum Eigensolver (VQE) algorithm. The VQE approximates the solution of the problem by using an iterative loop between the quantum computer and a classical optimization routine. This work explores the intermediary statevectors found in each VQE iteration as the means of sonifying the optimization process itself. The implementation was realised in the form of a musical interface prototype named Variational Quantum Harmonizer (VQH), providing potential design strategies for musical applications, focusing on chords, chord progressions, and arpeggios. The VQH can be used both to enhance data visualization or to create artistic pieces. The methodology is also relevant in terms of how an artist would gain intuition towards achieving a desired musical sound by carefully designing QUBO cost functions. Flexible mapping strategies could supply a broad portfolio of sounds for QUBO and quantum-inspired musical compositions, as demonstrated in a case study composition, "Dependent Origination" by Peter Thomas and Paulo Itaborai.
本研究调查了一个案例研究,涉及利用基于物理的音频增强技术对经过Variational Quantum Eigensolver (VQE)算法优化的quadratic Unconstrained Binary Optimization (QUBO)问题进行音频增强。VQE使用量子计算机和经典优化算法之间的迭代循环来近似解决问题。本研究探索了在每个VQE迭代中出现的中间状态向量,将其视为优化过程本身的音频增强手段。实现形式是名为Variational Quantum Harmonizer (VQH)的音乐接口原型,为音乐应用提供了潜在设计策略,重点关注和弦、和弦进展和拨片。VQH既可以用于增强数据可视化,也可以用于创作艺术片段。研究方法也涉及到如何通过精心设计的QUBO成本函数来启发艺术家实现所需的音乐声音。灵活的映射策略可以为QUBO和量子 inspired的音乐创作提供广泛的音乐声音集,就像Peter Thomas和Paulo Itaborai创作的一个案例音乐作品《依赖的起源》所演示的那样。
https://arxiv.org/abs/2309.12254
Presentation attack (spoofing) detection (PAD) typically operates alongside biometric verification to improve reliablity in the face of spoofing attacks. Even though the two sub-systems operate in tandem to solve the single task of reliable biometric verification, they address different detection tasks and are hence typically evaluated separately. Evidence shows that this approach is suboptimal. We introduce a new metric for the joint evaluation of PAD solutions operating in situ with biometric verification. In contrast to the tandem detection cost function proposed recently, the new tandem equal error rate (t-EER) is parameter free. The combination of two classifiers nonetheless leads to a \emph{set} of operating points at which false alarm and miss rates are equal and also dependent upon the prevalence of attacks. We therefore introduce the \emph{concurrent} t-EER, a unique operating point which is invariable to the prevalence of attacks. Using both modality (and even application) agnostic simulated scores, as well as real scores for a voice biometrics application, we demonstrate application of the t-EER to a wide range of biometric system evaluations under attack. The proposed approach is a strong candidate metric for the tandem evaluation of PAD systems and biometric comparators.
呈现攻击(伪造)检测(PAD)通常会与生物信息学验证一起运行,以提高在伪造攻击面前的可靠性。尽管两个子系统协同工作来解决可靠的生物信息学验证单一的任务,但它们处理不同的检测任务,因此通常需要分别评估。证据表明这种方法是最优化的。我们介绍了一种新的度量方法,用于在实时的生物信息学验证与生物信息学验证同时运行的情况下对PAD解决方案进行联合评估。与最近提出的协同检测成本函数不同,新的协同等误差率(t-EER)没有参数。虽然两个分类器的组合导致一系列 operating points,其中 false警报和误报率相等,并且也取决于攻击的普及程度。因此,我们介绍了协同的 t-EER,这是一个独特的 operating point,与攻击的普及程度是不可变的。使用两种模式(甚至应用)无关的模拟得分和语音生物信息学应用的真实得分,我们展示了t-EER应用于受到攻击的生物信息学系统评估的广泛范围。该方法作为协同评估 PAD系统与生物信息学比较器的强有力的候选度量。
https://arxiv.org/abs/2309.12237