A classification prediction algorithm based on Long Short-Term Memory Network (LSTM) improved AdaBoost is used to predict virtual reality (VR) user experience. The dataset is randomly divided into training and test sets in the ratio of 7:3.During the training process, the model's loss value decreases from 0.65 to 0.31, which shows that the model gradually reduces the discrepancy between the prediction results and the actual labels, and improves the accuracy and generalisation ability.The final loss value of 0.31 indicates that the model fits the training data well, and is able to make predictions and classifications more accurately. The confusion matrix for the training set shows a total of 177 correct predictions and 52 incorrect predictions, with an accuracy of 77%, precision of 88%, recall of 77% and f1 score of 82%. The confusion matrix for the test set shows a total of 167 correct and 53 incorrect predictions with 75% accuracy, 87% precision, 57% recall and 69% f1 score. In summary, the classification prediction algorithm based on LSTM with improved AdaBoost shows good prediction ability for virtual reality user experience. This study is of great significance to enhance the application of virtual reality technology in user experience. By combining LSTM and AdaBoost algorithms, significant progress has been made in user experience prediction, which not only improves the accuracy and generalisation ability of the model, but also provides useful insights for related research in the field of virtual reality. This approach can help developers better understand user requirements, optimise virtual reality product design, and enhance user satisfaction, promoting the wide application of virtual reality technology in various fields.
基于长短期记忆网络(LSTM)的分类预测算法改进后的AdaBoost用于预测虚拟现实(VR)用户体验。数据集按7:3的比例随机分成训练和测试集。在训练过程中,模型的损失值从0.65降低到0.31,这说明模型逐渐减小预测结果与实际标签之间的差异,并提高了准确性和泛化能力。模型的最终损失值为0.31,表明模型对训练数据适应良好,能够更准确地进行预测和分类。训练集的混淆矩阵显示总共177个正确预测和52个错误预测,准确率为77%,精度为88%,召回率为77%,F1分数为82%。测试集的混淆矩阵显示总共167个正确预测和53个错误预测,准确率为75%,精度为87%,召回率为57%,F1分数为69%。总之,基于LSTM和AdaBoost的分类预测算法在预测虚拟现实用户体验方面表现出良好的预测能力。这项研究对于增强虚拟现实技术在用户体验方面的应用具有重要意义。通过结合LSTM和AdaBoost算法,在用户体验预测方面取得了显著的进展,不仅提高了模型的准确性和泛化能力,还为该领域的虚拟现实研究提供了有价值的见解。这种方法可以帮助开发人员更好地理解用户需求,优化虚拟现实产品设计,提高用户满意度,从而在各个领域广泛应用虚拟现实技术。
https://arxiv.org/abs/2405.10515
In complex environments with large discrete action spaces, effective decision-making is critical in reinforcement learning (RL). Despite the widespread use of value-based RL approaches like Q-learning, they come with a computational burden, necessitating the maximization of a value function over all actions in each iteration. This burden becomes particularly challenging when addressing large-scale problems and using deep neural networks as function approximators. In this paper, we present stochastic value-based RL approaches which, in each iteration, as opposed to optimizing over the entire set of $n$ actions, only consider a variable stochastic set of a sublinear number of actions, possibly as small as $\mathcal{O}(\log(n))$. The presented stochastic value-based RL methods include, among others, Stochastic Q-learning, StochDQN, and StochDDQN, all of which integrate this stochastic approach for both value-function updates and action selection. The theoretical convergence of Stochastic Q-learning is established, while an analysis of stochastic maximization is provided. Moreover, through empirical validation, we illustrate that the various proposed approaches outperform the baseline methods across diverse environments, including different control problems, achieving near-optimal average returns in significantly reduced time.
在具有较大离散动作空间复杂环境的强化学习(RL)中,有效的决策非常重要。尽管基于价值的RL方法(如Q-learning)在实践中得到了广泛应用,但它们带来了计算负担,需要通过每个迭代最大化价值函数来解决。这个负担在处理大规模问题和使用深度神经网络作为函数近似的函数时变得尤为困难。在本文中,我们提出了随机价值基于RL的方法,每个迭代周期内,除了优化整个$n$个动作集合外,只考虑一个随机子线性数量的动作,可能大小为$\mathcal{O}(\log(n))$。所提出的随机价值基于RL方法包括,例如,Stochastic Q-learning,StochDQN和StochDDQN,它们都集成了这个随机方法 both value-function updates and action selection. The theoretical convergence of Stochastic Q-learning is established, while an analysis of stochastic maximization is provided. Moreover, through empirical validation, we demonstrate that the various proposed approaches outperform the baseline methods across diverse environments, including different control problems, achieving near-optimal average returns in significantly reduced time.
https://arxiv.org/abs/2405.10310
Before deploying outputs from foundation models in high-stakes tasks, it is imperative to ensure that they align with human values. For instance, in radiology report generation, reports generated by a vision-language model must align with human evaluations before their use in medical decision-making. This paper presents Conformal Alignment, a general framework for identifying units whose outputs meet a user-specified alignment criterion. It is guaranteed that on average, a prescribed fraction of selected units indeed meet the alignment criterion, regardless of the foundation model or the data distribution. Given any pre-trained model and new units with model-generated outputs, Conformal Alignment leverages a set of reference data with ground-truth alignment status to train an alignment predictor. It then selects new units whose predicted alignment scores surpass a data-dependent threshold, certifying their corresponding outputs as trustworthy. Through applications to question answering and radiology report generation, we demonstrate that our method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data. En route, we investigate the informativeness of various features in alignment prediction and combine them with standard models to construct the alignment predictor.
在将基础模型在高风险任务中部署输出之前,确保它们与人类价值观保持一致是至关重要的。例如,在放射学报告生成中,使用视觉语言模型生成的报告必须在用于医疗决策之前与人类评价保持一致。本文介绍了一种称为“对齐预测”的一般框架,可以确定满足用户指定对齐标准的单元。保证,即使使用不同的基础模型或数据分布,平均选择的单元也确实满足对齐标准。给定任何预训练模型和新单元,对齐预测利用具有真实对齐状态的参考数据集训练一个对齐预测器。然后选择预测对齐分数超过数据相关阈值的新的单元,证明其相应输出值得信赖。通过应用于问题回答和放射学报告生成,我们证明了我们的方法能够通过轻量训练在适量参考数据上准确地识别具有可信输出的单元。在研究对齐预测的各种特征的信息性之后,我们将它们与标准模型结合使用构建对齐预测器。
https://arxiv.org/abs/2405.10301
Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic framework that fine-tunes VLMs with reinforcement learning (RL). Specifically, our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM to efficiently explore intermediate reasoning steps that lead to the final text-based action. Next, the open-ended text output is parsed into an executable action to interact with the environment to obtain goal-directed task rewards. Finally, our framework uses these task rewards to fine-tune the entire VLM with RL. Empirically, we demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks, enabling 7b models to outperform commercial models such as GPT4-V or Gemini. Furthermore, we find that CoT reasoning is a crucial component for performance improvement, as removing the CoT reasoning results in a significant decrease in the overall performance of our method.
大视觉语言模型(VLMs)在专用视觉指令跟随数据上进行微调已经展示了令人印象深刻的语言推理能力。然而,这种微调范式可能无法有效地从交互环境中学习最优决策策略。为解决这个问题,我们提出了一个使用强化学习(RL)微调VLMs的算法框架。具体来说,我们的框架提供任务描述,然后提示VLM生成连锁推理(CoT)思维,使VLM能够高效探索导致最终文本基于行动的中间推理步骤。接下来,开放的文本输出被解析为可执行动作,以与环境交互以获得目标导向任务奖励。最后,我们的框架使用这些任务奖励对整个VLM进行微调。实验证明,我们提出的框架增强了VLM代理在不同任务中的决策能力,使得7b模型能够优于诸如GPT4-V或Gemini等商业模型。此外,我们发现,CoT推理是提高性能的关键组成部分,因为去除CoT推理会导致我们方法的整体性能显著下降。
https://arxiv.org/abs/2405.10292
Facts extraction is pivotal for constructing knowledge graphs. Recently, the increasing demand for temporal facts in downstream tasks has led to the emergence of the task of temporal fact extraction. In this paper, we specifically address the extraction of temporal facts from natural language text. Previous studies fail to handle the challenge of establishing time-to-fact correspondences in complex sentences. To overcome this hurdle, we propose a timeline-based sentence decomposition strategy using large language models (LLMs) with in-context learning, ensuring a fine-grained understanding of the timeline associated with various facts. In addition, we evaluate the performance of LLMs for direct temporal fact extraction and get unsatisfactory results. To this end, we introduce TSDRE, a method that incorporates the decomposition capabilities of LLMs into the traditional fine-tuning of smaller pre-trained language models (PLMs). To support the evaluation, we construct ComplexTRED, a complex temporal fact extraction dataset. Our experiments show that TSDRE achieves state-of-the-art results on both HyperRED-Temporal and ComplexTRED datasets.
事实提取对于构建知识图谱至关重要。最近,对于下游任务中不断增加的时间性事实需求,导致出现了时间性事实提取任务。在本文中,我们重点讨论自然语言文本中提取时间性事实的问题。之前的研究未能解决在复杂句子中建立时间性事实的时间挑战。为了克服这一障碍,我们提出了一个基于时间轴的句子分解策略,使用大型语言模型(LLMs)进行预训练,确保对与各种事实相关的时间轴有细粒度的理解。此外,我们评估了LLMs的直接时间性事实提取性能,并得到不满意的结果。为此,我们引入了TSDRE,一种将LLM的分解能力融入对较小预训练语言模型(PLM)传统微调的方法。为了支持评估,我们构建了复杂的时间性事实提取数据集ComplexTRED。我们的实验结果表明,TSDRE在HyperRED-Temporal和ComplexTRED数据集上均取得了最先进的成果。
https://arxiv.org/abs/2405.10288
This paper investigates the dynamics of a deep neural network (DNN) learning interactions. Previous studies have discovered and mathematically proven that given each input sample, a well-trained DNN usually only encodes a small number of interactions (non-linear relationships) between input variables in the sample. A series of theorems have been derived to prove that we can consider the DNN's inference equivalent to using these interactions as primitive patterns for inference. In this paper, we discover the DNN learns interactions in two phases. The first phase mainly penalizes interactions of medium and high orders, and the second phase mainly learns interactions of gradually increasing orders. We can consider the two-phase phenomenon as the starting point of a DNN learning over-fitted features. Such a phenomenon has been widely shared by DNNs with various architectures trained for different tasks. Therefore, the discovery of the two-phase dynamics provides a detailed mechanism for how a DNN gradually learns different inference patterns (interactions). In particular, we have also verified the claim that high-order interactions have weaker generalization power than low-order interactions. Thus, the discovered two-phase dynamics also explains how the generalization power of a DNN changes during the training process.
本文研究了深度神经网络(DNN)学习交互的动态。之前的研究发现并数学证明了,给定每个输入样本,经过良好训练的DNN通常只编码样本中输入变量之间的小数量(非线性关系)交互。一系列推论已经被导出来,证明我们可以将DNN的推理视为使用这些交互作为推理的基本模式。在本文中,我们发现了DNN在两个阶段学习交互。第一个阶段主要惩罚中高阶交互,第二个阶段主要学习逐渐增加阶数的交互。我们可以将两个阶段的现象视为DNN过拟合特征的起点。事实上,这种现象已经在各种架构训练的DNN中得到了广泛分享。因此,发现两个阶段动态提供了DNN逐渐学习不同推理模式(交互)的详细机制。特别,我们还验证了说法,高阶交互的泛化能力比低阶交互弱。因此,发现的两个阶段动态也解释了DNN在训练过程中泛化能力的变化。
https://arxiv.org/abs/2405.10262
As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a project page where papers related to our topic are organized and listed: this https URL.
随着大型语言模型(LLMs)的不断发展,其与3D空间数据的集成取得了快速进展,为理解和与物理空间进行交互提供了前所未有的能力。这项调查对LLMs处理、理解和生成3D数据的方法进行了全面概述。强调LLMs的独特优势,如上下文学习、逐步推理、开放词汇功能和广泛的世界知识,我们强调它们在智能体人工智能(AI)系统中将空间理解和交互的重大潜力。我们的研究跨越了各种3D数据表示,从点云到神经辐射场(NeRFs)。它研究了LLM与3D场景理解、捕捉、问答和对话等任务的集成,以及基于LLM的空间推理、规划和导航等代理。此外,本文还简要回顾了其他将3D和语言集成的方法。本文中的元分析揭示了显著的进展,但同时也强调了需要新的方法来充分利用3D-LLMs的全部潜力。因此,本文旨在为未来研究绘制一个探索和扩展3D-LLMs理解和服务于复杂3D世界的道路的路线图。为了支持这项调查,我们建立了一个与我们的主题相关的项目页,其中列出了相关的论文:https://url。
https://arxiv.org/abs/2405.10255
Epigenetic cell memory, the inheritance of gene expression patterns across subsequent cell divisions, is a critical property of multi-cellular organisms. In recent work [10], a subset of the authors observed in a simulation study how the stochastic dynamics and time-scale differences between establishment and erasure processes in chromatin modifications (such as histone modifications and DNA methylation) can have a critical effect on epigenetic cell memory. In this paper, we provide a mathematical framework to rigorously validate and extend beyond these computational findings. Viewing our stochastic model of a chromatin modification circuit as a singularly perturbed, finite state, continuous time Markov chain, we extend beyond existing theory in order to characterize the leading coefficients in the series expansions of stationary distributions and mean first passage times. In particular, we characterize the limiting stationary distribution in terms of a reduced Markov chain, provide an algorithm to determine the orders of the poles of mean first passage times, and determine how changing erasure rates affects system behavior. The theoretical tools developed in this paper not only allow us to set a rigorous mathematical basis for the computational findings of our prior work, highlighting the effect of chromatin modification dynamics on epigenetic cell memory, but they can also be applied to other singularly perturbed Markov chains beyond the applications in this paper, especially those associated with chemical reaction networks.
表观遗传细胞记忆,即多细胞生物中基因表达模式在后续细胞分裂中的遗传,是多细胞生物的关键特性。在最近的工作[10]中,一些作者通过模拟研究观察到,在真核生物中,染色质修饰(如组蛋白修饰和DNA甲基化)的建立和消除过程的随机动态以及时间尺度差异可能会对表观遗传细胞记忆产生关键影响。在本文中,我们提供了数学框架,以严谨验证和扩展这些计算结果。将我们染色质修饰电路的随机模型视为一个单例微扰,有限状态连续时间随机过程,我们超越了现有的理论,以刻画随 stationary 分布展开式的导数和均 first passage 时间的关系。特别地,我们刻画了极限 stationary 分布,给出了确定 mean first passage 时间极值的算法,并确定了改变 erasure 速率对系统行为的影响。本文中发展的理论工具不仅使我们能够为之前的工作建立一个严谨的数学基础,突出染色质修饰动态对表观遗传细胞记忆的影响,而且还可以应用于其他单例微扰 Markov 链,尤其是与化学反应网络相关的 those。
https://arxiv.org/abs/2405.10184
In this work, we introduce Libra, a prototype model with a decoupled vision system on a large language model (LLM). The decoupled vision system decouples inner-modal modeling and cross-modal interaction, yielding unique visual information modeling and effective cross-modal comprehension. Libra is trained through discrete auto-regressive modeling on both vision and language inputs. Specifically, we incorporate a routed visual expert with a cross-modal bridge module into a pretrained LLM to route the vision and language flows during attention computing to enable different attention patterns in inner-modal modeling and cross-modal interaction scenarios. Experimental results demonstrate that the dedicated design of Libra achieves a strong MLLM baseline that rivals existing works in the image-to-text scenario with merely 50 million training data, providing a new perspective for future multimodal foundation models. Code is available at this https URL.
在这项工作中,我们引入了LIBRA,一个在大型语言模型(LLM)上具有解耦视觉系统的原型模型。解耦的视觉系统解耦了内模态建模和跨模态交互,产生了独特的视觉信息建模和有效的跨模态理解。LIBRA通过在视觉和语言输入上进行离散自回归建模进行训练。具体来说,我们将一个经过跨模态桥接模块的径向视觉专家融入预训练的LLM中,以路由在注意力计算过程中视觉和语言流的视觉和跨模态交互场景,实现不同内模态建模和跨模态交互场景的注意力模式。实验结果表明,专门设计的LIBRA在仅有5000万训练数据的情况下,实现了与现有图像到文本场景中工作的MLLM基线相媲美的强大性能,为未来的多模态基础模型提供了新的视角。代码可以从该链接获取:https://www.example.com/libra。
https://arxiv.org/abs/2405.10140
Over the past century, the Turkish language has undergone substantial changes, primarily driven by governmental interventions. In this work, our goal is to investigate the evolution of the Turkish language since the establishment of Türkiye in 1923. Thus, we first introduce Turkronicles which is a diachronic corpus for Turkish derived from the Official Gazette of Türkiye. Turkronicles contains 45,375 documents, detailing governmental actions, making it a pivotal resource for analyzing the linguistic evolution influenced by the state policies. In addition, we expand an existing diachronic Turkish corpus which consists of the records of the Grand National Assembly of Türkiye by covering additional years. Next, combining these two diachronic corpora, we seek answers for two main research questions: How have the Turkish vocabulary and the writing conventions changed since the 1920s? Our analysis reveals that the vocabularies of two different time periods diverge more as the time between them increases, and newly coined Turkish words take the place of their old counterparts. We also observe changes in writing conventions. In particular, the use of circumflex noticeably decreases and words ending with the letters "-b" and "-d" are successively replaced with "-p" and "-t" letters, respectively. Overall, this study quantitatively highlights the dramatic changes in Turkish from various aspects of the language in a diachronic perspective.
在过去的一个世纪里,土耳其语经历了显著的变化,主要是由政府干预驱动的。在这项研究中,我们的目标是调查自1923年土耳其共和国成立以来的土耳其语演变过程。因此,我们首先引入了土耳其语词典,这是土耳其语的动态语料库,来源于土耳其官方公报。土耳其语词典包含45,375个文件,详述政府行动,成为分析受国家政策影响的语言演变的关键资源。此外,我们扩展了一个现有的动态土耳其语语料库,包括覆盖额外年份的土耳其大国民会议记录。接下来,我们将这两个动态语料库相结合,寻求两个主要研究问题的答案:自20世纪以来土耳其词汇和写作规范有哪些变化?我们的分析显示,两个不同时间段的词汇分化越来越明显,新创的土耳其词汇取代了旧的同类词汇。我们还观察到写作规范的变化。特别是,使用折线的明显减少,而且单词 ending with the letters "-b" and "-d" successively replaced with "-p" and "-t" letters, respectively.总的来说,从语法的角度,这次研究定量揭示了土耳其从多个方面语言在动态过程中的剧变。
https://arxiv.org/abs/2405.10133
Natural language could play an important role in developing generalist surgical models by providing a broad source of supervision from raw texts. This flexible form of supervision can enable the model's transferability across datasets and tasks as natural language can be used to reference learned visual concepts or describe new ones. In this work, we present HecVL, a novel hierarchical video-language pretraining approach for building a generalist surgical model. Specifically, we construct a hierarchical video-text paired dataset by pairing the surgical lecture video with three hierarchical levels of texts: at clip-level, atomic actions using transcribed audio texts; at phase-level, conceptual text summaries; and at video-level, overall abstract text of the surgical procedure. Then, we propose a novel fine-to-coarse contrastive learning framework that learns separate embedding spaces for the three video-text hierarchies using a single model. By disentangling embedding spaces of different hierarchical levels, the learned multi-modal representations encode short-term and long-term surgical concepts in the same model. Thanks to the injected textual semantics, we demonstrate that the HecVL approach can enable zero-shot surgical phase recognition without any human annotation. Furthermore, we show that the same HecVL model for surgical phase recognition can be transferred across different surgical procedures and medical centers.
自然语言在开发通用手术模型方面发挥了重要作用,因为它可以提供从原始文本的广泛监督。这种灵活的监督形式可以确保模型在数据和任务上的可转移性,因为自然语言可以用于参考学到的视觉概念或描述新的概念。在这项工作中,我们提出了HecVL,一种用于构建通用手术模型的分层视频语言预训练方法。具体来说,我们通过将手术讲座视频与三个层次的文本(剪辑级别、音频转录文本级别和视频级别)进行配对,构建了一个分层的视频文本对数据集。然后,我们提出了一个新颖的细到粗的对比学习框架,使用单个模型学习三个视频文本层次的嵌入空间。通过分离不同层次的嵌入空间,学习到的多模态表示编码了同一模型中的短期和长期手术概念。由于引入了文本语义,我们证明了HecVL方法可以在没有任何人类注释的情况下实现零散手术阶段识别。此外,我们还证明了用于手术阶段识别的HecVL模型可以应用于不同的手术过程和医疗机构。
https://arxiv.org/abs/2405.10075
Deformable image registration (alignment) is highly sought after in numerous clinical applications, such as computer aided diagnosis and disease progression analysis. Deep Convolutional Neural Network (DCNN)-based image registration methods have demonstrated advantages in terms of registration accuracy and computational speed. However, while most methods excel at global alignment, they often perform worse in aligning local regions. To address this challenge, this paper proposes a mask-guided encoder-decoder DCNN-based image registration method, named as MrRegNet. This approach employs a multi-resolution encoder for feature extraction and subsequently estimates multi-resolution displacement fields in the decoder to handle the substantial deformation of images. Furthermore, segmentation masks are employed to direct the model's attention toward aligning local regions. The results show that the proposed method outperforms traditional methods like Demons and a well-known deep learning method, VoxelMorph, on a public 3D brain MRI dataset (OASIS) and a local 2D brain MRI dataset with large deformations. Importantly, the image alignment accuracies are significantly improved at local regions guided by segmentation masks. Github link:this https URL.
塑形图像注册(对齐)在许多临床应用中受到高度关注,如计算机辅助诊断和疾病进展分析。基于深度卷积神经网络(DCNN)的图像注册方法在注册准确性和计算速度方面表现出了优势。然而,虽然大多数方法在全局对齐方面表现出色,但它们在局部区域对齐方面往往表现得更差。为解决这个问题,本文提出了一种基于mask-guided encoder-decoder DCNN图像注册方法,称为MrRegNet。该方法采用多分辨率编码器用于特征提取,并随后在解码器中估计多分辨率位移场,以处理图像的巨额变形。此外,还使用分割掩码来引导模型的注意力指向对齐局部区域。结果表明,与传统方法如Demons和著名的深度学习方法VoxelMorph相比,所提出的方法在公共3D脑MRI数据集(OASIS)和具有较大变形 locally的2D脑MRI数据集上显著表现出色。重要的是,在由分割掩码引导的局部区域,图像对齐准确度得到了显著提高。Github链接:this <https://github.com/>.
https://arxiv.org/abs/2405.10068
There are various desired capabilities to create aerial forest-traversing robots capable of monitoring both biological and abiotic data. The features range from multi-functionality, robustness, and adaptability. These robots have to weather turbulent winds and various obstacles such as forest flora and wildlife thus amplifying the complexity of operating in such uncertain environments. The key for successful data collection is the flexibility to intermittently move from tree-to-tree, in order to perch at vantage locations for elongated time. This effort to perch not only reduces the disturbance caused by multi-rotor systems during data collection, but also allows the system to rest and recharge for longer outdoor missions. Current systems feature the addition of perching modules that increase the aerial robots' weight and reduce the drone's overall endurance. Thus in our work, the key questions currently studied are: "How do we develop a single robot capable of metamorphosing its body for multi-modal flight and dynamic perching?", "How do we detect and land on perchable objects robustly and dynamically?", and "What important spatial-temporal data is important for us to collect?"
有许多创建能够监测生物和环境数据并穿越森林的 aerial 森林穿越机器人的期望功能。这些功能包括多功能性、稳健性和适应性。这些机器人必须应对动荡的风和各种障碍,例如森林植物和野生动物,从而增加了在如此不确定的环境中操作的复杂性。成功数据收集的关键在于可以在树木间间歇移动,以便在长时间的攀爬过程中停留在优势位置。这种在树上停留的努力不仅减少了多旋翼系统在数据收集过程中产生的干扰,而且还允许系统休息和充电以执行更长的户外任务。目前的系统增加了悬挂模块,增加了无人机的重量并降低了其耐用性。因此,在我们的工作中,当前研究的关键问题包括:“我们如何开发一个能够进行多模态飞行和动态悬挂的单个机器人?”、“我们如何动态和可靠地检测并登陆可攀爬的物体?”以及“对我们来说重要的是收集哪些重要的空间和时间数据?”
https://arxiv.org/abs/2405.10043
Remote sensing image dehazing (RSID) aims to remove nonuniform and physically irregular haze factors for high-quality image restoration. The emergence of CNNs and Transformers has taken extraordinary strides in the RSID arena. However, these methods often struggle to demonstrate the balance of adequate long-range dependency modeling and maintaining computational efficiency. To this end, we propose the first lightweight network on the mamba-based model called RSDhamba in the field of RSID. Greatly inspired by the recent rise of Selective State Space Model (SSM) for its superior performance in modeling linear complexity and remote dependencies, our designed RSDehamba integrates the SSM framework into the U-Net architecture. Specifically, we propose the Vision Dehamba Block (VDB) as the core component of the overall network, which utilizes the linear complexity of SSM to achieve the capability of global context encoding. Simultaneously, the Direction-aware Scan Module (DSM) is designed to dynamically aggregate feature exchanges over different directional domains to effectively enhance the flexibility of sensing the spatially varying distribution of haze. In this way, our RSDhamba fully demonstrates the superiority of spatial distance capture dependencies and channel information exchange for better extraction of haze features. Extensive experimental results on widely used benchmarks validate the surpassing performance of our RSDehamba against existing state-of-the-art methods.
遥感图像去雾(RSID)旨在消除不均匀和物理不规则雾因素,实现高质量图像修复。随着CNNs和Transformer在RSID领域的非凡进步,这些方法通常很难在适当的远距离依赖建模和保持计算效率之间实现平衡。为此,我们提出了在基于mamba模型的RSID领域中的第一个轻量级网络RSDhamba。受到最近Selective State Space Model(SSM)在建模线性复杂度和远程依赖方面的卓越性能启发,我们设计的RSDehamba将SSM框架融入U-Net架构中。具体来说,我们提出了Vision Dehamba Block(VDB)作为整个网络的核心组件,它利用SSM的线性复杂度实现全局上下文编码的能力。同时,我们设计了Direction-aware Scan Module(DSM)来动态地聚合不同方向域的特征交换,有效增强对雾中空间分布变化的灵活感知。这样,我们的RSDhamba完全展示了空间距离捕捉依赖和信道信息交流对于更好提取雾特征的优越性。在广泛使用的基准测试上进行的大量实验结果证实了RSDehamba相对于现有技术的超群性能。
https://arxiv.org/abs/2405.10030
The main challenge in learning image-conditioned robotic policies is acquiring a visual representation conducive to low-level control. Due to the high dimensionality of the image space, learning a good visual representation requires a considerable amount of visual data. However, when learning in the real world, data is expensive. Sim2Real is a promising paradigm for overcoming data scarcity in the real-world target domain by using a simulator to collect large amounts of cheap data closely related to the target task. However, it is difficult to transfer an image-conditioned policy from sim to real when the domains are very visually dissimilar. To bridge the sim2real visual gap, we propose using natural language descriptions of images as a unifying signal across domains that captures the underlying task-relevant semantics. Our key insight is that if two image observations from different domains are labeled with similar language, the policy should predict similar action distributions for both images. We demonstrate that training the image encoder to predict the language description or the distance between descriptions of a sim or real image serves as a useful, data-efficient pretraining step that helps learn a domain-invariant image representation. We can then use this image encoder as the backbone of an IL policy trained simultaneously on a large amount of simulated and a handful of real demonstrations. Our approach outperforms widely used prior sim2real methods and strong vision-language pretraining baselines like CLIP and R3M by 25 to 40%.
在学习图像条件机器人策略的主要挑战是获取一个有助于低级控制的视觉表示。由于图像空间的维度很高,获得良好的视觉表示需要大量的视觉数据。然而,在现实世界中,数据很昂贵。Sim2Real 是一种有前途的方法,通过在真实世界目标领域使用模拟器收集与目标任务密切相关的大量廉价数据,来克服真实世界数据稀缺的问题。然而,在领域之间具有非常视觉差异时,从Sim到Real的图像条件策略很难进行迁移。为了弥合Sim2Real的视觉差距,我们提出使用图像的自然语言描述作为跨领域的统一信号来捕捉相关任务语义。我们的关键洞见是,如果来自不同领域的两个图像观察者被标记为相似的语言,策略应该预测两个图像的相似动作分布。我们证明了将图像编码器预训练为预测图像描述或描述的距离是一种有用的、数据有效的预训练步骤,可以帮助学习领域无关的图像表示。然后,我们可以将这个图像编码器作为同时训练大量模拟和几个真实演示的IL策略的基础。我们的方法在广泛使用的先验Sim2Real方法和强大的视觉语言预训练基线CLIP和R3M上分别提高了25%至40%。
https://arxiv.org/abs/2405.10020
Due to the significant advances in large-scale text-to-image generation by diffusion model (DM), controllable human image generation has been attracting much attention recently. Existing works, such as Controlnet [36], T2I-adapter [20] and HumanSD [10] have demonstrated good abilities in generating human images based on pose conditions, they still fail to meet the requirements of real e-commerce scenarios. These include (1) the interaction between the shown product and human should be considered, (2) human parts like face/hand/arm/foot and the interaction between human model and product should be hyper-realistic, and (3) the identity of the product shown in advertising should be exactly consistent with the product itself. To this end, in this paper, we first define a new human image generation task for e-commerce marketing, i.e., Object-ID-retentive Human-object Interaction image Generation (OHG), and then propose a VirtualModel framework to generate human images for product shown, which supports displays of any categories of products and any types of human-object interaction. As shown in Figure 1, VirtualModel not only outperforms other methods in terms of accurate pose control and image quality but also allows for the display of user-specified product objects by maintaining the product-ID consistency and enhancing the plausibility of human-object interaction. Codes and data will be released.
由于扩散模型(DM)在大型文本到图像生成方面的显著进步,可控制的人像生成一直引起了很多关注。现有的工作,如控制网 [36]、T2I-adapter [20] 和 HumanSD [10] 在基于姿态条件的生成人体图像方面表现出良好的能力,但它们仍然无法满足电子商务场景的要求。这些要求包括:(1)应该考虑展示的产品和人体之间的互动,(2)人像的部分(如面/手/手臂/脚)以及人和产品之间的交互应该非常逼真,(3)展示在广告中的产品的身份应该与产品本身完全一致。为此,在本文中,我们首先为电子商务营销定义了一个新的人体图像生成任务,即对象ID保持的人-物体交互图像生成(OHG),然后提出了一个虚拟模型框架来生成显示产品的人像,该框架支持展示任何类别的产品和任何类型的人-物体交互。如图1所示,虚拟模型不仅在准确姿态控制和图像质量方面优于其他方法,而且通过保持产品ID一致性和提高人-物体交互的可信度,允许用户指定的人体对象展示。代码和数据将发布。
https://arxiv.org/abs/2405.09985
The maturity classification of specialty crops such as strawberries and tomatoes is an essential agricultural downstream activity for selective harvesting and quality control (QC) at production and packaging sites. Recent advancements in Deep Learning (DL) have produced encouraging results in color images for maturity classification applications. However, hyperspectral imaging (HSI) outperforms methods based on color vision. Multivariate analysis methods and Convolutional Neural Networks (CNN) deliver promising results; however, a large amount of input data and the associated preprocessing requirements cause hindrances in practical application. Conventionally, the reflectance intensity in a given electromagnetic spectrum is employed in estimating fruit maturity. We present a feature extraction method to empirically demonstrate that the peak reflectance in subbands such as 500-670 nm (pigment band) and the wavelength of the peak position, and contrarily, the trough reflectance and its corresponding wavelength within 671-790 nm (chlorophyll band) are convenient to compute yet distinctive features for the maturity classification. The proposed feature selection method is beneficial because preprocessing, such as dimensionality reduction, is avoided before every prediction. The feature set is designed to capture these traits. The best SOTA methods, among 3D-CNN, 1D-CNN, and SVM, achieve at most 90.0 % accuracy for strawberries and 92.0 % for tomatoes on our dataset. Results show that the proposed method outperforms the SOTA as it yields an accuracy above 98.0 % in strawberry and 96.0 % in tomato classification. A comparative analysis of the time efficiency of these methods is also conducted, which shows the proposed method performs prediction at 13 Frames Per Second (FPS) compared to the maximum 1.16 FPS attained by the full-spectrum SVM classifier.
草莓和西红柿等特色作物的成熟度分类是生产和包装站点选择性收获和质量控制(QC)过程中必不可少的重要农业下游活动。最近在深度学习(DL)方面的进步在颜色图像的成熟度分类应用中产生了鼓舞人心的结果。然而,基于色彩视觉的方法在成熟度分类上劣后于 hyperspectral imaging(HSI)。多变量分析方法和卷积神经网络(CNN)产生了积极的结果;然而,大量的输入数据及其相关预处理要求给应用带来障碍。通常,在给定的电磁频谱中的反射强度被用来估计果实成熟度。我们提出了一个特征提取方法,以经验证明在色素带(500-670纳米,色素带)子频段和最大峰位波长以及相反,在671-790纳米(叶绿素带)中的峰谷反射强度和其相应的波长是方便计算且具有区分性的特征,用于成熟度分类。所提出的特征选择方法有益处,因为预测之前,预处理,例如降维,被避免了。特征集旨在捕捉这些特征。在3D-CNN、1D-CNN和SVM中,最好的SOTA方法,即草莓和西红柿数据集中的3D-CNN,在草莓和西红柿上的准确度分别为90.0%和92.0%。结果表明,与SOTA相比,所提出的方法具有更高的准确度,草莓的准确度为98.0%,西红柿的准确度为96.0%。还进行了这些方法的比较分析,比较了它们的预测时间效率,结果表明,与 full-spectral SVM 分类器达到的最大1.16 FPS 相比,所提出的 method 在13 FPS 的预测速度上表现出色。
https://arxiv.org/abs/2405.09955
Previous unsupervised anomaly detection (UAD) methods often struggle with significant intra-class diversity; i.e., a class in a dataset contains multiple subclasses, which we categorize as Feature-Rich Anomaly Detection Datasets (FRADs). This is evident in applications such as unified setting and unmanned supermarket scenarios. To address this challenge, we developed MiniMaxAD: a lightweight autoencoder designed to efficiently compress and memorize extensive information from normal images. Our model utilizes a large kernel convolutional network equipped with a Global Response Normalization (GRN) unit and employs a multi-scale feature reconstruction strategy. The GRN unit significantly increases the upper limit of the network's capacity, while the large kernel convolution facilitates the extraction of highly abstract patterns, leading to compact normal feature modeling. Additionally, we introduce an Adaptive Contraction Loss (ADCLoss), tailored to FRADs to overcome the limitations of global cosine distance loss. MiniMaxAD was comprehensively tested across six challenging UAD benchmarks, achieving state-of-the-art results in four and highly competitive outcomes in the remaining two. Notably, our model achieved a detection AUROC of up to 97.0\% in ViSA under the unified setting. Moreover, it not only achieved state-of-the-art performance in unmanned supermarket tasks but also exhibited an inference speed 37 times faster than the previous best method, demonstrating its effectiveness in complex UAD tasks.
之前无监督异常检测(UAD)方法通常在数据集中的类内多样性显著受限;即数据集中的一个类别可能包含多个亚类,我们称之为特征丰富异常检测数据集(FRADs)。这在统一设置和无人超市场景等应用中是显而易见的。为解决这个挑战,我们开发了MiniMaxAD:一种轻量级的自编码器,旨在有效地压缩和记忆丰富的图像信息。我们的模型采用了一个大核卷积神经网络,配备了一个全局响应归一化(GRN)单元,并采用多尺度特征重构策略。GRN单元显著增加了网络的容量上限,而大核卷积有助于提取高度抽象的模式,导致紧凑的正常特征建模。此外,我们还引入了自适应收缩损失(ADCLoss),专门针对FRADs来克服全局余弦距离损失。MiniMaxAD在六个具有挑战性的UAD基准测试中进行了全面测试,在四个基准测试中实现了最先进的性能,在另外两个基准测试中具有极具竞争力的结果。值得注意的是,在统一设置下,我们的模型在ViSA上的检测AUROC可以达到97.0%。此外,它不仅在无人超市任务中实现了最先进的表现,而且具有比之前最佳方法快37倍的应用速度,表明其对于复杂UAD任务的处理效果非常出色。
https://arxiv.org/abs/2405.09933
Most existing attention prediction research focuses on salient instances like humans and objects. However, the more complex interaction-oriented attention, arising from the comprehension of interactions between instances by human observers, remains largely unexplored. This is equally crucial for advancing human-machine interaction and human-centered artificial intelligence. To bridge this gap, we first collect a novel gaze fixation dataset named IG, comprising 530,000 fixation points across 740 diverse interaction categories, capturing visual attention during human observers cognitive processes of interactions. Subsequently, we introduce the zero-shot interaction-oriented attention prediction task ZeroIA, which challenges models to predict visual cues for interactions not encountered during training. Thirdly, we present the Interactive Attention model IA, designed to emulate human observers cognitive processes to tackle the ZeroIA problem. Extensive experiments demonstrate that the proposed IA outperforms other state-of-the-art approaches in both ZeroIA and fully supervised settings. Lastly, we endeavor to apply interaction-oriented attention to the interaction recognition task itself. Further experimental results demonstrate the promising potential to enhance the performance and interpretability of existing state-of-the-art HOI models by incorporating real human attention data from IG and attention labels generated by IA.
目前,大部分现有的注意力预测研究都集中在显眼的实例,如人类和物体。然而,更加复杂的关系型注意力,即通过人类观察者理解实例之间互动所产生的注意力,仍然没有被深入研究。这对于促进人与机器之间的交互和人类为中心的人工智能发展至关重要。为了填补这一空白,我们首先收集了一个名为IG的新 gaze 固定点数据集,包括740个不同交互类别的530,000个固定点,记录了人类观察者在互动过程中的视觉注意力。接着,我们引入了零击关系型注意力预测任务ZeroIA,该任务挑战模型预测在训练过程中未见过的视觉线索。第三,我们提出了交互注意力模型IA,旨在模仿人类观察者的认知过程解决 ZeroIA 问题。大量实验证明,与最先进的零击和完全监督方法相比,所提出的IA在ZeroIA和完全监督设置中都表现出色。最后,我们努力将关系型注意力应用于交互识别任务本身。进一步的实验结果表明,通过将IG和IA生成的真实人类注意力数据以及注意力标签相结合,可以增强现有最先进的HOI模型的性能和可解释性。
https://arxiv.org/abs/2405.09931
Large-scale "foundation models" have gained traction as a way to leverage the vast amounts of unlabeled remote sensing data collected every day. However, due to the multiplicity of Earth Observation satellites, these models should learn "sensor agnostic" representations, that generalize across sensor characteristics with minimal fine-tuning. This is complicated by data availability, as low-resolution imagery, such as Sentinel-2 and Landsat-8 data, are available in large amounts, while very high-resolution aerial or satellite data is less common. To tackle these challenges, we introduce cross-sensor self-supervised training and alignment for remote sensing (X-STARS). We design a self-supervised training loss, the Multi-Sensor Alignment Dense loss (MSAD), to align representations across sensors, even with vastly different resolutions. Our X-STARS can be applied to train models from scratch, or to adapt large models pretrained on e.g low-resolution EO data to new high-resolution sensors, in a continual pretraining framework. We collect and release MSC-France, a new multi-sensor dataset, on which we train our X-STARS models, then evaluated on seven downstream classification and segmentation tasks. We demonstrate that X-STARS outperforms the state-of-the-art by a significant margin with less data across various conditions of data availability and resolutions.
大规模的“基础模型”作为一种利用每天收集的丰富遥感数据的方法已经获得了关注。然而,由于地球观测卫星的多样性,这些模型应该学习“传感器无关”的表示,具有最小的微调,以在传感器特性的不同上进行通用。这复杂化了数据可用性,因为像Sentinel-2和Landsat-8数据这样低分辨率 imagery 大量存在,而非常高的分辨率卫星数据则较为罕见。为解决这些挑战,我们引入了跨传感器自监督训练和对齐 for 遥感(X-STARS)。我们设计了一个自监督训练损失,Multi-Sensor Alignment Dense loss (MSAD),以在传感器特性存在很大差异的情况下对齐表示。我们的 X-STARS 可以应用于从头训练模型,或者将预训练于低分辨率EO数据的模型适应到新高分辨率传感器上,在连续预训练框架中进行。我们收集并发布 MSC-France,一个新的多传感器数据集,用于训练我们的 X-STARS 模型,然后对七个下游分类和分割任务进行评估。我们证明了 X-STARS 在各种数据可用性和分辨率条件下显著优于最先进的模型,尽管数据量较少。
https://arxiv.org/abs/2405.09922