Pre-training on Internet data has proven to be a key ingredient for broad generalization in many modern ML systems. What would it take to enable such capabilities in robotic reinforcement learning (RL)? Offline RL methods, which learn from datasets of robot experience, offer one way to leverage prior data into the robotic learning pipeline. However, these methods have a "type mismatch" with video data (such as Ego4D), the largest prior datasets available for robotics, since video offers observation-only experience without the action or reward annotations needed for RL methods. In this paper, we develop a system for leveraging large-scale human video datasets in robotic offline RL, based entirely on learning value functions via temporal-difference learning. We show that value learning on video datasets learns representations that are more conducive to downstream robotic offline RL than other approaches for learning from video data. Our system, called V-PTR, combines the benefits of pre-training on video data with robotic offline RL approaches that train on diverse robot data, resulting in value functions and policies for manipulation tasks that perform better, act robustly, and generalize broadly. On several manipulation tasks on a real WidowX robot, our framework produces policies that greatly improve over prior methods. Our video and additional details can be found at this https URL
在互联网数据上进行预训练已经成为许多现代机器学习系统的广泛泛化的关键要素。要在这些机器人强化学习(RL)系统中实现这种能力,需要什么步骤? offline RL方法通过学习机器人经验的数据集来提供一种利用先前数据进入机器人学习管道的方法。然而,这些方法与视频数据(例如 Ego4D)存在“类型不匹配”,因为它们仅提供观察体验,而缺乏 RL 方法所需的行动或奖励注释。在本文中,我们开发了一种系统,旨在利用大规模人类视频数据集在机器人 offline RL 中利用,完全基于基于时间差学习的值函数学习。我们表明,在视频数据上进行值函数学习学习表示,这些表示对于机器人后端 offline RL 方法来说更加有利于下游,而其他从视频数据学习的方法则不如这些方法。我们的系统称为 V-PTR,将预训练视频数据和机器人离线 RL 方法的优点相结合,训练 diverse 机器人数据的数据集,从而产生更好的操作值函数和政策,表现更好、行为更稳健,并广泛泛化。在真实的 WidowX 机器人的几个操作任务中,我们的框架产生的政策比先前方法显著提高。我们的视频和其他详细信息可以在 this https URL 中找到。
https://arxiv.org/abs/2309.13041
Neural radiance fields (NeRF) have revolutionized the field of image-based view synthesis. However, NeRF uses straight rays and fails to deal with complicated light path changes caused by refraction and reflection. This prevents NeRF from successfully synthesizing transparent or specular objects, which are ubiquitous in real-world robotics and A/VR applications. In this paper, we introduce the refractive-reflective field. Taking the object silhouette as input, we first utilize marching tetrahedra with a progressive encoding to reconstruct the geometry of non-Lambertian objects and then model refraction and reflection effects of the object in a unified framework using Fresnel terms. Meanwhile, to achieve efficient and effective anti-aliasing, we propose a virtual cone supersampling technique. We benchmark our method on different shapes, backgrounds and Fresnel terms on both real-world and synthetic datasets. We also qualitatively and quantitatively benchmark the rendering results of various editing applications, including material editing, object replacement/insertion, and environment illumination estimation. Codes and data are publicly available at this https URL.
神经网络辐射场(NeRF)已经彻底改变了基于图像视图合成的领域。然而,NeRF使用直线光线,并无法处理由折射和反射引起的复杂的光路径变化。这导致NeRF无法成功合成透明或闪耀的物体,它们在现实世界机器人和虚拟现实应用中无处不在。在本文中,我们介绍了折射反射域。将物体轮廓作为输入,我们首先使用逐步编码的立方体重构非Lambertian物体的几何形状,然后使用费斯涅尔术语在一个统一框架中模型物体的折射和反射效果。同时,为了高效且有效地减少失真,我们提出了一个虚拟锥超采样技术。我们在不同的形状、背景和费斯涅尔术语的现实世界和合成数据集上对我们的算法进行了基准测试。我们还定性和定量基准了各种编辑应用程序的渲染结果,包括材料编辑、物体替换/插入和环境照明估计。代码和数据在这个httpsURL上公开可用。
https://arxiv.org/abs/2309.13039
The rapid development of AR/VR brings tremendous demands for 3D content. While the widely-used Computer-Aided Design (CAD) method requires a time-consuming and labor-intensive modeling process, sketch-based 3D modeling offers a potential solution as a natural form of computer-human interaction. However, the sparsity and ambiguity of sketches make it challenging to generate high-fidelity content reflecting creators' ideas. Precise drawing from multiple views or strategic step-by-step drawings is often required to tackle the challenge but is not friendly to novice users. In this work, we introduce a novel end-to-end approach, Deep3DSketch+, which performs 3D modeling using only a single free-hand sketch without inputting multiple sketches or view information. Specifically, we introduce a lightweight generation network for efficient inference in real-time and a structural-aware adversarial training approach with a Stroke Enhancement Module (SEM) to capture the structural information to facilitate learning of the realistic and fine-detailed shape structures for high-fidelity performance. Extensive experiments demonstrated the effectiveness of our approach with the state-of-the-art (SOTA) performance on both synthetic and real datasets.
增强现实(AR)和虚拟现实(VR)的迅速发展,对3D内容的需求急剧增加。虽然广泛使用的计算机辅助设计(CAD)方法需要进行耗时且劳动力密集型的建模过程,但基于 Sketch 的3D建模作为一种自然计算机-人类交互的形式,提供了一个潜在的解决方案。然而, Sketch 的稀疏和歧义使得生成高保真的内容非常困难,通常需要进行精确的多视图绘图或关键步骤的 strategic 绘图,但这不适用于初学者。在这个项目中,我们介绍了一种全新的端到端方法 Deep3DSketch+,它使用单个自由手绘 Sketch 来进行3D建模,而不需要输入多个 Sketch 或视图信息。具体来说,我们介绍了一种轻量级的生成网络,用于实时高效推理,并介绍了一种结构aware的对抗训练方法,以及一个 stroke 增强模块(SEM),以捕获结构信息,以便于学习 realistic 和精细的形状结构,以获得高保真的性能。广泛的实验证明了我们的方法在合成和真实数据集上具有最先进的性能(SOTA)。
https://arxiv.org/abs/2309.13006
Despite the recent successes of vanilla Graph Neural Networks (GNNs) on many tasks, their foundation on pairwise interaction networks inherently limits their capacity to discern latent higher-order interactions in complex systems. To bridge this capability gap, we propose a novel approach exploiting the rich mathematical theory of simplicial complexes (SCs) - a robust tool for modeling higher-order interactions. Current SC-based GNNs are burdened by high complexity and rigidity, and quantifying higher-order interaction strengths remains challenging. Innovatively, we present a higher-order Flower-Petals (FP) model, incorporating FP Laplacians into SCs. Further, we introduce a Higher-order Graph Convolutional Network (HiGCN) grounded in FP Laplacians, capable of discerning intrinsic features across varying topological scales. By employing learnable graph filters, a parameter group within each FP Laplacian domain, we can identify diverse patterns where the filters' weights serve as a quantifiable measure of higher-order interaction strengths. The theoretical underpinnings of HiGCN's advanced expressiveness are rigorously demonstrated. Additionally, our empirical investigations reveal that the proposed model accomplishes state-of-the-art (SOTA) performance on a range of graph tasks and provides a scalable and flexible solution to explore higher-order interactions in graphs.
尽管 vanilla 图形神经网络(GNN)在多项任务上取得了 recent 的成功,但他们基于点对点交互网络的基础本身限制了他们在复杂系统中发现潜在高级别交互的能力。为了填补这一能力差距,我们提出了一种创新的方法,利用线段组合丰富的数学理论 - 一种用于建模高级别交互的强大工具。目前基于SC的GNNs 承受着高复杂性和Rigidity 的负载,量化高级别交互 strengths 仍然是一项挑战。创新性地,我们提出了高级别 flower-petal(FP)模型,将FP拉普拉斯算子融入SC中。进一步,我们介绍了基于FP拉普拉斯算子的高级别图形卷积网络(HiGCN),能够在不同拓扑级别的上识别内在特征。通过使用可学习图形滤波器,每个FP拉普拉斯域中的参数组,我们可以识别不同的模式,这些滤波器的权重作为高级别交互 strengths 的可量化衡量标准。HiGCN 的高级表达能力的理论基础得到了严格证明。此外,我们的实证研究表明,我们提出的模型在多种图形任务中实现了最先进的表现,并提供了一种可扩展且灵活的解决方案,以探索图形中的高级别交互。
https://arxiv.org/abs/2309.12971
Nested Event Extraction (NEE) aims to extract complex event structures where an event contains other events as its arguments recursively. Nested events involve a kind of Pivot Elements (PEs) that simultaneously act as arguments of outer events and as triggers of inner events, and thus connect them into nested structures. This special characteristic of PEs brings challenges to existing NEE methods, as they cannot well cope with the dual identities of PEs. Therefore, this paper proposes a new model, called PerNee, which extracts nested events mainly based on recognizing PEs. Specifically, PerNee first recognizes the triggers of both inner and outer events and further recognizes the PEs via classifying the relation type between trigger pairs. In order to obtain better representations of triggers and arguments to further improve NEE performance, it incorporates the information of both event types and argument roles into PerNee through prompt learning. Since existing NEE datasets (e.g., Genia11) are limited to specific domains and contain a narrow range of event types with nested structures, we systematically categorize nested events in generic domain and construct a new NEE dataset, namely ACE2005-Nest. Experimental results demonstrate that PerNee consistently achieves state-of-the-art performance on ACE2005-Nest, Genia11 and Genia13.
嵌套事件提取(NEE)的目标是提取事件结构中包含其他事件作为其论据的复杂的事件。嵌套事件涉及到一种称为pivot elements(PEs)的特殊元素,它们同时作为外部事件论据和内部事件触发器,将它们连接成嵌套结构。PEs的特殊性质给现有的NEE方法带来了挑战,因为它们无法很好地处理PEs的双重身份。因此,本文提出了一种新模型,称为PerNee,它主要基于识别PEs来提取嵌套事件。具体来说,PerNee首先识别内部和外部事件的触发器,并通过分类触发器之间的关系类型来进一步识别PEs。为了获得更好的触发器和论据表示,以进一步改善NEE性能,它通过prompt learning将两种事件类型和论据角色的信息融入PerNee中。由于现有的NEE数据集(例如Gia11)仅局限于特定的领域,并包含嵌套结构和嵌套事件的狭窄范围,因此我们在通用领域 systematic 分类嵌套事件,并建立了新的NEE数据集,即ACE2005- Nest。实验结果显示,PerNee在ACE2005- Nest、Gia11和Gia13上 consistently 实现了最先进的性能。
https://arxiv.org/abs/2309.12960
Assurance cases can be used to argue for the safety of products in safety engineering. In safety-critical areas, the construction of assurance cases is indispensable. Trustworthiness Derivation Trees (TDTs) enhance assurance cases by incorporating formal methods, rendering it possible for automatic reasoning about assurance cases. We present Trustworthiness Derivation Tree Analyzer (Trusta), a desktop application designed to automatically construct and verify TDTs. The tool has a built-in Prolog interpreter in its backend, and is supported by the constraint solvers Z3 and MONA. Therefore, it can solve constraints about logical formulas involving arithmetic, sets, Horn clauses etc. Trusta also utilizes large language models to make the creation and evaluation of assurance cases more convenient. It allows for interactive human examination and modification. We evaluated top language models like ChatGPT-3.5, ChatGPT-4, and PaLM 2 for generating assurance cases. Our tests showed a 50%-80% similarity between machine-generated and human-created cases. In addition, Trusta can extract formal constraints from text in natural languages, facilitating an easier interpretation and validation process. This extraction is subject to human review and correction, blending the best of automated efficiency with human insight. To our knowledge, this marks the first integration of large language models in automatic creating and reasoning about assurance cases, bringing a novel approach to a traditional challenge. Through several industrial case studies, Trusta has proven to quickly find some subtle issues that are typically missed in manual inspection, demonstrating its practical value in enhancing the assurance case development process.
在安全性关键领域,建设质量保证案例是不可或缺的。 trustworthinessDerivation Trees(TDT)通过引入形式方法,提高了质量保证案例的质量,使得对质量保证案例进行自动推理变得可能。我们开发了 trustworthinessDerivation Tree Analyzer( Trusta),这是一个桌面应用程序,旨在自动构建和验证 TDT。该工具在后台拥有一个内置 Prolog 解释器,并支持约束求解器 Z3 和 MONA。因此,它可以解决涉及算术、集合、 Horn 条件等逻辑公式的约束。 trusta 还利用大型语言模型,使创建和评估质量保证案例更加便利。它允许人机交互的人类检查和修改。我们评估了如 ChatGPT-3.5、ChatGPT-4 和 PaLM2 等顶尖语言模型,以生成质量保证案例。我们的测试结果显示,机器生成的案例与人类生成的案例有 50%-80% 的相似性。此外, trusta 从自然语言文本中提取形式约束,促进了更容易的解释和验证过程。这种提取需要人类审查和修正,将自动化效率和人类洞察力相结合。据我们所知,这是第一次将大型语言模型集成到自动创建和推理质量保证案例方面,带来了传统挑战的一种新颖方法。通过几个工业案例研究, trusta 证明可以快速发现一些通常在手动检查中忽略的微妙问题,展示了它在增强质量保证案例开发过程中的实际价值。
https://arxiv.org/abs/2309.12941
Event Relation Extraction (ERE) aims to extract multiple kinds of relations among events in texts. However, existing methods singly categorize event relations as different classes, which are inadequately capturing the intrinsic semantics of these relations. To comprehensively understand their intrinsic semantics, in this paper, we obtain prototype representations for each type of event relation and propose a Prototype-Enhanced Matching (ProtoEM) framework for the joint extraction of multiple kinds of event relations. Specifically, ProtoEM extracts event relations in a two-step manner, i.e., prototype representing and prototype matching. In the first step, to capture the connotations of different event relations, ProtoEM utilizes examples to represent the prototypes corresponding to these relations. Subsequently, to capture the interdependence among event relations, it constructs a dependency graph for the prototypes corresponding to these relations and utilized a Graph Neural Network (GNN)-based module for modeling. In the second step, it obtains the representations of new event pairs and calculates their similarity with those prototypes obtained in the first step to evaluate which types of event relations they belong to. Experimental results on the MAVEN-ERE dataset demonstrate that the proposed ProtoEM framework can effectively represent the prototypes of event relations and further obtain a significant improvement over baseline models.
事件关系提取(ERE)的目标是在文本中提取不同类型的关系。然而,现有方法单独将事件关系分类为不同的类别,这些类别未能充分捕捉到这些关系的内在语义。为了全面理解这些关系的内在语义,在本文中,我们提出了每个类型的事件关系原型表示,并提出了原型增强匹配(ProtoEM)框架,用于同时提取多种类型的事件关系。具体来说,ProtoEM采用两步提取方法,即原型表示和原型匹配。在第一步中,为了捕捉不同事件关系的内涵,ProtoEM使用示例表示这些关系的原型。随后,为了捕捉事件关系之间的依赖关系,它构建了一个原型依赖图,用于表示这些关系的原型,并使用基于Graph Neural Network(GNN)模块进行建模。在第二步中,它获取了新的事件对的表示,并计算它们与在第一步中获取的原型之间的相似性,以评估它们属于哪种事件关系。Maven-ERE数据集的实验结果表明,提出的ProtoEM框架可以 effectively representing 原型事件关系原型,并进一步优于基准模型。
https://arxiv.org/abs/2309.12892
Affect recognition, encompassing emotions, moods, and feelings, plays a pivotal role in human communication. In the realm of conversational artificial intelligence (AI), the ability to discern and respond to human affective cues is a critical factor for creating engaging and empathetic interactions. This study delves into the capacity of large language models (LLMs) to recognise human affect in conversations, with a focus on both open-domain chit-chat dialogues and task-oriented dialogues. Leveraging three diverse datasets, namely IEMOCAP, EmoWOZ, and DAIC-WOZ, covering a spectrum of dialogues from casual conversations to clinical interviews, we evaluated and compared LLMs' performance in affect recognition. Our investigation explores the zero-shot and few-shot capabilities of LLMs through in-context learning (ICL) as well as their model capacities through task-specific fine-tuning. Additionally, this study takes into account the potential impact of automatic speech recognition (ASR) errors on LLM predictions. With this work, we aim to shed light on the extent to which LLMs can replicate human-like affect recognition capabilities in conversations.
情感识别在人类沟通中扮演着关键角色。在对话型人工智能(AI)领域,能够分辨和响应人类情感 cues 是创造有趣和感同身受的交互的关键因素。本文探讨了大型语言模型(LLM)在对话中识别人类情感的能力,重点研究了公开领域的闲聊对话和任务驱动的对话。利用三个不同的数据集,包括IEMOCAP、EmoWOZ和DAIC-WOZ,涵盖了从闲聊对话到临床访谈的一系列对话,我们评估了和比较了LLM在情感识别方面的表现。我们的研究探索了LLM通过上下文学习(ICL)的零Shot和少量Shot能力,以及通过任务特定微调来提高其模型能力。此外,本文考虑到了自动语音识别(ASR)错误对LLM预测的潜在影响。通过这项工作,我们旨在阐明LLM在对话中能否模拟人类情感识别能力的局限性。
https://arxiv.org/abs/2309.12881
Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process (e.g., feature extraction and/or captioning model learning). In this pipeline, manual frame sampling may ignore key information in videos and thus degrade performance. Additionally, redundant information in the sampled frames may result in low efficiency in the inference of video captioning. Addressing this, we study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline: 1) Compared to raw images from the decoded video, the compressed video, consisting of I-frames, motion vectors and residuals, is highly distinguishable, which allows us to leverage the entire video for learning without manual sampling through a specialized model design; 2) The captioning model is more efficient in inference as smaller and less redundant information is processed. We propose a simple yet effective end-to-end transformer in the compressed domain for video captioning that enables learning from the compressed video for captioning. We show that even with a simple design, our method can achieve state-of-the-art performance on different benchmarks while running almost 2x faster than existing approaches. Code is available at this https URL.
现有的视频字幕方法通常需要先从解码视频中抽取帧并执行后续处理(例如特征提取和或字幕模型学习)。在这条处理路径中,手动帧采样可能会忽略视频中的关键信息,从而降低性能。此外,抽取的帧中的冗余信息可能会影响视频字幕推断的效率。针对这个问题,我们从压缩域的角度研究视频字幕,比现有的处理路径具有多项优势:1)与解码视频中的 raw 图像相比,压缩视频由 I-frames、运动向量和残差组成,具有很高的辨识度,这使得我们可以通过专门的模型设计利用整个视频进行学习,而无需手动采样;2)由于处理的信息规模更小且冗余更少,字幕模型在推断方面更加高效。我们提出了一种在压缩域中用于视频字幕的端到端Transformer,使其可以从压缩视频中学习字幕。我们证明,即使采用简单的设计,我们的方法也可以在不同基准测试中实现最先进的性能,同时比现有方法运行快近2倍。代码可在该 https URL 上获取。
https://arxiv.org/abs/2309.12867
Recently, some researchers started exploring the use of ViTs in tackling HSI classification and achieved remarkable results. However, the training of ViT models requires a considerable number of training samples, while hyperspectral data, due to its high annotation costs, typically has a relatively small number of training samples. This contradiction has not been effectively addressed. In this paper, aiming to solve this problem, we propose the single-direction tuning (SDT) strategy, which serves as a bridge, allowing us to leverage existing labeled HSI datasets even RGB datasets to enhance the performance on new HSI datasets with limited samples. The proposed SDT inherits the idea of prompt tuning, aiming to reuse pre-trained models with minimal modifications for adaptation to new tasks. But unlike prompt tuning, SDT is custom-designed to accommodate the characteristics of HSIs. The proposed SDT utilizes a parallel architecture, an asynchronous cold-hot gradient update strategy, and unidirectional interaction. It aims to fully harness the potent representation learning capabilities derived from training on heterologous, even cross-modal datasets. In addition, we also introduce a novel Triplet-structured transformer (Tri-Former), where spectral attention and spatial attention modules are merged in parallel to construct the token mixing component for reducing computation cost and a 3D convolution-based channel mixer module is integrated to enhance stability and keep structure information. Comparison experiments conducted on three representative HSI datasets captured by different sensors demonstrate the proposed Tri-Former achieves better performance compared to several state-of-the-art methods. Homologous, heterologous and cross-modal tuning experiments verified the effectiveness of the proposed SDT.
最近,一些研究人员开始探索使用VITs解决HSI分类的问题,并取得了显著成果。然而,训练VIT模型需要大量的训练样本,而高分辨率数据由于其标注成本高,通常拥有相对较少的训练样本。这个矛盾并没有得到有效的解决。在本文中,我们提出了一种单向调优策略(SDT),作为连接的手段,使我们可以利用现有的标注HSI数据集,甚至RGB数据集,以提高有限的样本量下的HSI数据集的性能。我们所提出的SDT继承了快速调优的想法,旨在使用少量的修改重新使用训练过的模型以适应新的任务。但是与快速调优不同,SDT是专门为适应HSI特点而设计的。我们提出的SDT使用并行架构、异步冷温梯度更新策略和单向交互,旨在完全利用从训练异构数据集中获得的潜在表示学习能力。此外,我们还介绍了一种独特的三向结构转换器(Tri-former),其中 spectral和空间注意力模块在并行中合并,构建 token 混合组件,以降低计算成本,并整合基于3D卷积通道融合模块的3D卷积通道融合模块,以增强稳定性并保留结构信息。对不同传感器捕获的三个代表性HSI数据集进行的比较实验表明,相比 several state-of-the-art methods,Tri-former achieve better performance。同源性和跨同源性调优实验验证了我们提出的SDT的有效性。
https://arxiv.org/abs/2309.12865
Emerging from the monolithic pairwise attention mechanism in conventional Transformer models, there is a growing interest in leveraging sparse interactions that align more closely with biological principles. Approaches including the Set Transformer and the Perceiver employ cross-attention consolidated with a latent space that forms an attention bottleneck with limited capacity. Building upon recent neuroscience studies of Global Workspace Theory and associative memory, we propose the Associative Transformer (AiT). AiT induces low-rank explicit memory that serves as both priors to guide bottleneck attention in the shared workspace and attractors within associative memory of a Hopfield network. Through joint end-to-end training, these priors naturally develop module specialization, each contributing a distinct inductive bias to form attention bottlenecks. A bottleneck can foster competition among inputs for writing information into the memory. We show that AiT is a sparse representation learner, learning distinct priors through the bottlenecks that are complexity-invariant to input quantities and dimensions. AiT demonstrates its superiority over methods such as the Set Transformer, Vision Transformer, and Coordination in various vision tasks.
从传统的Transformer模型中分化出来的单个对偶注意力机制已经引起了越来越多的关注和兴趣,这种注意力机制更加符合生物学原则。包括Set Transformer和感知器的方法使用交叉注意力巩固与潜在空间,形成了具有有限能力的注意瓶颈。基于最近的全球工作空间理论和感觉记忆的神经学研究,我们提出了感觉记忆Transformer(AiT)。AiT诱导低秩显式记忆,作为在共享工作空间中的引导瓶颈注意力的先验,并在Hopfield网络感觉记忆中的吸引器中形成。通过端到端的训练,这些先验自然地发展模块专业化,每个模块都贡献了一个不同的经验偏见,形成了瓶颈。一个瓶颈可以激励输入之间的竞争,以将信息写入记忆。我们表明,AiT是一种稀疏表示学习器,通过瓶颈学习不同的先验,这些先验对于输入数量和质量的复杂性不敏感。AiT证明了它比Set Transformer、视觉Transformer和协调在各种视觉任务中的方法更加优越。
https://arxiv.org/abs/2309.12862
Sequential recommendation (SRS) has become the technical foundation in many applications recently, which aims to recommend the next item based on the user's historical interactions. However, sequential recommendation often faces the problem of data sparsity, which widely exists in recommender systems. Besides, most users only interact with a few items, but existing SRS models often underperform these users. Such a problem, named the long-tail user problem, is still to be resolved. Data augmentation is a distinct way to alleviate these two problems, but they often need fabricated training strategies or are hindered by poor-quality generated interactions. To address these problems, we propose a Diffusion Augmentation for Sequential Recommendation (DiffuASR) for a higher quality generation. The augmented dataset by DiffuASR can be used to train the sequential recommendation models directly, free from complex training procedures. To make the best of the generation ability of the diffusion model, we first propose a diffusion-based pseudo sequence generation framework to fill the gap between image and sequence generation. Then, a sequential U-Net is designed to adapt the diffusion noise prediction model U-Net to the discrete sequence generation task. At last, we develop two guide strategies to assimilate the preference between generated and origin sequences. To validate the proposed DiffuASR, we conduct extensive experiments on three real-world datasets with three sequential recommendation models. The experimental results illustrate the effectiveness of DiffuASR. As far as we know, DiffuASR is one pioneer that introduce the diffusion model to the recommendation.
Sequential recommendation (SRS)已经成为许多应用的技术基础,其目标是基于用户的历史交互推荐下一个物品。然而,Sequential recommendation常常面临数据稀疏的问题,这个问题在推荐系统中很常见。此外,大多数用户只与少数物品交互,但现有的SRS模型往往在这些用户上表现不好,这种情况被称为长尾用户问题,仍需要解决。数据增强是一种缓解这两个问题的独特方法,但常常需要编造训练策略或受到生成 interactions 的质量阻碍。为了解决这些问题,我们提出了一种Sequential Recommendation中的扩散增强(DiffuASR),以提供更高质量的生成。通过DiffuASR增强的数据集可以用于直接训练Sequential recommendation模型,而无需复杂的训练程序。为了最大限度地发挥扩散模型的生成能力,我们首先提出了一种基于扩散的伪序列生成框架,以填补图像和序列生成之间的空缺。然后,我们设计了Sequential U-Net,以适应扩散噪声预测模型U-Net的离散序列生成任务。最后,我们发展了两个指导策略,以整合生成和起源序列的偏好。为了验证所提出的DiffuASR,我们针对三个实际数据集和三个Sequential recommendation模型进行了广泛的实验。实验结果展示了DiffuASR的有效性。据所知,DiffuASR是将扩散模型引入推荐领域的先驱之一。
https://arxiv.org/abs/2309.12858
With the rapid advances in high-throughput sequencing technologies, the focus of survival analysis has shifted from examining clinical indicators to incorporating genomic profiles with pathological images. However, existing methods either directly adopt a straightforward fusion of pathological features and genomic profiles for survival prediction, or take genomic profiles as guidance to integrate the features of pathological images. The former would overlook intrinsic cross-modal correlations. The latter would discard pathological information irrelevant to gene expression. To address these issues, we present a Cross-Modal Translation and Alignment (CMTA) framework to explore the intrinsic cross-modal correlations and transfer potential complementary information. Specifically, we construct two parallel encoder-decoder structures for multi-modal data to integrate intra-modal information and generate cross-modal representation. Taking the generated cross-modal representation to enhance and recalibrate intra-modal representation can significantly improve its discrimination for comprehensive survival analysis. To explore the intrinsic crossmodal correlations, we further design a cross-modal attention module as the information bridge between different modalities to perform cross-modal interactions and transfer complementary information. Our extensive experiments on five public TCGA datasets demonstrate that our proposed framework outperforms the state-of-the-art methods.
随着高分辨率测序技术的迅速发展,生存分析的重点已经从检查临床指标转移到结合病理图像的基因组 profiles 。然而,现有的方法要么直接采用一种简单直接的融合病理特征和基因组 profiles 进行生存预测的方法,要么将基因组 profiles 作为指导,以整合病理图像的特征。前者会忽略内在的跨模态 correlation 。后者则会丢弃与基因表达无关的病理信息。为了解决这些问题,我们提出了一种跨模态翻译和对齐(CMTA)框架,以探索内在的跨模态 correlation 和传输潜在的互补信息。具体而言,我们构建了两个并行的编码-解码结构,对多模态数据整合内部模态信息,生成跨模态表示。利用生成的跨模态表示来提高和重新校准内部模态表示,可以显著改善全面生存分析的区分性。为了探索内在的跨模态 correlation,我们进一步设计了一个跨模态注意模块,作为不同模态之间的信息桥梁,进行跨模态交互和传输互补信息。我们对五个公开的 TCGA 数据集进行了广泛的实验,证明了我们提出的框架比当前最先进的方法表现更好。
https://arxiv.org/abs/2309.12855
We present CloudGripper, an open source cloud robotics testbed, consisting of a scalable, space and cost-efficient design constructed as a rack of 32 small robot arm work cells. Each robot work cell is fully enclosed and features individual lighting, a low-cost custom 5 degree of freedom Cartesian robot arm with an attached parallel jaw gripper and a dual camera setup for experimentation. The system design is focused on continuous operation and features a 10 Gbit/s network connectivity allowing for high throughput remote-controlled experimentation and data collection for robotic manipulation. CloudGripper furthermore is intended to form a community testbed to study the challenges of large scale machine learning and cloud and edge-computing in the context of robotic manipulation. In this work, we describe the mechanical design of the system, its initial software stack and evaluate the repeatability of motions executed by the proposed robot arm design. A local network API throughput and latency analysis is also provided. CloudGripper-Rope-100, a dataset of more than a hundred hours of randomized rope pushing interactions and approximately 4 million camera images is collected and serves as a proof of concept demonstrating data collection capabilities. A project website with more information is available at this https URL.
我们介绍了CloudGripper,一个开源的云机器人测试平台,由一个由32个小型机器人臂工作单元组成的架子组成。每个机器人工作单元都完全封闭,并配备了 individual lighting、一个低成本的定制5自由度Cartesian机器人臂,带有附加平行爪的试验装置,以及一个双摄像头的实验装置。系统设计注重连续运行,并配备了10 Gbit/s的网络连接,可以实现高吞吐量远程控制实验和机器人操纵数据的收集。CloudGripper还旨在形成一个社区测试平台,以研究大规模机器学习和云和边缘计算在机器人操纵领域的挑战。在这个工作中,我们描述了系统机械设计、其初始软件栈,并评估了 proposed 机器人臂设计的重复性行为。还提供了本地网络API吞吐量和延迟分析。CloudGripper-rope-100,一个超过100小时的随机绳推互动数据集和大约400万相机图像,用于演示数据采集能力,该数据集在此httpsURL上可用。更多信息可在该网站上查找。
https://arxiv.org/abs/2309.12786
Large Language Models (LLMs), acting as a powerful reasoner and generator, exhibit extraordinary performance across various natural language tasks, such as question answering (QA). Among these tasks, Multi-Hop Question Answering (MHQA) stands as a widely discussed category, necessitating seamless integration between LLMs and the retrieval of external knowledge. Existing methods employ LLM to generate reasoning paths and plans, and utilize IR to iteratively retrieve related knowledge, but these approaches have inherent flaws. On one hand, Information Retriever (IR) is hindered by the low quality of generated queries by LLM. On the other hand, LLM is easily misguided by the irrelevant knowledge by IR. These inaccuracies, accumulated by the iterative interaction between IR and LLM, lead to a disaster in effectiveness at the end. To overcome above barriers, in this paper, we propose a novel pipeline for MHQA called Furthest-Reasoning-with-Plan-Assessment (FuRePA), including an improved framework (Furthest Reasoning) and an attached module (Plan Assessor). 1) Furthest reasoning operates by masking previous reasoning path and generated queries for LLM, encouraging LLM generating chain of thought from scratch in each iteration. This approach enables LLM to break the shackle built by previous misleading thoughts and queries (if any). 2) The Plan Assessor is a trained evaluator that selects an appropriate plan from a group of candidate plans proposed by LLM. Our methods are evaluated on three highly recognized public multi-hop question answering datasets and outperform state-of-the-art on most metrics (achieving a 10%-12% in answer accuracy).
大型语言模型(LLM)作为一种强大的推理和生成工具,在各种自然语言任务中表现出非凡的性能,例如问答(QA)。在这些任务中,MHQA是一个被广泛讨论的类别,需要进行LLM和外部知识检索的无缝集成。现有的方法使用LLM生成推理路径和计划,并使用IR迭代地检索相关知识,但这些方法具有固有的缺陷。一方面,信息检索(IR)受到LLM生成低质量查询的限制。另一方面,LLM很容易受到IR生成的无关知识的影响。这些不准确的误差通过IR和LLM的迭代交互不断增加,最终导致 effectiveness 的灾难。为了克服上述障碍,在本文中,我们提出了MHQA的新型管道,称为“最短推理与计划评估(FuRePA)”,包括改进的框架(最短推理)和一个附加模块(计划评估器)。1) 最短推理通过掩盖前推理路径和生成LLM的查询,鼓励LLM在每次迭代中从头生成思考链。这种方法使LLM能够打破由以前误导性思考和查询(如果有)构建的束缚。2) 计划评估器是一个经过训练的评估者,从LLM提出的一组备选计划中选择适当的计划。我们的方法在三个备受认可的公共多级问答数据集上进行了评估,并在大多数指标上优于最先进的方法(实现回答准确性10%-12%)。
https://arxiv.org/abs/2309.12767
Video analysis is a major computer vision task that has received a lot of attention in recent years. The current state-of-the-art performance for video analysis is achieved with Deep Neural Networks (DNNs) that have high computational costs and need large amounts of labeled data for training. Spiking Neural Networks (SNNs) have significantly lower computational costs (thousands of times) than regular non-spiking networks when implemented on neuromorphic hardware. They have been used for video analysis with methods like 3D Convolutional Spiking Neural Networks (3D CSNNs). However, these networks have a significantly larger number of parameters compared with spiking 2D CSNN. This, not only increases the computational costs, but also makes these networks more difficult to implement with neuromorphic hardware. In this work, we use CSNNs trained in an unsupervised manner with the Spike Timing-Dependent Plasticity (STDP) rule, and we introduce, for the first time, Spiking Separated Spatial and Temporal Convolutions (S3TCs) for the sake of reducing the number of parameters required for video analysis. This unsupervised learning has the advantage of not needing large amounts of labeled data for training. Factorizing a single spatio-temporal spiking convolution into a spatial and a temporal spiking convolution decreases the number of parameters of the network. We test our network with the KTH, Weizmann, and IXMAS datasets, and we show that S3TCs successfully extract spatio-temporal information from videos, while increasing the output spiking activity, and outperforming spiking 3D convolutions.
视频分析是近年来备受关注的计算机视觉任务之一。当前视频分析的最佳性能是通过深度神经网络(DNN)实现的,这些网络具有高计算成本,需要大量的标记数据进行训练。在神经生成硬件上实现Spiking Neural Networks(SNNs)具有显著更低的计算成本(数千倍),比传统的非Spiking Neural Networks(NNNs)更低。这些方法,例如3DConvolutionalSpiking Neural Networks(3D CSNNs),已经被用于视频分析。然而,与Spiking 2D CSNNs相比,这些网络具有更多的参数,这不仅增加了计算成本,也使这些网络在神经生成硬件上实现变得更加困难。在本文中,我们使用无监督的Spiking Timing-Dependent Plasticity(STDP)规则训练的CSNNs,以降低视频分析所需的参数数量,并首次介绍了Spiking Separated Spatial and TemporalConvolutions(S3TCs),以减少视频分析所需的参数数量。这种无监督学习的优势是不需要大量标记数据进行训练。将单个时间和空间SpikingConvolution分解成空间和时间SpikingConvolution,可以降低网络中的参数数量。我们使用KTH、Weizmann和IXMAS数据集测试我们的网络,并表明S3TCs成功地从视频中提取时间和空间信息,同时增加输出SpikingActivity,并超越了Spiking 3DConvolutions。
https://arxiv.org/abs/2309.12761
Large language models (LLMs) have had a huge impact on society due to their impressive capabilities and vast knowledge of the world. Various applications and tools have been created that allow users to interact with these models in a black-box scenario. However, one limitation of this scenario is that users cannot modify the internal knowledge of the model, and the only way to add or modify internal knowledge is by explicitly mentioning it to the model during the current interaction. This learning process is called in-context training, and it refers to training that is confined to the user's current session or context. In-context learning has significant applications, but also has limitations that are seldom studied. In this paper, we present a study that shows how the model can suffer from interference between information that continually flows in the context, causing it to forget previously learned knowledge, which can reduce the model's performance. Along with showing the problem, we propose an evaluation benchmark based on the bAbI dataset.
大型语言模型(LLM)对社会发展产生了巨大的影响,因为其出色的能力和对世界的广泛知识。各种应用程序和工具被创建,使用户可以在一个黑盒场景中与这些模型交互。然而,这个场景的一个限制是用户不能修改模型的内部知识,并且添加或修改内部知识的唯一方法是在当前的交互中明确地向模型提到它。这种学习过程被称为上下文训练,它指的是局限于用户当前会话或上下文的培训。上下文训练具有重要的应用,但也具有很少被研究的限制。在本文中,我们提出一项研究,以展示模型如何受到不断流动Context中信息的干扰,导致它忘记先前学习的知识,从而降低模型的性能。同时,我们提出了基于AbBI数据集的评价基准。
https://arxiv.org/abs/2309.12727
Speech Emotion Recognition (SER) plays a pivotal role in enhancing human-computer interaction by enabling a deeper understanding of emotional states across a wide range of applications, contributing to more empathetic and effective communication. This study proposes an innovative approach that integrates self-supervised feature extraction with supervised classification for emotion recognition from small audio segments. In the preprocessing step, to eliminate the need of crafting audio features, we employed a self-supervised feature extractor, based on the Wav2Vec model, to capture acoustic features from audio data. Then, the output featuremaps of the preprocessing step are fed to a custom designed Convolutional Neural Network (CNN)-based model to perform emotion classification. Utilizing the ShEMO dataset as our testing ground, the proposed method surpasses two baseline methods, i.e. support vector machine classifier and transfer learning of a pretrained CNN. comparing the propose method to the state-of-the-art methods in SER task indicates the superiority of the proposed method. Our findings underscore the pivotal role of deep unsupervised feature learning in elevating the landscape of SER, offering enhanced emotional comprehension in the realm of human-computer interactions.
语音情感识别( SER )在增强人机交互方面发挥着关键作用,通过使对多种应用程序中情感状态的理解更深入,有助于更同情心和有效的沟通。本研究提出了一种创新的方法,将自监督特征提取与监督分类集成,从小型音频片段中情感识别。在预处理步骤中,为了消除制作音频特征的需要,我们采用了基于瓦夫2Vec模型的自监督特征提取器,从音频数据中提取声学特征。然后将预处理步骤的输出特征映射喂入一个定制的卷积神经网络(CNN)基模型,进行情感分类。利用 ShEMO 数据集作为我们的测试集,所提出的方法超过了两个基准方法,即支持向量机分类器和预先训练的 CNN 的迁移学习。将所提出的方法与 SER 任务中的先进方法进行比较表明这种方法的优势。我们的研究结果强调了深度无监督特征学习在提高 SER 景观方面的关键作用,提供了在人机交互领域中增强情感理解。
https://arxiv.org/abs/2309.12714
Semantic Scene Completion (SSC) aims to jointly generate space occupancies and semantic labels for complex 3D scenes. Most existing SSC models focus on volumetric representations, which are memory-inefficient for large outdoor spaces. Point clouds provide a lightweight alternative but existing benchmarks lack outdoor point cloud scenes with semantic labels. To address this, we introduce PointSSC, the first cooperative vehicle-infrastructure point cloud benchmark for semantic scene completion. These scenes exhibit long-range perception and minimal occlusion. We develop an automated annotation pipeline leveraging Segment Anything to efficiently assign semantics. To benchmark progress, we propose a LiDAR-based model with a Spatial-Aware Transformer for global and local feature extraction and a Completion and Segmentation Cooperative Module for joint completion and segmentation. PointSSC provides a challenging testbed to drive advances in semantic point cloud completion for real-world navigation.
语义场景完成(SSC)旨在同时生成复杂3D场景中的空间利用率和语义标签。大多数现有SSC模型专注于体积表示,这在大型户外空间中内存效率较低。点云提供了轻量级替代品,但现有的基准缺乏户外点云场景带有语义标签的场景。为了解决这一问题,我们介绍了PointSSC,这是第一个合作车辆基础设施点云基准,以语义场景完成。这些场景表现出远程感知和最小遮挡。我们开发了一个自动化标注 pipeline,利用Segment anything高效地分配语义。为了基准进展,我们提出了一个基于LiDAR的模型,具有空间AwareTransformer,用于全球和 local 特征提取,以及一个合作完成和分割合作模块,用于同时完成和分割。PointSSC提供了一个具有挑战性的测试平台,以推动语义点云完成对现实世界导航的进步。
https://arxiv.org/abs/2309.12708
Offline multi-agent reinforcement learning is challenging due to the coupling effect of both distribution shift issue common in offline setting and the high dimension issue common in multi-agent setting, making the action out-of-distribution (OOD) and value overestimation phenomenon excessively severe. Tomitigate this problem, we propose a novel multi-agent offline RL algorithm, named CounterFactual Conservative Q-Learning (CFCQL) to conduct conservative value estimation. Rather than regarding all the agents as a high dimensional single one and directly applying single agent methods to it, CFCQL calculates conservative regularization for each agent separately in a counterfactual way and then linearly combines them to realize an overall conservative value estimation. We prove that it still enjoys the underestimation property and the performance guarantee as those single agent conservative methods do, but the induced regularization and safe policy improvement bound are independent of the agent number, which is therefore theoretically superior to the direct treatment referred to above, especially when the agent number is large. We further conduct experiments on four environments including both discrete and continuous action settings on both existing and our man-made datasets, demonstrating that CFCQL outperforms existing methods on most datasets and even with a remarkable margin on some of them.
offline multi-agent reinforcement learning 面临挑战,因为 offline 环境通常存在分布 shift 问题和多Agent 环境常见的高维度问题,导致行动偏离分布(OOD)和价值高估现象过于严重。为了解决这一问题,我们提出了一种 novel 的 multi-agent offline RL 算法,名为 CounterFactual Conservative Q-Learning(CFCQL),用于保守价值估计。我们不再是将所有 agents 视为高维 single 个体并直接应用单个体方法,而是采用 counterfactual 方法分别计算每个agent 的保守Regularization,然后线性组合它们来实现整体保守价值估计。我们证明了它仍然具有低估特性和表现保证,就像单个体保守方法一样,但引起的Regularization 和安全政策改进限与agent 数量无关,因此理论上比上述直接处理方法更好,特别是当agent 数量很大时。我们还对现有的环境和两个现有和我们的人工数据集分别进行了实验,证明了 CFCQL 在大多数数据集上优于现有的方法,甚至在其中的某些数据集上表现出显著的差异。
https://arxiv.org/abs/2309.12696