Although face analysis has achieved remarkable improvements in the past few years, designing a multi-task face analysis model is still challenging. Most face analysis tasks are studied as separate problems and do not benefit from the synergy among related tasks. In this work, we propose a novel task-adaptive multi-task face analysis method named as Q-Face, which simultaneously performs multiple face analysis tasks with a unified model. We fuse the features from multiple layers of a large-scale pre-trained model so that the whole model can use both local and global facial information to support multiple tasks. Furthermore, we design a task-adaptive module that performs cross-attention between a set of query vectors and the fused multi-stage features and finally adaptively extracts desired features for each face analysis task. Extensive experiments show that our method can perform multiple tasks simultaneously and achieves state-of-the-art performance on face expression recognition, action unit detection, face attribute analysis, age estimation, and face pose estimation. Compared to conventional methods, our method opens up new possibilities for multi-task face analysis and shows the potential for both accuracy and efficiency.
尽管在过去的几年中,面部识别已经取得了显著的进步,但设计一个多任务面部识别模型仍然具有挑战性。大多数面部识别任务都被单独研究,并没有从相关任务之间的协同作用中受益。在本文中,我们提出了一种名为 Q-Face 的具有新颖性的多任务面部识别方法,该方法使用一个统一模型同时执行多个面部识别任务。我们将来自大型预训练模型的多个层次的特征融合在一起,使整个模型可以利用局部和全局面部信息来支持多个任务。此外,我们还设计了一个任务适应模块,在查询向量和融合多级特征之间进行跨注意,并最终根据每个面部识别任务自适应地提取所需特征。大量实验证明,我们的方法可以同时执行多个任务,在面部表情识别、动作单元检测、面部属性分析、年龄估计和面部姿态估计方面的表现均为最先进的水平。与传统方法相比,我们的方法为多任务面部识别提供了新的可能性,并展示了准确性和效率的潜力。
https://arxiv.org/abs/2405.09059
The detection and tracking of small targets in passive optical remote sensing (PORS) has broad applications. However, most of the previously proposed methods seldom utilize the abundant temporal features formed by target motion, resulting in poor detection and tracking performance for low signal-to-clutter ratio (SCR) targets. In this article, we analyze the difficulty based on spatial features and the feasibility based on temporal features of realizing effective detection. According to this analysis, we use a multi-frame as a detection unit and propose a detection method based on temporal energy selective scaling (TESS). Specifically, we investigated the composition of intensity temporal profiles (ITPs) formed by pixels on a multi-frame detection unit. For the target-present pixel, the target passing through the pixel will bring a weak transient disturbance on the ITP and introduce a change in the statistical properties of ITP. We use a well-designed function to amplify the transient disturbance, suppress the background and noise components, and output the trajectory of the target on the multi-frame detection unit. Subsequently, to solve the contradiction between the detection rate and the false alarm rate brought by the traditional threshold segmentation, we associate the temporal and spatial features of the output trajectory and propose a trajectory extraction method based on the 3D Hough transform. Finally, we model the trajectory of the target and propose a trajectory-based multi-target tracking method. Compared with the various state-of-the-art detection and tracking methods, experiments in multiple scenarios prove the superiority of our proposed methods.
被动光学遥感(PORS)中检测和跟踪小目标具有广泛的应用价值。然而,之前提出的大多数方法很少利用目标运动产生的丰富时变特征,导致低信号-噪声比(SCR)目标检测和跟踪性能较差。在本文中,我们分析基于空间特征和基于时变特征实现有效检测的难度,并根据分析结果提出了一种基于时变能量选择性缩放(TESS)的检测方法。具体来说,我们研究了多帧中像素产生的强度时变轮廓(ITP)的组成。对于目标存在的像素,穿过像素的目标会对ITP产生弱暂态干扰,并改变ITP的统计特性。我们使用一个精心设计的函数来放大暂态干扰,抑制背景和噪声分量,并输出目标在多帧检测单元上的轨迹。为了解决传统阈值分割带来的检测率和误报警率之间的矛盾,我们将输出轨迹的时域和空间特征相关联,并提出了基于3D Hough变换的轨迹提取方法。最后,我们建模了目标轨迹,并提出了基于轨迹的多目标跟踪方法。与各种最先进的检测和跟踪方法相比,多个场景下的实验证明了我们提出方法的优越性。
https://arxiv.org/abs/2405.09054
"How does the person in the bounding box feel?" Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision. Facial expressions are not enough: body pose, contextual knowledge, and commonsense reasoning all contribute to how humans perform this emotional theory of mind task. In this paper, we examine two major approaches enabled by recent large vision language models: 1) image captioning followed by a language-only LLM, and 2) vision language models, under zero-shot and fine-tuned setups. We evaluate the methods on the Emotions in Context (EMOTIC) dataset and demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines. The results of this work aim to help robots and agents perform emotionally sensitive decision-making and interaction in the future.
这个人如何感觉?在现实生活中,对一个人情感的明显认知仍是一个在计算机视觉中尚未解决的任务。仅仅依靠面部表情是不够的:身体姿势、上下文知识以及常识推理都参与了人类完成这个情感理论思维任务的方式。在本文中,我们研究了两种由最近的大型视觉语言模型推动的主要方法:1)图像标题 followed by a language-only LLM,2)在零散和微调设置下的视觉语言模型。我们在情感在上下文中(EMOTIC)数据集上评估这些方法,并证明了即使是对于小型数据集,经过微调的视觉语言模型也显著优于传统基线。本工作的结果旨在帮助机器人和代理在未来的情感敏感决策和交互中发挥作用。
https://arxiv.org/abs/2405.08992
Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we evaluate recent video-centric LLMs, both open-source and proprietary, on the test split of our dataset. The findings reveal that even state-of-the-art video-centric LLMs significantly lag behind human performance in these tasks, highlighting the complexity and challenge inherent in video understanding. The dataset is available at this https URL
目前用于长视频理解的数据集往往无法提供真正的长视频理解挑战,因为许多来源于这些数据集的任务只要分析视频中的一两个随机帧就可以成功解决。为解决这个问题,我们提出了一个名为CinePile的新数据集和基准,专门为真正的长视频理解而设计。本文详细介绍了我们创造问题-答案数据集的方法,利用先进的神经网络与人类交互并基于人类生成的原始数据。我们全面的数据集包括305,000个多选题问题(MCQs),涵盖各种视觉和多模态方面,包括时间理解、理解人-对象交互和推理场景中事件或动作。此外,我们还评估了我们的数据集中的最新视频相关LLM,包括开源和专有版本,在测试集上。研究结果表明,即使是最先进的视频相关LLM在这些任务上也无法与人类 performance相媲美,这突出了视频理解的复杂性和挑战性。该数据集可在此处访问:<https://this URL>
https://arxiv.org/abs/2405.08813
Navigating the complex landscape of news articles involves understanding the various actors or entities involved, referred to as news stakeholders. These stakeholders, ranging from policymakers to opposition figures, citizens, and more, play pivotal roles in shaping news narratives. Recognizing their stakeholder types, reflecting their roles, political alignments, social standing, and more, is paramount for a nuanced comprehension of news content. Despite existing works focusing on salient entity extraction, coverage variations, and political affiliations through social media data, the automated detection of stakeholder roles within news content remains an underexplored domain. In this paper, we bridge this gap by introducing an effective approach to classify stakeholder types in news articles. Our method involves transforming the stakeholder classification problem into a natural language inference task, utilizing contextual information from news articles and external knowledge to enhance the accuracy of stakeholder type detection. Moreover, our proposed model showcases efficacy in zero-shot settings, further extending its applicability to diverse news contexts.
浏览新闻文章的复杂多变的景观,需要理解各种参与者的身份,这些参与者从决策者到反对派人物、公民等,在塑造新闻故事中发挥着关键作用。了解他们的角色、政治观点、社会地位等,对于深入理解新闻内容至关重要。尽管现有的工作集中于通过社交媒体数据突出实体、覆盖差异和政治立场,但自动检测新闻内容中的参与者角色仍然是一个未被探索的领域。在本文中,我们通过引入一种有效的分类新闻文章中参与者类型的方法,跨越了这个领域的空白。我们的方法将参与者分类问题转化为自然语言推理任务,利用新闻文章的上下文信息和外部知识来提高参与者类型检测的准确性。此外,我们所提出的模型在零散设置中表现出优异效果,进一步拓展了其在各种新闻环境中的应用。
https://arxiv.org/abs/2405.08751
Non-prehensile manipulation enables fast interactions with objects by circumventing the need to grasp and ungrasp as well as handling objects that cannot be grasped through force closure. Current approaches to non-prehensile manipulation focus on static contacts, avoiding the underactuation that comes with sliding. However, the ability to control sliding contact, essentially removing the no-slip constraint, opens up new possibilities in dynamic manipulation. In this paper, we explore a challenging dynamic non-prehensile manipulation task that requires the consideration of the full spectrum of hybrid contact modes. We leverage recent methods in contact-implicit MPC to handle the multi-modal planning aspect of the task. We demonstrate, with careful consideration of integration between the simple model used for MPC and the low-level tracking controller, how contact-implicit MPC can be adapted to dynamic tasks. Surprisingly, despite the known inaccuracies of frictional rigid contact models, our method is able to react to these inaccuracies while still quickly performing the task. Moreover, we do not use common aids such as reference trajectories or motion primitives, highlighting the generality of our approach. To the best of our knowledge, this is the first application of contact-implicit MPC to a dynamic manipulation task in three dimensions.
非抓取操作使通过绕开抓取和解抓取的需要,以及处理无法通过力闭合来抓取的对象,实现了与物体的高速互动。目前,非抓取操作方法主要关注静态接触,避免与滑动相关的松动。然而,控制滑动接触的能力,本质上消除松动约束,为动态操作带来了新的可能性。在本文中,我们探讨了一个具有挑战性的动态非抓取操作任务,需要考虑混合接触模式的完整范围。我们利用最近在接触隐式MPC中的方法来处理任务的 Multi-modal 规划方面。我们证明了,在仔细考虑简单模型用于MPC和低级跟踪控制器之间的集成的情况下,接触隐式MPC可以适应动态任务。令人惊讶的是,尽管已知摩擦刚性接触模型的不准确度,但我们的方法仍然能够应对这些不准确度,同时仍然快速地完成任务。此外,我们没有使用常见的辅助工具,如参考轨迹或运动原型,这突出了我们方法的普遍性。据我们所知,这是在三维空间中第一次将接触隐式MPC应用于动态操作任务。
https://arxiv.org/abs/2405.08731
Addressing the challenge of low-resource information extraction remains an ongoing issue due to the inherent information scarcity within limited training examples. Existing data augmentation methods, considered potential solutions, struggle to strike a balance between weak augmentation (e.g., synonym augmentation) and drastic augmentation (e.g., conditional generation without proper guidance). This paper introduces a novel paradigm that employs targeted augmentation and back validation to produce augmented examples with enhanced diversity, polarity, accuracy, and coherence. Extensive experimental results demonstrate the effectiveness of the proposed paradigm. Furthermore, identified limitations are discussed, shedding light on areas for future improvement.
由于有限训练样本内的固有信息稀缺性,解决低资源信息抽取的挑战仍然是一个持续进行的问题,这也使得现有的数据增强方法,被认为是潜在的解决方案,在弱增强(例如同义词增强)和剧增增强(例如缺乏指导时的条件生成)之间难以取得平衡。本文介绍了一种新的范式,它采用有针对性增强和反向验证来产生具有增强多样性、极化、准确性和连贯性的增强示例。大量的实验结果证明了所提出的范式的有效性。此外,还讨论了已识别的局限性,并阐明了未来改进的领域。
https://arxiv.org/abs/2405.08729
Ensuring safety and adapting to the user's behavior are of paramount importance in physical human-robot interaction. Thus, incorporating elastic actuators in the robot's mechanical design has become popular, since it offers intrinsic compliance and additionally provide a coarse estimate for the interaction force by measuring the deformation of the elastic components. While observer-based methods have been shown to improve these estimates, they rely on accurate models of the system, which are challenging to obtain in complex operating environments. In this work, we overcome this issue by learning the unknown dynamics components using Gaussian process (GP) regression. By employing the learned model in a Bayesian filtering framework, we improve the estimation accuracy and additionally obtain an observer that explicitly considers local model uncertainty in the confidence measure of the state estimate. Furthermore, we derive guaranteed estimation error bounds, thus, facilitating the use in safety-critical applications. We demonstrate the effectiveness of the proposed approach experimentally in a human-exoskeleton interaction scenario.
在物理人机交互中确保安全和适应用户行为至关重要。因此,将弹性执行器纳入机器设计中已成为一种流行的方法,因为它提供了固有的顺应性,并且通过测量弹性部件的变形程度,还提供了一个粗略的估计值来计算交互力。虽然基于观察者的方法已经证明了这些估计的改善,但是它们依赖于准确系统模型的准确性,而在复杂操作环境中获得这种准确性是非常困难的。在这项工作中,我们通过使用高斯过程(GP)回归学习未知动态组件。通过将学习到的模型应用于贝叶斯滤波框架,我们提高了估计精度和附加观测器,它明确考虑了状态估计中局部模型不确定性。此外,我们导出了保证估计误差上限,从而促进在关键应用中使用。我们在人机协同操作场景中验证了所提出的方法的实效性。
https://arxiv.org/abs/2405.08711
Addressing multi-label action recognition in videos represents a significant challenge for robotic applications in dynamic environments, especially when the robot is required to cooperate with humans in tasks that involve objects. Existing methods still struggle to recognize unseen actions or require extensive training data. To overcome these problems, we propose Dual-VCLIP, a unified approach for zero-shot multi-label action recognition. Dual-VCLIP enhances VCLIP, a zero-shot action recognition method, with the DualCoOp method for multi-label image classification. The strength of our method is that at training time it only learns two prompts, and it is therefore much simpler than other methods. We validate our method on the Charades dataset that includes a majority of object-based actions, demonstrating that -- despite its simplicity -- our method performs favorably with respect to existing methods on the complete dataset, and promising performance when tested on unseen actions. Our contribution emphasizes the impact of verb-object class-splits during robots' training for new cooperative tasks, highlighting the influence on the performance and giving insights into mitigating biases.
在视频中的多标签动作识别是一个对机器人动态环境应用的显著挑战,尤其是在机器人需要与人类在涉及对象的任務中進行合作时。现有的方法仍然很难识别未见到的动作,或者需要大量的训练数据。为了克服这些问题,我们提出了Dual-VCLIP,一种用于零散标签多标签动作识别的统一方法。Dual-VCLIP通过DualCoOp方法增强了VCLIP,一种用于零散标签图像分类的零散动作识别方法。我们方法的优势在于,在训练时它只学习两个提示,因此它比其他方法要简单得多。我们在包含大量物体为基础的动作的Charades数据集上验证我们的方法,证明了--尽管其简单性--我们的方法在完整数据集上与现有方法的表现相当,而在测试未见到的动作时具有 promising 的表现。我们的贡献强调了在机器人训练过程中动词-物体类别的分割对新型合作任务的影响,突出了对表现和减轻偏见的影响。
https://arxiv.org/abs/2405.08695
The nature of diversity in real-world environments necessitates neural network models to expand from closed category settings to accommodate novel emerging categories. In this paper, we study the open-vocabulary object detection (OVD), which facilitates the detection of novel object classes under the supervision of only base annotations and open-vocabulary knowledge. However, we find that the inadequacy of neighboring relationships between regions during the alignment process inevitably constrains the performance on recent distillation-based OVD strategies. To this end, we propose Neighboring Region Attention Alignment (NRAA), which performs alignment within the attention mechanism of a set of neighboring regions to boost the open-vocabulary inference. Specifically, for a given proposal region, we randomly explore the neighboring boxes and conduct our proposed neighboring region attention (NRA) mechanism to extract relationship information. Then, this interaction information is seamlessly provided into the distillation procedure to assist the alignment between the detector and the pre-trained vision-language models (VLMs). Extensive experiments validate that our proposed model exhibits superior performance on open-vocabulary benchmarks.
现实环境中的多样性需要神经网络模型从封闭的类别设置扩展到以容纳新颖的浮现类别。在本文中,我们研究了开放词汇对象检测(OVD),它通过仅基于基本注释的监督来检测新颖的对象类别。然而,我们发现,在配准过程中,区域之间相邻关系的不足会必然限制最近基于蒸馏的OD策略的性能。为此,我们提出了邻居区域注意对齐(NRAA),它通过一组邻居区域的注意机制在注意力机制内进行对齐,以提高开放词汇的推理。具体来说,对于给定的建议区域,我们随机探索邻居框,并执行我们提出的邻居区域注意(NRA)机制来提取关系信息。然后,这种交互信息被无缝地提供到蒸馏过程中,以协助检测器与预训练的视觉语言模型(VLMs)之间的对齐。大量实验证实,与开放词汇基准相比,我们提出的模型具有卓越的性能。
https://arxiv.org/abs/2405.08593
Task and Motion Planning (TAMP) algorithms solve long-horizon robotics tasks by integrating task planning with motion planning; the task planner proposes a sequence of actions towards a goal state and the motion planner verifies whether this action sequence is geometrically feasible for the robot. However, state-of-the-art TAMP algorithms do not scale well with the difficulty of the task and require an impractical amount of time to solve relatively small problems. We propose Constraints and Streams for Task and Motion Planning (COAST), a probabilistically-complete, sampling-based TAMP algorithm that combines stream-based motion planning with an efficient, constrained task planning strategy. We validate COAST on three challenging TAMP domains and demonstrate that our method outperforms baselines in terms of cumulative task planning time by an order of magnitude. You can find more supplementary materials on our project \href{this https URL}{website}.
任务和动作规划(TAMP)算法通过将任务规划与运动规划相结合来解决长视野机器人任务;任务规划器提出了一系列动作序列以达到目标状态,而运动规划器验证这些动作序列是否对机器人几何可行。然而,最先进的TAMP算法在任务难度上表现不佳,需要解决相对较小的問題的时间相当长。我们提出了一种约束和流式任务和动作规划(COAST)算法,这是一种概率完整性、基于采样的TAMP算法,将流式运动规划与高效、约束的任务规划策略相结合。我们在三个具有挑战性的TAMP领域上验证了COAST,并证明了我们的方法在累积任务规划时间方面优于基线。您可以在我们的项目页面上找到更多信息 \href{this <https://website.com>}.
https://arxiv.org/abs/2405.08572
In visual tasks, large teacher models capture essential features and deep information, enhancing performance. However, distilling this information into smaller student models often leads to performance loss due to structural differences and capacity limitations. To tackle this, we propose a distillation framework based on graph knowledge, including a multi-level feature alignment strategy and an attention-guided mechanism to provide a targeted learning trajectory for the student model. We emphasize spectral embedding (SE) as a key technique in our distillation process, which merges the student's feature space with the relational knowledge and structural complexities similar to the teacher network. This method captures the teacher's understanding in a graph-based representation, enabling the student model to more accurately mimic the complex structural dependencies present in the teacher model. Compared to methods that focus only on specific distillation areas, our strategy not only considers key features within the teacher model but also endeavors to capture the relationships and interactions among feature sets, encoding these complex pieces of information into a graph structure to understand and utilize the dynamic relationships among these pieces of information from a global perspective. Experiments show that our method outperforms previous feature distillation methods on the CIFAR-100, MS-COCO, and Pascal VOC datasets, proving its efficiency and applicability.
在视觉任务中,大型教师模型能够捕获关键特征和深度信息,提高性能。然而,将这种信息压缩到学生模型中往往会导致性能损失,因为学生模型和教师模型的结构存在差异和容量限制。为了解决这个问题,我们提出了一个基于图知识的蒸馏框架,包括多级特征对齐策略和关注引导机制,为学生的模型提供目标学习路径。我们强调谱聚类(SE)作为我们蒸馏过程的关键技术,它将学生的特征空间与类似于教师网络的结构关系和复杂性相结合。这种方法捕捉了教师在基于图的关系知识,使得学生模型能够更准确地模仿教师模型的复杂结构依赖关系。与只关注特定蒸馏领域的方法相比,我们的策略不仅考虑了教师模型中的关键特征,而且努力捕捉特征集之间的关系和相互作用,将这些复杂信息编码为图结构,从全局视角理解并利用这些信息。实验证明,我们的方法在CIFAR-100、MS-COCO和Pascal VOC数据集上的性能优于之前的特征蒸馏方法,证明了其高效性和适用性。
https://arxiv.org/abs/2405.08547
Conversation requires a substantial amount of coordination between dialogue participants, from managing turn taking to negotiating mutual understanding. Part of this coordination effort surfaces as the reuse of linguistic behaviour across speakers, a process often referred to as alignment. While the presence of linguistic alignment is well documented in the literature, several questions remain open, including the extent to which patterns of reuse across speakers have an impact on the emergence of labelling conventions for novel referents. In this study, we put forward a methodology for automatically detecting shared lemmatised constructions -- expressions with a common lexical core used by both speakers within a dialogue -- and apply it to a referential communication corpus where participants aim to identify novel objects for which no established labels exist. Our analyses uncover the usage patterns of shared constructions in interaction and reveal that features such as their frequency and the amount of different constructions used for a referent are associated with the degree of object labelling convergence the participants exhibit after social interaction. More generally, the present study shows that automatically detected shared constructions offer a useful level of analysis to investigate the dynamics of reference negotiation in dialogue.
对话需要对话参与者在语篇、回合管理以及相互理解等方面进行大量的协调。这种协调努力在文献中得到了充分记录,但仍有几个问题有待回答,包括跨说话者之间语言行为的再利用程度对新颖参照物标签的出现有何影响。在这项研究中,我们提出了一个自动检测共享同义词短语的方法——用于对话中共同使用的词汇核心的表达。我们将该方法应用于一个旨在识别没有现有标签的新物体语料库中。我们的分析揭示了互动中共享同义词短语的使用模式,并揭示了诸如它们的使用频率和使用的不同短语数量等特征与参与者在社交互动后展示的物体标签程度之间的关联。更一般地说,本研究表明,自动检测到的共享同义词短语为研究对话中参照物协商的动态提供了有价值的分析水平。
https://arxiv.org/abs/2405.08546
The IoT and Business Process Management (BPM) communities co-exist in many shared application domains, such as manufacturing and healthcare. The IoT community has a strong focus on hardware, connectivity and data; the BPM community focuses mainly on finding, controlling, and enhancing the structured interactions among the IoT devices in processes. While the field of Process Mining deals with the extraction of process models and process analytics from process event logs, the data produced by IoT sensors often is at a lower granularity than these process-level events. The fundamental questions about extracting and abstracting process-related data from streams of IoT sensor values are: (1) Which sensor values can be clustered together as part of process events?, (2) Which sensor values signify the start and end of such events?, (3) Which sensor values are related but not essential? This work proposes a framework to semi-automatically perform a set of structured steps to convert low-level IoT sensor data into higher-level process events that are suitable for process mining. The framework is meant to provide a generic sequence of abstract steps to guide the event extraction, abstraction, and correlation, with variation points for plugging in specific analysis techniques and algorithms for each step. To assess the completeness of the framework, we present a set of challenges, how they can be tackled through the framework, and an example on how to instantiate the framework in a real-world demonstration from the field of smart manufacturing. Based on this framework, future research can be conducted in a structured manner through refining and improving individual steps.
物联网和业务流程管理(BPM)社区在许多共同的应用领域中存在,如制造业和医疗保健。物联网社区对硬件、连接和数据有很强的关注;BPM社区主要关注在过程设备中寻找、控制和改进结构化的交互。虽然过程挖掘领域涉及从过程事件日志中提取过程模型和过程分析,但物联网传感器产生的数据往往比这些过程级事件具有较低的粒度。从物联网传感器值的流中提取和抽象与过程相关的数据的基本问题包括:(1)哪些传感器值可以聚类为过程事件的一部分?(2)哪些传感器值表示过程事件的开始和结束?(3)哪些传感器值相关但不 essential?本文提出了一个半自动化的框架,以执行一系列结构化步骤将低级物联网传感器数据转换为适合过程挖掘的高级别过程事件。该框架旨在提供一种通用的序列步骤,以指导事件提取、抽象和关联,并允许插入特定分析技术和算法的差异点。为了评估框架的完整性,我们提供了一组挑战,以及如何通过框架解决这些挑战的思路,以及一个从智能制造领域实际演示中即时启动框架的示例。基于这个框架,未来的研究可以通过优化和改进各个步骤来以结构化的方式进行。
https://arxiv.org/abs/2405.08528
Traditional recommendation proposals, including content-based and collaborative filtering, usually focus on similarity between items or users. Existing approaches lack ways of introducing unexpectedness into recommendations, prioritizing globally popular items over exposing users to unforeseen items. This investigation aims to design and evaluate a novel layer on top of recommender systems suited to incorporate relational information and suggest items with a user-defined degree of surprise. We propose a Knowledge Graph (KG) based recommender system by encoding user interactions on item catalogs. Our study explores whether network-level metrics on KGs can influence the degree of surprise in recommendations. We hypothesize that surprisingness correlates with certain network metrics, treating user profiles as subgraphs within a larger catalog KG. The achieved solution reranks recommendations based on their impact on structural graph metrics. Our research contributes to optimizing recommendations to reflect the metrics. We experimentally evaluate our approach on two datasets of LastFM listening histories and synthetic Netflix viewing profiles. We find that reranking items based on complex network metrics leads to a more unexpected and surprising composition of recommendation lists.
传统推荐策略,包括基于内容和协同过滤的方法,通常关注物品或用户之间的相似性。现有方法缺乏将意外性引入推荐的方法,将全局热门物品优先于向用户展示未知的物品。本研究旨在设计并评估一种新层,将关系信息编码在物品目录上,用于在推荐系统中建议具有用户定义程度惊喜的物品。我们提出了一个基于知识图谱的推荐系统,通过编码用户在目录上的交互来实现。我们的研究探讨了网络级指标在知识图谱上的影响是否会影响推荐中的惊喜程度。我们假设惊喜程度与某些网络指标相关,将用户个人档案视为大型目录知识图谱中的子图。所实现的结果根据其对结构图指标的影响对推荐进行排序。我们的研究为优化推荐以反映这些指标做出了贡献。我们在LastFM听书历史数据集和合成Netflix观看个人资料数据集上进行了实验评估。我们发现,根据复杂的网络指标重新排列物品会导致推荐列表更加意外和令人惊讶。
https://arxiv.org/abs/2405.08465
When studying political communication, combining the information from text, audio, and video signals promises to reflect the richness of human communication more comprehensively than confining it to individual modalities alone. However, when modeling such multimodal data, its heterogeneity, connectedness, and interaction are challenging to address. We argue that aligning the respective modalities can be an essential step in entirely using the potential of multimodal data because it informs the model with human understanding. Exploring aligned modalities unlocks promising analytical leverage. First, it allows us to make the most of information in the data, which inter alia opens the door to better quality predictions. Second, it is possible to answer research questions that span multiple modalities with cross-modal queries. Finally, alignment addresses concerns about model interpretability. We illustrate the utility of this approach by analyzing how German MPs address members of the far-right AfD in their speeches, and predicting the tone of video advertising in the context of the 2020 US presidential race. Our paper offers important insights to all keen to analyze multimodal data effectively.
在研究政治沟通时,将文本、音频和视频信号的信息结合起来,比仅仅局限于单一方式更全面地反映人类沟通的丰富性。然而,当尝试建模这种多模态数据时,其异质性、联系性和交互性是难以解决的问题。我们认为,对各自模态进行对齐可以是完全利用多模态数据潜在功能的关键步骤,因为它赋予了模型人类理解。通过对齐模态,我们解锁了有前途的分析优势。首先,它让我们能够充分利用数据中的信息,这不仅为更好的预测打开了大门,而且还有助于跨模态问题。其次,可以在跨模态查询中回答研究问题。最后,对齐解决了关于模型可解释性的担忧。我们通过分析德国议员如何回应极右翼民粹主义者在讲话中如何对待成员,以及预测2020年美国总统竞选期间视频广告的语气,展示了这种方法的有效性。我们的论文为所有希望有效地分析多模态数据的人提供了重要的见解。
https://arxiv.org/abs/2405.08454
Robust road surface estimation is required for autonomous ground vehicles to navigate safely. Despite it becoming one of the main targets for autonomous mobility researchers in recent years, it is still an open problem in which cameras and LiDAR sensors have demonstrated to be adequate to predict the position, size and shape of the road a vehicle is driving on in different environments. In this work, a novel Convolutional Neural Network model is proposed for the accurate estimation of the roadway surface. Furthermore, an ablation study has been conducted to investigate how different encoding strategies affect model performance, testing 6 slightly different neural network architectures. Our model is based on the use of a Twin Encoder-Decoder Neural Network (TEDNet) for independent camera and LiDAR feature extraction, and has been trained and evaluated on the Kitti-Road dataset. Bird's Eye View projections of the camera and LiDAR data are used in this model to perform semantic segmentation on whether each pixel belongs to the road surface. The proposed method performs among other state-of-the-art methods and operates at the same frame-rate as the LiDAR and cameras, so it is adequate for its use in real-time applications.
为了使自动驾驶车辆安全导航,需要对道路表面进行稳健的估计。尽管近年来,自动驾驶移动研究人员将这一目标作为主要目标,但仍然是一个尚未解决的问题,其中相机和LiDAR传感器已经被证明在预测车辆在各种环境中行驶的道路位置、大小和形状方面是足够的。在本文中,我们提出了一个用于准确估计道路表面的全新卷积神经网络模型。此外,我们还进行了一项消融研究,以研究不同编码策略对模型性能的影响,测试了6种稍微不同的神经网络架构。我们的模型基于使用Twin Encoder-Decoder Neural Network(TEDNet)进行独立相机和LiDAR特征提取,并在Kitti-Road数据集上进行训练和评估。在这个模型中,使用了摄像机和LiDAR数据的鸟瞰投影来进行语义分割,以确定每个像素是否属于道路表面。与最先进的算法相比,我们的方法在性能上处于领先地位,并且与LiDAR和相机在相同的帧率下运行,因此它非常适合在实时应用中使用。
https://arxiv.org/abs/2405.08429
Stereo image super-resolution (SR) refers to the reconstruction of a high-resolution (HR) image from a pair of low-resolution (LR) images as typically captured by a dual-camera device. To enhance the quality of SR images, most previous studies focused on increasing the number and size of feature maps and introducing complex and computationally intensive structures, resulting in models with high computational complexity. Here, we propose a simple yet efficient stereo image SR model called NAFRSSR, which is modified from the previous state-of-the-art model NAFSSR by introducing recursive connections and lightweighting the constituent modules. Our NAFRSSR model is composed of nonlinear activation free and group convolution-based blocks (NAFGCBlocks) and depth-separated stereo cross attention modules (DSSCAMs). The NAFGCBlock improves feature extraction and reduces number of parameters by removing the simple channel attention mechanism from NAFBlock and using group convolution. The DSSCAM enhances feature fusion and reduces number of parameters by replacing 1x1 pointwise convolution in SCAM with weight-shared 3x3 depthwise convolution. Besides, we propose to incorporate trainable edge detection operator into NAFRSSR to further improve the model performance. Four variants of NAFRSSR with different sizes, namely, NAFRSSR-Mobile (NAFRSSR-M), NAFRSSR-Tiny (NAFRSSR-T), NAFRSSR-Super (NAFRSSR-S) and NAFRSSR-Base (NAFRSSR-B) are designed, and they all exhibit fewer parameters, higher PSNR/SSIM, and faster speed than the previous state-of-the-art models. In particular, to the best of our knowledge, NAFRSSR-M is the lightest (0.28M parameters) and fastest (50 ms inference time) model achieving an average PSNR/SSIM as high as 24.657 dB/0.7622 on the benchmark datasets. Codes and models will be released at this https URL.
立体图像超分辨率(SR)是指通常由双相机设备捕获的低分辨率(LR)图像的一对高分辨率(HR)图像的重建。为了提高SR图像的质量,以前的研究主要集中在增加特征图的数量和大小,并引入复杂且计算密集的结构,导致具有高计算复杂性的模型。在这里,我们提出了一种简单而有效的立体图像SR模型,称为NAFRSSR,它基于前 state-of-the-art模型NAFSSR,通过引入递归连接和轻量化构成模块。我们的NAFRSSR模型由非线性激活自由和基于组卷积的块(NAFGCBlocks)以及深度分离的立体跨注意模块(DSSCAMs)组成。NAFGCBlock通过从NAFBlock中移除简单的通道关注机制并使用组卷积来减少参数数量并提高特征提取。DSSCAM通过用权共享的3x3深度卷积来替换SCAM中的1x1点卷积,从而增强特征融合并减少参数数量。此外,我们还提出将可训练的边缘检测操作器集成到NAFRSSR中,以进一步提高模型性能。设计有四种不同大小的NAFRSSR变体,分别为:NAFRSSR-Mobile(NAFRSSR-M),NAFRSSR-Tiny(NAFRSSR-T),NAFRSSR-Super(NAFRSSR-S)和NAFRSSR-Base(NAFRSSR-B),它们都具有更少的参数、更高的PSNR/SSIM和更快的推理速度。特别是,据我们所知,NAFRSSR-M是最轻便(0.28M参数)且最快的(50ms推理时间)模型,在基准数据集上达到平均PSNR/SSIM 24.657 dB/0.7622。代码和模型发布在https://这个URL上。
https://arxiv.org/abs/2405.08423
In the training process of Deep Reinforcement Learning (DRL), agents require repetitive interactions with the environment. With an increase in training volume and model complexity, it is still a challenging problem to enhance data utilization and explainability of DRL training. This paper addresses these challenges by focusing on the temporal correlations within the time dimension of time series. We propose a novel approach to segment multivariate time series into meaningful subsequences and represent the time series based on these subsequences. Furthermore, the subsequences are employed for causal inference to identify fundamental causal factors that significantly impact training outcomes. We design a module to provide feedback on the causality during DRL training. Several experiments demonstrate the feasibility of our approach in common environments, confirming its ability to enhance the effectiveness of DRL training and impart a certain level of explainability to the training process. Additionally, we extended our approach with priority experience replay algorithm, and experimental results demonstrate the continued effectiveness of our approach.
在深度强化学习(DRL)的训练过程中,代理需要与环境进行重复交互。随着训练量的增加和模型复杂性的提高,增强数据利用率和对DRL训练的 可解释性仍然是一个具有挑战性的问题。本文通过关注时间维度中的时间相关性来解决这些问题。我们提出了一种将多维时间序列分割为有意义的子序列的新方法,并基于这些子序列表示时间序列。此外,这些子序列用于进行因果推理,以确定对训练结果具有显著影响的基本因果因素。我们设计了一个模块来提供在DRL训练期间的因果性反馈。 several实验证明,我们的方法在常见环境中是行得通的,证实了其提高DRL训练效果和使其具有某种程度的可解释性的能力。此外,我们还通过优先经验回放算法扩展了我们的方法,实验结果表明,我们的方法仍然具有有效性。
https://arxiv.org/abs/2405.08380
Current architectures for video understanding mainly build upon 3D convolutional blocks or 2D convolutions with additional operations for temporal modeling. However, these methods all regard the temporal axis as a separate dimension of the video sequence, which requires large computation and memory budgets and thus limits their usage on mobile devices. In this paper, we propose to squeeze the time axis of a video sequence into the channel dimension and present a lightweight video recognition network, term as \textit{SqueezeTime}, for mobile video understanding. To enhance the temporal modeling capability of the proposed network, we design a Channel-Time Learning (CTL) Block to capture temporal dynamics of the sequence. This module has two complementary branches, in which one branch is for temporal importance learning and another branch with temporal position restoring capability is to enhance inter-temporal object modeling ability. The proposed SqueezeTime is much lightweight and fast with high accuracies for mobile video understanding. Extensive experiments on various video recognition and action detection benchmarks, i.e., Kinetics400, Kinetics600, HMDB51, AVA2.1 and THUMOS14, demonstrate the superiority of our model. For example, our SqueezeTime achieves $+1.2\%$ accuracy and $+80\%$ GPU throughput gain on Kinetics400 than prior methods. Codes are publicly available at this https URL and this https URL.
目前用于视频理解的架构主要基于3D卷积块或2D卷积块,并添加了用于时间建模的操作。然而,这些方法都将时间轴视为视频序列的单独维度,需要大量的计算和内存预算,因此限制了它们在移动设备上的使用。在本文中,我们提出了一种将视频序列的时间轴压缩到通道维度,并提出了一个轻量级的移动视频理解网络,称为\textit{SqueezeTime},用于移动视频理解。为了增强所提出的网络的时序建模能力,我们设计了一个通道时学习(CTL)模块来捕捉序列的时变动态。这个模块有两个互补的分支,其中一个是用于时间重要性学习,另一个是用于时间位置恢复能力的分支,以增强跨时间物体建模能力。所提出的SqueezeTime在移动视频理解中非常轻便且快速,具有很高的准确性。在各种视频识别和动作检测基准上进行的广泛实验(即Kinetics400、Kinetics600、HMDB51、AVA2.1和THUMOS14)证明了我们的模型的优越性。例如,我们的SqueezeTime在Kinetics400上实现了比 prior 方法 $+1.2\%$ 的准确性和$+80\%$的GPU吞吐量增益。代码公开可用,请访问以下链接:https://this URL 和 https://this URL。
https://arxiv.org/abs/2405.08344