Pose

Beyond static AI evaluations: advancing human interaction evaluations for LLM harms and risks

2024-05-17 08:49:34

Lujain Ibrahim, Saffron Huang, Lama Ahmad, Markus Anderljung

arXiv_AI

arXiv_AI Recommendation Pose Action LLM
Abstract

Model evaluations are central to understanding the safety, risks, and societal impacts of AI systems. While most real-world AI applications involve human-AI interaction, most current evaluations (e.g., common benchmarks) of AI models do not. Instead, they incorporate human factors in limited ways, assessing the safety of models in isolation, thereby falling short of capturing the complexity of human-model interactions. In this paper, we discuss and operationalize a definition of an emerging category of evaluations -- "human interaction evaluations" (HIEs) -- which focus on the assessment of human-model interactions or the process and the outcomes of humans using models. First, we argue that HIEs can be used to increase the validity of safety evaluations, assess direct human impact and interaction-specific harms, and guide future assessments of models' societal impact. Second, we propose a safety-focused HIE design framework -- containing a human-LLM interaction taxonomy -- with three stages: (1) identifying the risk or harm area, (2) characterizing the use context, and (3) choosing the evaluation parameters. Third, we apply our framework to two potential evaluations for overreliance and persuasion risks. Finally, we conclude with tangible recommendations for addressing concerns over costs, replicability, and unrepresentativeness of HIEs.

Abstract (translated)

模型评估是理解人工智能系统的安全性、风险和社会影响的关键。虽然大多数现实世界的AI应用涉及人类-AI交互，但大多数现有AI模型的评估（例如常见基准）并没有包括人类因素。相反，它们以有限的方式包括人类因素，从而在评估模型安全性时存在不足，未能捕捉到人类-模型交互的复杂性。在本文中，我们讨论并操作了一个新兴类别的定义——"人类交互评估"（HIEs）——关注评估人类-模型交互或使用模型的人类使用过程及其结果。我们首先认为，HIEs可用于提高安全性评估的有效性，评估直接人类影响和交互特定的危害，并指导未来评估模型的社会影响。其次，我们提出了一个以安全性为重点的HIE设计框架——包含一个人类LLM交互分类器——包括三个阶段：1）确定风险或危害区域，2）描述使用背景，3）选择评估参数。第三，我们将我们的框架应用于两个潜在的评估：过度依赖和说服风险。最后，我们得出关于解决HIE成本、可重复性和代表性问题的具体建议。

URL

https://arxiv.org/abs/2405.10632

PDF

https://arxiv.org/pdf/2405.10632.pdf
Read All
Dynamic data sampler for cross-language transfer learning in large language models

2024-05-17 08:40:51

Yudong Li, Yuhao Feng, Wen Zhou, Zhe Zhao, Linlin Shen, Cheng Hou, Xianxu Hou

arXiv_CL

arXiv_CL Attention Transfer_Learning Knowledge Language_Model Transformer Unsupervised Pose Chat LLM
Abstract

Large Language Models (LLMs) have gained significant attention in the field of natural language processing (NLP) due to their wide range of applications. However, training LLMs for languages other than English poses significant challenges, due to the difficulty in acquiring large-scale corpus and the requisite computing resources. In this paper, we propose ChatFlow, a cross-language transfer-based LLM, to address these challenges and train large Chinese language models in a cost-effective manner. We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model, aiming to align cross-language representations and facilitate the knowledge transfer specifically to the Chinese language model. In addition, we use a dynamic data sampler to progressively transition the model from unsupervised pre-training to supervised fine-tuning. Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance. We evaluate ChatFlow on popular Chinese and English benchmarks, the results indicate that it outperforms other Chinese models post-trained on LLaMA-2-7B.

Abstract (translated)

大型语言模型（LLMs）因其在自然语言处理（NLP）领域的广泛应用而引起了NLP领域的广泛关注。然而，为非英语语言训练LLMs存在巨大的挑战，这是因为获取大规模语料库和所需的计算资源困难。在本文中，我们提出了ChatFlow，一种跨语言迁移的LLM，以解决这些挑战，并以经济的方式训练大型中文语言模型。我们采用中国语、英语和并行语料库连续训练LLLaMA2模型，旨在使跨语言表示对齐，并促进特别是对中文模型的知识传递。此外，我们还使用动态数据抽样器，逐步将模型从无监督预训练转移到监督微调。实验结果表明，我们的方法加速了模型收敛，并取得了卓越的性能。我们在流行的大中文和英文基准上评估了ChatFlow，结果表明，它在大LaMA-2-7B预训练后的其他中文模型中表现优异。

URL

https://arxiv.org/abs/2405.10626

PDF

https://arxiv.org/pdf/2405.10626.pdf
Read All
Historically Relevant Event Structuring for Temporal Knowledge Graph Reasoning

2024-05-17 08:33:43

Jinchuan Zhang, Bei Hui, Chong Mu, Ming Sun, Ling Tian

arXiv_AI

arXiv_AI Relation Knowledge Knowledge_Graph Pose Action
Abstract

Temporal Knowledge Graph (TKG) reasoning focuses on predicting events through historical information within snapshots distributed on a timeline. Existing studies mainly concentrate on two perspectives of leveraging the history of TKGs, including capturing evolution of each recent snapshot or correlations among global historical facts. Despite the achieved significant accomplishments, these models still fall short of (1) investigating the influences of multi-granularity interactions across recent snapshots and (2) harnessing the expressive semantics of significant links accorded with queries throughout the entire history, especially events exerting a profound impact on the future. These inadequacies restrict representation ability to reflect historical dependencies and future trends thoroughly. To overcome these drawbacks, we propose an innovative TKG reasoning approach towards \textbf{His}torically \textbf{R}elevant \textbf{E}vents \textbf{S}tructuring ($\mathsf{HisRES}$). Concretely, $\mathsf{HisRES}$ comprises two distinctive modules excelling in structuring historically relevant events within TKGs, including a multi-granularity evolutionary encoder that captures structural and temporal dependencies of the most recent snapshots, and a global relevance encoder that concentrates on crucial correlations among events relevant to queries from the entire history. Furthermore, $\mathsf{HisRES}$ incorporates a self-gating mechanism for adaptively merging multi-granularity recent and historically relevant structuring representations. Extensive experiments on four event-based benchmarks demonstrate the state-of-the-art performance of $\mathsf{HisRES}$ and indicate the superiority and effectiveness of structuring historical relevance for TKG reasoning.

Abstract (translated)

temporal知识图（TKG）推理关注通过时间轴上分发的快照中的历史信息来预测事件。现有研究主要集中在利用TKG历史两个方面，包括捕捉每个最近快照的演化过程以及全球历史事实之间的关联。尽管取得了显著的成就，但这些模型仍然不足以（1）研究多粒度交互对最近快照之间影响的调查，（2）利用整个历史中与查询相关的显著链接的语义语义。这些不足限制了表示能力，不能充分反映历史依赖和未来趋势。为了克服这些缺陷，我们提出了一个创新的历史相关事件构建（HisRElevant Structuring，\拼写为$\mathsf{HisRES}$）的TKG推理方法。具体来说，$\mathsf{HisRES}$包括两个具有独特功能的模块，在TKG中构建历史相关事件，包括一个多粒度进化编码器，它捕捉了最近快照的结构和时间依赖；一个全局相关编码器，它集中于整个历史中与查询相关的关键关联。此外，$\mathsf{HisRES}$引入了一个自适应合并多粒度最近和历史相关结构表示的自门机制。在四个基于事件的基准测试中进行的广泛实验证明了$\mathsf{HisRES}$的最先进性能，并表明了为TKG推理构建历史相关性的优越性和有效性。

URL

https://arxiv.org/abs/2405.10621

PDF

https://arxiv.org/pdf/2405.10621.pdf
Read All
MC-GPT: Empowering Vision-and-Language Navigation with Memory Map and Reasoning Chains

2024-05-17 08:33:27

Zhaohuan Zhan, Lisha Yu, Sijie Yu, Guang Tan

arXiv_AI

arXiv_AI Face Relation Prediction Language_Model Transformer Pose Action Chat Agent LLM
Abstract

In the Vision-and-Language Navigation (VLN) task, the agent is required to navigate to a destination following a natural language instruction. While learning-based approaches have been a major solution to the task, they suffer from high training costs and lack of interpretability. Recently, Large Language Models (LLMs) have emerged as a promising tool for VLN due to their strong generalization capabilities. However, existing LLM-based methods face limitations in memory construction and diversity of navigation strategies. To address these challenges, we propose a suite of techniques. Firstly, we introduce a method to maintain a topological map that stores navigation history, retaining information about viewpoints, objects, and their spatial relationships. This map also serves as a global action space. Additionally, we present a Navigation Chain of Thoughts module, leveraging human navigation examples to enrich navigation strategy diversity. Finally, we establish a pipeline that integrates navigational memory and strategies with perception and action prediction modules. Experimental results on the REVERIE and R2R datasets show that our method effectively enhances the navigation ability of the LLM and improves the interpretability of navigation reasoning.

Abstract (translated)

在视觉与语言导航（VLN）任务中，代理需要根据自然语言指令导航到目标。虽然基于学习的解决方案是解决这个问题的主要方法，但它们在训练成本高和可解释性差方面存在局限性。最近，大型语言模型（LLMs）已成为VLN的有前途的工具，因为它们具有强大的泛化能力。然而，现有的LLM-based方法在记忆构建和导航策略多样性方面存在局限性。为了应对这些挑战，我们提出了一个系列技术。首先，我们引入了一种方法来维护一个拓扑地图，该地图存储了导航历史，包括视点、物体及其空间关系。这个地图还作为全局动作空间。此外，我们提出了一个基于人类导航示例的导航链思考模块，以丰富导航策略的多样性。最后，我们建立了一个将导航记忆和策略与感知和动作预测模块集成起来的流水线。在REVERIE和R2R数据集上的实验结果表明，我们的方法有效地增强了LLM的导航能力，并提高了导航推理的可解释性。

URL

https://arxiv.org/abs/2405.10620

PDF

https://arxiv.org/pdf/2405.10620.pdf
Read All
Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization

2024-05-17 08:27:12

Yixin Ji, Yang Xiang, Juntao Li, Wei Chen, Zhongyi Liu, Kehai Chen, Min Zhang

arXiv_CL

arXiv_CL Optimization Language_Model Pose LLM
Abstract

In recent years, large language models (LLMs) have driven advances in natural language processing. Still, their growing scale has increased the computational burden, necessitating a balance between efficiency and performance. Low-rank compression, a promising technique, reduces non-essential parameters by decomposing weight matrices into products of two low-rank matrices. Yet, its application in LLMs has not been extensively studied. The key to low-rank compression lies in low-rank factorization and low-rank dimensions allocation. To address the challenges of low-rank compression in LLMs, we conduct empirical research on the low-rank characteristics of large models. We propose a low-rank compression method suitable for LLMs. This approach involves precise estimation of feature distributions through pooled covariance matrices and a Bayesian optimization strategy for allocating low-rank dimensions. Experiments on the LLaMA-2 models demonstrate that our method outperforms existing strong structured pruning and low-rank compression techniques in maintaining model performance at the same compression ratio.

Abstract (translated)

近年来，大型语言模型（LLMs）推动了自然语言处理（NLP）的进步。然而，它们不断增长的大小增加了计算负担，需要平衡效率与性能。低秩压缩（Low-rank Compression）是一种有前景的技术，通过分解权重矩阵为两个低秩矩阵的乘积，减少非关键参数。然而，在LLMs中的应用并未受到充分研究。低秩分解和低秩维度分配是低秩压缩的关键。为了应对LLMs中低秩压缩的挑战，我们通过研究大型模型的低秩特性进行实证研究。我们提出了适合LLMs的低秩压缩方法。这种方法涉及通过池化协方差矩阵精确估计特征分布，以及采用贝叶斯优化策略进行低秩维度分配。对LLLaMA-2模型的实验证明，我们的方法在保持相同压缩比的同时优于现有强烈结构修剪和低秩压缩技术。

URL

https://arxiv.org/abs/2405.10616

PDF

https://arxiv.org/pdf/2405.10616.pdf
Read All
Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers

2024-05-17 08:19:48

Sheng Yang, Jiawang Bai, Kuofeng Gao, Yong Yang, Yiming Li, Shu-tao Xia

arXiv_CV

arXiv_CV Recognition Transformer Pose
Abstract

Given the power of vision transformers, a new learning paradigm, pre-training and then prompting, makes it more efficient and effective to address downstream visual recognition tasks. In this paper, we identify a novel security threat towards such a paradigm from the perspective of backdoor attacks. Specifically, an extra prompt token, called the switch token in this work, can turn the backdoor mode on, i.e., converting a benign model into a backdoored one. Once under the backdoor mode, a specific trigger can force the model to predict a target class. It poses a severe risk to the users of cloud API, since the malicious behavior can not be activated and detected under the benign mode, thus making the attack very stealthy. To attack a pre-trained model, our proposed attack, named SWARM, learns a trigger and prompt tokens including a switch token. They are optimized with the clean loss which encourages the model always behaves normally even the trigger presents, and the backdoor loss that ensures the backdoor can be activated by the trigger when the switch is on. Besides, we utilize the cross-mode feature distillation to reduce the effect of the switch token on clean samples. The experiments on diverse visual recognition tasks confirm the success of our switchable backdoor attack, i.e., achieving 95%+ attack success rate, and also being hard to be detected and removed. Our code is available at this https URL.

Abstract (translated)

鉴于视觉Transformer的力量，一种新的学习范式——预训练和提示，使得解决下游视觉识别任务更加高效和有效。在本文中，我们从后门攻击的角度识别了一个新的安全威胁，这种威胁来自这种范式。具体来说，本文中引入了一个额外的提示标记词，称为切换标记词，可以在后门模式下将其启用，即将一个 benign 模型转换为 a backdoored 模型。一旦进入后门模式，一个特定的触发器可以迫使模型预测目标类别。这将对使用云 API 的用户造成严重风险，因为恶意行为无法在 benign 模式下被激活和检测，从而使攻击非常隐蔽。为了攻击预训练的模型，我们提出的攻击名为 SWARM，它学习了一个触发器和提示标记词，包括切换标记词。它们通过干净损失优化，该损失鼓励模型在提示存在时始终正常行为，并通过后门损失确保在切换开启时，后门可以被激活。此外，我们还利用跨模式特征蒸馏来减少切换标记词对干净样本的影响。在多样性的视觉识别任务上进行的实验证实了我们的可切换后门攻击取得了成功，即实现了95%+的攻击成功率，并且还很难被检测和删除。我们的代码可以从该链接的 URL 中获得。

URL

https://arxiv.org/abs/2405.10612

PDF

https://arxiv.org/pdf/2405.10612.pdf
Read All
Driving Referring Video Object Segmentation with Vision-Language Pre-trained Models

2024-05-17 08:14:22

Zikun Zhou, Wentao Xiong, Li Zhou, Xin Li, Zhenyu He, Yaowei Wang

arXiv_CV

arXiv_CV Segmentation Face Attention Relation Prediction Language_Model Transformer Pose Action
Abstract

The crux of Referring Video Object Segmentation (RVOS) lies in modeling dense text-video relations to associate abstract linguistic concepts with dynamic visual contents at pixel-level. Current RVOS methods typically use vision and language models pre-trained independently as backbones. As images and texts are mapped to uncoupled feature spaces, they face the arduous task of learning Vision-Language~(VL) relation modeling from scratch. Witnessing the success of Vision-Language Pre-trained (VLP) models, we propose to learn relation modeling for RVOS based on their aligned VL feature space. Nevertheless, transferring VLP models to RVOS is a deceptively challenging task due to the substantial gap between the pre-training task (image/region-level prediction) and the RVOS task (pixel-level prediction in videos). In this work, we introduce a framework named VLP-RVOS to address this transfer challenge. We first propose a temporal-aware prompt-tuning method, which not only adapts pre-trained representations for pixel-level prediction but also empowers the vision encoder to model temporal clues. We further propose to perform multi-stage VL relation modeling while and after feature extraction for comprehensive VL understanding. Besides, we customize a cube-frame attention mechanism for spatial-temporal reasoning. Extensive experiments demonstrate that our method outperforms state-of-the-art algorithms and exhibits strong generalization abilities.

Abstract (translated)

RVOS（参考视频对象分割）的核心在于将丰富的文本-视频关系建模为将抽象语言概念与动态视觉内容关联起来。当前的RVOS方法通常使用独立预训练的视觉和语言模型作为骨干网络。由于图像和文本被映射到分离的特征空间中，它们面临着从零开始学习Vision-Language（VL）关系模型的艰难任务。鉴于Vision-Language预训练（VLP）模型的成功，我们提出了基于它们对齐的VL特征空间学习RVOS关系模型的方法。然而，将VLP模型转移到RVOS是一个欺骗性的挑战，因为预训练任务（图像/区域级别预测）与RVOS任务（视频级别预测）之间存在很大差距。在本文中，我们引入了一个名为VLP-RVOS的工作框架来解决这个传输挑战。我们首先提出了一个时间感知提示调整方法，该方法不仅适应像素级别预测的预训练表示，而且还赋予了视觉编码器建模时间提示的能力。我们进一步提出了一种在特征提取之前和之后进行多阶段VL关系建模的方法，以实现全面的VL理解。此外，我们还自定义了一个立方体框架注意机制来进行空间-时间推理。大量实验证明，我们的方法超越了最先进的算法，并表现出强大的泛化能力。

URL

https://arxiv.org/abs/2405.10610

PDF

https://arxiv.org/pdf/2405.10610.pdf
Read All
ECATS: Explainable-by-design concept-based anomaly detection for time series

2024-05-17 08:12:53

Irene Ferfoglia, Gaia Saveri, Laura Nenzi, Luca Bortolussi

arXiv_AI

arXiv_AI Detection Deep_Learning Classification Attention Embedding Prediction Unsupervised Pose
Abstract

Deep learning methods for time series have already reached excellent performances in both prediction and classification tasks, including anomaly detection. However, the complexity inherent in Cyber Physical Systems (CPS) creates a challenge when it comes to explainability methods. To overcome this inherent lack of interpretability, we propose ECATS, a concept-based neuro-symbolic architecture where concepts are represented as Signal Temporal Logic (STL) formulae. Leveraging kernel-based methods for STL, concept embeddings are learnt in an unsupervised manner through a cross-attention mechanism. The network makes class predictions through these concept embeddings, allowing for a meaningful explanation to be naturally extracted for each input. Our preliminary experiments with a simple CPS-based dataset show that our model is able to achieve great classification performance while ensuring local interpretability.

Abstract (translated)

深度学习方法已经在时间序列预测和分类任务中达到了优秀的表现，包括异常检测。然而，CPS固有的复杂性在可解释性方法上构成了挑战。为了解决这一固有可解释性不足的问题，我们提出了ECATS，一种基于概念的神经符号架构，其中概念用信号时序逻辑（STL）公式表示。通过跨注意机制，利用基于核的方法学习STL，概念嵌入以一种无监督的方式学习。网络通过这些概念嵌入进行分类预测，从而可以自然地提取每个输入的有意义的解释。我们对一个简单的CPS数据集进行的初步实验表明，我们的模型能够在保证局部可解释性的同时实现出色的分类性能。

URL

https://arxiv.org/abs/2405.10608

PDF

https://arxiv.org/pdf/2405.10608.pdf
Read All
Learning Object-Centric Representation via Reverse Hierarchy Guidance

2024-05-17 07:48:27

Junhong Zou, Xiangyu Zhu, Zhaoxiang Zhang, Zhen Lei

arXiv_CV

arXiv_CV Inference Pose Reconstruction
Abstract

Object-Centric Learning (OCL) seeks to enable Neural Networks to identify individual objects in visual scenes, which is crucial for interpretable visual comprehension and reasoning. Most existing OCL models adopt auto-encoding structures and learn to decompose visual scenes through specially designed inductive bias, which causes the model to miss small objects during reconstruction. Reverse hierarchy theory proposes that human vision corrects perception errors through a top-down visual pathway that returns to bottom-level neurons and acquires more detailed information, inspired by which we propose Reverse Hierarchy Guided Network (RHGNet) that introduces a top-down pathway that works in different ways in the training and inference processes. This pathway allows for guiding bottom-level features with top-level object representations during training, as well as encompassing information from bottom-level features into perception during inference. Our model achieves SOTA performance on several commonly used datasets including CLEVR, CLEVRTex and MOVi-C. We demonstrate with experiments that our method promotes the discovery of small objects and also generalizes well on complex real-world scenes. Code will be available at https://anonymous.4open.science/r/RHGNet-6CEF.

Abstract (translated)

对象中心化学习（OCL）旨在使神经网络能够识别视觉场景中的单个对象，这对于可解释的视觉理解和推理至关重要。大多数现有的OCL模型采用自动编码结构，并通过特别设计的归纳偏见学习来分解视觉场景，这导致在重构过程中模型会丢失小物体。逆层次结构理论提出，人类视觉通过一个自上而下的视觉路径来纠正感知错误，并获取更详细的信息，因此我们提出了逆层次结构引导网络（RHGNet），该网络在训练和推理过程中采用自上而下的路径。这条路径允许在训练过程中使用顶级别物体表示来指导底部特征，以及在推理过程中将底部特征的信息包含在感知中。我们的模型在包括CLEVR、CLEVRTex和MOVi-C等常用数据集在内的多个数据集上实现了SOTA性能。我们通过实验证明了我们的方法有助于发现小物体，并且在复杂的现实生活中表现良好。代码将在https://anonymous.4open.science/r/RHGNet-6CEF中提供。

URL

https://arxiv.org/abs/2405.10598

PDF

https://arxiv.org/pdf/2405.10598.pdf
Read All
UniCL: A Universal Contrastive Learning Framework for Large Time Series Models

2024-05-17 07:47:11

Jiawei Li, Jingshu Peng, Haoyang Li, Lei Chen

arXiv_AI

arXiv_AI Classification Transformer Pose Contrastive_Learning
Abstract

Time-series analysis plays a pivotal role across a range of critical applications, from finance to healthcare, which involves various tasks, such as forecasting and classification. To handle the inherent complexities of time-series data, such as high dimensionality and noise, traditional supervised learning methods first annotate extensive labels for time-series data in each task, which is very costly and impractical in real-world applications. In contrast, pre-trained foundation models offer a promising alternative by leveraging unlabeled data to capture general time series patterns, which can then be fine-tuned for specific tasks. However, existing approaches to pre-training such models typically suffer from high-bias and low-generality issues due to the use of predefined and rigid augmentation operations and domain-specific data training. To overcome these limitations, this paper introduces UniCL, a universal and scalable contrastive learning framework designed for pretraining time-series foundation models across cross-domain datasets. Specifically, we propose a unified and trainable time-series augmentation operation to generate pattern-preserved, diverse, and low-bias time-series data by leveraging spectral information. Besides, we introduce a scalable augmentation algorithm capable of handling datasets with varying lengths, facilitating cross-domain pretraining. Extensive experiments on two benchmark datasets across eleven domains validate the effectiveness of UniCL, demonstrating its high generalization on time-series analysis across various fields.

Abstract (translated)

时间序列分析在金融、医疗等领域具有关键作用，涉及预测、分类等任务。为处理时间序列数据的固有复杂性，如高维度和噪声，传统监督学习方法首先对每个任务对时间序列数据进行广泛的标记，这非常耗时且在实际应用中不实用。相比之下，预训练的基础模型通过利用未标记数据来捕捉一般时间序列模式，然后可以针对特定任务进行微调，为现有方法提供了有前景的替代方案。然而，由于使用预定义和刚性的增强操作和领域特定的数据训练，现有方法通常存在高偏差和高方差的问题。为了克服这些限制，本文提出了一种名为UniCL的通用且可扩展的时间序列增强学习框架，专门为跨领域数据集上的预训练时间序列基础模型而设计。具体来说，我们提出了一种统一且可训练的时间序列增强操作，通过利用谱信息生成保留模式、多样性和低偏差的时间序列数据。此外，我们还引入了一种可扩展的增强算法，能够处理长度不同的数据集，促进跨领域的预训练。在十一个领域的多个基准数据集上进行的大量实验证实了UniCL的有效性，表明其在各个领域的时间序列分析方面的通用性。

URL

https://arxiv.org/abs/2405.10597

PDF

https://arxiv.org/pdf/2405.10597.pdf
Read All
GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision

2024-05-17 07:31:20

Xin Tan, Wenbin Wu, Zhiwei Zhang, Chaojie Fan, Yong Peng, Zhizhong Zhang, Yuan Xie, Lizhuang Ma

arXiv_CV

arXiv_CV Sparse Prediction Transformer Pose Autonomous 3D Reconstruction
Abstract

3D occupancy perception holds a pivotal role in recent vision-centric autonomous driving systems by converting surround-view images into integrated geometric and semantic representations within dense 3D grids. Nevertheless, current models still encounter two main challenges: modeling depth accurately in the 2D-3D view transformation stage, and overcoming the lack of generalizability issues due to sparse LiDAR supervision. To address these issues, this paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception. Our approach is three-fold: 1) Integration of explicit lift-based depth prediction and implicit projection-based transformers for depth modeling, enhancing the density and robustness of view transformation. 2) Utilization of mask-based encoder-decoder architecture for fine-grained semantic predictions; 3) Adoption of context-aware self-training loss functions in the pertaining stage to complement LiDAR supervision, involving the re-rendering of 2D depth maps from 3D occupancy features and leveraging image reconstruction loss to obtain denser depth supervision besides sparse LiDAR ground-truths. Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone compared with current models, marking an improvement of 3.3% due to our proposed contributions. Comprehensive experimentation also demonstrates the consistent superiority of our method over baselines and alternative approaches.

Abstract (translated)

3D占有率感知在最近以视觉为中心的自驾系统中将周围视图图像转换为密集3D网格内的集成几何和语义表示，在很大程度上推动了这种技术的发展。然而，目前的模型仍然面临着两个主要挑战：在2D-3D视图变换阶段准确建模深度，以及克服由于稀疏LiDAR监督而导致的泛化问题。为了应对这些问题，本文提出了GEOcc，一种专为视觉仅周围视图感知而设计的几何增强占有率网络。我们的方法是三方面的：1）将显式升力为基础的深度预测和隐式投影为基础的变压器深度建模相结合，提高视图变换的密度和稳健性；2）利用掩码为基础的编码器-解码器架构进行细粒度语义预测；3）在相关阶段采用语境感知自训练损失函数来补充LiDAR监督，包括从3D占有率特征重新渲染2D深度图，并利用图像重建损失以获得比稀疏LiDAR ground-truths更密的深度监督。我们的方法在Occ3D-nuScenes数据集上实现了与最少的图像分辨率相关的最轻量级的图像骨架，与当前模型的最轻量级图像骨架相比，提高了3.3%的性能，并通过我们的建议取得了显著的改善。综合实验还证明了我们的方法相对于基线和替代方法的优势是一致的。

URL

https://arxiv.org/abs/2405.10591

PDF

https://arxiv.org/pdf/2405.10591.pdf
Read All
Improving Point-based Crowd Counting and Localization Based on Auxiliary Point Guidance

2024-05-17 07:23:27

I-Hsiang Chen, Wei-Ting Chen, Yu-Wei Liu, Ming-Hsuan Yang, Sy-Yen Kuo

arXiv_AI

arXiv_AI Face Optimization Pose Action Matching
Abstract

Crowd counting and localization have become increasingly important in computer vision due to their wide-ranging applications. While point-based strategies have been widely used in crowd counting methods, they face a significant challenge, i.e., the lack of an effective learning strategy to guide the matching process. This deficiency leads to instability in matching point proposals to target points, adversely affecting overall performance. To address this issue, we introduce an effective approach to stabilize the proposal-target matching in point-based methods. We propose Auxiliary Point Guidance (APG) to provide clear and effective guidance for proposal selection and optimization, addressing the core issue of matching uncertainty. Additionally, we develop Implicit Feature Interpolation (IFI) to enable adaptive feature extraction in diverse crowd scenarios, further enhancing the model's robustness and accuracy. Extensive experiments demonstrate the effectiveness of our approach, showing significant improvements in crowd counting and localization performance, particularly under challenging conditions. The source codes and trained models will be made publicly available.

Abstract (translated)

随着计算机视觉中计数和定位的应用越来越广泛，点基础策略在计数方法中得到了广泛应用。然而，基于点的策略在计数过程中面临着一个重要挑战，即缺乏有效的学习策略来指导匹配过程。这种不足导致将匹配点建议与目标点匹配不稳定，从而影响整体性能。为了应对这个问题，我们引入了一种有效的点基础方法来稳定计数建议与目标点的匹配。我们提出了辅助点指导（APG）来提供明确的有效建议，解决计数不确定性核心问题。此外，我们还开发了隐式特征平滑（IFI）以实现多样 crowd场景的自适应特征提取，进一步加强模型的稳健性和准确性。大量的实验证明了我们方法的有效性，特别是在具有挑战性的条件下，计数和定位性能得到了显著的提高。源代码和训练好的模型将公开发布。

URL

https://arxiv.org/abs/2405.10589

PDF

https://arxiv.org/pdf/2405.10589.pdf
Read All
RDRec: Rationale Distillation for LLM-based Recommendation

2024-05-17 07:22:02

Xinfeng Wang, Jin Cui, Yoshimi Suzuki, Fumiyo Fukumoto

arXiv_CL

arXiv_CL Review Attention Recommendation Language_Model Pose Action LLM
Abstract

Large language model (LLM)-based recommender models that bridge users and items through textual prompts for effective semantic reasoning have gained considerable attention. However, few methods consider the underlying rationales behind interactions, such as user preferences and item attributes, limiting the reasoning capability of LLMs for recommendations. This paper proposes a rationale distillation recommender (RDRec), a compact model designed to learn rationales generated by a larger language model (LM). By leveraging rationales from reviews related to users and items, RDRec remarkably specifies their profiles for recommendations. Experiments show that RDRec achieves state-of-the-art (SOTA) performance in both top-N and sequential recommendations. Our source code is released at this https URL.

Abstract (translated)

基于大型语言模型（LLM）的推荐模型通过文本提示用户和物品，实现了有效的语义推理，已经引起了广泛关注。然而，很少有方法考虑用户偏好和物品属性背后的合理性，从而限制了LLM在推荐中的推理能力。本文提出了一种理据蒸馏推荐器（RDRec）模型，一种紧凑型模型，用于学习由较大语言模型（LM）生成的合理性。通过利用与用户和物品相关的评论中的理据，RDRec显著地指定了它们的推荐个人档案。实验表明，RDRec在 top-N 和序列推荐方面都取得了最先进的（SOTA）性能。我们的源代码已发布在上述链接处。

URL

https://arxiv.org/abs/2405.10587

PDF

https://arxiv.org/pdf/2405.10587.pdf
Read All
A Hybrid Deep Learning Framework for Stock Price Prediction Considering the Investor Sentiment of Online Forum Enhanced by Popularity

2024-05-17 07:18:08

Huiyu Li, Junhua Hu

arXiv_CL

arXiv_CL RNN Deep_Learning Prediction Sentiment Pose
Abstract

Stock price prediction has always been a difficult task for forecasters. Using cutting-edge deep learning techniques, stock price prediction based on investor sentiment extracted from online forums has become feasible. We propose a novel hybrid deep learning framework for predicting stock prices. The framework leverages the XLNET model to analyze the sentiment conveyed in user posts on online forums, combines these sentiments with the post popularity factor to compute daily group sentiments, and integrates this information with stock technical indicators into an improved BiLSTM-highway model for stock price prediction. Through a series of comparative experiments involving four stocks on the Chinese stock market, it is demonstrated that the hybrid framework effectively predicts stock prices. This study reveals the necessity of analyzing investors' textual views for stock price prediction.

Abstract (translated)

股票价格预测一直是预测员的一个难题。利用最先进的人工智能技术，基于从在线论坛中提取的投资者情绪的股票价格预测已经变得可行。我们提出了一个新颖的混合深度学习框架，用于预测基于投资者情绪的股票价格。该框架利用XLNET模型来分析用户在在线论坛上传达的情绪，将这些情绪与帖子流行度因素相结合计算每日组情绪，并将此信息与股票技术指标集成，以改进BiLSTM-highway模型用于股票价格预测。通过在中国股市进行一系列比较实验，证明了混合框架有效预测股票价格。本研究揭示了分析投资者文本观点对股票价格预测的必要性。

URL

https://arxiv.org/abs/2405.10584

PDF

https://arxiv.org/pdf/2405.10584.pdf
Read All
Future Aware Safe Active Learning of Time Varying Systems using Gaussian Processes

2024-05-17 07:09:52

Markus Lange-Hegermann, Christoph Zimmer

arXiv_AI

arXiv_AI Prediction Pose
Abstract

Experimental exploration of high-cost systems with safety constraints, common in engineering applications, is a challenging endeavor. Data-driven models offer a promising solution, but acquiring the requisite data remains expensive and is potentially unsafe. Safe active learning techniques prove essential, enabling the learning of high-quality models with minimal expensive data points and high safety. This paper introduces a safe active learning framework tailored for time-varying systems, addressing drift, seasonal changes, and complexities due to dynamic behavior. The proposed Time-aware Integrated Mean Squared Prediction Error (T-IMSPE) method minimizes posterior variance over current and future states, optimizing information gathering also in the time domain. Empirical results highlight T-IMSPE's advantages in model quality through toy and real-world examples. State of the art Gaussian processes are compatible with T-IMSPE. Our theoretical contributions include a clear delineation which Gaussian process kernels, domains, and weighting measures are suitable for T-IMSPE and even beyond for its non-time aware predecessor IMSPE.

Abstract (translated)

实验探索具有安全约束的高代价系统是一个具有挑战性的任务。数据驱动的模型提供一个有前景的解决方案，但获取所需数据仍然昂贵且可能不安全。安全主动学习技术证明是必要的，使得在学习高质量模型时，最小化昂贵的数据点并确保高安全性。本文介绍了一个针对时间变化系统的安全主动学习框架，解决了漂移、季节变化和动态行为带来的复杂性。所提出的T-IMSPE方法在当前和未来状态上最小化后验方差，同时在时间域内优化信息搜集。通过玩具和真实世界例子，实证结果突出了T-IMSPE在模型质量方面的优势。与T-IMSPE兼容的先进高斯过程包括高斯过程核、领域和权重度量。我们理论上的贡献包括对高斯过程核、领域和高斯过程权重度量是否适合T-IMSPE以及其非时间意识的先驱IMSPPE的清晰区分。

URL

https://arxiv.org/abs/2405.10581

PDF

https://arxiv.org/pdf/2405.10581.pdf
Read All
A Hard Nut to Crack: Idiom Detection with Conversational Large Language Models

2024-05-17 07:08:13

Francesca De Luca Fornaciari, Bego\~na Altuna, Itziar Gonzalez-Dios, Maite Melero

arXiv_CL

arXiv_CL Detection Language_Model Pose LLM
Abstract

In this work, we explore idiomatic language processing with Large Language Models (LLMs). We introduce the Idiomatic language Test Suite IdioTS, a new dataset of difficult examples specifically designed by language experts to assess the capabilities of LLMs to process figurative language at sentence level. We propose a comprehensive evaluation methodology based on an idiom detection task, where LLMs are prompted with detecting an idiomatic expression in a given English sentence. We present a thorough automatic and manual evaluation of the results and an extensive error analysis.

Abstract (translated)

在这项研究中，我们探讨了使用大型语言模型（LLMs）进行惯用语处理。我们引入了惯用语测试集IdioTS，这是一个由语言专家专门设计的新数据集，用于评估LLMs在句子级别处理隐喻语言的能力。我们提出了一个全面评估方法，基于惯用语检测任务，其中LLMs被提示在给定英语句子中检测隐喻表达。我们全面自动和手动评估了结果，并对结果进行了深入分析。

URL

https://arxiv.org/abs/2405.10579

PDF

https://arxiv.org/pdf/2405.10579.pdf
Read All
DuoSpaceNet: Leveraging Both Bird's-Eye-View and Perspective View Representations for 3D Object Detection

2024-05-17 07:04:29

Zhe Huang, Yizhe Zhao, Hao Xiao, Chenyan Wu, Lingting Ge

arXiv_CV

arXiv_CV Segmentation Detection Object_Detection Knowledge Pose 3D Reconstruction
Abstract

Recent advances in multi-view camera-only 3D object detection either rely on an accurate reconstruction of bird's-eye-view (BEV) 3D features or on traditional 2D perspective view (PV) image features. While both have their own pros and cons, few have found a way to stitch them together in order to benefit from "the best of both worlds". To this end, we explore a duo space (i.e., BEV and PV) 3D perception framework, in conjunction with some useful duo space fusion strategies that allow effective aggregation of the two feature representations. To the best of our knowledge, our proposed method, DuoSpaceNet, is the first to leverage two distinct feature spaces and achieves the state-of-the-art 3D object detection and BEV map segmentation results on nuScenes dataset.

Abstract (translated)

近年来，多视角相机仅3D物体检测技术的发展主要依赖于对鸟眼视图（BEV）3D特征的准确重建，或者依赖于传统2D透视图（PV）图像特征。虽然两者都有其自身的优点和缺点，但很少有方法将它们结合起来以实现“两者之最”。因此，我们探讨了一种结合鸟眼视图（BEV）和透视图（PV）的3D感知框架，并探讨了一些有用的二元空间融合策略，以实现两个特征表示的有效聚合。据我们所知，我们提出的方法DuoSpaceNet是第一个利用两个不同的特征空间并实现 nuScenes 数据集上最先进的3D物体检测和 BEV地图分割结果的方法。

URL

https://arxiv.org/abs/2405.10577

PDF

https://arxiv.org/pdf/2405.10577.pdf
Read All
An Efficient Learning Control Framework With Sim-to-Real for String-Type Artificial Muscle-Driven Robotic Systems

2024-05-17 07:01:36

Jiyue Tao, Yunsong Zhang, Sunil Kumar Rajendran, Feitian Zhang, Dexin Zhao, Tongsheng Shen

arXiv_RO

arXiv_RO Reinforcement_Learning Face Pose Enhancement Robot
Abstract

Robotic systems driven by artificial muscles present unique challenges due to the nonlinear dynamics of actuators and the complex designs of mechanical structures. Traditional model-based controllers often struggle to achieve desired control performance in such systems. Deep reinforcement learning (DRL), a trending machine learning technique widely adopted in robot control, offers a promising alternative. However, integrating DRL into these robotic systems faces significant challenges, including the requirement for large amounts of training data and the inevitable sim-to-real gap when deployed to real-world robots. This paper proposes an efficient reinforcement learning control framework with sim-to-real transfer to address these challenges. Bootstrap and augmentation enhancements are designed to improve the data efficiency of baseline DRL algorithms, while a sim-to-real transfer technique, namely randomization of muscle dynamics, is adopted to bridge the gap between simulation and real-world deployment. Extensive experiments and ablation studies are conducted utilizing two string-type artificial muscle-driven robotic systems including a two degree-of-freedom robotic eye and a parallel robotic wrist, the results of which demonstrate the effectiveness of the proposed learning control strategy.

Abstract (translated)

由于执行器和机械结构的非线性动力学以及复杂设计，基于模型的控制器往往难以在此类系统中实现期望的控制性能。深度强化学习（DRL）作为一种热门的机器学习技术，为机器人控制提供了有前景的替代方案。然而，将DRL集成到这些机器人系统面临着巨大的挑战，包括需要大量训练数据以及将模型部署到现实世界机器人时不可避免的模拟-真实世界差距。本文提出了一种有效的强化学习控制框架，通过模拟-真实世界迁移来解决这些挑战。通过 bootstrap 和 augmentation 增强基线 DRL 算法提高数据效率，同时采用一种称为肌肉动力随机化技术的 sim-to-real 迁移方法来弥合模拟和真实世界部署之间的差距。使用两种基于人工肌肉的机器人系统，包括具有两个自由度的机器人眼和一个并行机器人手腕，进行了广泛的实验和消融研究。实验结果表明，所提出的学习控制策略的有效性得到了充分验证。

URL

https://arxiv.org/abs/2405.10576

PDF

https://arxiv.org/pdf/2405.10576.pdf
Read All
Simultaneous Deep Learning of Myocardium Segmentation and T2 Quantification for Acute Myocardial Infarction MRI

2024-05-17 06:50:37

Yirong Zhou, Chengyan Wang, Mengtian Lu, Kunyuan Guo, Zi Wang, Dan Ruan, Rui Guo, Peijun Zhao, Jianhua Wang, Naiming Wu, Jianzhong Lin, Yinyin Chen, Hang Jin, Lianxin Xie, Lilan Wu, Liuhong Zhu, Jianjun Zhou, Congbo Cai, He Wang, Xiaobo Qu

arXiv_AI

arXiv_AI Segmentation CNN Deep_Learning Relation Quantitative Transformer Pose
Abstract

In cardiac Magnetic Resonance Imaging (MRI) analysis, simultaneous myocardial segmentation and T2 quantification are crucial for assessing myocardial pathologies. Existing methods often address these tasks separately, limiting their synergistic potential. To address this, we propose SQNet, a dual-task network integrating Transformer and Convolutional Neural Network (CNN) components. SQNet features a T2-refine fusion decoder for quantitative analysis, leveraging global features from the Transformer, and a segmentation decoder with multiple local region supervision for enhanced accuracy. A tight coupling module aligns and fuses CNN and Transformer branch features, enabling SQNet to focus on myocardium regions. Evaluation on healthy controls (HC) and acute myocardial infarction patients (AMI) demonstrates superior segmentation dice scores (89.3/89.2) compared to state-of-the-art methods (87.7/87.9). T2 quantification yields strong linear correlations (Pearson coefficients: 0.84/0.93) with label values for HC/AMI, indicating accurate mapping. Radiologist evaluations confirm SQNet's superior image quality scores (4.60/4.58 for segmentation, 4.32/4.42 for T2 quantification) over state-of-the-art methods (4.50/4.44 for segmentation, 3.59/4.37 for T2 quantification). SQNet thus offers accurate simultaneous segmentation and quantification, enhancing cardiac disease diagnosis, such as AMI.

Abstract (translated)

在心脏磁共振成像（MRI）分析中，同时进行心肌分割和T2定量分析对于评估心肌病非常重要。现有的方法通常将这些任务分别处理，限制了它们的协同作用潜力。为了解决这个问题，我们提出了SQNet，一种集成Transformer和卷积神经网络（CNN）组件的双任务网络。SQNet具有用于定量分析的T2精炼融合解码器，利用Transformer的全局特征，并具有多个局部区域监督的分割解码器，以提高准确性。一个紧耦合模块将CNN和Transformer分支特征对齐和融合，使SQNet能够专注于心肌区域。在健康对照（HC）和急性心肌梗死患者（AMI）上的评估表明，SQNet的分割散点得分（89.3/89.2）优于最先进的方法（87.7/87.9）。T2定量分析与HC/AMI标签值具有很强的线性相关性（Pearson系数：0.84/0.93），表明准确的映射。放射科医生的评估证实了SQNet在分割（4.60/4.58）和T2定量分析（4.32/4.42）方面的优越图像质量评分超过最先进的方法（4.50/4.44）。因此，SQNet能够准确同时进行分割和定量分析，提高心脏病的诊断，如AMI。

URL

https://arxiv.org/abs/2405.10570

PDF

https://arxiv.org/pdf/2405.10570.pdf
Read All
Infrared Image Super-Resolution via Lightweight Information Split Network

2024-05-17 06:10:42

Shijie Liu, Kang Yan, Feiwei Qin, Changmiao Wang, Ruiquan Ge, Kai Zhang, Jie Huang

arXiv_CV

arXiv_CV Deep_Learning Super_Resolution Pose Action Reconstruction
Abstract

Single image super-resolution (SR) is an established pixel-level vision task aimed at reconstructing a high-resolution image from its degraded low-resolution counterpart. Despite the notable advancements achieved by leveraging deep neural networks for SR, most existing deep learning architectures feature an extensive number of layers, leading to high computational complexity and substantial memory demands. These issues become particularly pronounced in the context of infrared image SR, where infrared devices often have stringent storage and computational constraints. To mitigate these challenges, we introduce a novel, efficient, and precise single infrared image SR model, termed the Lightweight Information Split Network (LISN). The LISN comprises four main components: shallow feature extraction, deep feature extraction, dense feature fusion, and high-resolution infrared image reconstruction. A key innovation within this model is the introduction of the Lightweight Information Split Block (LISB) for deep feature extraction. The LISB employs a sequential process to extract hierarchical features, which are then aggregated based on the relevance of the features under consideration. By integrating channel splitting and shift operations, the LISB successfully strikes an optimal balance between enhanced SR performance and a lightweight framework. Comprehensive experimental evaluations reveal that the proposed LISN achieves superior performance over contemporary state-of-the-art methods in terms of both SR quality and model complexity, affirming its efficacy for practical deployment in resource-constrained infrared imaging applications.

Abstract (translated)

单图像超分辨率（SR）是一种旨在从低分辨率图像重构高分辨率图像的像素级别视觉任务。尽管通过利用深度神经网络进行SR取得了显著的进展，但大多数现有的深度学习架构具有大量的层，导致计算复杂度高和内存需求大。这些问题在红外图像SR背景下尤为突出，因为红外设备通常具有严格的存储和计算限制。为了减轻这些挑战，我们引入了一种新颖、高效和精确的红外图像SR模型，称为轻量信息分割网络（LISN）。LISN由四个主要组件组成：浅特征提取、深特征提取、密集特征融合和高分辨率红外图像重建。在这个模型中，关键创新是引入了轻量信息分割块（LISB）用于深特征提取。LISB采用一种级联过程来提取分层特征，然后根据考虑到的特征的相关性进行聚合。通过整合通道分割和移位操作，LISB成功地将增强SR性能与轻量框架之间的平衡达成最优。 comprehensive experimental evaluations reveal that the proposed LISN achieves superior performance over contemporary state-of-the-art methods in terms of both SR quality and model complexity, affirming its applicability for practical deployment in resource-constrained infrared imaging applications.

URL

https://arxiv.org/abs/2405.10561

PDF

https://arxiv.org/pdf/2405.10561.pdf
Read All

Content

Pose (20)

Pose

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF