We introduce a novel task within the field of 3D dance generation, termed dance accompaniment, which necessitates the generation of responsive movements from a dance partner, the "follower", synchronized with the lead dancer's movements and the underlying musical rhythm. Unlike existing solo or group dance generation tasks, a duet dance scenario entails a heightened degree of interaction between the two participants, requiring delicate coordination in both pose and position. To support this task, we first build a large-scale and diverse duet interactive dance dataset, DD100, by recording about 117 minutes of professional dancers' performances. To address the challenges inherent in this task, we propose a GPT-based model, Duolando, which autoregressively predicts the subsequent tokenized motion conditioned on the coordinated information of the music, the leader's and the follower's movements. To further enhance the GPT's capabilities of generating stable results on unseen conditions (music and leader motions), we devise an off-policy reinforcement learning strategy that allows the model to explore viable trajectories from out-of-distribution samplings, guided by human-defined rewards. Based on the collected dataset and proposed method, we establish a benchmark with several carefully designed metrics.
在3D舞蹈生成领域,我们提出了一个新的任务,称为舞蹈伴随,它要求从舞伴(跟随者)中生成响应性的动作,与主导舞者的动作同步,并响应音乐节奏的底层音乐节奏。与现有的独舞或群舞生成任务不同,二重舞情景在两个参与者之间产生了更密切的互动,需要谨慎的姿势和位置的协调。为了支持这项任务,我们首先建立了一个大型的、多样化的二重舞交互舞蹈数据集DD100,通过记录大约117分钟的职业舞者的表演来完成。为了应对这种任务固有的挑战,我们提出了一个基于GPT的模型,Duolando,它自回归地预测了基于音乐的协调信息、主导舞者和跟随者运动的下一个标记化动作。为了进一步提高GPT在生成未见情况(音乐和主导舞者动作)中稳定结果的能力,我们设计了一种 off-policy 强化学习策略,允许模型从离散抽样中探索可行的轨迹,并受到人类定义的奖励的指导。根据收集的 dataset 和提出的 method,我们建立了一个由几个精心设计的指标组成的目标基准。
https://arxiv.org/abs/2403.18811
In the realm of data-driven AI technology, the application of open-source large language models (LLMs) in robotic task planning represents a significant milestone. Recent robotic task planning methods based on open-source LLMs typically leverage vast task planning datasets to enhance models' planning abilities. While these methods show promise, they struggle with complex long-horizon tasks, which require comprehending more context and generating longer action sequences. This paper addresses this limitation by proposing MLDT, theMulti-Level Decomposition Task planning method. This method innovatively decomposes tasks at the goal-level, task-level, and action-level to mitigate the challenge of complex long-horizon tasks. In order to enhance open-source LLMs' planning abilities, we introduce a goal-sensitive corpus generation method to create high-quality training data and conduct instruction tuning on the generated corpus. Since the complexity of the existing datasets is not high enough, we construct a more challenging dataset, LongTasks, to specifically evaluate planning ability on complex long-horizon tasks. We evaluate our method using various LLMs on four datasets in VirtualHome. Our results demonstrate a significant performance enhancement in robotic task planning, showcasing MLDT's effectiveness in overcoming the limitations of existing methods based on open-source LLMs as well as its practicality in complex, real-world scenarios.
在数据驱动的人工智能技术领域,开源大型语言模型(LLMs)在机器人任务规划中的应用是一个重要的里程碑。基于开源LLMs的机器人任务规划方法通常利用庞大的任务规划数据集来增强模型的规划能力。虽然这些方法显示出潜力,但它们在复杂的长远任务上表现不佳,这些任务需要更广泛的上下文理解并生成更长的动作序列。本文通过提出MLDT,多级分解任务规划方法,来解决这个局限。这种方法创新地将任务在目标级别、任务级别和动作级别分解,从而减轻复杂长远任务的挑战。为了增强开源LLMs的规划能力,我们引入了一个目标相关的语料库生成方法,用于创建高质量的训练数据并进行指令调整。由于现有数据集的复杂度不高,我们构建了一个更具挑战性的数据集LongTasks,专门用于评估在复杂长远任务上的规划能力。我们使用四个VirtualHome数据集来评估我们的方法。我们的结果表明,机器人任务规划方面的性能得到了显著提高,这表明MLDT在克服基于开源LLMs的现有方法的局限性以及其实际应用方面的有效性。
https://arxiv.org/abs/2403.18760
Lighting normalization is a crucial but underexplored restoration task with broad applications. However, existing works often simplify this task within the context of shadow removal, limiting the light sources to one and oversimplifying the scene, thus excluding complex self-shadows and restricting surface classes to smooth ones. Although promising, such simplifications hinder generalizability to more realistic settings encountered in daily use. In this paper, we propose a new challenging task termed Ambient Lighting Normalization (ALN), which enables the study of interactions between shadows, unifying image restoration and shadow removal in a broader context. To address the lack of appropriate datasets for ALN, we introduce the large-scale high-resolution dataset Ambient6K, comprising samples obtained from multiple light sources and including self-shadows resulting from complex geometries, which is the first of its kind. For benchmarking, we select various mainstream methods and rigorously evaluate them on Ambient6K. Additionally, we propose IFBlend, a novel strong baseline that maximizes Image-Frequency joint entropy to selectively restore local areas under different lighting conditions, without relying on shadow localization priors. Experiments show that IFBlend achieves SOTA scores on Ambient6K and exhibits competitive performance on conventional shadow removal benchmarks compared to shadow-specific models with mask priors. The dataset, benchmark, and code are available at this https URL.
照明归一化是一个关键但尚未得到充分探索的修复任务,具有广泛的应用。然而,现有的作品通常在阴影消除的背景下简化了这个任务,将光源限制为单个,并简化场景,从而排除了复杂自阴影和限制表面类为平滑表面。尽管这些简化具有前景,但它们阻碍了在实际使用环境中更逼真的设置下的泛化。在本文中,我们提出了一个新的具有挑战性的任务,称为环境光照正常化(ELN),它使研究阴影之间的相互作用,将图像修复和阴影消除统一起来,在更广泛的上下文中进行研究。为解决ELN缺乏合适的数据集的问题,我们引入了大型高分辨率数据集Ambient6K,包括多个光源获取的样本,以及由复杂几何形状产生的自阴影,这是前所未有的。为了进行基准测试,我们选择了各种主流方法,并对其在Ambient6K上的性能进行严格评估。此外,我们提出了IFBlend,一种新颖的强基线,通过最大化图像频率共同熵来选择性地修复不同光照条件下的局部区域,而不依赖于阴影定位先验。实验结果表明,IFBlend在Ambient6K上实现了最先进的分数,并且在传统阴影去除基准测试中表现出与具有掩码先验的阴影特定模型竞争的能力。数据集、基准和代码都可以在以下链接中找到。
https://arxiv.org/abs/2403.18730
We introduce a method to verify stochastic reinforcement learning (RL) policies. This approach is compatible with any RL algorithm as long as the algorithm and its corresponding environment collectively adhere to the Markov property. In this setting, the future state of the environment should depend solely on its current state and the action executed, independent of any previous states or actions. Our method integrates a verification technique, referred to as model checking, with RL, leveraging a Markov decision process, a trained RL policy, and a probabilistic computation tree logic (PCTL) formula to build a formal model that can be subsequently verified via the model checker Storm. We demonstrate our method's applicability across multiple benchmarks, comparing it to baseline methods called deterministic safety estimates and naive monolithic model checking. Our results show that our method is suited to verify stochastic RL policies.
我们介绍了一种验证随机强化学习(RL)策略的方法。这种方法在只要算法及其相应环境共同遵循马尔可夫性质的任何RL算法都是兼容的。在這種設置中,環境的未來狀態應僅取決於其當前狀態和執行的動作,獨立於任何以前的狀態或動作。我們的方法將驗證技術(也稱為模型檢查)與RL相结合,利用马尔可夫决策过程、訓練好的RL策略和概率計算樹邏輯(PCTL)公式來構建一個 formal 模型,可以通過model checker Storm進行後續驗證。我們在多個驗證標準中展示了我們方法的适用性,將其與稱為確定性安全性估計的基線方法進行比較。 our results表明,我們的驗證方法適合驗證隨機RL策略
https://arxiv.org/abs/2403.18725
This study focuses on addressing the instability issues prevalent in contrastive learning, specifically examining the InfoNCE loss function and its derivatives. We reveal a critical observation that these loss functions exhibit a restrictive behavior, leading to a convergence phenomenon where embeddings tend to merge into a singular point. This "over-fusion" effect detrimentally affects classification accuracy in subsequent supervised-learning tasks. Through theoretical analysis, we demonstrate that embeddings, when equalized or confined to a rank-1 linear subspace, represent a local minimum for InfoNCE. In response to this challenge, our research introduces an innovative strategy that leverages the same or fewer labeled data than typically used in the fine-tuning phase. The loss we proposed, Orthonormal Anchor Regression Loss, is designed to disentangle embedding clusters, significantly enhancing the distinctiveness of each embedding while simultaneously ensuring their aggregation into dense, well-defined clusters. Our method demonstrates remarkable improvements with just a fraction of the conventional label requirements, as evidenced by our results on CIFAR10 and CIFAR100 datasets.
本研究重点关注解决在对比学习中最普遍的不稳定问题,特别是研究 InfoNCE 损失函数及其导数。我们揭示了一个关键观察结果,即这些损失函数表现出一种限制性行为,导致嵌入倾向于合并成一个点。这种“过度融合”现象会恶化后续的监督学习任务的分类精度。通过理论分析,我们证明了,当平等或限制在一个秩为一的线性子空间时,嵌入表示 InfoNCE 的局部最小值。为了应对这个挑战,我们的研究引入了一种创新策略,它利用的标注数据量与通常在微调阶段使用的数据量相同或更少。我们提出的损失函数 Orthonormal Anchor Regression Loss 旨在分离嵌入聚类,显著增强每个嵌入的差异性,同时确保它们合并为密集、定义明确的簇。我们的方法在仅使用传统标签要求的一小部分数据的情况下,表现出显著的改善,正如我们在 CIFAR10 和 CIFAR100 数据集上的结果所示。
https://arxiv.org/abs/2403.18699
In 2020, I designed the course CMSC 20630/30630 Human-Robot Interaction: Research and Practice as a hands-on introduction to human-robot interaction (HRI) research for both undergraduate and graduate students at the University of Chicago. Since 2020, I have taught and refined this course each academic year. Human-Robot Interaction: Research and Practice focuses on the core concepts and cutting-edge research in the field of human-robot interaction (HRI), covering topics that include: nonverbal robot behavior, verbal robot behavior, social dynamics, norms & ethics, collaboration & learning, group interactions, applications, and future challenges of HRI. Course meetings involve students in the class leading discussions about cutting-edge peer-reviewed research HRI publications. Students also participate in a quarter-long collaborative research project, where they pursue an HRI research question that often involves conducing their own human-subjects research study where they recruit human subjects to interact with a robot. In this paper, I detail the structure of the course and its learning goals as well as my reflections and student feedback on the course.
在2020年,我为芝加哥大学本科和研究生学生设计了CMSC 20630/30630人机交互:研究和实践课程,作为人机交互(HRI)研究的实践入门课程,为大学里的本科和研究生学生。自2020年以来,每年我都教授和不断改进这门课程。人机交互:研究和实践课程重点关注人机交互(HRI)领域中的核心概念和尖端研究,涵盖包括:非语言机器人行为,语言机器人行为,社会动态,规范与伦理,合作与学习,群体互动,应用以及HRI领域未来的挑战。课程会议将学生带进课堂,带领大家讨论关于HRI期刊的最新前沿研究。学生还参加了为期四个月的合作研究项目,他们追求一个经常涉及他们自己人种志研究问题的HRI研究问题,并招募人类参与与机器人互动。在本文中,我详细说明了课程的结构和学习目标以及我对课程的反思和学生的反馈。
https://arxiv.org/abs/2403.18692
Annolid is a deep learning-based software package designed for the segmentation, labeling, and tracking of research targets within video files, focusing primarily on animal behavior analysis. Based on state-of-the-art instance segmentation methods, Annolid now harnesses the Cutie video object segmentation model to achieve resilient, markerless tracking of multiple animals from single annotated frames, even in environments in which they may be partially or entirely concealed by environmental features or by one another. Our integration of Segment Anything and Grounding-DINO strategies additionally enables the automatic masking and segmentation of recognizable animals and objects by text command, removing the need for manual annotation. Annolid's comprehensive approach to object segmentation flexibly accommodates a broad spectrum of behavior analysis applications, enabling the classification of diverse behavioral states such as freezing, digging, pup huddling, and social interactions in addition to the tracking of animals and their body parts.
Annolid是一种基于深度学习的软件包,旨在对视频文件中的研究目标进行分割、标注和跟踪,主要关注动物行为分析。基于最先进的实例分割方法,Annolid现在利用Cutie视频对象分割模型实现多个动物从单个标注帧的 resilience标记less跟踪,即使在它们可能被环境特征或彼此遮挡的环境中也是如此。我们集成了Segment Anything和Grounding-DINO策略,允许通过文本命令自动遮盖和分割可识别的动物和物体,无需手动注释。Annolid全面的行为分析方法可以适应各种行为分析应用,包括对不同行为状态(如冻结、挖掘、猫窝依偎和社会互动)的分类,以及追踪动物及其身体部位。
https://arxiv.org/abs/2403.18690
Addressing the challenges related to data sparsity, cold-start problems, and diversity in recommendation systems is both crucial and demanding. Many current solutions leverage knowledge graphs to tackle these issues by combining both item-based and user-item collaborative signals. A common trend in these approaches focuses on improving ranking performance at the cost of escalating model complexity, reducing diversity, and complicating the task. It is essential to provide recommendations that are both personalized and diverse, rather than solely relying on achieving high rank-based performance, such as Click-through Rate, Recall, etc. In this paper, we propose a hybrid multi-task learning approach, training on user-item and item-item interactions. We apply item-based contrastive learning on descriptive text, sampling positive and negative pairs based on item metadata. Our approach allows the model to better understand the relationships between entities within the knowledge graph by utilizing semantic information from text. It leads to more accurate, relevant, and diverse user recommendations and a benefit that extends even to cold-start users who have few interactions with items. We perform extensive experiments on two widely used datasets to validate the effectiveness of our approach. Our findings demonstrate that jointly training user-item interactions and item-based signals using synopsis text is highly effective. Furthermore, our results provide evidence that item-based contrastive learning enhances the quality of entity embeddings, as indicated by metrics such as uniformity and alignment.
解决数据稀疏性、冷启动问题和推荐系统多样性问题是至关重要的,同时也是一个具有挑战性的任务。许多现有解决方案利用知识图谱结合基于项目和用户-物品协同信号来解决这些问题。这些方法的一个共同趋势是提高排名性能,以牺牲模型的复杂性、降低多样性,并使任务变得更加复杂。提供个性化和多样化的推荐,而不是仅仅依靠实现高排名的性能,例如点击率、召回率等,是非常重要的。在本文中,我们提出了一个混合多任务学习方法,在用户-物品和物品-物品之间进行训练。我们对描述性文本应用基于项目的对比学习,根据物品元数据采样正负对。我们的方法利用文本中的语义信息更好地理解知识图谱中实体之间的关系。这导致更准确、相关和多样化的用户推荐,并且对冷启动用户(与物品互动较少)也具有好处。我们对两个广泛使用的数据集进行了广泛的实验,以验证我们方法的的有效性。我们的研究结果表明,通过共同训练用户-物品交互和物品-物品信号,使用概述文本进行训练非常有效。此外,我们的结果还提供了证据,表明基于物品的对比学习可以提高实体嵌入的质量,正如指标 such as uniformity and alignment 所示。
https://arxiv.org/abs/2403.18667
Process events are recorded by multiple information systems at different granularity levels. Based on the resulting event logs, process models are discovered at different granularity levels, as well. Events stored at a fine-grained granularity level, for example, may hinder the discovered process model to be displayed due the high number of resulting model elements. The discovered process model of a real-world manufacturing process, for example, consists of 1,489 model elements and over 2,000 arcs. Existing process model abstraction techniques could help reducing the size of the model, but would disconnect it from the underlying event log. Existing event abstraction techniques do neither support the analysis of mixed granularity levels, nor interactive exploration of a suitable granularity level. To enable the exploration of discovered process models at different granularity levels, we propose INEXA, an interactive, explainable process model abstraction method that keeps the link to the event log. As a starting point, INEXA aggregates large process models to a "displayable" size, e.g., for the manufacturing use case to a process model with 58 model elements. Then, the process analyst can explore granularity levels interactively, while applied abstractions are automatically traced in the event log for explainability.
过程事件是通过多个信息系统在不同的粒度级别被记录的。根据这些事件日志,可以在不同的粒度级别上发现过程模型。例如,在细粒度级别存储的事件可能会因为结果模型元素数量众多而阻碍发现的过程模型显示。例如,现实世界中的制造过程发现的进程模型由1489个模型元素和超过2000个弧组成。现有的过程模型抽象技术可以帮助减小模型的大小,但会将其与底层事件日志断开联系。现有的事件抽象技术既不支持混合粒度级别的分析,也不支持对合适粒度级别的交互式探索。为了在不同的粒度级别上探索发现的过程模型,我们提出了INEXA,一种交互式、可解释的过程模型抽象方法,保留了与事件日志的链接。 作为起点,INEXA将大型过程模型聚合到“可显示”的大小,例如,在制造用例中,将模型大小聚合为具有58个模型元素的过程模型。然后,过程分析师可以交互式地探索粒度级别,同时,在事件日志中自动跟踪抽象的来源,以实现可解释性。
https://arxiv.org/abs/2403.18659
Procedure Planning in instructional videos entails generating a sequence of action steps based on visual observations of the initial and target states. Despite the rapid progress in this task, there remain several critical challenges to be solved: (1) Adaptive procedures: Prior works hold an unrealistic assumption that the number of action steps is known and fixed, leading to non-generalizable models in real-world scenarios where the sequence length varies. (2) Temporal relation: Understanding the step temporal relation knowledge is essential in producing reasonable and executable plans. (3) Annotation cost: Annotating instructional videos with step-level labels (i.e., timestamp) or sequence-level labels (i.e., action category) is demanding and labor-intensive, limiting its generalizability to large-scale this http URL this work, we propose a new and practical setting, called adaptive procedure planning in instructional videos, where the procedure length is not fixed or pre-determined. To address these challenges we introduce Retrieval-Augmented Planner (RAP) model. Specifically, for adaptive procedures, RAP adaptively determines the conclusion of actions using an auto-regressive model architecture. For temporal relation, RAP establishes an external memory module to explicitly retrieve the most relevant state-action pairs from the training videos and revises the generated procedures. To tackle high annotation cost, RAP utilizes a weakly-supervised learning manner to expand the training dataset to other task-relevant, unannotated videos by generating pseudo labels for action steps. Experiments on CrossTask and COIN benchmarks show the superiority of RAP over traditional fixed-length models, establishing it as a strong baseline solution for adaptive procedure planning.
在教学视频中,程序规划涉及根据初始和目标状态的视觉观察生成一系列动作步骤。尽管这项任务在迅速进展,但仍然存在几个关键挑战需要解决:(1)自适应过程:以往的工作在一个不现实的假设中,即动作步骤的数量已知且固定,导致在现实场景中序列长度不确定的情况下,无法推广到大型模型中。 (2)时间关系:理解步骤时间关系知识对于产生合理且可执行的计划至关重要。 (3)标注成本:为教学视频标注步骤级标签(即时间戳)或序列级标签(即动作类别)非常具有挑战性且费力,这限制了其在大型规模上的可推广性。针对这些挑战,我们提出了一个新的且实用的设置,称为适应性程序规划在教学视频中,其中程序长度不是固定的或预先确定的。为解决这些挑战,我们引入了检索增强计划器(RAP)模型。具体来说,对于自适应过程,RAP通过自回归模型架构动态地确定动作结论。对于时间关系,RAP建立了一个外部记忆模块,用于从训练视频中明确检索最具相关性的状态-动作对,并修改生成的程序。为解决高标注成本,RAP采用弱监督学习方法扩展训练数据集,通过为动作步骤生成伪标签来处理。在CrossTask和COIN基准测试上进行的实验表明,RAP相对于传统固定长度模型具有优越性,将其确立为适应性程序规划的强基解决方案。
https://arxiv.org/abs/2403.18600
Resource efficiency plays an important role for machine learning nowadays. The energy and decision latency are two critical aspects to ensure a sustainable and practical application. Unfortunately, the energy consumption and decision latency are not robust against adversaries. Researchers have recently demonstrated that attackers can compute and submit so-called sponge examples at inference time to increase the energy consumption and decision latency of neural networks. In computer vision, the proposed strategy crafts inputs with less activation sparsity which could otherwise be used to accelerate the computation. In this paper, we analyze the mechanism how these energy-latency attacks reduce activation sparsity. In particular, we find that input uniformity is a key enabler. A uniform image, that is, an image with mostly flat, uniformly colored surfaces, triggers more activations due to a specific interplay of convolution, batch normalization, and ReLU activation. Based on these insights, we propose two new simple, yet effective strategies for crafting sponge examples: sampling images from a probability distribution and identifying dense, yet inconspicuous inputs in natural datasets. We empirically examine our findings in a comprehensive evaluation with multiple image classification models and show that our attack achieves the same sparsity effect as prior sponge-example methods, but at a fraction of computation effort. We also show that our sponge examples transfer between different neural networks. Finally, we discuss applications of our findings for the good by improving efficiency by increasing sparsity.
资源效率在现今的机器学习中扮演着重要角色。能源消耗和决策延迟是确保可持续和实际应用的两个关键方面。然而,攻击者可以利用推理时间计算并提交所谓的海绵例子来增加神经网络的能源消耗和决策延迟。在计算机视觉领域,所提出的策略通过减少激活稀疏度来构造输入。否则,这些输入可以加速计算。在本文中,我们分析了这些能源延迟攻击如何减少激活稀疏度。特别地,我们发现输入均匀性是一个关键因素。一个均匀的图像(即大部分平且均匀上色的表面)由于卷积、批均化和ReLU激活的特定相互作用,会引发更多的激活。基于这些见解,我们提出了两种新的简单而有效的海绵例子生成策略:从概率分布中采样图像,并从自然数据集中识别密集 yet inconspicuous 的输入。我们通过多个图像分类模型进行全面的评估,并实验证明,我们的攻击达到了与先前的海绵例子方法相同的稀疏效果,但只需一半的计算精力。我们还证明了我们的海绵例子可以在不同的神经网络之间传输。最后,我们讨论了这些发现对提高效率的应用,通过增加稀疏度来提高效率。
https://arxiv.org/abs/2403.18587
Reconstructing 3D hand mesh robustly from a single image is very challenging, due to the lack of diversity in existing real-world datasets. While data synthesis helps relieve the issue, the syn-to-real gap still hinders its usage. In this work, we present HandBooster, a new approach to uplift the data diversity and boost the 3D hand-mesh reconstruction performance by training a conditional generative space on hand-object interactions and purposely sampling the space to synthesize effective data samples. First, we construct versatile content-aware conditions to guide a diffusion model to produce realistic images with diverse hand appearances, poses, views, and backgrounds; favorably, accurate 3D annotations are obtained for free. Then, we design a novel condition creator based on our similarity-aware distribution sampling strategies to deliberately find novel and realistic interaction poses that are distinctive from the training set. Equipped with our method, several baselines can be significantly improved beyond the SOTA on the HO3D and DexYCB benchmarks. Our code will be released on this https URL.
从单个图像中重构3D手网格非常具有挑战性,因为现有现实世界数据集缺乏多样性。尽管数据合成有助于缓解这个问题,但同质到现实世界的差距仍然限制了其使用。在本文中,我们提出了HandBooster,一种通过在手-物体交互上训练条件生成空间来提高数据多样性和提高3D手网格重构性能的新方法。首先,我们构建了多样内容感知条件,以指导扩散模型产生具有多样手外貌、姿势、视角和背景的现实生活中图像;其次,我们根据我们的相似度感知分布采样策略设计了一个新的条件生成者,有目的地寻找与训练集中的独特且现实的手物体交互姿势。配备我们方法,在HO3D和DexYCB基准测试中,几个基线可以显著提高。我们的代码将发布在https://这个URL上。
https://arxiv.org/abs/2403.18575
Existing research based on deep learning has extensively explored the problem of daytime image dehazing. However, few studies have considered the characteristics of nighttime hazy scenes. There are two distinctions between nighttime and daytime haze. First, there may be multiple active colored light sources with lower illumination intensity in nighttime scenes, which may cause haze, glow and noise with localized, coupled and frequency inconsistent characteristics. Second, due to the domain discrepancy between simulated and real-world data, unrealistic brightness may occur when applying a dehazing model trained on simulated data to real-world data. To address the above two issues, we propose a semi-supervised model for real-world nighttime dehazing. First, the spatial attention and frequency spectrum filtering are implemented as a spatial-frequency domain information interaction module to handle the first issue. Second, a pseudo-label-based retraining strategy and a local window-based brightness loss for semi-supervised training process is designed to suppress haze and glow while achieving realistic brightness. Experiments on public benchmarks validate the effectiveness of the proposed method and its superiority over state-of-the-art methods. The source code and Supplementary Materials are placed in the this https URL.
基于深度学习的现有研究已经广泛研究了白天图像去雾问题。然而,很少有研究考虑夜晚雾景的特征。夜晚和白天雾有兩個区别。首先,在夜晚场景中可能存在多个较低亮度且具有不同激发态的彩色光源,这可能导致雾、发光和噪声,具有局部、耦合和频率不一致的特征。其次,由于模拟数据和真实世界数据的领域差异,当将基于模拟数据训练的去雾模型应用于真实世界数据时,可能会发生不现实的亮度。为了应对上述两个问题,我们提出了一个基于半监督的夜间去雾模型。首先,采用空间关注和频谱滤波器作为空间-频域信息交互模块来处理第一个问题。其次,为了在半监督训练过程中抑制雾和发光,并实现真实的亮度,我们设计了基于伪标签的重新训练策略和基于局部窗口的亮度损失。在公开基准上的实验验证了所提出方法的有效性和其与最先进方法的优越性。源代码和补充材料放在这个[https://url。](https://url。)
https://arxiv.org/abs/2403.18548
The majority of existing recommender systems rely on user ratings, which are limited by the lack of user collaboration and the sparsity problem. To address these issues, this study proposes a behavior-based recommender system that leverages customers' natural behaviors, such as browsing and clicking, on e-commerce platforms. The proposed recommendation system involves clustering active customers, determining neighborhoods, collecting similar users, calculating product reputation based on similar users, and recommending high-reputation products. To overcome the complexity of customer behaviors and traditional clustering methods, an unsupervised clustering approach based on product categories is developed to enhance the recommendation methodology. This study makes notable contributions in several aspects. Firstly, a groundbreaking behavior-based recommendation methodology is developed, incorporating customer behavior to generate accurate and tailored recommendations leading to improved customer satisfaction and engagement. Secondly, an original unsupervised clustering method, focusing on product categories, enables more precise clustering and facilitates accurate recommendations. Finally, an approach to determine neighborhoods for active customers within clusters is established, ensuring grouping of customers with similar behavioral patterns to enhance recommendation accuracy and relevance. The proposed recommendation methodology and clustering method contribute to improved recommendation performance, offering valuable insights for researchers and practitioners in the field of e-commerce recommendation systems. Additionally, the proposed method outperforms benchmark methods in experiments conducted using a behavior dataset from the well-known e-commerce site Alibaba.
目前大多数推荐系统依赖于用户评分,这受到用户协作不足和稀疏问题(sparsity problem)的限制。为解决这些问题,本研究提出了一个基于行为的推荐系统,利用了用户的自然行为,如浏览和点击,在电子商务平台上。所提出的推荐系统包括对活跃用户进行聚类、确定聚类、收集类似用户以及根据类似用户计算产品声誉并推荐高声誉产品。为了克服客户行为的复杂性和传统聚类方法的局限性,基于产品类别的无监督聚类方法被开发,以提高推荐方法。本研究在几个方面取得了显著贡献。首先,开发了一种 groundbreaking 的基于行为的推荐方法,将客户行为与准确且个性化的推荐相结合,从而提高了客户满意度。其次,提出了一种 original 的无监督聚类方法,重点关注产品类别,实现了更精确的聚类和提高了准确的推荐。最后,建立了一种确定聚类内活跃用户的方法,确保将具有相似行为模式的用户分组,从而提高了推荐的精度和相关性。所提出的推荐方法和聚类方法为提高推荐性能提供了宝贵的洞见,对电子商务推荐系统的研究人员和实践者具有重要的指导意义。此外,与基准方法相比,所提出的实验在行为数据集上进行的实验表现优异。
https://arxiv.org/abs/2403.18536
Unsupervised pathology detection can be implemented by training a model on healthy data only and measuring the deviation from the training set upon inference, for example with CNN-based feature extraction and one-class classifiers, or reconstruction-score-based methods such as AEs, GANs and Diffusion models. Normalizing Flows (NF) have the ability to directly learn the probability distribution of training examples through an invertible architecture. We leverage this property in a novel 3D NF-based model named CT-3DFlow, specifically tailored for patient-level pulmonary pathology detection in chest CT data. Our model is trained unsupervised on healthy 3D pulmonary CT patches, and detects deviations from its log-likelihood distribution as anomalies. We aggregate patches-level likelihood values from a patient's CT scan to provide a patient-level 'normal'/'abnormal' prediction. Out-of-distribution detection performance is evaluated using expert annotations on a separate chest CT test dataset, outperforming other state-of-the-art methods.
无监督病理检测可以通过仅在健康数据上训练模型并测量在推理时从训练集的偏差来实现,例如使用基于卷积神经网络的特征提取和一类别分类器,或使用重建分数基方法(如AEs,GANs和扩散模型)来实现。归一化流(NF)具有通过可逆架构直接学习训练示例的概率分布的能力。我们在名为CT-3DFlow的新3D NF-基模型中利用了这一特性,特别针对患者级别胸部CT数据进行定制。我们的模型在健康3D肺组织补丁上通过无监督方式进行训练,并检测其对对数似然分布的偏差作为异常。我们从患者的CT扫描中聚合补丁级的概率值,提供患者级别的“正常”/“异常”预测。我们使用专家注释对另一个胸部CT测试数据集进行评估,与其他最先进的 methods相比取得了优异的性能。
https://arxiv.org/abs/2403.18514
This paper proposes a new knowledge distillation method tailored for image semantic segmentation, termed Intra- and Inter-Class Knowledge Distillation (I2CKD). The focus of this method is on capturing and transferring knowledge between the intermediate layers of teacher (cumbersome model) and student (compact model). For knowledge extraction, we exploit class prototypes derived from feature maps. To facilitate knowledge transfer, we employ a triplet loss in order to minimize intra-class variances and maximize inter-class variances between teacher and student prototypes. Consequently, I2CKD enables the student to better mimic the feature representation of the teacher for each class, thereby enhancing the segmentation performance of the compact network. Extensive experiments on three segmentation datasets, i.e., Cityscapes, Pascal VOC and CamVid, using various teacher-student network pairs demonstrate the effectiveness of the proposed method.
本文提出了一种针对图像语义分割的新知识蒸馏方法,称为内类和间类知识蒸馏(I2CKD)。该方法的重点是在教师(庞大模型)和学生(紧凑模型)的中间层之间捕捉和传递知识。为了知识提取,我们利用特征图派生的类原型。为了促进知识传递,我们采用三元组损失函数来最小化类内方差并最大化类间方差。因此,I2CKD使得学生能够更好地模仿每个类的教师特征表示,从而提高了紧凑网络的分割性能。在三个分割数据集(即城市风光、Pascal VOC和CamVid)上进行的大量实验证明,采用各种教师-学生网络对证明了所提出方法的有效性。
https://arxiv.org/abs/2403.18490
We present Stochastic Gaussian Splatting (SGS): the first framework for uncertainty estimation using Gaussian Splatting (GS). GS recently advanced the novel-view synthesis field by achieving impressive reconstruction quality at a fraction of the computational cost of Neural Radiance Fields (NeRF). However, contrary to the latter, it still lacks the ability to provide information about the confidence associated with their outputs. To address this limitation, in this paper, we introduce a Variational Inference-based approach that seamlessly integrates uncertainty prediction into the common rendering pipeline of GS. Additionally, we introduce the Area Under Sparsification Error (AUSE) as a new term in the loss function, enabling optimization of uncertainty estimation alongside image reconstruction. Experimental results on the LLFF dataset demonstrate that our method outperforms existing approaches in terms of both image rendering quality and uncertainty estimation accuracy. Overall, our framework equips practitioners with valuable insights into the reliability of synthesized views, facilitating safer decision-making in real-world applications.
我们提出了Stochastic Gaussian Splatting (SGS):第一个使用Gaussian Splatting (GS)进行不确定性估计的框架。GS通过在计算成本远低于Neural Radiance Fields (NeRF)的令人印象深刻的重构质量,最近推动了新视图合成领域的发展。然而,与后者不同,它仍然缺乏提供其输出相关自信心的能力。为了克服这一限制,本文中我们引入了一种基于Variational Inference的方法,该方法将不确定性预测无缝集成到GS的常见渲染管道中。此外,我们引入了面积 under Sparsification Error (AUSE) 作为新的损失函数项,以便在图像重构的同时优化不确定性估计。在LLFF数据集上的实验结果表明,我们的方法在图像渲染质量和不确定性估计精度方面都超过了现有方法。总体而言,我们的框架为实践者提供了有关合成视图可靠性的宝贵见解,从而促进了在现实应用中更安全的决策。
https://arxiv.org/abs/2403.18476
Out-of-distribution (OoD) detection techniques for deep neural networks (DNNs) become crucial thanks to their filtering of abnormal inputs, especially when DNNs are used in safety-critical applications and interact with an open and dynamic environment. Nevertheless, integrating OoD detection into state-of-the-art (SOTA) object detection DNNs poses significant challenges, partly due to the complexity introduced by the SOTA OoD construction methods, which require the modification of DNN architecture and the introduction of complex loss functions. This paper proposes a simple, yet surprisingly effective, method that requires neither retraining nor architectural change in object detection DNN, called Box Abstraction-based Monitors (BAM). The novelty of BAM stems from using a finite union of convex box abstractions to capture the learned features of objects for in-distribution (ID) data, and an important observation that features from OoD data are more likely to fall outside of these boxes. The union of convex regions within the feature space allows the formation of non-convex and interpretable decision boundaries, overcoming the limitations of VOS-like detectors without sacrificing real-time performance. Experiments integrating BAM into Faster R-CNN-based object detection DNNs demonstrate a considerably improved performance against SOTA OoD detection techniques.
离散(OoD)检测技术对于深度神经网络(DNN)来说变得至关重要,尤其是当DNN用于关键应用和安全环境中时。然而,将OoD检测集成到最先进的(SOTA)物体检测DNN中提出了巨大的挑战,部分原因是SOTA OoD构建方法引入的复杂性,这需要对DNN架构进行修改并引入复杂的损失函数。本文提出了一种简单而惊人的有效方法,不需要重新训练或修改物体检测DNN的架构,称为基于盒抽象的监视器(BAM)。BAM的新颖之处在于使用有限聚类凸盒抽象来捕捉归一化数据中学习到的物体特征,并且一个重要的观察是,OoD数据中 Features 更有可能超出这些盒。DNN中凸区域内的联合可以形成非凸且可解释的决策边界,克服了 VOS 类检测器的限制,同时保持实时性能。将BAM集成到基于Faster R-CNN的物体检测DNN中进行实验,表明其性能显著优于SOTA OoD检测技术。
https://arxiv.org/abs/2403.18373
We investigate the problem of supporting Industrial Internet of Things user equipment (IIoT UEs) with intent (i.e., requested quality of service (QoS)) and random traffic arrival. A deep reinforcement learning (DRL) based centralized dynamic scheduler for time-frequency resources is proposed to learn how to schedule the available communication resources among the IIoT UEs. The proposed scheduler leverages an RL framework to adapt to the dynamic changes in the wireless communication system and traffic arrivals. Moreover, a graph-based reduction scheme is proposed to reduce the state and action space of the RL framework to allow fast convergence and a better learning strategy. Simulation results demonstrate the effectiveness of the proposed intelligent scheduler in guaranteeing the expressed intent of IIoT UEs compared to several traditional scheduling schemes, such as round-robin, semi-static, and heuristic approaches. The proposed scheduler also outperforms the contention-free and contention-based schemes in maximizing the number of successfully computed tasks.
我们研究了支持工业物联网设备(IIoT UEs)的意图(即请求服务质量(QoS))和随机流量到达问题。我们提出了一个基于集中动态调度器的深度强化学习(DRL)方案,以学习如何调度IIoT UEs之间的可用通信资源。所提出的调度器利用强化学习框架适应无线通信系统和水流到达的动态变化。此外,我们还提出了基于图的降维方案,以减少RL框架的状态和动作空间,以实现快速收敛和更好的学习策略。仿真结果表明,与几种传统调度方案(如轮流、半静态和启发式方法)相比,所提出的智能调度器在保证IIoT UEs表达意图方面具有有效性。所提出的调度器还能够在最大程度上超过竞争无冲突和竞争基的方案,即成功计算任务的数量。
https://arxiv.org/abs/2403.18364
To ensure safe driving in dynamic environments, autonomous vehicles should possess the capability to accurately predict the lane change intentions of surrounding vehicles in advance and forecast their future trajectories. Existing motion prediction approaches have ample room for improvement, particularly in terms of long-term prediction accuracy and interpretability. In this paper, we address these challenges by proposing LC-LLM, an explainable lane change prediction model that leverages the strong reasoning capabilities and self-explanation abilities of Large Language Models (LLMs). Essentially, we reformulate the lane change prediction task as a language modeling problem, processing heterogeneous driving scenario information in natural language as prompts for input into the LLM and employing a supervised fine-tuning technique to tailor the LLM specifically for our lane change prediction task. This allows us to utilize the LLM's powerful common sense reasoning abilities to understand complex interactive information, thereby improving the accuracy of long-term predictions. Furthermore, we incorporate explanatory requirements into the prompts in the inference stage. Therefore, our LC-LLM model not only can predict lane change intentions and trajectories but also provides explanations for its predictions, enhancing the interpretability. Extensive experiments on the large-scale highD dataset demonstrate the superior performance and interpretability of our LC-LLM in lane change prediction task. To the best of our knowledge, this is the first attempt to utilize LLMs for predicting lane change behavior. Our study shows that LLMs can encode comprehensive interaction information for driving behavior understanding.
为了在动态环境中确保安全驾驶,自动驾驶应该具备准确预测周围车辆换道意图和预测未来轨迹的能力。现有的运动预测方法在远程预测精度和可解释性方面仍有很大的改进空间,尤其是在长途预测精度和可解释性方面。在本文中,我们通过提出LC-LLM,一种利用大型语言模型(LLMs)的强推理能力和自解释能力进行解释的换道预测模型,来应对这些挑战。本质上,我们将换道预测任务重新建模为自然语言处理问题,将异质驾驶场景信息作为提示输入到LLM,并采用监督微调技术将LLM专门针对我们的换道预测任务进行调整。这使得我们可以利用LLM的强大的常识推理能力来理解复杂交互信息,从而提高长期预测的准确性。此外,我们在推理阶段将解释性要求融入提示中。因此,我们的LC-LLM模型不仅能够预测换道意图和轨迹,而且还能提供预测的解释,增强可解释性。在大型高D数据集上的大量实验证明,我们的LC-LLM在换道预测任务中具有卓越的性能和可解释性。据我们所知,这是首次将LLM用于预测换道行为。我们的研究证明,LLM可以对驾驶行为的全面交互信息进行编码。
https://arxiv.org/abs/2403.18344