AI-generated video has revolutionized short video production, filmmaking, and personalized media, making video local editing an essential tool. However, this progress also blurs the line between reality and fiction, posing challenges in multimedia forensics. To solve this urgent issue, V2A-Mark is proposed to address the limitations of current video tampering forensics, such as poor generalizability, singular function, and single modality focus. Combining the fragility of video-into-video steganography with deep robust watermarking, our method can embed invisible visual-audio localization watermarks and copyright watermarks into the original video frames and audio, enabling precise manipulation localization and copyright protection. We also design a temporal alignment and fusion module and degradation prompt learning to enhance the localization accuracy and decoding robustness. Meanwhile, we introduce a sample-level audio localization method and a cross-modal copyright extraction mechanism to couple the information of audio and video frames. The effectiveness of V2A-Mark has been verified on a visual-audio tampering dataset, emphasizing its superiority in localization precision and copyright accuracy, crucial for the sustainable development of video editing in the AIGC video era.
AI生成的视频已经推动了短视频制作、电影制作和个性化媒体的发展,使视频本地编辑成为必不可少的工具。然而,这一进步也模糊了现实与虚构之间的界线,对多媒体forensics造成了挑战。为解决这一紧迫问题,V2A-Mark提出了通过解决当前视频篡改forensics的局限性来 addressing the limitations of current video tampering forensics,例如缺乏一般性、单一功能和单模态关注。将视频转录为视频的隐式视觉-音频本地化水印和版权水印相结合,我们的方法可以将隐形的视觉-音频本地化水印和版权水印嵌入原始视频帧和音频中,实现精确的本地处理和版权保护。我们还设计了一个时间对齐和融合模块以及退化提示学习来提高定位准确性和解码稳健性。同时,我们引入了音频级联定位方法和跨模态版权提取机制,将音频和视频帧的信息耦合在一起。V2A-Mark的有效性已在视觉-音频篡改数据集上得到验证,强调了其在本地定位精度和版权准确性方面的优越性,这对AIGC视频时代的视频编辑的可持续发展至关重要。
https://arxiv.org/abs/2404.16824
Representation-based Siamese networks have risen to popularity in lightweight text matching due to their low deployment and inference costs. While word-level attention mechanisms have been implemented within Siamese networks to improve performance, we propose Feature Attention (FA), a novel downstream block designed to enrich the modeling of dependencies among embedding features. Employing "squeeze-and-excitation" techniques, the FA block dynamically adjusts the emphasis on individual features, enabling the network to concentrate more on features that significantly contribute to the final classification. Building upon FA, we introduce a dynamic "selection" mechanism called Selective Feature Attention (SFA), which leverages a stacked BiGRU Inception structure. The SFA block facilitates multi-scale semantic extraction by traversing different stacked BiGRU layers, encouraging the network to selectively concentrate on semantic information and embedding features across varying levels of abstraction. Both the FA and SFA blocks offer a seamless integration capability with various Siamese networks, showcasing a plug-and-play characteristic. Experimental evaluations conducted across diverse text matching baselines and benchmarks underscore the indispensability of modeling feature attention and the superiority of the "selection" mechanism.
以表示为基础的孪生网络因低部署和推理成本而在轻量文本匹配中备受欢迎。尽管在孪生网络中已经实现了词级关注机制以提高性能,但我们提出了特征关注(FA)这一新型的下游块,旨在丰富嵌入特征之间的建模。通过采用“收缩和激发”技术,FA块动态地调整对单个特征的强调,使得网络能更关注对最终分类具有重要影响的特征。基于FA,我们引入了一种动态选择机制,称为选择性特征关注(SFA),并利用堆叠BiGRU Inception结构。SFA块通过穿越不同的堆叠BiGRU层,促使网络在不同的抽象层次上集中关注语义信息和嵌入特征。FA和SFA块都具有与各种孪生网络的无缝集成能力,展示了可插拔和 Play 的特点。在多样文本匹配基准和挑战中进行实验评估,结果表明建模特征关注和选择机制至关重要,而“选择”机制的优越性得到了充分证实。
https://arxiv.org/abs/2404.16776
Navigating mobile robots in social environments remains a challenging task due to the intricacies of human-robot interactions. Most of the motion planners designed for crowded and dynamic environments focus on choosing the best velocity to reach the goal while avoiding collisions, but do not explicitly consider the high-level navigation behavior (avoiding through the left or right side, letting others pass or passing before others, etc.). In this work, we present a novel motion planner that incorporates topology distinct paths representing diverse navigation strategies around humans. The planner selects the topology class that imitates human behavior the best using a deep neural network model trained on real-world human motion data, ensuring socially intelligent and contextually aware navigation. Our system refines the chosen path through an optimization-based local planner in real time, ensuring seamless adherence to desired social behaviors. In this way, we decouple perception and local planning from the decision-making process. We evaluate the prediction accuracy of the network with real-world data. In addition, we assess the navigation capabilities in both simulation and a real-world platform, comparing it with other state-of-the-art planners. We demonstrate that our planner exhibits socially desirable behaviors and shows a smooth and remarkable performance.
在社交环境中导航移动机器人仍然是一个具有挑战性的任务,因为人机交互的复杂性。为了解决这个问题,大多数为拥挤和动态环境设计的运动规划器都集中于选择最佳速度以达到目标,同时避免碰撞,但这些规划器没有明确考虑高级导航行为(避免穿过左侧或右侧,让别人通过或在其前面经过等)。在本文中,我们提出了一个新颖的运动规划器,它包含了代表人类行为多样性导航策略的拓扑学不同的路径。规划器通过基于真实世界人类运动数据训练的深度神经网络模型选择最优秀的拓扑学类,确保社会智能和上下文意识导航。我们的系统通过实时优化基于拓扑的运动规划器来优化所选路径,确保无缝适应期望的社会行为。 在这种程度上,我们解耦了感知和局部规划与决策过程。我们在真实世界中评估网络的预测准确性。此外,我们还评估了该规划器在模拟和真实世界平台上的导航能力,将其与最先进的规划器进行比较。我们证明了我们的规划器表现出社会可接受的行为,表现出平滑和令人印象深刻的表现。
https://arxiv.org/abs/2404.16705
In the rapidly evolving field of artificial intelligence, ensuring safe decision-making of Large Language Models (LLMs) is a significant challenge. This paper introduces Governance of the Commons Simulation (GovSim), a simulation platform designed to study strategic interactions and cooperative decision-making in LLMs. Through this simulation environment, we explore the dynamics of resource sharing among AI agents, highlighting the importance of ethical considerations, strategic planning, and negotiation skills. GovSim is versatile and supports any text-based agent, including LLMs agents. Using the Generative Agent framework, we create a standard agent that facilitates the integration of different LLMs. Our findings reveal that within GovSim, only two out of 15 tested LLMs managed to achieve a sustainable outcome, indicating a significant gap in the ability of models to manage shared resources. Furthermore, we find that by removing the ability of agents to communicate, they overuse the shared resource, highlighting the importance of communication for cooperation. Interestingly, most LLMs lack the ability to make universalized hypotheses, which highlights a significant weakness in their reasoning skills. We open source the full suite of our research results, including the simulation environment, agent prompts, and a comprehensive web interface.
在人工智能领域,确保大型语言模型(LLMs)的安全决策是一个重要的挑战。本文介绍了治理共享资源模拟(GovSim)模拟平台,该平台旨在研究LLMs的战略互动和合作决策。通过这个仿真环境,我们探讨了AI代理之间资源共享的动态,强调了道德考虑、战略规划和谈判技能的重要性。GovSim具有灵活性,支持任何基于文本的代理,包括LLM代理。利用生成代理框架,我们创建了一个标准代理,促进不同LLM的集成。我们的研究结果表明,在GovSim中,只有两个 out of 15 经测试的LLM成功地实现了可持续的结果,表明模型在管理共享资源方面的能力存在显著的差距。此外,我们发现,通过移除代理与进行沟通的能力,它们超出了共享资源的使用,强调了沟通在合作中的重要性。有趣的是,大多数LLM都缺乏普遍化假设的能力,这表明它们在推理能力方面存在显著的弱点。我们开源了我们所有研究的完整套件,包括仿真环境、代理提示和综合网页界面。
https://arxiv.org/abs/2404.16698
We explored the addition bias, a cognitive tendency to prefer adding elements over removing them to alter an initial state or structure, by conducting four preregistered experiments examining the problem-solving behavior of both humans and OpenAl's GPT-4 large language model. The experiments involved 588 participants from the U.S. and 680 iterations of the GPT-4 model. The problem-solving task was either to create symmetry within a grid (Experiments 1 and 3) or to edit a summary (Experiments 2 and 4). As hypothesized, we found that overall, the addition bias was present. Solution efficiency (Experiments 1 and 2) and valence of the instruction (Experiments 3 and 4) played important roles. Human participants were less likely to use additive strategies when subtraction was relatively more efficient than when addition and subtraction were equally efficient. GPT-4 exhibited the opposite behavior, with a strong addition bias when subtraction was more efficient. In terms of instruction valence, GPT-4 was more likely to add words when asked to "improve" compared to "edit", whereas humans did not show this effect. When we looked at the addition bias under different conditions, we found more biased responses for GPT-4 compared to humans. Our findings highlight the importance of considering comparable and sometimes superior subtractive alternatives, as well as reevaluating one's own and particularly the language models' problem-solving behavior.
我们通过进行四项预注册实验,研究了人类和OpenAl的GPT-4大型语言模型在解决问题行为方面的差异,以探讨添加偏差(addition bias)这一认知趋势。实验涉及来自美国588名参与者和GPT-4模型的680个迭代。问题解决任务可以是创建网格内的对称性(实验1和3)或者编辑摘要(实验2和4)。 根据我们的假设,我们发现总体上存在添加偏差。解决方案效率(实验1和2)和指令的积极性(实验3和4)非常重要。当减法相对更有效时,人类参与者更不可能使用添加策略。GPT-4表现出相反的行为,在减法更有效时具有强烈的添加偏差。 在指令积极性方面,GPT-4在被告知“改进”时更可能添加单词,而人类则没有这种效果。当我们研究添加偏差在不同条件下时,发现GPT-4的回答更加偏见,相对于人类而言。我们的研究结果强调了考虑可比较的和有时更好的减法替代方案的重要性,以及重新评估自己以及特别是语言模型的解决问题的行为。
https://arxiv.org/abs/2404.16692
Autonomous navigation in dynamic environments is a complex but essential task for autonomous robots, with recent deep reinforcement learning approaches showing promising results. However, the complexity of the real world makes it infeasible to train agents in every possible scenario configuration. Moreover, existing methods typically overlook factors such as robot kinodynamic constraints, or assume perfect knowledge of the environment. In this work, we present RUMOR, a novel planner for differential-drive robots that uses deep reinforcement learning to navigate in highly dynamic environments. Unlike other end-to-end DRL planners, it uses a descriptive robocentric velocity space model to extract the dynamic environment information, enhancing training effectiveness and scenario interpretation. Additionally, we propose an action space that inherently considers robot kinodynamics and train it in a simulator that reproduces the real world problematic aspects, reducing the gap between the reality and simulation. We extensively compare RUMOR with other state-of-the-art approaches, demonstrating a better performance, and provide a detailed analysis of the results. Finally, we validate RUMOR's performance in real-world settings by deploying it on a ground robot. Our experiments, conducted in crowded scenarios and unseen environments, confirm the algorithm's robustness and transferability.
自主导航在动态环境中是一个复杂但 essential 的任务,对于自主机器人来说,最近的深度强化学习方法显示出良好的效果。然而,真实世界的复杂性使得在所有可能的场景配置上训练代理是不切实际的。此外,现有的方法通常忽视诸如机器人动力学限制等因素,或者假设对环境具有完美的了解。在这项工作中,我们提出了 RUMOR,一种用于在高度动态环境中进行自主导航的新规划器,它使用深度强化学习来 navigate。与其它端到端 DRL 规划器不同,它使用描述性的机器人本体运动空间模型来提取动态环境信息,提高训练效果和场景解释。此外,我们还提出了一个考虑机器人动力学的动作空间,并将其在模拟器上训练,减少了现实世界和模拟器之间的差距。我们详细比较了 RUMOR 与其他最先进的方案,证明了其更好的性能,并提供了结果的详细分析。最后,我们通过在真实环境中部署 RUMOR 来验证其性能。我们的实验在拥挤的场景和未知的环境中进行,证实了算法的稳健性和可迁移性。
https://arxiv.org/abs/2404.16672
Developing autonomous agents for mobile devices can significantly enhance user interactions by offering increased efficiency and accessibility. However, despite the growing interest in mobile device control agents, the absence of a commonly adopted benchmark makes it challenging to quantify scientific progress in this area. In this work, we introduce B-MoCA: a novel benchmark designed specifically for evaluating mobile device control agents. To create a realistic benchmark, we develop B-MoCA based on the Android operating system and define 60 common daily tasks. Importantly, we incorporate a randomization feature that changes various aspects of mobile devices, including user interface layouts and language settings, to assess generalization performance. We benchmark diverse agents, including agents employing large language models (LLMs) or multi-modal LLMs as well as agents trained from scratch using human expert demonstrations. While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to enhance their effectiveness. Our source code is publicly available at this https URL.
开发移动设备上的自主代理可以显著增强用户交互,通过提供更高的效率和可访问性。然而,尽管移动设备控制代理越来越受到关注,但缺乏一个普遍适用的基准使得衡量这一领域科学进展具有挑战性。在这项工作中,我们介绍了B-MoCA:一个专门为评估移动设备控制代理而设计的新的基准。为了创建一个真实的基准,我们基于Android操作系统开发B-MoCA,并定义了60个常见的日常任务。重要的是,我们引入了随机化功能,随机改变移动设备的各个方面,包括用户界面布局和语言设置,以评估泛化性能。我们基准了各种代理,包括使用大型语言模型(LLMs)或多模态LLM训练的代理以及使用人类专家演示训练的代理。虽然这些代理在执行简单任务时表现出熟练,但他们在复杂任务上的表现却显露出未来研究可以改进其有效性的巨大潜力。我们的源代码可在此链接公开使用。
https://arxiv.org/abs/2404.16660
Current state-of-the-art two-stage models on instance segmentation task suffer from several types of imbalances. In this paper, we address the Intersection over the Union (IoU) distribution imbalance of positive input Regions of Interest (RoIs) during the training of the second stage. Our Self-Balanced R-CNN (SBR-CNN), an evolved version of the Hybrid Task Cascade (HTC) model, brings brand new loop mechanisms of bounding box and mask refinements. With an improved Generic RoI Extraction (GRoIE), we also address the feature-level imbalance at the Feature Pyramid Network (FPN) level, originated by a non-uniform integration between low- and high-level features from the backbone layers. In addition, the redesign of the architecture heads toward a fully convolutional approach with FCC further reduces the number of parameters and obtains more clues to the connection between the task to solve and the layers used. Moreover, our SBR-CNN model shows the same or even better improvements if adopted in conjunction with other state-of-the-art models. In fact, with a lightweight ResNet-50 as backbone, evaluated on COCO minival 2017 dataset, our model reaches 45.3% and 41.5% AP for object detection and instance segmentation, with 12 epochs and without extra tricks. The code is available at this https URL
目前最先进的两阶段模型在实例分割任务中存在多种不平衡类型。在本文中,我们解决了在第二阶段训练过程中输入区域关键点(RoIs)的交集over联合(IoU)分布不平衡。我们的自平衡R-CNN(SBR-CNN)模型,是Hybrid Task Cascade(HTC)模型的进化版本,带来了边界框和掩码精度的循环机制。通过改进的通用RoI提取(GRoIE),我们还解决了特征层不平衡问题,源于低层和高层特征之间的非均匀整合。此外,模型的架构朝着全卷积方法迈进,FCC进一步减少了参数并获得更多关于任务要解决的和层使用的提示。此外,与最先进的其他模型相结合,我们的SBR-CNN模型显示出相同或更好的性能。事实上,使用轻量级的ResNet-50作为骨架,在COCO minival 2017数据集上评估,我们的模型达到45.3%和41.5%的AP,经过12个epoch和无需额外技巧。代码可在此处访问:https://url
https://arxiv.org/abs/2404.16633
The task of spatiotemporal action localization in chaotic scenes is a challenging task toward advanced video understanding. Paving the way with high-quality video feature extraction and enhancing the precision of detector-predicted anchors can effectively improve model performance. To this end, we propose a high-performance dual-stream spatiotemporal feature extraction network SFMViT with an anchor pruning strategy. The backbone of our SFMViT is composed of ViT and SlowFast with prior knowledge of spatiotemporal action localization, which fully utilizes ViT's excellent global feature extraction capabilities and SlowFast's spatiotemporal sequence modeling capabilities. Secondly, we introduce the confidence maximum heap to prune the anchors detected in each frame of the picture to filter out the effective anchors. These designs enable our SFMViT to achieve a mAP of 26.62% in the Chaotic World dataset, far exceeding existing models. Code is available at this https URL.
在复杂场景中进行时空动作局部化的任务是高级视频理解的一个具有挑战性的任务。通过高质量的视频特征提取和增强检测器预测锚点的精度,可以有效地提高模型性能。为此,我们提出了一个高性能的双流时空特征提取网络SFMViT,采用锚点剪枝策略。我们SFMViT的骨干网络由ViT和SlowFast组成,基于先前对时空动作局部化的知识,充分利用ViT的卓越全局特征提取能力和SlowFast的时空序列建模能力。其次,我们引入了最大置信度堆来剪枝检测器在每个帧中检测到的锚点,以过滤出有效的锚点。这些设计使我们的SFMViT在Chaotic World数据集上的mAP达到26.62%,远超过现有模型。代码可以从该链接获得。
https://arxiv.org/abs/2404.16609
Maintaining temporal stability is crucial in multi-agent trajectory prediction. Insufficient regularization to uphold this stability often results in fluctuations in kinematic states, leading to inconsistent predictions and the amplification of errors. In this study, we introduce a framework called Multi-Agent Trajectory prediction via neural interaction Energy (MATE). This framework assesses the interactive motion of agents by employing neural interaction energy, which captures the dynamics of interactions and illustrates their influence on the future trajectories of agents. To bolster temporal stability, we introduce two constraints: inter-agent interaction constraint and intra-agent motion constraint. These constraints work together to ensure temporal stability at both the system and agent levels, effectively mitigating prediction fluctuations inherent in multi-agent systems. Comparative evaluations against previous methods on four diverse datasets highlight the superior prediction accuracy and generalization capabilities of our model.
保持时间稳定性对于多智能体轨迹预测至关重要。通常,保持这种稳定性需要足够的正则化来保持,否则会导致运动状态的波动,从而导致预测的不一致性和误差放大的问题。在这项研究中,我们引入了一个名为多智能体轨迹预测通过神经相互作用能量(MATE)的框架。这个框架通过使用神经相互作用能量来评估智能体的相互作用运动,捕捉了互动的动态,并突出了它们对智能体未来轨迹的影响。为了加强时间稳定性,我们引入了两个约束:智能体间交互约束和智能体间运动约束。这些约束共同作用,确保了系统和智能体层面的时间稳定性,有效地减轻了多智能体系统中的预测波动。与之前的方法相比,在四个不同的数据集上的比较评估结果表明,我们模型的预测准确性和泛化能力都具有优势。
https://arxiv.org/abs/2404.16579
In below freezing winter conditions, road surface friction can greatly vary based on the mixture of snow, ice, and water on the road. Friction between the road and vehicle tyres is a critical parameter defining vehicle dynamics, and therefore road surface friction information is essential to acquire for several intelligent transportation applications, such as safe control of automated vehicles or alerting drivers of slippery road conditions. This paper explores computer vision-based evaluation of road surface friction from roadside cameras. Previous studies have extensively investigated the application of convolutional neural networks for the task of evaluating the road surface condition from images. Here, we propose a hybrid deep learning architecture, WCamNet, consisting of a pretrained visual transformer model and convolutional blocks. The motivation of the architecture is to combine general visual features provided by the transformer model, as well as finetuned feature extraction properties of the convolutional blocks. To benchmark the approach, an extensive dataset was gathered from national Finnish road infrastructure network of roadside cameras and optical road surface friction sensors. Acquired results highlight that the proposed WCamNet outperforms previous approaches in the task of predicting the road surface friction from the roadside camera images.
在严寒的冬季条件下,道路表面的摩擦系数会因路面上积雪、冰和水混合物的影响而大大不同。道路与车辆轮胎之间的摩擦是定义车辆动力学的重要参数,因此获取道路表面摩擦信息对于多个智能交通应用至关重要,例如安全控制自动车辆或警示驾驶员道路湿滑情况。本文从路边摄像机对道路表面摩擦进行计算机视觉评估。之前的研究已经广泛探讨了使用卷积神经网络从图像中评估道路表面状况。本文提出了一种混合深度学习架构WCamNet,包括预训练的视觉 transformer模型和卷积模块。架构的动机是结合 transformer 模型提供的通用视觉特征以及卷积模块的微调特征提取特性。为了验证该方法,从国家芬兰道路基础设施网络的路边摄像机和光学道路表面摩擦传感器中收集了广泛的數據。得到的结果表明,与之前的方法相比,所提出的 WCamNet 在预测从路边摄像机图像中预测道路表面摩擦方面表现优异。
https://arxiv.org/abs/2404.16578
In recent years, with the rapid development of computer information technology, the development of artificial intelligence has been accelerating. The traditional geometry recognition technology is relatively backward and the recognition rate is low. In the face of massive information database, the traditional algorithm model inevitably has the problems of low recognition accuracy and poor performance. Deep learning theory has gradually become a very important part of machine learning. The implementation of convolutional neural network (CNN) reduces the difficulty of graphics generation algorithm. In this paper, using the advantages of lenet-5 architecture sharing weights and feature extraction and classification, the proposed geometric pattern recognition algorithm model is faster in the training data set. By constructing the shared feature parameters of the algorithm model, the cross-entropy loss function is used in the recognition process to improve the generalization of the model and improve the average recognition accuracy of the test data set.
近年来,随着计算机信息技术的快速发展,人工智能的发展也加速了。传统的几何识别技术相对较落后,识别率也较低。面对大规模的信息数据库,传统的算法模型无疑存在识别准确度低和性能差的问题。深度学习理论逐渐成为机器学习的重要组成部分。卷积神经网络(CNN)的实现减化了图形生成算法的难度。在本文中,利用lenet-5架构共享权重和特征提取与分类的优势,所提出的几何模式识别算法模型在训练数据集上训练速度更快。通过构建算法模型的共享特征参数,交叉熵损失函数在识别过程中用于提高模型的泛化能力和测试数据集的平均识别准确度。
https://arxiv.org/abs/2404.16561
3D object generation has undergone significant advancements, yielding high-quality results. However, fall short of achieving precise user control, often yielding results that do not align with user expectations, thus limiting their applicability. User-envisioning 3D object generation faces significant challenges in realizing its concepts using current generative models due to limited interaction capabilities. Existing methods mainly offer two approaches: (i) interpreting textual instructions with constrained controllability, or (ii) reconstructing 3D objects from 2D images. Both of them limit customization to the confines of the 2D reference and potentially introduce undesirable artifacts during the 3D lifting process, restricting the scope for direct and versatile 3D modifications. In this work, we introduce Interactive3D, an innovative framework for interactive 3D generation that grants users precise control over the generative process through extensive 3D interaction capabilities. Interactive3D is constructed in two cascading stages, utilizing distinct 3D representations. The first stage employs Gaussian Splatting for direct user interaction, allowing modifications and guidance of the generative direction at any intermediate step through (i) Adding and Removing components, (ii) Deformable and Rigid Dragging, (iii) Geometric Transformations, and (iv) Semantic Editing. Subsequently, the Gaussian splats are transformed into InstantNGP. We introduce a novel (v) Interactive Hash Refinement module to further add details and extract the geometry in the second stage. Our experiments demonstrate that Interactive3D markedly improves the controllability and quality of 3D generation. Our project webpage is available at \url{this https URL}.
3D对象生成已经取得了显著的进步,产生了高质量的结果。然而,由于缺乏用户控制,通常无法实现精确的用户期望,从而限制了其应用范围。用户可视化3D对象生成面临很大的挑战,因为在目前的生成模型中具有有限的交互能力。现有的方法主要提出了两种方法:(i)通过约束可控制性的文本指令进行解释,或者(ii)从2D图像中重构3D对象。两种方法都限制了对2D参考范围内的定制,并且在3D提升过程中可能引入不良伪影,从而限制了直接和多功能的3D修改范围。在这项工作中,我们引入了Interactive3D,一种创新的交互式3D生成框架,通过广泛的3D交互功能赋予用户对生成过程的精确控制。Interactive3D分为两个级联阶段构建,利用不同的3D表示方法。第一个阶段采用高斯平铺进行直接用户交互,通过(i)添加和移除组件, (ii)可形变和刚体拖拽, (iii)几何变换和(iv)语义编辑来修改和指导生成方向。然后,高斯平铺被转换为InstantNGP。我们引入了一种新颖的(v)交互式哈希平滑模块,以进一步增加细节并提取第二阶段的几何。我们的实验证明,Interactive3D显著提高了3D生成的可控性和质量。我们的项目网页可以通过 \url{这个链接}访问。
https://arxiv.org/abs/2404.16510
Document-level Relation Extraction (DocRE) is the task of extracting all semantic relationships from a document. While studies have been conducted on English DocRE, limited attention has been given to DocRE in non-English languages. This work delves into effectively utilizing existing English resources to promote DocRE studies in non-English languages, with Japanese as the representative case. As an initial attempt, we construct a dataset by transferring an English dataset to Japanese. However, models trained on such a dataset suffer from low recalls. We investigate the error cases and attribute the failure to different surface structures and semantics of documents translated from English and those written by native speakers. We thus switch to explore if the transferred dataset can assist human annotation on Japanese documents. In our proposal, annotators edit relation predictions from a model trained on the transferred dataset. Quantitative analysis shows that relation recommendations suggested by the model help reduce approximately 50% of the human edit steps compared with the previous approach. Experiments quantify the performance of existing DocRE models on our collected dataset, portraying the challenges of Japanese and cross-lingual DocRE.
文档级别关系提取(DocRE)是从文档中提取所有语义关系的过程。尽管已经进行了关于英语DocRE的研究,但在非英语语言中,对DocRE的研究却鲜有关注。本文深入研究如何有效地利用现有英语资源来促进非英语语言中的DocRE研究,以日本为例作为代表。作为初始尝试,我们将英语数据集迁移到日本并构建了一个数据集。然而,训练在这样的数据集上的模型,模型的召回率很低。我们研究了错误案例,并将失败归因于从英语到非英语翻译的文档的不同表面结构和语义。因此,我们转向研究是否转移的数据集可以帮助人类对日语文档进行标注。在我们的建议中,注释者编辑从转移数据集中得出的关系预测。定量分析显示,与以前的方法相比,模型建议的关系减少约50%的人为编辑步骤。实验验证了现有DocRE模型的在我们收集的数据集上的性能,揭示了日语和跨语言DocRE的挑战。
https://arxiv.org/abs/2404.16506
In this paper, we address the challenging source-free unsupervised domain adaptation (SFUDA) for pinhole-to-panoramic semantic segmentation, given only a pinhole image pre-trained model (i.e., source) and unlabeled panoramic images (i.e., target). Tackling this problem is non-trivial due to three critical challenges: 1) semantic mismatches from the distinct Field-of-View (FoV) between domains, 2) style discrepancies inherent in the UDA problem, and 3) inevitable distortion of the panoramic images. To tackle these problems, we propose 360SFUDA++ that effectively extracts knowledge from the source pinhole model with only unlabeled panoramic images and transfers the reliable knowledge to the target panoramic domain. Specifically, we first utilize Tangent Projection (TP) as it has less distortion and meanwhile slits the equirectangular projection (ERP) to patches with fixed FoV projection (FFP) to mimic the pinhole images. Both projections are shown effective in extracting knowledge from the source model. However, as the distinct projections make it less possible to directly transfer knowledge between domains, we then propose Reliable Panoramic Prototype Adaptation Module (RP2AM) to transfer knowledge at both prediction and prototype levels. RP$^2$AM selects the confident knowledge and integrates panoramic prototypes for reliable knowledge adaptation. Moreover, we introduce Cross-projection Dual Attention Module (CDAM), which better aligns the spatial and channel characteristics across projections at the feature level between domains. Both knowledge extraction and transfer processes are synchronously updated to reach the best performance. Extensive experiments on the synthetic and real-world benchmarks, including outdoor and indoor scenarios, demonstrate that our 360SFUDA++ achieves significantly better performance than prior SFUDA methods.
在本文中,我们解决了仅使用预训练的孔洞图(source)和未标注的全景图像(target)进行无监督域适应(SFUDA)的问题,以实现孔洞到全景语义分割。解决这一问题是不简单的,因为存在三个关键挑战:1)不同域之间语义不匹配,2)源域问题中的风格差异,3)全景图像中不可避免的扭曲。为了应对这些问题,我们提出了360SFUDA++,它有效地从仅有的未标注全景图像中提取知识,并将可靠的知识传递到目标全景域。具体来说,我们首先利用切线投影(TP)作为它具有较少的扭曲,同时将等角投影(ERP)切成固定 FoV 投影(FFP)的补丁,以模仿孔洞图像。两个投影在提取知识方面都有效。然而,由于不同的投影使得域之间知识传递变得困难,我们 then 引入了可靠的全景原型适应模块(RP2AM),在预测和原型级别上传递知识。RP2AM 选择自信的知识,并整合全景原型以实现可靠的知识适应。此外,我们还引入了跨投影双重注意模块(CDAM),它更好地对域之间的特征水平进行投影之间的空间和通道特征的同步调整。知识提取和传递过程都被同步更新,以达到最佳性能。在合成和真实世界基准上的广泛实验,包括户外和室内场景,证明了我们的360SFUDA++在性能上显著优于前面的SFUDA方法。
https://arxiv.org/abs/2404.16501
Automated driving systems require monitoring mechanisms to ensure safe operation, especially if system components degrade or fail. Their runtime self-representation plays a key role as it provides a-priori knowledge about the system's capabilities and limitations. In this paper, we propose a data-driven approach for deriving such a self-representation model for the motion controller of an automated vehicle. A conformalized prediction model is learned and allows estimating how operational conditions as well as potential degradations and failures of the vehicle's actuators impact motion control performance. During runtime behavior generation, our predictor can provide a heuristic for determining the admissible action space.
自动驾驶系统需要监控机制以确保安全运行,尤其是系统组件退化或失效时。其运行时自表示在确定系统能力与限制方面发挥着关键作用。在本文中,我们提出了一个数据驱动的方法,用于得出这样的自表示模型,该模型用于自动车辆运动控制器的运动控制器。学得了平滑预测模型,允许估计操作条件以及潜在的车辆执行器故障和运动控制性能的影响。在运行时行为生成期间,我们的预测器可以提供一种经验性的确定可允许动作空间的指导方针。
https://arxiv.org/abs/2404.16500
Model-free reinforcement learning methods lack an inherent mechanism to impose behavioural constraints on the trained policies. While certain extensions exist, they remain limited to specific types of constraints, such as value constraints with additional reward signals or visitation density constraints. In this work we try to unify these existing techniques and bridge the gap with classical optimization and control theory, using a generic primal-dual framework for value-based and actor-critic reinforcement learning methods. The obtained dual formulations turn out to be especially useful for imposing additional constraints on the learned policy, as an intrinsic relationship between such dual constraints (or regularization terms) and reward modifications in the primal is reveiled. Furthermore, using this framework, we are able to introduce some novel types of constraints, allowing to impose bounds on the policy's action density or on costs associated with transitions between consecutive states and actions. From the adjusted primal-dual optimization problems, a practical algorithm is derived that supports various combinations of policy constraints that are automatically handled throughout training using trainable reward modifications. The resulting $\texttt{DualCRL}$ method is examined in more detail and evaluated under different (combinations of) constraints on two interpretable environments. The results highlight the efficacy of the method, which ultimately provides the designer of such systems with a versatile toolbox of possible policy constraints.
模型无关强化学习方法缺乏对训练后策略施加行为约束的固有机制。虽然存在某些扩展,但它们仍然局限于特定的约束类型,例如带有额外奖励信号的价值约束或访问密度约束。在这项工作中,我们试图统一这些现有技术,并使用基于价值的actor-critic强化学习方法的泛化二次框架来弥合与经典优化和控制理论之间的差距。所得到的双重形式展开在很大程度上有助于对学习到的策略施加额外的约束,因为这种双重约束(或 regularization 项)与原初在值上的约束之间揭示了一种固有的关系。此外,利用这种框架,我们能够引入一些新颖的约束类型,使得能够对策略的动作密度或连续状态和动作之间的转移成本施加限制。从调整后的原初-二次优化问题中,我们得到了一个实际算法,它在训练过程中自动处理各种策略约束。利用不同的约束组合,对两个可解释的环境进行了评估。结果表明,该方法非常有效,最终为设计这种系统的设计者提供了一个丰富的策略约束工具箱。
https://arxiv.org/abs/2404.16468
Mental health in children and adolescents has been steadily deteriorating over the past few years [ 1 ]. The recent advent of Large Language Models (LLMs) offers much hope for cost and time efficient scaling of monitoring and intervention, yet despite specifically prevalent issues such as school bullying and eating disorders, previous studies on have not investigated performance in this domain or for open information extraction where the set of answers is not predetermined. We create a new dataset of Reddit posts from adolescents aged 12-19 annotated by expert psychiatrists for the following categories: TRAUMA, PRECARITY, CONDITION, SYMPTOMS, SUICIDALITY and TREATMENT and compare expert labels to annotations from two top performing LLMs (GPT3.5 and GPT4). In addition, we create two synthetic datasets to assess whether LLMs perform better when annotating data as they generate it. We find GPT4 to be on par with human inter-annotator agreement and performance on synthetic data to be substantially higher, however we find the model still occasionally errs on issues of negation and factuality and higher performance on synthetic data is driven by greater complexity of real data rather than inherent advantage.
近年来,儿童和青少年的心理健康状况一直在不断恶化[1]。大型语言模型的出现为成本和时间有效的监测和干预带来了很多希望。然而,尽管学校欺凌和饮食障碍等问题是以前研究的主要内容,但以前的研究没有调查这个领域或开放信息提取领域的表现。我们创建了一个由12-19岁青少年在专家精神科医生指导下标注的Reddit帖子的新数据集,以下是这个领域的专家标签和两个顶级LLM(GPT3.5和GPT4)的注释:TRAUMA,PRECARITY,CONDITION,SYMPTOMS,SUICIDALITY和TREATMENT。此外,我们还创建了两个合成数据集来评估LLM在生成数据时注释数据的性能。我们发现GPT4在人类互注解一致性和性能上与LLM相当,而合成数据上的性能远高于LLM。然而,我们发现模型在否定和事实性问题上仍然偶尔犯错,并且 synthetic data上的高性能是由真实数据的复杂性而不是固有优势造成的。
https://arxiv.org/abs/2404.16461
Semi-supervised action recognition aims to improve spatio-temporal reasoning ability with a few labeled data in conjunction with a large amount of unlabeled data. Albeit recent advancements, existing powerful methods are still prone to making ambiguous predictions under scarce labeled data, embodied as the limitation of distinguishing different actions with similar spatio-temporal information. In this paper, we approach this problem by empowering the model two aspects of capability, namely discriminative spatial modeling and temporal structure modeling for learning discriminative spatio-temporal representations. Specifically, we propose an Adaptive Contrastive Learning~(ACL) strategy. It assesses the confidence of all unlabeled samples by the class prototypes of the labeled data, and adaptively selects positive-negative samples from a pseudo-labeled sample bank to construct contrastive learning. Additionally, we introduce a Multi-scale Temporal Learning~(MTL) strategy. It could highlight informative semantics from long-term clips and integrate them into the short-term clip while suppressing noisy information. Afterwards, both of these two new techniques are integrated in a unified framework to encourage the model to make accurate predictions. Extensive experiments on UCF101, HMDB51 and Kinetics400 show the superiority of our method over prior state-of-the-art approaches.
半监督动作识别旨在通过与大量未标记数据相结合,通过几标记数据来提高空间和时间推理能力。尽管有最近的研究进展,但现有的强大方法在稀疏标记数据下仍然容易产生模糊预测,这表现为用类似的时空信息区分不同动作的局限性。在本文中,我们通过赋予模型两个能力方面来解决这个问题,即判别性空间建模和时间结构建模,以学习具有判别性的时空表示。具体来说,我们提出了自适应对比学习(ACL)策略。它通过标记数据的类原型评估所有未标记样本的置信度,并从预标记样本库中选择正负样本进行对比学习。此外,我们还引入了多尺度时间学习(MTL)策略。它可以从长期视频片段中突出有用的语义信息,并将它们整合到短期视频片段中,同时抑制噪声信息。然后,这两种新的技术都被融入到统一的框架中,以鼓励模型做出准确的预测。在UCF101、HMDB51和Kinetics400等数据集上进行的大量实验表明,我们的方法优越于先前的最先进方法。
https://arxiv.org/abs/2404.16416
In today's data and information-rich world, summarization techniques are essential in harnessing vast text to extract key information and enhance decision-making and efficiency. In particular, topic-focused summarization is important due to its ability to tailor content to specific aspects of an extended text. However, this usually requires extensive labelled datasets and considerable computational power. This study introduces a novel method, Augmented-Query Summarization (AQS), for topic-focused summarization without the need for extensive labelled datasets, leveraging query augmentation and hierarchical clustering. This approach facilitates the transferability of machine learning models to the task of summarization, circumventing the need for topic-specific training. Through real-world tests, our method demonstrates the ability to generate relevant and accurate summaries, showing its potential as a cost-effective solution in data-rich environments. This innovation paves the way for broader application and accessibility in the field of topic-focused summarization technology, offering a scalable, efficient method for personalized content extraction.
在当今数据和信息丰富的世界中,总结技术是提取大量文本的关键,以提取关键信息和提高决策和效率。特别是,面向主题的总结对将内容定制到扩展文本的特定方面非常重要。然而,这通常需要大量的标记数据集和相当大的计算能力。本研究介绍了一种新颖的方法,自适应查询摘要(AQS),用于不需要大量标记数据集的主题集中。它利用查询增强和层次聚类。这种方法促进了机器学习模型在摘要任务上的可迁移性,绕过了主题特定训练的需求。通过现实世界的测试,我们的方法证明了生成相关且准确的摘要的能力,表明其作为一个经济高效解决方案在数据丰富的环境中的潜力。这一创新为主题集中摘要技术的更广泛应用和可访问性铺平了道路,为个人内容提取提供了一种可扩展、高效的规模方法。
https://arxiv.org/abs/2404.16411