The current societal challenges exceed the capacity of human individual or collective effort alone. As AI evolves, its role within human collectives is poised to vary from an assistive tool to a participatory member. Humans and AI possess complementary capabilities that, when synergized, can achieve a level of collective intelligence that surpasses the collective capabilities of either humans or AI in isolation. However, the interactions in human-AI systems are inherently complex, involving intricate processes and interdependencies. This review incorporates perspectives from network science to conceptualize a multilayer representation of human-AI collective intelligence, comprising a cognition layer, a physical layer, and an information layer. Within this multilayer network, humans and AI agents exhibit varying characteristics; humans differ in diversity from surface-level to deep-level attributes, while AI agents range in degrees of functionality and anthropomorphism. The interplay among these agents shapes the overall structure and dynamics of the system. We explore how agents' diversity and interactions influence the system's collective intelligence. Furthermore, we present an analysis of real-world instances of AI-enhanced collective intelligence. We conclude by addressing the potential challenges in AI-enhanced collective intelligence and offer perspectives on future developments in this field.
当前的社会面临着个人或集体努力无法克服的挑战。随着AI的发展,其在人类集体中的作用从辅助工具转变为参与成员。人类和AI拥有互补的能力,当它们协同作用时,可以实现超过人类或AI单独能力的集体智能。然而,人类-AI系统中的互动固守复杂过程和相互依存关系。本文从网络科学的角度来构思了人类-AI集体智能的多层表示,包括认知层、物理层和信息层。在这种多层网络中,人类和AI代理表现出不同的特征;人类在多样性上从表面到深层,而AI代理在功能和人性化程度上有很大的变化。这些代理之间的互动构成了系统的整体结构和动态。我们探讨了代理的多样性及其相互作用如何影响系统的集体智能。此外,我们还对现实世界中的AI增强集体智能进行了分析。最后,我们讨论了AI增强集体智能中可能出现的挑战,并探讨了该领域未来的发展趋势。
https://arxiv.org/abs/2403.10433
Current rock engineering design in drill and blast tunnelling primarily relies on engineers' observational assessments. Measure While Drilling (MWD) data, a high-resolution sensor dataset collected during tunnel excavation, is underutilised, mainly serving for geological visualisation. This study aims to automate the translation of MWD data into actionable metrics for rock engineering. It seeks to link data to specific engineering actions, thus providing critical decision support for geological challenges ahead of the tunnel face. Leveraging a large and geologically diverse dataset of 500,000 drillholes from 15 tunnels, the research introduces models for accurate rock mass quality classification in a real-world tunnelling context. Both conventional machine learning and image-based deep learning are explored to classify MWD data into Q-classes and Q-values, examples of metrics describing the stability of the rock mass, using both tabular and image data. The results indicate that the K-nearest neighbours algorithm in an ensemble with tree-based models using tabular data, effectively classifies rock mass quality. It achieves a cross-validated balanced accuracy of 0.86 in classifying rock mass into the Q-classes A, B, C, D, E1, E2, and 0.95 for a binary classification with E versus the rest. Classification using a CNN with MWD-images for each blasting round resulted in a balanced accuracy of 0.82 for binary classification. Regressing the Q-value from tabular MWD-data achieved cross-validated R2 and MSE scores of 0.80 and 0.18 for a similar ensemble model as in classification. High performance in regression and classification boosts confidence in automated rock mass assessment. Applying advanced modelling on a unique dataset demonstrates MWD data's value in improving rock mass classification accuracy and advancing data-driven rock engineering design, reducing manual intervention.
目前,钻掘和爆破隧道主要依赖工程师的观察评估。测量掘进过程中的声波数据(MWD)是一个在隧道挖掘过程中收集的高分辨率传感器数据集,主要用於地质可视化。本研究旨在将MWD数据自动化为岩石工程的有用指标。它试图将数据与具体的工程行动联系起来,为即将到来的隧道面前的地质挑战提供关键决策支持。 利用来自15个隧道的500,000个钻孔的大型且地质多样性的数据集,研究引入了在现实钻掘背景下准确判断岩石质量模型的方法。研究探索了使用传统机器学习和基于图像的深度学习将MWD数据分类为Q类和Q值的模型。使用表格和图像数据描述岩石质量的指标。 结果显示,在具有树状模型的集合中,K-最近邻算法有效地分类了岩石质量。它将岩石质量分类到Q类,Q值的二分类器的交叉验证平衡精度提高到0.86。使用CNN对每个爆破周期处理的MWD图像进行分类,二分类器的平衡精度为0.82。从表格MWD数据中回归Q值,其交叉验证的R2和MSE分数分别为0.80和0.18,与分类模型的结果相似。 在先进的建模技术的帮助下,对独特数据集的深入研究证明了MWD数据在提高岩石质量分类准确性和推动数据驱动岩石工程设计方面的价值,并减少了手动干预。
https://arxiv.org/abs/2403.10404
Accurate positioning of underwater robots in confined environments is crucial for inspection and mapping tasks and is also a prerequisite for autonomous operations. Presently, there are no positioning systems available that are suited for real-world use in confined underwater environments, unconstrained by environmental lighting and water turbidity levels and have sufficient accuracy for reliable and repeatable navigation. This shortage presents a significant barrier to enhancing the capabilities of ROVs in such scenarios. This paper introduces an innovative positioning system for ROVs operating in confined, cluttered underwater settings, achieved through the collaboration of an omnidirectional surface vehicle and an ROV. A formulation is proposed and evaluated in the simulation against ground truth. The experimental results from the simulation form a proof of principle of the proposed system and also demonstrate its deployability. Unlike many previous approaches, the system does not rely on fixed infrastructure or tracking of features in the environment and can cover large enclosed areas without additional equipment.
在水下机器人准确定位在受限环境中的重要性不亚于在海洋中进行探测和绘图任务,也是自主操作的先决条件。目前,尚无适用于水下现实环境中的定位系统,这些系统不受环境光照和水浊度的影响,具有足够的准确性和可重复性导航。这种不足严重地阻碍了在受限场景中提高ROV能力。本文提出了一种创新的水下机器人定位系统,该系统由全向型水面车辆和ROV共同开发。针对仿真进行了公式提出并进行了评估。仿真结果证明了所提出的系统的原则,同时也证明了其可部署性。与许多先前的方法不同,该系统不依赖于固定的基础设施或环境特征的跟踪,可以在没有额外设备的情况下覆盖较大的封闭区域。
https://arxiv.org/abs/2403.10397
Pseudo-label-based semi-supervised learning (SSL) algorithms trained on a class-imbalanced set face two cascading challenges: 1) Classifiers tend to be biased towards majority classes, and 2) Biased pseudo-labels are used for training. It is difficult to appropriately re-balance the classifiers in SSL because the class distribution of an unlabeled set is often unknown and could be mismatched with that of a labeled set. We propose a novel class-imbalanced SSL algorithm called class-distribution-mismatch-aware debiasing (CDMAD). For each iteration of training, CDMAD first assesses the classifier's biased degree towards each class by calculating the logits on an image without any patterns (e.g., solid color image), which can be considered irrelevant to the training set. CDMAD then refines biased pseudo-labels of the base SSL algorithm by ensuring the classifier's neutrality. CDMAD uses these refined pseudo-labels during the training of the base SSL algorithm to improve the quality of the representations. In the test phase, CDMAD similarly refines biased class predictions on test samples. CDMAD can be seen as an extension of post-hoc logit adjustment to address a challenge of incorporating the unknown class distribution of the unlabeled set for re-balancing the biased classifier under class distribution mismatch. CDMAD ensures Fisher consistency for the balanced error. Extensive experiments verify the effectiveness of CDMAD.
基于伪标签的半监督学习(SSL)算法在类不平衡数据集上训练面临着两个挑战:1)分类器倾向于偏向多数类,2)使用有偏伪标签进行训练。由于未标注数据集中的类分布通常未知,并且可能与有标注数据集中的类分布不匹配,因此很难适当地重新平衡分类器。我们提出了一种名为类分布不匹配感知均衡(CDMAD)的新颖类不平衡SSL算法。在训练每个迭代过程中,CDMAD首先通过计算没有模式的图像的logits来评估分类器对每个类的偏见程度,这可以认为是与训练集无关的。然后,CDMAD通过确保分类器的中性来精炼基SSL算法的有偏伪标签。在基SSL算法的训练过程中,CDMAD使用这些平滑的伪标签来提高表示的质量。在测试阶段,CDMAD同样对测试样本进行偏分类预测的优化。CDMAD可以被看作是后验对数调整来解决将未知类分布的类分布不匹配情况下的有偏分类器进行重新平衡的挑战。经过大量实验验证,CDMAD的有效性得到了证实。
https://arxiv.org/abs/2403.10391
Advances in Deep Learning have made possible reliable landmark tracking of human bodies and faces that can be used for a variety of tasks. We test a recent Computer Vision solution, MediaPipe Holistic (MPH), to find out if its tracking of the facial features is reliable enough for a linguistic analysis of data from sign languages, and compare it to an older solution (OpenFace, OF). We use an existing data set of sentences in Kazakh-Russian Sign Language and a newly created small data set of videos with head tilts and eyebrow movements. We find that MPH does not perform well enough for linguistic analysis of eyebrow movement -- but in a different way from OF, which is also performing poorly without correction. We reiterate a previous proposal to train additional correction models to overcome these limitations.
深度学习的进步使得可靠的人体和面部标记跟踪成为可能,这些跟踪可以用于各种任务。我们测试了MediaPipe Holistic(MPH)作为研究视觉解决方案的最新进展,以确定其对签语数据中面部特征的跟踪是否足够可靠,并与较旧的解决方案(OpenFace,OF)进行比较。我们使用了保加利亚-俄语手语现有数据集和一个新的小型数据集,其中包括头部倾斜和眼眉运动的视频。我们发现,MPH在签名数据的语义分析方面表现不佳,但与OF不同的是,即使在未校正的情况下,OF的表现也较差。我们重申了之前提出的建议,即训练额外的修正模型以克服这些限制。
https://arxiv.org/abs/2403.10367
Recent progress in human shape learning, shows that neural implicit models are effective in generating 3D human surfaces from limited number of views, and even from a single RGB image. However, existing monocular approaches still struggle to recover fine geometric details such as face, hands or cloth wrinkles. They are also easily prone to depth ambiguities that result in distorted geometries along the camera optical axis. In this paper, we explore the benefits of incorporating depth observations in the reconstruction process by introducing ANIM, a novel method that reconstructs arbitrary 3D human shapes from single-view RGB-D images with an unprecedented level of accuracy. Our model learns geometric details from both multi-resolution pixel-aligned and voxel-aligned features to leverage depth information and enable spatial relationships, mitigating depth ambiguities. We further enhance the quality of the reconstructed shape by introducing a depth-supervision strategy, which improves the accuracy of the signed distance field estimation of points that lie on the reconstructed surface. Experiments demonstrate that ANIM outperforms state-of-the-art works that use RGB, surface normals, point cloud or RGB-D data as input. In addition, we introduce ANIM-Real, a new multi-modal dataset comprising high-quality scans paired with consumer-grade RGB-D camera, and our protocol to fine-tune ANIM, enabling high-quality reconstruction from real-world human capture.
近年来在人体形状学习方面的进步表明,神经隐含模型在从有限视角下生成三维人体表面以及甚至单个RGB图像时非常有效。然而,现有的单目方法仍然很难从仅有的几个视角下恢复细粒度几何细节,如面部、手部或布料皱纹。它们还容易产生深度模糊,导致沿着相机光学轴的扭曲几何。在本文中,我们探讨了在重建过程中将深度观察纳入其中的好处,通过引入ANIM,一种在单目RGB-D图像上重构任意3D人体形状的新方法,前所未有的准确度。我们的模型从多分辨率像素对齐和体素对齐的特征中学习几何细节,以利用深度信息并实现空间关系,减轻深度模糊。我们进一步通过引入深度监督策略来提高重构形状的质量,提高距离场估计点在重构表面上的准确性。实验证明,ANIM在用作输入的RGB、表面法线、点云或RGB-D数据上优于最先进的论文。此外,我们还引入了ANIM-Real,一个新的多模态数据集,包括高质量扫描和消费级RGB-D相机,以及调整ANIM的协议,使其从现实世界的人类捕捉中实现高质量重构。
https://arxiv.org/abs/2403.10357
Surface parameterization is a fundamental geometry processing problem with rich downstream applications. Traditional approaches are designed to operate on well-behaved mesh models with high-quality triangulations that are laboriously produced by specialized 3D modelers, and thus unable to meet the processing demand for the current explosion of ordinary 3D data. In this paper, we seek to perform UV unwrapping on unstructured 3D point clouds. Technically, we propose ParaPoint, an unsupervised neural learning pipeline for achieving global free-boundary surface parameterization by building point-wise mappings between given 3D points and 2D UV coordinates with adaptively deformed boundaries. We ingeniously construct several geometrically meaningful sub-networks with specific functionalities, and assemble them into a bi-directional cycle mapping framework. We also design effective loss functions and auxiliary differential geometric constraints for the optimization of the neural mapping process. To the best of our knowledge, this work makes the first attempt to investigate neural point cloud parameterization that pursues both global mappings and free boundaries. Experiments demonstrate the effectiveness and inspiring potential of our proposed learning paradigm. The code will be publicly available.
表面参数化是一个基本的几何处理问题,具有丰富的下游应用。传统的解决方案旨在操作在良好行为网格模型上的高质量三角形,这些三角形是由专门的3D建模软件生成的,因此无法满足当前普通3D数据处理需求的爆炸性增长。在本文中,我们试图对无结构3D点云进行UV解包。从技术上讲,我们提出了ParaPoint,一种通过自适应变形来构建点对2D UV坐标的无监督神经网络,以实现全局自由边界表面参数化。我们巧妙地构建了几个具有特定功能的几何有意义子网络,并将它们组装成一个双向循环映射框架。我们还为神经映射过程的优化设计了有效的损失函数和辅助差分几何约束。据我们所知,这篇论文是首次研究追求全局映射和自由边界的神经点云参数化。实验证明了我们提出的学习范式的有效性和鼓舞人心的潜力。代码将公开可用。
https://arxiv.org/abs/2403.10349
Neural implicit surface representation methods have recently shown impressive 3D reconstruction results. However, existing solutions struggle to reconstruct urban outdoor scenes due to their large, unbounded, and highly detailed nature. Hence, to achieve accurate reconstructions, additional supervision data such as LiDAR, strong geometric priors, and long training times are required. To tackle such issues, we present SCILLA, a new hybrid implicit surface learning method to reconstruct large driving scenes from 2D images. SCILLA's hybrid architecture models two separate implicit fields: one for the volumetric density and another for the signed distance to the surface. To accurately represent urban outdoor scenarios, we introduce a novel volume-rendering strategy that relies on self-supervised probabilistic density estimation to sample points near the surface and transition progressively from volumetric to surface representation. Our solution permits a proper and fast initialization of the signed distance field without relying on any geometric prior on the scene, compared to concurrent methods. By conducting extensive experiments on four outdoor driving datasets, we show that SCILLA can learn an accurate and detailed 3D surface scene representation in various urban scenarios while being two times faster to train compared to previous state-of-the-art solutions.
近年来,神经隐式表面表示方法在3D建模方面取得了令人印象深刻的成果。然而,由于其大、无限制和高度详细的特点,现有的解决方案在重建城市户外场景方面存在较大困难。因此,为了获得准确的重建,需要添加一些监督数据,如激光雷达(LiDAR)、强大的几何先验和长的训练时间。为解决这些问题,我们提出了SCILLA,一种新的混合隐式表面学习方法,用于从2D图像重建大型驾驶场景。SCILLA的混合架构建模了两个隐式场:一个是体积密度,另一个是表面签名距离。为了准确地表示城市户外场景,我们引入了一种新的体积渲染策略,它依赖于自监督概率密度估计来采样表面附近的关键点,并从体积到表面表示逐渐过渡。与同时期方法相比,我们的解决方案不需要依赖于场景的任何几何先验,从而实现了一个合适且快速的初始化签名距离场。通过在四个户外驾驶数据集上进行广泛的实验,我们发现,SCILLA可以在各种城市场景中学习到准确的、细致的3D表面场景表示,同时其训练速度是前 state-of-the-art 解决方案的两倍。
https://arxiv.org/abs/2403.10344
Threat hunting is sifting through system logs to detect malicious activities that might have bypassed existing security measures. It can be performed in several ways, one of which is based on detecting anomalies. We propose an unsupervised framework, called continuous bag-of-terms-and-time (CBoTT), and publish its application programming interface (API) to help researchers and cybersecurity analysts perform anomaly-based threat hunting among SIEM logs geared toward process auditing on endpoint devices. Analyses show that our framework consistently outperforms benchmark approaches. When logs are sorted by likelihood of being an anomaly (from most likely to least), our approach identifies anomalies at higher percentiles (between 1.82-6.46) while benchmark approaches identify the same anomalies at lower percentiles (between 3.25-80.92). This framework can be used by other researchers to conduct benchmark analyses and cybersecurity analysts to find anomalies in SIEM logs.
威胁狩猎是通过分析系统日志来检测可能绕过现有安全措施的恶意活动。它可以通过多种方式执行,其中一种是基于检测异常。我们提出了一个名为连续 bag-of-terms-and-time (CBoTT) 的无监督框架,并发布其应用程序编程接口 (API),以帮助研究人员和网络安全分析师在针对终端设备的SIEM日志中进行基于异常的威胁狩猎。分析结果表明,我们的框架在基准方法上始终表现优异。当日志按异常可能性从最可能到最不可能排序时,我们的方法在较高的百分位数(1.82-6.46)上识别出异常,而基准方法在较低的百分位数(3.25-80.92)上识别出相同的异常。这个框架可以被其他研究人员用于进行基准分析,也可以被网络安全分析师用于在SIEM日志中查找异常。
https://arxiv.org/abs/2403.10327
We present a knowledge integration framework (called KIF) that uses Wikidata as a lingua franca to integrate heterogeneous knowledge bases. These can be triplestores, relational databases, CSV files, etc., which may or may not use the Wikidata dialect of RDF. KIF leverages Wikidata's data model and vocabulary plus user-defined mappings to expose a unified view of the integrated bases while keeping track of the context and provenance of their statements. The result is a virtual knowledge base which behaves like an "extended Wikidata" and which can be queried either through an efficient filter interface or using SPARQL. We present the design and implementation of KIF, discuss how we have used it to solve a real integration problem in the domain of chemistry (involving Wikidata, PubChem, and IBM CIRCA), and present experimental results on the performance and overhead of KIF.
我们提出了一个知识整合框架(称为KIF),它使用Wikidata作为通用的语言来整合异构知识库。这些可以是三元组建模、关系数据库、CSV文件等,可能或不使用Wikidata方言RDF。KIF利用Wikidata的数据模型和词汇,以及用户定义的映射,呈现出整合基体的统一视图,同时跟踪它们的陈述的上下文和来源。结果是一个像“扩展Wikidata”一样的行为的虚拟知识库,可以通过高效的筛选界面或使用SPARQL进行查询。我们介绍了KIF的设计和实现,讨论了我们在化学领域(包括Wikidata、PubChem和IBM CIRCA)如何使用它解决实际整合问题,并提供了KIF的性能和开销的实验结果。
https://arxiv.org/abs/2403.10304
Classical structural-based visual localization methods offer high accuracy but face trade-offs in terms of storage, speed, and privacy. A recent innovation, keypoint scene coordinate regression (KSCR) named D2S addresses these issues by leveraging graph attention networks to enhance keypoint relationships and predict their 3D coordinates using a simple multilayer perceptron (MLP). Camera pose is then determined via PnP+RANSAC, using established 2D-3D correspondences. While KSCR achieves competitive results, rivaling state-of-the-art image-retrieval methods like HLoc across multiple benchmarks, its performance is hindered when data samples are limited due to the deep learning model's reliance on extensive data. This paper proposes a solution to this challenge by introducing a pipeline for keypoint descriptor synthesis using Neural Radiance Field (NeRF). By generating novel poses and feeding them into a trained NeRF model to create new views, our approach enhances the KSCR's generalization capabilities in data-scarce environments. The proposed system could significantly improve localization accuracy by up to 50\% and cost only a fraction of time for data synthesis. Furthermore, its modular design allows for the integration of multiple NeRFs, offering a versatile and efficient solution for visual localization. The implementation is publicly available at: this https URL.
基于经典结构视觉定位方法具有高精度,但存储、速度和隐私方面存在牺牲。一种最近的创新,关键点场景坐标回归(KSCR)命名为D2S,通过利用图注意力网络增强关键点关系并使用简单的多层感知器(MLP)预测其3D坐标,从而解决这些问题。然后通过PnP+RANSAC确定相机姿态,利用已建立的2D-3D对应关系。虽然KSCR在多个基准测试中实现了竞争力的结果,与像HLoc这样的最先进图像检索方法相媲美,但当数据样本有限时,由于深度学习模型对广泛数据的高度依赖,其性能受到限制。为了解决这个问题,本文提出了一种通过使用神经辐射场(NeRF)进行关键点描述符的流程来解决这个挑战。通过生成新的姿态并将其输入已训练的NeRF模型以生成新的视图,我们的方法在数据稀疏环境中提高了KSCR的泛化能力。所提出的系统可以通过提高50\%的定位精度,仅用很少的时间来合成数据,显著提高局部定位精度。此外,它的模块化设计允许将多个NeRF集成到一个 versatile和高效的视觉定位解决方案中。实现可通过以下链接公开获得:https://this URL。
https://arxiv.org/abs/2403.10297
Single-modal object re-identification (ReID) faces great challenges in maintaining robustness within complex visual scenarios. In contrast, multi-modal object ReID utilizes complementary information from diverse modalities, showing great potentials for practical applications. However, previous methods may be easily affected by irrelevant backgrounds and usually ignore the modality gaps. To address above issues, we propose a novel learning framework named \textbf{EDITOR} to select diverse tokens from vision Transformers for multi-modal object ReID. We begin with a shared vision Transformer to extract tokenized features from different input modalities. Then, we introduce a Spatial-Frequency Token Selection (SFTS) module to adaptively select object-centric tokens with both spatial and frequency information. Afterwards, we employ a Hierarchical Masked Aggregation (HMA) module to facilitate feature interactions within and across modalities. Finally, to further reduce the effect of backgrounds, we propose a Background Consistency Constraint (BCC) and an Object-Centric Feature Refinement (OCFR). They are formulated as two new loss functions, which improve the feature discrimination with background suppression. As a result, our framework can generate more discriminative features for multi-modal object ReID. Extensive experiments on three multi-modal ReID benchmarks verify the effectiveness of our methods. The code is available at this https URL.
单模态物体识别(ReID)在复杂的视觉场景中面临很大的挑战。相比之下,多模态物体ReID利用来自不同模态的互补信息,具有很大的实际应用潜力。然而,之前的方法可能会受到无关背景的影响,通常会忽略模态差距。为解决上述问题,我们提出了一个名为《编辑器》(Editor)的新学习框架,用于从视觉Transformer中选择多样化的标记。我们首先使用共享的视觉Transformer提取不同输入模态的标记。然后,我们引入了一个空间频率标记选择(SFTS)模块,以适应选择具有空间和频率信息的物体中心标记。接下来,我们使用层次结构掩码聚合(HMA)模块促进模态之间和模态之间的特征交互。最后,为了进一步减少背景的影响,我们提出了背景一致性约束(BCC)和物体中心特征细化(OCFR)。它们被表示为两个新的损失函数,通过背景抑制改善了特征识别。因此,我们的框架可以生成更有区分性的多模态物体ReID。在三个多模态ReID基准测试中进行了广泛的实验,验证了我们的方法的有效性。代码可在此处下载:https://www.example.com/。
https://arxiv.org/abs/2403.10254
In this study, we address the intricate challenge of multi-task dense prediction, encompassing tasks such as semantic segmentation, depth estimation, and surface normal estimation, particularly when dealing with partially annotated data (MTPSL). The complexity arises from the absence of complete task labels for each training image. Given the inter-related nature of these pixel-wise dense tasks, our focus is on mining and capturing cross-task relationships. Existing solutions typically rely on learning global image representations for global cross-task image matching, imposing constraints that, unfortunately, sacrifice the finer structures within the images. Attempting local matching as a remedy faces hurdles due to the lack of precise region supervision, making local alignment a challenging endeavor. The introduction of Segment Anything Model (SAM) sheds light on addressing local alignment challenges by providing free and high-quality solutions for region detection. Leveraging SAM-detected regions, the subsequent challenge lies in aligning the representations within these regions. Diverging from conventional methods that directly learn a monolithic image representation, our proposal involves modeling region-wise representations using Gaussian Distributions. Aligning these distributions between corresponding regions from different tasks imparts higher flexibility and capacity to capture intra-region structures, accommodating a broader range of tasks. This innovative approach significantly enhances our ability to effectively capture cross-task relationships, resulting in improved overall performance in partially supervised multi-task dense prediction scenarios. Extensive experiments conducted on two widely used benchmarks underscore the superior effectiveness of our proposed method, showcasing state-of-the-art performance even when compared to fully supervised methods.
在这项研究中,我们解决了多任务密集预测的复杂挑战,包括语义分割、深度估计和表面法线估计等任务,尤其是在处理部分标注数据(MTPSL)时。复杂性源于每个训练图像缺少完整的任务标签。鉴于这些像素级密集任务之间的相关性,我们的重点是挖掘和捕捉跨任务关系。现有的解决方案通常依赖于全局图像表示学习全局跨任务图像匹配,并强加一些限制,不幸的是,牺牲了图像中的更细结构。尝试局部匹配作为解决方法遇到了障碍,因为缺乏精确的区域指导,使得局部对齐成为一个具有挑战性的任务。SAM模型的引入使通过为区域检测提供免费且高质量解决方案来解决局部对齐问题。利用SAM检测到的区域,接下来的挑战在于对齐这些区域内的表示。从传统的直接学习单体图像表示的方法中脱离出来,我们的方案使用高斯分布对区域进行建模。在对应任务的不同区域之间对齐这些分布赋予了更高的灵活性和容量,适应了更广泛的任务。这种创新方法显著增强了我们捕捉跨任务关系的能力,在部分监督的多任务密集预测场景中实现了整体性能的提高。在两个广泛使用的基准上进行的大量实验都证实了所提出方法的优势,即使与完全监督方法相比,表现也相当卓越。
https://arxiv.org/abs/2403.10252
In recent advancements within the domain of Large Language Models (LLMs), there has been a notable emergence of agents capable of addressing Robotic Process Automation (RPA) challenges through enhanced cognitive capabilities and sophisticated reasoning. This development heralds a new era of scalability and human-like adaptability in goal attainment. In this context, we introduce AUTONODE (Autonomous User-interface Transformation through Online Neuro-graphic Operations and Deep Exploration). AUTONODE employs advanced neuro-graphical techniques to facilitate autonomous navigation and task execution on web interfaces, thereby obviating the necessity for predefined scripts or manual intervention. Our engine empowers agents to comprehend and implement complex workflows, adapting to dynamic web environments with unparalleled efficiency. Our methodology synergizes cognitive functionalities with robotic automation, endowing AUTONODE with the ability to learn from experience. We have integrated an exploratory module, DoRA (Discovery and mapping Operation for graph Retrieval Agent), which is instrumental in constructing a knowledge graph that the engine utilizes to optimize its actions and achieve objectives with minimal supervision. The versatility and efficacy of AUTONODE are demonstrated through a series of experiments, highlighting its proficiency in managing a diverse array of web-based tasks, ranging from data extraction to transaction processing.
在大型语言模型(LLMs)领域最近的研究进展中,出现了一些能够通过增强认知能力和复杂的推理能力来应对机器人流程自动化(RPA)挑战的智能体。这一发展预示着在实现目标的过程中将进入一个可扩展和具有人类相似适应性的新时代。在这个背景下,我们介绍了一个名为AUTONODE(通过在线神经图网络操作和深度探索实现自主用户界面转换)的系统。AUTONODE采用先进的神经图网络技术来促进自主导航和任务执行在网页界面上,从而消除了需要预定义脚本或手动干预的必要性。我们的引擎使智能体能够理解并实施复杂的任务流程,适应于动态的网页环境,效率无与伦比。我们的方法论将认知功能与机器人自动化相结合,使AUTONODE具有从经验中学习的能力。我们引入了一个探索模块,DoRA(用于构建知识图的发现和映射操作),该模块对于构建引擎使用的知识图至关重要。通过一系列实验,我们展示了AUTONODE的多样性和有效性,涵盖了从数据提取到交易处理的各类网页任务。
https://arxiv.org/abs/2403.10171
User Interface (UI) understanding has been an increasingly popular topic over the last few years. So far, there has been a vast focus solely on web and mobile applications. In this paper, we introduce the harder task of computer UI understanding. With the goal of enabling research in this field, we have generated a dataset with a set of videos where a user is performing a sequence of actions and each image shows the desktop contents at that time point. We also present a framework that is composed of a synthetic sample generation pipeline to augment the dataset with relevant characteristics, and a contrastive learning method to classify images in the videos. We take advantage of the natural conditional, tree-like, relationship of the images' characteristics to regularize the learning of the representations by dealing with multiple partial tasks simultaneously. Experimental results show that the proposed framework outperforms previously proposed hierarchical multi-label contrastive losses in fine-grain UI classification.
用户界面(UI)理解是一个越来越热门的话题,并且在过去的几年里,主要集中在Web和移动应用程序上。在本文中,我们提出了一个更难的任务:计算机UI理解。旨在促进此领域的研究,我们生成了一个数据集,其中用户会执行一系列操作,并且每张图片都显示了此时桌面内容。我们还提出了一个由合成样本生成管道和视频中的图像对比学习方法组成的框架。我们利用图像特征的自然条件树状关系,在处理多个部分任务的同时,对表示的学习进行正则化。实验结果表明,与之前提出的分层多标签对比损失相比,所提出的框架在细粒度UI分类上表现优异。
https://arxiv.org/abs/2403.10170
Recent advancements in dynamic neural radiance field methods have yielded remarkable outcomes. However, these approaches rely on the assumption of sharp input images. When faced with motion blur, existing dynamic NeRF methods often struggle to generate high-quality novel views. In this paper, we propose DyBluRF, a dynamic radiance field approach that synthesizes sharp novel views from a monocular video affected by motion blur. To account for motion blur in input images, we simultaneously capture the camera trajectory and object Discrete Cosine Transform (DCT) trajectories within the scene. Additionally, we employ a global cross-time rendering approach to ensure consistent temporal coherence across the entire scene. We curate a dataset comprising diverse dynamic scenes that are specifically tailored for our task. Experimental results on our dataset demonstrate that our method outperforms existing approaches in generating sharp novel views from motion-blurred inputs while maintaining spatial-temporal consistency of the scene.
近年来,动态神经辐射场方法的进步取得了显著的成果。然而,这些方法依赖于平滑的输入图像的假设。面对运动模糊,现有的动态 NeRF 方法通常很难生成高质量的全新视角。在本文中,我们提出了 DyBluRF,一种基于单目视频受运动模糊影响的动态辐射场方法,该方法从受运动模糊影响的单目视频中合成清晰的全新视角。为了考虑输入图像的运动模糊,我们在场景中同时捕获了相机轨迹和物体离散余弦变换(DCT)轨迹。此外,我们还采用全局跨时渲染方法来确保整个场景中时间一致性。我们对数据集进行策展,其中包括专门为我们的任务定制的多样动态场景。我们数据集上的实验结果表明,与其他方法相比,我们的方法在从运动模糊的输入中生成清晰的全新视角的同时保持了场景的时空一致性。
https://arxiv.org/abs/2403.10103
Blind face restoration (BFR) is a highly challenging problem due to the uncertainty of degradation patterns. Current methods have low generalization across photorealistic and heterogeneous domains. In this paper, we propose a Diffusion-Information-Diffusion (DID) framework to tackle diffusion manifold hallucination correction (DiffMAC), which achieves high-generalization face restoration in diverse degraded scenes and heterogeneous domains. Specifically, the first diffusion stage aligns the restored face with spatial feature embedding of the low-quality face based on AdaIN, which synthesizes degradation-removal results but with uncontrollable artifacts for some hard cases. Based on Stage I, Stage II considers information compression using manifold information bottleneck (MIB) and finetunes the first diffusion model to improve facial fidelity. DiffMAC effectively fights against blind degradation patterns and synthesizes high-quality faces with attribute and identity consistencies. Experimental results demonstrate the superiority of DiffMAC over state-of-the-art methods, with a high degree of generalization in real-world and heterogeneous settings. The source code and models will be public.
失真修复(BFR)是一个由于降解模式不确定性而具有极大挑战性的问题。目前的解决方案在真实感和异质领域中的泛化能力较低。在本文中,我们提出了一种扩散信息扩散(DID)框架来解决扩散向量启发式修复(DiffMAC),它在多样化和异质场景中实现了高泛化率的面修复。具体来说,第一扩散阶段将修复后的面部与低质量面在AdaIN中的空间特征嵌入对齐,生成一些难以处理的情况中的去失真结果,但同时存在一些不可控的伪影。基于阶段I,阶段II考虑使用多维信息瓶颈(MIB)进行信息压缩,并微调第一扩散模型以提高面部保真度。DiffMAC有效地对抗盲失真模式,并生成具有属性和身份一致性的高质量面部。实验结果表明,DiffMAC相对于最先进的 methods具有优越性,在现实和异质环境中的泛化能力较高。源代码和模型将公开发布。
https://arxiv.org/abs/2403.10098
The pretraining-finetuning paradigm has gained widespread adoption in vision tasks and other fields, yet it faces the significant challenge of high sample annotation costs. To mitigate this, the concept of active finetuning has emerged, aiming to select the most appropriate samples for model finetuning within a limited budget. Traditional active learning methods often struggle in this setting due to their inherent bias in batch selection. Furthermore, the recent active finetuning approach has primarily concentrated on aligning the distribution of selected subsets with the overall data pool, focusing solely on diversity. In this paper, we propose a Bi-Level Active Finetuning framework to select the samples for annotation in one shot, which includes two stages: core sample selection for diversity, and boundary sample selection for uncertainty. The process begins with the identification of pseudo-class centers, followed by an innovative denoising method and an iterative strategy for boundary sample selection in the high-dimensional feature space, all without relying on ground-truth labels. Our comprehensive experiments provide both qualitative and quantitative evidence of our method's efficacy, outperforming all the existing baselines.
预训练-微调范式已经在各种任务中得到了广泛应用,然而它面临着高样本注释成本的显著挑战。为了减轻这一挑战,出现了主动微调的概念,旨在在有限预算内选择模型微调的最佳样本。传统的主动学习方法在這種设置中往往因为它们固有的批量选择偏见而遇到困难。此外,最近主动微调方法主要集中在将选定的子集的分布与整个数据池的分布对齐,仅关注多样性。在本文中,我们提出了一个双级别主动微调框架,用于在一次射击中选择注释样本,包括两个阶段:核心样本选择(多样性)和边界样本选择(不确定性)。过程始于伪类中心的出现,然后是采用创新去噪方法和在高维特征空间中边界的迭代策略。所有这些方法都没有依赖于真实标签。我们全面的实验提供了我们方法的成效的定量和定性证据,超越了所有现有基线。
https://arxiv.org/abs/2403.10069
Face de-identification in videos is a challenging task in the domain of computer vision, primarily used in privacy-preserving applications. Despite the considerable progress achieved through generative vision models, there remain multiple challenges in the latest approaches. They lack a comprehensive discussion and evaluation of aspects such as realism, temporal coherence, and preservation of non-identifiable features. In our work, we propose RID-Twin: a novel pipeline that leverages the state-of-the-art generative models, and decouples identity from motion to perform automatic face de-identification in videos. We investigate the task from a holistic point of view and discuss how our approach addresses the pertinent existing challenges in this domain. We evaluate the performance of our methodology on the widely employed VoxCeleb2 dataset, and also a custom dataset designed to accommodate the limitations of certain behavioral variations absent in the VoxCeleb2 dataset. We discuss the implications and advantages of our work and suggest directions for future research.
在视频中的面部匿名化是一个计算机视觉领域中的具有挑战性的任务,主要应用于隐私保护应用。尽管通过生成视觉模型取得了显著的进展,但最新的方法仍然存在多个挑战。他们缺乏对诸如真实性、时间一致性和保留非可识别特征等方面的全面讨论和评估。在我们工作中,我们提出了RID-Twin:一种利用最先进的生成模型的新管道,将身份与运动解耦,实现视频中的自动面部匿名化。我们从整体的角度来调查这项任务,并讨论了我们的方法如何解决这个领域内的相关现有挑战。我们评估了我们的方法在广泛使用的VoxCeleb2数据集上的性能,以及专为适应VoxCeleb2数据集中的某些行为变化不足而设计的自定义数据集上的性能。我们讨论了我们工作的 implications(影响)和 advantages(优势),并提出了未来研究的建议。
https://arxiv.org/abs/2403.10058
Instruction tuning for large language models (LLMs) can drive them to produce results consistent with human goals in specific downstream tasks. However, the process of continual instruction tuning (CIT) for LLMs may bring about the catastrophic forgetting (CF) problem, where previously learned abilities are degraded. Recent methods try to alleviate the CF problem by modifying models or replaying data, which may only remember the surface-level pattern of instructions and get confused on held-out tasks. In this paper, we propose a novel continual instruction tuning method based on Key-part Information Gain (KPIG). Our method computes the information gain on masked parts to dynamically replay data and refine the training objective, which enables LLMs to capture task-aware information relevant to the correct response and alleviate overfitting to general descriptions in instructions. In addition, we propose two metrics, P-score and V-score, to measure the generalization and instruction-following abilities of LLMs. Experiments demonstrate our method achieves superior performance on both seen and held-out tasks.
对大型语言模型(LLMs)进行指令调整可能使其在特定下游任务上产生与人类目标一致的结果。然而,对LLMs进行持续指令调整(CIT)可能会导致灾难性遗忘(CF)问题,即先前的学习能力减弱。为缓解CF问题,最近的方法尝试通过修改模型或回放数据来减轻它。这可能只是记忆指令的表面层模式,在保留的作业上出错。在本文中,我们提出了一种基于键部分信息增益(KPIG)的新连续指令调整方法。我们的方法计算掩码部分的信息增益以动态回放数据并优化训练目标,这使得LLMs能够捕捉到正确回答任务相关的信息,并减轻对指令的一般描述的过拟合。此外,我们还提出了两个指标,P-分数和V-分数,来衡量LLM的泛化能力和指令跟随能力。实验证明,我们的方法在可见和保留作业上都取得了卓越的性能。
https://arxiv.org/abs/2403.10056