Despite increasingly realistic image quality, recent 3D image generative models often operate on 3D volumes of fixed extent with limited camera motions. We investigate the task of unconditionally synthesizing unbounded nature scenes, enabling arbitrarily large camera motion while maintaining a persistent 3D world model. Our scene representation consists of an extendable, planar scene layout grid, which can be rendered from arbitrary camera poses via a 3D decoder and volume rendering, and a panoramic skydome. Based on this representation, we learn a generative world model solely from single-view internet photos. Our method enables simulating long flights through 3D landscapes, while maintaining global scene consistency--for instance, returning to the starting point yields the same view of the scene. Our approach enables scene extrapolation beyond the fixed bounds of current 3D generative models, while also supporting a persistent, camera-independent world representation that stands in contrast to auto-regressive 3D prediction models. Our project page: this https URL.
尽管图像质量越来越真实,最近的3D图像生成模型通常运行在固定 extent 的3D体积上,且相机运动限制有限。我们研究无条件合成无限制自然场景的任务,以实现任意大的相机运动,同时保持持久的3D世界模型。我们的场景表示包括一个可扩展的平面场景布局网格,可以通过任意相机姿态通过3D解码和体积渲染进行渲染,并创建一个全景的天空穹顶。基于这个表示,我们仅从单视图互联网照片学习了一个生成世界模型。我们的方法可以实现模拟穿越3D地形的旅程,同时保持全局场景一致性——例如,回到起点得到相同的场景视图。我们的方法和传统的3D生成模型的固定边界相比,支持持久的相机无关的世界表示。我们的项目页面:这个 https URL。
https://arxiv.org/abs/2303.13515
DEtection TRansformer (DETR) started a trend that uses a group of learnable queries for unified visual perception. This work begins by applying this appealing paradigm to LiDAR-based point cloud segmentation and obtains a simple yet effective baseline. Although the naive adaptation obtains fair results, the instance segmentation performance is noticeably inferior to previous works. By diving into the details, we observe that instances in the sparse point clouds are relatively small to the whole scene and often have similar geometry but lack distinctive appearance for segmentation, which are rare in the image domain. Considering instances in 3D are more featured by their positional information, we emphasize their roles during the modeling and design a robust Mixed-parameterized Positional Embedding (MPE) to guide the segmentation process. It is embedded into backbone features and later guides the mask prediction and query update processes iteratively, leading to Position-Aware Segmentation (PA-Seg) and Masked Focal Attention (MFA). All these designs impel the queries to attend to specific regions and identify various instances. The method, named Position-guided Point cloud Panoptic segmentation transFormer (P3Former), outperforms previous state-of-the-art methods by 3.4% and 1.2% PQ on SemanticKITTI and nuScenes benchmark, respectively. The source code and models are available at this https URL .
DEtectionTRansformer(DETR)开始使用一组可学习查询来实现统一的视觉感知。这项工作首先将这一吸引人的范式应用于基于激光雷达的点云分割,并获得了简单但有效的基线。虽然简单的适应方法获得了公正的结果,但实例分割性能明显低于以前的工作。通过深入研究细节,我们发现稀疏点云实例相对于整个场景来说相对较小,往往具有相似的几何形状,但在分割方面缺乏独特的外观,这在图像领域非常罕见。考虑到3D实例更多地取决于其位置信息,我们在建模期间强调它们的作用,设计了一个稳健的混合参数化位置嵌入(MPE),以指导分割过程。它被嵌入到基线特征中,然后迭代地指导掩码预测和查询更新过程,导致位置Aware分割(PA-Seg)和掩码焦点注意(MFA)。所有这些设计都促使查询关注特定的区域并识别各种实例。该方法被称为位置引导点云 Panoptic 分割转换器(P3 former),在语义KITTI和nuScenes基准测试中分别比以前的先进方法高出3.4%和1.2%。源代码和模型可在该httpsURL上提供。
https://arxiv.org/abs/2303.13509
Most video restoration networks are slow, have high computational load, and can't be used for real-time video enhancement. In this work, we design an efficient and fast framework to perform real-time video enhancement for practical use-cases like live video calls and video streams. Our proposed method, called Recurrent Bottleneck Mixer Network (ReBotNet), employs a dual-branch framework. The first branch learns spatio-temporal features by tokenizing the input frames along the spatial and temporal dimensions using a ConvNext-based encoder and processing these abstract tokens using a bottleneck mixer. To further improve temporal consistency, the second branch employs a mixer directly on tokens extracted from individual frames. A common decoder then merges the features form the two branches to predict the enhanced frame. In addition, we propose a recurrent training approach where the last frame's prediction is leveraged to efficiently enhance the current frame while improving temporal consistency. To evaluate our method, we curate two new datasets that emulate real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.
大多数视频恢复网络很慢,具有高计算负载,并且不能用于实时视频增强。在这项工作中,我们设计了一个高效且快速的框架,用于实时视频增强,例如实时视频通话和视频流。我们提出的方法被称为循环瓶颈混合器网络(ReBotNet),采用了双分支框架。第一个分支使用基于ConvNext的编码器将输入帧的空间和时间维度 tokenizing,并使用瓶颈混合器对这些抽象 tokens进行处理。为了进一步改善时间一致性,第二个分支直接使用从每个帧提取的 token 进行混合。一个通用的解码器然后将两个分支的特征合并,以预测增强帧。此外,我们提出了一种循环训练方法,其中利用最后帧的预测来高效地增强当前帧,同时改善时间一致性。为了评估我们的方法,我们编辑了两个模拟实际视频通话和流媒体场景的新数据集,并在多个数据集上展示了广泛的结果,其中 ReBotNet 在计算量更低、内存要求更低和Inference时间更快速的情况下表现更好。
https://arxiv.org/abs/2303.13504
Automated diagnosis prediction from medical images is a valuable resource to support clinical decision-making. However, such systems usually need to be trained on large amounts of annotated data, which often is scarce in the medical domain. Zero-shot methods address this challenge by allowing a flexible adaption to new settings with different clinical findings without relying on labeled data. Further, to integrate automated diagnosis in the clinical workflow, methods should be transparent and explainable, increasing medical professionals' trust and facilitating correctness verification. In this work, we introduce Xplainer, a novel framework for explainable zero-shot diagnosis in the clinical setting. Xplainer adapts the classification-by-description approach of contrastive vision-language models to the multi-label medical diagnosis task. Specifically, instead of directly predicting a diagnosis, we prompt the model to classify the existence of descriptive observations, which a radiologist would look for on an X-Ray scan, and use the descriptor probabilities to estimate the likelihood of a diagnosis. Our model is explainable by design, as the final diagnosis prediction is directly based on the prediction of the underlying descriptors. We evaluate Xplainer on two chest X-ray datasets, CheXpert and ChestX-ray14, and demonstrate its effectiveness in improving the performance and explainability of zero-shot diagnosis. Our results suggest that Xplainer provides a more detailed understanding of the decision-making process and can be a valuable tool for clinical diagnosis.
医学图像的自动诊断预测是一种重要的资源,以支持临床决策。然而,这种系统通常需要从大量的标注数据中进行训练,这在医学领域中往往是缺乏的。零样本方法解决了这个问题,它可以在没有标记数据的情况下灵活适应不同的临床发现设置,无需依赖标签数据。进一步,将自动诊断集成到临床工作流程中,方法应该透明和可解释,增加医务人员的信任,并方便正确性验证。在这个项目中,我们介绍了Xplainer,一个可在临床环境中解释零样本诊断的新框架。Xplainer将竞争视觉语言模型的描述分类方法应用于多标签医学诊断任务。具体来说,我们不再直接预测诊断,而是促使模型分类描述观察的存在,这是放射科医生在X射线扫描中会寻找的描述观察,并使用描述概率估计诊断的可能性。我们的模型是设计可解释的,因为其最终诊断预测直接基于底层描述预测。我们评估了 CheXpert和 chestX-ray14两个心电学数据集,并证明了Xplainer在改善零样本诊断性能和解释性方面的效力。我们的结果表明,Xplainer提供了更详细的理解决策过程,可以成为临床诊断的宝贵工具。
https://arxiv.org/abs/2303.13391
Label scarcity is a bottleneck for improving task performance in specialised domains. We propose a novel compositional transfer learning framework (DoT5 - domain compositional zero-shot T5) for zero-shot domain transfer. Without access to in-domain labels, DoT5 jointly learns domain knowledge (from MLM of unlabelled in-domain free text) and task knowledge (from task training on more readily available general-domain data) in a multi-task manner. To improve the transferability of task training, we design a strategy named NLGU: we simultaneously train NLG for in-domain label-to-data generation which enables data augmentation for self-finetuning and NLU for label prediction. We evaluate DoT5 on the biomedical domain and the resource-lean subdomain of radiology, focusing on NLI, text summarisation and embedding learning. DoT5 demonstrates the effectiveness of compositional transfer learning through multi-task learning. In particular, DoT5 outperforms the current SOTA in zero-shot transfer by over 7 absolute points in accuracy on RadNLI. We validate DoT5 with ablations and a case study demonstrating its ability to solve challenging NLI examples requiring in-domain expertise.
标签稀缺是改善特定领域的任务表现的瓶颈。我们提出了一种全新的组件化 Transfer Learning 框架(DoT5 - 域组件式零次输入 T5),用于零次输入域转移。在没有访问域内标签的情况下,DoT5 以多任务方式共同学习域知识和任务知识(从未标记的域内自由文本的 LM 中提取任务知识,并从任务训练更常见的通用数据集中提取数据增强和 NLU)。为了提高任务训练的可转移性,我们设计了一种名为 NLGU 的策略:我们同时训练 In-domain 标签到数据生成 NLG,这可以实现数据增强的自训练和标签预测的 NLU。我们在生物医学领域和放射学资源受限的子领域评估了 DoT5,重点关注 NLI、文本摘要和嵌入学习。DoT5 通过多任务学习证明了组件化转移学习的 effectiveness。特别是,DoT5 在 RadNLI 上的零次输入转移中比当前的最佳方法高出超过 7 的绝对点的准确性。我们通过实验和案例研究验证了 DoT5 的能力,以解决需要域内专业知识的具有挑战性的 NLI 示例。
https://arxiv.org/abs/2303.13386
Modern surgeries are performed in complex and dynamic settings, including ever-changing interactions between medical staff, patients, and equipment. The holistic modeling of the operating room (OR) is, therefore, a challenging but essential task, with the potential to optimize the performance of surgical teams and aid in developing new surgical technologies to improve patient outcomes. The holistic representation of surgical scenes as semantic scene graphs (SGG), where entities are represented as nodes and relations between them as edges, is a promising direction for fine-grained semantic OR understanding. We propose, for the first time, the use of temporal information for more accurate and consistent holistic OR modeling. Specifically, we introduce memory scene graphs, where the scene graphs of previous time steps act as the temporal representation guiding the current prediction. We design an end-to-end architecture that intelligently fuses the temporal information of our lightweight memory scene graphs with the visual information from point clouds and images. We evaluate our method on the 4D-OR dataset and demonstrate that integrating temporality leads to more accurate and consistent results achieving an +5% increase and a new SOTA of 0.88 in macro F1. This work opens the path for representing the entire surgery history with memory scene graphs and improves the holistic understanding in the OR. Introducing scene graphs as memory representations can offer a valuable tool for many temporal understanding tasks.
现代手术在复杂且动态的场景中进行,包括医疗人员、患者和设备之间的不断变化的互动。因此,对手术空间的整个建模是一个具有挑战性但必要的任务,有潜力优化手术团队的表现,并帮助开发新的手术技术,提高患者的治疗效果。将手术场景作为一个语义场景图(SGG)的整个建模,其中实体表示为节点,它们之间的关系表示为边,是一个高精度的语义OR理解有前途的方向。我们首次提出了使用时间信息来进行更准确且一致的整个OR建模。具体来说,我们引入了记忆场景图,其中之前的时间步骤的场景图作为时间表示指导当前预测。我们设计了一个端到端架构,智能地融合我们的轻量级记忆场景图的时间信息与点云和图像的视觉信息。我们在4DOR数据集上评估了我们的方法,并证明了将时间整合在一起会导致更准确且一致的结果,实现+5%的增加,并在宏观F1中获得了一个新的SOTA。这项工作开辟了用记忆场景图代表整个手术历史并改善OR整体理解的道路。引入场景图作为记忆表示可以提供许多时间理解任务中的一种宝贵的工具。
https://arxiv.org/abs/2303.13293
Scene Graph Generation (SGG) aims to extract <subject, predicate, object> relationships in images for vision understanding. Although recent works have made steady progress on SGG, they still suffer long-tail distribution issues that tail-predicates are more costly to train and hard to distinguish due to a small amount of annotated data compared to frequent predicates. Existing re-balancing strategies try to haddle it via prior rules but are still confined to pre-defined conditions, which are not scalable for various models and datasets. In this paper, we propose a Cross-modal prediCate boosting (CaCao) framework, where a visually-prompted language model is learned to generate diverse fine-grained predicates in a low-resource way. The proposed CaCao can be applied in a plug-and-play fashion and automatically strengthen existing SGG to tackle the long-tailed problem. Based on that, we further introduce a novel Entangled cross-modal prompt approach for open-world predicate scene graph generation (Epic), where models can generalize to unseen predicates in a zero-shot manner. Comprehensive experiments on three benchmark datasets show that CaCao consistently boosts the performance of multiple scene graph generation models in a model-agnostic way. Moreover, our Epic achieves competitive performance on open-world predicate prediction.
场景Graph生成(SGG)旨在从图像中提取<主题、谓词、对象>关系以视觉理解。尽管最近的工作在SGG方面取得了稳定的进展,但它们仍然面临长尾巴分布问题,长谓词训练代价更高,且由于少量的注释数据,与频繁谓词相比难以区分。现有的平衡策略试图通过先前规则来实现,但仍然局限于预定义条件,这对各种模型和数据集是不可扩展的。在本文中,我们提出了一种跨模态预比较增强(CaCao)框架,其中视觉提示的语言模型以低资源方式生成多种精细的谓词。 proposed CaCao可以以一种可插拔的方式应用,并自动加强现有的SGG以解决长尾巴问题。基于这一点,我们进一步介绍了一种名为“开放世界谓词场景Graph生成(Epic)”的全新的、相互交织的跨模态提示方法,其中模型可以在零样本情况下 generalization到未观察到的谓词。对三个基准数据集的全面实验表明,CaCao consistentlyBoost了多个场景Graph生成模型的性能,以一种模型无关的方式。此外,我们的Epic在开放世界谓词预测方面实现了竞争性能。
https://arxiv.org/abs/2303.13233
Current video-based scene graph generation (VidSGG) methods have been found to perform poorly on predicting predicates that are less represented due to the inherent biased distribution in the training data. In this paper, we take a closer look at the predicates and identify that most visual relations (e.g. sit_above) involve both actional pattern (sit) and spatial pattern (above), while the distribution bias is much less severe at the pattern level. Based on this insight, we propose a decoupled label learning (DLL) paradigm to address the intractable visual relation prediction from the pattern-level perspective. Specifically, DLL decouples the predicate labels and adopts separate classifiers to learn actional and spatial patterns respectively. The patterns are then combined and mapped back to the predicate. Moreover, we propose a knowledge-level label decoupling method to transfer non-target knowledge from head predicates to tail predicates within the same pattern to calibrate the distribution of tail classes. We validate the effectiveness of DLL on the commonly used VidSGG benchmark, i.e. VidVRD. Extensive experiments demonstrate that the DLL offers a remarkably simple but highly effective solution to the long-tailed problem, achieving the state-of-the-art VidSGG performance.
目前基于视频的场景图形生成(VidSGG)方法在预测不符合训练数据固有分布的 predicate 方面表现较差。在本文中,我们对 predicate 进行更细致的观察,并发现大多数视觉关系(例如 sit_Above 涉及行动模式(sit)和空间模式(Above),而模式级别的分布偏差相对较轻。基于这一认识,我们提出了一种分离标签学习(DLL)范式,从模式级角度解决顽固的视觉关系预测问题。具体而言,DLL 将 predicate 标签分离,并采用不同的分类器学习行动和空间模式。模式后将它们组合并映射回 predicate。此外,我们提出了一种知识级标签分离方法,从同一模式中的头predicate 到尾predicate 转移非目标知识,以校准 tail 类分布。我们验证了 DLL 在常用的 VidSGG 基准测试数据上的有效性,即 VidVRD。广泛的实验表明,DLL 提供了一种非常简单但非常有效的解决方案,解决长尾巴问题,实现 VidSGG 的先进技术表现。
https://arxiv.org/abs/2303.13209
In this paper, we propose a novel representation for grasping using contacts between multi-finger robotic hands and objects to be manipulated. This representation significantly reduces the prediction dimensions and accelerates the learning process. We present an effective end-to-end network, CMG-Net, for grasping unknown objects in a cluttered environment by efficiently predicting multi-finger grasp poses and hand configurations from a single-shot point cloud. Moreover, we create a synthetic grasp dataset that consists of five thousand cluttered scenes, 80 object categories, and 20 million annotations. We perform a comprehensive empirical study and demonstrate the effectiveness of our grasping representation and CMG-Net. Our work significantly outperforms the state-of-the-art for three-finger robotic hands. We also demonstrate that the model trained using synthetic data performs very well for real robots.
在本文中,我们提出了一种利用多指机器人手与待操纵物体之间的接触关系进行 grasping 的新表示方法。这种表示方法极大地减少了预测维度,并加速了学习过程。我们介绍了一种有效的端到端网络 CMG-Net,用于在复杂环境中识别未知物体,该网络从一次点云图像中高效预测多指抓握姿态和手构型。此外,我们创造了一个合成的抓握数据集,其中包括五千个复杂场景、80个物体类别和2000万注释。我们进行了全面的实证研究,并证明了我们的 grasping 表示和 CMG-Net 的有效性。我们的工作 significantly outperforms 三维机器人手的技能水平。我们还证明了使用合成数据训练的模型对于真实机器人非常良好地表现。
https://arxiv.org/abs/2303.13182
We address the challenge of training a large supernet for the object detection task, using a relatively small amount of training data. Specifically, we propose an efficient supernet-based neural architecture search (NAS) method that uses transfer learning and search space pruning. First, the supernet is pre-trained on a classification task, for which large datasets are available. Second, the search space defined by the supernet is pruned by removing candidate models that are predicted to perform poorly. To effectively remove the candidates over a wide range of resource constraints, we particularly design a performance predictor, called path filter, which can accurately predict the relative performance of the models that satisfy similar resource constraints. Hence, supernet training is more focused on the best-performing candidates. Our path filter handles prediction for paths with different resource budgets. Compared to once-for-all, our proposed method reduces the computational cost of the optimal network architecture by 30% and 63%, while yielding better accuracy-floating point operations Pareto front (0.85 and 0.45 points of improvement on average precision for Pascal VOC and COCO, respectively).
我们解决了训练大型超网络用于目标检测任务的挑战,使用了相对较小的训练数据。具体而言,我们提出了一种高效的超网络神经网络架构搜索方法(NAS),该方法使用迁移学习和搜索空间剪枝。首先,超网络在分类任务上进行了预训练,有大量数据可用。其次,超网络定义的搜索空间通过删除预测表现较差的候选模型进行修剪。为了有效地去除在各种资源限制下的候选模型,我们特别设计了性能预测器,称为路径滤波,它能够准确地预测满足类似资源限制的模型的性能相对表现。因此,超网络训练更关注表现最好的候选模型。我们的路径滤波处理不同资源预算下的预测路径。与一次性搜索相比,我们提出的方法降低了最优网络架构的计算成本,下降了30%和63%,同时提供了更好的浮点操作精度 Pareto 前端(分别提高0.85点和0.45点)。
https://arxiv.org/abs/2303.13121
Deep learning models have shown promising performance in the field of diabetic retinopathy (DR) staging. However, collaboratively training a DR staging model across multiple institutions remains a challenge due to non-iid data, client reliability, and confidence evaluation of the prediction. To address these issues, we propose a novel federated uncertainty-aware aggregation paradigm (FedUAA), which considers the reliability of each client and produces a confidence estimation for the DR staging. In our FedUAA, an aggregated encoder is shared by all clients for learning a global representation of fundus images, while a novel temperature-warmed uncertainty head (TWEU) is utilized for each client for local personalized staging criteria. Our TWEU employs an evidential deep layer to produce the uncertainty score with the DR staging results for client reliability evaluation. Furthermore, we developed a novel uncertainty-aware weighting module (UAW) to dynamically adjust the weights of model aggregation based on the uncertainty score distribution of each client. In our experiments, we collect five publicly available datasets from different institutions to conduct a dataset for federated DR staging to satisfy the real non-iid condition. The experimental results demonstrate that our FedUAA achieves better DR staging performance with higher reliability compared to other federated learning methods. Our proposed FedUAA paradigm effectively addresses the challenges of collaboratively training DR staging models across multiple institutions, and provides a robust and reliable solution for the deployment of DR diagnosis models in real-world clinical scenarios.
深度学习模型在糖尿病视网膜病变(DR)分期领域展现出了良好的表现。然而,在多个机构合作训练DR分期模型仍然是一项挑战,因为非独立数据、客户可靠性以及预测信心评估都存在。为了解决这些问题,我们提出了一种全新的分布式不确定聚合范式(FedUAA),该范式考虑了每个客户的可靠性,并产生了对DR分期的信心估计。在我们的FedUAA中,所有客户共享一个聚合编码器来学习全球 fundus图像表示,同时每个客户使用一种温度温暖不确定头(TWEU),用于当地个性化分期标准。我们的TWEU使用证据深层层生成不确定性得分,以用于客户可靠性评估。此外,我们开发了一种新的不确定认知加权模块(UAW),以动态地调整模型聚合的重量,基于每个客户的不确定得分分布。在我们的实验中,我们从不同的机构收集了五个公开可用的数据集,用于合作训练分布式的DR分期数据集,以满足真实的非独立数据条件。实验结果表明,我们的FedUAA实现比其他分布式学习方法更高的DR分期表现,且比其更具有可靠性。我们提出的FedUAA范式有效地解决了在多个机构合作训练DR分期模型面临的挑战,并为在真实临床场景下部署DR诊断模型提供了可靠和稳定的解决方案。
https://arxiv.org/abs/2303.13033
Knowledge Distillation (KD) uses the teacher's prediction logits as soft labels to guide the student, while self-KD does not need a real teacher to require the soft labels. This work unifies the formulations of the two tasks by decomposing and reorganizing the generic KD loss into a Normalized KD (NKD) loss and customized soft labels for both target class (image's category) and non-target classes named Universal Self-Knowledge Distillation (USKD). We decompose the KD loss and find the non-target loss from it forces the student's non-target logits to match the teacher's, but the sum of the two non-target logits is different, preventing them from being identical. NKD normalizes the non-target logits to equalize their sum. It can be generally used for KD and self-KD to better use the soft labels for distillation loss. USKD generates customized soft labels for both target and non-target classes without a teacher. It smooths the target logit of the student as the soft target label and uses the rank of the intermediate feature to generate the soft non-target labels with Zipf's law. For KD with teachers, our NKD achieves state-of-the-art performance on CIFAR-100 and ImageNet datasets, boosting the ImageNet Top-1 accuracy of ResNet18 from 69.90% to 71.96% with a ResNet-34 teacher. For self-KD without teachers, USKD is the first self-KD method that can be effectively applied to both CNN and ViT models with negligible additional time and memory cost, resulting in new state-of-the-art results, such as 1.17% and 0.55% accuracy gains on ImageNet for MobileNet and DeiT-Tiny, respectively. Our codes are available at this https URL.
知识蒸馏(KD)使用教师的预测 logits 作为软标签来指导学生,而自我蒸馏不需要真正的教师来要求软标签。这项工作将通用KD损失分解和重组为 normalized KD(NKD)损失,以及为 both 目标 class(图像类别)和非目标 class(图像)的自定义软标签,名为Universal Self-Knowledge Distillation(USKD)。我们分解了KD损失并发现它使非目标损失迫使学生的非目标 logits 与教师相同,但两个非目标 logits 的总和不同,防止它们相同。NKD 将非目标 logits 标准化以平衡它们的总和。它可以广泛应用于 KD 和自我蒸馏,更好地使用软标签为蒸馏损失。USKD 为 both 目标和非目标 class 生成自定义软标签,它平滑学生的目标 logit 作为软目标标签,使用排名的中间特征生成软非目标标签,以Zipf's law。对于 KD 有教师的情况,我们的NKD 在CIFAR-100 和 ImageNet 数据集上取得了最先进的性能,使用 ResNet-34 教师将 ImageNet 上的 ImageNet 1 准确性从69.90%提高到71.96%。对于自我蒸馏没有教师的情况,USKD 是第一种自我蒸馏方法,可以在 CNN 和 VIT 模型中有效地应用于 both CNN 和 ViT 模型,并几乎没有额外的时间和内存成本,产生新的最先进的结果,例如 MobileNet 和 DeiT-Tiny 在 ImageNet 上的 1.17% 和 0.55% 准确性提高。我们的代码在此https URL上可用。
https://arxiv.org/abs/2303.13005
Facial expression is a way of communication that can be used to interact with computers or other electronic devices and the recognition of emotion from faces is an emerging practice with application in many fields. There are many cloud-based vision application programming interfaces available that recognize emotion from facial images and video. In this article, the performances of two well-known APIs were compared using a public dataset of 980 images of facial emotions. For these experiments, a client program was developed which iterates over the image set, calls the cloud services, and caches the results of the emotion detection for each image. The performance was evaluated in each class of emotions using prediction accuracy. It has been found that the prediction accuracy for each emotion varies according to the cloud service being used. Similarly, each service provider presents a strong variation of performance according to the class being analyzed, as can be seen with more detail in this artilects.
面部表情是一种可以用来与电脑或其他电子设备交互的通信方式,而从面部识别情感是一种新兴的行为,在许多领域都有应用。有许多基于云计算的视觉应用程序编程接口可用,可以从面部图像和视频中提取情感。在本文中,使用了一个公开的面部情感图像数据集,对两个知名的API的性能进行了比较。为了进行这些实验,开发了一款客户端程序,该程序对图像集进行迭代,调用云计算服务,并缓存每个图像的情感检测结果。在每个情感类别中,使用预测精度进行评估。发现每个情感的预测精度根据使用的云计算服务而异。类似地,每个服务提供商都展示了根据被分析的情感类别的强烈性能变化,这可以在本文中以更详细的方式看到。
https://arxiv.org/abs/2303.12974
Control Barrier Functions offer safety certificates by dictating controllers that enforce safety constraints. However, their response depends on the classK function that is used to restrict the rate of change of the barrier function along the system trajectories. This paper introduces the notion of Rate Tunable Control Barrier Function (RT-CBF), which allows for online tuning of the response of CBF-based controllers. In contrast to the existing CBF approaches that use a fixed (predefined) classK function to ensure safety, we parameterize and adapt the classK function parameters online. Furthermore, we discuss the challenges associated with multiple barrier constraints, namely ensuring that they admit a common control input that satisfies them simultaneously for all time. In practice, RT-CBF enables designing parameter dynamics for (1) a better-performing response, where performance is defined in terms of the cost accumulated over a time horizon, or (2) a less conservative response. We propose a model-predictive framework that computes the sensitivity of the future states with respect to the parameters and uses Sequential Quadratic Programming for deriving an online law to update the parameters in the direction of improving the performance. When prediction is not possible, we also provide point-wise sufficient conditions to be imposed on any user-given parameter dynamics so that multiple CBF constraints continue to admit common control input with time. Finally, we introduce RT-CBFs for decentralized uncooperative multi-agent systems, where a trust factor, computed based on the instantaneous ease of constraint satisfaction, is used to update parameters online for a less conservative response.
控制障碍函数通过指定执行安全约束的控制控制器来提供安全证书。然而,其响应取决于 classK 函数,用于限制系统路径上障碍函数的变化率。本文介绍了 Rate Tunable Control Barrier Function(RT-CBF)的概念,以便在线调整基于 CBF 的控制控制器的响应。与现有的 CBF 方法,该方法使用一个固定(预先定义)的 classK 函数以确保安全,我们参数化并适应 classK 函数参数在线。此外,我们讨论了多个障碍约束所面临的挑战,即确保它们承认一个共同的控制输入,使其在整个时间范围内同时满足它们。在实践中,RT-CBF 允许设计参数动态特性,以(1) 实现更好的响应性能,性能以累积的成本为定义标准,或(2) 实现更保守的响应。我们提出了一个模型预测框架,计算未来状态的敏感性与参数之间的关系,并使用Sequential Quadratic Programming 推导在线 law 更新参数以改善性能。当预测不可用时,我们也提供点充分条件,应将其施加给任何用户提供的参数动态特性,以确保多个 CBF 约束继续承认共同的控制输入。最后,我们介绍了 RT-CBF 用于分散的不合作多Agent系统,其中基于实时满足约束条件的易用性计算的信任因子用于更新参数以实现更保守的响应。
https://arxiv.org/abs/2303.12966
In the last few years, many works have tried to explain the predictions of deep learning models. Few methods, however, have been proposed to verify the accuracy or faithfulness of these explanations. Recently, influence functions, which is a method that approximates the effect that leave-one-out training has on the loss function, has been shown to be fragile. The proposed reason for their fragility remains unclear. Although previous work suggests the use of regularization to increase robustness, this does not hold in all cases. In this work, we seek to investigate the experiments performed in the prior work in an effort to understand the underlying mechanisms of influence function fragility. First, we verify influence functions using procedures from the literature under conditions where the convexity assumptions of influence functions are met. Then, we relax these assumptions and study the effects of non-convexity by using deeper models and more complex datasets. Here, we analyze the key metrics and procedures that are used to validate influence functions. Our results indicate that the validation procedures may cause the observed fragility.
在过去几年中,许多工作试图解释深度学习模型的预测。然而, few methods have been proposed to verify the accuracy or faithfulness of these explanations. Recently, influence functions, which is a method that approximates the effect that leave-one-out training has on the loss function, has been shown to be fragile. The proposed reason for their fragility remains unclear. Although previous work suggests the use of regularization to increase robustness, this does not hold in all cases. In this work, we seek to investigate the experiments performed in the prior work in an effort to understand the underlying mechanisms of influence function fragility. First, we verify influence functions using procedures from the literature under conditions where the convexity assumptions of influence functions are met. Then, we relax these assumptions and study the effects of non-Convexity by using deeper models and more complex datasets. Here, we analyze the key metrics and procedures that are used to validate influence functions. Our results indicate that the validation procedures may cause the observed fragility.
https://arxiv.org/abs/2303.12922
Myocardial infarction and heart failure are major cardiovascular diseases that affect millions of people in the US. The morbidity and mortality are highest among patients who develop cardiogenic shock. Early recognition of cardiogenic shock is critical. Prompt implementation of treatment measures can prevent the deleterious spiral of ischemia, low blood pressure, and reduced cardiac output due to cardiogenic shock. However, early identification of cardiogenic shock has been challenging due to human providers' inability to process the enormous amount of data in the cardiac intensive care unit (ICU) and lack of an effective risk stratification tool. We developed a deep learning-based risk stratification tool, called CShock, for patients admitted into the cardiac ICU with acute decompensated heart failure and/or myocardial infarction to predict onset of cardiogenic shock. To develop and validate CShock, we annotated cardiac ICU datasets with physician adjudicated outcomes. CShock achieved an area under the receiver operator characteristic curve (AUROC) of 0.820, which substantially outperformed CardShock (AUROC 0.519), a well-established risk score for cardiogenic shock prognosis. CShock was externally validated in an independent patient cohort and achieved an AUROC of 0.800, demonstrating its generalizability in other cardiac ICUs.
心脏病和 heart failure 是影响美国数百万人的 major 心血管疾病。患者发展心脏病性休克的严重程度最高。及早识别心脏病性休克是至关重要的。及时采取治疗措施可以防止缺血、低血压和心脏病性休克引起的心脏输出减少的有益螺旋。然而,早期识别心脏病性休克由于人类医护人员无法处理心脏重症监护室(ICU)中巨大的数据量和缺乏有效的风险分层工具而具有挑战性。我们开发了基于深度学习的风险分层工具 CShock,用于 admission to the heart lung center with acute decompensated heart failure and/or myocardial infarction 的心脏病性休克预测。为了开发并验证 CShock,我们给心脏重症监护室的数据集加上医生判断的结果。CShock 在Receiver operator characteristic 曲线上的 area under the curve (AUROC) 达到了 0.820,大大超过了 CardShock (AUROC 0.519), CardShock 是心脏病性休克预测的一个确立的风险评分。CShock 在独立的患者群体上进行外部验证,并达到了 AUROC 0.800,这表明它可以在其他心脏重症监护中通用。
https://arxiv.org/abs/2303.12888
Placing a human in the loop may abate the risks of deploying AI systems in safety-critical settings (e.g., a clinician working with a medical AI system). However, mitigating risks arising from human error and uncertainty within such human-AI interactions is an important and understudied issue. In this work, we study human uncertainty in the context of concept-based models, a family of AI systems that enable human feedback via concept interventions where an expert intervenes on human-interpretable concepts relevant to the task. Prior work in this space often assumes that humans are oracles who are always certain and correct. Yet, real-world decision-making by humans is prone to occasional mistakes and uncertainty. We study how existing concept-based models deal with uncertain interventions from humans using two novel datasets: UMNIST, a visual dataset with controlled simulated uncertainty based on the MNIST dataset, and CUB-S, a relabeling of the popular CUB concept dataset with rich, densely-annotated soft labels from humans. We show that training with uncertain concept labels may help mitigate weaknesses of concept-based systems when handling uncertain interventions. These results allow us to identify several open challenges, which we argue can be tackled through future multidisciplinary research on building interactive uncertainty-aware systems. To facilitate further research, we release a new elicitation platform, UElic, to collect uncertain feedback from humans in collaborative prediction tasks.
将人类纳入循环可能会减轻在安全关键环境中部署AI系统的风险(例如,一名临床医生与医疗AI系统工作的人)。然而,在这类人类-AI交互中减轻由人类错误和不确定性引起的风险是一个重要但尚未深入研究的问题。在本文中,我们研究了基于概念模型的人类不确定性,这是一个由专家通过概念干预提供人类反馈的AI系统家族。先前在这个领域的工作通常假设人类是可靠的预言者,总是准确无误。然而,人类在现实生活中的决策往往偶尔有误和不确定性。我们使用两个新的数据集研究了现有概念模型如何应对来自人类的不确定干预:UMNIST是一个基于MNIST数据集的视觉数据集,CUB-S是对广受赞誉的CUB概念数据集的重命名,其中从人类提供的丰富、密集注释软标签进行了重新分类。我们表明,训练带有不确定概念标签的数据可以帮助减轻概念模型的漏洞,这些结果使我们能够识别几个开放挑战,我们认为可以通过未来跨学科研究建立交互不确定性意识的系统来解决。为了促进进一步研究,我们发布了一个新的收集人类不确定反馈的平台UElic,用于协作预测任务。
https://arxiv.org/abs/2303.12872
Existing defense methods against adversarial attacks can be categorized into training time and test time defenses. Training time defense, i.e., adversarial training, requires a significant amount of extra time for training and is often not able to be generalized to unseen attacks. On the other hand, test time defense by test time weight adaptation requires access to perform gradient descent on (part of) the model weights, which could be infeasible for models with frozen weights. To address these challenges, we propose DRAM, a novel defense method to Detect and Reconstruct multiple types of Adversarial attacks via Masked autoencoder (MAE). We demonstrate how to use MAE losses to build a KS-test to detect adversarial attacks. Moreover, the MAE losses can be used to repair adversarial samples from unseen attack types. In this sense, DRAM neither requires model weight updates in test time nor augments the training set with more adversarial samples. Evaluating DRAM on the large-scale ImageNet data, we achieve the best detection rate of 82% on average on eight types of adversarial attacks compared with other detection baselines. For reconstruction, DRAM improves the robust accuracy by 6% ~ 41% for Standard ResNet50 and 3% ~ 8% for Robust ResNet50 compared with other self-supervision tasks, such as rotation prediction and contrastive learning.
现有的防反欺诈方法可以分为两种:训练时间和测试时间防御。训练时间防御(也称为反欺诈训练)需要额外的训练时间,并且通常无法应用于未观察到的攻击。测试时间防御(也称为测试时间权重适应)需要访问模型权重的一部分进行梯度下降,这对于具有冻结权重的模型来说是可行的。为了解决这些挑战,我们提出了DRAM,一种 novel 防御方法,通过掩码自编码器(MAE)来检测和重构多种类型的反欺诈攻击。我们展示了如何使用MAE损失构建 KS-测试来检测反欺诈攻击。此外,MAE损失还可以用于修复未观察到攻击类型的反欺诈样本。因此,DRAM在测试时间内不需要模型权重更新,也不会增加训练集中的更多反欺诈样本。在评估DRAM的大型图像集数据上,我们平均实现了82%的反欺诈攻击检测率,相比其他检测基准线。对于重构,DRAM标准ResNet50的鲁棒精度提高了6%至41%,而 robust ResNet50的精度提高了3%至8%。与旋转预测和对比学习等其他自监督任务相比,DRAM实现了更好的鲁棒性精度。
https://arxiv.org/abs/2303.12848
To quantify uncertainty, conformal prediction methods are gaining continuously more interest and have already been successfully applied to various domains. However, they are difficult to apply to time series as the autocorrelative structure of time series violates basic assumptions required by conformal prediction. We propose HopCPT, a novel conformal prediction approach for time series that not only copes with temporal structures but leverages them. We show that our approach is theoretically well justified for time series where temporal dependencies are present. In experiments, we demonstrate that our new approach outperforms state-of-the-art conformal prediction methods on multiple real-world time series datasets from four different domains.
要量化不确定性, conformal 预测方法正在赢得越来越多的关注,并已经成功应用于各种领域。然而,对于时间序列而言,时间序列的自相关结构违反了 conformal 预测所需的基本假设,这使得它们难以应用于时间序列。我们提出了 HopCPT,一个针对时间序列的全新的 conformal 预测方法,不仅能够应对时间序列的时间结构,还能够利用它们。我们证明,对于存在时间依赖关系的 time 序列,我们的新方法理论上很好地解释了。在实验中,我们证明了我们的新方法在多个来自不同领域的多个真实世界时间序列数据集上比最先进的 conformal 预测方法表现更好。
https://arxiv.org/abs/2303.12783
Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions.
人工智能(AI)研究人员一直在开发和改进大型语言模型(LLM),它们在多种领域和任务中表现出非凡的能力,挑战我们对学习和认知的理解。由OpenAI开发的GPT-4是最新版本的LLM之一,仍在OpenAI的开发中。我们相信(这个早期版本的)GPT-4是一个新的群体(例如ChatGPT和Google的PaLM)之一,比过去的AI模型表现出更多的通用智能。我们讨论了这些模型不断提高能力和影响。我们证明,除了对语言的熟练掌握,GPT-4可以在数学、编程、视觉、医学、法律、心理学和其他任务中解决 novel 和困难的任务,不需要任何特别的prompting。在所有这些问题中,GPT-4的表现都非常接近人类水平,常常远远超过先前的模型ChatGPT。鉴于GPT-4的能力广度和深度,我们相信它可以合理地被视为一个早期的(但仍不完全) artificial general intelligence(AGI)系统的早期版本。在我们研究GPT-4时,我们特别强调了发现其限制,并讨论迈向更深刻和更全面的AGI版本的挑战,包括可能需要追求超越下一步预测的新范式。我们的结论是,回顾最近技术突破和社会影响,以及未来研究的方向。
https://arxiv.org/abs/2303.12712