The recent emergence of deep learning has led to a great deal of work on designing supervised deep semantic segmentation algorithms. As in many tasks sufficient pixel-level labels are very difficult to obtain, we propose a method which combines a Gaussian mixture model (GMM) with unsupervised deep learning techniques. In the standard GMM the pixel values with each sub-region are modelled by a Gaussian distribution. In order to identify the different regions, the parameter vector that minimizes the negative log-likelihood (NLL) function regarding the GMM has to be approximated. For this task, usually iterative optimization methods such as the expectation-maximization (EM) algorithm are used. In this paper, we propose to estimate these parameters directly from the image using a convolutional neural network (CNN). We thus change the iterative procedure in the EM algorithm replacing the expectation-step by a gradient-step with regard to the networks parameters. This means that the network is trained to minimize the NLL function of the GMM which comes with at least two advantages. As once trained, the network is able to predict label probabilities very quickly compared with time consuming iterative optimization methods. Secondly, due to the deep image prior our method is able to partially overcome one of the main disadvantages of GMM, which is not taking into account correlation between neighboring pixels, as it assumes independence between them. We demonstrate the advantages of our method in various experiments on the example of myocardial infarct segmentation on multi-sequence MRI images.
近年来,深度学习的出现导致了许多关于设计有监督深度语义分割算法的辛勤工作。由于在许多任务中,获得足够的像素级标签非常困难,我们提出了一种将高斯混合模型(GMM)与无监督深度学习技术相结合的方法。在标准GMM中,每个子区域的像素值由高斯分布建模。为了确定不同区域,关于GMM的最小负对数(NLL)函数的参数向量必须近似。对于这项任务,通常使用迭代优化方法(如期望最大(EM)算法)进行优化。在本文中,我们提出了一种直接从图像中使用卷积神经网络(CNN)估计这些参数的方法。我们因此用网络参数的梯度代替了EM算法中的期望步骤。这意味着网络训练以最小化GMM的NLL函数,这具有至少两个优点。一旦训练完成,与时间耗费的迭代优化方法相比,网络能够非常快速地预测标签概率。其次,由于深度图像先验,我们的方法能够部分克服GMM的一个主要缺陷,即没有考虑到邻近像素之间的相关性。我们在多序列MRI图像上对心肌梗死分割进行各种实验,以展示我们方法的优势。
https://arxiv.org/abs/2404.12252
Medication recommendation systems are designed to deliver personalized drug suggestions that are closely aligned with individual patient needs. Previous studies have primarily concentrated on developing medication embeddings, achieving significant progress. Nonetheless, these approaches often fall short in accurately reflecting individual patient profiles, mainly due to challenges in distinguishing between various patient conditions and the inability to establish precise correlations between specific conditions and appropriate medications. In response to these issues, we introduce DisMed, a model that focuses on patient conditions to enhance personalization. DisMed employs causal inference to discern clear, quantifiable causal links. It then examines patient conditions in depth, recognizing and adapting to the evolving nuances of these conditions, and mapping them directly to corresponding medications. Additionally, DisMed leverages data from multiple patient visits to propose combinations of medications. Comprehensive testing on real-world datasets demonstrates that DisMed not only improves the customization of patient profiles but also surpasses leading models in both precision and safety.
药物推荐系统旨在提供与个体患者需求高度相关的个性化药物建议。之前的研究主要集中在开发药物嵌入,取得了一定的进展。然而,这些方法往往难以准确反映个体患者的病历,主要原因是难以区分各种患者状况之间的差异以及无法建立特定状况与适当药物之间的精确关联。为了解决这些问题,我们引入了DisMed,一种关注患者状况的模型,以提高个性化。DisMed采用因果推断来确定清晰、可量化的因果关系。然后深入研究患者的状况,识别并适应这些状况的不断变化,并将它们直接映射到相应的药物。此外,DisMed利用多个患者就诊数据中的药物组合。在现实世界数据集上进行全面的测试表明,DisMed不仅提高了患者病历的定制性,而且超过了领先模型的精度和安全性。
https://arxiv.org/abs/2404.12228
Text-video retrieval aims to find the most relevant cross-modal samples for a given query. Recent methods focus on modeling the whole spatial-temporal relations. However, since video clips contain more diverse content than captions, the model aligning these asymmetric video-text pairs has a high risk of retrieving many false positive results. In this paper, we propose Probabilistic Token Aggregation (\textit{ProTA}) to handle cross-modal interaction with content asymmetry. Specifically, we propose dual partial-related aggregation to disentangle and re-aggregate token representations in both low-dimension and high-dimension spaces. We propose token-based probabilistic alignment to generate token-level probabilistic representation and maintain the feature representation diversity. In addition, an adaptive contrastive loss is proposed to learn compact cross-modal distribution space. Based on extensive experiments, \textit{ProTA} achieves significant improvements on MSR-VTT (50.9%), LSMDC (25.8%), and DiDeMo (47.2%).
文本-视频检索的目的是找到与给定查询最相关的跨模态样本。最近的方法集中于建模整个空间-时间关系。然而,由于视频片段包含比字幕更丰富的内容,因此模型对 these 不对称视频-文本对进行对齐有很高的风险,可能导致许多假阳性结果。在本文中,我们提出概率词聚合(ProTA)来处理跨模态交互中的内容差异。具体来说,我们提出了一种 dual partial-related aggregation 来解离和重新聚合低维度和高维度空间中的标记词表示。我们提出基于标记词的概率对齐来生成标记级概率表示,并保持特征表示多样性。此外,还提出了一种自适应对比损失来学习紧凑的跨模态分布空间。通过广泛的实验,ProTA在 MSR-VTT(50.9%)、LSMDC(25.8%)和 DiDeMo(47.2%)上取得了显著的改进。
https://arxiv.org/abs/2404.12216
Surveillance footage represents a valuable resource and opportunities for conducting gait analysis. However, the typical low quality and high noise levels in such footage can severely impact the accuracy of pose estimation algorithms, which are foundational for reliable gait analysis. Existing literature suggests a direct correlation between the efficacy of pose estimation and the subsequent gait analysis results. A common mitigation strategy involves fine-tuning pose estimation models on noisy data to improve robustness. However, this approach may degrade the downstream model's performance on the original high-quality data, leading to a trade-off that is undesirable in practice. We propose a processing pipeline that incorporates a task-targeted artifact correction model specifically designed to pre-process and enhance surveillance footage before pose estimation. Our artifact correction model is optimized to work alongside a state-of-the-art pose estimation network, HRNet, without requiring repeated fine-tuning of the pose estimation model. Furthermore, we propose a simple and robust method for obtaining low quality videos that are annotated with poses in an automatic manner with the purpose of training the artifact correction model. We systematically evaluate the performance of our artifact correction model against a range of noisy surveillance data and demonstrate that our approach not only achieves improved pose estimation on low-quality surveillance footage, but also preserves the integrity of the pose estimation on high resolution footage. Our experiments show a clear enhancement in gait analysis performance, supporting the viability of the proposed method as a superior alternative to direct fine-tuning strategies. Our contributions pave the way for more reliable gait analysis using surveillance data in real-world applications, regardless of data quality.
监视视频资料是一种宝贵的资源和进行姿态分析的机会。然而,这类视频的低质量和高噪声水平可能会严重影响姿态估计算法的准确性,这些算法是可靠姿态分析的基础。现有文献表明,姿态估计的有效性与后续的姿态分析结果之间存在直接关系。一种常见的缓解策略是在噪声数据上对姿态估计模型进行微调,以提高稳健性。然而,这种方法可能会在原始高质量数据上降低下游模型的性能,导致在实践中不必要的权衡。我们提出了一个处理流程,其中包含一个专门针对任务目标进行预处理和增强的监视视频处理模型。我们的预处理和增强模型与最先进的姿态估计网络——HRNet——协同工作,无需反复微调姿态估计模型。此外,我们提出了一种简单而鲁棒的方法,用于自动标注带有姿态的低质量视频,以训练预处理和增强模型。我们系统地评估了我们的预处理模型的性能,并证明我们的方法不仅能在低质量监视视频上实现 improved pose estimation,还能在高质量视频上保留姿态估计的完整性。我们的实验显示,我们的预处理模型在姿态分析性能上明显增强,支持了所提出的利用监视数据进行更可靠姿态分析作为直接微调策略的替代品。我们的贡献为使用监视数据进行更可靠姿态分析在现实应用中铺平道路,而无需考虑数据质量。
https://arxiv.org/abs/2404.12183
Language models (LMs) trained on vast quantities of text data can acquire sophisticated skills such as generating summaries, answering questions or generating code. However, they also manifest behaviors that violate human preferences, e.g., they can generate offensive content, falsehoods or perpetuate social biases. In this thesis, I explore several approaches to aligning LMs with human preferences. First, I argue that aligning LMs can be seen as Bayesian inference: conditioning a prior (base, pretrained LM) on evidence about human preferences (Chapter 2). Conditioning on human preferences can be implemented in numerous ways. In Chapter 3, I investigate the relation between two approaches to finetuning pretrained LMs using feedback given by a scoring function: reinforcement learning from human feedback (RLHF) and distribution matching. I show that RLHF can be seen as a special case of distribution matching but distributional matching is strictly more general. In chapter 4, I show how to extend the distribution matching to conditional language models. Finally, in chapter 5 I explore a different root: conditioning an LM on human preferences already during pretraining. I show that involving human feedback from the very start tends to be more effective than using it only during supervised finetuning. Overall, these results highlight the room for alignment techniques different from and complementary to RLHF.
语言模型(LMs)通过大量文本数据进行训练可以获得复杂的技能,如生成概述、回答问题或生成代码。然而,它们也表现出了违反人类偏好的行为,例如生成具有攻击性的内容、虚假信息或传播社会偏见。在这篇论文中,我探讨了几种将LMs与人类偏好对齐的方法。首先,我认为将LMs对齐可以看作是贝叶斯推理:通过给定关于人类偏好的证据来条件化先验(基础,预训练LM)(第2章)。通过人类偏好进行条件可以以多种方式实现。在第3章中,我研究了使用评分函数给反馈的两种方法:基于人类反馈的强化学习(RLHF)和分布匹配。我表明,RLHF可以被视为分布匹配的特殊情况,但分布匹配比它更一般。在第4章中,我展示了如何将分布匹配扩展到条件语言模型。最后,在第5章中,我探讨了另一种根源:在预训练过程中将LM对齐于人类偏好。我表明,从从一开始涉及人类反馈往往比仅在监督微调过程中使用它更有效。总体而言,这些结果突出了与RLHF不同的、互补的alignment技术。
https://arxiv.org/abs/2404.12150
Knowledge Tracing (KT) aims to trace changes in students' knowledge states throughout their entire learning process by analyzing their historical learning data and predicting their future learning performance. Existing forgetting curve theory based knowledge tracing models only consider the general forgetting caused by time intervals, ignoring the individualization of students and the causal relationship of the forgetting process. To address these problems, we propose a Concept-driven Personalized Forgetting knowledge tracing model (CPF) which integrates hierarchical relationships between knowledge concepts and incorporates students' personalized cognitive abilities. First, we integrate the students' personalized capabilities into both the learning and forgetting processes to explicitly distinguish students' individual learning gains and forgetting rates according to their cognitive abilities. Second, we take into account the hierarchical relationships between knowledge points and design a precursor-successor knowledge concept matrix to simulate the causal relationship in the forgetting process, while also integrating the potential impact of forgetting prior knowledge points on subsequent ones. The proposed personalized forgetting mechanism can not only be applied to the learning of specifc knowledge concepts but also the life-long learning process. Extensive experimental results on three public datasets show that our CPF outperforms current forgetting curve theory based methods in predicting student performance, demonstrating CPF can better simulate changes in students' knowledge status through the personalized forgetting mechanism.
知识追踪(KT)旨在通过分析学生整个学习过程中的历史学习数据,预测他们的未来学习表现,来追溯学生在学习过程中的知识状态变化。现有基于知识追踪的遗忘曲线理论模型仅考虑时间间隔造成的普遍遗忘,而忽略了学生个性的差异和遗忘过程的因果关系。为了解决这些问题,我们提出了一个以概念驱动的学生个性化遗忘知识追踪模型(CPF),该模型将知识概念之间的层次关系与学生的个性化认知能力相结合。 首先,我们将学生的个性化能力集成到学习和忘记过程中,明确区分学生根据其认知能力获得的个性化学习收益和遗忘速率。其次,我们考虑知识点的层次关系,设计了一个前驱-成功者知识概念矩阵,以模拟遗忘过程中的因果关系,并考虑遗忘先前的知识点对后续知识点的潜在影响。 所提出的个性化遗忘机制不仅可以应用于对具体知识概念的学习,还可以应用于终身学习过程。在三个公共数据集上的大量实验结果表明,我们的CPF在预测学生表现方面优于基于遗忘曲线理论的方法,这表明通过个性化遗忘机制,CPF可以更好地模拟学生知识状态的变化。
https://arxiv.org/abs/2404.12127
Recent advances in image deraining have focused on training powerful models on mixed multiple datasets comprising diverse rain types and backgrounds. However, this approach tends to overlook the inherent differences among rainy images, leading to suboptimal results. To overcome this limitation, we focus on addressing various rainy images by delving into meaningful representations that encapsulate both the rain and background components. Leveraging these representations as instructive guidance, we put forth a Context-based Instance-level Modulation (CoI-M) mechanism adept at efficiently modulating CNN- or Transformer-based models. Furthermore, we devise a rain-/detail-aware contrastive learning strategy to help extract joint rain-/detail-aware representations. By integrating CoI-M with the rain-/detail-aware Contrastive learning, we develop CoIC, an innovative and potent algorithm tailored for training models on mixed datasets. Moreover, CoIC offers insight into modeling relationships of datasets, quantitatively assessing the impact of rain and details on restoration, and unveiling distinct behaviors of models given diverse inputs. Extensive experiments validate the efficacy of CoIC in boosting the deraining ability of CNN and Transformer models. CoIC also enhances the deraining prowess remarkably when real-world dataset is included.
近年来,图像去雾技术的发展主要集中在在包含多种雨类型和背景的混合数据集上训练强大的模型。然而,这种方法往往忽视了雨图片之间的固有差异,导致性能较低。为了克服这一局限,我们专注于通过深入挖掘有意义的表现来解决各种雨图片,从而实现更好的结果。利用这些表现作为有指导性的提示,我们提出了一个基于上下文的实例级调制(CoI-M)机制,该机制能够有效地对基于CNN或Transformer的模型进行调制。此外,我们还设计了一个雨-/细节敏感的对比学习策略,以帮助提取联合雨-/细节感知的表示。通过将CoI-M与雨-/细节感知的对比学习相结合,我们开发了CoIC,一种专为在混合数据集上训练模型而设计的创新且强大的算法。此外,CoIC揭示了数据集之间的建模关系,定量评估了雨和细节对恢复的影响,并揭示了给定不同输入的模型具有显著的差异行为。大量的实验证实了CoIC在提高CNN和Transformer模型的去雾能力方面的有效性。当包含真实世界数据时,CoIC的去雾能力显著增强。
https://arxiv.org/abs/2404.12091
Event-based eye tracking has shown great promise with the high temporal resolution and low redundancy provided by the event camera. However, the diversity and abruptness of eye movement patterns, including blinking, fixating, saccades, and smooth pursuit, pose significant challenges for eye localization. To achieve a stable event-based eye-tracking system, this paper proposes a bidirectional long-term sequence modeling and time-varying state selection mechanism to fully utilize contextual temporal information in response to the variability of eye movements. Specifically, the MambaPupil network is proposed, which consists of the multi-layer convolutional encoder to extract features from the event representations, a bidirectional Gated Recurrent Unit (GRU), and a Linear Time-Varying State Space Module (LTV-SSM), to selectively capture contextual correlation from the forward and backward temporal relationship. Furthermore, the Bina-rep is utilized as a compact event representation, and the tailor-made data augmentation, called as Event-Cutout, is proposed to enhance the model's robustness by applying spatial random masking to the event image. The evaluation on the ThreeET-plus benchmark shows the superior performance of the MambaPupil, which secured the 1st place in CVPR'2024 AIS Event-based Eye Tracking challenge.
基于事件的眼跟踪在具有高时间分辨率和高容错性的事件相机提供的功能方面表现出巨大的潜力。然而,眼运动模式的多样性和突然性,包括眨眼、固定、扫视和流畅跟踪,对眼定位提出了严重的挑战。为了实现一个稳定的基于事件的眼跟踪系统,本文提出了双向长时序列建模和时间可变状态选择机制,以充分利用眼睛运动变化对上下文时间信息的响应。具体来说,提出了MambaPupil网络,它由多层卷积编码器提取事件表示的 features,双向Gated Recurrent Unit (GRU) 和线性时间可变状态空间模块 (LTV-SSM) 组成,用于选择性地捕捉上下文关系中的局部相关性。此外,Bina-rep被用作紧凑的事件表示,而提出的数据增强技术,称为事件裁剪,通过应用空间随机掩码对事件图像进行空间随机遮盖,来增强模型的鲁棒性。在 ThreeET-plus 基准上进行的评估显示,MambaPupil 的性能优越,它在 CVPR'2024 AIS Event-based Eye Tracking挑战中获得了第 1 名。
https://arxiv.org/abs/2404.12083
Change detection (CD) from remote sensing (RS) images using deep learning has been widely investigated in the literature. It is typically regarded as a pixel-wise labeling task that aims to classify each pixel as changed or unchanged. Although per-pixel classification networks in encoder-decoder structures have shown dominance, they still suffer from imprecise boundaries and incomplete object delineation at various scenes. For high-resolution RS images, partly or totally changed objects are more worthy of attention rather than a single pixel. Therefore, we revisit the CD task from the mask prediction and classification perspective and propose MaskCD to detect changed areas by adaptively generating categorized masks from input image pairs. Specifically, it utilizes a cross-level change representation perceiver (CLCRP) to learn multiscale change-aware representations and capture spatiotemporal relations from encoded features by exploiting deformable multihead self-attention (DeformMHSA). Subsequently, a masked-attention-based detection transformers (MA-DETR) decoder is developed to accurately locate and identify changed objects based on masked attention and self-attention mechanisms. It reconstructs the desired changed objects by decoding the pixel-wise representations into learnable mask proposals and making final predictions from these candidates. Experimental results on five benchmark datasets demonstrate the proposed approach outperforms other state-of-the-art models. Codes and pretrained models are available online (this https URL).
利用深度学习从遥感图像中进行Change Detection(CD)的研究已经广泛展开。通常,它被视为一个像素级的标注任务,旨在将每个像素分类为发生改变或未发生改变。尽管在编码器-解码器结构中的每个像素分类网络已经表现出优势,但在各种场景中,它们仍然存在不精确的边界和对象不完整的外部边界。对于高分辨率的反射图像,部分或完全发生变化的对象更值得关注,而不是单个像素。因此,我们从掩膜预测和分类的角度重新审视了CD任务,并提出了MaskCD来通过自适应生成分类掩码来检测发生变化的部分。具体来说,它利用跨级变化表示器(CLCRP)来学习多尺度变化感知的表示,并利用变形多头自注意力(DeformMHSA)从编码特征中捕获语义关系。然后,开发了一个掩码注意力和自注意力的检测变压器(MA-DETR)解码器,用于准确地定位和识别发生变化的对象,基于掩码注意力和自注意机制。它通过将像素级表示解码为可学习掩码建议并做出最后预测来重构所需的变化对象。在五个基准数据集上的实验结果表明,与最先进的模型相比,所提出的方法表现出色。代码和预训练模型可在线获取(此https://)
https://arxiv.org/abs/2404.12081
The prevalence of digital media and evolving sociopolitical dynamics have significantly amplified the dissemination of hateful content. Existing studies mainly focus on classifying texts into binary categories, often overlooking the continuous spectrum of offensiveness and hatefulness inherent in the text. In this research, we present an extensive benchmark dataset for Amharic, comprising 8,258 tweets annotated for three distinct tasks: category classification, identification of hate targets, and rating offensiveness and hatefulness intensities. Our study highlights that a considerable majority of tweets belong to the less offensive and less hate intensity levels, underscoring the need for early interventions by stakeholders. The prevalence of ethnic and political hatred targets, with significant overlaps in our dataset, emphasizes the complex relationships within Ethiopia's sociopolitical landscape. We build classification and regression models and investigate the efficacy of models in handling these tasks. Our results reveal that hate and offensive speech can not be addressed by a simplistic binary classification, instead manifesting as variables across a continuous range of values. The Afro-XLMR-large model exhibits the best performances achieving F1-scores of 75.30%, 70.59%, and 29.42% for the category, target, and regression tasks, respectively. The 80.22% correlation coefficient of the Afro-XLMR-large model indicates strong alignments.
数字媒体和不断变化的社会政治动态显著增强了仇恨内容的传播。现有的研究主要集中在将文本分类为二元类别,往往忽视了文本中存在的连续的冒犯和仇恨程度。在这项研究中,我们提出了一个广泛的哈马斯语 benchmark 数据集,包括 8,258 条推特,分别用于三个不同的任务:分类、识别仇恨目标和评分冒犯力和仇恨程度。我们的研究强调,绝大多数推特属于不太冒犯和不太仇恨的程度,这需要利益相关者的早期干预。民族和政治仇恨目标的普遍存在,在我们的数据集中具有显著的重叠,强调了 Ethiopia 社会政治格局中复杂的关系。我们构建了分类和回归模型,并研究了这些任务中模型的效果。我们的结果表明,简单的二元分类无法解决仇恨和冒犯性言论的问题,反而表现为一个连续范围内的变量。Afro-XLMR-large 模型在分类、目标和回归任务上都取得了最佳性能,分别达到 F1 分数为 75.30%、70.59% 和 29.42%。Afro-XLMR-large 模型的 80.22% 相关系数表明很强的 alignments。
https://arxiv.org/abs/2404.12042
Data-free knowledge distillation (DFKD) is a promising approach for addressing issues related to model compression, security privacy, and transmission restrictions. Although the existing methods exploiting DFKD have achieved inspiring achievements in coarse-grained classification, in practical applications involving fine-grained classification tasks that require more detailed distinctions between similar categories, sub-optimal results are obtained. To address this issue, we propose an approach called DFKD-FGVC that extends DFKD to fine-grained visual categorization~(FGVC) tasks. Our approach utilizes an adversarial distillation framework with attention generator, mixed high-order attention distillation, and semantic feature contrast learning. Specifically, we introduce a spatial-wise attention mechanism to the generator to synthesize fine-grained images with more details of discriminative parts. We also utilize the mixed high-order attention mechanism to capture complex interactions among parts and the subtle differences among discriminative features of the fine-grained categories, paying attention to both local features and semantic context relationships. Moreover, we leverage the teacher and student models of the distillation framework to contrast high-level semantic feature maps in the hyperspace, comparing variances of different categories. We evaluate our approach on three widely-used FGVC benchmarks (Aircraft, Cars196, and CUB200) and demonstrate its superior performance.
数据无感知知识蒸馏(DFKD)是一种解决与模型压缩、隐私和安全相关问题的有前途的方法,尤其是在涉及对类似类别的细粒度分类任务的实际应用中。虽然利用DFKD的现有方法已经取得了鼓舞人心的成就,但在实际应用中涉及细粒度分类任务时,得到的结果往往是不最优的。为了解决这个问题,我们提出了一个名为DFKD-FGVC的方法,将其扩展到细粒度视觉分类(FGVC)任务中。我们的方法利用注意力生成器、混合高阶注意力蒸馏和语义特征对比学习。具体来说,我们在生成器中引入了一个空间级的注意力机制,以合成具有更多细节的判别部分的精细图像。我们还利用混合高阶注意力机制来捕捉部分之间的复杂互动以及细粒度类别的判别特征之间的微妙差异,关注局部特征和语义上下文关系。此外,我们还利用蒸馏框架的教师和学生模型来对比超空间中高级语义特征映射的差异,比较不同类别的差异。我们在三个广泛使用的FGVC基准(飞机、汽车196和CUB200)上评估我们的方法,并证明了其优越性能。
https://arxiv.org/abs/2404.12037
The new trend in multi-object tracking task is to track objects of interest using natural language. However, the scarcity of paired prompt-instance data hinders its progress. To address this challenge, we propose a high-quality yet low-cost data generation method base on Unreal Engine 5 and construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos, detailing the appearance and actions of people and vehicles. Specifically, it provides 14 videos with a total of 714 expressions, and is comparable in scale to the Refer-KITTI dataset. Additionally, we propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer through the introduction of Semantic Guidance Module (SGM) and Semantic Correlation Branch (SCB). Extensive experiments on Refer-UE-City and Refer-KITTI datasets demonstrate the effectiveness of our proposed framework and it achieves state-of-the-art performance. Code and datatsets will be available.
跨对象跟踪任务的新趋势是使用自然语言跟踪感兴趣的对象。然而,缺乏成对提示实例数据会阻碍其进展。为解决这个问题,我们提出了一个高质量但成本低的数据生成方法,基于Unreal Engine 5,并构建了一个名为Refer-UE-City的新基准数据集,主要包括路口监视视频的场景,详细描述了人和车辆的外观和行为。具体来说,它提供了14个视频,总共有714个表情,与Refer-KITTI数据集的规模相当。此外,我们提出了一个多层语义引导多对象框架MLS-Track,通过引入语义引导模块(SGM)和语义相关分支(SCB)来增强模型和文本之间的交互。对Refer-UE-City和Refer-KITTI数据集的实验表明,我们提出的框架的有效性得到了证明,并且实现了最先进的性能。代码和数据集将可用。
https://arxiv.org/abs/2404.12031
Multi-modal relation extraction (MMRE) is a challenging task that aims to identify relations between entities in text leveraging image information. Existing methods are limited by their neglect of the multiple entity pairs in one sentence sharing very similar contextual information (ie, the same text and image), resulting in increased difficulty in the MMRE task. To address this limitation, we propose the Variational Multi-Modal Hypergraph Attention Network (VM-HAN) for multi-modal relation extraction. Specifically, we first construct a multi-modal hypergraph for each sentence with the corresponding image, to establish different high-order intra-/inter-modal correlations for different entity pairs in each sentence. We further design the Variational Hypergraph Attention Networks (V-HAN) to obtain representational diversity among different entity pairs using Gaussian distribution and learn a better hypergraph structure via variational attention. VM-HAN achieves state-of-the-art performance on the multi-modal relation extraction task, outperforming existing methods in terms of accuracy and efficiency.
多模态关系提取(MMRE)是一个具有挑战性的任务,旨在利用图像信息识别文本中实体之间的关系。现有方法的一个局限是它们忽略了共享非常相似上下文信息的多个实体对,导致在MMRE任务中难度增加。为了应对这个局限,我们提出了用于多模态关系提取的变分多模态超图注意力网络(VM-HAN)。具体来说,我们首先为每句话构建一个带有相应图像的多模态超图,以建立不同实体对之间的高层次内部/间相互作用关系。我们进一步设计变分超图注意力网络(V-HAN)来通过高斯分布获得表示多样性,并通过变分注意来学习更好的超图结构。VM-HAN在多模态关系提取任务上实现了最先进的性能,在准确性和效率方面均优于现有方法。
https://arxiv.org/abs/2404.12006
Accurate traffic forecasting is essential for effective urban planning and congestion management. Deep learning (DL) approaches have gained colossal success in traffic forecasting but still face challenges in capturing the intricacies of traffic dynamics. In this paper, we identify and address this challenges by emphasizing that spatial features are inherently dynamic and change over time. A novel in-depth feature representation, called Dynamic Spatio-Temporal (Dyn-ST) features, is introduced, which encapsulates spatial characteristics across varying times. Moreover, a Dynamic Spatio-Temporal Graph Transformer Network (DST-GTN) is proposed by capturing Dyn-ST features and other dynamic adjacency relations between intersections. The DST-GTN can model dynamic ST relationships between nodes accurately and refine the representation of global and local ST characteristics by adopting adaptive weights in low-pass and all-pass filters, enabling the extraction of Dyn-ST features from traffic time-series data. Through numerical experiments on public datasets, the DST-GTN achieves state-of-the-art performance for a range of traffic forecasting tasks and demonstrates enhanced stability.
准确的交通预测对于有效的城市规划和交通管理至关重要。尽管深度学习(DL)方法在交通预测方面取得了巨大的成功,但仍然存在捕捉交通动态复杂性的挑战。在本文中,我们通过强调空间特征是动态的并且会随着时间的推移而变化,来识别和解决这一挑战。我们引入了一种新的特征表示,称为动态时空( Dyn-ST)特征,其中包含了跨不同时间的空间特征。此外,我们提出了一个动态时空网Transformer网络(DST-GTN),通过捕捉 Dyn-ST特征和其他路口的动态邻接关系,来捕捉交通信号的动态 ST 关系。DST-GTN 可以准确地建模节点之间的动态 ST 关系,并通过低通和全通滤波器的自适应权重来优化全局和局部 ST特性的表示,从而从交通时间序列数据中提取 Dyn-ST特征。通过公开数据集的数值实验,DST-GTN 在各种交通预测任务上实现了最先进的性能,并展示了增强的稳定性。
https://arxiv.org/abs/2404.11996
Weakly Incremental Learning for Semantic Segmentation (WILSS) leverages a pre-trained segmentation model to segment new classes using cost-effective and readily available image-level labels. A prevailing way to solve WILSS is the generation of seed areas for each new class, serving as a form of pixel-level supervision. However, a scenario usually arises where a pixel is concurrently predicted as an old class by the pre-trained segmentation model and a new class by the seed areas. Such a scenario becomes particularly problematic in WILSS, as the lack of pixel-level annotations on new classes makes it intractable to ascertain whether the pixel pertains to the new class or not. To surmount this issue, we propose an innovative, tendency-driven relationship of mutual exclusivity, meticulously tailored to govern the behavior of the seed areas and the predictions generated by the pre-trained segmentation model. This relationship stipulates that predictions for the new and old classes must not conflict whilst prioritizing the preservation of predictions for the old classes, which not only addresses the conflicting prediction issue but also effectively mitigates the inherent challenge of incremental learning - catastrophic forgetting. Furthermore, under the auspices of this tendency-driven mutual exclusivity relationship, we generate pseudo masks for the new classes, allowing for concurrent execution with model parameter updating via the resolution of a bi-level optimization problem. Extensive experiments substantiate the effectiveness of our framework, resulting in the establishment of new benchmarks and paving the way for further research in this field.
我们的研究"Weakly Incremental Learning for Semantic Segmentation (WILSS)"利用预训练的分割模型对新的类别进行分割,使用成本效益高且易得的开源图像级标签进行有效的分割。解决WILSS的一种方法是为新每个类别生成种子区域,作为一种像素级别的监督。然而,在WILSS中,预训练的分割模型预测像素为旧类和新类的情况通常会发生。这种情况在WILSS中变得尤为严重,因为新类缺乏像素级别的注释,因此无法确定像素是否属于新类。为了克服这个问题,我们提出了一个创新的分歧驱动关系,精心设计以管理种子区域和预训练分割模型生成的预测的行为。该关系规定,新旧类的预测不能冲突,同时优先考虑保留旧类的预测,这不仅解决了冲突预测问题,还有效地缓解了逐步学习固有的挑战 - 灾难性遗忘。此外,在分歧驱动 mutual exclusivity关系的帮助下,我们生成新类的伪掩码,使得通过解决双层优化问题对模型参数进行更新时,可以实现同时执行。大量实验证实了我们在该领域的有效性和创新性,从而为该领域建立了新的基准,并为进一步研究铺平道路。
https://arxiv.org/abs/2404.11981
Events refer to specific occurrences, incidents, or happenings that take place under a particular background. Event reasoning aims to infer events according to certain relations and predict future events. The cutting-edge techniques for event reasoning play a crucial role in various natural language processing applications. Large language models (LLMs) have made significant advancements in event reasoning owing to their wealth of knowledge and reasoning capabilities. However, smaller instruction-tuned models currently in use do not consistently demonstrate exceptional proficiency in managing these tasks. This discrepancy arises from the absence of explicit modeling of events and the interconnections of them within their instruction data. Consequently, these models face challenges in comprehending event structures and semantics while struggling to bridge the gap between their interpretations and human understanding of events. Additionally, their limitations in grasping event relations lead to constrained event reasoning abilities to effectively deduce and incorporate pertinent event knowledge. In this paper, we propose Event-Oriented Instruction Tuning (EvIT) to train our LLM. Specifically, we first propose a novel structure named event quadruple which contains the structure and semantics of events and is complete in the event representation. We then design event-relation learning based on the structures. We encapsulate the learning into the instruction-tuning formulation to better stimulate the event reasoning capacity of our model. We design a heuristic unsupervised method to mine event quadruple from a large-scale corpus. At last, we finetune a Llama model on our Event-Oriented Instruction Tuning. We conduct extensive experiments on event reasoning tasks on several datasets. Automatic and human evaluations demonstrate EvIT achieves competitive performances on event reasoning.
事件指的是在特定背景下发生的具体事件、事故或现象。事件推理旨在根据某些关系推断事件并预测未来事件。事件推理的最新技术在各种自然语言处理应用中发挥了关键作用。由于其知识丰富和推理能力,大型语言模型(LLMs)在事件推理方面取得了显著进展。然而,当前使用的较小调整模型在处理这些任务时并没有表现出非凡的熟练程度。这一差异源于事件和它们在指令数据中的相互关系缺乏明确的建模。因此,这些模型在理解和解释事件结构方面遇到了挑战,同时在将它们的解释与人类对事件的认知之间存在差距。此外,它们在理解事件关系方面的限制导致它们无法有效推断和融入相关事件知识。在本文中,我们提出了事件导向指令调整(EvIT)来训练我们的LLM。具体来说,我们首先提出了一个名为事件四元组的全新结构,它包含了事件和事件的表示结构,并且是完整的。然后,我们基于结构设计事件关系学习。我们将学习封装到指令调整公式中,以更好地刺激模型的事件推理能力。我们设计了一个基于节点的未经监督的方法,用于从大型语料库中挖掘事件四元组。最后,我们在事件导向指令调整上对Llama模型进行微调。我们在多个数据集上进行了广泛的实验,自动和人工评估都表明,EvIT在事件推理上取得了竞争力的性能。
https://arxiv.org/abs/2404.11978
Entity alignment (EA) aims to find equivalent entities between two Knowledge Graphs. Existing embedding-based EA methods usually encode entities as embeddings, triples as embeddings' constraint and learn to align the embeddings. The structural and side information are usually utilized via embedding propagation, aggregation or interaction. However, the details of the underlying logical inference steps among the alignment process are usually omitted, resulting in inadequate inference process. In this paper, we introduce P-NAL, an entity alignment method that captures two types of logical inference paths with Non-Axiomatic Logic (NAL). Type 1 is the bridge-like inference path between to-be-aligned entity pairs, consisting of two relation/attribute triples and a similarity sentence between the other two entities. Type 2 links the entity pair by their embeddings. P-NAL iteratively aligns entities and relations by integrating the conclusions of the inference paths. Moreover, our method is logically interpretable and extensible due to the expressiveness of NAL. Our proposed method is suitable for various EA settings. Experimental results show that our method outperforms state-of-the-art methods in terms of Hits@1, achieving 0.98+ on all three datasets of DBP15K with both supervised and unsupervised settings. To our knowledge, we present the first in-depth analysis of entity alignment's basic principles from a unified logical perspective.
实体对齐(EA)旨在在两个知识图之间找到等价的实体。现有的基于嵌入的EA方法通常将实体编码为嵌入,关系/属性为嵌入约束,并学会对齐嵌入。通常,结构性和侧信息通过嵌入传播、聚合或交互来利用。然而,在对齐过程中,通常会忽略对逻辑推理步骤的详细说明,导致推理过程不充分。在本文中,我们介绍了P-NAL,一种名为非直观逻辑(NAL)的实体对齐方法,可以捕捉两种逻辑推理路径。类型1是一种桥式推理路径,由两个关系/属性三元组和另外两个实体之间的相似句子组成。类型2通过实体之间的嵌入将实体对链接起来。P-NAL通过整合推理路径的结论来逐步对实体和关系进行对齐。此外,由于NAL的表述力,我们的方法具有逻辑可解释性和可扩展性。我们提出的方法适用于各种知识图对齐设置。实验结果表明,我们的方法在Hits@1方面优于最先进的现有方法,在所有三个人工标注数据集的监督和无监督设置下均实现了0.98+。据我们所知,这是从统一逻辑角度对实体对齐基本原则的第一次深入分析。
https://arxiv.org/abs/2404.11968
Powered hip exoskeletons have shown the ability for locomotion assistance during treadmill walking. However, providing suitable assistance in real-world walking scenarios which involve changing terrain remains challenging. Recent research suggests that forecasting the lower limb joint's angles could provide target trajectories for exoskeletons and prostheses, and the performance could be improved with visual information. In this letter, We share a real-world dataset of 10 healthy subjects walking through five common types of terrain with stride-level label. We design a network called Sandwich Fusion Transformer for Image and Kinematics (SFTIK), which predicts the thigh angle of the ensuing stride given the terrain images at the beginning of the preceding and the ensuing stride and the IMU time series during the preceding stride. We introduce width-level patchify, tailored for egocentric terrain images, to reduce the computational demands. We demonstrate the proposed sandwich input and fusion mechanism could significantly improve the forecasting performance. Overall, the SFTIK outperforms baseline methods, achieving a computational efficiency of 3.31 G Flops, and root mean square error (RMSE) of 3.445 \textpm \ 0.804\textdegree \ and Pearson's correlation coefficient (PCC) of 0.971 \textpm\ 0.025. The results demonstrate that SFTIK could forecast the thigh's angle accurately with low computational cost, which could serve as a terrain adaptive trajectory planning method for hip exoskeletons. Codes and data are available at this https URL.
电动髋外骨骼装置在踏步机步行过程中表现出协助运动的能力。然而,在现实世界中涉及改变地形的情景提供适当的协助仍然具有挑战性。最近的研究表明,预测下肢关节的角度可能为外骨骼和假肢提供目标轨迹,并且可以通过视觉信息提高性能。在这封信中,我们分享了由10名健康受试者组成的真实世界数据集,他们在五种常见的地形上行走,包括水平带标签。我们设计了一个名为Sandwich Fusion Transformer for Image and Kinematics (SFTIK)的图像和运动预测网络,该网络在先前和后续步道的地形图像上预测随后的步道大腿角度,以及前一步的时间序列中的IMU数据。我们还引入了宽度级别的补全,专门针对以自旋为中心的地形图像,以降低计算需求。我们证明了所提出的sandwich输入和融合机制可以显著提高预测性能。总体而言,SFTIK超越了基线方法,实现3.31 G Flops的计算效率和3.445 \textpm \ 0.804\textdegree \的 root mean square error (RMSE) 和0.971 \textpm\ 0.025的 Pearson's correlation coefficient (PCC)。结果表明,SFTIK可以在低计算成本下准确预测大腿的角度,这可以为髋外骨骼器提供地形自适应轨迹规划方法。代码和数据可在此https URL获取。
https://arxiv.org/abs/2404.11945
Dopamine transporter (DAT) imaging is commonly used for monitoring Parkinson's disease (PD), where striatal DAT uptake amount is computed to assess PD severity. However, DAT imaging has a high cost and the risk of radiance exposure and is not available in general clinics. Recently, MRI patch of the nigral region has been proposed as a safer and easier alternative. This paper proposes a symmetric regressor for predicting the DAT uptake amount from the nigral MRI patch. Acknowledging the symmetry between the right and left nigrae, the proposed regressor incorporates a paired input-output model that simultaneously predicts the DAT uptake amounts for both the right and left striata. Moreover, it employs a symmetric loss that imposes a constraint on the difference between right-to-left predictions, resembling the high correlation in DAT uptake amounts in the two lateral sides. Additionally, we propose a symmetric Monte-Carlo (MC) dropout method for providing a fruitful uncertainty estimate of the DAT uptake prediction, which utilizes the above symmetry. We evaluated the proposed approach on 734 nigral patches, which demonstrated significantly improved performance of the symmetric regressor compared with the standard regressors while giving better explainability and feature representation. The symmetric MC dropout also gave precise uncertainty ranges with a high probability of including the true DAT uptake amounts within the range.
多巴胺转运体(DAT)成像通常用于监测帕金森病(PD),其中纹状体DAT的摄取量被计算以评估PD的严重程度。然而,DAT成像具有较高的成本,且放射性暴露风险较高,一般诊所无法提供。最近,MRI纹状体区域补丁被提出作为更安全且易於替代的方案。本文提出了一种对称回归器,用于预测纹状体MRI补丁中的DAT摄取量。承认右纹状体和左纹状体之间的对称性,所提出的回归器包含了一对输入-输出模型,同时预测右和左纹状体的DAT摄取量。此外,它采用了一种对称损失,该损失对右到左预测之间的差异施加约束,类似于DAT摄取量在两个侧面之间的较高相关性。此外,我们提出了一种对称蒙特卡洛(MC)丢弃方法,用于提供DAT摄取预测的有价值的不确定性估计,该方法利用了上述对称性。我们对734个纹状体补丁进行了评估,结果显示,与标准回归器相比,对称回归器的性能显著提高,同时具有更好的解释性和特征表示。对称MC丢弃也提供了高概率包括真实DAT摄取量在范围内的精确不确定性范围。
https://arxiv.org/abs/2404.11929
The interactions between human and objects are important for recognizing object-centric actions. Existing methods usually adopt a two-stage pipeline, where object proposals are first detected using a pretrained detector, and then are fed to an action recognition model for extracting video features and learning the object relations for action recognition. However, since the action prior is unknown in the object detection stage, important objects could be easily overlooked, leading to inferior action recognition performance. In this paper, we propose an end-to-end object-centric action recognition framework that simultaneously performs Detection And Interaction Reasoning in one stage. Particularly, after extracting video features with a base network, we create three modules for concurrent object detection and interaction reasoning. First, a Patch-based Object Decoder generates proposals from video patch tokens. Then, an Interactive Object Refining and Aggregation identifies important objects for action recognition, adjusts proposal scores based on position and appearance, and aggregates object-level info into a global video representation. Lastly, an Object Relation Modeling module encodes object relations. These three modules together with the video feature extractor can be trained jointly in an end-to-end fashion, thus avoiding the heavy reliance on an off-the-shelf object detector, and reducing the multi-stage training burden. We conduct experiments on two datasets, Something-Else and Ikea-Assembly, to evaluate the performance of our proposed approach on conventional, compositional, and few-shot action recognition tasks. Through in-depth experimental analysis, we show the crucial role of interactive objects in learning for action recognition, and we can outperform state-of-the-art methods on both datasets.
人与物体之间的互动对于识别物体中心行动非常重要。现有的方法通常采用两阶段流程,首先使用预训练的检测器检测物体建议,然后将它们输入到动作识别模型中,以提取视频特征并学习动作识别中的物体关系。然而,在物体检测阶段,动作先验未知,重要物体可能很容易被忽视,导致动作识别性能下降。在本文中,我们提出了一种端到端的物体中心动作识别框架,在同一阶段同时执行检测和交互推理。特别地,在提取视频特征的基础上,我们创建了三个并发物体检测和交互推理模块。首先,基于补丁的对象编码器生成视频补丁标记的提议。然后,一个交互式物体精炼和聚合模块确定动作识别中的重要物体,根据位置和外观调整提议得分,并将物体级信息汇总到全局视频表示中。最后,一个物体关系建模模块编码物体关系。这三个模块与视频特征提取器可以以协同训练的方式进行训练,从而避免对预定义的物体检测器的过度依赖,并减少多阶段训练负担。我们对两个数据集 Something-Else 和 Ikea-Assembly 进行了实验,以评估所提出方法在传统、组合和少样本动作识别任务上的性能。通过深入的实验分析,我们证明了交互式物体在动作识别中的关键作用,并且在两个数据集上都能够超越最先进的 methods。
https://arxiv.org/abs/2404.11903