Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. Although scaling Transformer-based generators has been central to recent advances, the tokenizer component itself is rarely scaled, leaving open questions about how auto-encoder design choices influence both its objective of reconstruction and downstream generative performance. Our work aims to conduct an exploration of scaling in auto-encoders to fill in this blank. To facilitate this exploration, we replace the typical convolutional backbone with an enhanced Vision Transformer architecture for Tokenization (ViTok). We train ViTok on large-scale image and video datasets far exceeding ImageNet-1K, removing data constraints on tokenizer scaling. We first study how scaling the auto-encoder bottleneck affects both reconstruction and generation -- and find that while it is highly correlated with reconstruction, its relationship with generation is more complex. We next explored the effect of separately scaling the auto-encoders' encoder and decoder on reconstruction and generation performance. Crucially, we find that scaling the encoder yields minimal gains for either reconstruction or generation, while scaling the decoder boosts reconstruction but the benefits for generation are mixed. Building on our exploration, we design ViTok as a lightweight auto-encoder that achieves competitive performance with state-of-the-art auto-encoders on ImageNet-1K and COCO reconstruction tasks (256p and 512p) while outperforming existing auto-encoders on 16-frame 128p video reconstruction for UCF-101, all with 2-5x fewer FLOPs. When integrated with Diffusion Transformers, ViTok demonstrates competitive performance on image generation for ImageNet-1K and sets new state-of-the-art benchmarks for class-conditional video generation on UCF-101.
这段文本描述了一项关于通过改进自动编码器(auto-encoder)架构来提升图像和视频生成模型性能的研究工作。以下是该工作的主要内容翻译: 视觉标记化通过自编码能够使最先进的图像和视频生成模型受益,它将像素压缩到潜在空间中。尽管基于Transformer的生成器的扩展在最近的进步中占据了中心地位,但其组件标记器本身很少被扩大规模,这留下了一些关于自动编码器设计选择如何影响重建目标以及下游生成性能的问题。我们的工作旨在通过探索自动编码器的扩展来填补这一空白。 为了促进这项研究,我们用增强型视觉Transformer架构(ViTok)替换了传统的卷积骨干网络以进行标记化,并在ImageNet-1K数据集规模之上训练了ViTok,从而消除了对令牌生成器扩展的数据限制。我们首先研究了压缩自动编码器瓶颈如何影响重建和生成——发现尽管它与重建高度相关,但其与生成的关系更加复杂。 接下来,我们探讨单独扩大自动编码器的编码器和解码器在重建和生成性能上的效果。最重要的是,我们发现在对任何重建或生成都没有显著增益的情况下扩展了编码器,而扩展解码器则提升了重建的效果,但对于生成的影响则是好坏参半。 基于我们的探索结果,设计了一种轻量级的自动编码器ViTok,在ImageNet-1K和COCO图像重建任务(256p和512p)中表现与最先进的自动编码器相当,并且在UCF-101数据集上对于16帧128p视频重建,性能超越了现有的自动编码器,同时计算量仅为原来的2到5倍。当将ViTok集成至Diffusion Transformers时,在ImageNet-1K图像生成中展示了竞争力的表现,并为UCF-101的类条件视频生成设定了新的最先进基准。 这项研究不仅扩展了对自动编码器如何影响图像和视频生成的理解,还提出了一种轻量级而高效的解决方案ViTok,能够在不牺牲性能的前提下大幅减少计算资源需求。
https://arxiv.org/abs/2501.09755
Open-Vocabulary Part Segmentation (OVPS) is an emerging field for recognizing fine-grained parts in unseen categories. We identify two primary challenges in OVPS: (1) the difficulty in aligning part-level image-text correspondence, and (2) the lack of structural understanding in segmenting object parts. To address these issues, we propose PartCATSeg, a novel framework that integrates object-aware part-level cost aggregation, compositional loss, and structural guidance from DINO. Our approach employs a disentangled cost aggregation strategy that handles object and part-level costs separately, enhancing the precision of part-level segmentation. We also introduce a compositional loss to better capture part-object relationships, compensating for the limited part annotations. Additionally, structural guidance from DINO features improves boundary delineation and inter-part understanding. Extensive experiments on Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets demonstrate that our method significantly outperforms state-of-the-art approaches, setting a new baseline for robust generalization to unseen part categories.
开放词汇部分分割(OVPS)是一个新兴领域,专注于识别未见过类别中的细粒度部分。我们确定了OVPS的两个主要挑战:(1) 部分级别的图像-文本对应关系对齐困难;以及 (2) 缺乏结构化理解以划分对象部分。为了解决这些问题,我们提出了PartCATSeg框架,它集成了面向对象的部分级别成本聚合、组合损失和来自DINO的结构指导。 我们的方法采用了一种解耦的成本聚合策略,分别处理对象级和部分级别的成本,从而提高了部分分割的精度。此外,我们引入了组合损失来更好地捕捉部分与对象之间的关系,弥补了有限部分注释的问题。从DINO特征中获得的结构化指导则改善了边界划分以及部分间的理解。 在Pascal-Part-116、ADE20K-Part-234和PartImageNet数据集上的广泛实验表明,我们的方法显著优于现有最先进方法,并为未见过的部分类别提供了强大的泛化能力基准。
https://arxiv.org/abs/2501.09688
In today's assistant landscape, personalisation enhances interactions, fosters long-term relationships, and deepens engagement. However, many systems struggle with retaining user preferences, leading to repetitive user requests and disengagement. Furthermore, the unregulated and opaque extraction of user preferences in industry applications raises significant concerns about privacy and trust, especially in regions with stringent regulations like Europe. In response to these challenges, we propose a long-term memory system for voice assistants, structured around predefined categories. This approach leverages Large Language Models to efficiently extract, store, and retrieve preferences within these categories, ensuring both personalisation and transparency. We also introduce a synthetic multi-turn, multi-session conversation dataset (CarMem), grounded in real industry data, tailored to an in-car voice assistant setting. Benchmarked on the dataset, our system achieves an F1-score of .78 to .95 in preference extraction, depending on category granularity. Our maintenance strategy reduces redundant preferences by 95% and contradictory ones by 92%, while the accuracy of optimal retrieval is at .87. Collectively, the results demonstrate the system's suitability for industrial applications.
在当今的助手生态系统中,个性化增强了互动,促进了长期关系,并加深了用户的参与度。然而,许多系统难以保留用户偏好,导致重复性的用户请求和失去兴趣。此外,在诸如欧洲等监管严格的地区,行业应用中不规范且不透明地提取用户偏好的做法引发了关于隐私和信任的重大担忧。 为应对这些挑战,我们提出了一种基于预定义类别的语音助手长期记忆系统。该方法利用大型语言模型高效地从这些类别中提取、存储和检索偏好信息,确保个性化的同时也保持透明度。此外,我们还引入了一个合成的多轮对话数据集(CarMem),这个数据集以真实行业数据为基础,并针对车载语音助理场景进行了定制。 在这一数据集上的评估结果显示,我们的系统根据不同类别的详细程度,在偏好提取方面取得了F1值从0.78到0.95的成绩。通过我们的维护策略,重复的偏好减少了95%,矛盾的偏好减少了92%;而最优检索精度为0.87。总体而言,这些结果展示了该系统的工业应用潜力和适用性。
https://arxiv.org/abs/2501.09645
The pivotal shift from traditional paper-based records to sophisticated Electronic Health Records (EHR), enabled systematic collection and analysis of patient data through descriptive statistics, providing insight into patterns and trends across patient populations. This evolution continued toward predictive analytics, allowing healthcare providers to anticipate patient outcomes and potential complications before they occur. This progression from basic digital record-keeping to sophisticated predictive modelling and digital twins reflects healthcare's broader evolution toward more integrated, patient-centred approaches that combine data-driven insights with personalized care delivery. This chapter explores the evolution and significance of healthcare information systems, beginning with an examination of the implementation of EHR in the UK and the USA. It provides a comprehensive overview of the International Classification of Diseases (ICD) system, tracing its development from ICD-9 to ICD-10. Central to this discussion is the MIMIC-III database, a landmark achievement in healthcare data sharing and arguably the most comprehensive critical care database freely available to researchers worldwide. MIMIC-III has democratized access to high-quality healthcare data, enabling unprecedented opportunities for research and analysis. The chapter examines its structure, clinical outcome analysis capabilities, and practical applications through case studies, with a particular focus on mortality and length of stay metrics, vital signs extraction, and ICD coding. Through detailed entity-relationship diagrams and practical examples, the text illustrates MIMIC's complex data structure and demonstrates how different querying approaches can lead to subtly different results, emphasizing the critical importance of understanding the database's architecture for accurate data extraction.
从传统的纸质记录向复杂的电子健康记录(EHR)转变的关键性变化,使得通过描述性统计方法系统地收集和分析患者数据成为可能,从而揭示了不同人群中的模式和趋势。这种演变进一步向着预测性分析发展,使医疗服务提供者能够提前预知患者的结局和潜在并发症。从基本的数字记录管理到复杂的预测建模以及数字孪生的发展,反映了医疗保健更广泛的向更加集成、以患者为中心的方法转变,这种方法结合了数据驱动的洞察力和个人化护理交付。 本章探讨了医疗信息系统的演变及其意义,从英国和美国实施EHR开始。它还提供了关于国际疾病分类(ICD)系统的一个全面概述,并追溯其从ICD-9到ICD-10的发展历程。在这其中的核心是MIMIC-III数据库,这是医疗数据共享领域的一项里程碑式成就,也是目前世界上免费提供给研究人员的最全面的重症监护数据库之一。MIMIC-III使高质量的医疗数据获取民主化,并为研究和分析提供了前所未有的机会。本章考察了它的结构、临床结果分析能力以及通过案例研究的实际应用情况,特别关注死亡率和住院时间指标、生命体征提取以及ICD编码。 通过详细的实体关系图和实际示例,该文本展示了MIMIC复杂的数据结构,并说明了不同的查询方法如何会导致细微但重要的不同结果,强调理解数据库架构对于准确数据提取的重要性。
https://arxiv.org/abs/2501.09640
Recent advances in large language models (LLMs) have demonstrated significant progress in performing complex tasks. While Reinforcement Learning from Human Feedback (RLHF) has been effective in aligning LLMs with human preferences, it is susceptible to spurious correlations in reward modeling. Consequently, it often introduces biases-such as length bias, sycophancy, conceptual bias, and discrimination that hinder the model's ability to capture true causal relationships. To address this, we propose a novel causal reward modeling approach that integrates causal inference to mitigate these spurious correlations. Our method enforces counterfactual invariance, ensuring reward predictions remain consistent when irrelevant variables are altered. Through experiments on both synthetic and real-world datasets, we show that our approach mitigates various types of spurious correlations effectively, resulting in more reliable and fair alignment of LLMs with human preferences. As a drop-in enhancement to the existing RLHF workflow, our causal reward modeling provides a practical way to improve the trustworthiness and fairness of LLM finetuning.
最近在大型语言模型(LLM)方面取得的进展已经证明了其在执行复杂任务方面的显著进步。虽然基于人类反馈的强化学习(RLHF)在将LLM与人类偏好对齐方面非常有效,但它容易受到奖励建模中的虚假相关性的困扰。这通常会导致诸如长度偏见、阿谀奉承倾向、概念偏差和歧视等偏见问题,这些都阻碍了模型捕捉真正因果关系的能力。为了解决这些问题,我们提出了一种新颖的因果奖励建模方法,该方法整合了因果推理来减轻这些虚假相关性。我们的方法强制执行反事实不变性,确保在无关变量改变时,奖励预测仍然保持一致。通过在合成和真实世界数据集上的实验,我们展示了我们的方法能够有效地缓解各种类型的虚假相关性,从而使得LLM与人类偏好的对齐更加可靠和公平。作为现有RLHF工作流程的即插即用增强功能,我们的因果奖励建模提供了一种实用的方法来提高LLM微调过程中的可信度和公平性。
https://arxiv.org/abs/2501.09620
With the rapid advancement of deepfake generation technologies, the demand for robust and accurate face forgery detection algorithms has become increasingly critical. Recent studies have demonstrated that wavelet analysis can uncover subtle forgery artifacts that remain imperceptible in the spatial domain. Wavelets effectively capture important facial contours, which are often slender, fine-grained, and global in nature. However, existing wavelet-based approaches fail to fully leverage these unique characteristics, resulting in sub-optimal feature extraction and limited generalizability. To address this challenge, we introduce WMamba, a novel wavelet-based feature extractor built upon the Mamba architecture. WMamba maximizes the utility of wavelet information through two key innovations. First, we propose Dynamic Contour Convolution (DCConv), which employs specially crafted deformable kernels to adaptively model slender facial contours. Second, by leveraging the Mamba architecture, our method captures long-range spatial relationships with linear computational complexity. This efficiency allows for the extraction of fine-grained, global forgery artifacts from small image patches. Extensive experimental results show that WMamba achieves state-of-the-art (SOTA) performance, highlighting its effectiveness and superiority in face forgery detection.
随着深度伪造生成技术的迅速发展,对稳健且准确的人脸伪造检测算法的需求变得愈发重要。近期研究显示,小波分析能够揭示在空间域中难以察觉的细微伪造痕迹。小波变换能有效地捕捉到重要的面部轮廓特征,这些特征往往是纤细、精细和具有全局性的。然而,现有的基于小波的方法未能充分利用这些独特特性,导致了次优的特征提取效果和有限的应用广度。 为了解决这一挑战,我们引入了一种新型的小波基特征提取器——WMamba,它是基于Mamba架构设计的。WMamba通过两个关键创新最大化了小波信息的作用。首先,我们提出了动态轮廓卷积(DCConv),采用特制的可变形核来适应性地建模纤细的面部轮廓;其次,利用Mamba架构的优势,我们的方法能够在线性计算复杂度下捕捉长距离的空间关系。这种效率使得可以从小型图像块中提取出精细、全局性的伪造痕迹。 广泛的实验结果表明,WMamba达到了最先进的性能(SOTA),突显了其在人脸伪造检测中的有效性和优越性。
https://arxiv.org/abs/2501.09617
Metric learning projects samples into an embedded space, where similarities and dissimilarities are quantified based on their learned representations. However, existing methods often rely on label-guided representation learning, where representations of different modalities, such as audio and visual data, are aligned based on annotated labels. This approach tends to underutilize latent complex features and potential relationships inherent in the distributions of audio and visual data that are not directly tied to the labels, resulting in suboptimal performance in audio-visual embedding learning. To address this issue, we propose a novel architecture that integrates cross-modal triplet loss with progressive self-distillation. Our method enhances representation learning by leveraging inherent distributions and dynamically refining soft audio-visual alignments -- probabilistic alignments between audio and visual data that capture the inherent relationships beyond explicit labels. Specifically, the model distills audio-visual distribution-based knowledge from annotated labels in a subset of each batch. This self-distilled knowledge is used t
将度量学习项目样本转换到嵌入空间中,在该空间中,相似性和不相似性基于它们的学得表示进行量化。然而,现有的方法通常依赖于标签引导的表征学习,即不同模态(如音频和视觉数据)的表示是根据标注标签对齐的。这种做法往往未能充分挖掘与标签直接关联之外的音频和视觉数据分布中固有的复杂特征和潜在关系,导致在音视频嵌入学习上的性能不佳。 为解决这一问题,我们提出了一种新型架构,该架构集成了跨模态三元组损失(cross-modal triplet loss)与渐进式自我蒸馏(progressive self-distillation)。我们的方法通过利用内在分布并动态优化音频-视觉软对齐来增强表征学习——即在标签之外捕捉音视频数据之间固有关系的概率性对齐。具体而言,模型从每个批次的子集中提取注释标签的基础知识,并基于这些知识点进行自我蒸馏。这种方法使得模型能够逐步提高其表示能力,不仅利用了直接标注信息,还充分利用了未被显式标签覆盖的数据内在结构和复杂特征。通过这种方式,我们的方法能够在跨模态数据的学习中更好地捕捉到隐藏的潜在关系,从而提升音视频嵌入学习的整体性能。 简而言之,这种方法旨在解决现有度量学习在处理多模态数据时可能存在的局限性,并且通过引入自我蒸馏机制来改进模型对复杂内在结构的理解和利用。
https://arxiv.org/abs/2501.09608
Group theory has been used in machine learning to provide a theoretically grounded approach for incorporating known symmetry transformations in tasks from robotics to protein modeling. In these applications, equivariant neural networks use known symmetry groups with predefined representations to learn over geometric input data. We propose MatrixNet, a neural network architecture that learns matrix representations of group element inputs instead of using predefined representations. MatrixNet achieves higher sample efficiency and generalization over several standard baselines in prediction tasks over the several finite groups and the Artin braid group. We also show that MatrixNet respects group relations allowing generalization to group elements of greater word length than in the training set.
群论已在机器学习中被用于提供一种理论基础的方法,将已知的对称性变换融入从机器人学到蛋白质建模等任务。在这些应用中,等变神经网络使用预定义表示的已知对称群来处理几何输入数据。我们提出了MatrixNet,这是一种神经网络架构,它能够学习群元素输入的矩阵表示,而不是使用预定义的表示形式。MatrixNet在几个标准基准上实现了更高的样本效率和泛化能力,在涉及若干有限群以及阿廷辫子群的预测任务中表现尤为突出。此外,还证明了MatrixNet尊重群关系,从而允许其将训练集中单词长度较小的群元素进行推广至更大的单词长度的群元素。
https://arxiv.org/abs/2501.09571
Conventional 2D human pose estimation methods typically require extensive labeled annotations, which are both labor-intensive and expensive. In contrast, semi-supervised 2D human pose estimation can alleviate the above problems by leveraging a large amount of unlabeled data along with a small portion of labeled data. Existing semi-supervised 2D human pose estimation methods update the network through backpropagation, ignoring crucial historical information from the previous training process. Therefore, we propose a novel semi-supervised 2D human pose estimation method by utilizing a newly designed Teacher-Reviewer-Student framework. Specifically, we first mimic the phenomenon that human beings constantly review previous knowledge for consolidation to design our framework, in which the teacher predicts results to guide the student's learning and the reviewer stores important historical parameters to provide additional supervision signals. Secondly, we introduce a Multi-level Feature Learning strategy, which utilizes the outputs from different stages of the backbone to estimate the heatmap to guide network training, enriching the supervisory information while effectively capturing keypoint relationships. Finally, we design a data augmentation strategy, i.e., Keypoint-Mix, to perturb pose information by mixing different keypoints, thus enhancing the network's ability to discern keypoints. Extensive experiments on publicly available datasets, demonstrate our method achieves significant improvements compared to the existing methods.
传统的2D人体姿态估计方法通常需要大量的标注数据,这既耗时又昂贵。相比之下,半监督的2D人体姿态估计可以通过利用大量未标记的数据和少量已标注的数据来缓解上述问题。现有的半监督2D人体姿态估计方法通过反向传播更新网络,而忽视了之前训练过程中重要的历史信息。因此,我们提出了一种新的半监督2D人体姿态估计方法,采用了一个新颖设计的教师-评审员-学生框架(Teacher-Reviewer-Student framework)。具体来说,首先模仿人类不断回顾以前的知识以巩固记忆的现象来设计我们的框架,在这个框架中,教师预测结果来指导学生的学习,而评审员存储重要的历史参数以提供额外的监督信号。其次,我们引入了一种多级特征学习策略,利用骨干网络不同阶段的输出估计热图(heatmap)来指导网络训练,这不仅丰富了监督信息,还能有效捕捉关键点之间的关系。最后,我们设计了一种数据增强策略,即关键点混合(Keypoint-Mix),通过混合不同的关键点来扰动姿态信息,从而增强了网络区分关键点的能力。在公开可用的数据集上进行的广泛实验表明,我们的方法相较于现有方法取得了显著的改进。
https://arxiv.org/abs/2501.09565
The meanings and relationships of words shift over time. This phenomenon is referred to as semantic this http URL focused on understanding how semantic shifts occur over multiple time periods is essential for gaining a detailed understanding of semantic this http URL, detecting change points only between adjacent time periods is insufficient for analyzing detailed semantic shifts, and using BERT-based methods to examine word sense proportions incurs a high computational this http URL address those issues, we propose a simple yet intuitive framework for how semantic shifts occur over multiple time periods by leveraging a similarity matrix between the embeddings of the same word through this http URL compute a diachronic word similarity matrix using fast and lightweight word embeddings across arbitrary time periods, making it deeper to analyze continuous semantic this http URL, by clustering the similarity matrices for different words, we can categorize words that exhibit similar behavior of semantic shift in an unsupervised manner.
单词的意义和关系会随着时间的推移而发生变化,这种现象被称为语义漂移。为了深入了解这种多时期内发生的语义变化,研究在不同时间间隔期间检测到的变化点是不够的,并且使用基于BERT的方法来考察词义比例也面临着较高的计算成本问题。 为了解决这些问题,我们提出了一种简单直观的框架,通过利用相同词汇在其嵌入表示之间的相似性矩阵,以描述多时期内语义变化的发生。我们的方法可以跨任意时间间隔计算出一种历时性的单词相似度矩阵,并使用快速且轻量级的词向量来实现这一点,从而更深入地分析连续的语义变化。 通过聚类不同词汇的相似度矩阵,我们可以无监督地将表现出类似语义漂移行为的词语归为一类。
https://arxiv.org/abs/2501.09538
Online medical consultation (OMC) restricts doctors to gathering patient information solely through inquiries, making the already complex sequential decision-making process of diagnosis even more challenging. Recently, the rapid advancement of large language models has demonstrated a significant potential to transform OMC. However, most studies have primarily focused on improving diagnostic accuracy under conditions of relatively sufficient information, while paying limited attention to the "inquiry" phase of the consultation process. This lack of focus has left the relationship between "inquiry" and "diagnosis" insufficiently explored. In this paper, we first extract real patient interaction strategies from authentic doctor-patient conversations and use these strategies to guide the training of a patient simulator that closely mirrors real-world behavior. By inputting medical records into our patient simulator to simulate patient responses, we conduct extensive experiments to explore the relationship between "inquiry" and "diagnosis" in the consultation process. Experimental results demonstrate that inquiry and diagnosis adhere to the Liebig's law: poor inquiry quality limits the effectiveness of diagnosis, regardless of diagnostic capability, and vice versa. Furthermore, the experiments reveal significant differences in the inquiry performance of various models. To investigate this phenomenon, we categorize the inquiry process into four types: (1) chief complaint inquiry; (2) specification of known symptoms; (3) inquiry about accompanying symptoms; and (4) gathering family or medical history. We analyze the distribution of inquiries across the four types for different models to explore the reasons behind their significant performance differences. We plan to open-source the weights and related code of our patient simulator at this https URL.
在线医疗咨询(OMC)限制了医生只能通过询问来收集患者信息,使得本已复杂的诊断决策过程更加复杂。最近,大型语言模型的迅速发展显示出显著潜力,可以彻底改变OMC的方式。然而,大多数研究主要集中在提高在相对充足信息条件下的诊断准确性上,而对咨询过程中“问诊”阶段的关注较少。这种忽视导致了关于“问诊”与“诊断”之间关系的研究不足。在这篇论文中,我们首先从真实的医生和患者对话中提取出实际的问诊策略,并利用这些策略来训练一个高度模拟现实行为的患者模拟器。通过将医疗记录输入我们的患者模拟器以模拟患者的反应,我们进行了广泛的实验来探索咨询过程中“问诊”与“诊断”的关系。实验结果表明,问诊和诊断遵循李比希法则:低质量的问诊限制了诊断的有效性,无论诊断能力如何;反之亦然。 此外,这些实验揭示了不同模型在问诊性能上的显著差异。为了探讨这一现象的原因,我们将问诊过程分为四种类型:(1)主诉询问;(2)已知症状的具体化;(3)伴随症状的询问;以及(4)收集家族或个人医疗史信息。我们分析了不同类型问诊在整个咨询过程中不同模型间的分布情况,以探讨其显著性能差异的原因。 我们的患者模拟器权重和相关代码计划在以下网址开源:[此 URL](https://this-url.com) (请注意,原文中提供的实际链接需要替换为有效URL)。
https://arxiv.org/abs/2501.09484
In this study, we firstly introduce a method that converts CityGML data into voxels which works efficiently and fast in high resolution for large scale datasets such as cities but by sacrificing some building details to overcome the limitations of previous voxelization methodologies that have been computationally intensive and inefficient at transforming large-scale urban areas into voxel representations for high resolution. Those voxelized 3D city data from multiple cities and corresponding air temperature data are used to develop a machine learning model. Before the model training, Gaussian blurring is implemented on input data to consider spatial relationships, as a result the correlation rate between air temperature and volumetric building morphology is also increased after the Gaussian blurring. After the model training, the prediction results are not just evaluated with Mean Square Error (MSE) but some image similarity metrics such as Structural Similarity Index Measure (SSIM) and Learned Perceptual Image Patch Similarity (LPIPS) that are able to detect and consider spatial relations during the evaluation process. This trained model is capable of predicting the spatial distribution of air temperature by using building volume information of corresponding pixel as input. By doing so, this research aims to assist urban planners in incorporating environmental parameters into their planning strategies, thereby facilitating more sustainable and inhabitable urban environments.
在这项研究中,我们首先介绍了一种将CityGML数据转换为体素的方法。这种方法在处理如城市这样大规模的数据集时可以高效快速地生成高分辨率的体素表示,虽然这会牺牲一些建筑细节以克服之前体素化方法计算量大且不适用于大规模城市区域的问题。通过使用来自多个城市的这些经过体素化的3D城市数据及其相应的空气温度数据来开发机器学习模型。在模型训练前,在输入数据上应用高斯模糊处理,以此来考虑空间关系,并因此提高了空气温度与建筑体积形态之间的相关性。 模型训练完成后,预测结果不仅用均方误差(MSE)进行评估,还使用了如结构相似度指数测量(SSIM)和感知图像块相似性(LPIPS)等一些可以检测并考虑评估过程中空间关系的图象相似度指标。这样训练出的模型可以通过对应像素中的建筑体积信息来预测空气温度的空间分布。通过这种方式,这项研究旨在帮助城市规划者将环境参数纳入其规划策略中,从而促进更加可持续和宜居的城市环境建设。
https://arxiv.org/abs/2501.09469
We explore the impact of aperture size and shape on automotive camera systems for deep-learning-based tasks like traffic sign recognition and light state detection. A method is proposed to simulate optical effects using the point spread function (PSF), enhancing realism and reducing the domain gap between synthetic and real-world images. Computer-generated scenes are refined with this technique to model optical distortions and improve simulation accuracy.
我们探讨了光圈大小和形状对基于深度学习的汽车相机系统任务(如交通标志识别和灯光状态检测)的影响。提出了一种使用点扩散函数(PSF)模拟光学效果的方法,以增强现实感并减少合成图像与真实世界图像之间的领域差距。通过这项技术,计算机生成的场景得到了改进,用于建模光学畸变并提高仿真精度。
https://arxiv.org/abs/2501.09456
3D visual grounding (3DVG), which aims to correlate a natural language description with the target object within a 3D scene, is a significant yet challenging task. Despite recent advancements in this domain, existing approaches commonly encounter a shortage: a limited amount and diversity of text3D pairs available for training. Moreover, they fall short in effectively leveraging different contextual clues (e.g., rich spatial relations within the 3D visual space) for grounding. To address these limitations, we propose AugRefer, a novel approach for advancing 3D visual grounding. AugRefer introduces cross-modal augmentation designed to extensively generate diverse text-3D pairs by placing objects into 3D scenes and creating accurate and semantically rich descriptions using foundation models. Notably, the resulting pairs can be utilized by any existing 3DVG methods for enriching their training data. Additionally, AugRefer presents a language-spatial adaptive decoder that effectively adapts the potential referring objects based on the language description and various 3D spatial relations. Extensive experiments on three benchmark datasets clearly validate the effectiveness of AugRefer.
三维视觉定位(3DVG)旨在将自然语言描述与三维场景中的目标对象关联起来,是一项重要但具有挑战性的任务。尽管该领域最近取得了一些进展,现有的方法通常会遇到一个局限性:用于训练的文本-3D对的数量和多样性有限。此外,它们在利用不同上下文线索(例如,丰富的三维视觉空间内的空间关系)进行定位方面也存在不足。 为了克服这些限制,我们提出了AugRefer,这是一种旨在推进3DVG的新方法。AugRefer引入了跨模态增强技术,通过将对象放置到3D场景中并使用基础模型生成准确且语义丰富描述来广泛生成多样化的文本-3D对。值得注意的是,生成的这些对可以被任何现有的3DVG方法用于扩充它们的训练数据。 此外,AugRefer还提出了一种语言空间自适应解码器,该解码器能够根据自然语言描述和各种三维空间关系有效地调整潜在的对象引用。在三个基准数据集上的大量实验清楚地验证了AugRefer的有效性。
https://arxiv.org/abs/2501.09428
Robust WiFi-based human pose estimation is a challenging task that bridges discrete and subtle WiFi signals to human skeletons. This paper revisits this problem and reveals two critical yet overlooked issues: 1) cross-domain gap, i.e., due to significant variations between source-target domain pose distributions; and 2) structural fidelity gap, i.e., predicted skeletal poses manifest distorted topology, usually with misplaced joints and disproportionate bone lengths. This paper fills these gaps by reformulating the task into a novel two-phase framework dubbed DT-Pose: Domain-consistent representation learning and Topology-constrained Pose decoding. Concretely, we first propose a temporal-consistent contrastive learning strategy with uniformity regularization, coupled with self-supervised masking-reconstruction operations, to enable robust learning of domain-consistent and motion-discriminative WiFi-specific representations. Beyond this, we introduce a simple yet effective pose decoder with task prompts, which integrates Graph Convolution Network (GCN) and Transformer layers to constrain the topology structure of the generated skeleton by exploring the adjacent-overarching relationships among human joints. Extensive experiments conducted on various benchmark datasets highlight the superior performance of our method in tackling these fundamental challenges in both 2D/3D human pose estimation tasks.
基于WiFi的人体姿态估计是一项具有挑战性的任务,它需要将离散且微妙的WiFi信号与人体骨骼联系起来。本文重新审视了这一问题,并揭示了两个关键但被忽视的问题:1)跨域差距,即由于源领域和目标领域的姿态分布差异显著;2)结构保真度差距,即预测的人体骨骼姿势表现出扭曲的拓扑结构,通常表现为关节位置不正确及骨头长度不成比例。本文通过重构任务为一个新颖的两阶段框架来填补这些缺口,该框架被称为DT-Pose(领域一致表示学习和拓扑约束姿态解码)。具体而言,我们首先提出了一种带有均匀性正则化的时序一致性对比学习策略,并结合自监督掩蔽-重建操作,以实现领域一致性和运动判别性的WiFi特有表征的稳健学习。此外,我们引入了一个简单但有效的姿势解码器,该解码器通过集成图卷积网络(GCN)和Transformer层来约束生成骨骼的拓扑结构,并探索人体关节之间的相邻-整体关系。在多个基准数据集上进行的广泛实验表明,在二维/三维人体姿态估计任务中解决这些基本挑战时,我们的方法表现出卓越性能。
https://arxiv.org/abs/2501.09411
Neural implicit k-space representations (NIK) have shown promising results for dynamic magnetic resonance imaging (MRI) at high temporal resolutions. Yet, reducing acquisition time, and thereby available training data, results in severe performance drops due to overfitting. To address this, we introduce a novel self-supervised k-space loss function $\mathcal{L}_\mathrm{PISCO}$, applicable for regularization of NIK-based reconstructions. The proposed loss function is based on the concept of parallel imaging-inspired self-consistency (PISCO), enforcing a consistent global k-space neighborhood relationship without requiring additional data. Quantitative and qualitative evaluations on static and dynamic MR reconstructions show that integrating PISCO significantly improves NIK representations. Particularly for high acceleration factors (R$\geq$54), NIK with PISCO achieves superior spatio-temporal reconstruction quality compared to state-of-the-art methods. Furthermore, an extensive analysis of the loss assumptions and stability shows PISCO's potential as versatile self-supervised k-space loss function for further applications and architectures. Code is available at: this https URL
神经隐式k空间表示(NIK)在高时间分辨率的动态磁共振成像(MRI)中展现了令人鼓舞的结果。然而,减少采集时间会导致可用训练数据量减少,从而因过拟合而导致性能严重下降。为解决这一问题,我们提出了一种新颖的自监督k空间损失函数$\mathcal{L}_\mathrm{PISCO}$,该函数适用于NIK基础重建技术中的正则化处理。所提出的损失函数基于平行成像启发的自我一致性(Parallel Imaging-inspired Self-Consistency, PISCO)概念,在无需额外数据的情况下强制执行一致的整体k空间邻域关系。 定量和定性评估结果表明,在静态和动态MR重建中,集成PISCO显著提升了NIK表示的质量。特别是在高加速因子(R≥54)的情况下,与现有方法相比,配备了PISCO的NIK在时空重建质量上表现出优越性能。此外,对损失假设及稳定性的广泛分析也显示了PISCO作为多功能自监督k空间损失函数应用于进一步应用和架构中的潜力。 代码可在以下网址获取:[this https URL]
https://arxiv.org/abs/2501.09403
Transformer-based encoder-decoder models have achieved remarkable success in image-to-image transfer tasks, particularly in image restoration. However, their high computational complexity-manifested in elevated FLOPs and parameter counts-limits their application in real-world scenarios. Existing knowledge distillation methods in image restoration typically employ lightweight student models that directly mimic the intermediate features and reconstruction results of the teacher, overlooking the implicit attention relationships between them. To address this, we propose a Soft Knowledge Distillation (SKD) strategy that incorporates a Multi-dimensional Cross-net Attention (MCA) mechanism for compressing image restoration models. This mechanism facilitates interaction between the student and teacher across both channel and spatial dimensions, enabling the student to implicitly learn the attention matrices. Additionally, we employ a Gaussian kernel function to measure the distance between student and teacher features in kernel space, ensuring stable and efficient feature learning. To further enhance the quality of reconstructed images, we replace the commonly used L1 or KL divergence loss with a contrastive learning loss at the image level. Experiments on three tasks-image deraining, deblurring, and denoising-demonstrate that our SKD strategy significantly reduces computational complexity while maintaining strong image restoration capabilities.
基于Transformer的编码器-解码器模型在图像到图像转换任务中,尤其是图像恢复方面取得了显著的成功。然而,这些模型由于计算复杂度高(表现为更高的FLOPs和参数计数)而限制了其在现实场景中的应用。现有的知识蒸馏方法通常采用轻量级的学生模型直接模仿教师模型的中间特征和重建结果,忽略了两者之间的隐式注意关系。为了解决这一问题,我们提出了一种Soft Knowledge Distillation(SKD)策略,该策略结合了Multi-dimensional Cross-net Attention(MCA)机制以压缩图像恢复模型。这种机制促进了学生和教师在通道和空间维度上的交互,使学生能够隐式地学习注意力矩阵。 此外,我们使用高斯核函数来衡量学生和教师特征之间的距离,确保稳定且高效的特征学习。为了进一步提高重建图像的质量,我们将常用的L1或KL散度损失替换为基于图像级别的对比学习损失。在三项任务——去雨、去模糊和降噪的实验中,我们的SKD策略显著降低了计算复杂性,并保持了强大的图像恢复能力。
https://arxiv.org/abs/2501.09321
This paper introduces a new problem, Causal Abductive Reasoning on Video Events (CARVE), which involves identifying causal relationships between events in a video and generating hypotheses about causal chains that account for the occurrence of a target event. To facilitate research in this direction, we create two new benchmark datasets with both synthetic and realistic videos, accompanied by trigger-target labels generated through a novel counterfactual synthesis approach. To explore the challenge of solving CARVE, we present a Causal Event Relation Network (CERN) that examines the relationships between video events in temporal and semantic spaces to efficiently determine the root-cause trigger events. Through extensive experiments, we demonstrate the critical roles of event relational representation learning and interaction modeling in solving video causal reasoning challenges. The introduction of the CARVE task, along with the accompanying datasets and the CERN framework, will advance future research on video causal reasoning and significantly facilitate various applications, including video surveillance, root-cause analysis and movie content management.
本文介绍了一个新的问题——视频事件中的因果类比推理(CARVE),该问题涉及识别视频中事件之间的因果关系,并生成关于导致目标事件发生的因果链的假设。为了促进这一方向的研究,我们创建了两个新基准数据集,其中包括合成和现实主义视频,并通过一种新颖的事实反例综合方法生成触发-目标标签。为了解决CARVE的挑战,我们提出了一种因果事件关系网络(CERN),该网络考察时间与语义空间中视频事件之间的关系,以高效地确定根本原因触发事件。通过广泛的实验,我们展示了事件关系表示学习和交互建模在解决视频因果推理问题中的关键作用。引入CARVE任务、相关数据集及CERN框架将推动未来关于视频因果推理的研究,并显著促进包括视频监控、根本原因分析以及电影内容管理在内的各种应用的发展。
https://arxiv.org/abs/2501.09304
Voice biometric tasks, such as age estimation require modeling the often complex relationship between voice features and the biometric variable. While deep learning models can handle such complexity, they typically require large amounts of accurately labeled data to perform well. Such data are often scarce for biometric tasks such as voice-based age prediction. On the other hand, simpler models like linear regression can work with smaller datasets but often fail to generalize to the underlying non-linear patterns present in the data. In this paper we propose the Tessellated Linear Model (TLM), a piecewise linear approach that combines the simplicity of linear models with the capacity of non-linear functions. TLM tessellates the feature space into convex regions and fits a linear model within each region. We optimize the tessellation and the linear models using a hierarchical greedy partitioning. We evaluated TLM on the TIMIT dataset on the task of age prediction from voice, where it outperformed state-of-the-art deep learning models.
声纹生物识别任务,如年龄估计,需要建立语音特征与生物计量变量之间复杂关系的模型。深度学习模型可以处理这种复杂性,但通常需要大量准确标记的数据才能表现良好。对于基于声音的年龄预测等生物计量任务来说,这样的数据往往很少。另一方面,简单的线性回归模型可以在较小的数据集上工作,但却常常无法概括数据中存在的非线性模式。 在本文中,我们提出了分块线性模型(Tessellated Linear Model, TLM),这是一种分段线性方法,它结合了线性模型的简洁性和非线性函数的能力。TLM 将特征空间分割成凸区域,并在每个区域内拟合一个线性模型。通过层次贪婪分区法对分割和线性模型进行优化。 我们在 TIMIT 数据集上使用年龄预测任务评估了 TLM 的性能,结果显示它优于最先进的深度学习模型。
https://arxiv.org/abs/2501.09229
Short text classification, as a research subtopic in natural language processing, is more challenging due to its semantic sparsity and insufficient labeled samples in practical scenarios. We propose a novel model named MI-DELIGHT for short text classification in this work. Specifically, it first performs multi-source information (i.e., statistical information, linguistic information, and factual information) exploration to alleviate the sparsity issues. Then, the graph learning approach is adopted to learn the representation of short texts, which are presented in graph forms. Moreover, we introduce a dual-level (i.e., instance-level and cluster-level) contrastive learning auxiliary task to effectively capture different-grained contrastive information within massive unlabeled data. Meanwhile, previous models merely perform the main task and auxiliary tasks in parallel, without considering the relationship among tasks. Therefore, we introduce a hierarchical architecture to explicitly model the correlations between tasks. We conduct extensive experiments across various benchmark datasets, demonstrating that MI-DELIGHT significantly surpasses previous competitive models. It even outperforms popular large language models on several datasets.
短文本分类作为自然语言处理领域的研究子课题,因其语义稀疏性和实际场景中标签样本不足而更具挑战性。在本工作中,我们提出了一种名为MI-DELIGHT的新模型,用于解决短文本分类问题。具体来说,该模型首先进行多源信息(即统计信息、语言学信息和事实信息)的探索以缓解语义稀疏性的问题。然后,采用图学习方法来学习用图形式表示的短文本的表征。此外,我们引入了一种双层次(实例级和聚类级)对比学习辅助任务,有效捕获大量未标记数据中的不同粒度的对比信息。同时,先前的模型仅在并行执行主要任务与辅助任务时,并不考虑各任务之间的关系。因此,我们引入了分层架构以明确建模各个任务间的相关性。我们在多个基准数据集上进行了广泛的实验,结果表明MI-DELIGHT显著超越了以前的竞争模型,在某些数据集中甚至超过了流行的大规模语言模型的性能。
https://arxiv.org/abs/2501.09214