Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. Although scaling Transformer-based generators has been central to recent advances, the tokenizer component itself is rarely scaled, leaving open questions about how auto-encoder design choices influence both its objective of reconstruction and downstream generative performance. Our work aims to conduct an exploration of scaling in auto-encoders to fill in this blank. To facilitate this exploration, we replace the typical convolutional backbone with an enhanced Vision Transformer architecture for Tokenization (ViTok). We train ViTok on large-scale image and video datasets far exceeding ImageNet-1K, removing data constraints on tokenizer scaling. We first study how scaling the auto-encoder bottleneck affects both reconstruction and generation -- and find that while it is highly correlated with reconstruction, its relationship with generation is more complex. We next explored the effect of separately scaling the auto-encoders' encoder and decoder on reconstruction and generation performance. Crucially, we find that scaling the encoder yields minimal gains for either reconstruction or generation, while scaling the decoder boosts reconstruction but the benefits for generation are mixed. Building on our exploration, we design ViTok as a lightweight auto-encoder that achieves competitive performance with state-of-the-art auto-encoders on ImageNet-1K and COCO reconstruction tasks (256p and 512p) while outperforming existing auto-encoders on 16-frame 128p video reconstruction for UCF-101, all with 2-5x fewer FLOPs. When integrated with Diffusion Transformers, ViTok demonstrates competitive performance on image generation for ImageNet-1K and sets new state-of-the-art benchmarks for class-conditional video generation on UCF-101.
这段文本描述了一项关于通过改进自动编码器(auto-encoder)架构来提升图像和视频生成模型性能的研究工作。以下是该工作的主要内容翻译: 视觉标记化通过自编码能够使最先进的图像和视频生成模型受益,它将像素压缩到潜在空间中。尽管基于Transformer的生成器的扩展在最近的进步中占据了中心地位,但其组件标记器本身很少被扩大规模,这留下了一些关于自动编码器设计选择如何影响重建目标以及下游生成性能的问题。我们的工作旨在通过探索自动编码器的扩展来填补这一空白。 为了促进这项研究,我们用增强型视觉Transformer架构(ViTok)替换了传统的卷积骨干网络以进行标记化,并在ImageNet-1K数据集规模之上训练了ViTok,从而消除了对令牌生成器扩展的数据限制。我们首先研究了压缩自动编码器瓶颈如何影响重建和生成——发现尽管它与重建高度相关,但其与生成的关系更加复杂。 接下来,我们探讨单独扩大自动编码器的编码器和解码器在重建和生成性能上的效果。最重要的是,我们发现在对任何重建或生成都没有显著增益的情况下扩展了编码器,而扩展解码器则提升了重建的效果,但对于生成的影响则是好坏参半。 基于我们的探索结果,设计了一种轻量级的自动编码器ViTok,在ImageNet-1K和COCO图像重建任务(256p和512p)中表现与最先进的自动编码器相当,并且在UCF-101数据集上对于16帧128p视频重建,性能超越了现有的自动编码器,同时计算量仅为原来的2到5倍。当将ViTok集成至Diffusion Transformers时,在ImageNet-1K图像生成中展示了竞争力的表现,并为UCF-101的类条件视频生成设定了新的最先进基准。 这项研究不仅扩展了对自动编码器如何影响图像和视频生成的理解,还提出了一种轻量级而高效的解决方案ViTok,能够在不牺牲性能的前提下大幅减少计算资源需求。
https://arxiv.org/abs/2501.09755
Autoregressive sequence models, such as Transformer-based vision-language action (VLA) policies, can be tremendously effective for capturing complex and generalizable robotic behaviors. However, such models require us to choose a tokenization of our continuous action signals, which determines how the discrete symbols predicted by the model map to continuous robot actions. We find that current approaches for robot action tokenization, based on simple per-dimension, per-timestep binning schemes, typically perform poorly when learning dexterous skills from high-frequency robot data. To address this challenge, we propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform. Our tokenization approach, Frequency-space Action Sequence Tokenization (FAST), enables us to train autoregressive VLAs for highly dexterous and high-frequency tasks where standard discretization methods fail completely. Based on FAST, we release FAST+, a universal robot action tokenizer, trained on 1M real robot action trajectories. It can be used as a black-box tokenizer for a wide range of robot action sequences, with diverse action spaces and control frequencies. Finally, we show that, when combined with the pi0 VLA, our method can scale to training on 10k hours of robot data and match the performance of diffusion VLAs, while reducing training time by up to 5x.
自回归序列模型,如基于Transformer的视觉-语言-行动(VLA)策略,在捕捉复杂且可泛化的机器人行为方面非常有效。然而,这类模型需要我们选择连续动作信号的标记化方案,这决定了模型预测出的离散符号如何映射到连续的机器人动作上。我们发现,目前基于每维度、每时间步简单分箱方法的机器人行动标记化技术,在从高频机器人数据中学习灵巧技能时通常表现不佳。为了解决这一挑战,我们提出了一种新的基于离散余弦变换的机器人动作压缩式标记化方案。我们的标记化方法,即频域操作序列标记化(FAST),使我们能够训练自回归VLA模型来处理高度灵巧且高频的任务,在这些任务中标准的离散化方法完全失败了。基于FAST,我们发布了FAST+,这是一个通用的机器人动作标记器,它是在100万条真实机器人的行动轨迹上进行训练的。它可以作为一个黑盒标记器用于各种不同操作空间和控制频率范围内的机器人行动序列。最后,我们展示了当与pi0 VLA结合使用时,我们的方法可以扩展到在10,000小时的数据集上进行训练,并且性能能够匹敌扩散VLA模型,同时将训练时间减少最多5倍。
https://arxiv.org/abs/2501.09747
The multimodal language models (MLMs) based on generative pre-trained Transformer are considered powerful candidates for unifying various domains and tasks. MLMs developed for remote sensing (RS) have demonstrated outstanding performance in multiple tasks, such as visual question answering and visual grounding. In addition to visual grounding that detects specific objects corresponded to given instruction, aerial detection, which detects all objects of multiple categories, is also a valuable and challenging task for RS foundation models. However, aerial detection has not been explored by existing RS MLMs because the autoregressive prediction mechanism of MLMs differs significantly from the detection outputs. In this paper, we present a simple baseline for applying MLMs to aerial detection for the first time, named LMMRotate. Specifically, we first introduce a normalization method to transform detection outputs into textual outputs to be compatible with the MLM framework. Then, we propose a evaluation method, which ensures a fair comparison between MLMs and conventional object detection models. We construct the baseline by fine-tuning open-source general-purpose MLMs and achieve impressive detection performance comparable to conventional detector. We hope that this baseline will serve as a reference for future MLM development, enabling more comprehensive capabilities for understanding RS images. Code is available at this https URL.
基于生成预训练Transformer的多模态语言模型(MLMs)被视为统一各种领域和任务的强大候选者。专为遥感(RS)开发的MLMs在多项任务中展现了卓越性能,如视觉问答和视觉接地。除了检测与给定指令相对应的具体物体的视觉接地外,检测多种类别的所有对象的航空检测也是一个对RS基础模型有价值的且具有挑战性的任务。然而,由于MLMs的自回归预测机制与检测输出显著不同,现有的RS MLMs尚未探索航空检测领域。在这篇文章中,我们首次提出了一种简单的方法用于将MLMs应用于航空检测,并将其命名为LMMRotate。 具体而言,首先引入一种归一化方法,以将检测输出转换为文本形式的输出,从而使其与MLM框架兼容。然后,我们提出了一种评估方法,确保MLMs和传统目标检测模型之间的公平比较。通过微调开源通用MLMs构建基线,并取得了与传统检测器相当的出色检测性能。我们希望这一基线将作为未来MLM发展的参考,使理解RS图像的能力更加全面。 代码可在以下网址获得:[此URL](https://this-url.com)(原文中的链接请自行替换为实际提供的地址)。
https://arxiv.org/abs/2501.09720
This study conducts a systematic assessment of the capabilities of 12 machine learning models and model variations in detecting economic ideology. As an evaluation benchmark, I use manifesto data spanning six elections in the United Kingdom and pre-annotated by expert and crowd coders. The analysis assesses the performance of several generative, fine-tuned, and zero-shot models at the granular and aggregate levels. The results show that generative models such as GPT-4o and Gemini 1.5 Flash consistently outperform other models against all benchmarks. However, they pose issues of accessibility and resource availability. Fine-tuning yielded competitive performance and offers a reliable alternative through domain-specific optimization. But its dependency on training data severely limits scalability. Zero-shot models consistently face difficulties with identifying signals of economic ideology, often resulting in negative associations with human coding. Using general knowledge for the domain-specific task of ideology scaling proved to be unreliable. Other key findings include considerable within-party variation, fine-tuning benefiting from larger training data, and zero-shot's sensitivity to prompt content. The assessments include the strengths and limitations of each model and derive best-practices for automated analyses of political content.
这项研究对12种机器学习模型及其变体在检测经济意识形态方面的能力进行了系统的评估。作为评价标准,我使用了跨越英国六次选举的宣言数据,并由专家和众包编码者预先标注。该分析评估了几种生成式、微调型和零样本模型在颗粒级和汇总级上的表现。 研究结果表明,像GPT-4o和Gemini 1.5 Flash这样的生成式模型,在所有基准测试中都持续优于其他模型。然而,这些模型面临可访问性和资源可用性的问题。微调可以产生具有竞争力的性能,并通过特定领域的优化提供可靠的替代方案。但是,它对训练数据的依赖严重限制了其扩展能力。零样本模型在识别经济意识形态信号方面经常遇到困难,往往与人类编码的结果存在负面关联。 使用通用知识来执行特定领域(如意识形态量表)的任务被证明是不可靠的。其他关键发现包括政党内部差异较大、微调从更大规模的数据中受益更多以及零样本模型对提示内容敏感性高。评估涵盖了每种模型的优势和局限,并得出了自动化分析政治内容的最佳实践。
https://arxiv.org/abs/2501.09719
For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners, and these requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify three key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. (iii) In real-world scenarios, the training samples may be scarce or partially missing during the process of forgetting. To address them, we first propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we introduce LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. To further extend GS-LoRA to more practical scenarios, we incorporate prototype information as additional supervision and introduce a more practical approach, GS-LoRA++. For each forgotten class, we move the logits away from its original prototype. For the remaining classes, we pull the logits closer to their respective prototypes. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that our method manages to forget specific classes with minimal impact on other classes. Codes have been released on this https URL.
出于隐私和安全方面的考虑,从预训练的视觉模型中删除不需要的信息的需求变得越来越明显。在现实场景中,用户和模型所有者随时都可能提出擦除请求,并且这些请求通常形成一个序列。因此,在这种设置下,期望能够持续地从预训练模型中移除特定信息的同时保持其余部分不受影响。我们将这个问题定义为连续遗忘问题,并识别出三个关键挑战。(i)对于不需要的知识,高效的删除方法至关重要。(ii)对于保留下来的知识,遗忘过程带来的负面影响应该最小化。(iii)在现实场景中,在遗忘过程中可用的训练样本可能非常有限或不完整。 为了应对这些挑战,我们首先提出了组稀疏LoRA(GS-LoRA)。具体来说,针对(i),我们引入了用于独立微调Transformer块中的FFN层的LoRA模块,并且对于(ii),采用了简单的组稀疏正则化方法,从而能够自动选择特定的LoRA组并将其他部分置零。为了将GS-LoRA进一步扩展到更多实际场景中使用,我们将原型信息作为额外监督引入,并提出了一种更实用的方法——GS-LoRA++。对于每个被遗忘的类别,我们将其logits远离其原始原型;而对于剩余的类别,则吸引它们各自的原型。我们在人脸识别、目标检测和图像分类上进行了广泛的实验,证明我们的方法能够以最小影响从特定类中进行遗忘操作。 代码已经在以下网址发布:[此链接处应填写实际提供的GitHub或相关代码存储库URL]。
https://arxiv.org/abs/2501.09705
Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor here: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Theoretical analysis suggests that learning from off-policy data is impeded by the presence of KL-divergence between the updated policy and the reference policy. From the perspective of dataset distribution, we systematically summarize the inherent flaws in existing algorithms that employ DPO to address hallucination issues. To alleviate the problems, we propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Notably, with only 4.8k data, OPA-DPO achieves an additional reduction in the hallucination rate of LLaVA-1.5-7B: 13.26% on the AMBER benchmark and 5.39% on the Object-Hal benchmark, compared to the previous SOTA algorithm trained with 16k samples.
幻觉仍然是大型视觉语言模型(LVLM)面临的主要挑战之一。直接偏好优化(DPO)作为一种简单的解决方案,近年来受到了越来越多的关注,它通过从反映同一提示和图像的响应中幻觉严重程度所构建的偏好对进行直接学习。然而,现有的工作中的不同数据构建方法带来了显著的性能差异。我们在这里识别了一个关键因素:结果在很大程度上取决于所构建的数据是否与DPO最初的(参考)策略一致。理论上分析表明,从离策略数据学习会受到更新后的策略和参考策略之间存在的KL散度的影响。 从数据集分布的角度来看,我们系统地总结了现有算法使用DPO解决幻觉问题时固有的缺陷。为了解决这些问题,我们提出了在政策对齐(OPA)-DPO框架,它利用专家反馈来纠正幻觉响应,并以在策略的方式对准原始和经过专家修订的响应。值得注意的是,在仅使用4.8k数据的情况下,与先前使用的16k样本训练的最佳现有算法相比,OPA-DPO使LLaVA-1.5-7B模型在AMBER基准测试中实现了幻觉率额外降低13.26%,在Object-Hal基准测试中降低了5.39%。
https://arxiv.org/abs/2501.09695
The proliferation of Vision-Language Models (VLMs) in the past several years calls for rigorous and comprehensive evaluation methods and benchmarks. This work analyzes existing VLM evaluation techniques, including automated metrics, AI-based assessments, and human evaluations across diverse tasks. We first introduce Robin - a novel suite of VLMs that we built by combining Large Language Models (LLMs) and Vision Encoders (VEs) at multiple scales, and use Robin to identify shortcomings of current evaluation approaches across scales. Next, to overcome the identified limitations, we introduce CHIRP - a new long form response benchmark we developed for more robust and complete VLM evaluation. We provide open access to the Robin training code, model suite, and CHIRP benchmark to promote reproducibility and advance VLM research.
近年来,视觉-语言模型(VLMs)的迅速发展需要严格的和全面的评估方法及基准。本文分析了现有的VLM评估技术,包括自动化指标、基于AI的评估以及跨不同任务的人类评价。首先,我们介绍了Robin——一个新型的VLM套件,它是通过在多个尺度上结合大规模语言模型(LLMs)和视觉编码器(VEs)构建而成,并使用Robin来识别当前评估方法在各个尺度上的不足之处。接下来,为了克服这些已发现的局限性,我们引入了CHIRP——一个新的长形式响应基准,旨在为VLM提供更稳健且全面的评价。我们提供了Robin的训练代码、模型套件以及CHIRP基准测试的开放访问权限,以促进可重复性和推动VLM的研究进展。
https://arxiv.org/abs/2501.09672
Face recognition technology has dramatically transformed the landscape of security, surveillance, and authentication systems, offering a user-friendly and non-invasive biometric solution. However, despite its significant advantages, face recognition systems face increasing threats from physical and digital spoofing attacks. Current research typically treats face recognition and attack detection as distinct classification challenges. This approach necessitates the implementation of separate models for each task, leading to considerable computational complexity, particularly on devices with limited resources. Such inefficiencies can stifle scalability and hinder performance. In response to these challenges, this paper introduces an innovative unified model designed for face recognition and detection of physical and digital attacks. By leveraging the advanced Swin Transformer backbone and incorporating HiLo attention in a convolutional neural network framework, we address unified face recognition and spoof attack detection more effectively. Moreover, we introduce augmentation techniques that replicate the traits of physical and digital spoofing cues, significantly enhancing our model robustness. Through comprehensive experimental evaluation across various datasets, we showcase the effectiveness of our model in unified face recognition and spoof detection. Additionally, we confirm its resilience against unseen physical and digital spoofing attacks, underscoring its potential for real-world applications.
面部识别技术已显著改变了安全、监控和认证系统的格局,提供了一种用户友好且非侵入性的生物特征解决方案。然而,尽管其具有明显的优势,但面部识别系统面临着来自物理和数字伪造攻击的日益增加的威胁。目前的研究通常将面部识别与攻击检测视为两个独立的分类挑战。这种方法需要为每个任务实施单独的模型,导致计算复杂性大幅增加,尤其是在资源有限的设备上。这种低效会限制可扩展性并阻碍性能。 为了应对这些挑战,本文介绍了一种创新的一体化模型,用于面部识别和物理及数字攻击检测。通过利用先进的Swin Transformer骨干网,并在卷积神经网络框架中融入HiLo注意力机制,我们更有效地解决了统一的面部识别和伪造攻击检测问题。此外,我们引入了增强技术来复制物理和数字伪造线索的特点,大大增强了模型的鲁棒性。 通过跨多种数据集进行全面实验评估,我们展示了我们的模型在统一面部识别和伪造检测方面的有效性。另外,我们也确认了该模型对未见过的物理及数字伪造攻击具有抗御能力,突显其在实际应用中的潜力。
https://arxiv.org/abs/2501.09635
The rapid spread of fake news presents a significant global challenge, particularly in low-resource languages like Bangla, which lack adequate datasets and detection tools. Although manual fact-checking is accurate, it is expensive and slow to prevent the dissemination of fake news. Addressing this gap, we introduce BanFakeNews-2.0, a robust dataset to enhance Bangla fake news detection. This version includes 11,700 additional, meticulously curated fake news articles validated from credible sources, creating a proportional dataset of 47,000 authentic and 13,000 fake news items across 13 categories. In addition, we created a manually curated independent test set of 460 fake and 540 authentic news items for rigorous evaluation. We invest efforts in collecting fake news from credible sources and manually verified while preserving the linguistic richness. We develop a benchmark system utilizing transformer-based architectures, including fine-tuned Bidirectional Encoder Representations from Transformers variants (F1-87\%) and Large Language Models with Quantized Low-Rank Approximation (F1-89\%), that significantly outperforms traditional methods. BanFakeNews-2.0 offers a valuable resource to advance research and application in fake news detection for low-resourced languages. We publicly release our dataset and model on Github to foster research in this direction.
假新闻的迅速传播带来了全球性的挑战,特别是在像孟加拉语这样的资源匮乏语言中,这些语言缺乏足够的数据集和检测工具。尽管人工事实核查准确度高,但成本高昂且耗时长,无法有效阻止假新闻的扩散。为了解决这一缺口,我们推出了BanFakeNews-2.0,这是一个旨在增强孟加拉语假新闻检测能力的强大数据集。这个版本包含11,700篇额外、精心策划并从可信来源验证过的假新闻文章,形成了一个由47,000条真实和13,000条虚假新闻组成的数据集,涵盖13个类别。此外,我们还创建了一个独立的手工策划测试数据集,其中包含460篇假新闻和540篇真实新闻,用于严格的评估。我们在收集来自可信来源的假新闻时投入了大量努力,并进行了人工验证以保留语言的丰富性。 我们开发了一个基准系统,采用基于变压器的架构,包括微调后的双向编码器表示(F1-87%)和量化低秩近似的大规模语言模型(F1-89%),这些方法显著优于传统的方法。BanFakeNews-2.0为资源匮乏语言中的假新闻检测研究和应用提供了宝贵的资源。我们将在Github上公开发布我们的数据集和模型,以促进在这个方向上的研究。
https://arxiv.org/abs/2501.09604
The appearance of surface impurities (e.g., water stains, fingerprints, stickers) is an often-mentioned issue that causes degradation of automated visual inspection systems. At the same time, synthetic data generation techniques for visual surface inspection have focused primarily on generating perfect examples and defects, disregarding impurities. This study highlights the importance of considering impurities when generating synthetic data. We introduce a procedural method to include photorealistic water stains in synthetic data. The synthetic datasets are generated to correspond to real datasets and are further used to train an anomaly detection model and investigate the influence of water stains. The high-resolution images used for surface inspection lead to memory bottlenecks during anomaly detection training. To address this, we introduce Sequential PatchCore - a method to build coresets sequentially and make training on large images using consumer-grade hardware tractable. This allows us to perform transfer learning using coresets pre-trained on different dataset versions. Our results show the benefits of using synthetic data for pre-training an explicit coreset anomaly model and the extended performance benefits of finetuning the coreset using real data. We observed how the impurities and labelling ambiguity lower the model performance and have additionally reported the defect-wise recall to provide an industrially relevant perspective on model performance.
表面杂质(如水渍、指纹和贴纸)的出现是导致自动化视觉检测系统性能下降的一个经常被提及的问题。同时,用于视觉表面检查的合成数据生成技术主要集中在生成完美示例及缺陷上,而忽略了杂质的影响。本研究强调了在生成合成数据时考虑杂质的重要性,并提出了一种程序化方法来将逼真的水渍纳入合成数据中。我们生成的合成数据集与真实数据集相对应,并进一步用于训练异常检测模型以调查水渍的影响。 高分辨率图像在表面检查中的使用会导致在进行异常检测训练时出现内存瓶颈问题。为了解决这个问题,我们引入了Sequential PatchCore方法——一种顺序构建核心样本集(coresets)的方法,使在大型图像上使用消费级硬件进行训练成为可能。这使得我们可以利用预先在不同数据版本上经过训练的核心集来进行迁移学习。 我们的结果显示,在预训练显式核心异常模型时使用合成数据是有益的,并且通过真实数据对核心集进行微调可以进一步提升性能表现。我们观察到杂质和标签模糊度会降低模型性能,此外还报告了缺陷级别的召回率,以提供一个与工业相关的视角来衡量模型性能。
https://arxiv.org/abs/2501.09579
In this paper, we elaborate on how AI can support diversity and inclusion and exemplify research projects conducted in that direction. We start by looking at the challenges and progress in making large language models (LLMs) more transparent, inclusive, and aware of social biases. Even though LLMs like ChatGPT have impressive abilities, they struggle to understand different cultural contexts and engage in meaningful, human like conversations. A key issue is that biases in language processing, especially in machine translation, can reinforce inequality. Tackling these biases requires a multidisciplinary approach to ensure AI promotes diversity, fairness, and inclusion. We also highlight AI's role in identifying biased content in media, which is important for improving representation. By detecting unequal portrayals of social groups, AI can help challenge stereotypes and create more inclusive technologies. Transparent AI algorithms, which clearly explain their decisions, are essential for building trust and reducing bias in AI systems. We also stress AI systems need diverse and inclusive training data. Projects like the Child Growth Monitor show how using a wide range of data can help address real world problems like malnutrition and poverty. We present a project that demonstrates how AI can be applied to monitor the role of search engines in spreading disinformation about the LGBTQ+ community. Moreover, we discuss the SignON project as an example of how technology can bridge communication gaps between hearing and deaf people, emphasizing the importance of collaboration and mutual trust in developing inclusive AI. Overall, with this paper, we advocate for AI systems that are not only effective but also socially responsible, promoting fair and inclusive interactions between humans and machines.
在这篇论文中,我们详细阐述了人工智能如何支持多样性和包容性,并举例说明了朝这个方向开展的研究项目。首先,我们审视了使大型语言模型(LLMs)更加透明、包容以及对社会偏见有所意识的挑战与进展。尽管像ChatGPT这样的语言模型具备令人印象深刻的技能,但它们在理解不同的文化背景和进行有意义的人类对话方面仍然存在困难。一个关键问题是,在语言处理中特别是机器翻译中的偏见会加剧不平等现象。解决这些偏见需要采取多学科的方法,以确保AI能够促进多样性和包容性,并维护公平原则。 我们还强调了人工智能在识别媒体中的有偏内容方面的角色,这对于改善代表性和消除刻板印象至关重要。通过检测社会群体的不公平描绘,AI可以帮助挑战现有观念并推动更加包容的技术发展。透明的人工智能算法,即那些能明确解释其决策过程的系统,在建立信任和减少AI系统的偏见方面是必不可少的。 此外,我们强调AI系统需要多样且包容性的训练数据集。例如,“儿童成长监测器”项目展示了如何利用广泛的数据来解决现实世界的问题,如营养不良和贫困问题。我们还介绍了一个项目,该项目演示了AI在监控搜索引擎传播有关LGBTQ+社区错误信息方面的作用。 此外,我们讨论了“SignON”项目作为技术促进听力障碍者与听觉正常人之间沟通的一个例子,强调开发包容性人工智能的重要性在于协作和相互信任。 总体而言,通过这篇论文,我们提倡构建不仅有效而且具有社会责任感的人工智能系统,这些系统能够促进人类与机器之间的公平且包容的互动。
https://arxiv.org/abs/2501.09534
Text-to-SQL enables users to interact with databases through natural language, simplifying the retrieval and synthesis of information. Despite the success of large language models (LLMs) in converting natural language questions into SQL queries, their broader adoption is limited by two main challenges: achieving robust generalization across diverse queries and ensuring interpretative confidence in their predictions. To tackle these issues, our research investigates the integration of selective classifiers into Text-to-SQL systems. We analyse the trade-off between coverage and risk using entropy based confidence estimation with selective classifiers and assess its impact on the overall performance of Text-to-SQL models. Additionally, we explore the models' initial calibration and improve it with calibration techniques for better model alignment between confidence and accuracy. Our experimental results show that encoder-decoder T5 is better calibrated than in-context-learning GPT 4 and decoder-only Llama 3, thus the designated external entropy-based selective classifier has better performance. The study also reveal that, in terms of error detection, selective classifier with a higher probability detects errors associated with irrelevant questions rather than incorrect query generations.
文本到SQL(Text-to-SQL)技术允许用户通过自然语言与数据库进行交互,从而简化信息的检索和合成。尽管大型语言模型(LLMs)在将自然语言问题转换为SQL查询方面取得了成功,但它们更广泛的应用受到两大挑战的限制:实现跨多种查询类型的稳健泛化以及确保其预测结果具有解释性的信心。 为了应对这些问题,我们的研究探讨了将选择性分类器集成到Text-to-SQL系统中的方法。我们使用基于熵的信任度估计来分析覆盖率和风险之间的权衡,并评估这种方法对整体性能的影响。此外,我们还探索了模型的初始校准情况,并通过校准技术对其进行改进,以更好地使信任度与准确性之间达成一致。 实验结果显示,编码器-解码器T5模型比上下文学习GPT 4以及仅解码器Llama 3具有更好的校准效果。因此,指定的基于外部熵的选择性分类器表现出更佳性能。研究还发现,在错误检测方面,概率更高的选择性分类器能够更准确地识别与无关问题相关的错误,而非因查询生成不正确导致的错误。 总之,我们的研究表明了将选择性分类器应用于Text-to-SQL系统的潜力,并展示了如何通过改进模型校准来提高其可靠性和准确性。
https://arxiv.org/abs/2501.09527
We propose a novel architecture for graph-based dependency parsing that explicitly constructs vectors, from which both arcs and labels are scored. Our method addresses key limitations of the standard two-pipeline approach by unifying arc scoring and labeling into a single network, reducing scalability issues caused by the information bottleneck and lack of parameter sharing. Additionally, our architecture overcomes limited arc interactions with transformer layers to efficiently simulate higher-order dependencies. Experiments on PTB and UD show that our model outperforms state-of-the-art parsers in both accuracy and efficiency.
我们提出了一种基于图的依存句法分析的新架构,该架构明确构建向量,并从中对弧线和标签进行评分。我们的方法通过将弧线评分和标注统一到一个网络中来解决标准两步流程方法的关键限制,从而减少了由于信息瓶颈和缺乏参数共享导致的可扩展性问题。此外,我们的架构克服了变压器层之间有限的弧线交互,以高效地模拟高阶依赖关系。在PTB和UD上的实验表明,我们的模型在准确性和效率方面都优于最先进的解析器。
https://arxiv.org/abs/2501.09451
This paper investigates the robustness of vision-language models against adversarial visual perturbations and introduces a novel ``double visual defense" to enhance this robustness. Unlike previous approaches that resort to lightweight adversarial fine-tuning of a pre-trained CLIP model, we perform large-scale adversarial vision-language pre-training from scratch using web-scale data. We then strengthen the defense by incorporating adversarial visual instruction tuning. The resulting models from each stage, $\Delta$CLIP and $\Delta^2$LLaVA, show substantially enhanced zero-shot robustness and set a new state-of-the-art in adversarial defense for vision-language models. For example, the adversarial robustness of $\Delta$CLIP surpasses that of the previous best models on ImageNet-1k by ~20%. %For example, $\Delta$CLIP surpasses the previous best models on ImageNet-1k by ~20% in terms of adversarial robustness. Similarly, compared to prior art, $\Delta^2$LLaVA brings a ~30% robustness improvement to image captioning task and a ~20% robustness improvement to visual question answering task. Furthermore, our models exhibit stronger zero-shot recognition capability, fewer hallucinations, and superior reasoning performance compared to baselines. Our project page is this https URL.
这篇论文探讨了视觉-语言模型在对抗性视觉干扰下的鲁棒性,并引入了一种新颖的“双重视觉防御”方法,以增强这种鲁棒性。与以往依赖于轻量级对抗微调预训练CLIP模型的方法不同,我们使用网络规模的数据从头开始进行了大规模的对抗性视觉-语言预训练。然后通过加入对抗性视觉指令调整来加强防护措施。在每个阶段生成的模型$\Delta$CLIP和$\Delta^2$LLaVA显示出了显著增强的零样本鲁棒性,并且在对抗防御方面为视觉-语言模型设定了新的最佳状态。例如,$\Delta$CLIP在ImageNet-1k上的对抗鲁棒性比之前的最好模型高约20%。同样地,与先前的方法相比,$\Delta^2$LLaVA在图像描述任务上带来了大约30%的鲁棒性改进,在视觉问答任务上带来了大约20%的鲁棒性改进。此外,我们的模型还展示了更强的零样本识别能力、更少的幻觉现象以及比基准方法更优越的推理性能。我们的项目页面是这个网址:[请在此处插入正确的URL]。
https://arxiv.org/abs/2501.09446
While large language models (LLMs) present significant potential for supporting numerous real-world applica- tions and delivering positive social impacts, they still face significant challenges in terms of the inherent risk of privacy leakage, hallucinated outputs, and value misalignment, and can be maliciously used for generating toxic content and unethical purposes after been jailbroken. Therefore, in this survey, we present a comprehensive review of recent advancements aimed at mitigating these issues, organized across the four phases of LLM development and usage: data collecting and pre-training, fine-tuning and alignment, prompting and reasoning, and post-processing and auditing. We elaborate on the recent advances for enhancing the performance of LLMs in terms of privacy protection, hallucination reduction, value alignment, toxicity elimination, and jailbreak defenses. In contrast to previous surveys that focus on a single dimension of responsible LLMs, this survey presents a unified framework that encompasses these diverse dimensions, providing a comprehensive view of enhancing LLMs to better serve real-world applications.
尽管大型语言模型(LLMs)在支持众多现实世界应用和带来积极社会影响方面展现出巨大潜力,但它们仍然面临着固有的隐私泄露风险、幻觉输出以及价值不一致等重大挑战,并且在被破解后可能会恶意用于生成有毒内容和不符合伦理的目的。因此,在本次综述中,我们全面回顾了近期为减轻这些问题而取得的进展,这些进展按照LLMs开发和使用过程中的四个阶段进行组织:数据收集与预训练、微调与对齐、提示与推理以及后期处理与审计。我们将详细阐述最近在增强LLMs隐私保护性能、幻觉减少、价值一致性和消除毒性以及防破解措施方面的进展。 与之前专注于负责任的LLMs单一维度的综述不同,本综述提出了一种涵盖这些多样维度的统一框架,为如何通过多种方式提升LLMs性能以更好地服务于现实世界应用提供了全面的看法。
https://arxiv.org/abs/2501.09431
Many practical vision-language applications require models that understand negation, e.g., when using natural language to retrieve images which contain certain objects but not others. Despite advancements in vision-language models (VLMs) through large-scale training, their ability to comprehend negation remains underexplored. This study addresses the question: how well do current VLMs understand negation? We introduce NegBench, a new benchmark designed to evaluate negation understanding across 18 task variations and 79k examples spanning image, video, and medical datasets. The benchmark consists of two core tasks designed to evaluate negation understanding in diverse multimodal settings: Retrieval with Negation and Multiple Choice Questions with Negated Captions. Our evaluation reveals that modern VLMs struggle significantly with negation, often performing at chance level. To address these shortcomings, we explore a data-centric approach wherein we finetune CLIP models on large-scale synthetic datasets containing millions of negated captions. We show that this approach can result in a 10% increase in recall on negated queries and a 40% boost in accuracy on multiple-choice questions with negated captions.
许多视觉-语言应用需要能够理解否定的模型,例如,在使用自然语言检索包含某些对象但不包含其他特定对象的图像时。尽管通过大规模训练提高了视觉-语言模型(VLMs)的能力,它们理解和处理否定信息的能力仍然没有得到充分研究。这项研究旨在回答以下问题:当前的VLMs在理解否定方面表现如何?为此,我们引入了NegBench,这是一个新的基准测试工具,用于评估18种任务变体和涵盖图像、视频及医学数据集的79,000个样本中的否定理解能力。该基准由两个核心任务组成,旨在评估多样化多模态环境下的否定理解:带有否定的检索以及带有所述否定描述的多项选择题。 我们的评估结果显示,现代VLM在处理否定时存在显著困难,其性能常常处于随机水平。为解决这些不足之处,我们探索了一种数据为中心的方法,即对包含数百万条否定描述的大规模合成数据集进行微调CLIP模型训练。这种方法可以在带有否定查询的检索中将召回率提高10%,在带有所述否定描述的多项选择题中则可使准确度提升40%。
https://arxiv.org/abs/2501.09425
Robust WiFi-based human pose estimation is a challenging task that bridges discrete and subtle WiFi signals to human skeletons. This paper revisits this problem and reveals two critical yet overlooked issues: 1) cross-domain gap, i.e., due to significant variations between source-target domain pose distributions; and 2) structural fidelity gap, i.e., predicted skeletal poses manifest distorted topology, usually with misplaced joints and disproportionate bone lengths. This paper fills these gaps by reformulating the task into a novel two-phase framework dubbed DT-Pose: Domain-consistent representation learning and Topology-constrained Pose decoding. Concretely, we first propose a temporal-consistent contrastive learning strategy with uniformity regularization, coupled with self-supervised masking-reconstruction operations, to enable robust learning of domain-consistent and motion-discriminative WiFi-specific representations. Beyond this, we introduce a simple yet effective pose decoder with task prompts, which integrates Graph Convolution Network (GCN) and Transformer layers to constrain the topology structure of the generated skeleton by exploring the adjacent-overarching relationships among human joints. Extensive experiments conducted on various benchmark datasets highlight the superior performance of our method in tackling these fundamental challenges in both 2D/3D human pose estimation tasks.
基于WiFi的人体姿态估计是一项具有挑战性的任务,它需要将离散且微妙的WiFi信号与人体骨骼联系起来。本文重新审视了这一问题,并揭示了两个关键但被忽视的问题:1)跨域差距,即由于源领域和目标领域的姿态分布差异显著;2)结构保真度差距,即预测的人体骨骼姿势表现出扭曲的拓扑结构,通常表现为关节位置不正确及骨头长度不成比例。本文通过重构任务为一个新颖的两阶段框架来填补这些缺口,该框架被称为DT-Pose(领域一致表示学习和拓扑约束姿态解码)。具体而言,我们首先提出了一种带有均匀性正则化的时序一致性对比学习策略,并结合自监督掩蔽-重建操作,以实现领域一致性和运动判别性的WiFi特有表征的稳健学习。此外,我们引入了一个简单但有效的姿势解码器,该解码器通过集成图卷积网络(GCN)和Transformer层来约束生成骨骼的拓扑结构,并探索人体关节之间的相邻-整体关系。在多个基准数据集上进行的广泛实验表明,在二维/三维人体姿态估计任务中解决这些基本挑战时,我们的方法表现出卓越性能。
https://arxiv.org/abs/2501.09411
The proliferation of Internet of Things (IoT) devices equipped with acoustic sensors necessitates robust acoustic scene classification (ASC) capabilities, even in noisy and data-limited environments. Traditional machine learning methods often struggle to generalize effectively under such conditions. To address this, we introduce Q-ASC, a novel Quantum-Inspired Acoustic Scene Classifier that leverages the power of quantum-inspired transformers. By integrating quantum concepts like superposition and entanglement, Q-ASC achieves superior feature learning and enhanced noise resilience compared to classical models. Furthermore, we introduce a Quantum Variational Autoencoder (QVAE) based data augmentation technique to mitigate the challenge of limited labeled data in IoT deployments. Extensive evaluations on the Tampere University of Technology (TUT) Acoustic Scenes 2016 benchmark dataset demonstrate that Q-ASC achieves remarkable accuracy between 68.3% and 88.5% under challenging conditions, outperforming state-of-the-art methods by over 5% in the best case. This research paves the way for deploying intelligent acoustic sensing in IoT networks, with potential applications in smart homes, industrial monitoring, and environmental surveillance, even in adverse acoustic environments.
物联网(IoT)设备中配备声学传感器的数量激增,要求在嘈杂且数据有限的环境中具备强大的声景分类(ASC)能力。传统机器学习方法在这种条件下往往难以有效泛化。为此,我们引入了Q-ASC,这是一种基于量子启发式变压器的新型量子启发式声景分类器,利用了诸如叠加和纠缠等量子概念的力量,从而在特征学习和抗噪性能方面超越经典模型。此外,为了缓解物联网部署中标签数据不足的问题,我们还提出了一种基于量子变分自动编码器(QVAE)的数据增强技术。 我们在坦佩雷理工大学(TUT)2016年声景基准数据集上进行了广泛的评估,结果表明,在挑战性的条件下,Q-ASC的准确性达到了惊人的68.3%到88.5%,在最佳情况下比现有方法高出超过5%。这项研究为在物联网网络中部署智能声学传感铺平了道路,并且潜在的应用包括智能家居、工业监控和环境监视,即使是在不良的声学环境中也是如此。
https://arxiv.org/abs/2501.09394
Image segmentation, a key task in computer vision, has traditionally relied on convolutional neural networks (CNNs), yet these models struggle with capturing complex spatial dependencies, objects with varying scales, need for manually crafted architecture components and contextual information. This paper explores the shortcomings of CNN-based models and the shift towards transformer architectures -to overcome those limitations. This work reviews state-of-the-art transformer-based segmentation models, addressing segmentation-specific challenges and their solutions. The paper discusses current challenges in transformer-based segmentation and outlines promising future trends, such as lightweight architectures and enhanced data efficiency. This survey serves as a guide for understanding the impact of transformers in advancing segmentation capabilities and overcoming the limitations of traditional models.
图像分割,作为计算机视觉中的一个关键任务,长期以来一直依赖于卷积神经网络(CNN)。然而,这些模型在捕捉复杂的空间依赖关系、处理不同尺度的对象以及利用手工设计的架构组件和上下文信息方面存在困难。本文探讨了基于 CNN 的模型的不足之处,并转向基于变压器架构的趋势以克服这些限制。这项工作回顾了最新的基于变压器的分割模型,针对特定于分割的挑战及其解决方案进行了讨论。 论文还讨论了当前基于变压器的分割面临的挑战,并概述了一些有前景的发展趋势,例如轻量级架构和增强的数据效率。这篇综述旨在帮助理解变压器在提升分割能力以及克服传统模型局限性方面的影响。
https://arxiv.org/abs/2501.09372
Instruction tuning enhances large language models (LLMs) to follow human instructions across diverse tasks, relying on high-quality datasets to guide behavior. However, these datasets, whether manually curated or synthetically generated, are often narrowly focused and misaligned with the broad distributions captured during pre-training, limiting LLM generalization and effective use of pre-trained knowledge. We propose *Aligning Instruction Tuning with Pre-training* (AITP), a method that bridges this gap by identifying coverage shortfalls in instruction-tuning datasets and rewriting underrepresented pre-training data into high-quality instruction-response pairs. This approach enriches dataset diversity while preserving task-specific objectives. Evaluations on three fully open LLMs across eight benchmarks demonstrate consistent performance improvements with AITP. Ablations highlight the benefits of adaptive data selection, controlled rewriting, and balanced integration, emphasizing the importance of aligning instruction tuning with pre-training distributions to unlock the full potential of LLMs.
指令微调通过依赖高质量的数据集来增强大规模语言模型(LLMs)在各种任务中遵循人类指令的能力。然而,无论是手动策划还是合成生成的这些数据集往往过于狭隘,并且与预训练期间捕捉到的广泛分布不一致,这限制了LLM的泛化能力以及对预训练知识的有效利用。我们提出了一种名为“与预训练对齐的指令微调”(AITP)的方法,该方法通过识别指令微调数据集中的覆盖不足之处,并将代表性不足的预训练数据重写为高质量的指令-响应对来弥合这一差距。这种方法在丰富数据集多样性的同时保留了特定任务的目标。针对三个完全开放的LLM进行的八项基准测试评估显示,使用AITP可以实现一致的性能提升。消融研究突显了自适应数据选择、受控重写和平衡集成的好处,并强调了将指令微调与预训练分布对齐的重要性,以充分释放LLM的能力。
https://arxiv.org/abs/2501.09368