Our objective is to discover and localize monotonic temporal changes in a sequence of images. To achieve this, we exploit a simple proxy task of ordering a shuffled image sequence, with `time' serving as a supervisory signal since only changes that are monotonic with time can give rise to the correct ordering. We also introduce a flexible transformer-based model for general-purpose ordering of image sequences of arbitrary length with built-in attribution maps. After training, the model successfully discovers and localizes monotonic changes while ignoring cyclic and stochastic ones. We demonstrate applications of the model in multiple video settings covering different scene and object types, discovering both object-level and environmental changes in unseen sequences. We also demonstrate that the attention-based attribution maps function as effective prompts for segmenting the changing regions, and that the learned representations can be used for downstream applications. Finally, we show that the model achieves the state of the art on standard benchmarks for ordering a set of images.
我们的目标是发现和局部化图像序列中的单调时间变化。为了实现这一目标,我们利用了一个简单的代理任务,即对随机图像序列进行排序,其中`time'作为监督信号,因为只有与时间相关的单调变化才能得到正确的排序。我们还引入了一个灵活的Transformer-based模型,用于对任意长度的图像序列进行通用排序,并内置归一化映射。在训练之后,该模型在成功发现和局部化单调变化的同时,忽略了循环和随机变化。我们在多个视频设置中展示了该模型的应用,涵盖了不同的场景和对象类型,发现了未见过的序列中的物体级和环境变化。我们还证明了基于注意的归一化映射可以作为分割变化区域的有效提示,并且学到的表示可以用于下游应用。最后,我们证明了该模型在为给定一组图像排序的基准测试中达到了最先进的水平。
https://arxiv.org/abs/2404.16828
With the advent of virtual reality technology, omnidirectional image (ODI) rescaling techniques are increasingly embraced for reducing transmitted and stored file sizes while preserving high image quality. Despite this progress, current ODI rescaling methods predominantly focus on enhancing the quality of images in equirectangular projection (ERP) format, which overlooks the fact that the content viewed on head mounted displays (HMDs) is actually a rendered viewport instead of an ERP image. In this work, we emphasize that focusing solely on ERP quality results in inferior viewport visual experiences for users. Thus, we propose ResVR, which is the first comprehensive framework for the joint Rescaling and Viewport Rendering of ODIs. ResVR allows obtaining LR ERP images for transmission while rendering high-quality viewports for users to watch on HMDs. In our ResVR, a novel discrete pixel sampling strategy is developed to tackle the complex mapping between the viewport and ERP, enabling end-to-end training of ResVR pipeline. Furthermore, a spherical pixel shape representation technique is innovatively derived from spherical differentiation to significantly improve the visual quality of rendered viewports. Extensive experiments demonstrate that our ResVR outperforms existing methods in viewport rendering tasks across different fields of view, resolutions, and view directions while keeping a low transmission overhead.
随着虚拟现实技术的的出现,全方向图像(ODI)缩放技术逐渐受到欢迎,用于减小传输和存储文件的大小,同时保留高图像质量。尽管如此,目前ODI缩放方法主要集中在增强等角投影(ERP)格式下图像的质量,而忽略了用户在头戴显示器(HMD)上看到的实际内容是一个渲染视图而不是ERP图像。在本文中,我们强调,仅关注ERP质量会导致用户获得劣质视图视觉体验。因此,我们提出了ResVR,这是第一个全面框架,旨在实现ODI的联合缩放和视图渲染。ResVR允许在传输过程中获得高光晕(LR)ERP图像,同时为用户在HMD上观看高质量视图。在我们的ResVR中,我们开发了一种新颖的离散像素采样策略,以解决视图和ERP之间的复杂映射,实现ResVR管道的端到端训练。此外,我们还创新地从球形差分中得到球形像素形状表示技术,显著提高了渲染视图的质量。大量实验证明,我们的ResVR在不同的视角、分辨率和对角线范围内,相较于现有方法在视图渲染任务中具有优异的性能,同时保持较低的传输开销。
https://arxiv.org/abs/2404.16825
AI-generated video has revolutionized short video production, filmmaking, and personalized media, making video local editing an essential tool. However, this progress also blurs the line between reality and fiction, posing challenges in multimedia forensics. To solve this urgent issue, V2A-Mark is proposed to address the limitations of current video tampering forensics, such as poor generalizability, singular function, and single modality focus. Combining the fragility of video-into-video steganography with deep robust watermarking, our method can embed invisible visual-audio localization watermarks and copyright watermarks into the original video frames and audio, enabling precise manipulation localization and copyright protection. We also design a temporal alignment and fusion module and degradation prompt learning to enhance the localization accuracy and decoding robustness. Meanwhile, we introduce a sample-level audio localization method and a cross-modal copyright extraction mechanism to couple the information of audio and video frames. The effectiveness of V2A-Mark has been verified on a visual-audio tampering dataset, emphasizing its superiority in localization precision and copyright accuracy, crucial for the sustainable development of video editing in the AIGC video era.
AI生成的视频已经推动了短视频制作、电影制作和个性化媒体的发展,使视频本地编辑成为必不可少的工具。然而,这一进步也模糊了现实与虚构之间的界线,对多媒体forensics造成了挑战。为解决这一紧迫问题,V2A-Mark提出了通过解决当前视频篡改forensics的局限性来 addressing the limitations of current video tampering forensics,例如缺乏一般性、单一功能和单模态关注。将视频转录为视频的隐式视觉-音频本地化水印和版权水印相结合,我们的方法可以将隐形的视觉-音频本地化水印和版权水印嵌入原始视频帧和音频中,实现精确的本地处理和版权保护。我们还设计了一个时间对齐和融合模块以及退化提示学习来提高定位准确性和解码稳健性。同时,我们引入了音频级联定位方法和跨模态版权提取机制,将音频和视频帧的信息耦合在一起。V2A-Mark的有效性已在视觉-音频篡改数据集上得到验证,强调了其在本地定位精度和版权准确性方面的优越性,这对AIGC视频时代的视频编辑的可持续发展至关重要。
https://arxiv.org/abs/2404.16824
As large language models (LLMs) see increasing adoption across the globe, it is imperative for LLMs to be representative of the linguistic diversity of the world. India is a linguistically diverse country of 1.4 Billion people. To facilitate research on multilingual LLM evaluation, we release IndicGenBench - the largest benchmark for evaluating LLMs on user-facing generation tasks across a diverse set 29 of Indic languages covering 13 scripts and 4 language families. IndicGenBench is composed of diverse generation tasks like cross-lingual summarization, machine translation, and cross-lingual question answering. IndicGenBench extends existing benchmarks to many Indic languages through human curation providing multi-way parallel evaluation data for many under-represented Indic languages for the first time. We evaluate a wide range of proprietary and open-source LLMs including GPT-3.5, GPT-4, PaLM-2, mT5, Gemma, BLOOM and LLaMA on IndicGenBench in a variety of settings. The largest PaLM-2 models performs the best on most tasks, however, there is a significant performance gap in all languages compared to English showing that further research is needed for the development of more inclusive multilingual language models. IndicGenBench is released at this http URL
随着大型语言模型(LLMs)在全球范围内的应用不断增加,LLMs代表世界语言多样性至关重要。印度是一个拥有14亿人口的多语言国家。为了促进对多语言LLM评估的研究,我们发布了IndicGenBench - 针对13个脚本和4个语言家庭的多语言用户生成任务评估的最大基准。IndicGenBench由跨语言摘要、机器翻译和跨语言问题回答等多样化的生成任务组成。通过人类审核,IndicGenBench为许多印度语言提供了多途径并行评估数据,为许多代表性不足的印度语言首次提供了全面评估。我们在IndicGenBench上评估了各种专有和开源LLM,包括GPT-3.5、GPT-4、PaLM-2、mT5、Gemma、BLOOM和LLLaMA。在IndicGenBench上,最大的PaLM-2模型在大多数任务上表现最佳,然而,与英语相比,所有语言之间的性能差距都很大,这表明需要进一步研究更包容的多语言语言模型的开发。IndicGenBench发布在以下URL:
https://arxiv.org/abs/2404.16816
Addressing the challenges of rare diseases is difficult, especially with the limited number of reference images and a small patient population. This is more evident in rare skin diseases, where we encounter long-tailed data distributions that make it difficult to develop unbiased and broadly effective models. The diverse ways in which image datasets are gathered and their distinct purposes also add to these challenges. Our study conducts a detailed examination of the benefits and drawbacks of episodic and conventional training methodologies, adopting a few-shot learning approach alongside transfer learning. We evaluated our models using the ISIC2018, Derm7pt, and SD-198 datasets. With minimal labeled examples, our models showed substantial information gains and better performance compared to previously trained models. Our research emphasizes the improved ability to represent features in DenseNet121 and MobileNetV2 models, achieved by using pre-trained models on ImageNet to increase similarities within classes. Moreover, our experiments, ranging from 2-way to 5-way classifications with up to 10 examples, showed a growing success rate for traditional transfer learning methods as the number of examples increased. The addition of data augmentation techniques significantly improved our transfer learning based model performance, leading to higher performances than existing methods, especially in the SD-198 and ISIC2018 datasets. All source code related to this work will be made publicly available soon at the provided URL.
解决罕见疾病面临的挑战是困难的,尤其是在参考图像数量有限且患者人口规模较小的情况下。这在罕见皮肤疾病中更加明显,因为我们会遇到具有长尾数据分布的疾病,这使得开发无偏差且具有广泛效果的模型变得困难。图像数据集的收集方式和它们的独特目的也增加了这些挑战。我们的研究详细探讨了周期性训练方法和传统训练方法的优缺点,并采用少量样本学习方法与迁移学习相结合。我们使用ISIC2018、Derm7pt和SD-198数据集来评估我们的模型。由于样本标注数量很少,我们的模型在性能上与之前训练的模型相比取得了很大的信息和特征增益。我们的研究重点是改善DenseNet121和MobileNetV2模型的特征表示能力,通过在ImageNet上预训练模型来增加类内相似度。此外,我们的实验,从2-way到5-way分类,有 up to 10 个样本,表明随着样本数量的增加,传统迁移学习方法的转移学习效果逐渐提高。数据增强技术极大地提高了基于模型的迁移学习性能,特别是在SD-198和ISIC2018数据集上,使得现有方法的性能更优。所有与本研究相关的源代码都将很快在提供的URL上公开发布。
https://arxiv.org/abs/2404.16814
Generative Commonsense Reasoning (GCR) requires a model to reason about a situation using commonsense knowledge, while generating coherent sentences. Although the quality of the generated sentences is crucial, the diversity of the generation is equally important because it reflects the model's ability to use a range of commonsense knowledge facts. Large Language Models (LLMs) have shown proficiency in enhancing the generation quality across various tasks through in-context learning (ICL) using given examples without the need for any fine-tuning. However, the diversity aspect in LLM outputs has not been systematically studied before. To address this, we propose a simple method that diversifies the LLM generations, while preserving their quality. Experimental results on three benchmark GCR datasets show that our method achieves an ideal balance between the quality and diversity. Moreover, the sentences generated by our proposed method can be used as training data to improve diversity in existing commonsense generators.
生成常识推理(GCR)需要一个模型使用常识知识来推理关于一种情况的句子,同时生成连贯的句子。尽管生成的句子的质量至关重要,但生成多样性同样重要,因为它反映了模型能够使用一系列常识知识事实的能力。大型语言模型(LLMs)通过在上下文中学来提高各种任务的生成质量,而不需要进行微调。然而,LLM输出的多样性方面之前还没有系统地研究过。为了解决这个问题,我们提出了一个简单的方法,它扩展了LLM的生成,同时保留了其质量。在三个基准GCR数据集上的实验结果表明,我们的方法实现了质量与多样性的理想平衡。此外,我们提出的方法生成的句子可以作为现有常识生成器的训练数据,以提高其多样性。
https://arxiv.org/abs/2404.16807
Recent advances in large pre-trained vision-language models have demonstrated remarkable performance on zero-shot downstream tasks. Building upon this, recent studies, such as CoOp and CoCoOp, have proposed the use of prompt learning, where context within a prompt is replaced with learnable vectors, leading to significant improvements over manually crafted prompts. However, the performance improvement for unseen classes is still marginal, and to tackle this problem, data augmentation has been frequently used in traditional zero-shot learning techniques. Through our experiments, we have identified important issues in CoOp and CoCoOp: the context learned through traditional image augmentation is biased toward seen classes, negatively impacting generalization to unseen classes. To address this problem, we propose adversarial token embedding to disentangle low-level visual augmentation features from high-level class information when inducing bias in learnable prompts. Through our novel mechanism called "Adding Attributes to Prompt Learning", AAPL, we guide the learnable context to effectively extract text features by focusing on high-level features for unseen classes. We have conducted experiments across 11 datasets, and overall, AAPL shows favorable performances compared to the existing methods in few-shot learning, zero-shot learning, cross-dataset, and domain generalization tasks.
近年来,大型预训练视觉语言模型在零散分布任务上的表现已经引人注目。在此基础上,一些研究,如CoOp和CoCoOp,提出了使用提示学习的方法,其中上下文在提示中替换为可学习向量,从而在手动设计的提示上取得了显著的改进。然而,对于未见过的类别的性能提升仍然很小,为了解决这个问题,传统零散学习技术中经常使用数据增强。通过我们的实验,我们发现了CoOp和CoCoOp中重要的问题:通过传统图像增强学习到的上下文存在偏见,不利于对未见过的类别的泛化。为了解决这个问题,我们提出了一个对抗性标记嵌入策略,当在提示中诱导偏见时,将低级视觉增强特征与高级分类信息分离。通过我们新颖的机制“在提示中添加属性”,AAPL,我们引导可学习上下文有效地提取未见过的类别的文本特征。我们在11个数据集上进行了实验,总体而言,AAPL在零散分布学习、少样本学习、跨数据集学习和领域泛化任务上的表现与现有方法相比具有优势。
https://arxiv.org/abs/2404.16804
Although the capabilities of large language models (LLMs) ideally scale up with increasing data and compute, they are inevitably constrained by limited resources in reality. Suppose we have a moderately trained LLM (e.g., trained to align with human preference) in hand, can we further exploit its potential and cheaply acquire a stronger model? In this paper, we propose a simple method called ExPO to boost LLMs' alignment with human preference. ExPO assumes that a medium-aligned model can be interpolated between a less-aligned (weaker) model, e.g., the initial SFT model, and a better-aligned (stronger) one, thereby directly obtaining this stronger model by extrapolating from the weights of the former two relatively weaker models. On the AlpacaEval 2.0 benchmark, we show that ExPO pushes models trained with less preference data (e.g., 10% or 20%) to reach and even surpass the fully-trained one, without any additional training. Furthermore, ExPO also significantly improves off-the-shelf DPO/RLHF models and exhibits decent scalability across model sizes from 7B to 70B. Our work demonstrates the efficacy of model extrapolation in exploiting LLMs' capabilities, suggesting a promising direction that deserves future exploration.
尽管大型语言模型(LLMs)在理想情况下能够随着数据和计算能力的增加而扩展其能力,但它们在现实中受到有限资源的限制。假设我们手中有一个适度训练的LLM(例如,训练以与人类偏好对齐),我们能否进一步发掘其潜力并以较低的成本获得更强的模型?在本文中,我们提出了一个简单的方法叫做ExPO,用于提高LLMs与人类偏好的对齐程度。ExPO假设一个中庸对齐的模型可以平滑地存在于一个较不满意的(较弱)模型和更好对齐的(较强)模型之间,从而通过从这两个较弱模型的权重中进行扩展直接获得这个更强的模型。在AlpacaEval 2.0基准上,我们证明了ExPO将偏好数据较少的模型(例如10%或20%)推向并甚至超过完全训练的模型,而没有任何额外的训练。此外,ExPO还显著地改善了标准DPO/RLHF模型,并在模型规模从7B到70B时表现出良好的可扩展性。我们的工作表明,模型扩展在利用LLM的能力方面具有有效性,为未来的探索提供了一个有前景的方向。
https://arxiv.org/abs/2404.16792
In human neuroimaging studies, atlas registration enables mapping MRI scans to a common coordinate frame, which is necessary to aggregate data from multiple subjects. Machine learning registration methods have achieved excellent speed and accuracy but lack interpretability. More recently, keypoint-based methods have been proposed to tackle this issue, but their accuracy is still subpar, particularly when fitting nonlinear transforms. Here we propose Registration by Regression (RbR), a novel atlas registration framework that is highly robust and flexible, conceptually simple, and can be trained with cheaply obtained data. RbR predicts the (x,y,z) atlas coordinates for every voxel of the input scan (i.e., every voxel is a keypoint), and then uses closed-form expressions to quickly fit transforms using a wide array of possible deformation models, including affine and nonlinear (e.g., Bspline, Demons, invertible diffeomorphic models, etc.). Robustness is provided by the large number of voxels informing the registration and can be further increased by robust estimators like RANSAC. Experiments on independent public datasets show that RbR yields more accurate registration than competing keypoint approaches, while providing full control of the deformation model.
在人类神经影像研究中,空间映射允许将MRI扫描映射到共同的坐标框架中,这对于从多个受试者中汇总数据是必要的。机器学习配准方法取得了良好的速度和精度,但缺乏可解释性。最近,基于关键点的配准方法提出了来解决这个 issue,但它们的准确性仍然较低,特别是在拟合非线性变换时。因此,我们提出了一个名为注册 by 回归 (RbR) 的新的配准框架,它具有很高的稳健性和灵活性,从低廉的数据中进行训练,并且具有直观简单的概念。RbR预测输入扫描中每个体素(即每个体素是一个关键点)的 (x,y,z) 配准坐标,然后使用一系列可能的变化模型(包括线性变换、非线性变换等)来使用闭式公式快速拟合变换。通过大量的体素的信息进行注册,可以进一步增加稳健性,而RANSAC等 robust estimator 可以使这种效果得到改善。在独立公共数据集上的实验表明,RbR 产生的配准比竞争性的关键点方法更准确,同时提供了对变形模型的完全控制。
https://arxiv.org/abs/2404.16781
The success of many RL techniques heavily relies on human-engineered dense rewards, which typically demand substantial domain expertise and extensive trial and error. In our work, we propose DrS (Dense reward learning from Stages), a novel approach for learning reusable dense rewards for multi-stage tasks in a data-driven manner. By leveraging the stage structures of the task, DrS learns a high-quality dense reward from sparse rewards and demonstrations if given. The learned rewards can be \textit{reused} in unseen tasks, thus reducing the human effort for reward engineering. Extensive experiments on three physical robot manipulation task families with 1000+ task variants demonstrate that our learned rewards can be reused in unseen tasks, resulting in improved performance and sample efficiency of RL algorithms. The learned rewards even achieve comparable performance to human-engineered rewards on some tasks. See our project page (this https URL) for more details.
许多强化学习(RL)技术的成功很大程度上依赖于人类设计的密集奖励,通常需要深厚的领域专业知识以及广泛的尝试和误差。在我们的工作中,我们提出了DrS(从Stages进行密集奖励学习),一种学习可重用密集奖励以数据驱动方式处理多阶段任务的新颖方法。通过利用任务的阶段结构,DrS从稀疏奖励和演示中学习高质量密集奖励。所学习的奖励可以在未见过的任务中\textit{重用},从而减轻了奖励工程的人力成本。在三个物理机器人操作任务家族(具有1000+任务变体)的广泛实验中,我们的学习奖励在未见过的任务中可以\textit{重用},从而提高了RL算法的表现和样本效率。甚至有些任务的学习奖励甚至可以达到与人类设计的奖励相媲美的水平。更多详情,请查看我们的项目页面(此https URL)。
https://arxiv.org/abs/2404.16779
Representation-based Siamese networks have risen to popularity in lightweight text matching due to their low deployment and inference costs. While word-level attention mechanisms have been implemented within Siamese networks to improve performance, we propose Feature Attention (FA), a novel downstream block designed to enrich the modeling of dependencies among embedding features. Employing "squeeze-and-excitation" techniques, the FA block dynamically adjusts the emphasis on individual features, enabling the network to concentrate more on features that significantly contribute to the final classification. Building upon FA, we introduce a dynamic "selection" mechanism called Selective Feature Attention (SFA), which leverages a stacked BiGRU Inception structure. The SFA block facilitates multi-scale semantic extraction by traversing different stacked BiGRU layers, encouraging the network to selectively concentrate on semantic information and embedding features across varying levels of abstraction. Both the FA and SFA blocks offer a seamless integration capability with various Siamese networks, showcasing a plug-and-play characteristic. Experimental evaluations conducted across diverse text matching baselines and benchmarks underscore the indispensability of modeling feature attention and the superiority of the "selection" mechanism.
以表示为基础的孪生网络因低部署和推理成本而在轻量文本匹配中备受欢迎。尽管在孪生网络中已经实现了词级关注机制以提高性能,但我们提出了特征关注(FA)这一新型的下游块,旨在丰富嵌入特征之间的建模。通过采用“收缩和激发”技术,FA块动态地调整对单个特征的强调,使得网络能更关注对最终分类具有重要影响的特征。基于FA,我们引入了一种动态选择机制,称为选择性特征关注(SFA),并利用堆叠BiGRU Inception结构。SFA块通过穿越不同的堆叠BiGRU层,促使网络在不同的抽象层次上集中关注语义信息和嵌入特征。FA和SFA块都具有与各种孪生网络的无缝集成能力,展示了可插拔和 Play 的特点。在多样文本匹配基准和挑战中进行实验评估,结果表明建模特征关注和选择机制至关重要,而“选择”机制的优越性得到了充分证实。
https://arxiv.org/abs/2404.16776
Self-supervised contrastive learning has emerged as one of the most successful deep learning paradigms. In this regard, it has seen extensive use in image registration and, more recently, in the particular field of medical image registration. In this work, we propose to test and extend and improve a state-of-the-art framework for color fundus image registration, ConKeD. Using the ConKeD framework we test multiple loss functions, adapting them to the framework and the application domain. Furthermore, we evaluate our models using the standarized benchmark dataset FIRE as well as several datasets that have never been used before for color fundus registration, for which we are releasing the pairing data as well as a standardized evaluation approach. Our work demonstrates state-of-the-art performance across all datasets and metrics demonstrating several advantages over current SOTA color fundus registration methods
自监督对比学习已经成为最成功的深度学习范式之一。在这方面,它在图像配准和更近期的医学图像配准领域看到了广泛的应用。在这项工作中,我们提出了一个用于测试和改进最先进的颜色 fundus 图像配准框架ConKeD的框架。使用ConKeD框架我们测试了多个损失函数,并将其适应框架和应用领域。此外,我们还使用标准化基准数据集FIRE以及之前没有用于颜色 fundus 图像配准的数据集来评估我们的模型。我们的工作在所有数据集和指标上都展示了当前最佳性能,并比当前最佳方法具有几个优势。
https://arxiv.org/abs/2404.16773
Existing definitions and associated conceptual frameworks for computer-based system safety should be revisited in light of real-world experiences from deploying autonomous vehicles. Current terminology used by industry safety standards emphasizes mitigation of risk from specifically identified hazards, and carries assumptions based on human-supervised vehicle operation. Operation without a human driver dramatically increases the scope of safety concerns, especially due to operation in an open world environment, a requirement to self-enforce operational limits, participation in an ad hoc sociotechnical system of systems, and a requirement to conform to both legal and ethical constraints. Existing standards and terminology only partially address these new challenges. We propose updated definitions for core system safety concepts that encompass these additional considerations as a starting point for evolving safe-ty approaches to address these additional safety challenges. These results might additionally inform framing safety terminology for other autonomous system applications.
现有的计算机系统安全定义和相关概念框架应该在考虑自动驾驶车辆的实际应用经验后进行重新审视。目前由行业安全标准使用的术语侧重于从明确确定的危险中减轻风险,并基于人类监督的车辆操作做出假设。在没有人类驾驶员的情况下,操作范围和安全问题的范围急剧扩大,特别是由于在开放世界环境中的操作,需要自我执行操作限制,参与临时社会技术系统的运作,以及符合法律和道德约束等。现有的标准和术语仅部分解决了这些新挑战。我们提出了涵盖这些额外考虑的核心系统安全概念的更新定义,作为发展安全方法应对这些额外安全挑战的起点。这些结果还可能告知其他自动驾驶系统应用的安全术语的制定。
https://arxiv.org/abs/2404.16768
While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clipping) and is notorious for its sensitivity to the precise implementation of these components. In response, we take a step back and ask what a minimalist RL algorithm for the era of generative models would look like. We propose REBEL, an algorithm that cleanly reduces the problem of policy optimization to regressing the relative rewards via a direct policy parameterization between two completions to a prompt, enabling strikingly lightweight implementation. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which allows us to match the strongest known theoretical guarantees in terms of convergence and sample complexity in the RL literature. REBEL can also cleanly incorporate offline data and handle the intransitive preferences we frequently see in practice. Empirically, we find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO, all while being simpler to implement and more computationally tractable than PPO.
最初是为连续控制问题而开发的,但Proximal Policy Optimization (PPO)现在已成为各种强化学习(RL)应用(包括对生成模型的微调)的摇钱树。不幸的是,PPO需要多个技巧来实现稳定的收敛(例如值网络,截断) ,并以其对这些组件的具体实现非常敏感而闻名。为了应对这个问题,我们回退一步并问:在生成模型时代,一个简约的RL算法会是什么样子?我们提出了REBEL,一种通过直接对两个完成之间的策略参数化来降低策略优化问题的算法。在理论方面,我们证明了诸如自然策略梯度等基本RL算法可以被视为REBEL的变体,这使我们能够匹配RL文献中关于收敛和样本复杂性的最强已知理论保证。REBEL还可以干净地整合离线数据,并处理我们经常遇到的实际问题中的自偏好。在实证研究中,我们发现REBEL在语言建模和图像生成方面的性能与PPO和DPO相当或更好,同时比PPO更简单地实现,并且具有更快的计算可处理性。
https://arxiv.org/abs/2404.16767
While supervised fine-tuning (SFT) has been a straightforward approach for tailoring the output of foundation large language model (LLM) to specific preferences, concerns have been raised about the depth of this alignment, with some critiques suggesting it is merely "superficial". We critically examine this hypothesis within the scope of cross-lingual generation tasks, proposing that the effectiveness of SFT may be constrained by its reliance on prior tokens to guide cross-lingual generation. Based on this crucial insight, and in response to the challenges posed by the costly and limited availability of non-English data for SFT, we introduce a novel training-free alignment method named PreTTY, which employs minimal task-related prior tokens to bridge the foundation LLM and the SFT LLM, achieving comparable performance without training. Experiments on machine translation and part-of-speech tagging across eight languages demonstrate the efficacy of PreTTY in cross-lingual settings. Remarkably, by initiating the decoding process with only one or two prior tokens, foundation LLMs can achieve performance comparable to their SFT counterparts. This method presents a cost-effective alternative to SFT and advances the democratization of multilingual LLMs.
虽然监督微调(SFT)已经是一种将基础大型语言模型(LLM)输出定制到特定偏好的直接方法,但人们对其深度的担忧也随之提出,有些批评认为这仅仅是“表面”。我们将在跨语言生成任务的范围内对这一假设进行深入探讨,并提出一个名为PreTTY的新训练-免费对齐方法,该方法采用最小化任务相关的前缀来连接基础LLM和SFT LLM,实现与训练相同或更好的性能,而无需训练。在八种语言的机器翻译和词性标注实验中,证明了PreTTY在跨语言环境中的有效性。值得注意的是,通过仅使用前几个前缀启动解码过程,基础LLM可以实现与SFT同级的性能。这种方法为SFT提供了一种成本效益高的替代方案,并推动了多语言LLM的民主化。
https://arxiv.org/abs/2404.16766
We address the problem of regressing 3D human pose and shape from a single image, with a focus on 3D accuracy. The current best methods leverage large datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust performance. With such methods, we observe a paradoxical decline in 3D pose accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and the use of an approximate camera projection model. We quantify the error induced by current camera models and show that fitting 2D keypoints and p-GT accurately causes incorrect 3D poses. Our analysis defines the invalid distances within which minimizing 2D and p-GT losses is detrimental. We use this to formulate a new loss Threshold-Adaptive Loss Scaling (TALS) that penalizes gross 2D and p-GT losses but not smaller ones. With such a loss, there are many 3D poses that could equally explain the 2D evidence. To reduce this ambiguity we need a prior over valid human poses but such priors can introduce unwanted bias. To address this, we exploit a tokenized representation of human pose and reformulate the problem as token prediction. This restricts the estimated poses to the space of valid poses, effectively providing a uniform prior. Extensive experiments on the EMDB and 3DPW datasets show that our reformulated keypoint loss and tokenization allows us to train on in-the-wild data while improving 3D accuracy over the state-of-the-art. Our models and code are available for research at this https URL.
我们关注从一张图片上回归3D人体姿势和形状的问题,重点关注3D准确性。目前最佳方法利用大量的3D伪地面真(p-GT)和2D关键点数据集,导致稳健的性能。然而,随着2D准确性的增加,3D姿势准确性的下降是一个悖论。这是由于p-GT和近似相机投影模型的偏差导致的。我们计算了当前相机模型引起的误差,并表明,精确地匹配2D关键点和p-GT确实会导致错误的3D姿势。我们的分析定义了在最小化2D和p-GT损失时会导致无效距离的区间。我们使用这个方法来定义一个新的损失函数:Threshold-Adaptive Loss Scaling(TALS)。这个损失函数惩罚 gross 2D和p-GT损失,但不惩罚更小的损失。有了这样的损失,有很多3D姿势都可以解释2D证据。为了减少这种歧义,我们需要在有效的人体姿势上建立一个先验,但这样的先验可能会引入不必要的偏差。为了解决这个问题,我们利用人体姿势的标记化表示来重新定义问题,并将其转化为标记预测问题。这限制了估计姿势在有效姿势的空间内,有效地提供了均匀的先验。在EMDB和3DPW数据集上进行的大量实验证明,我们重新定义的关键点损失和标记化使我们能够在野外数据上进行训练,同时提高3D准确性超过现有水平。我们的模型和代码可在https://这个链接上进行研究。
https://arxiv.org/abs/2404.16752
This paper addresses the task of 3D clothed human generation from textural descriptions. Previous works usually encode the human body and clothes as a holistic model and generate the whole model in a single-stage optimization, which makes them struggle for clothing editing and meanwhile lose fine-grained control over the whole generation process. To solve this, we propose a layer-wise clothed human representation combined with a progressive optimization strategy, which produces clothing-disentangled 3D human models while providing control capacity for the generation process. The basic idea is progressively generating a minimal-clothed human body and layer-wise clothes. During clothing generation, a novel stratified compositional rendering method is proposed to fuse multi-layer human models, and a new loss function is utilized to help decouple the clothing model from the human body. The proposed method achieves high-quality disentanglement, which thereby provides an effective way for 3D garment generation. Extensive experiments demonstrate that our approach achieves state-of-the-art 3D clothed human generation while also supporting cloth editing applications such as virtual try-on. Project page: this http URL
本文讨论了从文本描述中生成3D带衣服的人的任务。以前的工作通常将人体和衣服编码为一个整体模型,并在一个阶段优化中生成整个模型,这使得他们在衣物编辑方面挣扎,同时失去了对整个生成过程的细粒度控制。为了解决这个问题,我们提出了一个逐层的带衣服的人表示与渐进优化策略相结合的方法,从而在生成过程中实现衣物分离的3D人体模型,并提供了对生成过程的控制能力。基本思路是逐步生成最小带衣服的人体和逐层生成衣服。在服装生成过程中,我们提出了一种新的分层组合渲染方法来融合多层人体模型,并使用新的损失函数帮助解耦服装模型与人体。所提出的方法实现了高质量的分离,从而为3D服装生成提供了一种有效的方法。大量的实验证明,我们的方法在实现最先进的3D带衣服的人生成的同时,还支持虚拟试穿等衣物编辑应用。项目页面:http:// this http URL
https://arxiv.org/abs/2404.16748
Word error rate (WER) is a metric used to evaluate the quality of transcriptions produced by Automatic Speech Recognition (ASR) systems. In many applications, it is of interest to estimate WER given a pair of a speech utterance and a transcript. Previous work on WER estimation focused on building models that are trained with a specific ASR system in mind (referred to as ASR system-dependent). These are also domain-dependent and inflexible in real-world applications. In this paper, a hypothesis generation method for ASR System-Independent WER estimation (SIWE) is proposed. In contrast to prior work, the WER estimators are trained using data that simulates ASR system output. Hypotheses are generated using phonetically similar or linguistically more likely alternative words. In WER estimation experiments, the proposed method reaches a similar performance to ASR system-dependent WER estimators on in-domain data and achieves state-of-the-art performance on out-of-domain data. On the out-of-domain data, the SIWE model outperformed the baseline estimators in root mean square error and Pearson correlation coefficient by relative 17.58% and 18.21%, respectively, on Switchboard and CALLHOME. The performance was further improved when the WER of the training set was close to the WER of the evaluation dataset.
词错误率(WER)是一种用于评估自动语音识别(ASR)系统产生的转录质量的度量。在许多应用中,估计WER根据语音单位和文本的关系进行估计。之前的工作主要集中在构建特定ASR系统的模型(称为ASR系统依赖的模型)。这些模型在领域上是相关的,且在现实世界的应用中不灵活。在本文中,提出了一个用于ASR System-Independent WER估计(SIWE)的假设生成方法。与之前的工作不同,使用模拟ASR系统输出的数据来训练WER估计算法。假设是通过 phonetically similar 或 linguistically more likely 替代词来生成的。在WER估计实验中,所提出的方法在领域内数据上的性能与ASR系统依赖的WER估计算法相当,同时在对外部数据上的性能达到了最先进的水平。在外部数据上,SIWE模型分别比基线估计器在Switchboard和CALLHOME上的相对误差的绝对值和皮尔逊相关系数上提高了17.58%和18.21%。当训练集的WER接近评估数据上的WER时,性能进一步得到改善。
https://arxiv.org/abs/2404.16743
Cancelable Biometric is a challenging research field in which security of an original biometric image is ensured by transforming the original biometric into another irreversible domain. Several approaches have been suggested in literature for generating cancelable biometric templates. In this paper, two novel and simple cancelable biometric template generation methods based on Random Walk (CBRW) have been proposed. By employing random walk and other steps given in the proposed two algorithms viz. CBRW-BitXOR and CBRW-BitCMP, the original biometric is transformed into a cancellable template. The performance of the proposed methods is compared with other state-of-the-art methods. Experiments have been performed on eight publicly available gray and color datasets i.e. CP (ear) (gray and color), UTIRIS (iris) (gray and color), ORL (face) (gray), IIT Delhi (iris) (gray and color), and AR (face) (color). Performance of the generated templates is measured in terms of Correlation Coefficient (Cr), Root Mean Square Error (RMSE), Peak Signal to Noise Ratio (PSNR), Structural Similarity (SSIM), Mean Absolute Error (MAE), Number of Pixel Change Rate (NPCR), and Unified Average Changing Intensity (UACI). By experimental results, it has been proved that proposed methods are superior than other state-of-the-art methods in qualitative as well as quantitative analysis. Furthermore, CBRW performs better on both gray as well as color images.
可取消生物特征是一个具有挑战性的研究领域,其中通过将原始生物特征变换为另一个不可逆的领域来确保原始生物特征的安全。在文献中,已经提出了几种生成可取消生物特征模板的方法。本文提出了两种基于随机漫步(CBRW)的新颖且简单的可取消生物特征模板生成方法。通过使用提出的两种算法 viz. CBRW-BitXOR 和 CBRW-BitCMP,将原始生物特征变换为可取消模板。本文方法的表现与最先进的算法进行了比较。实验在八个公开可用的灰度和彩色数据集上进行,即CP(耳朵)(灰度与彩色),UTIRIS(眼睛)(灰度与彩色),ORL(面部)(灰度),IIT德里亚(眼睛)(灰度与彩色),和AR(眼睛)(彩色)。生成模板的表现用相关系数系数(Cr)、算术均方误差(RMSE)、峰值信号与噪声比(PSNR)、结构相似性(SSIM)、平均绝对误差(MAE)、像素变化率(NPCR)和统一平均变化强度(UACI)进行衡量。通过实验结果,证明了本文方法在质量和数量分析方面优于其他最先进的算法。此外,CBRW在灰度和彩色图像上表现更好。
https://arxiv.org/abs/2404.16739
This paper presents a novel learning approach for Dubins Traveling Salesman Problems(DTSP) with Neighborhood (DTSPN) to quickly produce a tour of a non-holonomic vehicle passing through neighborhoods of given task points. The method involves two learning phases: initially, a model-free reinforcement learning approach leverages privileged information to distill knowledge from expert trajectories generated by the LinKernighan heuristic (LKH) algorithm. Subsequently, a supervised learning phase trains an adaptation network to solve problems independently of privileged information. Before the first learning phase, a parameter initialization technique using the demonstration data was also devised to enhance training efficiency. The proposed learning method produces a solution about 50 times faster than LKH and substantially outperforms other imitation learning and RL with demonstration schemes, most of which fail to sense all the task points.
本文提出了一种名为Neighborhood-based Traveling Salesman Problem (DTSP)的新的学习方法,用于通过给定的任务点快速生成非holonomic车辆通过邻近区域的周游路线。该方法包括两个学习阶段:首先,一种模型无关的强化学习方法利用特权信息从由LinKernighan启发式(LKH)算法生成的专家轨迹中提炼知识。随后,一个监督学习阶段训练一个自适应网络,以独立于特权信息解决问题。在第一个学习阶段之前,还开发了一种使用演示数据进行参数初始化的技术,以提高训练效率。与LKH相比,所提出的学习方法解决方案大约快50倍,并且比其他演示学习方法和RL取得了显著的优越性,大多数这些方法无法感知所有任务点。
https://arxiv.org/abs/2404.16721