As a significant step for human face modeling, editing, and generation, face landmarking aims at extracting facial keypoints from images. A generalizable face landmarker is required in practice because real-world facial images, e.g., the avatars in animations and games, are often stylized in various ways. However, achieving generalizable face landmarking is challenging due to the diversity of facial styles and the scarcity of labeled stylized faces. In this study, we propose a simple but effective paradigm to learn a generalizable face landmarker based on labeled real human faces and unlabeled stylized faces. Our method learns the face landmarker as the key module of a conditional face warper. Given a pair of real and stylized facial images, the conditional face warper predicts a warping field from the real face to the stylized one, in which the face landmarker predicts the ending points of the warping field and provides us with high-quality pseudo landmarks for the corresponding stylized facial images. Applying an alternating optimization strategy, we learn the face landmarker to minimize $i)$ the discrepancy between the stylized faces and the warped real ones and $ii)$ the prediction errors of both real and pseudo landmarks. Experiments on various datasets show that our method outperforms existing state-of-the-art domain adaptation methods in face landmarking tasks, leading to a face landmarker with better generalizability. Code is available at this https URL}{this https URL.
作为人脸建模、编辑和生成的显著一步,目标是从图像中提取面部关键点。在实践中,需要一个通用的面部关键点检测器,因为现实世界的人脸图像,例如动画和游戏中的人物,通常以各种方式进行扭曲。然而,实现通用的面部关键点检测器具有挑战性,因为人脸风格的多样性以及标注有风格的人脸的稀缺性。在这项研究中,我们提出了一个简单但有效的范例,基于标注的实际人脸和未标注的有风格的人脸学习一个通用的面部关键点检测器。我们的方法将面部关键点检测器作为一个条件式人脸扭曲的模块学习。给定一对真实和有风格的人脸图像,条件式人脸扭曲预测从真实人脸到有风格人脸的扭曲场,面部关键点检测器预测扭曲场的终点,并提供我们高质量的伪关键点,对应的有风格的人脸图像。采用交替优化策略,我们学习面部关键点检测器最小化$i)$建模轮廓与扭曲真实图像之间的差异和$ii)$同时预测真实和伪关键点的误差。在各种数据集上的实验表明,我们的方法在面部关键点任务上优于现有的领域自适应方法,导致具有更好泛化能力的面部关键点。代码可在此处下载:https://this.url
https://arxiv.org/abs/2404.12322
Federated Learning (FL) has emerged as a promising solution for collaborative training of large language models (LLMs). However, the integration of LLMs into FL introduces new challenges, particularly concerning the evaluation of LLMs. Traditional evaluation methods that rely on labeled test sets and similarity-based metrics cover only a subset of the acceptable answers, thereby failing to accurately reflect the performance of LLMs on generative tasks. Meanwhile, although automatic evaluation methods that leverage advanced LLMs present potential, they face critical risks of data leakage due to the need to transmit data to external servers and suboptimal performance on downstream tasks due to the lack of domain knowledge. To address these issues, we propose a Federated Evaluation framework of Large Language Models, named FedEval-LLM, that provides reliable performance measurements of LLMs on downstream tasks without the reliance on labeled test sets and external tools, thus ensuring strong privacy-preserving capability. FedEval-LLM leverages a consortium of personalized LLMs from participants as referees to provide domain knowledge and collective evaluation capability, thus aligning to the respective downstream tasks and mitigating uncertainties and biases associated with a single referee. Experimental results demonstrate a significant improvement in the evaluation capability of personalized evaluation models on downstream tasks. When applied to FL, these evaluation models exhibit strong agreement with human preference and RougeL-score on meticulously curated test sets. FedEval-LLM effectively overcomes the limitations of traditional metrics and the reliance on external services, making it a promising framework for the evaluation of LLMs within collaborative training scenarios.
联邦学习(FL)作为一种在大型语言模型(LLMs)的协同训练中取得突破的解决方案,已经得到了广泛的应用。然而,将LLMs集成到FL中带来了新的挑战,尤其是在LLMs的评估方面。传统的评估方法仅依赖于标记测试集和基于相似度的指标,从而无法准确地反映LLMs在生成任务上的表现。同时,虽然自动评估方法依赖于先进的LLM,但由于需要将数据传输到外部服务器,以及由于缺乏领域知识,导致在下游任务上表现不佳。为了应对这些挑战,我们提出了一个名为FedEval-LLM的大语言模型评估框架,该框架在不依赖标记测试集和外部工具的情况下提供LLM在下游任务上的可靠性能测量,从而确保了强大的隐私保护能力。FedEval-LLM利用参与者的个人LLM作为参考,提供领域知识和集体评估能力,从而与各自的下游任务保持一致,并减轻了单个参考者带来的不确定性和偏见。实验结果表明,在FL中,个性评估模型的评估能力得到了显著的提高。当应用于FL时,这些评估模型与人类偏好和RougeL得分高度一致。FedEval-LLM有效地克服了传统指标和依赖外部服务的局限性,为在协作训练场景下评估LLM提供了有前景的框架。
https://arxiv.org/abs/2404.12273
Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly being used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators simply inherit all the problems of the LLMs they evaluate, requiring further human validation. We present a mixed-initiative approach to ``validate the validators'' -- aligning LLM-generated evaluation functions (be it prompts or code) with human requirements. Our interface, EvalGen, provides automated assistance to users in generating evaluation criteria and implementing assertions. While generating candidate implementations (Python functions, LLM grader prompts), EvalGen asks humans to grade a subset of LLM outputs; this feedback is used to select implementations that better align with user grades. A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment. In particular, we identify a phenomenon we dub \emph{criteria drift}: users need criteria to grade outputs, but grading outputs helps users define criteria. What is more, some criteria appears \emph{dependent} on the specific LLM outputs observed (rather than independent criteria that can be defined \emph{a priori}), raising serious questions for approaches that assume the independence of evaluation from observation of model outputs. We present our interface and implementation details, a comparison of our algorithm with a baseline approach, and implications for the design of future LLM evaluation assistants.
由于人类评价的繁琐性和基于代码的评估方法的局限性,大型语言模型(LLMs)正日益被用于帮助人们评估LLM输出。然而,LLM生成的评估器只是继承了它们评估的LLM的所有问题,需要进一步的人类验证。我们提出了一个混合启动方法来“验证验证器”——将LLM生成的评估函数(无论是提示还是代码)与人类需求对齐。我们的界面EvalGen为用户提供自动帮助生成评估标准和实施断言。在生成候选实现(Python函数,LLM评估器提示)的同时,EvalGen要求人类用户为部分LLM输出打分;这个反馈用于选择更符合用户评分水平的实现。 定性研究结果表明,对EvalGen的支持总体上是积极的,但突出了对齐的主观性和迭代过程。特别是,我们发现了一个我们称之为“标准偏差”的现象:用户需要为评分输出设置标准,但评分输出实际上帮助他们定义了标准。此外,一些标准似乎与观察到的具体LLM输出有关(而不是可以事先定义的独立标准),这可能对假设从观察模型输出中评估独立性的方法提出了严重问题。我们提供了我们的界面和实现细节,将我们的算法与基线方法进行比较,并就未来LLM评估辅助设计的可能性做出了展望。
https://arxiv.org/abs/2404.12272
Extracting structured event knowledge, including event triggers and corresponding arguments, from military texts is fundamental to many applications, such as intelligence analysis and decision assistance. However, event extraction in the military field faces the data scarcity problem, which impedes the research of event extraction models in this domain. To alleviate this problem, we propose CMNEE, a large-scale, document-level open-source Chinese Military News Event Extraction dataset. It contains 17,000 documents and 29,223 events, which are all manually annotated based on a pre-defined schema for the military domain including 8 event types and 11 argument role types. We designed a two-stage, multi-turns annotation strategy to ensure the quality of CMNEE and reproduced several state-of-the-art event extraction models with a systematic evaluation. The experimental results on CMNEE fall shorter than those on other domain datasets obviously, which demonstrates that event extraction for military domain poses unique challenges and requires further research efforts. Our code and data can be obtained from this https URL.
提取军事文本中的结构化事件知识,包括事件触发器和相应论据,对于许多应用来说至关重要,如情报分析和决策支持。然而,军事领域的事件提取面临着数据稀缺的问题,这阻碍了该领域事件提取模型的研究。为了解决这个问题,我们提出了CMNEE,一个大规模、文档级别的开源中国军事新闻事件提取数据集。它包含17,000个文档和29,223个事件,所有这些都根据预定义的军事领域数据模型进行手动注释,包括8种事件类型和11种论据角色类型。我们设计了一个两级、多轮注释策略,以确保CMNEE的质量和系统地评估了多个最先进的event extraction模型。CMNEE在实验结果方面显然短于其他领域数据集,这表明军事领域的事件提取提出了独特的挑战,需要进一步的研究努力。我们的代码和数据可以从该https URL获得。
https://arxiv.org/abs/2404.12242
Face Image Quality Assessment (FIQA) estimates the utility of face images for automated face recognition (FR) systems. We propose in this work a novel approach to assess the quality of face images based on inspecting the required changes in the pre-trained FR model weights to minimize differences between testing samples and the distribution of the FR training dataset. To achieve that, we propose quantifying the discrepancy in Batch Normalization statistics (BNS), including mean and variance, between those recorded during FR training and those obtained by processing testing samples through the pretrained FR model. We then generate gradient magnitudes of pretrained FR weights by backpropagating the BNS through the pretrained model. The cumulative absolute sum of these gradient magnitudes serves as the FIQ for our approach. Through comprehensive experimentation, we demonstrate the effectiveness of our training-free and quality labeling-free approach, achieving competitive performance to recent state-of-theart FIQA approaches without relying on quality labeling, the need to train regression networks, specialized architectures, or designing and optimizing specific loss functions.
面部图像质量评估(FIQA)估计面部图像对自动面部识别(FR)系统的利用率。在本文中,我们提出了一种新方法来评估面部图像的质量,即根据检查预训练FR模型权重所需的更改来最小化测试样本与FR训练数据分布之间的差异。为了实现这一目标,我们提出计算在FR训练期间记录的BNS之间差异的方差,包括均值和方差,以及通过预训练FR模型处理测试样本获得的BNS之间的差异。然后,通过反向传播算法计算预训练FR权重的梯度大小。这些梯度大小的累积绝对和作为FIQ。通过全面的实验,我们证明了我们无需训练和免费的质量标注方法的有效性,实现了与最近 state-of-the-art FIQA 方法竞争的性能,而无需依赖质量标注。我们证明了不需要训练回归网络、专用架构或设计并优化特定的损失函数。
https://arxiv.org/abs/2404.12203
Instruction fine-tuning pretrained LLMs for diverse downstream tasks has demonstrated remarkable success and has captured the interest of both academics and practitioners. To ensure such fine-tuned LLMs align with human preferences, techniques such as RLHF and DPO have emerged. At the same time, there is increasing interest in smaller parameter counts for models. In this work, using OpenLLaMA 3Bv2 as a base model, we describe the recipe used to fine-tune the OpenBezoar family of models. In this recipe: We first generate synthetic instruction fine-tuning data using an open and commercially non-restrictive instruction fine-tuned variant of the Falcon-40B model under three schemes based on: LaMini-LM, WizardLM/Evol-Instruct (with databricks-dolly-15k as a seed dataset) and Orca (with the Flan Collection as a seed dataset), then filter these generations using GPT-4 as a human proxy. We then perform cost-effective QLoRA-based supervised fine-tuning sequentially with each scheme. The resulting checkpoint is further fine-tuned with a subset of the HH-RLHF dataset to minimize distribution shift prior to using the DPO loss to obtain the final checkpoint. Evaluation is done with the LM Eval Harness tasks/metrics as well as on MT-Bench using the "LLM-as-a-judge" framework with Claude 2.1, with the finding that the final checkpoint, "OpenBezoar-HH-RLHF-DPO", demonstrates superior performance over many models at the 3B parameter scale, even outperforming the top model in one of the categories on the Huggingface Open LLM Leaderboard. We release "OpenBezoar-SFT", "OpenBezoar-HH-RLHF-SFT", "OpenBezoar-HH-RLHF-DPO" checkpoints, alongside our generated datasets on HuggingFace at this https URL and our codebase at this https URL.
翻译:对 diverse下游任务的指令微调预训练语言模型已经取得了显著的成功,并吸引了学术界和实践界的广泛关注。为了确保微调后的 LLM 符合人类偏好,出现了诸如 RLHF 和 DPO 等技术。与此同时,对于模型参数数量的需求也在增加。在这项工作中,我们使用 OpenLLaMA 3Bv2 作为基础模型,描述了用于微调 OpenBezoar 模型的食谱。在这个食谱中: 我们首先使用一个基于 Falcon-40B 模型,在三个方案(基于 LaMini-LM、WizardLM/Evol-Instruct(使用 databricks-dolly-15k 作为 seed 数据集)和 Orca(使用 Flan Collection 作为 seed 数据集)下生成合成指令微调数据,然后使用 GPT-4 作为人类代理过滤这些世代。接着,我们使用每个方案的成本效益 QLoRA 进行逐步微调。得到的最终checkpoint 进一步通过 HH-RLHF 子集的微调来最小化在使用 DPO 损失之前分布漂移。使用LM Eval Harness任务/指标以及MT-Bench 使用 "LLM-as-a-judge"框架对Claude 2.1进行评估。结果表明,在3B参数级别,"OpenBezoar-HH-RLHF-DPO" 显示的性能优于许多模型,即使在其中一个分类上,也超过了该分类中的顶级模型。我们发布了 "OpenBezoar-SFT"、"OpenBezoar-HH-RLHF-SFT" 和 "OpenBezoar-HH-RLHF-DPO" 检查点,这些检查点与我们的生成数据一起存放在 HuggingFace 的这个链接:https://www.huggingface.co/openbezoar-sft/。
https://arxiv.org/abs/2404.12195
The burgeoning landscape of text-to-image models, exemplified by innovations such as Midjourney and DALLE 3, has revolutionized content creation across diverse sectors. However, these advancements bring forth critical ethical concerns, particularly with the misuse of open-source models to generate content that violates societal norms. Addressing this, we introduce Ethical-Lens, a framework designed to facilitate the value-aligned usage of text-to-image tools without necessitating internal model revision. Ethical-Lens ensures value alignment in text-to-image models across toxicity and bias dimensions by refining user commands and rectifying model outputs. Systematic evaluation metrics, combining GPT4-V, HEIM, and FairFace scores, assess alignment capability. Our experiments reveal that Ethical-Lens enhances alignment capabilities to levels comparable with or superior to commercial models like DALLE 3, ensuring user-generated content adheres to ethical standards while maintaining image quality. This study indicates the potential of Ethical-Lens to ensure the sustainable development of open-source text-to-image tools and their beneficial integration into society. Our code is available at this https URL.
文本转图像模型的迅速发展,如Midjourney和DALLE 3等创新,已经彻底颠覆了内容创作的各个领域。然而,这些进步也带来了关键的伦理担忧,特别是开放源代码模型被用于生成违反社会规范的内容时。为解决这一问题,我们引入了Ethical-Lens,一个框架,旨在促进在不需要修改内部模型的情况下实现文本转图像工具的价值对齐使用。Ethical-Lens通过优化用户命令和纠正模型输出,在毒性 and bias维度上确保文本转图像模型的价值对齐。组合使用GPT4-V、HEIM和FairFace分数的系统评估指标评估了对齐能力。我们的实验结果表明,Ethical-Lens增强了与商业模型如DALLE 3相当或更强的对齐能力,确保用户生成内容符合道德准则,同时保持图像质量。本研究表示Ethical-Lens确保了开源文本转图像工具的可持续发展和有益整合到社会。我们的代码可在此链接下载:https://www.ethical-lens.org/。
https://arxiv.org/abs/2404.12104
Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning task, demanding intelligent systems to accurately respond to natural language queries based on audio-video input pairs. Nevertheless, prevalent AVQA approaches are prone to overlearning dataset biases, resulting in poor robustness. Furthermore, current datasets may not provide a precise diagnostic for these methods. To tackle these challenges, firstly, we propose a novel dataset, \textit{MUSIC-AVQA-R}, crafted in two steps: rephrasing questions within the test split of a public dataset (\textit{MUSIC-AVQA}) and subsequently introducing distribution shifts to split questions. The former leads to a large, diverse test space, while the latter results in a comprehensive robustness evaluation on rare, frequent, and overall questions. Secondly, we propose a robust architecture that utilizes a multifaceted cycle collaborative debiasing strategy to overcome bias learning. Experimental results show that this architecture achieves state-of-the-art performance on both datasets, especially obtaining a significant improvement of 9.68\% on the proposed dataset. Extensive ablation experiments are conducted on these two datasets to validate the effectiveness of the debiasing strategy. Additionally, we highlight the limited robustness of existing multi-modal QA methods through the evaluation on our dataset.
音频视觉问答(AVQA)是一个复杂的多模态推理任务,要求智能系统根据音频-视频输入对对自然语言查询进行准确响应。然而,普遍的AVQA方法容易过拟合数据集偏差,导致鲁棒性差。此外,现有数据集可能无法提供这些方法的准确诊断。为了应对这些挑战,我们首先提出了一个名为\textit{MUSIC-AVQA-R}的新数据集,通过两个步骤进行制作:在公共数据集\textit{MUSIC-AVQA}的测试划分内重新表述问题,然后引入分布变换来分割问题。前者导致一个大型、多样化的测试空间,而后者在罕见、频繁和总体问题上的评估具有全面性。其次,我们提出了一个具有多方面合作的循环去偏策略来克服偏差学习。实验结果表明,这种架构在两个数据集上都实现了最先进的性能,尤其是针对所提出的数据集,性能提高了9.68%。此外,我们对这两个数据集进行了广泛的消融实验,以验证去偏策略的有效性。最后,通过在我们的数据集上评估现有多模态问答方法的有效性,我们揭示了这些方法的局限性。
https://arxiv.org/abs/2404.12020
This work introduces a cooperative inspection system designed to efficiently control and coordinate a team of distributed heterogeneous UAV agents for the inspection of 3D structures in cluttered, unknown spaces. Our proposed approach employs a two-stage innovative methodology. Initially, it leverages the complementary sensing capabilities of the robots to cooperatively map the unknown environment. It then generates optimized, collision-free inspection paths, thereby ensuring comprehensive coverage of the structure's surface area. The effectiveness of our system is demonstrated through qualitative and quantitative results from extensive Gazebo-based simulations that closely replicate real-world inspection scenarios, highlighting its ability to thoroughly inspect real-world-like 3D structures.
这项工作提出了一种合作检查系统,旨在有效地控制和协调由分布式异构UAV代理团队对复杂、未知空间中的3D结构进行检查。我们提出的方法采用了两阶段创新方法。首先,它利用机器人的互补感测能力合作绘制未知环境。然后,它生成最优、无碰撞的检查路径,从而确保对结构表面的全面覆盖。通过在Gazebo等模拟器中进行广泛的实验,以模拟真实世界的检查场景,我们系统的有效性得到了定性和定量结果的证明,这表明该系统具有深入检查类似真实世界3D结构的能力。
https://arxiv.org/abs/2404.12018
Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at this https URL.
微调预训练的大型语言模型(LLMs)与人类价值观和意图对齐至关重要。这一过程通常采用比较对等关系和与参考LLM的KL散度的方法,重点关注模型生成的完整答案的评估。然而,这些回答的生成是在标记级别进行的,遵循了序列、自回归的样式。在本文中,我们引入了Token-level Direct Preference Optimization(TDPO),一种通过优化模型在每个标记级别的策略来与人类偏好对齐的新颖方法。与之前的方法不同,TDPO通过每个标记点的正向KL散度约束来改善对齐和多样性。利用布拉德利-特里模型作为基于标记的奖励系统,TDPO增强了KL散度的规范,同时保留了简单性,无需显式奖励建模。在各种文本任务的各种实验结果中,TDPO在平衡对齐与生成多样性方面的表现优于DPO。值得注意的是,在受控情感生成和单轮对话数据集上,TDPO与DPO的微调效果略好于PPO,显著地提高了生成的响应的质量。我们的代码目前是开源的,在以下链接处。
https://arxiv.org/abs/2404.11999
Accurate traffic forecasting is essential for effective urban planning and congestion management. Deep learning (DL) approaches have gained colossal success in traffic forecasting but still face challenges in capturing the intricacies of traffic dynamics. In this paper, we identify and address this challenges by emphasizing that spatial features are inherently dynamic and change over time. A novel in-depth feature representation, called Dynamic Spatio-Temporal (Dyn-ST) features, is introduced, which encapsulates spatial characteristics across varying times. Moreover, a Dynamic Spatio-Temporal Graph Transformer Network (DST-GTN) is proposed by capturing Dyn-ST features and other dynamic adjacency relations between intersections. The DST-GTN can model dynamic ST relationships between nodes accurately and refine the representation of global and local ST characteristics by adopting adaptive weights in low-pass and all-pass filters, enabling the extraction of Dyn-ST features from traffic time-series data. Through numerical experiments on public datasets, the DST-GTN achieves state-of-the-art performance for a range of traffic forecasting tasks and demonstrates enhanced stability.
准确的交通预测对于有效的城市规划和交通管理至关重要。尽管深度学习(DL)方法在交通预测方面取得了巨大的成功,但仍然存在捕捉交通动态复杂性的挑战。在本文中,我们通过强调空间特征是动态的并且会随着时间的推移而变化,来识别和解决这一挑战。我们引入了一种新的特征表示,称为动态时空( Dyn-ST)特征,其中包含了跨不同时间的空间特征。此外,我们提出了一个动态时空网Transformer网络(DST-GTN),通过捕捉 Dyn-ST特征和其他路口的动态邻接关系,来捕捉交通信号的动态 ST 关系。DST-GTN 可以准确地建模节点之间的动态 ST 关系,并通过低通和全通滤波器的自适应权重来优化全局和局部 ST特性的表示,从而从交通时间序列数据中提取 Dyn-ST特征。通过公开数据集的数值实验,DST-GTN 在各种交通预测任务上实现了最先进的性能,并展示了增强的稳定性。
https://arxiv.org/abs/2404.11996
Events refer to specific occurrences, incidents, or happenings that take place under a particular background. Event reasoning aims to infer events according to certain relations and predict future events. The cutting-edge techniques for event reasoning play a crucial role in various natural language processing applications. Large language models (LLMs) have made significant advancements in event reasoning owing to their wealth of knowledge and reasoning capabilities. However, smaller instruction-tuned models currently in use do not consistently demonstrate exceptional proficiency in managing these tasks. This discrepancy arises from the absence of explicit modeling of events and the interconnections of them within their instruction data. Consequently, these models face challenges in comprehending event structures and semantics while struggling to bridge the gap between their interpretations and human understanding of events. Additionally, their limitations in grasping event relations lead to constrained event reasoning abilities to effectively deduce and incorporate pertinent event knowledge. In this paper, we propose Event-Oriented Instruction Tuning (EvIT) to train our LLM. Specifically, we first propose a novel structure named event quadruple which contains the structure and semantics of events and is complete in the event representation. We then design event-relation learning based on the structures. We encapsulate the learning into the instruction-tuning formulation to better stimulate the event reasoning capacity of our model. We design a heuristic unsupervised method to mine event quadruple from a large-scale corpus. At last, we finetune a Llama model on our Event-Oriented Instruction Tuning. We conduct extensive experiments on event reasoning tasks on several datasets. Automatic and human evaluations demonstrate EvIT achieves competitive performances on event reasoning.
事件指的是在特定背景下发生的具体事件、事故或现象。事件推理旨在根据某些关系推断事件并预测未来事件。事件推理的最新技术在各种自然语言处理应用中发挥了关键作用。由于其知识丰富和推理能力,大型语言模型(LLMs)在事件推理方面取得了显著进展。然而,当前使用的较小调整模型在处理这些任务时并没有表现出非凡的熟练程度。这一差异源于事件和它们在指令数据中的相互关系缺乏明确的建模。因此,这些模型在理解和解释事件结构方面遇到了挑战,同时在将它们的解释与人类对事件的认知之间存在差距。此外,它们在理解事件关系方面的限制导致它们无法有效推断和融入相关事件知识。在本文中,我们提出了事件导向指令调整(EvIT)来训练我们的LLM。具体来说,我们首先提出了一个名为事件四元组的全新结构,它包含了事件和事件的表示结构,并且是完整的。然后,我们基于结构设计事件关系学习。我们将学习封装到指令调整公式中,以更好地刺激模型的事件推理能力。我们设计了一个基于节点的未经监督的方法,用于从大型语料库中挖掘事件四元组。最后,我们在事件导向指令调整上对Llama模型进行微调。我们在多个数据集上进行了广泛的实验,自动和人工评估都表明,EvIT在事件推理上取得了竞争力的性能。
https://arxiv.org/abs/2404.11978
In spoken languages, utterances are often shaped to be incomplete or vague for efficiency. This can lead to varying interpretations of the same input, based on different assumptions about the context. To ensure reliable user-model interactions in such scenarios, it is crucial for models to adeptly handle the inherent ambiguity in user queries. However, conversational agents built upon even the most recent large language models (LLMs) face challenges in processing ambiguous inputs, primarily due to the following two hurdles: (1) LLMs are not directly trained to handle inputs that are too ambiguous to be properly managed; (2) the degree of ambiguity in an input can vary according to the intrinsic knowledge of the LLMs, which is difficult to investigate. To address these issues, this paper proposes a method to align LLMs to explicitly handle ambiguous inputs. Specifically, we introduce a proxy task that guides LLMs to utilize their intrinsic knowledge to self-disambiguate a given input. We quantify the information gain from the disambiguation procedure as a measure of the extent to which the models perceive their inputs as ambiguous. This measure serves as a cue for selecting samples deemed ambiguous from the models' perspectives, which are then utilized for alignment. Experimental results from several question-answering datasets demonstrate that the LLMs fine-tuned with our approach are capable of handling ambiguous inputs while still performing competitively on clear questions within the task.
在口语中,常常需要让陈述不完整或含糊不清,以提高效率。这可能导致对相同输入的不同假设导致对相同输入的不同解释。为了在类似情况下确保可靠的用户-模型交互,模型需要巧妙处理用户查询固有的歧义性。然而,基于最新的大型语言模型(LLM)构建的会话机器人面临处理模糊输入的挑战,主要原因有以下两个障碍:(1)LLM 没有直接训练来处理过于复杂或无法正确管理的输入;(2)LLM 固有的知识可能随其内在知识而变化,这很难进行调查。为了应对这些问题,本文提出了一种将 LLM 对齐为明确处理模糊输入的方法。具体来说,我们引入了一个引导任务,使 LLM 使用其固有知识来自我消除给定的输入。我们衡量消除过程的信息增益作为衡量模型认为其输入存在歧义的程度。这个度量作为从模型观点选择认为存在歧义的样本的提示。在多个问题回答数据集的实验结果中,采用我们方法训练的 LLM 能够处理模糊输入,同时在任务内保持竞争力。
https://arxiv.org/abs/2404.11972
In public roads, autonomous vehicles (AVs) face the challenge of frequent interactions with human-driven vehicles (HDVs), which render uncertain driving behavior due to varying social characteristics among humans. To effectively assess the risks prevailing in the vicinity of AVs in social interactive traffic scenarios and achieve safe autonomous driving, this article proposes a social-suitable and safety-sensitive trajectory planning (S4TP) framework. Specifically, S4TP integrates the Social-Aware Trajectory Prediction (SATP) and Social-Aware Driving Risk Field (SADRF) modules. SATP utilizes Transformers to effectively encode the driving scene and incorporates an AV's planned trajectory during the prediction decoding process. SADRF assesses the expected surrounding risk degrees during AVs-HDVs interactions, each with different social characteristics, visualized as two-dimensional heat maps centered on the AV. SADRF models the driving intentions of the surrounding HDVs and predicts trajectories based on the representation of vehicular interactions. S4TP employs an optimization-based approach for motion planning, utilizing the predicted HDVs'trajectories as input. With the integration of SADRF, S4TP executes real-time online optimization of the planned trajectory of AV within lowrisk regions, thus improving the safety and the interpretability of the planned trajectory. We have conducted comprehensive tests of the proposed method using the SMARTS simulator. Experimental results in complex social scenarios, such as unprotected left turn intersections, merging, cruising, and overtaking, validate the superiority of our proposed S4TP in terms of safety and rationality. S4TP achieves a pass rate of 100% across all scenarios, surpassing the current state-of-the-art methods Fanta of 98.25% and Predictive-Decision of 94.75%.
在公共道路上,自动驾驶车辆(AVs)面临着与人类驾驶车辆(HDVs)频繁互动的挑战,这使得不确定性的驾驶行为变得不可预测,因为人类之间的社会特点有所不同。为了有效评估周围AV在社交交互交通场景中的风险,并实现安全的自动驾驶,本文提出了一个社交合适和安全敏感的轨迹规划(S4TP)框架。具体来说,S4TP整合了社交意识轨迹预测(SATP)和社会意识驾驶风险场(SADRF)模块。SATP利用Transformer有效地编码驾驶场景,并预测AV在预测解码过程中计划的轨迹。SADRF评估了AV与HDV互动时的预期周围风险程度,每个具有不同社会特点,以二维热力图为中心绘制。SADRF建模了周围HDV的驾驶意图,并预测轨迹,基于车辆交互的表示。S4TP采用基于优化的方法进行运动规划,利用预测HDV的轨迹作为输入。通过整合SADRF,S4TP在低风险区域执行AV计划轨迹的实时在线优化,从而提高安全和计划轨迹的可解释性。我们使用SMARTS仿真器对所提出的方案进行了全面的测试。复杂的社会场景,如未保护左转路口、汇入、巡航和超车等,证实了我们的S4TP在安全和理性方面的优越性。S4TP在所有场景下的通过率为100%,超过了现有技术的98.25%和预测决策的94.75%。
https://arxiv.org/abs/2404.11946
Multilingual proficiency presents a significant challenge for large language models (LLMs). English-centric models are usually suboptimal in other languages, particularly those that are linguistically distant from English. This performance discrepancy mainly stems from the imbalanced distribution of training data across languages during pre-training and instruction tuning stages. To address this problem, we propose a novel approach called CrossIn, which utilizes a mixed composition of cross-lingual instruction tuning data. Our method leverages the compressed representation shared by various languages to efficiently enhance the model's task-solving capabilities and multilingual proficiency within a single process. In addition, we introduce a multi-task and multi-faceted benchmark to evaluate the effectiveness of CrossIn. Experimental results demonstrate that our method substantially improves performance across tasks and languages, and we provide extensive insights into the impact of cross-lingual data volume and the integration of translation data on enhancing multilingual consistency and accuracy.
多语言能力是一个对大型语言模型(LLMs)来说是一个显著的挑战。英语中心化的模型在其他语言上通常是不优的,特别是那些与英语语言学距离较远的语言。这种性能差异主要源于在预训练和调整阶段语言之间训练数据的不平衡分布。为了解决这个问题,我们提出了一个名为CrossIn的新方法,该方法利用了各种语言之间的跨语言指令调整数据的混合组合。我们的方法利用了各种语言共同拥有的压缩表示来有效地增强模型在单个过程中的任务解决能力和多语言能力。此外,我们还引入了一个多任务和多方面的基准来评估CrossIn的有效性。实验结果表明,我们的方法在任务和语言上都大大提高了性能,并为跨语言数据量和翻译数据整合对增强多语言一致性和准确性的影响提供了广泛的洞察。
https://arxiv.org/abs/2404.11932
Precise image editing with text-to-image models has attracted increasing interest due to their remarkable generative capabilities and user-friendly nature. However, such attempts face the pivotal challenge of misalignment between the intended precise editing target regions and the broader area impacted by the guidance in practice. Despite excellent methods leveraging attention mechanisms that have been developed to refine the editing guidance, these approaches necessitate modifications through complex network architecture and are limited to specific editing tasks. In this work, we re-examine the diffusion process and misalignment problem from a frequency perspective, revealing that, due to the power law of natural images and the decaying noise schedule, the denoising network primarily recovers low-frequency image components during the earlier timesteps and thus brings excessive low-frequency signals for editing. Leveraging this insight, we introduce a novel fine-tuning free approach that employs progressive $\textbf{Fre}$qu$\textbf{e}$ncy truncation to refine the guidance of $\textbf{Diff}$usion models for universal editing tasks ($\textbf{FreeDiff}$). Our method achieves comparable results with state-of-the-art methods across a variety of editing tasks and on a diverse set of images, highlighting its potential as a versatile tool in image editing applications.
通过利用文本到图像模型的生成能力以及用户友好的特点,精确图像编辑引起了越来越多的关注。然而,这些尝试面临着关键挑战:预期精确编辑目标区域与实际指导区域之间存在的不一致性。尽管已经开发出了一些利用注意机制来优化编辑指导的方法,但这种方法需要通过复杂的网络架构进行修改,并且仅限于特定的编辑任务。在这项工作中,我们从频率角度重新审视了扩散过程和偏差问题,发现由于自然图像的功率律和衰减噪声时间表,去噪网络主要在较早的时间步恢复低频图像成分,从而为编辑带来过量的低频信号。利用这一发现,我们引入了一种新型的免费编辑方法,该方法采用渐进式频率截断来优化扩散模型的指导,以实现通用编辑任务(免费扩散)。我们的方法在各种编辑任务中与最先进的方法达到相当的结果,在多样性的图像上表现出色,这表明在图像编辑应用中具有很大的潜力。
https://arxiv.org/abs/2404.11895
We focus on a very challenging task: imaging at nighttime dynamic scenes. Most previous methods rely on the low-light enhancement of a conventional RGB camera. However, they would inevitably face a dilemma between the long exposure time of nighttime and the motion blur of dynamic scenes. Event cameras react to dynamic changes with higher temporal resolution (microsecond) and higher dynamic range (120dB), offering an alternative solution. In this work, we present a novel nighttime dynamic imaging method with an event camera. Specifically, we discover that the event at nighttime exhibits temporal trailing characteristics and spatial non-stationary distribution. Consequently, we propose a nighttime event reconstruction network (NER-Net) which mainly includes a learnable event timestamps calibration module (LETC) to align the temporal trailing events and a non-uniform illumination aware module (NIAM) to stabilize the spatiotemporal distribution of events. Moreover, we construct a paired real low-light event dataset (RLED) through a co-axial imaging system, including 64,200 spatially and temporally aligned image GTs and low-light events. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art methods in terms of visual quality and generalization ability on real-world nighttime datasets. The project are available at: this https URL.
我们将注意力集中在一个非常具有挑战性的任务上:夜间动态场景的成像。大多数以前的方法依赖于传统RGB摄像机 low-light 增强。然而,它们会不可避免地面临夜间长曝光时间和动态场景运动模糊之间的困境。事件相机对动态变化具有更高的时间分辨率(微秒)和更高的动态范围(120dB),提供了另一种解决方案。在这项工作中,我们提出了一个事件相机驱动的夜间动态成像方法。具体来说,我们发现夜间事件表现出时间拖尾特征和空间非平稳分布。因此,我们提出了一个基于事件时钟的夜间事件重建网络(NER-Net),主要包括可学习的事件时间戳校准模块(LETC)和一个非均匀光照感知模块(NIAM),用于稳定事件的空间和时间分布。此外,我们还构建了一个通过轴向成像系统构建的成对低光事件数据集(RLED),包括64,200个空间和时间对齐的图像GT和低光事件。大量实验证明,与最先进的method相比,所提出的方法在现实世界的夜间数据集上的视觉质量和泛化能力都具有优势。该项目 available at: this https URL.
https://arxiv.org/abs/2404.11884
Medical imaging has been used for diagnosis of various conditions, making it one of the most powerful resources for effective patient care. Due to widespread availability, low cost, and low radiation, chest X-ray is one of the most sought after radiology examination for the diagnosis of various thoracic diseases. Due to advancements in medical imaging technologies and increasing patient load, current radiology workflow faces various challenges including increasing backlogs, working long hours, and increase in diagnostic errors. An automated computer-aided diagnosis system that can interpret chest X-rays to augment radiologists by providing actionable insights has potential to provide second opinion to radiologists, highlight relevant regions in the image, in turn expediting clinical workflow, reducing diagnostic errors, and improving patient care. In this study, we applied a novel architecture augmenting the DenseNet121 Convolutional Neural Network (CNN) with multi-head self-attention mechanism using transformer, namely SA-DenseNet121, that can identify multiple thoracic diseases in chest X-rays. We conducted experiments on four of the largest chest X-ray datasets, namely, ChestX-ray14, CheXpert, MIMIC-CXR-JPG, and IU-CXR. Experimental results in terms of area under the receiver operating characteristics (AUC-ROC) shows that augmenting CNN with self-attention has potential in diagnosing different thoracic diseases from chest X-rays. The proposed methodology has the potential to support the reading workflow, improve efficiency, and reduce diagnostic errors.
医学影像在用于诊断各种疾病方面发挥着重要作用,因此是实现有效患者护理的最强大的资源之一。由于普遍可用性、低成本和低辐射,胸部X光片是诊断各种胸腔疾病最渴望的放射学检查之一。由于医学影像技术的进步和患者数量的增加,当前的放射学工作流程面临着各种挑战,包括增加延迟、延长工作时间和工作诊断错误等。 我们提出了一个自动化的计算机辅助诊断系统,通过提供有行动力的见解来辅助放射科医生,从而增加其工作效率。这个系统采用Transformer架构来增强DenseNet121卷积神经网络(CNN),可以识别出胸部X光片中的多种疾病。我们对四个最大的胸部X光片数据集(ChestX-ray14、CheXpert、MIMIC-CXR-JPG和IU-CXR)进行了实验。 在接收者操作特征(AUC-ROC)方面的实验结果表明,通过自注意机制增强CNN具有在胸部X光片中诊断不同胸腔疾病的潜力。所提出的方法具有支持阅读工作流程、提高效率和减少诊断错误的可能性。
https://arxiv.org/abs/2404.11843
Embodied agents operating in complex and uncertain environments face considerable challenges. While some advanced agents handle complex manipulation tasks with proficiency, their success often hinges on extensive training data to develop their capabilities. In contrast, humans typically rely on recalling past experiences and analogous situations to solve new problems. Aiming to emulate this human approach in robotics, we introduce the Retrieval-Augmented Embodied Agent (RAEA). This innovative system equips robots with a form of shared memory, significantly enhancing their performance. Our approach integrates a policy retriever, allowing robots to access relevant strategies from an external policy memory bank based on multi-modal inputs. Additionally, a policy generator is employed to assimilate these strategies into the learning process, enabling robots to formulate effective responses to tasks. Extensive testing of RAEA in both simulated and real-world scenarios demonstrates its superior performance over traditional methods, representing a major leap forward in robotic technology.
操作于复杂且不确定的环境中的嵌入式智能体面临着巨大的挑战。虽然一些先进的智能体通过熟练处理复杂操作任务而表现出色,但他们的成功往往取决于广泛的训练数据来发展其能力。相比之下,人类通常依赖于回忆过去的经验和类似的情况来解决问题。为了在机器人领域模仿人类方法,我们引入了 Retrieval-Augmented Embodied Agent(RAEA)系统。这种创新系统使机器人具备了一种共享记忆形式,显著提高了其性能。我们的方法结合了策略检索器,使机器人在基于多模态输入的外部策略记忆库中访问相关策略。此外,我们还使用策略生成器将这些策略纳入学习过程,使机器人能够对任务形成有效的响应。对RAEA在模拟和现实世界场景的广泛测试表明,其性能超过了传统方法,代表机器人技术取得了重大进展。
https://arxiv.org/abs/2404.11699
Sign languages, often categorised as low-resource languages, face significant challenges in achieving accurate translation due to the scarcity of parallel annotated datasets. This paper introduces Select and Reorder (S&R), a novel approach that addresses data scarcity by breaking down the translation process into two distinct steps: Gloss Selection (GS) and Gloss Reordering (GR). Our method leverages large spoken language models and the substantial lexical overlap between source spoken languages and target sign languages to establish an initial alignment. Both steps make use of Non-AutoRegressive (NAR) decoding for reduced computation and faster inference speeds. Through this disentanglement of tasks, we achieve state-of-the-art BLEU and Rouge scores on the Meine DGS Annotated (mDGS) dataset, demonstrating a substantial BLUE-1 improvement of 37.88% in Text to Gloss (T2G) Translation. This innovative approach paves the way for more effective translation models for sign languages, even in resource-constrained settings.
手语,通常被归类为低资源语言,在实现准确翻译时面临重大挑战,因为缺乏并行注释数据集。本文介绍了一种名为Select and Reorder(S&R)的新方法,通过将翻译过程划分为两个截然不同的步骤:词镜选择(GS)和词镜排序(GR)来解决数据稀缺问题。我们的方法利用了大型口语语言模型和源口语语言与目标手语语言之间巨大的词汇重叠,建立了一个初步的对齐。两个步骤都利用了非自回归(NAR)解码来减少计算并实现更快的推理速度。通过这种任务的分离,我们在Meine DGS注释数据集上实现了最先进的BLEU和ROUGE分数,证明了在资源受限的环境中,T2G翻译的BLUE-1改进了37.88%。这种创新方法为在资源受限的环境中实现更有效的手语语言翻译模型铺平了道路。
https://arxiv.org/abs/2404.11532