We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.
我们介绍了一个新的基准Blink,它专注于其他评估中没有发现的视觉感知能力。大多数Blink任务都可以通过人类“眨眼内解决”(例如,相对深度估计,视觉对应,法医检测和多视角推理)。然而,我们发现这些感知要求的任务对现有的多模态LLM构成了重大挑战,因为它们通过自然语言进行中介。Blink将14个经典计算机视觉任务重新格式化为3,807个多选题,与单张或多张图像和视觉提示搭配。虽然人类平均得到95.70%的准确率,但Blink对于现有的多模态LLM来说仍然具有令人惊讶的挑战性:即使是表现最好的GPT-4V和Gemini,其准确率也只有51.26%和45.72%,只有13.17%和7.63%高于随机猜测,表明在最近的多模态LLM中,这样的感知能力尚未“出现”。我们的分析还强调,专家CV模型本可以更好地解决这些问题,这表明未来改进的潜在途径。我们相信,Blink将激发社区帮助多模态LLM追上人类水平视觉感知。
https://arxiv.org/abs/2404.12390
We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently. Our approach targets interactive image editing applications in which, starting from a blank canvas or an image, a user specifies a sequence of localized image modifications using binary masks and text prompts. Our generator operates in two phases. First, a context encoder processes the current canvas and user mask to produce a compact global context tailored to the region to generate. Second, conditioned on this context, a diffusion-based transformer decoder synthesizes the masked pixels in a "lazy" fashion, i.e., it only generates the masked region. This contrasts with previous works that either regenerate the full canvas, wasting time and computation, or confine processing to a tight rectangular crop around the mask, ignoring the global image context altogether. Our decoder's runtime scales with the mask size, which is typically small, while our encoder introduces negligible overhead. We demonstrate that our approach is competitive with state-of-the-art inpainting methods in terms of quality and fidelity while providing a 10x speedup for typical user interactions, where the editing mask represents 10% of the image.
我们提出了一种名为LazyDiffusion的新扩散变换器,它能够高效地生成部分图像更新。我们的方法针对交互式图像编辑应用,用户从空白的画布或图像开始,使用二进制掩码和文本提示指定一系列局部图像修改序列。生成器有两个阶段。首先,上下文编码器处理当前画布和用户掩码以生成一个紧凑的全局上下文,专门针对要生成的区域进行优化。其次,在上下文条件下,扩散基变换器解码器以“懒散”的方式合成掩码中的像素,即它只生成掩码的区域。这 contrasts with previous works that either regenerate the full canvas, wasting time and computation, or confine processing to a tight rectangular crop around the mask, ignoring the global image context altogether. 我们的解码器的运行时与掩码大小成比例,而我们的编码器引入的开销可以忽略不计。我们证明了我们的方法在质量和忠实度方面与最先进的修复方法相当,同时为典型的用户交互提供10倍的加速,其中编辑掩码代表图像的10%。
https://arxiv.org/abs/2404.12382
Current 3D reconstruction techniques struggle to infer unbounded scenes from a few images faithfully. Specifically, existing methods have high computational demands, require detailed pose information, and cannot reconstruct occluded regions reliably. We introduce 6Img-to-3D, an efficient, scalable transformer-based encoder-renderer method for single-shot image to 3D reconstruction. Our method outputs a 3D-consistent parameterized triplane from only six outward-facing input images for large-scale, unbounded outdoor driving scenarios. We take a step towards resolving existing shortcomings by combining contracted custom cross- and self-attention mechanisms for triplane parameterization, differentiable volume rendering, scene contraction, and image feature projection. We showcase that six surround-view vehicle images from a single timestamp without global pose information are enough to reconstruct 360$^{\circ}$ scenes during inference time, taking 395 ms. Our method allows, for example, rendering third-person images and birds-eye views. Our code is available at this https URL, and more examples can be found at our website here this https URL.
目前的三维重建技术很难从几张图像中忠实推断无限制的场景。具体来说,现有的方法具有高的计算需求,需要详细的姿态信息,并且无法可靠地重构遮挡区域。我们引入了6Img-to-3D,一种高效、可扩展的基于Transformer的单击图像到3D重建方法。我们的方法输出从仅六个外向 facing输入图像中得到的大规模无限制 outdoor driving 场景中的 3D 一致参数化三平面。我们通过结合收缩的自定义 cross- 和自注意机制来解决现有不足,实现不同纹理渲染、场景收缩和图像特征投影。我们证明了,在推理过程中仅使用一个时间戳的6个环绕视图车辆图像足以重构360$^{\circ}$的场景,需要395毫秒。我们的方法允许,例如,渲染三个人物图像和鸟瞰视图。我们的代码可以从此链接获得,更多例子可以在我们的网站 https://this.url 找到。
https://arxiv.org/abs/2404.12378
We present FastFit, a method, and a Python package design to provide fast and accurate few-shot classification, especially for scenarios with many semantically similar classes. FastFit utilizes a novel approach integrating batch contrastive learning and token-level similarity score. Compared to existing few-shot learning packages, such as SetFit, Transformers, or few-shot prompting of large language models via API calls, FastFit significantly improves multiclass classification performance in speed and accuracy across FewMany, our newly curated English benchmark, and Multilingual datasets. FastFit demonstrates a 3-20x improvement in training speed, completing training in just a few seconds. The FastFit package is now available on GitHub and PyPi, presenting a user-friendly solution for NLP practitioners.
我们提出了FastFit方法和一个Python软件包设计,旨在提供快速和准确的零散shot分类,尤其是在具有许多相似语义类别的场景中。FastFit采用了一种新颖的方法,将批式对比学习与词级相似度分数相结合。与现有的零散shot学习软件包(如SetFit、Transformers或通过API调用的大语言模型的一小 shots提示)相比,FastFit在速度和准确性上显著提高了多分类分类性能。FastFit在FewMany、我们的新编英语基准和多语言数据集上的表现表明,其训练速度提高了3-20倍,训练时间仅需几秒钟。FastFit软件包现在可以在GitHub和PyPI上获得,为NLP从业者提供了一个易于使用的解决方案。
https://arxiv.org/abs/2404.12365
Deep learning methods for accelerated MRI achieve state-of-the-art results but largely ignore additional speedups possible with noncartesian sampling trajectories. To address this gap, we created a generative diffusion model-based reconstruction algorithm for multi-coil highly undersampled spiral MRI. This model uses conditioning during training as well as frequency-based guidance to ensure consistency between images and measurements. Evaluated on retrospective data, we show high quality (structural similarity > 0.87) in reconstructed images with ultrafast scan times (0.02 seconds for a 2D image). We use this algorithm to identify a set of optimal variable-density spiral trajectories and show large improvements in image quality compared to conventional reconstruction using the non-uniform fast Fourier transform. By combining efficient spiral sampling trajectories, multicoil imaging, and deep learning reconstruction, these methods could enable the extremely high acceleration factors needed for real-time 3D imaging.
深度学习方法在加速磁共振成像(MRI)方面取得了最先进的结果,但很大程度上忽略了非曲面采样轨迹所带来的额外加速可能性。为了填补这一空白,我们创建了一个基于生成扩散模型的多旋磁共振成像(MRI)重构算法。该模型在训练过程中使用条件,以及基于频率的指导来确保图像和测量之间的一致性。在反向数据上评估,我们证明了具有超快速扫描时间的重建图像具有高质量(结构相似性 > 0.87)。我们使用该算法找出了最优的变密度螺旋轨迹,并表明与传统复原方法相比,图像质量有很大的提高。通过结合高效的螺旋采样轨迹、多旋成像和深度学习复原,这些方法可能能够实现实现实时 3D 成像所需的高达极限的加速因子。
https://arxiv.org/abs/2404.12361
Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective fine-tuning of advanced large vision-language models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooking the contemporary need for multimodal video content summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary's modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT). However, the textual summaries in previous multimodal datasets are inadequate. To address these issues, we introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube, with lengths ranging from 40 to 940 seconds and an average summarization ratio of 16.39\%. Each video summary in Instruct-V2Xum is paired with a textual summary that references specific frame indexes, facilitating the generation of aligned video and textual summaries. In addition, we propose a new video summarization framework named V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is the first framework that unifies different video summarization tasks into one large language model's (LLM) text decoder and achieves task-controllable video summarization with temporal prompts and task instructions. Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks.
视频摘要的目的是创建较短、准确和连贯的视频摘要。尽管存在各种视频摘要数据集,但一个显著的局限是它们来源视频的数量有限,这阻碍了高级大 Vision-语言模型(VLMs)的有效微调。此外,现有的数据集都是为视频到视频摘要而设计的,忽视了当代需要多模态视频内容摘要的需求。最近,努力将无模态视频摘要扩展到多模态视频摘要,根据摘要的模态将任务划分为三个子任务:视频到视频(V2V)、视频到文本(V2T)和视频与文本摘要的结合(V2VT)。然而,之前的多模态数据集中的文本摘要是不够的。为了应对这些挑战,我们引入了Instruct-V2Xum,一个包含来自YouTube的30,000个多样视频的多模态视频摘要数据集,时长在40至940秒之间,平均摘要比率为16.39%。Instruct-V2Xum中的每个视频摘要都搭配了一个文本摘要,引用具体的帧索引,从而生成对齐的影片和文本摘要。此外,我们提出了一个新的视频摘要框架,名为V2Xum-LLM。V2Xum-LLM(本研究中的V2Xum-LLaMA)是第一个将不同视频摘要任务统一到一个大型语言模型(LLM)的文本解码器中的框架,并使用时序提示和任务指令实现任务可控的视频摘要。实验证明,V2Xum-LLaMA在多个视频摘要任务上优于强大的基线模型。此外,我们提出了V2V和V2VT摘要任务的增强评估指标。
https://arxiv.org/abs/2404.12353
This study evaluates the performance of general-purpose AI, like ChatGPT, in legal question-answering tasks, highlighting significant risks to legal professionals and clients. It suggests leveraging foundational models enhanced by domain-specific knowledge to overcome these issues. The paper advocates for creating open-source legal AI systems to improve accuracy, transparency, and narrative diversity, addressing general AI's shortcomings in legal contexts.
本研究评估了通用人工智能(如ChatGPT)在法律问题解答任务中的表现,强调了法律专业人员和客户面临的重要风险。它建议利用在特定领域知识基础上发现的基础模型来克服这些问题。论文主张创建开源法律人工智能系统以提高准确性、透明度和叙事多样性,解决通用人工智能在法律背景中的不足。
https://arxiv.org/abs/2404.12349
As a significant step for human face modeling, editing, and generation, face landmarking aims at extracting facial keypoints from images. A generalizable face landmarker is required in practice because real-world facial images, e.g., the avatars in animations and games, are often stylized in various ways. However, achieving generalizable face landmarking is challenging due to the diversity of facial styles and the scarcity of labeled stylized faces. In this study, we propose a simple but effective paradigm to learn a generalizable face landmarker based on labeled real human faces and unlabeled stylized faces. Our method learns the face landmarker as the key module of a conditional face warper. Given a pair of real and stylized facial images, the conditional face warper predicts a warping field from the real face to the stylized one, in which the face landmarker predicts the ending points of the warping field and provides us with high-quality pseudo landmarks for the corresponding stylized facial images. Applying an alternating optimization strategy, we learn the face landmarker to minimize $i)$ the discrepancy between the stylized faces and the warped real ones and $ii)$ the prediction errors of both real and pseudo landmarks. Experiments on various datasets show that our method outperforms existing state-of-the-art domain adaptation methods in face landmarking tasks, leading to a face landmarker with better generalizability. Code is available at this https URL}{this https URL.
作为人脸建模、编辑和生成的显著一步,目标是从图像中提取面部关键点。在实践中,需要一个通用的面部关键点检测器,因为现实世界的人脸图像,例如动画和游戏中的人物,通常以各种方式进行扭曲。然而,实现通用的面部关键点检测器具有挑战性,因为人脸风格的多样性以及标注有风格的人脸的稀缺性。在这项研究中,我们提出了一个简单但有效的范例,基于标注的实际人脸和未标注的有风格的人脸学习一个通用的面部关键点检测器。我们的方法将面部关键点检测器作为一个条件式人脸扭曲的模块学习。给定一对真实和有风格的人脸图像,条件式人脸扭曲预测从真实人脸到有风格人脸的扭曲场,面部关键点检测器预测扭曲场的终点,并提供我们高质量的伪关键点,对应的有风格的人脸图像。采用交替优化策略,我们学习面部关键点检测器最小化$i)$建模轮廓与扭曲真实图像之间的差异和$ii)$同时预测真实和伪关键点的误差。在各种数据集上的实验表明,我们的方法在面部关键点任务上优于现有的领域自适应方法,导致具有更好泛化能力的面部关键点。代码可在此处下载:https://this.url
https://arxiv.org/abs/2404.12322
Unleashing the synergies of rapidly evolving mobility technologies in a multi-stakeholder landscape presents unique challenges and opportunities for addressing urban transportation problems. This paper introduces a novel synthetic participatory method, critically leveraging large language models (LLMs) to create digital avatars representing diverse stakeholders to plan shared automated electric mobility systems (SAEMS). These calibratable agents collaboratively identify objectives, envision and evaluate SAEMS alternatives, and strategize implementation under risks and constraints. The results of a Montreal case study indicate that a structured and parameterized workflow provides outputs with high controllability and comprehensiveness on an SAEMS plan than generated using a single LLM-enabled expert agent. Consequently, the approach provides a promising avenue for cost-efficiently improving the inclusivity and interpretability of multi-objective transportation planning, suggesting a paradigm shift in how we envision and strategize for sustainable and equitable transportation systems.
在多利益相关者的背景下,释放快速发展的移动技术之间的协同作用面临着独特的挑战和解决城市交通问题的机遇。本文介绍了一种新颖的合成参与方法,通过大型语言模型(LLMs)创建数字代表不同利益相关者的虚拟代理,规划共享自动电动交通系统(SAEMS)。这些可调节的代理合作确定目标、展望和评估SAEMS备选方案,并制定策略实施风险和约束条件。蒙特利尔案例研究的结果表明,使用结构化和参数化的工作流程可以比使用单一LLM启用的专家代理规划SAEMS计划产生具有高可控制性和全面性的输出。因此,该方法为改善多目标交通规划的包容性和可解释性提供了有前途的途径,表明了我们对于可持续和公平交通系统的愿景和策略发生了范式转移。
https://arxiv.org/abs/2404.12317
In Simultaneous Machine Translation (SiMT) systems, training with a simultaneous interpretation (SI) corpus is an effective method for achieving high-quality yet low-latency systems. However, it is very challenging to curate such a corpus due to limitations in the abilities of annotators, and hence, existing SI corpora are limited. Therefore, we propose a method to convert existing speech translation corpora into interpretation-style data, maintaining the original word order and preserving the entire source content using Large Language Models (LLM-SI-Corpus). We demonstrate that fine-tuning SiMT models in text-to-text and speech-to-text settings with the LLM-SI-Corpus reduces latencies while maintaining the same level of quality as the models trained with offline datasets. The LLM-SI-Corpus is available at \url{this https URL}.
在同时机器翻译(SiMT)系统中,使用同时解释(SI)语料库进行训练是一种实现高质量低延迟系统的高效方法。然而,由于 annotator 能力受限,因此很难创建这样的语料库,现有的 SI 语料库也很有限。因此,我们提出了一种将现有语音翻译语料库转换为解释风格数据的方法,保持原始单词顺序并使用大型语言模型(LLM-SI-Corpus)保留整个源内容。我们证明了,在文本到文本和语音到文本设置中,使用 LLM-SI-Corpus 对 SiMT 模型进行微调可以降低延迟,同时保持与离线数据训练模型的相同质量水平。LLM-SI-Corpus 可以在 \url{这个链接} 中使用。
https://arxiv.org/abs/2404.12299
This study introduces a novel method for irony detection, applying Large Language Models (LLMs) with prompt-based learning to facilitate emotion-centric text augmentation. Traditional irony detection techniques typically fall short due to their reliance on static linguistic features and predefined knowledge bases, often overlooking the nuanced emotional dimensions integral to irony. In contrast, our methodology augments the detection process by integrating subtle emotional cues, augmented through LLMs, into three benchmark pre-trained NLP models - BERT, T5, and GPT-2 - which are widely recognized as foundational in irony detection. We assessed our method using the SemEval-2018 Task 3 dataset and observed substantial enhancements in irony detection capabilities.
本研究介绍了一种新颖的 Irony 检测方法,该方法采用基于提示的学习方法(LLMs)来促进情感中心化文本增强。传统的 Irony 检测技术通常因为其依赖静态语言特征和预定义知识库而不足,往往忽视了 Irony 中至关重要的细微情感维度。相比之下,我们的方法通过将微妙的情感线索通过 LLMs 增强,将三种广泛认为是 Irony 检测基础的预训练 NLP 模型 - BERT、T5 和 GPT-2 - 集成到检测过程中,从而增强了 Irony 检测能力。我们对该方法使用 SemEval-2018 任务 3 数据集进行了评估,并观察到 Irony 检测能力得到了显著提升。
https://arxiv.org/abs/2404.12291
In the big data era, integrating diverse data modalities poses significant challenges, particularly in complex fields like healthcare. This paper introduces a new process model for multimodal Data Fusion for Data Mining, integrating embeddings and the Cross-Industry Standard Process for Data Mining with the existing Data Fusion Information Group model. Our model aims to decrease computational costs, complexity, and bias while improving efficiency and reliability. We also propose "disentangled dense fusion", a novel embedding fusion method designed to optimize mutual information and facilitate dense inter-modality feature interaction, thereby minimizing redundant information. We demonstrate the model's efficacy through three use cases: predicting diabetic retinopathy using retinal images and patient metadata, domestic violence prediction employing satellite imagery, internet, and census data, and identifying clinical and demographic features from radiography images and clinical notes. The model achieved a Macro F1 score of 0.92 in diabetic retinopathy prediction, an R-squared of 0.854 and sMAPE of 24.868 in domestic violence prediction, and a macro AUC of 0.92 and 0.99 for disease prediction and sex classification, respectively, in radiological analysis. These results underscore the Data Fusion for Data Mining model's potential to significantly impact multimodal data processing, promoting its adoption in diverse, resource-constrained settings.
在大数据时代,整合多样数据模态面临着重大挑战,特别是在复杂的领域如医疗保健领域。本文介绍了一种名为多模态数据融合数据挖掘的新过程模型,将嵌入和跨行业数据挖掘标准过程与现有的数据融合信息组模型相结合。我们的目标是降低计算成本、复杂性和偏见,同时提高效率和可靠性。我们还提出了“解耦的密集融合”,一种新型的嵌入融合方法,旨在优化互信息并促进密集模态特征交互,从而最小化冗余信息。我们通过三个用例展示了模型的效果:使用视网膜图像预测糖尿病视网膜病变,利用卫星图像和人口数据预测家庭暴力,以及从X光片和临床笔记中识别临床和人口特征。模型在糖尿病视网膜病变预测方面的Macro F1得分达到了0.92,在家庭暴力预测方面的R-squared为0.854,在放射学分析中的sMAPE为24.868,而在疾病预测和性别分类方面的宏AUC分别为0.92和0.99。这些结果强调了你数据融合数据挖掘模型对多模态数据处理的重大影响,推动了其在各种资源受限的环境中采用。
https://arxiv.org/abs/2404.12278
Although large language models (LLMs) have achieved significant success, their vulnerability to adversarial perturbations, including recent jailbreak attacks, has raised considerable concerns. However, the increasing size of these models and their limited access make improving their robustness a challenging task. Among various defense strategies, randomized smoothing has shown great potential for LLMs, as it does not require full access to the model's parameters or fine-tuning via adversarial training. However, randomized smoothing involves adding noise to the input before model prediction, and the final model's robustness largely depends on the model's performance on these noise corrupted data. Its effectiveness is often limited by the model's sub-optimal performance on noisy data. To address this issue, we propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions. We call this procedure self-denoised smoothing. Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility. Our experimental results indicate that our method surpasses existing methods in both empirical and certified robustness in defending against adversarial attacks for both downstream tasks and human alignments (i.e., jailbreak attacks). Our code is publicly available at this https URL
尽管大型语言模型(LLMs)已经取得了显著的成功,但它们对对抗性扰动(包括最近的越狱攻击)的易受性引起了相当大的关注。然而,这些模型的规模不断增加,并且它们的可访问性有限,因此提高它们的鲁棒性是一项具有挑战性的任务。在各种防御策略中,随机平滑对LLMs显示出巨大的潜力,因为它们不需要对模型的参数或通过对抗性训练进行微调。然而,随机平滑在输入预测前添加噪声,因此最终模型的鲁棒性很大程度上取决于模型在这些噪声污染数据上的表现。它的效果通常受到模型在噪声数据上的次优性能的限制。为了解决这个问题,我们利用LLMs的多任务性质,首先对噪声输入进行平滑,然后根据这些平滑版本进行预测。我们称之为自平滑。与计算机视觉中以前的平滑滤波技术不同,后者需要训练一个单独的模型来增强LLMs的鲁棒性,而我们的方法提供了显著更好的效率和灵活性。我们的实验结果表明,我们的方法在防御对抗性攻击以及对下游任务和人类对齐(即越狱攻击)方面超越了现有方法。我们的代码公开可用,在這個網址:https://github.com/your-name/self-denoised-smoothing
https://arxiv.org/abs/2404.12274
Federated Learning (FL) has emerged as a promising solution for collaborative training of large language models (LLMs). However, the integration of LLMs into FL introduces new challenges, particularly concerning the evaluation of LLMs. Traditional evaluation methods that rely on labeled test sets and similarity-based metrics cover only a subset of the acceptable answers, thereby failing to accurately reflect the performance of LLMs on generative tasks. Meanwhile, although automatic evaluation methods that leverage advanced LLMs present potential, they face critical risks of data leakage due to the need to transmit data to external servers and suboptimal performance on downstream tasks due to the lack of domain knowledge. To address these issues, we propose a Federated Evaluation framework of Large Language Models, named FedEval-LLM, that provides reliable performance measurements of LLMs on downstream tasks without the reliance on labeled test sets and external tools, thus ensuring strong privacy-preserving capability. FedEval-LLM leverages a consortium of personalized LLMs from participants as referees to provide domain knowledge and collective evaluation capability, thus aligning to the respective downstream tasks and mitigating uncertainties and biases associated with a single referee. Experimental results demonstrate a significant improvement in the evaluation capability of personalized evaluation models on downstream tasks. When applied to FL, these evaluation models exhibit strong agreement with human preference and RougeL-score on meticulously curated test sets. FedEval-LLM effectively overcomes the limitations of traditional metrics and the reliance on external services, making it a promising framework for the evaluation of LLMs within collaborative training scenarios.
联邦学习(FL)作为一种在大型语言模型(LLMs)的协同训练中取得突破的解决方案,已经得到了广泛的应用。然而,将LLMs集成到FL中带来了新的挑战,尤其是在LLMs的评估方面。传统的评估方法仅依赖于标记测试集和基于相似度的指标,从而无法准确地反映LLMs在生成任务上的表现。同时,虽然自动评估方法依赖于先进的LLM,但由于需要将数据传输到外部服务器,以及由于缺乏领域知识,导致在下游任务上表现不佳。为了应对这些挑战,我们提出了一个名为FedEval-LLM的大语言模型评估框架,该框架在不依赖标记测试集和外部工具的情况下提供LLM在下游任务上的可靠性能测量,从而确保了强大的隐私保护能力。FedEval-LLM利用参与者的个人LLM作为参考,提供领域知识和集体评估能力,从而与各自的下游任务保持一致,并减轻了单个参考者带来的不确定性和偏见。实验结果表明,在FL中,个性评估模型的评估能力得到了显著的提高。当应用于FL时,这些评估模型与人类偏好和RougeL得分高度一致。FedEval-LLM有效地克服了传统指标和依赖外部服务的局限性,为在协作训练场景下评估LLM提供了有前景的框架。
https://arxiv.org/abs/2404.12273
Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly being used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators simply inherit all the problems of the LLMs they evaluate, requiring further human validation. We present a mixed-initiative approach to ``validate the validators'' -- aligning LLM-generated evaluation functions (be it prompts or code) with human requirements. Our interface, EvalGen, provides automated assistance to users in generating evaluation criteria and implementing assertions. While generating candidate implementations (Python functions, LLM grader prompts), EvalGen asks humans to grade a subset of LLM outputs; this feedback is used to select implementations that better align with user grades. A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment. In particular, we identify a phenomenon we dub \emph{criteria drift}: users need criteria to grade outputs, but grading outputs helps users define criteria. What is more, some criteria appears \emph{dependent} on the specific LLM outputs observed (rather than independent criteria that can be defined \emph{a priori}), raising serious questions for approaches that assume the independence of evaluation from observation of model outputs. We present our interface and implementation details, a comparison of our algorithm with a baseline approach, and implications for the design of future LLM evaluation assistants.
由于人类评价的繁琐性和基于代码的评估方法的局限性,大型语言模型(LLMs)正日益被用于帮助人们评估LLM输出。然而,LLM生成的评估器只是继承了它们评估的LLM的所有问题,需要进一步的人类验证。我们提出了一个混合启动方法来“验证验证器”——将LLM生成的评估函数(无论是提示还是代码)与人类需求对齐。我们的界面EvalGen为用户提供自动帮助生成评估标准和实施断言。在生成候选实现(Python函数,LLM评估器提示)的同时,EvalGen要求人类用户为部分LLM输出打分;这个反馈用于选择更符合用户评分水平的实现。 定性研究结果表明,对EvalGen的支持总体上是积极的,但突出了对齐的主观性和迭代过程。特别是,我们发现了一个我们称之为“标准偏差”的现象:用户需要为评分输出设置标准,但评分输出实际上帮助他们定义了标准。此外,一些标准似乎与观察到的具体LLM输出有关(而不是可以事先定义的独立标准),这可能对假设从观察模型输出中评估独立性的方法提出了严重问题。我们提供了我们的界面和实现细节,将我们的算法与基线方法进行比较,并就未来LLM评估辅助设计的可能性做出了展望。
https://arxiv.org/abs/2404.12272
Physics-integrated generative modeling is a class of hybrid or grey-box modeling in which we augment the the data-driven model with the physics knowledge governing the data distribution. The use of physics knowledge allows the generative model to produce output in a controlled way, so that the output, by construction, complies with the physical laws. It imparts improved generalization ability to extrapolate beyond the training distribution as well as improved interpretability because the model is partly grounded in firm domain knowledge. In this work, we aim to improve the fidelity of reconstruction and robustness to noise in the physics integrated generative model. To this end, we use variational-autoencoder as a generative model. To improve the reconstruction results of the decoder, we propose to learn the latent posterior distribution of both the physics as well as the trainable data-driven components using planar normalizng flow. Normalizng flow based posterior distribution harnesses the inherent dynamical structure of the data distribution, hence the learned model gets closer to the true underlying data distribution. To improve the robustness of generative model against noise injected in the model, we propose a modification in the encoder part of the normalizing flow based VAE. We designed the encoder to incorporate scaled dot product attention based contextual information in the noisy latent vector which will mitigate the adverse effect of noise in the latent vector and make the model more robust. We empirically evaluated our models on human locomotion dataset [33] and the results validate the efficacy of our proposed models in terms of improvement in reconstruction quality as well as robustness against noise injected in the model.
物理集成生成建模是一种混合或灰色盒模型,其中我们通过添加指导数据分布的物理知识来增强数据驱动模型。利用物理知识可以使生成模型以可控的方式产生输出,从而使输出本质上符合物理定律。它赋予了扩展训练分布以外的高置信度能力,并提高了可解释性,因为模型的部分基础是固有领域知识。在这项工作中,我们旨在提高物理集成生成模型的重建精度和对噪声的鲁棒性。为此,我们使用变分自编码器作为生成模型。为了提高解码器的重建结果,我们提出了一种使用平滑正态分布来学习物理和可训练数据驱动组件的后验分布的计划。平滑正态分布的后验分布利用了数据分布的固有动态结构,因此所学习到的模型更接近于真实的数据分布。为了提高生成模型对模型内噪声的鲁棒性,我们在平滑正态分布的编码器部分进行了修改,基于上下文信息进行缩放点积注意。这将减轻噪声在 latent 向量上的不利影响,使模型更加鲁棒。我们对人类运动数据集 [33] 进行了实证评估,结果证实了我们在建模方面的提议,即提高重建质量和模型对噪声的鲁棒性。
https://arxiv.org/abs/2404.12267
Data analysts have long sought to turn unstructured text data into meaningful concepts. Though common, topic modeling and clustering focus on lower-level keywords and require significant interpretative work. We introduce concept induction, a computational process that instead produces high-level concepts, defined by explicit inclusion criteria, from unstructured text. For a dataset of toxic online comments, where a state-of-the-art BERTopic model outputs "women, power, female," concept induction produces high-level concepts such as "Criticism of traditional gender roles" and "Dismissal of women's concerns." We present LLooM, a concept induction algorithm that leverages large language models to iteratively synthesize sampled text and propose human-interpretable concepts of increasing generality. We then instantiate LLooM in a mixed-initiative text analysis tool, enabling analysts to shift their attention from interpreting topics to engaging in theory-driven analysis. Through technical evaluations and four analysis scenarios ranging from literature review to content moderation, we find that LLooM's concepts improve upon the prior art of topic models in terms of quality and data coverage. In expert case studies, LLooM helped researchers to uncover new insights even from familiar datasets, for example by suggesting a previously unnoticed concept of attacks on out-party stances in a political social media dataset.
数据分析师一直试图将无结构文本数据转化为有意义的概念。尽管常见,主题建模和聚类关注较低级别的关键词,需要进行大量解释性工作。我们引入了概念归纳,一种计算过程,它从无结构文本中产生高层次的概念,定义了明确的包括标准。对于一个包含有毒在线评论的 dataset,其中最先进的 BERTopic 模型输出“女性、权力、女性”,概念归纳产生了类似于“对传统性别角色批评”和“对女性关注的不屑”的高层次概念。我们介绍了 LLooM,一种利用大型语言模型迭代生成抽样文本并提出具有普遍性的人解释性概念的概念。然后将 LLooM 实例化到一个混合文本分析工具中,使分析员可以将注意力从解释主题转向进行理论驱动的分析。通过技术评估和四个分析场景(文献综述到内容审查),我们发现,LLooM 的概念在主题模型的先前艺术品质和数据覆盖方面有所提高。在专家案例研究中,LLooM 甚至帮助研究人员从熟悉的數據中发现新的见解,例如通过建议政治社交媒體數據中 previously unnoticed 的攻击姿态的概念。
https://arxiv.org/abs/2404.12259
Image-based methods to analyze food images have alleviated the user burden and biases associated with traditional methods. However, accurate portion estimation remains a major challenge due to the loss of 3D information in the 2D representation of foods captured by smartphone cameras or wearable devices. In this paper, we propose a new framework to estimate both food volume and energy from 2D images by leveraging the power of 3D food models and physical reference in the eating scene. Our method estimates the pose of the camera and the food object in the input image and recreates the eating occasion by rendering an image of a 3D model of the food with the estimated poses. We also introduce a new dataset, SimpleFood45, which contains 2D images of 45 food items and associated annotations including food volume, weight, and energy. Our method achieves an average error of 31.10 kCal (17.67%) on this dataset, outperforming existing portion estimation methods.
基于图像的方法来分析食品图像已经减轻了与传统方法相关的用户负担和偏见。然而,由于智能手机相机或可穿戴设备捕获的食品图像中丢失了3D信息,准确估计食品体积和能量仍然是一个主要挑战。在本文中,我们提出了一种新的框架,通过利用3D食物模型和人体参考来估计从2D图像中食品的体积和能量。我们的方法估计输入图像中相机的姿态和食品对象的姿态,并通过渲染一个3D模型来重建共进餐场景。我们还引入了一个新的数据集SimpleFood45,其中包含45个食品的2D图像和相关注释,包括食品体积、重量和能量。我们的方法在SimpleFood45数据集上的平均误差为31.10 kCal(17.67%),优于现有食品体积估计方法。
https://arxiv.org/abs/2404.12257
The autonomous driving industry is expected to grow by over 20 times in the coming decade and, thus, motivate researchers to delve into it. The primary focus of their research is to ensure safety, comfort, and efficiency. An autonomous vehicle has several modules responsible for one or more of the aforementioned items. Among these modules, the trajectory planner plays a pivotal role in the safety of the vehicle and the comfort of its passengers. The module is also responsible for respecting kinematic constraints and any applicable road constraints. In this paper, a novel online spatial-temporal graph trajectory planner is introduced to generate safe and comfortable trajectories. First, a spatial-temporal graph is constructed using the autonomous vehicle, its surrounding vehicles, and virtual nodes along the road with respect to the vehicle itself. Next, the graph is forwarded into a sequential network to obtain the desired states. To support the planner, a simple behavioral layer is also presented that determines kinematic constraints for the planner. Furthermore, a novel potential function is also proposed to train the network. Finally, the proposed planner is tested on three different complex driving tasks, and the performance is compared with two frequently used methods. The results show that the proposed planner generates safe and feasible trajectories while achieving similar or longer distances in the forward direction and comparable comfort ride.
预计在未来十年里,自动驾驶行业将增长20倍以上,因此会激发研究人员深入研究这个领域。他们的研究主要关注确保安全性、舒适性和效率。自动驾驶汽车有多个模块负责实现上述一或多个项目。在这些模块中,轨迹规划器在车辆的安全性和乘客的舒适性方面起着关键作用。该模块还负责遵守运动约束和适用的道路限制。在本文中,介绍了一种新颖的在线空间-时间图轨迹规划器,用于生成安全和舒适的轨迹。首先,使用自动驾驶汽车、周围车辆和道路上的虚拟节点构建了空间-时间图。然后,将该图传递给一个序列网络以获得所需的状态。为了支持规划器,还提出了一个简单的行为层,用于确定规划器的运动约束。此外,还提出了一个新型的势能函数来训练网络。最后,对所提出的规划器在三个不同的复杂驾驶任务上进行了测试,并将性能与两种常用的方法进行了比较。结果表明,与两种常用方法相比,所提出的规划器在实现类似或更长的前进距离的同时,还提供了安全和舒适的驾驶体验。
https://arxiv.org/abs/2404.12256
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark.
本文介绍了由MLCommons AI Safety Working Group创建的AI安全基准的v0.5版本。AI安全基准旨在评估使用聊天机器人语言模型的AI系统的安全性风险。我们引入了一种基于原则的方法来指定和构建基准,涵盖v0.5的只有一个用例(用英语与通用助手进行交流的成年人),以及一组有限的人物角色(即典型用户、恶意用户和易受攻击的用户)。我们创建了一个包含13个危险类别的新分类器,其中7个在v0.5基准中有测试。我们计划在2024年底发布AI安全基准的1.0版本。v1.0基准将为AI系统的安全性提供有意义的见解。然而,v0.5基准不应用于评估AI系统的安全性。我们努力全面记录v0.5基准的局限性、缺陷和挑战。发布v0.5 AI安全基准包括:(1)基于原则指定和构建基准的方法,包括使用案例、测试类型、系统类型、语言和上下文、人物角色、测试和测试项目;(2)一个包含13个危险类别的分类器及其定义和子类别;(3)对7个危险类别的测试,每个测试都包括一个独特的测试项目,即提示。总共有43,090个测试项目,我们使用模板创建;(4)一个针对基准对AI系统进行评估的评分系统;(5)一个公开可用的平台和可下载的工具,名为ModelBench,用于在基准上评估AI系统的安全性;(6)一个基准评估报告,该报告衡量了超过12个公开可用的聊天机器人语言模型的性能;(7)基准测试规格。
https://arxiv.org/abs/2404.12241