Although large language models (LLMs) have achieved significant success, their vulnerability to adversarial perturbations, including recent jailbreak attacks, has raised considerable concerns. However, the increasing size of these models and their limited access make improving their robustness a challenging task. Among various defense strategies, randomized smoothing has shown great potential for LLMs, as it does not require full access to the model's parameters or fine-tuning via adversarial training. However, randomized smoothing involves adding noise to the input before model prediction, and the final model's robustness largely depends on the model's performance on these noise corrupted data. Its effectiveness is often limited by the model's sub-optimal performance on noisy data. To address this issue, we propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions. We call this procedure self-denoised smoothing. Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility. Our experimental results indicate that our method surpasses existing methods in both empirical and certified robustness in defending against adversarial attacks for both downstream tasks and human alignments (i.e., jailbreak attacks). Our code is publicly available at this https URL
尽管大型语言模型(LLMs)已经取得了显著的成功,但它们对对抗性扰动(包括最近的越狱攻击)的易受性引起了相当大的关注。然而,这些模型的规模不断增加,并且它们的可访问性有限,因此提高它们的鲁棒性是一项具有挑战性的任务。在各种防御策略中,随机平滑对LLMs显示出巨大的潜力,因为它们不需要对模型的参数或通过对抗性训练进行微调。然而,随机平滑在输入预测前添加噪声,因此最终模型的鲁棒性很大程度上取决于模型在这些噪声污染数据上的表现。它的效果通常受到模型在噪声数据上的次优性能的限制。为了解决这个问题,我们利用LLMs的多任务性质,首先对噪声输入进行平滑,然后根据这些平滑版本进行预测。我们称之为自平滑。与计算机视觉中以前的平滑滤波技术不同,后者需要训练一个单独的模型来增强LLMs的鲁棒性,而我们的方法提供了显著更好的效率和灵活性。我们的实验结果表明,我们的方法在防御对抗性攻击以及对下游任务和人类对齐(即越狱攻击)方面超越了现有方法。我们的代码公开可用,在這個網址:https://github.com/your-name/self-denoised-smoothing
https://arxiv.org/abs/2404.12274
Federated Learning (FL) has emerged as a promising solution for collaborative training of large language models (LLMs). However, the integration of LLMs into FL introduces new challenges, particularly concerning the evaluation of LLMs. Traditional evaluation methods that rely on labeled test sets and similarity-based metrics cover only a subset of the acceptable answers, thereby failing to accurately reflect the performance of LLMs on generative tasks. Meanwhile, although automatic evaluation methods that leverage advanced LLMs present potential, they face critical risks of data leakage due to the need to transmit data to external servers and suboptimal performance on downstream tasks due to the lack of domain knowledge. To address these issues, we propose a Federated Evaluation framework of Large Language Models, named FedEval-LLM, that provides reliable performance measurements of LLMs on downstream tasks without the reliance on labeled test sets and external tools, thus ensuring strong privacy-preserving capability. FedEval-LLM leverages a consortium of personalized LLMs from participants as referees to provide domain knowledge and collective evaluation capability, thus aligning to the respective downstream tasks and mitigating uncertainties and biases associated with a single referee. Experimental results demonstrate a significant improvement in the evaluation capability of personalized evaluation models on downstream tasks. When applied to FL, these evaluation models exhibit strong agreement with human preference and RougeL-score on meticulously curated test sets. FedEval-LLM effectively overcomes the limitations of traditional metrics and the reliance on external services, making it a promising framework for the evaluation of LLMs within collaborative training scenarios.
联邦学习(FL)作为一种在大型语言模型(LLMs)的协同训练中取得突破的解决方案,已经得到了广泛的应用。然而,将LLMs集成到FL中带来了新的挑战,尤其是在LLMs的评估方面。传统的评估方法仅依赖于标记测试集和基于相似度的指标,从而无法准确地反映LLMs在生成任务上的表现。同时,虽然自动评估方法依赖于先进的LLM,但由于需要将数据传输到外部服务器,以及由于缺乏领域知识,导致在下游任务上表现不佳。为了应对这些挑战,我们提出了一个名为FedEval-LLM的大语言模型评估框架,该框架在不依赖标记测试集和外部工具的情况下提供LLM在下游任务上的可靠性能测量,从而确保了强大的隐私保护能力。FedEval-LLM利用参与者的个人LLM作为参考,提供领域知识和集体评估能力,从而与各自的下游任务保持一致,并减轻了单个参考者带来的不确定性和偏见。实验结果表明,在FL中,个性评估模型的评估能力得到了显著的提高。当应用于FL时,这些评估模型与人类偏好和RougeL得分高度一致。FedEval-LLM有效地克服了传统指标和依赖外部服务的局限性,为在协作训练场景下评估LLM提供了有前景的框架。
https://arxiv.org/abs/2404.12273
Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly being used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators simply inherit all the problems of the LLMs they evaluate, requiring further human validation. We present a mixed-initiative approach to ``validate the validators'' -- aligning LLM-generated evaluation functions (be it prompts or code) with human requirements. Our interface, EvalGen, provides automated assistance to users in generating evaluation criteria and implementing assertions. While generating candidate implementations (Python functions, LLM grader prompts), EvalGen asks humans to grade a subset of LLM outputs; this feedback is used to select implementations that better align with user grades. A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment. In particular, we identify a phenomenon we dub \emph{criteria drift}: users need criteria to grade outputs, but grading outputs helps users define criteria. What is more, some criteria appears \emph{dependent} on the specific LLM outputs observed (rather than independent criteria that can be defined \emph{a priori}), raising serious questions for approaches that assume the independence of evaluation from observation of model outputs. We present our interface and implementation details, a comparison of our algorithm with a baseline approach, and implications for the design of future LLM evaluation assistants.
由于人类评价的繁琐性和基于代码的评估方法的局限性,大型语言模型(LLMs)正日益被用于帮助人们评估LLM输出。然而,LLM生成的评估器只是继承了它们评估的LLM的所有问题,需要进一步的人类验证。我们提出了一个混合启动方法来“验证验证器”——将LLM生成的评估函数(无论是提示还是代码)与人类需求对齐。我们的界面EvalGen为用户提供自动帮助生成评估标准和实施断言。在生成候选实现(Python函数,LLM评估器提示)的同时,EvalGen要求人类用户为部分LLM输出打分;这个反馈用于选择更符合用户评分水平的实现。 定性研究结果表明,对EvalGen的支持总体上是积极的,但突出了对齐的主观性和迭代过程。特别是,我们发现了一个我们称之为“标准偏差”的现象:用户需要为评分输出设置标准,但评分输出实际上帮助他们定义了标准。此外,一些标准似乎与观察到的具体LLM输出有关(而不是可以事先定义的独立标准),这可能对假设从观察模型输出中评估独立性的方法提出了严重问题。我们提供了我们的界面和实现细节,将我们的算法与基线方法进行比较,并就未来LLM评估辅助设计的可能性做出了展望。
https://arxiv.org/abs/2404.12272
Physics-integrated generative modeling is a class of hybrid or grey-box modeling in which we augment the the data-driven model with the physics knowledge governing the data distribution. The use of physics knowledge allows the generative model to produce output in a controlled way, so that the output, by construction, complies with the physical laws. It imparts improved generalization ability to extrapolate beyond the training distribution as well as improved interpretability because the model is partly grounded in firm domain knowledge. In this work, we aim to improve the fidelity of reconstruction and robustness to noise in the physics integrated generative model. To this end, we use variational-autoencoder as a generative model. To improve the reconstruction results of the decoder, we propose to learn the latent posterior distribution of both the physics as well as the trainable data-driven components using planar normalizng flow. Normalizng flow based posterior distribution harnesses the inherent dynamical structure of the data distribution, hence the learned model gets closer to the true underlying data distribution. To improve the robustness of generative model against noise injected in the model, we propose a modification in the encoder part of the normalizing flow based VAE. We designed the encoder to incorporate scaled dot product attention based contextual information in the noisy latent vector which will mitigate the adverse effect of noise in the latent vector and make the model more robust. We empirically evaluated our models on human locomotion dataset [33] and the results validate the efficacy of our proposed models in terms of improvement in reconstruction quality as well as robustness against noise injected in the model.
物理集成生成建模是一种混合或灰色盒模型,其中我们通过添加指导数据分布的物理知识来增强数据驱动模型。利用物理知识可以使生成模型以可控的方式产生输出,从而使输出本质上符合物理定律。它赋予了扩展训练分布以外的高置信度能力,并提高了可解释性,因为模型的部分基础是固有领域知识。在这项工作中,我们旨在提高物理集成生成模型的重建精度和对噪声的鲁棒性。为此,我们使用变分自编码器作为生成模型。为了提高解码器的重建结果,我们提出了一种使用平滑正态分布来学习物理和可训练数据驱动组件的后验分布的计划。平滑正态分布的后验分布利用了数据分布的固有动态结构,因此所学习到的模型更接近于真实的数据分布。为了提高生成模型对模型内噪声的鲁棒性,我们在平滑正态分布的编码器部分进行了修改,基于上下文信息进行缩放点积注意。这将减轻噪声在 latent 向量上的不利影响,使模型更加鲁棒。我们对人类运动数据集 [33] 进行了实证评估,结果证实了我们在建模方面的提议,即提高重建质量和模型对噪声的鲁棒性。
https://arxiv.org/abs/2404.12267
This paper presents the design, implementation, and flight test results of linear quadratic integral regulator (LQRi) based attitude control for a quadcopter UAV. We present the derivation of the mathematical model for the kinematics and dynamics of the UAV, along with the linearized state space representation of the system about hover conditions. LQR and LQRi controllers are then designed to stabilize the UAV in hover conditions and to track desired attitude commands. The controllers are then implemented onboard the Pixhawk flight controller and flight test results are discussed. Finally, the code related to this paper has been published open-source for replication and further research
本文介绍了基于线性二次积分调节器(LQRi)的无人机姿态控制的设计、实现和飞行试验结果。我们首先给出了无人机的运动和动力学模型以及关于悬停条件的系统线性化状态空间表示。然后,设计并实现LQR和LQRi控制器来稳定无人机在悬停状态,并跟踪期望的姿态命令。最后,本文的代码已公开发布,供复制和研究。
https://arxiv.org/abs/2404.12261
Facial expression recognition is a pivotal component in machine learning, facilitating various applications. However, convolutional neural networks (CNNs) are often plagued by catastrophic forgetting, impeding their adaptability. The proposed method, emotion-centered generative replay (ECgr), tackles this challenge by integrating synthetic images from generative adversarial networks. Moreover, ECgr incorporates a quality assurance algorithm to ensure the fidelity of generated images. This dual approach enables CNNs to retain past knowledge while learning new tasks, enhancing their performance in emotion recognition. The experimental results on four diverse facial expression datasets demonstrate that incorporating images generated by our pseudo-rehearsal method enhances training on the targeted dataset and the source dataset while making the CNN retain previously learned knowledge.
面部表情识别是机器学习的一个重要组成部分,促进了各种应用的发展。然而,卷积神经网络(CNNs)经常受到灾难性遗忘的困扰,这会阻碍其适应性。所提出的方法,情感为中心的生成性重放(ECgr),通过将生成对抗网络(GAN)生成的合成图像相结合来解决这一挑战。此外,ECgr 还包含一个质量保证算法,以确保生成图像的准确性。这种双方法使 CNN 能够保留过去的知识,同时学习新的任务,从而提高其在情感识别方面的性能。在四个多样的人脸表情数据集的实验结果中,采用我们伪重放方法生成的图像增强了目标数据集和源数据集的训练,同时使 CNN 保留之前学习的知识。
https://arxiv.org/abs/2404.12260
Data analysts have long sought to turn unstructured text data into meaningful concepts. Though common, topic modeling and clustering focus on lower-level keywords and require significant interpretative work. We introduce concept induction, a computational process that instead produces high-level concepts, defined by explicit inclusion criteria, from unstructured text. For a dataset of toxic online comments, where a state-of-the-art BERTopic model outputs "women, power, female," concept induction produces high-level concepts such as "Criticism of traditional gender roles" and "Dismissal of women's concerns." We present LLooM, a concept induction algorithm that leverages large language models to iteratively synthesize sampled text and propose human-interpretable concepts of increasing generality. We then instantiate LLooM in a mixed-initiative text analysis tool, enabling analysts to shift their attention from interpreting topics to engaging in theory-driven analysis. Through technical evaluations and four analysis scenarios ranging from literature review to content moderation, we find that LLooM's concepts improve upon the prior art of topic models in terms of quality and data coverage. In expert case studies, LLooM helped researchers to uncover new insights even from familiar datasets, for example by suggesting a previously unnoticed concept of attacks on out-party stances in a political social media dataset.
数据分析师一直试图将无结构文本数据转化为有意义的概念。尽管常见,主题建模和聚类关注较低级别的关键词,需要进行大量解释性工作。我们引入了概念归纳,一种计算过程,它从无结构文本中产生高层次的概念,定义了明确的包括标准。对于一个包含有毒在线评论的 dataset,其中最先进的 BERTopic 模型输出“女性、权力、女性”,概念归纳产生了类似于“对传统性别角色批评”和“对女性关注的不屑”的高层次概念。我们介绍了 LLooM,一种利用大型语言模型迭代生成抽样文本并提出具有普遍性的人解释性概念的概念。然后将 LLooM 实例化到一个混合文本分析工具中,使分析员可以将注意力从解释主题转向进行理论驱动的分析。通过技术评估和四个分析场景(文献综述到内容审查),我们发现,LLooM 的概念在主题模型的先前艺术品质和数据覆盖方面有所提高。在专家案例研究中,LLooM 甚至帮助研究人员从熟悉的數據中发现新的见解,例如通过建议政治社交媒體數據中 previously unnoticed 的攻击姿态的概念。
https://arxiv.org/abs/2404.12259
In this study, we introduce DeepLocalization, an innovative framework devised for the real-time localization of actions tailored explicitly for monitoring driver behavior. Utilizing the power of advanced deep learning methodologies, our objective is to tackle the critical issue of distracted driving-a significant factor contributing to road accidents. Our strategy employs a dual approach: leveraging Graph-Based Change-Point Detection for pinpointing actions in time alongside a Video Large Language Model (Video-LLM) for precisely categorizing activities. Through careful prompt engineering, we customize the Video-LLM to adeptly handle driving activities' nuances, ensuring its classification efficacy even with sparse data. Engineered to be lightweight, our framework is optimized for consumer-grade GPUs, making it vastly applicable in practical scenarios. We subjected our method to rigorous testing on the SynDD2 dataset, a complex benchmark for distracted driving behaviors, where it demonstrated commendable performance-achieving 57.5% accuracy in event classification and 51% in event detection. These outcomes underscore the substantial promise of DeepLocalization in accurately identifying diverse driver behaviors and their temporal occurrences, all within the bounds of limited computational resources.
在这项研究中,我们引入了DeepLocalization,一种专为实时定位针对监控驾驶员行为的动作的创新框架。利用先进深度学习方法论的力量,我们的目标是解决驾驶员分心驾驶这一关键问题,这是导致道路事故的一个重要因素。我们的策略采用了一种双方法:利用基于图的变换点检测来精确定位时间中的动作,同时结合视频大型语言模型(Video-LLM)进行精确分类活动。通过仔细的提示工程,我们定制了Video-LLM,使其能够熟练处理驾驶活动的细节,即使数据稀疏,也能确保分类效果。经优化后,我们的框架轻便且适用于消费级GPU,因此在实际场景中具有广泛的应用前景。我们对该方法在SynDD2数据集上的测试进行了严格的评估,这是一个复杂的驾驶员分心驾驶行为基准数据集,它在事件分类和事件检测方面都取得了令人满意的成绩,证明了DeepLocalization在准确识别不同驾驶员行为及其时间发生情况方面具有巨大的潜力。
https://arxiv.org/abs/2404.12258
Image-based methods to analyze food images have alleviated the user burden and biases associated with traditional methods. However, accurate portion estimation remains a major challenge due to the loss of 3D information in the 2D representation of foods captured by smartphone cameras or wearable devices. In this paper, we propose a new framework to estimate both food volume and energy from 2D images by leveraging the power of 3D food models and physical reference in the eating scene. Our method estimates the pose of the camera and the food object in the input image and recreates the eating occasion by rendering an image of a 3D model of the food with the estimated poses. We also introduce a new dataset, SimpleFood45, which contains 2D images of 45 food items and associated annotations including food volume, weight, and energy. Our method achieves an average error of 31.10 kCal (17.67%) on this dataset, outperforming existing portion estimation methods.
基于图像的方法来分析食品图像已经减轻了与传统方法相关的用户负担和偏见。然而,由于智能手机相机或可穿戴设备捕获的食品图像中丢失了3D信息,准确估计食品体积和能量仍然是一个主要挑战。在本文中,我们提出了一种新的框架,通过利用3D食物模型和人体参考来估计从2D图像中食品的体积和能量。我们的方法估计输入图像中相机的姿态和食品对象的姿态,并通过渲染一个3D模型来重建共进餐场景。我们还引入了一个新的数据集SimpleFood45,其中包含45个食品的2D图像和相关注释,包括食品体积、重量和能量。我们的方法在SimpleFood45数据集上的平均误差为31.10 kCal(17.67%),优于现有食品体积估计方法。
https://arxiv.org/abs/2404.12257
The autonomous driving industry is expected to grow by over 20 times in the coming decade and, thus, motivate researchers to delve into it. The primary focus of their research is to ensure safety, comfort, and efficiency. An autonomous vehicle has several modules responsible for one or more of the aforementioned items. Among these modules, the trajectory planner plays a pivotal role in the safety of the vehicle and the comfort of its passengers. The module is also responsible for respecting kinematic constraints and any applicable road constraints. In this paper, a novel online spatial-temporal graph trajectory planner is introduced to generate safe and comfortable trajectories. First, a spatial-temporal graph is constructed using the autonomous vehicle, its surrounding vehicles, and virtual nodes along the road with respect to the vehicle itself. Next, the graph is forwarded into a sequential network to obtain the desired states. To support the planner, a simple behavioral layer is also presented that determines kinematic constraints for the planner. Furthermore, a novel potential function is also proposed to train the network. Finally, the proposed planner is tested on three different complex driving tasks, and the performance is compared with two frequently used methods. The results show that the proposed planner generates safe and feasible trajectories while achieving similar or longer distances in the forward direction and comparable comfort ride.
预计在未来十年里,自动驾驶行业将增长20倍以上,因此会激发研究人员深入研究这个领域。他们的研究主要关注确保安全性、舒适性和效率。自动驾驶汽车有多个模块负责实现上述一或多个项目。在这些模块中,轨迹规划器在车辆的安全性和乘客的舒适性方面起着关键作用。该模块还负责遵守运动约束和适用的道路限制。在本文中,介绍了一种新颖的在线空间-时间图轨迹规划器,用于生成安全和舒适的轨迹。首先,使用自动驾驶汽车、周围车辆和道路上的虚拟节点构建了空间-时间图。然后,将该图传递给一个序列网络以获得所需的状态。为了支持规划器,还提出了一个简单的行为层,用于确定规划器的运动约束。此外,还提出了一个新型的势能函数来训练网络。最后,对所提出的规划器在三个不同的复杂驾驶任务上进行了测试,并将性能与两种常用的方法进行了比较。结果表明,与两种常用方法相比,所提出的规划器在实现类似或更长的前进距离的同时,还提供了安全和舒适的驾驶体验。
https://arxiv.org/abs/2404.12256
Despite the impressive capabilities of Large Language Models (LLMs) on various tasks, they still struggle with scenarios that involves complex reasoning and planning. Recent work proposed advanced prompting techniques and the necessity of fine-tuning with high-quality data to augment LLMs' reasoning abilities. However, these approaches are inherently constrained by data availability and quality. In light of this, self-correction and self-learning emerge as viable solutions, employing strategies that allow LLMs to refine their outputs and learn from self-assessed rewards. Yet, the efficacy of LLMs in self-refining its response, particularly in complex reasoning and planning task, remains dubious. In this paper, we introduce AlphaLLM for the self-improvements of LLMs, which integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop, thereby enhancing the capabilities of LLMs without additional annotations. Drawing inspiration from the success of AlphaGo, AlphaLLM addresses the unique challenges of combining MCTS with LLM for self-improvement, including data scarcity, the vastness search spaces of language tasks, and the subjective nature of feedback in language tasks. AlphaLLM is comprised of prompt synthesis component, an efficient MCTS approach tailored for language tasks, and a trio of critic models for precise feedback. Our experimental results in mathematical reasoning tasks demonstrate that AlphaLLM significantly enhances the performance of LLMs without additional annotations, showing the potential for self-improvement in LLMs.
尽管大型语言模型(LLMs)在各种任务上的表现非常出色,但它们仍然很难处理涉及复杂推理和计划的情境。为了增强LLMs的推理能力,最新工作提出了先进的提示技术和高质量数据微调的必要性。然而,这些方法在数据可用性和质量方面存在固有限制。鉴于这一点,自纠正和自学习变得可行,并采用策略让LLMs优化其输出并从自我评估的奖励中学习。然而,LLM在自我校正其响应方面的有效性,尤其是在复杂推理和计划任务上,仍然存在争议。在本文中,我们引入AlphaLLM,用于LLM的自我改进,它将蒙特卡洛树搜索(MCTS)与LLM相结合,建立了一个自我改进循环,从而增强LLMs的功能,而无需增加注释。受到AlphaGo的成功启发,AlphaLLM解决了将MCTS与LLM相结合进行自我改进的独特挑战,包括数据稀缺性、语言任务的搜索空间巨大以及语言任务的反馈具有主观性。AlphaLLM由提示合成组件、专为语言任务高效的MCTS方法和三个批评模型组成。我们在数学推理任务上的实验结果表明,AlphaLLM显著增强了没有额外注释的LLM的性能,展示了LLM的自我改进潜力。
https://arxiv.org/abs/2404.12253
The recent emergence of deep learning has led to a great deal of work on designing supervised deep semantic segmentation algorithms. As in many tasks sufficient pixel-level labels are very difficult to obtain, we propose a method which combines a Gaussian mixture model (GMM) with unsupervised deep learning techniques. In the standard GMM the pixel values with each sub-region are modelled by a Gaussian distribution. In order to identify the different regions, the parameter vector that minimizes the negative log-likelihood (NLL) function regarding the GMM has to be approximated. For this task, usually iterative optimization methods such as the expectation-maximization (EM) algorithm are used. In this paper, we propose to estimate these parameters directly from the image using a convolutional neural network (CNN). We thus change the iterative procedure in the EM algorithm replacing the expectation-step by a gradient-step with regard to the networks parameters. This means that the network is trained to minimize the NLL function of the GMM which comes with at least two advantages. As once trained, the network is able to predict label probabilities very quickly compared with time consuming iterative optimization methods. Secondly, due to the deep image prior our method is able to partially overcome one of the main disadvantages of GMM, which is not taking into account correlation between neighboring pixels, as it assumes independence between them. We demonstrate the advantages of our method in various experiments on the example of myocardial infarct segmentation on multi-sequence MRI images.
近年来,深度学习的出现导致了许多关于设计有监督深度语义分割算法的辛勤工作。由于在许多任务中,获得足够的像素级标签非常困难,我们提出了一种将高斯混合模型(GMM)与无监督深度学习技术相结合的方法。在标准GMM中,每个子区域的像素值由高斯分布建模。为了确定不同区域,关于GMM的最小负对数(NLL)函数的参数向量必须近似。对于这项任务,通常使用迭代优化方法(如期望最大(EM)算法)进行优化。在本文中,我们提出了一种直接从图像中使用卷积神经网络(CNN)估计这些参数的方法。我们因此用网络参数的梯度代替了EM算法中的期望步骤。这意味着网络训练以最小化GMM的NLL函数,这具有至少两个优点。一旦训练完成,与时间耗费的迭代优化方法相比,网络能够非常快速地预测标签概率。其次,由于深度图像先验,我们的方法能够部分克服GMM的一个主要缺陷,即没有考虑到邻近像素之间的相关性。我们在多序列MRI图像上对心肌梗死分割进行各种实验,以展示我们方法的优势。
https://arxiv.org/abs/2404.12252
The study of human emotions, traditionally a cornerstone in fields like psychology and neuroscience, has been profoundly impacted by the advent of artificial intelligence (AI). Multiple channels, such as speech (voice) and facial expressions (image), are crucial in understanding human emotions. However, AI's journey in multimodal emotion recognition (MER) is marked by substantial technical challenges. One significant hurdle is how AI models manage the absence of a particular modality - a frequent occurrence in real-world situations. This study's central focus is assessing the performance and resilience of two strategies when confronted with the lack of one modality: a novel multimodal dynamic modality and view selection and a cross-attention mechanism. Results on the RECOLA dataset show that dynamic selection-based methods are a promising approach for MER. In the missing modalities scenarios, all dynamic selection-based methods outperformed the baseline. The study concludes by emphasizing the intricate interplay between audio and video modalities in emotion prediction, showcasing the adaptability of dynamic selection methods in handling missing modalities.
人类情感的研究,传统上是一个心理学和神经科学领域的基石,受到了人工智能(AI)的深刻影响。多种渠道,如语音(声音)和面部表情(图像),对于理解人类情感至关重要。然而,AI在多模态情感识别(MER)方面的旅程充满了技术挑战。一个重要的挑战是AI模型如何处理特定模态的缺失 - 在现实情况中这是一种常见的情况。本研究的核心是对两种策略在遇到一种缺失模态时的表现和恢复力的评估:一种新颖的多模态动态模态和视图选择,以及跨注意机制。RECOLA数据集上的结果表明,基于动态选择的策略对于MER来说是一个有前景的方法。在缺失模态场景中,所有基于动态选择的策略都超过了基线。本研究结论强调了音频和视频模态在情感预测中的复杂相互作用,展示了动态选择方法在处理缺失模态的适应性。
https://arxiv.org/abs/2404.12251
Anomaly detection and localization in images is a growing field in computer vision. In this area, a seemingly understudied problem is anomaly clustering, i.e., identifying and grouping different types of anomalies in a fully unsupervised manner. In this work, we propose a novel method for clustering anomalies in largely stationary images (textures) in a blind setting. That is, the input consists of normal and anomalous images without distinction and without labels. What contributes to the difficulty of the task is that anomalous regions are often small and may present only subtle changes in appearance, which can be easily overshadowed by the genuine variance in the texture. Moreover, each anomaly type may have a complex appearance distribution. We introduce a novel scheme for solving this task using a combination of blind anomaly localization and contrastive learning. By identifying the anomalous regions with high fidelity, we can restrict our focus to those regions of interest; then, contrastive learning is employed to increase the separability of different anomaly types and reduce the intra-class variation. Our experiments show that the proposed solution yields significantly better results compared to prior work, setting a new state of the art. Project page: this https URL.
图像中的异常检测和定位是一个在计算机视觉中正在快速增长的研究领域。在这个领域,一个似乎被低估的问题是不显著异常聚类,即在完全无监督的情况下识别和分组不同类型的异常。在这项工作中,我们提出了一种在盲环境中对大型静止图像(纹理)进行异常聚类的新方法。也就是说,输入由正常和异常图像组成,没有区分和标签。导致任务困难的是,异常区域通常较小,可能仅出现轻微的视觉变化,这很容易被纹理的真正方差所掩盖。此外,每种异常类型可能具有复杂的形态分布。我们使用盲异常局部化和对比学习相结合的新方法来解决这个任务。通过高保真度地识别异常区域,我们可以将关注点限制在感兴趣的区域内;然后,对比学习被用于增加不同异常类型之间的分离度,并减少类内差异。我们的实验结果表明,与之前的工作相比,所提出的解决方案取得了显著更好的结果,设定了一个新的科技水平。项目页面:https:// this URL。
https://arxiv.org/abs/2404.12246
Extracting structured event knowledge, including event triggers and corresponding arguments, from military texts is fundamental to many applications, such as intelligence analysis and decision assistance. However, event extraction in the military field faces the data scarcity problem, which impedes the research of event extraction models in this domain. To alleviate this problem, we propose CMNEE, a large-scale, document-level open-source Chinese Military News Event Extraction dataset. It contains 17,000 documents and 29,223 events, which are all manually annotated based on a pre-defined schema for the military domain including 8 event types and 11 argument role types. We designed a two-stage, multi-turns annotation strategy to ensure the quality of CMNEE and reproduced several state-of-the-art event extraction models with a systematic evaluation. The experimental results on CMNEE fall shorter than those on other domain datasets obviously, which demonstrates that event extraction for military domain poses unique challenges and requires further research efforts. Our code and data can be obtained from this https URL.
提取军事文本中的结构化事件知识,包括事件触发器和相应论据,对于许多应用来说至关重要,如情报分析和决策支持。然而,军事领域的事件提取面临着数据稀缺的问题,这阻碍了该领域事件提取模型的研究。为了解决这个问题,我们提出了CMNEE,一个大规模、文档级别的开源中国军事新闻事件提取数据集。它包含17,000个文档和29,223个事件,所有这些都根据预定义的军事领域数据模型进行手动注释,包括8种事件类型和11种论据角色类型。我们设计了一个两级、多轮注释策略,以确保CMNEE的质量和系统地评估了多个最先进的event extraction模型。CMNEE在实验结果方面显然短于其他领域数据集,这表明军事领域的事件提取提出了独特的挑战,需要进一步的研究努力。我们的代码和数据可以从该https URL获得。
https://arxiv.org/abs/2404.12242
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark.
本文介绍了由MLCommons AI Safety Working Group创建的AI安全基准的v0.5版本。AI安全基准旨在评估使用聊天机器人语言模型的AI系统的安全性风险。我们引入了一种基于原则的方法来指定和构建基准,涵盖v0.5的只有一个用例(用英语与通用助手进行交流的成年人),以及一组有限的人物角色(即典型用户、恶意用户和易受攻击的用户)。我们创建了一个包含13个危险类别的新分类器,其中7个在v0.5基准中有测试。我们计划在2024年底发布AI安全基准的1.0版本。v1.0基准将为AI系统的安全性提供有意义的见解。然而,v0.5基准不应用于评估AI系统的安全性。我们努力全面记录v0.5基准的局限性、缺陷和挑战。发布v0.5 AI安全基准包括:(1)基于原则指定和构建基准的方法,包括使用案例、测试类型、系统类型、语言和上下文、人物角色、测试和测试项目;(2)一个包含13个危险类别的分类器及其定义和子类别;(3)对7个危险类别的测试,每个测试都包括一个独特的测试项目,即提示。总共有43,090个测试项目,我们使用模板创建;(4)一个针对基准对AI系统进行评估的评分系统;(5)一个公开可用的平台和可下载的工具,名为ModelBench,用于在基准上评估AI系统的安全性;(6)一个基准评估报告,该报告衡量了超过12个公开可用的聊天机器人语言模型的性能;(7)基准测试规格。
https://arxiv.org/abs/2404.12241
Accurate spatio-temporal information about the current situation is crucial for smart city applications such as modern routing algorithms. Often, this information describes the state of stationary resources, e.g. the availability of parking bays, charging stations or the amount of people waiting for a vehicle to pick them up near a given location. To exploit this kind of information, predicting future states of the monitored resources is often mandatory because a resource might change its state within the time until it is needed. To train an accurate predictive model, it is often not possible to obtain a continuous time series on the state of the resource. For example, the information might be collected from traveling agents visiting the resource with an irregular frequency. Thus, it is necessary to develop methods which work on sparse observations for training and prediction. In this paper, we propose time-inhomogeneous discrete Markov models to allow accurate prediction even when the frequency of observation is very rare. Our new model is able to blend recent observations with historic data and also provide useful probabilistic estimates for future states. Since resources availability in a city is typically time-dependent, our Markov model is time-inhomogeneous and cyclic within a predefined time interval. To train our model, we propose a modified Baum-Welch algorithm. Evaluations on real-world datasets of parking bay availability show that our new method indeed yields good results compared to methods being trained on complete data and non-cyclic variants.
准确的空间和时间信息对于诸如现代路线算法等智能城市应用至关重要。通常,这些信息描述了静止资源的状况,例如停车位、充电站的可用性或等待车辆接他们的位置上的人数。为了利用这类信息,预测监测资源未来的状态通常是必要的,因为资源可能会在需要它们之前改变其状态。要训练一个准确的预测模型,通常无法在资源状态上获得连续的时间序列。例如,信息可能来自访问资源的不规则频率的旅行代理。因此,需要开发用于训练和预测的稀疏观察方法。在本文中,我们提出了一种名为时间异构离散马尔可夫模型的方法,即使在观察频率非常低的情况下,仍能实现准确预测。我们的新模型能够将最近观察到的信息和历史数据相结合,并为未来状态提供有用的概率估计。由于城市的资源可用性通常与时间相关,我们马尔可夫模型在指定的时间间隔内是时间异构的。为了训练我们的模型,我们提出了一个修改的Baum-Welch算法。对于停车位可用性的真实世界数据集的评估表明,与完全数据和非循环变体的训练方法相比,我们新方法确实产生了很好的效果。
https://arxiv.org/abs/2404.12240
This study introduces De-DSI, a novel framework that fuses large language models (LLMs) with genuine decentralization for information retrieval, particularly employing the differentiable search index (DSI) concept in a decentralized setting. Focused on efficiently connecting novel user queries with document identifiers without direct document access, De-DSI operates solely on query-docid pairs. To enhance scalability, an ensemble of DSI models is introduced, where the dataset is partitioned into smaller shards for individual model training. This approach not only maintains accuracy by reducing the number of data each model needs to handle but also facilitates scalability by aggregating outcomes from multiple models. This aggregation uses a beam search to identify top docids and applies a softmax function for score normalization, selecting documents with the highest scores for retrieval. The decentralized implementation demonstrates that retrieval success is comparable to centralized methods, with the added benefit of the possibility of distributing computational complexity across the network. This setup also allows for the retrieval of multimedia items through magnet links, eliminating the need for platforms or intermediaries.
这项研究引入了De-DSI,一种将大型语言模型(LLMs)与真正的去中心化信息检索相结合的新框架,特别是在去中心化环境中采用不同的可导搜索索引(DSI)概念。该研究将关注有效地将新颖的用户查询与文档标识符连接起来,而无需直接访问文档。De-DSI仅在查询-文档对上操作。为了提高可扩展性,引入了一个DSI模型的集成,其中数据集被划分为个人模型训练的较小的片段。这种方法不仅通过减少每个模型需要处理的数据量来保持准确性,而且通过将结果聚合起来,使可扩展性得到改善。这种聚合使用beam搜索来确定top docids,并应用softmax函数进行分数归一化,选择得分最高的文档进行检索。去中心化实现证明了检索成功与集中方法相当,同时还具有将计算复杂度分散在网络中的好处。这种设置还允许通过磁力链接检索多媒体项目,无需平台或中介机构。
https://arxiv.org/abs/2404.12237
Understanding how attention varies across individuals has significant scientific and societal impacts. However, existing visual scanpath models treat attention uniformly, neglecting individual differences. To bridge this gap, this paper focuses on individualized scanpath prediction (ISP), a new attention modeling task that aims to accurately predict how different individuals shift their attention in diverse visual tasks. It proposes an ISP method featuring three novel technical components: (1) an observer encoder to characterize and integrate an observer's unique attention traits, (2) an observer-centric feature integration approach that holistically combines visual features, task guidance, and observer-specific characteristics, and (3) an adaptive fixation prioritization mechanism that refines scanpath predictions by dynamically prioritizing semantic feature maps based on individual observers' attention traits. These novel components allow scanpath models to effectively address the attention variations across different observers. Our method is generally applicable to different datasets, model architectures, and visual tasks, offering a comprehensive tool for transforming general scanpath models into individualized ones. Comprehensive evaluations using value-based and ranking-based metrics verify the method's effectiveness and generalizability.
理解个体间注意力的变化对科学和社会具有重要影响。然而,现有的视觉扫描路径模型忽略了个体差异,将注意力统一处理。为了弥合这一空白,本文重点关注个性化扫描路径预测(ISP)这一新的关注建模任务,旨在准确预测不同个体在多样视觉任务中如何转移注意力。它提出了一个ISP方法,包括三个新颖的技术组件: (1)观察者编码器,用于描述和整合观察者的独特注意力特征; (2)以观察者为中心的特征整合方法,将视觉特征、任务指导以及观察者特定特征全面结合; (3)自适应聚焦优先级机制,根据个体观察者的注意力特征动态优先化扫描路径预测。 这些新颖组件使扫描路径模型能够有效解决不同观察者之间的注意力变化。我们的方法通常适用于各种数据集、模型架构和视觉任务,为将通用扫描路径模型转化为个性化的模型提供了全面工具。使用价值为基础和排名为基础的指标进行全面评估证实了该方法的有效性和可扩展性。
https://arxiv.org/abs/2404.12235
Medication recommendation systems are designed to deliver personalized drug suggestions that are closely aligned with individual patient needs. Previous studies have primarily concentrated on developing medication embeddings, achieving significant progress. Nonetheless, these approaches often fall short in accurately reflecting individual patient profiles, mainly due to challenges in distinguishing between various patient conditions and the inability to establish precise correlations between specific conditions and appropriate medications. In response to these issues, we introduce DisMed, a model that focuses on patient conditions to enhance personalization. DisMed employs causal inference to discern clear, quantifiable causal links. It then examines patient conditions in depth, recognizing and adapting to the evolving nuances of these conditions, and mapping them directly to corresponding medications. Additionally, DisMed leverages data from multiple patient visits to propose combinations of medications. Comprehensive testing on real-world datasets demonstrates that DisMed not only improves the customization of patient profiles but also surpasses leading models in both precision and safety.
药物推荐系统旨在提供与个体患者需求高度相关的个性化药物建议。之前的研究主要集中在开发药物嵌入,取得了一定的进展。然而,这些方法往往难以准确反映个体患者的病历,主要原因是难以区分各种患者状况之间的差异以及无法建立特定状况与适当药物之间的精确关联。为了解决这些问题,我们引入了DisMed,一种关注患者状况的模型,以提高个性化。DisMed采用因果推断来确定清晰、可量化的因果关系。然后深入研究患者的状况,识别并适应这些状况的不断变化,并将它们直接映射到相应的药物。此外,DisMed利用多个患者就诊数据中的药物组合。在现实世界数据集上进行全面的测试表明,DisMed不仅提高了患者病历的定制性,而且超过了领先模型的精度和安全性。
https://arxiv.org/abs/2404.12228