Autoregressive sequence models, such as Transformer-based vision-language action (VLA) policies, can be tremendously effective for capturing complex and generalizable robotic behaviors. However, such models require us to choose a tokenization of our continuous action signals, which determines how the discrete symbols predicted by the model map to continuous robot actions. We find that current approaches for robot action tokenization, based on simple per-dimension, per-timestep binning schemes, typically perform poorly when learning dexterous skills from high-frequency robot data. To address this challenge, we propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform. Our tokenization approach, Frequency-space Action Sequence Tokenization (FAST), enables us to train autoregressive VLAs for highly dexterous and high-frequency tasks where standard discretization methods fail completely. Based on FAST, we release FAST+, a universal robot action tokenizer, trained on 1M real robot action trajectories. It can be used as a black-box tokenizer for a wide range of robot action sequences, with diverse action spaces and control frequencies. Finally, we show that, when combined with the pi0 VLA, our method can scale to training on 10k hours of robot data and match the performance of diffusion VLAs, while reducing training time by up to 5x.
自回归序列模型,如基于Transformer的视觉-语言-行动(VLA)策略,在捕捉复杂且可泛化的机器人行为方面非常有效。然而,这类模型需要我们选择连续动作信号的标记化方案,这决定了模型预测出的离散符号如何映射到连续的机器人动作上。我们发现,目前基于每维度、每时间步简单分箱方法的机器人行动标记化技术,在从高频机器人数据中学习灵巧技能时通常表现不佳。为了解决这一挑战,我们提出了一种新的基于离散余弦变换的机器人动作压缩式标记化方案。我们的标记化方法,即频域操作序列标记化(FAST),使我们能够训练自回归VLA模型来处理高度灵巧且高频的任务,在这些任务中标准的离散化方法完全失败了。基于FAST,我们发布了FAST+,这是一个通用的机器人动作标记器,它是在100万条真实机器人的行动轨迹上进行训练的。它可以作为一个黑盒标记器用于各种不同操作空间和控制频率范围内的机器人行动序列。最后,我们展示了当与pi0 VLA结合使用时,我们的方法可以扩展到在10,000小时的数据集上进行训练,并且性能能够匹敌扩散VLA模型,同时将训练时间减少最多5倍。
https://arxiv.org/abs/2501.09747
The objective of BioCreative8 Track 3 is to extract phenotypic key medical findings embedded within EHR texts and subsequently normalize these findings to their Human Phenotype Ontology (HPO) terms. However, the presence of diverse surface forms in phenotypic findings makes it challenging to accurately normalize them to the correct HPO terms. To address this challenge, we explored various models for named entity recognition and implemented data augmentation techniques such as synonym marginalization to enhance the normalization step. Our pipeline resulted in an exact extraction and normalization F1 score 2.6\% higher than the mean score of all submissions received in response to the challenge. Furthermore, in terms of the normalization F1 score, our approach surpassed the average performance by 1.9\%. These findings contribute to the advancement of automated medical data extraction and normalization techniques, showcasing potential pathways for future research and application in the biomedical domain.
BioCreative8 Track 3的目标是从电子健康记录(EHR)文本中提取关键的医学表型发现,并将其归一化为人类表型本体论(Human Phenotype Ontology,HPO)术语。然而,由于表型发现表面形式的多样性,准确地将它们归一化到正确的HPO术语上存在挑战性。为了应对这一挑战,我们探索了多种命名实体识别模型,并实施了数据增强技术如同义词边际化来提升归一化的步骤。我们的流水线最终在精确提取和归一化的F1评分方面比所有对挑战做出回应的提交的平均得分高出2.6%。此外,在归一化F1评分上,我们的方法超出了平均水平1.9%。这些发现有助于自动医学数据提取和归一化技术的发展,并展示了未来研究和生物医学领域应用的潜在路径。
https://arxiv.org/abs/2501.09744
Existing video anomaly detection datasets are inadequate for representing complex anomalies that occur due to the interactions between objects. The absence of complex anomalies in previous video anomaly detection datasets affects research by shifting the focus onto simple anomalies. To address this problem, we introduce a new large-scale dataset: ComplexVAD. In addition, we propose a novel method to detect complex anomalies via modeling the interactions between objects using a scene graph with spatio-temporal attributes. With our proposed method and two other state-of-the-art video anomaly detection methods, we obtain baseline scores on ComplexVAD and demonstrate that our new method outperforms existing works.
现有的视频异常检测数据集在表示由于物体之间相互作用而产生的复杂异常方面是不足的。以前的视频异常检测数据集中缺乏复杂的异常情况,这影响了研究方向,使其偏向于关注简单的异常情况。为了解决这个问题,我们引入了一个新的大规模数据集:ComplexVAD。此外,我们还提出了一种新颖的方法,通过使用带有时空属性的场景图来建模物体之间的相互作用,以此检测复杂异常。利用我们的新方法以及其他两种最先进的视频异常检测方法,在ComplexVAD上获得了基准分数,并证明了我们的新方法优于现有的工作。
https://arxiv.org/abs/2501.09733
Values or principles are key elements of human society that influence people to behave and function according to an accepted standard set of social rules to maintain social order. As AI systems are becoming ubiquitous in human society, it is a major concern that they could violate these norms or values and potentially cause harm. Thus, to prevent intentional or unintentional harm, AI systems are expected to take actions that align with these principles. Training systems to exhibit this type of behavior is difficult and often requires a specialized dataset. This work presents a multi-modal dataset illustrating normative and non-normative behavior in real-life situations described through natural language and artistic images. This training set contains curated sets of images that are designed to teach young children about social principles. We argue that this is an ideal dataset to use for training socially normative agents given this fact.
价值观或原则是人类社会的关键要素,它们影响人们的行为和功能,使之遵循一套公认的社会规则以维持社会秩序。随着AI系统在人类社会中的普及,一个主要的担忧在于这些系统可能会违反这些规范或价值,并可能造成伤害。因此,为了防止有意或无意的危害,期望AI系统采取符合这些原则的行动。训练系统表现出这种行为是困难且通常需要专门的数据集。 这项工作提供了一个多模态数据集,通过自然语言和艺术图像描绘了现实生活中规范与非规范的行为。该训练集中包含了一组精心策划的图片,旨在教导儿童关于社会准则的知识。我们主张,鉴于这一事实,这是一个用于训练具有社会规范行为的代理的理想数据集。
https://arxiv.org/abs/2501.09707
The rapid deployment of autonomous AI agents creates urgent challenges around authorization, accountability, and access control in digital spaces. New standards are needed to know whom AI agents act on behalf of and guide their use appropriately, protecting online spaces while unlocking the value of task delegation to autonomous agents. We introduce a novel framework for authenticated, authorized, and auditable delegation of authority to AI agents, where human users can securely delegate and restrict the permissions and scope of agents while maintaining clear chains of accountability. This framework builds on existing identification and access management protocols, extending OAuth 2.0 and OpenID Connect with agent-specific credentials and metadata, maintaining compatibility with established authentication and web infrastructure. Further, we propose a framework for translating flexible, natural language permissions into auditable access control configurations, enabling robust scoping of AI agent capabilities across diverse interaction modalities. Taken together, this practical approach facilitates immediate deployment of AI agents while addressing key security and accountability concerns, working toward ensuring agentic AI systems perform only appropriate actions and providing a tool for digital service providers to enable AI agent interactions without risking harm from scalable interaction.
自主AI代理的快速部署在授权、问责制和访问控制方面为数字空间带来了紧迫挑战。需要新的标准来确定AI代理代表谁行事,并指导它们的适当使用,以保护在线空间并解锁将任务委托给自主代理的价值。我们介绍了一种新型框架,用于将经过认证、授权和可审计的权限委派给AI代理,使人类用户能够安全地向代理授予和限制权限范围,同时保持清晰的责任链。此框架建立在现有的身份验证和访问管理协议之上,并通过为特定代理扩展OAuth 2.0和OpenID Connect的身份验证凭证和元数据来维护与现有认证和网络基础设施的兼容性。 此外,我们还提出了一种将灵活、自然语言权限转换为可审计的访问控制配置框架,使AI代理的能力范围能够在各种交互模式下保持稳健。总体而言,这种方法旨在在解决关键的安全性和问责制问题的同时实现AI代理的即时部署,并且努力确保具有自主性的AI系统仅执行适当的操作,并提供一种工具,使数字服务提供商能够安全地启用与AI代理的互动而不会因可扩展性互动带来风险。
https://arxiv.org/abs/2501.09674
Online motion planning is a challenging problem for intelligent robots moving in dense environments with dynamic obstacles, e.g., crowds. In this work, we propose a novel approach for optimal and safe online motion planning with minimal information about dynamic obstacles. Specifically, our approach requires only the current position of the obstacles and their maximum speed, but it does not need any information about their exact trajectories or dynamic model. The proposed methodology combines Monte Carlo Tree Search (MCTS), for online optimal planning via model simulations, with Velocity Obstacles (VO), for obstacle avoidance. We perform experiments in a cluttered simulated environment with walls, and up to 40 dynamic obstacles moving with random velocities and directions. With an ablation study, we show the key contribution of VO in scaling up the efficiency of MCTS, selecting the safest and most rewarding actions in the tree of simulations. Moreover, we show the superiority of our methodology with respect to state-of-the-art planners, including Non-linear Model Predictive Control (NMPC), in terms of improved collision rate, computational and task performance.
在线运动规划是智能机器人在密集环境中(如人群)移动时面临的一个挑战,尤其是在存在动态障碍物的情况下。在这项工作中,我们提出了一种新颖的方法来实现仅需最少的动态障碍物信息就能进行最优且安全的在线运动规划。具体而言,我们的方法只需要当前障碍物的位置及其最大速度的信息,并不需要有关它们的确切轨迹或动力学模型的数据。 该方法结合了蒙特卡洛树搜索(MCTS)和速度障碍物(Velocity Obstacles, VO)。其中,MCTS通过模拟模型来进行在线最优规划,而VO用于避免碰撞。我们在一个充满墙壁和其他多达40个随机移动的动态障碍物的仿真环境中进行了实验。通过消融研究,我们展示了VO在扩展MCTS效率中的关键作用,它可以在模拟树中选择最安全且最具奖励性的动作。 此外,我们的方法在碰撞率、计算性能和任务执行效果方面均优于现有的先进规划器,包括非线性模型预测控制(NMPC)。
https://arxiv.org/abs/2501.09649
In today's assistant landscape, personalisation enhances interactions, fosters long-term relationships, and deepens engagement. However, many systems struggle with retaining user preferences, leading to repetitive user requests and disengagement. Furthermore, the unregulated and opaque extraction of user preferences in industry applications raises significant concerns about privacy and trust, especially in regions with stringent regulations like Europe. In response to these challenges, we propose a long-term memory system for voice assistants, structured around predefined categories. This approach leverages Large Language Models to efficiently extract, store, and retrieve preferences within these categories, ensuring both personalisation and transparency. We also introduce a synthetic multi-turn, multi-session conversation dataset (CarMem), grounded in real industry data, tailored to an in-car voice assistant setting. Benchmarked on the dataset, our system achieves an F1-score of .78 to .95 in preference extraction, depending on category granularity. Our maintenance strategy reduces redundant preferences by 95% and contradictory ones by 92%, while the accuracy of optimal retrieval is at .87. Collectively, the results demonstrate the system's suitability for industrial applications.
在当今的助手生态系统中,个性化增强了互动,促进了长期关系,并加深了用户的参与度。然而,许多系统难以保留用户偏好,导致重复性的用户请求和失去兴趣。此外,在诸如欧洲等监管严格的地区,行业应用中不规范且不透明地提取用户偏好的做法引发了关于隐私和信任的重大担忧。 为应对这些挑战,我们提出了一种基于预定义类别的语音助手长期记忆系统。该方法利用大型语言模型高效地从这些类别中提取、存储和检索偏好信息,确保个性化的同时也保持透明度。此外,我们还引入了一个合成的多轮对话数据集(CarMem),这个数据集以真实行业数据为基础,并针对车载语音助理场景进行了定制。 在这一数据集上的评估结果显示,我们的系统根据不同类别的详细程度,在偏好提取方面取得了F1值从0.78到0.95的成绩。通过我们的维护策略,重复的偏好减少了95%,矛盾的偏好减少了92%;而最优检索精度为0.87。总体而言,这些结果展示了该系统的工业应用潜力和适用性。
https://arxiv.org/abs/2501.09645
The pivotal shift from traditional paper-based records to sophisticated Electronic Health Records (EHR), enabled systematic collection and analysis of patient data through descriptive statistics, providing insight into patterns and trends across patient populations. This evolution continued toward predictive analytics, allowing healthcare providers to anticipate patient outcomes and potential complications before they occur. This progression from basic digital record-keeping to sophisticated predictive modelling and digital twins reflects healthcare's broader evolution toward more integrated, patient-centred approaches that combine data-driven insights with personalized care delivery. This chapter explores the evolution and significance of healthcare information systems, beginning with an examination of the implementation of EHR in the UK and the USA. It provides a comprehensive overview of the International Classification of Diseases (ICD) system, tracing its development from ICD-9 to ICD-10. Central to this discussion is the MIMIC-III database, a landmark achievement in healthcare data sharing and arguably the most comprehensive critical care database freely available to researchers worldwide. MIMIC-III has democratized access to high-quality healthcare data, enabling unprecedented opportunities for research and analysis. The chapter examines its structure, clinical outcome analysis capabilities, and practical applications through case studies, with a particular focus on mortality and length of stay metrics, vital signs extraction, and ICD coding. Through detailed entity-relationship diagrams and practical examples, the text illustrates MIMIC's complex data structure and demonstrates how different querying approaches can lead to subtly different results, emphasizing the critical importance of understanding the database's architecture for accurate data extraction.
从传统的纸质记录向复杂的电子健康记录(EHR)转变的关键性变化,使得通过描述性统计方法系统地收集和分析患者数据成为可能,从而揭示了不同人群中的模式和趋势。这种演变进一步向着预测性分析发展,使医疗服务提供者能够提前预知患者的结局和潜在并发症。从基本的数字记录管理到复杂的预测建模以及数字孪生的发展,反映了医疗保健更广泛的向更加集成、以患者为中心的方法转变,这种方法结合了数据驱动的洞察力和个人化护理交付。 本章探讨了医疗信息系统的演变及其意义,从英国和美国实施EHR开始。它还提供了关于国际疾病分类(ICD)系统的一个全面概述,并追溯其从ICD-9到ICD-10的发展历程。在这其中的核心是MIMIC-III数据库,这是医疗数据共享领域的一项里程碑式成就,也是目前世界上免费提供给研究人员的最全面的重症监护数据库之一。MIMIC-III使高质量的医疗数据获取民主化,并为研究和分析提供了前所未有的机会。本章考察了它的结构、临床结果分析能力以及通过案例研究的实际应用情况,特别关注死亡率和住院时间指标、生命体征提取以及ICD编码。 通过详细的实体关系图和实际示例,该文本展示了MIMIC复杂的数据结构,并说明了不同的查询方法如何会导致细微但重要的不同结果,强调理解数据库架构对于准确数据提取的重要性。
https://arxiv.org/abs/2501.09640
Planning for autonomous systems typically requires reasoning with models at different levels of abstraction, and the harmonization of two competing sets of objectives: high-level mission goals that refer to an interaction of the system with the external environment, and low-level platform constraints that aim to preserve the integrity and the correct interaction of the subsystems. The complicated interplay between these two models makes it very hard to reason on the system as a whole, especially when the objective is to find plans with robustness guarantees, considering the non-deterministic behavior of the lower layers of the system. In this paper, we introduce the problem of Platform-Aware Mission Planning (PAMP), addressing it in the setting of temporal durative actions. The PAMP problem differs from standard temporal planning for its exists-forall nature: the high-level plan dealing with mission goals is required to satisfy safety and executability constraints, for all the possible non-deterministic executions of the low-level model of the platform and the environment. We propose two approaches for solving PAMP. The first baseline approach amalgamates the mission and platform levels, while the second is based on an abstraction-refinement loop that leverages the combination of a planner and a verification engine. We prove the soundness and completeness of the proposed approaches and validate them experimentally, demonstrating the importance of heterogeneous modeling and the superiority of the technique based on abstraction-refinement.
自主系统规划通常需要在不同抽象层次上使用模型进行推理,并协调两个相互竞争的目标集:高层次的任务目标,这些目标涉及到系统的外部环境交互;以及低层次的平台约束,旨在保护子系统的完整性和正确交互。这两者之间的复杂互动使得整体系统难以分析,尤其是在寻找具备鲁棒性保证的计划时,这需要考虑系统底层不确定行为的影响。 本文介绍了基于时态持续动作设置下的平台感知任务规划(Platform-Aware Mission Planning, PAMP)问题,并提出了解决该问题的两种方法。PAMP问题不同于标准的时间规划之处在于其存在全称性质:处理任务目标的高层次计划必须满足所有可能的低层次模型非确定性执行的安全性和可执行约束。 我们提出了两种解决PAMP的方法。第一种基础方法是将任务和平台层次合并,而第二种则基于抽象细化循环,利用了规划器与验证引擎相结合的技术。我们证明了这两种方法的有效性和完备性,并通过实验对其进行了验证,展示了异构建模的重要性以及基于抽象细化技术的优越性。 总的来说,这项研究强调了在自主系统中处理高层次任务目标和低层次平台约束之间复杂关系时,采用适当的方法和技术来保证规划的鲁棒性和可行性至关重要。
https://arxiv.org/abs/2501.09632
With the rapid advancement of deepfake generation technologies, the demand for robust and accurate face forgery detection algorithms has become increasingly critical. Recent studies have demonstrated that wavelet analysis can uncover subtle forgery artifacts that remain imperceptible in the spatial domain. Wavelets effectively capture important facial contours, which are often slender, fine-grained, and global in nature. However, existing wavelet-based approaches fail to fully leverage these unique characteristics, resulting in sub-optimal feature extraction and limited generalizability. To address this challenge, we introduce WMamba, a novel wavelet-based feature extractor built upon the Mamba architecture. WMamba maximizes the utility of wavelet information through two key innovations. First, we propose Dynamic Contour Convolution (DCConv), which employs specially crafted deformable kernels to adaptively model slender facial contours. Second, by leveraging the Mamba architecture, our method captures long-range spatial relationships with linear computational complexity. This efficiency allows for the extraction of fine-grained, global forgery artifacts from small image patches. Extensive experimental results show that WMamba achieves state-of-the-art (SOTA) performance, highlighting its effectiveness and superiority in face forgery detection.
随着深度伪造生成技术的迅速发展,对稳健且准确的人脸伪造检测算法的需求变得愈发重要。近期研究显示,小波分析能够揭示在空间域中难以察觉的细微伪造痕迹。小波变换能有效地捕捉到重要的面部轮廓特征,这些特征往往是纤细、精细和具有全局性的。然而,现有的基于小波的方法未能充分利用这些独特特性,导致了次优的特征提取效果和有限的应用广度。 为了解决这一挑战,我们引入了一种新型的小波基特征提取器——WMamba,它是基于Mamba架构设计的。WMamba通过两个关键创新最大化了小波信息的作用。首先,我们提出了动态轮廓卷积(DCConv),采用特制的可变形核来适应性地建模纤细的面部轮廓;其次,利用Mamba架构的优势,我们的方法能够在线性计算复杂度下捕捉长距离的空间关系。这种效率使得可以从小型图像块中提取出精细、全局性的伪造痕迹。 广泛的实验结果表明,WMamba达到了最先进的性能(SOTA),突显了其在人脸伪造检测中的有效性和优越性。
https://arxiv.org/abs/2501.09617
De-identification of medical images is a critical step to ensure privacy during data sharing in research and clinical settings. The initial step in this process involves detecting Protected Health Information (PHI), which can be found in image metadata or imprinted within image pixels. Despite the importance of such systems, there has been limited evaluation of existing AI-based solutions, creating barriers to the development of reliable and robust tools. In this study, we present an AI-based pipeline for PHI detection, comprising three key components: text detection, text extraction, and analysis of PHI content in medical images. By experimenting with exchanging roles of vision and language models within the pipeline, we evaluate the performance and recommend the best setup for the PHI detection task.
医学图像去识别是确保研究和临床环境中数据共享期间隐私保护的关键步骤。此过程的初始阶段涉及检测受保护的健康信息(PHI),这些信息可能存在于图像元数据中或嵌印在图像像素内。尽管此类系统的至关重要性,现有的基于AI的解决方案却很少被评估,这阻碍了可靠且稳健工具的发展。在这项研究中,我们提出了一种用于检测PHI的基于人工智能的流程,包括三个关键组成部分:文本检测、文本提取以及医学图像中PHI内容的分析。通过在管道中交换视觉和语言模型的角色进行实验,我们评估了其性能,并推荐了最适合执行PHI检测任务的最佳配置。
https://arxiv.org/abs/2501.09552
In this paper, we elaborate on how AI can support diversity and inclusion and exemplify research projects conducted in that direction. We start by looking at the challenges and progress in making large language models (LLMs) more transparent, inclusive, and aware of social biases. Even though LLMs like ChatGPT have impressive abilities, they struggle to understand different cultural contexts and engage in meaningful, human like conversations. A key issue is that biases in language processing, especially in machine translation, can reinforce inequality. Tackling these biases requires a multidisciplinary approach to ensure AI promotes diversity, fairness, and inclusion. We also highlight AI's role in identifying biased content in media, which is important for improving representation. By detecting unequal portrayals of social groups, AI can help challenge stereotypes and create more inclusive technologies. Transparent AI algorithms, which clearly explain their decisions, are essential for building trust and reducing bias in AI systems. We also stress AI systems need diverse and inclusive training data. Projects like the Child Growth Monitor show how using a wide range of data can help address real world problems like malnutrition and poverty. We present a project that demonstrates how AI can be applied to monitor the role of search engines in spreading disinformation about the LGBTQ+ community. Moreover, we discuss the SignON project as an example of how technology can bridge communication gaps between hearing and deaf people, emphasizing the importance of collaboration and mutual trust in developing inclusive AI. Overall, with this paper, we advocate for AI systems that are not only effective but also socially responsible, promoting fair and inclusive interactions between humans and machines.
在这篇论文中,我们详细阐述了人工智能如何支持多样性和包容性,并举例说明了朝这个方向开展的研究项目。首先,我们审视了使大型语言模型(LLMs)更加透明、包容以及对社会偏见有所意识的挑战与进展。尽管像ChatGPT这样的语言模型具备令人印象深刻的技能,但它们在理解不同的文化背景和进行有意义的人类对话方面仍然存在困难。一个关键问题是,在语言处理中特别是机器翻译中的偏见会加剧不平等现象。解决这些偏见需要采取多学科的方法,以确保AI能够促进多样性和包容性,并维护公平原则。 我们还强调了人工智能在识别媒体中的有偏内容方面的角色,这对于改善代表性和消除刻板印象至关重要。通过检测社会群体的不公平描绘,AI可以帮助挑战现有观念并推动更加包容的技术发展。透明的人工智能算法,即那些能明确解释其决策过程的系统,在建立信任和减少AI系统的偏见方面是必不可少的。 此外,我们强调AI系统需要多样且包容性的训练数据集。例如,“儿童成长监测器”项目展示了如何利用广泛的数据来解决现实世界的问题,如营养不良和贫困问题。我们还介绍了一个项目,该项目演示了AI在监控搜索引擎传播有关LGBTQ+社区错误信息方面的作用。 此外,我们讨论了“SignON”项目作为技术促进听力障碍者与听觉正常人之间沟通的一个例子,强调开发包容性人工智能的重要性在于协作和相互信任。 总体而言,通过这篇论文,我们提倡构建不仅有效而且具有社会责任感的人工智能系统,这些系统能够促进人类与机器之间的公平且包容的互动。
https://arxiv.org/abs/2501.09534
We present a method for augmenting a Large Language Model (LLM) with a combination of text and visual data to enable accurate question answering in visualization of scientific data, making conversational visualization possible. LLMs struggle with tasks like visual data interaction, as they lack contextual visual information. We address this problem by merging a text description of a visualization and dataset with snapshots of the visualization. We extract their essential features into a structured text file, highly compact, yet descriptive enough to appropriately augment the LLM with contextual information, without any fine-tuning. This approach can be applied to any visualization that is already finally rendered, as long as it is associated with some textual description.
我们提出了一种方法,通过结合文本和视觉数据来增强大型语言模型(LLM),使科学数据可视化中的准确问答成为可能,并实现对话式可视化。由于缺乏上下文视觉信息,LLMs在处理如视觉数据分析之类的任务时遇到困难。为了解决这个问题,我们将一个可视化的文本描述和相关数据集与该可视化的快照相结合。我们提取这些内容的关键特征并将其转化为结构化、高度紧凑但又足够详细的文本文件,以适当增强LLM的上下文信息,而无需进行任何微调。只要可视化已经最终渲染并与一些文本描述关联,这种方法就可以应用于任何可视化。
https://arxiv.org/abs/2501.09521
Understanding emotions accurately is essential for fields like human-computer interaction. Due to the complexity of emotions and their multi-modal nature (e.g., emotions are influenced by facial expressions and audio), researchers have turned to using multi-modal models to understand human emotions rather than single-modality. However, current video multi-modal large language models (MLLMs) encounter difficulties in effectively integrating audio and identifying subtle facial micro-expressions. Furthermore, the lack of detailed emotion analysis datasets also limits the development of multimodal emotion analysis. To address these issues, we introduce a self-reviewed dataset and a human-reviewed dataset, comprising 24,137 coarse-grained samples and 3,500 manually annotated samples with detailed emotion annotations, respectively. These datasets allow models to learn from diverse scenarios and better generalize to real-world applications. Moreover, in addition to the audio modeling, we propose to explicitly integrate facial encoding models into the existing advanced Video MLLM, enabling the MLLM to effectively unify audio and the subtle facial cues for emotion understanding. By aligning these features within a unified space and employing instruction tuning in our proposed datasets, our Omni-Emotion achieves state-of-the-art performance in both emotion recognition and reasoning tasks.
准确理解情感对于人机交互等领域来说至关重要。由于情绪的复杂性和多模态特性(例如,情绪会受到面部表情和音频的影响),研究人员已经转向使用多模态模型来理解和分析人类情绪,而不是单一模式的方法。然而,目前的视频多模态大语言模型在有效地融合音频数据以及识别细微的面部微表情方面遇到了困难。此外,缺乏详细的多模态情感分析数据集也限制了该领域的发展。 为了解决这些问题,我们引入了一个自我审查的数据集和一个人工审查的数据集,分别包含了24,137个粗粒度样本和3,500个详细标注的情感样本。这些数据集使模型能够从各种场景中学习,并更好地泛化到实际应用中去。 此外,在音频建模之外,我们提议将面部编码模型明确地整合到现有的先进视频多模态大语言模型(Video MLLM)之中,使得该模型能有效地统一音频和细微的面部线索进行情感理解。通过在提出的这些数据集中对特征进行空间上的对齐,并采用指令调优方法,我们的Omni-Emotion系统在情绪识别和推理任务中均达到了当前的最佳性能水平。
https://arxiv.org/abs/2501.09502
Online medical consultation (OMC) restricts doctors to gathering patient information solely through inquiries, making the already complex sequential decision-making process of diagnosis even more challenging. Recently, the rapid advancement of large language models has demonstrated a significant potential to transform OMC. However, most studies have primarily focused on improving diagnostic accuracy under conditions of relatively sufficient information, while paying limited attention to the "inquiry" phase of the consultation process. This lack of focus has left the relationship between "inquiry" and "diagnosis" insufficiently explored. In this paper, we first extract real patient interaction strategies from authentic doctor-patient conversations and use these strategies to guide the training of a patient simulator that closely mirrors real-world behavior. By inputting medical records into our patient simulator to simulate patient responses, we conduct extensive experiments to explore the relationship between "inquiry" and "diagnosis" in the consultation process. Experimental results demonstrate that inquiry and diagnosis adhere to the Liebig's law: poor inquiry quality limits the effectiveness of diagnosis, regardless of diagnostic capability, and vice versa. Furthermore, the experiments reveal significant differences in the inquiry performance of various models. To investigate this phenomenon, we categorize the inquiry process into four types: (1) chief complaint inquiry; (2) specification of known symptoms; (3) inquiry about accompanying symptoms; and (4) gathering family or medical history. We analyze the distribution of inquiries across the four types for different models to explore the reasons behind their significant performance differences. We plan to open-source the weights and related code of our patient simulator at this https URL.
在线医疗咨询(OMC)限制了医生只能通过询问来收集患者信息,使得本已复杂的诊断决策过程更加复杂。最近,大型语言模型的迅速发展显示出显著潜力,可以彻底改变OMC的方式。然而,大多数研究主要集中在提高在相对充足信息条件下的诊断准确性上,而对咨询过程中“问诊”阶段的关注较少。这种忽视导致了关于“问诊”与“诊断”之间关系的研究不足。在这篇论文中,我们首先从真实的医生和患者对话中提取出实际的问诊策略,并利用这些策略来训练一个高度模拟现实行为的患者模拟器。通过将医疗记录输入我们的患者模拟器以模拟患者的反应,我们进行了广泛的实验来探索咨询过程中“问诊”与“诊断”的关系。实验结果表明,问诊和诊断遵循李比希法则:低质量的问诊限制了诊断的有效性,无论诊断能力如何;反之亦然。 此外,这些实验揭示了不同模型在问诊性能上的显著差异。为了探讨这一现象的原因,我们将问诊过程分为四种类型:(1)主诉询问;(2)已知症状的具体化;(3)伴随症状的询问;以及(4)收集家族或个人医疗史信息。我们分析了不同类型问诊在整个咨询过程中不同模型间的分布情况,以探讨其显著性能差异的原因。 我们的患者模拟器权重和相关代码计划在以下网址开源:[此 URL](https://this-url.com) (请注意,原文中提供的实际链接需要替换为有效URL)。
https://arxiv.org/abs/2501.09484
Stereo matching is a key technique for metric depth estimation in computer vision and robotics. Real-world challenges like occlusion and non-texture hinder accurate disparity estimation from binocular matching cues. Recently, monocular relative depth estimation has shown remarkable generalization using vision foundation models. Thus, to facilitate robust stereo matching with monocular depth cues, we incorporate a robust monocular relative depth model into the recurrent stereo-matching framework, building a new framework for depth foundation model-based stereo-matching, DEFOM-Stereo. In the feature extraction stage, we construct the combined context and matching feature encoder by integrating features from conventional CNNs and DEFOM. In the update stage, we use the depth predicted by DEFOM to initialize the recurrent disparity and introduce a scale update module to refine the disparity at the correct scale. DEFOM-Stereo is verified to have comparable performance on the Scene Flow dataset with state-of-the-art (SOTA) methods and notably shows much stronger zero-shot generalization. Moreover, DEFOM-Stereo achieves SOTA performance on the KITTI 2012, KITTI 2015, Middlebury, and ETH3D benchmarks, ranking 1st on many metrics. In the joint evaluation under the robust vision challenge, our model simultaneously outperforms previous models on the individual benchmarks. Both results demonstrate the outstanding capabilities of the proposed model.
立体匹配是计算机视觉和机器人技术中用于度量深度估计的关键技术。现实世界中的挑战,如遮挡和非纹理区域,会妨碍双目线索的准确视差估算。最近,单目相对深度估计算法在使用视觉基础模型时表现出显著的泛化能力。因此,为了利用单目深度线索实现稳健的立体匹配,我们将在递归立体匹配框架中加入一个鲁棒的单目相对深度模型,构建一个新的基于深度基础模型的立体匹配框架——DEFOM-Stereo。 在特征提取阶段,我们通过整合传统CNN和DEFOM(Depth Foundation for OMnomalous Matching)的特性来构造组合上下文与匹配特性编码器。更新阶段中,使用DEFOM预测的深度信息初始化递归视差,并引入尺度更新模块以在正确尺度下精炼视差。 DEFOM-Stereo模型在Scene Flow数据集上的表现可媲美当前最佳(SOTA)方法,在零样本泛化能力方面尤为突出。此外,该模型还在KITTI 2012、KITTI 2015、Middlebury和ETH3D基准测试中取得了SOTA性能,并且在许多指标上排名第一。 在鲁棒视觉挑战的联合评估中,我们的模型同时超越了先前方法在各个单独基准上的表现。这些结果都证明了所提模型卓越的能力。
https://arxiv.org/abs/2501.09466
We propose a novel architecture for graph-based dependency parsing that explicitly constructs vectors, from which both arcs and labels are scored. Our method addresses key limitations of the standard two-pipeline approach by unifying arc scoring and labeling into a single network, reducing scalability issues caused by the information bottleneck and lack of parameter sharing. Additionally, our architecture overcomes limited arc interactions with transformer layers to efficiently simulate higher-order dependencies. Experiments on PTB and UD show that our model outperforms state-of-the-art parsers in both accuracy and efficiency.
我们提出了一种基于图的依存句法分析的新架构,该架构明确构建向量,并从中对弧线和标签进行评分。我们的方法通过将弧线评分和标注统一到一个网络中来解决标准两步流程方法的关键限制,从而减少了由于信息瓶颈和缺乏参数共享导致的可扩展性问题。此外,我们的架构克服了变压器层之间有限的弧线交互,以高效地模拟高阶依赖关系。在PTB和UD上的实验表明,我们的模型在准确性和效率方面都优于最先进的解析器。
https://arxiv.org/abs/2501.09451
Throughout history, humans have created remark- able works of art, but artificial intelligence has only recently started to make strides in generating visually compelling art. Breakthroughs in the past few years have focused on using convolutional neural networks (CNNs) to separate and manipulate the content and style of images, applying texture synthesis techniques. Nevertheless, a number of current techniques continue to encounter obstacles, including lengthy processing times, restricted choices of style images, and the inability to modify the weight ratio of styles. We proposed a neural style transfer system that can add various artistic styles to a desired image to address these constraints allowing flexible adjustments to style weight ratios and reducing processing time. The system uses the VGG19 model for feature extraction, ensuring high-quality, flexible stylization without compromising content integrity.
历史上,人类创造了令人惊叹的艺术作品,而人工智能在生成视觉吸引人的艺术方面仅最近才开始取得进展。近年来的突破主要集中在使用卷积神经网络(CNN)来分离和操作图像的内容与风格,并应用纹理合成技术。然而,目前许多技术仍然面临一些障碍,包括处理时间长、可供选择的风格图片有限以及无法调整风格权重比等问题。为了解决这些问题,我们提出了一种基于VGG19模型进行特征提取的神经样式迁移系统。该系统能够向目标图像添加各种艺术风格,并允许灵活地调整风格权重比例以减少处理时间,同时确保在不损害内容完整性的情况下实现高质量和灵活性的风格化。
https://arxiv.org/abs/2501.09420
Electronic Health Record (EHR) tables pose unique challenges among which is the presence of hidden contextual dependencies between medical features with a high level of data dimensionality and sparsity. This study presents the first investigation into the abilities of LLMs to comprehend EHRs for patient data extraction and retrieval. We conduct extensive experiments using the MIMICSQL dataset to explore the impact of the prompt structure, instruction, context, and demonstration, of two backbone LLMs, Llama2 and Meditron, based on task performance. Through quantitative and qualitative analyses, our findings show that optimal feature selection and serialization methods can enhance task performance by up to 26.79% compared to naive approaches. Similarly, in-context learning setups with relevant example selection improve data extraction performance by 5.95%. Based on our study findings, we propose guidelines that we believe would help the design of LLM-based models to support health search.
电子健康记录(EHR)表格具有一些独特的挑战,其中之一是存在高维度和稀疏数据的医疗特征之间的隐藏上下文依赖关系。本研究首次探讨了大型语言模型(LLM)理解EHR以提取和检索患者数据的能力。我们使用MIMICSQL数据集进行了广泛的实验,以探索提示结构、指令、上下文以及两个骨干LLM——Llama2和Meditron的演示对任务性能的影响。通过定量和定性分析,我们的研究结果表明,最佳特征选择和序列化方法可以将任务表现提升高达26.79%,相比于朴素的方法。同样地,在使用相关示例选择的情况下进行上下文学习设置能够提高数据提取性能达5.95%。基于我们对研究发现的评估,我们提出了我们认为会帮助设计支持健康搜索的LLM模型的设计指导原则。
https://arxiv.org/abs/2501.09384
Medicinal plants have been a key component in producing traditional and modern medicines, especially in the field of Ayurveda, an ancient Indian medical system. Producing these medicines and collecting and extracting the right plant is a crucial step due to the visually similar nature of some plants. The extraction of these plants from nonmedicinal plants requires human expert intervention. To solve the issue of accurate plant identification and reduce the need for a human expert in the collection process; employing computer vision methods will be efficient and beneficial. In this paper, we have proposed a model that solves such issues. The proposed model is a custom convolutional neural network (CNN) architecture with 6 convolution layers, max-pooling layers, and dense layers. The model was tested on three different datasets named Indian Medicinal Leaves Image Dataset,MED117 Medicinal Plant Leaf Dataset, and the self-curated dataset by the authors. The proposed model achieved respective accuracies of 99.5%, 98.4%, and 99.7% using various optimizers including Adam, RMSprop, and SGD with momentum.
药用植物在传统和现代药物的生产中扮演着关键角色,尤其是在古老的印度医学体系阿育吠陀领域。由于某些植物外观相似,正确采集和提取这些植物是制药过程中的重要步骤。从非药用植物中分离出药用植物需要人类专家介入。为了准确识别植物并减少对人类专家的需求,在收集过程中使用计算机视觉方法将是高效且有益的。 本文提出了一种解决此类问题的模型。该模型是一种自定义卷积神经网络(CNN)架构,包含6个卷积层、最大池化层和全连接层。我们在三个不同的数据集上测试了这个模型:印度药用叶子图像数据集、MED117药用植物叶片数据集以及作者自己整理的数据集。使用包括Adam、RMSprop和带有动量的SGD在内的多种优化器,所提出的模型分别达到了99.5%、98.4%和99.7%的准确率。
https://arxiv.org/abs/2501.09363