As artificial intelligence (AI) becomes increasingly embedded in healthcare delivery, this chapter explores the critical aspects of developing reliable and ethical Clinical Decision Support Systems (CDSS). Beginning with the fundamental transition from traditional statistical models to sophisticated machine learning approaches, this work examines rigorous validation strategies and performance assessment methods, including the crucial role of model calibration and decision curve analysis. The chapter emphasizes that creating trustworthy AI systems in healthcare requires more than just technical accuracy; it demands careful consideration of fairness, explainability, and privacy. The challenge of ensuring equitable healthcare delivery through AI is stressed, discussing methods to identify and mitigate bias in clinical predictive models. The chapter then delves into explainability as a cornerstone of human-centered CDSS. This focus reflects the understanding that healthcare professionals must not only trust AI recommendations but also comprehend their underlying reasoning. The discussion advances in an analysis of privacy vulnerabilities in medical AI systems, from data leakage in deep learning models to sophisticated attacks against model explanations. The text explores privacy-preservation strategies such as differential privacy and federated learning, while acknowledging the inherent trade-offs between privacy protection and model performance. This progression, from technical validation to ethical considerations, reflects the multifaceted challenges of developing AI systems that can be seamlessly and reliably integrated into daily clinical practice while maintaining the highest standards of patient care and data protection.
随着人工智能(AI)在医疗保健领域应用的日益广泛,本章探讨了开发可靠且符合伦理规范的临床决策支持系统(CDSS)的关键方面。从传统的统计模型过渡到复杂的机器学习方法开始,本书详细研究了严格的验证策略和性能评估方法,包括模型校准和决策曲线分析等关键角色。本章强调,在医疗保健中创建值得信赖的人工智能系统不仅仅需要技术上的准确性,还需要谨慎考虑公平性、可解释性和隐私保护。 确保通过AI实现公正的医疗服务是一项挑战,本章节讨论了识别和缓解临床预测模型中的偏见的方法。随后,本书深入探讨了可解释性作为以人类为中心CDSS的核心要素。这种关注反映了对医疗保健专业人员不仅需要信任AI建议而且必须理解其背后的逻辑这一认识。 接下来,本书分析了医学人工智能系统中的隐私漏洞问题,从深度学习模型的数据泄露到针对模型解释的复杂攻击手段。文中探讨了包括差异隐私和联邦学习在内的隐私保护策略,并承认在隐私保护与模型性能之间存在固有的权衡关系。 这一从技术验证到伦理考虑的过程反映了开发能够在日常临床实践中无缝且可靠地集成的人工智能系统所面临的多方面挑战,同时保持最高标准的患者护理和数据保护。
https://arxiv.org/abs/2501.09628
Understanding users' product preferences is essential to the efficacy of a recommendation system. Precision marketing leverages users' historical data to discern these preferences and recommends products that align with them. However, recent browsing and purchase records might better reflect current purchasing inclinations. Transformer-based recommendation systems have made strides in sequential recommendation tasks, but they often fall short in utilizing product image style information and shopping cart data effectively. In light of this, we propose Style4Rec, a transformer-based e-commerce recommendation system that harnesses style and shopping cart information to enhance existing transformer-based sequential product recommendation systems. Style4Rec represents a significant step forward in personalized e-commerce recommendations, outperforming benchmarks across various evaluation metrics. Style4Rec resulted in notable improvements: HR@5 increased from 0.681 to 0.735, NDCG@5 increased from 0.594 to 0.674, and MRR@5 increased from 0.559 to 0.654. We tested our model using an e-commerce dataset from our partnering company and found that it exceeded established transformer-based sequential recommendation benchmarks across various evaluation metrics. Thus, Style4Rec presents a significant step forward in personalized e-commerce recommendation systems.
理解用户的产品偏好对于推荐系统的有效性至关重要。精准营销利用用户的过往数据来识别这些偏好,并推荐与其相符的商品。然而,最近的浏览和购买记录可能更能反映当前的购买倾向。基于Transformer的推荐系统在序列推荐任务上取得了进展,但它们往往未能有效利用产品图像风格信息和购物车数据。鉴于此,我们提出了Style4Rec,这是一种基于Transformer的电子商务推荐系统,它能够利用样式和购物车信息来增强现有的基于Transformer的顺序商品推荐系统。Style4Rec代表了个性化电子商务推荐领域的一个重大进步,在各种评估指标上超越了基准。 通过使用一个来自合作伙伴公司的电子商务数据集测试我们的模型,我们发现Style4Rec在多个评价指标上都超过了既有的基于Transformer的序列推荐系统的基准表现:HR@5从0.681提升到0.735;NDCG@5从0.594提升到0.674;MRR@5从0.559提升到0.654。因此,Style4Rec为个性化电子商务推荐系统的发展提供了一个重要的里程碑。
https://arxiv.org/abs/2501.09354
Accurate molecular quantification is essential for advancing research and diagnostics in fields such as infectious diseases, cancer biology, and genetic disorders. Droplet digital PCR (ddPCR) has emerged as a gold standard for achieving absolute quantification. While computational ddPCR technologies have advanced significantly, achieving automatic interpretation and consistent adaptability across diverse operational environments remains a challenge. To address these limitations, we introduce the intelligent interpretable droplet digital PCR (I2ddPCR) assay, a comprehensive framework integrating front-end predictive models (for droplet segmentation and classification) with GPT-4o multimodal large language model (MLLM, for context-aware explanations and recommendations) to automate and enhance ddPCR image analysis. This approach surpasses the state-of-the-art models, affording 99.05% accuracy in processing complex ddPCR images containing over 300 droplets per image with varying signal-to-noise ratios (SNRs). By combining specialized neural networks and large language models, the I2ddPCR assay offers a robust and adaptable solution for absolute molecular quantification, achieving a sensitivity capable of detecting low-abundance targets as low as 90.32 copies/{\mu}L. Furthermore, it improves model's transparency through detailed explanation and troubleshooting guidance, empowering users to make informed decisions. This innovative framework has the potential to benefit molecular diagnostics, disease research, and clinical applications, especially in resource-constrained settings.
准确的分子定量对于传染病、癌症生物学和遗传疾病等领域的研究与诊断至关重要。液滴数字PCR(ddPCR)已成为实现绝对定量的标准方法之一。尽管计算型ddPCR技术已经取得了显著进展,但自动解读及在各种操作环境下的一致适应性仍然面临挑战。为了解决这些问题,我们引入了智能可解释的液滴数字PCR(I2ddPCR)检测法,这是一个全面的框架,结合前端预测模型(用于液滴分割和分类)与GPT-4o多模态大型语言模型(MLLM,用于上下文感知说明及建议),以实现并增强ddPCR图像分析的自动化。这种方法超越了现有技术,在处理每张包含超过300个液滴且信噪比变化复杂的ddPCR图像时,达到了99.05%的精度。 通过结合专门的神经网络和大型语言模型,I2ddPCR检测法提供了一种稳健且适应性强的解决方案,实现了能够检测低至每微升90.32拷贝数的目标分子绝对定量。此外,它通过详细的解释与故障排除指导提高了模型的透明度,使用户能够做出明智的决定。 这种创新框架有可能在分子诊断、疾病研究和临床应用(尤其是在资源受限的情况下)中发挥重要作用。
https://arxiv.org/abs/2501.09218
Identifying reliable synthesis pathways in materials chemistry is a complex task, particularly in polymer science, due to the intricate and often non-unique nomenclature of macromolecules. To address this challenge, we propose an agent system that integrates large language models (LLMs) and knowledge graphs (KGs). By leveraging LLMs' powerful capabilities for extracting and recognizing chemical substance names, and storing the extracted data in a structured knowledge graph, our system fully automates the retrieval of relevant literatures, extraction of reaction data, database querying, construction of retrosynthetic pathway trees, further expansion through the retrieval of additional literature and recommendation of optimal reaction pathways. A novel Multi-branched Reaction Pathway Search (MBRPS) algorithm enables the exploration of all pathways, with a particular focus on multi-branched ones, helping LLMs overcome weak reasoning in multi-branched paths. This work represents the first attempt to develop a fully automated retrosynthesis planning agent tailored specially for macromolecules powered by LLMs. Applied to polyimide synthesis, our new approach constructs a retrosynthetic pathway tree with hundreds of pathways and recommends optimized routes, including both known and novel pathways, demonstrating its effectiveness and potential for broader applications.
在材料化学中,特别是在聚合物科学领域,识别可靠的合成途径是一个复杂的问题,主要是由于高分子化合物复杂的、往往不唯一的命名法。为了解决这一挑战,我们提出了一种结合大型语言模型(LLMs)和知识图谱(KGs)的智能代理系统。通过利用LLM强大的化学物质名称提取和识别能力,并将这些数据存储在结构化的知识图中,我们的系统可以全自动地检索相关文献、提取反应信息、查询数据库、构建逆合成路径树,进一步通过检索额外文献和推荐最佳反应途径来扩展路径。一种新颖的多分支反应路径搜索(MBRPS)算法使我们能够探索所有可能的路径,并特别关注多分支路径,这有助于LLM克服在处理复杂多分支路径时推理能力不足的问题。这项工作首次尝试开发了一种完全自动化的逆合成规划代理,专为大型语言模型驱动的大分子设计。 应用于聚酰亚胺合成中,我们的新方法构建了一个包含数百条路径的逆合成路径树,并推荐了优化路线,包括已知和新颖的途径,展示了其有效性和更广泛应用的潜力。
https://arxiv.org/abs/2501.08897
Rapid advancements in large language models have unlocked remarkable capabilities when it comes to processing and summarizing unstructured text data. This has implications for the analysis of rich, open-ended datasets, such as survey responses, where LLMs hold the promise of efficiently distilling key themes and sentiments. However, as organizations increasingly turn to these powerful AI systems to make sense of textual feedback, a critical question arises, can we trust LLMs to accurately represent the perspectives contained within these text based datasets? While LLMs excel at generating human-like summaries, there is a risk that their outputs may inadvertently diverge from the true substance of the original responses. Discrepancies between the LLM-generated outputs and the actual themes present in the data could lead to flawed decision-making, with far-reaching consequences for organizations. This research investigates the effectiveness of LLMs as judge models to evaluate the thematic alignment of summaries generated by other LLMs. We utilized an Anthropic Claude model to generate thematic summaries from open-ended survey responses, with Amazon's Titan Express, Nova Pro, and Meta's Llama serving as LLM judges. The LLM-as-judge approach was compared to human evaluations using Cohen's kappa, Spearman's rho, and Krippendorff's alpha, validating a scalable alternative to traditional human centric evaluation methods. Our findings reveal that while LLMs as judges offer a scalable solution comparable to human raters, humans may still excel at detecting subtle, context-specific nuances. This research contributes to the growing body of knowledge on AI assisted text analysis. We discuss limitations and provide recommendations for future research, emphasizing the need for careful consideration when generalizing LLM judge models across various contexts and use cases.
大型语言模型的迅速进步在处理和总结非结构化文本数据方面展现出了显著的能力。这为分析诸如调查回复等丰富、开放式的数据集提供了重要的意义,其中LLM(大型语言模型)有望高效地提炼关键主题和情感。然而,随着组织越来越多地依赖这些强大的AI系统来理解文字反馈,一个至关重要的问题出现了:我们能否信任LLM准确地代表文本数据集中包含的观点?虽然LLM在生成类人类摘要方面表现出色,但它们的输出可能会无意中偏离原始回复的真实内容。LLM生成的内容与实际数据中存在的主题之间的差异可能导致决策失误,并对组织产生深远的影响。 这项研究探讨了将LLM作为评判模型来评估其他LLM生成总结的主题一致性有效性的问题。我们使用Anthropic Claude模型从开放式调查回应中生成主题摘要,而亚马逊的Titan Express、Nova Pro和Meta的Llama则用作评判LLM。通过Cohen's kappa、Spearman's rho以及Krippendorff's alpha与人类评估方法进行比较,验证了这一基于LLM的评判方式作为一种可扩展的人类中心评估方法替代方案的有效性。 我们的研究结果表明,虽然作为评判模型的LLM提供了一种与人类评价员相当的可扩展解决方案,但人类在检测细微且特定于上下文的差异方面仍然可能更胜一筹。这项研究为AI辅助文本分析的现有知识体系做出了贡献。我们讨论了该领域的局限性,并提供了对未来研究的建议,强调了在不同场景和应用中泛化LLM评判模型时需要谨慎考虑的重要性。 简而言之,尽管LLM作为评估工具具有巨大的潜力,但它们尚不能完全替代人类判断的细致入微与灵活性。未来的研究应当继续探索如何结合这两种方法的优势,以实现更准确、高效的文本分析。
https://arxiv.org/abs/2501.08167
We introduce V-Trans4Style, an innovative algorithm tailored for dynamic video content editing needs. It is designed to adapt videos to different production styles like documentaries, dramas, feature films, or a specific YouTube channel's video-making technique. Our algorithm recommends optimal visual transitions to help achieve this flexibility using a more bottom-up approach. We first employ a transformer-based encoder-decoder network to learn recommending temporally consistent and visually seamless sequences of visual transitions using only the input videos. We then introduce a style conditioning module that leverages this model to iteratively adjust the visual transitions obtained from the decoder through activation maximization. We demonstrate the efficacy of our method through experiments conducted on our newly introduced AutoTransition++ dataset. It is a 6k video version of AutoTransition Dataset that additionally categorizes its videos into different production style categories. Our encoder-decoder model outperforms the state-of-the-art transition recommendation method, achieving improvements of 10% to 80% in Recall@K and mean rank values over baseline. Our style conditioning module results in visual transitions that improve the capture of the desired video production style characteristics by an average of around 12% in comparison to other methods when measured with similarity metrics. We hope that our work serves as a foundation for exploring and understanding video production styles further.
我们介绍了一种创新算法V-Trans4Style,专门用于动态视频内容编辑需求。该算法旨在将视频适应于不同制作风格,如纪录片、戏剧、故事片或特定YouTube频道的视频制作技巧。我们的算法推荐最优视觉过渡方式,采用更加自下而上的方法来实现这一灵活性。首先,我们使用基于Transformer的编码器-解码器网络仅利用输入视频来学习推荐时间上一致且视觉上无缝衔接的过渡序列。随后,我们引入了一个风格调节模块,通过激活最大化技术迭代调整从解码器获得的视觉转换。 我们通过在新推出的AutoTransition++数据集上的实验展示了该方法的有效性。AutoTransition++是一个包含6000个视频的新版本AutoTransition数据集,并且将这些视频进一步分类为不同的制作风格类别。我们的编码器-解码器模型超越了最先进的过渡推荐方法,在Recall@K和平均排名值上分别比基准提高了10%到80%。 此外,我们的风格调节模块在使用相似性度量指标进行测量时,使视觉转换的捕捉效果比其他方法平均提升了约12%,以更好地体现所需的视频制作风格特征。我们希望这项工作能够为探索和理解视频制作风格提供基础。
https://arxiv.org/abs/2501.07983
Given their ability for advanced reasoning, extensive contextual understanding, and robust question-answering abilities, large language models have become prominent in healthcare management research. Despite adeptly handling a broad spectrum of healthcare inquiries, these models face significant challenges in delivering accurate and practical advice for chronic conditions such as diabetes. We evaluate the responses of ChatGPT versions 3.5 and 4 to diabetes patient queries, assessing their depth of medical knowledge and their capacity to deliver personalized, context-specific advice for diabetes self-management. Our findings reveal discrepancies in accuracy and embedded biases, emphasizing the models' limitations in providing tailored advice unless activated by sophisticated prompting techniques. Additionally, we observe that both models often provide advice without seeking necessary clarification, a practice that can result in potentially dangerous advice. This underscores the limited practical effectiveness of these models without human oversight in clinical settings. To address these issues, we propose a commonsense evaluation layer for prompt evaluation and incorporating disease-specific external memory using an advanced Retrieval Augmented Generation technique. This approach aims to improve information quality and reduce misinformation risks, contributing to more reliable AI applications in healthcare settings. Our findings seek to influence the future direction of AI in healthcare, enhancing both the scope and quality of its integration.
鉴于大型语言模型具备高级推理能力、广泛的情境理解和强大的问答功能,在医疗管理研究中已经变得非常突出。尽管这些模型能够处理广泛的医疗咨询,但在为糖尿病等慢性疾病提供准确和实用的建议方面仍然面临重大挑战。我们评估了ChatGPT 3.5和4版本对糖尿病患者查询的回答,考察其医学知识深度及其提供个性化、情境特定的糖尿病自我管理建议的能力。我们的研究发现,在准确性及隐含偏见上存在差异,这表明除非采用复杂的提示技术激活,否则这些模型难以提供定制化建议。此外,我们观察到两个模型经常在没有必要澄清的情况下给出建议,这种做法可能导致潜在危险的建议。这强调了在临床环境中缺乏人类监督时,这些模型的实际有效性的局限性。 为了解决这些问题,我们提出了一种常识评估层来对提示进行评价,并使用先进的检索增强生成技术将疾病特定的外部记忆纳入其中。这一方法旨在提高信息质量并降低错误信息的风险,有助于实现更可靠的AI在医疗环境中的应用。我们的研究成果旨在影响未来AI在医疗领域的方向,以扩大其集成范围和提升质量。
https://arxiv.org/abs/2501.07931
Large Language Models (LLMs) have emerged as the new recommendation engines, outperforming traditional methods in both capability and scope, particularly in code generation applications. Our research reveals a novel provider bias in LLMs, namely without explicit input prompts, these models show systematic preferences for services from specific providers in their recommendations (e.g., favoring Google Cloud over Microsoft Azure). This bias holds significant implications for market dynamics and societal equilibrium, potentially promoting digital monopolies. It may also deceive users and violate their expectations, leading to various consequences. This paper presents the first comprehensive empirical study of provider bias in LLM code generation. We develop a systematic methodology encompassing an automated pipeline for dataset generation, incorporating 6 distinct coding task categories and 30 real-world application scenarios. Our analysis encompasses over 600,000 LLM-generated responses across seven state-of-the-art models, utilizing approximately 500 million tokens (equivalent to \$5,000+ in computational costs). The study evaluates both the generated code snippets and their embedded service provider selections to quantify provider bias. Additionally, we conduct a comparative analysis of seven debiasing prompting techniques to assess their efficacy in mitigating these biases. Our findings demonstrate that LLMs exhibit significant provider preferences, predominantly favoring services from Google and Amazon, and can autonomously modify input code to incorporate their preferred providers without users' requests. Notably, we observe discrepancies between providers recommended in conversational contexts versus those implemented in generated code. The complete dataset and analysis results are available in our repository.
大型语言模型(LLMs)已作为新的推荐引擎崭露头角,在能力与应用范围上超越了传统的推荐方法,特别是在代码生成领域。我们的研究揭示了一种新颖的提供商偏见,即在没有显式输入提示的情况下,这些模型会系统性地偏向于特定提供商的服务(例如更倾向于推荐Google Cloud而非Microsoft Azure)。这种偏见对市场动态和社会平衡具有重要影响,可能会促进数字垄断,并可能误导用户、违反其预期,引发多种后果。本文首次全面且实证地研究了LLM代码生成中的提供商偏见问题。我们开发了一种系统性的方法,包括一个自动化管道用于数据集生成,涵盖6个不同的编程任务类别和30个现实世界的应用场景。我们的分析涵盖了超过60万条来自七种最先进的模型的响应,使用了大约5亿令牌(相当于计算成本约5,000美元以上)。该研究评估了生成代码片段及其嵌入的服务提供商选择,以量化提供商偏见的程度。此外,我们还对七种去偏提示技术进行了比较分析,评估它们在缓解这些偏差方面的有效性。我们的发现表明,LLMs表现出明显的提供商偏好,主要倾向于Google和Amazon的服务,并且可以在没有用户请求的情况下自主修改输入代码,将其改为首选的提供商服务。值得注意的是,我们在对话推荐情境中建议的提供商与生成代码实际使用的提供商之间存在不一致的现象。完整的数据集和分析结果可在我们的存储库中获取。
https://arxiv.org/abs/2501.07849
Increasingly, recommender systems are tasked with improving users' long-term satisfaction. In this context, we study a content exploration task, which we formalize as a bandit problem with delayed rewards. There is an apparent trade-off in choosing the learning signal: waiting for the full reward to become available might take several weeks, slowing the rate of learning, whereas using short-term proxy rewards reflects the actual long-term goal only imperfectly. First, we develop a predictive model of delayed rewards that incorporates all information obtained to date. Rewards as well as shorter-term surrogate outcomes are combined through a Bayesian filter to obtain a probabilistic belief. Second, we devise a bandit algorithm that quickly learns to identify content aligned with long-term success using this new predictive model. We prove a regret bound for our algorithm that depends on the \textit{Value of Progressive Feedback}, an information theoretic metric that captures the quality of short-term leading indicators that are observed prior to the long-term reward. We apply our approach to a podcast recommendation problem, where we seek to recommend shows that users engage with repeatedly over two months. We empirically validate that our approach significantly outperforms methods that optimize for short-term proxies or rely solely on delayed rewards, as demonstrated by an A/B test in a recommendation system that serves hundreds of millions of users.
越来越多地,推荐系统被赋予提升用户长期满意度的任务。在这一背景下,我们研究了一个内容探索任务,并将其形式化为一个具有延迟回报的多臂老虎机问题。在这类问题中,选择学习信号存在明显的权衡:等待完整回报变得可用可能需要数周时间,从而减慢了学习速度;而使用短期替代奖励则只能不完全地反映实际长期目标。 首先,我们开发了一个预测模型来估计延迟回报,并将至今所获取的所有信息纳入考虑。通过贝叶斯滤波器结合奖励和较短时期的代理结果,我们可以得到一个概率性的信念。其次,我们设计了一种多臂老虎机算法,它能快速学习到与长期成功相关的高质量内容。我们的算法证明了其遗憾界限取决于“渐进反馈的价值”,这是一个信息论指标,用于衡量在观察到长期回报之前所观测的短期领先指标的质量。 我们将这种方法应用于播客推荐问题中,在这个问题中我们希望为用户推荐那些他们在两个月内反复收听的内容。通过A/B测试验证了一个服务于数亿用户的推荐系统中的方法效果,实证结果表明我们的方法显著优于优化短期替代指标或仅依赖延迟回报的方法。
https://arxiv.org/abs/2501.07761
In recent years, AI red teaming has emerged as a practice for probing the safety and security of generative AI systems. Due to the nascency of the field, there are many open questions about how red teaming operations should be conducted. Based on our experience red teaming over 100 generative AI products at Microsoft, we present our internal threat model ontology and eight main lessons we have learned: 1. Understand what the system can do and where it is applied 2. You don't have to compute gradients to break an AI system 3. AI red teaming is not safety benchmarking 4. Automation can help cover more of the risk landscape 5. The human element of AI red teaming is crucial 6. Responsible AI harms are pervasive but difficult to measure 7. LLMs amplify existing security risks and introduce new ones 8. The work of securing AI systems will never be complete By sharing these insights alongside case studies from our operations, we offer practical recommendations aimed at aligning red teaming efforts with real world risks. We also highlight aspects of AI red teaming that we believe are often misunderstood and discuss open questions for the field to consider.
近年来,AI红队演练作为一种探查生成式人工智能系统安全性和可靠性的实践方法已经出现。由于该领域尚处于初级阶段,关于如何开展红队行动还有很多未解的问题。基于我们在微软对超过100种生成式AI产品进行红队演练的经验,我们提出了内部威胁模型知识体系以及从中学到的八条主要教训: 1. 了解系统能够做什么及其应用场景。 2. 不需要计算梯度就可以破坏一个AI系统。 3. AI红队演练并不是安全基准测试。 4. 自动化可以帮助覆盖更多的风险领域。 5. 在AI红队演练中,人为因素至关重要。 6. 负责任的AI危害普遍存在但难以衡量。 7. 大型语言模型(LLMs)放大了现有的安全风险,并引入新的风险。 8. 确保AI系统的安全性的工作永远不会结束。 通过分享这些见解以及我们在操作中的案例研究,我们提出了旨在使红队演练与现实世界的风险保持一致的实用建议。我们还指出了我们认为通常被误解的AI红队演练方面,并讨论了该领域需要考虑的一些开放性问题。
https://arxiv.org/abs/2501.07238
Combinatorial medication recommendation(CMR) is a fundamental task of healthcare, which offers opportunities for clinical physicians to provide more precise prescriptions for patients with intricate health conditions, particularly in the scenarios of long-term medical care. Previous research efforts have sought to extract meaningful information from electronic health records (EHRs) to facilitate combinatorial medication recommendations. Existing learning-based approaches further consider the chemical structures of medications, but ignore the textual medication descriptions in which the functionalities are clearly described. Furthermore, the textual knowledge derived from the EHRs of patients remains largely underutilized. To address these issues, we introduce the Natural Language-Assisted Multi-modal Medication Recommendation(NLA-MMR), a multi-modal alignment framework designed to learn knowledge from the patient view and medication view jointly. Specifically, NLA-MMR formulates CMR as an alignment problem from patient and medication modalities. In this vein, we employ pretrained language models(PLMs) to extract in-domain knowledge regarding patients and medications, serving as the foundational representation for both modalities. In the medication modality, we exploit both chemical structures and textual descriptions to create medication representations. In the patient modality, we generate the patient representations based on textual descriptions of diagnosis, procedure, and symptom. Extensive experiments conducted on three publicly accessible datasets demonstrate that NLA-MMR achieves new state-of-the-art performance, with a notable average improvement of 4.72% in Jaccard score. Our source code is publicly available on this https URL.
药物组合推荐(CMR)是医疗保健中的一个基本任务,它为临床医生提供了机会,可以根据患者复杂的健康状况提供更精准的处方,尤其是在长期医疗服务的情境下。先前的研究努力旨在从电子健康记录(EHRs)中提取有意义的信息,以促进药物组合推荐。现有基于学习的方法进一步考虑了药品的化学结构,但忽视了其中功能描述清晰的文本信息。此外,通过患者的EHR获得的文本知识仍然很大程度上未被充分利用。为了解决这些问题,我们引入了一种自然语言辅助多模态药物推荐(NLA-MMR)框架,旨在从患者视角和药物视角共同学习知识。具体来说,NLA-MMR将CMR视为一个从患者和药品两种模式进行对齐的问题。在此基础上,我们使用预训练的语言模型(PLMs)来提取有关患者的领域内知识以及关于药品的知识,并将其作为两种模态的基础表示形式。在药品模式下,我们利用化学结构和文本描述创建药品表示;而在患者模式下,则根据诊断、程序和症状的文本描述生成患者表示。我们在三个公开可用的数据集上进行了一系列实验,结果表明NLA-MMR实现了新的最先进的性能,在Jaccard分数方面平均提高了4.72%。我们的源代码可在该链接处获取。
https://arxiv.org/abs/2501.07166
Multimodal information (e.g., visual, acoustic, and textual) has been widely used to enhance representation learning for micro-video recommendation. For integrating multimodal information into a joint representation of micro-video, multimodal fusion plays a vital role in the existing micro-video recommendation approaches. However, the static multimodal fusion used in previous studies is insufficient to model the various relationships among multimodal information of different micro-videos. In this paper, we develop a novel meta-learning-based multimodal fusion framework called Meta Multimodal Fusion (MetaMMF), which dynamically assigns parameters to the multimodal fusion function for each micro-video during its representation learning. Specifically, MetaMMF regards the multimodal fusion of each micro-video as an independent task. Based on the meta information extracted from the multimodal features of the input task, MetaMMF parameterizes a neural network as the item-specific fusion function via a meta learner. We perform extensive experiments on three benchmark datasets, demonstrating the significant improvements over several state-of-the-art multimodal recommendation models, like MMGCN, LATTICE, and InvRL. Furthermore, we lighten our model by adopting canonical polyadic decomposition to improve the training efficiency, and validate its effectiveness through experimental results. Codes are available at this https URL.
多模态信息(如视觉、听觉和文本)已被广泛用于增强微视频推荐中的表示学习。在将多模态信息整合到微视频的联合表示中时,现有的微视频推荐方法中多模态融合起着关键作用。然而,先前研究中使用的静态多模态融合不足以建模不同微视频之间多模态信息的各种关系。 本文开发了一种基于元学习的新颖多模态融合框架,称为Meta多模态融合(MetaMMF),该框架在每个微视频的表示学习过程中动态分配参数给多模态融合函数。具体来说,MetaMMF将每个微视频的多模态融合视为一个独立的任务。基于从输入任务的多模态特征中提取的元信息,通过元学习器,MetaMMF将神经网络参数化为针对特定项目的融合函数。 我们在三个基准数据集上进行了广泛的实验,结果表明相较于几个最先进的多模态推荐模型(如MMGCN、LATTICE和InvRL),我们的方法有了显著改进。此外,我们还采用典范多项式分解简化了模型以提高训练效率,并通过实验结果验证了其有效性。代码可在[此链接](https://this-url.com/)获取。 注意:最后的URL是占位符,请用实际提供的URL替换它。
https://arxiv.org/abs/2501.07110
In the current development of large language models (LLMs), it is important to ensure the accuracy and reliability of the underlying data sources. LLMs are critical for various applications, but they often suffer from hallucinations and inaccuracies due to knowledge gaps in the training data. Knowledge graphs (KGs), as a powerful structural tool, could serve as a vital external information source to mitigate the aforementioned issues. By providing a structured and comprehensive understanding of real-world data, KGs enhance the performance and reliability of LLMs. However, it is common that errors exist in KGs while extracting triplets from unstructured data to construct KGs. This could lead to degraded performance in downstream tasks such as question-answering and recommender systems. Therefore, anomaly detection in KGs is essential to identify and correct these errors. This paper presents an anomaly detection algorithm in knowledge graphs with dual-channel learning (ADKGD). ADKGD leverages a dual-channel learning approach to enhance representation learning from both the entity-view and triplet-view perspectives. Furthermore, using a cross-layer approach, our framework integrates internal information aggregation and context information aggregation. We introduce a kullback-leibler (KL)-loss component to improve the accuracy of the scoring function between the dual channels. To evaluate ADKGD's performance, we conduct empirical studies on three real-world KGs: WN18RR, FB15K, and NELL-995. Experimental results demonstrate that ADKGD outperforms the state-of-the-art anomaly detection algorithms. The source code and datasets are publicly available at this https URL.
在大型语言模型(LLM)的当前开发中,确保基础数据源的准确性和可靠性至关重要。LLMs 对各种应用都非常重要,但由于训练数据中的知识空白,它们经常会出现幻觉和不准确性的问题。作为强大的结构化工具,知识图谱(KGs)可以充当重要的外部信息来源以缓解这些问题。通过提供对现实世界数据的结构化和全面理解,KGs 可提高 LLMs 的性能和可靠性。然而,在从非结构化数据中提取三元组来构建 KG 时,错误通常会存在,这可能导致问答和推荐系统等下游任务的表现下降。因此,知识图谱中的异常检测对于识别并纠正这些错误至关重要。 本文提出了一种在知识图谱中进行双通道学习的异常检测算法(ADKGD)。ADKGD 利用一种双通道学习方法来增强从实体视角和三元组视角的理解能力。此外,我们的框架采用跨层的方法整合内部信息聚合与上下文信息聚合。我们引入了 Kullback-Leibler (KL) 损失组件以提高两个通道之间评分函数的准确性。 为了评估 ADKGD 的性能,我们在三个真实世界的知识图谱上进行了实证研究:WN18RR、FB15K 和 NELL-995。实验结果表明 ADKGD 在异常检测算法中优于最先进的方法。源代码和数据集可在 [此链接](https://this https URL) 公开获取。
https://arxiv.org/abs/2501.07078
In business analysis, providing effective recommendations is essential for enhancing company profits. The utilization of graph-based structures, such as bipartite graphs, has gained popularity for their ability to analyze complex data relationships. Link prediction is crucial for recommending specific items to users. Traditional methods in this area often involve identifying patterns in the graph structure or using representational techniques like graph neural networks (GNNs). However, these approaches encounter difficulties as the volume of data increases. To address these challenges, we propose a model called Graph Contrastive Learning for Multi-label Classification (MCGCL). MCGCL leverages contrastive learning to enhance recommendation effectiveness. The model incorporates two training stages: a main task and a subtask. The main task is holistic user-item graph learning to capture user-item relationships. The homogeneous user-user (item-item) subgraph is constructed to capture user-user and item-item relationships in the subtask. We assessed the performance using real-world datasets from Amazon Reviews in multi-label classification tasks. Comparative experiments with state-of-the-art methods confirm the effectiveness of MCGCL, highlighting its potential for improving recommendation systems.
在商业分析中,提供有效的建议对于提升公司利润至关重要。基于图结构的方法,如二分图,因其能够分析复杂的数据关系而越来越受欢迎。链接预测对于向用户推荐特定项目至关重要。传统方法通常包括识别图形结构中的模式或使用表示技术(例如图神经网络(GNNs))。然而,随着数据量的增加,这些方法遇到了困难。为了解决这些问题,我们提出了一种名为多标签分类图对比学习模型(MCGCL)的方法。MCGCL利用对比学习来增强推荐效果的有效性。该模型包括两个训练阶段:主要任务和子任务。主要任务是进行整体用户-项目图学习以捕捉用户与项目的关联关系。在子任务中,构建同构的用户-用户(项目-项目)子图以捕捉用户之间的关系以及项目之间的关系。 我们使用来自亚马逊评论的真实世界数据集进行了多标签分类任务的性能评估。通过与最先进的方法进行比较实验,确认了MCGCL的有效性,并突显其改善推荐系统潜力的能力。
https://arxiv.org/abs/2501.06985
Models trained on crowdsourced labels may not reflect broader population views when annotator pools are not representative. Since collecting representative labels is challenging, we propose Population-Aligned Instance Replication (PAIR), a method to address this bias through statistical adjustment. Using a simulation study of hate speech and offensive language detection, we create two types of annotators with different labeling tendencies and generate datasets with varying proportions of the types. Models trained on unbalanced annotator pools show poor calibration compared to those trained on representative data. However, PAIR, which duplicates labels from underrepresented annotator groups to match population proportions, significantly reduces bias without requiring new data collection. These results suggest statistical techniques from survey research can help align model training with target populations even when representative annotator pools are unavailable. We conclude with three practical recommendations for improving training data quality.
在基于众包标签训练的模型可能无法反映更广泛人群的观点,尤其是在标注者群体不具备代表性的情况下。由于收集具代表性的标签具有挑战性,我们提出了一种名为Population-Aligned Instance Replication (PAIR)的方法来通过统计调整解决这种偏差问题。通过模拟仇恨言论和冒犯性语言检测的研究,我们创建了两种不同倾向的标注员类型,并生成了不同比例类型的语料库。基于不平衡标注者群体训练的模型与基于代表性数据训练的模型相比,校准效果较差。然而,PAIR 通过复制少数派标注组的标签来匹配总体比例,从而显著减少了偏差,而无需收集新的数据。 这些结果表明,即使在无法获得具代表性的标注者群体的情况下,调查研究中的统计技术也可以帮助使模型训练与目标人群对齐。我们最后提出了三项实用建议以提高训练数据的质量。
https://arxiv.org/abs/2501.06826
The air transport system recognizes the criticality of safety, as even minor anomalies can have severe consequences. Reporting accidents and incidents play a vital role in identifying their causes and proposing safety recommendations. However, the narratives describing pre-accident events are presented in unstructured text that is not easily understood by computer systems. Classifying and categorizing safety occurrences based on these narratives can support informed decision-making by aviation industry stakeholders. In this study, researchers applied natural language processing (NLP) and artificial intelligence (AI) models to process text narratives to classify the flight phases of safety occurrences. The classification performance of two deep learning models, ResNet and sRNN was evaluated, using an initial dataset of 27,000 safety occurrence reports from the NTSB. The results demonstrated good performance, with both models achieving an accuracy exceeding 68%, well above the random guess rate of 14% for a seven-class classification problem. The models also exhibited high precision, recall, and F1 scores. The sRNN model greatly outperformed the simplified ResNet model architecture used in this study. These findings indicate that NLP and deep learning models can infer the flight phase from raw text narratives, enabling effective analysis of safety occurrences.
航空运输系统认识到安全的重要性,因为即使是较小的异常情况也可能导致严重的后果。报告事故和事件对于识别其原因并提出安全建议至关重要。然而,描述事故发生前事件的叙述通常以不易被计算机系统理解的非结构化文本形式呈现。基于这些叙述对安全事故进行分类和归类可以支持航空业利益相关者的知情决策。 在这项研究中,研究人员应用了自然语言处理(NLP)和人工智能(AI)模型来处理文本叙述,并据此将安全事件分类为不同的飞行阶段。他们使用美国国家运输安全委员会(NTSB)提供的初始数据集中的27,000份安全事故报告,评估了两种深度学习模型(ResNet 和 sRNN)的分类性能。研究结果显示,这两种模型都表现良好,在七类分类问题中,准确率均超过了68%,远高于随机猜测的14%概率。这些模型还表现出高精度、召回率和F1分数。 在本次研究中,sRNN 模型大大优于所使用的简化ResNet 模型架构。这一发现表明,NLP 和深度学习模型可以从原始文本叙述中推断出飞行阶段,从而能够有效地分析安全事件。
https://arxiv.org/abs/2501.06564
Predicting individual mobility patterns is crucial across various applications. While current methods mainly focus on predicting the next location for personalized services like recommendations, they often fall short in supporting broader applications such as traffic management and epidemic control, which require longer period forecasts of human mobility. This study addresses mid-term mobility prediction, aiming to capture daily travel patterns and forecast trajectories for the upcoming day or week. We propose a novel Multi-scale Spatial-Temporal Decoupled Predictor (MSTDP) designed to efficiently extract spatial and temporal information by decoupling daily trajectories into distinct location-duration chains. Our approach employs a hierarchical encoder to model multi-scale temporal patterns, including daily recurrence and weekly periodicity, and utilizes a transformer-based decoder to globally attend to predicted information in the location or duration chain. Additionally, we introduce a spatial heterogeneous graph learner to capture multi-scale spatial relationships, enhancing semantic-rich representations. Extensive experiments, including statistical physics analysis, are conducted on large-scale mobile phone records in five cities (Boston, Los Angeles, SF Bay Area, Shanghai, and Tokyo), to demonstrate MSTDP's advantages. Applied to epidemic modeling in Boston, MSTDP significantly outperforms the best-performing baseline, achieving a remarkable 62.8% reduction in MAE for cumulative new cases.
预测个体的移动模式对于多种应用场景至关重要。虽然当前的方法主要集中在为个性化服务(如推荐)预测下一个位置,但在需要更长时间段的人类移动性预测的应用中,例如交通管理和流行病控制等方面却显得力不从心。本研究针对中期移动性的预测问题,旨在捕捉每日出行模式,并对未来一天或一周的轨迹进行预测。 为此,我们提出了一种新颖的多尺度时空解耦预测器(MSTDP),通过将日常轨迹分解为独立的位置-持续时间链来高效地提取空间和时间信息。我们的方法采用分层编码器来建模多尺度的时间模式,包括每日循环和每周周期性,并使用基于变压器的解码器全局关注位置或持续时间链中的预测信息。此外,我们引入了一个空间异构图学习器来捕捉多层次的空间关系,从而增强语义丰富的表示。 我们在五个城市的大型移动电话记录数据集上进行了广泛的实验(波士顿、洛杉矶、旧金山湾区、上海和东京),包括统计物理学分析,以证明MSTDP的优势。当应用于波士顿的流行病建模时,MSTDP在累积新病例数上的平均绝对误差(MAE)方面比最佳基线模型显著高出62.8%。 通过这些实验结果,MSTDP展示了其在捕捉人类移动性模式中的多尺度时空特征方面的优越能力,并且证明了它在流行病预测等实际问题中具有重要的应用价值。
https://arxiv.org/abs/2501.06561
Spoken language datasets are vital for advancing linguistic research, Natural Language Processing, and speech technology. However, resources dedicated to Italian, a linguistically rich and diverse Romance language, remain underexplored compared to major languages like English or Mandarin. This survey provides a comprehensive analysis of 66 spoken Italian datasets, highlighting their characteristics, methodologies, and applications. The datasets are categorized by speech type, source and context, and demographic and linguistic features, with a focus on their utility in fields such as Automatic Speech Recognition, emotion detection, and education. Challenges related to dataset scarcity, representativeness, and accessibility are discussed alongside recommendations for enhancing dataset creation and utilization. The full dataset inventory is publicly accessible via GitHub and archived on Zenodo, serving as a valuable resource for researchers and developers. By addressing current gaps and proposing future directions, this work aims to support the advancement of Italian speech technologies and linguistic research.
口语数据集对于推进语言学研究、自然语言处理和语音技术至关重要。然而,相较于英语或汉语等主要语言,意大利语——一种富有语言特色且多样的罗曼语言——所拥有的资源仍然相对较少且未被充分探索。本文档对66个意大利口语数据集进行了全面分析,突出了它们的特征、方法论及其在自动语音识别、情感检测和教育领域的应用等方面的特点。根据言语类型、来源及背景以及人口统计学和语言特点,这些数据集被分类。同时,文中还讨论了因数据稀缺性、代表性不足和获取难度所带来的挑战,并提出了一些建议以提升数据集的创建与使用效率。 全文档中的所有数据集目录可在GitHub上公开访问,并在Zenodo中归档存储,从而为研究人员和开发者提供了一个宝贵的资源库。本文通过解决现有缺口并展望未来方向,旨在支持意大利语语音技术及语言学研究的发展进步。
https://arxiv.org/abs/2501.06557
This case study explores the integration of Generative AI tools and real-world experiences in business education. Through a study of an innovative undergraduate course, we investigate how AI-assisted learning, combined with experiential components, impacts students' creative processes and learning outcomes. Our findings reveal that this integrated approach accelerates knowledge acquisition, enables students to overcome traditional creative barriers, and facilitates a dynamic interplay between AI-generated insights and real-world observations. The study also highlights challenges, including the need for instructors with high AI literacy and the rapid evolution of AI tools creating a moving target for curriculum design. These insights contribute to the growing body of literature on AI in education and provide actionable recommendations for educators preparing students for the complexities of modern business environments.
本案例研究探讨了在商业教育中整合生成式人工智能工具与现实世界经验的方法。通过对一个创新的本科课程的研究,我们考察了由AI辅助学习和体验性成分相结合对学生创意过程及学习成绩产生的影响。我们的发现表明,这种集成方法加速了知识获取速度,帮助学生克服传统的创造性障碍,并促进了AI生成洞察力与现实观察之间的动态互动。 该研究还指出了挑战,包括对教师具有高水平的人工智能素养的需求以及人工智能工具的快速演变给课程设计带来的持续变化目标。这些见解丰富了关于人工智能在教育领域应用的文献,并为教育工作者提供了可操作的建议,以帮助学生应对现代商业环境中的复杂性。
https://arxiv.org/abs/2501.06527
Safety is a critical aspect of the air transport system given even slight operational anomalies can result in serious consequences. To reduce the chances of aviation safety occurrences, accidents and incidents are reported to establish the root cause, propose safety recommendations etc. However, analysis narratives of the pre-accident events are presented using human-understandable, raw, unstructured, text that a computer system cannot understand. The ability to classify and categorise safety occurrences from their textual narratives would help aviation industry stakeholders make informed safety-critical decisions. To classify and categorise safety occurrences, we applied natural language processing (NLP) and AI (Artificial Intelligence) models to process text narratives. The study aimed to answer the question. How well can the damage level caused to the aircraft in a safety occurrence be inferred from the text narrative using natural language processing. The classification performance of various deep learning models including LSTM, BLSTM, GRU, sRNN, and combinations of these models including LSTM and GRU, BLSTM+GRU, sRNN and LSTM, sRNN and BLSTM, sRNN and GRU, sRNN and BLSTM and GRU, and sRNN and LSTM and GRU was evaluated on a set of 27,000 safety occurrence reports from the NTSB. The results of this study indicate that all models investigated performed competitively well recording an accuracy of over 87.9% which is well above the random guess of 25% for a four-class classification problem. Also, the models recorded high precision, recall, and F1 scores above 80%, 88%, and 85%, respectively. sRNN slightly outperformed other single models in terms of recall (90%) and accuracy (90%) while LSTM reported slightly better performance in terms of precision (87%).
安全性是航空运输系统中的关键方面,因为即使是轻微的操作异常也可能导致严重的后果。为了减少航空安全事件的发生概率,事故和事件会被报告以确定根本原因,并提出安全建议等措施。然而,在事故发生前的事件分析叙述通常使用人类可以理解但未经过结构化处理的文本形式进行记录,这种格式计算机系统无法直接解析。 能够根据文本描述对航空安全事件进行分类和归类,将帮助航空业利益相关者做出更加明智的安全决策。为了实现这一目标,我们应用了自然语言处理(NLP)和人工智能(AI)模型来处理这些未结构化的叙述文本。本研究旨在回答以下问题:通过使用自然语言处理技术,可以从事故发生的叙述性描述中推断出飞机受损程度的程度有多高? 在来自美国国家运输安全委员会(NTSB)的27,000份安全事件报告的数据集上,我们评估了包括长短时记忆网络(LSTM)、双向长短期记忆网络(BLSTM)、门控循环单元网络(GRU)、简化递归神经网络(sRNN),以及这些模型的各种组合在内的多个深度学习模型的分类性能。具体组合包括:LSTM+GRU、BLSTM+GRU、sRNN+LSTM、sRNN+BLSTM、sRNN+GRU、sRNN+BLSTM+GRU 和 sRNN+LSTM+GRU。 研究结果表明,所有被调查的模型均表现优异,在四类分类问题中达到了超过87.9%的准确率,这一成绩远远超过了随机猜测的25%。此外,这些模型在精度、召回率和F1分数方面也都取得了较高的得分,分别为80%,88%,以及85%以上。 其中,简化递归神经网络(sRNN)在召回率(90%)和准确率(90%)方面的表现略优于其他单一模型。而长短时记忆网络(LSTM)则在精度方面表现出略微更好的性能(87%)。
https://arxiv.org/abs/2501.06490