One of the challenges for neural networks in real-life applications is the overconfident errors these models make when the data is not from the original training distribution. Addressing this issue is known as Out-of-Distribution (OOD) detection. Many state-of-the-art OOD methods employ an auxiliary dataset as a surrogate for OOD data during training to achieve improved performance. However, these methods fail to fully exploit the local information embedded in the auxiliary dataset. In this work, we propose the idea of leveraging the information embedded in the gradient of the loss function during training to enable the network to not only learn a desired OOD score for each sample but also to exhibit similar behavior in a local neighborhood around each sample. We also develop a novel energy-based sampling method to allow the network to be exposed to more informative OOD samples during the training phase. This is especially important when the auxiliary dataset is large. We demonstrate the effectiveness of our method through extensive experiments on several OOD benchmarks, improving the existing state-of-the-art FPR95 by 4% on our ImageNet experiment. We further provide a theoretical analysis through the lens of certified robustness and Lipschitz analysis to showcase the theoretical foundation of our work. We will publicly release our code after the review process.
神经网络在实际应用中面临的一个挑战是,当数据不是原始训练分布时,这些模型会犯过度自信的错误。解决这个问题称为离散化(DIS)检测。许多最先进的离散化方法在训练期间使用辅助数据作为离散化数据的代理以提高性能。然而,这些方法无法充分利用辅助数据中固有的局部信息。在本文中,我们提出了利用损失函数梯度中嵌入的信息在训练过程中引导网络不仅学习每个样本的所需离散化分数,而且还要表现出类似的行为在每个样本的局部邻域内。我们还开发了一种基于能量的采样方法,以便在训练阶段让网络接触到更多的信息丰富的离散化样本。当辅助数据很大时,这尤其重要。通过在多个离散化基准上进行广泛的实验,我们提高了现有 state-of-the-art FPR95 4%。我们进一步通过认证鲁棒性和 Lipschitz 分析的透镜展示了我们工作的理论基础。在审查过程中之后,我们会公开发布我们的代码。
https://arxiv.org/abs/2404.12368
Data analysts have long sought to turn unstructured text data into meaningful concepts. Though common, topic modeling and clustering focus on lower-level keywords and require significant interpretative work. We introduce concept induction, a computational process that instead produces high-level concepts, defined by explicit inclusion criteria, from unstructured text. For a dataset of toxic online comments, where a state-of-the-art BERTopic model outputs "women, power, female," concept induction produces high-level concepts such as "Criticism of traditional gender roles" and "Dismissal of women's concerns." We present LLooM, a concept induction algorithm that leverages large language models to iteratively synthesize sampled text and propose human-interpretable concepts of increasing generality. We then instantiate LLooM in a mixed-initiative text analysis tool, enabling analysts to shift their attention from interpreting topics to engaging in theory-driven analysis. Through technical evaluations and four analysis scenarios ranging from literature review to content moderation, we find that LLooM's concepts improve upon the prior art of topic models in terms of quality and data coverage. In expert case studies, LLooM helped researchers to uncover new insights even from familiar datasets, for example by suggesting a previously unnoticed concept of attacks on out-party stances in a political social media dataset.
数据分析师一直试图将无结构文本数据转化为有意义的概念。尽管常见,主题建模和聚类关注较低级别的关键词,需要进行大量解释性工作。我们引入了概念归纳,一种计算过程,它从无结构文本中产生高层次的概念,定义了明确的包括标准。对于一个包含有毒在线评论的 dataset,其中最先进的 BERTopic 模型输出“女性、权力、女性”,概念归纳产生了类似于“对传统性别角色批评”和“对女性关注的不屑”的高层次概念。我们介绍了 LLooM,一种利用大型语言模型迭代生成抽样文本并提出具有普遍性的人解释性概念的概念。然后将 LLooM 实例化到一个混合文本分析工具中,使分析员可以将注意力从解释主题转向进行理论驱动的分析。通过技术评估和四个分析场景(文献综述到内容审查),我们发现,LLooM 的概念在主题模型的先前艺术品质和数据覆盖方面有所提高。在专家案例研究中,LLooM 甚至帮助研究人员从熟悉的數據中发现新的见解,例如通过建议政治社交媒體數據中 previously unnoticed 的攻击姿态的概念。
https://arxiv.org/abs/2404.12259
While many topics of the learning-based approach to automated music generation are under active research, musical form is under-researched. In particular, recent methods based on deep learning models generate music that, at the largest time scale, lacks any structure. In practice, music longer than one minute generated by such models is either unpleasantly repetitive or directionless. Adapting a recent music generation model, this paper proposes a novel method to generate music with form. The experimental results show that the proposed method can generate 2.5-minute-long music that is considered as pleasant as the music used to train the model. The paper first reviews a recent music generation method based on language models (transformer architecture). We discuss why learning musical form by such models is infeasible. Then we discuss our proposed method and the experiments.
虽然基于学习的方法在自动音乐生成领域有很多研究,但音乐形式的研究却很少。特别是,基于深度学习模型的最近方法生成的音乐在最大的时间尺度上缺乏任何结构。在实践中,由这类模型生成的超过一分钟的音乐通常是令人不满意的重叠或毫无方向。适应于此类模型的最近音乐生成方法,本文提出了一种生成音乐形式的新方法。实验结果表明,与训练模型使用的音乐相比,所提出的方法生成的2.5分钟音乐被认为是愉悦的。本文首先回顾了一种基于语言模型的最近音乐生成方法(Transformer架构)。我们讨论了通过此类方法学习音乐形式是不切实际的。然后我们讨论了本文提出的音乐生成方法和实验。
https://arxiv.org/abs/2404.11976
In this review paper, we delve into the realm of Large Language Models (LLMs), covering their foundational principles, diverse applications, and nuanced training processes. The article sheds light on the mechanics of in-context learning and a spectrum of fine-tuning approaches, with a special focus on methods that optimize efficiency in parameter usage. Additionally, it explores how LLMs can be more closely aligned with human preferences through innovative reinforcement learning frameworks and other novel methods that incorporate human feedback. The article also examines the emerging technique of retrieval augmented generation, integrating external knowledge into LLMs. The ethical dimensions of LLM deployment are discussed, underscoring the need for mindful and responsible application. Concluding with a perspective on future research trajectories, this review offers a succinct yet comprehensive overview of the current state and emerging trends in the evolving landscape of LLMs, serving as an insightful guide for both researchers and practitioners in artificial intelligence.
在这篇综述论文中,我们深入探讨了大型语言模型(LLMs)的领域,涵盖了它们的基本原理、多种应用以及复杂的训练过程。该文章揭示了在上下文学习中以及在参数使用上的微调方法,特别关注优化参数使用效率的方法。此外,它探讨了LLM如何通过创新强化学习框架和其他结合人类反馈的新方法更紧密地与人类偏好保持一致。文章还检查了正在崛起的检索增强生成技术,将外部知识集成到LLM中。讨论了LLM部署的伦理维度,强调需要谨慎和负责任的应用。结论从当前LLM发展演变的视角出发,这份综述为研究人员和实践者提供了一篇简洁而全面的现状和趋势概述,为人工智能领域的研究人员和实践者提供了一个有益的指导。
https://arxiv.org/abs/2404.11973
This survey reviews the AIS 2024 Event-Based Eye Tracking (EET) Challenge. The task of the challenge focuses on processing eye movement recorded with event cameras and predicting the pupil center of the eye. The challenge emphasizes efficient eye tracking with event cameras to achieve good task accuracy and efficiency trade-off. During the challenge period, 38 participants registered for the Kaggle competition, and 8 teams submitted a challenge factsheet. The novel and diverse methods from the submitted factsheets are reviewed and analyzed in this survey to advance future event-based eye tracking research.
本次调查对AIS 2024基于事件的眼跟踪(EET)挑战进行了回顾。挑战的重点在于通过事件相机记录的眼部运动来处理和预测眼睛的瞳孔中心。挑战强调了利用事件相机进行高效的眼跟踪以实现任务准确性和效率的权衡。在挑战期间,有38名参与者注册了Kaggle比赛,8支队伍提交了挑战事实册。本次调查对提交的事实册中的新颖且多样方法进行了审查和分析,以推动未来基于事件的眼跟踪研究的发展。
https://arxiv.org/abs/2404.11770
Address matching is an important task for many businesses especially delivery and take out companies which help them to take out a certain address from their data warehouse. Existing solution uses similarity of strings, and edit distance algorithms to find out the similar addresses from the address database, but these algorithms could not work effectively with redundant, unstructured, or incomplete address data. This paper discuss semantic Address matching technique, by which we can find out a particular address from a list of possible addresses. We have also reviewed existing practices and their shortcoming. Semantic address matching is an essentially NLP task in the field of deep learning. Through this technique We have the ability to triumph the drawbacks of existing methods like redundant or abbreviated data problems. The solution uses the OCR on invoices to extract the address and create the data pool of addresses. Then this data is fed to the algorithm BM-25 for scoring the best matching entries. Then to observe the best result, this will pass through BERT for giving the best possible result from the similar queries. Our investigation exhibits that our methodology enormously improves both accuracy and review of cutting-edge technology existing techniques.
地址匹配对于许多企业来说特别是送餐和外卖公司,帮助他们从数据仓库中提取特定地址。现有解决方案使用字符串的相似性和编辑距离算法来查找地址数据库中的类似地址,但这些算法对于冗余、无结构或未完整地址数据无法有效工作。本文讨论了语义地址匹配技术,通过它可以从可能的地址列表中找到特定地址。我们还回顾了现有实践及其不足之处。语义地址匹配是深度学习领域中一个基本的语言处理任务。通过这种技术,我们能够克服现有方法中冗余或缩写数据问题的缺点。解决方案使用发票上的OCR提取地址并创建地址数据池。然后将该数据输入到算法BM-25中进行评分,以观察最佳结果。为了观察最佳结果,这还将通过BERT进行处理,从而从类似查询中获得最佳结果。我们的研究结果表明,我们的方法大大提高了现有技术的准确性和尖端技术的审查。
https://arxiv.org/abs/2404.11691
As the manufacturing industry advances with sensor integration and automation, the opaque nature of deep learning models in machine learning poses a significant challenge for fault detection and diagnosis. And despite the related predictive insights Artificial Intelligence (AI) can deliver, advanced machine learning engines often remain a black box. This paper reviews the eXplainable AI (XAI) tools and techniques in this context. We explore various XAI methodologies, focusing on their role in making AI decision-making transparent, particularly in critical scenarios where humans are involved. We also discuss current limitations and potential future research that aims to balance explainability with model performance while improving trustworthiness in the context of AI applications for critical industrial use cases.
随着传感器集成和自动化技术的进步,制造业的发展对机器学习模型的透明性提出了重大挑战,尤其是在关键场景中,例如涉及人类的情况下。本文回顾了在 this 背景下可用的可解释人工智能(XAI)工具和技术。我们探讨了各种 XAI 方法,重点关注其在使 AI 决策透明方面的作用,尤其是在关键场景中。我们还讨论了当前的局限性以及旨在平衡可解释性 with 模型性能,同时提高 AI 应用在关键工业用例中的可信度未来的研究。
https://arxiv.org/abs/2404.11597
To convince readers of the novelty of their research paper, authors must perform a literature review and compose a coherent story that connects and relates prior works to the current work. This challenging nature of literature review writing makes automatic related work generation (RWG) academically and computationally interesting, and also makes it an excellent test bed for examining the capability of SOTA natural language processing (NLP) models. Since the initial proposal of the RWG task, its popularity has waxed and waned, following the capabilities of mainstream NLP approaches. In this work, we survey the zoo of RWG historical works, summarizing the key approaches and task definitions and discussing the ongoing challenges of RWG.
为了说服读者们论文的新颖性,作者必须进行文献综述并创作一个连贯的故事,将之前的工作联系起来并与其当前工作相关联。文献综述写作的挑战性质使得自动相关工作生成(RWG)在学术和计算上具有趣味性,同时也使其成为评估最先进自然语言处理(NLP)模型的能力的一个绝佳实验平台。自RWG最初提出以来,其流行程度有所起伏,跟随主流NLP方法的能力。在这篇工作中,我们调查了RWG历史作品的动物园,总结关键的方法和任务定义,并讨论了RWG目前正在进行的挑战。
https://arxiv.org/abs/2404.11588
Robotic systems are becoming pervasive and adopted in increasingly many domains, such as manufacturing, healthcare, and space exploration. To this end, engineering software has emerged as a crucial discipline for building maintainable and reusable robotic systems. Robotics software engineering research has received increasing attention, fostering autonomy as a fundamental goal. However, robotics developers are still challenged trying to achieve this goal given that simulation is not able to deliver solutions to realistically emulate real-world phenomena. Robots also need to operate in unpredictable and uncontrollable environments, which require safe and trustworthy self-adaptation capabilities implemented in software. Typical techniques to address the challenges are runtime verification, field-based testing, and mitigation techniques that enable fail-safe solutions. However, there is no clear guidance to architect ROS-based systems to enable and facilitate runtime verification and field-based testing. This paper aims to fill in this gap by providing guidelines that can help developers and QA teams when developing, verifying or testing their robots in the field. These guidelines are carefully tailored to address the challenges and requirements of testing robotics systems in real-world scenarios. We conducted a literature review on studies addressing runtime verification and field-based testing for robotic systems, mined ROS-based application repositories, and validated the applicability, clarity, and usefulness via two questionnaires with 55 answers. We contribute 20 guidelines formulated for researchers and practitioners in robotic software engineering. Finally, we map our guidelines to open challenges thus far in runtime verification and field-based testing for ROS-based systems and, we outline promising research directions in the field.
机器人系统正在变得无处不在并逐渐应用于越来越多的领域,如制造业、医疗保健和太空探索。因此,工程软件已成为构建可维护和可重用的机器人系统的关键学科。机器人软件工程研究受到了越来越多的关注,推动了自主作为一个基本目标。然而,机器人开发人员仍然面临着实现这一目标的压力,因为仿真无法解决现实世界现象。机器人还需要在不可预测和不可控的环境中操作,这需要软件中实现安全可靠的自我适应能力。解决这些挑战的典型方法包括运行时验证、现场测试和缓解技术,以实现容错解决方案。然而,在构建基于ROS的系统时,缺乏明确的指导以帮助开发人员和测试团队在现场开发、验证或测试机器人。本文旨在填补这一空白,通过提供有助于开发人员在现场开发、验证或测试机器人时使用的指南来填补这一空白。这些指南是针对现实世界场景中测试机器人系统的挑战和要求的精心制定的。我们对基于ROS的机器人系统的运行时验证和现场测试的研究进行了文献综述,挖掘了ROS基于应用程序的存储库,并通过两个问卷回答了55个问题。我们为研究人员和实践者提供了20个关于机器人软件工程研究的指南。最后,我们将这些指南与ROS基于系统的前沿挑战进行了映射,并为该领域勾勒出有前景的研究方向。
https://arxiv.org/abs/2404.11498
With the rapid advancement of large language models (LLMs), information retrieval (IR) systems, such as search engines and recommender systems, have undergone a significant paradigm shift. This evolution, while heralding new opportunities, introduces emerging challenges, particularly in terms of biases and unfairness, which may threaten the information ecosystem. In this paper, we present a comprehensive survey of existing works on emerging and pressing bias and unfairness issues in IR systems when the integration of LLMs. We first unify bias and unfairness issues as distribution mismatch problems, providing a groundwork for categorizing various mitigation strategies through distribution alignment. Subsequently, we systematically delve into the specific bias and unfairness issues arising from three critical stages of LLMs integration into IR systems: data collection, model development, and result evaluation. In doing so, we meticulously review and analyze recent literature, focusing on the definitions, characteristics, and corresponding mitigation strategies associated with these issues. Finally, we identify and highlight some open problems and challenges for future work, aiming to inspire researchers and stakeholders in the IR field and beyond to better understand and mitigate bias and unfairness issues of IR in this LLM era. We also consistently maintain a GitHub repository for the relevant papers and resources in this rising direction at this https URL.
随着大型语言模型(LLMs)的快速发展,信息检索(IR)系统,如搜索引擎和推荐系统,经历了一个显著的范式转变。尽管这一演变带来了新的机会,但同时也带来了新兴挑战,特别是在偏见和公平性方面,这可能威胁到信息生态系统。在本文中,我们对IR系统中当前新兴和紧迫的偏见和公平性问题进行了全面的调查。我们首先将偏见和公平性问题统一为分布不匹配问题,为通过分布对各种缓解策略进行分类奠定基础。接着,我们系统地深入研究了LLMs整合到IR系统中的三个关键阶段:数据收集、模型开发和结果评估,详细审查并分析了相关文献,重点关注这些问题的定义、特征以及相应的缓解策略。最后,我们识别和强调了未来工作中的一些开放问题和挑战,旨在激发IR领域的研究人员和利益相关者更好地理解并缓解LLM时代IR中的偏见和公平性问题。我们还在这个上升趋势的领域持续维护一个GitHub仓库,用于存储相关论文和资源,链接如下:https://github.com/[您的GitHub用户名]/相关论文和资源。
https://arxiv.org/abs/2404.11457
The proliferation of applications using artificial intelligence (AI) systems has led to a growing number of users interacting with these systems through sophisticated interfaces. Human-computer interaction research has long shown that interfaces shape both user behavior and user perception of technical capabilities and risks. Yet, practitioners and researchers evaluating the social and ethical risks of AI systems tend to overlook the impact of anthropomorphic, deceptive, and immersive interfaces on human-AI interactions. Here, we argue that design features of interfaces with adaptive AI systems can have cascading impacts, driven by feedback loops, which extend beyond those previously considered. We first conduct a scoping review of AI interface designs and their negative impact to extract salient themes of potentially harmful design patterns in AI interfaces. Then, we propose Design-Enhanced Control of AI systems (DECAI), a conceptual model to structure and facilitate impact assessments of AI interface designs. DECAI draws on principles from control systems theory -- a theory for the analysis and design of dynamic physical systems -- to dissect the role of the interface in human-AI systems. Through two case studies on recommendation systems and conversational language model systems, we show how DECAI can be used to evaluate AI interface designs.
人工智能(AI)系统的应用程序的普及导致越来越多的用户通过复杂的界面与这些系统互动。人机交互研究已经证明,界面不仅塑造了用户的行為,还塑造了用户对技术能力和风险的认知。然而,评估AI系统的社会和道德风险的实践者和研究人员往往忽视了类人形、欺骗性和沉浸式界面对人类-AI互动的影响。在这里,我们认为具有自适应AI系统的界面设计特征可能会产生级联影响,这种影响超越了之前考虑的范围。我们首先对AI界面设计进行了范围审查,以提取可能对AI界面产生有害设计模式的主题。然后,我们提出了设计增强控制AI系统(DECAI)的概念模型,用于结构和促进AI界面设计的影響評估。DECAI借鉴了控制理论--一种用于分析和管理动态物理系统的理论--来阐明界面在人类-AI系统中的作用。通过推荐系统和会话语言模型系统的两个案例研究,我们展示了如何使用DECAI评估AI界面设计。
https://arxiv.org/abs/2404.11370
This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The purpose is to build new benchmarks and advance the development of S-UGC VQA. The competition had 200 participants and 13 teams submitted valid solutions for the final testing phase. The proposed solutions achieved state-of-the-art performances for S-UGC VQA. The project can be found at this https URL.
本文回顾了NTIRE 2024挑战赛短形式UGC视频质量评估(S-UGC VQA),其中各种优秀的解决方案在收集到的数据集KVQ上提交并进行了评估,该数据集来自流行的短视频平台,即快手/Kwai平台。KVQ数据库分为三部分,包括用于训练的2926个视频、验证的420个视频和测试的854个视频。目的是建立新的基准并推动S-UGC VQA的发展。该比赛有200名参与者,13个团队在最终测试阶段提交了有效的解决方案。所提出的解决方案在S-UGC VQA上实现了最先进的表现。该项目可以在此链接中找到:https://url.cn/xyz6hxl。
https://arxiv.org/abs/2404.11313
This paper reviews the NTIRE 2024 Portrait Quality Assessment Challenge, highlighting the proposed solutions and results. This challenge aims to obtain an efficient deep neural network capable of estimating the perceptual quality of real portrait photos. The methods must generalize to diverse scenes and diverse lighting conditions (indoor, outdoor, low-light), movement, blur, and other challenging conditions. In the challenge, 140 participants registered, and 35 submitted results during the challenge period. The performance of the top 5 submissions is reviewed and provided here as a gauge for the current state-of-the-art in Portrait Quality Assessment.
本文回顾了NTIRE 2024 肖像质量评估挑战,重点关注所提出的解决方案和结果。这个挑战的目标是获得一个高效的深度神经网络,能够估计真实肖像照片的感知质量。为了适应各种场景和照明条件(室内、户外、低光),运动、模糊和其他具有挑战性的条件,方法必须具有泛化性。在挑战期间,共有140名参与者注册,35名提交了结果。对前五名提交作品的性能进行了审查,并提供了一个评估当前面部质量评估技术的指标。
https://arxiv.org/abs/2404.11159
Meeting summarization has become a critical task considering the increase in online interactions. While new techniques are introduced regularly, their evaluation uses metrics not designed to capture meeting-specific errors, undermining effective evaluation. This paper investigates what the frequently used automatic metrics capture and which errors they mask by correlating automatic metric scores with human evaluations across a broad error taxonomy. We commence with a comprehensive literature review on English meeting summarization to define key challenges like speaker dynamics and contextual turn-taking and error types such as missing information and linguistic inaccuracy, concepts previously loosely defined in the field. We examine the relationship between characteristic challenges and errors by using annotated transcripts and summaries from Transformer-based sequence-to-sequence and autoregressive models from the general summary QMSum dataset. Through experimental validation, we find that different model architectures respond variably to challenges in meeting transcripts, resulting in different pronounced links between challenges and errors. Current default-used metrics struggle to capture observable errors, showing weak to mid-correlations, while a third of the correlations show trends of error masking. Only a subset reacts accurately to specific errors, while most correlations show either unresponsiveness or failure to reflect the error's impact on summary quality.
会议总结已成为一个关键的任务,因为在线互动的增加。虽然定期引入新的技术,但它们的评估使用的是不用于捕捉会议特定错误的指标,这削弱了有效的评估。本文研究了常用的自动指标捕捉了什么,以及它们通过将自动指标得分与人类评价结果进行相关性来掩盖的错误类型。我们在英语会议总结的全面文献综述中定义了关键挑战,如演讲者动态和上下文转向,以及错误类型,如信息缺失和语言不准确,这些错误类型在领域中以前被定义为松散的。我们研究了特征挑战和错误之间的关系,使用来自通用总结 QMSum 数据集的带有注释的转录和摘要。通过实验验证,我们发现不同的模型架构对会议文本的挑战反应不同,导致挑战和错误之间的突出联系。当前默认使用的指标很难捕捉可观察的错误,显示出弱到中度的相关性,而三分之一的相关性显示出错误遮蔽的趋势。只有少数反应准确地对待具体错误,而大多数相关性要么不响应,要么不能反映错误对摘要质量的影响。
https://arxiv.org/abs/2404.11124
Sentiment analysis (SA) aims to identify the sentiment expressed in a text, such as a product review. Given a review and the sentiment associated with it, this paper formulates SA as a combination of two tasks: (1) a causal discovery task that distinguishes whether a review "primes" the sentiment (Causal Hypothesis C1), or the sentiment "primes" the review (Causal Hypothesis C2); and (2) the traditional prediction task to model the sentiment using the review as input. Using the peak-end rule in psychology, we classify a sample as C1 if its overall sentiment score approximates an average of all the sentence-level sentiments in the review, and C2 if the overall sentiment score approximates an average of the peak and end sentiments. For the prediction task, we use the discovered causal mechanisms behind the samples to improve the performance of LLMs by proposing causal prompts that give the models an inductive bias of the underlying causal graph, leading to substantial improvements by up to 32.13 F1 points on zero-shot five-class SA. Our code is at this https URL
情感分析(SA)旨在识别文本中表达的情感,如产品评论。给定一个评论及其情感,本文将SA分解为两个任务:(1)一个因果发现任务,区分是否是评论“推动了”情感(因果假设C1),或者是情感“推动了”评论(因果假设C2);和(2)传统预测任务,使用评论作为输入来建模情感。根据心理学的峰值-端规则,我们将样本分类为C1,如果其整体情感得分约等于评论中所有句子级情感的平均值,则C1;如果整体情感得分约等于评论中峰值和结束情感的平均值,则C2。对于预测任务,我们使用发现样本背后的因果机制来提高LLM的性能,通过提出因果提示,使模型对潜在因果图具有归纳偏见,从而在零散五类SA上实现显著的改进,和改进后的F1得分达到32.13。我们的代码位于此链接:
https://arxiv.org/abs/2404.11055
With the rapid advancement of technology, Augmented Reality (AR) technology, known for its ability to deeply integrate virtual information with the real world, is gradually transforming traditional work modes and teaching methods. Particularly in the realms of remote work and online education, AR technology demonstrates a broad spectrum of application prospects. This paper delves into the application potential and actual effects of AR technology in remote work and education. Through a systematic literature review, this study outlines the key features, advantages, and challenges of AR technology. Based on theoretical analysis, it discusses the scientific basis and technical support that AR technology provides for enhancing remote work efficiency and promoting innovation in educational teaching models. Additionally, by designing an empirical research plan and analyzing experimental data, this article reveals the specific performance and influencing factors of AR technology in practical applications. Finally, based on the results of the experiments, this research summarizes the application value of AR technology in remote work and education, looks forward to its future development trends, and proposes forward-looking research directions and strategic suggestions, offering empirical foundation and theoretical guidance for further promoting the in-depth application of AR technology in related fields.
随着科技的快速发展,增强现实(AR)技术以其将虚拟信息与现实世界深度整合的能力而闻名,逐渐改变着传统的工作模式和教学方法。特别是在远程办公和教育领域,AR 技术表现出广阔的应用前景。本文深入探讨了 AR 技术在远程办公和教育领域的应用潜力和实际效果。通过系统文献综述,本研究概述了 AR 技术的 key features, advantages, and challenges。基于理论分析,它讨论了 AR 技术为提高远程办公效率和促进教育教学模式创新提供的科学基础和技术支持。此外,通过设计一个实证研究计划并分析实验数据,本文揭示了 AR 技术在实际应用中的具体表现和影响因素。最后,根据实验结果,本研究总结了 AR 技术在远程办公和教育领域的应用价值,展望了其未来发展趋势,并提出了进一步推动相关领域深入应用 AR 技术的建议,为继续发挥 AR 技术在相关领域应用提供实证基础和理论指导。
https://arxiv.org/abs/2404.10579
Social biases can manifest in language agency. For instance, White individuals and men are often described as "agentic" and achievement-oriented, whereas Black individuals and women are frequently described as "communal" and as assisting roles. This study establishes agency as an important aspect of studying social biases in both human-written and Large Language Model (LLM)-generated texts. To accurately measure "language agency" at sentence level, we propose a Language Agency Classification dataset to train reliable agency classifiers. We then use an agency classifier to reveal notable language agency biases in 6 datasets of human- or LLM-written texts, including biographies, professor reviews, and reference letters. While most prior NLP research on agency biases focused on single dimensions, we comprehensively explore language agency biases in gender, race, and intersectional identities. We observe that (1) language agency biases in human-written texts align with real-world social observations; (2) LLM-generated texts demonstrate remarkably higher levels of language agency bias than human-written texts; and (3) critical biases in language agency target people of minority groups--for instance, languages used to describe Black females exhibit the lowest level of agency across datasets. Our findings reveal intricate social biases in human- and LLM-written texts through the lens of language agency, warning against using LLM generations in social contexts without scrutiny.
社会偏见可以在语言代理机构中表现出来。例如,白人和男性通常被描述为“主动的”和“有成就的”,而黑人信息和女性则常常被描述为“共同的”和“辅助性的”。这项研究建立了代理作为研究社会偏见在人类和大型语言模型(LLM)生成文本中的重要方面。为了准确测量句子级的“语言代理”,我们提出了一个语言代理分类数据集,用于训练可靠的代理分类器。然后,我们使用代理分类器在6个人类或LLM生成的文本数据集中揭示显著的语言代理偏见,包括传记、教授评价和推荐信。虽然之前的大多数自然语言处理研究对社会偏见集中在单一维度上,但我们对语言代理偏见进行全面探讨,涉及性别、种族和交叉身份。我们观察到,(1)人类生成的文本中的语言代理偏见与现实世界的社会观察相一致;(2)LLM生成的文本表现出显著的语言代理偏见,而人类生成的文本则不然;(3)语言代理偏见针对少数民族群体,例如用来说明黑人女性的语言在数据集中的代理水平最低。通过语言代理视角,我们的研究揭示了人类和LLM生成的文本中的复杂社会偏见,警告不要在社交环境中使用未经审视的LLM生成。
https://arxiv.org/abs/2404.10508
High-Performance Computing (HPC) systems excel in managing distributed workloads, and the growing interest in Artificial Intelligence (AI) has resulted in a surge in demand for faster methods of Machine Learning (ML) model training and inference. In the past, research on HPC I/O focused on optimizing the underlying storage system for modeling and simulation applications and checkpointing the results, causing writes to be the dominant I/O operation. These applications typically access large portions of the data written by simulations or experiments. ML workloads, in contrast, perform small I/O reads spread across a large number of random files. This shift of I/O access patterns poses several challenges to HPC storage systems. In this paper, we survey I/O in ML applications on HPC systems, and target literature within a 6-year time window from 2019 to 2024. We provide an overview of the common phases of ML, review available profilers and benchmarks, examine the I/O patterns encountered during ML training, explore I/O optimizations utilized in modern ML frameworks and proposed in recent literature, and lastly, present gaps requiring further R&D. We seek to summarize the common practices used in accessing data by ML applications and expose research gaps that could spawn further R&D.
高性能计算(HPC)系统在管理分布式负载方面表现出色,随着人工智能(AI)需求的增加,对机器学习(ML)模型训练和推理的更快速方法的需求也在增加。在过去,研究主要集中在优化建模和仿真应用的底层存储系统以及检查点结果。导致写入操作成为主导的I/O操作。这些应用通常访问由仿真或实验编写的大型数据部分。与ML工作负载不同,ML工作负载在大型随机文件上执行小的I/O读取。这种I/O访问模式的变化给HPC存储系统带来了几个挑战。在本文中,我们对HPC系统中的ML应用程序的I/O进行了调查,目标文献是在2019年到2024年期间发表的6年内的文献。我们提供了ML的常见阶段的概述,回顾了可用的调试器和基准,研究了在ML训练过程中遇到的I/O模式,探讨了现代ML框架中使用的I/O优化以及在最近文献中提出的I/O优化,最后,我们提出了需要进一步研究的研究差距。我们希望简要概括ML应用程序访问数据时的常见做法,并揭示可能引发进一步研究需求的研究空白。
https://arxiv.org/abs/2404.10386
This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at this https URL.
本论文对NTIRE 2024挑战进行了全面的回顾,重点关注高效的单图像超分辨率(ESR)解决方案及其效果。挑战的任务是根据低分辨率和高分辨率图像的成对,将输入图像进行4倍放大的超分辨率。主要目标是在保持DIV2K_LSDIR_valid数据集上的峰值信号-噪声比(PSNR)约为26.90 dB和DIV2K_LSDIR_test数据集上的峰值信号-噪声比(PSNR)约为26.99 dB的同时,开发网络优化各种方面,如运行时间、参数和FLOPs。此外,该挑战分为4个轨道,包括主轨道(总体性能)、子轨道1(运行时间)、子轨道2(FLOPs)和子轨道3(参数)。在主轨道中,考虑了所有三个指标(即运行时间、FLOPs和参数计数)。主轨道的排名基于所有其他子轨道评分之加权求和。在子轨道1中,对提交的实现进行了实际运行时间的评估,并使用相应的得分来确定排名。在子轨道2中,考虑了FLOPs的数量。基于相应的FLOPs计算的得分用于确定排名。在子轨道3中,考虑了参数的数量。基于相应的参数计算的得分用于确定排名。RLFN被设定为效率测量的基准。挑战有262名注册参与者,34支队伍提交了有效的解决方案。它们衡量了ESR的现有水平。为了促进挑战的重复性,并使其他研究人员能够基于这些发现进行构建,代码和预训练模型的知识产权已公开在https://这个URL上。
https://arxiv.org/abs/2404.10343
A contract is a type of legal document commonly used in organizations. Contract review is an integral and repetitive process to avoid business risk and liability. Contract analysis requires the identification and classification of key provisions and paragraphs within an agreement. Identification and validation of contract clauses can be a time-consuming and challenging task demanding the services of trained and expensive lawyers, paralegals or other legal assistants. Classification of legal provisions in contracts using artificial intelligence and natural language processing is complex due to the requirement of domain-specialized legal language for model training and the scarcity of sufficient labeled data in the legal domain. Using general-purpose models is not effective in this context due to the use of specialized legal vocabulary in contracts which may not be recognized by a general model. To address this problem, we propose the use of a pre-trained large language model which is subsequently calibrated on legal taxonomy. We propose LegalPro-BERT, a BERT transformer architecture model that we fine- tune to efficiently handle classification task for legal provisions. We conducted experiments to measure and compare metrics with current benchmark results. We found that LegalPro-BERT outperforms the previous benchmark used for comparison in this research.
合同是一种在组织中常用的法律文件类型。合同审查是避免业务风险和责任的重要和重复的过程。合同分析需要识别和分类协议中的关键条款和段落。识别和验证合同条款可能是一个耗时且具有挑战性的任务,需要训练有素且成本高昂的律师、法律助理或其他法律专业人士的帮助。使用人工智能和自然语言处理对合同法律条款进行分类由于对训练和标注数据的需求以及法律领域中可获得足够标注数据的稀缺性而变得复杂。使用通用模型在这种情况下并不有效,因为合同中使用的专业化法律词汇可能不会被通用模型所识别。为解决这个问题,我们提出了一个预训练的大型语言模型,然后在法律分类上进行微调。我们提出了LegalPro-BERT,是一个我们对其进行微调以高效处理法律条款分类任务的BERT变换器架构模型。我们进行了实验来衡量和比较与现有基准结果相关的指标。我们发现,LegalPro-BERT超越了本次研究中的比较基准。
https://arxiv.org/abs/2404.10097