Large language models (LLMs) are increasingly deployed in agentic frameworks, in which prompts trigger complex tool-based analysis in pursuit of a goal. While these frameworks have shown promise across multiple domains including in finance, they typically lack a principled model-building step, relying instead on sentiment- or trend-based analysis. We address this gap by developing an agentic system that uses LLMs to iteratively discover stochastic differential equations for financial time series. These models generate risk metrics which inform daily trading decisions. We evaluate our system in both traditional backtests and using a market simulator, which introduces synthetic but causally plausible price paths and news events. We find that model-informed trading strategies outperform standard LLM-based agents, improving Sharpe ratios across multiple equities. Our results show that combining LLMs with agentic model discovery enhances market risk estimation and enables more profitable trading decisions.
大型语言模型(LLMs)越来越多地被部署在代理框架中,在这些框架中,提示会触发基于工具的复杂分析以实现目标。尽管这些框架在包括金融在内的多个领域展示了潜力,但它们通常缺乏一个有原则的建模构建步骤,而是依赖于情感或趋势分析。我们通过开发一种使用LLMs迭代发现随机微分方程来填补这一空白,用于金融时间序列。这些模型生成风险指标,为每日交易决策提供信息。我们在传统的回测和市场模拟器中评估了我们的系统,后者引入了合成但因果合理的价格路径和新闻事件。我们发现基于模型的交易策略优于标准的LLM代理,在多个股票上提高了夏普比率(Sharpe ratio)。我们的结果显示,将LLMs与代理模型发现相结合可以增强市场的风险估计,并有助于做出更盈利的交易决策。
https://arxiv.org/abs/2507.08584
Decision conferences are structured, collaborative meetings that bring together experts from various fields to address complex issues and reach a consensus on recommendations for future actions or policies. These conferences often rely on facilitated discussions to ensure productive dialogue and collective agreement. Recently, Large Language Models (LLMs) have shown significant promise in simulating real-world scenarios, particularly through collaborative multi-agent systems that mimic group interactions. In this work, we present a novel LLM-based multi-agent system designed to simulate decision conferences, specifically focusing on detecting agreement among the participant agents. To achieve this, we evaluate six distinct LLMs on two tasks: stance detection, which identifies the position an agent takes on a given issue, and stance polarity detection, which identifies the sentiment as positive, negative, or neutral. These models are further assessed within the multi-agent system to determine their effectiveness in complex simulations. Our results indicate that LLMs can reliably detect agreement even in dynamic and nuanced debates. Incorporating an agreement-detection agent within the system can also improve the efficiency of group debates and enhance the overall quality and coherence of deliberations, making them comparable to real-world decision conferences regarding outcome and decision-making. These findings demonstrate the potential for LLM-based multi-agent systems to simulate group decision-making processes. They also highlight that such systems could be instrumental in supporting decision-making with expert elicitation workshops across various domains.
决策会议是一种结构化的协作会议,汇集来自不同领域的专家来解决复杂问题,并就未来行动或政策达成共识。这些会议通常依赖于引导讨论以确保富有成效的对话和集体一致意见。最近,大型语言模型(LLMs)在模拟现实世界场景方面显示出巨大潜力,尤其是通过模仿群体互动的合作多代理系统。在这项工作中,我们提出了一种基于LLM的新颖多代理系统,旨在模拟决策会议,并特别关注检测参与代理人之间的共识。为了实现这一目标,我们在两个任务上评估了六种不同的LLM:立场识别(确定一个代理人在给定问题上的立场)和情感极性识别(确定积极、消极或中立的情感)。这些模型在多代理系统中的进一步评估以确定它们在复杂模拟中的有效性。我们的结果表明,即使是在动态且复杂的辩论中,LLMs也能可靠地检测到共识。在系统中加入共识检测代理还可以提高群体讨论的效率,并增强整体讨论的质量和一致性,使其与现实世界决策会议的结果和决策相媲美。这些发现展示了基于LLM的多代理系统模拟群体决策过程的潜力。同时,它们还强调了此类系统在支持跨多个领域的专家咨询工作坊中的作用。
https://arxiv.org/abs/2507.08440
This paper presents a competitive approach to multilingual subjectivity detection using large language models (LLMs) with few-shot prompting. We participated in Task 1: Subjectivity of the CheckThat! 2025 evaluation campaign. We show that LLMs, when paired with carefully designed prompts, can match or outperform fine-tuned smaller language models (SLMs), particularly in noisy or low-quality data settings. Despite experimenting with advanced prompt engineering techniques, such as debating LLMs and various example selection strategies, we found limited benefit beyond well-crafted standard few-shot prompts. Our system achieved top rankings across multiple languages in the CheckThat! 2025 subjectivity detection task, including first place in Arabic and Polish, and top-four finishes in Italian, English, German, and multilingual tracks. Notably, our method proved especially robust on the Arabic dataset, likely due to its resilience to annotation inconsistencies. These findings highlight the effectiveness and adaptability of LLM-based few-shot learning for multilingual sentiment tasks, offering a strong alternative to traditional fine-tuning, particularly when labeled data is scarce or inconsistent.
本文提出了一种使用大型语言模型(LLM)和少量提示进行多语种主观性检测的竞争方法。我们在CheckThat! 2025评估活动中参加了任务1:主观性检测。研究表明,当与精心设计的提示相结合时,LLM可以匹配或超越经过微调的小型语言模型(SLM),尤其是在嘈杂或低质量数据环境中。 尽管我们尝试了先进的提示工程技术,如让LLM进行辩论和多种示例选择策略,但我们发现除了精心制作的标准少量提示外,并没有获得显著的额外好处。我们的系统在CheckThat! 2025主观性检测任务中多个语种的比赛中取得了顶尖排名,包括阿拉伯语和波兰语的第一名以及意大利语、英语、德语和多语言赛道中的前四名。 值得注意的是,我们方法在阿拉伯语文本数据集上表现尤为稳健,这可能是因为其对标注不一致具有较强的鲁棒性。这些发现强调了基于LLM的少量学习对于跨语言情感任务的有效性和适应性,并为传统微调提供了强大的替代方案,特别是在标签数据稀缺或不一致的情况下。
https://arxiv.org/abs/2507.07539
Time, cost, and energy efficiency are critical considerations in Deep-Learning (DL), particularly when processing long texts. Transformers, which represent the current state of the art, exhibit quadratic computational complexity relative to input length, making them inefficient for extended documents. This study introduces a novel model architecture that combines Graph Neural Networks (GNNs) and Convolutional Neural Networks (CNNs), integrated with a real-time, end-to-end graph generation mechanism. The model processes compact batches of character-level inputs without requiring padding or truncation. To enhance performance while maintaining high speed and efficiency, the model incorporates information from Large Language Models (LLMs), such as token embeddings and sentiment polarities, through efficient dictionary lookups. It captures local contextual patterns using CNNs, expands local receptive fields via lattice-based graph structures, and employs small-world graphs to aggregate document-level information. The generated graphs exhibit structural properties indicative of meaningful semantic organization, with an average clustering coefficient of approximately 0.45 and an average shortest path length ranging between 4 and 5. The model is evaluated across multiple text classification tasks, including sentiment analysis and news-categorization, and is compared against state-of-the-art models. Experimental results confirm the proposed model's efficiency and competitive performance.
时间、成本和能源效率是深度学习(DL)中的关键考虑因素,尤其是在处理长文本时。当前最先进的模型——变压器表现出与输入长度呈二次关系的计算复杂性,这使得它们在处理较长文档时效率低下。本研究提出了一种结合图神经网络(GNNs)和卷积神经网络(CNNs)的新颖架构,并整合了实时、端到端的图生成机制。该模型以字符级输入的小批量为单位进行处理,无需填充或截断。 为了在保持高速度和高效率的同时提升性能,该模型通过高效的字典查找方式集成了大型语言模型(LLMs)的信息,包括token嵌入和情感极性等。具体来说,它使用CNN捕捉局部上下文模式,并利用基于格子的图结构扩展了局部感受野;同时采用小世界图来汇总文档级别的信息。 生成的图表显示有意义的语义组织特征,其平均聚类系数约为0.45,而最短路径长度则在4到5之间。该模型在多项文本分类任务上进行了评估,包括情感分析和新闻分类,并与其他最先进的模型进行了比较。实验结果证实了所提出模型的有效性和竞争性性能。
https://arxiv.org/abs/2507.07414
Public product launches in Artificial Intelligence can serve as focusing events for collective attention, surfacing how societies react to technological change. Social media provide a window into the sensemaking around these events, surfacing hopes and fears and showing who chooses to engage in the discourse and when. We demonstrate that public sensemaking about AI is shaped by economic interests and cultural values of those involved. We analyze 3.8 million tweets posted by 1.6 million users across 117 countries in response to the public launch of ChatGPT in 2022. Our analysis shows how economic self-interest, proxied by occupational skill types in writing, programming, and mathematics, and national cultural orientations, as measured by Hofstede's individualism, uncertainty avoidance, and power distance dimensions, shape who speaks, when they speak, and their stance towards ChatGPT. Roles requiring more technical skills, such as programming and mathematics, tend to engage earlier and express more positive stances, whereas writing-centric occupations join later with greater skepticism. At the cultural level, individualism predicts both earlier engagement and a more negative stance, and uncertainty avoidance reduces the prevalence of positive stances but does not delay when users first engage with ChatGPT. Aggregate sentiment trends mask the dynamics observed in our study. The shift toward a more critical stance towards ChatGPT over time stems primarily from the entry of more skeptical voices rather than a change of heart among early adopters. Our findings underscore the importance of both the occupational background and cultural context in understanding public reactions to AI.
人工智能产品的公开发布可以作为集体注意力的聚焦事件,揭示社会如何应对技术变革。社交媒体为这些事件的理解提供了窗口,展现了人们的希望和恐惧,并展示了谁选择参与讨论以及何时参与。 我们通过分析380万条推文,由160万名用户在117个国家和地区对2022年ChatGPT公开发布做出的回应来证明,关于AI的公众理解受到参与者经济利益和文化价值观的影响。我们的研究表明了个人职业背景中的技能类型(如写作、编程、数学)以及国家文化的维度(根据霍夫斯泰德的文化维度理论,包括个体主义、不确定性规避和权力距离),如何影响谁参与讨论、何时参与以及他们对ChatGPT的态度。 需要更多技术技能的角色,例如编程和数学相关职业的人群倾向于更早地参与到对话中,并且表达出更加积极的态度。而以写作为主的职业则在较晚的时候加入进来,态度上更为怀疑。从文化层面来看,个体主义预测了早期参与以及更加消极的立场;不确定性规避减少了正面立场的数量,但并未推迟用户首次与ChatGPT互动的时间。 总体情绪趋势掩盖了我们研究中观察到的动力变化。随着时间推移,公众对ChatGPT的态度变得更加批判性的主要原因是更多持怀疑态度的声音加入进来,而不是早期使用者改变了自己的看法。我们的发现强调了了解公众对AI的反应时职业背景和文化环境的重要性。
https://arxiv.org/abs/2507.06876
Multi-source Opinion Summarization (M-OS) extends beyond traditional opinion summarization by incorporating additional sources of product metadata such as descriptions, key features, specifications, and ratings, alongside reviews. This integration results in comprehensive summaries that capture both subjective opinions and objective product attributes essential for informed decision-making. While Large Language Models (LLMs) have shown significant success in various Natural Language Processing (NLP) tasks, their potential in M-OS remains largely unexplored. Additionally, the lack of evaluation datasets for this task has impeded further advancements. To bridge this gap, we introduce M-OS-EVAL, a benchmark dataset for evaluating multi-source opinion summaries across 7 key dimensions: fluency, coherence, relevance, faithfulness, aspect coverage, sentiment consistency, specificity. Our results demonstrate that M-OS significantly enhances user engagement, as evidenced by a user study in which, on average, 87% of participants preferred M-OS over opinion summaries. Our experiments demonstrate that factually enriched summaries enhance user engagement. Notably, M-OS-PROMPTS exhibit stronger alignment with human judgment, achieving an average Spearman correlation of \r{ho} = 0.74, which surpasses the performance of previous methodologies.
多源意见总结(M-OS)超越了传统的意见总结,通过整合产品元数据的额外来源,如描述、关键特性、规格和评分,与评论一起使用。这种集成产生的全面摘要能够捕捉主观观点和客观的产品属性,对于做出明智决策至关重要。尽管大型语言模型(LLMs)在各种自然语言处理(NLP)任务中表现出显著的成功,但它们在M-OS中的潜力尚未得到充分探索。此外,由于缺乏评估数据集,这一领域的进一步发展受到了阻碍。为了填补这一空白,我们引入了M-OS-EVAL,这是一个用于跨7个关键维度(流畅性、连贯性、相关性、忠实度、方面覆盖范围、情感一致性及具体性)评估多源意见摘要的基准数据集。 我们的研究结果表明,M-OS显著提高了用户参与度。在一项由87%参与者平均偏好的用户研究中,证明了这一点:即大多数用户更偏好于M-OS而非单纯的意见总结。实验还表明,事实丰富的摘要能够增强用户参与度。值得注意的是,M-OS-PROMPTS(多源意见总结提示)与人类判断的对齐更强,达到了Spearman相关系数 \(\rho = 0.74\) 的平均值,这超过了先前方法的表现。 该研究展示了通过整合产品元数据实现更全面和有效的产品摘要生成的重要性,并表明改进评估框架对于推动NLP领域中此类任务的发展至关重要。
https://arxiv.org/abs/2507.04751
This work evaluates FinGPT, a financial domain-specific language model, across six key natural language processing (NLP) tasks: Sentiment Analysis, Text Classification, Named Entity Recognition, Financial Question Answering, Text Summarization, and Stock Movement Prediction. The evaluation uses finance-specific datasets to assess FinGPT's capabilities and limitations in real-world financial applications. The results show that FinGPT performs strongly in classification tasks such as sentiment analysis and headline categorization, often achieving results comparable to GPT-4. However, its performance is significantly lower in tasks that involve reasoning and generation, such as financial question answering and summarization. Comparisons with GPT-4 and human benchmarks highlight notable performance gaps, particularly in numerical accuracy and complex reasoning. Overall, the findings indicate that while FinGPT is effective for certain structured financial tasks, it is not yet a comprehensive solution. This research provides a useful benchmark for future research and underscores the need for architectural improvements and domain-specific optimization in financial language models.
这项工作评估了FinGPT,这是一种专门针对金融领域的语言模型,并对其在六个关键自然语言处理(NLP)任务中的表现进行了评价:情感分析、文本分类、命名实体识别、财务问答、文本摘要和股票价格走势预测。该评估使用了特定于金融的数据集来考察FinGPT在现实世界金融应用中的能力和局限性。 结果显示,FinGPT在诸如情感分析和新闻标题分类等分类任务中表现出色,并经常达到与GPT-4相媲美的结果水平。然而,在涉及推理和生成的任务(如财务问答和摘要生成)方面,它的表现显著较低。与GPT-4及人类基准的比较表明了明显的性能差距,特别是在数值准确性和复杂推理能力上。 总体而言,研究发现表明尽管FinGPT在某些结构化的金融任务中是有效的,但它还不是一个全面的解决方案。这项研究为未来的研究提供了有用的基准,并强调了对金融语言模型架构改进和领域特定优化的需求。
https://arxiv.org/abs/2507.08015
Machine learning methods are increasingly applied to analyze health-related public discourse based on large-scale data, but questions remain regarding their ability to accurately detect different types of health sentiments. Especially, Large Language Models (LLMs) have gained attention as a powerful technology, yet their accuracy and feasibility in capturing different opinions and perspectives on health issues are largely unexplored. Thus, this research examines how accurate the three prominent LLMs (GPT, Gemini, and LLAMA) are in detecting risk-promoting versus health-supporting sentiments across two critical public health topics: Human Papillomavirus (HPV) vaccination and heated tobacco products (HTPs). Drawing on data from Facebook and Twitter, we curated multiple sets of messages supporting or opposing recommended health behaviors, supplemented with human annotations as the gold standard for sentiment classification. The findings indicate that all three LLMs generally demonstrate substantial accuracy in classifying risk-promoting and health-supporting sentiments, although notable discrepancies emerge by platform, health issue, and model type. Specifically, models often show higher accuracy for risk-promoting sentiment on Facebook, whereas health-supporting messages on Twitter are more accurately detected. An additional analysis also shows the challenges LLMs face in reliably detecting neutral messages. These results highlight the importance of carefully selecting and validating language models for public health analyses, particularly given potential biases in training data that may lead LLMs to overestimate or underestimate the prevalence of certain perspectives.
基于大规模数据的机器学习方法越来越多地被应用于分析与健康相关的公众言论,但关于它们准确检测不同类型健康情感的能力仍然存在疑问。特别是,大型语言模型(LLMs)作为一项强大的技术引起了关注,然而它们在捕捉健康问题的不同观点和视角方面的准确性及可行性尚未得到充分探索。因此,这项研究考察了三种知名的大规模语言模型(GPT、Gemini 和 LLAMA),它们在检测与两种关键公共卫生话题相关的情感方面——人类乳头瘤病毒(HPV)疫苗接种和加热烟草产品(HTPs)——的准确度如何。我们从 Facebook 和 Twitter 中提取了多组支持或反对推荐健康行为的信息,并通过人工注释作为情感分类的标准来补充这些数据。研究结果表明,这三种 LLM 通常在识别风险促进与健康支持性情感方面表现出相当高的准确性,尽管这种准确性因平台、健康问题和模型类型而异。具体而言,模型往往在 Facebook 上对风险促进的情感检测更准确,而在 Twitter 上则更擅长于检测健康支持性的信息。此外的分析还显示了 LLM 在可靠地检测中立消息时面临的挑战。这些结果突显了在进行公共卫生分析时选择和验证语言模型的重要性,尤其是在考虑到训练数据中存在的潜在偏见可能导致 LLM 对某些观点的流行程度估计过高或过低的情况下。
https://arxiv.org/abs/2507.04364
Student dropout in distance learning remains a critical challenge, with profound societal and economic consequences. While classical machine learning models leverage structured socio-demographic and behavioral data, they often fail to capture the nuanced emotional and contextual factors embedded in unstructured student interactions. This paper introduces a transformative AI framework that redefines dropout prediction through three synergistic innovations: Retrieval-Augmented Generation (RAG) for domain-specific sentiment analysis, prompt engineering to decode academic stressors, and cross-modal attention fusion to dynamically align textual, behavioral, and socio-demographic insights. By grounding sentiment analysis in a curated knowledge base of pedagogical content, our RAG-enhanced BERT model interprets student comments with unprecedented contextual relevance, while optimized prompts isolate indicators of academic distress (e.g., "isolation," "workload anxiety"). A cross-modal attention layer then fuses these insights with temporal engagement patterns, creating holistic risk profiles. Evaluated on a longitudinal dataset of 4 423 students, the framework achieves 89% accuracy and an F1-score of 0.88, outperforming conventional models by 7% and reducing false negatives by 21%. Beyond prediction, the system generates interpretable interventions by retrieving contextually aligned strategies (e.g., mentorship programs for isolated learners). This work bridges the gap between predictive analytics and actionable pedagogy, offering a scalable solution to mitigate dropout risks in global education systems
学生在远程学习中的辍学率仍然是一个关键挑战,它对社会和经济产生了深远的影响。虽然传统的机器学习模型可以利用结构化的社会人口统计信息和行为数据,但它们通常无法捕捉到未结构化的学生互动中嵌入的情感和上下文因素的细微差别。本文介绍了一种变革性的AI框架,通过三项协同创新重新定义辍学预测:检索增强生成(RAG)用于特定领域的感情分析、提示工程以解码学术压力源以及跨模态注意力融合以动态对齐文本、行为和社会人口统计见解。 我们的改进型BERT模型利用了精心策划的教育内容知识库来进行情感分析,从而能够以前所未有的上下文相关性解释学生评论。通过优化提示来隔离学业压力指标(例如“孤立”、“工作量焦虑”),该系统可以更精确地识别潜在问题。接下来,跨模态注意力层将这些见解与学生的参与模式结合起来,创建全面的风险概况。 在为期四年的4,423名学生数据集中进行评估后,该框架实现了89%的准确率和0.88的F1分数,在预测精度上优于传统模型7%,并且减少了21%的假阴性结果。除了预测之外,该系统还能生成可解释的干预措施,通过检索与上下文相匹配的战略(例如为孤立的学习者提供导师计划)。 这项工作弥合了预测分析和行动教育之间的差距,为全球教育体系中的辍学风险缓解提供了可扩展解决方案。
https://arxiv.org/abs/2507.05285
Sentiment analysis, widely used in product reviews, also impacts financial markets by influencing asset prices through microblogs and news articles. Despite research in sentiment-driven finance, many studies focus on sentence-level classification, overlooking its practical application in trading. This study bridges that gap by evaluating sentiment-based trading strategies for generating positive alpha. We conduct a backtesting analysis using sentiment predictions from three models (two classification and one regression) applied to news articles on Dow Jones 30 stocks, comparing them to the benchmark Buy&Hold strategy. Results show all models produced positive returns, with the regression model achieving the highest return of 50.63% over 28 months, outperforming the benchmark Buy&Hold strategy. This highlights the potential of sentiment in enhancing investment strategies and financial decision-making.
情感分析在产品评论中广泛应用,同时也通过微博和新闻文章影响金融市场并进而影响资产价格。尽管有关于情感驱动金融的研究,但许多研究主要集中在句子层面的分类上,而忽略了其在交易中的实际应用价值。本研究旨在填补这一空白,通过对道琼斯30指数股票相关新闻文章的情感预测来评估基于情感的交易策略,并以此生成正向阿尔法收益。我们使用三种模型(两个为分类模型,一个为回归模型)进行回测分析,并将结果与基准的“买入并持有”策略进行比较。结果显示,所有模型均产生了正回报,其中回归模型表现最佳,在28个月内实现了50.63%的最高收益率,优于基准的“买入并持有”策略。这表明情感分析在增强投资策略和金融决策方面具有巨大潜力。
https://arxiv.org/abs/2507.03350
Large language models (LLMs) are increasingly integrated into applications ranging from review summarization to medical diagnosis support, where they affect human decisions. Even though LLMs perform well in many tasks, they may also inherit societal or cognitive biases, which can inadvertently transfer to humans. We investigate when and how LLMs expose users to biased content and quantify its severity. Specifically, we assess three LLM families in summarization and news fact-checking tasks, evaluating how much LLMs stay consistent with their context and/or hallucinate. Our findings show that LLMs expose users to content that changes the sentiment of the context in 21.86% of the cases, hallucinates on post-knowledge-cutoff data questions in 57.33% of the cases, and primacy bias in 5.94% of the cases. We evaluate 18 distinct mitigation methods across three LLM families and find that targeted interventions can be effective. Given the prevalent use of LLMs in high-stakes domains, such as healthcare or legal analysis, our results highlight the need for robust technical safeguards and for developing user-centered interventions that address LLM limitations.
大型语言模型(LLMs)在从评论总结到医学诊断支持等各种应用中越来越广泛地被集成,这些应用影响着人类的决策。尽管LLM在许多任务上表现出色,但它们也可能继承社会或认知偏见,这可能会无意中传递给用户。我们研究了LLM何时以及如何向用户提供带有偏见的内容,并量化其严重性。具体而言,我们在总结和新闻事实核查任务中评估了三个LLM家族,评估LLM在多大程度上与其上下文保持一致,或产生幻觉。 我们的发现表明,在21.86%的情况下,LLM提供的内容改变了上下文的情感;在57.33%的情况下,它们对知识截止日期后的数据问题产生了幻觉;在5.94%的情况下,出现了优先效应偏差。我们评估了三个LLM家族中的18种不同的缓解方法,并发现有针对性的干预措施可能是有效的。 鉴于LLMs在医疗保健或法律分析等高风险领域的广泛应用,我们的结果强调了需要建立稳健的技术保障措施和开发用户为中心的干预措施,以解决LLM的局限性。
https://arxiv.org/abs/2507.03194
With the rapid advancement of Reinforcement Learning from Human Feedback (RLHF) and autoregressive transformers, state-of-the-art models such as GPT-4.0, DeepSeek R1, and Llama 3.3 increasingly emphasize answer depth and personalization. However, most existing RLHF approaches (e.g., PPO, DPO) still rely on a binary-preference (BT) paradigm, which, while reducing annotation costs, still requires substantial human effort and captures only group-level tendencies rather than individual preferences. To overcome these limitations, we propose Adaptive Reward-Following (ARF), a self-assessment framework that leverages a high-precision emotion analyzer achieving over 70% accuracy on GoEmotions, Sentiment140, and DailyDialog to convert free-form user feedback into continuous preference scores. We further enrich and debias these signals through lightweight data augmentations, including synonym replacement, random trace truncation, and score bias annotation algorithm. A Dynamic Adapter Preference Tracker continuously models evolving user tastes in real time, enabling our novel Trace Bias (TB) fine-tuning algorithm to optimize directly on these tracked rewards instead of coarse binary labels. Experiments on Qwen-2/2.5, Gemma-2, and Llama-3.2 across four preference domains demonstrate that ARF achieves an improvement of 3.3% over PPO and 7.6% over DPO. Moreover, TB preserves theoretical alignment with PPO and DPO objectives. Overall, ARF presents a scalable, personalized, and cost-effective approach to RLHF LLMs through autonomous reward modeling.
随着从人类反馈中强化学习(Reinforcement Learning from Human Feedback,RLHF)和自回归变换器的迅速发展,最先进的模型如GPT-4.0、DeepSeek R1 和 Llama 3.3 越来越注重答案深度和个人化。然而,大多数现有的 RLHF 方法(例如 PPO、DPO)仍然依赖于二元偏好(BT)范式,在减少标注成本的同时,仍需要大量的人力,并且只能捕捉到群体级别的趋势而无法反映个人偏好的细微差别。 为了克服这些限制,我们提出了自适应奖励跟随(Adaptive Reward-Following, ARF),这是一种利用高精度情绪分析器的自我评估框架,该分析器在 GoEmotions、Sentiment140 和 DailyDialog 数据集上达到了超过 70% 的准确率。此分析器能够将用户的自由形式反馈转换为连续偏好评分。我们进一步通过包括同义词替换、随机轨迹截断和得分偏差注释算法在内的轻量级数据增强技术来丰富并去偏这些信号。 动态适应器偏好跟踪器实时建模不断变化的用户喜好,从而使我们的新型轨迹偏差(Trace Bias, TB)微调算法能够直接针对追踪到的奖励进行优化,而不是粗粒度的二元标签。在 Qwen-2/2.5、Gemma-2 和 Llama-3.2 上进行的四项偏好领域的实验表明,ARF 相比于 PPO 提升了 3.3%,相比于 DPO 则提升了 7.6%。 此外,TB 在理论上与 PPO 和 DPO 的目标保持一致。总的来说,ARF 为 RLHF 大型语言模型提供了可扩展、个性化且成本效益高的自主奖励建模方法。
https://arxiv.org/abs/2507.03069
The global reach of social media has amplified the spread of hateful content, including implicit sexism, which is often overlooked by conventional detection methods. In this work, we introduce an Adaptive Supervised Contrastive lEarning framework for implicit sexism detectioN (ASCEND). A key innovation of our method is the incorporation of threshold-based contrastive learning: by computing cosine similarities between embeddings, we selectively treat only those sample pairs as positive if their similarity exceeds a learnable threshold. This mechanism refines the embedding space by robustly pulling together representations of semantically similar texts while pushing apart dissimilar ones, thus reducing false positives and negatives. The final classification is achieved by jointly optimizing a contrastive loss with a cross-entropy loss. Textual features are enhanced through a word-level attention module. Additionally, we employ sentiment, emotion, and toxicity features. Evaluations on the EXIST2021 and MLSC datasets demonstrate that ASCEND significantly outperforms existing methods, with average Macro F1 improvements of 9.86%, 29.63%, and 32.51% across multiple tasks, highlighting its efficacy in capturing the subtle cues of implicit sexist language.
社交媒体的全球影响力放大了仇恨内容的传播,其中包括隐性性别歧视,这往往被传统的检测方法所忽视。在此项工作中,我们引入了一个名为ASCEND(Adaptive Supervised Contrastive lEarning for Implicit Sexism DetectioN)的自适应监督对比学习框架来检测隐性性别歧视。我们的方法的一个关键创新在于采用了基于阈值的对比学习:通过计算嵌入之间的余弦相似度,我们只将那些相似度超过可学习阈值的样本对视为正例。这种机制通过稳健地拉近语义相似文本的表示并推开不相似的文本来精炼嵌入空间,从而减少假阳性和假阴性案例。最终分类是通过对比损失和交叉熵损失共同优化实现的。 文本特征通过词级注意力模块得到增强,并且我们还采用了情感、情绪和毒性特征。在EXIST2021和MLSC数据集上的评估表明,ASCEND显著优于现有方法,在多个任务中的平均Macro F1得分分别提高了9.86%、29.63%和32.51%,突显了它捕捉隐性性别歧视语言细微线索的有效性。
https://arxiv.org/abs/2507.05271
Multiperspective Fusion (MPF) is a novel posttraining alignment framework for large language models (LLMs) developed in response to the growing need for easy bias mitigation. Built on top of the SAGED pipeline, an automated system for constructing bias benchmarks and extracting interpretable baseline distributions, MPF leverages multiperspective generations to expose and align biases in LLM outputs with nuanced, humanlike baselines. By decomposing baseline, such as sentiment distributions from HR professionals, into interpretable perspective components, MPF guides generation through sampling and balancing of responses, weighted by the probabilities obtained in the decomposition. Empirically, we demonstrate its ability to align LLM sentiment distributions with both counterfactual baselines (absolute equality) and the HR baseline (biased for Top Univeristy), resulting in small KL divergence, reduction of calibration error and generalization to unseen questions. This shows that MPF offers a scalable and interpretable method for alignment and bias mitigation, compatible with deployed LLMs and requiring no extensive prompt engineering or finetuning.
多视角融合(MPF)是一种针对大规模语言模型(LLM)的新颖的后训练对齐框架,旨在应对日益增长的减少偏见的需求。该框架基于SAGED管道构建,这是一个自动系统,用于建立偏差基准并提取可解释的基础分布。通过利用多视角生成来揭示和校准LLM输出中的偏差与细腻、类似人类的基础线之间的关系,MPF能够有效工作。 具体而言,MPF将诸如来自人力资源专业人士的情感分布等基础分解成可理解的视角组件,并指导生成过程通过响应采样和平衡进行引导,这些过程由分解中获得的概率加权。通过实证研究,我们证明了它可以调整LLM情感分布以与反事实基准(绝对平等)以及偏向于顶尖大学的人力资源基准对齐,从而实现小的KL散度、减少校准误差,并在未见过的问题上进行泛化。 这表明MPF提供了一种可扩展且解释性良好的方法来对齐和缓解偏差,适用于已部署的语言模型,并不需要大量的提示工程或微调。
https://arxiv.org/abs/2507.02595
Conversational agents have made significant progress since ELIZA, expanding their role across various domains, including healthcare, education, and customer service. As these agents become increasingly integrated into daily human interactions, the need for emotional intelligence, particularly empathetic listening, becomes increasingly essential. In this study, we explore how Large Language Models (LLMs) respond when tasked with generating emotionally rich interactions. Starting from a small dataset manually crafted by an expert to reflect empathic behavior, we extended the conversations using two LLMs: ChatGPT and Gemini. We analyzed the emotional progression of the dialogues using both sentiment analysis (via VADER) and expert assessments. While the generated conversations often mirrored the intended emotional structure, human evaluation revealed important differences in the perceived empathy and coherence of the responses. These findings suggest that emotion modeling in dialogues requires not only structural alignment in the expressed emotions but also qualitative depth, highlighting the importance of combining automated and humancentered methods in the development of emotionally competent agents.
自从ELIZA以来,对话代理已经在包括医疗保健、教育和客户服务在内的多个领域取得了显著进步。随着这些代理越来越融入日常的人类互动中,它们在情感智能方面的需求也变得日益重要,特别是具有同理心的倾听能力。在这项研究中,我们探讨了大型语言模型(LLMs)在生成充满情感的对话任务中的表现情况。从一个由专家手动创建的小型数据集开始,该数据集反映了富有同情心的行为模式,我们将对话通过两个LLM——ChatGPT和Gemini进行了扩展。我们使用情感分析(VADER工具)和专家评估对对话的情感进展进行分析。 尽管生成的对话通常与预期的情感结构相吻合,但人类评价揭示了在感知同理心和回应连贯性方面的重大差异。这些发现表明,在对话中建模情感不仅需要表达情绪上的结构一致,还需要深度的质量考量。这突显了在开发具有情感能力代理的过程中结合自动化和以人为中心的方法的重要性。
https://arxiv.org/abs/2507.02537
Bias and stereotypes in language models can cause harm, especially in sensitive areas like content moderation and decision-making. This paper addresses bias and stereotype detection by exploring how jointly learning these tasks enhances model performance. We introduce StereoBias, a unique dataset labeled for bias and stereotype detection across five categories: religion, gender, socio-economic status, race, profession, and others, enabling a deeper study of their relationship. Our experiments compare encoder-only models and fine-tuned decoder-only models using QLoRA. While encoder-only models perform well, decoder-only models also show competitive results. Crucially, joint training on bias and stereotype detection significantly improves bias detection compared to training them separately. Additional experiments with sentiment analysis confirm that the improvements stem from the connection between bias and stereotypes, not multi-task learning alone. These findings highlight the value of leveraging stereotype information to build fairer and more effective AI systems.
语言模型中的偏见和刻板印象可能导致伤害,尤其是在内容审核和决策制定等敏感领域。本文通过探讨共同学习这些任务如何提升模型性能来解决偏见和刻板印象检测的问题。我们引入了StereoBias数据集,这是一个独特的、针对五个类别(宗教、性别、社会经济地位、种族、职业和其他)的偏见和刻板印象检测进行标注的数据集,使它们之间的关系研究更加深入。 我们的实验比较了编码器模型和使用QLoRA微调后的解码器模型。虽然独立训练的编码器模型表现良好,但单独微调的解码器模型也显示出了竞争力。关键的是,共同学习偏见检测和刻板印象检测任务显著提高了偏见检测的效果,而分开学习则不能达到这种效果。 通过情感分析进行的额外实验表明,这些改进源于偏见与刻板印象之间的联系,而不是多任务学习本身的作用。这一发现强调了利用刻板印象信息来构建更公平、更有效的AI系统的重要性。
https://arxiv.org/abs/2507.01715
In this paper, we investigate the transferability of pre-trained language models to low-resource Indonesian local languages through the task of sentiment analysis. We evaluate both zero-shot performance and adapter-based transfer on ten local languages using models of different types: a monolingual Indonesian BERT, multilingual models such as mBERT and XLM-R, and a modular adapter-based approach called MAD-X. To better understand model behavior, we group the target languages into three categories: seen (included during pre-training), partially seen (not included but linguistically related to seen languages), and unseen (absent and unrelated in pre-training data). Our results reveal clear performance disparities across these groups: multilingual models perform best on seen languages, moderately on partially seen ones, and poorly on unseen languages. We find that MAD-X significantly improves performance, especially for seen and partially seen languages, without requiring labeled data in the target language. Additionally, we conduct a further analysis on tokenization and show that while subword fragmentation and vocabulary overlap with Indonesian correlate weakly with prediction quality, they do not fully explain the observed performance. Instead, the most consistent predictor of transfer success is the model's prior exposure to the language, either directly or through a related language.
在这篇论文中,我们通过情感分析任务研究了预训练语言模型向低资源印度尼西亚本土语言的迁移能力。我们使用不同类型的模型(单语印尼BERT、多语种模型如mBERT和XLM-R以及一种模块化适配器方法MAD-X)对十个本土语言在零样本设置和基于适配器的转移学习下的性能进行了评估。为了更好地理解模型的行为,我们将目标语言分为三类:已见(预训练期间包含的语言)、部分已见(虽未包括但在语言学上与已见语言相关)以及未见过(既没有出现也无关联于预训练数据)。我们的结果显示了这些组之间明显的性能差异:多语种模型在已见语言上的表现最好,在部分已见语言上的表现中等,而在未见过的语言上的表现较差。我们发现MAD-X显著提高了性能,尤其是在已见和部分已见语言上,并且无需目标语言的标注数据即可实现这一点。此外,我们还对分词进行了进一步分析,结果显示虽然子词分割与印尼语词汇重叠度较低地关联着预测质量,但并不能完全解释观察到的表现差异。相反,模型之前是否直接或通过相关语言接触过该语言是迁移成功的最一致的预测因素。
https://arxiv.org/abs/2507.01645
Task-oriented dialogue (ToD) systems are designed to help users achieve specific goals through natural language interaction. While recent advances in large language models (LLMs) have significantly improved linguistic fluency and contextual understanding, building effective and emotionally intelligent ToD systems remains a complex challenge. Effective ToD systems must optimise for task success, emotional understanding and responsiveness, and precise information conveyance, all within inherently noisy and ambiguous conversational environments. In this work, we investigate architectural, representational, optimisational as well as emotional considerations of ToD systems. We set up systems covering these design considerations with a challenging evaluation environment composed of a natural-language user simulator coupled with an imperfect natural language understanding module. We propose \textbf{LUSTER}, an \textbf{L}LM-based \textbf{U}nified \textbf{S}ystem for \textbf{T}ask-oriented dialogue with \textbf{E}nd-to-end \textbf{R}einforcement learning with both short-term (user sentiment) and long-term (task success) rewards. Our findings demonstrate that combining LLM capability with structured reward modelling leads to more resilient and emotionally responsive ToD systems, offering a practical path forward for next-generation conversational agents.
任务导向对话(ToD)系统旨在通过自然语言交互帮助用户实现特定目标。尽管近年来大型语言模型(LLMs)在语言流畅度和上下文理解方面取得了显著进步,但构建有效且具有情感智能的ToD系统仍然是一项复杂的挑战。有效的ToD系统必须优化任务成功率、情感理解和响应能力以及精准的信息传达,在本质上嘈杂和模糊的对话环境中实现这些目标尤其困难。在这项工作中,我们研究了ToD系统的架构、表示、优化以及情感方面的问题。我们搭建了一个包含自然语言用户模拟器及不完美的自然语言理解模块的挑战性评估环境,并涵盖了上述设计考量。 为此,我们提出了**LUSTER**系统——一个基于LLM的统一任务导向对话系统,该系统结合了端到端强化学习(同时考虑短期奖励如用户情感和长期奖励如任务成功率)。我们的研究发现表明,将大型语言模型的能力与结构化的奖励建模相结合可以构建出更稳健且具有更强情感响应能力的任务导向对话系统,为下一代会话代理提供了一条切实可行的发展路径。
https://arxiv.org/abs/2507.01594
In the stance detection task, a text is classified as either favorable, opposing, or neutral towards a target. Prior work suggests that the use of external information, e.g., excerpts from Wikipedia, improves stance detection performance. However, whether or not such information can benefit large language models (LLMs) remains an unanswered question, despite their wide adoption in many reasoning tasks. In this study, we conduct a systematic evaluation on how Wikipedia and web search external information can affect stance detection across eight LLMs and in three datasets with 12 targets. Surprisingly, we find that such information degrades performance in most cases, with macro F1 scores dropping by up to 27.9\%. We explain this through experiments showing LLMs' tendency to align their predictions with the stance and sentiment of the provided information rather than the ground truth stance of the given text. We also find that performance degradation persists with chain-of-thought prompting, while fine-tuning mitigates but does not fully eliminate it. Our findings, in contrast to previous literature on BERT-based systems which suggests that external information enhances performance, highlight the risks of information biases in LLM-based stance classifiers. Code is available at this https URL.
在立场检测任务中,文本会被分类为支持、反对或对某个目标保持中立。先前的研究表明,使用外部信息(例如维基百科的摘录)可以提高立场检测的效果。然而,尽管大型语言模型(LLM)在许多推理任务中被广泛采用,但它们是否能从此类外部信息中受益仍然未得到解答。在这项研究中,我们对八种不同的大语言模型以及包含12个目标的三个数据集进行了系统评估,以探讨维基百科和网络搜索提供的外部信息如何影响立场检测的效果。令人惊讶的是,我们在大多数情况下发现这种外部信息会降低性能,宏观F1分数甚至下降了高达27.9%。我们通过实验解释了这一现象:大语言模型倾向于让它们的预测与所提供信息的立场和情感一致,而不是文本本身的实际情况。我们还发现,在链式思考提示下,性能下降仍然存在,尽管微调可以缓解但并不能完全消除这一问题。我们的研究结果与基于BERT系统的先前文献形成对比,后者表明外部信息能增强系统性能,突显了在基于LLM的立场分类器中使用信息时存在的偏见风险。相关代码可在该URL获得。
https://arxiv.org/abs/2507.01543
Aspect-based Sentiment Analysis (ABSA) is a critical Natural Language Processing (NLP) task that extracts aspects from text and determines their associated sentiments, enabling fine-grained analysis of user opinions. Existing ABSA methods struggle to balance computational efficiency with high performance: deep learning models often lack global context, transformers demand significant computational resources, and Mamba-based approaches face CUDA dependency and diminished local correlations. Recent advancements in Extended Long Short-Term Memory (xLSTM) models, particularly their efficient modeling of long-range dependencies, have significantly advanced the NLP community. However, their potential in ABSA remains untapped. To this end, we propose xLSTM with Multihead Exponential Gated Fusion (MEGA), a novel framework integrating a bi-directional mLSTM architecture with forward and partially flipped backward (PF-mLSTM) streams. The PF-mLSTM enhances localized context modeling by processing the initial sequence segment in reverse with dedicated parameters, preserving critical short-range patterns. We further introduce an mLSTM-based multihead cross exponential gated fusion mechanism (MECGAF) that dynamically combines forward mLSTM outputs as query and key with PF-mLSTM outputs as value, optimizing short-range dependency capture while maintaining global context and efficiency. Experimental results on three benchmark datasets demonstrate that MEGA outperforms state-of-the-art baselines, achieving superior accuracy and efficiency in ABSA tasks.
基于方面的情感分析(ABSA)是一项关键的自然语言处理(NLP)任务,该任务从文本中提取特定方面并确定与其相关的主观情感,从而能够对用户意见进行细致入微的分析。现有的ABSA方法在计算效率与高性能之间难以取得平衡:深度学习模型通常缺乏全局上下文理解能力,而基于Transformer的方法则需要大量的计算资源;Mamba方法面临CUDA依赖问题,并且局部相关性有所减弱。最近,在扩展长短期记忆(xLSTM)模型方面的进展——尤其是它们对长期依赖关系的高效建模——极大地推动了NLP领域的发展。然而,这些模型在ABSA任务中的潜力尚未被充分挖掘。 为此,我们提出了一种新的框架:带有多头指数门融合(MEGA)的扩展长短期记忆模型(xLSTM)。该框架结合了一个双向mLSTM架构与前向和部分翻转后向(PF-mLSTM)流。通过使用专用参数对初始序列片段进行反方向处理,PF-mLSTM增强了局部上下文建模能力,并保留了关键的短程模式。我们还引入了一种基于多层长短期记忆网络的多头交叉指数门融合机制(MECGAF),该机制动态地结合前向mLSTM输出作为查询和键与PF-mLSTM输出作为值,以优化短距离依赖关系的捕获,同时保持全局上下文理解能力并提高效率。实验结果表明,在三个基准数据集上,MEGA框架不仅在ABSA任务中的准确性和效率方面超越了最先进的基线方法。 总之,提出的xLSTM与多头指数门融合(MEGA)框架通过结合双向PF-mLSTM和新颖的多头交叉指数门融合机制,为ABSA任务提供了更高效、更高性能的解决方案。
https://arxiv.org/abs/2507.01213