Large language models (LLMs) can often accurately describe probability distributions using natural language, yet they still struggle to generate faithful samples from them. This mismatch limits their use in tasks requiring reliable stochasticity, such as Monte Carlo methods, agent-based simulations, and randomized decision-making. We investigate this gap between knowledge and sampling in the context of Bernoulli distributions. We introduce Verbalized Rejection Sampling (VRS), a natural-language adaptation of classical rejection sampling that prompts the LLM to reason about and accept or reject proposed samples. Despite relying on the same Bernoulli mechanism internally, VRS substantially reduces sampling bias across models. We provide theoretical analysis showing that, under mild assumptions, VRS improves over direct sampling, with gains attributable to both the algorithm and prompt design. More broadly, our results show how classical probabilistic tools can be verbalized and embedded into LLM workflows to improve reliability, without requiring access to model internals or heavy prompt engineering.
大型语言模型(LLMs)通常能够准确地用自然语言描述概率分布,但它们在从这些分布中生成忠实样本时仍存在困难。这种不匹配限制了它们在需要可靠随机性的任务中的应用,例如蒙特卡洛方法、基于代理的仿真和随机决策制定。我们在此背景下研究伯努利分布的知识与采样之间的差距,并提出了语言化的拒绝抽样(Verbalized Rejection Sampling, VRS),这是一种经典拒绝抽样的自然语言版本,它促使LLM对提出的样本进行推理并接受或拒绝这些样本。尽管VRS在内部依赖于相同的伯努利机制,但它显著减少了不同模型的采样偏差。 我们提供理论分析表明,在适度假设下,VRS优于直接抽样,并且改进来自于算法本身以及提示设计。更广泛地说,我们的研究结果展示了如何将经典的概率工具语言化并嵌入到LLM工作流程中以提高可靠性,而无需访问模型内部或进行复杂的提示工程。
https://arxiv.org/abs/2506.09998
Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products. Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency. Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance. In this paper, we explore how to form a data-and-model solution that natively supports partial detection. For the data, we construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained annotations to provide reasonable supervision for token-level training. Then, we propose the streaming content monitor, which is trained with dual supervision of response- and token-level labels and can follow the output stream of LLM to make a timely judgment of harmfulness. Experiments show that SCM gains 0.95+ in macro F1 score that is comparable to full detection, by only seeing the first 18% of tokens in responses on average. Moreover, the SCM can serve as a pseudo-harmfulness annotator for improving safety alignment and lead to a higher harmlessness score than DPO.
尽管安全对齐已经被大多数大型语言模型(LLM)采用,但LLM服务提供商通常会在实际产品中部署后续的审核作为外部的安全保障。现有的审核人员主要执行传统的全检测方法,即根据完整的LLM输出来确定其危害性,这导致了较高的服务延迟。近期的研究更关注于部分检测,在生成过程中中途监督并提前停止有害内容的输出,但它们直接将使用全检测范式训练的审核员应用于不完整输出中,造成了训练与推理之间的差距,并降低了性能表现。 本文探讨如何形成一种原生支持部分检测的数据和模型解决方案。在数据方面,我们构建了FineHarm数据集,包含29,000个带有细粒度注释的提示-响应对,以提供合理监督,从而进行令牌级训练。然后,我们提出了流式内容监控器(SCM),它通过响应级别和令牌级别的双重视频标签进行训练,并能够跟随LLM输出流,及时判断有害性。 实验表明,SCM在仅查看平均20%的响应令牌的情况下,在宏F1得分上获得了比全检测方法相当甚至更高的性能表现(提高幅度为0.95+)。此外,SCM可以作为伪危害注释员来改进安全对齐,并导致相比DPO更高的无害性评分。
https://arxiv.org/abs/2506.09996
Online toxic language causes real harm, especially in regions with limited moderation tools. In this study, we evaluate how large language models handle toxic comments in Serbian, Croatian, and Bosnian, languages with limited labeled data. We built and manually labeled a dataset of 4,500 YouTube and TikTok comments drawn from videos across diverse categories, including music, politics, sports, modeling, influencer content, discussions of sexism, and general topics. Four models (GPT-3.5 Turbo, GPT-4.1, Gemini 1.5 Pro, and Claude 3 Opus) were tested in two modes: zero-shot and context-augmented. We measured precision, recall, F1 score, accuracy and false positive rates. Including a short context snippet raised recall by about 0.12 on average and improved F1 score by up to 0.10, though it sometimes increased false positives. The best balance came from Gemini in context-augmented mode, reaching an F1 score of 0.82 and accuracy of 0.82, while zero-shot GPT-4.1 led on precision and had the lowest false alarms. We show how adding minimal context can improve toxic language detection in low-resource settings and suggest practical strategies such as improved prompt design and threshold calibration. These results show that prompt design alone can yield meaningful gains in toxicity detection for underserved Balkan language communities.
在线有毒语言会造成实际伤害,特别是在那些缺乏监管工具的地区。在这项研究中,我们评估了大型语言模型在处理塞尔维亚语、克罗地亚语和波斯尼亚语中的有害评论的能力,这些语言由于数据标注资源有限而难以应对。为此,我们构建并手动标注了一个包含4500条YouTube和TikTok评论的数据集,这些评论来自各种类别的视频内容,包括音乐、政治、体育、模特展示、网红内容以及关于性别歧视的讨论等。 四种模型(GPT-3.5 Turbo、GPT-4.1、Gemini 1.5 Pro 和 Claude 3 Opus)在两种模式下进行了测试:零样本模式和上下文增强模式。我们测量了这些模型的精度、召回率、F1分数、准确性和假阳性率。 加入简短的上下文片段平均提高了召回率约0.12,并将F1分数最多提高到0.10,尽管有时会增加假阳性的数量。在上下文增强模式下,Gemini表现最佳,达到了F1分数和准确率为0.82,而零样本GPT-4.1则在精度上领先,并且具有最低的误报率。 我们展示了即使是在资源有限的情况下,添加少量的上下文信息也能提高有害语言检测的效果。此外,我们还提出了实用策略,例如改进提示设计和阈值校准,以进一步提升这些模型的表现。研究结果表明,仅靠优化提示设计就能在服务不足的巴尔干语社区中实现有意义的有害言论检测改进。
https://arxiv.org/abs/2506.09992
Recent advances in large language models (LLMs) have enabled impressive performance in various tasks. However, standard prompting often struggles to produce structurally valid and accurate outputs, especially in dependency parsing. We propose a novel step-by-step instruction strategy, where universal part-of-speech tagging precedes the prediction of syntactic heads and dependency labels, and a simplified CoNLL-U like output format, our method achieves state-of-the-art accuracy on Universal Dependencies datasets across 17 languages without hallucination or contamination. We further show that multilingual fine-tuning simultaneously improves cross-language generalization performance. Our results highlight the effectiveness of explicit reasoning steps in LLM-based parsing and offer a scalable, format-consistent alternative to bracket-based approaches.
最近在大型语言模型(LLM)方面取得的进展使得它们在各种任务中表现出色。然而,标准提示方法通常难以生成结构正确且准确的结果,尤其是在依存句法分析中。我们提出了一种新颖的逐步指令策略,在这种策略中,通用词性标注先于句法头和依存关系标签的预测,并采用一种简化的类似CoNLL-U格式输出,我们的方法在涵盖17种语言的Universal Dependencies数据集上实现了无幻觉或污染的最佳准确率。此外,我们还证明了多语言微调同时提高了跨语言泛化性能。我们的结果突显了基于LLM解析中明确推理步骤的有效性,并为基于括号的方法提供了一种可扩展且格式一致的替代方案。
https://arxiv.org/abs/2506.09983
Detecting AI-generated text is a difficult problem to begin with; detecting AI-generated text on social media is made even more difficult due to the short text length and informal, idiosyncratic language of the internet. It is nonetheless important to tackle this problem, as social media represents a significant attack vector in online influence campaigns, which may be bolstered through the use of mass-produced AI-generated posts supporting (or opposing) particular policies, decisions, or events. We approach this problem with the mindset and resources of a reasonably sophisticated threat actor, and create a dataset of 505,159 AI-generated social media posts from a combination of open-source, closed-source, and fine-tuned LLMs, covering 11 different controversial topics. We show that while the posts can be detected under typical research assumptions about knowledge of and access to the generating models, under the more realistic assumption that an attacker will not release their fine-tuned model to the public, detectability drops dramatically. This result is confirmed with a human study. Ablation experiments highlight the vulnerability of various detection algorithms to fine-tuned LLMs. This result has implications across all detection domains, since fine-tuning is a generally applicable and realistic LLM use case.
检测人工智能生成的文本本身就是一个难题;而在社交媒体上检测这类文本则更加困难,因为短文本长度和互联网特有的非正式、个性化的语言使得这一任务更为复杂。尽管如此,解决这个问题仍然非常重要,因为在网络影响力活动中,社交媒体代表了一个重要的攻击途径,通过大规模生产支持(或反对)特定政策、决策或事件的人工智能生成帖子可以增强这种活动的力度。 我们以一个较为复杂的威胁行为者的思维方式和资源来应对这个问题,并创建了一套数据集,其中包含来自开源、闭源以及经过微调的语言模型生成的505,159条社交媒体帖子,这些帖子涵盖了11个有争议的话题。研究表明,在典型的科研假设下(即研究者对生成文本的模型具有一定的了解和访问权限),可以检测到这些帖子;但在更为现实的情况下,若攻击者不会将其微调后的模型公开,则可检测性会大幅下降。这项结果也通过一项人类实验得到了确认。 消融实验进一步揭示了各种检测算法对于经过微调的语言模型存在明显的脆弱性。这一发现对所有领域的检测工作都有重要的影响,因为微调是大型语言模型的一种普遍适用且现实的应用场景。
https://arxiv.org/abs/2506.09975
How cost-effectively can we elicit strong reasoning in language models by leveraging their underlying representations? We answer this question with Resa, a family of 1.5B reasoning models trained via a novel and efficient sparse autoencoder tuning (SAE-Tuning) procedure. This method first trains an SAE to capture reasoning abilities from a source model, and then uses the trained SAE to guide a standard supervised fine-tuning process to elicit such abilities in a target model, all using verified question-answer data without any reasoning traces. Notably, when applied to certain base models before further RL post-training, SAE-Tuning retains >97% of its RL-trained counterpart's reasoning performance while reducing training costs by >2000x to roughly \$1 and training time by >450x to around 20 minutes. Furthermore, when applied to lightly RL-trained models (e.g., within 1 hour on 2 GPUs), it enables reasoning performance such as 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23 for only around \$1 additional cost. Surprisingly, the reasoning abilities extracted via SAEs are potentially both generalizable and modular. Generality means abilities extracted from one dataset still elevate performance on a larger and overlapping corpus. Modularity means abilities extracted from Qwen or Qwen-Math can be attached to the R1-Distill model at test time, without any retraining, and yield comparable gains. Extensive ablations validate these findings and all artifacts are fully open-sourced.
我们如何通过利用语言模型的底层表示来有效地激发其推理能力?我们用Resa回答了这个问题,这是一个通过一种新颖且高效的稀疏自编码器调优(SAE-Tuning)过程训练的一系列15亿参数规模的推理模型。该方法首先使用源模型数据训练一个稀疏自编码器以捕捉推理能力,然后利用训练好的自编码器指导标准监督微调过程,在目标模型中激发这些能力,整个过程仅需经过验证的问题-答案数据而无需任何推理痕迹。 值得注意的是,在某些基础模型上应用SAE-Tuning,并在进一步的强化学习后训练之前进行处理时,它能够保留超过97%的与之相对应的强化学习训练后的推理性能,同时将训练成本减少超过2000倍至大约1美元,并将训练时间减少超过450倍至约20分钟。此外,在对经过轻度强化学习训练的模型(例如在两块GPU上花费不到一小时)应用时,它能够实现如AIME24中Pass@1得分为43.33%,AMC23中Pass@1得分为90%等推理性能,并且仅增加约1美元的成本。 令人惊讶的是,通过自编码器提取的推理能力可能是通用和模块化的。泛化意味着从一个数据集中提取的能力仍然可以提升更大、更相关数据集上的表现;而模块化则表示可以从Qwen或Qwen-Math中抽取的能力可以在测试时附加到R1-Distill模型上,并且无需任何重新训练即可获得类似的效果。 一系列消融实验验证了上述发现,所有相关成果均已完全开源。
https://arxiv.org/abs/2506.09967
Recent work has identified retrieval heads (Wu et al., 2025b), a subset of attention heads responsible for retrieving salient information in long-context language models (LMs), as measured by their copy-paste behavior in Needle-in-a-Haystack tasks. In this paper, we introduce QRHEAD (Query-Focused Retrieval Head), an improved set of attention heads that enhance retrieval from long context. We identify QRHEAD by aggregating attention scores with respect to the input query, using a handful of examples from real-world tasks (e.g., long-context QA). We further introduce QR- RETRIEVER, an efficient and effective retriever that uses the accumulated attention mass of QRHEAD as retrieval scores. We use QR- RETRIEVER for long-context reasoning by selecting the most relevant parts with the highest retrieval scores. On multi-hop reasoning tasks LongMemEval and CLIPPER, this yields over 10% performance gains over full context and outperforms strong dense retrievers. We also evaluate QRRETRIEVER as a re-ranker on the BEIR benchmark and find that it achieves strong zero-shot performance, outperforming other LLM-based re-rankers such as RankGPT. Further analysis shows that both the querycontext attention scoring and task selection are crucial for identifying QRHEAD with strong downstream utility. Overall, our work contributes a general-purpose retriever and offers interpretability insights into the long-context capabilities of LMs.
最近的研究(Wu et al., 2025b)发现了一类称为检索头(retrieval heads)的注意力机制子集,这些头部在长上下文语言模型中负责从大量文本信息中提取关键内容,这一结论是通过它们在针刺干草堆任务中的复制粘贴行为得出的。在此论文中,我们提出了QRHEAD(查询聚焦检索头),这是一种改进后的注意力机制集合,旨在增强从长文本中进行有效检索的能力。我们通过聚集与输入查询相关的注意力得分,并结合真实世界任务(如长上下文问答)示例来识别QRHEAD。 此外,我们引入了QR- RETRIEVER,这是一个高效且有效的检索器,它使用QRHEAD累积的注意力质量作为检索分数。我们在多跳推理任务LongMemEval和CLIPPER中利用QR-RETRIEVER进行长文本推理,通过选择具有最高检索分数的相关部分来实现这一目标。相比全面考虑上下文的方法,这种方法在这些任务上带来了超过10%的性能提升,并且优于其他密集型检索器。 我们还在BEIR基准测试中将QR-RETRIEVER作为重排序器进行了评估,并发现它实现了强大的零样本性能,超过了其他基于大语言模型(LLM)的重排序器,如RankGPT。进一步分析表明,查询上下文注意力评分以及任务选择对于识别具有强大下游效用的QRHEAD至关重要。 总体而言,我们的工作为长文本推理贡献了一种通用检索方法,并提供了关于LMs在处理长上下文时能力的理解和解释性见解。
https://arxiv.org/abs/2506.09944
Reviews are valuable resources for customers making purchase decisions in online shopping. However, it is impractical for customers to go over the vast number of reviews and manually conclude the prominent opinions, which prompts the need for automated opinion summarization systems. Previous approaches, either extractive or abstractive, face challenges in automatically producing grounded aspect-centric summaries. In this paper, we propose a novel summarization system that not only captures predominant opinions from an aspect perspective with supporting evidence, but also adapts to varying domains without relying on a pre-defined set of aspects. Our proposed framework, ASESUM, summarizes viewpoints relevant to the critical aspects of a product by extracting aspect-centric arguments and measuring their salience and validity. We conduct experiments on a real-world dataset to demonstrate the superiority of our approach in capturing diverse perspectives of the original reviews compared to new and existing methods.
评论是在线购物中客户进行购买决策时的重要资源。然而,面对大量的商品评价,顾客很难一一阅读并手动总结出突出的观点,这就催生了自动观点摘要系统的需要。先前的研究方法,无论是提取式的还是生成式的,都在自动生成基于方面的有根据的摘要方面面临挑战。在本文中,我们提出了一种新的摘要系统,该系统不仅从面向方面的角度捕捉到主要观点,并提供支持证据,而且还能适应不同的领域而不依赖于预定义的一组方面。我们的框架ASESUM通过提取以方面为中心的论点并衡量其重要性和有效性来总结与产品关键方面相关的观点。 我们在一个真实世界的数据集上进行了实验,结果表明,相较于现有和新的方法,我们提出的方法在捕捉原始评论中的多样视角方面具有优越性。
https://arxiv.org/abs/2506.09917
The prediction of foreign exchange rates, such as the US Dollar (USD) to Bangladeshi Taka (BDT), plays a pivotal role in global financial markets, influencing trade, investments, and economic stability. This study leverages historical USD/BDT exchange rate data from 2018 to 2023, sourced from Yahoo Finance, to develop advanced machine learning models for accurate forecasting. A Long Short-Term Memory (LSTM) neural network is employed, achieving an exceptional accuracy of 99.449%, a Root Mean Square Error (RMSE) of 0.9858, and a test loss of 0.8523, significantly outperforming traditional methods like ARIMA (RMSE 1.342). Additionally, a Gradient Boosting Classifier (GBC) is applied for directional prediction, with backtesting on a $10,000 initial capital revealing a 40.82% profitable trade rate, though resulting in a net loss of $20,653.25 over 49 trades. The study analyzes historical trends, showing a decline in BDT/USD rates from 0.012 to 0.009, and incorporates normalized daily returns to capture volatility. These findings highlight the potential of deep learning in forex forecasting, offering traders and policymakers robust tools to mitigate risks. Future work could integrate sentiment analysis and real-time economic indicators to further enhance model adaptability in volatile markets.
外汇汇率预测,如美元(USD)对孟加拉塔卡(BDT)的汇率,在全球金融市场中扮演着至关重要的角色,影响贸易、投资和经济稳定性。本研究利用来自雅虎财经2018年至2023年的历史USD/BDT汇率数据,开发先进的机器学习模型来进行准确预测。所采用的是长短期记忆(LSTM)神经网络模型,在精度方面达到了99.449%,均方根误差(RMSE)为0.9858和测试损失为0.8523,显著优于传统的ARIMA方法(RMSE 1.342)。此外,还应用了梯度提升分类器(GBC)进行方向性预测,在初始资本为10,000美元的情况下进行了回测,结果显示有40.82%的盈利交易率,但经过49笔交易后净亏损达到20,653.25美元。研究分析了历史趋势,并发现BDT/USD汇率从0.012降至0.009,同时利用归一化的每日回报来捕捉波动性。这些发现突显了深度学习在外汇预测中的潜力,为交易者和政策制定者提供了强有力的工具以降低风险。未来的工作可以整合情绪分析和实时经济指标,进一步增强模型在动荡市场环境下的适应能力。
https://arxiv.org/abs/2506.09851
Effective rehabilitation assessment is essential for monitoring patient progress, particularly in home-based settings. Existing systems often face challenges such as data imbalance and difficulty detecting subtle movement errors. This paper introduces Error-Guided Pose Augmentation (EGPA), a method that generates synthetic skeleton data by simulating clinically relevant movement mistakes. Unlike standard augmentation techniques, EGPA targets biomechanical errors observed in rehabilitation. Combined with an attention-based graph convolutional network, EGPA improves performance across multiple evaluation metrics. Experiments demonstrate reductions in mean absolute error of up to 27.6 percent and gains in error classification accuracy of 45.8 percent. Attention visualizations show that the model learns to focus on clinically significant joints and movement phases, enhancing both accuracy and interpretability. EGPA offers a promising approach for improving automated movement quality assessment in both clinical and home-based rehabilitation contexts.
有效的康复评估对于监测患者进展至关重要,尤其是在家庭护理环境中。现有的系统常常面临数据不平衡和难以检测细微运动错误等问题的挑战。本文介绍了一种名为“误差引导姿态增强”(Error-Guided Pose Augmentation, EGPA)的方法,该方法通过模拟临床相关的运动失误来生成合成骨骼数据。与标准的数据增强技术不同,EGPA专注于康复中观察到的生物力学错误。结合注意力机制图卷积网络使用EGPA可以在多个评估指标上提高性能表现。 实验结果表明,在平均绝对误差方面可以减少高达27.6%,而在错误分类准确性方面可以提高45.8%。注意力可视化显示模型学会了关注临床意义重大的关节和运动阶段,从而提高了准确性和可解释性。EGPA为改善临床环境及家庭康复背景下自动化运动质量评估提供了一种有前景的方法。
https://arxiv.org/abs/2506.09833
Neural front-ends are an appealing alternative to traditional, fixed feature extraction pipelines for automatic speech recognition (ASR) systems since they can be directly trained to fit the acoustic model. However, their performance often falls short compared to classical methods, which we show is largely due to their increased susceptibility to overfitting. This work therefore investigates regularization methods for training ASR models with learnable feature extraction front-ends. First, we examine audio perturbation methods and show that larger relative improvements can be obtained for learnable features. Additionally, we identify two limitations in the standard use of SpecAugment for these front-ends and propose masking in the short time Fourier transform (STFT)-domain as a simple but effective modification to address these challenges. Finally, integrating both regularization approaches effectively closes the performance gap between traditional and learnable features.
神经前端是自动语音识别(ASR)系统中传统固定特征提取管道的有吸引力的替代方案,因为它们可以直接训练以适应声学模型。然而,与经典方法相比,它们的表现通常较差,我们发现这主要是由于过度拟合的风险增加所致。因此,本工作研究了用于具有可学习特征提取前端的ASR模型的正则化方法。 首先,我们探讨了音频扰动方法,并表明对于可学习特征可以实现更大的相对改进。此外,我们发现了标准SpecAugment在这些前端使用中的两个限制,并提出了短时傅里叶变换(STFT)域内的掩码作为简单有效的修改来解决这些问题。最后,结合这两种正则化方法能够有效缩小传统固定特征和可学习特征之间性能差距。
https://arxiv.org/abs/2506.09804
Knowing how test takers answer items in educational assessments is essential for test development, to evaluate item quality, and to improve test validity. However, this process usually requires extensive pilot studies with human participants. If large language models (LLMs) exhibit human-like response behavior to test items, this could open up the possibility of using them as pilot participants to accelerate test development. In this paper, we evaluate the human-likeness or psychometric plausibility of responses from 18 instruction-tuned LLMs with two publicly available datasets of multiple-choice test items across three subjects: reading, U.S. history, and economics. Our methodology builds on two theoretical frameworks from psychometrics which are commonly used in educational assessment, classical test theory and item response theory. The results show that while larger models are excessively confident, their response distributions can be more human-like when calibrated with temperature scaling. In addition, we find that LLMs tend to correlate better with humans in reading comprehension items compared to other subjects. However, the correlations are not very strong overall, indicating that LLMs should not be used for piloting educational assessments in a zero-shot setting.
了解受试者在教育评估中如何回答问题对于测试开发、评价题目质量和提高测试有效性至关重要。然而,这一过程通常需要进行广泛的试点研究并招募人类参与者。如果大型语言模型(LLMs)表现出与人类相似的答题行为,则可以考虑使用它们作为试点参与者来加速测试开发进程。本文通过采用两种常用的心理测量理论框架——经典测验理论和项目反应理论——对18个经过指令调优的LLM在两个公开发布的涵盖阅读、美国历史和经济学三门学科的选择题数据集上的回答进行评估,考察其人类行为相似性或心理测量合理性。 实验结果显示,虽然较大的模型过于自信,但在应用温度缩放校准后,它们的回答分布会更接近于人类。此外,我们发现LLM在阅读理解题目中与人类的相关性较好,而其他学科的关联度较低。然而,总体而言,相关性的强度并不高,这表明不应直接将LLM用于教育评估的零样本试点测试(即不进行任何特殊调整的情况下)。
https://arxiv.org/abs/2506.09796
AI-generated content has evolved from monolithic models to modular workflows, particularly on platforms like ComfyUI, enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI-R1, the first large reasoning model for automated workflow generation. Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection, workflow planning, and code-level workflow representation. ComfyUI-R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward, ensuring format validity, structural integrity, and node-level fidelity. Experiments show that our 7B-parameter model achieves a 97\% format validity rate, along with high pass rate, node-level and graph-level F1 scores, significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series. Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes, underscoring the potential of long CoT reasoning in AI art creation.
AI生成内容的创建流程已经从单一模型演进到了模块化工作流,特别是在ComfyUI这样的平台上,使得创意管线中的定制化成为可能。然而,构建有效的这些工作流需要大量的专业知识来协调众多的专业组件,这给用户带来了陡峭的学习曲线。为了解决这一挑战,我们推出了ComfyUI-R1,这是首个用于自动化工作流生成的大规模推理模型。 我们的工作是从精心挑选的4000个工作流的数据集开始,构建了长链条思维(CoT)推理数据,包括节点选择、工作流程规划以及代码级别的工作流表示。ComfyUI-R1通过两阶段框架进行训练:(1) 在初始适应时进行长链思维细化调优,使模型能够针对ComfyUI领域;(2) 采用细粒度规则-指标混合奖励引导的强化学习来激励推理能力,确保格式有效性、结构完整性以及节点级精准度。 实验结果表明,我们70亿参数量的模型在格式有效性方面达到了97%的成功率,并且拥有高通过率和节点级及图级别F1得分,显著超越了使用如GPT-4o和Claude系列等领先封闭源代码模型的方法。进一步分析强调了推理过程的关键作用以及将工作流转换为代码的优势。定性比较显示我们在合成具有多样化节点的复杂工作流程方面的优势,突显出长链思维在AI艺术创作中的潜力。
https://arxiv.org/abs/2506.09790
Prolonged Exposure (PE) therapy is an effective treatment for post-traumatic stress disorder (PTSD), but evaluating therapist fidelity remains labor-intensive due to the need for manual review of session recordings. We present a method for the automatic temporal localization of key PE fidelity elements -- identifying their start and stop times -- directly from session audio and transcripts. Our approach fine-tunes a large pre-trained audio-language model, Qwen2-Audio, using Low-Rank Adaptation (LoRA) to process focused 30-second windows of audio-transcript input. Fidelity labels for three core protocol phases -- therapist orientation (P1), imaginal exposure (P2), and post-imaginal processing (P3) -- are generated via LLM-based prompting and verified by trained raters. The model is trained to predict normalized boundary offsets using soft supervision guided by task-specific prompts. On a dataset of 313 real PE sessions, our best configuration (LoRA rank 8, 30s windows) achieves a mean absolute error (MAE) of 5.3 seconds across tasks. We further analyze the effects of window size and LoRA rank, highlighting the importance of context granularity and model adaptation. This work introduces a scalable framework for fidelity tracking in PE therapy, with potential to support clinician training, supervision, and quality assurance.
持久暴露疗法(PE)是治疗创伤后应激障碍(PTSD)的有效方法,但评估治疗师的忠实度仍然是一项劳动密集型工作,因为需要手动审查会话录音。我们提出了一种从会话音频和转录文本中直接自动定位关键PE忠实元素(确定其开始和结束时间)的方法。 我们的方法采用大规模预训练音频语言模型Qwen2-Audio,并通过低秩适应(LoRA)技术对特定时长的30秒窗口音频-转录输入进行微调。三个核心协议阶段——治疗师定向(P1)、想象暴露(P2)和后想象处理(P3)——的忠实度标签通过基于大型语言模型的提示生成,并由经过培训的评审员验证。模型被训练预测归一化的边界偏移量,使用特定任务引导的软监督进行指导。 在一个包含313个真实PE会话的数据集上,我们最佳配置(LoRA秩8,30秒窗口)在各任务中实现了5.3秒的平均绝对误差(MAE)。此外,我们还分析了窗口大小和LoRA秩的影响,强调上下文粒度和模型适应的重要性。 这项工作引入了一个可扩展的框架,在PE治疗中的忠实度跟踪方面具有潜力,有望支持临床医生的培训、监督和质量保证。
https://arxiv.org/abs/2506.09707
Dual encoder Vision-Language Models (VLM) such as CLIP are widely used for image-text retrieval tasks. However, those models struggle with compositionality, showing a bag-of-words-like behavior that limits their retrieval performance. Many different training approaches have been proposed to improve the vision-language compositionality capabilities of those models. In comparison, inference-time techniques have received little attention. In this paper, we propose to add simple structure at inference, where, given an image and a caption: i) we divide the image into different smaller crops, ii) we extract text segments, capturing objects, attributes and relations, iii) using a VLM, we find the image crops that better align with text segments obtaining matches, and iv) we compute the final image-text similarity aggregating the individual similarities of the matches. Based on various popular dual encoder VLMs, we evaluate our approach in controlled and natural datasets for VL compositionality. We find that our approach consistently improves the performance of evaluated VLMs without any training, which shows the potential of inference-time techniques. The results are especially good for attribute-object binding as shown in the controlled dataset. As a result of an extensive analysis: i) we show that processing image crops is actually essential for the observed gains in performance, and ii) we identify specific areas to further improve inference-time approaches.
双编码器视觉-语言模型(VLM)如CLIP被广泛应用于图像-文本检索任务中。然而,这些模型在处理组合性问题时表现出色差,例如它们展现出类似“词袋”的行为,这限制了其检索性能。为提高这类模型的视觉-语言组合能力,许多不同的训练方法已被提出。相比之下,在推理时间的技术却很少被关注。 本文提出了一个简单的结构化推断技术,该技术在给定图像和标题时执行以下步骤:i) 将图像分割成较小的部分;ii) 提取文本片段,捕捉对象、属性和关系;iii) 使用VLM找到与这些文本片段最佳对齐的图像部分,并获取匹配结果;iv) 计算最终的图像-文本相似度,通过聚合每个匹配项的单独相似性来实现。 基于各种流行的双编码器VLM模型,我们在控制数据集和自然数据集中评估了该方法在视觉语言组合性的性能。我们发现,这种方法能够显著提升所有被评估的VLM模型的表现,并且无需额外训练,这表明了推理时间技术的巨大潜力。尤其是,在控制数据集中的属性-对象绑定方面表现尤为出色。 通过广泛的分析,我们的研究还发现了两个关键点:i) 处理图像部分对于观察到的性能改进至关重要;ii) 我们识别出了可以进一步提升推理时方法的具体领域。
https://arxiv.org/abs/2506.09691
Large language models (LLMs) have transformed natural language processing, but their reliable deployment requires effective uncertainty quantification (UQ). Existing UQ methods are often heuristic and lack a probabilistic foundation. This paper begins by providing a theoretical justification for the role of perturbations in UQ for LLMs. We then introduce a dual random walk perspective, modeling input-output pairs as two Markov chains with transition probabilities defined by semantic similarity. Building on this, we propose a fully probabilistic framework based on an inverse model, which quantifies uncertainty by evaluating the diversity of the input space conditioned on a given output through systematic perturbations. Within this framework, we define a new uncertainty measure, Inv-Entropy. A key strength of our framework is its flexibility: it supports various definitions of uncertainty measures, embeddings, perturbation strategies, and similarity metrics. We also propose GAAP, a perturbation algorithm based on genetic algorithms, which enhances the diversity of sampled inputs. In addition, we introduce a new evaluation metric, Temperature Sensitivity of Uncertainty (TSU), which directly assesses uncertainty without relying on correctness as a proxy. Extensive experiments demonstrate that Inv-Entropy outperforms existing semantic UQ methods. The code to reproduce the results can be found at this https URL.
大型语言模型(LLMs)已经彻底改变了自然语言处理领域,但它们的可靠部署需要有效的不确定性量化(UQ)。现有的UQ方法通常基于启发式策略,并缺乏概率理论基础。本文首先提供了扰动在LLM中不确定性量化中的作用的理论依据。然后,我们引入了一个双重随机游走视角,将输入-输出对建模为两个由语义相似性定义转移概率的马尔可夫链。在此基础上,我们提出了一种基于逆模型的完全概率框架,通过系统的扰动评估给定输出条件下输入空间的多样性来量化不确定性。在这个框架内,我们定义了一个新的不确定性度量指标Inv-Entropy(逆熵)。该框架的一个关键优势在于其灵活性:它支持各种不确定度量、嵌入、扰动策略和相似性度量的定义。 此外,我们提出了一种基于遗传算法的GAAP扰动算法,增强了采样输入的多样性。另外,我们还引入了一个新的评估指标,即温度敏感不确定性(TSU),该指标直接评估不确定性而不依赖于正确性作为代理变量。大量的实验表明,Inv-Entropy优于现有的语义UQ方法。用于重现结果的代码可在以下链接找到:[此URL]。 原文中的“this https URL”似乎是一个占位符,实际使用时应该替换为具体的研究论文或代码仓库的实际网址。
https://arxiv.org/abs/2506.09684
It is important for Large Language Models to be aware of the boundary of their knowledge, the mechanism of identifying known and unknown queries. This type of awareness can help models perform adaptive inference, such as invoking RAG, engaging in slow and deep thinking, or adopting the abstention mechanism, which is beneficial to the development of efficient and trustworthy AI. In this work, we propose a method to detect knowledge boundaries via Query-Level Uncertainty, which aims to determine if the model is able to address a given query without generating any tokens. To this end, we introduce a novel and training-free method called \emph{Internal Confidence}, which leverages self-evaluations across layers and tokens. Empirical results on both factual QA and mathematical reasoning tasks demonstrate that our internal confidence can outperform several baselines. Furthermore, we showcase that our proposed method can be used for efficient RAG and model cascading, which is able to reduce inference costs while maintaining performance.
大型语言模型了解其知识边界并能够识别已知和未知查询的机制是非常重要的。这种意识有助于模型进行自适应推理,比如调用检索增强生成(RAG)、开展慢而深入的思考或采用弃权机制,这有利于高效且可信的人工智能的发展。在这项工作中,我们提出了一种通过查询级别的不确定性来检测知识边界的的方法,旨在确定模型是否能够在不生成任何标记的情况下解决给定的查询。为此,我们引入了一个新颖且无需训练的方法——“内部置信度”,它利用了跨层和标记的自我评估。 在事实问答和数学推理任务上的实证结果表明,我们的内部置信度方法能够超越多个基准线。此外,我们展示了所提出的方法可以用于高效的RAG和模型级联,从而降低推断成本的同时保持性能水平。
https://arxiv.org/abs/2506.09669
This paper presents a system developed for SemEval 2025 Task 8: Question Answering (QA) over tabular data. Our approach integrates several key components: text-to-SQL and text-to-code generation modules, a self-correction mechanism, and a retrieval-augmented generation (RAG). Additionally, it includes an end-to-end (E2E) module, all orchestrated by a large language model (LLM). Through ablation studies, we analyzed the effects of different parts of our pipeline and identified the challenges that are still present in this field. During the evaluation phase of the competition, our solution achieved an accuracy of 80%, resulting in a top-13 ranking among the 38 participating teams. Our pipeline demonstrates a significant improvement in accuracy for open-source models and achieves a performance comparable to proprietary LLMs in QA tasks over tables. The code is available at GitHub repository.
本文介绍了一个为SemEval 2025任务8(表格数据上的问答系统)开发的系统。我们的方法集成了几个关键组件:文本转SQL和文本转代码生成模块、自我校正机制以及检索增强生成(RAG)。此外,该方法还包括一个端到端(E2E)模块,并由大型语言模型(LLM)进行协调。通过消融实验,我们分析了管道中不同部分的影响,并识别出了当前领域仍然存在的挑战。在比赛的评估阶段,我们的解决方案实现了80%的准确率,在38支参赛队伍中排名前13位。我们的管道对开源模型的准确性有了显著提升,并且在表格上的问答任务中的表现与专有LLM相当。代码可在GitHub仓库中获取。
https://arxiv.org/abs/2506.09657
Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains, but their reliability is hindered by the outdated knowledge and hallucinations. Retrieval-Augmented Generation mitigates these issues by grounding LLMs with external knowledge; however, most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning. Knowledge graphs, which represent facts as relational triples, offer a more structured and compact alternative. Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering (KGQA), with a significant proportion adopting the retrieve-then-reasoning paradigm. In this framework, graph-based retrievers have demonstrated strong empirical performance, yet they still face challenges in generalization ability. In this work, we propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA. RAPL addresses these limitations through three aspects: (1) a two-stage labeling strategy that combines heuristic signals with parametric models to provide causally grounded supervision; (2) a model-agnostic graph transformation approach to capture both intra- and inter-triple interactions, thereby enhancing representational capacity; and (3) a path-based reasoning strategy that facilitates learning from the injected rational knowledge, and supports downstream reasoner through structured inputs. Empirically, RAPL outperforms state-of-the-art methods by $2.66\%-20.34\%$, and significantly reduces the performance gap between smaller and more powerful LLM-based reasoners, as well as the gap under cross-dataset settings, highlighting its superior retrieval capability and generalizability. Codes are available at: this https URL.
大型语言模型(LLMs)在各个领域展示了强大的归纳推理能力,但其可靠性受到知识过时和幻觉问题的限制。检索增强生成通过利用外部知识解决了这些问题;然而,现有的大多数RAG管道依赖于非结构化文本,这限制了可解释性和结构化的推理过程。知识图谱(用关系三元组表示事实)提供了一种更结构化且紧凑的替代方案。最近的研究探讨了将知识图谱与LLMs结合以进行知识图查询回答(KGQA),其中许多研究采用“检索后推理”的方法。在此框架下,基于图的方法展现了强大的实证性能,但仍面临泛化能力方面的挑战。 本文提出了一种新的框架RAPL,旨在有效和高效地在KGQA中执行图谱检索。通过三个方面解决了这些限制:(1)一种两阶段标记策略,结合了启发式信号与参数模型以提供因果基础的监督;(2)一种模型无关的知识图变换方法,捕捉到三元组内部及之间的交互作用,从而提高表示能力;以及(3)基于路径推理策略,促进从注入的合理知识中学习,并通过结构化输入支持下游推理器。实证上,RAPL在性能方面超越了现有最佳方法2.66%至20.34%,显著缩小了较小和更强大的LLM推理者之间的性能差距以及跨数据集设置下的性能差距,突显其卓越的检索能力和泛化能力。 代码可在以下网址获取:[此链接](this https URL)。
https://arxiv.org/abs/2506.09645
Machine learning models fundamentally rely on large quantities of high-quality data. Collecting the necessary data for these models can be challenging due to cost, scarcity, and privacy restrictions. Signed languages are visual languages used by the deaf community and are considered low-resource languages. Sign language datasets are often orders of magnitude smaller than their spoken language counterparts. Sign Language Production is the task of generating sign language videos from spoken language sentences, while Sign Language Translation is the reverse translation task. Here, we propose leveraging recent advancements in Sign Language Production to augment existing sign language datasets and enhance the performance of Sign Language Translation models. For this, we utilize three techniques: a skeleton-based approach to production, sign stitching, and two photo-realistic generative models, SignGAN and SignSplat. We evaluate the effectiveness of these techniques in enhancing the performance of Sign Language Translation models by generating variation in the signer's appearance and the motion of the skeletal data. Our results demonstrate that the proposed methods can effectively augment existing datasets and enhance the performance of Sign Language Translation models by up to 19%, paving the way for more robust and accurate Sign Language Translation systems, even in resource-constrained environments.
机器学习模型从根本上依赖于大量的高质量数据。收集这些模型所需的数据可能由于成本、稀缺性和隐私限制而变得具有挑战性。手语是一种视觉语言,由聋人社区使用,并被视为资源匮乏的语言。与口语相比,手语数据集往往小得多。手语生成的任务是从口语句子中生成手语视频,而手语翻译则是反向的翻译任务。在这里,我们提出利用最近在手语生成方面的进展来增强现有的手语数据集并提高手语翻译模型的表现。为此,我们采用了三种技术:基于骨架的方法、手势缝合以及两种逼真的生成模型——SignGAN 和 SignSplat。通过生成表现者外观和骨骼动作的变化,我们评估了这些方法对手语翻译模型性能提升的有效性。我们的结果显示,所提出的方法可以有效地增强现有数据集,并将手语翻译模型的性能提高多达19%,为资源受限环境中的更稳健、准确的手语翻译系统铺平了道路。
https://arxiv.org/abs/2506.09643