Light curves serve as a valuable source of information on stellar formation and evolution. With the rapid advancement of machine learning techniques, it can be effectively processed to extract astronomical patterns and information. In this study, we present a comprehensive evaluation of deep-learning and large language model (LLM) based models for the automatic classification of variable star light curves, based on large datasets from the Kepler and K2 missions. Special emphasis is placed on Cepheids, RR Lyrae, and eclipsing binaries, examining the influence of observational cadence and phase distribution on classification precision. Employing AutoDL optimization, we achieve striking performance with the 1D-Convolution+BiLSTM architecture and the Swin Transformer, hitting accuracies of 94\% and 99\% correspondingly, with the latter demonstrating a notable 83\% accuracy in discerning the elusive Type II Cepheids-comprising merely 0.02\% of the total dataset.We unveil StarWhisper LightCurve (LC), an innovative Series comprising three LLM-based models: LLM, multimodal large language model (MLLM), and Large Audio Language Model (LALM). Each model is fine-tuned with strategic prompt engineering and customized training methods to explore the emergent abilities of these models for astronomical data. Remarkably, StarWhisper LC Series exhibit high accuracies around 90\%, significantly reducing the need for explicit feature engineering, thereby paving the way for streamlined parallel data processing and the progression of multifaceted multimodal models in astronomical applications. The study furnishes two detailed catalogs illustrating the impacts of phase and sampling intervals on deep learning classification accuracy, showing that a substantial decrease of up to 14\% in observation duration and 21\% in sampling points can be realized without compromising accuracy by more than 10\%.
光曲线作为一种关于恒星形成和演化的宝贵信息来源,随着机器学习技术的快速发展,可以有效地处理以提取天文模式和信息。在这项研究中,我们全面评估了基于深度学习和大型语言模型(LLM)的变星光曲线自动分类模型的性能,基于Kepler和K2任务的大数据集。特别关注Cepheids、RR Lyrae和食人鱼 binary,研究了观测序列和相位分布对分类精度的影响。采用AutoDL优化,我们通过1D-卷积加BiLSTM架构和Swin Transformer取得了显著的性能,前者的准确度为94%,后者则表现出对Type II Cepheids的显著83%的判断能力,前者的最高准确度达到99%,后者的准确性仅为0.02%的整个数据集中的样本总量。我们揭示了StarWhisper LightCurve(LC)系列,这是一种创新的三模型系列:LLM、多模态大型语言模型(MLLM)和大型音频语言模型(LALM)。每个模型都通过战略提示工程和定制化训练方法进行了微调,以探索这些模型在天文数据中产生的新兴能力。值得注意的是,StarWhisper LC系列在准确度方面表现出高达90%的准确度,从而显著减少了不需要的显式特征工程,为简化并行数据处理和多面体多模态模型在天文应用中的发展铺平了道路。这项研究提供了两个详细的目录,说明了相位和采样间隔对深度学习分类准确度的影响,表明在不过度妥协准确度的情况下,可以通过减小观测持续时间和采样点的数量来降低观察持续时间至14%,采样点至21%。
https://arxiv.org/abs/2404.10757
Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as Proximal Policy Optimization (PPO). However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO). Is DPO truly superior to PPO? Why does PPO perform poorly on these benchmarks? In this paper, we first conduct both theoretical and empirical studies on the algorithmic properties of DPO and show that DPO may have fundamental limitations. Moreover, we also comprehensively examine PPO and reveal the key factors for the best performances of PPO in fine-tuning LLMs. Finally, we benchmark DPO and PPO across various a collection of RLHF testbeds, ranging from dialogue to code generation. Experiment results demonstrate that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions.
强化学习从人类反馈(RLHF)是当前最广泛用于将大型语言模型(LLMs)与人类偏好对齐的方法。现有的RLHF方法可以大致分为基于奖励或无奖励两类。新颖的应用程序如ChatGPT和Claude利用了基于奖励的方法,首先学习了一个奖励模型并应用了actor-critic算法,如Proximal Policy Optimization (PPO)。然而,在学术基准测试中,最先进的结果通常是通过无奖励方法实现的,例如直接偏好优化(DPO)。DPO是否真的比PPO更优越?为什么PPO在这些基准测试中的表现不佳?在本文中,我们首先对DPO和RLHF的算法性质进行了理论和实证研究,并表明DPO可能具有根本限制。此外,我们还对PPO进行了全面评估,揭示了其在微调LLM时的最佳表现关键因素。最后,我们在各种RLHF测试床上对DPO和PPO进行了基准测试,从对话到代码生成。实验结果表明,在所有情况下,PPO都能够在所有情况下超越其他对齐方法,并在具有挑战性的代码竞赛中实现最先进的结果。
https://arxiv.org/abs/2404.10719
Harnessing visual texts represents a burgeoning frontier in the evolution of language modeling. In this paper, we introduce a novel pre-training framework for a suite of pixel-based autoregressive language models, pre-training on a corpus of over 400 million documents rendered as RGB images. Our approach is characterized by a dual-modality training regimen, engaging both visual data through next patch prediction with a regression head and textual data via next token prediction with a classification head. This study is particularly focused on investigating the synergistic interplay between visual and textual modalities of language. Our comprehensive evaluation across a diverse array of benchmarks reveals that the confluence of visual and textual data substantially augments the efficacy of pixel-based language models. Notably, our findings show that a unidirectional pixel-based model, devoid of textual data during training, can match the performance levels of advanced bidirectional pixel-based models on various language understanding benchmarks. This work highlights the considerable untapped potential of integrating visual and textual information for language modeling purposes. We will release our code, data, and checkpoints to inspire further research advancement.
利用视觉文本表示语言建模领域正在迅速发展的前沿。在本文中,我们引入了一种新的预训练框架,用于一系列基于像素的自回归语言模型,在超过40亿个文档的RGB图像数据集上进行预训练。我们的方法的特点是具有双模态训练计划,通过具有回归头的下一个补丁预测视觉数据,并通过具有分类头的下一个单词预测文本数据。本研究特别关注语言中视觉和文本模态之间的协同作用。我们在各种基准测试中都进行了全面的评估,结果表明,视觉和文本模态的汇聚极大地增强了基于像素的语言模型的效果。值得注意的是,我们的研究结果表明,在训练过程中缺乏文本数据的单向像素模型可以与高级双向像素模型在各种语言理解基准测试中的性能水平相匹敌。这项工作突出了将视觉和文本信息集成到语言建模过程中的巨大潜力。我们将发布我们的代码、数据和检查点,以激发进一步的研究进展。
https://arxiv.org/abs/2404.10710
The Arab Spring was a historic set of protests beginning in 2011 that toppled governments and led to major conflicts. Collective memories of events like these can vary significantly across social contexts in response to political, cultural, and linguistic factors. While Wikipedia plays an important role in documenting both historic and current events, little attention has been given to how Wikipedia articles, created in the aftermath of major events, continue to evolve over years or decades. Using the archived content of Arab Spring-related topics across the Arabic and English Wikipedias between 2011 and 2024, we define and evaluate multilingual measures of event salience, deliberation, contextualization, and consolidation of collective memory surrounding the Arab Spring. Our findings about the temporal evolution of the Wikipedia articles' content similarity across languages has implications for theorizing about online collective memory processes and evaluating linguistic models trained on these data.
阿拉伯春天是一个历史性的抗议行动,始于2011年,推翻了政府和导致了重大冲突。集体对这些事件的记忆因政治、文化和语言因素而有所不同。尽管维基百科在记录历史和当前事件中扮演着重要角色,但人们对如何随着重大事件的发生来创建的维基百科文章在多年或几十年后继续演变发展并没有给予足够的关注。通过分析阿拉伯春天相关主题在阿拉伯和英语维基百科的存档内容,我们定义和评估了围绕阿拉伯春天多语言事件显著性、辩论、情境化和集体记忆的巩固。关于维基百科文章在语言之间的内容相似性随时间演变的发现,对研究在线集体记忆过程和评估基于这些数据的 linguistic模型的影响具有影响。
https://arxiv.org/abs/2404.10706
Resolving coreference and bridging relations in chemical patents is important for better understanding the precise chemical process, where chemical domain knowledge is very critical. We proposed an approach incorporating external knowledge into a multi-task learning model for both coreference and bridging resolution in the chemical domain. The results show that integrating external knowledge can benefit both chemical coreference and bridging resolution.
解决指称关系和桥接关系在化学专利中的问题对于更好地理解精确的化学过程非常重要。我们提出了一个将外部知识集成到多任务学习模型中的方法,用于在化学领域中解决指称关系和桥接关系。结果显示,将外部知识集成到模型中可以有益于化学指称和桥接关系的解决。
https://arxiv.org/abs/2404.10696
Visual Question Answering (VQA) is a complicated task that requires the capability of simultaneously processing natural language and images. Initially, this task was researched, focusing on methods to help machines understand objects and scene contexts in images. However, some text appearing in the image that carries explicit information about the full content of the image is not mentioned. Along with the continuous development of the AI era, there have been many studies on the reading comprehension ability of VQA models in the world. As a developing country, conditions are still limited, and this task is still open in Vietnam. Therefore, we introduce the first large-scale dataset in Vietnamese specializing in the ability to understand text appearing in images, we call it ViTextVQA (\textbf{Vi}etnamese \textbf{Text}-based \textbf{V}isual \textbf{Q}uestion \textbf{A}nswering dataset) which contains \textbf{over 16,000} images and \textbf{over 50,000} questions with answers. Through meticulous experiments with various state-of-the-art models, we uncover the significance of the order in which tokens in OCR text are processed and selected to formulate answers. This finding helped us significantly improve the performance of the baseline models on the ViTextVQA dataset. Our dataset is available at this \href{this https URL}{link} for research purposes.
视觉问答(VQA)是一个复杂的任务,需要同时处理自然语言和图像。最初,这个任务是研究集中在帮助机器理解图像中对象的上下文的方法。然而,在图像中出现的含有明确图像内容的信息的文本没有提及。随着人工智能时代的持续发展,世界各地已经有很多研究关注VQA模型的阅读理解能力。作为一个发展中国家,条件仍然有限,因此在越南,这个问题仍然是一个开放的任务。因此,我们介绍了第一个专门针对图像中出现文本的越南语大型数据集,我们称之为ViTextVQA(越南文本-为基础的视觉问答数据集),它包含超过16,000张图片和超过50,000个问题与答案。通过仔细实验各种最先进的模型,我们揭示了处理OCR文本中标记符的顺序以及选择标记符来形成答案的重要性。这一发现极大地提高了ViTextVQA数据集 baseline模型的性能。我们的数据集可在此链接中获取研究用途:<https://this <https://this link>
https://arxiv.org/abs/2404.10652
We explore the self-play training procedure of large language models (LLMs) in a two-player adversarial language game called Adversarial Taboo. In this game, an attacker and a defender communicate with respect to a target word only visible to the attacker. The attacker aims to induce the defender to utter the target word unconsciously, while the defender tries to infer the target word from the attacker's utterances. To win the game, both players should have sufficient knowledge about the target word and high-level reasoning ability to infer and express in this information-reserved conversation. Hence, we are curious about whether LLMs' reasoning ability can be further enhanced by Self-Play in this Adversarial language Game (SPAG). With this goal, we let LLMs act as the attacker and play with a copy of itself as the defender on an extensive range of target words. Through reinforcement learning on the game outcomes, we observe that the LLMs' performance uniformly improves on a broad range of reasoning benchmarks. Furthermore, iteratively adopting this self-play process can continuously promote LLM's reasoning ability. The code is at this https URL.
我们在两款对抗性语言游戏《Adversarial Taboo》中探索大型语言模型(LLMs)的自玩训练过程。在这款游戏中,攻击者和防御者仅就一个可见目标词进行通信。攻击者的目标是诱导防御者在不自觉的情况下说出目标词,而防御者则试图从攻击者的陈述中推断出目标词。要获胜,双方都应该具备关于目标词的充分知识以及推断和表达高层次能力。因此,我们对LLMs自玩在对抗性语言游戏中(SPAG)是否可以进一步增强推理能力感到好奇。为实现这一目标,我们让LLMs充当攻击者,在广泛的标有目标词的范围内,使用其自身的副本作为防御者。通过在游戏结果上进行强化学习,我们观察到LLM的表现随着各种推理基准的提高而普遍改善。此外,逐步采用这种自玩过程可以持续地促进LLM的推理能力。代码位于此链接处:
https://arxiv.org/abs/2404.10642
Getting large language models (LLMs) to perform well on the downstream tasks requires pre-training over trillions of tokens. This typically demands a large number of powerful computational devices in addition to a stable distributed training framework to accelerate the training. The growing number of applications leveraging AI/ML had led to a scarcity of the expensive conventional accelerators (such as GPUs), which begs the need for the alternative specialized-accelerators that are scalable and cost-efficient. AWS Trainium is the second-generation machine learning accelerator that has been purposely built for training large deep learning models. Its corresponding instance, Amazon EC2 trn1, is an alternative to GPU instances for LLM training. However, training LLMs with billions of parameters on trn1 is challenging due to its relatively nascent software ecosystem. In this paper, we showcase HLAT: a 7 billion parameter decoder-only LLM pre-trained using trn1 instances over 1.8 trillion tokens. The performance of HLAT is benchmarked against popular open source baseline models including LLaMA and OpenLLaMA, which have been trained on NVIDIA GPUs and Google TPUs, respectively. On various evaluation tasks, we show that HLAT achieves model quality on par with the baselines. We also share the best practice of using the Neuron Distributed Training Library (NDTL), a customized distributed training library for AWS Trainium to achieve efficient training. Our work demonstrates that AWS Trainium powered by the NDTL is able to successfully pre-train state-of-the-art LLM models with high performance and cost-effectiveness.
将大型语言模型(LLMs)在下游任务上表现出色需要数万亿个标记的预训练。这通常需要大量的强大计算设备以及一个稳定的分布式训练框架来加速训练。越来越多应用利用AI/ML,导致昂贵的传统加速器(如GPUs)的数量有限,因此需要可扩展且高效的专用加速器。AWS Trainium是专门为训练大型深度学习模型而设计的第二代机器学习加速器。它的相应实例Amazon EC2 trn1是对GPU实例的一个替代,适用于LLM训练。然而,在trn1上训练数十亿参数的LLM具有挑战性,因为其软件生态系统相对较弱。在本文中,我们展示了HLAT:使用trn1实例对1.8万亿个标记的预训练LLM。HLAT的性能与 popular open source baseline models(包括 LLaMA 和 OpenLLaMA)进行了比较,这些模型分别使用NVIDIA GPUs和Google TPUs进行训练。在各种评估任务上,我们证明了HLAT与基线模型具有相同的质量。我们还分享了使用AWS Trainium的Neuron分布式训练库(NDTL)实现高效训练的最佳实践。我们的工作表明,AWS Trainium由NDTL驱动能够成功预训练具有高性能和成本效益的先进LLM模型。
https://arxiv.org/abs/2404.10630
Large language models (LLMs) are now widely used in various fields, including finance. However, Japanese financial-specific LLMs have not been proposed yet. Hence, this study aims to construct a Japanese financial-specific LLM through continual pre-training. Before tuning, we constructed Japanese financial-focused datasets for continual pre-training. As a base model, we employed a Japanese LLM that achieved state-of-the-art performance on Japanese financial benchmarks among the 10-billion-class parameter models. After continual pre-training using the datasets and the base model, the tuned model performed better than the original model on the Japanese financial benchmarks. Moreover, the outputs comparison results reveal that the tuned model's outputs tend to be better than the original model's outputs in terms of the quality and length of the answers. These findings indicate that domain-specific continual pre-training is also effective for LLMs. The tuned model is publicly available on Hugging Face.
大语言模型(LLMs)现在已被广泛应用于各个领域,包括金融领域。然而,尚未提出针对日本金融领域的LLM。因此,本研究旨在通过持续预训练来构建一个日本金融领域的LLM。在进行调整之前,我们为持续预训练构建了日本金融领域的数据集。作为基础模型,我们采用了在10亿级参数模型中实现最先进性能的日本LLM。通过使用数据集和基础模型进行持续预训练后,调整后的模型在 Japanese financial benchmarks 上的表现优于原始模型。此外,输出比较结果表明,调整模型的输出在答案的质量和长度方面优于原始模型。这些发现表明,对于LLMs,领域特定的持续预训练同样有效。已经调整好的模型在Hugging Face上公开发布。
https://arxiv.org/abs/2404.10555
Open conversations are one of the most engaging forms of teaching. However, creating those conversations in educational software is a complex endeavor, especially if we want to address the needs of different audiences. While language models hold great promise for educational applications, there are substantial challenges in training them to engage in meaningful and effective conversational teaching, especially when considering the diverse needs of various audiences. No official data sets exist for this task to facilitate the training of language models for conversational teaching, considering the diverse needs of various audiences. This paper presents a novel source for facilitating conversational teaching of scientific concepts at various difficulty levels (from preschooler to expert), namely dialogues taken from video transcripts. We analyse this data source in various ways to show that it offers a diverse array of examples that can be used to generate contextually appropriate and natural responses to scientific topics for specific target audiences. It is a freely available valuable resource for training and evaluating conversation models, encompassing organically occurring dialogues. While the raw data is available online, we provide additional metadata for conversational analysis of dialogues at each level in all available videos.
开放性对话是教学中最引人入胜的形式之一。然而,在教育软件中创建这些对话是一个复杂的任务,尤其是当我们想要满足不同受众的需求时。虽然自然语言处理模型在教育应用中具有巨大的潜力时,要训练它们参与有意义的有效对话教学确实存在巨大的挑战。尤其当考虑到各种受众的不同需求时。目前尚无用于此任务的可用于训练自然语言处理模型进行对话教学的官方数据集。本文介绍了一个新的来源,用于促进不同难度级别的科学概念的对话教学,这个来源是从视频剪辑中提取的对话。我们对这个数据源进行了多种方式的分析,以展示它提供了多样化的例子,可以用于为特定目标受众生成适当且自然地回答科学主题。这对于训练和评估对话模型具有价值,并涵盖了自然产生的对话。尽管原始数据可在线获取,但我们为所有可用视频的每个级别提供了对话分析的额外元数据。
https://arxiv.org/abs/2404.10475
This study explores F0 entrainment in second language (L2) English speech imitation during an Alternating Reading Task (ART). Participants with Italian, French, and Slovak native languages imitated English utterances, and their F0 entrainment was quantified using the Dynamic Time Warping (DTW) distance between the parameterized F0 contours of the imitated utterances and those of the model utterances. Results indicate a nuanced relationship between L2 English proficiency and entrainment: speakers with higher proficiency generally exhibit less entrainment in pitch variation and declination. However, within dyads, the more proficient speakers demonstrate a greater ability to mimic pitch range, leading to increased entrainment. This suggests that proficiency influences entrainment differently at individual and dyadic levels, highlighting the complex interplay between language skill and prosodic adaptation.
本研究探讨了在交替阅读任务(ART)中,第二语言(L2)英语语音模仿中的F0同步现象。参与者用意大利语、法语和斯洛伐克语进行了英语句子的模仿,并通过参数化F0轮廓的动态时间膨胀(DTW)距离来量化他们的F0同步。结果表明,L2英语能力与同步存在微妙的关联:英语能力较高的参与者通常在语调变化和降调方面的同步表现较少。然而,在双人对话中,英语能力较高的参与者表现出更强的模仿语调范围的能力,导致同步增加。这表明,在个体和双人层面上,能力对同步产生了不同的影响,突显了语言技能和 prosodic adaptation之间的复杂相互作用。
https://arxiv.org/abs/2404.10440
Generative spoken language models produce speech in a wide range of voices, prosody, and recording conditions, seemingly approaching the diversity of natural speech. However, the extent to which generated speech is acoustically diverse remains unclear due to a lack of appropriate metrics. We address this gap by developing lightweight metrics of acoustic diversity, which we collectively refer to as MAD Speech. We focus on measuring five facets of acoustic diversity: voice, gender, emotion, accent, and background noise. We construct the metrics as a composition of specialized, per-facet embedding models and an aggregation function that measures diversity within the embedding space. Next, we build a series of datasets with a priori known diversity preferences for each facet. Using these datasets, we demonstrate that our proposed metrics achieve a stronger agreement with the ground-truth diversity than baselines. Finally, we showcase the applicability of our proposed metrics across several real-life evaluation scenarios. MAD Speech will be made publicly accessible.
生成式会话模型产生各种声音、语调、和录音条件下的说话,似乎接近自然语言的多样性。然而,生成式说话的音频多样性程度的范围仍然不清楚,因为缺乏适当的指标。为了填补这一空白,我们开发了轻量级的音频多样性指标,我们称之为MAD(多声道音频多样性)说话。我们关注衡量五个音频多样性的方面:声音、性别、情感、口音和背景噪音。我们将指标组合为专门定制的每个方面的嵌入模型和衡量嵌入空间内多样性的聚合函数。接下来,我们构建了一系列具有已知多样性偏好的数据集。使用这些数据集,我们证明了与基线相比,我们提出的指标具有更强的一致性。最后,我们展示了所提出的指标在多个现实世界的评估场景中的应用。MAD说话将公开可访问。
https://arxiv.org/abs/2404.10419
Training on large amounts of rationales (i.e., CoT Fine-tuning) is effective at improving the reasoning capabilities of large language models (LLMs). However, acquiring human-authored rationales or augmenting rationales from proprietary models is costly and not scalable. In this paper, we study the problem of whether LLMs could self-improve their reasoning capabilities. To this end, we propose Self-Explore, where the LLM is tasked to explore the first wrong step (i.e., the first pit) within the rationale and use such signals as fine-grained rewards for further improvement. On the GSM8K and MATH test set, Self-Explore achieves 11.57% and 2.89% improvement on average across three LLMs compared to supervised fine-tuning (SFT). Our code is available at this https URL.
在本文中,我们研究了LLM是否能够自我提高其推理能力。为此,我们提出了Self-Explore方法,其中LLM的任务是在推理中探索第一步错误的步骤(即第一步的低谷),并使用细粒度的奖励来进一步改进。在GSM8K和MATH测试集中,Self-Explore在三个LLM上的平均改进率为11.57%和2.89%,比监督微调(SFT)要好。我们的代码可在此处访问:https://this-url.com/
https://arxiv.org/abs/2404.10346
Large Language Models (LLMs) have exhibited remarkable performance across various downstream tasks, but they may generate inaccurate or false information with a confident tone. One of the possible solutions is to empower the LLM confidence expression capability, in which the confidence expressed can be well-aligned with the true probability of the generated answer being correct. However, leveraging the intrinsic ability of LLMs or the signals from the output logits of answers proves challenging in accurately capturing the response uncertainty in LLMs. Therefore, drawing inspiration from cognitive diagnostics, we propose a method of Learning from Past experience (LePe) to enhance the capability for confidence expression. Specifically, we first identify three key problems: (1) How to capture the inherent confidence of the LLM? (2) How to teach the LLM to express confidence? (3) How to evaluate the confidence expression of the LLM? Then we devise three stages in LePe to deal with these problems. Besides, to accurately capture the confidence of an LLM when constructing the training data, we design a complete pipeline including question preparation and answer sampling. We also conduct experiments using the Llama family of LLMs to verify the effectiveness of our proposed method on four datasets.
大语言模型(LLMs)在各种下游任务上表现出色,但它们可能以自信的语气生成不准确或虚假信息。一个可能的解决方案是赋予LLM自信表达能力,其中表达的信心与生成的答案正确性的真实概率相良好对。然而,利用LLM固有的能力或答案输出的日志标点证明在准确捕捉LLM的响应不确定性方面具有挑战性。因此,从认知诊断的角度出发,我们提出了从过去经验中学习的(LePe)方法来增强LLM的信心表达能力。具体来说,我们首先识别了三个关键问题:(1)如何捕捉LLM固有的信心?(2)如何教LLM表达信心?(3)如何评估LLM的信心表达?然后我们为解决这些问题设计了三个LePe阶段。此外,为了准确地构建训练数据来捕捉LLM的信心,我们还设计了一个包括问题准备和答案抽样的完整流程。我们还使用Llama家族的LLM进行了实验,以验证我们提出方法在四个数据集上的有效性。
https://arxiv.org/abs/2404.10315
Aligned Large Language Models (LLMs) showcase remarkable versatility, capable of handling diverse real-world tasks. Meanwhile, aligned LLMs are also expected to exhibit speciality, excelling in specific applications. However, fine-tuning with extra data, a common practice to gain speciality, often leads to catastrophic forgetting (CF) of previously acquired versatility, hindering the model's performance across diverse tasks. In response to this challenge, we propose CoFiTune, a coarse to fine framework in an attempt to strike the balance between speciality and versatility. At the coarse-grained level, an empirical tree-search algorithm is utilized to pinpoint and update specific modules that are crucial for speciality, while keeping other parameters frozen; at the fine-grained level, a soft-masking mechanism regulates the update to the LLMs, mitigating the CF issue without harming speciality. In an overall evaluation of both speciality and versatility, CoFiTune consistently outperforms baseline methods across diverse tasks and model scales. Compared to the full-parameter SFT, CoFiTune leads to about 14% versatility improvement and marginal speciality loss on a 13B model. Lastly, based on further analysis, we provide a speculative insight into the information forwarding process in LLMs, which helps explain the effectiveness of the proposed method. The code is available at this https URL.
aligned large language models (LLMs) 展示了令人印象深刻的多才多艺,能够处理多样的人工现实任务。同时,与对齐的 LLMs 也预计将表现出专业性,在特定应用中表现出色。然而,通过额外的数据进行微调,这是一种常见的获得专业性的方法,往往会导致对先前获得的多样性的灾难性遗忘(CF),从而阻碍模型在各种任务上的表现。为了应对这个挑战,我们提出了 CoFiTune,一种在专业性和多样性之间取得平衡的尝试。在粗粒度级别,采用了一种经验性的树搜索算法来确定和更新对于专业性至关重要的一些模块,而其他参数则保持不变;在细粒度级别,采用了一种软掩码机制来调节对 LLMs 的更新,从而减轻 CF 问题,同时不损害专业性。在多样任务和模型规模的综合评估中,CoFiTune 始终在基线方法上表现优异。与完整参数 SFT 相比,CoFiTune导致约 14% 的多样性改进和约 13B 模型的边缘专业性损失。最后,根据进一步的分析,我们提供了一个关于 LLMs 中信息传递过程的启示性的见解,这有助于解释所提出方法的有效性。代码可在此链接下载:https://www.aclweb.org/anthology/N22-2161
https://arxiv.org/abs/2404.10306
Health coaching helps patients achieve personalized and lifestyle-related goals, effectively managing chronic conditions and alleviating mental health issues. It is particularly beneficial, however cost-prohibitive, for low-socioeconomic status populations due to its highly personalized and labor-intensive nature. In this paper, we propose a neuro-symbolic goal summarizer to support health coaches in keeping track of the goals and a text-units-text dialogue generation model that converses with patients and helps them create and accomplish specific goals for physical activities. Our models outperform previous state-of-the-art while eliminating the need for predefined schema and corresponding annotation. We also propose a new health coaching dataset extending previous work and a metric to measure the unconventionality of the patient's response based on data difficulty, facilitating potential coach alerts during deployment.
健康教练可以帮助患者实现个性化和生活方式相关目标,有效地管理慢性疾病和缓解心理健康问题。然而,由于其高度个性化和劳动密集的性质,对于低社会经济地位的人口来说,它可能过于昂贵。在本文中,我们提出了一个神经符号目标摘要器来支持健康教练跟踪目标,以及一个文本单元文本对话生成模型,与患者进行交互并帮助他们制定和实现特定的运动活动目标。我们的模型在保持当前最先进水平的同时,消除了需要预定义模式和相关注释的需求。我们还提出了一个新的健康教练数据集,扩展了以前的工作,并定义了一个指标来衡量患者响应的不寻常性,从而在部署过程中促进可能的教练警报。
https://arxiv.org/abs/2404.10268
Mixture of Expert Tuning (MoE-Tuning) has effectively enhanced the performance of general MLLMs with fewer parameters, yet its application in resource-limited medical settings has not been fully explored. To address this gap, we developed MoE-TinyMed, a model tailored for medical applications that significantly lowers parameter demands. In evaluations on the VQA-RAD, SLAKE, and Path-VQA datasets, MoE-TinyMed outperformed LLaVA-Med in all Med-VQA closed settings with just 3.6B parameters. Additionally, a streamlined version with 2B parameters surpassed LLaVA-Med's performance in PathVQA, showcasing its effectiveness in resource-limited healthcare settings.
混合专家调整(MoE-Tuning)有效地增强了一般MLMs的性能,同时参数更少。然而,在资源受限的医疗环境中,它的应用并没有完全被探索。为了填补这一空白,我们开发了MoE-TinyMed,一种专为医疗应用而设计的模型,显著降低了参数需求。在VQA-RAD、SLAKE和Path-VQA数据集上的评估显示,MoE-TinyMed在所有Med-VQA关闭设置中均超过了LLaVA-Med的性能,只需3.6B个参数。此外,一个优化版本,具有2B个参数,在PathVQA上超过了LLaVA-Med的性能,展示了其在资源受限的医疗环境中的有效性。
https://arxiv.org/abs/2404.10237
Recent advances in large language models (LLMs) have blurred the boundary of high-quality text generation between humans and machines, which is favorable for generative text steganography. While, current advanced steganographic mapping is not suitable for LLMs since most users are restricted to accessing only the black-box API or user interface of the LLMs, thereby lacking access to the training vocabulary and its sampling probabilities. In this paper, we explore a black-box generative text steganographic method based on the user interfaces of large language models, which is called LLM-Stega. The main goal of LLM-Stega is that the secure covert communication between Alice (sender) and Bob (receiver) is conducted by using the user interfaces of LLMs. Specifically, We first construct a keyword set and design a new encrypted steganographic mapping to embed secret messages. Furthermore, to guarantee accurate extraction of secret messages and rich semantics of generated stego texts, an optimization mechanism based on reject sampling is proposed. Comprehensive experiments demonstrate that the proposed LLM-Stega outperforms current state-of-the-art methods.
近年来,大型语言模型(LLMs)的进步模糊了高质量文本生成领域人类和机器之间的边界,这对隐式文本加密非常有利。然而,目前的先进隐式映射方法并不适用于LLMs,因为大多数用户只能访问LLMs的黑色盒API或用户界面,而无法访问训练词汇及其抽样概率。在本文中,我们探讨了一种基于LLM用户界面的黑盒生成文本隐式映射方法,称为LLM-Stega。LLM-Stega的主要目标是通过使用LLM的用户界面进行安全隐式通信。具体来说,我们首先构建了一个关键词集,并设计了一个新的隐式文本映射来嵌入秘密消息。此外,为了确保准确提取秘密消息和生成的隐式文本的丰富语义,提出了一个基于拒绝抽样的优化机制。全面的实验证明,与现有技术水平相比,LLM-Stega取得了优异的性能。
https://arxiv.org/abs/2404.10229
The high volume and rapid evolution of content on social media present major challenges for studying the stance of social media users. In this work, we develop a two stage stance labeling method that utilizes the user-hashtag bipartite graph and the user-user interaction graph. In the first stage, a simple and efficient heuristic for stance labeling uses the user-hashtag bipartite graph to iteratively update the stance association of user and hashtag nodes via a label propagation mechanism. This set of soft labels is then integrated with the user-user interaction graph to train a graph neural network (GNN) model using semi-supervised learning. We evaluate this method on two large-scale datasets containing tweets related to climate change from June 2021 to June 2022 and gun control from January 2022 to January 2023. Experiments demonstrate that our user-hashtag heuristic and the semi-supervised GNN method outperform zero-shot stance labeling using LLMs such as GPT4. Further analysis illustrates how the stance labeling information and interaction graph can be used for evaluating the polarization of social media interactions on divisive issues such as climate change and gun control.
社交媒体上内容的数量和快速演变给研究社交媒体用户立场的分析带来了重大挑战。在这项工作中,我们开发了一种两级立标签方法,该方法利用了用户- hashtag 二分图和用户-用户互动图。在第一阶段,一个简单而有效的立标签策略使用用户- hashtag 二分图通过标签传播机制迭代更新用户和 hashtag 节点的立标签关联。这组软标签随后与用户-用户互动图整合,使用半监督学习训练图神经网络(GNN)模型。我们在2021年6月至2022年6月期间气候变化和2022年1月至2023年1月枪支控制的大型数据集上评估了这种方法。实验结果表明,我们的用户-hashtag策略和半监督的 GNN 方法优于使用LLM(如 GPT4)进行零散立标签。进一步的分析表明,立标签信息和互动图可用于评估社交媒体在具有分歧的问题(如气候变化和枪支控制)上的极化。
https://arxiv.org/abs/2404.10228
Text-based reinforcement learning involves an agent interacting with a fictional environment using observed text and admissible actions in natural language to complete a task. Previous works have shown that agents can succeed in text-based interactive environments even in the complete absence of semantic understanding or other linguistic capabilities. The success of these agents in playing such games suggests that semantic understanding may not be important for the task. This raises an important question about the benefits of LMs in guiding the agents through the game states. In this work, we show that rich semantic understanding leads to efficient training of text-based RL agents. Moreover, we describe the occurrence of semantic degeneration as a consequence of inappropriate fine-tuning of language models in text-based reinforcement learning (TBRL). Specifically, we describe the shift in the semantic representation of words in the LM, as well as how it affects the performance of the agent in tasks that are semantically similar to the training games. We believe these results may help develop better strategies to fine-tune agents in text-based RL scenarios.
基于文本的强化学习涉及一个智能体使用观察到的文本和可允许的动作与虚构环境交互以完成任务。以前的工作表明,即使缺乏语义理解或其他语言能力,基于文本的交互环境中的智能体也可以成功。这些智能体在玩这类游戏中的成功表明,语义理解可能不是任务完成所必需的。这引发了一个重要的问题,即自然语言处理(NLP)模型在引导智能体通过游戏状态方面的优势。在这项工作中,我们证明了丰富的语义理解会导致基于文本的强化学习(TBRL)智能体的有效训练。此外,我们描述了语义退化作为文本基于强化学习(TBRL)中不合适的语言模型微调的结果。具体来说,我们描述了LM中单词语义表示的转移,以及它如何影响与训练游戏相似任务的智能体表现。我们相信,这些结果有助于在文本基于强化学习场景中开发更好的微调策略。
https://arxiv.org/abs/2404.10174