The SemEval task on Argument Reasoning in Civil Procedure is challenging in that it requires understanding legal concepts and inferring complex arguments. Currently, most Large Language Models (LLM) excelling in the legal realm are principally purposed for classification tasks, hence their reasoning rationale is subject to contention. The approach we advocate involves using a powerful teacher-LLM (ChatGPT) to extend the training dataset with explanations and generate synthetic data. The resulting data are then leveraged to fine-tune a small student-LLM. Contrary to previous work, our explanations are not directly derived from the teacher's internal knowledge. Instead they are grounded in authentic human analyses, therefore delivering a superior reasoning signal. Additionally, a new `mutation' method generates artificial data instances inspired from existing ones. We are publicly releasing the explanations as an extension to the original dataset, along with the synthetic dataset and the prompts that were used to generate both. Our system ranked 15th in the SemEval competition. It outperforms its own teacher and can produce explanations aligned with the original human analyses, as verified by legal experts.
民事诉讼中的推理挑战性的任务在于它需要理解法律概念并推断复杂论点。目前,在法律领域表现卓越的大规模语言模型(LLM)主要目的是分类任务,因此其推理是具有争议的。我们倡导的方法包括使用强大的教师-LLM(ChatGPT)扩展训练数据,并生成合成数据。然后将这些数据用于微调小学生-LLM。与之前的工作不同,我们的解释不是直接从教师内部知识中得出的。相反,它们是基于真实的人类分析得出的,因此具有更好的推理信号。此外,一种新方法`突变`生成源自现有方法的 artificial 数据实例。我们公开发布这些解释作为对原始数据集的扩展,以及为生成 both 原始人类分析和合成数据而使用的提示。我们的系统在 SemEval 竞赛中排名第15。它超越了自己的教师,并且可以通过法律专家的验证,产生与原始人类分析一致的解释。
https://arxiv.org/abs/2405.08502
In recent years, deep learning has greatly streamlined the process of generating realistic fake face images. Aware of the dangers, researchers have developed various tools to spot these counterfeits. Yet none asked the fundamental question: What digital manipulations make a real photographic face image fake, while others do not? In this paper, we put face forgery in a semantic context and define that computational methods that alter semantic face attributes to exceed human discrimination thresholds are sources of face forgery. Guided by our new definition, we construct a large face forgery image dataset, where each image is associated with a set of labels organized in a hierarchical graph. Our dataset enables two new testing protocols to probe the generalization of face forgery detectors. Moreover, we propose a semantics-oriented face forgery detection method that captures label relations and prioritizes the primary task (\ie, real or fake face detection). We show that the proposed dataset successfully exposes the weaknesses of current detectors as the test set and consistently improves their generalizability as the training set. Additionally, we demonstrate the superiority of our semantics-oriented method over traditional binary and multi-class classification-based detectors.
近年来,深度学习极大地简化了生成逼真假脸图像的过程。为了意识到这种伪造技术的危险性,研究人员开发了各种工具来检测这些伪造技术。然而,没有一个工具问到这个问题:什么数字编辑会使得真实照片脸孔图像伪造,而其他工具不会?在本文中,我们将面部伪造置于语义背景下,并定义了能够超过人类判断阈值的语义人脸属性编辑的方法是面部伪造的源头。在受到新定义的指导下,我们构建了一个大规模的人脸伪造图像数据集,其中每个图像都与一个由层次图组织起来的标签集相关联。我们的数据集使得有两个新的测试协议可以探究面部伪造检测器的泛化能力。此外,我们提出了一个语义导向的面部伪造检测方法,它抓住了标签关系并优先考虑了主要任务(即真实或伪造脸孔检测)。我们证明了所提出的数据集成功地揭示了当前检测器的缺陷作为测试集,并且在训练集上持续改进了它们的泛化能力。此外,我们还证明了我们的语义导向方法比传统的二分类和多分类分类基础检测器具有优越性。
https://arxiv.org/abs/2405.08487
Reinforcement learning from human feedback (RLHF) is the canonical framework for large language model alignment. However, rising popularity in offline alignment algorithms challenge the need for on-policy sampling in RLHF. Within the context of reward over-optimization, we start with an opening set of experiments that demonstrate the clear advantage of online methods over offline methods. This prompts us to investigate the causes to the performance discrepancy through a series of carefully designed experimental ablations. We show empirically that hypotheses such as offline data coverage and data quality by itself cannot convincingly explain the performance difference. We also find that while offline algorithms train policy to become good at pairwise classification, it is worse at generations; in the meantime the policies trained by online algorithms are good at generations while worse at pairwise classification. This hints at a unique interplay between discriminative and generative capabilities, which is greatly impacted by the sampling process. Lastly, we observe that the performance discrepancy persists for both contrastive and non-contrastive loss functions, and appears not to be addressed by simply scaling up policy networks. Taken together, our study sheds light on the pivotal role of on-policy sampling in AI alignment, and hints at certain fundamental challenges of offline alignment algorithms.
强化学习从人类反馈(RLHF)是大型语言模型对齐的规范框架。然而,离线对齐算法的兴起使得RLHF中需要在线策略抽样变得具有挑战性。在奖励过度优化背景下,我们从一个实验开端的实验集开始,这些实验展示了在线方法相对于离线方法的优势。这促使我们通过一系列精心设计的实验 ablations 调查性能差异的原因。我们通过经验证明,类似于离线数据覆盖和数据质量本身无法说服地解释性能差异。我们发现,尽管离线算法通过成对分类来训练策略变得擅长,但它在大规模生成任务上的表现却更差;与此同时,在线算法在大规模生成任务上表现更好,但在成对分类上表现得更差。这揭示了在抽样过程中存在某种独特的交互作用,这种作用在很大程度上受到对样本的影响。最后,我们观察到,对于对比度和非对比度损失函数,性能差异仍然存在,并且似乎不能通过简单地增加策略网络规模来解决。结合研究,我们的研究阐明了在人工智能对齐中on-policy抽样的关键作用,以及离线对齐算法的某些基本挑战。
https://arxiv.org/abs/2405.08448
The rapid advancement of large language models (LLMs) has made it increasingly difficult to distinguish between text written by humans and machines. Addressing this, we propose a novel method for generating watermarks that strategically alters token probabilities during generation. Unlike previous works, this method uniquely employs linguistic features such as stylometry. Concretely, we introduce acrostica and sensorimotor norms to LLMs. Further, these features are parameterized by a key, which is updated every sentence. To compute this key, we use semantic zero shot classification, which enhances resilience. In our evaluation, we find that for three or more sentences, our method achieves a false positive and false negative rate of 0.02. For the case of a cyclic translation attack, we observe similar results for seven or more sentences. This research is of particular of interest for proprietary LLMs to facilitate accountability and prevent societal harm.
大型语言模型(LLMs)的快速发展使得区分由人类和机器撰写的文本变得越来越困难。为了解决这个问题,我们提出了一种新颖的生成水印的方法,在生成过程中策略性地改变词的置信度概率。与之前的工作不同,这种方法独特地使用了语义零样本分类等语言特征。具体来说,我们引入了词表和运动传感器范式给LLM。此外,这些特征由一个键控制,该键每句话都会更新。为了计算这个键,我们使用语义零样本分类,该技术可以增强鲁棒性。在我们的评估中,我们发现对于三个或更多句子,我们的方法实现了一个 false positive 和 false negative 率为 0.02 的准确率。对于循环翻译攻击的情况,我们观察到类似的结果,对于七或更多的句子。这项研究对于推动自监督LLM的负责任性和防止社会伤害具有特别的兴趣。
https://arxiv.org/abs/2405.08400
Respiratory disease, the third leading cause of deaths globally, is considered a high-priority ailment requiring significant research on identification and treatment. Stethoscope-recorded lung sounds and artificial intelligence-powered devices have been used to identify lung disorders and aid specialists in making accurate diagnoses. In this study, audio-spectrogram vision transformer (AS-ViT), a new approach for identifying abnormal respiration sounds, was developed. The sounds of the lungs are converted into visual representations called spectrograms using a technique called short-time Fourier transform (STFT). These images are then analyzed using a model called vision transformer to identify different types of respiratory sounds. The classification was carried out using the ICBHI 2017 database, which includes various types of lung sounds with different frequencies, noise levels, and backgrounds. The proposed AS-ViT method was evaluated using three metrics and achieved 79.1% and 59.8% for 60:40 split ratio and 86.4% and 69.3% for 80:20 split ratio in terms of unweighted average recall and overall scores respectively for respiratory sound detection, surpassing previous state-of-the-art results.
呼吸疾病,全球死亡人数排名第三,被认为是需要进行大量研究的高优先疾病。通过听诊器记录的肺声音和由人工智能驱动的设备已被用于鉴定肺病并帮助专家进行准确的诊断。在这项研究中,我们开发了一种新的方法:音频光谱图视觉转换器(AS-ViT),用于识别异常呼吸声音。使用一种称为短时傅里叶变换(STFT)的技术将肺部的声音转换成视觉表示,这些图像然后用一种称为视觉转换器的模型进行分析,以识别不同类型的呼吸声音。分类是根据ICBHI 2017数据库进行的,该数据库包括不同频率、噪声水平和背景的各种类型的肺声音。所提出的AS-ViT方法通过三个指标进行评估,在60:40分比和80:20分比下的平均召回率和总得分均超过了前人最佳水平,分别实现了79.1%和59.8%。
https://arxiv.org/abs/2405.08342
Multi-parametric MRI (mpMRI) studies are widely available in clinical practice for the diagnosis of various diseases. As the volume of mpMRI exams increases yearly, there are concomitant inaccuracies that exist within the DICOM header fields of these exams. This precludes the use of the header information for the arrangement of the different series as part of the radiologist's hanging protocol, and clinician oversight is needed for correction. In this pilot work, we propose an automated framework to classify the type of 8 different series in mpMRI studies. We used 1,363 studies acquired by three Siemens scanners to train a DenseNet-121 model with 5-fold cross-validation. Then, we evaluated the performance of the DenseNet-121 ensemble on a held-out test set of 313 mpMRI studies. Our method achieved an average precision of 96.6%, sensitivity of 96.6%, specificity of 99.6%, and F1 score of 96.6% for the MRI series classification task. To the best of our knowledge, we are the first to develop a method to classify the series type in mpMRI studies acquired at the level of the chest, abdomen, and pelvis. Our method has the capability for robust automation of hanging protocols in modern radiology practice.
多参数磁共振(mpMRI)在临床实践中广泛用于各种疾病的诊断。随着每年mpMRI考试数量的增加,这些考试的DICOM头文件中存在相互不准确的情况。这使得使用头信息对不同系列进行放射科医生的挂载协议的部分是行不通的,并且需要临床医生的校正。在本次试点工作中,我们提出了一个自动框架来对mpMRI studies中的8种不同系列进行分类。我们用由三台西门子扫描仪获得的1363个研究来训练了一个DenseNet-121模型,并进行了5倍交叉验证。然后,我们对DenseNet-121模型的性能在由313个mpMRI研究组成的保留测试集中进行了评估。我们的方法在MRI系列分类任务上的平均精度为96.6%,敏感性为96.6%,特异性为99.6%,F1分数为96.6%。据我们所知,我们 是第一个开发出在胸部、腹部和盆腔水平对mpMRI studies进行分类的方法的人。我们的方法具有在现代放射科实践中实现挂载协议的稳健自动化能力。
https://arxiv.org/abs/2405.08247
As training datasets become increasingly drawn from unstructured, uncontrolled environments such as the web, researchers and industry practitioners have increasingly relied upon data filtering techniques to "filter out the noise" of web-scraped data. While datasets have been widely shown to reflect the biases and values of their creators, in this paper we contribute to an emerging body of research that assesses the filters used to create these datasets. We show that image-text data filtering also has biases and is value-laden, encoding specific notions of what is counted as "high-quality" data. In our work, we audit a standard approach of image-text CLIP-filtering on the academic benchmark DataComp's CommonPool by analyzing discrepancies of filtering through various annotation techniques across multiple modalities of image, text, and website source. We find that data relating to several imputed demographic groups -- such as LGBTQ+ people, older women, and younger men -- are associated with higher rates of exclusion. Moreover, we demonstrate cases of exclusion amplification: not only are certain marginalized groups already underrepresented in the unfiltered data, but CLIP-filtering excludes data from these groups at higher rates. The data-filtering step in the machine learning pipeline can therefore exacerbate representation disparities already present in the data-gathering step, especially when existing filters are designed to optimize a specifically-chosen downstream performance metric like zero-shot image classification accuracy. Finally, we show that the NSFW filter fails to remove sexually-explicit content from CommonPool, and that CLIP-filtering includes several categories of copyrighted content at high rates. Our conclusions point to a need for fundamental changes in dataset creation and filtering practices.
随着训练数据越来越来自于无结构、无控制的环境(如Web),研究人员和实践者越来越依赖数据筛选技术来“滤除网络爬取数据的噪音”。尽管数据集已被广泛证明反映了其创建者的偏见和价值观,但在这篇论文中,我们为评估创建这些数据集所使用的过滤器的研究新进展做出了贡献。我们证明了图像文本数据过滤也存在偏见和价值,并编码了“高质量”数据的特定概念。在我们的工作中,我们通过对学术基准DataComp的CommonPool进行图像文本CLIP过滤的标准方法进行分析,研究了通过各种标注技术在多个图像、文本和网站来源之间过滤的差异。我们发现,与几个假设的人口群体相关的数据——例如LGBTQ+人员、老年女性和年轻男性——被排除的比例较高。此外,我们证明了排除放大案例:不仅是某些边缘群体在未过滤的数据中已经代表性不足,而且CLIP过滤器在这些群体的数据上排除的比例更高。因此,机器学习管道中的数据过滤步骤可能加剧数据收集阶段已经存在的代表性差异,特别是在现有的过滤器被设计为优化特定的下游性能指标(如零散图像分类准确性)时。最后,我们发现NSFW过滤器无法从CommonPool中移除性暗示内容,而CLIP过滤器包括多个类别的受版权保护的内容。我们的结论指出,数据创建和过滤实践需要进行根本性的改变。
https://arxiv.org/abs/2405.08209
Large language models (LLM) have demonstrated remarkable capabilities in various biomedical natural language processing (NLP) tasks, leveraging the demonstration within the input context to adapt to new tasks. However, LLM is sensitive to the selection of demonstrations. To address the hallucination issue inherent in LLM, retrieval-augmented LLM (RAL) offers a solution by retrieving pertinent information from an established database. Nonetheless, existing research work lacks rigorous evaluation of the impact of retrieval-augmented large language models on different biomedical NLP tasks. This deficiency makes it challenging to ascertain the capabilities of RAL within the biomedical domain. Moreover, the outputs from RAL are affected by retrieving the unlabeled, counterfactual, or diverse knowledge that is not well studied in the biomedical domain. However, such knowledge is common in the real world. Finally, exploring the self-awareness ability is also crucial for the RAL system. So, in this paper, we systematically investigate the impact of RALs on 5 different biomedical tasks (triple extraction, link prediction, classification, question answering, and natural language inference). We analyze the performance of RALs in four fundamental abilities, including unlabeled robustness, counterfactual robustness, diverse robustness, and negative awareness. To this end, we proposed an evaluation framework to assess the RALs' performance on different biomedical NLP tasks and establish four different testbeds based on the aforementioned fundamental abilities. Then, we evaluate 3 representative LLMs with 3 different retrievers on 5 tasks over 9 datasets.
大语言模型(LLM)在各种生物医学自然语言处理(NLP)任务中表现出非凡的能力,通过在输入上下文演示以适应新任务。然而,LLM 对演示的选择非常敏感。为解决LLM固有的虚构问题,检索增强的大语言模型(RAL)通过从已建立的数据库中检索相关信息提供解决方案。然而,现有的研究作品在生物医学领域对检索增强的大语言模型的影响缺乏严谨的评估。这一缺陷使得确定RAL在生物医学领域的能力具有挑战性。此外,RAL的输出受到从生物医学领域检索未标记、反事实或多样知识的影響,而这些知识在生物医学领域中并没有得到充分研究。然而,在现实世界中,这些知识是很常见的。最后,探索自意识能力对RAL系统也是至关重要的。因此,在本文中,我们系统地研究了RAL对5种生物医学任务(三重提取、链接预测、分类、问答和自然语言推理)的影响。我们分析RAL在四种基本能力(未标记稳健性、反事实稳健性、多样性稳健性和负面意识)上的性能。为此,我们提出了一个评估框架来评估RAL在不同生物医学NLP任务上的性能,并基于上述基本能力建立四个测试台。然后,我们在9个数据集上评估了3个具有不同检索器的LLM的5个任务的表现。
https://arxiv.org/abs/2405.08151
Most Americans agree that misinformation, hate speech and harassment are harmful and inadequately curbed on social media through current moderation practices. In this paper, we aim to understand the discursive strategies employed by people in response to harmful speech in news comments. We conducted a content analysis of more than 6500 comment replies to trending news videos on YouTube and Twitter and identified seven distinct discursive objection strategies (Study 1). We examined the frequency of each strategy's occurrence from the 6500 comment replies, as well as from a second sample of 2004 replies (Study 2). Together, these studies show that people deploy a diversity of discursive strategies when objecting to speech, and reputational attacks are the most common. The resulting classification scheme accounts for different theoretical approaches for expressing objections and offers a comprehensive perspective on grassroots efforts aimed at stopping offensive or problematic speech on campus.
大多数美国人认为,在社交媒体上传播错误信息、仇恨言论和骚扰是有害的,并且目前的 moderation 做法不足以遏制这种不良行为。(本文旨在了解人们对有害言论的反应策略。我们在 YouTube 和 Twitter 上对热门新闻视频的超过 6500 条评论进行了内容分析,并识别出了七种不同的 discourse 反对策略(研究 1)。我们研究了每种策略在 6500 条评论中的出现频率,以及从第二个样本 2004 条评论中收集的数据。这些研究共同表明,人们在反对言论时采取了多种 discursive strategies,而声誉攻击是最常见的。(研究 2) 该分类方案考虑了表达异议的不同理论方法,并提供了关于校园内停止冒犯或问题言论的基层努力的全局视角。
https://arxiv.org/abs/2405.08142
Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism and subsequently applied to vision tasks. Nevertheless, the performance of Mamba for vision is often underwhelming when compared with convolutional and attention-based models. In this paper, we delve into the essence of Mamba, and conceptually conclude that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. For vision tasks, as image classification does not align with either characteristic, we hypothesize that Mamba is not necessary for this task; Detection and segmentation tasks are also not autoregressive, yet they adhere to the long-sequence characteristic, so we believe it is still worthwhile to explore Mamba's potential for these tasks. To empirically verify our hypotheses, we construct a series of models named \emph{MambaOut} through stacking Mamba blocks while removing their core token mixer, SSM. Experimental results strongly support our hypotheses. Specifically, our MambaOut model surpasses all visual Mamba models on ImageNet image classification, indicating that Mamba is indeed unnecessary for this task. As for detection and segmentation, MambaOut cannot match the performance of state-of-the-art visual Mamba models, demonstrating the potential of Mamba for long-sequence visual tasks. The code is available at this https URL
Mamba 是一种具有类似于 RNN 式的状态空间模型(SSM)的架构,最近为解决注意力机制的二次复杂度而引入。然而,与卷积和基于注意的模型相比,Mamba 在视觉任务上的表现往往令人失望。在本文中,我们深入研究了 Mamba 的本质,并从理论上得出了结论,即 Mamba 非常适合具有长序列和自回归特性的任务。对于视觉任务,由于图像分类不涉及任何特性,我们假设 Mamba 对这项任务是不必要的;检测和分割任务也不具有自回归特性,但它们仍然遵循长序列特性,因此我们认为值得探索 Mamba 在这些任务上的潜力。为了通过实验验证我们的假设,我们通过堆叠 Mamba 模块并删除其核心词混合器构建了一系列模型,命名为 \emph{MambaOut}。实验结果强烈支持我们的假设。具体来说,我们的 MambaOut 模型在 ImageNet 图像分类中超过了所有视觉 Mamba 模型,表明 Mamba 确实是不必要的。对于检测和分割任务,MambaOut 的性能无法与最先进的视觉 Mamba 模型相媲美,这表明了 Mamba 在长序列视觉任务上的潜力。代码可在此处获得:https:// this URL
https://arxiv.org/abs/2405.07992
Large language models (LLMs) have shown success in many natural language processing tasks. Despite rigorous safety alignment processes, supposedly safety-aligned LLMs like Llama 2 and Claude 2 are still susceptible to jailbreaks, leading to security risks and abuse of the models. One option to mitigate such risks is to augment the LLM with a dedicated "safeguard", which checks the LLM's inputs or outputs for undesired behaviour. A promising approach is to use the LLM itself as the safeguard. Nonetheless, baseline methods, such as prompting the LLM to self-classify toxic content, demonstrate limited efficacy. We hypothesise that this is due to domain shift: the alignment training imparts a self-censoring behaviour to the model ("Sorry I can't do that"), while the self-classify approach shifts it to a classification format ("Is this prompt malicious"). In this work, we propose PARDEN, which avoids this domain shift by simply asking the model to repeat its own outputs. PARDEN neither requires finetuning nor white box access to the model. We empirically verify the effectiveness of our method and show that PARDEN significantly outperforms existing jailbreak detection baselines for Llama-2 and Claude-2. Code and data are available at this https URL. We find that PARDEN is particularly powerful in the relevant regime of high True Positive Rate (TPR) and low False Positive Rate (FPR). For instance, for Llama2-7B, at TPR equal to 90%, PARDEN accomplishes a roughly 11x reduction in the FPR from 24.8% to 2.0% on the harmful behaviours dataset.
大语言模型(LLMs)在许多自然语言处理任务中表现出成功。然而,尽管进行了严谨的安全性协调过程,看似安全性协调的LLM如Llama 2和Claude 2仍然容易受到破解,导致安全风险和模型滥用。减少这些风险的一个方法是增加LLM的“保护”,该保护检查LLM的输入或输出是否具有不当行为。一种有前景的方法是使用LLM本身作为保护。然而,基线方法,如提示LLM自我分类有害内容,表明效果有限。我们假设这是因为领域转移:协调训练赋予了模型自我审查的行为("对不起,我不能那么做"),而自分类方法将其转移到分类格式("这个提示恶意吗?")。在本文中,我们提出了PARDEN,它通过简单地要求模型重复其自己的输出来避免领域转移。PARDEN不需要对模型进行微调,也不需要对模型进行白盒访问。我们通过实验验证了我们的方法的有效性,并表明PARDEN显著优于Llama-2和Claude-2上的现有破解检测基线。代码和数据可在此链接处获取:https://this URL。我们发现,PARDEN在相关高真阳性率(TPR)和低假阳性率(FPR)的领域表现特别强大。例如,对于Llama2-7B,在TPR等于90%时,PARDEN将从有害行为数据集中的24.8%降低FPR至2.0%。
https://arxiv.org/abs/2405.07932
Pathology is the study of microscopic inspection of tissue, and a pathology diagnosis is often the medical gold standard to diagnose disease. Pathology images provide a unique challenge for computer-vision-based analysis: a single pathology Whole Slide Image (WSI) is gigapixel-sized and often contains hundreds of thousands to millions of objects of interest across multiple resolutions. In this work, we propose PathoLogy Universal TransfOrmer (PLUTO): a light-weight pathology FM that is pre-trained on a diverse dataset of 195 million image tiles collected from multiple sites and extracts meaningful representations across multiple WSI scales that enable a large variety of downstream pathology tasks. In particular, we design task-specific adaptation heads that utilize PLUTO's output embeddings for tasks which span pathology scales ranging from subcellular to slide-scale, including instance segmentation, tile classification, and slide-level prediction. We compare PLUTO's performance to other state-of-the-art methods on a diverse set of external and internal benchmarks covering multiple biologically relevant tasks, tissue types, resolutions, stains, and scanners. We find that PLUTO matches or outperforms existing task-specific baselines and pathology-specific foundation models, some of which use orders-of-magnitude larger datasets and model sizes when compared to PLUTO. Our findings present a path towards a universal embedding to power pathology image analysis, and motivate further exploration around pathology foundation models in terms of data diversity, architectural improvements, sample efficiency, and practical deployability in real-world applications.
病理学是对组织进行显微镜检查的研究,而病理学诊断通常是诊断疾病的医学金标准。病理学图像为基于计算机视觉的分析提供了独特的挑战:单个病理学全息切片图(WSI)可能是兆像素大小,并且通常包含跨多个分辨率的成千上万个感兴趣对象。在这项工作中,我们提出了PathoLogy Universal Transformer (PLUTO):一种轻量级的病理学FM,在多个站点收集的1950万张图像片段的多样数据集上进行预训练,并跨多个WSI尺度提取有意义的表示,从而实现多种下游病理学任务的多样化。 特别地,我们设计了一些任务特异性适应头,利用PLUTO的输出嵌入进行任务,这些任务跨越病理学尺度,包括亚细胞到切片水平的实例分割、贴片分类和切片级别预测。我们将PLUTO的性能与来自多个生物学相关任务、组织类型的外部和内部基准的现有方法进行比较。我们发现,PLUTO与现有任务特异性基线相比,或与病理学特定基础模型相比具有优势,有些模型在PLUTO面前使用的是规模更大、模型尺寸更大的数据集。 我们的研究结果表明,PLUTO朝着实现病理学图像分析的通用嵌入迈出了重要的一步,并激励在病理学基础模型方面进行进一步的探索,以提高数据多样性、架构改进、样本效率和实际应用的可行性。
https://arxiv.org/abs/2405.07905
In driving scenarios, automobile active safety systems are increasingly incorporating deep learning technology. These systems typically need to handle multiple tasks simultaneously, such as detecting fatigue driving and recognizing the driver's identity. However, the traditional parallel-style approach of combining multiple single-task models tends to waste resources when dealing with similar tasks. Therefore, we propose a novel tree-style multi-task modeling approach for multi-task learning, which rooted at a shared backbone, more dedicated separate module branches are appended as the model pipeline goes deeper. Following the tree-style approach, we propose a multi-task learning model for simultaneously performing driver fatigue detection and face recognition for identifying a driver. This model shares a common feature extraction backbone module, with further separated feature extraction and classification module branches. The dedicated branches exploit and combine spatial and channel attention mechanisms to generate space-channel fused-attention enhanced features, leading to improved detection performance. As only single-task datasets are available, we introduce techniques including alternating updation and gradient accumulation for training our multi-task model using only the single-task datasets. The effectiveness of our tree-style multi-task learning model is verified through extensive validations.
在驾驶场景中,汽车主动安全系统 increasingly 开始采用深度学习技术。这些系统通常需要同时处理多个任务,例如检测疲劳驾驶和识别驾驶员身份。然而,传统的并行式方法在处理类似任务时会浪费资源。因此,我们提出了一个新树形多任务建模方法,该方法基于共享骨架,随着模型管道加深,有更多的专用分支附加到模型管道中。遵循树形方法,我们提出了一个同时检测驾驶员疲劳和识别驾驶员身份的多任务学习模型。该模型共享了一个共同的特征提取骨架模块,并进一步附加了分离的特征提取和分类模块分支。专用的分支利用和结合空间和通道关注机制产生空间通道融合注意力增强特征,导致检测性能 improved。由于只有单任务数据集可用,我们介绍了交替更新和梯度累积等技术,以便仅使用单任务数据集训练我们的多任务模型。通过广泛的验证,我们验证了我们的树形多任务学习模型的有效性。
https://arxiv.org/abs/2405.07845
Replica exchange stochastic gradient Langevin dynamics (reSGLD) is an effective sampler for non-convex learning in large-scale datasets. However, the simulation may encounter stagnation issues when the high-temperature chain delves too deeply into the distribution tails. To tackle this issue, we propose reflected reSGLD (r2SGLD): an algorithm tailored for constrained non-convex exploration by utilizing reflection steps within a bounded domain. Theoretically, we observe that reducing the diameter of the domain enhances mixing rates, exhibiting a \emph{quadratic} behavior. Empirically, we test its performance through extensive experiments, including identifying dynamical systems with physical constraints, simulations of constrained multi-modal distributions, and image classification tasks. The theoretical and empirical findings highlight the crucial role of constrained exploration in improving the simulation efficiency.
复制品交换随机梯度Langevin动力学(reSGLD)在大型数据集中的非凸学习是一种有效的采样方法。然而,当高维链深入分布的尾部时,模拟可能会遇到停滞问题。为解决这个问题,我们提出了反射式reSGLD(r2SGLD)算法:通过在有界域内利用反射步来拟合约束的非凸探索。从理论上看,我们观察到减小域的直径会增强混合率,并表现出一个二次行为。在实证研究中,我们通过广泛的实验,包括具有物理约束的动态系统、约束多模态分布的模拟和图像分类任务,测试了其性能。理论和实证研究结果强调了在提高仿真效率中确保约束探索的关键作用。
https://arxiv.org/abs/2405.07839
Detecting an ingestion environment is an important aspect of monitoring dietary intake. It provides insightful information for dietary assessment. However, it is a challenging problem where human-based reviewing can be tedious, and algorithm-based review suffers from data imbalance and perceptual aliasing problems. To address these issues, we propose a neural network-based method with a two-stage training framework that tactfully combines fine-tuning and transfer learning techniques. Our method is evaluated on a newly collected dataset called ``UA Free Living Study", which uses an egocentric wearable camera, AIM-2 sensor, to simulate food consumption in free-living conditions. The proposed training framework is applied to common neural network backbones, combined with approaches in the general imbalanced classification field. Experimental results on the collected dataset show that our proposed method for automatic ingestion environment recognition successfully addresses the challenging data imbalance problem in the dataset and achieves a promising overall classification accuracy of 96.63%.
检测摄入环境是监控饮食摄入的重要方面。它为饮食评估提供了有洞察力的信息。然而,它是一个具有挑战性的问题,基于人类审查的方法可以让人感到乏味,而基于算法审查的方法则受到数据不平衡和感知混淆问题的困扰。为了解决这些问题,我们提出了一个基于神经网络的方法,具有两个训练框架,巧妙地将微调和支持学习技术相结合。我们对该方法在一个名为“UA自由生活研究”的新数据集上进行了评估,该数据集使用一个以自我为中心的智能穿戴相机、AIM-2传感器等设备,在自由生活条件下模拟食物摄入。所提出的训练框架应用于常见的神经网络骨干,结合了通用不平衡分类领域的方法。对收集到的数据集的实验结果表明,我们提出的自动摄入环境识别方法成功地解决了数据不平衡问题,并实现了96.63%的准确分类准确率,这是一个有前景的结果。
https://arxiv.org/abs/2405.07827
Over the last few years, 360$\degree$ video traffic on the network has grown significantly. A key challenge of 360$\degree$ video playback is ensuring a high quality of experience (QoE) with limited network bandwidth. Currently, most studies focus on tile-based adaptive bitrate (ABR) streaming based on single viewport prediction to reduce bandwidth consumption. However, the performance of models for single-viewpoint prediction is severely limited by the inherent uncertainty in head movement, which can not cope with the sudden movement of users very well. This paper first presents a multimodal spatial-temporal attention transformer to generate multiple viewpoint trajectories with their probabilities given a historical trajectory. The proposed method models viewpoint prediction as a classification problem and uses attention mechanisms to capture the spatial and temporal characteristics of input video frames and viewpoint trajectories for multi-viewpoint prediction. After that, a multi-agent deep reinforcement learning (MADRL)-based ABR algorithm utilizing multi-viewpoint prediction for 360$\degree$ video streaming is proposed for maximizing different QoE objectives under various network conditions. We formulate the ABR problem as a decentralized partially observable Markov decision process (Dec-POMDP) problem and present a MAPPO algorithm based on centralized training and decentralized execution (CTDE) framework to solve the problem. The experimental results show that our proposed method improves the defined QoE metric by up to 85.5\% compared to existing ABR methods.
在过去的几年里,网络上的360度视频流量大幅增长。360度视频播放的一个关键挑战是确保在有限网络带宽下提供高质量(QoE)体验,尤其是在用户运动突然的情况下。目前,大多数研究都集中在基于单视图预测的块式自适应比特率(ABR)流媒体上,以降低带宽消耗。然而,单视图预测模型的性能受到头部运动固有不确定性的严重限制,无法很好地应对用户的突然运动。本文首先提出了一种多模态空间-时间注意力Transformer,用于根据历史轨迹生成多个视角轨迹的概率。所提出的方法将视角预测视为分类问题,并使用注意机制来捕捉输入视频帧和多视角预测视角轨迹的空间和时间特征。然后,我们提出了一种基于多视角预测的ABR算法,用于在各种网络条件下最大化不同的QoE目标。我们将ABR问题形式化为分布式部分观察的马尔可夫决策过程(Dec-POMDP)问题,并基于集中训练和分布式执行(CTDE)框架提出了一种MAPPO算法来解决该问题。实验结果表明,与现有ABR方法相比,我们所提出的方法提高了定义的QoE指标高达85.5%。
https://arxiv.org/abs/2405.07759
The novel coronavirus (COVID-19), a highly infectious respiratory disease caused by the SARS-CoV-2 has emerged as an unprecedented healthcare crisis. The pandemic had a devastating impact on the health, well-being, and economy of the global population. Early screening and diagnosis of symptomatic patients plays crucial role in isolation of patient to help stop community transmission as well as providing early treatment helping in reducing the mortality rate. Although, the RT-PCR test is the gold standard for COVID-19 testing, it is a manual, laborious, time consuming, uncomfortable, and invasive process. Due to its accessibility, availability, lower-cost, ease of sanitisation, and portable setup, chest X-Ray imaging can serve as an effective screening and diagnostic tool. In this study, we first highlight limitations of existing datasets and studies in terms of data quality, data imbalance, and evaluation strategy. Second, we curated a large-scale COVID-19 chest X-ray dataset from many publicly available COVID-19 imaging databases and proposed a pre-processing pipeline to improve quality of the dataset. We proposed CoVScreen, an CNN architecture to train and test the curated dataset. The experimental results applying different classification scenarios on the curated dataset in terms of various evaluation metrics demonstrate the effectiveness of proposed methodology in the screening of COVID-19 infection.
新型冠状病毒(COVID-19),由SARS-CoV-2引起的高度传染性呼吸疾病,已成为前所未有的卫生危机。大流行对全球人口的健康、福祉和经济都造成了毁灭性影响。在症状性患者早期筛查和诊断在隔离患者以帮助阻止社区传播以及提供早期治疗以降低死亡率方面起着关键作用。尽管COVID-19测试的RT-PCR测试是金标准,但它是一个手动、费力、耗时、不舒适和侵入性的过程。由于其可获取性、可用性、较低成本、易消毒和便携式设置,胸部X光影像可以成为有效的筛查和诊断工具。在本研究中,我们首先强调了现有数据和研究的局限性,即数据质量、数据不平衡和评估策略。然后,我们从多个公开可用的COVID-19成像数据库中收集了大规模COVID-19胸X光数据,并提出了一种预处理方案以提高数据质量。我们提出了COVScreen,一种CNN架构,用于训练和测试所选数据的预处理后的数据。通过对所选数据集的不同分类情景在各种评估指标上的实验结果,表明所提出的方法在筛查COVID-19感染方面非常有效。
https://arxiv.org/abs/2405.07674
Many industry verticals are confronted with small-sized tabular data. In this low-data regime, it is currently unclear whether the best performance can be expected from simple baselines, or more complex machine learning approaches that leverage meta-learning and ensembling. On 44 tabular classification datasets with sample sizes $\leq$ 500, we find that L2-regularized logistic regression performs similar to state-of-the-art automated machine learning (AutoML) frameworks (AutoPrognosis, AutoGluon) and off-the-shelf deep neural networks (TabPFN, HyperFast) on the majority of the benchmark datasets. We therefore recommend to consider logistic regression as the first choice for data-scarce applications with tabular data and provide practitioners with best practices for further method selection.
许多行业细分领域都面临着小规模的表格数据。在数据量较低的情况下,目前尚不清楚是否可以从简单的基线模型中期待最佳性能,或者是否可以从元学习和集成方法中更复杂的机器学习方法中期待最佳性能。在44个表格分类数据集(样本量 $\leq$ 500)中,我们发现L2正则化逻辑回归在大多数基准数据集上的表现与最先进的自动化机器学习(AutoML)框架(AutoPrognosis,AutoGluon)和离线的深度神经网络(TabPFN,HyperFast)类似。因此,我们建议在数据有限的应用程序中,将逻辑回归作为首选方法,并为实践者提供进一步方法选择的最佳实践。
https://arxiv.org/abs/2405.07662
For language model classification, would you prefer having only one workable class or having every class working? The latter makes more practical uses. Especially for large language models (LLMs), the fact that they achieve a fair overall accuracy by in-context learning (ICL) obscures a large difference in individual class accuracies. In this work, we uncover and tackle language models' imbalance in per-class prediction accuracy by reconceptualizing it as the Contextual Oddity Bias (COBias), and we are the first to engage nonlinear integer programming (NIP) to debias it. Briefly, COBias refers to the difference in accuracy by a class A compared to its ''odd'' class, which holds the majority wrong predictions of class A. With the COBias metric, we reveal that LLMs of varied scales and families exhibit large per-class accuracy differences. Then we propose Debiasing as Nonlinear Integer Programming (DNIP) to correct ICL per-class probabilities for lower bias and higher overall accuracy. Our optimization objective is directly based on the evaluation scores by COBias and accuracy metrics, solved by simulated annealing. Evaluations on three LLMs across seven NLP classification tasks show that DNIP simultaneously achieves significant COBias reduction ($-27\%$) and accuracy improvement ($+12\%$) over the conventional ICL approach, suggesting that modeling pairwise class accuracy differences is a direction in pushing forward more accurate, more reliable LLM predictions.
在语言模型分类中,您更喜欢只有一个可工作的类别,还是每个类别都有可工作的类别?后者更具有实际应用。尤其是对于大型语言模型(LLMs),它们通过在上下文学习中实现公平的整体准确度来达到公平的整体准确度,这使得个体类别的准确性之间存在较大差异。在这项工作中,我们通过重新定义上下文归一化偏差(COBias)来揭示并解决语言模型在每类预测准确度方面的不平衡,并成为第一个将非线性整数规划(NIP)应用于其上的团队。简而言之,COBias是指类A相对于其“奇怪”类的准确度差异。通过COBias度量,我们揭示了具有不同规模和家族的语言模型表现出大的每类准确性差异。然后我们提出了非线性整数规划(DNIP)来纠正类A的较低偏差和较高总体准确度的每类概率。我们的优化目标基于COBias和准确度指标的评估分数,通过模拟退火求解。在七种自然语言处理分类任务上评估三种LLM的表现,结果显示DNIP同时实现了显著的COBias降低(-27%)和准确度提高(+12%),超过了传统ICL方法。这表明建模类别的 pairwise 准确度差异是在推动更准确、更可靠的LLM预测方面的一种方向。
https://arxiv.org/abs/2405.07623
Recent advances in Tiny Machine Learning (TinyML) empower low-footprint embedded devices for real-time on-device Machine Learning. While many acknowledge the potential benefits of TinyML, its practical implementation presents unique challenges. This study aims to bridge the gap between prototyping single TinyML models and developing reliable TinyML systems in production: (1) Embedded devices operate in dynamically changing conditions. Existing TinyML solutions primarily focus on inference, with models trained offline on powerful machines and deployed as static objects. However, static models may underperform in the real world due to evolving input data distributions. We propose online learning to enable training on constrained devices, adapting local models towards the latest field conditions. (2) Nevertheless, current on-device learning methods struggle with heterogeneous deployment conditions and the scarcity of labeled data when applied across numerous devices. We introduce federated meta-learning incorporating online learning to enhance model generalization, facilitating rapid learning. This approach ensures optimal performance among distributed devices by knowledge sharing. (3) Moreover, TinyML's pivotal advantage is widespread adoption. Embedded devices and TinyML models prioritize extreme efficiency, leading to diverse characteristics ranging from memory and sensors to model architectures. Given their diversity and non-standardized representations, managing these resources becomes challenging as TinyML systems scale up. We present semantic management for the joint management of models and devices at scale. We demonstrate our methods through a basic regression example and then assess them in three real-world TinyML applications: handwritten character image classification, keyword audio classification, and smart building presence detection, confirming our approaches' effectiveness.
近年来,Tiny Machine Learning (TinyML) 的发展使嵌入式设备能够实现实时本地机器学习。虽然许多人承认 TinyML 的潜在好处,但它的实际实现带来了独特的挑战。本研究旨在弥补在原型设计单个 TinyML 模型和开发可靠的 TinyML 系统之间的差距:(1)嵌入式设备运行在不断变化的条件下。现有的 TinyML 解决方案主要关注推理,模型在强大的机器上进行离线训练,然后部署为静态对象。然而,静态模型可能在真实世界里由于不断变化的输入数据分布而表现不佳。我们提出了在线学习来实现在约束设备上的训练,将局部模型适应最新的领域条件。(2)然而,当前的 on-device 学习方法在处理多样部署条件和高标数据稀缺的情况下遇到了困难。我们引入了联合式元学习,包括在线学习,以增强模型的泛化能力,促进快速学习。这种方法确保了分布式设备之间的最优性能,通过知识共享实现。(3)此外,TinyML 的关键优势是广泛应用。嵌入式设备和 TinyML 模型优先考虑极端效率,导致具有从内存和传感器到模型架构的多样特征。随着 TinyML 系统规模的增长,管理这些资源变得具有挑战性。我们提出了针对大規模模型和设备管理的语义管理。我们通过基本的回归示例展示了我们的方法,然后对三个真实世界 TinyML 应用进行了评估:手写字符图像分类、关键词音频分类和智能建筑存在检测,证实了我们的方法的有效性。
https://arxiv.org/abs/2405.07601