Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the need to train models for summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on hallucination is not universally positive across LLMs. To conduct this analysis, we introduce Keyphrase Signal Extractor (CriSPO), a lightweight model that can be finetuned to extract salient keyphrases. By using CriSPO, we achieve consistent ROUGE improvements across datasets and open-weight and proprietary LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summarization systems.
大型语言模型(LLMs)可以通过提示技术在领域之间生成流畅的摘要。这减少了为摘要应用程序训练模型的需求。然而,制定有效的提示以指导LLMs生成具有适当细节和写作风格的摘要仍然具有挑战性。在本文中,我们探讨了从源文档中提取显眼信息来增强摘要提示。我们证明了在提示中添加关键词可以提高ROUGE F1和召回,使生成的摘要更相似于参考文献,更完整。关键词的数量可以控制精度和召回之间的权衡。此外,我们的分析发现,在LLM上引入显眼的语句级别信息要优于词或句子级别。然而,对幻觉的影响并非普遍积极。为了进行这项分析,我们引入了Keyphrase Signal Extractor(CriSPO),一种轻量级模型,可以微调以提取显眼的关键词。通过使用CriSPO,我们在数据集上实现了一致的ROUGE提高,同时在不进行LLM自定义的情况下使用开放式权重和专有LLM。我们的研究结果为利用显眼信息构建基于提示的摘要系统提供了见解。
https://arxiv.org/abs/2410.02748
Navigating complex environments requires Unmanned Aerial Vehicles (UAVs) and autonomous systems to perform trajectory tracking and obstacle avoidance in real-time. While many control strategies have effectively utilized linear approximations, addressing the non-linear dynamics of UAV, especially in obstacle-dense environments, remains a key challenge that requires further research. This paper introduces a Non-linear Model Predictive Control (NMPC) framework for the DJI Matrice 100, addressing these challenges by using a dynamic model and B-spline interpolation for smooth reference trajectories, ensuring minimal deviation while respecting safety constraints. The framework supports various trajectory types and employs a penalty-based cost function for control accuracy in tight maneuvers. The framework utilizes CasADi for efficient real-time optimization, enabling the UAV to maintain robust operation even under tight computational constraints. Simulation and real-world indoor and outdoor experiments demonstrated the NMPC ability to adapt to disturbances, resulting in smooth, collision-free navigation.
导航复杂的环境需要无人机(UAVs)和自主系统在实时进行轨迹跟踪和避障。虽然许多控制策略有效地利用了线性近似,但处理UAV的非线性动力学,特别是在密集障碍物环境中,仍然是一个关键挑战,需要进一步研究。本文介绍了一种非线性模型预测控制(NMPC)框架,用于DJI Matrice 100,通过使用动态模型和B-spline插值来提供平滑的参考轨迹,确保在遵守安全约束的情况下最小偏差。该框架支持各种轨迹类型,并采用基于惩罚的成本函数来控制精确度在紧缩操纵中。该框架利用CasADi实现高效的实时优化,使无人机在计算约束紧张的情况下仍保持稳健操作。模拟和现实世界的室内和室外实验证明,NMPC能力能够适应干扰,从而实现平滑、无碰撞的导航。
https://arxiv.org/abs/2410.02732
Large Language Models (LLMs) are pre-trained on large-scale corpora and excel in numerous general natural language processing (NLP) tasks, such as question answering (QA). Despite their advanced language capabilities, when it comes to domain-specific and knowledge-intensive tasks, LLMs suffer from hallucinations, knowledge cut-offs, and lack of knowledge attributions. Additionally, fine tuning LLMs' intrinsic knowledge to highly specific domains is an expensive and time consuming process. The retrieval-augmented generation (RAG) process has recently emerged as a method capable of optimization of LLM responses, by referencing them to a predetermined ontology. It was shown that using a Knowledge Graph (KG) ontology for RAG improves the QA accuracy, by taking into account relevant sub-graphs that preserve the information in a structured manner. In this paper, we introduce SMART-SLIC, a highly domain-specific LLM framework, that integrates RAG with KG and a vector store (VS) that store factual domain specific information. Importantly, to avoid hallucinations in the KG, we build these highly domain-specific KGs and VSs without the use of LLMs, but via NLP, data mining, and nonnegative tensor factorization with automatic model selection. Pairing our RAG with a domain-specific: (i) KG (containing structured information), and (ii) VS (containing unstructured information) enables the development of domain-specific chat-bots that attribute the source of information, mitigate hallucinations, lessen the need for fine-tuning, and excel in highly domain-specific question answering tasks. We pair SMART-SLIC with chain-of-thought prompting agents. The framework is designed to be generalizable to adapt to any specific or specialized domain. In this paper, we demonstrate the question answering capabilities of our framework on a corpus of scientific publications on malware analysis and anomaly detection.
大规模语言模型(LLMs)在大型语料库上预训练,并在许多通用自然语言处理(NLP)任务中表现出色,如问题回答(QA)。尽管它们具有高级语言能力,但在领域特定和知识密集型任务上,LLMs会受到幻觉、知识截止和知识归因不足的困扰。此外,将LLM的固有知识细分为高度特定的领域是一个耗时且昂贵的过程。最近,检索增强生成(RAG)过程作为一种优化LLM响应的方法而出现,通过将它们与预定义的语义网络参考。研究表明,使用知识图(KG)语义网络对RAG具有更好的QA准确率,通过考虑到相关的子图以保留结构化信息。在本文中,我们介绍了一个高度领域特定的LLM框架SMART-SLIC,该框架将RAG与KG和事实领域特定信息向量存储(VS)集成在一起。重要的是,为了避免知识库中的幻觉,我们通过NLP、数据挖掘和非负张量分解自动选择模型来构建这些高度领域特定的KGs和VS,而不是使用LLM。将我们的RAG与领域特定的: (i) KG(包含结构化信息)和(ii) VS(包含非结构化信息)相结合,可以开发出领域特定的聊天机器人,能够归因信息的来源、减轻幻觉、降低对细调的需求并擅长高度领域特定的问题回答任务。我们将SMART-SLIC与链式思考提示代理商相结合。该框架旨在适用于任何具体或专业领域。本文我们还展示了我们在关于恶意软件分析和检测领域的知识库上问题回答能力的实证研究。
https://arxiv.org/abs/2410.02721
We have developed a Bayesian optimization (BO) workflow that integrates intra-step noise optimization into automated experimental cycles. Traditional BO approaches in automated experiments focus on optimizing experimental trajectories but often overlook the impact of measurement noise on data quality and cost. Our proposed framework simultaneously optimizes both the target property and the associated measurement noise by introducing time as an additional input parameter, thereby balancing the signal-to-noise ratio and experimental duration. Two approaches are explored: a reward-driven noise optimization and a double-optimization acquisition function, both enhancing the efficiency of automated workflows by considering noise and cost within the optimization process. We validate our method through simulations and real-world experiments using Piezoresponse Force Microscopy (PFM), demonstrating the successful optimization of measurement duration and property exploration. Our approach offers a scalable solution for optimizing multiple variables in automated experimental workflows, improving data quality, and reducing resource expenditure in materials science and beyond.
我们已经开发了一个将贝叶斯优化(BO)工作流程与自实验周期内整步噪声优化相结合的框架。传统的BO方法在自实验中关注优化实验轨迹,但通常忽视了测量噪声对数据质量和成本的影响。我们提出的框架通过引入时间作为额外的输入参数,同时优化目标和相关测量噪声,从而平衡信号与噪声比和实验持续时间。我们探讨了两种方法:基于奖励的噪声优化和双优化 acquisition 函数,两种方法都通过考虑噪声和成本来提高自实验工作流程的效率。我们使用压电响应力显微镜(PFM)进行仿真和真实世界实验来验证我们的方法,证明了在测量持续时间和性质探索方面取得了成功。我们的方法为优化多个变量在自实验工作流程中提供了可扩展的解决方案,提高了数据质量,并降低了材料科学及其他领域的资源消耗。
https://arxiv.org/abs/2410.02717
Recent advances in robotics are pushing real-world autonomy, enabling robots to perform long-term and large-scale missions. A crucial component for successful missions is the incorporation of loop closures through place recognition, which effectively mitigates accumulated pose estimation drift. Despite computational advancements, optimizing performance for real-time deployment remains challenging, especially in resource-constrained mobile robots and multi-robot systems since, conventional keyframe sampling practices in place recognition often result in retaining redundant information or overlooking relevant data, as they rely on fixed sampling intervals or work directly in the 3D space instead of the feature space. To address these concerns, we introduce the concept of sample space in place recognition and demonstrate how different sampling techniques affect the query process and overall performance. We then present a novel keyframe sampling approach for LiDAR-based place recognition, which focuses on redundancy minimization and information preservation in the hyper-dimensional descriptor space. This approach is applicable to both learning-based and handcrafted descriptors, and through the experimental validation across multiple datasets and descriptor frameworks, we demonstrate the effectiveness of our proposed method, showing it can jointly minimize redundancy and preserve essential information in real-time. The proposed approach maintains robust performance across various datasets without requiring parameter tuning, contributing to more efficient and reliable place recognition for a wide range of robotic applications.
近年来,机器人技术的进步推动了现实世界的自主,使得机器人能够执行长期和大规模任务。成功执行任务的关键组件是引入通过空间识别进行闭环闭合,有效减轻了累积姿态估计漂移。尽管计算取得了进步,为实时部署优化性能仍然具有挑战性,尤其是在资源受限的移动机器人和多机器人系统上,因为传统的基于关键帧的采样实践通常会导致保留冗余信息或忽视相关信息,因为他们依赖于固定的采样间隔或直接在三维空间而不是特征空间工作。为了应对这些担忧,我们引入了空间识别中的样本空间概念,并展示了不同采样技术如何影响查询过程和整体性能。然后,我们提出了一个基于LiDAR的紧凑表示空间中进行闭环闭合的新关键帧采样方法,重点关注降维和信息保留。这种方法适用于基于学习和手工制作的描述符,并通过在多个数据集和描述框架上的实验验证,证明了我们所提出方法的有效性,表明它可以同时最小化冗余并保留关键信息。这种方法在各种数据集上保持稳健的性能,无需进行参数调整,为各种机器人应用提供更高效、可靠的姿态识别。
https://arxiv.org/abs/2410.02643
The widely used expected utility theory has been shown to be empirically inconsistent with human preferences in the psychology and behavioral economy literatures. Cumulative Prospect Theory (CPT) has been developed to fill in this gap and provide a better model for human-based decision-making supported by empirical evidence. It allows to express a wide range of attitudes and perceptions towards risk, gains and losses. A few years ago, CPT has been combined with Reinforcement Learning (RL) to formulate a CPT policy optimization problem where the goal of the agent is to search for a policy generating long-term returns which are aligned with their preferences. In this work, we revisit this policy optimization problem and provide new insights on optimal policies and their nature depending on the utility function under consideration. We further derive a novel policy gradient theorem for the CPT policy optimization objective generalizing the seminal corresponding result in standard RL. This result enables us to design a model-free policy gradient algorithm to solve the CPT-RL problem. We illustrate the performance of our algorithm in simple examples motivated by traffic control and electricity management applications. We also demonstrate that our policy gradient algorithm scales better to larger state spaces compared to the existing zeroth order algorithm for solving the same problem.
预期效用理论在心理学和行为经济学文献中已经被证明与人类偏好存在经验上的不一致性。为了填补这一空白并提供支持基于实证证据的人类决策模型的更好模型,累积前景理论(CPT)被开发出来。它允许表达对于风险、收益和损失的广泛态度和观念。几年前,CPT与强化学习(RL)相结合,形成了一个CPT政策优化问题,其中代理者的目标是寻找一个政策,该政策产生的长期回报与他们的偏好相一致。在这篇工作中,我们重新审视了这个政策优化问题,并提供了关于随着考虑的效用函数最优策略和其性质的新洞察。我们进一步推导出了一种新的CPT政策优化目标函数,将其扩展了标准RL中的相关结果。这一结果使我们能够设计一个模型无关的政策梯度算法来解决CPT-RL问题。我们用交通控制和电力管理等简单例子来展示我们算法的性能。我们还证明了我们的政策梯度算法在解决相同问题时的扩展性比现有零阶算法更好。
https://arxiv.org/abs/2410.02605
Recently, 3D Gaussian Splatting (3DGS) has exceled in novel view synthesis with its real-time rendering capabilities and superior quality. However, it faces challenges for high-resolution novel view synthesis (HRNVS) due to the coarse nature of primitives derived from low-resolution input views. To address this issue, we propose Super-Resolution 3DGS (SuperGS), which is an expansion of 3DGS designed with a two-stage coarse-to-fine training framework, utilizing pretrained low-resolution scene representation as an initialization for super-resolution optimization. Moreover, we introduce Multi-resolution Feature Gaussian Splatting (MFGS) to incorporates a latent feature field for flexible feature sampling and Gradient-guided Selective Splitting (GSS) for effective Gaussian upsampling. By integrating these strategies within the coarse-to-fine framework ensure both high fidelity and memory efficiency. Extensive experiments demonstrate that SuperGS surpasses state-of-the-art HRNVS methods on challenging real-world datasets using only low-resolution inputs.
近年来,3D高斯平铺(3DGS)在实时渲染能力和卓越的质量方面脱颖而出,为高分辨率 novel view synthesis(HRNVS)带来了巨大的挑战。然而,由于从低分辨率输入视图中提取的图形的粗粒度性质,3DGS 在高分辨率 novel view synthesis(HRNVS)方面也面临着挑战。为解决这一问题,我们提出了 Super-Resolution 3DGS(SuperGS),这是通过两阶段粗-到细训练框架设计的一种 3DGS 的扩展,利用预训练的低分辨率场景表示作为超分辨率优化初始化。此外,我们还引入了 Multi-resolution Feature Gaussian Splatting(MFGS),以实现灵活的特征采样,并使用 Gradient-guided Selective Splitting(GSS)进行有效的 Gaussian 上采样。通过将这些策略集成在粗-到细框架中,确保高保真度和内存效率。大量实验证明,SuperGS 通过仅使用低分辨率输入在具有挑战性的现实世界数据集上超越了最先进的 HRNVS 方法。
https://arxiv.org/abs/2410.02571
Data augmentation, a cornerstone technique in deep learning, is crucial in enhancing model performance, especially with scarce labeled data. While traditional techniques are effective, their reliance on hand-crafted methods limits their applicability across diverse data types and tasks. Although modern learnable augmentation methods offer increased adaptability, they are computationally expensive and challenging to incorporate within prevalent augmentation workflows. In this work, we present a novel, efficient method for data augmentation, effectively bridging the gap between existing augmentation strategies and emerging datasets and learning tasks. We introduce SAFLEX (Self-Adaptive Augmentation via Feature Label EXtrapolation), which learns the sample weights and soft labels of augmented samples provided by any given upstream augmentation pipeline, using a specifically designed efficient bilevel optimization algorithm. Remarkably, SAFLEX effectively reduces the noise and label errors of the upstream augmentation pipeline with a marginal computational cost. As a versatile module, SAFLEX excels across diverse datasets, including natural and medical images and tabular data, showcasing its prowess in few-shot learning and out-of-distribution generalization. SAFLEX seamlessly integrates with common augmentation strategies like RandAug, CutMix, and those from large pre-trained generative models like stable diffusion and is also compatible with frameworks such as CLIP's fine-tuning. Our findings highlight the potential to adapt existing augmentation pipelines for new data types and tasks, signaling a move towards more adaptable and resilient training frameworks.
数据增强是深度学习中的一个关键技术,尤其是在稀疏标注数据的情况下,它有助于提高模型性能。虽然传统的技术很有效,但它们依赖于手工定制的方法,因此其对不同数据类型和任务的适用性有限。尽管现代可学习增强方法提供了更大的适应性,但它们在计算上昂贵,且难以融入现有的增强工作流程。在这项工作中,我们提出了一个新颖、高效的数据增强方法,有效地将现有的增强策略与新兴数据和任务之间的差距弥合。我们引入了SAFLEX(自适应增强通过特征标签扩展),它使用一个专门设计的有效双层优化算法来学习任何给定上游增强管道的增强样本的样本权重和软标签。值得注意的是,SAFLEX有效地降低了上游增强管道的噪声和标签错误,且计算成本微不足道。作为一款多功能的模块,SAFLEX在各种数据集上表现出色,包括自然和医学图像以及表格数据,展示了其在少量样本学习和离散分布泛化方面的卓越能力。SAFLEX轻松地与常见的增强策略(如RandAug、CutMix)以及像Stable Diffusion这样的大规模预训练生成模型兼容,也兼容CLIP的微调框架。我们的研究结果强调了为新的数据类型和任务调整现有增强管道的重要性,表明朝着更加适应性和弹性的训练框架的方向发展。
https://arxiv.org/abs/2410.02512
Large Language Models (LLMs) can be \emph{misused} to spread online spam and misinformation. Content watermarking deters misuse by hiding a message in model-generated outputs, enabling their detection using a secret watermarking key. Robustness is a core security property, stating that evading detection requires (significant) degradation of the content's quality. Many LLM watermarking methods have been proposed, but robustness is tested only against \emph{non-adaptive} attackers who lack knowledge of the watermarking method and can find only suboptimal attacks. We formulate the robustness of LLM watermarking as an objective function and propose preference-based optimization to tune \emph{adaptive} attacks against the specific watermarking method. Our evaluation shows that (i) adaptive attacks substantially outperform non-adaptive baselines. (ii) Even in a non-adaptive setting, adaptive attacks optimized against a few known watermarks remain highly effective when tested against other unseen watermarks, and (iii) optimization-based attacks are practical and require less than seven GPU hours. Our findings underscore the need to test robustness against adaptive attackers.
大语言模型(LLMs)可能被用于传播网络垃圾信息和错误信息。内容水印标记防止了滥用,通过在模型生成的输出中隐藏信息,使它们能够通过秘密水印键进行检测。稳健性是核心安全属性,表明要逃避检测,需要(显著)降低内容的质量。已经提出了许多LLM水印标记方法,但只有针对非适应性攻击者进行测试,他们不知道水印方法,只能找到次优攻击。我们将LLM水印的稳健性表示为一个目标函数,并提出基于偏好的优化来调整针对特定水印方法的适应性攻击。我们的评估显示,(i)适应性攻击远优于非适应性基线。(ii)即使在非适应性设置中,针对几个已知水印的适应性攻击仍然在与其他未见水印的测试中具有高度的有效性。(iii)基于优化的攻击是实用的,并且只需要几个GPU小时。我们的发现强调了对适应性攻击者进行稳健性测试的必要性。
https://arxiv.org/abs/2410.02440
In this paper, we introduce Plug-and-Play (PnP) Flow Matching, an algorithm for solving imaging inverse problems. PnP methods leverage the strength of pre-trained denoisers, often deep neural networks, by integrating them in optimization schemes. While they achieve state-of-the-art performance on various inverse problems in imaging, PnP approaches face inherent limitations on more generative tasks like inpainting. On the other hand, generative models such as Flow Matching pushed the boundary in image sampling yet lack a clear method for efficient use in image restoration. We propose to combine the PnP framework with Flow Matching (FM) by defining a time-dependent denoiser using a pre-trained FM model. Our algorithm alternates between gradient descent steps on the data-fidelity term, reprojections onto the learned FM path, and denoising. Notably, our method is computationally efficient and memory-friendly, as it avoids backpropagation through ODEs and trace computations. We evaluate its performance on denoising, super-resolution, deblurring, and inpainting tasks, demonstrating superior results compared to existing PnP algorithms and Flow Matching based state-of-the-art methods.
在本文中,我们提出了Plug-and-Play(PnP)流匹配算法,用于解决图像反问题。PnP方法通过将预训练的去噪器集成到优化方案中,利用预训练去噪器的优势,通常使用深度神经网络。虽然它们在各种图像反问题中实现了最先进的性能,但PnP方法在更具有生成性的任务(如修复)上存在固有局限性。另一方面,像Flow Matching这样的生成模型在图像采样方面推动了边界,但是它们缺乏在图像修复中有效使用的方法。我们提出了一种将PnP框架与Flow Matching(FM)相结合的方法,通过使用预训练FM模型定义一个时间依赖的去噪器。我们的算法在数据可靠性梯度下降步骤、学习到的FM路径上的投影以及去噪三个步骤之间交替进行。值得注意的是,我们的方法在计算效率和内存友好性方面具有优势,因为它避免了通过ODE进行反向传播和迹计算。我们在去噪、超分辨率、去雾和修复任务上评估了其性能,证明了与现有PnP算法和基于Flow Matching的最佳方法相比具有卓越的结果。
https://arxiv.org/abs/2410.02423
In this work, we present BiSSL, a first-of-its-kind training framework that introduces bilevel optimization to enhance the alignment between the pretext pre-training and downstream fine-tuning stages in self-supervised learning. BiSSL formulates the pretext and downstream task objectives as the lower- and upper-level objectives in a bilevel optimization problem and serves as an intermediate training stage within the self-supervised learning pipeline. By more explicitly modeling the interdependence of these training stages, BiSSL facilitates enhanced information sharing between them, ultimately leading to a backbone parameter initialization that is better suited for the downstream task. We propose a training algorithm that alternates between optimizing the two objectives defined in BiSSL. Using a ResNet-18 backbone pre-trained with SimCLR on the STL10 dataset, we demonstrate that our proposed framework consistently achieves improved or competitive classification accuracies across various downstream image classification datasets compared to the conventional self-supervised learning pipeline. Qualitative analyses of the backbone features further suggest that BiSSL enhances the alignment of downstream features in the backbone prior to fine-tuning.
在这项工作中,我们提出了BiSSL,一种前所未有的训练框架,它引入了二阶优化来增强自监督学习中预训练和下游微调阶段的文本对齐。BiSSL将预训练和下游任务的目标形式为二阶优化问题中的下和上层次目标,并作为自监督学习流水线中的中间训练阶段。通过更明确地建模这些训练阶段之间的相互依存关系,BiSSL促进了它们之间的信息共享,最终导致下游任务的骨干参数初始化更适合下游任务。我们提出了一个交替优化两个定义在BiSSL中的目标的训练算法。使用在STL10数据集上预训练的ResNet-18骨干,我们证明了我们的框架在各种下游图像分类数据集上的分类准确率普遍优于传统的自监督学习方法。对骨干特征的定性分析进一步表明,BiSSL在微调之前增强了下游特征在骨干中的对齐。
https://arxiv.org/abs/2410.02387
Mamba, a special case of the State Space Model, is gaining popularity as an alternative to template-based deep learning approaches in medical image analysis. While transformers are powerful architectures, they have drawbacks, including quadratic computational complexity and an inability to address long-range dependencies efficiently. This limitation affects the analysis of large and complex datasets in medical imaging, where there are many spatial and temporal relationships. In contrast, Mamba offers benefits that make it well-suited for medical image analysis. It has linear time complexity, which is a significant improvement over transformers. Mamba processes longer sequences without attention mechanisms, enabling faster inference and requiring less memory. Mamba also demonstrates strong performance in merging multimodal data, improving diagnosis accuracy and patient outcomes. The organization of this paper allows readers to appreciate the capabilities of Mamba in medical imaging step by step. We begin by defining core concepts of SSMs and models, including S4, S5, and S6, followed by an exploration of Mamba architectures such as pure Mamba, U-Net variants, and hybrid models with convolutional neural networks, transformers, and Graph Neural Networks. We also cover Mamba optimizations, techniques and adaptations, scanning, datasets, applications, experimental results, and conclude with its challenges and future directions in medical imaging. This review aims to demonstrate the transformative potential of Mamba in overcoming existing barriers within medical imaging while paving the way for innovative advancements in the field. A comprehensive list of Mamba architectures applied in the medical field, reviewed in this work, is available at Github.
Mamba,一种 State Space Model 的特殊情况,正在成为医学图像分析中模板为基础的深度学习方法的替代品。尽管 Transformer 是一种强大的架构,但它们存在一些局限性,包括二次计算复杂性和无法有效地解决长距离依赖问题。这种局限性影响到医疗影像大数据的分析,其中存在许多空间和时间关系。相比之下,Mamba 提供了在医学图像分析中具有优势的益处。它具有线性时间复杂性,这是 Transformer 的重大改进。Mamba 在没有注意力机制的情况下处理较长的序列,实现更快的推理并需要更少的内存。Mamba 还展示了在合并多模态数据方面的强大性能,提高诊断准确性和患者 outcomes。本文的组织使读者能够逐步了解 Mamba 在医学影像分析中的能力。我们首先定义了 State Space Model 和模型的核心概念,包括 S4、S5 和 S6,接着探讨了 Mamba 的架构,如纯 Mamba、U-Net 变体和具有卷积神经网络、Transformer 和 Graph Neural Networks 的混合模型。我们还涵盖了 Mamba 的优化、技术和适应性,扫描、数据集、应用、实验结果,并最后结论与挑战及未来在医学影像领域的发展趋势。本综述旨在展示 Mamba 在克服现有医疗影像工作中的障碍的同时,为该领域推动创新进展奠定基础。本工作中回顾了在医学领域应用的 Mamba 架构的完整列表,可在 Github 上查看。
https://arxiv.org/abs/2410.02362
Preference Optimization (PO) techniques are currently one of the state of the art techniques for fine-tuning large language models (LLMs) on pairwise preference feedback from human annotators. However, in machine translation, this sort of feedback can be difficult to solicit. Additionally, Kreutzer et al. (2018) have shown that, for machine translation, pairwise preferences are less reliable than other forms of human feedback, such as 5-point ratings. We examine post-edits to see if they can be a source of reliable human preferences by construction. In PO, a human annotator is shown sequences $s_1$ and $s_2$ and asked for a preference judgment, %$s_1 > s_2$; while for post-editing, editors \emph{create} $s_1$ and know that it should be better than $s_2$. We attempt to use these implicit preferences for PO and show that it helps the model move towards post-edit-like hypotheses and away from machine translation-like hypotheses. Furthermore, we show that best results are obtained by pre-training the model with supervised fine-tuning (SFT) on post-edits in order to promote post-edit-like hypotheses to the top output ranks.
目前,偏好优化(PO)技术是通过对人类注释者的一对一偏好反馈来微调大型语言模型(LLMs)的一种最先进的技巧。然而,在机器翻译中,这种类型的反馈很难得到。此外,Kreutzer等人(2018)已经发现,对于机器翻译,一对一偏好与其他人类反馈形式(如5点评分)相比,可靠性较低。我们研究后编辑,看看它们是否可以通过构建来成为可靠的人类偏好的来源。 在PO中,一个人类注释者被要求对序列$s_1$和$s_2$进行偏好判断,$s_1>s_2$;而对于后编辑,编辑知道$s_1$应该比$s_2$更好。我们试图利用这些隐含偏好来进行PO,并证明这有助于模型朝后编辑式的假设方向发展,远离机器翻译式的假设方向。此外,我们还证明了通过在后编辑上进行有监督的微调(SFT)来预训练模型,可以获得最佳结果,从而将后编辑式假设推向输出排名的前沿。
https://arxiv.org/abs/2410.02320
Second-order optimization methods offer notable advantages in training deep neural networks by utilizing curvature information to achieve faster convergence. However, traditional second-order techniques are computationally prohibitive, primarily due to the large matrix inversions and high memory demands they require. While adaptive trust-region methods have been developed to mitigate these issues, their performance is often hindered by conservative estimates of key parameters, such as the Lipschitz constant of the Hessian, resulting in suboptimal outcomes. In this paper, we introduce SecondOrderAdaptiveAdam (SOAA), a novel optimization algorithm designed to overcome these limitations. SOAA approximates the Fisher information matrix using a diagonal representation, reducing computational complexity from \(O(n^{2})\) to \(O(n)\), thereby making it suitable for large-scale deep learning models, including large language models (LLMs). Additionally, the algorithm integrates an adaptive trust-region mechanism that dynamically adjusts the trust region size based on observed loss reduction, ensuring both robust convergence and computational efficiency. We empirically demonstrate that SOAA achieves faster and more stable convergence compared to first-order optimizers, such as Adam, under similar computational constraints. However, the diagonal approximation of the Fisher information matrix may be less effective in capturing higher-order interactions between gradients, suggesting potential areas for further refinement and future research.
二次优化方法通过利用凸性信息来加速训练深度神经网络,取得了显著的优势。然而,传统的二次方法在计算上过于昂贵,主要原因是它们需要大矩阵的求逆和高内存需求。虽然已经开发了一些自适应信任区域方法来减轻这些问题,但它们的性能往往受到对关键参数保守估计的限制,导致最优结果。在本文中,我们引入了SecondOrderAdaptiveAdam (SOAA),一种旨在克服这些限制的创新优化算法。SOAA通过用离散化表示法近似Fisher信息矩阵,从O(n^2)的计算复杂度降低到O(n)的复杂度,从而使其适用于大规模深度学习模型,包括大型语言模型(LLMs)。此外,算法还集成了一个自适应信任区域机制,可以根据观察到的损失减少动态调整信任区域大小,确保稳健的收敛和计算效率。我们通过实验实证证明,SOAA在计算约束相同的情况下,比 first-order 优化器(如Adam)具有更快的收敛速度和更稳定的结果。然而,Fisher信息矩阵的离散化近似可能对更高阶交互的梯度捕捉效果更差,建议在进一步精度和未来研究中进行进一步改进。
https://arxiv.org/abs/2410.02293
Accurately recommending products has long been a subject requiring in-depth research. This study proposes a multimodal paradigm for clothing recommendations. Specifically, it designs a multimodal analysis method that integrates clothing description texts and images, utilizing a pre-trained large language model to deeply explore the hidden meanings of users and products. Additionally, a variational encoder is employed to learn the relationship between user information and products to address the cold start problem in recommendation systems. This study also validates the significant performance advantages of this method over various recommendation system methods through extensive ablation experiments, providing crucial practical guidance for the comprehensive optimization of recommendation systems.
准确地推荐产品一直是需要深入研究的一个课题。本研究提出了一个多模态的服装推荐范式。具体来说,它设计了一个将服装描述文本和图像相结合的多模态分析方法,利用预训练的大语言模型来深入探索用户和产品之间的潜在含义。此外,还采用了一种变分编码器来学习用户信息和产品之间的关系,以解决推荐系统中的冷启动问题。通过进行广泛的消融实验,本研究证实了这种方法在各种推荐系统方法中的显著性能优势,为全面优化推荐系统提供了关键的实践指导。
https://arxiv.org/abs/2410.02219
Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. Although supervised pair preference models (PairPM) can express general preferences, their implementation is highly ad-hoc and cannot guarantee a consistent preference probability of compared pairs. Additionally, they impose high computational costs due to their quadratic query complexity when comparing multiple responses. In this paper, we introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback. Experimental results show that our General Preference representation model (GPM) outperforms the BT reward model on the RewardBench benchmark with a margin of up to 5.6% and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0 and MT-Bench, following the language model post-training with GPO and our general preference model, reveal substantial performance improvements with margins up to 9.3%. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at this https URL.
建模人类偏好对齐基础模型与人类价值至关重要。传统的奖励建模方法,如布拉德利-特里(BT)奖励模型,在表现力上存在不足,尤其是在解决非Transititive偏好方面。尽管监督成对偏好模型(PairPM)可以表达一般偏好,但它们的实现非常随意,并且不能保证比较对的偏好概率的一致性。此外,它们在比较多个答案时,由于其二次查询复杂度而产生高计算成本。在本文中,我们引入了偏好表示学习,一种将响应嵌入到潜在空间中,以捕捉复杂偏好结构的途径,实现线性查询复杂度的方法。此外,我们提出了基于偏好的通用偏好优化(GPO),将人类反馈为基础的强化学习扩展到人类。实验结果表明,我们在RewardBench基准上,GPM显著优于BT奖励模型,其领先优势达到5.6%,并且有效地建模了环形偏好,其中任何BT奖励模型都像随机猜测一样行为。此外,在下游任务如AlpacaEval2.0和MT-Bench上,使用GPO和我们的通用偏好模型进行语言模型后训练,评估结果表明,性能改进的幅度达到9.3%。这些发现表明,我们的方法可能有助于增强基础模型与复杂人类价值的对齐。代码可在此链接下载:https://www.aclweb.org/anthology/W21-4246
https://arxiv.org/abs/2410.02197
Multivariate Time Series (MTS) forecasting is a fundamental task with numerous real-world applications, such as transportation, climate, and epidemiology. While a myriad of powerful deep learning models have been developed for this task, few works have explored the robustness of MTS forecasting models to malicious attacks, which is crucial for their trustworthy employment in high-stake scenarios. To address this gap, we dive deep into the backdoor attacks on MTS forecasting models and propose an effective attack method named this http URL subtly injecting a few stealthy triggers into the MTS data, BackTime can alter the predictions of the forecasting model according to the attacker's intent. Specifically, BackTime first identifies vulnerable timestamps in the data for poisoning, and then adaptively synthesizes stealthy and effective triggers by solving a bi-level optimization problem with a GNN-based trigger generator. Extensive experiments across multiple datasets and state-of-the-art MTS forecasting models demonstrate the effectiveness, versatility, and stealthiness of \method{} attacks. The code is available at \url{this https URL}.
多变量时间序列(MTS)预测是一种基本任务,具有许多现实应用,如交通、气候和流行病学。虽然为这个任务已经开发了无数个强大的深度学习模型,但很少有研究探索MTS预测模型的稳健性,这对它们在高风险场景中可靠使用的至关重要。为填补这个空白,我们深入研究了针对MTS预测模型的后门攻击,并提出了一个有效攻击方法:隐式注入,通过这个URL悄悄地向MTS数据中注入了一些隐形的触发器,BackTime可以根据攻击者的意图改变预测模型的预测。具体来说,BackTime首先确定数据中可能被下毒的易受攻击的戳,然后通过基于GNN的触发器生成器解决二元优化问题,生成 stealthy(隐形的)和 effective(有效的)触发器。在多个数据集和最先进的MTS预测模型上进行广泛的实验证明了这个方法的的有效性、多样性和隐秘性。代码可以在这个URL上找到。
https://arxiv.org/abs/2410.02195
Chain-of-Thought (CoT) is an efficient prompting method that enables the reasoning ability of large language models by augmenting the query using multiple examples with multiple intermediate steps. Despite the empirical success, the theoretical understanding of how to train a Transformer to achieve the CoT ability remains less explored. This is primarily due to the technical challenges involved in analyzing the nonconvex optimization on nonlinear attention models. To the best of our knowledge, this work provides the first theoretical study of training Transformers with nonlinear attention to obtain the CoT generalization capability so that the resulting model can inference on unseen tasks when the input is augmented by examples of the new task. We first quantify the required training samples and iterations to train a Transformer model towards CoT ability. We then prove the success of its CoT generalization on unseen tasks with distribution-shifted testing data. Moreover, we theoretically characterize the conditions for an accurate reasoning output by CoT even when the provided reasoning examples contain noises and are not always accurate. In contrast, in-context learning (ICL), which can be viewed as one-step CoT without intermediate steps, may fail to provide an accurate output when CoT does. These theoretical findings are justified through experiments.
Chain-of-Thought (CoT)是一种有效的提示方法,通过在多个例子中使用多个中间步骤来增加查询,从而增强大型语言模型的推理能力。尽管在实践中取得了成功,但如何训练Transformer以实现CoT能力在理论上的理解仍然较少被探索。这主要是因为分析非凸优化在非线性注意力模型上的困难性。据我们所知,这项工作是第一个关于使用非线性注意力训练Transformer以获得CoT能力的一般化能力的理论研究。我们首先定量计算了训练Transformer模型的所需的训练样本和迭代次数,以便实现CoT能力。然后,我们通过分布平移测试数据证明了其在未见过的任务上的CoT通用性。此外,我们理论化了在CoT提供的推理例子中包含噪声且不总是准确时,获得准确推理输出的条件。相比之下,将上下文学习(ICL)视为无中间步骤的CoT的近义词可能无法提供准确的输出。这些理论发现通过实验得到了证实。
https://arxiv.org/abs/2410.02167
Open-ended short-answer questions (SAGs) have been widely recognized as a powerful tool for providing deeper insights into learners' responses in the context of learning analytics (LA). However, SAGs often present challenges in practice due to the high grading workload and concerns about inconsistent assessments. With recent advancements in natural language processing (NLP), automatic short-answer grading (ASAG) offers a promising solution to these challenges. Despite this, current ASAG algorithms are often limited in generalizability and tend to be tailored to specific questions. In this paper, we propose a unified multi-agent ASAG framework, GradeOpt, which leverages large language models (LLMs) as graders for SAGs. More importantly, GradeOpt incorporates two additional LLM-based agents - the reflector and the refiner - into the multi-agent system. This enables GradeOpt to automatically optimize the original grading guidelines by performing self-reflection on its errors. Through experiments on a challenging ASAG task, namely the grading of pedagogical content knowledge (PCK) and content knowledge (CK) questions, GradeOpt demonstrates superior performance in grading accuracy and behavior alignment with human graders compared to representative baselines. Finally, comprehensive ablation studies confirm the effectiveness of the individual components designed in GradeOpt.
开放性问题短答案问题(SAGs)已被广泛认为是学习分析(LA)环境中提供对学习者回答的深入洞察的强大工具。然而,由于高评分工作量和对不统一评估的担忧,SAGs在实践中往往存在挑战。随着自然语言处理(NLP) recent进步,自动短答案评分(ASAG)为解决这些挑战提供了一个有前景的解决方案。尽管如此,当前的ASAG算法在一般可拓展性方面往往存在局限性,并且通常针对特定问题进行定制。在本文中,我们提出了一个统一的多代理商ASAG框架GradeOpt,该框架利用大型语言模型(LLMs)作为SAG的评分者。更重要的是,GradeOpt包括两个附加的LLM-based代理——反思者和优化者——进入多代理系统。这使得GradeOpt能够通过对自己错误进行自我反思来自动优化原始评分指南。通过对具有挑战性的ASAG任务(例如教育内容知识(PCK)和内容知识(CK)问题)的实验,GradeOpt在评分准确性和行为与人类评审者一致方面表现出优越性能。最后,全面的消融研究证实了GradeOpt中设计的各个组件的有效性。
https://arxiv.org/abs/2410.02165
This paper presents an approach for navigation and control in unmapped environments under input and state constraints using a composite control barrier function (CBF). We consider the scenario where real-time perception feedback (e.g., LiDAR) is used online to construct a local CBF that models local state constraints (e.g., local safety constraints such as obstacles) in the a priori unmapped environment. The approach employs a soft-maximum function to synthesize a single time-varying CBF from the N most recently obtained local CBFs. Next, the input constraints are transformed into controller-state constraints through the use of control dynamics. Then, we use a soft-minimum function to compose the input constraints with the time-varying CBF that models the a priori unmapped environment. This composition yields a single relaxed CBF, which is used in a constrained optimization to obtain an optimal control that satisfies the state and input constraints. The approach is validated through simulations of a nonholonomic ground robot that is equipped with LiDAR and navigates an unmapped environment. The robot successfully navigates the environment while avoiding the a priori unmapped obstacles and satisfying both speed and input constraints.
本文提出了一种在受输入和状态约束的未映射环境中进行导航和控制的策略,使用复合控制障碍函数(CBF)。我们考虑使用实时感知反馈(例如激光雷达)在线构建局部CBF来建模先验未映射环境中的局部状态约束(例如,如障碍物的局部安全约束)的情况。 策略采用软最大函数合成单个时间变化的CBF。接下来,通过控制动态将输入约束转换为控制器状态约束。然后,我们使用软最小函数将输入约束与建模先验未映射环境的时刻变化的CBF组合。这个组合产生了一个单个的放松CBF,用于在约束优化中实现满足状态和输入约束的最佳控制。 该策略通过非齐次性地面机器人的仿真来验证。机器人成功地在未映射环境中避开先前的障碍物,并满足速度和输入约束。
https://arxiv.org/abs/2410.02106