Personalized outfit recommendation remains a complex challenge, demanding both fashion compatibility understanding and trend awareness. This paper presents a novel framework that harnesses the expressive power of large language models (LLMs) for this task, mitigating their "black box" and static nature through fine-tuning and direct feedback integration. We bridge the item visual-textual gap in items descriptions by employing image captioning with a Multimodal Large Language Model (MLLM). This enables the LLM to extract style and color characteristics from human-curated fashion images, forming the basis for personalized recommendations. The LLM is efficiently fine-tuned on the open-source Polyvore dataset of curated fashion images, optimizing its ability to recommend stylish outfits. A direct preference mechanism using negative examples is employed to enhance the LLM's decision-making process. This creates a self-enhancing AI feedback loop that continuously refines recommendations in line with seasonal fashion trends. Our framework is evaluated on the Polyvore dataset, demonstrating its effectiveness in two key tasks: fill-in-the-blank, and complementary item retrieval. These evaluations underline the framework's ability to generate stylish, trend-aligned outfit suggestions, continuously improving through direct feedback. The evaluation results demonstrated that our proposed framework significantly outperforms the base LLM, creating more cohesive outfits. The improved performance in these tasks underscores the proposed framework's potential to enhance the shopping experience with accurate suggestions, proving its effectiveness over the vanilla LLM based outfit generation.
个性化服装推荐仍然是一个复杂的挑战,需要同时具备时尚兼容性和趋势意识。本文提出了一种新颖的方法,利用大型语言模型(LLMs)的表达能力来解决此任务,通过微调和直接反馈整合来减轻它们的“黑盒子”和静态性质。我们通过采用多模态大型语言模型(MLLM)进行图像 captioning来填补物品描述中的视觉-文本差距。这使得LLM能够从人类策划的时尚图像中提取风格和颜色特征,形成个性化推荐的基础。LLM在经过优化的开源Polyvore数据集中进行高效微调,提高其推荐时尚衣物的能力。采用负例直接偏好机制来增强LLM的决策过程。这导致一个自增强的AI反馈循环,持续根据季节时尚趋势优化建议。在Polyvore数据集上评估我们的框架,证明了其在两个关键任务上的有效性:填空题和互补物品检索。这些评估强调了我们框架通过直接反馈持续改进时尚、与趋势保持一致的服装建议的能力。评估结果显示,与基线LLM相比,我们提出的框架显著提高了性能,创建了更凝聚力的服装组合。这些任务中 improved performance 证明了所提出的框架通过准确建议提高购物体验的重要性,证明其在基于普通LLM的服装生成方面的有效性。
https://arxiv.org/abs/2409.12150
Large Language Models' (LLM) reasoning can be improved using test-time aggregation strategies, i.e., generating multiple samples and voting among generated samples. While these improve performance, they often reach a saturation point. Refinement offers an alternative by using LLM-generated feedback to improve solution quality. However, refinement introduces 3 key challenges: (1) Excessive refinement: Uniformly refining all instances can over-correct and reduce the overall performance. (2) Inability to localize and address errors: LLMs have a limited ability to self-correct and struggle to identify and correct their own mistakes. (3) Insufficient refinement: Deciding how many iterations of refinement are needed is non-trivial, and stopping too soon could leave errors unaddressed. To tackle these issues, we propose MAgICoRe, which avoids excessive refinement by categorizing problem difficulty as easy or hard, solving easy problems with coarse-grained aggregation and hard ones with fine-grained and iterative multi-agent refinement. To improve error localization, we incorporate external step-wise reward model (RM) scores. Moreover, to ensure effective refinement, we employ a multi-agent loop with three agents: Solver, Reviewer (which generates targeted feedback based on step-wise RM scores), and the Refiner (which incorporates feedback). To ensure sufficient refinement, we re-evaluate updated solutions, iteratively initiating further rounds of refinement. We evaluate MAgICoRe on Llama-3-8B and GPT-3.5 and show its effectiveness across 5 math datasets. Even one iteration of MAgICoRe beats Self-Consistency by 3.4%, Best-of-k by 3.2%, and Self-Refine by 4.0% while using less than half the samples. Unlike iterative refinement with baselines, MAgICoRe continues to improve with more iterations. Finally, our ablations highlight the importance of MAgICoRe's RMs and multi-agent communication.
大语言模型的推理可以通过在测试时间内进行聚合策略来提高,即生成多个样本并从中投票。尽管这些方法提高了性能,但它们通常达到饱和点。通过使用LLM生成的反馈来提高解决方案的质量,提供了另一种方法。然而,这种方法引入了三个关键挑战:(1)过度细化:均匀地细化所有实例可能会过度纠正并降低整体性能。(2)无法定位和解决错误:LLM的自我纠正能力有限,很难识别和纠正自己的错误。(3)不够细化:决定需要多少轮细化是一个非 trivial 的问题, stopping too soon could leave errors unaddressed。为了解决这些问题,我们提出了MAgICoRe,它通过将问题难度分类为容易或困难来避免过度细化,用粗粒度聚合解决容易问题,用细粒度多代理器迭代解决困难问题。为了改善错误定位,我们引入了外部逐步奖励模型(RM)得分。此外,为了确保有效的迭代,我们使用了一个多代理器循环,包括求解器、评论者(根据逐步 RM 得分生成定向反馈)和优化器(包含反馈)。为了确保足够的迭代,我们重新评估了更新的解决方案,并迭代启动进一步的优化轮数。我们在Llama-3-8B和GPT-3.5上评估了MAgICoRe,并证明了其在5个数学数据集上的有效性。即使在只使用一半样本的情况下,MAgICoRe的自我一致性比基线提高了3.4%,最佳推理顺序比基线提高了3.2%,自修复提高了4.0%。与基线迭代方法不同,MAgICoRe在更多迭代后继续改进。最后,我们的实验表明,MAgICoRe的RM和多代理器通信非常重要。
https://arxiv.org/abs/2409.12147
We introduce MoRAG, a novel multi-part fusion based retrieval-augmented generation strategy for text-based human motion generation. The method enhances motion diffusion models by leveraging additional knowledge obtained through an improved motion retrieval process. By effectively prompting large language models (LLMs), we address spelling errors and rephrasing issues in motion retrieval. Our approach utilizes a multi-part retrieval strategy to improve the generalizability of motion retrieval across the language space. We create diverse samples through the spatial composition of the retrieved motions. Furthermore, by utilizing low-level, part-specific motion information, we can construct motion samples for unseen text descriptions. Our experiments demonstrate that our framework can serve as a plug-and-play module, improving the performance of motion diffusion models. Code, pretrained models and sample videos will be made available at: this https URL
我们介绍了一种名为MoRAG的多部分融合基于检索增强生成策略,用于基于文本的人体运动生成。该方法通过改进运动检索过程来获得额外的知识,从而增强运动扩散模型。通过有效地提示大型语言模型(LLMs),我们解决了运动检索中的拼写错误和重新表述问题。我们的方法利用多部分检索策略来提高运动检索在语言空间中的泛化能力。我们通过检索运动中空间组成的多样性来创建各种样本。此外,通过利用低级别的、部分特定的运动信息,我们可以为未知的文本描述构建运动样本。我们的实验证明,我们的框架可以作为一个插件式模块,提高运动扩散模型的性能。代码、预训练模型和样本视频将在此处公布:https://这个链接。
https://arxiv.org/abs/2409.12140
With the advent of the big data and large language model era, zero-shot personalized rapid customization has emerged as a significant trend. In this report, we introduce Takin AudioLLM, a series of techniques and models, mainly including Takin TTS, Takin VC, and Takin Morphing, specifically designed for audiobook production. These models are capable of zero-shot speech production, generating high-quality speech that is nearly indistinguishable from real human speech and facilitating individuals to customize the speech content according to their own needs. Specifically, we first introduce Takin TTS, a neural codec language model that builds upon an enhanced neural speech codec and a multi-task training framework, capable of generating high-fidelity natural speech in a zero-shot way. For Takin VC, we advocate an effective content and timbre joint modeling approach to improve the speaker similarity, while advocating for a conditional flow matching based decoder to further enhance its naturalness and expressiveness. Last, we propose the Takin Morphing system with highly decoupled and advanced timbre and prosody modeling approaches, which enables individuals to customize speech production with their preferred timbre and prosody in a precise and controllable manner. Extensive experiments validate the effectiveness and robustness of our Takin AudioLLM series models. For detailed demos, please refer to this https URL.
随着大数据和大型语言模型的到来,零散式个性化快速定制已成为一个重要的趋势。在这篇报告中,我们介绍了Takin AudioLLM系列技术,主要包括Takin TTS、Takin VC和Takin Morphing,特别为有声书制作而设计。这些模型具有零散拍摄 speech 生产能力,生成几乎无法分辨于真实人类 speech 的优质 speech,并帮助个人根据自身需求定制 speech 内容。具体来说,我们首先介绍Takin TTS,一种基于增强神经语音编码器和多任务训练框架的神经码解码语言模型,可以在零散拍摄中生成高保真的自然 speech。对于Takin VC,我们提倡一种有效的内容和时域联合建模方法来提高说话者相似度,同时提倡基于条件流匹配的解码器进一步增强其自然性和表现力。最后,我们提出了Takin Morphing系统,采用高度耦合和先进的时域和 prosody 建模方法,使得个人可以精确和可控地定制 speech 生产。丰富的实验验证了我们的Takin AudioLLM系列模型的有效性和稳健性。对于详细演示,请参阅此链接。
https://arxiv.org/abs/2409.12139
Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better pursue the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develop a top-2 16$\times$3.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data. Extensive evaluations across diverse tasks demonstrate the potential of GRIN to significantly enhance MoE efficacy, achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.
混合专家(MoE)模型比密集模型更有效地扩展,因为专家路由稀疏计算,仅激活一小部分专家模块。然而,稀疏计算挑战了传统的训练实践,因为离散的专家路由阻碍了标准的反向传播,从而使得基于梯度的优化成为深度学习的重要组成部分。为了更好地追求MoE的扩展能力,我们引入了GRIN(GRadient-INformed MoE训练),它包含了专家路由的稀疏梯度估计,并配置模型并行度以避免词丢弃。将GRIN应用于自回归语言建模,我们开发了一个前2 16$\times$3.8B MoE模型。我们的模型仅激活6.6B个参数,却表现出了比7B个参数的密集模型更好的性能,并达到了与在相同数据上训练的14B个参数的模型相匹配的性能。在多样任务的大规模评估中,GRIN证明了显著增强MoE效果的潜力,在MMLU上取得了79.4,在HellaSwag上取得了83.7,在HumanEval上取得了74.4,在MATH上取得了58.9的分数。
https://arxiv.org/abs/2409.12136
Temporal difference (TD) learning with linear function approximation, abbreviated as linear TD, is a classic and powerful prediction algorithm in reinforcement learning. While it is well understood that linear TD converges almost surely to a unique point, this convergence traditionally requires the assumption that the features used by the approximator are linearly independent. However, this linear independence assumption does not hold in many practical scenarios. This work is the first to establish the almost sure convergence of linear TD without requiring linearly independent features. In fact, we do not make any assumptions on the features. We prove that the approximated value function converges to a unique point and the weight iterates converge to a set. We also establish a notion of local stability of the weight iterates. Importantly, we do not need to introduce any other additional assumptions and do not need to make any modification to the linear TD algorithm. Key to our analysis is a novel characterization of bounded invariant sets of the mean ODE of linear TD.
时空差(TD)学习是一种在强化学习中的经典且强大的预测算法,可以简称为线性TD。尽管众所周知,线性TD几乎 sure 地收敛到唯一的点,但这一收敛传统要求近似器使用的特征是线性相关的。然而,在许多实际场景中,这个线性相关假设不成立。这项工作是第一项不需要线性相关特征来建立几乎 sure 收敛的线性TD。事实上,我们不需要对特征做出任何假设。我们证明近似值函数收敛到唯一的点,权重迭代收敛到一个集合。我们还建立了权重迭代的一个局部稳定性概念。重要的是,我们不需要引入任何其他附加假设,也不需要对线性TD算法进行任何修改。我们分析的关键是对线性TDmean ODE的有界不变子集的全新刻画。
https://arxiv.org/abs/2409.12135
In tackling the challenge of Multi-Document Summarization (MDS), numerous methods have been proposed, spanning both extractive and abstractive summarization techniques. However, each approach has its own limitations, making it less effective to rely solely on either one. An emerging and promising strategy involves a synergistic fusion of extractive and abstractive summarization methods. Despite the plethora of studies in this domain, research on the combined methodology remains scarce, particularly in the context of Vietnamese language processing. This paper presents a novel Vietnamese MDS framework leveraging a two-component pipeline architecture that integrates extractive and abstractive techniques. The first component employs an extractive approach to identify key sentences within each document. This is achieved by a modification of the pre-trained BERT network, which derives semantically meaningful phrase embeddings using siamese and triplet network structures. The second component utilizes the VBD-LLaMA2-7B-50b model for abstractive summarization, ultimately generating the final summary document. Our proposed framework demonstrates a positive performance, attaining ROUGE-2 scores of 39.6% on the VN-MDS dataset and outperforming the state-of-the-art baselines.
在解决多文档摘要(MDS)的挑战方面,已经提出了许多方法,包括提取式和抽象式摘要方法。然而,每种方法都有其局限性,因此仅依赖其中一种并不有效。一种新兴且具有前景的策略涉及提取式和抽象式摘要方法的协同融合。尽管该领域有很多研究,但关于综合方法的研究仍然很少,尤其是在越南语处理方面。本文介绍了一种利用两个组件的管道架构的新颖越南MDS框架,该框架集成了提取式和抽象式技术。第一个组件采用提取式方法来确定每个文档中的关键句子。这是通过修改预训练的BERT网络来实现的,该网络使用同义词和三元组网络结构生成具有语义意义的短语嵌入。第二个组件使用VBD-LLaMA2-7B-50b模型进行摘要式摘要,最终生成摘要文档。我们提出的框架表现出良好的性能,在VN-MDS数据集上的ROUGE-2得分为39.6%,并超过了最先进的基线。
https://arxiv.org/abs/2409.12134
We propose a new benchmark to measure a language model's linguistic reasoning skills without relying on pre-existing language-specific knowledge. The test covers 894 questions grouped in 160 problems across 75 (mostly) extremely low-resource languages, extracted from the International Linguistic Olympiad corpus. To attain high accuracy on this benchmark, models don't need previous knowledge of the tested language, as all the information needed to solve the linguistic puzzle is presented in the context. We find that, while all analyzed models rank below 25% accuracy, there is a significant gap between open and closed models, with the best-performing proprietary model at 24.05% and the best-performing open model at 8.84%.
我们提出了一个新的基准来衡量语言模型的语言推理能力,而不依赖于先前的语言特定知识。测试包括75种主要来自国际语言学奥林匹克竞赛语料库的低资源语言中的894个问题。要在这个基准上获得高准确度,模型不需要先前的语言知识,因为解决语言谜题所需的所有信息都在上下文中提供。我们发现,虽然所有分析模型都排名低于25%,但开箱即用的模型和封闭模型之间存在显著的差距,开箱即用的最佳模型在24.05%的准确度,而最佳开放模型在8.84%的准确度。
https://arxiv.org/abs/2409.12126
Visual search is a fundamental natural task for humans and other animals. We investigated the decision processes humans use when searching briefly presented displays having well-separated potential target-object locations. Performance was compared with the Bayesian-optimal decision process under the assumption that the information from the different potential target locations is statistically independent. Surprisingly, humans performed slightly better than optimal, despite humans' substantial loss of sensitivity in the fovea, and the implausibility of the human brain replicating the optimal computations. We show that three factors can quantitatively explain these seemingly paradoxical results. Most importantly, simple and fixed heuristic decision rules reach near optimal search performance. Secondly, foveal neglect primarily affects only the central potential target location. Finally, spatially correlated neural noise causes search performance to exceed that predicted for independent noise. These findings have far-reaching implications for understanding visual search tasks and other identification tasks in humans and other animals.
视觉搜索是人类和其他动物的一种基本自然任务。我们研究了人们在短暂呈现的显示中进行搜索时使用的决策过程,这些显示具有良好分离的潜在目标位置。性能与基于贝叶斯最优决策过程的性能进行了比较,假设来自不同潜在目标位置的信息是统计上独立的。令人惊讶的是,尽管人类在视锥方面的敏感度很大,但他们仍然比最优表现略好,并且人类大脑不可能复制最优计算结果的假设是不成立的。我们证明了三个因素可以定量解释这些看似悖论的结果。首先,简单的且固定的启发式决策规则能达到近最优的搜索性能。其次,视锥忽视主要影响中央潜在目标位置。最后,空间相关神经噪声导致搜索性能超过预测的独立噪声。这些发现对理解人类和其他动物的视觉搜索任务和其他识别任务具有深远的影响。
https://arxiv.org/abs/2409.12124
In this report, we present a series of math-specific large language models: Qwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B. The core innovation of the Qwen2.5 series lies in integrating the philosophy of self-improvement throughout the entire pipeline, from pre-training and post-training to inference: (1) During the pre-training phase, Qwen2-Math-Instruct is utilized to generate large-scale, high-quality mathematical data. (2) In the post-training phase, we develop a reward model (RM) by conducting massive sampling from Qwen2-Math-Instruct. This RM is then applied to the iterative evolution of data in supervised fine-tuning (SFT). With a stronger SFT model, it's possible to iteratively train and update the RM, which in turn guides the next round of SFT data iteration. On the final SFT model, we employ the ultimate RM for reinforcement learning, resulting in the Qwen2.5-Math-Instruct. (3) Furthermore, during the inference stage, the RM is used to guide sampling, optimizing the model's performance. Qwen2.5-Math-Instruct supports both Chinese and English, and possess advanced mathematical reasoning capabilities, including Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR). We evaluate our models on 10 mathematics datasets in both English and Chinese, such as GSM8K, MATH, GaoKao, AMC23, and AIME24, covering a range of difficulties from grade school level to math competition problems.
在这份报告中,我们介绍了一系列针对数学的大语言模型:Qwen2.5-Math和Qwen2.5-Math-Instruct-1.5B/7B/72B。Qwen2.5系列的核心创新在于在整个管道中整合自我提升的理念,从预训练到后训练,直至推理: (1)在预训练阶段,Qwen2-Math-Instruct被用于生成大规模、高质量的数学数据。 (2)在后训练阶段,我们通过从Qwen2-Math-Instruct进行大规模抽样来开发了一个奖励模型(RM)。然后将该RM应用于数据在监督微调(SFT)中的迭代进化。有了更强的SFT模型,可以逐级训练和更新RM,从而引导下一轮SFT数据迭代。在最终的后训练模型中,我们使用终极RM进行强化学习,实现了Qwen2.5-Math-Instruct。 (3)此外,在推理阶段,RM被用于指导抽样,优化模型的性能。Qwen2.5-Math-Instruct支持中文和英文,并具备高级数学推理能力,包括链式思维(CoT)和工具集成推理(TIR)。我们在英语和中文的10个数学数据集上评估我们的模型,这些数据集包括GSM8K、MATH、GaoKao、AMC23和AIME24,涵盖了从小学到数学竞赛问题的各种难度级别。
https://arxiv.org/abs/2409.12122
Recent advances in speech spoofing necessitate stronger verification mechanisms in neural speech codecs to ensure authenticity. Current methods embed numerical watermarks before compression and extract them from reconstructed speech for verification, but face limitations such as separate training processes for the watermark and codec, and insufficient cross-modal information integration, leading to reduced watermark imperceptibility, extraction accuracy, and capacity. To address these issues, we propose WMCodec, the first neural speech codec to jointly train compression-reconstruction and watermark embedding-extraction in an end-to-end manner, optimizing both imperceptibility and extractability of the watermark. Furthermore, We design an iterative Attention Imprint Unit (AIU) for deeper feature integration of watermark and speech, reducing the impact of quantization noise on the watermark. Experimental results show WMCodec outperforms AudioSeal with Encodec in most quality metrics for watermark imperceptibility and consistently exceeds both AudioSeal with Encodec and reinforced TraceableSpeech in extraction accuracy of watermark. At bandwidth of 6 kbps with a watermark capacity of 16 bps, WMCodec maintains over 99% extraction accuracy under common attacks, demonstrating strong robustness.
近年来,语音伪造技术的进步使得在神经语音编码器中需要更强的验证机制来确保真实性。目前的方法在压缩之前嵌入数字水印,然后从重构语音中提取它们进行验证,但面临诸如水印和编码器的单独训练过程以及缺乏跨模态信息整合等问题,导致水印的不感知性、提取精度和容量降低。为了应对这些问题,我们提出了WMCodec,第一个在端到端方式下共同训练压缩和编码的水印嵌入的神经语音编码器,通过优化水印的感知度和提取性来提高其性能。此外,我们设计了一个递归注意印迹单元(AIU)来融合水印和语音的特征,减少量化噪声对水印的影响。实验结果表明,WMCodec在大多数质量指标上超过了AudioSeal with Encodec,并且 consistently超过了AudioSeal with Encodec和强化可追溯语音。在带宽为6 kbps,水印容量为16 bps的情况下,WMCodec在普通攻击下保持了超过99%的提取精度,证明了其强大的鲁棒性。
https://arxiv.org/abs/2409.12121
Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modeling techniques to audio data. However, audio codecs often operate at high frame rates, resulting in slow training and inference, especially for autoregressive models. To address this challenge, we present the Low Frame-rate Speech Codec (LFSC): a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps bitrate and 21.5 frames per second. We demonstrate that our novel codec can make the inference of LLM-based text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.
大语言模型(LLMs)通过将音频转换为离散 tokens 的音频编解码器显著提高了音频处理。然而,音频编解码器通常以高帧率为操作,导致训练和推理缓慢,特别是对于自回归模型。为了解决这个问题,我们提出了低帧率语音编解码器(LFSC):一种利用有限标量量化和大语言模型中的对抗训练来获得高品质音频压缩的神经音频编解码器,具有1.89 kbps的比特速率和21.5帧每秒。我们证明了我们的新编解码器可以在不降低质量的情况下将基于LLM的文本到语音模型的推理速度提高三倍,同时提高可听度和产生与以前模型相当的质量。
https://arxiv.org/abs/2409.12117
Teams of mobile [aerial, ground, or aquatic] robots have applications in resource delivery, patrolling, information-gathering, agriculture, forest fire fighting, chemical plume source localization and mapping, and search-and-rescue. Robot teams traversing hazardous environments -- with e.g. rough terrain or seas, strong winds, or adversaries capable of attacking or capturing robots -- should plan and coordinate their trails in consideration of risks of disablement, destruction, or capture. Specifically, the robots should take the safest trails, coordinate their trails to cooperatively achieve the team-level objective with robustness to robot failures, and balance the reward from visiting locations against risks of robot losses. Herein, we consider bi-objective trail-planning for a mobile team of robots orienteering in a hazardous environment. The hazardous environment is abstracted as a directed graph whose arcs, when traversed by a robot, present known probabilities of survival. Each node of the graph offers a reward to the team if visited by a robot (which e.g. delivers a good to or images the node). We wish to search for the Pareto-optimal robot-team trail plans that maximize two [conflicting] team objectives: the expected (i) team reward and (ii) number of robots that survive the mission. A human decision-maker can then select trail plans that balance, according to their values, reward and robot survival. We implement ant colony optimization, guided by heuristics, to search for the Pareto-optimal set of robot team trail plans. As a case study, we illustrate with an information-gathering mission in an art museum.
移动机器人团队具有在资源交付、巡逻、信息收集、农业、森林防火、化学浓烟源定位和绘制以及搜救中的应用。在具有例如崎岖不平的地形、强风或能够攻击或捕捉机器人的敌人等危险环境的机器人团队中,应该规划并协调它们的路线,考虑残疾、破坏或被俘虏的风险。具体来说,机器人应选择最安全的路线,将路线协调为在机器人故障的情况下实现团队目标,并平衡访问地点的奖励与机器人损失的风险。本文我们考虑在危险环境中为移动机器人团队进行双目标规划。危险环境被抽象为一个有向图,当机器人穿过时,路径上的每个节点已知生存概率。每个节点为团队提供奖励(例如,交付货物或图像节点)。我们试图寻找具有两个(相互冲突)团队目标的Pareto最优机器人团队路线计划:预期团队奖励和预期幸存机器人数量。然后,一个人类决策者可以根据他们的价值观选择路线计划,平衡奖励和机器人生存。我们采用蚁群优化,受到启发,搜索具有最优机器人团队路线计划的Pareto最优集合。 例如,我们以在艺术博物馆进行信息收集任务为例进行说明。
https://arxiv.org/abs/2409.12114
This paper introduces the Pareto Data Framework, an approach for identifying and selecting the Minimum Viable Data (MVD) required for enabling machine learning applications on constrained platforms such as embedded systems, mobile devices, and Internet of Things (IoT) devices. We demonstrate that strategic data reduction can maintain high performance while significantly reducing bandwidth, energy, computation, and storage costs. The framework identifies Minimum Viable Data (MVD) to optimize efficiency across resource-constrained environments without sacrificing performance. It addresses common inefficient practices in an IoT application such as overprovisioning of sensors and overprecision, and oversampling of signals, proposing scalable solutions for optimal sensor selection, signal extraction and transmission, and data representation. An experimental methodology demonstrates effective acoustic data characterization after downsampling, quantization, and truncation to simulate reduced-fidelity sensors and network and storage constraints; results shows that performance can be maintained up to 95\% with sample rates reduced by 75\% and bit depths and clip length reduced by 50\% which translates into substantial cost and resource reduction. These findings have implications on the design and development of constrained systems. The paper also discusses broader implications of the framework, including the potential to democratize advanced AI technologies across IoT applications and sectors such as agriculture, transportation, and manufacturing to improve access and multiply the benefits of data-driven insights.
本文介绍了一种名为帕雷托数据框架的方法,用于确定并选择最小可行数据(MFD),以在受约束平台(如嵌入式系统、移动设备和物联网设备)上实现机器学习应用。我们证明了战略数据降维可以在保持高性能的同时显著降低带宽、能量、计算和存储成本。该框架通过优化受约束环境中的最小可行数据来保持效率,同时不牺牲性能。它解决了物联网应用中常见的低效实践,如过度配备传感器和过度精确、信号过度采样等,并提出了可扩展的传感器选择、信号提取和传输以及数据表示的最佳实践。一种实验方法证明了在降维、量化和截断后有效的高频数据特征;结果表明,通过将采样率降低75%,比特深度和截断长度降低50%,性能可以维持95%,这相当于降低数据和资源的成本。这些发现对受约束系统的设计和开发有影响。论文还讨论了框架的更广泛的影响,包括其在促进先进AI技术在物联网应用和行业(如农业、交通和制造业)中的应用和普及,以改善数据驱动洞察的访问和扩大其益处。
https://arxiv.org/abs/2409.12112
With the ever-growing complexity of models in the field of remote sensing (RS), there is an increasing demand for solutions that balance model accuracy with computational efficiency. Knowledge distillation (KD) has emerged as a powerful tool to meet this need, enabling the transfer of knowledge from large, complex models to smaller, more efficient ones without significant loss in performance. This review article provides an extensive examination of KD and its innovative applications in RS. KD, a technique developed to transfer knowledge from a complex, often cumbersome model (teacher) to a more compact and efficient model (student), has seen significant evolution and application across various domains. Initially, we introduce the fundamental concepts and historical progression of KD methods. The advantages of employing KD are highlighted, particularly in terms of model compression, enhanced computational efficiency, and improved performance, which are pivotal for practical deployments in RS scenarios. The article provides a comprehensive taxonomy of KD techniques, where each category is critically analyzed to demonstrate the breadth and depth of the alternative options, and illustrates specific case studies that showcase the practical implementation of KD methods in RS tasks, such as instance segmentation and object detection. Further, the review discusses the challenges and limitations of KD in RS, including practical constraints and prospective future directions, providing a comprehensive overview for researchers and practitioners in the field of RS. Through this organization, the paper not only elucidates the current state of research in KD but also sets the stage for future research opportunities, thereby contributing significantly to both academic research and real-world applications.
随着遥感领域(RS)模型日益复杂,人们越来越需要兼顾模型准确性与计算效率的解决方案。知识蒸馏(KD)作为一种强大的工具,满足了这个需求,使大型、复杂模型的知识可以转移到更小、更高效的模型,而不会牺牲太多性能。本文回顾了KD及其在RS领域的创新应用。 首先,我们介绍了KD方法的基本概念和历史演变。KD的优点,特别是在模型压缩、增强计算效率和提高性能方面,得到了突出强调,这些优势对RS场景的实用部署至关重要。 文章全面梳理了KD技术,对每个类别进行了深入分析,以展示其广度和深度,并举例展示了KD方法在RS任务(如实例分割和目标检测)的实际应用。 此外,本文讨论了KD在RS中的挑战和局限性,包括实际限制和未来的研究方向,为RS领域的研究人员和实践者提供了全面概述。 通过这种组织,本文不仅阐明了KD在RS领域的现状,还为未来的研究机会奠定了基础,从而对学术研究和实际应用都做出了重要贡献。
https://arxiv.org/abs/2409.12111
Endoscopic Submucosal Dissection (ESD) is a minimally invasive procedure initially designed for the treatment of early gastric cancer but is now widely used for various gastrointestinal lesions. Computer-assisted Surgery systems have played a crucial role in improving the precision and safety of ESD procedures, however, their effectiveness is limited by the accurate recognition of surgical phases. The intricate nature of ESD, with different lesion characteristics and tissue structures, presents challenges for real-time surgical phase recognition algorithms. Existing surgical phase recognition algorithms struggle to efficiently capture temporal contexts in video-based scenarios, leading to insufficient performance. To address these issues, we propose SPRMamba, a novel Mamba-based framework for ESD surgical phase recognition. SPRMamba leverages the strengths of Mamba for long-term temporal modeling while introducing the Scaled Residual TranMamba block to enhance the capture of fine-grained details, overcoming the limitations of traditional temporal models like Temporal Convolutional Networks and Transformers. Moreover, a Temporal Sample Strategy is introduced to accelerate the processing, which is essential for real-time phase recognition in clinical settings. Extensive testing on the ESD385 dataset and the cholecystectomy Cholec80 dataset demonstrates that SPRMamba surpasses existing state-of-the-art methods and exhibits greater robustness across various surgical phase recognition tasks.
内窥镜下黏膜下剥离(ESD)是一种最初设计用于治疗早期胃癌的微创手术,但现在已广泛应用于各种胃肠道病变。计算机辅助手术系统在提高ESD手术的精度和安全性方面发挥了关键作用,然而,它们的有效性受到准确识别手术阶段的限制。ESD复杂的病变特征和组织结构,使得实时手术阶段识别算法面临挑战。现有的手术阶段识别算法在基于视频的场景中,难以有效地捕捉时间语境,导致性能不足。为了应对这些问题,我们提出了SPRMamba,一种基于Mamba的新型框架用于ESD手术阶段识别。SPRMamba利用Mamba的长期时间建模优势,引入了Scaled Residual TranMamba块来增强捕捉细小细节,克服了传统时间模型的局限性,如Temporal Convolutional Networks和Transformers。此外,还引入了Temporal Sample策略来加速处理,这对在临床环境中实现实时阶段识别至关重要。对ESD385数据集和Cholec80数据集的广泛测试表明,SPRMamba超越了现有最先进的方法,并在各种手术阶段识别任务中表现出更高的鲁棒性。
https://arxiv.org/abs/2409.12108
Human values and their measurement are long-standing interdisciplinary inquiry. Recent advances in AI have sparked renewed interest in this area, with large language models (LLMs) emerging as both tools and subjects of value measurement. This work introduces Generative Psychometrics for Values (GPV), an LLM-based, data-driven value measurement paradigm, theoretically grounded in text-revealed selective perceptions. We begin by fine-tuning an LLM for accurate perception-level value measurement and verifying the capability of LLMs to parse texts into perceptions, forming the core of the GPV pipeline. Applying GPV to human-authored blogs, we demonstrate its stability, validity, and superiority over prior psychological tools. Then, extending GPV to LLM value measurement, we advance the current art with 1) a psychometric methodology that measures LLM values based on their scalable and free-form outputs, enabling context-specific measurement; 2) a comparative analysis of measurement paradigms, indicating response biases of prior methods; and 3) an attempt to bridge LLM values and their safety, revealing the predictive power of different value systems and the impacts of various values on LLM safety. Through interdisciplinary efforts, we aim to leverage AI for next-generation psychometrics and psychometrics for value-aligned AI.
人类价值观及其衡量是一个长期跨学科研究。近年来,人工智能(AI)的进步使人们对这个领域重新产生了兴趣,大型语言模型(LLMs)成为价值和衡量工具的共同对象。这项工作引入了基于LLM的生成性心理测量(GPV),这是一种基于文本显现的选择性知觉的心理测量范式,从理论上揭示了文本中揭示的筛选知觉。 我们首先通过微调LLM进行准确的感知级价值测量,并验证LLM将文本解析为感知的能力,这是GPV管道的核心。将GPV应用于人类编写的博客,我们证明了其稳定性、有效性和优越性。然后,将GPV扩展到LLM价值测量,我们在1)一种基于可扩展和自由形式输出的心理测量方法上前进,实现了特定语境下的测量;2)对测量范式的比较分析,揭示了前方法的反应偏差;3)试图桥接LLM价值观及其安全性,揭示了不同价值观对LLM安全性的影响。 通过跨学科努力,我们旨在利用AI进行下一代心理测量和价值与人工智能相辅相成的心理测量。
https://arxiv.org/abs/2409.12106
Understanding how humans process visual information is one of the crucial steps for unraveling the underlying mechanism of brain activity. Recently, this curiosity has motivated the fMRI-to-image reconstruction task; given the fMRI data from visual stimuli, it aims to reconstruct the corresponding visual stimuli. Surprisingly, leveraging powerful generative models such as the Latent Diffusion Model (LDM) has shown promising results in reconstructing complex visual stimuli such as high-resolution natural images from vision datasets. Despite the impressive structural fidelity of these reconstructions, they often lack details of small objects, ambiguous shapes, and semantic nuances. Consequently, the incorporation of additional semantic knowledge, beyond mere visuals, becomes imperative. In light of this, we exploit how modern LDMs effectively incorporate multi-modal guidance (text guidance, visual guidance, and image layout) for structurally and semantically plausible image generations. Specifically, inspired by the two-streams hypothesis suggesting that perceptual and semantic information are processed in different brain regions, our framework, Brain-Streams, maps fMRI signals from these brain regions to appropriate embeddings. That is, by extracting textual guidance from semantic information regions and visual guidance from perceptual information regions, Brain-Streams provides accurate multi-modal guidance to LDMs. We validate the reconstruction ability of Brain-Streams both quantitatively and qualitatively on a real fMRI dataset comprising natural image stimuli and fMRI data.
理解人类如何处理视觉信息是揭开大脑活动背后的关键步骤之一。最近,这种好奇心激励了fMRI到图像重构任务;它旨在根据视觉刺激重建相应的视觉刺激。令人惊讶的是,利用像Latent Diffusion Model(LDM)这样的强大生成模型在重构视觉数据集中的复杂视觉刺激取得了很好的效果。尽管这些重构具有令人印象深刻的结构保真度,但它们通常缺乏小物件的细节、模糊的形状和语义细微差别。因此,引入额外的语义知识(不仅仅是视觉信息)变得至关重要。 鉴于这一点,我们研究了现代LDMs如何有效结合多模态指导(文本指导、视觉指导和图像布局)进行结构化和语义性图像生成。具体来说,受到两个流假设的启发,即感知和语义信息在不同的脑区处理,我们的Brain-Streams框架将fMRI信号从这些脑区映射到适当的嵌入。这意味着通过从语义信息区域提取文本指导,从感知信息区域提取视觉指导,Brain-Streams为LDMs提供准确的多模态指导。 我们在包括自然图像刺激和fMRI数据的实际fMRI数据集上验证了Brain-Streams的重建能力。我们既以数量方式验证了它的重建效果,也以质量方式验证了它的重建效果。
https://arxiv.org/abs/2409.12099
Finding the perfect match between a job proposal and a set of freelancers is not an easy task to perform at scale, especially in multiple languages. In this paper, we propose a novel neural retriever architecture that tackles this problem in a multilingual setting. Our method encodes project descriptions and freelancer profiles by leveraging pre-trained multilingual language models. The latter are used as backbone for a custom transformer architecture that aims to keep the structure of the profiles and project. This model is trained with a contrastive loss on historical data. Thanks to several experiments, we show that this approach effectively captures skill matching similarity and facilitates efficient matching, outperforming traditional methods.
在本文中,我们提出了一种名为“Multilingual Neural Retriever”的新神经检索架构,旨在解决在多语言环境中找到完美匹配的工作提案和自由职业者之间的关系问题。我们的方法通过利用预训练的多语言语言模型来编码项目描述和自由职业者简历。后者用作自定义Transformer架构的骨干,旨在保留简历和项目结构的相似性。该模型通过在历史数据上应用对比损失进行训练。通过多篇实验,我们证明了这种方法有效地捕捉了技能匹配的相似性,并促进了高效的匹配,超过了传统方法。
https://arxiv.org/abs/2409.12097
Efficiently and completely capturing the three-dimensional data of an object is a fundamental problem in industrial and robotic applications. The task of next-best-view (NBV) planning is to infer the pose of the next viewpoint based on the current data, and gradually realize the complete three-dimensional reconstruction. Many existing algorithms, however, suffer a large computational burden due to the use of ray-casting. To address this, this paper proposes a projection-based NBV planning framework. It can select the next best view at an extremely fast speed while ensuring the complete scanning of the object. Specifically, this framework refits different types of voxel clusters into ellipsoids based on the voxel structure.Then, the next best view is selected from the candidate views using a projection-based viewpoint quality evaluation function in conjunction with a global partitioning strategy. This process replaces the ray-casting in voxel structures, significantly improving the computational efficiency. Comparative experiments with other algorithms in a simulation environment show that the framework proposed in this paper can achieve 10 times efficiency improvement on the basis of capturing roughly the same coverage. The real-world experimental results also prove the efficiency and feasibility of the framework.
有效地和完全捕捉物体三维数据是一个在工业和机器人应用中基本的难题。下一最好视角(NBV)规划的任务是根据当前数据推断出下一个视角,并逐渐实现完整的三维重建。然而,许多现有算法由于使用透射而产生巨大的计算负载。为了解决这个问题,本文提出了一种基于投影的NBV规划框架。它可以在极快的时间内选择下一个最佳视角,同时确保对物体进行完整的扫描。具体来说,这个框架根据体素结构将不同类型的体素聚类为椭圆。然后,使用基于投影的主观 View Quality 评估函数和全局分割策略从候选视角中选择下一个最佳视角。这个过程用透射替换了体素结构中的 ray-casting,显著提高了计算效率。在仿真环境中的其他算法比较实验表明,本文提出的框架基于捕捉大致相同的覆盖面积可以实现10倍的效率提升。真实世界的实验结果也证明了该框架的有效性和可行性。
https://arxiv.org/abs/2409.12096