Q-learning methods are widely used in robot path planning but often face challenges of inefficient search and slow convergence. We propose an Improved Q-learning (IQL) framework that enhances standard Q-learning in two significant ways. First, we introduce the Path Adaptive Collaborative Optimization (PACO) algorithm to optimize Q-table initialization, providing better initial estimates and accelerating learning. Second, we incorporate a Utility-Controlled Heuristic (UCH) mechanism with dynamically tuned parameters to optimize the reward function, enhancing the algorithm's accuracy and effectiveness in path-planning tasks. Extensive experiments in three different raster grid environments validate the superior performance of our IQL framework. The results demonstrate that our IQL algorithm outperforms existing methods, including FIQL, PP-QL-based CPP, DFQL, and QMABC algorithms, in terms of path-planning capabilities.
Q-learning方法在机器人路径规划中被广泛使用,但常常面临搜索效率低和收敛速度慢的挑战。我们提出了一种改进型Q学习(IQL)框架,在两个重要方面提升了标准Q学习的方法。首先,我们引入了路径自适应协同优化(PACO)算法来优化Q表的初始化,提供更好的初始估计值并加速学习过程。其次,我们整合了一个动态调整参数的效用控制启发式(UCH)机制以优化奖励函数,从而提高算法在路径规划任务中的准确性和有效性。 我们在三种不同的栅格环境中进行了广泛的实验,验证了我们的IQL框架的优越性能。结果表明,在路径规划能力方面,我们的IQL算法优于现有方法,包括FIQL、基于PP-QL的CPP、DFQL和QMABC算法。
https://arxiv.org/abs/2501.05411
Modern deep learning (DL) workloads increasingly use complex deep reinforcement learning (DRL) algorithms that generate training data within the learning loop. This results in programs with several nested loops and dynamic data dependencies between tensors. While DL systems with eager execution support such dynamism, they lack the optimizations and smart scheduling of graph-based execution. Graph-based execution, however, cannot express dynamic tensor shapes, instead requiring the use of multiple static subgraphs. Either execution model for DRL thus leads to redundant computation, reduced parallelism, and less efficient memory management. We describe TimeRL, a system for executing dynamic DRL programs that combines the dynamism of eager execution with the whole-program optimizations and scheduling of graph-based execution. TimeRL achieves this by introducing the declarative programming model of recurrent tensors, which allows users to define dynamic dependencies as intuitive recurrence equations. TimeRL translates recurrent tensors into a polyhedral dependence graph (PDG) with dynamic dependencies as symbolic expressions. Through simple PDG transformations, TimeRL applies whole-program optimizations, such as automatic vectorization, incrementalization, and operator fusion. The PDG also allows for the computation of an efficient program-wide execution schedule, which decides on buffer deallocations, buffer donations, and GPU/CPU memory swapping. We show that TimeRL executes current DRL algorithms up to 47$\times$ faster than existing DRL systems, while using 16$\times$ less GPU peak memory.
现代深度学习(DL)工作负载越来越多地采用复杂的深度强化学习(DRL)算法,这些算法在学习循环中生成训练数据。这导致程序具有多个嵌套循环和张量之间的动态数据依赖关系。尽管支持即时执行的DL系统可以处理这种动态性,但它们缺乏基于图执行系统的优化和智能调度功能。然而,基于图的执行模型无法表达动态张量形状,相反,它需要使用多个静态子图。因此,无论是哪种执行模式,在DRL中都会导致冗余计算、减少并行性和更不高效的内存管理。 我们介绍了TimeRL系统,该系统用于执行动态DRL程序,结合了即时执行的灵活性和基于图执行的整体程序优化及调度功能。通过引入递归张量的声明式编程模型,TimeRL允许用户以直观的方式使用递推方程定义动态依赖关系。然后将递归张量转换为多面体依赖图(PDG),其中动态依赖关系表示为符号表达式。 通过简单的PDG变换,TimeRL可以应用整体程序优化,如自动向量化、增量计算和操作融合等。此外,PDG还允许计算一个高效的全局执行计划,决定缓冲区释放、捐赠以及GPU/CPU内存交换等操作。 实验结果显示,与现有的DRL系统相比,TimeRL能将当前的DRL算法执行速度提高至47倍,并且使用的GPU峰值内存减少了16倍。
https://arxiv.org/abs/2501.05408
The co-design of neural network architectures, quantization precisions, and hardware accelerators offers a promising approach to achieving an optimal balance between performance and efficiency, particularly for model deployment on resource-constrained edge devices. In this work, we propose the JAQ Framework, which jointly optimizes the three critical dimensions. However, effectively automating the design process across the vast search space of those three dimensions poses significant challenges, especially when pursuing extremely low-bit quantization. Specifical, the primary challenges include: (1) Memory overhead in software-side: Low-precision quantization-aware training can lead to significant memory usage due to storing large intermediate features and latent weights for back-propagation, potentially causing memory exhaustion. (2) Search time-consuming in hardware-side: The discrete nature of hardware parameters and the complex interplay between compiler optimizations and individual operators make the accelerator search time-consuming. To address these issues, JAQ mitigates the memory overhead through a channel-wise sparse quantization (CSQ) scheme, selectively applying quantization to the most sensitive components of the model during optimization. Additionally, JAQ designs BatchTile, which employs a hardware generation network to encode all possible tiling modes, thereby speeding up the search for the optimal compiler mapping strategy. Extensive experiments demonstrate the effectiveness of JAQ, achieving approximately 7% higher Top-1 accuracy on ImageNet compared to previous methods and reducing the hardware search time per iteration to 0.15 seconds.
神经网络架构、量化精度和硬件加速器的协同设计为在性能与效率之间实现最佳平衡提供了一种有前景的方法,尤其是在资源受限的边缘设备上部署模型时。在这项工作中,我们提出了JAQ框架(Joint Architecture, Quantization and Accelerator Framework),它共同优化这三个关键维度。然而,在处理这三个维度的巨大搜索空间时,自动化设计过程面临着重大挑战,尤其是当追求极低比特量化时更是如此。具体来说,主要的挑战包括: 1. 软件端内存开销:低精度量化的感知训练可能会导致由于存储大量中间特征和隐式权重以进行反向传播而产生显著的记忆使用问题,这可能引起内存耗尽。 2. 硬件端搜索时间长:硬件参数的离散性质以及编译器优化与个别操作之间的复杂相互作用使得加速器的搜索过程非常耗费时间。 为了解决这些问题,JAQ通过通道稀疏量化(Channel-wise Sparse Quantization, CSQ)方案缓解了内存开销问题。这种方法在优化过程中有选择地将量化应用于模型中最敏感的部分。此外,JAQ设计了BatchTile机制,该机制利用硬件生成网络来编码所有可能的切片模式,从而加速最优编译器映射策略的搜索过程。 广泛的实验展示了JAQ的有效性,在ImageNet数据集上实现了比先前方法高约7%的Top-1准确率,并将每次迭代中硬件搜索的时间减少到了0.15秒。
https://arxiv.org/abs/2501.05339
Counterfactual estimators are critical for learning and refining policies using logged data, a process known as Off-Policy Evaluation (OPE). OPE allows researchers to assess new policies without costly experiments, speeding up the evaluation process. Online experimental methods, such as A/B tests, are effective but often slow, thus delaying the policy selection and optimization process. In this work, we explore the application of OPE methods in the context of resource allocation in dynamic auction environments. Given the competitive nature of environments where rapid decision-making is crucial for gaining a competitive edge, the ability to quickly and accurately assess algorithmic performance is essential. By utilizing counterfactual estimators as a preliminary step before conducting A/B tests, we aim to streamline the evaluation process, reduce the time and resources required for experimentation, and enhance confidence in the chosen policies. Our investigation focuses on the feasibility and effectiveness of using these estimators to predict the outcomes of potential resource allocation strategies, evaluate their performance, and facilitate more informed decision-making in policy selection. Motivated by the outcomes of our initial study, we envision an advanced analytics system designed to seamlessly and dynamically assess new resource allocation strategies and policies.
反事实估计器对于利用日志数据学习和优化策略至关重要,这一过程被称为离策评估(Off-Policy Evaluation,OPE)。OPE使研究人员能够在不进行昂贵实验的情况下评估新政策,从而加快评估进程。在线实验方法,如A/B测试,虽然有效,但通常速度较慢,这会延迟政策选择和优化的过程。在这项工作中,我们探讨了在动态拍卖环境中资源分配背景下应用OPE方法的应用。鉴于快速决策对于获得竞争优势至关重要的竞争环境的特性,在这种情况下能够迅速而准确地评估算法性能至关重要。通过利用反事实估计器作为进行A/B测试之前的初步步骤,我们的目标是简化评估流程,减少实验所需的时间和资源,并增强对所选政策的信心。我们研究的重点在于探讨使用这些估计器来预测潜在资源分配策略的可能结果、评估其表现并促进更加明智的决策制定在政策选择上的可行性与有效性。受初期研究成果的启发,我们构想了一个高级分析系统,旨在无缝且动态地评估新的资源分配策略和政策。
https://arxiv.org/abs/2501.05278
3D human pose estimation (3D HPE) has emerged as a prominent research topic, particularly in the realm of RGB-based methods. However, RGB images are susceptible to limitations such as sensitivity to lighting conditions and potential user discomfort. Consequently, multi-modal sensing, which leverages non-intrusive sensors, is gaining increasing attention. Nevertheless, multi-modal 3D HPE still faces challenges, including modality imbalance and the imperative for continual learning. In this work, we introduce a novel balanced continual multi-modal learning method for 3D HPE, which harnesses the power of RGB, LiDAR, mmWave, and WiFi. Specifically, we propose a Shapley value-based contribution algorithm to quantify the contribution of each modality and identify modality imbalance. To address this imbalance, we employ a re-learning strategy. Furthermore, recognizing that raw data is prone to noise contamination, we develop a novel denoising continual learning approach. This approach incorporates a noise identification and separation module to mitigate the adverse effects of noise and collaborates with the balanced learning strategy to enhance optimization. Additionally, an adaptive EWC mechanism is employed to alleviate catastrophic forgetting. We conduct extensive experiments on the widely-adopted multi-modal dataset, MM-Fi, which demonstrate the superiority of our approach in boosting 3D pose estimation and mitigating catastrophic forgetting in complex scenarios. We will release our codes.
三维人体姿态估计(3D HPE)已经成为基于RGB方法的重要研究课题。然而,RGB图像容易受到光照条件敏感性和潜在用户不适的限制。因此,利用非侵入式传感器进行多模态感知的方法正逐渐获得越来越多的关注。尽管如此,多模态3D HPE仍然面临着诸如模式不平衡和持续学习必要性等挑战。在本项工作中,我们提出了一种新颖的平衡连续多模态学习方法用于3D HPE,该方法利用了RGB、LiDAR、毫米波(mmWave)以及WiFi等多种传感器数据。 具体来说,我们提出了基于Shapley值的贡献算法来量化每种模式的贡献并识别模式不平衡。为解决这种不平衡问题,我们采用了重新学习策略。此外,考虑到原始数据容易受到噪声污染,我们开发了一种新颖的去噪连续学习方法。这种方法包含了一个用于减少噪声负面影响的噪声识别和分离模块,并与平衡学习策略协同工作以增强优化过程。另外,还采用了一种自适应EWC机制来缓解灾难性遗忘问题。 我们在广泛使用的多模态数据集MM-Fi上进行了广泛的实验,结果表明我们的方法在提升3D姿态估计性能以及减少复杂场景中的灾难性遗忘方面具有显著优势。我们将发布代码。
https://arxiv.org/abs/2501.05264
Reinforcement learning demonstrated immense success in modelling complex physics-driven systems, providing end-to-end trainable solutions by interacting with a simulated or real environment, maximizing a scalar reward signal. In this work, we propose, building upon previous work, a multi-agent reinforcement learning approach with assignment constraints for reconstructing particle tracks in pixelated particle detectors. Our approach optimizes collaboratively a parametrized policy, functioning as a heuristic to a multidimensional assignment problem, by jointly minimizing the total amount of particle scattering over the reconstructed tracks in a readout frame. To satisfy constraints, guaranteeing a unique assignment of particle hits, we propose a safety layer solving a linear assignment problem for every joint action. Further, to enforce cost margins, increasing the distance of the local policies predictions to the decision boundaries of the optimizer mappings, we recommend the use of an additional component in the blackbox gradient estimation, forcing the policy to solutions with lower total assignment costs. We empirically show on simulated data, generated for a particle detector developed for proton imaging, the effectiveness of our approach, compared to multiple single- and multi-agent baselines. We further demonstrate the effectiveness of constraints with cost margins for both optimization and generalization, introduced by wider regions with high reconstruction performance as well as reduced predictive instabilities. Our results form the basis for further developments in RL-based tracking, offering both enhanced performance with constrained policies and greater flexibility in optimizing tracking algorithms through the option for individual and team rewards.
强化学习在建模复杂的物理驱动系统方面展示了巨大的成功,通过与模拟或真实环境的交互提供端到端可训练的解决方案,并最大化标量奖励信号。在此工作中,我们基于先前的研究提出了一种带有分配约束的多智能体强化学习方法,用于重建像素粒子探测器中的粒子轨迹。我们的方法协同优化参数化策略,在一个读出帧中共同最小化重建轨迹中粒子散射总量,该策略作为多维分配问题的一种启发式方法。为满足保证每个粒子击中唯一性分配的约束条件,我们提出了一种安全层,用于解决每一个联合动作的线性分配问题。此外,为了强制执行成本边界,增加局部策略预测与优化器映射决策边界的距离,我们建议在黑盒梯度估计中使用一个额外组件,促使政策向具有较低总分配成本的解决方案发展。 我们在为质子成像开发的粒子探测器上生成的模拟数据集上进行了实证研究,并将我们的方法与多个单智能体和多智能体基线模型的效果进行了比较。我们进一步展示了约束条件结合成本边界的有效性能,即通过更广泛的具有高重建性能区域以及预测不稳定性的减少来实现优化和泛化的提升。 我们的研究成果为基于强化学习的跟踪技术的发展奠定了基础,提供了增强的受限策略性能,并且在通过个体与团队奖励选项进行跟踪算法优化方面拥有更大的灵活性。
https://arxiv.org/abs/2501.05113
Miniature underwater robots play a crucial role in the exploration and development of marine resources, particularly in confined spaces and high-pressure deep-sea environments. This study presents the design, optimization, and performance of a miniature robotic fish, powered by the oscillation of bio-inspired fins. These fins feature a rigid-flexible hybrid structure and use an eccentric rotating mass (ERM) vibration motor as the excitation source to generate high-frequency unidirectional oscillations that induce acoustic streaming for propulsion. The drive mechanism, powered by miniature ERM vibration motors, eliminates the need for complex mechanical drive systems, enabling complete isolation of the entire drive system from the external environment and facilitating the miniaturization of the robotic fish. A compact, untethered robotic fish, measuring 85*60*45 mm^3, is equipped with three bio-inspired fins located at the pectoral and caudal positions. Experimental results demonstrate that the robotic fish achieves a maximum forward swimming speed of 1.36 body lengths (BL) per second powered by all fins and minimum turning radius of 0.6 BL when powered by a single fin. These results underscore the significance of employing the ERM vibration motor in advancing the development of highly maneuverable, miniature untethered underwater robots for various marine exploration tasks.
微型水下机器人在探索和开发海洋资源,特别是在狭窄空间和高压深海环境中起着至关重要的作用。本研究介绍了一种由生物启发的鳍摆动驱动的小型仿生鱼的设计、优化及性能。这些鳍采用了刚柔混合结构,并使用了偏心旋转质量(ERM)振动电机作为激励源,以产生高频单向摆动,进而诱导声流推进。通过微型ERM振动电机供电的动力装置消除了复杂机械传动系统的需要,实现了整个动力系统与外部环境的完全隔离,促进了仿生鱼的小型化设计。这种紧凑、无缆绳连接的仿生鱼尺寸为85*60*45毫米^3,配备了位于胸鳍和尾鳍位置上的三个生物启发式鳍。实验结果表明,在所有鳍的动力下,该仿生鱼的最大前向游速可达每秒1.36倍体长(BL),仅靠单个鳍驱动时最小转弯半径为0.6 BL。这些结果强调了在开发高度机动、小型化且无缆绳连接的水下机器人以执行各种海洋探索任务中采用ERM振动电机的重要性。
https://arxiv.org/abs/2501.05107
The discovery of causal relationships from observed data has attracted significant interest from disciplines such as economics, social sciences, epidemiology, and biology. In practical applications, considerable knowledge of the underlying systems is often unavailable, and real data are often associated with nonlinear causal structures, which make the direct use of most conventional causality analysis methods difficult. This study proposes a novel quantum Peter-Clark (qPC) algorithm for causal discovery that does not assume any underlying model structures. Based on the independence conditional tests in a class of reproducing kernel Hilbert spaces characterized by quantum circuits, the proposed qPC algorithm can explore causal relationships from the observed data drawn from arbitrary distributions. We conducted systematic experiments on fundamental graph parts of causal structures, demonstrating that the qPC algorithm exhibits a significantly better performance, particularly with smaller sample sizes compared to its classical counterpart. Furthermore, we proposed a novel optimization approach based on Kernel Target Alignment (KTA) for determining hyperparameters of quantum kernels. This method effectively reduced the risk of false positives in causal discovery, enabling more reliable inference. Our theoretical and experimental results demonstrate that the proposed quantum algorithm can empower classical algorithms for robust and accurate inference in causal discovery, supporting them in regimes where classical algorithms typically fail. Additionally, the effectiveness of this method was validated using the Boston Housing dataset as a real-world application. These findings demonstrate the new potential of quantum circuit-based causal discovery methods in addressing practical challenges, particularly in small-sample scenarios where traditional approaches have shown limitations.
从观测数据中发现因果关系已经吸引了经济学、社会科学、流行病学和生物学等学科的广泛关注。在实际应用中,通常缺乏对底层系统的充分了解,并且真实数据常常与非线性因果结构相关联,这使得大多数传统因果分析方法难以直接使用。本研究提出了一种新颖的量子Peter-Clark (qPC) 算法用于因果发现,该算法不假设任何底层模型结构。基于一类由量子电路表征的再生核希尔伯特空间中的独立性条件测试,所提出的 qPC 算法能够从任意分布中抽取的数据探索因果关系。我们在基本图形部分的因果结构上进行了系统的实验,结果表明,与经典方法相比,qPC算法在样本量较小的情况下表现出了显著更好的性能。此外,我们还提出了一种基于核目标对齐(KTA)的新优化方法来确定量子核的超参数,这种方法有效降低了因果发现中假阳性的风险,从而实现了更为可靠的推断。我们的理论和实验结果表明,所提出的量子算法能够增强经典算法在因果发现中的稳健性和准确性,使其能够在传统算法通常失败的情况下发挥作用。此外,我们使用波士顿住房数据集作为实际应用案例验证了该方法的有效性。这些发现展示了基于量子电路的因果发现方法的新潜力,在小样本场景等实践中解决挑战,并且这种方法在传统的技术手段显示局限性的场合尤其有效。
https://arxiv.org/abs/2501.05007
In autonomous driving, traditional Computer Vision (CV) agents often struggle in unfamiliar situations due to biases in the training data. Deep Reinforcement Learning (DRL) agents address this by learning from experience and maximizing rewards, which helps them adapt to dynamic environments. However, ensuring their generalization remains challenging, especially with static training environments. Additionally, DRL models lack transparency, making it difficult to guarantee safety in all scenarios, particularly those not seen during training. To tackle these issues, we propose a method that combines DRL with Curriculum Learning for autonomous driving. Our approach uses a Proximal Policy Optimization (PPO) agent and a Variational Autoencoder (VAE) to learn safe driving in the CARLA simulator. The agent is trained using two-fold curriculum learning, progressively increasing environment difficulty and incorporating a collision penalty in the reward function to promote safety. This method improves the agent's adaptability and reliability in complex environments, and understand the nuances of balancing multiple reward components from different feedback signals in a single scalar reward function. Keywords: Computer Vision, Deep Reinforcement Learning, Variational Autoencoder, Proximal Policy Optimization, Curriculum Learning, Autonomous Driving.
在自动驾驶领域,传统的计算机视觉(CV)代理由于训练数据中的偏差,在处理未知情况时常常遇到困难。深度强化学习(DRL)代理通过从经验中学习并最大化奖励来适应动态环境,从而解决了这一问题。然而,确保其泛化能力仍是一个挑战,尤其是在静态的培训环境中。此外,DRL模型缺乏透明度,这使得在所有场景下保证安全变得困难,尤其是那些训练过程中未见过的情况。 为了解决这些问题,我们提出了一种结合深度强化学习和课程学习的方法来解决自动驾驶问题。我们的方法使用Proximal Policy Optimization(近端策略优化,PPO)代理以及Variational Autoencoder(变分自编码器,VAE),在CARLA仿真环境中学习安全驾驶。通过双重课程训练法逐步提高环境难度,并将碰撞惩罚纳入奖励函数中以促进安全性。这种方法提高了代理在复杂环境中的适应性和可靠性,同时帮助理解如何在一个标量奖励函数中平衡多个来自不同反馈信号的奖励成分。 关键词:计算机视觉、深度强化学习、变分自编码器、近端策略优化、课程学习、自动驾驶。
https://arxiv.org/abs/2501.04982
Deep learning models in medical imaging face dual challenges: domain shift, where models perform poorly when deployed in settings different from their training environment, and class imbalance, where certain disease conditions are naturally underrepresented. We present Imbalance-Aware Domain Adaptation (IADA), a novel framework that simultaneously tackles both challenges through three key components: (1) adaptive feature learning with class-specific attention mechanisms, (2) balanced domain alignment with dynamic weighting, and (3) adaptive threshold optimization. Our theoretical analysis establishes convergence guarantees and complexity bounds. Through extensive experiments on embryo development assessment across four imaging modalities, IADA demonstrates significant improvements over existing methods, achieving up to 25.19\% higher accuracy while maintaining balanced performance across classes. In challenging scenarios with low-quality imaging systems, IADA shows robust generalization with AUC improvements of up to 12.56\%. These results demonstrate IADA's potential for developing reliable and equitable medical imaging systems for diverse clinical settings. The code is made public available at \url{this https URL}
深度学习模型在医学影像中面临双重挑战:领域偏移(domain shift),即当模型部署到不同于其训练环境的场景时性能下降;以及类别不平衡(class imbalance),其中某些疾病条件自然地表现为较少出现。我们提出了“类别感知域适应”(Imbalance-Aware Domain Adaptation, IADA),这是一个新颖的框架,通过三个关键组件同时应对这两种挑战:(1) 使用类特定注意机制进行自适应特征学习;(2) 带有动态权重平衡的领域对齐;(3) 自适应阈值优化。我们的理论分析建立了收敛保证和复杂度界限。在胚胎发育评估的广泛实验中,涉及四种成像模式,IADA展示了相对于现有方法的重大改进,在保持类别间性能均衡的同时,准确性提高了多达25.19%。在使用低质量成像系统的具有挑战性的场景下,IADA表现出了稳健的一般化能力,并将AUC(曲线下面积)提高了最多12.56%。这些结果证明了IADA在为各种临床环境开发可靠且公平的医学影像系统方面的潜力。代码可在[此处](this https URL)公开获取。
https://arxiv.org/abs/2501.04958
Multimodal Large Language Models (MLLMs) have achieved impressive performance and have been put into practical use in commercial applications, but they still have potential safety mechanism vulnerabilities. Jailbreak attacks are red teaming methods that aim to bypass safety mechanisms and discover MLLMs' potential risks. Existing MLLMs' jailbreak methods often bypass the model's safety mechanism through complex optimization methods or carefully designed image and text prompts. Despite achieving some progress, they have a low attack success rate on commercial closed-source MLLMs. Unlike previous research, we empirically find that there exists a Shuffle Inconsistency between MLLMs' comprehension ability and safety ability for the shuffled harmful instruction. That is, from the perspective of comprehension ability, MLLMs can understand the shuffled harmful text-image instructions well. However, they can be easily bypassed by the shuffled harmful instructions from the perspective of safety ability, leading to harmful responses. Then we innovatively propose a text-image jailbreak attack named SI-Attack. Specifically, to fully utilize the Shuffle Inconsistency and overcome the shuffle randomness, we apply a query-based black-box optimization method to select the most harmful shuffled inputs based on the feedback of the toxic judge model. A series of experiments show that SI-Attack can improve the attack's performance on three benchmarks. In particular, SI-Attack can obviously improve the attack success rate for commercial MLLMs such as GPT-4o or Claude-3.5-Sonnet.
多模态大型语言模型(MLLMs)已经取得了令人印象深刻的性能,并在商业应用中得到了实际运用,但它们仍然存在潜在的安全机制漏洞。绕过攻击是红队使用的方法之一,旨在避开安全机制并发现MLLM的潜在风险。现有的MLLM绕过方法通常通过复杂的优化方法或精心设计的文字和图像提示来规避模型的安全机制。尽管取得了一些进展,但在针对商业闭源MLLM时,它们的成功率仍然很低。 与以往的研究不同的是,我们通过实证研究发现了MLLM在处理乱序有害指令方面存在一种理解能力与安全能力之间的不一致性,即“Shuffle Inconsistency”。具体来说,从理解能力的角度来看,MLLM能够很好地理解乱序的有害文字-图像指令。然而,在安全性角度来看,它们很容易被这些乱序的有害指令绕过,从而产生有害响应。 基于此发现,我们创新性地提出了一种名为SI-Attack的文字和图像组合的绕过攻击方法。具体来说,为了充分利用这种不一致性并克服随机性的挑战,我们采用了一种查询式的黑盒优化方法来根据毒害判断模型的反馈选择最具有危害性的乱序输入。 一系列实验表明,SI-Attack在三个基准测试上显著提高了攻击性能。特别是对于GPT-4o或Claude-3.5-Sonnet等商业MLLM而言,SI-Attack明显提升了绕过成功的几率。
https://arxiv.org/abs/2501.04931
The growth in the use of online advertising to foster brand awareness over recent years is largely attributable to the ubiquity of social media. One pivotal technology contributing to the success of online brand advertising is frequency capping, a mechanism that enables marketers to control the number of times an ad is shown to a specific user. However, the very foundation of this technology is being scrutinized as the industry gravitates towards advertising solutions that prioritize user privacy. This paper delves into the issue of reach measurement and optimization within the context of $k$-anonymity, a privacy-preserving model gaining traction across major online advertising platforms. We outline how to report reach within this new privacy landscape and demonstrate how probabilistic discounting, a probabilistic adaptation of traditional frequency capping, can be employed to optimize campaign performance. Experiments are performed to assess the trade-off between user privacy and the efficacy of online brand advertising. Notably, we discern a significant dip in performance as long as privacy is introduced, yet this comes with a limited additional cost for advertising platforms to offer their users more privacy.
近年来,由于社交媒体的普及,线上广告在提升品牌认知度方面的使用量大幅增长。其中一项对在线品牌广告成功至关重要的技术是频次上限(frequency capping),这是一种让营销人员能够控制特定用户看到某条广告次数的技术手段。然而,随着行业向优先考虑用户隐私的广告服务转变,这项技术的基础正在受到质疑。 本文深入探讨了在$k$匿名性(一种注重保护隐私的模型,在主要在线广告平台上越来越受欢迎)背景下,关于覆盖面测量与优化的问题。我们详细说明了如何在这种新的隐私环境中报告覆盖面,并展示了概率折扣法——传统频次上限的一种概率适应版本——如何被用来优化活动表现。 通过实验评估用户隐私和线上品牌广告效果之间的权衡关系,我们发现,只要引入隐私保护,性能就会显著下降,但这仅需在线广告服务提供商付出有限的额外成本来为用户提供更多的隐私保障。
https://arxiv.org/abs/2501.04882
This paper examines the specific obstacles of constructing Retrieval-Augmented Generation(RAG) systems in low-resource languages, with a focus on Persian's complicated morphology and versatile syntax. The research aims to improve retrieval and generation accuracy by introducing Persian-specific models, namely MatinaRoberta(a masked language model) and MatinaSRoberta(a fine-tuned Sentence-BERT), along with a comprehensive benchmarking framework. Three datasets-general knowledge(PQuad), scientifically specialized texts, and organizational reports, were used to assess these models after they were trained on a varied corpus of 73.11 billion Persian tokens. The methodology involved extensive pretraining, fine-tuning with tailored loss functions, and systematic evaluations using both traditional metrics and the Retrieval-Augmented Generation Assessment framework. The results show that MatinaSRoberta outperformed previous embeddings, achieving superior contextual relevance and retrieval accuracy across datasets. Temperature tweaking, chunk size modifications, and document summary indexing were explored to enhance RAG setups. Larger models like Llama-3.1 (70B) consistently demonstrated the highest generation accuracy, while smaller models faced challenges with domain-specific and formal contexts. The findings underscore the potential for developing RAG systems in Persian through customized embeddings and retrieval-generation settings and highlight the enhancement of NLP applications such as search engines and legal document analysis in low-resource languages.
这篇论文探讨了在资源匮乏的语言环境下构建检索增强生成(RAG)系统所遇到的具体障碍,特别关注波斯语复杂的形态学和多变的句法结构。研究的目标是通过引入针对波斯语设计的模型——即MatinaRoberta(一个掩码语言模型)和MatinaSRoberta(Sentence-BERT微调后的版本),以及一套全面的基准测试框架来提高检索和生成的准确性。 为了评估这些经过训练的模型,使用了三个数据集:普通知识(PQuad)、科学专业文本和组织报告。这三种类型的文本是基于多样化的语料库进行训练的,该语料库包含731.1亿个波斯语标记。研究方法包括广泛的预训练、利用定制损失函数微调以及使用传统指标和检索增强生成评估框架进行全面系统性评价。 结果显示,MatinaSRoberta在所有数据集中都超越了以往的嵌入模型,在上下文相关性和检索准确性方面表现出色。为了进一步优化RAG设置,还探索了调整温度参数、修改块大小及文档摘要索引等方法。较大的语言模型如Llama-3.1(70B)始终展示了最高的生成准确性,而较小的模型在处理特定领域和正式语境时则面临挑战。 研究结果强调了通过定制化嵌入和检索-生成设置来发展波斯语RAG系统的潜力,并突显了对资源匮乏语言中的自然语言处理应用(如搜索引擎和法律文件分析)进行改进的可能性。
https://arxiv.org/abs/2501.04858
We present Seg-TTO, a novel framework for zero-shot, open-vocabulary semantic segmentation (OVSS), designed to excel in specialized domain tasks. While current open vocabulary approaches show impressive performance on standard segmentation benchmarks under zero-shot settings, they fall short of supervised counterparts on highly domain-specific datasets. We focus on segmentation-specific test-time optimization to address this gap. Segmentation requires an understanding of multiple concepts within a single image while retaining the locality and spatial structure of representations. We propose a novel self-supervised objective adhering to these requirements and use it to align the model parameters with input images at test time. In the textual modality, we learn multiple embeddings for each category to capture diverse concepts within an image, while in the visual modality, we calculate pixel-level losses followed by embedding aggregation operations specific to preserving spatial structure. Our resulting framework termed Seg-TTO is a plug-in-play module. We integrate Seg-TTO with three state-of-the-art OVSS approaches and evaluate across 22 challenging OVSS tasks covering a range of specialized domains. Our Seg-TTO demonstrates clear performance improvements across these establishing new state-of-the-art. Code: this https URL.
我们提出了一种名为Seg-TTO的创新框架,旨在解决零样本、开放词汇语义分割(OVSS)中的专门领域任务。尽管当前的开放式词汇方法在零样本设置下的标准分割基准测试中表现出色,但它们在高度特定领域的数据集上却不如监督模型。为了解决这一差距,我们专注于在测试时进行细分特性的优化。 分割任务需要理解单个图像内的多个概念,并同时保持表示中的局部性和空间结构。为此,我们提出了一种新颖的自监督目标,该目标符合这些要求,并利用它将模型参数与输入图像对齐以用于测试时间。在文本模态中,为每个类别学习多种嵌入来捕捉图像内多样化的概念;而在视觉模态中,则计算像素级别的损失并执行特定于保持空间结构的嵌入聚合操作。 我们的框架Seg-TTO被设计成可插拔模块,能够与现有的最先进的OVSS方法无缝集成。我们将Seg-TTO整合到三种最先进的OVSS方法中,并在涵盖各种专门领域内的22个具有挑战性的任务上进行了评估。结果显示,我们提出的Seg-TTO在性能上有明显改进,确立了新的最先进水平。 代码链接:[请在此处插入实际的URL链接]
https://arxiv.org/abs/2501.04696
Interior design involves the careful selection and arrangement of objects to create an aesthetically pleasing, functional, and harmonized space that aligns with the client's design brief. This task is particularly challenging, as a successful design must not only incorporate all the necessary objects in a cohesive style, but also ensure they are arranged in a way that maximizes accessibility, while adhering to a variety of affordability and usage considerations. Data-driven solutions have been proposed, but these are typically room- or domain-specific and lack explainability in their design design considerations used in producing the final layout. In this paper, we investigate if large language models (LLMs) can be directly utilized for interior design. While we find that LLMs are not yet capable of generating complete layouts, they can be effectively leveraged in a structured manner, inspired by the workflow of interior designers. By systematically probing LLMs, we can reliably generate a list of objects along with relevant constraints that guide their placement. We translate this information into a design layout graph, which is then solved using an off-the-shelf constrained optimization setup to generate the final layouts. We benchmark our algorithm in various design configurations against existing LLM-based methods and human designs, and evaluate the results using a variety of quantitative and qualitative metrics along with user studies. In summary, we demonstrate that LLMs, when used in a structured manner, can effectively generate diverse high-quality layouts, making them a viable solution for creating large-scale virtual scenes. Project webpage at this https URL
室内设计涉及精心挑选和布置物品,以创造出美观、实用且和谐的空间,该空间需与客户的设计要求相符合。这一任务特别具有挑战性,因为成功的室内设计方案不仅要将所有必要的物体以一致的风格融入其中,还要确保它们被安排在一种最大化可达性的布局中,并同时考虑多种成本效益及使用因素的影响。虽然有人提出过基于数据的方法来解决这些问题,但这些方法通常局限于特定房间或领域,并且在其最终布局设计考量方面的解释性不足。 本文探讨了大型语言模型(LLMs)能否直接用于室内设计的问题。我们发现尽管目前的LLM尚不能生成完整的平面图,但是可以通过借鉴专业设计师的工作流程,以有组织的方式有效利用它们。通过系统地探究这些大语言模型,可以可靠地产生一系列物品及其放置的相关约束条件。然后将这些信息转化为一个设计布局图,并使用现成的受限制优化设置来解决它,从而生成最终的平面图。 我们在各种设计配置下对我们的算法进行基准测试,与现有的基于LLM的方法和人类设计师的设计进行了比较,并通过多种定量及定性指标以及用户研究评估了结果。总之,我们证明了在有组织的方式使用时,大型语言模型能够有效地产生多样化且高质量的布局方案,使其成为创建大规模虚拟场景的一种可行解决方案。 项目网页:[请访问提供的链接获取更多信息](https://this-url-is-an-example.com)
https://arxiv.org/abs/2501.04648
Unrolled networks have become prevalent in various computer vision and imaging tasks. Although they have demonstrated remarkable efficacy in solving specific computer vision and computational imaging tasks, their adaptation to other applications presents considerable challenges. This is primarily due to the multitude of design decisions that practitioners working on new applications must navigate, each potentially affecting the network's overall performance. These decisions include selecting the optimization algorithm, defining the loss function, and determining the number of convolutional layers, among others. Compounding the issue, evaluating each design choice requires time-consuming simulations to train, fine-tune the neural network, and optimize for its performance. As a result, the process of exploring multiple options and identifying the optimal configuration becomes time-consuming and computationally demanding. The main objectives of this paper are (1) to unify some ideas and methodologies used in unrolled networks to reduce the number of design choices a user has to make, and (2) to report a comprehensive ablation study to discuss the impact of each of the choices involved in designing unrolled networks and present practical recommendations based on our findings. We anticipate that this study will help scientists and engineers design unrolled networks for their applications and diagnose problems within their networks efficiently.
展开网络(Unrolled Networks)已经在各种计算机视觉和成像任务中广泛使用。尽管它们在解决特定的计算机视觉和计算成像任务方面表现出色,但将其适应其他应用却面临诸多挑战。这主要是因为从事新应用的研究人员必须做出大量设计决策,每一个决策都可能影响到网络的整体性能。这些决策包括选择优化算法、定义损失函数以及确定卷积层的数量等。更加复杂的是,评估每个设计方案需要进行耗时的模拟训练、微调神经网络并针对其性能进行优化。因此,探索多种选项以寻找最佳配置的过程变得既费时又计算成本高。 本文的主要目标是:(1)将展开网络中使用的一些理念和方法统一起来,从而减少用户必须做出的设计选择的数量;以及(2)报告一个详尽的消融研究,讨论在设计展开网络过程中每个决策的影响,并根据我们的发现提出实用建议。我们期望这项研究能够帮助科学家和工程师为他们的应用设计出适合的展开网络,并高效地诊断其网络中的问题。
https://arxiv.org/abs/2501.04608
The image compression model has long struggled with adaptability and generalization, as the decoded bitstream typically serves only human or machine needs and fails to preserve information for unseen visual tasks. Therefore, this paper innovatively introduces supervision obtained from multimodal pre-training models and incorporates adaptive multi-objective optimization tailored to support both human visual perception and machine vision simultaneously with a single bitstream, denoted as Unified and Generalized Image Coding for Machine (UG-ICM). Specifically, to get rid of the reliance between compression models with downstream task supervision, we introduce Contrastive Language-Image Pre-training (CLIP) models into the training constraint for improved generalization. Global-to-instance-wise CLIP supervision is applied to help obtain hierarchical semantics that make models more generalizable for the tasks relying on the information of different granularity. Furthermore, for supporting both human and machine visions with only a unifying bitstream, we incorporate a conditional decoding strategy that takes as conditions human or machine preferences, enabling the bitstream to be decoded into different versions for corresponding preferences. As such, our proposed UG-ICM is fully trained in a self-supervised manner, i.e., without awareness of any specific downstream models and tasks. The extensive experiments have shown that the proposed UG-ICM is capable of achieving remarkable improvements in various unseen machine analytics tasks, while simultaneously providing perceptually satisfying images.
图像压缩模型长期以来一直面临适应性和泛化的难题,因为解码后的比特流通常只服务于人类或机器的需求,而无法保留用于未见过的视觉任务的信息。因此,本文创新性地引入了来自多模态预训练模型的监督,并结合自适应多目标优化策略,旨在通过单一的比特流同时支持人类视觉感知和机器视觉,该方法被称为统一与泛化图像编码模型(Unified and Generalized Image Coding for Machine, UG-ICM)。 具体而言,为了解除压缩模型依赖下游任务监督的问题,我们将对比语言—图像预训练(CLIP)模型引入到训练约束中以提升其泛化能力。全局至实例级别的CLIP监督被应用来帮助获取层次语义信息,从而使模型更加适应于依赖不同粒度信息的任务。 此外,为了仅通过单一的统一比特流支持人类和机器视觉,我们引入了一种条件解码策略,该策略根据人类或机器的偏好作为条件进行解码,从而使得比特流可以被解析成不同的版本以满足相应的偏好。因此,我们的UG-ICM模型完全采用自监督方式进行训练,即无需了解任何特定下游模型和任务。 广泛的实验表明,提出的UG-ICM能够在各种未见过的机器分析任务中实现显著改进,同时提供感知上令人满意的图像质量。
https://arxiv.org/abs/2501.04579
Adversarial training has proven to be a highly effective method for improving the robustness of deep neural networks against adversarial attacks. Nonetheless, it has been observed to exhibit a limitation in terms of robust fairness, characterized by a significant disparity in robustness across different classes. Recent efforts to mitigate this problem have turned to class-wise reweighted methods. However, these methods suffer from a lack of rigorous theoretical analysis and are limited in their exploration of the weight space, as they mainly rely on existing heuristic algorithms or intuition to compute weights. In addition, these methods fail to guarantee the consistency of the optimization direction due to the decoupled optimization of weights and the model parameters. They potentially lead to suboptimal weight assignments and consequently, a suboptimal model. To address these problems, this paper proposes a novel min-max training framework, Class Optimal Distribution Adversarial Training (CODAT), which employs distributionally robust optimization to fully explore the class-wise weight space, thus enabling the identification of the optimal weight with theoretical guarantees. Furthermore, we derive a closed-form optimal solution to the internal maximization and then get a deterministic equivalent objective function, which provides a theoretical basis for the joint optimization of weights and model parameters. Meanwhile, we propose a fairness elasticity coefficient for the evaluation of the algorithm with regard to both robustness and robust fairness. Experimental results on various datasets show that the proposed method can effectively improve the robust fairness of the model and outperform the state-of-the-art approaches.
对抗训练已被证明是提高深度神经网络在面对对抗性攻击时鲁棒性的非常有效的方法。然而,它被观察到存在一个关于鲁棒公平性的局限性,即不同类别的鲁棒性之间存在着显著的差异。为了缓解这个问题,最近的研究转向了基于类别重新加权的方法。但是,这些方法缺乏严格的理论分析,并且在探索权重空间方面受到限制,因为它们主要依赖于现有的启发式算法或直觉来计算权重。此外,由于权重和模型参数的解耦优化,这些方法无法保证优化方向的一致性,可能导致次优的权重分配及由此产生的次优模型。 为了解决这些问题,本文提出了一种新的极小-极大训练框架——类别最优分布对抗训练(Class Optimal Distribution Adversarial Training, CODAT)。该框架采用分布鲁棒优化来全面探索基于类别的权重空间,从而能够理论上有保证地识别出最优权重。此外,我们还推导出了内部最大化的闭式解,并得到了一个确定性的等价目标函数,这为同时优化权重和模型参数提供了理论基础。 与此同时,本文提出了一种公平性弹性系数来评估算法在鲁棒性和鲁棒公平性方面的表现。实验结果表明,在多个数据集上的测试中,所提方法可以有效提高模型的鲁棒公平性,并超越了现有的最佳方法。
https://arxiv.org/abs/2501.04527
Federated learning (FL) provides a privacy-preserving solution for fine-tuning pre-trained large language models (LLMs) using distributed private datasets, enabling task-specific adaptation while preserving data privacy. However, fine-tuning the extensive parameters in LLMs is particularly challenging in resource-constrained federated scenarios due to the significant communication and computational costs. To gain a deeper understanding of how these challenges can be addressed, this article conducts a comparative analysis three advanced federated LLM (FedLLM) frameworks that integrate knowledge distillation (KD) and split learning (SL) to mitigate these issues: 1) FedLLMs, where clients upload model parameters or gradients to enable straightforward and effective fine-tuning; 2) KD-FedLLMs, which leverage KD for efficient knowledge sharing via logits; and 3) Split-FedLLMs, which split the LLMs into two parts, with one part executed on the client and the other one on the server, to balance the computational load. Each framework is evaluated based on key performance metrics, including model accuracy, communication overhead, and client-side computational load, offering insights into their effectiveness for various federated fine-tuning scenarios. Through this analysis, we identify framework-specific optimization opportunities to enhance the efficiency of FedLLMs and discuss broader research directions, highlighting open opportunities to better adapt FedLLMs for real-world applications. A use case is presented to demonstrate the performance comparison of these three frameworks under varying configurations and settings.
联邦学习(FL)为在分布式私有数据集上微调预训练的大规模语言模型(LLM)提供了一种保护隐私的解决方案,从而能够在保持数据隐私的同时实现任务特定的适应。然而,在资源受限的联邦场景中,由于通信和计算成本高昂,调整LLM中的广泛参数尤其具有挑战性。为了更深入地了解如何解决这些挑战,本文对三种先进的集成知识蒸馏(KD)和分割学习(SL)以缓解这些问题的联邦大语言模型(FedLLM)框架进行了比较分析: 1. **FedLLMs**:客户端上传模型参数或梯度,以便进行直接且有效的微调。 2. **KD-FedLLMs**:通过利用KD通过logits进行高效的知识共享来优化性能。 3. **Split-FedLLMs**:将LLM分成两部分,其中一部分在客户端执行,另一部分在服务器端执行,以平衡计算负荷。 每个框架基于关键的性能指标进行了评估,包括模型准确性、通信开销和客户端计算负载,为各种联邦微调场景的有效性提供了见解。通过这项分析,我们识别了特定于各个框架的优化机会来提高FedLLMs的效率,并讨论了更广泛的研究方向,强调了更好地将FedLLM适应实际应用中的开放机遇。本文还展示了一个用例,以在不同的配置和设置下比较这三种框架的表现。
https://arxiv.org/abs/2501.04436
While many algorithms for diversity maximization under imitation constraints are online in nature, many applications require offline algorithms without environment interactions. Tackling this problem in the offline setting, however, presents significant challenges that require non-trivial, multi-stage optimization processes with non-stationary rewards. In this work, we present a novel offline algorithm that enhances diversity using an objective based on Van der Waals (VdW) force and successor features, and eliminates the need to learn a previously used skill discriminator. Moreover, by conditioning the value function and policy on a pre-trained Functional Reward Encoding (FRE), our method allows for better handling of non-stationary rewards and provides zero-shot recall of all skills encountered during training, significantly expanding the set of skills learned in prior work. Consequently, our algorithm benefits from receiving a consistently strong diversity signal (VdW), and enjoys more stable and efficient training. We demonstrate the effectiveness of our method in generating diverse skills for two robotic tasks in simulation: locomotion of a quadruped and local navigation with obstacle traversal.
虽然许多在模仿限制下最大化多样性的算法本质上是在线的,但许多应用场景需要不与环境交互的离线算法。然而,在离线设置中解决这个问题会带来重大挑战,这需要非平凡且多阶段的优化过程来处理非稳态奖励。在这项工作中,我们提出了一种新颖的离线算法,该算法利用基于范德华力(VdW)和继承特征的目标增强多样性,并消除了学习先前使用的技能鉴别器的需求。此外,通过将价值函数和策略条件化于预训练的功能回报编码(FRE),我们的方法能够更好地处理非稳态奖励,并提供零样本回溯在训练期间遇到的所有技能,显著扩展了之前工作中学习的技能集。因此,我们的算法受益于持续强多样性信号(VdW)的优势,并享受更稳定和高效的训练过程。 我们在模拟环境中展示了该方法的有效性:用于四足机器人行走以及具有障碍物穿越的局部导航任务中生成多样化的技能。
https://arxiv.org/abs/2501.04426