We present DreamHOI, a novel method for zero-shot synthesis of human-object interactions (HOIs), enabling a 3D human model to realistically interact with any given object based on a textual description. This task is complicated by the varying categories and geometries of real-world objects and the scarcity of datasets encompassing diverse HOIs. To circumvent the need for extensive data, we leverage text-to-image diffusion models trained on billions of image-caption pairs. We optimize the articulation of a skinned human mesh using Score Distillation Sampling (SDS) gradients obtained from these models, which predict image-space edits. However, directly backpropagating image-space gradients into complex articulation parameters is ineffective due to the local nature of such gradients. To overcome this, we introduce a dual implicit-explicit representation of a skinned mesh, combining (implicit) neural radiance fields (NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization, we transition between implicit and explicit forms, grounding the NeRF generation while refining the mesh articulation. We validate our approach through extensive experiments, demonstrating its effectiveness in generating realistic HOIs.
我们提出了DreamHOI,一种用于零 shot 合成人类-物体交互(HOIs)的新方法,使得基于文本描述可以现实地与任何给定物体交互的3D人类模型。由于现实世界中物体的类别和几何形状各不相同,而且涵盖各种HOIs的数据集非常有限,因此这项任务变得非常复杂。为了绕过需要大量数据的问题,我们利用训练在数十亿图像-注释对上的文本到图像扩散模型。我们通过这些模型获得的标度区分采样(SDS)梯度来优化皮肤化的人类网格的 articulation,该预测图像空间编辑。然而,由于这些梯度的局部性质,直接在复杂 articulation参数上反向传播图像空间梯度是不有效的。为了克服这个问题,我们引入了一种皮肤化的网格的双隐显隐式表示,结合(隐式)神经辐射场(NeRFs)和(显式)骨架驱动网格拓扑结构。在优化过程中,我们通过隐式和显式形式之间进行转换,从而在优化NeRF的同时精炼网格拓扑结构。通过广泛的实验验证,我们证明了我们的方法在生成真实HOIs方面非常有效。
https://arxiv.org/abs/2409.08278
High frame rate and accurate depth estimation plays an important role in several tasks crucial to robotics and automotive perception. To date, this can be achieved through ToF and LiDAR devices for indoor and outdoor applications, respectively. However, their applicability is limited by low frame rate, energy consumption, and spatial sparsity. Depth on Demand (DoD) allows for accurate temporal and spatial depth densification achieved by exploiting a high frame rate RGB sensor coupled with a potentially lower frame rate and sparse active depth sensor. Our proposal jointly enables lower energy consumption and denser shape reconstruction, by significantly reducing the streaming requirements on the depth sensor thanks to its three core stages: i) multi-modal encoding, ii) iterative multi-modal integration, and iii) depth decoding. We present extended evidence assessing the effectiveness of DoD on indoor and outdoor video datasets, covering both environment scanning and automotive perception use cases.
高速和高精度的深度估计在许多对机器人学和汽车感知至关重要的任务中发挥着重要作用。目前,可以通过室内和室外应用中的ToF和LiDAR设备来实现这一目标。然而,它们的适用性受到帧率低、能耗和空间稀疏性的限制。需求深度(DoD)通过利用高帧率RGB传感器与可能较低帧率和高稀疏性活动深度传感器相结合,实现准确的时间和空间深度密度。我们的提案共同通过降低深度传感器对流式需求的显著程度,从而实现较低的能耗和更密的形状重建,覆盖了室内和室外视频数据集的环境扫描和汽车感知应用场景。我们提供了对DoD在室内和室外视频数据集上的有效性的扩展评估,涵盖了环境扫描和汽车感知两种应用场景。
https://arxiv.org/abs/2409.08277
While tactile sensing is widely accepted as an important and useful sensing modality, its use pales in comparison to other sensory modalities like vision and proprioception. AnySkin addresses the critical challenges that impede the use of tactile sensing -- versatility, replaceability, and data reusability. Building on the simplistic design of ReSkin, and decoupling the sensing electronics from the sensing interface, AnySkin simplifies integration making it as straightforward as putting on a phone case and connecting a charger. Furthermore, AnySkin is the first uncalibrated tactile-sensor with cross-instance generalizability of learned manipulation policies. To summarize, this work makes three key contributions: first, we introduce a streamlined fabrication process and a design tool for creating an adhesive-free, durable and easily replaceable magnetic tactile sensor; second, we characterize slip detection and policy learning with the AnySkin sensor; and third, we demonstrate zero-shot generalization of models trained on one instance of AnySkin to new instances, and compare it with popular existing tactile solutions like DIGIT and ReSkin.this https URL
虽然触觉感知被广泛认为是重要且有益的感知模式,但与其他感官模态(如视觉和本体感知)相比,其应用显得相形见绌。AnySkin 解决了阻碍触觉感知应用的关键挑战——多才多艺、可替换性和数据可重复使用性。在简化 ReSkin 的简单设计并解耦感测电路与感知界面之后,AnySkin 简化了集成,使其与给孩子带手机壳并连接充电器一样简单。此外,AnySkin 是第一个具有跨实例学习操作策略的未校准触觉传感器。总之,这项工作做出了以下三个关键贡献:首先,我们引入了简化制造过程和创建无粘性、耐用且易于更换的磁触觉传感器的工具设计;其次,我们用 AnySkin 传感器对滑动检测和策略学习进行了特征;最后,我们证明了在 AnySkin 上训练的模型对 new instances 的零样本泛化,并将其与受欢迎的现有触觉解决方案(如 DIGIT 和 ReSkin)进行了比较。这个链接
https://arxiv.org/abs/2409.08276
We present an approach to learn general robot manipulation priors from 3D hand-object interaction trajectories. We build a framework to use in-the-wild videos to generate sensorimotor robot trajectories. We do so by lifting both the human hand and the manipulated object in a shared 3D space and retargeting human motions to robot actions. Generative modeling on this data gives us a task-agnostic base policy. This policy captures a general yet flexible manipulation prior. We empirically demonstrate that finetuning this policy, with both reinforcement learning (RL) and behavior cloning (BC), enables sample-efficient adaptation to downstream tasks and simultaneously improves robustness and generalizability compared to prior approaches. Qualitative experiments are available at: \url{this https URL}.
我们提出了一种从3D手势交互轨迹中学习一般机器人操作优先级的 approach。我们建立了一个框架,用于在野外视频生成传感器运动机器人轨迹。我们通过在共享3D空间中抬起人类双手和操作的对象,并重定向人类动作为机器人动作来实现这一目标。这种数据上的生成建模给我们一个任务无关的基础策略。这个策略捕捉了一个通用但灵活的操作优先级。我们通过结合强化学习(RL)和行为复制(BC)对策略进行微调,使得样本高效的适应下游任务,同时提高了鲁棒性和泛化能力,与之前的方法相比。实验结果可以在:\url{这个链接}找到。
https://arxiv.org/abs/2409.08273
Recent advancements in generative models have revolutionized image generation and editing, making these tasks accessible to non-experts. This paper focuses on local image editing, particularly the task of adding new content to a loosely specified area. Existing methods often require a precise mask or a detailed description of the location, which can be cumbersome and prone to errors. We propose Click2Mask, a novel approach that simplifies the local editing process by requiring only a single point of reference (in addition to the content description). A mask is dynamically grown around this point during a Blended Latent Diffusion (BLD) process, guided by a masked CLIP-based semantic loss. Click2Mask surpasses the limitations of segmentation-based and fine-tuning dependent methods, offering a more user-friendly and contextually accurate solution. Our experiments demonstrate that Click2Mask not only minimizes user effort but also delivers competitive or superior local image manipulation results compared to SoTA methods, according to both human judgement and automatic metrics. Key contributions include the simplification of user input, the ability to freely add objects unconstrained by existing segments, and the integration potential of our dynamic mask approach within other editing methods.
近年来,生成模型的发展已经彻底颠覆了图像生成和编辑,使得这些任务对非专家来说变得更加容易。本文重点关注局部图像编辑,特别是向指定区域添加新内容的任务。现有的方法通常需要精确的掩码或详细的描述位置,这可能会变得笨重且容易出错。我们提出了Click2Mask,一种通过要求只需一个参考点来简化局部编辑过程的新颖方法。在混合局部暗示扩散(BLD)过程中,围绕这个点动态增长掩码,由掩码的CLIP基于符号损失引导。Click2Mask超越了基于分割的方法和依赖于微调的方法的局限性,为用户提供了更友好且具有上下文准确性的解决方案。我们的实验结果表明,Click2Mask不仅最小化了用户的努力,而且与SoTA方法相比,在局部图像编辑方面具有竞争性或优越性的结果。关键贡献包括简化用户输入、在不限制现有分段的情况下自由添加物体以及将动态掩码方法集成到其他编辑方法中的潜力。
https://arxiv.org/abs/2409.08272
We present DreamBeast, a novel method based on score distillation sampling (SDS) for generating fantastical 3D animal assets composed of distinct parts. Existing SDS methods often struggle with this generation task due to a limited understanding of part-level semantics in text-to-image diffusion models. While recent diffusion models, such as Stable Diffusion 3, demonstrate a better part-level understanding, they are prohibitively slow and exhibit other common problems associated with single-view diffusion models. DreamBeast overcomes this limitation through a novel part-aware knowledge transfer mechanism. For each generated asset, we efficiently extract part-level knowledge from the Stable Diffusion 3 model into a 3D Part-Affinity implicit representation. This enables us to instantly generate Part-Affinity maps from arbitrary camera views, which we then use to modulate the guidance of a multi-view diffusion model during SDS to create 3D assets of fantastical animals. DreamBeast significantly enhances the quality of generated 3D creatures with user-specified part compositions while reducing computational overhead, as demonstrated by extensive quantitative and qualitative evaluations.
我们提出了DreamBeast,一种基于分数分解采样(SDS)的新方法,用于生成由不同部分构成的幻想3D动物资产。现有的SDS方法通常由于在文本到图像扩散模型中理解部件级语义有限而无法应对这个生成任务。虽然最近的一些扩散模型(如Stable Diffusion 3)展示了更好的部件级理解,但它们的速度过慢,并且展示了与单视扩散模型相关的其他常见问题。DreamBeast通过一种新颖的基于部件的知识传递机制克服了这一限制。对于每个生成的资产,我们通过将Stable Diffusion 3模型中的部件级知识有效地提取到3D Part-Affinity隐式表示中,实现了从任意视角瞬时生成Part-Affinity图。这使我们能够立即从任意视角生成幻想动物的3D资产,然后将其用于在SDS中调节多视扩散模型的指导,从而创造出3D动物资产。DreamBeast在用户指定部件组合的情况下显著提高了生成的3D生物的质量,同时降低了计算开销,这通过广泛的定量和定性评估得到了证实。
https://arxiv.org/abs/2409.08271
This study addresses the challenge of accurately segmenting 3D Gaussian Splatting from 2D masks. Conventional methods often rely on iterative gradient descent to assign each Gaussian a unique label, leading to lengthy optimization and sub-optimal solutions. Instead, we propose a straightforward yet globally optimal solver for 3D-GS segmentation. The core insight of our method is that, with a reconstructed 3D-GS scene, the rendering of the 2D masks is essentially a linear function with respect to the labels of each Gaussian. As such, the optimal label assignment can be solved via linear programming in closed form. This solution capitalizes on the alpha blending characteristic of the splatting process for single step optimization. By incorporating the background bias in our objective function, our method shows superior robustness in 3D segmentation against noises. Remarkably, our optimization completes within 30 seconds, about 50$\times$ faster than the best existing methods. Extensive experiments demonstrate the efficiency and robustness of our method in segmenting various scenes, and its superior performance in downstream tasks such as object removal and inpainting. Demos and code will be available at this https URL.
这项研究解决了从2D掩码中准确分割3D高斯平铺的挑战。传统的方法通常通过迭代梯度下降为每个高斯分配唯一的标签,导致长时间的优化和次优解决方案。相反,我们提出了一种简单而全局最优的3D-GS分割解决方案。我们方法的核心思想是,通过重构3D-GS场景,每个高斯的渲染在很大程度上是一个关于每个高斯标签的线性函数。因此,最优标签分配可以通过线性规划在闭式形式下求解。这个解决方案利用了平铺过程的alpha平滑特性,实现了单步优化的最优性能。通过将背景偏差包含在我们的目标函数中,我们的方法在3D分割在面对噪声时表现出卓越的鲁棒性。实验和代码将在这个[https://www.osso.io/url](https://www.osso.io/url) URL上提供。
https://arxiv.org/abs/2409.08270
Today's touch sensors come in many shapes and sizes. This has made it challenging to develop general-purpose touch processing methods since models are generally tied to one specific sensor design. We address this problem by performing cross-modal prediction between touch sensors: given the tactile signal from one sensor, we use a generative model to estimate how the same physical contact would be perceived by another sensor. This allows us to apply sensor-specific methods to the generated signal. We implement this idea by training a diffusion model to translate between the popular GelSlim and Soft Bubble sensors. As a downstream task, we perform in-hand object pose estimation using GelSlim sensors while using an algorithm that operates only on Soft Bubble signals. The dataset, the code, and additional details can be found at this https URL.
今天的触摸传感器有许多形状和尺寸。这使得开发通用触摸处理方法变得具有挑战性,因为模型通常与特定传感器设计捆绑在一起。我们通过在触摸传感器之间进行跨模态预测来解决这个问题:给定一个传感器的触觉信号,我们使用生成模型估计由另一个传感器感知到的相同物理接触如何。这使我们能够将特定传感器的方法应用于生成的信号。我们通过训练扩散模型来实现这个想法,该模型在 GelSlim 和 Soft Bubble 传感器之间进行翻译。作为下游任务,我们在使用 GelSlim 传感器进行手部物体姿态估计的同时,使用只操作 Soft Bubble 信号的算法进行研究。数据集、代码和其他详细信息可以在这个链接中找到。
https://arxiv.org/abs/2409.08269
Recent years have witnessed the success of large text-to-image diffusion models and their remarkable potential to generate high-quality images. The further pursuit of enhancing the editability of images has sparked significant interest in the downstream task of inpainting a novel object described by a text prompt within a designated region in the image. Nevertheless, the problem is not trivial from two aspects: 1) Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; 2) The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model. In this paper, we propose to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object inpainting. Technically, the semantic inpainter is trained to predict the semantic features of the target object conditioning on unmasked context and text prompt. The outputs of the semantic inpainter then act as the informative visual prompts to guide high-fieldity object generation through a reference adapter layer, leading to controllable object inpainting. Extensive evaluations on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against the state-of-the-art methods. Code is available at \url{this https URL}.
近年来,大型文本到图像扩散模型的成功和它们生成高质量图像的令人印象深刻潜力引起了广泛关注。进一步增加图像的编辑性引发了对于在指定区域图像中描述新对象的有希望的下游任务的浓厚兴趣。然而,这个问题有两个方面:1)仅依赖单个 U-Net 进行文本提示和图像对象对所有去噪时间步的同步对齐是不够的,不足以生成所需的对象;2)扩散模型的对象生成控制是不可靠的。在本文中,我们提出了一种将通常单阶段对象修复分解为两个级联过程的方法:1)多模态特征空间中推断目标对象语义特征的语义预修复;2)基于扩散模型的自适应高场强度物体生成。为了实现这一目标,我们通过级联基于 Transformer 的语义修复器和扩散模型,引入了一个新颖的级联 Transformer-Diffusion(CAT-Diffusion)框架,用于文本指导的对象修复。从技术上讲,语义修复器通过预测未揭示上下文和文本提示来推断目标对象的语义特征。然后,语义修复器的输出作为信息丰富的视觉提示,引导高场强度物体生成通过参考适配层,实现可控制的对象修复。在 OpenImages-V6 和 MSCOCO 等数据集上进行的广泛评估证实了 CAT-Diffusion 相对于最先进方法的优势。代码可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2409.08260
Diffusion models have led to the revolutionizing of generative modeling in numerous image synthesis tasks. Nevertheless, it is not trivial to directly apply diffusion models for synthesizing an image of a target person wearing a given in-shop garment, i.e., image-based virtual try-on (VTON) task. The difficulty originates from the aspect that the diffusion process should not only produce holistically high-fidelity photorealistic image of the target person, but also locally preserve every appearance and texture detail of the given garment. To address this, we shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process with amplified guidance of both basic visual appearance and detailed textures (i.e., high-frequency details) derived from the given garment. GarDiff first remoulds a pre-trained latent diffusion model with additional appearance priors derived from the CLIP and VAE encodings of the reference garment. Meanwhile, a novel garment-focused adapter is integrated into the UNet of diffusion model, pursuing local fine-grained alignment with the visual appearance of reference garment and human pose. We specifically design an appearance loss over the synthesized garment to enhance the crucial, high-frequency details. Extensive experiments on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff when compared to state-of-the-art VTON approaches. Code is publicly available at: \href{this https URL}{this https URL}.
扩散模型在图像合成任务中推动了生成模型的变革。然而,直接将扩散模型应用于合成特定店铺中人物穿着的图像并不容易,即基于图像的虚拟试穿(VTON)任务。难度源于扩散过程应不仅产生目标人物的高保真度照片现实图像,还应保留由给定服装产生的所有外观和纹理细节。为解决这个问题,我们塑造了一个名为GarDiff的新扩散模型,它通过增强基于给定服装的基本视觉外观和详细纹理(即高频细节)的指导来触发服装集中的扩散过程。GarDiff首先通过CLIP和VAE编码的参考服装的预训练 latent 扩散模型重新塑形。同时,一种新的服装集中适配器被集成到扩散模型的UNet中,追求与参考服装视觉外观和人体姿势的局部微调。我们特别设计了一个在合成服装上的外观损失,以增强关键的高频细节。在VITON-HD和DressCode数据集上的大量实验证明,与最先进的VTON方法相比,我们的GarDiff具有优越性。代码公开可用:\href{https://this https URL}{https://this URL}.
https://arxiv.org/abs/2409.08258
The question of how cyber-physical systems should interact with human partners that can take over control or exert oversight is becoming more pressing, as these systems are deployed for an ever larger range of tasks. Drawing on the literatures on handing over control during semi-autonomous driving and human-robot interaction, we propose a design of a take-over request that combines an abstract pre-alert with an informative TOR: Relevant sensor information is highlighted on the controller's display, while a spoken message verbalizes the reason for the TOR. We conduct our study in the context of a semi-autonomous drone control scenario as our testbed. The goal of our online study is to assess in more detail what form a language-based TOR should take. Specifically, we compare a full sentence condition to shorter fragments, and test whether the visual highlighting should be done synchronously or asynchronously with the speech. Participants showed a higher accuracy in choosing the correct solution with our bi-modal TOR and felt that they were better able to recognize the critical situation. Using only fragments in the spoken message rather than full sentences did not lead to improved accuracy or faster reactions. Also, synchronizing the visual highlighting with the spoken message did not result in better accuracy and response times were even increased in this condition.
如何在自主驾驶中如何与人类伙伴交互以及如何进行监督,以转让控制权,这是一个越来越紧迫的问题,因为这些系统被用于越来越广泛的任务。在半自主驾驶和人类机器人交互的文献中,我们提出了一个交接请求的设计,该设计结合了抽象预警和有用的TOR:相关的传感器信息在控制器界面上突出显示,而口头信息则口头说明TOR的原因。我们在半自主驾驶无人机控制场景中进行研究,作为我们的测试平台。我们在线研究的目的是更深入地评估基于语言的TOR应该具有什么形式。具体来说,我们将完整的句子与较短的片段进行比较,并测试视觉突出是否应该与语音同步或异步进行。参与者使用我们的双模TOR选择正确解决方案时表现出了更高的准确度,并且他们觉得自己更好地能够识别关键情况。仅使用口头信息中的片段而不是完整的句子,并没有提高准确度或加快反应速度。此外,将视觉突出与口头信息同步,也没有提高准确度和反应时间。
https://arxiv.org/abs/2409.08253
Panoptic narrative grounding (PNG), whose core target is fine-grained image-text alignment, requires a panoptic segmentation of referred objects given a narrative caption. Previous discriminative methods achieve only weak or coarse-grained alignment by panoptic segmentation pretraining or CLIP model adaptation. Given the recent progress of text-to-image Diffusion models, several works have shown their capability to achieve fine-grained image-text alignment through cross-attention maps and improved general segmentation performance. However, the direct use of phrase features as static prompts to apply frozen Diffusion models to the PNG task still suffers from a large task gap and insufficient vision-language interaction, yielding inferior performance. Therefore, we propose an Extractive-Injective Phrase Adapter (EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts with image features and inject the multimodal cues back, which leverages the fine-grained image-text alignment capability of Diffusion models more sufficiently. In addition, we also design a Multi-Level Mutual Aggregation (MLMA) module to reciprocally fuse multi-level image and phrase features for segmentation refinement. Extensive experiments on the PNG benchmark show that our method achieves new state-of-the-art performance.
透视叙事 grounding (PNG) 的核心目标是细粒度图像-文本对齐,它需要对给定叙述性摘要的参考对象进行透视分割。之前的方法通过透视分割预训练或CLIP模型适应仅获得弱或粗粒度的对齐。鉴于最近文本到图像扩散模型的进步,许多工作通过跨注意力和提高通用分割性能展示了其通过透视分割实现细粒度图像-文本对齐的能力。然而,直接将短语特征作为静态提示应用于PNG任务仍然存在较大的任务差距和视觉语言交互不足,导致性能较差。因此,我们提出了一个提取式注入式短语适配器(EIPA)绕过Diffusion UNet,在Diffusion模型中动态更新短语提示并注入多模态线索,这充分利用了Diffusion模型的细粒度图像-文本对齐能力。此外,我们还设计了一个多级相互聚合(MLMA)模块,通过相互融合多级图像和短语特征进行分割精炼。在PNG基准上的大量实验证明,我们的方法达到了最先进的水平。
https://arxiv.org/abs/2409.08251
Whether learned, simulated, or analytical, approximations of a robot's dynamics can be inaccurate when encountering novel environments. Many approaches have been proposed to quantify the aleatoric uncertainty of such methods, i.e. uncertainty resulting from stochasticity, however these estimates alone are not enough to properly estimate the uncertainty of a model in a novel environment, where the actual dynamics can change. Such changes can induce epistemic uncertainty, i.e. uncertainty due to a lack of information/data. Accounting for both epistemic and aleatoric dynamics uncertainty in a theoretically-grounded way remains an open problem. We introduce Local Uncertainty Conformal Calibration (LUCCa), a conformal prediction-based approach that calibrates the aleatoric uncertainty estimates provided by dynamics models to generate probabilistically-valid prediction regions of the system's state. We account for both epistemic and aleatoric uncertainty non-asymptotically, without strong assumptions about the form of the true dynamics or how it changes. The calibration is performed locally in the state-action space, leading to uncertainty estimates that are useful for planning. We validate our method by constructing probabilistically-safe plans for a double-integrator under significant changes in dynamics.
无论是通过学习、模拟还是分析,对机器人动态的近似在遇到新颖环境时可能会出现不准确的情况。为了量化这种方法的随机性不确定性,许多方法提出了估价随机性不确定性的方法,即随机性产生的不确定性。然而,这些估计单独不足以正确估计在新颖环境中的模型的不确定性。这些变化可能会导致元理不确定性,即缺乏信息/数据产生的不确定性。在理论基础上同时考虑元理和随机性不确定性仍然是一个未解决的问题。我们引入了局部不确定性 conformal calibration (LUCCa),一种基于 conformal 预测的方法,它将动态模型提供的随机性不确定性估计量用于生成系统的状态的的概率合法预测区域。我们非线性地考虑了元理和随机性不确定性,没有对真实动态的形式或其变化做出强烈的假设。 calibration 在状态-动作空间中进行局部处理,导致可用于规划的 uncertainty estimates。我们通过在动力学显著变化的情况下构建概率安全计划来验证我们的方法。
https://arxiv.org/abs/2409.08249
Recent breakthroughs in text-to-image models have opened up promising research avenues in personalized image generation, enabling users to create diverse images of a specific subject using natural language prompts. However, existing methods often suffer from performance degradation when given only a single reference image. They tend to overfit the input, producing highly similar outputs regardless of the text prompt. This paper addresses the challenge of one-shot personalization by mitigating overfitting, enabling the creation of controllable images through text prompts. Specifically, we propose a selective fine-tuning strategy that focuses on the text encoder. Furthermore, we introduce three key techniques to enhance personalization performance: (1) augmentation tokens to encourage feature disentanglement and alleviate overfitting, (2) a knowledge-preservation loss to reduce language drift and promote generalizability across diverse prompts, and (3) SNR-weighted sampling for efficient training. Extensive experiments demonstrate that our approach efficiently generates high-quality, diverse images using only a single reference image while significantly reducing memory and storage requirements.
近年来在文本到图像模型的突破为个性化图像生成打开了有前景的研究途径,使用自然语言提示可以创建特定主题的多样图像。然而,现有的方法在仅有一个参考图像时往往性能下降。他们倾向于过拟合输入,产生无论文本提示如何都非常相似的输出。本文通过减轻过拟合,通过文本提示创建可控的图像来解决一次性个性化。具体来说,我们提出了一种专注于文本编码器的选择性微调策略。此外,我们还介绍了三种增强个性化性能的技术:(1)增强令牌以鼓励特征解耦并减轻过拟合,(2)知识保留损失以减少语言漂移并促进跨不同提示的泛化,(3)SNR加权抽样进行高效的训练。大量实验证明,我们的方法在仅使用一个参考图像的情况下, efficiently生成高质量、多样化的图像,同时显著降低内存和存储要求。
https://arxiv.org/abs/2409.08248
Clustering artworks based on style has many potential real-world applications like art recommendations, style-based search and retrieval, and the study of artistic style evolution in an artwork corpus. However, clustering artworks based on style is largely an unaddressed problem. A few present methods for clustering artworks principally rely on generic image feature representations derived from deep neural networks and do not specifically deal with the artistic style. In this paper, we introduce and deliberate over the notion of style-based clustering of visual artworks. Our main objective is to explore neural feature representations and architectures that can be used for style-based clustering and observe their impact and effectiveness. We develop different methods and assess their relative efficacy for style-based clustering through qualitative and quantitative analysis by applying them to four artwork corpora and four curated synthetically styled datasets. Our analysis provides some key novel insights on architectures, feature representations, and evaluation methods suitable for style-based clustering.
基于风格的聚类艺术作品有很多实际应用,如艺术推荐、基于风格的搜索和检索,以及研究艺术作品库中的艺术风格演变。然而,基于风格的聚类艺术作品主要是未解决的问题。目前提出的几种基于风格的聚类方法主要依赖于深度神经网络生成的通用图像特征表示,并没有具体处理艺术风格。在本文中,我们介绍并探讨了基于风格的视觉艺术品聚类概念。我们的主要目标是对可用于基于风格的聚类神经特征表示和架构进行探索,并观察它们的影响和效果。我们开发了不同的方法和评估它们对基于风格的聚类的效果,通过定性和定量分析将它们应用于四个艺术作品数据集和四个合成艺术风格的数据集。我们的分析提供了一些关于适合基于风格的聚类的新颖见解,包括架构、特征表示和评估方法。
https://arxiv.org/abs/2409.08245
While Text-to-Image (T2I) diffusion models excel at generating visually appealing images of individual instances, they struggle to accurately position and control the features generation of multiple instances. The Layout-to-Image (L2I) task was introduced to address the positioning challenges by incorporating bounding boxes as spatial control signals, but it still falls short in generating precise instance features. In response, we propose the Instance Feature Generation (IFG) task, which aims to ensure both positional accuracy and feature fidelity in generated instances. To address the IFG task, we introduce the Instance Feature Adapter (IFAdapter). The IFAdapter enhances feature depiction by incorporating additional appearance tokens and utilizing an Instance Semantic Map to align instance-level features with spatial locations. The IFAdapter guides the diffusion process as a plug-and-play module, making it adaptable to various community models. For evaluation, we contribute an IFG benchmark and develop a verification pipeline to objectively compare models' abilities to generate instances with accurate positioning and features. Experimental results demonstrate that IFAdapter outperforms other models in both quantitative and qualitative evaluations.
尽管 Text-to-Image (T2I) 扩散模型在生成单个实例的视觉吸引人的图像方面表现出色,但在准确地定位和控制多个实例的特征生成方面却存在困难。为解决这一问题,我们提出了 Instance Feature Generation (IFG) 任务,旨在确保生成的实例具有精确的定位和特征完整性。为解决 IFG 任务,我们引入了 Instance Feature Adapter (IFAdapter)。IFAdapter 通过引入额外的表现令牌并利用 Instance Semantic Map 来将实例级特征与空间位置对齐,从而增强特征描绘。IFAdapter 作为插件式模块引导扩散过程,使其适用于各种社区模型。为了评估,我们贡献了一个 IFG 基准并开发了一个验证管道,以客观地比较模型的生成实例的能力。实验结果表明,IFAdapter 在数量和质量评估中均优于其他模型。
https://arxiv.org/abs/2409.08240
Large Language Models still struggle in challenging scenarios that leverage structured data, complex reasoning, or tool usage. In this paper, we propose Source2Synth: a new method that can be used for teaching LLMs new skills without relying on costly human annotations. Source2Synth takes as input a custom data source and produces synthetic data points with intermediate reasoning steps grounded in real-world sources. Source2Synth improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two challenging domains: we test reasoning abilities in multi-hop question answering (MHQA), and tool usage in tabular question answering (TQA). Our method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on HotPotQA compared to the fine-tuned baselines.
大型语言模型在利用结构化数据、复杂推理或工具使用等挑战性场景中仍然存在困难。在本文中,我们提出了一种名为Source2Synth的新方法,可以用于在不依赖昂贵的人类注释的情况下教授LLM新技能。Source2Synth接受一个自定义数据源作为输入,并基于真实世界来源的中间推理步骤生成合成数据点。通过丢弃低质量的预测来提高数据集的质量。我们通过将这种方法应用于两个具有挑战性的领域来证明其普适性:我们测试了在多轮问题回答(MHQA)和表格问题回答(TQA)中的推理能力。与微调基线相比,我们的方法在TQA上的性能提高了25.51%,在MHQA上的性能提高了22.57%。
https://arxiv.org/abs/2409.08239
The rapid evolution of cyber threats necessitates innovative solutions for detecting and analyzing malicious activity. Honeypots, which are decoy systems designed to lure and interact with attackers, have emerged as a critical component in cybersecurity. In this paper, we present a novel approach to creating realistic and interactive honeypot systems using Large Language Models (LLMs). By fine-tuning a pre-trained open-source language model on a diverse dataset of attacker-generated commands and responses, we developed a honeypot capable of sophisticated engagement with attackers. Our methodology involved several key steps: data collection and processing, prompt engineering, model selection, and supervised fine-tuning to optimize the model's performance. Evaluation through similarity metrics and live deployment demonstrated that our approach effectively generates accurate and informative responses. The results highlight the potential of LLMs to revolutionize honeypot technology, providing cybersecurity professionals with a powerful tool to detect and analyze malicious activity, thereby enhancing overall security infrastructure.
网络威胁的快速演变需要创新的方法来检测和分析恶意活动。诱饵系统,即设计来诱使攻击者互动的诱饵系统,已成为网络安全中的关键组成部分。在本文中,我们提出了一种使用大型语言模型(LLMs)创建真实和交互式诱饵系统的新方法。通过在攻击者生成的命令和响应的多样数据集上微调预训练的开放源代码语言模型,我们开发了一个能够与攻击者进行复杂交互的诱饵。我们的方法涉及几个关键步骤:数据收集和处理、提示工程、模型选择和监督微调,以优化模型的性能。通过相似度指标和现场部署的评估证明,我们的方法有效地生成了准确和有用的响应。结果强调了LLMs在革新诱饵技术方面的潜力,为网络安全专业人员提供了一种强大的工具,以检测和分析恶意活动,从而提高整个安全基础设施。
https://arxiv.org/abs/2409.08234
Recent successes in applying reinforcement learning (RL) for robotics has shown it is a viable approach for constructing robotic controllers. However, RL controllers can produce many collisions in environments where new obstacles appear during execution. This poses a problem in safety-critical settings. We present a hybrid approach, called iKinQP-RL, that uses an Inverse Kinematics Quadratic Programming (iKinQP) controller to correct actions proposed by an RL policy at runtime. This ensures safe execution in the presence of new obstacles not present during training. Preliminary experiments illustrate our iKinQP-RL framework completely eliminates collisions with new obstacles while maintaining a high task success rate.
近年来在将强化学习(RL)应用于机器人领域取得成功表明,这是一种构建机器人控制器的可行方法。然而,在执行过程中出现新障碍时,RL控制器可能会产生许多碰撞。这会在关键安全环境中产生问题。我们提出了一个名为iKinQP-RL的混合方法,使用反向运动学四元规划(iKinQP)控制器在运行时纠正由RL策略提出的动作。这将确保在训练过程中新障碍存在时仍安全执行。初步实验结果表明,我们的iKinQP-RL框架完全消除与新障碍的碰撞,同时保持高任务成功率。
https://arxiv.org/abs/2409.08233
Segmenting brain tumors in multi-parametric magnetic resonance imaging enables performing quantitative analysis in support of clinical trials and personalized patient care. This analysis provides the potential to impact clinical decision-making processes, including diagnosis and prognosis. In 2023, the well-established Brain Tumor Segmentation (BraTS) challenge presented a substantial expansion with eight tasks and 4,500 brain tumor cases. In this paper, we present a deep learning-based ensemble strategy that is evaluated for newly included tumor cases in three tasks: pediatric brain tumors (PED), intracranial meningioma (MEN), and brain metastases (MET). In particular, we ensemble outputs from state-of-the-art nnU-Net and Swin UNETR models on a region-wise basis. Furthermore, we implemented a targeted post-processing strategy based on a cross-validated threshold search to improve the segmentation results for tumor sub-regions. The evaluation of our proposed method on unseen test cases for the three tasks resulted in lesion-wise Dice scores for PED: 0.653, 0.809, 0.826; MEN: 0.876, 0.867, 0.849; and MET: 0.555, 0.6, 0.58; for the enhancing tumor, tumor core, and whole tumor, respectively. Our method was ranked first for PED, third for MEN, and fourth for MET, respectively.
基于多参数磁共振成像进行肿瘤分割能够支持临床试验和个性化患者护理,并为临床决策过程提供定量分析。该分析有潜力影响临床决策过程,包括诊断和预后。在2023年,经过良好检验的Brain Tumor Segmentation(BraTS)挑战展示了八个任务和4,500个肿瘤病例的更大扩展。在本文中,我们提出了一个基于深度学习的集成策略,用于对三个任务(儿童脑肿瘤、颅内海绵状血管瘤和脑转移瘤)的新纳入肿瘤进行评估。特别是,我们基于先进的nnU-Net和Swin UNETR模型在区域基础上生成集成输出。此外,我们还基于交叉验证阈值搜索实现针对肿瘤亚区域的定向后处理策略,以提高分割结果。我们对三个任务上我们提出的方法未见过的测试用例进行评估,结果如下: - PED:0.653、0.809、0.826 - MEN:0.876、0.867、0.849 - MET:0.555、0.6、0.58 对于增强肿瘤、肿瘤核心和整个肿瘤,我们的方法分别获得了0.876、0.867、0.849的Dice分数。我们的方法在PED上排名第一,在MEN上排名第三,在MET上排名第四。
https://arxiv.org/abs/2409.08232