We consider the problem of efficiently representing casually captured monocular videos in a spatially- and temporally-coherent manner. While existing approaches predominantly rely on 2D/2.5D techniques treating videos as collections of spatiotemporal pixels, they struggle with complex motions, occlusions, and geometric consistency due to absence of temporal coherence and explicit 3D structure. Drawing inspiration from monocular video as a projection of the dynamic 3D world, we explore representing videos in their intrinsic 3D form through continuous flows of Gaussian primitives in space-time. In this paper, we propose NutWorld, a novel framework that efficiently transforms monocular videos into dynamic 3D Gaussian representations in a single forward pass. At its core, NutWorld introduces a structured spatial-temporal aligned Gaussian (STAG) representation, enabling optimization-free scene modeling with effective depth and flow regularization. Through comprehensive experiments, we demonstrate that NutWorld achieves high-fidelity video reconstruction quality while enabling various downstream applications in real-time. Demos and code will be available at this https URL.
我们考虑如何高效地以空间和时间上一致的方式表示随意捕捉的单目视频的问题。尽管现有的方法主要依赖于2D/2.5D技术,将视频视为时空像素的集合,但它们在处理复杂运动、遮挡以及由于缺乏时间一致性及显式3D结构而导致的几何一致性方面存在困难。 受到单目视频是动态三维世界的投影这一观点的启发,我们探索通过空间-时间中连续的高斯原始流来表示视频,以反映其内在的3D形式。在本文中,我们提出了NutWorld,这是一种新颖框架,能够将单目视频高效地转换为动态的3D高斯表示,并且只需一次前向传递即可完成。 NutWorld的核心在于引入了一种结构化的时间-空间对齐高斯(STAG)表示方法,这种方法能够在无需优化的情况下进行场景建模,并通过有效的深度和流正则化来提高效果。通过全面的实验,我们证明了NutWorld可以实现高质量的视频重建同时支持实时的各种下游应用。 演示文稿和代码将在以下链接中提供:[https://this-url](https://this-url)(请将URL替换为实际可用地址)。
https://arxiv.org/abs/2502.03465
When deploying large language models (LLMs), it is important to ensure that these models are not only capable, but also reliable. Many benchmarks have been created to track LLMs' growing capabilities, however there has been no similar focus on measuring their reliability. To understand the potential ramifications of this gap, we investigate how well current benchmarks quantify model reliability. We find that pervasive label errors can compromise these evaluations, obscuring lingering model failures and hiding unreliable behavior. Motivated by this gap in the evaluation of reliability, we then propose the concept of so-called platinum benchmarks, i.e., benchmarks carefully curated to minimize label errors and ambiguity. As a first attempt at constructing such benchmarks, we revise examples from fifteen existing popular benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks such as elementary-level math word problems. Analyzing these failures further reveals previously unidentified patterns of problems on which frontier models consistently struggle. We provide code at this https URL
在部署大型语言模型(LLMs)时,确保这些模型不仅有能力而且可靠至关重要。已经创建了许多基准来跟踪LLM能力的增长,然而并没有类似的焦点用于衡量它们的可靠性。为了理解这种差距可能带来的后果,我们研究了当前基准能否准确量化模型的可靠性。我们发现,普遍存在的标签错误会破坏这些评估,掩盖持续存在的模型问题,并隐藏不可靠的行为。 鉴于在可靠性的评估中存在这一缺口,我们提出了所谓的“铂金基准”概念,即精心策划以最小化标签错误和模糊性来衡量模型性能的标准。作为构建此类基准的首次尝试,我们在十五个现有流行基准的基础上修订了示例。我们对这些铂金基准进行广泛的模型测试,并发现前沿大型语言模型在诸如小学水平数学问题等简单任务上仍然存在失败。进一步分析这些失败揭示了此前未被识别的问题模式,在这些问题上,前沿模型持续表现出挣扎。 我们提供了代码以供访问(请参见提供的链接)。
https://arxiv.org/abs/2502.03461
Small language models (SLMs) have attracted considerable attention from both academia and industry due to their broad range of applications in edge devices. To obtain SLMs with strong performance, conventional approaches either pre-train the models from scratch, which incurs substantial computational costs, or compress/prune existing large language models (LLMs), which results in performance drops and falls short in comparison to pre-training. In this paper, we investigate the family of acceleration methods that involve both structured pruning and model training. We found 1) layer-wise adaptive pruning (Adapt-Pruner) is extremely effective in LLMs and yields significant improvements over existing pruning techniques, 2) adaptive pruning equipped with further training leads to models comparable to those pre-training from scratch, 3) incremental pruning brings non-trivial performance gain by interleaving pruning with training and only removing a small portion of neurons ($\sim$5%) at a time. Experimental results on LLaMA-3.1-8B demonstrate that Adapt-Pruner outperforms conventional pruning methods, such as LLM-Pruner, FLAP, and SliceGPT, by an average of 1%-7% in accuracy on commonsense benchmarks. Additionally, Adapt-Pruner restores the performance of MobileLLM-125M to 600M on the MMLU benchmark with 200$\times$ fewer tokens via pruning from its larger counterparts, and discovers a new 1B model that surpasses LLaMA-3.2-1B in multiple benchmarks.
小型语言模型(SLMs)因其在边缘设备中的广泛应用,受到了学术界和工业界的广泛关注。为了获得性能强劲的SLM,传统方法要么从头开始预训练模型,这会带来巨大的计算成本;要么压缩/裁剪现有的大型语言模型(LLMs),这种方式会导致性能下降,并且相比从头开始预训练的效果较差。在本文中,我们研究了一类加速方法,该方法结合了结构化修剪和模型训练。 我们的主要发现包括: 1. **逐层自适应修剪**:Adapt-Pruner 在 LLMs 中非常有效,相对于现有的修剪技术带来了显著的改进。 2. **自适应修剪与进一步训练相结合**:这种组合能产生性能接近从头开始预训练模型的效果。 3. **增量式修剪**:通过在修剪和训练之间交替进行,并每次仅移除少量(约5%)神经元,可以带来非平凡的性能提升。 实验结果表明,在LLaMA-3.1-8B上的测试中,Adapt-Pruner 的准确率比传统的修剪方法如 LLM-Pruner、FLAP 和 SliceGPT 平均高出 1%-7%,在常识基准上表现更优。此外,通过从更大的模型中进行剪枝,Adapt-Pruner 将 MobileLLM-125M 模型的性能恢复到与600M大小相当的水平,但仅使用了原来的千分之二个tokens(数据量)。更重要的是,该方法发现了一个新的 1B 大小模型,在多个基准上超越了 LLaMA-3.2-1B。
https://arxiv.org/abs/2502.03460
The introduction of vision-language models like CLIP has enabled the development of foundational video models capable of generalizing to unseen videos and human actions. However, these models are typically trained on web videos, which often fail to capture the challenges present in Activities of Daily Living (ADL) videos. Existing works address ADL-specific challenges, such as similar appearances, subtle motion patterns, and multiple viewpoints, by combining 3D skeletons and RGB videos. However, these approaches are not integrated with language, limiting their ability to generalize to unseen action classes. In this paper, we introduce SKI models, which integrate 3D skeletons into the vision-language embedding space. SKI models leverage a skeleton-language model, SkeletonCLIP, to infuse skeleton information into Vision Language Models (VLMs) and Large Vision Language Models (LVLMs) through collaborative training. Notably, SKI models do not require skeleton data during inference, enhancing their robustness for real-world applications. The effectiveness of SKI models is validated on three popular ADL datasets for zero-shot action recognition and video caption generation tasks.
视觉-语言模型(如CLIP)的引入,促进了能够泛化到未见过视频和人类动作的基础视频模型的发展。然而,这些模型通常是在网络视频上进行训练的,而这些视频往往无法捕捉日常活动(ADL)视频中存在的挑战。现有研究通过结合3D骨架与RGB视频来解决类似外观、细微的动作模式及多视角等特定于ADL的问题。不过,这种方法未将语言整合进来,从而限制了其对新动作类别的泛化能力。 在本文中,我们提出了SKI模型,该模型将3D骨架融入到视觉-语言嵌入空间中。通过联合训练,SKI模型利用了一种骨骼-语言模型(SkeletonCLIP),能够将骨架信息注入到视觉语言模型(VLMs)和大型视觉语言模型(LVLMs)中。值得注意的是,在推理阶段SKI模型不需要骨架数据,从而增强了其在实际应用中的鲁棒性。 我们通过三个流行的ADL数据集上的零样本动作识别与视频字幕生成任务验证了SKI模型的有效性。
https://arxiv.org/abs/2502.03459
Online planning and execution of minimum-time maneuvers on three-dimensional (3D) circuits is an open challenge in autonomous vehicle racing. In this paper, we present an artificial race driver (ARD) to learn the vehicle dynamics, plan and execute minimum-time maneuvers on a 3D track. ARD integrates a novel kineto-dynamical (KD) vehicle model for trajectory planning with economic nonlinear model predictive control (E-NMPC). We use a high-fidelity vehicle simulator (VS) to compare the closed-loop ARD results with a minimum-lap-time optimal control problem (MLT-VS), solved offline with the same VS. Our ARD sets lap times close to the MLT-VS, and the new KD model outperforms a literature benchmark. Finally, we study the vehicle trajectories, to assess the re-planning capabilities of ARD under execution errors. A video with the main results is available as supplementary material.
在线规划和执行三维(3D)赛道上最短时间行驶策略,是自主赛车领域的开放性挑战。本文中,我们提出了一种人工赛车手 (ARD),用于学习车辆动力学,并在3D赛道上进行最短时间内动作的计划与执行。ARD整合了新型动力-动态(KD)车辆模型来进行轨迹规划以及经济型非线性预测控制(E-NMPC)。我们使用高保真度车辆模拟器(VS),将闭环ARD结果与通过同一VS离线解决的最小圈速最优控制问题(MLT-VS)进行对比。我们的ARD实现了接近于MLT-VS的圈速,并且新的KD模型在性能上优于文献中的基准。最后,我们研究了车辆轨迹,以评估ARD在执行错误情况下的重新规划能力。该文的主要结果视频作为补充材料提供。
https://arxiv.org/abs/2502.03454
Scene graphs have emerged as a structured and serializable environment representation for grounded spatial reasoning with Large Language Models (LLMs). In this work, we propose SG-RwR, a Schema-Guided Retrieve-while-Reason framework for reasoning and planning with scene graphs. Our approach employs two cooperative, code-writing LLM agents: a (1) Reasoner for task planning and information queries generation, and a (2) Retriever for extracting corresponding graph information following the queries. Two agents collaborate iteratively, enabling sequential reasoning and adaptive attention to graph information. Unlike prior works, both agents are prompted only with the scene graph schema rather than the full graph data, which reduces the hallucination by limiting input tokens, and drives the Reasoner to generate reasoning trace this http URL the trace, the Retriever programmatically query the scene graph data based on the schema understanding, allowing dynamic and global attention on the graph that enhances alignment between reasoning and retrieval. Through experiments in multiple simulation environments, we show that our framework surpasses existing LLM-based approaches in numerical Q\&A and planning tasks, and can benefit from task-level few-shot examples, even in the absence of agent-level demonstrations. Project code will be released.
场景图作为一种结构化和序列化的环境表示方法,已用于与大型语言模型(LLMs)进行基于情境的空间推理。在本文中,我们提出了一种名为SG-RwR的框架,即Schema-Guided Retrieve-while-Reason(模式引导边查边推断)框架,它利用场景图来进行推理和规划。 我们的方法采用两个协作编写代码的LLM代理:(1) 一个用于任务规划和信息查询生成的推理器;以及 (2) 一个用于根据查询提取相应图信息的检索器。这两个代理通过迭代合作实现了顺序推理,并且能够对图表信息进行动态关注,这使得推理与检索之间保持更好的一致性。 不同于以往的工作,我们的方法中两个代理仅被提示场景图模式而非完整的图表数据。这种方法减少了输入令牌的数量限制,从而减少了幻觉的产生,并促使推理器生成详细的推理过程。基于这种推理路径,检索器可以根据对模式的理解来程序化查询场景图的数据,实现了对图形动态和全局关注的能力。 通过在多个模拟环境中的实验,我们展示了该框架在数值问答和规划任务中超越了现有的基于LLM的方法,并且能够从少量的任务级别示例中受益,即使没有代理级别的演示也是如此。项目代码将在不久后发布。
https://arxiv.org/abs/2502.03450
Recent advances in large models have significantly advanced image-to-3D reconstruction. However, the generated models are often fused into a single piece, limiting their applicability in downstream tasks. This paper focuses on 3D garment generation, a key area for applications like virtual try-on with dynamic garment animations, which require garments to be separable and simulation-ready. We introduce Dress-1-to-3, a novel pipeline that reconstructs physics-plausible, simulation-ready separated garments with sewing patterns and humans from an in-the-wild image. Starting with the image, our approach combines a pre-trained image-to-sewing pattern generation model for creating coarse sewing patterns with a pre-trained multi-view diffusion model to produce multi-view images. The sewing pattern is further refined using a differentiable garment simulator based on the generated multi-view images. Versatile experiments demonstrate that our optimization approach substantially enhances the geometric alignment of the reconstructed 3D garments and humans with the input image. Furthermore, by integrating a texture generation module and a human motion generation module, we produce customized physics-plausible and realistic dynamic garment demonstrations. Project page: this https URL
最近在大型模型方面的进展显著推动了图像到三维重建技术的发展。然而,生成的模型通常会融合成一个整体,限制了其在下游任务中的应用效果。本文着重探讨3D服装生成这一领域,该领域的关键在于虚拟试衣和动态服装动画等应用场景中,需要生成可以分离且适合模拟的服装。 我们提出了一种新的管道——Dress-1-to-3,它从野外图像出发,能够重建出物理上合理、可模拟并具备缝制图案的分离式服装,并且与人类模型结合。这种方法首先利用预训练好的图像到缝制模式生成模型创建粗略的缝制模式,然后使用多视图扩散模型生成多视角图像。接下来,通过基于生成的多视角图像的可微分服装模拟器进一步细化缝制图案。 各种实验表明,我们的优化方法显著提升了重建出的3D服装和人类模型与输入图像之间的几何对齐效果。此外,通过整合纹理生成模块和人体运动生成模块,我们能够制作出具物理合理性和真实感的动态服装演示。 项目页面: [此链接](this https URL)
https://arxiv.org/abs/2502.03449
Recent advances in latent diffusion models have demonstrated their effectiveness for high-resolution image synthesis. However, the properties of the latent space from tokenizer for better learning and generation of diffusion models remain under-explored. Theoretically and empirically, we find that improved generation quality is closely tied to the latent distributions with better structure, such as the ones with fewer Gaussian Mixture modes and more discriminative features. Motivated by these insights, we propose MAETok, an autoencoder (AE) leveraging mask modeling to learn semantically rich latent space while maintaining reconstruction fidelity. Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on ImageNet generation using only 128 tokens. MAETok achieves significant practical improvements, enabling a gFID of 1.69 with 76x faster training and 31x higher inference throughput for 512x512 generation. Our findings show that the structure of the latent space, rather than variational constraints, is crucial for effective diffusion models. Code and trained models are released.
最近在潜在扩散模型方面的进展展示了它们在高分辨率图像合成中的有效性。然而,用于改善扩散模型学习和生成的编码器-解码器架构中潜空间的特性仍然未被充分探索。从理论和实证的角度来看,我们发现改进的生成质量与结构更优的潜在分布密切相关,例如具有较少的高斯混合模式和更多辨别性特征的分布。 受到这些见解的启发,我们提出了MAETok,这是一种利用掩码建模来学习语义丰富的潜空间的同时保持重建保真的自编码器(AE)。广泛的实验验证了我们的分析:变分形式的自编码器并非必要,并且仅通过128个令牌,基于AE的辨别性潜在空间在ImageNet图像生成上实现了最先进的性能。MAETok取得了显著的实际改进,在512x512尺寸的图像生成中,实现了gFID为1.69,训练速度提升了76倍,推理吞吐量提高了31倍。 我们的研究结果表明,潜空间的结构比变分约束更加关键,这对于有效的扩散模型至关重要。代码和经过训练的模型已经发布。
https://arxiv.org/abs/2502.03444
Recent advancements in large language models (LLMs) have spurred growing interest in automatic theorem proving using Lean4, where effective tree search methods are crucial for navigating proof search spaces. While the existing approaches primarily rely on value functions and Monte Carlo Tree Search (MCTS), the potential of simpler methods like Best-First Search (BFS) remains underexplored. This paper investigates whether BFS can achieve competitive performance in large-scale theorem proving tasks. We present \texttt{BFS-Prover}, a scalable expert iteration framework, featuring three key innovations. First, we implement strategic data filtering at each expert iteration round, excluding problems solvable via beam search node expansion to focus on harder cases. Second, we improve the sample efficiency of BFS through Direct Preference Optimization (DPO) applied to state-tactic pairs automatically annotated with compiler error feedback, refining the LLM's policy to prioritize productive expansions. Third, we employ length normalization in BFS to encourage exploration of deeper proof paths. \texttt{BFS-Prover} achieves a score of $71.31$ on the MiniF2F test set and therefore challenges the perceived necessity of complex tree search methods, demonstrating that BFS can achieve competitive performance when properly scaled.
最近在大型语言模型(LLMs)方面的进展激发了人们对使用Lean4进行自动定理证明的兴趣,而有效的树搜索方法对于探索证明搜索空间至关重要。尽管现有方法主要依赖于价值函数和蒙特卡洛树搜索(MCTS),但简单方法如最佳优先搜索(BFS)的潜力尚未得到充分探索。本文探讨了BFS是否可以在大规模定理证明任务中实现竞争力的表现。 我们提出了\texttt{BFS-Prover},这是一个可扩展的专家迭代框架,具有三个关键创新: 1. 在每个专家迭代轮次中实施战略数据过滤,排除可通过光束搜索节点扩张解决的问题,从而专注于更难的情况。 2. 通过将直接偏好优化(DPO)应用于由编译器错误反馈自动注释的状态-策略对,提高了BFS的样本效率,并改进了LLM的策略以优先考虑有成效的扩张。 3. 在BFS中使用长度归一化来鼓励探索更深层次的证明路径。 \texttt{BFS-Prover}在MiniF2F测试集上取得了71.31的成绩,从而挑战了对复杂树搜索方法必要性的认知,展示了当正确扩展时,BFS可以实现竞争力的表现。
https://arxiv.org/abs/2502.03438
Following recent advancements in computer-aided detection and diagnosis systems for colonoscopy, the automated reporting of colonoscopy procedures is set to further revolutionize clinical practice. A crucial yet underexplored aspect in the development of these systems is the creation of computer vision models capable of autonomously segmenting full-procedure colonoscopy videos into anatomical sections and procedural phases. In this work, we aim to create the first open-access dataset for this task and propose a state-of-the-art approach, benchmarked against competitive models. We annotated the publicly available REAL-Colon dataset, consisting of 2.7 million frames from 60 complete colonoscopy videos, with frame-level labels for anatomical locations and colonoscopy phases across nine categories. We then present ColonTCN, a learning-based architecture that employs custom temporal convolutional blocks designed to efficiently capture long temporal dependencies for the temporal segmentation of colonoscopy videos. We also propose a dual k-fold cross-validation evaluation protocol for this benchmark, which includes model assessment on unseen, multi-center this http URL achieves state-of-the-art performance in classification accuracy while maintaining a low parameter count when evaluated using the two proposed k-fold cross-validation settings, outperforming competitive models. We report ablation studies to provide insights into the challenges of this task and highlight the benefits of the custom temporal convolutional blocks, which enhance learning and improve model efficiency. We believe that the proposed open-access benchmark and the ColonTCN approach represent a significant advancement in the temporal segmentation of colonoscopy procedures, fostering further open-access research to address this clinical need.
随着计算机辅助检测和诊断系统在结肠镜检查中的近期进展,全自动的结肠镜报告即将进一步革新临床实践。然而,在这些系统的开发过程中,一个关键但尚未充分探索的问题是创建能够自主分割完整过程结肠镜视频为解剖部位及操作阶段的计算机视觉模型。本文旨在为此任务建立首个开放访问的数据集,并提出一种经过竞争性模型验证的状态-of-the-art方法。 我们对公开可用的REAL-Colon数据集进行了注释,该数据集包含来自60部完整结肠镜检查视频的270万帧图像,在九个类别中为每一帧标注了解剖位置和结肠镜操作阶段的标签。接着,我们提出了一种基于学习的架构——ColonTCN,它采用了自定义的时间卷积块,旨在高效地捕捉长时序依赖性,从而实现结肠镜视频的时间分割。 为了评估这个基准测试,我们还建议采用双重k折交叉验证评价方案,其中包括对未见、多中心的数据集进行模型评估。使用提出的两个k折交叉验证设置,在分类准确率方面,我们的方法达到了最新的性能水平,并且在参数计数上保持较低数值,超过了竞争性模型。 此外,我们进行了消融研究以揭示该任务的挑战,并强调自定义时间卷积块的好处——它们增强了学习过程并提高了模型效率。我们认为所提议的开放访问基准和ColonTCN方法代表了结肠镜检查程序时序分割领域的重要进展,促进了进一步的开放访问研究来满足这一临床需求。 总结来说,这项工作不仅为相关领域的研究人员提供了宝贵的资源,还展示了如何通过创新的方法解决复杂的医疗影像分析问题。
https://arxiv.org/abs/2502.03430
Unified multimodal large language models (U-MLLMs) have demonstrated impressive performance in visual understanding and generation in an end-to-end pipeline. Compared with generation-only models (e.g., Stable Diffusion), U-MLLMs may raise new questions about bias in their outputs, which can be affected by their unified capabilities. This gap is particularly concerning given the under-explored risk of propagating harmful stereotypes. In this paper, we benchmark the latest U-MLLMs and find that most exhibit significant demographic biases, such as gender and race bias. To better understand and mitigate this issue, we propose a locate-then-fix strategy, where we audit and show how the individual model component is affected by bias. Our analysis shows that bias originates primarily from the language model. More interestingly, we observe a "partial alignment" phenomenon in U-MLLMs, where understanding bias appears minimal, but generation bias remains substantial. Thus, we propose a novel balanced preference model to balance the demographic distribution with synthetic data. Experiments demonstrate that our approach reduces demographic bias while preserving semantic fidelity. We hope our findings underscore the need for more holistic interpretation and debiasing strategies of U-MLLMs in the future.
统一的多模态大规模语言模型(U-MLLM)在端到端管道中的视觉理解和生成方面表现出令人印象深刻的性能。与仅用于生成的模型(例如Stable Diffusion)相比,U-MLLM可能会对其输出中的偏见产生新的疑问,这些偏见可能受到它们统一能力的影响。鉴于传播有害刻板印象的风险尚未得到充分探索,这一差距尤其值得关注。在本文中,我们对最新的U-MLLM进行了基准测试,并发现大多数模型都存在显著的人口统计学偏差,如性别和种族偏见。 为了更好地理解和缓解这个问题,我们提出了一种“定位然后修复”的策略,在这种方法中,我们审计并展示了单个模型组件如何受到偏见的影响。我们的分析表明,语言模型是偏见的主要来源。更有趣的是,我们在U-MLLM中观察到了一种“部分对齐”现象:理解中的偏差很小,但生成的偏差仍然很大。因此,我们提出了一种新的平衡偏好模型,通过合成数据来平衡人口统计分布。实验结果表明,我们的方法在保持语义保真度的同时减少了人口统计学偏见。 我们希望我们的研究发现强调了未来对U-MLLM进行更全面解释和去偏策略的必要性。
https://arxiv.org/abs/2502.03429
Pose-Guided Person Image Synthesis (PGPIS) generates images that maintain a subject's identity from a source image while adopting a specified target pose (e.g., skeleton). While diffusion-based PGPIS methods effectively preserve facial features during pose transformation, they often struggle to accurately maintain clothing details from the source image throughout the diffusion process. This limitation becomes particularly problematic when there is a substantial difference between the source and target poses, significantly impacting PGPIS applications in the fashion industry where clothing style preservation is crucial for copyright protection. Our analysis reveals that this limitation primarily stems from the conditional diffusion model's attention modules failing to adequately capture and preserve clothing patterns. To address this limitation, we propose human-parsing-guided attention diffusion, a novel approach that effectively preserves both facial and clothing appearance while generating high-quality results. We propose a human-parsing-aware Siamese network that consists of three key components: dual identical UNets (TargetNet for diffusion denoising and SourceNet for source image embedding extraction), a human-parsing-guided fusion attention (HPFA), and a CLIP-guided attention alignment (CAA). The HPFA and CAA modules can embed the face and clothes patterns into the target image generation adaptively and effectively. Extensive experiments on both the in-shop clothes retrieval benchmark and the latest in-the-wild human editing dataset demonstrate our method's significant advantages over 13 baseline approaches for preserving both facial and clothes appearance in the source image.
姿势引导的人体图像合成(PGPIS)生成的图像是在保持源图像中主体身份的同时,采用指定的目标姿态(如骨骼)。虽然基于扩散的方法在姿态变换过程中有效保留面部特征,但它们往往难以准确地在整个扩散过程中维持来自源图像的服装细节。这种限制尤其在源姿势与目标姿势差异显著时变得严重,这影响了PGPIS在时尚行业中的应用,因为在这些行业中,保持服装风格对于版权保护至关重要。 我们的分析表明,这一局限主要源于条件扩散模型的注意力模块未能充分捕捉和保留服装图案。为解决此问题,我们提出了一种新颖的方法——人体分割引导的注意扩散(human-parsing-guided attention diffusion),该方法在生成高质量图像的同时,有效保持了面部和服装外观。 我们设计了一个人体分割感知的Siamese网络,包括三个关键组件:两个相同的双UNet架构(TargetNet用于扩散去噪,SourceNet用于提取源图像嵌入)、一个人体分割引导融合注意机制(HPFA)以及一个CLIP引导注意力对齐模块(CAA)。通过这些模块,面部和服装图案可以被灵活有效地嵌入到目标图中。 在商店内衣物检索基准测试及最新的野外人体编辑数据集上的广泛实验表明,在保持源图像中的面部和服装外观方面,我们提出的方法相比13种基线方法具有显著优势。
https://arxiv.org/abs/2502.03426
Explaining deep neural networks is challenging, due to their large size and non-linearity. In this paper, we introduce a concept-based explanation method, in order to explain the prediction for an individual class, as well as contrasting any two classes, i.e. explain why the model predicts one class over the other. We test it on several openly available classification models trained on ImageNet1K, as well as on a segmentation model trained to detect tumor in stained tissue samples. We perform both qualitative and quantitative tests. For example, for a ResNet50 model from pytorch model zoo, we can use the explanation for why the model predicts a class 'A' to automatically select six dataset crops where the model does not predict class 'A'. The model then predicts class 'A' again for the newly combined image in 71\% of the cases (works for 710 out of the 1000 classes). The code including an .ipynb example is available on git: this https URL.
解释深度神经网络的预测结果颇具挑战,这主要是因为它们规模庞大且具有非线性特性。本文中,我们提出了一种基于概念的解释方法,旨在说明模型为何会为某个特定类别做出预测,并能够对比任意两个类别的区别,即解释为什么模型在给定输入时选择一个类别而非另一个。我们在几个公开可用的、训练于ImageNet1K数据集上的分类模型上测试了这种方法,同时也在用于检测染色组织样本中肿瘤区域的分割模型上进行了应用。我们进行了定性和定量两方面的测试。 例如,在使用pytorch模型库中的ResNet50模型时,我们可以利用解释方法来找出模型为何预测类别'A'的原因,并自动选择六个数据集中的样本,这些样本在不被分类为'A'的情况下由模型进行预测。然后,当将这六张图像合并成一张新的图像时,在71%的情况下(即针对1000个类中的710个),该模型再次预测类别为'A'。 相关的代码及一个.ipynb示例可以在以下git链接上找到:[此URL](this https URL)。
https://arxiv.org/abs/2502.03422
Text-to-image generative models have shown remarkable progress in producing diverse and photorealistic outputs. In this paper, we present a comprehensive analysis of their effectiveness in creating synthetic portraits that accurately represent various demographic attributes, with a special focus on age, nationality, and gender. Our evaluation employs prompts specifying detailed profiles (e.g., Photorealistic selfie photo of a 32-year-old Canadian male), covering a broad spectrum of 212 nationalities, 30 distinct ages from 10 to 78, and balanced gender representation. We compare the generated images against ground truth age estimates from two established age estimation models to assess how faithfully age is depicted. Our findings reveal that although text-to-image models can consistently generate faces reflecting different identities, the accuracy with which they capture specific ages and do so across diverse demographic backgrounds remains highly variable. These results suggest that current synthetic data may be insufficiently reliable for high-stakes age-related tasks requiring robust precision, unless practitioners are prepared to invest in significant filtering and curation. Nevertheless, they may still be useful in less sensitive or exploratory applications, where absolute age precision is not critical.
文本到图像的生成模型在产生多样性和照片般逼真的输出方面取得了显著进展。在这篇论文中,我们对这些模型在创建能够准确反映各种人口统计特征(特别是年龄、国籍和性别)的人工肖像方面的有效性进行了全面分析。我们的评估使用了指定详细个人资料的提示语(例如,“一个32岁的加拿大男性的现实自拍照”),涵盖了212个不同国家,从10岁到78岁的30种不同的年龄段,并且性别比例均衡。我们通过与两个已建立的年龄估计模型提供的真实年龄估计进行比较来评估生成图像中年龄描绘的真实程度。 我们的研究发现表明,尽管文本到图像的模型能够持续生成反映不同身份特征的脸部图像,但它们捕捉特定年龄以及在多样化的人口统计背景下的准确性仍然存在很大的变异性。这些结果暗示当前合成数据可能不足以用于需要高度精确度的关键性年龄相关任务,除非从业者愿意投入大量精力进行过滤和筛选工作。然而,在不敏感或探索性的应用中,即使绝对的年龄精度不是关键因素,它们仍可能具有一定的实用性。
https://arxiv.org/abs/2502.03420
Zero-shot prompting techniques have significantly improved the performance of Large Language Models (LLMs). However, we lack a clear understanding of why zero-shot prompts are so effective. For example, in the prompt "Let's think step-by-step," is "think" or "step-by-step" more crucial to its success? Existing interpretability methods, such as gradient-based and attention-based approaches, are computationally intensive and restricted to open-source models. We introduce the ZIP score (Zero-shot Importance of Perturbation score), a versatile metric applicable to both open and closed-source models, based on systematic input word perturbations. Our experiments across four recent LLMs, seven widely-used prompts, and several tasks, reveal interesting patterns in word importance. For instance, while both 'step-by-step' and 'think' show high ZIP scores, which one is more influential depends on the model and task. We validate our method using controlled experiments and compare our results with human judgments, finding that proprietary models align more closely with human intuition regarding word significance. These findings enhance our understanding of LLM behavior and contribute to developing more effective zero-shot prompts and improved model analysis.
零样本提示技术显著提升了大型语言模型(LLM)的性能。然而,我们缺乏对其有效性的清晰理解。例如,在提示“Let's think step-by-step”中,“think”或“step-by-step”哪一个更为关键?现有的解释方法,如基于梯度和注意力的方法,在计算上较为耗时,并且仅限于开源模型使用。为此,我们引入了ZIP评分(零样本扰动重要性分数),这是一种适用于开放源代码及封闭源代码模型的灵活衡量标准,其基于系统化的输入词干扰来评估。 我们的实验覆盖了四个近期大型语言模型、七个广泛使用的提示以及多个任务,在这些实验中揭示了一些关于单词重要性的有趣模式。例如,尽管“step-by-step”和“think”都显示出了高ZIP评分,但哪个更具有影响力则取决于具体的模型和任务情况。我们通过受控实验验证了我们的方法,并将结果与人类判断进行了比较,发现专有模型在衡量词的重要性时更加接近于人类的直觉。 这些发现增强了我们对LLM行为的理解,并有助于开发更具效果的零样本提示以及改进模型分析技术。
https://arxiv.org/abs/2502.03418
We propose a novel approach for optimizing the graph ratio-cut by modeling the binary assignments as random variables. We provide an upper bound on the expected ratio-cut, as well as an unbiased estimate of its gradient, to learn the parameters of the assignment variables in an online setting. The clustering resulting from our probabilistic approach (PRCut) outperforms the Rayleigh quotient relaxation of the combinatorial problem, its online learning extensions, and several widely used methods. We demonstrate that the PRCut clustering closely aligns with the similarity measure and can perform as well as a supervised classifier when label-based similarities are provided. This novel approach can leverage out-of-the-box self-supervised representations to achieve competitive performance and serve as an evaluation method for the quality of these representations.
我们提出了一种通过将二元分配建模为随机变量来优化图比率割的新方法。我们提供了期望比率割的上界以及其梯度的无偏估计,以便在在线设置中学习分配变量的参数。我们的概率方法(PRCut)产生的聚类结果优于瑞利商松弛组合问题、该问题的在线学习扩展以及其他多种常用方法。我们展示了PRCut聚类与相似性度量紧密对齐,并且当提供基于标签的相似性时,其性能可媲美监督分类器。这种方法可以利用现成的自监督表示来实现竞争性的性能,并可以用作评估这些表示质量的方法。
https://arxiv.org/abs/2502.03405
Task offloading management in 6G vehicular networks is crucial for maintaining network efficiency, particularly as vehicles generate substantial data. Integrating secure communication through authentication introduces additional computational and communication overhead, significantly impacting offloading efficiency and latency. This paper presents a unified framework incorporating lightweight Identity-Based Cryptographic (IBC) authentication into task offloading within cloud-based 6G Vehicular Twin Networks (VTNs). Utilizing Proximal Policy Optimization (PPO) in Deep Reinforcement Learning (DRL), our approach optimizes authenticated offloading decisions to minimize latency and enhance resource allocation. Performance evaluation under varying network sizes, task sizes, and data rates reveals that IBC authentication can reduce offloading efficiency by up to 50% due to the added overhead. Besides, increasing network size and task size can further reduce offloading efficiency by up to 91.7%. As a countermeasure, increasing the transmission data rate can improve the offloading performance by as much as 63%, even in the presence of authentication overhead. The code for the simulations and experiments detailed in this paper is available on GitHub for further reference and reproducibility [1].
在6G车载网络中,任务卸载管理对于维持网络效率至关重要,尤其是当车辆生成大量数据时。通过认证整合安全通信会引入额外的计算和通信开销,显著影响卸载效率和延迟。本文提出了一种统一框架,在基于云的6G车辆孪生网络(VTN)中的任务卸载中集成了轻量级的身份基础加密(IBC)认证。我们的方法利用深度强化学习(DRL)中的近端策略优化(PPO),以优化经过身份验证的任务卸载决策,从而最大限度地减少延迟并提高资源分配效率。 性能评估在不同网络规模、任务大小和数据速率下进行,结果显示IBC认证由于额外开销的存在可能会使卸载效率降低多达50%。此外,随着网络规模和任务规模的增大,卸载效率最多可进一步降低91.7%。作为对策,提高传输数据率可以在存在身份验证开销的情况下将卸载性能提升高达63%。 本文中详细描述的模拟和实验代码可在GitHub上获取,供进一步参考和重现[1]。
https://arxiv.org/abs/2502.03403
Aligning Large Language Models to integrate and reflect human values, especially for tasks that demand intricate human oversight, is arduous since it is resource-intensive and time-consuming to depend on human expertise for context-specific guidance. Prior work has utilized predefined sets of rules or principles to steer the behavior of models (Bai et al., 2022; Sun et al., 2023). However, these principles tend to be generic, making it challenging to adapt them to each individual input query or context. In this work, we present Situated-PRInciples (SPRI), a framework requiring minimal or no human effort that is designed to automatically generate guiding principles in real-time for each input query and utilize them to align each response. We evaluate SPRI on three tasks, and show that 1) SPRI can derive principles in a complex domain-specific task that leads to on-par performance as expert-crafted ones; 2) SPRI-generated principles lead to instance-specific rubrics that outperform prior LLM-as-a-judge frameworks; 3) using SPRI to generate synthetic SFT data leads to substantial improvement on truthfulness. We release our code and model generations at this https URL.
将大型语言模型与人类价值观相结合并反映这些价值,尤其是在需要复杂人工监督的任务中,是一项艰巨的工作。这是因为依赖于人力专业知识来进行特定情境下的指导既耗时又耗费资源。先前的研究工作已经使用预定义的规则或原则来引导模型的行为(Bai等,2022;Sun等,2023)。然而,这些原则往往过于通用化,难以适应每个单独输入查询或具体背景。 在此研究中,我们提出了一个名为Situated-PRInciples (SPRI) 的框架。该框架旨在以最小甚至无需人工干预的情况下实时生成针对每个输入查询的指导性原则,并利用这些原则来调整模型响应。我们在三个任务上评估了 SPRI 框架,并展示了以下几点: 1. SPRI 能够在复杂的专业领域任务中推导出与专家制定的原则相媲美的原则,从而达到类似的表现水平。 2. 由 SPRI 推导出来的原则可以为每个实例生成特定的评分标准,这优于先前基于LLM(大语言模型)作为判别者的框架。 3. 使用 SPRI 来生成合成的数据以用于监督微调 (SFT),能显著提高模型的诚实度。 我们已在[此处](https://example.com)发布了我们的代码和模型输出。
https://arxiv.org/abs/2502.03397
Creating a Digital Twin (DT) for Healthcare Intelligent Transportation Systems (HITS) is a hot research trend focusing on enhancing HITS management, particularly in emergencies where ambulance vehicles must arrive at the crash scene on time and track their real-time location is crucial to the medical authorities. Despite the claim of real-time representation, a temporal misalignment persists between the physical and virtual domains, leading to discrepancies in the ambulance's location representation. This study proposes integrating AI predictive models, specifically Support Vector Regression (SVR) and Deep Neural Networks (DNN), within a constructed mock DT data pipeline framework to anticipate the medical vehicle's next location in the virtual world. These models align virtual representations with their physical counterparts, i.e., metaphorically offsetting the synchronization delay between the two worlds. Trained meticulously on a historical geospatial dataset, SVR and DNN exhibit exceptional prediction accuracy in MATLAB and Python environments. Through various testing scenarios, we visually demonstrate the efficacy of our methodology, showcasing SVR and DNN's key role in significantly reducing the witnessed gap within the HITS's DT. This transformative approach enhances real-time synchronization in emergency HITS by approximately 88% to 93%.
为医疗智能交通系统(HITS)创建数字孪生(DT)是目前研究的热门趋势,旨在提升HITS管理效率,尤其是在紧急情况下,救护车辆必须按时到达事故现场,并且跟踪其实时位置对于医学当局至关重要。尽管声称能实现实时表现,物理和虚拟领域之间仍存在时间上的偏差,导致救护车所在地点表示出现不一致的情况。本研究提出将人工智能预测模型(如支持向量回归SVR和支持深度神经网络DNN)集成到构建的模拟DT数据管道框架中,以预测医疗车辆在虚拟世界中的下一个位置。这些模型能够使虚拟表示与物理实际对应物对齐,即通过比喻方式抵消了两个世界之间的同步延迟。 经过基于历史地理空间数据集精细训练后,SVR和DNN在MATLAB和Python环境中表现出卓越的预测准确性。通过对各种测试场景进行可视化展示,我们展示了本方法的有效性,并强调了SVR和DNN在显著减少HITS DT中观察到差距方面所起的关键作用。这种变革性方法通过大约88%至93%的比例提高了紧急情况下HITS中的实时同步性能。
https://arxiv.org/abs/2502.03396
Time series forecasting is essential for operational intelligence in the hospitality industry, and particularly challenging in large-scale, distributed systems. This study evaluates the performance of statistical, machine learning (ML), deep learning, and foundation models in forecasting hourly sales over a 14-day horizon using real-world data from a network of thousands of restaurants across Germany. The forecasting solution includes features such as weather conditions, calendar events, and time-of-day patterns. Results demonstrate the strong performance of ML-based meta-models and highlight the emerging potential of foundation models like Chronos and TimesFM, which deliver competitive performance with minimal feature engineering, leveraging only the pre-trained model (zero-shot inference). Additionally, a hybrid PySpark-Pandas approach proves to be a robust solution for achieving horizontal scalability in large-scale deployments.
时间序列预测对于酒店行业的运营智能至关重要,而在大规模分布式系统中实现这一点尤其具有挑战性。本研究评估了统计方法、机器学习(ML)、深度学习以及基础模型在使用德国数千家餐厅的真实世界数据来预测14天内每小时销售额方面的表现。该预测解决方案包括天气条件、日历事件和时间段模式等特征。 实验结果表明,基于机器学习的元模型表现出色,并强调了Chronos和TimesFM等基础模型的新兴潜力,这些模型在无需复杂特征工程的情况下仅通过预训练模型(零样本推理)即可提供具有竞争力的表现。此外,混合使用PySpark和Pandas的方法被证明是实现大规模部署横向扩展的一种稳健解决方案。
https://arxiv.org/abs/2502.03395