Recovering 3D Human-Object Interaction (HOI) from single color images is challenging due to depth ambiguities, occlusions, and the huge variation in object shape and appearance. Thus, past work requires controlled settings such as known object shapes and contacts, and tackles only limited object classes. Instead, we need methods that generalize to natural images and novel object classes. We tackle this in two main ways: (1) We collect PICO-db, a new dataset of natural images uniquely paired with dense 3D contact on both body and object meshes. To this end, we use images from the recent DAMON dataset that are paired with contacts, but these contacts are only annotated on a canonical 3D body. In contrast, we seek contact labels on both the body and the object. To infer these given an image, we retrieve an appropriate 3D object mesh from a database by leveraging vision foundation models. Then, we project DAMON's body contact patches onto the object via a novel method needing only 2 clicks per patch. This minimal human input establishes rich contact correspondences between bodies and objects. (2) We exploit our new dataset of contact correspondences in a novel render-and-compare fitting method, called PICO-fit, to recover 3D body and object meshes in interaction. PICO-fit infers contact for the SMPL-X body, retrieves a likely 3D object mesh and contact from PICO-db for that object, and uses the contact to iteratively fit the 3D body and object meshes to image evidence via optimization. Uniquely, PICO-fit works well for many object categories that no existing method can tackle. This is crucial to enable HOI understanding to scale in the wild. Our data and code are available at this https URL.
从单色图像中恢复3D人体与物体交互(HOI)是一项挑战,原因在于深度模糊、遮挡以及不同物体形状和外观的巨大变化。以往的工作需要在受控环境下进行,例如已知的物体形状和接触点,并且仅限于处理有限数量的对象类别。因此,我们需要能够泛化到自然图像和新对象类别的方法。 我们采用两种主要方式来解决这一问题: 1. 我们收集了PICO-db数据集,这是一个新的自然图像集合,每个图像都与人体网格和物体网格上的密集3D接触信息配对。为此,我们使用了近期的DAMON数据集中带有接触点注释的图像,但这些接触点仅标注在一个标准的人体三维模型上。与此不同的是,我们需要在人体和物体两方面都有接触标签的信息。为了从给定的一张图片中推断出这些信息,我们会从数据库中检索合适的3D物体网格,并利用视觉基础模型来完成这一过程。然后,我们通过一种只需要为每个补丁提供2次点击的新方法将DAMON的人体接触贴片投影到对象上。这种最低限度的用户输入能够在人体和物体之间建立丰富的接触对应关系。 2. 我们利用新的接触对应数据集,在一项称为PICO-fit的创新渲染与比较拟合方法中加以运用,以恢复处于交互中的3D人体网格和物体网格。PICO-fit推断出SMPL-X身体模型的接触点,并从PICO-db检索相应对象可能存在的三维物体网格及其接触信息。然后利用这些接触信息通过优化迭代地将3D人体网格和物体网格拟合到图像证据上。 独特的是,PICO-fit适用于许多现有方法无法处理的对象类别,这对于在现实环境中实现HOI理解的扩展至关重要。我们的数据集和代码可以在提供的链接中获取(原文中的"this https URL")。
https://arxiv.org/abs/2504.17695
Augmented reality (AR) applications for construction monitoring rely on real-time environmental tracking to visualize architectural elements. However, construction sites present significant challenges for traditional tracking methods due to featureless surfaces, dynamic changes, and drift accumulation, leading to misalignment between digital models and the physical world. This paper proposes a BIM-aware drift correction method to address these challenges. Instead of relying solely on SLAM-based localization, we align ``as-built" detected planes from the real-world environment with ``as-planned" architectural planes in BIM. Our method performs robust plane matching and computes a transformation (TF) between SLAM (S) and BIM (B) origin frames using optimization techniques, minimizing drift over time. By incorporating BIM as prior structural knowledge, we can achieve improved long-term localization and enhanced AR visualization accuracy in noisy construction environments. The method is evaluated through real-world experiments, showing significant reductions in drift-induced errors and optimized alignment consistency. On average, our system achieves a reduction of 52.24% in angular deviations and a reduction of 60.8% in the distance error of the matched walls compared to the initial manual alignment by the user.
增强现实(AR)在建筑监测中的应用依赖于实时环境跟踪来可视化架构元素。然而,施工现场对传统的跟踪方法提出了重大挑战,因为这些地方往往存在特征缺乏的表面、动态变化以及累积漂移等问题,这会导致数字模型与物理世界的不一致。本文提出了一种基于BIM感知的漂移校正方法以应对这些问题。我们的方法不同于单纯依赖于SLAM(即时定位与地图构建)本地化技术,而是将现场检测到的实际平面(“已建成”状态)与在BIM中的设计平面(“计划建造”状态)进行对齐。 具体来说,我们实现了稳健的平面匹配,并通过优化技术计算出SLAM坐标系和BIM坐标系之间的转换关系。这种方法最小化了长期跟踪过程中的漂移积累,从而提高了AR应用在嘈杂施工环境下的可视化精度和定位准确性。 通过真实世界的实验评估,我们的方法显著减少了由于漂移引起的错误,并增强了对齐的一致性。平均而言,与用户初始手动校准相比,系统实现了52.24%的角度偏差减少以及匹配墙壁距离误差的60.8%降低。这种方法证明了将BIM作为结构先验知识用于AR应用中的有效性。
https://arxiv.org/abs/2504.17693
This study explores the potential of small language model(SLM) ensembles to achieve accuracy comparable to proprietary large language models (LLMs). We propose Ensemble Bayesian Inference (EBI), a novel approach that applies Bayesian estimation to combine judgments from multiple SLMs, allowing them to exceed the performance limitations of individual models. Our experiments on diverse tasks(aptitude assessments and consumer profile analysis in both Japanese and English) demonstrate EBI's effectiveness. Notably, we analyze cases where incorporating models with negative Lift values into ensembles improves overall performance, and we examine the method's efficacy across different languages. These findings suggest new possibilities for constructing high-performance AI systems with limited computational resources and for effectively utilizing models with individually lower performance. Building on existing research on LLM performance evaluation, ensemble methods, and open-source LLM utilization, we discuss the novelty and significance of our approach.
这项研究探讨了小型语言模型(SLM)集合体实现与专有大型语言模型(LLM)相当的准确性的潜力。我们提出了一种新颖的方法——集成贝叶斯推理(EBI),该方法通过应用贝叶斯估计将来自多个SLM的判断结合起来,使这些模型能够超越单一模型的性能限制。我们在各种任务上进行了实验(包括日语和英语的能力评估和消费者画像分析),证明了EBI的有效性。值得注意的是,我们分析了在集成中加入具有负Lift值的模型如何提高整体表现,并探讨了该方法在不同语言中的效果。这些发现表明,在有限计算资源的情况下构建高性能AI系统的可能性,并有效地利用性能较低但个别的模型。 基于现有对LLM性能评估、集成方法以及开源LLM使用的相关研究,我们讨论了本方法的新颖性和重要性。
https://arxiv.org/abs/2504.17685
As large language models (LLMs) scale in size and adoption, their computational and environmental costs continue to rise. Prior benchmarking efforts have primarily focused on latency reduction in idealized settings, often overlooking the diverse real-world inference workloads that shape energy use. In this work, we systematically analyze the energy implications of common inference efficiency optimizations across diverse Natural Language Processing (NLP) and generative Artificial Intelligence (AI) workloads, including conversational AI and code generation. We introduce a modeling approach that approximates real-world LLM workflows through a binning strategy for input-output token distributions and batch size variations. Our empirical analysis spans software frameworks, decoding strategies, GPU architectures, online and offline serving settings, and model parallelism configurations. We show that the effectiveness of inference optimizations is highly sensitive to workload geometry, software stack, and hardware accelerators, demonstrating that naive energy estimates based on FLOPs or theoretical GPU utilization significantly underestimate real-world energy consumption. Our findings reveal that the proper application of relevant inference efficiency optimizations can reduce total energy use by up to 73% from unoptimized baselines. These insights provide a foundation for sustainable LLM deployment and inform energy-efficient design strategies for future AI infrastructure.
随着大型语言模型(LLMs)在规模和应用上的扩展,其计算成本和环境影响也在不断增加。此前的基准测试工作主要集中在理想化条件下的延迟减少上,往往忽视了塑造实际能耗的各种推理负载。在这项工作中,我们系统地分析了各种自然语言处理(NLP)和生成式人工智能(AI)工作负载(包括对话AI和代码生成)中常见推理效率优化的能源影响。我们引入了一种建模方法,通过输入输出标记分布的分箱策略以及批次大小的变化来近似实际LLM工作流程。 我们的实证分析涵盖了软件框架、解码策略、GPU架构、在线和离线服务设置及模型并行配置等多个方面。研究表明,推理优化的效果对工作负载几何形状、软件堆栈和硬件加速器的高度敏感性表明,基于FLOPs或理论GPU利用率的简单能耗估计显著低估了实际能耗。 研究发现,适当应用相关推理效率优化措施可将能源使用从未经优化的基准线减少高达73%。这些见解为可持续部署大型语言模型奠定了基础,并为未来AI基础设施的设计提供了节能策略的信息。
https://arxiv.org/abs/2504.17674
This study addresses the critical challenge of hallucination mitigation in Large Vision-Language Models (LVLMs) for Visual Question Answering (VQA) tasks through a Split Conformal Prediction (SCP) framework. While LVLMs excel in multi-modal reasoning, their outputs often exhibit hallucinated content with high confidence, posing risks in safety-critical applications. We propose a model-agnostic uncertainty quantification method that integrates dynamic threshold calibration and cross-modal consistency verification. By partitioning data into calibration and test sets, the framework computes nonconformity scores to construct prediction sets with statistical guarantees under user-defined risk levels ($\alpha$). Key innovations include: (1) rigorous control of \textbf{marginal coverage} to ensure empirical error rates remain strictly below $\alpha$; (2) dynamic adjustment of prediction set sizes inversely with $\alpha$, filtering low-confidence outputs; (3) elimination of prior distribution assumptions and retraining requirements. Evaluations on benchmarks (ScienceQA, MMMU) with eight LVLMs demonstrate that SCP enforces theoretical guarantees across all $\alpha$ values. The framework achieves stable performance across varying calibration-to-test split ratios, underscoring its robustness for real-world deployment in healthcare, autonomous systems, and other safety-sensitive domains. This work bridges the gap between theoretical reliability and practical applicability in multi-modal AI systems, offering a scalable solution for hallucination detection and uncertainty-aware decision-making.
这项研究通过分割置信预测(Split Conformal Prediction,SCP)框架来解决大型视觉-语言模型(Large Vision-Language Models,LVLMs)在视觉问答(Visual Question Answering,VQA)任务中幻觉缓解的关键挑战。尽管LVLMs在多模态推理方面表现出色,但它们的输出经常包含以高置信度呈现的虚假内容,在安全关键的应用中构成风险。我们提出了一种与模型无关的不确定性量化方法,该方法整合了动态阈值校准和跨模式一致性验证。 通过将数据划分为校准集和测试集,框架计算非一致性分数来构建具有用户定义的风险水平($\alpha$)下的统计保证预测集合。关键创新包括: 1. 严格控制**边际覆盖率**,以确保经验错误率始终低于$\alpha$。 2. 动态调整预测集大小与$\alpha$成反比,过滤掉低置信度输出。 3. 消除先验分布假设和重新训练要求。 在ScienceQA和MMMU基准测试中对八个LVLMs进行的评估表明,SCP能够在所有$\alpha$值上提供理论保证。该框架实现了在不同校准集到测试集分割比例下的稳定性能,凸显了其在医疗保健、自主系统和其他安全敏感领域中的稳健性。 这项工作弥合了多模态AI系统的理论可靠性与实际应用之间的差距,并为幻觉检测和不确定性感知决策提供了可扩展的解决方案。
https://arxiv.org/abs/2504.17671
With the advent of large-scale 3D datasets, feed-forward 3D generative models, such as the Large Reconstruction Model (LRM), have gained significant attention and achieved remarkable success. However, we observe that RGB images often lead to conflicting training objectives and lack the necessary clarity for geometry reconstruction. In this paper, we revisit the inductive biases associated with mesh reconstruction and introduce DiMeR, a novel disentangled dual-stream feed-forward model for sparse-view mesh reconstruction. The key idea is to disentangle both the input and framework into geometry and texture parts, thereby reducing the training difficulty for each part according to the Principle of Occam's Razor. Given that normal maps are strictly consistent with geometry and accurately capture surface variations, we utilize normal maps as exclusive input for the geometry branch to reduce the complexity between the network's input and output. Moreover, we improve the mesh extraction algorithm to introduce 3D ground truth supervision. As for texture branch, we use RGB images as input to obtain the textured mesh. Overall, DiMeR demonstrates robust capabilities across various tasks, including sparse-view reconstruction, single-image-to-3D, and text-to-3D. Numerous experiments show that DiMeR significantly outperforms previous methods, achieving over 30% improvement in Chamfer Distance on the GSO and OmniObject3D dataset.
随着大规模3D数据集的出现,如大型重建模型(LRM)这样的前馈3D生成模型受到了广泛关注,并取得了显著的成功。然而,我们观察到RGB图像常常导致训练目标冲突,并且缺乏几何重构所需的清晰度。在本文中,我们重新审视了与网格重建相关的归纳偏差,并引入了DiMeR,这是一种新型的解耦双流前馈模型,专门用于稀疏视图下的网格重建。其核心思想是将输入和框架拆分为几何部分和纹理部分,根据奥卡姆剃刀原则(即简单性原理)降低每一部分的训练难度。 鉴于法线贴图与几何结构严格一致,并且能准确捕捉表面变化,我们将其作为专用于几何分支的独立输入,以减少网络输入和输出之间的复杂度。此外,我们改进了网格提取算法,引入3D真实标签监督。对于纹理分支,我们使用RGB图像作为输入来获取带有纹理信息的网格。 总体而言,DiMeR在包括稀疏视图重建、单张图片到3D转换以及文本到3D在内的各种任务中展示了强大的性能。大量实验表明,与先前的方法相比,DiMeR表现出了显著的优势,在GSO和OmniObject3D数据集上,Chamfer Distance指标提高了超过30%。
https://arxiv.org/abs/2504.17670
Assisting LLMs with code generation improved their performance on mathematical reasoning tasks. However, the evaluation of code-assisted LLMs is generally restricted to execution correctness, lacking a rigorous evaluation of their generated programs. In this work, we bridge this gap by conducting an in-depth analysis of code-assisted LLMs' generated programs in response to math reasoning tasks. Our evaluation focuses on the extent to which LLMs ground their programs to math rules, and how that affects their end performance. For this purpose, we assess the generations of five different LLMs, on two different math datasets, both manually and automatically. Our results reveal that the distribution of grounding depends on LLMs' capabilities and the difficulty of math problems. Furthermore, mathematical grounding is more effective for closed-source models, while open-source models fail to employ math rules in their solutions correctly. On MATH500, the percentage of grounded programs decreased to half, while the ungrounded generations doubled in comparison to ASDiv grade-school problems. Our work highlights the need for in-depth evaluation beyond execution accuracy metrics, toward a better understanding of code-assisted LLMs' capabilities and limits in the math domain.
帮助大型语言模型(LLM)生成代码可以提升它们在数学推理任务上的表现。然而,对辅助代码的LLM评估通常仅限于执行正确性,缺乏对其生成程序的严格评价。在这项工作中,我们通过深入分析响应数学推理任务时由代码辅助的LLM所生成的程序来填补这一空白。我们的评估侧重于LLM们将程序与数学规则联系起来的程度以及这如何影响最终表现。为此,我们在两个不同的数学数据集上手动和自动地评估了五种不同LLM的生成内容。我们的结果显示,这种关联的分布取决于LLM的能力和数学问题的难度。此外,在封闭源代码模型中,数学基础更为有效,而开源模型则无法正确运用数学规则来解决问题。在MATH500数据集上,与ASDiv小学题相比,带有数学基础知识的程序所占比例减半,而非基于数学知识生成的比例却增加了一倍。 这项工作强调了超越执行准确性指标进行深入评估的重要性,以更好地理解代码辅助LLM在数学领域的能力和限制。
https://arxiv.org/abs/2504.17665
This paper presents a comprehensive empirical analysis of conformal prediction methods on a challenging aerial image dataset featuring diverse events in unconstrained environments. Conformal prediction is a powerful post-hoc technique that takes the output of any classifier and transforms it into a set of likely labels, providing a statistical guarantee on the coverage of the true label. Unlike evaluations on standard benchmarks, our study addresses the complexities of data-scarce and highly variable real-world settings. We investigate the effectiveness of leveraging pretrained models (MobileNet, DenseNet, and ResNet), fine-tuned with limited labeled data, to generate informative prediction sets. To further evaluate the impact of calibration, we consider two parallel pipelines (with and without temperature scaling) and assess performance using two key metrics: empirical coverage and average prediction set size. This setup allows us to systematically examine how calibration choices influence the trade-off between reliability and efficiency. Our findings demonstrate that even with relatively small labeled samples and simple nonconformity scores, conformal prediction can yield valuable uncertainty estimates for complex tasks. Moreover, our analysis reveals that while temperature scaling is often employed for calibration, it does not consistently lead to smaller prediction sets, underscoring the importance of careful consideration in its application. Furthermore, our results highlight the significant potential of model compression techniques within the conformal prediction pipeline for deployment in resource-constrained environments. Based on our observations, we advocate for future research to delve into the impact of noisy or ambiguous labels on conformal prediction performance and to explore effective model reduction strategies.
本文通过一个具有挑战性的空中图像数据集,对多种形式的非约束环境中事件进行了全面的经验分析,探讨了符合预测方法的有效性。符合预测是一种强大的事后技术,它能够将任意分类器的输出转化为一组可能的标签,并为真实标签提供覆盖率统计保证。与在标准基准上的评估不同,本研究解决了稀疏数据和高度多变的真实世界设置中的复杂问题。 我们调查了利用预训练模型(MobileNet、DenseNet 和 ResNet),并结合少量标注数据进行微调,以生成具有信息性的预测集的有效性。为了进一步评估校准的影响,我们考虑了两条平行的处理路径(含与不含温度缩放)并使用两个关键指标——经验覆盖率和平均预测集合大小来评估性能。这种设置使我们能够系统地分析校准选择如何影响可靠性和效率之间的权衡。 我们的研究结果表明,即使在标注样本相对较少且非一致性评分简单的情况下,符合预测也能为复杂任务提供有价值的不确定性估计。此外,我们的分析揭示了尽管温度缩放常用于校准,但其并不总是导致更小的预测集合,强调了谨慎应用的重要性。同时,我们的成果还突显了模型压缩技术在资源受限环境中使用符合预测管道中的巨大潜力。 基于我们观察到的情况,我们建议未来的研究应深入探讨噪声或模糊标签对符合预测性能的影响,并探索有效的模型缩减策略。
https://arxiv.org/abs/2504.17655
The proliferation of abusive language in online communications has posed significant risks to the health and wellbeing of individuals and communities. The growing concern regarding online abuse and its consequences necessitates methods for identifying and mitigating harmful content and facilitating continuous monitoring, moderation, and early intervention. This paper presents a taxonomy for distinguishing key characteristics of abusive language within online text. Our approach uses a systematic method for taxonomy development, integrating classification systems of 18 existing multi-label datasets to capture key characteristics relevant to online abusive language classification. The resulting taxonomy is hierarchical and faceted, comprising 5 categories and 17 dimensions. It classifies various facets of online abuse, including context, target, intensity, directness, and theme of abuse. This shared understanding can lead to more cohesive efforts, facilitate knowledge exchange, and accelerate progress in the field of online abuse detection and mitigation among researchers, policy makers, online platform owners, and other stakeholders.
在线通信中滥用语言的蔓延对个人和社区的健康与福祉构成了重大风险。鉴于人们对网络欺凌及其后果日益增长的关注,迫切需要开发识别、缓解有害内容的方法,并促进持续监控、调节及早期干预。本文提出了一种用于区分网络文本中滥用语言关键特征的分类体系。我们的方法采用系统化的方法来构建分类体系,整合了18个现有多标签数据集中的分类系统,以捕捉与在线滥用语言分类相关的关键特性。最终形成的分类体系是分层且有角度(facet)的,包含5大类和17个维度。它涵盖了网络欺凌的各种方面,包括情境、目标、强度、直接性和主题等。这种共同的理解可以促进更加连贯的努力,推动知识交流,并加速研究人员、政策制定者、在线平台所有者及其他利益相关方在在线滥用检测与缓解领域的进展。
https://arxiv.org/abs/2504.17653
Safety-critical whole-body robot control demands reactive methods that ensure collision avoidance in real-time. Complementarity constraints and control barrier functions (CBF) have emerged as core tools for ensuring such safety constraints, and each represents a well-developed field. Despite addressing similar problems, their connection remains largely unexplored. This paper bridges this gap by formally proving the equivalence between these two methodologies for sampled-data, first-order systems, considering both single and multiple constraint scenarios. By demonstrating this equivalence, we provide a unified perspective on these techniques. This unification has theoretical and practical implications, facilitating the cross-application of robustness guarantees and algorithmic improvements between complementarity and CBF frameworks. We discuss these synergistic benefits and motivate future work in the comparison of the methods in more general cases.
确保实时碰撞避免的安全关键全身机器人控制需要反应式方法。互补约束和控制屏障函数(CBF)已作为保障此类安全限制的核心工具出现,每个领域都得到了充分的发展。尽管它们解决的问题类似,但二者之间的联系仍然很大程度上未被探索。本文通过正式证明这两种方法在采样数据、一阶系统的等价性来弥合这一差距,并同时考虑单一和多重约束场景。通过展示这种等价性,我们为这些技术提供了一种统一的观点。这种统一具有理论和实践意义,促进了互补性和CBF框架之间的鲁棒性保证以及算法改进的交叉应用。我们讨论了这些协同效益,并激励未来在更一般情况下的方法比较研究。
https://arxiv.org/abs/2504.17647
A brief overview of CLIPSE, a self-hosted image search engine with the main application of research, is provided. In general, CLIPSE uses CLIP embeddings to process the images and also the text queries. The overall framework is designed with simplicity to enable easy extension and usage. Two benchmark scenarios are described and evaluated, covering indexing and querying time. It is shown that CLIPSE is capable of handling smaller datasets; for larger datasets, a distributed approach with several instances should be considered.
本文提供了一个关于CLIPSE的简要概述,CLIPSE是一个主要用于研究的自托管图像搜索引擎。总体而言,CLIPSE利用CLIP嵌入来处理图像和文本查询。该框架设计简单,以实现易于扩展和使用。文中描述并评估了两个基准场景,涵盖索引时间和查询时间。结果显示,对于较小的数据集,CLIPSE能够有效应对;而对于较大的数据集,则应考虑采用分布式方法,通过多个实例来进行处理。
https://arxiv.org/abs/2504.17643
Visual localization algorithms, i.e., methods that estimate the camera pose of a query image in a known scene, are core components of many applications, including self-driving cars and augmented / mixed reality systems. State-of-the-art visual localization algorithms are structure-based, i.e., they store a 3D model of the scene and use 2D-3D correspondences between the query image and 3D points in the model for camera pose estimation. While such approaches are highly accurate, they are also rather inflexible when it comes to adjusting the underlying 3D model after changes in the scene. Structureless localization approaches represent the scene as a database of images with known poses and thus offer a much more flexible representation that can be easily updated by adding or removing images. Although there is a large amount of literature on structure-based approaches, there is significantly less work on structureless methods. Hence, this paper is dedicated to providing the, to the best of our knowledge, first comprehensive discussion and comparison of structureless methods. Extensive experiments show that approaches that use a higher degree of classical geometric reasoning generally achieve higher pose accuracy. In particular, approaches based on classical absolute or semi-generalized relative pose estimation outperform very recent methods based on pose regression by a wide margin. Compared with state-of-the-art structure-based approaches, the flexibility of structureless methods comes at the cost of (slightly) lower pose accuracy, indicating an interesting direction for future work.
视觉定位算法,即用于估计查询图像在已知场景中相机姿态的方法,在自动驾驶汽车和增强/混合现实系统等许多应用中是核心组成部分。当前最先进的视觉定位算法基于结构,也就是说它们存储场景的3D模型,并使用查询图像与该模型中3D点之间的2D-3D对应关系进行相机姿态估计。虽然此类方法高度准确,但当场景发生变化时调整底层3D模型的方式却相当不灵活。 无结构化的定位方法将场景表示为一系列具有已知姿势的图像数据库,从而提供了一种更加灵活的表现形式,可以轻松地通过添加或移除图像来更新。尽管有关基于结构的方法的研究文献颇丰,但关于无结构化方法的工作却相对较少。因此,本文致力于全面讨论和比较无结构化方法。 大量实验表明,使用经典几何推理程度更高的方法通常能获得更高的姿态精度。特别是,基于经典绝对姿势或半广义相对性姿势估计的方法显著优于最近提出的基于姿态回归的方法。 与最先进的基于结构的方法相比,虽然无结构化方法在灵活性上有明显优势,但在姿态准确性方面(稍低)则略逊一筹,这表明未来研究中一个有趣的探索方向。
https://arxiv.org/abs/2504.17636
Diabetic foot ulcers (DFUs) pose a significant challenge in healthcare, requiring precise and efficient wound assessment to enhance patient outcomes. This study introduces the Attention Diffusion Zero-shot Unsupervised System (ADZUS), a novel text-guided diffusion model that performs wound segmentation without relying on labeled training data. Unlike conventional deep learning models, which require extensive annotation, ADZUS leverages zero-shot learning to dynamically adapt segmentation based on descriptive prompts, offering enhanced flexibility and adaptability in clinical applications. Experimental evaluations demonstrate that ADZUS surpasses traditional and state-of-the-art segmentation models, achieving an IoU of 86.68\% and the highest precision of 94.69\% on the chronic wound dataset, outperforming supervised approaches such as FUSegNet. Further validation on a custom-curated DFU dataset reinforces its robustness, with ADZUS achieving a median DSC of 75\%, significantly surpassing FUSegNet's 45\%. The model's text-guided segmentation capability enables real-time customization of segmentation outputs, allowing targeted analysis of wound characteristics based on clinical descriptions. Despite its competitive performance, the computational cost of diffusion-based inference and the need for potential fine-tuning remain areas for future improvement. ADZUS represents a transformative step in wound segmentation, providing a scalable, efficient, and adaptable AI-driven solution for medical imaging.
糖尿病足溃疡(DFU)在医疗保健中是一个严峻的挑战,需要精确和高效的伤口评估以改善患者预后。本研究引入了一种新的文本引导扩散模型——注意力扩散零样本无监督系统(ADZUS),该系统能够在无需标注训练数据的情况下执行伤口分割。与传统的深度学习模型不同,后者需要大量的注释工作,ADZUS利用零样本学习根据描述性提示动态调整分割结果,从而在临床应用中提供了更高的灵活性和适应性。 实验评估表明,ADZUS的表现超过了传统方法和当前最先进的分割模型,在慢性伤口数据集上实现了86.68%的交并比(IoU)和最高的精确度94.69%,优于监督式方法如FUSegNet。在针对DFU定制的数据集上的进一步验证表明其鲁棒性,ADZUS达到了75%的中位数Dice相似系数(DSC),远超FUSegNet的45%。 该模型的文本引导分割能力使其能够根据临床描述实时自定义分割输出,允许对伤口特性进行针对性分析。尽管性能具有竞争力,但基于扩散推理的计算成本以及可能需要潜在微调的需求仍然是未来改进的方向。 ADZUS代表了在伤口分割领域中的一个变革性步骤,提供了一种可扩展、高效且适应性强的人工智能解决方案,为医学影像应用带来了革命性的进步。
https://arxiv.org/abs/2504.17628
Our work addresses the problem of learning to localize objects in an open-world setting, i.e., given the bounding box information of a limited number of object classes during training, the goal is to localize all objects, belonging to both the training and unseen classes in an image, during inference. Towards this end, recent work in this area has focused on improving the characterization of objects either explicitly by proposing new objective functions (localization quality) or implicitly using object-centric auxiliary-information, such as depth information, pixel/region affinity map etc. In this work, we address this problem by incorporating background information to guide the learning of the notion of objectness. Specifically, we propose a novel framework to discover background regions in an image and train an object proposal network to not detect any objects in these regions. We formulate the background discovery task as that of identifying image regions that are not discriminative, i.e., those that are redundant and constitute low information content. We conduct experiments on standard benchmarks to showcase the effectiveness of our proposed approach and observe significant improvements over the previous state-of-the-art approaches for this task.
我们的工作解决了在开放环境下学习定位物体的问题,即:在训练期间给定有限数量的物体类别的边界框信息的情况下,目标是在推理过程中对图像中所有属于训练集和未见过类别的物体进行定位。为了解决这个问题,最近的研究集中在通过提出新的目标函数(定位质量)或利用以深度信息、像素/区域亲和图等为中心的对象相关辅助信息来改进对象的表征上。 在这项工作中,我们通过整合背景信息来指导学习“物体质心”这一概念,从而解决了上述问题。具体而言,我们提出了一个新的框架,用于在图像中发现背景区域,并训练物体建议网络,在这些区域内不检测任何物体。我们将背景识别任务形式化为确定非区分性图像区域的问题,即那些冗余且包含低信息量的区域。 我们在标准基准测试上进行了实验,展示了所提出方法的有效性,并观察到相对于先前的最佳方法在该任务上的显著改进。
https://arxiv.org/abs/2504.17626
We exploit the mathematical modeling of the visual cortex mechanism for border completion to define custom filters for CNNs. We see a consistent improvement in performance, particularly in accuracy, when our modified LeNet 5 is tested with occluded MNIST images.
我们利用视觉皮层机制中边界完成的数学模型来为卷积神经网络(CNN)定义定制滤波器。当修改后的LeNet 5使用遮挡的MNIST图像进行测试时,我们观察到性能的一致性提升,特别是在准确性方面。
https://arxiv.org/abs/2504.17619
Hessians of neural network (NN) contain essential information about the curvature of NN loss landscapes which can be used to estimate NN generalization capabilities. We have previously proposed generalization criteria that rely on the observation that Hessian eigenvalue spectral density (HESD) behaves similarly for a wide class of NNs. This paper further studies their applicability by investigating factors that can result in different types of HESD. We conduct a wide range of experiments showing that HESD mainly has positive eigenvalues (MP-HESD) for NN training and fine-tuning with various optimizers on different datasets with different preprocessing and augmentation procedures. We also show that mainly negative HESD (MN-HESD) is a consequence of external gradient manipulation, indicating that the previously proposed Hessian analysis methodology cannot be applied in such cases. We also propose criteria and corresponding conditions to determine HESD type and estimate NN generalization potential. These HESD types and previously proposed generalization criteria are combined into a unified HESD analysis methodology. Finally, we discuss how HESD changes during training, and show the occurrence of quasi-singular (QS) HESD and its influence on the proposed methodology and on the conventional assumptions about the relation between Hessian eigenvalues and NN loss landscape curvature.
神经网络(NN)的Hessian矩阵包含了关于损失函数曲率的重要信息,这些信息可以用来估计神经网络的泛化能力。我们之前提出了一种基于观察到不同类别的神经网络具有相似Hessian特征值光谱密度(HESD)行为的一般性准则。本文进一步研究了该方法的应用可行性,通过探究导致不同类型HESD的因素进行更深入的研究。 我们的实验范围广泛,结果显示在使用不同的优化器、处理不同数据集并采用不同预处理和增强程序的情况下训练或微调神经网络时,HESD主要具有正特征值(MP-HESD)。此外,我们还发现完全负的HESD(MN-HESD)是外部梯度操作的结果,这表明之前提出的Hessian分析方法在这种情况下不适用。 我们也提出了判断HESD类型和估算NN泛化潜力的标准及相应条件。将这些HESD类型与先前提出的一般性准则结合起来形成了一个统一的HESD分析方法论。 最后,我们讨论了在训练过程中HESD的变化情况,并展示了准奇异(QS)HESD的发生及其对所提出的理论和关于Hessian特征值与神经网络损失景观曲率之间关系的传统假设的影响。
https://arxiv.org/abs/2504.17618
Aiming at the problems of poor quality of steganographic images and slow network convergence of image steganography models based on deep learning, this paper proposes a Steganography Curriculum Learning training strategy (STCL) for deep learning image steganography models. So that only easy images are selected for training when the model has poor fitting ability at the initial stage, and gradually expand to more difficult images, the strategy includes a difficulty evaluation strategy based on the teacher model and an knee point-based training scheduling strategy. Firstly, multiple teacher models are trained, and the consistency of the quality of steganographic images under multiple teacher models is used as the difficulty score to construct the training subsets from easy to difficult. Secondly, a training control strategy based on knee points is proposed to reduce the possibility of overfitting on small training sets and accelerate the training process. Experimental results on three large public datasets, ALASKA2, VOC2012 and ImageNet, show that the proposed image steganography scheme is able to improve the model performance under multiple algorithmic frameworks, which not only has a high PSNR, SSIM score, and decoding accuracy, but also the steganographic images generated by the model under the training of the STCL strategy have a low steganography analysis scores. You can find our code at \href{this https URL}{this https URL}.
为了应对基于深度学习的图像隐写模型中存在的隐写图像质量差和网络收敛速度慢的问题,本文提出了一种针对深度学习图像隐写模型的隐写课程学习训练策略(STCL)。该策略确保在模型初始阶段拟合能力较差时仅选择容易处理的图像进行训练,并逐渐扩展到更难处理的图像。该策略包含基于教师模型的难度评估方法和基于拐点的训练调度策略。 首先,通过训练多个教师模型来构建从易到难的训练子集。这些子集中隐写图像质量的一致性被用作难度评分。其次,提出了一种基于拐点的训练控制策略,以减少在小规模训练集上过拟合的可能性,并加速整个训练过程。 实验结果表明,在ALASKA2、VOC2012和ImageNet三个大型公共数据集上的测试显示,所提出的图像隐写方案能够在多种算法框架下提高模型性能。不仅具有高PSNR(峰值信噪比)、SSIM(结构相似度)得分以及解码准确性,而且在STCL策略训练过程中生成的隐写图像还具有较低的隐写分析评分。 我们的代码可在[\href{this https URL}{此处}]访问。
https://arxiv.org/abs/2504.17609
The integration of dual-modal features has been pivotal in advancing RGB-Depth (RGB-D) tracking. However, current trackers are less efficient and focus solely on single-level features, resulting in weaker robustness in fusion and slower speeds that fail to meet the demands of real-world applications. In this paper, we introduce a novel network, denoted as HMAD (Hierarchical Modality Aggregation and Distribution), which addresses these challenges. HMAD leverages the distinct feature representation strengths of RGB and depth modalities, giving prominence to a hierarchical approach for feature distribution and fusion, thereby enhancing the robustness of RGB-D tracking. Experimental results on various RGB-D datasets demonstrate that HMAD achieves state-of-the-art performance. Moreover, real-world experiments further validate HMAD's capacity to effectively handle a spectrum of tracking challenges in real-time scenarios.
双模态特征的融合在推动RGB-Depth(RGB-D)跟踪技术的发展中起到了关键作用。然而,当前的追踪器效率较低,并且仅专注于单层特征提取,这导致了融合过程中的鲁棒性较弱以及运行速度慢,无法满足实际应用的需求。为此,在本文中我们提出了一种名为HMAD(Hierarchical Modality Aggregation and Distribution)的新网络架构来应对这些挑战。HMAD充分利用了RGB和深度模态在特征表示上的独特优势,并采用分层方法进行特征的分布与融合,从而增强了RGB-D跟踪技术的鲁棒性。实验结果表明,在多种RGB-D数据集上,HMAD达到了最先进的性能水平。此外,实际环境中的测试进一步验证了HMAD能够在实时场景中有效应对各种追踪挑战的能力。
https://arxiv.org/abs/2504.17595
An intriguing phenomenon about JPEG compression has been observed since two decades ago- after repeating JPEG compression and decompression, it leads to a stable image that does not change anymore, which is a fixed point. In this work, we prove the existence of fixed points in the essential JPEG procedures. We analyze JPEG compression and decompression processes, revealing the existence of fixed points that can be reached within a few iterations. These fixed points are diverse and preserve the image's visual quality, ensuring minimal distortion. This result is used to develop a method to create a tamper-evident image from the original authentic image, which can expose tampering operations by showing deviations from the fixed point image.
自二十多年前以来,人们观察到JPEG压缩的一个有趣现象:在重复进行JPEG的压缩和解压缩之后,图像会达到一个不再变化的稳定状态,即所谓的“不动点”。在这项工作中,我们证明了在基本的JPEG处理过程中存在不动点。通过分析JPEG的压缩和解压过程,我们揭示了可以在几次迭代内到达的不动点的存在性,并且这些不动点是多样化的、并能够保持图像的视觉质量,确保最小的失真。 这一结果被用来开发一种方法,可以从原始的真实图像生成一个防篡改图像。通过展示与不动点图像之间的偏差,这种方法可以揭露任何对图像的篡改操作。
https://arxiv.org/abs/2504.17594
The demand for realistic virtual immersive audio continues to grow, with Head-Related Transfer Functions (HRTFs) playing a key role. HRTFs capture how sound reaches our ears, reflecting unique anatomical features and enhancing spatial perception. It has been shown that personalized HRTFs improve localization accuracy, but their measurement remains time-consuming and requires a noise-free environment. Although machine learning has been shown to reduce the required measurement points and, thus, the measurement time, a controlled environment is still necessary. This paper proposes a method to address this constraint by presenting a novel technique that can upsample sparse, noisy HRTF measurements. The proposed approach combines an HRTF Denoisy U-Net for denoising and an Autoencoding Generative Adversarial Network (AE-GAN) for upsampling from three measurement points. The proposed method achieves a log-spectral distortion (LSD) error of 5.41 dB and a cosine similarity loss of 0.0070, demonstrating the method's effectiveness in HRTF upsampling.
对逼真的虚拟沉浸式音频的需求持续增长,头部相关传输函数(HRTFs)在此过程中扮演了关键角色。HRTFs捕捉声音如何到达我们的耳朵,并反映了独特的解剖特征以增强空间感知能力。研究表明,个性化的HRTFs可以提高定位精度,但其测量过程仍然耗时且需要在无噪音的环境中进行。尽管机器学习已经被证明能够减少所需的测量点数量并因此缩短了测量时间,但仍需在一个受控环境下完成这些操作。 本文提出了一种方法来解决这一限制,即介绍一种新技术,用于从稀疏、有噪声的HRTF测量中插值数据。该方法结合使用了一个HRTF去噪U-Net(用于去除噪音)和一个自动编码生成对抗网络(AE-GAN),后者可以基于三个测量点的数据进行插值处理。 所提出的方法在对数谱失真(LSD)误差上达到了5.41 dB,在余弦相似度损失上为0.0070,展示了该方法在HRTF插值中的有效性。
https://arxiv.org/abs/2504.17586