We present HOIDiNi, a text-driven diffusion framework for synthesizing realistic and plausible human-object interaction (HOI). HOI generation is extremely challenging since it induces strict contact accuracies alongside a diverse motion manifold. While current literature trades off between realism and physical correctness, HOIDiNi optimizes directly in the noise space of a pretrained diffusion model using Diffusion Noise Optimization (DNO), achieving both. This is made feasible thanks to our observation that the problem can be separated into two phases: an object-centric phase, primarily making discrete choices of hand-object contact locations, and a human-centric phase that refines the full-body motion to realize this blueprint. This structured approach allows for precise hand-object contact without compromising motion naturalness. Quantitative, qualitative, and subjective evaluations on the GRAB dataset alone clearly indicate HOIDiNi outperforms prior works and baselines in contact accuracy, physical validity, and overall quality. Our results demonstrate the ability to generate complex, controllable interactions, including grasping, placing, and full-body coordination, driven solely by textual prompts. this https URL.
我们介绍了H OIDiNi,这是一个用于合成逼真且合理的物体-人交互(HOI)的文本驱动扩散框架。HOI生成极具挑战性,因为它需要高度准确的身体接触以及多种运动模式。虽然现有文献在现实性和物理正确性之间做出权衡,但H OIDiNi通过使用预训练扩散模型中的噪声空间进行直接优化(利用Diffusion Noise Optimization, DNO),能够同时实现这两点。这成为可能的原因在于我们观察到该问题可以分为两个阶段:以物体为中心的阶段主要做离散的手-物接触位置选择;以人为中心的阶段则细化全身运动,从而实现这一蓝图。这种结构化的办法允许在不牺牲动作自然性的情况下进行精确的手部与物体接触控制。 仅在GRAB数据集上的定量、定性和主观评估清晰地表明H OIDiNi在接触准确性、物理有效性以及整体质量方面超越了先前的工作和基准。我们的结果展示了生成复杂且可控的交互(包括抓取、放置及全身协调)的能力,这完全由文本提示驱动。 您可以在此链接中找到更多相关信息:[此链接](https://example.com)(原文中未提供实际链接,请根据实际情况替换)。
https://arxiv.org/abs/2506.15625
Estimating forest above-ground biomass (AGB) is crucial for assessing carbon storage and supporting sustainable forest management. Quantitative Structural Model (QSM) offers a non-destructive approach to AGB estimation through 3D tree structural reconstruction. However, current QSM methods face significant limitations, as they are primarily designed for individual trees,depend on high-quality point cloud data from terrestrial laser scanning (TLS), and also require multiple pre-processing steps that hinder scalability and practical deployment. This study presents a novel unified framework that enables end-to-end processing of large-scale point clouds using an innovative graph-based pipeline. The proposed approach seamlessly integrates tree segmentation,leaf-wood separation and 3D skeletal reconstruction through dedicated graph operations including pathing and abstracting for tree topology reasoning. Comprehensive validation was conducted on datasets with varying leaf conditions (leaf-on and leaf-off), spatial scales (tree- and plot-level), and data sources (TLS and UAV-based laser scanning, ULS). Experimental results demonstrate strong performance under challenging conditions, particularly in leaf-on scenarios (~20% relative error) and low-density ULS datasets with partial coverage (~30% relative error). These findings indicate that the proposed framework provides a robust and scalable solution for large-scale, non-destructive AGB estimation. It significantly reduces dependency on specialized pre-processing tools and establishes ULS as a viable alternative to TLS. To our knowledge, this is the first method capable of enabling seamless, end-to-end 3D tree reconstruction at operational scales. This advancement substantially improves the feasibility of QSM-based AGB estimation, paving the way for broader applications in forest inventory and climate change research.
估算森林地上生物量(AGB)对于评估碳储存和支持可持续森林管理至关重要。定量结构模型(QSM)通过3D树木结构重建提供了一种非破坏性的AGB估算方法。然而,现有的QSM方法面临重大限制,因为它们主要是为单一树木设计的,依赖于地面激光扫描(TLS)提供的高质量点云数据,并且需要多个预处理步骤,这阻碍了其可扩展性和实际部署。本研究提出了一种新颖的一体化框架,该框架能够使用创新的基于图的流水线对大规模点云进行端到端处理。所提出的这种方法通过专门的图操作(包括路径规划和抽象)无缝地集成了树木分割、叶木分离以及3D骨架重建。在具有不同叶片条件(带叶和无叶)、空间尺度(单树级和地块级)及数据来源(TLS和基于无人机的激光扫描,ULS)的数据集上进行了全面验证。 实验结果显示,在挑战性条件下性能强大,特别是在带叶场景中相对误差约为20%,以及在低密度ULS数据集中部分覆盖情况下的相对误差约为30%。这些发现表明,所提出的框架为大规模、非破坏性的AGB估算提供了稳健且可扩展的解决方案,大大减少了对专门预处理工具的依赖,并确立了ULS作为TLS的有效替代方案。据我们所知,这是第一个能够在操作规模上实现无缝端到端3D树木重建的方法。 这一进步极大地提高了基于QSM的AGB估算的实际可行性,为森林调查和气候变化研究中的更广泛应用铺平了道路。
https://arxiv.org/abs/2506.15577
Cancer is an abnormal growth with potential to invade locally and metastasize to distant organs. Accurate auto-segmentation of the tumor and surrounding normal tissues is required for radiotherapy treatment plan optimization. Recent AI-based segmentation models are generally trained on large public datasets, which lack the heterogeneity of local patient populations. While these studies advance AI-based medical image segmentation, research on local datasets is necessary to develop and integrate AI tumor segmentation models directly into hospital software for efficient and accurate oncology treatment planning and execution. This study enhances tumor segmentation using computationally efficient hybrid UNet-Transformer models on magnetic resonance imaging (MRI) datasets acquired from a local hospital under strict privacy protection. We developed a robust data pipeline for seamless DICOM extraction and preprocessing, followed by extensive image augmentation to ensure model generalization across diverse clinical settings, resulting in a total dataset of 6080 images for training. Our novel architecture integrates UNet-based convolutional neural networks with a transformer bottleneck and complementary attention modules, including efficient attention, Squeeze-and-Excitation (SE) blocks, Convolutional Block Attention Module (CBAM), and ResNeXt blocks. To accelerate convergence and reduce computational demands, we used a maximum batch size of 8 and initialized the encoder with pretrained ImageNet weights, training the model on dual NVIDIA T4 GPUs via checkpointing to overcome Kaggle's runtime limits. Quantitative evaluation on the local MRI dataset yielded a Dice similarity coefficient of 0.764 and an Intersection over Union (IoU) of 0.736, demonstrating competitive performance despite limited data and underscoring the importance of site-specific model development for clinical deployment.
癌症是一种具有局部侵犯和远处器官转移潜能的异常生长。为了优化放射治疗计划,准确地自动分割肿瘤及其周围正常组织是必需的。最近基于人工智能的分割模型通常是在大型公共数据集上训练的,这些数据集缺乏本地患者群体的多样性。虽然这些研究推动了基于AI的医学图像分割技术的发展,但使用本地数据进行研究对于开发和整合适用于医院软件的人工智能肿瘤分割模型以实现高效且准确的肿瘤治疗计划制定至关重要。 本研究利用来自当地医院并在严格隐私保护下获取的磁共振成像(MRI)数据集,通过计算效率高的混合UNet-Transformer模型来提高肿瘤分割效果。我们构建了一个强大的数据流水线,能够无缝提取和预处理DICOM文件,并进行了广泛的图像增强以确保在不同临床环境中模型的泛化能力,最终形成一个包含6080张训练图像的数据集。 我们的新型架构将基于UNet的卷积神经网络与变压器瓶颈以及互补注意模块(包括高效注意、挤压激励块(SE)、卷积块注意力模块(CBAM)和ResNeXt块)相结合。为了加速收敛并减少计算需求,我们使用了最大批量大小为8,并用预训练的ImageNet权重初始化编码器,在两个NVIDIA T4 GPU上通过检查点功能进行模型训练以克服Kaggle运行时间限制。 在本地MRI数据集上的定量评估显示,Dice相似系数为0.764,交并比(IoU)为0.736。尽管数据有限,这些结果仍然表明了竞争性的性能,并强调了开发特定于位置的模型对于临床部署的重要性。
https://arxiv.org/abs/2506.15562
We examine the intrinsic (within the attention head) and extrinsic (amongst the attention heads) structure of the self-attention mechanism in transformers. Theoretical evidence for invariance of the self-attention mechanism to softmax activation is obtained by appealing to paradifferential calculus, (and is supported by computational examples), which relies on the intrinsic organization of the attention heads. Furthermore, we use an existing methodology for hierarchical organization of tensors to examine network structure by constructing hierarchal partition trees with respect to the query, key, and head axes of network 3-tensors. Such an organization is consequential since it allows one to profitably execute common signal processing tasks on a geometry where the organized network 3-tensors exhibit regularity. We exemplify this qualitatively, by visualizing the hierarchical organization of the tree comprised of attention heads and the diffusion map embeddings, and quantitatively by investigating network sparsity with the expansion coefficients of individual attention heads and the entire network with respect to the bi and tri-haar bases (respectively) on the space of queries, keys, and heads of the network. To showcase the utility of our theoretical and methodological findings, we provide computational examples using vision and language transformers. The ramifications of these findings are two-fold: (1) a subsequent step in interpretability analysis is theoretically admitted, and can be exploited empirically for downstream interpretability tasks (2) one can use the network 3-tensor organization for empirical network applications such as model pruning (by virtue of network sparsity) and network architecture comparison.
我们研究了变压器中自我注意机制的内在(在注意力头内)和外在(在注意力头之间)结构。通过利用参数微分学,获得了自我注意机制对softmax激活不变性的理论证据(并得到了计算示例的支持),这依赖于注意力头的内在组织方式。此外,我们使用现有的张量层次化组织方法来构建基于查询、键和头部轴的网络3维张量的层级分区树,以分析网络结构。这种组织具有重要意义,因为它允许在由组织后的网络3维张量表现出规则性的几何空间中有效地执行常见信号处理任务。通过可视化注意力头组成的树形图以及扩散映射嵌入,我们定性地展示了这一点,并通过研究个体注意力头和整个网络相对于查询、键和头部的双向(bi)和三向(tri)哈尔基的扩展系数来定量分析了网络稀疏性。 为了展示我们的理论与方法发现的实际效用,我们使用视觉和语言变压器提供计算示例。这些发现的影响是两方面的:(1) 在解释性分析中,理论上可以进行下一步,并且可以在实践中利用以执行下游任务;(2) 可以通过利用网络稀疏性来使用网络3维张量组织来进行模型剪枝等实际应用,并比较不同网络架构。
https://arxiv.org/abs/2506.15541
We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC's factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC's potential as a unified discrete speech representation, bridging acoustic detail and lexical meaning for downstream speech generation and understanding tasks.
我们提出了层次音频编解码器(HAC),这是一种统一的神经语音编解码器,它在一个单一模型中将瓶颈分解为三个语言层级:声学层、音素层和词汇层。HAC利用了两个知识蒸馏目标:一个来自预训练的语音编码器(HuBERT),用于提取音素级别的结构;另一个则来源于基于文本的编码器(LaBSE),以获取词汇线索。在英语及多语言数据上的实验表明,HAC分解后的瓶颈产生了分离式的标记集合:其中一个与音素对齐,而另一个捕捉了词级语义信息。定量评估确认了HAC标记能够保持自然性,并提供可解释的语言信息,在分离性和重建质量方面均优于单一层次的基线模型。这些发现突显了HAC作为一种统一离散语音表示的巨大潜力,它在下游语音生成和理解任务中连接了声学细节与词汇意义之间的桥梁。
https://arxiv.org/abs/2506.15456
Comparing time series is essential in various tasks such as clustering and classification. While elastic distance measures that allow warping provide a robust quantitative comparison, a qualitative comparison on top of them is missing. Traditional visualizations focus on point-to-point alignment and do not convey the broader structural relationships at the level of subsequences. This limitation makes it difficult to understand how and where one time series shifts, speeds up or slows down with respect to another. To address this, we propose a novel technique that simplifies the warping path to highlight, quantify and visualize key transformations (shift, compression, difference in amplitude). By offering a clearer representation of how subsequences match between time series, our method enhances interpretability in time series comparison.
时间序列的比较在诸如聚类和分类等各种任务中至关重要。虽然弹性距离度量(允许扭曲)提供了稳健的数量化比较,但在此基础上进行定性比较的方法却缺失了。传统的可视化方法侧重于点对点的对齐,并不能传达子序列层面的更广泛的结构关系。这种局限使得难以理解一个时间序列相对于另一个时间序列是如何移动、加速或减速的。 为了解决这个问题,我们提出了一种新的技术,该技术通过简化扭曲路径来突出显示、量化和可视化关键转换(如位移、压缩以及幅度差异),从而帮助更好地理解这些现象。通过提供子序列在不同时间序列之间如何匹配的更清晰表示,我们的方法增强了时间序列比较的可解释性。
https://arxiv.org/abs/2506.15452
Motion capture using sparse inertial sensors has shown great promise due to its portability and lack of occlusion issues compared to camera-based tracking. Existing approaches typically assume that IMU sensors are tightly attached to the human body. However, this assumption often does not hold in real-world scenarios. In this paper, we present a new task of full-body human pose estimation using sparse, loosely attached IMU sensors. To solve this task, we simulate IMU recordings from an existing garment-aware human motion dataset. We developed transformer-based diffusion models to synthesize loose IMU data and estimate human poses based on this challenging loose IMU data. In addition, we show that incorporating garment-related parameters while training the model on simulated loose data effectively maintains expressiveness and enhances the ability to capture variations introduced by looser or tighter garments. Experiments show that our proposed diffusion methods trained on simulated and synthetic data outperformed the state-of-the-art methods quantitatively and qualitatively, opening up a promising direction for future research.
使用稀疏惯性传感器进行动作捕捉由于其便携性和相比基于摄像头的跟踪较少出现遮挡问题而展现出巨大潜力。现有的方法通常假设惯性测量单元(IMU)传感器紧密附着在人体上,但在实际应用场景中这一假设往往不成立。本文提出了一项新的任务:使用稀疏、松散连接的IMU传感器进行全身人类姿态估计。为了解决这个问题,我们从一个已有的服装感知的人体运动数据集中模拟了IMU记录,并开发了基于变压器的扩散模型来合成松散的IMU数据并根据这些挑战性的松散IMU数据估算人体姿势。 此外,研究还表明,在训练过程中结合与服装相关的参数可以有效地保持表现力,并增强捕捉由更宽松或紧身服装引入变化的能力。实验结果表明,我们的方法在模拟和合成数据上训练出的扩散模型无论是定量还是定性都优于当前最先进的方法,为未来的研究开辟了有前途的方向。
https://arxiv.org/abs/2506.15290
Visual monitoring of industrial assembly tasks is critical for preventing equipment damage due to procedural errors and ensuring worker safety. Although commercial solutions exist, they typically require rigid workspace setups or the application of visual markers to simplify the problem. We introduce ViMAT, a novel AI-driven system for real-time visual monitoring of assembly tasks that operates without these constraints. ViMAT combines a perception module that extracts visual observations from multi-view video streams with a reasoning module that infers the most likely action being performed based on the observed assembly state and prior task knowledge. We validate ViMAT on two assembly tasks, involving the replacement of LEGO components and the reconfiguration of hydraulic press molds, demonstrating its effectiveness through quantitative and qualitative analysis in challenging real-world scenarios characterized by partial and uncertain visual observations. Project page: this https URL
工业装配任务的视觉监控对于预防因程序错误导致的设备损坏以及确保工人安全至关重要。尽管市面上存在一些商用解决方案,但这些方案通常需要固定的工作空间设置或应用视觉标记来简化问题。我们介绍了一种名为ViMAT的新AI驱动系统,该系统能够在没有上述限制的情况下实现装配任务的实时视觉监控。ViMAT结合了感知模块和推理模块:前者从多视角视频流中提取视觉观察数据;后者则基于所见装配状态及先前的任务知识推断出最有可能正在执行的动作。 我们通过两项装配任务验证了ViMAT的有效性,包括乐高组件替换与液压压模的重新配置。这两项任务在部分和不确定的视觉观测等具有挑战性的现实场景中均显示出其优异性能,并通过定量分析和定性分析进行了展示。 项目页面:[请在此处插入具体网址]
https://arxiv.org/abs/2506.15285
Generative AI, specifically text-to-image models, have revolutionized interior architectural design by enabling the rapid translation of conceptual ideas into visual representations from simple text prompts. While generative AI can produce visually appealing images they often lack actionable data for designers In this work, we propose a novel pipeline that integrates DALL-E 3 with a materials dataset to enrich AI-generated designs with sustainability metrics and material usage insights. After the model generates an interior design image, a post-processing module identifies the top ten materials present and pairs them with carbon dioxide equivalent (CO2e) values from a general materials dictionary. This approach allows designers to immediately evaluate environmental impacts and refine prompts accordingly. We evaluate the system through three user tests: (1) no mention of sustainability to the user prior to the prompting process with generative AI, (2) sustainability goals communicated to the user before prompting, and (3) sustainability goals communicated along with quantitative CO2e data included in the generative AI outputs. Our qualitative and quantitative analyses reveal that the introduction of sustainability metrics in the third test leads to more informed design decisions, however, it can also trigger decision fatigue and lower overall satisfaction. Nevertheless, the majority of participants reported incorporating sustainability principles into their workflows in the third test, underscoring the potential of integrated metrics to guide more ecologically responsible practices. Our findings showcase the importance of balancing design freedom with practical constraints, offering a clear path toward holistic, data-driven solutions in AI-assisted architectural design.
生成式AI,特别是文本到图像模型,在室内建筑设计领域引发了革命性的变化。这些技术通过从简单的文本提示中快速将概念性想法转化为视觉表现形式,大大提升了设计过程的效率。然而,尽管生成式AI能够创建出吸引人的图像,但它们往往缺乏设计师可用的实际数据信息。 在本文中,我们提出了一种新的工作流程,该流程整合了DALL-E 3与材料数据库,以增强由AI生成的设计方案,使其包含可持续性指标和材料使用情况。具体来说,在模型生成室内设计图之后,一个后处理模块会识别出图像中的前十大材料,并从通用材料字典中匹配每种材料的二氧化碳当量(CO2e)数值。 这种策略允许设计师立即评估环境影响并相应地调整提示内容。我们通过三种用户测试来评价该系统: 1. 在生成AI的过程中,不向用户提供有关可持续性的信息; 2. 在生成过程前,先与用户沟通可持续性目标; 3. 除第二点外,在生成的AI输出中还包含定量的CO2e数据。 我们的定性和定量分析表明,第三种测试中的可持续性指标引入使设计师能够做出更为明智的设计决策,但同时也会导致决策疲劳并降低整体满意度。然而,大多数参与者报告在第三次测试中将可持续原则融入了他们的工作流程之中,这凸显出综合指标引导更具生态责任感实践的潜力。 我们的研究结果强调了在AI辅助建筑设计过程中平衡设计自由度与实际约束的重要性,并为实现全面、数据驱动的设计方案提供了明确的道路。
https://arxiv.org/abs/2506.15008
As Large Language Models (LLMs) increasingly appear in social science research (e.g., economics and marketing), it becomes crucial to assess how well these models replicate human behavior. In this work, using hypothesis testing, we present a quantitative framework to assess the misalignment between LLM-simulated and actual human behaviors in multiple-choice survey settings. This framework allows us to determine in a principled way whether a specific language model can effectively simulate human opinions, decision-making, and general behaviors represented through multiple-choice options. We applied this framework to a popular language model for simulating people's opinions in various public surveys and found that this model is ill-suited for simulating the tested sub-populations (e.g., across different races, ages, and incomes) for contentious questions. This raises questions about the alignment of this language model with the tested populations, highlighting the need for new practices in using LLMs for social science studies beyond naive simulations of human subjects.
随着大型语言模型(LLMs)越来越多地出现在社会科学研究中(例如经济学和市场营销),评估这些模型在复制人类行为方面的效果变得至关重要。在这项工作中,我们使用假设检验方法提出了一种定量框架,用于评估多选题调查情境下LLM模拟的人类行为与实际人类行为之间的偏差。这一框架使我们能够以一种有原则的方式确定特定语言模型是否能有效地模拟通过多项选择选项表示的人类观点、决策和一般行为。 我们将此框架应用于一个流行的语言模型,在该模型中使用各种公共调查来模拟人们的看法,发现对于具有争议性的问题,这个模型并不适合用来模拟被测试的不同亚群体(如不同种族、年龄和收入水平)。这引发了关于这一语言模型与被测人群之间的对齐问题的疑问,并强调了在社会科学研究中使用LLMs进行超越简单的人类主体模拟的新实践的需求。
https://arxiv.org/abs/2506.14997
Jamming requires coordination, anticipation, and collaborative creativity between musicians. Current generative models of music produce expressive output but are not able to generate in an \emph{online} manner, meaning simultaneously with other musicians (human or otherwise). We propose ReaLchords, an online generative model for improvising chord accompaniment to user melody. We start with an online model pretrained by maximum likelihood, and use reinforcement learning to finetune the model for online use. The finetuning objective leverages both a novel reward model that provides feedback on both harmonic and temporal coherency between melody and chord, and a divergence term that implements a novel type of distillation from a teacher model that can see the future melody. Through quantitative experiments and listening tests, we demonstrate that the resulting model adapts well to unfamiliar input and produce fitting accompaniment. ReaLchords opens the door to live jamming, as well as simultaneous co-creation in other modalities.
即兴演奏要求音乐家之间进行协调、预判和协作性创作。目前的音乐生成模型虽然能够产出有表现力的作品,但它们无法在线(即实时地)与其他音乐家同步生成音乐。我们提出了ReaLchords这一在线生成模型,旨在为用户提供的旋律即兴伴奏和弦。 我们的方法首先使用最大似然预训练一个在线模型,并通过强化学习对这个模型进行微调以适应在线应用的需求。在微调过程中,我们引入了一个新颖的奖励模型来评估旋律与和弦之间的和谐性和时间一致性,同时还有一个散度项,它从一个能够预见未来旋律的教师模型中提取知识(实施了一种新的蒸馏方法)。通过定量实验和听觉测试,我们证明了该模型在处理不熟悉的输入时表现良好,并能生成合适的伴奏。 ReaLchords不仅为实时即兴演奏打开了大门,还使得其他模态下的同时创作成为可能。
https://arxiv.org/abs/2506.14723
Previous methods for image geo-localization have typically treated the task as either classification or retrieval, often relying on black-box decisions that lack interpretability. The rise of large vision-language models (LVLMs) has enabled a rethinking of geo-localization as a reasoning-driven task grounded in visual cues. However, two major challenges persist. On the data side, existing reasoning-focused datasets are primarily based on street-view imagery, offering limited scene diversity and constrained viewpoints. On the modeling side, current approaches predominantly rely on supervised fine-tuning, which yields only marginal improvements in reasoning capabilities. To address these challenges, we propose a novel pipeline that constructs a reasoning-oriented geo-localization dataset, MP16-Reason, using diverse social media images. We introduce GLOBE, Group-relative policy optimization for Locatability assessment and Optimized visual-clue reasoning, yielding Bi-objective geo-Enhancement for the VLM in recognition and reasoning. GLOBE incorporates task-specific rewards that jointly enhance locatability assessment, visual clue reasoning, and geolocation accuracy. Both qualitative and quantitative results demonstrate that GLOBE outperforms state-of-the-art open-source LVLMs on geo-localization tasks, particularly in diverse visual scenes, while also generating more insightful and interpretable reasoning trajectories.
之前的图像地理定位方法通常将其视为分类或检索任务,常常依赖于缺乏可解释性的黑盒决策。随着大型视觉语言模型(LVLM)的兴起,人们开始重新思考将地理定位作为基于视觉线索的推理驱动型任务。然而,仍然存在两个主要挑战。在数据方面,现有的以推理为中心的数据集主要基于街景图像,这提供了有限的场景多样性以及受限的视角选择。在建模方面,当前的方法主要依赖于监督微调,这仅能带来有限的推理能力提升。 为了解决这些挑战,我们提出了一种新的工作流程,构建了一个侧重于推理的地理定位数据集——MP16-Reason,该数据集使用了多样化的社交媒体图像。同时,我们引入了GLOBE(基于组相对策略优化的可定位性评估和优化视觉线索推理),旨在为LVLM在识别与推理任务中的表现带来双目标增强。GLOBE整合了特定于任务的奖励机制,以共同提升可定位性评估、视觉线索推理以及地理坐标的准确性。 无论是定性的还是定量的结果都表明,GLOBE超越了现有开源LVLM模型,在多样化视觉场景下的地理定位任务中表现出色,并且生成更具有洞察力和解释性的推理路径。
https://arxiv.org/abs/2506.14674
In the field of image fusion, promising progress has been made by modeling data from different modalities as linear subspaces. However, in practice, the source images are often located in a non-Euclidean space, where the Euclidean methods usually cannot encapsulate the intrinsic topological structure. Typically, the inner product performed in the Euclidean space calculates the algebraic similarity rather than the semantic similarity, which results in undesired attention output and a decrease in fusion performance. While the balance of low-level details and high-level semantics should be considered in infrared and visible image fusion task. To address this issue, in this paper, we propose a novel attention mechanism based on Grassmann manifold for infrared and visible image fusion (GrFormer). Specifically, our method constructs a low-rank subspace mapping through projection constraints on the Grassmann manifold, compressing attention features into subspaces of varying rank levels. This forces the features to decouple into high-frequency details (local low-rank) and low-frequency semantics (global low-rank), thereby achieving multi-scale semantic fusion. Additionally, to effectively integrate the significant information, we develop a cross-modal fusion strategy (CMS) based on a covariance mask to maximise the complementary properties between different modalities and to suppress the features with high correlation, which are deemed redundant. The experimental results demonstrate that our network outperforms SOTA methods both qualitatively and quantitatively on multiple image fusion benchmarks. The codes are available at this https URL.
在图像融合领域,通过将不同模式的数据建模为线性子空间的方法取得了显著进展。然而,在实践中,源图像通常位于非欧几里得空间中,其中欧氏方法通常无法捕捉到内在的拓扑结构。典型地,在欧几里得空间中执行的内积计算的是代数相似度而非语义相似度,这导致了不希望有的注意力输出以及融合性能降低的问题。同时,在红外与可见光图像融合任务中应该考虑到低级细节和高级语义之间的平衡。 为了解决这个问题,本文提出了一种基于Grassmann流形的新型注意机制(称为GrFormer),用于红外和可见光图像融合。具体来说,我们的方法通过在Grassmann流形上的投影约束来构建低秩子空间映射,将注意力特征压缩到不同等级的子空间中。这迫使特性分解为高频细节(局部低秩)和低频语义(全局低秩),从而实现多尺度语义融合。此外,为了有效地整合重要信息,我们开发了一种基于协方差掩码的跨模态融合策略(CMS),以最大化不同模式之间的互补属性,并抑制高度相关的特性,这些特性的冗余被认为可以被剔除。 实验结果表明,在多个图像融合基准上,我们的网络在定性和定量评估中都优于现有最先进的方法。代码可在提供的链接处获取。
https://arxiv.org/abs/2506.14384
Film grain, once a by-product of analog film, is now present in most cinematographic content for aesthetic reasons. However, when such content is compressed at medium to low bitrates, film grain is lost due to its random nature. To preserve artistic intent while compressing efficiently, film grain is analyzed and modeled before encoding and synthesized after decoding. This paper introduces FGA-NN, the first learning-based film grain analysis method to estimate conventional film grain parameters compatible with conventional synthesis. Quantitative and qualitative results demonstrate FGA-NN's superior balance between analysis accuracy and synthesis complexity, along with its robustness and applicability.
胶片颗粒,过去是模拟胶片的副产品,现在由于美学原因出现在大多数电影内容中。然而,在以中低比特率压缩这种内容时,随机性质的胶片颗粒会丢失。为了在高效压缩的同时保留艺术意图,需要先对胶片颗粒进行分析和建模,然后再编码和解码过程中重新合成。本文介绍了FGA-NN,这是首个基于学习的方法,用于估计与传统合成兼容的传统胶片颗粒参数。定量和定性的结果表明,FGA-NN在分析准确性和合成复杂性之间具有优越的平衡,并且表现出其鲁棒性和适用性。
https://arxiv.org/abs/2506.14350
Music transcription plays a pivotal role in Music Information Retrieval (MIR), particularly for stringed instruments like the guitar, where symbolic music notations such as MIDI lack crucial playability information. This contribution introduces the Fretting-Transformer, an encoderdecoder model that utilizes a T5 transformer architecture to automate the transcription of MIDI sequences into guitar tablature. By framing the task as a symbolic translation problem, the model addresses key challenges, including string-fret ambiguity and physical playability. The proposed system leverages diverse datasets, including DadaGP, GuitarToday, and Leduc, with novel data pre-processing and tokenization strategies. We have developed metrics for tablature accuracy and playability to quantitatively evaluate the performance. The experimental results demonstrate that the Fretting-Transformer surpasses baseline methods like A* and commercial applications like Guitar Pro. The integration of context-sensitive processing and tuning/capo conditioning further enhances the model's performance, laying a robust foundation for future developments in automated guitar transcription.
音乐转录在音乐信息检索(MIR)中扮演着至关重要的角色,尤其是在像吉他这样的弦乐器领域,因为符号化的乐谱如MIDI缺乏关键的演奏性信息。本文介绍了一种名为Fretting-Transformer的编码器-解码器模型,该模型利用了T5变压器架构来自动将MIDI序列转换为吉他的六线谱(tablature)。通过将任务视为一种符号翻译问题,该模型解决了包括琴弦品格模糊性和物理演奏性在内的关键挑战。所提出的系统使用多样化的数据集,如DadaGP、GuitarToday和Leduc,并采用了创新的数据预处理和标记策略。我们开发了用于评估六线谱准确性和演奏性的度量标准,以便定量地评价模型的性能。实验结果表明,Fretting-Transformer在基线方法(如A*算法)和商业应用(如Guitar Pro)上都表现出色。通过集成上下文敏感处理和调音/卡普奥调节,进一步增强了该模型的性能,为自动吉他转录未来的发展奠定了坚实的基础。
https://arxiv.org/abs/2506.14223
In recirculating aquaculture systems, accurate and effective assessment of fish feeding intensity is crucial for reducing feed costs and calculating optimal feeding times. However, current studies have limitations in modality selection, feature extraction and fusion, and co-inference for decision making, which restrict further improvement in the accuracy, applicability and reliability of multimodal fusion models. To address this problem, this study proposes a Multi-stage Augmented Multimodal Interaction Network (MAINet) for quantifying fish feeding intensity. Firstly, a general feature extraction framework is proposed to efficiently extract feature information from input image, audio and water wave datas. Second, an Auxiliary-modality Reinforcement Primary-modality Mechanism (ARPM) is designed for inter-modal interaction and generate enhanced features, which consists of a Channel Attention Fusion Network (CAFN) and a Dual-mode Attention Fusion Network (DAFN). Finally, an Evidence Reasoning (ER) rule is introduced to fuse the output results of each modality and make decisions, thereby completing the quantification of fish feeding intensity. The experimental results show that the constructed MAINet reaches 96.76%, 96.78%, 96.79% and 96.79% in accuracy, precision, recall and F1-Score respectively, and its performance is significantly higher than the comparison models. Compared with models that adopt single-modality, dual-modality fusion and different decision-making fusion methods, it also has obvious advantages. Meanwhile, the ablation experiments further verified the key role of the proposed improvement strategy in improving the robustness and feature utilization efficiency of model, which can effectively improve the accuracy of the quantitative results of fish feeding intensity.
在循环水养殖系统中,准确且有效地评估鱼类喂食强度对于降低饲料成本和计算最佳喂食时间至关重要。然而,目前的研究在模态选择、特征提取与融合以及协同推理以进行决策制定方面存在局限性,这些限制了多模态融合模型的准确性、适用性和可靠性进一步提升的可能性。为解决这一问题,本研究提出了一种多层次增强多模态交互网络(MAINet)来量化鱼类喂食强度。 首先,本文提出了一个通用特征提取框架,能够有效地从输入图像、音频和水波数据中提取特征信息。其次,设计了辅助模态强化主模态机制(ARPM),用于跨模态互动并生成增强的特征,其中包括通道注意融合网络(CAFN)和双模式注意融合网络(DAFN)。最后,引入证据推理(ER)规则来融合各模态的输出结果,并做出决策,从而完成对鱼类喂食强度的量化。 实验结果显示,构建的MAINet在准确性、精确性、召回率和F1分数方面分别达到了96.76%、96.78%、96.79%和96.79%,其性能显著高于对比模型。与采用单一模态融合、双模态融合以及不同决策融合方法的模型相比,该网络也有明显的优势。同时,消融实验进一步验证了所提出的改进策略在提升模型鲁棒性和特征利用效率的关键作用,能够有效提高鱼类喂食强度量化结果的准确性。
https://arxiv.org/abs/2506.14170
Neural networks are a powerful tool for learning patterns from data. However, they do not respect known scientific laws, nor can they reveal novel scientific insights due to their black-box nature. In contrast, scientific reasoning distills biological or physical principles from observations and controlled experiments, and quantitatively interprets them with process-based models made of mathematical equations. Yet, process-based models rely on numerous free parameters that must be set in an ad-hoc manner, and thus often fit observations poorly in cross-scale predictions. While prior work has embedded process-based models in conventional neural networks, discovering interpretable relationships between parameters in process-based models and input features is still a grand challenge for scientific discovery. We thus propose Scientifically-Interpretable Reasoning Network (ScIReN), a fully-transparent framework that combines interpretable neural and process-based reasoning. An interpretable encoder predicts scientifically-meaningful latent parameters, which are then passed through a differentiable process-based decoder to predict labeled output variables. ScIReN also uses a novel hard-sigmoid constraint layer to restrict latent parameters to meaningful ranges defined by scientific prior knowledge, further enhancing its interpretability. While the embedded process-based model enforces established scientific knowledge, the encoder reveals new scientific mechanisms and relationships hidden in conventional black-box models. We apply ScIReN on two tasks: simulating the flow of organic carbon through soils, and modeling ecosystem respiration from plants. In both tasks, ScIReN outperforms black-box networks in predictive accuracy while providing substantial scientific interpretability -- it can infer latent scientific mechanisms and their relationships with input features.
神经网络是学习数据模式的强大工具。然而,它们不遵守已知的科学定律,并且由于其黑盒性质也无法揭示新的科学见解。相比之下,科学研究从观察和控制实验中提炼出生物学或物理学原理,并用基于过程的数学方程模型对其进行定量解释。尽管如此,基于过程的模型依赖于许多需要随意设定的自由参数,在跨尺度预测中往往难以很好地拟合观测数据。先前的工作已经将这些基于过程的模型嵌入到传统的神经网络中,但发现这些模型中的参数与输入特征之间的可解释关系仍然是科学发现的一大挑战。 因此,我们提出了一个完全透明的框架——Scientifically-Interpretable Reasoning Network (ScIReN),该框架结合了可解释的神经推理和基于过程的推理。在这个框架中,一个可解释的编码器预测出具有科学意义的潜在参数,然后这些参数通过一个微分方程构成的过程解码器传递以预测带有标签的目标变量。此外,ScIReN 使用了一种新颖的硬 sigmoid 约束层来限制潜在参数到由先验科学知识定义的意义范围内,进一步增强了其可解释性。 嵌入的基于过程模型施加了已确立的科学知识,而编码器揭示了隐藏在传统黑盒模型中的新的科学机制和关系。我们在两个任务中应用了 ScIReN:模拟有机碳通过土壤的流动以及从植物中建模生态系统呼吸作用。在这两项任务中,ScIReN 不仅比传统的黑箱网络表现出更高的预测准确性,而且还提供了大量的科学可解释性——它可以推断出潜在的科学机制及其与输入特征的关系。
https://arxiv.org/abs/2506.14054
Time-optimal trajectories drive quadrotors to their dynamic limits, but computing such trajectories involves solving non-convex problems via iterative nonlinear optimization, making them prohibitively costly for real-time applications. In this work, we investigate learning-based models that imitate a model-based time-optimal trajectory planner to accelerate trajectory generation. Given a dataset of collision-free geometric paths, we show that modeling architectures can effectively learn the patterns underlying time-optimal trajectories. We introduce a quantitative framework to analyze local analytic properties of the learned models, and link them to the Backward Reachable Tube of the geometric tracking controller. To enhance robustness, we propose a data augmentation scheme that applies random perturbations to the input paths. Compared to classical planners, our method achieves substantial speedups, and we validate its real-time feasibility on a hardware quadrotor platform. Experiments demonstrate that the learned models generalize to previously unseen path lengths. The code for our approach can be found here: this https URL
时间最优轨迹可使四旋翼飞行器达到其动态极限,但计算此类轨迹涉及通过迭代非线性优化解决非凸问题,这使得实时应用的成本过高。在这项工作中,我们研究了基于学习的模型,这些模型模仿基于模型的时间最优轨迹规划器来加速轨迹生成过程。给定一组无碰撞几何路径的数据集,我们展示了建模架构可以有效地学习时间最优轨迹背后的模式。我们引入了一个定量框架来分析所学模型的局部解析属性,并将其与几何跟踪控制器的可逆到达管相联系。为了增强鲁棒性,我们提出了一种数据增强方案,该方案对输入路径施加随机扰动。 相比于传统规划器,我们的方法实现了显著的速度提升,并通过硬件四旋翼平台验证了其实时可行性。实验表明,学习模型可以推广到之前未见过的路径长度。我们方法的代码可以从这里找到:[此URL](https://this-url-here.com)
https://arxiv.org/abs/2506.13915
Distilling pre-trained 2D diffusion models into 3D assets has driven remarkable advances in text-to-3D synthesis. However, existing methods typically rely on Score Distillation Sampling (SDS) loss, which involves asymmetric KL divergence--a formulation that inherently favors mode-seeking behavior and limits generation diversity. In this paper, we introduce Dive3D, a novel text-to-3D generation framework that replaces KL-based objectives with Score Implicit Matching (SIM) loss, a score-based objective that effectively mitigates mode collapse. Furthermore, Dive3D integrates both diffusion distillation and reward-guided optimization under a unified divergence perspective. Such reformulation, together with SIM loss, yields significantly more diverse 3D outputs while improving text alignment, human preference, and overall visual fidelity. We validate Dive3D across various 2D-to-3D prompts and find that it consistently outperforms prior methods in qualitative assessments, including diversity, photorealism, and aesthetic appeal. We further evaluate its performance on the GPTEval3D benchmark, comparing against nine state-of-the-art baselines. Dive3D also achieves strong results on quantitative metrics, including text-asset alignment, 3D plausibility, text-geometry consistency, texture quality, and geometric detail.
将预训练的2D扩散模型提炼到3D资产中,已在文本到3D合成领域取得了显著进展。然而,现有的方法通常依赖于分数蒸馏采样(SDS)损失,该方法涉及非对称KL散度——这种形式本质上倾向于寻求模式行为,并限制了生成多样性。在本文中,我们介绍了Dive3D,这是一种新颖的文本到3D生成框架,它用基于分数隐式匹配(SIM)损失替换了基于KL的目标函数,后者是一种有效的分数目标函数,能够有效缓解模式崩溃现象。此外,Dive3D将扩散蒸馏和奖励引导优化整合在一个统一的散度视角下。这种重新表述结合了SIM损失,产生了更多样化的3D输出,并在文本对齐、人类偏好以及整体视觉保真度方面有所改进。我们在各种2D到3D提示上验证了Dive3D的有效性,发现它在包括多样性、照片真实感和美学吸引力在内的定性评估中始终优于先前的方法。我们进一步在其性能上进行了定量测试,使用GPTEval3D基准对,并与九种最新的基线方法进行了比较。Dive3D还在量化指标方面取得了强大结果,包括文本-资产对齐度、3D合理性、文本-几何一致性、纹理质量和几何细节。
https://arxiv.org/abs/2506.13594
Performance evaluation for Content-Based Image Retrieval (CBIR) remains a crucial but unsolved problem today especially in the medical domain. Various evaluation metrics have been discussed in the literature to solve this problem. Most of the existing metrics (e.g., precision, recall) are adapted from classification tasks which require manual labels as ground truth. However, such labels are often expensive and unavailable in specific thematic domains. Furthermore, medical images are usually associated with (radiological) case reports or annotated with descriptive captions in literature figures, such text contains information that can help to assess this http URL researchers have argued that the medical concepts hidden in the text can serve as the basis for CBIR evaluation purpose. However, these works often consider these medical concepts as independent and isolated labels while in fact the subtle relationships between various concepts are neglected. In this work, we introduce the use of knowledge graphs to measure the distance between various medical concepts and propose a novel relevance measure for the evaluation of CBIR by defining an approximate matching-based relevance score between two sets of medical concepts which allows us to indirectly measure the similarity between medical this http URL quantitatively demonstrate the effectiveness and feasibility of our relevance measure using a public dataset.
内容基于图像检索(CBIR)在医学领域的性能评估仍然是一个关键但尚未解决的问题。文献中提出了多种解决方案的评价指标,其中大多数现有的度量标准(如精确度、召回率等)是从分类任务中借鉴而来的,并需要手动标签作为真实情况。然而,在特定的主题领域中,这样的标签往往难以获得且成本高昂。 医学图像通常与放射学报告相关联或在文献图例中标注描述性说明文本,这些文本包含可以用来评估CBIR的有用信息。研究人员指出,隐藏于这些文本中的医学概念可以用作评价CBIR的基础。然而,目前的工作常常将这些医学概念视为独立和孤立的标签,而实际上各种概念之间的细微关系被忽略了。 在本工作中,我们引入了知识图谱来衡量不同医学概念间的距离,并提出了一种新的相关性度量方法用于评估CBIR,通过定义两组医学概念之间基于近似匹配的相关得分,这使我们能够间接地量化医学图像之间的相似性。我们将使用公共数据集定量展示该相关性度量的有效性和可行性。
https://arxiv.org/abs/2506.13509