We introduce ARPG, a novel visual autoregressive model that enables randomized parallel generation, addressing the inherent limitations of conventional raster-order approaches, which hinder inference efficiency and zero-shot generalization due to their sequential, predefined token generation order. Our key insight is that effective random-order modeling necessitates explicit guidance for determining the position of the next predicted token. To this end, we propose a novel guided decoding framework that decouples positional guidance from content representation, encoding them separately as queries and key-value pairs. By directly incorporating this guidance into the causal attention mechanism, our approach enables fully random-order training and generation, eliminating the need for bidirectional attention. Consequently, ARPG readily generalizes to zero-shot tasks such as image inpainting, outpainting, and resolution expansion. Furthermore, it supports parallel inference by concurrently processing multiple queries using a shared KV cache. On the ImageNet-1K 256 benchmark, our approach attains an FID of 1.94 with only 64 sampling steps, achieving over a 20-fold increase in throughput while reducing memory consumption by over 75% compared to representative recent autoregressive models at a similar scale.
我们介绍了一种名为ARPG的新型视觉自回归模型,该模型实现了随机并行生成,解决了传统栅格顺序方法固有的局限性。这些传统方法由于其序列化的、预定义的标记生成顺序,在推理效率和零样本泛化方面存在不足。 我们的关键见解是:有效的随机顺序建模需要明确地指导以确定下一个预测标记的位置。为此,我们提出了一种新颖的引导解码框架,该框架将位置指引与内容表示分离,分别编码为查询(queries)和键-值对(key-value pairs)。通过直接将这种指引融入因果注意力机制中,我们的方法能够实现完全随机顺序的训练和生成,无需双向注意力。因此,ARPG可以轻松地应用于零样本任务,如图像修补、扩展以及分辨率扩展。 此外,ARPG支持并行推理,可以通过共享键-值缓存同时处理多个查询。在ImageNet-1K 256数据集上,我们的方法仅通过64个采样步骤便达到了FID为1.94的成绩,在与类似规模的代表性最近自回归模型相比时,实现了超过20倍的吞吐量提升,并且减少了75%以上的内存消耗。
https://arxiv.org/abs/2503.10568
Robot navigation in complex environments necessitates controllers that are adaptive and safe. Traditional controllers like Regulated Pure Pursuit, Dynamic Window Approach, and Model-Predictive Path Integral, while reliable, struggle to adapt to dynamic conditions. Reinforcement Learning offers adaptability but lacks formal safety guarantees. To address this, we propose a path tracking controller leveraging the Simplex architecture. It combines a Reinforcement Learning controller for adaptiveness and performance with a high-assurance controller providing safety and stability. Our contribution is twofold. We firstly discuss general stability and safety considerations for designing controllers using the Simplex architecture. Secondly, we present a Simplex-based path tracking controller. Our simulation results, supported by preliminary in-field tests, demonstrate the controller's effectiveness in maintaining safety while achieving comparable performance to state-of-the-art methods.
在复杂环境中进行机器人导航需要能够适应和确保安全性的控制器。传统的方法,如受控纯追踪(Regulated Pure Pursuit)、动态窗口方法(Dynamic Window Approach)以及预测路径积分模型(Model-Predictive Path Integral),虽然可靠但难以应对动态变化的环境条件。强化学习则因其可塑性而具有优势,但它缺乏正式的安全保障机制。为解决这些问题,我们提出了一种基于Simplex架构的路径追踪控制器,它结合了强化学习控制器的适应性和性能以及高可靠性控制器提供的安全和稳定性。 我们的贡献主要体现在两个方面:首先,我们讨论了在使用Simplex架构设计控制器时的一般稳定性和安全性考虑;其次,我们展示了一个基于Simplex的路径跟踪控制器。通过模拟结果及初步现场测试支持的数据表明,该控制器能够在确保安全的同时实现与最先进的方法相当的性能水平。 综上所述,我们的研究为机器人导航任务提供了一种新的解决方案,特别是在需要高度适应性、稳定性和安全性的情况下,该方案显示出优越的应用潜力。
https://arxiv.org/abs/2503.10559
The evolution from motion capture and teleoperation to robot skill learning has emerged as a hotspot and critical pathway for advancing embodied intelligence. However, existing systems still face a persistent gap in simultaneously achieving four objectives: accurate tracking of full upper limb movements over extended durations (Accuracy), ergonomic adaptation to human biomechanics (Comfort), versatile data collection (e.g., force data) and compatibility with humanoid robots (Versatility), and lightweight design for outdoor daily use (Convenience). We present a wearable exoskeleton system, incorporating user-friendly immersive teleoperation and multi-modal sensing collection to bridge this gap. Due to the features of a novel shoulder mechanism with synchronized linkage and timing belt transmission, this system can adapt well to compound shoulder movements and replicate 100% coverage of natural upper limb motion ranges. Weighing 5.2 kg, NuExo supports backpack-type use and can be conveniently applied in daily outdoor scenarios. Furthermore, we develop a unified intuitive teleoperation framework and a comprehensive data collection system integrating multi-modal sensing for various humanoid robots. Experiments across distinct humanoid platforms and different users validate our exoskeleton's superiority in motion range and flexibility, while confirming its stability in data collection and teleoperation accuracy in dynamic scenarios.
从动作捕捉和遥操作到机器人技能学习的进化已经成为了增强实体智能的一个热点领域和发展关键路径。然而,现有的系统仍然面临着一个持续存在的挑战:即在实现四个目标的同时存在差距——长时间内准确跟踪全身上肢运动(准确性)、适应人体生物力学(舒适性)、收集多功能数据(例如力数据)和与人形机器人兼容(多功能性),以及适用于户外日常使用的轻量化设计(便捷性)。我们提出了一种可穿戴外骨骼系统,集成了用户友好的沉浸式遥操作和多模态传感采集功能来弥合这一差距。由于该系统具备一种新型的肩部机制,采用同步连杆与皮带传动技术,因此能够很好地适应复合肩部动作,并且可以实现自然上肢运动范围100%覆盖的复制。NuExo重5.2公斤,支持背包式使用方式,在日常户外场景中也可以方便地应用。此外,我们还开发了一个统一直观的遥操作框架和一个全面的数据采集系统,该系统整合了针对各种人形机器人的多模态传感技术。在不同的平台以及不同用户之间的实验表明,我们的外骨骼系统在运动范围与灵活性方面具有显著优势,并且在动态场景中的数据收集稳定性和遥操作准确性也得到了验证。
https://arxiv.org/abs/2503.10554
As facial recognition is increasingly adopted for government and commercial services, its potential misuse has raised serious concerns about privacy and civil rights. To counteract, various anti-facial recognition techniques have been proposed for privacy protection by adversarially perturbing face images, among which generative makeup-based approaches are the most popular. However, these methods, designed primarily to impersonate specific target identities, can only achieve weak dodging success rates while increasing the risk of targeted abuse. In addition, they often introduce global visual artifacts or a lack of adaptability to accommodate diverse makeup prompts, compromising user satisfaction. To address the above limitations, we develop MASQUE, a novel diffusion-based framework that generates localized adversarial makeups guided by user-defined text prompts. Built upon precise null-text inversion, customized cross-attention fusion with masking, and a pairwise adversarial guidance mechanism using images of the same individual, MASQUE achieves robust dodging performance without requiring any external identity. Comprehensive evaluations on open-source facial recognition models and commercial APIs demonstrate that MASQUE significantly improves dodging success rates over all baselines, along with higher perceptual fidelity and stronger adaptability to various text makeup prompts.
随着面部识别技术在政府和商业服务中的广泛应用,其潜在的滥用引发了对隐私和公民权利的严重担忧。为了应对这一问题,各种旨在通过对抗性扰动面部图像来保护隐私的反面部识别技术被提出,其中基于生成化妆的方法最受欢迎。然而,这些方法主要设计用于模仿特定目标身份,在实现弱躲避成功率的同时增加了针对性滥用的风险。此外,它们往往引入全局视觉伪影或缺乏适应不同化妆提示的能力,从而影响用户体验。 为了解决上述限制,我们开发了MASQUE,这是一种新型的基于扩散框架的技术,它可以根据用户定义的文本提示生成局部对抗性妆容。MASQUE 建立在精确的无文本来源逆向、定制化的交叉注意力融合与屏蔽以及使用同一人物图像的成对对抗引导机制之上,在无需任何外部身份信息的情况下实现了稳健的躲避性能。开源面部识别模型和商业 API 的全面评估表明,MASQUE 在所有基准上显著提高了躲避成功率,并且具有更高的感知真实度和更强的适应各种文本化妆提示的能力。
https://arxiv.org/abs/2503.10549
We propose a general framework called VisionLogic to extract interpretable logic rules from deep vision models, with a focus on image classification tasks. Given any deep vision model that uses a fully connected layer as the output head, VisionLogic transforms neurons in the last layer into predicates and grounds them into vision concepts using causal validation. In this way, VisionLogic can provide local explanations for single images and global explanations for specific classes in the form of logic rules. Compared to existing interpretable visualization tools such as saliency maps, VisionLogic addresses several key challenges, including the lack of causal explanations, overconfidence in visualizations, and ambiguity in interpretation. VisionLogic also facilitates the study of visual concepts encoded by predicates, particularly how they behave under perturbation -- an area that remains underexplored in the field of hidden semantics. Apart from providing better visual explanations and insights into the visual concepts learned by the model, we show that VisionLogic retains most of the neural network's discriminative power in an interpretable and transparent manner. We envision it as a bridge between complex model behavior and human-understandable explanations, providing trustworthy and actionable insights for real-world applications.
我们提出了一种称为VisionLogic的通用框架,用于从深度视觉模型中提取可解释的逻辑规则,并重点研究图像分类任务。对于任何使用全连接层作为输出头的深度视觉模型,VisionLogic将最后一层中的神经元转换为谓词,并通过因果验证将其锚定到视觉概念上。这样,VisionLogic可以为单个图像提供局部解释,同时也可以以逻辑规则的形式为特定类别提供全局解释。 与现有的可解释性可视化工具(如显著图)相比,VisionLogic解决了几个关键挑战,包括缺乏因果解释、对可视化的过度自信以及解读的模糊性等问题。此外,VisionLogic有助于研究由谓词编码的视觉概念,特别是它们在扰动下的行为表现——这是隐藏语义领域中一个尚未充分探索的领域。 除了提供更好的可视化解释并深入理解模型学习到的视觉概念之外,我们还展示了VisionLogic能够在可解释且透明的方式下保留神经网络大部分判别能力。我们认为它是复杂模型行为与人类可理解解释之间的桥梁,为现实世界应用提供了值得信赖和具有操作性的洞察。
https://arxiv.org/abs/2503.10547
With the rapid advancement of large language models (LLMs) and vision-language models (VLMs), significant progress has been made in developing open-vocabulary robotic manipulation systems. However, many existing approaches overlook the importance of object dynamics, limiting their applicability to more complex, dynamic tasks. In this work, we introduce KUDA, an open-vocabulary manipulation system that integrates dynamics learning and visual prompting through keypoints, leveraging both VLMs and learning-based neural dynamics models. Our key insight is that a keypoint-based target specification is simultaneously interpretable by VLMs and can be efficiently translated into cost functions for model-based planning. Given language instructions and visual observations, KUDA first assigns keypoints to the RGB image and queries the VLM to generate target specifications. These abstract keypoint-based representations are then converted into cost functions, which are optimized using a learned dynamics model to produce robotic trajectories. We evaluate KUDA on a range of manipulation tasks, including free-form language instructions across diverse object categories, multi-object interactions, and deformable or granular objects, demonstrating the effectiveness of our framework. The project page is available at this http URL.
随着大型语言模型(LLMs)和视觉-语言模型(VLMs)的迅速发展,开放式词汇机器人操作系统的开发取得了显著进展。然而,许多现有的方法忽视了物体动力学的重要性,这限制了它们在更复杂、动态任务中的应用范围。在这项工作中,我们介绍了KUDA,这是一种将动力学习与基于关键点的视觉提示相结合的开放词汇操作系统,通过这种方式利用VLMs和基于学习的动力模型。我们的核心观点是,基于关键点的目标指定可以同时被VLM理解和高效地转换为用于模型规划的成本函数。 在接收到语言指令和视觉观察数据后,KUDA首先将关键点分配给RGB图像,并查询VLM以生成目标规格说明。这些抽象的关键点表示随后会被转化为成本函数,然后利用一个已学习的动力学模型进行优化,从而产生机器人的轨迹。 我们在一系列操作任务中评估了KUDA的性能,包括跨多种物体类别的自由形式语言指令、多物体交互以及变形或颗粒状物体的操作,证明了我们框架的有效性。项目的网页在此链接提供:[此HTTP URL]。
https://arxiv.org/abs/2503.10546
This work concerns the path-star task, a minimal example of searching over a graph. The graph, $G$, is star-shaped with $D$ arms radiating from a start node, $s$. A language model (LM) is given $G$, $s$, and a target node $t$, which ends one of the arms and is tasked with generating the arm containing $t$. The minimal nature of this task means only a single choice needs to be made: which of the $D$ arms contains $t$? Decoder-only LMs fail to solve this elementary task above $1/D$ chance due to a learned shortcut that absorbs training supervision. We show how this pathology is caused by excess supervision and we present a series of solutions demonstrating that the task is solvable via decoder-only LMs. We find that the task's minimal nature causes its difficulty, as it prevents task decomposition. Our solutions provide insight into the pathology and its implications for LMs trained via next-token prediction.
这项工作研究了路径星任务,这是一个在图上进行搜索的最小示例。图$G$呈星形结构,从起始节点$s$出发有$D$个臂(分支)。给定语言模型(LM) 图$G$、起始节点$s$以及目标节点$t$(该节点位于一个臂的末端),任务是让LM生成包含$t$的那个臂。由于这个任务的本质是最小化的,因此只需要做出一次选择:在$D$个臂中哪一个包含了$t$?解码器自回归模型在此基本任务上的表现未能超过随机猜测的概率$\frac{1}{D}$,这是因为它们学会了利用训练数据中的捷径来吸收监督信息。 我们展示了这种病态行为是由于过度的监督导致,并提出了一系列解决方案,证明了通过解码器自回归LM可以解决这一问题。我们发现该任务的最小性质使其变得困难,因为它阻止了任务分解。我们的解决方案提供了对这种病态行为及其对基于下一个标记预测训练的LM的影响的理解。
https://arxiv.org/abs/2503.10542
Support Vector Regression (SVR) and its variants are widely used to handle regression tasks, however, since their solution involves solving an expensive quadratic programming problem, it limits its application, especially when dealing with large datasets. Additionally, SVR uses an epsilon-insensitive loss function which is sensitive to outliers and therefore can adversely affect its performance. We propose Granular Ball Support Vector Regression (GBSVR) to tackle problem of regression by using granular ball concept. These balls are useful in simplifying complex data spaces for machine learning tasks, however, to the best of our knowledge, they have not been sufficiently explored for regression problems. Granular balls group the data points into balls based on their proximity and reduce the computational cost in SVR by replacing the large number of data points with far fewer granular balls. This work also suggests a discretization method for continuous-valued attributes to facilitate the construction of granular balls. The effectiveness of the proposed approach is evaluated on several benchmark datasets and it outperforms existing state-of-the-art approaches
支持向量回归(SVR)及其变体被广泛用于处理回归任务,但由于其解决方案涉及求解昂贵的二次规划问题,这限制了它在大型数据集上的应用。此外,SVR 使用 ε-不敏感损失函数,该函数对异常值敏感,从而可能对其性能产生负面影响。 我们提出了一种基于颗粒球概念的支持向量回归(GBSVR),以解决回归问题。这些颗粒球有助于简化机器学习任务中的复杂数据空间。据我们所知,它们尚未在回归问题中得到充分探索。通过将数据点依据其邻近性分组为颗粒球,GBSVR 可以减少 SVR 中的数据点数量,并降低计算成本。 此外,该工作还提出了一种离散化方法来处理连续值属性,以便于构建颗粒球。我们所提出的这种方法在多个基准数据集上进行了评估,并且其性能优于现有的最先进的方法。
https://arxiv.org/abs/2503.10539
High-quality test items are essential for educational assessments, particularly within Item Response Theory (IRT). Traditional validation methods rely on resource-intensive pilot testing to estimate item difficulty and discrimination. More recently, Item-Writing Flaw (IWF) rubrics emerged as a domain-general approach for evaluating test items based on textual features. However, their relationship to IRT parameters remains underexplored. To address this gap, we conducted a study involving over 7,000 multiple-choice questions across various STEM subjects (e.g., math and biology). Using an automated approach, we annotated each question with a 19-criteria IWF rubric and studied relationships to data-driven IRT parameters. Our analysis revealed statistically significant links between the number of IWFs and IRT difficulty and discrimination parameters, particularly in life and physical science domains. We further observed how specific IWF criteria can impact item quality more and less severely (e.g., negative wording vs. implausible distractors). Overall, while IWFs are useful for predicting IRT parameters--particularly for screening low-difficulty MCQs--they cannot replace traditional data-driven validation methods. Our findings highlight the need for further research on domain-general evaluation rubrics and algorithms that understand domain-specific content for robust item validation.
高质量的测试题目对于教育评估至关重要,尤其是在项目反应理论(IRT)的应用中。传统的验证方法依赖于耗时且资源密集型的试点测试来估计题目的难度和区分度。最近,基于文本特征评估试题质量的“项目编写缺陷”(IWF)评分标准作为一种通用领域的方法出现。然而,这些评分标准与IRT参数之间的关系尚未得到充分探索。为了解决这一空白,我们进行了一项研究,涵盖了超过7,000道来自STEM不同学科(如数学和生物学)的选择题。 通过自动化的方式,我们使用一个包含19个标准的IWF评分表对每一道题目进行了标注,并且分析了这些文本特征与数据驱动IRT参数之间的关系。我们的分析发现,项目编写缺陷的数量与IRT难度和区分度参数之间存在统计学上的显著联系,尤其是在生命科学和物理科学领域更为明显。我们进一步观察到,特定的IWF标准对于题目的质量影响程度不同(例如,否定表达式相对于不可信的干扰项的影响)。 总体而言,虽然IWF评分表在预测IRT参数方面具有一定的预测能力——特别是在筛查低难度的选择题时尤为有用——但它们不能替代传统基于数据驱动的方法。我们的研究结果强调了对通用领域评估标准和能够理解特定领域内容算法进行进一步研究的需求,以实现稳健的项目验证方法。
https://arxiv.org/abs/2503.10533
In this study, we present an approach for efficient spatiotemporal feature extraction using MobileNetV4 and a multi-scale 3D MLP-Mixer-based temporal aggregation module. MobileNetV4, with its Universal Inverted Bottleneck (UIB) blocks, serves as the backbone for extracting hierarchical feature representations from input image sequences, ensuring both computational efficiency and rich semantic encoding. To capture temporal dependencies, we introduce a three-level MLP-Mixer module, which processes spatial features at multiple resolutions while maintaining structural integrity. Experimental results on the ABAW 8th competition demonstrate the effectiveness of our approach, showing promising performance in affective behavior analysis. By integrating an efficient vision backbone with a structured temporal modeling mechanism, the proposed framework achieves a balance between computational efficiency and predictive accuracy, making it well-suited for real-time applications in mobile and embedded computing environments.
在这项研究中,我们提出了一种使用MobileNetV4和基于多尺度3D MLP-Mixer的时序聚合模块进行高效时空特征提取的方法。MobileNetV4通过其通用倒残块(UIB)作为骨干网络来从输入图像序列中抽取分层特征表示,确保计算效率的同时还提供丰富的语义编码。为了捕捉时间依赖性,我们引入了一个三级MLP-Mixer模块,该模块在多个分辨率下处理空间特征,并保持结构完整性。 在ABAW 8th竞赛中的实验结果证明了我们方法的有效性,在情感行为分析方面表现出令人鼓舞的性能。通过将高效的视觉骨干网络与有组织的时间建模机制相结合,所提出的框架实现了计算效率和预测准确度之间的平衡,使其非常适合移动设备和嵌入式计算环境下的实时应用。
https://arxiv.org/abs/2503.10530
3D Multimodal Large Language Models (MLLMs) have recently made substantial advancements. However, their potential remains untapped, primarily due to the limited quantity and suboptimal quality of 3D datasets. Current approaches attempt to transfer knowledge from 2D MLLMs to expand 3D instruction data, but still face modality and domain gaps. To this end, we introduce PiSA-Engine (Point-Self-Augmented-Engine), a new framework for generating instruction point-language datasets enriched with 3D spatial semantics. We observe that existing 3D MLLMs offer a comprehensive understanding of point clouds for annotation, while 2D MLLMs excel at cross-validation by providing complementary information. By integrating holistic 2D and 3D insights from off-the-shelf MLLMs, PiSA-Engine enables a continuous cycle of high-quality data generation. We select PointLLM as the baseline and adopt this co-evolution training framework to develop an enhanced 3D MLLM, termed PointLLM-PiSA. Additionally, we identify limitations in previous 3D benchmarks, which often feature coarse language captions and insufficient category diversity, resulting in inaccurate evaluations. To address this gap, we further introduce PiSA-Bench, a comprehensive 3D benchmark covering six key aspects with detailed and diverse labels. Experimental results demonstrate PointLLM-PiSA's state-of-the-art performance in zero-shot 3D object captioning and generative classification on our PiSA-Bench, achieving significant improvements of 46.45% (+8.33%) and 63.75% (+16.25%), respectively. We will release the code, datasets, and benchmark.
最近,三维多模态大型语言模型(MLLMs)取得了显著进展。然而,由于缺乏大量且质量优良的三维数据集,其潜力仍未被完全发挥。当前的方法试图通过从二维 MLLMs 转移知识来扩充三维指令数据,但仍然面临着模式和领域差距的问题。为此,我们引入了 PiSA-Engine(Point-Self-Augmented-Engine),这是一个新的框架,用于生成富含3D空间语义的指令点语言数据集。我们观察到现有的3D MLLMs在对点云进行注释时提供了全面的理解能力,而2D MLLMs则擅长通过提供补充信息来进行跨模态验证。通过整合现成MLLMs中全方位的2D和3D见解,PiSA-Engine能够实现高质量数据生成的连续循环过程。 我们选取PointLLM作为基准,并采用这一协同进化训练框架开发了一个增强型的三维 MLLM,称为 PointLLM-PiSA。此外,我们还指出了先前的3D基准测试中存在的局限性:这些测试往往包含粗糙的语言描述和不足的类别多样性,导致评估不准确。为了弥补这一差距,我们进一步引入了 PiSA-Bench,这是一个涵盖六个关键方面的全面3D基准测试,具有详细且多样的标签。 实验结果表明,在我们的 PiSA-Bench 上,PointLLM-PiSA 在零样本3D物体描述和生成分类方面达到了最先进的性能,分别实现了46.45% (+8.33%) 和 63.75% (+16.25%) 的显著改进。我们将发布代码、数据集和基准测试。
https://arxiv.org/abs/2503.10529
Reconstructing accelerated MRI is an ill-posed problem. Machine learning has recently shown great promise at this task, but current approaches to quantifying uncertainty focus on measuring the variability in pixelwise intensity variation. Although these provide interpretable maps, they lack structural understanding and they do not have a clear relationship to how the data will be analysed subsequently. In this paper, we propose a new approach to evaluating reconstruction variability based on apparent anatomical changes in the reconstruction, which is more tightly related to common downstream tasks. We use image registration and segmentation to evaluate several common MRI reconstruction approaches, where uncertainty is measured via ensembling, for accelerated imaging. We demonstrate the intrinsic variability in reconstructed images and show that models with high scores on often used quality metrics such as SSIM and PSNR, can nonetheless display high levels of variance and bias in anatomical measures.
加速MRI的重建是一个病态问题。机器学习在解决这一任务方面近期展现出巨大潜力,但目前量化不确定性的方式主要集中在测量像素强度变化的变异性上。尽管这些方法能够提供可解释的地图图示,它们缺乏结构理解,并且与后续数据处理方式之间没有明确的关系。本文中,我们提出了一种新的评估重建变异性的方法,该方法基于重建过程中出现的解剖学变化,这种方法更紧密地关联到常见的下游任务。我们使用图像配准和分割技术来评估几种常见的MRI重建方法,在加速成像情况下通过模型集成测量不确定性。我们展示了在重建图像中固有的变异性,并表明即使是在SSIM(结构相似性指数)和PSNR(峰值信噪比)等常用质量指标上得分很高的模型,也可能在解剖学度量上表现出高度的变异性和偏差。
https://arxiv.org/abs/2503.10527
Cross-modal retrieval aims to bridge the semantic gap between different modalities, such as visual and textual data, enabling accurate retrieval across them. Despite significant advancements with models like CLIP that align cross-modal representations, a persistent challenge remains: the hubness problem, where a small subset of samples (hubs) dominate as nearest neighbors, leading to biased representations and degraded retrieval accuracy. Existing methods often mitigate hubness through post-hoc normalization techniques, relying on prior data distributions that may not be practical in real-world scenarios. In this paper, we directly mitigate hubness during training and introduce NeighborRetr, a novel method that effectively balances the learning of hubs and adaptively adjusts the relations of various kinds of neighbors. Our approach not only mitigates the hubness problem but also enhances retrieval performance, achieving state-of-the-art results on multiple cross-modal retrieval benchmarks. Furthermore, NeighborRetr demonstrates robust generalization to new domains with substantial distribution shifts, highlighting its effectiveness in real-world applications. We make our code publicly available at: this https URL .
跨模态检索旨在弥合不同模式(如视觉和文本数据)之间的语义差距,从而实现在这些模式之间进行准确的检索。尽管像CLIP这样的模型在对齐跨模态表示方面取得了重大进展,但持续存在的挑战是“中心性问题”(hubness problem),即一小部分样本(称为中心)作为最近邻占据主导地位,导致偏斜表示和检索精度下降。现有的方法通常通过依赖于先前数据分布的后处理归一化技术来减轻中心性问题,这在实际场景中可能不可行。在这篇论文中,我们直接在训练期间缓解了中心性问题,并引入了NeighborRetr这一新方法,该方法能够有效地平衡“中心”样本的学习并自适应调整各种邻居之间的关系。我们的方法不仅解决了中心性问题,还提升了检索性能,在多个跨模态检索基准测试上取得了最先进的结果。此外,NeighborRetr展示了对具有重大分布偏移的新领域的鲁棒泛化能力,突显了其在实际应用中的有效性。我们已将代码公开提供:[此处链接应为具体网址]。
https://arxiv.org/abs/2503.10526
This paper presents our method for the estimation of valence-arousal (VA) in the 8th Affective Behavior Analysis in-the-Wild (ABAW) competition. Our approach integrates visual and audio information through a multimodal framework. The visual branch uses a pre-trained ResNet model to extract spatial features from facial images. The audio branches employ pre-trained VGG models to extract VGGish and LogMel features from speech signals. These features undergo temporal modeling using Temporal Convolutional Networks (TCNs). We then apply cross-modal attention mechanisms, where visual features interact with audio features through query-key-value attention structures. Finally, the features are concatenated and passed through a regression layer to predict valence and arousal. Our method achieves competitive performance on the Aff-Wild2 dataset, demonstrating effective multimodal fusion for VA estimation in-the-wild.
这篇论文介绍了我们在第八届情感行为分析野外挑战赛(ABAW)中用于估计效价唤醒值(VA)的方法。我们的方法通过一个多模态框架结合了视觉和音频信息。视觉分支使用预训练的ResNet模型从面部图像中提取空间特征,而音频分支则利用预训练的VGG模型从语音信号中抽取VGGish和LogMel特征。这些特征经过时间卷积网络(TCNs)进行时序建模。我们随后应用跨模式注意机制,在此过程中视觉特征通过查询-键值注意结构与音频特征交互。最后,我们将特征连接并传递到回归层以预测效价和唤醒值。我们的方法在Aff-Wild2数据集上取得了竞争性的性能,证明了在野外环境中进行VA估计的有效多模态融合能力。
https://arxiv.org/abs/2503.10523
Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs. In this work, we propose AudioX, a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. Unlike previous domain-specific models, AudioX can generate both general audio and music with high quality, while offering flexible natural language control and seamless processing of various modalities including text, video, image, music, and audio. Its key innovation is a multi-modal masked training strategy that masks inputs across modalities and forces the model to learn from masked inputs, yielding robust and unified cross-modal representations. To address data scarcity, we curate two comprehensive datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset. Extensive experiments demonstrate that AudioX not only matches or outperforms state-of-the-art specialized models, but also offers remarkable versatility in handling diverse input modalities and generation tasks within a unified architecture. The code and datasets will be available at this https URL
音频和音乐生成已成为许多应用中至关重要的任务,然而现有方法面临重大限制:它们独立运作且没有跨模态的统一能力;缺乏高质量的多模态训练数据;难以有效整合多样化的输入。为此,在这项工作中我们提出了AudioX,这是一种用于将各种输入转换为音频或音乐的统一Diffusion Transformer模型。 与以往特定领域的模型不同,AudioX可以高质量地生成通用音频和音乐,并提供灵活的自然语言控制以及对包括文本、视频、图像、音乐和音频在内的多种模态的无缝处理能力。其关键创新在于多模态遮蔽训练策略:这种方法会对跨模态输入进行遮蔽并迫使模型从这些遮蔽后的输入中学习,从而生成稳健且统一的跨模态表示。 为了应对数据稀缺问题,我们整理了两个全面的数据集:基于VGGSound数据集构建的vggsound-caps(包含190K条音频描述)和源自V2M数据集的V2M-caps(含6百万音乐描述)。广泛的实验表明,AudioX不仅在性能上可以匹配甚至超越现有的专业模型,还能在一个统一架构内灵活处理多种输入模态与生成任务。 相关代码和数据集将发布在此[网址](https://this.url.com)。
https://arxiv.org/abs/2503.10522
Quality control of medical images is a critical component of digital pathology, ensuring that diagnostic images meet required standards. A pre-analytical task within this process is the verification of the number of specimen fragments, a process that ensures that the number of fragments on a slide matches the number documented in the macroscopic report. This step is important to ensure that the slides contain the appropriate diagnostic material from the grossing process, thereby guaranteeing the accuracy of subsequent microscopic examination and diagnosis. Traditionally, this assessment is performed manually, requiring significant time and effort while being subject to significant variability due to its subjective nature. To address these challenges, this study explores an automated approach to fragment counting using the YOLOv9 and Vision Transformer models. Our results demonstrate that the automated system achieves a level of performance comparable to expert assessments, offering a reliable and efficient alternative to manual counting. Additionally, we present findings on interobserver variability, showing that the automated approach achieves an accuracy of 86%, which falls within the range of variation observed among experts (82-88%), further supporting its potential for integration into routine pathology workflows.
医学图像的质量控制是数字病理学的关键组成部分,确保诊断图像符合必要的标准。在此过程中的一项预分析任务是验证标本碎片的数量,这一过程确保载玻片上的碎片数量与宏观报告中记录的数量相符。此步骤至关重要,以保证显微镜检查和诊断的准确性,即载玻片包含来自组织处理过程中的适当诊断材料。传统上,这项评估是由人工完成的,需要耗费大量时间和精力,并且由于其主观性而存在显著的变化。 为了解决这些挑战,本研究探索了一种使用YOLOv9和Vision Transformer模型进行自动碎片计数的方法。我们的结果显示,自动化系统在性能方面达到了专家评估水平,提供了一个可靠且高效的替代人工计数的选择。此外,我们还展示了观察者间差异的结果,表明自动化方法的准确性为86%,这与专家之间观察到的变化范围(82-88%)一致,进一步支持其在常规病理工作流程中应用的潜力。
https://arxiv.org/abs/2503.10520
This paper presents a novel information-theoretic proof demonstrating that the human brain as currently understood cannot function as a classical digital computer. Through systematic quantification of distinguishable conscious states and their historical dependencies, we establish that the minimum information required to specify a conscious state exceeds the physical information capacity of the human brain by a significant factor. Our analysis calculates the bit-length requirements for representing consciously distinguishable sensory "stimulus frames" and demonstrates that consciousness exhibits mandatory temporal-historical dependencies that multiply these requirements beyond the brain's storage capabilities. This mathematical approach offers new insights into the fundamental limitations of computational models of consciousness and suggests that non-classical information processing mechanisms may be necessary to account for conscious experience.
本文提出了一种新颖的信息理论证明,表明根据目前的理解,人脑无法作为经典数字计算机运行。通过系统地量化可区分的意识状态及其历史依赖性,我们确立了指定一个意识状态所需的最小信息量远远超过了人类大脑的物理信息容量。我们的分析计算了表示可感知区别的感觉“刺激帧”的比特长度要求,并展示了意识表现出强制性的时序-历史依赖性,这些依赖性使需求超出大脑的存储能力。这种数学方法为意识计算模型的基本限制提供了新的见解,并暗示可能需要非经典的信息处理机制来解释意识体验。
https://arxiv.org/abs/2503.10518
Discourse understanding is essential for many NLP tasks, yet most existing work remains constrained by framework-dependent discourse representations. This work investigates whether large language models (LLMs) capture discourse knowledge that generalizes across languages and frameworks. We address this question along two dimensions: (1) developing a unified discourse relation label set to facilitate cross-lingual and cross-framework discourse analysis, and (2) probing LLMs to assess whether they encode generalizable discourse abstractions. Using multilingual discourse relation classification as a testbed, we examine a comprehensive set of 23 LLMs of varying sizes and multilingual capabilities. Our results show that LLMs, especially those with multilingual training corpora, can generalize discourse information across languages and frameworks. Further layer-wise analyses reveal that language generalization at the discourse level is most salient in the intermediate layers. Lastly, our error analysis provides an account of challenging relation classes.
语篇理解对许多自然语言处理任务至关重要,但大多数现有研究仍受限于框架依赖的语篇表示。本项工作探讨了大型语言模型(LLMs)是否捕捉到了跨语言和跨框架的一般化语篇知识。我们从两个维度来解决这个问题:(1) 开发一套统一的语篇关系标签集以促进跨语言、跨框架的语篇分析;(2) 探测LLMs,评估它们是否编码了一般化的语篇抽象概念。通过使用多语言语篇关系分类作为测试平台,我们研究了包括不同规模和多语言能力在内的共计23个大型语言模型的全面集合。我们的结果显示,尤其是那些接受了多语言训练数据集的大型语言模型,能够跨语言和框架泛化语篇信息。进一步的逐层分析表明,在中间层最能体现出语言在语篇层次上的泛化特征。最后,我们的错误分析对具有挑战性的关系类别提供了解释。
https://arxiv.org/abs/2503.10515
We consider the problem of generating valid and small prediction sets by sampling outputs (e.g., software code and natural language text) from a black-box deep generative model for a given input (e.g., textual prompt). The validity of a prediction set is determined by a user-defined binary admissibility function depending on the target application. For example, requiring at least one program in the set to pass all test cases in code generation application. To address this problem, we develop a simple and effective conformal inference algorithm referred to as Generative Prediction Sets (GPS). Given a set of calibration examples and black-box access to a deep generative model, GPS can generate prediction sets with provable guarantees. The key insight behind GPS is to exploit the inherent structure within the distribution over the minimum number of samples needed to obtain an admissible output to develop a simple conformal regression approach over the minimum number of samples. Experiments on multiple datasets for code and math word problems using different large language models demonstrate the efficacy of GPS over state-of-the-art methods.
我们考虑从一个黑盒深度生成模型中为给定输入(例如文本提示)采样输出(如软件代码和自然语言文本),以生成有效的、规模较小的预测集的问题。预测集的有效性由用户定义的一个二元可接受性函数确定,该函数依赖于目标应用的具体要求。例如,在代码生成应用中,可能需要集合中的至少一个程序通过所有测试用例。 为了解决这个问题,我们开发了一种简单且有效的符合推理算法,称为生成预测集(Generative Prediction Sets, GPS)。给定一组校准示例和对深度生成模型的黑盒访问权限,GPS能够生成具有可证明保证的预测集。GPS的核心见解在于利用了获取满足条件输出所需的最小样本数内在结构,以开发一种简单的符合回归方法。 在多个代码和数学问题数据集上使用不同的大型语言模型进行的实验表明,与现有最佳方法相比,GPS表现出更高的效率和效果。
https://arxiv.org/abs/2503.10512
In the domain of Image Anomaly Detection (IAD), Existing methods frequently exhibit a paucity of fine-grained, interpretable semantic information, resulting in the detection of anomalous entities or activities that are susceptible to machine illusions. This deficiency often leads to the detection of anomalous entities or actions that are susceptible to machine illusions and lack sufficient explanation. In this thesis, we propose a novel approach to anomaly detection, termed Hoi2Anomaly, which aims to achieve precise discrimination and localization of anomalies. The proposed methodology involves the construction of a multi-modal instruction tuning dataset comprising human-object interaction (HOI) pairs in anomalous scenarios. Second, we have trained an HOI extractor in threat scenarios to localize and match anomalous actions and entities. Finally, explanatory content is generated for the detected anomalous HOI by fine-tuning the visual language pretraining (VLP) framework. The experimental results demonstrate that Hoi2Anomaly surpasses existing generative approaches in terms of precision and explainability. We will release Hoi2Anomaly for the advancement of the field of anomaly detection.
在图像异常检测(IAD)领域,现有方法常常缺乏细粒度、可解释的语义信息,导致识别出的异常实体或活动容易受到机器幻觉的影响。这种不足通常会导致对那些容易被机器误解为异常的行为或对象进行错误检测,并且无法提供足够的解释。在这篇论文中,我们提出了一种新的异常检测方法,称为Hoi2Anomaly,旨在实现精确区分和定位异常的能力。 所提出的这种方法包括构建一个包含异常场景下的人体交互(HOI)对的多模态指令微调数据集。其次,我们在威胁场景中训练了一个HOI提取器来定位并匹配异常行为和实体。最后,通过精调视觉语言预训练框架生成针对检测到的异常HOI的解释性内容。 实验结果显示,Hoi2Anomaly在精度和可解释性方面超越了现有的生成式方法。我们将发布Hoi2Anomaly以促进异常检测领域的发展。
https://arxiv.org/abs/2503.10508