Creating a photorealistic scene and human reconstruction from a single monocular in-the-wild video figures prominently in the perception of a human-centric 3D world. Recent neural rendering advances have enabled holistic human-scene reconstruction but require pre-calibrated camera and human poses, and days of training time. In this work, we introduce a novel unified framework that simultaneously performs camera tracking, human pose estimation and human-scene reconstruction in an online fashion. 3D Gaussian Splatting is utilized to learn Gaussian primitives for humans and scenes efficiently, and reconstruction-based camera tracking and human pose estimation modules are designed to enable holistic understanding and effective disentanglement of pose and appearance. Specifically, we design a human deformation module to reconstruct the details and enhance generalizability to out-of-distribution poses faithfully. Aiming to learn the spatial correlation between human and scene accurately, we introduce occlusion-aware human silhouette rendering and monocular geometric priors, which further improve reconstruction quality. Experiments on the EMDB and NeuMan datasets demonstrate superior or on-par performance with existing methods in camera tracking, human pose estimation, novel view synthesis and runtime. Our project page is at this https URL.
从单目野外视频创建逼真的场景和人体重建在人类中心的3D世界的感知中占据重要地位。最近的神经渲染技术进步已实现了完整的人体场景重建,但需要预先校准的摄像机和人体姿态,并且需要数天的训练时间。在这项工作中,我们引入了一个新颖的统一框架,在线同时执行相机跟踪、人体姿态估计和人体场景重建。3D高斯点阵(Gaussian Splatting)被用来高效地学习用于人类和场景的高斯基元,并设计了基于重构的摄像机跟踪和人体姿态估计算法模块,以实现整体理解并有效地分离姿势和外观。具体而言,我们设计了一个人体变形模块来重建细节并提高对分布外姿态的一致性和泛化能力。为了准确学习人与场景之间的空间相关性,我们引入了感知遮挡的人体轮廓渲染和单目几何先验,这进一步提高了重构的质量。在EMDB和NeuMan数据集上的实验表明,在摄像机跟踪、人体姿态估计、新视图合成和运行时间方面,我们的方法的表现优于或与现有方法相当。我们的项目页面位于此链接:[https URL]。 请注意,最后的URL被省略了具体的网址内容,请确认并提供完整的链接地址以便查阅详细信息。
https://arxiv.org/abs/2504.13167
This study explores the relationship between deep learning (DL) model accuracy and expert agreement in the classification of crash narratives. We evaluate five DL models -- including BERT variants, the Universal Sentence Encoder (USE), and a zero-shot classifier -- against expert-labeled data and narrative text. The analysis is further extended to four large language models (LLMs): GPT-4, LLaMA 3, Qwen, and Claude. Our results reveal a counterintuitive trend: models with higher technical accuracy often exhibit lower agreement with domain experts, whereas LLMs demonstrate greater expert alignment despite relatively lower accuracy scores. To quantify and interpret model-expert agreement, we employ Cohen's Kappa, Principal Component Analysis (PCA), and SHAP-based explainability techniques. Findings indicate that expert-aligned models tend to rely more on contextual and temporal language cues, rather than location-specific keywords. These results underscore that accuracy alone is insufficient for evaluating models in safety-critical NLP applications. We advocate for incorporating expert agreement as a complementary metric in model evaluation frameworks and highlight the promise of LLMs as interpretable, scalable tools for crash analysis pipelines.
这项研究探讨了深度学习(DL)模型的准确性与在分类事故叙述中的专家一致性的关系。我们评估了五种DL模型——包括BERT变体、通用句子编码器(USE)以及零样本分类器,并将这些模型与专家标记的数据和叙述文本进行对比。分析进一步扩展到四种大型语言模型(LLMs):GPT-4、LLaMA 3、Qwen 和 Claude。我们的研究结果揭示了一个反直觉的趋势:技术精度更高的模型往往在与领域专家的一致性方面表现较差,而LLM则展示了更高的专家一致性,尽管它们的准确率相对较低。为了量化和解释模型与专家之间的一致性,我们采用了Cohen的Kappa系数、主成分分析(PCA)以及基于SHAP的可解释技术。 研究结果表明,与专家观点一致的模型更倾向于依赖上下文和时间的语言线索,而不是特定位置的关键字。这些结果强调了在安全关键性的自然语言处理应用中评估模型时,仅依靠精度是不够的。我们建议将专家一致性作为模型评估框架中的补充指标,并指出LLM作为解释性、可扩展工具,在事故分析管道中有巨大的潜力。
https://arxiv.org/abs/2504.13068
In the rapidly advancing field of robotics, dual-arm coordination and complex object manipulation are essential capabilities for developing advanced autonomous systems. However, the scarcity of diverse, high-quality demonstration data and real-world-aligned evaluation benchmarks severely limits such development. To address this, we introduce RoboTwin, a generative digital twin framework that uses 3D generative foundation models and large language models to produce diverse expert datasets and provide a real-world-aligned evaluation platform for dual-arm robotic tasks. Specifically, RoboTwin creates varied digital twins of objects from single 2D images, generating realistic and interactive scenarios. It also introduces a spatial relation-aware code generation framework that combines object annotations with large language models to break down tasks, determine spatial constraints, and generate precise robotic movement code. Our framework offers a comprehensive benchmark with both simulated and real-world data, enabling standardized evaluation and better alignment between simulated training and real-world performance. We validated our approach using the open-source COBOT Magic Robot platform. Policies pre-trained on RoboTwin-generated data and fine-tuned with limited real-world samples demonstrate significant potential for enhancing dual-arm robotic manipulation systems by improving success rates by over 70% for single-arm tasks and over 40% for dual-arm tasks compared to models trained solely on real-world data.
在快速发展的机器人技术领域,双臂协调和复杂物体操作是开发高级自主系统的关键能力。然而,多样化、高质量的演示数据稀缺以及与现实世界相匹配的评估基准不足,严重限制了这一领域的进步。为此,我们引入了RoboTwin,这是一个使用3D生成基础模型和大型语言模型来产生多样化的专家数据集,并提供针对双臂机器人任务的真实世界对齐评估平台的生成式数字孪生框架。 具体而言,RoboTwin可以从单张2D图像创建多样化且逼真的物体数字化副本,生成现实而互动的情景。它还引入了一个具有空间关系感知的代码生成框架,该框架结合了对象注释和大型语言模型来分解任务、确定空间约束,并生成精确的机器人运动代码。 我们的框架提供了一个包含模拟数据和真实世界数据的全面基准测试平台,从而能够进行标准化评估并更好地将模拟训练与现实世界性能对齐。我们使用开源COBOT Magic Robot平台验证了这一方法的有效性。在RoboTwin生成的数据上预先训练策略,并通过少量的真实世界样本进一步微调,可以显著提高单臂任务的成功率超过70%,双臂任务的成功率超过40%(相较于仅基于真实数据训练的模型)。这表明该框架对于增强双臂机器人操作系统的性能具有巨大潜力。
https://arxiv.org/abs/2504.13059
Gaussian splatting demonstrates proficiency for 3D scene modeling but suffers from substantial data volume due to inherent primitive redundancy. To enable future photorealistic 3D immersive visual communication applications, significant compression is essential for transmission over the existing Internet infrastructure. Hence, we propose Compressed Gaussian Splatting (CompGS++), a novel framework that leverages compact Gaussian primitives to achieve accurate 3D modeling with substantial size reduction for both static and dynamic scenes. Our design is based on the principle of eliminating redundancy both between and within primitives. Specifically, we develop a comprehensive prediction paradigm to address inter-primitive redundancy through spatial and temporal primitive prediction modules. The spatial primitive prediction module establishes predictive relationships for scene primitives and enables most primitives to be encoded as compact residuals, substantially reducing the spatial redundancy. We further devise a temporal primitive prediction module to handle dynamic scenes, which exploits primitive correlations across timestamps to effectively reduce temporal redundancy. Moreover, we devise a rate-constrained optimization module that jointly minimizes reconstruction error and rate consumption. This module effectively eliminates parameter redundancy within primitives and enhances the overall compactness of scene representations. Comprehensive evaluations across multiple benchmark datasets demonstrate that CompGS++ significantly outperforms existing methods, achieving superior compression performance while preserving accurate scene modeling. Our implementation will be made publicly available on GitHub to facilitate further research.
高斯点阵在三维场景建模中展现了卓越的能力,但其庞大的数据量由于内在的基元冗余而成为一个问题。为了支持未来的逼真3D沉浸式视觉通信应用,在现有的互联网基础设施上传输时需要进行显著的数据压缩。因此,我们提出了一种新型框架——压缩高斯点阵(CompGS++),它利用紧凑的高斯基元来实现三维建模的同时大幅减少静态和动态场景的数据量。我们的设计基于消除基元间及基元内的冗余这一原则。 具体来说,我们开发了一个全面预测范式,通过空间和时间基元预测模块处理基元间的冗余问题。空间基元预测模块为场景的基元建立了预测关系,并使大多数基元能够以紧凑的残差形式编码,从而大大减少了空间冗余。此外,我们还设计了一个时间基元预测模块来处理动态场景,该模块利用不同时间戳之间的基元相关性有效减少时间冗余。 除此之外,我们构建了一个基于速率约束的优化模块,通过同时最小化重构误差和数据传输量以进一步提高效率。这个模块有效地消除了单个基元内的参数冗余,并提升了整个场景表示的一致性和紧凑度。 经过多个基准数据集上的全面评估,CompGS++显著优于现有的方法,在保持精确场景建模的同时实现了更优的数据压缩性能。我们的实现代码将通过GitHub公开发布以促进进一步的研究和探索。
https://arxiv.org/abs/2504.13022
Domain generalization aims at training on source domains to uncover a domain-invariant feature space, allowing the model to perform robust generalization ability on unknown target domains. However, due to domain gaps, it is hard to find reliable common image feature space, and the reason for that is the lack of suitable basic units for images. Different from image in vision space, language has comprehensive expression elements that can effectively convey semantics. Inspired by the semantic completeness of language and intuitiveness of image, we propose VLCA, which combine language space and vision space, and connect the multiple image domains by using semantic space as the bridge domain. Specifically, in language space, by taking advantage of the completeness of language basic units, we tend to capture the semantic representation of the relations between categories through word vector distance. Then, in vision space, by taking advantage of the intuitiveness of image features, the common pattern of sample features with the same class is explored through low-rank approximation. In the end, the language representation is aligned with the vision representation through the multimodal space of text and image. Experiments demonstrate the effectiveness of the proposed method.
领域泛化旨在通过在源域上训练来揭示一个与领域无关的特征空间,从而使模型能够在未知的目标域上表现出强大的泛化能力。然而,由于不同领域的差异(即“领域差距”),很难找到可靠且通用的图像特征空间,其原因在于缺乏合适的图像基本单位。不同于视觉空间中的图像,语言具有全面的表达元素,能够有效传达语义信息。 受到语言完整性和图像直观性的启发,我们提出了VLCA方法,该方法结合了语言空间和视觉空间,并通过使用语义空间作为桥梁来连接多个图像领域。具体来说,在语言空间中,利用语言基本单位的完整性,我们倾向于通过词向量距离捕捉类别之间关系的语义表示;然后在视觉空间中,利用图像特征的直观性,通过低秩近似方法探索同一类样本特征的共同模式。最后,通过文本和图像的多模态空间将语言表示与视觉表示对齐。 实验结果证明了所提出方法的有效性。
https://arxiv.org/abs/2504.12966
Credit assignment has remained a fundamental challenge in multi-agent reinforcement learning (MARL). Previous studies have primarily addressed this issue through value decomposition methods under the centralized training with decentralized execution paradigm, where neural networks are utilized to approximate the nonlinear relationship between individual Q-values and the global Q-value. Although these approaches have achieved considerable success in various benchmark tasks, they still suffer from several limitations, including imprecise attribution of contributions, limited interpretability, and poor scalability in high-dimensional state spaces. To address these challenges, we propose a novel algorithm, \textbf{QLLM}, which facilitates the automatic construction of credit assignment functions using large language models (LLMs). Specifically, the concept of \textbf{TFCAF} is introduced, wherein the credit allocation process is represented as a direct and expressive nonlinear functional formulation. A custom-designed \textit{coder-evaluator} framework is further employed to guide the generation, verification, and refinement of executable code by LLMs, significantly mitigating issues such as hallucination and shallow reasoning during inference. Extensive experiments conducted on several standard MARL benchmarks demonstrate that the proposed method consistently outperforms existing state-of-the-art baselines. Moreover, QLLM exhibits strong generalization capability and maintains compatibility with a wide range of MARL algorithms that utilize mixing networks, positioning it as a promising and versatile solution for complex multi-agent scenarios.
信用分配一直是多智能体强化学习(MARL)中的一个基本挑战。以往的研究主要通过集中训练与分散执行范式下的价值分解方法来解决这一问题,其中利用神经网络来近似个体Q值和全局Q值之间的非线性关系。尽管这些方法在各种基准任务中取得了显著的成功,但它们仍然存在一些局限性,包括贡献归因不准确、解释能力有限以及高维状态空间中的可扩展性差等问题。为了解决这些问题,我们提出了一种新的算法QLLM(基于大语言模型的信用分配),该算法通过利用大型语言模型自动构建信用分配函数来解决上述问题。具体来说,引入了**TFCAF**的概念,其中将信用分配过程表示为一种直接且表达能力强的非线性功能形式。 此外,还设计了一种定制的“编码器-评估器”框架,用于引导大语言模型生成、验证和优化可执行代码,这在很大程度上缓解了诸如幻觉(即产生不准确或虚假信息)和推理浅薄等问题。在几个标准MARL基准测试中进行的广泛实验表明,所提出的方法始终优于现有的最先进基线方法。 此外,QLLM展示了强大的泛化能力,并且与利用混合网络的各种MARL算法兼容,使其成为复杂多智能体场景中的一个有前景和多功能的解决方案。
https://arxiv.org/abs/2504.12961
Multispectral imaging plays a critical role in a range of intelligent transportation applications, including advanced driver assistance systems (ADAS), traffic monitoring, and night vision. However, accurate visible and thermal (RGB-T) image registration poses a significant challenge due to the considerable modality differences. In this paper, we present a novel joint Self-Correlation and Cross-Correspondence Estimation Framework (SC3EF), leveraging both local representative features and global contextual cues to effectively generate RGB-T correspondences. For this purpose, we design a convolution-transformer-based pipeline to extract local representative features and encode global correlations of intra-modality for inter-modality correspondence estimation between unaligned visible and thermal images. After merging the local and global correspondence estimation results, we further employ a hierarchical optical flow estimation decoder to progressively refine the estimated dense correspondence maps. Extensive experiments demonstrate the effectiveness of our proposed method, outperforming the current state-of-the-art (SOTA) methods on representative RGB-T datasets. Furthermore, it also shows competitive generalization capabilities across challenging scenarios, including large parallax, severe occlusions, adverse weather, and other cross-modal datasets (e.g., RGB-N and RGB-D).
多光谱成像在包括高级驾驶辅助系统(ADAS)、交通监控和夜视在内的智能运输应用中发挥着关键作用。然而,由于模态差异显著,准确地注册可见光和热图像(RGB-T)构成了重大挑战。本文提出了一种新颖的联合自相关与跨模式对应估计框架(SC3EF),该框架利用局部代表性特征和全局上下文线索有效生成RGB-T对应的配对关系。为此,我们设计了一个基于卷积-变压器的管道,用于提取局部代表性特征,并编码同一模态内的全球关联以进行不同模态之间的未对齐可见光与热图像间的跨模式对应估计。在合并了局部和全局对应性估算结果后,进一步利用层次化光学流估计算法解码器逐步细化所估计的密集对应的映射图。 广泛的实验表明,我们提出的方法优于当前最先进的(SOTA)方法,在代表性的RGB-T数据集上取得了卓越的效果。此外,它还展示了在具有挑战性场景中的强大泛化能力,包括大视差、严重遮挡、恶劣天气及其他跨模态数据集(如RGB-N和RGB-D)。
https://arxiv.org/abs/2504.12869
This paper investigates the integration of graph neural networks (GNNs) with Qualitative Explainable Graphs (QXGs) for scene understanding in automated driving. Scene understanding is the basis for any further reactive or proactive decision-making. Scene understanding and related reasoning is inherently an explanation task: why is another traffic participant doing something, what or who caused their actions? While previous work demonstrated QXGs' effectiveness using shallow machine learning models, these approaches were limited to analysing single relation chains between object pairs, disregarding the broader scene context. We propose a novel GNN architecture that processes entire graph structures to identify relevant objects in traffic scenes. We evaluate our method on the nuScenes dataset enriched with DriveLM's human-annotated relevance labels. Experimental results show that our GNN-based approach achieves superior performance compared to baseline methods. The model effectively handles the inherent class imbalance in relevant object identification tasks while considering the complete spatial-temporal relationships between all objects in the scene. Our work demonstrates the potential of combining qualitative representations with deep learning approaches for explainable scene understanding in autonomous driving systems.
本文研究了将图神经网络(GNNs)与定性可解释图(QXGs)相结合,用于自动驾驶中的场景理解。场景理解是任何进一步反应性和前瞻性决策的基础。场景理解和相关推理本质上是一种解释任务:为什么其他交通参与者会采取某种行动?是什么或谁导致了他们的行为?虽然之前的工作展示了通过浅层机器学习模型使用QXGs的有效性,但这些方法仅限于分析对象对之间的单一关系链,忽略了更广泛的场景背景。 我们提出了一种新的GNN架构,该架构处理整个图结构以识别交通场景中的相关对象。我们在nuScenes数据集上评估了我们的方法,该数据集包含了DriveLM的人工标注的相关性标签。实验结果表明,与基线方法相比,基于GNN的方法在性能上取得了显著优势。模型能够有效地应对相关物体识别任务中固有的类别不平衡问题,并考虑场景中所有对象之间的完整时空关系。 本工作展示了将定性表示与深度学习方法相结合,在自动驾驶系统中的可解释场景理解方面具有巨大潜力。
https://arxiv.org/abs/2504.12817
Relational Triple Extraction (RTE) is a fundamental task in Natural Language Processing (NLP). However, prior research has primarily focused on optimizing model performance, with limited efforts to understand the internal mechanisms driving these models. Many existing methods rely on complex preprocessing to induce specific interactions, often resulting in opaque systems that may not fully align with their theoretical foundations. To address these limitations, we propose SMARTe: a Slot-based Method for Accountable Relational Triple extraction. SMARTe introduces intrinsic interpretability through a slot attention mechanism and frames the task as a set prediction problem. Slot attention consolidates relevant information into distinct slots, ensuring all predictions can be explicitly traced to learned slot representations and the tokens contributing to each predicted relational triple. While emphasizing interpretability, SMARTe achieves performance comparable to state-of-the-art models. Evaluations on the NYT and WebNLG datasets demonstrate that adding interpretability does not compromise performance. Furthermore, we conducted qualitative assessments to showcase the explanations provided by SMARTe, using attention heatmaps that map to their respective tokens. We conclude with a discussion of our findings and propose directions for future research.
关系三元组提取(RTE)是自然语言处理(NLP)中的一个基本任务。然而,以往的研究主要集中在优化模型性能上,很少有研究试图理解驱动这些模型的内在机制。许多现有的方法依赖于复杂的预处理来诱导特定交互,这通常导致不透明系统,可能与其理论基础不完全一致。为了克服这些限制,我们提出了SMARTe:一种基于槽位的关系三元组可解释提取方法。SMARTe 通过引入槽注意机制实现固有可解释性,并将任务定义为集合预测问题。槽注意机制将相关信息整合到不同的槽中,确保所有预测都可以明确地追溯到学习的槽表示及其贡献的每个预测关系三元组中的标记。尽管强调了可解释性,SMARTe 的性能与最新模型相当。在 NYT 和 WebNLG 数据集上的评估表明,增加可解释性并不会影响其表现。此外,我们进行了定性评估以展示 SMARTe 提供的解释,并使用注意力热图将其映射到相应的标记上。最后,我们讨论了研究发现并提出了未来研究的方向。
https://arxiv.org/abs/2504.12816
We know that language models (LMs) form biases and stereotypes of minorities, leading to unfair treatments of members of these groups, thanks to research mainly in the US and the broader English-speaking world. As the negative behavior of these models has severe consequences for society and individuals, industry and academia are actively developing methods to reduce the bias in LMs. However, there are many under-represented groups and languages that have been overlooked so far. This includes marginalized groups that are specific to individual countries and regions in the English speaking and Western world, but crucially also almost all marginalized groups in the rest of the world. The UN estimates, that between 600 million to 1.2 billion people worldwide are members of marginalized groups and in need for special protection. If we want to develop inclusive LMs that work for everyone, we have to broaden our understanding to include overlooked marginalized groups and low-resource languages and dialects. In this work, we contribute to this effort with the first study investigating offensive stereotyping bias in 23 LMs for 270 marginalized groups from Egypt, the remaining 21 Arab countries, Germany, the UK, and the US. Additionally, we investigate the impact of low-resource languages and dialects on the study of bias in LMs, demonstrating the limitations of current bias metrics, as we measure significantly higher bias when using the Egyptian Arabic dialect versus Modern Standard Arabic. Our results show, LMs indeed show higher bias against many marginalized groups in comparison to dominant groups. However, this is not the case for Arabic LMs, where the bias is high against both marginalized and dominant groups in relation to religion and ethnicity. Our results also show higher intersectional bias against Non-binary, LGBTQIA+ and Black women.
我们知道,得益于美国及更广泛的英语世界的研究,我们了解到语言模型(LMs)会形成针对少数群体的偏见和刻板印象,这导致了这些群体成员受到不公平对待。鉴于这些模型的负面行为对社会和个人产生了严重影响,工业界和学术界正在积极开发减少LM偏见的方法。然而,到目前为止,许多代表性不足的群体和语言仍被忽视。这包括特定于个别国家和地区(在英语世界和西方世界内)的边缘化群体,但最重要的是几乎全世界所有其他地方的边缘化群体。联合国估计,全球范围内有6亿至12亿人属于需要特别保护的边缘化群体。如果我们想要开发出适用于所有人、包容性强的语言模型,我们就必须扩大理解范围,包括被忽视的边缘化群体和低资源语言及方言。在本研究中,我们为这一努力做出了贡献,首次对来自埃及以及剩余21个阿拉伯国家、德国、英国和美国的270个边缘化群体中的23种LM进行了攻击性刻板印象偏见的研究。此外,我们还调查了低资源语言和方言对LM偏见研究的影响,并展示了当前偏见指标的局限性——当使用埃及阿拉伯语方言时,相较于现代标准阿拉伯语,我们测量到了显著更高的偏见水平。 我们的结果表明,在许多情况下,LM确实比针对主导群体显示出更高的针对边缘化群体的偏见。然而,对于阿拉伯语模型而言,无论是在宗教和族裔方面,它们都对边缘化和主导群体表现出高水平的偏见。此外,我们的研究还显示出了对非二元性别、LGBTQIA+及黑人女性更为严重的交叉性偏见。
https://arxiv.org/abs/2504.12767
Recent advances in industrial anomaly detection have highlighted the need for deeper logical anomaly analysis, where unexpected relationships among objects, counts, and spatial configurations must be identified and explained. Existing approaches often rely on large-scale external reasoning modules or elaborate pipeline designs, hindering practical deployment and interpretability. To address these limitations, we introduce a new task, Reasoning Logical Anomaly Detection (RLAD), which extends traditional anomaly detection by incorporating logical reasoning. We propose a new framework, LAD-Reasoner, a customized tiny multimodal language model built on Qwen2.5-VL 3B. Our approach leverages a two-stage training paradigm that first employs Supervised Fine-Tuning (SFT) for fine-grained visual understanding, followed by Group Relative Policy Optimization (GRPO) to refine logical anomaly detection and enforce coherent, human-readable reasoning. Crucially, reward signals are derived from both the detection accuracy and the structural quality of the outputs, obviating the need for building chain of thought (CoT) reasoning data. Experiments on the MVTec LOCO AD dataset show that LAD-Reasoner, though significantly smaller, matches the performance of Qwen2.5-VL-72B in accuracy and F1 score, and further excels in producing concise and interpretable rationales. This unified design reduces reliance on large models and complex pipelines, while offering transparent and interpretable insights into logical anomaly detection. Code and data will be released.
最近在工业异常检测领域的进展强调了深入逻辑异常分析的必要性,这要求识别和解释物体之间的意外关系、计数以及空间配置。现有的方法通常依赖大规模外部推理模块或复杂的流水线设计,阻碍了实际部署和可解释性。为解决这些限制,我们提出了一项新任务——推理逻辑异常检测(Reasoning Logical Anomaly Detection, RLAD),该任务通过融入逻辑推理来扩展传统的异常检测方法。为此,我们提出一个新框架 LAD-Reasoner,这是一个基于 Qwen2.5-VL 3B 的定制化轻量级多模态语言模型。 我们的方法采用两阶段训练范式:首先使用监督微调(Supervised Fine-Tuning, SFT)进行细粒度视觉理解;然后通过组相对策略优化(Group Relative Policy Optimization, GRPO)来精炼逻辑异常检测,并确保推理的连贯性和人类可读性。关键的是,奖励信号源自于检测准确率和输出结构质量,从而无需构建链式思维(Chain of Thought, CoT)推理数据。 在 MVTec LOCO AD 数据集上的实验表明,尽管 LAD-Reasoner 的规模显著较小,它与 Qwen2.5-VL-72B 在精度和 F1 值上表现出同等的性能,并且在生成简洁、可解释的理由方面更胜一筹。这种统一设计减少了对大型模型和复杂管道的依赖,同时提供了逻辑异常检测透明且易于理解的见解。代码和数据将公开发布。
https://arxiv.org/abs/2504.12749
The rapid advancement of diffusion models and personalization techniques has made it possible to recreate individual portraits from just a few publicly available images. While such capabilities empower various creative applications, they also introduce serious privacy concerns, as adversaries can exploit them to generate highly realistic impersonations. To counter these threats, anti-personalization methods have been proposed, which add adversarial perturbations to published images to disrupt the training of personalization models. However, existing approaches largely overlook the intrinsic multi-image nature of personalization and instead adopt a naive strategy of applying perturbations independently, as commonly done in single-image settings. This neglects the opportunity to leverage inter-image relationships for stronger privacy protection. Therefore, we advocate for a group-level perspective on privacy protection against personalization. Specifically, we introduce Cross-image Anti-Personalization (CAP), a novel framework that enhances resistance to personalization by enforcing style consistency across perturbed images. Furthermore, we develop a dynamic ratio adjustment strategy that adaptively balances the impact of the consistency loss throughout the attack iterations. Extensive experiments on the classical CelebHQ and VGGFace2 benchmarks show that CAP substantially improves existing methods.
扩散模型和个性化技术的迅速发展使得仅通过几张公开图像就能复原个人肖像成为可能。虽然这种能力为各种创意应用提供了支持,但也带来了严重的隐私问题,因为对手可以利用这些技术生成高度逼真的伪装画像。为了应对这些威胁,反个性化方法被提出,它们通过对发布的图片添加对抗性扰动来扰乱个性化模型的训练过程。然而,现有的方法大多忽视了个人化内在的多图像特性,并且采用了一种类似于单一图像设置中的简单策略,在这种情况下独立地应用扰动,从而忽略了利用图像间关系以增强隐私保护的机会。 因此,我们提倡从群体层面出发,采取一种对抗个性化的隐私保护视角。具体来说,我们引入了一个名为跨图反个性化(Cross-image Anti-Personalization, CAP)的新框架,通过强制执行被扰动图片间的风格一致性来提高对个性化的抵抗能力。此外,我们还开发了一种动态比率调整策略,在整个攻击迭代过程中自适应地平衡一致性损失的影响。 在经典的人脸数据集CelebHQ和VGGFace2上的广泛实验表明,CAP显著提升了现有方法的性能。
https://arxiv.org/abs/2504.12747
As artificial intelligence methods are increasingly applied to complex task scenarios, high dimensional multi-label learning has emerged as a prominent research focus. At present, the curse of dimensionality remains one of the major bottlenecks in high-dimensional multi-label learning, which can be effectively addressed through multi-label feature selection methods. However, existing multi-label feature selection methods mostly focus on identifying global features shared across all labels, which overlooks personalized characteristics and specific requirements of individual labels. This global-only perspective may limit the ability to capture label-specific discriminative information, thereby affecting overall performance. In this paper, we propose a novel method called GPMFS (Global Foundation and Personalized Optimization for Multi-Label Feature Selection). GPMFS firstly identifies global features by exploiting label correlations, then adaptively supplements each label with a personalized subset of discriminative features using a threshold-controlled strategy. Experiments on multiple real-world datasets demonstrate that GPMFS achieves superior performance while maintaining strong interpretability and robustness. Furthermore, GPMFS provides insights into the label-specific strength across different multi-label datasets, thereby demonstrating the necessity and potential applicability of personalized feature selection approaches.
随着人工智能方法越来越多地应用于复杂的任务场景,高维多标签学习已成为一个重要研究焦点。目前,在高维多标签学习中,维度灾难仍然是主要瓶颈之一,而这个问题可以通过多标签特征选择方法得到有效解决。然而,现有的多标签特征选择方法大多侧重于识别所有标签共有的全局特征,从而忽略了各个标签的个性化特性和具体需求。这种仅关注全局的方法可能限制了捕捉特定标签鉴别信息的能力,进而影响整体性能。 在本文中,我们提出了一种新的方法,称为GPMFS(针对多标签特征选择的全局基础和个人化优化)。GPMFS首先通过利用标签之间的关联性来识别全局特征,然后采用阈值控制策略为每个标签自适应地补充一个个性化的鉴别特征子集。实验结果表明,在多个真实数据集中,GPMFS不仅实现了卓越性能,还保持了良好的解释性和鲁棒性。此外,GPMFS揭示了不同多标签数据集中特定标签强度的差异,从而证明了个性化特征选择方法的重要性和潜在适用性。
https://arxiv.org/abs/2504.12740
Humans can perceive and reason about spatial relationships from sequential visual observations, such as egocentric video streams. However, how pretrained models acquire such abilities, especially high-level reasoning, remains unclear. This paper introduces Embodied-R, a collaborative framework combining large-scale Vision-Language Models (VLMs) for perception and small-scale Language Models (LMs) for reasoning. Using Reinforcement Learning (RL) with a novel reward system considering think-answer logical consistency, the model achieves slow-thinking capabilities with limited computational resources. After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models (OpenAI-o1, Gemini-2.5-pro) on both in-distribution and out-of-distribution embodied spatial reasoning tasks. Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration. We further explore research questions including response length, training on VLM, strategies for reward design, and differences in model generalization after SFT (Supervised Fine-Tuning) and RL training.
人类能够从连续的视觉观察中感知和推理空间关系,例如第一人称视频流。然而,预训练模型如何获得这种能力,尤其是高层次的推理能力,仍然是不清楚的。本文介绍了Embodied-R框架,这是一个结合大规模Vision-Language Models(VLM)用于感知和小规模Language Models(LM)用于推理的协作框架。通过使用考虑思考-回答逻辑一致性的新型奖励系统的强化学习(RL),该模型在有限的计算资源下实现了慢思考的能力。经过仅5000个嵌入视频样本的训练,Embodied-R与一个3B LM,在分布内和分布外的第一人称空间推理任务上达到了最先进的多模态推理模型(如OpenAI-o1、Gemini-2.5-pro)的表现水平。此外,Embodied-R还展示了系统性分析和上下文整合等涌现的思维模式。我们进一步探讨了包括回答长度、基于VLM的训练策略、奖励设计策略以及经过监督微调(SFT)与RL训练后模型泛化差异在内的研究问题。
https://arxiv.org/abs/2504.12680
Arabidopsis is a widely used model plant to gain basic knowledge on plant physiology and development. Live imaging is an important technique to visualize and quantify elemental processes in plant development. To uncover novel theories underlying plant growth and cell division, accurate cell tracking on live imaging is of utmost importance. The commonly used cell tracking software, TrackMate, adopts tracking-by-detection fashion, which applies Laplacian of Gaussian (LoG) for blob detection, and Linear Assignment Problem (LAP) tracker for tracking. However, they do not perform sufficiently when cells are densely arranged. To alleviate the problems mentioned above, we propose an accurate tracking method based on Genetic algorithm (GA) using knowledge of Arabidopsis root cellular patterns and spatial relationship among volumes. Our method can be described as a coarse-to-fine method, in which we first conducted relatively easy line-level tracking of cell nuclei, then performed complicated nuclear tracking based on known linear arrangement of cell files and their spatial relationship between nuclei. Our method has been evaluated on a long-time live imaging dataset of Arabidopsis root tips, and with minor manual rectification, it accurately tracks nuclei. To the best of our knowledge, this research represents the first successful attempt to address a long-standing problem in the field of time-lapse microscopy in the root meristem by proposing an accurate tracking method for Arabidopsis root nuclei.
拟南芥是一种广泛用于获取植物生理和发育基本知识的模型植物。活体成像是可视化并量化植物发育过程中基本元素过程的重要技术。为了揭示新的理论,阐明植物生长和细胞分裂背后的机制,对活体成像中的精确细胞追踪至关重要。目前常用的细胞追踪软件TrackMate采用基于检测的追踪方法,通过高斯拉普拉斯算子(LoG)进行斑点检测,并使用线性分配问题(LAP)跟踪器来进行追踪。然而,在细胞密集排列的情况下,这种方法的效果并不理想。 为了解决上述问题,我们提出了一种基于遗传算法(GA)并结合拟南芥根部细胞模式和体积间空间关系知识的精确追踪方法。我们的方法可以描述为从粗到细的过程:首先进行相对简单的线性级别的细胞核追踪,然后根据已知的细胞文件线性排列及其细胞核之间的空间关系来进行复杂的核追踪。 我们在拟南芥根尖长时间活体成像数据集上对这种方法进行了评估,并在经过少量人工修正后,该方法能够精确地追踪细胞核。据我们所知,这项研究是首次成功尝试解决时序显微镜技术中长期存在的问题——即为拟南芥根部核提供准确的跟踪方法,这在分生组织领域尤为突出。
https://arxiv.org/abs/2504.12676
In this paper, we introduce a novel method named Robo-SGG, i.e., Layout-Oriented Normalization and Restitution for Robust Scene Graph Generation. Compared to the existing SGG setting, the robust scene graph generation aims to perform inference on a diverse range of corrupted images, with the core challenge being the domain shift between the clean and corrupted images. Existing SGG methods suffer from degraded performance due to compromised visual features e.g., corruption interference or occlusions. To obtain robust visual features, we exploit the layout information, which is domain-invariant, to enhance the efficacy of existing SGG methods on corrupted images. Specifically, we employ Instance Normalization(IN) to filter out the domain-specific feature and recover the unchangeable structural features, i.e., the positional and semantic relationships among objects by the proposed Layout-Oriented Restitution. Additionally, we propose a Layout-Embedded Encoder (LEE) that augments the existing object and predicate encoders within the SGG framework, enriching the robust positional and semantic features of objects and predicates. Note that our proposed Robo-SGG module is designed as a plug-and-play component, which can be easily integrated into any baseline SGG model. Extensive experiments demonstrate that by integrating the state-of-the-art method into our proposed Robo-SGG, we achieve relative improvements of 5.6%, 8.0%, and 6.5% in mR@50 for PredCls, SGCls, and SGDet tasks on the VG-C dataset, respectively, and achieve new state-of-the-art performance in corruption scene graph generation benchmark (VG-C and GQA-C). We will release our source code and model.
在这篇论文中,我们介绍了一种新的方法,名为Robo-SGG(即面向布局的归一化和恢复以实现鲁棒场景图生成)。与现有的SGG设置相比,鲁棒场景图生成旨在对各种受损图像进行推理,其核心挑战在于干净图像和受损图像之间的领域偏移。现有SGG方法由于视觉特征受到损害而性能下降,比如因为干扰或遮挡导致的问题。为了获得稳健的视觉特征,我们利用了布局信息,这种信息在不同领域中是不变的,并以此来增强现有的SGG方法在处理受损图像时的表现效果。 具体来说,我们采用实例归一化(Instance Normalization, IN)来过滤掉特定领域的特征并恢复不可改变的对象结构特征,即通过提出的方法Layout-Oriented Restitution来恢复对象之间的位置关系和语义关系。此外,我们还提出了一个嵌入布局信息的编码器(Layout-Embedded Encoder, LEE),该编码器扩充了SGG框架中的现有物体和谓词编码器,增强了物体和谓词的鲁棒性位置及语义特征。 值得注意的是,我们的Robo-SGG模块被设计为即插即用组件,可以轻松集成到任何基准SGG模型中。广泛的实验表明,在VG-C数据集上,通过将最先进的方法整合进我们提出的Robo-SGG中,我们在PredCls、SGCls和SGDet任务上的mR@50指标分别取得了5.6%、8.0% 和 6.5% 的相对改进,并在受损场景图生成基准(VG-C和GQA-C)上实现了新的最先进的性能。我们将发布我们的源代码和模型。
https://arxiv.org/abs/2504.12606
Restoring images afflicted by complex real-world degradations remains challenging, as conventional methods often fail to adapt to the unique mixture and severity of artifacts present. This stems from a reliance on indirect cues which poorly capture the true perceptual quality deficit. To address this fundamental limitation, we introduce AdaQual-Diff, a diffusion-based framework that integrates perceptual quality assessment directly into the generative restoration process. Our approach establishes a mathematical relationship between regional quality scores from DeQAScore and optimal guidance complexity, implemented through an Adaptive Quality Prompting mechanism. This mechanism systematically modulates prompt structure according to measured degradation severity: regions with lower perceptual quality receive computationally intensive, structurally complex prompts with precise restoration directives, while higher quality regions receive minimal prompts focused on preservation rather than intervention. The technical core of our method lies in the dynamic allocation of computational resources proportional to degradation severity, creating a spatially-varying guidance field that directs the diffusion process with mathematical precision. By combining this quality-guided approach with content-specific conditioning, our framework achieves fine-grained control over regional restoration intensity without requiring additional parameters or inference iterations. Experimental results demonstrate that AdaQual-Diff achieves visually superior restorations across diverse synthetic and real-world datasets.
修复受到复杂现实世界退化影响的图像仍然具有挑战性,因为传统方法往往无法适应存在的独特混合和严重程度的艺术瑕疵。这源于对间接线索的依赖,这些线索未能充分捕捉到真正的感知质量缺陷。为了解决这一根本限制,我们引入了AdaQual-Diff,这是一种基于扩散框架的方法,它将感知质量评估直接整合到了生成修复过程中。我们的方法建立了一个数学关系,将DeQAScore中的区域质量得分与最优引导复杂度联系起来,并通过自适应质量提示机制实现这一点。 这种机制根据测量到的退化严重程度系统地调节提示结构:感知质量较低的区域会收到计算密集型、结构复杂的提示,带有精确的修复指令;而高质量的区域则会收到最小化的提示,专注于保护而非干预。我们方法的技术核心在于动态分配与降级严重性成比例的计算资源,创建一个空间变化的引导场,以数学精度指导扩散过程。 通过将这种方法与内容特定条件相结合,我们的框架能够在不增加额外参数或推理迭代的情况下实现对区域修复强度的精细控制。实验结果表明,AdaQual-Diff在各种合成和现实世界的数据集中实现了视觉上更优的修复效果。
https://arxiv.org/abs/2504.12605
Retrieval-Augmented Generation (RAG) has significantly enhanced large language models (LLMs) in knowledge-intensive tasks by incorporating external knowledge retrieval. However, existing RAG frameworks primarily rely on semantic similarity and correlation-driven retrieval, limiting their ability to distinguish true causal relationships from spurious associations. This results in responses that may be factually grounded but fail to establish cause-and-effect mechanisms, leading to incomplete or misleading insights. To address this issue, we introduce Causal Dynamic Feedback for Adaptive Retrieval-Augmented Generation (CDF-RAG), a framework designed to improve causal consistency, factual accuracy, and explainability in generative reasoning. CDF-RAG iteratively refines queries, retrieves structured causal graphs, and enables multi-hop causal reasoning across interconnected knowledge sources. Additionally, it validates responses against causal pathways, ensuring logically coherent and factually grounded outputs. We evaluate CDF-RAG on four diverse datasets, demonstrating its ability to improve response accuracy and causal correctness over existing RAG-based methods. Our code is publicly available at this https URL elakhatibi/CDF-RAG.
基于检索的增强生成(RAG)通过整合外部知识检索,显著提升了大型语言模型(LLMs)在知识密集型任务中的表现。然而,现有的RAG框架主要依赖于语义相似性和相关性驱动的检索方法,这限制了它们区分真正因果关系与虚假关联的能力。因此,尽管这些响应可能基于事实,却无法建立明确的因果机制,从而导致不完整或误导性的见解。 为解决这一问题,我们引入了因果动态反馈自适应检索增强生成(CDF-RAG)框架,旨在提升生成性推理中的因果一致性、事实准确性及可解释性。CDF-RAG通过迭代优化查询请求,检索结构化的因果图,并支持跨互联知识源的多跳因果推理。此外,它还验证响应与因果路径的一致性,确保输出在逻辑上连贯且基于事实。 我们在四个多样化的数据集上评估了CDF-RAG的表现,证明其相较于现有的RAG方法,在提升回应准确性和因果正确性方面具有显著优势。我们的代码可在以下链接获取:https://github.com/elakhatibi/CDF-RAG
https://arxiv.org/abs/2504.12560
Large Language Models (LLMs) are increasingly deployed in both academic and industry settings to automate the evaluation of information seeking systems, particularly by generating graded relevance judgments. Previous work on LLM-based relevance assessment has primarily focused on replicating graded human relevance judgments through various prompting strategies. However, there has been limited exploration of alternative assessment methods or comprehensive comparative studies. In this paper, we systematically compare multiple LLM-based relevance assessment methods, including binary relevance judgments, graded relevance assessments, pairwise preference-based methods, and two nugget-based evaluation methods~--~document-agnostic and document-dependent. In addition to a traditional comparison based on system rankings using Kendall correlations, we also examine how well LLM judgments align with human preferences, as inferred from relevance grades. We conduct extensive experiments on datasets from three TREC Deep Learning tracks 2019, 2020 and 2021 as well as the ANTIQUE dataset, which focuses on non-factoid open-domain question answering. As part of our data release, we include relevance judgments generated by both an open-source (Llama3.2b) and a commercial (gpt-4o) model. Our goal is to \textit{reproduce} various LLM-based relevance judgment methods to provide a comprehensive comparison. All code, data, and resources are publicly available in our GitHub Repository at this https URL.
大型语言模型(LLMs)在学术界和工业环境中被越来越多地部署,用于自动化信息检索系统的评估,特别是在生成分级的相关性判断方面。此前关于基于LLM的相关性评估的工作主要集中在通过各种提示策略复制人类的分级相关性判断上。然而,对于替代评估方法或全面比较研究的探索却相对有限。在本文中,我们系统地比较了多种基于LLM的相关性评估方法,包括二元相关性判断、分级相关性评估、基于成对偏好的方法以及两种基于“知识点”的评价方法——文档无关和文档相关的。除了传统的基于肯德尔等级相关系数的系统排名对比之外,我们还考察了LLM判断与从相关性评级中推断出的人类偏好之间的吻合度。 我们在三个TREC深度学习轨道(2019、2020和2021年)的数据集以及专注于非事实型开放领域问题回答的ANTIQUE数据集上进行了广泛的实验。作为我们数据发布的一部分,我们包含了开源模型(Llama3.2b)和商用模型(gpt-4o)生成的相关性判断。我们的目标是再现各种基于LLM的相关性判断方法,从而提供全面的比较。所有代码、数据及资源都可在我们的GitHub存储库(此URL)中公开获取。
https://arxiv.org/abs/2504.12558
This study critically examines the commonly held assumption that explicability in artificial intelligence (AI) systems inherently boosts user trust. Utilizing a meta-analytical approach, we conducted a comprehensive examination of the existing literature to explore the relationship between AI explainability and trust. Our analysis, incorporating data from 90 studies, reveals a statistically significant but moderate positive correlation between the explainability of AI systems and the trust they engender among users. This indicates that while explainability contributes to building trust, it is not the sole or predominant factor in this equation. In addition to academic contributions to the field of Explainable AI (XAI), this research highlights its broader socio-technical implications, particularly in promoting accountability and fostering user trust in critical domains such as healthcare and justice. By addressing challenges like algorithmic bias and ethical transparency, the study underscores the need for equitable and sustainable AI adoption. Rather than focusing solely on immediate trust, we emphasize the normative importance of fostering authentic and enduring trustworthiness in AI systems.
这项研究批判性地审视了人工智能(AI)系统中的可解释性必然提升用户信任这一普遍假设。通过元分析方法,我们对现有文献进行了全面审查,以探讨AI系统的可解释性和信任之间的关系。我们的分析基于90项研究的数据,揭示了AI系统可解释性与它们在用户中建立的信任之间存在显著但适度的正相关关系。这表明虽然可解释性有助于构建信任,但它并非这一等式中的唯一或主要因素。 除了对可解释人工智能(XAI)领域的学术贡献外,这项研究还强调了其更广泛的社会技术影响,特别是在医疗和司法等关键领域促进责任意识并增强用户信任方面的作用。通过解决算法偏见和伦理透明度等问题,该研究突显了实现公平且可持续的AI采纳的需求。 我们不仅关注即时建立的信任,更加重视培养AI系统的真实、持久的信任价值。
https://arxiv.org/abs/2504.12529