Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that equips multimodal LLMs with the ability to generate "visual thoughts" by performing visual editing on the input image through code, shifting and refining their visual focuses. Specifically, ReFocus enables multimodal LLMs to generate Python codes to call tools and modify the input image, sequentially drawing boxes, highlighting sections, and masking out areas, thereby enhancing the visual reasoning process. We experiment upon a wide range of structured image understanding tasks involving tables and charts. ReFocus largely improves performance on all tasks over GPT-4o without visual editing, yielding an average gain of 11.0% on table tasks and 6.8% on chart tasks. We present an in-depth analysis of the effects of different visual edits, and reasons why ReFocus can improve the performance without introducing additional information. Further, we collect a 14k training set using ReFocus, and prove that such visual chain-of-thought with intermediate information offers a better supervision than standard VQA data, reaching a 8.0% average gain over the same model trained with QA pairs and 2.6% over CoT.
结构化图像理解,如解释表格和图表,需要在图像中的各种结构和文本之间战略性地重新聚焦,形成推理序列以得出最终答案。然而,当前的多模态大型语言模型(LLM)缺乏这种多步选择性注意的能力。在这项工作中,我们介绍了 ReFocus,这是一个简单而有效的框架,它使多模态 LLM 具备通过代码在输入图像上执行视觉编辑来生成“视觉思维”的能力,从而转移和精炼其视觉焦点。具体而言,ReFocus 使得多模态 LLM 能够生成 Python 代码调用工具并修改输入图像,在此基础上依次绘制方框、高亮显示部分和屏蔽区域,从而增强视觉推理过程。 我们在涉及表格和图表的多种结构化图像理解任务上进行了实验。与未经视觉编辑的 GPT-4 相比,ReFocus 在所有任务中都显著提高了性能,在表格任务上的平均增益为 11.0%,在图表任务上的平均增益为 6.8%。我们深入分析了不同视觉编辑的效果,并解释了为什么 ReFocus 能够在不引入额外信息的情况下提高性能。 此外,我们使用 ReFocus 收集了一个包含 14,000 条数据的训练集,并证明了这种具有中间信息的视觉思维链提供了比标准 VQA 数据更好的监督效果,在模型训练中与 QA 对相比平均增益为 8.0%,与 CoT 相比则为 2.6%。
https://arxiv.org/abs/2501.05452
We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at this https URL
我们通过实证研究了从视频中进行自回归预训练的方法。为了开展这项研究,我们构建了一系列名为Toto的自回归视频模型。我们将视频视为视觉令牌序列,并训练变压器模型以自回归方式预测未来的令牌。我们的模型在包含超过1万亿个视觉令牌的多样化数据集(包括视频和图像)上进行了预训练。我们在架构、训练和推理设计选择方面做了各种探索。我们评估了所学习到的视觉表示形式在一系列下游任务上的表现,包括图像识别、视频分类、对象跟踪和机器人技术。我们的研究结果表明,尽管具有最少的归纳偏置,自回归预训练仍然能在一个广泛的标准上取得竞争性的性能。 最后,我们发现随着视频模型规模的增长,其扩展曲线与语言模型相似,不过增长率有所不同。更多详情请参阅此链接(在实际回答中应提供具体网址,此处以“https URL”表示)。
https://arxiv.org/abs/2501.05453
Large-scale AI model training divides work across thousands of GPUs, then synchronizes gradients across them at each step. This incurs a significant network burden that only centralized, monolithic clusters can support, driving up infrastructure costs and straining power systems. We propose Decentralized Diffusion Models, a scalable framework for distributing diffusion model training across independent clusters or datacenters by eliminating the dependence on a centralized, high-bandwidth networking fabric. Our method trains a set of expert diffusion models over partitions of the dataset, each in full isolation from one another. At inference time, the experts ensemble through a lightweight router. We show that the ensemble collectively optimizes the same objective as a single model trained over the whole dataset. This means we can divide the training burden among a number of "compute islands," lowering infrastructure costs and improving resilience to localized GPU failures. Decentralized diffusion models empower researchers to take advantage of smaller, more cost-effective and more readily available compute like on-demand GPU nodes rather than central integrated systems. We conduct extensive experiments on ImageNet and LAION Aesthetics, showing that decentralized diffusion models FLOP-for-FLOP outperform standard diffusion models. We finally scale our approach to 24 billion parameters, demonstrating that high-quality diffusion models can now be trained with just eight individual GPU nodes in less than a week.
大规模AI模型的训练通常会在数千个GPU上分配任务,然后在每一步中同步各个GPU上的梯度。这种方式会带来显著的网络负担,只有集中的、单一化的集群才能支持这种需求,从而推高基础设施成本并增加电力系统的压力。我们提出了一种去中心化扩散模型框架(Decentralized Diffusion Models),该框架能够将扩散模型训练任务分布到独立的集群或数据中心中,而无需依赖于集中式的高带宽网络架构。我们的方法是在数据集的不同部分上分别训练一组专家扩散模型,并且它们彼此完全隔离。在推理阶段,这些专家通过一个轻量级路由器进行组合。我们证明了这一组合能够共同优化整个数据集上的单一模型所追求的目标。这意味着我们可以将计算负担分散到多个“计算岛”中,从而降低成本并提高对局部GPU故障的抵御能力。 去中心化的扩散模型允许研究人员利用更小、成本更低且更容易获取的计算资源(如按需GPU节点),而无需依赖于集中式的集成系统。我们在ImageNet和LAION Aesthetics数据集上进行了广泛的实验,证明了我们的去中心化扩散模型在同等运算量(FLOP)下优于标准扩散模型。最终,我们将方法扩展到了240亿参数规模,并展示了仅通过八台独立的GPU节点,在不到一周的时间内即可训练出高质量的扩散模型。 这种方法不仅降低了大规模AI模型训练的成本,还增强了计算系统的灵活性和稳定性,为研究人员提供了一种新的途径来探索更大、更复杂的人工智能系统。
https://arxiv.org/abs/2501.05450
Pumpkin leaf diseases are significant threats to agricultural productivity, requiring a timely and precise diagnosis for effective management. Traditional identification methods are laborious and susceptible to human error, emphasizing the necessity for automated solutions. This study employs on the "Pumpkin Leaf Disease Dataset", that comprises of 2000 high-resolution images separated into five categories. Downy mildew, powdery mildew, mosaic disease, bacterial leaf spot, and healthy leaves. The dataset was rigorously assembled from several agricultural fields to ensure a strong representation for model training. We explored many proficient deep learning architectures, including DenseNet201, DenseNet121, DenseNet169, Xception, ResNet50, ResNet101 and InceptionResNetV2, and observed that ResNet50 performed most effectively, with an accuracy of 90.5% and comparable precision, recall, and F1-Score. We used Explainable AI (XAI) approaches like Grad-CAM, Grad-CAM++, Score-CAM, and Layer-CAM to provide meaningful representations of model decision-making processes, which improved understanding and trust in automated disease diagnostics. These findings demonstrate ResNet50's potential to revolutionize pumpkin leaf disease detection, allowing for earlier and more accurate treatments.
南瓜叶病害是影响农业生产力的重要威胁,及时且精准的诊断对于有效管理这些疾病至关重要。传统的识别方法既费力又容易出错,因此迫切需要自动化解决方案。本研究使用了“南瓜叶病害数据集”,该数据集包含2000张高分辨率图像,并分为五个类别:霜霉病、白粉病、花叶病、细菌斑点病和健康叶片。数据集从多个农业现场严格收集而成,确保模型训练具有强大的代表性。 我们探索了多种高效深度学习架构,包括DenseNet201、DenseNet121、DenseNet169、Xception、ResNet50、ResNet101和InceptionResNetV2,并观察到ResNet50表现最为出色,其准确率为90.5%,且精度、召回率和F1分数相当高。我们还采用了可解释的人工智能(Explainable AI, XAI)方法,如Grad-CAM、Grad-CAM++、Score-CAM和Layer-CAM,以提供模型决策过程的有意义表示,从而增强了理解和信任。 这些发现表明ResNet50具有革新南瓜叶病害检测的巨大潜力,能够实现更早且更为准确的治疗。
https://arxiv.org/abs/2501.05449
Monocular depth estimation (MDE) models have undergone significant advancements over recent years. Many MDE models aim to predict affine-invariant relative depth from monocular images, while recent developments in large-scale training and vision foundation models enable reasonable estimation of metric (absolute) depth. However, effectively leveraging these predictions for geometric vision tasks, in particular relative pose estimation, remains relatively under explored. While depths provide rich constraints for cross-view image alignment, the intrinsic noise and ambiguity from the monocular depth priors present practical challenges to improving upon classic keypoint-based solutions. In this paper, we develop three solvers for relative pose estimation that explicitly account for independent affine (scale and shift) ambiguities, covering both calibrated and uncalibrated conditions. We further propose a hybrid estimation pipeline that combines our proposed solvers with classic point-based solvers and epipolar constraints. We find that the affine correction modeling is beneficial to not only the relative depth priors but also, surprisingly, the ``metric" ones. Results across multiple datasets demonstrate large improvements of our approach over classic keypoint-based baselines and PnP-based solutions, under both calibrated and uncalibrated setups. We also show that our method improves consistently with different feature matchers and MDE models, and can further benefit from very recent advances on both modules. Code is available at this https URL.
近年来,单目深度估计(MDE)模型取得了显著的进步。许多MDE模型旨在从单目图像中预测仿射不变的相对深度,而大规模训练和视觉基础模型的发展使得合理估算度量(绝对)深度成为可能。然而,在几何视觉任务中有效地利用这些预测——特别是相对姿态估计方面——仍然相对探索不足。虽然深度提供了跨视角图像对齐的丰富约束条件,但单目深度先验中的固有噪声和模糊性为改进传统的基于关键点的解决方案带来了实际挑战。 在本文中,我们开发了三种用于相对姿态估计的求解器,这些求解器明确考虑独立仿射(尺度和平移)歧义,并涵盖了校准和未校准的情况。我们进一步提出了一种混合估计流水线,将我们的拟议求解器与传统的基于点的求解器以及极线约束相结合。我们发现,仿射修正建模不仅对相对深度先验有益,而且令人惊讶地也对“度量”先验有利。 在多个数据集上的结果表明,在校准和未校准设置下,我们的方法相对于经典的关键点基线和PnP解决方案有了显著改进。此外,我们还展示了无论使用哪种特征匹配器或MDE模型,我们的方法都能保持一致的改进,并可以从这两个模块最近的进步中进一步受益。 代码可在提供的URL上获取。
https://arxiv.org/abs/2501.05446
Score Distillation Sampling (SDS) has made significant strides in distilling image-generative models for 3D generation. However, its maximum-likelihood-seeking behavior often leads to degraded visual quality and diversity, limiting its effectiveness in 3D applications. In this work, we propose Consistent Flow Distillation (CFD), which addresses these limitations. We begin by leveraging the gradient of the diffusion ODE or SDE sampling process to guide the 3D generation. From the gradient-based sampling perspective, we find that the consistency of 2D image flows across different viewpoints is important for high-quality 3D generation. To achieve this, we introduce multi-view consistent Gaussian noise on the 3D object, which can be rendered from various viewpoints to compute the flow gradient. Our experiments demonstrate that CFD, through consistent flows, significantly outperforms previous methods in text-to-3D generation.
分数蒸馏采样(SDS)在将图像生成模型用于三维生成方面取得了显著进展。然而,其最大似然求取行为往往会导致视觉质量和多样性的下降,限制了它在三维应用中的有效性。为此,我们提出了具有一致性流蒸馏(CFD)的方法来解决这些问题。首先,我们将扩散ODE或SDE采样过程的梯度用于指导三维生成。从基于梯度的采样角度来看,我们发现2D图像流在不同视角下的一致性对于高质量的三维生成至关重要。为此,我们在3D对象上引入了多视图一致高斯噪声,该噪声可以从不同的视角渲染以计算流动梯度。 实验表明,通过一致性流,CFD在文本到3D生成方面显著优于先前的方法。
https://arxiv.org/abs/2501.05445
The ability to organically reason over and with both text and images is a pillar of human intelligence, yet the ability of Multimodal Large Language Models (MLLMs) to perform such multimodal reasoning remains under-explored. Existing benchmarks often emphasize text-dominant reasoning or rely on shallow visual cues, failing to adequately assess integrated visual and textual reasoning. We introduce EMMA (Enhanced MultiModal reAsoning), a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding. EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality, offering an enhanced test suite for MLLMs' reasoning capabilities. Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks, even with advanced techniques like Chain-of-Thought prompting and test-time compute scaling underperforming. These findings underscore the need for improved multimodal architectures and training paradigms to close the gap between human and model reasoning in multimodality.
有机地对文本和图像进行推理是人类智能的一个支柱,然而多模态大型语言模型(MLLMs)在执行这种多模态推理方面的能力尚未得到充分探索。现有的基准测试通常强调以文本为主的推理或依赖于浅层次的视觉线索,未能充分评估综合的视觉和文本推理能力。我们推出了EMMA(Enhanced MultiModal reAsoning),这是一个针对数学、物理、化学以及编程领域有机多模态推理的基准测试。EMMA任务要求高级跨模态推理,这些任务无法通过单一模态内的独立推理来解决,为MLLMs的推理能力提供了一个增强型测试套件。 我们在EMMA上对最先进的MLLMs进行评估后发现,在处理复杂的多模态和多步骤推理任务时,即使是先进的技术如链式思维提示(Chain-of-Thought prompting)和测试时间计算扩展也无法达到预期效果。这些发现强调了需要改进多模态架构和训练范例以缩小人类与模型在多模态能力上的差距。
https://arxiv.org/abs/2501.05444
The success of social media platforms has facilitated the emergence of various forms of online abuse within digital communities. This abuse manifests in multiple ways, including hate speech, cyberbullying, emotional abuse, grooming, and sexting. In this paper, we present a comprehensive analysis of the different forms of abuse prevalent in social media, with a particular focus on how emerging technologies, such as Language Models (LMs) and Large Language Models (LLMs), are reshaping both the detection and generation of abusive content within these networks. We delve into the mechanisms through which social media abuse is perpetuated, exploring the psychological and social impact. Additionally, we examine the dual role of advanced language models-highlighting their potential to enhance automated detection systems for abusive behavior while also acknowledging their capacity to generate harmful content. This paper aims to contribute to the ongoing discourse on online safety and ethics, offering insights into the evolving landscape of cyberabuse and the technological innovations that both mitigate and exacerbate it.
社交媒体平台的成功促进了数字社区中各种形式的在线虐待行为的出现。这种虐待以多种形式表现,包括仇恨言论、网络欺凌、情感虐待、引诱和色情信息交换等。本文将全面分析社交媒体中存在的不同形式的虐待,并特别关注新兴技术(如语言模型LMs和大规模语言模型LLMs)如何重塑这些平台中滥用内容的检测与生成方式。我们将深入探讨社交媒体虐待行为的传播机制,探索其心理和社会影响。此外,我们还将考察高级语言模型在自动检测系统中的双重作用——一方面它们有可能提升对有害行为的自动化识别能力,另一方面也要承认它们可能被用于生成有害内容的能力。本文旨在为在线安全和伦理讨论做出贡献,提供有关网络虐待演变及技术革新如何同时缓解与加剧这一问题的新见解。
https://arxiv.org/abs/2501.05443
Video tokenizers are essential for latent video diffusion models, converting raw video data into spatiotemporally compressed latent spaces for efficient training. However, extending state-of-the-art video tokenizers to achieve a temporal compression ratio beyond 4x without increasing channel capacity poses significant challenges. In this work, we propose an alternative approach to enhance temporal compression. We find that the reconstruction quality of temporally subsampled videos from a low-compression encoder surpasses that of high-compression encoders applied to original videos. This indicates that high-compression models can leverage representations from lower-compression models. Building on this insight, we develop a bootstrapped high-temporal-compression model that progressively trains high-compression blocks atop well-trained lower-compression models. Our method includes a cross-level feature-mixing module to retain information from the pretrained low-compression model and guide higher-compression blocks to capture the remaining details from the full video sequence. Evaluation of video benchmarks shows that our method significantly improves reconstruction quality while increasing temporal compression compared to direct extensions of existing video tokenizers. Furthermore, the resulting compact latent space effectively trains a video diffusion model for high-quality video generation with a reduced token budget.
视频标记化器对于潜在的视频扩散模型至关重要,能够将原始视频数据转换为时空压缩的潜在空间,从而实现高效的训练。然而,现有最先进的视频标记化器在不增加通道容量的情况下难以达到超过4倍的时间压缩比。为此,在这项工作中我们提出了一种增强时间压缩的新方法。 我们发现,从低压缩率编码器中重建的时间子采样视频的质量超过了高压缩率编码器对原始视频的重建质量。这一观察表明,高压缩模型可以利用低压缩模型的表示能力。基于此见解,我们开发了一个自举式的高时间压缩比模型,在已训练良好的低压缩率模型基础上逐步训练出更高压缩率的模块。 我们的方法包括一个跨层级特征混合模块,用于保持预训练低压缩模型的信息,并引导更高压缩率的块捕捉完整视频序列中的剩余细节。在视频基准测试上的评估显示,与现有视频标记化器的直接扩展相比,我们提出的方法显著提高了重建质量并增加了时间压缩比。 此外,由此产生的紧凑潜在空间能够有效训练用于高质量视频生成的视频扩散模型,并减少了标记预算(token budget)。这种方法不仅提升了视频处理和分析的效果,也为后续基于深度学习的视频生成任务提供了更高效的数据表示基础。
https://arxiv.org/abs/2501.05442
There is a widely-spread claim that GANs are difficult to train, and GAN architectures in the literature are littered with empirical tricks. We provide evidence against this claim and build a modern GAN baseline in a more principled manner. First, we derive a well-behaved regularized relativistic GAN loss that addresses issues of mode dropping and non-convergence that were previously tackled via a bag of ad-hoc tricks. We analyze our loss mathematically and prove that it admits local convergence guarantees, unlike most existing relativistic losses. Second, our new loss allows us to discard all ad-hoc tricks and replace outdated backbones used in common GANs with modern architectures. Using StyleGAN2 as an example, we present a roadmap of simplification and modernization that results in a new minimalist baseline -- R3GAN. Despite being simple, our approach surpasses StyleGAN2 on FFHQ, ImageNet, CIFAR, and Stacked MNIST datasets, and compares favorably against state-of-the-art GANs and diffusion models.
有一个广泛流传的说法认为生成对抗网络(GAN)难以训练,并且文献中的GAN架构充斥着各种经验性的技巧。我们提供了与此说法相悖的证据,并以更为原则化的方式建立了一个现代的GAN基准模型。首先,我们推导出了一种具有良好性质的正则化的相对性GAN损失函数,该损失函数解决了之前通过一系列临时拼凑的方法来处理的模式丢失和非收敛问题。我们从数学角度分析了我们的损失函数并证明它可以提供局部收敛保证,而这在大多数现有的相对性损失中是不存在的。其次,新的损失函数使我们可以抛弃所有随意性的技巧,并用现代架构替换常见的GAN中的过时骨干网络。以StyleGAN2为例,我们展示了一个简化和现代化的道路图,这最终导致了一种全新的极简主义基准模型——R3GAN。尽管我们的方法非常简单,但它在FFHQ、ImageNet、CIFAR以及Stacked MNIST数据集上都超过了StyleGAN2,并且与最新的GAN和扩散模型相比也表现良好。
https://arxiv.org/abs/2501.05441
Learning policies in simulation and transferring them to the real world has become a promising approach in dexterous manipulation. However, bridging the sim-to-real gap for each new task requires substantial human effort, such as careful reward engineering, hyperparameter tuning, and system identification. In this work, we present a system that leverages low-level skills to address these challenges for more complex tasks. Specifically, we introduce a hierarchical policy for in-hand object reorientation based on previously acquired rotation skills. This hierarchical policy learns to select which low-level skill to execute based on feedback from both the environment and the low-level skill policies themselves. Compared to learning from scratch, the hierarchical policy is more robust to out-of-distribution changes and transfers easily from simulation to real-world environments. Additionally, we propose a generalizable object pose estimator that uses proprioceptive information, low-level skill predictions, and control errors as inputs to estimate the object pose over time. We demonstrate that our system can reorient objects, including symmetrical and textureless ones, to a desired pose.
在灵巧操作领域,从模拟环境中学得策略并将其转移到现实世界已经成为一种有前景的方法。然而,为了每个新任务连接仿真与真实世界的桥梁仍需大量的人力投入,包括精细的奖励工程、超参数调优和系统识别等。在此研究中,我们提出了一种利用低级技能应对更复杂任务挑战的系统。具体来说,我们引入了一项基于先前获得的旋转技巧的分层策略来处理手中的物体重新定向问题。该分层策略能够根据环境反馈以及低级技能策略自身的反馈选择执行哪种低级技能。 相比于从零开始学习,这种分层政策更能应对数据分布变化,并且能轻易地将模拟效果转移到现实环境中去。此外,我们还提出了一种通用化的物体姿态估计器,它利用本体感觉信息、低级技能预测和控制误差作为输入来估算随时间推移的物体姿态。 我们的研究表明,系统能够使包括对称及无纹理在内的各种对象重新定向至所需姿态。这种方法不仅提高了任务处理的速度与效率,也大大降低了人力成本,并增强了系统的适应性和可靠性。
https://arxiv.org/abs/2501.05439
The shape of human brain is complex and highly variable, with interactions between brain size, cortical folding, and age well-documented in the literature. However, few studies have explored how global brain size influences geometric features of the cortical surface derived from anatomical MRI. In this work, we focus on sulcal depth, an imaging phenotype that has gained significant attention in both basic research and clinical applications. We make key contributions to the field by: 1) providing the first quantitative analysis of how brain size affects sulcal depth measurements; 2) introducing a novel, scale-invariant method for sulcal depth estimation based on an original formalization of the problem; 3) presenting a validation framework and sharing our code and benchmark data with the community; and 4) demonstrating the biological relevance of our new sulcal depth measure using a large sample of 1,987 subjects spanning the developmental period from 26 weeks post-conception to adulthood.
人类大脑的形状复杂且高度变化,大脑大小、皮层折叠与年龄之间的相互作用在文献中已有充分记录。然而,鲜有研究探讨全球脑体积如何影响由解剖MRI衍生出的皮层表面几何特征。在这项工作中,我们专注于沟深这一成像表型,它已在基础研究和临床应用中引起了广泛的关注。我们在该领域做出了关键贡献: 1. 提供了大脑大小对沟深测量定量影响的第一个分析; 2. 引入了一种新的、尺度不变的方法来估计沟深,基于问题的原创形式化方法; 3. 呈现了一个验证框架,并与社区分享我们的代码和基准数据; 4. 使用涵盖从妊娠后26周到成年期的大样本(1,987名受试者),展示了我们新开发的沟深度测量在生物学上的相关性。
https://arxiv.org/abs/2501.05436
Background: The field of Artificial Intelligence has undergone cyclical periods of growth and decline, known as AI summers and winters. Currently, we are in the third AI summer, characterized by significant advancements and commercialization, particularly in the integration of Symbolic AI and Sub-Symbolic AI, leading to the emergence of Neuro-Symbolic AI. Methods: The review followed the PRISMA methodology, utilizing databases such as IEEE Explore, Google Scholar, arXiv, ACM, and SpringerLink. The inclusion criteria targeted peer-reviewed papers published between 2020 and 2024. Papers were screened for relevance to Neuro-Symbolic AI, with further inclusion based on the availability of associated codebases to ensure reproducibility. Results: From an initial pool of 1,428 papers, 167 met the inclusion criteria and were analyzed in detail. The majority of research efforts are concentrated in the areas of learning and inference (63%), logic and reasoning (35%), and knowledge representation (44%). Explainability and trustworthiness are less represented (28%), with Meta-Cognition being the least explored area (5%). The review identifies significant interdisciplinary opportunities, particularly in integrating explainability and trustworthiness with other research areas. Conclusion: Neuro-Symbolic AI research has seen rapid growth since 2020, with concentrated efforts in learning and inference. Significant gaps remain in explainability, trustworthiness, and Meta-Cognition. Addressing these gaps through interdisciplinary research will be crucial for advancing the field towards more intelligent, reliable, and context-aware AI systems.
背景:人工智能领域经历了周期性的增长与衰退,被称为AI的夏季和冬季。目前我们处于第三次AI夏季,这一时期以显著的进步和商业化为特征,特别是在符号式AI和亚符号式AI的融合中,导致了神经-符号式AI(Neuro-Symbolic AI)的出现。 方法:回顾采用了PRISMA方法论,利用IEEE Explore、Google Scholar、arXiv、ACM 和 SpringerLink 等数据库。纳入标准是2020年至2024年间发表的同行评审论文。筛选出与神经-符号式AI相关的论文,并进一步基于相关代码库的存在与否进行审核,以确保研究结果可重复。 结果:在最初的1,428篇论文中,有167篇符合纳入标准并进行了详细分析。大多数的研究集中在学习和推理(63%)、逻辑和推理(35%)以及知识表示(44%)。解释性和可信度代表的较少(28%),而元认知是探索最少的领域(5%)。该回顾指出,在将可解释性与信任结合到其他研究领域的跨学科机会中存在显著潜力。 结论:自2020年以来,神经-符号式AI的研究迅速增长,并在学习和推理方面集中了大量的努力。然而,在可解释性、可信度以及元认知领域仍存在重大空白。通过跨学科的研究来解决这些缺口将是推进该领域向更智能、可靠且上下文感知的AI系统发展的重要环节。
https://arxiv.org/abs/2501.05435
When is it possible to project two sets of labeled points lying in a pair of projective planes to the same image on a projective line? We give a complete answer to this question and describe the loci of the projection centers that enable a common image. In particular, we find that there exists a solution to this problem if and only if these two sets are themselves images of a common pointset in projective space.
何时可以将一对投影平面中的两组标记点投射到同一投影线上?我们对这个问题给出了完整答案,并描述了允许共用图像的投影中心的位置。特别是,我们发现如果这两组点本身是三维投影空间中某一点集的图像,则存在该问题的解。
https://arxiv.org/abs/2501.05429
Recent advances in 2D image generation have achieved remarkable quality,largely driven by the capacity of diffusion models and the availability of large-scale datasets. However, direct 3D generation is still constrained by the scarcity and lower fidelity of 3D datasets. In this paper, we introduce Zero-1-to-G, a novel approach that addresses this problem by enabling direct single-view generation on Gaussian splats using pretrained 2D diffusion models. Our key insight is that Gaussian splats, a 3D representation, can be decomposed into multi-view images encoding different attributes. This reframes the challenging task of direct 3D generation within a 2D diffusion framework, allowing us to leverage the rich priors of pretrained 2D diffusion models. To incorporate 3D awareness, we introduce cross-view and cross-attribute attention layers, which capture complex correlations and enforce 3D consistency across generated splats. This makes Zero-1-to-G the first direct image-to-3D generative model to effectively utilize pretrained 2D diffusion priors, enabling efficient training and improved generalization to unseen objects. Extensive experiments on both synthetic and in-the-wild datasets demonstrate superior performance in 3D object generation, offering a new approach to high-quality 3D generation.
最近在二维图像生成领域的进展取得了显著的成果,这主要得益于扩散模型的能力增强和大规模数据集的可用性。然而,直接三维(3D)生成仍然受到三维数据集稀缺性和较低保真度的限制。在这篇论文中,我们介绍了一种新的方法——Zero-1-to-G,该方法通过利用预训练的二维扩散模型在高斯点上实现单视角直接生成,从而解决了这一问题。我们的关键见解在于:高斯点作为三维表示形式,可以被分解为编码不同属性的多视图图像。这将具有挑战性的直接三维生成任务重新框架到一个二维扩散模型中,从而使我们能够利用预训练二维扩散模型中的丰富先验知识。为了引入对3D感知的支持,我们提出了跨视角和跨属性注意层,这些层捕捉复杂的相关性,并确保在生成的高斯点之间保持3D一致性。这使得Zero-1-to-G成为第一个有效利用预训练2D扩散模型先验的直接图像到3D生成模型,从而实现高效的训练并提高对未见过物体的一般化能力。在合成数据集和真实世界数据集上的广泛实验表明,在三维对象生成方面具有卓越性能,并为高质量三维生成提供了一种新的方法。
https://arxiv.org/abs/2501.05427
Brain cancer represents a major challenge in medical diagnostics, requisite precise and timely detection for effective treatment. Diagnosis initially relies on the proficiency of radiologists, which can cause difficulties and threats when the expertise is sparse. Despite the use of imaging resources, brain cancer remains often difficult, time-consuming, and vulnerable to intraclass variability. This study conveys the Bangladesh Brain Cancer MRI Dataset, containing 6,056 MRI images organized into three categories: Brain Tumor, Brain Glioma, and Brain Menin. The dataset was collected from several hospitals in Bangladesh, providing a diverse and realistic sample for research. We implemented advanced deep learning models, and DenseNet169 achieved exceptional results, with accuracy, precision, recall, and F1-Score all reaching 0.9983. In addition, Explainable AI (XAI) methods including GradCAM, GradCAM++, ScoreCAM, and LayerCAM were employed to provide visual representations of the decision-making processes of the models. In the context of brain cancer, these techniques highlight DenseNet169's potential to enhance diagnostic accuracy while simultaneously offering transparency, facilitating early diagnosis and better patient outcomes.
脑癌在医学诊断中是一个重大挑战,需要精准和及时的检测以进行有效的治疗。最初的诊断依赖于放射科医生的专业技能,但当专家资源稀少时,这会带来困难甚至威胁。尽管使用了成像资源,脑癌仍然常常难以诊断、耗时且容易受到类内变异性的困扰。本研究介绍了孟加拉国脑癌MRI数据集,该数据集中包含6,056张MRI图像,并按三个类别进行组织:脑肿瘤、脑胶质瘤和脑膜瘤。这些数据是从孟加拉国的多家医院收集的,为研究提供了多样且现实的样本。 我们实施了先进的深度学习模型,其中DenseNet169取得了卓越的结果,其准确率、精确度、召回率以及F1分数均达到了0.9983。此外,我们还应用了解释性人工智能(XAI)方法,包括GradCAM、GradCAM++、ScoreCAM和LayerCAM等,用以提供模型决策过程的可视化表示。 在脑癌诊断领域中,这些技术突显了DenseNet169增强诊断准确性的同时提供了透明度,有助于早期诊断及改善患者的治疗效果。
https://arxiv.org/abs/2501.05426
We present RoboPanoptes, a capable yet practical robot system that achieves whole-body dexterity through whole-body vision. Its whole-body dexterity allows the robot to utilize its entire body surface for manipulation, such as leveraging multiple contact points or navigating constrained spaces. Meanwhile, whole-body vision uses a camera system distributed over the robot's surface to provide comprehensive, multi-perspective visual feedback of its own and the environment's state. At its core, RoboPanoptes uses a whole-body visuomotor policy that learns complex manipulation skills directly from human demonstrations, efficiently aggregating information from the distributed cameras while maintaining resilience to sensor failures. Together, these design aspects unlock new capabilities and tasks, allowing RoboPanoptes to unbox in narrow spaces, sweep multiple or oversized objects, and succeed in multi-step stowing in cluttered environments, outperforming baselines in adaptability and efficiency. Results are best viewed on this https URL.
我们介绍了RoboPanoptes,这是一种功能强大且实用的机器人系统,通过全身视觉实现了全身灵巧性。其全身灵巧性使机器人能够利用整个身体表面进行操作,例如使用多个接触点或在受限空间中导航。与此同时,全身视觉采用分布在机器人表面的摄像头系统来提供全面、多视角的自身状态和环境反馈。 RoboPanoptes的核心是一个全身视动策略(whole-body visuomotor policy),该策略可以直接从人类演示中学得复杂的操作技能,并能够有效地汇总分布式摄像机的信息,同时保持对传感器故障的鲁棒性。这些设计方面共同解锁了新的能力和任务,使RoboPanoptes能够在狭窄的空间中打开物品,在杂乱环境中清扫多个或超大尺寸物体以及完成多步骤储存任务,其适应性和效率优于基准线。 有关结果的最佳查看方式,请访问此 [URL](https://example.com)。
https://arxiv.org/abs/2501.05420
Continuum instruments are integral to robot-assisted minimally invasive surgery (MIS), with tendon-driven mechanisms being the most common. Real-time tension feedback is crucial for precise articulation but remains a challenge in compact actuation unit designs. Additionally, accurate shape and external force sensing of continuum instruments are essential for advanced control and manipulation. This paper presents a compact and modular actuation unit that integrates a torque cell directly into the pulley module to provide real-time tension feedback. Building on this unit, we propose a novel shape-force sensing framework that incorporates polynomial curvature kinematics to accurately model non-constant curvature. The framework combines pose sensor measurements at the instrument tip and actuation tension feedback at the developed actuation unit. Experimental results demonstrate the improved performance of the proposed shape-force sensing framework in terms of shape reconstruction accuracy and force estimation reliability compared to conventional constant-curvature methods.
连续体仪器在机器人辅助的微创手术(MIS)中至关重要,其中腱驱动机制最为常见。实时张力反馈对于精确的操作非常重要,但在紧凑型驱动单元的设计中仍然是一个挑战。此外,对连续体器械进行准确的形状和外部力量感知是实现高级控制和操作所必需的。本文提出了一种集成了扭矩传感器直接到滑轮模块中的小型化、模块化的驱动单元,以提供实时张力反馈。在此基础上,我们提出了一个新的形变-力感测框架,该框架结合了多项式曲率运动学模型来精确模拟非恒定曲率。此框架结合了器械尖端的姿态传感器测量和在新开发的驱动单元中获得的操作张力反馈。 实验结果表明,在形状重建准确性和力量估计可靠性方面,所提出的形变-力感测框架相较于传统恒定曲率方法有了显著改善。
https://arxiv.org/abs/2501.05418
Existing benchmarks for evaluating long-context language models (LCLMs) primarily focus on long-context recall, requiring models to produce short responses based on a few critical snippets while processing thousands of irrelevant tokens. We introduce LongProc (Long Procedural Generation), a new benchmark that requires both the integration of highly dispersed information and long-form generation. LongProc consists of six diverse procedural generation tasks, such as extracting structured information from HTML pages into a TSV format and executing complex search procedures to create travel plans. These tasks challenge LCLMs by testing their ability to follow detailed procedural instructions, synthesize and reason over dispersed information, and generate structured, long-form outputs (up to 8K tokens). Furthermore, as these tasks adhere to deterministic procedures and yield structured outputs, they enable reliable rule-based evaluation. We evaluate 17 LCLMs on LongProc across three difficulty levels, with maximum numbers of output tokens set at 500, 2K, and 8K. Notably, while all tested models claim a context window size above 32K tokens, open-weight models typically falter on 2K-token tasks, and closed-source models like GPT-4o show significant degradation on 8K-token tasks. Further analysis reveals that LCLMs struggle to maintain long-range coherence in long-form generations. These findings highlight critical limitations in current LCLMs and suggest substantial room for improvement. Data and code available at: this https URL
现有的评估长上下文语言模型(LCLM)的基准主要集中在长上下文回溯上,即要求这些模型在处理数千个无关令牌的同时,根据几个关键片段生成简短的回答。我们引入了LongProc(长程序化生成),这是一个新的评估基准,它不仅需要整合高度分散的信息,还需要进行长篇生成。LongProc 包含六个多样化的程序化生成任务,例如从HTML页面中提取结构化信息并将其转换为TSV格式、执行复杂的搜索过程来创建旅行计划等。这些任务通过测试LCLM遵循详细程序指令的能力、综合和推理分散的信息以及生成结构化长篇输出(最多8K令牌)的能力来挑战模型。 此外,由于这些任务遵循确定性的流程,并产生结构化的输出,因此它们支持基于规则的可靠评估。我们在LongProc上对17个LCLM进行了不同难度级别的评估,分别设置了500、2K和8K的最大生成标记数量。值得注意的是,尽管所有测试模型都声称其上下文窗口大小超过32K令牌,但开源模型通常在处理2K令牌任务时表现不佳,而像GPT-4o这样的封闭源模型在处理8K令牌的任务时会表现出显著的性能下降。 进一步分析表明,LCLM 在长篇生成中难以保持长期的一致性。这些发现突显了当前LCLMs的关键限制,并指出了改进的巨大空间。数据和代码可在以下链接获取:[提供一个URL]
https://arxiv.org/abs/2501.05414
Training audio-to-image generative models requires an abundance of diverse audio-visual pairs that are semantically aligned. Such data is almost always curated from in-the-wild videos, given the cross-modal semantic correspondence that is inherent to them. In this work, we hypothesize that insisting on the absolute need for ground truth audio-visual correspondence, is not only unnecessary, but also leads to severe restrictions in scale, quality, and diversity of the data, ultimately impairing its use in the modern generative models. That is, we propose a scalable image sonification framework where instances from a variety of high-quality yet disjoint uni-modal origins can be artificially paired through a retrieval process that is empowered by reasoning capabilities of modern vision-language models. To demonstrate the efficacy of this approach, we use our sonified images to train an audio-to-image generative model that performs competitively against state-of-the-art. Finally, through a series of ablation studies, we exhibit several intriguing auditory capabilities like semantic mixing and interpolation, loudness calibration and acoustic space modeling through reverberation that our model has implicitly developed to guide the image generation process.
训练音频到图像生成模型需要大量的语义对齐的视听成对数据。这种数据通常是从野生视频中策划出来的,因为这些视频具有跨模态的语义对应关系。在这项工作中,我们假设坚持绝对需要地面实况的音视对应是不必要的,并且还会导致数据在规模、质量和多样性方面受到严重限制,最终影响其在现代生成模型中的应用效果。具体来说,我们提出了一种可扩展的图像声化框架,在该框架中,通过利用现代视觉语言模型的推理能力,可以将来自各种高质量但互不相关的单模态来源的数据人工配对。 为了展示这种方法的有效性,我们使用我们的声化图像来训练一个音频到图像的生成模型,并且这个模型在与现有最先进的技术对比时表现出色。最后,通过一系列消融研究,展示了我们的模型隐含地开发出了一些有趣的听觉能力,比如语义混合、插值、响度校准以及通过混响建模声学空间等,这些都指导了图像生成的过程。
https://arxiv.org/abs/2501.05413