We introduce a Mixture of Raytraced Experts, a stacked Mixture of Experts (MoE) architecture which can dynamically select sequences of experts, producing computational graphs of variable width and depth. Existing MoE architectures generally require a fixed amount of computation for a given sample. Our approach, in contrast, yields predictions with increasing accuracy as the computation cycles through the experts' sequence. We train our model by iteratively sampling from a set of candidate experts, unfolding the sequence akin to how Recurrent Neural Networks are trained. Our method does not require load-balancing mechanisms, and preliminary experiments show a reduction in training epochs of 10\% to 40\% with a comparable/higher accuracy. These results point to new research directions in the field of MoEs, allowing the design of potentially faster and more expressive models. The code is available at this https URL
我们提出了一种称为光线追踪专家混合(Mixture of Raytraced Experts)的堆叠式专家混合(MoE)架构,它能够动态选择一系列专家,从而生成宽度和深度可变的计算图。现有的 MoE 架构通常需要为给定样本固定数量的计算资源。相比之下,我们的方法通过计算在专家序列中循环来逐步提高预测精度。我们通过从一组候选专家中迭代采样并展开序列(类似于递归神经网络的训练方式)进行模型训练。此方法不需要负载平衡机制,并且初步实验表明,与现有技术相比,在保持相同或更高的准确度的同时,可以减少10%到40%的训练周期数量。 这些结果为 MoE 领域的研究开辟了新的方向,有望设计出更快、更具表现力的模型。该代码可在此 [URL] 获取(请将方括号中的文本替换为实际链接)。
https://arxiv.org/abs/2507.12419
Humans possess a remarkable capacity for spatial cognition, allowing for self-localization even in novel or unfamiliar environments. While hippocampal neurons encoding position and orientation are well documented, the large-scale neural dynamics supporting spatial representation, particularly during naturalistic, passive experience, remain poorly understood. Here, we demonstrate for the first time that non-invasive brain-computer interfaces (BCIs) based on electroencephalography (EEG) can decode spontaneous, fine-grained egocentric 6D pose, comprising three-dimensional position and orientation, during passive viewing of egocentric video. Despite EEG's limited spatial resolution and high signal noise, we find that spatially coherent visual input (i.e., continuous and structured motion) reliably evokes decodable spatial representations, aligning with participants' subjective sense of spatial engagement. Decoding performance further improves when visual input is presented at a frame rate of 100 ms per image, suggesting alignment with intrinsic neural temporal dynamics. Using gradient-based backpropagation through a neural decoding model, we identify distinct EEG channels contributing to position -- and orientation specific -- components, revealing a distributed yet complementary neural encoding scheme. These findings indicate that the brain's spatial systems operate spontaneously and continuously, even under passive conditions, challenging traditional distinctions between active and passive spatial cognition. Our results offer a non-invasive window into the automatic construction of egocentric spatial maps and advance our understanding of how the human mind transforms everyday sensory experience into structured internal representations.
人类拥有非凡的空间认知能力,即使在新的或不熟悉的环境中也能进行自我定位。虽然关于海马体神经元编码位置和方向的研究已经非常详尽,但是支持空间表征的大规模神经动力学,在自然、被动体验期间的表现仍然不清楚。在这项研究中,我们首次展示了基于非侵入式脑机接口(BCI)的电生理技术可以通过脑电图(EEG)解码在观看第一人称视频时自发产生的精细的自身坐标系下的6D姿态信息,包括三维位置和方向。 尽管EEG的空间分辨率有限且信号噪音较高,但我们发现具有空间一致性的视觉输入(即持续而结构化的运动),能够可靠地引发可解码的空间表示,与参与者主观感受到的空间参与感相符合。当以每帧100毫秒的速度呈现视觉输入时,解码性能进一步提升,这表明其与神经元固有的时间动态特性相对齐。 通过基于梯度的反向传播方法来研究神经解码模型,我们识别出了特定于位置和方向的不同EEG通道,揭示了一种分布且互补性的神经编码方案。这些发现表明,大脑的空间系统在被动条件下也能自发、持续地运作,挑战了传统上将主动与被动空间认知区分开来的观点。 我们的结果为探索第一人称视角下的空间地图的自动构建提供了一个非侵入式的窗口,并推动了人类如何将日常感官体验转化为结构化的内部表征的理解。
https://arxiv.org/abs/2507.12417
Composed Image Retrieval (CIR) retrieves relevant images based on a reference image and accompanying text describing desired modifications. However, existing CIR methods only focus on retrieving the target image and disregard the relevance of other images. This limitation arises because most methods employing contrastive learning-which treats the target image as positive and all other images in the batch as negatives-can inadvertently include false negatives. This may result in retrieving irrelevant images, reducing user satisfaction even when the target image is retrieved. To address this issue, we propose Query-Relevant Retrieval through Hard Negative Sampling (QuRe), which optimizes a reward model objective to reduce false negatives. Additionally, we introduce a hard negative sampling strategy that selects images positioned between two steep drops in relevance scores following the target image, to effectively filter false negatives. In order to evaluate CIR models on their alignment with human satisfaction, we create Human-Preference FashionIQ (HP-FashionIQ), a new dataset that explicitly captures user preferences beyond target retrieval. Extensive experiments demonstrate that QuRe achieves state-of-the-art performance on FashionIQ and CIRR datasets while exhibiting the strongest alignment with human preferences on the HP-FashionIQ dataset. The source code is available at this https URL.
组成图像检索(CIR)是根据参考图像及其描述的修改需求来查找相关图片的一种方法。然而,现有的CIR技术仅专注于寻找目标图片,并忽略了其他图片的相关性问题。这种限制主要源于大多数使用对比学习的方法——将目标图片视为正例而将其余所有图片作为负例——可能会无意中包含假阴性(即实际应被视为与查询高度相关的图像被错误地标记为不相关)。这可能导致检索到的无关图片数量增加,即使找到了目标图片也会降低用户体验满意度。 为了应对这一问题,我们提出了通过难例采样进行查询相关检索的方法(QuRe),它优化了一个奖励模型的目标函数以减少假阴性。此外,我们还引入了一种难例采样策略,该策略选择那些在目标图片后出现显著相关度陡降的图像之间的图片,以便有效过滤掉这些假阴性的不相关图像。 为了评估CIR模型与人类满意度的一致性,我们创建了一个新的数据集Human-Preference FashionIQ(HP-FashionIQ),该数据集专门捕捉超出单纯目标检索之外的用户偏好。通过广泛的实验表明,QuRe方法在FashionIQ和CIRR数据集上取得了最先进的性能,并且在HP-FashionIQ数据集中与人类偏好的一致性表现最强。 相关源代码可以在提供的链接处获取。
https://arxiv.org/abs/2507.12416
Training of autonomous driving systems requires extensive datasets with precise annotations to attain robust performance. Human annotations suffer from imperfections, and multiple iterations are often needed to produce high-quality datasets. However, manually reviewing large datasets is laborious and expensive. In this paper, we introduce AutoVDC (Automated Vision Data Cleaning) framework and investigate the utilization of Vision-Language Models (VLMs) to automatically identify erroneous annotations in vision datasets, thereby enabling users to eliminate these errors and enhance data quality. We validate our approach using the KITTI and nuImages datasets, which contain object detection benchmarks for autonomous driving. To test the effectiveness of AutoVDC, we create dataset variants with intentionally injected erroneous annotations and observe the error detection rate of our approach. Additionally, we compare the detection rates using different VLMs and explore the impact of VLM fine-tuning on our pipeline. The results demonstrate our method's high performance in error detection and data cleaning experiments, indicating its potential to significantly improve the reliability and accuracy of large-scale production datasets in autonomous driving.
自动驾驶系统训练需要大量的数据集,这些数据集中含有精确标注以实现稳健性能。人工标注存在不完美之处,并且往往需要多次迭代才能生成高质量的数据集。然而,手动审查大规模数据集既费时又昂贵。在本文中,我们介绍了AutoVDC(自动化视觉数据清洗)框架,并探讨了利用视觉语言模型(VLMs)自动识别视觉数据集中错误注释的方法,从而让用户能够消除这些错误并提高数据质量。我们使用KITTI和nuImages数据集验证我们的方法,这两个数据集包含用于自动驾驶的对象检测基准。 为了测试AutoVDC的有效性,我们在故意注入了错误注释的数据集变体上进行了实验,并观察了我们的方法在错误检测率方面的表现。此外,我们还比较了不同视觉语言模型的检测率,并探讨了对视觉语言模型微调对我们流程的影响。结果表明,在错误检测和数据清理实验中,我们的方法表现出色,预示着它有潜力显著提高大规模生产数据集在自动驾驶中的可靠性和准确性。
https://arxiv.org/abs/2507.12414
In many critical applications, resource constraints limit the amount of information that can be gathered to make predictions. For example, in healthcare, patient data often spans diverse features ranging from lab tests to imaging studies. Each feature may carry different information and must be acquired at a respective cost of time, money, or risk to the patient. Moreover, temporal prediction tasks, where both instance features and labels evolve over time, introduce additional complexity in deciding when or what information is important. In this work, we propose NOCTA, a Non-Greedy Objective Cost-Tradeoff Acquisition method that sequentially acquires the most informative features at inference time while accounting for both temporal dynamics and acquisition cost. We first introduce a cohesive estimation target for our NOCTA setting, and then develop two complementary estimators: 1) a non-parametric method based on nearest neighbors to guide the acquisition (NOCTA-NP), and 2) a parametric method that directly predicts the utility of potential acquisitions (NOCTA-P). Experiments on synthetic and real-world medical datasets demonstrate that both NOCTA variants outperform existing baselines.
在许多关键应用中,资源限制了可以收集的信息量以进行预测。例如,在医疗保健领域,患者数据通常涵盖从实验室检查到影像研究的各种特征。每个特征可能携带不同的信息,并且必须花费相应的时间、金钱或对患者的潜在风险来获取这些特征。此外,时间序列预测任务中的实例特征和标签会随时间变化,这增加了在决定何时以及哪些信息重要时的复杂性。 为此,我们提出了NOCTA(Non-Greedy Objective Cost-Tradeoff Acquisition),这是一种非贪婪目标成本权衡获取方法,能够在推理过程中按顺序获取最有信息量的特征,并考虑到时间和获取成本。首先,我们在NOCTA环境中引入了一个连贯的目标估计,然后开发了两种互补的估算器:1)一种基于最近邻的非参数方法(NOCTA-NP),用于指导获取;2)直接预测潜在获取效用的参数化方法(NOCTA-P)。在合成和真实世界的医疗数据集上的实验表明,这两种NOCTA变体均优于现有的基准方法。
https://arxiv.org/abs/2507.12412
We consider manipulation problems in constrained and cluttered settings, which require several regrasps at unknown locations. We propose to inform an optimization-based task and motion planning (TAMP) solver with possible regrasp areas and grasp sequences to speed up the search. Our main idea is to use a state space abstraction, a regrasp map, capturing the combinations of available grasps in different parts of the configuration space, and allowing us to provide the solver with guesses for the mode switches and additional constraints for the object placements. By interleaving the creation of regrasp maps, their adaptation based on failed refinements, and solving TAMP (sub)problems, we are able to provide a robust search method for challenging regrasp manipulation problems.
我们在受约束且杂乱的环境中考虑操作问题,这些问题需要在未知位置进行多次重抓取。我们提出通过提供可能的重抓取区域和抓取序列来为基于优化的任务和运动规划(TAMP)求解器提供信息,从而加速搜索过程。我们的主要想法是使用一种状态空间抽象方法——重抓图,它捕获了配置空间不同部分可用抓取方式的组合,并允许我们向求解器提供模式切换的猜测以及对象放置的额外约束条件。通过交替创建重抓图、根据失败的细化进行调整以及解决TAMP(子)问题,我们可以为具有挑战性的重抓操作问题提供一种稳健的搜索方法。
https://arxiv.org/abs/2507.12407
Realistic human surveillance datasets are crucial for training and evaluating computer vision models under real-world conditions, facilitating the development of robust algorithms for human and human-interacting object detection in complex environments. These datasets need to offer diverse and challenging data to enable a comprehensive assessment of model performance and the creation of more reliable surveillance systems for public safety. To this end, we present two visual object detection benchmarks named OD-VIRAT Large and OD-VIRAT Tiny, aiming at advancing visual understanding tasks in surveillance imagery. The video sequences in both benchmarks cover 10 different scenes of human surveillance recorded from significant height and distance. The proposed benchmarks offer rich annotations of bounding boxes and categories, where OD-VIRAT Large has 8.7 million annotated instances in 599,996 images and OD-VIRAT Tiny has 288,901 annotated instances in 19,860 images. This work also focuses on benchmarking state-of-the-art object detection architectures, including RETMDET, YOLOX, RetinaNet, DETR, and Deformable-DETR on this object detection-specific variant of VIRAT dataset. To the best of our knowledge, it is the first work to examine the performance of these recently published state-of-the-art object detection architectures on realistic surveillance imagery under challenging conditions such as complex backgrounds, occluded objects, and small-scale objects. The proposed benchmarking and experimental settings will help in providing insights concerning the performance of selected object detection models and set the base for developing more efficient and robust object detection architectures.
现实的人类监控数据集对于在真实世界条件下训练和评估计算机视觉模型至关重要,有助于开发复杂环境中可靠的人体及与人体互动物体检测算法。这些数据集需要提供多样且具有挑战性的数据,以全面评估模型性能并创建更可靠的公共安全监控系统。为此,我们提出了两个视觉对象检测基准——OD-VIRAT Large 和 OD-VIRAT Tiny,旨在推进监控图像中的视觉理解任务。这两个基准中的视频序列涵盖了10个不同的监控场景,并从相当高的高度和距离记录下来。 所提出的基准提供了丰富的边界框和类别的注释,其中OD-VIRAT Large 在599,996张图片中有870万次标注实例,而OD-VIRAT Tiny在19,860张图片中则有288,901次标注实例。这项工作还侧重于对最先进的对象检测架构进行基准测试,包括RETMDET、YOLOX、RetinaNet、DETR和Deformable-DETR,在此特定于对象检测的VIRAT数据集变体上进行了评估。据我们所知,这是首次在复杂背景、遮挡物体以及小尺寸物体等具有挑战性的条件下,对这些最新发布的最先进的对象检测架构进行性能测试的工作。 提出的基准设置和实验环境将有助于了解选定对象检测模型的表现,并为开发更高效且鲁棒的对象检测架构奠定基础。
https://arxiv.org/abs/2507.12396
Large Language Models (LLMs) show potential for enhancing robotic path planning. This paper assesses visual input's utility for multimodal LLMs in such tasks via a comprehensive benchmark. We evaluated 15 multimodal LLMs on generating valid and optimal paths in 2D grid environments, simulating simplified robotic planning, comparing text-only versus text-plus-visual inputs across varying model sizes and grid complexities. Our results indicate moderate success rates on simpler small grids, where visual input or few-shot text prompting offered some benefits. However, performance significantly degraded on larger grids, highlighting a scalability challenge. While larger models generally achieved higher average success, the visual modality was not universally dominant over well-structured text for these multimodal systems, and successful paths on simpler grids were generally of high quality. These results indicate current limitations in robust spatial reasoning, constraint adherence, and scalable multimodal integration, identifying areas for future LLM development in robotic path planning.
大型语言模型(LLMs)在增强机器人路径规划方面展现出潜力。本文通过全面的基准测试评估了视觉输入对多模态LLM在这种任务中的实用性。我们在2D网格环境中评估了15种多模态LLM,模拟简化的机器人规划,比较了仅文本与文本加视觉输入的效果,并考虑了不同模型大小和网格复杂度的情况。我们的结果显示,在较简单的较小网格上,视觉输入或少量样本的文本提示提供了一些好处,成功率相对较高。然而,在更大的网格上性能显著下降,这突显了可扩展性方面的挑战。尽管较大的模型通常能实现更高的平均成功率,但视觉模式并不总是优于精心构造的文本输入,尤其是在这些多模态系统中。在较简单的网格上的成功路径质量普遍很高。 这些结果表明当前存在稳健空间推理、约束遵守和大规模多模态集成等方面的局限性,并确定了未来LLM在机器人路径规划发展中的关键领域。
https://arxiv.org/abs/2507.12391
Semi-supervised medical image segmentation is a crucial technique for alleviating the high cost of data annotation. When labeled data is limited, textual information can provide additional context to enhance visual semantic understanding. However, research exploring the use of textual data to enhance visual semantic embeddings in 3D medical imaging tasks remains scarce. In this paper, we propose a novel text-driven multiplanar visual interaction framework for semi-supervised medical image segmentation (termed Text-SemiSeg), which consists of three main modules: Text-enhanced Multiplanar Representation (TMR), Category-aware Semantic Alignment (CSA), and Dynamic Cognitive Augmentation (DCA). Specifically, TMR facilitates text-visual interaction through planar mapping, thereby enhancing the category awareness of visual features. CSA performs cross-modal semantic alignment between the text features with introduced learnable variables and the intermediate layer of visual features. DCA reduces the distribution discrepancy between labeled and unlabeled data through their interaction, thus improving the model's robustness. Finally, experiments on three public datasets demonstrate that our model effectively enhances visual features with textual information and outperforms other methods. Our code is available at this https URL.
半监督医学图像分割是一种减轻数据标注高成本的关键技术。当标记的数据有限时,文本信息可以提供额外的背景知识以增强视觉语义理解。然而,在三维医学成像任务中利用文本数据来提升视觉语义嵌入的研究仍然较少。在本文中,我们提出了一种新颖的基于文本驱动的多平面视图交互框架用于半监督医学图像分割(称为Text-SemiSeg),该框架由三个主要模块组成:增强型多平面表示(TMR)、类别感知语义对齐(CSA)和动态认知增强(DCA)。具体而言,TMR通过平面映射促进文本-视觉互动,从而增强了视觉特征的类别意识。CSA在引入了可学习变量的文本特征与视觉特征中间层之间执行跨模态语义对齐。DCA则通过标签数据和无标签数据之间的相互作用减少它们分布上的差异,从而提高模型的鲁棒性。最后,在三个公开的数据集上进行的实验表明,我们的模型有效地利用了文本信息来增强视觉特性,并且优于其他方法。我们的代码可在提供的链接中获取。
https://arxiv.org/abs/2507.12382
We investigate whether internal activations in language models can be used to detect arithmetic errors. Starting with a controlled setting of 3-digit addition, we show that simple probes can accurately decode both the model's predicted output and the correct answer from hidden states, regardless of whether the model's output is correct. Building on this, we train lightweight error detectors that predict model correctness with over 90% accuracy. We then extend our analysis to structured chain-of-thought traces on addition-only GSM8K problems and find that probes trained on simple arithmetic generalize well to this more complex setting, revealing consistent internal representations. Finally, we demonstrate that these probes can guide selective re-prompting of erroneous reasoning steps, improving task accuracy with minimal disruption to correct outputs. Our findings suggest that arithmetic errors can be anticipated from internal activations alone, and that simple probes offer a viable path toward lightweight model self-correction.
我们研究了语言模型内部激活是否可以用于检测算术错误。从一个受控的三位数加法场景开始,我们展示了简单的探测器可以从隐藏状态准确解码模型预测输出和正确答案,无论该输出是否正确。在此基础上,我们训练出了轻量级的错误检测器,能够以超过90%的准确性预测模型的正确性。然后,我们将分析扩展到仅包含加法运算的GSM8K问题的结构化思维链,并发现针对简单算术训练的探测器可以很好地泛化到这个更复杂的设置中,揭示了内部表示的一致性。最后,我们展示了这些探测器可以指导选择性的重新提示错误推理步骤,在不干扰正确输出的情况下提高任务准确性。 我们的研究结果表明,仅从内部激活即可预测算术错误,并且简单的探测器提供了一条可行的路径来实现轻量级模型自我校正。
https://arxiv.org/abs/2507.12379
Traditional information extraction systems face challenges with text only language models as it does not consider infographics (visual elements of information) such as tables, charts, images etc. often used to convey complex information to readers. Multimodal LLM (MLLM) face challenges of finding needle in the haystack problem i.e., either longer context length or substantial number of documents as search space. Late interaction mechanism over visual language models has shown state of the art performance in retrieval-based vision augmented Q&A tasks. There are yet few challenges using it for RAG based multi-modal Q&A. Firstly, many popular and widely adopted vector databases do not support native multi-vector retrieval. Secondly, late interaction requires computation which inflates space footprint and can hinder enterprise adoption. Lastly, the current state of late interaction mechanism does not leverage the approximate neighbor search indexing methods for large speed ups in retrieval process. This paper explores a pragmatic approach to make vision retrieval process scalable and efficient without compromising on performance quality. We propose multi-step custom implementation utilizing widely adopted hybrid search (metadata & embedding) and state of the art late interaction re-ranker to retrieve best matching pages. Finally, MLLM are prompted as reader to generate answers from contextualized best matching pages. Through experiments, we observe that the proposed design is scalable (significant speed up) and stable (without degrading performance quality), hence can be used as production systems at enterprises.
传统信息抽取系统在只处理文本的语言模型方面面临挑战,因为它们不考虑用于传达复杂信息的图形元素(如表格、图表和图像等)。多模态大型语言模型(MLLM)则面临着“大海捞针”的问题,即需要更长的上下文长度或大量的文档作为搜索空间。视觉语言模型中的后期互动机制在基于检索的视觉增强问答任务中表现出色,但在用于RAG(Retrieval-Augmented Generation)多模态问答时仍面临一些挑战。首先,许多流行的向量数据库不支持原生的多向量检索。其次,后期互动需要计算资源,这会增加存储空间占用并可能阻碍企业采用。最后,当前的后期互动机制尚未利用近似邻居搜索索引方法来大幅提升检索过程的速度。 本文探索了一种实用的方法,以使视觉检索过程既可扩展又高效,同时不牺牲性能质量。我们提出一个多步骤的自定义实现方案,结合广泛使用的混合搜索(元数据和嵌入)技术以及最新的后期互动重排器,以获取最佳匹配页面。最后,MLLM被用作读者,从上下文化的最佳匹配页面中生成答案。 通过实验观察到,所提出的架构是可扩展的(显著加速)、稳定的(不降低性能质量),因此可以作为企业的生产系统使用。
https://arxiv.org/abs/2507.12378
Large language models (LLMs) have traditionally relied on static training data, limiting their knowledge to fixed snapshots. Recent advancements, however, have equipped LLMs with web browsing capabilities, enabling real time information retrieval and multi step reasoning over live web content. While prior studies have demonstrated LLMs ability to access and analyze websites, their capacity to directly retrieve and analyze social media data remains unexplored. Here, we evaluate whether web browsing LLMs can infer demographic attributes of social media users given only their usernames. Using a synthetic dataset of 48 X (Twitter) accounts and a survey dataset of 1,384 international participants, we show that these models can access social media content and predict user demographics with reasonable accuracy. Analysis of the synthetic dataset further reveals how LLMs parse and interpret social media profiles, which may introduce gender and political biases against accounts with minimal activity. While this capability holds promise for computational social science in the post API era, it also raises risks of misuse particularly in information operations and targeted advertising underscoring the need for safeguards. We recommend that LLM providers restrict this capability in public facing applications, while preserving controlled access for verified research purposes.
大型语言模型(LLMs)传统上依赖于静态训练数据,其知识局限于固定的快照。然而,最近的进步使这些模型具备了网页浏览的能力,能够实时检索信息,并在活页网站内容上进行多步骤推理。虽然先前的研究已经展示了LLM访问和分析网站的能力,但它们直接检索和分析社交媒体数据的能力尚未被探索。在这里,我们评估具有网络浏览能力的大型语言模型是否可以根据用户的用户名推断出社交媒体用户的年龄、性别等人口统计信息。通过使用48个(Twitter)账户的合成数据集以及1,384名国际参与者的调查数据集,我们证明这些模型可以访问社交媒体内容,并以合理的准确性预测用户的人口统计数据。对合成数据集进行分析进一步揭示了LLM如何解析和解读社交媒体个人资料,这可能会针对活动较少的账户引入性别和政治偏见。虽然这种能力在后API时代的计算社会科学中具有潜力,但它也引起了信息操作和定向广告等滥用的风险,强调需要制定保障措施。我们建议大型语言模型提供商在面向公众的应用程序中限制这一功能,同时为经过验证的研究目的保留受控访问权限。
https://arxiv.org/abs/2507.12372
Large Language Models (LLMs) have demonstrated significant capabilities in understanding and generating human language, contributing to more natural interactions with complex systems. However, they face challenges such as ambiguity in user requests processed by LLMs. To address these challenges, this paper introduces and evaluates a multi-agent debate framework designed to enhance detection and resolution capabilities beyond single models. The framework consists of three LLM architectures (Llama3-8B, Gemma2-9B, and Mistral-7B variants) and a dataset with diverse ambiguities. The debate framework markedly enhanced the performance of Llama3-8B and Mistral-7B variants over their individual baselines, with Mistral-7B-led debates achieving a notable 76.7% success rate and proving particularly effective for complex ambiguities and efficient consensus. While acknowledging varying model responses to collaborative strategies, these findings underscore the debate framework's value as a targeted method for augmenting LLM capabilities. This work offers important insights for developing more robust and adaptive language understanding systems by showing how structured debates can lead to improved clarity in interactive systems.
大型语言模型(LLMs)在理解和生成人类语言方面表现出显著的能力,有助于与复杂系统进行更自然的交互。然而,它们面临着诸如用户请求中的歧义处理等挑战。为了应对这些挑战,本文介绍并评估了一个多代理辩论框架,旨在增强单一模型之外的检测和解决能力。该框架包括三种LLM架构(Llama3-8B、Gemma2-9B 和 Mistral-7B 变体)以及一个包含各种歧义的数据集。 研究表明,这种辩论框架显著提升了 Llama3-8B 和 Mistral-7B 变体相对于其单个基线的性能。特别是,由 Mistral-7B 领导的辩论在复杂歧义和高效共识方面的成功率达到 76.7%,表现尤为突出。尽管承认不同模型对协作策略的不同响应,这些发现强调了辩论框架作为增强 LLM 能力的一种有针对性方法的价值。 这项工作通过展示结构化辩论如何改善互动系统的清晰度提供了重要的见解,并为开发更强大和适应性强的语言理解系统提供了宝贵的信息。
https://arxiv.org/abs/2507.12370
The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent version updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51\% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods. We make the dataset and evaluation code publicly available at this https URL.
软件库的快速演化给代码生成带来了相当大的障碍,需要在保持向后兼容性的同时不断适应频繁的版本更新。虽然现有的代码演变基准提供了宝贵的见解,但它们通常缺乏基于执行的评估来生成符合特定库版本的代码。为了解决这一问题,我们引入了GitChameleon,这是一个新颖且精心策划的数据集,包含328个Python代码补全问题,每个问题都针对特定的库版本,并附有可执行的单元测试。GitChameleon严格评估当代大型语言模型(LLMs)、由LLM驱动的代理、代码助手和检索增强生成系统在进行符合版本要求的功能性准确代码生成方面的表现能力。我们的广泛评估表明,最先进的系统在这个任务中遇到了重大挑战;企业级模型仅实现了48-51%的基本成功率,这突显了问题的复杂性。通过提供一个基于执行的基准测试,强调代码库动态性质,GitChameleon使得更好地理解这一挑战成为可能,并有助于指导更适应性和可靠性的AI代码生成方法的发展。我们将在[此处](https://example.com)公开发布数据集和评估代码。
https://arxiv.org/abs/2507.12367
Neuro-symbolic artificial intelligence (neuro-symbolic AI) excels in logical analysis and reasoning. Hyperdimensional Computing (HDC), a promising brain-inspired computational model, is integral to neuro-symbolic AI. Various HDC models have been proposed to represent class-instance and class-class relations, but when representing the more complex class-subclass relation, where multiple objects associate different levels of classes and subclasses, they face challenges for factorization, a crucial task for neuro-symbolic AI systems. In this article, we propose FactorHD, a novel HDC model capable of representing and factorizing the complex class-subclass relation efficiently. FactorHD features a symbolic encoding method that embeds an extra memorization clause, preserving more information for multiple objects. In addition, it employs an efficient factorization algorithm that selectively eliminates redundant classes by identifying the memorization clause of the target class. Such model significantly enhances computing efficiency and accuracy in representing and factorizing multiple objects with class-subclass relation, overcoming limitations of existing HDC models such as "superposition catastrophe" and "the problem of 2". Evaluations show that FactorHD achieves approximately 5667x speedup at a representation size of 10^9 compared to existing HDC models. When integrated with the ResNet-18 neural network, FactorHD achieves 92.48% factorization accuracy on the Cifar-10 dataset.
神经符号人工智能(神经符号AI)在逻辑分析和推理方面表现出色。高维计算(HDC),一种有前景的类脑计算模型,是神经符号AI的重要组成部分。已经提出了多种HDC模型来表示类-实例和类-类关系,但在表示更复杂的类-子类关系时——多个对象关联不同级别的类和子类——它们在分解任务中面临挑战,这是神经符号AI系统的关键任务。本文提出了一种新的HDC模型FactorHD,该模型能够高效地表示和分解复杂的关系。 FactorHD具有一个符号编码方法,嵌入了额外的记忆条款,为多对象保留更多的信息。此外,它还采用了一个高效的分解算法,通过识别目标类的记忆条款来选择性地消除冗余类别。这种模型在表示和分解具有类-子类关系的多个对象时,显著提高了计算效率和准确性,并克服了现有HDC模型如“叠加灾难”和“2的问题”的局限性。 评估结果显示,在10^9的表示大小下,FactorHD与现有的HDC模型相比,速度提升了大约5667倍。当与ResNet-18神经网络集成时,在Cifar-10数据集上实现了92.48%的分解精度。
https://arxiv.org/abs/2507.12366
We introduce Cluster Contrast (CueCo), a novel approach to unsupervised visual representation learning that effectively combines the strengths of contrastive learning and clustering methods. Inspired by recent advancements, CueCo is designed to simultaneously scatter and align feature representations within the feature space. This method utilizes two neural networks, a query and a key, where the key network is updated through a slow-moving average of the query outputs. CueCo employs a contrastive loss to push dissimilar features apart, enhancing inter-class separation, and a clustering objective to pull together features of the same cluster, promoting intra-class compactness. Our method achieves 91.40% top-1 classification accuracy on CIFAR-10, 68.56% on CIFAR-100, and 78.65% on ImageNet-100 using linear evaluation with a ResNet-18 backbone. By integrating contrastive learning with clustering, CueCo sets a new direction for advancing unsupervised visual representation learning.
我们提出了一种新颖的方法——Cluster Contrast(简称CueCo),这是一种无监督视觉表征学习方法,它巧妙地结合了对比学习和聚类方法的优势。受近期进展的启发,CueCo旨在同时在特征空间内散列并对齐特征表示。此方法使用两个神经网络:一个查询网络和一个键网络,其中键网络通过查询输出的慢速移动平均值进行更新。 CueCo采用对比损失来推动不同特征之间的距离,从而增强类间区分度,并利用聚类目标将同一簇内的特征拉近,促进类内紧凑性。我们的方法在使用ResNet-18骨干网络并通过线性评估的情况下,在CIFAR-10数据集上达到了91.40%的Top-1分类准确率,在CIFAR-100和ImageNet-100上的准确率分别为68.56%和78.65%。 通过将对比学习与聚类技术相结合,CueCo为无监督视觉表征学习的发展开辟了新的方向。
https://arxiv.org/abs/2507.12359
Gender bias has been widely observed in speech perception tasks, influenced by the fundamental voicing differences between genders. This study reveals a gender bias in the perception of Alzheimer's Disease (AD) speech. In a perception experiment involving 16 Chinese listeners evaluating both Chinese and Greek speech, we identified that male speech was more frequently identified as AD, with this bias being particularly pronounced in Chinese speech. Acoustic analysis showed that shimmer values in male speech were significantly associated with AD perception, while speech portion exhibited a significant negative correlation with AD identification. Although language did not have a significant impact on AD perception, our findings underscore the critical role of gender bias in AD speech perception. This work highlights the necessity of addressing gender bias when developing AD detection models and calls for further research to validate model performance across different linguistic contexts.
性别偏见在言语感知任务中已被广泛观察到,这种偏见受两性基本发声差异的影响。本研究揭示了阿尔茨海默病(AD)言语感知中存在的性别偏见。在一个涉及16名中国听众的感知实验中,他们评估了中文和希腊语两种语言的讲话样本,我们发现男性的讲话更容易被识别为患有阿尔茨海默病,尤其是对于中文来说这种偏见更为显著。声学分析表明,在男性讲话中,振幅波动值(shimmer)与AD感知存在显著关联,而言语部分则显示出与AD识别呈显著负相关的趋势。虽然语言本身对AD感知没有产生重大影响,但我们的研究结果强调了性别偏见在AD言语感知中的关键作用。这项工作突显了在开发阿尔茨海默病检测模型时必须解决性别偏见的重要性,并呼吁进一步的研究以验证不同语言背景下的模型性能。
https://arxiv.org/abs/2507.12356
Weed detection is a critical component of precision agriculture, facilitating targeted herbicide application and reducing environmental impact. However, deploying accurate object detection models on resource-limited platforms remains challenging, particularly when differentiating visually similar weed species commonly encountered in plant phenotyping applications. In this work, we investigate Channel-wise Knowledge Distillation (CWD) and Masked Generative Distillation (MGD) to enhance the performance of lightweight models for real-time smart spraying systems. Utilizing YOLO11x as the teacher model and YOLO11n as both reference and student, both CWD and MGD effectively transfer knowledge from the teacher to the student model. Our experiments, conducted on a real-world dataset comprising sugar beet crops and four weed types (Cirsium, Convolvulus, Fallopia, and Echinochloa), consistently show increased AP50 across all classes. The distilled CWD student model achieves a notable improvement of 2.5% and MGD achieves 1.9% in mAP50 over the baseline without increasing model complexity. Additionally, we validate real-time deployment feasibility by evaluating the student YOLO11n model on Jetson Orin Nano and Raspberry Pi 5 embedded devices, performing five independent runs to evaluate performance stability across random seeds. These findings confirm CWD and MGD as an effective, efficient, and practical approach for improving deep learning-based weed detection accuracy in precision agriculture and plant phenotyping scenarios.
杂草检测是精准农业中的一个关键组成部分,它有助于实现有针对性的除草剂应用并减少对环境的影响。然而,在资源受限平台上部署准确的对象检测模型仍然具有挑战性,尤其是在区分植物表型应用中常见的视觉相似杂草种类时。在这项工作中,我们研究了通道级知识蒸馏(CWD)和掩码生成式蒸馏(MGD),以增强轻量级模型在实时智能喷洒系统中的性能。使用YOLO11x作为教师模型,并将YOLO11n同时用作参考和学生模型,这两种方法都能够有效地从教师模型向学生模型转移知识。 我们的实验是在一个包含甜菜作物及四种杂草类型(蓟、旋花、豚草和千金子)的真实世界数据集上进行的,结果表明所有类别的平均精度(AP50)均有所提高。经过蒸馏处理后的CWD学生模型在mAP50方面比基准模型提高了2.5%,而MGD则提高了1.9%,同时没有增加模型复杂度。 此外,为了验证实时部署的可能性,我们在Jetson Orin Nano和Raspberry Pi 5嵌入式设备上评估了YOLO11n的学生模型,并进行了五次独立运行以评估不同随机种子下的性能稳定性。这些发现证实了CWD和MGD是一种有效、高效且实用的方法,能够提高基于深度学习的杂草检测精度,在精准农业及植物表型应用中具有重要意义。
https://arxiv.org/abs/2507.12344
Ensuring that neural models used in real-world applications cannot infer sensitive information, such as demographic attributes like gender or race, from text representations is a critical challenge when fairness is a concern. We address this issue through concept erasure, a process that removes information related to a specific concept from distributed representations while preserving as much of the remaining semantic information as possible. Our approach involves learning an orthogonal projection in the embedding space, designed to make the class-conditional feature distributions of the discrete concept to erase indistinguishable after projection. By adjusting the rank of the projector, we control the extent of information removal, while its orthogonality ensures strict preservation of the local structure of the embeddings. Our method, termed $\overline{\mathrm{L}}$EOPARD, achieves state-of-the-art performance in nonlinear erasure of a discrete attribute on classic natural language processing benchmarks. Furthermore, we demonstrate that $\overline{\mathrm{L}}$EOPARD effectively mitigates bias in deep nonlinear classifiers, thereby promoting fairness.
确保在现实世界应用中使用的神经模型不能从文本表示中推断出敏感信息(如性别或种族等人口统计特征)是一项重要挑战,尤其是在关注公平性时。我们通过概念擦除来解决这一问题,即去除分布式表示中的特定概念相关信息,同时尽可能保留剩余的语义信息。我们的方法包括在嵌入空间中学习正交投影,设计目的是使得删除离散概念后的类别条件特征分布变得无法区分。通过调整投影器的秩,我们可以控制信息移除的程度,而其正交性则确保严格保持嵌入的局部结构不变。 我们提出的方法称为$\overline{\mathrm{L}}$EOPARD,在经典的自然语言处理基准测试中,该方法在非线性擦除离散属性方面达到了最先进的性能。此外,我们还证明了$\overline{\mathrm{L}}$EOPARD能够有效减轻深度非线性分类器中的偏见,从而促进公平性。 这段文本讨论了一种名为$\overline{\mathrm{L}}$EOPARD的技术方法及其在自然语言处理领域的应用,旨在提高模型的隐私保护能力和减少偏见。通过去除敏感概念(如性别或种族)的相关信息而不显著影响整体语义表达,该技术有助于提升算法公平性和数据使用的透明度与安全性。
https://arxiv.org/abs/2507.12341
This paper introduces KeyDiff3D, a framework for unsupervised monocular 3D keypoints estimation that accurately predicts 3D keypoints from a single image. While previous methods rely on manual annotations or calibrated multi-view images, both of which are expensive to collect, our method enables monocular 3D keypoints estimation using only a collection of single-view images. To achieve this, we leverage powerful geometric priors embedded in a pretrained multi-view diffusion model. In our framework, this model generates multi-view images from a single image, serving as a supervision signal to provide 3D geometric cues to our model. We also use the diffusion model as a powerful 2D multi-view feature extractor and construct 3D feature volumes from its intermediate representations. This transforms implicit 3D priors learned by the diffusion model into explicit 3D features. Beyond accurate keypoints estimation, we further introduce a pipeline that enables manipulation of 3D objects generated by the diffusion model. Experimental results on diverse aspects and datasets, including Human3.6M, Stanford Dogs, and several in-the-wild and out-of-domain datasets, highlight the effectiveness of our method in terms of accuracy, generalization, and its ability to enable manipulation of 3D objects generated by the diffusion model from a single image.
本文介绍了KeyDiff3D框架,这是一种无需监督的单目三维关键点估计方法,可以从单一图像中准确预测三维关键点。与以往依赖手动标注或校准多视角图像的方法相比(这些数据收集起来成本高昂),我们的方法仅使用一系列单视图图像即可进行单目三维关键点估计。为了实现这一目标,我们利用了预先训练的多视角扩散模型中嵌入的强大几何先验知识。在该框架内,这种模型能够从单一图像生成多视角图像,并作为监督信号向我们的模型提供三维几何线索。同时,我们将扩散模型用作强大的二维多视图特征提取器,并从中构建出三维特征体,从而将由扩散模型学习到的隐式三维先验转换为显式的三维特征。 除了精确的关键点估计之外,我们还引入了一种流程,能够对通过扩散模型生成的三维物体进行操作。在包括Human3.6M、Stanford Dogs以及多种野外和跨域数据集上的实验结果表明,我们的方法在精度、泛化能力以及单张图像驱动下使由扩散模型生成的三维对象可操控性方面表现出色。
https://arxiv.org/abs/2507.12336