Modern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains unclear. We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract low-level geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. To avoid the loss of spatial layout information when discarding redundant convolution layers, we introduce a novel, training-free 3D positional encoding, PointROPE. The resulting LitePT model has $3.6\times$ fewer parameters, runs $2\times$ faster, and uses $2\times$ less memory than the state-of-the-art Point Transformer V3, but nonetheless matches or even outperforms it on a range of tasks and datasets. Code and models are available at: this https URL.
现代用于处理三维点云的神经网络架构包含卷积层和注意力模块,但最佳的组装方式尚不明确。我们分析了3D点云网络中不同计算块的作用,并发现了一种直观的行为:在早期层次的高分辨率下,卷积足以提取低级几何信息,在这种情况下使用注意力机制成本高昂且没有带来任何好处;而注意力则可以更有效地捕捉低分辨率深层中的高级语义和上下文信息。遵循这一设计原则,我们提出了一种新的、改进的3D点云骨干网络,它在早期阶段采用卷积,并在深层切换为注意力模块。为了避免丢弃冗余卷积层时空间布局信息丢失的问题,我们引入了一种新颖且无需训练的三维位置编码方法——PointROPE。最终得到的LitePT模型相比最先进的Point Transformer V3,在参数量上减少了3.6倍,运行速度提升了2倍,内存使用减少了2倍,但在一系列任务和数据集上的表现却相匹配甚至更优。代码和模型可在以下网址获取:this https URL。
https://arxiv.org/abs/2512.13689
We present Recurrent Video Masked-Autoencoders (RVM): a novel video representation learning approach that uses a transformer-based recurrent neural network to aggregate dense image features over time, effectively capturing the spatio-temporal structure of natural video data. RVM learns via an asymmetric masked prediction task requiring only a standard pixel reconstruction objective. This design yields a highly efficient ``generalist'' encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action recognition and point/object tracking, while also performing favorably against image models (e.g. DINOv2) on tasks that test geometric and dense spatial understanding. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter efficiency than competing video masked autoencoders. Moreover, we demonstrate that RVM's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based architectures. Finally, we use qualitative visualizations to highlight that RVM learns rich representations of scene semantics, structure, and motion.
我们提出了递归视频掩码自编码器(RVM):一种新颖的视频表示学习方法,它使用基于变压器的循环神经网络来在时间维度上聚合密集图像特征,从而有效地捕捉自然视频数据中的空间-时间结构。RVM 通过一个非对称的掩码预测任务进行学习,该任务只需要标准像素重建目标即可完成。这种设计产生了一个高效的“通才”编码器:RVM 在诸如动作识别和点/对象跟踪等视频级别的任务上表现出与最先进的视频模型(如 VideoMAE、V-JEPA)相当的性能,并且在测试几何学和密集空间理解的任务中,其表现也优于图像模型(例如 DINOv2)。值得注意的是,即使不使用知识蒸馏,在小规模模型环境下 RVM 也能取得优异的成绩,比竞争中的视频掩码自编码器参数效率高出多达30倍。此外,我们证明了由于 RVM 的递归特性,它能够在长时序范围内以线性计算成本稳定地传播特征,克服了一些基于空间-时间注意的架构的限制。最后,我们通过定性的可视化展示了 RVM 学习到了丰富的场景语义、结构和运动表示。
https://arxiv.org/abs/2512.13684
Recent feed-forward reconstruction models like VGGT and $\pi^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system by aligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($\mathrm{Sim}(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignment, which segments depth predictions into discrete layers, computes per-layer scale factors, and propagates them across both adjacent windows and timestamps. Extensive experiments show that LASER achieves state-of-the-art performance on camera pose estimation and point map reconstruction %quality with offline models while operating at 14 FPS with 6 GB peak memory on a RTX A6000 GPU, enabling practical deployment for kilometer-scale streaming videos. Project website: $\href{this https URL}{\texttt{this https URL}}$
最近的前向反馈重建模型,如VGGT和$\pi^3$,在图像重建质量上表现出色,但由于其二次内存复杂度,无法处理流媒体视频,这限制了它们的实际部署。尽管现有的流媒体方法通过学习记忆机制或因果注意力解决了这一问题,但这些方法需要大量的重新训练,并且可能未能充分利用最先进的离线模型中的强几何先验。 我们提出了LASER框架,这是一个无需训练的框架,它可以将一个离线重建模型转化为一个流式系统,通过在连续的时间窗口中对预测进行对齐来实现这一点。我们观察到,简单的相似性变换($\mathrm{Sim}(3)$)对齐由于层深度错位而失效:单目尺度模糊导致不同场景层的相对深度比例在不同的窗口之间不一致变化。 为了解决这个问题,我们引入了逐层尺度对齐方法,该方法将深度预测分割成离散的层次,并计算每个层次的比例因子,然后将其传播到相邻的时间窗口和时间戳上。大量的实验表明,LASER在相机姿态估计和点云重建方面的表现达到了最先进的水平,同时还能以每秒14帧的速度运行,并且在RTX A6000 GPU上的峰值内存占用仅为6GB,这使得它能够处理千米级的流媒体视频,在实际应用中具有可行性。 项目网站:[此链接](https://this https URL/)
https://arxiv.org/abs/2512.13680
In this paper, we present JoVA, a unified framework for joint video-audio generation. Despite recent encouraging advances, existing methods face two critical limitations. First, most existing approaches can only generate ambient sounds and lack the capability to produce human speech synchronized with lip movements. Second, recent attempts at unified human video-audio generation typically rely on explicit fusion or modality-specific alignment modules, which introduce additional architecture design and weaken the model simplicity of the original transformers. To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer, enabling direct and efficient cross-modal interaction without the need for additional alignment modules. Furthermore, to enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area loss based on facial keypoint detection, which enhances supervision on the critical mouth region during training without compromising architectural simplicity. Extensive experiments on benchmarks demonstrate that JoVA outperforms or is competitive with both unified and audio-driven state-of-the-art methods in lip-sync accuracy, speech quality, and overall video-audio generation fidelity. Our results establish JoVA as an elegant framework for high-quality multimodal generation.
在这篇论文中,我们介绍了JoVA,这是一个用于联合视频-音频生成的统一框架。尽管最近取得了一些令人鼓舞的进步,现有的方法仍然面临着两个关键限制。首先,大多数现有方法只能生成环境声音,并且缺乏产生与唇部动作同步的人类语音的能力。其次,近期尝试进行统一的人体视频-音频生成的方法通常依赖于显式的融合或特定模态的对齐模块,这会引入额外的架构设计并削弱原始变压器模型的简洁性。 为了解决这些问题,JoVA在每个变压器层内通过跨视频和音频标记的联合自注意力机制来直接进行有效的跨模式交互,从而无需使用额外的对齐模块。此外,为了实现高质量的唇部语音同步,我们引入了一个基于面部关键点检测的简单而有效的口区损失函数,该函数可以在不牺牲架构简洁性的情况下增强训练过程中对关键口区域的监督。 在基准测试上的广泛实验表明,JoVA在唇部同步精度、语音质量和整体视频-音频生成保真度方面优于或可与当前最先进的统一方法和音频驱动的方法相媲美。我们的研究结果确立了JoVA作为高质量多模态生成框架的地位。
https://arxiv.org/abs/2512.13677
Recent advances in diffusion-based generation techniques enable AI models to produce highly realistic videos, heightening the need for reliable detection mechanisms. However, existing detection methods provide only limited exploration of the 3D geometric patterns present in generated videos. In this paper, we use vanishing points as an explicit representation of 3D geometry patterns, revealing fundamental discrepancies in geometric consistency between real and AI-generated videos. We introduce Grab-3D, a geometry-aware transformer framework for detecting AI-generated videos based on 3D geometric temporal consistency. To enable reliable evaluation, we construct an AI-generated video dataset of static scenes, allowing stable 3D geometric feature extraction. We propose a geometry-aware transformer equipped with geometric positional encoding, temporal-geometric attention, and an EMA-based geometric classifier head to explicitly inject 3D geometric awareness into temporal modeling. Experiments demonstrate that Grab-3D significantly outperforms state-of-the-art detectors, achieving robust cross-domain generalization to unseen generators.
最近基于扩散生成技术的进步使得AI模型能够产生高度逼真的视频,从而加大了可靠检测机制的需求。然而,现有的检测方法仅对生成视频中存在的三维几何模式进行了有限的探索。在本文中,我们使用消失点作为三维几何图案的明确表示方式,揭示了真实视频与AI生成视频之间在几何一致性上的基本差异。我们引入了一种基于3D几何时间一致性的几何感知变换器框架Grab-3D来检测AI生成的视频。 为了实现可靠的评估,我们构建了一个由静态场景组成的AI生成视频数据集,从而能够稳定地提取三维几何特征。我们提出了一种配备了几何位置编码、时序几何注意力机制以及基于EMA(指数移动平均)的几何分类头的几何感知变换器,以明确将3D几何意识注入时间建模中。 实验表明,Grab-3D在检测AI生成视频方面显著优于现有的最先进的方法,并且能够实现对未知生成器的强大跨域泛化能力。
https://arxiv.org/abs/2512.13665
Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.
记忆已成为基于基础模型的代理的核心能力,并且将继续保持这一地位。随着关于代理记忆的研究迅速扩展并吸引前所未有的关注,该领域也变得越来越碎片化。现有属于代理记忆范畴的工作在动机、实现和评估协议方面往往存在显著差异,而松散定义的记忆术语进一步模糊了概念的清晰度。传统的分类方法,如长/短期记忆,证明不足以捕捉当代代理记忆系统的多样性。 本文旨在提供当前代理记忆研究的最新全景图。我们首先明确界定代理记忆的范围,并将其与诸如大型语言模型(LLM)记忆、检索增强生成(RAG)、上下文工程等相关概念区分开来。然后,我们通过形式、功能和动态性这三大统一视角审视代理记忆。 从形式的角度来看,我们识别出三种主导型的代理记忆实现方式:令牌级、参数化和潜在记忆。从功能角度来看,我们提出了一种更细粒度的分类法,区分事实记忆、体验记忆和工作记忆。从动态性的角度来看,我们分析了如何随着时间推移形成、演变和检索记忆。 为了支持实际开发,我们编制了一份全面的记忆基准测试和开源框架汇总表。超越整合之外,我们还提出了对未来研究前沿的前瞻性视角,包括记忆自动化、强化学习集成、多模态记忆、多代理记忆以及可信性问题。我们希望此次调查不仅作为现有工作的参考,还可以作为重新思考未来智能设计中记忆这一首要原始概念的概念基础。
https://arxiv.org/abs/2512.13564
Human-centric anomaly detection (AD) has been primarily studied to specify anomalous behaviors in a single person. However, as humans by nature tend to act in a collaborative manner, behavioral anomalies can also arise from human-human interactions. Detecting such anomalies using existing single-person AD models is prone to low accuracy, as these approaches are typically not designed to capture the complex and asymmetric dynamics of interactions. In this paper, we introduce a novel task, Human-Human Interaction Anomaly Detection (H2IAD), which aims to identify anomalous interactive behaviors within collaborative 3D human actions. To address H2IAD, we then propose Interaction Anomaly Detection Network (IADNet), which is formalized with a Temporal Attention Sharing Module (TASM). Specifically, in designing TASM, we share the encoded motion embeddings across both people such that collaborative motion correlations can be effectively synchronized. Moreover, we notice that in addition to temporal dynamics, human interactions are also characterized by spatial configurations between two people. We thus introduce a Distance-Based Relational Encoding Module (DREM) to better reflect social cues in H2IAD. The normalizing flow is eventually employed for anomaly scoring. Extensive experiments on human-human motion benchmarks demonstrate that IADNet outperforms existing Human-centric AD baselines in H2IAD.
人类中心的异常检测(AD)主要研究单个人的异常行为。然而,由于人类本质上倾向于协作行动,因此人的行为异常也可能源于人与人之间的互动。使用现有的单人AD模型来检测此类异常可能会导致准确性较低,因为这些方法通常不设计用来捕捉复杂和不对称的人际动态。 在本文中,我们引入了一个新的任务——人-人人交互异常检测(H2IAD),旨在识别协作3D人体动作中的异常交互行为。为解决H2IAD问题,我们提出了交互异常检测网络(IADNet),该网络采用了时间注意力共享模块(TASM)进行形式化设计。具体来说,在设计TASM时,我们将编码后的运动嵌入信息在两个人之间共享,以便有效地同步协作的运动相关性。 此外,我们注意到除了时间动态特性之外,人的互动还由两人之间的空间配置所定义。因此,我们引入了基于距离的关系编码模块(DREM),以更好地反映H2IAD中的社会线索。最后,使用归一化流技术进行异常评分。 在人类-人类动作基准测试的广泛实验中,结果显示IADNet优于现有的以人为中心的AD基线模型,在H2IAD任务上表现出色。
https://arxiv.org/abs/2512.13560
The study presents the outcomes of research and experimental validation in the domain of automated codebase migration, with a focus on addressing challenges in transitioning SQL-based systems. The proposed method for migration essentially appears as a framework that leverages the best aspects of traditional software engineering techniques and provides an iterative, scalable, precise and efficient solution for modern database transformations. The central piece of the approach is the integration of a fine-tuned Large Language Model to address critical issues in SQL code conversion, such as syntax mapping, resolving discrepancies between Oracle PL/SQL and PostgreSQL, and optimising database elements such as stored procedures, triggers, views, and overall database logic. Thus, the method involves a trade-off between fine-tuning and prompt engineering. Special attention is given to a fine-tuning approach, which enhances the adaptability and compatibility with migration requirements across the entire database. According to the achieved results, fine-tuning plays a very important role. The study employs targeted evaluation methodologies along with computational metrics to measure the success of iterative conversion cycles. Core innovations include automated SQL feature detection, semi-supervised error analysis and integration of Subject Matter Experts feedback within a systematic migration workflow. The methodology achieves significant reductions in Syntax Error Rates, enhances feature alignment throughout migration iterations, and leverages dataset sampling to ensure continual improvement. By embedding GAI into the migration process, the framework facilitates precise feature mapping, semi-automated error resolution, and data-driven optimisation loops, improving workflow efficiency.
该研究展示了在自动化代码库迁移领域的研究成果和实验验证,重点解决基于SQL系统的过渡挑战。提出的迁移方法本质上是一种框架,它借鉴了传统软件工程技术的最佳方面,并为现代数据库转换提供了一种迭代、可扩展、精确且高效的解决方案。此方法的核心在于整合了一个经过微调的大语言模型(Large Language Model),以解决SQL代码转换中的关键问题,如语法映射、解决Oracle PL/SQL和PostgreSQL之间的差异以及优化数据库元素,包括存储过程、触发器、视图等整体数据库逻辑。因此,该方法在微调与提示工程之间存在权衡。特别关注的是采用了一种精细的微调方法,以增强适应性和兼容性,使其能够满足整个数据库迁移需求的要求。根据所取得的结果,微调起着非常重要的作用。 研究采用了有针对性的评估方法和计算指标来衡量迭代转换周期的成功率。核心创新包括自动化的SQL特性检测、半监督错误分析以及在系统化迁移工作流程中整合主题专家(Subject Matter Experts)的意见反馈。该方法实现了语法错误率显著下降,增强了迁移过程中的特征对齐,并通过数据集抽样确保了持续改进。 通过将生成式人工智能(Generative AI, GAI)嵌入到迁移过程中,框架能够实现精确的特性映射、半自动化的错误解决以及基于数据驱动的优化循环,从而提高工作流程效率。
https://arxiv.org/abs/2512.13515
Native 4K (2160$\times$3840) video generation remains a critical challenge due to the quadratic computational explosion of full-attention as spatiotemporal resolution increases, making it difficult for models to strike a balance between efficiency and quality. This paper proposes a novel Transformer retrofit strategy termed $\textbf{T3}$ ($\textbf{T}$ransform $\textbf{T}$rained $\textbf{T}$ransformer) that, without altering the core architecture of full-attention pretrained models, significantly reduces compute requirements by optimizing their forward logic. Specifically, $\textbf{T3-Video}$ introduces a multi-scale weight-sharing window attention mechanism and, via hierarchical blocking together with an axis-preserving full-attention design, can effect an "attention pattern" transformation of a pretrained model using only modest compute and data. Results on 4K-VBench show that $\textbf{T3-Video}$ substantially outperforms existing approaches: while delivering performance improvements (+4.29$\uparrow$ VQA and +0.08$\uparrow$ VTC), it accelerates native 4K video generation by more than 10$\times$. Project page at this https URL
原生4K (2160×3840) 视频生成仍然是一个关键挑战,因为随着时空分辨率的增加,全注意力机制的计算量呈二次增长,使得模型难以在效率和质量之间找到平衡点。本文提出了一种新颖的Transformer改进策略,称为T3(Transform Trained Transformer),该策略无需改变预训练全注意力模型的核心架构,而是通过优化其前向逻辑显著降低计算需求。 具体来说,**T3-Video** 引入了多尺度权重共享窗口注意机制,并且通过层次化阻塞与轴保留的全注意力设计,在仅使用适度计算和数据的情况下,能够实现对预训练模型的“注意力模式”转换。在4K-VBench上的结果显示,**T3-Video** 显著优于现有方法:它不仅提供了性能改进(+4.29↑ VQA 和 +0.08↑ VTC),还使得原生4K视频生成的速度提高了10倍以上。 项目页面在此链接中提供。
https://arxiv.org/abs/2512.13492
Premature semantic collapse -- the forced early commitment to a single meaning -- remains a core architectural limitation of current language models. Softmax-driven competition and greedy decoding cause models to discard valid interpretations before sufficient context is available, resulting in brittle reasoning and context failures. We introduce Non-Resolution Reasoning (NRR), a general computational framework that preserves semantic ambiguity during inference and performs resolution only when explicitly required. NRR integrates three components: (1) Multi-Vector Embeddings that maintain multiple viable interpretations per token, (2) Non-Collapsing Attention that prevents winner-take-all dynamics across layers, and (3) Contextual Identity Tracking (CIT), which assigns context-specific identities to recurring entities (e.g., distinguishing "Dr. Smith the cardiologist" from "Dr. Smith the researcher"). These mechanisms are unified by an external Resolution Operator $\rho$ that makes semantic commitment explicit, controllable, and task-dependent. Unlike standard architectures, NRR separates representation from resolution, allowing a single model to shift between creative, factual, and ambiguity-preserving reasoning without retraining. A synthetic evaluation demonstrates NRR's ability to preserve ambiguity and track context: CIT-enhanced models achieve 90.9% accuracy on out-of-distribution identity-shift tasks, compared to 9.1% for transformer baselines. NRR provides a principled alternative to premature collapse, reframing ambiguity as an explicit representational state rather than a failure mode. The question is not whether AI should resolve ambiguity, but when, how, and under whose control.
过早的语义崩溃——即在有足够的上下文之前被迫提前承诺单一含义——仍然是当前语言模型的核心架构限制之一。由softmax驱动的竞争和贪婪解码导致模型在没有足够背景信息的情况下舍弃有效的解释,从而引发脆弱的推理和上下文理解失败。我们提出了一种通用计算框架:非解析推理(NRR),该框架在推断过程中保留语义模糊性,并仅在明确需要时进行解析。NRR整合了三个组成部分: 1. 多向量嵌入,为每个标记维持多个可行的解释。 2. 非崩溃注意力机制,阻止各层间的赢家通吃动态过程。 3. 上下文身份追踪(CIT),用于根据上下文环境赋予反复出现的实体特定的身份(例如区分“心脏病专家史密斯医生”和“研究员史密斯博士”)。 这些机制通过一个外部解析操作符$\rho$统一起来,该操作符使语义承诺变得显式、可控并依赖于任务需求。与标准架构不同的是,NRR将表示与解析分离,使得单一模型能够在创造性推理、事实性推理和保留模糊性的推理之间自由切换而无需重新训练。合成评估表明了NRR在保持模糊性和追踪上下文方面的有效性:增强CIT的模型在外来身份转变任务上达到90.9%的准确率,相比之下基于变换器的基础模型仅为9.1%。 NRR为过早崩溃提供了一个有原则的替代方案,将模棱两可视为一种显式的表示状态而不是失败模式。问题不再是AI是否应该解析模糊性,而是何时、如何以及在谁的控制下进行这种解析。
https://arxiv.org/abs/2512.13478
Pose-guided video generation refers to controlling the motion of subjects in generated video through a sequence of poses. It enables precise control over subject motion and has important applications in animation. However, current pose-guided video generation methods are limited to accepting only human poses as input, thus generalizing poorly to pose of other subjects. To address this issue, we propose PoseAnything, the first universal pose-guided video generation framework capable of handling both human and non-human characters, supporting arbitrary skeletal inputs. To enhance consistency preservation during motion, we introduce Part-aware Temporal Coherence Module, which divides the subject into different parts, establishes part correspondences, and computes cross-attention between corresponding parts across frames to achieve fine-grained part-level consistency. Additionally, we propose Subject and Camera Motion Decoupled CFG, a novel guidance strategy that, for the first time, enables independent camera movement control in pose-guided video generation, by separately injecting subject and camera motion control information into the positive and negative anchors of CFG. Furthermore, we present XPose, a high-quality public dataset containing 50,000 non-human pose-video pairs, along with an automated pipeline for annotation and filtering. Extensive experiments demonstrate that Pose-Anything significantly outperforms state-of-the-art methods in both effectiveness and generalization.
姿势引导的视频生成是指通过一系列姿态序列来控制生成视频中主体的动作。这种技术能够对主体动作进行精确控制,并在动画制作中有重要应用。然而,目前的姿势引导视频生成方法仅能接受人体姿态作为输入,这导致其在处理其他生物或非人类角色的姿态时泛化能力较差。为解决这一问题,我们提出了PoseAnything——首个通用的姿势引导视频生成框架,能够同时处理人类和非人类角色,并支持任意骨骼结构的输入。 为了增强运动过程中的连贯性保持,我们引入了Part-aware Temporal Coherence Module(部件感知时间一致性模块),该模块将主体划分为不同部分,建立各个部分之间的对应关系,并计算各帧中相应部分之间的交叉注意力,从而实现细粒度级别的部件级连贯性。 此外,我们还提出了一种新的引导策略——Subject and Camera Motion Decoupled CFG(解耦的主体和相机运动控制CFG),这是首次在姿势引导视频生成中实现了独立控制相机移动的方法。通过将主体和相机的运动控制信息分别注入到CFG的正负锚点中实现。 最后,我们还推出了XPose,这是一个高质量的公开数据集,包含50,000对非人类姿态-视频配对,以及自动化注释和过滤管道。 广泛的实验表明,PoseAnything在有效性及泛化能力方面均显著优于现有的最先进的方法。
https://arxiv.org/abs/2512.13465
Accurate and timely identification of plant leaf diseases is essential for resilient and sustainable agriculture, yet most deep learning approaches rely on large annotated datasets and computationally intensive models that are unsuitable for data-scarce and resource-constrained environments. To address these challenges we present a few-shot learning approach within a lightweight yet efficient framework that combines domain-adapted MobileNetV2 and MobileNetV3 models as feature extractors, along with a feature fusion technique to generate robust feature representation. For the classification task, the fused features are passed through a Bi-LSTM classifier enhanced with attention mechanisms to capture sequential dependencies and focus on the most relevant features, thereby achieving optimal classification performance even in complex, real-world environments with noisy or cluttered backgrounds. The proposed framework was evaluated across multiple experimental setups, including both laboratory-controlled and field-captured datasets. On tomato leaf diseases from the PlantVillage dataset, it consistently improved performance across 1 to 15 shot scenarios, reaching 98.23+-0.33% at 15 shot, closely approaching the 99.98% SOTA benchmark achieved by a Transductive LSTM with attention, while remaining lightweight and mobile-friendly. Under real-world conditions using field images from the Dhan Shomadhan dataset, it maintained robust performance, reaching 69.28+-1.49% at 15-shot and demonstrating strong resilience to complex backgrounds. Notably, it also outperformed the previous SOTA accuracy of 96.0% on six diseases from PlantVillage, achieving 99.72% with only 15-shot learning. With a compact model size of approximately 40 MB and inference complexity of approximately 1.12 GFLOPs, this work establishes a scalable, mobile-ready foundation for precise plant disease diagnostics in data-scarce regions.
准确且及时地识别植物叶片疾病对于建立有弹性和可持续的农业至关重要,然而大多数深度学习方法依赖于大规模标注数据集和计算资源密集型模型,在数据匮乏和资源受限环境中并不适用。为了解决这些问题,我们提出了一种轻量级但高效的框架内的少量样本学习(few-shot learning)方法,该框架结合了领域适应的MobileNetV2和MobileNetV3模型作为特征提取器,并采用特征融合技术生成稳健的特征表示。对于分类任务,融合后的特征通过增强注意力机制的Bi-LSTM分类器传递,以捕获序列依赖性并聚焦于最相关的特征,从而在复杂且背景嘈杂的真实世界环境中实现最优分类性能。 该框架在多个实验设置中进行了评估,包括实验室控制和野外捕捉的数据集。在PlantVillage数据集中番茄叶片疾病的测试中,在1到15个样本的场景下其表现均得到提升,并在15个样本时达到98.23±0.33%的准确率,接近使用带有注意力机制的归纳LSTM实现的99.98%的最佳性能(SOTA),同时保持轻量级和移动友好性。在真实世界条件下使用Dhan Shomadhan数据集中的田野图像时,在15个样本的情况下,其表现仍然稳健,达到了69.28±1.49%,并在复杂背景中表现出强大的适应能力。值得注意的是,它还超越了PlantVillage数据集中六种疾病的SOTA准确率(96.0%),在仅使用15个样本学习时就实现了99.72%的准确度。 此研究工作以约40MB的小型模型和大约1.12GFLOPs的推理复杂性建立了一个可扩展且适用于移动设备的基础框架,为数据匮乏地区的精准植物病害诊断奠定了基础。
https://arxiv.org/abs/2512.13428
Unexploitable example generation aims to transform personal images into their unexploitable (unlearnable) versions before they are uploaded online, thereby preventing unauthorized exploitation of online personal images. Recently, this task has garnered significant research attention due to its critical relevance to personal data privacy. Yet, despite recent progress, existing methods for this task can still suffer from limited practical applicability, as they can fail to generate examples that are broadly unexploitable across different real-world computer vision tasks. To deal with this problem, in this work, we propose a novel Meta Cross-Task Unexploitable Example Generation (MCT-UEG) framework. At the core of our framework, to optimize the unexploitable example generator for effectively producing broadly unexploitable examples, we design a flat-minima-oriented meta training and testing scheme. Extensive experiments show the efficacy of our framework.
不可利用示例生成旨在将个人图像转换为其不可利用(无法学习)的版本,然后再上传到网上,从而防止未经授权的人滥用在线个人图片。由于这项任务与个人数据隐私密切相关,它最近受到了大量研究的关注。然而,尽管近年来取得了进展,现有方法在实际应用中的效果仍然有限,因为它们生成的例子可能不能广泛地抵御不同现实世界计算机视觉任务的利用。为了解决这个问题,在本工作中,我们提出了一种新颖的元跨任务不可利用示例生成(MCT-UEG)框架。我们的框架的核心在于设计了一个面向平坦极小值的元训练和测试方案,以优化不可利用示例生成器,使其能够有效地产生广泛不可利用的例子。广泛的实验展示了我们框架的有效性。
https://arxiv.org/abs/2512.13416
Automatic polyp segmentation is crucial for improving the clinical identification of colorectal cancer (CRC). While Deep Learning (DL) techniques have been extensively researched for this problem, current methods frequently struggle with generalization, particularly in data-constrained or challenging settings. Moreover, many existing polyp segmentation methods rely on complex, task-specific architectures. To address these limitations, we present a framework that leverages the intrinsic robustness of DINO self-attention "key" features for robust segmentation. Unlike traditional methods that extract tokens from the deepest layers of the Vision Transformer (ViT), our approach leverages the key features of the self-attention module with a simple convolutional decoder to predict polyp masks, resulting in enhanced performance and better generalizability. We validate our approach using a multi-center dataset under two rigorous protocols: Domain Generalization (DG) and Extreme Single Domain Generalization (ESDG). Our results, supported by a comprehensive statistical analysis, demonstrate that this pipeline achieves state-of-the-art (SOTA) performance, significantly enhancing generalization, particularly in data-scarce and challenging scenarios. While avoiding a polyp-specific architecture, we surpass well-established models like nnU-Net and UM-Net. Additionally, we provide a systematic benchmark of the DINO framework's evolution, quantifying the specific impact of architectural advancements on downstream polyp segmentation performance.
自动息肉分割对于提高结直肠癌(CRC)的临床识别至关重要。尽管深度学习(DL)技术已经广泛应用于解决这个问题,但目前的方法在数据受限或挑战性环境中往往难以实现泛化效果。此外,许多现有的息肉分割方法依赖于复杂且特定任务的设计架构。为了克服这些限制,我们提出了一种框架,该框架利用DINO自注意力机制“键”特征的内在鲁棒性来进行稳健的分割。与传统的从视觉变换器(ViT)最深层提取令牌的方法不同,我们的方法使用简单的卷积解码器结合自注意力模块的键特征来预测息肉掩模,从而提高性能并实现更好的泛化能力。 我们通过一个多中心数据集在两个严格的协议下验证了这一方法:领域泛化(DG)和极端单一领域泛化(ESDG)。我们的结果,经全面统计分析支持,显示该管道达到了最先进的(SOTA)性能,在数据稀缺和挑战性场景中显著增强了泛化能力。同时避免使用特定于息肉的架构设计,我们超越了诸如nnU-Net和UM-Net等已确立的模型。 此外,我们还系统地评估了DINO框架的发展历程,并量化了架构改进对下游息肉分割性能的具体影响。
https://arxiv.org/abs/2512.13376
LLMs achieve remarkable multi-step reasoning capabilities, yet effectively transferring these skills via post-training distillation remains challenging. Existing data selection methods, ranging from manual curation to heuristics based on length, entropy, or overall loss, fail to capture the causal importance of individual reasoning steps, limiting distillation efficiency. To address this, we propose Attention Influence for Reasoning (AIR), a principled, unsupervised and training-free framework that leverages mechanistic insights of the retrieval head to select high-value post-training data. AIR first identifies reasoning-critical attention heads of an off-the-shelf model, then constructs a weakened reference model with disabled head influence, and finally quantifies the resulting loss divergence as the Attention Influence Score. This score enables fine-grained assessment at both the step and sample levels, supporting step-level weighted fine-tuning and global sample selection. Experiments across multiple reasoning benchmarks show that AIR consistently improves reasoning accuracy, surpassing heuristic baselines and effectively isolating the most critical steps and samples. Our work establishes a mechanism-driven, data-efficient approach for reasoning distillation in LLMs.
大型语言模型(LLMs)实现了显著的多步推理能力,但通过后期训练蒸馏有效转移这些技能仍然具有挑战性。现有的数据选择方法从手动策划到基于长度、熵或整体损失的启发式方法,都无法捕捉个体推理步骤的因果重要性,从而限制了蒸馏效率。为了解决这一问题,我们提出了用于推理的注意力影响(AIR),这是一个原理驱动的、无监督且无需训练的框架,它利用检索头的机制洞察来选择具有高价值的后期训练数据。 AIR 首先识别现成模型中的关键推理注意头,然后构建一个禁用头部影响的弱化参考模型,并最终量化由此产生的损失分歧作为注意力影响得分。此分数支持在步骤和样本层面进行细粒度评估,支持按步骤加权微调和全局样本选择。 跨多个推理基准实验表明,AIR 一致地提高了推理准确性,超越了启发式基线并有效隔离了最关键的步骤和样本。我们的工作为 LLM 中的推理蒸馏建立了一种机制驱动且数据高效的方法。
https://arxiv.org/abs/2512.13279
Collaborative perception has garnered significant attention as a crucial technology to overcome the perceptual limitations of single-agent systems. Many state-of-the-art (SOTA) methods have achieved communication efficiency and high performance via intermediate fusion. However, they share a critical vulnerability: their performance degrades under adverse communication conditions due to the misalignment induced by data transmission, which severely hampers their practical deployment. To bridge this gap, we re-examine different fusion paradigms, and recover that the strengths of intermediate and late fusion are not a trade-off, but a complementary pairing. Based on this key insight, we propose CoRA, a novel collaborative robust architecture with a hybrid approach to decouple performance from robustness with low communication. It is composed of two components: a feature-level fusion branch and an object-level correction branch. Its first branch selects critical features and fuses them efficiently to ensure both performance and scalability. The second branch leverages semantic relevance to correct spatial displacements, guaranteeing resilience against pose errors. Experiments demonstrate the superiority of CoRA. Under extreme scenarios, CoRA improves upon its baseline performance by approximately 19% in AP@0.7 with more than 5x less communication volume, which makes it a promising solution for robust collaborative perception.
协作感知作为克服单个代理系统感知限制的关键技术,已吸引了广泛关注。许多最先进的(SOTA)方法通过中间融合实现了通信效率和高性能的提升。然而,它们都存在一个关键弱点:在不利的通信条件下,由于数据传输引起的不匹配会导致性能下降,这严重阻碍了其实际部署的应用。为了弥合这一差距,我们重新审视了不同的融合范式,并发现中级和后期融合的优势并不是取舍关系,而是相辅相成的搭配方式。基于这个关键见解,我们提出了一种名为CoRA的新颖协作鲁棒架构,采用混合方法将性能与低通信量下的稳健性解耦。该架构由两个部分组成:特征级融合分支和对象级校正分支。 第一部分选择关键特征并高效地将其融合起来,以确保性能和可扩展性。第二部分利用语义相关性来纠正空间位移,从而保证对姿态误差的鲁棒性。实验结果证明了CoRA的优势。在极端场景下,与基线相比,CoRA在AP@0.7指标上提高了约19%,并且通信量减少了超过5倍,使其成为协作感知中具有前景的稳健解决方案。
https://arxiv.org/abs/2512.13191
The Automatic Identification System (AIS) enables data-driven maritime surveillance but suffers from reliability issues and irregular intervals. We address vessel destination estimation using global-scope AIS data by proposing a differentiated approach that recasts long port-to-port trajectories as a nested sequence structure. Using spatial grids, this method mitigates spatio-temporal bias while preserving detailed resolution. We introduce a novel deep learning architecture, WAY, designed to process these reformulated trajectories for long-term destination estimation days to weeks in advance. WAY comprises a trajectory representation layer and Channel-Aggregative Sequential Processing (CASP) blocks. The representation layer generates multi-channel vector sequences from kinematic and non-kinematic features. CASP blocks utilize multi-headed channel- and self-attention for aggregation and sequential information delivery. Additionally, we propose a task-specialized Gradient Dropout (GD) technique to enable many-to-many training on single labels, preventing biased feedback surges by stochastically blocking gradient flow based on sample length. Experiments on 5-year AIS data demonstrate WAY's superiority over conventional spatial grid-based approaches regardless of trajectory progression. Results further confirm that adopting GD leads to performance gains. Finally, we explore WAY's potential for real-world application through multitask learning for ETA estimation.
自动识别系统(AIS)能够实现基于数据的海上监视,但面临着可靠性问题和时间间隔不规则的问题。我们提出了一种差异化的方法来解决利用全球范围内的AIS数据进行船舶目的地预测时所面临的问题:将长距离港口间的轨迹重新定义为嵌套序列结构。通过使用空间网格,这种方法可以减轻时空偏差并保持详细的分辨率。我们引入了一个新颖的深度学习架构——WAY(Way Ahead),专门设计用于处理这些重构后的轨迹,以实现数天乃至数周前的长期目的地预测。WAY包括一个轨迹表示层和通道聚合顺序处理(CASP)模块。 该表示层从运动学特征和非运动学特征中生成多通道向量序列。CASP块利用多头通道注意和自注意力机制来进行聚合及顺序信息传递。此外,我们还提出了一种任务特定的梯度丢弃(GD)技术,以实现在单一标签上的许多到许多的训练,通过随机阻断基于样本长度的梯度流来防止偏差反馈激增。 在5年的AIS数据集上进行的实验表明,WAY优于传统的基于空间网格的方法,无论是在轨迹进程方面。结果进一步证实了采用GD技术能够带来性能提升。最后,我们通过ETA估计的多任务学习探索了WAY在现实世界应用中的潜力。
https://arxiv.org/abs/2512.13190
Large language models (LLMs) have demonstrated strong performance on a variety of natural language processing (NLP) tasks. However, they often struggle with long-text sequences due to the ``lost in the middle'' phenomenon. This issue has been shown to arise from a U-shaped attention bias, where attention is disproportionately focused on the beginning and end of a text, leaving the middle section underrepresented. While previous studies have attributed this bias to position encoding, our research first identifies an additional factor: initial saliency. It means that in the attention computation for each token, tokens with higher attention weights relative to the initial token tend to receive more attention in the prediction of the next token. We further find that utilizing this property by scaling attention weight between the initial token and others improves the model's ability to process long contexts, achieving a maximum improvement of 3.6\% in MDQA dataset. Moreover, combining this approach with existing methods to reduce position encoding bias further enhances performance, achieving a maximum improvement of 3.4\% in KV-Retrieval tasks.
大型语言模型(LLMs)在多种自然语言处理(NLP)任务中表现出强大的性能。然而,它们通常难以处理长文本序列,这是因为所谓的“中间部分丢失”现象。这个问题已被证明是由U形注意力偏置引起的,即注意力过于集中在文本的开头和结尾部分,而忽略了中间部分。尽管之前的研究将这种偏差归因于位置编码的影响,我们的研究首次识别出另一个重要因素:初始显著性。这意味着,在每个令牌的注意计算中,相对于初始令牌具有更高注意权重的令牌在预测下一个令牌时会获得更多的关注。 我们进一步发现,通过调整初始令牌与其他令牌之间的注意力权重来利用这一特性,可以增强模型处理长上下文的能力,并在MDQA数据集中实现了最高3.6%的性能提升。此外,将这种方法与现有的减少位置编码偏差的方法相结合,还可以进一步提高性能,在KV-Retrieval任务中实现最高的3.4%改进。
https://arxiv.org/abs/2512.13109
Forest pests threaten ecosystem stability, requiring efficient monitoring. To overcome the limitations of traditional methods in large-scale, fine-grained detection, this study focuses on accurately identifying infected trees and analyzing infestation patterns. We propose FID-Net, a deep learning model that detects pest-affected trees from UAV visible-light imagery and enables infestation analysis via three spatial metrics. Based on YOLOv8n, FID-Net introduces a lightweight Feature Enhancement Module (FEM) to extract disease-sensitive cues, an Adaptive Multi-scale Feature Fusion Module (AMFM) to align and fuse dual-branch features (RGB and FEM-enhanced), and an Efficient Channel Attention (ECA) mechanism to enhance discriminative information efficiently. From detection results, we construct a pest situation analysis framework using: (1) Kernel Density Estimation to locate infection hotspots; (2) neighborhood evaluation to assess healthy trees' infection risk; (3) DBSCAN clustering to identify high-density healthy clusters as priority protection zones. Experiments on UAV imagery from 32 forest plots in eastern Tianshan, China, show that FID-Net achieves 86.10% precision, 75.44% recall, 82.29% mAP@0.5, and 64.30% mAP@0.5:0.95, outperforming mainstream YOLO models. Analysis confirms infected trees exhibit clear clustering, supporting targeted forest protection. FID-Net enables accurate tree health discrimination and, combined with spatial metrics, provides reliable data for intelligent pest monitoring, early warning, and precise management.
森林害虫威胁着生态系统的稳定性,需要高效的监测方法。为了克服传统方法在大规模、精细化检测上的局限性,本研究专注于准确识别受害树木并分析其感染模式。我们提出了FID-Net,这是一种基于无人机可见光影像的深度学习模型,能够检测受病虫侵害的树木,并通过三种空间度量进行虫害分析。 FID-Net是基于YOLOv8n构建的,加入了轻量级特征增强模块(Feature Enhancement Module, FEM),用于提取疾病敏感线索;同时引入自适应多尺度特性融合模块(Adaptive Multi-scale Feature Fusion Module, AMFM)来对双通道特性进行对齐和融合(RGB和FEM增强后的图像);最后是高效的信道注意机制(Efficient Channel Attention, ECA),用来高效地提升判别信息。 基于检测结果,我们构建了一个害虫情况分析框架:(1) 使用核密度估计来定位感染热点;(2) 通过邻居评估来评估健康树木的感染风险;以及(3) 利用DBSCAN聚类算法识别高密度健康树群作为优先保护区域。 在中国天山东部地区的无人机影像实验中,FID-Net在32个森林地块上的检测性能表现突出:准确率为86.10%,召回率为75.44%,mAP@0.5为82.29%,mAP@0.5:0.95为64.30%。这些结果优于主流的YOLO模型。 数据分析表明,受感染的树木表现出明显的群聚现象,支持了有针对性的森林保护措施。FID-Net能够准确区分树种健康状况,并结合空间度量提供可靠数据用于智能虫害监控、早期预警及精确管理。
https://arxiv.org/abs/2512.13104
The canonical $O(N^2)$ Transformer remains the empirical performance frontier in sequence modeling, and its training can be further optimized by addressing geometric inefficiency. We propose an optimization framework that leverages an asymmetric projection to decompose the backward-pass gradients into parallel spans and orthogonal violations, while keeping the canonical forward-pass $QKV$ structure intact. Through consistent experimental validation across various decomposition and projection setups, we provide strong theoretical evidence: the standard attention gradient is suboptimal. We demonstrated that selectively scaling these components, focusing primarily on $0^{th}$ order bidirectional parallel spans, yields the most effective learning signal. On the limited WikiText-2 dataset, and using a crude configuration, this method achieved a $0.56\%$ reduction in validation loss, confirming the framework's fundamental validity and suggesting significant potential gains on larger datasets and deeper training regimes
标准的$O(N^2)$ Transformer在序列建模中仍然是经验性能的前沿,其训练可以通过解决几何效率问题进一步优化。我们提出了一种优化框架,该框架利用不对称投影将反向传递梯度分解为平行片段和平行违反部分,同时保持了正向传递中的标准$QKV$结构不变。通过在各种分解和投影设置下进行一致的实验验证,我们提供了强有力的理论证据:标准注意力梯度次优。 我们展示了选择性地调整这些组件的有效性,主要关注于零阶双向平行片段,这为学习信号提供了最有效的策略。在一个有限的数据集WikiText-2上,并使用一个粗略配置的情况下,这种方法实现了0.56%的验证损失减少,确认了该框架的基本有效性,并暗示在更大的数据集和更深入训练模式下有显著潜力收益。 简而言之,通过对Transformer模型中注意力机制梯度的优化处理,我们不仅提高了其性能,而且还揭示了标准实现中的潜在改进空间。这种方法的有效性已经在实践中得到了验证,并且预示着未来研究的巨大前景。
https://arxiv.org/abs/2512.13033