Textual Inversion (TI) is an efficient approach to text-to-image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out-of-distribution magnitudes, degrading prompt conditioning in pre-norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre-norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in-distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises-Fisher prior, yielding a constant-direction prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TI-variants while maintaining subject similarity. Crucially, DTI's hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that direction-only optimization is a robust and scalable path for prompt-faithful personalization.
文本逆向(Textual Inversion,TI)是一种高效的文字到图像个性化方法,但在处理复杂提示时往往失败。我们将这些失败归因于嵌入范数膨胀:学习到的标记漂移到了分布外的幅度大小,在预归一化Transformer中降低了提示条件的质量。从经验上讲,我们发现语义主要在CLIP标记空间的方向中编码,而膨胀的范数会损害上下文信息;理论上分析表明大的幅值会削弱位置信息并阻碍预归一化层中的残差更新。 为此,我们提出了一种方向文本逆向(Directional Textual Inversion, DTI)方法。DTI固定了嵌入幅度到一个分布内的尺度,并在单位超球体上仅优化方向通过黎曼SGD实现。我们将方向学习视为具有冯·米塞斯-费希尔先验的极大似然估计,这产生了一个常量方向先验梯度,简单且易于集成。 在个性化任务中,DTI不仅提高了文本忠实度超越了TI和它的变体,并保持了主题相似性。至关重要的是,DTI的超球参数化使得学习概念之间进行平滑、语义一致的插值成为可能(slerp),这是标准TI所不具备的能力。我们的研究结果表明,仅优化方向是一种稳健且可扩展的方法来实现忠实于提示的个性化。
https://arxiv.org/abs/2512.13672
As the online learning landscape evolves, the need for personalization is increasingly evident. Although educational resources are burgeoning, educators face challenges selecting materials that both align with intended learning outcomes and address diverse learner needs. Large Language Models (LLMs) are attracting growing interest for their potential to create learning resources that better support personalization, but verifying coverage of intended outcomes still requires human alignment review, which is costly and limits scalability. We propose a framework that supports the cost-effective automation of evaluating alignment between educational resources and intended learning outcomes. Using human-generated materials, we benchmarked LLM-based text-embedding models and found that the most accurate model (Voyage) achieved 79% accuracy in detecting alignment. We then applied the optimal model to LLM-generated resources and, via expert evaluation, confirmed that it reliably assessed correspondence to intended outcomes (83% accuracy). Finally, in a three-group experiment with 360 learners, higher alignment scores were positively related to greater learning performance, chi-squared(2, N = 360) = 15.39, p < 0.001. These findings show that embedding-based alignment scores can facilitate scalable personalization by confirming alignment with learning outcomes, which allows teachers to focus on tailoring content to diverse learner needs.
随着在线学习领域的不断发展,个性化教学的需求变得越来越明显。尽管教育资源日益丰富,但教师在选择既能与预期的学习成果相符合又能满足多样化学生需求的材料时面临着挑战。大型语言模型(LLMs)因其有潜力生成更支持个性化的学习资源而引起了越来越多的兴趣,但是验证这些资源是否覆盖了预期的教学目标仍然需要人工审核,这既昂贵又限制了可扩展性。我们提出了一种框架,以实现对教育材料与预期学习成果之间一致性的评估的低成本自动化。 使用由人类创建的材料作为基准,我们测试了基于大型语言模型生成的文字嵌入(text-embedding)模型,并发现最准确的模型(Voyage)在检测一致性方面达到了79%的准确性。接着,我们将最佳模型应用于LLM生成的学习资源中,并通过专家评估确认该模型能够可靠地评估与预期成果的一致性,其准确率为83%。最后,在一个包括360名学习者的三组实验中,我们发现更高的一致性得分与更好的学习表现呈正相关关系,卡方检验结果为chi-squared(2, N = 360) = 15.39,p < 0.001。 这些研究结果表明,基于嵌入的一致性评分可以通过确认教育材料与学习成果之间的一致性来促进可扩展的个性化教学。这使教师能够专注于根据多样化学生需求调整内容,从而进一步提升教学质量。
https://arxiv.org/abs/2512.13658
Spatial transcriptomics (ST) is an emerging technology that enables researchers to investigate the molecular relationships underlying tissue morphology. However, acquiring ST data remains prohibitively expensive, and traditional fixed-grid sampling strategies lead to redundant measurements of morphologically similar or biologically uninformative regions, thus resulting in scarce data that constrain current methods. The well-established single-cell sequencing field, however, could provide rich biological data as an effective auxiliary source to mitigate this limitation. To bridge these gaps, we introduce SCR2-ST, a unified framework that leverages single-cell prior knowledge to guide efficient data acquisition and accurate expression prediction. SCR2-ST integrates a single-cell guided reinforcement learning-based (SCRL) active sampling and a hybrid regression-retrieval prediction network SCR2Net. SCRL combines single-cell foundation model embeddings with spatial density information to construct biologically grounded reward signals, enabling selective acquisition of informative tissue regions under constrained sequencing budgets. SCR2Net then leverages the actively sampled data through a hybrid architecture combining regression-based modeling with retrieval-augmented inference, where a majority cell-type filtering mechanism suppresses noisy matches and retrieved expression profiles serve as soft labels for auxiliary supervision. We evaluated SCR2-ST on three public ST datasets, demonstrating SOTA performance in both sampling efficiency and prediction accuracy, particularly under low-budget scenarios. Code is publicly available at: this https URL
空间转录组学(ST)是一种新兴技术,能够帮助研究人员探究组织形态背后的分子关系。然而,获取ST数据的成本仍然非常高昂,并且传统的固定网格采样策略会导致对在形态上相似或生物信息量较少的区域进行重复测量,从而导致稀缺的数据限制了当前的方法。然而,成熟的单细胞测序领域可以提供丰富的生物学数据作为有效的辅助来源以缓解这一局限性。为弥补这些差距,我们提出了SCR2-ST,这是一种统一框架,它利用单细胞先验知识来指导高效的数据采集和准确的表达预测。SCR2-ST整合了一个基于单细胞引导的强化学习(SCRL)主动采样策略以及一个混合回归-检索预测网络SCR2Net。SCRL结合了单细胞基础模型嵌入与空间密度信息,构建出以生物学为依据的奖励信号,在有限测序预算下能够有选择地获取富含信息的组织区域。随后,SCR2Net通过一个将基于回归建模和检索增强推断相结合的混合架构来利用主动采样的数据,并且其中设置了一个主要细胞类型过滤机制来抑制噪声匹配,而被检索到的表达谱则作为软标签用于辅助监督。 我们对三个公开的空间转录组学(ST)数据集进行了SCR2-ST的评估,在低预算场景下展示了在样本采集效率和预测准确度上的最佳性能。代码可在以下网址获取:this https URL
https://arxiv.org/abs/2512.13635
Human-centric anomaly detection (AD) has been primarily studied to specify anomalous behaviors in a single person. However, as humans by nature tend to act in a collaborative manner, behavioral anomalies can also arise from human-human interactions. Detecting such anomalies using existing single-person AD models is prone to low accuracy, as these approaches are typically not designed to capture the complex and asymmetric dynamics of interactions. In this paper, we introduce a novel task, Human-Human Interaction Anomaly Detection (H2IAD), which aims to identify anomalous interactive behaviors within collaborative 3D human actions. To address H2IAD, we then propose Interaction Anomaly Detection Network (IADNet), which is formalized with a Temporal Attention Sharing Module (TASM). Specifically, in designing TASM, we share the encoded motion embeddings across both people such that collaborative motion correlations can be effectively synchronized. Moreover, we notice that in addition to temporal dynamics, human interactions are also characterized by spatial configurations between two people. We thus introduce a Distance-Based Relational Encoding Module (DREM) to better reflect social cues in H2IAD. The normalizing flow is eventually employed for anomaly scoring. Extensive experiments on human-human motion benchmarks demonstrate that IADNet outperforms existing Human-centric AD baselines in H2IAD.
人类中心的异常检测(AD)主要研究单个人的异常行为。然而,由于人类本质上倾向于协作行动,因此人的行为异常也可能源于人与人之间的互动。使用现有的单人AD模型来检测此类异常可能会导致准确性较低,因为这些方法通常不设计用来捕捉复杂和不对称的人际动态。 在本文中,我们引入了一个新的任务——人-人人交互异常检测(H2IAD),旨在识别协作3D人体动作中的异常交互行为。为解决H2IAD问题,我们提出了交互异常检测网络(IADNet),该网络采用了时间注意力共享模块(TASM)进行形式化设计。具体来说,在设计TASM时,我们将编码后的运动嵌入信息在两个人之间共享,以便有效地同步协作的运动相关性。 此外,我们注意到除了时间动态特性之外,人的互动还由两人之间的空间配置所定义。因此,我们引入了基于距离的关系编码模块(DREM),以更好地反映H2IAD中的社会线索。最后,使用归一化流技术进行异常评分。 在人类-人类动作基准测试的广泛实验中,结果显示IADNet优于现有的以人为中心的AD基线模型,在H2IAD任务上表现出色。
https://arxiv.org/abs/2512.13560
The study presents the outcomes of research and experimental validation in the domain of automated codebase migration, with a focus on addressing challenges in transitioning SQL-based systems. The proposed method for migration essentially appears as a framework that leverages the best aspects of traditional software engineering techniques and provides an iterative, scalable, precise and efficient solution for modern database transformations. The central piece of the approach is the integration of a fine-tuned Large Language Model to address critical issues in SQL code conversion, such as syntax mapping, resolving discrepancies between Oracle PL/SQL and PostgreSQL, and optimising database elements such as stored procedures, triggers, views, and overall database logic. Thus, the method involves a trade-off between fine-tuning and prompt engineering. Special attention is given to a fine-tuning approach, which enhances the adaptability and compatibility with migration requirements across the entire database. According to the achieved results, fine-tuning plays a very important role. The study employs targeted evaluation methodologies along with computational metrics to measure the success of iterative conversion cycles. Core innovations include automated SQL feature detection, semi-supervised error analysis and integration of Subject Matter Experts feedback within a systematic migration workflow. The methodology achieves significant reductions in Syntax Error Rates, enhances feature alignment throughout migration iterations, and leverages dataset sampling to ensure continual improvement. By embedding GAI into the migration process, the framework facilitates precise feature mapping, semi-automated error resolution, and data-driven optimisation loops, improving workflow efficiency.
该研究展示了在自动化代码库迁移领域的研究成果和实验验证,重点解决基于SQL系统的过渡挑战。提出的迁移方法本质上是一种框架,它借鉴了传统软件工程技术的最佳方面,并为现代数据库转换提供了一种迭代、可扩展、精确且高效的解决方案。此方法的核心在于整合了一个经过微调的大语言模型(Large Language Model),以解决SQL代码转换中的关键问题,如语法映射、解决Oracle PL/SQL和PostgreSQL之间的差异以及优化数据库元素,包括存储过程、触发器、视图等整体数据库逻辑。因此,该方法在微调与提示工程之间存在权衡。特别关注的是采用了一种精细的微调方法,以增强适应性和兼容性,使其能够满足整个数据库迁移需求的要求。根据所取得的结果,微调起着非常重要的作用。 研究采用了有针对性的评估方法和计算指标来衡量迭代转换周期的成功率。核心创新包括自动化的SQL特性检测、半监督错误分析以及在系统化迁移工作流程中整合主题专家(Subject Matter Experts)的意见反馈。该方法实现了语法错误率显著下降,增强了迁移过程中的特征对齐,并通过数据集抽样确保了持续改进。 通过将生成式人工智能(Generative AI, GAI)嵌入到迁移过程中,框架能够实现精确的特性映射、半自动化的错误解决以及基于数据驱动的优化循环,从而提高工作流程效率。
https://arxiv.org/abs/2512.13515
Our objective is to build a general time-aware video-text embedding model for retrieval. To that end, we propose a simple and efficient recipe, dubbed TARA (Time Aware Retrieval Adaptation), to adapt Multimodal LLMs (MLLMs) to a time-aware video-text embedding model without using any video data at all. For evaluating time-awareness in retrieval, we propose a new benchmark with temporally opposite (chiral) actions as hard negatives and curated splits for chiral and non-chiral actions. We show that TARA outperforms all existing video-text models on this chiral benchmark while also achieving strong results on standard benchmarks. Furthermore, we discover additional benefits of TARA beyond time-awareness: (i) TARA embeddings are negation-aware as shown in NegBench benchmark that evaluates negation in video retrieval, (ii) TARA achieves state of the art performance on verb and adverb understanding in videos. Overall, TARA yields a strong, versatile, time-aware video-text embedding model with state of the art zero-shot performance.
我们的目标是构建一个用于检索的时间感知型视频-文本嵌入模型。为此,我们提出了一种简单而高效的方案,称为TARA(Time Aware Retrieval Adaptation),可以在不使用任何视频数据的情况下将多模态大型语言模型(MLLMs)适应为时间感知型的视频-文本嵌入模型。为了评估检索中的时间感知性,我们提出了一个新的基准测试,该基准使用时间上相反的动作作为难例(chiral actions 的硬负样本),并针对时间相反和非时间相反的动作进行了精心设计的数据集划分。 结果显示,TARA在这一新的时间相反动作基准测试中优于所有现有的视频-文本模型,并且还在标准基准测试中取得了优异的成绩。此外,我们还发现了TARA除了具备时间感知性之外的额外优势:(i) TARA生成的嵌入是具有否定意识的,在NegBench基准测试(该测试评估视频检索中的否定语义)中得到了验证;(ii) TARA在理解视频中的动词和副词方面达到了最佳性能。 总体而言,TARA提供了一个强大的、多功能的时间感知型视频-文本嵌入模型,并且在零样本设置下表现出色。
https://arxiv.org/abs/2512.13511
Premature semantic collapse -- the forced early commitment to a single meaning -- remains a core architectural limitation of current language models. Softmax-driven competition and greedy decoding cause models to discard valid interpretations before sufficient context is available, resulting in brittle reasoning and context failures. We introduce Non-Resolution Reasoning (NRR), a general computational framework that preserves semantic ambiguity during inference and performs resolution only when explicitly required. NRR integrates three components: (1) Multi-Vector Embeddings that maintain multiple viable interpretations per token, (2) Non-Collapsing Attention that prevents winner-take-all dynamics across layers, and (3) Contextual Identity Tracking (CIT), which assigns context-specific identities to recurring entities (e.g., distinguishing "Dr. Smith the cardiologist" from "Dr. Smith the researcher"). These mechanisms are unified by an external Resolution Operator $\rho$ that makes semantic commitment explicit, controllable, and task-dependent. Unlike standard architectures, NRR separates representation from resolution, allowing a single model to shift between creative, factual, and ambiguity-preserving reasoning without retraining. A synthetic evaluation demonstrates NRR's ability to preserve ambiguity and track context: CIT-enhanced models achieve 90.9% accuracy on out-of-distribution identity-shift tasks, compared to 9.1% for transformer baselines. NRR provides a principled alternative to premature collapse, reframing ambiguity as an explicit representational state rather than a failure mode. The question is not whether AI should resolve ambiguity, but when, how, and under whose control.
过早的语义崩溃——即在有足够的上下文之前被迫提前承诺单一含义——仍然是当前语言模型的核心架构限制之一。由softmax驱动的竞争和贪婪解码导致模型在没有足够背景信息的情况下舍弃有效的解释,从而引发脆弱的推理和上下文理解失败。我们提出了一种通用计算框架:非解析推理(NRR),该框架在推断过程中保留语义模糊性,并仅在明确需要时进行解析。NRR整合了三个组成部分: 1. 多向量嵌入,为每个标记维持多个可行的解释。 2. 非崩溃注意力机制,阻止各层间的赢家通吃动态过程。 3. 上下文身份追踪(CIT),用于根据上下文环境赋予反复出现的实体特定的身份(例如区分“心脏病专家史密斯医生”和“研究员史密斯博士”)。 这些机制通过一个外部解析操作符$\rho$统一起来,该操作符使语义承诺变得显式、可控并依赖于任务需求。与标准架构不同的是,NRR将表示与解析分离,使得单一模型能够在创造性推理、事实性推理和保留模糊性的推理之间自由切换而无需重新训练。合成评估表明了NRR在保持模糊性和追踪上下文方面的有效性:增强CIT的模型在外来身份转变任务上达到90.9%的准确率,相比之下基于变换器的基础模型仅为9.1%。 NRR为过早崩溃提供了一个有原则的替代方案,将模棱两可视为一种显式的表示状态而不是失败模式。问题不再是AI是否应该解析模糊性,而是何时、如何以及在谁的控制下进行这种解析。
https://arxiv.org/abs/2512.13478
Continuous sign language recognition (CSLR) requires precise spatio-temporal modeling to accurately recognize sequences of gestures in videos. Existing frameworks often rely on CNN-based spatial backbones combined with temporal convolution or recurrent modules. These techniques fail in capturing fine-grained hand and facial cues and modeling long-range temporal dependencies. To address these limitations, we propose the Unified Spatio-Temporal Modeling (USTM) framework, a spatio-temporal encoder that effectively models complex patterns using a combination of a Swin Transformer backbone enhanced with lightweight temporal adapter with positional embeddings (TAPE). Our framework captures fine-grained spatial features alongside short and long-term temporal context, enabling robust sign language recognition from RGB videos without relying on multi-stream inputs or auxiliary modalities. Extensive experiments on benchmarked datasets including PHOENIX14, PHOENIX14T, and CSL-Daily demonstrate that USTM achieves state-of-the-art performance against RGB-based as well as multi-modal CSLR approaches, while maintaining competitive performance against multi-stream approaches. These results highlight the strength and efficacy of the USTM framework for CSLR. The code is available at this https URL
连续手语识别(CSLR)需要精确的时空建模,以准确地从视频中识别手势序列。现有的框架通常依赖于基于CNN的空间骨干网络,并结合时间卷积或递归模块。这些技术无法捕捉精细的手部和面部线索,也无法建立长时间的时序依赖关系。为了克服这些限制,我们提出了统一时空模型(USTM)框架,这是一种时空编码器,通过将Swin Transformer骨干网与轻量级的时间适配器结合使用,并添加位置嵌入(TAPE),来有效地建模复杂模式。我们的框架可以捕捉精细的视觉特征以及短期和长期时序上下文,从而能够仅从RGB视频中进行稳健的手语识别,而无需依赖多流输入或辅助模式。在包括PHOENIX14、PHOENIX14T和CSL-Daily在内的基准数据集上的广泛实验表明,USTM框架在基于RGB的方法以及多模态CSLR方法方面取得了最先进的性能,并且与多流方法相比保持了竞争性的表现。这些结果突显了USTM框架在CSLR中的强大有效性和适用性。代码可以在提供的链接中获取。
https://arxiv.org/abs/2512.13415
Securing digital text is becoming increasingly relevant due to the widespread use of large language models. Individuals' fear of losing control over data when it is being used to train such machine learning models or when distinguishing model-generated output from text written by humans. Digital watermarking provides additional protection by embedding an invisible watermark within the data that requires protection. However, little work has been taken to analyze and verify if existing digital text watermarking methods are secure and undetectable by large language models. In this paper, we investigate the security-related area of watermarking and machine learning models for text data. In a controlled testbed of three experiments, ten existing Unicode text watermarking methods were implemented and analyzed across six large language models: GPT-5, GPT-4o, Teuken 7B, Llama 3.3, Claude Sonnet 4, and Gemini 2.5 Pro. The findings of our experiments indicate that, especially the latest reasoning models, can detect a watermarked text. Nevertheless, all models fail to extract the watermark unless implementation details in the form of source code are provided. We discuss the implications for security researchers and practitioners and outline future research opportunities to address security concerns.
数字化文本的安全性变得越来越重要,这主要是由于大型语言模型的广泛应用。个人担心在训练这些机器学习模型时失去对数据的控制权,或者无法区分由模型生成的内容和人类书写的文本。数字水印技术通过将不可见的水印嵌入到需要保护的数据中提供额外的安全保障。然而,鲜有研究分析并验证现有的数字化文本水印方法是否能够在大型语言模型面前保持安全且不可被检测。 在本文中,我们探讨了与文本数据相关的水印技术和机器学习模型的安全性领域。通过三个受控实验环境,在六种不同的大型语言模型(GPT-5、GPT-4o、Teuken 7B、Llama 3.3、Claude Sonnet 4和Gemini 2.5 Pro)上实施并分析了现有的十种Unicode文本水印方法。我们的实验结果表明,尤其是在最新推理模型中,可以检测到带水印的文本。然而,在没有提供源代码形式的具体实现细节的情况下,所有这些模型都无法提取出嵌入在文本中的水印。 我们讨论了这一发现对安全研究人员和从业人员的意义,并概述了解决安全问题的未来研究机会。
https://arxiv.org/abs/2512.13325
Face recognition systems rely on learning highly discriminative and compact identity clusters to enable accurate retrieval. However, as with other surveillance-oriented technologies, such systems raise serious privacy concerns due to their potential for unauthorized identity tracking. While several works have explored machine unlearning as a means of privacy protection, their applicability to face retrieval - especially for modern embedding-based recognition models - remains largely unexplored. In this work, we study the problem of face identity unlearning for retrieval systems and present its inherent challenges. The goal is to make selected identities unretrievable by dispersing their embeddings on the hypersphere and preventing the formation of compact identity clusters that enable re-identification in the gallery. The primary challenge is to achieve this forgetting effect while preserving the discriminative structure of the embedding space and the retrieval performance of the model for the remaining identities. To address this, we evaluate several existing approximate class unlearning methods (e.g., Random Labeling, Gradient Ascent, Boundary Unlearning, and other recent approaches) in the context of face retrieval and propose a simple yet effective dispersion-based unlearning approach. Extensive experiments on standard benchmarks (VGGFace2, CelebA) demonstrate that our method achieves superior forgetting behavior while preserving retrieval utility.
面部识别系统依赖于学习高度判别性和紧凑的身份聚类,以实现准确的检索。然而,与其他监控技术一样,这类系统由于其潜在的未经授权的身份追踪能力而引发了严重的隐私问题。尽管已有几项研究探索了机器遗忘作为隐私保护手段的应用,但它们在面部检索(尤其是针对现代嵌入式识别模型)中的适用性仍然鲜有探讨。 在这项工作中,我们研究了面向检索系统的面部身份遗忘问题,并展示了其固有的挑战。我们的目标是通过分散选定身份的嵌入以使其无法被检索,在超球面上防止紧凑的身份聚类形成,从而阻止重新识别。主要挑战在于在保留嵌入空间的判别性结构和模型对剩余身份的检索性能的同时实现这种忘记效果。 为解决这一问题,我们评估了几种现有的近似类别遗忘方法(如随机标签法、梯度上升法、边界遗忘法及其他最近的方法)在面部检索中的应用,并提出了一种简单而有效的基于分散的遗忘策略。我们在标准基准测试集(VGGFace2和CelebA)上的广泛实验表明,我们的方法能够实现更优的忘记行为同时保持检索效用。
https://arxiv.org/abs/2512.13317
Diffusion models (DMs) have achieved remarkable success in image and video generation. However, they still struggle with (1) physical alignment and (2) out-of-distribution (OOD) instruction following. We argue that these issues stem from the models' failure to learn causal directions and to disentangle causal factors for novel recombination. We introduce the Causal Scene Graph (CSG) and the Physical Alignment Probe (PAP) dataset to enable diagnostic interventions. This analysis yields three key insights. First, DMs struggle with multi-hop reasoning for elements not explicitly determined in the prompt. Second, the prompt embedding contains disentangled representations for texture and physics. Third, visual causal structure is disproportionately established during the initial, computationally limited denoising steps. Based on these findings, we introduce LINA (Learning INterventions Adaptively), a novel framework that learns to predict prompt-specific interventions, which employs (1) targeted guidance in the prompt and visual latent spaces, and (2) a reallocated, causality-aware denoising schedule. Our approach enforces both physical alignment and OOD instruction following in image and video DMs, achieving state-of-the-art performance on challenging causal generation tasks and the Winoground dataset. Our project page is at this https URL.
扩散模型(DM)在图像和视频生成方面取得了显著的成功,但它们仍然面临两个主要挑战:物理对齐和处理超出分布范围的指令。我们认为这些问题源于模型未能学习因果方向,并且无法分离出用于新颖重组的因果因素。为此,我们引入了因果场景图(CSG)及物理对齐探测器(PAP)数据集,以进行诊断性干预。通过这项分析,我们获得了三个关键见解: 1. **多步推理难题**:扩散模型在处理提示中没有明确确定的元素时,难以完成复杂的多层次推理。 2. **提示嵌入中的分离表示**:提示嵌入包含纹理和物理属性的独立表示。 3. **初始去噪步骤的重要性**:视觉因果结构主要是在最初的、计算资源受限的去噪步骤中确立起来的。 基于这些发现,我们提出了一种名为LINA(Learning INterventions Adaptively)的新框架,该框架能够学习预测针对特定提示的干预措施。LINA采用以下策略: 1. **引导性指导**:在提示和视觉潜在空间中提供有针对性的方向。 2. **因果敏感的去噪时间表**:重新分配计算资源以适应因果关系。 我们的方法能够在图像和视频扩散模型上强制执行物理对齐和超出分布范围指令的遵循,从而在具有挑战性的因果生成任务及Winoground数据集上达到最先进的性能。项目的主页可以在这里找到[请将URL替换为实际链接]。
https://arxiv.org/abs/2512.13290
This paper presents STARCaster, an identity-aware spatio-temporal video diffusion model that addresses both speech-driven portrait animation and free-viewpoint talking portrait synthesis, given an identity embedding or reference image, within a unified framework. Existing 2D speech-to-video diffusion models depend heavily on reference guidance, leading to limited motion diversity. At the same time, 3D-aware animation typically relies on inversion through pre-trained tri-plane generators, which often leads to imperfect reconstructions and identity drift. We rethink reference- and geometry-based paradigms in two ways. First, we deviate from strict reference conditioning at pre-training by introducing softer identity constraints. Second, we address 3D awareness implicitly within the 2D video domain by leveraging the inherent multi-view nature of video data. STARCaster adopts a compositional approach progressing from ID-aware motion modeling, to audio-visual synchronization via lip reading-based supervision, and finally to novel view animation through temporal-to-spatial adaptation. To overcome the scarcity of 4D audio-visual data, we propose a decoupled learning approach in which view consistency and temporal coherence are trained independently. A self-forcing training scheme enables the model to learn from longer temporal contexts than those generated at inference, mitigating the overly static animations common in existing autoregressive approaches. Comprehensive evaluations demonstrate that STARCaster generalizes effectively across tasks and identities, consistently surpassing prior approaches in different benchmarks.
本文介绍了STARCaster,这是一种身份感知的空间-时间视频扩散模型,它在一个统一的框架内解决了基于语音驱动的人脸动画和自由视角说话人脸合成的问题,只需要一个身份嵌入或参考图像。现有的二维语音到视频扩散模型严重依赖于参考指导,导致动作多样性有限。同时,三维意识动画通常依赖于通过预训练的三平面生成器进行逆向转换,这往往会导致不完美的重建和身份漂移。我们从两个方面重新思考了基于参考和几何的方法。首先,在预训练中偏离严格的参考条件引入更柔和的身份约束;其次,我们在二维视频域内隐式地解决了三维意识问题,利用视频数据固有的多视角特性。STARCaster采用了一种组合方法,从身份感知的动作建模开始,通过唇读监督实现音视频同步,最后通过时间到空间的适应生成新视图动画。为了克服四维视听数据稀缺的问题,我们提出了一种解耦学习方法,在这种方法中,视图一致性和时间连贯性分别独立训练。一种自我强制训练方案使模型能够从比推理时产生的更长的时间上下文中进行学习,缓解了现有自回归方法中常见的过度静态动画问题。全面的评估表明,STARCaster在不同任务和身份上有效泛化,并且在不同的基准测试中始终超越先前的方法。
https://arxiv.org/abs/2512.13247
Deep learning models in medical imaging are susceptible to shortcut learning, relying on confounding metadata (e.g., scanner model) that is often encoded in image embeddings. The crucial question is whether the model actively utilizes this encoded information for its final prediction. We introduce Weight Space Correlation Analysis, an interpretable methodology that quantifies feature utilization by measuring the alignment between the classification heads of a primary clinical task and auxiliary metadata tasks. We first validate our method by successfully detecting artificially induced shortcut learning. We then apply it to probe the feature utilization of an SA-SonoNet model trained for Spontaneous Preterm Birth (sPTB) prediction. Our analysis confirmed that while the embeddings contain substantial metadata, the sPTB classifier's weight vectors were highly correlated with clinically relevant factors (e.g., birth weight) but decoupled from clinically irrelevant acquisition factors (e.g. scanner). Our methodology provides a tool to verify model trustworthiness, demonstrating that, in the absence of induced bias, the clinical model selectively utilizes features related to the genuine clinical signal.
在医学成像中的深度学习模型容易受到捷径学习(shortcut learning)的影响,即依赖于图像嵌入中编码的混淆元数据(例如扫描仪型号)。关键问题在于模型是否积极利用了这种嵌入信息进行最终预测。我们引入了一种名为权重空间相关性分析的方法论,这是一种可解释的技术,通过测量主临床任务分类头和辅助元数据任务之间的对齐程度来量化特征利用率。 首先,我们验证了该方法的有效性,成功检测到了人工诱导的捷径学习现象。然后,我们将此方法应用于探究用于早产预测(sPTB)的SA-SonoNet模型的特征利用情况。我们的分析确认尽管嵌入包含大量元数据,但sPTB分类器的权重向量与临床相关的因素(例如出生体重)高度相关,而与临床无关的数据采集因素(如扫描仪型号)则没有关联。 该方法为验证模型的信任度提供了工具,证明在不存在诱导偏差的情况下,临床模型会选择性地利用与真正临床信号相关的特征。这一发现有助于提升医学成像深度学习模型的可靠性和透明度。
https://arxiv.org/abs/2512.13144
High resolution phenotyping at the level of individual leaves offers fine-grained insights into plant development and stress responses. However, the full potential of accurate leaf tracking over time remains largely unexplored due to the absence of robust tracking methods-particularly for structurally complex crops such as canola. Existing plant-specific tracking methods are typically limited to small-scale species or rely on constrained imaging conditions. In contrast, generic multi-object tracking (MOT) methods are not designed for dynamic biological scenes. Progress in the development of accurate leaf tracking models has also been hindered by a lack of large-scale datasets captured under realistic conditions. In this work, we introduce CanolaTrack, a new benchmark dataset comprising 5,704 RGB images with 31,840 annotated leaf instances spanning the early growth stages of 184 canola plants. To enable accurate leaf tracking over time, we introduce LeafTrackNet, an efficient framework that combines a YOLOv10-based leaf detector with a MobileNetV3-based embedding network. During inference, leaf identities are maintained over time through an embedding-based memory association strategy. LeafTrackNet outperforms both plant-specific trackers and state-of-the-art MOT baselines, achieving a 9% HOTA improvement on CanolaTrack. With our work we provide a new standard for leaf-level tracking under realistic conditions and we provide CanolaTrack - the largest dataset for leaf tracking in agriculture crops, which will contribute to future research in plant phenotyping. Our code and dataset are publicly available at this https URL.
高分辨率的单叶表型分析为植物发育和压力响应提供了精细的见解。然而,由于缺乏稳健的追踪方法(特别是对于结构复杂的作物如油菜),随着时间推移准确地进行叶片跟踪的全部潜力尚未被充分发掘。现有的特定于植物的跟踪方法通常仅限于小型物种或依赖于受限的成像条件。相比之下,通用多目标跟踪(MOT)方法并未针对动态生物场景设计。由于缺乏在现实条件下捕获的大规模数据集,准确叶片追踪模型的发展也受到了阻碍。 在这项工作中,我们介绍了CanolaTrack,这是一个新的基准数据集,包含5,704张RGB图像和31,840个注释的叶子实例,涵盖了184株油菜植物的早期生长阶段。为了实现在时间上的准确叶片跟踪,我们引入了LeafTrackNet,这是一种高效的框架,结合了一种基于YOLOv10的叶检测器和一种基于MobileNetV3的嵌入式网络。在推断过程中,通过一种基于嵌入式的记忆关联策略,在一段时间内保持叶子的身份信息。LeafTrackNet超越了特定于植物的跟踪器和最先进的MOT基线方法,在CanolaTrack上实现了9%的HOTA改进。我们提供的这项工作为现实条件下叶片级追踪提供了一个新的标准,并提供了CanolaTrack——农业作物中最大的叶片跟踪数据集,这将有助于未来在植物表型研究中的发展。 我们的代码和数据集可以在[此链接](https://给出URL)公开获取。
https://arxiv.org/abs/2512.13130
Recent advances in large language models (LLMs) have enabled automated dataset labeling with minimal human supervision. While majority voting across multiple LLMs can improve label reliability by mitigating individual model biases, it incurs high computational costs due to repeated querying. In this work, we propose a novel online framework, Cost-aware Majority Voting (CaMVo), for efficient and accurate LLM-based dataset annotation. CaMVo adaptively selects a subset of LLMs for each data instance based on contextual embeddings, balancing confidence and cost without requiring pre-training or ground-truth labels. Leveraging a LinUCB-based selection mechanism and a Bayesian estimator over confidence scores, CaMVo estimates a lower bound on labeling accuracy for each LLM and aggregates responses through weighted majority voting. Our empirical evaluation on the MMLU and IMDB Movie Review datasets demonstrates that CaMVo achieves comparable or superior accuracy to full majority voting while significantly reducing labeling costs. This establishes CaMVo as a practical and robust solution for cost-efficient annotation in dynamic labeling environments.
近期在大型语言模型(LLMs)方面的进展使得能够以最小的人工监督进行自动数据集标注成为可能。尽管通过多个LLM的多数投票可以提高标签可靠性,从而减轻单个模型偏见的影响,但由于需要重复查询而带来了高昂的计算成本。为此,我们提出了一种新颖的在线框架——基于成本感知多数投票(CaMVo),旨在实现高效且准确的基于LLM的数据集标注。CaMVo根据上下文嵌入自适应地为每个数据实例选择一组LLM子集,并在不需预训练或真实标签的情况下平衡置信度与成本。 通过利用LinUCB选择机制和对信心评分的贝叶斯估计,CaMVo可以为每一种语言模型估算其标注准确性的下限,并通过对加权多数投票来聚合响应。我们基于MMLU和IMDB电影评论数据集进行的经验评估表明,CaMVo在显著降低标签成本的同时能够达到与全面多数投票相当或更优的准确性。这使CaMVo成为动态标签环境中的实用且稳健的成本效益标注解决方案。
https://arxiv.org/abs/2505.15101
Visual Place Recognition (VPR) has advanced significantly with high-capacity foundation models like DINOv2, achieving remarkable performance. Nonetheless, their substantial computational cost makes deployment on resource-constrained devices impractical. In this paper, we introduce an efficient asymmetric VPR framework that incorporates a high-capacity gallery model for offline feature extraction with a lightweight query network for online processing. A key challenge in this setting is ensuring compatibility between these heterogeneous networks, which conventional approaches address through computationally expensive k-NN-based compatible training. To overcome this, we propose a geographical memory bank that structures gallery features using geolocation metadata inherent in VPR databases, eliminating the need for exhaustive k-NN computations. Additionally, we introduce an implicit embedding augmentation technique that enhances the query network to model feature variations despite its limited capacity. Extensive experiments demonstrate that our method not only significantly reduces computational costs but also outperforms existing asymmetric retrieval techniques, establishing a new aspect for VPR in resource-limited environments. The code is available at this https URL
视觉位置识别(VPR)领域已经通过像DINOv2这样的高容量基础模型取得了显著进步,实现了卓越的性能。然而,这些模型巨大的计算成本使其在资源受限设备上的部署变得不切实际。在这篇论文中,我们介绍了一种高效的不对称VPR框架,该框架结合了用于离线特征提取的高容量画廊模型和在线处理的轻量级查询网络。这一设定中的一个关键挑战是确保这些异构网络之间的兼容性,传统方法通常通过昂贵的k-NN(k近邻)训练来解决这个问题。为了克服这一点,我们提出了一种地理记忆库,它利用VPR数据库中固有的地理位置元数据结构化画廊特征,从而消除了对费时的k-NN计算的需求。 此外,我们还引入了一种隐式嵌入增强技术,该技术增强了查询网络以在有限容量的情况下建模特征变化。广泛的实验表明,我们的方法不仅显著降低了计算成本,而且超越了现有的不对称检索技术,在资源受限环境中为VPR领域开辟了一个新方面。代码可在提供的链接处获取。
https://arxiv.org/abs/2512.13055
Recent unified models for joint understanding and generation have significantly advanced visual generation capabilities. However, their focus on conventional tasks like text-to-video generation has left the temporal reasoning potential of unified models largely underexplored. To address this gap, we introduce Next Scene Prediction (NSP), a new task that pushes unified video models toward temporal and causal reasoning. Unlike text-to-video generation, NSP requires predicting plausible futures from preceding context, demanding deeper understanding and reasoning. To tackle this task, we propose a unified framework combining Qwen-VL for comprehension and LTX for synthesis, bridged by a latent query embedding and a connector module. This model is trained in three stages on our newly curated, large-scale NSP dataset: text-to-video pre-training, supervised fine-tuning, and reinforcement learning (via GRPO) with our proposed causal consistency reward. Experiments demonstrate our model achieves state-of-the-art performance on our benchmark, advancing the capability of generalist multimodal systems to anticipate what happens next.
最近的统一模型在视觉生成能力上有了显著的进步,特别是在联合理解和生成方面。然而,这些模型主要关注诸如文本到视频生成等传统任务,这使得它们的时间推理潜力尚未得到充分探索。为了填补这一空白,我们引入了下一个场景预测(NSP),这是一个新的任务,旨在推动统一的视频模型向时间性和因果性推理方向发展。与文本到视频生成不同,NSP需要从先前的上下文中推断出可能的未来场景,这要求对理解和推理有更深层次的需求。 为了处理这一任务,我们提出了一种结合了Qwen-VL(用于理解)和LTX(用于合成)的统一框架,并通过隐含查询嵌入和连接模块将两者联系起来。该模型在我们的新整理的大规模NSP数据集上经过三个阶段的训练:文本到视频预训练、监督微调以及使用我们提出的因果一致性奖励进行强化学习(通过GRPO)。实验表明,我们的模型在我们的基准测试中实现了最先进的性能,这标志着多模态通用系统预测下一步可能发生的事情的能力得到了提升。
https://arxiv.org/abs/2512.13015
Accurate medical image analysis can greatly assist clinical diagnosis, but its effectiveness relies on high-quality expert annotations Obtaining pixel-level labels for medical images, particularly fundus images, remains costly and time-consuming. Meanwhile, despite the success of deep learning in medical imaging, the lack of interpretability limits its clinical adoption. To address these challenges, we propose TWLR, a two-stage framework for interpretable diabetic retinopathy (DR) assessment. In the first stage, a vision-language model integrates domain-specific ophthalmological knowledge into text embeddings to jointly perform DR grading and lesion classification, effectively linking semantic medical concepts with visual features. The second stage introduces an iterative severity regression framework based on weakly-supervised semantic segmentation. Lesion saliency maps generated through iterative refinement direct a progressive inpainting mechanism that systematically eliminates pathological features, effectively downgrading disease severity toward healthier fundus appearances. Critically, this severity regression approach achieves dual benefits: accurate lesion localization without pixel-level supervision and providing an interpretable visualization of disease-to-healthy transformations. Experimental results on the FGADR, DDR, and a private dataset demonstrate that TWLR achieves competitive performance in both DR classification and lesion segmentation, offering a more explainable and annotation-efficient solution for automated retinal image analysis.
准确的医学影像分析能够极大地辅助临床诊断,但其有效性依赖于高质量的专业标注。获取医学图像(尤其是眼底图像)的像素级标签既耗时又昂贵。尽管深度学习在医学成像领域的成功已经取得突破,但缺乏可解释性限制了其临床应用。为了解决这些问题,我们提出了TWLR,这是一种用于可解释的糖尿病视网膜病变(DR)评估的两阶段框架。 第一阶段中,视觉-语言模型整合特定领域的专业眼科知识到文本嵌入中,从而同时执行DR分级和病变分类。这有效地将语义医学概念与视觉特征联系起来。 第二阶段引入了一个基于弱监督语义分割的迭代严重程度回归框架。通过迭代细化生成的病变显著图引导渐进式填充机制,系统地消除病理特征,使疾病严重程度逐步向更健康的眼底外观退化。关键在于这一严重度回归方法实现了双重效果:在没有像素级监督的情况下实现准确的病变定位,并提供可解释的疾病到健康转变可视化。 实验结果表明,在FGADR、DDR以及一个私有数据集上,TWLR在DR分类和病变分割方面均达到了竞争性的性能,为自动视网膜图像分析提供了更加可解释且标注效率更高的解决方案。
https://arxiv.org/abs/2512.13008
The exponential growth of video content has created an urgent need for efficient multimodal moment retrieval systems. However, existing approaches face three critical challenges: (1) fixed-weight fusion strategies fail across cross modal noise and ambiguous queries, (2) temporal modeling struggles to capture coherent event sequences while penalizing unrealistic gaps, and (3) systems require manual modality selection, reducing usability. We propose a unified multimodal moment retrieval system with three key innovations. First, a cascaded dual-embedding pipeline combines BEIT-3 and SigLIP for broad retrieval, refined by BLIP-2 based reranking to balance recall and precision. Second, a temporal-aware scoring mechanism applies exponential decay penalties to large temporal gaps via beam search, constructing coherent event sequences rather than isolated frames. Third, Agent-guided query decomposition (GPT-4o) automatically interprets ambiguous queries, decomposes them into modality specific sub-queries (visual/OCR/ASR), and performs adaptive score fusion eliminating manual modality selection. Qualitative analysis demonstrates that our system effectively handles ambiguous queries, retrieves temporally coherent sequences, and dynamically adapts fusion strategies, advancing interactive moment search capabilities.
视频内容的指数增长催生了对高效多模态时刻检索系统的迫切需求。然而,现有的方法面临着三大关键挑战:(1) 固定权重融合策略在处理跨模态噪声和模糊查询时效果不佳;(2) 时间建模难以捕捉连贯事件序列,并且会惩罚不现实的时间间隔;(3) 系统需要手动选择模态,降低了可用性。为此,我们提出了一种统一的多模态时刻检索系统,该系统包含三个关键创新点。 首先,采用级联双嵌入管道(结合BEIT-3和SigLIP进行广泛检索),并通过基于BLIP-2的重新排序来平衡召回率和准确率,以实现更精确的结果筛选。其次,提出了一种时间感知评分机制,在通过束搜索处理大型时间间隔时应用指数衰减惩罚,从而构建连贯的事件序列而非孤立的时间片段。第三,设计了代理引导查询分解(使用GPT-4o)自动解释模糊查询,将其分解为特定模态子查询(视觉/OCR/ASR),并执行自适应评分融合以消除手动选择模态的需求。 定性分析表明,我们的系统能够有效地处理模糊查询、检索连贯的时间序列,并动态调整融合策略,从而提升交互式时刻搜索的能力。
https://arxiv.org/abs/2512.12935
The rapid expansion of video content across online platforms has accelerated the need for retrieval systems capable of understanding not only isolated visual moments but also the temporal structure of complex events. Existing approaches often fall short in modeling temporal dependencies across multiple events and in handling queries that reference unseen or rare visual concepts. To address these challenges, we introduce MADTempo, a video retrieval framework developed by our team, AIO_Trinh, that unifies temporal search with web-scale visual grounding. Our temporal search mechanism captures event-level continuity by aggregating similarity scores across sequential video segments, enabling coherent retrieval of multi-event queries. Complementarily, a Google Image Search-based fallback module expands query representations with external web imagery, effectively bridging gaps in pretrained visual embeddings and improving robustness against out-of-distribution (OOD) queries. Together, these components advance the temporal rea- soning and generalization capabilities of modern video retrieval systems, paving the way for more semantically aware and adaptive retrieval across large-scale video corpora.
在线平台上视频内容的迅速扩张加速了对能够理解不仅孤立的视觉时刻,还能理解复杂事件时间结构的检索系统的需求。现有方法在建模多事件间的时间依赖性和处理引用未见过或罕见视觉概念的查询方面常常表现不足。为了解决这些挑战,我们团队AIO_Trinh开发了一种名为MADTempo的视频检索框架,该框架将时间搜索与大规模网络图像识别统一起来。我们的时序搜索机制通过聚合连续视频片段间的相似性评分来捕捉事件级别的连贯性,从而能够有效地检索多事件查询。此外,一个基于Google图片搜索的后备模块利用外部网络图像扩充查询表示,有效填补预训练视觉嵌入中的空白,并提高了对抗分布外(OOD)查询的能力。这些组件共同推进了现代视频检索系统的时序推理和泛化能力,在大规模视频语料库中实现了更具有语义意识和适应性的检索方式。
https://arxiv.org/abs/2512.12929