Plant disease classification via imaging is a critical task in precision agriculture. We propose XMACNet, a novel light-weight Convolutional Neural Network (CNN) that integrates self-attention and multi-modal fusion of visible imagery and vegetation indices for chili disease detection. XMACNet uses an EfficientNetV2S backbone enhanced by a self-attention module and a fusion branch that processes both RGB images and computed vegetation index maps (NDVI, NPCI, MCARI). We curated a new dataset of 12,000 chili leaf images across six classes (five disease types plus healthy), augmented synthetically via StyleGAN to mitigate data scarcity. Trained on this dataset, XMACNet achieves high accuracy, F1-score, and AUC, outperforming baseline models such as ResNet-50, MobileNetV2, and a Swin Transformer variant. Crucially, XMACNet is explainable: we use Grad-CAM++ and SHAP to visualize and quantify the models focus on disease features. The models compact size and fast inference make it suitable for edge deployment in real-world farming scenarios.
通过图像识别进行植物病害分类是精准农业中的关键任务。我们提出了一种名为XMACNet的新颖轻量级卷积神经网络(CNN),该网络结合了自注意力机制以及可见光图像和植被指数的多模态融合,用于辣椒病害检测。XMACNet采用EfficientNetV2S骨干网,并通过添加一个自注意模块和一个多模态融合分支来增强其性能,该分支同时处理RGB图像和计算出的植被指数图(NDVI、NPCCI、MCARI)。我们整理了一个新的数据集,包含12,000张辣椒叶片图片,涵盖了六种类别(五种病害类型加上健康状态),并通过StyleGAN合成增广来缓解数据不足的问题。在该数据集上训练后,XMACNet取得了高准确率、F1值和AUC值,并优于基准模型如ResNet-50、MobileNetV2以及Swin Transformer变体的表现。尤为重要的是,XMACNet具有可解释性:我们通过Grad-CAM++和SHAP来可视化并量化模型对病害特征的聚焦情况。该模型的小巧体积和快速推理使其非常适合部署在实际农业场景中的边缘设备上。
https://arxiv.org/abs/2603.06750
The increasing demand for sustainable textile recycling requires robust automation solutions capable of handling deformable garments and detecting foreign objects in cluttered environments. This work presents a digital twin driven robotic sorting system that integrates grasp prediction, multi modal perception, and semantic reasoning for real world textile classification. A dual arm robotic cell equipped with RGBD sensing, capacitive tactile feedback, and collision-aware motion planning autonomously separates garments from an unsorted basket, transfers them to an inspection zone, and classifies them using state of the art Visual Language Models (VLMs). We benchmark nine VLM s from five model families on a dataset of 223 inspection scenarios comprising shirts, socks, trousers, underwear, foreign objects (including garments outside of the aforementioned classes), and empty scenes. The evaluation assesses per class accuracy, hallucination behavior, and computational performance under practical hardware constraints. Results show that the Qwen model family achieves the highest overall accuracy (up to 87.9 %), with strong foreign object detection performance, while lighter models such as Gemma3 offer competitive speed accuracy trade offs for edge deployment. A digital twin combined with MoveIt enables collision aware path planning and integrates segmented 3D point clouds of inspected garments into the virtual environment for improved manipulation reliability. The presented system demonstrates the feasibility of combining semantic VLM reasoning with conventional grasp detection and digital twin technology for scalable, autonomous textile sorting in realistic industrial settings.
对可持续纺织品回收日益增长的需求要求能够处理可变形衣物并在混乱环境中检测异物的强大的自动化解决方案。本文介绍了一种基于数字孪生驱动的机器人分拣系统,该系统集成了抓取预测、多模态感知和语义推理,用于现实世界中的纺织品分类。此系统配备了一个双臂机器人单元,配有RGBD传感器、电容式触觉反馈以及碰撞意识运动规划,能够从未分类的篮子中自主分离衣物,并将其转移到检查区域,在此基础上使用最先进的视觉语言模型(VLMs)对其进行分类。 我们针对五个模型家族中的九种VLM模型在包含223个检测场景的数据集上进行了基准测试。这些场景包括衬衫、袜子、裤子、内衣,以及不属于上述类别的异物和空场景。评估内容涵盖了各类别准确率、幻觉行为及在实际硬件限制下的计算性能表现。结果显示,在整体准确性(最高可达87.9%)方面,Qwen模型家族表现出最佳效果,并且具有强大的异物检测能力;而像Gemma3这样的轻量级模型则提供了边缘部署时速度和精度之间的良好权衡。 结合数字孪生与MoveIt能够实现碰撞感知路径规划,并将检查后的衣物的分割三维点云集成到虚拟环境中,从而提高了操纵可靠性。所提出的系统展示了语义VLM推理技术与传统抓取检测及数字孪生技术相结合,在现实工业环境中的大规模、自主纺织品分拣任务中具有可行性。
https://arxiv.org/abs/2603.05230
OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON formatting for both shapes and animation behaviors representation. However, the raw Lottie JSON files contain extensive invariant structural metadata and formatting tokens, posing significant challenges for learning vector animation generation. Therefore, we introduce a well designed Lottie tokenizer that transforms JSON files into structured sequences of commands and parameters representing shapes, animation functions and control parameters. Such tokenizer enables us to build OmniLottie upon pretrained vision language models to follow multi-modal interleaved instructions and generate high quality vector animations. To further advance research in vector animation generation, we curate MMLottie-2M, a large scale dataset of professionally designed vector animations paired with textual and visual annotations. With extensive experiments, we validate that OmniLottie can produce vivid and semantically aligned vector animations that adhere closely to multi modal human instructions.
OmniLottie 是一个多功能框架,它能够根据多模态指令生成高质量的矢量动画。为了实现灵活的动作和视觉内容控制,我们专注于 Lottie——这是一种轻量级的 JSON 格式化方式,用于表示形状及动画行为。然而,原始的 Lottie JSON 文件包含大量的不变结构元数据和格式令牌,这为学习矢量动画生成带来了显著挑战。因此,我们引入了一个精心设计的 Lottie 词法分析器(tokenizer),它可以将 JSON 文件转换为一系列代表形状、动画函数以及控制参数的结构化命令序列。这种词法分析器使我们能够基于预先训练好的视觉语言模型构建 OmniLottie,并根据多模态交织指令生成高质量的矢量动画。 为了进一步推进矢量动画生成的研究,我们还整理了一个名为 MMLottie-2M 的大规模数据集,该数据集中包含专业设计的矢量动画及其文本和视觉注释。通过广泛的实验验证,我们确认 OmniLottie 能够根据多模态的人类指令生成生动且语义一致的矢量动画。
https://arxiv.org/abs/2603.02138
SkyReels V4 is a unified multi modal video foundation model for joint video audio generation, inpainting, and editing. The model adopts a dual stream Multimodal Diffusion Transformer (MMDiT) architecture, where one branch synthesizes video and the other generates temporally aligned audio, while sharing a powerful text encoder based on the Multimodal Large Language Models (MMLM). SkyReels V4 accepts rich multi modal instructions, including text, images, video clips, masks, and audio references. By combining the MMLMs multi modal instruction following capability with in context learning in the video branch MMDiT, the model can inject fine grained visual guidance under complex conditioning, while the audio branch MMDiT simultaneously leverages audio references to guide sound generation. On the video side, we adopt a channel concatenation formulation that unifies a wide range of inpainting style tasks, such as image to video, video extension, and video editing under a single interface, and naturally extends to vision referenced inpainting and editing via multi modal prompts. SkyReels V4 supports up to 1080p resolution, 32 FPS, and 15 second duration, enabling high fidelity, multi shot, cinema level video generation with synchronized audio. To make such high resolution, long-duration generation computationally feasible, we introduce an efficiency strategy: Joint generation of low resolution full sequences and high-resolution keyframes, followed by dedicated super-resolution and frame interpolation models. To our knowledge, SkyReels V4 is the first video foundation model that simultaneously supports multi-modal input, joint video audio generation, and a unified treatment of generation, inpainting, and editing, while maintaining strong efficiency and quality at cinematic resolutions and durations.
SkyReels V4 是一个统一的多模态视频基础模型,用于联合生成视频和音频、修复和编辑。该模型采用双流多模态扩散变换器(MMDiT)架构,其中一条分支合成视频,另一条分支生成时间对齐的音频,并共享基于多模态大型语言模型(MMLM)的强大文本编码器。SkyReels V4 接受丰富的多模态指令,包括文本、图像、视频片段、遮罩和音频参考。通过结合 MMLM 的多模态指令跟随能力和视频分支 MMDiT 中的上下文学习能力,该模型可以在复杂的条件下注入精细的视觉引导,而音频分支 MMDiT 同时利用音频参考来指导声音生成。在视频方面,我们采用通道级联的形式,统一了包括图像到视频、视频扩展和视频编辑在内的广泛修复风格任务,并通过多模态提示自然地扩展到了基于视觉参考的修复和编辑。SkyReels V4 支持高达 1080p 分辨率、32 FPS 和 15 秒时长,能够生成高保真度、多镜头、电影级别的视频并配有同步音频。为了使这种高分辨率、长时间生成在计算上可行,我们引入了一个效率策略:先联合生成低分辨率完整序列和高清关键帧,然后使用专门的超分辨率模型和帧插值模型进行后处理。据我们所知,SkyReels V4 是第一个同时支持多模态输入、联合视频音频生成,并统一处理生成、修复和编辑的视频基础模型,且在电影级分辨率和时长下保持了强大的效率和质量。
https://arxiv.org/abs/2602.21818
Deteriorating civil infrastructure requires automated inspection techniques overcoming limitations of visual assessment. While Ground Penetrating Radar and Infrared Thermography enable subsurface defect detection, single modal approaches face complementary constraints radar struggles with moisture and shallow defects, while thermography exhibits weather dependency and limited depth. This paper presents a multi modal attention network fusing radar temporal patterns with thermal spatial signatures for bridge deck delamination detection. Our architecture introduces temporal attention for radar processing, spatial attention for thermal features, and cross modal fusion with learnable embeddings discovering complementary defect patterns invisible to individual sensors. We incorporate uncertainty quantification through Monte Carlo dropout and learned variance estimation, decomposing uncertainty into epistemic and aleatoric components for safety critical decisions. Experiments on five bridge datasets reveal that on balanced to moderately imbalanced data, our approach substantially outperforms baselines in accuracy and AUC representing meaningful improvements over single modal and concatenation based fusion. Ablation studies demonstrate cross modal attention provides critical gains beyond within modality attention, while multi head mechanisms achieve improved calibration. Uncertainty quantification reduces calibration error, enabling selective prediction by rejecting uncertain cases. However, under extreme class imbalance, attention mechanisms show vulnerability to majority class collapse. These findings provide actionable guidance: attention based architecture performs well across typical scenarios, while extreme imbalance requires specialized techniques. Our system maintains deployment efficiency, enabling real time inspection with characterized capabilities and limitations.
不断恶化的民用基础设施需要采用自动化检测技术来克服视觉评估的局限性。虽然地面穿透雷达(Ground Penetrating Radar,GPR)和红外热成像技术能够实现对内部缺陷的检测,但单一模式的方法面临着互补性的限制:雷达在处理含水量较高的环境和浅层缺陷时表现不佳,而热成像则受天气条件的影响,并且探测深度有限。本文提出了一种多模态注意力网络,该网络融合了雷达的时间模式与热图像的空间特征,用于桥梁桥面脱胶检测。 我们的架构引入了针对雷达处理的时域注意机制、针对热图特征的空间注意机制以及跨模态融合机制(利用可学习嵌入来发现单个传感器无法识别的互补缺陷模式)。我们通过蒙特卡洛丢弃法和已学得方差估计进行不确定性量化,将不确定性分解为知识不确定性和数据不确性成分,以支持安全关键决策。在五座桥梁的数据集上进行的实验显示,在平衡到中度不平衡的数据集中,与基线方法相比,我们的方法在准确率和AUC(接收者操作特性曲线下的面积)方面有了显著提升,代表了单模态和基于连接融合的方法的重大改进。 消融研究表明,跨模态注意力机制提供了超出单一模态内注意机制的关键增益,而多头机制实现了更好的校准。不确定性量化减少了校准误差,并通过拒绝不明确的情况支持选择性预测。然而,在极端类别不平衡的情况下,注意力机制显示出对多数类别的崩溃现象的敏感性。 这些发现为实际应用提供了指导:基于注意力架构在典型场景中表现良好,但在极端不平衡条件下需要专门的技术手段。我们的系统保持了部署效率,能够实现具有规定能力与限制条件下的实时检测。
https://arxiv.org/abs/2512.20113
Sustained operation of solar photovoltaic assets hinges on accurate detection and prioritization of surface faults across vast, geographically distributed modules. While multi modal imaging strategies are popular, they introduce logistical and economic barriers for routine farm level deployment. This work demonstrates that deep learning and classical machine learning may be judiciously combined to achieve robust surface anomaly categorization and severity estimation from planar visible band imagery alone. We introduce TinyViT which is a compact pipeline integrating Transformer based segmentation, spectral-spatial feature engineering, and ensemble regression. The system ingests consumer grade color camera mosaics of PV panels, classifies seven nuanced surface faults, and generates actionable severity grades for maintenance triage. By eliminating reliance on electroluminescence or IR sensors, our method enables affordable, scalable upkeep for resource limited installations, and advances the state of solar health monitoring toward universal field accessibility. Experiments on real public world datasets validate both classification and regression sub modules, achieving accuracy and interpretability competitive with specialized approaches.
太阳能光伏资产的持续运行依赖于对大面积、地理分布广泛的模块表面故障进行准确检测和优先级排序。虽然多模态成像策略很受欢迎,但它们在日常农场级别的部署中引入了物流和经济障碍。这项工作展示了深度学习与经典机器学习可以巧妙结合,仅通过平面可见光带图像就能实现稳健的表面异常分类和严重程度估计。 我们介绍了TinyViT,这是一个紧凑的工作流程,集成了基于Transformer的分割、光谱-空间特征工程以及集成回归。该系统接收消费者级彩色相机拍摄的光伏板马赛克图,对七种细微的表面故障进行分类,并为维护排期生成可操作的严重程度等级。通过不再依赖电致发光或红外传感器,我们的方法使资源有限的安装能够实现负担得起且可扩展的维护,并推动了太阳能健康监测向普遍现场可访问性的进步。 在真实世界公开数据集上的实验验证了分类和回归子模块的有效性,达到了与专用方法相媲美的准确性和解释能力。
https://arxiv.org/abs/2512.00117
Driver distraction remains a leading cause of road traffic accidents, contributing to thousands of fatalities annually across the globe. While deep learning-based driver activity recognition methods have shown promise in detecting such distractions, their effectiveness in real-world deployments is hindered by two critical challenges: variations in camera viewpoints (cross-view) and domain shifts such as change in sensor modality or environment. Existing methods typically address either cross-view generalization or unsupervised domain adaptation in isolation, leaving a gap in the robust and scalable deployment of models across diverse vehicle configurations. In this work, we propose a novel two-phase cross-view, cross-modal unsupervised domain adaptation framework that addresses these challenges jointly on real-time driver monitoring data. In the first phase, we learn view-invariant and action-discriminative features within a single modality using contrastive learning on multi-view data. In the second phase, we perform domain adaptation to a new modality using information bottleneck loss without requiring any labeled data from the new domain. We evaluate our approach using state-of-the art video transformers (Video Swin, MViT) and multi modal driver activity dataset called Drive&Act, demonstrating that our joint framework improves top-1 accuracy on RGB video data by almost 50% compared to a supervised contrastive learning-based cross-view method, and outperforms unsupervised domain adaptation-only methods by up to 5%, using the same video transformer backbone.
https://arxiv.org/abs/2511.12196
Multimodal recommender systems enhance personalized recommendations in e-commerce and online advertising by integrating visual, textual, and user-item interaction data. However, existing methods often overlook two critical biases: (i) modal confounding, where latent factors (e.g., brand style or product category) simultaneously drive multiple modalities and influence user preference, leading to spurious feature-preference associations; (ii) interaction bias, where genuine user preferences are mixed with noise from exposure effects and accidental clicks. To address these challenges, we propose a Causal-inspired multimodal Recommendation framework. Specifically, we introduce a dual-channel cross-modal diffusion module to identify hidden modal confounders, utilize back-door adjustment with hierarchical matching and vector-quantized codebooks to block confounding paths, and apply front-door adjustment combined with causal topology reconstruction to build a deconfounded causal subgraph. Extensive experiments on three real-world e-commerce datasets demonstrate that our method significantly outperforms state-of-the-art baselines while maintaining strong interpretability.
多模态推荐系统通过融合视觉、文本和用户-物品交互数据,提升了电子商务和在线广告中的个性化推荐效果。然而,现有方法往往忽略了两个关键偏差:(i)模式混淆,即潜在因素(例如品牌风格或产品类别)同时驱动多个模式,并影响用户的偏好,导致虚假的特征-偏好关联;(ii)交互偏倚,即真正的用户偏好与曝光效应和意外点击产生的噪声混合在一起。为解决这些问题,我们提出了一个基于因果关系的多模态推荐框架。具体来说,我们引入了一个双通道跨模式扩散模块来识别隐藏的模式混淆因素,利用分层匹配和向量量化代码本进行背门调整以阻断混淆路径,并采用前门调整结合因果拓扑重构建立去混杂的因果子图。在三个现实世界的电子商务数据集上进行了广泛的实验,证明了我们的方法显著优于最先进的基准模型,同时保持了强大的可解释性。
https://arxiv.org/abs/2510.12325
Profiling gamers provides critical insights for adaptive game design, behavioral understanding, and digital well-being. This study proposes an integrated, data-driven framework that combines psychological measures, behavioral analytics, and machine learning to reveal underlying gamer personas. A structured survey of 250 participants, including 113 active gamers, captured multidimensional behavioral, motivational, and social data. The analysis pipeline integrated feature engineering, association-network, knowledge-graph analysis, and unsupervised clustering to extract meaningful patterns. Correlation statistics uses Cramers V, Tschuprows T, Theils U, and Spearmans quantified feature associations, and network centrality guided feature selection. Dimensionality-reduction techniques such as PCA, SVD, t-SNE are coupled with clustering algorithms like K-Means, Agglomerative, Spectral, DBSCAN, evaluated using Silhouette, Calinski Harabasz, and Davies Bouldin indices. The PCA with K-Means with k = 4 model achieved optimal cluster quality with Silhouette = 0.4, identifying four archetypes as Immersive Social Story-Seekers, Disciplined Optimizers, Strategic Systems Navigators, and Competitive Team-Builders. This research contributes a reproducible pipeline that links correlation-driven network insights with unsupervised learning. The integration of behavioral correlation networks with clustering not only enhances classification accuracy but also offers a holistic lens to connect gameplay motivations with psychological and wellness outcomes.
对游戏玩家进行剖析,为适应性游戏设计、行为理解以及数字福祉提供了关键见解。本研究提出了一种结合心理测量、行为分析和机器学习的数据驱动综合框架,旨在揭示潜在的玩家人格特征。通过针对250名参与者(其中包括113位活跃游戏玩家)开展的一项结构化调查,收集了多维度的行为、动机及社交数据。分析流程整合了特征工程、关联网络、知识图谱分析以及无监督聚类技术来提取有意义的模式。 相关性统计使用了Cramér's V、Tschuprow's T、Theil's U和Spearman的相关度量,同时利用网络中心性指导特征选择。通过主成分分析(PCA)、奇异值分解(SVD)及t-SNE等降维技术与K-Means、凝聚层次聚类、谱聚类及DBSCAN等聚类算法相结合,并采用Silhouette、Calinski-Harabasz和Davies-Bouldin指数进行评估。使用主成分分析结合K-Means(k=4)的模型实现了最高的聚类质量,Silhouette系数为0.4,识别出了四种玩家原型:沉浸式社交故事寻找者、纪律严明的优化者、战略系统导航者和竞争团队建设者。 这项研究提供了一个可重复使用的流程,该流程将基于相关性的网络洞察与无监督学习相结合。行为相关性网络与聚类技术的结合不仅提升了分类准确性,而且还为连接游戏动机与心理及健康成果提供了整体视角。
https://arxiv.org/abs/2510.10263
In recent years, multimodal learning has become essential in robotic vision and information fusion, especially for understanding human behavior in complex environments. However, current methods struggle to fully leverage the textual modality, relying on supervised pretrained models, which limits semantic extraction in unsupervised robotic environments, particularly with significant modality loss. These methods also tend to be computationally intensive, leading to high resource consumption in real-world applications. To address these challenges, we propose the Multi Modal Mamba Enhanced Transformer (M3ET), a lightweight model designed for efficient multimodal learning, particularly on mobile platforms. By incorporating the Mamba module and a semantic-based adaptive attention mechanism, M3ET optimizes feature fusion, alignment, and modality reconstruction. Our experiments show that M3ET improves cross-task performance, with a 2.3 times increase in pretraining inference speed. In particular, the core VQA task accuracy of M3ET remains at 0.74, while the model's parameter count is reduced by 0.67. Although performance on the EQA task is limited, M3ET's lightweight design makes it well suited for deployment on resource-constrained robotic platforms.
近年来,多模态学习在机器人视觉和信息融合领域变得至关重要,特别是在复杂环境中理解人类行为方面。然而,当前的方法难以充分利用文本模式,通常依赖于监督预训练模型,这限制了无监督机器人环境中的语义提取能力,尤其是在存在显著模态损失的情况下。这些方法还往往计算资源消耗大,在实际应用中造成了高资源需求。为了应对这些挑战,我们提出了多模态曼巴增强变换器(M3ET),这是一种轻量级模型,旨在进行高效的多模态学习,特别是在移动平台上。通过集成曼巴模块和基于语义的自适应注意力机制,M3ET优化了特征融合、对齐以及模态重构。我们的实验表明,与现有方法相比,M3ET在跨任务性能方面有所提升,预训练推理速度提高了2.3倍。特别地,尽管模型参数数量减少了0.67,M3ET的核心视觉问答(VQA)任务的准确率仍然保持在0.74。虽然其在实体关系问答(EQA)任务上的表现有限,但M3ET轻量级的设计使其非常适合部署于资源受限的机器人平台上。
https://arxiv.org/abs/2509.18005
Emotion Recognition in Conversation has been seen to be widely applicable in call center analytics, opinion mining, finance, retail, healthcare, and other industries. In a call center scenario, the role of the call center agent is not just confined to receiving calls but to also provide good customer experience by pacifying the frustration or anger of the customers. This can be achieved by maintaining neutral and positive emotion from the agent. As in any conversation, the emotion of one speaker is usually dependent on the emotion of other speaker. Hence the positive emotion of an agent, accompanied with the right resolution will help in enhancing customer experience. This can change an unhappy customer to a happy one. Imparting the right resolution at right time becomes easier if the agent has the insight of the emotion of future utterances. To predict the emotions of the future utterances we propose a novel architecture, Emotion Recognition and Forecasting in Conversation. Our proposed ERFC architecture considers multi modalities, different attributes of emotion, context and the interdependencies of the utterances of the speakers in the conversation. Our intensive experiments on the IEMOCAP dataset have shown the feasibility of the proposed ERFC. This approach can provide a tremendous business value for the applications like call center, where the happiness of customer is utmost important.
情感识别在对话中的应用范围广泛,适用于呼叫中心分析、意见挖掘、金融、零售、医疗保健等行业。在一个呼叫中心的情境中,座席人员的角色不仅仅是接听电话,还需要通过缓解客户的挫败感或愤怒来提供良好的客户体验。这可以通过保持中立和积极的情感状态来实现。在任何对话中,一名说话者的情绪通常都会受到另一名说话者情绪的影响。因此,当座席执行了正确的解决方案时,伴随其的积极情感有助于提升顾客体验,从而将不满意的顾客转变为满意顾客。如果座席人员能够预见未来的发言情绪,则可以在正确的时间提供适当的解决方案。为了预测未来发言的情感,我们提出了一种新颖的架构——对话中的情感识别和预测(ERFC)。我们的提出的ERFC架构考虑了多模态、不同的情绪属性、上下文以及对话中说话者言论之间的相互依赖关系。我们在IEMOCAP数据集上进行的密集实验已经证明了所提议的ERFC方法的可行性。这种方法可以为像呼叫中心这样的应用提供巨大的商业价值,因为在这种情境下,顾客的满意度至关重要。
https://arxiv.org/abs/2509.18175
Recent developments in voice cloning and talking head generation demonstrate impressive capabilities in synthesizing natural speech and realistic lip synchronization. Current methods typically require and are trained on large scale datasets and computationally intensive processes using clean studio recorded inputs that is infeasible in noisy or low resource environments. In this paper, we introduce a new modular pipeline comprising Tortoise text to speech. It is a transformer based latent diffusion model that can perform high fidelity zero shot voice cloning given only a few training samples. We use a lightweight generative adversarial network architecture for robust real time lip synchronization. The solution will contribute to many essential tasks concerning less reliance on massive pre training generation of emotionally expressive speech and lip synchronization in noisy and unconstrained scenarios. The modular structure of the pipeline allows an easy extension for future multi modal and text guided voice modulation and it could be used in real world systems.
近期在语音克隆和生成面部动画方面的进展展示了合成自然语音和逼真口型同步的出色能力。然而,当前的方法通常需要并依赖于大规模的数据集以及使用清洁录音室录制输入的计算密集型过程,在嘈杂或资源匮乏的环境中这是不可行的。在这篇论文中,我们介绍了一种新的模块化流水线,其中包括基于Transformers的潜扩散模型Tortoise文本到语音系统。该模型能够在仅提供少量训练样本的情况下进行高质量零样本(zero-shot)语音克隆。我们还采用了一个轻量级生成对抗网络架构来实现鲁棒且实时的口型同步。 这种解决方案有助于许多关键任务,特别是在依赖大规模预训练的情感表达语音和噪声及非限制场景下的口型同步方面减少对大量数据的需求。流水线的模块化结构使得未来多模态和文本引导的声音调制易于扩展,并能够在实际系统中使用。
https://arxiv.org/abs/2509.12831
School dropout is a serious problem in distance learning, where early detection is crucial for effective intervention and student perseverance. Predicting student dropout using available educational data is a widely researched topic in learning analytics. Our partner's distance learning platform highlights the importance of integrating diverse data sources, including socio-demographic data, behavioral data, and sentiment analysis, to accurately predict dropout risks. In this paper, we introduce a novel model that combines sentiment analysis of student comments using the Bidirectional Encoder Representations from Transformers (BERT) model with socio-demographic and behavioral data analyzed through Extreme Gradient Boosting (XGBoost). We fine-tuned BERT on student comments to capture nuanced sentiments, which were then merged with key features selected using feature importance techniques in XGBoost. Our model was tested on unseen data from the next academic year, achieving an accuracy of 84\%, compared to 82\% for the baseline model. Additionally, the model demonstrated superior performance in other metrics, such as precision and F1-score. The proposed method could be a vital tool in developing personalized strategies to reduce dropout rates and encourage student perseverance
在校外学习环境中,学生辍学是一个严重的问题,早期检测对于有效的干预和鼓励学生的坚持至关重要。使用现有的教育数据预测学生辍学是学习分析领域广泛研究的主题之一。我们的合作伙伴的校外学习平台强调了整合多样化的数据源的重要性,包括社会人口统计数据、行为数据以及情感分析,以便准确地预测辍学风险。 在这篇论文中,我们介绍了一种新颖的方法,该方法结合使用基于转换器(Transformer)模型中的双向编码表示 (BERT) 对学生评论进行情感分析,并通过极端梯度提升 (XGBoost) 分析社会人口统计和行为数据。我们将 BERT 在学生评论上进行了微调以捕捉细微的情感变化,然后将这些结果与通过 XGBoost 的特征重要性技术选取的关键特征合并在一起。 我们的模型在来自下一个学年的未见过的数据集上进行了测试,并达到了84%的准确率,相比之下,基准模型仅达到82%。此外,在其他指标如精确度和F1分数方面,该模型也表现出更优越的表现。 提出的方法可以成为开发个性化策略以降低辍学率并鼓励学生坚持学习的重要工具。
https://arxiv.org/abs/2507.10421
Short-video misinformation detection has attracted wide attention in the multi-modal domain, aiming to accurately identify the misinformation in the video format accompanied by the corresponding audio. Despite significant advancements, current models in this field, trained on particular domains (source domains), often exhibit unsatisfactory performance on unseen domains (target domains) due to domain gaps. To effectively realize such domain generalization on the short-video misinformation detection task, we propose deep insights into the characteristics of different domains: (1) The detection on various domains may mainly rely on different modalities (i.e., mainly focusing on videos or audios). To enhance domain generalization, it is crucial to achieve optimal model performance on all modalities simultaneously. (2) For some domains focusing on cross-modal joint fraud, a comprehensive analysis relying on cross-modal fusion is necessary. However, domain biases located in each modality (especially in each frame of videos) will be accumulated in this fusion process, which may seriously damage the final identification of misinformation. To address these issues, we propose a new DOmain generalization model via ConsisTency and invariance learning for shORt-video misinformation detection (named DOCTOR), which contains two characteristic modules: (1) We involve the cross-modal feature interpolation to map multiple modalities into a shared space and the interpolation distillation to synchronize multi-modal learning; (2) We design the diffusion model to add noise to retain core features of multi modal and enhance domain invariant features through cross-modal guided denoising. Extensive experiments demonstrate the effectiveness of our proposed DOCTOR model. Our code is public available at this https URL.
短视频误导信息检测在多模态领域引起了广泛关注,旨在准确识别伴随相应音频的视频格式中的虚假信息。尽管目前该领域的模型取得了显著进展,但这些模型通常仅针对特定领域(源域)进行训练,在未见过的领域(目标域)上表现不佳,原因是存在领域差距。为了在这种短视频误导信息检测任务中有效实现跨领域泛化,我们深入分析了不同领域的特点:(1) 不同领域的检测可能主要依赖于不同的模态(即主要关注视频或音频)。要增强跨领域泛化能力,关键在于同时在所有模态上达到最佳的模型性能。(2) 对于一些专注于跨模态联合欺诈的领域,需要进行基于跨模态融合的全面分析。然而,在这种融合过程中,各模态(特别是每个视频帧)中的领域偏差会被累积起来,这可能会严重损害最终的误导信息识别。 为解决这些问题,我们提出了一种新的跨域泛化模型——通过一致性和不变性学习实现短视频误导信息检测(名为DOCTOR),该模型包含两个特色模块:(1) 我们引入了跨模态特征插值来将多个模态映射到共享空间,并利用插值蒸馏同步多模态学习;(2) 我们设计了一种扩散模型,通过向多模态添加噪声保留核心特征并通过跨模态引导去噪增强领域不变性特征。广泛的实验展示了我们提出的DOCTOR模型的有效性。我们的代码可在以下链接获取:[此处插入实际的URL]。
https://arxiv.org/abs/2507.04061
Recent advances in deep generative models have led to significant progress in video generation, yet the fidelity of AI-generated videos remains limited. Synthesized content often exhibits visual artifacts such as temporally inconsistent motion, physically implausible trajectories, unnatural object deformations, and local blurring that undermine realism and user trust. Accurate detection and spatial localization of these artifacts are crucial for both automated quality control and for guiding the development of improved generative models. However, the research community currently lacks a comprehensive benchmark specifically designed for artifact localization in AI generated videos. Existing datasets either restrict themselves to video or frame level detection or lack the fine-grained spatial annotations necessary for evaluating localization methods. To address this gap, we introduce BrokenVideos, a benchmark dataset of 3,254 AI-generated videos with meticulously annotated, pixel-level masks highlighting regions of visual corruption. Each annotation is validated through detailed human inspection to ensure high quality ground truth. Our experiments show that training state of the art artifact detection models and multi modal large language models (MLLMs) on BrokenVideos significantly improves their ability to localize corrupted regions. Through extensive evaluation, we demonstrate that BrokenVideos establishes a critical foundation for benchmarking and advancing research on artifact localization in generative video models. The dataset is available at: this https URL.
最近在深度生成模型领域的进展显著推动了视频生成技术的发展,然而人工智能生成的视频的真实性依然有限。合成内容经常出现诸如时间上不一致的动作、物理上不可能的轨迹、不自然的对象变形以及局部模糊等视觉缺陷,这些都削弱了其真实性和用户信任度。准确检测并定位这些缺陷对于自动质量控制和指导改进生成模型的发展至关重要。然而,目前研究社区缺乏一个专门用于人工智能生成视频中缺陷定位的全面基准。 现有的数据集要么仅限于视频或帧级别的检测,要么缺少评估定位方法所需的精细空间标注信息。为了填补这一空白,我们引入了BrokenVideos,这是一个包含3,254个由人工智能生成、并带有详细像素级掩码注释以突出显示视觉缺陷区域的基准数据集。每一份注释都通过详细的人员审查来确保高质量的真实度。 我们的实验表明,在BrokenVideos上训练最先进的缺陷检测模型和多模态大型语言模型(MLLMs)能够显著提升它们定位受损区域的能力。通过广泛评估,我们展示了BrokenVideos为在生成视频模型中进行缺陷定位的研究奠定了关键基础,并推动了该领域的发展。数据集可在以下网址获取:[this https URL]。
https://arxiv.org/abs/2506.20103
Early and accurate detection of brain abnormalities, such as tumors and strokes, is essential for timely intervention and improved patient outcomes. In this study, we present a deep learning-based system capable of identifying both brain tumors and strokes from MRI images, along with their respective stages. We have executed two groundbreaking strategies involving convolutional neural networks, MobileNet V2 and ResNet-50-optimized through transfer learning to classify MRI scans into five diagnostic categories. Our dataset, aggregated and augmented from various publicly available MRI sources, was carefully curated to ensure class balance and image diversity. To enhance model generalization and prevent overfitting, we applied dropout layers and extensive data augmentation. The models achieved strong performance, with training accuracy reaching 93\% and validation accuracy up to 88\%. While ResNet-50 demonstrated slightly better results, Mobile Net V2 remains a promising option for real-time diagnosis in low resource settings due to its lightweight architecture. This research offers a practical AI-driven solution for early brain abnormality detection, with potential for clinical deployment and future enhancement through larger datasets and multi modal inputs.
早期且准确地检测大脑异常,如肿瘤和中风,对于及时干预及改善患者预后至关重要。在本研究中,我们介绍了一种基于深度学习的系统,能够从MRI图像中识别脑瘤和中风,并确定它们各自的阶段。我们执行了两项突破性策略:采用经过迁移学习优化的卷积神经网络(CNN)MobileNet V2 和 ResNet-50 来将 MRI 扫描分类为五个诊断类别。我们的数据集是从各种公开来源汇集并扩增而来的MRI图像,为了确保类平衡和图像多样性,我们精心策划了该数据集。 为了增强模型的泛化能力和防止过拟合,我们在训练过程中应用了dropout层和广泛的图像增强技术。模型在性能上表现出色,训练准确率达到了93%,验证准确率达到88%。虽然ResNet-50展示了稍好一些的结果,但轻量级架构的Mobile Net V2 在低资源环境下的实时诊断中仍然具有很大的潜力。 这项研究为早期大脑异常检测提供了实用的人工智能解决方案,并且通过更大规模的数据集和多模态输入进一步优化后,在临床部署方面具有巨大的应用前景。
https://arxiv.org/abs/2506.09161
We introduce DEEVISum (Distilled Early Exit Vision language model for Summarization), a lightweight, efficient, and scalable vision language model designed for segment wise video summarization. Leveraging multi modal prompts that combine textual and audio derived signals, DEEVISum incorporates Multi Stage Knowledge Distillation (MSKD) and Early Exit (EE) to strike a balance between performance and efficiency. MSKD offers a 1.33% absolute F1 improvement over baseline distillation (0.5%), while EE reduces inference time by approximately 21% with a 1.3 point drop in F1. Evaluated on the TVSum dataset, our best model PaLI Gemma2 3B + MSKD achieves an F1 score of 61.1, competing the performance of significantly larger models, all while maintaining a lower computational footprint. We publicly release our code and processed dataset to support further research.
我们介绍了DEEVISum(用于摘要的早期退出视觉语言模型),这是一种轻量级、高效且可扩展的视觉语言模型,专门设计用于分段视频摘要。通过结合文本和音频信号的多模态提示,DEEVISum 集成了多阶段知识蒸馏(MSKD)和早期退出(EE),以在性能与效率之间取得平衡。MSKD 相比基线蒸馏方法,在 F1 分数上提高了 1.33% 的绝对值(0.5%),而 EE 将推理时间减少了大约 21%,F1 分数下降了 1.3 点。在 TVSum 数据集上的评估中,我们的最佳模型 PaLI Gemma2 3B + MSKD 达到了 F1 得分 61.1,在保持较低计算负担的同时,与明显更大的模型性能相当。我们公开发布了代码和处理后的数据集以支持进一步的研究。
https://arxiv.org/abs/2504.21831
Human pose estimation and action recognition have received attention due to their critical roles in healthcare monitoring, rehabilitation, and assistive technologies. In this study, we proposed a novel architecture named Transformer based Encoder Decoder Network (TED Net) designed for estimating human skeleton poses from WiFi Channel State Information (CSI). TED Net integrates convolutional encoders with transformer based attention mechanisms to capture spatiotemporal features from CSI signals. The estimated skeleton poses were used as input to a customized Directed Graph Neural Network (DGNN) for action recognition. We validated our model on two datasets: a publicly available multi modal dataset for assessing general pose estimation, and a newly collected dataset focused on fall related scenarios involving 20 participants. Experimental results demonstrated that TED Net outperformed existing approaches in pose estimation, and that the DGNN achieves reliable action classification using CSI based skeletons, with performance comparable to RGB based systems. Notably, TED Net maintains robust performance across both fall and non fall cases. These findings highlight the potential of CSI driven human skeleton estimation for effective action recognition, particularly in home environments such as elderly fall detection. In such settings, WiFi signals are often readily available, offering a privacy preserving alternative to vision based methods, which may raise concerns about continuous camera monitoring.
人体姿态估计和动作识别因其在医疗监护、康复及辅助技术中的关键作用而受到了广泛关注。本研究提出了一种名为基于变换器的编码解码网络(TED Net)的新架构,用于从WiFi信道状态信息(CSI)中估算人体骨骼姿势。TED Net结合了卷积编码器与基于变换器的注意力机制,以捕捉来自CSI信号的时空特征。所估计的人体姿态作为输入被送入一种定制化的有向图神经网络(DGNN),用于动作识别。我们在两个数据集上验证了我们的模型:一个是公开可用的多模态数据集,用于评估一般姿势估计;另一个是新收集的数据集,重点关注涉及20名参与者的与跌倒相关的场景。实验结果显示,TED Net在姿态估计方面超越了现有方法,并且DGNN能够使用基于CSI的人体骨骼进行可靠的动作分类,性能与基于RGB的方法相当。值得注意的是,TED Net在跌倒和非跌倒情况下都保持了稳健的性能。这些发现突显了CSI驱动的人体骨架估算在有效动作识别中的潜力,尤其是在如老人跌倒检测这样的家庭环境中。在这种环境下,WiFi信号往往易于获取,并提供了一种隐私保护的方法来替代基于视觉的方法,后者可能因持续摄像监控而引起担忧。
https://arxiv.org/abs/2504.16655
Background This study proposes a Vision-Language Model (VLM) leveraging the SIGLIP encoder and Gemma-3b transformer decoder to enhance automated chronic tuberculosis (TB) screening. By integrating chest X-ray images with clinical data, the model addresses the challenges of manual interpretation, improving diagnostic consistency and accessibility, particularly in resource-constrained settings. Methods The VLM architecture combines a Vision Transformer (ViT) for visual encoding and a transformer-based text encoder to process clinical context, such as patient histories and treatment records. Cross-modal attention mechanisms align radiographic features with textual information, while the Gemma-3b decoder generates comprehensive diagnostic reports. The model was pre-trained on 5 million paired medical images and texts and fine-tuned using 100,000 chronic TB-specific chest X-rays. Results The model demonstrated high precision (94 percent) and recall (94 percent) for detecting key chronic TB pathologies, including fibrosis, calcified granulomas, and bronchiectasis. Area Under the Curve (AUC) scores exceeded 0.93, and Intersection over Union (IoU) values were above 0.91, validating its effectiveness in detecting and localizing TB-related abnormalities. Conclusion The VLM offers a robust and scalable solution for automated chronic TB diagnosis, integrating radiographic and clinical data to deliver actionable and context-aware insights. Future work will address subtle pathologies and dataset biases to enhance the model's generalizability, ensuring equitable performance across diverse populations and healthcare settings.
背景:这项研究提出了一种利用SIGLIP编码器和Gemma-3b解码器的视觉语言模型(VLM),以增强自动慢性结核病(TB)筛查。通过整合胸部X光影像与临床数据,该模型旨在解决手动解读的挑战,提高诊断的一致性和可访问性,特别是在资源有限的情况下。 方法:VLM架构结合了用于视觉编码的Vision Transformer(ViT)和基于变压器的文本编码器以处理如患者病史和治疗记录等临床背景信息。跨模态注意机制将放射学特征与文本信息对齐,而Gemma-3b解码器则生成全面的诊断报告。该模型在500万张配对的医学影像和文本数据上进行了预训练,并通过10万张特定于慢性TB的胸部X光片进行了微调。 结果:该模型对于检测慢性结核病的关键病理特征,如纤维化、钙化的肉芽肿和支气管扩张等表现出了高精度(94%)和召回率(94%)。在曲线下面积(AUC)评分超过0.93,交并比(IoU)值高于0.91,验证了其在检测和定位与结核病相关的异常方面具有有效性。 结论:VLM为自动慢性结核诊断提供了一种稳健且可扩展的解决方案,通过整合放射学和临床数据来交付实际可行且情境感知的洞察。未来的工作将解决细微病理特征和数据集偏差的问题,以增强模型在多样人口和医疗环境中的泛化能力,确保其性能的公平性。
https://arxiv.org/abs/2503.14536
3D Gaussian Splatting (3DGS) has been widely used in 3D reconstruction and 3D generation. Training to get a 3DGS scene often takes a lot of time and resources and even valuable inspiration. The increasing amount of 3DGS digital asset have brought great challenges to the copyright protection. However, it still lacks profound exploration targeted at 3DGS. In this paper, we propose a new framework X-SG$^2$S which can simultaneously watermark 1 to 3D messages while keeping the original 3DGS scene almost unchanged. Generally, we have a X-SG$^2$S injector for adding multi-modal messages simultaneously and an extractor for extract them. Specifically, we first split the watermarks into message patches in a fixed manner and sort the 3DGS points. A self-adaption gate is used to pick out suitable location for watermarking. Then use a XD(multi-dimension)-injection heads to add multi-modal messages into sorted 3DGS points. A learnable gate can recognize the location with extra messages and XD-extraction heads can restore hidden messages from the location recommended by the learnable gate. Extensive experiments demonstrated that the proposed X-SG$^2$S can effectively conceal multi modal messages without changing pretrained 3DGS pipeline or the original form of 3DGS parameters. Meanwhile, with simple and efficient model structure and high practicality, X-SG$^2$S still shows good performance in hiding and extracting multi-modal inner structured or unstructured messages. X-SG$^2$S is the first to unify 1 to 3D watermarking model for 3DGS and the first framework to add multi-modal watermarks simultaneous in one 3DGS which pave the wave for later researches.
3D高斯点阵(3D Gaussian Splatting,简称3DGS)在三维重建和生成领域得到了广泛应用。为了获得一个3DGS场景,通常需要消耗大量时间和资源,并且有时还需要有价值的创意灵感。随着3DGS数字资产数量的不断增加,版权保护也面临着巨大的挑战。然而,针对3DGS的研究仍处于初步阶段。在此论文中,我们提出了一种新的框架X-SG$^2$S,它可以同时嵌入1到3维的信息,而几乎不改变原始的3DGS场景。总的来说,我们的框架包含一个X-SG$^2$S注入器来添加多模态信息,并且有一个提取器用于从中提取这些信息。 具体来说,我们首先以固定的方式将水印分割成消息补丁,并对3DGS点进行排序。使用自适应门选择适合嵌入水印的位置。然后利用XD(多维)-注射头向已排序的3DGS点中添加多模态信息。一个可学习的门能够识别带有额外信息的位置,而XD提取头可以从由该可学习门推荐的位置中恢复隐藏的信息。 广泛的实验表明,所提出的X-SG$^2$S能够在不改变预训练的3DGS流程或原始参数形式的情况下有效隐藏多模态消息。同时,由于模型结构简单且高效,并具有高度实用性,X-SG$^2$S在隐藏和提取多模式内部结构化或非结构化信息方面表现良好。 作为首个统一1到3维水印嵌入模型的框架以及首个在同一3DGS中同时添加多模态水印的框架,X-SG$^2$S为后续研究铺平了道路。
https://arxiv.org/abs/2502.10475