Sustained operation of solar photovoltaic assets hinges on accurate detection and prioritization of surface faults across vast, geographically distributed modules. While multi modal imaging strategies are popular, they introduce logistical and economic barriers for routine farm level deployment. This work demonstrates that deep learning and classical machine learning may be judiciously combined to achieve robust surface anomaly categorization and severity estimation from planar visible band imagery alone. We introduce TinyViT which is a compact pipeline integrating Transformer based segmentation, spectral-spatial feature engineering, and ensemble regression. The system ingests consumer grade color camera mosaics of PV panels, classifies seven nuanced surface faults, and generates actionable severity grades for maintenance triage. By eliminating reliance on electroluminescence or IR sensors, our method enables affordable, scalable upkeep for resource limited installations, and advances the state of solar health monitoring toward universal field accessibility. Experiments on real public world datasets validate both classification and regression sub modules, achieving accuracy and interpretability competitive with specialized approaches.
太阳能光伏资产的持续运行依赖于对大面积、地理分布广泛的模块表面故障进行准确检测和优先级排序。虽然多模态成像策略很受欢迎,但它们在日常农场级别的部署中引入了物流和经济障碍。这项工作展示了深度学习与经典机器学习可以巧妙结合,仅通过平面可见光带图像就能实现稳健的表面异常分类和严重程度估计。 我们介绍了TinyViT,这是一个紧凑的工作流程,集成了基于Transformer的分割、光谱-空间特征工程以及集成回归。该系统接收消费者级彩色相机拍摄的光伏板马赛克图,对七种细微的表面故障进行分类,并为维护排期生成可操作的严重程度等级。通过不再依赖电致发光或红外传感器,我们的方法使资源有限的安装能够实现负担得起且可扩展的维护,并推动了太阳能健康监测向普遍现场可访问性的进步。 在真实世界公开数据集上的实验验证了分类和回归子模块的有效性,达到了与专用方法相媲美的准确性和解释能力。
https://arxiv.org/abs/2512.00117
Driver distraction remains a leading cause of road traffic accidents, contributing to thousands of fatalities annually across the globe. While deep learning-based driver activity recognition methods have shown promise in detecting such distractions, their effectiveness in real-world deployments is hindered by two critical challenges: variations in camera viewpoints (cross-view) and domain shifts such as change in sensor modality or environment. Existing methods typically address either cross-view generalization or unsupervised domain adaptation in isolation, leaving a gap in the robust and scalable deployment of models across diverse vehicle configurations. In this work, we propose a novel two-phase cross-view, cross-modal unsupervised domain adaptation framework that addresses these challenges jointly on real-time driver monitoring data. In the first phase, we learn view-invariant and action-discriminative features within a single modality using contrastive learning on multi-view data. In the second phase, we perform domain adaptation to a new modality using information bottleneck loss without requiring any labeled data from the new domain. We evaluate our approach using state-of-the art video transformers (Video Swin, MViT) and multi modal driver activity dataset called Drive&Act, demonstrating that our joint framework improves top-1 accuracy on RGB video data by almost 50% compared to a supervised contrastive learning-based cross-view method, and outperforms unsupervised domain adaptation-only methods by up to 5%, using the same video transformer backbone.
https://arxiv.org/abs/2511.12196
Multimodal recommender systems enhance personalized recommendations in e-commerce and online advertising by integrating visual, textual, and user-item interaction data. However, existing methods often overlook two critical biases: (i) modal confounding, where latent factors (e.g., brand style or product category) simultaneously drive multiple modalities and influence user preference, leading to spurious feature-preference associations; (ii) interaction bias, where genuine user preferences are mixed with noise from exposure effects and accidental clicks. To address these challenges, we propose a Causal-inspired multimodal Recommendation framework. Specifically, we introduce a dual-channel cross-modal diffusion module to identify hidden modal confounders, utilize back-door adjustment with hierarchical matching and vector-quantized codebooks to block confounding paths, and apply front-door adjustment combined with causal topology reconstruction to build a deconfounded causal subgraph. Extensive experiments on three real-world e-commerce datasets demonstrate that our method significantly outperforms state-of-the-art baselines while maintaining strong interpretability.
多模态推荐系统通过融合视觉、文本和用户-物品交互数据,提升了电子商务和在线广告中的个性化推荐效果。然而,现有方法往往忽略了两个关键偏差:(i)模式混淆,即潜在因素(例如品牌风格或产品类别)同时驱动多个模式,并影响用户的偏好,导致虚假的特征-偏好关联;(ii)交互偏倚,即真正的用户偏好与曝光效应和意外点击产生的噪声混合在一起。为解决这些问题,我们提出了一个基于因果关系的多模态推荐框架。具体来说,我们引入了一个双通道跨模式扩散模块来识别隐藏的模式混淆因素,利用分层匹配和向量量化代码本进行背门调整以阻断混淆路径,并采用前门调整结合因果拓扑重构建立去混杂的因果子图。在三个现实世界的电子商务数据集上进行了广泛的实验,证明了我们的方法显著优于最先进的基准模型,同时保持了强大的可解释性。
https://arxiv.org/abs/2510.12325
Profiling gamers provides critical insights for adaptive game design, behavioral understanding, and digital well-being. This study proposes an integrated, data-driven framework that combines psychological measures, behavioral analytics, and machine learning to reveal underlying gamer personas. A structured survey of 250 participants, including 113 active gamers, captured multidimensional behavioral, motivational, and social data. The analysis pipeline integrated feature engineering, association-network, knowledge-graph analysis, and unsupervised clustering to extract meaningful patterns. Correlation statistics uses Cramers V, Tschuprows T, Theils U, and Spearmans quantified feature associations, and network centrality guided feature selection. Dimensionality-reduction techniques such as PCA, SVD, t-SNE are coupled with clustering algorithms like K-Means, Agglomerative, Spectral, DBSCAN, evaluated using Silhouette, Calinski Harabasz, and Davies Bouldin indices. The PCA with K-Means with k = 4 model achieved optimal cluster quality with Silhouette = 0.4, identifying four archetypes as Immersive Social Story-Seekers, Disciplined Optimizers, Strategic Systems Navigators, and Competitive Team-Builders. This research contributes a reproducible pipeline that links correlation-driven network insights with unsupervised learning. The integration of behavioral correlation networks with clustering not only enhances classification accuracy but also offers a holistic lens to connect gameplay motivations with psychological and wellness outcomes.
对游戏玩家进行剖析,为适应性游戏设计、行为理解以及数字福祉提供了关键见解。本研究提出了一种结合心理测量、行为分析和机器学习的数据驱动综合框架,旨在揭示潜在的玩家人格特征。通过针对250名参与者(其中包括113位活跃游戏玩家)开展的一项结构化调查,收集了多维度的行为、动机及社交数据。分析流程整合了特征工程、关联网络、知识图谱分析以及无监督聚类技术来提取有意义的模式。 相关性统计使用了Cramér's V、Tschuprow's T、Theil's U和Spearman的相关度量,同时利用网络中心性指导特征选择。通过主成分分析(PCA)、奇异值分解(SVD)及t-SNE等降维技术与K-Means、凝聚层次聚类、谱聚类及DBSCAN等聚类算法相结合,并采用Silhouette、Calinski-Harabasz和Davies-Bouldin指数进行评估。使用主成分分析结合K-Means(k=4)的模型实现了最高的聚类质量,Silhouette系数为0.4,识别出了四种玩家原型:沉浸式社交故事寻找者、纪律严明的优化者、战略系统导航者和竞争团队建设者。 这项研究提供了一个可重复使用的流程,该流程将基于相关性的网络洞察与无监督学习相结合。行为相关性网络与聚类技术的结合不仅提升了分类准确性,而且还为连接游戏动机与心理及健康成果提供了整体视角。
https://arxiv.org/abs/2510.10263
In recent years, multimodal learning has become essential in robotic vision and information fusion, especially for understanding human behavior in complex environments. However, current methods struggle to fully leverage the textual modality, relying on supervised pretrained models, which limits semantic extraction in unsupervised robotic environments, particularly with significant modality loss. These methods also tend to be computationally intensive, leading to high resource consumption in real-world applications. To address these challenges, we propose the Multi Modal Mamba Enhanced Transformer (M3ET), a lightweight model designed for efficient multimodal learning, particularly on mobile platforms. By incorporating the Mamba module and a semantic-based adaptive attention mechanism, M3ET optimizes feature fusion, alignment, and modality reconstruction. Our experiments show that M3ET improves cross-task performance, with a 2.3 times increase in pretraining inference speed. In particular, the core VQA task accuracy of M3ET remains at 0.74, while the model's parameter count is reduced by 0.67. Although performance on the EQA task is limited, M3ET's lightweight design makes it well suited for deployment on resource-constrained robotic platforms.
近年来,多模态学习在机器人视觉和信息融合领域变得至关重要,特别是在复杂环境中理解人类行为方面。然而,当前的方法难以充分利用文本模式,通常依赖于监督预训练模型,这限制了无监督机器人环境中的语义提取能力,尤其是在存在显著模态损失的情况下。这些方法还往往计算资源消耗大,在实际应用中造成了高资源需求。为了应对这些挑战,我们提出了多模态曼巴增强变换器(M3ET),这是一种轻量级模型,旨在进行高效的多模态学习,特别是在移动平台上。通过集成曼巴模块和基于语义的自适应注意力机制,M3ET优化了特征融合、对齐以及模态重构。我们的实验表明,与现有方法相比,M3ET在跨任务性能方面有所提升,预训练推理速度提高了2.3倍。特别地,尽管模型参数数量减少了0.67,M3ET的核心视觉问答(VQA)任务的准确率仍然保持在0.74。虽然其在实体关系问答(EQA)任务上的表现有限,但M3ET轻量级的设计使其非常适合部署于资源受限的机器人平台上。
https://arxiv.org/abs/2509.18005
Emotion Recognition in Conversation has been seen to be widely applicable in call center analytics, opinion mining, finance, retail, healthcare, and other industries. In a call center scenario, the role of the call center agent is not just confined to receiving calls but to also provide good customer experience by pacifying the frustration or anger of the customers. This can be achieved by maintaining neutral and positive emotion from the agent. As in any conversation, the emotion of one speaker is usually dependent on the emotion of other speaker. Hence the positive emotion of an agent, accompanied with the right resolution will help in enhancing customer experience. This can change an unhappy customer to a happy one. Imparting the right resolution at right time becomes easier if the agent has the insight of the emotion of future utterances. To predict the emotions of the future utterances we propose a novel architecture, Emotion Recognition and Forecasting in Conversation. Our proposed ERFC architecture considers multi modalities, different attributes of emotion, context and the interdependencies of the utterances of the speakers in the conversation. Our intensive experiments on the IEMOCAP dataset have shown the feasibility of the proposed ERFC. This approach can provide a tremendous business value for the applications like call center, where the happiness of customer is utmost important.
情感识别在对话中的应用范围广泛,适用于呼叫中心分析、意见挖掘、金融、零售、医疗保健等行业。在一个呼叫中心的情境中,座席人员的角色不仅仅是接听电话,还需要通过缓解客户的挫败感或愤怒来提供良好的客户体验。这可以通过保持中立和积极的情感状态来实现。在任何对话中,一名说话者的情绪通常都会受到另一名说话者情绪的影响。因此,当座席执行了正确的解决方案时,伴随其的积极情感有助于提升顾客体验,从而将不满意的顾客转变为满意顾客。如果座席人员能够预见未来的发言情绪,则可以在正确的时间提供适当的解决方案。为了预测未来发言的情感,我们提出了一种新颖的架构——对话中的情感识别和预测(ERFC)。我们的提出的ERFC架构考虑了多模态、不同的情绪属性、上下文以及对话中说话者言论之间的相互依赖关系。我们在IEMOCAP数据集上进行的密集实验已经证明了所提议的ERFC方法的可行性。这种方法可以为像呼叫中心这样的应用提供巨大的商业价值,因为在这种情境下,顾客的满意度至关重要。
https://arxiv.org/abs/2509.18175
Recent developments in voice cloning and talking head generation demonstrate impressive capabilities in synthesizing natural speech and realistic lip synchronization. Current methods typically require and are trained on large scale datasets and computationally intensive processes using clean studio recorded inputs that is infeasible in noisy or low resource environments. In this paper, we introduce a new modular pipeline comprising Tortoise text to speech. It is a transformer based latent diffusion model that can perform high fidelity zero shot voice cloning given only a few training samples. We use a lightweight generative adversarial network architecture for robust real time lip synchronization. The solution will contribute to many essential tasks concerning less reliance on massive pre training generation of emotionally expressive speech and lip synchronization in noisy and unconstrained scenarios. The modular structure of the pipeline allows an easy extension for future multi modal and text guided voice modulation and it could be used in real world systems.
近期在语音克隆和生成面部动画方面的进展展示了合成自然语音和逼真口型同步的出色能力。然而,当前的方法通常需要并依赖于大规模的数据集以及使用清洁录音室录制输入的计算密集型过程,在嘈杂或资源匮乏的环境中这是不可行的。在这篇论文中,我们介绍了一种新的模块化流水线,其中包括基于Transformers的潜扩散模型Tortoise文本到语音系统。该模型能够在仅提供少量训练样本的情况下进行高质量零样本(zero-shot)语音克隆。我们还采用了一个轻量级生成对抗网络架构来实现鲁棒且实时的口型同步。 这种解决方案有助于许多关键任务,特别是在依赖大规模预训练的情感表达语音和噪声及非限制场景下的口型同步方面减少对大量数据的需求。流水线的模块化结构使得未来多模态和文本引导的声音调制易于扩展,并能够在实际系统中使用。
https://arxiv.org/abs/2509.12831
School dropout is a serious problem in distance learning, where early detection is crucial for effective intervention and student perseverance. Predicting student dropout using available educational data is a widely researched topic in learning analytics. Our partner's distance learning platform highlights the importance of integrating diverse data sources, including socio-demographic data, behavioral data, and sentiment analysis, to accurately predict dropout risks. In this paper, we introduce a novel model that combines sentiment analysis of student comments using the Bidirectional Encoder Representations from Transformers (BERT) model with socio-demographic and behavioral data analyzed through Extreme Gradient Boosting (XGBoost). We fine-tuned BERT on student comments to capture nuanced sentiments, which were then merged with key features selected using feature importance techniques in XGBoost. Our model was tested on unseen data from the next academic year, achieving an accuracy of 84\%, compared to 82\% for the baseline model. Additionally, the model demonstrated superior performance in other metrics, such as precision and F1-score. The proposed method could be a vital tool in developing personalized strategies to reduce dropout rates and encourage student perseverance
在校外学习环境中,学生辍学是一个严重的问题,早期检测对于有效的干预和鼓励学生的坚持至关重要。使用现有的教育数据预测学生辍学是学习分析领域广泛研究的主题之一。我们的合作伙伴的校外学习平台强调了整合多样化的数据源的重要性,包括社会人口统计数据、行为数据以及情感分析,以便准确地预测辍学风险。 在这篇论文中,我们介绍了一种新颖的方法,该方法结合使用基于转换器(Transformer)模型中的双向编码表示 (BERT) 对学生评论进行情感分析,并通过极端梯度提升 (XGBoost) 分析社会人口统计和行为数据。我们将 BERT 在学生评论上进行了微调以捕捉细微的情感变化,然后将这些结果与通过 XGBoost 的特征重要性技术选取的关键特征合并在一起。 我们的模型在来自下一个学年的未见过的数据集上进行了测试,并达到了84%的准确率,相比之下,基准模型仅达到82%。此外,在其他指标如精确度和F1分数方面,该模型也表现出更优越的表现。 提出的方法可以成为开发个性化策略以降低辍学率并鼓励学生坚持学习的重要工具。
https://arxiv.org/abs/2507.10421
Short-video misinformation detection has attracted wide attention in the multi-modal domain, aiming to accurately identify the misinformation in the video format accompanied by the corresponding audio. Despite significant advancements, current models in this field, trained on particular domains (source domains), often exhibit unsatisfactory performance on unseen domains (target domains) due to domain gaps. To effectively realize such domain generalization on the short-video misinformation detection task, we propose deep insights into the characteristics of different domains: (1) The detection on various domains may mainly rely on different modalities (i.e., mainly focusing on videos or audios). To enhance domain generalization, it is crucial to achieve optimal model performance on all modalities simultaneously. (2) For some domains focusing on cross-modal joint fraud, a comprehensive analysis relying on cross-modal fusion is necessary. However, domain biases located in each modality (especially in each frame of videos) will be accumulated in this fusion process, which may seriously damage the final identification of misinformation. To address these issues, we propose a new DOmain generalization model via ConsisTency and invariance learning for shORt-video misinformation detection (named DOCTOR), which contains two characteristic modules: (1) We involve the cross-modal feature interpolation to map multiple modalities into a shared space and the interpolation distillation to synchronize multi-modal learning; (2) We design the diffusion model to add noise to retain core features of multi modal and enhance domain invariant features through cross-modal guided denoising. Extensive experiments demonstrate the effectiveness of our proposed DOCTOR model. Our code is public available at this https URL.
短视频误导信息检测在多模态领域引起了广泛关注,旨在准确识别伴随相应音频的视频格式中的虚假信息。尽管目前该领域的模型取得了显著进展,但这些模型通常仅针对特定领域(源域)进行训练,在未见过的领域(目标域)上表现不佳,原因是存在领域差距。为了在这种短视频误导信息检测任务中有效实现跨领域泛化,我们深入分析了不同领域的特点:(1) 不同领域的检测可能主要依赖于不同的模态(即主要关注视频或音频)。要增强跨领域泛化能力,关键在于同时在所有模态上达到最佳的模型性能。(2) 对于一些专注于跨模态联合欺诈的领域,需要进行基于跨模态融合的全面分析。然而,在这种融合过程中,各模态(特别是每个视频帧)中的领域偏差会被累积起来,这可能会严重损害最终的误导信息识别。 为解决这些问题,我们提出了一种新的跨域泛化模型——通过一致性和不变性学习实现短视频误导信息检测(名为DOCTOR),该模型包含两个特色模块:(1) 我们引入了跨模态特征插值来将多个模态映射到共享空间,并利用插值蒸馏同步多模态学习;(2) 我们设计了一种扩散模型,通过向多模态添加噪声保留核心特征并通过跨模态引导去噪增强领域不变性特征。广泛的实验展示了我们提出的DOCTOR模型的有效性。我们的代码可在以下链接获取:[此处插入实际的URL]。
https://arxiv.org/abs/2507.04061
Recent advances in deep generative models have led to significant progress in video generation, yet the fidelity of AI-generated videos remains limited. Synthesized content often exhibits visual artifacts such as temporally inconsistent motion, physically implausible trajectories, unnatural object deformations, and local blurring that undermine realism and user trust. Accurate detection and spatial localization of these artifacts are crucial for both automated quality control and for guiding the development of improved generative models. However, the research community currently lacks a comprehensive benchmark specifically designed for artifact localization in AI generated videos. Existing datasets either restrict themselves to video or frame level detection or lack the fine-grained spatial annotations necessary for evaluating localization methods. To address this gap, we introduce BrokenVideos, a benchmark dataset of 3,254 AI-generated videos with meticulously annotated, pixel-level masks highlighting regions of visual corruption. Each annotation is validated through detailed human inspection to ensure high quality ground truth. Our experiments show that training state of the art artifact detection models and multi modal large language models (MLLMs) on BrokenVideos significantly improves their ability to localize corrupted regions. Through extensive evaluation, we demonstrate that BrokenVideos establishes a critical foundation for benchmarking and advancing research on artifact localization in generative video models. The dataset is available at: this https URL.
最近在深度生成模型领域的进展显著推动了视频生成技术的发展,然而人工智能生成的视频的真实性依然有限。合成内容经常出现诸如时间上不一致的动作、物理上不可能的轨迹、不自然的对象变形以及局部模糊等视觉缺陷,这些都削弱了其真实性和用户信任度。准确检测并定位这些缺陷对于自动质量控制和指导改进生成模型的发展至关重要。然而,目前研究社区缺乏一个专门用于人工智能生成视频中缺陷定位的全面基准。 现有的数据集要么仅限于视频或帧级别的检测,要么缺少评估定位方法所需的精细空间标注信息。为了填补这一空白,我们引入了BrokenVideos,这是一个包含3,254个由人工智能生成、并带有详细像素级掩码注释以突出显示视觉缺陷区域的基准数据集。每一份注释都通过详细的人员审查来确保高质量的真实度。 我们的实验表明,在BrokenVideos上训练最先进的缺陷检测模型和多模态大型语言模型(MLLMs)能够显著提升它们定位受损区域的能力。通过广泛评估,我们展示了BrokenVideos为在生成视频模型中进行缺陷定位的研究奠定了关键基础,并推动了该领域的发展。数据集可在以下网址获取:[this https URL]。
https://arxiv.org/abs/2506.20103
Early and accurate detection of brain abnormalities, such as tumors and strokes, is essential for timely intervention and improved patient outcomes. In this study, we present a deep learning-based system capable of identifying both brain tumors and strokes from MRI images, along with their respective stages. We have executed two groundbreaking strategies involving convolutional neural networks, MobileNet V2 and ResNet-50-optimized through transfer learning to classify MRI scans into five diagnostic categories. Our dataset, aggregated and augmented from various publicly available MRI sources, was carefully curated to ensure class balance and image diversity. To enhance model generalization and prevent overfitting, we applied dropout layers and extensive data augmentation. The models achieved strong performance, with training accuracy reaching 93\% and validation accuracy up to 88\%. While ResNet-50 demonstrated slightly better results, Mobile Net V2 remains a promising option for real-time diagnosis in low resource settings due to its lightweight architecture. This research offers a practical AI-driven solution for early brain abnormality detection, with potential for clinical deployment and future enhancement through larger datasets and multi modal inputs.
早期且准确地检测大脑异常,如肿瘤和中风,对于及时干预及改善患者预后至关重要。在本研究中,我们介绍了一种基于深度学习的系统,能够从MRI图像中识别脑瘤和中风,并确定它们各自的阶段。我们执行了两项突破性策略:采用经过迁移学习优化的卷积神经网络(CNN)MobileNet V2 和 ResNet-50 来将 MRI 扫描分类为五个诊断类别。我们的数据集是从各种公开来源汇集并扩增而来的MRI图像,为了确保类平衡和图像多样性,我们精心策划了该数据集。 为了增强模型的泛化能力和防止过拟合,我们在训练过程中应用了dropout层和广泛的图像增强技术。模型在性能上表现出色,训练准确率达到了93%,验证准确率达到88%。虽然ResNet-50展示了稍好一些的结果,但轻量级架构的Mobile Net V2 在低资源环境下的实时诊断中仍然具有很大的潜力。 这项研究为早期大脑异常检测提供了实用的人工智能解决方案,并且通过更大规模的数据集和多模态输入进一步优化后,在临床部署方面具有巨大的应用前景。
https://arxiv.org/abs/2506.09161
We introduce DEEVISum (Distilled Early Exit Vision language model for Summarization), a lightweight, efficient, and scalable vision language model designed for segment wise video summarization. Leveraging multi modal prompts that combine textual and audio derived signals, DEEVISum incorporates Multi Stage Knowledge Distillation (MSKD) and Early Exit (EE) to strike a balance between performance and efficiency. MSKD offers a 1.33% absolute F1 improvement over baseline distillation (0.5%), while EE reduces inference time by approximately 21% with a 1.3 point drop in F1. Evaluated on the TVSum dataset, our best model PaLI Gemma2 3B + MSKD achieves an F1 score of 61.1, competing the performance of significantly larger models, all while maintaining a lower computational footprint. We publicly release our code and processed dataset to support further research.
我们介绍了DEEVISum(用于摘要的早期退出视觉语言模型),这是一种轻量级、高效且可扩展的视觉语言模型,专门设计用于分段视频摘要。通过结合文本和音频信号的多模态提示,DEEVISum 集成了多阶段知识蒸馏(MSKD)和早期退出(EE),以在性能与效率之间取得平衡。MSKD 相比基线蒸馏方法,在 F1 分数上提高了 1.33% 的绝对值(0.5%),而 EE 将推理时间减少了大约 21%,F1 分数下降了 1.3 点。在 TVSum 数据集上的评估中,我们的最佳模型 PaLI Gemma2 3B + MSKD 达到了 F1 得分 61.1,在保持较低计算负担的同时,与明显更大的模型性能相当。我们公开发布了代码和处理后的数据集以支持进一步的研究。
https://arxiv.org/abs/2504.21831
Human pose estimation and action recognition have received attention due to their critical roles in healthcare monitoring, rehabilitation, and assistive technologies. In this study, we proposed a novel architecture named Transformer based Encoder Decoder Network (TED Net) designed for estimating human skeleton poses from WiFi Channel State Information (CSI). TED Net integrates convolutional encoders with transformer based attention mechanisms to capture spatiotemporal features from CSI signals. The estimated skeleton poses were used as input to a customized Directed Graph Neural Network (DGNN) for action recognition. We validated our model on two datasets: a publicly available multi modal dataset for assessing general pose estimation, and a newly collected dataset focused on fall related scenarios involving 20 participants. Experimental results demonstrated that TED Net outperformed existing approaches in pose estimation, and that the DGNN achieves reliable action classification using CSI based skeletons, with performance comparable to RGB based systems. Notably, TED Net maintains robust performance across both fall and non fall cases. These findings highlight the potential of CSI driven human skeleton estimation for effective action recognition, particularly in home environments such as elderly fall detection. In such settings, WiFi signals are often readily available, offering a privacy preserving alternative to vision based methods, which may raise concerns about continuous camera monitoring.
人体姿态估计和动作识别因其在医疗监护、康复及辅助技术中的关键作用而受到了广泛关注。本研究提出了一种名为基于变换器的编码解码网络(TED Net)的新架构,用于从WiFi信道状态信息(CSI)中估算人体骨骼姿势。TED Net结合了卷积编码器与基于变换器的注意力机制,以捕捉来自CSI信号的时空特征。所估计的人体姿态作为输入被送入一种定制化的有向图神经网络(DGNN),用于动作识别。我们在两个数据集上验证了我们的模型:一个是公开可用的多模态数据集,用于评估一般姿势估计;另一个是新收集的数据集,重点关注涉及20名参与者的与跌倒相关的场景。实验结果显示,TED Net在姿态估计方面超越了现有方法,并且DGNN能够使用基于CSI的人体骨骼进行可靠的动作分类,性能与基于RGB的方法相当。值得注意的是,TED Net在跌倒和非跌倒情况下都保持了稳健的性能。这些发现突显了CSI驱动的人体骨架估算在有效动作识别中的潜力,尤其是在如老人跌倒检测这样的家庭环境中。在这种环境下,WiFi信号往往易于获取,并提供了一种隐私保护的方法来替代基于视觉的方法,后者可能因持续摄像监控而引起担忧。
https://arxiv.org/abs/2504.16655
Background This study proposes a Vision-Language Model (VLM) leveraging the SIGLIP encoder and Gemma-3b transformer decoder to enhance automated chronic tuberculosis (TB) screening. By integrating chest X-ray images with clinical data, the model addresses the challenges of manual interpretation, improving diagnostic consistency and accessibility, particularly in resource-constrained settings. Methods The VLM architecture combines a Vision Transformer (ViT) for visual encoding and a transformer-based text encoder to process clinical context, such as patient histories and treatment records. Cross-modal attention mechanisms align radiographic features with textual information, while the Gemma-3b decoder generates comprehensive diagnostic reports. The model was pre-trained on 5 million paired medical images and texts and fine-tuned using 100,000 chronic TB-specific chest X-rays. Results The model demonstrated high precision (94 percent) and recall (94 percent) for detecting key chronic TB pathologies, including fibrosis, calcified granulomas, and bronchiectasis. Area Under the Curve (AUC) scores exceeded 0.93, and Intersection over Union (IoU) values were above 0.91, validating its effectiveness in detecting and localizing TB-related abnormalities. Conclusion The VLM offers a robust and scalable solution for automated chronic TB diagnosis, integrating radiographic and clinical data to deliver actionable and context-aware insights. Future work will address subtle pathologies and dataset biases to enhance the model's generalizability, ensuring equitable performance across diverse populations and healthcare settings.
背景:这项研究提出了一种利用SIGLIP编码器和Gemma-3b解码器的视觉语言模型(VLM),以增强自动慢性结核病(TB)筛查。通过整合胸部X光影像与临床数据,该模型旨在解决手动解读的挑战,提高诊断的一致性和可访问性,特别是在资源有限的情况下。 方法:VLM架构结合了用于视觉编码的Vision Transformer(ViT)和基于变压器的文本编码器以处理如患者病史和治疗记录等临床背景信息。跨模态注意机制将放射学特征与文本信息对齐,而Gemma-3b解码器则生成全面的诊断报告。该模型在500万张配对的医学影像和文本数据上进行了预训练,并通过10万张特定于慢性TB的胸部X光片进行了微调。 结果:该模型对于检测慢性结核病的关键病理特征,如纤维化、钙化的肉芽肿和支气管扩张等表现出了高精度(94%)和召回率(94%)。在曲线下面积(AUC)评分超过0.93,交并比(IoU)值高于0.91,验证了其在检测和定位与结核病相关的异常方面具有有效性。 结论:VLM为自动慢性结核诊断提供了一种稳健且可扩展的解决方案,通过整合放射学和临床数据来交付实际可行且情境感知的洞察。未来的工作将解决细微病理特征和数据集偏差的问题,以增强模型在多样人口和医疗环境中的泛化能力,确保其性能的公平性。
https://arxiv.org/abs/2503.14536
3D Gaussian Splatting (3DGS) has been widely used in 3D reconstruction and 3D generation. Training to get a 3DGS scene often takes a lot of time and resources and even valuable inspiration. The increasing amount of 3DGS digital asset have brought great challenges to the copyright protection. However, it still lacks profound exploration targeted at 3DGS. In this paper, we propose a new framework X-SG$^2$S which can simultaneously watermark 1 to 3D messages while keeping the original 3DGS scene almost unchanged. Generally, we have a X-SG$^2$S injector for adding multi-modal messages simultaneously and an extractor for extract them. Specifically, we first split the watermarks into message patches in a fixed manner and sort the 3DGS points. A self-adaption gate is used to pick out suitable location for watermarking. Then use a XD(multi-dimension)-injection heads to add multi-modal messages into sorted 3DGS points. A learnable gate can recognize the location with extra messages and XD-extraction heads can restore hidden messages from the location recommended by the learnable gate. Extensive experiments demonstrated that the proposed X-SG$^2$S can effectively conceal multi modal messages without changing pretrained 3DGS pipeline or the original form of 3DGS parameters. Meanwhile, with simple and efficient model structure and high practicality, X-SG$^2$S still shows good performance in hiding and extracting multi-modal inner structured or unstructured messages. X-SG$^2$S is the first to unify 1 to 3D watermarking model for 3DGS and the first framework to add multi-modal watermarks simultaneous in one 3DGS which pave the wave for later researches.
3D高斯点阵(3D Gaussian Splatting,简称3DGS)在三维重建和生成领域得到了广泛应用。为了获得一个3DGS场景,通常需要消耗大量时间和资源,并且有时还需要有价值的创意灵感。随着3DGS数字资产数量的不断增加,版权保护也面临着巨大的挑战。然而,针对3DGS的研究仍处于初步阶段。在此论文中,我们提出了一种新的框架X-SG$^2$S,它可以同时嵌入1到3维的信息,而几乎不改变原始的3DGS场景。总的来说,我们的框架包含一个X-SG$^2$S注入器来添加多模态信息,并且有一个提取器用于从中提取这些信息。 具体来说,我们首先以固定的方式将水印分割成消息补丁,并对3DGS点进行排序。使用自适应门选择适合嵌入水印的位置。然后利用XD(多维)-注射头向已排序的3DGS点中添加多模态信息。一个可学习的门能够识别带有额外信息的位置,而XD提取头可以从由该可学习门推荐的位置中恢复隐藏的信息。 广泛的实验表明,所提出的X-SG$^2$S能够在不改变预训练的3DGS流程或原始参数形式的情况下有效隐藏多模态消息。同时,由于模型结构简单且高效,并具有高度实用性,X-SG$^2$S在隐藏和提取多模式内部结构化或非结构化信息方面表现良好。 作为首个统一1到3维水印嵌入模型的框架以及首个在同一3DGS中同时添加多模态水印的框架,X-SG$^2$S为后续研究铺平了道路。
https://arxiv.org/abs/2502.10475
Previous research has revealed the potential of large language models (LLMs) to support cognitive reframing therapy; however, their focus was primarily on text-based methods, often overlooking the importance of non-verbal evidence crucial in real-life therapy. To alleviate this gap, we extend the textual cognitive reframing to multimodality, incorporating visual clues. Specifically, we present a new dataset called Multi Modal-Cognitive Support Conversation (M2CoSC), which pairs each GPT-4-generated dialogue with an image that reflects the virtual client's facial expressions. To better mirror real psychotherapy, where facial expressions lead to interpreting implicit emotional evidence, we propose a multi-hop psychotherapeutic reasoning approach that explicitly identifies and incorporates subtle evidence. Our comprehensive experiments with both LLMs and vision-language models (VLMs) demonstrate that the VLMs' performance as psychotherapists is significantly improved with the M2CoSC dataset. Furthermore, the multi-hop psychotherapeutic reasoning method enables VLMs to provide more thoughtful and empathetic suggestions, outperforming standard prompting methods.
之前的研究已经揭示了大型语言模型(LLM)在认知重构疗法中的潜力;然而,这些研究主要集中在文本方法上,往往忽略了非言语证据在现实生活中治疗中的重要性。为了解决这一差距,我们将基于文本的认知重构扩展到了多模态领域,引入视觉线索。具体来说,我们提出了一个新的数据集——多模态认知支持对话(M2CoSC),该数据集中每一段由GPT-4生成的对话都配有一张反映虚拟客户面部表情的图片。为了更好地模拟现实生活中的心理治疗,在这种情境中,面部表情用于解读隐含的情感线索,我们提出了一种多层次的心理治疗方法,这种方法能够明确识别并整合细微证据。我们的全面实验包括LLM和视觉语言模型(VLM),结果表明,使用M2CoSC数据集可以显著提升VLM作为心理咨询师的表现。此外,多层次心理治疗推理方法使VLM能提供更体贴且富有同情心的建议,优于标准提示法的效果。
https://arxiv.org/abs/2502.06873
Deep learning based person re-identification (re-id) models have been widely employed in surveillance systems. Recent studies have demonstrated that black-box single-modality and cross-modality re-id models are vulnerable to adversarial examples (AEs), leaving the robustness of multi-modality re-id models unexplored. Due to the lack of knowledge about the specific type of model deployed in the target black-box surveillance system, we aim to generate modality unified AEs for omni-modality (single-, cross- and multi-modality) re-id models. Specifically, we propose a novel Modality Unified Attack method to train modality-specific adversarial generators to generate AEs that effectively attack different omni-modality models. A multi-modality model is adopted as the surrogate model, wherein the features of each modality are perturbed by metric disruption loss before fusion. To collapse the common features of omni-modality models, Cross Modality Simulated Disruption approach is introduced to mimic the cross-modality feature embeddings by intentionally feeding images to non-corresponding modality-specific subnetworks of the surrogate model. Moreover, Multi Modality Collaborative Disruption strategy is devised to facilitate the attacker to comprehensively corrupt the informative content of person images by leveraging a multi modality feature collaborative metric disruption loss. Extensive experiments show that our MUA method can effectively attack the omni-modality re-id models, achieving 55.9%, 24.4%, 49.0% and 62.7% mean mAP Drop Rate, respectively.
基于深度学习的人再识别(re-id)模型在监控系统中得到了广泛的应用。近期的研究表明,单模态和跨模态的黑盒人再识别模型容易受到对抗样本(AEs)的影响,而多模态人再识别模型的鲁棒性尚不清楚。由于目标黑盒监视系统中所部署的具体类型未知,我们的研究旨在生成适用于全模态(单模、跨模和多模)re-id模型的统一对抗样本。 为此,我们提出了一种名为“Modality Unified Attack”(MUA)的新方法,通过训练特定于每个模态的对抗生成器来生成能够有效攻击不同全模态模型的对抗性示例。在我们的方案中,采用了一个多模态模型作为替代模型,在该模型中,每一种模态的数据特征都经过了度量破坏损失函数(metric disruption loss)的扰动处理,并在融合之前进行。 为了使各种全模态模型共有的特征失效,“Cross Modality Simulated Disruption”方法被引入,通过故意将图像输入到替代模型中的非对应特定模态子网络中来模仿跨模态特征嵌入。此外,“Multi Modality Collaborative Disruption”策略旨在帮助攻击者全面破坏人像图片中有用的信息内容,利用多模态特征协作度量破坏损失函数。 广泛的实验表明,我们的MUA方法能够有效攻击全模态的人再识别模型,在单模、跨模和多模模型上分别实现了55.9%,24.4% 和 62.7%的平均mAP下降率。此外,在包含所有三种类型的综合测试集上的均值mAP下降率为49.0%。 综上所述,我们的研究为提升人再识别模型的安全性和鲁棒性提供了一种全新的视角和方法论。
https://arxiv.org/abs/2501.12761
Generating visual text in natural scene images is a challenging task with many unsolved problems. Different from generating text on artificially designed images (such as posters, covers, cartoons, etc.), the text in natural scene images needs to meet the following four key criteria: (1) Fidelity: the generated text should appear as realistic as a photograph and be completely accurate, with no errors in any of the strokes. (2) Reasonability: the text should be generated on reasonable carrier areas (such as boards, signs, walls, etc.), and the generated text content should also be relevant to the scene. (3) Utility: the generated text can facilitate to the training of natural scene OCR (Optical Character Recognition) tasks. (4) Controllability: The attribute of the text (such as font and color) should be controllable as this http URL this paper, we propose a two stage method, SceneVTG++, which simultaneously satisfies the four aspects mentioned above. SceneVTG++ consists of a Text Layout and Content Generator (TLCG) and a Controllable Local Text Diffusion (CLTD). The former utilizes the world knowledge of multi modal large language models to find reasonable text areas and recommend text content according to the nature scene background images, while the latter generates controllable multilingual text based on the diffusion model. Through extensive experiments, we respectively verified the effectiveness of TLCG and CLTD, and demonstrated the state-of-the-art text generation performance of SceneVTG++. In addition, the generated images have superior utility in OCR tasks like text detection and text recognition. Codes and datasets will be available.
在自然场景图像中生成视觉文本是一项充满挑战的任务,许多问题尚未解决。与在人工设计的图像(如海报、封面、卡通等)上生成文字不同,在自然场景图像中的文字需要满足以下四个关键标准:(1) 真实性:生成的文字应该看起来像一张照片一样逼真,并且完全准确,没有任何笔画错误。(2) 合理性:文本应当出现在合理的载体区域(如板子、标识牌、墙壁等),并且所生成的文本内容也应与场景相关。(3) 实用性:生成的文本能够促进自然场景OCR(光学字符识别)任务的训练。(4) 可控性:文字属性(如字体和颜色)应该可以控制。 在这篇论文中,我们提出了一种两阶段方法——SceneVTG++,它同时满足上述四个方面的需求。SceneVTG++由文本布局及内容生成器(TLCG) 和可控局部文本扩散(CLTD)组成。前者利用多模态大型语言模型的世界知识来寻找合理的文字区域并根据自然场景背景图像推荐文字内容,而后者则基于扩散模型生成可控制的多语言文本。通过广泛的实验,我们分别验证了TLCG和CLTD的有效性,并展示了SceneVTG++在文本生成性能方面的先进水平。此外,所生成的图像在OCR任务(如文本检测、文本识别)中具有极高的实用性。代码及数据集将会公开提供。
https://arxiv.org/abs/2501.02962
Autonomous off-road navigation is required for applications in agriculture, construction, search and rescue and defence. Traditional on-road autonomous methods struggle with dynamic terrains, leading to poor vehicle control on off-road. Recent deep-learning models have used perception sensors along with kinesthetic feedback for navigation on such terrains. However, this approach has out-of-domain uncertainty. Factors like change in weather and time of day impacts the performance of the model. We propose a multi modal fusion network FuseIsPath capable of using LWIR and RGB images to provide robustness against dynamic weather and light conditions. To aid further works in this domain, we also open-source a day-night dataset with LWIR and RGB images along with pseudo-labels for traversability. In order to co-register the two images we developed a novel method for targetless extrinsic calibration of LWIR, LiDAR and RGB cameras with translation accuracy of 1.7cm and rotation accuracy of 0.827degree.
https://arxiv.org/abs/2412.03173
The potential use of large language models (LLMs) in healthcare robotics can help address the significant demand put on healthcare systems around the world with respect to an aging demographic and a shortage of healthcare professionals. Even though LLMs have already been integrated into medicine to assist both clinicians and patients, the integration of LLMs within healthcare robots has not yet been explored for clinical settings. In this perspective paper, we investigate the groundbreaking developments in robotics and LLMs to uniquely identify the needed system requirements for designing health specific LLM based robots in terms of multi modal communication through human robot interactions (HRIs), semantic reasoning, and task planning. Furthermore, we discuss the ethical issues, open challenges, and potential future research directions for this emerging innovative field.
大型语言模型(LLMs)在医疗机器人中的潜在应用,可以帮助应对全球范围内因人口老龄化和医护人员短缺而对 healthcare 系统提出的重大需求。尽管 LLMs 已经被整合到医学中以帮助临床医生和患者,但将 LLMs 集成到医疗机器人中的方法尚未在临床环境中得到探索。在这篇视角论文中,我们探讨了机器人技术与 LLMs 的突破性发展,以独特地确定设计特定于健康领域的基于 LLM 机器人的系统需求,特别是在通过人机交互(HRIs)实现多模态通信、语义推理和任务规划方面的需求。此外,我们还讨论了这一新兴创新领域中的伦理问题、开放挑战以及潜在的未来研究方向。 注:原文中的 "healthcare systems" 已经被直接翻译为“医疗系统”,而 "address the significant demand put on healthcare systems" 被解释为应对医疗系统的重大需求。此外,考虑到中文表达习惯,“health specific” 直译可能不够自然,因此调整为“特定于健康领域”。
https://arxiv.org/abs/2411.03287