Reliable hydrologic and flood forecasting requires models that remain stable when input data are delayed, missing, or inconsistent. However, most advances in rainfall-runoff prediction have been evaluated under ideal data conditions, emphasizing accuracy rather than operational resilience. Here, we develop an operationally ready emulator of the Global Flood Awareness System (GloFAS) that couples long- and short-term memory networks with a relaxed water-balance constraint to preserve physical coherence. Five architectures span a continuum of information availability: from complete historical and forecast forcings to scenarios with data latency and outages, allowing systematic evaluation of robustness. Trained in minimally managed catchments across the United States and tested in more than 5,000 basins, including heavily regulated rivers in India, the emulator reproduces the hydrological core of GloFAS and degrades smoothly as information quality declines. Transfer across contrasting hydroclimatic and management regimes yields reduced yet physically consistent performance, defining the limits of generalization under data scarcity and human influence. The framework establishes operational robustness as a measurable property of hydrological machine learning and advances the design of reliable real-time forecasting systems.
可靠的水文和洪水预报需要在输入数据延迟、缺失或不一致的情况下仍能保持稳定的模型。然而,大多数降雨径流预测的进步是在理想的数据条件下评估的,注重准确性而非操作稳定性。在此,我们开发了一个可以立即投入使用的全球洪水预警系统(GloFAS)模拟器,它将长短期记忆网络与松弛的水量平衡约束相结合以维持物理连贯性。五种架构涵盖了信息可用性的连续范围:从完整的历史和预测强迫条件到数据延迟和中断的情况,从而能够系统地评估其稳健性。该模拟器在美国最小管理流域中进行了训练,并在包括印度受严格管控河流在内的超过5000个盆地中进行了测试。模拟器再现了GloFAS的水文核心内容,并随着信息质量下降而平滑退化。在不同的气候和管理制度下进行迁移学习,虽然性能有所降低但仍保持物理一致性,定义了数据稀缺及人为影响下的泛化极限。该框架将操作稳定性确立为水文学机器学习的一个可测量属性,并推进了可靠实时预报系统的设计。
https://arxiv.org/abs/2510.18535
We present a reproducibility study of the state-of-the-art neural architecture for sequence labeling proposed by Ma and Hovy (2016)\cite{ma2016end}. The original BiLSTM-CNN-CRF model combines character-level representations via Convolutional Neural Networks (CNNs), word-level context modeling through Bi-directional Long Short-Term Memory networks (BiLSTMs), and structured prediction using Conditional Random Fields (CRFs). This end-to-end approach eliminates the need for hand-crafted features while achieving excellent performance on named entity recognition (NER) and part-of-speech (POS) tagging tasks. Our implementation successfully reproduces the key results, achieving 91.18\% F1-score on CoNLL-2003 NER and demonstrating the model's effectiveness across sequence labeling tasks. We provide a detailed analysis of the architecture components and release an open-source PyTorch implementation to facilitate further research.
我们对Ma和Hovy(2016)\cite{ma2016end}提出的最新序列标注神经架构进行了可重复性研究。原BiLSTM-CNN-CRF模型结合了通过卷积神经网络(CNNs)产生的字符级表示,双向长短期记忆网络(BiLSTMs)进行的词级别上下文建模以及使用条件随机场(CRFs)实现的结构化预测。这种端到端的方法消除了手工特征设计的需求,并在命名实体识别(NER)和词性标注(POS)任务上取得了优异的成绩。 我们的实现在关键结果方面成功地再现了原始研究,具体而言,在CoNLL-2003 NER数据集上实现了91.18%的F1分数,展示了该模型在序列标注任务中的有效性。我们提供了对架构组件的详细分析,并发布了一个开源的PyTorch实现,以促进进一步的研究。
https://arxiv.org/abs/2510.10936
Social media has become an essential part of the digital age, serving as a platform for communication, interaction, and information sharing. Celebrities are among the most active users and often reveal aspects of their personal and professional lives through online posts. Platforms such as Twitter provide an opportunity to analyze language and behavior for understanding demographic and social patterns. Since followers frequently share linguistic traits and interests with the celebrities they follow, textual data from followers can be used to predict celebrity demographics. However, most existing research in this field has focused on English and other high-resource languages, leaving Urdu largely unexplored. This study applies modern machine learning and deep learning techniques to the problem of celebrity profiling in Urdu. A dataset of short Urdu tweets from followers of subcontinent celebrities was collected and preprocessed. Multiple algorithms were trained and compared, including Logistic Regression, Support Vector Machines, Random Forests, Convolutional Neural Networks, and Long Short-Term Memory networks. The models were evaluated using accuracy, precision, recall, F1-score, and cumulative rank (cRank). The best performance was achieved for gender prediction with a cRank of 0.65 and an accuracy of 0.65, followed by moderate results for age, profession, and fame prediction. These results demonstrate that follower-based linguistic features can be effectively leveraged using machine learning and neural approaches for demographic prediction in Urdu, a low-resource language.
社交媒体已成为数字时代不可或缺的一部分,它作为沟通、互动和信息分享的平台发挥着重要作用。名人是其中最活跃的一类用户群体,他们经常通过在线帖子展示个人生活和职业生活的各个方面。例如,Twitter这样的平台为分析语言和行为提供了机会,以理解人口和社会模式。由于追随者通常与他们关注的名人的语言特征和兴趣有相似之处,因此可以从这些追随者的文本数据中预测名人的特征。 然而,在这一领域的大多数现有研究主要集中在英语和其他资源丰富的语言上,乌尔都语则相对较少被探索。本研究将现代机器学习和深度学习技术应用于乌尔都语的名人画像问题。我们收集并预处理了一个由次大陆名人的追随者发布的短微博数据集。训练了包括逻辑回归、支持向量机、随机森林、卷积神经网络和长短期记忆网络在内的多种算法,并通过准确率、精确度、召回率、F1分数以及累积排名(cRank)对模型进行了评估。 性别预测方面取得了最佳性能,其cRank为0.65,准确率为0.65;年龄、职业和知名度预测则表现较为一般。这些结果表明,在乌尔都语这一资源较少的语言中,可以通过机器学习和神经网络方法有效地利用追随者的语言特征来进行人口统计学预测。
https://arxiv.org/abs/2510.11739
This review aims to conduct a comparative analysis of liquid neural networks (LNNs) and traditional recurrent neural networks (RNNs) and their variants, such as long short-term memory networks (LSTMs) and gated recurrent units (GRUs). The core dimensions of the analysis include model accuracy, memory efficiency, and generalization ability. By systematically reviewing existing research, this paper explores the basic principles, mathematical models, key characteristics, and inherent challenges of these neural network architectures in processing sequential data. Research findings reveal that LNN, as an emerging, biologically inspired, continuous-time dynamic neural network, demonstrates significant potential in handling noisy, non-stationary data, and achieving out-of-distribution (OOD) generalization. Additionally, some LNN variants outperform traditional RNN in terms of parameter efficiency and computational speed. However, RNN remains a cornerstone in sequence modeling due to its mature ecosystem and successful applications across various tasks. This review identifies the commonalities and differences between LNNs and RNNs, summarizes their respective shortcomings and challenges, and points out valuable directions for future research, particularly emphasizing the importance of improving the scalability of LNNs to promote their application in broader and more complex scenarios.
这篇评论旨在对液态神经网络(LNN)和传统的循环神经网络(RNN)及其变体,如长短期记忆网络(LSTM)和门控循环单元(GRU),进行比较分析。该研究的核心维度包括模型准确性、内存效率以及泛化能力。通过系统回顾现有的研究成果,本文探讨了这些神经网络架构在处理序列数据时的基本原理、数学模型、关键特性及其内在挑战。研究表明,作为新兴的生物启发式连续时间动态神经网络,LNN在处理噪声和非平稳数据方面表现出显著潜力,并且能够实现出域(OOD)泛化。此外,一些LNN变体在参数效率和计算速度上优于传统RNN。然而,由于其成熟的生态系统以及跨各种任务的成功应用,RNN仍然是序列建模的基石。这篇评论确定了LNN与RNN之间的共性和差异,总结了它们各自的缺点和挑战,并指出未来研究的重要方向,尤其强调提高LNN可扩展性的重要性,以促进其在更广泛和复杂的场景中的应用。
https://arxiv.org/abs/2510.07578
Solar Proton Events (SPEs) cause significant radiation hazards to satellites, astronauts, and technological systems. Accurate forecasting of their proton flux time profiles is crucial for early warnings and mitigation. This paper explores deep learning sequence-to-sequence (seq2seq) models based on Long Short-Term Memory networks to predict 24-hour proton flux profiles following SPE onsets. We used a dataset of 40 well-connected SPEs (1997-2017) observed by NOAA GOES, each associated with a >=M-class western-hemisphere solar flare and undisturbed proton flux profiles. Using 4-fold stratified cross-validation, we evaluate seq2seq model configurations (varying hidden units and embedding dimensions) under multiple forecasting scenarios: (i) proton-only input vs. combined proton+X-ray input, (ii) original flux data vs. trend-smoothed data, and (iii) autoregressive vs. one-shot forecasting. Our major results are as follows: First, one-shot forecasting consistently yields lower error than autoregressive prediction, avoiding the error accumulation seen in iterative approaches. Second, on the original data, proton-only models outperform proton+X-ray models. However, with trend-smoothed data, this gap narrows or reverses in proton+X-ray models. Third, trend-smoothing significantly enhances the performance of proton+X-ray models by mitigating fluctuations in the X-ray channel. Fourth, while models trained on trendsmoothed data perform best on average, the best-performing model was trained on original data, suggesting that architectural choices can sometimes outweigh the benefits of data preprocessing.
太阳质子事件(SPEs)对卫星、宇航员和技术系统造成显著的辐射危害。准确预测其质子通量的时间曲线对于早期预警和缓解措施至关重要。本文探讨了基于长短期记忆网络的深度学习序列到序列(seq2seq)模型,以预测在SPE发生后的24小时内的质子通量曲线。我们使用了一套由NOAA GOES观测的40个相互关联良好的SPE(1997-2017年间)的数据集,每个事件都与西部半球的大于M级太阳耀斑以及未受干扰的质子通量曲线相关联。 通过4折分层交叉验证,我们在多种预测情景下评估了seq2seq模型配置的不同参数设置(隐藏单元和嵌入维度变化),包括:(i) 单独使用质子输入与同时使用质子和X射线输入;(ii) 使用原始通量数据与经过趋势平滑处理的数据;以及 (iii) 自回归预测与一次性预测。我们的主要研究结果如下: 1. 一次性预测方法始终比自回归预测产生更低的误差,避免了迭代方法中的误差累积。 2. 在使用原始数据时,单独使用质子输入的模型优于同时使用质子和X射线输入的模型。然而,在使用趋势平滑后的数据时,后者的表现与前者差距缩小甚至超越前者。 3. 趋势平滑显著提升了质子加X射线模型的效果,减少了X射线通道中的波动性。 4. 尽管基于经过趋势平滑处理的数据训练出的模型平均性能最佳,但表现最好的模型是使用原始数据进行训练的结果。这表明架构选择有时可以克服预处理带来的好处。 这些发现对于提高太阳质子事件预测精度具有重要的实际意义,并且能够为未来的相关研究提供有价值的参考信息。
https://arxiv.org/abs/2510.05399
This study applies a range of forecasting techniques,including ARIMA, Prophet, Long Short Term Memory networks (LSTM), Temporal Convolutional Networks (TCN), and XGBoost, to model and predict Russian equipment losses during the ongoing war in Ukraine. Drawing on daily and monthly open-source intelligence (OSINT) data from WarSpotting, we aim to assess trends in attrition, evaluate model performance, and estimate future loss patterns through the end of 2025. Our findings show that deep learning models, particularly TCN and LSTM, produce stable and consistent forecasts, especially under conditions of high temporal granularity. By comparing different model architectures and input structures, this study highlights the importance of ensemble forecasting in conflict modeling, and the value of publicly available OSINT data in quantifying material degradation over time.
这项研究应用了一系列预测技术,包括ARIMA、Prophet、长短期记忆网络(LSTM)、时间卷积网络(TCN)和XGBoost,来建模并预测俄罗斯在乌克兰战争中装备的损失。我们利用WarSpotting提供的每日和月度开源情报(OSINT)数据,旨在评估损耗趋势、评估模型性能,并估计到2025年底的未来损失模式。我们的研究发现表明,深度学习模型,尤其是TCN和LSTM,在时间粒度高的条件下能产生稳定且一致的预测结果。通过比较不同模型架构和输入结构,本研究强调了冲突建模中的集成预测的重要性以及公开可用OSINT数据在量化材料损耗方面的时间价值。
https://arxiv.org/abs/2509.07813
Speech Emotion Recognition (SER) presents a significant yet persistent challenge in human-computer interaction. While deep learning has advanced spoken language processing, achieving high performance on limited datasets remains a critical hurdle. This paper confronts this issue by developing and evaluating a suite of machine learning models, including Support Vector Machines (SVMs), Long Short-Term Memory networks (LSTMs), and Convolutional Neural Networks (CNNs), for automated emotion classification in human speech. We demonstrate that by strategically employing transfer learning and innovative data augmentation techniques, our models can achieve impressive performance despite the constraints of a relatively small dataset. Our most effective model, a ResNet34 architecture, establishes a new performance benchmark on the combined RAVDESS and SAVEE datasets, attaining an accuracy of 66.7% and an F1 score of 0.631. These results underscore the substantial benefits of leveraging pre-trained models and data augmentation to overcome data scarcity, thereby paving the way for more robust and generalizable SER systems.
语音情感识别(SER)在人机交互中面临着一个重大且持久的挑战。虽然深度学习已经推动了口语处理技术的进步,但在有限的数据集上实现高性能仍然是一个重要障碍。本文通过开发和评估一系列机器学习模型来应对这一问题,这些模型包括支持向量机(SVM)、长短时记忆网络(LSTM)和卷积神经网络(CNN),以进行人类语音情感的自动分类。我们证明,通过战略性地采用迁移学习和创新的数据增强技术,我们的模型能够在数据集相对较小的情况下实现令人印象深刻的性能表现。 最有效的模型是我们采用的ResNet34架构,在结合RAVDESS和SAVEE数据集后建立了一个新的性能基准,达到了66.7%的准确率和0.631的F1分数。这些结果强调了利用预训练模型和数据增强技术克服数据不足的重要性,并为构建更强大、更具通用性的SER系统铺平了道路。
https://arxiv.org/abs/2509.00077
Visual reasoning is critical for a wide range of computer vision tasks that go beyond surface-level object detection and classification. Despite notable advances in relational, symbolic, temporal, causal, and commonsense reasoning, existing surveys often address these directions in isolation, lacking a unified analysis and comparison across reasoning types, methodologies, and evaluation protocols. This survey aims to address this gap by categorizing visual reasoning into five major types (relational, symbolic, temporal, causal, and commonsense) and systematically examining their implementation through architectures such as graph-based models, memory networks, attention mechanisms, and neuro-symbolic systems. We review evaluation protocols designed to assess functional correctness, structural consistency, and causal validity, and critically analyze their limitations in terms of generalizability, reproducibility, and explanatory power. Beyond evaluation, we identify key open challenges in visual reasoning, including scalability to complex scenes, deeper integration of symbolic and neural paradigms, the lack of comprehensive benchmark datasets, and reasoning under weak supervision. Finally, we outline a forward-looking research agenda for next-generation vision systems, emphasizing that bridging perception and reasoning is essential for building transparent, trustworthy, and cross-domain adaptive AI systems, particularly in critical domains such as autonomous driving and medical diagnostics.
视觉推理对于超越表面级对象检测和分类的广泛计算机视觉任务至关重要。尽管在关系、符号、时间、因果以及常识推理方面取得了显著进展,现有的综述往往孤立地探讨这些方向,缺乏对不同推理类型、方法论及评估协议的统一分析与比较。本次调查旨在通过将视觉推理归类为五大主要类型(关系型、符号型、时序型、因果型和常识型),并系统性地审查其通过图模型、内存网络、注意力机制以及神经-符号系统的实现来填补这一空白。我们回顾了用于评估功能正确性、结构一致性及因果有效性的评价协议,并批判性分析了它们在泛化能力、可重复性和解释力方面的局限性。除了评估之外,我们还识别出视觉推理领域的关键开放挑战,包括对复杂场景的扩展性问题、符号与神经范式更深层次的整合需求、缺乏全面基准数据集以及弱监督下的推理难题。最后,我们为下一代视觉系统展望了一个前瞻性的研究议程,强调了将感知和推理相融合对于构建透明、可信且跨领域适应性强的人工智能系统的必要性,尤其是在自动驾驶及医学诊断等关键领域中尤为如此。
https://arxiv.org/abs/2508.10523
We introduce a novel class of untrained Recurrent Neural Networks (RNNs) within the Reservoir Computing (RC) paradigm, called Residual Reservoir Memory Networks (ResRMNs). ResRMN combines a linear memory reservoir with a non-linear reservoir, where the latter is based on residual orthogonal connections along the temporal dimension for enhanced long-term propagation of the input. The resulting reservoir state dynamics are studied through the lens of linear stability analysis, and we investigate diverse configurations for the temporal residual connections. The proposed approach is empirically assessed on time-series and pixel-level 1-D classification tasks. Our experimental results highlight the advantages of the proposed approach over other conventional RC models.
我们介绍了一种新型未经训练的循环神经网络(RNN)类别,该类别在水库计算(RC)框架下被命名为残差水库记忆网络(ResRMN)。ResRMN结合了线性内存水库和非线性水库,其中后者基于时间维度上的残差正交连接来增强输入信号的长期传播。我们通过线性稳定性分析研究了由此产生的水库状态动力学,并探讨了时间残差连接的各种配置。我们的方法在时间序列和像素级1-D分类任务上进行了经验评估。实验结果突显了所提出的方法相对于其他传统RC模型的优势。
https://arxiv.org/abs/2508.09925
The evolution towards future generation of mobile systems and fixed wireless networks is primarily driven by the urgency to support high-bandwidth and low-latency services across various vertical sectors. This endeavor is fueled by smartphones as well as technologies like industrial internet of things, extended reality (XR), and human-to-machine (H2M) collaborations for fostering industrial and social revolutions like Industry 4.0/5.0 and Society 5.0. To ensure an ideal immersive experience and avoid cyber-sickness for users in all the aforementioned usage scenarios, it is typically challenging to synchronize XR content from a remote machine to a human collaborator according to their head movements across a large geographic span in real-time over communication networks. Thus, we propose a novel H2M collaboration scheme where the human's head movements are predicted ahead with highly accurate models like bidirectional long short-term memory networks to orient the machine's camera in advance. We validate that XR frame size varies in accordance with the human's head movements and predict the corresponding bandwidth requirements from the machine's camera to propose a human-machine coordinated dynamic bandwidth allocation (HMC-DBA) scheme. Through extensive simulations, we show that end-to-end latency and jitter requirements of XR frames are satisfied with much lower bandwidth consumption over enterprise networks like Fiber-To-The-Room-Business. Furthermore, we show that better efficiency in network resource utilization is achieved by employing our proposed HMC-DBA over state-of-the-art schemes.
面向未来移动系统和固定无线网络的演进主要由支持跨多个垂直领域(如工业、医疗等)的高带宽和低延迟服务的需求所驱动。这一进展受到智能手机以及诸如工业物联网、扩展现实(XR)、人机协作(H2M)等技术的推动,这些技术促进了像工业4.0/5.0和社会5.0这样的工业和社会革命。为了确保用户在上述所有使用场景中获得理想的沉浸式体验并避免网络眩晕,实时地将远程机器上的XR内容同步到根据地理范围内头部移动的人类合作者处是一个极具挑战性的任务。 为此,我们提出了一种新型人机协作方案,在该方案中,人类的头部运动通过像双向长短期记忆(BiLSTM)这样的高精度模型进行预测,并提前调整机械摄像头的方向。我们验证了XR帧大小会根据人的头部动作变化,并基于机械摄像头的需求预测相应的带宽需求以制定一种人机协调动态带宽分配(HMC-DBA)方案。 通过广泛的模拟实验,我们展示了在企业网络(如房间到企业的光纤业务网络Fiber-To-The-Room-Business)中,使用该方案可以满足XR帧的端到端延迟和抖动要求,并且以更低的带宽消耗实现。此外,我们还展示了与现有最佳方案相比,采用我们的HMC-DBA方案能更高效地利用网络资源。
https://arxiv.org/abs/2507.15254
The proliferation of large language models (LLMs) has significantly transformed the digital information landscape, making it increasingly challenging to distinguish between human-written and LLM-generated content. Detecting LLM-generated information is essential for preserving trust on digital platforms (e.g., social media and e-commerce sites) and preventing the spread of misinformation, a topic that has garnered significant attention in IS research. However, current detection methods, which primarily focus on identifying content generated by specific LLMs in known domains, face challenges in generalizing to new (i.e., unseen) LLMs and domains. This limitation reduces their effectiveness in real-world applications, where the number of LLMs is rapidly multiplying and content spans a vast array of domains. In response, we introduce a general LLM detector (GLD) that combines a twin memory networks design and a theory-guided detection generalization module to detect LLM-generated information across unseen LLMs and domains. Using real-world datasets, we conduct extensive empirical evaluations and case studies to demonstrate the superiority of GLD over state-of-the-art detection methods. The study has important academic and practical implications for digital platforms and LLMs.
大型语言模型(LLMs)的广泛使用显著改变了数字信息景观,使得区分人类撰写的内容和由LLM生成的内容变得越来越困难。检测LLM生成的信息对于维护数字平台(如社交媒体和电子商务网站)上的信任以及防止错误信息的传播至关重要,这是一个在信息系统研究中备受关注的话题。然而,目前主要专注于识别特定LLM在其已知领域内生成内容的方法,在泛化到新的(即未知的)LLM和领域时面临挑战。这种限制降低了它们在现实世界应用中的有效性,因为在这些应用中,LLMs的数量正在迅速增加,并且内容涵盖了广泛的领域。 为了解决这个问题,我们引入了一种通用的大语言模型检测器(GLD),该检测器结合了双内存网络设计和理论指导的检测泛化模块来检测跨越未知LLM和领域的LLM生成的信息。通过使用真实世界的数据集进行广泛的实证评估和案例研究,我们展示了GLD在与现有最先进的检测方法相比时的优势。这项研究对于数字平台和大型语言模型都具有重要的学术和实践意义。
https://arxiv.org/abs/2506.21589
Grain growth simulation is crucial for predicting metallic material microstructure evolution during annealing and resulting final mechanical properties, but traditional partial differential equation-based methods are computationally expensive, creating bottlenecks in materials design and manufacturing. In this work, we introduce a machine learning framework that combines a Convolutional Long Short-Term Memory networks with an Autoencoder to efficiently predict grain growth evolution. Our approach captures both spatial and temporal aspects of grain evolution while encoding high-dimensional grain structure data into a compact latent space for pattern learning, enhanced by a novel composite loss function combining Mean Squared Error, Structural Similarity Index Measurement, and Boundary Preservation to maintain structural integrity of grain boundary topology of the prediction. Results demonstrated that our machine learning approach accelerates grain growth prediction by up to \SI{89}{\times} faster, reducing computation time from \SI{10}{\minute} to approximately \SI{10}{\second} while maintaining high-fidelity predictions. The best model (S-30-30) achieving a structural similarity score of \SI{86.71}{\percent} and mean grain size error of just \SI{0.07}{\percent}. All models accurately captured grain boundary topology, morphology, and size distributions. This approach enables rapid microstructural prediction for applications where conventional simulations are prohibitively time-consuming, potentially accelerating innovation in materials science and manufacturing.
金属材料在退火过程中的微观结构演变及最终机械性能的预测对于材料设计和制造至关重要。然而,传统的基于偏微分方程的方法计算成本高昂,成为了瓶颈。本文中,我们引入了一种结合卷积长短时记忆网络与自动编码器的机器学习框架,用于高效地预测晶粒生长演化。 我们的方法能够捕捉晶粒演变的空间和时间特性,并将高维晶粒结构数据编码到一个紧凑的潜在空间中以进行模式学习。此外,通过一种新的复合损失函数(结合均方误差、结构相似性指数测量以及边界保持)来维护预测中的晶界拓扑结构完整性。 实验结果表明,我们的机器学习方法能够加速晶粒生长预测高达89倍,计算时间从10分钟减少到大约10秒,同时维持高精度的预测。最佳模型(S-30-30)达到了86.71%的结构相似度得分和仅0.07%的平均晶粒尺寸误差。所有模型均准确捕捉了晶界拓扑、形态及大小分布。 这种快速微观结构预测方法适用于传统模拟过于耗时的应用场景,有可能加速材料科学与制造领域的创新。
https://arxiv.org/abs/2505.05354
Video Object Segmentation (VOS) is one of the most fundamental and challenging tasks in computer vision and has a wide range of applications. Most existing methods rely on spatiotemporal memory networks to extract frame-level features and have achieved promising results on commonly used datasets. However, these methods often struggle in more complex real-world scenarios. This paper addresses this issue, aiming to achieve accurate segmentation of video objects in challenging scenes. We propose fine-tuning VOS (FVOS), optimizing existing methods for specific datasets through tailored training. Additionally, we introduce a morphological post-processing strategy to address the issue of excessively large gaps between adjacent objects in single-model predictions. Finally, we apply a voting-based fusion method on multi-scale segmentation results to generate the final output. Our approach achieves J&F scores of 76.81% and 83.92% during the validation and testing stages, respectively, securing third place overall in the MOSE Track of the 4th PVUW challenge 2025.
视频对象分割(VOS)是计算机视觉中最基础且最具挑战性的任务之一,它在广泛的应用领域中发挥着重要作用。目前大多数现有方法依赖于时空记忆网络来提取帧级特征,并在常用数据集上取得了令人鼓舞的结果。然而,在更复杂的现实场景下,这些方法往往表现出色不足。 本文旨在解决这一问题,目标是实现对具有挑战性场景中的视频对象进行准确分割。我们提出了一种针对特定数据集优化现有方法的微调VOS(FVOS)策略,并通过定制化训练来提升性能。此外,我们还引入了一种形态学后处理策略,以应对单模型预测中相邻对象间距离过大的问题。最后,我们将多尺度分割结果结合投票融合法生成最终输出。 我们的方法在验证阶段和测试阶段分别取得了J&F分数76.81%和83.92%,在2025年第四届PVUW挑战赛的MOSE轨道中获得了总成绩第三名。
https://arxiv.org/abs/2504.09507
Mainstream visual object tracking frameworks predominantly rely on template matching paradigms. Their performance heavily depends on the quality of template features, which becomes increasingly challenging to maintain in complex scenarios involving target deformation, occlusion, and background clutter. While existing spatiotemporal memory-based trackers emphasize memory capacity expansion, they lack effective mechanisms for dynamic feature selection and adaptive fusion. To address this gap, we propose a Dynamic Attention Mechanism in Spatiotemporal Memory Network (DASTM) with two key innovations: 1) A differentiable dynamic attention mechanism that adaptively adjusts channel-spatial attention weights by analyzing spatiotemporal correlations between the templates and memory features; 2) A lightweight gating network that autonomously allocates computational resources based on target motion states, prioritizing high-discriminability features in challenging scenarios. Extensive evaluations on OTB-2015, VOT 2018, LaSOT, and GOT-10K benchmarks demonstrate our DASTM's superiority, achieving state-of-the-art performance in success rate, robustness, and real-time efficiency, thereby offering a novel solution for real-time tracking in complex environments.
主流的视觉对象跟踪框架主要依赖于模板匹配范式。其性能很大程度上取决于模板特征的质量,而在涉及目标变形、遮挡和背景杂乱等复杂场景的情况下,保持高质量模板特征变得越来越具有挑战性。尽管现有的基于时空记忆的追踪器强调扩大内存容量,但它们缺乏有效的动态特征选择和自适应融合机制。为了弥补这一不足,我们提出了一种在时空记忆网络中的动态注意力机制(DASTM),其包含两个关键创新点:1)一种可微分的动态注意力机制,该机制通过分析模板与记忆特征之间的时空相关性来自适应地调整通道-空间注意权重;2)一个轻量级的门控网络,根据目标运动状态自主分配计算资源,在复杂场景中优先处理高区分度特征。在OTB-2015、VOT 2018、LaSOT和GOT-10K基准测试中的广泛评估证明了我们提出的DASTM的优势,实现了成功率、鲁棒性和实时效率方面的最新性能,从而为复杂环境下的实时跟踪提供了新颖的解决方案。
https://arxiv.org/abs/2503.16768
Video object segmentation is crucial for the efficient analysis of complex medical video data, yet it faces significant challenges in data availability and annotation. We introduce the task of one-shot medical video object segmentation, which requires separating foreground and background pixels throughout a video given only the mask annotation of the first frame. To address this problem, we propose a temporal contrastive memory network comprising image and mask encoders to learn feature representations, a temporal contrastive memory bank that aligns embeddings from adjacent frames while pushing apart distant ones to explicitly model inter-frame relationships and stores these features, and a decoder that fuses encoded image features and memory readouts for segmentation. We also collect a diverse, multi-source medical video dataset spanning various modalities and anatomies to benchmark this task. Extensive experiments demonstrate state-of-the-art performance in segmenting both seen and unseen structures from a single exemplar, showing ability to generalize from scarce labels. This highlights the potential to alleviate annotation burdens for medical video analysis. Code is available at this https URL.
视频对象分割在复杂医学视频数据的高效分析中至关重要,但其面临着数据可用性和标注方面的重大挑战。我们提出了单样本医学视频对象分割任务,该任务仅基于第一帧的掩码标注来区分整个视频中的前景和背景像素。为解决这一问题,我们提出了一种包含图像编码器和掩码编码器以学习特征表示、时间对比记忆库(Temporal Contrastive Memory Bank)以对齐相邻帧之间的嵌入并拉开不相关帧之间距离以便显式建模帧间关系,并存储这些特征的网络架构。此外,还有一个解码器用于融合编码后的图像特征与记忆库读取的内容来进行分割。 为了评估这一任务,我们收集了一个多样化的、多源医学视频数据集,涵盖各种模式和解剖结构的数据,以作为基准测试。广泛的实验展示了在单个示例的情况下对已见和未见结构进行分割的最先进的性能,这表明了从稀缺标签中泛化的能力。这项研究强调了解决标注负担对于医学视频分析具有潜在的作用。 代码可在提供的链接获取:[此URL](请将方括号中的内容替换为实际的URL)。
https://arxiv.org/abs/2503.14979
The artificial lateral line (ALL) is a bioinspired flow sensing system for underwater robots, comprising of distributed flow sensors. The ALL has been successfully applied to detect the undulatory flow fields generated by body undulation and tail-flapping of bioinspired robotic fish. However, its feasibility and performance in sensing the undulatory flow fields produced by human leg kicks during swimming has not been systematically tested and studied. This paper presents a novel sensing framework to investigate the undulatory flow field generated by swimmer's leg kicks, leveraging bioinspired ALL sensing. To evaluate the feasibility of using the ALL system for sensing the undulatory flow fields generated by swimmer leg kicks, this paper designs an experimental platform integrating an ALL system and a lab-fabricated human leg model. To enhance the accuracy of flow sensing, this paper proposes a feature extraction method that dynamically fuses time-domain and time-frequency characteristics. Specifically, time-domain features are extracted using one-dimensional convolutional neural networks and bidirectional long short-term memory networks (1DCNN-BiLSTM), while time-frequency features are extracted using short-term Fourier transform and two-dimensional convolutional neural networks (STFT-2DCNN). These features are then dynamically fused based on attention mechanisms to achieve accurate sensing of the undulatory flow field. Furthermore, extensive experiments are conducted to test various scenarios inspired by human swimming, such as leg kick pattern recognition and kicking leg localization, achieving satisfactory results.
人工侧线系统(ALL)是一种仿生流体传感系统,用于水下机器人,由分布式的流动传感器组成。该系统已成功应用于检测由生物启发的机器鱼身体摆动和尾部拍打产生的波动水流场。然而,其在感测游泳时人类腿部踢水所产生的波动水流场方面的可行性和性能尚未经过系统的测试与研究。 本文提出了一种新的传感框架,利用仿生ALL传感技术来探究游泳者腿部踢水所生成的波动水流场。为了评估使用ALL系统感应由游泳者的腿部踢动产生的波动水流场的可能性,本文设计了一个实验平台,该平台集成了ALL系统和实验室制造的人类腿模型。 为提高流体感知的准确性,本研究提出了一种基于注意力机制动态融合时域与时频特征的提取方法。具体而言,时间领域的特性通过一维卷积神经网络(1DCNN)与双向长短时记忆网络(BiLSTM)进行抽取;而时间-频率特性则通过短时傅里叶变换(STFT)及二维卷积神经网络(2DCNN)来抽取。这些特征随后基于注意力机制被动态融合,以实现对波动水流场的准确感知。 此外,本文还进行了广泛的实验,测试了由人类游泳启发的各种场景下的性能,如踢腿模式识别和踢动腿部定位等任务,并取得了令人满意的结果。
https://arxiv.org/abs/2503.07312
Chord recognition serves as a critical task in music information retrieval due to the abstract and descriptive nature of chords in music analysis. While audio chord recognition systems have achieved significant accuracy for small vocabularies (e.g., major/minor chords), large-vocabulary chord recognition remains a challenging problem. This complexity also arises from the inherent long-tail distribution of chords, where rare chord types are underrepresented in most datasets, leading to insufficient training samples. Effective chord recognition requires leveraging contextual information from audio sequences, yet existing models, such as combinations of convolutional neural networks, bidirectional long short-term memory networks, and bidirectional transformers, face limitations in capturing long-term dependencies and exhibit suboptimal performance on large-vocabulary chord recognition tasks. This work proposes ChordFormer, a novel conformer-based architecture designed to tackle structural chord recognition (e.g., triads, bass, sevenths) for large vocabularies. ChordFormer leverages conformer blocks that integrate convolutional neural networks with transformers, thus enabling the model to capture both local patterns and global dependencies effectively. By addressing challenges such as class imbalance through a reweighted loss function and structured chord representations, ChordFormer outperforms state-of-the-art models, achieving a 2% improvement in frame-wise accuracy and a 6% increase in class-wise accuracy on large-vocabulary chord datasets. Furthermore, ChordFormer excels in handling class imbalance, providing robust and balanced recognition across chord types. This approach bridges the gap between theoretical music knowledge and practical applications, advancing the field of large-vocabulary chord recognition.
和弦识别在音乐信息检索中是一项关键任务,由于和弦在音乐分析中的抽象性和描述性特点。尽管音频和弦识别系统在处理小词汇量(如大调/小调和弦)时已经取得了显著的准确性,但对于大词汇量和弦识别来说,这仍然是一个具有挑战性的难题。这种复杂性还源于和弦固有的长尾分布特性,在大多数数据集中,罕见和弦类型代表性不足,导致训练样本数量不足。 有效的和弦识别需要从音频序列中获取上下文信息,但现有的模型(如卷积神经网络、双向长短时记忆网络和双向变压器的组合)在捕捉长期依赖关系方面存在局限性,并且在大词汇量和弦识别任务上的表现欠佳。本研究提出了一种名为ChordFormer的新颖架构,该架构基于Conformer模块设计,旨在解决大型词汇表中的结构化和弦识别问题(例如三和弦、低音、七和弦)。ChordFormer利用结合了卷积神经网络与变压器的Conformer块,使模型能够有效捕捉局部模式及全局依赖关系。 通过采用重新加权的损失函数来应对类别不平衡的问题,并且使用有结构化的和弦表示方式,ChordFormer超越了现有的先进模型,在大型词汇表和弦数据集上实现了2%的帧级准确率提升以及6%的类级别准确率增长。此外,ChordFormer在处理类别不平衡方面表现出色,为各种类型的和弦提供了稳健且均衡的识别能力。 这种方法连接了理论音乐知识与实际应用之间的鸿沟,并推动了大规模词汇表和弦识别领域的进步。
https://arxiv.org/abs/2502.11840
Vehicular communication systems face significant challenges due to high mobility and rapidly changing environments, which affect the channel over which the signals travel. To address these challenges, neural network (NN)-based channel estimation methods have been suggested. These methods are primarily trained on high signal-to-noise ratio (SNR) with the assumption that training a NN in less noisy conditions can result in good generalisation. This study examines the effectiveness of training NN-based channel estimators on mixed SNR datasets compared to training solely on high SNR datasets, as seen in several related works. Estimators evaluated in this work include an architecture that uses convolutional layers and self-attention mechanisms; a method that employs temporal convolutional networks and data pilot-aided estimation; two methods that combine classical methods with multilayer perceptrons; and the current state-of-the-art model that combines Long-Short-Term Memory networks with data pilot-aided and temporal averaging methods as post processing. Our results indicate that using only high SNR data for training is not always optimal, and the SNR range in the training dataset should be treated as a hyperparameter that can be adjusted for better performance. This is illustrated by the better performance of some models in low SNR conditions when trained on the mixed SNR dataset, as opposed to when trained exclusively on high SNR data.
车载通信系统面临着由于车辆高速移动和快速变化的环境所带来的显著挑战,这些因素影响了信号传输所依赖的信道。为解决这些问题,基于神经网络(NN)的信道估计方法已被提出。这类方法主要是在高信噪比(SNR)条件下训练的,并假设在噪声较少的情况下训练神经网络可以实现更好的泛化能力。本研究探讨了使用混合SNR数据集来训练基于神经网络的信道估计算法的有效性,而非仅在高SNR数据集上进行训练,后者是许多相关工作中的常见做法。 本文中评估的估计器包括一个采用卷积层和自注意机制的架构;一种利用时间卷积网络和数据导频辅助估计的方法;两种结合经典方法与多层感知机的方法;以及目前最先进的将长短期记忆(LSTM)网络与数据导频辅助和时间平均方法作为后处理手段的模型。 我们的结果显示,仅使用高SNR数据进行训练并不总是最优选择,并且在训练数据集中SNR范围应该被视为可以调整以获得更好性能的一个超参数。这一点通过一些模型在低SNR条件下使用混合SNR数据集进行训练时表现优于单纯使用高SNR数据集训练得到的验证,体现了其有效性。
https://arxiv.org/abs/2502.06824
Accurate detection of traffic anomalies is crucial for effective urban traffic management and congestion mitigation. We use the Spatiotemporal Generative Adversarial Network (STGAN) framework combining Graph Neural Networks and Long Short-Term Memory networks to capture complex spatial and temporal dependencies in traffic data. We apply STGAN to real-time, minute-by-minute observations from 42 traffic cameras across Gothenburg, Sweden, collected over several months in 2020. The images are processed to compute a flow metric representing vehicle density, which serves as input for the model. Training is conducted on data from April to November 2020, and validation is performed on a separate dataset from November 14 to 23, 2020. Our results demonstrate that the model effectively detects traffic anomalies with high precision and low false positive rates. The detected anomalies include camera signal interruptions, visual artifacts, and extreme weather conditions affecting traffic flow.
准确检测交通异常对于有效的城市交通管理和缓解拥堵至关重要。我们使用结合了图神经网络和长短期记忆网络的时空生成对抗网络(STGAN)框架来捕捉交通数据中的复杂空间和时间依赖关系。我们将STGAN应用于瑞典哥德堡42个交通摄像头收集的真实时钟、分钟级观测数据,这些数据于2020年数月内采集。通过处理图像计算出代表车辆密度的流量指标作为模型输入。训练在2020年4月至11月的数据上进行,验证则使用了2020年11月14日至23日的独立数据集。我们的结果显示,该模型能够以高精度和低假阳性率有效地检测交通异常。所检测到的异常包括摄像头信号中断、视觉伪影以及影响车流的极端天气状况。
https://arxiv.org/abs/2502.01391
A crucial step to efficiently integrate Whole Slide Images (WSIs) in computational pathology is assigning a single high-quality feature vector, i.e., one embedding, to each WSI. With the existence of many pre-trained deep neural networks and the emergence of foundation models, extracting embeddings for sub-images (i.e., tiles or patches) is straightforward. However, for WSIs, given their high resolution and gigapixel nature, inputting them into existing GPUs as a single image is not feasible. As a result, WSIs are usually split into many patches. Feeding each patch to a pre-trained model, each WSI can then be represented by a set of patches, hence, a set of embeddings. Hence, in such a setup, WSI representation learning reduces to set representation learning where for each WSI we have access to a set of patch embeddings. To obtain a single embedding from a set of patch embeddings for each WSI, multiple set-based learning schemes have been proposed in the literature. In this paper, we evaluate the WSI search performance of multiple recently developed aggregation techniques (mainly set representation learning techniques) including simple average or max pooling operations, Deep Sets, Memory networks, Focal attention, Gaussian Mixture Model (GMM) Fisher Vector, and deep sparse and binary Fisher Vector on four different primary sites including bladder, breast, kidney, and Colon from TCGA. Further, we benchmark the search performance of these methods against the median of minimum distances of patch embeddings, a non-aggregating approach used for WSI retrieval.
将全滑动图像(WSI)高效地集成到计算病理学中的一个关键步骤是为每个WSI分配一个高质量的特征向量,即单一嵌入。鉴于许多预训练深度神经网络的存在以及基础模型的出现,提取子图(例如,切片或补丁)的嵌入变得简单直接。然而,对于WSIs来说,由于其高分辨率和数吉像素的特性,将它们作为单个图像输入现有GPU中是不可行的。因此,通常会将WSIs分割成许多小块。通过将每个小块传递给预训练模型,每个WSI可以由一组小块表示,从而形成一系列嵌入。在这种设置下,WSI表示学习简化为集合表示学习,在这种情况下,对于每一个WSI,我们都可以访问到一组补丁嵌入。 为了从每张WSI的多个补丁嵌入中获得单一嵌入,文献中提出了多种基于集的方法。在这篇论文中,我们在四个不同的主要位置(包括TCGA的数据集中膀胱、乳腺、肾脏和结肠)上评估了近期开发出的多项聚合技术(主要是集合表示学习技术),例如简单平均或最大池化操作、Deep Sets、内存网络、焦点注意、高斯混合模型(GMM)、Fisher Vector以及深度稀疏和二进制Fisher Vector。此外,我们还将这些方法的搜索性能与补丁嵌入最小距离中位数进行了基准测试,后者是一种用于WSI检索的非聚合方法。
https://arxiv.org/abs/2501.17822