Fine-grained labor market analysis increasingly relies on mapping unstructured job advertisements to standardized skill taxonomies such as ESCO. This mapping is naturally formulated as an Extreme Multi-Label Classification (XMLC) problem, but supervised solutions are constrained by the scarcity and cost of large-scale, taxonomy-aligned annotations--especially in non-English settings where job-ad language diverges substantially from formal skill definitions. We propose a zero-shot skill extraction framework that eliminates the need for manually labeled job-ad training data. The framework uses a Large Language Model (LLM) to synthesize training instances from ESCO definitions, and introduces hierarchically constrained multi-skill generation based on ESCO Level-2 categories to improve semantic coherence in multi-label contexts. On top of the synthetic corpus, we train a contrastive bi-encoder that aligns job-ad sentences with ESCO skill descriptions in a shared embedding space; the encoder augments a BERT backbone with BiLSTM and attention pooling to better model long, information-dense requirement statements. An upstream RoBERTa-based binary filter removes non-skill sentences to improve end-to-end precision. Experiments show that (i) hierarchy-conditioned generation improves both fluency and discriminability relative to unconstrained pairing, and (ii) the resulting multi-label model transfers effectively to real-world Chinese job advertisements, achieving strong zero-shot retrieval performance (F1@5 = 0.72) and outperforming TF--IDF and standard BERT baselines. Overall, the proposed pipeline provides a scalable, data-efficient pathway for automated skill coding in labor economics and workforce analytics.
精细粒度的劳动力市场分析越来越依赖于将非结构化的职位广告映射到诸如ESCO(欧洲技能/能力、资格框架)等标准化的技能分类体系。这种映射自然可以被形式化为一个极多标签分类(XMLC)问题,但监督解决方案由于大规模、与分类体系对齐标注数据的稀缺和高昂成本而受到限制——特别是在非英语环境下,职位广告语言与正式技能定义之间存在显著差异的情况下更是如此。我们提出了一种零样本技能提取框架,该框架消除了手动标记职位广告训练数据的需求。该框架使用大型语言模型(LLM)从ESCO定义中合成训练实例,并引入基于ESCO第二级类别的分层约束多技能生成机制,以在多标签上下文中提高语义一致性。在此人工合成的语料库之上,我们训练了一个对比双编码器来将职位广告句子与ESCO技能描述对齐到共享嵌入空间;该编码器通过添加双向长短时记忆(BiLSTM)和注意力池化层增强BERT骨干模型,以更好地处理长且信息密集的要求陈述。一个上游的RoBERTa基二元过滤器用于去除非技能语句,从而提高整体精度。 实验表明: (i) 分层条件生成提高了相对于无约束配对而言的语言流畅性和可区分性; (ii) 所得到的多标签模型能够有效转移到真实世界中的中文职位广告上,并取得了强大的零样本检索性能(F1@5 = 0.72),并且优于TF-IDF和标准BERT基线。 总体来说,所提出的处理流程为劳动力经济学和工作力量分析领域的自动化技能编码提供了一条可扩展且数据高效的途径。
https://arxiv.org/abs/2601.09119
Accurately forecasting long-term atmospheric variables remains a defining challenge in meteorological science due to the chaotic nature of atmospheric systems. Temperature data represents a complex superposition of deterministic cyclical climate forces and stochastic, short-term fluctuations. While planetary mechanics drive predictable seasonal periodicities, rapid meteorological changes such as thermal variations, pressure anomalies, and humidity shifts introduce nonlinear volatilities that defy simple extrapolation. Historically, the Seasonal Autoregressive Integrated Moving Average (SARIMA) model has been the standard for modeling historical weather data, prized for capturing linear seasonal trends. However, SARIMA operates under strict assumptions of stationarity, failing to capture abrupt, nonlinear transitions. This leads to systematic residual errors, manifesting as the under-prediction of sudden spikes or the over-smoothing of declines. Conversely, Deep Learning paradigms, specifically Long Short-Term Memory (LSTM) networks, demonstrate exceptional efficacy in handling intricate time-series data. By utilizing memory gates, LSTMs learn complex nonlinear dependencies. Yet, LSTMs face instability in open-loop forecasting; without ground truth feedback, minor deviations compound recursively, causing divergence. To resolve these limitations, we propose a Hybrid SARIMA-LSTM architecture. This framework employs a residual-learning strategy to decompose temperature into a predictable climate component and a nonlinear weather component. The SARIMA unit models the robust, long-term seasonal trend, while the LSTM is trained exclusively on the residuals the nonlinear errors SARIMA fails to capture. By fusing statistical stability with neural plasticity, this hybrid approach minimizes error propagation and enhances long-horizon accuracy.
准确预测长期大气变量仍然是气象科学面临的重大挑战,因为大气系统具有混沌特性。温度数据代表了确定性的周期性气候力与随机短期波动的复杂叠加。行星力学驱动可预测的季节性周期性变化,而诸如热变、气压异常和湿度波动等快速气象变化引入了非线性不稳定性,难以简单外推。 历史上,季节数字自回归积分移动平均模型(SARIMA)一直是用于建模历史天气数据的标准方法,因其能够捕捉线性的季节趋势而备受推崇。然而,SARIMA基于平稳性的严格假设运行,在处理突发、非线性转变时表现不佳。这导致了系统性的残差误差,表现为对突然峰值的低估或对下降趋势的过度平滑。 相比之下,深度学习范式——尤其是长短期记忆网络(LSTM)——在处理复杂的时间序列数据方面表现出非凡的有效性。通过使用记忆门,LSTMs能够学习复杂的非线性依赖关系。然而,在没有真实反馈的情况下进行开环预测时,LSTMs面临不稳定性;即使是很小的偏差也会递归地累积,导致模型发散。 为解决这些限制,我们提出了一种混合SARIMA-LSTM架构。这种框架采用残差学习策略,将温度分解成一个可预测的气候成分和一个非线性天气成分。SARIMA单元建模稳健的长期季节趋势,而LSTM仅针对SARIMA未能捕捉到的非线性误差进行训练。通过结合统计稳定性和神经网络的灵活性,这种混合方法可以最小化误差传播,并提高长时期的预测准确性。
https://arxiv.org/abs/2601.07951
Few-shot remote sensing image classification is challenging due to limited labeled samples and high variability in land-cover types. We propose a reconstruction-guided few-shot network (RGFS-Net) that enhances generalization to unseen classes while preserving consistency for seen categories. Our method incorporates a masked image reconstruction task, where parts of the input are occluded and reconstructed to encourage semantically rich feature learning. This auxiliary task strengthens spatial understanding and improves class discrimination under low-data settings. We evaluated the efficacy of EuroSAT and PatternNet datasets under 1-shot and 5-shot protocols, our approach consistently outperforms existing baselines. The proposed method is simple, effective, and compatible with standard backbones, offering a robust solution for few-shot remote sensing classification. Codes are available at this https URL.
少量标注样本下的遥感图像分类任务具有挑战性,原因在于标签数据有限且地物类型变化大。我们提出了一种基于重建引导的少样本网络(RGFS-Net),该方法旨在增强对未见过类别的泛化能力同时保持已见类别的一致性。我们的方法整合了一个掩码图像重建任务,在这个任务中部分输入被遮挡并需要重构,以此来鼓励语义丰富的特征学习。这一辅助任务增强了空间理解,并在数据量低的情况下提高了分类的准确性。 我们使用EuroSAT和PatternNet数据集分别在1-shot(一次学习)和5-shot(五次学习)协议下评估了该方法的有效性,结果表明我们的方法始终优于现有的基准模型。所提出的方法简单、有效且兼容标准骨干网络,为少样本遥感分类提供了稳健的解决方案。 代码可以在提供的链接地址获取:[此处应填入具体的网址]。
https://arxiv.org/abs/2601.07335
Drone light shows have emerged as a popular form of entertainment in recent years. However, several high-profile incidents involving large-scale drone failures -- where multiple drones simultaneously fall from the sky -- have raised safety and reliability concerns. To ensure robustness, we propose a drone parking algorithm designed specifically for multiple drone failures in drone light shows, aimed at mitigating the risk of cascading collisions by drone evacuation and enabling rapid recovery from failures by leveraging strategically placed hidden drones. Our algorithm integrates a Social LSTM model with attention mechanisms to predict the trajectories of failing drones and compute near-optimal evacuation paths that minimize the likelihood of surviving drones being hit by fallen drones. In the recovery node, our system deploys hidden drones (operating with their LED lights turned off) to replace failed drones so that the drone light show can continue. Our experiments showed that our approach can greatly increase the robustness of a multi-drone system by leveraging deep learning to predict the trajectories of fallen drones.
近年来,无人机灯光秀作为一种流行的娱乐形式越来越受欢迎。然而,一些大型的无人机故障事件——即多架无人机同时从空中坠落——引发了安全性和可靠性的担忧。为了确保系统的稳健性,我们提出了一种专为多机故障设计的无人机停放算法,在大规模无人机表演中通过疏散策略来降低连环碰撞的风险,并利用隐藏在特定位置的备用无人机实现快速恢复。我们的算法结合了具有注意力机制的社会LSTM模型来预测坠落无人机的轨迹,并计算出最小化幸存无人机被坠落无人机撞击概率的近似最优疏散路径。在恢复阶段,系统部署备用无人机(关闭LED灯)以替换故障无人机,从而使表演得以继续进行。 实验表明,通过深度学习技术预测坠落无人机的轨迹,我们的方法可以大大提高多机系统的稳健性。
https://arxiv.org/abs/2601.06728
Dominant sequence models like the Transformer represent structure implicitly through dense attention weights, incurring quadratic complexity. We propose RewriteNets, a novel neural architecture built on an alternative paradigm: explicit, parallel string rewriting. Each layer in a RewriteNet contains a set of learnable rules. For each position in an input sequence, the layer performs four operations: (1) fuzzy matching of rule patterns, (2) conflict resolution via a differentiable assignment operator to select non-overlapping rewrites, (3) application of the chosen rules to replace input segments with output segments of potentially different lengths, and (4) propagation of untouched tokens. While the discrete assignment of rules is non-differentiable, we employ a straight-through Gumbel-Sinkhorn estimator, enabling stable end-to-end training. We evaluate RewriteNets on algorithmic, compositional, and string manipulation tasks, comparing them against strong LSTM and Transformer baselines. Results show that RewriteNets excel at tasks requiring systematic generalization (achieving 98.7% accuracy on the SCAN benchmark's length split) and are computationally more efficient than Transformers. We also provide an analysis of learned rules and an extensive ablation study, demonstrating that this architecture presents a promising direction for sequence modeling with explicit structural inductive biases.
主导的序列模型,如Transformer,通过密集的关注权重隐式地表示结构,导致复杂度呈二次增长。我们提出了RewriteNets,这是一种基于替代范式的新型神经网络架构:显式并行字符串重写。在每个RewriteNet层中包含一组可学习的规则。对于输入序列中的每一个位置,该层执行四个操作: 1. 模糊匹配规则模式; 2. 通过一个可微分分配算子解决冲突,以选择非重叠的重写; 3. 应用选定的规则将输入片段替换为输出片段(长度可能不同); 4. 不变令牌的传播。 尽管离散规则的选择是非可微分的,我们采用了一种直通Gumbel-Sinkhorn估计器,从而能够进行端到端的稳定训练。我们在算法、组合和字符串操作任务上评估了RewriteNets,并将其与强大的LSTM和Transformer基线模型进行了比较。结果表明,RewriteNets在需要系统化泛化的任务中表现出色(在SCAN基准测试的长度分割上达到了98.7%的准确率),并且其计算效率优于Transformers。我们还提供了对学习规则的分析以及详尽的消融研究,证明了这种架构为带有显式结构归纳偏差的序列建模提供了一个有前景的方向。
https://arxiv.org/abs/2601.07868
Inertial localization is particularly valuable in GPS-denied environments such as indoors. However, localization using only Inertial Measurement Units (IMUs) suffers from drift caused by motion-process noise and sensor biases. This paper introduces Uncertainty-aware Map-constrained Inertial Localization (UMLoc), an end-to-end framework that jointly models IMU uncertainty and map constraints to achieve drift-resilient positioning. UMLoc integrates two coupled modules: (1) a Long Short-Term Memory (LSTM) quantile regressor, which estimates the specific quantiles needed to define 68%, 90%, and 95% prediction intervals serving as a measure of localization uncertainty and (2) a Conditioned Generative Adversarial Network (CGAN) with cross-attention that fuses IMU dynamic data with distance-based floor-plan maps to generate geometrically feasible trajectories. The modules are trained jointly, allowing uncertainty estimates to propagate through the CGAN during trajectory generation. UMLoc was evaluated on three datasets, including a newly collected 2-hour indoor benchmark with time-aligned IMU data, ground-truth poses and floor-plan maps. Results show that the method achieves a mean drift ratio of 5.9% over a 70 m travel distance and an average Absolute Trajectory Error (ATE) of 1.36 m, while maintaining calibrated prediction bounds.
惯性定位在GPS信号缺失的环境中(如室内)特别有价值。然而,仅使用惯性测量单元(IMUs)进行定位会因运动处理噪声和传感器偏置导致漂移问题。本文介绍了一种名为不确定性感知地图约束惯性定位(UMLoc)的新框架,该框架通过联合建模IMU不确定性和地图约束来实现抗漂移的定位。UMLoc集成了两个相互关联的模块:(1)一个长短期记忆(LSTM)分位数回归器,它估计定义68%,90%和95%预测区间所需的特定分位数,作为定位不确定性度量;(2)一个带有交叉注意力机制的条件生成对抗网络(CGAN),该网络融合IMU动态数据与基于距离的地图平面图,以生成几何上可行的轨迹。两个模块联合训练,使不确定性的估计可以在CGAN中生成轨迹的过程中传播。UMLoc在三个数据集上进行了评估,包括一个新收集的具有时间对齐IMU数据、真实姿态和地图平面图的两小时室内基准测试集。结果显示,在70米行程距离的情况下,该方法实现了5.9%的平均漂移比,并且保持了校准预测边界的同时,平均绝对轨迹误差(ATE)为1.36米。
https://arxiv.org/abs/2601.06602
Algorithm extraction aims to synthesize executable programs directly from models trained on specific algorithmic tasks, enabling de novo algorithm discovery without relying on human-written code. However, extending this paradigm to Transformer is hindered by superposition, where entangled features encoded in overlapping directions obstruct the extraction of symbolic expressions. In this work, we propose the Discrete Transformer, an architecture explicitly engineered to bridge the gap between continuous representations and discrete symbolic logic. By enforcing a strict functional disentanglement, which constrains Numerical Attention to information routing and Numerical MLP to element-wise arithmetic, and employing temperature-annealed sampling, our method effectively facilitates the extraction of human-readable programs. Empirically, the Discrete Transformer not only achieves performance comparable to RNN-based baselines but crucially extends interpretability to continuous variable domains. Moreover, our analysis of the annealing process shows that the efficient discrete search undergoes a clear phase transition from exploration to exploitation. We further demonstrate that our method enables fine-grained control over synthesized programs by imposing inductive biases. Collectively, these findings establish the Discrete Transformer as a robust framework for demonstration-free algorithm discovery, offering a rigorous pathway toward Transformer interpretability.
算法提取的目标是从训练于特定算法任务的模型中直接合成可执行程序,从而实现从零发现新算法而不依赖于人类编写的代码。然而,将这一范式扩展到Transformer架构上却受到了叠加效应(superposition)的影响,即纠缠特征在重叠方向上的编码阻碍了符号表达式的提取。为此,本文提出了离散Transformer(Discrete Transformer),这是一种专门设计用于弥合连续表示与离散符号逻辑之间差距的架构。 通过强制执行严格的函数解耦,离散Transformer将数值注意力限制于信息路由,并将数值多层感知机(MLP)限定为逐元素算术运算。同时采用温度退火采样技术,该方法有效地促进了可读性程序的提取。实验表明,离散Transformer不仅在性能上与基于RNN的基线相当,而且更关键的是扩展了对连续变量域的理解能力。 此外,我们对退火过程的分析显示,在从探索转向利用的过程中,高效的离散搜索经历了一个明确的相变阶段。进一步地,我们展示了通过施加归纳偏置可以实现对合成程序的细粒度控制。 综上所述,这些发现确立了离散Transformer作为无演示算法发现的一个稳健框架,并为Transformer模型的可解释性提供了一条严格的路径。
https://arxiv.org/abs/2601.05770
Classical Recurrent Neural Networks (RNNs) summarize musical context into a deterministic hidden state vector, imposing an information bottleneck that fails to capture the inherent ambiguity in music. We propose the Density Matrix RNN (DM-RNN), a novel theoretical architecture utilizing the Density Matrix. This allows the model to maintain a statistical ensemble of musical interpretations (a mixed state), capturing both classical probabilities and quantum coherences. We rigorously define the temporal dynamics using Quantum Channels (CPTP maps). Crucially, we detail a parameterization strategy based on the Choi-Jamiolkowski isomorphism, ensuring the learned dynamics remain physically valid (CPTP) by construction. We introduce an analytical framework using Von Neumann Entropy to quantify musical uncertainty and Quantum Mutual Information (QMI) to measure entanglement between voices. The DM-RNN provides a mathematically rigorous framework for modeling complex, ambiguous musical structures.
传统的递归神经网络(RNN)将音乐背景总结为一个确定性的隐藏状态向量,这造成了信息瓶颈问题,无法捕捉到音乐中固有的模糊性。我们提出了密度矩阵递归神经网络(DM-RNN),这是一种新颖的理论架构,利用了密度矩阵。这种模型能够维持一系列可能的音乐解释的概率集合(混合态),不仅包括传统的概率分布,也涵盖了量子相干性。我们严格定义了时间动态过程使用的是量子通道(CPTP映射)。尤为重要的是,我们提出了一种基于Cho-Jamiolkowski同构参数化的策略,确保所学得的动力学过程固然是物理上有效的(CPTP)。 此外,我们引入了一个分析框架,利用冯·诺依曼熵量化音乐的不确定性,并使用量子互信息(QMI)来衡量声部之间的纠缠。DM-RNN为建模复杂的、具有模糊性的音乐结构提供了一种数学上严谨的方法论基础。
https://arxiv.org/abs/2601.04592
The rapid development of large language models has led to an increase in AI-generated text, with students increasingly using LLM-generated content as their own work, which violates academic integrity. This paper presents an evaluation of AI text detection methods, including both traditional machine learning models and transformer-based architectures. We utilize two datasets, HC3 and DAIGT v2, to build a unified benchmark and apply a topic-based data split to prevent information leakage. This approach ensures robust generalization across unseen domains. Our experiments show that TF-IDF logistic regression achieves a reasonable baseline accuracy of 82.87%. However, deep learning models outperform it. The BiLSTM classifier achieves an accuracy of 88.86%, while DistilBERT achieves a similar accuracy of 88.11% with the highest ROC-AUC score of 0.96, demonstrating the strongest overall performance. The results indicate that contextual semantic modeling is significantly superior to lexical features and highlight the importance of mitigating topic memorization through appropriate evaluation protocols. The limitations of this work are primarily related to dataset diversity and computational constraints. In future work, we plan to expand dataset diversity and utilize parameter-efficient fine-tuning methods such as LoRA. We also plan to explore smaller or distilled models and employ more efficient batching strategies and hardware-aware optimization.
大型语言模型的快速发展导致了越来越多的人工智能生成文本,其中学生使用大语言模型(LLM)生成的内容作为自己的作业变得越来越常见,这违反了学术诚信。本文介绍了对AI文本检测方法的评估,包括传统的机器学习模型和基于Transformer架构的方法。我们利用HC3和DAIGT v2两个数据集构建了一个统一的基准,并采用基于主题的数据分割方式以防止信息泄露。这种方法确保在未见过的领域中具有稳健的一般化性能。 我们的实验表明,TF-IDF逻辑回归达到了82.87%的合理基线准确率。然而,深度学习模型的表现优于它。双向LSTM分类器实现了88.86%的准确率,而DistilBERT则获得了相似的88.11%准确率和最高的ROC-AUC得分为0.96分,显示出最强的整体性能。结果表明,上下文语义建模在很大程度上优于词汇特征,并强调了通过适当的评估协议来减轻主题记忆的重要性。 这项工作的主要局限性在于数据集多样性以及计算资源的限制。在未来的工作中,我们计划扩展数据集的多样性并利用参数高效微调方法如LoRA。此外,我们还打算探索更小或精简的模型,并采用更加高效的批处理策略和硬件感知优化技术。
https://arxiv.org/abs/2601.03812
Respiratory sounds captured via auscultation contain critical clues for diagnosing pulmonary conditions. Automated classification of these sounds faces challenges due to subtle acoustic differences and severe class imbalance in clinical datasets. This study investigates respiratory sound classification with a focus on mitigating pronounced class imbalance. We propose a hybrid deep learning model that combines a Long Short-Term Memory (LSTM) network for sequential feature encoding with a Kolmogorov-Arnold Network (KAN) for classification. The model is integrated with a comprehensive feature extraction pipeline and targeted imbalance mitigation strategies. Experiments were conducted on a public respiratory sound database comprising six classes with a highly skewed distribution. Techniques such as focal loss, class-specific data augmentation, and Synthetic Minority Over-sampling Technique (SMOTE) were employed to enhance minority class recognition. The proposed Hybrid LSTM-KAN model achieves an overall accuracy of 94.6 percent and a macro-averaged F1 score of 0.703, despite the dominant COPD class accounting for over 86 percent of the data. Improved detection performance is observed for minority classes compared to baseline approaches, demonstrating the effectiveness of the proposed architecture for imbalanced respiratory sound classification.
通过听诊捕获的呼吸音包含诊断肺部疾病的宝贵线索。然而,由于临床数据集中存在细微的声音差异和严重的类别不平衡问题,自动化分类面临挑战。本研究致力于通过缓解显著的类别不平衡来提高呼吸音分类的效果。我们提出了一种结合长短期记忆网络(LSTM)进行序列特征编码与柯尔莫哥洛夫-阿诺德网络(KAN)进行分类的混合深度学习模型。该模型集成了全面的特征提取管道和针对类不平衡问题的目标缓解策略。 实验在包含六种类别的公共呼吸音数据库上进行了测试,这些类别具有高度偏斜的数据分布。我们使用了诸如焦点损失、特定类别的数据增强以及合成少数过采样技术(SMOTE)等方法来提高对少数类别的识别能力。尽管慢性阻塞性肺疾病(COPD)占总样本的86%以上,所提出的混合LSTM-KAN模型仍然实现了94.6%的整体准确率和0.703的宏平均F1分数。 与基准方法相比,改进了对少数类别的检测性能,证明了该架构在不平衡呼吸音分类中的有效性。
https://arxiv.org/abs/2601.03610
Student mistakes in mathematics are often systematic: a learner applies a coherent but wrong procedure and repeats it across contexts. We introduce MalruleLib, a learning-science-grounded framework that translates documented misconceptions into executable procedures, drawing on 67 learning-science and mathematics education sources, and generates step-by-step traces of malrule-consistent student work. We formalize a core student-modeling problem as Malrule Reasoning Accuracy (MRA): infer a misconception from one worked mistake and predict the student's next answer under cross-template rephrasing. Across nine language models (4B-120B), accuracy drops from 66% on direct problem solving to 40% on cross-template misconception prediction. MalruleLib encodes 101 malrules over 498 parameterized problem templates and produces paired dual-path traces for both correct reasoning and malrule-consistent student reasoning. Because malrules are executable and templates are parameterizable, MalruleLib can generate over one million instances, enabling scalable supervision and controlled evaluation. Using MalruleLib, we observe cross-template degradations of 10-21%, while providing student step traces improves prediction by 3-15%. We release MalruleLib as infrastructure for educational AI that models student procedures across contexts, enabling diagnosis and feedback that targets the underlying misconception.
学生在数学中的错误往往是系统性的:学习者应用一套连贯但错误的程序,并将其应用于不同的情境中。我们介绍了MalruleLib,这是一个基于教育科学的框架,它将记录下来的误解转化为可执行的过程,并生成符合错误规则的学生工作的步骤跟踪。该框架借鉴了67个教育科学研究和数学教育资源。我们将核心学生建模问题正式定义为“Malrule推理准确性”(MRA):从一个工作失误中推断出一种误解,并预测在重新表述的问题情境下学生的下一个答案。 对于九种不同大小的语言模型(从4B到120B),直接解决问题时的准确率为66%,而在跨模板重新表述的情境下的错误观念预测准确性下降到了40%。MalruleLib编码了101个错误规则,这些规则涵盖了498个参数化问题模板,并生成了正确推理和与错误规则一致的学生推理的双重路径跟踪。 由于错误规则是可执行的且模板是可以参数化的,因此MalruleLib可以生成超过一百万个实例,从而实现大规模监督和控制评估。使用MalruleLib,我们观察到跨模板性能下降10%-21%,而提供学生步骤跟踪可以提高预测准确性3-15%。 我们将MalruleLib作为教育人工智能基础设施发布,以建模学生在不同情境中的程序行为,并支持诊断及反馈,帮助解决根本误解问题。
https://arxiv.org/abs/2601.03217
Spatial prediction of reservoir parameters, especially permeability, is crucial for oil and gas exploration and development. However, the wide range and high variability of permeability prevent existing methods from providing reliable predictions. For the first time in subsurface spatial prediction, this study presents a quantum-enhanced long short-term memory with attention (QLSTMA) model that incorporates variational quantum circuits (VQCs) into the recurrent cell. Using quantum entanglement and superposition principles, the QLSTMA significantly improves the ability to predict complex geological parameters such as permeability. Two quantization structures, QLSTMA with Shared Gates (QLSTMA-SG) and with Independent Gates (QLSTMA-IG), are designed to investigate and evaluate the effects of quantum structure configurations and the number of qubits on model performance. Experimental results demonstrate that the 8-qubit QLSTMA-IG model significantly outperforms the traditional long short-term memory with attention (LSTMA), reducing Mean Absolute Error (MAE) by 19% and Root Mean Squared Error (RMSE) by 20%, with particularly strong performance in regions featuring complex well-logging data. These findings validate the potential of quantum-classical hybrid neural networks for reservoir prediction, indicating that increasing the number of qubits yields further accuracy gains despite the reliance on classical simulations. This study establishes a foundational framework for the eventual deployment of such models on real quantum hardware and their extension to broader applications in petroleum engineering and geoscience.
储层参数的空间预测,特别是渗透率的预测,对于石油和天然气勘探与开发至关重要。然而,渗透率范围广泛且变化多端,使得现有的方法难以提供可靠的结果。本研究首次在地下空间预测中提出了一种量子增强长短期记忆模型(QLSTMA),该模型将变分量子电路(VQCs)整合到递归单元中。通过利用量子纠缠和叠加原理,QLSTMA显著提高了对复杂地质参数如渗透率进行预测的能力。为了研究和评估量子结构配置以及量子比特数量对模型性能的影响,设计了两种量化结构:共享门的QLSTMA (QLSTMA-SG) 和独立门的QLSTMA (QLSTMA-IG)。实验结果表明,具有8个量子位的QLSTMA-IG模型在传统长短期记忆注意力模型(LSTMA)的基础上显著提升了性能,在均方根误差(RMSE)上降低了20%,绝对误差(MAE)减少了19%;尤其在复杂测井数据区域表现出色。这些发现验证了量子-经典混合神经网络用于储层预测的潜力,表明增加量子比特的数量可以进一步提高精度,即使依赖于经典的模拟。这项研究为未来将此类模型部署到真正的量子硬件上以及将其扩展至石油工程和地球科学领域更广泛的应用奠定了基础。
https://arxiv.org/abs/2601.02818
With the development of teleconferencing and in-vehicle voice assistants, far-field multi-speaker speech recognition has become a hot research topic. Recently, a multi-channel transformer (MCT) has been proposed, which demonstrates the ability of the transformer to model far-field acoustic environments. However, MCT cannot encode high-dimensional acoustic features for each speaker from mixed input audio because of the interference between speakers. Based on these, we propose the multi-channel multi-speaker transformer (M2Former) for far-field multi-speaker ASR in this paper. Experiments on the SMS-WSJ benchmark show that the M2Former outperforms the neural beamformer, MCT, dual-path RNN with transform-average-concatenate and multi-channel deep clustering based end-to-end systems by 9.2%, 14.3%, 24.9%, and 52.2% respectively, in terms of relative word error rate reduction.
随着远程会议和车载语音助手技术的发展,远场多说话人语音识别已经成为一个热门的研究领域。最近提出了一个多通道变换器(MCT),展示了变压器建模远场声学环境的能力。然而,由于不同说话者之间的相互干扰,MCT无法为混合输入音频中的每个说话人编码高维的声学特征。基于此,在本文中我们提出了一种用于远场多说话人自动语音识别的多通道多说话人变换器(M2Former)。在SMS-WSJ基准测试上的实验表明,与神经波束形成器、MCT、双路径RNN结合变换平均连接方法以及多通道深度聚类端到端系统相比,M2Former分别将相对单词错误率降低了9.2%、14.3%、24.9%和52.2%。
https://arxiv.org/abs/2601.02688
Human cognition integrates information across nested timescales. While the cortex exhibits hierarchical Temporal Receptive Windows (TRWs), local circuits often display heterogeneous time constants. To reconcile this, we trained biologically constrained deep networks, based on scale-invariant hippocampal time cells, on a language classification task mimicking the hierarchical structure of language (e.g., 'letters' forming 'words'). First, using a feedforward model (SITHCon), we found that a hierarchy of TRWs emerged naturally across layers, despite the network having an identical spectrum of time constants within layers. We then distilled these inductive priors into a biologically plausible recurrent architecture, SITH-RNN. Training a sequence of architectures ranging from generic RNNs to this restricted subset showed that the scale-invariant SITH-RNN learned faster with orders-of-magnitude fewer parameters, and generalized zero-shot to out-of-distribution timescales. These results suggest the brain employs scale-invariant, sequential priors - coding "what" happened "when" - making recurrent networks with such priors particularly well-suited to describe human cognition.
人类的认知能够整合不同层级时间尺度上的信息。虽然大脑皮层展示出分层次的时间感受窗口(TRWs),但局部回路通常表现出异质的时间常数。为了调和这一矛盾,我们基于具有尺度不变特性的海马体时间细胞训练了受生物约束的深层网络,并将其应用于一个模拟语言层级结构的语言分类任务(例如,“字母”组成“单词”)。首先,通过使用前馈模型(SITHCon),我们发现即使网络在每一层内部具有相同的时间常数谱,层次化的TRWs也会自然地出现在各个层之间。随后,我们将这些归纳先验知识整合到一个生物学上合理的递归架构中,即SITH-RNN。通过对一系列从通用RNN到这种受限子集的架构进行训练,我们发现规模不变的SITH-RNN可以更快学习,并且使用数量级更少的参数,在零样本条件下将新时间尺度的数据泛化为分布之外的时间尺度。 这些结果表明,大脑采用的是具有尺度不变特性的序列先验知识——编码了“什么”在“何时”发生。这使得带有此类先验信息的递归网络特别适合于描述人类认知。
https://arxiv.org/abs/2601.02618
The automotive industry's pursuit of enhanced fuel economy and performance necessitates efficient aerodynamic design. However, traditional evaluation methods such as computational fluid dynamics (CFD) and wind tunnel testing are resource intensive, hindering rapid iteration in the early design stages. Machine learning-based surrogate models offer a promising alternative, yet many existing approaches suffer from high computational complexity, limited interpretability, or insufficient accuracy for detailed geometric inputs. This paper introduces a novel lightweight surrogate model for the prediction of the aerodynamic drag coefficient (Cd) based on a sequential slice-wise processing of the geometry of the 3D vehicle. Inspired by medical imaging, 3D point clouds of vehicles are decomposed into an ordered sequence of 2D cross-sectional slices along the stream-wise axis. Each slice is encoded by a lightweight PointNet2D module, and the sequence of slice embeddings is processed by a bidirectional LSTM to capture longitudinal geometric evolution. The model, trained and evaluated on the DrivAerNet++ dataset, achieves a high coefficient of determination (R^2 > 0.9528) and a low mean absolute error (MAE approx 6.046 x 10^{-3}) in Cd prediction. With an inference time of approximately 0.025 seconds per sample on a consumer-grade GPU, our approach provides fast, accurate, and interpretable aerodynamic feedback, facilitating more agile and informed automotive design exploration.
汽车产业对燃油经济性和性能的追求要求高效的空气动力学设计。然而,传统评估方法如计算流体动力学(CFD)和风洞测试耗资巨大,限制了早期设计阶段的快速迭代。基于机器学习的代理模型提供了一种有前景的替代方案,但许多现有方法面临计算复杂度高、可解释性差或对详细几何输入准确性不足的问题。本文介绍了一种新颖的轻量级代理模型,用于预测3D车辆沿流线轴顺序切片处理的几何形状基础上的空气动力学阻力系数(Cd)。受到医学成像领域的启发,将汽车的三维点云分解为一系列有序的二维横截面切片。每个切片通过一个轻量级的PointNet2D模块进行编码,并且这些切片嵌入序列由双向LSTM处理以捕捉纵向几何演变。 该模型在DrivAerNet++数据集上训练和评估,预测Cd时达到了很高的决定系数(R^2 > 0.9528)及较低的平均绝对误差(MAE 约为6.046 x 10^{-3})。凭借每样本约0.025秒的推理时间,在消费级GPU上的快速、准确且可解释的空气动力学反馈,我们的方法促进了更敏捷和有信息量的汽车设计探索。
https://arxiv.org/abs/2601.02112
Orthognathic surgery repositions jaw bones to restore occlusion and enhance facial aesthetics. Accurate simulation of postoperative facial morphology is essential for preoperative planning. However, traditional biomechanical models are computationally expensive, while geometric deep learning approaches often lack interpretability. In this study, we develop and validate a physics-informed geometric deep learning framework named PhysSFI-Net for precise prediction of soft tissue deformation following orthognathic surgery. PhysSFI-Net consists of three components: a hierarchical graph module with craniofacial and surgical plan encoders combined with attention mechanisms to extract skeletal-facial interaction features; a Long Short-Term Memory (LSTM)-based sequential predictor for incremental soft tissue deformation; and a biomechanics-inspired module for high-resolution facial surface reconstruction. Model performance was assessed using point cloud shape error (Hausdorff distance), surface deviation error, and landmark localization error (Euclidean distances of craniomaxillofacial landmarks) between predicted facial shapes and corresponding ground truths. A total of 135 patients who underwent combined orthodontic and orthognathic treatment were included for model training and validation. Quantitative analysis demonstrated that PhysSFI-Net achieved a point cloud shape error of 1.070 +/- 0.088 mm, a surface deviation error of 1.296 +/- 0.349 mm, and a landmark localization error of 2.445 +/- 1.326 mm. Comparative experiments indicated that PhysSFI-Net outperformed the state-of-the-art method ACMT-Net in prediction accuracy. In conclusion, PhysSFI-Net enables interpretable, high-resolution prediction of postoperative facial morphology with superior accuracy, showing strong potential for clinical application in orthognathic surgical planning and simulation.
正颌手术通过重新定位下颌骨来恢复咬合并增强面部美学。在术前规划中,准确模拟术后面部形态至关重要。然而,传统的生物力学模型计算成本高昂,而几何深度学习方法往往缺乏可解释性。在这项研究中,我们开发和验证了一个名为PhysSFI-Net的物理信息几何深度学习框架,用于精确预测正颌手术后的软组织变形。 PhysSFI-Net包括三个组成部分:结合注意力机制的分层图模块与颅面编码器和外科计划编码器相结合以提取骨骼面部交互特征;基于长短期记忆(LSTM)的序列预测器用于增量软组织变形;以及一种受生物力学启发的模块,用于高分辨率面部表面重建。通过预测面部形状与对应真实数据之间的点云形状误差(豪斯多夫距离)、表面偏差误差及颅面标志定位误差(欧几里得距离),评估模型性能。 共有135名接受综合正畸和正颌治疗的患者被纳入模型训练和验证。定量分析表明,PhysSFI-Net实现了1.070 +/- 0.088毫米的点云形状误差、1.296 +/- 0.349毫米的表面偏差误差以及2.445 +/- 1.326毫米的标志定位误差。比较实验表明,PhysSFI-Net在预测准确性方面优于最先进的方法ACMT-Net。 结论是,PhysSFI-Net能够实现可解释、高分辨率的术后面部形态预测,并具有更高的精度,在正颌手术规划和模拟的临床应用中展现出强大的潜力。
https://arxiv.org/abs/2601.02088
Strawberry harvesting robots faced persistent challenges such as low integration of visual perception, fruit-gripper misalignment, empty grasping, and strawberry slippage from the gripper due to insufficient gripping force, all of which compromised harvesting stability and efficiency in orchard environments. To overcome these issues, this paper proposed a visual fault diagnosis and self-recovery framework that integrated multi-task perception with corrective control strategies. At the core of this framework was SRR-Net, an end-to-end multi-task perception model that simultaneously performed strawberry detection, segmentation, and ripeness estimation, thereby unifying visual perception with fault diagnosis. Based on this integrated perception, a relative error compensation method based on the simultaneous target-gripper detection was designed to address positional misalignment, correcting deviations when error exceeded the tolerance threshold. To mitigate empty grasping and fruit-slippage faults, an early abort strategy was implemented. A micro-optical camera embedded in the end-effector provided real-time visual feedback, enabling grasp detection during the deflating stage and strawberry slip prediction during snap-off through MobileNet V3-Small classifier and a time-series LSTM classifier. Experiments demonstrated that SRR-Net maintained high perception accuracy. For detection, it achieved a precision of 0.895 and recall of 0.813 on strawberries, and 0.972/0.958 on hands. In segmentation, it yielded a precision of 0.887 and recall of 0.747 for strawberries, and 0.974/0.947 for hands. For ripeness estimation, SRR-Net attained a mean absolute error of 0.035, while simultaneously supporting multi-task perception and sustaining a competitive inference speed of 163.35 FPS.
草莓收获机器人在实际应用中面临持续挑战,如视觉感知集成度低、夹持器错位、空抓以及由于抓取力不足导致的草莓从夹持器上滑落等问题。这些问题影响了果园环境中的采摘稳定性和效率。为解决这些难题,本文提出了一种集成了多任务感知与纠正控制策略的视觉故障诊断和自我恢复框架。该框架的核心是SRR-Net,这是一种端到端的多任务感知模型,能够同时执行草莓检测、分割以及成熟度估计,从而将视觉感知与故障诊断相结合。 基于这一集成感知技术,设计了一种基于目标夹持器联合检测的相对误差补偿方法,用于解决位置错位问题。当偏差超过容差阈值时,该方法可以进行校正。为了减轻空抓和果实滑落的问题,实施了早期终止策略。末端执行器中嵌入了一个微光学摄像头,提供了实时视觉反馈,在草莓瘪缩阶段进行夹取检测,并通过MobileNet V3-Small分类器及时间序列LSTM分类器预测在断裂瞬间的草莓滑落情况。 实验表明,SRR-Net保持了高度的感知准确性。对于检测任务,它对草莓实现了0.895的精度和0.813的召回率,对手部则达到0.972/0.958;对于分割任务,草莓的精度为0.887,召回率为0.747,而手部数据则是0.974/0.947。在成熟度估计方面,SRR-Net达到了平均绝对误差为0.035的结果,并同时支持多任务感知以及保持了每秒163.35帧的推断速度这一竞争性的计算效率。
https://arxiv.org/abs/2601.02085
Current attempts of Reinforcement Learning for Autonomous Controller are data-demanding while the results are under-performed, unstable, and unable to grasp and anchor on the concept of safety, and over-concentrating on noise features due to the nature of pixel reconstruction. While current Self-Supervised Learningapproachs that learning on high-dimensional representations by leveraging the JointEmbedding Predictive Architecture (JEPA) are interesting and an effective alternative, as the idea mimics the natural ability of the human brain in acquiring new skill usingimagination and minimal samples of observations. This study introduces Hanoi-World, a JEPA-based world model that using recurrent neural network (RNN) formaking longterm horizontal planning with effective inference time. Experimentsconducted on the Highway-Env package with difference enviroment showcase the effective capability of making a driving plan while safety-awareness, with considerablecollision rate in comparison with SOTA baselines
当前的强化学习方法在为自主控制器进行训练时,需要大量数据,并且结果表现不佳、不稳定,无法有效掌握和确保安全概念,同时由于像素重构的特性而过度关注噪声特征。尽管目前基于联合嵌入预测架构(JEPA)的自监督学习方法,在处理高维表示方面有趣且是一种有效的替代方案,因为这种方法模仿了人类大脑在使用想象力和少量样本观察来获取新技能的自然能力。 本研究介绍了一种名为Hanoi-World的世界模型,它是基于JEPA并通过使用循环神经网络(RNN)进行长期水平规划,并能在有效推理时间内做出决策。实验是在Highway-Env包的不同环境中进行的,这些环境展示了在安全意识驱动下的驾驶计划制定的有效能力,并且与最先进的基准相比,在碰撞率上有显著改善。
https://arxiv.org/abs/2601.01577
Cloud storage has become the backbone of modern data infrastructure, yet privacy and efficient data retrieval remain significant challenges. Traditional privacy-preserving approaches primarily focus on enhancing database security but fail to address the automatic identification of sensitive information before encryption. This can dramatically reduce query processing time and mitigate errors during manual identification of sensitive information, thereby reducing potential privacy risks. To address this limitation, this research proposes an intelligent privacy-preserving query optimization framework that integrates Named Entity Recognition (NER) to detect sensitive information in queries, utilizing secure data encryption and query optimization techniques for both sensitive and non-sensitive data in parallel, thereby enabling efficient database optimization. Combined deep learning algorithms and transformer-based models to detect and classify sensitive entities with high precision, and the Advanced Encryption Standard (AES) algorithm to encrypt, with blind indexing to secure search functionality of the sensitive data, whereas non-sensitive data was divided into groups using the K-means algorithm, along with a rank search for optimization. Among all NER models, the Deep Belief Network combined with Long Short-Term Memory (DBN-LSTM) delivers the best performance, with an accuracy of 93% and precision (94%), recall, and F1 score of 93%, and 93%, respectively. Besides, encrypted search achieved considerably faster results with the help of blind indexing, and non-sensitive data fetching also outperformed traditional clustering-based searches. By integrating sensitive data detection, encryption, and query optimization, this work advances the state of privacy-preserving computation in modern cloud infrastructures.
云计算已成为现代数据基础设施的支柱,然而隐私保护和高效的数据检索仍然是重大挑战。传统隐私保护方法主要侧重于增强数据库的安全性,但未能解决在加密前自动识别敏感信息的问题。这可以显著减少查询处理时间,并且在手动识别敏感信息时降低错误率,从而减少潜在的隐私风险。为了解决这一限制,这项研究提出了一种智能的隐私保护查询优化框架,该框架集成了命名实体识别(NER)来检测查询中的敏感信息,并利用安全的数据加密和查询优化技术对敏感数据和非敏感数据进行并行处理,从而使数据库能够高效地进行优化。 结合深度学习算法和基于变压器的模型,可以以高精度检测和分类敏感实体。使用高级加密标准(AES)算法进行加密,并采用盲索引技术来保护敏感数据的安全搜索功能,而非敏感数据则通过K-均值算法分组并结合秩检索来进行优化。 在所有NER模型中,深层信念网络与长短期记忆(DBN-LSTM)的组合表现最佳,其准确率为93%,精度、召回率和F1得分为分别为94%、93% 和 93%。此外,在盲索引的帮助下,加密搜索得到了显著的速度提升,并且非敏感数据获取也超过了传统的基于聚类的搜索。 通过整合敏感信息检测、数据加密和查询优化技术,本研究在现代云基础设施中的隐私保护计算领域取得了进展。
https://arxiv.org/abs/2601.01254
This technical report presents the 600K-KS-OCR Dataset, a large-scale synthetic corpus comprising approximately 602,000 word-level segmented images designed for training and evaluating optical character recognition systems targeting Kashmiri script. The dataset addresses a critical resource gap for Kashmiri, an endangered Dardic language utilizing a modified Perso-Arabic writing system spoken by approximately seven million people. Each image is rendered at 256x64 pixels with corresponding ground-truth transcriptions provided in multiple formats compatible with CRNN, TrOCR, and generalpurpose machine learning pipelines. The generation methodology incorporates three traditional Kashmiri typefaces, comprehensive data augmentation simulating real-world document degradation, and diverse background textures to enhance model robustness. The dataset is distributed across ten partitioned archives totaling approximately 10.6 GB and is released under the CC-BY-4.0 license to facilitate research in low-resource language optical character recognition.
这份技术报告介绍了600K-KS-OCR数据集,这是一个大型合成语料库,包含大约602,000张单字符级别的分割图像,旨在用于训练和评估针对克什米尔语的光学字符识别系统。该数据集解决了克什米尔语资源匮乏的问题,这是一种濒危的达迪语族语言,使用修改后的波斯-阿拉伯书写系统,由约七百万人使用。每一张图像是以256x64像素生成,并附有与CRNN、TrOCR和通用机器学习管道兼容的多个格式的真实文本转录。生成方法采用了三种传统的克什米尔字体,全面的数据增强技术模拟了真实文档退化的状况,并加入了多种背景纹理来提升模型的鲁棒性。该数据集分布在十个分区档案中,总大小约为10.6GB,并在CC-BY-4.0许可下发布以促进低资源语言光学字符识别领域的研究工作。
https://arxiv.org/abs/2601.01088