Video understanding requires models to continuously track and update world state during playback. While existing benchmarks have advanced video understanding evaluation across multiple dimensions, the observation of how models maintain world state remains insufficient. We propose VCBench, a streaming counting benchmark that repositions counting as a minimal probe for diagnosing world state maintenance capability. We decompose this capability into object counting (tracking currently visible objects vs.\ tracking cumulative unique identities) and event counting (detecting instantaneous actions vs.\ tracking complete activity cycles), forming 8 fine-grained subcategories. VCBench contains 406 videos with frame-by-frame annotations of 10,071 event occurrence moments and object state change moments, generating 1,000 streaming QA pairs with 4,576 query points along timelines. By observing state maintenance trajectories through streaming multi-point queries, we design three complementary metrics to diagnose numerical precision, trajectory consistency, and temporal awareness. Evaluation on mainstream video-language models shows that current models still exhibit significant deficiencies in spatial-temporal state maintenance, particularly struggling with tasks like periodic event counting. VCBench provides a diagnostic framework for measuring and improving state maintenance in video understanding systems.
https://arxiv.org/abs/2603.12703
Speech production and perception are the main ways humans communicate daily. Prior brain-to-text decoding studies have largely focused on a single modality and alphabetic languages. Here, we present a unified brain-to-sentence decoding framework for both speech production and perception in Mandarin Chinese. The framework exhibits strong generalization ability, enabling sentence-level decoding when trained only on single-character data and supporting characters and syllables unseen during training. In addition, it allows direct and controlled comparison of neural dynamics across modalities. Mandarin speech is decoded by first classifying syllable components in Hanyu Pinyin, namely initials and finals, from neural signals, followed by a post-trained large language model (LLM) that maps sequences of toneless Pinyin syllables to Chinese sentences. To enhance LLM decoding, we designed a three-stage post-training and two-stage inference framework based on a 7-billion-parameter LLM, achieving overall performance that exceeds larger commercial LLMs with hundreds of billions of parameters or more. In addition, several characteristics were observed in Mandarin speech production and perception: speech production involved neural responses across broader cortical regions than auditory perception; channels responsive to both modalities exhibited similar activity patterns, with speech perception showing a temporal delay relative to production; and decoding performance was broadly comparable across hemispheres. Our work not only establishes the feasibility of a unified decoding framework but also provides insights into the neural characteristics of Mandarin speech production and perception. These advances contribute to brain-to-text decoding in logosyllabic languages and pave the way toward neural language decoding systems supporting multiple modalities.
https://arxiv.org/abs/2603.12628
Understanding freely moving animal behavior is central to neuroscience, where pose estimation and behavioral understanding form the foundation for linking neural activity to natural actions. Yet both tasks still depend heavily on human annotation or unstable unsupervised pipelines, limiting scalability and reproducibility. We present BehaviorVLM, a unified vision-language framework for pose estimation and behavioral understanding that requires no task-specific finetuning and minimal human labeling by guiding pretrained Vision-Language Models (VLMs) through detailed, explicit, and verifiable reasoning steps. For pose estimation, we leverage quantum-dot-grounded behavioral data and propose a multi-stage pipeline that integrates temporal, spatial, and cross-view reasoning. This design greatly reduces human annotation effort, exposes low-confidence labels through geometric checks such as reprojection error, and produces labels that can later be filtered, corrected, or used to fine-tune downstream pose models. For behavioral understanding, we propose a pipeline that integrates deep embedded clustering for over-segmented behavior discovery, VLM-based per-clip video captioning, and LLM-based reasoning to merge and semantically label behavioral segments. The behavioral pipeline can operate directly from visual information and does not require keypoints to segment behavior. Together, these components enable scalable, interpretable, and label-light analysis of multi-animal behavior.
理解自由活动动物的行为是神经科学的核心,其中姿态估计和行为理解构成了将神经活动与自然动作联系起来的基础。然而,这两项任务仍然高度依赖于人工标注或不稳定的无监督流程,从而限制了其可扩展性和再现性。我们提出了BehaviorVLM,这是一种统一的视觉-语言框架,用于姿态估计和行为理解,无需特定任务微调,并通过详细的、明确且可验证的推理步骤引导预训练的视觉-语言模型(VLMs),大大减少了对人类标注的需求。 对于姿态估计,我们利用量子点接地的行为数据并提出了一种多阶段管道,该管道集成了时间、空间和跨视图推理。这种设计极大地减少了人工注释的工作量,并通过几何检查(如重新投影误差)来揭示低置信度标签,从而生成可后续过滤、修正或用于微调下游姿态模型的标签。 对于行为理解,我们提出了一种集成深度嵌入聚类以发现过度分割的行为、基于VLM的逐片段视频描述以及LLM推理合并和语义标记行为段的管道。这种行为分析流程可以直接从视觉信息中操作,并不需要关键点来划分行为。 这些组件共同使得多动物行为的大规模、可解释且标签轻量级的分析成为可能。
https://arxiv.org/abs/2603.12176
During music listening, cortical activity encodes both acoustic and expectation-related information. Prior work has shown that ANN representations resemble cortical representations and can serve as supervisory signals for EEG recognition. Here we show that distinguishing acoustic and expectation-related ANN representations as teacher targets improves EEG-based music identification. Models pretrained to predict either representation outperform non-pretrained baselines, and combining them yields complementary gains that exceed strong seed ensembles formed by varying random initializations. These findings show that teacher representation type shapes downstream performance and that representation learning can be guided by neural encoding. This work points toward advances in predictive music cognition and neural decoding. Our expectation representation, computed directly from raw signals without manual labels, reflects predictive structure beyond onset or pitch, enabling investigation of multilayer predictive encoding across diverse stimuli. Its scalability to large, diverse datasets further suggests potential for developing general-purpose EEG models grounded in cortical encoding principles.
在听音乐时,大脑皮层活动会编码声音和预期相关的信息。先前的研究表明,人工神经网络(ANN)表示与大脑皮层的表示相似,并可以作为脑电图(EEG)识别的监督信号。在这里,我们展示了区分声音和预期相关的ANN表示作为教师目标能够改善基于EEG的音乐识别。预先训练以预测任一表示形式的模型优于未预训练的基础线模型,将它们结合使用会产生互补增益,这些增益超过由不同的随机初始化形成的强大种子集成。 这些发现表明,教师表示类型影响下游性能,并且可以通过神经编码引导表示学习。这项工作指向了预测性音乐认知和神经解码的进步。我们计算出的预期表示直接从原始信号中得出(无需手动标签),反映了超出起始时刻或音高之外的预测结构,使多层预测编码在各种刺激下的研究成为可能。其能够扩展到大型、多样化的数据集进一步表明,在基于皮质编码原则的基础上开发通用EEG模型方面具有潜力。
https://arxiv.org/abs/2603.03190
Translating single-cell RNA sequencing (scRNA-seq) data into mechanistic biological hypotheses remains a critical bottleneck, as agentic AI systems lack direct access to transcriptomic representations while expression foundation models remain opaque to natural language. Here we introduce ELISA (Embedding-Linked Interactive Single-cell Agent), an interpretable framework that unifies scGPT expression embeddings with BioBERT-based semantic retrieval and LLM-mediated interpretation for interactive single-cell discovery. An automatic query classifier routes inputs to gene marker scoring, semantic matching, or reciprocal rank fusion pipelines depending on whether the query is a gene signature, natural language concept, or mixture of both. Integrated analytical modules perform pathway activity scoringacross 60+ gene sets, ligand--receptor interaction prediction using 280+ curated pairs, condition-aware comparative analysis, and cell-type proportion estimation all operating directly on embedded data without access to the original count matrix. Benchmarked across six diverse scRNA-seq datasets spanning inflammatory lung disease, pediatric and adult cancers, organoid models, healthy tissue, and neurodevelopment, ELISA significantly outperforms CellWhisperer in cell type retrieval (combined permutation test, $p < 0.001$), with particularly large gains on gene-signature queries (Cohen's $d = 5.98$ for MRR). ELISA replicates published biological findings (mean composite score 0.90) with near-perfect pathway alignment and theme coverage (0.98 each), and generates candidate hypotheses through grounded LLM reasoning, bridging the gap between transcriptomic data exploration and biological discovery. Code available at: this https URL (If you use ELISA in your research, please cite this work).
将单细胞RNA测序(scRNA-seq)数据转换为机械生物学假设仍然是一个关键瓶颈,因为代理AI系统无法直接访问转录组表示,而表达基础模型对自然语言仍是不透明的。在这里,我们引入了ELISA(Embedding-Linked Interactive Single-cell Agent),这是一种可解释框架,它将scGPT表达嵌入与基于BioBERT的语义检索以及LLM介导的解释统一起来,用于交互式单细胞发现。自动查询分类器根据查询是基因签名、自然语言概念还是两者的混合情况将其路由到基因标记评分、语义匹配或互惠排名融合管道中。 集成分析模块执行跨60多个基因集的通路活性评分,使用280多对精心策划的数据预测配体-受体相互作用,并进行条件感知比较分析和细胞类型比例估计,所有这些操作都直接在嵌入数据上运行,无需访问原始计数矩阵。ELISA经过六种不同scRNA-seq数据集的基准测试,涵盖了炎症性肺病、儿科和成人癌症、类器官模型、健康组织以及神经发育,与CellWhisperer相比,在细胞类型检索方面表现显著更优(综合排列检验,$p < 0.001$),特别是对于基因签名查询(Cohen's $d = 5.98$ for MRR)。ELISA复制了已发表的生物学发现(平均合成得分为0.90),具有接近完美的通路对齐和主题覆盖度(分别为0.98),并通过基于LLM的推理生成候选假设,弥合了转录组数据探索与生物发现之间的差距。代码可在以下网址获取:[this https URL](https://github.com/alibaba/ELISA) (如果您在研究中使用ELISA,请引用这项工作)。
https://arxiv.org/abs/2603.11872
The expression of affect is integral to spoken communication, yet, its link to underlying articulatory execution remains unclear. Measures of articulatory muscle activity such as EMG could reveal how speech production is modulated by emotion alongside acoustic speech analyses. We investigate affect decoding from facial and neck surface electromyography (sEMG) during phonated and silent speech production. For this purpose, we introduce a dataset comprising 2,780 utterances from 12 participants across 3 tasks, on which we evaluate both intra- and inter-subject decoding using a range of features and model embeddings. Our results reveal that EMG representations reliably discriminate frustration with up to 0.845 AUC, and generalize well across articulation modes. Our ablation study further demonstrates that affective signatures are embedded in facial motor activity and persist in the absence of phonation, highlighting the potential of EMG sensing for affect-aware silent speech interfaces.
情感表达是口语交流的重要组成部分,然而,它与底层发音执行的联系仍然不清楚。通过测量如表面肌电图(sEMG)等发音肌肉活动可以揭示情绪如何调节言语生成以及伴随的声学分析。我们研究了在发声和无声言语产生过程中从面部和颈部表面肌电图(sEMG)解码情感的表现。为此,我们引入了一个包含12名参与者在3项任务中产生的2,780次发音的数据集,在这个数据集中评估了各种特征和模型嵌入的单个和跨个体解码效果。我们的研究结果表明,肌电图表征能够可靠地区分烦躁情绪,最高达到了0.845的曲线下面积(AUC),并且在不同的发声模式下具有良好的泛化能力。进一步的消融实验还证明了情感特征嵌入于面部运动活动之中,并且即使没有发音也会持续存在,这突显了肌电图感应在情感感知无声言语接口中的潜力。
https://arxiv.org/abs/2603.11715
Dense Video Captioning (DVC) is a challenging multimodal task that involves temporally localizing multiple events within a video and describing them with natural language. While query-based frameworks enable the simultaneous, end-to-end processing of localization and captioning, their reliance on shared queries often leads to significant multi-task interference between the two tasks, as well as temporal redundancy in localization. In this paper, we propose utilizing role-specific queries that separate localization and captioning into independent components, allowing each to exclusively learn its role. We then employ contrastive alignment to enforce semantic consistency between the corresponding outputs, ensuring coherent behavior across the separated queries. Furthermore, we design a novel suppression mechanism in which mutual temporal overlaps across queries are penalized to tackle temporal redundancy, supervising the model to learn distinct, non-overlapping event regions for more precise localization. Additionally, we introduce a lightweight module that captures core event concepts to further enhance semantic richness in captions through concept-level representations. We demonstrate the effectiveness of our method through extensive experiments on major DVC benchmarks YouCook2 and ActivityNet Captions.
密集视频字幕(Dense Video Captioning,DVC)是一项具有挑战性的多模态任务,涉及在视频中定位多个事件并用自然语言描述它们。虽然基于查询的框架能够同时、端到端地处理定位和字幕生成这两个任务,但这些方法通常依赖于共享查询,这往往会导致两个任务之间存在显著的任务干扰以及定位过程中出现的时间冗余。在这篇论文中,我们提出使用角色特定查询的方法,将定位与字幕生成分离开来作为独立的组件,使每个组件能够专注于学习其自身的功能。随后,我们采用对比对齐方法以强制执行对应输出之间的语义一致性,确保分离后的查询之间的一致行为。此外,我们设计了一种新颖的抑制机制,在此机制中跨查询的时间重叠被惩罚,用以解决时间冗余问题,并监督模型为更精准定位学习到不相交的独特事件区域。另外,我们还引入了一个轻量级模块来捕捉核心事件概念,通过概念级别的表示进一步增强字幕中的语义丰富性。 我们在主要的DVC基准测试YouCook2和ActivityNet Captions上进行了广泛的实验,并展示了我们的方法的有效性。
https://arxiv.org/abs/2603.11439
The automatic identification of cough segments in audio through the determination of start and end points is pivotal to building scalable screening tools in health technologies for pulmonary related diseases. We propose the application of two current pre-trained architectures to the task of cough activity detection. A dataset of recordings containing cough from patients symptomatic for tuberculosis (TB) who self-present at community-level care centres in South Africa and Uganda is employed. When automatic start and end points are determined using XLS-R, an average precision of 0.96 and an area under the receiver-operating characteristic of 0.99 are achieved for the test set. We show that best average precision is achieved by utilising only the first three layers of the network, which has the dual benefits of reduced computational and memory requirements, pivotal for smartphone-based applications. This XLS-R configuration is shown to outperform an audio spectrogram transformer (AST) as well as a logistic regression baseline by 9% and 27% absolute in test set average precision respectively. Furthermore, a downstream TB classification model trained using the coughs automatically isolated by XLS-R comfortably outperforms a model trained on the coughs isolated by AST, and is only narrowly outperformed by a classifier trained on the ground truth coughs. We conclude that the application of large pre-trained transformer models is an effective approach to identifying cough end-points and that the integration of such a model into a screening tool is feasible.
通过确定咳嗽段的起始和结束点来自动识别音频中的咳嗽片段,对于构建用于肺部相关疾病的可扩展筛查工具至关重要。我们提议将两个当前预训练的架构应用于咳嗽活动检测任务中。该研究使用了一组记录数据集,其中包含南非和乌干达社区级护理中心自诊为结核病(TB)患者发出的咳嗽声。 当使用XLS-R自动确定起始和结束点时,在测试集中实现了平均精度0.96和接收者操作特性曲线下面积(AUC-ROC)为0.99的成绩。我们展示了通过仅利用网络的前三个层可以获得最佳的平均精度,这不仅减少了计算资源和内存需求,对基于智能手机的应用来说尤其重要。 该XLS-R配置在测试集平均精度上分别比音频光谱图变换器(AST)和逻辑回归基线高出9%和27%,显示了其优越性。此外,使用XLS-R自动隔离的咳嗽训练出的下游TB分类模型明显优于基于AST隔离的咳嗽训练出的模型,并且仅略微逊于基于真实数据标签训练的分类器。 综上所述,我们认为应用大规模预训练转换模型来识别咳嗽结束点是有效的方法,并且将此类模型集成到筛查工具中是可行的。
https://arxiv.org/abs/2603.11241
Healthcare professionals work in complex, high-stakes environments where effective communication is critical for care delivery, team coordination, and individual well-being. However, communication activity in everyday clinical settings remains challenging to measure and largely unexplored in human behavioral research. We present VoxCare, a scalable egocentric wearable audio sensing and computing system that captures natural communication behaviors of hospital professionals in real-world settings without storing raw audio. VoxCare performs real-time, on-device acoustic feature extraction and applies a speech foundation model-guided teacher-student framework to identify foreground speech activity. From these features, VoxCare derives interpretable behavioral measures of communication frequency, duration, and vocal arousal. Our analyses reveal how, when, and how often clinicians communicate across different shifts and working units, and suggest that communication activity reflects underlying workload and stress. By enabling continuous assessment of communication patterns in everyday contexts, this study provides data-driven approaches to understand the behaviors of healthcare providers and ultimately improve healthcare delivery.
医疗专业人员在复杂且高风险的环境中工作,有效沟通对于护理提供、团队协作和个人福祉至关重要。然而,在日常临床环境中的沟通活动仍然难以量化和研究,特别是在人类行为学领域。本文介绍了VoxCare系统,这是一个可扩展的自为中心式穿戴音频感知与计算系统,能够在不存储原始音频的情况下捕捉医院专业人员在现实世界环境中自然发生的沟通行为。VoxCare能够实时、在设备上进行声学特征提取,并采用以语音基础模型指导的教师-学生框架来识别前景中的言语活动。从这些特征中,VoxCare推导出可解释的行为指标,包括沟通频率、持续时间和语调的紧张程度。 我们的分析揭示了临床医生在不同班次和工作单位中何时以及如何频繁地进行沟通,并表明沟通行为反映了潜在的工作量和压力水平。通过连续评估日常环境中医疗保健提供者的沟通模式,这项研究为理解其行为提供了基于数据的方法,并最终旨在改善医疗服务的质量。
https://arxiv.org/abs/2603.10888
We present a parameter-efficient Diffusion Transformer (DiT) for generating 200bp cell-type-specific regulatory DNA sequences. By replacing the U-Net backbone of DNA-Diffusion with a transformer denoiser equipped with a 2D CNN input encoder, our model matches the U-Net's best validation loss in 13 epochs (60$\times$ fewer) and converges 39% lower, while reducing memorization from 5.3% to 1.7% of generated sequences aligning to training data via BLAT. Ablations show the CNN encoder is essential: without it, validation loss increases 70% regardless of positional embedding choice. We further apply DDPO finetuning using Enformer as a reward model, achieving a 38$\times$ improvement in predicted regulatory activity. Cross-validation against DRAKES on an independent prediction task confirms that improvements reflect genuine regulatory signal rather than reward model overfitting.
我们提出了一种参数高效的扩散变压器(DiT),用于生成特定细胞类型的200bp调控DNA序列。通过将DNA-Diffusion模型中的U-Net主干替换为配备有二维CNN输入编码器的Transformer去噪器,我们的模型在13个训练周期内达到了与U-Net最佳验证损失相匹配的结果(减少了60倍),并且收敛时的验证损失降低了39%,同时生成序列与训练数据通过BLAT比对一致的比例从5.3%减少到1.7%,这表明记忆化现象有所降低。 消融研究表明,CNN编码器是必不可少的:如果没有它,在不同的位置嵌入选择下,验证损失会增加70%。我们进一步应用了DDPO微调,并使用Enformer作为奖励模型,预测调控活性得到了38倍的提升。交叉验证在独立的任务上与DRAKES进行比较确认,改进反映了真正的调控信号而非奖励模型过拟合的结果。
https://arxiv.org/abs/2603.10885
Deep learning models can predict protein properties with unprecedented accuracy but rarely offer mechanistic insight or actionable guidance for engineering improved variants. When a model flags an antibody as unstable, the protein engineer is left without recourse: which mutations would rescue stability while preserving function? We introduce Manifold-Constrained Counterfactual Optimization for Proteins (MCCOP), a framework that computes minimal, biologically plausible sequence edits that flip a model's prediction to a desired target state. MCCOP operates in a continuous joint sequence-structure latent space and employs a pretrained diffusion model as a manifold prior, balancing three objectives: validity (achieving the target property), proximity (minimizing mutations), and plausibility (producing foldable proteins). We evaluate MCCOP on three protein engineering tasks - GFP fluorescence rescue, thermodynamic stability enhancement, and E3 ligase activity recovery - and show that it generates sparser, more plausible counterfactuals than both discrete and continuous baselines. The recovered mutations align with known biophysical mechanisms, including chromophore packing and hydrophobic core consolidation, establishing MCCOP as a tool for both model interpretation and hypothesis-driven protein design. Our code is publicly available at this http URL.
深度学习模型能够以前所未有的精度预测蛋白质特性,但很少提供机制性见解或用于改进变体的设计指导。当一个模型将某个抗体标记为不稳定时,蛋白质工程师往往无计可施:哪些突变更正稳定性同时保持功能?我们引入了Manifold-Constrained Counterfactual Optimization for Proteins (MCCOP)框架,该框架可以计算最小且生物上合理的序列编辑,以使模型的预测转变为所需的靶标状态。MCCOP在连续的联合序列-结构潜在空间中运行,并使用预训练的扩散模型作为流形先验,在三个目标之间进行权衡:有效性(达到目标属性)、接近性(减少突变)和合理性(产生可折叠的蛋白质)。我们在三种蛋白质工程任务上评估了MCCOP——GFP荧光恢复、热力学稳定性增强以及E3泛素连接酶活性恢复,并表明它比离散和连续基准线生成更稀疏且合理的反事实。回收到的突变与已知的生物物理机制(包括色素团包装和疏水核心整合)相符,这确立了MCCOP作为模型解释和基于假设驱动蛋白质设计工具的作用。 我们的代码可以在 [此链接](http://this.http.url) 公开获取。
https://arxiv.org/abs/2603.10811
Photometric stereo is a technique for estimating surface normals using images captured under varying illumination. However, conventional frame-based photometric stereo methods are limited in real-world applications due to their reliance on controlled lighting, and susceptibility to ambient illumination. To address these limitations, we propose an event-based photometric stereo system that leverages an event camera, which is effective in scenarios with continuously varying scene radiance and high dynamic range conditions. Our setup employs a single light source moving along a predefined circular trajectory, eliminating the need for multiple synchronized light sources and enabling a more compact and scalable design. We further introduce a lightweight per-pixel multi-layer neural network that directly predicts surface normals from event signals generated by intensity changes as the light source rotates, without system calibration. Experimental results on benchmark datasets and real-world data collected with our data acquisition system demonstrate the effectiveness of our method, achieving a 7.12\% reduction in mean angular error compared to existing event-based photometric stereo methods. In addition, our method demonstrates robustness in regions with sparse event activity, strong ambient illumination, and scenes affected by specularities.
光度立体法是一种通过在不同光照条件下捕获的图像来估计表面法线的技术。然而,传统的基于帧的光度立体方法由于依赖于受控照明条件和易受到环境光线影响而难以应用于实际场景中。为了解决这些问题,我们提出了一种基于事件相机的光度立体系统。该系统在场景光照连续变化及高动态范围条件下表现出色。 我们的实验设置采用单个光源沿预定圆形轨迹移动的方式工作,从而消除了对多个同步光源的需求,并实现了更紧凑和可扩展的设计。此外,我们还引入了一个轻量级的像素级多层神经网络,可以直接从光源旋转过程中产生的事件信号预测表面法线,无需系统校准。 在基准数据集和使用我们的数据采集系统收集的真实世界数据上的实验结果表明,我们的方法有效,与现有的基于事件的光度立体方法相比,平均角度误差降低了7.12%。此外,该方法还展示了在事件活动稀疏区域、强环境光线以及受镜面反射影响场景中的鲁棒性。
https://arxiv.org/abs/2603.10748
We present FireRedASR2S, a state-of-the-art industrial-grade all-in-one automatic speech recognition (ASR) system. It integrates four modules in a unified pipeline: ASR, Voice Activity Detection (VAD), Spoken Language Identification (LID), and Punctuation Prediction (Punc). All modules achieve SOTA performance on the evaluated benchmarks: FireRedASR2: An ASR module with two variants, FireRedASR2-LLM (8B+ parameters) and FireRedASR2-AED (1B+ parameters), supporting speech and singing transcription for Mandarin, Chinese dialects and accents, English, and code-switching. Compared to FireRedASR, FireRedASR2 delivers improved recognition accuracy and broader dialect and accent coverage. FireRedASR2-LLM achieves 2.89% average CER on 4 public Mandarin benchmarks and 11.55% on 19 public Chinese dialects and accents benchmarks, outperforming competitive baselines including Doubao-ASR, Qwen3-ASR, and Fun-ASR. FireRedVAD: An ultra-lightweight module (0.6M parameters) based on the Deep Feedforward Sequential Memory Network (DFSMN), supporting streaming VAD, non-streaming VAD, and multi-label VAD (mVAD). On the FLEURS-VAD-102 benchmark, it achieves 97.57% frame-level F1 and 99.60% AUC-ROC, outperforming Silero-VAD, TEN-VAD, FunASR-VAD, and WebRTC-VAD. FireRedLID: An Encoder-Decoder LID module supporting 100+ languages and 20+ Chinese dialects and accents. On FLEURS (82 languages), it achieves 97.18% utterance-level accuracy, outperforming Whisper and SpeechBrain. FireRedPunc: A BERT-style punctuation prediction module for Chinese and English. On multi-domain benchmarks, it achieves 78.90% average F1, outperforming FunASR-Punc (62.77%). To advance research in speech processing, we release model weights and code at this https URL.
我们介绍FireRedASR2S,这是一个最先进的工业级一体化自动语音识别(ASR)系统。它在一个统一的流水线中集成了四个模块:ASR、语音活动检测(VAD)、口语语言识别(LID)和标点预测(Punc)。所有模块在评估基准上均达到了最先进水平(SOTA): - FireRedASR2: 包含两个变体,FireRedASR2-LLM(8B+参数)和支持语音和演唱转写的FireRedASR2-AED(1B+参数),支持普通话、中国方言及口音、英语以及代码切换。与前代相比,FireRedASR2在识别准确性和更广泛的方言及口音覆盖范围方面都有所改进。FireRedASR2-LLM在中国语料库的四个公共基准上平均CER为2.89%,在19个公共中国方言和口音基准上的平均CER为11.55%,优于包括Doubao-ASR、Qwen3-ASR和Fun-ASR在内的竞争基线。 - FireRedVAD:一个基于深度前馈序列记忆网络(DFSMN)的超轻量级模块,支持流式语音活动检测(streaming VAD)、非流式语音活动检测(non-streaming VAD)以及多标签语音活动检测(mVAD)。在FLEURS-VAD-102基准上,它实现了97.57%的帧级F1和99.60%AUC-ROC,优于Silero-VAD、TEN-VAD、FunASR-VAD及WebRTC-VAD。 - FireRedLID:一个支持超过100种语言和20多种中国方言及口音的编码器解码器模块。在FLEURS(82种语言)上,它实现了97.18%的句子级准确率,优于Whisper和SpeechBrain。 - FireRedPunc:针对中文和英文的BERT风格标点预测模块,在多领域基准上平均F1为78.90%,优于FunASR-Punc(62.77%)。 为了推进语音处理领域的研究,我们将在[此链接](https://this-url.com)发布模型权重及代码。
https://arxiv.org/abs/2603.10420
Emerging experimental evidence shows that writing with AI assistance can change both the views people express in writing and the opinions they hold afterwards. Yet, we lack substantive understanding of procedural and behavioral changes in co-writing with AI that underlie the observed opinion-shaping power of AI writing tools. We conducted a mixed-methods study, combining retrospective interviews with 19 participants about their AI co-writing experience with a quantitative analysis tracing engagement with ideas and opinions in 1{,}291 AI co-writing sessions. Our analysis shows that engaging with the AI's suggestions -- reading them and deciding whether to accept them -- becomes a central activity in the writing process, taking away from more traditional processes of ideation and language generation. As writers often do not complete their own ideation before engaging with suggestions, the suggested ideas and opinions seeded directions that writers then elaborated on. At the same time, writers did not notice the AI's influence and felt in full control of their writing, as they -- in principle -- could always edit the final text. We term this shift \textit{Reactive Writing}: an evaluation-first, suggestion-led writing practice that departs substantially from conventional composing in the presence of AI assistance and is highly vulnerable to AI-induced biases and opinion shifts.
最新的实验证据表明,使用AI辅助写作可以改变人们在写作中表达的观点以及他们在写作后持有的意见。然而,我们对与AI合作写作中的程序和行为变化缺乏实质性的理解,这些变化构成了观察到的AI写作工具塑造观点的力量的基础。为此,我们结合了定性和定量研究方法:首先通过回溯性访谈收集了19名参与者的AI辅助协作写作体验;其次通过对1,291次AI辅助写作会话中思想和意见互动的数据进行量化分析。 我们的分析表明,在写作过程中与AI建议的交互——阅读并决定是否接受这些建议,已经成为一个核心活动,从而减少了传统的创意生成和语言创作过程。由于作者通常在完成自己的构思之前就参与到了建议中,因此由AI提出的观点和意见成为了他们进一步展开思考的方向。与此同时,作者并未察觉到AI的影响,并且感觉完全掌控着自己的写作——因为他们原则上可以在任何时候编辑最终文本。 我们称这种转变现象为“反应式写作”(Reactive Writing):这是一种以评估优先、基于建议引导的写作实践方式,与常规的人类在AI辅助下创作作品的方式大不相同,同时也非常容易受到由AI引发的偏见和观点变化的影响。
https://arxiv.org/abs/2603.10374
Mobile manipulators are envisioned to serve more complex roles in people's everyday lives. With recent breakthroughs in large language models, task planners have become better at translating human verbal instructions into a sequence of tasks. However, there is still a need for a decision-making algorithm that can seamlessly interface with the high-level task planner to carry out the sequence of tasks efficiently. In this work, building on the idea of nonlinear lexicographic optimization, we propose a novel Hierarchical-Task Model Predictive Control framework that is able to complete sequential tasks with improved performance and reactivity by effectively leveraging the robot's redundancy. Compared to the state-of-the-art task-prioritized inverse kinematic control method, our approach has improved hierarchical trajectory tracking performance by 42% on average when facing task changes, robot singularity and reference variations. Compared to a typical single-task architecture, our proposed hierarchical task control architecture enables the robot to traverse a shorter path in task space and achieves an execution time 2.3 times faster when executing a sequence of delivery tasks. We demonstrated the results with real-world experiments on a 9 degrees of freedom mobile manipulator.
移动机械臂有望在未来人们的日常生活中扮演更复杂的角色。随着大型语言模型的最新突破,任务规划器现在能够更好地将人类口头指令转化为一系列具体任务。然而,仍然需要一种决策算法来与高层次的任务规划器无缝对接,以高效地执行这些序列化任务。在本研究中,我们基于非线性词典优化的理念提出了一个新颖的分层任务模型预测控制框架,该框架能够通过有效利用机器人的冗余度来完成顺序任务,从而提高性能和响应能力。 与最先进的任务优先逆向运动学控制方法相比,在面对任务变化、机器人奇点及参考变量变动时,我们的方法在层级轨迹跟踪性能方面平均提升了42%。相较于传统的单一任务架构,我们提出的分层任务控制系统能够让机械臂以更短的路径穿越作业空间,并且当执行一系列递送任务时,运行速度提高了2.3倍。 我们在一个具有9个自由度的移动机械臂上进行了实际实验来展示这些成果。
https://arxiv.org/abs/2603.10232
As compared to typical mobile manipulation tasks, sequential mobile manipulation poses a unique challenge -- as the robot operates over extended periods, successful task completion is not solely dependent on consistent motion generation but also on the robot's awareness and adaptivity to changes in the operating environment. While existing motion planners can generate whole-body trajectories to complete sequential tasks, they typically assume that the environment remains static and rely on precomputed maps. This assumption often breaks down during long-term operations, where semi-static changes such as object removal, introduction, or shifts are common. In this work, we propose a novel perceptive hierarchical-task model predictive control (HTMPC) framework for efficient sequential mobile manipulation in unstructured, changing environments. To tackle the challenge, we leverage a Bayesian inference framework to explicitly model object-level changes and thereby maintain a temporally accurate representation of the 3D environment; this up-to-date representation is embedded in a lexicographic optimization framework to enable efficient execution of sequential tasks. We validate our perceptive HTMPC approach through both simulated and real-robot experiments. In contrast to baseline methods, our approach systematically accounts for moved and phantom obstacles, successfully completing sequential tasks with higher efficiency and reactivity, without relying on prior maps or external infrastructure.
与典型的移动操作任务相比,顺序移动操作带来了独特的挑战——机器人在长时间运行中不仅需要持续生成稳定的运动轨迹,还需具备对环境变化的感知和适应能力。现有的运动规划器虽然能够为完成顺序任务生成全身轨迹,但它们通常假设环境保持不变,并依赖于预先计算的地图。这种假设在长期运行过程中往往不成立,因为半静态的变化(如移除、引入或移动物体)是常见的。 为此,我们提出了一种新颖的感知层次化任务模型预测控制(HTMPC)框架,以实现无结构和变化环境中高效顺序移动操作。为应对这一挑战,我们采用贝叶斯推理框架来显式建模对象级别的变化,并保持对3D环境的实时准确表示;这种更新后的表示被嵌入到字典排序优化框架中,从而能够有效地执行顺序任务。 通过模拟实验和真实机器人实验验证了我们的感知HTMPC方法的有效性。与基线方法相比,我们的方法系统地考虑了移动和虚幻障碍物(即不存在或已移除的物体),并且在没有先验地图或外部基础设施的情况下,以更高的效率和反应性成功完成顺序任务。
https://arxiv.org/abs/2603.10227
Dexterous manipulation is essential for real-world robot autonomy, mirroring the central role of human hand coordination in daily activity. Humans rely on rich multimodal perception--vision, sound, and language-guided intent--to perform dexterous actions, motivating vision-based, language-conditioned manipulation systems for robots. However, training reliable vision-language-action (VLA) models for dexterous manipulation requires large-scale demonstrations across many robotic hands. In addition, as new dexterous embodiments appear rapidly, collecting data for each becomes costly and impractical, creating a need for scalable cross-embodiment learning. We introduce XL-VLA, a vision-language-action framework integrated with a unified latent action space shared across diverse dexterous hands. This embodiment-invariant latent space is directly pluggable into standard VLA architectures, enabling seamless cross-embodiment training and efficient reuse of both existing and newly collected data. Experimental results demonstrate that XL-VLA consistently outperforms baseline VLA models operating in raw joint spaces, establishing it as an effective solution for scalable cross-embodiment dexterous manipulation.
灵巧的操作对于现实世界的机器人自主性至关重要,这反映了人类手部协调在日常活动中的核心作用。人类依赖于丰富的多模态感知——包括视觉、声音和语言引导的意图——来执行灵巧的动作,这激励了为机器人开发基于视觉的语言条件下的操作系统。然而,训练可靠的视觉-语言-行动(VLA)模型以实现灵巧的操作需要在许多不同类型的机械手中进行大规模演示。此外,随着新的灵巧化身迅速出现,为每个新化身收集数据变得既昂贵又不切实际,这就产生了对可扩展的跨化身学习的需求。 我们引入了XL-VLA框架,这是一个集成了统一潜在动作空间(该空间跨越各种灵巧的手)的视觉-语言-行动框架。这个与具体实现无关的潜在空间可以直接嵌入到标准VLA架构中,从而可以无缝地进行跨化身训练,并高效利用现有和新收集的数据。 实验结果表明,XL-VLA在原始关节空间运行的标准VLA模型上始终表现出优越性能,这证明了它作为一个有效的解决方案,用于可扩展的跨化身灵巧操作。
https://arxiv.org/abs/2603.10158
Animal brains exhibit remarkable efficiency in perception and action, while being robust to both external and internal perturbations. The means by which brains accomplish this remains, for now, poorly understood, hindering our understanding of animal and human cognition, as well as our own implementation of efficient algorithms for control of dynamical systems.A potential candidate for a robust mechanism of state estimation and action computation is the free energy principle, but existing implementations of this principle have largely relied on conventional, biologically implausible approaches without spikes. We propose a novel, efficient, and robust spiking control framework with realistic biological characteristics. The resulting networks function as free energy constrainers, in which neurons only fire if they reduce the free energy of their internal representation. The networks offer efficient operation through highly sparse activity while matching performance with other similar spiking frameworks, and have high resilience against both external (e.g. sensory noise or collisions) and internal perturbations (e.g. synaptic noise and delays or neuron silencing) that such a network would be faced with when deployed by either an organism or an engineer. Overall, our work provides a novel mathematical account for spiking control through constraining free energy, providing both better insight into how brain networks might leverage their spiking substrate and a new route for implementing efficient control algorithms in neuromorphic hardware.
动物大脑在感知和行动方面表现出非凡的效率,并且对内外部干扰具有较强的鲁棒性。然而,目前我们对于大脑如何实现这一目标的理解仍然非常有限,这阻碍了我们对于动物及人类认知的研究,同时也限制了我们在动态系统控制中实施高效算法的能力。自由能原则被视为一种稳健的状态估计和行为计算机制的候选者之一,但现有的该原理实现主要依赖于传统的、生物学上不可行的方法(如未使用尖峰信号)。 为此,我们提出了一种新颖、高效的脉冲控制系统框架,具备真实的生物特征。在这一模型中,神经网络通过减少其内部表示的自由能来工作,即只有当一个神经元能够降低系统的自由能量时才会放电。这些网络能够在高度稀疏活动的情况下运行高效,并且性能与现有的其他类似脉冲系统相匹配;同时具有应对外部(如感官噪声或碰撞)和内部干扰(例如突触噪声、延迟或者神经元静默)的高度鲁棒性,无论是在生物体还是工程师部署时都会遇到这些情况。 总的来说,我们的工作提供了一种通过限制自由能来解释脉冲控制的新数学方法,这不仅为大脑网络如何利用其脉冲结构提供了更深入的见解,还为在类脑硬件中实现高效控制算法开辟了新的途径。
https://arxiv.org/abs/2603.09729
Wi-Fi Channel State Information (CSI) has emerged as a promising non-line-of-sight sensing modality for human and robotic activity recognition. However, prior work has predominantly relied on CSI amplitude while underutilizing phase information, particularly in robotic arm activity recognition. In this paper, we present GateFusion-Bidirectional Long Short-Term Memory network (GF-BiLSTM) for WiFi sensing in robotic activity recognition. GF-BiLSTM is a two-stream gated fusion network that encodes amplitude and phase separately and adaptively integrates per-time features through a learned gating mechanism. We systematically evaluate state-of-the-art deep learning models under a Leave-One-Velocity-Out (LOVO) protocol across four input configurations: amplitude only, phase only, amplitude + unwrapped phase, and amplitude + sanitized phase. Experimental results demonstrate that incorporating phase alongside amplitude consistently improves recognition accuracy and cross-speed robustness, with GF-BiLSTM achieving the best performance. To the best of our knowledge, this work provides the first systematic exploration of CSI phase for robotic activity recognition, establishing its critical role in Wi-Fi-based sensing.
Wi-Fi 信道状态信息(CSI)已成为非直视线感知模式中用于人类和机器人活动识别的有前途的技术。然而,先前的工作主要依赖于CSI幅度信息,而忽视了相位信息的应用,尤其是在机器人手臂活动识别方面。本文提出了一种基于双流门控融合网络的GateFusion-Bidirectional Long Short-Term Memory (GF-BiLSTM)模型,用于通过Wi-Fi传感进行机器人活动识别。GF-BiLSTM将幅度和相位分别编码,并通过学习到的门机制自适应地集成每个时间点上的特征。我们在四种输入配置下对最先进的深度学习模型进行了系统的评估:仅幅度、仅相位、幅度+未展开相位,以及幅度+净化后的相位。实验结果表明,在所有情况下,结合使用幅度和相位信息都能持续提高识别准确性和跨速度的鲁棒性,其中GF-BiLSTM的表现最佳。据我们所知,这项工作首次系统地探索了CSI相位在机器人活动识别中的作用,并确立了它在基于Wi-Fi感知中的关键角色。
https://arxiv.org/abs/2603.09047
Graph neural networks (GNNs) have gained significant attention for their effectiveness across various domains. This study focuses on applying GNN to process 3D point cloud data for human pose estimation (HPE) and human activity recognition (HAR). We propose novel point cloud feature extraction (PCFEx) techniques to capture meaningful information at the point, edge, and graph levels of the point cloud by considering point cloud as a graph. Moreover, we introduce a GNN architecture designed to efficiently process these features. Our approach is evaluated on four most popular publicly available millimeter wave radar datasets, three for HPE and one for HAR. The results show substantial improvements, with significantly reduced errors in all three HPE benchmarks, and an overall accuracy of 98.8% in mmWave-based HAR, outperforming the existing state of the art models. This work demonstrates the great potential of feature extraction incorporated with GNN modeling approach to enhance the precision of point cloud processing.
图神经网络(GNNs)因其在多个领域的有效性而受到了广泛关注。本研究聚焦于将GNN应用于处理3D点云数据,以进行人体姿态估计(HPE)和人体活动识别(HAR)。我们提出了一种新颖的点云特征提取(PCFEx)技术,通过将点云视为图来捕获点、边以及整个图形层面有意义的信息。此外,我们还引入了一个专门为处理这些特征而设计的GNN架构。 我们的方法在四个最流行的公开毫米波雷达数据集上进行了评估:三个用于HPE,一个用于HAR。实验结果显示,在所有三个HPE基准测试中错误显著减少,并且在基于毫米波的人体活动识别方面实现了98.8%的整体准确率,超越了现有的最佳模型。这项工作展示了结合特征提取与GNN建模方法来提高点云处理精度的巨大潜力。
https://arxiv.org/abs/2603.08540