Abstract
We extend the frameworks of Serialized Output Training (SOT) to address practical needs of both streaming and offline automatic speech recognition (ASR) applications. Our approach focuses on balancing latency and accuracy, catering to real-time captioning and summarization requirements. We propose several key improvements: (1) Leveraging Continuous Speech Separation (CSS) single-channel front-end with end-to-end (E2E) systems for highly overlapping scenarios, challenging the conventional wisdom of E2E versus cascaded setups. The CSS framework improves the accuracy of the ASR system by separating overlapped speech from multiple speakers. (2) Implementing dual models -- Conformer Transducer for streaming and Sequence-to-Sequence for offline -- or alternatively, a two-pass model based on cascaded encoders. (3) Exploring segment-based SOT (segSOT) which is better suited for offline scenarios while also enhancing readability of multi-talker transcriptions.
Abstract (translated)
我们将 Serialized Output Training(SOT)框架扩展,以解决流媒体和离线自动语音识别(ASR)应用的实际需求。我们的方法侧重于平衡延迟与准确性的关系,满足实时字幕生成和摘要制作的需求。我们提出以下关键改进: 1. 利用连续语音分离(CSS)单通道前端与端到端(E2E)系统相结合的方法,在高度重叠的场景中进行优化,这挑战了传统E2E与级联设置之间的观点。CSS框架通过从多个说话者中分离出重叠的语音来提高ASR系统的准确性。 2. 实现双模型——Conformer Transducer用于流媒体应用,Sequence-to-Sequence用于离线处理,或采用基于级联编码器的两步模型作为替代方案。 3. 探索分段SOT(segSOT),这种方法更适合于离线场景,并且还能提高多说话者转录文本的可读性。
URL
https://arxiv.org/abs/2506.14204