Abstract
Voice-based interaction has emerged as a natural and intuitive modality for controlling IoT devices. However, speech-driven edge devices face a fundamental trade-off between cloud-based solutions, which offer stronger language understanding capabilities at the cost of latency, connectivity dependence, and privacy concerns, and edge-based solutions, which provide low latency and improved privacy but are limited by computational constraints. This paper presents ASTA, an adaptive speech-to-action solution that dynamically routes voice commands between edge and cloud inference to balance performance and system resource utilization. ASTA integrates on-device automatic speech recognition and lightweight offline language-model inference with cloud-based LLM processing, guided by real-time system metrics such as CPU workload, device temperature, and network latency. A metric-aware routing mechanism selects the inference path at runtime, while a rule-based command validation and repair component ensures successful end-to-end command execution. We implemented our solution on an NVIDIA Jetson-based edge platform and evaluated it using a diverse dataset of 80 spoken commands. Experimental results show that ASTA successfully routes all input commands for execution, achieving a balanced distribution between online and offline inference. The system attains an ASR accuracy of 62.5% and generates executable commands without repair for only 47.5% of inputs, highlighting the importance of the repair mechanism in improving robustness. These results suggest that adaptive edge-cloud orchestration is a viable approach for resilient and resource-aware voice-controlled IoT systems.
Abstract (translated)
基于语音的交互已成为控制物联网设备的一种自然且直观的方式。然而,以语音驱动的边缘设备面临着一个基本的权衡:云端解决方案提供了更强的语言理解能力,但代价是增加了延迟、依赖于网络连接和隐私问题;而边缘计算解决方案则提供低延迟和改进后的隐私保护,但由于计算资源限制而受到约束。本文提出了ASTA,这是一种自适应的语音转操作解决方案,它能够在边缘设备与云推理之间动态地路由语音命令,以平衡性能和系统资源利用。 ASTA整合了设备上的自动语音识别、轻量级离线语言模型推理以及基于云端的大规模语言模型处理,并根据实时系统指标(如CPU负载、设备温度和网络延迟)进行指导。一个具有感知能力的路由机制在运行时选择推理路径,而一个基于规则的命令验证与修复组件则确保了从端到端命令执行的成功性。 我们在NVIDIA Jetson边缘平台上实现了该解决方案,并使用包含80种口语命令的多样化数据集进行了评估。实验结果表明,ASTA成功地将所有输入命令路由至执行状态,在线推理和离线推理之间达到了平衡分布。系统实现了62.5%的自动语音识别准确率,并且只有47.5%的输入在不经过修复的情况下就能生成可执行命令,这凸显了修复机制对于提高鲁棒性的重要性。 这些结果表明,自适应边缘-云端编排是一种为具备弹性和资源意识的语音控制物联网系统提供支持的有效方法。
URL
https://arxiv.org/abs/2512.12769