Speech Command Recognition: Text-to-Speech and Speech Corpus Scraping Are All You Need

Askat Kuzdeuov; Shakhizat Nurgaliyev; Diana Turmakhan; Nurkhan Laiyk; Huseyin Atakan Varol

doi:10.36227/techrxiv.22717657.v1

loading page

Speech Command Recognition: Text-to-Speech and Speech Corpus Scraping Are All You Need

Askat Kuzdeuov ,
Shakhizat Nurgaliyev ,
Diana Turmakhan ,
Nurkhan Laiyk ,
Huseyin Atakan Varol

Abstract

Speech Command Recognition (SCR) is rapidly gaining prominence due to its diverse applications, such as virtual assistants, smart homes, hands-free navigation, and voice-controlled industrial machinery. In this paper, we present a data-centric approach to creating SCR systems for low-resource languages, particularly focusing on the Kazakh language. By leveraging synthetic data generated by Text-to-Speech (TTS) and data extracted from a large-scale speech corpus, we successfully created the Kazakh language equivalent of the Google Speech Commands dataset. Moreover, we also compiled the Kazakh Speech Commands dataset with data collected from 119 participants. This dataset was used to benchmark the performance of the Keyword-MLP model trained using our synthetic dataset. The results showed that the model achieves 89.79% accuracy for the real-world data demonstrating the efficacy of our approach. Our work can serve as a recipe for creating customized speech command datasets, including for low-resource languages, obviating the need for laborious and costly human data collection.