Speech Command Recognition: Text-to-Speech and Speech Corpus Scraping
Are All You Need
Abstract
Speech Command Recognition (SCR) is rapidly gaining prominence due to
its diverse applications, such as virtual assistants, smart homes,
hands-free navigation, and voice-controlled industrial machinery. In
this paper, we present a data-centric approach to creating SCR systems
for low-resource languages, particularly focusing on the Kazakh
language. By leveraging synthetic data generated by Text-to-Speech (TTS)
and data extracted from a large-scale speech corpus, we successfully
created the Kazakh language equivalent of the Google Speech Commands
dataset. Moreover, we also compiled the Kazakh Speech Commands dataset
with data collected from 119 participants. This dataset was used to
benchmark the performance of the Keyword-MLP model trained using our
synthetic dataset. The results showed that the model achieves 89.79%
accuracy for the real-world data demonstrating the efficacy of our
approach. Our work can serve as a recipe for creating customized speech
command datasets, including for low-resource languages, obviating the
need for laborious and costly human data collection.