Enhancing Speech Rehabilitation: Calibrating 3D-CNN Lip Reading Models
for Higher Single User Accuracy to Improve Communication in Aphonia and
Aphasia Cases
Abstract
Automated lip-reading systems have the potential to greatly improve
speech recognition and communication for individuals with speech or
hearing disabilities. People with aphasia, aphonia, dysphonia, voice
disorders and trouble swallowing have limited speaking ability and can
be assisted by lipreading technology. Traditional approaches to
lipreading have relied on hand-crafted features and statistical modeling
techniques, which have limitations in capturing the complex
spatiotemporal dynamics of lip movements. Deep learning approaches have
shown promise in addressing these limitations by extracting features
from data and have achieved state-of-the-art results in various
speech-related tasks. In this paper, a 3D Convolutional Neural Network
(3D-CNN) is proposed as an approach to automated lip reading. This
system takes a video of a person speaking and processes it through a
3D-convolutional layer to extract spatial-temporal features from the
video frames. The system uses deep learning algorithms to learn the
mapping between lip-movements and corresponding phonemes, enabling it to
recognize spoken words. The approach is evaluated on visual recordings
of spoken words, the MIRACL-VC1 dataset. It contains 10 words and
multiple instances of each. The proposed model achieves 99.0%training
accuracy on the dataset. The testing accuracy achieved is 61.3%,
indicating model overfitting and a high “speaker-dependency”. A
dataset was self-created using videos of one speaker. The model achieved
an 89.0% training accuracy, and an 83.0% testing accuracy on this
dataset. Both models are then evaluated on user input video. The
proposed approach has applications in speech therapy, speech
recognition, and translation for those with speech and voice
disabilities.
The human participant in this research is the researcher/author.