Alamina, Iyalla John (2021) Deep Scattering and End-to-End Speech Models towards Low Resource Speech Recognition. Doctoral thesis, University of Huddersfield.

Automatic Speech Recognition (ASR) has made major leaps in its advancement
largely due to two different machine learning models: Hidden Markov Models (HMMs)
and Deep Neural Networks (DNNs). State-of-the art results have been achieved by
combining these two disparate methods to form a hybrid system. This also requires
that various components of the speech recognizer be trained independently based on
a probabilistic noisy channel model. Although this HMM-DNN hybrid ASR method
has been successful in recent studies, the independent development of the individual
components used in hybrid HMM-DNN models makes ASR development fragile and
expensive in terms of time-to-develop the various components and their associated
sub-systems. The resulting trade-off is that ASR systems are difficult to develop
and use especially for new applications and languages.

The alternative approach, known as the end-to-end paradigm, makes use of a
single deep neural-network architecture used to encapsulate as many as possible subcomponents
of speech recognition as a single process. In the so-called end-to-end
paradigm, latent variables of sub-components are subsumed by the neural network
sub-architectures and the associated parameters. The end-to-end paradigm gains
of a simplified ASR-development process again are traded for higher internal model
complexity and computational resources needed to train the end-to-end models.

This research focuses on taking advantage of the end-to-end model ASR development
gains for new and low-resource languages. Using a specialised light weight
convolution-like neural network called the deep scattering network (DSN) to replace
the input layer of the end-to-end model, our objective was to measure the
performance of the end-to-end model using these augmented speech features while
checking to see if the light-weight, wavelet-based architecture brought about any
improvements for low resource Speech recognition in particular.

The results showed that it is possible to use this compact strategy for speech
pattern recognition by deploying deep scattering network features with higher dimensional
vectors when compared to traditional speech features. With Word Error
Rates of 26.8% and 76.7% for SVCSR and LVCSR respective tasks, the ASR system
metrics fell few WER points short of their respective baselines. In addition, training
times tended to be longer when compared to their respective baselines and therefore
had no significant improvement for low resource speech recognition training.

FINAL THESIS - Alamina.pdf - Accepted Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.

Download (2MB) | Preview


Downloads per month over past year

Add to AnyAdd to TwitterAdd to FacebookAdd to LinkedinAdd to PinterestAdd to Email