TechRxiv
TASLP20_flow_based_deep_latent_variable_model.pdf (3.47 MB)

A Flow-Based Deep Latent Variable Model for Speech Spectrogram Modeling and Enhancement

Download (3.47 MB)
preprint
posted on 27.05.2020, 13:44 by Aditya Arie Nugraha, Kouhei Sekiguchi, Kazuyoshi Yoshii
This paper describes a deep latent variable model of speech power spectrograms and its application to semi-supervised speech enhancement with a deep speech prior. By integrating two major deep generative models, a variational autoencoder (VAE) and a normalizing flow (NF), in a mutually-beneficial manner, we formulate a flexible latent variable model called the NF-VAE that can extract low-dimensional latent representations from high-dimensional observations, akin to the VAE, and does not need to explicitly represent the distribution of the observations, akin to the NF. In this paper, we consider a variant of NF called the generative flow (GF a.k.a. Glow) and formulate a latent variable model called the GF-VAE. We experimentally show that the proposed GF-VAE is better than the standard VAE at capturing fine-structured harmonics of speech spectrograms, especially in the high-frequency range. A similar finding is also obtained when the GF-VAE and the VAE are used to generate speech spectrograms from latent variables randomly sampled from the standard Gaussian distribution. Lastly, when these models are used as speech priors for statistical multichannel speech enhancement, the GF-VAE outperforms the VAE and the GF.

Funding

JSPS KAKENHI No. 19H04137

NII CRIS-Line Collaborative Research

History

Email Address of Submitting Author

adityaarie.nugraha@riken.jp

ORCID of Submitting Author

0000-0001-5424-747X

Submitting Author's Institution

RIKEN

Submitting Author's Country

Japan

Licence

Exports

Read the peer-reviewed publication

in IEEE/ACM Transactions on Audio, Speech, and Language Processing

Licence

Exports