Publication Details

Autoencoder based multi-stream combination for noise robust speech recognition

MALLIDI Sri Harish, OGAWA Tetsuji, VESELÝ Karel, NIDADAVOLU Phani S. and HEŘMANSKÝ Hynek. Autoencoder based multi-stream combination for noise robust speech recognition. In: Proceeding of Interspeech 2015. Dresden: International Speech Communication Association, 2015, pp. 3551-3555. ISBN 978-1-5108-1790-6. ISSN 1990-9772.

Czech title

Multi-proudová kombinace založená na autoenkodéru pro rozpoznávání řeči robustní vůči šumu

Type

conference paper

Language

english

Authors

Mallidi Sri Harish (AmazonCom)
Ogawa Tetsuji, Ph.D. (WasUni)
Veselý Karel, Ing., Ph.D. (DCGM FIT BUT)
Nidadavolu Phani S. (JHU)
Heřmanský Hynek, prof. Ing., Dr.Eng. (DCGM FIT BUT)

URL

http://www.fit.vutbr.cz/research/groups/speech/publi/2015/mallidi_interspeech2015_IS150897.pdf PDF

Keywords

speech recognition, human-computer interaction, computational paralinguistics

Abstract

In this paper, we have proposed techniques to estimate performance of DNN based classifiers. The technique is based on modeling the distribution of DNN outputs, using autoencoders.

Annotation

Performances of automatic speech recognition (ASR) systems degrade rapidly when there is a mismatch between train and test acoustic conditions. Performance can be improved using a multi-stream framework, which involves combining posterior probabilities from several classifiers (often deep neural networks (DNNs)) trained on different features/streams. Knowledge about the confidence of each of these classifiers on a noisy test utterance can help in devising better techniques for posterior combination than simple sum and product rules [1]. In this work, we propose to use autoencoders which are multilayer feed forward neural networks, for estimating this confidence measure. During the training phase, for each stream, an autocoder is trained on TANDEM features extracted from the corresponding DNN. On employing the autoencoder during the testing phase, we show that the reconstruction error of the autoencoder is correlated to the robustness of the corresponding stream. These error estimates are then used as confidence measures to combine the posterior probabilities generated from each of the streams. Experiments on Aurora4 and BABEL databases indicate significant improvements, especially in the scenario of mismatch between train and test acoustic conditions.

Published

2015

Pages

3551-3555

Journal

Proceedings of Interspeech - on-line, vol. 2015, no. 9, ISSN 1990-9772

Proceedings

Proceeding of Interspeech 2015

Conference

Interspeech Conference, Dresden, DE

ISBN

978-1-5108-1790-6

Publisher

International Speech Communication Association

Place

Dresden, DE

UT WoS

000380581601277

EID Scopus

2-s2.0-84959165456

BibTeX

@INPROCEEDINGS{FITPUB10970,
   author = "Harish Sri Mallidi and Tetsuji Ogawa and Karel Vesel\'{y} and S. Phani Nidadavolu and Hynek He\v{r}mansk\'{y}",
   title = "Autoencoder based multi-stream combination for noise robust speech recognition",
   pages = "3551--3555",
   booktitle = "Proceeding of Interspeech 2015",
   journal = "Proceedings of Interspeech - on-line",
   volume = 2015,
   number = 09,
   year = 2015,
   location = "Dresden, DE",
   publisher = "International Speech Communication Association",
   ISBN = "978-1-5108-1790-6",
   ISSN = "1990-9772",
   language = "english",
   url = "https://www.fit.vut.cz/research/publication/10970"
}