Publication Details

Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model

KOCOUR Martin, ŽMOLÍKOVÁ Kateřina, ONDEL Yang Lucas Antoine Francois, ŠVEC Ján, DELCROIX Marc, OCHIAI Tsubasa, BURGET Lukáš and ČERNOCKÝ Jan. Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Incheon: International Speech Communication Association, 2022, pp. 4955-4959. ISSN 1990-9772. Available from: https://www.isca-speech.org/archive/pdfs/interspeech_2022/kocour22_interspeech.pdf
Czech title
Návrat k rozpoznávání řeči více mluvčích založenému na společném dekódování s DNN akustickým modelem
Type
conference paper
Language
english
Authors
Kocour Martin, Ing. (DCGM FIT BUT)
Žmolíková Kateřina, Ing., Ph.D. (DCGM FIT BUT)
Ondel Yang Lucas Antoine Francois, Mgr., Ph.D. (UPSAC)
Švec Ján, Ing. (FIT BUT)
Delcroix Marc (NTT)
Ochiai Tsubasa (NTT)
Burget Lukáš, doc. Ing., Ph.D. (DCGM FIT BUT)
Černocký Jan, prof. Dr. Ing. (DCGM FIT BUT)
URL
Keywords

Multi-talker speech recognition, Permutation invariant training, Factorial Hidden Markov models

Abstract

In typical multi-talker speech recognition systems, a neural network-based acoustic model predicts senone state posteriors for each speaker. These are later used by a single-talker decoder which is applied on each speaker-specific output stream separately. In this work, we argue that such a scheme is sub-optimal and propose a principled solution that decodes all speakers jointly. We modify the acoustic model to predict joint state posteriors for all speakers, enabling the network to express uncertainty about the attribution of parts of the speech signal to the speakers. We employ a joint decoder that can make use of this uncertainty together with higher-level language information. For this, we revisit decoding algorithms used in factorial generative models in early multi-talker speech recognition systems. In contrast with these early works, we replace the GMM acoustic model with DNN, which provides greater modeling power and simplifies part of the inference. We demonstrate the advantage of joint decoding in proof of concept experiments on a mixed-TIDIGITS dataset.

Published
2022
Pages
4955-4959
Journal
Proceedings of Interspeech - on-line, no. 9, ISSN 1990-9772
Proceedings
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Conference
Interspeech Conference, Incheon, KR
Publisher
International Speech Communication Association
Place
Incheon, KR
DOI
UT WoS
000900724505027
EID Scopus
BibTeX
@INPROCEEDINGS{FITPUB12852,
   author = "Martin Kocour and Kate\v{r}ina \v{Z}mol\'{i}kov\'{a} and Francois Antoine Lucas Yang Ondel and J\'{a}n \v{S}vec and Marc Delcroix and Tsubasa Ochiai and Luk\'{a}\v{s} Burget and Jan \v{C}ernock\'{y}",
   title = "Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model",
   pages = "4955--4959",
   booktitle = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
   journal = "Proceedings of Interspeech - on-line",
   number = 9,
   year = 2022,
   location = "Incheon, KR",
   publisher = "International Speech Communication Association",
   ISSN = "1990-9772",
   doi = "10.21437/Interspeech.2022-10406",
   language = "english",
   url = "https://www.fit.vut.cz/research/publication/12852"
}
Back to top