Publication Details

Single Channel Target Speaker Extraction and Recognition with Speaker Beam

DELCROIX Marc, ŽMOLÍKOVÁ Kateřina, KINOSHITA Keisuke, OGAWA Atsunori and NAKATANI Tomohiro. Single Channel Target Speaker Extraction and Recognition with Speaker Beam. In: Proceedings of ICASSP 2018. Calgary: IEEE Signal Processing Society, 2018, pp. 5554-5558. ISBN 978-1-5386-4658-8.

Czech title

Extrakce cílového mluvčího z jednoho kanálu a rozpoznávání s paprskem adaptovaným na mluvčího

Type

conference paper

Language

english

Authors

Delcroix Marc (NTT)
Žmolíková Kateřina, Ing., Ph.D. (DCGM FIT BUT)
Kinoshita Keisuke (NTT)
Ogawa Atsunori (NTT)
Nakatani Tomohiro (NTT)

URL

http://www.fit.vutbr.cz/research/groups/speech/publi/2018/delcroix_icassp2018_0005554.pdf PDF

Keywords

Speech Recognition, Speech mixtures, Speaker extraction, Adaptation, Robust ASR

Abstract

This paper addresses the problem of single channel speech recognition of a target speaker in a mixture of speech signals. We propose to exploit auxiliary speaker information provided by an adaptation utterance from the target speaker to extract and recognize only that speaker. Using such auxiliary information, we can build a speaker extraction neural network (NN) that is independent of the number of sources in the mixture, and that can track speakers across different utterances, which are two challenging issues occurring with conventional approaches for speech recognition of mixtures. We call such an informed speaker extraction scheme "SpeakerBeam". SpeakerBeam exploits a recently developed context adaptive deep NN (CADNN) that allows tracking speech from a target speaker using a speaker adaptation layer, whose parameters are adjusted depending on auxiliary features representing the target speaker characteristics. SpeakerBeam was previously investigated for speaker extraction using a microphone array. In this paper, we demonstrate that it is also efficient for single channel speaker extraction. The speaker adaptation layer can be employed either to build a speaker adaptive acoustic model that recognizes only the target speaker or a maskbased speaker extraction network that extracts the target speech from the speech mixture signal prior to recognition. We also show that the latter speaker extraction network can be optimized jointly with an acoustic model to further improve ASR performance.

Published

2018

Pages

5554-5558

Proceedings

Proceedings of ICASSP 2018

Conference

IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, CA

ISBN

978-1-5386-4658-8

Publisher

IEEE Signal Processing Society

Place

Calgary, CA

DOI

10.1109/ICASSP.2018.8462661

UT WoS

000446384605144

EID Scopus

2-s2.0-85054290595

BibTeX

@INPROCEEDINGS{FITPUB11721,
   author = "Marc Delcroix and Kate\v{r}ina \v{Z}mol\'{i}kov\'{a} and Keisuke Kinoshita and Atsunori Ogawa and Tomohiro Nakatani",
   title = "Single Channel Target Speaker Extraction and Recognition with Speaker Beam",
   pages = "5554--5558",
   booktitle = "Proceedings of ICASSP 2018",
   year = 2018,
   location = "Calgary, CA",
   publisher = "IEEE Signal Processing Society",
   ISBN = "978-1-5386-4658-8",
   doi = "10.1109/ICASSP.2018.8462661",
   language = "english",
   url = "https://www.fit.vut.cz/research/publication/11721"
}