Publication Details

Multitask Speech Recognition and Speaker Change Detection for Unknown Number of Speakers

KUMAR Sashi, MADIKERI Srikanth, NIGMATULINA Iuliia, VILLATORO-TELLO Esaú, MOTLÍČEK Petr, PANDIA Karthick, DUBAGUNTA S. Pavankumar and GANAPATHIRAJU Aravind. Multitask Speech Recognition and Speaker Change Detection for Unknown Number of Speakers. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Seoul: IEEE Signal Processing Society, 2024, pp. 12592-12596. ISBN 979-8-3503-4485-1. Available from: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10446130

Czech title

Víceúlohové rozpoznávání řeči a detekce změny mluvčího pro neznámý počet mluvčích

Type

conference paper

Language

english

Authors

Kumar Sashi (IDIAP)
Madikeri Srikanth (IDIAP)
Nigmatulina Iuliia (IDIAP)
Villatoro-tello Esaú (IDIAP)
Motlíček Petr, doc. Ing., Ph.D. (DCGM FIT BUT)
Pandia Karthick (Uniphore)
Dubagunta S. Pavankumar (Uniphore)
Ganapathiraju Aravind (Uniphore)

URL

Keywords

speaker change detection, speaker turn detection, speech recognition, multitask learning, F1 score

Abstract

Traditionally, automatic speech recognition (ASR) and speaker change detection (SCD) systems have been independently trained to generate comprehensive transcripts accompanied by speaker turns. Recently, joint training of ASR and SCD systems, by inserting speaker turn tokens in the ASR training text, has been shown to be successful. In this work, we present a multitask alternative to the joint training approach. Results obtained on the mix-headset audios of AMI corpus show that the proposed multitask training yields an absolute improvement of 1.8% in coverage and purity based F1 score on SCD task without ASR degradation. We also examine the trade-offs between the ASR and SCD performance when trained using multitask criteria. Additionally, we validate the speaker change information in the embedding spaces obtained after different transformer layers of a self-supervised pre-trained model, such as XLSR-53, by integrating an SCD classifier at the output of specific transformer layers. Results reveal that the use of different embedding spaces from XLSR-53 model for multitask ASR and SCD is advantageous.1

Published

2024

Pages

12592-12596

Proceedings

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Conference

2024 IEEE International Conference on Acoustics, Speech and Signal Processing IEEE, Seoul, KR

ISBN

979-8-3503-4485-1

Publisher

IEEE Signal Processing Society

Place

Seoul, KR

DOI

10.1109/ICASSP48485.2024.10446130

BibTeX

@INPROCEEDINGS{FITPUB13375,
   author = "Sashi Kumar and Srikanth Madikeri and Iuliia Nigmatulina and Esa\'{u} Villatoro-tello and Petr Motl\'{i}\v{c}ek and Karthick Pandia and Pavankumar S. Dubagunta and Aravind Ganapathiraju",
   title = "Multitask Speech Recognition and Speaker Change Detection for Unknown Number of Speakers",
   pages = "12592--12596",
   booktitle = "ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
   year = 2024,
   location = "Seoul, KR",
   publisher = "IEEE Signal Processing Society",
   ISBN = "979-8-3503-4485-1",
   doi = "10.1109/ICASSP48485.2024.10446130",
   language = "english",
   url = "https://www.fit.vut.cz/research/publication/13375"
}