Publication Details

Probing Self-Supervised Learning Models With Target Speech Extraction

PENG Junyi, DELCROIX Marc, OCHIAI Tsubasa, ASHIHARA Takanori, PLCHOT Oldřich, ARAKI Shoko and ČERNOCKÝ Jan. Probing Self-Supervised Learning Models With Target Speech Extraction. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Seoul: IEEE Signal Processing Society, 2024, pp. 535-539. ISBN 979-8-3503-7451-3. Available from: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10627502
Czech title
Testování modelů získaných samoučením na úloze extrakce řeči cílového mluvčího
Type
conference paper
Language
english
Authors
Peng Junyi, Msc. Eng. (DCGM FIT BUT)
Delcroix Marc (NTT)
Ochiai Tsubasa (NTT)
Ashihara Takanori (NTT)
Plchot Oldřich, Ing., Ph.D. (DCGM FIT BUT)
Araki Shoko (NTT)
Černocký Jan, prof. Dr. Ing. (DCGM FIT BUT)
URL
Keywords

Target speech extraction, self-supervised learning, SUPERB

Abstract

Large-scale pre-trained self-supervised learning (SSL) models have shown remarkable advancements in speech-related tasks. However, the utilization of these models in complex multi-talker scenarios, such as extracting a target speaker in a mixture, is yet to be fully evaluated. In this paper, we introduce target speech extraction (TSE) as a novel downstream task to evaluate the feature extraction capabilities of pre-trained SSL models. TSE uniquely requires both speaker identification and speech separation, distinguishing it from other tasks in the Speech processing Universal PERformance Benchmark (SUPERB) evaluation. Specifically, we propose a TSE downstream model composed of two lightweight task-oriented modules based on the same frozen SSL model. One module functions as a speaker encoder to obtain target speaker information from an enrollment speech, while the other estimates the target speaker's mask to extract its speech from the mixture. Experimental results on the Libri2mix datasets reveal the relevance of the TSE downstream task to probe SSL models, as its performance cannot be simply deduced from other related tasks such as speaker verification and separation.

Published
2024
Pages
535-539
Proceedings
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Conference
2024 IEEE International Conference on Acoustics, Speech and Signal Processing IEEE, Seoul, KR
ISBN
979-8-3503-7451-3
Publisher
IEEE Signal Processing Society
Place
Seoul, KR
DOI
EID Scopus
BibTeX
@INPROCEEDINGS{FITPUB13276,
   author = "Junyi Peng and Marc Delcroix and Tsubasa Ochiai and Takanori Ashihara and Old\v{r}ich Plchot and Shoko Araki and Jan \v{C}ernock\'{y}",
   title = "Probing Self-Supervised Learning Models With Target Speech Extraction",
   pages = "535--539",
   booktitle = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
   year = 2024,
   location = "Seoul, KR",
   publisher = "IEEE Signal Processing Society",
   ISBN = "979-8-3503-7451-3",
   doi = "10.1109/ICASSPW62465.2024.10627502",
   language = "english",
   url = "https://www.fit.vut.cz/research/publication/13276"
}
Back to top