Publication Details

Target Speech Extraction with Pre-Trained Self-Supervised Learning Models

PENG Junyi, DELCROIX Marc, OCHIAI Tsubasa, PLCHOT Oldřich, ARAKI Shoko and ČERNOCKÝ Jan. Target Speech Extraction with Pre-Trained Self-Supervised Learning Models. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Seoul: IEEE Signal Processing Society, 2024, pp. 10421-10425. ISBN 979-8-3503-4485-1. Available from: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10448315
Type
conference paper
Language
english
Authors
Peng Junyi, Msc. Eng. (DCGM FIT BUT)
Delcroix Marc (NTT)
Ochiai Tsubasa (NTT)
Plchot Oldřich, Ing., Ph.D. (DCGM FIT BUT)
Araki Shoko (NTT)
Černocký Jan, prof. Dr. Ing. (DCGM FIT BUT)
URL
Keywords

Target speech extraction, pre-trained models, self-supervised learning, feature aggregation

Abstract

Pre-trained self-supervised learning (SSL) models have achieved re- markable success in various speech tasks. However, their potential in target speech extraction (TSE) has not been fully exploited. TSE aims to extract the speech of a target speaker in a mixture guided by enrollment utterances. We exploit pre-trained SSL models for two purposes within a TSE framework, i.e., to process the input mixture and to derive speaker embeddings from the enrollment. In this paper, we focus on how to effectively use SSL models for TSE. We first in- troduce a novel TSE downstream task following the SUPERB princi- ples. This simple experiment shows the potential of SSL models for TSE, but extraction performance remains far behind the state-of-the- art. We then extend a powerful TSE architecture by incorporating two SSL-based modules: an Adaptive Input Enhancer (AIE) and a speaker encoder. Specifically, the proposed AIE utilizes intermedi- ate representations from the CNN encoder by adjusting the time res- olution of CNN encoder and transformer blocks through progressive upsampling, capturing both fine-grained and hierarchical features. Our method outperforms current TSE systems achieving a SI-SDR improvement of 14.0 dB on LibriMix. Moreover, we can further improve performance by 0.7 dB by fine-tuning the whole model in- cluding the SSL model parameters.

Published
2024
Pages
10421-10425
Proceedings
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Conference
2024 IEEE International Conference on Acoustics, Speech and Signal Processing IEEE, Seoul, KR
ISBN
979-8-3503-4485-1
Publisher
IEEE Signal Processing Society
Place
Seoul, KR
DOI
BibTeX
@INPROCEEDINGS{FITPUB13275,
   author = "Junyi Peng and Marc Delcroix and Tsubasa Ochiai and Old\v{r}ich Plchot and Shoko Araki and Jan \v{C}ernock\'{y}",
   title = "Target Speech Extraction with Pre-Trained Self-Supervised Learning Models",
   pages = "10421--10425",
   booktitle = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
   year = 2024,
   location = "Seoul, KR",
   publisher = "IEEE Signal Processing Society",
   ISBN = "979-8-3503-4485-1",
   doi = "10.1109/ICASSP48485.2024.10448315",
   language = "english",
   url = "https://www.fit.vut.cz/research/publication/13275"
}
Back to top