Publication Details
Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations
Mošner Ladislav, Ing. (DCGM FIT BUT)
Kakouros Sofoklis ( unknown)
Plchot Oldřich, Ing., Ph.D. (DCGM FIT BUT)
Burget Lukáš, doc. Ing., Ph.D. (DCGM FIT BUT)
Černocký Jan, prof. Dr. Ing. (DCGM FIT BUT)
Speaker identification, speaker verification, emotion recognition, self-supervised models
Self-supervised learning of speech representations from large amounts of unlabeled data has enabled state-of-the-art results in several speech processing tasks. Aggregating these speech representations across time is typically approached by using descriptive statistics, and in particular, using the first- and second-order statistics of representation coefficients. In this paper, we examine an alternative way of extracting speaker and emotion information from self-supervised trained models, based on the correlations between the coefficients of the representations - correlation pooling. We show improvements over mean pooling and further gains when the pooling methods are combined via fusion. The code is available at github.com/Lamomal/s3prl_correlation.
@INPROCEEDINGS{FITPUB12985, author = "Themos Stafylakis and Ladislav Mo\v{s}ner and Sofoklis Kakouros and Old\v{r}ich Plchot and Luk\'{a}\v{s} Burget and Jan \v{C}ernock\'{y}", title = "Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations", pages = "1136--1143", booktitle = "2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings", year = 2023, location = "Doha, QA", publisher = "IEEE Signal Processing Society", ISBN = "978-1-6654-7189-3", doi = "10.1109/SLT54892.2023.10023345", language = "english", url = "https://www.fit.vut.cz/research/publication/12985" }