Publication Details

Fine-Tuning Self-Supervised Models for Language Identification Using Orthonormal Constraint

PRASAD Amrutha, CAROFILIS Andrés, VANDERREYDT Geoffroy, KHALIL Driss, MADIKERI Srikanth, MOTLÍČEK Petr and SCHUEPBACH Christof. Fine-Tuning Self-Supervised Models for Language Identification Using Orthonormal Constraint. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Seoul: IEEE Signal Processing Society, 2024, pp. 11921-11925. ISBN 979-8-3503-4485-1. Available from: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10446751

Czech title

Fine-Tuning samoučicích modelů pro identifikaci jazyka pomocí ortonormálního omezení

Type

conference paper

Language

english

Authors

Prasad Amrutha (DCGM FIT BUT)
Carofilis Andrés (UNILEON)
Vanderreydt Geoffroy (IDLab - imec)
Khalil Driss (IDIAP)
Madikeri Srikanth (IDIAP)
Motlíček Petr, doc. Ing., Ph.D. (DCGM FIT BUT)
Schuepbach Christof (armasuise)

URL

Keywords

Language Identification, Transformers, Wav2Vec2, fine-tuning, low-resource, out-of-domain,

Abstract

Self-supervised models trained with high linguistic diversity, such as the XLS-R model, can be effectively fine-tuned for the language recognition task. Typically, a back-end classifier followed by statistics pooling layer are added during train- ing. Commonly used back-end classifiers require a large num- ber of parameters to be trained, which is not ideal in limited data conditions. In this work, we explore smaller parame- ter back-ends using factorized Time Delay Neural Network (TDNN-F). The TDNN-F architecture is also integrated into Emphasized Channel Attention, Propagation and Aggregation- TDNN (ECAPA-TDNN) models, termed ECAPA-TDNN-F, reducing the number of parameters by 30 to 50% absolute, with competitive accuracies and no change in minimum cost. The results show that the ECAPA-TDNN-F can be extended to tasks where ECAPA-TDNN is suitable. We also test the effectiveness of a linear classifier and a variant, the Orthonor- mal linear classifier, previously used in x-vector type systems. The models are trained with NIST LRE17 data and evalu- ated on NIST LRE17, LRE22 and the ATCO2 LID datasets. Both linear classifiers outperform conventional back-ends with improvements in accuracy between 0.9% and 9.1%

Published

2024

Pages

11921-11925

Proceedings

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Conference

2024 IEEE International Conference on Acoustics, Speech and Signal Processing IEEE, Seoul, KR

ISBN

979-8-3503-4485-1

Publisher

IEEE Signal Processing Society

Place

Seoul, KR

DOI

10.1109/ICASSP48485.2024.10446751

EID Scopus

2-s2.0-85195416122

BibTeX

@INPROCEEDINGS{FITPUB13280,
   author = "Amrutha Prasad and Andr\'{e}s Carofilis and Geoffroy Vanderreydt and Driss Khalil and Srikanth Madikeri and Petr Motl\'{i}\v{c}ek and Christof Schuepbach",
   title = "Fine-Tuning Self-Supervised Models for Language Identification Using Orthonormal Constraint",
   pages = "11921--11925",
   booktitle = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
   year = 2024,
   location = "Seoul, KR",
   publisher = "IEEE Signal Processing Society",
   ISBN = "979-8-3503-4485-1",
   doi = "10.1109/ICASSP48485.2024.10446751",
   language = "english",
   url = "https://www.fit.vut.cz/research/publication/13280"
}