Publication Details
BUT/JHU System Description for CHiME-8 NOTSOFAR-1 Challenge
Klement Dominik, Bc. (FIT BUT)
Han Jiangyu, M.Eng. (DCGM FIT BUT)
Sedláček Šimon, Ing. (DCGM FIT BUT)
Yusuf Bolaji (DCGM FIT BUT)
Maciejewski Matthew (JHU)
Wiesner Matthew (JHU)
Burget Lukáš, doc. Ing., Ph.D. (DCGM FIT BUT)
multi-talker speech recognition, CHiME-8, NOTSOFAR-1, target-speaker
This paper presents our method for tackling the CHIME-8 chal- lenge's NOTSOFAR-1 task, which requires participants to per- form multi-speaker automatic speech recognition (ASR) using audio from distant microphone arrays. We modify the Pyan- note3 diarization pipeline, incorporating pre-trained WavLM as local EEND to adapt effectively to new domains, and we intro- duce two diarization-aware approaches to ASR by condition- ing Whisper on diarization outputs for target-speaker ASR. The first method, which we refer to as Query-Key Biasing, modi- fies Whisper's attention mechanism and positional embeddings with a learnable attention mask to exclude non-target speaker segments in the audio. The second method, called Frame- Level Diarization-Dependent Transformations, applies affine, diarization-dependent transformations with trainable parame- ters to the inputs of one or more transformer blocks. We also extend both the ASR and diarization systems to a multichannel setup by incorporating cross-channel communication into our models. Finally, we report the performance of these approaches on the NOTSOFAR-1 dataset.
@INPROCEEDINGS{FITPUB13338, author = "Alexander Polok and Dominik Klement and Jiangyu Han and \v{S}imon Sedl\'{a}\v{c}ek and Bolaji Yusuf and Matthew Maciejewski and Matthew Wiesner and Luk\'{a}\v{s} Burget", title = "BUT/JHU System Description for CHiME-8 NOTSOFAR-1 Challenge", pages = "18--22", booktitle = "Proceedings of CHiME 2024 Workshop", year = 2024, location = "Kos Island, GR", publisher = "International Speech Communication Association", doi = "10.21437/CHiME.2024-4", language = "english", url = "" }