Publication Details
Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks
Madikeri Srikanth (IDIAP)
Zuluaga-Gomez Juan (IDIAP)
Sharma Bidisha (Uniphore)
Sarfjoo Seyyed Saeed (IDIAP)
Nigmatulina Iuliia (IDIAP)
Motlíček Petr, doc. Ing., Ph.D. (DCGM FIT BUT)
Ivanov Alexei V.
Ganapathiraju Aravind (Uniphore)
Speech Recognition, Human-computer Interaction, Spoken Language Understanding, Word Consensus Networks, Cross-modal Attention
In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically- vs. manuallygenerated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs, namely word-consensus-networks, allows the SLU system to improve in comparison to the 1-best setup (5.5% relative improvement). However, crossmodal approaches, i.e., learning from acoustic and text embeddings, obtains performance similar to the oracle setup, a relative improvement of 17.8% over the 1-best configuration, being a recommended alternative to overcome the limitations of working with automatically generated transcripts.
@INPROCEEDINGS{FITPUB13158, author = "Esa\'{u} Villatoro-tello and Srikanth Madikeri and Juan Zuluaga-Gomez and Bidisha Sharma and Saeed Seyyed Sarfjoo and Iuliia Nigmatulina and Petr Motl\'{i}\v{c}ek and V. Alexei Ivanov and Aravind Ganapathiraju", title = "Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks", pages = "1--5", booktitle = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings", year = 2023, location = "Rhodes Island, GR", publisher = "IEEE Signal Processing Society", ISBN = "978-1-7281-6327-7", doi = "10.1109/ICASSP49357.2023.10095168", language = "english", url = "https://www.fit.vut.cz/research/publication/13158" }