End-to-end architectures for ASR-free spoken language understanding

Authors:

Elisavet Palogiannidi, Ioannis Gkinis, George Mastrapas, Petr Mizera, Themos Stafylakis

Collaborators:

Athens University of Economics and Business

Publication Date

November, 2020

Dialogue systems are an important part of today’s world Artificial Intelligence (AI) applications. Task-oriented dialogue systems have as main goal to help users to complete certain tasks more efficiently. Spoken language understanding (SLU), a key component of task-oriented dialogue systems, is the problem of extracting the meaning from speech utterances. It is typically addressed as a two-task procedure, where initially an Automatic Speech Recognition (ASR) model is employed to decode speech into text, followed by a Natural Language Understanding (NLU) model that takes as input the most likely hypothesis for the user’s utterance in order to extract the meaning. Several challenges lead to errors in ASR that propagate to the NLU. Because misrecognition of a word may result in misunderstanding of the whole utterance, a technique in order these systems to be more robust is instead of using the most likely hypothesis (1-best) as input to the NLU module to use a list of most likely hypotheses (N-best lists). In this thesis, we examined if the SLU system performance for the problem of intent detection is improved by using N-best lists as input to NLU during training, as a form of data augmentation, compared to using only the 1-best hypothesis. We conducted experiments using a set of standard LSTM-based architectures and state-of-the-art transformers models using the recently introduced Fluent Speech Commands (FSC) dataset, where intents are formed in classes as combinations of three slots (action, object, and location).
Omilia