Dialogue systems are an important part of today’s world Artificial Intelligence (AI) applications. Task-oriented dialogue systems have as main goal to help users to complete
certain tasks more efficiently. Spoken language understanding (SLU), a key component
of task-oriented dialogue systems, is the problem of extracting the meaning from speech
utterances. It is typically addressed as a two-task procedure, where initially an Automatic
Speech Recognition (ASR) model is employed to decode speech into text, followed by a
Natural Language Understanding (NLU) model that takes as input the most likely hypothesis
for the user’s utterance in order to extract the meaning. Several challenges lead to errors
in ASR that propagate to the NLU. Because misrecognition of a word may result in misunderstanding of the whole utterance, a technique in order these systems to be more robust is
instead of using the most likely hypothesis (1-best) as input to the NLU module to use a
list of most likely hypotheses (N-best lists). In this thesis, we examined if the SLU system
performance for the problem of intent detection is improved by using N-best lists as input
to NLU during training, as a form of data augmentation, compared to using only the 1-best
hypothesis. We conducted experiments using a set of standard LSTM-based architectures and
state-of-the-art transformers models using the recently introduced Fluent Speech Commands
(FSC) dataset, where intents are formed in classes as combinations of three slots (action,
object, and location).