Speaker Verification Using End-to-End Adversarial Language Adaptation

Omilia’s R&D team comprises PhD and/or MSc holders with strong contribution to research on ASR and NLU they have co-authored with their University or Omilia as affiliation.

Authors

Johan Rohdin, Themos Stafylakis, Anna Silnova, Hossein Zeinali, Lukas Burget, Oldrich Plchot

Abstract: In this paper we investigate the use of adversarial domain adaptation for addressing the problem of language mismatch between speaker recognition corpora. In the context of speaker verification, adversarial domain adaptation methods aim at minimizing certain divergences between the distribution that the utterance-level features follow (i.e. speaker embeddings) when drawn from source and target domains (i.e. languages), while preserving their capacity in recognizing speakers. Neural architectures for extracting utterance-level representations enable us to apply adversarial adaptation methods in an end-to-end fashion and train the network jointly with the standard cross-entropy loss. We examine several configurations, such as the use of (pseudo-) labels on the target domain as well as domain labels in the feature extractor, and we demonstrate the effectiveness of our method on the challenging NIST SRE16 and SRE18 benchmarks.

Introduction

The need for domain adaptation (DA) arises in cases when the target domain data is insufficient (and possibly unlabeled) for training a model from scratch and therefore source domain data (assumed labeled and sufficient for training a model) should be leveraged as well. The core idea behind DA is that the knowledge distilled from the source domain can be transferred to the target domain, despite the differences in the marginal distributions of the two domains. Conventional approaches to DA, such as fine-tuning a source domain model to the target domain data may fail in many settings due to the target data being weakly-labeled or even unlabeled. DA methods for speaker verification are of particular interest, as for many real-world applications large amounts of target domain labeled data are rarely available. Hence, for training state-of-the-art models which require several thousand of training utterances, one should resort to large out-of-domain corpora and use the small and possibly unlabeled target domain datasets for language, channel or other types of adaptation. In order to promote further research in DA methods, MIT-LL and NIST has organized 3 evaluations (namely the MIT-LL DA challenge, DAC-2013, NIST SRE16, and the recent NIST SRE18) with the two latter focused primarily on language adaptation. Several DA methods were introduced as part of those evaluations, the majority of which approach the problem as a transformation of fixed utterance-level representations, such as i-vectors. View the complete paper here: https://arxiv.org/pdf/1811.02331.pdf