The Omilia R&D team is participating in the JSALT 2020 hosted by “Johns Hopkins University”. Omilia and other distinguished educational institutions are participating in a six-week-long research workshop on Machine Learning for Speech Language and Computer Vision Technology.

You may find more information at the official website. You can also contact us if you would like additional information or to contact the R&D team involved.

R&D team:

Themos Stafylakis – Head of Machine Learning & Voice Biometrics,

Niko Brümmer – Senior Speech and Speaker Recognition Scientist,

Johan Rohdin – Senior Speech and Speaker Recognition Scientist (also with Brno University of Technology).

Speech Recognition and Diarization for Unsegmented Multi-talker Recordings with Speaker Overlaps. 


Multi-talker conversational speech transcription using distant microphones is increasingly becoming an important application scenario in the speech industry.

However, there are still many fundamental challenges that need to be overcome. Overlapped speech (and, equally importantly, quick turn-taking), which breaks the assumption of “one active person at a time”, is one of the long standing problems that have barely been addressed. Speech separation and extraction are extensively studied approaches for handling overlaps. The former separates each constituent speech signal which is then processed with speech recognition and speaker diarization. The latter approach starts with detecting speaker segments from overlapped speech to obtain speaker embeddings, followed by speaker-informed speech separation or recognition to extract the transcription for each speaker. While extensively studied in laboratory settings using pre-segmented utterances, their applications to real unsegmented multi-talker recordings are limited. Also, existing real world applications are based on modular approaches using separately trained subsystems for speech separation, speech recognition, and so on, which may result in sub-optimal solutions.

In this workshop, we propose hosting a team to pursue the following goals: 1. to build fully contained multi-talker audio transcription systems based on the two approaches mentioned earlier whilst investigating their relative merits with respect to overlap handling, speech recognition, and speaker diarization, 2. to explore end-to-end modeling for dealing with unsegmented multi-talker audio recordings within the framework of each approach, and 3. to explore the use of unlabeled data to further improve the aforementioned techniques in an unsupervised manner. The emphasis of the project is placed on building fully contained systems that deal with unsegmented conversational audio with no dependency on unrealistic assumptions such as the availability of speaker segmentation files.

We aim to organize a research team with outstanding researchers who have intensively worked on relevant areas, including speech separation, speech recognition, speaker diarization, unsupervised training, and end-to-end modeling. The team will put emphasis to successfully deliver the systems that will provide foundations for future research as well as to cross-fertilize ideas between the related areas by investigating interdisciplinary end-to-end approaches.

The  goal of the workshop is to improve existing and develop novel methods for Speech Recognition and Diarization for Unsegmented Multi-talker Recordings with Speaker. You can find the full project description  here.