Text-To-Speech

Convert text to audible speech with Omilia deepTTS. Synthesise words with accurate pronunciation and humanlike voice.

scroll down

Omilia deepTTS

Omilia’s deepTTS receives text and provides human-like natural sounding voices to end users. It is designed to empower call centers and voice applications with state-of-the-art conversational speech synthesis for a range of languages and voices based on optimized neural models for different language domains.

• deepTTS can deliver lifelike synthesized speech based on deep learning technologies. It applies batch processing on multiple requests that greatly reduces generation time, multiple languages and multiple voices, allowing custom lexicons, configurations and models for improved in-domain performance.

• Omilia’s deepTTS can be integrated through APIs, CLI tools, and User Interfaces, allowing insightful experience when combined with Omilia’s analytics and reviewing tools.

• It can be used synchronously when streaming is required and seamlessly integrate with live interactions be it in call centers, drive-throughs and wherever else response time and latency play a crucial role.

Multiple voices support

Choose an already available personae or create your own from voice.

3rd party integrations available.

Total control

SSML support to allow users to control prosody, pitch, rate of speech, volume, breaks and other important features of speech.

Dialog Manager Integration

Integration with Dialog Manager context (Dialog Acts) for improved synthesis of ambiguous sentences.

Omilia Text-To-Speech Samples

EN-US deepTTS Samples

The TTS Engine preprocess long text, not only based on punctuation but also on lexical or prosodic cues. In that way, generation is faster, more robust and more natural as if a real person is speaking.

The TTS engine is capable of assigning different prosodic patterns based on intended meaning. Here, the part “compromised or stolen” is not a disjunction (as in the previous “your checking or your savings” example) but it is meant to be voiced as one possibility, where the answer to the question would be a yes or a no.

The TTS engine can disambiguate homographs thanks to the Part Of Speech analysis that is run before synthesizing audios. In this example, the noun record is successfully differentiated from the respective verb.

The TTS engine is geared to use prosody as well as emphasis wherever needed so that the persona is fully conversational. Our goal is that the synthesized voice is undistuishable from a real one.

RU-RU deepTTS Samples