Current synthetic speech capabilities are practical and acceptable
when used to provide voice as a utility,
for instance when speaking
sat-nav directions or making simple public announcements.
However, today in mid 2014, text-to-speech technology is not capable
of producing speech that
could be regarded as realistic,
emotionally engaging or genuinely expressive.
The purpose of DRAMATIS software is to make it possible to edit,
enhance and manipulate synthetic speech to render a realistic
and even dramatically pleasing end result for the listener.
Voice synthesis technology has developed significantly in the last decade. It is a maturing technological capability – which has now moved beyond the disability aid sector – into a rapidly expanding general consumer market.
The most famous synthetic voice in the world is that used by Stephen Hawking – it is now iconic and as recognisable globally as if it were his own real speaking voice. But it could not be described as realistic.
In recent years, great advances have been made in the technology used to capture and ‘digitise’ the human voice. A number of companies are now able to deliver a high level of realism. The Edinburgh based company CereProc are developing what they call voice cloning – to give back a speaking capability – in their own voice – to people who have lost it through illness.
To date, the main focus of voice synthesis technology development has been to provide ‘automatic’ conversion of text into speech. The text to speech (TTS) capability has been present within personal computers for the last 20-years – primarily to assist the visually impaired.
Over the last year Apple brought TTS and STT (Speech to Text) to market in their iPhone. Other companies are combining voice recognition and voice synthesis to provide a smart phone with a ‘live’ language translation capability. First portrayed as science fiction fantasy in StarTrek in the 1970s – today LinguaTec offer it as a working reality at very low cost.
Current speech generation capabilities give acceptable results in utility applications where only ‘information’ is being presented. For example public travel announcements, in-car navigation systems, mobile phone messaging etc. But today, at the start of 2013, state-of-the art TTS systems are unable to give the synthesised words any meaningful level of ‘expression’. Any writer or actor knows that ‘EXPRESSION’ is required in the way the spoken words are delivered to communicate their full true meaning – in context – and also to convey the feelings and emotions contained within the words.
To fully evaluate the current capability of TTS systems – to convey emotion and meaning – a comprehensive, multi-vendor, test project was implemented in Q1 2012 . The project applied automated TTS – in a dramatic context – to read/perform the dialogue of a full length film script.
The test project evaluated voices from a wide selection of vendors (including IVONA, CereProc, AT&T, Cepstral, Nuance. From this test exercise, two companies emerged as offering the most convincing systems – Infovox and Loquendo which was recently aquired by Nuance.
As voice quality improves, so TTS is being utilised increasingly, in multi-media applications, games, customer support, online promotions, marketing and business presentations. Current take-up is limited.
The main problem with the current capability of text-to-speech systems is no facility is made for the user to edit or fine-tune of the rendition to give it more expression. To understand the significance of this – imagine digital photography – but with no retouching capability – because image editors like PhotoShop simply do not exist.
This is the situation that currently exists in the text-to-speech arena; there is no editing capability to control or enhance the performance to a significant level of realism.
Infovox do provide a handy software application which allows the user to fine-tune the pronunciation of specific words or acronyms at ‘phoneme’ syllable level. This is an essential tool when dealing with unique words, proper names, place names, nick names, slang etc. but it barely scratches the surface of what is needed to produce realistic performances. Yet even Infovox speech output fails in applications where the words carry emotional content and need to be expressed with some level of feeling.
Acknowledging this shortcoming, one company, Loquendo produced an editing application (Java based developer standard) that allows the user to ‘enhance’ the delivery of lines by synthesised voice. The software is called TTS Director but it was withdrawn from the market when the company was taken over by Nuance in 2010.
This current stage of development is very similar to the one that existed in the music technology sector in the mid 1980’s; sound synthesis and sound sampling had reached extremely advanced levels. In other words – synthesised or digitised voices of the instruments – was close to being imperceptible from the real thing.
Sounds were assigned to a digital keyboard using MIDI – (Musical Instrument Digital Interface) – connection and control standard. Automation was achieved through the use of an electronic ‘sequencer’ which triggered repeating patterns of notes in a completely mechanical way. Fully automated control – the software programmes required for fully playing the instruments – to give them ‘performance’ or ‘expressive’ values simply did not exist.
This need has been fully answered over the course of the last two decades by software applications – Cubase from Steinberg first released in 1989 and ProLogic Audio released by Emagic in 1993.
(Apple purchased Emagic in 2002 for US$30-million – around £20-million. At that time there were 200,000 users on Mac and 70,000 users on PC.)
It appears that currently there is no application which facilitates the editing of TTS synthetic voice output to:
1. improve phrasing, pace, pauses and the timing of the delivery of words
2. refine emphasis and expand dynamic range of speech within phrases
3. clarify and enhance meaning and improve intonation and expressiveness
4. achieve a more realistic, expressive and human performance
There is a gap in the market and the opportunity exists to develop a TTS authoring application which makes it possible and practical to enhance the performance of synthesised voice output. In so doing, to rapidly capture and stake a claim to a potentially valuable sub-sector of the TTS market.
The application that is envisaged be an object-oriented, cross-platform software programme. The project working title is “DRAMATIS” (or Drama–Text-Into-Speech).
Loquendo TTS director was withdrawn from the market
after the company was purchased by Nuance in 2012
Loquendo TTS software brings you truly natural sounding voices able to read any kind of dynamic data and prompts in your server-based, multimedia, embedded and multimodal voice applications. Loquendo is the very first company to bring expressive synthetic speech to the market: new, high-quality voices guarantee Loquendo’s market leadership in quality, efficiency and portability as well as in pronunciation accuracy, natural timbre and intonation.
Loquendo is the only speech technology vendor that provides a complete product line guaranteeing the same wide range of high-quality voices and languages, and the same core engine for all these environments.
The Benefits For You…
Give your users the best available TTS technology for IVR (banking, government…), live news, accessing business documents, e-learning, entertainment, automotive telematics, email reading and any embedded application – there are no limits!
Loquendo’s truly lifelike TTS means there’s no need for costly, time-consuming pre-recording, and it enables the rapid deployment of vocal services that customers will love using.
Loquendo’s voices are expressive, clear, natural and fluent: they have been enriched with a repertoire of “expressive cues” that allow for highly emotional pronunciation.
Two ways for creating your own audio files are available:
* Loquendo TTS Director – a complete development environment for creating your own voice prompts, tuning them and saving customizations.
* Loquendo TTS Voice Experience – designed in a games console style, this is a highly interactive and user friendly environment in which voice parameters can be easily and rapidly adjusted.
With all this at your disposal, you can get Loquendo TTS voices talking just the way you want them to!
A World of Languages & Voices…
Loquendo gives its customers expanded reach in today’s global marketplace. Loquendo TTS is a rapidly growing family of expressive voices and personas from around the world.
Loquendo’s research and highly efficient methods of development enable the rapid release of new, high-quality voices and languages, as well as customized voices to suit your corporate profile.
To discover more about Loquendo’s latest TTS successes, visit our interactive TTS Demo page.
A Technological Leap…
Loquendo TTS delivers the highest levels of flexibility, scalability, performance and robustness; its multi-threaded,
multi-process configuration supports deployment in any high performance software architecture,
meeting all your business and technical requirements.
Loquendo TTS has highly performant algorithms and guarantees an extremely rapid response.
The speech engine can synthesize different languages and voices simultaneously, switching between them at any time and on any channel.
It is designed for use in any kind of voice application, including highly intensive TTS.
The Pronunciation Lexicon ensures that specialized vocabulary, abbreviations, acronyms and even regional pronunciation differences sound just as the developer intends them to. The characteristics of each voice (i.e. pitch, speaking rate and volume) can be fine tuned and fully controlled.
Special formats such as phone numbers, currencies and e-mail addresses can also be read correctly.
Loquendo TTS is available for Telephony, Multimedia and Embedded applications, guaranteeing the same wide range of high-quality voices and languages, all with the same core engine.
For current text-into-speech capabilities
to be comprehensively evaluated in a dramatic context
a full 120-minute film soundtrack was created
This was done using only text-to-speech voices, a selection
of the best from all the major vendors, to produce the dialogue
1
First, the written lines for each character
were pasted into a text editor
2
Then the text file was assigned a voice
and used to generate speech
3
This synthetic speech output was then captured
as an audio file in mp3 file format
4
The audio file for each character was then placed
into the ProLogic multi-track audio production application then,
each part was cut into snippets of dialogue, edited and integrated
with all the other spoken parts of the characters.
5
These audio parts were then sequenced together to create the scenes
and produce realistic spoken exchanges between the characters.
6
Sound effects and music were added, along with audio spacial effects
such as echo and reverb, to produce the final mix of the soundtrack
Research and assessment of TTS capabilities within the context of a drama was done using a film screenplay. The script required a cast of 14 main characters – and 21 other speaking parts – so provided the opportunity to test all the leading synthetic voice systems – on the Apple OSX platform.
The test project – comparable to a radio drama workshop production – features several different types of dialogue. The large cast of characters are of different ages and required many different accents.
The end result is an audio programme of a 120-minutes. Listening to this production it is easy to hear where improvements to ‘expression and performance’ need to be made – in fact are absolutely essential – to achieve a truly convincing and realistic result.
What is not apparent – is that to achieve this standard of result meant pushing the current capabilities of voice synthesis to the absolute limit.
In fact, to get even an acceptable result, it was necessary to push beyond current TTS capabilities and to ‘cheat’ by using sophisticated audio editing techniques to make the presentation of some dialogue listenable.
To complete the production it was necessary to enhance the voices with spacial effects like reverb and echo and then add sound effects and music to maximise the dramatic impact.
The gap between what it is possible to achieve today – and the standard actually needed to produce a convincing PERFORMANCE of dialogue – represents the development and commercial opportunity.
The target market for a powerful and intuative application – to be used to improve text-to-speech output across all vendors, is potentially in excess of 2-million users across all of Europe, North and South America and also Russia and other countries using the Cyrillic alphabet.
The target sale price for the application would be £30.00. A customer base of 2-million would produce total sales of £60-million after 5-years.
The market for the application is diverse – covering both amateurs and professionals – involved in script writing, mulit-media presentations, film, animation, games, online promotions, audio books etc.
A ‘PROFESSIONAL’ version of the software could justify a price of over £100. Further revenue streams could come from offering character voices, additional plug-in software, sound effects, music etc.
To develop and produce the DRAMATIS software application in English would take 12-months and cost circa £200k. Translation into Spanish, German, Italian, French, Russian and up to 5 other languages could cost a further £200k. Setting up online distribution, branding and viral marketing £100k+.
The development strategy would be to seemlessly integrate with world leading script writing software application Final Draft and also Apple’s Garage Band and Pro Logic Audio. Development would also aim to make it possible for all voices and accents be used to speak all languages.
The exit strategy would be to sell the company to one of the large vendors in the sector within or at 5-years.
Listen to the full 120-minute soundtrack.
It is available to hear and follow along with the written script
as you can see on the player interface shown below
It is essential to note that the shortcomings of the soundtrack
illustrate what is required and define the technological tasks
Whatever can be done with software development
to make the soundtrack sound more realistic
represents the commercial opportunity
[ click to commence playback ]