Provide additional control over how Overdub generates speech from the text (e.g., using SSML tags). 🤖
Please extend Overdub to support the easy-to-use Speech Synthesis Markup Language (SSML) tags for cases where users need additional control over how Overdub generates speech from the text. As not to detract from readability, SSML tag visibility should be controlled by way of a keyboard short-cut and/or menu option in Descript (as Pascal commented here).
TL;DR VERSION:While I understand Overdub Styles are intended to provide Descript users some control over how their text is rendered via Overdub speech, its implementation is opaque, imprecise and cumbersome to apply and manage. Rather than reinventing the wheel, supporting standardized SSML tags in Overdub will enable Descript users easily to see and precisely to apply their desired speech prosody/intonation, tonality/pitch, pause duration, emphasis and phonetic pronunciation as well as have Overdub read out individual letters of a word or digits of a number as exemplified hereafter...
SSML TAG EXAMPLES
- To insert a 3 second pause in the speech, simply insert the markup <break time="3s"/>in the corresponding text.
- To read the individual digits of a telephone number 555-123-4567, wrap the number with <say-as interpret-as="digits">555-123-4567</say-as>.
- To emphasize a word or phrase such as "truth to power", simply wrap the text in <prosody volume="x-loud">truth to power</prosody>.
- To lower/raise its pitch, wrap the text in <prosody pitch="low">truth to power</prosody>or<prosody pitch="high">truth to power</prosody>. Similarly, the prosody tagratechanges the speaking rate and prosody tagvolumeadjusts the speaking volume accordingly.
- To read out the individual letters of the word "consensus", wrap the word with <say-as interpret-as='spell-out'>consensus</say-as>.
For more details of SSML use, please see Amazon's SSML reference for Alexa here and Polly here, for Google's Assistant and, for Apple's Siri, et al, see here.
- To help other readers understand how SSML tags could be employed in Descript's Overdub, Google has provided lots of excellent examples to test drive here: https://developers.google.com/assistant/conversational/df-asdk/ssml
- Hereafter, with SSML implemented, I would recommend you license the Descript services for general global use in Alexa, Polly, Siri, Cortana, etc so we can all text speech (using our own voices) to friends, family, ... the World (even after we're gone ;-).
- This request would address issues indicated by these 29 users: "Marquis Miller", "Alessandro", "Nick Ritter", "Stephen Massey", "daytona", "Adam Knee", "Mercedes Rothwell", "Roxana Stratila", "Nosson Weissman", "Steve Steve", "Mark Sobrepena", "Laura Baiardi", "Hargitai Henrik", "Umesh Kumar", "Mark Bramhill", "Chad Pennycuff", "Aemyn Connolly", "Matt Neputin [Mateusz Peplinski]", "Marek Basler", "Akshay Raj", "Jim McKeeth", "Harry Hawk", "Tau Lukos", "Kweli Kush", "Lee Schneider", "Kat Lind", "Podcast Advocate", "David Swaddle", "Pascal".
Please see the following 12 related feature requests:
- Expressions and Tonality (Overdub)
- Pitch Correction
- Emotional Prosody
- Support SSML export format
- Ability to use foreign languages
- Edit word gap duration from the script
- Control of Overdub pauses and emphasis
- Overdub: Add emphasis
- Add phonetic pronunciations support for Overdub
- '/' gets read as 'Divided by' in overdub?
- Provide a flexible way to increase the pauses between words and sentences
- SSML Input