Provide additional control over how Overdub generates speech from the text (e.g., using SSML tags). 🤖
Frameworks
Please extend Overdub to support the easy-to-use Speech Synthesis Markup Language (SSML) tags for cases where users need additional control over how Overdub generates speech from the text. As not to detract from readability, SSML tag visibility should be controlled by way of a keyboard short-cut and/or menu option in Descript (as Pascal commented here).
TL;DR VERSION:
While I understand Overdub Styles are intended to provide Descript users some control over how their text is rendered via Overdub speech, its implementation is opaque, imprecise and cumbersome to apply and manage. Rather than reinventing the wheel, supporting standardized SSML tags in Overdub will enable Descript users easily to see and precisely to apply their desired speech prosody/intonation, tonality/pitch, pause duration, emphasis and phonetic pronunciation as well as have Overdub read out individual letters of a word or digits of a number as exemplified hereafter...SSML TAG EXAMPLES
- To insert a 3 second pause in the speech, simply insert the markup <break time="3s"/>in the corresponding text.
- To read the individual digits of a telephone number 555-123-4567, wrap the number with <say-as interpret-as="digits">555-123-4567</say-as>.
- To emphasize a word or phrase such as "truth to power", simply wrap the text in <prosody volume="x-loud">truth to power</prosody>.
- To lower/raise its pitch, wrap the text in <prosody pitch="low">truth to power</prosody>or<prosody pitch="high">truth to power</prosody>. Similarly, the prosody tagratechanges the speaking rate and prosody tagvolumeadjusts the speaking volume accordingly.
- To read out the individual letters of the word "consensus", wrap the word with <say-as interpret-as='spell-out'>consensus</say-as>.
For more details of SSML use, please see Amazon's SSML reference for Alexa here and Polly here, for Google's Assistant and, for Apple's Siri, et al, see here.
NOTES
- To help other readers understand how SSML tags could be employed in Descript's Overdub, Google has provided lots of excellent examples to test drive here: https://developers.google.com/assistant/conversational/df-asdk/ssml
- Hereafter, with SSML implemented, I would recommend you license the Descript services for general global use in Alexa, Polly, Siri, Cortana, etc so we can all text speech (using our own voices) to friends, family, ... the World (even after we're gone ;-).
- This request would address issues indicated by these 29 users: "Marquis Miller", "Alessandro", "Nick Ritter", "Stephen Massey", "daytona", "Adam Knee", "Mercedes Rothwell", "Roxana Stratila", "Nosson Weissman", "Steve Steve", "Mark Sobrepena", "Laura Baiardi", "Hargitai Henrik", "Umesh Kumar", "Mark Bramhill", "Chad Pennycuff", "Aemyn Connolly", "Matt Neputin [Mateusz Peplinski]", "Marek Basler", "Akshay Raj", "Jim McKeeth", "Harry Hawk", "Tau Lukos", "Kweli Kush", "Lee Schneider", "Kat Lind", "Podcast Advocate", "David Swaddle", "Pascal".
Please see the following 12 related feature requests:
- Expressions and Tonality (Overdub)
- Pitch Correction
- Emotional Prosody
- Support SSML export format
- Ability to use foreign languages
- Edit word gap duration from the script
- Control of Overdub pauses and emphasis
- Overdub: Add emphasis
- Add phonetic pronunciations support for Overdub
- '/' gets read as 'Divided by' in overdub?
- Provide a flexible way to increase the pauses between words and sentences
- SSML Input
Aimee Nguyen
This would be amazingly helpful for me. I work at a university and trying to get the Descript voices to pronounce the initials correctly is soooo difficult without being able to use SSML. Also trying to get them to read URLs is often complicated as well. Just a lot of small things that would be easier if SSML was available for use.
S
Support Team
Merged in a post:
Emotional Prosody
Nosson Weissman
Working on a game to teach emotion recognition to people with ASD. Some of the levels require human voices to communicate specific emotions with context neutral sentence (ex. "Kids are standing by the door"). But finding non-proprietary online especially with some less common emotions. Is there an option for emotional prosody with content neutral sentences?
S
Support Team
Merged in a post:
Overdub reads subtitle file text matching timestamps
G
Thousands of text-to-speech (TTS) tools offer AI voices that read text, but I haven't found one that reads texts that match timestamps. With this feature, we could provide audio files for our clients to hear TTS of subtitles instead of only having the option to read them.
Canny AI
Merged in a post:
Integrate Emotion Tags for Script Narration in Descript
Elijah Ebinum
Dear Descript Team,
I've been exploring the capabilities of Descript for script narration, and while it's been an incredibly useful tool, I've encountered a challenge in infusing emotions and feelings into the AI speaker's delivery. Specifically, I'm looking to have the AI read my script with varying emotions such as sadness, joy, or even tears at different points.
Currently, I'm finding it difficult to achieve this nuanced emotional delivery within Descript. I believe incorporating emotion tags could greatly enhance the functionality and creative possibilities of the platform. These emotion tags would allow users to annotate specific sections of their script with desired emotional tones, guiding the AI speaker's delivery accordingly.
With emotion tags, users like myself could seamlessly indicate where in the script we want the AI to convey sadness, joy, or any other nuanced emotion, enabling more expressive and engaging narrations.
I believe integrating emotion tags into Descript would not only streamline the process of emotive script narration but also open up a wealth of creative opportunities for users across various industries, from storytelling to marketing and beyond.
I'm eager to see Descript evolve to encompass this feature, and I believe it would significantly enhance the platform's appeal and utility to users like myself who rely on it for dynamic and emotionally resonant audio productions.
here is a sample of how emotion tag could look like:
Narrator [somber]: I don’t know how long I was lost in that darkness. Time didn’t seem to exist there. But then I felt a hand gripping mine. It was Amina. Her touch was weak, but it pulled me back from the edge of the void. I opened my eyes and saw her, tears streaming down her face.
Narrator [hopeful]: “We must escape,” she whispered, her voice shaking. “There is a way, but it’s dangerous.
Thank you for considering my suggestion, and I look forward to seeing how Descript continues to innovate in the future.
C
Carl Bartlett
As others have mentioned it is critical to large projects. I need to be able to adjust audio, in text format, for large sections. Changing the speaker, add pauses(with length), the tempo, volume, add audio effects for sections of text. This allows editing in other applications where it can be done quickly. If I want to add a pause after every instance a specific word, It is simple to search replace in a text editor, but in this it is cumbersome and tedious!
M
Miss Mez
Full Disclosure: I've been using Descript Pro for a week now, so I'm still learning. But, this needs to be bumped, please. We need easier, global control for pauses and Convert to Audio functions.
Right now I am using overdub to voice my entire affirmation / meditation scripts. I trained my voice profile to speak slowly, not as slowly as I'd like, but (I think) that will just take time and more training.
The problem I'm facing is that the pauses between sentences is FAR too short for my genre.
It is a hassle to go thru and first change EVERY sentence, individually, with Convert to Audio. (Global Convert would be AWESOME) Then I have to go thru and gap between each sentence.
If we can't assign symbols, punctuations, or code to be certain pause lengths, then could we at least use the Shorten Word Gaps to actually INCREASE pauses globally? That would greatly decrease the amount of time we have to spend adjusting the pacing of our files.
Right now, it's frustrating to have to do this on my 5-minute, daily affirmation files... I can't even imagine how long it will take me to do a 1+ hour guided meditation!
Or am I just missing something? If I am, please let me know, because I'm nearly done with my 30 days of 5-min affirmations project... and the long meditation projects are next!
Samuel Eisenberg
Yes, this would make a huge impact. Right now Overdub is pretty monotonous.
I'm using Overdub for audiobooks and the output has plenty of room for improvement.
Mathnasium Online
Enabling tuning and adjusting of Overdub output (for example, to read individual letters and numeric digits, insert pauses, adjust tone and emphasis, etc) is essential for our use-case creating training materials for staff/students teaching/learning mathematics.
The requester’s proposal of using SSML to implement such easy, user-configurable tweaking of Overdub’s output would be truly awesome and, presently lacking this capability, Descript is forcing its customers to seek workarounds and alternate solutions, and leaves a big opening for competitors to walk through.
Based on its continuing upvotes and enthusiastic comments below, after ~two years~ sitting here as a feature request, it’s time to get this one at least “Under Review”!
N
Nicole Berryhill Phd
YES! This would be INCREDIBLY useful. We're currently in the process of using Overdub to make written coursework available to blind students in my voice. Manually adding gaps between sentences for a long script is extremely time consuming.
My suggestion (from a strictly UI standpoint) would be to expand the functionality of the "Shorten Word Gaps" tool, making it a "Manage Word Gaps" tool. Ideally, to include the option to "Insert [input area to define length of desired gap] Between all Sentences".
While Descript is indeed a "game changing" tool for all of its' current offerings, I (for one, with many equally interested colleagues) would most certainly continue my Pro Subscription for all eternity with this specific, automated flexibility. This feature request should really be moved to the top of the list. It would have an incredible impact on workflow for those looking to automate this necessary part of using Overdub on lengthy scripts.
If this function already exists in some other format that I'm/we're overlooking, please advise. Otherwise, please provide it in some automated way, perhaps as described above.
Thank you.
veritas et caritas
Please, something like this is needed as soon as possible.
Load More
→