Provide additional control over how Overdub generates speech from the text (e.g., using SSML tags). 🤖 | Voters

Provide additional control over how Overdub generates speech from the text (e.g., using SSML tags). 🤖

Frameworks

Please extend Overdub to support the easy-to-use Speech Synthesis Markup Language (SSML) tags for cases where users need additional control over how Overdub generates speech from the text. As not to detract from readability, SSML tag visibility should be controlled by way of a keyboard short-cut and/or menu option in Descript (as Pascal commented here).

TL;DR VERSION:

While I understand Overdub Styles are intended to provide Descript users some control over how their text is rendered via Overdub speech, its implementation is opaque, imprecise and cumbersome to apply and manage. Rather than reinventing the wheel, supporting standardized SSML tags in Overdub will enable Descript users easily to see and precisely to apply their desired speech prosody/intonation, tonality/pitch, pause duration, emphasis and phonetic pronunciation as well as have Overdub read out individual letters of a word or digits of a number as exemplified hereafter...

SSML TAG EXAMPLES

To insert a 3 second pause in the speech, simply insert the markup
<break time="3s"/>
in the corresponding text.
To read the individual digits of a telephone number 555-123-4567, wrap the number with
<say-as interpret-as="digits">555-123-4567</say-as>
.
To emphasize a word or phrase such as "truth to power", simply wrap the text in
<prosody volume="x-loud">truth to power</prosody>
.
To lower/raise its pitch, wrap the text in
<prosody pitch="low">truth to power</prosody>
or
<prosody pitch="high">truth to power</prosody>
. Similarly, the prosody tag
rate
changes the speaking rate and prosody tag
volume
adjusts the speaking volume accordingly.
To read out the individual letters of the word "consensus", wrap the word with
<say-as interpret-as='spell-out'>consensus</say-as>
.

For more details of SSML use, please see Amazon's SSML reference for Alexa here and Polly here, for Google's Assistant and, for Apple's Siri, et al, see here.

NOTES

To help other readers understand how SSML tags could be employed in Descript's Overdub, Google has provided lots of excellent examples to test drive here: https://developers.google.com/assistant/conversational/df-asdk/ssml
Hereafter, with SSML implemented, I would recommend you license the Descript services for general global use in Alexa, Polly, Siri, Cortana, etc so we can all text speech (using our own voices) to friends, family, ... the World (even after we're gone ;-).
This request would address issues indicated by these
29 users
: "Marquis Miller", "Alessandro", "Nick Ritter", "Stephen Massey", "daytona", "Adam Knee", "Mercedes Rothwell", "Roxana Stratila", "Nosson Weissman", "Steve Steve", "Mark Sobrepena", "Laura Baiardi", "Hargitai Henrik", "Umesh Kumar", "Mark Bramhill", "Chad Pennycuff", "Aemyn Connolly", "Matt Neputin [Mateusz Peplinski]", "Marek Basler", "Akshay Raj", "Jim McKeeth", "Harry Hawk", "Tau Lukos", "Kweli Kush", "Lee Schneider", "Kat Lind", "Podcast Advocate", "David Swaddle", "Pascal".

Please see the following 12 related feature requests:

November 19, 2020

Gabe Michalski

Merged in a post:

Overdub: Allow easy changes to pronunciation of words

Scott DeLuzio

English is full of words that are taken from other languages (loanwords). One word I can't get to be pronounced correctly is "résumé" - i.e. the document you submit to a potential employer to show your work history. Instead, it is pronounced "resume" - i.e. what you do when you begin doing something again after taking a break. It is a fairly common word, yet no matter how I spell it (with/without accents, phonetically, etc.) I can't get Overdub to pronounce it correctly. The odd thing is that some non-English "loanwords" are pronounced just fine - including those with and without accents. For example, some of the words in this paragraph sound just fine while others, like résumé, café, and others sound off (note the language of origin is in parenthesis):

In the bustling metropolis, I found solace at a quaint café (French), sipping on espresso while perusing the menu for croissants (French) and quiche (French). The atmosphere was très chic (French), with its bohemian décor (French) and avant-garde (French) artwork adorning the walls. As I indulged in the scrumptious cuisine, I couldn't help but feel a pang of nostalgia for the boulangeries (French) of Paris. Afterward, I rendezvoused (French) with a friend to discuss our mutual interest in yoga and meditation, practicing asanas (Sanskrit) and finding inner zen (Japanese). We reminisced about our recent trip to Kyoto (Japanese), where we experienced the beauty of traditional tea ceremonies and composed haikus (Japanese) inspired by the serene gardens. As the evening approached, I checked my résumé (French) and updated my RSVP for a soirée (French) at a luxurious mansion, where I anticipated engaging conversations with cosmopolitan guests, sampling delicacies like sushi (Japanese) and tapas (Spanish). The soirée turned out to be a lively affair, with the sounds of salsa music (Spanish) and people engaging in animated discussions about politics, philosophy, and the latest films à la (French) cinephiles (French). I bid adieu (French) to the soirée, feeling content and naïvely (French) hopeful for the adventures that lay ahead.

April 3, 2026

Gabe Michalski

Merged in a post:

Speech Synthesis Markup Language or equivalent fine grained controls

Gabriel Lambert

I would need to be able to change the spoken rate, add pauses, or change emphasis for different words or phrases.

April 3, 2026

Autopilot

Merged in a post:

Add Support for SSML Markup in AI Voice

Eduardo Podecast

I’d love to see Descript take the next step and add support for SSML (Speech Synthesis Markup Language) or similar markup directly in AI voice scripts. SSML is becoming an industry standard for controlling prosody, pacing, pauses, pronunciation, and emotion. It would unlock a whole new level of polish and flexibility—especially for creators looking for natural, expressive AI voiceovers.

For reference, ElevenLabs recently launched their v3, which allows SSML-type markup in scripts. This makes a huge difference in the quality and versatility of AI-generated speech, and I think Descript could offer even more value if you implemented something similar.

June 27, 2025

Autopilot

Merged in a post:

Add Support for SSML Markup in AI Voice

Eduardo Podecast

I’d love to see Descript take the next step and add support for SSML (Speech Synthesis Markup Language) or similar markup directly in AI voice scripts. SSML is becoming an industry standard for controlling prosody, pacing, pauses, pronunciation, and emotion. It would unlock a whole new level of polish and flexibility—especially for creators looking for natural, expressive AI voiceovers.
For reference, ElevenLabs recently launched their v3, which allows SSML-type markup in scripts. This makes a huge difference in the quality and versatility of AI-generated speech, and I think Descript could offer even more value if you implemented something similar.
Would love to hear your thoughts and if this is on your roadmap!

June 27, 2025

Aimee Nguyen

This would be amazingly helpful for me. I work at a university and trying to get the Descript voices to pronounce the initials correctly is soooo difficult without being able to use SSML. Also trying to get them to read URLs is often complicated as well. Just a lot of small things that would be easier if SSML was available for use.

Support Team

Merged in a post:

Emotional Prosody

Nosson Weissman

Working on a game to teach emotion recognition to people with ASD. Some of the levels require human voices to communicate specific emotions with context neutral sentence (ex. "Kids are standing by the door"). But finding non-proprietary online especially with some less common emotions. Is there an option for emotional prosody with content neutral sentences?

August 10, 2024

Support Team

Merged in a post:

Overdub reads subtitle file text matching timestamps

Thousands of text-to-speech (TTS) tools offer AI voices that read text, but I haven't found one that reads texts that match timestamps. With this feature, we could provide audio files for our clients to hear TTS of subtitles instead of only having the option to read them.

August 9, 2024

Autopilot

Merged in a post:

Integrate Emotion Tags for Script Narration in Descript

Elijah Ebinum

Dear Descript Team,
I've been exploring the capabilities of Descript for script narration, and while it's been an incredibly useful tool, I've encountered a challenge in infusing emotions and feelings into the AI speaker's delivery. Specifically, I'm looking to have the AI read my script with varying emotions such as sadness, joy, or even tears at different points.
Currently, I'm finding it difficult to achieve this nuanced emotional delivery within Descript. I believe incorporating emotion tags could greatly enhance the functionality and creative possibilities of the platform. These emotion tags would allow users to annotate specific sections of their script with desired emotional tones, guiding the AI speaker's delivery accordingly.
With emotion tags, users like myself could seamlessly indicate where in the script we want the AI to convey sadness, joy, or any other nuanced emotion, enabling more expressive and engaging narrations.
I believe integrating emotion tags into Descript would not only streamline the process of emotive script narration but also open up a wealth of creative opportunities for users across various industries, from storytelling to marketing and beyond.
I'm eager to see Descript evolve to encompass this feature, and I believe it would significantly enhance the platform's appeal and utility to users like myself who rely on it for dynamic and emotionally resonant audio productions.
here is a sample of how emotion tag could look like:
Narrator [somber]: I don’t know how long I was lost in that darkness. Time didn’t seem to exist there. But then I felt a hand gripping mine. It was Amina. Her touch was weak, but it pulled me back from the edge of the void. I opened my eyes and saw her, tears streaming down her face.
Narrator [hopeful]: “We must escape,” she whispered, her voice shaking. “There is a way, but it’s dangerous.
Thank you for considering my suggestion, and I look forward to seeing how Descript continues to innovate in the future.

June 6, 2024

Carl Bartlett

As others have mentioned it is critical to large projects. I need to be able to adjust audio, in text format, for large sections. Changing the speaker, add pauses(with length), the tempo, volume, add audio effects for sections of text. This allows editing in other applications where it can be done quickly. If I want to add a pause after every instance a specific word, It is simple to search replace in a text editor, but in this it is cumbersome and tedious!

Miss Mez

Full Disclosure: I've been using Descript Pro for a week now, so I'm still learning. But, this needs to be bumped, please. We need easier, global control for pauses and Convert to Audio functions.
Right now I am using overdub to voice my entire affirmation / meditation scripts. I trained my voice profile to speak slowly, not as slowly as I'd like, but (I think) that will just take time and more training. 
The problem I'm facing is that the pauses between sentences is FAR too short for my genre.
It is a hassle to go thru and first change EVERY sentence, individually, with Convert to Audio. (Global Convert would be AWESOME) Then I have to go thru and gap between each sentence.
If we can't assign symbols, punctuations, or code to be certain pause lengths, then could we at least use the Shorten Word Gaps to actually INCREASE pauses globally? That would greatly decrease the amount of time we have to spend adjusting the pacing of our files.
Right now, it's frustrating to have to do this on my 5-minute, daily affirmation files... I can't even imagine how long it will take me to do a 1+ hour guided meditation!
Or am I just missing something? If I am, please let me know, because I'm nearly done with my 30 days of 5-min affirmations project... and the long meditation projects are next!

→