Word-level timestamps in subtitle export | Voters

Word-level timestamps in subtitle export

Ben Woodward

I'd like to export transcription data that I can use in an interactive transcript. i.e. A transcript where each word is highlighted as it is spoken in the audio (in the same way that words are highlighted as they are spoken in the Descript app itself). 
The VTT spec allows for Karaoke style cues: 
1
00:16.500 --> 00:18.500
When the moon <00:17.500>hits your eye
1
00:00:18.500 --> 00:00:20.500
Like a <b><00:19.000>big-a <00:19.500></b>pizza <00:20.000>pie
1
00:00:20.500 --> 00:00:21.500
That's <00:00:21.000>amore
Another option would be to export a JSON file that contains an array of words with a start_time and end_time value for each.
Ideally you'd be able to export to both these options, and have some control over how the exported data is formatted. For example, should the current word in a line inside a VTT file be bolded.

September 12, 2019

Support Team

Merged in a post:

All for "text" exports with timestamps to include sub-second precision

Joshua Krall

Right now, "text" exports can include timestamps, but they are rounded to the nearest second — we'd like the ability to have sub-second precision on these timestamps.

August 8, 2024

Raymond

I came here to make this exact same suggestion as it would add a lot of value to the way I use Descript. I hope it wouldn't be too hard to implement since the timeline already displays 3 decimal places of precision.

Andrew Mason

marked this post as

open

Stéphane L.

Andrew Mason: Hi Andrew, long time Descript user here. This functionality has been asked in 2020 and has received a lot of community support. We are now in 2023 and exporting a JSON would not be something that would take a long time to ship since Descript probably already uses that as part of its transcribing workflow. Would you have any information about it?

Vit Muller

Yes and also if it did this please!!
I like to drop markers through my edits to separate different topics of my conversation with the guests.. (I have a podcast i use Descript to edit the file to generate clean audio with the 'Studio Sound' , export transcript for Show Notes and have a work around (painful one) for getting my timestamps..
Right now the work around is having to copy the marker text, and mark a comment in of that.. which is Very PAINFUL & Time consuming 🙁
Then I can see all the comments with timestamps but to get one complete list of timestamps still needs me manually copy each plus format the timing in brackets so they become clickable when I post in my show notes..
Honestly simple export button for 'Timestamps' (abased on where I place my 'Markers' would be game changer!!
Formatting would need to be auto generated as follows:
(00:00:00) - MARKER NAME
So for example i placed example Markers like this:
3 mins
Marker Title A
15 mins 43 sec
Marker Title B
44 mins 12 secs
Marker Title C
1 hour 12 mins 12 secs
Marker Title C
Ideally it'd then generate the export in following format:
(00:03:00) - MARKER TITLE A
(00:15:43) - MARKER TITLE B
(01:13:00) - MARKER TITLE C
Simple txt export file would be fine.
Please?

Daniel Sommer

It's not a direct export, but the JSON file is available. 
Open your project
Publish (enable show transcript)
Open in Browser 
Ctrl + Shift + I (Inspect in Chrome) The ELEMENTS tab in developer tools should open up
Ctrl + F (Find)
Search for "descript:transcript" 
The json file at the neighboring link will have the individual word timings

Steven Black

I'd also love to have this data exported either directly as an optional part of the VTT file or in a JSON format where I could pull the information I want from there.
I specialize in improvised lyrical acapella, and the transcription is the slowest part of my process. The data is a lot more interesting if I can retain the word-level timing. I've used my own adhoc tools for this, but I love the Descript UI. It's a lot nicer than my weird little curses-based thing.
I sometimes generate such karaoke-enabled VTT files from the MIDI files produced by Lilypond (for the timing) coupled with the Lilypond source (to get the line endings and spaces right). Here's a snippet from "All Aboard for Podunk":
00:12.000 --> 00:21.734
<00:13.000><c>There</c> <00:13.273><c>was</c> <00:13.546><c>a</c> <00:13.819><c>sta</c><00:14.092><c>tion</c> <00:14.365><c>a</c><00:14.638><c>gent</c> <00:14.911><c>on</c> <00:15.184><c>the</c> <00:15.457><c>New</c> <00:15.730><c>York</c> <00:16.003><c>Cen</c><00:16.276><c>tral</c> <00:16.549><c>line.</c>
<00:17.367><c>At</c> <00:17.640><c>Mack'</c><00:17.913><c>rel</c> <00:18.186><c>Cen</c><00:18.459><c>ter,</c> <00:18.732><c>where</c> <00:19.005><c>he</c> <00:19.278><c>lived</c> <00:19.551><c>they</c> <00:19.824><c>thought</c> <00:20.097><c>him</c> <00:20.370><c>ver</c><00:20.643><c>y</c> <00:20.916><c>fine.</c>
The <c> is supposed to be covered by CSS. I've used the following to mixed success: (Not always well supported.)
::cue {
font-size: larger;
background-color: transparent;
text-shadow: 2px 2px 5px black; 
}
::cue(:past) {
color: yellow;
}
::cue(:future) {
color: cyan;
}
::cue(:active) {
color: magenta !important;
}
::cue(#title) { color: lightgreen; }
Getting things to work like karaoke without CSS requires folks jumping through a lot of hoops. (There's software that post processes things so there's a series of changes for each line.)
For my purposes, I can't work-around it. I need to preserve line and stanza/paragraph breaks.

Andrew Mason

marked this post as

under consideration

Russell Silber

I use Descript for webinar recordings and would prefer to add interactive transcripts to the video embeds on our site instead of captions (to not block the powerpoint presentation on-screen). Would love this export feature so I could find a way to add interactive transcripts to our video player.

Anas

I would love to have Word-level timestamps exportable as JSON. Similar to Google Speechtotext and IBM Watson API

Eric Lampi

I have the same feature request.

In the meantime, there is a work around:

Using your transcript text file auto-generated by Descript:

(I was able to use the automatically transcribed text just fine, but if you've manually written the script, skip to 2.):

Export as a rich text file
Add line breaks to each word

Note- you can do this in the application or in a text editor, hold option and tap the right arrow, it will advance to the next word, hit return and add a line break. (You could use find and replace to add a break where a space exists, and it also shouldn't be hard to write a script to do this)

Make a new project
Import (drag and drop) the Audio file and choose "Import Transcript" from the drop down.
Click "Transcribe"
Copy and Paste the text from your rich text document into the text window that is presented
Discript will synch the new text with the audio
Export one last time as an .srt or .vtt file
The exported file will have a in/out timestamp for each word (see the attached example below)

→