The federal government recently postponed a rule requiring public entities and universities to comply with WCAG 2.1 Level AA under the ADA which includes Audio Description requirements for Video by (now) April 24, 2027. This process can many hours to do manually for even simple videos, and is practically or operationally impossible for many workflows or content types
Audio Descriptions allow low vision users to hear a spoken description of the visual content and action happening on sc either during the natural pauses between dialogue (standard), or through pausing the video to allow the visual content to be spoken (extended).
AI models have only very recently reached the technical capability of analyzing video frames and accurately summarizing the content over time. When paired with speech-to-text or speech generation models, the descriptions can be largely automated, compared to manually review, scripting, voicing, and editing.
Ideal features would include:
  • Auto-generated editable and time-aware text descriptions of on screen visual content
  • Generation of speech audio based on said description, that adapts to updates to edits
  • Ability to detect non-dialogue portions of a video for possible AD audio insertion
  • Automatic insertion of a freeze frame and ripple edit of generated audio
  • Customization of timbre timing and phonetic pronunciation for non-standard words or phrases (think company names, uncommon person names, industry jargon and acronyms)
  • Custom voice selection and cloning.
  • Ability for human review, editing, and customization throughout the process.
As a video editing platform on the cutting of integrating AI media tools into real production environments, Descript already has many of the tools in place to help significantly speed up the AD generation for hundreds of thousands of hours of video content in public entities' backlog in addition to new content generated every year while. Very few effective commercial solutions currently exist otherwise.