This is like the poor-man's multitrack: I'll describe a use case.
Setup: Single camera, with 2 people in shot. 2 separate audio tracks from their microphones.
If you could set descript to zoom in on the active speaker when it detects that they are speaking, this would give the effect of multiple cameras.