Prepare an intermeriatary stereo file with channels:
L: has soundtrack
R: has music synced to soundtrack
THEN all you nneed is an FFT imager on that intermediatary.
Try Audition’s Center Chnnel Extractor, or try adding QuikQuak Mashtactic to Audition for a moore visual approach.
EDIT: Also maybe http://blog.wavosaur.com/extract-vocals-from-song-with-kn0ck0ut-vst/