If you have access to multiple language tracks with identical music then it should theoretically be possible to filter out what matches between the tracks -- which would be the music. However, some speech may be lost where the frequency of the music overlaps with the speech and some of the music may remain because of audio compression artifacts. The process should improve the more language tracks you use. I think that this would be better than a typical voice-removal filter.

I don't know if there is a software package that allows you to do this ...

Analysis should be done in some frequency domain, and preferably according to the transform used by the codec ... if the tracks used the same codec, which I don't think that they do.

Am I correct in that for all movies, the English track is in DTS-HD (lossless) and that the others are either in DTS 5.1 or in Dolby Digital 5.1 ? The DTS-HD could be transformed into the frequency domain of any of the other. I have not checked, but I doubt that DTS and Dolby Digital would use the same transforms.