logo Sign In

Audio Isolation Using Per-Sample (or near per sample) Mode Averaging

Author
Time

I have absolutely no idea if this would work or not, but it seems to me that theoretically it could.

My thought on this project is based on the "My Neighbor Totoro" original english DUB - I'd love to hear the vocal talent from that release grafted onto the higher quality soundtrack backing the Japanese and Disney DUB. 

Of course, this only even theoretically works if the audio is in perfect sync.

Usually when you combine audio or video, you use a regular mean average. Sometimes when you're capturing video and want to remove artifacts, you use a median average that eliminates outliers.

What you don't see a lot is MODE averaging, where the most commonly occurring pixel or sample is used.

I'm proposing a potential breakthrough in movies with dubbing at least, using a MODE average process to compare and combine audio tracks on a Per-Sample basis (or 1ms basis... just something very small).

In the case of Totoro, my thought is this - Use the following Audio Tracks

1 - Japanese Blu Track

2 - Disney Blu Track

3 - Original Dub

4 - Original Dub duplicate

5 - French Blu Track (or any other foreign language track that appears to use the same soundscape as the other Blu Tracks)

In this scenario, hopefully the 3 Blu-Sourced tracks would provide the Most Commonly Occurring Samples/sections for music and background noises/effects. In situations where there was talking, the 3 different Blu Languages would be too different, allowing the two identical Dub tracks to be the most commonly occurring. 

I'm unsure exactly how to then implement this, but I'd imagine somehow phase inversion and/or difference extraction could then be used on a track like this to isolate dialogue in the other tracks as well. While not important for the project I'm interested in here, I imagine something like that would be very helpful for Star Wars edits.

Is any of this possible? Could anyone explain to me how to do it? Or should I go to sleep because it's 2 in the morning and I'm going to have no clue what I even meant here in the morning?

Preferred Saga:
1/2: Hal9000
3: L8wrtr
4/5: Adywan
6-9: Hal9000

Author
Time

I'm not quite sure that I am understanding what your goals are, but it sounds interesting. I'm no audio engineer, but I do goof around in Audacity and VinylStudio from time to time when cleaning up flaws in my record and CD library. I'd like some software that can merge two or more copies of identical vinyl pressings as a snap/crackle/pop noise removal technique. There is an existing patent on this concept, it is about 15 years old or more IIRC. However I don't know if there is any TooT-type audio merging software that is commercially available. Being able to separately save the sounds removed from a process similar to this is what I think you are trying to accomplish. Perfect alignment of waveforms I think is key to everything, I see you mention on a per-sample or 1ms basis. I wonder if zero-crossings are sufficient alignment points?

If your crop is water, what, exactly, would you dust your crops with?

Author
Time

If you can get the music in isolation from other sources, REMIXAVIER may be of use:

The Remixavier ("remix savior") project is concerned with recovering the "difference" between different mixes of the same track. For instance, given a full mix and an instrumental, we can try to recover the vocals, or given the full mix and an a cappella version, we can try to produce an instrumental version. In the process, we can identify the precise temporal alignment between the two versions, which may be useful in its own right.

http://www.ee.columbia.edu/~dpwe/resources/matlab/remixavier/

http://labrosa.ee.columbia.edu/hamr2013/proceedings/doku.php/remixavier

Example 1: Significant time skew and channel difference

This example consists of an original instrumental track, digitized from a vinyl LP release, and a rap that uses the track as backing, taken directly from a CD. Thus, the different signal paths mean that the timing is significantly different (clock drift of 0.1%), and the overall spectrum is very different too.

We haven’t been staying away so much as not coming here. -Ron Nasty

Author
Time

Actually, that might work on two identical (except language) tracks. It would result in just the two vocal tracks. Which could have their phase inverted and combined with the Japanese track, which would remove the japanese voices and just leave the dub.

Preferred Saga:
1/2: Hal9000
3: L8wrtr
4/5: Adywan
6-9: Hal9000

Author
Time
 (Edited)

It seems that remixavier uses stretching and FFT (which i can do with a combination of tools, see below)

thanks obi-juan for suggesting Remixavier. There's also utagoerip which is FFT-only. I'll mention other FFT-only tools below.

You can Forget about all this "Audio Isolation Using Per-Sample (or near per sample) Mode Averaging" (unless you stretch the audio, and that by itself is successful). Any audio that is not in perfect sync will not polarity inverse cleanly, this is where FFT-based tools come in handy.

RE FFT & sources, Other language tracks are less useful unless they align 100% (or if they are surround and have mostly isolated music in some channels). You may need to combine FFT with polarity inversion to get BGM if you have stereo only.

If you really have only 2 languages  in stereo and they don't align (and you don't treat it because it's generally hard to fix), you will get poor results , because 2 sides are pulling the same vocal frequencies. See pic for an example when we want to keep the music only, but we only have 2 voiced sources.

If you want to break through and get stuff done (and know how to use an actual DAW instead of audacity), I will suggest tools like REAPER + Quikquak mashtactic or R-Mix (FFT based imager) to remove music/sfx.

I have completed a project where I mostly removed the music from a mono version of the Animax card captor sakura dub, and replaced it with my own fixed surround version of the BGM derived from the JP 5.1 track [using the tools i mentioned above. You can probably do this with kn0ckout VST (or other FFT) in audiacity but it would be at least 4x slower & more finicky].

I'm in the middle of a similar project replacing the mono bgm of the Streamline Kiki's Delivery svc dub with surround bgm. But I don't really have that much time nowdays.

Author
Time

Thanks for all of the suggestions!

... this might seem odd... but any chance you could post the KiKi streamline dub somewhere? PM me if so. Been trying to get my hands on it.

Preferred Saga:
1/2: Hal9000
3: L8wrtr
4/5: Adywan
6-9: Hal9000

Author
Time
 (Edited)

My 2017 reply.

My thought on this project is based on the “My Neighbor Totoro” original english DUB - I’d love to hear the vocal talent from that release grafted onto the higher quality soundtrack backing the Japanese and Disney DUB. Of course, this only even theoretically works if the audio is in perfect sync.

Cool.

Usually when you combine audio or video, you use a regular mean average. Sometimes when you’re capturing video and want to remove artifacts, you use a median average that eliminates outliers. What you don’t see a lot is MODE averaging, where the most commonly occurring pixel or sample is used.

Okay.

I’m proposing a potential breakthrough in movies with dubbing at least, using a MODE average process to compare and combine audio tracks on a Per-Sample basis (or 1ms basis… just something very small).

No, mode is probably not going to work because probably not perfectly sync. YOu can try, but you will likely not get what you want.

If you wanted to do something roughly using average modes, it wouldn’t work in the time domain, you’d need to use FFT. And it would be eaxsiet to use 2 languages only, to do a ‘minimum of’ bins in the frequency domain, instead of mode. This tool does not exist yet. Lets’ take a look again at the motive:

I’d love to hear the vocal talent from that release grafted onto the higher quality soundtrack backing the Japanese and Disney DUB.

Okay. So if this is ur motive, you can do center channel removal of a stereo track, and then place it behind the mono EN dub.

Or, take the side channel of any stereo track, and then place it behind the mono EN dub.

Drawbacks of both methods: may be too much or too little content, and unstable imaging.

NB: I haven’t tried any of these things btw.

Author
Time

I’m wondering if this will help me with my edit Of Terminator 2. I need to isolate and extract all the dialogue, as it’s the only original audio my edit will have. All other sounds are being rebuilt from the ground up. Gunshots, explosions, the T-1000’s noises, ambience, and of course, the score.

Author
Time

TylerDurden389 said:

I’m wondering if this will help me with my edit Of Terminator 2. I need to isolate and extract all the dialogue, as it’s the only original audio my edit will have. All other sounds are being rebuilt from the ground up. Gunshots, explosions, the T-1000’s noises, ambience, and of course, the score.

No it won’t. You won’t have a 2ry audio source. If T2 is in surround, you can try fiddling with the center chan, if it isn’t satisfactory or 2ch, try dialogue isolate in izotope Rx6.