I'm coming late to this party, so I can't perhaps claim with complete believability to have been right (lol), but my reaction to the clips was in line with the actual results. At first I couldn't tell anything was wrong with any of them, but looking more closely I then thought the second was most correct while the first was late and the last was early. It's kind of hard to tell with dialogue perhaps, because nobody really moves their mouth exactly the same way or fully enunciates or whatever, and the difference was so small I couldn't really have said with any degree of confidence that I knew which was which. Still, I was gratified to see that my fleeting impression was actually the right one.
So it seems that a couple frames difference is noticeable, but only just barely, and perhaps not completely objectionable if the error is small enough. I think the sound being early was easier to spot than being delayed, because the movement carried on a bit after it was already heard, which is the least natural. But, they were close enough that it was very easy to get psyched out and confused about it. Hmm . . .