CatBus said:
Feallan said:
I tried OCRing Thai with their trial, it was pretty legit. About half of the lines had a mistake or two, but it could be done.
Damn, their demo can do up to 100 pages during the trial period. I bet I could combine the 4000 or so images of individual lines of Thai text into less than that. Or at least one film at ~1200 images.
This is completely doable. I'll be combining the subtitles into pseudo-documents, simulating a page of A4 paper scanned at 300dpi with font sizes and margins within the normal range. Each film should get around 30-ish "pages" of subtitles per language, which can then be fed into FineReader to produce actual text!
Then, of course, will be a lengthy process of manual correction and moving the subs back into an SRT format.
I'll start with Thai, but will then create these "pages" for Cantonese, Mandarin/Traditional, and, if it might help Sadako, Japanese. I'm honestly not sure if the Chinese ones will end up being used, but considering Cantonese has no text equivalent at all, maybe I can include it as a convenience.