logo Sign In

Post #202060

Author
tweaker
Parent topic
The Thief and the Cobbler: Recobbled Director's Cut (Released)
Link to post in topic
https://originaltrilogy.com/post/id/202060/action/topic#202060
Date created
17-Apr-2006, 5:39 AM
Spent a while on getting the formatting right on the OCR stuff--for those that don't know, before you can OCR an article or whatever, the program has to figure out what's text, what's an illustration or picture, and what's just garbage (specks of dust, blots on a xerox, etc). So you just tell the program to automatically figure it out. Then, you get to go back and FIX what the program just did. For 60 pages, it took about 2 hours.

So now I've OCRed the pages, and I now I have to fix that. Some of the pages are pretty good, and only take 5 or minutes or so to spell check. The most common issues are lower-case Ls and capital Is being recognized as the number "1," or as exclamation points. In contrast to this are the pages that are poorly copies, and use small, funky fonts, and I get crap like this (note, this is actually copied directly from a paragraph that I have yet to spell check):

"H rearm anlMaty apprcprLBle Irtal Williams lurrnd on io anirfa-lion al age Swb ntar lacing Snow rtfitfa irtd tna Senm t>wtlfl F4&H J4 r* li Hii dra »*4 By mc« raponiHl at "boyish.' dw in pan |D rul -nletliOLn enimiuasfn lor ta wort Born "1 TotWfc wtvE hia met**' Trtxhod al ¦« HuatrBbt and Kri IVrtot *ta* ¦ Cinhmercal "rmf. William? K"jfc lo Irtfl OrawinQ bcvrQ V an eH'ly aga and wn> ImtKmJ by lid Disney chB/aclflf* AF 14. Tie mi»fl ¦ pihjmnflQd [Pj Thi^) Id Ihfl Oi*nay studia Vi Bur bank. CaJilo*r*ir and Ihtuugrt a FlVnd dF hil mo[r>wrE mam9«tf ID gam mjrfW-lariCB Ha «poX* 10 IrtB Di*r*y ¦n.naicr? i*lwis* wort r<e riad flffmk*!. Bjid Ihey in ftun vrero Impaiawf *lh lr« younpHflr"? .JadkiaHofl and Ulant ¦ Loll Df peop* al ¦» Ikne *H'a Dh"*T ivi«dcir' I* Of» I0U ar> 4nlBiviai-flr, "ftiil I cduM *clij#lty drt» aHaFir"

I shit you not. Those first three 'words' (H rearm anlMaty) should read "It seems entirely". I think I see five words in that mess that actually OCRed correctly.

Ahh, the joys of OCRing shit...