logo Sign In

Post #1226666

Author
CatBus
Parent topic
Project Threepio (Star Wars OOT subtitles)
Link to post in topic
https://originaltrilogy.com/post/id/1226666/action/topic#1226666
Date created
18-Jul-2018, 1:14 PM

Every now and then I get a glimpse of what an ambitious project Unicode is, and how complicated things get when theory meets reality. Basically Unicode is creating a 1:1 mapping of numbers to glyphs. 0x0041->A, 0x0042->B, and so on. But we’ve already seen from CJK unification that it’s not really 1:1. You’ve got 1:1 mapping of numbers to the abstract idea of a glyph, but those numbers need to be tied to a particular language so that you can select a language-specific font and render those glyphs.

So far, so good. Then things take a turn for the strange. There’s this little feature defined within the fonts themselves where the font can have different glyphs for different languages. For example, a font can define special Romanian versions of certain letters. Basically because for a long time, the Romanian letter S with a comma below was considered the same thing as the letter S with a cedilla below in other languages, and a distinct Unicode code point wasn’t assigned until much later. In the meantime, many Romanian documents use the original “S with cedilla” code point. So if you use a font that supports it, and software that supports it, and the software is somehow made aware that the text is in Romanian, then it will display the Romanian glyph instead of the typical S with cedilla. But Project Threepio can sidestep this because it just uses the new Unicode code points.

And now the wheels come off. In Serbian and Macedonian, certain Cyrillic letters look very different than in other languages using Cyrillic characters, but only when the letters are italicized. So again, that font feature is used so that if the font supports it, and the software supports it, and the software is aware of the text’s language, and the text is italicized, then a totally different-looking glyph is displayed. And it does not seem these cases are ever getting distinct Unicode code points, because they’re only distinct in italic form. So this complicated process is the only way you’ll ever be able to get the right results.

Which is really now why it’s officially a miracle when anything ever works at all. But never fear, I’ve confirmed Project Threepio will handle this correctly, thanks to the Pango software and Noto fonts.