- Post
- #702120
- Topic
- Project Threepio (Star Wars OOT subtitles)
- Link
- https://originaltrilogy.com/post/id/702120/action/topic#702120
- Time
Now those Osram lightbulb commercials featuring Triumph the Comic Insult Dog make so much more sense.
Now those Osram lightbulb commercials featuring Triumph the Comic Insult Dog make so much more sense.
Just a little update for thread-watchers so they know what's happening after that last page of posts.
First off, we found official subtitles for Polish, Greek, Turkish, and European Portuguese. This means improvements for Turkish (especially ROTJ), promoting Greek from unverified to verified, adding a long-absent Portuguese dialect, and possibly some minor improvement for Polish. Some of this work is already done.
Then, there's the effort I have taken to calling "Operation Eyestrain". The goal is to no longer have any graphical-only subtitles--using a mixture of our newfound OCR method, and a painfully slow phase of manual transcription and correction. Japanese is furthest along, thanks to Sadako. Mandarin/Traditional and Cantonese are probably in a very good state since our OCR software seemed to handle Chinese characters very well, but some manual correction is undoubtedly necessary, and I'm hoping I can lean on Sadako for that as well. The surprise was with Thai, which I thought would be easy due to it being an alphabet with a much more manageable number of character permutations than Chinese, but the OCR fell down hard on this text, and it's hard to find a single line that doesn't require at least one manual correction, if not several. Feallan is taking two films and I'm taking one. I predict this will be the slowest of the jobs, since neither of us know the language at all.
PM sent.
Oh, it's not bad for a holiday variety show special of the era--in fact, it's pretty good by those standards. It's just not so great on an absolute scale, but it's hardly the bottom of the barrel there either. Think of the prequels, for example.
imperialscum said:
I am a common-sense type of fan. I don't limit my self to some area of SW (i.e. OT, PT, EU). If I find something enjoyable then it is part of my personal canon. If something sucks I ignore it.
Yup, me too. For me, though, that's a 100% overlap with OT purist, not a new category.
Although I do derive a morbid fascination that definitely isn't enjoyment from the Holiday Special.
Anyone with a Facebook account who can tell me what that link says?
The very best thing is the fact that, per episode numbering, they had to stop after making three of them.
Jerry, FWIW, I ran into a hardware player compatibility issue with your theatrical 5.1 DTS track (strange audio artifacting on an Oppo BDP-93)--I think possibly because it's 24-bit, but just standard DTS (not DTS-MA). Anyway, I was able to re-encode it as DTS-MA and that worked around the compatibility issue for me. In case anyone else runs into this or you're planning to re-encode it for 2.0.
FWIW, there was a Polish release of the GOUT with (as far as I can tell) English-only audio, but with European Portuguese, Turkish, Polish, and Greek subtitles. I don't have anything but the subtitles (guess what's coming in Project Threepio v7.1?)
DrCrowTStarwars said:
CatBus said:
Harmy said:
Oh, well, it's true of most projectors, but it's possible that some projectors can actually do true 24fps but I've never seen one. I know that digital projectors in cinemas definitely have higher refresh rates then 24Hz, so it seems curious, that a home projector would do native 24fps.
But with a color wheel, the projector has to refresh 3 times for each full-color frame, so my guess would be, that its true refresh rate for 24fps sources is 72Hz, thus producing 24fps in full color and for higher frame-rate sources, it's capable of even higher refresh rates.Yeah, and 120Hz+ displays should be able to do the same thing. So if the player upscales 720p24 to 1080p24, the only conversion is the image upscale, no change to the frame cadence. If it converts to 720p60, it's likely the display will need to upscale the image too, so you get both conversions.
Again, not so you'd notice for the most part. But you do need to hand in your videophile card if someone catches you watching film at 60Hz. I hear they take your plasma away too ;-)
What do they do to you if you only have an LCD that goes up to 60hz?
Not that I would know anything about that,or know anyone who would know anything about that.
Usually you can escape from them when they're facepalming, so you'll be fine.
Harmy said:
Oh, well, it's true of most projectors, but it's possible that some projectors can actually do true 24fps but I've never seen one. I know that digital projectors in cinemas definitely have higher refresh rates then 24Hz, so it seems curious, that a home projector would do native 24fps.
But with a color wheel, the projector has to refresh 3 times for each full-color frame, so my guess would be, that its true refresh rate for 24fps sources is 72Hz, thus producing 24fps in full color and for higher frame-rate sources, it's capable of even higher refresh rates.
Yeah, and 120Hz+ displays should be able to do the same thing. So if the player upscales 720p24 to 1080p24, the only conversion is the image upscale, no change to the frame cadence. If it converts to 720p60, it's likely the display will need to upscale the image too, so you get both conversions.
Again, not so you'd notice for the most part. But you do need to hand in your videophile card if someone catches you watching film at 60Hz. I hear they take your plasma away too ;-)
Harmy said:
The question is what actually happens when it gets converted to 720p60? My guess is, that it simply shows some frames 3 times and some 2 times, which is something that happens on a 60Hz monitor or TV anyway and on a 120Hz or higher, the frames are simply going to be shown that many more times, right?
Precisely. And it's honestly not that big of a deal for the most part, except you notice the lack of smooth scrolling in the credits, etc.
For more information, 720p/24 is a valid AVCHD/Blu-ray format, but it's not in the HDMI spec, so players have to convert it to something else before transmitting it to the display. The proper thing to do is 1080p/24, but some do 720p/60. Not much you can do to fix it other than get a different player.
I was tempted to despecialize it like the Ukraininan audio, but I'm not doing anything with it ATM (don't have it either).
PM sent.
Molly said:
inb4 jokes about Syfy in Poland.
A good point for the "Sci-Fi vs. Space Opera" argument: The prequels clearly aren't a form of Syfy, because antibiotics don't make them any more tolerable (nor does liquor, for that matter).
Gather 'round, kiddos! It's story-time with CatBus.
So a conversation with Sadako started me thinking about all of the interesting and unexpected things a person learns when diving into a new language (such as: "jedi" means "to eat" in Croatian, which leads to some translation issues), and I thought I'd share one of the ones that I thought was neat.
So, first off, we have this unverified Simplified Mandarin fansub. I suspect it's actually very good because I could tell whoever did it was very thorough and loved Star Wars, but I just can't say for certain if the Chinese was very good.
One of the interesting things about these Chinese subtitles is how they incorporate foreign words, such as Jawa. There are, as you probably know, thousands of Chinese characters, and quite a lot of them can be pronounced as some variation on "ja" or "wa". So a translator finds two characters that make those sounds, and if they're good, they choose two characters that actually can describe the thing in question.
"Wa" is easy. The character for "baby" is pronounced "wa", Jawas are small people and kinda cute, so that's that. "Ja" on the other hand... well, this translator chose "claw". So Jawas are "claw babies"...
...which is pretty accurate actually, but it somehow makes me think about an alternate version of Star Wars made by David Cronenberg, where claw babies would fit right in. Also, it makes some lines a little funny, like Luke asking, "Why would the Empire want to slaughter claw babies?" Gee, I dunno, Luke. Self-defense? Because they're an abomination? To kill them before they grow and multiply? I can think of plenty of reasons.
Anyway, that's just a language tidbit I thought was interesting and a little funny. There are plenty of others, I'm sure, if I think about it.
yoda-sama said:
could Admiral Oswald, rather than being clumsy or stupid, instead be a highly placed Rebel plant or a Rebel sympathizer?
It's Ozzel, or are you merging your conspiracy theories to save space? ;-)
BDSup2Sub, Perl, and ImageMagick--I'm a one-trick pony in that respect. BDSup2Sub to extract individual image files, Perl to script everything, and ImageMagick to handle the image compositing.
What I've got only really works with white subtitles with a black border. Yellow subtitles and such would require different code. Also if your subs aren't 720p, you may need to do some resizing. Can't guarantee it's bug-free, but even if there end up being problems, it still saves a lot of time.
Perl code below:
#!/usr/bin/perl -w
if($#ARGV==-1) {
print "Usage: perl assemble.pl <filename...>\n";
exit;
}
my @filelist=();
my @tmplist;
my $sourcefile;
for $sourcefile (@ARGV) {
if(-e $sourcefile) {
push(@filelist,$sourcefile);
} else {
@tmplist=glob($sourcefile);
my $listsize=@tmplist;
if($listsize==0) {
print "Error: Could not find $sourcefile\n";
exit;
}
push(@filelist,@tmplist);
}
}
my $pagenum="01";
my $currentpage="page".$pagenum.".png";
my $pagewidth=2480;
my $pageheight=3508;
my $tmpfile="tmp1.png";
my $leftmargin=$pagewidth/20;
my $topmargin=$pageheight/20;
my $bottommargin=$pageheight/18;
my $currentmargin=$topmargin;
unlink($currentpage);
FILELOOP: for $sourcefile (@filelist) {
if(!(-e $currentpage)) {
system("convert -size ".$pagewidth."x".$pageheight." xc:white $currentpage");
$currentmargin=$topmargin;
}
system("convert $sourcefile -negate $tmpfile");
my $imagewidth=0;
my $imageheight=0;
system("identify -format %wx%h $tmpfile > _dims.txt");
open(FILE2,"_dims.txt");
my $dims=<FILE2>;
close(FILE2);
unlink("_dims.txt");
my @bits=split(/x/,$dims);
$imagewidth=$bits[0];
$imageheight=$bits[1];
system("composite -compose atop -geometry +".$leftmargin."+".$currentmargin." $tmpfile $currentpage _".$currentpage);
unlink($tmpfile);
rename("_".$currentpage,$currentpage);
$currentmargin=$currentmargin+$imageheight+5;
if($currentmargin>=$pageheight-$bottommargin) {
$pagenum=$pagenum+1;
if(length($pagenum)<2) {
$pagenum="0".$pagenum;
}
$currentpage="page".$pagenum.".png";
}
}
print "Done.\n";
exit;
Conclusion: it works! (at least for Japanese)
Thanks Buster D for getting this conversation started and pointing us at a good product, and Feallan and Sadako too! Corrections are still required, but it's not that bad. Perhaps someday Project Threepio will no longer have any graphical-only subtitles!
Thanks for volunteering ;)
I've got some files to Feallan for testing the process. If all goes well, I'll make some more.
CatBus said:
Feallan said:
I tried OCRing Thai with their trial, it was pretty legit. About half of the lines had a mistake or two, but it could be done.
Damn, their demo can do up to 100 pages during the trial period. I bet I could combine the 4000 or so images of individual lines of Thai text into less than that. Or at least one film at ~1200 images.
This is completely doable. I'll be combining the subtitles into pseudo-documents, simulating a page of A4 paper scanned at 300dpi with font sizes and margins within the normal range. Each film should get around 30-ish "pages" of subtitles per language, which can then be fed into FineReader to produce actual text!
Then, of course, will be a lengthy process of manual correction and moving the subs back into an SRT format.
I'll start with Thai, but will then create these "pages" for Cantonese, Mandarin/Traditional, and, if it might help Sadako, Japanese. I'm honestly not sure if the Chinese ones will end up being used, but considering Cantonese has no text equivalent at all, maybe I can include it as a convenience.
Feallan said:
I tried OCRing Thai with their trial, it was pretty legit. About half of the lines had a mistake or two, but it could be done.
Damn, their demo can do up to 100 pages during the trial period. I bet I could combine the 4000 or so images of individual lines of Thai text into less than that. Or at least one film at ~1200 images.
The only real benefits of text-derived subs are that: 1) they can look better, 2) they can be easily modified and corrected, and 3) they can be used with old-school MKV players that don't support BD-SUP files. That's a short list of nice-to-haves, but none of them are critical.
So if there's even a chance of the odd OCR mistake, I'd rather stick with the graphical subs, because they are actually still pretty decent, after all. That's why Thai OCR is the only one I'm willing to go for, even if the Chinese OCR looks okay to me.
I suppose I should add this regarding OCR--while I wouldn't trust any OCR without manual correction, and I am unqualified to do this manual correction for Japanese and Chinese text, I do feel confident that I could manage it (albeit very slowly) for Thai, because it uses an alphabet and diacritics with a much more manageable number of permutations. I've even managed to transcribe a few lines myself entirely by hand before it got way too frustrating (my process for combining diacritics was pretty clunky).
So if anyone's sitting on a copy of FineReader, I'm sure we could make some use of it for this project.