logo Sign In

Project Threepio (Star Wars OOT subtitles) — Page 12

Author
Time

PM sent.

BTW, here's a report on the differences in the (16mm) mono mix for Empire:

- Threepio says: "Oh, this is suicide! There's nowhere to go!"

- Threepio says: "Oh, dear. What now? I don't like the look of this. If only you'd attached my legs..."

- Threepio says "Hello?" only once when looking for the R2 unit in Cloud City.

Lots of little mixing differences, occasionally making dialogue easier to hear, as was the case with the Star Wars mono mix. For example, when Lando says "So you see, since we're a small operation...", the "So" is much more clear in the mono mix.

I plan on including SRT files for the Empire mono mix in the next release, but no pre-rendered graphics, similar to what I've done for the Star Wars mono mix.

Project Threepio (Star Wars OOT subtitles)

Author
Time

Project files have been updated to version 7.0 (original post has been updated as well). This release includes a huge amount of changes to the underlying utilities and procedures for creating subtitles, and a moderate amount of other changes, specified below. Please PM me for the temporary download links until the files are available in a more permanent location.

Rough summary of text changes:
- Added Ukrainian subtitles (verified, thanks to lexsanor)
- Japanese subtitles converted from graphical-only to text and graphical (Star Wars only for now, huge thanks to Sadako--OCR doesn't really work for Japanese, so this work was all manual!!)
- Further improvements to Polish subs (thanks to Feallan)
- American Spanish subtitles promoted from unverified to verified (thanks to Leoj)
- Russian subtitles promoted from unverified to verified (thanks to lexsanor)
- Indonesian subtitles are now available in both text and graphical form for the entire trilogy, verified. They haven't all been shortened as effectively as the ones for Star Wars, so there may still be some improvements to come for these
- Improvement to Croatian subtitles for Empire, but I don't yet consider these to be verified (thanks to Feallan's international connections)
- Minor improvement to English subtitles for Empire, including the addition of text-only subtitles for the 16mm mono mix

Rough summary of behind-the-scenes changes:
- Every text-based subtitle has been re-rendered to correct an issue where a few pixels of semitransparent drop shadow were shaved off in earlier versions (this issue was barely visible; otherwise the new subtitles should look identical). The drop shadow was removed from SDH subtitles, where the black background made a drop shadow redundant
- Non-SDH DVD-resolution subtitles were re-rendered with a slightly thicker black border (easier to read on a standard definition CRT)
- Project Threepio now includes 1080p SUP files (upscaled from 720p).  I'm really hoping these might be useful to someone someday ;)
- Further improvements to cross-platform support (still requires Windows for the initial render of text subtitles)
- This project is finally weaned off an old, outdated version of BDSup2Sub. The upside is no longer having to work around its bugs, the downside is that completely different command syntax is required for batch processing on Windows and non-Windows platforms, so the instructions have become more convoluted
- The Perl scripts have been moved out of the project's root folder to avoid confusing people
- Many more processes have been fully automated (for example, adjusting subtitle positions), which I hope will reduce errors and improve quality, and also save my wrists from certain ruin

I'm sure I've forgotten something, but those are the big changes.

Project Threepio (Star Wars OOT subtitles)

Author
Time

I think there are actually some programs (not SubRip) that will OCR Japanese text in images.  I tried FineReader a long time ago and it seemed to work somewhat well on scans of an old novel I had, but the output still required some editing.  Not sure how well it would work on subtitle images, but it might be worth a try.  FineReader isn't free though and I'm not sure how well other software works.

Author
Time
 (Edited)

Yeah, that was a bit of an oversimplification--everything that exists requires some fairly significant post-correction--and unlike OCR'ing, say, Swedish--you actually have to be fairly familiar with the language to do the manual correction properly on Japanese (or Chinese). Plus, application Unicode support for furigana is pretty much shit, so you'll have to rework those bits anyway. And as you said, Japanese DVD subtitles are like a lo-fi 8-bit approximation of a character shape, and even good OCR software may fail with it.

In my estimation, for text as short as this, the manual correction would take just as long as the manual transcription. Whether that's true or not, I don't know, but that's what we're doing--and the results are good, and we're 1/3 of the way through!

Project Threepio (Star Wars OOT subtitles)

Author
Time

I've just heard that Project Threepio v7 is available on MySpleen! Thanks, marvins!

Project Threepio (Star Wars OOT subtitles)

Author
Time

BTW it's a good idea to make a text file named "help_needed" or whatever, and include info about languages needing verification/transcription plus your contact info. A good number of people download DeEds from tehparadox and never even visit ot.com, we might catch someone this way.

Fanrestore - Fan Restoration Forum: https://fanrestore.com

Author
Time

Ah, good point. There's a link to the discussion thread in the README, but that's in more of a "if you need help" statement, not an "if you'd like to help". It might not even occur to people that they could help us out.

Something for 7.1 ;)

Project Threepio (Star Wars OOT subtitles)

Author
Time

I suppose I should add this regarding OCR--while I wouldn't trust any OCR without manual correction, and I am unqualified to do this manual correction for Japanese and Chinese text, I do feel confident that I could manage it (albeit very slowly) for Thai, because it uses an alphabet and diacritics with a much more manageable number of permutations.  I've even managed to transcribe a few lines myself entirely by hand before it got way too frustrating (my process for combining diacritics was pretty clunky).

So if anyone's sitting on a copy of FineReader, I'm sure we could make some use of it for this project.

Project Threepio (Star Wars OOT subtitles)

Author
Time
 (Edited)

I tried OCRing Thai with their trial, it was pretty legit. About half of the lines had a mistake or two, but it could be done.

I even tried this program on Chinese subs, and it seems to handle them better than Thai... I think...

Fanrestore - Fan Restoration Forum: https://fanrestore.com

Author
Time

The only real benefits of text-derived subs are that: 1) they can look better, 2) they can be easily modified and corrected, and 3) they can be used with old-school MKV players that don't support BD-SUP files. That's a short list of nice-to-haves, but none of them are critical.

So if there's even a chance of the odd OCR mistake, I'd rather stick with the graphical subs, because they are actually still pretty decent, after all. That's why Thai OCR is the only one I'm willing to go for, even if the Chinese OCR looks okay to me.

Project Threepio (Star Wars OOT subtitles)

Author
Time

Feallan said:

I tried OCRing Thai with their trial, it was pretty legit. About half of the lines had a mistake or two, but it could be done.

Damn, their demo can do up to 100 pages during the trial period. I bet I could combine the 4000 or so images of individual lines of Thai text into less than that. Or at least one film at ~1200 images.

Project Threepio (Star Wars OOT subtitles)

Author
Time
 (Edited)

CatBus said:

Feallan said:

I tried OCRing Thai with their trial, it was pretty legit. About half of the lines had a mistake or two, but it could be done.

Damn, their demo can do up to 100 pages during the trial period. I bet I could combine the 4000 or so images of individual lines of Thai text into less than that. Or at least one film at ~1200 images.

This is completely doable. I'll be combining the subtitles into pseudo-documents, simulating a page of A4 paper scanned at 300dpi with font sizes and margins within the normal range.  Each film should get around 30-ish "pages" of subtitles per language, which can then be fed into FineReader to produce actual text!

Then, of course, will be a lengthy process of manual correction and moving the subs back into an SRT format.

I'll start with Thai, but will then create these "pages" for Cantonese, Mandarin/Traditional, and, if it might help Sadako, Japanese.  I'm honestly not sure if the Chinese ones will end up being used, but considering Cantonese has no text equivalent at all, maybe I can include it as a convenience.

Project Threepio (Star Wars OOT subtitles)

Author
Time

Pretty much the hardest part of transcribing the Japanese is, like, the first 100-200 lines--most of the important terms in the movie will have been said by then ('Rebel army secret base', 'Battle station', etc.), so fewer and fewer terms have to be looked up in order to get readings for them. After that, it's just a matter of finding enough motivation to crank out the lines, lol.

Since every line of OCR'd text would have to be checked anyway, I'm not sure how much time it would actually save over just doing it by hand. But I'm definitely intrigued. If it works well, then I might be able to help proof Traditional Chinese, if no one else actually, y'know, knows Mandarin (I know Classical Chinese, and I know my way around Traditional hanzi, but I don't actually speak Chinese, and I'm hopeless with Simplified).

Author
Time

Thanks for volunteering ;)

I've got some files to Feallan for testing the process.  If all goes well, I'll make some more.

Project Threepio (Star Wars OOT subtitles)

Author
Time
 (Edited)

Conclusion: it works! (at least for Japanese)

Thanks Buster D for getting this conversation started and pointing us at a good product, and Feallan and Sadako too! Corrections are still required, but it's not that bad. Perhaps someday Project Threepio will no longer have any graphical-only subtitles!

Project Threepio (Star Wars OOT subtitles)

Author
Time

I'm gonna bet $5 that the Japanese OCR that this program does is a modification of their Traditional Chinese OCR--because it's AWESOME at reading kanji, it's pretty good at hiragana, and runs into a lot of the problems that first and second-semester Japanese students have with katakana. I ran a quick test, and it performed really well; every single error I found was in kana, lol.

This program would work really well for the transcribing portion, at least in my language. Correcting the mistakes that it does make is no big deal, and the rest of the work is just the other tedious portions--double-checking, formatting and putting in time codes, lol.

Author
Time

CatBus said:

Conclusion: it works! (at least for Japanese)

Thanks Buster D for getting this conversation started and pointing us at a good product, and Feallan and Sadako too! Corrections are still required, but it's not that bad. Perhaps someday Project Threepio will no longer have any graphical-only subtitles!

 Awesome!  What did you do to get multiple subtitle lines into one image, just manually copying and pasting?  I might want to try this on some of my own projects someday.

Author
Time

BDSup2Sub, Perl, and ImageMagick--I'm a one-trick pony in that respect.  BDSup2Sub to extract individual image files, Perl to script everything, and ImageMagick to handle the image compositing.

What I've got only really works with white subtitles with a black border.  Yellow subtitles and such would require different code. Also if your subs aren't 720p, you may need to do some resizing. Can't guarantee it's bug-free, but even if there end up being problems, it still saves a lot of time.

Perl code below:

#!/usr/bin/perl -w

if($#ARGV==-1) {
  print "Usage: perl assemble.pl <filename...>\n";
  exit;
}

my @filelist=();
my @tmplist;
my $sourcefile;

for $sourcefile (@ARGV) {
  if(-e $sourcefile) {
    push(@filelist,$sourcefile);
  } else {
    @tmplist=glob($sourcefile);
    my $listsize=@tmplist;
     if($listsize==0) {
      print "Error: Could not find $sourcefile\n";
      exit;
    }
    push(@filelist,@tmplist);
  }
}

my $pagenum="01";
my $currentpage="page".$pagenum.".png";
my $pagewidth=2480;
my $pageheight=3508;
my $tmpfile="tmp1.png";
my $leftmargin=$pagewidth/20;
my $topmargin=$pageheight/20;
my $bottommargin=$pageheight/18;
my $currentmargin=$topmargin;

unlink($currentpage);

FILELOOP: for $sourcefile (@filelist) {
  if(!(-e $currentpage)) {
    system("convert -size ".$pagewidth."x".$pageheight." xc:white $currentpage");
    $currentmargin=$topmargin;
  }
  system("convert $sourcefile -negate $tmpfile");
  my $imagewidth=0;
  my $imageheight=0;
  system("identify -format %wx%h $tmpfile > _dims.txt");
  open(FILE2,"_dims.txt");
  my $dims=<FILE2>;
  close(FILE2);
  unlink("_dims.txt");
  my @bits=split(/x/,$dims);
  $imagewidth=$bits[0];
  $imageheight=$bits[1];
  system("composite -compose atop -geometry +".$leftmargin."+".$currentmargin." $tmpfile $currentpage _".$currentpage);
  unlink($tmpfile);
  rename("_".$currentpage,$currentpage);
  $currentmargin=$currentmargin+$imageheight+5;
  if($currentmargin>=$pageheight-$bottommargin) {
    $pagenum=$pagenum+1;
    if(length($pagenum)<2) {
      $pagenum="0".$pagenum;
    }
    $currentpage="page".$pagenum.".png";
  }
}
print "Done.\n";
exit;

Project Threepio (Star Wars OOT subtitles)

Author
Time

Only ran into one page in the ESB that it couldn't read properly for whatever reason, but I was able to get all the subs in one evening. 

Author
Time

Gather 'round, kiddos! It's story-time with CatBus.

So a conversation with Sadako started me thinking about all of the interesting and unexpected things a person learns when diving into a new language (such as: "jedi" means "to eat" in Croatian, which leads to some translation issues), and I thought I'd share one of the ones that I thought was neat.

So, first off, we have this unverified Simplified Mandarin fansub.  I suspect it's actually very good because I could tell whoever did it was very thorough and loved Star Wars, but I just can't say for certain if the Chinese was very good.

One of the interesting things about these Chinese subtitles is how they incorporate foreign words, such as Jawa.  There are, as you probably know, thousands of Chinese characters, and quite a lot of them can be pronounced as some variation on "ja" or "wa".  So a translator finds two characters that make those sounds, and if they're good, they choose two characters that actually can describe the thing in question.

"Wa" is easy. The character for "baby" is pronounced "wa", Jawas are small people and kinda cute, so that's that. "Ja" on the other hand... well, this translator chose "claw". So Jawas are "claw babies"...

...which is pretty accurate actually, but it somehow makes me think about an alternate version of Star Wars made by David Cronenberg, where claw babies would fit right in.  Also, it makes some lines a little funny, like Luke asking, "Why would the Empire want to slaughter claw babies?" Gee, I dunno, Luke.  Self-defense? Because they're an abomination? To kill them before they grow and multiply? I can think of plenty of reasons.

Anyway, that's just a language tidbit I thought was interesting and a little funny. There are plenty of others, I'm sure, if I think about it.

Project Threepio (Star Wars OOT subtitles)

Author
Time

Jedi is "to eat" in Croatian, Serbian etc, but the pronunciation is nothing like you'd say it. :) Things like that happen sometimes. You know that German company "Osram", which makes lightbulbs and whatnot? In Polish this word literally means shitting on something...

Fanrestore - Fan Restoration Forum: https://fanrestore.com

Author
Time

inb4 jokes about Syfy in Poland.

"Right now the coffees are doing their final work." (Airi, Masked Rider Den-o episode 1)

Author
Time

Molly said:

inb4 jokes about Syfy in Poland.

A good point for the "Sci-Fi vs. Space Opera" argument: The prequels clearly aren't a form of Syfy, because antibiotics don't make them any more tolerable (nor does liquor, for that matter).

Project Threepio (Star Wars OOT subtitles)