2012-01-23

Chinese OCR: translating scanned or photographied Chinese text to any language


Having lived in China for almost 3 years now I am able to recognize a good bunch of characters, I can type in Chinese on the computer too but writing is much easier than reading since it doesn't require you to actually memorize the characters, you just type in pinyin (phonetics). That is not enough to understand a full, complex text.
After months of research I've finally figured out how to recognize chinese characters automatically from a picture, in order to copy/paste the text into a translator such as Google translate or others. The solution was right under my eyes all this time: Microsoft Office 2007. I had no idea that Office 2007 came with such features. I've always known of expensive solutions such as Ominpage Pro, but I refused to resort to purchasing the app considering its price and how little I would need it.


OCR, which stands for Optical Character Recognition, is the principle of proceeding to the digital analysis of an image to extract the characters/text that it contains, in order to be able to manipulate the text on a computer.

The solution I describe is for PC/Windows users only. If you're interested on doing the same, real time and in a much simpler way directly from your iPhone, I recommend the excellent Pleco. I've tried out the demo version and am seriously considering purchasing the full version.

TLDR: there are basically four steps involved in the process:

  1. Take a photo with your digital camera, or scan your document with your scanner
  2. (Optional) Convert your photo to a TIF image
  3. Open the TIF image with Microsoft Office Document Imaging, run the OCR
  4. Export the text to Microsoft Word then translate it

This tutorial requires the following software to be installed on your computer:

  1. Microsoft Office 2007 (though this supposedly works with Microsoft Office 2003): installed with "Document Imaging" and "Picture Manager", both are components that you can select during the setup process. If you don't have those two installed on your computer, modify your Office setup to include them.
  2. Chinese language support for Microsoft Office, which isn't exactly something you come across easily. I have the chance to work in China and we have licenses for the Chinese version of Microsoft Office, so I've had no trouble. As an alternative you can get the Microsoft Office 2007 Multi-Language pack and install Chinese support as well as a bunch of other languages if you're interested.
Step 1: Taking a picture or scanning a document
I don't need to remind you how you take pictures with a digital camera. Nor how to save pictures from a website with your favorite web browser. If you are going to use a scanner though, and that is probably the solution that will get you the best results, you can probably skip the next step if your scanner supports saving as TIF/TIFF documents.

To illustrate this tutorial I've chosen to work with a photo taken with my iPhone 4S. It's a document I've taken from a random advertisement booklet found at a friend's place. I tried to get a clear shot of the text to make sure OCR works as accurately as possible.


Step 2: Converting to TIF/TIFF image
Unfortunately, and I must admit I find this quite odd myself, the tool we're going to use for performing the OCR does not support anything other than the TIF format. So if your picture was saved under any other format (JPG typically, like mine) you'll have to convert it. There are plenty of ways to do so.

Since you have Microsoft Office installed on your computer, you should have everything it takes. Right-click your JPG image and "Open with..." - "Microsoft Office Picture Manager". Go to "File" - "Export..." and select the TIF format.

Step 3: Performing the OCR
The actual OCR (Optical Character Recognition) is performed by Microsoft Office Document Imaging. Open the tool, which should be located in your Start Menu under Microsoft Office / Microsoft Office Tools / Microsoft Office Document Imaging.

Before performing the OCR you need to specify the document language. To do so, open the "Tools" menu, go to "Options" - "OCR" and select "Chinese" in the drop-down list. The next steps are simple...

Open the file... Click on the OCR button... Click on the "Send Text to Word" button... press OK and you're done!
The text should be more or less faithfully transcripted depending on the quality of the original picture. Now onto translating it to something actually legible to the average westerner :-)

Step 4: Translation to English or other languages
There are tons of translators out there but I'm going to stick to Microsoft Office since that is what we've been using from the start. Yes, you can translate Chinese directly from within Word 2007 if you follow the simple instructions described below:
  1. Select the text you want translated
  2. Right-click the selected text and in the menu, go to "Translate" - "Translate..." 
  3. Select the input and output languages and click the little green arrow
  4. You'll be taken to Microsoft's online translation service, which provides a surprisingly accurate translation of my original text.
Before:

After:
Note: the original document IS about shady management techniques. That's all I had.

Voila, you've successfully translated a document written in another language, based on a simple photo and Microsoft Office. Ah, isn't technology wonderful?

13 comments:

Ruby said...

Wow, translating documents just got more exciting! I wish we can also use OCR in other languages like Arabic, Russian, Japanese, or even Korean in order to understand their cultures and their words.


Ruby Badcoe

Fajar said...

wow, wonderfull. finally, i can find this way. will be very useful for me, when in large numbers.
deserve it, you can write a very good book.

Keep Shared :)

document finder said...

Nice,it's a useful

Anonymous said...

Ok, I installed all the necessary things but when I go to options there is no CHINESE language, how can I install it?

j0j0soft@yahoo.com said...

Forgot to mention, I'm running win. 7 ultimate sp1 (64bit), I already had try installing few multilanguage pack for office 2007 (sp1, sp2, sp3 for XP cos I don't see for ultimate) and still can't find Chinese at options, so, I guess I fail somewhere. Really would like to install this language for my kid, he's playing a Chinese game and all day he bother me to help him translate this or that (I have a friend who is Chinese, but I can't bother him always for a game thing).
Would appreciate if you could lend me a hand with this issue.
PS: I was the Anonymous before this post, now as my name I write my e-mail address.

Alec said...

For quick OCR and translation, I do it all online.

1. I first use this free service at
http://www.sciweavers.org/free-online-ocr

2. Then I use google translate
http://translate.google.com/

Clem said...

Wow, nice! Thanks Alec, this looks like a neat solution. And I bet it works better than the one I am talking about in this post... I'll give this a try and update my post when I can.

wennie09 said...

I'm so lucky to find this blog. Thank you, you saved me today.

id card reader said...

Not only is this a well-written post, but I love the topic. Really trustworthy blog. Thanks for sharing !

Dora Trevino said...

I tried it but it didn't work for me. When I get the .tif file, I get an error message.

Can someone help me....please?

Thanks!!

Dora

TSC Translation said...

Thanks for shearing this informative blog about Chinese English Translation

DAG said...

Well, had to straighten the scan, save as JPEG file and it worked for me. Thank you

Anonymous said...

Hey, thanks a lot! you saved my life :)!

Search This Blog

Loading...