2012-01-23

Chinese OCR: translating scanned or photographied Chinese text to any language


Having lived in China for almost 3 years now I am able to recognize a good bunch of characters, I can type in Chinese on the computer too but writing is much easier than reading since it doesn't require you to actually memorize the characters, you just type in pinyin (phonetics). That is not enough to understand a full, complex text.
After months of research I've finally figured out how to recognize chinese characters automatically from a picture, in order to copy/paste the text into a translator such as Google translate or others. The solution was right under my eyes all this time: Microsoft Office 2007. I had no idea that Office 2007 came with such features. I've always known of expensive solutions such as Ominpage Pro, but I refused to resort to purchasing the app considering its price and how little I would need it.


OCR, which stands for Optical Character Recognition, is the principle of proceeding to the digital analysis of an image to extract the characters/text that it contains, in order to be able to manipulate the text on a computer.

The solution I describe is for PC/Windows users only. If you're interested on doing the same, real time and in a much simpler way directly from your iPhone, I recommend the excellent Pleco. I've tried out the demo version and am seriously considering purchasing the full version.

TLDR: there are basically four steps involved in the process:

  1. Take a photo with your digital camera, or scan your document with your scanner
  2. (Optional) Convert your photo to a TIF image
  3. Open the TIF image with Microsoft Office Document Imaging, run the OCR
  4. Export the text to Microsoft Word then translate it

This tutorial requires the following software to be installed on your computer:

  1. Microsoft Office 2007 (though this supposedly works with Microsoft Office 2003): installed with "Document Imaging" and "Picture Manager", both are components that you can select during the setup process. If you don't have those two installed on your computer, modify your Office setup to include them.
  2. Chinese language support for Microsoft Office, which isn't exactly something you come across easily. I have the chance to work in China and we have licenses for the Chinese version of Microsoft Office, so I've had no trouble. As an alternative you can get the Microsoft Office 2007 Multi-Language pack and install Chinese support as well as a bunch of other languages if you're interested.
Step 1: Taking a picture or scanning a document
I don't need to remind you how you take pictures with a digital camera. Nor how to save pictures from a website with your favorite web browser. If you are going to use a scanner though, and that is probably the solution that will get you the best results, you can probably skip the next step if your scanner supports saving as TIF/TIFF documents.

To illustrate this tutorial I've chosen to work with a photo taken with my iPhone 4S. It's a document I've taken from a random advertisement booklet found at a friend's place. I tried to get a clear shot of the text to make sure OCR works as accurately as possible.


Step 2: Converting to TIF/TIFF image
Unfortunately, and I must admit I find this quite odd myself, the tool we're going to use for performing the OCR does not support anything other than the TIF format. So if your picture was saved under any other format (JPG typically, like mine) you'll have to convert it. There are plenty of ways to do so.

Since you have Microsoft Office installed on your computer, you should have everything it takes. Right-click your JPG image and "Open with..." - "Microsoft Office Picture Manager". Go to "File" - "Export..." and select the TIF format.

Step 3: Performing the OCR
The actual OCR (Optical Character Recognition) is performed by Microsoft Office Document Imaging. Open the tool, which should be located in your Start Menu under Microsoft Office / Microsoft Office Tools / Microsoft Office Document Imaging.

Before performing the OCR you need to specify the document language. To do so, open the "Tools" menu, go to "Options" - "OCR" and select "Chinese" in the drop-down list. The next steps are simple...

Open the file... Click on the OCR button... Click on the "Send Text to Word" button... press OK and you're done!
The text should be more or less faithfully transcripted depending on the quality of the original picture. Now onto translating it to something actually legible to the average westerner :-)

Step 4: Translation to English or other languages
There are tons of translators out there but I'm going to stick to Microsoft Office since that is what we've been using from the start. Yes, you can translate Chinese directly from within Word 2007 if you follow the simple instructions described below:
  1. Select the text you want translated
  2. Right-click the selected text and in the menu, go to "Translate" - "Translate..." 
  3. Select the input and output languages and click the little green arrow
  4. You'll be taken to Microsoft's online translation service, which provides a surprisingly accurate translation of my original text.
Before:

After:
Note: the original document IS about shady management techniques. That's all I had.

Voila, you've successfully translated a document written in another language, based on a simple photo and Microsoft Office. Ah, isn't technology wonderful?

21 comments:

Ruby said...

Wow, translating documents just got more exciting! I wish we can also use OCR in other languages like Arabic, Russian, Japanese, or even Korean in order to understand their cultures and their words.


Ruby Badcoe

Fajar said...

wow, wonderfull. finally, i can find this way. will be very useful for me, when in large numbers.
deserve it, you can write a very good book.

Keep Shared :)

document finder said...

Nice,it's a useful

Anonymous said...

Ok, I installed all the necessary things but when I go to options there is no CHINESE language, how can I install it?

j0j0soft@yahoo.com said...

Forgot to mention, I'm running win. 7 ultimate sp1 (64bit), I already had try installing few multilanguage pack for office 2007 (sp1, sp2, sp3 for XP cos I don't see for ultimate) and still can't find Chinese at options, so, I guess I fail somewhere. Really would like to install this language for my kid, he's playing a Chinese game and all day he bother me to help him translate this or that (I have a friend who is Chinese, but I can't bother him always for a game thing).
Would appreciate if you could lend me a hand with this issue.
PS: I was the Anonymous before this post, now as my name I write my e-mail address.

Alec said...

For quick OCR and translation, I do it all online.

1. I first use this free service at
http://www.sciweavers.org/free-online-ocr

2. Then I use google translate
http://translate.google.com/

Clem said...

Wow, nice! Thanks Alec, this looks like a neat solution. And I bet it works better than the one I am talking about in this post... I'll give this a try and update my post when I can.

wennie09 said...

I'm so lucky to find this blog. Thank you, you saved me today.

id card reader said...

Not only is this a well-written post, but I love the topic. Really trustworthy blog. Thanks for sharing !

Dora Trevino said...

I tried it but it didn't work for me. When I get the .tif file, I get an error message.

Can someone help me....please?

Thanks!!

Dora

TSC Translation said...

Thanks for shearing this informative blog about Chinese English Translation

DAG said...

Well, had to straighten the scan, save as JPEG file and it worked for me. Thank you

Anonymous said...

Hey, thanks a lot! you saved my life :)!

kristine Peterson said...

Hello all,
Very impressive and useful post. this looks like a neat solution. And I bet it works better than the one I am talking about in this post... I'll give this a try and update my post when I can.professional language translation service

Anonymous said...

I just tried translating a instrument user guide and I got this (promise):

Put on a little fairy playing outside towards Lu Yi was the thoroughfare.
Instead, take the dumb n Li, son Furthermore, "Qi n female resistance to shut the gate to the value of Qi, 0 ginger, on
The next street is slow nose open into the bottom of the lazy coffee cafe tenant key key M b Tun boron Cave is set in Philippi. Position, insert the instrument .n "... HH said _ li n zob, _, low pool felt n
Philippians _F count as lazy move Gu Xuan Qi Shuai stable measuring plate. Wu is now off the pole by guanidine or too j, electrical ground of puppets FS to stir the sweet coconut foam ... guanidine aiming brush to use the new tunnel lost power cricket aiming Relief pressure transducer Chuxiu z group ... heavy dish,, u Officials "wash Rae .1 Gaidianlila -z

Hold _ raw ore homemade puppet Khan Mian eyebrow when changing fat sadistic type auxiliary, electric rates under _n `Ye Yuan Qi / I Wang Cao by adding T-quinone straight 0
Phoenix will take the light side with a line delicately put to bite Yee was female z
To make.. "-z, Rocky E have the coffee earth lie light. Ferran down on the bud., ~
Electricity. Said White in a straight z ~ zTz gung workers Temple 0.
,, Only. Puppet propagate the main screen. _ Will claw n, more .zr. _ No. Stay Cheung Miriam N ~ belong female insect mesh twenty right Panoptimization mesh Yao ~,
. O ,, x guard also ,,, z

Child. "With stay. ~ M, fine seized v Huan,

Under the unitary puppet Shen _ should be: "Fu can z ~ _z -

Ah well, good try!

Anonymous said...

More on the Chinese Instrument User Guide:

I removed the occasional english characters, numbers, punctuation, from the original scanned image in Photoshop and ran it through again.

"...
The open body significantly lower pole pitch little justice on the child and then press it to the warehouse was used
Axillary slightly on playing electric department said Soy value sets are not low pool pool house
When the board so that the bottom of the eye mail to help key number key than + or large electrical electrical power
Face off state r old anti-fear is a positive end by Wu Yue _ is now out in a very poor pressure after paste swapped out snail
Under the cover does not open the instrument steady thunder body said _ in a mountain too old tactic to use power off mode electric pole Officials pull Mu
Care + "fairy door moving trough put new measure _ not the number of officials like ... screen is not worth the battery always wash _ eagle interest
Surety _ Qi 1 to stir the foam _ not _ magpie friend who was under heavy Chia law change significantly _ old puppet master does not wash electric and battery _ Wang

Under the operation, such as linear light by Nan Lu clothing to bite when the bud was down by Paul said gamma _,, _ is lying on the right was a small set of more twenty Yao _ should note the old power ~,
Take a light source with a light half Angeles branch has to be the value of the measured light curtain _ in addition to holding such kind of instrument music old guard pool, fine prosecution may change
Make electricity in the old Stapleton E is not guaranteed to be _ SOLUTIONS screen character by using electricity _ does not guarantee that the wine should be more
..."

Ahhh! Well that's so much clearer now! Did you guess what the instrument is already?

Saba Naaz said...

very nice blog

Saba Naaz said...

very nice blog

Saba Naaz said...

very nice blog

ali raza said...

very impressive blog i like your article information thanks
now translate all your documents in multi language more than 30 more info visit us
Documents Translation Services

best online translation said...

I would say that the longer the text needed in the document, the more likely the translated document will be incorrect. Even best online translation software still requires a human translator

Search This Blog

Loading...