2012-01-23

Chinese OCR: translating scanned or photographied Chinese text to any language


Having lived in China for almost 3 years now I am able to recognize a good bunch of characters, I can type in Chinese on the computer too but writing is much easier than reading since it doesn't require you to actually memorize the characters, you just type in pinyin (phonetics). That is not enough to understand a full, complex text.
After months of research I've finally figured out how to recognize chinese characters automatically from a picture, in order to copy/paste the text into a translator such as Google translate or others. The solution was right under my eyes all this time: Microsoft Office 2007. I had no idea that Office 2007 came with such features. I've always known of expensive solutions such as Ominpage Pro, but I refused to resort to purchasing the app considering its price and how little I would need it.


OCR, which stands for Optical Character Recognition, is the principle of proceeding to the digital analysis of an image to extract the characters/text that it contains, in order to be able to manipulate the text on a computer.

The solution I describe is for PC/Windows users only. If you're interested on doing the same, real time and in a much simpler way directly from your iPhone, I recommend the excellent Pleco. I've tried out the demo version and am seriously considering purchasing the full version.

TLDR: there are basically four steps involved in the process:

  1. Take a photo with your digital camera, or scan your document with your scanner
  2. (Optional) Convert your photo to a TIF image
  3. Open the TIF image with Microsoft Office Document Imaging, run the OCR
  4. Export the text to Microsoft Word then translate it

This tutorial requires the following software to be installed on your computer:

  1. Microsoft Office 2007 (though this supposedly works with Microsoft Office 2003): installed with "Document Imaging" and "Picture Manager", both are components that you can select during the setup process. If you don't have those two installed on your computer, modify your Office setup to include them.
  2. Chinese language support for Microsoft Office, which isn't exactly something you come across easily. I have the chance to work in China and we have licenses for the Chinese version of Microsoft Office, so I've had no trouble. As an alternative you can get the Microsoft Office 2007 Multi-Language pack and install Chinese support as well as a bunch of other languages if you're interested.
Step 1: Taking a picture or scanning a document
I don't need to remind you how you take pictures with a digital camera. Nor how to save pictures from a website with your favorite web browser. If you are going to use a scanner though, and that is probably the solution that will get you the best results, you can probably skip the next step if your scanner supports saving as TIF/TIFF documents.

To illustrate this tutorial I've chosen to work with a photo taken with my iPhone 4S. It's a document I've taken from a random advertisement booklet found at a friend's place. I tried to get a clear shot of the text to make sure OCR works as accurately as possible.


Step 2: Converting to TIF/TIFF image
Unfortunately, and I must admit I find this quite odd myself, the tool we're going to use for performing the OCR does not support anything other than the TIF format. So if your picture was saved under any other format (JPG typically, like mine) you'll have to convert it. There are plenty of ways to do so.

Since you have Microsoft Office installed on your computer, you should have everything it takes. Right-click your JPG image and "Open with..." - "Microsoft Office Picture Manager". Go to "File" - "Export..." and select the TIF format.

Step 3: Performing the OCR
The actual OCR (Optical Character Recognition) is performed by Microsoft Office Document Imaging. Open the tool, which should be located in your Start Menu under Microsoft Office / Microsoft Office Tools / Microsoft Office Document Imaging.

Before performing the OCR you need to specify the document language. To do so, open the "Tools" menu, go to "Options" - "OCR" and select "Chinese" in the drop-down list. The next steps are simple...

Open the file... Click on the OCR button... Click on the "Send Text to Word" button... press OK and you're done!
The text should be more or less faithfully transcripted depending on the quality of the original picture. Now onto translating it to something actually legible to the average westerner :-)

Step 4: Translation to English or other languages
There are tons of translators out there but I'm going to stick to Microsoft Office since that is what we've been using from the start. Yes, you can translate Chinese directly from within Word 2007 if you follow the simple instructions described below:
  1. Select the text you want translated
  2. Right-click the selected text and in the menu, go to "Translate" - "Translate..." 
  3. Select the input and output languages and click the little green arrow
  4. You'll be taken to Microsoft's online translation service, which provides a surprisingly accurate translation of my original text.
Before:

After:
Note: the original document IS about shady management techniques. That's all I had.

Voila, you've successfully translated a document written in another language, based on a simple photo and Microsoft Office. Ah, isn't technology wonderful?

Search This Blog