WikiTrans Code Pyango View Goopytrans Wikipydia

Pyango View

16 December 2009 – James – Brooklyn

Two Sections

This post is split into two sections. The first is the story of trying to get Urdu text to print properly in an image and what I learned. The second is the tool I now use to print Urdu text in images with good accuracy. Perfect, according to at least one native speaker!

The Story

Omar Zaidan and Chris (CCB) both had the idea to deter cheating in human aided translational tools by not using actual text. Instead, print the text into an image, similar to how a captcha uses images of text as a are-you-human test. Cheaters, often times, will cut n paste the text into translation tools to do the job without being able to actually speak the language correctly. This is made harder by not letting one cut n paste the text.

Imagemagick and PIL

Both Imagemagick and PIL are fairly simple image tools. Neither handle text that goes from right-to-left. It’s possible to send Urdu characters into both, but the letters are shown backwards. Text is really just a line of bits and then some context about the letters determines whether or not it’s written from left-to-right or right-to-left.


(click for larger)

Dang. Neither PIL nor Imagemagick could do this properly.

Linear B and PIL+arabic_rtlize

Chris suggested I try Linear B’s rendering, which means I go out to Java. I’m just looking for a solution so I’m quite happy to use this if it works.

While Chris was generating some text images, I was googling for methods of staying inside Python when I came across Hasan’s arabic-writer. Arabic-writer is a tool for converting Arabic text that needs to be right-to-left but is being displayed in a tool that only understands left-to-right. I need this to work for Urdu, but I suspect this will get me fairly close. Interestingly, I’ve discovered that Arabic fonts also do not cover all Urdu letters. Underneath the hood, in terms of how fonts are stored, Arabic and Urdu are significantly different.

I wrote to Hasan to let him know I planned to convert his tool into a library for my purposes and have released that as arabic_rtlize with some of the GUI tools Hasan wrote stripped out. It’s just a library, instead of an application.

Here is what that code looks like, but first: The Urdu text looks like it’s going from right-to-left, doesn’t it? You can tell by seeing the terminator (eg. period) on the left instead of the right. But don’t be fooled, your computer is playing tricks on you by displaying it right-to-left even though it’s left-to-right in the code. Tricksy hobbits’s!

We showed these to Omar and he pointed out that the PIL version appears to have excessive spacing between letters. We only had the PIL copy with us while we were talking, so I went back and compared the two in more detail.

Now, for a test on the same string. We’ll use a string from http://ur.wikipedia.org/wiki/اردو.

Firefox shows it like this (click for larger)

Linear B renders this

PIL combined with rtlize renders this

First mistake I noticed was a dot under the last character, when reading right-to-left, in the Linear B version. That looks like below.

PIL-rtlize has no dot

Linear B has a dot

Second mistake I noticed, and thought this one seemed more troublesome, was a lot of spacing between letters in the sixth word from the right of the sentence in PIL. Linear B appeared to handle the word just fine.

PIL-rtlize has spaced characters

Linear B has correctly joined letters

I started to dig in to Linear B to see what kind of options I’d have regarding fonts. That dot, mentioned as the first mistake above, could just be like a font. I wondered if this was a mark similar to when people cross upper case J’s or don’t. The idea of fonts in arabic was not something I had considered until I saw the differences in rendering. And I believe the spacing issue that PIL-rtlize has would be referred to as kerning by Typographers, but it appears the letters change shape subtly to enhance their adaptiveness to the adjacent letter. Consider the heavy upward swoosh in the PIL-rtlize image directly above. It’s not present in the Linear B alternative.

Something to consider, there is context in how letters are spaced that is brought up under certain conditions that don’t exist in the bytes representing the graphic. This is a strange concept for me because I prefer unabiguous maps from letter graphic to byte. Handling this context is precisely what Hasan’s tool does for Arabic. It’s just incomplete for Urdu.

Pango View

After reaching this point, I heard from Hasan again. He suggested I try Pango View.

Here is the text as rendered by pango-view.

We have heard that the pango view version is by far the best. Nicer on the eye and smoother.

We have a solution!

Pyango View

Pyango View is a python library that takes the basics of how we use pango-view and turns it into a library.

I will let the documentation for Pyango View speak for itself.

Creative Commons License

blog comments powered by Disqus
Fork me on GitHub