One the the initial languages that I’d like to target for WikiTrans is Urdu. Urdu is one of the two official languages of Pakistan (along with English), it is written in Arabic script, and its word order is subject-object-verb. I worked on building an Urdu-English machine translation system this summer, and I also experimented with using Mechanical Turk to gather training data. I had a lot of success getting high quality translations from Turkers, and I’d like to hire them to translate Urdu Wikipedia articles into English to create yet more training data and to populate WikiTrans.
I wanted to get a list of Urdu featured articles, similar to the English ones that I got last weekend. I found the pages that contain the Urdu featured articles, and Urdu feature article candidates, but my first hurtle was that I had trouble copying and pasting the Urdu text into my python session. My work around was to save the two Urdu category labels in a utf8 text file and then read it in, like this:
The file contains these contents:
When the contents above are rendered in my web browser there are some boxes that incorrectly apprear in place of Urdu characters – I’m not sure what’s up with that. They display properly if you click on “view raw” and set the page encoding to UTF8.
Once I’ve got the access to the category labels, I can grab their members:
We can then sort the articles based on their page view stats, just like we did for the English featured articles.
Note that Urdu page views are way smaller than for English. The most popular Urdu featured article has only been viewed 1,200 times v. 422,000 times for the most popular English featured article in a similar time span. The Urdu wikipedia has far fewer articles than English, with 12,500 articles and a grand total of 270,000 edits versus 3,100,000 English articles which have been edited more than 360,000,000 times.
My hope is that WikiTrans can be used to translate some of the good English content into Urdu, and that we can nominate well-translated articles from WikiTrans for inclusion the Urdu Wikipedia.