This weekend I taught myself python and started writing some more code for the wikipydia.py module that James started. I’m extremely impressed at how easy python and JSON are to use, and how nicely written James’s code is. I picked it up with no problems at all.
As an introductory exercise, I decide to try to solve the problem of deciding what pages we should translate for WikiTrans. Because the Wikipedia has so many articles, we won’t be able to translate all of them into all other languages — using Mechanical Turk would be too expensive, and even using machine translation would probably require too many CPU hours to be feasible. So my goal was to select a subset of articles to translate first.
We could draw our translation candidates from Wikipedia’s Featured Articles that meet the criteria of being well-written, comprehensive, and well researched. These articles are labeled with the label Category:Featured_articles. I wrote a method for retrieving members of a given category:
Here’s an example of its output:
Even 2739 articles may be too many to start out with. Ideally, I like to be able to sort them based on their popularity. We’ll use page view statistics to quantify popularity. Wikipedia user Henrik maintains stats for daily wikipedia page views at stats.grok.se (there’s also a 3 month archive of raw hourly traffic data at dammit.lt/wikistats). Henrik provides a JSON interface to stats.grok.se, so I wrote a new wikipydia method to query for page views.
Here’s an example of what that returns:
It takes about an hour to gather the stats for all 2738 featured articles in English. After that we can sort them based on their total views: