I have been using NLTK for sentence splitting while working on Wikitrans. NLTK, in their words, is Open source Python modules, linguistic data and documentation for research and development in natural language processing, supporting dozens of NLP tasks, with distributions for Windows, Mac OSX and Linux. The SVN repository also includes a copy of their book Natural Language Processing with Python, which is written by Steven Bird, Ewan Klein and Edward Loper.
I came across NLTK initially while looking for sentence splitters. NLTK has the Punkt Tokenizer for splitting, which works well, but it does not cover Urdu. D’oh! So I emailed Chris and he sent me copy of Danish Munir’s urdu-segmenter.pl. Perl! Gasp!
Wikitrans is built using Python and I don’t want to run code outside Python just to split some sentences so I sat down and converted Danish’s code. I had some corpora that had already been split and stored in XML but I didn’t have the input corpora. And on top of that, I didn’t actually have any corpora that Danish’s code had split. I just had some already split sentences from NIST.
This exercise is useful because it also forces me to think about an API for which the splitting mechanism of WikiTrans will conform. NLTK for some languages, urdu_segmenter for Urdu and we’ll approach future limitations as we come to them.
I tuned my browser to Wikipedia’s Urdu entry and quickly found the Urdu version of this page. It didn’t take long to find my computer getting confused by the right-to-left nature of Urdu. For example, if I highlighted the last character in a sentence (right-most) and copied it, my paste would sometimes give me the left-most character. Because I don’t read Urdu, it was difficult for me to understand exactly what was causing this. I also don’t want to waste time debugging, so I needed a solution quickly.
To grab the text with no concern for how my browser or editors were going to handle the left-to-rightness and right-to-leftness, I hopped into Python and wrote some code to fetch the page and strip out the HTML. I’m a big Python fan so I’d like to take the time to explain how I did what I did.
First, Wikipedia does some stuff by looking at the referrer of the system retrieving data. Those with a sophisticated mobile device will see this funtionality when they’re redirected to a mobile version of Wikipedia. However, if I use Python’s urllib to retreive data, Wikipedia will get confused (it returns a multilingual page explaining that there is something broken) so it’s important to set a referrer to something Wikipedia feels comfortable with. I told Wikipedia my program is Firefox and now the data retrieval works just fine.
From there I am able to use Python’s BeautifulSoup package to get at the text. From their web page: Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. They don’t lie, it’s easy to scrape text using BeautifulSoup, as you’ll see in the code below. The rest of this program is some basic file I/O.
First thing to do is subclass urllib’s FancyURLopener. This will let us set a version field to the referrer string Firefox uses. Then, tell urllib to use our opener. This keeps urllib’s behavior as normal with the exception of url opening mechanism. urllib makes fetching data from url’s trivial, so opening the url is just one line and reading the data is another line.
My use of BeautifulSoup is somewhat simple too and is similar to how I use the library in WikiTrans. First, I create a soup object out of the html I got from urllib. Then I loop over all the paragraph tags. I ask for all the text in each paragraph, which is returned as a list (findAll returns lists) generally consisting of 1 item. I join the items of the list (to be safe) and then put this text in my p_texts list.
From there I join the paragraph texts by two newlines, so the text looks somewhat normal, and write the data to a file while also telling python the text is in unicode.
The end result, at the time of writing, looks like this: urdu.txt. Your browser will not display this correctly. Downloading the file to your computer and viewing it locally should work fine though.
At least half of Danish’s code appeared to be command line processing. This is made easy in Python (Perl too, actually) using getopt. Python has something even easier called optparse which handles autogenerating help menus and some other nice things for you. So first, let’s look at Danish’s code and inspect what it’s doing.
Lines 64-108 are the command line processing and a little regex for getting the file name without the path information. An astute reader might be wondering why he didn’t just use Perl’s basename and dirname subroutines and to that I think the answer is just that this is really old Perl code. Anyway…
I do not speak Urdu, so part of my programming involved simply copying Danish’s logic over to Python without fully understanding what wisdom he might be passing on. But as I did this it became clear the real meat of this tool is in line 126 when he uses a regex to split on particular characters. He expresses these characters as their unicode values, but I found this tricky to read so I converted this, in my implementation, to something easier to read (eg. QUESTION, DASH, BULLET, CR, ELLIPSIS, FULL_STOP).
After that split, the program essentially loops over the data and is careful to make sure any grammar found is printed on the same line as the sentence it is used for. The split is careful to make sure the grammar is represented in the splits too.
I spent the whole time working on this looking at text written from right to left while being somewhat aware of the computer finding this confusing at times. Intuitively, I felt if we were to split ‘.text more some is this .text some is this’ we would get back a data structure that looks like (‘.’, ‘text more some is this’, ‘.’, ‘text some is this’). I mistakenly thought my program would actually read the text from the file in that order too. I instead discovered that this is not actually the case as the data is still left-to-right internally. Seems obvious in hindsight…
The program actually reads the text like ‘this is some text. this is some more text.’ and we get a data structure like (‘this is some text’, ‘.’, ‘this is some more text’, ‘.’). My code and Danish’s code is writing the data to the file in this order too. It writes them like sentence + terminal on the same line and the display on our computers handles arranging these correctly. Neat!
The rest of the code is fairly straight-forward and simply prints the data to a file along with some XML information if the user has requested it.
Here is the Python code. To run this code against urdu.txt, type ./urdu_segmenter.py -f urdu.txt
A particular difference between this splitter and NLTK’s is the use of regex. Splitting on a period, in English, is not going to perfect. For example, it will choke on “I think Washington D.C. is neato.” NLTK’s punkt is based off machine learning techniques so it handles this better, but I found it is also not perfect. For example, NLTK will also choke on “I think Washington D.C. is neato.” but it will handle “I think Washington D. C. is neato” just fine. The difference being the space between D. and C. in the sentence.
Ideally, NLTK would handle all the tokenization but there is no support for Urdu splitting in NLTK at the time of this writing. I intend to look into how one trains NLTK on sentence segmentation soon though.
I would really like some feedback on people’s views of sentence splitting if anyone is interested. Please email me.
—