I was inspired after reading Peter Norvig's chapter in the book, Beautiful Data on Natural Language Corpus Data, and subsequently his implementation of a spelling corrector, and I wondered if I could implement a similar spell corrector for pinyin, with the corrected pinyin being used to suggest individual characters, as a primitive Chinese input method editor would do. The syllable (pinyin without tones) and character frequency lists were taken from Jun Da at Middle Tennessee State University. While I am a linguistics minor who is interested in natural language processing, I do not have any significant experience in it besides messing around with NLTK a little bit. I started this project out of curiosity without much expectation, but I figured that nonetheless, it would be an interesting learning experience. I also wanted to use this opportunity to explore how to make Jupyter notebooks more interactive by learning how to use ipywidgets.
Here is the character data. The characters are ranked by frequency.
Here is the syllables data. The syllables are paired with frequency counts.
Below is a bar plot and a word cloud that shows the most common syllables. We can see that "de" is the most common syllable, followed by "shi" and "yi". Notice how syllables either end in a vowel, "n", or "ng". Compared to English, there are much fewer ways to make a proper syllable, and there are basically no real consonant clusters. (The velar nasal - "ng", the retroflex - "sh", while orthographically appear to consist of two consonants, are actually considered singular phonemes linguistically, as opposed to something like "sk".)
Most Chinese IMEs don't differentiate between tones, so we don't either. We add a new column "toneless" which contains the pinyin without tone marks generated using Unidecode.
Now we can search up all the characters that correspond to any syllable written in toneless pinyin. The intuition behind the IME is that it should ideally retrieve all the characters corresponding to any valid toneless pinyin syllable, like "yi", in order of usage frequency.
For the purposes of the wordcloud and our implementation, we reorganize the dataframes into dictionaries using custom-built functions.
On to the spelling corrector! Our implementation is not much different from Peter Norvig's.
The intuition behind it is to: 1. Get all candidates. Candidates include: - The input itself if it is a known syllable - All known syllables one edit away (deletes, transposes, replaces, inserts) - All known syllables two edits away - If no known syllables are found using the first three methods, the input itself will be the only candidate returned, so there will be no correction made. 2. Get the probabilities for each candidate, based on relative frequency. 3. Return the candidate with the highest probability as the correction.
All functions functions can be viewed in the repository. For more on the process behind the data manipulation and implementation, view the exploratory notebook(("https://github.com/rtang18/xiaoshuru/blob/master/notebooks/Discovery%20and%20Discussion.ipynb").
Here are the candidates and corrections when "mieo" and "ddu" are inputted:
We're getting closer to our IME! Here are the top 10 most frequent characters for the corrected version of "ddu", "dou."
When I have more time, I would definitely look into improving the probability calculations. This is already an improvement over some older IMEs that do not tolerate spelling errors well, but I'm still not sure if "dou" is a more realistic choice over "du" if I were to type "ddu". Certainly, the fact that there is a much more limited amount of syllables in Mandarin can be leveraged as well. In the future, I would also look to using Jun Da's bigram frequency data so that the IME can also handle bigrams. As of now, only one character can be processed at a time, and there is no support for special characters like punctuation marks yet either.
I don't have a lot of experience in dashboard creation and interactive front-end in general, so I decided to see what I could do with ipywidgets. While IMEs usually need proper GUIs, this year, I'm challenging myself to make more interactive notebooks. So even though this is a really strange use of the widgets to the point where it's almost like a misuse, it's just a fun proof of concept.
There are some limitations. As of now, it handles single characters and automatically outputs the first suggested character. The version below mimics a desktop IME with a select menu. For best results, run the Jupyter Notebook (as opposed to just viewing it on Github/elsewhere).
Below is a capture of me writing, "你好我是小唐", which means "Hello, I am Little Tang", using the IME.