Analyze Abstracts to create a multiclass text classification treating each Journal that the Abstract belongs to as a class. I.e. trying to predict which Journal an Abstract can be affiliated with.
We will remove the top class (Geophysical Research Letters) to reduce the imbalance.
We are only interested in three columns:
One common approach for extracting features from text is to use the bag of words model: a model where for each document, a
Content in our case, the presence (and often the frequency) of words is taken into consideration, but the order in which they occur is ignored.
We will use sklearn.feature_extraction.text.TfidfVectorizer to calculate a tf-idf vector for each of consumer complaint narratives:
* If float, the parameter represents a proportion of documents, while int represents an absolute count. Usually I start with something like 5, however, if it leads to out-of-memory problems, then I use a float number like 0.5 to limit the size of the vocab and hence help reduce the memory size.
Now, each of 30844 abstracts is represented by 120237 features, representing the tf-idf score for different unigrams and bigrams.
We can use sklearn.feature_selection.chi2 to find the terms that are the most correlated with each of the journals:
Looks like the uni and bigrams make sense. So far so good!
Noe that we have the vector representations of the text we can train supervised classifiers to predict unseen “Anstracts” and predict the “Journal” to which they belong.
Let's try Bayes
As we can see, without the GRL, the classifications of these abstracts became more targeted (e.g. 'Lunar imapct ...' got classified as 'JGR Space Physics Section'.
Looks like by just removing GRL, we have gotten better results on Naive Bayes alone. Let's get some quantitative data on other models.
Let's experiment with different models, evaluate their accuracy and see if we can identify some other problems. The models we will experiment and benchmark are:
Wow! Without the GRL, the accuracy means increased by ~10% (even more for Naive Bayes: 15%)
The majority of the predictions show up on the diagonal where predicted = actual, as expected. However, as in the case with GRL, we see some misclassifications, this time around the
Geochemistry, Geophysics, Geosystems, which does sound like yet another generic category.
lets look at the unigrams and bigrams (using chi^2 test)
After removing the abstracts of the GRL Journal we saw a whooping 10% jump in the overall accuracy! (less data more accuracy :))
The only low performing classes are those that either are too generic (e.g.
Geochemistry, Geophysics, Geosystems), or have little or no data (e.g.
These results confrim that the selection of the Journal for a new publication can be automated. An API endpoint can be created to suggest a Journal automatically and provide a publisher with the top two or three choices. Once the publisher makes the final choice, it will be fed back to the ML model for continuos training and accuracy increase.
all-encompassing(i.e. generic) Journals
The fact that it was the Linear SVC that performed the best, points to the fact that the knowledge space for each journal is clearly separable which would allow for automating a knowledge management system