Multiclass classifier for Abstracts

#Multiclass-classifier-for-Abstracts

Analyze Abstracts to create a multiclass text classification treating each Journal that the Abstract belongs to as a class. I.e. trying to predict which Journal an Abstract can be affiliated with.

Loading output library...
Loading output library...
Loading output library...
Loading output library...

We will remove the top class (Geophysical Research Letters) to reduce the imbalance.

Preprocessing

#Preprocessing

We are only interested in three columns:

  • Journal (i.e. classes/output)
  • Abstract (i.e. the content/input): abstract of the manuscript
  • Title: Title of the manuscript
Loading output library...
Loading output library...
Loading output library...

Remove the Geophysical Research Letters rows

#Remove-the-Geophysical-Research-Letters-rows
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

Constructing the BoW using tf-idf vectors

#Constructing-the-BoW-using-tf-idf-vectors

One common approach for extracting features from text is to use the bag of words model: a model where for each document, a Content in our case, the presence (and often the frequency) of words is taken into consideration, but the order in which they occur is ignored. We will use sklearn.feature_extraction.text.TfidfVectorizer to calculate a tf-idf vector for each of consumer complaint narratives:

  • sublinear_df is set to True to use a logarithmic form for frequency
  • *min_df is the minimum numbers of documents a word must be present in to be kept
  • norm is set to l2, to ensure all our feature vectors have an euclidian norm of 1
  • ngram_range is set to l2, to indicate that we want to consider both unigrams and bigrams
  • stop_words is set to "english" to remove all common pronouns ("a", "the", and etc.)

* If float, the parameter represents a proportion of documents, while int represents an absolute count. Usually I start with something like 5, however, if it leads to out-of-memory problems, then I use a float number like 0.5 to limit the size of the vocab and hence help reduce the memory size.

Loading output library...

Now, each of 30844 abstracts is represented by 120237 features, representing the tf-idf score for different unigrams and bigrams.

We can use sklearn.feature_selection.chi2 to find the terms that are the most correlated with each of the journals:

Looks like the uni and bigrams make sense. So far so good!

Noe that we have the vector representations of the text we can train supervised classifiers to predict unseen “Anstracts” and predict the “Journal” to which they belong.

Let's try Bayes

Naive Bayes classifier

#Naive-Bayes-classifier

As we can see, without the GRL, the classifications of these abstracts became more targeted (e.g. 'Lunar imapct ...' got classified as 'JGR Space Physics Section'.

Looks like by just removing GRL, we have gotten better results on Naive Bayes alone. Let's get some quantitative data on other models.

Model Selection

#Model-Selection

Let's experiment with different models, evaluate their accuracy and see if we can identify some other problems. The models we will experiment and benchmark are:

  • Logistic Regression
  • Naive Bayes (Multinomial)
  • Linear Support Vector Machine
  • Random Forest
Loading output library...
Loading output library...

Wow! Without the GRL, the accuracy means increased by ~10% (even more for Naive Bayes: 15%)

Model Evaluation

#Model-Evaluation
Loading output library...

The majority of the predictions show up on the diagonal where predicted = actual, as expected. However, as in the case with GRL, we see some misclassifications, this time around the Geochemistry, Geophysics, Geosystems, which does sound like yet another generic category.

Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

lets look at the unigrams and bigrams (using chi^2 test)

Classification report for each class

#Classification-report-for-each-class

Observations and recommendations

#Observations-and-recommendations

After removing the abstracts of the GRL Journal we saw a whooping 10% jump in the overall accuracy! (less data more accuracy :)) The only low performing classes are those that either are too generic (e.g. Geochemistry, Geophysics, Geosystems), or have little or no data (e.g. GeoHealth).

These results confrim that the selection of the Journal for a new publication can be automated. An API endpoint can be created to suggest a Journal automatically and provide a publisher with the top two or three choices. Once the publisher makes the final choice, it will be fed back to the ML model for continuos training and accuracy increase.

  • Remove all-encompassing (i.e. generic) Journals
  • Create a recommender API

The fact that it was the Linear SVC that performed the best, points to the fact that the knowledge space for each journal is clearly separable which would allow for automating a knowledge management system