Build a language detector model


The goal of this exercise is to train a linear classifier on text features that represent sequences of up to 3 consecutive characters so as to be recognize natural languages by using the frequencies of short character sequences as 'fingerprints'.

Author: Olivier Grisel

License: Simplified BSD

Split the dataset in training and test set:


Build a vectorizer that splits strings into sequence of 1 to 3 characters instead of word tokens


Fit the pipeline on the training set

Loading output library...

Predict the outcome on the testing set in a variable named y_predicted


Model Evaluation: Print the classification report


precision: Ability of the classifier not to label as positive a sample that is negative. The higher the number, the more sure we are that the postive labels are actually positive.

precision = true_positives / (true_positives + false_positives)

recall: Ability of the classifier to find all the positive samples. The higher the number, the more sure we are that we are not missing any positive labels

recall = true_positives / (true_positives + false_negatives)

f1-score: Combines the precision and recall.

f1 = 2.0 * true_positives / (2*true_positives + false_positives + false_negatives)

support: The number of occurrences of each class in positive labels.

F1 = 2 * (precision * recall) / (precision + recall)

Plot the confusion matrix


AKA error matrix.

Loading output library...

Predict the result on some short new sentences: