Build a language detector model

#Build-a-language-detector-model

The goal of this exercise is to train a linear classifier on text features that represent sequences of up to 3 consecutive characters so as to be recognize natural languages by using the frequencies of short character sequences as 'fingerprints'.

Author: Olivier Grisel olivier.grisel@ensta.org

License: Simplified BSD

Split the dataset in training and test set:

#Split-the-dataset-in-training-and-test-set:

Build a vectorizer that splits strings into sequence of 1 to 3 characters instead of word tokens

#Build-a-vectorizer-that-splits-strings-into-sequence-of-1-to-3-characters-instead-of-word-tokens

Fit the pipeline on the training set

#Fit-the-pipeline-on-the-training-set
Loading output library...

Predict the outcome on the testing set in a variable named y_predicted

#Predict-the-outcome-on-the-testing-set-in-a-variable-named-y_predicted

Model Evaluation: Print the classification report

#Model-Evaluation:-Print-the-classification-report

http://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-and-f-measures

https://en.wikipedia.org/wiki/Precision_and_recall#Precision

precision: Ability of the classifier not to label as positive a sample that is negative. The higher the number, the more sure we are that the postive labels are actually positive.

1
precision = true_positives / (true_positives + false_positives)

recall: Ability of the classifier to find all the positive samples. The higher the number, the more sure we are that we are not missing any positive labels

1
recall = true_positives / (true_positives + false_negatives)

f1-score: Combines the precision and recall.

1
f1 = 2.0 * true_positives / (2*true_positives + false_positives + false_negatives)

support: The number of occurrences of each class in positive labels.

1
F1 = 2 * (precision * recall) / (precision + recall)

Plot the confusion matrix

#Plot-the-confusion-matrix

AKA error matrix.

Loading output library...

Predict the result on some short new sentences:

#Predict-the-result-on-some-short-new-sentences: