Approaches to Encoding Categorical Variables


Understanding the differences between:

  • pd.factorize
  • pd.get_dummies
  • sklearn.preprocessing.LabelEncoder
  • sklearn.preprocessing.OneHotEncoder

These 4 encoders can be categorized into 2 categories:

  • Encode labels into categorical variables. Pandas factorize and scikit-learn LabelEncoder result in 1-dimension
  • Encode categorical variables into dummy (binary) variables. Pandas get_dummies and scikit-learn OneHotEncoder result in n dimensions

Tip: scikit-learn encoders are made to be used in scikit-learn pipelines with fit and transform methods.

Encode labels into Categorical Variables

  • Pandas factorize
  • scikit-learn LabelEncoder
Loading output library...

Encode labels into Dummy Variables

  • Pandas get_dummies
  • scikit-learn OneHotEncoder
Loading output library...
Loading output library...
Loading output library...