淘宝天猫平台运动鞋数据预处理

#淘宝天猫平台运动鞋数据预处理

Import packages and load google.drive

Data Cleaning

#Data-Cleaning
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

Trim away the columns highly correlated or with constant value

#Trim-away-the-columns-highly-correlated-or-with-constant-value
  • GoodComment is highly correlated with Comments
  • OriginPrice is highly correlated with Price
  • SellerId is highly correlated with UserId
  • UnitPrice is highly correlated with OriginPrice
  • DetailStatus has constant value 1
  • Rate has constant value 0
  • RateStatus has constant value 0
  • ShangHaiExpress has constant value 0
  • UrlNo has constant value 1359
  • stock has constant value 0

价格的三个字段高度相关,总评中好评占绝大多数,也就是说差评与打折价可以作为异常值看待,留给 Anomaly Detection 的模型处理,在这里先去除多余字段只保留一个。

Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

Focus on some features

#Focus-on-some-features
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

Visualization

#Visualization
Loading output library...

Cut a slice of 0-2000, to see the distribution of each features

Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
  • 左下角靠近原点的密集部分,是销量低评论少的商品,正常情况下面积大(价格高),这一部分如果有低价商品会有假货嫌疑,可以深入探究。
  • 右上角可以看到一些散点,代表低价高销量的爆款商品,是受到仿冒的高危区。
  • 右下角代表评论数多但销量低的商品,可能是商品换新之类的原因导致,有待查看。

初步分析可以看到原始商品数据集存在大量数据倾斜,在之后的模型训练分析阶段需要单独建立异常值检测模型,可能会有比较好的效果。

To be continued

#To-be-continued