Introduction

#Introduction

This is the data from a freelancer market platform funnel data. The service uses the machine learning algorithms to 1) recommend new product to users, 2) detact outside personal transaction. To use the machine learning algorithms the data needs to be cleaned. We will cleaned the data off so that it can be used for the learning.

Data Loading

#Data-Loading

conversion.csv

#conversion.csv

conversion contains website / mobile service activity log. We can analyze it and figure out the user usage patterns.

  • eventcategory: event categories. Values are
    • install
    • launch
    • deeplinkLaunch
    • goal
    • exit
    • foreground, background
    • launchlnSession
  • isfirstactivity: define if the event is the user's first time on that activity. boolean
  • apppackagename: Unique application package name. for Android it take applicationId, for iOS it takes Bundle ID
  • appversion: the application version
  • devicetype: user's device name
  • devicemanufacturer: device manufacturer name
  • osversion: the device OS version
  • canonicaldeviceuuid: unique device ID. it can be used for user identification
  • sourcetype: How the user joined the service
  • channel: detailed version of sourcetype
    • unattributed
    • WEB
    • google-play, m_naver, google, (not set), google.adwrods...
  • params_campaign: campaign name parameter entered by the marketter
  • params_medium: campaign media parameter entered by the marketter
  • params_term: campaign term parameter entered by the marketter
  • inappeventcategory: in-app-event value. hierarchical (category > action > label)
    • can only access when eventcategory equals goal
    • foreign key for funnel dataset
    • ex) seller_selling_history.view, gig_detail.view
  • inappeventlabel:
    • foreign key for category data set.
  • eventdatetime: event occurance time
  • isfirstgoalactivity: as for goal events, it shows if the goal event happend more than once. Events are considered the same only if Goal Label, Description, Key, and Category match. boolean
  • even_rank: for the data sort.
Loading output library...

funnel.csv

#funnel.csv

Funnel means the paths that a user took up until they purchased products.

Through the funnel data we can see conversion or churn rate.

Loading output library...

category.csv

#category.csv

product category dats set.

Loading output library...

Data Cleansing

#Data-Cleansing

Essensially, it's log data. Therefore it's difficult to analyze as is. We will clean the data and save it

1. Change canonicaldeviceuuid to userid

#1.-Change-canonicaldeviceuuid-to-userid

canonicaldeviceuuid is important because you can identify users. However, it's not intuitive to understand. To make it intuitive, we will change it to userid.

Loading output library...

2. convert eventdatetime to datetime type and extract the date and time

#2.-convert-eventdatetime-to-datetime-type-and-extract-the-date-and-time
Loading output library...

3. osversion

#3.-osversion

osversion really contains two different pieces of information. OS type (Adroid and iOS) and OS Version.

Loading output library...
Loading output library...

4. devicemanufacturer

#4.-devicemanufacturer

There are three major smartphone manufacturer. Samsung, Apple, and LG. However, there are others too, such as Xiaomi, Foxconn, Pantech, Huawei, and so on.

Loading output library...

Looking at the data there are some duplicate values. For example, LGE and LG Electronics are the same.

Loading output library...

Only 2% of the data is from manufacturers other than Samsung, Apple, and LG. It makes sense to categorize them as others for efficient analysis.

Loading output library...

channel

#channel

Chennel column is what channel brought the users in. However this is not uniformed. It needs little cleaning to see a better picture

Loading output library...

You can see in the result that there are different services from one company. Such as google-play, google, google.adwords could be merged into one. Also there is NaN value we will take care of it as well.

Loading output library...

The value set is clearer, however we can see quite a lot of data(23.7%) doesn't have channel information.

Loading output library...

inappeventcategory

#inappeventcategory

This shows user activities while on the application. It contains the information on whether the user is looking at products page or processing the order. This column is also a foreign key to join with funnel data. However if you look at the values you can see this is actually combination of several different information.

Loading output library...

We are going to separate the values into smaller pieces.

Loading output library...

Dropping unnecessary columns

#Dropping-unnecessary-columns

After the cleansing process, we are going to drop some unnecessary columns that already processed. osversion, devicemanufacturer, canonicaldeviceuuid, channel, and event_rank.

Loading output library...
Loading output library...

Rearrange the order of the columns

#Rearrange-the-order-of-the-columns

Sort the column order 1) more important to less important columns, 2) put together similar kind of columns

Loading output library...

Merge datasets into one

#Merge-datasets-into-one

To make it easier to analyze or training the machine learning algorithm, we are going to merge the sets together. First we are going to merge log and funnel data together using merge. The key column is view_id.

Loading output library...

Data seems to be merged without an issue.

Now log and category data merge. We are using 'in_app_event_label' and 'category_id' to merge the data.

Loading output library...

Data seems to be merged without an issue.

Drop unnecessary columns and rearrange for finalizing

#Drop-unnecessary-columns-and-rearrange-for-finalizing

The following columns are not going to be use, therefore shall be dropped.

in_app_event_category in_app_event_label source_type Lv1, Lv2 funnel_name, depth category_id, category1_id, category2_id, category3_id

Loading output library...

Reindex the data frame

#Reindex-the-data-frame

row_uuid is id for the data. Let's make it as the index and drop it afterward because it's going to be redundant.

Loading output library...