Loading output library...

Notes:

#Notes:

Notice that we have three potential target variables. While the count is the exact predictor, we could calculate the casual and registerd to find the count... we'll see how we will calculate the predictor.

Loading output library...

Notes:

#Notes:
  • The date variable is an object so you will have to change it to a datetype

Data Cleaning

#Data-Cleaning
Loading output library...

Notes:

#Notes:
  • The only missing values are from the predictors... which is how it should!

Feature Enginnering

#Feature-Enginnering

Notes:

#Notes:
  • extract_dateinfo function is a fast.ai function.
    • From a date datatype, it is able to make many calculations!
    • It automates the feature engineering processs.
  • custom_datainfo function is a function I created.
    • It focuses on renaming or creating abritrary columns (like rush_hour)

Data Visualization

#Data-Visualization
Loading output library...
Loading output library...

Notes:

#Notes:
  • For casual, the # of bikes being used has a huge spike in the low-end
  • For registered, the distribution is more flat (indicating that registered users are more common at a daily level)
  • We are better calculating the log.1p to get a normal distribution.

A logistic distribution better represents the data when the data is highly skewed. For example, if our data contained values like 0, 1, 10, and 100, well our data will be skewed by the 100. When transforming 0, 1, 10, and 100 in a logistic distrbution, we are normalizing the increases from 0, 1, 10, and 100:

1
2
3
4
log_transform = [np.log1p(i) for i in [0, 1, 10, 100]]
log_transform    
    
# Output: [0.0, 0.6931471805599453, 2.3978952727983707, 4.61512051684126]
Loading output library...
Loading output library...

Notes:

#Notes:
  • Count has the largest variation
Loading output library...
Loading output library...

Notes:

#Notes:
  • Notice that we see there is an increase during peak hours in 2011 and 2012.
Loading output library...
Loading output library...

Notes

#Notes
  • Every season has a similar trend except for the spring
    • 1 = spring, 2 = summer, 3 = fall, 4 = winter
Loading output library...
Loading output library...

Notes:

#Notes:
  • The season and weather visualization states that:
    • All season follow a similar distribution along the x-axis (hours) but spring is has a signficant lesser count
    • A clear weather attracts more people (higher count)
Loading output library...

Notes:

#Notes:
  • Casual users are more common in the weekend while registered users are more common in the weekdays
Loading output library...

Data Cleaning

#Data-Cleaning

Model Selection: Correlation

#Model-Selection:-Correlation
Loading output library...

Notes:

#Notes:
  • The top correlated variables with the predictors are Hour, Days_in_year, frac_day, Elapsed
  • The correlation values are similar throughout the three predictors with the exception of workingday
Loading output library...
Loading output library...
Loading output library...

Data Cleaning: Removing Columns

#Data-Cleaning:-Removing-Columns

Notes:

#Notes:
  • After checking the correlations values, there are a couple of columns that represent the same information. Some of the redudant columns are:
    • Hour and frac_day
    • temp and atemp
    • Days_in_year and is_leap_year

Data Cleaning: Dummies

#Data-Cleaning:-Dummies

To encode the categorical variables, we will be using the pd.get_dummies()

Feature Engineering

#Feature-Engineering
Loading output library...

Data Modeling

#Data-Modeling

Notes:

#Notes:
  • factorplot do an excellent way of creating chart that could be easily manipulated and tweak to include additional features... it's an effective method to automate the visualization with seaborn's API!

Notes

#Notes
  • I would like to use all three target variables to make predictions.
  • However, finding the appropriate weight to each of the target variables is more complicated than it seems.
  • I am going to make two prediction (to gain a better understanding of their performance)
    • Add the Casual + Registered predictions
    • Use the Count predictions

Model: XGBRegressor

#Model:-XGBRegressor
Loading output library...

Scoring Performance:

#Scoring-Performance:
  • model_xgb_all: 0.41202
  • model_xgb_feat: 0.42347
  • model_xgb_agg_all: 0.41332
  • model_xgb_agg_feat: 0.41574

Model: GradientBoostingRegressor

#Model:-GradientBoostingRegressor

Scoring Performance:

#Scoring-Performance:
  • model_gb_all: 0.43123
  • model_gb_feat: 0.44574
  • model_gb_agg_all: 0.44070
  • model_gb_agg_feat: 0.45243

Model: RandomForestRegressor

#Model:-RandomForestRegressor

Notes:

#Notes:
  • For some reason, Random Forest has terrible performance. This should something we look into!

Scoring Performance:

#Scoring-Performance:
  • model_rf_all: 0.99251
  • model_rf_feat: 0.75609
  • model_rf_agg_all: 0.72469
  • model_rf_agg_feat: 1.04207

Model: Advanced (Combining Models)

#Model:-Advanced-(Combining-Models)
Loading output library...
Loading output library...
Loading output library...

Scoring Performance

#Scoring-Performance
  • Weighted Average: 0.41143