Loading output library...

Notice that we have three potential target variables. While the count is the exact predictor, we could calculate the casual and registerd to find the count... we'll see how we will calculate the predictor.

Loading output library...

- The date variable is an object so you will have to change it to a datetype

Loading output library...

- The only missing values are from the predictors... which is how it should!

`extract_dateinfo`

function is a fast.ai function.- From a date datatype, it is able to make many calculations!
- It automates the feature engineering processs.

`custom_datainfo`

function is a function I created.- It focuses on renaming or creating abritrary columns (like rush_hour)

Loading output library...

Loading output library...

- For casual, the # of bikes being used has a huge spike in the low-end
- For registered, the distribution is more flat (indicating that registered users are more common at a daily level)
- We are better calculating the log.1p to get a normal distribution.

A logistic distribution better represents the data when the data is highly skewed. For example, if our data contained values like 0, 1, 10, and 100, well our data will be skewed by the 100. When transforming 0, 1, 10, and 100 in a logistic distrbution, we are normalizing the increases from 0, 1, 10, and 100:

```
1
2
3
4
```

```
log_transform = [np.log1p(i) for i in [0, 1, 10, 100]]
log_transform
# Output: [0.0, 0.6931471805599453, 2.3978952727983707, 4.61512051684126]
```

Loading output library...

Loading output library...

- Count has the largest variation

Loading output library...

Loading output library...

- Notice that we see there is an increase during peak hours in 2011 and 2012.

Loading output library...

Loading output library...

- Every season has a similar trend except for the spring
- 1 = spring, 2 = summer, 3 = fall, 4 = winter

Loading output library...

Loading output library...

- The season and weather visualization states that:
- All season follow a similar distribution along the x-axis (hours) but spring is has a signficant lesser count
- A clear weather attracts more people (higher count)

Loading output library...

- Casual users are more common in the weekend while registered users are more common in the weekdays

Loading output library...

Loading output library...

- The top correlated variables with the predictors are Hour, Days_in_year, frac_day, Elapsed
- The correlation values are similar throughout the three predictors with the exception of workingday

Loading output library...

Loading output library...

Loading output library...

- After checking the correlations values, there are a couple of columns that represent the same information. Some of the redudant columns are:
- Hour and frac_day
- temp and atemp
- Days_in_year and is_leap_year

To encode the categorical variables, we will be using the `pd.get_dummies()`

Loading output library...

`factorplot`

do an excellent way of creating chart that could be easily manipulated and tweak to include additional features... it's an effective method to automate the visualization with seaborn's API!

- I would like to use all three target variables to make predictions.
- However, finding the appropriate weight to each of the target variables is more complicated than it seems.
- I am going to make two prediction (to gain a better understanding of their performance)
- Add the Casual + Registered predictions
- Use the Count predictions

Loading output library...

`model_xgb_all`

: 0.41202`model_xgb_feat`

: 0.42347`model_xgb_agg_all`

: 0.41332`model_xgb_agg_feat`

: 0.41574

`model_gb_all`

: 0.43123`model_gb_feat`

: 0.44574`model_gb_agg_all`

: 0.44070`model_gb_agg_feat`

: 0.45243

- For some reason, Random Forest has terrible performance. This should something we look into!

`model_rf_all`

: 0.99251`model_rf_feat`

: 0.75609`model_rf_agg_all`

: 0.72469`model_rf_agg_feat`

: 1.04207

Loading output library...

Loading output library...

Loading output library...

`Weighted Average`

: 0.41143