In this kernel I'm working with data from TMDB Box Office Prediction Challenge. Film industry is booming, the revunues are growing, so we have a lot of data about films. Can we build models, which will be able to accurately predict film revenues? Could this models be used to make some changes in movies to increase their revenues even further? I'll try answer this questions in my kernel!
(Screenshot of the main page of https://www.themoviedb.org/)
There are only 3000 samples in train data! Let's hope this is enough to train models.
We can see that some of columns contain lists with dictionaries. Some lists contain a single dictionary, some have several. Let's extract data from these columns!
2396 values in this column are empty, 604 contrain information about the collections. I suppose that only collection name can be useful. Another possibly useful feature is the fact of belonging to a collection.
Genres column contains named and ids of genres to which films belong. Most of films have 2-3 genres and 5-6 genres are possible. 0 and 7 are outliers, I think. Let's extract genres! I'll create a column with all genres in the film and also separate columns for each genre.
But at first let's have a look at the genres themselves.
Drama, Comedy and Thriller are popular genres.
I'll create separate columns for top-15 genres.
Most of films have 1-2 production companies, cometimes 3-4. But there are films with 10+ companies! Let's have a look at some of them.
For now I'm not sure what to do with this data. I'll simply create binary columns for top-30 films. Maybe later I'll have a better idea.
Normally films are produced by a single country, but there are cases when companies from several countries worked together.
Here we have some keywords describing films. Of course there can be a lot of them. Let's have a look at the most common ones.
Those who are casted heavily impact the quality of the film. We have not only the name of the actor, but also the gender and character name/type.
At first let's have a look at the popular names.
0 is unspecified, 1 is female, and 2 is male. (https://www.kaggle.com/c/tmdb-box-office-prediction/discussion/80983#475572)
I think it is quite funny the most popular male role is playing himself. :)
The great crew is very important in creating the film. We have not only the names of the crew members, but also the genders, jobs and departments.
At first let's have a look at the popular names.
As we can see revenue distribution has a high skewness! It is better to use
np.log1p of revenue.
We can see that budget and revenue are somewhat correlated. Logarithm transformation makes budget distribution more managable.
Most of homepages are unique, so this feature may be useless.
Films with homepage tend to generate more revenue! I suppose people can know more about the film thanks to homepage.
As we know there are much more english films and they have a higher range of values. Films with the highest revenue are usually in English, but there are also high revenue films in other languages.
Let's try to see which words have high impact on the revenue. I'll build a simple model and use ELI5 for this.
We can see that some words can be used to predict revenue, but we will need more that overview text to build a good model.
I'm not exactly sure what does popularity represents. Maybe it is some king of weighted rating, maybe something else. It seems it has low correlation with the target.
We can see that number of films and total revenue are growing, which is to be expected. But there were some years in the past with a high number of successful films, which brought high revenue.
Surprisingly films releases on Wednesdays and on Thursdays tend to have a higher revenue.
It seems that most of the films are 1.5-2 hour long and films with the highest revenue are also in this range
AS we can see only 4 films in train data and 7 in test aren't released yet, so this feature is quite useless.
Films, which are part of a collection usually have higher revenues. I suppose such films have a bigger fan base thanks to previous films.
Some genres tend to have less revenue, some tend to have higher.
There are only a couple of companies, which have distinctly higher revenues compared to others.
In fact I think that number of production countries hardly matters. Most films are produced by 1-2 companies, so films with 1-2 companies have the highest revenue.
There are only a couple of countries, which have distinctly higher revenues compared to others.
We can see that important features native to LGB and top features in ELI5 are mostly similar. This means that our model is quite good at working with these features.
SHAP provides more detailed information even if it may be more difficult to understand.
For example low budget has negavite impact on revenue, while high values usually tend to have higher revenue.
Here we can see interactions between important features. There are some interesting things here. For example relationship between release_date_year and log_budget. Up to ~1990 low budget films brought higher revenues, but after 2000 year high budgets tended to be correlated with higher revenues. And in genereal the effect of budget diminished.
Let's create new features as interactions between top important features. Some of them make little sense, but maybe they could improve the model.
I'm adding external features from this kernel: https://www.kaggle.com/kamalchhirang/eda-feature-engineering-lgb-xgb-cat by kamalchhirang. All credit for these features goes to him and his kernel.