General information

#General-information

In this kernel I'm working with data from TMDB Box Office Prediction Challenge. Film industry is booming, the revunues are growing, so we have a lot of data about films. Can we build models, which will be able to accurately predict film revenues? Could this models be used to make some changes in movies to increase their revenues even further? I'll try answer this questions in my kernel!

(Screenshot of the main page of https://www.themoviedb.org/)

Loading output library...

Data loading and overview

#Data-loading-and-overview
Loading output library...
Loading output library...

There are only 3000 samples in train data! Let's hope this is enough to train models.

We can see that some of columns contain lists with dictionaries. Some lists contain a single dictionary, some have several. Let's extract data from these columns!

belongs_to_collection

#belongs_to_collection
Loading output library...

2396 values in this column are empty, 604 contrain information about the collections. I suppose that only collection name can be useful. Another possibly useful feature is the fact of belonging to a collection.

genres

#genres
Loading output library...

Genres column contains named and ids of genres to which films belong. Most of films have 2-3 genres and 5-6 genres are possible. 0 and 7 are outliers, I think. Let's extract genres! I'll create a column with all genres in the film and also separate columns for each genre.

But at first let's have a look at the genres themselves.

Loading output library...

Drama, Comedy and Thriller are popular genres.

Loading output library...

I'll create separate columns for top-15 genres.

production_companies

#production_companies
Loading output library...

Most of films have 1-2 production companies, cometimes 3-4. But there are films with 10+ companies! Let's have a look at some of them.

Loading output library...
Loading output library...
Loading output library...

For now I'm not sure what to do with this data. I'll simply create binary columns for top-30 films. Maybe later I'll have a better idea.

production_countries

#production_countries
Loading output library...

Normally films are produced by a single country, but there are cases when companies from several countries worked together.

Loading output library...

Spoken languages

#Spoken-languages
Loading output library...
Loading output library...

Keywords

#Keywords
Loading output library...

Here we have some keywords describing films. Of course there can be a lot of them. Let's have a look at the most common ones.

Loading output library...

cast

#cast
Loading output library...

Those who are casted heavily impact the quality of the film. We have not only the name of the actor, but also the gender and character name/type.

At first let's have a look at the popular names.

Loading output library...
Loading output library...
Loading output library...
Loading output library...

I think it is quite funny the most popular male role is playing himself. :)

crew

#crew
Loading output library...

The great crew is very important in creating the film. We have not only the names of the crew members, but also the genders, jobs and departments.

At first let's have a look at the popular names.

Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

Data exploration

#Data-exploration
Loading output library...

Target

#Target
Loading output library...

As we can see revenue distribution has a high skewness! It is better to use np.log1p of revenue.

Budget

#Budget
Loading output library...
Loading output library...

We can see that budget and revenue are somewhat correlated. Logarithm transformation makes budget distribution more managable.

homepage

#homepage
Loading output library...

Most of homepages are unique, so this feature may be useless.

Loading output library...

Films with homepage tend to generate more revenue! I suppose people can know more about the film thanks to homepage.

original_language

#original_language
Loading output library...

As we know there are much more english films and they have a higher range of values. Films with the highest revenue are usually in English, but there are also high revenue films in other languages.

original_title

#original_title

It can be interesting to see which words are common in titles.

Loading output library...

overview

#overview
Loading output library...

Let's try to see which words have high impact on the revenue. I'll build a simple model and use ELI5 for this.

Loading output library...
Loading output library...

We can see that some words can be used to predict revenue, but we will need more that overview text to build a good model.

popularity

#popularity

I'm not exactly sure what does popularity represents. Maybe it is some king of weighted rating, maybe something else. It seems it has low correlation with the target.

Loading output library...

release_date

#release_date
Loading output library...
Loading output library...
Loading output library...

We can see that number of films and total revenue are growing, which is to be expected. But there were some years in the past with a high number of successful films, which brought high revenue.

Loading output library...

Surprisingly films releases on Wednesdays and on Thursdays tend to have a higher revenue.

runtime

#runtime

The length of the film in minutes

Loading output library...

It seems that most of the films are 1.5-2 hour long and films with the highest revenue are also in this range

Status

#Status
Loading output library...
Loading output library...

AS we can see only 4 films in train data and 7 in test aren't released yet, so this feature is quite useless.

tagline

#tagline
Loading output library...

Collections

#Collections
Loading output library...

Films, which are part of a collection usually have higher revenues. I suppose such films have a bigger fan base thanks to previous films.

Genres

#Genres
Loading output library...
Loading output library...
Loading output library...

Some genres tend to have less revenue, some tend to have higher.

Production companies

#Production-companies
Loading output library...

There are only a couple of companies, which have distinctly higher revenues compared to others.

Production countries

#Production-countries
Loading output library...

In fact I think that number of production countries hardly matters. Most films are produced by 1-2 companies, so films with 1-2 companies have the highest revenue.

Loading output library...

There are only a couple of countries, which have distinctly higher revenues compared to others.

Cast

#Cast
Loading output library...
Loading output library...
Loading output library...

Keywords

#Keywords
Loading output library...

Crew

#Crew
Loading output library...
Loading output library...
Loading output library...

Modelling and feature generation

#Modelling-and-feature-generation
Loading output library...
Loading output library...
Loading output library...

OOF features based on texts

#OOF-features-based-on-texts

Additional feature generation

#Additional-feature-generation
Loading output library...
Loading output library...

Important features

#Important-features

Let's have a look at important features using ELI5 and SHAP!

Loading output library...

We can see that important features native to LGB and top features in ELI5 are mostly similar. This means that our model is quite good at working with these features.

Loading output library...

SHAP provides more detailed information even if it may be more difficult to understand.

For example low budget has negavite impact on revenue, while high values usually tend to have higher revenue.

Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

Here we can see interactions between important features. There are some interesting things here. For example relationship between release_date_year and log_budget. Up to ~1990 low budget films brought higher revenues, but after 2000 year high budgets tended to be correlated with higher revenues. And in genereal the effect of budget diminished.

Let's create new features as interactions between top important features. Some of them make little sense, but maybe they could improve the model.

External features

#External-features

I'm adding external features from this kernel: https://www.kaggle.com/kamalchhirang/eda-feature-engineering-lgb-xgb-cat by kamalchhirang. All credit for these features goes to him and his kernel.

Loading output library...
Loading output library...

Blending

#Blending

Stacking

#Stacking
Loading output library...