This notebook is part of a project to predict the production of solar energy of a photovoltaic system on top of a house. In this notebook, we will consider the exploratory data analysis of the photovoltaic dataset that was obtained from the interface of the machine.
We have 1740 csv files with photovoltaic data. Let us import them.
Let's save it because this took a while.
We see that we have data from October 2017 until end of June 2019 and only the production data of the photovoltaic system. The resolution of the data is in 15-minte intervals.
Let us see how many days we have and if the overall number of intervals does check out.
Interesting, so typically there are 96 entries (@@0@@ for the total of the 15-minute intervals in one day) per day, but there are 6 days where this criteria is not met. Let's see which these are!
It is clear now that these values are from duplicates, generated by the mining process. We just have to drop them.
Perfect, let's now visualize the data.
Here we have the hourly production, which is not what we are looking for, let us look at the total daily production in units of kWh by integrating the curve for each day.
Let's also visualize that.
Very good! Let's look at these values.
Interesting, we see that the mean of the production is 22.7 kWh for any day, with extremas of 0 and 56 kWh. Most of the data is between 0 and 20.8 kWh (50%).
Let us also consider the four seasons.
In the northern hemisphere, the four seasons are defined as (meteorologically):
Let us check if we encoded the months correctly!
Seems fine (note that winter is 0, spring 1, summer 2, fall 3, as expected).
To get the seasonal production, let us group the data by the season and make a violinplot.
What we see here is the distribution of daily energy production grouped by the different season. We identify the following key takeaways: