Exploratory data analysis of photovoltaic dataset


This notebook is part of a project to predict the production of solar energy of a photovoltaic system on top of a house. In this notebook, we will consider the exploratory data analysis of the photovoltaic dataset that was obtained from the interface of the machine.

Table of content


1. Import libraries


2. Load all the data into one dataframe

Loading output library...

We have 1740 csv files with photovoltaic data. Let us import them.

Let's save it because this took a while.

Loading output library...
Loading output library...
Loading output library...
Loading output library...

We see that we have data from October 2017 until end of June 2019 and only the production data of the photovoltaic system. The resolution of the data is in 15-minte intervals.

Let us see how many days we have and if the overall number of intervals does check out.

Loading output library...

Interesting, so typically there are 96 entries (@@0@@ for the total of the 15-minute intervals in one day) per day, but there are 6 days where this criteria is not met. Let's see which these are!

Loading output library...
Loading output library...

It is clear now that these values are from duplicates, generated by the mining process. We just have to drop them.

Loading output library...

Perfect, let's now visualize the data.

3. Visualize

Loading output library...

Here we have the hourly production, which is not what we are looking for, let us look at the total daily production in units of kWh by integrating the curve for each day.

4. Compute total energy production for each day


in kWh!

Loading output library...

Let's also visualize that.

Loading output library...

Very good! Let's look at these values.

Loading output library...

Interesting, we see that the mean of the production is 22.7 kWh for any day, with extremas of 0 and 56 kWh. Most of the data is between 0 and 20.8 kWh (50%).

5. Visualize seasonal energy production


Let us also consider the four seasons.

In the northern hemisphere, the four seasons are defined as (meteorologically):

  • Spring: March until May
  • Summer: June until August
  • Fall: September until November
  • Winter: December until February

Let us check if we encoded the months correctly!

Loading output library...

Seems fine (note that winter is 0, spring 1, summer 2, fall 3, as expected).

To get the seasonal production, let us group the data by the season and make a violinplot.

Loading output library...

What we see here is the distribution of daily energy production grouped by the different season. We identify the following key takeaways:

  • The total productions are distributed to higher values for the summer and spring with medians of around 42 and 28 kWh, respectively, compared to around 8 kWh in the winter and 18 kWh in the fall.
  • The shape of the distribution is more uniform in the spring and fall compared to the winter and summer. In the winter, the distribution is skewed towards lower values, and in the summer towards higher values.

6. Check for missing data

Loading output library...