In this notebook I will present the process a Data Scientist / Analyst should follow in order to extract useful information from a dataset. As an example I will use the given Acc.csv file for Accidents in the United Kingdom for 2017. The analysis is split in the mandatory steps for creating meaningful insights.

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

In the following steps I will not use these columns, therefore there is no need to handle these Null values.

Loading output library...

Loading output library...

Above the basic statistics of our Dataframe are presented but most of them are meaningless as the specific attributes are recorded from Python in a fault data type. For example, the attribute Road_Type should be category and not integer as is obvious after the describe() function.

Loading output library...

Loading output library...

I will select only the columns that I will use for the analysis.

Loading output library...

Loading output library...

Loading output library...

In order to understand what the specific elements represent in each column of the dataset, I referred to the given metadata excel file and I changed these values.

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

I will convert some attributes from object to category.

Loading output library...

One of the important steps in order to have a general picture of the dataset is to extract a basic statistical information from the numeric attributes. The mean,standard deviation, min, max and the Quartiles are shown in the above table, for the two numeric attributes NumberOfVehicles and SpeedLimit.

Now, the dataset is cleaned and ready for the analysis. I will implement the analysis by answering some queries on the dataset in order to gain insight from the results.

****

Find the percentage of all the accidents that are Fatal and occur on SaturdaySo, 0.216% of the Accidents that occur on Saturday are Fatal.

****

Find the number of accidents that happened in Greater Manchester and occured when it was snowingLoading output library...

Loading output library...

Loading output library...

So, 25 accidents happened in Greater Manchester when it was snowing.

Loading output library...

Loading output library...

Loading output library...

So, 10% of the accidents that happened in urban area were due to the fact that the driver had been exceeding the speed limit of 30 miles per hour in these areas.

Loading output library...

Loading output library...

From this graph it is obvious that most of the accidents happened in the speed limit of 30 miles per hour. Also there is a significant number of accidents with speed greater than 60 miles per hour.

Notice: The shape of distribution is as such because the SpeedLimit attribute should be categorical and not numeric, as shown from this plot. But, I handled it like numeric for the extraction of other statistical information from the dataset.

Loading output library...

From this plot it is obvious that the Fatal accidents have big interquartile range and therefore an accident can be fatal at any speed. Moreover, the slight accidents occur in low speed with some outliers.

Loading output library...

We have the same results as the previous plot.

Loading output library...

It is clear that the accidents with the most vehicles included happened in the speed limit of 40 and 50 miles per hour on Sundays, which makes sense as on that day most of the people return from weekend trips.

Loading output library...

It is obvious that the most accidents occured on Fridays and were labeled as Slight.

Loading output library...

So, most of the accidents are Slight and happened on Dry surface.