Data Source: Department of Education, Training and Employment (DETE) and Technical and Further Education (TAFE) in Queesland, Austrailia.
What we'd like to find out are:
A data dictionary wasn't provided with the dataset. In a job setting, we'd make sure to meet with a manager to confirm the definitions of the data. For this project, we'll use our general knowledge to define the columns.
Below is a preview of a couple columns we'll work with from the dete_survey.csv:
Below is a preview of a couple columns we'll work with from the tafe_survey.csv:
In dete_survey, there are 56 columns and 822 entries.
In tafe_survey, there are 72 columns and 702 entries.
Both datasets contains quite a bit of data. We will have to clean up and decide what are the data that are useful for this analysis.
The columns that we dropped aren't directly connected to the reason why employees resigned. To simplify the analysis, it's easier to drop them.
Column names aren't standardized. Let's standardize the column names.
Change the columns names to unify them.
Since we are only interested in the people who resigned we don't need any other data entries besides them. Let's look at the serpartiontype column to filter out unnecessary data.
Now we see there are 3 different resigation types in DETE survey.
In TAFE survey, there is only one category for resignation
Looks like data successfully filtered considering the number of columns and rows.
Looks like data successfully filtered considering the number of columns rows.
We are going to look at the data and see if there is any inconsidtency or logically doesn't make sense.
cease_date is the last year of their emplyment and date_start_date is their first year of employment. It wouldn't make sense if date_start_date is after cease_date.
The data is inconsidtence. Some of them have yyyy format and other have mm/yyyy. We are only interested the year they resigned for now.
It looks like they are in one format now. And it doesn't loos like any data contains dete start date after cease date.
There is a column institute service in tafe_resignations which indicate the years of service before the resignation. We can make the same column in dete resignations using cease date and dete_start_date.
If the employee indicated any of the factors above caused them to resign, we'll mark them as dissatisfied in a new column.
To create the new column, we'll do the following:
The values in the columns, we can't use them for indentification. They need to be in True, False, or NaN.
We combined two datasets into one to make it easy to analyze. Also dropped columns do not have more than 500 of non-null values. If it has too many null values it won't be too helpful to analyze the data.
Now that we've combined our dataframes, we're almost at a place where we can perform some kind of analysis! First, though, we'll have to clean up the institute_service column. This column is tricky to clean because it currently contains values in a couple different forms:
The result is messy. Some are in floats, some are in string. We are going to cateogrize them into the following.