Employee Exit Survey Analysis

#Employee-Exit-Survey-Analysis

Data Source: Department of Education, Training and Employment (DETE) and Technical and Further Education (TAFE) in Queesland, Austrailia.

What we'd like to find out are:

  • Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
  • Are younger employees resigning due to some kind of dissatisfaction? What about older employees?

A data dictionary wasn't provided with the dataset. In a job setting, we'd make sure to meet with a manager to confirm the definitions of the data. For this project, we'll use our general knowledge to define the columns.

Below is a preview of a couple columns we'll work with from the dete_survey.csv:

  • ID: An id used to identify the participant of the survey
  • SeparationType: The reason why the person's employment ended
  • Cease Date: The year or month the person's employment ended
  • DETE Start Date: The year the person began employment with the DETE

Below is a preview of a couple columns we'll work with from the tafe_survey.csv:

  • Record ID: An id used to identify the participant of the survey
  • Reason for ceasing employment: The reason why the person's employment ended
  • LengthofServiceOverall. Overall Length of Service at Institute (in years): The length of the person's employment (in years)

Environment Setup

#Environment-Setup

Load Datasets

#Load-Datasets

Data Information

#Data-Information
Loading output library...

In dete_survey, there are 56 columns and 822 entries.

Loading output library...

In tafe_survey, there are 72 columns and 702 entries.

Both datasets contains quite a bit of data. We will have to clean up and decide what are the data that are useful for this analysis.

  • The dete_survey data set contains 'Not Stated' values, which is better represented with NaN value.
  • They have many columns that we don't need for this analysis
  • There is a overlap between two datasets. But the column names are different.
  • There are several columns/answers that indicate an employee resigned du to dissatisfaction.

Clean Up

#Clean-Up

Dropping Columns

#Dropping-Columns

The columns that we dropped aren't directly connected to the reason why employees resigned. To simplify the analysis, it's easier to drop them.

Clean Column Names

#Clean-Column-Names

Column names aren't standardized. Let's standardize the column names.

  • Make all the capitalization lowercase.
  • Remove any trailing whitespace from the strings.
  • Replace spaces with underscores('_')

Change the columns names to unify them.

  • 'Record ID': 'id'
  • 'CESSATION YEAR': 'cease_date'
  • 'Reason for ceasing employment': 'separationtype'
  • 'Gender. What is your Gender?': 'gender'
  • 'CurrentAge. Current Age': 'age'
  • 'Employment Type. Employment Type': 'employment_status'
  • 'Classification. Classification': 'position'
  • 'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service'
  • 'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'
Loading output library...
Loading output library...

Data Filtering

#Data-Filtering

Since we are only interested in the people who resigned we don't need any other data entries besides them. Let's look at the serpartiontype column to filter out unnecessary data.

Loading output library...

Now we see there are 3 different resigation types in DETE survey.

Loading output library...

In TAFE survey, there is only one category for resignation

Loading output library...

Looks like data successfully filtered considering the number of columns and rows.

Loading output library...

Looks like data successfully filtered considering the number of columns rows.

Data Verification

#Data-Verification

We are going to look at the data and see if there is any inconsidtency or logically doesn't make sense.

cease_date is the last year of their emplyment and date_start_date is their first year of employment. It wouldn't make sense if date_start_date is after cease_date.

Loading output library...
Loading output library...
Loading output library...

The data is inconsidtence. Some of them have yyyy format and other have mm/yyyy. We are only interested the year they resigned for now.

Loading output library...
Loading output library...
Loading output library...

It looks like they are in one format now. And it doesn't loos like any data contains dete start date after cease date.

Loading output library...

New Columns

#New-Columns

There is a column institute service in tafe_resignations which indicate the years of service before the resignation. We can make the same column in dete resignations using cease date and dete_start_date.

Loading output library...

Dissatisfied Employees Identification

#Dissatisfied-Employees-Identification

tafe_survey_updated:

  • Contributing Factors. Dissatisfaction
  • Contributing Factors. Job Dissatisfaction dafe_survey_updated:
  • job_dissatisfaction
  • dissatisfaction_with_the_department
  • physical_work_environment
  • lack_of_recognition
  • lack_of_job_security
  • work_location
  • employment_conditions
  • work_life_balance
  • workload

If the employee indicated any of the factors above caused them to resign, we'll mark them as dissatisfied in a new column.

To create the new column, we'll do the following:

  • Convert the values in the 'Contributing Factors. Dissatisfaction' and 'Contributing Factors. Job Dissatisfaction' columns in the tafe_resignations dataframe to True, False, or NaN values.
  • If any of the columns listed above contain a True value, we'll add a True value to a new column named dissatisfied. To accomplish this, we'll use the DataFrame.any() method to do the following:
    • Return True if any element in the selected columns above is True
    • Return False if none of the elements in the selected columns above is True
    • Return NaN if the value is NaN
Loading output library...
Loading output library...
Loading output library...

The values in the columns, we can't use them for indentification. They need to be in True, False, or NaN.

Combine the Data

#Combine-the-Data
Loading output library...

We combined two datasets into one to make it easy to analyze. Also dropped columns do not have more than 500 of non-null values. If it has too many null values it won't be too helpful to analyze the data.

Service column clean up

#Service-column-clean-up

Now that we've combined our dataframes, we're almost at a place where we can perform some kind of analysis! First, though, we'll have to clean up the institute_service column. This column is tricky to clean because it currently contains values in a couple different forms:

Loading output library...

The result is messy. Some are in floats, some are in string. We are going to cateogrize them into the following.

  • New: Less than 3 years at a company
  • Experienced: 3-6 years at a company
  • Established: 7-10 years at a company
  • Veteran: 11 or more years at a company

Perform Initial Analysis

#Perform-Initial-Analysis
Loading output library...
Loading output library...
Loading output library...