Predicting Salary level from Indeed site 💵

#Predicting-Salary-level-from-Indeed-site-💵

An online job listing web site has extensive data that is primarily unstructured text descriptions of the posted jobs. Many listings provide a salary, but as many as half do not. For those listings that do not provide a salary, what if we can predict the job salary level if it is high🔺 or low🔻 by just knowing the location of the job or by the job title or by both location and title 🤔.

This is what we are trying to address in this tutorial blog predicting the job salary level ( high or low ).

In order to achieve this we will walk through the following steps:

#In-order-to-achieve-this-we-will-walk-through-the-following-steps:

1- Collecting data by scraping Indeed website. The salary information we are going to collect are:

  • Location 📍.
  • Job Title 🔖.
  • Job Salary 💵.
  • Company 🏢.

2- Building a binary predictor with Logistic Regression and other models.

Content

#Content

So let's get started 💪🏼 🙏🏻 !!!!

1. Scraping job listings from Indeed.com

#1.-Scraping-job-listings-from-Indeed.com

We will be scraping job listings from Indeed.com using BeautifulSoup.

Loading output library...
Loading output library...

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
<div class=" row result" data-jk="2480d203f7e97210" data-tn-component="organicJob" id="p_2480d203f7e97210" itemscope="" itemtype="http://schema.org/JobPosting">
<h2 class="jobtitle" id="jl_2480d203f7e97210">
<a class="turnstileLink" data-tn-element="jobTitle" onmousedown="return rclk(this,jobmap[0],1);" rel="nofollow" target="_blank" title="AVP/Quantitative Analyst">AVP/Quantitative Analyst</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Alliancebernstein?from=SERP&amp;campaignid=serp-linkcompanyname&amp;fromjk=2480d203f7e97210&amp;jcid=b374f2a780e04789" target="_blank">
    AllianceBernstein</a></span>
</span>
<tr>
<td class="snip">
<nobr>$117,500 - $127,500 a year</nobr>

<span class="summary" itemprop="description">
C onduct quantitative and statistical research as well as portfolio management for various investment portfolios. Collaborate with Quantitative Analysts and</span>


</td>
</tr>
</table>

We are going to write 4 functions to extract each item:

#We-are-going-to-write-4-functions-to-extract-each-item:
  • Location 📍.
  • Job Title 🔖.
  • Job Salary 💵.
  • Company 🏢.

    🛑 We will Make sure these functions are robust and can handle cases where the data/field may not be available by doing:

    1
    2
    - Check if a field is empty or `None` for attempting to call methods on it.
    - Use `try/except`.

We will extract the salary from the html result by using this structure:

  • The salary is available in a span element with class='salary no-wrap'.

We can extract the job title from the html result by using this structure:

  • The title of a job is in a link with class set to jobtitle and a data-tn-element='jobTitle'.

We will extract the location from the html result by using this structure:

  • The location is set in a span with class='location'.

We will extract the company from the html result by using this structure:

  • The company is set in a span with class='company'.

Collecting data from multiple cities

#Collecting-data-from-multiple-cities

Now, to scale up our scraping, we need to accumulate more results. We can do this by examining the URL below👇🏼.

From the URL there are two query parameters:

  • l=New+York controls the location 📍 of the results.

  • start=10 controls where in the results to start and gives 10 results (thus, we can keep incrementing by 10 to go further in the list).

Loading output library...
Loading output library...

2. EDA

#2.-EDA

This is the most important part in this tutorial in this section we will do some data cleaning and features engineering.

2.1 Data Cleaning 🚿 🛀🏻

#2.1-Data-Cleaning-🚿-🛀🏻

Lastly, we need to clean up salary data.

  • Only a small number of the scraped results have salary information - only these will be used for modeling.
  • Some of the salaries are not yearly but hourly or weekly, these will not be useful to us.
  • Some of the entries may be duplicated.
  • The salaries are given as text and usually with ranges.
Loading output library...
Loading output library...

Converting salary 💵

#Converting-salary-💵

We are goint to Write a function that takes a salary string and converts it to a number, averaging a salary range if necessary.

Loading output library...

2.2 Features Engineering ⚒ ⚙️

#2.2-Features-Engineering-⚒-⚙️

In order to find features from our columns, we need to convert the string columns which are: location and job title we will convert them differently:

  • Location 📍 ----> Dummy variables

  • Job title 🔖 -----> Bag Of Word (BOW)

Target column 💰:

#Target-column-💰:

As we want to predict a binary variable - whether the salary was low 🔻 or high 🔺.

we will compute the median salary and create a new binary variable that is true when the salary is high (above the median).

Loading output library...
Loading output library...
Loading output library...

2.3 Visualization

#2.3-Visualization

Here we will do some investigation on the salary column by Visualizing our data.

Loading output library...

📝 Things to note:

  • As we can see the highest salary average was in San Fransisco and the lowest was in Austin.
Loading output library...

📝 Things to note:

  • From the histogram and boxplot above we can see the most salary values lies between 5000 and 15000.

  • Also, we have that long bar which we can infer from this that the most salary value is less than 10000.

  • Also, we have some outliers above 200000.

Loading output library...

📝 Things to note:

  • Violin plot above give us a good overview of the range of each high and low salary in each city. As we can see some cites do not have low or high salary like Boston , Pittsburgh and Miami.

3. Modeling

#3.-Modeling

We are going to try many models then we will choose some of them.

3.1 Modeling based on location property

#3.1-Modeling-based-on-location-property

For this task, we will use get_dummies in order to extract the features from the job location.

Loading output library...

Baseline accuracy:

#Baseline-accuracy:

we will baseline line to examine the performance of our models

Loading output library...
Loading output library...
Loading output library...

KNN is the worst one !!!

3.1.1 Feature Importance

#3.1.1-Feature-Importance
Loading output library...

📝 Summary of what coefficients feature mean:

#📝-Summary-of-what-coefficients-feature-mean:
  • As the coefficient value (with absolute value) increase, that means the correlation between that variable and our target will also increase.

  • Pittsburgh and Miami do not have high salary so this is maybe why they have very small(negative)coefficient values.

  • Boston has just high salary so this is maybe why it has large coefficient values.

3.2 Modeling based on job levels and categories

#3.2-Modeling-based-on-job-levels-and-categories

For this task, we will use NLP (BOW) in order to extract the features from the job title.

Loading output library...
Loading output library...

Modeling

#Modeling

As we did above we will try many models then we will choose some of them.

Loading output library...

Feature Importance

#Feature-Importance
Loading output library...

📝 Summary of what coefficients feature mean:

#📝-Summary-of-what-coefficients-feature-mean:

As the coefficient value (with absolute value) increase, that means the correlation between that variable and our target will also increase.

3.3 Tuning

#3.3-Tuning

3.4 Evaluation

#3.4-Evaluation

We will evaluate our models by doing the following:

  • Evaluate the AUC, precision and recall of the models.
  • Plot the ROC and precision-recall curves for the best model based on cross-validation scores.

3.4.1 Evaluate models based on AUC score

#3.4.1-Evaluate-models-based-on-AUC-score
Loading output library...

3.4.2 Evaluate models based on recall score

#3.4.2-Evaluate-models-based-on-recall-score

Recall:

#Recall:
  • Is the ratio of correctly predicted high salary jobs to the all high salary jobs class

The question that this metric answer is:

how much of all truly high salary jobs we missed ??

or (in another way):

Of all the jobs that truly high salary, how many did we label?

Loading output library...

3.4.3 Evaluate models based on precision score

#3.4.3-Evaluate-models-based-on-precision-score

Precision:

#Precision:

Is the ratio of correctly predicted high salary jobs to the total predicted high salary jobs

The question that this metric answer is:

how much jobs we predicted as high salary but actually they are not ??

Loading output library...

Plot the ROC:

#Plot-the-ROC:

ROC is a plot of signal (True Positive Rate) against noise (False Positive Rate).

Loading output library...

Interpretation of ROC curves

#Interpretation-of-ROC-curves

A classifier with the perfect performance level shows a combination of two straight lines – from the origin (0.0, 0.0) to the top left corner (0.0, 1.0) and further to the top right corner (1.0, 1.0).

But it is important to notice that classifiers with meaningful performance levels usually lie in the area between the random ROC curve (baseline) and the perfect ROC curve as we got in our plot above 1.

Plot precision-recall curves:

#Plot-precision-recall-curves:
Loading output library...

Interpretation of precision-recall curves:

#Interpretation-of-precision-recall-curves:

A classifier with the random performance level shows a horizontal line as P / (P + N). This line separates the precision-recall space into two areas. The separated area above the line is the area of good performance levels. The other area below the line is the area of poor performance. 2

Conclusion

#Conclusion

In this Tutorial, we have focused on the problem of modeling and predicting the salary level based on the location and title of the job.

We tested a variety of classification models, including Random Forest, Extra tree, LR, GradientBoost, XGB and AdaBoost. We optimized the parameters of each of these methods, validated the performance of each model using cross validation and compared the performance of these models. The best result we achieved was with AdaBoost model (71%).

In the future I'm hoping to build on and refine this model, extending its accuracy and predictive ability by employing Other NLP techniques, such as using longer strings of words as features such as bigrams and trigrams. Another option would be using the job description as a feature.