HR Analytics: Job Change of Data Scientists

6 min readJun 13, 2023

HDSC Premiere Project

A Project by Spectral Clustering

HR analytics is revolutionizing the operations of human resources departments, leading to enhanced efficiency and better outcomes. While analytics has been utilized in HR for some time, much of the data collection, processing, and analysis have traditionally been manual processes. However, this approach has proven limiting considering the dynamic nature of HR and HR Key Performance Indicators (KPIs). It is not surprising that HR departments have only recently recognized the significance of machine learning. By employing predictive analytics, data scientists and engineers can generate valuable insights, while data analysts can draw meaningful conclusions from the data. Machine learning presents a clear advantage in terms of accuracy and speed, making it well-suited for these tasks.

Aims and Objectives

A company in the field of Big Data and Data Science wants to identify data scientists among those who have completed their courses. To achieve this, they plan to develop a machine learning model that predicts whether a candidate will seek a new job or stay with the company. This model aims to minimize costs, save time, improve training quality, and categorize candidates effectively. Understanding the factors influencing employee decisions will aid in achieving these goals and making informed decisions.

Flow Process

Data Gathering: Data collection is a vital stage in the machine learning process, as it significantly impacts the utility and accuracy of a project. The quality of the data obtained during collection is paramount. It should be relevant, devoid of duplicates and missing information, and accurately reflect all relevant classifications and subcategories. Ensuring high-quality data sets a solid foundation for successful machine learning projects.

Data Pre-processing: After acquiring data, it is crucial to preprocess it as a vital step in the machine learning process. Preprocessing involves preparing the data by cleaning, validating, and transforming it into a usable dataset. Data cleaning plays a significant role in this process, as it involves reformatting attributes and addressing issues such as handling missing values through techniques like imputation. Proper preprocessing ensures that the data is in a suitable format for effective analysis and modeling in machine learning tasks.

Model Building: The selection of the final model for solving the specified business problem is a critical step, requiring careful attention to several key elements. These elements include the input data type, output data type, as well as additional factors such as accuracy, complexity, scalability, and interpretability. During this stage, the model is trained to understand the desired outcomes, and then it is tested using the same set of methods. The results from the training and testing runs are compared, evaluating the performance by comparing the actual values with the predicted values. This process aids in selecting the most suitable model that effectively addresses the business problem at hand.

Model Evaluation: Model evaluation is crucial for selecting the best model and accurately predicting values. It assesses the fit between the model and the data, as well as compares multiple models.

Model Deployment: After completing all the necessary steps, machine learning models are deployed in production environments, similar to other applications. Data is continuously fed into the model from internal or external sources, allowing for real-time analysis. The model’s performance is monitored and reports, as well as visualizations, are generated using various tools. This helps stakeholders understand the firm’s performance, identify areas for improvement, and make informed business decisions.

Data Source

Here is the URL to the dataset that we obtained for this issue on Kaggle, which was almost perfect.

Exploration Data Analysis:

The whole data is divided to train and test. Target isn’t included in the test but the test target values data file is in hands for related tasks. A sample submission corresponds to enrollee_id of the test set provided too with columns: enrollee _id, target


  • The dataset is imbalanced.
  • Most features are categorical (Nominal, Ordinal, Binary), and some with high cardinality.
  • Missing imputation can be a part of your pipeline as well.


  • enrollee_id: Unique ID for candidate
  • city: City code
  • city_development _index: Development index of the city (scaled)
  • gender: Gender of the candidate
  • relevent_experience: Relevant experience of the candidate
  • enrolled_university: Type of University course enrolled if any
  • education_level: Education level of the candidate
  • major_discipline: Education major discipline of the candidate
  • experience: Candidate’s total experience in years
  • company_size: No of employees in current employer’s company
  • company_type: Type of current employer
  • lastnewjob: Difference in years between previous job and current job
  • training_hours: training hours completed
  • target: 0 — Not looking for a job change, 1 — Looking for a job change

Data Visualization

The “count plot” above shows that we have more of the male gender in the dataset and about 30% are likely to look for a job change. Most of the female candidates are likely not looking for a job change. Also, about 75% of candidates who have relevant experience are not looking for a job change, while about 48% of candidates with no relevant experience are looking for a job change.

The plot shows a higher number of candidates not enrolled in university who are likely to change jobs. Additionally, the majority of candidates in the dataset have an average educational level and are less likely to change jobs after training.

The left diagram indicates that the majority of candidates in the STEM discipline are not looking for a job change. On the other hand, the right boxplot suggests that experienced candidates are less likely to seek job changes..

In the left plot, we observe a list of companies categorized by their employee size. Companies with approximately 50–99 employees have a ratio of 2:1, indicating that there are twice as many candidates who are unlikely to change their job compared to those who are likely to change. In the right plot, we see a list of company types, with “PVT LTD” having the highest number of candidates who are unlikely to change their job.

Model Training, Evaluation, and Validation

From the Heatmap diagram above, we detected five attributes with the highest correlation which are:

  • Experience
  • Enrolled University
  • Relevant Experience
  • Company Size
  • Last New Job

The target variable in our dataset showed an imbalance, with a count of 14,381 for target 0 (not looking for a job change) and a count of 4,777 for target 1 (looking for a job change). To address this imbalance, we employed SMOTE (Synthetic Minority Oversampling Technique) to up-sample the minority class.

As our dataset contained categorical variables, we used Label Encoder to convert them into numerical variables. This allowed us to work with the categorical data effectively.

Using the balanced dataset, we divided it into training and test sets for model creation. We built six models: Linear Regression, Naïve Bayes, Decision Tree, Support Vector Machine, Random Forest Classifier, and XGB Classifier. Among these models, the Random Forest Classifier performed the best, achieving a score of 0.67.


From, our data visualization we discovered that most of the data scientists in the sample:

  • Are male
  • Have relevant experience
  • Are not currently enrolled in a university course
  • Are graduate in a STEM discipline
  • Have more than 20 years of work experience
  • Work in a private company
  • Work in a medium-sized company
  • Got their new job in the last year
  • The dataset is unbalanced. Only 24.9% of the data scientists in the sample are looking to change their job.

Conclusion and Recommendation

This analysis illustrates the potential of using machine learning to predict job change based on factors such as candidate experience, enrollment in a university, relevant experience, and previous job history. It is important to note that the accuracy of the findings heavily relies on the quality of the dataset used to train the model.

To obtain more accurate results, it is crucial to ensure that the model is trained with precise and reliable data from the company’s records. By doing so, better outcomes can be achieved in predicting job changes. Thank you for your attention.




Our mission is to develop an army of creative problem solvers using an innovative approach to internships.