I’m currently going through one course which I’m going to review soon.
The main dataset of the course has training and testing data that is concerned with employee attrition. Globally, around 16% of employees are leaving the company, and based on the available features1, we want to build a model (whether it’s GLM, Deep Learning, Stacked Ensemble and so on) that is going to predict whether the employee will leave or stay. After cleaning and preparing the data for ML model, the goal is to recommend to stakeholders various policies in order to reduce attrition costs.
The best model that I’ve got with h2o
framework is Stacked Ensemble that contains:
- 1 Deep Learning model.
- 2 Distributed Random Forest models.
- 1 Gradient Boosting Machine model.
- 1 Generalized Linear model.
On cross validation data, I am getting AUC of 0.8411565, with the LogLoss of 0.3124953.
The confusion matrix on testing data is as follows:
MODEL = No | MODEL = Yes | Error | Rate | |
---|---|---|---|---|
ACTUAL = No | 181 | 3 | 0.016304 | =3/184 |
ACTUAL = Yes | 17 | 19 | 0.472222 | =17/36 |
Total | 198 | 22 | 0.090909 | =20/220 |
This is all with optimal threshold value of 0.415080729861174.
So, what does this all mean?
Well, the problem of binary (or any) classification is to determine the threshold value that will tell us when will we actually classify (in this case) employee as the one who will leave the organization. Our optimal threshold value is 0.41508: when the model predicts probability of leaving (given the particular set of features for one employee) greater than or equal to 41.51% (rounded), we will label that employee as Job Quitter.
The optimal threshold value of 0.415080729861174 is determined by F1 Score, with the following formula:
\[ F_{1} = \frac{2 \times (Precision \times Recall)}{Precision + Recall} \]
Essentially, we are talking about harmonic mean here.
Okay, but what is Precision and Recall and why should we care about it?
1 Precision
Formal definition of precision is:
\[ Precision = \frac{TP}{TP + FP} \]
… where the relevant values can be found in confusion matrix:
- TP = true positives. Intersection of MODEL = Yes and Actual = Yes. In this case: 19.
- FP = false positives = Intersection of MODEL = Yes and Actual = No. In this case: 3.
So, with the optimal threshold value, the precision is 0.8636364, or 86.36%.
Let’s make this even more clearer: in Precision, our pivot/main column is what the model is predicting: MODEL = Yes.
So, when the model predicts that an employee will be leaving the organization (again, with the selected threshold value), in 86.36% of cases it is correct on data (testing data) that the model has never seen.
2 Recall
Formal definition of recall is:
\[ Recall = \frac{TP}{TP + FN} \]
… where the relevant values can be found in confusion matrix:
- TP = true positives. Intersection of MODEL = Yes and Actual = Yes. In this case: 19.
- FN = false negative = Intersection of MODEL = No and Actual = Yes. In this case: 17.
So, with the optimal threshold value, the recall is 0.5277778, or 52.78%.
Let’s make this even more clearer: in Recall, our pivot/main row is what the actual data is: ACTUAL = Yes.
So, out of employees that have left the organization (in data that the model has never seen), the model correctly classified 52.78% Job Quitters.
3 Conclusion
In this sense, we can see that the Recall can be in most of cases more important than Precision. Why? Well, imagine that that the model is predicting patients with cancer. If the model predicts that the patient does not have cancer (MODEL = No), but the patient actually had Cancer (Actual = Yes), the hospital could be overwhelmed with lawsuits.
Nevertheless, both measures are important, but we should value them differently with the concept of \(EV\): that is, with each case where the model delivers false positive or false negative, we should weight these cases with different levels of cost. On top of my head, these could be:
- In case of false negatives, that should be expected cost of lawsuit (\(cost \times p_{success}\)).
- In case of false positives, that should be expected cost of treatment.
These measures need to be tested for different levels of threshold value in order to find the optimum. h2o
does this automatically, and in the following graph you can see the results:
The red line represents optimal threshold value of 0.415080729861174 (highest point in F1 score), while the blue line represents the respective value of different measures at the optimal threshold value. At the optimal threshold value, the overall accuracy of the model is highest: almost 91%.
Footnotes
Age, BusinessTravel, DailyRate, Department, DistanceFromHome, Education, EducationField, Gender, HourlyRate, JobInvolvement, JobLevel, JobRole, JobSatisfaction, MaritalStatus …↩︎