Since I’ve finished Data Science for Business (Part 02, hereinafter: DSBF2), I’ve wrote a post on LinkedIn where I reviewed the course. Since the platform has cursed character limit, this post will give more details for fence sitters.
1 Motivation
I’ve entered into this course with over 500+ hours of theory study in both R and Python. I’ve also used both languages daily in my professional life past 4.5 years.
So, why did I took this course? What was the motivation behind it?
I was already introduced into ML during the study of R and Python. And this field always fascinated me with the unique ways it can bring value to the industry/organization. But, I was always leaving with tools and ideas that could not be implemented: at least not in the industry where I was working. When you write a report as an Expert Witness, your stakeholders are:
- Lawyer(s).
- Judge(s).
- Some other legal body.
Most of them don’t like statistics or conclusions that end in probability statements (at least not in official Report). Which is bad, since we are making everyday decisions in regards to the probability of some event (not) occurring. You cannot escape it.
UK legal system has similar type of problems(s).
The Expert Witness, Forensic Science and the Criminal Justice Systems of the UK by Lucina Hackman, Fiona Raitt, Sue Black (2019) deals with forensic science in classical sense (gunshots, footprints and so on), but the conclusions can be easily applied to white collar crimes. Current legal theory in the context of Expert Witness procedures has to say the following on well-known Bayes’ theorem (bolds are mine):
The calculation of probabilities is strengthened through the use of Bayes’ theorem in the calculation process (Aitken et al. 2010). A detailed description of Bayes’ theorem is beyond the scope of this chapter; however, the use of Bayes’ theorem to assist in explaining to the court mathematical or statistical assessments of probability has caused, and continues to cause, controversy in relation to expert evidence (Berger et al. 2011). The Appeal Court has issued guidance for practitioners, experts and the judiciary in English courts, where their concern is for the introduction of Bayes’ theorem as evidence because of its potential to confuse the fact-finder and thus lead to miscarriages of justice. However, this approach has been the subject of debate and challenged in the academic literature which has presented a strong case for the use of Bayesian methods, or at the very least the use of a likelihood ratio, to remain as an admissible measure of evidential strength in courts (Berger et al. 2011; Donnelly 2005; Fenton and Neil 2011; Robertson et al. 2011).1
Authors goes on to describe case where the Bayes’ theorem was successfully applied:
In the first trial a statistician was brought into the court to explain to the jury how they might use a fully Bayesian approach when interpreting the strength of the DNA evidence that was being presented by the experts to the court (Donnelly 2005). The prosecution scientist having calculated a match probability of 1 in 200 million, the probability calculated was the chance that someone would match the discovered DNA profile if they were innocent. As already discussed, this is not the same as the probability that the accused was innocent, which is the misunderstanding that is the prosecutor’s fallacy, or the possibility that they are innocent if they match the profile. While the jury was instructed in the use of Bayes’ theorem to allow them to understand the probabilities at the culmination of the trial, Adams was found guilty. There was an appeal which was upheld on the basis that the judge should have given more advice to the jury about how to approach the statistics presented in the courtroom, if they did not wish to use Bayes’ theorem to interpret the evidence placed before them. The appeal was upheld and a retrial ordered at which, again, the evidence was presented using Bayes’ theorem. This time the judge asked the experts to come together to create a way of explaining how to apply Bayes’ theorem in their decision-making process. Adams was found guilty a second time and a second appeal was unsuccessful; however, the court ruled heavily against the use of Bayes’ theorem, endorsing the reservations expressed in the first appeal as follows: “To introduce Bayes Theorem, or any similar methods, into a criminal trial plunges the jury into inappropriate and unnecessary realms of theory and complexity deflecting them from their proper task”. They continued: “We regard the reliance on evidence of this kind in such cases as a recipe for confusion, misunderstanding and misjudgement, possibly even amongst counsel, but very probably among judges and, as we conclude, almost certainly among jurors”.2
Somewhat ironically, the appeal judges said the following:
But the apparently objective numerical figures used in the theorem (Bayes’) may conceal the element of judgement on which it entirely depends. More importantly for present purposes, however, whatever the merits or demerits of the Bayes theorem in mathematical or statistical assessments of probability, it seems to us that it is not appropriate for use in jury trial or as a means to assist the jury in their task. … More fundamentally, however, the attempt to determine guilt or innocence on the basis of a mathematical formula, applied to each separate piece of evidence, is simply inappropriate to the jury’s task. Jurors evaluate evidence and reach a conclusion not by means of a formula, mathematical or otherwise, but by the joint application of their individual common sense and knowledge of the world to the evidence before them.3
For me, this is ironical, since any judge has in his decision-making process some areas of uncertainty where he, internally, in his mind, has the following thought process: Given the evidence that was presented, I can be (or cannot be) sure that following is or is not the case …
Or, to put in more simpler terms, when somebody tells us that the coin is fair with the \(p = 0.5\), but after 50 flips I am getting 40 tails, I can be 95% sure that the true \(p\) lies between 10.2% and 32.2%. I can conclude that the coin is most certainly not fair.
Anyways, back to the course.
2 Review
The course begins with a binary classification problem: around 16% of employees have left the organization, which leads to enormous attrition cost for the company because of the lost productivity, cost associated with finding the new employee and so on. The goal is to correctly identify the probability that employee will leave the company, based on various features, such as: Age, BusinessTravel, DailyRate, Department, DistanceFromHome, Education, EducationField, EmployeeCount, EmployeeNumber, EnvironmentSatisfaction, Gender, HourlyRate, JobInvolvement, JobLevel, JobRole, JobSatisfaction, MaritalStatus, MonthlyIncome, MonthlyRate, NumCompaniesWorked, Over18, OverTime, PercentSalaryHike, PerformanceRating, RelationshipSatisfaction, StandardHours, StockOptionLevel, TotalWorkingYears, TrainingTimesLastYear, WorkLifeBalance, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager.
After presenting the business problem and explaining what the financial implications of turnover are (costs are measured in millions of dollars), the data is being cleaned according to the newest packages that are available in the R community (mostly tidyverse
). Frankly, I never paid any attention to the tidy evaluation, but this course really emphasizes that, and I can see the value in it. On the other hand, I was never into map
functions from purrr
: I was learning R in 2018 and the classical apply
family that base R has (especially lapply
and sapply
) served me quite well. Of course, during the further study of R, I came across purrr
and have used it, but map
functions were not really in focus. But, in the context of programming principles that Python taught me, I might start to use map
, since particular name of the function hints at the return type after applying particular function (map_dbl
, map_df
and so on).
With the cleaned data (which includes excellent section on transformation of the data), the modelling can begin. I have to admit, h2o and LIME are game changers for me. h2o
generates ML models on the fly, with detailed statistics, and LIME
helps the data scientist to demystify the black box behind each model. During my previous study of statistics/ML in R/Python, I have never seen anyone use these frameworks, and frankly, it’s a shame. The caveat is that hard work cannot be avoided: I would strongly recommend anybody interested in ML to examine the theory behind each of the well known models (logistic regression, random forest and so on).
After choosing the best models, special emphasis (which I greatly appreciate) is put on concepts such as F1, sensitivity, specificity, precision & recall and so on. Better yet, these concepts are explained in the light of the business problem on hand, and in the light of business domain as such: for example, if our business problem is to determine whether the patient has cancer or not, it’s better for the hospital to lower the false negative rate. Why? Well, for the simple reason that the hospital can be sued if conclusion was that the patent did not have cancer, when in fact he had cancer (PREDICTED = FALSE | ACTUAL = TRUE).
The course really emphasizes this crucial point in the context of delivering value to business stakeholders. For example, based on threshold value, we can model various OT policy changes that will affect the total attrition cost in different ways. For example, the best model that I’ve had on testing data (GLM) gives the following sensitivity analysis:
For example, in this graph, we can see that the best threshold value is around 12%: if the model gives probability of leaving the organization for particular employee equal to or greater than 12%, we change the OT policy for that particular employee. Do note that optimum threshold value that the model has on testing data by F1 score actually gives 3rd best solution. Reason is obvious: we don’t value false negatives same as false positives. Do nothing policy is actually deceiving in this case (some savings), since the h2o
framework does not calculate the parameters at the threshold that would not characterize everybody as Job Quitters.
And this is why this course has great value: business insights!
Last parts of the course gives recommendation strategies for particular employee in three different areas, but I will not go into this section much further.
All this being said, I have some important caveats:
- As I’ve said, hard work cannot be avoided. I’ve made a lot of progress in short time since I’ve dealt with these concepts previously. But, if R or statistics are completely new to you, you will be demotivated since you won’t be ready for advanced concepts.
- Matt Dancho (instructor) implements good coding practices in designing the project structure. I’ve used similar workflow for the last 4-5 years: there is a special folder for data, helper files, steps of analysis and so on. But, I’ve been diverging somewhat from the coding practices that he has:
- For example, in the later sections of the course, scripts start with defining the file paths, reading and cleaning those files through defined functions and so on. This leads to code duplication, thus breaking the DRY principle. Instead of that approach, I’m using special functions that return cleaned data (for example:
get_train_data
gives me raw training data,get_processed_data_train
gives me processed training data,get_train_tbl_after_bake
gives me processed training data after applying the transformation recipe). With that approach, the environment is cleaner and the function can be documented. But, I have nothing against the Matt’s approach: after all, this is a course, and organizing everything into functions might confuse some of the users.- Approach that I use also serves to solidify the comment I’ve read in Python vs. R wars (which are ultimately useless, since you should know both languages, IMO): after Python, your R code will get better.
- For example, in the later sections of the course, scripts start with defining the file paths, reading and cleaning those files through defined functions and so on. This leads to code duplication, thus breaking the DRY principle. Instead of that approach, I’m using special functions that return cleaned data (for example:
As I’ve said in my LinkedIn post, I wholeheartedly recommend this course. Additional value of the course is the Slack channel, where you can ask the professional community anything regarding the course or statistics in general. Also, Matt is ready to deliver the reply to your problem in reasonable amount of time, and is always eager to help and deliver value. Worth every penny, IMO, and I doubt you will get this kind of treatment elsewhere. :smile: