So I started pursuing an online Masters program in Machine Learning earlier this year. The purpose was twofold- one was to get a more structured and broader handle on the field of Machine learning (since all of my knowledge so far is self taught and on the job). The reason I went for this degree was to pursue the field in a much more structured way, and in a broad way. The broader you understand the gamut of algorithms available, the better and more realistic applications you can find/design in your Supply Chains.
The second was to get a formal credential in the field.
So last week (5th July), I wrote an online proctored exam as a part of the coursework. It was kind of “adventurous” since I was RVing-so I had to find a spot in White mountains national forest that had good phone network- use my phone as hot spot- get on a bunk bed in the RV ( to get away from my wife and kid), and then write the exam, hunched for 90 minutes, since I hardly fit in there.
Anyways, my exam went just fine though (as per my interpretation). I have never been fan of exams and notorious for not “preparing” specifically for them so doing that now at this age was out of question. I just went there with my learning and it was fun. However, this post is not about how the exam went but is more about why Data Science education needs to evolve in the face of AutoML and ubiquitious availability of code snippets.
I am disappointed with Data Science education
But before I crib, I have to disclose my choice of program. There was another Masters in Data Science but I chose Masters in AI and ML. Simply because I wanted more exposure to programming, primarily to understand what packages and libraries are relevant to leverage which algorithm. I have never been a professional programmer or developer and my only exposure to programming in my career, before I started self teaching myself, has been programming courses covered in any required coursework. So the evaluation criteria that I am reviewing in this post may very well be justified for the coursework I am pursuing.
The points below are what I think are relevant for a Data Science education.
Yes, you still need Data Scientists, despite the coding aspects getting automated
I am still amazed that we have not yet figured out that a Data Scientist is, and should, not be, a Developer. If you don’t have a ML engineer helping your DS develop, your team is structured very very wrong.
Yes, they do need a good depth of algorithms and math behind it. They need to understand the depth of the entire modeling process and the underlying code (even if the AutoML Tool does the coding in the background)
But do you really need to test the real skills by asking them to write a code under time constraint ?
Because in ~ 5 years from now- they will (and should not be) coding at all. If a person with Data Scientist title is primarily writing codes in your organization, you have got it all wrong and I can diagnose straight away that many of your Data Science projects are struggling.
And forget 5 years from now. If I really need to use a code for creating a list out of an existing list (A question on the exam that I got right), in real world, I would just use the code snippet from thousands available online. So what did it test ? Speed of coding is not relevant if I am not a developer and understanding of code fundamentals can be easily tested by MCQs designed to test code and modeling concepts together. I did not see even a single question like that on the exam.
I am not a developer and never have been. So if I need to build an algorithm, I either look online, and use it with tweaks (since I know coding-which is different from being a professional coder/developer- and hence know what to tweak and update). If you have been in my LinkedIn network for long, you may have seen requests from me looking for programmers for hire. That is when an Algorithm is something that is a unique idea- and for which I can not find a pre-existing code snippet.
And that- is the Smart way to be an Analytics professional- if you don’t aim to be a professional Python developer/ML Engineer.
The education market is driven by the same confusion that persists in hiring
And this method of testing is driven by the way we hire. We hire Data Scientists by testing them either purely on coding skills or on a case study that is widely available on internet on hundred plus sites (If I can get a Dollar for the number of Data Science “education” sites that use Telecom churn case study- I will be filthy rich.)
It is unfortunate that it is 2020 but we have not been able to differentiate between a Machine Learning Engineer and a Data Scientist. Since the hype behind Data Science is driving the education- the education programs are chasing what hiring managers are looking for.
So the programs are testing and spitting developers in guise of Data Scientists and Hiring managers are hiring the same. Result ? We see disappointments from AI and ML projects where we blame failure on many different things.
But in my mind- the failure is bad design of Algorithms- because we have focussed only on writing the code of algorithm as the primary skill. If this is not checked now-this will keep haunting companies for years.
So what does the DS needs to know:
The list can be long but some aspects at a high level are:
- The Business aspects of algorithms- which algorithm will do its magic in which scenario, what variables, parameters and constraints are involved.
- The fundamentals of Math behind the algorithms and computations- to understand how the algorithm is doing its magic
- In and out of model evaluation, performace etc. and the ability to interpret the results and tie the results to the business scenario/problem
- Proficiency to peek under the hood and understand the code in case of a suspected “breakdown”.
So how should the evaluation be ?
So now, considering the points above, do you really want to evaluate a DS on their ability to write a Python code snippet in “time constrained environment”?
I think the key word here is “Time Constrained environment”.
I would see, even in scenarios where AutoML tools are not evolved, Data Scientists working in close collaboration with ML engineers (Which I believe the program I am enrolled in may be focussed towards). So any “rapid fire” programming can be taken care of by these professional algorithm developers.
What scares me is that testing methods like this further cements the notion in students that the primary skill for a Data Scientist is to code rapidly.
I can not code rapidly but I can guarantee to design an Analytics solution to any Supply Chain problem, if a solution exists, identify the libraries that need to be leveraged and any other tools (like solvers etc.) and infrastructure (AWS S3, Redshift etc.) involved.
Someone with rapid coding skills can then work with me from there. THAT is actually what DS skills, at least in Supply Chain, should be.
From a full solution development perspective, I am useless without a ML engineer. Because I am not an ML Engineer. My skills are different from an ML Engineer.
I wrote a case study to include in this article to illustrate as an example, how Data Scientists can be tested for true DS skills, even for coding aspects, in a way that they complement ML engineers perfectly. But then I decided not to include it here. I don’t think training providers like the one I am currently enrolled with have the maturity to comprehend and incorporate such testing methods- and neither is there any will to do so. Those case studies can be put to better use in the future- to hire analytics professionals who will deliver AI solutions that ACTUALLY work.