Introduction to Machine Learning: Is AutoML Replacing Data Scientists?

Types of Machine Learning Models
Introduce the common types of ML models before explaining the application of ML.

Supervised learning: The machine learning task which learns from the trained dataset, builds a function, predicts the output on the basis of the function. The training dataset often consist of pairs of an input vector (or scalar) and the corresponding output vector (or scalar), output of the function can be regression or prediction of classification.

Supervised learning

Unsupervised learning: A type of machine learning algorithm, learns from dataset without any labels. The algorithm can automatically classify or categorise the input data. The application of unsupervised learning mainly includes: cluster analysis, association rule or dimensionality reduce.

Unsupervised Learning

Semi-Supervised learning: It learns from a small amount of labeled data with a large amount of unlabeled data. Semi-Supervised learning is between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). Many machine learning researches show that the combination of unlabeled data and a small amount of labeled data can rise the learning accuracy.


For example, Medical Imagine Analysis such as CT scan or MRI data, radiologists can examine and mark a small part of a tumor or disease. It is easy to collect this type of data, but manually marking all scans is time intensive and high cost. Therefore in comparison with unsupervised learning, training a model to assist in labelling data with the conjunction of deep learning networks can be beneficial from a small proportion of labelled data and improve its accuracy

Reinforcement learning: A type of machine learning lays stress on how agent should act in an environment, to obtain the maximum of cumulative reward. The difference between reinforcement learning and supervised learning is accurate labelled input and output are not required in reinforcement learning, and correctness in sub-optimal actions is not needed. Reinforcement learning focus on linear programming, looking for a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

Reinforcement Learning

The demand of Machine Learning

ML demanders can be divided into three categories, according to the development time from short to long and the problem high-level to detail:

  • AI service fast food demander

This type of clients just need an automated AI service, some existing AI services can meet their needs. Such as throwing their data into the data lake, and then use AWS Rekognition, AWS Comprehend to output ML results to their API.

  • Problem identifier

This type of clients have a clear definition of their business needs, they understand what data they can provide, and understand which kind of machine learning output can solve their problems. The existing AutoML modules or tools can be utilised to fulfill their wishes, such as Amazon ML or AWS SageMaker Marketplace.

  • Need help in defining data assumptions

This kind of clients have business needs, without clear definition of the problem but they expect to find some value from the data in hand. At this time, it is necessary to define the business hypothesis through some BI tools such as Tableau, and then develop algorithms using SageMaker or EMR etc. From Feature engineering to Modeling, even architecture must be highly customized.

Need help in defining data assumptions

*Note: The length of time for development changes depends on the need of users. From bottom to top is Fast time to value for AI users, Define questions and Define Hypothesis. If users have the request of higher level to define hypothesis, the length of time for development will be longer.

AutoML(Automated Machine Learning): An Agile Approach to AI Deployment

In fact, machine learning is a problem solution that combines a lot of mathematical knowledge. There are different kinds of machine learning models in different scenarios. The following photo shows interactive relationship between different ML methods. Each problem may be focused on different mathematics and professional fields. However, sometimes it is hard to decide which model to use in so many choices.

In the past, the process of building a ML model strongly relays on experienced data scientists. They need to find out the appropriate features, processes, models, model hyperparameters, etc, usually the whole process is time consuming. However, in order to make ML modular and widely used in many different scenarios in the future, the research field that automates ML is called Auto ML.

In general Auto ML is not only a model development, it also includes data cleansing, feature analysis and transformation. Model development usually involves repeatedly selecting one or more algorithms, model tests, hyperparameter optimization and model evaluation.

The emergence of Auto ML can solve the repetitive and redundant process of modeling and tuning. It enables the enterprises to experiment variety of models, to increase the efficiency of problem solving or to get a more accurate result.

Will data scientists lose their job?
McKinsey Global Institute (MGI) points out that the emergence of Auto-ML can solve the shorthanded situation in data science field, and replace 50% of the work of data scientists. The role of data scientist will not be completely replaced by Auto ML, because only the domain-know-how data scientists know what data to collect and how to arrange them to solve specific business problem. Also, only the data scientists understand and make accurate judgement of what model should be deployed and produced.

How to apply AutoML on Problem Identifier scenario?
Previous ML processes require data pre-processing, feature engineering, model training and model turning until finding out the most suitable model. The whole process is time consuming. When we are trying to use a solution, the solution itself may become another development problem. If using AutoML modules or tools to help training the model, the technology is saving our time is what we mostly concern. Generally the machine learning process can be divided into many parts, from data acquisition to deploy, those parts can be modularized. There are different problem taxonomy and solutions depend on the scenario and usage.

Using data analysis in retail as an example, assuming that the problem is to know where the potential market is suitable for the store. It is necessary to consider the following factors, such as the number of customers, the number of orders, the location, etc. Many aspects can help to find potential information. In the past we need to get geospatial data from different sources ourselves, but now we can leverage some tools such as the geoenrichment modules in Python. It provides geospatial data, population and income to smoothen the data analysis processes and increase the accuracy of ML models. Regarding algorithm, training and deploy model parts, they can be completed by the AutoML tools or modules.

2019-06-11T16:32:04+00:00 2019/04/11 |News|