Introduction to Machine Learning: Is AutoML Replacing Data Scientists?

Types of Machine Learning Methods

Machine learning (ML) is a way to realize artificial intelligence, solving problems in artificial intelligence through machine learning. Big data means analyzing large amounts of data, and artificial intelligence is about making machines look smarter. Both can use machine learning as a core tool. Let’s introduce the common types of ML models before explaining the application of ML.

Supervised learning

The machine learning task which learns from the trained dataset, builds a function, predicts the output based on the function. The training dataset often consists of pairs of an input vector (or scalar) and the corresponding output vector (or scalar), the output of the function can be regression or prediction of classification.

Supervised learning

Unsupervised learning

This type of machine learning algorithm learns from a dataset without any labels. The algorithm can automatically classify or categorize the input data. The application of unsupervised learning mainly includes cluster analysis, association rule or dimensionality reduce.

Unsupervised learning

Semi-Supervised learning

It learns from a small amount of labeled data with a large amount of unlabeled data. Semi-Supervised learning is between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). Many machine learning researches showed that the combination of unlabeled data and the small amount of labeled data could raise the learning accuracy.

Semi-Supervised learning

For example, Medical Imagine Analysis such as CT scan or MRI data, radiologists can examine and mark a small part of a tumor or disease. It can collect the data easily, but manually marking all scans is time-intensive and costly. Therefore, it compares with unsupervised learning, training a model to assist in labeling data with the conjunction of deep learning networks can be beneficial from a small proportion of labeled data and improve its accuracy.

Reinforcement learning

A machine learning emphasizes how agents should act in the environment to get the most cumulative rewards. The difference between reinforcement learning and supervised learning is that intensive learning does not require accurate marker input and output and does not require the correctness of suboptimal. Reinforcement learning focuses on linear programming, looking for a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

Reinforcement learning

The demand for Machine Learning

ML demanders can be divided into three categories, from short to long development time and high level of problems to details:

AI service fast food demander

This type of clients required an automated AI service, some existing AI services can meet their needs. Such as throwing their data into the data lake, and then use AWS Rekognition, AWS Comprehend to output ML results to their API.

Problem identifier

This type of clients has a clear definition of their business needs, they understand what data they can provide, and understand which kind of machine learning output can solve their problems. The existing AutoML modules or tools can be utilized to fulfill their wishes, such as Amazon ML or AWS SageMaker Marketplace.

Need help in defining data assumptions

This kind of clients have business needs, without a clear definition of the problem but they expect to find some value from the data in hand. Currently, it is necessary to define the business hypothesis through some BI tools such as Tableau, then develop algorithms using AWS SageMaker or EMR, etc. From Feature engineering to Modeling, even architecture must be highly customized.

Need help in defining data assumptions

Note: The length of time for development changes depends on the need of users. From bottom to top is a fast time to value for AI users, define questions and define hypothesis. If users have the request of higher level to define the hypothesis, the length of time for development will be longer.

AutoML (Automated Machine Learning): An Agile Approach to AI Deployment

Machine learning is a problem solution that combines lots of mathematical knowledge. There are different kinds of machine learning models in different scenarios. The following figure shows an interactive relationship between different ML methods. Each problem may be focused on different mathematics and computer science fields. However, sometimes it is hard to decide which model to use in so many choices.

Approach to AI Deployment

In terms of common practice, the process of building an ML model strongly relies on experienced data scientists. They need to find out the appropriate features, processes, models, model hyperparameters, etc., usually the whole process is time consuming. However, in order to make ML modular and widely used in many different scenarios in the future, the research field that automates ML is called Auto ML. In general Auto ML is not a model development only, but also includes data cleansing, feature analysis and transformation. Model development usually involves repeatedly selecting one or more algorithms, model tests, hyperparameter optimization, and model evaluation. The emergence of Auto ML can solve the repetitive and redundant process of modeling and tuning. It enables the enterprises to experiment with a variety of models, and helps the enterprise increase the efficiency of problem-solving or to get a more accurate result.

Will the data scientists get unemployed?

McKinsey Global Institute (MGI) pointed out that the emergence of Auto-ML can solve the shorthanded situation in the data science field and replace 50% of the work of data scientists. Despite this AutoML would not replace data scientist completely, data scientists know what kind of data should be collected and how to arrange them to solve a specific business problem. Furthermore, the data scientists understand how to make an accurate judgment. For instance, what model should be deployed and produced.

How to apply AutoML on Problem Identifier scenario?

Previous ML processes required data pre-processing, feature engineering, model training, and model turning until finding out the most suitable model. The whole process is time-consuming. When we try to use a solution, it may become another development problem. If we use AutoML modules or tools to help train our model, this technology can save us the most time what we care about. In general, the machine learning process can be divided into many parts, from data acquisition to deploy part, those parts can be modularized. There are different problem taxonomy and solutions depend on the scenario and usage. We take retail data analysis as an example. If the problem is to know where the potential market is suitable for the store. It is necessary to consider the following factors, such as the number of customers, the number of orders, the location, etc., those aspects can help to find potential information. In the past, we need to get geospatial data from different sources ourselves. In this time, we can leverage some tools such as the geo enrichment modules in Python. It provides geospatial data, population, and income to smooth the data analysis processes and improve the accuracy of the ML models. About algorithms, training and deployment of model components can be done with AutoML tools or modules.

Download whitepaper written by our data experts! >> Download whitepaper now!

2019-09-09T15:05:05+00:00 2019/04/11 |Insights|