Machine Learning Landscape

Nowadays, I started to read Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow by Aurélien Géron, and decided to summarize each chapter in a blog post in my page so that I can keep things I learned from the book in my mind longer, and hopefully it benefits for those who are interested in the topic.

In this article, the machine learning landscape is discussed, i.e., what it is, why use it, types and main challenges and etc. Let’s begin!

Content

Types of Machine Learning Systems
a. Supervised and Unsupervised Learning
b. Batch and Online Learning
c. Instance-Based and Model-Based Learning
Main Challenges
a. Insufficient Training Data
b. Nonrepresentative Training Data
c. Poor Quality Data
d. Overfitting
e. Underfitting
Testing and Validating
a. Hyperparameter Tuning and Model Selection
References

Basically, machine learning is the science of programming the computers so that they can learn from data. Your spam mail filter is a very usual example of it. Given examples of spam and regular mails, the ML program can learn how to flag a spam. The examples that your ML algorithm uses to learn is called a training set. We usually divide our data into two or three partitions so ML program uses one to learn, uses another to calibrate and the last to measure its performance.

When do we need to use Machine Learning?

When we face problems for which existing solutions require a lot of fine-tuning or long lists of rules.
When we need to solve complex problems for which traditional approaches does not work.
When we want to get insights about complex problems and large amounts of data. The list can go on but we will understand its importance in the following posts, as well.

Some applications of ML:

Detecting credit card fraud
Summarizing long documents automatically
Detecting tumors in brain scans
Building an intelligent bot for a game
Segmenting client in a market

Types of Machine Learning Systems

There are many ways of ML systems to classify them into categories, based on the followings:

Whether they are trained with human supervision
Whether they can learn incrementally on the fly
Whether why work by simply comparing new data points to known data points

Nowi let’s see those in details.

Supervised and Unsupervised Learning

ML systems can be classified according to the amount and type of supervision they get during training.

Supervised Learning: In this type of learning, the training set contains the desired results, called labels. Two typical supervised learning tasks: classification and prediction a numeric value. Example of first one is spam filter. Within the training data it has the information whether a mail is spam (its class). An example for the second one is to predict the price of a car. For that purpose, a dataset that contains cars’ properties and labels (prices) is needed to given. The most important supervised learning algos are as follows:

KNN
Linear Regression
Logistic Regression
Support Vector Machine
Decision Trees and Random Forests
Neural Networks

Unsupervised Learning: In this case, our data is not labeled. Some important tasks regarding unsupervised learning:

Clustering
Anomaly detection and novelty detection
Dimentionality reduction
Association and rule learning

For example, you have data of your blog’s visitors and want to group similar visitors based on some properties. At any time, the algortihms is not toled which group a visitor belongs to. It finds those by itself.

Semisupervised Learning: In the case where the data has many unlabeled instances, you may need some algorithms that deals with partially labeled data and this types of ML systems is called semisupervised. A very good examples of this is Google Photos. Once you upload your photos, it automatically recognizes the people that show up in the photos. Then system needs you to tell it who those people are.

Reinforcement Learning: In this learning system, an agent observes the environment, performs actions, and get positive or negative feedbacks based on those actions, then it learns by itself the best policy. Many robots actually implements reinforcement learning algorithms.

Batch and Online Learning

In batch learning, the system cannot learn incrementally, it must be trained using all data that is available, which takes too much time and resources, therefore typically it is done offline. Firstly, the system is trained then it is launched into the production and runs without learning. Therefore, if new data arrives and you want your system to know about new data, then you need to train it from scratch on the full dataset with new one.

In online learning, on the other hand, you could train the system incrementally by feeding it by data either individually or in small groups (mini batches). Thus, the system is able to learn about new data on the fly, as it arrives.

Instance-Based and Model-Based Learning

Another way to categorize ML system is related to how they generalize. Having a good performance on the tatinin data is good, however it is not sufficient, the real goal is to perform well on the new instances that the model have never seen. In that context, there are two main approaches.

In instance-based learning, the system learns the examples by heart, then generalizes to new cases by using a similarity measure to compare them to the learned examples.

In model-based learning, the system build a model of the examples and then use that model to make predictions.

To summarize, in a typical ML project we have the following steps:

Study the data
Select a appropriate model
Build the model on training set
Make predictions on new cases

We will cover how a complete ML project looks like in detail in the following posts.

Main Challenges

Our main task is to select a learning algortihm and train it on some data, and what could be go wrong in this process? We will discuss these.

Insufficient Training Data

For a ML algorithm to work properly, we need to feed it by a lot of data. Even for very simple problems, there could be a need for a thousands of examples, for complex problems the need is much more, millions of examples namely.

Nonrepresentative Training Data

In order to generalize our model well, it is very important that the training data should be a good representative of the new cases that we are going to generalize to. By building our model based on a nonrepresentative training data, it is very unlikely to make accurate predictions. There it is crucial to use a training set that is representative of new cases. However, it is usually harder. If the sample is too small, we will have sampling noise. Even if the sample is very large, it could be nonrepresentative if the sampling method is flawed, which results in sampling bias.

Poor Quality Data

Obviously, if the data has lots of errors, outliers or noise, it will be hard for he ML system to catch the pattern, which will cause the algorithm to underperform. That’s why it is very important to take time to clean up the data. We need to spend a significant time in cleaning up the data.

Irrelevant Features

There is a very good saying in statistics: garbage in, garbage out. This is a very significant one to keep in mind. The system captures the necessary information only if the data contains enough relevant features and no irrelevant ones. The process in which we specify what fatures to use to build the model is called feature engineering.

Overfitting

The case where the model builded performs very well on training data but underperform on test data is called overfitting. In that case, the model does not generalize well. The possible solution for overfitting problem:

Simplyfiy the model by reducing the number of attributes in the data.
Increase the number of training data.
Reduce the noise in the traning data (remove outliers).

Underfitting

Underfitting is the opposite of overfitting. In happens when the model is too simple to learn the underlying structure of the data. In that case, the model does no perform good neither on the training nor on the test set. Possible solution are as follows for this problem:

Train a more powerful model or use more parameters.
Select better features to feed the model.
Reduce the regularization hyperparameters, if exists.

Now, we have discussed many issues about machine learning, but there is one last important topic: after training a model, we expect it to generalize to new cases well. That’s why we want to evaluate our model and fine tune it if necessary.

Testing and Validating

The very intuitive approach to test our model is to try it on new cases. An option to do that is to split our main data into two datasets: the training set and the test set. Simply, you train your model on the training set, and you test it using your test set (we typically use 80% of data for training and 20% for the test). By evaluating the model on the test set, you can get an estimate of the error rate on new cases and this tells you how well your model perform on the instances that it has never seen. As an additional information, if the error rate on the traning set is low, but the error rate on the test set is high, then it means that your model is overfitting.

Hyperparameter Tuning and Model Selection

Previously, we have learned how to evaluate our model before production. For evaluation our model we just use a test set which is seperated training set beforehand. Now, suppose we trained a model and want to apply some regularization to avoid overfitting. How do we choose those hyperparameters? One option is to train 100 different models using 100 different values for the hyperparameter and choose the hyperparameter values with the lowest error rate and then deploy your model into the production. Then a problem arises here. You measured the generalization error on the test set several times and adapted the model to produce best results for that particular test set. A solution of this problem is called holdout validation, which means you basically hold out part of your training set to evaluate several candidate models and then select the best one. The new held-out set is called the validation set (development or dev set). More precisely, you train multiple models with several hyperparameters on training set and you select the best model (in terms of hyperparameters) depending on the best performance on the validation set. After this procedure, you train your model on the full training set (reduced training + validation set). Finally, you evaluate your model on the test set so that you can obtain an estimation of error rate. Another option is to perform repeated cross-validation, which means using many small validation sets. Each model is evaluated once per validation set after it is trained on the rest of the data. By averaging out the evaluations of a model, you get a much more accurate measure of its performance.

For this part, we have learned the basics of machine learning, the main challenges of it, and how to basically produce and test our model. In the following posts, we are going to get into more advanced topics.

References

Géron, A. (2019). Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (2nd ed.). O’Reilly Media, Inc.