Supervised machine learning models can then be further classified into regression and classification algorithms, which will be explained in more detail in this article.
Machine Learning Regression Models
Regression algorithms are used to predict a continuous outcome (y) using independent variables (x).
For example, look at the table below:
Image by author
In this case, we would like to predict the rent of a house based on its size, the number of bedrooms, and whether it is fully furnished. The dependent variable, “Rent”, is numeric, which makes this a regression problem.
A problem with many input variables like the one above is called a multivariate regression problem.
Regression Metrics
A common misconception by data science beginners is that a regression model can be evaluated using a metric like accuracy. Accuracy is a metric used to assess the performance of classification models, as will be explained later in this article.
Regression models, on the other hand, are evaluated using metrics such as MAE (Mean Absolute Error), MSE (Mean Squared Error), and RMSE (Root Mean Squared Error).
Let’s add a predicted value to the house price problem above and evaluate these predictions using a few regression metrics:
Image by author
1. Mean Absolute Error:
The mean absolute error calculates the sum of the difference between all true and predicted values, and divides this by the total number of observations. Here is the formula to calculate MAE:
Let’s calculate the Mean Absolute Error of the above values using this formula:
The mean absolute error between the actual and predicted house price is approximately $155.
2. Mean Squared Error:
The formula to calculate a model’s mean squared error is similar to that of its mean absolute error:
Note that while the mean absolute error calculates the average absolute distance between the actual and predicted value, the mean squared error finds the averaged squared distance between actual and predicted values.
Let’s calculate the MSE between the actual and predicted values above:
3. Root Mean Squared Error:
The RMSE of an estimator is calculated by finding the square root of its mean squared error. One advantage of calculating a dataset’s RMSE over its MSE is that the error is returned in the same unit of the variable we are predicting.
In this case, for instance, the RMSE is √54,520.25=233.5. This value is interpretable since it is in terms of house price, while the Mean Squared Error was not.
Now that you understand the concept of regression, let’s look into the different types of regression models:
Simple Linear Regression
Linear regression is a linear approach to modeling the relationship between a dependent and one or more independent variables. This algorithm involves finding a line that best fits the data at hand.
Here is a visual representation of how a simple linear regression model works:
Image by author
The chart above showcases the relationship between house price and size. The linear regression model will create a line that best models this relationship. All house price predictions relative to different values of size will lie on the best fit line.
Observe that there are three lines drawn on the diagram above. Which of these lines is the “line of best fit?”
Line of Best Fit
Just by looking at the diagram above, we can see that the orange line is the closest to all the data points showcased. Hence, we can intuitively say that it represents the “line of best fit.”
Here is a more formal explanation as to how the line of best fit is found in linear regression:
The equation of a straight line is
y=mx+c
. Here,
m
represents the slope of the line and
c
represents its
y
intercept. There are infinite ways to draw this line, as there are infinite possible values for
m andc . The line of best fit, also known as the least squares regression line, is found by minimizing the sum of squared distance between the true and predicted values: 可以阅读Python教程中的《线性回归精要》,深入了解线性回归机器学习模型及其实现。 里脊回归 岭回归是上面解释的线性回归模型的扩展。这是一种用于保持回归模型的系数尽可能低的技术。 简单线性回归模型的一个问题是其系数可能会变大,这使得模型对输入更加敏感。这可能导致过度拟合。 让我们举一个简单的例子来理解过度拟合的概念: 作者图片 在上图中,上面的最佳拟合线完美地模拟了X和y之间的关系,真实值和预测值之间的距离平方和为0。回想一下,这条线的方程式是y=mx+c
。
虽然这条线非常适合训练数据集,但它可能不会很好地推广到测试数据。这种现象被称为过度拟合,你可以阅读这篇关于过度拟合的文章来了解更多。
简而言之,高度复杂的模型会发现训练数据集的不必要的细微差别,而这些差别在现实世界中没有反映出来。该模型在训练数据上表现非常好,但在训练数据之外的数据集上表现不佳。
系数大的线性回归模型容易过拟合。
岭回归是一种正则化技术,它通过惩罚损失函数来包含额外的成本,从而迫使算法选择较小的系数。
如前一节所示,下面是我们希望在简单线性回归中最小化的误差:
在岭回归中,这个方程会稍有变化,上面的误差会增加一个惩罚项:
If an independent variable’s coefficient reaches zero, the feature can be eliminated from the model. This reduces the feature space and makes the algorithm easier to interpret, which is the biggest advantage of lasso regression.Due to this, lasso regression can also be used as a feature selection technique, since variables with low importance can have coefficients that reach zero and will be removed entirely from the model.
How to Build a Regression Machine Learning Model in Python
You can build linear, ridge, and lasso regression models using the Scikit-Learn library:
1. Linear Regression
from sklearn.linear_model import LinearRegression lr_model = LinearRegression()
To fit the model on your training dataset, run:
lr_model.fit(X_train,y_train)
2. Ridge Regression
from sklearn.linear_model import Ridge model = Ridge(alpha=1.0)
The lambda term can be configured via the “alpha” parameter when defining the model.
3. Lasso Regression
from sklearn.linear_model import Lasso model = Lasso(alpha=1.0)
If you’d like to learn more about linear models and how to build them in Python, take our Introduction to Linear Modeling in Python course.
Machine Learning Classification Models
We use Classification algorithms to predict a discrete outcome (y) using independent variables (x). The dependent variable, in this case, is always a class or category.
For example, predicting whether a patient is likely to develop heart disease based on their risk factors is a classification problem:
Image by author
The table above showcases a classification problem with four independent variables and one dependent variable, heart disease. Since there are only two possible outcomes (Yes and No), this is called a binary classification problem.
Other examples of a binary classification problem include classifying whether an email is spam or legitimate, customer churn prediction, and deciding whether to provide someone a loan.
A multiclass classification problem is one with three or more possible outcomes, such as weather forecasting or distinguishing between different animal species.
Classification Metrics
There are many ways to evaluate a classification model. While accuracy is the most used metric, it is not always the most reliable.
Let’s look at some common methods used to evaluate a classification algorithm based on the dataset below:
Image by author
1. Accuracy
: Accuracy can be defined as the fraction of correct predictions made by the machine learning model.
The formula to calculate accuracy is:
In this case, the accuracy is 46, or 0.67.
2. Precision
: Precision is a metric used to calculate the quality of positive predictions made by the model. It is defined as:
The above model has a precision of 24, or 0.5.
3. Recall
: Recall is used to calculate the quality of negative predictions made by the model. It is defined as:
The above model has a recall of 2/2 or 1.
Let’s look at a simple example to understand the difference between precision and recall:
There is a rare, fatal disease that affects a fraction of the population. 95% of the patients in a hospital’s database do not have the disease, while only 5% do. If we build a machine learning algorithm that predicts that nobody has the disease, then the training accuracy of this model will be 95%. Despite the high accuracy, we know this is not a good model since it fails to identify patients with the disease.
This is where metrics like precision and recall come in. Precision, or specificity, tells us the ability of the model to correctly identify people without the disease. Recall, or sensitivity, tells us how well the model identifies people with the disease.
A “good” precision and recall value is subjective and depends on your use case.
In this spam email example, if the text contains little to no suspicious keywords, then the probability of it being spam will be low and close to 0. On the other hand, an email with many suspicious keywords will have a high probability of being spam, close to 1.
This probability is then turned into a classification outcome:
Image by author
All the points colored in red have a probability >= 0.5 of being spam. Hence, they are classified as spam and the logistic regression model will return a classification outcome of 1. The points colored in green have a probability < 0.5 of being spam, so they are classified by the model as “Not Spam” and will return a classification outcome of 0.For binary classification problems like the above, the default threshold of a logistic regression model is 0.5, which means that data points with a higher probability than 0.5 will automatically be assigned a label of 1. This threshold value can be manually changed depending on your use case to achieve better results.
Now, recall that in linear regression, we found the line of best fit by minimizing the sum of squared error between the predicted and true values. In logistic regression, however, the coefficients are estimated using a technique called maximum likelihood estimation instead of least squares.
Read Python logistic regression tutorial to learn more about the concept of maximum likelihood estimation and how logistic regression works.
K-Nearest NeighborsKNN is a classification algorithm that classifies a data point based on what group the data points nearest to it belong to.
Here is a simple example to demonstrate how the K-Nearest Neighbors model works:
Image by authorIn the diagram above, there are two classes of data points - A and B. The black triangle represents a new data point that needs to be classified into one of these two classes.
The K-Nearest Neighbors algorithm works like this:
Step 1
: The model first stores all the training data.
Step 2:
Then, it calculates the distance from the new data point to all points in the dataset.
Step 3
: The model sorts these data points based on their distance to the new data point.
Step 4
: The new data point is assigned to the class of its nearest neighbors depending on the value of “k.”
In the visual above, the value of k is 1. This means that we look at only one closest neighbor to the black triangle and assign the data point to that class. The new data point is closest to the blue point, so we assign it to class B.
Now, let’s amend the value of k. Let’s try two possible values of k, 3 and 7:
Image by author
Now, notice that when we choose k=3, the new data point is between two categories. This means that we pick the majority class. Tw nearest neighbors are blue, and one nearest neighbor is green, so the data point will again be assigned to the class with blue points, class B.
When k=7, however, things change. Now, two nearest neighbors are blue, and seven are green. In this case, the data point will be assigned to the green class, class A.
Choosing different values of k will impact what class the new point is assigned to.
Selecting a value that is too small can be noisy and subject to outliers while selecting a large value might make you overlook categories with fewer data points.
If you’d like to learn more about the K-Nearest Neighbors algorithm and how to select an optimal “k” value, read this KNN tutorial.
Build a Classification Model in Python
Here are some code snippets you can use to build a classification model in Python using the Scikit-Learn library:
1. Logistic Regression
from sklearn.linear_model import LogisticRegression log_reg = LogisticRegression()
2. K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier()
Machine Learning Tree-Based Models
However, decision trees are also highly prone to overfitting if left to grow completely. This is because they are designed to split perfectly on all samples of the training dataset, which makes them unable to generalize well to external data.
This drawback of decision trees can be solved by using the random forest algorithm.
Random Forests
The random forest model is a tree-based algorithm that helps us mitigate some of the problems that arise when using decision trees, one of which is overfitting. Random forests are created by combining the predictions made by multiple decision tree models and returning a single output.
It does this in two steps:
Step 1
- : First, the rows and variables of the dataset are randomly sampled with replacement. Multiple decision trees are then created and trained on each data sample.Step 2
- : Next, the predictions made by all these decision trees are combined to come up with a single output. For instance, if 3 separate decision trees were trained and 2 of them predicted “Yes” while 1 predicted “No,” then the final outcome of the random forest algorithm would be “Yes.” In case of a regression problem, the outcome will be the average prediction of all decision trees.
- Here is a simple visual to showcase how the random forest algorithm works:Image by author
- In the diagram above, the first and third decision trees predict “Yes” while the second predicts “No.”Since this is a classification task, the majority class is selected. In this case, the random forest algorithm will return a final outcome of “Yes” based on the predictions made by 2 out of 3 decision trees.
One of the biggest advantages of the random forest algorithm is that it generalizes well, since it combines the output of multiple decision trees that are trained on a subset of features.
Furthermore, while the output of a single decision tree can vary dramatically based on a small change in the training dataset, this problem does not arise with the random forest algorithm as the training dataset is sampled many times.
Build a Tree-Based Model in Python
Run the following lines of code to build a tree-based machine learning algorithm with Scikit-Learn:
1. Decision Tree
# classification from sklearn.tree import DecisionTreeClassifier clf = DecisionTreeClassifier() # regression from sklearn.tree import DecisionTreeRegressor dt_reg = DecisionTreeRegressor()
2. Random Forests
# classification from sklearn.ensemble import RandomForestClassifier rf_clf = RandomForestClassifier() # regression from sklearn.ensemble import RandomForestRegressor rf_reg = RandomForestRegressor()
Machine Learning Clustering
So far, we’ve explored supervised machine learning models to tackle classification and regression problems. Now, we will dive into a popular unsupervised learning approach called clustering.
In simple words, clustering is the task of creating a group of objects that are similar to each other but different from others. This technique has a variety of business use cases, such as recommending movies to users with similar viewing patterns on a video streaming site, anomaly detection, and customer segmentation.
In this section, we will examine an algorithm called K-Means clustering - the simplest and most popular machine learning model used for unsupervised learning tasks.
K-Means Clustering
K-Means clustering is an unsupervised machine learning technique that is used to group similar objects together in data.
Here is an example of how the K-Means clustering algorithm works:
Image by author
Step 1
: The image above consists of unlabeled observations that have not been grouped. Initially, each observation will be assigned to a cluster at random. A centroid will then be computed for each cluster.
These are represented with the “+” symbol in the diagram below:
了解如何评估机器学习模型也很重要。“好”的模型是主观的,高度依赖于你的用例。例如,在分类问题中,仅靠高精度并不能表示一个好的模型。作为数据科学家,您需要查看精确率、召回率和 F1 分数等指标,以更好地了解模型的性能。
如果想要比本文中介绍的概念更深入地了解机器学习模型,请参加机器学习科学家 Python 课程。这个职业轨道将教你机器学习模型如何运作背后的理论,以及如何在Python中实现它们。您还将在课程中学习数据准备技术,例如归一化、去相关和特征选择。
In this case, if a student does not study every week, they will fail. If they study every week but do not complete their homework, the result is still “Fail.” They will only pass if they were to study every week and finish all their homework.
Notice that the decision tree above splits first on the variable “Studies Every Week?” It then stops splitting if the answer is “No,” saying that the student will fail.
The decision tree will choose a variable to split on first based on a metric called entropy. It will stop splitting when a “pure split” is obtained, i.e., when all the data points belong to a single class.
There are many ways to build a decision tree. The tree needs to find a feature to split on first, second, third, etc. This structure is created based on a metric called information gain. The best possible decision tree is one with the highest information gain.
To learn more about how decision trees work, along with metrics like entropy and information gain, this Python decision tree classification article has more details.
One of the biggest advantages of decision trees is that they are highly interpretable. It is easy to work backward and understand how a decision tree has obtained its final outcome based on the training dataset.
However, decision trees are also highly prone to overfitting if left to grow completely. This is because they are designed to split perfectly on all samples of the training dataset, which makes them unable to generalize well to external data.
This drawback of decision trees can be solved by using the random forest algorithm.
Random Forests
The random forest model is a tree-based algorithm that helps us mitigate some of the problems that arise when using decision trees, one of which is overfitting. Random forests are created by combining the predictions made by multiple decision tree models and returning a single output.
It does this in two steps:
- Step 1: First, the rows and variables of the dataset are randomly sampled with replacement. Multiple decision trees are then created and trained on each data sample.
- Step 2: Next, the predictions made by all these decision trees are combined to come up with a single output. For instance, if 3 separate decision trees were trained and 2 of them predicted “Yes” while 1 predicted “No,” then the final outcome of the random forest algorithm would be “Yes.”
In case of a regression problem, the outcome will be the average prediction of all decision trees.
Here is a simple visual to showcase how the random forest algorithm works:
Image by author
In the diagram above, the first and third decision trees predict “Yes” while the second predicts “No.”
Since this is a classification task, the majority class is selected. In this case, the random forest algorithm will return a final outcome of “Yes” based on the predictions made by 2 out of 3 decision trees.
One of the biggest advantages of the random forest algorithm is that it generalizes well, since it combines the output of multiple decision trees that are trained on a subset of features.
Furthermore, while the output of a single decision tree can vary dramatically based on a small change in the training dataset, this problem does not arise with the random forest algorithm as the training dataset is sampled many times.
Build a Tree-Based Model in Python
Run the following lines of code to build a tree-based machine learning algorithm with Scikit-Learn:
1. Decision Tree
# classification
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
# regression
from sklearn.tree import DecisionTreeRegressor
dt_reg = DecisionTreeRegressor()
2. Random Forests
# classification
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier()
# regression
from sklearn.ensemble import RandomForestRegressor
rf_reg = RandomForestRegressor()
Machine Learning Clustering
So far, we’ve explored supervised machine learning models to tackle classification and regression problems. Now, we will dive into a popular unsupervised learning approach called clustering.
In simple words, clustering is the task of creating a group of objects that are similar to each other but different from others. This technique has a variety of business use cases, such as recommending movies to users with similar viewing patterns on a video streaming site, anomaly detection, and customer segmentation.
In this section, we will examine an algorithm called K-Means clustering - the simplest and most popular machine learning model used for unsupervised learning tasks.
K-Means Clustering
K-Means clustering is an unsupervised machine learning technique that is used to group similar objects together in data.
Here is an example of how the K-Means clustering algorithm works:
Image by author
Step 1: The image above consists of unlabeled observations that have not been grouped. Initially, each observation will be assigned to a cluster at random. A centroid will then be computed for each cluster.
These are represented with the “+” symbol in the diagram below:
Image by author
Step 2: Next, the distance of each data point to the centroid is measured, and each point is assigned to the nearest centroid:
Image by author
Step 3: The centroid of the new cluster is then recalculated, and data points will be reassigned accordingly.
Step 4: This process is repeated until data points are no longer being reassigned:
Image by author
Observe that three clusters were created in the example above. The number of clusters is referred to as “k” in the K-Means clustering algorithm, and this has to be determined by us.
There are a few different ways to select “k” in K-Means, the most popular of which is the elbow method. This technique consists of plotting the error for a different number of clusters on a graph and choosing the inflection point of the curve as “k.”
Learn more in our K-Means clustering in Python tutorial to discover the elbow method and the inner workings of K-Means clustering.
Build a K-Means Clustering Model in Python
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 3, init='k-means++')
The n_clusters argument indicates the number of clusters “k” that you need to define when building the algorithm.
Machine Learning Models Explained - Next Steps:
If you managed to follow along with this entire article, congratulations! You now know about some of the most popular supervised and unsupervised machine learning models and algorithms and how they can be applied to solve a variety of predictive modeling problems.
To become a data scientist, you need to understand how different types of machine learning models work to apply them to solve a problem. For instance, if you’d like to build a model that is interpretable and has low computation time, it might make sense to create a decision tree. If your aim is to create a model that generalizes well, however, then you can choose to build a random forest algorithm instead.
It is also important to understand how to evaluate machine learning models. A “good” model is subjective and highly dependent on your use case. In classification problems, for instance, high accuracy alone isn’t indicative of a good model. As a data scientist, you need to review metrics like precision, recall, and F1-Score to get a better idea of how well your model is performing.
If you would like to gain a deeper understanding of machine learning models than the concepts covered in this article, take the Machine Learning Scientist with Python course. This career track will teach you the theory behind how machine learning models operate and how they can be implemented in Python. You will also learn data preparation techniques such as normalization, decorrelation, and feature selection in the course.