Logistic regression is one of the most popular machine learning algorithms. As a predictive statistical analysis model, it’s a simple and effective way to classify whether something is one thing or another.
In fact, understanding the essential elements of logistic regression is one of the best places to start when it comes to mastering machine learning and data analysis.
What is logistic regression?
Logistic regression is a supervised machine learning algorithm that predicts the probability of an outcome.
In contrast to linear regression where theoretically any number could be the predicted, in the case of logistic regression there are only two possible outcomes, 1 or 0. This means the prediction will be discrete – basically, something is or it isn’t.
An email is either spam or it isn’t. The candidate will win or they won’t. Tomorrow, it will rain or it won’t.
This is what’s known as a classification algorithm, meaning it classifies data into categories.
Considering its simplicity, logistic regression is both extremely powerful and widely used. Uses include:
And many more.
How logistic regression works
Let’s say that you run an online store and you’ve just sent a promotional email out to your customer list.
Of the customers that are on your newsletter list, you know their age and whether or not they bought a product as a direct result of reading your promotional email. You know this because you can tell which links they clicked in your newsletter and you have tracking set up to see what that user then went on to do once they landed on your website.
Now you want to know whether someone of a certain age (e.g. 25 years old) will buy from you in the future.
This is a binary question – we can predict that the user either will buy or they will not buy, which means that logistic regression is a suitable analytical tool for us to use to create this model.
First let’s take a look at the data in this example:
We can see that the observed actions (a person who bought something from your online store) are plotted as red dots (data points). There are no data points in the middle because we already know whether the individual did or did not buy – the data is binary.
Across the horizontal axis we have age and across the vertical axis we have two ticks, 1 and 0.
Remember that 1 means someone purchased something after reading your email and 0 means someone did not purchase anything. Another way of looking at this is that 1 is 100% likelihood that someone bought something and 0 is 0% likelihood.
So logistic regression measures the relationship between our dependent variable (whether someone bought something) and the independent variable (in this case, their age).
Rather than trying to predict exactly what any given user is going to do, we can use logistic regression to predict the probability, or likelihood, that a person will buy something based on their age. These probabilities are then transformed into binary values that allow us to make a yes-no prediction.
You might be asking how we transform these values – enter the sigmoid function.
The sigmoid function is an S-shaped curve that can scale any real number to be between 0 and 1, but never exactly reach 0 or 1. Once the numbers are mapped to sit between 0 and 1 they are transformed into either a 0 (no) or 1 (yes) using a threshold value.
Ultimately, the threshold value is arbitrary, however, in practice it’s usually selected as 0.5 or 50%. This provides symmetry as it’s a number in the middle.
Any value below the threshold line will be projected onto the 0 line and classified as a ‘no’, while any value above the threshold line will be projected onto the 1 line and classified as a ‘yes’.
OK, so we now know whether our model predicts a customer will buy something. But how likely are they to buy?
Since logistic regression works on probabilities, we can project the data point onto the vertical axis and obtain the probability of something happening.
Here’s the formula for logistic regression:
y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))
y is the dependent variable or predicted output. In our example that’s whether a customer will make a purchase or not.
x is the independent variable or predictor. We assume that the independent variable is causing the dependent variable to change in some way. Here we’re talking about the age of potential customers.
b1 is the coefficient for the independent variable. This coefficient controls the angle or slope of the our line and expresses how a unit change in x (older age) effects a unit change in y (more likely to buy).
b0 is the constant term. This is the point where your trendline crosses the horizontal axis.
e represents what’s called Euler’s number and is the actual numerical value that we’re looking to transform with the sigmoid function.
Logistic regression vs linear regression
At this point, you may be like “hold the phone – didn’t we already talk about a regression somewhere else?” and the answer is yes, yes we did – here.
Regression analysis is all about estimating and understanding the relationship between variables. Because of this, both linear regression and logistic regression are similar in the role they carry out.
In fact, you might even recognize elements of the linear regression formula in the logistic regression formula. That’s because logistic regression is a linear method that has been transformed using the logistic function (via the sigmoid function).
At the end of the day, both algorithms are doing the same thing – they are using an equation to find the line that best fits the variables found in a dataset.
Linear regression, however, predicts a continuous outcome that is measured along a sliding numeric scale (like house prices, for example), while logistic regression predicts a discrete, or binary outcome, which means the outcome will always be one thing or another – an email is spam or it is not spam.
IRL, a linear regression might predict that a house will sell for $100,000, whereas a logistic regression would predict yes the house will sell based upon certain defined variables.
What happens if you want to classify something with more than two possible outcomes?
Well, luckily, logistic regression’s got your back here too.
Using a method called one-vs-all classification – or multiclass classification – you can carry out the same level of prediction as a logistic regression, but for multiple variables.
In multiclass classification, you train multiple logistic regression classifiers – one for each class in your data set. Your multiclass classifier will then pick the logistic regression classifier that outputs the highest probability. Easy.
So that’s the basics of logistic regression, an undeniable contender for MVP of machine learning, statistics, and data analysis.