Linear regression is the simplest machine learning algorithm to get started with, making it perfect for beginners. In fact, it’s so easy that you can basically get started with machine learning today – like, right now 🙌
What is Linear Regression?
Linear regression is a type of statistical modeling that allows you to investigate whether one thing (a variable) is dependent on others. The relationship between variables is illustrated by a trend-line which is overlaid on your data and can be used for predicting many different things.
For example, you could use linear regression to determine how well a particular crop will grow based on how much it’s watered, how much milk a cow will produce based on how frequently it’s milked, or how much a house will sell for based on the number of rooms.
For all of the above examples you probably already have a hunch that there’s going to be some kind of a relationship between the variables. It’s likely that more water will mean more crops, more milking will yield more milk, and more rooms in an apartment will equate to more money at sale time.
By undertaking a linear regression analysis, your hunch (or hypothesis), will be supported by actual data, which is obviously what we’re all about.
Once you understand the relationship between variables, you can begin to make some really powerful predictions.
FYI linear regression is both a statistical model and a supervised machine learning algorithm. In the context of machine learning, the “machine” is the linear regression model and “learning” means that the linear regression model was trained on a sample of the dataset and then learned the correlations between the variables in order to be able to make future predictions.
Simple vs multivariate regression
Let’s talk about names for a minute.
There are many types of regression out there, however, the term linear regression refers to the fact that the relationship between variables is linear – it follows a line.
When you’re exploring the relationship between two variables, an independent variable and dependent variable (x and y), the method of regression used to interrogate this is referred to as a simple linear regression.
If you’re comparing one dependent and multiple independent variables then you’re doing multiple linear regression.
And if you have multiple dependent variables then you’ve got multivariate regression.
How linear regression works
Let’s look at a potential scenario where linear regression would be useful: recruiting spies.
Before we run our analysis let’s start with the assumption that there are two types of spies: those that are really good at their jobs and those who aren’t.
There are many reasons why a spy might be bad at their job – they might take unnecessary risks which gets their cover blown or they may fall in love with a double agent and then help them sneak across the Canadian border to evade capture by the authorities.
You can just never know for sure.
But we can use linear regression to determine which potential candidates are most likely to be the best match for the very high-stress job of being a spy. We only want to hire those who have a high probability of being good at their jobs.
The goal of a regression analysis is to create a trend line that best fits our data. This then allows us to investigate whether (and how) one thing (the dependent variable) depends on other things (the independent variables).
Before we can begin shortlisting the best candidates, we need to look at the different attributes and determine whether there is a correlation between them. If there is, then we can use these attributes to start making some predictions. Luckily, because all potential candidates in this scenario have been through rigorous vetting, we have access to extensive data about their personal attributes such as physical fitness, personality traits, IQ, potential terrorist affiliations etc.
To explore potential relationships in the data we plot the data and apply a trend line, which will tell us whether there’s a correlation between the data. Linear regression can then be used to confirm or deny the relationship between attributes.
Let’s take a look at the attribute IQ.
We can see here that there’s a correlation between the attributes ‘spy potential’ and ‘IQ’. We can, therefore, say that candidates that are smarter are more likely to make better spies.*
Once we have confirmed with regression analysis whether there’s a relationship between the various attributes, we can deploy a regression algorithm that will learn the relationship between these variables.
This will ultimately allow us to predict whether a future candidate will make a good prospective spy or not.
Linear regression equation
The equation for simple linear regressions is, well, pretty simple:
y = b0 + b1 * x1
Let’s break this equation down.
y is the dependent variable. This is what you’re trying to explain and to understand how (in which way) it depends on something else. In this example, our dependent variable would be spy potential.
x is the independent variable. We assume that this is causing the dependent variable to change in some way. This might not be a direct cause, but it implies an association between the variables and we obviously want to know more about that. So in our spy example, the independent variable could be something like IQ or physical fitness.
b1 is what’s known as the coefficient for the independent variable. This is a fancy way of expressing how a unit change in x (higher IQ, for example) affects a unit change in y (more spy potential). This also controls the angle or slope of the line, so the steeper the line, the higher your spy potential becomes per extra IQ point and the shallower the slope/gradient, the less spy potential a candidate receives per additional IQ point.
b0 is the constant term, or point where your trendline crosses the horizontal axis.
If you have multiple variables the equation looks almost the same, except you add more independent variables into the mix:
y = b0 + b1 * x1 + b2 * x2 + … + bn * xn
For each additional independent variable that you’re looking to find a relationship for you add an extra coefficient (b) and independent variable (x) to your formula.
When to use linear regression
Linear regression’s power is in its simplicity, which means it can be used to answer all types of questions. If you are planning to use a linear regression algorithm there are a few conditions that your data should meet.
First, the relationship between the variables in your data needs to be linear. This means that they can be plotted along a line. Once plotted, the difference between the real and predicted values (residuals) needs to be more or less constant (homoscedastic).
At the same time, the residuals must be independent of each other and the predictors (independent variables) should not be highly correlated.
You can determine whether your data meets these conditions by plotting it and then doing a bit of digging into its structure.
Want to try linear regression yourself? Check out the tutorial DIY Simple Linear Regression.
Linear regression in action
Linear regression is one of the most widely used statistical tools thanks to its conceptual simplicity and wide applicability.
A partial list of areas that could benefit from a regression analysis includes economics, finance, business, law, meteorology, medicine, biology, chemistry, engineering, physics, education, sports, history, sociology, and psychology.
Here are a few real-world examples of regression analysis in action, taken from this book:
🐄 Agriculture: Predicting milk production.
💸 Labor: Calculating the effects of specific labor laws on a family in the USA.
🏛️ Government: Determining the impact of domestic immigration.
🏺 Archeology: Estimating the age of ancient Egyptian skulls.
⚕️ Healthcare: Quantifying the cost of delivering health care.
Want to learn more about linear regression? Check out this tutorial on Simple Linear Regressions.
Ready to learn about another machine learning algorithm? Read more about Logistic Regression.
Looking for more information about machine learning? Learn about the essentials of machine learning and how to get started with machine learning.
* Note: the numbers used in this example are made up for the purpose of illustration. How good someone is at their job depends on a myriad of different factors.