Linear regression is a really simple but powerful way of making predictions. Looking to rock your own machine learning style? Start here 👊
🤖 Table of Contents 🤖
What is simple linear regression?
Getting started with simple linear regression
Fitting the data
Visualizing the data
In the wild
What is simple linear regression?
Linear regression is a type of statistical modeling that explores relationships between variables and determines whether one variable is dependent on other variables. This relationship can then be illustrated by a trend-line that’s overlaid on your data.
Once you understand the relationships between different variables in your data you can start making interesting and powerful predictions.
While linear regression has its origins in statistics, it’s also used in machine learning to teach computers the correlations between variables.
This in turn allows us to make future predictions 🎉
FYI – the simple in simple linear regression refers to the fact that you’re only comparing two variables, the independent variable (or “predictor”) and the dependent variable (or “outcome”). So simple linear regression studies only one predictor variable, whereas multiple linear regression looks at two or more predictor variables.
Despite being simple, linear regression is actually a really powerful tool and can be used to answer all sorts of questions such as how much money you’ll need to spend on gas for a cross-country road trip or how many cocktails you can have before saying something you’ll regret later.
To learn more about linear regression, check out this post: Linear Regression: The Beginner’s Machine Learning Algorithm.
Getting started with simple linear regression
In this DIY we’re going to explore the glamorous world of blogging by using Python to train a simple linear regression model.
Our goal is to determine the correlation between the number of months of blogging and amount of money earned.
Before we dive into the nuts and bolts of this tutorial, let’s talk about the context that our data exists in. This will help us better understand the data and produce analyses that are more useful in the real world.
Let’s say you’re thinking about a change of career to become a fulltime blogger.
Although you think it will be a lot of fun you’re going to need to know that some cash money is eventually coming your way. Otherwise, you won’t be able to keep your momentum up because #lattesdon’tgrowontrees.
After some online research you found income reports that some fulltime bloggers published and put them into a spreadsheet so you can find out when they began making their big bucks.
This will form your dataset.
With data now in hand, you can use simple linear regression to understand the correlation between the number of months of experience and the amount of money earned. This model will then allow you to predict approximately how much money can be earned based on the number of months of experience blogging.
You probably already have a hunch that the older a blog is, the more money it makes, but by using linear regression you’ll be able to support that hypothesis with actual data. You know, the kind of data you can take to the bank.
Before you begin, you’ll need to assemble your tools.
Luckily, you only need two things:
- Dataset – download the Blogging_Income dataset here
FYI you don’t need to know any Python in order to follow this DIY but you’re probably going to want some because it’s awesome.
If you want to learn Python, or are just curious, these posts are a good place to start: Absolute Beginner’s Guide to Python and Getting Started with Python (for Machine Learning).
Get your data
First, download the Blogging_Income dataset. You can find it here.
For this DIY our data set will consist of 30 observations. This means the income figures for 30 different blogs and their corresponding age.
* Note: the numbers in this dataset are totally made up for the purposes of this DIY. How much money a blog makes depends on all sorts of different factors such as quality of content, type of monetization, frequency of posting, and strength of outreach network.
Install Anaconda and launch Spyder
This tutorial will use Anaconda’s Spyder interface to run Python so if you haven’t already, you’ll need to download and install Anaconda. Luckily, it’s really easy.
Find out how to install Anaconda and launch Spyder here
BTW – Python is the programming language, Spyder is the interface that we’ll use to write our Python code in, and Anaconda is the company that makes Spyder 🐍 🕷️ 🐍
Once Spyder is up and running you’ll want to set the right folder as your working directory. This lets Spyder know where to find the dataset that you just downloaded. Where you store your dataset is up to you. For example, you could leave it on your desktop or, if you’re a stickler for organization you can always make a folder called something like “Machine Learning DIY” in which to store any datasets that you download.
Here’s how you set your working directory:
In the top bar of Spyder click on the black folder icon (like above). Once the window pops up, navigate to the location of your dataset, whether that is your desktop or a folder stored elsewhere. Then, when you’ve found and selected the location of your data, click on Select Folder (Windows) or Open (Mac). Your working directory is now set!
If you want to see what’s in your working directory, head on over to the File Explorer (see image below) which shows a list of all the files currently stored in the given folder.
And while we’re at it, let’s make a new file and call it simple_linear_regression.py. You can do that by clicking on File > New File or CTRL + N (Windows) / CMD + N (Mac).
Save the file by clicking on File > Save File As. Go ahead and save simple_linear_regression.py in the same place you saved Blogging_Income.csv.
Or you can always download a template with all the code already in it here. Don’t worry, I won’t tell 😜 If you go for this route, you can open a Python file in Spyder by going to File > Open and then navigating to the location on your computer where you stored the file.
PROTIP: to execute any code you’ve written in Spyder, highlight the code in question and then click CTRL + ENTER or CMD + ENTER.
So, to recap:
- Your computer is turned on.
- Blogging_Income.csv is downloaded and put somewhere you can find it.
- Spyder is running and knows where to find Blogging_Income.csv.
- You have a file saved and open in Spyder called simple_linear_regression.py.
Now it’s time to get your data ready. The first step in any machine learning project is to clean and prep your data. This step is called pre-processing.
Since the dataset is already provided you won’t need to clean the data, however it’s highly likely you’ll need to do this for projects out in the wild.
Data cleaning involves tasks like filling in any blanks in the data, making sure the numbers are all on the same scale, and also converting text data into numerical data.
For our purposes, data pre-processing will involve the following three steps:
- Importing the libraries
- Importing the data
- Splitting the data into training and test sets
Let’s walk through these steps one at a time:
01. Import your Python libraries
The cool thing about Python is that if you want to do something, odds are, someone has already written code for whatever it is you’re trying to do. This often takes the form of what’s known as a library, which is a package of pre-written code that you import into your IDE, like Spyder in this example.
For our simple linear algebra DIY we’ll need to import three libraries: numpy, matplotlib, and pandas. Each library does its own thing and when combined they make your computer super powerful by providing the essential tools needed to build machine learning models.
How to import a library:
In Spyder, paste the below code into the simple_linear_regression.py Python file that we currently have open.
# import these libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd
Then, select the above lines of code and hit CTRL + ENTER / CMD + ENTER. This will execute the code.
You’ll notice that once you do this, something happens in the console on the bottom right-hand side of Spyder. This is letting you know that the system has imported the libraries. If there’s no error message then it’s all good 👍
02. Import the data
The next step is to import your data into Spyder. This will require three simple lines of code:
# import your dataset dataset = pd.read_csv('Blogging_Income.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 1].values
Once you’ve added these to simple_linear_regression.py, don’t forget to highlight them and click CTRL + ENTER / CMD + ENTER.
Head on over to the Variable Explorer to check out what we’ve just done.
If you double-click on “dataset” you can see all of the data from our Blogging_Income.csv file in addition to two columns, MonthsExperience and Income.
The second and third lines of code break our data into the independent variable (X) and the dependent variable (y). In this case, X is the number of months blogging and y is the income.
03. Split the dataset into a training set and test set
Machine learning is about getting machines to learn, right?
In this case, we want our algorithm to learn correlations in our data and then to make predictions based on what it learned. To do this though, we need to split our data into two different sets, a training set and a test set.
Splitting the data into two different sets ensures that our linear regression algorithm doesn’t overlearn. This would be like a student who memorized all of the answers to a test by heart and then when it comes time to take the test, can’t answer any of the questions because they are worded differently.
It would be a total bummer if our machine could only make predictions based on the one dataset that we provided.
So by splitting the data into a training set which we’ll use to a build our linear regression algorithm and a test set which we’ll test our algorithm on, we can make sure that the machine actually learned the correlations and didn’t just memorize them. Then we can apply the learnings to other data sets.
In this DIY we have 30 observations, so in this case, a reasonable split for this data would be 20 for the training set and 10 for the test set.
To split your data, add the following code to simple_linear_regression.py:
# split your dataset into a training set and test set from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)
Highlight and CTRL + ENTER / CMD + ENTER that thang.
You might have noticed that the end of the code above includes
random_state = 0. This ensures that we all get the same results. You can set the
random_state to other variables but it’s important to always set the same value so you can validate your results when you re-run your code.
If you head over to the Variable Explorer you’ll see that your algorithm has now split the data into
Alright friends, we’re now ready to begin machine learning!
Fitting the data
Now that our data is ready, our first step is to fit the simple linear regression model to our training set.
Basically, we’re going to get our computer to learn the correlations in our training set so that it can predict the dependent variable (income) based on the independent variable (blogging experience).
In machine learning terms, the machine is the simple linear regressor which learns the correlations between income and blogging experience that are given in the training set. Based on its learning experience, the machine will then be able to predict income based on amount of experience.
This code will create our linear regressor:
# fit simple linear regression to the training set from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train)
Now go ahead and execute that code. You’re not going to see anything exciting happen (yet) but you’ve just succesfully created a machine learning model 🥂🥂
If you recall, our goal is to train a simple linear regression model to learn the correlation between number of months blogging (experience) and income. By doing this, our model will then be able to make predictions based on months of experience.
So, with model in hand, we can now begin making predictions. Since our linear regressor (aka machine) has learned the observations in our training set, we’re now ready to make predictions based on our test set.
We’re going to do this with one line of code:
# predict test set results y_pred = regressor.predict(X_test)
Execute your code and then head over to the Variable Explorer. Here you’ll see
y_pred which contains a list of predicted incomes that our machine produced.
If you look at
y_pred side by side, you’ll see that
y_test has the real observations, while
y_pred has predicted incomes. By comparing them side by side you can see how accurate your machine’s predictions were.
Not too shabby, imo.
Visualizing the data
Let’s take this DIY one step further and display all of our data on a graph. I mean, linear regression is all about the lines, right?
First, let’s go back and plot the real observation points from our training set:
# visualize training set results plt.scatter(X_train, y_train, color = '#fe5656') plt.plot(X_train, regressor.predict(X_train), color = '#302a2c') plt.title('Income vs Experience (Training)') plt.xlabel('Months of Experience') plt.ylabel('Income') plt.show()
Executing this code will create a visualization that looks like this:
The first thing to look at is the distinction between real and predicted values – the red dots plot the real (observed) income and years of experience, while the black regression line shows the predicted income based upon this distribution.
By looking at this it’s pretty obvious that there seems to be a linear dependency between income and months of experience since the regression line is pretty close to all of the observation points.
Ok, so we’ve confirmed that there’s a linear relationship between income and months blogging.
Now for the fun part!
Let’s look at how good our machine is at making predictions. Again, we’ll plot our data, but this time we’ll be looking at the data from the test set.
# visualize test set results plt.scatter(X_test, y_test, color = '#fe5656') plt.plot(X_train, regressor.predict(X_train), color = '#302a2c') plt.title('Income vs Experience (Test)') plt.xlabel('Months of Experience') plt.ylabel('Income') plt.show()
The red points on this new graph now show the test data which we can compare against the linear regressor. The black line stays the same because we’d already trained it on the training set earlier.
PROTIP: Graph only showing up in your console? To get your graphics to pop-out in a second window click on Tools > Preferences > Ipython Console > Graphics (on Windows) or python > references > Ipython Console > Graphics (on Mac), then change Backend to “automatic” instead of “inline“.
So how well did our machine do?
The closer our red dots are to the line, the better the prediction is. Dots that are actually on the line are 💯
So by doing a quick visual review, we can see that on the whole, our machine has done a pretty good job of predicting income based on blogging experience!
Let’s say you decide you’re willing to give blogging 6 months of time. After that point, if you’re not making any money you’ll call it quits. By looking at our graph we can see that our machine predicts you should be making money by then, which is v interesting.
Either way, you now have a data-driven method of setting milestones.
The great thing about this, of course, is that it can be applied to almost anything where a linear relationship exists.
In the wild
What do you think? Simple linear regression is pretty easy, right?
Have you done any cool examples of linear regression IRL? If so, share with the hashtag #machinesgonewild ✨