Linear regression is a really simple but powerful way of making predictions. Looking to begin your machine learning journey? Start here 👊


🤖 Table of Contents 🤖
What is simple linear regression?
Getting started with simple linear regression
Essential tools
Data pre-processing
Fitting the data
Making predictions
Visualizing the data
Post-game analysis


What is simple linear regression?

Linear regression is a type of statistical modeling that explores relationships between variables and determines whether one variable is dependent on other variables. This relationship can then be illustrated with a trend-line that’s overlaid on your data.

Like this:

Once you understand the relationships between different variables in your data you can start making interesting and powerful predictions.

While linear regression has its origins in statistics, it’s also used in machine learning to teach computers the correlations between variables.

This in turn allows us to make future predictions 🎉

FYI – the simple in simple linear regression refers to the fact that you’re only comparing two variables, the independent variable (or “predictor”) and the dependent variable (or “outcome”). So simple linear regression studies only one predictor variable, whereas multiple linear regression looks at two or more predictor variables.

Despite the simplicity, linear regression is actually a really powerful tool and can be used to answer all sorts of questions such as how much money you’ll need to spend on gas for a cross-country road trip or how many cocktails you can have before saying something you’ll regret later.

To learn more about linear regression, check this out for an overview of the essentials: Linear Regression: The Beginner’s Machine Learning Algorithm.

Getting started with simple linear regression

In this DIY we’re going to explore the glamorous world of blogging by using Python to train a simple linear regression model.

Our goal is to determine the correlation (if any) between the number of months of blogging and amount of money earned.

Before we dive into the nuts and bolts of this tutorial though, let’s talk about the context that our data exists in. This will help us better understand the data we’re working with and produce analyses that are more useful in the real world.

The background:
Let’s say you’re thinking about a change of career and now want to become a full-time blogger.

Although you think it will be a lot of fun, you’re going to need to know that some cash money is eventually coming your way to replace your income. Otherwise, you won’t be able to keep your momentum up.

After some online research you find income reports that some full-time bloggers have published and you put them into a spreadsheet so you can find out when they began making money.

This will form your dataset.

With data now in hand, you can use simple linear regression to understand the correlation between the number of months of experience and the amount of money earned. This model will then allow you to predict approximately how much money can be earned based on the number of months of experience blogging.

You probably already have a hunch that the older a blog is, the more money it makes, but by using linear regression you’ll be able to support or refute this hypothesis with actual data. You know, the kind of data you can take to the bank.

Essential tools

Before you begin, you’ll need to assemble your tools.

Luckily, you only need two things:

  • Dataset – download the Blogging_Income dataset here
  • Computer

Note: you don’t need to know any Python in order to follow this DIY but you’re probably going to want some because it’s awesome.

If you want to learn Python, these posts are a good place to start: Absolute Beginner’s Guide to Python and Getting Started with Python (for Machine Learning).

Get your data

First, download the Blogging_Income dataset. You can find it here.

For this DIY, our data set will consist of 30 observations. This means the income figures for 30 different blogs and their corresponding age*.

* Note: the numbers in this dataset are made up for the purposes of this exercise. How much money a blog makes depends on all sorts of different factors such as the quality of content, type of monetization, frequency of posting, and strength of outreach network.

Install Anaconda and launch Spyder

This tutorial will use Anaconda’s Spyder interface to run Python so if you haven’t already, you’ll need to download and install Anaconda. Luckily, it’s really easy.

Find out how (and why) to install Anaconda and launch Spyder here.

BTW – Python is the programming language, Spyder is the interface that we’ll use to write our Python code in, and Anaconda is the organization that makes Spyder 🐍 🕷️ 🐍

Once Spyder is up and running you’ll want to set the right folder as your working directory. This lets Spyder know where to find the dataset that you just downloaded.

Ultimately, where you store your dataset is up to you. For example, you could leave it on your desktop or, if you’re a stickler for organization, you can always make a folder called something like “Machine Learning DIY” in which to store any datasets that you download.

Here’s how you set your working directory:  

In the top bar of Spyder click on the black folder icon (like the one above).

Once the window pops up, navigate to the location of your dataset, whether that is your desktop or a folder stored elsewhere.

Then, when you’ve found and selected the location of your data, click on Select Folder (Windows) or Open (Mac). Your working directory is now set.

If you want to see what’s in your working directory, head on over to the File Explorer tab (see image below) which shows a list of all the files currently stored in the given folder.

And while we’re at it, let’s make a new file and call it simple_linear_regression.py.

You can do that by clicking on File > New File or CTRL + N (Windows) / CMD + N (Mac).

Save the file by clicking on File > Save File As. Go ahead and save simple_linear_regression.py in the same place you saved Blogging_Income.csv.

Or you can always download a template with all the code already in it here. Don’t worry, I won’t tell.

If you go down this route, you can open a Python file in Spyder by going to File > Open and then navigating to the location on your computer where you stored the file.

PROTIP: to execute any code you’ve written in Spyder, highlight the code in question and then click CTRL + ENTER or CMD + ENTER.

Data pre-processing

So, to recap:

  • Your computer is turned on.
  • Blogging_Income.csv is downloaded and put in a location on your computer where you can find it.
  • Spyder is running and knows where to find Blogging_Income.csv.
  • You have a file saved and open in Spyder called simple_linear_regression.py.

Great!

Now it’s time to get your data ready. The first step in any machine learning project is to clean and prep your data.

This step is called pre-processing.

Since, in this case, the dataset is already provided in a ready-to-go format you won’t need to clean the data, however it’s highly likely you’ll need to do this for projects out in the wild with almost any other data you’ll be working with.

Data cleaning involves tasks like filling in any blanks in the data, making sure the numbers are all on the same scale, or converting text data into numerical data, where required.

For our purposes, data pre-processing will involve the following three steps:

  • Importing the libraries
  • Importing the data
  • Splitting the data into training and test sets

Let’s walk through these steps one at a time:

01. Import your Python libraries

The great thing about Python is that if you want to do something, the odds are, someone has already written code for whatever it is you’re trying to do.

This often takes the form of what’s known as a library, which is a package of pre-written code that you simply import into your IDE, like Spyder.

For our simple linear regression DIY we’ll need to import three libraries: numpy, matplotlib, and pandas (sadly, not these 🐼).

Each library does its own thing and when combined they level up your computer’s Python-game by providing the essential tools needed to build machine learning models.

How to import a library:
In Spyder, paste the below code into the simple_linear_regression.py Python file that we currently have open.

# import these libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Note: The denotes a comment, it’s ignored by the system when the code is executed and is only there to let us (and others) know what exactly the code below is doing, which in this case is importing libraries.

Once this code is pasted in, select the lines of code and hit CTRL + ENTER / CMD + ENTER. This will execute the code.

You’ll notice that once you do this, something happens in the console on the bottom right-hand side of Spyder. This is letting you know that the system has imported the libraries. If there’s no error message then it’s all good 👍

02. Import the data

The next step is to import your data into Spyder. This will require adding three simple lines of code below the previously entered code:

# import your dataset

dataset = pd.read_csv('Blogging_Income.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values

Once you’ve added these to simple_linear_regression.py, don’t forget to highlight them and click CTRL + ENTER / CMD + ENTER.

Now we’re going to head on over to the Variable Explorer to check out what we’ve just done.

If you double-click on “dataset” you can see all of the data from our Blogging_Income.csv file in addition to two columns, MonthsExperience and Income.

The second and third lines of code break our data into the independent variable (x) and the dependent variable (y). In this case, x is the number of months blogging and y is the income.

03. Split the dataset into a training set and test set

Machine learning is about getting machines to learn, right?

In this case, we want our algorithm to learn correlations in our data and then to make predictions based on what it learned. To do this though, we need to split our data into two different sets, a training set and a test set.

Splitting the data into two different sets ensures that our linear regression algorithm doesn’t overlearn. This would be like a student who memorized all of the answers to a test by heart and then when it comes time to take the test, can’t answer any of the questions because they are worded differently.

It would be a real downer if our machine could only make predictions based on the one dataset that we provided.

So by splitting the data into a training set which we’ll use to a build our linear regression algorithm and a test set which we’ll test our algorithm on, we can make sure that the machine actually learned the correlations and didn’t just memorize them. Then we can apply the learnings to other data sets.

In this DIY we have 30 observations, so in this case, a reasonable split for this data would be 20 for the training set and 10 for the test set.

To split your data, add the following code to simple_linear_regression.py:

# split your dataset into a training set and test set

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

Highlight the code and CTRL + ENTER / CMD + ENTER.

You might have noticed that the end of the code above includes random_state = 0. This ensures that we all get the same results. You can set the random_state to other variables but it’s important to always set the same value so you can validate your results when you re-run your code.

If you head over to the Variable Explorer you’ll see that your algorithm has now split the data into X_train and y_train and X_test and y_test

Alright friends, we’re now ready to begin some machine learning!

Fitting the data

Now that our data is ready, our first step is to fit the simple linear regression model to our training set.

Basically, we’re going to get our computer to learn the correlations in our training set so that it can predict the dependent variable (income) based on the independent variable (blogging experience).

In machine learning terms, the machine is the simple linear regressor which learns the correlations between income and blogging experience that are given in the training set.

Based on its learning experience, the machine should then be able to predict income based on amount of experience.

This code will create our linear regressor – add it in beneath the previous code:

# fit simple linear regression to the training set

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

Now go ahead and execute that code. You’re not going to see anything exciting happen (yet) but you’ve just successfully created a machine learning model 🥂🥂

Making predictions

So, if we now think back to why we started this project our goal was to train a simple linear regression model to learn if any correlations existed between the number of months blogging (experience) and income.

We set out to do this with the hypothesis that our model would then be able to make predictions of likely financial success based on months of experience.

With our model now in hand we can now begin making some predictions.

Since our linear regressor (aka machine) has learned the observations in our training set, we’re now ready to make predictions based on our test set.

We’re going to do this with one line of code:

# predict test set results

y_pred = regressor.predict(X_test)

Execute your code as before and then head over to the Variable Explorer.

Here you’ll see y_pred which contains a list of predicted incomes that our machine produced.

If you look at y_test and y_pred side by side, you’ll see that y_test has the real observations, while y_pred has predicted incomes. By comparing them this way you can see how accurate your machine’s predictions were.

Not too shabby.

Now let’s put this in a format that makes it easier to understand and interpret at a glance.

Visualizing the data

Now we’re going to take this DIY one step further and display all of our data visually on a graph.

I mean, linear regression is all about the lines, right?

First, let’s go back and plot the real observation points from our training set. We can do this by running the following code:

# visualize training set results

plt.scatter(X_train, y_train, color = '#fe5656')
plt.plot(X_train, regressor.predict(X_train), color = '#302a2c')
plt.title('Income vs Experience (Training)')
plt.xlabel('Months of Experience')
plt.ylabel('Income')
plt.show()

Executing this code will create a visualization that looks something like this:

The first thing to look at is the distinction between real and predicted values – the red dots plot the real (observed) income and years of experience, while the black regression line shows the predicted income based upon this distribution.

By looking at this graph it’s pretty obvious that there seems to be a linear dependency between Income and Months of Experience since the regression line is pretty close to all of the observation points.

Ok, so we’ve confirmed that there definitely appears to be a linear relationship between income and the number of months blogging.

Now for the fun part!

Let’s look at how good our machine is at making predictions.

Again, we’ll plot our data, but this time we’ll be looking at the data from the test set. We’re going to run this code:

# visualize test set results

plt.scatter(X_test, y_test, color = '#fe5656')
plt.plot(X_train, regressor.predict(X_train), color = '#302a2c')
plt.title('Income vs Experience (Test)')
plt.xlabel('Months of Experience')
plt.ylabel('Income')
plt.show()

The red points on this new graph now show the test data which we can compare against the linear regressor. The black line stays the same because we’d already trained it on the training set earlier.

PROTIP: Graph only showing up in your console? To get your graphics to pop-out in a second window click on Tools > Preferences > Ipython Console > Graphics (on Windows) or python references > Ipython Console > Graphics (on Mac), then change Backend to “automatic” instead of “inline“.

Post-game analysis

So how well did our machine do?

The closer our red dots are to the line, the better the prediction is. Dots that are actually on the line are 💯

So by doing a quick visual review, we can see that on the whole, our machine has done a pretty good job of predicting likely income based on blogging experience!

Let’s say you decide you’re willing to give blogging 6 months of your time. After that point, if you’re not making any money you’ll call it quits. By looking at our graph we can see that our machine predicts there’s a good chance that you should be making money by then, which is very interesting.

Either way, you now have a data-driven method of setting some milestones.

The great thing about this, of course, is that it can be applied to almost anything where a linear relationship exists and this is why it’s such a powerful tool when it comes to data science and statistical analysis.

January 8, 2019

RELATED POSTS

LEAVE A COMMENT