At some point in your life you’re probably going to find yourself with a CSV file that needs to be opened. If you want to manipulate the data in that file, or if it has more than 1 million rows, you’re going to want to open that CSV with a coding language like Python. Knowing how to read, process, and parse CSV files programmatically is an essential skill for anyone interested in data science or machine learning and will save you time in the long run.
🤖 Table of Contents 🤖
What is a CSV file?
Python’s CSV Library
Using Pandas to work with CSV files
Conclusion
Column 1, column 2, column 3 First row data 1, data 2, data 3 Second row data 1, data 2, data 3 ...
In the above example, the first line provides the column names, while each data point is separated with a comma. If you’ve worked with an excel file before you can imagine a CSV as a series of rows and columns where the columns are separated by a delimiter, in this case, a comma. Commas aren’t the only way to separate data however, other delimiters could be a tab (\t), colon (:), or semi-colon (;).
CSV files are a convenient way of working with data as they provide an easy way of exporting data from spreadsheets and databases in order to import in other programs. For instance, you can easily export an excel file as a CSV, and vice versa.
There are a few different ways to work with CSV files in Python – we’re going to talk about using Python’s built-in csv
library and the Pandas library.
Python’s csv
Library
When it comes to working with CSV files, Python comes with an out of the box solution in the form of the csv
library. Designed to work with Excel-generated CSV files, the csv
library is a useful tool that can read, write, and process data from and to CSV files.
Reading CSV files with csv
Let’s look at a CSV file that includes information about dog breeds called dog_breed.csv.
The file includes the following data:
dog name,breed,birth month Muffin,chihuahua,August Scruffy,border collie,February Lance,great dane,November
To read this file, use the following code:
with open('dog_breed.csv') as csv_file: csv_reader = csv.reader(csv_file, delimiter=',') line_count = 0 for row in csv_reader: if line_count == 0: print(f'Column names are {", ".join(row)}') line_count += 1 else: print(f'\t{row[0]} is a {row[1]} and was born in {row[2]}.') line_count += 1 print(f'This file has {line_count} lines.')
This will produce the following output:
Column names are dog name, breed, birth month Muffin is a chihuahua, and was born in August. Scruffy is a border collie, and was born in February. Lance is a great dane, and was born in November. This file has 4 lines.
As you can see, the Python opens the file reads through the contents. It then prints out the column names. Because csv
reader is anticipating the first row of your file to be the column names, it treats this separately to the rest of the data. Python then proceeds to work through the data row-by-row, printing the selected information as it goes. Finally, the code prints out the total number of lines processed.
This code reads the contents of the CSV file into string
elements. You might, however, want to read the CSV data directly into a dictionary, which you can do with just a few minor changes to the code.
To learn the difference between a string and a dictionary check out the post: Absolute Begginer’s Guide to Python
import csv with open('dog_breed.csv', mode='r') as csv_file: csv_reader = csv.DictReader(csv_file) line_count = 0 for row in csv_reader: if line_count == 0: print(f'Column names are {", ".join(row)}') line_count += 1 print(f'\t{row["dog name"]} is a {row["breed"]} and was born in {row["birth month"]}.') line_count += 1 print(f'Processed {line_count} lines.')
The output is the same – except this time the first line of the CSV file (the column names) are turned into the keys used to build the dictionary.
Writing CSV files with csv
The csv
library can also write to a CSV file.
import csv with open('about_dogs.csv', mode='w') as dog_breed: dog_breed = csv.writer(dog_breed, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL) dog_breed.writerow(['Muffin', 'chihuahua', 'August']) dog_breed.writerow(['Scruffy', 'border collie', 'February']) dog_breed.writerow(['Lance', 'great dane', 'November'])
This will create a plain text file that contains the following:
Muffin,chihuahua,August Scruffy,border collie,February Lance,great dane,November
You can also write a dictionary to CSV.
import csv with open('about_dogs_dict.csv', mode='w') as csv_file: fieldnames = ['name', 'dog_breed', 'birth_month'] writer = csv.DictWriter(csv_file, fieldnames=fieldnames) writer.writeheader() writer.writerow({'name': 'Muffin', 'dog_breed': 'chihuahua', 'birth_month': 'August'}) writer.writerow({'name': 'Scruffy', 'dog_breed': 'border collie', 'birth_month': 'February'}) writer.writerow({'name': 'Lance', 'dog_breed': 'great dane', 'birth_month': 'November'})
This will produce the following file:
name,dog_breed,birth_month Muffin,chihuahua,August Scruffy,border collie,February Lance,great dane,November
The primary difference when writing a dictionary to CSV is that you need to define the fieldnames
parameter. This is because Python needs to know which keys to use in order to retrieve information from your dictionary. It also uses the fieldname
keys to writer the first row as column names.
Using Pandas to work with CSV files
pandas
is an open-source Python library that provides easy-to-use tools for data analysis. It can be installed directly to your computer or through the Anaconda distribution of Python.
To install pandas
using pip
, open your command line and type:
$ pip install pandas
Alternatively, to install pandas
using Anaconda open your Anaconda Navigator. On the right hand side click Environments
> then click the green arrow to the right of base (root) > select Open Terminal
. This opens Anaconda’s command line.
In the command line type:
$ pip install pandas
This will install pandas
and its dependencies.
Reading CSV files with `pandas`
Pandas is a powerful tool for analysing and working with data. To highlight this, let’s add a few more details about our dogs in a new CSV called dog_breed2.csv.
dog name,breed,birthday,weight_kg Muffin,chihuahua,22/09/18, 8.4 Scruffy,border collie,07/02/19, 20.3 Lance,great dane,16/11/17, 47.9
Reading a CSV into a pandas DataFrame
is really simple. Just three lines of code, in fact.
import pandas df = pandas.read_csv('dog_breed2.csv') print(df)
The code is simple. It opens the CSV, stores it in a DataFrame and then prints the DataFrame as a table:
dog name breed birthday weight_kg 0 Muffin chihuaha 22/09/18 8.4 1 Scruffy border collie 07/02/19 20.3 2 Lance great dane 16/11/17 47.9
pandas
automatically recognizes the first row of the CSV as column names.
As with all data work, you might need to do a bit of cleaning up before you can do jump into analysis. In this instance, pandas
is reading the birthday column as a string, rather than a date.
Use the following code to confirm this and make sure your other columns are in the right format:
print(type(df['birthday'][0]))
This is easily fixed by forcing pandas
read the data as a date when you load the CSV into your DataFrame.
import pandas df = pandas.read_csv('dog_breed2.csv', parse_dates=['birthday']) print(df)
When you use the optional parse_dates parameter Python will output the following:
dog name breed birthday weight_kg 0 Muffin chihuaha 22/09/18 8.4 1 Scruffy border collie 07/02/19 20.3 2 Lance great dane 16/11/17 47.9
As you can see, our dogs’ birthdays are now formatted as dates.
Another thing you might want to change is the zero-based integer indices in the far left-hand column. We can do this by adding another optional parameter to pandas read_csv
function.
import pandas df = pandas.read_csv('dog_breed2.csv', index_col='dog name', parse_dates=['birthday']) print(df)
You should now see:
breed birthday weight_kg dog name Muffin chihuaha 2018-09-22 8.4 Scruffy border collie 2019-07-02 20.3 Lance great dane 2017-11-16 47.9
Writing CSV files with pandas
Writing a CSV file with pandas
is as easy as reading one.
import pandas df = pandas.read_csv('dog_breed2.csv', index_col='dog name', parse_dates=['birthday'], header=0, names=['dog name', 'breed', 'birthday', 'weight_kg']) df.to_csv('dog_breed2_new.csv')
This is very similar to the code used to read our CSV, except we need to denote the names of the columns (dog name, breed etc.) and the name of our new CSV (dog_breed2_new.csv)
The output of this code will be:
dog name,breed,birthday,weight_kg Muffin,chihuaha,2018-09-22,8.4 Scruffy,border collie,2019-07-02,20.3 Lance,great dane,2017-11-16,47.9