At some point in your life you’re probably going to find yourself with a CSV file that needs to be opened. If you want to manipulate the data in that file, or if it has more than 1 million rows, you’re going to want to open that CSV with a coding language like Python. Knowing how to read, process, and parse CSV files programmatically is an essential skill for anyone interested in data science or machine learning and will save you time in the long run.

What is a CSV file?

A CSV file (Comma Separated Values) is a type of plain text file that holds tabular data. As the name suggests, CSVs generally use commas to separate each data value. For instance:

Column 1, column 2, column 3
First row data 1, data 2, data 3
Second row data 1, data 2, data 3
...

In the above example, the first line provides the column names, while each data point is separated with a comma. If you’ve worked with an excel file before you can imagine a CSV as a series of rows and columns where the columns are separated by a delimiter, in this case, a comma. Commas aren’t the only way to separate data however, other delimiters could be a tab (\t), colon (:), or semi-colon (;). 

CSV files are a convenient way of working with data as they provide an easy way of exporting data from spreadsheets and databases in order to import in other programs. For instance, you can easily export an excel file as a CSV, and vice versa.

There are a few different ways to work with CSV files in Python – we’re going to talk about using Python’s built-in csv library and the Pandas library.

Python’s csv Library

When it comes to working with CSV files, Python comes with an out of the box solution in the form of the csv library. Designed to work with Excel-generated CSV files, the csv library is a useful tool that can read, write, and process data from and to CSV files.

Reading CSV files with csv

Let’s look at a CSV file that includes information about dog breeds called dog_breed.csv.

The file includes the following data:

dog name,breed,birth month
Muffin,chihuahua,August
Scruffy,border collie,February
Lance,great dane,November

To read this file, use the following code:

with open('dog_breed.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    line_count = 0
    for row in csv_reader:
        if line_count == 0:
            print(f'Column names are {", ".join(row)}')
            line_count += 1
        else:
            print(f'\t{row[0]} is a {row[1]} and was born in {row[2]}.')
            line_count += 1
    print(f'This file has {line_count} lines.')

This will produce the following output:

Column names are dog name,  breed,  birth month
  Muffin is a  chihuahua, and was born in  August.
  Scruffy is a  border collie, and was born in  February.
  Lance is a  great dane, and was born in  November.
This file has 4 lines.

As you can see, the Python opens the file reads through the contents. It then prints out the column names. Because csv reader is anticipating the first row of your file to be the column names, it treats this separately to the rest of the data. Python then proceeds to work through the data row-by-row, printing the selected information as it goes. Finally, the code prints out the total number of lines processed.

This code reads the contents of the CSV file into string elements. You might, however, want to read the CSV data directly into a dictionary, which you can do with just a few minor changes to the code.

To learn the difference between a string and a dictionary check out the post: Absolute Begginer’s Guide to Python

import csv

with open('dog_breed.csv', mode='r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    line_count = 0
    for row in csv_reader:
        if line_count == 0:
            print(f'Column names are {", ".join(row)}')
            line_count += 1
        print(f'\t{row["dog name"]} is a {row["breed"]} and was born in {row["birth month"]}.')
        line_count += 1
    print(f'Processed {line_count} lines.')

The output is the same – except this time the first line of the CSV file (the column names) are turned into the keys used to build the dictionary.

Writing CSV files with csv

The csv library can also write to a CSV file.

import csv

with open('about_dogs.csv', mode='w') as dog_breed:
    dog_breed = csv.writer(dog_breed, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    dog_breed.writerow(['Muffin', 'chihuahua', 'August'])
    dog_breed.writerow(['Scruffy', 'border collie', 'February'])
    dog_breed.writerow(['Lance', 'great dane', 'November'])

This will create a plain text file that contains the following:

Muffin,chihuahua,August
Scruffy,border collie,February
Lance,great dane,November

You can also write a dictionary to CSV.

import csv

with open('about_dogs_dict.csv', mode='w') as csv_file:
    fieldnames = ['name', 'dog_breed', 'birth_month']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

    writer.writeheader()
    writer.writerow({'name': 'Muffin', 'dog_breed': 'chihuahua', 'birth_month': 'August'})
    writer.writerow({'name': 'Scruffy', 'dog_breed': 'border collie', 'birth_month': 'February'})
    writer.writerow({'name': 'Lance', 'dog_breed': 'great dane', 'birth_month': 'November'})

This will produce the following file:

name,dog_breed,birth_month
Muffin,chihuahua,August
Scruffy,border collie,February
Lance,great dane,November

The primary difference when writing a dictionary to CSV is that you need to define the fieldnames parameter. This is because Python needs to know which keys to use in order to retrieve information from your dictionary. It also uses the fieldname keys to writer the first row as column names.

Using Pandas to work with CSV files

pandas is an open-source Python library that provides easy-to-use tools for data analysis. It can be installed directly to your computer or through the Anaconda distribution of Python.

To install pandas using pip, open your command line and type:

$ pip install pandas

Alternatively, to install pandas using Anaconda open your Anaconda Navigator. On the right hand side click Environments > then click the green arrow to the right of base (root) > select Open Terminal. This opens Anaconda’s command line.

In the command line type:

$ pip install pandas

This will install pandas and its dependencies.

Reading CSV files with `pandas`

Pandas is a powerful tool for analysing and working with data. To highlight this, let’s add a few more details about our dogs in a new CSV called dog_breed2.csv.

dog name,breed,birthday,weight_kg
Muffin,chihuahua,22/09/18, 8.4
Scruffy,border collie,07/02/19, 20.3
Lance,great dane,16/11/17, 47.9

Reading a CSV into a pandas DataFrame is really simple. Just three lines of code, in fact. 

import pandas
df = pandas.read_csv('dog_breed2.csv')
print(df)

The code is simple. It opens the CSV, stores it in a DataFrame and then prints the DataFrame as a table:

  dog name          breed  birthday  weight_kg
0   Muffin       chihuaha  22/09/18        8.4
1  Scruffy  border collie  07/02/19       20.3
2    Lance     great dane  16/11/17       47.9

pandas automatically recognizes the first row of the CSV as column names.

As with all data work, you might need to do a bit of cleaning up before you can do jump into analysis. In this instance, pandas is reading the birthday column as a string, rather than a date.

Use the following code to confirm this and make sure your other columns are in the right format:

print(type(df['birthday'][0]))

This is easily fixed by forcing pandas read the data as a date when you load the CSV into your DataFrame.

import pandas
df = pandas.read_csv('dog_breed2.csv', parse_dates=['birthday'])
print(df)

When you use the optional parse_dates parameter Python will output the following:

  dog name          breed  birthday  weight_kg
0   Muffin       chihuaha  22/09/18        8.4
1  Scruffy  border collie  07/02/19       20.3
2    Lance     great dane  16/11/17       47.9

As you can see, our dogs’ birthdays are now formatted as dates.

Another thing you might want to change is the zero-based integer indices in the far left-hand column. We can do this by adding another optional parameter to pandas read_csv function.

import pandas
df = pandas.read_csv('dog_breed2.csv', index_col='dog name', parse_dates=['birthday'])
print(df)

You should now see:

                  breed   birthday  weight_kg
dog name                                     
Muffin         chihuaha 2018-09-22        8.4
Scruffy   border collie 2019-07-02       20.3
Lance        great dane 2017-11-16       47.9

Writing CSV files with pandas

Writing a CSV file with pandas is as easy as reading one.

import pandas
df = pandas.read_csv('dog_breed2.csv', 
            index_col='dog name', 
            parse_dates=['birthday'],
            header=0, 
            names=['dog name', 'breed', 'birthday', 'weight_kg'])
df.to_csv('dog_breed2_new.csv')

This is very similar to the code used to read our CSV, except we need to denote the names of the columns (dog name, breed etc.) and the name of our new CSV (dog_breed2_new.csv)

The output of this code will be:

dog name,breed,birthday,weight_kg
Muffin,chihuaha,2018-09-22,8.4
Scruffy,border collie,2019-07-02,20.3
Lance,great dane,2017-11-16,47.9

Conclusion

CSV are a very common method of storing data and because of this, knowing how to work with CSV files in Python is an excellent skill to have.

July 8, 2019

RELATED POSTS