Appendix: Data Wrangling
Contents
Appendix: Data Wrangling#
In this notebook, we will focus on loading different types of data files. Other aspects of ‘wrangling’ such as combining different datasets will be covered in future tutorials, and are explored in the assignments.
Note: Throughout this notebook, we will be using !
to run the shell command cat
to print out the contents of example data files.
Python I/O#
Let’s start with basic Python utilities for reading and loading data files.
# Check out an example data file
!cat files/data.txt
Der Befehl "cat" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.
# First, explicitly open the file object for reading
file_obj = open('files/data.txt', 'r')
# You can then loop through the file object, grabbing each line of data
for line in file_obj:
# Here we explicitly remove the new line marker at the end of each line (the '\n')
print(line.strip('\n'))
# File objects then have to closed when you are finished with them
file_obj.close()
First line of data
Second line of data
Since opening and closing files basically always goes together, there is a shortcut to do both of them together, which is the with
keyword.
By using with
, file objects will be opened, and then automatically closed at the end of the code block.
# Use 'with' keyword to open, read, and then close a file
with open('files/data.txt', 'r') as file_obj:
for line in file_obj:
print(line.strip('\n'))
First line of data
Second line of data
Using input / output functionality from standard library Python is a pretty ‘low level’ way to read data files. This strategy often takes a lot of work to organize and define the details of how files are organized and how to read them. For example, in the above simple example, we had to deal with the new line character explicitly.
As long as you have reasonably well structured data files, using standardized file types, you can use higher-level functions that will take care of a lot of these details - loading data straight into pandas
data objects, for example.
File types#
There are many different file types in which data may be stored.
Here, we will start by examining CSV and JSON files.
CSV Files#
# Let's have a look at a csv file (printed out in plain text)
!cat files/data.csv
Der Befehl "cat" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.
CSV Files with Python#
# Python has a module devoted to working with csv's
import csv
# We can read through our file with the csv module
with open('files/data.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
print(', '.join(row))
1, 2, 3, 4
5, 6, 7, 8
9, 10, 11, 12
CSV Files with Pandas#
# Pandas also has functions to directly load csv data
pd.read_csv?
Object `pd.read_csv` not found.
# Let's read in our csv file
pd.read_csv(open('files/data.csv'), header=None)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [8], in <cell line: 2>()
1 # Let's read in our csv file
----> 2 pd.read_csv(open('files/data.csv'), header=None)
NameError: name 'pd' is not defined
As we can see, using Pandas
save us from having to do more work (write more code) to use load the file.
JSON Files#
# Let's have a look at a json file (printed out in plain text)
!cat files/data.json
{
"firstName": "John",
"age": 53
}
# Think of json's as similar to dictionaries
d = {'firstName': 'John', 'age': '53'}
print(d)
{'firstName': 'John', 'age': '53'}
JSON Files with Python#
# Python also has a module for dealing with json
import json
# Load a json file
with open('files/data.json') as dat_file:
dat = json.load(dat_file)
# Check what data type this gets loaded as
print(type(dat))
<class 'dict'>
JSON Files with Pandas#
# Pandas also has support for reading in json files
pd.read_json?
# You can read in json formatted strings with pandas
# Note that here I am specifying to read it in as a pd.Series, as there is a single line of data
pd.read_json('{ "first": "Alan", "place": "Manchester"}', typ='series')
first Alan
place Manchester
dtype: object
# Read in our json file with pandas
pd.read_json(open('files/data.json'), typ='series')
firstName John
age 53
dtype: object
Conclusion#
As a general guideline, for loading and wrangling data files, using standardized data files, and loading them with ‘higher-level’ tools such as Pandas
makes it easier to work with data files.