Week 2 of the course started with a video on dirty data and the method of ensuring that data can be used: data cleaning. Before the video even went into details I knew that this type of thing would be the long and boring part of data analysis. It’s the same with anything really, you can’t get to work on something unless all of the errors are removed and removing errors is usually time consuming because a) you have to spot the errors and b) those errors are usually hard to spot and the process for which usually can’t be automated.
Whereas last week I was looking at tuberculosis cases with data supplied by the WHO, this week I’d also be working with weather data from Weather Underground. This appealed to the little girl in me who really enjoyed her geography classes.
I’d be working with data stored in a CSV file, and so I needed to use the pandas
module in Jupyter. Once I’d imported a CSV file, I’d need to narrow down the results – dataframes can have hundreds or thousands of rows, so it is often not practical to display a whole dataframe. You can do this in a number of ways, but the course recommended the methods:
irow()
– Displays a specific row based on the index number provided.head()
– Returns either the first 5 or the first specified number of records.tail()
– Returns either the last 5 or the last specified number of records.
So far in the course, when I wanted to see the data stored in a particular column of the table I was working with, I’d used the Series
data type, and this would show the records without a column heading. The course now suggested that if I wanted to show several columns with their headings, but not the whole table, I’d need to use the List
data type. Both data types use most of the same methods too.
For example, if I wanted to display the first 5 rows of the columns Country
and Population (1000s)
from a table stored under the variable df
, I’d use df[['Country', 'Population (1000s)']].head()
.
Country | Population (1000s) | |
0 | Afghanistan | 30552 |
1 | Albania | 3173 |
2 | Algeria | 39208 |
3 | Andorra | 79 |
4 | Angola | 21472 |
If I only wanted to display the first 5 rows of the column Population (1000s)
, without the column heading, I’d use df['Population (1000s)'].head()
.
0 30552 1 3173 2 39208 3 79 4 21472 Name: Population (1000s), dtype: float64
At this point I got to play around with the code myself, and completed three small tasks on the notebook that I downloaded for week 2.
The course then introduced comparison operators. Python has access to several of these:
==
– equals!=
– not equal<
– less than>
– greater than<=
– less than or equal to>=
– greater than or equal to
You can apply these operators to columns on a dataframe and the results will only show the rows where the comparison results as true. So, if I wanted to see the entire table that I have stored as df
but where it only shows countries with a population over 80,000 I’d use df[df['Population (1000s)'] > 80000]
.
Country | Population (1000s) | TB deaths | |
13 | Bangladesh | 156595 | 80000 |
23 | Brazil | 200362 | 4400 |
36 | China | 1393337 | 41000 |
53 | Egypt | 82056 | 550 |
58 | Ethiopia | 94101 | 30000 |
65 | Germany | 82727 | 300 |
77 | India | 1252140 | 240000 |
78 | Indonesia | 249866 | 64000 |
85 | Japan | 127144 | 2100 |
109 | Mexico | 122332 | 2200 |
124 | Nigeria | 173615 | 160000 |
128 | Pakistan | 182143 | 49000 |
134 | Philippines | 98394 | 27000 |
141 | Russian Federation | 142834 | 17000 |
185 | United States of America | 320051 | 490 |
190 | Vietnam | 91680 | 17000 |
I was given one small task to complete in my notebook to reinforce what I’d learnt in this section.
My next topic was on bitwise operators. Python has two of these:
&
– and|
– or
As with comparison operators, when applied to columns on a dataframe, the results will only show the rows where the comparison results as true.
Because expressions involving operators can often be long and complex, the course recommends storing individual columns as variables so they can be referred to more easily.
One small exercise was given to show I understood this section.
From this point I got to work with weather data and was asked to download data for London in 2014. My first task was to clean the data to ensure I could analyse it properly. I began with rogue spaces. I could do this by opening the CSV file in Excel, and using Find & Replace to get rid of them all, but pandas
offered the ability to ignore white space in most places with the addition of skipinitialspace=True
to the read_csv()
function that you use to look at a CSV file with pandas
. So, you’d use london = read_csv('London_2014.csv', skipinitialspace=True)
to store the table to a dataframe named london
. The skipinitialspace
method does not skip white space at the end of column names – you’d need to rename the columns to remove this.
Another feature offered by pandas
is the ability to strip unwanted characters from data. In the Weather Underground data I downloaded, there were lots of references to <br />
which they used to separate rows in their data.
For the column I’d have to rename it using london = london.rename(columns={'WindDirDegrees<br />':'WindDirDegrees'})
. This updates the table stored as london
to have a renamed column by finding the column 'WindDirDegrees<br />'
and renaming it to 'WindDirDegrees'
overwriting the table stored as london
with the new table.
For the rows themselves I could use the rstrip()
method. So, if I typed london['WindDirDegrees'] = london['WindDirDegrees'].str.rstrip('<br />')
which strips the term <br />
from the rear (hence the r in .rstrip) of strings (str) of data stored in the cells of the column named WindDirDegrees
. Again, you could do this manually via Excel.
Pandas
also offers the ability to return null values using isnull()
. If you find null values, you can then use fillna()
to fill them all in with whatever you like. So london['Events'] = london['Events'].fillna('')
would remove the contents from all cells that have null values (normally you’d put text between two quotation marks but as we don’t want to have anything in these cells, simply use two quotation marks) – you don’t want to use the number 0 when dealing with numbers as they would be included in any calculations that you do.
Another way to deal with missing values is to ignore rows with them. The dropna()
method returns a new dataframe where all rows with at least one non-available value have been removed.
Although replacing null values can be useful, sometimes you don’t have to replace them. Most programs – including the pandas
module in Jupyter – ignore them when computing numeric statistics, i.e. when using sum()
, median()
and other similar methods.
Sometimes your columns may have the wrong data type so you might want to change it. To check what data type each column is in pandas
, you use the .dtypes
attribute. This prints a list of all columns and what their contents will be stored as. In my CSV file, 'GMT'
and 'WindDirDegrees'
were stored incorrectly so I’d need to change those.
The 'WindDirDegrees'
column is meant to store the direction that wind is coming from in degrees. As degrees are a number, and we’d probably need to be comparing rows where the wind direction is from a specific direction, an integer (int) data type (int64
in python) would be appropriate here. To change this you’d use london['WindDirDegrees'] = london['WindDirDegrees'].astype('int64')
which finds the column 'WindDirDegrees'
and sets its data type using .astype()
to int64.
The 'GMT'
column is meant to store dates, and does so as a string of numbers. We should therefore change the data type to a string, but pandas
also has a function that changes dates stored as strings to a proper date type called datetime64
. To do this, you’d use london['GMT'] = to_datetime(london['GMT'])
which applies the to_datetime()
function (converting the strings to the new format, and changing the columns data type to datetime64) to the column 'GMT'
and overwrites the table stored as london
with the new table we created.
Note that because you’ve changed the 'GMT'
column to use this datetime64 data type, you can’t match the content using a string data type. So london[london['GMT'] == '01/03/2014']
would not work as the content between the quotation marks is a string, even though to your eye it is formatted exactly the same as the CSV. Instead you’d use london[london['GMT'] == datetime(2014, 3, 1)]
to find all rows in the 'GMT'
column that match the 1st of March 2014. You can also use london[(london['GMT'] >= datetime(2014, 12, 8)) & (london['GMT'] <= datetime(2014, 12, 12))]
to find all the rows with dates between December 8th 2014 and December 12th 2014.
I had three small tasks to complete in my notebook to show that I understood how to use these different ways of adjusting my data in pandas
.
At this point it is quite late in the evening and I’m going climbing with Tom. I’ll stop here for now, as the course is about to change topics, and I’ll finish off tomorrow.
Terms
These terms are written as I understand them.