Skip to the content
jambonium.co.uk - Portfolio and blog of Michelle-Louise Janion
  • Home
  • Showcase
    • Written Work
    • Web Design
  • Horus Heresy Map Project
    • Blog Entries
    • The Map
  • Blog
    • Coding
    • Web Design
    • Sewing
    • Personal
    • Data Analysis
  • Terms & Definitions
    • Programming terms
    • Data Analysis terms
    • Web Design and Development terms
    • Miscellaneous terms
  • Home
  • Showcase
    • Written Work
    • Web Design
  • Horus Heresy Map Project
    • Blog Entries
    • The Map
  • Blog
    • Coding
    • Web Design
    • Sewing
    • Personal
    • Data Analysis
  • Terms & Definitions
    • Programming terms
    • Data Analysis terms
    • Web Design and Development terms
    • Miscellaneous terms
  • Facebook
  • Linkedin
  • Twitter
  • Last.fm
  • YouTube
  • True Achievements
  • GitHub
  1. You are in: Home
  2. Blog
  3. 2015
  4. November 2015
  5. Data Analysis Course Week 2 (Part 1)

Data Analysis Course Week 2 (Part 1)

Week 2 of the course started with a video on dirty data and the method of ensuring that data can be used: data cleaning. Before the video even went into details I knew that this type of thing would be the long and boring part of data analysis. It’s the same with anything really, you can’t get to work on something unless all of the errors are removed and removing errors is usually time consuming because a) you have to spot the errors and b) those errors are usually hard to spot and the process for which usually can’t be automated.

Whereas last week I was looking at tuberculosis cases with data supplied by the WHO, this week I’d also be working with weather data from Weather Underground. This appealed to the little girl in me who really enjoyed her geography classes.

I’d be working with data stored in a CSV file, and so I needed to use the pandas module in Jupyter. Once I’d imported a CSV file, I’d need to narrow down the results – dataframes can have hundreds or thousands of rows, so it is often not practical to display a whole dataframe. You can do this in a number of ways, but the course recommended the methods:

  • irow() – Displays a specific row based on the index number provided.
  • head() – Returns either the first 5 or the first specified number of records.
  • tail() – Returns either the last 5 or the last specified number of records.

So far in the course, when I wanted to see the data stored in a particular column of the table I was working with, I’d used the Series data type, and this would show the records without a column heading. The course now suggested that if I wanted to show several columns with their headings, but not the whole table, I’d need to use the List data type. Both data types use most of the same methods too.

For example, if I wanted to display the first 5 rows of the columns Country and Population (1000s) from a table stored under the variable df, I’d use df[['Country', 'Population (1000s)']].head().

 CountryPopulation (1000s)
0Afghanistan30552
1Albania3173
2Algeria39208
3Andorra79
4Angola21472

If I only wanted to display the first 5 rows of the column Population (1000s), without the column heading, I’d use df['Population (1000s)'].head().

0     30552
1     3173
2     39208
3     79
4     21472
Name: Population (1000s), dtype: float64

At this point I got to play around with the code myself, and completed three small tasks on the notebook that I downloaded for week 2.

The course then introduced comparison operators. Python has access to several of these:

  • == – equals
  • != – not equal
  • < – less than
  • > – greater than
  • <= – less than or equal to
  • >= – greater than or equal to

You can apply these operators to columns on a dataframe and the results will only show the rows where the comparison results as true. So, if I wanted to see the entire table that I have stored as df but where it only shows countries with a population over 80,000 I’d use df[df['Population (1000s)'] > 80000].

 CountryPopulation (1000s)TB deaths
13Bangladesh15659580000
23Brazil2003624400
36China139333741000
53Egypt82056550
58Ethiopia9410130000
65Germany82727300
77India1252140240000
78Indonesia24986664000
85Japan1271442100
109Mexico1223322200
124Nigeria173615160000
128Pakistan18214349000
134Philippines9839427000
141Russian Federation14283417000
185United States of America320051490
190Vietnam9168017000

I was given one small task to complete in my notebook to reinforce what I’d learnt in this section.

My next topic was on bitwise operators. Python has two of these:

  • & – and
  • | – or

As with comparison operators, when applied to columns on a dataframe, the results will only show the rows where the comparison results as true.

Because expressions involving operators can often be long and complex, the course recommends storing individual columns as variables so they can be referred to more easily.

One small exercise was given to show I understood this section.

From this point I got to work with weather data and was asked to download data for London in 2014. My first task was to clean the data to ensure I could analyse it properly. I began with rogue spaces. I could do this by opening the CSV file in Excel, and using Find & Replace to get rid of them all, but pandas offered the ability to ignore white space in most places with the addition of skipinitialspace=True to the read_csv()function that you use to look at a CSV file with pandas. So, you’d use london = read_csv('London_2014.csv', skipinitialspace=True) to store the table to a dataframe named london. The skipinitialspace method does not skip white space at the end of column names – you’d need to rename the columns to remove this.

Another feature offered by pandas is the ability to strip unwanted characters from data. In the Weather Underground data I downloaded, there were lots of references to <br /> which they used to separate rows in their data.

For the column I’d have to rename it using london = london.rename(columns={'WindDirDegrees<br />':'WindDirDegrees'}). This updates the table stored as london to have a renamed column by finding the column 'WindDirDegrees<br />' and renaming it to 'WindDirDegrees' overwriting the table stored as london with the new table.

For the rows themselves I could use the rstrip() method. So, if I typed  london['WindDirDegrees'] = london['WindDirDegrees'].str.rstrip('<br />') which strips the term <br /> from the rear (hence the r in .rstrip) of strings (str) of data stored in the cells of the column named WindDirDegrees. Again, you could do this manually via Excel.

Pandas also offers the ability to return null values using isnull(). If you find null values, you can then use fillna() to fill them all in with whatever you like. So london['Events'] = london['Events'].fillna('') would remove the contents from all cells that have null values (normally you’d put text between two quotation marks but as we don’t want to have anything in these cells, simply use two quotation marks) – you don’t want to use the number 0 when dealing with numbers as they would be included in any calculations that you do.

Another way to deal with missing values is to ignore rows with them. The dropna() method returns a new dataframe where all rows with at least one non-available value have been removed.

Although replacing null values can be useful, sometimes you don’t have to replace them. Most programs – including the pandas module in Jupyter – ignore them when computing numeric statistics, i.e. when using sum(), median() and other similar methods.

Sometimes your columns may have the wrong data type so you might want to change it. To check what data type each column is in pandas, you use the .dtypes attribute. This prints a list of all columns and what their contents will be stored as. In my CSV file, 'GMT' and 'WindDirDegrees' were stored incorrectly so I’d need to change those.

The 'WindDirDegrees' column is meant to store the direction that wind is coming from in degrees. As degrees are a number, and we’d probably need to be comparing rows where the wind direction is from a specific direction, an integer (int) data type (int64 in python) would be appropriate here. To change this you’d use london['WindDirDegrees'] = london['WindDirDegrees'].astype('int64') which finds the column 'WindDirDegrees' and sets its data type using .astype() to int64.

The 'GMT' column is meant to store dates, and does so as a string of numbers. We should therefore change the data type to a string, but pandas also has a function that changes dates stored as strings to a proper date type called datetime64. To do this, you’d use london['GMT'] = to_datetime(london['GMT']) which applies the to_datetime() function (converting the strings to the new format, and changing the columns data type to datetime64) to the column 'GMT' and overwrites the table stored as london with the new table we created.

Note that because you’ve changed the 'GMT' column to use this datetime64 data type, you can’t match the content using a string data type. So london[london['GMT'] == '01/03/2014'] would not work as the content between the quotation marks is a string, even though to your eye it is formatted exactly the same as the CSV. Instead you’d use london[london['GMT'] == datetime(2014, 3, 1)] to find all rows in the 'GMT' column that match the 1st of March 2014. You can also use london[(london['GMT'] >= datetime(2014, 12, 8)) & (london['GMT'] <= datetime(2014, 12, 12))] to find all the rows with dates between December 8th 2014 and December 12th 2014.

I had three small tasks to complete in my notebook to show that I understood how to use these different ways of adjusting my data in pandas.

At this point it is quite late in the evening and I’m going climbing with Tom. I’ll stop here for now, as the course is about to change topics, and I’ll finish off tomorrow.

Terms

These terms are written as I understand them.

Attribute
Bitwise operator
Comma-separated values (CSV)
Comparison operator
Dataframe
Data type
Dirty data
Data cleaning/cleansing
Dot notation
Expression
Function
Index
Integer (int)
Metadata
Method
Null value
Object
Operator
Plain text
Pseudo code
String (str)
Variable
White space

  • Entry Details:

    jambonium.co.uk Logo for jambonium.co.uk
  • today 4th November 2015
  • update 16th August 2020
  • share
  • category Coding and Programming, Data Analysis
  • local_offer Anaconda, Attribute, Bitwise operator, Comma-separated values (CSV), Comparison operator, Data analysis, Data cleaning, Dataframe, Dirty data, Dot notation, Index, Jupyter Notebook, Metadata, Method, Null value, Object, Operator, Pandas, Plain text, Pseudo code, Python, White space
  • comment No Comments on Data Analysis Course Week 2 (Part 1)

Written by Michelle-Louise Janion

Michelle is a skilled content producer and technical author with working experience in the computer games space.

Her technical experience allows her to distil complex topics into guides and other multimedia formats that are suitable for any audience, regardless of knowledge level.

View profile →


Leave a Reply Cancel reply

Thank you for leaving a comment. Comments may be moderated.

Before choosing to leave a comment, please read the Privacy Policy.

  • arrow_backPrevious entry: “Demotivated”
  • Next entry: “Data Analysis Course Week 2 (Part 2)” arrow_forward

About

  • About
  • Web Design Showcase
  • Written Work Showcase

Horus Heresy

  • The Map
  • The Project
  • …view all entries

Blog

  • Coding
  • Personal
  • Web Design
  • Data Analysis
  • Sewing
  • …view all entries

Terms & Definitions

  • Programming
  • Data Analysis
  • Miscellaneous
  • Web Design & Development
  • …view all terms

Tags

Object oriented programming (OOP) | Sewing | Leaflet.js | Pandas | Jupyter Notebook | Cascading style sheets (CSS) | Wordpress | Web Design | JavaScript | Job | Hypertext markup language (HTML) | Warhammer 40000 | Python | Data analysis | Warhammer Community | Horus Heresy map project | Horus Heresy | Interactive map | Operator | Programming
  • Facebook
  • Linkedin
  • Twitter
  • Last.fm
  • YouTube
  • True Achievements
  • GitHub

Privacy Policy

© Copyright 2015 - 2022 Michelle-Louise Janion - jambonium.co.uk

To the top arrow_upward Up arrow_upward