Data Analysis Course Week 3

Posted on

This week I’m working on combining and transforming data with details on life expectancy and GDP from the World Bank. The aim is to see if people in richer countries have a longer life expectancy.

The data you download from the World Bank is extensive, so to make it easier to work with, I’ll be setting up a new dataframe using Pandas. The code gdp = DataFrame(columns=['Country', 'GDP (US$)'], data=[['UK', 2678454886796.7], ['USA', 16768100000000.0], ['China', 9240270452047.0], ['Brazil', 2245673032353.8], ['South Africa', 366057913367.1]]) would store a new dataframe as gdp and add the column titles 'Country' and 'GDP (US$)' to it, storing the data on GDP for the UK, USA, China, Brazil and South Africa. To make this easier to write, you could store the column names and the data in variables (with the GDP records in a list):

table = [
   ['UK', 2678454886796.7],
   ['USA', 16768100000000.0],
   ['China', 9240270452047.0],
   ['Brazil', 2245673032353.8],
   ['South Africa', 366057913367.1]
headings = ['Country', 'GDP (US$)']
gdp = DataFrame(columns=headings, data=table)

There was a small task to check I  knew how to form a new dataframe, and then the course moved on to defining functions.

When trying to compare data it sometimes helps to use rounded figures. It makes large numbers easier to read when accuracy of the number isn’t an issue. To round a number, you can use the following code:

def roundToMillions (value):
    result = round(value / 1000000)
    return result

This defines (def) a function called roundToMillions that takes a number that you specify, divides it by a million, and rounds it to the nearest whole number with the in-built Python function round().

Some of the data I downloaded was given in USD$. To convert this to GBP£ I’d need the following code:

def usdToGbp (usd):
    return usd / 1.564768

This defines (def) a function called usdToGbp that takes a number that you specify and divides it by the conversion rate (which was provided by the course and is an average rate for 2013).

The course encourages you to test your function with both expected and unexpected (such as using 0 or negative numbers) values to see how it copes.

I was then asked to do a small task in week 3’s exercise notebook that included writing my own function and testing the ones already provided.

For the next section, it was pointed out that some of the country names were not the same (e.g. UK as United Kingdom), so I needed to write another function to fix this. As it wasn’t all country names, the course suggested a function that only corrected the incorrect entries and left all others the same using a conditional statement.

Once written, I had to apply these new functions to the dataframes. You apply functions to a dataframe column using apply(). So:

column = gdp['Country']

Would turn all references to the incorrect names specified in my function, and correct them using the new names I’d supplied. And I could make this a new column entirely by using: gdp['Country name'] = column.apply(expandCountry).

I can also apply multiple functions at a time using method chaining as so:

column = gdp['GDP (US$)']
result = column.apply(usdToGbp).apply(roundToMillions)
gdp['GDP (£m)'] = result

From here the course introduces merging dataframes together, linked by their common records: in this case the country names. To do this, the tables must be joined with the merge() function. The code merge(gdp, life, on='Country name', how='left') joins the dataframes gdp and life based on their common column 'Country name' from the  gdp dataframe (the left dataframe). You can change how='left' to how='right' to join based on the 'Country name' from the life dataframe (the right dataframe). You can also include all the countries from the 'Country name' column from both dataframes with a how='outer', or only the entries that both dataframes have in common with how='inner'.

The course then went into constants and suggested I set up commonly recurring column names as variables so that, should they need to be renamed in future, I could do so easily without having to change many lines of code – only the constant would need updating.

The next section concerned getting data from the World Bank. Pandas could do this without me downloading anything provided I knew the unique indicator for the particular data set I wanted. The only problem with this is that the index had already been set. Python does allow you to reset the index however, using the .reset_index() method.

YEAR = 2013
data = download(indicator=LIFE_INDICATOR, country='all', 
	      start=YEAR, end=YEAR)
life = data.reset_index()

The above code sets the unique indicator for the data we’re after as the constant LIFE_INDICATOR, sets the variable data to download that data set for all countries in the year stored in the constant YEAR. The code then resets the index for that dataframe and displays the first 5 rows.

The data from the World Bank includes a lot of groups of countries which we want removed. To do this the code [n:m] is used (where n is the lowest row you want to reference, and m is the highest) after a dataframe. So, gdp[0:3] would show you the first 4 rows (0, 1, 2 and 3) from the gdp dataframe. You could leave out the m to display all rows from point n onwards. In this case, the list of individual countries starts in row number 34 so gdp[34:] would display the data I’m interested in.

The course then went into correlation by using the Spearman rank correlation coefficient. The pandas module doesn’t have access to something that can do this, but the scipy module does.

from scipy.stats import spearmanr

gdpColumn = gdpVsLife[GDP]
lifeColumn = gdpVsLife[LIFE]
(correlation, pValue) = spearmanr(gdpColumn, lifeColumn)
print('The correlation is', correlation)
if pValue < 0.05:
	print('It is statistically significant.')
        print('It is not statistically significant.')

The above code will firstly import spearmanr from scipy. Next it sets up the variables gdpColumn and lifeColumn with columns from the gdpVsLife dataframe, and then sets up two more variables to store the spearmanr results performed on the gdpColumn and lifeColumn variables. The code then prints The correlation is followed by the value stored in correlation, and, depending on if the value in pValue is below 0.05 or not, prints It is statistically significant or It is not statistically significant.

The course made sure to point out that just because the data is statistically significant, it does not mean one causes the other – just that they are related. Correlation does not equal causation, as the saying goes.

It is also possible to see related data in other ways such as with scatterplot graphs. The code gdpVsLife.plot(x=GDP, y=LIFE, kind='scatter', grid=True, logx=True, figsize = (10, 4)) plots a graph of the gdpVsLife dataframe with an X axis using the GDP constant and a Y axis of the LIFE constant (x=GDP, y=LIFE),  of the graph type scatter (kind='scatter'), showing the grid lines (grid=True), with a logarithmic x axis (logx=True), that is 10 by 4 units (figsize = (10, 4)).

The project for this week was to extend the downloaded project notebook, adding in extra data sources to see if healthcare expenditure per capita, or GDP per capita had more of an effect on life expectancy than total GDP did. You can see a copy of my notebook here.


These terms are written as I understand them at the time of writing this blog. I may come to expand on them, or change them completely as I learn more about programming. You can find an up-to-date list of the terms on my programming terms page.

Combining data – Using multiple data sets to improve the accuracy of your results, or to expand the number of things you can compare. Aggregating data requires the data sets to have a single point in common, for example a date range. Also known as aggregating data.

Conditional Statement – Where the computer performs a task when one or more conditions are met. An ‘if else’ is a conditional statement e.g. if condition, then statement, else if condition; then statement; else condition, then statement. You don’t have to use the else if parts, you could just use if alone. To give a real life example for making tea: if you want tea: make tea; if you want sugar: add sugar; if you want milk: add milk; else don’t make tea. If you added an else if to this, you’d only have milk or sugar if you did not make tea, which is a bit odd! These are also referred to as conditional expressions or conditional constructs.

Constant – A value that is not supposed to change and therefore cannot be modified by the program itself. These values are said to be hard-coded and can only be adjusted by rewriting the code.

Correlation – The technique used to show if and how strongly variables are related. For example, taller people tend to be heavier than shorter people, but there are always going to be exceptions to this for very over-weight people.

Dataframe – See the programming terms page.

Function – See the programming terms page.

Gross Domestic Product (GDP) – The size a country’s economy calculated by working out how much it produces including taxes, minus subsidies.

Hard-coded – See the programming terms page.

Index – See the programming terms page.

Join – A term used for merging separate tables of data based on a common column without duplicating that column.

Method Chaining – Applying multiple methods at the same time – this allows you to run them as a single statement without storing any variables that would be generated until all methods have run. Method chaining only works if the methods involved return the same type of value, in the same way that you can chain in maths: 3+4-5 – each part one takes two numbers and returns a number that is used by the next section of the chain This is an object-oriented programming term.

Object-oriented Programming – See the programming terms page.

Spearman rank correlation coefficient – Using a rank to describe the relationship between data that has been sorted from smallest to largest, and a number known as a p-value to determine if this is a significant relationship. Data that increases whilst related data also increases is ranked as a positive number; data that shrinks while the other related data grows is ranked as a negative number; and data that has no relationship at all is ranked as 0. P-values below 0.05 are considered significant and so anything given these values can be said to have a strong relationship.

Transforming data – The process of taking data that is in the incorrect unit and making it the same as the rest of your data. When aggregating data, some records may have different units of measurement even though they represent the same thing. For example, temperature measured in both Celsius and Fahrenheit.

Variable – See the programming terms page.