Data Analysis Course Week 3

Posted on

This week I’m working on combining and transforming data with details on life expectancy and GDP from the World Bank. The aim is to see if people in richer countries have a longer life expectancy.

The data you download from the World Bank is extensive, so to make it easier to work with, I’ll be setting up a new dataframe using Pandas. The code gdp = DataFrame(columns=['Country', 'GDP (US$)'], data=[['UK', 2678454886796.7], ['USA', 16768100000000.0], ['China', 9240270452047.0], ['Brazil', 2245673032353.8], ['South Africa', 366057913367.1]]) would store a new dataframe as gdp and add the column titles 'Country' and 'GDP (US$)' to it, storing the data on GDP for the UK, USA, China, Brazil and South Africa. To make this easier to write, you could store the column names and the data in variables (with the GDP records in a list):

table = [
   ['UK', 2678454886796.7],
   ['USA', 16768100000000.0],
   ['China', 9240270452047.0],
   ['Brazil', 2245673032353.8],
   ['South Africa', 366057913367.1]
]
headings = ['Country', 'GDP (US$)']
gdp = DataFrame(columns=headings, data=table)

There was a small task to check I  knew how to form a new dataframe, and then the course moved on to defining functions.

When trying to compare data it sometimes helps to use rounded figures. It makes large numbers easier to read when accuracy of the number isn’t an issue. To round a number, you can use the following code:

def roundToMillions (value):
    result = round(value / 1000000)
    return result

This defines (def) a function called roundToMillions that takes a number that you specify, divides it by a million, and rounds it to the nearest whole number with the in-built Python function round().

Some of the data I downloaded was given in USD$. To convert this to GBP£ I’d need the following code:

def usdToGbp (usd):
    return usd / 1.564768

This defines (def) a function called usdToGbp that takes a number that you specify and divides it by the conversion rate (which was provided by the course and is an average rate for 2013).

The course encourages you to test your function with both expected and unexpected (such as using 0 or negative numbers) values to see how it copes.

I was then asked to do a small task in week 3’s exercise notebook that included writing my own function and testing the ones already provided.

For the next section, it was pointed out that some of the country names were not the same (e.g. UK as United Kingdom), so I needed to write another function to fix this. As it wasn’t all country names, the course suggested a function that only corrected the incorrect entries and left all others the same using a conditional statement.

Once written, I had to apply these new functions to the dataframes. You apply functions to a dataframe column using apply(). So:

column = gdp['Country']
column.apply(expandCountry)
column

Would turn all references to the incorrect names specified in my function, and correct them using the new names I’d supplied. And I could make this a new column entirely by using: gdp['Country name'] = column.apply(expandCountry).

I can also apply multiple functions at a time using method chaining as so:

column = gdp['GDP (US$)']
result = column.apply(usdToGbp).apply(roundToMillions)
gdp['GDP (£m)'] = result
gdp

From here the course introduces merging dataframes together, linked by their common records: in this case the country names. To do this, the tables must be joined with the merge() function. The code merge(gdp, life, on='Country name', how='left') joins the dataframes gdp and life based on their common column 'Country name' from the  gdp dataframe (the left dataframe). You can change how='left' to how='right' to join based on the 'Country name' from the life dataframe (the right dataframe). You can also include all the countries from the 'Country name' column from both dataframes with a how='outer', or only the entries that both dataframes have in common with how='inner'.

The course then went into constants and suggested I set up commonly recurring column names as variables so that, should they need to be renamed in future, I could do so easily without having to change many lines of code – only the constant would need updating.

The next section concerned getting data from the World Bank. Pandas could do this without me downloading anything provided I knew the unique indicator for the particular data set I wanted. The only problem with this is that the index had already been set. Python does allow you to reset the index however, using the .reset_index() method.

YEAR = 2013
LIFE_INDICATOR = 'SP.DYN.LE00.IN'
data = download(indicator=LIFE_INDICATOR, country='all', 
	      start=YEAR, end=YEAR)
life = data.reset_index()
life.head()

The above code sets the unique indicator for the data we’re after as the constant LIFE_INDICATOR, sets the variable data to download that data set for all countries in the year stored in the constant YEAR. The code then resets the index for that dataframe and displays the first 5 rows.

The data from the World Bank includes a lot of groups of countries which we want removed. To do this the code [n:m] is used (where n is the lowest row you want to reference, and m is the highest) after a dataframe. So, gdp[0:3] would show you the first 4 rows (0, 1, 2 and 3) from the gdp dataframe. You could leave out the m to display all rows from point n onwards. In this case, the list of individual countries starts in row number 34 so gdp[34:] would display the data I’m interested in.

The course then went into correlation by using the Spearman rank correlation coefficient. The pandas module doesn’t have access to something that can do this, but the scipy module does.

from scipy.stats import spearmanr

gdpColumn = gdpVsLife[GDP]
lifeColumn = gdpVsLife[LIFE]
(correlation, pValue) = spearmanr(gdpColumn, lifeColumn)
print('The correlation is', correlation)
if pValue < 0.05:
	print('It is statistically significant.')
else:
        print('It is not statistically significant.')

The above code will firstly import spearmanr from scipy. Next it sets up the variables gdpColumn and lifeColumn with columns from the gdpVsLife dataframe, and then sets up two more variables to store the spearmanr results performed on the gdpColumn and lifeColumn variables. The code then prints The correlation is followed by the value stored in correlation, and, depending on if the value in pValue is below 0.05 or not, prints It is statistically significant or It is not statistically significant.

The course made sure to point out that just because the data is statistically significant, it does not mean one causes the other – just that they are related. Correlation does not equal causation, as the saying goes.

It is also possible to see related data in other ways such as with scatterplot graphs. The code gdpVsLife.plot(x=GDP, y=LIFE, kind='scatter', grid=True, logx=True, figsize = (10, 4)) plots a graph of the gdpVsLife dataframe with an X axis using the GDP constant and a Y axis of the LIFE constant (x=GDP, y=LIFE),  of the graph type scatter (kind='scatter'), showing the grid lines (grid=True), with a logarithmic x axis (logx=True), that is 10 by 4 units (figsize = (10, 4)).

The project for this week was to extend the downloaded project notebook, adding in extra data sources to see if healthcare expenditure per capita, or GDP per capita had more of an effect on life expectancy than total GDP did. You can see a copy of my notebook here.

Terms

These terms are written as I understand them.

Leave a Reply

Thank you for choosing to leave a comment. Comments may be moderated.