The data you download from the World Bank is extensive, so to make it easier to work with, I’ll be setting up a new dataframe using Pandas. The code
gdp = DataFrame(columns=['Country', 'GDP (US$)'], data=[['UK', 2678454886796.7], ['USA', 16768100000000.0], ['China', 9240270452047.0], ['Brazil', 2245673032353.8], ['South Africa', 366057913367.1]]) would store a new dataframe as
gdp and add the column titles
'GDP (US$)' to it, storing the data on GDP for the UK, USA, China, Brazil and South Africa. To make this easier to write, you could store the column names and the data in variables (with the GDP records in a
table = [ ['UK', 2678454886796.7], ['USA', 16768100000000.0], ['China', 9240270452047.0], ['Brazil', 2245673032353.8], ['South Africa', 366057913367.1] ] headings = ['Country', 'GDP (US$)'] gdp = DataFrame(columns=headings, data=table)
There was a small task to check I knew how to form a new dataframe, and then the course moved on to defining functions.
When trying to compare data it sometimes helps to use rounded figures. It makes large numbers easier to read when accuracy of the number isn’t an issue. To round a number, you can use the following code:
def roundToMillions (value): result = round(value / 1000000) return result
This defines (
def) a function called
roundToMillions that takes a number that you specify, divides it by a million, and rounds it to the nearest whole number with the in-built Python function
Some of the data I downloaded was given in USD$. To convert this to GBP£ I’d need the following code:
def usdToGbp (usd): return usd / 1.564768
This defines (
def) a function called
usdToGbp that takes a number that you specify and divides it by the conversion rate (which was provided by the course and is an average rate for 2013).
The course encourages you to test your function with both expected and unexpected (such as using 0 or negative numbers) values to see how it copes.
I was then asked to do a small task in week 3’s exercise notebook that included writing my own function and testing the ones already provided.
For the next section, it was pointed out that some of the country names were not the same (e.g. UK as United Kingdom), so I needed to write another function to fix this. As it wasn’t all country names, the course suggested a function that only corrected the incorrect entries and left all others the same using a conditional statement.
Once written, I had to apply these new functions to the dataframes. You apply functions to a dataframe column using
column = gdp['Country'] column.apply(expandCountry) column
Would turn all references to the incorrect names specified in my function, and correct them using the new names I’d supplied. And I could make this a new column entirely by using:
gdp['Country name'] = column.apply(expandCountry).
I can also apply multiple functions at a time using method chaining as so:
column = gdp['GDP (US$)'] result = column.apply(usdToGbp).apply(roundToMillions) gdp['GDP (£m)'] = result gdp
From here the course introduces merging dataframes together, linked by their common records: in this case the country names. To do this, the tables must be joined with the
merge() function. The code
merge(gdp, life, on='Country name', how='left') joins the dataframes
life based on their common column
'Country name' from the
gdp dataframe (the left dataframe). You can change
how='right' to join based on the
'Country name' from the
life dataframe (the right dataframe). You can also include all the countries from the
'Country name' column from both dataframes with a
how='outer', or only the entries that both dataframes have in common with
The course then went into constants and suggested I set up commonly recurring column names as variables so that, should they need to be renamed in future, I could do so easily without having to change many lines of code – only the constant would need updating.
The next section concerned getting data from the World Bank. Pandas could do this without me downloading anything provided I knew the unique indicator for the particular data set I wanted. The only problem with this is that the index had already been set. Python does allow you to reset the index however, using the
YEAR = 2013 LIFE_INDICATOR = 'SP.DYN.LE00.IN' data = download(indicator=LIFE_INDICATOR, country='all', start=YEAR, end=YEAR) life = data.reset_index() life.head()
The above code sets the unique indicator for the data we’re after as the constant LIFE_INDICATOR, sets the variable data to download that data set for all countries in the year stored in the constant YEAR. The code then resets the index for that dataframe and displays the first 5 rows.
The data from the World Bank includes a lot of groups of countries which we want removed. To do this the code
[n:m] is used (where
n is the lowest row you want to reference, and
m is the highest) after a dataframe. So,
gdp[0:3] would show you the first 4 rows (0, 1, 2 and 3) from the
gdp dataframe. You could leave out the
m to display all rows from point
n onwards. In this case, the list of individual countries starts in row number 34 so
gdp[34:] would display the data I’m interested in.
from scipy.stats import spearmanr gdpColumn = gdpVsLife[GDP] lifeColumn = gdpVsLife[LIFE] (correlation, pValue) = spearmanr(gdpColumn, lifeColumn) print('The correlation is', correlation) if pValue < 0.05: print('It is statistically significant.') else: print('It is not statistically significant.')
The above code will firstly import
scipy. Next it sets up the variables
lifeColumn with columns from the
gdpVsLife dataframe, and then sets up two more variables to store the
spearmanr results performed on the
lifeColumn variables. The code then prints
The correlation is followed by the value stored in
correlation, and, depending on if the value in
pValue is below 0.05 or not, prints
It is statistically significant or
It is not statistically significant.
The course made sure to point out that just because the data is statistically significant, it does not mean one causes the other – just that they are related. Correlation does not equal causation, as the saying goes.
It is also possible to see related data in other ways such as with scatterplot graphs. The code
gdpVsLife.plot(x=GDP, y=LIFE, kind='scatter', grid=True, logx=True, figsize = (10, 4)) plots a graph of the
gdpVsLife dataframe with an X axis using the
GDP constant and a Y axis of the
LIFE constant (
x=GDP, y=LIFE), of the graph type scatter (
kind='scatter'), showing the grid lines (
grid=True), with a logarithmic x axis (
logx=True), that is 10 by 4 units (
figsize = (10, 4)).
The project for this week was to extend the downloaded project notebook, adding in extra data sources to see if healthcare expenditure per capita, or GDP per capita had more of an effect on life expectancy than total GDP did. You can see a copy of my notebook here.
These terms are written as I understand them.