Data Analysis Course Week 4 (Part 1)

Posted on

Today I hope to have the final week of my Data Analysis course completed. As I hurt my back over the weekend, however, I might not be able to. Sitting at my desk on a dining chair is not the best idea and a computer chair should probably be on the list of things to get with my next pay cheque (when I finally get myself a job – the future’s looking bleak on that one).


This week the course is introducing comparison of subsets of data within a data set. For example, in a dataset you may have the GDP listed for each country over a number of years. You can then make a subset using just one country. I’ll be doing that using the Comtrade website which is run by the UN and collates import and export data from all countries. You can download the CSV from the website, but as with the World Bank from week 3, you can use the website’s API to get the data directly.

To get the data you use the pandas module in Jupyter. The following code imports the contents of the pandas module, specifies a particular data set from the Comtrade website, and uses the read_csv() method to place it into a dataframe.

from pandas import *
URL='http://comtrade.un.org/api/get?max=5000&type=C&freq=A&px=HS&ps=2014%2C2013%2C2012&r=826&p=all&rg=all&cc=0401%2C0402&fmt=csv'
df=read_csv(URL, dtype={'Commodity Code':str, 'Reporter Code':str})

Should you want to save the data you’ve loaded to a file, you can use the code df.to_csv('saved_data_example.csv', index=False) to specify that the dataframe stored as df should be saved to a file named saved_data_example.csv.

To split the data into subsets, you can use comparison operators to locate specific rows. For example df_imports=df[df['Trade Flow']=='Imports'] would find all rows including the word Imports in the Trade Flow column of the df dataframe, and save all of this to a new dataframe called df_imports.

For more complicated data comparisons, the course suggested the split-apply-combine pattern:

  • split the original data set into the subsets you wish to compare
  • apply a summary operator to the new subsets
  • merge the subsets together

To split the data set, the course suggested grouping the rows together by common criteria. To do this with Python you can use the code grouped = df.groupby('Commodity') which will make a new dataframe called grouped and fill it with the data from the dataframe df that has been sorted into groups based on the Commodity column. You can group by multiple columns with grouped = df.groupby( ['Commodity','Year']).

You can then use the get_group() method to pull only rows that match a particular group. So for the code grouped.get_group('B') you’d only see rows that have a Commodity of B.

You can see what options you have available for your groups with the code grouped.groups.keys() which will list all of the unique entries contained in the grouped dataframe.

When it comes to the apply section of the pattern, here you typically add a filter or summarise the data through averages or totals for particular groups. To get the total for a group of data you’d use the code:

grouped=df.groupby('Commodity')

grouped.aggregate(sum)

This sets a dataframe called df that has been grouped by the Commodity column (.groupby('Commodity')) and names it grouped. It then uses the aggregate(sum) method to add up all of the values in this dataframe for each particular group. You can also use the apply() method if you’ve written a function of your own. For example:

def top2byAmount(g):
	return g.sort('Amount', ascending=False).head(2)

grouped.apply(top2byAmount)

The above would define a function called top2byAmount that takes a group input (g) and sorts it by the ‘Amount’ column in descending order (.sort('Amount', ascending=False)), returning the top two values (head(2)). The new function is then applied to the grouped dataframe.

You can plot graphs easily using the aggregate() method. The code grouped['Trade Value (US$)'].aggregate(sum).plot(kind='barh') would make a horizontal bar graph (barh) from the grouped dataframe with Trade Value (US$) as the x axis.

As well as the sum in aggregate(sum) there is also min, max and mean. Sum adds all of the contents of a group; min will find the minimum value; max will find the maximum value; and mean will provide the average of the numbers by adding them and dividing by the total. You can apply all of these to the same dataframe with .aggregate([sum, min, max, mean]).

To display a dataframe with columns using the aggregate() method you can use df.groupby(('Period','Trade Flow'))['Trade Value (US$)'].aggregate([sum, min, max, mean]). This will display the 'Period' and 'Trade Flow' columns, with the unique entries therein, and then offer four more columns with the sum, min, max and mean for the values in the 'Trade Value (US$)' for the unique entries in the previous two columns.


At this point I was getting frustrated with the exercise book. It threw a lot of information at me that wasn’t covered by the course, that and my back became sore again. I’ve decided to stop here so I don’t become more annoyed. It’s a good point top stop (about half way through this week’s work) and I’ll pick it up again tomorrow.

Terms

These terms are written as I understand them at the time of writing this blog. I may come to expand on them, or change them completely as I learn more about programming. You can find an up-to-date list of the terms on my programming terms page.

Application Programming Interface (API) – An API provides a means for one computer to talk to another in a way they will understand. It allows a script or another program to interact with it by providing instructions and tools. I’ve mostly seen this used for websites where you can use plugins for a CMS, or output data into a JSON file, or in computer games where you write modifications or mods.

Content Management System (CMS) – A program that allows you to organise, publish, delete, edit and modify content from a central interface. WordPress is a CMS.

Comma-separated values (CSV) – See the programming terms page.

Comparison operator – See the programming terms page.

Dataframe – See the programming terms page.

Expression – See the programming terms page.

Function – See the programming terms page.

JavaScript Object Notation (JSON) – A simple language that allows you to easily read lists of data. It is also easy for computers to read too.

Juypter – See the programming terms page.

Method – See the programming terms page.

Operator – See the programming terms page.

Pandas – See the programming terms page.

Summary operator – A type of operator which summarises the input values. Expressions using this type of operator always returns one value, usually a total or an average.