Start of the Data Analysis Course

Posted on

Today I started my 4 week FutureLearn data analysis course. It’s a free course that will give me the basics of analysing data, and will just happen to help me learn the basics of a new coding language: Python.

I progressed through the first couple of pages very quickly – these were mostly an introduction to the course and an explanation of what I could expect to learn – but when it came to start typing in some example code, I noticed a problem with my Anaconda installation.

I’d already installed Anaconda early last week as advised by an email that was sent before the course started but I noticed that due to the way my PC was set up, it wasn’t quite right: the location where I wanted to store my project could not be accessed by the program. Because of this, I spent a while figuring out how to specify the location. The Notebook application kept loading up my C:\ drive but I keep all my documents on my E:\ drive (C:\ is the solid state drive that my Windows installation runs off). I tried reinstalling on my E:\ drive but it still opened up on C:\. In the end, Stack Exchange came to the rescue, as it so often does, so I’ll just have to remember to open the Notebook via that desktop shortcut.

We’re working on calculating the prevalence of tuberculosis (TB) so the course has provided details on population size, number of deaths and number of TB cases for a few countries – Portuguese speaking countries (Brazil, Portugal and Angola) and BRICS countries (Brazil, Russia, India, China and South Africa).

The first practical lesson was with variables and assignments, and finally I get a nice definition of the word variable in language I can understand:

[…]find in the attic an empty box, put the number 100 in the box, and write “deathsInPortugal” on the box’[…] the attic is the computer’s memory, boxes are called variables[…], what’s written on a box is the variable’s name, and storing a value in a variable is called an assignment.

It’s a little clunky, and very basic, but you get the picture.

After this, the course went into how to name your variables correctly and what happens if you don’t – two types of error occur: syntax error or name error. Lower camel case and all that (I’ve talked about that previously). According to Wikipedia, Python recommends upper camel case[1] for most things, though the course states that it favours lower camel case. As I favour lower camel case, I’ll stick with the course’s advice.

The interesting thing about using Anaconda, is that it runs code in little blocks or cells. These cells are labelled as In [] or Out [] depending on what they are doing. In [] cells are where you input your code or text, and the Out [] cells are generated by Anaconda in response to the code. So, using the example above, if I type the following into an In [] cell:

deathsInPortugal = 100
deathsInPortugal = 140
deathsInPortugal

Then the Out [] cell, will show 140. This is because I’ve set the variable deathsInPortugal to 100, then I have set it to 140, and then asked Anaconda to print the contents of the deathsInPortugal variable. As I’ve overwritten the initial value of 100 with 140, Anaconda will print 140.

It’s also important to note that Anaconda won’t run code unless you tell it to do so. If you’ve closed your notebook and opened it again, any Out [] cells that may appear, are there from a previous session. This means that you could change a cell and, unless you run the code again, any other cells that depend on it would not recognise the change.

Where the first exercise had me inputting new variables, the second went into expressions and statements, including what operators Python has access to (plus +, minus -, multiply *, and divide /). I had to perform a few tasks such as adding up the populations of the 5 BRICS countries and dividing that number by 5 to get the average population.

The third exercise was in functions. Python uses two functions: maximum and minimum. I used these two functions to get the range of deaths between the 5 BRICS countries.

The fourth exercise was in using comments to ensure the Python code was legible. Some of the numbers being used in the course are estimates, and others are given in thousands, so comments needed to be added to show which were which. This made making a calculation based on these numbers easier.

The fifth exercise started to get a bit more involved. Having variables for everything in data analysis can mean a lot of variables, and the more variables there are, the more chance there is of making a mistake. So, to combat this, Anaconda has access to a module called pandas that enables the usage of tables to store data like you would in a spreadsheet or database. To enable pandas, I had to import the module – tell the computer I wanted to use this specific module. Doing so used a statement that introduced me to reserved words.

The pandas module lets me load data from an Excel spreadsheet, store whole columns from this table in variables as an array and then perform calculations on individual columns using a method. The pandas module offers a few methods:

  • sum() – adds up all of the values
  • max() – finds the largest of the values
  • min() – finds the smallest of the values
  • mean() – finds the average of the values
  • median() – finds the number in the middle of the values (half of the numbers are below the median and half are above)

I got a bit confused on the difference between a function and a method so I went to my partner for help. He didn’t think there was a difference either until he found this beauty of a thread on Stack Exchange so now he’s learnt something new. I’m still pretty confused on it, but I’ve written up what he said on the matter in the Terms section below. Either way, the definitions do not matter much, the course is about data analysis not programming per say.

With the methods that pandas gave I could then perform a number of calculations on the data in the table, and even make new columns from these calculations.

The final task was to use all that I had learned on a whole sample of data from the World Health Organisation (WHO) to make a notebook in Jupyter and then analyse my findings. I cut down the data to just include countries in Europe. I’ve included a copy of this final task including the Jupyter project notebook, and the data I used in Excel format, should you be interested.

Terms

These terms are written as I understand them.

References

  1. Naming convention (programming)

Leave a Reply

Thank you for choosing to leave a comment. Comments may be moderated.