Today I started my 4 week FutureLearn data analysis course. It’s a free course that will give me the basics of analysing data, and will just happen to help me learn the basics of a new coding language: Python.
I progressed through the first couple of pages very quickly – these were mostly an introduction to the course and an explanation of what I could expect to learn – but when it came to start typing in some example code, I noticed a problem with my Anaconda installation.
I’d already installed Anaconda early last week as advised by an email that was sent before the course started but I noticed that due to the way my PC was set up, it wasn’t quite right: the location where I wanted to store my project could not be accessed by the program. Because of this, I spent a while figuring out how to specify the location. The Notebook application kept loading up my C:\ drive but I keep all my documents on my E:\ drive (C:\ is the solid state drive that my Windows installation runs off). I tried reinstalling on my E:\ drive but it still opened up on C:\. In the end, Stack Exchange came to the rescue, as it so often does, so I’ll just have to remember to open the Notebook via that desktop shortcut.
We’re working on calculating the prevalence of tuberculosis (TB) so the course has provided details on population size, number of deaths and number of TB cases for a few countries – Portuguese speaking countries (Brazil, Portugal and Angola) and BRICS countries (Brazil, Russia, India, China and South Africa).
[…]find in the attic an empty box, put the number 100 in the box, and write “deathsInPortugal” on the box’[…] the attic is the computer’s memory, boxes are called variables[…], what’s written on a box is the variable’s name, and storing a value in a variable is called an assignment.
It’s a little clunky, and very basic, but you get the picture.
After this, the course went into how to name your variables correctly and what happens if you don’t – two types of error occur: syntax error or name error. Lower camel case and all that (I’ve talked about that previously). According to Wikipedia, Python recommends upper camel case for most things, though the course states that it favours lower camel case. As I favour lower camel case, I’ll stick with the course’s advice.
The interesting thing about using Anaconda, is that it runs code in little blocks or cells. These cells are labelled as In  or Out  depending on what they are doing. In  cells are where you input your code or text, and the Out  cells are generated by Anaconda in response to the code. So, using the example above, if I type the following into an In  cell:
deathsInPortugal = 100 deathsInPortugal = 140 deathsInPortugal
Then the Out  cell, will show
140. This is because I’ve set the variable deathsInPortugal to 100, then I have set it to 140, and then asked Anaconda to print the contents of the deathsInPortugal variable. As I’ve overwritten the initial value of 100 with 140, Anaconda will print 140.
It’s also important to note that Anaconda won’t run code unless you tell it to do so. If you’ve closed your notebook and opened it again, any Out  cells that may appear, are there from a previous session. This means that you could change a cell and, unless you run the code again, any other cells that depend on it would not recognise the change.
Where the first exercise had me inputting new variables, the second went into expressions and statements, including what operators Python has access to (plus +, minus -, multiply *, and divide /). I had to perform a few tasks such as adding up the populations of the 5 BRICS countries and dividing that number by 5 to get the average population.
The third exercise was in functions. Python uses two functions: maximum and minimum. I used these two functions to get the range of deaths between the 5 BRICS countries.
The fourth exercise was in using comments to ensure the Python code was legible. Some of the numbers being used in the course are estimates, and others are given in thousands, so comments needed to be added to show which were which. This made making a calculation based on these numbers easier.
The fifth exercise started to get a bit more involved. Having variables for everything in data analysis can mean a lot of variables, and the more variables there are, the more chance there is of making a mistake. So, to combat this, Anaconda has access to a module called
pandas that enables the usage of tables to store data like you would in a spreadsheet or database. To enable
pandas, I had to import the module – tell the computer I wanted to use this specific module. Doing so used a statement that introduced me to reserved words.
pandas module lets me load data from an Excel spreadsheet, store whole columns from this table in variables as an array and then perform calculations on individual columns using a method. The
pandas module offers a few methods:
- sum() – adds up all of the values
- max() – finds the largest of the values
- min() – finds the smallest of the values
- mean() – finds the average of the values
- median() – finds the number in the middle of the values (half of the numbers are below the median and half are above)
I got a bit confused on the difference between a function and a method so I went to my partner for help. He didn’t think there was a difference either until he found this beauty of a thread on Stack Exchange so now he’s learnt something new. I’m still pretty confused on it, but I’ve written up what he said on the matter in the Terms section below. Either way, the definitions do not matter much, the course is about data analysis not programming per say.
With the methods that
pandas gave I could then perform a number of calculations on the data in the table, and even make new columns from these calculations.
The final task was to use all that I had learned on a whole sample of data from the World Health Organisation (WHO) to make a notebook in Jupyter and then analyse my findings. I cut down the data to just include countries in Europe. I’ve included a copy of this final task including the Jupyter project notebook, and the data I used in Excel format, should you be interested.
These terms are written as I understand them at the time of writing this blog. I may come to expand on them, or change them completely as I learn more about programming. You can find an up-to-date list of the terms on my programming terms and programming-related terms pages.
Array – A type of variable that can store multiple pieces of information and sort these pieces in a specified manner. Particular entries can be retrieved from the array, or the whole array can be modified using a method.
Assignment – The value stored in a variable. So if a variable is a box used to store teddy bears, then the teddy bears are the assignment.
Class – In object-oriented programming, a class is a template for an object. If you had a class of dog it would contain everything that a dog has, but would never actually be a dog. The dog would be an object.
Comment – A method for annotating code. While comments are visible to you, the computer will skip over them when running the code. This means you can explain what different parts of your code do, without worrying that the computer will try to do something with it.
Database – An organized collection of data stored in tables. These tables have a certain layout, or schema, telling you what the names of the table headings are and what types of content they store. You can then run queries on the data stored in the database and produce reports on it.
Expression – A fragment of code that produces a value. Printing the contents of a variable and calling a function are also considered expressions. You evaluate an expression to get its value. You can assign the value of an expression to a variable.
Function – A function is a piece of code that takes zero or more values (the function’s “arguments”) and returns a result. You “call” a function to get its value. A function performs a specific set of related tasks within your program. So, you could have a function that deals with everything to do with making toast within a program that makes your breakfast. I’m told this is used interchangeably with the term method, but there are differences between the two.
Instance – When you instance a class you are making a copy of that class and filling in the template that the class provides. This instance becomes an object.
Method – A special type of function that can only be called in a specific context. It can only be called from an object – as it is defined in the class of that object – whereas a function can be called from anywhere. Note that methods are an object-oriented concept and so for some languages, the term method means the same as the term function and vice versa.
Module – A package of various pieces of code that add extra functionality to a product. Modules are loosely related to the original product, in that they can connect to, interact with, and share resources with it. Modularity is explained in more detail on Wikipedia.
Name error – When the computer doesn’t know of any variable with that name. Usually when you’ve misspelled a variable (even if the letters are correct – the capitalisation must be exact too), or haven’t declared it (stated what you’re going to put in it).
Object – An instance of a class.
Object-oriented Programming – See the programming terms page.
Operator – A type of action or procedure which produces a new value from zero or more input values, called operands.
Reserved word – A word that has a specific use in the code and as such cannot be used for variable or class names. These words vary from programming language to programming language.
Spreadsheet – A program allowing you to organize data in tables, graphs or individual cells. Any value can be adjusted or have calculations performed on it.
Statement – A command for the computer to do something. This command does not produce a value like an expression does.
Syntax error – When the computer doesn’t understand the line of code. This can be caused by adding a comma or brace in the wrong place, for example.
Variable – A named storage area for values that can vary.