Data Analysis Course Week 2 (Part 2)

Posted on

Continuing from my efforts yesterday, I’ll be finishing week 2 of the data analysis course. There’s not much more to do, but I didn’t want to rush myself to get everything done before I went climbing. And climbing is important too – as an unemployed bum living in the ass-end of nowhere, it’s one of the few times I get out of the house :-p.

The course continued with a small section on graphing your data using the pandas module and its plot() method.

The code london['Max Wind SpeedKm/h'].plot(grid=True, figsize=(10,5)) will map the contents of the london dataframe with a graph that shows grid lines (grid=True) and is 10 units wide by 5 units high (figsize=(10,5)).

You can also plot multiple results on the same graph by simply adding another column name: london[['Max Wind SpeedKm/h', 'Mean Wind SpeedKm/h']].plot(grid=True, figsize=(10,5)).

By default, the plot() method will try to generate a line but it can give other chart types too. I’ll learn about those later in the course.

I had a small task to complete in the notebook I was asked to download, before moving on to the next section on changing the index of a dataframe.

So far, my dataframes have been using a default numerical index running from 0 to 364 – a number for every entry in the dataframe starting at 0 for January 1st 2014, all the way to 364 for December 31st 2014. Seeing as the dates are unique – there’s only ever one entry per day – I can use the 'GMT' column to index my dataframe instead. To do this, I’d use .ix() so: london.ix[datetime(2014, 1, 1)] would change the index of the london dataframe to a datetime64 data type starting at the 1st of January 2014.

I can now run a query on the dataframe that finds all the rows where the date is between the 8th December and 12th December as london.ix[datetime(2014,12,8) : datetime(2014,12,12)] instead of the method I’d used before: london[(london['GMT'] >= datetime(2014, 12, 8)) & london['GMT'] <= datetime(2014, 12, 12))] (which would still work).

You’d need to make sure the dataframe was sorted in date order, however. Otherwise the index wouldn’t work for the above query. You’d do this with the sort_index() method: london = london.sort_index().

At this point I was sent to the notebook again for another small exercise.

My final task for this week was to make a project on holiday weather. We were given an example notebook to work from, but asked to pick another city for a two week holiday. I chose Finland because my family want to go and see the Northern Lights at some point.

For the best holiday I’d need to find a two week period where the skies were clear (so I’d need to remove any references to rain, fog, thunderstorm or snow from the 'Event' column in my data). According to the Aurora Zone the best time to see the Northern Lights is between January and March, so I would be able to narrow my search to these three months. They also suggest that colder nights are the best, so I’d be able to further narrow my data using temperature. As for location, I compared those listed on Weather Underground with the destinations on Lapland Safaris, seeing which matched what activities we’d want to do (search for the Northern Lights and husky safari). In the end, Rovaniemi came out on top, so I downloaded the 2014 data for there.

As an aside, we actually want to go to Reykjavik, Iceland but Tom took too long to remind me about this.

I’ve uploaded a copy of my notebook and the data I used should you be interested in the results.


These terms are written as I understand them at the time of writing this blog. I may come to expand on them, or change them completely as I learn more about programming. You can find an up-to-date list of the terms on my programming terms page.

Dataframe – See the programming terms page.

Index – See the programming terms page.