I find myself on a dull Tuesday evening in the London Borough of Hillingdon, an excellent staging post ahead of a visit to a certain nearby airport for sundry work purposes that are of little interest here. Previous research into lodgings along the Bath Road had alerted me to an interesting venue on the way to Sipson:
Could this be Pakistani curry within easy walking distance of the strip of airport hotels that line Heathrow’s northern flank? I knew I would have to investigate.
I drove up from the south coast under apocalyptic skies, accompanied throughout the journey by stop-start (but not much in the way of stop) torrential rain. Dashing from my parked car to the hotel entrance entailed getting thoroughly soaked. Although my goal was only ten minutes’ walk away I wondered if this was an option in these conditions.
After checking in I noticed something of a sucker’s gap on the weather radar. It was now or never. My confidence in making the journey without drowning received a further boost on the discovery of an ingenious machine in the lobby that would hire me an umbrella for the princely sum of two quid. Suddenly I was all set.
The magical umbrella machine
I made my way along Bath Road, constant and thunderous traffic keeping me company as I dodged the puddles and spray. As I made my way up Sipson Road my goal gradually revealed itself around the corner – here was Spice Mix, sharing its salubrious location with an airport parking firm and a hand car wash.
Presenting myself at the building shown on Google Maps, I noticed a sign directing me further into the car park. A portacabin beckoned from a corner. Strolling in, I found it empty. The actual kitchen was to be found in the next building, along with a member of staff in the midst of a phone call, who espied me as I wandered around. He bade me wait for him in the portacabin.
Phone call still underway, the man joined me in the portacabin to take my order. This was to be lamb karahi, plain rice and a butter naan. I had an important question – did the karahi contain the devil’s vegetable (capsicum)? It did not. But – and this was most important – how spicy did I want it? I offered “desi spicy?” – this was understood. “Medium?” he replied. “And a little bit more.” We had an understanding.
The counter with its refrigerated delights
I took a seat and appraised my surroundings. This was no frills in the extreme, a curry caff in its purest sense. Others arrived, and similar negotiations were entered into.
After a decent interval my food appeared. This would be an extremely moist karahi, to the point of it having shorva as opposed to masala. It was bedecked with a garnish of julienned ginger, bullet chillies and a sprig of coriander. The bread was served quartered. The rice was in abundance, and would probably be too much for me. Dipping the bread into the shorva revealed spice but not much in the way of seasoning.
I should note at this point that I had recently recovered from a bout of the ‘vid during which my sense of smell went temporarily astray. While it has now returned there is still a possibility that the taste buds might not be completely back to full function although I think they are mostly working ok now.
Getting stuck in, I decided my method of attack would be to transport the rice from its bowl to sit atop the shorva. The lamb was tender, with one piece on the bone of the eight or nine present. I got the impression that it and the shorva had been introduced only recently. The spice built nicely, however the seasoning was still rather lacking. It was however a perfectly serviceable curry, and one I was enjoying eating.
The bread was an interesting proposition. While it certainly had butter on it and tasted buttery, it was a stodgy old thing and not quite what I had hoped it would be. I considered I would possibly have been better off forgoing the bread in favour of the rice, which was nicely infused with the aroma of cardamoms, one of which I narrowly avoided biting into – always the surprise nobody wants.
I managed all the meat and most of the shorva, but as I predicted an amount of the rice had to remain uneaten. This was perfectly good, honest food, which set me back £13.90. Not unreasonable by any means. The man I paid was a different fellow to the one who had taken my order and prepared the food. I found him as I left, in the same window through which he had originally espied me. “How was it?”, he asked. I told him I had asked for desi, and that was exactly what he had given me.
After the roast beef (never t*rk*y) excesses of Christmas Day, I’m out of the house on Boxing Day and looking for a late lunch. I desire curry, but something different from the usual BIR fare one finds is omnipresent round these parts.
Smile Grill (121 Charminster Road, Bournemouth BH8 8UH) is one of those pizza-kebab-burger-curry joints that are relatively common in many parts of the UK but the curry element is a bit of a rarity on the sunny south coast, us spice lovers having to satisfy ourselves with BIR establishments all serving variations on the usual offerings. Smile is set apart still further by being an Afghani outlet. My hopes are high.
On entry, I note a series of tables to the left, all occupied bar one. I catch the eye of one of the numerous staff and point to it, receiving an affirmative nod. The menu is laid out on an illuminated sign above the counter, and I engage what turns out to be mein host to place my simple order – lamb curry and rice. What rice? The options are pilau and “Kabul rice”. This joint is busy and while I would like to question mein host what goes on in Kabul rice I opt for pilau. Would I like bread? Oh, go on then.
Taking my seat I further observe a Tardis-like back room that seemingly swallows up all the subsequent arriving customers. The tables are bare, bar a selection of condiments that I did not investigate. Behind the counter, the various curry offerings are visible along with a rotisserie cabinet and the usual elephant’s legs.
A waiter appears with a single naan bread, a small plate of hummus and a courtesy salad. The bread is served whole and is pleasantly blistered – I sacrifice some to the hummus and find both to be enjoyable. My rice follows, studded with sultanas and embedded with strips of carrot. With that, the main event arrives.
This is most certainly not a BIR curry. Lots of small – boneless – pieces of lamb in something that is decidedly more masala than shorva. I note visible oil separating around the edges of the dish. I decide to adopt a two-pronged method of attack, digging into the masala with the bread and transporting the meat over to the larger rice plate.
The bread is both crisp and slightly chewy – perfect – and collects the masala well. The lamb is soft, not to the point of falling apart but needs minimal persuasion. The spice level is decidedly medium but is certainly enjoyable – I had not asked for any customisations so this was fine. The rice was delightful, bouncy and with a fruity twang thanks to the many sultanas.
My only gripe is that the food could have been slightly hotter, although this was not helped by my being slightly in the draught of the constantly opening-and-closing front door.
The price of this Boxing Day feed – including a Diet Coke – came to the princely sum of £11.00 which I was more than happy to hand over to mein host at the counter on departure.
Would I return? Most certainly – I want to investigate the alternative Kabul rice, and to enquire at a quieter time about the possibility of customisation. I also want to have a go at what appears to be a lamb shank biryani. A la prochâine!
The 23A is a bus service that runs once a year from Warminster to the middle of nowhere, via a village where nobody lives. The fact that it resembles a TfL bus route, with Routemasters, Boris Buses and proper TfL bus stop flags makes it all the more incongruous.
In the course of learning how to work with data in the R statistical programming language, I ran into a problem whenever I tried to plot multiple columns in a dataset – like this, for instance where we are looking at local authority level coronavirus vaccination figures obtained from Public Health England:
> head(vaccs_combined)
date areaName First Second
1 2021-05-19 Bournemouth, Christchurch and Poole 2428 1838
2 2021-05-18 Bournemouth, Christchurch and Poole 349 2048
3 2021-05-17 Bournemouth, Christchurch and Poole 293 1050
4 2021-05-16 Bournemouth, Christchurch and Poole 384 1424
5 2021-05-15 Bournemouth, Christchurch and Poole 1632 3987
6 2021-05-14 Bournemouth, Christchurch and Poole 550 2007
This sort of presentation of data is great for humans to read, but rather more difficult for R to understand. The problem is that in the context of this data, the column headers First and Second don’t actually represent a variable in their own right, rather they are values of a hypothetical variable that doesn’t exist yet, describing the type of vaccination event.
What on earth are you talking about?
It’s probably easier to explain this visually. Breaking out into Excel so I can easily colour code the cells, our data currently looks like this:
Whereas for R to be able to interpret it and neatly plot the data, it needs to look more like this:
This is the difference between wide data, shown in the upper table, and long data, shown in the lower table. Immediately you can see that the long data is not quite as easy for us humans to interpret – this is why both wide and long data are perfectly valid methods of presentation, but with different use cases.
Column headers containing values instead of variable names is in fact the first common problem of messy datasets described by the New Zealand statistician Hadley Wickham in his paper Tidy Data1Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1 – 23. doi:http://dx.doi.org/10.18637/jss.v059.i10. For interest, the full list of common problems is:
column headers being values, not variable names;
multiple variables being stored in one column;
variables being stored in both rows and columns;
multiple types of observational units being stored in the same table;
a single observational unit being stored in multiple tables.
Thankfully, in the covid vaccination data we only have to deal with one of these problems! As is usually the case in R, there’s more than one way to solve our this particular problem.
We tidyr the data and gathr
To elongate our data, we are going to use the gather function that comes as part of the tidyr package. tidyr helps us create tidy data. In his Tidy Data paper, Wickham cites the characteristics of tidy data as being:
each variable forms a column;
each observation forms a row;
each type of observational unit forms a table.
First off, let’s install tidyr:
install.packages("tidyr")
In our R script we will then need to load tidyr so we can use the gather function:
library("tidyr")
Our data already exists in the dataframe vaccs_combined – as a reminder it’s currently set out in a wide format like this:
> head(vaccs_combined)
date areaName First Second
1 2021-05-19 Bournemouth, Christchurch and Poole 2428 1838
2 2021-05-18 Bournemouth, Christchurch and Poole 349 2048
3 2021-05-17 Bournemouth, Christchurch and Poole 293 1050
4 2021-05-16 Bournemouth, Christchurch and Poole 384 1424
5 2021-05-15 Bournemouth, Christchurch and Poole 1632 3987
6 2021-05-14 Bournemouth, Christchurch and Poole 550 2007
We’re going to create a new dataframe called vaccs_long and use gather to write the elongated data into it. gather works like this:
Hey presto! Our data has been converted from wide to long:
> head(vaccs_long)
date areaName event total
1 2021-05-19 Bournemouth, Christchurch and Poole First 2428
2 2021-05-18 Bournemouth, Christchurch and Poole First 349
3 2021-05-17 Bournemouth, Christchurch and Poole First 293
4 2021-05-16 Bournemouth, Christchurch and Poole First 384
5 2021-05-15 Bournemouth, Christchurch and Poole First 1632
6 2021-05-14 Bournemouth, Christchurch and Poole First 550
I quite enjoy tinkering with data, and the sheer amount of it that the global response to covid-19 has produced gives us a number of rich datasets to explore. In this blogpost I’ll try and set out my adventures in the R statistical programming language to plot the covid case rate in England onto a graph.
To do this I’m using RStudio which is free software available for the Mac (which I use), Windows and Linux, along with the ggplot2 package within R. I don’t have the expertise in any of this to provide a particularly good tutorial in either R or RStudio but there are countless guides, YouTube videos, books and so on available. This is simply the story of my own travails, partly to serve as an aide-memoire but also in the hope that someone else could find it interesting.
I will try to recreate the process of discovery I stepped through on a bank holiday afternoon. I’m conscious that might not be terribly helpful in terms of working out how the underlying code ends up so I’ve published it over on Github in case you would like to see it
Finding some data
First, I need a dataset to work with. Thankfully the Public Health England covid data is easy to obtain from https://coronavirus.data.gov.uk, and the download page allows you to build a URL you can use again and again. Here, I’m using the following parameters (all selected from dropdowns) to build my permanent link:
Area type
Nation
Area name
England
Metrics
newCasesBySpecimenDate
Data release date
Latest
Data format
CSV
As you select each parameter, your permanent link is built for you below. Here’s mine, based on the selections I made above:
You can visit this link in your web browser; it’ll return a CSV file which you can load in a spreadsheet package such as Excel, or Google Sheets and manipulate it that way if you like.
Into R
As I start writing my R code I need to do a few things: first, I need to load the ggplot2 library to enable me to plot some nice looking charts. Second, I need to load my dataset using the URL we built above. And third, I need to plot the data into some sort of meaningful form.
A couple of things have happened here. First off, I’ve loaded ggplot2 as mentioned just above. I’ve also created a variable, covid_cases_csv into which I’ve dumped the covid case dataset by way of the read.csv command which uses the permanent link I got from Public Health England as an argument.
Having run this code I don’t see any particular errors, but I want to make sure we’ve loaded something that R can work with. If I issue the command head(covid_cases_csv) I get back the following:
> head(covid_cases_csv)
areaCode areaName areaType date newCasesBySpecimenDate
1 E92000001 England nation 2021-05-02 674
2 E92000001 England nation 2021-05-01 991
3 E92000001 England nation 2021-04-30 1343
4 E92000001 England nation 2021-04-29 1836
5 E92000001 England nation 2021-04-28 2134
6 E92000001 England nation 2021-04-27 1805
Excellent! We have a working dataset I can use to create a plot. I’ll now issue the following command – am I going to get a beautifully formatted plot?
ggplot(covid_cases_csv, aes(x = date, y = newCasesBySpecimenDate))
Not quite. I do get an x and a y axis, a few case rate tickmarks and a great many date tickmarks but that’s about it. I need to add what’s known as a geom to actually see any data. I’m also going to create a second variable, covid_cases_plot to hold my ggplot command to make it easier to work with later.
So now my new variable is a receptable for the ggplot command. With the plus mark I can also add the geom_point command to plot my data. And finally, I call my variable to execute the plot.
We have some data! We also have a few things to fix, the first thing being the slightly odd date format in the underlying data that ggplot is clearly having trouble interpreting. I’m going to add a couple of lines to my script just after I populate my covid_cases_csv variable which will rewrite the data to a slightly easier format:
# FORMAT DATA ----
covid_cases_csv$date = as.Date(covid_cases_csv$date, "%Y-%m-%d")
That x-axis immediately starts to look better.
Zooming in and smoothing off
Because I’m mainly interested in the case rate change we’re experiencing in 2021, there’s a lot of early data I can eliminate. I also want to calculate a seven-day average to remove some of the noise in the plot and make patterns easier to establish.
To cut out data prior to 1st November 2020 I need to change my ggplot command to introduce a conditional which argument in square brackets, effectively cutting my plot down to “dates greater than 1st November 2020”:
We can now see the familiar shape of cases dropping off over the November lockdown, only to rise again through December and January. At closer range this is even noisier, so we want to implement the seven day rolling average to calm it down a little. This requires us to add two more libraries, dplyr and zoo to enable us to manipulate the data further. I therefore add the following lines to the # LOAD LIBRARIES section of my script:
library("dplyr")
library("zoo")
Next, I add an additional parameter to my # IMPORT DATASET section:
covid_cases_csv <- read.csv(url("https://api.coronavirus.data.gov.uk/v2/data?areaType=nation&areaCode=E92000001&metric=newCasesBySpecimenDate&format=csv")) %>%
dplyr::mutate(cases_07da = zoo::rollmean(newCasesBySpecimenDate, k = 7, fill = NA))
Here I’ve used a function of the dplyr library called mutate, and a function of the zoo library called rollmean to create a 7 day rolling average of the newCasesBySpecimenDate column in my dataset. This produces a value I have named cases_07da and we now need to make sure we plot this on our y axis instead of newCasesBySpecimenDate. Again we change our ggplot command:
Off to the logging camp to see about a dangling end
This is starting to look a lot better. However, the linear scale doesn’t really help us chart the case rates given the low levels of the disease present in the UK at the time of writing. A log scale will show this in a rather better way. Let’s add an additional argument to our final line where we call the covid_cases_plot variable:
Now we can see much more easily what’s going on at those lower levels. This improvement has highlighted however a further problem inherent in the “cases by specimen date” dataset, which is that the numbers for each day can be revised upwards as new test results come in. This can theoretically happen at any time, although as Richard @RP131 shows us daily on Twitter this is usually pretty stable after five days:
Chart for monitoring lag in reporting of England positive test results. Each column represents a given day's report and shows which specimen dates it covers. pic.twitter.com/2eyrHOkaJG
What we therefore want to do is not plot anything from the last five days. It was not immediately obvious how to do this until I discovered R’s Sys.Date() command which returns the date in a way which allows me to perform a simple subtraction on it:
less_recent_days <- Sys.Date() - 5
If I return again to our ggplot command, I now need to add a further statement to my which argument to allow us to cut off the date range at both ends:
Now we don’t have any artificially low rates to concern ourselves with, we can have a go at plotting a reasonably current trendline. This is done in ggplot using a geom called geom_smooth. As we’re looking at a seven day rolling average I’ll use a seven day trendline as well. To figure out the date seven days back from five days ago (remember we’ve chopped off our dangling ends up above), I’ll simply use the output of my less_recent_days variable like so:
less_seven_days <- less_recent_days - 7
The geom_smooth gets added on after the geom_point command. Using the subset argument and our less_seven_days variable we can make sure we only track the trend for the period of time we want. The method of “lm” gives us a ‘linear smooth’, or a straight line.
We now have a useful data plot, however it’s still a little rough around the edges. If I add some additional arguments to the final calling of the covid_cases_plot variable I can neatly label each axis, make sure the x-axis shows each month as a separate tickmark, add a title and finally change the style of the plot:
I also add a further argument to my ggplot command to credit PHE for the data:
labs(caption = "Data from Public Health England / https://coronavirus.data.gov.uk")
The final product looks something like this, which I’m pretty pleased about to be honest:
I’ve yet to decide whether or not I do anything with this data – I might try and have a look at doing something similar at a county level for my local area but I’ll leave that for another day.
The end product
I don’t think I’d ever recommend anyone uses any sort of code I’ve written, but if you’re curious about how these various snippets of code ended up you can see them over on Github.