Greasemonkey vs the ONS at Rewired State
I was at the Rewired State National Hack the Government Day yesterday, held at the almost-brand-new Guardian offices by Saint Pancras station. The motivation is summed up on the site:
Government isn’t very good at computers. They spend millions to produce mediocre websites, hide away really useful public information and generally get it wrong. Which is a shame.
And we were there to do something about it, by showing them how they could improve things fairly simply—if only they could procure and specify their online efforts effectively, and avoid the temptation to see every piece of government-produced data as a sales opportunity rather than the commonly-funded common good that it is. We’re all users of government-funded websites, and we benefit by improving them.
James and I decided to take a look at improving access to time series datasets on the Office for National Statistics StatBase site. There’s a lot of information on there:
Time Series Data is a free-of-charge service that gives you access to a comprehensive database of more than 40,000 time series from major National Statistics economic and socio-economic releases. The complete histories of the time series are available in the database. You can download complete releases to your PC in a few easy steps; or make your own custom selection to view or download.
But it’s a shocker of a website, to be honest: it’s hard to know what information is available, and it’s difficult to get at. It works, but it’s clumsy and obviously hasn’t received much love in the past decade, as this quote makes quite clear:
Webpages for this service are best viewed using Microsoft Internet Explorer Version 4+ or Netscape Navigator Version 3+.
The complete datasets are available in two formats: as a plain text file with no obvious schema, or in a proprietary NaviData format. Proprietary, that is, to the antediluvian in-house NaviData application written by the ONS themselves. It’s Windows-only (although it does work in Linux under Wine), but you aren’t missing out on much, apart from perhaps the weirdest toolbar icons ever committed to pixels (the pointing hand closes the program):
So if you don’t run Windows, the bulk datasets are pretty useless to you, and if you do … they’re still not awfully useful. But that’s not the only way to get to the information: you can drill down to smaller subsets via the website, and view these as tables or export them (CSV is one of the formats) for further processing.
If you do that for GDP data, you get something like … alas, it’s not possible to link directly to a result set: you have to go through multiple levels of form submission. Those of you following along at home will need to start on the time series page then proceed as follows:
- Select Gross Domestic Product (O)
- Select View Tables
- Click go
- Select 1: GVA: Major output aggregates
- Select View Series
- Click go
- Select the top three items
- Select Add to Selection
- Click go
- Make sure the three items are still selected
- Select Download
- Click go
- Make sure the three items are still selected
- Select View On-Screen
- Click go
And you’ll get a page like this (several feet of data cropped for brevity):
At this point, I’m going to quote James verbatim:
So for the first part of our hack, in a fine example of post-modern programming, we decided to use JavaScript to generate a graph in the browser using the HTML table form of the data. Thanks to an article by Rebecca Murphey, we decided to use jQuery together with Rebecca’s graphTable jQuery plugin which uses the flot jQuery plugin to actually draw the graphs. We used the GreaseMonkey, the Firefox add-on, to write a little script to pull everything together.
This all enhances the turgid numbers by patching the site to add a graph and explanatory key above the table, thus:
A great improvement, don’t you think? All the code is on GitHub, where there’s an explanation of how to install the script.
Because we were able to pull together pre-existing code, it didn’t take us all that long. In the remaining time, we started to look at scraping the site to try to find a way to reduce the tedious click/select/click/select process enumerated earlier and to be able to make static links to the charts. We didn’t have time to finish that, but I think it’s worth pursuing.
One thing we did discover, however, was that the site has hidden form fields containing data like this:
SELECT Series.ReleaseID, ReleaseTable.Name, Series.Name, Series.Title, Series.ID, Series.StartYear, Series.EndYear, Series.StartQuarter, Series.EndQuarter, Series.StartMonth, Series.EndMonth FROM series, ReleaseTable WHERE ((Series.ReleaseID = ReleaseTable.ReleaseID)AND (Series.TableSequence = ReleaseTable.TableSequence))AND ((Series.ID = 2905343) OR (Series.ID = 2905344) OR (Series.ID = 2905345))
Ouch. Let’s hope it’s a read-only database without any sensitive data. We didn’t try anything evil, though: that would have been the wrong kind of government hacking!
You can see all the things made by people on the day on the Rewired State Projects site.