In 2010, Tim Berners-Lee created his “Five-Star Open Data” guidelines, which define a set of requirements for what he sees asideal open data. The guidelines have gotten a lot of press, conference mentions and so on – you can even examine them on a mug. For those not familiar with the guidelines, they are:

  • One star: Data is available on the web, under an open license
  • Two stars: Data is available as structured data (e.g., not a PDF)
  • Three stars: Data is available in a non-proprietary format (e.g., not Excel)
  • Four stars: Entities defined in the data get a URI specified for them
  • Five stars: Data is linked to other data, such as by using RDF

These guidelines are overall pretty reasonable, though – at the risk of disagreeing with Sir Tim – I see some strange aspects to them. Is the problem with Excel spreadsheets really just that they are in a proprietary format, and not that spreadsheets are hard to parse? Would data placed in the equivalent open format for spreadsheets, ODS, really be as good as CSV or JSON data? How helpful is using RDF, in practical terms, that it justifies a star of its own? (The official description of the fifth star doesn’t mention RDF, but that is the obvious implication.) Maybe most importantly – why do all these steps involve simply putting files online? Does an API which can provide all this same data, but only in small chunks at a time and in response to queries, have no value? Anyway, whatever issues there are, it’s nice that these guidelines exist. They provide something that we can bounce ideas off of – and yes, these are reasonable metrics to follow, for those who otherwise wouldn’t know where to start. (At least, the first four stars certainly are.)

But thinking about these made me wonder: perhaps there should be similar guidelines for people looking to create open data? It’s understandable that there are not: after all, whether others can read and understand the content is much more important than how the content was generated. (And by “others” in this case, I mean machines and humans, as the saying goes.)

Still, creating the data in the right way can be critical. Let’s say that your great open data set is stored in an Excel spreadsheet on someone’s computer. You can save the data to one or more CSV files, run a CSV-to-RDF utility on it (or put it in CKAN), put the resulting files on your web site, and presto, you’ve hit the jackpot of five-star open data. But the data remains in one file on one person’s computer – hardly a recipe for success, in either the short or long term. If, for example, someone else wants to fix something in that data, they might be out of luck – even if they work in the next cubicle over. If you use MediaWiki with Page Forms and/or Cargo and/or Semantic MediaWiki, or you’re a Wikidata editor, you probably know where this is going.

I considered creating an actual five-star listing for generating data, but it felt too self-indulgent, because the attributes I came up with are basically descriptions of the software I’m involved with. But let me just list, in no particular order, some of what I think are the important elements of a system to create open data:

  • Publicly-editable – anyone who makes use of a particular item of data should be able to modify it, even if they have to “jump through some hoops” to do the modification.
  • Version history – it should be possible to easily see the history of any piece of data – when it was created, and all the times it was changed. Ideally even if it gets deleted. That’s especially true if anyone can edit the data – but even if there’s just one person editing it (or one machine), it’s very useful to know how the data has changed.
  • Form-based entry – obviously, forms make data entry a lot easier. But beyond that, forms are helpful because they define a data structure: they make it obvious what information is desired and what is not – and they allow for easily changing the rules, and enforcing the new rules automatically.
  • Easy export to a variety of formats like CSV, JSON and (sure, why not) RDF.

As far as I know, the software packages that currently match all of these criteria (all four “stars”, you could say) all consist of MediaWiki plus different sets of extensions. That is, I know of no other software, past or present, that can serve as an ideal tool for the creation of data sets, open or otherwise. Software like Google Docs comes close, as do the newfangled collaborative apps like Airtable, Code, Zoho Creator, etc. etc. – but I don’t believe any of them truly allow for unlimited public editing of the data in the same way that MediaWiki does, by making it extremely easy to undo anyone’s bad edits.

And I’m not sure how well they handle form-based entry either. Can their forms handle maps, calendars and tables of data, like MediaWiki’s Page Forms does? (The fact that these applications all seem to view the spreadsheet, and not the page, as the basic building block of data seems to place a severe limit on their abilities; but that’s a story for another day.) In short, there are a few great solutions for data creation, and they all involve MediaWiki. At least, that is my (deeply biased) view.

But if I’m wrong, I would love to hear why.