As part of the build of the new IATI datastore, there’s an important point for our community to consider: what goes into the datastore?
A common response might be “all published IATI data, surely?”, but I wanted to offer an alternative, which I think others will support.
In short, I propose that the IATI datastore initially pulls in data that is:
Valid to the IATI schema, and;
Published under an open licence, and;
Using version 2.0x of IATI
To be clear: this does not encompass all current published data. So - why limit ourselves? Here’s three reasons:
1 - Schema validation is something we are very used to
For those of us that grapple with data validation around IATI, we know it can often mean many things. The term “compliance with IATI” is often heard, but not universally agreed.
However, we have a very simple mechanism to help us: the IATI schema. The schema provides exactness on how data should be ordered and structure: it’s the minimum level of validation one should pass.
The online IATI validator has always provided a means to test data against the schema. It’s true that there are a range of further “validation” tests one could make, including a host of rulesets and codelist checking - and even extending to the coherence of organisation identifiers. However, for us to get a basic level, we should begin by working with data that passes this initial schema inspection.
My argument here is simple: if we start to support data that is not valid against the schema, why have a schema? Or even - what support are we giving to data users, if we supply invalid data?
2 - Data licencing supports data use
You might be surprised to see mention of data licencing in this proposal, as it can often be something that is added at the last moment of publication, whilst someone sets up their IATI Registry account. However, appropriate data licencing is an absolute must if we are to support people to successfully use IATI data.
In fact, we should really consider the datastore as a data user! In this light, it needs to be clear that they can access data under a licence that is permissive and open. When data is licenced under restricted grounds, then it cannot be reused (as that is what the licence says!).
My challenge: Why would we support the datastore to use data that has no licence for reuse?
3 - Version 1 of the standard is deprecated
The TAG meeting in Kathmandu supported the decision of IATI members to deprecate version 1 of the standard, meaning that from June 2019 publishers using this will not be supported.
Whilst it’s technically possible to convert data from version 1 to version 2 of the IATI standard, this would take up a limited resource on the datastore project we could deploy elsewhere.
My rationale: to get the support of the new datastore, organisations need to supply data in a version that is actively supported by the initiative.
Support for our principles?
A common thread between these three conditions is mutual support. We all want to support our data standard and data users via the datastore project. To do this, we must ensure that we respect the core protocols we have around the schemas, licences and versions for our standard. Given that the datastore represents a renewed focus on data quality and use, I can’t imagine a scenario where we would actively go against these.
Of course, there are currently a range of publishing organisations that would be omitted from inclusion in the datastore, in terms of failed schema tests, restrictive licencing and/or use of unsupported versions,. However, we should be careful to not start to cite examples in order to find reason for a relaxation of this criteria. I do believe this is a relatively low bar for entry - and that our community and tech team can provide positive support to those that need to remedy their data.
What next? I’m hoping those active in our community can support these principles, so that we can in turn endorse the data that makes its way into the datastore. Maybe respond with a quick nod of a approval to get us moving on…
My guess (based on informal discussions - see below) is that the first two principles are very agreeable, whilst there’s a dilemma about use of version 1 data. That seems fine - and is a reason for my separating v1 into a new point.
After this, we can start to extend our discussions around data validity, compliance and quality in other, more advanced, ways. But, I do hope colleagues are able to step back and agree that this initial benchmark is for the betterment of the initiative.
Disclosure: prior to posting this I quickly discussed these ideas with @bill_anderson, @KateHughes, @Herman @siemvaessen , @andylolz, @markbrough & @rory_scott - more as an valued and accessible sounding board rather than definitive answer (but thanks, nevertheless!)!