IATI Datastore - what data should go in?


(David Megginson) #14

I agree about the reuse problem. That’s why I’d have the data excluded from common queries by default, and included only when the user explicitly opted in (e.g. “Include non-open data” option in the UI, or “&license=nonopen” in the API).

There are some use cases where non-open data is better than no data, but it’s OK to make the users do a bit of extra work in those cases (including demonstrating that they’re aware of the problem).


(Andy Lulham) #15

^^ Agreed / cool. Step #1 is this ticket, which would stem the tide of “license unspecified” data.

It sounds like there’s appetite for removing the option to publish closed IATI data going forward (FWIW I support this). Plus the number of activities published with a closed license is really small (see table above). If the option of a closed license were to be removed, I doubt it would be worth special casing for closed data in the datastore API.


(Steven Flower) #16

Thanks everyone to the detailed, considered and useful answers. It’s like a Technical Advisory Group!

Allow me for a moment to sit on my TAG chair cushion and undertake my duties. In amongst all these exciting conversations and (potential) tangents, I think this is where we are:

  1. On schema validation - I see no objection.
  2. On open licences - we seem to also agree on the principle, but see a contradiction in how an open data standard can accommodate closed licences.
  3. On 1.0x, we seem less ready to “reject” that data - but think the deprecation of v1 should mean active publishers will make a plan to migrate to v2

There are a few tasks coming from this, it seems:

  • clarifying our guidance on closed licences
  • understanding why/how the Registry would allow them
  • thinking through how the Registry might apply some / all of these principles
  • considering how we make available / archive “non-active” version 1 publishers
  • understanding our position on limiting data, in an unlimited data world

But - as we break for the weekend (think of it as a coffee break in this energetic meeting we’re having, but the chance to get some actual fresh air) I’m hoping this is a adequate summary of where we are at.


(Bill Anderson) #17

Yes.

The Datastore’s priority clients are data users who should reasonably expect to be served usable data.

In my opinion “usable data” sits somewhere in between schema-only validation and full validation against schema, codelists and rulesets.

We have agreed the first step: all activities from active publishers MUST validate against the schema.

BUT we haven’t yet provided the DS developers with guidance and a roadmap as to how to tolerate ruleset and codelist errors.


(SJohns) #18

Good summary - going back to Andy’s original summary of the activities affected, would any exclusions be based on excluding activities rather than whole data files? Just thinking of the 600+ CSO publishers, some of whom have old activities going back to 2011 that won’t meet these criteria, but are part of the same datafile as newer activities that will be 2.0x and meet the criteria. They are not going to have the resources to go back and update older activities. And as many donors now link the payment of funds to the publication of data - it could be a real risk to them to have their datafile pulled completely. What would be the best advice you can give a CSO in advance of these changes?


(Steven Flower) #19

Hi @SJohns - thanks, it’s a very valid question :slight_smile:

In terms of specific file having a mix of 1.0x and 2.0x activities within it, then I don’t think this is actually possible. The version attribute is only applicable at the <iati-activities> element, not the <iati-activity>, so it can only be declared once per file. It used to be different (in version 1.0x) - but was changed in the move to 2.01 (see changelog). @bill_anderson @IATI-techteam do you agree?

However, the point still remains that it could be possible to publish a file with a mix of valid and invalid activities (in the same version). I think @andylolz did some stats on this too…


(Andy Lulham) #20

@SJohns: pragmatically, I’d suggest any publisher that can’t go back and update old v1.0x data should ensure all new activities are created in a brand new v2.03 activity file. This means all future data will be “datastore compliant”. And perhaps at some point, the old v1.0x data could be one-off converted.

That’s true – in the stats above, schema validation was performed at activity level (i.e. rather than validate each dataset, I validated each activity.) So in practice this means the “activity count” is a count of invalid activities, rather than a count of all activities inside invalid datasets.


(Matt Geddes) #21

Slightly off-topic but does IATI give a ‘lifespan’ estimate when new version of the standard are created i.e. with an operating system, updates are guaranteed for X number of years? It seems like it might be helpful/standard practice to say with new version that they will not be depreciated (dropped from the registry/core tools) for X years, or until X date?


(SJohns) #22

@stevieflow @andylolz thanks for clarifying. I was really thinking of this scenario - older activities that are poorer quality within the same datafile as activities of good quality, but I didn’t express it well!! So just thinking about Andy’s suggestion - how would it work for AidStream users.

Aidstream users (who are using the full version) should click on the button to upgrade to version 2.03 and then continue to add in their data for the current activities. If they have older, closed activities on AidStream that are poor quality (data missing/incomplete), then they can convert them to draft activities in AidStream by editing them. This means that when they publish the datafile, only the current activities will show up in a datafile that is tagged as iati-activities version=“2.03”.

This 2.03 datafile should (if no other issues) get pulled through to the new database without the older activities, which will no longer be publicly available. This should not therefore impact their funding (because the current activities are published) but will shorten their track record.

Then if an organisation has extra resources, they can go back and fix the older files if they want to show a longer track record.

For organisations with a smaller amount of activities, this will be feasible to do.For organisations that use AidStream to publish many activities,for multiple donors, it’s going to be a headache, so the more time and warning you can give, the better.

Unfortunately, as soon as funders link an open, public good like IATI to withdrawing funding which an organisation receives to run their programmes (which vulnerable people depend on) it gets a lot more complicated than just excluding data and teliing organisations to update it as and when.


(Andy Lulham) #23

@SJohns: I’ve replied on twitter with a suggested approach that doesn’t involve removing existing IATI data.


(Steven Flower) #24

Thanks again @SJohns

I think we are into some of the implementation details , based on the agreement of the principles above.

@andylolz would it be possible to share your twitter feedback in a new thread, where we can discuss this in a dedicated space? @SJohns by no means am I saying we should ignore this - but I want to keep this thread to our shared three principles. Just in the same way we have a new discussion on follow-ups for licencing, we should detail the support needed for Aistream publishers in a concentrated channel.


(Steven Flower) #25

Hi everyone

I’m just flagging that our technical advice to the @IATI-techteam & partners via @siemvaessen looks to be a clear line on the datastore initially ingesting data that is:

  • Valid to the relevant schema
  • Openly licenced
  • Version 2.0x (but actively checking valid/open 1.0x data alongside this)

As we can see, there are follow ups and actions elsewhere, but I wanted to thank everyone for their input here, and pass onto @KateHughes in terms of implementation of the datastore. Thanks!


(Andy Lulham) #26

Could you expand on this, @stevieflow? It’s unclear what this would mean for v1.0x publishers.

Thanks


(Herman van Loon) #27

@stevieflow i agree with @andylolz: it needs to be very clear what is going to happen with valid 1.0x data. I would expect the DS to process this data.

Agree?


(Bill Anderson) #28

See ^^. It has already been agreed that …

The DS spec does not require for deprecated versions of the standard to be processed.

It was decided, pragmatically, that although DS will come online before V1.0 is deprecated, we are talking about a couple of months and it was not worth the effort to complicate the load.


(Herman van Loon) #29

Then there hopefully will be no active 1.x data publishers anymore after the depreciation date in June this year.


(Steven Flower) #30

Thanks all

It’s useful for us to reaffirm our role as a community here. We’re giving technical advice (the TA of TAG!) to the datastore project. We’re not in a position to project manage the datastore, for example. For this reason, it’ll be great to hear a progress update from the @IATI-techteam

In terms of the discussion we’ve had so far on 1.0x, then apologies if I left that vague. My understanding is that we’d leave the door open for valid 1.0x data, but that other factors instigated by the @IATI-techteam may mean becomes less of an issue:

  • Existing active 1.0x publishers shift to 2.0x before June
  • There’s a process in place to support the Aidstream users, who may have a mix of versions

(Mark Brough) #31

Jumping on this thread a little late – I think it would be great to ensure that the needs of the ultimate users of the data are factored in here. There are currently some big donors publishing v1.x data to IATI (see below). It would be really unfortunate if the data from these organisations, which is currently available through the Datastore, became no longer available.

I don’t really understand the suggestion of loading all v1.x data into the Datastore once, and then never again – I would assume the development time required would be more or less the same, and presenting out of date data to users from a Datastore that is supposed to update nightly would arguably be misleading. Perhaps a better approach would be to gracefully degrade for older versions – i.e., trying to load v1.x data on a “best efforts” basis, but not focusing development or maintenance time on this.

Here are few suggestions about how to avoid losing access to all these organisations’ data:

  1. IATI tech team works intensively with priority organisations to support/encourage them to begin publishing v2.x data. I would argue that prioritisation should be based primarily on size of organisation.
  2. If there are still a large number of organisations (especially large ones) publishing v1.x data, then have a policy of gracefully degrading for no longer supported versions.
  3. The Datastore importer could perhaps use something like @andylolzv1.x to v2.03 transformer where possible to simplify the import process.

IATI Dashboard - Versions

v1.03

  • AFD (France)
  • AsDB
  • Finland
  • France (Ministry of Foreign Affairs)
  • Switzerland (SDC)
  • UNOPS

v1.04

  • European Commission (FPI)

v1.05

  • Germany (Environment Ministry)
  • Climate Investment Funds
  • New Zealand
  • The Global Fund to Fight AIDS, Tuberculosis and Malaria

(Bill Anderson) #32

The TAG consensus to deprecate v1 in June was based on the realistic expectation (based on the ongoing work of the Tech Team) that all big publishers will upgrade. Your Suggestion 1 has been going on for some time.


(Mark Brough) #33

@bill_anderson that’s great to hear, then perhaps we can just revisit this question around June, once we know how much progress these publishers have made.