IATI Datastore - what data should go in?

Agree. But you go on to focus solely on the supply side.

As a consumer I’m not interested in what producers should or shouldn’t be capable of doing. I just want the data. I’m not bothered whether my banana is malformed so long as it is edible.

I’m quite happy to hold my hand up and admit that for the best part of ten years I was part of a machinery (and community) that paid insufficient attenion to users outside of our immediate supply chain. Now that I’m on the other side of the fence things look different …

This kind of (much-used) argument is fundamentally flawed. Improving data quality and maximising the use of what currently exists are two very separate ideas that actually reinforce each other.

Thanks for this discussion!

I think part of the issue is that 2.01 made it much easier to fail schema validation by requiring elements to be in a particular order (it did this in order to make certain fields “mandatory”, which I think was the wrong way of enforcing compliance). I think that was a mistake. That didn’t matter that much before now, because everyone could continue to use data even though it failed validation, but obviously it would begin to make much more of a difference if we stick to this approach.

I don’t think making it impossible to access schema-invalid data through the IATI Datastore shifts any problem from a consumer to a producer. At the moment, it just makes it much more difficult for the consumer to access the data (even if it’s just a question of one element in one activity in one file being in the wrong order). If publishers quickly resolved data validation issues, that would be fine. However, the evidence suggests that around 10% of publishers have invalid files, and the number has remained fairly stable for the last three years – see these charts.

As various people have mentioned, one way of squaring this circle might be for publishers to be automatically notified (or politely bombarded) when their data fails validation.

If you’re a publisher reading this thread – you can sign up for alerts from IATI Canary!

2 Likes

I agree. I’m not an xml expert, but isn’t there another way of checking mandatory fileds without ordinality?

1 Like

Flagging this from earlier in this thread:

I checked again today, and the number of schema-valid activities in schema-invalid datasets is now 74,752. It’s possible to validate at activity level and still provide access to raw XML, by excluding the invalid activities.

3 Likes

From a user perspective, it would be great that if there can be activity level validation and some activities are dropped due to failing validation, if there could be some unmissable notification that the activities shown are not all of what the publisher intended to publish - because quite often users see the existence of data for a publisher as implying a comprehensive dataset from that publisher.

3 Likes

this is a huge amount of activities that should reside in the Datastore. Let me check what the efforts would be to validate on activity level. Perhaps @AlexLydiate could chime in?

Bumping this as it’s been dormant for a month and needs urgent attention imo.

I would like to engage with the initial IATI community consensus on the original RfP specification for the Datastore where community consensus was reached (not sure when/where anymore) on the matter of blocking non schema valid data-sets from entering the new datastore.

With the knowledge we all have today with both the 2020 validator and datastore moving towards production, I propose the IATI community to act on @bill_anderson suggestion to make an amendment to that will move away from dataset level schema validation towards activity level schema validation as the initial proposed requirement omits huge amounts of data from the new datastore and is not in line with IATI strategy of data use and data quality. This is a more granular approach and will ensure we have all IATI schema valid activities stored in the Datastore which is currently not the case and will never be the case if current consensus remains in place.

If we make this amendment IATI will massively increase the availability of IATI data.

The current method for blocking any activity from a schema valid data-set -even if 999/1000 of activities are actually schema valid(!!)- seems unsustainable and is a policy we should revisit urgently.

cc @bill_anderson @IATI-techteam @stevieflow @Mhirji @matmaxgeds @markbrough @andylolz @David_Megginson @Herman @rolfkleef

Could I suggest that anyone objecting to this proposal speak up. Otherwise silence should be accepted as community consent for this amendment, thus allowing the secretariat to take the necessary steps to make this happen.

1 Like

Agree with you Mark: data order should be irrelevant. Not so sure about other examples such as missing transaction type codes, currency codes, etc. Those activities should i.m.o. not be in the data-store (these are the really inedible rotten banana’s)

Accepting files with schema errors and doing ‘activity level’ validation only, would make file level schema validation unnecessary.

But then the question is how you are going to do ‘activity level’ validation. When this is bound to only validating and cleaning IATI data in the data-store ingestion process, it would mean that every existing IATI XML consuming application would be forced to make use of the data-store if it wanted to use validated IATI data.

The data-store will become the only source of validated IATI data since validating the raw IATI data of the publisher against the XSD will lose its meaning. This is i.m.o. only acceptable if the data-store would also provide fully IATI standard compliant XML output (which validates against the XSD), with just the erroneous activities being removed.

Correct, it can still stay in place, but for datastore entry the policy would be revised towards activity level.

Validator and Datastore will update their (technical exchange) arrangements to fit this amendment.

Hold on. Validator nor Datastore will clean IATI data. This thread is about increasing IATI Data availability by moving towards activity-level schema validation.

This is the case today seeing only schema validated data-sets arrive in the datastore.

Correct, datastore will serve fully IATI standard compliant XML output (next to CSV and JSON formats).

But a bit back on topic: do you have fundamental objections to the proposal of moving to schema validation on activity level by the proposed amendment?

I do not object against doing activity level schema validation as long as the publisher is responsible for doing this. Above proposal shifts this responsibility from the publisher to the IATI data-store, thereby removing any incentive to act on bad data. The problem with that is that this will get the publisher of the hook. Providing usable data will become a more technical responsibility.

Ultimately this proposal is about who is ultimately responsible for the data quality of the data in the IATI ecosystem. This proposal partly shifts this responsibility. I am not sure that in the long term the benefits will outweigh the costs. Reading the thread I see different viewpoints on this. In addition to that I would say that this should not be solely decided by the members of the community who are not on vacation right now or the tech-team. It warrants a broader audience (maybe an IATI WG?).

Nonetheless this should not be a showstopper for the further development of the data-store now. We can always decide to add this functionality in the next version of the data-store.

Correct me if I am wrong but the raw XML on the publishers URL validated against the XSD is i.m.o. the primary source of validated (or rejected) IATI XML.

The reason for this amendment is that in some cases (as pointed out by @bill_anderson earlier) one part of a transaction has an error which resulted in all other activities (schema valid) to be rejected due to current policy in place.

We all agree publishers are responsible for good data, that’s not really the point for this amendment.

This is about getting all schema valid activities available out to users to use. That’s the point of having a datastore in the first place. Current policy omits clearly valid data and therefore this policy must be revised.

This amendment does not in any way shift responsibility from the publisher to the IATI data-store, but merely feeds schema valid activities into the datastore for use that are actively being rejected by current policy.

Not sure what the exact work on changing to activity level schema would entail, but from datastore perspective it looks minimal. I’d argue good IATI data being rejected is far worse.

Sure, we’re in the middle of summer, so no immediate action required. But actively blocking schema-valid activities from the datastore should is not a healthy data policy and should no longer be considered.

It implicitly does i.m.o. because you will change (by omitting bad activities) the content of the publishers IATI XML, without the publisher being aware.

What I meant here was not the costs of the technical implementation, but the costs of increasingly sloppy publishers data because of implicitly omitting bad activities in an publication.

It would i.m.o. be better to provide feedback to the publisher that they have data quality problems instead of processing the XML anyway. My experience is that data quality improves when providing feedback.

Could you provide (automated) feedback to publishers when activities are rejected and inform the publisher that their IATI data will not be available in the data-store until they correct the errors?

(B.t.w. did anyone in above examples try providing feedback to publishers? If so, what was the response?).

May I am not familiar enough with the data-store, but I was not able to retrieve the original publishers XML with all data-elements. Is there an endpoint for that?

I’m late to the party, @siemvaessen, but I’m in full agreement. The packaging of IATI activities is arbitrary (a data provider could choose anything from one XML file per activity to one XML file for all their activities, ever), so an error in one activity in the package shouldn’t block the others from the data store.

4 Likes

@Herman I disagree with much of your approach. As an increasingly heavy user** (and champion) of IATI data I want access to as much usable data as possible. That’s what I expect from the datastore. Being told that I can’t have access because the datastore is on a mission to produce ‘pure’ data won’t wash.

In my particular use case my biggest problem is ensuring that both geographic and sector percentage splits add up to 100 so that I can to reliable transaction arithmetic. There’s a whole load of other QA issues that I’m not the slightest bit interested in. I would rather deal with my particular problem myself if I know that the Datastore is doing as little as possible to get in the way.

This has got nothing to do with letting the publisher off the hook. That’s got nothing to do with me (or the datastore). If we can get useful information based on usable data in front of a lot of people (not just those responsible for supply chain accountability) the incentives for publishers to improve their data will far outweigh our moral sticks.

(** For the record I have nothing to do with the supply side or governance any more)

Thanks to everyone for engaging with this conversation.

The Tech Team is currently focussing on ensuring that the original TOR are delivered as we prepare for launch. Any consideration of requirements outside of the current TOR will be led by the Tech Team after launch; we will engage with the community to make sure this complex issue in particular is fully explored. In the meantime, the Tech Team is contacting publishers with schema invalid files this week to urge them to address their issues.

Thanks again for all your good input above; we look forward to discussing this further after we launch.

2 Likes

I understand @IATI-techteam in the above, so this is in no way to try to contradict that timeline. I’m putting it here for posterity when this conversation re-opens after the initial launch.

I’m in agreement with @bill_anderson, @David_Megginson and others on this. I am routinely asked why IATI doesn’t line up with official statistics, or why expected IATI data is missing, or why we can’t we can’t trust IATI data. In my opinion it isn’t satisfactory to say that a publisher fluffed one number in a dataset of thousands of activities and therefore all of that data is inaccessible, and that this is by design.

The removal of arbitrary, valid data will undermine trust in IATI, and frustrate both users and publishers, and exacerbate existing narratives about the viability of the entire corpus of data we work to produce.

Regarding @Herman’s concern about the onus moving away from the publisher: I understand this principle, but at the moment we need pragmatism. I would be happy with a number of measures to re-establish that onus that don’t involve removing access to valuable data.

For example:

  1. We could institutionalise IATI Canary, making use of it a basic participation prerequisite for engagement.
  2. We could take this further by publishing the response times to data validation issues, and possibly push for this to be a relevant metric in future transparency evaluations such as the Aid Transparency Index or the IATI Dashboard
  3. We could include a flag within the XML to denote validity, and put garish, unsightly banners on relevant D-Portal pages or other presentation sites to make it clear that there are validation issues.
  4. We could celebrate the rapid engagement with and resolution of data validation issues in newsletters and official communications (if the publisher consents).
  5. We could have a public ‘caution’ list of publishers with invalid data.

I’m not seriously suggesting all of these, and some of them might seem extreme, but for me they are all sensible* compared to removing an unknown quantity of valid data from the one official data store.


*To add some numbers to this sentiment (see workings here):

  • There are currently ~982k activities.
  • If we take the publisher stats and add an activity to file ratio value, we can see that the top 25 publishers by number of activities published account for ~814k activities, about 82.89% of the total.
  • These activities are split amongst 2,234 files (meaning a total activity-to-file ratio of 364 among them).

The median activity-to-file ratio among them is 530, and the arithmetic mean is 1,657. This is because of our top five activity-to-file publishers:

  • GlobalGiving.org
  • UN Pooled Funds
  • Norad - Norwegian Agency for Development Cooperation
  • Food and Agriculture Organization of the United Nations (FAO)
  • The Global Alliance for Improved Nutrition

Together these five account for 38,000 activities spread between 5 files.

Going back to our top 25 publishers by activity count, it’s fairly clear that one validation error in any of these publishers will mean a serious loss of valid IATI data.

If GlobalGiving have one sector percent that doesn’t add up to 100 one missing sector field or other schema error, we could lose nearly 2% of all IATI data pertaining to nearly 10,000 organisations.

EDIT: changing the sector example as per @JoshStanley’s correction.

4 Likes

Just to be clear, data quality issues such as sector percentages not adding up to 100 will not prevent the dataset from being ingested by the Datastore, as this is a Standard rule (a must), rather than something that is dictated by the Schema.

2 Likes

Thanks for clarifying Josh - I’ve changed the example to reflect that :slight_smile:

2 Likes

@rory_scott and @bill_anderson I understand the need to have as much data available as possible. As mentioned before I am not against having an activity level schema validation for ingesting XML data in the data-store as such, provided there is an active feedback mechanism to the publisher (active meaning that no action is required from the publisher to get informed about the data quality issues).

@rory_scott proposes a number of interesting feedback mechanism, to which I would like to add one: sending an e-mail the the e-mail address provided on the activity level (iati-activities/iati-activity/contact-info/email) , or if there is not such an e-mail address, sending it to the contact e-mail address as stored in the registry.

I object to any solution which would silently skip activities being processed without any notification to the user or the publisher of the data. Users will be kept in the dark about the completeness of the data and publishers will be kept in the dark about the quality problems in their data.

One last thought: if a large publisher has just a few tiny errors in the many activities published, why not simply contact that publisher and ask to correct the problem. I.m.o. it is this lack of active engagement of data users and publishers that causes a great deal of these problems.

4 Likes