Data retention on the Data Store

(Siem Vaessen) #1

The Data Store depends on whatever is available on the IATI Registry. We have come across a use-case with @Wendy on the Humanitarian Data Portal (Grand Bargain) where data is not available in the Data Store once it becomes unavailable in the Registry.

While some publishers remove their data on purpose, sometimes data becomes unreachable and therefore will be removed from the Datastore.

What retention policy should we take into consideration here and should the Registry have a publisher ‘kill-switch’ which could tell other services down the data pipeline (Datastore etc) to remove data if that is checked to True in case of a retention policy of say 30 days?

1 Like
(Samuele Mattiuzzo) #2

Wouldn’t removing the data from the Datastore and other tools also affect the way we can report on historical data, being that there will be gaps/missing pieces that were there in the first place?
Although I think this would be the case anyway whenever a publisher takes their file offline.

Above aside, I would say a notification system that tells all the tools that the data has been removed by the publishers is what we should aim for, so that we can ensure consistency across tools

(Matt Geddes) #3

As far as I am aware, IATI doesn’t have any historical approach, it is up to the publishers to decide if they want to keep historical data available.

My assumption would be that if data is removed/becomes unreachable then it should immediately disappear from the datastore as there is no way of knowing whether this was an error, or on-purpose as I suspect that many publishers would ignore a kill-switch.

If the DS kept it in, and then the publisher replaced it with another file there would be issues knowing which one was the right one etc

(Samuele Mattiuzzo) #4

ah I did not think about the edge cases of mistakes/publishing the wrong file or generic errors like such (in which case having faulty broken data and retaining it becomes detrimental)

I was more referring to data that was valid and has once been used to draw calculations and then gets removed by the publisher 2-3 years afterwards, that would fundamentally change previous reports and possibly reports that work on comparisons as well (as that comparable data would have been removed).

Maybe there needs to be a way to clarify whether certain datasets have actually been published with mistakes and thus need removal and which data can stay but is not relevant/withdrawn for certain reasons?

(Matt Geddes) #5

@samuele-mattiuzzo - totally with you on the need to think/decide whether IATI should do historical (and what that would involve), but I think it must have been discussed 10 years ago and probably needs solving in that context, not in the context of a problem it causes for the DS as the implications are far wider…also e.g. if something was published as v1.03 and the secretariat stored it for historical purposes, would it then be lost as v1 is depreciated (in which case it is only part of history being kept), or would the Secretariat republish in v2…I guess because of all the difficulties is why the current situation exists :slight_smile:

1 Like
(Bill Anderson) #6

Splitting this in three:

  1. As far as I am aware the current position remains that the Datastore is an organised interface to the registry. It is not a curated database. If a publisher removes any data from the registry it gets removed from the datastore. This is simple. The Datastore does not hold data (historical or otherwise) against the wishes of a publisher.
  2. Ascertaining whether a publisher has removed data or whether it is accidentally, temporarily unavailable is the challenge that @siemvaessen focuses on above, and that does require some rules that I don’t think have been clearly expressed anywhere.
  3. As per 1 above I don’t think it is the responsibility of the Datastore to perpetually cope with deprecated versions of the standard - where a publisher has neither removed nor refreshed their data - but this is perhaps best dealt with as a separate, validation issue.
1 Like
(Siem Vaessen) #7

Agree.

Correct, IATI needs to define a ruleset for this.

Agree.

(Rolf Kleef) #8

My guess would be this might involve the 104 FTS files with URLs that have gone missing since July 30th?

As the new IATI Validator currently still uses its independent harvesting backend (integration with Datastore underway): we use the “last seen alive” version of files (July 29th).

This allows for reasonably robust availability of data independent of availability of source files (or even the Registry).

The “publisher kill switch” would be one of two options:

  1. Remove the dataset from the Registry: in that case, the dataset will disappear from the Validator at the next successful refresh from the Registry (currently scheduled every 3 hours).

  2. Remove an activity (or all data) from the data served via the URL of the dataset: in that case, the specific activity will disappear after the next successful file refresh and processing (currently scheduled every 8 hours).

4 Likes
(Andy Lulham) #9

Happily, such a notification system exists:

1 Like
(Siem Vaessen) #10

Back to my original subject: data retention. Currently policy (rules) on the matter are not in place. What is the best way forward to introduce such a policy and who needs to be involved? What is the mechanism to start this?

(Andy Lulham) #11

I’m in favour of consistency between tools.

So for instance: d-portal has a policy on this (@shi and xriss can elucidate). The validator also has a policy (as @rolfkleef outlines above) which sounds very similar to the one d-portal uses. I think (but would need to check) that the existing datastore policy is also similar.

So there’s already a precedent here. It would seem reasonable for the new datastore to follow the precedent already set (see: Principle of least astonishment).

(Siem Vaessen) #12

Sure, but we need a formal rule, not informal precedent. Is there any audit trail on IATI communication discussion on this? It needs to be discussed outside of the realm of 2-3 tools, seeing it touches on overall data consistency.

And it also does not provide a solution to data accidentally being unavailable on the Registry, while some other tool (a website for example) depends on another tool (datastore for example) etc.etc.

1 Like
(Andy Lulham) #13

Absolutely agree. I suppose a rule on this should be proposed, agreed, and then formally conveyed (e.g. via a page on iatistandard.org).

I don’t see anything in the draft ToR about it. This was certainly discussed as part of the Manchester developer workshop last year, but I’m afraid I’m not able to find documentation of that.

(shi) #14

Thanks, @andylolz - yes, our process is similar to the validator:

  1. If a file download errors out, we use the last successful download. (This mostly deals with intermittent network errors.)
  2. We do a test wherein if the Registry looks suspiciously empty, we skip the Registry update and use the previous version.

Step 2 was added when the Registry hiccuped and emptied itself.

2 Likes
(Matt Geddes) #15

I am not convinced that solving the issue of data being accidentally/temporarily unavailable in the registry should be solved via the datastore, as this means that only tools that use the datastore as the source of their data benefit from it, if everyone needs this feature, perhaps this would be better inserted into the current ToR for the revisions to the registry, then it would be solved once (and with only one ruleset) for everyone?

(Siem Vaessen) #16

Me neither, never made that claim either. Again: this topic tries to address the lack of a data retention policy in place and how formalise it. Who within IATI is responsible for shaping this policy? Perhaps @WendyThomas could shed light on the matter?

Where could I find that ToR revisions plan for the Registry?

1 Like