Discussion

Data retention on the Data Store

Siem Vaessen
Siem Vaessen • 13 November 2019

The Data Store depends on whatever is available on the IATI Registry. We have come across a use-case with Wendy Rogers on the Humanitarian Data Portal (Grand Bargain) where data is not available in the Data Store once it becomes unavailable in the Registry.

While some publishers remove their data on purpose, sometimes data becomes unreachable and therefore will be removed from the Datastore.

What retention policy should we take into consideration here and should the Registry have a publisher ‘kill-switch’ which could tell other services down the data pipeline (Datastore etc) to remove data if that is checked to True in case of a retention policy of say 30 days?

Comments (12)

Samuele Mattiuzzo
Samuele Mattiuzzo

Wouldn’t removing the data from the Datastore and other tools also affect the way we can report on historical data, being that there will be gaps/missing pieces that were there in the first place?
Although I think this would be the case anyway whenever a publisher takes their file offline.

Above aside, I would say a notification system that tells all the tools that the data has been removed by the publishers is what we should aim for, so that we can ensure consistency across tools

Andy Lulham
Andy Lulham
Image removed. samuele-mattiuzzo:

Above aside, I would say a notification system that tells all the tools that the data has been removed by the publishers is what we should aim for, so that we can ensure consistency across tools

Happily, such a notification system exists:
Image removed. IATI Canary Image removed. IATI Canary

An early warning system for IATI publishers.

matmaxgeds
matmaxgeds

As far as I am aware, IATI doesn’t have any historical approach, it is up to the publishers to decide if they want to keep historical data available.

My assumption would be that if data is removed/becomes unreachable then it should immediately disappear from the datastore as there is no way of knowing whether this was an error, or on-purpose as I suspect that many publishers would ignore a kill-switch.

If the DS kept it in, and then the publisher replaced it with another file there would be issues knowing which one was the right one etc

Samuele Mattiuzzo
Samuele Mattiuzzo
Image removed. matmaxgeds:

My assumption would be that if data is removed/becomes unreachable then it should immediately disappear from the datastore as there is no way of knowing whether this was an error, or on-purpose as I suspect that many publishers would ignore a kill-switch.

ah I did not think about the edge cases of mistakes/publishing the wrong file or generic errors like such (in which case having faulty broken data and retaining it becomes detrimental)

I was more referring to data that was valid and has once been used to draw calculations and then gets removed by the publisher 2-3 years afterwards, that would fundamentally change previous reports and possibly reports that work on comparisons as well (as that comparable data would have been removed).

Maybe there needs to be a way to clarify whether certain datasets have actually been published with mistakes and thus need removal and which data can stay but is not relevant/withdrawn for certain reasons?

matmaxgeds
matmaxgeds

Samuele Mattiuzzo - totally with you on the need to think/decide whether IATI should do historical (and what that would involve), but I think it must have been discussed 10 years ago and probably needs solving in that context, not in the context of a problem it causes for the DS as the implications are far wider…also e.g. if something was published as v1.03 and the secretariat stored it for historical purposes, would it then be lost as v1 is depreciated (in which case it is only part of history being kept), or would the Secretariat republish in v2…I guess because of all the difficulties is why the current situation exists

Bill Anderson
Bill Anderson

Splitting this in three:

  1. As far as I am aware the current position remains that the Datastore is an organised interface to the registry. It is not a curated database. If a publisher removes any data from the registry it gets removed from the datastore. This is simple. The Datastore does not hold data (historical or otherwise) against the wishes of a publisher.
  2. Ascertaining whether a publisher has removed data or whether it is accidentally, temporarily unavailable is the challenge that @siemvaessen focuses on above, and that does require some rules that I don’t think have been clearly expressed anywhere.
  3. As per 1 above I don’t think it is the responsibility of the Datastore to perpetually cope with deprecated versions of the standard - where a publisher has neither removed nor refreshed their data - but this is perhaps best dealt with as a separate, validation issue.
Rolf Kleef
Rolf Kleef

My guess would be this might involve the 104 FTS files with URLs that have gone missing since July 30th?

As the new IATI Validator currently still uses its independent harvesting backend (integration with Datastore underway): we use the “last seen alive” version of files (July 29th).

This allows for reasonably robust availability of data independent of availability of source files (or even the Registry).

The “publisher kill switch” would be one of two options:

  1. Remove the dataset from the Registry: in that case, the dataset will disappear from the Validator at the next successful refresh from the Registry (currently scheduled every 3 hours).

  2. Remove an activity (or all data) from the data served via the URL of the dataset: in that case, the specific activity will disappear after the next successful file refresh and processing (currently scheduled every 8 hours).

shi
shi

Thanks, Andy Lulham - yes, our process is similar to the validator:

  1. If a file download errors out, we use the last successful download. (This mostly deals with intermittent network errors.)
  2. We do a test wherein if the Registry looks suspiciously empty, we skip the Registry update and use the previous version.

Step 2 was added when the Registry hiccuped and emptied itself.

matmaxgeds
matmaxgeds

I am not convinced that solving the issue of data being accidentally/temporarily unavailable in the registry should be solved via the datastore, as this means that only tools that use the datastore as the source of their data benefit from it, if everyone needs this feature, perhaps this would be better inserted into the current ToR for the revisions to the registry, then it would be solved once (and with only one ruleset) for everyone?

IATI Technical Team
IATI Technical Team

The IATI Technical Team has now written guidance for data owners to explain the process for removing data published to the IATI Standard. This guidance sets out the process to be followed by data owners in instances where they choose to remove data published to the IATI Standard. Please note that this is separate from data retention, which sets policies on, for instance, how long a dataset is retained for in a specific application. We will look at this in the future, but at the moment the priority focus was on developing data removal guidance. Please do have a look at the IATI data removal guidance Discuss post and let us know if you have any clarification points.


Please log in or sign up to comment.