Data divergence between registry and datastore

(Matt Geddes) #1

Thanks very much @KateHughes for the quarterly update perhaps worth posting on the forums too?

@IATI-techteam In it I learnt that the new datastore will access data from the validator API, not directly from the registry. This suggests that the registry and the datastore could end up with different data i.e. a query via the registry to see how many active activities are there in country X could be different from the same query via the new datastore? Is this correct or have I missed something? If so, is one of them (registry API, validator API, datastore API) going to be ‘official’?

(Herman van Loon) #2

As far as I understand it (but please correct me if I am wrong), there can only be a timing difference since the data validator needs some time to do the back-end processing of the IATI files from the registry. Imo the registry is the source of the actual status of all data and is therefore ‘official’.

The same reasoning can also be applied to all other products making use of the registry (e.g. D-portal).

(Matt Geddes) #3

thanks @Herman that is really helpful, and would be great if confirmed. I was worrying that it might mean some data available in the registry was being rejected by the validator and therefore would not be present in the datastore.

(Herman van Loon) #4

As far as I understand it from previous discussions, that can be the case if the data does not validate against the XSD. But the same data will also be unusable when directly retrieving them based on the info from the registry.

And it also good to remember that the registry itself does not contain data. It just refers to the URL’s where IATI publishers place their datasets. From that point of view there can not be a difference between the DS and the registry (since the registry is not a database). Basically the registry accepts any kind of rubbish. By definition the DS, being a proper database, can not do that.

(Siem Vaessen) #5

@IATI-techteam was over in ZZ office last week and spent a day with @rolfkleef discussing options on how the new DS (Datastore) will communicate with Validator. Our focus was on how to deliver this with minimal impact in mind on both DS and Validator.

One part of the ETL process in DS basically reviews XML files processed from the Registry into DS and decides if it is valid XML. Once declared valid and having identified its IATI version it will then push that file to the relevant version based parser.

Last week we identified this part of the process of DS to be most relevant for the new IATI Validator to be activated. The concept will define a system integration around this. This means the DS will not use its internal validator, but will communicate with new IATI Validator as a 3rd party service / API. Essentially DS and Validator will be in constant communication on data validation and data handling. Non valid XML will never pass the validator test, only XSD validated data will pass and rulesets will be used to review data and depending on the ruleset outcome, an activity for example will be flagged in the DS on activity level, true/false. DS will then by default return all activities with a filter/parameter being introduced for API request only containing activities flagged true or false.

In this system, the Validator will no longer look at the IATI Registry for data sources, but will look-up the /api/datasets endpoint in DS, review its sha1 signature (new or same) and will then decide to process a file etc. Once we have a final systems integrations docs on this process we can share it here.

(Herman van Loon) #6

Hi Siem,
If I understand your message correctly, the DS will use the registry to retrieve all the raw IATI-XML and the validator will subsequently retrieve the raw IATI-XML from the DS?

(Siem Vaessen) #7

Hi @Herman that is correct. We drew up other options (both DS and Validator using Registry as source), but they had provided rather challenging (= time consuming) synchronisation issues in between DS and Validator.

(Matt Geddes) #8

This suggests to me that we should consider closing the registry API to public requests (leaving it only open to the datastore), otherwise people who directly used the registry as the source of IATI data (and put the effort into working with files that do not pass the XSD) would get different results to the same query via the datastore.

As I understand it, the /api/datasets endpoint in the datastore will provide all the same functionality for querying the registry that we had before anyway.

My concern is that IATI is in danger of splintering as a data source and that this will seriously damage its validity to end users.

(Kate) #9

You raise a valid point, the role of each tool within the system needs to be fully explored. There will be future posts from the Tech Team on this as the integration of the tools is worked out in detail. We’re going through our tools and working out their usage, use cases and what the future holds for them. As many things have dependancies on other things, this is something we don’t want to hurry, but instead want to build up step by step.