Hi @YohannaLoucheur ,
Thank you for this suggestion. I actually think the the governance side is the only viable avenue for substantial progress on this topic. Some progress was made on releasing XML codelists, as evidenced by the existence of the (now very outdated) files mentioned by @stevieflow above (see the penultimate link here), but it seems that this has gone cold in 2016.
For me, the hierarchy of preference for solutions to this is as follows (high preference to low):
- There's a proper API for CRS++, meaning that codelists are accessible through queries, and have proper metadata (lots of investment, but would reap huge benefits for CRS over time);
- Codes are stored in structured data primarily, and then presented in graphical formats as needed. In this case, XML files can be used as the authoritative source, and .xls files are generated from those XML files (shouldn't be that difficult, and could bring lots of short term benefit for introducing changelogs / notifications etc.);
- There is at least a commitment to update the structured data version of the codelists on a predictable and frequent basis (unless the original file was compiled entirely manually, this should be very easy - just pressing 'go' on whatever script made the last one).
In all of these cases, three things are completely necessary:
- A changelog;
- A real commitment to never reuse codes that have been deprecated;
I have tried to contact people whom I believe to be relevant, but with little success.
@markbrough thank you for directing me to your code. I've written some similar scripts in R and Pandas, but the issue is that no matter how sophisticatedly a script has been written, there's no guarantee that the next spreadsheet will have the same structure as the one it was written for. For instance, having just cloned your script and updated the URLs (which have been changed), I can see that there's been an arbitrary change which will stop the script from working at the first hurdle:
Rorys-DI-MBP:IATI-Codelists-NonEmbedded roryscott$ python crs_convert.py
Getting mapping Sector
Traceback (most recent call last):
File "crs_convert.py", line 51, in <module>
#### verbose debugging removed ####
xlrd.biffh.XLRDError: No sheet named <'Purpose code'>
That sheet is now called 'Purpose codes'. Clearly this isn't an insurmountable problem, but what if there is a more subtle difference which just means that the script runs incorrectly but doesn't halt? Things start to become more complicated when we put our trust in non-deterministic procedures.
Now consider their XML version:
<Codelist name="Sector" xml:lang="en" category-codelist="SectorCategory" complete="1">
<name>"DAC 5 Digit Sector"</name>
<name xml:lang="fr">Produits chimiques</name>
<description>Industrial and non-industrial production facilities; includes pesticides production.</description>
<description xml:lang="fr">Production industrielle et non industrielle ; y compris fabrication des pesticides.</description>
<!-- more codelist items -->
This would make makes the lives myself and other IATI/CRS users significantly easier, but it would also allow a much more responsive and rapid effort in joining up CRS codes with others, helping to make data much more interoperable. Minimally, I could just use a diff-checker to make sure none of the element or attribute names have changed, and we could even introduce a schema to start to standardise.
I recognise that there may be serious counter-arguments to the points I'm making but I would be very interested to start a dialogue about them and I'd be interested to hear what could potentially be achieved from the governance angle.