IATI Identifiers should not be allowed to contain special characters

(Siem Vaessen) #21

From the IATI network perspective yes and no. I understand IATI does not have jurisdiction over their business rules, but I guess leaving this as is will continue to cause issues down the line.

What if IATI -from data quality perspective- picks this up according to tbd upon convention / conversion we can all agree to? As in your option 1 @bill_anderson What would you propose?

(Bill Anderson) #22

If the consensus is Option 1 I would:

  1. Agree Regex rules
  2. Schema (3.01):Add Regex validation to all fields containing org and activity identifiers
  3. Guideline (2.03): Publishers should follow Regex rules
  4. Action: Add Regex rule and guidance to identify-org.net (@TimDavies ok?)
  5. Guideline (2.03): Publishers whose in-house business model involves use of invalid characters should provide a note (in the registry metadata?) on how users may be able to derive the original project id.

(Wendy Rogers) #23

Just to add that re 5) in @bill_anderson post above, publishers can also use the other-identifier element to cross ref the activity to their own internal project identifier.

(Bill Anderson) #24

Good shout. So 5 should in fact be:

5 . Publishers whose in-house business model involves use of invalid characters should record the original identifier in the other-identifier element with @type="A1"

Question: Is this a should or a must?

(Herman van Loon) #25

IATI isn’t a green field anymore. Since changing existing IATI identifiers will break references to organizations and activities of other publishers, I strongly oppose to this change. This change may have a huge impact on existing IATI data users.

As a rule you never change existing business identifiers. I can only think of two exceptions in this case:
1 - An IATI identifier has not been used by any other publisher: so it is safe to change the identifier or
2 - the proposed change is only applied to NEW IATI identifiers. The exiting identifiers are left unchanged.

To estimate to impact of this change it would be nice to have some metrics on the use of IATI identifiers with invalid characters.

(r_clements) #26

In the specific case I was talking about the ID on IATI was actually a compound ID based on data from their internal system and as they were using CSV2IATI to generate their data, so the ID actually didn’t exist in the IATI form on thier system.

They hadn’t realised what they were doing would cause a problem and were happy to change it when they did, so I think that there might be some naivety in the publishing community as to the issues that are being caused by IDs with non url compliant characters.

@bill_anderson - The only addition I would make to your list is that we identify organisations that are currently publishing IDs that have non complaint urls and, if possible, give them a nudge to change the impacted ID(s) to something that’s compliant with urls.

(Mark Brough) #27

Strongly agree with @Herman, @bill_anderson and others on option 2. Proper URL encoding much simpler than getting all publishers to replace characters from their project IDs. Slashes in project IDs sometimes have real meaning, and getting all organisations to implement manually RFC 3986 rather than have libraries that do the same job seems like a recipe for disaster to me.

For example, if the project ID is 2017/123-456, should both the publisher and the implementing partner be told that they need to remember to ignore the slash and turn it into some other character? Clearly that won’t always happen, so tools will always need to handle these characters, so why make people go to any effort? Even the conversation about what to do is complicated and going to add a lot of overhead.

I think we need a clearer explanation of why percent-encoding URL inputs is insufficient before undertaking what would be quite a disruptive step.

(Bill Anderson) #28

I think the point that comes out of this is that data usage is - or should be - , primarily, content-related. Finding the path of least resistance might produce a ‘better’ technical solution that appears to improve data quality (fewer errors), but does it achieve this at the expense of the meaning of the data?