IATI Identifiers should not be allowed to contain special characters

Moderators

No moderators on this discussion

Hello,

We have had a lot of issues latelywhen trying to develop solutions on top of IATI data when the underlying IATI identifiers contain special characters like ’ and &.

The IATI standard states that identifiers should pass the followiing regex [^/&|?]+

I think that this suggestion should be made into a requirement that identifiers must not contain special characters and that any IATI identifier that contains special characters will cause the activity to fail validation.

The standard also states that once an ID has been created that it must not be changed. I think in the case of special characters within IDs, however, that an exception should be made in order to improve the quality of the data contained within the registry.

Ross

Comments (37)

Steven Flower • 7 years ago

Hi r_clements

I agree that such identifiers then “break” uses of the data - especially when an identifier is a part of a URL - a / or a \ can present challenges / break things

One issue in terms of enforcing this is that the current IATI schema does not check such content - for example the format of an identifier. The rules you cite are additional resources - meaning that somebody validating their data via the schema (or the IATI Validator online) may “pass” validation, but be unaware / skip this additional good practice.

It would be great if others could indicate how such a process could become a part of the central resources

Wendy Rogers • 7 years ago

Thanks for raising this issue r_clements and I thought I should add that it is still our intention to enhance the IATI validator so that it carries out all specific validation and content checking (such as this for the activity identifier) as defined as part of the IATI Standard. Unfortunately work on the new validator has had to be paused due to other priorities but we hope to get going with it again in the near future.

Also you suggest that the activity identifier should not contain any special characters so we could extend the regex to explicitly cover other characters that should not be used? eg ’ $%* etc. However, I assume that we would want to continue to allow ‘-’ hypens to be used?

I also agree that whilst it is explicitly stated that an activity identifier once published should not change we may perhaps need to make an exception for when special characters have inadvertently been used. However, I would be interested to get the views of others (and especially data users) on this?

Herman van Loon • 7 years ago

I am not sure this change outweighs the benefits. When using IATI identifiers in URL’s you should always URL encode the IATI identifier. That solves the problem.

If you still want to exclude special characters from the IATI identifier, this check should be part of the IATI validator, since this the formal IATI conformance check. I would not change existing IATI identifiers, since that would cause all kind of problems when relating activities (as we extensively do). The hyphen ‘-’ is a part of the existing guidelines. No reason i.m.o. to change that. Since this is a breaking change, it should be part of the next integer upgrade of the standard, only to be applied to new activities. That would be not be trivial to implement though.

r_clements • 7 years ago

Thank you to everyone for your thoughts on my initial question.

Wendy Rogers - I think to clarify I would like to say that I had never thought that hyphen would be one of the characters that’s removed as DFID use it in all of our H2 IATI project identifiers and it doesn’t break URLs.

Herman van Loon - The standard currently warns against the use of characters that would break URLs but I don’t think that it goes far enough and they should be explicitly banned from usage. Ideally the IATI validator would check for this but I think that firming up the guidance would be a good start.
The OIPA API / Devtracker ecosystem has been updated so that it returns project identifiers that are stripped of special characters, however if I wanted to return raw JSON data from the OIPA API then you’re in a position where you are forcing the interface to process low quality input (e.g. the / character in a project identifier) so that users get the expected project data returned to them.

I honestly believe that this issue is causing problems across the board for people trying to create tools on top of IATI data and means that developers are spending time trying to interpret bad data rather than improve the tools that they are developing.

Herman van Loon • 7 years ago

Hi r_clements ,
Yes I agree that the use of IATI should be as simple as possible in principle. Changing existing identifiers will have considerable impact though in all existing applications referring to activities of other publishers. So this is no trivial change. It would be interesting to know how many publishers are producing activity identifiers with non-standard characters, so we know what the impact of this change would be.

The use of special characters throughout IATI should maybe considered (such as the use of special characters in the names of IATI file in the registry)

matmaxgeds • 7 years ago

Hi all,

I am not really a programmer but is an alternative not to tweak the IATI standard to enforce (or convert to) the use of character entity references e.g. http://www.theukwebdesigncompany.com/articles/entity-escape-characters.php instead of just the [^/&|?]+ regex so that all characters can continue to be used, and peoples parsers don’t break? E.g. “/” would be " & # 4 7 ; ". This seems to me one of the advantages of raw IATI data being machine readable.

My hunch is that otherwise this could force publishers to maintain two names e.g. for projects in their own systems (one for their system including special characters, and another for publishing to IATI), and also rmove lots of useful hyperlinks in the data which would be a significant inconvenience and reduce the amount of data that is published.

Matt

Andy Lulham • 7 years ago

I’m puzzled about this! [^\/\&\|\?]+ will match an identifier that contains at least one character that isn’t a forward slash, ampersand, pipe or question mark.

So for instance, :-/ uh oh? :-| would match this regex, and therefore be considered a valid identifier, despite the special characters.

I wonder if this regex should instead be: ^[^\/\&\|\?]+$ i.e. the identifier can’t contain any forward slashes, ampersands, pipes or question marks.

Ben Webb - IATI Secretariat • 7 years ago

Yes, I think you’re right. This was probably a mistake on my part a few years ago.

Bill Anderson • 7 years ago

We do not seem to have consensus on this issue. There would appear to be two approaches to use of characters in IATI identifiers.

Restrict characters allowed and fix regex rules.
Allow all valid characters and enforce url encoding when using identifiers in urls.

Could everyone provide a bottom-line response to these two options?

Herman van Loon r_clements Andy Lulham Ben Webb - IATI Secretariat matmaxgeds Mark Brough Steven Flower Tim Davies

(Personally I am with @herman on the second option as the activity part of the identifier should reflect the identifier used in the publisher’s own system.)

Steven Flower • 7 years ago

bill_anderson:

Allow all valid characters and enforce url encoding when using identifiers in urls.

I agree (in more than 20 characters)

Tim Davies • 7 years ago

I prefer option (2).

The two other considerations that might point towards (1) however:

(a) Input systems that only allow alphanumeric characters. Often legacy systems will have restrictive validation on fields; or at least may not support full unicode in input fields for identifiers;
(b) Cases where identifiers are presented in a range of different ways (e.g. Australian Business Number might be written as ‘123 123 123’ or ‘123.123.123’ or ‘123123123’ in different places, all for the same organisation;

These noted, I still support option (2), with the idea that we might need to provide guidance to users of organisation identifiers on some basic normalisation to apply to them to maximally ensure identifier matches.

(For example, I’m not aware of any identifier schemes in which ‘123/123/123’ and ‘123123123’ would identify different organisations - so it is generally safe internally when consuming data to strip out special characters to get the best chance of identifier matches)

Bill Anderson • 7 years ago

Tim Davies I think the bigger problem lies with the ‘project number’ part of the identifier, though I take the point that this could crop up on the organisation side as well

matmaxgeds • 7 years ago

Also with Herman van Loon - the donor systems I have seen would have significant difficulty (huge increase in manual tweaking of fields required) to restrict these characters, potentially also making those fields less readable. Suggest enforcing URL encoding for all URLs, not just those with identifiers.

r_clements • 7 years ago

Hello everyone,

For me it’s still: 1.Restrict characters allowed and fix regex rules.

I don’t like to be the contrarian here, but I’m afraid I disagree and would like to highlight one of the issues.

An organisation whose data we wanted to consume published an identifier that went something like: GB-test-Sana’a and caused the team behind DevTracker and OIPA no end of headaches to try and formulate and process an ID that was usable from the existing data.

I think it would be far easier for any tool builder to have the option of excluding IATI identifiers that contain this type of character and I fail to see why allowing / ’ " = and & values within IATI identifiers adds any value to the standard.

For me it results in a significant overhead for anyone trying to build API tools on top of the data, that will be replicated by any technical team looking to build tools with IATI data.

As a consumer of IATI data, I would rather spend my time working to develop new features to our tools (e.g DevTracker), insead of trying to mitigate against the impact of iati identifiers whi ch contain special characters that break url encoding.

Herman van Loon Andy Lulham Bill Anderson Ben Webb - IATI Secretariat Mark Brough Steven Flower Vincent van 't Westende [~379] matmaxgeds

Bill Anderson • 7 years ago

r_clements you make a strong case, but how do you maintain a link to the original project number:

Is there a conversion (that works in both directions) that could be standardised?
Or do you argue that the benefits of breaking that link outweigh the drawbacks?

r_clements • 7 years ago

Hello Bill Anderson , In this case we were lucky in that the project was not linked to any other IATI data sets so I asked the organisation to fix it (i.e. remove the ’ and republish) - I must appologise to the community here as I didn’t realise that you’re not meant to do this, so I misinformed the publisher in this case.

There were other issues with their published data, projects were published four times a year as the ID was adapted to reflect the current financial quarter, causing duplication of data within the IATI network. Again we have advised the organisation involved about this and they are changing their data, going forward, so they only publish a project ID once and adjust the finances rather than change the ID.

I think my real concern with the IATI identifiers is that when we start to really improve our linking through the network, via the transaction ref fields, that a badly formed identifier (i.e. non url compliant) is going to cause issues for anyone trying to track funds when they’re using API calls to return the data.

It’s not impossible to work round this issue, but I think we’re going to make the IATI data much more complex to work with than it needs to be - I suppose my core question, difficulties of changing the systems to enforce complance behaviour aside, is it too much to ask project inputters to avoid using these specific characters when allocating IATI Identifiers?

r_clements • 7 years ago

Sorry I went off a bit there: I think you’ve hit the nail on the head, in the first instance I would break the link so 2. but would look to flag this in some way (Vincent’s IATI bug tracker could be adapted for this, so that part 1. can happen in conversation between the impacted organisations.

Anonymous • 7 years ago

Do we know firsthand how many IATI identifiers contain special characters, seeing I don’t have those numbers in front of me.

From API (OIPA) perspective this has been an ongoing issue for years now. Non URL/URI compliance does in many cases require a custom approach from our perspective. Seeing how others may have a different approach as well, this does not add to interoperability if we for example were to align different systems in the IATI network.

Basically the issue is two-fold: an IATI org. identifier can contain special character plus the additional identifier may contain special characters as well.

We would prefer this to be solved at the root of the chain -the standard itself- and not anywhere else.

Bill Anderson • 7 years ago

siemvaessen:

the root of the chain

Isn’t the root of the chain the institution’s own business rules for how they id their projects? IATI doesn’t have jurisdiction over this.

r_clements just to be clear, when I talked about breaking the link I didn’t mean across IATI datasets, but between IATI and the reporting organisation’s project management system

Anonymous • 7 years ago

bill_anderson:

Isn’t the root of the chain the institution’s own business rules for how they id their projects? IATI doesn’t have jurisdiction over this.

From the IATI network perspective yes and no. I understand IATI does not have jurisdiction over their business rules, but I guess leaving this as is will continue to cause issues down the line.

What if IATI -from data quality perspective- picks this up according to tbd upon convention / conversion we can all agree to? As in your option 1 Bill Anderson What would you propose?

Bill Anderson • 7 years ago

siemvaessen:

What would you propose?

If the consensus is Option 1 I would:

Agree Regex rules
Schema (3.01):Add Regex validation to all fields containing org and activity identifiers
Guideline (2.03): Publishers should follow Regex rules
Action: Add Regex rule and guidance to identify-org.net (Tim Davies ok?)
Guideline (2.03): Publishers whose in-house business model involves use of invalid characters should provide a note (in the registry metadata?) on how users may be able to derive the original project id.

r_clements • 7 years ago

In the specific case I was talking about the ID on IATI was actually a compound ID based on data from their internal system and as they were using CSV2IATI to generate their data, so the ID actually didn’t exist in the IATI form on thier system.

They hadn’t realised what they were doing would cause a problem and were happy to change it when they did, so I think that there might be some naivety in the publishing community as to the issues that are being caused by IDs with non url compliant characters.

Bill Anderson - The only addition I would make to your list is that we identify organisations that are currently publishing IDs that have non complaint urls and, if possible, give them a nudge to change the impacted ID(s) to something that’s compliant with urls.

Wendy Rogers • 7 years ago

Just to add that re 5) in Bill Anderson post above, publishers can also use the other-identifier element to cross ref the activity to their own internal project identifier.

Bill Anderson • 7 years ago

Wendy:

publishers can also use the other-identifier element

Good shout. So 5 should in fact be:

5 . Publishers whose in-house business model involves use of invalid characters should record the original identifier in the other-identifier element with @type="A1"

Question: Is this a should or a must?

Herman van Loon • 7 years ago

IATI isn’t a green field anymore. Since changing existing IATI identifiers will break references to organizations and activities of other publishers, I strongly oppose to this change. This change may have a huge impact on existing IATI data users.

As a rule you never change existing business identifiers. I can only think of two exceptions in this case:
1 - An IATI identifier has not been used by any other publisher: so it is safe to change the identifier or
2 - the proposed change is only applied to NEW IATI identifiers. The exiting identifiers are left unchanged.

To estimate to impact of this change it would be nice to have some metrics on the use of IATI identifiers with invalid characters.

Mark Brough • 7 years ago

Strongly agree with Herman van Loon , Bill Anderson and others on option 2. Proper URL encoding much simpler than getting all publishers to replace characters from their project IDs. Slashes in project IDs sometimes have real meaning, and getting all organisations to implement manually RFC 3986 rather than have libraries that do the same job seems like a recipe for disaster to me.

For example, if the project ID is 2017/123-456, should both the publisher and the implementing partner be told that they need to remember to ignore the slash and turn it into some other character? Clearly that won’t always happen, so tools will always need to handle these characters, so why make people go to any effort? Even the conversation about what to do is complicated and going to add a lot of overhead.

I think we need a clearer explanation of why percent-encoding URL inputs is insufficient before undertaking what would be quite a disruptive step.

Bill Anderson • 7 years ago

markbrough:

For example, if the project ID is 2017/123-456, should both the publisher and the implementing partner be told that they need to remember to ignore the slash and turn it into some other character?

I think the point that comes out of this is that data usage is - or should be - , primarily, content-related. Finding the path of least resistance might produce a ‘better’ technical solution that appears to improve data quality (fewer errors), but does it achieve this at the expense of the meaning of the data?

David Megginson • 5 years ago

Reactivating an old conversation, since I just stumbled on that regex in the 2.03 conversation. The current regex means that, in a Unicode context, this is a valid activity identifier:

XI-IATI-OCHADSC-️

Perhaps it would be wise to revise at least to specify allowable Unicode character classes.

Andy Lulham • 5 years ago

David Megginson can you explain why this would be a problem? Is it just that some systems would fail to handle some unicode characters correctly? Also, could you suggest an alternative regex that would deal with this?

Aside: Back in Feb 2017, I sent a PR to fix this regex. This was merged earlier this year, but then discarded (presumably by accident) in this PR

David Megginson • 4 years ago

I think BNF (or similar) might be clearer than a regex, because of all the different regex flavours, but if we are sticking with regular expressions, then in POSIX-y dialects (including Python regex’s) we can use \w to match any alphanumeric character, \s to match any whitespace character, etc.

We also have to specify whether we’re allowing Unicode or just ASCII. I’m a huge Unicode (and UTF-8 encoding) fan, but even an experienced coder or DBA will often blow up a system and/or open security holes when they get an unexpected non-ASCII character in an identifier, etc. If we scan the registry and find that no one is using non-ASCII characters in identifiers, I’d suggest making the regex very explicit (inclusion rather than exclusion character groups) and issuing a guidance note along the lines of “this is what we meant, and what the registry will support”.

Mark Brough • 4 years ago

Hey David Megginson – I think this is a rare moment where I maybe have to disagree with you! There are a bunch of different perspectives and arguments in the thread above, but my argument is something like the following:

So IATI Identifiers should be composed of [Organisation ID]-[Organisation's internal project ID].

Having restrictions on IATI Identifiers means that we have to:

restrict which characters an organisation has in its project ID in internal systems (which as Bill Anderson says is not something IATI has control over), or
we require that organisations with non-permitted characters to convert those characters in a consistent way

I think 2. has several issues:

every other organisation referring to this IATI ID has to convert in exactly the same way, whereas probably at least sometimes people will make mistakes
it breaks the link between internal project IDs and IATI Identifiers

In either case:

the benefits are difficult to ascertain, because as we have seen elsewhere in IATI, there will always be cases where organisations don’t implement this perfectly – so systems using the data will therefore have to handle funny characters anyway (e.g percent-encoding if using these identifiers in URLs)
we would have to have a long and very painful discussion about which characters exactly should be permitted… e.g. would we be excluding data from Chinese or Arabic systems which have non-ASCII identifiers? I don’t know…

Or have I got the wrong end of the stick of what you’re proposing here?

Andy Lulham • 4 years ago

David_Megginson:

if we are sticking with regular expressions, then in POSIX-y dialects (including Python regex’s) we can use \w to match any alphanumeric character, \s to match any whitespace character, etc.

By default, \w will match unicode characters in python3. But python’s re library has the re.ASCII flag, which would make \w do the right thing (if “the right thing” means ASCII-only).

[Just mentioning this here because I was previously unaware of re.ASCII].

David Megginson • 4 years ago

Good points, Mark Brough , but I think there’s a risk of being overly cautious here. Yes, it is possible that there is a major enterprise computer system somewhere in the world that uses emojis in its database primary keys, but I’d suggest that it’s highly improbable, to the point that we can leave it out of consideration. Accented or non-Roman characters in a primary key are slightly less improbable, but anyone doing that would already have to convert them for interoperability with other systems.

On the other side of the scale, allowing non-alphanumeric, non-basic-punctuation characters opens a huge range of security holes in naive implementations, and a huge range of potential bugs. So we have to ask which cost is greater – accommodating a theoretically-possible edge case (that we could help a single org work around if it happened), or adding the potential for bugs and security holes in every IATI software implementation. There’s no zero-cost choice here.

(Note that I am a huge advocate of multilingual support in the human-readable data in IATI – titles, descriptions, etc – but not necessarily in the purely machine-readable stuff like XML tags, identifiers, etc).

Herman van Loon • 4 years ago

Changes in current IATI identifiers will produce havoc when using IATI data, especially when those identifiers are being used in other activities or by other publishers. I think the best we could do here is to make this a guideline, which could be checked and flagged by the data-validator (e.g. IATI identifier ABC contains non-standard characters) as a ‘warning’ class message.

David Megginson • 4 years ago

Very true, Herman. My question is whether it would involve a change to any existing identifiers. We’d have to crawl the registry to check.

Andy Lulham • 4 years ago

Worth noting that there’s already a recommendation in the docs against using non-ASCII characters.

David_Megginson:

We’d have to crawl the registry to check.

This is trivial to do using iatikit.

Here’s a gist containing the code and results.

In summary, there are currently 198 non-ASCII identifiers on the registry.

Worth mentioning that d-portal appears to cope with unicode in identifiers. The only exceptions relate to carriage returns in identifiers (which d-portal strips e.g. here) and angle brackets in identifiers (which d-portal gives up on e.g. here, from here). But neither of these are unicode issues.

David Megginson • 4 years ago

Thanks, Andy. Did you see how many non-alphanumeric/basic punctuation characters there were?

Please log in or sign up to comment.