This repository was archived by the owner on Apr 12, 2025. It is now read-only.
add punycode homograph identification #2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@mikesamuel , thank you for your earlier feedback in the Google Group regarding identifying potential punycode homograph attacks (https://groups.google.com/forum/#!topic/owasp-java-html-sanitizer-support/rI3V_Z6K6us). I've made this tentative PR to try and address the issue, but would like your feedback, and thought this might be a good way to advance the discussion.
I had a couple questions come up while working on this, that are probably worth discussing:
I'm not sure marking potential punycode homographs as invalid is the best way to go, since I'm not able confidentially identify them as such. After reading how other systems handle this attack vector, it looks like the primary responsibility for blocking homograph domains is on the TLD registrars. Other systems seem to rely on indicating a potential homograph to the user, or converting the questionable domain name to ASCII. How would you like to move forward with this solution?
I wasn't sure what the best way to include the punycode logic into this project, but it was starting to look like the punycode portion was substantial enough to warrant it's own class so I've broken it out. At the same time, it didn't seem like it should be a UrlClassifier itself, so I'd appreciate your thoughts on how to best structure it.
I tried following a builder pattern for inclusion of the PunycodeIdentifier and AuthorityClassifier's usage of it, but am open to any suggestions around it's implementation, or whether it needs a builder at all, since there are no additional properties at this time. The only properties I could think of might be worth including in the future would be to fail checks on suspicion of a homograph, and the option to auto-convert the string to ASCII if a suspicion is found. The latter option would probably be part of a separate function call if it was added.
Lastly, this is still an incomplete implementation, as I am only checking for mixed script domains, and not any same-script homographs that utilize. I think I would have to incorporate the codepoints in this document in order to do so, https://tools.ietf.org/html/rfc5892. This is something I could do at this time at the cost of some performance, or in a later iteration if the current implementation is found insufficient.
I bumped the JVM target to 8 so I could use Character.UnicodeScript, which requires 7, and streams, which requires 8. If we need to stay with 6 as the target, I can replace the usage of streams, but Character.UnicodeScript will be a bit more work. Let me know if this is going to be necessary.
Thanks again for your help on this, hopefully we can get it into useable shape.