Skip to content
This repository was archived by the owner on Apr 12, 2025. It is now read-only.

add punycode homograph identification #2

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

trentdm
Copy link

@trentdm trentdm commented Apr 24, 2019

@mikesamuel , thank you for your earlier feedback in the Google Group regarding identifying potential punycode homograph attacks (https://groups.google.com/forum/#!topic/owasp-java-html-sanitizer-support/rI3V_Z6K6us). I've made this tentative PR to try and address the issue, but would like your feedback, and thought this might be a good way to advance the discussion.

I had a couple questions come up while working on this, that are probably worth discussing:

  1. I'm not sure marking potential punycode homographs as invalid is the best way to go, since I'm not able confidentially identify them as such. After reading how other systems handle this attack vector, it looks like the primary responsibility for blocking homograph domains is on the TLD registrars. Other systems seem to rely on indicating a potential homograph to the user, or converting the questionable domain name to ASCII. How would you like to move forward with this solution?

  2. I wasn't sure what the best way to include the punycode logic into this project, but it was starting to look like the punycode portion was substantial enough to warrant it's own class so I've broken it out. At the same time, it didn't seem like it should be a UrlClassifier itself, so I'd appreciate your thoughts on how to best structure it.

  3. I tried following a builder pattern for inclusion of the PunycodeIdentifier and AuthorityClassifier's usage of it, but am open to any suggestions around it's implementation, or whether it needs a builder at all, since there are no additional properties at this time. The only properties I could think of might be worth including in the future would be to fail checks on suspicion of a homograph, and the option to auto-convert the string to ASCII if a suspicion is found. The latter option would probably be part of a separate function call if it was added.

  4. Lastly, this is still an incomplete implementation, as I am only checking for mixed script domains, and not any same-script homographs that utilize. I think I would have to incorporate the codepoints in this document in order to do so, https://tools.ietf.org/html/rfc5892. This is something I could do at this time at the cost of some performance, or in a later iteration if the current implementation is found insufficient.

  5. I bumped the JVM target to 8 so I could use Character.UnicodeScript, which requires 7, and streams, which requires 8. If we need to stay with 6 as the target, I can replace the usage of streams, but Character.UnicodeScript will be a bit more work. Let me know if this is going to be necessary.

Thanks again for your help on this, hopefully we can get it into useable shape.

@trentdm trentdm marked this pull request as ready for review April 24, 2019 20:02
@mikesamuel
Copy link
Collaborator

Thanks for putting this together. I'm ignorant of homograph attacks, so I'm afraid I have a bunch of goal related questions before diving into the code.

I'm not sure marking potential punycode homographs as invalid is the best way to go

IIUC, the primary risk is misplaced trust decisions made by users when presented with a URL.
Are homographs dangerous in other circumstances?
If not, maybe we could phrase the API in terms of

unfit for human consumption

I've tried to associate detailed problem info with the URL value but also require opt-in to unsafety when the signal is strong. It sounds like you think the signal might be weak. We could start off as opt-out and ask a large user like the Gmail iphone app to collect stats.

After reading how other systems handle this attack vector, it looks like the primary responsibility for blocking homograph domains is on the TLD registrars.

What use cases are there for vetting homographs not vetted by registrats?

Is there a reputation confusion risk in paths like github.com/аliceor twitter.com/аlice with a Cyrillic 'а'?

I'm horribly ignorant of how well those are filtered out during account creation.

Lastly, this is still an incomplete implementation

I'm fine with using a first PR to figure out the API changes necessary.
I'd rather not have tons of API knobs for subtly different kinds of homographs attacks but maybe some kind of EnumSet could provide a nuanced view if there's a use case for that.

If we need to stay with 6 as the target,

I'm fine with discussing the PR assuming 8+ and then figuring out how to backport if necessary.

This is security machinery, so I can't just say update your JVM if you want security fixes, but there's a good chance noone is using Java6 anymore.

I may ask for consensus on the mailing list since I have little visibility into how many clients use older JVMs.

@trentdm
Copy link
Author

trentdm commented Apr 26, 2019

IIUC, the primary risk is misplaced trust decisions made by users when presented with a URL.

My understanding is the same, and I agree, being able to warn and raise awareness via an easy to consume API would probably be best.

It sounds like you think the signal might be weak. We could start off as opt-out and ask a large user like the Gmail iphone app to collect stats.

Yeah, with there being valid mixed-case usages, and invalid single-script usages, that may be the best play.

What use cases are there for vetting homographs not vetted by registrats?

Is there a reputation confusion risk in paths like github.com/аliceor twitter.com/аlice with a Cyrillic 'а'?

That's a really great point that I wasn't considering earlier. There's probably a wide range of variability among sites that allow names/inputs to later be rendered as urls, from whitelist only characters to anything goes. Being able to identify these potential homographs from another site may still be of value. So fair point not to rely only on the registrar. This may partly explain why Mozilla has been moving away from using the TLD as a whitelist.

I'm fine with using a first PR to figure out the API changes necessary.
I'd rather not have tons of API knobs for subtly different kinds of homographs attacks but maybe some kind of EnumSet could provide a nuanced view if there's a use case for that.

Sounds good to me. I'm happy to continue working on this if further needs arise.

I'm fine with discussing the PR assuming 8+ and then figuring out how to backport if necessary.

This is security machinery, so I can't just say update your JVM if you want security fixes, but there's a good chance noone is using Java6 anymore.

I may ask for consensus on the mailing list since I have little visibility into how many clients use older JVMs.

Sounds great. Thanks for being willing to look into this!

@try3hacker
Copy link

Homo graph
http://xn--eby-7cd.com/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants