-
Notifications
You must be signed in to change notification settings - Fork 899
Implement encoding fallback for commit data #427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When invalid encoding is in the author name or other commit metadata, libgit2 always returns the text as if it where UTF8 encoded (which it should be). jGit will attempt to parse using the current systems default encoder. Core git will simply return the bytes leaving it up to the client shell to interpret (which posh-git does by defaulting to cp-1252 but Git Bash in msysgit shows differently)
…et to a 1252 default
Oh, that failed for a different reason than I was expecting. Should I have included the test repo in a post build step somewhere to make sure that it gets copied across when running on the build server? |
Could it be that your test repo's |
…r exists on clone
Thanks @dahlbyk - turns out my bare repo was a little too bare... Build now failing as expected. Basically get an error as discussed because the default encoding on Linux is UTF8. Unit test passes on a Windows machine with a Latin1 default charset. Would we still want to add the encoding fallback? If so I can code the test so that it tests for Latin1 fallback when default encoding is set to Latin1, Dos850 when that is the codepage etc then tidy up the commits and re-submit. |
Is there a workaround for consumers of the library if they wanted to do the fallback themselves? Can you get the raw bytes out of the string and then try to parse them yourself? |
I was trying to think of a way without polluting the API too much, but couldn't think of one without making something public that is really very low down in the guts of marshalling data to/from the natives. About the best idea (that @nulltoken came up with) was to pass an optional default fallback encoding down as a configuration parameter when you construct the repo. Most consumers of the library would just go with a default which falls back to the default encoding but you could pass an encoding to fall back to or switch the fallback behaviour off. The unit tests would specify an encoding to use and be able to test the fallback behaviour. |
@dahlbyk @xpaulbettsx @spraints @KindDragon What's your feeling about this issue? How would you like see this handled? Would you need an optional callback? I'm a little bit wary about this as there's very little chance to correctly guess what was the initial encoding of the commit details. Moreover, as one can see in the JGit tests (cf. above), nothing guarantees that the signatures and the commit message have been encoded with the same initial encoding. |
What's the drawback to guessing wrong in these cases though? I mean it goes from wrong to still wrong, right? But in some cases, it's right. :) |
Instead of being wrong everywhere, it's going to depend on the user's default locale. It's a recipe for very nice Fails-on-the-user's-machine-only bug reports ;-) Moreover, it's not fixing the root cause. The commit that has been initially pushed is actually wrongly encoded. Instead of benefiting from the callback running the perfect heuristic to nicely display a wrongly encoded commit, I'd rather be in favor of adding a button on the UIs stating "Click to send an email to the author/committer of this commit begging him to fix his configuration once and for all". 😉 |
We've seen this issue a lot on github.com, where even if the commit is hinted to be a certain encoding, it's actually a different encoding. It'd be awesome if we could guess the encoding via statistical analysis (relatively straightforward to do, but is a fair amount of code if you don't have a library like iconv). In the Windows case, the usual fallback is, that if it's not UTF-8, assume that it's non-Unicode encoded using the user's OEM Code page (i.e. run it through /cc @brianmario |
Hm. Maybe there exists some same heuristic, but I don't know enough about this madness to actually know that to be true. Falling back to the user's codepage (for somebody else's commits) will, I think, violate the "it can't make things worse" assumption. Consider that I commit with a remarkably popular windows encoding (CP 1252) and you're running some popular git server on java, running on your Mac, meaning that your default My name is David Fòcker, and I'm very embarrassed whenever anybody views my commits. On May 10, 2013, at 8:53 PM, Paul Betts notifications@github.com wrote:
|
So, on Windows I don't think there's much in the way of options there - at least |
I think we're talking about doing all the byte array to char array conversion in the language bindings. LibGit2 just returns bytes from the raw data structure and it's up to the language bindings to correctly represent them - i.e. in this case .NET framework elements. I understand the arguments here on correctness, however I'm also heavily persuaded by the pragmatic solutions uncovered after exposure to a lot of repos. How about this:
|
How about this?
I'd rather not make LibGit2Sharp internally default to the platform default. |
/cc @yishaigalatzer |
@nulltoken I could live with that. It'd give us the option to be in control of our destiny. I like that. |
Actually, I think this idea is horrible. I'd rather have a way to pass libgit2sharp a method to let me do the conversion myself, instead of handing me back a mangled string that I now get to demangle. Give me a UTF8 string, or give me the actual bytes and let me make it a UTF-8 string. This "Fallback encoding" stuff will never be correct in all cases. |
I agree with @xpaulbettsx - if users want to mangle, let them do it. Then we can wash our hands of this nonsense. |
My plan for this is to release a LibGit2Sharp.Iconv library that provides an implementation based on what we currently do on GitHub.com, comma the Website - use ICU to actually heuristically determine the encoding and marshal it to UTF-8. |
@xpaulbettsx well, technically we're using ICU to do the detection ;) I didn't even realize Iconv had a detection API? |
@xpaulbettsx Why not? This was actually my first proposal. Would a |
This is what I get when I don't read carefully
Is it always a Stream? I thought at this point we'd already have all the bytes so it may as well just be a |
Yeah, I may have confused ICU and iconv. I meant the former :) |
Actually all we've got is a native pointer. Returning a
And what would you expect the library to do with that? 😉 Actually the callback puts you in full control, but you have to return a string. If you don't know, well you'd have to do the guessing. |
FWIW, you can even use Encoding.Default 😝 |
If a fallback callback isn't provided, or if the fallback returns |
@dahlbyk I'd rather not see |
Fair enough. |
I'm okay with that too, that's probably better. |
Just my 2c, as a user of the library I'd expect that in most cases it behaves gracefully and fixes minor problems, which is why I suggested the fallback in the first place with Encoding.Default. It will work in most cases, and the consumer has a smaller concept count to worry about. Augmenting it with an optional callback makes it customizable, but still I believe that the out of the box experience should be as simple as possible. I didn't see a strong reason to oppose Encoding.Default rather an opinion, what would be a good alternative? Perhaps fallback to Latin code page if the default encoding is UTF8? In regards to automated code page detection - This is a very hairy thing to do for short pieces of text like signatures, there is no reliable code I'm aware of that can do that. We experiments with tens of thousands (literally) of files, and detection systems, only to realize that UTF8 is a great first guess (and fortunately in this case this is the default by spec), and mangling by default is just a bad idea. It assumes that the average user will be fine, while he has a bug lurking that he is simply not aware of. |
I agree, the heuristic only comes into play when decoding as UTF8 fails or results in invalid characters.
This is what I suggested initially but now believe is a bad idea. Going down this route means that you can't get the real data if the built-in marshaller fails. That's Bad™. |
Agree 1000000% - we probably shouldn't be modifying any bytes before handing them back to the user, or at least provide a way to get to them if we're going to transcode in the simple case. We'll unfortunately never get the transcoding 100% accurate so providing the original unmodified bytes somehow is critical to provide raw access to the git database. |
I don't think anyone is arguing against providing access to the raw bytes, but realistically some consumers simply won't care enough to implement custom transcoding. If there's not a custom fallback in place, what would be the least surprising default behavior? |
Yeah I can agree there too. Though I have my own personal agenda about encoding awareness (read: OCD heh) in which we take extra effort to educate other developers about the fact that every stream of bytes is either binary or text in some encoding. If we're going to hide that for the sake of usability we're just punting on an opportunity to educate more people IMO. That being said, I'm not trying to sign anyone up for any extra work so if there's anything I can do to help documentation-wise or brush off the dust on my C# skills from .NET 2.0 let me know ;) |
I agree with adding a fallback. But here is my thoughts
|
This is probably true. I'm okay with the default implementation of this |
You can fallback to ISO-8859-1 with no worry, there are no unused codes. So you can encode back given string into byte[] with no data loss. |
IMHO you shouldn't do any fallback by default, it should be an option. It is better to show user that something is wrong than try to cover it. If fallback makes user not see his error he will never correct it. Other users with system code page set to UTF-8 or using other git GUI will always see mojibakes. |
|
Guys, I'll start working on this once #531 is merged. |
This is quite stale at this point. Closing it. |
Work-around to resolve issue #387 where a repo has commit data that is not correctly encoded as UTF8. Not sure that this should be merged yet but throwing it out there for comment.