Skip to content

Implement encoding fallback for commit data #427

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

Implement encoding fallback for commit data #427

wants to merge 5 commits into from

Conversation

martinwoodward
Copy link
Member

Work-around to resolve issue #387 where a repo has commit data that is not correctly encoded as UTF8. Not sure that this should be merged yet but throwing it out there for comment.

When invalid encoding is in the author name or other commit metadata,
libgit2 always returns the text as if it where UTF8 encoded (which it
should be). jGit will attempt to parse using the current systems
default encoder. Core git will simply return the bytes leaving it
up to the client shell to interpret (which posh-git does by
defaulting to cp-1252 but Git Bash in msysgit shows differently)
@martinwoodward
Copy link
Member Author

Oh, that failed for a different reason than I was expecting. Should I have included the test repo in a post build step somewhere to make sure that it gets copied across when running on the build server?

@dahlbyk
Copy link
Member

dahlbyk commented May 9, 2013

Could it be that your test repo's refs folder is missing?

@martinwoodward
Copy link
Member Author

Thanks @dahlbyk - turns out my bare repo was a little too bare...

Build now failing as expected. Basically get an error as discussed because the default encoding on Linux is UTF8. Unit test passes on a Windows machine with a Latin1 default charset.

Would we still want to add the encoding fallback?

If so I can code the test so that it tests for Latin1 fallback when default encoding is set to Latin1, Dos850 when that is the codepage etc then tidy up the commits and re-submit.

@ethomson
Copy link
Member

Is there a workaround for consumers of the library if they wanted to do the fallback themselves?

Can you get the raw bytes out of the string and then try to parse them yourself?

@martinwoodward
Copy link
Member Author

I was trying to think of a way without polluting the API too much, but couldn't think of one without making something public that is really very low down in the guts of marshalling data to/from the natives.

About the best idea (that @nulltoken came up with) was to pass an optional default fallback encoding down as a configuration parameter when you construct the repo. Most consumers of the library would just go with a default which falls back to the default encoding but you could pass an encoding to fall back to or switch the fallback behaviour off. The unit tests would specify an encoding to use and be able to test the fallback behaviour.

@nulltoken
Copy link
Member

About the best idea was to pass an optional default fallback encoding down as a configuration parameter when you construct the repo.

@dahlbyk @xpaulbettsx @spraints @KindDragon What's your feeling about this issue? How would you like see this handled? Would you need an optional callback?

I'm a little bit wary about this as there's very little chance to correctly guess what was the initial encoding of the commit details. Moreover, as one can see in the JGit tests (cf. above), nothing guarantees that the signatures and the commit message have been encoded with the same initial encoding.

@haacked
Copy link
Contributor

haacked commented May 10, 2013

What's the drawback to guessing wrong in these cases though? I mean it goes from wrong to still wrong, right? But in some cases, it's right. :)

@nulltoken
Copy link
Member

What's the drawback to guessing wrong in these cases though? I mean it goes from wrong to still wrong, right? But in some cases, it's right. :)

Instead of being wrong everywhere, it's going to depend on the user's default locale. It's a recipe for very nice Fails-on-the-user's-machine-only bug reports ;-)

Moreover, it's not fixing the root cause. The commit that has been initially pushed is actually wrongly encoded. Instead of benefiting from the callback running the perfect heuristic to nicely display a wrongly encoded commit, I'd rather be in favor of adding a button on the UIs stating "Click to send an email to the author/committer of this commit begging him to fix his configuration once and for all". 😉

@anaisbetts
Copy link
Contributor

We've seen this issue a lot on github.com, where even if the commit is hinted to be a certain encoding, it's actually a different encoding.

It'd be awesome if we could guess the encoding via statistical analysis (relatively straightforward to do, but is a fair amount of code if you don't have a library like iconv).

In the Windows case, the usual fallback is, that if it's not UTF-8, assume that it's non-Unicode encoded using the user's OEM Code page (i.e. run it through MultiByteToWideChar and hope for the best). On !Windows, you can probably bet on iconv existing, though I'm sure there are 100000000 versions of it installed on people's boxes.

/cc @brianmario

@ethomson
Copy link
Member

Hm. Maybe there exists some same heuristic, but I don't know enough about this madness to actually know that to be true.

Falling back to the user's codepage (for somebody else's commits) will, I think, violate the "it can't make things worse" assumption.

Consider that I commit with a remarkably popular windows encoding (CP 1252) and you're running some popular git server on java, running on your Mac, meaning that your default file.encoding sys prop is MacRoman.

My name is David Fòcker, and I'm very embarrassed whenever anybody views my commits.

On May 10, 2013, at 8:53 PM, Paul Betts notifications@github.com wrote:

We've seen this issue a lot on github.com, where even if the commit is hinted to be a certain encoding, it's actually a different encoding.

It'd be awesome if we could guess the encoding via statistical analysis (relatively straightforward to do, but is a fair amount of code if you don't have a library like iconv).

In the Windows case, the usual fallback is, that if it's not UTF-8, assume that it's non-Unicode encoded using the user's OEM Code page (i.e. run it through MultiByteToWideChar and hope for the best). On !Windows, you can probably bet on iconv existing, though I'm sure there are 100000000 versions of it installed on people's boxes.

/cc @brianmario


Reply to this email directly or view it on GitHub.

@anaisbetts
Copy link
Contributor

Falling back to the user's codepage (for somebody else's commits) will, I think, violate the "it can't make things worse" assumption

So, on Windows I don't think there's much in the way of options there - at least MultiByteToWideChar will replace invalid characters with "?"s. You might be able to find something useful in the NLS functions, but I don't think so (I'd love to be wrong though!)

@martinwoodward
Copy link
Member Author

I think we're talking about doing all the byte array to char array conversion in the language bindings. LibGit2 just returns bytes from the raw data structure and it's up to the language bindings to correctly represent them - i.e. in this case .NET framework elements. I understand the arguments here on correctness, however I'm also heavily persuaded by the pragmatic solutions uncovered after exposure to a lot of repos. How about this:

  1. LibGit2 just passes bytes. The language bindings are responsible for encoding those bytes into char arrays / strings.
  2. When a Repository object is constructed, an optional fallback encoding can be provided. A fallback of UTF8 is essentially a no-op. If no fallback encoding is passed then this defaults to the platform default encoding as experience shows that this is more often correct for client side processes so makes LibGit2Sharp a friendly and easy to consume API when it comes to new developers coding against the library.
  3. The commit signature is parsed as UTF8, but if that fails encoding then it will fallback to the encoding specified or the platform default if no encoding specified.
  4. The commit message is parsed with the specified commit message encoding. If no encoding passed (as is the normal case) then the commit message is parsed at UTF8
  5. If LibGit2Sharp is being run in a server side process then the fallback encoding is placed in to the Repository constructor which is then used to ensure it is the client requestors encoding.

@nulltoken
Copy link
Member

How about this?

  1. LibGit2 just passes bytes. The language bindings are responsible for encoding those bytes into char arrays / strings.
  2. When a Repository object is constructed, an optional fallback encoding can be provided. A fallback of UTF8 is essentially a no-op. If no fallback encoding is passed then this defaults to the platform default encoding as experience shows that this is more often correct for client side processes so makes LibGit2Sharp a friendly and easy to consume API when it comes to new developers coding against the library. If no fallback is passed then the string will be parsed as UTF8 and returned (even if mangled). The code leveraging LibGit2Sharp is, of course, free to pass Encoding.Default as a fallback while instantiating an new Repository.
  3. The commit signature is parsed as UTF8, but if that fails encoding then it will fallback to the encoding specified or the platform default if no encoding specified or return the mangled parsed string when no fallback has been passed.
  4. The commit message is parsed with the specified commit message encoding. If no encoding passed (as is the normal case) then the commit message is parsed at UTF8
  5. If LibGit2Sharp is being run in a server side process then the fallback encoding is placed in to the Repository constructor which is then used to ensure it is the client requestors encoding.

I'd rather not make LibGit2Sharp internally default to the platform default.

@nulltoken
Copy link
Member

/cc @yishaigalatzer

@haacked
Copy link
Contributor

haacked commented May 14, 2013

@nulltoken I could live with that. It'd give us the option to be in control of our destiny. I like that.

@anaisbetts
Copy link
Contributor

Actually, I think this idea is horrible. I'd rather have a way to pass libgit2sharp a method to let me do the conversion myself, instead of handing me back a mangled string that I now get to demangle. Give me a UTF8 string, or give me the actual bytes and let me make it a UTF-8 string. This "Fallback encoding" stuff will never be correct in all cases.

@ethomson
Copy link
Member

I agree with @xpaulbettsx - if users want to mangle, let them do it. Then we can wash our hands of this nonsense.

@anaisbetts
Copy link
Contributor

My plan for this is to release a LibGit2Sharp.Iconv library that provides an implementation based on what we currently do on GitHub.com, comma the Website - use ICU to actually heuristically determine the encoding and marshal it to UTF-8.

@brianmario
Copy link

@xpaulbettsx well, technically we're using ICU to do the detection ;) I didn't even realize Iconv had a detection API?

@nulltoken
Copy link
Member

I'd rather have a way to pass libgit2sharp a method to let me do the conversion myself

@xpaulbettsx Why not? This was actually my first proposal.

Would a Func<Stream, string> do the trick? Where Stream would in fact be an UnmanagedMemoryStream.

@anaisbetts
Copy link
Contributor

@xpaulbettsx Why not? This was actually my first proposal.

This is what I get when I don't read carefully

Would a Func<Stream, string> do the trick? Where Stream would in fact be an UnmanagedMemoryStream.

Is it always a Stream? I thought at this point we'd already have all the bytes so it may as well just be a byte[]. Stream would work too, as long as there's semantics to say "I don't know" (probably by throwing an exception that libgit2sharp would catch).

@anaisbetts
Copy link
Contributor

well, technically we're using ICU to do the detection ;) I didn't even realize Iconv had a detection API?

Yeah, I may have confused ICU and iconv. I meant the former :)

@nulltoken
Copy link
Member

Is it always a Stream? I thought at this point we'd already have all the bytes so it may as well just be a byte[].

Actually all we've got is a native pointer. Returning a byte[] would require to copy the memory, wouldn't it?

as long as there's semantics to say "I don't know" (probably by throwing an exception that libgit2sharp would catch).

And what would you expect the library to do with that? 😉

Actually the callback puts you in full control, but you have to return a string. If you don't know, well you'd have to do the guessing.

@nulltoken
Copy link
Member

Actually the callback puts you in full control, but you have to return a string. If you don't know, well you'd have to do the guessing.

FWIW, you can even use Encoding.Default 😝

@dahlbyk
Copy link
Member

dahlbyk commented May 14, 2013

If a fallback callback isn't provided, or if the fallback returns null, is it reasonable for lg2s to defer to Encoding.Default?

@nulltoken
Copy link
Member

@dahlbyk I'd rather not see Encoding.Default in LibGit2Sharp code.

@dahlbyk
Copy link
Member

dahlbyk commented May 14, 2013

Fair enough.

@anaisbetts
Copy link
Contributor

Actually the callback puts you in full control, but you have to return a string. If you don't know, well you'd have to do the guessing.

I'm okay with that too, that's probably better.

@yishaigalatzer
Copy link

Just my 2c, as a user of the library I'd expect that in most cases it behaves gracefully and fixes minor problems, which is why I suggested the fallback in the first place with Encoding.Default. It will work in most cases, and the consumer has a smaller concept count to worry about.

Augmenting it with an optional callback makes it customizable, but still I believe that the out of the box experience should be as simple as possible.

I didn't see a strong reason to oppose Encoding.Default rather an opinion, what would be a good alternative? Perhaps fallback to Latin code page if the default encoding is UTF8?

In regards to automated code page detection - This is a very hairy thing to do for short pieces of text like signatures, there is no reliable code I'm aware of that can do that. We experiments with tens of thousands (literally) of files, and detection systems, only to realize that UTF8 is a great first guess (and fortunately in this case this is the default by spec), and mangling by default is just a bad idea. It assumes that the average user will be fine, while he has a bug lurking that he is simply not aware of.

@anaisbetts
Copy link
Contributor

We experiments with tens of thousands (literally) of files, and detection systems, only to realize that UTF8 is a great first guess (and fortunately in this case this is the default by spec), and mangling by default is just a bad idea

I agree, the heuristic only comes into play when decoding as UTF8 fails or results in invalid characters.

Perhaps fallback to Latin code page if the default encoding is UTF8?

This is what I suggested initially but now believe is a bad idea. Going down this route means that you can't get the real data if the built-in marshaller fails. That's Bad™.

@brianmario
Copy link

Going down this route means that you can't get the real data if the built-in marshaller fails. That's Bad™.

Agree 1000000% - we probably shouldn't be modifying any bytes before handing them back to the user, or at least provide a way to get to them if we're going to transcode in the simple case. We'll unfortunately never get the transcoding 100% accurate so providing the original unmodified bytes somehow is critical to provide raw access to the git database.

@dahlbyk
Copy link
Member

dahlbyk commented May 15, 2013

I don't think anyone is arguing against providing access to the raw bytes, but realistically some consumers simply won't care enough to implement custom transcoding. If there's not a custom fallback in place, what would be the least surprising default behavior?

@brianmario
Copy link

Yeah I can agree there too. Though I have my own personal agenda about encoding awareness (read: OCD heh) in which we take extra effort to educate other developers about the fact that every stream of bytes is either binary or text in some encoding. If we're going to hide that for the sake of usability we're just punting on an opportunity to educate more people IMO.

That being said, I'm not trying to sign anyone up for any extra work so if there's anything I can do to help documentation-wise or brush off the dust on my C# skills from .NET 2.0 let me know ;)

@yishaigalatzer
Copy link

I agree with adding a fallback.

But here is my thoughts

  1. The fallback (direct access to binaries) can be added separately from this PR, it is currently not supported with the current marshaller.
  2. Keep in mind that in most cases the end user will have less clue about what to do for transcoding failures (unless say he knows for sure what the codepage used was).

@anaisbetts
Copy link
Contributor

Keep in mind that in most cases the end user will have less clue about what to do for transcoding failures (unless say he knows for sure what the codepage used was).

This is probably true. I'm okay with the default implementation of this Func<Stream, string> to be "Use Encoding.Default" or whatever. If people don't want to deal with it they shouldn't have to

@jbialobr
Copy link
Contributor

Perhaps fallback to Latin code page if the default encoding is UTF8?

This is what I suggested initially but now believe is a bad idea. Going down this route means that you can't get the real data if the built-in marshaller fails. That's Bad™.

You can fallback to ISO-8859-1 with no worry, there are no unused codes. So you can encode back given string into byte[] with no data loss.

@jbialobr
Copy link
Contributor

IMHO you shouldn't do any fallback by default, it should be an option. It is better to show user that something is wrong than try to cover it. If fallback makes user not see his error he will never correct it. Other users with system code page set to UTF-8 or using other git GUI will always see mojibakes.

@nulltoken
Copy link
Member

IMHO you shouldn't do any fallback by default, it should be an option. It is better to show user that something is wrong than try to cover it. If fallback makes user not see his error he will never correct it. Other users with system code page set to UTF-8 or using other git GUI will always see mojibakes.

@jbialobr 👍

@nulltoken
Copy link
Member

Guys, I'll start working on this once #531 is merged.

@ethomson
Copy link
Member

This is quite stale at this point. Closing it.

@ethomson ethomson closed this Nov 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants