Implement encoding fallback for commit data #427

martinwoodward · 2013-05-09T16:18:51Z

Work-around to resolve issue #387 where a repo has commit data that is not correctly encoded as UTF8. Not sure that this should be merged yet but throwing it out there for comment.

When invalid encoding is in the author name or other commit metadata, libgit2 always returns the text as if it where UTF8 encoded (which it should be). jGit will attempt to parse using the current systems default encoder. Core git will simply return the bytes leaving it up to the client shell to interpret (which posh-git does by defaulting to cp-1252 but Git Bash in msysgit shows differently)

…et to a 1252 default

martinwoodward · 2013-05-09T16:26:58Z

Oh, that failed for a different reason than I was expecting. Should I have included the test repo in a post build step somewhere to make sure that it gets copied across when running on the build server?

dahlbyk · 2013-05-09T16:54:54Z

Could it be that your test repo's refs folder is missing?

…r exists on clone

martinwoodward · 2013-05-10T10:12:43Z

Thanks @dahlbyk - turns out my bare repo was a little too bare...

Build now failing as expected. Basically get an error as discussed because the default encoding on Linux is UTF8. Unit test passes on a Windows machine with a Latin1 default charset.

Would we still want to add the encoding fallback?

If so I can code the test so that it tests for Latin1 fallback when default encoding is set to Latin1, Dos850 when that is the codepage etc then tidy up the commits and re-submit.

ethomson · 2013-05-10T11:50:00Z

Is there a workaround for consumers of the library if they wanted to do the fallback themselves?

Can you get the raw bytes out of the string and then try to parse them yourself?

martinwoodward · 2013-05-10T12:06:08Z

I was trying to think of a way without polluting the API too much, but couldn't think of one without making something public that is really very low down in the guts of marshalling data to/from the natives.

About the best idea (that @nulltoken came up with) was to pass an optional default fallback encoding down as a configuration parameter when you construct the repo. Most consumers of the library would just go with a default which falls back to the default encoding but you could pass an encoding to fall back to or switch the fallback behaviour off. The unit tests would specify an encoding to use and be able to test the fallback behaviour.

nulltoken · 2013-05-10T16:28:06Z

About the best idea was to pass an optional default fallback encoding down as a configuration parameter when you construct the repo.

@dahlbyk @xpaulbettsx @spraints @KindDragon What's your feeling about this issue? How would you like see this handled? Would you need an optional callback?

I'm a little bit wary about this as there's very little chance to correctly guess what was the initial encoding of the commit details. Moreover, as one can see in the JGit tests (cf. above), nothing guarantees that the signatures and the commit message have been encoded with the same initial encoding.

haacked · 2013-05-10T16:42:12Z

What's the drawback to guessing wrong in these cases though? I mean it goes from wrong to still wrong, right? But in some cases, it's right. :)

nulltoken · 2013-05-10T17:06:42Z

What's the drawback to guessing wrong in these cases though? I mean it goes from wrong to still wrong, right? But in some cases, it's right. :)

Instead of being wrong everywhere, it's going to depend on the user's default locale. It's a recipe for very nice Fails-on-the-user's-machine-only bug reports ;-)

Moreover, it's not fixing the root cause. The commit that has been initially pushed is actually wrongly encoded. Instead of benefiting from the callback running the perfect heuristic to nicely display a wrongly encoded commit, I'd rather be in favor of adding a button on the UIs stating "Click to send an email to the author/committer of this commit begging him to fix his configuration once and for all". 😉

anaisbetts · 2013-05-10T18:52:59Z

We've seen this issue a lot on github.com, where even if the commit is hinted to be a certain encoding, it's actually a different encoding.

It'd be awesome if we could guess the encoding via statistical analysis (relatively straightforward to do, but is a fair amount of code if you don't have a library like iconv).

In the Windows case, the usual fallback is, that if it's not UTF-8, assume that it's non-Unicode encoded using the user's OEM Code page (i.e. run it through MultiByteToWideChar and hope for the best). On !Windows, you can probably bet on iconv existing, though I'm sure there are 100000000 versions of it installed on people's boxes.

/cc @brianmario

ethomson · 2013-05-10T21:25:17Z

Hm. Maybe there exists some same heuristic, but I don't know enough about this madness to actually know that to be true.

Falling back to the user's codepage (for somebody else's commits) will, I think, violate the "it can't make things worse" assumption.

Consider that I commit with a remarkably popular windows encoding (CP 1252) and you're running some popular git server on java, running on your Mac, meaning that your default file.encoding sys prop is MacRoman.

My name is David Fòcker, and I'm very embarrassed whenever anybody views my commits.

On May 10, 2013, at 8:53 PM, Paul Betts notifications@github.com wrote:

We've seen this issue a lot on github.com, where even if the commit is hinted to be a certain encoding, it's actually a different encoding.

It'd be awesome if we could guess the encoding via statistical analysis (relatively straightforward to do, but is a fair amount of code if you don't have a library like iconv).

In the Windows case, the usual fallback is, that if it's not UTF-8, assume that it's non-Unicode encoded using the user's OEM Code page (i.e. run it through MultiByteToWideChar and hope for the best). On !Windows, you can probably bet on iconv existing, though I'm sure there are 100000000 versions of it installed on people's boxes.

/cc @brianmario

—
Reply to this email directly or view it on GitHub.

anaisbetts · 2013-05-11T01:13:28Z

Falling back to the user's codepage (for somebody else's commits) will, I think, violate the "it can't make things worse" assumption

So, on Windows I don't think there's much in the way of options there - at least MultiByteToWideChar will replace invalid characters with "?"s. You might be able to find something useful in the NLS functions, but I don't think so (I'd love to be wrong though!)

martinwoodward · 2013-05-11T01:33:58Z

I think we're talking about doing all the byte array to char array conversion in the language bindings. LibGit2 just returns bytes from the raw data structure and it's up to the language bindings to correctly represent them - i.e. in this case .NET framework elements. I understand the arguments here on correctness, however I'm also heavily persuaded by the pragmatic solutions uncovered after exposure to a lot of repos. How about this:

LibGit2 just passes bytes. The language bindings are responsible for encoding those bytes into char arrays / strings.
When a Repository object is constructed, an optional fallback encoding can be provided. A fallback of UTF8 is essentially a no-op. If no fallback encoding is passed then this defaults to the platform default encoding as experience shows that this is more often correct for client side processes so makes LibGit2Sharp a friendly and easy to consume API when it comes to new developers coding against the library.
The commit signature is parsed as UTF8, but if that fails encoding then it will fallback to the encoding specified or the platform default if no encoding specified.
The commit message is parsed with the specified commit message encoding. If no encoding passed (as is the normal case) then the commit message is parsed at UTF8
If LibGit2Sharp is being run in a server side process then the fallback encoding is placed in to the Repository constructor which is then used to ensure it is the client requestors encoding.

nulltoken · 2013-05-14T15:35:17Z

How about this?

LibGit2 just passes bytes. The language bindings are responsible for encoding those bytes into char arrays / strings.
When a Repository object is constructed, an optional fallback encoding can be provided. A fallback of UTF8 is essentially a no-op. If no fallback encoding is passed then this defaults to the platform default encoding as experience shows that this is more often correct for client side processes so makes LibGit2Sharp a friendly and easy to consume API when it comes to new developers coding against the library. If no fallback is passed then the string will be parsed as UTF8 and returned (even if mangled). The code leveraging LibGit2Sharp is, of course, free to pass Encoding.Default as a fallback while instantiating an new Repository.
The commit signature is parsed as UTF8, but if that fails encoding then it will fallback to the encoding specified ~~or the platform default if no encoding specified~~ or return the mangled parsed string when no fallback has been passed.
The commit message is parsed with the specified commit message encoding. If no encoding passed (as is the normal case) then the commit message is parsed at UTF8
If LibGit2Sharp is being run in a server side process then the fallback encoding is placed in to the Repository constructor which is then used to ensure it is the client requestors encoding.

I'd rather not make LibGit2Sharp internally default to the platform default.

nulltoken · 2013-05-14T15:35:57Z

/cc @yishaigalatzer

haacked · 2013-05-14T15:53:24Z

@nulltoken I could live with that. It'd give us the option to be in control of our destiny. I like that.

anaisbetts · 2013-05-14T17:27:05Z

Actually, I think this idea is horrible. I'd rather have a way to pass libgit2sharp a method to let me do the conversion myself, instead of handing me back a mangled string that I now get to demangle. Give me a UTF8 string, or give me the actual bytes and let me make it a UTF-8 string. This "Fallback encoding" stuff will never be correct in all cases.

ethomson · 2013-05-14T17:39:55Z

I agree with @xpaulbettsx - if users want to mangle, let them do it. Then we can wash our hands of this nonsense.

anaisbetts · 2013-05-14T18:06:59Z

My plan for this is to release a LibGit2Sharp.Iconv library that provides an implementation based on what we currently do on GitHub.com, comma the Website - use ICU to actually heuristically determine the encoding and marshal it to UTF-8.

brianmario · 2013-05-14T18:11:51Z

@xpaulbettsx well, technically we're using ICU to do the detection ;) I didn't even realize Iconv had a detection API?

nulltoken · 2013-05-14T18:47:14Z

I'd rather have a way to pass libgit2sharp a method to let me do the conversion myself

@xpaulbettsx Why not? This was actually my first proposal.

Would a Func<Stream, string> do the trick? Where Stream would in fact be an UnmanagedMemoryStream.

anaisbetts · 2013-05-14T19:08:11Z

@xpaulbettsx Why not? This was actually my first proposal.

This is what I get when I don't read carefully

Would a Func<Stream, string> do the trick? Where Stream would in fact be an UnmanagedMemoryStream.

Is it always a Stream? I thought at this point we'd already have all the bytes so it may as well just be a byte[]. Stream would work too, as long as there's semantics to say "I don't know" (probably by throwing an exception that libgit2sharp would catch).

anaisbetts · 2013-05-14T19:09:04Z

well, technically we're using ICU to do the detection ;) I didn't even realize Iconv had a detection API?

Yeah, I may have confused ICU and iconv. I meant the former :)

nulltoken · 2013-05-14T19:26:46Z

Is it always a Stream? I thought at this point we'd already have all the bytes so it may as well just be a byte[].

Actually all we've got is a native pointer. Returning a byte[] would require to copy the memory, wouldn't it?

as long as there's semantics to say "I don't know" (probably by throwing an exception that libgit2sharp would catch).

And what would you expect the library to do with that? 😉

Actually the callback puts you in full control, but you have to return a string. If you don't know, well you'd have to do the guessing.

nulltoken · 2013-05-14T19:31:15Z

Actually the callback puts you in full control, but you have to return a string. If you don't know, well you'd have to do the guessing.

FWIW, you can even use Encoding.Default 😝

dahlbyk · 2013-05-14T19:39:06Z

If a fallback callback isn't provided, or if the fallback returns null, is it reasonable for lg2s to defer to Encoding.Default?

nulltoken · 2013-05-14T19:40:38Z

@dahlbyk I'd rather not see Encoding.Default in LibGit2Sharp code.

dahlbyk · 2013-05-14T19:57:59Z

Fair enough.

anaisbetts · 2013-05-14T22:10:27Z

Actually the callback puts you in full control, but you have to return a string. If you don't know, well you'd have to do the guessing.

I'm okay with that too, that's probably better.

yishaigalatzer · 2013-05-15T00:24:25Z

Just my 2c, as a user of the library I'd expect that in most cases it behaves gracefully and fixes minor problems, which is why I suggested the fallback in the first place with Encoding.Default. It will work in most cases, and the consumer has a smaller concept count to worry about.

Augmenting it with an optional callback makes it customizable, but still I believe that the out of the box experience should be as simple as possible.

I didn't see a strong reason to oppose Encoding.Default rather an opinion, what would be a good alternative? Perhaps fallback to Latin code page if the default encoding is UTF8?

In regards to automated code page detection - This is a very hairy thing to do for short pieces of text like signatures, there is no reliable code I'm aware of that can do that. We experiments with tens of thousands (literally) of files, and detection systems, only to realize that UTF8 is a great first guess (and fortunately in this case this is the default by spec), and mangling by default is just a bad idea. It assumes that the average user will be fine, while he has a bug lurking that he is simply not aware of.

anaisbetts · 2013-05-15T02:35:48Z

We experiments with tens of thousands (literally) of files, and detection systems, only to realize that UTF8 is a great first guess (and fortunately in this case this is the default by spec), and mangling by default is just a bad idea

I agree, the heuristic only comes into play when decoding as UTF8 fails or results in invalid characters.

Perhaps fallback to Latin code page if the default encoding is UTF8?

This is what I suggested initially but now believe is a bad idea. Going down this route means that you can't get the real data if the built-in marshaller fails. That's Bad™.

brianmario · 2013-05-15T08:20:08Z

Going down this route means that you can't get the real data if the built-in marshaller fails. That's Bad™.

Agree 1000000% - we probably shouldn't be modifying any bytes before handing them back to the user, or at least provide a way to get to them if we're going to transcode in the simple case. We'll unfortunately never get the transcoding 100% accurate so providing the original unmodified bytes somehow is critical to provide raw access to the git database.

dahlbyk · 2013-05-15T08:39:15Z

I don't think anyone is arguing against providing access to the raw bytes, but realistically some consumers simply won't care enough to implement custom transcoding. If there's not a custom fallback in place, what would be the least surprising default behavior?

brianmario · 2013-05-15T08:45:38Z

Yeah I can agree there too. Though I have my own personal agenda about encoding awareness (read: OCD heh) in which we take extra effort to educate other developers about the fact that every stream of bytes is either binary or text in some encoding. If we're going to hide that for the sake of usability we're just punting on an opportunity to educate more people IMO.

That being said, I'm not trying to sign anyone up for any extra work so if there's anything I can do to help documentation-wise or brush off the dust on my C# skills from .NET 2.0 let me know ;)

yishaigalatzer · 2013-05-15T16:53:26Z

I agree with adding a fallback.

But here is my thoughts

The fallback (direct access to binaries) can be added separately from this PR, it is currently not supported with the current marshaller.
Keep in mind that in most cases the end user will have less clue about what to do for transcoding failures (unless say he knows for sure what the codepage used was).

anaisbetts · 2013-05-15T19:08:07Z

Keep in mind that in most cases the end user will have less clue about what to do for transcoding failures (unless say he knows for sure what the codepage used was).

This is probably true. I'm okay with the default implementation of this Func<Stream, string> to be "Use Encoding.Default" or whatever. If people don't want to deal with it they shouldn't have to

jbialobr · 2013-05-23T19:20:34Z

Perhaps fallback to Latin code page if the default encoding is UTF8?

This is what I suggested initially but now believe is a bad idea. Going down this route means that you can't get the real data if the built-in marshaller fails. That's Bad™.

You can fallback to ISO-8859-1 with no worry, there are no unused codes. So you can encode back given string into byte[] with no data loss.

jbialobr · 2013-05-23T19:34:12Z

IMHO you shouldn't do any fallback by default, it should be an option. It is better to show user that something is wrong than try to cover it. If fallback makes user not see his error he will never correct it. Other users with system code page set to UTF-8 or using other git GUI will always see mojibakes.

nulltoken · 2013-05-23T21:02:13Z

IMHO you shouldn't do any fallback by default, it should be an option. It is better to show user that something is wrong than try to cover it. If fallback makes user not see his error he will never correct it. Other users with system code page set to UTF-8 or using other git GUI will always see mojibakes.

@jbialobr 👍

nulltoken · 2013-10-15T10:27:38Z

Guys, I'll start working on this once #531 is merged.

ethomson · 2017-11-25T18:38:49Z

This is quite stale at this point. Closing it.

martinwoodward added 4 commits May 9, 2013 02:03

Create a test for #387 against a repo with funky commit encodings

7326361

Implement encoding fallback in commit messages for #387

0032af2

Comment out the MacRoman and Dos850 tests so test passes on Windows s…

aa28f6c

…et to a 1252 default

Add a head ref in the commitencoding bare repo so that the refs folde…

a837637

…r exists on clone

nulltoken mentioned this pull request Oct 15, 2013

Conservative and liberal string encodings #531

Merged

jbialobr mentioned this pull request Oct 18, 2013

Re-Integrate LibGit2Sharp branch into master gitextensions/gitextensions#1996

Closed

4 tasks

nulltoken referenced this pull request in GitTools/GitVersion Nov 12, 2013

remove fody dependency

d980913

nulltoken force-pushed the vNext branch from 49987df to 0e5bfbc Compare October 8, 2014 22:05

ethomson closed this Nov 25, 2017

Implement encoding fallback for commit data #427

Implement encoding fallback for commit data #427

Uh oh!

Conversation

martinwoodward commented May 9, 2013

Uh oh!

martinwoodward commented May 9, 2013

Uh oh!

dahlbyk commented May 9, 2013

Uh oh!

martinwoodward commented May 10, 2013

Uh oh!

ethomson commented May 10, 2013

Uh oh!

martinwoodward commented May 10, 2013

Uh oh!

nulltoken commented May 10, 2013

Uh oh!

haacked commented May 10, 2013

Uh oh!

nulltoken commented May 10, 2013

Uh oh!

anaisbetts commented May 10, 2013

Uh oh!

ethomson commented May 10, 2013

Uh oh!

anaisbetts commented May 11, 2013

Uh oh!

martinwoodward commented May 11, 2013

Uh oh!

nulltoken commented May 14, 2013

Uh oh!

nulltoken commented May 14, 2013

Uh oh!

haacked commented May 14, 2013

Uh oh!

anaisbetts commented May 14, 2013

Uh oh!

ethomson commented May 14, 2013

Uh oh!

anaisbetts commented May 14, 2013

Uh oh!

brianmario commented May 14, 2013

Uh oh!

nulltoken commented May 14, 2013

Uh oh!

anaisbetts commented May 14, 2013

Uh oh!

anaisbetts commented May 14, 2013

Uh oh!

nulltoken commented May 14, 2013

Uh oh!

nulltoken commented May 14, 2013

Uh oh!

dahlbyk commented May 14, 2013

Uh oh!

nulltoken commented May 14, 2013

Uh oh!

dahlbyk commented May 14, 2013

Uh oh!

anaisbetts commented May 14, 2013

Uh oh!

yishaigalatzer commented May 15, 2013

Uh oh!

anaisbetts commented May 15, 2013

Uh oh!

brianmario commented May 15, 2013

Uh oh!

dahlbyk commented May 15, 2013

Uh oh!

brianmario commented May 15, 2013

Uh oh!

yishaigalatzer commented May 15, 2013

Uh oh!

anaisbetts commented May 15, 2013

Uh oh!

jbialobr commented May 23, 2013

Uh oh!

jbialobr commented May 23, 2013

Uh oh!