Conservative and liberal string encodings #531

nulltoken · 2013-10-11T18:16:06Z

This PR started as an attempt to deal with #530.

Eventually, honoring i18n.commitEncoding turned out to be unnecessary while crafting a commit. This PR morphed into segregating the Marshalers into Strict and Lax categories.

Strict encoding will be used when passing data to libgit2 (and throw when the passed in string cannot be converted into UTF-8 bytes)
Lax encoding will be used when retrieving data from libgit2 (and convert (mangle) the data without throwing)

nulltoken · 2013-10-11T18:17:00Z

@jbialobr @ethomson Well, I've started to work on the encoding thingy... Not ready yet though.

ethomson · 2013-10-11T18:18:36Z

LibGit2Sharp/Core/Ensure.cs

+            }
+
+            throw new ArgumentException(
+                string.Format("Zero bytes ('\\0') are not allowed. One zero byte has been found at position {0}.", zeroPos), argumentName);


Nitpick: "One zero byte" makes it sound like there is only one zero byte in the string. Perhaps "A zero byte"?

nulltoken · 2013-10-11T18:26:22Z

Some questions:

Are there other methods that we should protect from the evil \0 char?
Should we use Encoding.UTF8 or new UTF8Encoding(false, true)

ethomson · 2013-10-11T18:31:49Z

Actually, having thought about this for another second, I think I am not in favor of throwing here. Invalid byte sequences coming from existing commits should not throw, ideally. We should do our best to decode that in the encoding specified and - if there are byte sequences that do not conform to the specified encoding - well, we should do something reasonable that allows the libgit2sharp consumer to continue. (Be liberal in what you accept, etc, etc, etc.) When the throwOnInvalidBytes param is set to false, this will replace with question-marks. This seems reasonable to me.

nulltoken · 2013-10-11T19:07:42Z

Be liberal in what you accept, etc, etc, etc.

@ethomson I agree with you. We may introduce asymmetric encoding with a LaxUtf8Marshaler and a StrictUtf8Marshaler. Strict would be used when creating and Lax when reading.

How should we deal with the FilePathMarshaler (which current rely on the Utf8Marshaler)? Should we make it rely on LaxUtf8 (or StrictUtf8) by default? Or should we also deal with proper LaxFilePathMarshaler when reading and StrictFilePathMarshaler when writing? I'm not question marks would help here 😉...

ethomson · 2013-10-11T19:21:21Z

Hmm. I think that we probably do want a strict and a lax, but I'm thinking that strictness / laxness only comes into the picture when turning a byte[] into a string?

If that's true, we probably want to be strict when dealing with file paths (if libgit2 hands us a filename with a set of bytes that is totally illegal then there's probably nothing we can do...?) But we probably want to be lax when dealing with data from commits (it would be really annoying if getting a commit threw for an encoding-related reason.)

ethomson · 2013-10-11T19:25:03Z

NOPE, I am totally wrong, it turns out that you can totally stuff a string full of random chars that do not represent unicode code points...!

Still, I think filenames should always be strict? And that accepting commit data should be lax and writing it should be strict, as you said.

jbialobr · 2013-10-11T20:02:46Z

LibGit2Sharp/Core/EncodingMarshaler.cs

+                throw new MarshalDirectiveException(
+                    string.Format("{0} must be used on a string.", GetType().Name));
+            }
+            var str = str1;


What is the purpose of this line?

It was the result of an Extract Method followed by an Inline Method that turned bad....

Fixed!

nulltoken · 2013-10-12T17:13:48Z

Guys, here's a first draft. Can you please take a look?

Current status:

Only strict encoding is actually dealt with. If that's ok with you, I'd like you (as part of the review) to help me and point which method should be relaxed).
Creation and parsing of commits now encode and decode on the fly the signature and the message.
Annotated tag bears no encoding header. As such they're, by default, considered as UTF-8 encoded.
Stash and Notes are technically made of commits. However libgit2 API doesn't expose (yet?) an encoding param for those. As such, until further change, their messages and signatures are also considered as UTF-8 encoded.

nulltoken · 2013-10-12T18:34:40Z

@jbialobr If I'm correct the current scenario should fail

Set i18n.commitEncoding to windows-1250

Use git.git to create the following git objects (using specific letters from this codepage for the messages and textual contents)
- Commits
- Create Notes
- Stash
Use LibGit2Sharp to
- Browse the Reflog
- List the notes
- List the stashes

jbialobr · 2013-10-12T19:31:19Z

AFAIK git.git uses i18n.commitEncoding only for commit message, so you shouldn't use it for author and commiter.
git.git takes author and commiter from config file. To read config file git.git uses:

on linux: current operating system encoding
on windows: msysgit - UTF8, cygwin - don't know exactly, but if I am not mistaken it uses OS encoding (by default equal to windows encoding).
on mac - I don't know.

The same encoding is used for branches, tags, etc.

Now I am wondering if it is worth the effort to handle all this cases. (Actually - I think it is not)

nulltoken · 2013-10-13T21:40:03Z

@jbialobr Actually, you've very right. This is completely useless to support i18n.commitEncoding while crafting a commit.

Signatures and commit messages are passed through strings. Those strings are in-memory UTF-16. They are marshaled to libgit2 as UTF-8 (which can represent any character). Probing for i18n.commitEncoding, or even allowing to support a different encoding than the git default one while creating a new commit is completely dumb.

Parsing a commit is a different story though, but that will have to be deal with while fixing #387.

I think I'm going to rip off the 3 last commits. However, I still think the previous commits make sense. Although they do not bring i18n support, they still add some value:

Only remaining point to be deal with would be to relax the utf8 parsing of strings coming out of the odb or filesystem.

/cc @carlosmn

carlosmn · 2013-10-13T22:57:54Z

As the CLR (and anything sensible created since the 90s) is based on Unicode, that's what we should be using. Anything else (like commitencoding) is support for broken/older/legacy systems which managed to get some other character set into the odb.

Reading data out of the repository is of course a different area, in which we should try not to fail reading bogus data, and there we can attempt to use commitencoding as the source encoding. But using a per-repo or per-user configuration option for something that can change as much as the encoding of any particular commit is fraught with silly issues no matter what.

jbialobr · 2013-10-14T05:31:04Z

This is completely useless to support i18n.commitEncoding while crafting a commit.

In GitExtensions we support it, but you opened my eyes and I see now that it was introduced mostly for command line usage when OS encoding differs from UTF8. I will remove support of crafting commit using i18n.commitEncoding.

Only remaining point to be deal with would be to relax the utf8 parsing of strings coming out of the odb or filesystem.

Note that filepaths, git refs, commiter and author data are sotred using OS encoding. There is no sense to try to read them correctly when you are not going to write them using OS encoding (the same file path written using different encodings is seen by git.git as two diffferent paths).
We force GitExtensions users to conver their repos to UTF8. https://github.com/gitextensions/gitextensions/wiki/Git%20Encoding. They can work with old encoding using msysgit version prior to msysgit-1.7.10, but I think after integratig with libgit2sharp we will stop supporting using it.

nulltoken · 2013-10-14T15:44:37Z

I've pushed my take about Lax and Strict utf-8 marshaling. Could you please take a peek at it? Lax encoding is used when retrieving strings from libgit2. Strict when passing string to libgit2.

/cc @jamill @phkelley @dahlbyk

nulltoken · 2013-10-14T22:06:55Z

We should do our best to decode that in the encoding specified and - if there are byte sequences that do not conform to the specified encoding - well, we should do something reasonable that allows the libgit2sharp consumer to continue. (Be liberal in what you accept, etc, etc, etc.)

@ethomson Does this meet what you had in mind?

Retrieval process from what we get from libgit2 shouldn't be different from before (ie. goes through Encoding.UTF8, or at least something similar which emits the UTF8 header and doesn't throw on invalid bytes)
Every data passed on to libgit2 is now enforced to be valid UTF8

ethomson · 2013-10-15T13:00:30Z

LibGit2Sharp/BranchCollection.cs

@@ -203,9 +203,9 @@ private static bool LooksLikeABranchName(string referenceName)
                referenceName.LooksLikeRemoteTrackingBranch();
        }

-        private static string branchToCanoncialName(IntPtr namePtr, GitBranchType branchType)
+        private static string BranchToCanoncialName(IntPtr namePtr, GitBranchType branchType)


Typo here: should be "canonical"

Thakns! Fiexd. 😉

ethomson · 2013-10-15T13:16:38Z

"at least something similar which emits the UTF8 header"

Are we really emitting a BOM here? I wouldn't expected us to; you wouldn't want a BOM in the middle of a signature, would you?

I'm very unclear on how the UTF8Encoding class emits a BOM. The documentation claims that the first boolean argument controls this, but in practice, I get the same byte[] (all without a BOM) from using

the System.Text.Encoding.UTF8 property, which we previously used; documentation does not specify whether it writes a BOM
new Encoding(true, true) which claims it should emit a BOM, but does not for me
new Encoding(false, true) which claims it should not emit a BOM, and does not for me

Anyway. I guess I'm trying to say that I think that in theory we should be using new UTF8Encoding(false, true) but I don't understand why there's no difference in practice.

jbialobr · 2013-10-15T13:19:03Z

It emits BOM when it writes a data to a file. System.Text.Encoding.UTF8 emits BOM.

ethomson · 2013-10-15T13:25:47Z

@jbialobr Interesting. What I did was:

new Encoding(true, true).GetBytes("foo")

and got no BOM. You're saying that this is expected, but there's some other method that does emit a BOM for conversion, depending on the arguments in the constructor?

nulltoken · 2013-10-15T13:26:15Z

the System.Text.Encoding.UTF8 property, which we previously used; documentation does not specify whether it writes a BOM

I've read this as well. Eventually, I took this approach by peeking at the Mono source code (see Encoding.cs and Utf8Encoding.cs).

jbialobr · 2013-10-15T13:28:41Z

Do:

File.WriteAllText(fileName, "foo", new Encoding(true, true));

and you get BOM.

ethomson · 2013-10-15T13:32:53Z

I've read this as well. Eventually, I took this approach by peeking at the Mono source code (see Encoding.cs and Utf8Encoding.cs).

It's unwise for me to look at the source for a different CLR implementation. :)

Would you mind telling me the result?

jbialobr · 2013-10-15T13:34:54Z

LibGit2Sharp/TreeChanges.cs

@@ -75,8 +75,8 @@ private int FileCallback(GitDiffDelta delta, float progress, IntPtr payload)

        private int PrintCallBack(GitDiffDelta delta, GitDiffRange range, GitDiffLineOrigin lineorigin, IntPtr content, UIntPtr contentlen, IntPtr payload)
        {
-            string formattedoutput = Utf8Marshaler.FromNative(content, (int)contentlen);
-            var filePath = FilePathMarshaler.FromNative(delta.NewFile.Path);
+            string formattedoutput = LaxUtf8Marshaler.FromNative(content, (int)contentlen);


Here you should use encoding that was used to write file's content.

I'm not sure to know how we should detect the encoding that was initially used to create the file (@carlosmn @ethomson @arrbee any idea?).

However, I'd rather prefer to keep this PR being focused on Lax/Strict segregation. Maybe could you open an issue regarding this topic?

You guess, following the industry's best practices http://xkcd.com/221/

You have no information on the encoding of any given file, unless you're willing to go look at XML headers or HTML meta tags and whatnot (and lol if you believe those are going to be correct). If you can't parse it as Unicode, then maybe you can try to use the current user's locale (assuming this is not Unicode because it's used by a group of people with another de facto encoding).

As a last resort, you can do something libmagic- or libicu-like by looking at the content of the pre- or post-image and guessing based on that.

Well, maybe would this would be another use case where providing an optional callback for decoding related issues might be useful?

Maybe could you open an issue regarding this topic?

@jbialobr If you could also come up with a failing test, that'd be awesome 😉

git.git diplays any character out of system codepage range as <byte code>
Example
sprawdzi<E6> is diplayed for sprawdzić

In Git Extensions we have Files content encoding setting for each repo. IMHO it is very uncommon to hold files of different content encoding in a single repo. No one has requested per file setting of content encoding (but it would be the clearest solution).

git.git diplays any character out of system codepage range as

That would not be git, but the pager (typically less). git-core doesn't care about your encoding. That's the job of whatever displays the characters.

If you could also come up with a failing test, that'd be awesome

Done. #533

If you could also come up with a failing test, that'd be awesome

Done. #533

✨

nulltoken · 2013-10-15T13:35:18Z

Would you mind telling me the result?

Mono Encoding.UTF8 implementation creates a new UTF8Encoding type passing a parameter requiring the encoding to emit the UTF8 identifier/header.

@jbialobr @ethomson What's your take on this? Should we use

new Encoding(true,... which looks like matching Encoding.UTF8 implementation
new Encoding(false, ... which may feel safer as we control what's produced

jbialobr · 2013-10-15T13:37:31Z

I preffer to use new Encoding(false, ... as it is optional to emit BOM, and there exists apps that doesn't support BOM correctly.

ethomson · 2013-10-15T13:46:53Z

Well, maybe a BOM might be useful for files, but that's a different conversation. We're talking about all the marshaling here, and a BOM in the middle of a commit message is never going to be correct. We can't have:

Author: \xEF\xBB\xBFauthor name <\xEF\xBB\xBFauthor@example.com>...

\xEF\xBB\xBFThis is a message...

I think that using false here makes our intention more obvious. That said, this appears moot: consumers of the encoding class (like using File.WriteAllText above - thanks @jbialobr for the example) are expected to call GetPreamble. The UTF8Encoding class switches in GetPreamble based on the value of the encoderShouldEmitUTF8Identifier constructor argument - if it was true, GetPreamble will return the BOM, otherwise it will return an empty byte array.

For our uses, then, I think it's identical.

nulltoken · 2013-10-15T15:01:34Z

"I preffer to use new Encoding(false, ..."
"I think that using false here makes our intention more obvious. "

@jbialobr @ethomson Fixed with 0034283

jbialobr · 2013-10-15T18:15:46Z

LibGit2Sharp/Core/NativeMethods.cs

            ref GitOid tree,
            int parentCount,
            [MarshalAs(UnmanagedType.LPArray)] [In] IntPtr[] parents);

        [DllImport(libgit2)]
-        [return : MarshalAs(UnmanagedType.CustomMarshaler, MarshalCookie = UniqueId.UniqueIdentifier, MarshalTypeRef = typeof(Utf8NoCleanupMarshaler))]
+        [return : MarshalAs(UnmanagedType.CustomMarshaler, MarshalCookie = UniqueId.UniqueIdentifier, MarshalTypeRef = typeof(LaxUtf8NoCleanupMarshaler))]


Why did you drop using git_commit_message_encoding here? You have aviable this information, so I see no reason to not using it.

In my mind, this should be taken care of as part of #427. At least this is what I foresee to do.

nulltoken · 2013-10-15T19:06:35Z

Squashed and rebased.

Ok to merge?

ethomson reviewed Oct 11, 2013
View reviewed changes

jbialobr reviewed Oct 11, 2013
View reviewed changes

nulltoken mentioned this pull request Oct 12, 2013

Failing test for handling i18n.commitEncoding #530

Closed

nulltoken mentioned this pull request Oct 13, 2013

i18n.commitEncoding is EVIL #532

Open

nulltoken mentioned this pull request Oct 15, 2013

Implement encoding fallback for commit data #427

Closed

ethomson reviewed Oct 15, 2013
View reviewed changes

jbialobr reviewed Oct 15, 2013
View reviewed changes

nulltoken added 2 commits October 15, 2013 19:34

Introduce EncodingMarshaler.Cleanup()

75eaceb

Make string marshaling either lax or strict

b10c4b6

jbialobr reviewed Oct 15, 2013
View reviewed changes

nulltoken merged commit b10c4b6 into vNext Oct 16, 2013

nulltoken deleted the ntk/fix/commit_encoding branch October 16, 2013 23:17

nulltoken referenced this pull request in GitTools/GitVersion Nov 12, 2013

remove fody dependency

d980913

Conservative and liberal string encodings #531

Conservative and liberal string encodings #531

Uh oh!

Conversation

nulltoken commented Oct 11, 2013

Uh oh!

nulltoken commented Oct 11, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nulltoken commented Oct 11, 2013

Uh oh!

ethomson commented Oct 11, 2013

Uh oh!

nulltoken commented Oct 11, 2013

Uh oh!

ethomson commented Oct 11, 2013

Uh oh!

ethomson commented Oct 11, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nulltoken commented Oct 12, 2013

Uh oh!

nulltoken commented Oct 12, 2013

Uh oh!

jbialobr commented Oct 12, 2013

Uh oh!

nulltoken commented Oct 13, 2013

Uh oh!

carlosmn commented Oct 13, 2013

Uh oh!

jbialobr commented Oct 14, 2013

Uh oh!

nulltoken commented Oct 14, 2013

Uh oh!

nulltoken commented Oct 14, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ethomson commented Oct 15, 2013

Uh oh!

jbialobr commented Oct 15, 2013

Uh oh!

ethomson commented Oct 15, 2013

Uh oh!

nulltoken commented Oct 15, 2013

Uh oh!

jbialobr commented Oct 15, 2013

Uh oh!

ethomson commented Oct 15, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nulltoken commented Oct 15, 2013

Uh oh!

jbialobr commented Oct 15, 2013

Uh oh!