Failing test for handling i18n.commitEncoding #530

jbialobr · 2013-10-10T04:55:34Z

No description provided.

nulltoken · 2013-10-10T14:21:27Z

@jbialobr Thanks for this.

Currently libgit2 expects commit message, committer and author names as zero terminated strings. So provided the i18n.commitEncoding doesn't point to an encoding that introduce zeros as part of the produced output (eg. utf-16, utf-32), we should be able to honor this quite easily.

I think we could make the following happen:

Add a string parameter to repo.ObjectDatabase.CreateCommit() to make it accept an encoding name
Make repo.Commit() retrieve the i18n.commitEncoding config entry, if any, and pass it to the above function.

This will produce a commit in the ObjectDatabase which will bear an encoding entry as part of the header section.

Of course, this will assume that the passed in strings (message, signature names and emails) have been previously generated with this same encoding, as this encoding will also be used to decode the commit headers when loading a commit back from the ObjectDatabase.

Would that make sense to you?

/cc @ethomson

ethomson · 2013-10-10T14:30:24Z

What you're describing makes sense to me, @nulltoken.

I think that we should do three things as prerequisites to make sure that this solution is complete:

Ensure that there are no non-Unicode encodings that would legitimately produce a null in their representative byte array
~~Ensure that when we produce UTF-8 strings we create "Modified UTF-8" such that it does not include a null~~
Ensure that core git does not handle UTF-16 or UTF-32 commit messages or otherwise put a null character into the commit body

I am investigating point 1 currently.

nulltoken · 2013-10-10T15:31:50Z

Ensure that when we produce UTF-8 strings we create "Modified UTF-8" such that it does not include a null

@ethomson Hmmm. No dice for me. The tests below miserably fail.

[Fact]
public void ZerosAreThatEvil()
{
    string input = "hel\0lo";

    Encoding utf8encoder = new UTF8Encoding(false, true);

    byte[] bytes = utf8encoder.GetBytes(input);

    foreach (var b in bytes)
    {
        Assert.NotEqual((byte)0, b);
    }
}

[Fact]
public void ZerosAreThatEvil_TheSequel()
{
    var bytes = new byte[]
                {
                    (byte) 'h', (byte) 'e', (byte) 'l', 0xC0, 0x80, (byte) 'l', (byte) 'o'
                };

    Encoding utf8decoder = new UTF8Encoding(false, true);

    string output = null;
    Assert.DoesNotThrow(() => output = utf8decoder.GetString(bytes));

    Assert.Equal("hel\0lo", output);
}

ethomson · 2013-10-10T16:28:35Z

OH FER JGIT:

Repository repo = new RepositoryBuilder().setGitDir(new File("C:/Temp/test/.git")).build();

CommitBuilder commit = new CommitBuilder();
commit.setAuthor(new PersonIdent("Name", "test@test.test", Calendar.getInstance().getTime(), TimeZone.getDefault()));
commit.setCommitter(new PersonIdent("Name", "test@test.test", Calendar.getInstance().getTime(), TimeZone.getDefault()));
commit.setMessage("Foo!\0Has an embedded null!");
commit.setTreeId(ObjectId.fromString("aaff74984cccd156a469afa7d9ab10e4777beb24"));

ObjectInserter repoInserter = repo.newObjectInserter();
ObjectId result = repo.newObjectInserter().insert(commit);
repoInserter.flush();
repo.close();

Produces:

0000080: 3432 3232 3839 202d 3034 3030 0a0a 466f  422289 -0400..Fo
0000090: 6f21 0048 6173 2061 6e20 656d 6265 6464  o!.Has an embedd
00000a0: 6564 206e 756c 6c21                      ed null!

YUP, that's a real NUL in the middle of our commit!

ethomson · 2013-10-10T16:36:19Z

Okay, so I guess we have several options, when given a \0 in a string. I am listing them in order from my least to most favorite:

Figure out how to put it in the commit message body as a 0x00. Can we all agree that this is obviously and totally very very wrong?
Convert to 0xc0 0x80. I thought jgit would have prior art here but I was very wrong, apparently only JNI uses modified UTF-8, the regular UTF-8 conversion routines do not. Lacking that, this seems like a no-win.
Truncate at the 0x00 (ie, by passing it to libgit2) silently. Not a fan.
Remove the '\0 silently.
Throw

I suppose the argument in favor of mucking things about silently is that probably nobody will ever do this. Except maybe to see how we handle it, as I just did with jgit. But I am opposed to taking a legitimate (if weird) C# string and silently changing it.

nulltoken · 2013-10-10T17:28:04Z

While creating a Commit
👍 to throw. This will have to be checked in the Signature constructor and the ObjectDatabase.CreateCommit() method.

While reading a Commit from the ObjectDatabase
I'm not sure about our sane options here. How about letting the picked Encoding deal with it?

ethomson · 2013-10-10T17:29:34Z

While reading a Commit from the ObjectDatabase

Yup. If there's mangled stuff in there, then I would say that this is undefined behavior. If there's crazy bytes in there, there's only so much we can do.

jbialobr · 2013-10-10T18:39:01Z

This is from git.git source:

    if (memchr(msg->buf, '\0', msg->len))
        return error("a NUL byte in commit log message not allowed.");

So throw an exception is the right way.

jbialobr · 2013-10-10T18:52:25Z

Add a string parameter to repo.ObjectDatabase.CreateCommit() to make it accept an encoding name

I wouldn't do that. I would make repo.ObjectDatabase.CreateCommit() retrieve the i18n.commitEncoding config entry, if any, and if commitEncoding != '' and != UTF-8 then encode appropriate commit parts using this encoding.
This will allow lg2s customer to not care about encoding stuff.

nulltoken · 2013-10-10T19:42:17Z

I would make repo.ObjectDatabase.CreateCommit() retrieve the i18n.commitEncoding config entry

repo.ObjectDatabase.CreateCommit() is a very low level method. It directly interacts with the odb. It doesn't try to guess its parent(s). It doesn't update the HEAD. As such, I'd rather keep it as bare as possible and not make it depend on the configuration. However, repo.Commit() do all of these, and looks like the correct place to try to guess the encoding to be used.

then encode appropriate commit parts using this encoding.

repo.ObjectDatabase.CreateCommit() will take care of encoding the inputs.

BTW going back to my previous statement ("this will assume that the passed in strings (message, signature names and emails) have been previously generated with this same encoding"), I realize that this is completely wrong/misleading. A string is a sequential collection of Unicode characters. Encoding only comes into play when when converting to/from bytes.

This will allow lg2s customer to not care about encoding stuff.

Considering those above statements, I think we could still achieve the "not caring" goal.

Standard customers will rather use repo.Commit() and won't have to care about encoding.
Advanced ones will have the possibility to specify the Encoding they'd like to see used in their commit through repo.ObjectDatabase.CreateCommit()

jbialobr · 2013-10-10T20:11:37Z

repo.ObjectDatabase.CreateCommit() will take care of encoding the inputs.

That is ok. Reading your previous comment I thought that consumer will have to pass message that is encoded in encoding argument.

Encoding only comes into play when when converting to/from bytes.

Yes, but to pass it to libgit2 you have to create a string that consits from bytes that decodes string in encoding argument.

        public static string ReEncodeString(string s, Encoding fromEncoding, Encoding toEncoding)
        {
            if (s == null || fromEncoding.HeaderName.Equals(toEncoding.HeaderName))
                return s;
            else
            {
                byte[] bytes = fromEncoding.GetBytes(s);
                s = toEncoding.GetString(bytes);
                return s;
            }
        }

ethomson · 2013-10-10T20:15:03Z

@jbialobr What purpose does that function serve? It looks like its only purpose is to corrupt a string.

jbialobr · 2013-10-10T20:17:53Z

Git stores commit messages as a raw bytes. So when you pass a string, it has to consists of raw bytes as they were used to decode string in encoding argument.

ethomson · 2013-10-10T20:19:25Z

@jbialobr Yes, of course. So why are you trying to take the bytes out of a string as one encoding and then put them back in as if it were a different encoding? What are you going to do with the resultant string?

jbialobr · 2013-10-10T20:21:21Z

Pass to git_commit_create.

jbialobr · 2013-10-10T20:23:50Z

I assume that if (git_buf_puts(&commit, message) < 0) takes raw bytes from string 'as is' - no recoding is performed.

ethomson · 2013-10-10T20:28:40Z

Yes, libgit2 just takes bytes. But I don't see what that has to do with the getting the bytes out of a string in one encoding, then feeding those bytes into an encoder as if those bytes were in a different encoding, which they're not. Then getting a string out... Then...?

We need to take a string, convert it to a byte[] in the encoding you want, and then hand that byte[] to git_commit_create. No need to re-encode anything, and even if there were, pretending a byte[] in one encoding is a different encoding is not the way to do that.

jbialobr · 2013-10-10T20:32:30Z

I agree, I thought you have to pass a string not a byte[]. When you would have to pass a string then you have to create its byte[] equivalent using as toEnconding some lossless encoding(ISO-8859-1).

ethomson · 2013-10-10T20:35:51Z

Okay, great, it sounds like we're on the same page - apologies if I was misunderstanding you. We can just change the proxy layer to avoid any re-encoding!

nulltoken · 2013-10-10T21:28:02Z

We can just change the proxy layer to avoid any re-encoding!

@ethomson cough Utf8Marshaler cough

jbialobr · 2013-10-11T04:29:11Z

I have never worked before with native libriaries so I don't know which moves are allowed. But I am glad I can learn new things from you.

nulltoken · 2013-10-11T05:32:24Z

I have never worked before with native librairies so I don't know which moves are allowed.

@jbialobr I you feel like working on a Pull Request, we'd more than happy to help you with it.

If your plate is already full, please let me know and I'll start working on it.

jbialobr · 2013-10-11T06:52:24Z

@nulltoken I feel that I used inappropriate time in my previous comment. I am now working on a PR. I see that you already know what and how has to be done, so please, take this issue. Yes, my plate is even more than full. Currently I am testing GitExtensions integration with lg2s.

BTW Did you read #527? Any thoughts on this?

nulltoken · 2013-10-11T08:40:25Z

I feel that I used inappropriate time in my previous comment.

No comment is ever inappropriate 😉 Thanks for the time you've taken with this.

Currently I am testing GitExtensions integration with lg2s.

@jbialobr That's great news! Please let us know about any issue you encounter so that we can react accordingly and help you with it. BTW, I think that #526 would have to be fixed in order to make the GUI/Command line integration complete.

BTW Did you read #527? Any thoughts on this?

Yes, I've read it. I think that RepositoryStatus and FileStatus may indeed need a little rework. I'd like to handle this as part of #523. I'll try to write down my ideas about these in the next few days.

nulltoken · 2013-10-12T18:52:22Z

@jbialobr Closing this as I manually merged your commits into #531.

✨ ✨ ✨ ✨ ✨ ✨ ✨ ✨ ✨ ✨ ✨ ✨

jbialobr added 2 commits October 10, 2013 06:52

Failing test HandlesRead_i18n_commitEncoding

dad4a50

HandlesWrite_i18n_commitEncoding

92f6d41

jbialobr mentioned this pull request Oct 10, 2013

Re-Integrate LibGit2Sharp branch into master gitextensions/gitextensions#1996

Closed

4 tasks

nulltoken mentioned this pull request Oct 11, 2013

Conservative and liberal string encodings #531

Merged

nulltoken closed this Oct 12, 2013

jbialobr deleted the jb/encoding_test branch October 12, 2013 19:35

nulltoken mentioned this pull request Nov 22, 2013

Ensure any passed in string doesn't contains the \0 character #571

Open

Failing test for handling i18n.commitEncoding #530

Failing test for handling i18n.commitEncoding #530

Uh oh!

Conversation

jbialobr commented Oct 10, 2013

Uh oh!

nulltoken commented Oct 10, 2013

Uh oh!

ethomson commented Oct 10, 2013

Uh oh!

nulltoken commented Oct 10, 2013

Uh oh!

ethomson commented Oct 10, 2013

Uh oh!

ethomson commented Oct 10, 2013

Uh oh!

nulltoken commented Oct 10, 2013

Uh oh!

ethomson commented Oct 10, 2013

Uh oh!

jbialobr commented Oct 10, 2013

Uh oh!

jbialobr commented Oct 10, 2013

Uh oh!

nulltoken commented Oct 10, 2013

Uh oh!

jbialobr commented Oct 10, 2013

Uh oh!

ethomson commented Oct 10, 2013

Uh oh!

jbialobr commented Oct 10, 2013

Uh oh!

ethomson commented Oct 10, 2013

Uh oh!

jbialobr commented Oct 10, 2013

Uh oh!

jbialobr commented Oct 10, 2013

Uh oh!

ethomson commented Oct 10, 2013

Uh oh!

jbialobr commented Oct 10, 2013

Uh oh!

ethomson commented Oct 10, 2013

Uh oh!

nulltoken commented Oct 10, 2013

Uh oh!

jbialobr commented Oct 11, 2013

Uh oh!

nulltoken commented Oct 11, 2013

Uh oh!

jbialobr commented Oct 11, 2013

Uh oh!

nulltoken commented Oct 11, 2013

Uh oh!

nulltoken commented Oct 12, 2013

Uh oh!

Uh oh!