Skip to content

Failing test for handling i18n.commitEncoding #530

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

jbialobr
Copy link
Contributor

No description provided.

@nulltoken
Copy link
Member

@jbialobr Thanks for this.

Currently libgit2 expects commit message, committer and author names as zero terminated strings. So provided the i18n.commitEncoding doesn't point to an encoding that introduce zeros as part of the produced output (eg. utf-16, utf-32), we should be able to honor this quite easily.

I think we could make the following happen:

  • Add a string parameter to repo.ObjectDatabase.CreateCommit() to make it accept an encoding name
  • Make repo.Commit() retrieve the i18n.commitEncoding config entry, if any, and pass it to the above function.

This will produce a commit in the ObjectDatabase which will bear an encoding entry as part of the header section.

Of course, this will assume that the passed in strings (message, signature names and emails) have been previously generated with this same encoding, as this encoding will also be used to decode the commit headers when loading a commit back from the ObjectDatabase.

Would that make sense to you?

/cc @ethomson

@ethomson
Copy link
Member

What you're describing makes sense to me, @nulltoken.

I think that we should do three things as prerequisites to make sure that this solution is complete:

  • Ensure that there are no non-Unicode encodings that would legitimately produce a null in their representative byte array
  • Ensure that when we produce UTF-8 strings we create "Modified UTF-8" such that it does not include a null
  • Ensure that core git does not handle UTF-16 or UTF-32 commit messages or otherwise put a null character into the commit body

I am investigating point 1 currently.

@nulltoken
Copy link
Member

Ensure that when we produce UTF-8 strings we create "Modified UTF-8" such that it does not include a null

@ethomson Hmmm. No dice for me. The tests below miserably fail.

[Fact]
public void ZerosAreThatEvil()
{
    string input = "hel\0lo";

    Encoding utf8encoder = new UTF8Encoding(false, true);

    byte[] bytes = utf8encoder.GetBytes(input);

    foreach (var b in bytes)
    {
        Assert.NotEqual((byte)0, b);
    }
}

[Fact]
public void ZerosAreThatEvil_TheSequel()
{
    var bytes = new byte[]
                {
                    (byte) 'h', (byte) 'e', (byte) 'l', 0xC0, 0x80, (byte) 'l', (byte) 'o'
                };

    Encoding utf8decoder = new UTF8Encoding(false, true);

    string output = null;
    Assert.DoesNotThrow(() => output = utf8decoder.GetString(bytes));

    Assert.Equal("hel\0lo", output);
}

@ethomson
Copy link
Member

OH FER JGIT:

Repository repo = new RepositoryBuilder().setGitDir(new File("C:/Temp/test/.git")).build();

CommitBuilder commit = new CommitBuilder();
commit.setAuthor(new PersonIdent("Name", "test@test.test", Calendar.getInstance().getTime(), TimeZone.getDefault()));
commit.setCommitter(new PersonIdent("Name", "test@test.test", Calendar.getInstance().getTime(), TimeZone.getDefault()));
commit.setMessage("Foo!\0Has an embedded null!");
commit.setTreeId(ObjectId.fromString("aaff74984cccd156a469afa7d9ab10e4777beb24"));

ObjectInserter repoInserter = repo.newObjectInserter();
ObjectId result = repo.newObjectInserter().insert(commit);
repoInserter.flush();
repo.close();

Produces:

0000080: 3432 3232 3839 202d 3034 3030 0a0a 466f  422289 -0400..Fo
0000090: 6f21 0048 6173 2061 6e20 656d 6265 6464  o!.Has an embedd
00000a0: 6564 206e 756c 6c21                      ed null!

YUP, that's a real NUL in the middle of our commit!

@ethomson
Copy link
Member

Okay, so I guess we have several options, when given a \0 in a string. I am listing them in order from my least to most favorite:

  1. Figure out how to put it in the commit message body as a 0x00. Can we all agree that this is obviously and totally very very wrong?
  2. Convert to 0xc0 0x80. I thought jgit would have prior art here but I was very wrong, apparently only JNI uses modified UTF-8, the regular UTF-8 conversion routines do not. Lacking that, this seems like a no-win.
  3. Truncate at the 0x00 (ie, by passing it to libgit2) silently. Not a fan.
  4. Remove the '\0 silently.
  5. Throw

I suppose the argument in favor of mucking things about silently is that probably nobody will ever do this. Except maybe to see how we handle it, as I just did with jgit. But I am opposed to taking a legitimate (if weird) C# string and silently changing it.

@nulltoken
Copy link
Member

While creating a Commit
👍 to throw. This will have to be checked in the Signature constructor and the ObjectDatabase.CreateCommit() method.

While reading a Commit from the ObjectDatabase
I'm not sure about our sane options here. How about letting the picked Encoding deal with it?

@ethomson
Copy link
Member

While reading a Commit from the ObjectDatabase

Yup. If there's mangled stuff in there, then I would say that this is undefined behavior. If there's crazy bytes in there, there's only so much we can do.

@jbialobr
Copy link
Contributor Author

This is from git.git source:

    if (memchr(msg->buf, '\0', msg->len))
        return error("a NUL byte in commit log message not allowed.");

So throw an exception is the right way.

@jbialobr
Copy link
Contributor Author

Add a string parameter to repo.ObjectDatabase.CreateCommit() to make it accept an encoding name

I wouldn't do that. I would make repo.ObjectDatabase.CreateCommit() retrieve the i18n.commitEncoding config entry, if any, and if commitEncoding != '' and != UTF-8 then encode appropriate commit parts using this encoding.
This will allow lg2s customer to not care about encoding stuff.

@nulltoken
Copy link
Member

I would make repo.ObjectDatabase.CreateCommit() retrieve the i18n.commitEncoding config entry

repo.ObjectDatabase.CreateCommit() is a very low level method. It directly interacts with the odb. It doesn't try to guess its parent(s). It doesn't update the HEAD. As such, I'd rather keep it as bare as possible and not make it depend on the configuration. However, repo.Commit() do all of these, and looks like the correct place to try to guess the encoding to be used.

then encode appropriate commit parts using this encoding.

repo.ObjectDatabase.CreateCommit() will take care of encoding the inputs.

BTW going back to my previous statement ("this will assume that the passed in strings (message, signature names and emails) have been previously generated with this same encoding"), I realize that this is completely wrong/misleading. A string is a sequential collection of Unicode characters. Encoding only comes into play when when converting to/from bytes.

This will allow lg2s customer to not care about encoding stuff.

Considering those above statements, I think we could still achieve the "not caring" goal.

  • Standard customers will rather use repo.Commit() and won't have to care about encoding.
  • Advanced ones will have the possibility to specify the Encoding they'd like to see used in their commit through repo.ObjectDatabase.CreateCommit()

@jbialobr
Copy link
Contributor Author

repo.ObjectDatabase.CreateCommit() will take care of encoding the inputs.

That is ok. Reading your previous comment I thought that consumer will have to pass message that is encoded in encoding argument.

Encoding only comes into play when when converting to/from bytes.

Yes, but to pass it to libgit2 you have to create a string that consits from bytes that decodes string in encoding argument.

        public static string ReEncodeString(string s, Encoding fromEncoding, Encoding toEncoding)
        {
            if (s == null || fromEncoding.HeaderName.Equals(toEncoding.HeaderName))
                return s;
            else
            {
                byte[] bytes = fromEncoding.GetBytes(s);
                s = toEncoding.GetString(bytes);
                return s;
            }
        }

@ethomson
Copy link
Member

@jbialobr What purpose does that function serve? It looks like its only purpose is to corrupt a string.

@jbialobr
Copy link
Contributor Author

Git stores commit messages as a raw bytes. So when you pass a string, it has to consists of raw bytes as they were used to decode string in encoding argument.

@ethomson
Copy link
Member

@jbialobr Yes, of course. So why are you trying to take the bytes out of a string as one encoding and then put them back in as if it were a different encoding? What are you going to do with the resultant string?

@jbialobr
Copy link
Contributor Author

Pass to git_commit_create.

@jbialobr
Copy link
Contributor Author

I assume that if (git_buf_puts(&commit, message) < 0) takes raw bytes from string 'as is' - no recoding is performed.

@ethomson
Copy link
Member

Yes, libgit2 just takes bytes. But I don't see what that has to do with the getting the bytes out of a string in one encoding, then feeding those bytes into an encoder as if those bytes were in a different encoding, which they're not. Then getting a string out... Then...?

We need to take a string, convert it to a byte[] in the encoding you want, and then hand that byte[] to git_commit_create. No need to re-encode anything, and even if there were, pretending a byte[] in one encoding is a different encoding is not the way to do that.

@jbialobr
Copy link
Contributor Author

I agree, I thought you have to pass a string not a byte[]. When you would have to pass a string then you have to create its byte[] equivalent using as toEnconding some lossless encoding(ISO-8859-1).

@ethomson
Copy link
Member

Okay, great, it sounds like we're on the same page - apologies if I was misunderstanding you. We can just change the proxy layer to avoid any re-encoding!

@nulltoken
Copy link
Member

We can just change the proxy layer to avoid any re-encoding!

@ethomson cough Utf8Marshaler cough

@jbialobr
Copy link
Contributor Author

I have never worked before with native libriaries so I don't know which moves are allowed. But I am glad I can learn new things from you.

@nulltoken
Copy link
Member

I have never worked before with native librairies so I don't know which moves are allowed.

@jbialobr I you feel like working on a Pull Request, we'd more than happy to help you with it.

If your plate is already full, please let me know and I'll start working on it.

@jbialobr
Copy link
Contributor Author

@nulltoken I feel that I used inappropriate time in my previous comment. I am now working on a PR. I see that you already know what and how has to be done, so please, take this issue. Yes, my plate is even more than full. Currently I am testing GitExtensions integration with lg2s.

BTW Did you read #527? Any thoughts on this?

@nulltoken
Copy link
Member

I feel that I used inappropriate time in my previous comment.

No comment is ever inappropriate 😉 Thanks for the time you've taken with this.

Currently I am testing GitExtensions integration with lg2s.

@jbialobr That's great news! Please let us know about any issue you encounter so that we can react accordingly and help you with it. BTW, I think that #526 would have to be fixed in order to make the GUI/Command line integration complete.

BTW Did you read #527? Any thoughts on this?

Yes, I've read it. I think that RepositoryStatus and FileStatus may indeed need a little rework. I'd like to handle this as part of #523. I'll try to write down my ideas about these in the next few days.

@nulltoken
Copy link
Member

@jbialobr Closing this as I manually merged your commits into #531.

✨ ✨ ✨ ✨ ✨ ✨ ✨ ✨ ✨ ✨ ✨ ✨

@nulltoken nulltoken closed this Oct 12, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants