-
Notifications
You must be signed in to change notification settings - Fork 899
Failing test for handling i18n.commitEncoding #530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jbialobr Thanks for this. Currently libgit2 expects commit message, committer and author names as zero terminated strings. So provided the I think we could make the following happen:
This will produce a commit in the ObjectDatabase which will bear an encoding entry as part of the header section. Of course, this will assume that the passed in strings (message, signature names and emails) have been previously generated with this same encoding, as this encoding will also be used to decode the commit headers when loading a commit back from the ObjectDatabase. Would that make sense to you? /cc @ethomson |
What you're describing makes sense to me, @nulltoken. I think that we should do three things as prerequisites to make sure that this solution is complete:
I am investigating point 1 currently. |
@ethomson Hmmm. No dice for me. The tests below miserably fail. [Fact]
public void ZerosAreThatEvil()
{
string input = "hel\0lo";
Encoding utf8encoder = new UTF8Encoding(false, true);
byte[] bytes = utf8encoder.GetBytes(input);
foreach (var b in bytes)
{
Assert.NotEqual((byte)0, b);
}
}
[Fact]
public void ZerosAreThatEvil_TheSequel()
{
var bytes = new byte[]
{
(byte) 'h', (byte) 'e', (byte) 'l', 0xC0, 0x80, (byte) 'l', (byte) 'o'
};
Encoding utf8decoder = new UTF8Encoding(false, true);
string output = null;
Assert.DoesNotThrow(() => output = utf8decoder.GetString(bytes));
Assert.Equal("hel\0lo", output);
} |
OH FER JGIT:
Produces:
YUP, that's a real NUL in the middle of our commit! |
Okay, so I guess we have several options, when given a
I suppose the argument in favor of mucking things about silently is that probably nobody will ever do this. Except maybe to see how we handle it, as I just did with jgit. But I am opposed to taking a legitimate (if weird) C# string and silently changing it. |
While creating a Commit While reading a Commit from the ObjectDatabase |
Yup. If there's mangled stuff in there, then I would say that this is undefined behavior. If there's crazy bytes in there, there's only so much we can do. |
This is from git.git source: if (memchr(msg->buf, '\0', msg->len))
return error("a NUL byte in commit log message not allowed."); So throw an exception is the right way. |
I wouldn't do that. I would make repo.ObjectDatabase.CreateCommit() retrieve the i18n.commitEncoding config entry, if any, and if commitEncoding != '' and != UTF-8 then encode appropriate commit parts using this encoding. |
BTW going back to my previous statement ("this will assume that the passed in strings (message, signature names and emails) have been previously generated with this same encoding"), I realize that this is completely wrong/misleading. A string is a sequential collection of Unicode characters. Encoding only comes into play when when converting to/from bytes.
Considering those above statements, I think we could still achieve the "not caring" goal.
|
That is ok. Reading your previous comment I thought that consumer will have to pass message that is encoded in
Yes, but to pass it to libgit2 you have to create a string that consits from bytes that decodes string in public static string ReEncodeString(string s, Encoding fromEncoding, Encoding toEncoding)
{
if (s == null || fromEncoding.HeaderName.Equals(toEncoding.HeaderName))
return s;
else
{
byte[] bytes = fromEncoding.GetBytes(s);
s = toEncoding.GetString(bytes);
return s;
}
} |
@jbialobr What purpose does that function serve? It looks like its only purpose is to corrupt a |
Git stores commit messages as a raw bytes. So when you pass a string, it has to consists of raw bytes as they were used to decode string in |
@jbialobr Yes, of course. So why are you trying to take the bytes out of a string as one encoding and then put them back in as if it were a different encoding? What are you going to do with the resultant string? |
Pass to git_commit_create. |
I assume that |
Yes, libgit2 just takes bytes. But I don't see what that has to do with the getting the bytes out of a string in one encoding, then feeding those bytes into an encoder as if those bytes were in a different encoding, which they're not. Then getting a string out... Then...? We need to take a |
I agree, I thought you have to pass a string not a byte[]. When you would have to pass a string then you have to create its byte[] equivalent using as |
Okay, great, it sounds like we're on the same page - apologies if I was misunderstanding you. We can just change the proxy layer to avoid any re-encoding! |
@ethomson cough Utf8Marshaler cough |
I have never worked before with native libriaries so I don't know which moves are allowed. But I am glad I can learn new things from you. |
@jbialobr I you feel like working on a Pull Request, we'd more than happy to help you with it. If your plate is already full, please let me know and I'll start working on it. |
@nulltoken I feel that I used inappropriate time in my previous comment. I am now working on a PR. I see that you already know what and how has to be done, so please, take this issue. Yes, my plate is even more than full. Currently I am testing GitExtensions integration with lg2s. BTW Did you read #527? Any thoughts on this? |
No comment is ever inappropriate 😉 Thanks for the time you've taken with this.
@jbialobr That's great news! Please let us know about any issue you encounter so that we can react accordingly and help you with it. BTW, I think that #526 would have to be fixed in order to make the GUI/Command line integration complete.
Yes, I've read it. I think that |
No description provided.