-
Notifications
You must be signed in to change notification settings - Fork 899
Conservative and liberal string encodings #531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
} | ||
|
||
throw new ArgumentException( | ||
string.Format("Zero bytes ('\\0') are not allowed. One zero byte has been found at position {0}.", zeroPos), argumentName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: "One zero byte" makes it sound like there is only one zero byte in the string. Perhaps "A zero byte"?
Some questions:
|
Actually, having thought about this for another second, I think I am not in favor of throwing here. Invalid byte sequences coming from existing commits should not throw, ideally. We should do our best to decode that in the encoding specified and - if there are byte sequences that do not conform to the specified encoding - well, we should do something reasonable that allows the libgit2sharp consumer to continue. (Be liberal in what you accept, etc, etc, etc.) When the |
@ethomson I agree with you. We may introduce asymmetric encoding with a How should we deal with the |
Hmm. I think that we probably do want a strict and a lax, but I'm thinking that strictness / laxness only comes into the picture when turning a If that's true, we probably want to be strict when dealing with file paths (if libgit2 hands us a filename with a set of bytes that is totally illegal then there's probably nothing we can do...?) But we probably want to be lax when dealing with data from commits (it would be really annoying if getting a commit threw for an encoding-related reason.) |
NOPE, I am totally wrong, it turns out that you can totally stuff a Still, I think filenames should always be strict? And that accepting commit data should be lax and writing it should be strict, as you said. |
throw new MarshalDirectiveException( | ||
string.Format("{0} must be used on a string.", GetType().Name)); | ||
} | ||
var str = str1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was the result of an Extract Method
followed by an Inline Method
that turned bad....
Fixed!
Guys, here's a first draft. Can you please take a look? Current status:
|
@jbialobr If I'm correct the current scenario should fail Set
|
AFAIK git.git uses i18n.commitEncoding only for commit message, so you shouldn't use it for author and commiter.
The same encoding is used for branches, tags, etc. Now I am wondering if it is worth the effort to handle all this cases. (Actually - I think it is not) |
@jbialobr Actually, you've very right. This is completely useless to support Signatures and commit messages are passed through strings. Those strings are in-memory UTF-16. They are marshaled to libgit2 as UTF-8 (which can represent any character). Probing for Parsing a commit is a different story though, but that will have to be deal with while fixing #387. I think I'm going to rip off the 3 last commits. However, I still think the previous commits make sense. Although they do not bring i18n support, they still add some value: Only remaining point to be deal with would be to relax the utf8 parsing of strings coming out of the odb or filesystem. /cc @carlosmn |
As the CLR (and anything sensible created since the 90s) is based on Unicode, that's what we should be using. Anything else (like Reading data out of the repository is of course a different area, in which we should try not to fail reading bogus data, and there we can attempt to use |
In GitExtensions we support it, but you opened my eyes and I see now that it was introduced mostly for command line usage when OS encoding differs from UTF8. I will remove support of crafting commit using
Note that filepaths, git refs, commiter and author data are sotred using OS encoding. There is no sense to try to read them correctly when you are not going to write them using OS encoding (the same file path written using different encodings is seen by git.git as two diffferent paths). |
@ethomson Does this meet what you had in mind?
|
@@ -203,9 +203,9 @@ private static bool LooksLikeABranchName(string referenceName) | |||
referenceName.LooksLikeRemoteTrackingBranch(); | |||
} | |||
|
|||
private static string branchToCanoncialName(IntPtr namePtr, GitBranchType branchType) | |||
private static string BranchToCanoncialName(IntPtr namePtr, GitBranchType branchType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo here: should be "canonical"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thakns! Fiexd. 😉
Are we really emitting a BOM here? I wouldn't expected us to; you wouldn't want a BOM in the middle of a signature, would you? I'm very unclear on how the
Anyway. I guess I'm trying to say that I think that in theory we should be using |
It emits BOM when it writes a data to a file. System.Text.Encoding.UTF8 emits BOM. |
@jbialobr Interesting. What I did was:
and got no BOM. You're saying that this is expected, but there's some other method that does emit a BOM for conversion, depending on the arguments in the constructor? |
I've read this as well. Eventually, I took this approach by peeking at the Mono source code (see Encoding.cs and Utf8Encoding.cs). |
Do:
and you get BOM. |
It's unwise for me to look at the source for a different CLR implementation. :) Would you mind telling me the result? |
@@ -75,8 +75,8 @@ private int FileCallback(GitDiffDelta delta, float progress, IntPtr payload) | |||
|
|||
private int PrintCallBack(GitDiffDelta delta, GitDiffRange range, GitDiffLineOrigin lineorigin, IntPtr content, UIntPtr contentlen, IntPtr payload) | |||
{ | |||
string formattedoutput = Utf8Marshaler.FromNative(content, (int)contentlen); | |||
var filePath = FilePathMarshaler.FromNative(delta.NewFile.Path); | |||
string formattedoutput = LaxUtf8Marshaler.FromNative(content, (int)contentlen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you should use encoding that was used to write file's content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You guess, following the industry's best practices http://xkcd.com/221/
You have no information on the encoding of any given file, unless you're willing to go look at XML headers or HTML meta tags and whatnot (and lol if you believe those are going to be correct). If you can't parse it as Unicode, then maybe you can try to use the current user's locale (assuming this is not Unicode because it's used by a group of people with another de facto encoding).
As a last resort, you can do something libmagic
- or libicu
-like by looking at the content of the pre- or post-image and guessing based on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, maybe would this would be another use case where providing an optional callback for decoding related issues might be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe could you open an issue regarding this topic?
@jbialobr If you could also come up with a failing test, that'd be awesome 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
git.git diplays any character out of system codepage range as <byte code>
Example
sprawdzi<E6>
is diplayed for sprawdzić
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Git Extensions we have Files content encoding
setting for each repo. IMHO it is very uncommon to hold files of different content encoding in a single repo. No one has requested per file setting of content encoding (but it would be the clearest solution).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
git.git diplays any character out of system codepage range as
That would not be git, but the pager (typically less
). git-core doesn't care about your encoding. That's the job of whatever displays the characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you could also come up with a failing test, that'd be awesome
Done. #533
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you could also come up with a failing test, that'd be awesome
Done. #533
✨
Mono @jbialobr @ethomson What's your take on this? Should we use
|
I preffer to use |
Well, maybe a BOM might be useful for files, but that's a different conversation. We're talking about all the marshaling here, and a BOM in the middle of a commit message is never going to be correct. We can't have:
I think that using For our uses, then, I think it's identical. |
ref GitOid tree, | ||
int parentCount, | ||
[MarshalAs(UnmanagedType.LPArray)] [In] IntPtr[] parents); | ||
|
||
[DllImport(libgit2)] | ||
[return : MarshalAs(UnmanagedType.CustomMarshaler, MarshalCookie = UniqueId.UniqueIdentifier, MarshalTypeRef = typeof(Utf8NoCleanupMarshaler))] | ||
[return : MarshalAs(UnmanagedType.CustomMarshaler, MarshalCookie = UniqueId.UniqueIdentifier, MarshalTypeRef = typeof(LaxUtf8NoCleanupMarshaler))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you drop using git_commit_message_encoding
here? You have aviable this information, so I see no reason to not using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind, this should be taken care of as part of #427. At least this is what I foresee to do.
Squashed and rebased. Ok to merge? |
This PR started as an attempt to deal with #530.
Eventually, honoring i18n.commitEncoding turned out to be unnecessary while crafting a commit. This PR morphed into segregating the Marshalers into Strict and Lax categories.