Skip to content

Conservative and liberal string encodings #531

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 16, 2013
Merged

Conversation

nulltoken
Copy link
Member

This PR started as an attempt to deal with #530.

Eventually, honoring i18n.commitEncoding turned out to be unnecessary while crafting a commit. This PR morphed into segregating the Marshalers into Strict and Lax categories.

  • Strict encoding will be used when passing data to libgit2 (and throw when the passed in string cannot be converted into UTF-8 bytes)
  • Lax encoding will be used when retrieving data from libgit2 (and convert (mangle) the data without throwing)

@nulltoken
Copy link
Member Author

@jbialobr @ethomson Well, I've started to work on the encoding thingy... Not ready yet though.

}

throw new ArgumentException(
string.Format("Zero bytes ('\\0') are not allowed. One zero byte has been found at position {0}.", zeroPos), argumentName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: "One zero byte" makes it sound like there is only one zero byte in the string. Perhaps "A zero byte"?

@nulltoken
Copy link
Member Author

Some questions:

  • Are there other methods that we should protect from the evil \0 char?
  • Should we use Encoding.UTF8 or new UTF8Encoding(false, true)

@ethomson
Copy link
Member

Actually, having thought about this for another second, I think I am not in favor of throwing here. Invalid byte sequences coming from existing commits should not throw, ideally. We should do our best to decode that in the encoding specified and - if there are byte sequences that do not conform to the specified encoding - well, we should do something reasonable that allows the libgit2sharp consumer to continue. (Be liberal in what you accept, etc, etc, etc.) When the throwOnInvalidBytes param is set to false, this will replace with question-marks. This seems reasonable to me.

@nulltoken
Copy link
Member Author

Be liberal in what you accept, etc, etc, etc.

@ethomson I agree with you. We may introduce asymmetric encoding with a LaxUtf8Marshaler and a StrictUtf8Marshaler. Strict would be used when creating and Lax when reading.

How should we deal with the FilePathMarshaler (which current rely on the Utf8Marshaler)? Should we make it rely on LaxUtf8 (or StrictUtf8) by default? Or should we also deal with proper LaxFilePathMarshaler when reading and StrictFilePathMarshaler when writing? I'm not question marks would help here 😉...

@ethomson
Copy link
Member

Hmm. I think that we probably do want a strict and a lax, but I'm thinking that strictness / laxness only comes into the picture when turning a byte[] into a string?

If that's true, we probably want to be strict when dealing with file paths (if libgit2 hands us a filename with a set of bytes that is totally illegal then there's probably nothing we can do...?) But we probably want to be lax when dealing with data from commits (it would be really annoying if getting a commit threw for an encoding-related reason.)

@ethomson
Copy link
Member

NOPE, I am totally wrong, it turns out that you can totally stuff a string full of random chars that do not represent unicode code points...!

Still, I think filenames should always be strict? And that accepting commit data should be lax and writing it should be strict, as you said.

throw new MarshalDirectiveException(
string.Format("{0} must be used on a string.", GetType().Name));
}
var str = str1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was the result of an Extract Method followed by an Inline Method that turned bad....

Fixed!

@nulltoken
Copy link
Member Author

Guys, here's a first draft. Can you please take a look?

Current status:

  • Only strict encoding is actually dealt with. If that's ok with you, I'd like you (as part of the review) to help me and point which method should be relaxed).
  • Creation and parsing of commits now encode and decode on the fly the signature and the message.
  • Annotated tag bears no encoding header. As such they're, by default, considered as UTF-8 encoded.
  • Stash and Notes are technically made of commits. However libgit2 API doesn't expose (yet?) an encoding param for those. As such, until further change, their messages and signatures are also considered as UTF-8 encoded.

@nulltoken
Copy link
Member Author

@jbialobr If I'm correct the current scenario should fail

Set i18n.commitEncoding to windows-1250

  • Use git.git to create the following git objects (using specific letters from this codepage for the messages and textual contents)
    • Commits
    • Create Notes
    • Stash
  • Use LibGit2Sharp to
    • Browse the Reflog
    • List the notes
    • List the stashes

@jbialobr
Copy link
Contributor

AFAIK git.git uses i18n.commitEncoding only for commit message, so you shouldn't use it for author and commiter.
git.git takes author and commiter from config file. To read config file git.git uses:

  • on linux: current operating system encoding
  • on windows: msysgit - UTF8, cygwin - don't know exactly, but if I am not mistaken it uses OS encoding (by default equal to windows encoding).
  • on mac - I don't know.

The same encoding is used for branches, tags, etc.

Now I am wondering if it is worth the effort to handle all this cases. (Actually - I think it is not)

@nulltoken
Copy link
Member Author

@jbialobr Actually, you've very right. This is completely useless to support i18n.commitEncoding while crafting a commit.

Signatures and commit messages are passed through strings. Those strings are in-memory UTF-16. They are marshaled to libgit2 as UTF-8 (which can represent any character). Probing for i18n.commitEncoding, or even allowing to support a different encoding than the git default one while creating a new commit is completely dumb.

Parsing a commit is a different story though, but that will have to be deal with while fixing #387.

I think I'm going to rip off the 3 last commits. However, I still think the previous commits make sense. Although they do not bring i18n support, they still add some value:

Only remaining point to be deal with would be to relax the utf8 parsing of strings coming out of the odb or filesystem.

/cc @carlosmn

@carlosmn
Copy link
Member

As the CLR (and anything sensible created since the 90s) is based on Unicode, that's what we should be using. Anything else (like commitencoding) is support for broken/older/legacy systems which managed to get some other character set into the odb.

Reading data out of the repository is of course a different area, in which we should try not to fail reading bogus data, and there we can attempt to use commitencoding as the source encoding. But using a per-repo or per-user configuration option for something that can change as much as the encoding of any particular commit is fraught with silly issues no matter what.

@jbialobr
Copy link
Contributor

This is completely useless to support i18n.commitEncoding while crafting a commit.

In GitExtensions we support it, but you opened my eyes and I see now that it was introduced mostly for command line usage when OS encoding differs from UTF8. I will remove support of crafting commit using i18n.commitEncoding.

Only remaining point to be deal with would be to relax the utf8 parsing of strings coming out of the odb or filesystem.

Note that filepaths, git refs, commiter and author data are sotred using OS encoding. There is no sense to try to read them correctly when you are not going to write them using OS encoding (the same file path written using different encodings is seen by git.git as two diffferent paths).
We force GitExtensions users to conver their repos to UTF8. https://github.com/gitextensions/gitextensions/wiki/Git%20Encoding. They can work with old encoding using msysgit version prior to msysgit-1.7.10, but I think after integratig with libgit2sharp we will stop supporting using it.

@nulltoken
Copy link
Member Author

I've pushed my take about Lax and Strict utf-8 marshaling. Could you please take a peek at it? Lax encoding is used when retrieving strings from libgit2. Strict when passing string to libgit2.

/cc @jamill @phkelley @dahlbyk

@nulltoken
Copy link
Member Author

We should do our best to decode that in the encoding specified and - if there are byte sequences that do not conform to the specified encoding - well, we should do something reasonable that allows the libgit2sharp consumer to continue. (Be liberal in what you accept, etc, etc, etc.)

@ethomson Does this meet what you had in mind?

  • Retrieval process from what we get from libgit2 shouldn't be different from before (ie. goes through Encoding.UTF8, or at least something similar which emits the UTF8 header and doesn't throw on invalid bytes)
  • Every data passed on to libgit2 is now enforced to be valid UTF8

@@ -203,9 +203,9 @@ private static bool LooksLikeABranchName(string referenceName)
referenceName.LooksLikeRemoteTrackingBranch();
}

private static string branchToCanoncialName(IntPtr namePtr, GitBranchType branchType)
private static string BranchToCanoncialName(IntPtr namePtr, GitBranchType branchType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo here: should be "canonical"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thakns! Fiexd. 😉

@ethomson
Copy link
Member

"at least something similar which emits the UTF8 header"

Are we really emitting a BOM here? I wouldn't expected us to; you wouldn't want a BOM in the middle of a signature, would you?

I'm very unclear on how the UTF8Encoding class emits a BOM. The documentation claims that the first boolean argument controls this, but in practice, I get the same byte[] (all without a BOM) from using

  • the System.Text.Encoding.UTF8 property, which we previously used; documentation does not specify whether it writes a BOM
  • new Encoding(true, true) which claims it should emit a BOM, but does not for me
  • new Encoding(false, true) which claims it should not emit a BOM, and does not for me

Anyway. I guess I'm trying to say that I think that in theory we should be using new UTF8Encoding(false, true) but I don't understand why there's no difference in practice.

@jbialobr
Copy link
Contributor

It emits BOM when it writes a data to a file. System.Text.Encoding.UTF8 emits BOM.

@ethomson
Copy link
Member

@jbialobr Interesting. What I did was:

new Encoding(true, true).GetBytes("foo")

and got no BOM. You're saying that this is expected, but there's some other method that does emit a BOM for conversion, depending on the arguments in the constructor?

@nulltoken
Copy link
Member Author

the System.Text.Encoding.UTF8 property, which we previously used; documentation does not specify whether it writes a BOM

I've read this as well. Eventually, I took this approach by peeking at the Mono source code (see Encoding.cs and Utf8Encoding.cs).

@jbialobr
Copy link
Contributor

Do:

File.WriteAllText(fileName, "foo", new Encoding(true, true));

and you get BOM.

@ethomson
Copy link
Member

I've read this as well. Eventually, I took this approach by peeking at the Mono source code (see Encoding.cs and Utf8Encoding.cs).

It's unwise for me to look at the source for a different CLR implementation. :)

Would you mind telling me the result?

@@ -75,8 +75,8 @@ private int FileCallback(GitDiffDelta delta, float progress, IntPtr payload)

private int PrintCallBack(GitDiffDelta delta, GitDiffRange range, GitDiffLineOrigin lineorigin, IntPtr content, UIntPtr contentlen, IntPtr payload)
{
string formattedoutput = Utf8Marshaler.FromNative(content, (int)contentlen);
var filePath = FilePathMarshaler.FromNative(delta.NewFile.Path);
string formattedoutput = LaxUtf8Marshaler.FromNative(content, (int)contentlen);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you should use encoding that was used to write file's content.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to know how we should detect the encoding that was initially used to create the file (@carlosmn @ethomson @arrbee any idea?).

However, I'd rather prefer to keep this PR being focused on Lax/Strict segregation. Maybe could you open an issue regarding this topic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You guess, following the industry's best practices http://xkcd.com/221/

You have no information on the encoding of any given file, unless you're willing to go look at XML headers or HTML meta tags and whatnot (and lol if you believe those are going to be correct). If you can't parse it as Unicode, then maybe you can try to use the current user's locale (assuming this is not Unicode because it's used by a group of people with another de facto encoding).

As a last resort, you can do something libmagic- or libicu-like by looking at the content of the pre- or post-image and guessing based on that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, maybe would this would be another use case where providing an optional callback for decoding related issues might be useful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe could you open an issue regarding this topic?

@jbialobr If you could also come up with a failing test, that'd be awesome 😉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git.git diplays any character out of system codepage range as <byte code>
Example
sprawdzi<E6> is diplayed for sprawdzić

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Git Extensions we have Files content encoding setting for each repo. IMHO it is very uncommon to hold files of different content encoding in a single repo. No one has requested per file setting of content encoding (but it would be the clearest solution).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git.git diplays any character out of system codepage range as

That would not be git, but the pager (typically less). git-core doesn't care about your encoding. That's the job of whatever displays the characters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you could also come up with a failing test, that'd be awesome

Done. #533

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you could also come up with a failing test, that'd be awesome

Done. #533

@nulltoken
Copy link
Member Author

Would you mind telling me the result?

Mono Encoding.UTF8 implementation creates a new UTF8Encoding type passing a parameter requiring the encoding to emit the UTF8 identifier/header.

@jbialobr @ethomson What's your take on this? Should we use

  • new Encoding(true,... which looks like matching Encoding.UTF8 implementation
  • new Encoding(false, ... which may feel safer as we control what's produced

@jbialobr
Copy link
Contributor

I preffer to use new Encoding(false, ... as it is optional to emit BOM, and there exists apps that doesn't support BOM correctly.

@ethomson
Copy link
Member

Well, maybe a BOM might be useful for files, but that's a different conversation. We're talking about all the marshaling here, and a BOM in the middle of a commit message is never going to be correct. We can't have:

Author: \xEF\xBB\xBFauthor name <\xEF\xBB\xBFauthor@example.com>...

\xEF\xBB\xBFThis is a message...

I think that using false here makes our intention more obvious. That said, this appears moot: consumers of the encoding class (like using File.WriteAllText above - thanks @jbialobr for the example) are expected to call GetPreamble. The UTF8Encoding class switches in GetPreamble based on the value of the encoderShouldEmitUTF8Identifier constructor argument - if it was true, GetPreamble will return the BOM, otherwise it will return an empty byte array.

For our uses, then, I think it's identical.

@nulltoken
Copy link
Member Author

"I preffer to use new Encoding(false, ..."
"I think that using false here makes our intention more obvious. "

@jbialobr @ethomson Fixed with 0034283

ref GitOid tree,
int parentCount,
[MarshalAs(UnmanagedType.LPArray)] [In] IntPtr[] parents);

[DllImport(libgit2)]
[return : MarshalAs(UnmanagedType.CustomMarshaler, MarshalCookie = UniqueId.UniqueIdentifier, MarshalTypeRef = typeof(Utf8NoCleanupMarshaler))]
[return : MarshalAs(UnmanagedType.CustomMarshaler, MarshalCookie = UniqueId.UniqueIdentifier, MarshalTypeRef = typeof(LaxUtf8NoCleanupMarshaler))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you drop using git_commit_message_encoding here? You have aviable this information, so I see no reason to not using it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind, this should be taken care of as part of #427. At least this is what I foresee to do.

@nulltoken
Copy link
Member Author

Squashed and rebased.

Ok to merge?

@nulltoken nulltoken merged commit b10c4b6 into vNext Oct 16, 2013
@nulltoken nulltoken deleted the ntk/fix/commit_encoding branch October 16, 2013 23:17
nulltoken referenced this pull request in GitTools/GitVersion Nov 12, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants