Skip to content

i18n.commitEncoding is EVIL #532

Open
@Aimeast

Description

@Aimeast

Recently days I am researching i18n.commitEncoding, It's real an EVIL!!

First of all, I am a Chinese that OS language is en-us with Chinese(simplified) region. The commonly used encoding is GB2312(also GBK, code page 936).

The i18n.commitEncoding was defined in here that Character encoding the commit messages are stored in; git itself does not care per se, but this information is necessary.

OK, It's mean this config will affect the commit messages. but, I have find no any documents to define the format of encoding. I open the references window of Git GUI, and find the Default File Contents Encoding field. First item is System(cp936) of change menu, however, have an item Chinese Simplified (GB2312). They are equivalent in my first impression of two items.

I created a new repository then submit 3 commits with difference i18n.commitEncoding to test they. Both of they commit messages are four words Chinese, and i18n.commitEncoding followed by default, gb2312, cp936. (pushed to https://github.com/Aimeast/TestForFirst/commits/i18n)

  • git.exe can display the second commit
  • Git GUI can display the first and third commits
  • GitHub can display the first and second commits, third commit displayed as Japanese
  • LibGit2Sharp can display the first and second commits (third commit not same with Github)

OK, the result real funny. I decompress objects files for digging.

  • The first and second commits (e2266b and 38a2b0) were stored as utf-8
  • The third commit (bd9f62) was stored as gb2312

After my analysis, git.exe can identify format start with cp and followed by code page number. So, we can explain why there is such result.

Now, let's into the issue. The code

        // NativeMethods.git_commit_message
        [DllImport(libgit2)]
        [return : MarshalAs(UnmanagedType.CustomMarshaler, MarshalCookie = UniqueId.UniqueIdentifier, MarshalTypeRef = typeof(Utf8NoCleanupMarshaler))]
        internal static extern string git_commit_message(GitObjectSafeHandle commit);

It's always marshal as Utf-8 result in third commit message was messy code. So, I suggest that return raw data then decode string in Proxy.git_commit_message. But, the evil is .Net Framework not support some encoding which supported by git.exe.

Hereafter is my code for detect the encoding for a commit

        public static Encoding CpAsEncoding(this Commit commit)
        {
            try
            {
                var encoding = commit.Encoding;

                if (encoding.StartsWith("cp", StringComparison.OrdinalIgnoreCase))
                    return Encoding.GetEncoding(int.Parse(encoding.Substring(2)));

                return Encoding.GetEncoding(encoding);
            }
            catch
            {
                return Encoding.UTF8;
            }
        }

Not perfect codes, but It's working.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions