Skip to content

What's the unit of character in Point #21

Closed
@ghost

Description

In Point section, it's mentions:

The line field (1-indexed integer) represents a line in a source file. The column field (1-indexed integer) represents a column in a source file. The offset field (0-indexed integer) represents a character in a source file.

What's the unit of 'character' and 'column'? Is it UTF-16 code unit (used in JavaScript) or Unicode code point? See Wikipedia:

[UTF-16] encoding is variable-length, as code points are encoded with one or two 16-bit code units

I tried using remark to parse this markdown piece:

a𠮷b

Here, 𠮷 is one Unicode code point that can not be encoded into one UTF-16 code unit. In JavaScript, because String uses UTF-16, so:

'a𠮷b'.length
//=> 4

But in other languages like Python:

len('a𠮷b')
#=> 3

As for remark, the above markdown piece is parsed into:

{
  "type": "text",
  "value": "a𠮷b",
  "position": {
    "start": {
      "line": 1,
      "column": 1,
      "offset": 0
    },
    "end": {
      "line": 1,
      "column": 5,
      "offset": 4
    },
    "indent": []
  }
}

The column of end is 5, while the offset of end is 4, that means remark treat this text four 'chars' long, measured in UTF16 code units.

So what's the unit of character? It's so confused.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions