
Description
In Point section, it's mentions:
The
line
field (1-indexed integer) represents a line in a source file. Thecolumn
field (1-indexed integer) represents a column in a source file. Theoffset
field (0-indexed integer) represents a character in a source file.
What's the unit of 'character' and 'column'? Is it UTF-16 code unit (used in JavaScript) or Unicode code point? See Wikipedia:
[UTF-16] encoding is variable-length, as code points are encoded with one or two 16-bit code units
I tried using remark to parse this markdown piece:
a𠮷b
Here, 𠮷
is one Unicode code point that can not be encoded into one UTF-16 code unit. In JavaScript, because String uses UTF-16, so:
'a𠮷b'.length
//=> 4
But in other languages like Python:
len('a𠮷b')
#=> 3
As for remark, the above markdown piece is parsed into:
{
"type": "text",
"value": "a𠮷b",
"position": {
"start": {
"line": 1,
"column": 1,
"offset": 0
},
"end": {
"line": 1,
"column": 5,
"offset": 4
},
"indent": []
}
}
The column
of end
is 5, while the offset
of end
is 4, that means remark treat this text four 'chars' long, measured in UTF16 code units.
So what's the unit of character? It's so confused.