Skip to content

Improve compression of pickled quotes #3877

Closed
@nicolasstucki

Description

@nicolasstucki

We need to improve the encoding performed in TastyString. With the encoding found in the discussion.

@lrytz 5 days ago • Owner
It took me a while to find it.. Need to clean this up / document. Method parseScalaSigBytes calls ConstantPool.getBytes which goes through ByteCodecs.decode.

The encoding is explained here http://www.scala-lang.org/old/sites/default/files/sids/dubochet/Mon,%202010-05-31,%2015:25/Storage%20of%20pickled%20Scala%20signatures%20in%20class%20files.pdf

first map all 8-bit bytes to 7 bits (shifting the rest)
then increment all by 1 (in 7 bits), so 0x7f becomes 0x00
then encode 0x00 as 0xc0 0x80, which is an overlong utf 8 encoding for zero. it's what the jvm classfile spec uses to avoid having 0x00 in strings. it's called "modified utf 8".
the reason for the incrementing by 1 that 0x7f is expected to be less common than 0x00, so the two byte encoding hits less often.

The confusing part is that the class ScalaSigBytes used in the backend to encode the signature uses ByteCodecs.encode8to7, but does the +1 itself. It doesn't need to map 0x00 to the two byte version because ASM will do it when writing the annotation to the classfile. However, in the unpickler, we don't use ASM to read the annotation, but just get the bytes from the classfile directly. So there we'll see the two byte encoding. ByteCodecs.decode does the necessary work.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions