Skip to content

gix free pack verify --statistics uses ambiguous "KB" for SI kilobyte #1947

Closed
@EliahKagan

Description

@EliahKagan

Current behavior 😯

As attested by the current journey test snapshots, gix free pack verify --statistics outputs compression-related sizes in units of "KB":

compression
compressed entries size : 51.8 KB
decompressed entries size : 103.7 KB
total object size : 288.7 KB
pack size : 51.9 KB

But it is not immediately clear what unit that actually is. Is it…

  • …an SI decimal kilobyte, equal to 1000 bytes? (kB)
  • …an IEC binary kibibyte, equal to 1024 bytes? (KiB)

It turns out that, in this case, 1 KB = 1 kB, but it is not obvious.

Expected behavior 🤔

It is also not obvious what unit is intended. gitoxide-core uses the bytesize library to display the units:

writeln!(out, "\ncompression")?;
#[rustfmt::skip]
writeln!(
out, "\t{:<width$}: {}\n\t{:<width$}: {}\n\t{:<width$}: {}\n\t{:<width$}: {}",
"compressed entries size", ByteSize(stats.total_compressed_entries_size),
"decompressed entries size", ByteSize(stats.total_decompressed_entries_size),
"total object size", ByteSize(stats.total_object_size),
"pack size", ByteSize(stats.pack_size),
width = width
)?;

What unit is displayed and what symbol is used to represent it varies across major versions of bytesize. In current (i.e. recent stable) releases, the default unit is the IEC binary kibibyte, which it abbreviates KiB; while one can explicitly request the SI decimal kilobyte, which it abbreviates kB. But old versions of bytesize behave differently, defaulting to the SI decimal kilobyte, and also abbreviating it with the non-SI symbol KB. The new behavior came in as of bytesize 2.0.0. But gitoxide-core depends on:

bytesize = "1.0.1"

I suggest upgrading to bytesize 2.0.* and deciding whether we actually want…

  • …units of 1000 bytes abbreviated kB, in which case the above would be changed to:

    writeln!(out, "\ncompression")?;
    #[rustfmt::skip]
    writeln!(
    out, "\t{:<width$}: {}\n\t{:<width$}: {}\n\t{:<width$}: {}\n\t{:<width$}: {}",
    "compressed entries size", ByteSize(stats.total_compressed_entries_size).display().si(),
    "decompressed entries size", ByteSize(stats.total_decompressed_entries_size).display().si(),
    "total object size", ByteSize(stats.total_object_size).display().si(),
    "pack size", ByteSize(stats.pack_size).display().si(),
    width = width
    )?;

  • …or units of 1024 bytes abbreviated KiB, in which case no change would be needed to that source code file.

    (Though it could, if desired, be made explicit by calling iec() where the SI alternative calls si().)

Git behavior

I'm not sure if there's a Git behavior that should be considered to correspond exactly to this, since gix doesn't have or aim for the same interface as git, and since git verify-pack does not show file sizes in its statistics:

$ git verify-pack -s tests/fixtures/packs/pack-11fdfa9e156ab73caae3b6da867192221f2089c2.idx
non delta: 18 objects
chain length = 1: 4 objects
chain length = 2: 3 objects
chain length = 3: 1 object
chain length = 4: 2 objects
chain length = 5: 1 object
chain length = 6: 1 object

But some other git commands do show sizes of things in "human" units. For example:

$ git count-objects -vH
count: 0
size: 0 bytes
in-pack: 163413
packs: 2
size-pack: 79.16 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

When it is run without -H the size-pack value is shown with no unit, but git-count-objects(1) documents it as being in units of KiB. With neither -v nor -H, one gets:

$ git count-objects
0 objects, 0 kilobytes

There, "kilobytes" is ambiguous. One might think it means decimal SI kilobytes (1000 bytes). But actually that occurrence of "kilobytes" means binary IEC kibibytes, as revealed by:

if (human_readable)
	strbuf_humanise_bytes(&buf, loose_size);
else
	strbuf_addf(&buf, "%lu kilobytes",
			(unsigned long)(loose_size / 1024));

I don't think any of this has much bearing on what we should do, since it's about display behavior that makes no effort to be similar to Git. However, it may be that the preference in Git for using binary IEC units--rather than decimal SI units--reflects a preference for those units, or would lead users to expect that gitoxide use such units. (My personal preference is also for binary IEC units.)

Any change here, especially if it includes upgrading bytesize, should be fairly convenient for me to include in a larger PR that I am already working on. (Although the above-linked code currently fixes this by keeping them SI decimal kilobytes and changing the unit to "kB", I did that because it is closer to the current behavior, not to express a preference for that approach.)

Steps to reproduce 🕹

Check the journey test snapshot file shown above and observe that the journey tests are passing. Alternatively, run:

cargo run --bin=gix -- --no-verbose free pack verify --statistics tests/fixtures/packs/pack-11fdfa9e156ab73caae3b6da867192221f2089c2.idx

This shows the following, which are "KB" units where by "KB" it means what would be less ambiguously called "kB":

    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.20s
     Running `target/debug/gix --no-verbose free pack verify --statistics tests/fixtures/packs/pack-11fdfa9e156ab73caae3b6da867192221f2089c2.idx`
objects per delta chain length
         0: 18
         1: 4
         2: 3
         3: 1
         4: 2
         5: 1
         6: 1
        ->: 30

averages
        delta chain length:            1;
        decompressed entry [B]:        3456;
        compressed entry [B]:          1725;
        decompressed object size [B]:  9621;

compression
        compressed entries size       : 51.8 KB
        decompressed entries size     : 103.7 KB
        total object size             : 288.7 KB
        pack size                     : 51.9 KB

        num trees                     : 15
        num blobs                     : 5
        num commits                   : 10
        num tags                      : 0

        compression ratio             : 2.00
        delta compression ratio       : 5.58
        delta gain                    : 2.78
        pack overhead                 : 0.235%

Broadening the scope: Other uses of ambiguous units

I've framed this in terms of gix free pack verify --statistics because that's what I stumbled upon first (EliahKagan#18 (comment)), and because the exact way that formats sizes is under test, and because I didn't really think through the full scope this issue should have. But this should very possibly be construed more broadly: some other places also show ambiguous units, and also show what I believe to be decimal SI units when they might perhaps better show binary IEC units.

$ gix clone git@github.com:EliahKagan/gitoxide.git
 19:17:24 indexing done 161.1K objects in 2.85s (56.5K objects/s)
 19:17:24 decompressing done 175.7MB in 2.85s (61.7MB/s)
 19:17:25     Resolving done 161.1K objects in 0.86s (188.1K objects/s)
 19:17:25      Decoding done 1.4GB in 0.86s (1.6GB/s)
 19:17:25 writing index file done 4.5MB in 0.02s (290.5MB/s)
 19:17:25  create index file done 161.1K objects in 3.80s (42.4K objects/s)
 19:17:25          read pack done 77.6MB in 4.10s (18.9MB/s)
 19:17:25           checkout done 2.4K files in 0.14s (16.9K files/s)
 19:17:25            writing done 72.3MB in 0.14s (516.5MB/s)
...

In contrast, Git uses binary IEC units:

git clone https://github.com/GitoxideLabs/gitoxide.git
Cloning into 'gitoxide'...
remote: Enumerating objects: 163078, done.
remote: Counting objects: 100% (869/869), done.
remote: Compressing objects: 100% (363/363), done.
remote: Total 163078 (delta 630), reused 506 (delta 506), pack-reused 162209 (from 6)
Receiving objects: 100% (163078/163078), 74.22 MiB | 37.44 MiB/s, done.
Resolving deltas: 100% (107181/107181), done.

Upgrading bytesize in all gitoxide crates' Cargo.toml does not affect that. I'm not sure if that's because prodash depends on bytesize 1.3.3, or for some other reason.

Metadata

Metadata

Assignees

No one assigned

    Labels

    acknowledgedan issue is accepted as shortcoming to be fixedhelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions