Skip to content

Overhaul std collections and algorithm documentation #36318

Closed
@gnzlbg

Description

@gnzlbg

The documentation of the std::collections:

  • lacks enough rigor to make good informed decisions about using the collections and their methods as well as most algorithms.
  • some of the information is there, but it is not easy to get to it from a particular method (e.g. if I want to find the complexity of some container insert method googling for "Rust HashSet insert" would lead me to the documentation of the insert function, but this information is in a performance table in the main page of the collection API.
  • the place for rigor is in the docs and not in the book. The docs are the reference.

I really think that being more thorough with the documentation will help both find and fix bugs in the documentation and make the std library way better.

The solution I propose is to make all the relevant information accessible from all the methods.

The first thing to do would be to achieve consensus on which information should be available for each method. Taking inspiration from the C++ standard, and to kickstart the discussion, I would propose that each method / algorithm / function documentation comment should contain the following "fields" below the short and long descriptions. If a field doesn't make sense for some particular method, I would propose to still have it there but explicitly leave it blank.

  • Algorithmic complexity: with big O in terms of the relevant operations (which can be multiple: copies/moves in terms of the number of elements, number of swaps, number of comparisons, hashing...), with exact number of operations when guaranteed (performs exactly N moves, performs exactly M - N swaps, ...), and for those methods that have amortized/expected complexity it should contain both the complexity in the amortized case as well as the worst case (and maybe best case), and under which conditions they trigger.
    • Example: Vec::push(t): amortized O(1): exactly 1 move/copy of t if Vec::size() < Vec::capacity(), otherwise worst case O(N): exactly N + 1 moves/copies where N == Vec::size().
    • Example: Vec::reserve(M): O(1) time if M <= Vec::capacity(), O(N) otherwise: exactly N moves/copies where N == Vec::size().
  • Memory complexity: does it use O(1) stack space? Is it recursive and it uses O(logN) stack space? Does it allocate memory using an allocator? If so how much?
    • Example: Vec::push(t): amortized O(1) stack space, otherwise worst case O(1) stack space and a single memory allocation of O(k*N) where k is an implementation-defined growth factor independent of N and where N == Vec::size().
    • Example: Vec::reserve(M): O(1) stack space if M <= Vec::capacity(), otherwise a single allocation of O(k*N) where k is an implementation-defined growth factor independent of N and N == Vec::size().
  • Effects: What are the effects of the operation?
    • Example: Vec::push(t): Before: size() == N, capacity == M, After: size() == N + 1, capacity() == M (amortized case) or capacity() == k*M (worst case, where k is an implementation defined growth factor), pop() == Some(t).
    • Example: Vec::reserve(N): Before: capacity() == M, After: capacity() == max(N, M).
  • Panics: When is an operation supposed to panic?
    • Example: Vec::push(t) panics if OOM or if a copy of t is performed and copying panics.
    • Example: Vec::size() never panics.

I consider this the bare minimum level of "formalization" that at least the basic collections and algorithms in the standard library should have, but as you can see things become extremely verbose so one might want to explore how to factor out some parts of the documentation (e.g. that for Vec the growth-factor is a constant independent of the number of elements or the capacity of the vector) that are to be repeated in a lot of comments.

So that's for my strawman proposal. I guess the steps to follow (please correct me if I'm wrong):

  • Achieve consensus on how to document the methods of the std library algorithms and collections. Maybe through an RFC? Or isn't an RFC required here?
  • Explore if there is an easy way to reuse some documentation bits (e.g. OOM panics, never panics, explanation for amortized / expected complexity).
  • Go (again) through all methods and document them rigorously (i.e. actually implement this).
  • Orthogonal to this, since not all collections are mentioned in the main collections page (e.g. HashSet), probably we want to go over the main collection page and overhaul it as well.

For a feeling of the current state of things:

  • HashMap::insert doesn't mention that insertions are O(1) so if one lands there (e.g. via google) one is on its own. Relevant information should be visible right away.
  • vec::insert: in the amortized case requires O(1) stack memory and will perform exactly O(end - pos) moves but in the worst case it requires O(kN) memory and O(N) moves. None of this is "obvious" from the docs. We should make it obvious.
  • std::slice::sort complexity is O(NlogN) in the number of elements, but it mentions that it allocates approximately "2*n" elements. Does that mean at worst 2N? Can it be more? While this is better documented than other methods, it is still extremely vague. In particular, trivial stable sort implementations require 1/2N extra memory, a bad one at least N, and the good ones are in place (e.g. GrailSort requires O(1)). The rust standard library doesn't have an inplace unstable sorting algorithm (AFAIK), but the documentation of one implemented with something quicksort-based like intro-sort should say that it uses O(logN) stack space in some cases (I did not have any examples above with anything but O(1) stack space, but basically every recursive algorithm requires something different than O(1)).

cc @GuillaumeGomez

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-docsArea: Documentation for any part of the project, including the compiler, standard library, and toolsC-enhancementCategory: An issue proposing an enhancement or a PR with one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions