Description
The UCG currently requires that produced values satisfy their validity invariant. This issue asks whether the UCG should also recommend that produced values satisfy their safety invariant. Here are examples of values that would follow the requirement but not the recommendation:
- Values of type
&str
that are not UTF-8 (somehow this seems a common example). - Values of type
&mut T
that are pinned (return value ofPin::get_unchecked_mut()
). - Values of type
&mut [u8]
that point to uninitialized bytes (Document current justification for not requiring recursive reference validity (in particular,&mut uninit
not being immediate UB) #346). - Values of type
&'static T
that are not'static
(Box::leak()
eventually followed byBox::from_raw()
).
Those values can be considered "unsafe" because they cannot be passed to arbitrary safe code (aka safe code outside the scope of unsafe) at their current type without possibly breaking soundness.
The decision is not obvious, because there are many violations of this recommendation at this time (and more getting added). Besides this legacy issue, there is a non-trivial trade-off between making the language simpler to explain on one side, and making the language harder to use on the other (most probably why there are so many violations at this time).
Difference with library UB
This issue is not about the UCG recommending programs to not have library UB. This would be a different issue.
Library UB occurs when the safety contract of a library API is violated. This issue is about producing a value that does not satisfy its safety invariant. It is independent of the notion of library API and their safety contracts. It is a notion of the type system and in particular types (possibly library types) and their safety invariants.
Language simpler to explain
The rules for unsafe Rust are simpler for users who follow recommendations, because they don't need to know about the validity invariant (since you can't produce a value violating the validity invariant without also violating the safety invariant). This is valuable for a few reasons:
- The validity invariant is unstable while the safety invariant is stable
- The validity invariant doesn't leak to users, letting language designers do what they want with it
- Users can get confused with 2 invariants (which one is UB? which one is maybe later UB to violate?)
- Most type systems have only one notion of type invariant
The language is also simpler to make sense of. There is no need to account for ill-typed values (no need for an update type when a refinement type would suffice, see also this thread).
This would also be a way to provide partial guarantees about validity invariants in the form of new and weaker types whose safety invariant matches the partial guarantees we want to provide for that validity invariant. Note that multiple types (of the type system) can have the same underlying calculus type (the opsem concept that gives the notion of validity invariant). This is for example the case with pointers. This gives programmers access to a hierarchy of types with different safety invariants for the same underlying calculus type, to choose the most appropriate type for the needed operation (see next section).
Language harder to use
The type of an ill-typed value needs to be changed (weakened in the hierarchy of types) until the value becomes well-typed at that new type. In the initial examples, that would be:
- Using
&[u8]
instead of&str
. - Using
Pin<&mut T>
orNonNull<T>
instead of&mut T
. - Using
&mut [MaybeUninit<u8>]
instead of&mut [u8]
. - Using
NonNull<T>
or&'a T
(with a correct and possibly too strict'a
) instead of&'static T
.
Besides the inconvenience of propagating those changes (when starting from an ill-typed program), one needs to find a suitable weakened type (even when starting a program from scratch). To maximize the chances of finding a suitable type, the hierarchy of types needs to be as small grain as possible, and go up (or down depending how you look at it) to the most weakened types, namely the calculus types. The situation has improved in that regard, like the introduction of MaybeUninit
, and will probably continue, like with UnsafeAlias
. One could hope that it would continue to improve, and a suitable type can always be found without having to weaken too much and use *mut T
if NonNull<T>
would suffice (and it would not exist).
Values are usually only temporarily ill-typed, since they should eventually reach safe code. When values cross a safe API, their type may need to be adjusted such that they can be used at their maximum capability without unsafe. For example, if unsafe code took a &mut [MaybeUninit<u8>]
as argument, it could return a &mut [u8]
to the initialized part as convenience to its user, maximizing what a safe user could do compared to an unsafe user.
Changing the type of a value crossing a safe API is only possible in some situations. If the type is under a type constructor, one needs layout guarantees like those of slices. If unsafe code receives a Foo<Dirty>
, cleans it, and returns a Foo<Clean>
(where Clean
is a semantic subtype of Dirty
), it can't simply transmute the value if Foo
doesn't provide any layout guarantees. It needs to deconstruct and reconstruct the value.
Those difficulties seem quite consequent for multiple reasons:
- We are used to ignore them.
- The type hierarchy is not granular enough.
- We lack subtyping relations in the hierarchy.
- We lack strong updates.
The main question (at least for me) is whether we believe those difficulties will be lifted over time, and it's already time to start having type hygiene, or whether we have to wait and continue wearing our cowboy hats.
Terminology
The term EB (#512) could be generalized the same way the term UB has been. We would get:
- UB = a requirement of an X safety contract was violated (you get no guarantees)
- EB = a recommendation of an X safety contract was violated (you get guarantees but you're operating outside recommendations)
- X can be calculus, type system, language, library, program, environment, anything that can have a contract
- When "X = language", those requirements and recommendations must be checkable
The UCG rules mentioned at the beginning of the issue would read:
- Producing a value violating its validity invariant is language UB
- Producing a value violating its safety invariant is type system EB