ACP: Expose algebraic floating point intrinsics

# Proposal

## Problem statement

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

## Motivating examples or use cases

https://github.com/calder/dot-bench contains benchmarks for different dot product implementations, but all stable Rust implementations suffer from the same inability to vectorize and fuse instructions because the compiler needs to preserve the order of individual floating point operations.

Measurements below were performed on a i7-10875H:

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37f9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = algebraic_fadd(sum, algebraic_fmul(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

## Solution sketch

Expose `core::intrinsics::f*_algebraic` intrinsics as:

```rust
// core::num::f16

impl f16 {
    pub fn algebraic_add(self, rhs: f16) -> f16;
    pub fn algebraic_sub(self, rhs: f16) -> f16;
    pub fn algebraic_mul(self, rhs: f16) -> f16;
    pub fn algebraic_div(self, rhs: f16) -> f16;
    pub fn algebraic_rem(self, rhs: f16) -> f16;
}
```

```rust
// core::num::f32

impl f32 {
    pub fn algebraic_add(self, rhs: f32) -> f32;
    pub fn algebraic_sub(self, rhs: f32) -> f32;
    pub fn algebraic_mul(self, rhs: f32) -> f32;
    pub fn algebraic_div(self, rhs: f32) -> f32;
    pub fn algebraic_rem(self, rhs: f32) -> f32;
}
```

```rust
// core::num::f64

impl f64 {
    pub fn algebraic_add(self, rhs: f64) -> f64;
    pub fn algebraic_sub(self, rhs: f64) -> f64;
    pub fn algebraic_mul(self, rhs: f64) -> f64;
    pub fn algebraic_div(self, rhs: f64) -> f64;
    pub fn algebraic_rem(self, rhs: f64) -> f64;
}
```

```rust
// core::num::f128

impl f128 {
    pub fn algebraic_add(self, rhs: f128) -> f128;
    pub fn algebraic_sub(self, rhs: f128) -> f128;
    pub fn algebraic_mul(self, rhs: f128) -> f128;
    pub fn algebraic_div(self, rhs: f128) -> f128;
    pub fn algebraic_rem(self, rhs: f128) -> f128;
}
```

## Alternatives

https://github.com/rust-lang/rust/issues/21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

## Open questions

* [X] Should we add these methods to `f16` and `f128` as well? --> [Yes](https://github.com/rust-lang/libs-team/issues/532#issuecomment-2630321075)
* [X] Should we include more primitives? --> [Not in the initial implementation, but happy to in follow ups](https://hackmd.io/xLRjfOA4Twy3_nEzXv7gcQ#new-change-proposal-rusttflibs532-ACP-Expose-algebraic-floating-point-intrinsics)
* [X] Should we rename to `fast_*()` since we're unlikely to ever want to expose the current (unsafe) `f*_fast()` intrinsics? --> [No, "algebraic" does a better job capturing the intent guiding which optimizations we will enable](https://hackmd.io/xLRjfOA4Twy3_nEzXv7gcQ#new-change-proposal-rusttflibs532-ACP-Expose-algebraic-floating-point-intrinsics)
* [X] Are there other optimizations it makes sense to enable? [Not initially, but potential subnormal optimizations in the future](https://hackmd.io/xLRjfOA4Twy3_nEzXv7gcQ#new-change-proposal-rusttflibs532-ACP-Expose-algebraic-floating-point-intrinsics)

## Links and related work

* https://github.com/rust-lang/rust/issues/21690
* https://github.com/rust-lang/rust/issues/136469
* https://github.com/rust-lang/rust/pull/136457
* https://github.com/rust-lang/rust/pull/120718
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

## What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team [feature lifecycle]. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

[feature lifecycle]: https://std-dev-guide.rust-lang.org/development/feature-lifecycle.html

## Possible responses

The libs team may respond in various different ways. First, the team will consider the *problem* (this doesn't require any concrete solution or alternatives to have been proposed):

- We think this problem seems worth solving, and the standard library might be the right place to solve it.
- We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

- We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
- We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ACP: Expose algebraic floating point intrinsics #532

Proposal

Problem statement

Motivating examples or use cases

C++: 10us ✅

Nightly Rust: 10us ✅

Stable Rust: 84us ❌

Solution sketch

Alternatives

Open questions

Links and related work

What happens now?

Possible responses

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ACP: Expose algebraic floating point intrinsics #532

Description

Proposal

Problem statement

Motivating examples or use cases

C++: 10us ✅

Nightly Rust: 10us ✅

Stable Rust: 84us ❌

Solution sketch

Alternatives

Open questions

Links and related work

What happens now?

Possible responses

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions