test-backend-ops performance numbers incorrect

I noticed that for the CUDA backend using an RTX 3090 the reported achieved memory bandwidth for matrix multiplication can be much greater than 936 GB/s (the maximum of the hardware). Therefore, there must be some bug with how these numbers are calculated.