From 4152247a182514485a91cbf1dbb3587932c011ee Mon Sep 17 00:00:00 2001 From: Elliot Waite Date: Tue, 10 Dec 2019 23:35:27 -0800 Subject: [PATCH] Fix typos in C++ extensions tutorial --- advanced_source/cpp_extension.rst | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/advanced_source/cpp_extension.rst b/advanced_source/cpp_extension.rst index 56b02dd1818..d74c0eac771 100644 --- a/advanced_source/cpp_extension.rst +++ b/advanced_source/cpp_extension.rst @@ -115,13 +115,13 @@ PyTorch has no knowledge of the *algorithm* you are implementing. It knows only of the individual operations you use to compose your algorithm. As such, PyTorch must execute your operations individually, one after the other. Since each individual call to the implementation (or *kernel*) of an operation, which may -involve launch of a CUDA kernel, has a certain amount of overhead, this overhead -may become significant across many function calls. Furthermore, the Python -interpreter that is running our code can itself slow down our program. +involve the launch of a CUDA kernel, has a certain amount of overhead, this +overhead may become significant across many function calls. Furthermore, the +Python interpreter that is running our code can itself slow down our program. A definite method of speeding things up is therefore to rewrite parts in C++ (or CUDA) and *fuse* particular groups of operations. Fusing means combining the -implementations of many functions into a single functions, which profits from +implementations of many functions into a single function, which profits from fewer kernel launches as well as other optimizations we can perform with increased visibility of the global flow of data. @@ -509,12 +509,12 @@ and with our new C++ version:: Forward: 349.335 us | Backward 443.523 us We can already see a significant speedup for the forward function (more than -30%). For the backward function a speedup is visible, albeit not major one. The -backward pass I wrote above was not particularly optimized and could definitely -be improved. Also, PyTorch's automatic differentiation engine can automatically -parallelize computation graphs, may use a more efficient flow of operations -overall, and is also implemented in C++, so it's expected to be fast. -Nevertheless, this is a good start. +30%). For the backward function, a speedup is visible, albeit not a major one. +The backward pass I wrote above was not particularly optimized and could +definitely be improved. Also, PyTorch's automatic differentiation engine can +automatically parallelize computation graphs, may use a more efficient flow of +operations overall, and is also implemented in C++, so it's expected to be +fast. Nevertheless, this is a good start. Performance on GPU Devices ************************** @@ -571,7 +571,7 @@ And C++/ATen:: That's a great overall speedup compared to non-CUDA code. However, we can pull even more performance out of our C++ code by writing custom CUDA kernels, which -we'll dive into soon. Before that, let's dicuss another way of building your C++ +we'll dive into soon. Before that, let's discuss another way of building your C++ extensions. JIT Compiling Extensions @@ -851,7 +851,7 @@ and ``Double``), you can use ``AT_DISPATCH_ALL_TYPES``. Note that we perform some operations with plain ATen. These operations will still run on the GPU, but using ATen's default implementations. This makes -sense, because ATen will use highly optimized routines for things like matrix +sense because ATen will use highly optimized routines for things like matrix multiplies (e.g. ``addmm``) or convolutions which would be much harder to implement and improve ourselves. @@ -903,7 +903,7 @@ You can see in the CUDA kernel that we work directly on pointers with the right type. Indeed, working directly with high level type agnostic tensors inside cuda kernels would be very inefficient. -However, this comes at a cost of ease of use and readibility, especially for +However, this comes at a cost of ease of use and readability, especially for highly dimensional data. In our example, we know for example that the contiguous ``gates`` tensor has 3 dimensions: @@ -920,7 +920,7 @@ arithmetic. gates.data()[n*3*state_size + row*state_size + column] -In addition to being verbose, this expression needs stride to be explicitely +In addition to being verbose, this expression needs stride to be explicitly known, and thus passed to the kernel function within its arguments. You can see that in the case of kernel functions accepting multiple tensors with different sizes you will end up with a very long list of arguments.