Skip to content

Commit 2f55eb8

Browse files
Vincent Moenssvekars
Vincent Moens
andauthored
Apply suggestions from code review
Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
1 parent d4169d4 commit 2f55eb8

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

intermediate_source/pinmem_nonblock.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,7 @@
125125
# As the following example will show, three requirements must be met to enable this:
126126
#
127127
# 1. The device must have at least one free DMA (Direct Memory Access) engine. Modern GPU architectures such as Volterra,
128-
# Tesla or H100 devices have more than one DMA engine.
128+
# Tesla, or H100 devices have more than one DMA engine.
129129
#
130130
# 2. The transfer must be done on a separate, non-default cuda stream. In PyTorch, cuda streams can be handles using
131131
# :class:`~torch.cuda.Stream`.
@@ -250,7 +250,7 @@ def benchmark_with_profiler(
250250
# New tensors can be directly created in pinned memory with functions like :func:`~torch.zeros`, :func:`~torch.ones` and other
251251
# constructors.
252252
#
253-
# Let us check the speed of pinning memory and sending tensors to cuda:
253+
# Let us check the speed of pinning memory and sending tensors to CUDA:
254254

255255

256256
import torch
@@ -318,10 +318,10 @@ def timer(cmd):
318318
#
319319
# However, contrary to a somewhat common belief, calling :meth:`~torch.Tensor.pin_memory()` on a pageable tensor before
320320
# casting it to GPU should not bring any significant speed-up, on the contrary this call is usually slower than just
321-
# executing the transfer. This makes sense, since we're actually asking python to execute an operation that CUDA will
321+
# executing the transfer. This makes sense, since we're actually asking Python to execute an operation that CUDA will
322322
# perform anyway before copying the data from host to device.
323323
#
324-
# .. note:: The pytorch implementation of
324+
# .. note:: The PyTorch implementation of
325325
# `pin_memory <https://github.com/pytorch/pytorch/blob/5298acb5c76855bc5a99ae10016efc86b27949bd/aten/src/ATen/native/Memory.cpp#L58>`_
326326
# which relies on creating a brand new storage in pinned memory through `cudaHostAlloc <https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gb65da58f444e7230d3322b6126bb4902>`
327327
# could be, in rare cases, faster than transitioning data in chunks as ``cudaMemcpy`` does.
@@ -505,7 +505,7 @@ def pin_copy_to_device_nonblocking(*tensors):
505505

506506

507507
######################################################################
508-
# Other copy directions (GPU -> CPU, CPU -> MPS etc.)
508+
# Other copy directions (GPU -> CPU, CPU -> MPS)
509509
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
510510
#
511511
# .. _pinned_memory_other_direction:

0 commit comments

Comments
 (0)