Skip to content

Commit d3befe4

Browse files
author
Vincent Moens
committed
Merge remote-tracking branch 'origin/pinmem-nonblock-tuto' into pinmem-nonblock-tuto
2 parents 8f4d6d7 + 12d1b69 commit d3befe4

File tree

1 file changed

+12
-12
lines changed

1 file changed

+12
-12
lines changed

intermediate_source/pinmem_nonblock.py

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,7 @@
125125
# As the following example will show, three requirements must be met to enable this:
126126
#
127127
# 1. The device must have at least one free DMA (Direct Memory Access) engine. Modern GPU architectures such as Volterra,
128-
# Tesla or H100 devices have more than one DMA engine.
128+
# Tesla, or H100 devices have more than one DMA engine.
129129
#
130130
# 2. The transfer must be done on a separate, non-default cuda stream. In PyTorch, cuda streams can be handles using
131131
# :class:`~torch.cuda.Stream`.
@@ -250,7 +250,7 @@ def benchmark_with_profiler(
250250
# New tensors can be directly created in pinned memory with functions like :func:`~torch.zeros`, :func:`~torch.ones` and other
251251
# constructors.
252252
#
253-
# Let us check the speed of pinning memory and sending tensors to cuda:
253+
# Let us check the speed of pinning memory and sending tensors to CUDA:
254254

255255

256256
import torch
@@ -318,10 +318,10 @@ def timer(cmd):
318318
#
319319
# However, contrary to a somewhat common belief, calling :meth:`~torch.Tensor.pin_memory()` on a pageable tensor before
320320
# casting it to GPU should not bring any significant speed-up, on the contrary this call is usually slower than just
321-
# executing the transfer. This makes sense, since we're actually asking python to execute an operation that CUDA will
321+
# executing the transfer. This makes sense, since we're actually asking Python to execute an operation that CUDA will
322322
# perform anyway before copying the data from host to device.
323323
#
324-
# .. note:: The pytorch implementation of
324+
# .. note:: The PyTorch implementation of
325325
# `pin_memory <https://github.com/pytorch/pytorch/blob/5298acb5c76855bc5a99ae10016efc86b27949bd/aten/src/ATen/native/Memory.cpp#L58>`_
326326
# which relies on creating a brand new storage in pinned memory through `cudaHostAlloc <https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gb65da58f444e7230d3322b6126bb4902>`
327327
# could be, in rare cases, faster than transitioning data in chunks as ``cudaMemcpy`` does.
@@ -505,7 +505,7 @@ def pin_copy_to_device_nonblocking(*tensors):
505505

506506

507507
######################################################################
508-
# Other copy directions (GPU -> CPU, CPU -> MPS etc.)
508+
# Other copy directions (GPU -> CPU, CPU -> MPS)
509509
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
510510
#
511511
# .. _pinned_memory_other_direction:
@@ -693,7 +693,7 @@ def pin_copy_to_device_nonblocking(*tensors):
693693
#
694694
# - **System Architecture**
695695
#
696-
# How is the system's architecture influencing data transfer speeds (e.g., bus speeds, network latency)?
696+
# How is the system's architecture influencing data transfer speeds (for example, bus speeds, network latency)?
697697
#
698698
# Additionally, allocating a large number of tensors or sizable tensors in pinned memory can monopolize a substantial
699699
# portion of RAM.
@@ -718,11 +718,11 @@ def pin_copy_to_device_nonblocking(*tensors):
718718
#
719719
# .. _pinned_memory_resources:
720720
#
721-
# If you are dealing with issues with memory copies when using CUDA devices or want to learn more about
722-
# what was discussed in this tutorial, check the following references:
721+
# If you are dealing with issues with memory copies when using CUDA devices or want to learn more about
722+
# what was discussed in this tutorial, check the following references:
723723
#
724-
# - `CUDA toolkit memory management doc <https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html>`_
725-
# - `CUDA pin-memory note <https://forums.developer.nvidia.com/t/pinned-memory/268474>`_
726-
# - `How to Optimize Data Transfers in CUDA C/C++ <https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/>`_
727-
# - tensordict :meth:`~tensordict.TensorDict.to` method;
724+
# - `CUDA toolkit memory management doc <https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html>`_;
725+
# - `CUDA pin-memory note <https://forums.developer.nvidia.com/t/pinned-memory/268474>`_;
726+
# - `How to Optimize Data Transfers in CUDA C/C++ <https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/>`_;
727+
# - tensordict :meth:`~tensordict.TensorDict.to` method.
728728
#

0 commit comments

Comments
 (0)