Skip to content

Commit 07f9932

Browse files
author
Vincent Moens
committed
address comments
1 parent 1dfe315 commit 07f9932

File tree

1 file changed

+11
-11
lines changed

1 file changed

+11
-11
lines changed

intermediate_source/pinmem_nonblock.py

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@
6666
#
6767
# When one creates a CPU tensor in PyTorch, the content of this tensor needs to be placed
6868
# in memory. The memory we talk about here is a rather complex concept worth looking at carefully.
69-
# We distinguish two types of memory that are handled by the Memory Management Unit: the main memory (for simplicity)
69+
# We distinguish two types of memory that are handled by the Memory Management Unit: the RAM (for simplicity)
7070
# and the swap space on disk (which may or may not be the hard drive). Together, the available space in disk and RAM (physical memory)
7171
# make up the virtual memory, which is an abstraction of the total resources available.
7272
# In short, the virtual memory makes it so that the available space is larger than what can be found on RAM in isolation
@@ -78,9 +78,9 @@
7878
#
7979
# Typically, when a program accesses a page that is not in RAM, a "page fault" occurs and the operating system (OS) then brings
8080
# back this page into RAM ("swap in" or "page in").
81-
# In turn, the OS may have to _swap out_ (or _page out_) another page to make room for the new page.
81+
# In turn, the OS may have to swap out (or "page out") another page to make room for the new page.
8282
#
83-
# In contrast to pageable memory, a _pinned_ (or _page-locked_ or _non-pageable_) memory is a type of memory that cannot
83+
# In contrast to pageable memory, a pinned (or page-locked or non-pageable) memory is a type of memory that cannot
8484
# be swapped out to disk.
8585
# It allows for faster and more predictable access times, but has the downside that it is more limited than the
8686
# pageable memory (aka the main memory).
@@ -158,13 +158,13 @@ def inner(pinned: bool, streamed: bool):
158158
t1_cuda = t1_cpu_pinned.to(device, non_blocking=True)
159159
else:
160160
t2_cuda = t2_cpu_paged.to(device, non_blocking=True)
161-
t2_h2d_event = s.record_event()
161+
t_star_cuda_h2d_event = s.record_event()
162162
# This operation can be executed during the CPU to GPU copy if and only if the tensor is pinned and the copy is
163163
# done in the other stream
164164
t3_cuda_mul = t3_cuda * t3_cuda * t3_cuda
165-
t1_h2d_event = torch.cuda.current_stream().record_event()
166-
t1_h2d_event.synchronize()
167-
t2_h2d_event.synchronize()
165+
t3_cuda_h2d_event = torch.cuda.current_stream().record_event()
166+
t_star_cuda_h2d_event.synchronize()
167+
t3_cuda_h2d_event.synchronize()
168168

169169

170170
# Our profiler: profiles the `inner` function and stores the results in a .json file
@@ -206,7 +206,7 @@ def benchmark_with_profiler(
206206
#
207207
# Using a pinned tensor doesn't change the trace much, both operations are still executed consecutively:
208208

209-
benchmark_with_profiler(streamed=True, pinned=False)
209+
benchmark_with_profiler(streamed=False, pinned=True)
210210

211211
######################################################################
212212
#
@@ -215,7 +215,7 @@ def benchmark_with_profiler(
215215
#
216216
# Sending a pageable tensor to GPU on a separate stream is also a blocking operation:
217217

218-
benchmark_with_profiler(streamed=False, pinned=True)
218+
benchmark_with_profiler(streamed=True, pinned=False)
219219

220220
######################################################################
221221
#
@@ -323,7 +323,7 @@ def timer(cmd):
323323
#
324324
# .. note:: The PyTorch implementation of
325325
# `pin_memory <https://github.com/pytorch/pytorch/blob/5298acb5c76855bc5a99ae10016efc86b27949bd/aten/src/ATen/native/Memory.cpp#L58>`_
326-
# which relies on creating a brand new storage in pinned memory through `cudaHostAlloc <https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gb65da58f444e7230d3322b6126bb4902>`
326+
# which relies on creating a brand new storage in pinned memory through `cudaHostAlloc <https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gb65da58f444e7230d3322b6126bb4902>`_
327327
# could be, in rare cases, faster than transitioning data in chunks as ``cudaMemcpy`` does.
328328
# Here too, the observation may vary depending on the available hardware, the size of the tensors being sent or
329329
# the amount of available RAM.
@@ -724,5 +724,5 @@ def pin_copy_to_device_nonblocking(*tensors):
724724
# - `CUDA toolkit memory management doc <https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html>`_;
725725
# - `CUDA pin-memory note <https://forums.developer.nvidia.com/t/pinned-memory/268474>`_;
726726
# - `How to Optimize Data Transfers in CUDA C/C++ <https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/>`_;
727-
# - tensordict :meth:`~tensordict.TensorDictBase.to` method.
727+
# - `tensordict doc <https://pytorch.org/tensordict/stable/index.html>`_ and `repo <https://github.com/pytorch/tensordict>`_.
728728
#

0 commit comments

Comments
 (0)