|
1 | 1 | """
|
2 |
| -(prototype) Using GPUDirect Storage |
3 |
| -==================================== |
| 2 | +(prototype) Accelerating ``torch.save`` and ``torch.load`` with GPUDirect Storage |
| 3 | +================================================================================= |
4 | 4 |
|
5 |
| -GPUDirect Storage enabes a direct data path for direct memeory access transfers |
| 5 | +GPUDirect Storage enables a direct data path for direct memory access transfers |
6 | 6 | between GPU memory and storage, avoiding a bounce buffer through the CPU.
|
7 | 7 |
|
8 |
| -In version ``2.7``, we introduced some prototype APIs to ``torch.cuda.gds`` that serve as thin wrappers around |
| 8 | +In version **2.7**, we introduced new prototype APIs to ``torch.cuda.gds`` that serve as thin wrappers around |
9 | 9 | the `cuFile APIs <https://docs.nvidia.com/gpudirect-storage/api-reference-guide/index.html#cufile-io-api>`_
|
10 |
| -that can be used with ``torch.Tensor``. |
| 10 | +that can be used with ``torch.Tensor`` to achieve improved I/O performance. |
11 | 11 |
|
12 | 12 | In this tutorial, we will demonstrate how to use the ``torch.cuda.gds`` APIs in conjunction with
|
13 | 13 | checkpoints generated by ``torch.save`` and ``torch.load`` on local filesystem.
|
|
32 | 32 | ################################################################################
|
33 | 33 | # Using GPUDirect Storage with ``torch.save`` and ``torch.load``
|
34 | 34 | # =============================================================
|
35 |
| -# GPUDirect Storage requires a storage alignment of 4KB. One can toggle this using |
36 |
| -# ``torch.utils.serialization.config.save.storage_alignment`` to toggle this |
| 35 | +# GPUDirect Storage requires a storage alignment of 4KB. You can toggle this by using |
| 36 | +# ``torch.utils.serialization.config.save.storage_alignment``: |
37 | 37 |
|
38 | 38 | import torch
|
39 | 39 | from torch.utils.serialization import config as serialization_config
|
|
60 | 60 |
|
61 | 61 | ################################################################################
|
62 | 62 | # We can get the offsets that each storage should be written to within the checkpoint by loading under
|
63 |
| -# a ``FakeTensorMode``. A FakeTensor is a tensor that has metadata (e.g. sizes, strides, dtype, device) |
| 63 | +# a ``FakeTensorMode``. A FakeTensor is a tensor that has metadata (such as sizes, strides, dtype, device) |
64 | 64 | # information about the tensor but does not have any storage bytes. The following snippet will not materialize
|
65 |
| -# any data but which will tag each ``FakeTensor`` with the offset within the checkpoint that |
| 65 | +# any data but will tag each ``FakeTensor`` with the offset within the checkpoint that |
66 | 66 | # corresponds to the tensor.
|
67 | 67 | #
|
68 | 68 | # If you are continuously saving the same state dictionary during training, you
|
69 | 69 | # would only need to obtain the offsets once and the same offsets can be re-used. Similarly if tensor is going to
|
70 |
| -# be saved or loaded to repeatedly one can use the ``torch.cuda.gds.gds_register_buffer`` which wraps |
71 |
| -# ``cuFileBufRegister`` to register the storages as gds buffers. |
| 70 | +# be saved or loaded to repeatedly you can use the ``torch.cuda.gds.gds_register_buffer`` which wraps |
| 71 | +# ``cuFileBufRegister`` to register the storages as GDS buffers. |
| 72 | +# |
| 73 | +# Note that ``torch.cuda.gds.GdsFile.save_storage`` binds to the synchronous ``cuFileWrite`` API, |
| 74 | +# so no synchronization is needed afterwards. |
72 | 75 |
|
73 | 76 |
|
74 | 77 | import os
|
|
96 | 99 | assert torch.equal(v, sd[k])
|
97 | 100 |
|
98 | 101 | ################################################################################
|
99 |
| -# The loading flow is the inverse, we can ``torch.load`` under the ``torch.serialization.skip_data`` context |
| 102 | +# The loading flow is the inverse: you can use ``torch.load`` with the ``torch.serialization.skip_data`` context |
100 | 103 | # manager to load everything except the storage bytes. This means that any tensors in the checkpoint will be
|
101 |
| -# created but their storages will be empty (i.e. the tensors will be created via ``torch.empty``). |
| 104 | +# created but their storages will be empty (as if the tensors were created via ``torch.empty``). |
102 | 105 |
|
103 | 106 | with torch.serialization.skip_data():
|
104 | 107 | sd_loaded = torch.load("checkpoint.pt")
|
105 | 108 |
|
106 | 109 | ################################################################################
|
107 | 110 | # We once again use the ``FakeTensorMode`` to get the checkpoint offsets and
|
108 | 111 | # ascertain that the loaded checkpoint is the same as the saved checkpoint.
|
| 112 | +# |
| 113 | +# Similar to ``torch.cuda.gds.GdsFile.save_storage``, ``torch.cuda.gds.GdsFile.load_storage`` |
| 114 | +# binds to the synchronous ``cuFileRead`` API, so no synchronization is needed afterwards. |
109 | 115 |
|
110 | 116 | for k, v in sd_loaded.items():
|
111 | 117 | assert not torch.equal(v, sd[k])
|
|
118 | 124 |
|
119 | 125 | del f
|
120 | 126 |
|
121 |
| -# Summary |
122 |
| -# ======= |
| 127 | +# Conclusion |
| 128 | +# ========== |
123 | 129 | #
|
124 | 130 | # In this tutorial we have demonstrated how to use the prototype ``torch.cuda.gds`` APIs
|
125 |
| -# in conjunction with ``torch.save`` and ``torch.load`` on local filesystem. Do |
126 |
| -# file in issue in the PyTorch GitHub repo if you have any feedback. |
| 131 | +# in conjunction with ``torch.save`` and ``torch.load`` on local filesystem. Please |
| 132 | +# file an issue in the PyTorch GitHub repo if you have any feedback. |
0 commit comments