Skip to content

Commit 35a79eb

Browse files
authored
Merge branch 'site' into patch-1
2 parents 62233ef + 5b9c9dc commit 35a79eb

File tree

2,503 files changed

+18162
-4433
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

2,503 files changed

+18162
-4433
lines changed

_posts/2024-02-06-pytorch-2-paper-tutorial.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,12 +11,12 @@ During the ASPLOS conference, we'll be conducting a tutorial on Saturday, April
1111

1212
A preview of the paper is attached below:
1313

14-
Title: **PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation.** [**Full Paper PDF**](/assets/pytorch_2.pdf)
14+
Title: **PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation.** [**Full Paper PDF**](/assets/pytorch2-2.pdf)
1515

1616
### Abstract
1717
This paper introduces two extensions to the popular PyTorch machine learning framework, TorchDynamo and TorchInductor, which implement the torch.compile feature released in PyTorch 2. TorchDynamo is a Python-level just-in-time (JIT) compiler that enables graph compilation in PyTorch programs without sacrificing the flexibility of Python. It achieves this by dynamically modifying Python bytecode before execution and extracting sequences of PyTorch operations into an FX graph, which is then JIT compiled using one of many extensible backends. TorchInductor is the default compiler backend for TorchDynamo, which translates PyTorch programs into OpenAI's Triton for GPUs and C++ for CPUs. Results show that TorchDynamo is able to capture graphs more robustly than prior approaches while adding minimal overhead, and TorchInductor is able to provide a 2.27x inference and 1.41x training geometric mean speedup on an NVIDIA A100 GPU across 180+ real-world models, which outperforms six other compilers. These extensions provide a new way to apply optimizations through compilers in eager mode frameworks like PyTorch.
1818

1919

2020
### Authors
2121

22-
Jason Ansel (Meta); Edward Yang (Meta); Horace He (Meta); Natalia Gimelshein (OpenAI); Animesh Jain (Meta); Michael Voznesensky (Meta); Bin Bao (Meta); David Berard (Meta); Geeta Chauhan (Meta); Anjali Chourdia (Meta); Will Constable (Meta); Alban Desmaison (Meta); Zachary DeVito (Meta); Elias Ellison (Meta); Will Feng (Meta); Jiong Gong (Intel); Michael Gschwind (Meta); Brian Hirsh (Meta); Sherlock Huang (Meta); Laurent Kirsch (Meta); Michael Lazos (Meta); Yanbo Liang (Meta); Jason Liang (Meta); Yinghai Lu (Meta); CK Luk (Meta); Bert Maher (Meta); Yunjie Pan (University of Michigan); Christian Puhrsch (Meta); Matthias Reso (Meta); Mark Saroufim (Meta); Helen Suk (Meta); Michael Suo (Meta); Phil Tillet (OpenAI); Eikan Wang (Intel); Xiaodong Wang (Meta); William Wen (Meta); Shunting Zhang (Meta); Xu Zhao (Meta); Keren Zhou (OpenAI & George Mason University); Richard Zou (Meta); Ajit Mathews (Meta); Gregory Chanan (Meta); Peng Wu (Meta); Soumith Chintala (Meta)
22+
Jason Ansel (Meta); Edward Yang (Meta); Horace He (Meta); Natalia Gimelshein (OpenAI); Animesh Jain (Meta); Michael Voznesensky (Meta); Bin Bao (Meta); Peter Bell (Quansight); David Berard (Meta); Evgeni Burovski Quansight; Geeta Chauhan (Meta); Anjali Chourdia (Meta); Will Constable (Meta); Alban Desmaison (Meta); Zachary DeVito (Meta); Elias Ellison (Meta); Will Feng (Meta); Jiong Gong (Intel); Michael Gschwind (Meta); Brian Hirsh (Meta); Sherlock Huang (Meta); Kshiteej Kalambarkar (Quansight); Laurent Kirsch (Meta); Michael Lazos (Meta); Mario Lezcano (Quansight); Yanbo Liang (Meta); Jason Liang (Meta); Yinghai Lu (Meta); CK Luk (Meta); Bert Maher (Meta); Yunjie Pan (University of Michigan); Christian Puhrsch (Meta); Matthias Reso (Meta); Mark Saroufim (Meta); Marcos Yukio Siraichi (Quansight); Helen Suk (Meta); Michael Suo (Meta); Phil Tillet (OpenAI); Eikan Wang (Intel); Xiaodong Wang (Meta); William Wen (Meta); Shunting Zhang (Meta); Xu Zhao (Meta); Keren Zhou (OpenAI & George Mason University); Richard Zou (Meta); Ajit Mathews (Meta); Gregory Chanan (Meta); Peng Wu (Meta); Soumith Chintala (Meta)
Binary file not shown.

assets/quick-start-module.js

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.

docs/2.2/_images/RReLU.png

86 Bytes
Loading

docs/2.2/_modules/torch.html

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1126,7 +1126,13 @@ <h1>Source code for torch</h1><div class="highlight"><pre>
11261126

11271127

11281128
<div class="viewcode-block" id="set_default_tensor_type"><a class="viewcode-back" href="../generated/torch.set_default_tensor_type.html#torch.set_default_tensor_type">[docs]</a><span class="k">def</span> <span class="nf">set_default_tensor_type</span><span class="p">(</span><span class="n">t</span><span class="p">):</span>
1129-
<span class="w"> </span><span class="sa">r</span><span class="sd">&quot;&quot;&quot;Sets the default ``torch.Tensor`` type to floating point tensor type</span>
1129+
<span class="w"> </span><span class="sa">r</span><span class="sd">&quot;&quot;&quot;</span>
1130+
<span class="sd"> .. warning::</span>
1131+
1132+
<span class="sd"> This function is deprecated as of PyTorch 2.1, please use :func:`torch.set_default_dtype()` and</span>
1133+
<span class="sd"> :func:`torch.set_default_device()` as alternatives.</span>
1134+
1135+
<span class="sd"> Sets the default ``torch.Tensor`` type to floating point tensor type</span>
11301136
<span class="sd"> ``t``. This type will also be used as default floating point type for</span>
11311137
<span class="sd"> type inference in :func:`torch.tensor`.</span>
11321138

docs/2.2/_modules/torch/distributed/checkpoint/state_dict.html

Lines changed: 20 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -621,7 +621,7 @@ <h1>Source code for torch.distributed.checkpoint.state_dict</h1><div class="high
621621
<span class="k">if</span> <span class="ow">not</span> <span class="n">skip_ddp_prefix</span><span class="p">:</span>
622622
<span class="n">fqn_obj_names</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">curr_obj_name</span><span class="p">)</span>
623623
<span class="k">elif</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">curr_obj</span><span class="p">,</span> <span class="n">FSDP</span><span class="p">):</span>
624-
<span class="k">if</span> <span class="n">obj_names</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="n">FLAT_PARAM</span><span class="p">:</span>
624+
<span class="k">if</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="nb">len</span><span class="p">(</span><span class="n">obj_names</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span> <span class="ow">and</span> <span class="n">obj_names</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="n">FLAT_PARAM</span><span class="p">:</span>
625625
<span class="n">prefix</span> <span class="o">=</span> <span class="s2">&quot;.&quot;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">fqn_obj_names</span><span class="p">)</span>
626626
<span class="n">flat_param</span> <span class="o">=</span> <span class="nb">getattr</span><span class="p">(</span><span class="n">curr_obj</span><span class="p">,</span> <span class="n">FLAT_PARAM</span><span class="p">)</span>
627627
<span class="k">if</span> <span class="n">prefix</span><span class="p">:</span>
@@ -660,7 +660,7 @@ <h1>Source code for torch.distributed.checkpoint.state_dict</h1><div class="high
660660
<span class="n">Union</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">],</span> <span class="n">Union</span><span class="p">[</span><span class="n">Set</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span> <span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">]</span>
661661
<span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
662662
<span class="n">all_fqns</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span>
663-
<span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">param</span> <span class="ow">in</span> <span class="n">model</span><span class="o">.</span><span class="n">named_parameters</span><span class="p">():</span>
663+
<span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">param</span> <span class="ow">in</span> <span class="n">chain</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">named_parameters</span><span class="p">(),</span> <span class="n">model</span><span class="o">.</span><span class="n">named_buffers</span><span class="p">()):</span>
664664
<span class="n">fqns</span> <span class="o">=</span> <span class="n">_get_fqns</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">name</span><span class="p">)</span>
665665
<span class="n">fqn_param_mapping</span><span class="p">[</span><span class="n">param</span><span class="p">]</span> <span class="o">=</span> <span class="n">fqns</span>
666666
<span class="k">for</span> <span class="n">fqn</span> <span class="ow">in</span> <span class="n">fqns</span><span class="p">:</span>
@@ -859,7 +859,7 @@ <h1>Source code for torch.distributed.checkpoint.state_dict</h1><div class="high
859859
<span class="k">if</span> <span class="ow">not</span> <span class="n">info</span><span class="o">.</span><span class="n">handle_model</span> <span class="ow">or</span> <span class="ow">not</span> <span class="n">state_dict</span><span class="p">:</span>
860860
<span class="k">return</span> <span class="n">_IncompatibleKeys</span><span class="p">({},</span> <span class="p">{})</span>
861861

862-
<span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">model</span><span class="o">.</span><span class="n">named_parameters</span><span class="p">():</span>
862+
<span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">chain</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">named_parameters</span><span class="p">(),</span> <span class="n">model</span><span class="o">.</span><span class="n">named_buffers</span><span class="p">()):</span>
863863
<span class="n">fqns</span> <span class="o">=</span> <span class="n">_get_fqns</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">key</span><span class="p">)</span>
864864
<span class="n">fqns_with_ddp_prefix</span> <span class="o">=</span> <span class="n">_get_fqns</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">skip_ddp_prefix</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
865865
<span class="k">for</span> <span class="n">fqn</span><span class="p">,</span> <span class="n">fqn_with_ddp_prefix</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">fqns</span><span class="p">,</span> <span class="n">fqns_with_ddp_prefix</span><span class="p">):</span>
@@ -1142,25 +1142,25 @@ <h1>Source code for torch.distributed.checkpoint.state_dict</h1><div class="high
11421142
<span class="sd"> optimizer parameter IDs to the canonical FQNs.</span>
11431143

11441144
<span class="sd"> Example:</span>
1145+
<span class="sd"> &gt;&gt;&gt; # xdoctest: +SKIP</span>
1146+
<span class="sd"> &gt;&gt;&gt; import torch</span>
1147+
<span class="sd"> &gt;&gt;&gt; from torch.distributed.fsdp import FullyShardedDataParallel as FSDP</span>
1148+
<span class="sd"> &gt;&gt;&gt; from torch.nn.parallel import DistributedDataParallel as DDP</span>
1149+
<span class="sd"> &gt;&gt;&gt; from torch.distributed.checkpoint.state_dict import get_state_dict</span>
11451150

1146-
<span class="sd"> import torch</span>
1147-
<span class="sd"> from torch.distributed.fsdp import FullyShardedDataParallel as FSDP</span>
1148-
<span class="sd"> from torch.nn.parallel import DistributedDataParallel as DDP</span>
1149-
<span class="sd"> from torch.distributed.checkpoint.state_dict import get_state_dict</span>
1150-
1151-
<span class="sd"> fsdp_model = FSDP(copy.deepcopy(model))</span>
1152-
<span class="sd"> fsdp_optim = torch.optim.Adam(model.parameters(), lr=1e-3)</span>
1153-
<span class="sd"> ddp_model = DDP(copy.deepcopy(model))</span>
1154-
<span class="sd"> ddp_optim = torch.optim.Adam(model.parameters(), lr=1e-3)</span>
1151+
<span class="sd"> &gt;&gt;&gt; fsdp_model = FSDP(copy.deepcopy(model))</span>
1152+
<span class="sd"> &gt;&gt;&gt; fsdp_optim = torch.optim.Adam(model.parameters(), lr=1e-3)</span>
1153+
<span class="sd"> &gt;&gt;&gt; ddp_model = DDP(copy.deepcopy(model))</span>
1154+
<span class="sd"> &gt;&gt;&gt; ddp_optim = torch.optim.Adam(model.parameters(), lr=1e-3)</span>
11551155

11561156

1157-
<span class="sd"> ddp_state_dict, ddp_optim_state_dict = get_state_dict(ddp_model, ddp_optim)</span>
1158-
<span class="sd"> fsdp_state_dict, fsdp_optim_state_dict = get_state_dict(fsdp_model, fsdp_optim)</span>
1157+
<span class="sd"> &gt;&gt;&gt; ddp_state_dict, ddp_optim_state_dict = get_state_dict(ddp_model, ddp_optim)</span>
1158+
<span class="sd"> &gt;&gt;&gt; fsdp_state_dict, fsdp_optim_state_dict = get_state_dict(fsdp_model, fsdp_optim)</span>
11591159

1160-
<span class="sd"> # if we simply call ddp_model.state_dict() and fsdp_model.state_dict(),</span>
1161-
<span class="sd"> # the asserts will fail.</span>
1162-
<span class="sd"> assert ddp_state_dict == fsdp_state_dict</span>
1163-
<span class="sd"> assert ddp_optim_state == fsdp_optim_state_dict</span>
1160+
<span class="sd"> &gt;&gt;&gt; # if we simply call ddp_model.state_dict() and fsdp_model.state_dict(),</span>
1161+
<span class="sd"> &gt;&gt;&gt; # the asserts will fail.</span>
1162+
<span class="sd"> &gt;&gt;&gt; assert ddp_state_dict == fsdp_state_dict</span>
1163+
<span class="sd"> &gt;&gt;&gt; assert ddp_optim_state == fsdp_optim_state_dict</span>
11641164

11651165

11661166
<span class="sd"> Args:</span>
@@ -1175,6 +1175,8 @@ <h1>Source code for torch.distributed.checkpoint.state_dict</h1><div class="high
11751175

11761176
<span class="sd"> Returns:</span>
11771177
<span class="sd"> ``Tuple`` that contain model state_dict and optimizer state_dict.</span>
1178+
1179+
<span class="sd"> :rtype: typing.Tuple[typing.Dict[str, ValueType], OptimizerStateType]</span>
11781180
<span class="sd"> &quot;&quot;&quot;</span>
11791181

11801182
<span class="k">with</span> <span class="n">gc_context</span><span class="p">():</span>

docs/2.2/_modules/torch/distributed/fsdp/fully_sharded_data_parallel.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -932,7 +932,7 @@ <h1>Source code for torch.distributed.fsdp.fully_sharded_data_parallel</h1><div
932932
<span class="s2">&quot;ignored_states&quot;</span><span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">_ignored_params</span><span class="p">,</span>
933933
<span class="s2">&quot;device_mesh&quot;</span><span class="p">:</span> <span class="n">device_mesh</span><span class="p">,</span>
934934
<span class="p">}</span>
935-
<span class="k">if</span> <span class="n">sharding_strategy</span> <span class="ow">in</span> <span class="n">HYBRID_SHARDING_STRATEGIES</span><span class="p">:</span>
935+
<span class="k">if</span> <span class="n">sharding_strategy</span> <span class="ow">in</span> <span class="n">HYBRID_SHARDING_STRATEGIES</span> <span class="ow">and</span> <span class="n">device_mesh</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
936936
<span class="c1"># Share root process groups with children to maintain</span>
937937
<span class="c1"># the invariant that all FSDP modules will have the same</span>
938938
<span class="c1"># process groups.</span>

docs/2.2/_modules/torch/distributed/tensor/parallel/style.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -495,7 +495,7 @@ <h1>Source code for torch.distributed.tensor.parallel.style</h1><div class="high
495495

496496
<div class="viewcode-block" id="ColwiseParallel"><a class="viewcode-back" href="../../../../../distributed.tensor.parallel.html#torch.distributed.tensor.parallel.ColwiseParallel">[docs]</a><span class="k">class</span> <span class="nc">ColwiseParallel</span><span class="p">(</span><span class="n">ParallelStyle</span><span class="p">):</span>
497497
<span class="w"> </span><span class="sd">&quot;&quot;&quot;</span>
498-
<span class="sd"> Partition a compatible nn.Module in a row-wise fashion. Currently supports nn.Linear and nn.Embedding.</span>
498+
<span class="sd"> Partition a compatible nn.Module in a column-wise fashion. Currently supports nn.Linear and nn.Embedding.</span>
499499
<span class="sd"> Users can compose it together with RowwiseParallel to achieve the sharding of more complicated modules.</span>
500500
<span class="sd"> (i.e. MLP, Attention)</span>
501501

0 commit comments

Comments
 (0)