71
71
# networks, so improving the speed of these operations can improve
72
72
# overall network training speed. Future releases of nvFuser will
73
73
# improve the performance of Linear Layers, but for now we will
74
- # specifically look at the Bias-Dropout-Add-LayerNorm section of this
74
+ # specifically look at the `` Bias-Dropout-Add-LayerNorm`` section of this
75
75
# Transformer Block.
76
76
#
77
77
# .. figure:: /_static/img/nvfuser_intro/nvfuser_transformer_block.png
@@ -154,7 +154,7 @@ def profile_workload(forward_func, grad_output, iteration_count=100, label=""):
154
154
# Run model, forward and backward
155
155
output = forward_func ()
156
156
output .backward (grad_output )
157
- # delete gradiens to avoid profiling the gradient accumulation
157
+ # delete gradients to avoid profiling the gradient accumulation
158
158
for p in parameters :
159
159
p .grad = None
160
160
@@ -165,7 +165,7 @@ def profile_workload(forward_func, grad_output, iteration_count=100, label=""):
165
165
# Run model, forward and backward
166
166
output = forward_func ()
167
167
output .backward (grad_output )
168
- # delete gradiens to avoid profiling the gradient accumulation
168
+ # delete gradients to avoid profiling the gradient accumulation
169
169
for p in parameters :
170
170
p .grad = None
171
171
@@ -265,7 +265,7 @@ def profile_workload(forward_func, grad_output, iteration_count=100, label=""):
265
265
# nvFuser took around 2.4s in total to compile these high speed
266
266
# GPU functions.
267
267
#
268
- # nvFuser’ s capabilities extend well beyond this initial performance gain.
268
+ # nvFuser' s capabilities extend well beyond this initial performance gain.
269
269
#
270
270
271
271
######################################################################
@@ -281,7 +281,7 @@ def profile_workload(forward_func, grad_output, iteration_count=100, label=""):
281
281
# To use nvFuser on inputs that change shape from iteration, we
282
282
# generate new input and output gradient tensors and make a few
283
283
# different sizes. Since the last dimension is shared with the
284
- # parameters and cannot be changed dynamically in LayerNorm, we
284
+ # parameters and cannot be changed dynamically in `` LayerNorm`` , we
285
285
# perturb the first two dimensions of the input and gradient tensors.
286
286
#
287
287
@@ -390,16 +390,16 @@ def profile_workload(forward_func, grad_output, iteration_count=100, label=""):
390
390
#
391
391
392
392
######################################################################
393
- # Defining novel operations with nvFuser and FuncTorch
393
+ # Defining novel operations with nvFuser and functorch
394
394
# ----------------------------------------------------
395
395
#
396
396
# One of the primary benefits of nvFuser is the ability to define
397
397
# novel operations composed of PyTorch “primitives” which are then
398
398
# just-in-time compiled into efficient kernels.
399
399
#
400
400
# PyTorch has strong performance for any individual operation,
401
- # especially composite operations like LayerNorm. However, if
402
- # LayerNorm wasn’t already implemented in PyTorch as a composite
401
+ # especially composite operations like `` LayerNorm`` . However, if
402
+ # `` LayerNorm`` wasn’t already implemented in PyTorch as a composite
403
403
# operation, then you’d have to define it as a series of simpler
404
404
# (primitive) operations. Let’s make such a definition and run it
405
405
# without nvFuser.
@@ -488,7 +488,7 @@ def primitive_definition(
488
488
#
489
489
# However, the performance is still slower than the original eager
490
490
# mode performance of the composite definition. TorchScript works well
491
- # when predefined composite operations are used, however TorchScript’s
491
+ # when predefined composite operations are used, however TorchScript
492
492
# application of Autograd saves all of the activations for each
493
493
# operator in the fusion for re-use in the backwards pass. However,
494
494
# this is not typically the optimal choice. Especially when chaining
@@ -499,7 +499,7 @@ def primitive_definition(
499
499
# It’s possible to optimize away many of these unnecessary memory
500
500
# accesses, but it requires building a connected forward and backward
501
501
# graph which isn’t possible with TorchScript. The
502
- # `memory_efficient_fusion` pass in FuncTorch , however, is such an
502
+ # `` memory_efficient_fusion`` pass in functorch , however, is such an
503
503
# optimization pass. To use this pass, we have to redefine our
504
504
# function to pull the constants inside (for now it’s easiest to make
505
505
# non-tensor constants literals in the function definition):
@@ -527,11 +527,11 @@ def primitive_definition_for_memory_efficient_fusion(
527
527
528
528
######################################################################
529
529
# Now, instead of passing our function to TorchScript, we will pass it
530
- # to FuncTorch’s optimization pass.
530
+ # to functorch optimization pass.
531
531
#
532
532
533
533
534
- # Optimize the model with FuncTorch tracing and the memory efficiency
534
+ # Optimize the model with functorch tracing and the memory efficiency
535
535
# optimization pass
536
536
memory_efficient_primitive_definition = memory_efficient_fusion (
537
537
primitive_definition_for_memory_efficient_fusion
@@ -550,22 +550,22 @@ def primitive_definition_for_memory_efficient_fusion(
550
550
551
551
######################################################################
552
552
# This recovers even more speed, but it’s still not as fast as
553
- # TorchScripts original performance with the composite definition.
553
+ # TorchScript original performance with the composite definition.
554
554
# However, this is still faster than running this new definition
555
555
# without nvFuser, and is still faster than the composite definition
556
556
# without nvFuser.
557
557
#
558
558
# .. figure:: /_static/img/nvfuser_intro/nvfuser_tutorial_5.png
559
559
#
560
- # .. note:: FuncTorch’s memory efficient pass is experimental and still
560
+ # .. note:: The functorch memory efficient pass is experimental and still
561
561
# actively in development.
562
562
# Future versions of the API are expected to achieve performance
563
563
# closer to that of TorchScript with the composite definition.
564
564
#
565
- # .. note:: FuncTorch’s memory efficient pass specializes on the shapes of
565
+ # .. note:: The functorch memory efficient pass specializes on the shapes of
566
566
# the inputs to the function. If new inputs are provided with
567
567
# different shapes, then you need to construct a new function
568
- # using `memory_efficient_fusion` and apply it to the new inputs.
568
+ # using `` memory_efficient_fusion` ` and apply it to the new inputs.
569
569
570
570
571
571
######################################################################
@@ -577,10 +577,10 @@ def primitive_definition_for_memory_efficient_fusion(
577
577
# an entirely new operation in PyTorch – which takes a lot of time and
578
578
# knowledge of the lower-level PyTorch code as well as parallel
579
579
# programming – or writing the operation in simpler PyTorch ops and
580
- # settling for poor performance. For example, let's replace LayerNorm
581
- # in our example with RMSNorm. Even though RMSNorm is a bit simpler
582
- # than LayerNorm, it doesn’t have an existing compound operation in
583
- # PyTorch. See the `Root Mean Square Layer Normalization <https://doi.org/10.48550/arXiv.1910.07467>`__ paper for more information about RMSNorm.
580
+ # settling for poor performance. For example, let's replace `` LayerNorm``
581
+ # in our example with `` RMSNorm`` . Even though `` RMSNorm`` is a bit simpler
582
+ # than `` LayerNorm`` , it doesn’t have an existing compound operation in
583
+ # PyTorch. See the `Root Mean Square Layer Normalization <https://doi.org/10.48550/arXiv.1910.07467>`__ paper for more information about `` RMSNorm`` .
584
584
# As before, we’ll define our new transformer block with
585
585
# primitive PyTorch operations.
586
586
#
@@ -608,7 +608,7 @@ def with_rms_norm(
608
608
# As before, we’ll get a baseline by running PyTorch without nvFuser.
609
609
#
610
610
611
- # Profile rms_norm
611
+ # Profile `` rms_norm``
612
612
func = functools .partial (
613
613
with_rms_norm ,
614
614
input1 ,
@@ -625,7 +625,7 @@ def with_rms_norm(
625
625
# With nvFuser through TorchScript.
626
626
#
627
627
628
- # Profile scripted rms_norm
628
+ # Profile scripted `` rms_norm``
629
629
scripted_with_rms_norm = torch .jit .script (with_rms_norm )
630
630
func = functools .partial (
631
631
scripted_with_rms_norm ,
@@ -656,7 +656,7 @@ def with_rms_norm_for_memory_efficient_fusion(
656
656
return norm_output
657
657
658
658
659
- # Profile memory efficient rms_norm
659
+ # Profile memory efficient `` rms_norm``
660
660
memory_efficient_rms_norm = memory_efficient_fusion (
661
661
with_rms_norm_for_memory_efficient_fusion
662
662
)
@@ -666,12 +666,12 @@ def with_rms_norm_for_memory_efficient_fusion(
666
666
######################################################################
667
667
# .. figure:: /_static/img/nvfuser_intro/nvfuser_tutorial_6.png
668
668
#
669
- # Since RMSNorm is simpler than LayerNorm the performance of our new
669
+ # Since `` RMSNorm`` is simpler than `` LayerNorm`` the performance of our new
670
670
# transformer block is a little higher than the primitive definition
671
671
# without nvFuser (354 iterations per second compared with 260
672
672
# iterations per second). With TorchScript, the iterations per second
673
673
# increases by 2.68x and 3.36x to 952 iterations per second and 1,191
674
- # iterations per second with TorchScript and FuncTorch’s memory
674
+ # iterations per second with TorchScript and functorch memory
675
675
# efficient optimization pass, respectively. The performance of this
676
676
# new operation nearly matches the performance of the composite Layer
677
677
# Norm definition with TorchScript.
0 commit comments