add norm_ffn_norm to profile script #282

vkuzo · 2024-06-14T04:44:11Z

Stack from ghstack (oldest at bottom):

Summary:

This PR adds an example FFN with the preceding and subsequent norms
to the profile script. It also adds a couple of automatic data exctration QOL
items:

extract GPU time and aggregate it per CPU kernel name
attribute the kernel GPU time to gemms, float8 overhead or other
approximate the time spent syncing scales/amaxes and display as pct of total time
if TORCHINDUCTOR_PROFILE env variable is set, also parses its output for triton kernel memory bandwidth

I hope for this to speed up debugging of kernel performance on various models, as this automates a lot of high level metrics which take more time to get from visualizing the traces.

Example output when testing norm_ffn_norm with dynamic scaling and compile, and bandwidth measurements on

Summary of GPU time by CPU kernel

    experiment                                                                                      kernel       category  time_ms  pct_gpu_time  bw_gpbs
1       0_ref                                                                                    aten::mm         0_gemm   15.625         0.826     None
9       0_ref                                                     triton_red_fused__to_copy_add_mul_sum_2        2_other    1.375         0.073   241.12
11      0_ref                                                                                  aten::add_        2_other    0.566         0.030     None
8       0_ref                                            triton_poi_fused_add_fill_mul_sigmoid_silu_sub_1        2_other    0.520         0.027  2203.88
2       0_ref                                                                 triton_poi_fused_mul_silu_1        2_other    0.302         0.016  2207.31
10      0_ref                                             triton_red_fused__to_copy_add_div_mul_pow_sum_3        2_other    0.150         0.008  1375.62
7       0_ref                                             triton_red_fused__to_copy_add_div_mul_pow_sum_0        2_other    0.122         0.006  1963.47
3       0_ref                                          triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2        2_other    0.084         0.004  1935.35
6       0_ref                                                                                 aten::copy_        2_other    0.060         0.003     None
0       0_ref                                          triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_0        2_other    0.052         0.003  1861.85
4       0_ref                                                                                   aten::sum        2_other    0.051         0.003     None
5       0_ref                                                                                 aten::fill_        2_other    0.002         0.000     None
19   1_float8                                                                            aten::_scaled_mm         0_gemm    8.148         0.623     None
36   1_float8  triton_poi_fused__to_copy_add_clamp_copy_empty_like_fill_mul_reciprocal_sigmoid_silu_sub_6  1_f8_overhead    0.408         0.031  2220.22
34   1_float8                                    triton_red_fused_abs_add_fill_max_mul_sigmoid_silu_sub_4  1_f8_overhead    0.314         0.024  2090.00
18   1_float8                                                         triton_poi_fused__scaled_mm_clone_6  1_f8_overhead    0.294         0.022  1439.36
37   1_float8                               triton_poi_fused__scaled_mm_clamp_clone_copy_mul_reciprocal_7  1_f8_overhead    0.292         0.022  1527.26
22   1_float8                                  triton_poi_fused__to_copy_clamp_copy_mul_reciprocal_silu_9  1_f8_overhead    0.255         0.020  2148.34
20   1_float8                                                         triton_red_fused_abs_max_mul_silu_7  1_f8_overhead    0.244         0.019  1834.33
23   1_float8                                                        triton_poi_fused__scaled_mm_clone_10  1_f8_overhead    0.146         0.011  1462.59
15   1_float8                                                                  triton_red_fused_abs_max_3  1_f8_overhead    0.127         0.010  1498.30
31   1_float8                                     triton_red_fused__to_copy_abs_add_div_max_mul_pow_sum_1  1_f8_overhead    0.091         0.007  1971.11
33   1_float8                               triton_poi_fused__scaled_mm_clamp_clone_copy_mul_reciprocal_3  1_f8_overhead    0.090         0.007  1388.84
21   1_float8                                                                  triton_red_fused_abs_max_8  1_f8_overhead    0.062         0.005  1487.67
17   1_float8                                       triton_poi_fused__to_copy_clamp_copy_mul_reciprocal_5  1_f8_overhead    0.046         0.003  1833.92
12   1_float8                                  triton_red_fused__to_copy_abs_add_max_mean_mul_pow_rsqrt_0  1_f8_overhead    0.044         0.003  1246.58
16   1_float8                                        triton_per_fused_abs_clamp_copy_max_mul_reciprocal_4  1_f8_overhead    0.010         0.001     0.32
35   1_float8                  triton_per_fused__scaled_mm_abs_clamp_clone_copy_max_mul_reciprocal_silu_5  1_f8_overhead    0.005         0.000     0.32
32   1_float8                       triton_red_fused__scaled_mm_abs_clamp_clone_copy_max_mul_reciprocal_2  1_f8_overhead    0.003         0.000     4.51
13   1_float8                               triton_red_fused__to_copy_abs_clamp_copy_max_mul_reciprocal_1  1_f8_overhead    0.003         0.000     4.61
38   1_float8                                                     triton_red_fused__to_copy_add_mul_sum_8        2_other    0.804         0.061   244.46
40   1_float8                                                                                  aten::add_        2_other    0.567         0.043     None
30   1_float8                                                         triton_red_fused__to_copy_mul_sum_0        2_other    0.562         0.043   236.73
39   1_float8                                             triton_red_fused__to_copy_add_div_mul_pow_sum_9        2_other    0.149         0.011  1377.16
25   1_float8                                                                   triton_poi_fused_clone_12        2_other    0.146         0.011  1532.05
24   1_float8                                         triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_11        2_other    0.115         0.009  1293.20
29   1_float8                                                                                 aten::copy_        2_other    0.060         0.005     None
27   1_float8                                                                                   aten::sum        2_other    0.050         0.004     None
26   1_float8                                                                   triton_poi_fused_clone_13        2_other    0.043         0.003  1333.22
14   1_float8                                                               triton_poi_fused_empty_like_2        2_other    0.002         0.000     0.00
28   1_float8                                                                                 aten::fill_        2_other    0.002         0.000     None

Float8 amax/scale sync approx ratio of total time: 0.000

Summary of time (ms) by kernel category

 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
category                                              
0_gemm        15.625     8.148       0.521       1.918
1_f8_overhead  0.000     2.436         inf       0.000
2_other        3.283     2.498       0.761       1.314
All           18.908    13.082       0.692       1.445

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D59163495

Summary: This PR adds an example FFN with the preceding and subsequent norms to the profile script. I hope for this to speed up debugging of kernel performance on LLaMa. Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: This PR adds an example FFN with the preceding and subsequent norms to the profile script. I hope for this to speed up debugging of kernel performance on LLaMa. Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 8eb0020 Pull Request resolved: #282

Summary: This PR adds an example FFN with the preceding and subsequent norms to the profile script. I hope for this to speed up debugging of kernel performance on LLaMa. Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: This PR adds an example FFN with the preceding and subsequent norms to the profile script. I hope for this to speed up debugging of kernel performance on LLaMa. Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 7f7ea6b Pull Request resolved: #282

Summary: This PR adds an example FFN with the preceding and subsequent norms to the profile script. It also adds a couple of automatic data exctration QOL items: 1. extract GPU time and aggregate it per CPU kernel name 2. attribute the kernel GPU time to gemms, float8 overhead or other 3. approximate the time spent syncing scales/amaxes and display as pct of total time I hope for this to speed up debugging of kernel performance on various models, as this automates a lot of high level metrics which take more time to get from visualizing the traces. Example output when testing `norm_ffn_norm` with delayed scaling and compile: ``` Summary of GPU time by CPU kernel experiment kernel category time_ms 0 0_ref triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_0 2_other 0.061 1 0_ref aten::mm 0_gemm 14.691 2 0_ref triton_poi_fused_mul_silu_1 2_other 0.304 3 0_ref triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2 2_other 0.083 4 0_ref aten::sum 2_other 0.050 5 0_ref aten::fill_ 2_other 0.002 6 0_ref aten::copy_ 2_other 0.059 7 0_ref triton_red_fused__to_copy_add_div_mul_pow_sum_0 2_other 0.129 8 0_ref triton_poi_fused_add_fill_mul_sigmoid_silu_sub_1 2_other 0.520 9 0_ref triton_red_fused__to_copy_add_mul_sum_2 2_other 1.363 10 0_ref triton_red_fused__to_copy_add_div_mul_pow_sum_3 2_other 0.151 11 0_ref aten::add_ 2_other 0.567 12 1_float8 triton_per_fused_cat_copy_max_roll_0 1_f8_overhead 0.009 13 1_float8 triton_poi_fused_copy_1 2_other 0.004 14 1_float8 triton_poi_fused_copy_2 2_other 0.002 15 1_float8 triton_poi_fused_copy_3 2_other 0.004 16 1_float8 triton_poi_fused_copy_4 2_other 0.002 17 1_float8 triton_poi_fused_copy_5 2_other 0.004 18 1_float8 triton_poi_fused_copy_6 2_other 0.002 19 1_float8 triton_red_fused__to_copy_abs_add_clamp_max_mean_mul_pow_rsqrt_0 1_f8_overhead 0.089 20 1_float8 triton_red_fused__to_copy_abs_fill_max_mul_1 1_f8_overhead 0.003 21 1_float8 triton_red_fused_abs_max_2 1_f8_overhead 0.140 22 1_float8 triton_per_fused_abs_fill_max_3 1_f8_overhead 0.010 23 1_float8 triton_poi_fused_reciprocal_4 2_other 0.014 24 1_float8 triton_poi_fused__scaled_mm_clone_5 1_f8_overhead 0.289 25 1_float8 aten::_scaled_mm 0_gemm 8.054 26 1_float8 triton_red_fused_abs_max_mul_silu_6 1_f8_overhead 0.246 27 1_float8 triton_red_fused_abs_max_7 1_f8_overhead 0.061 28 1_float8 triton_poi_fused__to_copy_clamp_mul_silu_8 1_f8_overhead 0.254 29 1_float8 triton_poi_fused__scaled_mm_clone_9 1_f8_overhead 0.149 30 1_float8 triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_10 2_other 0.115 31 1_float8 triton_poi_fused_clone_11 2_other 0.151 32 1_float8 triton_poi_fused_clone_12 2_other 0.089 33 1_float8 aten::sum 2_other 0.049 34 1_float8 aten::fill_ 2_other 0.013 35 1_float8 aten::copy_ 2_other 0.060 36 1_float8 triton_red_fused__to_copy_mul_sum_0 2_other 0.548 37 1_float8 triton_red_fused__scaled_mm__to_copy_abs_add_div_max_mul_pow_reciprocal_sum_1 1_f8_overhead 0.109 38 1_float8 triton_red_fused_abs_fill_max_2 1_f8_overhead 0.004 39 1_float8 triton_poi_fused__scaled_mm_clone_reciprocal_3 1_f8_overhead 0.049 40 1_float8 triton_poi_fused__scaled_mm_clone_reciprocal_4 1_f8_overhead 0.007 41 1_float8 triton_red_fused_abs_add_fill_max_mul_sigmoid_silu_sub_5 1_f8_overhead 0.316 42 1_float8 triton_per_fused_abs_fill_max_mul_silu_6 1_f8_overhead 0.005 43 1_float8 triton_poi_fused__to_copy_add_clamp_fill_mul_sigmoid_silu_sub_7 1_f8_overhead 0.408 44 1_float8 triton_poi_fused__scaled_mm_clone_reciprocal_8 1_f8_overhead 0.294 45 1_float8 triton_red_fused__to_copy_add_mul_sum_9 2_other 0.810 46 1_float8 triton_red_fused__to_copy_add_div_mul_pow_sum_10 2_other 0.148 47 1_float8 aten::add_ 2_other 0.567 Summary of time (ms) by kernel category, across ref and float8 experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 14.691 8.054 0.548 1.824 1_f8_overhead 0.000 2.441 inf 0.000 2_other 3.291 2.582 0.785 1.274 All 17.981 13.077 0.727 1.375 Float8 amax/scale sync approx ratio of total time: 0.014 ``` Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: This PR adds an example FFN with the preceding and subsequent norms to the profile script. I hope for this to speed up debugging of kernel performance on LLaMa. Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 0affdc3 Pull Request resolved: #282

Summary: This PR adds an example FFN with the preceding and subsequent norms to the profile script. It also adds a couple of automatic data exctration QOL items: 1. extract GPU time and aggregate it per CPU kernel name 2. attribute the kernel GPU time to gemms, float8 overhead or other 3. approximate the time spent syncing scales/amaxes and display as pct of total time I hope for this to speed up debugging of kernel performance on various models, as this automates a lot of high level metrics which take more time to get from visualizing the traces. Example output when testing `norm_ffn_norm` with delayed scaling and compile: ``` Summary of GPU time by CPU kernel experiment kernel category time_ms 0 0_ref triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_0 2_other 0.061 1 0_ref aten::mm 0_gemm 14.691 2 0_ref triton_poi_fused_mul_silu_1 2_other 0.304 3 0_ref triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2 2_other 0.083 4 0_ref aten::sum 2_other 0.050 5 0_ref aten::fill_ 2_other 0.002 6 0_ref aten::copy_ 2_other 0.059 7 0_ref triton_red_fused__to_copy_add_div_mul_pow_sum_0 2_other 0.129 8 0_ref triton_poi_fused_add_fill_mul_sigmoid_silu_sub_1 2_other 0.520 9 0_ref triton_red_fused__to_copy_add_mul_sum_2 2_other 1.363 10 0_ref triton_red_fused__to_copy_add_div_mul_pow_sum_3 2_other 0.151 11 0_ref aten::add_ 2_other 0.567 12 1_float8 triton_per_fused_cat_copy_max_roll_0 1_f8_overhead 0.009 13 1_float8 triton_poi_fused_copy_1 2_other 0.004 14 1_float8 triton_poi_fused_copy_2 2_other 0.002 15 1_float8 triton_poi_fused_copy_3 2_other 0.004 16 1_float8 triton_poi_fused_copy_4 2_other 0.002 17 1_float8 triton_poi_fused_copy_5 2_other 0.004 18 1_float8 triton_poi_fused_copy_6 2_other 0.002 19 1_float8 triton_red_fused__to_copy_abs_add_clamp_max_mean_mul_pow_rsqrt_0 1_f8_overhead 0.089 20 1_float8 triton_red_fused__to_copy_abs_fill_max_mul_1 1_f8_overhead 0.003 21 1_float8 triton_red_fused_abs_max_2 1_f8_overhead 0.140 22 1_float8 triton_per_fused_abs_fill_max_3 1_f8_overhead 0.010 23 1_float8 triton_poi_fused_reciprocal_4 2_other 0.014 24 1_float8 triton_poi_fused__scaled_mm_clone_5 1_f8_overhead 0.289 25 1_float8 aten::_scaled_mm 0_gemm 8.054 26 1_float8 triton_red_fused_abs_max_mul_silu_6 1_f8_overhead 0.246 27 1_float8 triton_red_fused_abs_max_7 1_f8_overhead 0.061 28 1_float8 triton_poi_fused__to_copy_clamp_mul_silu_8 1_f8_overhead 0.254 29 1_float8 triton_poi_fused__scaled_mm_clone_9 1_f8_overhead 0.149 30 1_float8 triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_10 2_other 0.115 31 1_float8 triton_poi_fused_clone_11 2_other 0.151 32 1_float8 triton_poi_fused_clone_12 2_other 0.089 33 1_float8 aten::sum 2_other 0.049 34 1_float8 aten::fill_ 2_other 0.013 35 1_float8 aten::copy_ 2_other 0.060 36 1_float8 triton_red_fused__to_copy_mul_sum_0 2_other 0.548 37 1_float8 triton_red_fused__scaled_mm__to_copy_abs_add_div_max_mul_pow_reciprocal_sum_1 1_f8_overhead 0.109 38 1_float8 triton_red_fused_abs_fill_max_2 1_f8_overhead 0.004 39 1_float8 triton_poi_fused__scaled_mm_clone_reciprocal_3 1_f8_overhead 0.049 40 1_float8 triton_poi_fused__scaled_mm_clone_reciprocal_4 1_f8_overhead 0.007 41 1_float8 triton_red_fused_abs_add_fill_max_mul_sigmoid_silu_sub_5 1_f8_overhead 0.316 42 1_float8 triton_per_fused_abs_fill_max_mul_silu_6 1_f8_overhead 0.005 43 1_float8 triton_poi_fused__to_copy_add_clamp_fill_mul_sigmoid_silu_sub_7 1_f8_overhead 0.408 44 1_float8 triton_poi_fused__scaled_mm_clone_reciprocal_8 1_f8_overhead 0.294 45 1_float8 triton_red_fused__to_copy_add_mul_sum_9 2_other 0.810 46 1_float8 triton_red_fused__to_copy_add_div_mul_pow_sum_10 2_other 0.148 47 1_float8 aten::add_ 2_other 0.567 Summary of time (ms) by kernel category, across ref and float8 experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 14.691 8.054 0.548 1.824 1_f8_overhead 0.000 2.441 inf 0.000 2_other 3.291 2.582 0.785 1.274 All 17.981 13.077 0.727 1.375 Float8 amax/scale sync approx ratio of total time: 0.014 ``` Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: This PR adds an example FFN with the preceding and subsequent norms to the profile script. I hope for this to speed up debugging of kernel performance on LLaMa. Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: eefdb94 Pull Request resolved: #282

Summary: This PR adds an example FFN with the preceding and subsequent norms to the profile script. It also adds a couple of automatic data exctration QOL items: 1. extract GPU time and aggregate it per CPU kernel name 2. attribute the kernel GPU time to gemms, float8 overhead or other 3. approximate the time spent syncing scales/amaxes and display as pct of total time I hope for this to speed up debugging of kernel performance on various models, as this automates a lot of high level metrics which take more time to get from visualizing the traces. Example output when testing `norm_ffn_norm` with delayed scaling and compile: bandwidth off ``` Summary of GPU time by CPU kernel experiment kernel category time_ms 0 0_ref triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_0 2_other 0.061 1 0_ref aten::mm 0_gemm 14.691 2 0_ref triton_poi_fused_mul_silu_1 2_other 0.304 3 0_ref triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2 2_other 0.083 4 0_ref aten::sum 2_other 0.050 5 0_ref aten::fill_ 2_other 0.002 6 0_ref aten::copy_ 2_other 0.059 7 0_ref triton_red_fused__to_copy_add_div_mul_pow_sum_0 2_other 0.129 8 0_ref triton_poi_fused_add_fill_mul_sigmoid_silu_sub_1 2_other 0.520 9 0_ref triton_red_fused__to_copy_add_mul_sum_2 2_other 1.363 10 0_ref triton_red_fused__to_copy_add_div_mul_pow_sum_3 2_other 0.151 11 0_ref aten::add_ 2_other 0.567 12 1_float8 triton_per_fused_cat_copy_max_roll_0 1_f8_overhead 0.009 13 1_float8 triton_poi_fused_copy_1 2_other 0.004 14 1_float8 triton_poi_fused_copy_2 2_other 0.002 15 1_float8 triton_poi_fused_copy_3 2_other 0.004 16 1_float8 triton_poi_fused_copy_4 2_other 0.002 17 1_float8 triton_poi_fused_copy_5 2_other 0.004 18 1_float8 triton_poi_fused_copy_6 2_other 0.002 19 1_float8 triton_red_fused__to_copy_abs_add_clamp_max_mean_mul_pow_rsqrt_0 1_f8_overhead 0.089 20 1_float8 triton_red_fused__to_copy_abs_fill_max_mul_1 1_f8_overhead 0.003 21 1_float8 triton_red_fused_abs_max_2 1_f8_overhead 0.140 22 1_float8 triton_per_fused_abs_fill_max_3 1_f8_overhead 0.010 23 1_float8 triton_poi_fused_reciprocal_4 2_other 0.014 24 1_float8 triton_poi_fused__scaled_mm_clone_5 1_f8_overhead 0.289 25 1_float8 aten::_scaled_mm 0_gemm 8.054 26 1_float8 triton_red_fused_abs_max_mul_silu_6 1_f8_overhead 0.246 27 1_float8 triton_red_fused_abs_max_7 1_f8_overhead 0.061 28 1_float8 triton_poi_fused__to_copy_clamp_mul_silu_8 1_f8_overhead 0.254 29 1_float8 triton_poi_fused__scaled_mm_clone_9 1_f8_overhead 0.149 30 1_float8 triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_10 2_other 0.115 31 1_float8 triton_poi_fused_clone_11 2_other 0.151 32 1_float8 triton_poi_fused_clone_12 2_other 0.089 33 1_float8 aten::sum 2_other 0.049 34 1_float8 aten::fill_ 2_other 0.013 35 1_float8 aten::copy_ 2_other 0.060 36 1_float8 triton_red_fused__to_copy_mul_sum_0 2_other 0.548 37 1_float8 triton_red_fused__scaled_mm__to_copy_abs_add_div_max_mul_pow_reciprocal_sum_1 1_f8_overhead 0.109 38 1_float8 triton_red_fused_abs_fill_max_2 1_f8_overhead 0.004 39 1_float8 triton_poi_fused__scaled_mm_clone_reciprocal_3 1_f8_overhead 0.049 40 1_float8 triton_poi_fused__scaled_mm_clone_reciprocal_4 1_f8_overhead 0.007 41 1_float8 triton_red_fused_abs_add_fill_max_mul_sigmoid_silu_sub_5 1_f8_overhead 0.316 42 1_float8 triton_per_fused_abs_fill_max_mul_silu_6 1_f8_overhead 0.005 43 1_float8 triton_poi_fused__to_copy_add_clamp_fill_mul_sigmoid_silu_sub_7 1_f8_overhead 0.408 44 1_float8 triton_poi_fused__scaled_mm_clone_reciprocal_8 1_f8_overhead 0.294 45 1_float8 triton_red_fused__to_copy_add_mul_sum_9 2_other 0.810 46 1_float8 triton_red_fused__to_copy_add_div_mul_pow_sum_10 2_other 0.148 47 1_float8 aten::add_ 2_other 0.567 Summary of time (ms) by kernel category, across ref and float8 experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 14.691 8.054 0.548 1.824 1_f8_overhead 0.000 2.441 inf 0.000 2_other 3.291 2.582 0.785 1.274 All 17.981 13.077 0.727 1.375 Float8 amax/scale sync approx ratio of total time: 0.014 ``` bandwidth on ``` experiment kernel category time_ms pct_gpu_time bw_gpbs 0 0_ref triton_red_fused_native_layer_norm_0 2_other 0.242 0.021 2085.93 1 0_ref aten::mm 0_gemm 10.120 0.877 None 2 0_ref aten::sum 2_other 0.121 0.010 None 3 0_ref aten::fill_ 2_other 0.002 0.000 None 4 0_ref aten::copy_ 2_other 0.200 0.017 None 5 i 0_ref driton_red_fused_nytive_layer_norm_native_layer_norm_backward_0 2_other 0.350 0.030 2207.69 6 0_ref aten::add_ 2_other 0.511 0.044 None 7 1_float8 triton_per_fused_copy_max_roll_0 1_f8_overhead 0.005 0.001 0.01 8 1_float8 triton_per_fused_copy_max_roll_1 1_f8_overhead 0.003 0.000 0.01 9 1_float8 triton_red_fused__to_copy_abs_clamp_max_mul_native_layer_norm_0 1_f8_overhead 0.367 0.048 1083.39 10 1_float8 triton_red_fused_abs_fill_max_native_layer_norm_1 1_f8_overhead 0.004 0.001 5.52 11 1_float8 triton_red_fused_abs_max_2 1_f8_overhead 0.069 0.009 1486.46 12 1_float8 triton_per_fused_abs_fill_max_3 1_f8_overhead 0.002 0.000 0.20 13 1_float8 triton_poi_fused_reciprocal_4 2_other 0.004 0.001 0.00 14 1_float8 triton_poi_fused__scaled_mm_clone_5 1_f8_overhead 0.152 0.020 1460.56 15 1_float8 aten::_scaled_mm 0_gemm 5.213 0.683 None 16 1_float8 triton_poi_fused_clone_6 2_other 0.172 0.023 1488.13 17 1_float8 aten::sum 2_other 0.126 0.017 None 18 1_float8 aten::fill_ 2_other 0.006 0.001 None 19 1_float8 aten::copy_ 2_other 0.200 0.026 None 20 1_float8 triton_red_fused_abs_max_0 1_f8_overhead 0.129 0.017 1732.38 21 1_float8 triton_per_fused_abs_fill_max_1 1_f8_overhead 0.002 0.000 0.20 22 1_float8 triton_poi_fused__scaled_mm__to_copy_clamp_clone_mul_reciprocal_2 1_f8_overhead 0.310 0.041 1459.47 23 1_float8 triton_poi_fused__scaled_mm__to_copy_clamp_clone_mul_reciprocal_3 1_f8_overhead 0.002 0.000 0.00 24 1_float8 triton_red_fused_native_layer_norm_native_layer_norm_backward_4 2_other 0.352 0.046 2205.86 25 1_float8 aten::add_ 2_other 0.510 0.067 None ``` Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: This PR adds an example FFN with the preceding and subsequent norms to the profile script. I hope for this to speed up debugging of kernel performance on LLaMa. Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 6e6036b Pull Request resolved: #282

Summary: This PR adds an example FFN with the preceding and subsequent norms to the profile script. It also adds a couple of automatic data exctration QOL items: 1. extract GPU time and aggregate it per CPU kernel name 2. attribute the kernel GPU time to gemms, float8 overhead or other 3. approximate the time spent syncing scales/amaxes and display as pct of total time 4. if TORCHINDUCTOR_PROFILE env variable is set, also parses its output for triton kernel memory bandwidth I hope for this to speed up debugging of kernel performance on various models, as this automates a lot of high level metrics which take more time to get from visualizing the traces. Example output when testing `norm_ffn_norm` with dynamic scaling and compile, and bandwidth measurements on ``` Summary of GPU time by CPU kernel experiment kernel category time_ms pct_gpu_time bw_gpbs 1 0_ref aten::mm 0_gemm 15.625 0.826 None 9 0_ref triton_red_fused__to_copy_add_mul_sum_2 2_other 1.375 0.073 241.12 11 0_ref aten::add_ 2_other 0.566 0.030 None 8 0_ref triton_poi_fused_add_fill_mul_sigmoid_silu_sub_1 2_other 0.520 0.027 2203.88 2 0_ref triton_poi_fused_mul_silu_1 2_other 0.302 0.016 2207.31 10 0_ref triton_red_fused__to_copy_add_div_mul_pow_sum_3 2_other 0.150 0.008 1375.62 7 0_ref triton_red_fused__to_copy_add_div_mul_pow_sum_0 2_other 0.122 0.006 1963.47 3 0_ref triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2 2_other 0.084 0.004 1935.35 6 0_ref aten::copy_ 2_other 0.060 0.003 None 0 0_ref triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_0 2_other 0.052 0.003 1861.85 4 0_ref aten::sum 2_other 0.051 0.003 None 5 0_ref aten::fill_ 2_other 0.002 0.000 None 19 1_float8 aten::_scaled_mm 0_gemm 8.148 0.623 None 36 1_float8 triton_poi_fused__to_copy_add_clamp_copy_empty_like_fill_mul_reciprocal_sigmoid_silu_sub_6 1_f8_overhead 0.408 0.031 2220.22 34 1_float8 triton_red_fused_abs_add_fill_max_mul_sigmoid_silu_sub_4 1_f8_overhead 0.314 0.024 2090.00 18 1_float8 triton_poi_fused__scaled_mm_clone_6 1_f8_overhead 0.294 0.022 1439.36 37 1_float8 triton_poi_fused__scaled_mm_clamp_clone_copy_mul_reciprocal_7 1_f8_overhead 0.292 0.022 1527.26 22 1_float8 triton_poi_fused__to_copy_clamp_copy_mul_reciprocal_silu_9 1_f8_overhead 0.255 0.020 2148.34 20 1_float8 triton_red_fused_abs_max_mul_silu_7 1_f8_overhead 0.244 0.019 1834.33 23 1_float8 triton_poi_fused__scaled_mm_clone_10 1_f8_overhead 0.146 0.011 1462.59 15 1_float8 triton_red_fused_abs_max_3 1_f8_overhead 0.127 0.010 1498.30 31 1_float8 triton_red_fused__to_copy_abs_add_div_max_mul_pow_sum_1 1_f8_overhead 0.091 0.007 1971.11 33 1_float8 triton_poi_fused__scaled_mm_clamp_clone_copy_mul_reciprocal_3 1_f8_overhead 0.090 0.007 1388.84 21 1_float8 triton_red_fused_abs_max_8 1_f8_overhead 0.062 0.005 1487.67 17 1_float8 triton_poi_fused__to_copy_clamp_copy_mul_reciprocal_5 1_f8_overhead 0.046 0.003 1833.92 12 1_float8 triton_red_fused__to_copy_abs_add_max_mean_mul_pow_rsqrt_0 1_f8_overhead 0.044 0.003 1246.58 16 1_float8 triton_per_fused_abs_clamp_copy_max_mul_reciprocal_4 1_f8_overhead 0.010 0.001 0.32 35 1_float8 triton_per_fused__scaled_mm_abs_clamp_clone_copy_max_mul_reciprocal_silu_5 1_f8_overhead 0.005 0.000 0.32 32 1_float8 triton_red_fused__scaled_mm_abs_clamp_clone_copy_max_mul_reciprocal_2 1_f8_overhead 0.003 0.000 4.51 13 1_float8 triton_red_fused__to_copy_abs_clamp_copy_max_mul_reciprocal_1 1_f8_overhead 0.003 0.000 4.61 38 1_float8 triton_red_fused__to_copy_add_mul_sum_8 2_other 0.804 0.061 244.46 40 1_float8 aten::add_ 2_other 0.567 0.043 None 30 1_float8 triton_red_fused__to_copy_mul_sum_0 2_other 0.562 0.043 236.73 39 1_float8 triton_red_fused__to_copy_add_div_mul_pow_sum_9 2_other 0.149 0.011 1377.16 25 1_float8 triton_poi_fused_clone_12 2_other 0.146 0.011 1532.05 24 1_float8 triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_11 2_other 0.115 0.009 1293.20 29 1_float8 aten::copy_ 2_other 0.060 0.005 None 27 1_float8 aten::sum 2_other 0.050 0.004 None 26 1_float8 triton_poi_fused_clone_13 2_other 0.043 0.003 1333.22 14 1_float8 triton_poi_fused_empty_like_2 2_other 0.002 0.000 0.00 28 1_float8 aten::fill_ 2_other 0.002 0.000 None Float8 amax/scale sync approx ratio of total time: 0.000 Summary of time (ms) by kernel category experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 15.625 8.148 0.521 1.918 1_f8_overhead 0.000 2.436 inf 0.000 2_other 3.283 2.498 0.761 1.314 All 18.908 13.082 0.692 1.445 ``` Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: This PR adds an example FFN with the preceding and subsequent norms to the profile script. I hope for this to speed up debugging of kernel performance on LLaMa. Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 48420c4 Pull Request resolved: #282

vkuzo · 2024-06-28T16:29:48Z

@vkuzo has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-06-28T17:51:30Z

This pull request has been merged in 0b60496.

add norm_ffn_norm to profile script

899002b

Summary: This PR adds an example FFN with the preceding and subsequent norms to the profile script. I hope for this to speed up debugging of kernel performance on LLaMa. Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

vkuzo mentioned this pull request Jun 14, 2024

QOL improvements to benchmarks/profile_linear_float8.py #281

Closed

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 14, 2024

Update on "add norm_ffn_norm to profile script"

998b0c8

Summary: This PR adds an example FFN with the preceding and subsequent norms to the profile script. I hope for this to speed up debugging of kernel performance on LLaMa. Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

vkuzo requested review from drisspg and y-sq June 17, 2024 21:36

y-sq approved these changes Jun 25, 2024

View reviewed changes

facebook-github-bot closed this in 0b60496 Jun 28, 2024

facebook-github-bot added the Merged label Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add norm_ffn_norm to profile script #282

add norm_ffn_norm to profile script #282

Uh oh!

vkuzo commented Jun 14, 2024 •

edited

Loading

Uh oh!

vkuzo commented Jun 28, 2024

Uh oh!

facebook-github-bot commented Jun 28, 2024

Uh oh!

Uh oh!

add norm_ffn_norm to profile script #282

add norm_ffn_norm to profile script #282

Uh oh!

Conversation

vkuzo commented Jun 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkuzo commented Jun 28, 2024

Uh oh!

facebook-github-bot commented Jun 28, 2024

Uh oh!

Uh oh!

vkuzo commented Jun 14, 2024 •

edited

Loading