Skip to content
This repository was archived by the owner on Aug 7, 2024. It is now read-only.

add norm_ffn_norm to profile script #282

Closed
wants to merge 6 commits into from
Closed

Conversation

vkuzo
Copy link
Contributor

@vkuzo vkuzo commented Jun 14, 2024

Stack from ghstack (oldest at bottom):

Summary:

This PR adds an example FFN with the preceding and subsequent norms
to the profile script. It also adds a couple of automatic data exctration QOL
items:

  1. extract GPU time and aggregate it per CPU kernel name
  2. attribute the kernel GPU time to gemms, float8 overhead or other
  3. approximate the time spent syncing scales/amaxes and display as pct of total time
  4. if TORCHINDUCTOR_PROFILE env variable is set, also parses its output for triton kernel memory bandwidth

I hope for this to speed up debugging of kernel performance on various models, as this automates a lot of high level metrics which take more time to get from visualizing the traces.

Example output when testing norm_ffn_norm with dynamic scaling and compile, and bandwidth measurements on

Summary of GPU time by CPU kernel

    experiment                                                                                      kernel       category  time_ms  pct_gpu_time  bw_gpbs
1       0_ref                                                                                    aten::mm         0_gemm   15.625         0.826     None
9       0_ref                                                     triton_red_fused__to_copy_add_mul_sum_2        2_other    1.375         0.073   241.12
11      0_ref                                                                                  aten::add_        2_other    0.566         0.030     None
8       0_ref                                            triton_poi_fused_add_fill_mul_sigmoid_silu_sub_1        2_other    0.520         0.027  2203.88
2       0_ref                                                                 triton_poi_fused_mul_silu_1        2_other    0.302         0.016  2207.31
10      0_ref                                             triton_red_fused__to_copy_add_div_mul_pow_sum_3        2_other    0.150         0.008  1375.62
7       0_ref                                             triton_red_fused__to_copy_add_div_mul_pow_sum_0        2_other    0.122         0.006  1963.47
3       0_ref                                          triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2        2_other    0.084         0.004  1935.35
6       0_ref                                                                                 aten::copy_        2_other    0.060         0.003     None
0       0_ref                                          triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_0        2_other    0.052         0.003  1861.85
4       0_ref                                                                                   aten::sum        2_other    0.051         0.003     None
5       0_ref                                                                                 aten::fill_        2_other    0.002         0.000     None
19   1_float8                                                                            aten::_scaled_mm         0_gemm    8.148         0.623     None
36   1_float8  triton_poi_fused__to_copy_add_clamp_copy_empty_like_fill_mul_reciprocal_sigmoid_silu_sub_6  1_f8_overhead    0.408         0.031  2220.22
34   1_float8                                    triton_red_fused_abs_add_fill_max_mul_sigmoid_silu_sub_4  1_f8_overhead    0.314         0.024  2090.00
18   1_float8                                                         triton_poi_fused__scaled_mm_clone_6  1_f8_overhead    0.294         0.022  1439.36
37   1_float8                               triton_poi_fused__scaled_mm_clamp_clone_copy_mul_reciprocal_7  1_f8_overhead    0.292         0.022  1527.26
22   1_float8                                  triton_poi_fused__to_copy_clamp_copy_mul_reciprocal_silu_9  1_f8_overhead    0.255         0.020  2148.34
20   1_float8                                                         triton_red_fused_abs_max_mul_silu_7  1_f8_overhead    0.244         0.019  1834.33
23   1_float8                                                        triton_poi_fused__scaled_mm_clone_10  1_f8_overhead    0.146         0.011  1462.59
15   1_float8                                                                  triton_red_fused_abs_max_3  1_f8_overhead    0.127         0.010  1498.30
31   1_float8                                     triton_red_fused__to_copy_abs_add_div_max_mul_pow_sum_1  1_f8_overhead    0.091         0.007  1971.11
33   1_float8                               triton_poi_fused__scaled_mm_clamp_clone_copy_mul_reciprocal_3  1_f8_overhead    0.090         0.007  1388.84
21   1_float8                                                                  triton_red_fused_abs_max_8  1_f8_overhead    0.062         0.005  1487.67
17   1_float8                                       triton_poi_fused__to_copy_clamp_copy_mul_reciprocal_5  1_f8_overhead    0.046         0.003  1833.92
12   1_float8                                  triton_red_fused__to_copy_abs_add_max_mean_mul_pow_rsqrt_0  1_f8_overhead    0.044         0.003  1246.58
16   1_float8                                        triton_per_fused_abs_clamp_copy_max_mul_reciprocal_4  1_f8_overhead    0.010         0.001     0.32
35   1_float8                  triton_per_fused__scaled_mm_abs_clamp_clone_copy_max_mul_reciprocal_silu_5  1_f8_overhead    0.005         0.000     0.32
32   1_float8                       triton_red_fused__scaled_mm_abs_clamp_clone_copy_max_mul_reciprocal_2  1_f8_overhead    0.003         0.000     4.51
13   1_float8                               triton_red_fused__to_copy_abs_clamp_copy_max_mul_reciprocal_1  1_f8_overhead    0.003         0.000     4.61
38   1_float8                                                     triton_red_fused__to_copy_add_mul_sum_8        2_other    0.804         0.061   244.46
40   1_float8                                                                                  aten::add_        2_other    0.567         0.043     None
30   1_float8                                                         triton_red_fused__to_copy_mul_sum_0        2_other    0.562         0.043   236.73
39   1_float8                                             triton_red_fused__to_copy_add_div_mul_pow_sum_9        2_other    0.149         0.011  1377.16
25   1_float8                                                                   triton_poi_fused_clone_12        2_other    0.146         0.011  1532.05
24   1_float8                                         triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_11        2_other    0.115         0.009  1293.20
29   1_float8                                                                                 aten::copy_        2_other    0.060         0.005     None
27   1_float8                                                                                   aten::sum        2_other    0.050         0.004     None
26   1_float8                                                                   triton_poi_fused_clone_13        2_other    0.043         0.003  1333.22
14   1_float8                                                               triton_poi_fused_empty_like_2        2_other    0.002         0.000     0.00
28   1_float8                                                                                 aten::fill_        2_other    0.002         0.000     None

Float8 amax/scale sync approx ratio of total time: 0.000

Summary of time (ms) by kernel category

 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
category                                              
0_gemm        15.625     8.148       0.521       1.918
1_f8_overhead  0.000     2.436         inf       0.000
2_other        3.283     2.498       0.761       1.314
All           18.908    13.082       0.692       1.445

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D59163495

Summary:

This PR adds an example FFN with the preceding and subsequent norms
to the profile script. I hope for this to speed up debugging of kernel
performance on LLaMa.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 14, 2024
vkuzo added a commit that referenced this pull request Jun 14, 2024
Summary:

This PR adds an example FFN with the preceding and subsequent norms
to the profile script. I hope for this to speed up debugging of kernel
performance on LLaMa.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 8eb0020
Pull Request resolved: #282
Summary:

This PR adds an example FFN with the preceding and subsequent norms
to the profile script. I hope for this to speed up debugging of kernel
performance on LLaMa.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jun 17, 2024
Summary:

This PR adds an example FFN with the preceding and subsequent norms
to the profile script. I hope for this to speed up debugging of kernel
performance on LLaMa.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 7f7ea6b
Pull Request resolved: #282
@vkuzo vkuzo requested review from drisspg and y-sq June 17, 2024 21:36
Summary:

This PR adds an example FFN with the preceding and subsequent norms
to the profile script. It also adds a couple of automatic data exctration QOL
items:
1. extract GPU time and aggregate it per CPU kernel name
2. attribute the kernel GPU time to gemms, float8 overhead or other
3. approximate the time spent syncing scales/amaxes and display as pct of total time

I hope for this to speed up debugging of kernel performance on various models, as this automates a lot of high level metrics which take more time to get from visualizing the traces.

Example output when testing `norm_ffn_norm` with delayed scaling and compile:

```
Summary of GPU time by CPU kernel                                                                                                                                                                                                    
                                                                                                                                                                                                                                     
    experiment                                                                         kernel       category  time_ms
0       0_ref                             triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_0        2_other    0.061
1       0_ref                                                                       aten::mm         0_gemm   14.691
2       0_ref                                                    triton_poi_fused_mul_silu_1        2_other    0.304
3       0_ref                             triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2        2_other    0.083              
4       0_ref                                                                      aten::sum        2_other    0.050             
5       0_ref                                                                    aten::fill_        2_other    0.002             
6       0_ref                                                                    aten::copy_        2_other    0.059             
7       0_ref                                triton_red_fused__to_copy_add_div_mul_pow_sum_0        2_other    0.129             
8       0_ref                               triton_poi_fused_add_fill_mul_sigmoid_silu_sub_1        2_other    0.520             
9       0_ref                                        triton_red_fused__to_copy_add_mul_sum_2        2_other    1.363             
10      0_ref                                triton_red_fused__to_copy_add_div_mul_pow_sum_3        2_other    0.151             
11      0_ref                                                                     aten::add_        2_other    0.567             
12   1_float8                                           triton_per_fused_cat_copy_max_roll_0  1_f8_overhead    0.009             
13   1_float8                                                        triton_poi_fused_copy_1        2_other    0.004                                                                                                                 
14   1_float8                                                        triton_poi_fused_copy_2        2_other    0.002                                                                                                                 
15   1_float8                                                        triton_poi_fused_copy_3        2_other    0.004                                                                                                                 
16   1_float8                                                        triton_poi_fused_copy_4        2_other    0.002                                                                                                                 
17   1_float8                                                        triton_poi_fused_copy_5        2_other    0.004                                                                                                                 
18   1_float8                                                        triton_poi_fused_copy_6        2_other    0.002                                                                                                                 
19   1_float8               triton_red_fused__to_copy_abs_add_clamp_max_mean_mul_pow_rsqrt_0  1_f8_overhead    0.089                                                                                                                 
20   1_float8                                   triton_red_fused__to_copy_abs_fill_max_mul_1  1_f8_overhead    0.003                                                                                                                 
21   1_float8                                                     triton_red_fused_abs_max_2  1_f8_overhead    0.140                                                                                                                 
22   1_float8                                                triton_per_fused_abs_fill_max_3  1_f8_overhead    0.010                                                                                                                 
23   1_float8                                                  triton_poi_fused_reciprocal_4        2_other    0.014                                                                                                                 
24   1_float8                                            triton_poi_fused__scaled_mm_clone_5  1_f8_overhead    0.289                                                                                                                 
25   1_float8                                                               aten::_scaled_mm         0_gemm    8.054                                                                                                                 
26   1_float8                                            triton_red_fused_abs_max_mul_silu_6  1_f8_overhead    0.246                                                                                               
27   1_float8                                                     triton_red_fused_abs_max_7  1_f8_overhead    0.061                                                                                                                 
28   1_float8                                     triton_poi_fused__to_copy_clamp_mul_silu_8  1_f8_overhead    0.254                                                                                                                 
29   1_float8                                            triton_poi_fused__scaled_mm_clone_9  1_f8_overhead    0.149                                                                                                                 
30   1_float8                            triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_10        2_other    0.115                                                                                                                 
31   1_float8                                                      triton_poi_fused_clone_11        2_other    0.151                                                                                                                 
32   1_float8                                                      triton_poi_fused_clone_12        2_other    0.089                                                                                                                 
33   1_float8                                                                      aten::sum        2_other    0.049                                                                                                                 
34   1_float8                                                                    aten::fill_        2_other    0.013                                                                                                                 
35   1_float8                                                                    aten::copy_        2_other    0.060                                                                                                                 
36   1_float8                                            triton_red_fused__to_copy_mul_sum_0        2_other    0.548                                                                                                                 
37   1_float8  triton_red_fused__scaled_mm__to_copy_abs_add_div_max_mul_pow_reciprocal_sum_1  1_f8_overhead    0.109                                                                                                                 
38   1_float8                                                triton_red_fused_abs_fill_max_2  1_f8_overhead    0.004              
39   1_float8                                 triton_poi_fused__scaled_mm_clone_reciprocal_3  1_f8_overhead    0.049             
40   1_float8                                 triton_poi_fused__scaled_mm_clone_reciprocal_4  1_f8_overhead    0.007             
41   1_float8                       triton_red_fused_abs_add_fill_max_mul_sigmoid_silu_sub_5  1_f8_overhead    0.316             
42   1_float8                                       triton_per_fused_abs_fill_max_mul_silu_6  1_f8_overhead    0.005             
43   1_float8                triton_poi_fused__to_copy_add_clamp_fill_mul_sigmoid_silu_sub_7  1_f8_overhead    0.408             
44   1_float8                                 triton_poi_fused__scaled_mm_clone_reciprocal_8  1_f8_overhead    0.294             
45   1_float8                                        triton_red_fused__to_copy_add_mul_sum_9        2_other    0.810             
46   1_float8                               triton_red_fused__to_copy_add_div_mul_pow_sum_10        2_other    0.148             
47   1_float8                                                                     aten::add_        2_other    0.567             
                                                                                                                                                                                                                                     
Summary of time (ms) by kernel category, across ref and float8                                                                                                                                                                       
                                                                                                                                                                                                                                     
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8                                                                                                                                                                              
category                                                                                                                                                                                                                             
0_gemm        14.691     8.054       0.548       1.824                                                                                                                                                                               1_f8_overhead  0.000     2.441         inf       0.000                                                                                                                                                                               
2_other        3.291     2.582       0.785       1.274                                                                                                                                                                               
All           17.981    13.077       0.727       1.375                                                                                                                                                                               
                                                                                                                                                                                                                                     
Float8 amax/scale sync approx ratio of total time: 0.014                                                                                                                                                                             
```

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jun 17, 2024
Summary:

This PR adds an example FFN with the preceding and subsequent norms
to the profile script. I hope for this to speed up debugging of kernel
performance on LLaMa.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 0affdc3
Pull Request resolved: #282
Summary:

This PR adds an example FFN with the preceding and subsequent norms
to the profile script. It also adds a couple of automatic data exctration QOL
items:
1. extract GPU time and aggregate it per CPU kernel name
2. attribute the kernel GPU time to gemms, float8 overhead or other
3. approximate the time spent syncing scales/amaxes and display as pct of total time

I hope for this to speed up debugging of kernel performance on various models, as this automates a lot of high level metrics which take more time to get from visualizing the traces.

Example output when testing `norm_ffn_norm` with delayed scaling and compile:

```
Summary of GPU time by CPU kernel                                                                                                                                                                                                    
                                                                                                                                                                                                                                     
    experiment                                                                         kernel       category  time_ms
0       0_ref                             triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_0        2_other    0.061
1       0_ref                                                                       aten::mm         0_gemm   14.691
2       0_ref                                                    triton_poi_fused_mul_silu_1        2_other    0.304
3       0_ref                             triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2        2_other    0.083              
4       0_ref                                                                      aten::sum        2_other    0.050             
5       0_ref                                                                    aten::fill_        2_other    0.002             
6       0_ref                                                                    aten::copy_        2_other    0.059             
7       0_ref                                triton_red_fused__to_copy_add_div_mul_pow_sum_0        2_other    0.129             
8       0_ref                               triton_poi_fused_add_fill_mul_sigmoid_silu_sub_1        2_other    0.520             
9       0_ref                                        triton_red_fused__to_copy_add_mul_sum_2        2_other    1.363             
10      0_ref                                triton_red_fused__to_copy_add_div_mul_pow_sum_3        2_other    0.151             
11      0_ref                                                                     aten::add_        2_other    0.567             
12   1_float8                                           triton_per_fused_cat_copy_max_roll_0  1_f8_overhead    0.009             
13   1_float8                                                        triton_poi_fused_copy_1        2_other    0.004                                                                                                                 
14   1_float8                                                        triton_poi_fused_copy_2        2_other    0.002                                                                                                                 
15   1_float8                                                        triton_poi_fused_copy_3        2_other    0.004                                                                                                                 
16   1_float8                                                        triton_poi_fused_copy_4        2_other    0.002                                                                                                                 
17   1_float8                                                        triton_poi_fused_copy_5        2_other    0.004                                                                                                                 
18   1_float8                                                        triton_poi_fused_copy_6        2_other    0.002                                                                                                                 
19   1_float8               triton_red_fused__to_copy_abs_add_clamp_max_mean_mul_pow_rsqrt_0  1_f8_overhead    0.089                                                                                                                 
20   1_float8                                   triton_red_fused__to_copy_abs_fill_max_mul_1  1_f8_overhead    0.003                                                                                                                 
21   1_float8                                                     triton_red_fused_abs_max_2  1_f8_overhead    0.140                                                                                                                 
22   1_float8                                                triton_per_fused_abs_fill_max_3  1_f8_overhead    0.010                                                                                                                 
23   1_float8                                                  triton_poi_fused_reciprocal_4        2_other    0.014                                                                                                                 
24   1_float8                                            triton_poi_fused__scaled_mm_clone_5  1_f8_overhead    0.289                                                                                                                 
25   1_float8                                                               aten::_scaled_mm         0_gemm    8.054                                                                                                                 
26   1_float8                                            triton_red_fused_abs_max_mul_silu_6  1_f8_overhead    0.246                                                                                               
27   1_float8                                                     triton_red_fused_abs_max_7  1_f8_overhead    0.061                                                                                                                 
28   1_float8                                     triton_poi_fused__to_copy_clamp_mul_silu_8  1_f8_overhead    0.254                                                                                                                 
29   1_float8                                            triton_poi_fused__scaled_mm_clone_9  1_f8_overhead    0.149                                                                                                                 
30   1_float8                            triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_10        2_other    0.115                                                                                                                 
31   1_float8                                                      triton_poi_fused_clone_11        2_other    0.151                                                                                                                 
32   1_float8                                                      triton_poi_fused_clone_12        2_other    0.089                                                                                                                 
33   1_float8                                                                      aten::sum        2_other    0.049                                                                                                                 
34   1_float8                                                                    aten::fill_        2_other    0.013                                                                                                                 
35   1_float8                                                                    aten::copy_        2_other    0.060                                                                                                                 
36   1_float8                                            triton_red_fused__to_copy_mul_sum_0        2_other    0.548                                                                                                                 
37   1_float8  triton_red_fused__scaled_mm__to_copy_abs_add_div_max_mul_pow_reciprocal_sum_1  1_f8_overhead    0.109                                                                                                                 
38   1_float8                                                triton_red_fused_abs_fill_max_2  1_f8_overhead    0.004              
39   1_float8                                 triton_poi_fused__scaled_mm_clone_reciprocal_3  1_f8_overhead    0.049             
40   1_float8                                 triton_poi_fused__scaled_mm_clone_reciprocal_4  1_f8_overhead    0.007             
41   1_float8                       triton_red_fused_abs_add_fill_max_mul_sigmoid_silu_sub_5  1_f8_overhead    0.316             
42   1_float8                                       triton_per_fused_abs_fill_max_mul_silu_6  1_f8_overhead    0.005             
43   1_float8                triton_poi_fused__to_copy_add_clamp_fill_mul_sigmoid_silu_sub_7  1_f8_overhead    0.408             
44   1_float8                                 triton_poi_fused__scaled_mm_clone_reciprocal_8  1_f8_overhead    0.294             
45   1_float8                                        triton_red_fused__to_copy_add_mul_sum_9        2_other    0.810             
46   1_float8                               triton_red_fused__to_copy_add_div_mul_pow_sum_10        2_other    0.148             
47   1_float8                                                                     aten::add_        2_other    0.567             
                                                                                                                                                                                                                                     
Summary of time (ms) by kernel category, across ref and float8                                                                                                                                                                       
                                                                                                                                                                                                                                     
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8                                                                                                                                                                              
category                                                                                                                                                                                                                             
0_gemm        14.691     8.054       0.548       1.824                                                                                                                                                                               1_f8_overhead  0.000     2.441         inf       0.000                                                                                                                                                                               
2_other        3.291     2.582       0.785       1.274                                                                                                                                                                               
All           17.981    13.077       0.727       1.375                                                                                                                                                                               
                                                                                                                                                                                                                                     
Float8 amax/scale sync approx ratio of total time: 0.014                                                                                                                                                                             
```

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jun 17, 2024
Summary:

This PR adds an example FFN with the preceding and subsequent norms
to the profile script. I hope for this to speed up debugging of kernel
performance on LLaMa.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: eefdb94
Pull Request resolved: #282
Summary:

This PR adds an example FFN with the preceding and subsequent norms
to the profile script. It also adds a couple of automatic data exctration QOL
items:
1. extract GPU time and aggregate it per CPU kernel name
2. attribute the kernel GPU time to gemms, float8 overhead or other
3. approximate the time spent syncing scales/amaxes and display as pct of total time

I hope for this to speed up debugging of kernel performance on various models, as this automates a lot of high level metrics which take more time to get from visualizing the traces.

Example output when testing `norm_ffn_norm` with delayed scaling and compile:

bandwidth off
```
Summary of GPU time by CPU kernel                                                                                                                                                                                                    
                                                                                                                                                                                                                                     
    experiment                                                                         kernel       category  time_ms
0       0_ref                             triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_0        2_other    0.061
1       0_ref                                                                       aten::mm         0_gemm   14.691
2       0_ref                                                    triton_poi_fused_mul_silu_1        2_other    0.304
3       0_ref                             triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2        2_other    0.083              
4       0_ref                                                                      aten::sum        2_other    0.050             
5       0_ref                                                                    aten::fill_        2_other    0.002             
6       0_ref                                                                    aten::copy_        2_other    0.059             
7       0_ref                                triton_red_fused__to_copy_add_div_mul_pow_sum_0        2_other    0.129             
8       0_ref                               triton_poi_fused_add_fill_mul_sigmoid_silu_sub_1        2_other    0.520             
9       0_ref                                        triton_red_fused__to_copy_add_mul_sum_2        2_other    1.363             
10      0_ref                                triton_red_fused__to_copy_add_div_mul_pow_sum_3        2_other    0.151             
11      0_ref                                                                     aten::add_        2_other    0.567             
12   1_float8                                           triton_per_fused_cat_copy_max_roll_0  1_f8_overhead    0.009             
13   1_float8                                                        triton_poi_fused_copy_1        2_other    0.004                                                                                                                 
14   1_float8                                                        triton_poi_fused_copy_2        2_other    0.002                                                                                                                 
15   1_float8                                                        triton_poi_fused_copy_3        2_other    0.004                                                                                                                 
16   1_float8                                                        triton_poi_fused_copy_4        2_other    0.002                                                                                                                 
17   1_float8                                                        triton_poi_fused_copy_5        2_other    0.004                                                                                                                 
18   1_float8                                                        triton_poi_fused_copy_6        2_other    0.002                                                                                                                 
19   1_float8               triton_red_fused__to_copy_abs_add_clamp_max_mean_mul_pow_rsqrt_0  1_f8_overhead    0.089                                                                                                                 
20   1_float8                                   triton_red_fused__to_copy_abs_fill_max_mul_1  1_f8_overhead    0.003                                                                                                                 
21   1_float8                                                     triton_red_fused_abs_max_2  1_f8_overhead    0.140                                                                                                                 
22   1_float8                                                triton_per_fused_abs_fill_max_3  1_f8_overhead    0.010                                                                                                                 
23   1_float8                                                  triton_poi_fused_reciprocal_4        2_other    0.014                                                                                                                 
24   1_float8                                            triton_poi_fused__scaled_mm_clone_5  1_f8_overhead    0.289                                                                                                                 
25   1_float8                                                               aten::_scaled_mm         0_gemm    8.054                                                                                                                 
26   1_float8                                            triton_red_fused_abs_max_mul_silu_6  1_f8_overhead    0.246                                                                                               
27   1_float8                                                     triton_red_fused_abs_max_7  1_f8_overhead    0.061                                                                                                                 
28   1_float8                                     triton_poi_fused__to_copy_clamp_mul_silu_8  1_f8_overhead    0.254                                                                                                                 
29   1_float8                                            triton_poi_fused__scaled_mm_clone_9  1_f8_overhead    0.149                                                                                                                 
30   1_float8                            triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_10        2_other    0.115                                                                                                                 
31   1_float8                                                      triton_poi_fused_clone_11        2_other    0.151                                                                                                                 
32   1_float8                                                      triton_poi_fused_clone_12        2_other    0.089                                                                                                                 
33   1_float8                                                                      aten::sum        2_other    0.049                                                                                                                 
34   1_float8                                                                    aten::fill_        2_other    0.013                                                                                                                 
35   1_float8                                                                    aten::copy_        2_other    0.060                                                                                                                 
36   1_float8                                            triton_red_fused__to_copy_mul_sum_0        2_other    0.548                                                                                                                 
37   1_float8  triton_red_fused__scaled_mm__to_copy_abs_add_div_max_mul_pow_reciprocal_sum_1  1_f8_overhead    0.109                                                                                                                 
38   1_float8                                                triton_red_fused_abs_fill_max_2  1_f8_overhead    0.004              
39   1_float8                                 triton_poi_fused__scaled_mm_clone_reciprocal_3  1_f8_overhead    0.049             
40   1_float8                                 triton_poi_fused__scaled_mm_clone_reciprocal_4  1_f8_overhead    0.007             
41   1_float8                       triton_red_fused_abs_add_fill_max_mul_sigmoid_silu_sub_5  1_f8_overhead    0.316             
42   1_float8                                       triton_per_fused_abs_fill_max_mul_silu_6  1_f8_overhead    0.005             
43   1_float8                triton_poi_fused__to_copy_add_clamp_fill_mul_sigmoid_silu_sub_7  1_f8_overhead    0.408             
44   1_float8                                 triton_poi_fused__scaled_mm_clone_reciprocal_8  1_f8_overhead    0.294             
45   1_float8                                        triton_red_fused__to_copy_add_mul_sum_9        2_other    0.810             
46   1_float8                               triton_red_fused__to_copy_add_div_mul_pow_sum_10        2_other    0.148             
47   1_float8                                                                     aten::add_        2_other    0.567             
                                                                                                                                                                                                                                     
Summary of time (ms) by kernel category, across ref and float8                                                                                                                                                                       
                                                                                                                                                                                                                                     
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8                                                                                                                                                                              
category                                                                                                                                                                                                                             
0_gemm        14.691     8.054       0.548       1.824                                                                                                                                                                               1_f8_overhead  0.000     2.441         inf       0.000                                                                                                                                                                               
2_other        3.291     2.582       0.785       1.274                                                                                                                                                                               
All           17.981    13.077       0.727       1.375                                                                                                                                                                               
                                                                                                                                                                                                                                     
Float8 amax/scale sync approx ratio of total time: 0.014                                                                                                                                                                             
```

bandwidth on
```
    experiment                                                             kernel       category  time_ms  pct_gpu_time  bw_gpbs
0       0_ref                               triton_red_fused_native_layer_norm_0        2_other    0.242         0.021  2085.93
1       0_ref                                                           aten::mm         0_gemm   10.120         0.877     None
2       0_ref                                                          aten::sum        2_other    0.121         0.010     None
3       0_ref                                                        aten::fill_        2_other    0.002         0.000     None
4       0_ref                                                        aten::copy_        2_other    0.200         0.017     None
5   i   0_ref    driton_red_fused_nytive_layer_norm_native_layer_norm_backward_0        2_other    0.350         0.030  2207.69
6       0_ref                                                         aten::add_        2_other    0.511         0.044     None
7    1_float8                                   triton_per_fused_copy_max_roll_0  1_f8_overhead    0.005         0.001     0.01
8    1_float8                                   triton_per_fused_copy_max_roll_1  1_f8_overhead    0.003         0.000     0.01
9    1_float8    triton_red_fused__to_copy_abs_clamp_max_mul_native_layer_norm_0  1_f8_overhead    0.367         0.048  1083.39
10   1_float8                  triton_red_fused_abs_fill_max_native_layer_norm_1  1_f8_overhead    0.004         0.001     5.52
11   1_float8                                         triton_red_fused_abs_max_2  1_f8_overhead    0.069         0.009  1486.46
12   1_float8                                    triton_per_fused_abs_fill_max_3  1_f8_overhead    0.002         0.000     0.20
13   1_float8                                      triton_poi_fused_reciprocal_4        2_other    0.004         0.001     0.00
14   1_float8                                triton_poi_fused__scaled_mm_clone_5  1_f8_overhead    0.152         0.020  1460.56
15   1_float8                                                   aten::_scaled_mm         0_gemm    5.213         0.683     None
16   1_float8                                           triton_poi_fused_clone_6        2_other    0.172         0.023  1488.13
17   1_float8                                                          aten::sum        2_other    0.126         0.017     None
18   1_float8                                                        aten::fill_        2_other    0.006         0.001     None
19   1_float8                                                        aten::copy_        2_other    0.200         0.026     None
20   1_float8                                         triton_red_fused_abs_max_0  1_f8_overhead    0.129         0.017  1732.38
21   1_float8                                    triton_per_fused_abs_fill_max_1  1_f8_overhead    0.002         0.000     0.20
22   1_float8  triton_poi_fused__scaled_mm__to_copy_clamp_clone_mul_reciprocal_2  1_f8_overhead    0.310         0.041  1459.47
23   1_float8  triton_poi_fused__scaled_mm__to_copy_clamp_clone_mul_reciprocal_3  1_f8_overhead    0.002         0.000     0.00
24   1_float8    triton_red_fused_native_layer_norm_native_layer_norm_backward_4        2_other    0.352         0.046  2205.86
25   1_float8                                                         aten::add_        2_other    0.510         0.067     None

```

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jun 17, 2024
Summary:

This PR adds an example FFN with the preceding and subsequent norms
to the profile script. I hope for this to speed up debugging of kernel
performance on LLaMa.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 6e6036b
Pull Request resolved: #282
Summary:

This PR adds an example FFN with the preceding and subsequent norms
to the profile script. It also adds a couple of automatic data exctration QOL
items:
1. extract GPU time and aggregate it per CPU kernel name
2. attribute the kernel GPU time to gemms, float8 overhead or other
3. approximate the time spent syncing scales/amaxes and display as pct of total time
4. if TORCHINDUCTOR_PROFILE env variable is set, also parses its output for triton kernel memory bandwidth

I hope for this to speed up debugging of kernel performance on various models, as this automates a lot of high level metrics which take more time to get from visualizing the traces.

Example output when testing `norm_ffn_norm` with dynamic scaling and compile, and bandwidth measurements on

```
Summary of GPU time by CPU kernel

    experiment                                                                                      kernel       category  time_ms  pct_gpu_time  bw_gpbs
1       0_ref                                                                                    aten::mm         0_gemm   15.625         0.826     None
9       0_ref                                                     triton_red_fused__to_copy_add_mul_sum_2        2_other    1.375         0.073   241.12
11      0_ref                                                                                  aten::add_        2_other    0.566         0.030     None
8       0_ref                                            triton_poi_fused_add_fill_mul_sigmoid_silu_sub_1        2_other    0.520         0.027  2203.88
2       0_ref                                                                 triton_poi_fused_mul_silu_1        2_other    0.302         0.016  2207.31
10      0_ref                                             triton_red_fused__to_copy_add_div_mul_pow_sum_3        2_other    0.150         0.008  1375.62
7       0_ref                                             triton_red_fused__to_copy_add_div_mul_pow_sum_0        2_other    0.122         0.006  1963.47
3       0_ref                                          triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2        2_other    0.084         0.004  1935.35
6       0_ref                                                                                 aten::copy_        2_other    0.060         0.003     None
0       0_ref                                          triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_0        2_other    0.052         0.003  1861.85
4       0_ref                                                                                   aten::sum        2_other    0.051         0.003     None
5       0_ref                                                                                 aten::fill_        2_other    0.002         0.000     None
19   1_float8                                                                            aten::_scaled_mm         0_gemm    8.148         0.623     None
36   1_float8  triton_poi_fused__to_copy_add_clamp_copy_empty_like_fill_mul_reciprocal_sigmoid_silu_sub_6  1_f8_overhead    0.408         0.031  2220.22
34   1_float8                                    triton_red_fused_abs_add_fill_max_mul_sigmoid_silu_sub_4  1_f8_overhead    0.314         0.024  2090.00
18   1_float8                                                         triton_poi_fused__scaled_mm_clone_6  1_f8_overhead    0.294         0.022  1439.36
37   1_float8                               triton_poi_fused__scaled_mm_clamp_clone_copy_mul_reciprocal_7  1_f8_overhead    0.292         0.022  1527.26
22   1_float8                                  triton_poi_fused__to_copy_clamp_copy_mul_reciprocal_silu_9  1_f8_overhead    0.255         0.020  2148.34
20   1_float8                                                         triton_red_fused_abs_max_mul_silu_7  1_f8_overhead    0.244         0.019  1834.33
23   1_float8                                                        triton_poi_fused__scaled_mm_clone_10  1_f8_overhead    0.146         0.011  1462.59
15   1_float8                                                                  triton_red_fused_abs_max_3  1_f8_overhead    0.127         0.010  1498.30
31   1_float8                                     triton_red_fused__to_copy_abs_add_div_max_mul_pow_sum_1  1_f8_overhead    0.091         0.007  1971.11
33   1_float8                               triton_poi_fused__scaled_mm_clamp_clone_copy_mul_reciprocal_3  1_f8_overhead    0.090         0.007  1388.84
21   1_float8                                                                  triton_red_fused_abs_max_8  1_f8_overhead    0.062         0.005  1487.67
17   1_float8                                       triton_poi_fused__to_copy_clamp_copy_mul_reciprocal_5  1_f8_overhead    0.046         0.003  1833.92
12   1_float8                                  triton_red_fused__to_copy_abs_add_max_mean_mul_pow_rsqrt_0  1_f8_overhead    0.044         0.003  1246.58
16   1_float8                                        triton_per_fused_abs_clamp_copy_max_mul_reciprocal_4  1_f8_overhead    0.010         0.001     0.32
35   1_float8                  triton_per_fused__scaled_mm_abs_clamp_clone_copy_max_mul_reciprocal_silu_5  1_f8_overhead    0.005         0.000     0.32
32   1_float8                       triton_red_fused__scaled_mm_abs_clamp_clone_copy_max_mul_reciprocal_2  1_f8_overhead    0.003         0.000     4.51
13   1_float8                               triton_red_fused__to_copy_abs_clamp_copy_max_mul_reciprocal_1  1_f8_overhead    0.003         0.000     4.61
38   1_float8                                                     triton_red_fused__to_copy_add_mul_sum_8        2_other    0.804         0.061   244.46
40   1_float8                                                                                  aten::add_        2_other    0.567         0.043     None
30   1_float8                                                         triton_red_fused__to_copy_mul_sum_0        2_other    0.562         0.043   236.73
39   1_float8                                             triton_red_fused__to_copy_add_div_mul_pow_sum_9        2_other    0.149         0.011  1377.16
25   1_float8                                                                   triton_poi_fused_clone_12        2_other    0.146         0.011  1532.05
24   1_float8                                         triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_11        2_other    0.115         0.009  1293.20
29   1_float8                                                                                 aten::copy_        2_other    0.060         0.005     None
27   1_float8                                                                                   aten::sum        2_other    0.050         0.004     None
26   1_float8                                                                   triton_poi_fused_clone_13        2_other    0.043         0.003  1333.22
14   1_float8                                                               triton_poi_fused_empty_like_2        2_other    0.002         0.000     0.00
28   1_float8                                                                                 aten::fill_        2_other    0.002         0.000     None

Float8 amax/scale sync approx ratio of total time: 0.000

Summary of time (ms) by kernel category

 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
category                                              
0_gemm        15.625     8.148       0.521       1.918
1_f8_overhead  0.000     2.436         inf       0.000
2_other        3.283     2.498       0.761       1.314
All           18.908    13.082       0.692       1.445

```

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jun 28, 2024
Summary:

This PR adds an example FFN with the preceding and subsequent norms
to the profile script. I hope for this to speed up debugging of kernel
performance on LLaMa.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 48420c4
Pull Request resolved: #282
@vkuzo
Copy link
Contributor Author

vkuzo commented Jun 28, 2024

@vkuzo has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 0b60496.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants