This repository was archived by the owner on Aug 7, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 19
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Summary: This PR adds an example FFN with the preceding and subsequent norms to the profile script. I hope for this to speed up debugging of kernel performance on LLaMa. Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: This PR adds an example FFN with the preceding and subsequent norms to the profile script. I hope for this to speed up debugging of kernel performance on LLaMa. Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: This PR adds an example FFN with the preceding and subsequent norms to the profile script. It also adds a couple of automatic data exctration QOL items: 1. extract GPU time and aggregate it per CPU kernel name 2. attribute the kernel GPU time to gemms, float8 overhead or other 3. approximate the time spent syncing scales/amaxes and display as pct of total time I hope for this to speed up debugging of kernel performance on various models, as this automates a lot of high level metrics which take more time to get from visualizing the traces. Example output when testing `norm_ffn_norm` with delayed scaling and compile: ``` Summary of GPU time by CPU kernel experiment kernel category time_ms 0 0_ref triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_0 2_other 0.061 1 0_ref aten::mm 0_gemm 14.691 2 0_ref triton_poi_fused_mul_silu_1 2_other 0.304 3 0_ref triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2 2_other 0.083 4 0_ref aten::sum 2_other 0.050 5 0_ref aten::fill_ 2_other 0.002 6 0_ref aten::copy_ 2_other 0.059 7 0_ref triton_red_fused__to_copy_add_div_mul_pow_sum_0 2_other 0.129 8 0_ref triton_poi_fused_add_fill_mul_sigmoid_silu_sub_1 2_other 0.520 9 0_ref triton_red_fused__to_copy_add_mul_sum_2 2_other 1.363 10 0_ref triton_red_fused__to_copy_add_div_mul_pow_sum_3 2_other 0.151 11 0_ref aten::add_ 2_other 0.567 12 1_float8 triton_per_fused_cat_copy_max_roll_0 1_f8_overhead 0.009 13 1_float8 triton_poi_fused_copy_1 2_other 0.004 14 1_float8 triton_poi_fused_copy_2 2_other 0.002 15 1_float8 triton_poi_fused_copy_3 2_other 0.004 16 1_float8 triton_poi_fused_copy_4 2_other 0.002 17 1_float8 triton_poi_fused_copy_5 2_other 0.004 18 1_float8 triton_poi_fused_copy_6 2_other 0.002 19 1_float8 triton_red_fused__to_copy_abs_add_clamp_max_mean_mul_pow_rsqrt_0 1_f8_overhead 0.089 20 1_float8 triton_red_fused__to_copy_abs_fill_max_mul_1 1_f8_overhead 0.003 21 1_float8 triton_red_fused_abs_max_2 1_f8_overhead 0.140 22 1_float8 triton_per_fused_abs_fill_max_3 1_f8_overhead 0.010 23 1_float8 triton_poi_fused_reciprocal_4 2_other 0.014 24 1_float8 triton_poi_fused__scaled_mm_clone_5 1_f8_overhead 0.289 25 1_float8 aten::_scaled_mm 0_gemm 8.054 26 1_float8 triton_red_fused_abs_max_mul_silu_6 1_f8_overhead 0.246 27 1_float8 triton_red_fused_abs_max_7 1_f8_overhead 0.061 28 1_float8 triton_poi_fused__to_copy_clamp_mul_silu_8 1_f8_overhead 0.254 29 1_float8 triton_poi_fused__scaled_mm_clone_9 1_f8_overhead 0.149 30 1_float8 triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_10 2_other 0.115 31 1_float8 triton_poi_fused_clone_11 2_other 0.151 32 1_float8 triton_poi_fused_clone_12 2_other 0.089 33 1_float8 aten::sum 2_other 0.049 34 1_float8 aten::fill_ 2_other 0.013 35 1_float8 aten::copy_ 2_other 0.060 36 1_float8 triton_red_fused__to_copy_mul_sum_0 2_other 0.548 37 1_float8 triton_red_fused__scaled_mm__to_copy_abs_add_div_max_mul_pow_reciprocal_sum_1 1_f8_overhead 0.109 38 1_float8 triton_red_fused_abs_fill_max_2 1_f8_overhead 0.004 39 1_float8 triton_poi_fused__scaled_mm_clone_reciprocal_3 1_f8_overhead 0.049 40 1_float8 triton_poi_fused__scaled_mm_clone_reciprocal_4 1_f8_overhead 0.007 41 1_float8 triton_red_fused_abs_add_fill_max_mul_sigmoid_silu_sub_5 1_f8_overhead 0.316 42 1_float8 triton_per_fused_abs_fill_max_mul_silu_6 1_f8_overhead 0.005 43 1_float8 triton_poi_fused__to_copy_add_clamp_fill_mul_sigmoid_silu_sub_7 1_f8_overhead 0.408 44 1_float8 triton_poi_fused__scaled_mm_clone_reciprocal_8 1_f8_overhead 0.294 45 1_float8 triton_red_fused__to_copy_add_mul_sum_9 2_other 0.810 46 1_float8 triton_red_fused__to_copy_add_div_mul_pow_sum_10 2_other 0.148 47 1_float8 aten::add_ 2_other 0.567 Summary of time (ms) by kernel category, across ref and float8 experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 14.691 8.054 0.548 1.824 1_f8_overhead 0.000 2.441 inf 0.000 2_other 3.291 2.582 0.785 1.274 All 17.981 13.077 0.727 1.375 Float8 amax/scale sync approx ratio of total time: 0.014 ``` Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: This PR adds an example FFN with the preceding and subsequent norms to the profile script. It also adds a couple of automatic data exctration QOL items: 1. extract GPU time and aggregate it per CPU kernel name 2. attribute the kernel GPU time to gemms, float8 overhead or other 3. approximate the time spent syncing scales/amaxes and display as pct of total time I hope for this to speed up debugging of kernel performance on various models, as this automates a lot of high level metrics which take more time to get from visualizing the traces. Example output when testing `norm_ffn_norm` with delayed scaling and compile: ``` Summary of GPU time by CPU kernel experiment kernel category time_ms 0 0_ref triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_0 2_other 0.061 1 0_ref aten::mm 0_gemm 14.691 2 0_ref triton_poi_fused_mul_silu_1 2_other 0.304 3 0_ref triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2 2_other 0.083 4 0_ref aten::sum 2_other 0.050 5 0_ref aten::fill_ 2_other 0.002 6 0_ref aten::copy_ 2_other 0.059 7 0_ref triton_red_fused__to_copy_add_div_mul_pow_sum_0 2_other 0.129 8 0_ref triton_poi_fused_add_fill_mul_sigmoid_silu_sub_1 2_other 0.520 9 0_ref triton_red_fused__to_copy_add_mul_sum_2 2_other 1.363 10 0_ref triton_red_fused__to_copy_add_div_mul_pow_sum_3 2_other 0.151 11 0_ref aten::add_ 2_other 0.567 12 1_float8 triton_per_fused_cat_copy_max_roll_0 1_f8_overhead 0.009 13 1_float8 triton_poi_fused_copy_1 2_other 0.004 14 1_float8 triton_poi_fused_copy_2 2_other 0.002 15 1_float8 triton_poi_fused_copy_3 2_other 0.004 16 1_float8 triton_poi_fused_copy_4 2_other 0.002 17 1_float8 triton_poi_fused_copy_5 2_other 0.004 18 1_float8 triton_poi_fused_copy_6 2_other 0.002 19 1_float8 triton_red_fused__to_copy_abs_add_clamp_max_mean_mul_pow_rsqrt_0 1_f8_overhead 0.089 20 1_float8 triton_red_fused__to_copy_abs_fill_max_mul_1 1_f8_overhead 0.003 21 1_float8 triton_red_fused_abs_max_2 1_f8_overhead 0.140 22 1_float8 triton_per_fused_abs_fill_max_3 1_f8_overhead 0.010 23 1_float8 triton_poi_fused_reciprocal_4 2_other 0.014 24 1_float8 triton_poi_fused__scaled_mm_clone_5 1_f8_overhead 0.289 25 1_float8 aten::_scaled_mm 0_gemm 8.054 26 1_float8 triton_red_fused_abs_max_mul_silu_6 1_f8_overhead 0.246 27 1_float8 triton_red_fused_abs_max_7 1_f8_overhead 0.061 28 1_float8 triton_poi_fused__to_copy_clamp_mul_silu_8 1_f8_overhead 0.254 29 1_float8 triton_poi_fused__scaled_mm_clone_9 1_f8_overhead 0.149 30 1_float8 triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_10 2_other 0.115 31 1_float8 triton_poi_fused_clone_11 2_other 0.151 32 1_float8 triton_poi_fused_clone_12 2_other 0.089 33 1_float8 aten::sum 2_other 0.049 34 1_float8 aten::fill_ 2_other 0.013 35 1_float8 aten::copy_ 2_other 0.060 36 1_float8 triton_red_fused__to_copy_mul_sum_0 2_other 0.548 37 1_float8 triton_red_fused__scaled_mm__to_copy_abs_add_div_max_mul_pow_reciprocal_sum_1 1_f8_overhead 0.109 38 1_float8 triton_red_fused_abs_fill_max_2 1_f8_overhead 0.004 39 1_float8 triton_poi_fused__scaled_mm_clone_reciprocal_3 1_f8_overhead 0.049 40 1_float8 triton_poi_fused__scaled_mm_clone_reciprocal_4 1_f8_overhead 0.007 41 1_float8 triton_red_fused_abs_add_fill_max_mul_sigmoid_silu_sub_5 1_f8_overhead 0.316 42 1_float8 triton_per_fused_abs_fill_max_mul_silu_6 1_f8_overhead 0.005 43 1_float8 triton_poi_fused__to_copy_add_clamp_fill_mul_sigmoid_silu_sub_7 1_f8_overhead 0.408 44 1_float8 triton_poi_fused__scaled_mm_clone_reciprocal_8 1_f8_overhead 0.294 45 1_float8 triton_red_fused__to_copy_add_mul_sum_9 2_other 0.810 46 1_float8 triton_red_fused__to_copy_add_div_mul_pow_sum_10 2_other 0.148 47 1_float8 aten::add_ 2_other 0.567 Summary of time (ms) by kernel category, across ref and float8 experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 14.691 8.054 0.548 1.824 1_f8_overhead 0.000 2.441 inf 0.000 2_other 3.291 2.582 0.785 1.274 All 17.981 13.077 0.727 1.375 Float8 amax/scale sync approx ratio of total time: 0.014 ``` Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: This PR adds an example FFN with the preceding and subsequent norms to the profile script. It also adds a couple of automatic data exctration QOL items: 1. extract GPU time and aggregate it per CPU kernel name 2. attribute the kernel GPU time to gemms, float8 overhead or other 3. approximate the time spent syncing scales/amaxes and display as pct of total time I hope for this to speed up debugging of kernel performance on various models, as this automates a lot of high level metrics which take more time to get from visualizing the traces. Example output when testing `norm_ffn_norm` with delayed scaling and compile: bandwidth off ``` Summary of GPU time by CPU kernel experiment kernel category time_ms 0 0_ref triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_0 2_other 0.061 1 0_ref aten::mm 0_gemm 14.691 2 0_ref triton_poi_fused_mul_silu_1 2_other 0.304 3 0_ref triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2 2_other 0.083 4 0_ref aten::sum 2_other 0.050 5 0_ref aten::fill_ 2_other 0.002 6 0_ref aten::copy_ 2_other 0.059 7 0_ref triton_red_fused__to_copy_add_div_mul_pow_sum_0 2_other 0.129 8 0_ref triton_poi_fused_add_fill_mul_sigmoid_silu_sub_1 2_other 0.520 9 0_ref triton_red_fused__to_copy_add_mul_sum_2 2_other 1.363 10 0_ref triton_red_fused__to_copy_add_div_mul_pow_sum_3 2_other 0.151 11 0_ref aten::add_ 2_other 0.567 12 1_float8 triton_per_fused_cat_copy_max_roll_0 1_f8_overhead 0.009 13 1_float8 triton_poi_fused_copy_1 2_other 0.004 14 1_float8 triton_poi_fused_copy_2 2_other 0.002 15 1_float8 triton_poi_fused_copy_3 2_other 0.004 16 1_float8 triton_poi_fused_copy_4 2_other 0.002 17 1_float8 triton_poi_fused_copy_5 2_other 0.004 18 1_float8 triton_poi_fused_copy_6 2_other 0.002 19 1_float8 triton_red_fused__to_copy_abs_add_clamp_max_mean_mul_pow_rsqrt_0 1_f8_overhead 0.089 20 1_float8 triton_red_fused__to_copy_abs_fill_max_mul_1 1_f8_overhead 0.003 21 1_float8 triton_red_fused_abs_max_2 1_f8_overhead 0.140 22 1_float8 triton_per_fused_abs_fill_max_3 1_f8_overhead 0.010 23 1_float8 triton_poi_fused_reciprocal_4 2_other 0.014 24 1_float8 triton_poi_fused__scaled_mm_clone_5 1_f8_overhead 0.289 25 1_float8 aten::_scaled_mm 0_gemm 8.054 26 1_float8 triton_red_fused_abs_max_mul_silu_6 1_f8_overhead 0.246 27 1_float8 triton_red_fused_abs_max_7 1_f8_overhead 0.061 28 1_float8 triton_poi_fused__to_copy_clamp_mul_silu_8 1_f8_overhead 0.254 29 1_float8 triton_poi_fused__scaled_mm_clone_9 1_f8_overhead 0.149 30 1_float8 triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_10 2_other 0.115 31 1_float8 triton_poi_fused_clone_11 2_other 0.151 32 1_float8 triton_poi_fused_clone_12 2_other 0.089 33 1_float8 aten::sum 2_other 0.049 34 1_float8 aten::fill_ 2_other 0.013 35 1_float8 aten::copy_ 2_other 0.060 36 1_float8 triton_red_fused__to_copy_mul_sum_0 2_other 0.548 37 1_float8 triton_red_fused__scaled_mm__to_copy_abs_add_div_max_mul_pow_reciprocal_sum_1 1_f8_overhead 0.109 38 1_float8 triton_red_fused_abs_fill_max_2 1_f8_overhead 0.004 39 1_float8 triton_poi_fused__scaled_mm_clone_reciprocal_3 1_f8_overhead 0.049 40 1_float8 triton_poi_fused__scaled_mm_clone_reciprocal_4 1_f8_overhead 0.007 41 1_float8 triton_red_fused_abs_add_fill_max_mul_sigmoid_silu_sub_5 1_f8_overhead 0.316 42 1_float8 triton_per_fused_abs_fill_max_mul_silu_6 1_f8_overhead 0.005 43 1_float8 triton_poi_fused__to_copy_add_clamp_fill_mul_sigmoid_silu_sub_7 1_f8_overhead 0.408 44 1_float8 triton_poi_fused__scaled_mm_clone_reciprocal_8 1_f8_overhead 0.294 45 1_float8 triton_red_fused__to_copy_add_mul_sum_9 2_other 0.810 46 1_float8 triton_red_fused__to_copy_add_div_mul_pow_sum_10 2_other 0.148 47 1_float8 aten::add_ 2_other 0.567 Summary of time (ms) by kernel category, across ref and float8 experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 14.691 8.054 0.548 1.824 1_f8_overhead 0.000 2.441 inf 0.000 2_other 3.291 2.582 0.785 1.274 All 17.981 13.077 0.727 1.375 Float8 amax/scale sync approx ratio of total time: 0.014 ``` bandwidth on ``` experiment kernel category time_ms pct_gpu_time bw_gpbs 0 0_ref triton_red_fused_native_layer_norm_0 2_other 0.242 0.021 2085.93 1 0_ref aten::mm 0_gemm 10.120 0.877 None 2 0_ref aten::sum 2_other 0.121 0.010 None 3 0_ref aten::fill_ 2_other 0.002 0.000 None 4 0_ref aten::copy_ 2_other 0.200 0.017 None 5 i 0_ref driton_red_fused_nytive_layer_norm_native_layer_norm_backward_0 2_other 0.350 0.030 2207.69 6 0_ref aten::add_ 2_other 0.511 0.044 None 7 1_float8 triton_per_fused_copy_max_roll_0 1_f8_overhead 0.005 0.001 0.01 8 1_float8 triton_per_fused_copy_max_roll_1 1_f8_overhead 0.003 0.000 0.01 9 1_float8 triton_red_fused__to_copy_abs_clamp_max_mul_native_layer_norm_0 1_f8_overhead 0.367 0.048 1083.39 10 1_float8 triton_red_fused_abs_fill_max_native_layer_norm_1 1_f8_overhead 0.004 0.001 5.52 11 1_float8 triton_red_fused_abs_max_2 1_f8_overhead 0.069 0.009 1486.46 12 1_float8 triton_per_fused_abs_fill_max_3 1_f8_overhead 0.002 0.000 0.20 13 1_float8 triton_poi_fused_reciprocal_4 2_other 0.004 0.001 0.00 14 1_float8 triton_poi_fused__scaled_mm_clone_5 1_f8_overhead 0.152 0.020 1460.56 15 1_float8 aten::_scaled_mm 0_gemm 5.213 0.683 None 16 1_float8 triton_poi_fused_clone_6 2_other 0.172 0.023 1488.13 17 1_float8 aten::sum 2_other 0.126 0.017 None 18 1_float8 aten::fill_ 2_other 0.006 0.001 None 19 1_float8 aten::copy_ 2_other 0.200 0.026 None 20 1_float8 triton_red_fused_abs_max_0 1_f8_overhead 0.129 0.017 1732.38 21 1_float8 triton_per_fused_abs_fill_max_1 1_f8_overhead 0.002 0.000 0.20 22 1_float8 triton_poi_fused__scaled_mm__to_copy_clamp_clone_mul_reciprocal_2 1_f8_overhead 0.310 0.041 1459.47 23 1_float8 triton_poi_fused__scaled_mm__to_copy_clamp_clone_mul_reciprocal_3 1_f8_overhead 0.002 0.000 0.00 24 1_float8 triton_red_fused_native_layer_norm_native_layer_norm_backward_4 2_other 0.352 0.046 2205.86 25 1_float8 aten::add_ 2_other 0.510 0.067 None ``` Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
y-sq
approved these changes
Jun 25, 2024
Summary: This PR adds an example FFN with the preceding and subsequent norms to the profile script. It also adds a couple of automatic data exctration QOL items: 1. extract GPU time and aggregate it per CPU kernel name 2. attribute the kernel GPU time to gemms, float8 overhead or other 3. approximate the time spent syncing scales/amaxes and display as pct of total time 4. if TORCHINDUCTOR_PROFILE env variable is set, also parses its output for triton kernel memory bandwidth I hope for this to speed up debugging of kernel performance on various models, as this automates a lot of high level metrics which take more time to get from visualizing the traces. Example output when testing `norm_ffn_norm` with dynamic scaling and compile, and bandwidth measurements on ``` Summary of GPU time by CPU kernel experiment kernel category time_ms pct_gpu_time bw_gpbs 1 0_ref aten::mm 0_gemm 15.625 0.826 None 9 0_ref triton_red_fused__to_copy_add_mul_sum_2 2_other 1.375 0.073 241.12 11 0_ref aten::add_ 2_other 0.566 0.030 None 8 0_ref triton_poi_fused_add_fill_mul_sigmoid_silu_sub_1 2_other 0.520 0.027 2203.88 2 0_ref triton_poi_fused_mul_silu_1 2_other 0.302 0.016 2207.31 10 0_ref triton_red_fused__to_copy_add_div_mul_pow_sum_3 2_other 0.150 0.008 1375.62 7 0_ref triton_red_fused__to_copy_add_div_mul_pow_sum_0 2_other 0.122 0.006 1963.47 3 0_ref triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2 2_other 0.084 0.004 1935.35 6 0_ref aten::copy_ 2_other 0.060 0.003 None 0 0_ref triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_0 2_other 0.052 0.003 1861.85 4 0_ref aten::sum 2_other 0.051 0.003 None 5 0_ref aten::fill_ 2_other 0.002 0.000 None 19 1_float8 aten::_scaled_mm 0_gemm 8.148 0.623 None 36 1_float8 triton_poi_fused__to_copy_add_clamp_copy_empty_like_fill_mul_reciprocal_sigmoid_silu_sub_6 1_f8_overhead 0.408 0.031 2220.22 34 1_float8 triton_red_fused_abs_add_fill_max_mul_sigmoid_silu_sub_4 1_f8_overhead 0.314 0.024 2090.00 18 1_float8 triton_poi_fused__scaled_mm_clone_6 1_f8_overhead 0.294 0.022 1439.36 37 1_float8 triton_poi_fused__scaled_mm_clamp_clone_copy_mul_reciprocal_7 1_f8_overhead 0.292 0.022 1527.26 22 1_float8 triton_poi_fused__to_copy_clamp_copy_mul_reciprocal_silu_9 1_f8_overhead 0.255 0.020 2148.34 20 1_float8 triton_red_fused_abs_max_mul_silu_7 1_f8_overhead 0.244 0.019 1834.33 23 1_float8 triton_poi_fused__scaled_mm_clone_10 1_f8_overhead 0.146 0.011 1462.59 15 1_float8 triton_red_fused_abs_max_3 1_f8_overhead 0.127 0.010 1498.30 31 1_float8 triton_red_fused__to_copy_abs_add_div_max_mul_pow_sum_1 1_f8_overhead 0.091 0.007 1971.11 33 1_float8 triton_poi_fused__scaled_mm_clamp_clone_copy_mul_reciprocal_3 1_f8_overhead 0.090 0.007 1388.84 21 1_float8 triton_red_fused_abs_max_8 1_f8_overhead 0.062 0.005 1487.67 17 1_float8 triton_poi_fused__to_copy_clamp_copy_mul_reciprocal_5 1_f8_overhead 0.046 0.003 1833.92 12 1_float8 triton_red_fused__to_copy_abs_add_max_mean_mul_pow_rsqrt_0 1_f8_overhead 0.044 0.003 1246.58 16 1_float8 triton_per_fused_abs_clamp_copy_max_mul_reciprocal_4 1_f8_overhead 0.010 0.001 0.32 35 1_float8 triton_per_fused__scaled_mm_abs_clamp_clone_copy_max_mul_reciprocal_silu_5 1_f8_overhead 0.005 0.000 0.32 32 1_float8 triton_red_fused__scaled_mm_abs_clamp_clone_copy_max_mul_reciprocal_2 1_f8_overhead 0.003 0.000 4.51 13 1_float8 triton_red_fused__to_copy_abs_clamp_copy_max_mul_reciprocal_1 1_f8_overhead 0.003 0.000 4.61 38 1_float8 triton_red_fused__to_copy_add_mul_sum_8 2_other 0.804 0.061 244.46 40 1_float8 aten::add_ 2_other 0.567 0.043 None 30 1_float8 triton_red_fused__to_copy_mul_sum_0 2_other 0.562 0.043 236.73 39 1_float8 triton_red_fused__to_copy_add_div_mul_pow_sum_9 2_other 0.149 0.011 1377.16 25 1_float8 triton_poi_fused_clone_12 2_other 0.146 0.011 1532.05 24 1_float8 triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_11 2_other 0.115 0.009 1293.20 29 1_float8 aten::copy_ 2_other 0.060 0.005 None 27 1_float8 aten::sum 2_other 0.050 0.004 None 26 1_float8 triton_poi_fused_clone_13 2_other 0.043 0.003 1333.22 14 1_float8 triton_poi_fused_empty_like_2 2_other 0.002 0.000 0.00 28 1_float8 aten::fill_ 2_other 0.002 0.000 None Float8 amax/scale sync approx ratio of total time: 0.000 Summary of time (ms) by kernel category experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 15.625 8.148 0.521 1.918 1_f8_overhead 0.000 2.436 inf 0.000 2_other 3.283 2.498 0.761 1.314 All 18.908 13.082 0.692 1.445 ``` Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
@vkuzo has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
This pull request has been merged in 0b60496. |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Labels
CLA Signed
This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
Summary:
This PR adds an example FFN with the preceding and subsequent norms
to the profile script. It also adds a couple of automatic data exctration QOL
items:
I hope for this to speed up debugging of kernel performance on various models, as this automates a lot of high level metrics which take more time to get from visualizing the traces.
Example output when testing
norm_ffn_norm
with dynamic scaling and compile, and bandwidth measurements onTest Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: D59163495