Skip to content

Commit 7fa0b6e

Browse files
committed
add xpu profiling files and recipes_source/profile_with_itt.rst
1 parent 5c45426 commit 7fa0b6e

File tree

5 files changed

+112
-4
lines changed

5 files changed

+112
-4
lines changed
93.3 KB
Loading
Loading

_static/img/trace_xpu_img.png

88.3 KB
Loading

recipes_source/profile_with_itt.rst

Lines changed: 33 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,10 @@ Launch Intel® VTune™ Profiler
5858

5959
To verify the functionality, you need to start an Intel® VTune™ Profiler instance. Please check the `Intel® VTune™ Profiler User Guide <https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/launch.html>`__ for steps to launch Intel® VTune™ Profiler.
6060

61+
.. note::
62+
Users can also use web-server-ui by following `Intel® VTune™ Profiler Web Server UI Guide <https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2024-1/web-server-ui.html>`__
63+
ex : vtune-backend --web-port=8080 --allow-remote-access --enable-server-profiling
64+
6165
Once you get the Intel® VTune™ Profiler GUI launched, you should see a user interface as below:
6266

6367
.. figure:: /_static/img/itt_tutorial/vtune_start.png
@@ -66,8 +70,8 @@ Once you get the Intel® VTune™ Profiler GUI launched, you should see a user i
6670

6771
Three sample results are available on the left side navigation bar under `sample (matrix)` project. If you do not want profiling results appear in this default sample project, you can create a new project via the button `New Project...` under the blue `Configure Analysis...` button. To start a new profiling, click the blue `Configure Analysis...` button to initiate configuration of the profiling.
6872

69-
Configure Profiling
70-
~~~~~~~~~~~~~~~~~~~
73+
Configure Profiling for CPU
74+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7175

7276
Once you click the `Configure Analysis...` button, you should see the screen below:
7377

@@ -77,6 +81,16 @@ Once you click the `Configure Analysis...` button, you should see the screen bel
7781

7882
The right side of the windows is split into 3 parts: `WHERE` (top left), `WHAT` (bottom left), and `HOW` (right). With `WHERE`, you can assign a machine where you want to run the profiling on. With `WHAT`, you can set the path of the application that you want to profile. To profile a PyTorch script, it is recommended to wrap all manual steps, including activating a Python environment and setting required environment variables, into a bash script, then profile this bash script. In the screenshot above, we wrapped all steps into the `launch.sh` bash script and profile `bash` with the parameter to be `<path_of_launch.sh>`. On the right side `HOW`, you can choose whatever type that you would like to profile. Intel® VTune™ Profiler provides a bunch of profiling types that you can choose from. Details can be found at `Intel® VTune™ Profiler User Guide <https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance.html>`__.
7983

84+
85+
Configure Profiling for XPU
86+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
87+
Pick GPU Offline Profiling Type instead of HopSpot, and follow the same instructions as CPU to Launch the Application.
88+
89+
.. figure:: /_static/img/itt_tutorial/vtune_xpu_config.png
90+
:width: 100%
91+
:align: center
92+
93+
8094
Read Profiling Result
8195
~~~~~~~~~~~~~~~~~~~~~
8296

@@ -101,6 +115,18 @@ As illustrated on the right side navigation bar, brown portions in the timeline
101115

102116
Of course there are much more enriched sets of profiling features that Intel® VTune™ Profiler provides to help you understand a performance issue. When you understand the root cause of a performance issue, you can get it fixed. More detailed usage instructions are available at `Intel® VTune™ Profiler User Guide <https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance.html>`__.
103117

118+
Read XPU Profiling Result
119+
~~~~~~~~~~~~~~~~~~~~~~~~~
120+
121+
With a successful profiling with ITT, you can open `Platform` tab of the profiling result to see labels in the Intel® VTune™ Profiler timeline.
122+
123+
.. figure:: /_static/img/itt_tutorial/vtune_xpu_timeline.png
124+
:width: 100%
125+
:align: center
126+
127+
128+
The timeline shows the main thread as a `python` thread on the top. Labeled PyTorch operators and customized regions are shown in the main thread row. All operators starting with `aten::` are operators labeled implicitly by the ITT feature in PyTorch. The timeline also shows the GPU Computing Queue on the top, and users could see different XPU Kernels dispatched into GPU Queue.
129+
104130
A short sample code showcasing how to use PyTorch ITT APIs
105131
----------------------------------------------------------
106132

@@ -128,8 +154,12 @@ The topology is formed by two operators, `Conv2d` and `Linear`. Three iterations
128154
return x
129155
130156
def main():
131-
m = ITTSample()
157+
m = ITTSample
158+
# unmark below code for XPU
159+
# m = m.to("xpu")
132160
x = torch.rand(10, 3, 244, 244)
161+
# unmark below code for XPU
162+
# x = x.to("xpu")
133163
with torch.autograd.profiler.emit_itt():
134164
for i in range(3)
135165
# Labeling a region with pair of range_push and range_pop

recipes_source/recipes/profiler_recipe.py

Lines changed: 79 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@
7070
# - ``ProfilerActivity.CPU`` - PyTorch operators, TorchScript functions and
7171
# user-defined code labels (see ``record_function`` below);
7272
# - ``ProfilerActivity.CUDA`` - on-device CUDA kernels;
73+
# - ``ProfilerActivity.XPU`` - on-device XPU kernels;
7374
# - ``record_shapes`` - whether to record shapes of the operator inputs;
7475
# - ``profile_memory`` - whether to report amount of memory consumed by
7576
# model's Tensors;
@@ -197,6 +198,46 @@
197198
# Self CPU time total: 23.015ms
198199
# Self CUDA time total: 11.666ms
199200
#
201+
######################################################################
202+
# Profiler can also be used to analyze performance of models executed on XPUs:
203+
204+
model = models.resnet18().xpu()
205+
inputs = torch.randn(5, 3, 224, 224).xpu()
206+
207+
with profile(activities=[
208+
ProfilerActivity.CPU, ProfilerActivity.XPU], record_shapes=True) as prof:
209+
with record_function("model_inference"):
210+
model(inputs)
211+
212+
print(prof.key_averages().table(sort_by="xpu_time_total", row_limit=10))
213+
214+
######################################################################
215+
# (Note: the first use of XPU profiling may bring an extra overhead.)
216+
217+
######################################################################
218+
# The resulting table output (omitting some columns):
219+
#
220+
# .. code-block:: sh
221+
#
222+
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------
223+
Name Self XPU Self XPU % XPU total XPU time avg # of Calls
224+
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------
225+
model_inference 0.000us 0.00% 2.567ms 2.567ms 1
226+
aten::conv2d 0.000us 0.00% 1.871ms 93.560us 20
227+
aten::convolution 0.000us 0.00% 1.871ms 93.560us 20
228+
aten::_convolution 0.000us 0.00% 1.871ms 93.560us 20
229+
aten::convolution_overrideable 1.871ms 72.89% 1.871ms 93.560us 20
230+
gen_conv 1.484ms 57.82% 1.484ms 74.216us 20
231+
aten::batch_norm 0.000us 0.00% 432.640us 21.632us 20
232+
aten::_batch_norm_impl_index 0.000us 0.00% 432.640us 21.632us 20
233+
aten::native_batch_norm 432.640us 16.85% 432.640us 21.632us 20
234+
conv_reorder 386.880us 15.07% 386.880us 6.448us 60
235+
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------
236+
Self CPU time total: 712.486ms
237+
Self XPU time total: 2.567ms
238+
239+
#
240+
200241

201242
######################################################################
202243
# Note the occurrence of on-device kernels in the output (e.g. ``sgemm_32x32x32_NN``).
@@ -266,6 +307,7 @@
266307
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
267308
#
268309
# Profiling results can be outputted as a ``.json`` trace file:
310+
# Tracing CUDA kernels
269311

270312
model = models.resnet18().cuda()
271313
inputs = torch.randn(5, 3, 224, 224).cuda()
@@ -282,6 +324,24 @@
282324
# .. image:: ../../_static/img/trace_img.png
283325
# :scale: 25 %
284326

327+
# Tracing XPU kernels
328+
329+
model = models.resnet18().xpu()
330+
inputs = torch.randn(5, 3, 224, 224).xpu()
331+
332+
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.XPU]) as prof:
333+
model(inputs)
334+
335+
prof.export_chrome_trace("trace.json")
336+
337+
######################################################################
338+
# You can examine the sequence of profiled operators and CUDA kernels
339+
# in Chrome trace viewer (``chrome://tracing``):
340+
#
341+
# .. image:: ../../_static/img/trace_xpu_img.png
342+
# :scale: 25 %
343+
344+
285345
######################################################################
286346
# 6. Examining stack traces
287347
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -384,7 +444,7 @@
384444
# To send the signal to the profiler that the next step has started, call ``prof.step()`` function.
385445
# The current profiler step is stored in ``prof.step_num``.
386446
#
387-
# The following example shows how to use all of the concepts above:
447+
# The following example shows how to use all of the concepts above for CUDA Kernels:
388448

389449
def trace_handler(p):
390450
output = p.key_averages().table(sort_by="self_cuda_time_total", row_limit=10)
@@ -403,6 +463,24 @@ def trace_handler(p):
403463
model(inputs)
404464
p.step()
405465

466+
# The following example shows how to use all of the concepts above for XPU Kernels:
467+
468+
def trace_handler(p):
469+
output = p.key_averages().table(sort_by="self_xpu_time_total", row_limit=10)
470+
print(output)
471+
p.export_chrome_trace("/tmp/trace_" + str(p.step_num) + ".json")
472+
473+
with profile(
474+
activities=[ProfilerActivity.CPU, ProfilerActivity.XPU],
475+
schedule=torch.profiler.schedule(
476+
wait=1,
477+
warmup=1,
478+
active=2),
479+
on_trace_ready=trace_handler
480+
) as p:
481+
for idx in range(8):
482+
model(inputs)
483+
p.step()
406484

407485
######################################################################
408486
# Learn More

0 commit comments

Comments
 (0)