Description
🚀 Descirbe the improvement or the new tutorial
The first thing you see when you Google PyTorch performance is this. The recipe is well written but it's very much out of data today
https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html
Some concrete things we should fix
- For fusions we should talk about torch.compile instead of jit.script
- We should mention overhead reduction with cudagraphs
- We should talk about the *-fast series as places people can learn more
- For CPU specific optimization the most important one is launcher core pinning so we should either make that a default or explain the point more
- Instead of the CPU section we can instead go more into the inductor CPU backend
- AMP section is fine but maybe expand to quantization
- DDP section needs to be moved somewhere else with some FSDP performance guide
- GPU sync section is good
- Mention tensor cores and how to enable them and why they're not enabled by default
cc @sekyondaMeta @svekars @kit1980 @drisspg who first made me aware of this with an internal note that was important enough to make public
Existing tutorials on this topic
No response
Additional context
No response