From 16a0e0a80d3a14c381b3b04f174e2d8bff1aa61e Mon Sep 17 00:00:00 2001
From: Andrew Bringaze Linux Foundation
+
-Since then, TorchRec has matured significantly, with wide internal adoption across many Meta production recommendation models for training and inference, alongside new features such as: [variable batched embeddings, embedding offloading, zero collision hashing, etc.](https://github.com/pytorch/torchrec/releases?page=1) Furthermore, TorchRec has a presence outside of Meta, such as [in recommendation models at Databricks](https://docs.databricks.com/en/machine-learning/train-recommender-models.html) and in the [Twitter algorithm](https://github.com/twitter/the-algorithm-ml). As a result, standard TorchRec features have been marked as **stable**, with PyTorch style BC guarantees, and can be seen on the [revamped TorchRec documentation](https://pytorch.org/torchrec/).
+*To see a full list of public feature submissions click [here](https://docs.google.com/spreadsheets/d/1TzGkWuUMF1yTe88adz1dt2mzbIsZLd3PBasy588VWgk/edit?usp=sharing).
-## FBGEMM
-[FBGEMM is a library that provides high-performance kernels for CPUs and GPUs](https://pytorch.org/FBGEMM/). Since 2018, FBGEMM has supported the efficient execution of Meta-internal and external AI/ML workloads by expanding its scope from [performance-critical kernels for inference on CPUs](https://arxiv.org/abs/2101.05615) to more complex sparse operators for both training and inference – and recently for Generative AI – on CPUs and GPUs.
+## BETA FEATURES
-FBGEMM has been empowering TorchRec through its backend high-performance kernel implementations for recommendation workloads, ranging from embedding bag kernels to jagged tensor operations. Together with TorchRec, we released FBGEMM 1.0, which guarantees the functionality and backward-compatibility of several stable APIs serving its core features with [enhanced documentation](https://pytorch.org/FBGEMM/).
+### [Beta] CuDNN backend for SDPA
-## Performance
+The cuDNN "Fused Flash Attention" backend was landed for *torch.nn.functional.scaled_dot_product_attention*. On NVIDIA H100 GPUs this can provide up to 75% speed-up over FlashAttentionV2. This speedup is enabled by default for all users of SDPA on H100 or newer GPUs.
-[DLRM (Deep Learning Recommendation Model)](https://ai.meta.com/blog/dlrm-an-advanced-open-source-deep-learning-recommendation-model/) is the standard neural network architecture for powering recommendations at Meta, with categorical features being processed through embeddings, while continuous (dense) features are processed with a bottom multilayer perceptron. The following diagram depicts the basic architecture of DLRM, with a second order interaction layer between the dense and sparse features and a top MLP for generating the prediction.
-{:style="width:100%"}
+### [Beta] *torch.compile* regional compilation without recompilations
+Regional compilation without recompilations, via *torch._dynamo.config.inline_inbuilt_nn_modules* which default to True in 2.5+. This option allows users to compile a repeated *nn.Module* (e.g. a transformer layer in LLM) without recompilations. Compared to compiling the full model, this option can result in smaller compilation latencies with 1%-5% performance degradation compared to full model compilation.
+See the [tutorial](https://pytorch.org/tutorials/recipes/regional_compilation.html) for more information.
-TorchRec provides standardized modules with significant optimizations in fusing embedding lookups. EBC is a traditional PyTorch embedding module implementation, containing a collection of `torch.nn.EmbeddingBags.` FusedEBC, powered by FBGEMM for high performance operations on embedding tables with a fused optimizer and UVM caching/management for alleviating memory constraints, is the optimized version present in sharded TorchRec modules for distributed training and inference. The below benchmark demonstrates the vast performance improvements of FusedEBC in comparison to a traditional PyTorch embedding module implementation (EBC) and the ability for FusedEBC to handle much larger embeddings than what is available on GPU memory with UVM caching.
-{:style="width:100%"}
+### [Beta] TorchInductor CPU backend optimization
+This feature advances Inductor’s CPU backend optimization, including CPP backend code generation and FX fusions with customized CPU kernels. The Inductor CPU backend supports vectorization of common data types and all Inductor IR operations, along with the static and symbolic shapes. It is compatible with both Linux and Windows OS and supports the default Python wrapper, the CPP wrapper, and AOT-Inductor mode.
+Additionally, it extends the max-autotune mode of the GEMM template (prototyped in 2.5), offering further performance gains. The backend supports various FX fusions, lowering to customized kernels such as oneDNN for Linear/Conv operations and SDPA. The Inductor CPU backend consistently achieves performance speedups across three benchmark suites—TorchBench, Hugging Face, and timms—outperforming eager mode in 97.5% of the 193 models tested.
-## TorchRec Data Types
-TorchRec provides standard [data types](https://pytorch.org/torchrec/datatypes-api-reference.html) and [modules](https://pytorch.org/torchrec/modules-api-reference.html) for easy handling of distributed embeddings. Here is a simple example setting up a collection of embedding tables through TorchRec:
+## PROTOTYPE FEATURES
-```
-from torchrec import EmbeddingBagCollection
-from torchrec import KeyedJaggedTensor
-from torchrec import JaggedTensor
+### [Prototype] FlexAttention
-ebc = torchrec.EmbeddingBagCollection(
- device="cpu",
- tables=[
- torchrec.EmbeddingBagConfig(
- name="product_table",
- embedding_dim=64,
- num_embeddings=4096,
- feature_names=["product"],
- pooling=torchrec.PoolingType.SUM,
- ),
- torchrec.EmbeddingBagConfig(
- name="user_table",
- embedding_dim=64,
- num_embeddings=4096,
- feature_names=["user"],
- pooling=torchrec.PoolingType.SUM,
- )
- ]
-)
+We've introduced a flexible API that enables implementing various attention mechanisms such as Sliding Window, Causal Mask, and PrefixLM with just a few lines of idiomatic PyTorch code. This API leverages torch.compile to generate a fused FlashAttention kernel, which eliminates extra memory allocation and achieves performance comparable to handwritten implementations. Additionally, we automatically generate the backwards pass using PyTorch's autograd machinery. Furthermore, our API can take advantage of sparsity in the attention mask, resulting in significant improvements over standard attention implementations.
-product_jt = JaggedTensor(
- values=torch.tensor([1, 2, 1, 5]), lengths=torch.tensor([3, 1])
-)
-user_jt = JaggedTensor(values=torch.tensor([2, 3, 4, 1]), lengths=torch.tensor([2, 2]))
+For more information and examples, please refer to the [official blog post](https://pytorch.org/blog/flexattention/) and [Attention Gym](https://github.com/pytorch-labs/attention-gym).
-kjt = KeyedJaggedTensor.from_jt_dict({"product": product_jt, "user": user_jt})
-print("Call EmbeddingBagCollection Forward: ", ebc(kjt))
-```
+### [Prototype] Compiled Autograd
+Compiled Autograd is an extension to the PT2 stack allowing the capture of the entire backward pass. Unlike the backward graph traced by AOT dispatcher, Compiled Autograd tracing is deferred until backward execution time, which makes it impervious to forward pass graph breaks, and allows it to record backward hooks into the graph.
+Please refer to the [tutorial](https://pytorch.org/tutorials/intermediate/compiled_autograd_tutorial.html) for more information.
-## Sharding
-TorchRec provides a planner class that automatically generates an optimized sharding plan across many GPUs. Here we demonstrate generating a sharding plan across two GPUs:
+### [Prototype] Flight Recorder
+Flight recorder is a new debugging tool that helps debug stuck jobs. The tool works by continuously capturing information about collectives as they run. Upon detecting a stuck job, the information can be used to quickly identify misbehaving ranks/machines along with code stack traces.
-```
-from torchrec.distributed.planner import EmbeddingShardingPlanner, Topology
+For more information please refer to the following [tutorial](https://pytorch.org/tutorials/prototype/flight_recorder_tutorial.html).
-planner = EmbeddingShardingPlanner(
- topology=Topology(
- world_size=2,
- compute_device="cuda",
- )
-)
-plan = planner.collective_plan(ebc, [sharder], pg)
+### [Prototype] Max-autotune Support on CPU with GEMM Template
-print(f"Sharding Plan generated: {plan}")
-```
+Max-autotune mode for the Inductor CPU backend in torch.compile profiles multiple implementations of operations at compile time and selects the best-performing one. This is particularly beneficial for GEMM-related operations, using a C++ template-based GEMM implementation as an alternative to the ATen-based approach with oneDNN and MKL libraries. We support FP32, BF16, FP16, and INT8 with epilogue fusions for x86 CPUs. We’ve seen up to 7% geomean speedup on the dynamo benchmark suites and up to 20% boost in next-token latency for LLM inference.
+For more information please refer to the [tutorial](https://pytorch.org/tutorials/prototype/max_autotune_on_CPU_tutorial.html).
-## Model Parallel
+### [Prototype] TorchInductor CPU on Windows
-TorchRec’s main distributed training API is [DistributedModelParallel](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module), which calls the planner to generate a sharding plan (demonstrated above) and shards TorchRec modules according to that plan. We demonstrate using [DistributedModelParallel](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module) to our EmbeddingBagCollection for sharding embeddings distributed training:
+Inductor CPU backend in torch.compile now works on Windows. We support MSVC (cl), clang (clang-cl) and Intel compiler (icx-cl) for Windows inductor currently.
+See the [tutorial](https://pytorch.org/tutorials/prototype/inductor_windows_cpu.html) for more details.
-```
-model = torchrec.distributed.DistributedModelParallel(ebc, device=torch.device("cuda"))
-```
+### [Prototype] FP16 support on CPU path for both eager mode and TorchInductor CPP backend
+Float16 is a commonly used reduced floating point type for performance improvement in neural network inference/training. Since this release, float16 for both eager and TorchInductor is supported on the CPU path.
-## Inference
-TorchRec provides simple APIs for quantizing and sharding embeddings for a model for distributed inference. The usage is demonstrated below:
+### [Prototype] Autoload Device Extension
+PyTorch now supports autoloading for out-of-tree device extensions, streamlining integration by eliminating the need for manual imports. This feature, enabled through the torch.backends entrypoint, simplifies usage by ensuring seamless extension loading, while allowing users to disable it via an environment variable if needed.
-```
-from torchrec.inference.modules import (
- quantize_inference_model,
- shard_quant_model,
-)
-quant_model = quantize_inference_model(ebc)
-sharded_model, _ = shard_quant_model(
- quant_model, compute_device=device, sharding_device=device
-)
-```
+See the [tutorial](https://pytorch.org/tutorials/prototype/python_extension_autoload.html) for more information.
+### [Prototype] Enhanced Intel GPU support
+Intel GPUs support enhancement is now available for both Intel® Data Center GPU Max Series and Intel® Client GPUs (Intel® Core™ Ultra processors with built-in Intel® Arc™ graphics and Intel® Arc™ Graphics for dGPU parts), which is to make it easier to accelerate your Machine Learning workflows on Intel GPUs in PyTorch 2.5 release. We also enabled the initial support of PyTorch on Windows for Intel® Client GPUs in this release.
-## Conclusion
-TorchRec and FBGEMM are now stable, with optimized features for large scale recommendation systems.
-For setting up TorchRec and FBGEMM, check out the [getting started guide](https://pytorch.org/torchrec/setup-torchrec.html). \
- \
-We also recommend the comprehensive, end-to-end [tutorial for introducing the features in TorchRec and FBGEMM](https://pytorch.org/tutorials/intermediate/torchrec_intro_tutorial.html#).
\ No newline at end of file
+* Expanded PyTorch hardware backend support matrix to include both Intel Data Center and Client GPUs.
+* The implementation of SYCL* kernels to enhance coverage and execution of Aten operators on Intel GPUs to boost performance in PyTorch eager mode.
+* Enhanced Intel GPU backend of torch.compile to improve inference and training performance for a wide range of deep learning workloads.
+
+These features are available through PyTorch preview and nightly binary PIP wheels. For more information regarding Intel GPU support, please refer to [documentation](https://pytorch.org/docs/main/notes/get_start_xpu.html).
\ No newline at end of file
diff --git a/assets/images/performance-gains-over-fp32-eager-2.png b/assets/images/performance-gains-over-fp32-eager-2.png
new file mode 100644
index 0000000000000000000000000000000000000000..deadec7e17ec6290cbc511d6718ef11ad6fc9ba7
GIT binary patch
literal 8771
zcma)CWmJ@1w0
+
+ Beta
+
+ Prototype
+
+
+
+ CuDNN backend for SDPA
+
+ FlexAttention
+
+
+
+ torch.compile regional compilation without recompilations
+
+ Compiled Autograd
+
+
+
+ TorchDynamo added support for exception handling & MutableMapping types
+
+ Flight Recorder
+
+
+
+ TorchInductor CPU backend optimization
+
+ Max-autotune Support on CPU with GEMM Template
+
+
+
+
+
+ TorchInductor on Windows
+
+
+
+
+
+ FP16 support on CPU path for both eager mode and TorchInductor CPP backend
+
+
+
+
+
+ Autoload Device Extension
+
+
+
+
+
+ Enhanced Intel GPU support
+
+ (L*jRh~p0O}eJ_0;4FW#nW>$wqI}7A3GEYQiZKyY&9?a4Kq-r%;?^V17+A
zSl0lc2b#LOC2|6`FNqiLgQ_GfwFuRP)fSvd6&!JP0agH%@$<=8*ga4xoZR~;??V2C
z;rB$iaIIl}3;`7;6BkwrG@2Oo+aEbjYq;b^1
Supporting on-device AI presents unique challenges with diverse hardware, critical power requirements, low/no internet connectivity, and realtime processing needs. These constraints have historically prevented or slowed down the creation of scalable and performant on-device AI solutions. We designed ExecuTorch, backed by our industry leaders like Meta, Arm, Apple, and Qualcomm, to be highly portable and provide superior developer productivity without losing on performance.
-ExecuTorch was initially introduced to the community at the 2023 PyTorch Conference. With our most recent alpha release, we further expanded ExecuTorch’s capabilities across multiple dimensions. First, we enabled support for the deployment of large language models (LLMs) on various edge devices. Second, with ExecuTorch alpha, we have further stabilized the API surface. Lastly, we have significantly improved the developer experience by simplifying the installation flow as well as improving observability and developer productivity via the ExecuTorch SDK. ExecuTorch alpha release also provides early support for the recently announced Llama 3 8B along with demonstrations on how to run this model on an iPhone 15 Pro and a Samsung Galaxy S24 mobile phone.
- From e32b6a83ac8738ec05a3b1913fac5b09c3aa8e65 Mon Sep 17 00:00:00 2001 From: Andrew Bringaze Linux FoundationBeta - | -Prototype - | -
CuDNN backend for SDPA - | -FlexAttention - | -
torch.compile regional compilation without recompilations - | -Compiled Autograd - | -
TorchDynamo added support for exception handling & MutableMapping types - | -Flight Recorder - | -
TorchInductor CPU backend optimization - | -Max-autotune Support on CPU with GEMM Template - | -
- | -TorchInductor on Windows - | -
- | -FP16 support on CPU path for both eager mode and TorchInductor CPP backend - | -
- | -Autoload Device Extension - | -
- | -Enhanced Intel GPU support - | -
Beta - | -Prototype - | -
CuDNN backend for SDPA - | -FlexAttention - | -
torch.compile regional compilation without recompilations - | -Compiled Autograd - | -
TorchDynamo added support for exception handling & MutableMapping types - | -Flight Recorder - | -
TorchInductor CPU backend optimization - | -Max-autotune Support on CPU with GEMM Template - | -
- | -TorchInductor on Windows - | -
- | -FP16 support on CPU path for both eager mode and TorchInductor CPP backend - | -
- | -Autoload Device Extension - | -
- | -Enhanced Intel GPU support - | -
vXY;w%puMaDtUglU)~a;
z0G&i%#hSk!QzRdy!jtSMXH%N6%Z~gQNTT0SL@7-*vMlQ5KZnc0O&uOL12M=|i{Z4k
zOe(kCU;R#725xQrcDkAozD-{A!i++JvSdF@l9S_yh2EhbQ_!4lf(pagluO(7n8MN$
zO(oq>++G|zHjiMzLj2TnZ2g1Vlu2D;;Ck|d_Ybo*?U+aMy{&74Z|4@jN};H|`IDWF
z)!J ;A&Vm!}*SDz^E)5+gnWj;wX
zSo_c!*IYv97+cnWb>FThrH`wE#PJxY4f0vw8~v5XL>h~y8w811BKOv?fJvl+QFLa+
z&av_3qxDxn(Qj>|q@mz1JQ$6Yb_NvGF~>WLjSeis(7-n^Zv%j*klk7=Z8O(x8mf}m
z_B-{(^Gq?Gi|V$^&5Y`{<~jDbRRvIZU1$9P+slY6uCYL*CDlJe
zL>|Q69Fy-Zp(SuvLOn7ba~)Wv-FT?bg5kG`25P1p3LY%f!|n#b5K5Bq6hiFfLDbT9
zB!PN@=aTh`bGTMMca}}{<=Zo@s~3AddmoMUWA_VBtwnRBSlGgByJMhZoP_~2kX
zY<~HjQTkNhDH-u$f$z4H9E66kK(2_1*4RmH;Eri(Cv>5*;#?ig~q=oi<+~5
zmPOx2EMEar0;%VklfA4rRwnf1u2uJ?Y{0GDSO^eW{znxV)^1_`=}Ehgz@U=>rp=mS
zGNG=L-i*o<*_=M8Eq_ad?M?;{eqUAcgh$%dvtqp(_KAou`w@TOAR0M=>9~@2$?3nn
zW)31^ZI}~u|E$j{DA$$0PAI9}f<<^R%}j{Q68gMlq*TB&&pSAdCpx-3bl^?p`FTr4
zMuMV?Rr0VGt{v!vB5tvyJo?~6Iix;@>=BOSME3*cncX9JgLobRJ-bs8pCH?p(T8Kc
zc?xP-9b<9fHKl>Rrly{p%Q0ZNG9CQZ&z$zuy+U6HLr0GRe@9PTGWlgr`!}F3c&^*z
zSLU$T<
yxgr{y{k``7Rmsb+7_CUilBgPlMPx7k4{;!Z9stkRB=Nty7#gLZ;cGtJ
z6#w##;BJX5OPa>j2-VHouSwC!3%r7J@@2X4nae@6H6B>A*f;}48U^_3+W`Ts=;JMk
z^ShCctWyj|eML9b-ypfiF1iI!p|E$wp~9`uG9;}T4nz8h!I_w!`};v&y*F-wzK)gN
zVotb=TY89_uYOqDo7K}DHken)Uc!=Skl5x`1^j-ji5BuA$N9rz5-8r`pOm%=E{dQT
zf9}?nA$vvm?B#A>`HO%a4~