From 16a0e0a80d3a14c381b3b04f174e2d8bff1aa61e Mon Sep 17 00:00:00 2001
From: Andrew Bringaze Linux Foundation
Date: Tue, 22 Oct 2024 11:19:37 -0500
Subject: [PATCH 1/6] wednesday blog post
---
_posts/2024-10-23-torchrec-fbgemm-1.md | 142 +++++++++++++++++++++++++
1 file changed, 142 insertions(+)
create mode 100644 _posts/2024-10-23-torchrec-fbgemm-1.md
diff --git a/_posts/2024-10-23-torchrec-fbgemm-1.md b/_posts/2024-10-23-torchrec-fbgemm-1.md
new file mode 100644
index 000000000000..f34e514daa00
--- /dev/null
+++ b/_posts/2024-10-23-torchrec-fbgemm-1.md
@@ -0,0 +1,142 @@
+---
+layout: blog_detail
+title: "TorchRec and FBGEMM 1.0 Stable Release"
+author: Paul Zhang, Zain Huda, Sarunya Pumma, Shintaro Iwasaki, Supadchaya Puangpontip, Benson Ma
+---
+
+We are happy to announce the stable release, 1.0, for [TorchRec](https://github.com/pytorch/torchrec) and [FBGEMM](https://github.com/pytorch/FBGEMM). TorchRec is the PyTorch native recommendation systems library, powered by FBGEMM’s (Facebook GEneral Matrix Multiplication) efficient, low-level kernels.
+
+
+## TorchRec
+
+[Initially open sourced in 2022](https://pytorch.org/blog/introducing-torchrec/), [TorchRec](https://github.com/pytorch/torchrec) provides common primitives for creating state-of-the-art personalization models:
+
+* Simple, optimized APIs for distributed training across hundreds of GPUs
+* Advanced sharding techniques for embeddings
+* Modules common in authoring recommendation systems
+* Frictionless path to distributed inference with APIs for quantization and sharding of TorchRec models
+
+Since then, TorchRec has matured significantly, with wide internal adoption across many Meta production recommendation models for training and inference, alongside new features such as: [variable batched embeddings, embedding offloading, zero collision hashing, etc.](https://github.com/pytorch/torchrec/releases?page=1) Furthermore, TorchRec has a presence outside of Meta, such as [in recommendation models at Databricks](https://docs.databricks.com/en/machine-learning/train-recommender-models.html) and in the [Twitter algorithm](https://github.com/twitter/the-algorithm-ml). As a result, standard TorchRec features have been marked as **stable**, with PyTorch style BC guarantees, and can be seen on the [revamped TorchRec documentation](https://pytorch.org/torchrec/).
+
+
+## FBGEMM
+
+[FBGEMM is a library that provides high-performance kernels for CPUs and GPUs](https://pytorch.org/FBGEMM/). Since 2018, FBGEMM has supported the efficient execution of Meta-internal and external AI/ML workloads by expanding its scope from [performance-critical kernels for inference on CPUs](https://arxiv.org/abs/2101.05615) to more complex sparse operators for both training and inference – and recently for Generative AI – on CPUs and GPUs.
+
+FBGEMM has been empowering TorchRec through its backend high-performance kernel implementations for recommendation workloads, ranging from embedding bag kernels to jagged tensor operations. Together with TorchRec, we released FBGEMM 1.0, which guarantees the functionality and backward-compatibility of several stable APIs serving its core features with [enhanced documentation](https://pytorch.org/FBGEMM/).
+
+
+## Performance
+
+[DLRM (Deep Learning Recommendation Model)](https://ai.meta.com/blog/dlrm-an-advanced-open-source-deep-learning-recommendation-model/) is the standard neural network architecture for powering recommendations at Meta, with categorical features being processed through embeddings, while continuous (dense) features are processed with a bottom multilayer perceptron. The following diagram depicts the basic architecture of DLRM, with a second order interaction layer between the dense and sparse features and a top MLP for generating the prediction.
+
+{:style="width:100%"}
+
+
+
+TorchRec provides standardized modules with significant optimizations in fusing embedding lookups. EBC is a traditional PyTorch embedding module implementation, containing a collection of `torch.nn.EmbeddingBags.` FusedEBC, powered by FBGEMM for high performance operations on embedding tables with a fused optimizer and UVM caching/management for alleviating memory constraints, is the optimized version present in sharded TorchRec modules for distributed training and inference. The below benchmark demonstrates the vast performance improvements of FusedEBC in comparison to a traditional PyTorch embedding module implementation (EBC) and the ability for FusedEBC to handle much larger embeddings than what is available on GPU memory with UVM caching.
+
+{:style="width:100%"}
+
+
+
+## TorchRec Data Types
+
+TorchRec provides standard [data types](https://pytorch.org/torchrec/datatypes-api-reference.html) and [modules](https://pytorch.org/torchrec/modules-api-reference.html) for easy handling of distributed embeddings. Here is a simple example setting up a collection of embedding tables through TorchRec:
+
+
+```
+from torchrec import EmbeddingBagCollection
+from torchrec import KeyedJaggedTensor
+from torchrec import JaggedTensor
+
+ebc = torchrec.EmbeddingBagCollection(
+ device="cpu",
+ tables=[
+ torchrec.EmbeddingBagConfig(
+ name="product_table",
+ embedding_dim=64,
+ num_embeddings=4096,
+ feature_names=["product"],
+ pooling=torchrec.PoolingType.SUM,
+ ),
+ torchrec.EmbeddingBagConfig(
+ name="user_table",
+ embedding_dim=64,
+ num_embeddings=4096,
+ feature_names=["user"],
+ pooling=torchrec.PoolingType.SUM,
+ )
+ ]
+)
+
+product_jt = JaggedTensor(
+ values=torch.tensor([1, 2, 1, 5]), lengths=torch.tensor([3, 1])
+)
+user_jt = JaggedTensor(values=torch.tensor([2, 3, 4, 1]), lengths=torch.tensor([2, 2]))
+
+kjt = KeyedJaggedTensor.from_jt_dict({"product": product_jt, "user": user_jt})
+
+print("Call EmbeddingBagCollection Forward: ", ebc(kjt))
+```
+
+
+
+## Sharding
+
+TorchRec provides a planner class that automatically generates an optimized sharding plan across many GPUs. Here we demonstrate generating a sharding plan across two GPUs:
+
+
+```
+from torchrec.distributed.planner import EmbeddingShardingPlanner, Topology
+
+planner = EmbeddingShardingPlanner(
+ topology=Topology(
+ world_size=2,
+ compute_device="cuda",
+ )
+)
+
+plan = planner.collective_plan(ebc, [sharder], pg)
+
+print(f"Sharding Plan generated: {plan}")
+```
+
+
+
+## Model Parallel
+
+TorchRec’s main distributed training API is [DistributedModelParallel](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module), which calls the planner to generate a sharding plan (demonstrated above) and shards TorchRec modules according to that plan. We demonstrate using [DistributedModelParallel](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module) to our EmbeddingBagCollection for sharding embeddings distributed training:
+
+
+```
+model = torchrec.distributed.DistributedModelParallel(ebc, device=torch.device("cuda"))
+```
+
+
+
+## Inference
+
+TorchRec provides simple APIs for quantizing and sharding embeddings for a model for distributed inference. The usage is demonstrated below:
+
+
+```
+from torchrec.inference.modules import (
+ quantize_inference_model,
+ shard_quant_model,
+)
+quant_model = quantize_inference_model(ebc)
+sharded_model, _ = shard_quant_model(
+ quant_model, compute_device=device, sharding_device=device
+)
+```
+
+
+
+## Conclusion
+
+TorchRec and FBGEMM are now stable, with optimized features for large scale recommendation systems.
+
+For setting up TorchRec and FBGEMM, check out the [getting started guide](https://pytorch.org/torchrec/setup-torchrec.html). \
+ \
+We also recommend the comprehensive, end-to-end [tutorial for introducing the features in TorchRec and FBGEMM](https://pytorch.org/tutorials/intermediate/torchrec_intro_tutorial.html#).
\ No newline at end of file
From 2d96e028624296b0508e4b8d9f4c9a50368abe9d Mon Sep 17 00:00:00 2001
From: Andrew Bringaze Linux Foundation
Date: Tue, 22 Oct 2024 12:45:34 -0500
Subject: [PATCH 2/6] test for todays date
---
_posts/2024-10-22-torchrec-fbgemm-1.md | 142 +++++++++++++++++++++++++
1 file changed, 142 insertions(+)
create mode 100644 _posts/2024-10-22-torchrec-fbgemm-1.md
diff --git a/_posts/2024-10-22-torchrec-fbgemm-1.md b/_posts/2024-10-22-torchrec-fbgemm-1.md
new file mode 100644
index 000000000000..f34e514daa00
--- /dev/null
+++ b/_posts/2024-10-22-torchrec-fbgemm-1.md
@@ -0,0 +1,142 @@
+---
+layout: blog_detail
+title: "TorchRec and FBGEMM 1.0 Stable Release"
+author: Paul Zhang, Zain Huda, Sarunya Pumma, Shintaro Iwasaki, Supadchaya Puangpontip, Benson Ma
+---
+
+We are happy to announce the stable release, 1.0, for [TorchRec](https://github.com/pytorch/torchrec) and [FBGEMM](https://github.com/pytorch/FBGEMM). TorchRec is the PyTorch native recommendation systems library, powered by FBGEMM’s (Facebook GEneral Matrix Multiplication) efficient, low-level kernels.
+
+
+## TorchRec
+
+[Initially open sourced in 2022](https://pytorch.org/blog/introducing-torchrec/), [TorchRec](https://github.com/pytorch/torchrec) provides common primitives for creating state-of-the-art personalization models:
+
+* Simple, optimized APIs for distributed training across hundreds of GPUs
+* Advanced sharding techniques for embeddings
+* Modules common in authoring recommendation systems
+* Frictionless path to distributed inference with APIs for quantization and sharding of TorchRec models
+
+Since then, TorchRec has matured significantly, with wide internal adoption across many Meta production recommendation models for training and inference, alongside new features such as: [variable batched embeddings, embedding offloading, zero collision hashing, etc.](https://github.com/pytorch/torchrec/releases?page=1) Furthermore, TorchRec has a presence outside of Meta, such as [in recommendation models at Databricks](https://docs.databricks.com/en/machine-learning/train-recommender-models.html) and in the [Twitter algorithm](https://github.com/twitter/the-algorithm-ml). As a result, standard TorchRec features have been marked as **stable**, with PyTorch style BC guarantees, and can be seen on the [revamped TorchRec documentation](https://pytorch.org/torchrec/).
+
+
+## FBGEMM
+
+[FBGEMM is a library that provides high-performance kernels for CPUs and GPUs](https://pytorch.org/FBGEMM/). Since 2018, FBGEMM has supported the efficient execution of Meta-internal and external AI/ML workloads by expanding its scope from [performance-critical kernels for inference on CPUs](https://arxiv.org/abs/2101.05615) to more complex sparse operators for both training and inference – and recently for Generative AI – on CPUs and GPUs.
+
+FBGEMM has been empowering TorchRec through its backend high-performance kernel implementations for recommendation workloads, ranging from embedding bag kernels to jagged tensor operations. Together with TorchRec, we released FBGEMM 1.0, which guarantees the functionality and backward-compatibility of several stable APIs serving its core features with [enhanced documentation](https://pytorch.org/FBGEMM/).
+
+
+## Performance
+
+[DLRM (Deep Learning Recommendation Model)](https://ai.meta.com/blog/dlrm-an-advanced-open-source-deep-learning-recommendation-model/) is the standard neural network architecture for powering recommendations at Meta, with categorical features being processed through embeddings, while continuous (dense) features are processed with a bottom multilayer perceptron. The following diagram depicts the basic architecture of DLRM, with a second order interaction layer between the dense and sparse features and a top MLP for generating the prediction.
+
+{:style="width:100%"}
+
+
+
+TorchRec provides standardized modules with significant optimizations in fusing embedding lookups. EBC is a traditional PyTorch embedding module implementation, containing a collection of `torch.nn.EmbeddingBags.` FusedEBC, powered by FBGEMM for high performance operations on embedding tables with a fused optimizer and UVM caching/management for alleviating memory constraints, is the optimized version present in sharded TorchRec modules for distributed training and inference. The below benchmark demonstrates the vast performance improvements of FusedEBC in comparison to a traditional PyTorch embedding module implementation (EBC) and the ability for FusedEBC to handle much larger embeddings than what is available on GPU memory with UVM caching.
+
+{:style="width:100%"}
+
+
+
+## TorchRec Data Types
+
+TorchRec provides standard [data types](https://pytorch.org/torchrec/datatypes-api-reference.html) and [modules](https://pytorch.org/torchrec/modules-api-reference.html) for easy handling of distributed embeddings. Here is a simple example setting up a collection of embedding tables through TorchRec:
+
+
+```
+from torchrec import EmbeddingBagCollection
+from torchrec import KeyedJaggedTensor
+from torchrec import JaggedTensor
+
+ebc = torchrec.EmbeddingBagCollection(
+ device="cpu",
+ tables=[
+ torchrec.EmbeddingBagConfig(
+ name="product_table",
+ embedding_dim=64,
+ num_embeddings=4096,
+ feature_names=["product"],
+ pooling=torchrec.PoolingType.SUM,
+ ),
+ torchrec.EmbeddingBagConfig(
+ name="user_table",
+ embedding_dim=64,
+ num_embeddings=4096,
+ feature_names=["user"],
+ pooling=torchrec.PoolingType.SUM,
+ )
+ ]
+)
+
+product_jt = JaggedTensor(
+ values=torch.tensor([1, 2, 1, 5]), lengths=torch.tensor([3, 1])
+)
+user_jt = JaggedTensor(values=torch.tensor([2, 3, 4, 1]), lengths=torch.tensor([2, 2]))
+
+kjt = KeyedJaggedTensor.from_jt_dict({"product": product_jt, "user": user_jt})
+
+print("Call EmbeddingBagCollection Forward: ", ebc(kjt))
+```
+
+
+
+## Sharding
+
+TorchRec provides a planner class that automatically generates an optimized sharding plan across many GPUs. Here we demonstrate generating a sharding plan across two GPUs:
+
+
+```
+from torchrec.distributed.planner import EmbeddingShardingPlanner, Topology
+
+planner = EmbeddingShardingPlanner(
+ topology=Topology(
+ world_size=2,
+ compute_device="cuda",
+ )
+)
+
+plan = planner.collective_plan(ebc, [sharder], pg)
+
+print(f"Sharding Plan generated: {plan}")
+```
+
+
+
+## Model Parallel
+
+TorchRec’s main distributed training API is [DistributedModelParallel](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module), which calls the planner to generate a sharding plan (demonstrated above) and shards TorchRec modules according to that plan. We demonstrate using [DistributedModelParallel](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module) to our EmbeddingBagCollection for sharding embeddings distributed training:
+
+
+```
+model = torchrec.distributed.DistributedModelParallel(ebc, device=torch.device("cuda"))
+```
+
+
+
+## Inference
+
+TorchRec provides simple APIs for quantizing and sharding embeddings for a model for distributed inference. The usage is demonstrated below:
+
+
+```
+from torchrec.inference.modules import (
+ quantize_inference_model,
+ shard_quant_model,
+)
+quant_model = quantize_inference_model(ebc)
+sharded_model, _ = shard_quant_model(
+ quant_model, compute_device=device, sharding_device=device
+)
+```
+
+
+
+## Conclusion
+
+TorchRec and FBGEMM are now stable, with optimized features for large scale recommendation systems.
+
+For setting up TorchRec and FBGEMM, check out the [getting started guide](https://pytorch.org/torchrec/setup-torchrec.html). \
+ \
+We also recommend the comprehensive, end-to-end [tutorial for introducing the features in TorchRec and FBGEMM](https://pytorch.org/tutorials/intermediate/torchrec_intro_tutorial.html#).
\ No newline at end of file
From ddd5d230e527bc3fcf5927b8a47d05d419704360 Mon Sep 17 00:00:00 2001
From: Andrew Bringaze Linux Foundation
Date: Tue, 22 Oct 2024 16:09:19 -0500
Subject: [PATCH 3/6] executorch paragraph removal and addition of button + new
blog post for Tuesday
---
...024-10-22-intel-gpu-support-pytorch-2-5.md | 145 ++++++++++++++
_posts/2024-10-22-torchrec-fbgemm-1.md | 142 --------------
_posts/2024-10-23-torchrec-fbgemm-1.md | 183 ++++++++++--------
.../performance-gains-over-fp32-eager-2.png | Bin 0 -> 8771 bytes
.../performance-gains-over-fp32-eager.png | Bin 0 -> 7730 bytes
executorch.html | 7 +-
6 files changed, 246 insertions(+), 231 deletions(-)
create mode 100644 _posts/2024-10-22-intel-gpu-support-pytorch-2-5.md
delete mode 100644 _posts/2024-10-22-torchrec-fbgemm-1.md
create mode 100644 assets/images/performance-gains-over-fp32-eager-2.png
create mode 100644 assets/images/performance-gains-over-fp32-eager.png
diff --git a/_posts/2024-10-22-intel-gpu-support-pytorch-2-5.md b/_posts/2024-10-22-intel-gpu-support-pytorch-2-5.md
new file mode 100644
index 000000000000..42d7d01cf3e6
--- /dev/null
+++ b/_posts/2024-10-22-intel-gpu-support-pytorch-2-5.md
@@ -0,0 +1,145 @@
+---
+layout: blog_detail
+title: "Intel GPU Support Now Available in PyTorch 2.5"
+author: By PyTorch Team at Intel
+---
+
+Support for Intel GPUs is now available in PyTorch® 2.5, providing improved functionality and performance for Intel GPUs which including [Intel® Arc™ discrete graphics](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/arc.html), [Intel® Core™ Ultra processors](https://www.intel.com/content/www/us/en/products/details/processors/core-ultra.html) with built-in Intel® Arc™ graphics and [Intel® Data Center GPU Max Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series.html). This integration brings Intel GPUs and the SYCL\* software stack into the official PyTorch stack, ensuring a consistent user experience and enabling more extensive AI application scenarios, particularly in the AI PC domain.
+
+Developers and customers building for and using Intel GPUs will have a better user experience by directly obtaining continuous software support from native PyTorch, unified software distribution, and consistent product release time.
+
+Furthermore, Intel GPU support provides more choices to users. Now PyTorch provides a consistent GPU programming paradigm on both front ends and back ends. Developers can now run and deploy workloads on Intel GPUs with minimal coding efforts.
+
+## **Overview of Intel GPU support**
+
+Intel GPU support in PyTorch provides eager mode and graph mode support in the PyTorch built-in front end. Eager mode now has an implementation of commonly used Aten operators with the SYCL programming language. Graph mode (torch.compile) now has an enabled Intel GPU back end to implement the optimization for Intel GPUs and to integrate Triton.
+
+Essential components of Intel GPU support were added to PyTorch, including runtime, Aten operators, oneDNN, TorchInductor, Triton and Intel GPU tool chains integration. Meanwhile, quantization and distributed are being actively developed in preparation for the PyTorch 2.6 release.
+
+## **Features**
+
+In addition to providing key features for Intel® Client GPUs and Intel® Data Center GPU Max Series for inference and training, PyTorch keeps the same user experience as other hardware the PyTorch supports. If you migrate code from CUDA\*, you can run the existing application code on an Intel GPU with minimal code changes for the device name (from cuda to xpu). For example:
+
+*\# CUDA Code*
+**tensor** **\=** **torch.tensor(\[**1.0**,** 2.0**\]).to(**"cuda"**)**
+
+*\# Code for Intel GPU*
+**tensor** **\=** **torch.tensor(\[**1.0**,** 2.0**\]).to(**"xpu"**)**
+
+PyTorch 2.5 features with an Intel GPU include:
+
+* Inference and training workflows.
+* Enhance both torch.compile and eager mode functionalities (more Ops), together with performance improvement, and fully run three Dynamo Hugging Face\*, TIMM\* and TorchBench\* benchmarks for eager and compile modes.
+* Data types such as FP32, BF16, FP16, and automatic mixed precision (AMP).
+* Runs on Intel® Client GPUs and Intel® Data Center GPU Max Series.
+* Supports Linux (Ubuntu, SUSE Linux and Red Hat Linux) and Windows 10/11.
+
+## **Get Started**
+
+Get a tour of the environment setup, PIP wheels installation, and examples on Intel® Client GPUs and Intel® Data Center GPU Max Series from [Getting Started Guide](https://pytorch.org/docs/main/notes/get_start_xpu.html). Support for Intel GPUs can be experienced through PyTorch PIP wheels installation by nightly and preview binary releases.
+
+* Try Intel® Client GPUs through Intel® Arc™ Graphics family (Codename DG2), Intel® Core™ Ultra processor family with Intel® Graphics (Codename Meteor Lake), and Intel® Core™ Ultra mobile processor family with Intel® Graphics (Codename Lunar Lake).
+
+* Try Intel Data Center GPU Max Series through [Intel® Tiber™ AI Cloud](https://cloud.intel.com/).
+
+ 1. To learn how to create a free Standard account, see [Get Started](https://console.cloud.intel.com/docs/guides/get_started.html). Then do the following:
+
+ * Sign in to the [cloud console](https://console.cloud.intel.com/docs/guides/get_started.html).
+
+ * From the [Training](https://console.cloud.intel.com/training)** **section, open the [PyTorch on Intel® GPUs](https://console.cloud.intel.com/training/detail/7db2a900-e47d-4b70-8968-cefa08432c1d) notebook and click “Launch Jupyter Notebook.”
+
+ * Ensure that the **PyTorch 2.5** kernel is selected for the notebook.
+
+## **Performance**
+
+The performance of Intel GPU on PyTorch was continuously optimized to achieve decent result on three Dynamo Hugging Face, TIMM and TorchBench benchmarks for eager and compile modes.
+
+The latest performance data measured on top of PyTorch Dynamo Benchmarking Suite using Intel® Data Center GPU Max Series 1100 single card showcase the FP16/BF16 significant speedup ratio over FP32 on eager mode in Figure 1, and Torch.compile mode speedup ratio over eager mode in Figure 2\. Both inference and training reached the similar significant improvements.
+
+{:style="width:100%"}
+
+Figure 2: FP16/BF16 Performance Gains Over FP32 Eager
+
+{:style="width:100%"}
+
+Figure 3: Torch.compile Performance Gains Over Eager Mode
+
+## **Summary**
+
+Intel GPU on PyTorch 2.5 brings Intel® Client GPUs (Intel® Core™ Ultra processors with built-in Intel® Arc™ graphics and Intel® Arc™ Graphics for dGPU parts) and Intel® Data Center GPU Max Series into the PyTorch ecosystem for AI workload acceleration. Especially, Client GPUs is added to the GPU-supported list for AI PC use scenarios on Windows and Linux environment.
+
+We warmly welcome the community to evaluate and provide feedback on these enhancements to [Intel GPU support on PyTorch](https://github.com/pytorch/pytorch?tab=readme-ov-file#intel-gpu-support).
+
+## **Resources**
+
+* [PyTorch Docs: Getting Started on Intel GPU](https://pytorch.org/docs/main/notes/get_start_xpu.html)
+* [Intel® Tiber™ AI Cloud](https://cloud.intel.com/)
+
+## **Acknowledgments**
+
+We want thank PyTorch open source community for their technical discussions and insights: [Andrey Talman](https://github.com/atalman), [Alban Desmaison](https://github.com/alband), [Nikita Shulga](https://github.com/malfet), [Eli Uriegas](https://github.com/seemethere), [Jason Ansel](https://github.com/jansel), and [Bin Bao](https://github.com/desertfire).
+
+We also thank collaborators from PyTorch for their professional support and guidance.
+
+## **Performance Configuration**
+
+The configurations in the table are collected with [svr-info](https://github.com/intel/svr-info). Test by Intel on September 12, 2024\.
+
+## Table 1
+
+| Component | Details |
+| :---- | :---- |
+| **Name** | Intel® Max Series GPU 1100 in Intel® Tiber™ Developer Cloud |
+| **Time** | Thu Sep 12 08:21:27 UTC 2024 |
+| **System** | Supermicro SYS-521GE-TNRT |
+| **Baseboard** | Supermicro X13DEG-OA |
+| **Chassis** | Supermicro Other |
+| **CPU Model** | Intel(R) Xeon(R) Platinum 8468V |
+| **Microarchitecture** | SPR\_XCC |
+| **Sockets** | 2 |
+| **Cores per Socket** | 48 |
+| **Hyperthreading** | Enabled |
+| **CPUs** | 192 |
+| **Intel Turbo Boost** | Enabled |
+| **Base Frequency** | 2.4GHz |
+| **All-core Maximum Frequency** | 2.4GHz |
+| **Maximum Frequency** | 2.9GHz |
+| **NUMA Nodes** | 2 |
+| **Prefetchers** | L2 HW: Enabled, L2 Adj.: Enabled, DCU HW: Enabled, DCU IP: Enabled, AMP: Disabled, Homeless: Disabled, LLC: Disabled |
+| **PPINs** | 5e3f862ef7ba9d50, 6c85812edfcc84b1 |
+| **Accelerators** | DLB 2, DSA 2, IAA 2, QAT (on CPU) 2, QAT (on chipset) 0 |
+| **Installed Memory** | 1024GB (16x64GB DDR5 4800 MT/s \[4800 MT/s\]) |
+| **Hugepagesize** | 2048 kB |
+| **Transparent Huge Pages** | madvise |
+| **Automatic NUMA Balancing** | Enabled |
+| **NIC** | 2 x Ethernet Controller X710 for 10GBASE-T, 4 x MT2892 Family \[ConnectX-6 Dx\] |
+| **Disk** | 1 x 894.3G Micron\_7450\_MTFDKBG960TFR |
+| **BIOS** | 1.4a |
+| **Microcode** | 0x2b0004b1 |
+| **OS** | Ubuntu 22.04.2 LTS |
+| **Kernel** | 5.15.0-73-generic |
+| **TDP** | 330W |
+| **Power & Perf Policy** | Normal (6) |
+| **Frequency Governor** | performance |
+| **Frequency Driver** | acpi-cpufreq |
+| **Max C-State** | 9 |
+
+## Table 2
+
+| Component | Details |
+| :---- | :---- |
+| **Single Card** | Intel® Max Series GPU 1100 series on 4th Gen Intel® Xeon® processors of Intel Tiber Developer Cloud |
+| **Workload & version** | Timm ac34701, TorchBench 03cde49, Torchvision d23a6e1, Torchaudio b3f6f51, Transformers 243e186 |
+| **Software Stack** | intel-for-pytorch-gpu-dev 0.5.3, intel-pti-dev 0.9.0, Intel xpu backend for Triton cc981fe |
+| **Framework** | Pytorch 4a3dabd67f8ce63f2fc45f278421cca3cc532cfe |
+| **GPU driver** | agama-ci-devel-803.61 |
+| **GFX FW Version** | PVC2\_1.23374 |
+
+**Notices & Disclaimers**
+
+Performance varies by use, configuration and other factors. Learn more on the Performance Index site. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.
+
+Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
+
+**AI disclaimer:**
+AI features may require software purchase, subscription or enablement by a software or platform provider, or may have specific configuration or compatibility requirements. Details at [www.intel.com/AIPC](http://www.intel.com/AIPC). Results may vary.
\ No newline at end of file
diff --git a/_posts/2024-10-22-torchrec-fbgemm-1.md b/_posts/2024-10-22-torchrec-fbgemm-1.md
deleted file mode 100644
index f34e514daa00..000000000000
--- a/_posts/2024-10-22-torchrec-fbgemm-1.md
+++ /dev/null
@@ -1,142 +0,0 @@
----
-layout: blog_detail
-title: "TorchRec and FBGEMM 1.0 Stable Release"
-author: Paul Zhang, Zain Huda, Sarunya Pumma, Shintaro Iwasaki, Supadchaya Puangpontip, Benson Ma
----
-
-We are happy to announce the stable release, 1.0, for [TorchRec](https://github.com/pytorch/torchrec) and [FBGEMM](https://github.com/pytorch/FBGEMM). TorchRec is the PyTorch native recommendation systems library, powered by FBGEMM’s (Facebook GEneral Matrix Multiplication) efficient, low-level kernels.
-
-
-## TorchRec
-
-[Initially open sourced in 2022](https://pytorch.org/blog/introducing-torchrec/), [TorchRec](https://github.com/pytorch/torchrec) provides common primitives for creating state-of-the-art personalization models:
-
-* Simple, optimized APIs for distributed training across hundreds of GPUs
-* Advanced sharding techniques for embeddings
-* Modules common in authoring recommendation systems
-* Frictionless path to distributed inference with APIs for quantization and sharding of TorchRec models
-
-Since then, TorchRec has matured significantly, with wide internal adoption across many Meta production recommendation models for training and inference, alongside new features such as: [variable batched embeddings, embedding offloading, zero collision hashing, etc.](https://github.com/pytorch/torchrec/releases?page=1) Furthermore, TorchRec has a presence outside of Meta, such as [in recommendation models at Databricks](https://docs.databricks.com/en/machine-learning/train-recommender-models.html) and in the [Twitter algorithm](https://github.com/twitter/the-algorithm-ml). As a result, standard TorchRec features have been marked as **stable**, with PyTorch style BC guarantees, and can be seen on the [revamped TorchRec documentation](https://pytorch.org/torchrec/).
-
-
-## FBGEMM
-
-[FBGEMM is a library that provides high-performance kernels for CPUs and GPUs](https://pytorch.org/FBGEMM/). Since 2018, FBGEMM has supported the efficient execution of Meta-internal and external AI/ML workloads by expanding its scope from [performance-critical kernels for inference on CPUs](https://arxiv.org/abs/2101.05615) to more complex sparse operators for both training and inference – and recently for Generative AI – on CPUs and GPUs.
-
-FBGEMM has been empowering TorchRec through its backend high-performance kernel implementations for recommendation workloads, ranging from embedding bag kernels to jagged tensor operations. Together with TorchRec, we released FBGEMM 1.0, which guarantees the functionality and backward-compatibility of several stable APIs serving its core features with [enhanced documentation](https://pytorch.org/FBGEMM/).
-
-
-## Performance
-
-[DLRM (Deep Learning Recommendation Model)](https://ai.meta.com/blog/dlrm-an-advanced-open-source-deep-learning-recommendation-model/) is the standard neural network architecture for powering recommendations at Meta, with categorical features being processed through embeddings, while continuous (dense) features are processed with a bottom multilayer perceptron. The following diagram depicts the basic architecture of DLRM, with a second order interaction layer between the dense and sparse features and a top MLP for generating the prediction.
-
-{:style="width:100%"}
-
-
-
-TorchRec provides standardized modules with significant optimizations in fusing embedding lookups. EBC is a traditional PyTorch embedding module implementation, containing a collection of `torch.nn.EmbeddingBags.` FusedEBC, powered by FBGEMM for high performance operations on embedding tables with a fused optimizer and UVM caching/management for alleviating memory constraints, is the optimized version present in sharded TorchRec modules for distributed training and inference. The below benchmark demonstrates the vast performance improvements of FusedEBC in comparison to a traditional PyTorch embedding module implementation (EBC) and the ability for FusedEBC to handle much larger embeddings than what is available on GPU memory with UVM caching.
-
-{:style="width:100%"}
-
-
-
-## TorchRec Data Types
-
-TorchRec provides standard [data types](https://pytorch.org/torchrec/datatypes-api-reference.html) and [modules](https://pytorch.org/torchrec/modules-api-reference.html) for easy handling of distributed embeddings. Here is a simple example setting up a collection of embedding tables through TorchRec:
-
-
-```
-from torchrec import EmbeddingBagCollection
-from torchrec import KeyedJaggedTensor
-from torchrec import JaggedTensor
-
-ebc = torchrec.EmbeddingBagCollection(
- device="cpu",
- tables=[
- torchrec.EmbeddingBagConfig(
- name="product_table",
- embedding_dim=64,
- num_embeddings=4096,
- feature_names=["product"],
- pooling=torchrec.PoolingType.SUM,
- ),
- torchrec.EmbeddingBagConfig(
- name="user_table",
- embedding_dim=64,
- num_embeddings=4096,
- feature_names=["user"],
- pooling=torchrec.PoolingType.SUM,
- )
- ]
-)
-
-product_jt = JaggedTensor(
- values=torch.tensor([1, 2, 1, 5]), lengths=torch.tensor([3, 1])
-)
-user_jt = JaggedTensor(values=torch.tensor([2, 3, 4, 1]), lengths=torch.tensor([2, 2]))
-
-kjt = KeyedJaggedTensor.from_jt_dict({"product": product_jt, "user": user_jt})
-
-print("Call EmbeddingBagCollection Forward: ", ebc(kjt))
-```
-
-
-
-## Sharding
-
-TorchRec provides a planner class that automatically generates an optimized sharding plan across many GPUs. Here we demonstrate generating a sharding plan across two GPUs:
-
-
-```
-from torchrec.distributed.planner import EmbeddingShardingPlanner, Topology
-
-planner = EmbeddingShardingPlanner(
- topology=Topology(
- world_size=2,
- compute_device="cuda",
- )
-)
-
-plan = planner.collective_plan(ebc, [sharder], pg)
-
-print(f"Sharding Plan generated: {plan}")
-```
-
-
-
-## Model Parallel
-
-TorchRec’s main distributed training API is [DistributedModelParallel](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module), which calls the planner to generate a sharding plan (demonstrated above) and shards TorchRec modules according to that plan. We demonstrate using [DistributedModelParallel](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module) to our EmbeddingBagCollection for sharding embeddings distributed training:
-
-
-```
-model = torchrec.distributed.DistributedModelParallel(ebc, device=torch.device("cuda"))
-```
-
-
-
-## Inference
-
-TorchRec provides simple APIs for quantizing and sharding embeddings for a model for distributed inference. The usage is demonstrated below:
-
-
-```
-from torchrec.inference.modules import (
- quantize_inference_model,
- shard_quant_model,
-)
-quant_model = quantize_inference_model(ebc)
-sharded_model, _ = shard_quant_model(
- quant_model, compute_device=device, sharding_device=device
-)
-```
-
-
-
-## Conclusion
-
-TorchRec and FBGEMM are now stable, with optimized features for large scale recommendation systems.
-
-For setting up TorchRec and FBGEMM, check out the [getting started guide](https://pytorch.org/torchrec/setup-torchrec.html). \
- \
-We also recommend the comprehensive, end-to-end [tutorial for introducing the features in TorchRec and FBGEMM](https://pytorch.org/tutorials/intermediate/torchrec_intro_tutorial.html#).
\ No newline at end of file
diff --git a/_posts/2024-10-23-torchrec-fbgemm-1.md b/_posts/2024-10-23-torchrec-fbgemm-1.md
index f34e514daa00..ba33b7151872 100644
--- a/_posts/2024-10-23-torchrec-fbgemm-1.md
+++ b/_posts/2024-10-23-torchrec-fbgemm-1.md
@@ -1,142 +1,155 @@
---
layout: blog_detail
-title: "TorchRec and FBGEMM 1.0 Stable Release"
-author: Paul Zhang, Zain Huda, Sarunya Pumma, Shintaro Iwasaki, Supadchaya Puangpontip, Benson Ma
+title: "PyTorch 2.5 Release Blog"
---
-We are happy to announce the stable release, 1.0, for [TorchRec](https://github.com/pytorch/torchrec) and [FBGEMM](https://github.com/pytorch/FBGEMM). TorchRec is the PyTorch native recommendation systems library, powered by FBGEMM’s (Facebook GEneral Matrix Multiplication) efficient, low-level kernels.
+We are excited to announce the release of PyTorch® 2.5 ([release note](https://github.com/pytorch/pytorch/releases/tag/v2.5.0))! This release features a new CuDNN backend for SDPA, enabling speedups by default for users of SDPA on H100s or newer GPUs. As well, regional compilation of torch.compile offers a way to reduce the cold start up time for torch.compile by allowing users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Finally, TorchInductor CPP backend offers solid performance speedup with numerous enhancements like FP16 support, CPP wrapper, AOT-Inductor mode, and max-autotune mode.
+This release is composed of 4095 commits from 504 contributors since PyTorch 2.4. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.5. More information about how to get started with the PyTorch 2-series can be found at our [Getting Started](https://pytorch.org/get-started/pytorch-2.0/) page.
-## TorchRec
+As well, please check out our new ecosystem projects releases with [TorchRec](https://github.com/pytorch/torchrec) and [TorchFix](https://github.com/pytorch-labs/torchfix/releases/tag/v0.6.0).
-[Initially open sourced in 2022](https://pytorch.org/blog/introducing-torchrec/), [TorchRec](https://github.com/pytorch/torchrec) provides common primitives for creating state-of-the-art personalization models:
-* Simple, optimized APIs for distributed training across hundreds of GPUs
-* Advanced sharding techniques for embeddings
-* Modules common in authoring recommendation systems
-* Frictionless path to distributed inference with APIs for quantization and sharding of TorchRec models
+
+
+ Beta
+ |
+ Prototype
+ |
+
+
+ CuDNN backend for SDPA
+ |
+ FlexAttention
+ |
+
+
+ torch.compile regional compilation without recompilations
+ |
+ Compiled Autograd
+ |
+
+
+ TorchDynamo added support for exception handling & MutableMapping types
+ |
+ Flight Recorder
+ |
+
+
+ TorchInductor CPU backend optimization
+ |
+ Max-autotune Support on CPU with GEMM Template
+ |
+
+
+
+ |
+ TorchInductor on Windows
+ |
+
+
+
+ |
+ FP16 support on CPU path for both eager mode and TorchInductor CPP backend
+ |
+
+
+
+ |
+ Autoload Device Extension
+ |
+
+
+
+ |
+ Enhanced Intel GPU support
+ |
+
+
-Since then, TorchRec has matured significantly, with wide internal adoption across many Meta production recommendation models for training and inference, alongside new features such as: [variable batched embeddings, embedding offloading, zero collision hashing, etc.](https://github.com/pytorch/torchrec/releases?page=1) Furthermore, TorchRec has a presence outside of Meta, such as [in recommendation models at Databricks](https://docs.databricks.com/en/machine-learning/train-recommender-models.html) and in the [Twitter algorithm](https://github.com/twitter/the-algorithm-ml). As a result, standard TorchRec features have been marked as **stable**, with PyTorch style BC guarantees, and can be seen on the [revamped TorchRec documentation](https://pytorch.org/torchrec/).
+*To see a full list of public feature submissions click [here](https://docs.google.com/spreadsheets/d/1TzGkWuUMF1yTe88adz1dt2mzbIsZLd3PBasy588VWgk/edit?usp=sharing).
-## FBGEMM
-[FBGEMM is a library that provides high-performance kernels for CPUs and GPUs](https://pytorch.org/FBGEMM/). Since 2018, FBGEMM has supported the efficient execution of Meta-internal and external AI/ML workloads by expanding its scope from [performance-critical kernels for inference on CPUs](https://arxiv.org/abs/2101.05615) to more complex sparse operators for both training and inference – and recently for Generative AI – on CPUs and GPUs.
+## BETA FEATURES
-FBGEMM has been empowering TorchRec through its backend high-performance kernel implementations for recommendation workloads, ranging from embedding bag kernels to jagged tensor operations. Together with TorchRec, we released FBGEMM 1.0, which guarantees the functionality and backward-compatibility of several stable APIs serving its core features with [enhanced documentation](https://pytorch.org/FBGEMM/).
+### [Beta] CuDNN backend for SDPA
-## Performance
+The cuDNN "Fused Flash Attention" backend was landed for *torch.nn.functional.scaled_dot_product_attention*. On NVIDIA H100 GPUs this can provide up to 75% speed-up over FlashAttentionV2. This speedup is enabled by default for all users of SDPA on H100 or newer GPUs.
-[DLRM (Deep Learning Recommendation Model)](https://ai.meta.com/blog/dlrm-an-advanced-open-source-deep-learning-recommendation-model/) is the standard neural network architecture for powering recommendations at Meta, with categorical features being processed through embeddings, while continuous (dense) features are processed with a bottom multilayer perceptron. The following diagram depicts the basic architecture of DLRM, with a second order interaction layer between the dense and sparse features and a top MLP for generating the prediction.
-{:style="width:100%"}
+### [Beta] *torch.compile* regional compilation without recompilations
+Regional compilation without recompilations, via *torch._dynamo.config.inline_inbuilt_nn_modules* which default to True in 2.5+. This option allows users to compile a repeated *nn.Module* (e.g. a transformer layer in LLM) without recompilations. Compared to compiling the full model, this option can result in smaller compilation latencies with 1%-5% performance degradation compared to full model compilation.
+See the [tutorial](https://pytorch.org/tutorials/recipes/regional_compilation.html) for more information.
-TorchRec provides standardized modules with significant optimizations in fusing embedding lookups. EBC is a traditional PyTorch embedding module implementation, containing a collection of `torch.nn.EmbeddingBags.` FusedEBC, powered by FBGEMM for high performance operations on embedding tables with a fused optimizer and UVM caching/management for alleviating memory constraints, is the optimized version present in sharded TorchRec modules for distributed training and inference. The below benchmark demonstrates the vast performance improvements of FusedEBC in comparison to a traditional PyTorch embedding module implementation (EBC) and the ability for FusedEBC to handle much larger embeddings than what is available on GPU memory with UVM caching.
-{:style="width:100%"}
+### [Beta] TorchInductor CPU backend optimization
+This feature advances Inductor’s CPU backend optimization, including CPP backend code generation and FX fusions with customized CPU kernels. The Inductor CPU backend supports vectorization of common data types and all Inductor IR operations, along with the static and symbolic shapes. It is compatible with both Linux and Windows OS and supports the default Python wrapper, the CPP wrapper, and AOT-Inductor mode.
+Additionally, it extends the max-autotune mode of the GEMM template (prototyped in 2.5), offering further performance gains. The backend supports various FX fusions, lowering to customized kernels such as oneDNN for Linear/Conv operations and SDPA. The Inductor CPU backend consistently achieves performance speedups across three benchmark suites—TorchBench, Hugging Face, and timms—outperforming eager mode in 97.5% of the 193 models tested.
-## TorchRec Data Types
-TorchRec provides standard [data types](https://pytorch.org/torchrec/datatypes-api-reference.html) and [modules](https://pytorch.org/torchrec/modules-api-reference.html) for easy handling of distributed embeddings. Here is a simple example setting up a collection of embedding tables through TorchRec:
+## PROTOTYPE FEATURES
-```
-from torchrec import EmbeddingBagCollection
-from torchrec import KeyedJaggedTensor
-from torchrec import JaggedTensor
+### [Prototype] FlexAttention
-ebc = torchrec.EmbeddingBagCollection(
- device="cpu",
- tables=[
- torchrec.EmbeddingBagConfig(
- name="product_table",
- embedding_dim=64,
- num_embeddings=4096,
- feature_names=["product"],
- pooling=torchrec.PoolingType.SUM,
- ),
- torchrec.EmbeddingBagConfig(
- name="user_table",
- embedding_dim=64,
- num_embeddings=4096,
- feature_names=["user"],
- pooling=torchrec.PoolingType.SUM,
- )
- ]
-)
+We've introduced a flexible API that enables implementing various attention mechanisms such as Sliding Window, Causal Mask, and PrefixLM with just a few lines of idiomatic PyTorch code. This API leverages torch.compile to generate a fused FlashAttention kernel, which eliminates extra memory allocation and achieves performance comparable to handwritten implementations. Additionally, we automatically generate the backwards pass using PyTorch's autograd machinery. Furthermore, our API can take advantage of sparsity in the attention mask, resulting in significant improvements over standard attention implementations.
-product_jt = JaggedTensor(
- values=torch.tensor([1, 2, 1, 5]), lengths=torch.tensor([3, 1])
-)
-user_jt = JaggedTensor(values=torch.tensor([2, 3, 4, 1]), lengths=torch.tensor([2, 2]))
+For more information and examples, please refer to the [official blog post](https://pytorch.org/blog/flexattention/) and [Attention Gym](https://github.com/pytorch-labs/attention-gym).
-kjt = KeyedJaggedTensor.from_jt_dict({"product": product_jt, "user": user_jt})
-print("Call EmbeddingBagCollection Forward: ", ebc(kjt))
-```
+### [Prototype] Compiled Autograd
+Compiled Autograd is an extension to the PT2 stack allowing the capture of the entire backward pass. Unlike the backward graph traced by AOT dispatcher, Compiled Autograd tracing is deferred until backward execution time, which makes it impervious to forward pass graph breaks, and allows it to record backward hooks into the graph.
+Please refer to the [tutorial](https://pytorch.org/tutorials/intermediate/compiled_autograd_tutorial.html) for more information.
-## Sharding
-TorchRec provides a planner class that automatically generates an optimized sharding plan across many GPUs. Here we demonstrate generating a sharding plan across two GPUs:
+### [Prototype] Flight Recorder
+Flight recorder is a new debugging tool that helps debug stuck jobs. The tool works by continuously capturing information about collectives as they run. Upon detecting a stuck job, the information can be used to quickly identify misbehaving ranks/machines along with code stack traces.
-```
-from torchrec.distributed.planner import EmbeddingShardingPlanner, Topology
+For more information please refer to the following [tutorial](https://pytorch.org/tutorials/prototype/flight_recorder_tutorial.html).
-planner = EmbeddingShardingPlanner(
- topology=Topology(
- world_size=2,
- compute_device="cuda",
- )
-)
-plan = planner.collective_plan(ebc, [sharder], pg)
+### [Prototype] Max-autotune Support on CPU with GEMM Template
-print(f"Sharding Plan generated: {plan}")
-```
+Max-autotune mode for the Inductor CPU backend in torch.compile profiles multiple implementations of operations at compile time and selects the best-performing one. This is particularly beneficial for GEMM-related operations, using a C++ template-based GEMM implementation as an alternative to the ATen-based approach with oneDNN and MKL libraries. We support FP32, BF16, FP16, and INT8 with epilogue fusions for x86 CPUs. We’ve seen up to 7% geomean speedup on the dynamo benchmark suites and up to 20% boost in next-token latency for LLM inference.
+For more information please refer to the [tutorial](https://pytorch.org/tutorials/prototype/max_autotune_on_CPU_tutorial.html).
-## Model Parallel
+### [Prototype] TorchInductor CPU on Windows
-TorchRec’s main distributed training API is [DistributedModelParallel](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module), which calls the planner to generate a sharding plan (demonstrated above) and shards TorchRec modules according to that plan. We demonstrate using [DistributedModelParallel](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module) to our EmbeddingBagCollection for sharding embeddings distributed training:
+Inductor CPU backend in torch.compile now works on Windows. We support MSVC (cl), clang (clang-cl) and Intel compiler (icx-cl) for Windows inductor currently.
+See the [tutorial](https://pytorch.org/tutorials/prototype/inductor_windows_cpu.html) for more details.
-```
-model = torchrec.distributed.DistributedModelParallel(ebc, device=torch.device("cuda"))
-```
+### [Prototype] FP16 support on CPU path for both eager mode and TorchInductor CPP backend
+Float16 is a commonly used reduced floating point type for performance improvement in neural network inference/training. Since this release, float16 for both eager and TorchInductor is supported on the CPU path.
-## Inference
-TorchRec provides simple APIs for quantizing and sharding embeddings for a model for distributed inference. The usage is demonstrated below:
+### [Prototype] Autoload Device Extension
+PyTorch now supports autoloading for out-of-tree device extensions, streamlining integration by eliminating the need for manual imports. This feature, enabled through the torch.backends entrypoint, simplifies usage by ensuring seamless extension loading, while allowing users to disable it via an environment variable if needed.
-```
-from torchrec.inference.modules import (
- quantize_inference_model,
- shard_quant_model,
-)
-quant_model = quantize_inference_model(ebc)
-sharded_model, _ = shard_quant_model(
- quant_model, compute_device=device, sharding_device=device
-)
-```
+See the [tutorial](https://pytorch.org/tutorials/prototype/python_extension_autoload.html) for more information.
+### [Prototype] Enhanced Intel GPU support
+Intel GPUs support enhancement is now available for both Intel® Data Center GPU Max Series and Intel® Client GPUs (Intel® Core™ Ultra processors with built-in Intel® Arc™ graphics and Intel® Arc™ Graphics for dGPU parts), which is to make it easier to accelerate your Machine Learning workflows on Intel GPUs in PyTorch 2.5 release. We also enabled the initial support of PyTorch on Windows for Intel® Client GPUs in this release.
-## Conclusion
-TorchRec and FBGEMM are now stable, with optimized features for large scale recommendation systems.
-For setting up TorchRec and FBGEMM, check out the [getting started guide](https://pytorch.org/torchrec/setup-torchrec.html). \
- \
-We also recommend the comprehensive, end-to-end [tutorial for introducing the features in TorchRec and FBGEMM](https://pytorch.org/tutorials/intermediate/torchrec_intro_tutorial.html#).
\ No newline at end of file
+* Expanded PyTorch hardware backend support matrix to include both Intel Data Center and Client GPUs.
+* The implementation of SYCL* kernels to enhance coverage and execution of Aten operators on Intel GPUs to boost performance in PyTorch eager mode.
+* Enhanced Intel GPU backend of torch.compile to improve inference and training performance for a wide range of deep learning workloads.
+
+These features are available through PyTorch preview and nightly binary PIP wheels. For more information regarding Intel GPU support, please refer to [documentation](https://pytorch.org/docs/main/notes/get_start_xpu.html).
\ No newline at end of file
diff --git a/assets/images/performance-gains-over-fp32-eager-2.png b/assets/images/performance-gains-over-fp32-eager-2.png
new file mode 100644
index 0000000000000000000000000000000000000000..deadec7e17ec6290cbc511d6718ef11ad6fc9ba7
GIT binary patch
literal 8771
zcma)CWmJ@1w0{y9EhxP`VN68c8V`
zkQ9+S-;K4t`{S;4f1TL-jq^O`JZGP?_A5hu4KnyGH~;`-TAHdx0072UL1ZupzR({R
zjKdcYA7w3L7!0=X-2jVkdK&5)t6g7Tqig4W{`?83WMQ$`%gaBTn;WX$-KFmrq^#Z@
zpPZ7}U{*Je!o$PICZ^}+mp=BcrKYA96%`4Xp$7&Agr9saZklgsXsBsh>g?=vcX#J=
z{tSUYUcPQMF)^|8ueJ^EGw~?7ckdn#509X_uik0C7yz&WTB=IM{yF*lLNU5uAdnnR?U3D(Z
z;CTz~@mmCZ=_%UFKsX(377(I9mnp!7nyf~6}3hJ5QdDIP7+
zogH_^<}}^Aq^LJJ-xVErJ(0@Pq_=jd=e*RVF({#@@H*q0ctX&pcC6t#zP^DT(Jy&-|GlPqOj&IiY8s
zhXsP&!nROJNsBp7k2-UP(G6R^#+1piffeLqc#-}?Y*>#`PC#qks}1dx85AwY1~>W3KZbgi^}3RfjEXSZtobvFm{(@9av9|w
z-cqb%bdl9HzQc}qSuM-EaOZ%PXhBmzZ4bxf{4%lOO@elAQ-V@WQG#x@_P*MeO%F3V
z&}TwuEW+8`HM2YYK^saJiYqJOJVE(ci^;-?$G_JY!ze7OrJ_=AdFJeipI&)G{BH=j
zl9q(~5lwofigRbREt+=>lsH&n;@qCq(^`zVxup)f>zv)hjqdax_dYOqOoK9Oe*OAx
zuiXU6H^`t>oN~T}!5O{coBNZ?k`dDZ9nz0uH&E;M``hD*1o~(!=|IDT(#z`6j#kJf
zu7M}YUbq3PVT~
z9Njj`4zhK8VrLsy{wDSiH*TSTX@F6TBTutS(u!?tJ$LE{kU|x20p}HxIXXsw7)8Vm
zYFk_Y%2sAc-c*nKtv#J|i@?~`&EI@Wm3d1-`H#|x8Ue=Rg&j_Q@^$O@ySZ_WC36BoESxhsKfEaRmq~{BcAtkCRjueS?@qJx40Uj-
z%U!3*yhv(Av~1rw&Lp|TLIApF*NGDy@(T
zx#7}UOAb{`k`{qkON(jZR)@m8GfWUQ%ZOYS4DKFQo+Pe-B>pAZDSz8<3>
zwSg*-e;9C2j3LWVXI~CH&YH-y_bYdL+YK!KVkU62Sz~>zynD%*GA`F+Mg3XpLFnmt
z-aX>H7(jfj@XUNEk%*xVU56hpdOLS7YeV-wpNzTefNh|^#GK2wZ#;s8z$nM+@v=BN^phL&ZNZX*v~14Y1W18ra4
z@=bR59H0q}690Y=q-~+#VZ3vTOuTVx(3sL@f#3&hsYLACuIe|KJUec6MnzDL*7}_)
z=dt+N)`z9>Uqou$BO+T7o#UoJ>mtjGG(Aq#AA{g@q9UEdO@>K
zh8;D>IX>qsI&bR~E}o5*(r&juY(&bBP{XSOIVgye3*D(tx3fA7%y!+H|YnQl-z698leCM%2sF!wP9z`D=`{&FR+R8&w&nZf!yPyl9%mr-
z@zkE79%R8H*z{Gc??d9&DmIwC8D1nh6z_0*ZVHsoH)^X3axQ;lZZP!Q`Yeiv0GP30
zN0}$D}k0%vl3V+0rEYghHEjj|D8*=
zjyT+MKRSb(k%2m|J{oKvth5ZQ@FVX*&sz<)owzXM|B8J634}=45eK-^AMWuTs&>`T
z4-h&PU_+Y$QNH~?`8du5yaKiyQdm%O@Kh0b=K#6k_=7s(Dc=+_{1
zMF%u2XbOyUVZsB=2#S+Re&E2`$Hi(;py)Um*MRrX>@$LWCqD%JeqZ)E&>V)j^=k*3
z+~8>&_Ij*Bi^9C^tROP(5QJMEVmP2+0mOxsgyx2u?g&@X-T~G=+P(D{aJk37`Sa7c
zi?{-Um|Phu9e)GI%RimqJKBqEl2n#vj(9V0%|wN@o4FF@#AGl!{4Nx}MZ)z$d@d$p
zp+w|i>D0rdRPReyrpYx(X3YGvjC0_C9$c*qBAv`>N?$*P46&%KybQEj&J1$zkg5Qr-cSlBygNw#f#sL>ZHJ_Eq+GBG2(V!t0%;WPli`2N8thG
zM@{;d#KeZT^IX#M-5mv3L!W}ZRswLZoN8N#m+x!lYE`b_BhT}L>(lca`c$lL4+Ga!
zd`wVm(r^0#=L;pp&s7Y%4GYVZ*f7^k?5=fC@I7wKr>@q1rsP)LQ_)7yb#~5qB{kNf
zg98!YD}RT9I{0ek_5lpn=Y>Xw;F2S$uziKOX)BbNx57t(Ck8!S;9!^z7X}>@DijVR
zlZG0~_h#?_5&QX2fmFX_p=4lG{RDk3bH<6d{)~vxulwNu5Ld+Ph;2%4WVrWOi$NopWs0>pzl8&nuc9kBDhGz;o+u|E
zN(L*jRh~p0O}eJ_0;4FW#nW>$wqI}7A3GEYQiZKyY&9?a4Kq-r%;?^V17+A
zSl0lc2b#LOC2|6`FNqiLgQ_GfwFuRP)fSvd6&!JP0agH%@$<=8*ga4xoZR~;??V2C
z;rB$iaIIl}3;`7;6BkwrG@2Oo+aEbjYq;b^125O-LK8n3Ol8(pWNX_%YetK;b6JW4l%r0Uge3-tk
zZ3LjetudmHMr)mLFX1x(+Ag7US*Sn~(1pBFH`|KJ=EEF(FI!C-QVe`(W6CEp|Pj>?|ko6H(KiCsen?22vxkp%(qIb
zJ7>R_!+?~g{_}XSr81+@=J|^r?B&^gB*btu*q!;ipOdbg7n0VIr
z6-F_HqHxQA86@7oZFvQ99M?aVu4^7{6O}olo8K}7N-lZ0v>udvwdpC^hBz4@P)gUH
zaFFPe+vDt>j?RdD-f_ApZyS);PyLCirG|_5*E=FwUVS8T1vmRrZWFXX9`XV?OMPVl%1V(&U
zC0-JLZNx(hEF{D?+3Rgm+S!`YH|qs+5FuV0>wUCtL(=K%{szvi~Q&xC*-nb&AU
z7hg#LdE6%ExKZdCs3t~eo<39BBZ%7ZynHbNpm?x;rIw~lLjQZ>yUXq{oSYVAONwEo
zM0%x98ywQ$M07b&uUdJ4;d&ttNybd9ib|J;6^2wk5*$oQPh9uWvuQ7_`D6+T+#e3u
zn9x@PDeUH6EED25Gz_|UsIVKfz&?os`x%>#`oh5Q;fS;__$B`$rDgMaxsFB1*CKxn
zV6>-?8T0
z?q+bhcW^JfgPdgtl1L1!gd7~YHyKMcg1XazHr3k%=?#BgsDMcw}k-m}wd;v;UTK5PZOoZF5^-lfTgx)FSYf2a+iH0JAzzGnKYLEi~nQ4yYX{atx
z_-ng!l`~vugI52+pU&lPOdOcnBuZojIHpW+h@w%%Or)2u`q+sQ7}*c`{6_s}k({T8
z_1#rVu-A~hi7dg*w7;658YUvF_!p72iI<+lNzc)dK$G;1C%={8sJ&9}Glpjob)x8XPyVBhrIWwHDg6N
z5Z$JK2icQT#YyJwB0&1zEjW_Tbh1-PnwD1PRKUk^oi|xQmZUe=V>G55z`^ayCWmxz
zLs4M2fU1*a=$$H6((VI_b9wfqs?(tNB9hwV^*k
ztlc3^7Vwc2?st9r&Q&a%PbQ$5hQAJoDZcil6W&z;4YTg;&4kXLKn?3oVlj$MdIpX|
zhN$aH&!p>#d+lVPV5+|=6GrLp0K&=NmBHUvk|?k=bjm0j`;7%k%*ZVw>9FEE7=`i-
zE(~k=0wwECS#JLF#h-Jh@rAD@>4ktF2f!5J+|d>@#ERmxdp)^ye_y!wGSn>%4p@g+
z)oc%`SrUSDwHpf}7p$D(tgri~%Yhc&RZ$pg@hf!4V$#%@Nehav@
zGkNhjQetIjvH9;>ua2NNi!GVQU9N^G8$D}w3{mU{QZQ!NPqjuFENDJV94|&e)mUJ6x!8#`L{6|nWS%aVx_z%Rc(sC0g%9T+15Lg)#uK1~&t<_4Ikkm8RtpeP-vwEQJ0sK965
z)khaqcn(|&v3;p4x&HFv6Zt9x%SJRu;k~|lM*-DA!GaA|RxW`&9y~h)tv?M$HI@&&
ze5?Oq3&Vm^CeamKZ9fNy@ZwF9bRjvM&O9_7`o)c4(T9lzL+-je^L=}34dUOUI!_Jg
zc7GZWzP_-+2(CdEYe}#(ejvA3R;UM_d3ojCM96tLL%gdoBeus4_<8J`aMUd2;9zn+
z=k(<_4ep5nNPYAcF@F5EaO{J1pFf}qa7QM1IDx{57TZIET9p}j3a|RdiusT$antah
zPR!<=N<|E4x)`vB1iz&;qeWgoaqJ_3?no09{R6;aUB-9L&X~YogdnqBwCo?OQ8z?K
zU!y7jp%TF7A)(1XrkHuxdv*?n|M^7E)Nq=T3FxsFh?5|mrF~BX{|~9)nC{@Wzf!_0
zfZAdpYISEc@VT1_D#Hla6kYi!@Pz6=eTFqmP%}oTxiYvREt(Hug8zmIegwt@|J8xn
zrC|N>H%)F7^x+l@@;MwW{t;AsgCX_vJ-TR90;CQ*>UiTGMH%qw!sif#rCx*JDaB?2
z9C2G2H(7wqULppq9iw0X<+=*#-~m1_tepc8rXHf?K0INEs~3$F(fIR3Kc3_dc3zVq
zSGTpa|0=}aU`<+s{L|^pX{UjAydR8OW6v|t9GqbP`^Ps&P%eG?elQR~!-5#ZI}u$-
zv#=nEYHs@JPUsX3<2kRpl=wt~5>tHGO(dQFEzJru&|UKbJnFC)dk#1a`Rmcw)uM;+
zp`QzR^*3@C^jPPO+IgV%353aLqbknu1>AqC6!i*EXd7IZ_}^gs8y#pF+{=OTAp(yV
zKM&rJMvY$+gT#ik$PK)y&Kr-QnG7DtsvJC?ruzRcx4c-OBp6EA*fr8hPmJf-oM60n
za_D8_REq1HF=71sEBSyk8g^7xw_AKTm(OqE%ahvctM^xb>%gQbvNgH+KD5YWQ#K6i
zqtG!fjCFfA926Co?rGMY2Yhj}V#VC89>X8V{+=90?+))@|6M5TCOThRVg7p#LWf-a
zZN9u^Ew8A%6AH)nv6hEk_mEtlp9JO4`Uj(IENED;XRO7F+QBH#%chV7d{pI>DHwZ0
zczCKg=F;t|tt}?>_z=YHNJ8G=@U*Y4`7aju!}@5Zm*@^OMiV7PfGSkm4UTEB!%5uW
zd9(9ri;FADWq)+TQ|s_=a$}ta%WWqe@&<{E&Y8S(wpH5hvKFXc&iYS3b-%(*{h1Ec#Xncw1ijPl`_FpF}kQ
zvkTthg24Hbv-*zP>u$g6^NRuYps|LT*8%xtI!zLSDr8P|?~oC)Vp%VKHctK2U+LkE
zA8lW9d9&g@^nT_tzV8rzn+xT$j1&>u8g&BxNZhBAaFi1le?QN;nY*iFlid>cs-?@f
z_~XJ@M?`sS_NMW$6E6zA`=sOPtN_%3Xx^o~jpn8^S-hNH;rNEVBgfCzbf5GX(_&wX
zy(=fka%LzhGt!^Dg)E7ED8f5j@O>_6dPf>$%v-{1Uu^$XoXUGknuNw{GKdp!vgn}A
zAwVk0FxO%F!A`ujX0qINhVyzs*+JGKzP3HVZ^6?1$pD`im$sLAIS*Hb_>d-wDovgm
zq+D(?R_MUJ8mjHn|JhMpqMK%##31bDx5;iQu?U#_c&20@-|gz$r(kGhG9V)20oS?8
z3i!&wNlitf4*6jSL*(p{jMmYZ+ptxyX(sAsiBTPt^(KCfc2UcDD3V_S$U@V%)gbcH
zfgqQ>X(5|5=>c&py(!G|y^|ifmXUT5u>~
z>q!n^_Nk$`W`r~FU3hLL0cV~%I8}%f(YqMkeHmU@;q0Je>}#+3XqGocy&5Z^PmRh0
z`T@Z*+i#IkXD;x%eK6ec>ZJSI`@R}xgJ<6WO%N_m%j0?c&Ea&wEYwVo^CLo#;6ZP%
zNd0LTpNWf(9|_l{3e-@wD1`Bs#QoLea5K8wb&*eg&;qp_X}cukWq_H!=c})({2Qdj
z!(zHzz`z9B_OAWzBn`@gS&QMrk8!GR+u@NokRE
z+_!8d%$Y5+sQ3aBC*6STH`KB+rNQ3LLJ8@CsX%JEk?tBU@iTX?mXHP42AuMQvWKkw
z#mtPYwO7N61(y#Z9r@<9OyJt9OL7y;k5{DVSS11vHB77xJ>O;{4G=S19KfsPn2~x=
zVmg+u{%J_cjnR3Vla^a8DJnBrAfsf(a$D>Ay6dfHz2|TK{wUp;t7XFI
zq!{V%Z2Z@=!+hUm1@X+$}PUF6F8nD}B#Fy6B13ot&GKbtw
zGDYp%GfZZ{kQ{osK_$=0(E|?syHYHWKT@-%a>#;0UfJpq<7?E@#V8qm^du&cA
zD}SO3y{Kcd-vAN3!!|G!IJ)G6A}c-+)m1l!YoR)(dq7#!+kFgkth!(Na9#vh|1Dj<
zpVrG8tJE98{DS*ESHt!OkDHR?80&bg0Ld5ltH_nU{)B#=>~a!Tu-Qnp0%N~DorgA1
zVf_;#6ryOM7G{jqZ~biYWZI0>LtCW4$8)qPo>S4*Nh5M^LCKi+XYgxQJIIvY5xde{r3H0BvpplSbhfhX?&vIMwv5Gs
z)U)}Ae9$U=ChDU@$rtM*FvKyfzm9{5K{V8G)aWpVohGD`f))CvbtgV8rz!_qEmUPQ
z*sN}o;H)G5f%vvN1lGgVD*XLL1O*L;lsn_#-QT@}w4_g27?aVo5~h9QEQ1nIms9l8zEtkTdOQS7L*$eWU4SoEWhWi@e
zK0qv;Re_hg_AL1Ep2upm9l=9+sej$?|L^O5!a;yiQG{IdDd>#q@Ch{eQ*&)fKbNlt
zr)R-!#So58UT$xX_v3?-DPRrN^jo()JeSoo(zt*(&lbNY`((MuO85s+OQrTjpeVj_>9uLjx4hgr$W5Zc-SBpg-=m<_D?RrtDyE;#m+y!
zk2Z8<61nN(!&avtz3d>;3lEx7uD7i_AJgM)-bOnWV|FQ#mx<_M-nY8qa#9mVeA&vt
zkpwzfejTmUH#?lF9|RsJFXSM-aHb$gIk~XLS_oxdforA`5ow_S2
r(YDYCp&$;K^`#CWN`j-$?98>@RgpM}W5C9l)2r*H>
ziJc#fzBRFfdu!?#QBY7!f4NE^e)=2e8C^I%J>A;cURqjQTU(o%nfboA_v`3L!SW^c
zZg)}LG|DI|w`K}p(}#ygdV71r!ot)&y5|?SU%!6+^y$;AtgP17){>Hvq%1tJY_DvZ
z9r?5&VBc}~?%klcH#Z;jSXfx(mG^bLTToY5mvrvD_hf*9fx*nH$iSt5_d=6B09^Ju
zSPdip%&iyHwo~$Kk-oQT^+!slC-gH+Jc5o-Pj{6sMTU2Gc62j`?H3BrpWXO*xN2&W
zdto%PAO?BmY^o!0y@|lZmqFEDazMV44Bs;jJ~eQFvnFus#U^!tDp)`Crdb~=Nt&16
z9^k#;v~`v_TynW0KoL=iT?ub7oaF(E$J;`KO%5`Tczx%lzx=KT&m`_BwwmeHt7VHZOwxt+p{*zD0|V8p5vHcDxOM)P{1==HAtv3D
z)V4q9`Qy5`UVUYp6#x@u5{-`=+G0yaU-`WZ`q-ur>~JL1I>_xB@tG3+4BOw->vQk1
zwy=&=iQr)3;Bzj87y6zQ666;$_VtIk{5;!In+IX)fq;U{Lwc1m336J`S1<=_5qVQ8iK6`;O#Ks9AnwVzZPO(_<<=0=
z5hvQ|dzAqoR-V1=BLw9j%k;Zx2N&nc?hcjBP4xv(ZyIFCoMEgP)l!k-lnCJFT)Q8`
z(5pR@woMO;KBg+Ou_W`taP0uZ_R{0%@1*_a#+hy}5I75E1b>Uv$nSp#V3_@~_<|e4
zH&4WgiJb|i}N+eddx2gW*kl}kt>2RA9UJa)IabsfuhE_B8
zuDcZ{x3{QepYpx!&{`S)1GtR+EtPA3{N$wM74?7WUof)RwY)cf^}?Ahqs-`&alIh#
zOjSPvgozq&eQ3)qI4C!9W$P5a)YO&AHnC$REuE3q72psvW85m`FLJetZnd(7TW+#J
zjl|u1l{6v^Ytg6r>T5H-e$&06OJCI00sWl(X|)dk5~Dme891hgGls4=KDsxMflL9{
zD33GuzVw~p7vQ7XjPGc94}E_ZmDTCB`QkhWFgUlpXX>Y#WMIFX^W|C><+2DPB*iFE
zO7}v37zsxF^h53oE#KtusNlctNNZG}Los(`*l;d&j?b2sOFKa75*#o$<7}4RQO6ZO
z&NL9$C1>Ww%LEG{Qeb?7N@cz~`}BUiG%3x~2b>iU=yREnDRrb*OprJ#w-Hpz3)qb3
zeX$n4?knkARUt6ZG_udw90FeCkpryaGv!f-*3nYG5?R`xp&JtP8()rX9F)ZD*_A3n
z#|y5U0^GiJrMNk7Ti3`B4Z{V!8J$t5hL~pNK#(t5FEdLiQq2>yl&qAFd=%qf7C3k$
zkuu1S=JB2gOy>`|yrf<0P)=Rm-TK4w>EM|661g)1_%MK7Zuk4F;n>^-A&;feL}`0)Fbn0GA2~w;3sd+pnDIPJgO~tzHg(r`1=bb{60l
z?pxo#%$DWtIq~V(Z
z)<7@1N3^e@>a{Q$?q&P|;+ixz_Yp^UAD1E1WZ2y42^3rf^^7N==d0=~W14^YDot1^8qq
z+*~Xxr&=EfKrSX0QZs@QSj-Ep7U*FpPzJ^?49hrh-GhgCyaYIDck3Mzn6~9zju)5)
z8Z>Jt4@kK9c|q5o^znu21WwK#u#)D^e;s_vXviQM3iHNrrxkv7lXK@;PxB~wX(X=+
zA+DFiL7P!^rsRY2kK$_E!kGdwObjh~#}*UH={I>V`@I1Yci3D+N7(#I5c{!4_2i9n
zw?qiHWhvT>&zw^s_1|2HdVl}@v#T*hSk;D-xl1aIK8Ly#bbn$|Ud%YD<>+oH|26pq
zjg$tP4?($~yqVEo60l?@!d|`Z{BM*`q7U<4lYztSRrdKG=k`~YskbVbUrzR0`3((_
zzX}{j+_#b=#f+gX&3o}qkLq{o1IkFw7o}1N)@;oQ
z&$)gkeO?I*MUfm|$wfVIlgF95hRIDZMfk59rO(X&s4N854ToeiRw5`fQ_OIRb`wf?
z*=ky-;D@=d{+yEjL0;OQSSMr%esbzVT@VVoi6yKR|vRooGQ`*
z!O*eJweIw@@j-1hMpBU>zK_V4oa
zAN4k@Abewn$2do*#K%_H?zE(BaHDTNE2~VGwHczlu90!>fC@jae){9a6EYlQHy;F9
zq0_joO|B-nfO0dm#*08J8jVY0t(~DT1diqMLN`NdNvs_MRr;m(H-m`y?dvW}kD}=x
zmC-M(R0ko6gnnm<0{&D*ab^EL%CUM8LJk7tply>qbI(K-S@oa&XS|^GxI2C5PkEKvsrKntoq~IAvlIWRhlTs
zY6S?ch=q1z@Ui~!_^S~X@W(>v_?A-2xnNxIL9Nu|L>4jU`{g%C#qxzgP~4cK1E)Tn
z<>wwp1TJEX8*WlYhEL`I)Fn{0`79MK#rUV-7%vc%n;0axX2opxprS;>+iR~z2gw?88Ca-zZAU89`|CF)pe1T|b<
zU@3elL6oCe`WjB*amW0)2}fLER0H_1>l{E~vI69sw{Lz0S=;^<@|H6HpjyI}DB74C
z9z3`qm_tNH1Za2NyoIOokr7Nd6fMWTdi}2)RHDgM%an5unH$!!n$
zc=MP!IL53qhal5pS#Io7e5MYHJ>l6Zh_D{Z@O?vU$_p;Z~h)-oKzLQmcjAH;AeKd%QGtf$w19p9%t*x#s#BNwe%=>
zBXV~|*i#rEZnYafaH)te(yYBcsm3d!O!J9k%kD-gylwL?WofaJqLH}wR?NGXGsRuk
zalRKhxNLL8x>FF0O*ve+uZC=%&M>sR&jS~SC`l^Se8a|u?WU}6Zl%CEe$v&hz;c)r
zNWID4TyOk=4GHU8K?Gu=U<&1RvTWdcJd}VQxEcIWA&?q`T`F`;Bgt7Bk>WR#Skbl`
zVbmt~NHFiL)%N5WnsPJ!mc`Kf;&v9wjfPJ87UzjK5oPz->@r`^hS;MhO?FjMf=sPh
z<$gHbawON25_oPfYV`+a6rAsTYg|u%#R#(L)2g<^3SMexZz-`Q!RYZx
zHZmMPNTG}qO;(eCFLw7^5*V9V;>FVbQCE3?UDiX_wsX8xxpj^lS$$I08YCD~4|con
zewGaq-cqfS9yucheu45{B})9{eDBd4?F7LKQK)psWG}Cpb#D%UV{>Cy`9M;bHy>3R
zIZyLz!eM{SI1`Wk5&A*{;HIW6UjwD43#-KC2kL6A<2;k37&2EM?(KdOQj8qKWz)t{
z5KU2Uh|?U
zImoV+>&@6}Ea^2)Jp~^lmX;%RbgBcmOYpd;EA^TQ^q4(t;Lr
zle-8o4ed5uT!dafItnE_Ybjx?m)xDsv+Lsvfus$AwkofK(IGhK|5S``r5x6nVsfC&
zpjNM>1_qs4f0FVqB%0^?InD*Ha;uV>?zFx*@n?3Uq_}_Gu3oEQWX9#g2q_5Sc6scC
zro`9z#Lyw}=kM5Aur)+8&y^l<>#{mi7l+BXCM!X=ZYB@5>!b?AY@w{4x*7G$sBbXfoYy`_Szw0q_GQFXoe~56
zB%@9X!1X6>ulQ-Q!>e5{?~;k(n?W+i?FV&4*Ih&nk|SmM(>7?PUXy-Qz1P2c27de9
zhppOMWVoWgD-twKJ@4&0Q6p8q()teKWx~^4{I)w`*!G=-EyAzpJ8SPU3;f(k@!vG!
z+NG}DrrR4c9y_Rs}l
z@?G;jYwSIrd3U4UxeLf`3s3u{TsZ_UmuOi+R+$yQ1q_o~kg=WgR6=(V6*xwj7oryrr|4m|ATY{NI$GebP{%C86$pb2FFAoqN#>Y5}&CSdZ%TxreRa^wk
z*==Fm_E)uJOB+b7P@L!+@SLvSw_gsxROmrkoKqFpyzcwo6kx8&j6JE
z?4CDyojeaQ`={y7&&0hI^B`u9vo=3x@YG`X`klpWcHX%#(nCWH_kEE3Zo0PFUZNV@
zTJf{w*-#V1bHQ)}!0Tr}8=EPvYO(l_CwNX;tVU`frEthdT@53CnL2BK;$tQBnm_}v
z85qtU8XT(t3B13fQ}J@(qZr0_8r|bCwBf=^2mus!4?d~L@eoJRmTArIRyVC{0UZfm
zz*!9shdzcpr<0nn2S_$CmgyiIIShB6^&$2arWWk1+RKDVm@bw6BF+Za_R{>ESqV=8
zVTjE7nK~X&C~dLG@(>9LQSGa7Z8#?VYIN(8V=Z_>f{*RT(_g)Vzmiw@JaS?QZd4d4
zN9}3u0Z}v#HO`+RApd4JPWOJf8+Jy!r@J5e#R{)x_X{`!BS}k^vMUG~|D~P&vFfU2
zh~u>`(nhfdWYkCxqGDa9b=j
zPa^dX;vw{3?+TJV-Xh0Rj5#ib=676?M-X>=3Fhg>gW33;!CP1KS68!hU*C|(Qtx7b
zzx>0+zOQfRlDlT!T^-cm%o*toKG
z*UjN;mvgfmmV?nrHh`=XD{Sh2CkUFa2T~tND04%d)8lJ;-k1G63ik|!TjvV67
zC~<-+i0K}2#q;@QI2m7Wa0fXO#ucp*k>|GPxEwVJ5g%w64RT~9=>3I*9DV%>715>tp{e211B{4A=pWi&^W?!kiv3HY
z^DN|M`R6bZX+b!d{o;R2L!|w+IFT(8xW6=>f7udI{#TSUDWI9|TNKT(QYbgWMUfz{
zSqSqD3x$74DvP28o8GG{Cg8Wyh%~D;=K_SghIrP9
zw1LfTnWZ{^ezeyg?Oat!2|98qzf0H{&B(60nq4+o`RMn_yij*`8)2PwUKDyD%G-vZ
zf6z9bbRF@n{#x|uiG2PYPD1-KG3G?y8l-z_BE7%!eP6fmPoPN5)p0Ncvwn8=(0&>%
zV0tB}Jc-GygR0r7+kYkeqxOVDLrfSSXKZyQ1iB%Mz09{<^a!{
zYk!|O!F5n8khKN!qDvaH5$t$WK0*1c_Ch)sdmS?K^htI%@rfay@c*~O08@Yf)gdS=
zI$&X+X{(%+i}h;r@JdV(cRZXqiE&s}B*eA1}&s&wWa^y20(^eIWEal}pGSl|Qt!moFeoMHrN0Vzcid%rL#9hQ8M-bpFn3}j662%IhZ4HY
zn8(qt=#}>*QHWM2u9GwI@TvJ)1$ENgw2%K!~>M
z)O^4Y?ZWq|YSZo=S8kp_hcH@70lF>%a3}9Z{2wW->0J>VzV<#k0u`W_%O&tMTYDkb
z7z5$7nR5G|a~?T;wIHm{h5{0{B!1780*WfTc?JQN`t$btnaT!cntyKdnnj8}{odla
z-HmSX*GVp*h7+=rqHmV1n&{9Rqh9SMV)c1r9OGE8*p$G)c`mQC`f-5dU~5
z{M)(&zO%
z6uZCJEz^a5`FSVV0D8*y{-`f5AO*lVvYAKtds8RO4!PlO=)blK(6hwTBCOD-_N$GpP%
zvvO~lbUN6xUf6a2p!e8Hy0&+$IX^51=p~|jggokaI_<0+RJ~E*OEEQr43E23k-^Wy
zUE;(nm;fJRN3NloRBk*<)3K9}{Bm6DmdN^do+T2m*Goy--Gn=D;~rYwlcj9DA)bc5
zrs-{?_s)k~;P$KhOrfxc-D=CIn3M?W_%JHnqe4H|7K1VvK4J5IBy9~>
z@wA3uu&b)K6V%Kjkz$o{P|g*0yGwCBo2$idpI+#3=BA)Fn6jpb@6xMIM0W`dLABJt
zI#o;ld4cFVQ8OS<^K6*5H}QVUI4@;t@;C3d6J{3@?NigEgMX6(DuqakmdE@XFd&9A
zE(Hk%;iZ_TNW6@gZ5ciI5-A`~53EW7^Ah)~M4c8M02V)(`BUQ#?q#bL!}Mg|<_{ZO
z99mB7e_MAB;3-K!4pJ_r0~N|&S)oHEtxJ-4B~dYToU;1y9dZyy3U_{~i+umMj&Bl%
zq+!Lc8;OC4@A{;GlZ9>g3FJYicdL*T>~hG}B=FjWe_1eR0*MQ@45D3S&k~l|Dalnz
zzGPc>eJa(YQbVVdC=x_Eg-PfR1$u9$QCO0KN-4zz)_&fa-56nQ1WddlAKk~aE-iwD
z_oHQKDLEOr*&YrGY`_S=bIRU${yMu9B{+QU$@r(3{y7N*dR+oKyf%elREOc^{4LVt
z6R_V$zP|ZjSaZdX_;2OE<+1>gPF#8MSUx}u$O?7~f9Z0Z6=MmdyC}n0
z{n^r=@BMcxrK~9F3bS!{6VOx8K~W2H1mKfYB^^3=WUjiItN~EeB7G
zO{y*}FfCxR(qaX8)?G5Zd45~)ukXuenIjExplore ExecuTorch
ExecuTorch Documentation
+
+ ExecuTorch on GitHub
+
Why ExecuTorch?
Supporting on-device AI presents unique challenges with diverse hardware, critical power requirements, low/no internet connectivity, and realtime processing needs. These constraints have historically prevented or slowed down the creation of scalable and performant on-device AI solutions. We designed ExecuTorch, backed by our industry leaders like Meta, Arm, Apple, and Qualcomm, to be highly portable and provide superior developer productivity without losing on performance.
- ExecuTorch Alpha Release
-
- ExecuTorch was initially introduced to the community at the 2023 PyTorch Conference. With our most recent alpha release, we further expanded ExecuTorch’s capabilities across multiple dimensions. First, we enabled support for the deployment of large language models (LLMs) on various edge devices. Second, with ExecuTorch alpha, we have further stabilized the API surface. Lastly, we have significantly improved the developer experience by simplifying the installation flow as well as improving observability and developer productivity via the ExecuTorch SDK. ExecuTorch alpha release also provides early support for the recently announced Llama 3 8B along with demonstrations on how to run this model on an iPhone 15 Pro and a Samsung Galaxy S24 mobile phone.
-
From e32b6a83ac8738ec05a3b1913fac5b09c3aa8e65 Mon Sep 17 00:00:00 2001
From: Andrew Bringaze Linux Foundation
Date: Tue, 22 Oct 2024 17:20:00 -0500
Subject: [PATCH 4/6] change date to Friday
---
...pytorch-2-5.md => 2024-10-25-intel-gpu-support-pytorch-2-5.md} | 0
1 file changed, 0 insertions(+), 0 deletions(-)
rename _posts/{2024-10-22-intel-gpu-support-pytorch-2-5.md => 2024-10-25-intel-gpu-support-pytorch-2-5.md} (100%)
diff --git a/_posts/2024-10-22-intel-gpu-support-pytorch-2-5.md b/_posts/2024-10-25-intel-gpu-support-pytorch-2-5.md
similarity index 100%
rename from _posts/2024-10-22-intel-gpu-support-pytorch-2-5.md
rename to _posts/2024-10-25-intel-gpu-support-pytorch-2-5.md
From 8863e2a88f25f3dc961a676f0d10938a35f36943 Mon Sep 17 00:00:00 2001
From: Andrew Bringaze Linux Foundation
Date: Wed, 23 Oct 2024 18:21:58 -0500
Subject: [PATCH 5/6] Wednesday blog fix
---
_posts/2024-10-23-torchrec-fbgemm-1.md | 183 ++++++++++++-------------
1 file changed, 85 insertions(+), 98 deletions(-)
diff --git a/_posts/2024-10-23-torchrec-fbgemm-1.md b/_posts/2024-10-23-torchrec-fbgemm-1.md
index ba33b7151872..f34e514daa00 100644
--- a/_posts/2024-10-23-torchrec-fbgemm-1.md
+++ b/_posts/2024-10-23-torchrec-fbgemm-1.md
@@ -1,155 +1,142 @@
---
layout: blog_detail
-title: "PyTorch 2.5 Release Blog"
+title: "TorchRec and FBGEMM 1.0 Stable Release"
+author: Paul Zhang, Zain Huda, Sarunya Pumma, Shintaro Iwasaki, Supadchaya Puangpontip, Benson Ma
---
-We are excited to announce the release of PyTorch® 2.5 ([release note](https://github.com/pytorch/pytorch/releases/tag/v2.5.0))! This release features a new CuDNN backend for SDPA, enabling speedups by default for users of SDPA on H100s or newer GPUs. As well, regional compilation of torch.compile offers a way to reduce the cold start up time for torch.compile by allowing users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Finally, TorchInductor CPP backend offers solid performance speedup with numerous enhancements like FP16 support, CPP wrapper, AOT-Inductor mode, and max-autotune mode.
+We are happy to announce the stable release, 1.0, for [TorchRec](https://github.com/pytorch/torchrec) and [FBGEMM](https://github.com/pytorch/FBGEMM). TorchRec is the PyTorch native recommendation systems library, powered by FBGEMM’s (Facebook GEneral Matrix Multiplication) efficient, low-level kernels.
-This release is composed of 4095 commits from 504 contributors since PyTorch 2.4. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.5. More information about how to get started with the PyTorch 2-series can be found at our [Getting Started](https://pytorch.org/get-started/pytorch-2.0/) page.
-As well, please check out our new ecosystem projects releases with [TorchRec](https://github.com/pytorch/torchrec) and [TorchFix](https://github.com/pytorch-labs/torchfix/releases/tag/v0.6.0).
+## TorchRec
+[Initially open sourced in 2022](https://pytorch.org/blog/introducing-torchrec/), [TorchRec](https://github.com/pytorch/torchrec) provides common primitives for creating state-of-the-art personalization models:
-
-
- Beta
- |
- Prototype
- |
-
-
- CuDNN backend for SDPA
- |
- FlexAttention
- |
-
-
- torch.compile regional compilation without recompilations
- |
- Compiled Autograd
- |
-
-
- TorchDynamo added support for exception handling & MutableMapping types
- |
- Flight Recorder
- |
-
-
- TorchInductor CPU backend optimization
- |
- Max-autotune Support on CPU with GEMM Template
- |
-
-
-
- |
- TorchInductor on Windows
- |
-
-
-
- |
- FP16 support on CPU path for both eager mode and TorchInductor CPP backend
- |
-
-
-
- |
- Autoload Device Extension
- |
-
-
-
- |
- Enhanced Intel GPU support
- |
-
-
+* Simple, optimized APIs for distributed training across hundreds of GPUs
+* Advanced sharding techniques for embeddings
+* Modules common in authoring recommendation systems
+* Frictionless path to distributed inference with APIs for quantization and sharding of TorchRec models
+Since then, TorchRec has matured significantly, with wide internal adoption across many Meta production recommendation models for training and inference, alongside new features such as: [variable batched embeddings, embedding offloading, zero collision hashing, etc.](https://github.com/pytorch/torchrec/releases?page=1) Furthermore, TorchRec has a presence outside of Meta, such as [in recommendation models at Databricks](https://docs.databricks.com/en/machine-learning/train-recommender-models.html) and in the [Twitter algorithm](https://github.com/twitter/the-algorithm-ml). As a result, standard TorchRec features have been marked as **stable**, with PyTorch style BC guarantees, and can be seen on the [revamped TorchRec documentation](https://pytorch.org/torchrec/).
-*To see a full list of public feature submissions click [here](https://docs.google.com/spreadsheets/d/1TzGkWuUMF1yTe88adz1dt2mzbIsZLd3PBasy588VWgk/edit?usp=sharing).
+## FBGEMM
-## BETA FEATURES
+[FBGEMM is a library that provides high-performance kernels for CPUs and GPUs](https://pytorch.org/FBGEMM/). Since 2018, FBGEMM has supported the efficient execution of Meta-internal and external AI/ML workloads by expanding its scope from [performance-critical kernels for inference on CPUs](https://arxiv.org/abs/2101.05615) to more complex sparse operators for both training and inference – and recently for Generative AI – on CPUs and GPUs.
+FBGEMM has been empowering TorchRec through its backend high-performance kernel implementations for recommendation workloads, ranging from embedding bag kernels to jagged tensor operations. Together with TorchRec, we released FBGEMM 1.0, which guarantees the functionality and backward-compatibility of several stable APIs serving its core features with [enhanced documentation](https://pytorch.org/FBGEMM/).
-### [Beta] CuDNN backend for SDPA
-The cuDNN "Fused Flash Attention" backend was landed for *torch.nn.functional.scaled_dot_product_attention*. On NVIDIA H100 GPUs this can provide up to 75% speed-up over FlashAttentionV2. This speedup is enabled by default for all users of SDPA on H100 or newer GPUs.
+## Performance
+[DLRM (Deep Learning Recommendation Model)](https://ai.meta.com/blog/dlrm-an-advanced-open-source-deep-learning-recommendation-model/) is the standard neural network architecture for powering recommendations at Meta, with categorical features being processed through embeddings, while continuous (dense) features are processed with a bottom multilayer perceptron. The following diagram depicts the basic architecture of DLRM, with a second order interaction layer between the dense and sparse features and a top MLP for generating the prediction.
-### [Beta] *torch.compile* regional compilation without recompilations
+{:style="width:100%"}
-Regional compilation without recompilations, via *torch._dynamo.config.inline_inbuilt_nn_modules* which default to True in 2.5+. This option allows users to compile a repeated *nn.Module* (e.g. a transformer layer in LLM) without recompilations. Compared to compiling the full model, this option can result in smaller compilation latencies with 1%-5% performance degradation compared to full model compilation.
-See the [tutorial](https://pytorch.org/tutorials/recipes/regional_compilation.html) for more information.
+TorchRec provides standardized modules with significant optimizations in fusing embedding lookups. EBC is a traditional PyTorch embedding module implementation, containing a collection of `torch.nn.EmbeddingBags.` FusedEBC, powered by FBGEMM for high performance operations on embedding tables with a fused optimizer and UVM caching/management for alleviating memory constraints, is the optimized version present in sharded TorchRec modules for distributed training and inference. The below benchmark demonstrates the vast performance improvements of FusedEBC in comparison to a traditional PyTorch embedding module implementation (EBC) and the ability for FusedEBC to handle much larger embeddings than what is available on GPU memory with UVM caching.
-### [Beta] TorchInductor CPU backend optimization
+{:style="width:100%"}
-This feature advances Inductor’s CPU backend optimization, including CPP backend code generation and FX fusions with customized CPU kernels. The Inductor CPU backend supports vectorization of common data types and all Inductor IR operations, along with the static and symbolic shapes. It is compatible with both Linux and Windows OS and supports the default Python wrapper, the CPP wrapper, and AOT-Inductor mode.
-Additionally, it extends the max-autotune mode of the GEMM template (prototyped in 2.5), offering further performance gains. The backend supports various FX fusions, lowering to customized kernels such as oneDNN for Linear/Conv operations and SDPA. The Inductor CPU backend consistently achieves performance speedups across three benchmark suites—TorchBench, Hugging Face, and timms—outperforming eager mode in 97.5% of the 193 models tested.
+## TorchRec Data Types
-## PROTOTYPE FEATURES
+TorchRec provides standard [data types](https://pytorch.org/torchrec/datatypes-api-reference.html) and [modules](https://pytorch.org/torchrec/modules-api-reference.html) for easy handling of distributed embeddings. Here is a simple example setting up a collection of embedding tables through TorchRec:
-### [Prototype] FlexAttention
+```
+from torchrec import EmbeddingBagCollection
+from torchrec import KeyedJaggedTensor
+from torchrec import JaggedTensor
-We've introduced a flexible API that enables implementing various attention mechanisms such as Sliding Window, Causal Mask, and PrefixLM with just a few lines of idiomatic PyTorch code. This API leverages torch.compile to generate a fused FlashAttention kernel, which eliminates extra memory allocation and achieves performance comparable to handwritten implementations. Additionally, we automatically generate the backwards pass using PyTorch's autograd machinery. Furthermore, our API can take advantage of sparsity in the attention mask, resulting in significant improvements over standard attention implementations.
+ebc = torchrec.EmbeddingBagCollection(
+ device="cpu",
+ tables=[
+ torchrec.EmbeddingBagConfig(
+ name="product_table",
+ embedding_dim=64,
+ num_embeddings=4096,
+ feature_names=["product"],
+ pooling=torchrec.PoolingType.SUM,
+ ),
+ torchrec.EmbeddingBagConfig(
+ name="user_table",
+ embedding_dim=64,
+ num_embeddings=4096,
+ feature_names=["user"],
+ pooling=torchrec.PoolingType.SUM,
+ )
+ ]
+)
-For more information and examples, please refer to the [official blog post](https://pytorch.org/blog/flexattention/) and [Attention Gym](https://github.com/pytorch-labs/attention-gym).
+product_jt = JaggedTensor(
+ values=torch.tensor([1, 2, 1, 5]), lengths=torch.tensor([3, 1])
+)
+user_jt = JaggedTensor(values=torch.tensor([2, 3, 4, 1]), lengths=torch.tensor([2, 2]))
+kjt = KeyedJaggedTensor.from_jt_dict({"product": product_jt, "user": user_jt})
-### [Prototype] Compiled Autograd
+print("Call EmbeddingBagCollection Forward: ", ebc(kjt))
+```
-Compiled Autograd is an extension to the PT2 stack allowing the capture of the entire backward pass. Unlike the backward graph traced by AOT dispatcher, Compiled Autograd tracing is deferred until backward execution time, which makes it impervious to forward pass graph breaks, and allows it to record backward hooks into the graph.
-Please refer to the [tutorial](https://pytorch.org/tutorials/intermediate/compiled_autograd_tutorial.html) for more information.
+## Sharding
-### [Prototype] Flight Recorder
+TorchRec provides a planner class that automatically generates an optimized sharding plan across many GPUs. Here we demonstrate generating a sharding plan across two GPUs:
-Flight recorder is a new debugging tool that helps debug stuck jobs. The tool works by continuously capturing information about collectives as they run. Upon detecting a stuck job, the information can be used to quickly identify misbehaving ranks/machines along with code stack traces.
-For more information please refer to the following [tutorial](https://pytorch.org/tutorials/prototype/flight_recorder_tutorial.html).
+```
+from torchrec.distributed.planner import EmbeddingShardingPlanner, Topology
+planner = EmbeddingShardingPlanner(
+ topology=Topology(
+ world_size=2,
+ compute_device="cuda",
+ )
+)
-### [Prototype] Max-autotune Support on CPU with GEMM Template
+plan = planner.collective_plan(ebc, [sharder], pg)
-Max-autotune mode for the Inductor CPU backend in torch.compile profiles multiple implementations of operations at compile time and selects the best-performing one. This is particularly beneficial for GEMM-related operations, using a C++ template-based GEMM implementation as an alternative to the ATen-based approach with oneDNN and MKL libraries. We support FP32, BF16, FP16, and INT8 with epilogue fusions for x86 CPUs. We’ve seen up to 7% geomean speedup on the dynamo benchmark suites and up to 20% boost in next-token latency for LLM inference.
+print(f"Sharding Plan generated: {plan}")
+```
-For more information please refer to the [tutorial](https://pytorch.org/tutorials/prototype/max_autotune_on_CPU_tutorial.html).
-### [Prototype] TorchInductor CPU on Windows
+## Model Parallel
-Inductor CPU backend in torch.compile now works on Windows. We support MSVC (cl), clang (clang-cl) and Intel compiler (icx-cl) for Windows inductor currently.
+TorchRec’s main distributed training API is [DistributedModelParallel](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module), which calls the planner to generate a sharding plan (demonstrated above) and shards TorchRec modules according to that plan. We demonstrate using [DistributedModelParallel](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module) to our EmbeddingBagCollection for sharding embeddings distributed training:
-See the [tutorial](https://pytorch.org/tutorials/prototype/inductor_windows_cpu.html) for more details.
+```
+model = torchrec.distributed.DistributedModelParallel(ebc, device=torch.device("cuda"))
+```
-### [Prototype] FP16 support on CPU path for both eager mode and TorchInductor CPP backend
-Float16 is a commonly used reduced floating point type for performance improvement in neural network inference/training. Since this release, float16 for both eager and TorchInductor is supported on the CPU path.
+## Inference
-### [Prototype] Autoload Device Extension
+TorchRec provides simple APIs for quantizing and sharding embeddings for a model for distributed inference. The usage is demonstrated below:
-PyTorch now supports autoloading for out-of-tree device extensions, streamlining integration by eliminating the need for manual imports. This feature, enabled through the torch.backends entrypoint, simplifies usage by ensuring seamless extension loading, while allowing users to disable it via an environment variable if needed.
-See the [tutorial](https://pytorch.org/tutorials/prototype/python_extension_autoload.html) for more information.
+```
+from torchrec.inference.modules import (
+ quantize_inference_model,
+ shard_quant_model,
+)
+quant_model = quantize_inference_model(ebc)
+sharded_model, _ = shard_quant_model(
+ quant_model, compute_device=device, sharding_device=device
+)
+```
-### [Prototype] Enhanced Intel GPU support
-Intel GPUs support enhancement is now available for both Intel® Data Center GPU Max Series and Intel® Client GPUs (Intel® Core™ Ultra processors with built-in Intel® Arc™ graphics and Intel® Arc™ Graphics for dGPU parts), which is to make it easier to accelerate your Machine Learning workflows on Intel GPUs in PyTorch 2.5 release. We also enabled the initial support of PyTorch on Windows for Intel® Client GPUs in this release.
+## Conclusion
+TorchRec and FBGEMM are now stable, with optimized features for large scale recommendation systems.
-* Expanded PyTorch hardware backend support matrix to include both Intel Data Center and Client GPUs.
-* The implementation of SYCL* kernels to enhance coverage and execution of Aten operators on Intel GPUs to boost performance in PyTorch eager mode.
-* Enhanced Intel GPU backend of torch.compile to improve inference and training performance for a wide range of deep learning workloads.
-
-These features are available through PyTorch preview and nightly binary PIP wheels. For more information regarding Intel GPU support, please refer to [documentation](https://pytorch.org/docs/main/notes/get_start_xpu.html).
\ No newline at end of file
+For setting up TorchRec and FBGEMM, check out the [getting started guide](https://pytorch.org/torchrec/setup-torchrec.html). \
+ \
+We also recommend the comprehensive, end-to-end [tutorial for introducing the features in TorchRec and FBGEMM](https://pytorch.org/tutorials/intermediate/torchrec_intro_tutorial.html#).
\ No newline at end of file
From 2a69832b720e16ae4da35e54faaf1ce8ea2e3105 Mon Sep 17 00:00:00 2001
From: Andrew Bringaze Linux Foundation
Date: Wed, 23 Oct 2024 18:31:38 -0500
Subject: [PATCH 6/6] removal
---
_posts/2024-10-23-torchrec-fbgemm-1.md | 142 -------------------------
1 file changed, 142 deletions(-)
delete mode 100644 _posts/2024-10-23-torchrec-fbgemm-1.md
diff --git a/_posts/2024-10-23-torchrec-fbgemm-1.md b/_posts/2024-10-23-torchrec-fbgemm-1.md
deleted file mode 100644
index f34e514daa00..000000000000
--- a/_posts/2024-10-23-torchrec-fbgemm-1.md
+++ /dev/null
@@ -1,142 +0,0 @@
----
-layout: blog_detail
-title: "TorchRec and FBGEMM 1.0 Stable Release"
-author: Paul Zhang, Zain Huda, Sarunya Pumma, Shintaro Iwasaki, Supadchaya Puangpontip, Benson Ma
----
-
-We are happy to announce the stable release, 1.0, for [TorchRec](https://github.com/pytorch/torchrec) and [FBGEMM](https://github.com/pytorch/FBGEMM). TorchRec is the PyTorch native recommendation systems library, powered by FBGEMM’s (Facebook GEneral Matrix Multiplication) efficient, low-level kernels.
-
-
-## TorchRec
-
-[Initially open sourced in 2022](https://pytorch.org/blog/introducing-torchrec/), [TorchRec](https://github.com/pytorch/torchrec) provides common primitives for creating state-of-the-art personalization models:
-
-* Simple, optimized APIs for distributed training across hundreds of GPUs
-* Advanced sharding techniques for embeddings
-* Modules common in authoring recommendation systems
-* Frictionless path to distributed inference with APIs for quantization and sharding of TorchRec models
-
-Since then, TorchRec has matured significantly, with wide internal adoption across many Meta production recommendation models for training and inference, alongside new features such as: [variable batched embeddings, embedding offloading, zero collision hashing, etc.](https://github.com/pytorch/torchrec/releases?page=1) Furthermore, TorchRec has a presence outside of Meta, such as [in recommendation models at Databricks](https://docs.databricks.com/en/machine-learning/train-recommender-models.html) and in the [Twitter algorithm](https://github.com/twitter/the-algorithm-ml). As a result, standard TorchRec features have been marked as **stable**, with PyTorch style BC guarantees, and can be seen on the [revamped TorchRec documentation](https://pytorch.org/torchrec/).
-
-
-## FBGEMM
-
-[FBGEMM is a library that provides high-performance kernels for CPUs and GPUs](https://pytorch.org/FBGEMM/). Since 2018, FBGEMM has supported the efficient execution of Meta-internal and external AI/ML workloads by expanding its scope from [performance-critical kernels for inference on CPUs](https://arxiv.org/abs/2101.05615) to more complex sparse operators for both training and inference – and recently for Generative AI – on CPUs and GPUs.
-
-FBGEMM has been empowering TorchRec through its backend high-performance kernel implementations for recommendation workloads, ranging from embedding bag kernels to jagged tensor operations. Together with TorchRec, we released FBGEMM 1.0, which guarantees the functionality and backward-compatibility of several stable APIs serving its core features with [enhanced documentation](https://pytorch.org/FBGEMM/).
-
-
-## Performance
-
-[DLRM (Deep Learning Recommendation Model)](https://ai.meta.com/blog/dlrm-an-advanced-open-source-deep-learning-recommendation-model/) is the standard neural network architecture for powering recommendations at Meta, with categorical features being processed through embeddings, while continuous (dense) features are processed with a bottom multilayer perceptron. The following diagram depicts the basic architecture of DLRM, with a second order interaction layer between the dense and sparse features and a top MLP for generating the prediction.
-
-{:style="width:100%"}
-
-
-
-TorchRec provides standardized modules with significant optimizations in fusing embedding lookups. EBC is a traditional PyTorch embedding module implementation, containing a collection of `torch.nn.EmbeddingBags.` FusedEBC, powered by FBGEMM for high performance operations on embedding tables with a fused optimizer and UVM caching/management for alleviating memory constraints, is the optimized version present in sharded TorchRec modules for distributed training and inference. The below benchmark demonstrates the vast performance improvements of FusedEBC in comparison to a traditional PyTorch embedding module implementation (EBC) and the ability for FusedEBC to handle much larger embeddings than what is available on GPU memory with UVM caching.
-
-{:style="width:100%"}
-
-
-
-## TorchRec Data Types
-
-TorchRec provides standard [data types](https://pytorch.org/torchrec/datatypes-api-reference.html) and [modules](https://pytorch.org/torchrec/modules-api-reference.html) for easy handling of distributed embeddings. Here is a simple example setting up a collection of embedding tables through TorchRec:
-
-
-```
-from torchrec import EmbeddingBagCollection
-from torchrec import KeyedJaggedTensor
-from torchrec import JaggedTensor
-
-ebc = torchrec.EmbeddingBagCollection(
- device="cpu",
- tables=[
- torchrec.EmbeddingBagConfig(
- name="product_table",
- embedding_dim=64,
- num_embeddings=4096,
- feature_names=["product"],
- pooling=torchrec.PoolingType.SUM,
- ),
- torchrec.EmbeddingBagConfig(
- name="user_table",
- embedding_dim=64,
- num_embeddings=4096,
- feature_names=["user"],
- pooling=torchrec.PoolingType.SUM,
- )
- ]
-)
-
-product_jt = JaggedTensor(
- values=torch.tensor([1, 2, 1, 5]), lengths=torch.tensor([3, 1])
-)
-user_jt = JaggedTensor(values=torch.tensor([2, 3, 4, 1]), lengths=torch.tensor([2, 2]))
-
-kjt = KeyedJaggedTensor.from_jt_dict({"product": product_jt, "user": user_jt})
-
-print("Call EmbeddingBagCollection Forward: ", ebc(kjt))
-```
-
-
-
-## Sharding
-
-TorchRec provides a planner class that automatically generates an optimized sharding plan across many GPUs. Here we demonstrate generating a sharding plan across two GPUs:
-
-
-```
-from torchrec.distributed.planner import EmbeddingShardingPlanner, Topology
-
-planner = EmbeddingShardingPlanner(
- topology=Topology(
- world_size=2,
- compute_device="cuda",
- )
-)
-
-plan = planner.collective_plan(ebc, [sharder], pg)
-
-print(f"Sharding Plan generated: {plan}")
-```
-
-
-
-## Model Parallel
-
-TorchRec’s main distributed training API is [DistributedModelParallel](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module), which calls the planner to generate a sharding plan (demonstrated above) and shards TorchRec modules according to that plan. We demonstrate using [DistributedModelParallel](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module) to our EmbeddingBagCollection for sharding embeddings distributed training:
-
-
-```
-model = torchrec.distributed.DistributedModelParallel(ebc, device=torch.device("cuda"))
-```
-
-
-
-## Inference
-
-TorchRec provides simple APIs for quantizing and sharding embeddings for a model for distributed inference. The usage is demonstrated below:
-
-
-```
-from torchrec.inference.modules import (
- quantize_inference_model,
- shard_quant_model,
-)
-quant_model = quantize_inference_model(ebc)
-sharded_model, _ = shard_quant_model(
- quant_model, compute_device=device, sharding_device=device
-)
-```
-
-
-
-## Conclusion
-
-TorchRec and FBGEMM are now stable, with optimized features for large scale recommendation systems.
-
-For setting up TorchRec and FBGEMM, check out the [getting started guide](https://pytorch.org/torchrec/setup-torchrec.html). \
- \
-We also recommend the comprehensive, end-to-end [tutorial for introducing the features in TorchRec and FBGEMM](https://pytorch.org/tutorials/intermediate/torchrec_intro_tutorial.html#).
\ No newline at end of file