Merge branch 'main' into fix/inductor/link

svekars · web-flow · commit 6c88b0638ae7 · 2025-01-07T10:26:27.000-08:00
diff --git a/advanced_source/cpp_custom_ops.rst b/advanced_source/cpp_custom_ops.rst
@@ -63,9 +63,47 @@ Using ``cpp_extension`` is as simple as writing the following ``setup.py``:
 
 If you need to compile CUDA code (for example, ``.cu`` files), then instead use
 `torch.utils.cpp_extension.CUDAExtension <https://pytorch.org/docs/stable/cpp_extension.html#torch.utils.cpp_extension.CUDAExtension>`_.
-Please see how
-`extension-cpp <https://github.com/pytorch/extension-cpp>`_ for an example for
-how this is set up.
+Please see `extension-cpp <https://github.com/pytorch/extension-cpp>`_ for an
+example for how this is set up.
+
+Starting with PyTorch 2.6, you can now build a single wheel for multiple CPython
+versions (similar to what you would do for pure python packages). In particular,
+if your custom library adheres to the `CPython Stable Limited API
+<https://docs.python.org/3/c-api/stable.html>`_ or avoids CPython entirely, you
+can build one Python agnostic wheel against a minimum supported CPython version
+through setuptools' ``py_limited_api`` flag, like so:
+
+.. code-block:: python
+
+  from setuptools import setup, Extension
+  from torch.utils import cpp_extension
+
+  setup(name="extension_cpp",
+        ext_modules=[
+            cpp_extension.CppExtension(
+              "extension_cpp",
+              ["python_agnostic_code.cpp"],
+              py_limited_api=True)],
+        cmdclass={'build_ext': cpp_extension.BuildExtension},
+        options={"bdist_wheel": {"py_limited_api": "cp39"}}
+  )
+
+Note that you must specify ``py_limited_api=True`` both within ``setup``
+and also as an option to the ``"bdist_wheel"`` command with the minimal supported
+Python version (in this case, 3.9). This ``setup`` would build one wheel that could
+be installed across multiple Python versions ``python>=3.9``. Please see
+`torchao <https://github.com/pytorch/ao>`_ for an example.
+
+.. note::
+
+  You must verify independently that the built wheel is truly Python agnostic.
+  Specifying ``py_limited_api`` does not check for any guarantees, so it is possible
+  to build a wheel that looks Python agnostic but will crash, or worse, be silently
+  incorrect, in another Python environment. Take care to avoid using unstable CPython
+  APIs, for example APIs from libtorch_python (in particular pytorch/python bindings,)
+  and to only use APIs from libtorch (aten objects, operators and the dispatcher).
+  For example, to give access to custom ops from Python, the library should register
+  the ops through the dispatcher (covered below!).
 
 Defining the custom op and adding backend implementations
 ---------------------------------------------------------
@@ -177,7 +215,7 @@ operator specifies how to compute the metadata of output tensors given the metad
 The FakeTensor kernel should return dummy Tensors of your choice with
 the correct Tensor metadata (shape/strides/``dtype``/device).
 
-We recommend that this be done from Python via the `torch.library.register_fake` API,
+We recommend that this be done from Python via the ``torch.library.register_fake`` API,
 though it is possible to do this from C++ as well (see
 `The Custom Operators Manual <https://pytorch.org/docs/main/notes/custom_operators.html>`_
 for more details).
@@ -188,7 +226,9 @@ for more details).
   # before calling ``torch.library`` APIs that add registrations for the
   # C++ custom operator(s). The following import loads our
   # C++ custom operator definitions.
-  # See the next section for more details.
+  # Note that if you are striving for Python agnosticism, you should use
+  # the ``load_library(...)`` API call instead. See the next section for
+  # more details.
   from . import _C
 
   @torch.library.register_fake("extension_cpp::mymuladd")
@@ -214,7 +254,10 @@ of two ways:
 1. If you're following this tutorial, importing the Python C extension module
    we created will load the C++ custom operator definitions.
 2. If your C++ custom operator is located in a shared library object, you can
-   also use ``torch.ops.load_library("/path/to/library.so")`` to load it.
+   also use ``torch.ops.load_library("/path/to/library.so")`` to load it. This
+   is the blessed path for Python agnosticism, as you will not have a Python C
+   extension module to import. See `torchao __init__.py <https://github.com/pytorch/ao/blob/881e84b4398eddcea6fee4d911fc329a38b5cd69/torchao/__init__.py#L26-L28>`_
+   for an example.
 
 
 Adding training (autograd) support for an operator
diff --git a/advanced_source/custom_ops_landing_page.rst b/advanced_source/custom_ops_landing_page.rst
@@ -23,6 +23,7 @@ You may wish to author a custom operator from Python (as opposed to C++) if:
   respect to ``torch.compile`` and ``torch.export``.
 - you have some Python bindings to C++/CUDA kernels and want those to compose with PyTorch
   subsystems (like ``torch.compile`` or ``torch.autograd``)
+- you are using Python (and not a C++-only environment like AOTInductor).
 
 Integrating custom C++ and/or CUDA code with PyTorch
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/advanced_source/python_custom_ops.py b/advanced_source/python_custom_ops.py
@@ -3,7 +3,7 @@
 """
 .. _python-custom-ops-tutorial:
 
-Python Custom Operators
+Custom Python Operators
 =======================
 
 .. grid:: 2
@@ -30,6 +30,12 @@
   into the function).
 - Adding training support to an arbitrary Python function
 
+Use :func:`torch.library.custom_op` to create Python custom operators.
+Use the C++ ``TORCH_LIBRARY`` APIs to create C++ custom operators (these
+work in Python-less environments).
+See the `Custom Operators Landing Page <https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html>`_
+for more details.
+
 Please note that if your operation can be expressed as a composition of
 existing PyTorch operators, then there is usually no need to use the custom operator
 API -- everything (for example ``torch.compile``, training support) should
diff --git a/beginner_source/onnx/README.txt b/beginner_source/onnx/README.txt
@@ -3,7 +3,7 @@ ONNX
 
 1. intro_onnx.py
     Introduction to ONNX
-    https://pytorch.org/tutorials/onnx/intro_onnx.html
+    https://pytorch.org/tutorials/beginner/onnx/intro_onnx.html
 
 2. export_simple_model_to_onnx_tutorial.py
     Exporting a PyTorch model to ONNX
diff --git a/en-wordlist.txt b/en-wordlist.txt
@@ -392,6 +392,8 @@ FlexAttention
 fp
 frontend
 functionalized
+functionalizes
+functionalization
 functorch
 fuser
 geomean
diff --git a/index.rst b/index.rst
@@ -397,14 +397,14 @@ Welcome to PyTorch Tutorials
    :tags: Frontend-APIs,C++
 
 .. customcarditem::
-   :header: Python Custom Operators Landing Page
+   :header: PyTorch Custom Operators Landing Page
    :card_description: This is the landing page for all things related to custom operators in PyTorch.
    :image: _static/img/thumbnails/cropped/Custom-Cpp-and-CUDA-Extensions.png
    :link: advanced/custom_ops_landing_page.html
    :tags: Extending-PyTorch,Frontend-APIs,C++,CUDA
 
 .. customcarditem::
-   :header: Python Custom Operators
+   :header: Custom Python Operators
    :card_description: Create Custom Operators in Python. Useful for black-boxing a Python function for use with torch.compile.
    :image: _static/img/thumbnails/cropped/Custom-Cpp-and-CUDA-Extensions.png
    :link: advanced/python_custom_ops.html
@@ -426,14 +426,14 @@ Welcome to PyTorch Tutorials
 
 .. customcarditem::
    :header: Custom C++ and CUDA Extensions
-   :card_description:  Create a neural network layer with no parameters using numpy. Then use scipy to create a neural network layer that has learnable weights.
+   :card_description: Create a neural network layer with no parameters using numpy. Then use scipy to create a neural network layer that has learnable weights.
    :image: _static/img/thumbnails/cropped/Custom-Cpp-and-CUDA-Extensions.png
    :link: advanced/cpp_extension.html
    :tags: Extending-PyTorch,Frontend-APIs,C++,CUDA
 
 .. customcarditem::
    :header: Extending TorchScript with Custom C++ Operators
-   :card_description:  Implement a custom TorchScript operator in C++, how to build it into a shared library, how to use it in Python to define TorchScript models and lastly how to load it into a C++ application for inference workloads.
+   :card_description: Implement a custom TorchScript operator in C++, how to build it into a shared library, how to use it in Python to define TorchScript models and lastly how to load it into a C++ application for inference workloads.
    :image: _static/img/thumbnails/cropped/Extending-TorchScript-with-Custom-Cpp-Operators.png
    :link: advanced/torch_script_custom_ops.html
    :tags: Extending-PyTorch,Frontend-APIs,TorchScript,C++
diff --git a/recipes_source/distributed_device_mesh.rst b/recipes_source/distributed_device_mesh.rst
@@ -164,7 +164,7 @@ DeviceMesh allows users to slice child mesh from the parent mesh and re-use the
 
     # Users can access the underlying process group thru `get_group` API.
     replicate_group = hsdp_mesh["replicate"].get_group()
-    shard_group = hsdp_mesh["Shard"].get_group()
+    shard_group = hsdp_mesh["shard"].get_group()
     tp_group = tp_mesh.get_group()