From 404a858e88fed0515127186d9938bcac7422b717 Mon Sep 17 00:00:00 2001 From: andresruizfacebook <68402331+andresruizfacebook@users.noreply.github.com> Date: Mon, 26 Oct 2020 10:03:35 -0700 Subject: [PATCH 01/12] Create 2020-10-26-1.7-released.md --- _posts/2020-10-26-1.7-released.md | 416 ++++++++++++++++++++++++++++++ 1 file changed, 416 insertions(+) create mode 100644 _posts/2020-10-26-1.7-released.md diff --git a/_posts/2020-10-26-1.7-released.md b/_posts/2020-10-26-1.7-released.md new file mode 100644 index 000000000000..9175150c1793 --- /dev/null +++ b/_posts/2020-10-26-1.7-released.md @@ -0,0 +1,416 @@ +--- +layout: blog_detail +title: 'PyTorch 1.7 released w/ CUDA 11, New APIs for FFTs, Windows support for Distributed training and more..' +author: Team PyTorch +--- + +Today, we’re announcing the availability of PyTorch 1.7, along with updated domain libraries. The PyTorch 1.7 release includes a number of new APIs, profiling and benchmarking tools, major updates to both distributed data parallel (DDP) and remote procedure call (RPC) based distributed training. In addition, several features moved to [stable](https://pytorch.org/docs/stable/index.html#pytorch-documentation) including custom C++ Classes, the memory profiler, the creation of custom tensor-like objects, user async functions in RPC and a number of other features in torch.distributed like Per-RPC timeout, DDP dynamic bucketing and RRef helper. + +A few of the highlights include: + +* 1- CUDA 11 is now officially supported with binaries available at [PyTorch.org](http://pytorch.org/) +* 2- Updates and additions to profiling and performance for RPC, TorchScript, Stack traces and Benchmark utilities +* 3- (Beta) Support for NumPy compatible Fast Fourier transforms (FFT) via torch.fft +* 4- (Prototype) Support for Nvidia A100 generation GPUs and native TF32 format +* 5- (Prototype) Distributed training on Windows now supported + +To reiterate, starting [PyTorch 1.6](https://pytorch.org/blog/pytorch-feature-classification-changes/), features are now classified as stable, beta and prototype. You can see the detailed announcement [here](https://pytorch.org/blog/pytorch-feature-classification-changes/). Note that the prototype features listed in this blog are available as part of this release. + +Find the full release notes [here](https://github.com/pytorch/pytorch/releases). + +# Front End APIs + +## [Beta] NumPy Compatible torch.fft module + +FFT-related functionality is commonly used in a variety of scientific fields like signal processing. While PyTorch has historically supported a few FFT-related functions, the 1.7 release adds a new torch.fft module that implements FFT-related functions with the same API as NumPy. + +This new module must be imported to be used in the 1.7 release, since its name conflicts with the historic (and now deprecated) torch.fft function. + +**Example usage:** + +```python +*>>>** **import* torch.fft +>>> t *=* torch*.*arange(4) +>>> t +tensor([0, 1, 2, 3]) + +>>> torch*.*fft*.*fft(t) +tensor([ 6.+0.j, -2.+2.j, -2.+0.j, -2.-2.j]) + +>>> t *=* tensor([0.*+*1.j, 2.*+*3.j, 4.*+*5.j, 6.*+*7.j]) +>>> torch*.*fft*.*fft(t) +tensor([12.+16.j, -8.+0.j, -4.-4.j, 0.-8.j]) + ``` + +* Documentation | [Link](https://pytorch.org/docs/stable/fft.html#torch-fft) + +## [Beta] C++ Support for Transformer NN Modules + +Since [PyTorch 1.5](https://pytorch.org/blog/pytorch-1-dot-5-released-with-new-and-updated-apis/), we’ve continued to maintain parity between the python and C++ frontend APIs. This update allows developers to use the nn.transformer module abstraction from the C++ Frontend. And moreover, developers no longer need to save a module from python/JIT and load into C++ as it can now be used it in C++ directly. + +* Documentation | [Link](https://pytorch.org/cppdocs/api/classtorch_1_1nn_1_1_transformer_impl.html#_CPPv4N5torch2nn15TransformerImplE) + +## [Beta] torch.set_deterministic + +Reproducibility (bit-for-bit determinism) may help identify errors when debugging or testing a program. To facilitate reproducibility, PyTorch 1.7 adds the ```torch.set_deterministic(bool)``` function that can direct PyTorch operators to select deterministic algorithms when available, and to throw a runtime error if an operation may result in nondeterministic behavior. By default, the flag this function controls is false and there is no change in behavior, meaning PyTorch may implement its operations nondeterministically by default. + +More precisely, when this flag is true: + +* Operations known to not have a deterministic implementation throw a runtime error; +* Operations with deterministic variants use those variants (usually with a performance penalty versus the non-deterministic version); and +* ```torch.backends.cudnn.deterministic = True``` is set. + +Note that this is necessary, **but not sufficient**, for determinism **within a single run of a PyTorch program**. Other sources of randomness like random number generators, unknown operations, or asynchronous or distributed computation may still cause nondeterministic behavior. + +See the documentation for ```torch.set_deterministic(bool)``` for the list of affected operations. + +* RFC | [Link](https://github.com/pytorch/pytorch/issues/15359) +* Documentation | [Link](https://pytorch.org/docs/stable/generated/torch.set_deterministic.html) + +# Performance & Profiling + +## [Beta] Benchmark Utilities + +Benchmarks for pull requests are ad-hoc and typically inadequate currently. They frequently involve starting a timer, running an op or ops many times, and then taking the average and comparing before and after numbers to determine if a change improves or regresses performance. The PyTorch 1.7 release includes new utilities that allow the user to take accurate performance measurements, and provide composable tools to help with both benchmark formulation and post processing. + +**Example usage:** + +```python +"""Demonstration of how size can affect efficiency. + +1K: Task is small, overhead dominates. +1M: Good performance. +16M: Task does not fit into L2 cache which degrades performance. +""" +import torch +import torch.utils.benchmark as benchmark_utils + +for n in [1024, 1024 ** 2, 32 * 1024 ** 2]: + timer = benchmark_utils.Timer( + "torch.dot(x, y)", + description=f"n = {n}", + globals={ + "x": torch.ones((n,)), + "y": torch.ones((n,)), + } + ) + m = timer.blocked_autorange(min_run_time=1) + print(f"{m}\n{m.median / n * 1e9:5.2f} ns / element\n") + +torch.dot(x, y) +n = 1024 + Median: 2.37 us + IQR: 0.07 us (2.34 to 2.41) + 415 measurements, 1000 runs per measurement, 1 thread + 2.31 ns / element + + +torch.dot(x, y) +n = 1048576 + Median: 265.56 us + IQR: 5.56 us (263.17 to 268.74) + 374 measurements, 10 runs per measurement, 1 thread + 0.25 ns / element + + +torch.dot(x, y) +n = 33554432 + Median: 29.80 ms + IQR: 0.58 ms (29.53 to 30.10) + 34 measurements, 1 runs per measurement, 1 thread + 0.89 ns / element + ``` + + Documentation | Link **Missing link** + + ## [Beta] Stack traces added to profiler + +Users can now see not only operator name/inputs in the profiler output table but also where the operator is in the code. The workflow requires very little change to take advantage of this capability. The user uses the [autograd profiler](https://pytorch.org/docs/stable/autograd.html#profiler) as before but with optional new parameters: ```with_stack``` and ```group_by_stack_n.``` + +* Details | [Link](https://github.com/pytorch/pytorch/pull/43898/) +* Documentation | [Link](https://pytorch.org/docs/stable/autograd.html) + +# Distributed Training & RPC + +## [Stable] TorchElastic now bundled into PyTorch docker image + +Torchelastic offers a strict superset of the current ```torch.distributed.launch``` CLI with the added features for fault-tolerance and elasticity. If the user is not be interested in fault-tolerance, they can get the exact functionality/behavior parity by setting ```max_restarts=0``` with the added convenience of auto-assigned ```RANK``` and ```MASTER_ADDR|PORT``` (versus manually specified in ```torch.distributed.launch)```. + +By bundling ```torchelastic``` in the same docker image as PyTorch, users can start experimenting with torchelastic right-away without having to separately install ```torchelastic```. In addition to convenience, this work is a nice-to-have when adding support for elastic parameters in the existing Kubeflow’s distributed PyTorch operators. + +* Usage examples and how to get started | [Link](https://pytorch.org/elastic/0.2.0/examples.html) + +## [Beta] Support for uneven dataset inputs in DDP + +PyTorch 1.7 introduces a new context manager to be used in conjunction with models trained using ```torch.nn.parallel.DistributedDataParallel``` to enable training with uneven dataset size across different processes. This feature enables greater flexibility when using DDP and prevents the user from having to manually ensure dataset sizes are the same across different process. With this context manager, DDP will handle uneven dataset sizes automatically, which can prevent errors or hangs at the end of training. + +* RFC | [Link](https://github.com/pytorch/pytorch/issues/38174) +* Documentation | [Link](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join) + +## [Beta] NCCL Reliability - Async Error/Timeout Handling + +In the past, NCCL training runs would hang indefinitely due to stuck collectives, leading to a very unpleasant experience for users. This feature will abort stuck collectives and throw an exception/crash the process if a potential hang is detected. When used with something like torchelastic (which can recover the training process from the last checkpoint), users can have much greater reliability for distributed training. This feature is completely opt in and sits behind an environment variable that needs to be explicitly set in order to enable this feature (otherwise users will see the same behavior as before). + +* Documentation | Link **Missing Link** +* Usage examples | Link **Missing Link** + +## [Beta] TorchScript rpc_remote and rpc_sync + +```rpc_async``` has been available in TorchScript as a beta feature in prior releases. For PyTorch 1.7, this functionality will be extended the remaining two core RPC APIs, ```rpc_sync``` and ```rpc_remote```. This will complete the major RPC APIs targeted for support in TorchScript and can possibly improve application performance by allowing users to use these APIs within TorchScript. + +More specifically, support the following use case: + +```python +from torch.distributed import rpc +from torch import Tensor +from typing import Dict, Tuple + +@torch.jit.script +def script_rpc_async_call( + dst_worker_name: str, args: Tuple[Tensor, Tensor], kwargs: Dict[str, Tensor] +): + fut = rpc.rpc_async(dst_worker_name, two_args_two_kwargs, args, kwargs) + ret = fut.wait() + return ret + +@torch.jit.script +def script_rpc_sync_call( + dst_worker_name: str, args: Tuple[Tensor, Tensor], kwargs: Dict[str, Tensor] +): + res = rpc.rpc_sync(dst_worker_name, two_args_two_kwargs, args, kwargs) + return res + +@torch.jit.script +def script_rpc_remote_call( + dst_worker_name: str, args: Tuple[Tensor, Tensor], kwargs: Dict[str, Tensor] +): + rref_res = rpc.remote(dst_worker_name, two_args_two_kwargs, args, kwargs) + return rref_res.to_here() + ``` +* Design doc | Link **Missing Link** +* Documentation | Link **Missing Link** +* Usage examples | Link **Missing Link** + +## [Beta] Distributed optimizer with TorchScript support + +PyTorch provides a broad set of optimizers for training algorithms, and these have been used repeatedly as part of the python API. However, users often want to use multithreaded training instead of multiprocess training as it provides better resource utilization and efficiency in the context of large scale distributed training (e.g. Distributed Model Parallel) or any RPIC-based training application). Users couldn’t do this with with distributed optimizer before because we need to get rid of the python Global Interpreter Lock (GIL) limitation to achieve this. + +In PyTorch 1.7, we are enabling the TorchScript support in distributed optimizer to remove the GIL, and make it possible to run optimizer in multithreaded applications. The new distributed optimizer has exact same interface as before, but it automatically turn optimizers in each worker into TorchScript to make it GIL free. This is done by leveraging a functional optimizer concept to allow distributed optimizer turn the computation part of the optimizer into TorchScript. This will help use cases like distributed model parallel training to improve their performance with multithreading. + +Currently, the only optimizer that supports automatic conversion with TorchScript is ```Adagrad```, all other optimizers will still work as before without TorchScript support. We are working on expanding the coverage to all PyTorch optimizers. + +* Design doc | Link **Missing Link** +* Documentation | Link **Missing Link** +* Usage examples | Link **Missing Link** + +## [Beta] Enhancements to RPC-based Profiling + +Support for using the PyTorch profiler in conjunction with the RPC framework was first introduced in PyTorch 1.6. In PyTorch 1.7, the following enhancements have been made: + +* Implemented better support for profiling TorchScript functions over RPC +* Achieved parity in terms of profiler features that work with RPC +* Added support for asynchronous RPC functions on the server-side (functions decorated with ```rpc.functions.async_execution)```. + +Users are now able to use familiar profiling tools such as with ```torch.autograd.profiler.profile()``` and ```with torch.autograd.profiler.record_function```, and this works transparently with the RPC framework with full feature support, profiles asynchronous functions, and TorchScript functions. + +* Design doc | [Link](https://github.com/pytorch/pytorch/issues/39675) +* Usage examples | [Link](https://pytorch.org/tutorials/recipes/distributed_rpc_profiling.html) + +## [Beta] DDP memory reduction + +As of PyTorch 1.6, DDP would put an extra copy of gradient tensors in communication buckets. This incurs additional memory overhead which is equivalent to the size of the gradients. In PyTorch 1.7, we added a ```gradient_as_bucket_view``` flag to the DDP constructor API. When this flag is set to ```True```, DDP will override ```param.grad``` as views that point of communication buckets. This not only eliminates an extra in-memory copy of gradients, but also avoids the additional read/write operations to synchronize communication buckets and ```param.grad``` values. + +* Design doc | [Link](https://github.com/pytorch/pytorch/issues/37030) +* Documentation | [Link](https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html?highlight=distributeddataparallel) + +## [Prototype] Windows support for Distributed Training + +PyTorch 1.7 brings prototype support for ```DistributedDataParallel``` and collective communications on the Windows platform. In this release, the support only covers Gloo-based ```ProcessGroup``` and ```FileStore```. +To use this feature across multiple machines, please provide a file from a shared file system in ```init_process_group```. + +```python +# initialize the process group +dist.init_process_group( + "gloo", + # multi-machien example: + # init_method = "file://////{machine}/{share_folder}/file" + init_method="file:///{your local file path}", + rank=rank, + world_size=world_size +) + +model = DistributedDataParallel(local_model, device_ids=[rank]) +``` +* Design doc | [Link](https://github.com/pytorch/pytorch/issues/42095) +* Documentation | [Link](https://pytorch.org/docs/master/distributed.html#backends-that-come-with-pytorch) +* Acknowledgement | [gunandrose4u](https://github.com/gunandrose4u) + +# Mobile + +PyTorch Mobile supports both [iOS](https://pytorch.org/mobile/ios) and [Android](https://pytorch.org/mobile/android/) with binary packages available in [Cocoapods](https://cocoapods.org/) and [JCenter](https://mvnrepository.com/repos/jcenter) respectively. You can learn more about PyTorch-Mobile [here](https://pytorch.org/mobile/home/). + +## [Beta] PyTorch-Mobile Caching allocator for performance improvements + +On some mobile platforms, such as Pixel, we observed that memory is returned to the system more aggressively. This results in frequent page faults as PyTorch being a functional framework does not maintain state for the operators. Thus outputs are allocated dynamically on each execution of the op, for the most ops. To ameliorate performance penalties due to this, PyTorch 1.7 provides a simple caching allocator for CPU. The allocator caches allocations by tensor sizes and, is currently, available only via the PyTorch C++ API. The caching allocator itself is owned by client and thus the lifetime of the allocator is also maintained by client code. Such a client owned caching allocator can then be used with scoped guard, ```c10::WithCPUCachingAllocatorGuard```, to enable the use of cached allocation within that scope. + +**Example usage:** + +```python +#include +..... +c10::CPUCachingAllocator caching_allocator; + // Owned by client code. Can be a member of some client class so as to tie the + // the lifetime of caching allocator to that of the class. +..... +{ + c10::optional caching_allocator_guard; + if (FLAGS_use_caching_allocator) { + caching_allocator_guard.emplace(&caching_allocator); + } + .... + model.forward(..); +} +... +``` + +**NOTE**: Caching allocator is only available on mobile builds, thus the use of caching allocator outside of mobile builds won’t be effective. + +* Documentation | [Link](https://github.com/pytorch/pytorch/blob/master/c10/mobile/CPUCachingAllocator.h#L13-L43) +* Usage examples | [Link](https://github.com/pytorch/pytorch/blob/master/binaries/speed_benchmark_torch.cc#L207) + +# torchvision + +## [Stable] Transforms now support Tensor inputs, batch computation, GPU, and TorchScript + +torchvision transforms are now inherited from ```nn.Module``` and can be torchscripted and applied on torch Tensor inputs as well as on PIL images. They also support Tensors with batch dimensions and work seamlessly on CPU/GPU devices: + +```python +import torch +import torchvision.transforms as T + +# to fix random seed, use torch.manual_seed +# instead of random.seed +torch.manual_seed(12) + +transforms = torch.nn.Sequential( + T.RandomCrop(224), + T.RandomHorizontalFlip(p=0.3), + T.ConvertImageDtype(torch.float), + T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) +) +scripted_transforms = torch.jit.script(transforms) +# Note: we can similarly use T.Compose to define transforms +# transforms = T.Compose([...]) and +# scripted_transforms = torch.jit.script(torch.nn.Sequential(*transforms.transforms)) + +tensor_image = torch.randint(0, 256, size=(3, 256, 256), dtype=torch.uint8) +# works directly on Tensors +out_image1 = transforms(tensor_image) +# on the GPU +out_image1_cuda = transforms(tensor_image.cuda()) +# with batches +batched_image = torch.randint(0, 256, size=(4, 3, 256, 256), dtype=torch.uint8) +out_image_batched = transforms(batched_image) +# and has torchscript support +out_image2 = scripted_transforms(tensor_image) +``` + +These improvements enable the following new features: + +* support for GPU acceleration +* batched transformations e.g. as needed for videos +* transform multi-band torch tensor images (with more than 3-4 channels) +* torchscript transforms together with your model for deployment + +**Note: Exceptions for TorchScript support includes ```Compose```, ```RandomChoice```, ```RandomOrder```, ```Lambda``` and those applied on PIL images, such as ```ToPILImage```**. + +## [Stable] Native image IO for JPEG and PNG formats + +torchvision 0.8.0 introduces native image reading and writing operations for JPEG and PNG formats. Those operators support TorchScript and return CxHxW tensors in uint8 format, and can thus be now part of your model for deployment in C++ environments. + +```python +from torchvision.io import read_image + +# tensor_image is a CxHxW uint8 Tensor +tensor_image = read_image('path_to_image.jpeg') + +# or equivalently +from torchvision.io import read_file, decode_image +# raw_data is a 1d uint8 Tensor with the raw bytes +raw_data = read_file('path_to_image.jpeg') +tensor_image = decode_image(raw_data) + +# all operators are torchscriptable and can be +# serialized together with your model torchscript code +scripted_read_image = torch.jit.script(read_image) +``` +## [Stable] RetinaNet detection model + +This release adds pretrained models for RetinaNet with a ResNet50 backbone from [Focal Loss for Dense Object Detection](https://arxiv.org/abs/1708.02002), delivering improved accuracy on COCO val2017. + +## [Beta] New Video Reader API + +This release introduces a new video reading abstraction, which gives more fine-grained control of iteration over videos. It supports image and audio, and implements an iterator interface so that it is interoperable with other the python libraries such as itertools. + +```python +from torchvision.io import VideoReader + +# stream indicates if reading from audio or video +reader = VideoReader('path_to_video.mp4', stream='video') +# can change the stream after construction +# via reader.set_current_stream + +# to read all frames in a video starting at 2 seconds +for frame in reader.seek(2): + # frame is a dict with "data" and "pts" metadata + print(frame["data"], frame["pts"]) + +# because reader is an iterator you can combine it with +# itertools +from itertools import takewhile, islice +# read 10 frames starting from 2 seconds +for frame in islice(reader.seek(2), 10): + pass + +# or to return all frames between 2 and 5 seconds +for frame in takewhile(lambda x: x["pts"] < 5, reader): + pass +``` +**Notes:** + +* In order to use the Video Reader API beta, you must compile torchvision from source and have ffmpeg installed in your system. +* The VideoReader API is currently released as beta and its API may change following user feedback. + +# torchaudio + +With this release, torchaudio is expanding its support for models and [end-to-end applications](https://github.com/pytorch/audio/tree/master/examples), adding a wav2letter training pipeline and end-to-end text-to-speech and source separation pipelines. Please file an issue on [github](https://github.com/pytorch/audio/issues/new?template=questions-help-support.md) to provide feedback on them. + +## [Stable] *Speech Recognition* + +Building on the addition of the wav2letter model for speech recognition in the last release, we’ve now added an [example wav2letter training pipeline](https://github.com/pytorch/audio/tree/master/examples/pipeline_wav2letter) with the LibriSpeech dataset. + +## [Stable] Text-to-speech + +With the goal of supporting text-to-speech applications, we added a vocoder based on the WaveRNN model, based on the implementation from [this repository](https://github.com/fatchord/WaveRNN). The original implementation was introduced in "Efficient Neural Audio Synthesis". We also provide an [example WaveRNN training pipeline](https://github.com/pytorch/audio/tree/master/examples/pipeline_wavernn) that uses the LibriTTS dataset added to torchaudio in this release. + +## [Stable] Source Separation + +With the addition of the ConvTasNet model, based on the paper "Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation," torchaudio now also supports source separation. An [example ConvTasNet training pipeline](https://github.com/pytorch/audio/tree/master/examples/source_separation) is provided with the wsj-mix dataset. + +# Additional updates + +## PyTorch Developer Day, November 12 + +Kicking off this November, we plan to host two separate virtual PyTorch events: one for developers and users to discuss PyTorch’s future development called “Developer Day” and another for the entire PyTorch ecosystem to showcase their work, network and collaborate called “Ecosystem Day” (scheduled for early 2021). + +The PyTorch Developer Day takes place on November 12, 2020 PST with a full day of technical talks, project deep dives, and a networking event. The talks will be available to the public and the following networking event requires registration (Space is limited). + +* YouTube Premiere Link +* Facebook Watch Link +* Networking event registration + +Cheers! +Team PyTorch + + + From d0b92ed689fe724cc18681cae21a63edf0a6de0a Mon Sep 17 00:00:00 2001 From: andresruizfacebook <68402331+andresruizfacebook@users.noreply.github.com> Date: Mon, 26 Oct 2020 10:58:35 -0700 Subject: [PATCH 02/12] Update 2020-10-26-1.7-released.md --- _posts/2020-10-26-1.7-released.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2020-10-26-1.7-released.md b/_posts/2020-10-26-1.7-released.md index 9175150c1793..2e87dac9b823 100644 --- a/_posts/2020-10-26-1.7-released.md +++ b/_posts/2020-10-26-1.7-released.md @@ -276,7 +276,7 @@ c10::CPUCachingAllocator caching_allocator; **NOTE**: Caching allocator is only available on mobile builds, thus the use of caching allocator outside of mobile builds won’t be effective. * Documentation | [Link](https://github.com/pytorch/pytorch/blob/master/c10/mobile/CPUCachingAllocator.h#L13-L43) -* Usage examples | [Link](https://github.com/pytorch/pytorch/blob/master/binaries/speed_benchmark_torch.cc#L207) +* Usage examples [Link](https://github.com/pytorch/pytorch/blob/master/binaries/speed_benchmark_torch.cc#L207) # torchvision From 86b28b4baa0ceca247d094f7805549d5f66613e2 Mon Sep 17 00:00:00 2001 From: andresruizfacebook <68402331+andresruizfacebook@users.noreply.github.com> Date: Mon, 26 Oct 2020 11:47:07 -0700 Subject: [PATCH 03/12] Update 2020-10-26-1.7-released.md --- _posts/2020-10-26-1.7-released.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_posts/2020-10-26-1.7-released.md b/_posts/2020-10-26-1.7-released.md index 2e87dac9b823..86dc7302008e 100644 --- a/_posts/2020-10-26-1.7-released.md +++ b/_posts/2020-10-26-1.7-released.md @@ -127,8 +127,8 @@ n = 33554432 Users can now see not only operator name/inputs in the profiler output table but also where the operator is in the code. The workflow requires very little change to take advantage of this capability. The user uses the [autograd profiler](https://pytorch.org/docs/stable/autograd.html#profiler) as before but with optional new parameters: ```with_stack``` and ```group_by_stack_n.``` -* Details | [Link](https://github.com/pytorch/pytorch/pull/43898/) -* Documentation | [Link](https://pytorch.org/docs/stable/autograd.html) +* [Details Link](https://github.com/pytorch/pytorch/pull/43898/) +* [Documentation Link](https://pytorch.org/docs/stable/autograd.html) # Distributed Training & RPC From f482f692380ba9dde0250b2fa2112be733bf881b Mon Sep 17 00:00:00 2001 From: andresruizfacebook <68402331+andresruizfacebook@users.noreply.github.com> Date: Mon, 26 Oct 2020 12:00:38 -0700 Subject: [PATCH 04/12] Update 2020-10-26-1.7-released.md --- _posts/2020-10-26-1.7-released.md | 60 +++++++++++++++---------------- 1 file changed, 30 insertions(+), 30 deletions(-) diff --git a/_posts/2020-10-26-1.7-released.md b/_posts/2020-10-26-1.7-released.md index 86dc7302008e..57951008e4f2 100644 --- a/_posts/2020-10-26-1.7-released.md +++ b/_posts/2020-10-26-1.7-released.md @@ -8,11 +8,11 @@ Today, we’re announcing the availability of PyTorch 1.7, along with updated do A few of the highlights include: -* 1- CUDA 11 is now officially supported with binaries available at [PyTorch.org](http://pytorch.org/) -* 2- Updates and additions to profiling and performance for RPC, TorchScript, Stack traces and Benchmark utilities -* 3- (Beta) Support for NumPy compatible Fast Fourier transforms (FFT) via torch.fft -* 4- (Prototype) Support for Nvidia A100 generation GPUs and native TF32 format -* 5- (Prototype) Distributed training on Windows now supported +1. CUDA 11 is now officially supported with binaries available at [PyTorch.org](http://pytorch.org/) +2. Updates and additions to profiling and performance for RPC, TorchScript, Stack traces and Benchmark utilities +3. (Beta) Support for NumPy compatible Fast Fourier transforms (FFT) via torch.fft +4. (Prototype) Support for Nvidia A100 generation GPUs and native TF32 format +5. (Prototype) Distributed training on Windows now supported To reiterate, starting [PyTorch 1.6](https://pytorch.org/blog/pytorch-feature-classification-changes/), features are now classified as stable, beta and prototype. You can see the detailed announcement [here](https://pytorch.org/blog/pytorch-feature-classification-changes/). Note that the prototype features listed in this blog are available as part of this release. @@ -64,8 +64,8 @@ Note that this is necessary, **but not sufficient**, for determinism **within a See the documentation for ```torch.set_deterministic(bool)``` for the list of affected operations. -* RFC | [Link](https://github.com/pytorch/pytorch/issues/15359) -* Documentation | [Link](https://pytorch.org/docs/stable/generated/torch.set_deterministic.html) +* RFC ([Link](https://github.com/pytorch/pytorch/issues/15359)) +* Documentation ([Link](https://pytorch.org/docs/stable/generated/torch.set_deterministic.html)) # Performance & Profiling @@ -121,14 +121,14 @@ n = 33554432 0.89 ns / element ``` - Documentation | Link **Missing link** + Documentation (Link) **Missing link** ## [Beta] Stack traces added to profiler Users can now see not only operator name/inputs in the profiler output table but also where the operator is in the code. The workflow requires very little change to take advantage of this capability. The user uses the [autograd profiler](https://pytorch.org/docs/stable/autograd.html#profiler) as before but with optional new parameters: ```with_stack``` and ```group_by_stack_n.``` -* [Details Link](https://github.com/pytorch/pytorch/pull/43898/) -* [Documentation Link](https://pytorch.org/docs/stable/autograd.html) +* Detail ([Link](https://github.com/pytorch/pytorch/pull/43898/)) +* Documentation ([Link](https://pytorch.org/docs/stable/autograd.html)) # Distributed Training & RPC @@ -138,21 +138,21 @@ Torchelastic offers a strict superset of the current ```torch.distributed.launch By bundling ```torchelastic``` in the same docker image as PyTorch, users can start experimenting with torchelastic right-away without having to separately install ```torchelastic```. In addition to convenience, this work is a nice-to-have when adding support for elastic parameters in the existing Kubeflow’s distributed PyTorch operators. -* Usage examples and how to get started | [Link](https://pytorch.org/elastic/0.2.0/examples.html) +* Usage examples and how to get started ([Link](https://pytorch.org/elastic/0.2.0/examples.html)) ## [Beta] Support for uneven dataset inputs in DDP PyTorch 1.7 introduces a new context manager to be used in conjunction with models trained using ```torch.nn.parallel.DistributedDataParallel``` to enable training with uneven dataset size across different processes. This feature enables greater flexibility when using DDP and prevents the user from having to manually ensure dataset sizes are the same across different process. With this context manager, DDP will handle uneven dataset sizes automatically, which can prevent errors or hangs at the end of training. -* RFC | [Link](https://github.com/pytorch/pytorch/issues/38174) -* Documentation | [Link](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join) +* RFC ([Link](https://github.com/pytorch/pytorch/issues/38174)) +* Documentation ([Link](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join)) ## [Beta] NCCL Reliability - Async Error/Timeout Handling In the past, NCCL training runs would hang indefinitely due to stuck collectives, leading to a very unpleasant experience for users. This feature will abort stuck collectives and throw an exception/crash the process if a potential hang is detected. When used with something like torchelastic (which can recover the training process from the last checkpoint), users can have much greater reliability for distributed training. This feature is completely opt in and sits behind an environment variable that needs to be explicitly set in order to enable this feature (otherwise users will see the same behavior as before). -* Documentation | Link **Missing Link** -* Usage examples | Link **Missing Link** +* Documentation (Link) **Missing Link** +* Usage examples (Link) **Missing Link** ## [Beta] TorchScript rpc_remote and rpc_sync @@ -187,9 +187,9 @@ def script_rpc_remote_call( rref_res = rpc.remote(dst_worker_name, two_args_two_kwargs, args, kwargs) return rref_res.to_here() ``` -* Design doc | Link **Missing Link** -* Documentation | Link **Missing Link** -* Usage examples | Link **Missing Link** +* Design doc (Link) **Missing Link** +* Documentation (Link) **Missing Link** +* Usage examples (Link) **Missing Link** ## [Beta] Distributed optimizer with TorchScript support @@ -199,9 +199,9 @@ In PyTorch 1.7, we are enabling the TorchScript support in distributed optimizer Currently, the only optimizer that supports automatic conversion with TorchScript is ```Adagrad```, all other optimizers will still work as before without TorchScript support. We are working on expanding the coverage to all PyTorch optimizers. -* Design doc | Link **Missing Link** -* Documentation | Link **Missing Link** -* Usage examples | Link **Missing Link** +* Design doc (Link) **Missing Link** +* Documentation (Link) **Missing Link** +* Usage examples (Link) **Missing Link** ## [Beta] Enhancements to RPC-based Profiling @@ -213,15 +213,15 @@ Support for using the PyTorch profiler in conjunction with the RPC framework was Users are now able to use familiar profiling tools such as with ```torch.autograd.profiler.profile()``` and ```with torch.autograd.profiler.record_function```, and this works transparently with the RPC framework with full feature support, profiles asynchronous functions, and TorchScript functions. -* Design doc | [Link](https://github.com/pytorch/pytorch/issues/39675) -* Usage examples | [Link](https://pytorch.org/tutorials/recipes/distributed_rpc_profiling.html) +* Design doc ([Link](https://github.com/pytorch/pytorch/issues/39675)) +* Usage examples ([Link](https://pytorch.org/tutorials/recipes/distributed_rpc_profiling.html)) ## [Beta] DDP memory reduction As of PyTorch 1.6, DDP would put an extra copy of gradient tensors in communication buckets. This incurs additional memory overhead which is equivalent to the size of the gradients. In PyTorch 1.7, we added a ```gradient_as_bucket_view``` flag to the DDP constructor API. When this flag is set to ```True```, DDP will override ```param.grad``` as views that point of communication buckets. This not only eliminates an extra in-memory copy of gradients, but also avoids the additional read/write operations to synchronize communication buckets and ```param.grad``` values. -* Design doc | [Link](https://github.com/pytorch/pytorch/issues/37030) -* Documentation | [Link](https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html?highlight=distributeddataparallel) +* Design doc ([Link](https://github.com/pytorch/pytorch/issues/37030)) +* Documentation ([Link](https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html?highlight=distributeddataparallel)) ## [Prototype] Windows support for Distributed Training @@ -241,9 +241,9 @@ dist.init_process_group( model = DistributedDataParallel(local_model, device_ids=[rank]) ``` -* Design doc | [Link](https://github.com/pytorch/pytorch/issues/42095) -* Documentation | [Link](https://pytorch.org/docs/master/distributed.html#backends-that-come-with-pytorch) -* Acknowledgement | [gunandrose4u](https://github.com/gunandrose4u) +* Design doc ([Link](https://github.com/pytorch/pytorch/issues/42095)) +* Documentation ([Link](https://pytorch.org/docs/master/distributed.html#backends-that-come-with-pytorch)) +* Acknowledgement ([gunandrose4u](https://github.com/gunandrose4u)) # Mobile @@ -275,8 +275,8 @@ c10::CPUCachingAllocator caching_allocator; **NOTE**: Caching allocator is only available on mobile builds, thus the use of caching allocator outside of mobile builds won’t be effective. -* Documentation | [Link](https://github.com/pytorch/pytorch/blob/master/c10/mobile/CPUCachingAllocator.h#L13-L43) -* Usage examples [Link](https://github.com/pytorch/pytorch/blob/master/binaries/speed_benchmark_torch.cc#L207) +* Documentation ([Link](https://github.com/pytorch/pytorch/blob/master/c10/mobile/CPUCachingAllocator.h#L13-L43)) +* Usage examples ([Link](https://github.com/pytorch/pytorch/blob/master/binaries/speed_benchmark_torch.cc#L207)) # torchvision From 77f74bfffbdba9a4ad5335ba92b1f47b4982008f Mon Sep 17 00:00:00 2001 From: andresruizfacebook <68402331+andresruizfacebook@users.noreply.github.com> Date: Mon, 26 Oct 2020 13:21:22 -0700 Subject: [PATCH 05/12] Update 2020-10-26-1.7-released.md --- _posts/2020-10-26-1.7-released.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_posts/2020-10-26-1.7-released.md b/_posts/2020-10-26-1.7-released.md index 57951008e4f2..11098e099d03 100644 --- a/_posts/2020-10-26-1.7-released.md +++ b/_posts/2020-10-26-1.7-released.md @@ -42,13 +42,13 @@ tensor([ 6.+0.j, -2.+2.j, -2.+0.j, -2.-2.j]) tensor([12.+16.j, -8.+0.j, -4.-4.j, 0.-8.j]) ``` -* Documentation | [Link](https://pytorch.org/docs/stable/fft.html#torch-fft) +* Documentation ([Link](https://pytorch.org/docs/stable/fft.html#torch-fft)) ## [Beta] C++ Support for Transformer NN Modules Since [PyTorch 1.5](https://pytorch.org/blog/pytorch-1-dot-5-released-with-new-and-updated-apis/), we’ve continued to maintain parity between the python and C++ frontend APIs. This update allows developers to use the nn.transformer module abstraction from the C++ Frontend. And moreover, developers no longer need to save a module from python/JIT and load into C++ as it can now be used it in C++ directly. -* Documentation | [Link](https://pytorch.org/cppdocs/api/classtorch_1_1nn_1_1_transformer_impl.html#_CPPv4N5torch2nn15TransformerImplE) +* Documentation ([Link](https://pytorch.org/cppdocs/api/classtorch_1_1nn_1_1_transformer_impl.html#_CPPv4N5torch2nn15TransformerImplE)) ## [Beta] torch.set_deterministic From 735a9347b2842688e2f8b649f8591be194061ab6 Mon Sep 17 00:00:00 2001 From: Woo Kim <39344090+wookim3@users.noreply.github.com> Date: Mon, 26 Oct 2020 23:03:31 -0700 Subject: [PATCH 06/12] Update and rename 2020-10-26-1.7-released.md to 2020-10-26-pytorch-1.7-released.md --- ....md => 2020-10-26-pytorch-1.7-released.md} | 223 +++++++----------- 1 file changed, 81 insertions(+), 142 deletions(-) rename _posts/{2020-10-26-1.7-released.md => 2020-10-26-pytorch-1.7-released.md} (63%) diff --git a/_posts/2020-10-26-1.7-released.md b/_posts/2020-10-26-pytorch-1.7-released.md similarity index 63% rename from _posts/2020-10-26-1.7-released.md rename to _posts/2020-10-26-pytorch-1.7-released.md index 11098e099d03..1addf41b2485 100644 --- a/_posts/2020-10-26-1.7-released.md +++ b/_posts/2020-10-26-pytorch-1.7-released.md @@ -1,20 +1,26 @@ --- layout: blog_detail -title: 'PyTorch 1.7 released w/ CUDA 11, New APIs for FFTs, Windows support for Distributed training and more..' +title: 'PyTorch 1.7 released w/ CUDA 11, New APIs for FFTs, Windows support for Distributed training and more' author: Team PyTorch --- -Today, we’re announcing the availability of PyTorch 1.7, along with updated domain libraries. The PyTorch 1.7 release includes a number of new APIs, profiling and benchmarking tools, major updates to both distributed data parallel (DDP) and remote procedure call (RPC) based distributed training. In addition, several features moved to [stable](https://pytorch.org/docs/stable/index.html#pytorch-documentation) including custom C++ Classes, the memory profiler, the creation of custom tensor-like objects, user async functions in RPC and a number of other features in torch.distributed like Per-RPC timeout, DDP dynamic bucketing and RRef helper. +Today, we’re announcing the availability of PyTorch 1.7, along with updated domain libraries. The PyTorch 1.7 release includes a number of new APIs including support for NumPy-Compatible FFT operations, profiling tools and major updates to both distributed data parallel (DDP) and remote procedure call (RPC) based distributed training. In addition, several features moved to [stable](https://pytorch.org/docs/stable/index.html#pytorch-documentation) including custom C++ Classes, the memory profiler, extensions via custom tensor-like objects, user async functions in RPC and a number of other features in torch.distributed such as Per-RPC timeout, DDP dynamic bucketing and RRef helper. A few of the highlights include: -1. CUDA 11 is now officially supported with binaries available at [PyTorch.org](http://pytorch.org/) -2. Updates and additions to profiling and performance for RPC, TorchScript, Stack traces and Benchmark utilities -3. (Beta) Support for NumPy compatible Fast Fourier transforms (FFT) via torch.fft -4. (Prototype) Support for Nvidia A100 generation GPUs and native TF32 format -5. (Prototype) Distributed training on Windows now supported +* CUDA 11 is now officially supported with binaries available at [PyTorch.org](http://pytorch.org/) +* Updates and additions to profiling and performance for RPC, TorchScript and Stack traces in the autograd profiler +* (Beta) Support for NumPy compatible Fast Fourier transforms (FFT) via torch.fft +* (Prototype) Support for Nvidia A100 generation GPUs and native TF32 format +* (Prototype) Distributed training on Windows now supported +* torchvision + * (Stable) Transforms now support Tensor inputs, batch computation, GPU, and TorchScript + * (Stable) Native image I/O for JPEG and PNG formats + * (Beta) New Video Reader API +* torchaudio + * (Stable) Added support for speech rec (wav2letter), text to speech (WaveRNN) and source separation (ConvTasNet) -To reiterate, starting [PyTorch 1.6](https://pytorch.org/blog/pytorch-feature-classification-changes/), features are now classified as stable, beta and prototype. You can see the detailed announcement [here](https://pytorch.org/blog/pytorch-feature-classification-changes/). Note that the prototype features listed in this blog are available as part of this release. +To reiterate, starting PyTorch 1.6, features are now classified as stable, beta and prototype. You can see the detailed announcement [here](https://pytorch.org/blog/pytorch-feature-classification-changes/). Note that the prototype features listed in this blog are available as part of this release. Find the full release notes [here](https://github.com/pytorch/pytorch/releases). @@ -22,33 +28,33 @@ Find the full release notes [here](https://github.com/pytorch/pytorch/releases). ## [Beta] NumPy Compatible torch.fft module -FFT-related functionality is commonly used in a variety of scientific fields like signal processing. While PyTorch has historically supported a few FFT-related functions, the 1.7 release adds a new torch.fft module that implements FFT-related functions with the same API as NumPy. +FFT-related functionality is commonly used in a variety of scientific fields like signal processing. While PyTorch has historically supported a few FFT-related functions, the 1.7 release adds a new torch.fft module that implements FFT-related functions with the same API as NumPy. This new module must be imported to be used in the 1.7 release, since its name conflicts with the historic (and now deprecated) torch.fft function. **Example usage:** ```python -*>>>** **import* torch.fft ->>> t *=* torch*.*arange(4) +>>> import torch.fft +>>> t = torch.arange(4) >>> t tensor([0, 1, 2, 3]) ->>> torch*.*fft*.*fft(t) +>>> torch.fft.fft(t) tensor([ 6.+0.j, -2.+2.j, -2.+0.j, -2.-2.j]) ->>> t *=* tensor([0.*+*1.j, 2.*+*3.j, 4.*+*5.j, 6.*+*7.j]) ->>> torch*.*fft*.*fft(t) +>>> t = tensor([0.+1.j, 2.+3.j, 4.+5.j, 6.+7.j]) +>>> torch.fft.fft(t) tensor([12.+16.j, -8.+0.j, -4.-4.j, 0.-8.j]) ``` -* Documentation ([Link](https://pytorch.org/docs/stable/fft.html#torch-fft)) +* [Documentation](https://pytorch.org/docs/stable/fft.html#torch-fft) ## [Beta] C++ Support for Transformer NN Modules -Since [PyTorch 1.5](https://pytorch.org/blog/pytorch-1-dot-5-released-with-new-and-updated-apis/), we’ve continued to maintain parity between the python and C++ frontend APIs. This update allows developers to use the nn.transformer module abstraction from the C++ Frontend. And moreover, developers no longer need to save a module from python/JIT and load into C++ as it can now be used it in C++ directly. +Since [PyTorch 1.5](https://pytorch.org/blog/pytorch-1-dot-5-released-with-new-and-updated-apis/), we’ve continued to maintain parity between the python and C++ frontend APIs. This update allows developers to use the nn.transformer module abstraction from the C++ Frontend. And moreover, developers no longer need to save a module from python/JIT and load into C++ as it can now be used it in C++ directly. -* Documentation ([Link](https://pytorch.org/cppdocs/api/classtorch_1_1nn_1_1_transformer_impl.html#_CPPv4N5torch2nn15TransformerImplE)) +* [Documentation] (https://pytorch.org/cppdocs/api/classtorch_1_1nn_1_1_transformer_impl.html#_CPPv4N5torch2nn15TransformerImplE)) ## [Beta] torch.set_deterministic @@ -64,71 +70,17 @@ Note that this is necessary, **but not sufficient**, for determinism **within a See the documentation for ```torch.set_deterministic(bool)``` for the list of affected operations. -* RFC ([Link](https://github.com/pytorch/pytorch/issues/15359)) -* Documentation ([Link](https://pytorch.org/docs/stable/generated/torch.set_deterministic.html)) +* [RFC](https://github.com/pytorch/pytorch/issues/15359) +* [Documentation](https://pytorch.org/docs/stable/generated/torch.set_deterministic.html) # Performance & Profiling - -## [Beta] Benchmark Utilities - -Benchmarks for pull requests are ad-hoc and typically inadequate currently. They frequently involve starting a timer, running an op or ops many times, and then taking the average and comparing before and after numbers to determine if a change improves or regresses performance. The PyTorch 1.7 release includes new utilities that allow the user to take accurate performance measurements, and provide composable tools to help with both benchmark formulation and post processing. - -**Example usage:** - -```python -"""Demonstration of how size can affect efficiency. - -1K: Task is small, overhead dominates. -1M: Good performance. -16M: Task does not fit into L2 cache which degrades performance. -""" -import torch -import torch.utils.benchmark as benchmark_utils - -for n in [1024, 1024 ** 2, 32 * 1024 ** 2]: - timer = benchmark_utils.Timer( - "torch.dot(x, y)", - description=f"n = {n}", - globals={ - "x": torch.ones((n,)), - "y": torch.ones((n,)), - } - ) - m = timer.blocked_autorange(min_run_time=1) - print(f"{m}\n{m.median / n * 1e9:5.2f} ns / element\n") - -torch.dot(x, y) -n = 1024 - Median: 2.37 us - IQR: 0.07 us (2.34 to 2.41) - 415 measurements, 1000 runs per measurement, 1 thread - 2.31 ns / element - - -torch.dot(x, y) -n = 1048576 - Median: 265.56 us - IQR: 5.56 us (263.17 to 268.74) - 374 measurements, 10 runs per measurement, 1 thread - 0.25 ns / element - - -torch.dot(x, y) -n = 33554432 - Median: 29.80 ms - IQR: 0.58 ms (29.53 to 30.10) - 34 measurements, 1 runs per measurement, 1 thread - 0.89 ns / element - ``` - - Documentation (Link) **Missing link** ## [Beta] Stack traces added to profiler -Users can now see not only operator name/inputs in the profiler output table but also where the operator is in the code. The workflow requires very little change to take advantage of this capability. The user uses the [autograd profiler](https://pytorch.org/docs/stable/autograd.html#profiler) as before but with optional new parameters: ```with_stack``` and ```group_by_stack_n.``` +Users can now see not only operator name/inputs in the profiler output table but also where the operator is in the code. The workflow requires very little change to take advantage of this capability. The user uses the [autograd profiler](https://pytorch.org/docs/stable/autograd.html#profiler) as before but with optional new parameters: ```with_stack``` and ```group_by_stack_n```. Caution: regular profiling runs should not use this feature as it adds significant overhead. -* Detail ([Link](https://github.com/pytorch/pytorch/pull/43898/)) -* Documentation ([Link](https://pytorch.org/docs/stable/autograd.html)) +* [Detail](https://github.com/pytorch/pytorch/pull/43898/) +* [Documentation](https://pytorch.org/docs/stable/autograd.html) # Distributed Training & RPC @@ -136,72 +88,66 @@ Users can now see not only operator name/inputs in the profiler output table but Torchelastic offers a strict superset of the current ```torch.distributed.launch``` CLI with the added features for fault-tolerance and elasticity. If the user is not be interested in fault-tolerance, they can get the exact functionality/behavior parity by setting ```max_restarts=0``` with the added convenience of auto-assigned ```RANK``` and ```MASTER_ADDR|PORT``` (versus manually specified in ```torch.distributed.launch)```. -By bundling ```torchelastic``` in the same docker image as PyTorch, users can start experimenting with torchelastic right-away without having to separately install ```torchelastic```. In addition to convenience, this work is a nice-to-have when adding support for elastic parameters in the existing Kubeflow’s distributed PyTorch operators. +By bundling ```torchelastic``` in the same docker image as PyTorch, users can start experimenting with TorchElastic right-away without having to separately install ```torchelastic```. In addition to convenience, this work is a nice-to-have when adding support for elastic parameters in the existing Kubeflow’s distributed PyTorch operators. -* Usage examples and how to get started ([Link](https://pytorch.org/elastic/0.2.0/examples.html)) +* [Usage examples and how to get started](https://pytorch.org/elastic/0.2.0/examples.html) ## [Beta] Support for uneven dataset inputs in DDP PyTorch 1.7 introduces a new context manager to be used in conjunction with models trained using ```torch.nn.parallel.DistributedDataParallel``` to enable training with uneven dataset size across different processes. This feature enables greater flexibility when using DDP and prevents the user from having to manually ensure dataset sizes are the same across different process. With this context manager, DDP will handle uneven dataset sizes automatically, which can prevent errors or hangs at the end of training. -* RFC ([Link](https://github.com/pytorch/pytorch/issues/38174)) -* Documentation ([Link](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join)) +* [RFC](https://github.com/pytorch/pytorch/issues/38174) +* [Documentation](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join) ## [Beta] NCCL Reliability - Async Error/Timeout Handling -In the past, NCCL training runs would hang indefinitely due to stuck collectives, leading to a very unpleasant experience for users. This feature will abort stuck collectives and throw an exception/crash the process if a potential hang is detected. When used with something like torchelastic (which can recover the training process from the last checkpoint), users can have much greater reliability for distributed training. This feature is completely opt in and sits behind an environment variable that needs to be explicitly set in order to enable this feature (otherwise users will see the same behavior as before). +In the past, NCCL training runs would hang indefinitely due to stuck collectives, leading to a very unpleasant experience for users. This feature will abort stuck collectives and throw an exception/crash the process if a potential hang is detected. When used with something like torchelastic (which can recover the training process from the last checkpoint), users can have much greater reliability for distributed training. This feature is completely opt-in and sits behind an environment variable that needs to be explicitly set in order to enable this functionality (otherwise users will see the same behavior as before). -* Documentation (Link) **Missing Link** -* Usage examples (Link) **Missing Link** +* [Documentation](https://pytorch.org/docs/stable/distributed.html?highlight=init_process_group#torch.distributed.init_process_group) +* [RFC](https://github.com/pytorch/pytorch/issues/46874) -## [Beta] TorchScript rpc_remote and rpc_sync +## [Beta] TorchScript ```rpc_remote``` and ```rpc_sync``` -```rpc_async``` has been available in TorchScript as a beta feature in prior releases. For PyTorch 1.7, this functionality will be extended the remaining two core RPC APIs, ```rpc_sync``` and ```rpc_remote```. This will complete the major RPC APIs targeted for support in TorchScript and can possibly improve application performance by allowing users to use these APIs within TorchScript. +```torch.distributed.rpc.rpc_async``` has been available in TorchScript in prior releases. For PyTorch 1.7, this functionality will be extended the remaining two core RPC APIs, ```torch.distributed.rpc.rpc_sync``` and ```torch.distributed.rpc.remote```. This will complete the major RPC APIs targeted for support in TorchScript, it allows users to use the existing python RPC APIs within TorchScript (in a script function or script method, which releases the python Global Interpreter Lock) and could possibly improve application performance in multithreaded environment. -More specifically, support the following use case: - -```python -from torch.distributed import rpc -from torch import Tensor -from typing import Dict, Tuple - -@torch.jit.script -def script_rpc_async_call( - dst_worker_name: str, args: Tuple[Tensor, Tensor], kwargs: Dict[str, Tensor] -): - fut = rpc.rpc_async(dst_worker_name, two_args_two_kwargs, args, kwargs) - ret = fut.wait() - return ret - -@torch.jit.script -def script_rpc_sync_call( - dst_worker_name: str, args: Tuple[Tensor, Tensor], kwargs: Dict[str, Tensor] -): - res = rpc.rpc_sync(dst_worker_name, two_args_two_kwargs, args, kwargs) - return res - -@torch.jit.script -def script_rpc_remote_call( - dst_worker_name: str, args: Tuple[Tensor, Tensor], kwargs: Dict[str, Tensor] -): - rref_res = rpc.remote(dst_worker_name, two_args_two_kwargs, args, kwargs) - return rref_res.to_here() - ``` -* Design doc (Link) **Missing Link** -* Documentation (Link) **Missing Link** -* Usage examples (Link) **Missing Link** +* [Documentation](https://pytorch.org/docs/stable/rpc.html#rpc) +* [Usage examples](https://github.com/pytorch/pytorch/blob/58ed60c259834e324e86f3e3118e4fcbbfea8dd1/torch/testing/_internal/distributed/rpc/jit/rpc_test.py#L505-L525) ## [Beta] Distributed optimizer with TorchScript support -PyTorch provides a broad set of optimizers for training algorithms, and these have been used repeatedly as part of the python API. However, users often want to use multithreaded training instead of multiprocess training as it provides better resource utilization and efficiency in the context of large scale distributed training (e.g. Distributed Model Parallel) or any RPIC-based training application). Users couldn’t do this with with distributed optimizer before because we need to get rid of the python Global Interpreter Lock (GIL) limitation to achieve this. +PyTorch provides a broad set of optimizers for training algorithms, and these have been used repeatedly as part of the python API. However, users often want to use multithreaded training instead of multiprocess training as it provides better resource utilization and efficiency in the context of large scale distributed training (e.g. Distributed Model Parallel) or any RPC-based training application). Users couldn’t do this with with distributed optimizer before because we need to get rid of the python Global Interpreter Lock (GIL) limitation to achieve this. + +In PyTorch 1.7, we are enabling the TorchScript support in distributed optimizer to remove the GIL, and make it possible to run optimizer in multithreaded applications. The new distributed optimizer has the exact same interface as before but it automatically converts optimizers within each worker into TorchScript to make each GIL free. This is done by leveraging a functional optimizer concept and allowing the distributed optimizer to convert the computational portion of the optimizer into TorchScript. This will help use cases like distributed model parallel training and improve performance using multithreading. -In PyTorch 1.7, we are enabling the TorchScript support in distributed optimizer to remove the GIL, and make it possible to run optimizer in multithreaded applications. The new distributed optimizer has exact same interface as before, but it automatically turn optimizers in each worker into TorchScript to make it GIL free. This is done by leveraging a functional optimizer concept to allow distributed optimizer turn the computation part of the optimizer into TorchScript. This will help use cases like distributed model parallel training to improve their performance with multithreading. +Currently, the only optimizer that supports automatic conversion with TorchScript is ```Adagrad``` and all other optimizers will still work as before without TorchScript support. We are working on expanding the coverage to all PyTorch optimizers and expect more to come in future releases. The usage to enable TorchScript support is automatic and exactly the same with existing python APIs, here is an example of how to use this: -Currently, the only optimizer that supports automatic conversion with TorchScript is ```Adagrad```, all other optimizers will still work as before without TorchScript support. We are working on expanding the coverage to all PyTorch optimizers. +```python +import torch.distributed.autograd as dist_autograd +import torch.distributed.rpc as rpc +from torch import optim +from torch.distributed.optim import DistributedOptimizer + +with dist_autograd.context() as context_id: + # Forward pass. + rref1 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 3)) + rref2 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 1)) + loss = rref1.to_here() + rref2.to_here() + + # Backward pass. + dist_autograd.backward(context_id, [loss.sum()]) + + # Optimizer, pass in optim.Adagrad, DistributedOptimizer will + # automatically convert/compile it to TorchScript (GIL-free) + dist_optim = DistributedOptimizer( + optim.Adagrad, + [rref1, rref2], + lr=0.05, + ) + dist_optim.step(context_id) + ``` -* Design doc (Link) **Missing Link** -* Documentation (Link) **Missing Link** -* Usage examples (Link) **Missing Link** +* [Documentation](https://pytorch.org/docs/stable/rpc.html#module-torch.distributed.optim) +* [RFC](https://github.com/pytorch/pytorch/issues/46883) ## [Beta] Enhancements to RPC-based Profiling @@ -213,19 +159,13 @@ Support for using the PyTorch profiler in conjunction with the RPC framework was Users are now able to use familiar profiling tools such as with ```torch.autograd.profiler.profile()``` and ```with torch.autograd.profiler.record_function```, and this works transparently with the RPC framework with full feature support, profiles asynchronous functions, and TorchScript functions. -* Design doc ([Link](https://github.com/pytorch/pytorch/issues/39675)) -* Usage examples ([Link](https://pytorch.org/tutorials/recipes/distributed_rpc_profiling.html)) - -## [Beta] DDP memory reduction - -As of PyTorch 1.6, DDP would put an extra copy of gradient tensors in communication buckets. This incurs additional memory overhead which is equivalent to the size of the gradients. In PyTorch 1.7, we added a ```gradient_as_bucket_view``` flag to the DDP constructor API. When this flag is set to ```True```, DDP will override ```param.grad``` as views that point of communication buckets. This not only eliminates an extra in-memory copy of gradients, but also avoids the additional read/write operations to synchronize communication buckets and ```param.grad``` values. - -* Design doc ([Link](https://github.com/pytorch/pytorch/issues/37030)) -* Documentation ([Link](https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html?highlight=distributeddataparallel)) +* [Design doc](https://github.com/pytorch/pytorch/issues/39675) +* [Usage examples](https://pytorch.org/tutorials/recipes/distributed_rpc_profiling.html) ## [Prototype] Windows support for Distributed Training PyTorch 1.7 brings prototype support for ```DistributedDataParallel``` and collective communications on the Windows platform. In this release, the support only covers Gloo-based ```ProcessGroup``` and ```FileStore```. + To use this feature across multiple machines, please provide a file from a shared file system in ```init_process_group```. ```python @@ -241,15 +181,15 @@ dist.init_process_group( model = DistributedDataParallel(local_model, device_ids=[rank]) ``` -* Design doc ([Link](https://github.com/pytorch/pytorch/issues/42095)) -* Documentation ([Link](https://pytorch.org/docs/master/distributed.html#backends-that-come-with-pytorch)) +* [Design doc](https://github.com/pytorch/pytorch/issues/42095) +* [Documentation](https://pytorch.org/docs/master/distributed.html#backends-that-come-with-pytorch) * Acknowledgement ([gunandrose4u](https://github.com/gunandrose4u)) # Mobile -PyTorch Mobile supports both [iOS](https://pytorch.org/mobile/ios) and [Android](https://pytorch.org/mobile/android/) with binary packages available in [Cocoapods](https://cocoapods.org/) and [JCenter](https://mvnrepository.com/repos/jcenter) respectively. You can learn more about PyTorch-Mobile [here](https://pytorch.org/mobile/home/). +PyTorch Mobile supports both [iOS](https://pytorch.org/mobile/ios) and [Android](https://pytorch.org/mobile/android/) with binary packages available in [Cocoapods](https://cocoapods.org/) and [JCenter](https://mvnrepository.com/repos/jcenter) respectively. You can learn more about PyTorch Mobile [here](https://pytorch.org/mobile/home/). -## [Beta] PyTorch-Mobile Caching allocator for performance improvements +## [Beta] PyTorch Mobile Caching allocator for performance improvements On some mobile platforms, such as Pixel, we observed that memory is returned to the system more aggressively. This results in frequent page faults as PyTorch being a functional framework does not maintain state for the operators. Thus outputs are allocated dynamically on each execution of the op, for the most ops. To ameliorate performance penalties due to this, PyTorch 1.7 provides a simple caching allocator for CPU. The allocator caches allocations by tensor sizes and, is currently, available only via the PyTorch C++ API. The caching allocator itself is owned by client and thus the lifetime of the allocator is also maintained by client code. Such a client owned caching allocator can then be used with scoped guard, ```c10::WithCPUCachingAllocatorGuard```, to enable the use of cached allocation within that scope. @@ -275,15 +215,14 @@ c10::CPUCachingAllocator caching_allocator; **NOTE**: Caching allocator is only available on mobile builds, thus the use of caching allocator outside of mobile builds won’t be effective. -* Documentation ([Link](https://github.com/pytorch/pytorch/blob/master/c10/mobile/CPUCachingAllocator.h#L13-L43)) -* Usage examples ([Link](https://github.com/pytorch/pytorch/blob/master/binaries/speed_benchmark_torch.cc#L207)) +* [Documentation](https://github.com/pytorch/pytorch/blob/master/c10/mobile/CPUCachingAllocator.h#L13-L43) +* [Usage examples](https://github.com/pytorch/pytorch/blob/master/binaries/speed_benchmark_torch.cc#L207) # torchvision ## [Stable] Transforms now support Tensor inputs, batch computation, GPU, and TorchScript torchvision transforms are now inherited from ```nn.Module``` and can be torchscripted and applied on torch Tensor inputs as well as on PIL images. They also support Tensors with batch dimensions and work seamlessly on CPU/GPU devices: - ```python import torch import torchvision.transforms as T @@ -322,11 +261,11 @@ These improvements enable the following new features: * transform multi-band torch tensor images (with more than 3-4 channels) * torchscript transforms together with your model for deployment -**Note: Exceptions for TorchScript support includes ```Compose```, ```RandomChoice```, ```RandomOrder```, ```Lambda``` and those applied on PIL images, such as ```ToPILImage```**. +**Note:** Exceptions for TorchScript support includes ```Compose```, ```RandomChoice```, ```RandomOrder```, ```Lambda``` and those applied on PIL images, such as ```ToPILImage```. ## [Stable] Native image IO for JPEG and PNG formats -torchvision 0.8.0 introduces native image reading and writing operations for JPEG and PNG formats. Those operators support TorchScript and return CxHxW tensors in uint8 format, and can thus be now part of your model for deployment in C++ environments. +torchvision 0.8.0 introduces native image reading and writing operations for JPEG and PNG formats. Those operators support TorchScript and return ```CxHxW``` tensors in ```uint8``` format, and can thus be now part of your model for deployment in C++ environments. ```python from torchvision.io import read_image @@ -385,7 +324,7 @@ for frame in takewhile(lambda x: x["pts"] < 5, reader): With this release, torchaudio is expanding its support for models and [end-to-end applications](https://github.com/pytorch/audio/tree/master/examples), adding a wav2letter training pipeline and end-to-end text-to-speech and source separation pipelines. Please file an issue on [github](https://github.com/pytorch/audio/issues/new?template=questions-help-support.md) to provide feedback on them. -## [Stable] *Speech Recognition* +## [Stable] Speech Recognition Building on the addition of the wav2letter model for speech recognition in the last release, we’ve now added an [example wav2letter training pipeline](https://github.com/pytorch/audio/tree/master/examples/pipeline_wav2letter) with the LibriSpeech dataset. From 8ae0a5e24786e16a245ad0035cdcc7239dbe2648 Mon Sep 17 00:00:00 2001 From: Woo Kim <39344090+wookim3@users.noreply.github.com> Date: Tue, 27 Oct 2020 06:59:12 -0700 Subject: [PATCH 07/12] Update 2020-10-26-pytorch-1.7-released.md --- _posts/2020-10-26-pytorch-1.7-released.md | 77 +++-------------------- 1 file changed, 8 insertions(+), 69 deletions(-) diff --git a/_posts/2020-10-26-pytorch-1.7-released.md b/_posts/2020-10-26-pytorch-1.7-released.md index 1addf41b2485..171a113ed110 100644 --- a/_posts/2020-10-26-pytorch-1.7-released.md +++ b/_posts/2020-10-26-pytorch-1.7-released.md @@ -6,8 +6,7 @@ author: Team PyTorch Today, we’re announcing the availability of PyTorch 1.7, along with updated domain libraries. The PyTorch 1.7 release includes a number of new APIs including support for NumPy-Compatible FFT operations, profiling tools and major updates to both distributed data parallel (DDP) and remote procedure call (RPC) based distributed training. In addition, several features moved to [stable](https://pytorch.org/docs/stable/index.html#pytorch-documentation) including custom C++ Classes, the memory profiler, extensions via custom tensor-like objects, user async functions in RPC and a number of other features in torch.distributed such as Per-RPC timeout, DDP dynamic bucketing and RRef helper. -A few of the highlights include: - +A few of the highlights include: * CUDA 11 is now officially supported with binaries available at [PyTorch.org](http://pytorch.org/) * Updates and additions to profiling and performance for RPC, TorchScript and Stack traces in the autograd profiler * (Beta) Support for NumPy compatible Fast Fourier transforms (FFT) via torch.fft @@ -18,22 +17,19 @@ A few of the highlights include: * (Stable) Native image I/O for JPEG and PNG formats * (Beta) New Video Reader API * torchaudio - * (Stable) Added support for speech rec (wav2letter), text to speech (WaveRNN) and source separation (ConvTasNet) + * (Stable) Added support for speech rec (wav2letter), text to speech (WaveRNN) and source separation (ConvTasNet) To reiterate, starting PyTorch 1.6, features are now classified as stable, beta and prototype. You can see the detailed announcement [here](https://pytorch.org/blog/pytorch-feature-classification-changes/). Note that the prototype features listed in this blog are available as part of this release. Find the full release notes [here](https://github.com/pytorch/pytorch/releases). # Front End APIs - ## [Beta] NumPy Compatible torch.fft module - FFT-related functionality is commonly used in a variety of scientific fields like signal processing. While PyTorch has historically supported a few FFT-related functions, the 1.7 release adds a new torch.fft module that implements FFT-related functions with the same API as NumPy. This new module must be imported to be used in the 1.7 release, since its name conflicts with the historic (and now deprecated) torch.fft function. **Example usage:** - ```python >>> import torch.fft >>> t = torch.arange(4) @@ -51,17 +47,13 @@ tensor([12.+16.j, -8.+0.j, -4.-4.j, 0.-8.j]) * [Documentation](https://pytorch.org/docs/stable/fft.html#torch-fft) ## [Beta] C++ Support for Transformer NN Modules - Since [PyTorch 1.5](https://pytorch.org/blog/pytorch-1-dot-5-released-with-new-and-updated-apis/), we’ve continued to maintain parity between the python and C++ frontend APIs. This update allows developers to use the nn.transformer module abstraction from the C++ Frontend. And moreover, developers no longer need to save a module from python/JIT and load into C++ as it can now be used it in C++ directly. - -* [Documentation] (https://pytorch.org/cppdocs/api/classtorch_1_1nn_1_1_transformer_impl.html#_CPPv4N5torch2nn15TransformerImplE)) +* [Documentation](https://pytorch.org/cppdocs/api/classtorch_1_1nn_1_1_transformer_impl.html#_CPPv4N5torch2nn15TransformerImplE) ## [Beta] torch.set_deterministic - Reproducibility (bit-for-bit determinism) may help identify errors when debugging or testing a program. To facilitate reproducibility, PyTorch 1.7 adds the ```torch.set_deterministic(bool)``` function that can direct PyTorch operators to select deterministic algorithms when available, and to throw a runtime error if an operation may result in nondeterministic behavior. By default, the flag this function controls is false and there is no change in behavior, meaning PyTorch may implement its operations nondeterministically by default. More precisely, when this flag is true: - * Operations known to not have a deterministic implementation throw a runtime error; * Operations with deterministic variants use those variants (usually with a performance penalty versus the non-deterministic version); and * ```torch.backends.cudnn.deterministic = True``` is set. @@ -69,52 +61,38 @@ More precisely, when this flag is true: Note that this is necessary, **but not sufficient**, for determinism **within a single run of a PyTorch program**. Other sources of randomness like random number generators, unknown operations, or asynchronous or distributed computation may still cause nondeterministic behavior. See the documentation for ```torch.set_deterministic(bool)``` for the list of affected operations. - * [RFC](https://github.com/pytorch/pytorch/issues/15359) * [Documentation](https://pytorch.org/docs/stable/generated/torch.set_deterministic.html) # Performance & Profiling - - ## [Beta] Stack traces added to profiler - -Users can now see not only operator name/inputs in the profiler output table but also where the operator is in the code. The workflow requires very little change to take advantage of this capability. The user uses the [autograd profiler](https://pytorch.org/docs/stable/autograd.html#profiler) as before but with optional new parameters: ```with_stack``` and ```group_by_stack_n```. Caution: regular profiling runs should not use this feature as it adds significant overhead. - +## [Beta] Stack traces added to profiler +Users can now see not only operator name/inputs in the profiler output table but also where the operator is in the code. The workflow requires very little change to take advantage of this capability. The user uses the [autograd profiler](https://pytorch.org/docs/stable/autograd.html#profiler) as before but with optional new parameters: ```with_stack``` and ```group_by_stack_n```. Caution: regular profiling runs should not use this feature as it adds significant overhead. * [Detail](https://github.com/pytorch/pytorch/pull/43898/) * [Documentation](https://pytorch.org/docs/stable/autograd.html) # Distributed Training & RPC - ## [Stable] TorchElastic now bundled into PyTorch docker image - Torchelastic offers a strict superset of the current ```torch.distributed.launch``` CLI with the added features for fault-tolerance and elasticity. If the user is not be interested in fault-tolerance, they can get the exact functionality/behavior parity by setting ```max_restarts=0``` with the added convenience of auto-assigned ```RANK``` and ```MASTER_ADDR|PORT``` (versus manually specified in ```torch.distributed.launch)```. By bundling ```torchelastic``` in the same docker image as PyTorch, users can start experimenting with TorchElastic right-away without having to separately install ```torchelastic```. In addition to convenience, this work is a nice-to-have when adding support for elastic parameters in the existing Kubeflow’s distributed PyTorch operators. - * [Usage examples and how to get started](https://pytorch.org/elastic/0.2.0/examples.html) ## [Beta] Support for uneven dataset inputs in DDP - PyTorch 1.7 introduces a new context manager to be used in conjunction with models trained using ```torch.nn.parallel.DistributedDataParallel``` to enable training with uneven dataset size across different processes. This feature enables greater flexibility when using DDP and prevents the user from having to manually ensure dataset sizes are the same across different process. With this context manager, DDP will handle uneven dataset sizes automatically, which can prevent errors or hangs at the end of training. - * [RFC](https://github.com/pytorch/pytorch/issues/38174) * [Documentation](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join) ## [Beta] NCCL Reliability - Async Error/Timeout Handling - In the past, NCCL training runs would hang indefinitely due to stuck collectives, leading to a very unpleasant experience for users. This feature will abort stuck collectives and throw an exception/crash the process if a potential hang is detected. When used with something like torchelastic (which can recover the training process from the last checkpoint), users can have much greater reliability for distributed training. This feature is completely opt-in and sits behind an environment variable that needs to be explicitly set in order to enable this functionality (otherwise users will see the same behavior as before). - -* [Documentation](https://pytorch.org/docs/stable/distributed.html?highlight=init_process_group#torch.distributed.init_process_group) * [RFC](https://github.com/pytorch/pytorch/issues/46874) +* [Documentation](https://pytorch.org/docs/stable/distributed.html?highlight=init_process_group#torch.distributed.init_process_group) ## [Beta] TorchScript ```rpc_remote``` and ```rpc_sync``` - ```torch.distributed.rpc.rpc_async``` has been available in TorchScript in prior releases. For PyTorch 1.7, this functionality will be extended the remaining two core RPC APIs, ```torch.distributed.rpc.rpc_sync``` and ```torch.distributed.rpc.remote```. This will complete the major RPC APIs targeted for support in TorchScript, it allows users to use the existing python RPC APIs within TorchScript (in a script function or script method, which releases the python Global Interpreter Lock) and could possibly improve application performance in multithreaded environment. - * [Documentation](https://pytorch.org/docs/stable/rpc.html#rpc) * [Usage examples](https://github.com/pytorch/pytorch/blob/58ed60c259834e324e86f3e3118e4fcbbfea8dd1/torch/testing/_internal/distributed/rpc/jit/rpc_test.py#L505-L525) ## [Beta] Distributed optimizer with TorchScript support - PyTorch provides a broad set of optimizers for training algorithms, and these have been used repeatedly as part of the python API. However, users often want to use multithreaded training instead of multiprocess training as it provides better resource utilization and efficiency in the context of large scale distributed training (e.g. Distributed Model Parallel) or any RPC-based training application). Users couldn’t do this with with distributed optimizer before because we need to get rid of the python Global Interpreter Lock (GIL) limitation to achieve this. In PyTorch 1.7, we are enabling the TorchScript support in distributed optimizer to remove the GIL, and make it possible to run optimizer in multithreaded applications. The new distributed optimizer has the exact same interface as before but it automatically converts optimizers within each worker into TorchScript to make each GIL free. This is done by leveraging a functional optimizer concept and allowing the distributed optimizer to convert the computational portion of the optimizer into TorchScript. This will help use cases like distributed model parallel training and improve performance using multithreading. @@ -145,25 +123,20 @@ with dist_autograd.context() as context_id: ) dist_optim.step(context_id) ``` - -* [Documentation](https://pytorch.org/docs/stable/rpc.html#module-torch.distributed.optim) * [RFC](https://github.com/pytorch/pytorch/issues/46883) +* [Documentation](https://pytorch.org/docs/stable/rpc.html#module-torch.distributed.optim) ## [Beta] Enhancements to RPC-based Profiling - Support for using the PyTorch profiler in conjunction with the RPC framework was first introduced in PyTorch 1.6. In PyTorch 1.7, the following enhancements have been made: - * Implemented better support for profiling TorchScript functions over RPC * Achieved parity in terms of profiler features that work with RPC * Added support for asynchronous RPC functions on the server-side (functions decorated with ```rpc.functions.async_execution)```. Users are now able to use familiar profiling tools such as with ```torch.autograd.profiler.profile()``` and ```with torch.autograd.profiler.record_function```, and this works transparently with the RPC framework with full feature support, profiles asynchronous functions, and TorchScript functions. - * [Design doc](https://github.com/pytorch/pytorch/issues/39675) * [Usage examples](https://pytorch.org/tutorials/recipes/distributed_rpc_profiling.html) ## [Prototype] Windows support for Distributed Training - PyTorch 1.7 brings prototype support for ```DistributedDataParallel``` and collective communications on the Windows platform. In this release, the support only covers Gloo-based ```ProcessGroup``` and ```FileStore```. To use this feature across multiple machines, please provide a file from a shared file system in ```init_process_group```. @@ -186,13 +159,10 @@ model = DistributedDataParallel(local_model, device_ids=[rank]) * Acknowledgement ([gunandrose4u](https://github.com/gunandrose4u)) # Mobile - PyTorch Mobile supports both [iOS](https://pytorch.org/mobile/ios) and [Android](https://pytorch.org/mobile/android/) with binary packages available in [Cocoapods](https://cocoapods.org/) and [JCenter](https://mvnrepository.com/repos/jcenter) respectively. You can learn more about PyTorch Mobile [here](https://pytorch.org/mobile/home/). ## [Beta] PyTorch Mobile Caching allocator for performance improvements - On some mobile platforms, such as Pixel, we observed that memory is returned to the system more aggressively. This results in frequent page faults as PyTorch being a functional framework does not maintain state for the operators. Thus outputs are allocated dynamically on each execution of the op, for the most ops. To ameliorate performance penalties due to this, PyTorch 1.7 provides a simple caching allocator for CPU. The allocator caches allocations by tensor sizes and, is currently, available only via the PyTorch C++ API. The caching allocator itself is owned by client and thus the lifetime of the allocator is also maintained by client code. Such a client owned caching allocator can then be used with scoped guard, ```c10::WithCPUCachingAllocatorGuard```, to enable the use of cached allocation within that scope. - **Example usage:** ```python @@ -212,16 +182,12 @@ c10::CPUCachingAllocator caching_allocator; } ... ``` - **NOTE**: Caching allocator is only available on mobile builds, thus the use of caching allocator outside of mobile builds won’t be effective. - * [Documentation](https://github.com/pytorch/pytorch/blob/master/c10/mobile/CPUCachingAllocator.h#L13-L43) * [Usage examples](https://github.com/pytorch/pytorch/blob/master/binaries/speed_benchmark_torch.cc#L207) # torchvision - ## [Stable] Transforms now support Tensor inputs, batch computation, GPU, and TorchScript - torchvision transforms are now inherited from ```nn.Module``` and can be torchscripted and applied on torch Tensor inputs as well as on PIL images. They also support Tensors with batch dimensions and work seamlessly on CPU/GPU devices: ```python import torch @@ -253,20 +219,15 @@ out_image_batched = transforms(batched_image) # and has torchscript support out_image2 = scripted_transforms(tensor_image) ``` - These improvements enable the following new features: - * support for GPU acceleration * batched transformations e.g. as needed for videos * transform multi-band torch tensor images (with more than 3-4 channels) * torchscript transforms together with your model for deployment - **Note:** Exceptions for TorchScript support includes ```Compose```, ```RandomChoice```, ```RandomOrder```, ```Lambda``` and those applied on PIL images, such as ```ToPILImage```. ## [Stable] Native image IO for JPEG and PNG formats - torchvision 0.8.0 introduces native image reading and writing operations for JPEG and PNG formats. Those operators support TorchScript and return ```CxHxW``` tensors in ```uint8``` format, and can thus be now part of your model for deployment in C++ environments. - ```python from torchvision.io import read_image @@ -284,13 +245,10 @@ tensor_image = decode_image(raw_data) scripted_read_image = torch.jit.script(read_image) ``` ## [Stable] RetinaNet detection model - This release adds pretrained models for RetinaNet with a ResNet50 backbone from [Focal Loss for Dense Object Detection](https://arxiv.org/abs/1708.02002), delivering improved accuracy on COCO val2017. ## [Beta] New Video Reader API - This release introduces a new video reading abstraction, which gives more fine-grained control of iteration over videos. It supports image and audio, and implements an iterator interface so that it is interoperable with other the python libraries such as itertools. - ```python from torchvision.io import VideoReader @@ -316,40 +274,21 @@ for frame in takewhile(lambda x: x["pts"] < 5, reader): pass ``` **Notes:** - * In order to use the Video Reader API beta, you must compile torchvision from source and have ffmpeg installed in your system. * The VideoReader API is currently released as beta and its API may change following user feedback. # torchaudio - With this release, torchaudio is expanding its support for models and [end-to-end applications](https://github.com/pytorch/audio/tree/master/examples), adding a wav2letter training pipeline and end-to-end text-to-speech and source separation pipelines. Please file an issue on [github](https://github.com/pytorch/audio/issues/new?template=questions-help-support.md) to provide feedback on them. ## [Stable] Speech Recognition - Building on the addition of the wav2letter model for speech recognition in the last release, we’ve now added an [example wav2letter training pipeline](https://github.com/pytorch/audio/tree/master/examples/pipeline_wav2letter) with the LibriSpeech dataset. ## [Stable] Text-to-speech - With the goal of supporting text-to-speech applications, we added a vocoder based on the WaveRNN model, based on the implementation from [this repository](https://github.com/fatchord/WaveRNN). The original implementation was introduced in "Efficient Neural Audio Synthesis". We also provide an [example WaveRNN training pipeline](https://github.com/pytorch/audio/tree/master/examples/pipeline_wavernn) that uses the LibriTTS dataset added to torchaudio in this release. ## [Stable] Source Separation - With the addition of the ConvTasNet model, based on the paper "Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation," torchaudio now also supports source separation. An [example ConvTasNet training pipeline](https://github.com/pytorch/audio/tree/master/examples/source_separation) is provided with the wsj-mix dataset. -# Additional updates - -## PyTorch Developer Day, November 12 - -Kicking off this November, we plan to host two separate virtual PyTorch events: one for developers and users to discuss PyTorch’s future development called “Developer Day” and another for the entire PyTorch ecosystem to showcase their work, network and collaborate called “Ecosystem Day” (scheduled for early 2021). - -The PyTorch Developer Day takes place on November 12, 2020 PST with a full day of technical talks, project deep dives, and a networking event. The talks will be available to the public and the following networking event requires registration (Space is limited). - -* YouTube Premiere Link -* Facebook Watch Link -* Networking event registration - Cheers! -Team PyTorch - - +Team PyTorch From 97af4967f2f5054a347c288dcdc080cce7d9cbcb Mon Sep 17 00:00:00 2001 From: Woo Kim <39344090+wookim3@users.noreply.github.com> Date: Tue, 27 Oct 2020 07:07:15 -0700 Subject: [PATCH 08/12] Update 2020-10-26-pytorch-1.7-released.md --- _posts/2020-10-26-pytorch-1.7-released.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2020-10-26-pytorch-1.7-released.md b/_posts/2020-10-26-pytorch-1.7-released.md index 171a113ed110..58a641ed29c8 100644 --- a/_posts/2020-10-26-pytorch-1.7-released.md +++ b/_posts/2020-10-26-pytorch-1.7-released.md @@ -245,7 +245,7 @@ tensor_image = decode_image(raw_data) scripted_read_image = torch.jit.script(read_image) ``` ## [Stable] RetinaNet detection model -This release adds pretrained models for RetinaNet with a ResNet50 backbone from [Focal Loss for Dense Object Detection](https://arxiv.org/abs/1708.02002), delivering improved accuracy on COCO val2017. +This release adds pretrained models for RetinaNet with a ResNet50 backbone from [Focal Loss for Dense Object Detection](https://arxiv.org/abs/1708.02002). ## [Beta] New Video Reader API This release introduces a new video reading abstraction, which gives more fine-grained control of iteration over videos. It supports image and audio, and implements an iterator interface so that it is interoperable with other the python libraries such as itertools. From aeee04ba448d4cc0abcb727ad91262c0d60f3412 Mon Sep 17 00:00:00 2001 From: Woo Kim <39344090+wookim3@users.noreply.github.com> Date: Tue, 27 Oct 2020 07:51:14 -0700 Subject: [PATCH 09/12] Update 2020-10-26-pytorch-1.7-released.md --- _posts/2020-10-26-pytorch-1.7-released.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2020-10-26-pytorch-1.7-released.md b/_posts/2020-10-26-pytorch-1.7-released.md index 58a641ed29c8..528670fc27d8 100644 --- a/_posts/2020-10-26-pytorch-1.7-released.md +++ b/_posts/2020-10-26-pytorch-1.7-released.md @@ -146,7 +146,7 @@ To use this feature across multiple machines, please provide a file from a share dist.init_process_group( "gloo", # multi-machien example: - # init_method = "file://////{machine}/{share_folder}/file" + # init_method = "file:///{machine}/{share_folder}/file" init_method="file:///{your local file path}", rank=rank, world_size=world_size From 028c1ec976b0177ae2b997bfc90db6a084e27c8d Mon Sep 17 00:00:00 2001 From: Woo Kim <39344090+wookim3@users.noreply.github.com> Date: Tue, 27 Oct 2020 09:21:45 -0700 Subject: [PATCH 10/12] Update 2020-10-26-pytorch-1.7-released.md --- _posts/2020-10-26-pytorch-1.7-released.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2020-10-26-pytorch-1.7-released.md b/_posts/2020-10-26-pytorch-1.7-released.md index 528670fc27d8..58a641ed29c8 100644 --- a/_posts/2020-10-26-pytorch-1.7-released.md +++ b/_posts/2020-10-26-pytorch-1.7-released.md @@ -146,7 +146,7 @@ To use this feature across multiple machines, please provide a file from a share dist.init_process_group( "gloo", # multi-machien example: - # init_method = "file:///{machine}/{share_folder}/file" + # init_method = "file://////{machine}/{share_folder}/file" init_method="file:///{your local file path}", rank=rank, world_size=world_size From e6278bf2e156597ae4d78e74f091ec6b539e9fbf Mon Sep 17 00:00:00 2001 From: Woo Kim <39344090+wookim3@users.noreply.github.com> Date: Tue, 27 Oct 2020 09:26:25 -0700 Subject: [PATCH 11/12] Update 2020-10-26-pytorch-1.7-released.md --- _posts/2020-10-26-pytorch-1.7-released.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2020-10-26-pytorch-1.7-released.md b/_posts/2020-10-26-pytorch-1.7-released.md index 58a641ed29c8..a3679b25e166 100644 --- a/_posts/2020-10-26-pytorch-1.7-released.md +++ b/_posts/2020-10-26-pytorch-1.7-released.md @@ -145,7 +145,7 @@ To use this feature across multiple machines, please provide a file from a share # initialize the process group dist.init_process_group( "gloo", - # multi-machien example: + # multi-machine example: # init_method = "file://////{machine}/{share_folder}/file" init_method="file:///{your local file path}", rank=rank, From 5beb6224bb2435e9cd039392294ed98b3a21d85c Mon Sep 17 00:00:00 2001 From: Woo Kim <39344090+wookim3@users.noreply.github.com> Date: Tue, 27 Oct 2020 09:31:29 -0700 Subject: [PATCH 12/12] Rename 2020-10-26-pytorch-1.7-released.md to 2020-10-27-pytorch-1.7-released.md --- ...pytorch-1.7-released.md => 2020-10-27-pytorch-1.7-released.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename _posts/{2020-10-26-pytorch-1.7-released.md => 2020-10-27-pytorch-1.7-released.md} (100%) diff --git a/_posts/2020-10-26-pytorch-1.7-released.md b/_posts/2020-10-27-pytorch-1.7-released.md similarity index 100% rename from _posts/2020-10-26-pytorch-1.7-released.md rename to _posts/2020-10-27-pytorch-1.7-released.md