You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A few weeks ago, TorchVision v0.11 was released packed with numerous new primitives, models and training recipe improvements which allowed achieving state-of-the-art (SOTA) results. The project was dubbed “[TorchVision with Batteries Included](https://github.com/pytorch/vision/issues/3911)” and aimed to modernize our library. We wanted to enable researchers to reproduce papers and conduct research more easily by using common building blocks. Moreover, we aspired to provide the necessary tools to Applied ML practitioners to train their models on their own data using the same SOTA techniques as in research. Finally, we wanted to refresh our pre-trained weights and offer better off-the-shelf models to our users, hoping that they would build better applications.
13
13
14
-
Though there is still much work to be done, we wanted to share with you some exciting results from the above work. We will showcase how one can use the new tools included in TorchVision to achieve state-of-the-art results on a highly competitive and well-studied architecture such as ResNet50 [[1]](https://arxiv.org/abs/1512.03385). We will share the exact recipe used to improve our baseline by over 4.5 accuracy points to reach a final top-1 accuracy of 80.7% and share the journey for deriving the new training process. Moreover, we will show that this recipe generalizes well to other model variants and families. We hope that the above will influence future research for developing stronger generalizable training methodologies and will inspire the community to adopt and contribute to our efforts.
14
+
Though there is still much work to be done, we wanted to share with you some exciting results from the above work. We will showcase how one can use the new tools included in TorchVision to achieve state-of-the-art results on a highly competitive and well-studied architecture such as ResNet50 [[1]](https://arxiv.org/abs/1512.03385). We will share the exact recipe used to improve our baseline by over 4.7 accuracy points to reach a final top-1 accuracy of 80.9% and share the journey for deriving the new training process. Moreover, we will show that this recipe generalizes well to other model variants and families. We hope that the above will influence future research for developing stronger generalizable training methodologies and will inspire the community to adopt and contribute to our efforts.
15
15
16
16
## The Results
17
17
@@ -20,15 +20,17 @@ Using our new training recipe found on ResNet50, we’ve refreshed the pre-train
20
20
21
21
| Model | Accuracy@1 | Accuracy@5|
22
22
|----------|:--------:|:----------:|
23
-
| ResNet50 | 80.674| 95.166|
23
+
| ResNet50 | 80.858| 95.434|
24
24
|----------|:--------:|:----------:|
25
-
| ResNet101 | 81.728| 95.670|
25
+
| ResNet101 | 81.886| 95.780|
26
26
|----------|:--------:|:----------:|
27
-
| ResNet152 | 82.042|95.926|
27
+
| ResNet152 | 82.284|96.002|
28
28
|----------|:--------:|:----------:|
29
-
| ResNeXt50-32x4d | 81.116| 95.478|
29
+
| ResNeXt50-32x4d | 81.198| 95.340|
30
30
31
-
Note that the accuracy of all models except RetNet50 can be further improved by adjusting their training parameters slightly, but our focus was to have a single robust recipe which performs well for all.
31
+
Note that the accuracy of all models except RetNet50 can be further improved by adjusting their training parameters slightly, but our focus was to have a single robust recipe which performs well for all.
32
+
33
+
**UPDATE:** We have refreshed the majority of popular classification models of TorchVision, you can find the details on this [blog post](https://pytorch.org/blog/introducing-torchvision-new-multi-weight-support-api/).
32
34
33
35
There are currently two ways to use the latest weights of the model.
34
36
@@ -42,10 +44,10 @@ We are currently working on a new prototype mechanism which will extend the mode
@@ -81,7 +83,7 @@ Those who don’t want to use a prototype API have the option of accessing the n
81
83
82
84
## The Training Recipe
83
85
84
-
Our goal was to use the newly introduced primitives of TorchVision to derive a new strong training recipe which achieves state-of-the-art results for the vanilla ResNet50 architecture when trained from scratch on ImageNet with no additional external data. Though by using architecture specific tricks [[2]](https://arxiv.org/abs/1812.01187) one could further improve the accuracy, we’ve decided not to include them so that the recipe can be used in other architectures. Our recipe heavily focuses on simplicity and builds upon work by FAIR [[3]](https://arxiv.org/abs/2103.06877), [[4]](https://arxiv.org/abs/2106.14881), [[5]](https://arxiv.org/abs/1906.06423), [[6]](https://arxiv.org/abs/2012.12877), [[7]](https://arxiv.org/abs/2110.00476)]. Our findings align with the parallel study of Wightman et al. [[7]](https://arxiv.org/abs/2110.00476), who also report major accuracy improvements by focusing on the training recipes.
86
+
Our goal was to use the newly introduced primitives of TorchVision to derive a new strong training recipe which achieves state-of-the-art results for the vanilla ResNet50 architecture when trained from scratch on ImageNet with no additional external data. Though by using architecture specific tricks [[2]](https://arxiv.org/abs/1812.01187) one could further improve the accuracy, we’ve decided not to include them so that the recipe can be used in other architectures. Our recipe heavily focuses on simplicity and builds upon work by FAIR [[3]](https://arxiv.org/abs/2103.06877), [[4]](https://arxiv.org/abs/2106.14881), [[5]](https://arxiv.org/abs/1906.06423), [[6]](https://arxiv.org/abs/2012.12877), [[7]](https://arxiv.org/abs/2110.00476). Our findings align with the parallel study of Wightman et al. [[7]](https://arxiv.org/abs/2110.00476), who also report major accuracy improvements by focusing on the training recipes.
85
87
86
88
Without further ado, here are the main parameters of our recipe:
87
89
@@ -110,6 +112,9 @@ Without further ado, here are the main parameters of our recipe:
*The tuning of the inference size was done on top of the last model. See below for details.
185
192
193
+
** Community contribution done after the release of the article. See below for details.
194
+
186
195
## Baseline
187
196
188
197
Our baseline is the previously released ResNet50 model of TorchVision. It was trained with the following recipe:
@@ -259,7 +268,7 @@ This further increases our top-1 Accuracy by 1.8 points on top of the previous s
259
268
260
269
## Random Erasing
261
270
262
-
Another data augmentation technique known to help the classification accuracy is Random Erasing [[10]](https://arxiv.org/abs/1708.04896), [[11]](https://arxiv.org/abs/1708.04552)]. Often paired with Automatic Augmentation methods, it usually yields additional improvements in accuracy due to its regularization effect. In our experiments we tuned only the probability of applying the method via a grid search and found that it’s beneficial to keep its probability at low levels, typically around 10%.
271
+
Another data augmentation technique known to help the classification accuracy is Random Erasing [[10]](https://arxiv.org/abs/1708.04896), [[11]](https://arxiv.org/abs/1708.04552). Often paired with Automatic Augmentation methods, it usually yields additional improvements in accuracy due to its regularization effect. In our experiments we tuned only the probability of applying the method via a grid search and found that it’s beneficial to keep its probability at low levels, typically around 10%.
263
272
264
273
Here is the extra parameter introduced on top of the previous:
265
274
@@ -283,7 +292,7 @@ We use PyTorch’s newly introduced [CrossEntropyLoss](https://pytorch.org/docs
283
292
284
293
## Mixup and Cutmix
285
294
286
-
Two data augmentation techniques often used to produce SOTA results are Mixup and Cutmix [[13]](https://arxiv.org/abs/1710.09412), [[14]](https://arxiv.org/abs/1905.04899)]. They both provide strong regularization effects by softening not only the labels but also the images. In our setup we found it beneficial to apply one of them randomly with equal probability. Each is parameterized with a hyperparameter alpha, which controls the shape of the Beta distribution from which the smoothing probability is sampled. We did a very limited grid search, focusing primarily on common values proposed on the papers.
295
+
Two data augmentation techniques often used to produce SOTA results are Mixup and Cutmix [[13]](https://arxiv.org/abs/1710.09412), [[14]](https://arxiv.org/abs/1905.04899). They both provide strong regularization effects by softening not only the labels but also the images. In our setup we found it beneficial to apply one of them randomly with equal probability. Each is parameterized with a hyperparameter alpha, which controls the shape of the Beta distribution from which the smoothing probability is sampled. We did a very limited grid search, focusing primarily on common values proposed on the papers.
287
296
288
297
Below you will find the optimal values for the alpha parameters of the two techniques:
289
298
@@ -309,7 +318,7 @@ The above update improves our accuracy by a further 0.526 points, providing add
309
318
310
319
## FixRes mitigations
311
320
312
-
An important property identified early in our experiments is the fact that the models performed significantly better if the resolution used during validation was increased from the 224x224 of training. This effect is studied in detail on the FixRes paper [5](https://arxiv.org/abs/1906.06423) and two mitigations are proposed: a) one could try to reduce the training resolution so that the accuracy on the validation resolution is maximized or b) one could fine-tune the model on a two-phase training so that it adjusts on the target resolution. Since we didn’t want to introduce a 2-phase training, we went for option a). This means that we reduced the train crop size from 224 and used grid search to find the one that maximizes the validation on resolution of 224x224.
321
+
An important property identified early in our experiments is the fact that the models performed significantly better if the resolution used during validation was increased from the 224x224 of training. This effect is studied in detail on the FixRes paper [[5]](https://arxiv.org/abs/1906.06423) and two mitigations are proposed: a) one could try to reduce the training resolution so that the accuracy on the validation resolution is maximized or b) one could fine-tune the model on a two-phase training so that it adjusts on the target resolution. Since we didn’t want to introduce a 2-phase training, we went for option a). This means that we reduced the train crop size from 224 and used grid search to find the one that maximizes the validation on resolution of 224x224.
313
322
314
323
Below you can see the optimal value used on our recipe:
315
324
@@ -348,10 +357,10 @@ Unlike all other steps of the process which involved training models with differ
348
357
Below you can see the optimal value used on our recipe:
349
358
350
359
```
351
-
--val-resize-size 232
360
+
val_resize_size=232,
352
361
```
353
362
354
-
The above is the final optimization which improved our accuracy by 0.224 points. It’s worth noting that the optimal value for ResNet50 works also best for ResNet101, ResNet152 and ResNeXt50, which hints that it generalizes across models:
363
+
The above is an optimization which improved our accuracy by 0.224 points. It’s worth noting that the optimal value for ResNet50 works also best for ResNet101, ResNet152 and ResNeXt50, which hints that it generalizes across models:
355
364
356
365
357
366
<divstyle="display: flex">
@@ -360,6 +369,19 @@ The above is the final optimization which improved our accuracy by 0.224 points
360
369
<imgsrc="/assets/images/sota/ResNet152 Inference Resize.png"alt="Best ResNet50 trained with 224 Resolution"width="30%">
361
370
</div>
362
371
372
+
## [UPDATE] Repeated Augmentation
373
+
374
+
Repeated Augmentation [[15]](https://arxiv.org/abs/1901.09335), [[16]](https://arxiv.org/abs/1902.05509) is another technique which can improve the overall accuracy and has been used by other strong recipes such as those at [[6]](https://arxiv.org/abs/2012.12877), [[7]](https://arxiv.org/abs/2110.00476). Tal Ben-Nun, a community contributor, has [further improved](https://github.com/pytorch/vision/pull/5201) upon our original recipe by proposing training the model with 4 repetitions. His contribution came after the release of this article.
375
+
376
+
Below you can see the optimal value used on our recipe:
377
+
378
+
```
379
+
ra_sampler=True,
380
+
ra_reps=4,
381
+
```
382
+
383
+
The above is the final optimization which improved our accuracy by 0.184 points.
384
+
363
385
## Optimizations that were tested but not adopted
364
386
365
387
During the early stages of our research, we experimented with additional techniques, configurations and optimizations. Since our target was to keep our recipe as simple as possible, we decided not to include anything that didn’t provide a significant improvement. Here are a few approaches that we took but didn’t make it to our final recipe:
@@ -372,7 +394,7 @@ During the early stages of our research, we experimented with additional techniq
372
394
373
395
## Acknowledgements
374
396
375
-
We would like to thank Piotr Dollar, Mannat Singh and Hugo Touvron for providing their insights and feedback during the development of the recipe and for their previous research work on which our recipe is based on. Their support was invaluable for achieving the above result. Moreover, we would like to thank Prabhat Roy, Kai Zhang, Yiwen Song, Joel Schlosser, Ilqar Ramazanli, Francisco Massa, Mannat Singh, Xiaoliang Dai, Samuel Gabriel and Allen Goodman for their contributions to the Batteries Included project.
397
+
We would like to thank Piotr Dollar, Mannat Singh and Hugo Touvron for providing their insights and feedback during the development of the recipe and for their previous research work on which our recipe is based on. Their support was invaluable for achieving the above result. Moreover, we would like to thank Prabhat Roy, Kai Zhang, Yiwen Song, Joel Schlosser, Ilqar Ramazanli, Francisco Massa, Mannat Singh, Xiaoliang Dai, Samuel Gabriel, Allen Goodman and Tal Ben-Nun for their contributions to the Batteries Included project.
376
398
377
399
## References
378
400
@@ -390,3 +412,5 @@ We would like to thank Piotr Dollar, Mannat Singh and Hugo Touvron for providin
390
412
12. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, Zbigniew Wojna. “Rethinking the Inception Architecture for Computer Vision”
391
413
13. Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz. “mixup: Beyond Empirical Risk Minimization”
0 commit comments