@@ -2,9 +2,9 @@ TorchVision Object Detection Finetuning Tutorial
2
2
====================================================
3
3
4
4
.. tip ::
5
- To get the most of this tutorial, we suggest using this
6
- `Colab Version <https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/torchvision_finetuning_instance_segmentation.ipynb >`__.
7
- This will allow you to experiment with the information presented below.
5
+ To get the most of this tutorial, we suggest using this
6
+ `Colab Version <https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/torchvision_finetuning_instance_segmentation.ipynb >`__.
7
+ This will allow you to experiment with the information presented below.
8
8
9
9
For this tutorial, we will be finetuning a pre-trained `Mask
10
10
R-CNN <https://arxiv.org/abs/1703.06870> `__ model in the `Penn-Fudan
@@ -57,11 +57,14 @@ training and evaluation, and will use the evaluation scripts from
57
57
``pycocotools `` which can be installed with ``pip install pycocotools ``.
58
58
59
59
.. note ::
60
- For Windows, please install ``pycocotools`` from `gautamchitnis <https://github.com/gautamchitnis/cocoapi>`__ with command
60
+ For Windows, please install ``pycocotools`` from `gautamchitnis <https://github.com/gautamchitnis/cocoapi>`__ with command
61
61
62
62
``pip install git+https://github.com/gautamchitnis/cocoapi.git@cocodataset-master#subdirectory=PythonAPI``
63
63
64
- One note on the ``labels ``. The model considers class ``0 `` as background. If your dataset does not contain the background class, you should not have ``0 `` in your ``labels ``. For example, assuming you have just two classes, *cat * and *dog *, you can define ``1 `` (not ``0 ``) to represent *cats * and ``2 `` to represent *dogs *. So, for instance, if one of the images has both classes, your ``labels `` tensor should look like ``[1,2] ``.
64
+ One note on the ``labels ``. The model considers class ``0 `` as background. If your dataset does not contain the background class,
65
+ you should not have ``0 `` in your ``labels ``. For example, assuming you have just two classes, *cat * and *dog *, you can
66
+ define ``1 `` (not ``0 ``) to represent *cats * and ``2 `` to represent *dogs *. So, for instance, if one of the images has both
67
+ classes, your ``labels `` tensor should look like ``[1,2] ``.
65
68
66
69
Additionally, if you want to use aspect ratio grouping during training
67
70
(so that each batch only contains images with similar aspect ratios),
@@ -94,7 +97,7 @@ have the following folder structure:
94
97
FudanPed00003.png
95
98
FudanPed00004.png
96
99
97
- Here is one example of a pair of images and segmentation masks
100
+ Here is one example of a pair of images and segmentation masks
98
101
99
102
.. image :: ../../_static/img/tv_tutorial/tv_image01.png
100
103
@@ -103,13 +106,21 @@ Here is one example of a pair of images and segmentation masks
103
106
So each image has a corresponding
104
107
segmentation mask, where each color correspond to a different instance.
105
108
Let’s write a ``torch.utils.data.Dataset `` class for this dataset.
109
+ In the code below, we are wrapping images, bounding boxes and masks into
110
+ ``torchvision.datapoints `` structures so that we will be able to apply torchvision
111
+ built-in transformations (`new Transforms API <https://pytorch.org/vision/stable/transforms.html >`_)
112
+ that cover the object detection and segmentation tasks.
113
+ For more information about torchvision datapoints see `this documentation <https://pytorch.org/vision/stable/datapoints.html >`_.
106
114
107
115
.. code :: python
108
116
109
117
import os
110
- import numpy as np
111
118
import torch
112
- from PIL import Image
119
+
120
+ from torchvision.io import read_image
121
+ from torchvision.ops.boxes import masks_to_boxes
122
+ from torchvision import datapoints as dp
123
+ from torchvision.transforms.v2 import functional as F
113
124
114
125
115
126
class PennFudanDataset (torch .utils .data .Dataset ):
@@ -125,48 +136,36 @@ Let’s write a ``torch.utils.data.Dataset`` class for this dataset.
125
136
# load images and masks
126
137
img_path = os.path.join(self .root, " PNGImages" , self .imgs[idx])
127
138
mask_path = os.path.join(self .root, " PedMasks" , self .masks[idx])
128
- img = Image.open(img_path).convert(" RGB" )
129
- # note that we haven't converted the mask to RGB,
130
- # because each color corresponds to a different instance
131
- # with 0 being background
132
- mask = Image.open(mask_path)
133
- # convert the PIL Image into a numpy array
134
- mask = np.array(mask)
139
+ img = read_image(img_path)
140
+ mask = read_image(mask_path)
135
141
# instances are encoded as different colors
136
- obj_ids = np .unique(mask)
142
+ obj_ids = torch .unique(mask)
137
143
# first id is the background, so remove it
138
144
obj_ids = obj_ids[1 :]
145
+ num_objs = len (obj_ids)
139
146
140
147
# split the color-encoded mask into a set
141
148
# of binary masks
142
- masks = mask == obj_ids[:, None , None ]
149
+ masks = ( mask == obj_ids[:, None , None ]).to( dtype = torch.uint8)
143
150
144
151
# get bounding box coordinates for each mask
145
- num_objs = len (obj_ids)
146
- boxes = []
147
- for i in range (num_objs):
148
- pos = np.nonzero(masks[i])
149
- xmin = np.min(pos[1 ])
150
- xmax = np.max(pos[1 ])
151
- ymin = np.min(pos[0 ])
152
- ymax = np.max(pos[0 ])
153
- boxes.append([xmin, ymin, xmax, ymax])
154
-
155
- # convert everything into a torch.Tensor
156
- boxes = torch.as_tensor(boxes, dtype = torch.float32)
152
+ boxes = masks_to_boxes(masks)
153
+
157
154
# there is only one class
158
155
labels = torch.ones((num_objs,), dtype = torch.int64)
159
- masks = torch.as_tensor(masks, dtype = torch.uint8)
160
156
161
157
image_id = torch.tensor([idx])
162
158
area = (boxes[:, 3 ] - boxes[:, 1 ]) * (boxes[:, 2 ] - boxes[:, 0 ])
163
159
# suppose all instances are not crowd
164
160
iscrowd = torch.zeros((num_objs,), dtype = torch.int64)
165
161
162
+ # Wrap sample and targets into torchvision datapoints:
163
+ img = dp.Image(img)
164
+
166
165
target = {}
167
- target[" boxes" ] = boxes
166
+ target[" boxes" ] = dp.BoundingBoxes(boxes, format = " XYXY" , canvas_size = F.get_size(img))
167
+ target[" masks" ] = dp.Mask(masks)
168
168
target[" labels" ] = labels
169
- target[" masks" ] = masks
170
169
target[" image_id" ] = image_id
171
170
target[" area" ] = area
172
171
target[" iscrowd" ] = iscrowd
@@ -189,7 +188,7 @@ In this tutorial, we will be using `Mask
189
188
R-CNN <https://arxiv.org/abs/1703.06870> `__, which is based on top of
190
189
`Faster R-CNN <https://arxiv.org/abs/1506.01497 >`__. Faster R-CNN is a
191
190
model that predicts both bounding boxes and class scores for potential
192
- objects in the image.
191
+ objects in the image.
193
192
194
193
.. image :: ../../_static/img/tv_tutorial/tv_image03.png
195
194
@@ -199,7 +198,7 @@ instance.
199
198
200
199
.. image :: ../../_static/img/tv_tutorial/tv_image04.png
201
200
202
- There are two common
201
+ There are two common
203
202
situations where one might want
204
203
to modify one of the available models in torchvision modelzoo. The first
205
204
is when we want to start from a pre-trained model, and just finetune the
@@ -229,7 +228,7 @@ way of doing it:
229
228
# get number of input features for the classifier
230
229
in_features = model.roi_heads.box_predictor.cls_score.in_features
231
230
# replace the pre-trained head with a new one
232
- model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
231
+ model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
233
232
234
233
2 - Modifying the model to add a different backbone
235
234
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -252,7 +251,7 @@ way of doing it:
252
251
# location, with 5 different sizes and 3 different aspect
253
252
# ratios. We have a Tuple[Tuple[int]] because each feature
254
253
# map could potentially have different sizes and
255
- # aspect ratios
254
+ # aspect ratios
256
255
anchor_generator = AnchorGenerator(sizes = ((32 , 64 , 128 , 256 , 512 ),),
257
256
aspect_ratios = ((0.5 , 1.0 , 2.0 ),))
258
257
@@ -316,48 +315,52 @@ Putting everything together
316
315
317
316
In ``references/detection/ ``, we have a number of helper functions to
318
317
simplify training and evaluating detection models. Here, we will use
319
- ``references/detection/engine.py ``, ``references/detection/utils.py ``
320
- and ``references/detection/transforms.py ``. Just copy everything under
321
- ``references/detection `` to your folder and use them here.
318
+ ``references/detection/engine.py `` and ``references/detection/utils.py ``.
319
+ Just copy everything under ``references/detection `` to your folder and use them here.
320
+
321
+ Since v0.15.0 torchvision provides `new Transforms API <https://pytorch.org/vision/stable/transforms.html >`_
322
+ to easily write data augmentation pipelines for Object Detection and Segmentation tasks.
322
323
323
324
Let’s write some helper functions for data augmentation /
324
325
transformation:
325
326
326
327
.. code :: python
327
328
328
- import transforms as T
329
+ from torchvision.transforms import v2 as T
330
+
329
331
330
332
def get_transform (train ):
331
333
transforms = []
332
- transforms.append(T.PILToTensor())
333
- transforms.append(T.ConvertImageDtype(torch.float))
334
+ transforms.append(T.ToImage())
334
335
if train:
335
- transforms.append(T.RandomHorizontalFlip(0.5 ))
336
+ transforms.append(T.RandomHorizontalFlip(0.5 ))
337
+ transforms.append(T.ToDtype(torch.float, scale = True ))
338
+ transforms.append(T.ToPureTensor())
336
339
return T.Compose(transforms)
337
340
338
341
339
342
Testing ``forward() `` method (Optional)
340
343
---------------------------------------
341
344
342
- Before iterating over the dataset, it's good to see what the model
345
+ Before iterating over the dataset, it's good to see what the model
343
346
expects during training and inference time on sample data.
344
347
345
348
.. code :: python
346
349
347
350
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(weights = " DEFAULT" )
348
351
dataset = PennFudanDataset(' PennFudanPed' , get_transform(train = True ))
349
352
data_loader = torch.utils.data.DataLoader(
350
- dataset, batch_size = 2 , shuffle = True , num_workers = 4 ,
351
- collate_fn = utils.collate_fn)
353
+ dataset, batch_size = 2 , shuffle = True , num_workers = 4 ,
354
+ collate_fn = utils.collate_fn)
352
355
# For Training
353
- images,targets = next (iter (data_loader))
356
+ images, targets = next (iter (data_loader))
354
357
images = list (image for image in images)
355
358
targets = [{k: v for k, v in t.items()} for t in targets]
356
- output = model(images,targets) # Returns losses and detections
359
+ output = model(images, targets) # Returns losses and detections
357
360
# For inference
358
361
model.eval()
359
362
x = [torch.rand(3 , 300 , 400 ), torch.rand(3 , 500 , 400 )]
360
- predictions = model(x) # Returns predictions
363
+ predictions = model(x) # Returns predictions
361
364
362
365
Let’s now write the main function which performs the training and the
363
366
validation:
@@ -504,12 +507,12 @@ After training for 10 epochs, I got the following metrics
504
507
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.818
505
508
506
509
But what do the predictions look like? Let’s take one image in the
507
- dataset and verify
510
+ dataset and verify
508
511
509
512
.. image :: ../../_static/img/tv_tutorial/tv_image05.png
510
513
511
514
The trained model predicts 9
512
- instances of person in this image, let’s see a couple of them:
515
+ instances of person in this image, let’s see a couple of them:
513
516
514
517
.. image :: ../../_static/img/tv_tutorial/tv_image06.png
515
518
@@ -531,7 +534,7 @@ For a more complete example, which includes multi-machine / multi-gpu
531
534
training, check ``references/detection/train.py ``, which is present in
532
535
the torchvision repo.
533
536
534
- You can download a full source file for this tutorial
535
- `here <https://pytorch.org/tutorials/_static/tv-training-code.py >`__.
536
-
537
+ You can download a full source file for this tutorial
538
+ `here <https://pytorch.org/tutorials/_static/tv-training-code.py >`__.
539
+
537
540
0 commit comments