diff --git a/intermediate_source/torchvision_tutorial.rst b/intermediate_source/torchvision_tutorial.rst index c82b8097e93..93fcfd3d247 100644 --- a/intermediate_source/torchvision_tutorial.rst +++ b/intermediate_source/torchvision_tutorial.rst @@ -32,7 +32,7 @@ should return: - ``boxes (FloatTensor[N, 4])``: the coordinates of the ``N`` bounding boxes in ``[x0, y0, x1, y1]`` format, ranging from ``0`` to ``W`` and ``0`` to ``H`` - - ``labels (Int64Tensor[N])``: the label for each bounding box + - ``labels (Int64Tensor[N])``: the label for each bounding box. ``0`` represents always the background class. - ``image_id (Int64Tensor[1])``: an image identifier. It should be unique between all the images in the dataset, and is used during evaluation @@ -56,6 +56,8 @@ If your model returns the above methods, they will make it work for both training and evaluation, and will use the evaluation scripts from ``pycocotools``. +One note on the ``labels``. The model considers class ``0`` as background. If your dataset does not contain the background class, you should not have ``0`` in your ``labels``. For example, assuming you have just two classes, *cat* and *dog*, you can define ``1`` (not ``0``) to represent *cats* and ``2`` to represent *dogs*. So, for instance, if one of the images has booth classes, your ``labels`` tensor should look like ``[1,2]``. + Additionally, if you want to use aspect ratio grouping during training (so that each batch only contains images with similar aspect ratio), then it is recommended to also implement a ``get_height_and_width``