diff --git a/intermediate_source/torchvision_tutorial.rst b/intermediate_source/torchvision_tutorial.rst
index c82b8097e93..93fcfd3d247 100644
--- a/intermediate_source/torchvision_tutorial.rst
+++ b/intermediate_source/torchvision_tutorial.rst
@@ -32,7 +32,7 @@ should return:
    -  ``boxes (FloatTensor[N, 4])``: the coordinates of the ``N``
       bounding boxes in ``[x0, y0, x1, y1]`` format, ranging from ``0``
       to ``W`` and ``0`` to ``H``
-   -  ``labels (Int64Tensor[N])``: the label for each bounding box
+   -  ``labels (Int64Tensor[N])``: the label for each bounding box. ``0`` represents always the background class.
    -  ``image_id (Int64Tensor[1])``: an image identifier. It should be
       unique between all the images in the dataset, and is used during
       evaluation
@@ -56,6 +56,8 @@ If your model returns the above methods, they will make it work for both
 training and evaluation, and will use the evaluation scripts from
 ``pycocotools``.
 
+One note on the ``labels``. The model considers class ``0`` as background. If your dataset does not contain the background class, you should not have ``0`` in your ``labels``. For example, assuming you have just two classes, *cat* and *dog*, you can define ``1`` (not ``0``) to represent *cats* and ``2`` to represent *dogs*. So, for instance, if one of the images has booth classes, your ``labels`` tensor should look like ``[1,2]``.
+
 Additionally, if you want to use aspect ratio grouping during training
 (so that each batch only contains images with similar aspect ratio),
 then it is recommended to also implement a ``get_height_and_width``