Skip to content

Fixes issue #1868 regarding TypeError: ZipperIterDataPipe #1954

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 13, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions beginner_source/translation_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,7 @@ def train_epoch(model, optimizer):
optimizer.step()
losses += loss.item()

return losses / len(train_dataloader)
return losses / len(list(train_dataloader))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to append a LenthSetterDataPipe at the end of each Dataset within TorchText with a known length. For Zipper in torchdata, it's for general usage that requires the prior DataPipe provides valid length.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @Nayef211

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ejguan if my understanding is correct, calling list(train_dataloader) would cause all data in the dataset to materialize right? I thought this is something we'd want to avoid given the fact that this can lead to OOMs for very large datasets.

It's better to append a LenthSetterDataPipe at the end of each Dataset within TorchText with a known length.

If we do this, can we avoid the data materialization issue I mentioned above?

Copy link
Contributor

@ejguan ejguan Oct 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, we shouldn't do materialization here. If this Dataset has a known length, we can add a DataPipe:

class LenSetter(IterDataPipe):
    def __init__(self, datapipe, length):
        self.datapipe = datapipe
        self.length = length

    def __iter__(self):
        yield from self.datapipe

    def __len__(self):
        self.length = lenth

This would prevent any pre-materialization before data iterated. IterDataPipe itself doesn't guarantee to provide a __len__ function because there might DataPipe operations introduce the dynamic length (filter)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, just created an issue to track this pytorch/text#1943. Will merge this PR for now



def evaluate(model):
Expand All @@ -333,7 +333,7 @@ def evaluate(model):
loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
losses += loss.item()

return losses / len(val_dataloader)
return losses / len(list(val_dataloader))

######################################################################
# Now we have all the ingredients to train our model. Let's do it!
Expand Down