Skip to content

FSDP T5 Example not working #1210

Open
@YooSungHyun

Description

@YooSungHyun

Context

  • Pytorch version: 3.10
  • Operating System and version: ubuntu 22.04

Your Environment

  • Installed using source? [yes/no]: no
  • Are you planning to deploy it using docker container? [yes/no]: no
  • Is it a CPU or GPU environment?: GPU
  • Which example are you using: FSDP T5 example
  • Link to code or data to repro [if any]:

Expected Behavior

training well

Current Behavior

error raised and training stop

Possible Solution

Steps to Reproduce

  1. launch just fsdp t5 example
  2. error raised TypeError: T5Block.forward() got an unexpected keyword argument 'offload_to_cpu'
    ...

Failure Logs [if any]

Traceback (most recent call last):
  File "/data/bart/temp_workspace/examples/distributed/FSDP/T5_training.py", line 215, in <module>
    fsdp_main(args)
  File "/data/bart/temp_workspace/examples/distributed/FSDP/T5_training.py", line 148, in fsdp_main
    train_accuracy = train(args, model, rank, world_size, train_loader, optimizer, epoch, sampler=sampler1)
  File "/data/bart/temp_workspace/examples/distributed/FSDP/utils/train_utils.py", line 50, in train
    output = model(input_ids=batch["source_ids"],attention_mask=batch["source_mask"],labels=batch["target_ids"] )
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 839, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 1706, in forward
    encoder_outputs = self.encoder(
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 1110, in forward
    layer_outputs = layer_module(
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 839, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py", line 164, in forward
    return self.checkpoint_fn(  # type: ignore[misc]
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner
    return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
    return fn(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 458, in checkpoint
    ret = function(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: T5Block.forward() got an unexpected keyword argument 'offload_to_cpu'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions