Skip to content

[BUG] corrupted dataset due to simultaneous downloading by all ranks. #3065

Open
@LamForest

Description

@LamForest

Add Link

https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html

Describe the bug

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 9912422/9912422 [00:03<00:00, 3078874.05it/s]

  5%|█████▎                                                                                                    | 491520/9912422 [00:01<00:22, 417952.41it/s]Traceback (most recent call last):
  File "fsdp_mnist.py", line 173, in <module>
    mp.spawn(fsdp_main,
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
    while not context.join():
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 163, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/ssd1/gaotianlin/baidu/hac-aiacc/Megatron/old_scripts/fsdp/fsdp_mnist.py", line 94, in fsdp_main
    dataset1 = datasets.MNIST('./data', train=True, download=True,
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torchvision/datasets/mnist.py", line 99, in __init__
    self.download()
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torchvision/datasets/mnist.py", line 187, in download
    download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torchvision/datasets/utils.py", line 434, in download_and_extract_archive
    download_url(url, download_root, filename, md5)
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torchvision/datasets/utils.py", line 155, in download_url
    raise RuntimeError("File not found or corrupted.")
RuntimeError: File not found or corrupted.

/root/miniconda3/envs/old_mega/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Describe your environment

...

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions