Skip to content

DistributedDataParallel don't work at nightly build(1.6.0.dev20200408+cu101) #36268

Closed
@meixitu

Description

@meixitu

🐛 Bug

I have single Machine, and 4 GPUS.

nn.parallel.DistributedDataParallel can't run at nightly build

This code can run correctly with Pytorch 1.4

Traceback (most recent call last):
File "lstm_toy.py", line 72, in
model = nn.parallel.DistributedDataParallel(model,find_unused_parameters=True,check_reduction=True)
File "/usr/local/anaconda3/envs/torch1.5/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 305, in init
self._ddp_init_helper()
File "/usr/local/anaconda3/envs/torch1.5/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 382, in _ddp_init_helper
expect_sparse_gradient)
RuntimeError: Model replicas must have an equal number of parameters.

To Reproduce

Run below code in Command line:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import time
import os
import sys
import traceback

os.environ['CUDA_VISIBLE_DEVICES'] ='0,1,2,3'
import faulthandler
faulthandler.enable()

from gpu_mem_track import MemTracker
import inspect

frame = inspect.currentframe()
gpu_tracker = MemTracker(frame)


torch.backends.cudnn.benchmark = True

BATCH_SIZE = 4
INPUT_DIM = 2048
OUTPUT_DIM = 5000
EPOCHS = 10000
HIDDEN_DIM = 2048
N_LAYERS=5
SEQ_LEN = 2000

class Net(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, hidden_layers):
        super(Net, self).__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim
        self.hidden_layers = hidden_layers

        self.lstm = nn.LSTM(input_dim, hidden_dim, hidden_layers,batch_first=True)
        self.h2o = nn.Linear(hidden_dim, output_dim)

    def forward(self, x,y):
        self.lstm.flatten_parameters()
        h_t, _ = self.lstm(x)
        output = self.h2o(h_t)
        loss = F.mse_loss(output, y)
        return loss

X_data = torch.randn((BATCH_SIZE,SEQ_LEN, INPUT_DIM)).cuda()
Y_data = torch.rand((BATCH_SIZE,SEQ_LEN, OUTPUT_DIM)).cuda()

model = Net(INPUT_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS)
if 0:
    model = nn.DataParallel(model)
    model.cuda()
else:
    torch.distributed.init_process_group(backend='nccl',init_method='tcp://localhost:'+str(np.random.randint(100,60000)),rank=0,world_size=1)
    model.cuda()
    model = nn.parallel.DistributedDataParallel(model,find_unused_parameters=True,check_reduction=True)        

Environment

PyTorch version: 1.6.0.dev20200408+cu101
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 6.5.0-2ubuntu1~16.04) 6.5.0 20181026
CMake version: version 3.5.1

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.105
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
GPU 2: GeForce GTX 1080 Ti
GPU 3: GeForce GTX 1080 Ti
GPU 4: GeForce GTX 1080 Ti
GPU 5: GeForce GTX 1080 Ti
GPU 6: GeForce GTX 1080 Ti
GPU 7: GeForce GTX 1080 Ti

Nvidia driver version: 435.21
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.6.0.21
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5

Versions of relevant libraries:
[pip3] numpy==1.13.3
[conda] torch 1.6.0.dev20200408+cu101 pypi_0 pypi
[conda] torchvision 0.6.0.dev20200408+cu101 pypi_0 pypi
[conda] warprnnt-pytorch 0.1 pypi_0 pypi

Additional context

Thanks
Meixitu

cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar

Metadata

Metadata

Assignees

No one assigned

    Labels

    high prioritymodule: data parallelmodule: regressionIt used to work, and now it doesn'toncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions