Skip to content

How to implement model parallelism using PyTorch on an HPC environment? #896

Open
@Akshara211

Description

@Akshara211

Hello,
I am trying to implement model parallelism using PyTorch on my HPC environment, which has 4 GPUs available. My goal is to split a neural network model across these GPUs to improve training efficiency.

Here's what I've tried so far:

Followed the PyTorch documentation on model parallelism
Implemented a basic split of the model across GPUs
However, I am encountering performance bottlenecks and underutilization of the GPUs. Can someone guide me on how to implement this in my HPC setup?

Any advice or pointers to resources would be greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions