Open
Description
Hello,
I am trying to implement model parallelism using PyTorch on my HPC environment, which has 4 GPUs available. My goal is to split a neural network model across these GPUs to improve training efficiency.
Here's what I've tried so far:
Followed the PyTorch documentation on model parallelism
Implemented a basic split of the model across GPUs
However, I am encountering performance bottlenecks and underutilization of the GPUs. Can someone guide me on how to implement this in my HPC setup?
Any advice or pointers to resources would be greatly appreciated!
Metadata
Metadata
Assignees
Labels
No labels