Skip to content

FSDP2 example code for tutorial #1343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 9, 2025
Merged

FSDP2 example code for tutorial #1343

merged 4 commits into from
May 9, 2025

Conversation

weifengpy
Copy link
Contributor

run FSDP2 on transformer model:

torchrun --nproc_per_node 2 train.py
  • For 1st time, it creates a "checkpoints" folder and save state dicts there
  • For 2nd time, it loads from previous checkpoints

To enable explicit prefetching

torchrun --nproc_per_node 2 train.py --explicit-prefetch

To enable mixed precision

torchrun --nproc_per_node 2 train.py --mixed-precision

To showcse DCP API

torchrun --nproc_per_node 2 train.py --dcp-api

weifengpy added 2 commits May 8, 2025 16:40
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Copy link

netlify bot commented May 8, 2025

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit d281dcd
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-examples-preview/deploys/681d4749abc5e40008eda968

@weifengpy weifengpy requested review from wconstab and mori360 May 8, 2025 23:54
torchrun --nproc_per_node 2 train.py --mixed-precision
```

To showcse DCP API
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

torchrun --nproc_per_node 2 train.py --mixed-precision
```

To showcse DCP API
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
cd distributed/FSDP2
torchrun --nproc_per_node 2 train.py
```
* For 1st time, it creates a "checkpoints" folder and save state dicts there
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save -> saves

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@weifengpy weifengpy merged commit 7092296 into pytorch:main May 9, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants