[BUG] - Distributed Checkpoint Code Example Is Failing

### Add Link

https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html

### Describe the bug

As it is, [the code provided in the tutorial](https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html) fails.

# Saving the Checkpoints

## [Code](https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html#saving)

<details>
<summary>
To save distributed checkpoints the tutorial provides the following code.
</summary>

```python
import os

import torch
import torch.distributed as dist
import torch.distributed.checkpoint as dcp
import torch.multiprocessing as mp
import torch.nn as nn

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.checkpoint.state_dict import get_state_dict, set_state_dict
from torch.distributed.checkpoint.stateful import Stateful
from torch.distributed.fsdp.fully_sharded_data_parallel import StateDictType

CHECKPOINT_DIR = "checkpoint"


class AppState(Stateful):
    """This is a useful wrapper for checkpointing the Application State. Since this object is compliant
    with the Stateful protocol, DCP will automatically call state_dict/load_stat_dict as needed in the
    dcp.save/load APIs.

    Note: We take advantage of this wrapper to hande calling distributed state dict methods on the model
    and optimizer.
    """

    def __init__(self, model, optimizer=None):
        self.model = model
        self.optimizer = optimizer

    def state_dict(self):
        # this line automatically manages FSDP FQN's, as well as sets the default state dict type to FSDP.SHARDED_STATE_DICT
        model_state_dict, optimizer_state_dict = get_state_dict(model, optimizer)
        return {
            "model": model_state_dict,
            "optim": optimizer_state_dict
        }

    def load_state_dict(self, state_dict):
        # sets our state dicts on the model and optimizer, now that we've loaded
        set_state_dict(
            self.model,
            self.optimizer,
            model_state_dict=state_dict["model"],
            optim_state_dict=state_dict["optim"]
        )

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(16, 16)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(16, 8)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))


def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355 "

    # initialize the process group
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)


def cleanup():
    dist.destroy_process_group()


def run_fsdp_checkpoint_save_example(rank, world_size):
    print(f"Running basic FSDP checkpoint saving example on rank {rank}.")
    setup(rank, world_size)

    # create a model and move it to GPU with id rank
    model = ToyModel().to(rank)
    model = FSDP(model)

    loss_fn = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.1)

    optimizer.zero_grad()
    model(torch.rand(8, 16, device="cuda")).sum().backward()
    optimizer.step()

    state_dict = { "app": AppState(model, optimizer) }
    dcp.save(state_dict, checkpoint_id=CHECKPOINT_DIR)

    cleanup()


if __name__ == "__main__":
    world_size = torch.cuda.device_count()
    print(f"Running fsdp checkpoint example on {world_size} devices.")
    mp.spawn(
        run_fsdp_checkpoint_save_example,
        args=(world_size,),
        nprocs=world_size,
        join=True,
    )
```

</details>

## Error

Executing this code results in the error:
>   File "fsdp/saving.py", line 32, in state_dict
    model_state_dict, optimizer_state_dict = get_state_dict(model, optimizer)
NameError: name 'model' is not defined

## Fix

`model` and `optimizer` should be `self.model` and `self.optimizer`.

# Loading the Checkpoint

## [Code](https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html#loading)

<details>
<summary>
To load checkpoints the tutorial provides the following code.
</summary>

```python
import os

import torch
import torch.distributed as dist
import torch.distributed.checkpoint as dcp
from torch.distributed.checkpoint.state_dict import get_state_dict, set_state_dict
import torch.multiprocessing as mp
import torch.nn as nn

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP

CHECKPOINT_DIR = "checkpoint"


class AppState(Stateful):
    """This is a useful wrapper for checkpointing the Application State. Since this object is compliant
    with the Stateful protocol, DCP will automatically call state_dict/load_stat_dict as needed in the
    dcp.save/load APIs.

    Note: We take advantage of this wrapper to hande calling distributed state dict methods on the model
    and optimizer.
    """

    def __init__(self, model, optimizer=None):
        self.model = model
        self.optimizer = optimizer

    def state_dict(self):
        # this line automatically manages FSDP FQN's, as well as sets the default state dict type to FSDP.SHARDED_STATE_DICT
        model_state_dict, optimizer_state_dict = get_state_dict(model, optimizer)
        return {
            "model": model_state_dict,
            "optim": optimizer_state_dict
        }

    def load_state_dict(self, state_dict):
        # sets our state dicts on the model and optimizer, now that we've loaded
        set_state_dict(
            self.model,
            self.optimizer,
            model_state_dict=state_dict["model"],
            optim_state_dict=state_dict["optim"]
        )

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(16, 16)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(16, 8)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))


def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355 "

    # initialize the process group
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)


def cleanup():
    dist.destroy_process_group()


def run_fsdp_checkpoint_load_example(rank, world_size):
    print(f"Running basic FSDP checkpoint loading example on rank {rank}.")
    setup(rank, world_size)

    # create a model and move it to GPU with id rank
    model = ToyModel().to(rank)
    model = FSDP(model)

    optimizer = torch.optim.Adam(model.parameters(), lr=0.1)

    state_dict = { "app": AppState(model, optimizer)}
    optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
    # generates the state dict we will load into
    model_state_dict, optimizer_state_dict = get_state_dict(model, optimizer)
    state_dict = {
        "model": model_state_dict,
        "optimizer": optimizer_state_dict
    }
    dcp.load(
        state_dict=state_dict,
        checkpoint_id=CHECKPOINT_DIR,
    )

    cleanup()


if __name__ == "__main__":
    world_size = torch.cuda.device_count()
    print(f"Running fsdp checkpoint example on {world_size} devices.")
    mp.spawn(
        run_fsdp_checkpoint_load_example,
        args=(world_size,),
        nprocs=world_size,
        join=True,
    )
```
</details>

## Error
There are several errors:
- > NameError: name 'Stateful' is not defined
- >   File "torch/distributed/checkpoint/default_planner.py", line 235, in create_default_local_load_plan
    md = metadata.state_dict_metadata[fqn]
KeyError: 'model.net1.weight'

## Fix

- Importing `Stateful`
- Fixing the `state_dict` which is passed to `dcp.load`

## In-place Loading Documentation

As a side note, although that the documentation mentions that state dict are loaded in place (as quoted below), this could be emphasized even more. Especially, it could be clearly stated there is no need to load states as traditionally done `model.load_state_dict(...)`. 
> It operates in place, meaning that the model should allocate its data first and DCP uses that storage instead.

> DCP uses the pre-allocated storage from model state_dict to load from the checkpoint directory. During loading, the state_dict passed in will be updated in place.

### Describe your environment

- Platform: Linux;
- Cuda version: 12.4;
- Torch version: 2.3.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] - Distributed Checkpoint Code Example Is Failing #3067

Add Link

Describe the bug

Saving the Checkpoints

Code

Error

Fix

Loading the Checkpoint

Code

Error

Fix

In-place Loading Documentation

Describe your environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] - Distributed Checkpoint Code Example Is Failing #3067

Description

Add Link

Describe the bug

Saving the Checkpoints

Code

Error

Fix

Loading the Checkpoint

Code

Error

Fix

In-place Loading Documentation

Describe your environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions