Skip to content

A couple corrections to the Zeus post #1624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 14, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 3 additions & 15 deletions _posts/2024-05-11-zeus.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,8 +60,6 @@ measurement = monitor.end_window("training")
print(f"Entire training: {measurement.time} s, {measurement.total_energy} J")
```

<script src="https://gist.github.com/jaywonchung/f580b782ff0513374c6fa507d5e072a8.js"></script>

What you see above is a typical PyTorch training loop which uses four GPUs for data parallel training. Inside, we created an instance of `ZeusMonitor` and passed in a list of GPU indices to monitor. Then, using the monitor, we can measure the time and energy consumption of arbitrary execution _windows_ within the training script by pairing calls to `begin_window` and `end_window`. Multiple windows can overlap and nest in arbitrary ways without affecting the measurement of each, as long as their names are different.

`ZeusMonitor` adds very little overhead – typically single digit milliseconds – around the window. This allows `ZeusMonitor` to be used in various applications. For instance:
Expand All @@ -79,9 +77,8 @@ See our [blog post](https://ml.energy/blog/energy/measurement/measuring-gpu-ener
Let me introduce you to two of the energy optimizers provided by Zeus.


```
GlobalPowerLimitOptimizer
```
### GlobalPowerLimitOptimizer



GPUs allow users to configure its maximum power draw, called _power limit_. Typically, as you lower the GPU’s power limit from the default maximum, computation may get slightly slower, but you’ll save disproportionately more energy. The `GlobalPowerLimitOptimizer` in Zeus automatically finds the optimal GPU power limit globally across all GPUs.
Expand All @@ -108,8 +105,6 @@ for e in range(100):
plo.on_epoch_end()
```

<script src="https://gist.github.com/jaywonchung/1922ddd56b15f8764f2bdacc4a441109.js"></script>

In our familiar PyTorch training loop, we have instantiated `GlobalPowerLimitOptimizer` and passed it an instance of the `ZeusMonitor`, through which the optimizer sees the GPUs. Then, we just need to let the optimizer know about training progress (step and epoch boundaries), and the optimizer will transparently do all the necessary profiling and converge to the optimal power limit.

If you’re using the HuggingFace [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) or [SFTTrainer](https://huggingface.co/docs/trl/main/en/sft_trainer), integration is even easier:
Expand All @@ -131,8 +126,6 @@ trainer = Trainer(
)
```

<script src="https://gist.github.com/jaywonchung/69aa379dd9633a6a486cede1887cec2c.js"></script>

The `HFGlobalPowerLimitOptimizer` wraps `GlobalPowerLimitOptimizer` so that it automatically detects step and epoch boundaries. We have example integrations [here](https://github.com/ml-energy/zeus/tree/master/examples/huggingface), including running Gemma 7B supervised fine-tuning with QLoRA.

Now, we know how to integrate the optimizer, but what is the _optimal_ power limit? We know different users can have different preferences regarding trading off time and energy, so we allow users to specify an `OptimumSelector` (basically the [Strategy Pattern](https://en.wikipedia.org/wiki/Strategy_pattern)) to express their needs.
Expand All @@ -154,15 +147,10 @@ plo = GlobalPowerLimitOptimizer(

```

<script src="https://gist.github.com/jaywonchung/1077b14bc7440b849be1f8320d4bf791.js"></script>

Some of the built-in strategies include “Minimize time” ([Time](https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.Time), this might still reduce the power limit from the default since some workloads exhibit almost no slowdown even on lower power limits), “Minimize energy” ([Energy](https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.Energy)), “Somewhere in between” ([ZeusCost](https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.ZeusCost)), and “Minimize energy given maximum slowdown” ([MaxSlowdownConstraint](https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.MaxSlowdownConstraint)). Users can also create their own optimum selectors as needed.


```
PipelineFrequencyOptimizer
```

### PipelineFrequencyOptimizer

The pipeline frequency optimizer, based on our research paper [Perseus](https://ml.energy/zeus/research_overview/perseus), is our latest work on energy optimization for large model training, like GPT-3. Perseus can reduce the energy consumption of large model training with no or negligible training throughput degradation. We’ll briefly talk about how.

Expand Down