Skip to content

Commit f52198c

Browse files
authored
A couple corrections to the Zeus post (#1624)
Signed-off-by: Chris Abraham <cjyabraham@gmail.com>
1 parent e0904f8 commit f52198c

File tree

1 file changed

+3
-15
lines changed

1 file changed

+3
-15
lines changed

_posts/2024-05-11-zeus.md

Lines changed: 3 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -60,8 +60,6 @@ measurement = monitor.end_window("training")
6060
print(f"Entire training: {measurement.time} s, {measurement.total_energy} J")
6161
```
6262

63-
&lt;script src="https://gist.github.com/jaywonchung/f580b782ff0513374c6fa507d5e072a8.js">&lt;/script>
64-
6563
What you see above is a typical PyTorch training loop which uses four GPUs for data parallel training. Inside, we created an instance of `ZeusMonitor` and passed in a list of GPU indices to monitor. Then, using the monitor, we can measure the time and energy consumption of arbitrary execution _windows_ within the training script by pairing calls to `begin_window` and `end_window`. Multiple windows can overlap and nest in arbitrary ways without affecting the measurement of each, as long as their names are different.
6664

6765
`ZeusMonitor` adds very little overhead – typically single digit milliseconds – around the window. This allows `ZeusMonitor` to be used in various applications. For instance:
@@ -79,9 +77,8 @@ See our [blog post](https://ml.energy/blog/energy/measurement/measuring-gpu-ener
7977
Let me introduce you to two of the energy optimizers provided by Zeus.
8078

8179

82-
```
83-
GlobalPowerLimitOptimizer
84-
```
80+
### GlobalPowerLimitOptimizer
81+
8582

8683

8784
GPUs allow users to configure its maximum power draw, called _power limit_. Typically, as you lower the GPU’s power limit from the default maximum, computation may get slightly slower, but you’ll save disproportionately more energy. The `GlobalPowerLimitOptimizer` in Zeus automatically finds the optimal GPU power limit globally across all GPUs.
@@ -108,8 +105,6 @@ for e in range(100):
108105
plo.on_epoch_end()
109106
```
110107

111-
&lt;script src="https://gist.github.com/jaywonchung/1922ddd56b15f8764f2bdacc4a441109.js">&lt;/script>
112-
113108
In our familiar PyTorch training loop, we have instantiated `GlobalPowerLimitOptimizer` and passed it an instance of the `ZeusMonitor`, through which the optimizer sees the GPUs. Then, we just need to let the optimizer know about training progress (step and epoch boundaries), and the optimizer will transparently do all the necessary profiling and converge to the optimal power limit.
114109

115110
If you’re using the HuggingFace [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) or [SFTTrainer](https://huggingface.co/docs/trl/main/en/sft_trainer), integration is even easier:
@@ -131,8 +126,6 @@ trainer = Trainer(
131126
)
132127
```
133128

134-
&lt;script src="https://gist.github.com/jaywonchung/69aa379dd9633a6a486cede1887cec2c.js">&lt;/script>
135-
136129
The `HFGlobalPowerLimitOptimizer` wraps `GlobalPowerLimitOptimizer` so that it automatically detects step and epoch boundaries. We have example integrations [here](https://github.com/ml-energy/zeus/tree/master/examples/huggingface), including running Gemma 7B supervised fine-tuning with QLoRA.
137130

138131
Now, we know how to integrate the optimizer, but what is the _optimal_ power limit? We know different users can have different preferences regarding trading off time and energy, so we allow users to specify an `OptimumSelector` (basically the [Strategy Pattern](https://en.wikipedia.org/wiki/Strategy_pattern)) to express their needs.
@@ -154,15 +147,10 @@ plo = GlobalPowerLimitOptimizer(
154147

155148
```
156149

157-
&lt;script src="https://gist.github.com/jaywonchung/1077b14bc7440b849be1f8320d4bf791.js">&lt;/script>
158-
159150
Some of the built-in strategies include “Minimize time” ([Time](https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.Time), this might still reduce the power limit from the default since some workloads exhibit almost no slowdown even on lower power limits), “Minimize energy” ([Energy](https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.Energy)), “Somewhere in between” ([ZeusCost](https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.ZeusCost)), and “Minimize energy given maximum slowdown” ([MaxSlowdownConstraint](https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.MaxSlowdownConstraint)). Users can also create their own optimum selectors as needed.
160151

161152

162-
```
163-
PipelineFrequencyOptimizer
164-
```
165-
153+
### PipelineFrequencyOptimizer
166154

167155
The pipeline frequency optimizer, based on our research paper [Perseus](https://ml.energy/zeus/research_overview/perseus), is our latest work on energy optimization for large model training, like GPT-3. Perseus can reduce the energy consumption of large model training with no or negligible training throughput degradation. We’ll briefly talk about how.
168156

0 commit comments

Comments
 (0)