diff --git a/_community_blog/enhancing-deep-learning.md b/_community_blog/enhancing-deep-learning.md
new file mode 100644
index 000000000000..8c5cfd93846a
--- /dev/null
+++ b/_community_blog/enhancing-deep-learning.md
@@ -0,0 +1,8 @@
+---
+title: 'Enhancing Deep Learning Workflows: PyTorch Ecosystem Tools'
+author: Team PyTorch
+ext_url: /blog/enhancing-deep-learning/
+date: May 12, 2024
+---
+
+Welcome to the thriving PyTorch ecosystem, where a wealth of tools and libraries await, purpose-built to elevate your experience in deep learning as a developer or researcher. The Ecosystem Tools pages host  many projects from experts spanning academia, industry, application development, and machine learning.
diff --git a/_community_blog/introducing-depyf.md b/_community_blog/introducing-depyf.md
new file mode 100644
index 000000000000..e245ea20c9a8
--- /dev/null
+++ b/_community_blog/introducing-depyf.md
@@ -0,0 +1,8 @@
+---
+title: 'Introducing depyf: mastering torch.compile with ease'
+author: PyTorch
+ext_url: /blog/introducing-depyf/
+date: May 11, 2024
+---
+
+We are thrilled to introduce `depyf`, a new project to the PyTorch ecosystem designed to help users understand, learn, and adapt to `torch.compile`!
diff --git a/_community_blog/zeus.md b/_community_blog/zeus.md
new file mode 100644
index 000000000000..24299fb494b0
--- /dev/null
+++ b/_community_blog/zeus.md
@@ -0,0 +1,8 @@
+---
+title: 'Deep Learning Energy Measurement and Optimization'
+author: Jae-Won Chung
+ext_url: /blog/zeus/
+date: May 11, 2024
+---
+
+[Zeus](https://github.com/ml-energy/zeus) is an open-source toolbox for measuring and optimizing the energy consumption of deep learning workloads. Our goal is to make energy optimization based on accurate measurements as easy as possible for diverse deep learning workloads and setups by offering composable tools with minimal assumptions.
diff --git a/_posts/2024-05-11-enhancing-deep-learning.md b/_posts/2024-05-11-enhancing-deep-learning.md
new file mode 100644
index 000000000000..b4029540b3fe
--- /dev/null
+++ b/_posts/2024-05-11-enhancing-deep-learning.md
@@ -0,0 +1,106 @@
+---
+layout: blog_detail
+title: "Enhancing Deep Learning Workflows: PyTorch Ecosystem Tools"
+hidden: true
+---
+
+Welcome to the thriving PyTorch ecosystem, where a wealth of tools and libraries await, purpose-built to elevate your experience in deep learning as a developer or researcher. The Ecosystem Tools pages host  many projects from experts spanning academia, industry, application development, and machine learning.
+
+Initially, PyTorch aimed to establish a thriving community, enabling developers to access each other's tools, engage in meaningful discussions, and explore the wealth of resources available within the community. 
+
+Today, the PyTorch ecosystem has grown to feature over 100 projects tailored to your needs, providing robust support, enhanced speed, and effortless integration with PyTorch. If your project aligns with our mission, we invite you to [submit](https://pytorch.org/ecosystem/join) it and join this dynamic ecosystem.
+
+New this month, we’ve moved all of our Ecosystem blogs over to our PyTorch.org website to host a space where our community can show off the latest innovations with our users. Read on to hear about the latest projects in the ecosystem!
+
+## Explore the Latest Tools and Frameworks in the Ecosystem
+
+As we continue into 2024, we're thrilled to showcase an impressive array of ecosystem tools that significantly enrich the PyTorch community. These tools cover a wide range of domains, including pose estimation, profiling, and even quantum computing. Let's explore each one to witness firsthand how they are reshaping the PyTorch landscape, opening up exciting possibilities for developers.
+
+
+### [Anomalib](https://github.com/openvinotoolkit/anomalib)
+
+
+Anomalib is a deep learning library that aims to collect state-of-the-art anomaly detection algorithms for benchmarking on both public and private datasets. Anomalib provides several ready-to-use implementations of anomaly detection algorithms described in the recent literature, as well as a set of tools that facilitate the development and implementation of custom models. The library has a strong focus on image-based anomaly detection, where the goal of the algorithm is to identify anomalous images, or anomalous pixel regions within images in a dataset. Anomalib is constantly updated with the latest algorithms and training/inference extensions.
+
+### [Diffusers](https://huggingface.co/docs/diffusers)
+
+Diffusers is a library within the PyTorch ecosystem that focuses on model interpretability. It offers a suite of tools and techniques to explain the decisions made by deep learning models. With Diffusers, developers can gain insights into model behavior, understand feature importance, and detect potential biases. By making deep learning models more transparent, Diffusers promotes fairness, accountability, and robustness in AI applications.
+
+### [Pomegranate](https://pomegranate.readthedocs.io/en/latest/)
+
+Pomegranate is a versatile machine learning library that integrates seamlessly with PyTorch. It provides a wide range of probabilistic models and tools for probabilistic modeling tasks. Pomegranate empowers users to build complex models such as hidden Markov models (HMMs), Bayesian networks, and Gaussian mixture models (GMMs). By combining the strengths of PyTorch and Pomegranate, developers can leverage the power of deep learning and probabilistic modeling to tackle various machine learning challenges.
+
+
+### [PyPose](https://pypose.org/)
+
+PyPose is a PyTorch-based library designed for pose estimation tasks. With PyPose, developers can efficiently train and deploy models for human pose estimation, a fundamental computer vision problem. By leveraging PyTorch's flexibility and performance, PyPose simplifies the process of building accurate pose estimation models. Its intuitive APIs and pre-trained models make it an excellent choice for researchers and developers exploring human pose estimation applications.
+
+
+### [PyPOTS](https://github.com/WenjieDu/PyPOTS)
+
+A python toolbox/library for data mining on partially-observed time series with PyTorch, including SOTA models supporting tasks of imputation, classification, clustering, and forecasting on incomplete (irregularly-sampled) multivariate time series with missing values.
+
+### [OctoML Profiler](https://github.com/octoml/octoml-profile)
+
+OctoML Profiler is a performance profiling tool that aids in optimizing PyTorch models. This tool helps developers identify performance bottlenecks and inefficiencies within their deep learning models. By providing insights into memory usage, compute time, and data movement, the OctoML Profiler enables developers to fine-tune their models for improved efficiency. With this valuable feedback, developers can optimize their models for deployment on various hardware platforms.
+
+### [Open Compass](https://github.com/open-compass/opencompass)
+
+OpenCompass is a one-stop platform for large model evaluation, aiming to provide a fair, open, and reproducible benchmark for large model evaluation. Its main features include: Comprehensive support for models and datasets, efficient distributed evaluation, diversified evaluation paradigms, modular design with high extensibility and experiment management and reporting mechanism.
+
+### [Renate](https://renate.readthedocs.io/en/latest/)
+
+Renate is a PyTorch-based library for neural architecture search (NAS). It simplifies the process of automatically searching for optimal neural network architectures tailored to specific tasks. Renate leverages techniques like reinforcement learning and evolutionary algorithms to efficiently explore the architecture space. By using Renate, developers can save significant time and resources while discovering highly performant models.
+
+
+### [RoMa](https://github.com/naver/roma)
+
+
+RoMa is a standalone library to handle rotation representations with PyTorch (rotation matrices, quaternions, rotation vectors, etc). It aims for robustness, ease-of-use, and efficiency.
+
+
+### [Substra](https://github.com/Substra)
+
+Substra is an open source federated learning (FL) software. It enables the training and validation of machine learning models on distributed datasets. It provides a flexible Python interface and a web application to run federated learning training at scale. Substra's main usage is in production environments. It has already been deployed and used by hospitals and biotech companies. Substra can also be used on a single machine to perform FL simulations and debug code.
+
+### [TorchQuantum](https://hanruiwanghw.wixsite.com/torchquantum)
+
+TorchQuantum is a powerful library that combines the PyTorch framework with quantum computing concepts. It enables developers to explore quantum machine learning algorithms and build hybrid classical-quantum models. By integrating the principles of quantum computing into PyTorch, TorchQuantum opens up new possibilities for solving complex problems that traditional deep learning approaches may struggle with.
+
+### [TIAToolbox](https://github.com/TissueImageAnalytics/tiatoolbox)
+
+The TIAToolbox (Text-Image-Augmentation Toolbox) is a PyTorch library designed to augment text and image data for deep learning tasks. It offers a comprehensive set of tools for data augmentation, including transformations, noise injection, and image/text synthesis. By applying TIAToolbox, developers can enrich their training datasets, improve model generalization, and enhance the robustness of their deep learning models.
+
+### [torchdistill](https://github.com/yoshitomo-matsubara/torchdistill)
+
+torchdistill is a coding-free framework built on PyTorch for reproducible deep learning and knowledge distillation studies. The framework is designed to enable users to design experiments by declarative PyYAML configuration files and supports high-level module abstractions.
+
+### [TorchOpt](https://torchopt.readthedocs.io/en/latest/#)
+
+TorchOpt is a PyTorch library focused on optimization algorithms for deep learning. It provides a collection of state-of-the-art optimization techniques, such as stochastic gradient descent (SGD) variants, adaptive learning rate methods, and optimization schedules. TorchOpt empowers developers to fine-tune their models efficiently, converge faster, and achieve better performance in various deep learning tasks.
+
+### [USB](https://usb.readthedocs.io/)
+
+USB, or Unified Speech-to-Text Benchmark, is a PyTorch-based toolkit for training and evaluating speech recognition models. It provides standardized datasets and evaluation metrics to facilitate fair and accurate comparisons between different speech recognition architectures. By using USB, researchers and developers can benchmark their models against state-of-the-art systems and drive advancements in the field of automatic speech recognition.
+
+### [Zeus](https://github.com/ml-energy/zeus)
+
+Zeus is the current state-of-the-art in deep learning energy measurement and optimization. It has monitor components that allow users to measure GPU energy consumption and optimizer components that automatically optimize DNN or GPU knobs based on measurements from the monitor component.
+
+
+## Be Part of Our Ecosystem 
+
+Our  diverse ecosystem tools are instrumental in PyTorch's success.. They provide essential  support for tasks such as pose estimation, probabilistic modeling, performance profiling, model interpretability, speech recognition, quantum computing, data augmentation, optimization, and neural architecture search.
+
+Leveraging these tools empowers developers and researchers to accelerate their deep learning workflows and unlock new possibilities in the field of AI.
+
+Have a tool that would be a good fit for the [PyTorch Ecosystem](https://pytorch.org/ecosystem/)? If you can answer the below questions, we’d love for you to [submit your tool for review](https://pytorch.org/ecosystem/join).
+
+
+
+1. Does your project complement PyTorch, enhancing user experience, introducing new capabilities, or accelerating training and inference processes?
+    * Examples could include visualization tools, a kernel library or a framework that sits on top to enable research in a particular area such as NLP.
+2. Is the project ready for broad developer usage?
+    * For example, is the project stable, will it be maintained, and is there adequate supporting infrastructure, documentation, and technical support to allow a developer to successfully use it?
+
+Thank you to all of our contributors and collaborators in our ecosystem! Here’s to a great 2024.
diff --git a/_posts/2024-05-11-introducing-depyf.md b/_posts/2024-05-11-introducing-depyf.md
new file mode 100644
index 000000000000..382e0d441eb6
--- /dev/null
+++ b/_posts/2024-05-11-introducing-depyf.md
@@ -0,0 +1,216 @@
+---
+layout: blog_detail
+title: "Introducing depyf: mastering torch.compile with ease"
+hidden: true
+---
+
+![depyf logo](/assets/images/depyf.png){:style="width:100%;display: block; max-width: 400px; margin-right: auto; margin-left: auto"}
+
+
+We are thrilled to introduce `depyf`, a new project to the PyTorch ecosystem designed to help users understand, learn, and adapt to `torch.compile`!
+
+
+## Motivation
+
+`torch.compile` is a cornerstone of PyTorch 2.x, offering a straightforward path to accelerate machine learning workflows with just a single line of code for both training and inference. The mere inclusion of `@torch.compile` can[ dramatically enhance the performance of your code](https://pytorch.org/get-started/pytorch-2.0/). However, identifying the optimal insertion point for `torch.compile` is not easy, not to mention the complexity of adjusting various knobs for maximum efficiency.
+
+The intricacies of the `torch.compile` stack, encompassing Dynamo, AOTAutograd, Inductor, and more, present a **steep learning curve**. These components, essential for deep learning performance optimization, can be daunting without a solid foundation in the subject. 
+
+
+_Note: For an introductory example of how torch.compile works, please refer to this[ walk-through explanation](https://depyf.readthedocs.io/en/latest/walk_through.html)._
+
+
+## A common tool: `TORCH_COMPILE_DEBUG`
+
+To demystify `torch.compile`, the common approach involves leveraging the `TORCH_COMPILE_DEBUG` environment variable. While it provides more information, deciphering the output remains a formidable task. 
+
+For example, when we have the following code:
+
+
+```
+# test.py
+import torch
+from torch import _dynamo as torchdynamo
+from typing import List
+
+@torch.compile
+def toy_example(a, b):
+   x = a / (torch.abs(a) + 1)
+   if b.sum() < 0:
+       b = b * -1
+   return x * b
+
+def main():
+   for _ in range(100):
+       toy_example(torch.randn(10), torch.randn(10))
+
+if __name__ == "__main__":
+   main()
+```
+
+
+And run it with `TORCH_COMPILE_DEBUG=1 python test.py` , we will get a directory named `torch_compile_debug/run_2024_02_05_23_02_45_552124-pid_9520` , under which there are these files:
+
+
+```
+.
+├── torchdynamo
+│   └── debug.log
+└── torchinductor
+   ├── aot_model___0_debug.log
+   ├── aot_model___10_debug.log
+   ├── aot_model___11_debug.log
+   ├── model__4_inference_10.1
+   │   ├── fx_graph_readable.py
+   │   ├── fx_graph_runnable.py
+   │   ├── fx_graph_transformed.py
+   │   ├── ir_post_fusion.txt
+   │   ├── ir_pre_fusion.txt
+   │   └── output_code.py
+   ├── model__5_inference_11.2
+   │   ├── fx_graph_readable.py
+   │   ├── fx_graph_runnable.py
+   │   ├── fx_graph_transformed.py
+   │   ├── ir_post_fusion.txt
+   │   ├── ir_pre_fusion.txt
+   │   └── output_code.py
+   └── model___9.0
+       ├── fx_graph_readable.py
+       ├── fx_graph_runnable.py
+       ├── fx_graph_transformed.py
+       ├── ir_post_fusion.txt
+       ├── ir_pre_fusion.txt
+       └── output_code.py
+```
+
+
+The generated files and logs often raise more questions than they answer, leaving developers puzzled over the meaning and relationships within the data. Common puzzles for `TORCH_COMPILE_DEBUG` include:
+
+
+
+* What does `model__4_inference_10.1` mean? 
+* I have one function but three `model__xxx.py` in the directory, what is their correspondence? 
+* What are those `LOAD_GLOBAL` stuff in `debug.log` ?
+
+
+## A better tool: `depyf` comes to rescue
+
+Let’s see how `depyf` can help developers to resolve the above challenges. To use `depyf` , simply execute `pip install depyf` or follow the project page[ https://github.com/thuml/depyf](https://github.com/thuml/depyf) to install the latest version, and then surround the main code within `with depyf.prepare_debug` .
+
+
+```
+# test.py
+import torch
+from torch import _dynamo as torchdynamo
+from typing import List
+
+@torch.compile
+def toy_example(a, b):
+   x = a / (torch.abs(a) + 1)
+   if b.sum() < 0:
+       b = b * -1
+   return x * b
+
+def main():
+   for _ in range(100):
+       toy_example(torch.randn(10), torch.randn(10))
+
+if __name__ == "__main__":
+   import depyf
+   with depyf.prepare_debug("depyf_debug_dir"):
+       main()
+```
+
+
+After executing `python test.py` , `depyf` will produce a directory named `depyf_debug_dir` (the argument of the `prepare_debug` function). Under the directory, there would be these files:
+
+
+```
+.
+├── __compiled_fn_0 AFTER POST GRAD 0.py
+├── __compiled_fn_0 Captured Graph 0.py
+├── __compiled_fn_0 Forward graph 0.py
+├── __compiled_fn_0 kernel 0.py
+├── __compiled_fn_3 AFTER POST GRAD 0.py
+├── __compiled_fn_3 Captured Graph 0.py
+├── __compiled_fn_3 Forward graph 0.py
+├── __compiled_fn_3 kernel 0.py
+├── __compiled_fn_4 AFTER POST GRAD 0.py
+├── __compiled_fn_4 Captured Graph 0.py
+├── __compiled_fn_4 Forward graph 0.py
+├── __compiled_fn_4 kernel 0.py
+├── __transformed_code_0_for_torch_dynamo_resume_in_toy_example_at_8.py
+├── __transformed_code_0_for_toy_example.py
+├── __transformed_code_1_for_torch_dynamo_resume_in_toy_example_at_8.py
+└── full_code_for_toy_example_0.py
+```
+
+
+And there are two obvious benefits:
+
+
+
+1. The long and difficult-to-understand `torchdynamo/debug.log` is gone. Its content is cleaned up and shown as human-readable source code, in `full_code_for_xxx.py` and `__transformed_code_{n}_for_xxx.py` . It is worth to note, that the most tedious and difficult job of `depyf` is to decompile the bytecode inside `torchdynamo/debug.log` into Python source code, freeing developers from intimidating internals of Python.
+2. The correspondence between function names and computation graphs are respected. For example, in `__transformed_code_0_for_toy_example.py` , we can see a function named `__compiled_fn_0` , and we will immediately know its corresponding computation graphs are in `__compiled_fn_0_xxx.py` , because they share the same `__compiled_fn_0` prefix name.
+
+<strong>Starting with <code>full_code_for_xxx.py</code> , and following the functions involved, users will have a clear view of what <code>torch.compile</code> does to their code.</strong>
+
+
+## One more thing: step-through debuggability
+
+Stepping through code line by line using debuggers is a great way to understand how code works. However, under `TORCH_COMPILE_DEBUG` , those files are only for users’ information, and cannot be executed with the data users concern.
+
+
+_Note: By “debug”, we mean the process of inspecting and improving a program, rather than correcting buggy code._
+
+<strong>A standout feature of <code>depyf</code> is its capability to facilitate step-through debugging for <code>torch.compile</code></strong>: all of the files it generates are linked with runtime code objects inside Python interpreter, and we can set breakpoints in these files. The usage is simple, just add one context manager <code>with depyf.debug()</code> , and it should do the trick:
+
+
+```
+# test.py
+import torch
+from torch import _dynamo as torchdynamo
+from typing import List
+
+@torch.compile
+def toy_example(a, b):
+   x = a / (torch.abs(a) + 1)
+   if b.sum() < 0:
+       b = b * -1
+   return x * b
+
+def main():
+   for _ in range(100):
+       toy_example(torch.randn(10), torch.randn(10))
+
+if __name__ == "__main__":
+   import depyf
+   with depyf.prepare_debug("depyf_debug_dir"):
+       main()
+   with depyf.debug():
+       main()
+```
+
+
+Just one caveat: the workflow of debugging `torch.compile` deviates from standard debugging workflow. With `torch.compile`, many codes are **dynamically** generated. Therefore, we need to:
+
+
+
+1. launch the program
+2. when the program exits `with depyf.prepare_debug("depyf_debug_dir")` , code will be available in `depyf_debug_dir`.
+3. when the program enters `with depyf.debug()` , it will automatically set a breakpoint internally, so that the program is paused.
+4. navigate to `depyf_debug_dir` to set breakpoints.
+5. continue to run the code, and debuggers will hit these breakpoints!
+
+
+![depyf screenshot](/assets/images/depyf-screenshot.png){:style="width:100%;"}
+
+
+Here is a screenshot of what it looks like. All code and tensor variables are live, and we can inspect any variable, and step through the code, as in our daily debugging workflow now! The only difference is that we are debugging `torch.compile` generated code rather than human-written code.
+
+
+## Conclusion
+
+`torch.compile` serves as an invaluable tool for accelerating PyTorch code effortlessly. For those looking to delve deeper into `torch.compile`, whether to leverage its full potential or to integrate custom operations, the learning curve can be very steep though. `depyf` is designed to lower this barrier, offering a user-friendly experience to understand, learn, and adapt to `torch.compile`.
+
+Do explore `depyf` and experience its benefits firsthand! The project is open-source and readily available at[ https://github.com/thuml/depyf](https://github.com/thuml/depyf). Installation is straightforward via `pip install depyf`. We hope `depyf` can enhance everyone’s development workflow with `torch.compile`.
\ No newline at end of file
diff --git a/_posts/2024-05-11-zeus.md b/_posts/2024-05-11-zeus.md
new file mode 100644
index 000000000000..46fa59549ffd
--- /dev/null
+++ b/_posts/2024-05-11-zeus.md
@@ -0,0 +1,194 @@
+---
+layout: blog_detail
+title: "Deep Learning Energy Measurement and Optimization"
+hidden: true
+author: Jae-Won Chung
+---
+
+![Zeus logo](/assets/images/zeus/fig1.png){:style="width:100%;display: block; max-width: 400px; margin-right: auto; margin-left: auto"}
+
+_This post is authored by [Jae-Won Chung](https://jaewonchung.me/about), a PhD student at the University of Michigan and the lead of the [ML.ENERGY Initiative](https://ml.energy)._
+
+Deep learning consumes quite a bit of energy. For instance, training a single 200B LLM on AWS p4d instances consumed around 11.9 GWh (source: [CIDR 2024 keynote](https://mvdirona.com/jrh/talksandpapers/JamesHamiltonCIDR2024.pdf)), which is an amount that can single-handedly power more than a thousand [average US households](https://www.eia.gov/tools/faqs/faq.php?id=97&t=3) for a year.
+
+[Zeus](https://github.com/ml-energy/zeus) is an open-source toolbox for measuring and optimizing the energy consumption of deep learning workloads. Our goal is to make energy optimization based on accurate measurements as easy as possible for diverse deep learning workloads and setups by offering composable tools with minimal assumptions.
+
+Zeus largely provides two types of tools:
+
+
+
+1. Programmatic and command line GPU energy **measurement** tools
+2. Several energy **optimization** tools that find the best ML and/or GPU configurations
+
+Zeus can benefit those who would like to
+
+
+
+* measure and optimize their electricity cost
+* reduce heat dissipation from their GPUs (by lowering power draw)
+* report energy usage from research and development
+* reduce carbon footprint from electricity usage
+
+
+## Part 1: Measuring Energy
+
+Just like performance optimization, accurate measurement is the basis of effective energy optimization. Popular proxies for estimating power consumption like the maximum power draw of the hardware [can sometimes be vastly off](https://ml.energy/blog/energy/measurement/measuring-gpu-energy-best-practices/) compared to actual measurement.
+
+To make energy measurement as easy and transparent as possible, the core utility Zeus offers is the `ZeusMonitor` class. Let’s take a look at the actual snippet:
+
+```python
+from zeus.monitor import ZeusMonitor
+
+# All four GPUs are measured simultaneously.
+monitor = ZeusMonitor(gpu_indices=[0,1,2,3])
+
+# Measure total time and energy within the window.
+monitor.begin_window("training")
+for e in range(100):
+
+    # Measurement windows can arbitrarily be overlapped.
+    monitor.begin_window("epoch")
+    for x, y in train_dataloader:
+        y_hat = model(x)
+        loss = criterion(y, y_hat)
+        loss.backward()
+        optim.step()
+    measurement = monitor.end_window("epoch")
+    print(f"Epoch {e}: {measurement.time} s, {measurement.total_energy} J")
+
+measurement = monitor.end_window("training")
+print(f"Entire training: {measurement.time} s, {measurement.total_energy} J")
+```
+
+&lt;script src="https://gist.github.com/jaywonchung/f580b782ff0513374c6fa507d5e072a8.js">&lt;/script>
+
+What you see above is a typical PyTorch training loop which uses four GPUs for data parallel training. Inside, we created an instance of `ZeusMonitor` and passed in a list of GPU indices to monitor. Then, using the monitor, we can measure the time and energy consumption of arbitrary execution _windows_ within the training script by pairing calls to `begin_window` and `end_window`. Multiple windows can overlap and nest in arbitrary ways without affecting the measurement of each, as long as their names are different.
+
+`ZeusMonitor` adds very little overhead – typically single digit milliseconds – around the window. This allows `ZeusMonitor` to be used in various applications. For instance:
+
+
+
+* [The ML.ENERGY Leaderboard](https://ml.energy/leaderboard): The first open-source benchmark on how much energy LLM text generation consumes.
+* [The ML.ENERGY Colosseum](https://ml.energy/leaderboard): An online service that lets users compare LLM responses side-by-side based on response quality _and_ energy consumption.
+
+See our [blog post](https://ml.energy/blog/energy/measurement/measuring-gpu-energy-best-practices/) for a deeper technical dive into accurate GPU energy measurement.
+
+
+## Part 2: Optimizing Energy
+
+Let me introduce you to two of the energy optimizers provided by Zeus.
+
+
+```
+GlobalPowerLimitOptimizer
+```
+
+
+GPUs allow users to configure its maximum power draw, called _power limit_. Typically, as you lower the GPU’s power limit from the default maximum, computation may get slightly slower, but you’ll save disproportionately more energy. The `GlobalPowerLimitOptimizer` in Zeus automatically finds the optimal GPU power limit globally across all GPUs.
+
+```python
+from zeus.monitor import ZeusMonitor
+from zeus.optimizer.power_limit import GlobalPowerLimitOptimizer
+
+# The optimizer measures time and energy through the ZeusMonitor.
+monitor = ZeusMonitor(gpu_indices=[0,1,2,3])
+plo = GlobalPowerLimitOptimizer(monitor)
+
+for e in range(100):
+    plo.on_epoch_begin()
+    for x, y in train_dataloader:
+        plo.on_step_begin()
+
+        y_hat = model(x)
+        loss = criterion(y, y_hat)
+        loss.backward()
+        optim.step()
+
+        plo.on_step_end()
+    plo.on_epoch_end()
+```
+
+&lt;script src="https://gist.github.com/jaywonchung/1922ddd56b15f8764f2bdacc4a441109.js">&lt;/script>
+
+In our familiar PyTorch training loop, we have instantiated `GlobalPowerLimitOptimizer` and passed it an instance of the `ZeusMonitor`, through which the optimizer sees the GPUs. Then, we just need to let the optimizer know about training progress (step and epoch boundaries), and the optimizer will transparently do all the necessary profiling and converge to the optimal power limit.
+
+If you’re using the HuggingFace [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) or [SFTTrainer](https://huggingface.co/docs/trl/main/en/sft_trainer), integration is even easier:
+
+```python
+from zeus.monitor import ZeusMonitor
+from zeus.optimizer.power_limit import HFGlobalPowerLimitOptimizer
+
+# ZeusMonitor actually auto-detects CUDA_VISIBLE_DEVICES.
+monitor = ZeusMonitor()
+pl_optimizer = HFGlobalPowerLimitOptimizer(monitor)
+
+# Pass in the optimizer as a Trainer callback. Also works for SFTTrainer.
+trainer = Trainer(
+    model=model,
+    train_dataset=train_dataset,
+    ...,
+    callbacks=[pl_optimizer],
+)
+```
+
+&lt;script src="https://gist.github.com/jaywonchung/69aa379dd9633a6a486cede1887cec2c.js">&lt;/script>
+
+The `HFGlobalPowerLimitOptimizer` wraps `GlobalPowerLimitOptimizer` so that it automatically detects step and epoch boundaries. We have example integrations [here](https://github.com/ml-energy/zeus/tree/master/examples/huggingface), including running Gemma 7B supervised fine-tuning with QLoRA.
+
+Now, we know how to integrate the optimizer, but what is the _optimal_ power limit? We know different users can have different preferences regarding trading off time and energy, so we allow users to specify an `OptimumSelector` (basically the [Strategy Pattern](https://en.wikipedia.org/wiki/Strategy_pattern)) to express their needs.
+
+```python
+# Built-in strategies for selecting the optimal power limit.
+from zeus.optimizer.power_limit import (
+    GlobalPowerLimitOptimizer,
+    Time,
+    Energy,
+    MaxSlowdownConstraint,
+)
+
+# Minimize energy while tolerating at most 10% slowdown.
+plo = GlobalPowerLimitOptimizer(
+    monitor,
+    MaxSlowdownConstraint(factor=1.1),
+)
+
+```
+
+&lt;script src="https://gist.github.com/jaywonchung/1077b14bc7440b849be1f8320d4bf791.js">&lt;/script>
+
+Some of the built-in strategies include “Minimize time” ([Time](https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.Time), this might still reduce the power limit from the default since some workloads exhibit almost no slowdown even on lower power limits), “Minimize energy” ([Energy](https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.Energy)), “Somewhere in between” ([ZeusCost](https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.ZeusCost)), and “Minimize energy given maximum slowdown” ([MaxSlowdownConstraint](https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.MaxSlowdownConstraint)). Users can also create their own optimum selectors as needed.
+
+
+```
+PipelineFrequencyOptimizer
+```
+
+
+The pipeline frequency optimizer, based on our research paper [Perseus](https://ml.energy/zeus/research_overview/perseus), is our latest work on energy optimization for large model training, like GPT-3. Perseus can reduce the energy consumption of large model training with no or negligible training throughput degradation. We’ll briefly talk about how.
+
+![one iteration of training with four stage pipeline parallelism](/assets/images/zeus/fig2.png){:style="width:100%;"}
+
+
+The above is a visualization of one iteration of training with four stage _pipeline parallelism_ running with the 1F1B schedule. Each box is either a forward or a backward computation, and is colored with its power consumption.
+
+The key observation here is that when models are partitioned into pipeline stages, it’s very difficult to slice them in perfectly equal sizes. This leads to forward/backward boxes of varying widths and therefore computation _idle time_ between boxes. You would notice that those smaller boxes can run slightly slower than wider boxes and the overall critical path (blue line) will not change at all.
+
+![one iteration of training with four stage pipeline parallelism](/assets/images/zeus/fig3.png){:style="width:100%;"}
+
+That’s what Perseus automatically does. Based on profiling, it identifies computation boxes that are not on the critical path and figures out the precise amount of slowdown for each box that minimizes energy consumption. When done correctly, computations we slowed down will consume less power & energy, but the overall iteration time of the pipeline does not change.
+
+See [our guide](https://ml.energy/zeus/optimize/pipeline_frequency_optimizer/) to get started with Perseus!
+
+
+## Final Words
+
+For users who run their own on-premise compute, energy consumption and the resulting electricity bill is not something that can be easily overlooked. On a larger scale, energy consumption is not just about electricity bills, but also about data center power delivery. With thousands of GPUs running in clusters, finding stable, affordable, and sustainable electricity sources to power data centers is becoming [increasingly challenging](https://www.cbre.com/insights/reports/north-america-data-center-trends-h1-2023). Finding ways to reduce energy disproportionately more than slowdown leads to lower average power consumption, which can help with the power delivery challenge.
+
+With Zeus, we hope to take the first step towards deep learning energy measurement and optimization.
+
+Wondering where to go from here? Here are a couple helpful links:
+
+* [Zeus homepage/documentation](https://ml.energy/zeus)
+* [Zeus GitHub repository](https://github.com/ml-energy/zeus)
+* [Zeus usage and integration examples](https://github.com/ml-energy/zeus/tree/master/examples)
+* [ML.ENERGY Initiative](https://ml.energy) (i.e., the people building Zeus)
\ No newline at end of file
diff --git a/assets/images/depyf-screenshot.png b/assets/images/depyf-screenshot.png
new file mode 100644
index 000000000000..23ecde6f04da
Binary files /dev/null and b/assets/images/depyf-screenshot.png differ
diff --git a/assets/images/depyf.png b/assets/images/depyf.png
new file mode 100644
index 000000000000..d9104cf8f829
Binary files /dev/null and b/assets/images/depyf.png differ
diff --git a/assets/images/zeus/fig1.png b/assets/images/zeus/fig1.png
new file mode 100644
index 000000000000..05be1cc99a0a
Binary files /dev/null and b/assets/images/zeus/fig1.png differ
diff --git a/assets/images/zeus/fig2.png b/assets/images/zeus/fig2.png
new file mode 100644
index 000000000000..e7486983e387
Binary files /dev/null and b/assets/images/zeus/fig2.png differ
diff --git a/assets/images/zeus/fig3.png b/assets/images/zeus/fig3.png
new file mode 100644
index 000000000000..caf4e0a1af4b
Binary files /dev/null and b/assets/images/zeus/fig3.png differ