Skip to content

Commit 2bb6103

Browse files
committed
split authors into two lines
Signed-off-by: Chris Abraham <cjyabraham@gmail.com>
1 parent 1580429 commit 2bb6103

File tree

1 file changed

+4
-1
lines changed

1 file changed

+4
-1
lines changed

_posts/2024-11-25-training-using-float8-fsdp2.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,12 @@
11
---
22
layout: blog_detail
33
title: "Supercharging Training using float8 and FSDP2"
4-
author: "IBM: Tuan Hoang Trong, Alexei Karve, Yan Koyfman, Linsong Chu, Divya Kumari, Shweta Salaria, Robert Walkup, Praneet Adusumilli, Nirmit Desai, Raghu Ganti, Seetharami Seelam, Meta: Less Wright, Wei Feng, Vasiliy Kuznetsov, Driss Guesseous"
4+
author: "IBM and Meta"
55
---
66

7+
**IBM**: Tuan Hoang Trong, Alexei Karve, Yan Koyfman, Linsong Chu, Divya Kumari, Shweta Salaria, Robert Walkup, Praneet Adusumilli, Nirmit Desai, Raghu Ganti, Seetharami Seelam
8+
**Meta**: Less Wright, Wei Feng, Vasiliy Kuznetsov, Driss Guesseous
9+
710
In this blog, we will demonstrate how we achieve up to **50% throughput speedup** while achieving loss and evaluation benchmark parity in training over [FSDP1 bf16 training](https://pytorch.org/blog/maximizing-training-throughput/). We achieve this speedup by leveraging FSDP2, DTensor, and torch.compile with torchao’s float8 via linear layer updates (compute), and float8 all_gathers for weight communication. We showcase these improvements across a spectrum of Meta LLaMa model architecture sizes, ranging from small 1.8B model size all the way to 405B model size, making training faster than ever.
811

912
We demonstrate these improvements using the Meta Llama3 architecture, and then perform model quality studies at two scales: 100B tokens at 8B model size, and 50B tokens at 70B model size, which provide an exact comparison of float8 and bf16 training loss curves. We demonstrate that the loss curves result in identical loss convergence across these model training runs compared to the `bf16` counterpart. Further, we train a 3B model to 1T tokens using the FineWeb-edu dataset and run standard evaluation benchmarks to ensure that the model quality is intact and comparable to a `bf16` run.

0 commit comments

Comments
 (0)