Skip to content

Commit f8d49b8

Browse files
cjyabrahamkyliewd
andauthored
Added blog post "Introducing torchchat" (pytorch#1692)
Signed-off-by: Chris Abraham <cjyabraham@gmail.com> Co-authored-by: Kylie Wagar-Dirks <107439830+kyliewd@users.noreply.github.com>
1 parent 9f6f7a6 commit f8d49b8

File tree

2 files changed

+162
-0
lines changed

2 files changed

+162
-0
lines changed
Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
---
2+
layout: blog_detail
3+
title: "Introducing torchchat: Accelerating Local LLM Inference on Laptop, Desktop and Mobile"
4+
author: Ali Khosh, Jesse White, Orion Reblitz-Richardson
5+
---
6+
7+
Today, we’re releasing [torchchat](https://github.com/pytorch/torchchat), a library showcasing how to seamlessly and performantly run Llama 3, 3.1, and other large language models across laptop, desktop, and mobile.
8+
9+
In our previous blog posts, we [showed](https://pytorch.org/blog/accelerating-generative-ai-2/) how to use native PyTorch 2.0 to run LLMs with great performance using CUDA. Torchchat expands on this with more target environments, models and execution modes as well as providing important functions such as export, quantization and export in a way that’s easy to understand.
10+
11+
You will find the project organized into three areas:
12+
13+
* Python: Torchchat provides a [REST API](https://github.com/pytorch/torchchat?tab=readme-ov-file#server) that is called via a Python CLI or can be accessed via the browser
14+
* C++: Torchchat produces a desktop-friendly binary using PyTorch's [AOTInductor](https://pytorch-dev-podcast.simplecast.com/episodes/aotinductor) backend
15+
* Mobile devices: Torchchat uses [ExecuTorch](https://pytorch.org/executorch/stable/index.html) to export a .pte binary file for on-device inference
16+
17+
18+
![torchchat schema](/assets/images/torchchat.png){:style="width:100%"}
19+
20+
21+
## Performance
22+
23+
The following table tracks the performance of torchchat for Llama 3 for a variety of configurations.
24+
25+
_Numbers for Llama 3.1 are coming soon._
26+
27+
**Llama 3 8B Instruct on Apple MacBook Pro M1 Max 64GB**
28+
29+
30+
<table class="table table-bordered">
31+
<tr>
32+
<td><strong>Mode</strong>
33+
</td>
34+
<td><strong>DType</strong>
35+
</td>
36+
<td><strong>Llama 3 8B Tokens/Sec</strong>
37+
</td>
38+
</tr>
39+
<tr>
40+
<td rowspan="3" >Arm Compile
41+
</td>
42+
<td>float16
43+
</td>
44+
<td>5.84
45+
</td>
46+
</tr>
47+
<tr>
48+
<td>int8
49+
</td>
50+
<td>1.63
51+
</td>
52+
</tr>
53+
<tr>
54+
<td>int4
55+
</td>
56+
<td>3.99
57+
</td>
58+
</tr>
59+
<tr>
60+
<td rowspan="3" >Arm AOTI
61+
</td>
62+
<td>float16
63+
</td>
64+
<td>4.05
65+
</td>
66+
</tr>
67+
<tr>
68+
<td>int8
69+
</td>
70+
<td>1.05
71+
</td>
72+
</tr>
73+
<tr>
74+
<td>int4
75+
</td>
76+
<td>3.28
77+
</td>
78+
</tr>
79+
<tr>
80+
<td rowspan="3" >MPS Eager
81+
</td>
82+
<td>float16
83+
</td>
84+
<td>12.63
85+
</td>
86+
</tr>
87+
<tr>
88+
<td>int8
89+
</td>
90+
<td>16.9
91+
</td>
92+
</tr>
93+
<tr>
94+
<td>int4
95+
</td>
96+
<td>17.15
97+
</td>
98+
</tr>
99+
</table>
100+
101+
102+
**Llama 3 8B Instruct on Linux x86 and CUDA**
103+
104+
_Intel(R) Xeon(R) Platinum 8339HC CPU @ 1.80GHz with 180GB Ram + A100 (80GB)_
105+
106+
107+
<table class="table table-bordered">
108+
<tr>
109+
<td>
110+
<strong>Mode</strong>
111+
</td>
112+
<td><strong>DType</strong>
113+
</td>
114+
<td><strong>Llama 3 8B Tokens/Sec</strong>
115+
</td>
116+
</tr>
117+
<tr>
118+
<td rowspan="3" >x86 Compile
119+
</td>
120+
<td>bfloat16
121+
</td>
122+
<td>2.76
123+
</td>
124+
</tr>
125+
<tr>
126+
<td>int8
127+
</td>
128+
<td>3.15
129+
</td>
130+
</tr>
131+
<tr>
132+
<td>int4
133+
</td>
134+
<td>5.33
135+
</td>
136+
</tr>
137+
<tr>
138+
<td rowspan="3" >CUDA Compile
139+
</td>
140+
<td>bfloat16
141+
</td>
142+
<td>83.23
143+
</td>
144+
</tr>
145+
<tr>
146+
<td>int8
147+
</td>
148+
<td>118.17
149+
</td>
150+
</tr>
151+
<tr>
152+
<td>int4
153+
</td>
154+
<td>135.16
155+
</td>
156+
</tr>
157+
</table>
158+
159+
160+
Torchchat provides exceptional performance for Llama 3 8B on mobile (iPhone and Android). We run Llama 2 7B on Samsung Galaxy S22, and S23, and on iPhone 15 Pro using 4-bit GPTQ and post training quantization (PTQ). Early work on Llama 3 8B support is included in collaboration with ExecuTorch. Many improvements were made to export speed, memory overhead, and runtime speed. Ultimately, though, we’ll be seeing even stronger performance through Core ML, MPS, and HTP in the near future. We are excited!
161+
162+
We encourage you to **[clone the torchchat repo and give it a spin](https://github.com/pytorch/torchchat)**, explore its capabilities, and share your feedback as we continue to empower the PyTorch community to run LLMs locally and on constrained devices. Together, let's unlock the full potential of generative AI and LLMs on any device. Please submit [issues](https://github.com/pytorch/torchat/issues) as you see them as well as in [PyTorch](https://github.com/pytorch/pytorch/issues) plus [ExecuTorch](https://github.com/pytorch/executorch/issues), since we are still iterating quickly. We’re also inviting community contributions across a broad range of areas, from additional models, target hardware support, new quantization schemes, or performance improvements. Happy experimenting!

assets/images/torchchat.png

206 KB
Loading

0 commit comments

Comments
 (0)