You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-03-14-pytorch-at-gtc.md
+28-55Lines changed: 28 additions & 55 deletions
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ author: "Team PyTorch at NVIDIA"
9
9
Join in person with [discounted GTC registration](https://www.nvidia.com/gtc/?ncid=GTC-NVI0K8HVX) for PyTorch Foundation or [watch online](https://register.nvidia.com/flow/nvidia/gtcs25/registration/) with free registration.
### [Scaling Open Source AI: From Foundation Models to Ecosystem Success](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1738966749087001K1dG)
@@ -19,104 +19,79 @@ Hear from PyTorch Foundation’s Executive Director Matt White & panelists from
19
19
20
20
## PyTorch @ GTC
21
21
22
-
[The Performance of CUDA with the Flexibility of PyTorch ](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1726155993061001WWZM)
23
-
22
+
[The Performance of CUDA with the Flexibility of PyTorch ](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1726155993061001WWZM)
24
23
Mark Saroufim, Software Engineer, Meta Platforms
25
24
26
25
This talk explores how PyTorch users are also becoming CUDA developers. We'll start with motivating examples from eager, the launch of torch.compile and the more recent trend of kernel zoos. We will share details on how we went about integrating low bit matmuls in torchao and the torch.compile CUTLASS backend. We'll also discuss details on how you can define, build and package your own custom ops in PyTorch so you get the raw performance of CUDA while maintaining the flexibility of PyTorch.
27
26
28
-
[Make My PyTorch Model Fast, and Show Me How You Did It](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1727978036338001UVLu)
29
-
30
-
Thomas Viehmann, Principal Research Engineer, Lightning AI
31
-
27
+
[Make My PyTorch Model Fast, and Show Me How You Did It](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1727978036338001UVLu)
28
+
Thomas Viehmann, Principal Research Engineer, Lightning AI
32
29
Luca Antiga, CTO, Lightning AI
33
30
34
31
PyTorch is popular in deep learning and LLMs for richness and ease of expressions. To make the most of compute resources, PyTorch models benefit from nontrivial optimizations, but this means losing some of their ease and understandability. Learn how with Thunder, a PyTorch-to-Python compiler focused on usability, understandability, and extensibility, you can optimize and transform (i.e., distribute across many machines) models while • leaving the PyTorch code unchanged • targeting a variety of models without needing to adapt to each of them • understanding each transformation step because the results are presented as simple Python code • accessing powerful extension code for your own optimizations with just one or a few lines of code We'll show how the combination of Thunder transforms and the NVIDIA stack (NVFuser, cuDNN, Apex) delivers optimized performance in training and inference on a variety of models.
35
32
36
-
[FlexAttention: The Flexibility of PyTorch With the Performance of FlashAttention](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1726184633014001Jh5G)
37
-
33
+
[FlexAttention: The Flexibility of PyTorch With the Performance of FlashAttention](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1726184633014001Jh5G)
38
34
Driss Guessous, Machine Learning Engineer, Meta Platforms
39
35
40
36
Introducing FlexAttention: a novel PyTorch API that enables custom, user-defined attention mechanisms with performance comparable to state-of-the-art solutions. By leveraging the PyTorch compiler stack, FlexAttention supports dynamic modifications to attention scores within SDPA, achieving both runtime and memory efficiency through kernel fusion with the FlashAttention algorithm. Our benchmarks on A100 GPUs show FlexAttention achieves 90% of FlashAttention2's performance in forward passes and 85% in backward passes. On H100 GPUs, FlexAttention's forward performance averages 85% of FlashAttention3 and is ~25% faster than FlashAttention2, while backward performance averages 76% of FlashAttention3 and is ~3% faster than FlashAttention2. Explore how FlexAttention balances near-state-of-the-art performance with unparalleled flexibility, empowering researchers to rapidly iterate on attention mechanisms without sacrificing efficiency.
41
37
42
-
[Keep Your GPUs Going Brrr : Crushing Whitespace in Model Training](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1731693095418001cruA)
43
-
44
-
Syed Ahmed, Senior Software Engineer, NVIDIA
45
-
46
-
Alban Desmaison, Research Engineer, Meta
47
-
38
+
[Keep Your GPUs Going Brrr : Crushing Whitespace in Model Training](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1731693095418001cruA)
39
+
Syed Ahmed, Senior Software Engineer, NVIDIA
40
+
Alban Desmaison, Research Engineer, Meta
48
41
Aidyn Aitzhan, Senior Software Engineer, NVIDIA
49
42
50
43
Substantial progress has recently been made on the compute-intensive portions of model training, such as high-performing attention variants. While invaluable, this progress exposes previously hidden bottlenecks in model training, such as redundant copies during collectives and data loading time. We'll present recent improvements in PyTorch achieved through Meta/NVIDIA collaboration to tackle these newly exposed bottlenecks and how practitioners can leverage them.
51
44
52
-
[Accelerated Python: The Community and Ecosystem](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1727176757800001qp7T)
53
-
54
-
Andy Terrel, CUDA Python Product Lead, NVIDIA
55
-
56
-
Jeremy Tanner, Open Source Programs, NVIDIA
57
-
45
+
[Accelerated Python: The Community and Ecosystem](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1727176757800001qp7T)
46
+
Andy Terrel, CUDA Python Product Lead, NVIDIA
47
+
Jeremy Tanner, Open Source Programs, NVIDIA
58
48
Anshuman Bhat, CUDA Product Management, NVIDIA
59
49
60
50
Python is everywhere. Simulation, data science, and Gen AI all depend on it. Unfortunately, the dizzying array of tools leaves a newcomer baffled at where to start. We'll take you on a guided tour of the vibrant community and ecosystem surrounding accelerated Python programming. Explore a variety of tools, libraries, and frameworks that enable efficient computation and performance optimization in Python, including CUDA Python, RAPIDS, Warp, and Legate. We'll also discuss integration points with PyData, PyTorch, and JAX communities. Learn about collaborative efforts within the community, including open source projects and contributions that drive innovation in accelerated computing. We'll discuss best practices for leveraging these frameworks to enhance productivity in developing AI-driven applications and conducting large-scale data analyses.
61
51
62
-
[Supercharge large scale AI with Google Cloud AI hypercomputer (Presented by Google Cloud)](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1734571562315001xMKM)
63
-
64
-
Deepak Patil, Product Manager, Google Cloud
65
-
52
+
[Supercharge large scale AI with Google Cloud AI hypercomputer (Presented by Google Cloud)](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1734571562315001xMKM)
53
+
Deepak Patil, Product Manager, Google Cloud
66
54
Rajesh Anantharaman, Product Management Lead, ML Software, Google Cloud
67
55
68
56
Unlock the potential of your large-scale AI workloads with Google Cloud AI Hypercomputer – a supercomputing architecture designed for maximum performance and efficiency. In this session, we will deep dive into PyTorch and JAX stacks on Google Cloud on NVIDIA GPUs, and showcase capabilities for high performance foundation model building on Google Cloud.
69
57
70
-
[Peering Into the Future: What AI and Graph Networks Can Mean for the Future of Financial Analysis](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1739906058885001OxEF)
71
-
72
-
Siddharth Samsi, Sr. Solutions Architect, NVIDIA
73
-
58
+
[Peering Into the Future: What AI and Graph Networks Can Mean for the Future of Financial Analysis](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1739906058885001OxEF)
59
+
Siddharth Samsi, Sr. Solutions Architect, NVIDIA
74
60
Sudeep Kesh, Chief Innovation Officer, S&P Global
75
61
76
62
Artificial Intelligence, agentic systems, and graph neural networks (GNNs) are providing the new frontier to assess, monitor, and estimate opportunities and risks across work portfolios within financial services. Although many of these technologies are still developing, organizations are eager to understand their potential. See how S&P Global and NVIDIA are working together to find practical ways to learn and integrate such capabilities, ranging from forecasting corporate debt issuance to understanding capital markets at a deeper level. We'll show a graph representation of market data using the PyTorch-Geometric library and a dataset of issuances spanning three decades and across financial and non-financial industries. Technical developments include generation of a bipartite graph and link-prediction GNN forecasting. We'll address data preprocessing, pipelines, model training, and how these technologies can broaden capabilities in an increasingly complex world.
77
63
78
-
[Unlock Deep Learning Performance on Blackwell With cuDNN](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1727984645671001Y9eq)
79
-
64
+
[Unlock Deep Learning Performance on Blackwell With cuDNN](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1727984645671001Y9eq)
80
65
Yang Xu (Enterprise Products), DL Software Engineering Manager, NVIDIA
81
66
82
67
Since its launch, cuDNN, a library for GPU-accelerating deep learning (DL) primitives, has been powering many AI applications in domains such as conversational AI, recommender systems, and speech recognition, among others. CuDNN remains a core library for DL primitives in popular frameworks such as PyTorch, JAX, Tensorflow, and many more while covering training, fine-tuning, and inference use cases. Even in the rapidly evolving space of Gen AI — be it Llama, Gemma, or mixture-of-experts variants requiring complex DL primitives such as flash attention variants — cuDNN is powering them all. Learn about new/updated APIs of cuDNN pertaining to Blackwell’s microscaling format, and how to program against those APIs. We'll deep dive into leveraging its graph APIs to build some fusion patterns, such as matmul fusion patterns and fused flash attention from state-of-the-art models. Understand how new CUDA graph support in cuDNN, not to be mistaken with the cuDNN graph API, could be exploited to avoid rebuilding CUDA graphs, offering an alternative to CUDA graph capture with real-world framework usage.
83
68
84
-
[Train and Serve AI Systems Fast With the Lightning AI Open-Source Stack (Presented by Lightning AI)](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1736347047099001au7y)
85
-
69
+
[Train and Serve AI Systems Fast With the Lightning AI Open-Source Stack (Presented by Lightning AI)](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1736347047099001au7y)
86
70
Luca Antiga, CTO, Lightning AI
87
71
88
72
See how the Lightning stack can cover the full life cycle, from data preparation to deployment, with practical examples and particular focus on distributed training and high-performance inference. We'll show examples that focus on new features like support for multi-dimensional parallelism through DTensors, as well as quantization through torchao.
89
73
90
74
91
75
## Connect With Experts (Interactive Sessions)
92
76
93
-
[Meet the Experts From Deep Learning Framework Teams ](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1728516848639001tO7H)
94
-
95
-
Eddie Yan, Technical Lead of PyTorch, NVIDIA
96
-
97
-
Masaki Kozuki, Senior Software Engineer in PyTorch, NVIDIA
98
-
99
-
Patrick Wang (Enterprise Products), Software Engineer in PyTorch, NVIDIA
100
-
101
-
Mike Ruberry, Distinguished Engineer in Deep Learning Frameworks, NVIDIA
102
-
77
+
[Meet the Experts From Deep Learning Framework Teams ](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1728516848639001tO7H)
78
+
Eddie Yan, Technical Lead of PyTorch, NVIDIA
79
+
Masaki Kozuki, Senior Software Engineer in PyTorch, NVIDIA
80
+
Patrick Wang (Enterprise Products), Software Engineer in PyTorch, NVIDIA
81
+
Mike Ruberry, Distinguished Engineer in Deep Learning Frameworks, NVIDIA
103
82
Rishi Puri, Sr. Deep Learning Engineer and Lead for PyTorch Geometric, NVIDIA
104
83
105
84
106
85
## Training Labs
107
86
108
-
[Kernel Optimization for AI and Beyond: Unlocking the Power of Nsight Compute ](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1726073884811001C0za)
109
-
110
-
Felix Schmitt, Sr. System Software Engineer, NVIDIA
111
-
87
+
[Kernel Optimization for AI and Beyond: Unlocking the Power of Nsight Compute ](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1726073884811001C0za)
88
+
Felix Schmitt, Sr. System Software Engineer, NVIDIA
112
89
Peter Labus, Senior System Software Engineer, NVIDIA
113
90
114
91
Learn how to unlock the full potential of NVIDIA GPUs with the powerful profiling and analysis capabilities of Nsight Compute. AI workloads are rapidly increasing the demand for GPU computing, and ensuring that they efficiently utilize all available GPU resources is essential. Nsight Compute is the most powerful tool for understanding kernel execution behavior and performance. Learn how to configure and launch profiles customized for your needs, including advice on profiling accelerated Python applications, AI frameworks like PyTorch, and optimizing Tensor Core utilization essential to modern AI performance. Learn how to debug your kernel and use the expert system built into Nsight Compute, known as “Guided Analysis,” that automatically detects common issues and directs you to the most relevant performance data all the way down to the source code level.
115
92
116
-
[Make Retrieval Better: Fine-Tuning an Embedding Model for Domain-Specific RAG](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1725042189130001cmoW)
117
-
118
-
Gabriel Moreira, Sr. Research Scientist, NVIDIA
119
-
93
+
[Make Retrieval Better: Fine-Tuning an Embedding Model for Domain-Specific RAG](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1725042189130001cmoW)
94
+
Gabriel Moreira, Sr. Research Scientist, NVIDIA
120
95
Ronay Ak, Sr. Data Scientist, NVIDIA
121
96
122
97
LLMs power AI applications like conversational chatbots and content generators, but are constrained by their training data. This might lead to hallucinations in content generation, which requires up-to-date or domain-specific information. Retrieval augmented generation (RAG) addresses this issue by enabling LLMs to access external context without modifying model parameters. Embedding or dense retrieval models are a key component of a RAG pipeline for retrieving relevant context to the LLM. However, an embedding model’s effectiveness to capture the unique characteristics of the custom data hinges on the quality and domain relevance of its training data. Fine-tuning embedding models is gaining interest to provide more accurate and relevant responses tailored to users’ specific domain.
@@ -126,10 +101,8 @@ In this lab, you'll learn to generate a synthetic dataset with question-context
126
101
127
102
## Poster Presentations
128
103
129
-
[Single-View X-Ray 3D Reconstruction Using Neural Back Projection and Frustum Resampling](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1729781473379001KiPD)
130
-
104
+
[Single-View X-Ray 3D Reconstruction Using Neural Back Projection and Frustum Resampling](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1729781473379001KiPD)
131
105
Tran Minh Quan, Developer Technologist, NVIDIA
132
106
133
-
[Enable Novel Applications in the New AI Area in Medicine: Accelerated Feature Computation for Pathology Slides](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1729757102989001KDG4)
134
-
107
+
[Enable Novel Applications in the New AI Area in Medicine: Accelerated Feature Computation for Pathology Slides](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1729757102989001KDG4)
135
108
Nils Bruenggel, Principal Software Engineer, Roche Diagnostics Int. AG
0 commit comments