Skip to content

Commit c276d2b

Browse files
authored
Update documentation (#22)
* update readmes
1 parent 326308e commit c276d2b

File tree

7 files changed

+322
-36
lines changed

7 files changed

+322
-36
lines changed

CITATION.cff

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
cff-version: 1.2.0
2+
title: "Can Large Language Models Write Parallel Code?"
3+
message: "If you use this library and love it, cite the software and the paper \U0001F917"
4+
authors:
5+
- given-names: Daniel
6+
family-names: Nichols
7+
email: dnicho@umd.edu
8+
affiliation: University of Maryland, College Park
9+
- given-names: Josh
10+
family-names: Davis
11+
email: jhdavis@umd.edu
12+
affiliation: University of Maryland, College Park
13+
- given-names: Zhaojun
14+
family-names: Xie
15+
email: zxie12@umd.edu
16+
affiliation: University of Maryland, College Park
17+
- given-names: Arjun
18+
family-names: Rajaram
19+
email: arajara1@umd.edu
20+
affiliation: University of Maryland, College Park
21+
- given-names: Abhinav
22+
family-names: Bhatele
23+
email: bhatele@cs.umd.edu
24+
affiliation: University of Maryland, College Park
25+
version: 1.0.0
26+
doi: https://doi.org/10.48550/arXiv.2401.12554
27+
date-released: 2024-01-23
28+
references:
29+
- type: article
30+
authors:
31+
- given-names: Daniel
32+
family-names: Nichols
33+
email: dnicho@umd.edu
34+
affiliation: University of Maryland, College Park
35+
- given-names: Josh
36+
family-names: Davis
37+
email: jhdavis@umd.edu
38+
affiliation: University of Maryland, College Park
39+
- given-names: Zhaojun
40+
family-names: Xie
41+
email: zxie12@umd.edu
42+
affiliation: University of Maryland, College Park
43+
- given-names: Arjun
44+
family-names: Rajaram
45+
email: arajara1@umd.edu
46+
affiliation: University of Maryland, College Park
47+
- given-names: Abhinav
48+
family-names: Bhatele
49+
email: bhatele@cs.umd.edu
50+
affiliation: University of Maryland, College Park
51+
title: "Can Large Language Models Write Parallel Code?"
52+
year: 2024
53+
journal: ArXiv
54+
doi: https://doi.org/10.48550/arXiv.2401.12554
55+
url: https://arxiv.org/abs/2401.12554
56+
57+
abstract: >-
58+
Large Language Models are becoming an increasingly popular tool for software
59+
development. Their ability to model and generate source code has been
60+
demonstrated in a variety of contexts, including code completion,
61+
summarization, translation, and lookup. However, they often struggle to
62+
generate code for more complex tasks. In this paper, we explore the ability of
63+
state-of-the-art language models to generate parallel code. We propose a
64+
benchmark, PCGBench, consisting of a set of 420 tasks for evaluating the
65+
ability of language models to generate parallel code, and we evaluate the
66+
performance of several state-of-the-art open- and closed-source language
67+
models on these tasks. We introduce novel metrics for comparing parallel code
68+
generation performance and use them to explore how well each LLM performs on
69+
various parallel programming models and computational problem types.
70+
keywords:
71+
- Large Language Models
72+
- High Performance Computing
73+
- Parallel Computing
74+
license: Apache-2.0

README.md

Lines changed: 71 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,71 @@
1-
# LLMs for HPC
2-
Data and scripts related to evaluating Large Language Models (LLMs) for tasks in
3-
High Performance Computing (HPC).
1+
# PCGBench
2+
3+
This repo contains the Parallel Code Generation Benchmark (PCGBench) for
4+
evaluating the ability of Large Language Models to write parallel code. See the
5+
[PCGBench Leaderboard](https://pssg.cs.umd.edu/blog/2024/pareval/) for
6+
up-to-date results on different LLMs.
7+
8+
## Overview
9+
10+
The organization of the repo is as follows.
11+
12+
- `prompts/` -- the prompts in PCGBench alongside some utility scripts
13+
- `generate/` -- scripts for generating LLM outputs
14+
- `drivers/` -- scripts to evaluate LLM outputs
15+
- `analysis/` -- scripts to analyze driver results and compute metrics
16+
- `tpl/` -- git submodule dependencies
17+
18+
Each subdirectory has further documentation on its contents. The general
19+
workflow is to use `generate/generate.py` to get LLM outputs, run
20+
`drivers/run-all.py` to evaluate outputs, and `analysis/metrics.py` to
21+
postprocess the results.
22+
23+
## Setup and Installation
24+
25+
A couple core systems software are assumed to be installed: Python >=3.7, a C++
26+
compiler that supports C++17 and OpenMP, Make, CMake, and an MPI implementation.
27+
If you are testing the CUDA and HIP prompts, then you will need access to NVIDIA
28+
and AMD GPUs alongside their respective software stacks.
29+
30+
First, clone the repo.
31+
32+
```sh
33+
git clone --recurse-submodules https://github.com/pssg-int/llms-for-hpc.git
34+
```
35+
36+
Next, you need to build Kokkos (if you want to include it in testing).
37+
38+
```sh
39+
cd tpl/kokkos
40+
41+
mkdir build
42+
cd build
43+
44+
# depending on your system you may need to pass your c++ compiler to CMAKE_CXX_COMPILER
45+
cmake .. -DCMAKE_INSTALL_PREFIX=. -DKokkos_ENABLE_THREADS=ON
46+
make install -j4
47+
```
48+
49+
Finally, you need to install the Python dependencies. `requirements.txt` has
50+
the set of dependencies pinned at the version they were tested with. Other
51+
versions may also work. Note that some of these are only required for parts of
52+
the pipeline i.e. PyTorch and Transformers are only needed for generating LLM
53+
outputs.
54+
55+
```sh
56+
pip install -r requirements.txt
57+
```
58+
59+
## Citing PCGBench
60+
61+
```
62+
@misc{nichols2024large,
63+
title={Can Large Language Models Write Parallel Code?},
64+
author={Daniel Nichols and Joshua H. Davis and Zhaojun Xie and
65+
Arjun Rajaram and Abhinav Bhatele},
66+
year={2024},
67+
eprint={2401.12554},
68+
archivePrefix={arXiv},
69+
primaryClass={cs.DC}
70+
}
71+
```

analysis/README.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Analysis
2+
3+
This subdirectory contains scripts for analyzing the LLM outputs and driver
4+
results.
5+
6+
`create-dataframe.py` -- convert results json files into CSV format
7+
8+
`metrics.py` -- compute pass@k, efficiency@k, speedup@k, and build@k for a
9+
particular results csv file
10+
11+
`metrics-scaling.py` -- compute the metrics at different resource counts; used
12+
to get scaling results
13+
14+
`bin-the-stack.py` -- a utility script for analyzing The Stack dataset.
15+
16+
The arguments to each of these scripts can be found with `--help`. In general,
17+
the workflow is to use `create-dataframe.py` to get a CSV results file for the
18+
driver outputs and then feed this into `metrics.py` to get the relevant metrics.
19+
This will in turn output another CSV file with the metrics divided by problem
20+
type and execution model.

drivers/README.md

Lines changed: 57 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,29 @@
22
Scripts to handle running, testing, and benchmarking code generated by LLMs.
33
The drivers are split up by language.
44

5+
## Setup
6+
7+
The C++ drivers need some object files to be built before running. This can
8+
be done by running the following lines.
9+
10+
```sh
11+
cd cpp
12+
make
13+
```
14+
515
## Running the prompts
616
Given a prompt and output data set in `generated-outputs.json` you can run each
717
of the generated outputs using the below command.
818

9-
```bash
19+
```sh
1020
python run-all.py generated-outputs.json
1121

12-
# usage: run-all.py [-h] [-o OUTPUT] [--scratch-dir SCRATCH_DIR] [--launch-configs LAUNCH_CONFIGS] [--yes-to-all] [--dry] [--overwrite]
13-
# [--exclude-models {serial,omp,mpi} [{serial,omp,mpi} ...] | --include-models {serial,omp,mpi} [{serial,omp,mpi} ...]]
14-
# [--log {INFO,DEBUG,WARNING,ERROR,CRITICAL}]
22+
# usage: run-all.py [-h] [-o OUTPUT] [--scratch-dir SCRATCH_DIR] [--launch-configs LAUNCH_CONFIGS] [--problem-sizes PROBLEM_SIZES] [--yes-to-all]
23+
# [--dry] [--overwrite] [--hide-progress]
24+
# [--exclude-models {serial,omp,mpi,mpi+omp,kokkos,cuda,hip} [{serial,omp,mpi,mpi+omp,kokkos,cuda,hip} ...] | --include-models
25+
# {serial,omp,mpi,mpi+omp,kokkos,cuda,hip} [{serial,omp,mpi,mpi+omp,kokkos,cuda,hip} ...]]
26+
# [--problem PROBLEM | --problem-type PROBLEM_TYPE] [--early-exit-runs] [--build-timeout BUILD_TIMEOUT] [--run-timeout RUN_TIMEOUT]
27+
# [--log {INFO,DEBUG,WARNING,ERROR,CRITICAL}] [--log-build-errors] [--log-runs]
1528
# input_json
1629
#
1730
# Run all the generated code.
@@ -27,50 +40,61 @@ python run-all.py generated-outputs.json
2740
# If provided, put scratch files here.
2841
# --launch-configs LAUNCH_CONFIGS
2942
# config for how to run samples.
43+
# --problem-sizes PROBLEM_SIZES
44+
# config for how to run samples.
3045
# --yes-to-all If provided, automatically answer yes to all prompts.
3146
# --dry Dry run. Do not actually run the code snippets.
3247
# --overwrite If ouputs are already in DB for a given prompt, then overwrite them. Default behavior is to skip existing results.
33-
# --exclude-models {serial,omp,mpi} [{serial,omp,mpi} ...]
48+
# --hide-progress If provided, do not show progress bar.
49+
# --exclude-models {serial,omp,mpi,mpi+omp,kokkos,cuda,hip} [{serial,omp,mpi,mpi+omp,kokkos,cuda,hip} ...]
3450
# Exclude the given parallelism models from testing.
35-
# --include-models {serial,omp,mpi} [{serial,omp,mpi} ...]
51+
# --include-models {serial,omp,mpi,mpi+omp,kokkos,cuda,hip} [{serial,omp,mpi,mpi+omp,kokkos,cuda,hip} ...]
3652
# Only test the given parallelism models.
53+
# --problem PROBLEM Only test this probem if provided.
54+
# --problem-type PROBLEM_TYPE
55+
# Only test problems of this type if provided.
56+
# --early-exit-runs If provided, stop evaluating a model output after the first run configuration fails.
57+
# --build-timeout BUILD_TIMEOUT
58+
# Timeout in seconds for building a program.
59+
# --run-timeout RUN_TIMEOUT
60+
# Timeout in seconds for running a program.
3761
# --log {INFO,DEBUG,WARNING,ERROR,CRITICAL}
3862
# logging level
63+
# --log-build-errors On build error, display the stderr of the build process.
64+
# --log-runs Display the stderr and stdout of runs.
3965
```
4066

41-
Depending on the parallel models being tested this will require a newer C++
42-
compiler and MPI being loaded.
43-
44-
### Running on Zaratan
45-
To run the drivers on Zaratan you needc the following modules (or similar).
46-
47-
```bash
48-
ml gcc/11.3.0 openmpi/gcc/11.3.0 python
49-
```
67+
The launch configurations (node counts and launch commands) are defined in a
68+
json file passed to `--launch-configs`. The default should work on any system
69+
with the slurm workload manager. Depending on the parallel models being tested
70+
you may require a newer C++ compiler and MPI being loaded.
5071

51-
Additionally, the default `/tmp` scratch space for building and running is
52-
node-local on Zaratan, so you need to pass `--scratch-dir` as some root
53-
directory on the _scratch_ or _home_ filesystem, since these are accessible
54-
from all compute nodes and on the network.
72+
It is likely you will need to split up the runs between execution models (there
73+
are not many machines that have both NVIDIA and AMD GPUs on them). The
74+
`--include-models` and `--exclude-models` options allow you to only run subsets
75+
of prompts.
5576

56-
To prevent _cuda library not found errors_ with MPI you should run
57-
`export OMPI_MCA_opal_warn_on_missing_libcuda=0` before running the drivers.
77+
Additionally, the script uses `/tmp` for building and running the generated code.
78+
On many machines `/tmp` is node-local, which will cause the MPI jobs to fail.
79+
To solve this you can set `--scratch-dir` to point to a scratch directory
80+
on a shared file system.
5881

5982
## Organization of Drivers
60-
Within `drivers/` there are subdirectories for each programming language.
61-
In this subdirectory is a `*_driver_wrapper.py` file that handles running
62-
code for that language.
63-
This wrapper further uses functionality from drivers in `models/` and
64-
`benchmarks/` in the language subdirectory.
65-
These define behavior for running each programming model and benchmark.
83+
Within `drivers/` there are subdirectories for each programming language. In
84+
this subdirectory is a `*_driver_wrapper.py` file that handles running code for
85+
that language. This wrapper further uses functionality from drivers in `models/`
86+
and `benchmarks/` in the language subdirectory. These define behavior for
87+
running each programming model and benchmark.
6688

6789
### Notes on Drivers
68-
The benchmark drivers (in `benchmarks/`) follow the nameing convention
69-
`<test-name>-<model>-driver.<ext>`. Likewise, the model drivers in `models/`
70-
follow the naming convention `<model>-driver.<ext>`. The test name is the name
71-
used in the prompts data set. The model is one of the parallel backend models
72-
available. These are used as keys in the code so this naming convention and
73-
spelling needs to be followed.
90+
The benchmark drivers (in `benchmarks/`) follow the naming convention
91+
`<problem-type>/<problem-name>/<model>.<ext>`. Likewise, the model drivers in
92+
`models/` follow the naming convention `<model>-driver.<ext>`. The test name is
93+
the name used in the prompts data set. The model is one of the parallel backend
94+
models available. These are used as keys in the code so this naming convention
95+
and spelling needs to be followed. The `/models` subdirectory has the `main`
96+
function for each execution model. The actual testing code for each of the
97+
prompts is in the `benchmarks/` subdirectory.
7498

7599
Make sure to run `make` in the corresponding subdirectories for models that need
76100
to be compiled. For example in `cpp/` run `make` to build the driver binaries.

generate/README.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Generate
2+
3+
This subdirectory contains scripts for generating the LLM outputs for PCGBench.
4+
The main script is `generate.py`. It can be run as follows.
5+
6+
```sh
7+
usage: generate.py [-h] --prompts PROMPTS --model MODEL --output OUTPUT [--restart] [--cache CACHE] [--restore_from RESTORE_FROM] [--max_new_tokens MAX_NEW_TOKENS]
8+
[--num_samples_per_prompt NUM_SAMPLES_PER_PROMPT] [--temperature TEMPERATURE] [--top_p TOP_P] [--do_sample] [--batch_size BATCH_SIZE] [--prompted]
9+
10+
Generate code
11+
12+
optional arguments:
13+
-h, --help show this help message and exit
14+
--prompts PROMPTS Path to the prompt JSON file
15+
--model MODEL Path to the language model
16+
--output OUTPUT Path to the output JSON file
17+
--restart Restart generation from scratch (default: False)
18+
--cache CACHE JSONL file to cache intermediate results in. Will be restored from if it already exists and --restart is not specified
19+
--restore_from RESTORE_FROM
20+
JSON file to restore old results from. Will be restored from if it already exists and --restart is not specified. Is different from --cache in that it is a JSON
21+
file, not a JSONL file, and it is only used to restore old results where the prompt is equivalent. Cached results are prioritized over restored results.
22+
--max_new_tokens MAX_NEW_TOKENS
23+
Maximum number of new tokens to generate (default: 1024)
24+
--num_samples_per_prompt NUM_SAMPLES_PER_PROMPT
25+
Number of code samples to generate (default: 50)
26+
--temperature TEMPERATURE
27+
Temperature for controlling randomness (default: 0.2)
28+
--top_p TOP_P Top p value for nucleus sampling (default: 0.95)
29+
--do_sample Enable sampling (default: False)
30+
--batch_size BATCH_SIZE
31+
Batch size for generation (default: 8)
32+
--prompted Use prompted generation. See StarCoder paper (default: False)
33+
```
34+
35+
To get outputs in a reasonable amount of time you will most likely need to run
36+
this script on a system with a GPU available. There are the `generate-openai.py`
37+
and `generate-gemini.py` scripts for generating outputs from the respective
38+
APIs. These will require you to set the API keys in environment variables or
39+
pass them as arguments. Use `--help` to see the arguments available.
40+
41+
The `--model` argument in `generate.py` can either be a local model path or the
42+
handle of a HuggingFace model. The cache and restart arguments can be used to
43+
restart from existing outputs. The `--prompted` flag will append _solution_
44+
comments to the front of the prompt as in the StarCoder paper. You likely want
45+
this turned on as it almost always improves the results.
46+
47+
## Adding New LLMs
48+
49+
Since a number of the LLMs have different inference settings and prompt formats,
50+
we utilize an inference configuration abstraction to set these parameters for
51+
different models. These are defined in `utils.py`. To add support for a new LLM
52+
you will need to define an inference config for that LLM in `utils.py`. The
53+
existing examples in that file should suffice as examples.
54+
55+
## Translate Tests
56+
57+
Each of the generate scripts has a translation equivalent. These are used for
58+
the translation tasks.

prompts/README.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Prompts
2+
3+
This directory contains the PCGBench prompts.
4+
The prompts for the generation task are contained in `generation-prompts.json`
5+
and the prompts for the translation task are in `translation-prompts.json`.
6+
7+
8+
The format of the prompts dataset is as follows:
9+
10+
```json
11+
[
12+
{
13+
"problem_type": "stencil",
14+
"language": "cpp",
15+
"name": "17_problem_name",
16+
"parallelism_model": "serial",
17+
"prompt": "/* prompt for the model here */\nvoid foo(int x) {"
18+
},
19+
...
20+
]
21+
```
22+
23+
## Other Utilities
24+
25+
`create-serial-tests.py` -- this script will parse out the sequential baselines
26+
for each problem from the drivers and create a "fake" output file with these
27+
as the sequential solutions. This can be used to test the driver setup and make
28+
sure it's working.
29+
30+
`count-tokens.py` -- this can be used to estimate the number of tokens passed
31+
to the OpenAI API and, thus, estimate the cost of generated outputs.

0 commit comments

Comments
 (0)