Squashed commit of the following:

Dando18 · Dando18 · commit 0e6dc7b259cf · 2024-02-21T18:31:38.000-05:00
commit 244bb43 Author: Daniel Nichols <dando18studios@gmail.com> Date: Wed Feb 21 18:28:03 2024 -0500 Update documentation (#22) * update readmes commit 578ad03 Author: Daniel Nichols <dando18studios@gmail.com> Date: Tue Feb 20 12:06:40 2024 -0500 Update gemini outputs (#19) * update gemini outputs commit a7e38b1 Author: Daniel Nichols <dando18studios@gmail.com> Date: Tue Feb 20 12:05:07 2024 -0500 add gemini (#18) * add gemini script * add gemini generation outputs * add results commit 9911eb2 Author: Daniel Nichols <dando18studios@gmail.com> Date: Sun Feb 18 19:11:31 2024 -0500 Translation (#17) * finalize prompts * translation scripts * gpt translate outputs * add translation outputs * translation results
diff --git a/CITATION.cff b/CITATION.cff
@@ -0,0 +1,74 @@
+cff-version: 1.2.0
+title: "Can Large Language Models Write Parallel Code?"
+message: "If you use this library and love it, cite the software and the paper \U0001F917"
+authors:
+  - given-names: Daniel
+    family-names: Nichols
+    email: dnicho@umd.edu
+    affiliation: University of Maryland, College Park
+  - given-names: Josh
+    family-names: Davis
+    email: jhdavis@umd.edu
+    affiliation: University of Maryland, College Park
+  - given-names: Zhaojun
+    family-names: Xie
+    email: zxie12@umd.edu
+    affiliation: University of Maryland, College Park
+  - given-names: Arjun
+    family-names: Rajaram
+    email: arajara1@umd.edu
+    affiliation: University of Maryland, College Park
+  - given-names: Abhinav
+    family-names: Bhatele
+    email: bhatele@cs.umd.edu
+    affiliation: University of Maryland, College Park
+version: 1.0.0
+doi: https://doi.org/10.48550/arXiv.2401.12554
+date-released: 2024-01-23
+references:
+  - type: article
+    authors:
+      - given-names: Daniel
+        family-names: Nichols
+        email: dnicho@umd.edu
+        affiliation: University of Maryland, College Park
+      - given-names: Josh
+        family-names: Davis
+        email: jhdavis@umd.edu
+        affiliation: University of Maryland, College Park
+      - given-names: Zhaojun
+        family-names: Xie
+        email: zxie12@umd.edu
+        affiliation: University of Maryland, College Park
+      - given-names: Arjun
+        family-names: Rajaram
+        email: arajara1@umd.edu
+        affiliation: University of Maryland, College Park
+      - given-names: Abhinav
+        family-names: Bhatele
+        email: bhatele@cs.umd.edu
+        affiliation: University of Maryland, College Park
+    title: "Can Large Language Models Write Parallel Code?"
+    year: 2024
+    journal: ArXiv
+    doi: https://doi.org/10.48550/arXiv.2401.12554
+    url: https://arxiv.org/abs/2401.12554
+
+abstract: >-
+  Large Language Models are becoming an increasingly popular tool for software
+  development. Their ability to model and generate source code has been
+  demonstrated in a variety of contexts, including code completion,
+  summarization, translation, and lookup. However, they often struggle to
+  generate code for more complex tasks. In this paper, we explore the ability of
+  state-of-the-art language models to generate parallel code. We propose a
+  benchmark, PCGBench, consisting of a set of 420 tasks for evaluating the
+  ability of language models to generate parallel code, and we evaluate the
+  performance of several state-of-the-art open- and closed-source language
+  models on these tasks. We introduce novel metrics for comparing parallel code
+  generation performance and use them to explore how well each LLM performs on
+  various parallel programming models and computational problem types.
+keywords:
+  - Large Language Models
+  - High Performance Computing
+  - Parallel Computing
+license: Apache-2.0
diff --git a/README.md b/README.md
@@ -1,3 +1,71 @@
-# LLMs for HPC
-Data and scripts related to evaluating Large Language Models (LLMs) for tasks in 
-High Performance Computing (HPC).
+# PCGBench
+
+This repo contains the Parallel Code Generation Benchmark (PCGBench) for
+evaluating the ability of Large Language Models to write parallel code. See the
+[PCGBench Leaderboard](https://pssg.cs.umd.edu/blog/2024/pareval/) for
+up-to-date results on different LLMs. 
+
+## Overview
+
+The organization of the repo is as follows.
+
+- `prompts/` -- the prompts in PCGBench alongside some utility scripts
+- `generate/` -- scripts for generating LLM outputs
+- `drivers/` -- scripts to evaluate LLM outputs
+- `analysis/` -- scripts to analyze driver results and compute metrics
+- `tpl/` -- git submodule dependencies
+
+Each subdirectory has further documentation on its contents. The general
+workflow is to use `generate/generate.py` to get LLM outputs, run
+`drivers/run-all.py` to evaluate outputs, and `analysis/metrics.py` to
+postprocess the results.
+
+## Setup and Installation
+
+A couple core systems software are assumed to be installed: Python >=3.7, a C++
+compiler that supports C++17 and OpenMP, Make, CMake, and an MPI implementation.
+If you are testing the CUDA and HIP prompts, then you will need access to NVIDIA
+and AMD GPUs alongside their respective software stacks.
+
+First, clone the repo.
+
+```sh
+git clone --recurse-submodules https://github.com/pssg-int/llms-for-hpc.git
+```
+
+Next, you need to build Kokkos (if you want to include it in testing).
+
+```sh
+cd tpl/kokkos
+
+mkdir build
+cd build
+
+# depending on your system you may need to pass your c++ compiler to CMAKE_CXX_COMPILER
+cmake .. -DCMAKE_INSTALL_PREFIX=. -DKokkos_ENABLE_THREADS=ON
+make install -j4
+```
+
+Finally, you need to install the Python dependencies. `requirements.txt` has
+the set of dependencies pinned at the version they were tested with. Other
+versions may also work. Note that some of these are only required for parts of
+the pipeline i.e. PyTorch and Transformers are only needed for generating LLM
+outputs.
+
+```sh
+pip install -r requirements.txt
+```
+
+## Citing PCGBench
+
+```
+@misc{nichols2024large,
+      title={Can Large Language Models Write Parallel Code?}, 
+      author={Daniel Nichols and Joshua H. Davis and Zhaojun Xie and 
+              Arjun Rajaram and Abhinav Bhatele},
+      year={2024},
+      eprint={2401.12554},
+      archivePrefix={arXiv},
+      primaryClass={cs.DC}
+}
+```
diff --git a/analysis/README.md b/analysis/README.md
@@ -0,0 +1,20 @@
+# Analysis
+
+This subdirectory contains scripts for analyzing the LLM outputs and driver
+results.
+
+`create-dataframe.py` -- convert results json files into CSV format
+
+`metrics.py` -- compute pass@k, efficiency@k, speedup@k, and build@k for a 
+particular results csv file
+
+`metrics-scaling.py` -- compute the metrics at different resource counts; used
+to get scaling results
+
+`bin-the-stack.py` -- a utility script for analyzing The Stack dataset.
+
+The arguments to each of these scripts can be found with `--help`. In general,
+the workflow is to use `create-dataframe.py` to get a CSV results file for the
+driver outputs and then feed this into `metrics.py` to get the relevant metrics.
+This will in turn output another CSV file with the metrics divided by problem
+type and execution model.
diff --git a/drivers/README.md b/drivers/README.md
@@ -2,16 +2,29 @@
 Scripts to handle running, testing, and benchmarking code generated by LLMs.
 The drivers are split up by language.
 
+## Setup
+
+The C++ drivers need some object files to be built before running. This can
+be done by running the following lines.
+
+```sh
+cd cpp
+make
+```
+
 ## Running the prompts
 Given a prompt and output data set in `generated-outputs.json` you can run each
 of the generated outputs using the below command.
 
-```bash
+```sh
 python run-all.py generated-outputs.json
 
-# usage: run-all.py [-h] [-o OUTPUT] [--scratch-dir SCRATCH_DIR] [--launch-configs LAUNCH_CONFIGS] [--yes-to-all] [--dry] [--overwrite]
-#                   [--exclude-models {serial,omp,mpi} [{serial,omp,mpi} ...] | --include-models {serial,omp,mpi} [{serial,omp,mpi} ...]]
-#                   [--log {INFO,DEBUG,WARNING,ERROR,CRITICAL}]
+# usage: run-all.py [-h] [-o OUTPUT] [--scratch-dir SCRATCH_DIR] [--launch-configs LAUNCH_CONFIGS] [--problem-sizes PROBLEM_SIZES] [--yes-to-all]
+#                   [--dry] [--overwrite] [--hide-progress]
+#                   [--exclude-models {serial,omp,mpi,mpi+omp,kokkos,cuda,hip} [{serial,omp,mpi,mpi+omp,kokkos,cuda,hip} ...] | --include-models
+#                   {serial,omp,mpi,mpi+omp,kokkos,cuda,hip} [{serial,omp,mpi,mpi+omp,kokkos,cuda,hip} ...]]
+#                   [--problem PROBLEM | --problem-type PROBLEM_TYPE] [--early-exit-runs] [--build-timeout BUILD_TIMEOUT] [--run-timeout RUN_TIMEOUT]
+#                   [--log {INFO,DEBUG,WARNING,ERROR,CRITICAL}] [--log-build-errors] [--log-runs]
 #                   input_json
 # 
 # Run all the generated code.
@@ -27,50 +40,61 @@ python run-all.py generated-outputs.json
 #                         If provided, put scratch files here.
 #   --launch-configs LAUNCH_CONFIGS
 #                         config for how to run samples.
+#   --problem-sizes PROBLEM_SIZES
+#                         config for how to run samples.
 #   --yes-to-all          If provided, automatically answer yes to all prompts.
 #   --dry                 Dry run. Do not actually run the code snippets.
 #   --overwrite           If ouputs are already in DB for a given prompt, then overwrite them. Default behavior is to skip existing results.
-#   --exclude-models {serial,omp,mpi} [{serial,omp,mpi} ...]
+#   --hide-progress       If provided, do not show progress bar.
+#   --exclude-models {serial,omp,mpi,mpi+omp,kokkos,cuda,hip} [{serial,omp,mpi,mpi+omp,kokkos,cuda,hip} ...]
 #                         Exclude the given parallelism models from testing.
-#   --include-models {serial,omp,mpi} [{serial,omp,mpi} ...]
+#   --include-models {serial,omp,mpi,mpi+omp,kokkos,cuda,hip} [{serial,omp,mpi,mpi+omp,kokkos,cuda,hip} ...]
 #                         Only test the given parallelism models.
+#   --problem PROBLEM     Only test this probem if provided.
+#   --problem-type PROBLEM_TYPE
+#                         Only test problems of this type if provided.
+#   --early-exit-runs     If provided, stop evaluating a model output after the first run configuration fails.
+#   --build-timeout BUILD_TIMEOUT
+#                         Timeout in seconds for building a program.
+#   --run-timeout RUN_TIMEOUT
+#                         Timeout in seconds for running a program.
 #   --log {INFO,DEBUG,WARNING,ERROR,CRITICAL}
 #                         logging level
+#   --log-build-errors    On build error, display the stderr of the build process.
+#   --log-runs            Display the stderr and stdout of runs.
 ```
 
-Depending on the parallel models being tested this will require a newer C++
-compiler and MPI being loaded. 
-
-### Running on Zaratan
-To run the drivers on Zaratan you needc the following modules (or similar).
-
-```bash
-ml gcc/11.3.0 openmpi/gcc/11.3.0 python
-```
+The launch configurations (node counts and launch commands) are defined in a
+json file passed to `--launch-configs`. The default should work on any system
+with the slurm workload manager. Depending on the parallel models being tested
+you may require a newer C++ compiler and MPI being loaded. 
 
-Additionally, the default `/tmp` scratch space for building and running is 
-node-local on Zaratan, so you need to pass `--scratch-dir` as some root 
-directory on the _scratch_ or _home_ filesystem, since these are accessible
-from all compute nodes and on the network.
+It is likely you will need to split up the runs between execution models (there
+are not many machines that have both NVIDIA and AMD GPUs on them). The
+`--include-models` and `--exclude-models` options allow you to only run subsets
+of prompts. 
 
-To prevent _cuda library not found errors_ with MPI you should run 
-`export OMPI_MCA_opal_warn_on_missing_libcuda=0` before running the drivers.
+Additionally, the script uses `/tmp` for building and running the generated code.
+On many machines `/tmp` is node-local, which will cause the MPI jobs to fail.
+To solve this you can set `--scratch-dir` to point to a scratch directory
+on a shared file system.
 
 ## Organization of Drivers
-Within `drivers/` there are subdirectories for each programming language.
-In this subdirectory is a `*_driver_wrapper.py` file that handles running
-code for that language.
-This wrapper further uses functionality from drivers in `models/` and 
-`benchmarks/` in the language subdirectory.
-These define behavior for running each programming model and benchmark.
+Within `drivers/` there are subdirectories for each programming language. In
+this subdirectory is a `*_driver_wrapper.py` file that handles running code for
+that language. This wrapper further uses functionality from drivers in `models/`
+and `benchmarks/` in the language subdirectory. These define behavior for
+running each programming model and benchmark.
 
 ### Notes on Drivers
-The benchmark drivers (in `benchmarks/`) follow the nameing convention 
-`<test-name>-<model>-driver.<ext>`. Likewise, the model drivers in `models/`
-follow the naming convention `<model>-driver.<ext>`. The test name is the name 
-used in the prompts data set. The model is one of the parallel backend models 
-available. These are used as keys in the code so this naming convention and 
-spelling needs to be followed.
+The benchmark drivers (in `benchmarks/`) follow the naming convention
+`<problem-type>/<problem-name>/<model>.<ext>`. Likewise, the model drivers in
+`models/` follow the naming convention `<model>-driver.<ext>`. The test name is
+the name used in the prompts data set. The model is one of the parallel backend
+models available. These are used as keys in the code so this naming convention
+and spelling needs to be followed. The `/models` subdirectory has the `main`
+function for each execution model. The actual testing code for each of the
+prompts is in the `benchmarks/` subdirectory.
 
 Make sure to run `make` in the corresponding subdirectories for models that need
 to be compiled. For example in `cpp/` run `make` to build the driver binaries.
diff --git a/drivers/run-cuda.sbatch b/drivers/run-cuda.sbatch
@@ -20,6 +20,7 @@ module purge
 ml python gcc/11.3.0 cuda/12.1.1/gcc/11.3.0/
 
 python run-all.py \
+    $GENERATED_PROMPTS \
     $GENERATED_PROMPTS \
     --output $OUTPUT \
     --scratch-dir $SCRATCH_DIR \
diff --git a/drivers/run-hip.sbatch b/drivers/run-hip.sbatch
@@ -17,6 +17,7 @@ module purge
 ml python rocm/5.7.0 flux_wrappers/0.1
 
 python run-all.py \
+    $GENERATED_PROMPTS \
     $GENERATED_PROMPTS \
     --output $OUTPUT \
     --scratch-dir $SCRATCH_DIR \
diff --git a/drivers/run-kokkos.sbatch b/drivers/run-kokkos.sbatch
@@ -22,7 +22,7 @@ export OMP_PROC_BIND=spread
 export OMP_PLACES=cores
 
 python run-all.py \
-    $OUTPUT \
+    $GENERATED_PROMPTS \
     --output $OUTPUT \
     --scratch-dir $SCRATCH_DIR \
     --launch-configs launch-configs.json \
diff --git a/drivers/run-mpi+omp.sbatch b/drivers/run-mpi+omp.sbatch
@@ -2,6 +2,7 @@
 #SBATCH -N 4
 #SBATCH --exclusive
 #SBATCH -t 05:00:00
+#SBATCH -t 05:00:00
 #SBATCH -A bhatele-lab-cmsc
 #SBATCH -J mpi+omp-magicoder-s-ds-6.7b
 #SBATCH -o run-outputs/magicoder-s-ds-6.7b-mpi+omp-%A.out
@@ -22,7 +23,7 @@ export OMP_PLACES=cores
 export OMPI_MCA_opal_warn_on_missing_libcuda=0
 
 python run-all.py \
-    $OUTPUT \
+    $GENERATED_PROMPTS \
     --output $OUTPUT \
     --scratch-dir $SCRATCH_DIR \
     --launch-configs launch-configs.json \
diff --git a/drivers/run-mpi.sbatch b/drivers/run-mpi.sbatch
@@ -1,6 +1,6 @@
 #!/bin/bash
 #SBATCH -n 512
-#SBATCH -t 08:00:00
+#SBATCH -t 04:00:00
 #SBATCH -A bhatele-lab-cmsc
 #SBATCH -J mpi-magicoder-s-ds-6.7b
 #SBATCH -o run-outputs/magicoder-s-ds-6.7b-mpi-%A.out
diff --git a/drivers/run-omp.sbatch b/drivers/run-omp.sbatch
@@ -3,6 +3,7 @@
 #SBATCH --exclusive
 #SBATCH -p serial
 #SBATCH -t 05:00:00
+#SBATCH -t 05:00:00
 #SBATCH -A bhatele-lab-cmsc
 #SBATCH -J omp-magicoder-s-ds-6.7b
 #SBATCH -o run-outputs/magicoder-s-ds-6.7b-omp-%A.out
@@ -22,6 +23,7 @@ export OMP_PROC_BIND=spread
 export OMP_PLACES=cores
 
 python run-all.py \
+    $GENERATED_PROMPTS \
     $GENERATED_PROMPTS \
     --output $OUTPUT \
     --scratch-dir $SCRATCH_DIR \
@@ -30,4 +32,5 @@ python run-all.py \
     --yes-to-all \
     --include-models omp \
     --early-exit-runs \
+    --log info
     --log info
diff --git a/drivers/run-serial.sbatch b/drivers/run-serial.sbatch
@@ -3,6 +3,7 @@
 #SBATCH --exclusive
 #SBATCH -p serial
 #SBATCH -t 04:00:00
+#SBATCH -t 04:00:00
 #SBATCH -A bhatele-lab-cmsc
 #SBATCH -J serial-magicoder-s-ds-6.7b
 #SBATCH -o run-outputs/magicoder-s-ds-6.7b-serial-%A.out
diff --git a/generate/README.md b/generate/README.md
diff --git a/prompts/README.md b/prompts/README.md
diff --git a/requirements.txt b/requirements.txt