Update documentation (#22)

Dando18 · web-flow · commit c276d2bf761b · 2024-02-21T18:28:03.000-05:00
* update readmes
diff --git a/CITATION.cff b/CITATION.cff
@@ -0,0 +1,74 @@
+cff-version: 1.2.0
+title: "Can Large Language Models Write Parallel Code?"
+message: "If you use this library and love it, cite the software and the paper \U0001F917"
+authors:
+  - given-names: Daniel
+    family-names: Nichols
+    email: dnicho@umd.edu
+    affiliation: University of Maryland, College Park
+  - given-names: Josh
+    family-names: Davis
+    email: jhdavis@umd.edu
+    affiliation: University of Maryland, College Park
+  - given-names: Zhaojun
+    family-names: Xie
+    email: zxie12@umd.edu
+    affiliation: University of Maryland, College Park
+  - given-names: Arjun
+    family-names: Rajaram
+    email: arajara1@umd.edu
+    affiliation: University of Maryland, College Park
+  - given-names: Abhinav
+    family-names: Bhatele
+    email: bhatele@cs.umd.edu
+    affiliation: University of Maryland, College Park
+version: 1.0.0
+doi: https://doi.org/10.48550/arXiv.2401.12554
+date-released: 2024-01-23
+references:
+  - type: article
+    authors:
+      - given-names: Daniel
+        family-names: Nichols
+        email: dnicho@umd.edu
+        affiliation: University of Maryland, College Park
+      - given-names: Josh
+        family-names: Davis
+        email: jhdavis@umd.edu
+        affiliation: University of Maryland, College Park
+      - given-names: Zhaojun
+        family-names: Xie
+        email: zxie12@umd.edu
+        affiliation: University of Maryland, College Park
+      - given-names: Arjun
+        family-names: Rajaram
+        email: arajara1@umd.edu
+        affiliation: University of Maryland, College Park
+      - given-names: Abhinav
+        family-names: Bhatele
+        email: bhatele@cs.umd.edu
+        affiliation: University of Maryland, College Park
+    title: "Can Large Language Models Write Parallel Code?"
+    year: 2024
+    journal: ArXiv
+    doi: https://doi.org/10.48550/arXiv.2401.12554
+    url: https://arxiv.org/abs/2401.12554
+
+abstract: >-
+  Large Language Models are becoming an increasingly popular tool for software
+  development. Their ability to model and generate source code has been
+  demonstrated in a variety of contexts, including code completion,
+  summarization, translation, and lookup. However, they often struggle to
+  generate code for more complex tasks. In this paper, we explore the ability of
+  state-of-the-art language models to generate parallel code. We propose a
+  benchmark, PCGBench, consisting of a set of 420 tasks for evaluating the
+  ability of language models to generate parallel code, and we evaluate the
+  performance of several state-of-the-art open- and closed-source language
+  models on these tasks. We introduce novel metrics for comparing parallel code
+  generation performance and use them to explore how well each LLM performs on
+  various parallel programming models and computational problem types.
+keywords:
+  - Large Language Models
+  - High Performance Computing
+  - Parallel Computing
+license: Apache-2.0
diff --git a/README.md b/README.md
@@ -1,3 +1,71 @@
-# LLMs for HPC
-Data and scripts related to evaluating Large Language Models (LLMs) for tasks in 
-High Performance Computing (HPC).
+# PCGBench
+
+This repo contains the Parallel Code Generation Benchmark (PCGBench) for
+evaluating the ability of Large Language Models to write parallel code. See the
+[PCGBench Leaderboard](https://pssg.cs.umd.edu/blog/2024/pareval/) for
+up-to-date results on different LLMs. 
+
+## Overview
+
+The organization of the repo is as follows.
+
+- `prompts/` -- the prompts in PCGBench alongside some utility scripts
+- `generate/` -- scripts for generating LLM outputs
+- `drivers/` -- scripts to evaluate LLM outputs
+- `analysis/` -- scripts to analyze driver results and compute metrics
+- `tpl/` -- git submodule dependencies
+
+Each subdirectory has further documentation on its contents. The general
+workflow is to use `generate/generate.py` to get LLM outputs, run
+`drivers/run-all.py` to evaluate outputs, and `analysis/metrics.py` to
+postprocess the results.
+
+## Setup and Installation
+
+A couple core systems software are assumed to be installed: Python >=3.7, a C++
+compiler that supports C++17 and OpenMP, Make, CMake, and an MPI implementation.
+If you are testing the CUDA and HIP prompts, then you will need access to NVIDIA
+and AMD GPUs alongside their respective software stacks.
+
+First, clone the repo.
+
+```sh
+git clone --recurse-submodules https://github.com/pssg-int/llms-for-hpc.git
+```
+
+Next, you need to build Kokkos (if you want to include it in testing).
+
+```sh
+cd tpl/kokkos
+
+mkdir build
+cd build
+
+# depending on your system you may need to pass your c++ compiler to CMAKE_CXX_COMPILER
+cmake .. -DCMAKE_INSTALL_PREFIX=. -DKokkos_ENABLE_THREADS=ON
+make install -j4
+```
+
+Finally, you need to install the Python dependencies. `requirements.txt` has
+the set of dependencies pinned at the version they were tested with. Other
+versions may also work. Note that some of these are only required for parts of
+the pipeline i.e. PyTorch and Transformers are only needed for generating LLM
+outputs.
+
+```sh
+pip install -r requirements.txt
+```
+
+## Citing PCGBench
+
+```
+@misc{nichols2024large,
+      title={Can Large Language Models Write Parallel Code?}, 
+      author={Daniel Nichols and Joshua H. Davis and Zhaojun Xie and 
+              Arjun Rajaram and Abhinav Bhatele},
+      year={2024},
+      eprint={2401.12554},
+      archivePrefix={arXiv},
+      primaryClass={cs.DC}
+}
+```
diff --git a/analysis/README.md b/analysis/README.md
@@ -0,0 +1,20 @@
+# Analysis
+
+This subdirectory contains scripts for analyzing the LLM outputs and driver
+results.
+
+`create-dataframe.py` -- convert results json files into CSV format
+
+`metrics.py` -- compute pass@k, efficiency@k, speedup@k, and build@k for a 
+particular results csv file
+
+`metrics-scaling.py` -- compute the metrics at different resource counts; used
+to get scaling results
+
+`bin-the-stack.py` -- a utility script for analyzing The Stack dataset.
+
+The arguments to each of these scripts can be found with `--help`. In general,
+the workflow is to use `create-dataframe.py` to get a CSV results file for the
+driver outputs and then feed this into `metrics.py` to get the relevant metrics.
+This will in turn output another CSV file with the metrics divided by problem
+type and execution model.
diff --git a/drivers/README.md b/drivers/README.md
@@ -2,16 +2,29 @@
 Scripts to handle running, testing, and benchmarking code generated by LLMs.
 The drivers are split up by language.
 
+## Setup
+
+The C++ drivers need some object files to be built before running. This can
+be done by running the following lines.
+
+```sh
+cd cpp
+make
+```
+
 ## Running the prompts
 Given a prompt and output data set in `generated-outputs.json` you can run each
 of the generated outputs using the below command.
 
-```bash
+```sh
 python run-all.py generated-outputs.json
 
-# usage: run-all.py [-h] [-o OUTPUT] [--scratch-dir SCRATCH_DIR] [--launch-configs LAUNCH_CONFIGS] [--yes-to-all] [--dry] [--overwrite]
-#                   [--exclude-models {serial,omp,mpi} [{serial,omp,mpi} ...] | --include-models {serial,omp,mpi} [{serial,omp,mpi} ...]]
-#                   [--log {INFO,DEBUG,WARNING,ERROR,CRITICAL}]
+# usage: run-all.py [-h] [-o OUTPUT] [--scratch-dir SCRATCH_DIR] [--launch-configs LAUNCH_CONFIGS] [--problem-sizes PROBLEM_SIZES] [--yes-to-all]
+#                   [--dry] [--overwrite] [--hide-progress]
+#                   [--exclude-models {serial,omp,mpi,mpi+omp,kokkos,cuda,hip} [{serial,omp,mpi,mpi+omp,kokkos,cuda,hip} ...] | --include-models
+#                   {serial,omp,mpi,mpi+omp,kokkos,cuda,hip} [{serial,omp,mpi,mpi+omp,kokkos,cuda,hip} ...]]
+#                   [--problem PROBLEM | --problem-type PROBLEM_TYPE] [--early-exit-runs] [--build-timeout BUILD_TIMEOUT] [--run-timeout RUN_TIMEOUT]
+#                   [--log {INFO,DEBUG,WARNING,ERROR,CRITICAL}] [--log-build-errors] [--log-runs]
 #                   input_json
 # 
 # Run all the generated code.
@@ -27,50 +40,61 @@ python run-all.py generated-outputs.json
 #                         If provided, put scratch files here.
 #   --launch-configs LAUNCH_CONFIGS
 #                         config for how to run samples.
+#   --problem-sizes PROBLEM_SIZES
+#                         config for how to run samples.
 #   --yes-to-all          If provided, automatically answer yes to all prompts.
 #   --dry                 Dry run. Do not actually run the code snippets.
 #   --overwrite           If ouputs are already in DB for a given prompt, then overwrite them. Default behavior is to skip existing results.
-#   --exclude-models {serial,omp,mpi} [{serial,omp,mpi} ...]
+#   --hide-progress       If provided, do not show progress bar.
+#   --exclude-models {serial,omp,mpi,mpi+omp,kokkos,cuda,hip} [{serial,omp,mpi,mpi+omp,kokkos,cuda,hip} ...]
 #                         Exclude the given parallelism models from testing.
-#   --include-models {serial,omp,mpi} [{serial,omp,mpi} ...]
+#   --include-models {serial,omp,mpi,mpi+omp,kokkos,cuda,hip} [{serial,omp,mpi,mpi+omp,kokkos,cuda,hip} ...]
 #                         Only test the given parallelism models.
+#   --problem PROBLEM     Only test this probem if provided.
+#   --problem-type PROBLEM_TYPE
+#                         Only test problems of this type if provided.
+#   --early-exit-runs     If provided, stop evaluating a model output after the first run configuration fails.
+#   --build-timeout BUILD_TIMEOUT
+#                         Timeout in seconds for building a program.
+#   --run-timeout RUN_TIMEOUT
+#                         Timeout in seconds for running a program.
 #   --log {INFO,DEBUG,WARNING,ERROR,CRITICAL}
 #                         logging level
+#   --log-build-errors    On build error, display the stderr of the build process.
+#   --log-runs            Display the stderr and stdout of runs.
 ```
 
-Depending on the parallel models being tested this will require a newer C++
-compiler and MPI being loaded. 
-
-### Running on Zaratan
-To run the drivers on Zaratan you needc the following modules (or similar).
-
-```bash
-ml gcc/11.3.0 openmpi/gcc/11.3.0 python
-```
+The launch configurations (node counts and launch commands) are defined in a
+json file passed to `--launch-configs`. The default should work on any system
+with the slurm workload manager. Depending on the parallel models being tested
+you may require a newer C++ compiler and MPI being loaded. 
 
-Additionally, the default `/tmp` scratch space for building and running is 
-node-local on Zaratan, so you need to pass `--scratch-dir` as some root 
-directory on the _scratch_ or _home_ filesystem, since these are accessible
-from all compute nodes and on the network.
+It is likely you will need to split up the runs between execution models (there
+are not many machines that have both NVIDIA and AMD GPUs on them). The
+`--include-models` and `--exclude-models` options allow you to only run subsets
+of prompts. 
 
-To prevent _cuda library not found errors_ with MPI you should run 
-`export OMPI_MCA_opal_warn_on_missing_libcuda=0` before running the drivers.
+Additionally, the script uses `/tmp` for building and running the generated code.
+On many machines `/tmp` is node-local, which will cause the MPI jobs to fail.
+To solve this you can set `--scratch-dir` to point to a scratch directory
+on a shared file system.
 
 ## Organization of Drivers
-Within `drivers/` there are subdirectories for each programming language.
-In this subdirectory is a `*_driver_wrapper.py` file that handles running
-code for that language.
-This wrapper further uses functionality from drivers in `models/` and 
-`benchmarks/` in the language subdirectory.
-These define behavior for running each programming model and benchmark.
+Within `drivers/` there are subdirectories for each programming language. In
+this subdirectory is a `*_driver_wrapper.py` file that handles running code for
+that language. This wrapper further uses functionality from drivers in `models/`
+and `benchmarks/` in the language subdirectory. These define behavior for
+running each programming model and benchmark.
 
 ### Notes on Drivers
-The benchmark drivers (in `benchmarks/`) follow the nameing convention 
-`<test-name>-<model>-driver.<ext>`. Likewise, the model drivers in `models/`
-follow the naming convention `<model>-driver.<ext>`. The test name is the name 
-used in the prompts data set. The model is one of the parallel backend models 
-available. These are used as keys in the code so this naming convention and 
-spelling needs to be followed.
+The benchmark drivers (in `benchmarks/`) follow the naming convention
+`<problem-type>/<problem-name>/<model>.<ext>`. Likewise, the model drivers in
+`models/` follow the naming convention `<model>-driver.<ext>`. The test name is
+the name used in the prompts data set. The model is one of the parallel backend
+models available. These are used as keys in the code so this naming convention
+and spelling needs to be followed. The `/models` subdirectory has the `main`
+function for each execution model. The actual testing code for each of the
+prompts is in the `benchmarks/` subdirectory.
 
 Make sure to run `make` in the corresponding subdirectories for models that need
 to be compiled. For example in `cpp/` run `make` to build the driver binaries.
diff --git a/generate/README.md b/generate/README.md
@@ -0,0 +1,58 @@
+# Generate
+
+This subdirectory contains scripts for generating the LLM outputs for PCGBench.
+The main script is `generate.py`. It can be run as follows.
+
+```sh
+usage: generate.py [-h] --prompts PROMPTS --model MODEL --output OUTPUT [--restart] [--cache CACHE] [--restore_from RESTORE_FROM] [--max_new_tokens MAX_NEW_TOKENS]
+                   [--num_samples_per_prompt NUM_SAMPLES_PER_PROMPT] [--temperature TEMPERATURE] [--top_p TOP_P] [--do_sample] [--batch_size BATCH_SIZE] [--prompted]
+
+Generate code
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --prompts PROMPTS     Path to the prompt JSON file
+  --model MODEL         Path to the language model
+  --output OUTPUT       Path to the output JSON file
+  --restart             Restart generation from scratch (default: False)
+  --cache CACHE         JSONL file to cache intermediate results in. Will be restored from if it already exists and --restart is not specified
+  --restore_from RESTORE_FROM
+                        JSON file to restore old results from. Will be restored from if it already exists and --restart is not specified. Is different from --cache in that it is a JSON
+                        file, not a JSONL file, and it is only used to restore old results where the prompt is equivalent. Cached results are prioritized over restored results.
+  --max_new_tokens MAX_NEW_TOKENS
+                        Maximum number of new tokens to generate (default: 1024)
+  --num_samples_per_prompt NUM_SAMPLES_PER_PROMPT
+                        Number of code samples to generate (default: 50)
+  --temperature TEMPERATURE
+                        Temperature for controlling randomness (default: 0.2)
+  --top_p TOP_P         Top p value for nucleus sampling (default: 0.95)
+  --do_sample           Enable sampling (default: False)
+  --batch_size BATCH_SIZE
+                        Batch size for generation (default: 8)
+  --prompted            Use prompted generation. See StarCoder paper (default: False)
+```
+
+To get outputs in a reasonable amount of time you will most likely need to run
+this script on a system with a GPU available. There are the `generate-openai.py`
+and `generate-gemini.py` scripts for generating outputs from the respective
+APIs. These will require you to set the API keys in environment variables or
+pass them as arguments. Use `--help` to see the arguments available.
+
+The `--model` argument in `generate.py` can either be a local model path or the
+handle of a HuggingFace model. The cache and restart arguments can be used to
+restart from existing outputs. The `--prompted` flag will append _solution_
+comments to the front of the prompt as in the StarCoder paper. You likely want
+this turned on as it almost always improves the results.
+
+## Adding New LLMs
+
+Since a number of the LLMs have different inference settings and prompt formats,
+we utilize an inference configuration abstraction to set these parameters for
+different models. These are defined in `utils.py`. To add support for a new LLM
+you will need to define an inference config for that LLM in `utils.py`. The
+existing examples in that file should suffice as examples.
+
+## Translate Tests
+
+Each of the generate scripts has a translation equivalent. These are used for
+the translation tasks.
diff --git a/prompts/README.md b/prompts/README.md
@@ -0,0 +1,31 @@
+# Prompts
+
+This directory contains the PCGBench prompts.
+The prompts for the generation task are contained in `generation-prompts.json`
+and the prompts for the translation task are in `translation-prompts.json`.
+
+
+The format of the prompts dataset is as follows:
+
+```json
+[
+    {
+        "problem_type": "stencil",
+        "language": "cpp",
+        "name": "17_problem_name",
+        "parallelism_model": "serial",
+        "prompt": "/* prompt for the model here */\nvoid foo(int x) {"
+    },
+    ...
+]
+```
+
+## Other Utilities
+
+`create-serial-tests.py` -- this script will parse out the sequential baselines
+for each problem from the drivers and create a "fake" output file with these
+as the sequential solutions. This can be used to test the driver setup and make
+sure it's working.
+
+`count-tokens.py` -- this can be used to estimate the number of tokens passed
+to the OpenAI API and, thus, estimate the cost of generated outputs.
diff --git a/requirements.txt b/requirements.txt