Skip to content

Xgb datasets adding #60

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Apr 26, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
62f87c3
Applied mypy + flake8 for all files
Mar 22, 2021
132d73f
Sorted imports with ISort
Mar 22, 2021
4aa4898
Moved env change to runner
Mar 22, 2021
5a8db33
fixed all mypy errors and added mypy check to CI
Mar 22, 2021
5594efd
Yet another mypy fixes
Mar 22, 2021
35b55b8
Small runner refactoring
Mar 23, 2021
56de8f7
First attempt of adding nvidia datasets
Mar 29, 2021
0ee5f05
Merge branch 'master' into mypy-applying
Mar 29, 2021
04e7a64
removed E265 ignoring for flake8 job
Mar 29, 2021
8268747
Merge remote-tracking branch 'my/mypy-applying' into xgb-nvidia-datasets
Mar 30, 2021
b6a7eb0
NVidia benchmarks are working now
Mar 30, 2021
7e780bb
Added higgs, msrank and airline fetching
Mar 30, 2021
670c289
small fixes of env
Mar 30, 2021
dc0e9c9
Applying comments
Apr 1, 2021
f64ae68
Merge branch 'mypy-applying' into xgb-nvidia-datasets
Apr 1, 2021
873754b
Split dataset loading to different files
Apr 1, 2021
93ea32d
Merge remote-tracking branch 'origin/master' into xgb-nvidia-datasets
Apr 1, 2021
dcfc5b9
Why doesnt mypy work?
Apr 1, 2021
340402e
Added abalone + letters, updated all GB configs
Apr 15, 2021
6e47423
Added links and descriptions for new datasets
Apr 15, 2021
340a628
Merge remote-tracking branch 'origin/master' into xgb-nvidia-datasets
Apr 15, 2021
4be3720
handling mypy
Apr 15, 2021
8184016
Handled skex fake message throwing
Apr 15, 2021
cf5ee76
Trying to handle mypy, at. 3
Apr 15, 2021
9db3177
Trying to handle mypy, at. 4
Apr 15, 2021
5e76a0b
Trying to handle mypy, at. 5
Apr 15, 2021
13fcd20
Changed configs readme and made small fixes in GB testing configs
Apr 20, 2021
0873f97
Merge branch 'master' of https://github.com/IntelPython/scikit-learn_…
Apr 20, 2021
877e0fd
Applying more comments, updating readme's
Apr 20, 2021
8bdc7f2
Applying comments: renamed configs
Apr 20, 2021
f9cf09b
Changed all datasets to npy, applied Kirill's comments
Apr 23, 2021
41e003f
Merge branch 'master' of https://github.com/IntelPython/scikit-learn_…
Apr 23, 2021
523df30
Cleanup after someone's commit
Apr 23, 2021
59303fa
Applying mypy
Apr 23, 2021
b56e42c
Applied Ekaterina's suggestions
Apr 23, 2021
ad176e5
Applied other Ekaterina's comments
Apr 23, 2021
b92a27f
Merge branch 'xgb-nvidia-datasets' of https://github.com/RukhovichIV/…
Apr 23, 2021
11a8ffc
Final commits applying
Apr 26, 2021
37d5461
Alexander's final comments
Apr 26, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 41 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,55 +27,56 @@ We publish blogs on Medium, so [follow us](https://medium.com/intel-analytics-so

## Table of content

* [How to create conda environment for benchmarking](#how-to-create-conda-environment-for-benchmarking)
* [Running Python benchmarks with runner script](#running-python-benchmarks-with-runner-script)
* [Benchmark supported algorithms](#benchmark-supported-algorithms)
* [Intel(R) Extension for Scikit-learn* support](#intelr-extension-for-scikit-learn-support)
* [Algorithms parameters](#algorithms-parameters)
- [How to create conda environment for benchmarking](#how-to-create-conda-environment-for-benchmarking)
- [Running Python benchmarks with runner script](#running-python-benchmarks-with-runner-script)
- [Benchmark supported algorithms](#benchmark-supported-algorithms)
- [Intel(R) Extension for Scikit-learn* support](#intelr-extension-for-scikit-learn-support)
- [Algorithms parameters](#algorithms-parameters)

## How to create conda environment for benchmarking

Create a suitable conda environment for each framework to test. Each item in the list below links to instructions to create an appropriate conda environment for the framework.

* [**scikit-learn**](sklearn_bench#how-to-create-conda-environment-for-benchmarking)
- [**scikit-learn**](sklearn_bench#how-to-create-conda-environment-for-benchmarking)

```bash
pip install -r sklearn_bench/requirements.txt
# or
conda install -c intel scikit-learn scikit-learn-intelex pandas
conda install -c intel scikit-learn scikit-learn-intelex pandas tqdm
```

* [**daal4py**](daal4py_bench#how-to-create-conda-environment-for-benchmarking)
- [**daal4py**](daal4py_bench#how-to-create-conda-environment-for-benchmarking)

```bash
conda install -c conda-forge scikit-learn daal4py pandas
conda install -c conda-forge scikit-learn daal4py pandas tqdm
```

* [**cuml**](cuml_bench#how-to-create-conda-environment-for-benchmarking)
- [**cuml**](cuml_bench#how-to-create-conda-environment-for-benchmarking)

```bash
conda install -c rapidsai -c conda-forge cuml pandas cudf
conda install -c rapidsai -c conda-forge cuml pandas cudf tqdm
```

* [**xgboost**](xgboost_bench#how-to-create-conda-environment-for-benchmarking)
- [**xgboost**](xgboost_bench#how-to-create-conda-environment-for-benchmarking)

```bash
pip install -r xgboost_bench/requirements.txt
# or
conda install -c conda-forge xgboost pandas
conda install -c conda-forge xgboost scikit-learn pandas tqdm
```

## Running Python benchmarks with runner script

Run `python runner.py --configs configs/config_example.json [--output-file result.json --verbose INFO --report]` to launch benchmarks.

Options:
* ``--configs``: specify the path to a configuration file.
* ``--no-intel-optimized``: use Scikit-learn without [Intel(R) Extension for Scikit-learn*](#intelr-extension-for-scikit-learn-support). Now available for [scikit-learn benchmarks](https://github.com/IntelPython/scikit-learn_bench/tree/master/sklearn_bench). By default, the runner uses Intel(R) Extension for Scikit-learn.
* ``--output-file``: output file name for the benchmark result. The default name is `result.json`
* ``--report``: create an Excel report based on benchmark results. The `openpyxl` library is required.
* ``--dummy-run``: run configuration parser and dataset generation without benchmarks running.
* ``--verbose``: *WARNING*, *INFO*, *DEBUG*. print additional information during benchmarks running. Default is *INFO*.

- ``--configs``: specify the path to a configuration file.
- ``--no-intel-optimized``: use Scikit-learn without [Intel(R) Extension for Scikit-learn*](#intelr-extension-for-scikit-learn-support). Now available for [scikit-learn benchmarks](https://github.com/IntelPython/scikit-learn_bench/tree/master/sklearn_bench). By default, the runner uses Intel(R) Extension for Scikit-learn.
- ``--output-file``: specify the name of the output file for the benchmark result. The default name is `result.json`
- ``--report``: create an Excel report based on benchmark results. The `openpyxl` library is required.
- ``--dummy-run``: run configuration parser and dataset generation without benchmarks running.
- ``--verbose``: *WARNING*, *INFO*, *DEBUG*. Print out additional information when the benchmarks are running. The default is *INFO*.

| Level | Description |
|-----------|---------------|
Expand All @@ -84,10 +85,11 @@ Options:
| *WARNING* | An indication that something unexpected happened, or indicative of some problem in the near future (e.g. ‘disk space low’). The software is still working as expected. |

Benchmarks currently support the following frameworks:
* **scikit-learn**
* **daal4py**
* **cuml**
* **xgboost**

- **scikit-learn**
- **daal4py**
- **cuml**
- **xgboost**

The configuration of benchmarks allows you to select the frameworks to run, select datasets for measurements and configure the parameters of the algorithms.

Expand Down Expand Up @@ -117,27 +119,32 @@ The configuration of benchmarks allows you to select the frameworks to run, sele
When you run scikit-learn benchmarks on CPU, [Intel(R) Extension for Scikit-learn](https://github.com/intel/scikit-learn-intelex) is used by default. Use the ``--no-intel-optimized`` option to run the benchmarks without the extension.

The following benchmarks have a GPU support:
* dbscan
* kmeans
* linear
* log_reg

- dbscan
- kmeans
- linear
- log_reg

You may use the [configuration file for these benchmarks](https://github.com/IntelPython/scikit-learn_bench/blob/master/configs/skl_xpu_config.json) to run them on both CPU and GPU.

## Algorithms parameters
## Algorithms parameters

You can launch benchmarks for each algorithm separately.
To do this, go to the directory with the benchmark:

cd <framework>
```bash
cd <framework>
```

Run the following command:

python <benchmark_file> --dataset-name <path to the dataset> <other algorithm parameters>
```bash
python <benchmark_file> --dataset-name <path to the dataset> <other algorithm parameters>
```

The list of supported parameters for each algorithm you can find here:

* [**scikit-learn**](sklearn_bench#algorithms-parameters)
* [**daal4py**](daal4py_bench#algorithms-parameters)
* [**cuml**](cuml_bench#algorithms-parameters)
* [**xgboost**](xgboost_bench#algorithms-parameters)
- [**scikit-learn**](sklearn_bench#algorithms-parameters)
- [**daal4py**](daal4py_bench#algorithms-parameters)
- [**cuml**](cuml_bench#algorithms-parameters)
- [**xgboost**](xgboost_bench#algorithms-parameters)
4 changes: 2 additions & 2 deletions azure-pipelines.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ jobs:
steps:
- script: |
conda update -y -q conda
conda create -n bench -q -y -c conda-forge python=3.7 pandas scikit-learn daal4py
conda create -n bench -q -y -c conda-forge python=3.7 pandas scikit-learn daal4py tqdm
displayName: Create Anaconda environment
- script: |
. /usr/share/miniconda/etc/profile.d/conda.sh
Expand All @@ -46,7 +46,7 @@ jobs:
steps:
- script: |
conda update -y -q conda
conda create -n bench -q -y -c conda-forge python=3.7 pandas xgboost scikit-learn daal4py
conda create -n bench -q -y -c conda-forge python=3.7 pandas xgboost scikit-learn daal4py tqdm
displayName: Create Anaconda environment
- script: |
. /usr/share/miniconda/etc/profile.d/conda.sh
Expand Down
24 changes: 13 additions & 11 deletions bench.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@

import argparse
import json
import logging
import sys
import timeit

Expand Down Expand Up @@ -200,15 +201,16 @@ def parse_args(parser, size=None, loop_types=(),
from sklearnex import patch_sklearn
patch_sklearn()
except ImportError:
print('Failed to import sklearnex.patch_sklearn.'
'Use stock version scikit-learn', file=sys.stderr)
logging.info('Failed to import sklearnex.patch_sklearn.'
'Use stock version scikit-learn', file=sys.stderr)
params.device = 'None'
else:
if params.device != 'None':
print('Device context is not supported for stock scikit-learn.'
'Please use --no-intel-optimized=False with'
f'--device={params.device} parameter. Fallback to --device=None.',
file=sys.stderr)
logging.info(
'Device context is not supported for stock scikit-learn.'
'Please use --no-intel-optimized=False with'
f'--device={params.device} parameter. Fallback to --device=None.',
file=sys.stderr)
params.device = 'None'

# disable finiteness check (default)
Expand All @@ -218,7 +220,7 @@ def parse_args(parser, size=None, loop_types=(),
# Ask DAAL what it thinks about this number of threads
num_threads = prepare_daal_threads(num_threads=params.threads)
if params.verbose:
print(f'@ DAAL gave us {num_threads} threads')
logging.info(f'@ DAAL gave us {num_threads} threads')

n_jobs = None
if n_jobs_supported:
Expand All @@ -234,7 +236,7 @@ def parse_args(parser, size=None, loop_types=(),

# Very verbose output
if params.verbose:
print(f'@ params = {params.__dict__}')
logging.info(f'@ params = {params.__dict__}')

return params

Expand All @@ -249,8 +251,8 @@ def set_daal_num_threads(num_threads):
if num_threads:
daal4py.daalinit(nthreads=num_threads)
except ImportError:
print('@ Package "daal4py" was not found. Number of threads '
'is being ignored')
logging.info('@ Package "daal4py" was not found. Number of threads '
'is being ignored')


def prepare_daal_threads(num_threads=-1):
Expand Down Expand Up @@ -417,7 +419,7 @@ def load_data(params, generated_data=[], add_dtype=False, label_2d=False,
# load and convert data from npy/csv file if path is specified
if param_vars[file_arg] is not None:
if param_vars[file_arg].name.endswith('.npy'):
data = np.load(param_vars[file_arg].name)
data = np.load(param_vars[file_arg].name, allow_pickle=True)
else:
data = read_csv(param_vars[file_arg].name, params)
full_data[element] = convert_data(
Expand Down
63 changes: 32 additions & 31 deletions configs/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## Config JSON Schema
# Config JSON Schema

Configure benchmarks by editing the `config.json` file.
You can configure some algorithm parameters, datasets, a list of frameworks to use, and the usage of some environment variables.
Expand All @@ -11,58 +11,59 @@ Refer to the tables below for descriptions of all fields in the configuration fi
- [Training Object](#training-object)
- [Testing Object](#testing-object)

### Root Config Object
## Root Config Object

| Field Name | Type | Description |
| ----- | ---- |------------ |
|omp_env| array[string] | For xgboost only. Specify an environment variable to set the number of omp threads |
|common| [Common Object](#common-object)| **REQUIRED** common benchmarks setting: frameworks and input data settings |
|cases| array[[Case Object](#case-object)] | **REQUIRED** list of algorithms, their parameters and training data |
|cases| List[[Case Object](#case-object)] | **REQUIRED** list of algorithms, their parameters and training data |

### Common Object
## Common Object

| Field Name | Type | Description |
| ----- | ---- |------------ |
|lib| array[string] | **REQUIRED** list of test frameworks. It can be *sklearn*, *daal4py*, *cuml* or *xgboost* |
|data-format| array[string] | **REQUIRED** input data format. Data formats: *numpy*, *pandas* or *cudf* |
|data-order| array[string] | **REQUIRED** input data order. Data order: *C* (row-major, default) or *F* (column-major) |
|dtype| array[string] | **REQUIRED** input data type. Data type: *float64* (default) or *float32* |
|check-finitness| array[] | Check finiteness in sklearn input check(disabled by default) |
|device| array[string] | For scikit-learn only. The list of devices to run the benchmarks on. It can be *None* (default, run on CPU without sycl context) or one of the types of sycl devices: *cpu*, *gpu*, *host*. Refer to [SYCL specification](https://www.khronos.org/files/sycl/sycl-2020-reference-guide.pdf) for details|
|data-format| Union[str, List[str]] | **REQUIRED** Input data format: *numpy*, *pandas*, or *cudf*. |
|data-order| Union[str, List[str]] | **REQUIRED** Input data order: *C* (row-major, default) or *F* (column-major). |
|dtype| Union[str, List[str]] | **REQUIRED** Input data type: *float64* (default) or *float32*. |
|check-finitness| List[] | Check finiteness during scikit-learn input check (disabled by default). |
|device| array[string] | For scikit-learn only. The list of devices to run the benchmarks on.<br/>It can be *None* (default, run on CPU without sycl context) or one of the types of sycl devices: *cpu*, *gpu*, *host*.<br/>Refer to [SYCL specification](https://www.khronos.org/files/sycl/sycl-2020-reference-guide.pdf) for details.|

### Case Object
## Case Object

| Field Name | Type | Description |
| ----- | ---- |------------ |
|lib| array[string] | **REQUIRED** list of test frameworks. It can be *sklearn*, *daal4py*, *cuml* or *xgboost*|
|algorithm| string | **REQUIRED** benchmark name |
|dataset| array[[Dataset Object](#dataset-object)] | **REQUIRED** input data specifications. |
|benchmark parameters| array[Any] | **REQUIRED** algorithm parameters. a list of supported parameters can be found here |
|lib| Union[str, List[str]] | **REQUIRED** A test framework or a list of frameworks. Must be from [*sklearn*, *daal4py*, *cuml*, *xgboost*]. |
|algorithm| string | **REQUIRED** Benchmark file name. |
|dataset| List[[Dataset Object](#dataset-object)] | **REQUIRED** Input data specifications. |
|**specific algorithm parameters**| Union[int, float, str, List[int], List[float], List[str]] | Other algorithm-specific parameters |

**Important:** You can move any parameter from **"cases"** to **"common"** if this parameter is common to all cases

### Dataset Object
## Dataset Object

| Field Name | Type | Description |
| ----- | ---- |------------ |
|source| string | **REQUIRED** data source. It can be *synthetic* or *csv* |
|type| string | **REQUIRED** for synthetic data only. The type of task for which the dataset is generated. It can be *classification*, *blobs* or *regression* |
|source| string | **REQUIRED** Data source: *synthetic*, *csv*, or *npy*. |
|type| string | **REQUIRED for synthetic data**. The type of task for which the dataset is generated: *classification*, *blobs*, or *regression*. |
|n_classes| int | For *synthetic* data and for *classification* type only. The number of classes (or labels) of the classification problem |
|n_clusters| int | For *synthetic* data and for *blobs* type only. The number of centers to generate |
|n_features| int | **REQUIRED** For *synthetic* data only. The number of features to generate |
|name| string | Name of dataset |
|training| [Training Object](#training-object) | **REQUIRED** algorithm parameters. a list of supported parameters can be found here |
|testing| [Testing Object](#testing-object) | **REQUIRED** algorithm parameters. a list of supported parameters can be found here |
|n_features| int | **REQUIRED for *synthetic* data**. The number of features to generate. |
|name| string | Name of the dataset. |
|training| [Training Object](#training-object) | **REQUIRED** An object with the paths to the training datasets. |
|testing| [Testing Object](#testing-object) | An object with the paths to the testing datasets. If not provided, the training datasets are used. |

### Training Object
## Training Object

| Field Name | Type | Description |
| ----- | ---- |------------ |
| n_samples | int | The total number of the training points |
| x | str | The path to the training samples |
| y | str | The path to the training labels |
| n_samples | int | **REQUIRED** The total number of the training samples |
| x | str | **REQUIRED** The path to the training samples |
| y | str | **REQUIRED** The path to the training labels |

### Testing Object
## Testing Object

| Field Name | Type | Description |
| ----- | ---- |------------ |
| n_samples | int | The total number of the testing points |
| x | str | The path to the testing samples |
| y | str | The path to the testing labels |
| n_samples | int | **REQUIRED** The total number of the testing samples |
| x | str | **REQUIRED** The path to the testing samples |
| y | str | **REQUIRED** The path to the testing labels |
Loading