Skip to content

Commit d0444a6

Browse files
Igor Rukhovichoutoftardis
Igor Rukhovich
andauthored
XGB datasets adding (#60)
Co-authored-by: Ekaterina Mekhnetsova <mekkatya@gmail.com>
1 parent f58fac4 commit d0444a6

34 files changed

+2457
-1567
lines changed

README.md

Lines changed: 41 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -27,55 +27,56 @@ We publish blogs on Medium, so [follow us](https://medium.com/intel-analytics-so
2727

2828
## Table of content
2929

30-
* [How to create conda environment for benchmarking](#how-to-create-conda-environment-for-benchmarking)
31-
* [Running Python benchmarks with runner script](#running-python-benchmarks-with-runner-script)
32-
* [Benchmark supported algorithms](#benchmark-supported-algorithms)
33-
* [Intel(R) Extension for Scikit-learn* support](#intelr-extension-for-scikit-learn-support)
34-
* [Algorithms parameters](#algorithms-parameters)
30+
- [How to create conda environment for benchmarking](#how-to-create-conda-environment-for-benchmarking)
31+
- [Running Python benchmarks with runner script](#running-python-benchmarks-with-runner-script)
32+
- [Benchmark supported algorithms](#benchmark-supported-algorithms)
33+
- [Intel(R) Extension for Scikit-learn* support](#intelr-extension-for-scikit-learn-support)
34+
- [Algorithms parameters](#algorithms-parameters)
3535

3636
## How to create conda environment for benchmarking
3737

3838
Create a suitable conda environment for each framework to test. Each item in the list below links to instructions to create an appropriate conda environment for the framework.
3939

40-
* [**scikit-learn**](sklearn_bench#how-to-create-conda-environment-for-benchmarking)
40+
- [**scikit-learn**](sklearn_bench#how-to-create-conda-environment-for-benchmarking)
4141

4242
```bash
4343
pip install -r sklearn_bench/requirements.txt
4444
# or
45-
conda install -c intel scikit-learn scikit-learn-intelex pandas
45+
conda install -c intel scikit-learn scikit-learn-intelex pandas tqdm
4646
```
4747

48-
* [**daal4py**](daal4py_bench#how-to-create-conda-environment-for-benchmarking)
48+
- [**daal4py**](daal4py_bench#how-to-create-conda-environment-for-benchmarking)
4949

5050
```bash
51-
conda install -c conda-forge scikit-learn daal4py pandas
51+
conda install -c conda-forge scikit-learn daal4py pandas tqdm
5252
```
5353

54-
* [**cuml**](cuml_bench#how-to-create-conda-environment-for-benchmarking)
54+
- [**cuml**](cuml_bench#how-to-create-conda-environment-for-benchmarking)
5555

5656
```bash
57-
conda install -c rapidsai -c conda-forge cuml pandas cudf
57+
conda install -c rapidsai -c conda-forge cuml pandas cudf tqdm
5858
```
5959

60-
* [**xgboost**](xgboost_bench#how-to-create-conda-environment-for-benchmarking)
60+
- [**xgboost**](xgboost_bench#how-to-create-conda-environment-for-benchmarking)
6161

6262
```bash
6363
pip install -r xgboost_bench/requirements.txt
6464
# or
65-
conda install -c conda-forge xgboost pandas
65+
conda install -c conda-forge xgboost scikit-learn pandas tqdm
6666
```
6767

6868
## Running Python benchmarks with runner script
6969

7070
Run `python runner.py --configs configs/config_example.json [--output-file result.json --verbose INFO --report]` to launch benchmarks.
7171

7272
Options:
73-
* ``--configs``: specify the path to a configuration file.
74-
* ``--no-intel-optimized``: use Scikit-learn without [Intel(R) Extension for Scikit-learn*](#intelr-extension-for-scikit-learn-support). Now available for [scikit-learn benchmarks](https://github.com/IntelPython/scikit-learn_bench/tree/master/sklearn_bench). By default, the runner uses Intel(R) Extension for Scikit-learn.
75-
* ``--output-file``: output file name for the benchmark result. The default name is `result.json`
76-
* ``--report``: create an Excel report based on benchmark results. The `openpyxl` library is required.
77-
* ``--dummy-run``: run configuration parser and dataset generation without benchmarks running.
78-
* ``--verbose``: *WARNING*, *INFO*, *DEBUG*. print additional information during benchmarks running. Default is *INFO*.
73+
74+
- ``--configs``: specify the path to a configuration file.
75+
- ``--no-intel-optimized``: use Scikit-learn without [Intel(R) Extension for Scikit-learn*](#intelr-extension-for-scikit-learn-support). Now available for [scikit-learn benchmarks](https://github.com/IntelPython/scikit-learn_bench/tree/master/sklearn_bench). By default, the runner uses Intel(R) Extension for Scikit-learn.
76+
- ``--output-file``: specify the name of the output file for the benchmark result. The default name is `result.json`
77+
- ``--report``: create an Excel report based on benchmark results. The `openpyxl` library is required.
78+
- ``--dummy-run``: run configuration parser and dataset generation without benchmarks running.
79+
- ``--verbose``: *WARNING*, *INFO*, *DEBUG*. Print out additional information when the benchmarks are running. The default is *INFO*.
7980

8081
| Level | Description |
8182
|-----------|---------------|
@@ -84,10 +85,11 @@ Options:
8485
| *WARNING* | An indication that something unexpected happened, or indicative of some problem in the near future (e.g. ‘disk space low’). The software is still working as expected. |
8586

8687
Benchmarks currently support the following frameworks:
87-
* **scikit-learn**
88-
* **daal4py**
89-
* **cuml**
90-
* **xgboost**
88+
89+
- **scikit-learn**
90+
- **daal4py**
91+
- **cuml**
92+
- **xgboost**
9193

9294
The configuration of benchmarks allows you to select the frameworks to run, select datasets for measurements and configure the parameters of the algorithms.
9395

@@ -117,27 +119,32 @@ The configuration of benchmarks allows you to select the frameworks to run, sele
117119
When you run scikit-learn benchmarks on CPU, [Intel(R) Extension for Scikit-learn](https://github.com/intel/scikit-learn-intelex) is used by default. Use the ``--no-intel-optimized`` option to run the benchmarks without the extension.
118120

119121
The following benchmarks have a GPU support:
120-
* dbscan
121-
* kmeans
122-
* linear
123-
* log_reg
122+
123+
- dbscan
124+
- kmeans
125+
- linear
126+
- log_reg
124127

125128
You may use the [configuration file for these benchmarks](https://github.com/IntelPython/scikit-learn_bench/blob/master/configs/skl_xpu_config.json) to run them on both CPU and GPU.
126129

127-
## Algorithms parameters
130+
## Algorithms parameters
128131

129132
You can launch benchmarks for each algorithm separately.
130133
To do this, go to the directory with the benchmark:
131134

132-
cd <framework>
135+
```bash
136+
cd <framework>
137+
```
133138

134139
Run the following command:
135140

136-
python <benchmark_file> --dataset-name <path to the dataset> <other algorithm parameters>
141+
```bash
142+
python <benchmark_file> --dataset-name <path to the dataset> <other algorithm parameters>
143+
```
137144

138145
The list of supported parameters for each algorithm you can find here:
139146

140-
* [**scikit-learn**](sklearn_bench#algorithms-parameters)
141-
* [**daal4py**](daal4py_bench#algorithms-parameters)
142-
* [**cuml**](cuml_bench#algorithms-parameters)
143-
* [**xgboost**](xgboost_bench#algorithms-parameters)
147+
- [**scikit-learn**](sklearn_bench#algorithms-parameters)
148+
- [**daal4py**](daal4py_bench#algorithms-parameters)
149+
- [**cuml**](cuml_bench#algorithms-parameters)
150+
- [**xgboost**](xgboost_bench#algorithms-parameters)

azure-pipelines.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ jobs:
3333
steps:
3434
- script: |
3535
conda update -y -q conda
36-
conda create -n bench -q -y -c conda-forge python=3.7 pandas scikit-learn daal4py
36+
conda create -n bench -q -y -c conda-forge python=3.7 pandas scikit-learn daal4py tqdm
3737
displayName: Create Anaconda environment
3838
- script: |
3939
. /usr/share/miniconda/etc/profile.d/conda.sh
@@ -46,7 +46,7 @@ jobs:
4646
steps:
4747
- script: |
4848
conda update -y -q conda
49-
conda create -n bench -q -y -c conda-forge python=3.7 pandas xgboost scikit-learn daal4py
49+
conda create -n bench -q -y -c conda-forge python=3.7 pandas xgboost scikit-learn daal4py tqdm
5050
displayName: Create Anaconda environment
5151
- script: |
5252
. /usr/share/miniconda/etc/profile.d/conda.sh

bench.py

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616

1717
import argparse
1818
import json
19+
import logging
1920
import sys
2021
import timeit
2122

@@ -200,15 +201,16 @@ def parse_args(parser, size=None, loop_types=(),
200201
from sklearnex import patch_sklearn
201202
patch_sklearn()
202203
except ImportError:
203-
print('Failed to import sklearnex.patch_sklearn.'
204-
'Use stock version scikit-learn', file=sys.stderr)
204+
logging.info('Failed to import sklearnex.patch_sklearn.'
205+
'Use stock version scikit-learn', file=sys.stderr)
205206
params.device = 'None'
206207
else:
207208
if params.device != 'None':
208-
print('Device context is not supported for stock scikit-learn.'
209-
'Please use --no-intel-optimized=False with'
210-
f'--device={params.device} parameter. Fallback to --device=None.',
211-
file=sys.stderr)
209+
logging.info(
210+
'Device context is not supported for stock scikit-learn.'
211+
'Please use --no-intel-optimized=False with'
212+
f'--device={params.device} parameter. Fallback to --device=None.',
213+
file=sys.stderr)
212214
params.device = 'None'
213215

214216
# disable finiteness check (default)
@@ -218,7 +220,7 @@ def parse_args(parser, size=None, loop_types=(),
218220
# Ask DAAL what it thinks about this number of threads
219221
num_threads = prepare_daal_threads(num_threads=params.threads)
220222
if params.verbose:
221-
print(f'@ DAAL gave us {num_threads} threads')
223+
logging.info(f'@ DAAL gave us {num_threads} threads')
222224

223225
n_jobs = None
224226
if n_jobs_supported:
@@ -234,7 +236,7 @@ def parse_args(parser, size=None, loop_types=(),
234236

235237
# Very verbose output
236238
if params.verbose:
237-
print(f'@ params = {params.__dict__}')
239+
logging.info(f'@ params = {params.__dict__}')
238240

239241
return params
240242

@@ -249,8 +251,8 @@ def set_daal_num_threads(num_threads):
249251
if num_threads:
250252
daal4py.daalinit(nthreads=num_threads)
251253
except ImportError:
252-
print('@ Package "daal4py" was not found. Number of threads '
253-
'is being ignored')
254+
logging.info('@ Package "daal4py" was not found. Number of threads '
255+
'is being ignored')
254256

255257

256258
def prepare_daal_threads(num_threads=-1):
@@ -417,7 +419,7 @@ def load_data(params, generated_data=[], add_dtype=False, label_2d=False,
417419
# load and convert data from npy/csv file if path is specified
418420
if param_vars[file_arg] is not None:
419421
if param_vars[file_arg].name.endswith('.npy'):
420-
data = np.load(param_vars[file_arg].name)
422+
data = np.load(param_vars[file_arg].name, allow_pickle=True)
421423
else:
422424
data = read_csv(param_vars[file_arg].name, params)
423425
full_data[element] = convert_data(

configs/README.md

Lines changed: 32 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## Config JSON Schema
1+
# Config JSON Schema
22

33
Configure benchmarks by editing the `config.json` file.
44
You can configure some algorithm parameters, datasets, a list of frameworks to use, and the usage of some environment variables.
@@ -11,58 +11,59 @@ Refer to the tables below for descriptions of all fields in the configuration fi
1111
- [Training Object](#training-object)
1212
- [Testing Object](#testing-object)
1313

14-
### Root Config Object
14+
## Root Config Object
15+
1516
| Field Name | Type | Description |
1617
| ----- | ---- |------------ |
17-
|omp_env| array[string] | For xgboost only. Specify an environment variable to set the number of omp threads |
1818
|common| [Common Object](#common-object)| **REQUIRED** common benchmarks setting: frameworks and input data settings |
19-
|cases| array[[Case Object](#case-object)] | **REQUIRED** list of algorithms, their parameters and training data |
19+
|cases| List[[Case Object](#case-object)] | **REQUIRED** list of algorithms, their parameters and training data |
2020

21-
### Common Object
21+
## Common Object
2222

2323
| Field Name | Type | Description |
2424
| ----- | ---- |------------ |
25-
|lib| array[string] | **REQUIRED** list of test frameworks. It can be *sklearn*, *daal4py*, *cuml* or *xgboost* |
26-
|data-format| array[string] | **REQUIRED** input data format. Data formats: *numpy*, *pandas* or *cudf* |
27-
|data-order| array[string] | **REQUIRED** input data order. Data order: *C* (row-major, default) or *F* (column-major) |
28-
|dtype| array[string] | **REQUIRED** input data type. Data type: *float64* (default) or *float32* |
29-
|check-finitness| array[] | Check finiteness in sklearn input check(disabled by default) |
30-
|device| array[string] | For scikit-learn only. The list of devices to run the benchmarks on. It can be *None* (default, run on CPU without sycl context) or one of the types of sycl devices: *cpu*, *gpu*, *host*. Refer to [SYCL specification](https://www.khronos.org/files/sycl/sycl-2020-reference-guide.pdf) for details|
25+
|data-format| Union[str, List[str]] | **REQUIRED** Input data format: *numpy*, *pandas*, or *cudf*. |
26+
|data-order| Union[str, List[str]] | **REQUIRED** Input data order: *C* (row-major, default) or *F* (column-major). |
27+
|dtype| Union[str, List[str]] | **REQUIRED** Input data type: *float64* (default) or *float32*. |
28+
|check-finitness| List[] | Check finiteness during scikit-learn input check (disabled by default). |
29+
|device| array[string] | For scikit-learn only. The list of devices to run the benchmarks on.<br/>It can be *None* (default, run on CPU without sycl context) or one of the types of sycl devices: *cpu*, *gpu*, *host*.<br/>Refer to [SYCL specification](https://www.khronos.org/files/sycl/sycl-2020-reference-guide.pdf) for details.|
3130

32-
### Case Object
31+
## Case Object
3332

3433
| Field Name | Type | Description |
3534
| ----- | ---- |------------ |
36-
|lib| array[string] | **REQUIRED** list of test frameworks. It can be *sklearn*, *daal4py*, *cuml* or *xgboost*|
37-
|algorithm| string | **REQUIRED** benchmark name |
38-
|dataset| array[[Dataset Object](#dataset-object)] | **REQUIRED** input data specifications. |
39-
|benchmark parameters| array[Any] | **REQUIRED** algorithm parameters. a list of supported parameters can be found here |
35+
|lib| Union[str, List[str]] | **REQUIRED** A test framework or a list of frameworks. Must be from [*sklearn*, *daal4py*, *cuml*, *xgboost*]. |
36+
|algorithm| string | **REQUIRED** Benchmark file name. |
37+
|dataset| List[[Dataset Object](#dataset-object)] | **REQUIRED** Input data specifications. |
38+
|**specific algorithm parameters**| Union[int, float, str, List[int], List[float], List[str]] | Other algorithm-specific parameters |
39+
40+
**Important:** You can move any parameter from **"cases"** to **"common"** if this parameter is common to all cases
4041

41-
### Dataset Object
42+
## Dataset Object
4243

4344
| Field Name | Type | Description |
4445
| ----- | ---- |------------ |
45-
|source| string | **REQUIRED** data source. It can be *synthetic* or *csv* |
46-
|type| string | **REQUIRED** for synthetic data only. The type of task for which the dataset is generated. It can be *classification*, *blobs* or *regression* |
46+
|source| string | **REQUIRED** Data source: *synthetic*, *csv*, or *npy*. |
47+
|type| string | **REQUIRED for synthetic data**. The type of task for which the dataset is generated: *classification*, *blobs*, or *regression*. |
4748
|n_classes| int | For *synthetic* data and for *classification* type only. The number of classes (or labels) of the classification problem |
4849
|n_clusters| int | For *synthetic* data and for *blobs* type only. The number of centers to generate |
49-
|n_features| int | **REQUIRED** For *synthetic* data only. The number of features to generate |
50-
|name| string | Name of dataset |
51-
|training| [Training Object](#training-object) | **REQUIRED** algorithm parameters. a list of supported parameters can be found here |
52-
|testing| [Testing Object](#testing-object) | **REQUIRED** algorithm parameters. a list of supported parameters can be found here |
50+
|n_features| int | **REQUIRED for *synthetic* data**. The number of features to generate. |
51+
|name| string | Name of the dataset. |
52+
|training| [Training Object](#training-object) | **REQUIRED** An object with the paths to the training datasets. |
53+
|testing| [Testing Object](#testing-object) | An object with the paths to the testing datasets. If not provided, the training datasets are used. |
5354

54-
### Training Object
55+
## Training Object
5556

5657
| Field Name | Type | Description |
5758
| ----- | ---- |------------ |
58-
| n_samples | int | The total number of the training points |
59-
| x | str | The path to the training samples |
60-
| y | str | The path to the training labels |
59+
| n_samples | int | **REQUIRED** The total number of the training samples |
60+
| x | str | **REQUIRED** The path to the training samples |
61+
| y | str | **REQUIRED** The path to the training labels |
6162

62-
### Testing Object
63+
## Testing Object
6364

6465
| Field Name | Type | Description |
6566
| ----- | ---- |------------ |
66-
| n_samples | int | The total number of the testing points |
67-
| x | str | The path to the testing samples |
68-
| y | str | The path to the testing labels |
67+
| n_samples | int | **REQUIRED** The total number of the testing samples |
68+
| x | str | **REQUIRED** The path to the testing samples |
69+
| y | str | **REQUIRED** The path to the testing labels |

0 commit comments

Comments
 (0)