IntelPython · Alexsandruss · Apr 26, 2021 · Mar 22, 2021 · Mar 22, 2021 · Mar 22, 2021
diff --git a/README.md b/README.md
@@ -27,55 +27,56 @@ We publish blogs on Medium, so [follow us](https://medium.com/intel-analytics-so
 
 ## Table of content
 
-* [How to create conda environment for benchmarking](#how-to-create-conda-environment-for-benchmarking)
-* [Running Python benchmarks with runner script](#running-python-benchmarks-with-runner-script)
-* [Benchmark supported algorithms](#benchmark-supported-algorithms)
-* [Intel(R) Extension for Scikit-learn* support](#intelr-extension-for-scikit-learn-support)
-* [Algorithms parameters](#algorithms-parameters)
+- [How to create conda environment for benchmarking](#how-to-create-conda-environment-for-benchmarking)
+- [Running Python benchmarks with runner script](#running-python-benchmarks-with-runner-script)
+- [Benchmark supported algorithms](#benchmark-supported-algorithms)
+- [Intel(R) Extension for Scikit-learn* support](#intelr-extension-for-scikit-learn-support)
+- [Algorithms parameters](#algorithms-parameters)
 
 ## How to create conda environment for benchmarking
 
 Create a suitable conda environment for each framework to test. Each item in the list below links to instructions to create an appropriate conda environment for the framework.
 
-* [**scikit-learn**](sklearn_bench#how-to-create-conda-environment-for-benchmarking)
+- [**scikit-learn**](sklearn_bench#how-to-create-conda-environment-for-benchmarking)
 
 ```bash
 pip install -r sklearn_bench/requirements.txt
 # or
-conda install -c intel scikit-learn scikit-learn-intelex pandas
+conda install -c intel scikit-learn scikit-learn-intelex pandas tqdm
 ```
 
-* [**daal4py**](daal4py_bench#how-to-create-conda-environment-for-benchmarking)
+- [**daal4py**](daal4py_bench#how-to-create-conda-environment-for-benchmarking)
 
 ```bash
-conda install -c conda-forge scikit-learn daal4py pandas
+conda install -c conda-forge scikit-learn daal4py pandas tqdm
 ```
 
-* [**cuml**](cuml_bench#how-to-create-conda-environment-for-benchmarking)
+- [**cuml**](cuml_bench#how-to-create-conda-environment-for-benchmarking)
 
 ```bash
-conda install -c rapidsai -c conda-forge cuml pandas cudf
+conda install -c rapidsai -c conda-forge cuml pandas cudf tqdm
 ```
 
-* [**xgboost**](xgboost_bench#how-to-create-conda-environment-for-benchmarking)
+- [**xgboost**](xgboost_bench#how-to-create-conda-environment-for-benchmarking)
 
 ```bash
 pip install -r xgboost_bench/requirements.txt
 # or
-conda install -c conda-forge xgboost pandas
+conda install -c conda-forge xgboost scikit-learn pandas tqdm
 ```
 
 ## Running Python benchmarks with runner script
 
 Run `python runner.py --configs configs/config_example.json [--output-file result.json --verbose INFO --report]` to launch benchmarks.
 
 Options:
-* ``--configs``: specify the path to a configuration file.
-* ``--no-intel-optimized``: use Scikit-learn without [Intel(R) Extension for Scikit-learn*](#intelr-extension-for-scikit-learn-support). Now available for [scikit-learn benchmarks](https://github.com/IntelPython/scikit-learn_bench/tree/master/sklearn_bench). By default, the runner uses Intel(R) Extension for Scikit-learn.
-* ``--output-file``: output file name for the benchmark result. The default name is `result.json`
-* ``--report``: create an Excel report based on benchmark results. The `openpyxl` library is required.
-* ``--dummy-run``: run configuration parser and dataset generation without benchmarks running.
-* ``--verbose``: *WARNING*, *INFO*, *DEBUG*. print additional information during benchmarks running. Default is *INFO*.
+
+- ``--configs``: specify the path to a configuration file.
+- ``--no-intel-optimized``: use Scikit-learn without [Intel(R) Extension for Scikit-learn*](#intelr-extension-for-scikit-learn-support). Now available for [scikit-learn benchmarks](https://github.com/IntelPython/scikit-learn_bench/tree/master/sklearn_bench). By default, the runner uses Intel(R) Extension for Scikit-learn.
+- ``--output-file``: specify the name of the output file for the benchmark result. The default name is `result.json`
+- ``--report``: create an Excel report based on benchmark results. The `openpyxl` library is required.
+- ``--dummy-run``: run configuration parser and dataset generation without benchmarks running.
+- ``--verbose``: *WARNING*, *INFO*, *DEBUG*. Print out additional information when the benchmarks are running. The default is *INFO*.
 
 |   Level   |  Description  |
 |-----------|---------------|
@@ -84,10 +85,11 @@ Options:
 | *WARNING* | An indication that something unexpected happened, or indicative of some problem in the near future (e.g. ‘disk space low’). The software is still working as expected. |
 
 Benchmarks currently support the following frameworks:
-* **scikit-learn**
-* **daal4py**
-* **cuml**
-* **xgboost**
+
+- **scikit-learn**
+- **daal4py**
+- **cuml**
+- **xgboost**
 
 The configuration of benchmarks allows you to select the frameworks to run, select datasets for measurements and configure the parameters of the algorithms.
 
@@ -117,27 +119,32 @@ The configuration of benchmarks allows you to select the frameworks to run, sele
 When you run scikit-learn benchmarks on CPU, [Intel(R) Extension for Scikit-learn](https://github.com/intel/scikit-learn-intelex) is used by default. Use the ``--no-intel-optimized`` option to run the benchmarks without the extension.
 
 The following benchmarks have a GPU support:
-* dbscan
-* kmeans
-* linear
-* log_reg
+
+- dbscan
+- kmeans
+- linear
+- log_reg
 
 You may use the [configuration file for these benchmarks](https://github.com/IntelPython/scikit-learn_bench/blob/master/configs/skl_xpu_config.json) to run them on both CPU and GPU.
 
-##  Algorithms parameters
+## Algorithms parameters
 
 You can launch benchmarks for each algorithm separately.
 To do this, go to the directory with the benchmark:
 
-    cd <framework>
+```bash
+cd <framework>
+```
 
 Run the following command:
 
-    python <benchmark_file> --dataset-name <path to the dataset> <other algorithm parameters>
+```bash
+python <benchmark_file> --dataset-name <path to the dataset> <other algorithm parameters>
+```
 
 The list of supported parameters for each algorithm you can find here:
 
-* [**scikit-learn**](sklearn_bench#algorithms-parameters)
-* [**daal4py**](daal4py_bench#algorithms-parameters)
-* [**cuml**](cuml_bench#algorithms-parameters)
-* [**xgboost**](xgboost_bench#algorithms-parameters)
+- [**scikit-learn**](sklearn_bench#algorithms-parameters)
+- [**daal4py**](daal4py_bench#algorithms-parameters)
+- [**cuml**](cuml_bench#algorithms-parameters)
+- [**xgboost**](xgboost_bench#algorithms-parameters)
diff --git a/azure-pipelines.yml b/azure-pipelines.yml
@@ -33,7 +33,7 @@ jobs:
   steps:
   - script: |
       conda update -y -q conda
-      conda create -n bench -q -y -c conda-forge python=3.7 pandas scikit-learn daal4py
+      conda create -n bench -q -y -c conda-forge python=3.7 pandas scikit-learn daal4py tqdm
     displayName: Create Anaconda environment
   - script: |
       . /usr/share/miniconda/etc/profile.d/conda.sh
@@ -46,7 +46,7 @@ jobs:
   steps:
   - script: |
       conda update -y -q conda
-      conda create -n bench -q -y -c conda-forge python=3.7 pandas xgboost scikit-learn daal4py
+      conda create -n bench -q -y -c conda-forge python=3.7 pandas xgboost scikit-learn daal4py tqdm
     displayName: Create Anaconda environment
   - script: |
       . /usr/share/miniconda/etc/profile.d/conda.sh

diff --git a/bench.py b/bench.py
@@ -16,6 +16,7 @@
 
 import argparse
 import json
+import logging
 import sys
 import timeit
 
@@ -200,15 +201,16 @@ def parse_args(parser, size=None, loop_types=(),
             from sklearnex import patch_sklearn
             patch_sklearn()
         except ImportError:
-            print('Failed to import sklearnex.patch_sklearn.'
-                  'Use stock version scikit-learn', file=sys.stderr)
+            logging.info('Failed to import sklearnex.patch_sklearn.'
+                         'Use stock version scikit-learn', file=sys.stderr)
             params.device = 'None'
     else:
         if params.device != 'None':
-            print('Device context is not supported for stock scikit-learn.'
-                  'Please use --no-intel-optimized=False with'
-                  f'--device={params.device} parameter. Fallback to --device=None.',
-                  file=sys.stderr)
+            logging.info(
+                'Device context is not supported for stock scikit-learn.'
+                'Please use --no-intel-optimized=False with'
+                f'--device={params.device} parameter. Fallback to --device=None.',
+                file=sys.stderr)
             params.device = 'None'
 
     # disable finiteness check (default)
@@ -218,7 +220,7 @@ def parse_args(parser, size=None, loop_types=(),
     # Ask DAAL what it thinks about this number of threads
     num_threads = prepare_daal_threads(num_threads=params.threads)
     if params.verbose:
-        print(f'@ DAAL gave us {num_threads} threads')
+        logging.info(f'@ DAAL gave us {num_threads} threads')
 
     n_jobs = None
     if n_jobs_supported:
@@ -234,7 +236,7 @@ def parse_args(parser, size=None, loop_types=(),
 
     # Very verbose output
     if params.verbose:
-        print(f'@ params = {params.__dict__}')
+        logging.info(f'@ params = {params.__dict__}')
 
     return params
 
@@ -249,8 +251,8 @@ def set_daal_num_threads(num_threads):
         if num_threads:
             daal4py.daalinit(nthreads=num_threads)
     except ImportError:
-        print('@ Package "daal4py" was not found. Number of threads '
-              'is being ignored')
+        logging.info('@ Package "daal4py" was not found. Number of threads '
+                     'is being ignored')
 
 
 def prepare_daal_threads(num_threads=-1):
@@ -417,7 +419,7 @@ def load_data(params, generated_data=[], add_dtype=False, label_2d=False,
         # load and convert data from npy/csv file if path is specified
         if param_vars[file_arg] is not None:
             if param_vars[file_arg].name.endswith('.npy'):
-                data = np.load(param_vars[file_arg].name)
+                data = np.load(param_vars[file_arg].name, allow_pickle=True)
             else:
                 data = read_csv(param_vars[file_arg].name, params)
             full_data[element] = convert_data(

diff --git a/configs/README.md b/configs/README.md
@@ -1,4 +1,4 @@
-##  Config JSON Schema
+# Config JSON Schema
 
 Configure benchmarks by editing the `config.json` file.
 You can configure some algorithm parameters, datasets, a list of frameworks to use, and the usage of some environment variables.
@@ -11,58 +11,59 @@ Refer to the tables below for descriptions of all fields in the configuration fi
 - [Training Object](#training-object)
 - [Testing Object](#testing-object)
 
-###  Root Config Object
+## Root Config Object
+
 | Field Name  | Type | Description |
 | ----- | ---- |------------ |
-|omp_env| array[string] | For xgboost only. Specify an environment variable to set the number of omp threads |
 |common| [Common Object](#common-object)| **REQUIRED** common benchmarks setting: frameworks and input data settings |
-|cases| array[[Case Object](#case-object)] | **REQUIRED**  list of algorithms, their parameters and training data |
+|cases| List[[Case Object](#case-object)] | **REQUIRED**  list of algorithms, their parameters and training data |
 
-###  Common Object
+## Common Object
 
 | Field Name  | Type | Description |
 | ----- | ---- |------------ |
-|lib| array[string] | **REQUIRED** list of test frameworks. It can be *sklearn*, *daal4py*, *cuml* or *xgboost* |
-|data-format| array[string] | **REQUIRED** input data format. Data formats: *numpy*, *pandas* or *cudf* |
-|data-order| array[string] | **REQUIRED**  input data order. Data order: *C* (row-major, default) or *F* (column-major) |
-|dtype| array[string] | **REQUIRED**  input data type. Data type: *float64* (default) or *float32* |
-|check-finitness| array[] | Check finiteness in sklearn input check(disabled by default) |
-|device| array[string] | For scikit-learn only. The list of devices to run the benchmarks on. It can be *None* (default, run on CPU without sycl context) or one of the types of sycl devices: *cpu*, *gpu*, *host*. Refer to [SYCL specification](https://www.khronos.org/files/sycl/sycl-2020-reference-guide.pdf) for details|
+|data-format| Union[str, List[str]] | **REQUIRED** Input data format: *numpy*, *pandas*, or *cudf*. |
+|data-order| Union[str, List[str]] | **REQUIRED**  Input data order: *C* (row-major, default) or *F* (column-major). |
+|dtype| Union[str, List[str]] | **REQUIRED**  Input data type: *float64* (default) or *float32*. |
+|check-finitness| List[] | Check finiteness during scikit-learn input check (disabled by default). |
+|device| array[string] | For scikit-learn only. The list of devices to run the benchmarks on.<br/>It can be *None* (default, run on CPU without sycl context) or one of the types of sycl devices: *cpu*, *gpu*, *host*.<br/>Refer to [SYCL specification](https://www.khronos.org/files/sycl/sycl-2020-reference-guide.pdf) for details.|
 
-###  Case Object
+## Case Object
 
 | Field Name  | Type | Description |
 | ----- | ---- |------------ |
-|lib| array[string] | **REQUIRED** list of test frameworks. It can be *sklearn*, *daal4py*, *cuml* or *xgboost*|
-|algorithm| string | **REQUIRED** benchmark name |
-|dataset| array[[Dataset Object](#dataset-object)] | **REQUIRED**  input data specifications. |
-|benchmark parameters| array[Any] | **REQUIRED** algorithm parameters. a list of supported parameters can be found here |
+|lib| Union[str, List[str]] | **REQUIRED** A test framework or a list of frameworks. Must be from [*sklearn*, *daal4py*, *cuml*, *xgboost*]. |
+|algorithm| string | **REQUIRED** Benchmark file name. |
+|dataset| List[[Dataset Object](#dataset-object)] | **REQUIRED**  Input data specifications. |
+|**specific algorithm parameters**| Union[int, float, str, List[int], List[float], List[str]] | Other algorithm-specific parameters |
+
+**Important:** You can move any parameter from **"cases"** to **"common"** if this parameter is common to all cases
 
-###  Dataset Object
+## Dataset Object
 
 | Field Name  | Type | Description |
 | ----- | ---- |------------ |
-|source| string | **REQUIRED** data source. It can be *synthetic* or *csv* |
-|type| string | **REQUIRED**  for synthetic data only. The type of task for which the dataset is generated. It can be *classification*, *blobs* or *regression* |
+|source| string | **REQUIRED** Data source: *synthetic*, *csv*, or *npy*. |
+|type| string | **REQUIRED for synthetic data**. The type of task for which the dataset is generated: *classification*, *blobs*, or *regression*. |
 |n_classes| int | For *synthetic* data and for *classification* type only. The number of classes (or labels) of the classification problem |
 |n_clusters| int | For *synthetic* data and for *blobs* type only. The number of centers to generate |
-|n_features| int | **REQUIRED**  For *synthetic* data only. The number of features to generate |
-|name| string | Name of dataset |
-|training| [Training Object](#training-object) | **REQUIRED** algorithm parameters. a list of supported parameters can be found here |
-|testing| [Testing Object](#testing-object) | **REQUIRED** algorithm parameters. a list of supported parameters can be found here |
+|n_features| int | **REQUIRED for *synthetic* data**. The number of features to generate. |
+|name| string | Name of the dataset. |
+|training| [Training Object](#training-object) | **REQUIRED** An object with the paths to the training datasets. |
+|testing| [Testing Object](#testing-object) | An object with the paths to the testing datasets. If not provided, the training datasets are used. |
 
-###  Training Object
+## Training Object
 
 | Field Name  | Type | Description |
 | ----- | ---- |------------ |
-| n_samples | int | The total number of the training points |
-| x | str | The path to the training samples |
-| y | str | The path to the training labels |
+| n_samples | int | **REQUIRED** The total number of the training samples |
+| x | str | **REQUIRED** The path to the training samples |
+| y | str | **REQUIRED** The path to the training labels |
 
-###  Testing Object
+## Testing Object
 
 | Field Name  | Type | Description |
 | ----- | ---- |------------ |
-| n_samples | int | The total number of the testing points |
-| x | str | The path to the testing samples |
-| y | str | The path to the testing labels |
+| n_samples | int | **REQUIRED** The total number of the testing samples |
+| x | str | **REQUIRED** The path to the testing samples |
+| y | str | **REQUIRED** The path to the testing labels |