diff --git a/README.md b/README.md index 8099a0511..97bee26e1 100755 --- a/README.md +++ b/README.md @@ -27,42 +27,42 @@ We publish blogs on Medium, so [follow us](https://medium.com/intel-analytics-so ## Table of content -* [How to create conda environment for benchmarking](#how-to-create-conda-environment-for-benchmarking) -* [Running Python benchmarks with runner script](#running-python-benchmarks-with-runner-script) -* [Benchmark supported algorithms](#benchmark-supported-algorithms) -* [Intel(R) Extension for Scikit-learn* support](#intelr-extension-for-scikit-learn-support) -* [Algorithms parameters](#algorithms-parameters) +- [How to create conda environment for benchmarking](#how-to-create-conda-environment-for-benchmarking) +- [Running Python benchmarks with runner script](#running-python-benchmarks-with-runner-script) +- [Benchmark supported algorithms](#benchmark-supported-algorithms) +- [Intel(R) Extension for Scikit-learn* support](#intelr-extension-for-scikit-learn-support) +- [Algorithms parameters](#algorithms-parameters) ## How to create conda environment for benchmarking Create a suitable conda environment for each framework to test. Each item in the list below links to instructions to create an appropriate conda environment for the framework. -* [**scikit-learn**](sklearn_bench#how-to-create-conda-environment-for-benchmarking) +- [**scikit-learn**](sklearn_bench#how-to-create-conda-environment-for-benchmarking) ```bash pip install -r sklearn_bench/requirements.txt # or -conda install -c intel scikit-learn scikit-learn-intelex pandas +conda install -c intel scikit-learn scikit-learn-intelex pandas tqdm ``` -* [**daal4py**](daal4py_bench#how-to-create-conda-environment-for-benchmarking) +- [**daal4py**](daal4py_bench#how-to-create-conda-environment-for-benchmarking) ```bash -conda install -c conda-forge scikit-learn daal4py pandas +conda install -c conda-forge scikit-learn daal4py pandas tqdm ``` -* [**cuml**](cuml_bench#how-to-create-conda-environment-for-benchmarking) +- [**cuml**](cuml_bench#how-to-create-conda-environment-for-benchmarking) ```bash -conda install -c rapidsai -c conda-forge cuml pandas cudf +conda install -c rapidsai -c conda-forge cuml pandas cudf tqdm ``` -* [**xgboost**](xgboost_bench#how-to-create-conda-environment-for-benchmarking) +- [**xgboost**](xgboost_bench#how-to-create-conda-environment-for-benchmarking) ```bash pip install -r xgboost_bench/requirements.txt # or -conda install -c conda-forge xgboost pandas +conda install -c conda-forge xgboost scikit-learn pandas tqdm ``` ## Running Python benchmarks with runner script @@ -70,12 +70,13 @@ conda install -c conda-forge xgboost pandas Run `python runner.py --configs configs/config_example.json [--output-file result.json --verbose INFO --report]` to launch benchmarks. Options: -* ``--configs``: specify the path to a configuration file. -* ``--no-intel-optimized``: use Scikit-learn without [Intel(R) Extension for Scikit-learn*](#intelr-extension-for-scikit-learn-support). Now available for [scikit-learn benchmarks](https://github.com/IntelPython/scikit-learn_bench/tree/master/sklearn_bench). By default, the runner uses Intel(R) Extension for Scikit-learn. -* ``--output-file``: output file name for the benchmark result. The default name is `result.json` -* ``--report``: create an Excel report based on benchmark results. The `openpyxl` library is required. -* ``--dummy-run``: run configuration parser and dataset generation without benchmarks running. -* ``--verbose``: *WARNING*, *INFO*, *DEBUG*. print additional information during benchmarks running. Default is *INFO*. + +- ``--configs``: specify the path to a configuration file. +- ``--no-intel-optimized``: use Scikit-learn without [Intel(R) Extension for Scikit-learn*](#intelr-extension-for-scikit-learn-support). Now available for [scikit-learn benchmarks](https://github.com/IntelPython/scikit-learn_bench/tree/master/sklearn_bench). By default, the runner uses Intel(R) Extension for Scikit-learn. +- ``--output-file``: specify the name of the output file for the benchmark result. The default name is `result.json` +- ``--report``: create an Excel report based on benchmark results. The `openpyxl` library is required. +- ``--dummy-run``: run configuration parser and dataset generation without benchmarks running. +- ``--verbose``: *WARNING*, *INFO*, *DEBUG*. Print out additional information when the benchmarks are running. The default is *INFO*. | Level | Description | |-----------|---------------| @@ -84,10 +85,11 @@ Options: | *WARNING* | An indication that something unexpected happened, or indicative of some problem in the near future (e.g. ‘disk space low’). The software is still working as expected. | Benchmarks currently support the following frameworks: -* **scikit-learn** -* **daal4py** -* **cuml** -* **xgboost** + +- **scikit-learn** +- **daal4py** +- **cuml** +- **xgboost** The configuration of benchmarks allows you to select the frameworks to run, select datasets for measurements and configure the parameters of the algorithms. @@ -117,27 +119,32 @@ The configuration of benchmarks allows you to select the frameworks to run, sele When you run scikit-learn benchmarks on CPU, [Intel(R) Extension for Scikit-learn](https://github.com/intel/scikit-learn-intelex) is used by default. Use the ``--no-intel-optimized`` option to run the benchmarks without the extension. The following benchmarks have a GPU support: -* dbscan -* kmeans -* linear -* log_reg + +- dbscan +- kmeans +- linear +- log_reg You may use the [configuration file for these benchmarks](https://github.com/IntelPython/scikit-learn_bench/blob/master/configs/skl_xpu_config.json) to run them on both CPU and GPU. -## Algorithms parameters +## Algorithms parameters You can launch benchmarks for each algorithm separately. To do this, go to the directory with the benchmark: - cd +```bash +cd +``` Run the following command: - python --dataset-name +```bash +python --dataset-name +``` The list of supported parameters for each algorithm you can find here: -* [**scikit-learn**](sklearn_bench#algorithms-parameters) -* [**daal4py**](daal4py_bench#algorithms-parameters) -* [**cuml**](cuml_bench#algorithms-parameters) -* [**xgboost**](xgboost_bench#algorithms-parameters) +- [**scikit-learn**](sklearn_bench#algorithms-parameters) +- [**daal4py**](daal4py_bench#algorithms-parameters) +- [**cuml**](cuml_bench#algorithms-parameters) +- [**xgboost**](xgboost_bench#algorithms-parameters) diff --git a/azure-pipelines.yml b/azure-pipelines.yml index 86db13ef6..34b1efec5 100755 --- a/azure-pipelines.yml +++ b/azure-pipelines.yml @@ -33,7 +33,7 @@ jobs: steps: - script: | conda update -y -q conda - conda create -n bench -q -y -c conda-forge python=3.7 pandas scikit-learn daal4py + conda create -n bench -q -y -c conda-forge python=3.7 pandas scikit-learn daal4py tqdm displayName: Create Anaconda environment - script: | . /usr/share/miniconda/etc/profile.d/conda.sh @@ -46,7 +46,7 @@ jobs: steps: - script: | conda update -y -q conda - conda create -n bench -q -y -c conda-forge python=3.7 pandas xgboost scikit-learn daal4py + conda create -n bench -q -y -c conda-forge python=3.7 pandas xgboost scikit-learn daal4py tqdm displayName: Create Anaconda environment - script: | . /usr/share/miniconda/etc/profile.d/conda.sh diff --git a/bench.py b/bench.py index 11b9e3d7c..cd26c166e 100644 --- a/bench.py +++ b/bench.py @@ -16,6 +16,7 @@ import argparse import json +import logging import sys import timeit @@ -200,15 +201,16 @@ def parse_args(parser, size=None, loop_types=(), from sklearnex import patch_sklearn patch_sklearn() except ImportError: - print('Failed to import sklearnex.patch_sklearn.' - 'Use stock version scikit-learn', file=sys.stderr) + logging.info('Failed to import sklearnex.patch_sklearn.' + 'Use stock version scikit-learn', file=sys.stderr) params.device = 'None' else: if params.device != 'None': - print('Device context is not supported for stock scikit-learn.' - 'Please use --no-intel-optimized=False with' - f'--device={params.device} parameter. Fallback to --device=None.', - file=sys.stderr) + logging.info( + 'Device context is not supported for stock scikit-learn.' + 'Please use --no-intel-optimized=False with' + f'--device={params.device} parameter. Fallback to --device=None.', + file=sys.stderr) params.device = 'None' # disable finiteness check (default) @@ -218,7 +220,7 @@ def parse_args(parser, size=None, loop_types=(), # Ask DAAL what it thinks about this number of threads num_threads = prepare_daal_threads(num_threads=params.threads) if params.verbose: - print(f'@ DAAL gave us {num_threads} threads') + logging.info(f'@ DAAL gave us {num_threads} threads') n_jobs = None if n_jobs_supported: @@ -234,7 +236,7 @@ def parse_args(parser, size=None, loop_types=(), # Very verbose output if params.verbose: - print(f'@ params = {params.__dict__}') + logging.info(f'@ params = {params.__dict__}') return params @@ -249,8 +251,8 @@ def set_daal_num_threads(num_threads): if num_threads: daal4py.daalinit(nthreads=num_threads) except ImportError: - print('@ Package "daal4py" was not found. Number of threads ' - 'is being ignored') + logging.info('@ Package "daal4py" was not found. Number of threads ' + 'is being ignored') def prepare_daal_threads(num_threads=-1): @@ -417,7 +419,7 @@ def load_data(params, generated_data=[], add_dtype=False, label_2d=False, # load and convert data from npy/csv file if path is specified if param_vars[file_arg] is not None: if param_vars[file_arg].name.endswith('.npy'): - data = np.load(param_vars[file_arg].name) + data = np.load(param_vars[file_arg].name, allow_pickle=True) else: data = read_csv(param_vars[file_arg].name, params) full_data[element] = convert_data( diff --git a/configs/README.md b/configs/README.md index 44ce2ae21..02dee119b 100644 --- a/configs/README.md +++ b/configs/README.md @@ -1,4 +1,4 @@ -## Config JSON Schema +# Config JSON Schema Configure benchmarks by editing the `config.json` file. You can configure some algorithm parameters, datasets, a list of frameworks to use, and the usage of some environment variables. @@ -11,58 +11,59 @@ Refer to the tables below for descriptions of all fields in the configuration fi - [Training Object](#training-object) - [Testing Object](#testing-object) -### Root Config Object +## Root Config Object + | Field Name | Type | Description | | ----- | ---- |------------ | -|omp_env| array[string] | For xgboost only. Specify an environment variable to set the number of omp threads | |common| [Common Object](#common-object)| **REQUIRED** common benchmarks setting: frameworks and input data settings | -|cases| array[[Case Object](#case-object)] | **REQUIRED** list of algorithms, their parameters and training data | +|cases| List[[Case Object](#case-object)] | **REQUIRED** list of algorithms, their parameters and training data | -### Common Object +## Common Object | Field Name | Type | Description | | ----- | ---- |------------ | -|lib| array[string] | **REQUIRED** list of test frameworks. It can be *sklearn*, *daal4py*, *cuml* or *xgboost* | -|data-format| array[string] | **REQUIRED** input data format. Data formats: *numpy*, *pandas* or *cudf* | -|data-order| array[string] | **REQUIRED** input data order. Data order: *C* (row-major, default) or *F* (column-major) | -|dtype| array[string] | **REQUIRED** input data type. Data type: *float64* (default) or *float32* | -|check-finitness| array[] | Check finiteness in sklearn input check(disabled by default) | -|device| array[string] | For scikit-learn only. The list of devices to run the benchmarks on. It can be *None* (default, run on CPU without sycl context) or one of the types of sycl devices: *cpu*, *gpu*, *host*. Refer to [SYCL specification](https://www.khronos.org/files/sycl/sycl-2020-reference-guide.pdf) for details| +|data-format| Union[str, List[str]] | **REQUIRED** Input data format: *numpy*, *pandas*, or *cudf*. | +|data-order| Union[str, List[str]] | **REQUIRED** Input data order: *C* (row-major, default) or *F* (column-major). | +|dtype| Union[str, List[str]] | **REQUIRED** Input data type: *float64* (default) or *float32*. | +|check-finitness| List[] | Check finiteness during scikit-learn input check (disabled by default). | +|device| array[string] | For scikit-learn only. The list of devices to run the benchmarks on.
It can be *None* (default, run on CPU without sycl context) or one of the types of sycl devices: *cpu*, *gpu*, *host*.
Refer to [SYCL specification](https://www.khronos.org/files/sycl/sycl-2020-reference-guide.pdf) for details.| -### Case Object +## Case Object | Field Name | Type | Description | | ----- | ---- |------------ | -|lib| array[string] | **REQUIRED** list of test frameworks. It can be *sklearn*, *daal4py*, *cuml* or *xgboost*| -|algorithm| string | **REQUIRED** benchmark name | -|dataset| array[[Dataset Object](#dataset-object)] | **REQUIRED** input data specifications. | -|benchmark parameters| array[Any] | **REQUIRED** algorithm parameters. a list of supported parameters can be found here | +|lib| Union[str, List[str]] | **REQUIRED** A test framework or a list of frameworks. Must be from [*sklearn*, *daal4py*, *cuml*, *xgboost*]. | +|algorithm| string | **REQUIRED** Benchmark file name. | +|dataset| List[[Dataset Object](#dataset-object)] | **REQUIRED** Input data specifications. | +|**specific algorithm parameters**| Union[int, float, str, List[int], List[float], List[str]] | Other algorithm-specific parameters | + +**Important:** You can move any parameter from **"cases"** to **"common"** if this parameter is common to all cases -### Dataset Object +## Dataset Object | Field Name | Type | Description | | ----- | ---- |------------ | -|source| string | **REQUIRED** data source. It can be *synthetic* or *csv* | -|type| string | **REQUIRED** for synthetic data only. The type of task for which the dataset is generated. It can be *classification*, *blobs* or *regression* | +|source| string | **REQUIRED** Data source: *synthetic*, *csv*, or *npy*. | +|type| string | **REQUIRED for synthetic data**. The type of task for which the dataset is generated: *classification*, *blobs*, or *regression*. | |n_classes| int | For *synthetic* data and for *classification* type only. The number of classes (or labels) of the classification problem | |n_clusters| int | For *synthetic* data and for *blobs* type only. The number of centers to generate | -|n_features| int | **REQUIRED** For *synthetic* data only. The number of features to generate | -|name| string | Name of dataset | -|training| [Training Object](#training-object) | **REQUIRED** algorithm parameters. a list of supported parameters can be found here | -|testing| [Testing Object](#testing-object) | **REQUIRED** algorithm parameters. a list of supported parameters can be found here | +|n_features| int | **REQUIRED for *synthetic* data**. The number of features to generate. | +|name| string | Name of the dataset. | +|training| [Training Object](#training-object) | **REQUIRED** An object with the paths to the training datasets. | +|testing| [Testing Object](#testing-object) | An object with the paths to the testing datasets. If not provided, the training datasets are used. | -### Training Object +## Training Object | Field Name | Type | Description | | ----- | ---- |------------ | -| n_samples | int | The total number of the training points | -| x | str | The path to the training samples | -| y | str | The path to the training labels | +| n_samples | int | **REQUIRED** The total number of the training samples | +| x | str | **REQUIRED** The path to the training samples | +| y | str | **REQUIRED** The path to the training labels | -### Testing Object +## Testing Object | Field Name | Type | Description | | ----- | ---- |------------ | -| n_samples | int | The total number of the testing points | -| x | str | The path to the testing samples | -| y | str | The path to the testing labels | +| n_samples | int | **REQUIRED** The total number of the testing samples | +| x | str | **REQUIRED** The path to the testing samples | +| y | str | **REQUIRED** The path to the testing labels | diff --git a/configs/cuml_config.json b/configs/cuml_config.json index 01ec8333b..6217b6e96 100755 --- a/configs/cuml_config.json +++ b/configs/cuml_config.json @@ -1,5 +1,4 @@ { - "omp_env": ["OMP_NUM_THREADS"], "common": { "lib": ["cuml"], "data-format": ["cudf"], @@ -104,31 +103,31 @@ "dtype": ["float32"], "dataset": [ { - "source": "csv", + "source": "npy", "name": "higgs1m", "training": { - "x": "data/higgs1m_x_train.csv", - "y": "data/higgs1m_y_train.csv" + "x": "data/higgs1m_x_train.npy", + "y": "data/higgs1m_y_train.npy" }, "testing": { - "x": "data/higgs1m_x_test.csv", - "y": "data/higgs1m_y_test.csv" + "x": "data/higgs1m_x_test.npy", + "y": "data/higgs1m_y_test.npy" } }, { - "source": "csv", + "source": "npy", "name": "airline-ohe", "training": { - "x": "data/airline-ohe_x_train.csv", - "y": "data/airline-ohe_y_train.csv" + "x": "data/airline-ohe_x_train.npy", + "y": "data/airline-ohe_y_train.npy" }, "testing": { - "x": "data/airline-ohe_x_test.csv", - "y": "data/airline-ohe_y_test.csv" + "x": "data/airline-ohe_x_test.npy", + "y": "data/airline-ohe_y_test.npy" } } ], @@ -227,17 +226,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "ijcnn", "training": { - "x": "data/ijcnn_x_train.csv", - "y": "data/ijcnn_y_train.csv" + "x": "data/ijcnn_x_train.npy", + "y": "data/ijcnn_y_train.npy" }, "testing": { - "x": "data/ijcnn_x_test.csv", - "y": "data/ijcnn_y_test.csv" + "x": "data/ijcnn_x_test.npy", + "y": "data/ijcnn_y_test.npy" } } ], @@ -248,17 +247,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "a9a", "training": { - "x": "data/a9a_x_train.csv", - "y": "data/a9a_y_train.csv" + "x": "data/a9a_x_train.npy", + "y": "data/a9a_y_train.npy" }, "testing": { - "x": "data/a9a_x_test.csv", - "y": "data/a9a_y_test.csv" + "x": "data/a9a_x_test.npy", + "y": "data/a9a_y_test.npy" } } ], @@ -269,17 +268,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "gisette", "training": { - "x": "data/gisette_x_train.csv", - "y": "data/gisette_y_train.csv" + "x": "data/gisette_x_train.npy", + "y": "data/gisette_y_train.npy" }, "testing": { - "x": "data/gisette_x_test.csv", - "y": "data/gisette_y_test.csv" + "x": "data/gisette_x_test.npy", + "y": "data/gisette_y_test.npy" } } ], @@ -290,17 +289,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "klaverjas", "training": { - "x": "data/klaverjas_x_train.csv", - "y": "data/klaverjas_y_train.csv" + "x": "data/klaverjas_x_train.npy", + "y": "data/klaverjas_y_train.npy" }, "testing": { - "x": "data/klaverjas_x_test.csv", - "y": "data/klaverjas_y_test.csv" + "x": "data/klaverjas_x_test.npy", + "y": "data/klaverjas_y_test.npy" } } ], @@ -311,17 +310,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "skin_segmentation", "training": { - "x": "data/skin_segmentation_x_train.csv", - "y": "data/skin_segmentation_y_train.csv" + "x": "data/skin_segmentation_x_train.npy", + "y": "data/skin_segmentation_y_train.npy" }, "testing": { - "x": "data/skin_segmentation_x_test.csv", - "y": "data/skin_segmentation_y_test.csv" + "x": "data/skin_segmentation_x_test.npy", + "y": "data/skin_segmentation_y_test.npy" } } ], @@ -453,12 +452,12 @@ "algorithm": "train_test_split", "dataset": [ { - "source": "csv", + "source": "npy", "name": "census", "training": { - "x": "data/census_x.csv", - "y": "data/census_y.csv" + "x": "data/census_x_train.npy", + "y": "data/census_y_train.npy" } } ], @@ -469,12 +468,12 @@ "algorithm": "lasso", "dataset": [ { - "source": "csv", - "name": "mortgage", + "source": "npy", + "name": "mortgage1Q", "training": { - "x": "data/mortgage_x.csv", - "y": "data/mortgage_y.csv" + "x": "data/mortgage1Q_x_train.npy", + "y": "data/mortgage1Q_y_train.npy" } } ], @@ -485,17 +484,17 @@ "algorithm": "elasticnet", "dataset": [ { - "source": "csv", + "source": "npy", "name": "year_prediction_msd", "training": { - "x": "data/year_prediction_msd_x_train.csv", - "y": "data/year_prediction_msd_y_train.csv" + "x": "data/year_prediction_msd_x_train.npy", + "y": "data/year_prediction_msd_y_train.npy" }, "testing": { - "x": "data/year_prediction_msd_x_test.csv", - "y": "data/year_prediction_msd_y_test.csv" + "x": "data/year_prediction_msd_x_test.npy", + "y": "data/year_prediction_msd_y_test.npy" } } ], diff --git a/configs/lgbm_mb_cpu_config.json b/configs/lgbm_mb_cpu_config.json deleted file mode 100755 index e8a2111da..000000000 --- a/configs/lgbm_mb_cpu_config.json +++ /dev/null @@ -1,109 +0,0 @@ -{ - "omp_env": ["OMP_NUM_THREADS", "OMP_PLACES"], - "common": { - "lib": ["modelbuilders"], - "data-format": ["pandas"], - "data-order": ["F"], - "dtype": ["float32"] - }, - "cases": [ - { - "algorithm": "lgbm_mb", - "dataset": [ - { - "source": "csv", - "name": "mortgage1Q", - "training": - { - "x": "data/mortgage_x.csv", - "y": "data/mortgage_y.csv" - } - } - ], - "n-estimators": [100], - "objective": ["regression"], - "max-depth": [8], - "scale-pos-weight": [2], - "learning-rate": [0.1], - "subsample": [1], - "reg-alpha": [0.9], - "reg-lambda": [1], - "min-child-weight": [0], - "max-leaves": [256] - }, - { - "algorithm": "lgbm_mb", - "dataset": [ - { - "source": "csv", - "name": "airline-ohe", - "training": - { - "x": "data/airline-ohe_x_train.csv", - "y": "data/airline-ohe_y_train.csv" - } - } - ], - "reg-alpha": [0.9], - "max-bin": [256], - "scale-pos-weight": [2], - "learning-rate": [0.1], - "subsample": [1], - "reg-lambda": [1], - "min-child-weight": [0], - "max-depth": [8], - "max-leaves": [256], - "n-estimators": [1000], - "objective": ["binary"] - }, - { - "algorithm": "lgbm_mb", - "dataset": [ - { - "source": "csv", - "name": "higgs1m", - "training": - { - "x": "data/higgs1m_x_train.csv", - "y": "data/higgs1m_y_train.csv" - } - } - ], - "reg-alpha": [0.9], - "max-bin": [256], - "scale-pos-weight": [2], - "learning-rate": [0.1], - "subsample": [1], - "reg-lambda": [1], - "min-child-weight": [0], - "max-depth": [8], - "max-leaves": [256], - "n-estimators": [1000], - "objective": ["binary"] - }, - { - "algorithm": "lgbm_mb", - "dataset": [ - { - "source": "csv", - "name": "msrank", - "training": - { - "x": "data/mlsr_x_train.csv", - "y": "data/mlsr_y_train.csv" - } - } - ], - "max-bin": [256], - "learning-rate": [0.3], - "subsample": [1], - "reg-lambda": [2], - "min-child-weight": [1], - "min-split-gain": [0.1], - "max-depth": [8], - "max-leaves": [256], - "n-estimators": [200], - "objective": ["multiclass"] - } - ] -} diff --git a/configs/modelbuilders/lgbm_mb_cpu_config.json b/configs/modelbuilders/lgbm_mb_cpu_config.json new file mode 100755 index 000000000..a0dabdffa --- /dev/null +++ b/configs/modelbuilders/lgbm_mb_cpu_config.json @@ -0,0 +1,115 @@ +{ + "common": { + "lib": "modelbuilders", + "data-format": "pandas", + "data-order": "F", + "dtype": "float32", + "algorithm": "lgbm_mb" + }, + "cases": [ + { + "dataset": [ + { + "source": "npy", + "name": "airline-ohe", + "training": + { + "x": "data/airline-ohe_x_train.npy", + "y": "data/airline-ohe_y_train.npy" + }, + "testing": + { + "x": "data/airline-ohe_x_test.npy", + "y": "data/airline-ohe_y_test.npy" + } + } + ], + "reg-alpha": 0.9, + "max-bin": 256, + "scale-pos-weight": 2, + "learning-rate": 0.1, + "subsample": 1, + "reg-lambda": 1, + "min-child-weight": 0, + "max-depth": 8, + "max-leaves": 256, + "n-estimators": 1000, + "objective": "binary" + }, + { + "dataset": [ + { + "source": "npy", + "name": "higgs1m", + "training": + { + "x": "data/higgs1m_x_train.npy", + "y": "data/higgs1m_y_train.npy" + }, + "testing": + { + "x": "data/higgs1m_x_test.npy", + "y": "data/higgs1m_y_test.npy" + } + } + ], + "reg-alpha": 0.9, + "max-bin": 256, + "scale-pos-weight": 2, + "learning-rate": 0.1, + "subsample": 1, + "reg-lambda": 1, + "min-child-weight": 0, + "max-depth": 8, + "max-leaves": 256, + "n-estimators": 1000, + "objective": "binary" + }, + { + "dataset": [ + { + "source": "npy", + "name": "mortgage1Q", + "training": + { + "x": "data/mortgage1Q_x_train.npy", + "y": "data/mortgage1Q_y_train.npy" + } + } + ], + "n-estimators": 100, + "objective": "regression", + "max-depth": 8, + "scale-pos-weight": 2, + "learning-rate": 0.1, + "subsample": 1, + "reg-alpha": 0.9, + "reg-lambda": 1, + "min-child-weight": 0, + "max-leaves": 256 + }, + { + "dataset": [ + { + "source": "npy", + "name": "mlsr", + "training": + { + "x": "data/mlsr_x_train.npy", + "y": "data/mlsr_y_train.npy" + } + } + ], + "max-bin": 256, + "learning-rate": 0.3, + "subsample": 1, + "reg-lambda": 2, + "min-child-weight": 1, + "min-split-loss": 0.1, + "max-depth": 8, + "max-leaves": 256, + "n-estimators": 200, + "objective": "multiclass" + } + ] +} diff --git a/configs/modelbuilders/xgb_mb_cpu_config.json b/configs/modelbuilders/xgb_mb_cpu_config.json new file mode 100755 index 000000000..483f3c158 --- /dev/null +++ b/configs/modelbuilders/xgb_mb_cpu_config.json @@ -0,0 +1,118 @@ +{ + "common": { + "lib": "modelbuilders", + "data-format": "pandas", + "data-order": "F", + "dtype": "float32", + "algorithm": "xgb_mb", + "tree-method": "hist", + "count-dmatrix":"" + }, + "cases": [ + { + "dataset": [ + { + "source": "npy", + "name": "airline-ohe", + "training": + { + "x": "data/airline-ohe_x_train.npy", + "y": "data/airline-ohe_y_train.npy" + }, + "testing": + { + "x": "data/airline-ohe_x_test.npy", + "y": "data/airline-ohe_y_test.npy" + } + } + ], + "reg-alpha": 0.9, + "max-bin": 256, + "scale-pos-weight": 2, + "learning-rate": 0.1, + "subsample": 1, + "reg-lambda": 1, + "min-child-weight": 0, + "max-depth": 8, + "max-leaves": 256, + "n-estimators": 1000, + "objective": "binary:logistic" + }, + { + "dataset": [ + { + "source": "npy", + "name": "higgs1m", + "training": + { + "x": "data/higgs1m_x_train.npy", + "y": "data/higgs1m_y_train.npy" + }, + "testing": + { + "x": "data/higgs1m_x_test.npy", + "y": "data/higgs1m_y_test.npy" + } + } + ], + "reg-alpha": 0.9, + "max-bin": 256, + "scale-pos-weight": 2, + "learning-rate": 0.1, + "subsample": 1, + "reg-lambda": 1, + "min-child-weight": 0, + "max-depth": 8, + "max-leaves": 256, + "n-estimators": 1000, + "objective": "binary:logistic", + "enable-experimental-json-serialization": "False", + "inplace-predict": "" + }, + { + "dataset": [ + { + "source": "npy", + "name": "mortgage1Q", + "training": + { + "x": "data/mortgage1Q_x_train.npy", + "y": "data/mortgage1Q_y_train.npy" + } + } + ], + "n-estimators": 100, + "objective": "reg:squarederror", + "max-depth": 8, + "scale-pos-weight": 2, + "learning-rate": 0.1, + "subsample": 1, + "reg-alpha": 0.9, + "reg-lambda": 1, + "min-child-weight": 0, + "max-leaves": 256 + }, + { + "dataset": [ + { + "source": "npy", + "name": "mlsr", + "training": + { + "x": "data/mlsr_x_train.npy", + "y": "data/mlsr_y_train.npy" + } + } + ], + "max-bin": 256, + "learning-rate": 0.3, + "subsample": 1, + "reg-lambda": 2, + "min-child-weight": 1, + "min-split-loss": 0.1, + "max-depth": 8, + "n-estimators": 200, + "objective": "multi:softprob" + } + ] +} diff --git a/configs/skl_config.json b/configs/skl_config.json index 93c23e068..a385e50be 100755 --- a/configs/skl_config.json +++ b/configs/skl_config.json @@ -115,31 +115,31 @@ "dtype": ["float32"], "dataset": [ { - "source": "csv", + "source": "npy", "name": "higgs1m", "training": { - "x": "data/higgs1m_x_train.csv", - "y": "data/higgs1m_y_train.csv" + "x": "data/higgs1m_x_train.npy", + "y": "data/higgs1m_y_train.npy" }, "testing": { - "x": "data/higgs1m_x_test.csv", - "y": "data/higgs1m_y_test.csv" + "x": "data/higgs1m_x_test.npy", + "y": "data/higgs1m_y_test.npy" } }, { - "source": "csv", + "source": "npy", "name": "airline-ohe", "training": { - "x": "data/airline-ohe_x_train.csv", - "y": "data/airline-ohe_y_train.csv" + "x": "data/airline-ohe_x_train.npy", + "y": "data/airline-ohe_y_train.npy" }, "testing": { - "x": "data/airline-ohe_x_test.csv", - "y": "data/airline-ohe_y_test.csv" + "x": "data/airline-ohe_x_test.npy", + "y": "data/airline-ohe_y_test.npy" } } ], @@ -238,17 +238,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "ijcnn", "training": { - "x": "data/ijcnn_x_train.csv", - "y": "data/ijcnn_y_train.csv" + "x": "data/ijcnn_x_train.npy", + "y": "data/ijcnn_y_train.npy" }, "testing": { - "x": "data/ijcnn_x_test.csv", - "y": "data/ijcnn_y_test.csv" + "x": "data/ijcnn_x_test.npy", + "y": "data/ijcnn_y_test.npy" } } ], @@ -259,17 +259,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "a9a", "training": { - "x": "data/a9a_x_train.csv", - "y": "data/a9a_y_train.csv" + "x": "data/a9a_x_train.npy", + "y": "data/a9a_y_train.npy" }, "testing": { - "x": "data/a9a_x_test.csv", - "y": "data/a9a_y_test.csv" + "x": "data/a9a_x_test.npy", + "y": "data/a9a_y_test.npy" } } ], @@ -280,17 +280,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "gisette", "training": { - "x": "data/gisette_x_train.csv", - "y": "data/gisette_y_train.csv" + "x": "data/gisette_x_train.npy", + "y": "data/gisette_y_train.npy" }, "testing": { - "x": "data/gisette_x_test.csv", - "y": "data/gisette_y_test.csv" + "x": "data/gisette_x_test.npy", + "y": "data/gisette_y_test.npy" } } ], @@ -301,17 +301,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "klaverjas", "training": { - "x": "data/klaverjas_x_train.csv", - "y": "data/klaverjas_y_train.csv" + "x": "data/klaverjas_x_train.npy", + "y": "data/klaverjas_y_train.npy" }, "testing": { - "x": "data/klaverjas_x_test.csv", - "y": "data/klaverjas_y_test.csv" + "x": "data/klaverjas_x_test.npy", + "y": "data/klaverjas_y_test.npy" } } ], @@ -322,17 +322,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", - "name": "connect4", + "source": "npy", + "name": "connect", "training": { - "x": "data/connect_x_train.csv", - "y": "data/connect_y_train.csv" + "x": "data/connect_x_train.npy", + "y": "data/connect_y_train.npy" }, "testing": { - "x": "data/connect_x_test.csv", - "y": "data/connect_y_test.csv" + "x": "data/connect_x_test.npy", + "y": "data/connect_y_test.npy" } } ], @@ -343,17 +343,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "mnist", "training": { - "x": "data/mnist_x_train.csv", - "y": "data/mnist_y_train.csv" + "x": "data/mnist_x_train.npy", + "y": "data/mnist_y_train.npy" }, "testing": { - "x": "data/mnist_x_test.csv", - "y": "data/mnist_y_test.csv" + "x": "data/mnist_x_test.npy", + "y": "data/mnist_y_test.npy" } } ], @@ -364,17 +364,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "sensit", "training": { - "x": "data/sensit_x_train.csv", - "y": "data/sensit_y_train.csv" + "x": "data/sensit_x_train.npy", + "y": "data/sensit_y_train.npy" }, "testing": { - "x": "data/sensit_x_test.csv", - "y": "data/sensit_y_test.csv" + "x": "data/sensit_x_test.npy", + "y": "data/sensit_y_test.npy" } } ], @@ -385,17 +385,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "skin_segmentation", "training": { - "x": "data/skin_segmentation_x_train.csv", - "y": "data/skin_segmentation_y_train.csv" + "x": "data/skin_segmentation_x_train.npy", + "y": "data/skin_segmentation_y_train.npy" }, "testing": { - "x": "data/skin_segmentation_x_test.csv", - "y": "data/skin_segmentation_y_test.csv" + "x": "data/skin_segmentation_x_test.npy", + "y": "data/skin_segmentation_y_test.npy" } } ], @@ -406,17 +406,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "covertype", "training": { - "x": "data/covertype_x_train.csv", - "y": "data/covertype_y_train.csv" + "x": "data/covertype_x_train.npy", + "y": "data/covertype_y_train.npy" }, "testing": { - "x": "data/covertype_x_test.csv", - "y": "data/covertype_y_test.csv" + "x": "data/covertype_x_test.npy", + "y": "data/covertype_y_test.npy" } } ], @@ -427,17 +427,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "codrnanorm", "training": { - "x": "data/codrnanorm_x_train.csv", - "y": "data/codrnanorm_y_train.csv" + "x": "data/codrnanorm_x_train.npy", + "y": "data/codrnanorm_y_train.npy" }, "testing": { - "x": "data/codrnanorm_x_test.csv", - "y": "data/codrnanorm_y_test.csv" + "x": "data/codrnanorm_x_test.npy", + "y": "data/codrnanorm_y_test.npy" } } ], @@ -570,12 +570,12 @@ "algorithm": "train_test_split", "dataset": [ { - "source": "csv", + "source": "npy", "name": "census", "training": { - "x": "data/census_x.csv", - "y": "data/census_y.csv" + "x": "data/census_x_train.npy", + "y": "data/census_y_train.npy" } } ], @@ -589,12 +589,12 @@ "algorithm": "lasso", "dataset": [ { - "source": "csv", - "name": "mortgage", + "source": "npy", + "name": "mortgage1Q", "training": { - "x": "data/mortgage_x.csv", - "y": "data/mortgage_y.csv" + "x": "data/mortgage1Q_x_train.npy", + "y": "data/mortgage1Q_y_train.npy" } } ], @@ -605,17 +605,17 @@ "algorithm": "elasticnet", "dataset": [ { - "source": "csv", + "source": "npy", "name": "year_prediction_msd", "training": { - "x": "data/year_prediction_msd_x_train.csv", - "y": "data/year_prediction_msd_y_train.csv" + "x": "data/year_prediction_msd_x_train.npy", + "y": "data/year_prediction_msd_y_train.npy" }, "testing": { - "x": "data/year_prediction_msd_x_test.csv", - "y": "data/year_prediction_msd_y_test.csv" + "x": "data/year_prediction_msd_x_test.npy", + "y": "data/year_prediction_msd_y_test.npy" } } ], diff --git a/configs/svm/svc_proba_cuml.json b/configs/svm/svc_proba_cuml.json index 85fe1f0df..c765a2164 100755 --- a/configs/svm/svc_proba_cuml.json +++ b/configs/svm/svc_proba_cuml.json @@ -12,17 +12,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "ijcnn", "training": { - "x": "data/ijcnn_x_train.csv", - "y": "data/ijcnn_y_train.csv" + "x": "data/ijcnn_x_train.npy", + "y": "data/ijcnn_y_train.npy" }, "testing": { - "x": "data/ijcnn_x_test.csv", - "y": "data/ijcnn_y_test.csv" + "x": "data/ijcnn_x_test.npy", + "y": "data/ijcnn_y_test.npy" } } ], @@ -33,17 +33,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "a9a", "training": { - "x": "data/a9a_x_train.csv", - "y": "data/a9a_y_train.csv" + "x": "data/a9a_x_train.npy", + "y": "data/a9a_y_train.npy" }, "testing": { - "x": "data/a9a_x_test.csv", - "y": "data/a9a_y_test.csv" + "x": "data/a9a_x_test.npy", + "y": "data/a9a_y_test.npy" } } ], @@ -54,17 +54,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "gisette", "training": { - "x": "data/gisette_x_train.csv", - "y": "data/gisette_y_train.csv" + "x": "data/gisette_x_train.npy", + "y": "data/gisette_y_train.npy" }, "testing": { - "x": "data/gisette_x_test.csv", - "y": "data/gisette_y_test.csv" + "x": "data/gisette_x_test.npy", + "y": "data/gisette_y_test.npy" } } ], @@ -75,17 +75,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "klaverjas", "training": { - "x": "data/klaverjas_x_train.csv", - "y": "data/klaverjas_y_train.csv" + "x": "data/klaverjas_x_train.npy", + "y": "data/klaverjas_y_train.npy" }, "testing": { - "x": "data/klaverjas_x_test.csv", - "y": "data/klaverjas_y_test.csv" + "x": "data/klaverjas_x_test.npy", + "y": "data/klaverjas_y_test.npy" } } ], @@ -96,17 +96,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "connect", "training": { - "x": "data/connect_x_train.csv", - "y": "data/connect_y_train.csv" + "x": "data/connect_x_train.npy", + "y": "data/connect_y_train.npy" }, "testing": { - "x": "data/connect_x_test.csv", - "y": "data/connect_y_test.csv" + "x": "data/connect_x_test.npy", + "y": "data/connect_y_test.npy" } } ], @@ -117,17 +117,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "mnist", "training": { - "x": "data/mnist_x_train.csv", - "y": "data/mnist_y_train.csv" + "x": "data/mnist_x_train.npy", + "y": "data/mnist_y_train.npy" }, "testing": { - "x": "data/mnist_x_test.csv", - "y": "data/mnist_y_test.csv" + "x": "data/mnist_x_test.npy", + "y": "data/mnist_y_test.npy" } } ], @@ -138,17 +138,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "sensit", "training": { - "x": "data/sensit_x_train.csv", - "y": "data/sensit_y_train.csv" + "x": "data/sensit_x_train.npy", + "y": "data/sensit_y_train.npy" }, "testing": { - "x": "data/sensit_x_test.csv", - "y": "data/sensit_y_test.csv" + "x": "data/sensit_x_test.npy", + "y": "data/sensit_y_test.npy" } } ], @@ -159,17 +159,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "skin_segmentation", "training": { - "x": "data/skin_segmentation_x_train.csv", - "y": "data/skin_segmentation_y_train.csv" + "x": "data/skin_segmentation_x_train.npy", + "y": "data/skin_segmentation_y_train.npy" }, "testing": { - "x": "data/skin_segmentation_x_test.csv", - "y": "data/skin_segmentation_y_test.csv" + "x": "data/skin_segmentation_x_test.npy", + "y": "data/skin_segmentation_y_test.npy" } } ], @@ -180,17 +180,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "covertype", "training": { - "x": "data/covertype_x_train.csv", - "y": "data/covertype_y_train.csv" + "x": "data/covertype_x_train.npy", + "y": "data/covertype_y_train.npy" }, "testing": { - "x": "data/covertype_x_test.csv", - "y": "data/covertype_y_test.csv" + "x": "data/covertype_x_test.npy", + "y": "data/covertype_y_test.npy" } } ], @@ -201,17 +201,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "codrnanorm", "training": { - "x": "data/codrnanorm_x_train.csv", - "y": "data/codrnanorm_y_train.csv" + "x": "data/codrnanorm_x_train.npy", + "y": "data/codrnanorm_y_train.npy" }, "testing": { - "x": "data/codrnanorm_x_test.csv", - "y": "data/codrnanorm_y_test.csv" + "x": "data/codrnanorm_x_test.npy", + "y": "data/codrnanorm_y_test.npy" } } ], diff --git a/configs/svm/svc_proba_sklearn.json b/configs/svm/svc_proba_sklearn.json index 53c1676cf..3ded70b29 100755 --- a/configs/svm/svc_proba_sklearn.json +++ b/configs/svm/svc_proba_sklearn.json @@ -12,17 +12,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "ijcnn", "training": { - "x": "data/ijcnn_x_train.csv", - "y": "data/ijcnn_y_train.csv" + "x": "data/ijcnn_x_train.npy", + "y": "data/ijcnn_y_train.npy" }, "testing": { - "x": "data/ijcnn_x_test.csv", - "y": "data/ijcnn_y_test.csv" + "x": "data/ijcnn_x_test.npy", + "y": "data/ijcnn_y_test.npy" } } ], @@ -33,17 +33,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "a9a", "training": { - "x": "data/a9a_x_train.csv", - "y": "data/a9a_y_train.csv" + "x": "data/a9a_x_train.npy", + "y": "data/a9a_y_train.npy" }, "testing": { - "x": "data/a9a_x_test.csv", - "y": "data/a9a_y_test.csv" + "x": "data/a9a_x_test.npy", + "y": "data/a9a_y_test.npy" } } ], @@ -54,17 +54,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "gisette", "training": { - "x": "data/gisette_x_train.csv", - "y": "data/gisette_y_train.csv" + "x": "data/gisette_x_train.npy", + "y": "data/gisette_y_train.npy" }, "testing": { - "x": "data/gisette_x_test.csv", - "y": "data/gisette_y_test.csv" + "x": "data/gisette_x_test.npy", + "y": "data/gisette_y_test.npy" } } ], @@ -75,17 +75,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "klaverjas", "training": { - "x": "data/klaverjas_x_train.csv", - "y": "data/klaverjas_y_train.csv" + "x": "data/klaverjas_x_train.npy", + "y": "data/klaverjas_y_train.npy" }, "testing": { - "x": "data/klaverjas_x_test.csv", - "y": "data/klaverjas_y_test.csv" + "x": "data/klaverjas_x_test.npy", + "y": "data/klaverjas_y_test.npy" } } ], @@ -96,17 +96,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "connect", "training": { - "x": "data/connect_x_train.csv", - "y": "data/connect_y_train.csv" + "x": "data/connect_x_train.npy", + "y": "data/connect_y_train.npy" }, "testing": { - "x": "data/connect_x_test.csv", - "y": "data/connect_y_test.csv" + "x": "data/connect_x_test.npy", + "y": "data/connect_y_test.npy" } } ], @@ -117,17 +117,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "mnist", "training": { - "x": "data/mnist_x_train.csv", - "y": "data/mnist_y_train.csv" + "x": "data/mnist_x_train.npy", + "y": "data/mnist_y_train.npy" }, "testing": { - "x": "data/mnist_x_test.csv", - "y": "data/mnist_y_test.csv" + "x": "data/mnist_x_test.npy", + "y": "data/mnist_y_test.npy" } } ], @@ -138,17 +138,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "sensit", "training": { - "x": "data/sensit_x_train.csv", - "y": "data/sensit_y_train.csv" + "x": "data/sensit_x_train.npy", + "y": "data/sensit_y_train.npy" }, "testing": { - "x": "data/sensit_x_test.csv", - "y": "data/sensit_y_test.csv" + "x": "data/sensit_x_test.npy", + "y": "data/sensit_y_test.npy" } } ], @@ -159,17 +159,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "skin_segmentation", "training": { - "x": "data/skin_segmentation_x_train.csv", - "y": "data/skin_segmentation_y_train.csv" + "x": "data/skin_segmentation_x_train.npy", + "y": "data/skin_segmentation_y_train.npy" }, "testing": { - "x": "data/skin_segmentation_x_test.csv", - "y": "data/skin_segmentation_y_test.csv" + "x": "data/skin_segmentation_x_test.npy", + "y": "data/skin_segmentation_y_test.npy" } } ], @@ -180,17 +180,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "covertype", "training": { - "x": "data/covertype_x_train.csv", - "y": "data/covertype_y_train.csv" + "x": "data/covertype_x_train.npy", + "y": "data/covertype_y_train.npy" }, "testing": { - "x": "data/covertype_x_test.csv", - "y": "data/covertype_y_test.csv" + "x": "data/covertype_x_test.npy", + "y": "data/covertype_y_test.npy" } } ], @@ -201,17 +201,17 @@ "algorithm": "svm", "dataset": [ { - "source": "csv", + "source": "npy", "name": "codrnanorm", "training": { - "x": "data/codrnanorm_x_train.csv", - "y": "data/codrnanorm_y_train.csv" + "x": "data/codrnanorm_x_train.npy", + "y": "data/codrnanorm_y_train.npy" }, "testing": { - "x": "data/codrnanorm_x_test.csv", - "y": "data/codrnanorm_y_test.csv" + "x": "data/codrnanorm_x_test.npy", + "y": "data/codrnanorm_y_test.npy" } } ], diff --git a/configs/testing/daal4py_xgboost.json b/configs/testing/daal4py_xgboost.json index 56accdce3..548ec82bf 100755 --- a/configs/testing/daal4py_xgboost.json +++ b/configs/testing/daal4py_xgboost.json @@ -1,20 +1,21 @@ { - "omp_env": ["OMP_NUM_THREADS", "OMP_PLACES"], "common": { - "lib": ["modelbuilders"], - "data-format": ["pandas"], - "data-order": ["F"], - "dtype": ["float32"] + "lib": "modelbuilders", + "data-format": "pandas", + "data-order": "F", + "dtype": "float32", + "algorithm": "xgb_mb", + "tree-method": "hist", + "count-dmatrix":"" }, "cases": [ { - "algorithm": "xgb_mb", "dataset": [ { - "source": "synthetic", - "type": "classification", - "n_classes": 5, - "n_features": 10, + "source": "synthetic", + "type": "classification", + "n_classes": 5, + "n_features": 10, "training": { "n_samples": 100 }, @@ -23,10 +24,9 @@ } } ], - "n-estimators": [10], - "tree-method": ["hist"], - "objective": ["multi:softprob"], - "max-depth": [8] + "n-estimators": 10, + "max-depth": 8, + "objective": "multi:softprob" } ] } diff --git a/configs/testing/xgboost.json b/configs/testing/xgboost.json index 5107ee793..33242a630 100755 --- a/configs/testing/xgboost.json +++ b/configs/testing/xgboost.json @@ -1,21 +1,21 @@ { - "omp_env": ["OMP_NUM_THREADS", "OMP_PLACES"], "common": { - "lib": ["xgboost"], - "data-format": ["pandas"], - "data-order": ["F"], - "dtype": ["float64"] + "lib": "xgboost", + "data-format": "pandas", + "data-order": "F", + "dtype": "float32", + "algorithm": "gbt", + "tree-method": "hist", + "count-dmatrix":"" }, "cases": [ - { - "algorithm": "gbt", "dataset": [ { - "source": "synthetic", - "type": "classification", - "n_classes": 5, - "n_features": 10, + "source": "synthetic", + "type": "classification", + "n_classes": 5, + "n_features": 10, "training": { "n_samples": 1000 }, @@ -24,21 +24,19 @@ } } ], - "n-estimators": [50], - "objective": ["multi:softprob"], - "tree-method": ["hist"], - "max-depth": [7], - "subsample": [0.7], - "colsample-bytree": [0.7] + "n-estimators": 50, + "max-depth": 7, + "subsample": 0.7, + "colsample-bytree": 0.7, + "objective": "multi:softprob" }, { - "algorithm": "gbt", "dataset": [ { - "source": "synthetic", - "type": "regression", - "n_classes": 5, - "n_features": 10, + "source": "synthetic", + "type": "regression", + "n_classes": 5, + "n_features": 10, "training": { "n_samples": 100 }, @@ -47,12 +45,11 @@ } } ], - "n-estimators": [50], - "objective": ["reg:squarederror"], - "tree-method": ["hist"], - "max-depth": [8], - "learning-rate": [0.1], - "reg-alpha": [0.9] + "n-estimators": 50, + "max-depth": 8, + "learning-rate": 0.1, + "reg-alpha": 0.9, + "objective": "reg:squarederror" } ] } diff --git a/configs/xgb_cpu_config.json b/configs/xgb_cpu_config.json deleted file mode 100644 index ecc0da15b..000000000 --- a/configs/xgb_cpu_config.json +++ /dev/null @@ -1,163 +0,0 @@ -{ - "omp_env": ["OMP_NUM_THREADS", "OMP_PLACES"], - "common": { - "lib": ["xgboost"], - "data-format": ["pandas"], - "data-order": ["F"], - "dtype": ["float32"], - "count-dmatrix": [""] - }, - "cases": [ - { - "algorithm": "gbt", - "dataset": [ - { - "source": "csv", - "name": "plasticc", - "training": - { - "x": "data/plasticc_x_train.csv", - "y": "data/plasticc_y_train.csv" - }, - "testing": - { - "x": "data/plasticc_x_test.csv", - "y": "data/plasticc_y_test.csv" - } - } - ], - "n-estimators": [60], - "objective": ["multi:softprob"], - "tree-method": ["hist"], - "max-depth": [7], - "subsample": [0.7], - "colsample-bytree": [0.7] - }, - { - "algorithm": "gbt", - "dataset": [ - { - "source": "csv", - "name": "santander", - "training": - { - "x": "data/santander_x_train.csv", - "y": "data/santander_y_train.csv" - } - } - ], - "n-estimators": [10000], - "objective": ["binary:logistic"], - "tree-method": ["hist"], - "max-depth": [1], - "subsample": [0.5], - "eta": [0.1], - "colsample-bytree": [0.05], - "single-precision-histogram": [""] - }, - { - "algorithm": "gbt", - "dataset": [ - { - "source": "csv", - "name": "mortgage1Q", - "training": - { - "x": "data/mortgage_x.csv", - "y": "data/mortgage_y.csv" - } - } - ], - "n-estimators": [100], - "objective": ["reg:squarederror"], - "tree-method": ["hist"], - "max-depth": [8], - "scale-pos-weight": [2], - "learning-rate": [0.1], - "subsample": [1], - "reg-alpha": [0.9], - "reg-lambda": [1], - "min-child-weight": [0], - "max-leaves": [256] - }, - { - "algorithm": "gbt", - "dataset": [ - { - "source": "csv", - "name": "airline-ohe", - "training": - { - "x": "data/airline-ohe_x_train.csv", - "y": "data/airline-ohe_y_train.csv" - } - } - ], - "reg-alpha": [0.9], - "max-bin": [256], - "scale-pos-weight": [2], - "learning-rate": [0.1], - "subsample": [1], - "reg-lambda": [1], - "min-child-weight": [0], - "max-depth": [8], - "max-leaves": [256], - "n-estimators": [1000], - "objective": ["binary:logistic"], - "tree-method": ["hist"] - }, - { - "algorithm": "gbt", - "dataset": [ - { - "source": "csv", - "name": "higgs1m", - "training": - { - "x": "data/higgs1m_x_train.csv", - "y": "data/higgs1m_y_train.csv" - } - } - ], - "reg-alpha": [0.9], - "max-bin": [256], - "scale-pos-weight": [2], - "learning-rate": [0.1], - "subsample": [1], - "reg-lambda": [1], - "min-child-weight": [0], - "max-depth": [8], - "max-leaves": [256], - "n-estimators": [1000], - "objective": ["binary:logistic"], - "tree-method": ["hist"], - "enable-experimental-json-serialization": ["False"], - "inplace-predict": [""] - }, - { - "algorithm": "gbt", - "dataset": [ - { - "source": "csv", - "name": "msrank", - "training": - { - "x": "data/mlsr_x_train.csv", - "y": "data/mlsr_y_train.csv" - } - } - ], - "max-bin": [256], - "learning-rate": [0.3], - "subsample": [1], - "reg-lambda": [2], - "min-child-weight": [1], - "min-split-loss": [0.1], - "max-depth": [8], - "n-estimators": [200], - "objective": ["multi:softprob"], - "tree-method": ["hist"], - "single-precision-histogram": [""] - } - ] -} diff --git a/configs/xgb_gpu_config.json b/configs/xgb_gpu_config.json deleted file mode 100644 index 44d9aec45..000000000 --- a/configs/xgb_gpu_config.json +++ /dev/null @@ -1,160 +0,0 @@ -{ - "omp_env": ["OMP_NUM_THREADS", "OMP_PLACES"], - "common": { - "lib": ["xgboost"], - "data-format": ["cudf"], - "data-order": ["F"], - "dtype": ["float32"], - "count-dmatrix": [""] - }, - "cases": [ - { - "algorithm": "gbt", - "dataset": [ - { - "source": "csv", - "name": "plasticc", - "training": - { - "x": "data/plasticc_x_train.csv", - "y": "data/plasticc_y_train.csv" - }, - "testing": - { - "x": "data/plasticc_x_test.csv", - "y": "data/plasticc_y_test.csv" - } - } - ], - "n-estimators": [60], - "objective": ["multi:softprob"], - "tree-method": ["gpu_hist"], - "max-depth": [7], - "subsample": [0.7], - "colsample-bytree": [0.7] - }, - { - "algorithm": "gbt", - "dataset": [ - { - "source": "csv", - "name": "santander", - "training": - { - "x": "data/santander_x_train.csv", - "y": "data/santander_y_train.csv" - } - } - ], - "n-estimators": [10000], - "objective": ["binary:logistic"], - "tree-method": ["gpu_hist"], - "max-depth": [1], - "subsample": [0.5], - "eta": [0.1], - "colsample-bytree": [0.05] - }, - { - "algorithm": "gbt", - "dataset": [ - { - "source": "csv", - "name": "mortgage1Q", - "training": - { - "x": "data/mortgage_x.csv", - "y": "data/mortgage_y.csv" - } - } - ], - "n-estimators": [100], - "objective": ["reg:squarederror"], - "tree-method": ["gpu_hist"], - "max-depth": [8], - "scale-pos-weight": [2], - "learning-rate": [0.1], - "subsample": [1], - "reg-alpha": [0.9], - "reg-lambda": [1], - "min-child-weight": [0], - "max-leaves": [256] - }, - { - "algorithm": "gbt", - "dataset": [ - { - "source": "csv", - "name": "airline-ohe", - "training": - { - "x": "data/airline-ohe_x_train.csv", - "y": "data/airline-ohe_y_train.csv" - } - } - ], - "reg-alpha": [0.9], - "max-bin": [256], - "scale-pos-weight": [2], - "learning-rate": [0.1], - "subsample": [1], - "reg-lambda": [1], - "min-child-weight": [0], - "max-depth": [8], - "max-leaves": [256], - "n-estimators": [1000], - "objective": ["binary:logistic"], - "tree-method": ["gpu_hist"] - }, - { - "algorithm": "gbt", - "dataset": [ - { - "source": "csv", - "name": "higgs1m", - "training": - { - "x": "data/higgs1m_x_train.csv", - "y": "data/higgs1m_y_train.csv" - } - } - ], - "reg-alpha": [0.9], - "max-bin": [256], - "scale-pos-weight": [2], - "learning-rate": [0.1], - "subsample": [1], - "reg-lambda": [1], - "min-child-weight": [0], - "max-depth": [8], - "max-leaves": [256], - "n-estimators": [1000], - "objective": ["binary:logistic"], - "tree-method": ["gpu_hist"], - "inplace-predict": [""] - }, - { - "algorithm": "gbt", - "dataset": [ - { - "source": "csv", - "name": "msrank", - "training": - { - "x": "data/mlsr_x_train.csv", - "y": "data/mlsr_y_train.csv" - } - } - ], - "max-bin": [256], - "learning-rate": [0.3], - "subsample": [1], - "reg-lambda": [2], - "min-child-weight": [1], - "min-split-loss": [0.1], - "max-depth": [8], - "n-estimators": [200], - "objective": ["multi:softprob"], - "tree-method": ["gpu_hist"] - } - ] -} diff --git a/configs/xgb_mb_cpu_config.json b/configs/xgb_mb_cpu_config.json deleted file mode 100755 index 0c8128aef..000000000 --- a/configs/xgb_mb_cpu_config.json +++ /dev/null @@ -1,114 +0,0 @@ -{ - "omp_env": ["OMP_NUM_THREADS", "OMP_PLACES"], - "common": { - "lib": ["modelbuilders"], - "data-format": ["pandas"], - "data-order": ["F"], - "dtype": ["float32"], - "count-dmatrix": [""] - }, - "cases": [ - { - "algorithm": "xgb_mb", - "dataset": [ - { - "source": "csv", - "name": "mortgage1Q", - "training": - { - "x": "data/mortgage_x.csv", - "y": "data/mortgage_y.csv" - } - } - ], - "n-estimators": [100], - "objective": ["reg:squarederror"], - "tree-method": ["hist"], - "max-depth": [8], - "scale-pos-weight": [2], - "learning-rate": [0.1], - "subsample": [1], - "reg-alpha": [0.9], - "reg-lambda": [1], - "min-child-weight": [0], - "max-leaves": [256] - }, - { - "algorithm": "xgb_mb", - "dataset": [ - { - "source": "csv", - "name": "airline-ohe", - "training": - { - "x": "data/airline-ohe_x_train.csv", - "y": "data/airline-ohe_y_train.csv" - } - } - ], - "reg-alpha": [0.9], - "max-bin": [256], - "scale-pos-weight": [2], - "learning-rate": [0.1], - "subsample": [1], - "reg-lambda": [1], - "min-child-weight": [0], - "max-depth": [8], - "max-leaves": [256], - "n-estimators": [1000], - "objective": ["binary:logistic"], - "tree-method": ["hist"] - }, - { - "algorithm": "xgb_mb", - "dataset": [ - { - "source": "csv", - "name": "higgs1m", - "training": - { - "x": "data/higgs1m_x_train.csv", - "y": "data/higgs1m_y_train.csv" - } - } - ], - "reg-alpha": [0.9], - "max-bin": [256], - "scale-pos-weight": [2], - "learning-rate": [0.1], - "subsample": [1], - "reg-lambda": [1], - "min-child-weight": [0], - "max-depth": [8], - "max-leaves": [256], - "n-estimators": [1000], - "objective": ["binary:logistic"], - "tree-method": ["hist"], - "enable-experimental-json-serialization": ["False"] - }, - { - "algorithm": "xgb_mb", - "dataset": [ - { - "source": "csv", - "name": "msrank", - "training": - { - "x": "data/mlsr_x_train.csv", - "y": "data/mlsr_y_train.csv" - } - } - ], - "max-bin": [256], - "learning-rate": [0.3], - "subsample": [1], - "reg-lambda": [2], - "min-child-weight": [1], - "min-split-loss": [0.1], - "max-depth": [8], - "n-estimators": [200], - "objective": ["multi:softprob"], - "tree-method": ["hist"] - } - ] -} diff --git a/configs/xgboost/xgb_cpu_additional_config.json b/configs/xgboost/xgb_cpu_additional_config.json new file mode 100644 index 000000000..a3f738c00 --- /dev/null +++ b/configs/xgboost/xgb_cpu_additional_config.json @@ -0,0 +1,155 @@ +{ + "common": { + "lib": "xgboost", + "data-format": "pandas", + "data-order": "F", + "dtype": "float32", + "algorithm": "gbt", + "tree-method": "hist", + "count-dmatrix":"", + "max-depth": 8, + "learning-rate":0.1, + "reg-lambda": 1, + "max-leaves": 256 + }, + "cases": [ + { + "objective": "binary:logistic", + "scale-pos-weight": 2.1067817411664587, + "dataset": [ + { + "source": "npy", + "name": "airline", + "training": + { + "x": "data/airline_x_train.npy", + "y": "data/airline_y_train.npy" + }, + "testing": + { + "x": "data/airline_x_test.npy", + "y": "data/airline_y_test.npy" + } + } + ] + }, + { + "objective": "binary:logistic", + "scale-pos-weight": 173.63348001466812, + "dataset": [ + { + "source": "npy", + "name": "bosch", + "training": + { + "x": "data/bosch_x_train.npy", + "y": "data/bosch_y_train.npy" + }, + "testing": + { + "x": "data/bosch_x_test.npy", + "y": "data/bosch_y_test.npy" + } + } + ] + }, + { + "objective": "multi:softmax", + "dataset": [ + { + "source": "npy", + "name": "covtype", + "training": + { + "x": "data/covtype_x_train.npy", + "y": "data/covtype_y_train.npy" + }, + "testing": + { + "x": "data/covtype_x_test.npy", + "y": "data/covtype_y_test.npy" + } + } + ] + }, + { + "objective": "binary:logistic", + "scale-pos-weight": 2.0017715678375363, + "dataset": [ + { + "source": "npy", + "name": "epsilon", + "training": + { + "x": "data/epsilon_x_train.npy", + "y": "data/epsilon_y_train.npy" + }, + "testing": + { + "x": "data/epsilon_x_test.npy", + "y": "data/epsilon_y_test.npy" + } + } + ] + }, + { + "objective": "binary:logistic", + "scale-pos-weight": 578.2868020304569, + "dataset": [ + { + "source": "npy", + "name": "fraud", + "training": + { + "x": "data/fraud_x_train.npy", + "y": "data/fraud_y_train.npy" + }, + "testing": + { + "x": "data/fraud_x_test.npy", + "y": "data/fraud_y_test.npy" + } + } + ] + }, + { + "objective": "binary:logistic", + "scale-pos-weight": 1.8872389605086624, + "dataset": [ + { + "source": "npy", + "name": "higgs", + "training": + { + "x": "data/higgs_x_train.npy", + "y": "data/higgs_y_train.npy" + }, + "testing": + { + "x": "data/higgs_x_test.npy", + "y": "data/higgs_y_test.npy" + } + } + ] + }, + { + "objective": "reg:squarederror", + "dataset": [ + { + "source": "npy", + "name": "year_prediction_msd", + "training": + { + "x": "data/year_prediction_msd_x_train.npy", + "y": "data/year_prediction_msd_y_train.npy" + }, + "testing": + { + "x": "data/year_prediction_msd_x_test.npy", + "y": "data/year_prediction_msd_y_test.npy" + } + } + ] + } + ] +} diff --git a/configs/xgboost/xgb_cpu_main_config.json b/configs/xgboost/xgb_cpu_main_config.json new file mode 100644 index 000000000..f5a2c4b67 --- /dev/null +++ b/configs/xgboost/xgb_cpu_main_config.json @@ -0,0 +1,211 @@ +{ + "common": { + "lib": "xgboost", + "data-format": "pandas", + "data-order": "F", + "dtype": "float32", + "algorithm": "gbt", + "tree-method": "hist", + "count-dmatrix":"" + }, + "cases": [ + { + "dataset": [ + { + "source": "npy", + "name": "abalone", + "training": + { + "x": "data/abalone_x_train.npy", + "y": "data/abalone_y_train.npy" + }, + "testing": + { + "x": "data/abalone_x_test.npy", + "y": "data/abalone_y_test.npy" + } + } + ], + "learning-rate": 0.03, + "max-depth": 6, + "n-estimators": 1000, + "objective": "reg:squarederror" + }, + { + "dataset": [ + { + "source": "npy", + "name": "airline-ohe", + "training": + { + "x": "data/airline-ohe_x_train.npy", + "y": "data/airline-ohe_y_train.npy" + }, + "testing": + { + "x": "data/airline-ohe_x_test.npy", + "y": "data/airline-ohe_y_test.npy" + } + } + ], + "reg-alpha": 0.9, + "max-bin": 256, + "scale-pos-weight": 2, + "learning-rate": 0.1, + "subsample": 1, + "reg-lambda": 1, + "min-child-weight": 0, + "max-depth": 8, + "max-leaves": 256, + "n-estimators": 1000, + "objective": "binary:logistic" + }, + { + "dataset": [ + { + "source": "npy", + "name": "higgs1m", + "training": + { + "x": "data/higgs1m_x_train.npy", + "y": "data/higgs1m_y_train.npy" + }, + "testing": + { + "x": "data/higgs1m_x_test.npy", + "y": "data/higgs1m_y_test.npy" + } + } + ], + "reg-alpha": 0.9, + "max-bin": 256, + "scale-pos-weight": 2, + "learning-rate": 0.1, + "subsample": 1, + "reg-lambda": 1, + "min-child-weight": 0, + "max-depth": 8, + "max-leaves": 256, + "n-estimators": 1000, + "objective": "binary:logistic", + "enable-experimental-json-serialization": "False", + "inplace-predict": "" + }, + { + "dataset": [ + { + "source": "npy", + "name": "letters", + "training": + { + "x": "data/letters_x_train.npy", + "y": "data/letters_y_train.npy" + }, + "testing": + { + "x": "data/letters_x_test.npy", + "y": "data/letters_y_test.npy" + } + } + ], + "learning-rate": 0.03, + "max-depth": 6, + "n-estimators": 1000, + "objective": "multi:softprob" + }, + { + "dataset": [ + { + "source": "npy", + "name": "mlsr", + "training": + { + "x": "data/mlsr_x_train.npy", + "y": "data/mlsr_y_train.npy" + } + } + ], + "max-bin": 256, + "learning-rate": 0.3, + "subsample": 1, + "reg-lambda": 2, + "min-child-weight": 1, + "min-split-loss": 0.1, + "max-depth": 8, + "n-estimators": 200, + "objective": "multi:softprob", + "single-precision-histogram": "" + }, + { + "dataset": [ + { + "source": "npy", + "name": "mortgage1Q", + "training": + { + "x": "data/mortgage1Q_x_train.npy", + "y": "data/mortgage1Q_y_train.npy" + } + } + ], + "n-estimators": 100, + "objective": "reg:squarederror", + "max-depth": 8, + "scale-pos-weight": 2, + "learning-rate": 0.1, + "subsample": 1, + "reg-alpha": 0.9, + "reg-lambda": 1, + "min-child-weight": 0, + "max-leaves": 256 + }, + { + "dataset": [ + { + "source": "npy", + "name": "plasticc", + "training": + { + "x": "data/plasticc_x_train.npy", + "y": "data/plasticc_y_train.npy" + }, + "testing": + { + "x": "data/plasticc_x_test.npy", + "y": "data/plasticc_y_test.npy" + } + } + ], + "n-estimators": 60, + "objective": "multi:softprob", + "max-depth": 7, + "subsample": 0.7, + "colsample-bytree": 0.7 + }, + { + "dataset": [ + { + "source": "npy", + "name": "santander", + "training": + { + "x": "data/santander_x_train.npy", + "y": "data/santander_y_train.npy" + }, + "testing": + { + "x": "data/santander_x_test.npy", + "y": "data/santander_y_test.npy" + } + } + ], + "n-estimators": 10000, + "objective": "binary:logistic", + "max-depth": 1, + "subsample": 0.5, + "eta": 0.1, + "colsample-bytree": 0.05, + "single-precision-histogram": "" + } + ] +} diff --git a/configs/xgboost/xgb_gpu_config.json b/configs/xgboost/xgb_gpu_config.json new file mode 100644 index 000000000..506ac0cfd --- /dev/null +++ b/configs/xgboost/xgb_gpu_config.json @@ -0,0 +1,208 @@ +{ + "common": { + "lib": "xgboost", + "data-format": "cudf", + "data-order": "F", + "dtype": "float32", + "algorithm": "gbt", + "tree-method": "gpu_hist", + "count-dmatrix":"" + }, + "cases": [ + { + "dataset": [ + { + "source": "npy", + "name": "abalone", + "training": + { + "x": "data/abalone_x_train.npy", + "y": "data/abalone_y_train.npy" + }, + "testing": + { + "x": "data/abalone_x_test.npy", + "y": "data/abalone_y_test.npy" + } + } + ], + "learning-rate": 0.03, + "max-depth": 6, + "n-estimators": 1000, + "objective": "reg:squarederror" + }, + { + "dataset": [ + { + "source": "npy", + "name": "airline-ohe", + "training": + { + "x": "data/airline-ohe_x_train.npy", + "y": "data/airline-ohe_y_train.npy" + }, + "testing": + { + "x": "data/airline-ohe_x_test.npy", + "y": "data/airline-ohe_y_test.npy" + } + } + ], + "reg-alpha": 0.9, + "max-bin": 256, + "scale-pos-weight": 2, + "learning-rate": 0.1, + "subsample": 1, + "reg-lambda": 1, + "min-child-weight": 0, + "max-depth": 8, + "max-leaves": 256, + "n-estimators": 1000, + "objective": "binary:logistic" + }, + { + "dataset": [ + { + "source": "npy", + "name": "higgs1m", + "training": + { + "x": "data/higgs1m_x_train.npy", + "y": "data/higgs1m_y_train.npy" + }, + "testing": + { + "x": "data/higgs1m_x_test.npy", + "y": "data/higgs1m_y_test.npy" + } + } + ], + "reg-alpha": 0.9, + "max-bin": 256, + "scale-pos-weight": 2, + "learning-rate": 0.1, + "subsample": 1, + "reg-lambda": 1, + "min-child-weight": 0, + "max-depth": 8, + "max-leaves": 256, + "n-estimators": 1000, + "objective": "binary:logistic", + "inplace-predict": "" + }, + { + "dataset": [ + { + "source": "npy", + "name": "letters", + "training": + { + "x": "data/letters_x_train.npy", + "y": "data/letters_y_train.npy" + }, + "testing": + { + "x": "data/letters_x_test.npy", + "y": "data/letters_y_test.npy" + } + } + ], + "learning-rate":0.03, + "max-depth": 6, + "n-estimators": 1000, + "objective": "multi:softprob" + }, + { + "dataset": [ + { + "source": "npy", + "name": "mlsr", + "training": + { + "x": "data/mlsr_x_train.npy", + "y": "data/mlsr_y_train.npy" + } + } + ], + "max-bin": 256, + "learning-rate": 0.3, + "subsample": 1, + "reg-lambda": 2, + "min-child-weight": 1, + "min-split-loss": 0.1, + "max-depth": 8, + "n-estimators": 200, + "objective": "multi:softprob" + }, + { + "dataset": [ + { + "source": "npy", + "name": "mortgage1Q", + "training": + { + "x": "data/mortgage1Q_x_train.npy", + "y": "data/mortgage1Q_y_train.npy" + } + } + ], + "n-estimators": 100, + "objective": "reg:squarederror", + "max-depth": 8, + "scale-pos-weight": 2, + "learning-rate": 0.1, + "subsample": 1, + "reg-alpha": 0.9, + "reg-lambda": 1, + "min-child-weight": 0, + "max-leaves": 256 + }, + { + "dataset": [ + { + "source": "npy", + "name": "plasticc", + "training": + { + "x": "data/plasticc_x_train.npy", + "y": "data/plasticc_y_train.npy" + }, + "testing": + { + "x": "data/plasticc_x_test.npy", + "y": "data/plasticc_y_test.npy" + } + } + ], + "n-estimators": 60, + "objective": "multi:softprob", + "max-depth": 7, + "subsample": 0.7, + "colsample-bytree": 0.7 + }, + { + "dataset": [ + { + "source": "npy", + "name": "santander", + "training": + { + "x": "data/santander_x_train.npy", + "y": "data/santander_y_train.npy" + }, + "testing": + { + "x": "data/santander_x_test.npy", + "y": "data/santander_y_test.npy" + } + } + ], + "n-estimators": 10000, + "objective": "binary:logistic", + "max-depth": 1, + "subsample": 0.5, + "eta": 0.1, + "colsample-bytree": 0.05 + } + ] +} diff --git a/cuml_bench/README.md b/cuml_bench/README.md index e65f11432..e36e77f3b 100644 --- a/cuml_bench/README.md +++ b/cuml_bench/README.md @@ -1,6 +1,6 @@ ## How to create conda environment for benchmarking -`conda create -n bench -c rapidsai -c conda-forge python=3.7 cuml pandas cudf` +`conda create -n bench -c rapidsai -c conda-forge python=3.7 scikit-learn cuml pandas cudf tqdm` ## Algorithms parameters diff --git a/daal4py_bench/README.md b/daal4py_bench/README.md index 85c7831df..c1c940ef0 100644 --- a/daal4py_bench/README.md +++ b/daal4py_bench/README.md @@ -1,7 +1,7 @@ ## How to create conda environment for benchmarking -`conda create -n bench -c intel python=3.7 daal4py pandas scikit-learn` +`conda create -n bench -c intel python=3.7 daal4py pandas scikit-learn tqdm` ## Algorithms parameters diff --git a/datasets/load_datasets.py b/datasets/load_datasets.py index e16c6c918..5fad3ac4b 100755 --- a/datasets/load_datasets.py +++ b/datasets/load_datasets.py @@ -18,26 +18,50 @@ import logging import os import sys +from pathlib import Path +from typing import Callable, Dict -from .loader import (a9a, codrnanorm, connect, covertype, gisette, ijcnn, - klaverjas, mnist, sensit, skin_segmentation) +from .loader_classification import (a_nine_a, airline, airline_ohe, bosch, + census, codrnanorm, epsilon, fraud, + gisette, higgs, higgs_one_m, ijcnn, + klaverjas, santander, skin_segmentation) +from .loader_multiclass import (connect, covertype, covtype, letters, mlsr, + mnist, msrank, plasticc, sensit) +from .loader_regression import abalone, mortgage_first_q, year_prediction_msd -dataset_loaders = { - "a9a": a9a, +dataset_loaders: Dict[str, Callable[[Path], bool]] = { + "a9a": a_nine_a, + "abalone": abalone, + "airline": airline, + "airline-ohe": airline_ohe, + "bosch": bosch, + "census": census, + "codrnanorm": codrnanorm, + "connect": connect, + "covertype": covertype, + "covtype": covtype, + "epsilon": epsilon, + "fraud": fraud, "gisette": gisette, + "higgs": higgs, + "higgs1m": higgs_one_m, "ijcnn": ijcnn, - "skin_segmentation": skin_segmentation, "klaverjas": klaverjas, - "connect": connect, + "letters": letters, + "mlsr": mlsr, "mnist": mnist, + "mortgage1Q": mortgage_first_q, + "msrank": msrank, + "plasticc": plasticc, + "santander": santander, "sensit": sensit, - "covertype": covertype, - "codrnanorm": codrnanorm, + "skin_segmentation": skin_segmentation, + "year_prediction_msd": year_prediction_msd, } -def try_load_dataset(dataset_name, output_directory): - if dataset_name in dataset_loaders.keys(): +def try_load_dataset(dataset_name: str, output_directory: Path) -> bool: + if dataset_name in dataset_loaders: try: return dataset_loaders[dataset_name](output_directory) except BaseException as ex: @@ -60,11 +84,11 @@ def try_load_dataset(dataset_name, output_directory): args = parser.parse_args() if args.list: - for key in dataset_loaders.keys(): + for key in dataset_loaders: print(key) sys.exit(0) - root_dir = os.environ['DATASETSROOT'] + root_dir = Path(os.environ['DATASETSROOT']) if args.datasets is not None: for val in dataset_loaders.values(): diff --git a/datasets/loader.py b/datasets/loader.py deleted file mode 100755 index 45690c8ae..000000000 --- a/datasets/loader.py +++ /dev/null @@ -1,423 +0,0 @@ -# =============================================================================== -# Copyright 2020-2021 Intel Corporation -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# =============================================================================== - -import logging -import os -from urllib.request import urlretrieve - -import numpy as np -import pandas as pd -from sklearn.datasets import fetch_openml -from sklearn.model_selection import train_test_split - - -def a9a(dataset_dir=None): - """ - Author: Ronny Kohavi","Barry Becker - libSVM","AAD group - Source: original - Date unknown - Cite: http://archive.ics.uci.edu/ml/datasets/Adult - - Classification task. n_classes = 2. - a9a X train dataset (39073, 123) - a9a y train dataset (39073, 1) - a9a X test dataset (9769, 123) - a9a y test dataset (9769, 1) - """ - dataset_name = 'a9a' - os.makedirs(dataset_dir, exist_ok=True) - - X, y = fetch_openml(name='a9a', return_X_y=True, - as_frame=False, data_home=dataset_dir) - X = pd.DataFrame(X.todense()) - y = pd.DataFrame(y) - - y[y == -1] = 0 - - logging.info('a9a dataset is downloaded') - logging.info('reading CSV file...') - - x_train, x_test, y_train, y_test = train_test_split( - X, y, test_size=0.2, random_state=11) - for data, name in zip((x_train, x_test, y_train, y_test), - ('x_train', 'x_test', 'y_train', 'y_test')): - filename = f'{dataset_name}_{name}.csv' - data.to_csv(os.path.join(dataset_dir, filename), - header=False, index=False) - logging.info(f'dataset {dataset_name} ready.') - return True - - -def ijcnn(dataset_dir=None): - """ - Author: Danil Prokhorov. - libSVM,AAD group - Cite: Danil Prokhorov. IJCNN 2001 neural network competition. - Slide presentation in IJCNN'01, - Ford Research Laboratory, 2001. http://www.geocities.com/ijcnn/nnc_ijcnn01.pdf. - - Classification task. n_classes = 2. - ijcnn X train dataset (153344, 22) - ijcnn y train dataset (153344, 1) - ijcnn X test dataset (38337, 22) - ijcnn y test dataset (38337, 1) - """ - dataset_name = 'ijcnn' - os.makedirs(dataset_dir, exist_ok=True) - - X, y = fetch_openml(name='ijcnn', return_X_y=True, - as_frame=False, data_home=dataset_dir) - X = pd.DataFrame(X.todense()) - y = pd.DataFrame(y) - - y[y == -1] = 0 - - logging.info(f'{dataset_name} dataset is downloaded') - logging.info('reading CSV file...') - - x_train, x_test, y_train, y_test = train_test_split( - X, y, test_size=0.2, random_state=42) - for data, name in zip((x_train, x_test, y_train, y_test), - ('x_train', 'x_test', 'y_train', 'y_test')): - filename = f'{dataset_name}_{name}.csv' - data.to_csv(os.path.join(dataset_dir, filename), - header=False, index=False) - logging.info(f'dataset {dataset_name} ready.') - return True - - -def skin_segmentation(dataset_dir=None): - """ - Abstract: - The Skin Segmentation dataset is constructed over B, G, R color space. - Skin and Nonskin dataset is generated using skin textures from - face images of diversity of age, gender, and race people. - Author: Rajen Bhatt, Abhinav Dhall, rajen.bhatt '@' gmail.com, IIT Delhi. - - Classification task. n_classes = 2. - skin_segmentation X train dataset (196045, 3) - skin_segmentation y train dataset (196045, 1) - skin_segmentation X test dataset (49012, 3) - skin_segmentation y test dataset (49012, 1) - """ - dataset_name = 'skin_segmentation' - os.makedirs(dataset_dir, exist_ok=True) - - X, y = fetch_openml(name='skin-segmentation', - return_X_y=True, as_frame=True, data_home=dataset_dir) - y = y.astype(int) - y[y == 2] = 0 - - logging.info(f'{dataset_name} dataset is downloaded') - logging.info('reading CSV file...') - - x_train, x_test, y_train, y_test = train_test_split( - X, y, test_size=0.2, random_state=42) - for data, name in zip((x_train, x_test, y_train, y_test), - ('x_train', 'x_test', 'y_train', 'y_test')): - filename = f'{dataset_name}_{name}.csv' - data.to_csv(os.path.join(dataset_dir, filename), - header=False, index=False) - logging.info(f'dataset {dataset_name} ready.') - return True - - -def klaverjas(dataset_dir=None): - """ - Abstract: - Klaverjas is an example of the Jack-Nine card games, - which are characterized as trick-taking games where the the Jack - and nine of the trump suit are the highest-ranking trumps, and - the tens and aces of other suits are the most valuable cards - of these suits. It is played by four players in two teams. - - Task Information: - Classification task. n_classes = 2. - klaverjas X train dataset (196045, 3) - klaverjas y train dataset (196045, 1) - klaverjas X test dataset (49012, 3) - klaverjas y test dataset (49012, 1) - """ - dataset_name = 'klaverjas' - os.makedirs(dataset_dir, exist_ok=True) - - X, y = fetch_openml(name='Klaverjas2018', return_X_y=True, - as_frame=True, data_home=dataset_dir) - - y = y.cat.codes - logging.info(f'{dataset_name} dataset is downloaded') - logging.info('reading CSV file...') - - x_train, x_test, y_train, y_test = train_test_split( - X, y, train_size=0.2, random_state=42) - for data, name in zip((x_train, x_test, y_train, y_test), - ('x_train', 'x_test', 'y_train', 'y_test')): - filename = f'{dataset_name}_{name}.csv' - data.to_csv(os.path.join(dataset_dir, filename), - header=False, index=False) - logging.info(f'dataset {dataset_name} ready.') - return True - - -def connect(dataset_dir=None): - """ - Source: - UC Irvine Machine Learning Repository - http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.htm - - Classification task. n_classes = 3. - connect X train dataset (60801, 126) - connect y train dataset (60801, 1) - connect X test dataset (49012, 126) - connect y test dataset (49012, 1) - """ - dataset_name = 'connect' - os.makedirs(dataset_dir, exist_ok=True) - - X, y = fetch_openml(name='connect-4', version=1, return_X_y=True, - as_frame=False, data_home=dataset_dir) - X = pd.DataFrame(X.todense()) - y = pd.DataFrame(y) - y = y.astype(int) - - logging.info(f'{dataset_name} dataset is downloaded') - logging.info('reading CSV file...') - - x_train, x_test, y_train, y_test = train_test_split( - X, y, test_size=0.1, random_state=42) - for data, name in zip((x_train, x_test, y_train, y_test), - ('x_train', 'x_test', 'y_train', 'y_test')): - filename = f'{dataset_name}_{name}.csv' - data.to_csv(os.path.join(dataset_dir, filename), - header=False, index=False) - logging.info(f'dataset {dataset_name} ready.') - return True - - -def mnist(dataset_dir=None): - """ - Abstract: - The MNIST database of handwritten digits with 784 features. - It can be split in a training set of the first 60,000 examples, - and a test set of 10,000 examples - Source: - Yann LeCun, Corinna Cortes, Christopher J.C. Burges - http://yann.lecun.com/exdb/mnist/ - - Classification task. n_classes = 10. - mnist X train dataset (60000, 784) - mnist y train dataset (60000, 1) - mnist X test dataset (10000, 784) - mnist y test dataset (10000, 1) - """ - dataset_name = 'mnist' - - os.makedirs(dataset_dir, exist_ok=True) - - X, y = fetch_openml(name='mnist_784', return_X_y=True, - as_frame=True, data_home=dataset_dir) - y = y.astype(int) - X = X / 255 - - logging.info(f'{dataset_name} dataset is downloaded') - logging.info('reading CSV file...') - - x_train, x_test, y_train, y_test = train_test_split( - X, y, test_size=10000, shuffle=False) - for data, name in zip((x_train, x_test, y_train, y_test), - ('x_train', 'x_test', 'y_train', 'y_test')): - filename = f'{dataset_name}_{name}.csv' - data.to_csv(os.path.join(dataset_dir, filename), - header=False, index=False) - logging.info(f'dataset {dataset_name} ready.') - return True - - -def sensit(dataset_dir=None): - """ - Abstract: Vehicle classification in distributed sensor networks. - Author: M. Duarte, Y. H. Hu - Source: [original](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets) - - Classification task. n_classes = 2. - sensit X train dataset (196045, 3) - sensit y train dataset (196045, 1) - sensit X test dataset (49012, 3) - sensit y test dataset (49012, 1) - """ - dataset_name = 'sensit' - os.makedirs(dataset_dir, exist_ok=True) - - X, y = fetch_openml(name='SensIT-Vehicle-Combined', - return_X_y=True, as_frame=False, data_home=dataset_dir) - X = pd.DataFrame(X.todense()) - y = pd.DataFrame(y) - y = y.astype(int) - - logging.info(f'{dataset_name} dataset is downloaded') - logging.info('reading CSV file...') - - x_train, x_test, y_train, y_test = train_test_split( - X, y, test_size=0.2, random_state=42) - for data, name in zip((x_train, x_test, y_train, y_test), - ('x_train', 'x_test', 'y_train', 'y_test')): - filename = f'{dataset_name}_{name}.csv' - data.to_csv(os.path.join(dataset_dir, filename), - header=False, index=False) - logging.info(f'dataset {dataset_name} ready.') - return True - - -def covertype(dataset_dir=None): - """ - Abstract: This is the original version of the famous - covertype dataset in ARFF format. - Author: Jock A. Blackard, Dr. Denis J. Dean, Dr. Charles W. Anderson - Source: [original](https://archive.ics.uci.edu/ml/datasets/covertype) - - Classification task. n_classes = 7. - covertype X train dataset (390852, 54) - covertype y train dataset (390852, 1) - covertype X test dataset (97713, 54) - covertype y test dataset (97713, 1) - """ - dataset_name = 'covertype' - os.makedirs(dataset_dir, exist_ok=True) - - X, y = fetch_openml(name='covertype', version=3, return_X_y=True, - as_frame=True, data_home=dataset_dir) - y = y.astype(int) - - logging.info(f'{dataset_name} dataset is downloaded') - logging.info('reading CSV file...') - - x_train, x_test, y_train, y_test = train_test_split( - X, y, test_size=0.2, random_state=42) - for data, name in zip((x_train, x_test, y_train, y_test), - ('x_train', 'x_test', 'y_train', 'y_test')): - filename = f'{dataset_name}_{name}.csv' - data.to_csv(os.path.join(dataset_dir, filename), - header=False, index=False) - logging.info(f'dataset {dataset_name} ready.') - return True - - -def codrnanorm(dataset_dir=None): - """ - Abstract: Detection of non-coding RNAs on the basis of predicted secondary - structure formation free energy change. - Author: Andrew V Uzilov,Joshua M Keegan,David H Mathews. - Source: [original](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets) - - Classification task. n_classes = 2. - codrnanorm X train dataset (390852, 8) - codrnanorm y train dataset (390852, 1) - codrnanorm X test dataset (97713, 8) - codrnanorm y test dataset (97713, 1) - """ - dataset_name = 'codrnanorm' - os.makedirs(dataset_dir, exist_ok=True) - - X, y = fetch_openml(name='codrnaNorm', return_X_y=True, - as_frame=False, data_home=dataset_dir) - X = pd.DataFrame(X.todense()) - y = pd.DataFrame(y) - - logging.info(f'{dataset_name} dataset is downloaded') - logging.info('reading CSV file...') - - x_train, x_test, y_train, y_test = train_test_split( - X, y, test_size=0.2, random_state=42) - for data, name in zip((x_train, x_test, y_train, y_test), - ('x_train', 'x_test', 'y_train', 'y_test')): - filename = f'{dataset_name}_{name}.csv' - data.to_csv(os.path.join(dataset_dir, filename), - header=False, index=False) - logging.info(f'dataset {dataset_name} ready.') - return True - - -def gisette(dataset_dir=None): - """ - GISETTE is a handwritten digit recognition problem. - The problem is to separate the highly confusable digits '4' and '9'. - This dataset is one of five datasets of the NIPS 2003 feature selection challenge. - - Classification task. n_classes = 2. - gisette X train dataset (6000, 5000) - gisette y train dataset (6000, 1) - gisette X test dataset (1000, 5000) - gisette y test dataset (1000, 1) - """ - dataset_name = 'gisette' - os.makedirs(dataset_dir, exist_ok=True) - - cache_dir = os.path.join(dataset_dir, '_gisette') - os.makedirs(cache_dir, exist_ok=True) - - domen_hhtp = 'http://archive.ics.uci.edu/ml/machine-learning-databases/' - - gisette_train_data_url = domen_hhtp + '/gisette/GISETTE/gisette_train.data' - filename_train_data = os.path.join(cache_dir, 'gisette_train.data') - if not os.path.exists(filename_train_data): - urlretrieve(gisette_train_data_url, filename_train_data) - - gisette_train_labels_url = domen_hhtp + '/gisette/GISETTE/gisette_train.labels' - filename_train_labels = os.path.join(cache_dir, 'gisette_train.labels') - if not os.path.exists(filename_train_labels): - urlretrieve(gisette_train_labels_url, filename_train_labels) - - gisette_test_data_url = domen_hhtp + '/gisette/GISETTE/gisette_valid.data' - filename_test_data = os.path.join(cache_dir, 'gisette_valid.data') - if not os.path.exists(filename_test_data): - urlretrieve(gisette_test_data_url, filename_test_data) - - gisette_test_labels_url = domen_hhtp + '/gisette/gisette_valid.labels' - filename_test_labels = os.path.join(cache_dir, 'gisette_valid.labels') - if not os.path.exists(filename_test_labels): - urlretrieve(gisette_test_labels_url, filename_test_labels) - - logging.info('gisette dataset is downloaded') - logging.info('reading CSV file...') - - num_cols = 5000 - - df_train = pd.read_csv(filename_train_data, header=None) - df_labels = pd.read_csv(filename_train_labels, header=None) - num_train = 6000 - x_train = df_train.iloc[:num_train].values - x_train = pd.DataFrame(np.array([np.fromstring( - elem[0], dtype=int, count=num_cols, sep=' ') for elem in x_train])) - y_train = df_labels.iloc[:num_train].values - y_train = pd.DataFrame((y_train > 0).astype(int)) - - num_train = 1000 - df_test = pd.read_csv(filename_test_data, header=None) - df_labels = pd.read_csv(filename_test_labels, header=None) - x_test = df_test.iloc[:num_train].values - x_test = pd.DataFrame(np.array( - [np.fromstring(elem[0], dtype=int, count=num_cols, sep=' ') for elem in x_test])) - y_test = df_labels.iloc[:num_train].values - y_test = pd.DataFrame((y_test > 0).astype(int)) - - for data, name in zip((x_train, x_test, y_train, y_test), - ('x_train', 'x_test', 'y_train', 'y_test')): - filename = f'{dataset_name}_{name}.csv' - data.to_csv(os.path.join(dataset_dir, filename), - header=False, index=False) - - logging.info('dataset gisette ready.') - return True diff --git a/datasets/loader_classification.py b/datasets/loader_classification.py new file mode 100644 index 000000000..be981952e --- /dev/null +++ b/datasets/loader_classification.py @@ -0,0 +1,598 @@ +# =============================================================================== +# Copyright 2020-2021 Intel Corporation +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# =============================================================================== + +import logging +import os +import subprocess +from pathlib import Path +from typing import Any + +import numpy as np +import pandas as pd +from sklearn.datasets import fetch_openml, load_svmlight_file +from sklearn.model_selection import train_test_split + +from .loader_utils import retrieve + + +def a_nine_a(dataset_dir: Path) -> bool: + """ + Author: Ronny Kohavi","Barry Becker + libSVM","AAD group + Source: original - Date unknown + Site: http://archive.ics.uci.edu/ml/datasets/Adult + + Classification task. n_classes = 2. + a9a X train dataset (39073, 123) + a9a y train dataset (39073, 1) + a9a X test dataset (9769, 123) + a9a y test dataset (9769, 1) + """ + dataset_name = 'a9a' + os.makedirs(dataset_dir, exist_ok=True) + + X, y = fetch_openml(name='a9a', return_X_y=True, + as_frame=False, data_home=dataset_dir) + X = pd.DataFrame(X.todense()) + y = pd.DataFrame(y) + + y[y == -1] = 0 + + logging.info(f'{dataset_name} is loaded, started parsing...') + + x_train, x_test, y_train, y_test = train_test_split( + X, y, test_size=0.2, random_state=11) + for data, name in zip((x_train, x_test, y_train, y_test), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) + logging.info(f'dataset {dataset_name} is ready.') + return True + + +def airline(dataset_dir: Path) -> bool: + """ + Airline dataset + http://kt.ijs.si/elena_ikonomovska/data.html + + TaskType:binclass + NumberOfFeatures:13 + NumberOfInstances:115M + """ + dataset_name = 'airline' + os.makedirs(dataset_dir, exist_ok=True) + + url = 'http://kt.ijs.si/elena_ikonomovska/datasets/airline/airline_14col.data.bz2' + local_url = os.path.join(dataset_dir, os.path.basename(url)) + if not os.path.isfile(local_url): + logging.info(f'Started loading {dataset_name}') + retrieve(url, local_url) + logging.info(f'{dataset_name} is loaded, started parsing...') + + cols = [ + "Year", "Month", "DayofMonth", "DayofWeek", "CRSDepTime", + "CRSArrTime", "UniqueCarrier", "FlightNum", "ActualElapsedTime", + "Origin", "Dest", "Distance", "Diverted", "ArrDelay" + ] + + # load the data as int16 + dtype = np.int16 + + dtype_columns = { + "Year": dtype, "Month": dtype, "DayofMonth": dtype, "DayofWeek": dtype, + "CRSDepTime": dtype, "CRSArrTime": dtype, "FlightNum": dtype, + "ActualElapsedTime": dtype, "Distance": + dtype, + "Diverted": dtype, "ArrDelay": dtype, + } + + df: Any = pd.read_csv(local_url, names=cols, dtype=dtype_columns) + + # Encode categoricals as numeric + for col in df.select_dtypes(['object']).columns: + df[col] = df[col].astype("category").cat.codes + + # Turn into binary classification problem + df["ArrDelayBinary"] = 1 * (df["ArrDelay"] > 0) + + X = df[df.columns.difference(["ArrDelay", "ArrDelayBinary"]) + ].to_numpy(dtype=np.float32) + y = df["ArrDelayBinary"].to_numpy(dtype=np.float32) + del df + X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=77, + test_size=0.2, + ) + for data, name in zip((X_train, X_test, y_train, y_test), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) + logging.info(f'dataset {dataset_name} is ready.') + return True + + +def airline_ohe(dataset_dir: Path) -> bool: + """ + Dataset from szilard benchmarks: https://github.com/szilard/GBM-perf + TaskType:binclass + NumberOfFeatures:700 + NumberOfInstances:10100000 + """ + dataset_name = 'airline-ohe' + os.makedirs(dataset_dir, exist_ok=True) + + url_train = 'https://s3.amazonaws.com/benchm-ml--main/train-10m.csv' + url_test = 'https://s3.amazonaws.com/benchm-ml--main/test.csv' + local_url_train = os.path.join(dataset_dir, os.path.basename(url_train)) + local_url_test = os.path.join(dataset_dir, os.path.basename(url_test)) + if not os.path.isfile(local_url_train): + logging.info(f'Started loading {dataset_name} train') + retrieve(url_train, local_url_train) + if not os.path.isfile(local_url_test): + logging.info(f'Started loading {dataset_name} test') + retrieve(url_test, local_url_test) + logging.info(f'{dataset_name} is loaded, started parsing...') + + sets = [] + labels = [] + + categorical_names = ["Month", "DayofMonth", + "DayOfWeek", "UniqueCarrier", "Origin", "Dest"] + + for local_url in [local_url_train, local_url_test]: + df = pd.read_csv(local_url, nrows=1000000 + if local_url.endswith('train-10m.csv') else None) + X = df.drop('dep_delayed_15min', 1) + y: Any = df["dep_delayed_15min"] + + y_num = np.where(y == "Y", 1, 0) + + sets.append(X) + labels.append(y_num) + + n_samples_train = sets[0].shape[0] + + X_final: Any = pd.concat(sets) + X_final = pd.get_dummies(X_final, columns=categorical_names) + sets = [X_final[:n_samples_train], X_final[n_samples_train:]] + + for data, name in zip((sets[0], sets[1], labels[0], labels[1]), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) # type: ignore + logging.info(f'dataset {dataset_name} is ready.') + return True + + +def bosch(dataset_dir: Path) -> bool: + """ + Bosch Production Line Performance data set + https://www.kaggle.com/c/bosch-production-line-performance + + Requires Kaggle API and API token (https://github.com/Kaggle/kaggle-api) + Contains missing values as NaN. + + TaskType:binclass + NumberOfFeatures:968 + NumberOfInstances:1.184M + """ + dataset_name = 'bosch' + os.makedirs(dataset_dir, exist_ok=True) + + filename = "train_numeric.csv.zip" + local_url = os.path.join(dataset_dir, filename) + + if not os.path.isfile(local_url): + logging.info(f'Started loading {dataset_name}') + args = ["kaggle", "competitions", "download", "-c", + "bosch-production-line-performance", "-f", filename, "-p", str(dataset_dir)] + _ = subprocess.check_output(args) + logging.info(f'{dataset_name} is loaded, started parsing...') + X = pd.read_csv(local_url, index_col=0, compression='zip', dtype=np.float32) + y = X.iloc[:, -1].to_numpy(dtype=np.float32) + X.drop(X.columns[-1], axis=1, inplace=True) + X_np = X.to_numpy(dtype=np.float32) + X_train, X_test, y_train, y_test = train_test_split(X_np, y, random_state=77, + test_size=0.2, + ) + for data, name in zip((X_train, X_test, y_train, y_test), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) + logging.info(f'dataset {dataset_name} is ready.') + return True + + +def census(dataset_dir: Path) -> bool: + """ + # TODO: add an loading instruction + """ + return False + + +def codrnanorm(dataset_dir: Path) -> bool: + """ + Abstract: Detection of non-coding RNAs on the basis of predicted secondary + structure formation free energy change. + Author: Andrew V Uzilov,Joshua M Keegan,David H Mathews. + Source: [original](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets) + + Classification task. n_classes = 2. + codrnanorm X train dataset (390852, 8) + codrnanorm y train dataset (390852, 1) + codrnanorm X test dataset (97713, 8) + codrnanorm y test dataset (97713, 1) + """ + dataset_name = 'codrnanorm' + os.makedirs(dataset_dir, exist_ok=True) + + X, y = fetch_openml(name='codrnaNorm', return_X_y=True, + as_frame=False, data_home=dataset_dir) + X = pd.DataFrame(X.todense()) + y = pd.DataFrame(y) + + logging.info(f'{dataset_name} is loaded, started parsing...') + + x_train, x_test, y_train, y_test = train_test_split( + X, y, test_size=0.2, random_state=42) + for data, name in zip((x_train, x_test, y_train, y_test), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) + logging.info(f'dataset {dataset_name} is ready.') + return True + + +def epsilon(dataset_dir: Path) -> bool: + """ + Epsilon dataset + https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html + + TaskType:binclass + NumberOfFeatures:2000 + NumberOfInstances:500K + """ + dataset_name = 'epsilon' + os.makedirs(dataset_dir, exist_ok=True) + + url_train = 'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary' \ + '/epsilon_normalized.bz2' + url_test = 'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary' \ + '/epsilon_normalized.t.bz2' + local_url_train = os.path.join(dataset_dir, os.path.basename(url_train)) + local_url_test = os.path.join(dataset_dir, os.path.basename(url_test)) + + if not os.path.isfile(local_url_train): + logging.info(f'Started loading {dataset_name}, train') + retrieve(url_train, local_url_train) + if not os.path.isfile(local_url_test): + logging.info(f'Started loading {dataset_name}, test') + retrieve(url_test, local_url_test) + logging.info(f'{dataset_name} is loaded, started parsing...') + X_train, y_train = load_svmlight_file(local_url_train, + dtype=np.float32) + X_test, y_test = load_svmlight_file(local_url_test, + dtype=np.float32) + X_train = X_train.toarray() + X_test = X_test.toarray() + y_train[y_train <= 0] = 0 + y_test[y_test <= 0] = 0 + + for data, name in zip((X_train, X_test, y_train, y_test), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) + logging.info(f'dataset {dataset_name} is ready.') + return True + + +def fraud(dataset_dir: Path) -> bool: + """ + Credit Card Fraud Detection contest + https://www.kaggle.com/mlg-ulb/creditcardfraud + + Requires Kaggle API and API token (https://github.com/Kaggle/kaggle-api) + Contains missing values as NaN. + + TaskType:binclass + NumberOfFeatures:28 + NumberOfInstances:285K + """ + dataset_name = 'fraud' + os.makedirs(dataset_dir, exist_ok=True) + + filename = "creditcard.csv" + local_url = os.path.join(dataset_dir, filename) + + if not os.path.isfile(local_url): + logging.info(f'Started loading {dataset_name}') + args = ["kaggle", "datasets", "download", "mlg-ulb/creditcardfraud", "-f", + filename, "-p", str(dataset_dir)] + _ = subprocess.check_output(args) + logging.info(f'{dataset_name} is loaded, started parsing...') + + df = pd.read_csv(local_url + ".zip", dtype=np.float32) + X = df[[col for col in df.columns if col.startswith('V')]].to_numpy(dtype=np.float32) + y = df['Class'].to_numpy(dtype=np.float32) + X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=77, + test_size=0.2, + ) + for data, name in zip((X_train, X_test, y_train, y_test), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) + logging.info(f'dataset {dataset_name} is ready.') + return True + + +def gisette(dataset_dir: Path) -> bool: + """ + GISETTE is a handwritten digit recognition problem. + The problem is to separate the highly confusable digits '4' and '9'. + This dataset is one of five datasets of the NIPS 2003 feature selection challenge. + + Classification task. n_classes = 2. + gisette X train dataset (6000, 5000) + gisette y train dataset (6000, 1) + gisette X test dataset (1000, 5000) + gisette y test dataset (1000, 1) + """ + dataset_name = 'gisette' + os.makedirs(dataset_dir, exist_ok=True) + + cache_dir = os.path.join(dataset_dir, '_gisette') + os.makedirs(cache_dir, exist_ok=True) + + domen_hhtp = 'http://archive.ics.uci.edu/ml/machine-learning-databases/' + + gisette_train_data_url = domen_hhtp + '/gisette/GISETTE/gisette_train.data' + filename_train_data = os.path.join(cache_dir, 'gisette_train.data') + if not os.path.exists(filename_train_data): + retrieve(gisette_train_data_url, filename_train_data) + + gisette_train_labels_url = domen_hhtp + '/gisette/GISETTE/gisette_train.labels' + filename_train_labels = os.path.join(cache_dir, 'gisette_train.labels') + if not os.path.exists(filename_train_labels): + retrieve(gisette_train_labels_url, filename_train_labels) + + gisette_test_data_url = domen_hhtp + '/gisette/GISETTE/gisette_valid.data' + filename_test_data = os.path.join(cache_dir, 'gisette_valid.data') + if not os.path.exists(filename_test_data): + retrieve(gisette_test_data_url, filename_test_data) + + gisette_test_labels_url = domen_hhtp + '/gisette/gisette_valid.labels' + filename_test_labels = os.path.join(cache_dir, 'gisette_valid.labels') + if not os.path.exists(filename_test_labels): + retrieve(gisette_test_labels_url, filename_test_labels) + + logging.info(f'{dataset_name} is loaded, started parsing...') + + num_cols = 5000 + + df_train = pd.read_csv(filename_train_data, header=None) + df_labels = pd.read_csv(filename_train_labels, header=None) + num_train = 6000 + x_train_arr = df_train.iloc[:num_train].values + x_train = pd.DataFrame(np.array([np.fromstring( + elem[0], dtype=int, count=num_cols, sep=' ') for elem in x_train_arr])) # type: ignore + y_train_arr = df_labels.iloc[:num_train].values + y_train = pd.DataFrame((y_train_arr > 0).astype(int)) + + num_train = 1000 + df_test = pd.read_csv(filename_test_data, header=None) + df_labels = pd.read_csv(filename_test_labels, header=None) + x_test_arr = df_test.iloc[:num_train].values + x_test = pd.DataFrame(np.array( + [np.fromstring( + elem[0], + dtype=int, count=num_cols, sep=' ') # type: ignore + for elem in x_test_arr])) + y_test_arr = df_labels.iloc[:num_train].values + y_test = pd.DataFrame((y_test_arr > 0).astype(int)) + + for data, name in zip((x_train, x_test, y_train, y_test), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data.to_numpy()) + logging.info('dataset gisette is ready.') + return True + + +def higgs(dataset_dir: Path) -> bool: + """ + Higgs dataset from UCI machine learning repository + https://archive.ics.uci.edu/ml/datasets/HIGGS + + TaskType:binclass + NumberOfFeatures:28 + NumberOfInstances:11M + """ + dataset_name = 'higgs' + os.makedirs(dataset_dir, exist_ok=True) + + url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz' + local_url = os.path.join(dataset_dir, os.path.basename(url)) + if not os.path.isfile(local_url): + logging.info(f'Started loading {dataset_name}') + retrieve(url, local_url) + logging.info(f'{dataset_name} is loaded, started parsing...') + + higgs = pd.read_csv(local_url) + X = higgs.iloc[:, 1:].to_numpy(dtype=np.float32) + y = higgs.iloc[:, 0].to_numpy(dtype=np.float32) + X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=77, + test_size=0.2, + ) + for data, name in zip((X_train, X_test, y_train, y_test), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) + logging.info(f'dataset {dataset_name} is ready.') + return True + + +def higgs_one_m(dataset_dir: Path) -> bool: + """ + Higgs dataset from UCI machine learning repository + https://archive.ics.uci.edu/ml/datasets/HIGGS + + Only first 1.5M samples is taken + + TaskType:binclass + NumberOfFeatures:28 + NumberOfInstances:1.5M + """ + dataset_name = 'higgs1m' + os.makedirs(dataset_dir, exist_ok=True) + + url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz' + local_url = os.path.join(dataset_dir, os.path.basename(url)) + if not os.path.isfile(local_url): + logging.info(f'Started loading {dataset_name}') + retrieve(url, local_url) + logging.info(f'{dataset_name} is loaded, started parsing...') + + nrows_train, nrows_test, dtype = 1000000, 500000, np.float32 + data: Any = pd.read_csv(local_url, delimiter=",", header=None, + compression="gzip", dtype=dtype, nrows=nrows_train+nrows_test) + + data = data[list(data.columns[1:])+list(data.columns[0:1])] + n_features = data.shape[1]-1 + train_data = np.ascontiguousarray(data.values[:nrows_train, :n_features], dtype=dtype) + train_label = np.ascontiguousarray(data.values[:nrows_train, n_features], dtype=dtype) + test_data = np.ascontiguousarray( + data.values[nrows_train: nrows_train + nrows_test, : n_features], + dtype=dtype) + test_label = np.ascontiguousarray( + data.values[nrows_train: nrows_train + nrows_test, n_features], + dtype=dtype) + for data, name in zip((train_data, test_data, train_label, test_label), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) + logging.info(f'dataset {dataset_name} is ready.') + return True + + +def ijcnn(dataset_dir: Path) -> bool: + """ + Author: Danil Prokhorov. + libSVM,AAD group + Cite: Danil Prokhorov. IJCNN 2001 neural network competition. + Slide presentation in IJCNN'01, + Ford Research Laboratory, 2001. http://www.geocities.com/ijcnn/nnc_ijcnn01.pdf. + + Classification task. n_classes = 2. + ijcnn X train dataset (153344, 22) + ijcnn y train dataset (153344, 1) + ijcnn X test dataset (38337, 22) + ijcnn y test dataset (38337, 1) + """ + dataset_name = 'ijcnn' + os.makedirs(dataset_dir, exist_ok=True) + + X, y = fetch_openml(name='ijcnn', return_X_y=True, + as_frame=False, data_home=dataset_dir) + X = pd.DataFrame(X.todense()) + y = pd.DataFrame(y) + + y[y == -1] = 0 + + logging.info(f'{dataset_name} is loaded, started parsing...') + + x_train, x_test, y_train, y_test = train_test_split( + X, y, test_size=0.2, random_state=42) + for data, name in zip((x_train, x_test, y_train, y_test), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) + logging.info(f'dataset {dataset_name} is ready.') + return True + + +def klaverjas(dataset_dir: Path) -> bool: + """ + Abstract: + Klaverjas is an example of the Jack-Nine card games, + which are characterized as trick-taking games where the the Jack + and nine of the trump suit are the highest-ranking trumps, and + the tens and aces of other suits are the most valuable cards + of these suits. It is played by four players in two teams. + + Task Information: + Classification task. n_classes = 2. + klaverjas X train dataset (196308, 32) + klaverjas y train dataset (196308, 1) + klaverjas X test dataset (785233, 32) + klaverjas y test dataset (785233, 1) + """ + dataset_name = 'klaverjas' + os.makedirs(dataset_dir, exist_ok=True) + + X, y = fetch_openml(name='Klaverjas2018', return_X_y=True, + as_frame=True, data_home=dataset_dir) + + y = y.cat.codes + logging.info(f'{dataset_name} is loaded, started parsing...') + + x_train, x_test, y_train, y_test = train_test_split( + X, y, train_size=0.2, random_state=42) + for data, name in zip((x_train, x_test, y_train, y_test), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) + logging.info(f'dataset {dataset_name} is ready.') + return True + + +def santander(dataset_dir: Path) -> bool: + """ + # TODO: add an loading instruction + """ + return False + + +def skin_segmentation(dataset_dir: Path) -> bool: + """ + Abstract: + The Skin Segmentation dataset is constructed over B, G, R color space. + Skin and Nonskin dataset is generated using skin textures from + face images of diversity of age, gender, and race people. + Author: Rajen Bhatt, Abhinav Dhall, rajen.bhatt '@' gmail.com, IIT Delhi. + + Classification task. n_classes = 2. + skin_segmentation X train dataset (196045, 3) + skin_segmentation y train dataset (196045, 1) + skin_segmentation X test dataset (49012, 3) + skin_segmentation y test dataset (49012, 1) + """ + dataset_name = 'skin_segmentation' + os.makedirs(dataset_dir, exist_ok=True) + + X, y = fetch_openml(name='skin-segmentation', + return_X_y=True, as_frame=True, data_home=dataset_dir) + y = y.astype(int) + y[y == 2] = 0 + + logging.info(f'{dataset_name} is loaded, started parsing...') + + x_train, x_test, y_train, y_test = train_test_split( + X, y, test_size=0.2, random_state=42) + for data, name in zip((x_train, x_test, y_train, y_test), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) + logging.info(f'dataset {dataset_name} is ready.') + return True diff --git a/datasets/loader_multiclass.py b/datasets/loader_multiclass.py new file mode 100644 index 000000000..69b1da1e6 --- /dev/null +++ b/datasets/loader_multiclass.py @@ -0,0 +1,290 @@ +# =============================================================================== +# Copyright 2020-2021 Intel Corporation +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# =============================================================================== + +import logging +import os +import tarfile +from pathlib import Path +from typing import Any + +import numpy as np +import pandas as pd +from sklearn.datasets import fetch_covtype, fetch_openml +from sklearn.model_selection import train_test_split + +from .loader_utils import count_lines, read_libsvm_msrank, retrieve + + +def connect(dataset_dir: Path) -> bool: + """ + Source: + UC Irvine Machine Learning Repository + http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.htm + + Classification task. n_classes = 3. + connect X train dataset (60801, 126) + connect y train dataset (60801, 1) + connect X test dataset (6756, 126) + connect y test dataset (6756, 1) + """ + dataset_name = 'connect' + os.makedirs(dataset_dir, exist_ok=True) + + X, y = fetch_openml(name='connect-4', return_X_y=True, + as_frame=False, data_home=dataset_dir) + X = pd.DataFrame(X.todense()) + y = pd.DataFrame(y) + y = y.astype(int) + + logging.info(f'{dataset_name} is loaded, started parsing...') + + x_train, x_test, y_train, y_test = train_test_split( + X, y, test_size=0.1, random_state=42) + for data, name in zip((x_train, x_test, y_train, y_test), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) + logging.info(f'dataset {dataset_name} is ready.') + return True + + +def covertype(dataset_dir: Path) -> bool: + """ + Abstract: This is the original version of the famous + covertype dataset in ARFF format. + Author: Jock A. Blackard, Dr. Denis J. Dean, Dr. Charles W. Anderson + Source: [original](https://archive.ics.uci.edu/ml/datasets/covertype) + + Classification task. n_classes = 7. + covertype X train dataset (390852, 54) + covertype y train dataset (390852, 1) + covertype X test dataset (97713, 54) + covertype y test dataset (97713, 1) + """ + dataset_name = 'covertype' + os.makedirs(dataset_dir, exist_ok=True) + + X, y = fetch_openml(name='covertype', version=3, return_X_y=True, + as_frame=True, data_home=dataset_dir) + y = y.astype(int) + + logging.info(f'{dataset_name} is loaded, started parsing...') + + x_train, x_test, y_train, y_test = train_test_split( + X, y, test_size=0.4, random_state=42) + for data, name in zip((x_train, x_test, y_train, y_test), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) + logging.info(f'dataset {dataset_name} is ready.') + return True + + +def covtype(dataset_dir: Path) -> bool: + """ + Cover type dataset from UCI machine learning repository + https://archive.ics.uci.edu/ml/datasets/covertype + + y contains 7 unique class labels from 1 to 7 inclusive. + TaskType:multiclass + NumberOfFeatures:54 + NumberOfInstances:581012 + """ + dataset_name = 'covtype' + os.makedirs(dataset_dir, exist_ok=True) + + logging.info(f'Started loading {dataset_name}') + X, y = fetch_covtype(return_X_y=True) # pylint: disable=unexpected-keyword-arg + logging.info(f'{dataset_name} is loaded, started parsing...') + + X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=77, + test_size=0.2, + ) + for data, name in zip((X_train, X_test, y_train, y_test), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) + logging.info(f'dataset {dataset_name} is ready.') + return True + + +def letters(dataset_dir: Path) -> bool: + """ + http://archive.ics.uci.edu/ml/datasets/Letter+Recognition + + TaskType:multiclass + NumberOfFeatures:16 + NumberOfInstances:20.000 + """ + dataset_name = 'letters' + os.makedirs(dataset_dir, exist_ok=True) + + url = ('http://archive.ics.uci.edu/ml/machine-learning-databases/' + + 'letter-recognition/letter-recognition.data') + local_url = os.path.join(dataset_dir, os.path.basename(url)) + if not os.path.isfile(local_url): + logging.info(f'Started loading {dataset_name}') + retrieve(url, local_url) + logging.info(f'{dataset_name} is loaded, started parsing...') + + letters = pd.read_csv(local_url, header=None) + X = letters.iloc[:, 1:].values + y: Any = letters.iloc[:, 0] + y = y.astype('category').cat.codes.values + + X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) + + for data, name in zip((X_train, X_test, y_train, y_test), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) + logging.info(f'dataset {dataset_name} is ready.') + return True + + +def mlsr(dataset: Path) -> bool: + """ + # TODO: add an loading instruction + """ + return False + + +def mnist(dataset_dir: Path) -> bool: + """ + Abstract: + The MNIST database of handwritten digits with 784 features. + It can be split in a training set of the first 60,000 examples, + and a test set of 10,000 examples + Source: + Yann LeCun, Corinna Cortes, Christopher J.C. Burges + http://yann.lecun.com/exdb/mnist/ + + Classification task. n_classes = 10. + mnist X train dataset (60000, 784) + mnist y train dataset (60000, 1) + mnist X test dataset (10000, 784) + mnist y test dataset (10000, 1) + """ + dataset_name = 'mnist' + + os.makedirs(dataset_dir, exist_ok=True) + + X, y = fetch_openml(name='mnist_784', return_X_y=True, + as_frame=True, data_home=dataset_dir) + y = y.astype(int) + X = X / 255 + + logging.info(f'{dataset_name} is loaded, started parsing...') + + x_train, x_test, y_train, y_test = train_test_split( + X, y, test_size=10000, shuffle=False) + for data, name in zip((x_train, x_test, y_train, y_test), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) + logging.info(f'dataset {dataset_name} is ready.') + return True + + +def msrank(dataset_dir: Path) -> bool: + """ + Dataset from szilard benchmarks: https://github.com/szilard/GBM-perf + + TaskType:multiclass + NumberOfFeatures:137 + NumberOfInstances:1.2M + """ + dataset_name = 'msrank' + os.makedirs(dataset_dir, exist_ok=True) + url = "https://storage.mds.yandex.net/get-devtools-opensource/471749/msrank.tar.gz" + local_url = os.path.join(dataset_dir, os.path.basename(url)) + unzipped_url = os.path.join(dataset_dir, "MSRank") + if not os.path.isfile(local_url): + logging.info(f'Started loading {dataset_name}') + retrieve(url, local_url) + if not os.path.isdir(unzipped_url): + logging.info(f'{dataset_name} is loaded, unzipping...') + tar = tarfile.open(local_url, "r:gz") + tar.extractall(dataset_dir) + tar.close() + logging.info(f'{dataset_name} is unzipped, started parsing...') + + sets = [] + labels = [] + n_features = 137 + + for set_name in ['train.txt', 'vali.txt', 'test.txt']: + file_name = os.path.join(unzipped_url, set_name) + + n_samples = count_lines(file_name) + with open(file_name, 'r') as file_obj: + X, y = read_libsvm_msrank(file_obj, n_samples, n_features, np.float32) + + sets.append(X) + labels.append(y) + + sets[0] = np.vstack((sets[0], sets[1])) + labels[0] = np.hstack((labels[0], labels[1])) + + sets = [np.ascontiguousarray(sets[i]) for i in [0, 2]] + labels = [np.ascontiguousarray(labels[i]) for i in [0, 2]] + + for data, name in zip((sets[0], sets[1], labels[0], labels[1]), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) + logging.info(f'dataset {dataset_name} is ready.') + return True + + +def plasticc(dataset_dir: Path) -> bool: + """ + # TODO: add an loading instruction + """ + return False + + +def sensit(dataset_dir: Path) -> bool: + """ + Abstract: Vehicle classification in distributed sensor networks. + Author: M. Duarte, Y. H. Hu + Source: [original](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets) + + Multiclass classification task + sensit X train dataset (78822, 100) + sensit y train dataset (78822, 1) + sensit X test dataset (19706, 100) + sensit y test dataset (19706, 1) + """ + dataset_name = 'sensit' + os.makedirs(dataset_dir, exist_ok=True) + + X, y = fetch_openml(name='SensIT-Vehicle-Combined', + return_X_y=True, as_frame=False, data_home=dataset_dir) + X = pd.DataFrame(X.todense()) + y = pd.DataFrame(y) + y = y.astype(int) + + logging.info(f'{dataset_name} is loaded, started parsing...') + + x_train, x_test, y_train, y_test = train_test_split( + X, y, test_size=0.2, random_state=42) + for data, name in zip((x_train, x_test, y_train, y_test), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) + logging.info(f'dataset {dataset_name} is ready.') + return True diff --git a/datasets/loader_regression.py b/datasets/loader_regression.py new file mode 100644 index 000000000..c19cdf55c --- /dev/null +++ b/datasets/loader_regression.py @@ -0,0 +1,102 @@ +# =============================================================================== +# Copyright 2020-2021 Intel Corporation +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# =============================================================================== + +import logging +import os +from pathlib import Path +from typing import Any + +import numpy as np +import pandas as pd +from sklearn.model_selection import train_test_split + +from .loader_utils import retrieve + + +def abalone(dataset_dir: Path) -> bool: + """ + https://archive.ics.uci.edu/ml/machine-learning-databases/abalone + + TaskType:regression + NumberOfFeatures:8 + NumberOfInstances:4177 + """ + dataset_name = 'abalone' + os.makedirs(dataset_dir, exist_ok=True) + + url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data' + local_url = os.path.join(dataset_dir, os.path.basename(url)) + if not os.path.isfile(local_url): + logging.info(f'Started loading {dataset_name}') + retrieve(url, local_url) + logging.info(f'{dataset_name} is loaded, started parsing...') + + abalone: Any = pd.read_csv(local_url, header=None) + abalone[0] = abalone[0].astype('category').cat.codes + X = abalone.iloc[:, :-1].values + y = abalone.iloc[:, -1].values + + X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) + + for data, name in zip((X_train, X_test, y_train, y_test), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) + logging.info(f'dataset {dataset_name} is ready.') + return True + + +def mortgage_first_q(dataset_dir: Path) -> bool: + """ + # TODO: add an loading instruction + """ + return False + + +def year_prediction_msd(dataset_dir: Path) -> bool: + """ + YearPredictionMSD dataset from UCI repository + https://archive.ics.uci.edu/ml/datasets/yearpredictionmsd + + TaskType:regression + NumberOfFeatures:90 + NumberOfInstances:515345 + """ + dataset_name = 'year_prediction_msd' + os.makedirs(dataset_dir, exist_ok=True) + + url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00203/YearPredictionMSD.txt' \ + '.zip' + local_url = os.path.join(dataset_dir, os.path.basename(url)) + if not os.path.isfile(local_url): + logging.info(f'Started loading {dataset_name}') + retrieve(url, local_url) + logging.info(f'{dataset_name} is loaded, started parsing...') + + year = pd.read_csv(local_url, header=None) + X = year.iloc[:, 1:].to_numpy(dtype=np.float32) + y = year.iloc[:, 0].to_numpy(dtype=np.float32) + + X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False, + train_size=463715, + test_size=51630) + + for data, name in zip((X_train, X_test, y_train, y_test), + ('x_train', 'x_test', 'y_train', 'y_test')): + filename = f'{dataset_name}_{name}.npy' + np.save(os.path.join(dataset_dir, filename), data) + logging.info(f'dataset {dataset_name} is ready.') + return True diff --git a/datasets/loader_utils.py b/datasets/loader_utils.py new file mode 100755 index 000000000..29366eccb --- /dev/null +++ b/datasets/loader_utils.py @@ -0,0 +1,76 @@ +# =============================================================================== +# Copyright 2020-2021 Intel Corporation +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# =============================================================================== + +import re +from urllib.request import urlretrieve + +import numpy as np +import tqdm + +pbar: tqdm.tqdm = None + + +def _show_progress(block_num: int, block_size: int, total_size: int) -> None: + global pbar + if pbar is None: + pbar = tqdm.tqdm(total=total_size / 1024, unit='kB') + + downloaded = block_num * block_size + if downloaded < total_size: + pbar.update(block_size / 1024) + else: + pbar.close() + pbar = None + + +def retrieve(url: str, filename: str) -> None: + urlretrieve(url, filename, reporthook=_show_progress) + + +def read_libsvm_msrank(file_obj, n_samples, n_features, dtype): + X = np.zeros((n_samples, n_features)) + y = np.zeros((n_samples,)) + + counter = 0 + + regexp = re.compile(r'[A-Za-z0-9]+:(-?\d*\.?\d+)') + + for line in file_obj: + line = str(line).replace("\\n'", "") + line = regexp.sub(r'\g<1>', line) + line = line.rstrip(" \n\r").split(' ') + + y[counter] = int(line[0]) + X[counter] = [float(i) for i in line[1:]] + + counter += 1 + if counter == n_samples: + break + + return np.array(X, dtype=dtype), np.array(y, dtype=dtype) + + +def _make_gen(reader): + b = reader(1024 * 1024) + while b: + yield b + b = reader(1024 * 1024) + + +def count_lines(filename): + with open(filename, 'rb') as f: + f_gen = _make_gen(f.read) + return sum(buf.count(b'\n') for buf in f_gen) diff --git a/runner.py b/runner.py index a77c29a51..c4cba2449 100755 --- a/runner.py +++ b/runner.py @@ -18,35 +18,12 @@ import json import logging import os -import pathlib import socket import sys +from typing import Any, Dict, List, Union import datasets.make_datasets as make_datasets import utils -from datasets.load_datasets import try_load_dataset - - -def generate_cases(params): - ''' - Generate cases for benchmarking by iterating of - parameters values - ''' - global cases - if len(params) == 0: - return cases - prev_length = len(cases) - param_name = list(params.keys())[0] - n_param_values = len(params[param_name]) - cases = cases * n_param_values - dashes = '-' if len(param_name) == 1 else '--' - for i in range(n_param_values): - for j in range(prev_length): - cases[prev_length * i + j] += f' {dashes}{param_name} ' \ - + f'{params[param_name][i]}' - del params[param_name] - generate_cases(params) - if __name__ == '__main__': parser = argparse.ArgumentParser() @@ -54,7 +31,7 @@ def generate_cases(params): default='configs/config_example.json', help='Path to configuration files') parser.add_argument('--dummy-run', default=False, action='store_true', - help='Run configuration parser and datasets generation' + help='Run configuration parser and datasets generation ' 'without benchmarks running') parser.add_argument('--no-intel-optimized', default=False, action='store_true', help='Use no intel optimized version. ' @@ -69,7 +46,6 @@ def generate_cases(params): help='Create an Excel report based on benchmarks results. ' 'Need "openpyxl" library') args = parser.parse_args() - env = os.environ.copy() logging.basicConfig( stream=sys.stdout, format='%(levelname)s: %(message)s', level=args.verbose) @@ -78,7 +54,7 @@ def generate_cases(params): # make directory for data if it doesn't exist os.makedirs('data', exist_ok=True) - json_result = { + json_result: Dict[str, Union[Dict[str, Any], List[Any]]] = { 'hardware': utils.get_hw_parameters(), 'software': utils.get_sw_parameters(), 'results': [] @@ -90,51 +66,39 @@ def generate_cases(params): with open(config_name, 'r') as config_file: config = json.load(config_file) - if 'omp_env' not in config.keys(): - config['omp_env'] = [] # get parameters that are common for all cases common_params = config['common'] for params_set in config['cases']: - cases = [''] params = common_params.copy() params.update(params_set.copy()) algorithm = params['algorithm'] libs = params['lib'] + if not isinstance(libs, list): + libs = [libs] del params['dataset'], params['algorithm'], params['lib'] - generate_cases(params) + cases = utils.generate_cases(params) logging.info(f'{algorithm} algorithm: {len(libs) * len(cases)} case(s),' f' {len(params_set["dataset"])} dataset(s)\n') for dataset in params_set['dataset']: if dataset['source'] in ['csv', 'npy']: - train_data = dataset["training"] - file_train_data_x = train_data["x"] - paths = f'--file-X-train {file_train_data_x}' - if 'y' in dataset['training'].keys(): - file_train_data_y = train_data["y"] - paths += f' --file-y-train {file_train_data_y}' - if 'testing' in dataset.keys(): - test_data = dataset["testing"] - file_test_data_x = test_data["x"] - paths += f' --file-X-test {file_test_data_x}' - if 'y' in dataset['testing'].keys(): - file_test_data_y = test_data["y"] - paths += f' --file-y-test {file_test_data_y}' - if 'name' in dataset.keys(): - dataset_name = dataset['name'] - else: - dataset_name = 'unknown' - - if not utils.is_exists_files([file_train_data_x]): - directory_dataset = pathlib.Path(file_train_data_x).parent - if not try_load_dataset(dataset_name=dataset_name, - output_directory=directory_dataset): - logging.warning(f'Dataset {dataset_name} ' - 'could not be loaded. \n' - 'Check the correct name or expand ' - 'the download in the folder dataset.') - continue - + dataset_name = dataset['name'] if 'name' in dataset else 'unknown' + if 'training' not in dataset or \ + 'x' not in dataset['training'] or \ + not utils.find_the_dataset(dataset_name, + dataset['training']['x']): + logging.warning( + f'Dataset {dataset_name} could not be loaded. \n' + 'Check the correct name or expand the download in ' + 'the folder dataset.') + continue + paths = '--file-X-train ' + dataset['training']["x"] + if 'y' in dataset['training']: + paths += ' --file-y-train ' + dataset['training']["y"] + if 'testing' in dataset: + paths += ' --file-X-test ' + dataset["testing"]["x"] + if 'y' in dataset['testing']: + paths += ' --file-y-test ' + dataset["testing"]["y"] elif dataset['source'] == 'synthetic': class GenerationArgs: classes: int @@ -151,7 +115,7 @@ class GenerationArgs: gen_args = GenerationArgs() paths = '' - if 'seed' in params_set.keys(): + if 'seed' in params_set: gen_args.seed = params_set['seed'] else: gen_args.seed = 777 @@ -161,10 +125,10 @@ class GenerationArgs: gen_args.type = dataset['type'] gen_args.samples = dataset['training']['n_samples'] gen_args.features = dataset['n_features'] - if 'n_classes' in dataset.keys(): + if 'n_classes' in dataset: gen_args.classes = dataset['n_classes'] cls_num_for_file = f'-{dataset["n_classes"]}' - elif 'n_clusters' in dataset.keys(): + elif 'n_clusters' in dataset: gen_args.clusters = dataset['n_clusters'] cls_num_for_file = f'-{dataset["n_clusters"]}' else: @@ -179,7 +143,7 @@ class GenerationArgs: gen_args.filey = f'{file_prefix}y-train{file_postfix}' paths += f' --file-y-train {gen_args.filey}' - if 'testing' in dataset.keys(): + if 'testing' in dataset: gen_args.test_samples = dataset['testing']['n_samples'] gen_args.filextest = f'{file_prefix}X-test{file_postfix}' paths += f' --file-X-test {gen_args.filextest}' @@ -204,26 +168,20 @@ class GenerationArgs: logging.warning('Unknown dataset source. Only synthetics datasets ' 'and csv/npy files are supported now') - omp_env = utils.get_omp_env() no_intel_optimize = \ '--no-intel-optimized ' if args.no_intel_optimized else '' for lib in libs: - env = os.environ.copy() - if lib == 'xgboost': - for var in config['omp_env']: - env[var] = omp_env[var] for i, case in enumerate(cases): command = f'python {lib}_bench/{algorithm}.py ' \ + no_intel_optimize \ + f'--arch {hostname} {case} {paths} ' \ + f'--dataset-name {dataset_name}' - while ' ' in command: - command = command.replace(' ', ' ') + command = ' '.join(command.split()) logging.info(command) if not args.dummy_run: case = f'{lib},{algorithm} ' + case stdout, stderr = utils.read_output_from_command( - command, env=env) + command, env=os.environ.copy()) stdout, extra_stdout = utils.filter_stdout(stdout) stderr = utils.filter_stderr(stderr) @@ -233,8 +191,8 @@ class GenerationArgs: stderr += f'CASE {case} EXTRA OUTPUT:\n' \ + f'{extra_stdout}\n' try: - json_result['results'].extend( - json.loads(stdout)) + if isinstance(json_result['results'], list): + json_result['results'].extend(json.loads(stdout)) except json.JSONDecodeError as decoding_exception: stderr += f'CASE {case} JSON DECODING ERROR:\n' \ + f'{decoding_exception}\n{stdout}\n' diff --git a/sklearn_bench/README.md b/sklearn_bench/README.md index b21da94da..8cca0f81d 100644 --- a/sklearn_bench/README.md +++ b/sklearn_bench/README.md @@ -1,15 +1,14 @@ - -## How to create conda environment for benchmarking +# How to create conda environment for benchmarking If you want to test scikit-learn, then use ```bash pip install -r sklearn_bench/requirements.txt # or -conda install -c intel scikit-learn scikit-learn-intelex pandas +conda install -c intel scikit-learn scikit-learn-intelex pandas tqdm ``` -## Algorithms parameters +## Algorithms parameters You can launch benchmarks for each algorithm separately. The tables below list all supported parameters for each algorithm: @@ -27,7 +26,8 @@ You can launch benchmarks for each algorithm separately. The tables below list a - [SVC](#svc) - [train_test_split](#train_test_split) -#### General +### General + | parameter Name | Type | default value | description | | ----- | ---- |---- |---- | |num-threads|int|-1| The number of threads to use| @@ -50,14 +50,14 @@ You can launch benchmarks for each algorithm separately. The tables below list a |seed|int|12345|Seed to pass as random_state| |dataset-name|str|None|Dataset name| +### DBSCAN -#### DBSCAN | parameter Name | Type | default value | description | | ----- | ---- |---- |---- | | epsilon | float | 10 | Radius of neighborhood of a point| | min_samples | int | 5 | The minimum number of samples required in a 'neighborhood to consider a point a core point | -#### RandomForestClassifier +### RandomForestClassifier | parameter Name | Type | default value | description | | ----- | ---- |---- |---- | @@ -70,7 +70,7 @@ You can launch benchmarks for each algorithm separately. The tables below list a | min-impurity-decrease | float | 0 | Needed impurity decrease for node splitting | | no-bootstrap | store_false | True | Don't control bootstraping | -#### RandomForestRegressor +### RandomForestRegressor | parameter Name | Type | default value | description | | ----- | ---- |---- |---- | @@ -84,13 +84,13 @@ You can launch benchmarks for each algorithm separately. The tables below list a | no-bootstrap | action | True | Don't control bootstraping | | use-sklearn-class | action | | Force use of sklearn.ensemble.RandomForestClassifier | -#### pairwise_distances +### pairwise_distances | parameter Name | Type | default value | description | | ----- | ---- |---- |---- | | metric | str | cosine | *cosine* or *correlation* Metric to test for pairwise distances | -#### KMeans +### KMeans | parameter Name | Type | default value | description | | ----- | ---- |---- |---- | @@ -99,7 +99,7 @@ You can launch benchmarks for each algorithm separately. The tables below list a | maxiter | inte | 100 | Maximum number of iterations | | n-clusters | int | | The number of clusters | -#### KNeighborsClassifier +### KNeighborsClassifier | parameter Name | Type | default value | description | | ----- | ---- |---- |---- | @@ -108,13 +108,13 @@ You can launch benchmarks for each algorithm separately. The tables below list a | method | str | brute | Algorithm used to compute the nearest neighbors | | metric | str | euclidean | Distance metric to use | -#### LinearRegression +### LinearRegression | parameter Name | Type | default value | description | | ----- | ---- |---- |---- | | no-fit-intercept | action | True | Don't fit intercept (assume data already centered) | -#### LogisticRegression +### LogisticRegression | parameter Name | Type | default value | description | | ----- | ---- |---- |---- | @@ -125,7 +125,7 @@ You can launch benchmarks for each algorithm separately. The tables below list a | C | float | 1.0 | Regularization parameter | | tol | float | None | Tolerance for solver | -#### PCA +### PCA | parameter Name | Type | default value | description | | ----- | ---- |---- |---- | @@ -133,7 +133,7 @@ You can launch benchmarks for each algorithm separately. The tables below list a | n-components | int | None | The number of components to find | | whiten | action | False | Perform whitening | -#### Ridge +### Ridge | parameter Name | Type | default value | description | | ----- | ---- |---- |---- | @@ -141,7 +141,7 @@ You can launch benchmarks for each algorithm separately. The tables below list a | solver | str | auto | Solver used for training | | alpha | float | 1.0 | Regularization strength | -#### SVC +### SVC | parameter Name | Type | default value | description | | ----- | ---- |---- |---- | @@ -152,7 +152,7 @@ You can launch benchmarks for each algorithm separately. The tables below list a | tol | float | 1e-16 | Tolerance passed to sklearn.svm.SVC | | probability | action | True | Use probability for SVC | -#### train_test_split +### train_test_split | parameter Name | Type | default value | description | | ----- | ---- |---- |---- | diff --git a/sklearn_bench/requirements.txt b/sklearn_bench/requirements.txt index d25373e5e..28c7de80d 100755 --- a/sklearn_bench/requirements.txt +++ b/sklearn_bench/requirements.txt @@ -2,3 +2,4 @@ scikit-learn pandas scikit-learn-intelex openpyxl +tqdm diff --git a/utils.py b/utils.py index 40eef714e..5593ef443 100755 --- a/utils.py +++ b/utils.py @@ -15,24 +15,25 @@ # =============================================================================== import json -import logging -import multiprocessing import os +import pathlib import platform import subprocess import sys +from typing import Any, Dict, List, Tuple, Union, cast +from datasets.load_datasets import try_load_dataset -def filter_stderr(text): + +def filter_stderr(text: str) -> str: # delete 'Intel(R) Extension for Scikit-learn usage in sklearn' messages - fake_error_message = 'Intel(R) Extension for Scikit-learn* enabled ' + \ - '(https://github.com/intel/scikit-learn-intelex)' - while fake_error_message in text: - text = text.replace(fake_error_message, '') - return text + fake_error_message = ('Intel(R) Extension for Scikit-learn* enabled ' + + '(https://github.com/intel/scikit-learn-intelex)') + + return ''.join(text.split(fake_error_message)) -def filter_stdout(text): +def filter_stdout(text: str) -> Tuple[str, str]: verbosity_letters = 'EWIDT' filtered, extra = '', '' for line in text.split('\n'): @@ -50,14 +51,13 @@ def filter_stdout(text): return filtered, extra -def is_exists_files(files): - for f in files: - if not os.path.isfile(f): - return False - return True +def find_the_dataset(name: str, fullpath: str) -> bool: + return os.path.isfile(fullpath) or try_load_dataset( + dataset_name=name, output_directory=pathlib.Path(fullpath).parent) -def read_output_from_command(command, env=os.environ.copy()): +def read_output_from_command(command: str, + env: Dict[str, str] = os.environ.copy()) -> Tuple[str, str]: if "PYTHONPATH" in env: env["PYTHONPATH"] += ":" + os.path.dirname(os.path.abspath(__file__)) else: @@ -67,104 +67,74 @@ def read_output_from_command(command, env=os.environ.copy()): return res.stdout[:-1], res.stderr[:-1] -def _is_ht_enabled(): +def parse_lscpu_lscl_info(command_output: str) -> Dict[str, str]: + res: Dict[str, str] = {} + for elem in command_output.strip().split('\n'): + splt = elem.split(':') + res[splt[0]] = splt[1] + return res + + +def get_hw_parameters() -> Dict[str, Union[Dict[str, Any], float]]: + if 'Linux' not in platform.platform(): + return {} + + hw_params: Dict[str, Union[Dict[str, str], float]] = {'CPU': {}} + # get CPU information + lscpu_info, _ = read_output_from_command('lscpu') + lscpu_info = ' '.join(lscpu_info.split()) + for line in lscpu_info.split('\n'): + k, v = line.split(": ")[:2] + if k == 'CPU MHz': + continue + cast(Dict[str, str], hw_params['CPU'])[k] = v + + # get RAM size + mem_info, _ = read_output_from_command('free -b') + mem_info = mem_info.split('\n')[1] + mem_info = ' '.join(mem_info.split()) + hw_params['RAM size[GB]'] = int(mem_info.split(' ')[1]) / 2 ** 30 + + # get Intel GPU information try: - cpu_info, _ = read_output_from_command('lscpu') - cpu_info = cpu_info.split('\n') - for el in cpu_info: - if 'Thread(s) per core' in el: - threads_per_core = int(el[-1]) - if threads_per_core > 1: - return True - else: - return False - return False - except FileNotFoundError: - logging.info('Impossible to check hyperthreading via lscpu') - return False - - -def get_omp_env(): - cpu_count = multiprocessing.cpu_count() - omp_num_threads = str(cpu_count // 2) if _is_ht_enabled() else str(cpu_count) - - omp_env = { - 'OMP_PLACES': f'{{0}}:{cpu_count}:1', - 'OMP_NUM_THREADS': omp_num_threads - } - return omp_env - - -def parse_lscpu_lscl_info(command_output): - command_output = command_output.strip().split('\n') - for i in range(len(command_output)): - command_output[i] = command_output[i].split(':') - return {line[0].strip(): line[1].strip() for line in command_output} - - -def get_hw_parameters(): - hw_params = {} - - if 'Linux' in platform.platform(): - # get CPU information - lscpu_info, _ = read_output_from_command('lscpu') - hw_params.update({'CPU': parse_lscpu_lscl_info(lscpu_info)}) - if 'CPU MHz' in hw_params['CPU'].keys(): - del hw_params['CPU']['CPU MHz'] - - # get RAM size - mem_info, _ = read_output_from_command('free -b') - mem_info = mem_info.split('\n')[1] - while ' ' in mem_info: - mem_info = mem_info.replace(' ', ' ') - mem_info = int(mem_info.split(' ')[1]) / 2 ** 30 - hw_params.update({'RAM size[GB]': mem_info}) - - # get Intel GPU information - try: - lsgpu_info, _ = read_output_from_command( - 'lscl --device-type=gpu --platform-vendor=Intel') - device_num = 0 - start_idx = lsgpu_info.find('Device ') - while start_idx >= 0: - start_idx = lsgpu_info.find(':', start_idx) + 1 - end_idx = lsgpu_info.find('Device ', start_idx) - platform_info = parse_lscpu_lscl_info(lsgpu_info[start_idx:end_idx]) - hw_params.update({f'GPU Intel #{device_num + 1}': platform_info}) - device_num += 1 - start_idx = end_idx - except (FileNotFoundError, json.JSONDecodeError): - pass - - # get Nvidia GPU information - try: - gpu_info, _ = read_output_from_command( - 'nvidia-smi --query-gpu=name,memory.total,driver_version,pstate ' - '--format=csv,noheader') - gpu_info = gpu_info.split(', ') - hw_params.update({ - 'GPU Nvidia': { - 'Name': gpu_info[0], - 'Memory size': gpu_info[1], - 'Performance mode': gpu_info[3] - } - }) - except (FileNotFoundError, json.JSONDecodeError): - pass + lsgpu_info, _ = read_output_from_command( + 'lscl --device-type=gpu --platform-vendor=Intel') + device_num = 0 + start_idx = lsgpu_info.find('Device ') + while start_idx >= 0: + start_idx = lsgpu_info.find(':', start_idx) + 1 + end_idx = lsgpu_info.find('Device ', start_idx) + hw_params[f'GPU Intel #{device_num + 1}'] = parse_lscpu_lscl_info( + lsgpu_info[start_idx: end_idx]) + device_num += 1 + start_idx = end_idx + except (FileNotFoundError, json.JSONDecodeError): + pass + # get Nvidia GPU information + try: + gpu_info, _ = read_output_from_command( + 'nvidia-smi --query-gpu=name,memory.total,driver_version,pstate ' + '--format=csv,noheader') + gpu_info_arr = gpu_info.split(', ') + hw_params['GPU Nvidia'] = { + 'Name': gpu_info_arr[0], + 'Memory size': gpu_info_arr[1], + 'Performance mode': gpu_info_arr[3] + } + except (FileNotFoundError, json.JSONDecodeError): + pass return hw_params -def get_sw_parameters(): +def get_sw_parameters() -> Dict[str, Dict[str, Any]]: sw_params = {} try: gpu_info, _ = read_output_from_command( 'nvidia-smi --query-gpu=name,memory.total,driver_version,pstate ' '--format=csv,noheader') - gpu_info = gpu_info.split(', ') - - sw_params.update( - {'GPU_driver': {'version': gpu_info[2]}}) + info_arr = gpu_info.split(', ') + sw_params['GPU_driver'] = {'version': info_arr[2]} # alert if GPU is already running any processes gpu_processes, _ = read_output_from_command( 'nvidia-smi --query-compute-apps=name,pid,used_memory ' @@ -179,14 +149,35 @@ def get_sw_parameters(): try: conda_list, _ = read_output_from_command('conda list --json') needed_columns = ['version', 'build_string', 'channel'] - conda_list = json.loads(conda_list) - for pkg in conda_list: + conda_list_json: List[Dict[str, str]] = json.loads(conda_list) + for pkg in conda_list_json: pkg_info = {} for col in needed_columns: - if col in pkg.keys(): - pkg_info.update({col: pkg[col]}) - sw_params.update({pkg['name']: pkg_info}) + if col in pkg: + pkg_info[col] = pkg[col] + sw_params[pkg['name']] = pkg_info except (FileNotFoundError, json.JSONDecodeError): pass return sw_params + + +def generate_cases(params: Dict[str, Union[List[Any], Any]]) -> List[str]: + ''' + Generate cases for benchmarking by iterating the parameter values + ''' + commands = [''] + for param, values in params.items(): + if isinstance(values, list): + prev_len = len(commands) + commands *= len(values) + dashes = '-' if len(param) == 1 else '--' + for command_num in range(prev_len): + for value_num in range(len(values)): + commands[prev_len * value_num + command_num] += ' ' + \ + dashes + param + ' ' + str(values[value_num]) + else: + dashes = '-' if len(param) == 1 else '--' + for command_num in range(len(commands)): + commands[command_num] += ' ' + dashes + param + ' ' + str(values) + return commands diff --git a/xgboost_bench/README.md b/xgboost_bench/README.md index 2b4e93ec5..45f27be87 100644 --- a/xgboost_bench/README.md +++ b/xgboost_bench/README.md @@ -1,16 +1,17 @@ -## How to create conda environment for benchmarking +# How to create conda environment for benchmarking ```bash pip install -r xgboost_bench/requirements.txt # or -conda install -c conda-forge xgboost pandas +conda install -c intel scikit-learn scikit-learn-intelex pandas tqdm ``` -## Algorithms parameters +## Algorithms parameters You can launch benchmarks for each algorithm separately. The table below lists all supported parameters for each algorithm. -#### General +### General + | parameter Name | Type | default value | description | | ----- | ---- |---- |---- | |num-threads|int|-1| The number of threads to use| @@ -33,7 +34,7 @@ You can launch benchmarks for each algorithm separately. The table below lists a |seed|int|12345|Seed to pass as random_state| |dataset-name|str|None|Dataset name| -#### GradientBoostingTrees +### GradientBoostingTrees | parameter Name | Type | default value | description | | ----- | ---- |---- |---- | diff --git a/xgboost_bench/gbt.py b/xgboost_bench/gbt.py index c903e6008..0c44acfaa 100644 --- a/xgboost_bench/gbt.py +++ b/xgboost_bench/gbt.py @@ -15,7 +15,6 @@ # =============================================================================== import argparse -import os import bench import numpy as np @@ -34,57 +33,60 @@ def convert_xgb_predictions(y_pred, objective): return y_pred -parser = argparse.ArgumentParser(description='xgboost gradient boosted trees ' - 'benchmark') +parser = argparse.ArgumentParser(description='xgboost gradient boosted trees benchmark') + -parser.add_argument('--n-estimators', type=int, default=100, - help='Number of gradient boosted trees') -parser.add_argument('--learning-rate', '--eta', type=float, default=0.3, - help='Step size shrinkage used in update ' - 'to prevents overfitting') -parser.add_argument('--min-split-loss', '--gamma', type=float, default=0, - help='Minimum loss reduction required to make' - ' partition on a leaf node') -parser.add_argument('--max-depth', type=int, default=6, - help='Maximum depth of a tree') -parser.add_argument('--min-child-weight', type=float, default=1, - help='Minimum sum of instance weight needed in a child') -parser.add_argument('--max-delta-step', type=float, default=0, - help='Maximum delta step we allow each leaf output to be') -parser.add_argument('--subsample', type=float, default=1, - help='Subsample ratio of the training instances') parser.add_argument('--colsample-bytree', type=float, default=1, help='Subsample ratio of columns ' 'when constructing each tree') -parser.add_argument('--reg-lambda', type=float, default=1, - help='L2 regularization term on weights') -parser.add_argument('--reg-alpha', type=float, default=0, - help='L1 regularization term on weights') -parser.add_argument('--tree-method', type=str, required=True, - help='The tree construction algorithm used in XGBoost') -parser.add_argument('--scale-pos-weight', type=float, default=1, - help='Controls a balance of positive and negative weights') +parser.add_argument('--count-dmatrix', default=False, action='store_true', + help='Count DMatrix creation in time measurements') +parser.add_argument('--enable-experimental-json-serialization', default=True, + choices=('True', 'False'), help='Use JSON to store memory snapshots') parser.add_argument('--grow-policy', type=str, default='depthwise', help='Controls a way new nodes are added to the tree') -parser.add_argument('--max-leaves', type=int, default=0, - help='Maximum number of nodes to be added') +parser.add_argument('--inplace-predict', default=False, action='store_true', + help='Perform inplace_predict instead of default') +parser.add_argument('--learning-rate', '--eta', type=float, default=0.3, + help='Step size shrinkage used in update ' + 'to prevents overfitting') parser.add_argument('--max-bin', type=int, default=256, help='Maximum number of discrete bins to ' 'bucket continuous features') +parser.add_argument('--max-delta-step', type=float, default=0, + help='Maximum delta step we allow each leaf output to be') +parser.add_argument('--max-depth', type=int, default=6, + help='Maximum depth of a tree') +parser.add_argument('--max-leaves', type=int, default=0, + help='Maximum number of nodes to be added') +parser.add_argument('--min-child-weight', type=float, default=1, + help='Minimum sum of instance weight needed in a child') +parser.add_argument('--min-split-loss', '--gamma', type=float, default=0, + help='Minimum loss reduction required to make' + ' partition on a leaf node') +parser.add_argument('--n-estimators', type=int, default=100, + help='The number of gradient boosted trees') parser.add_argument('--objective', type=str, required=True, choices=('reg:squarederror', 'binary:logistic', 'multi:softmax', 'multi:softprob'), - help='Control a balance of positive and negative weights') -parser.add_argument('--count-dmatrix', default=False, action='store_true', - help='Count DMatrix creation in time measurements') -parser.add_argument('--inplace-predict', default=False, action='store_true', - help='Perform inplace_predict instead of default') + help='Specifies the learning task') +parser.add_argument('--reg-alpha', type=float, default=0, + help='L1 regularization term on weights') +parser.add_argument('--reg-lambda', type=float, default=1, + help='L2 regularization term on weights') +parser.add_argument('--scale-pos-weight', type=float, default=1, + help='Controls a balance of positive and negative weights') parser.add_argument('--single-precision-histogram', default=False, action='store_true', help='Build histograms instead of double precision') -parser.add_argument('--enable-experimental-json-serialization', default=True, - choices=('True', 'False'), help='Use JSON to store memory snapshots') +parser.add_argument('--subsample', type=float, default=1, + help='Subsample ratio of the training instances') +parser.add_argument('--tree-method', type=str, required=True, + help='The tree construction algorithm used in XGBoost') params = bench.parse_args(parser) +# Default seed +if params.seed == 12345: + params.seed = 0 # Load and convert data X_train, X_test, y_train, y_test = bench.load_data(params) @@ -119,9 +121,6 @@ def convert_xgb_predictions(y_pred, objective): if params.threads != -1: xgb_params.update({'nthread': params.threads}) -if 'OMP_NUM_THREADS' in os.environ.keys(): - xgb_params['nthread'] = int(os.environ['OMP_NUM_THREADS']) - if params.objective.startswith('reg'): task = 'regression' metric_name, metric_func = 'rmse', bench.rmse_score @@ -133,6 +132,11 @@ def convert_xgb_predictions(y_pred, objective): params.n_classes = y_train[y_train.columns[0]].nunique() else: params.n_classes = len(np.unique(y_train)) + + # Covtype has one class more than there is in train + if params.dataset_name == 'covtype': + params.n_classes += 1 + if params.n_classes > 2: xgb_params['num_class'] = params.n_classes @@ -170,4 +174,4 @@ def predict(): params=params, functions=['gbt.fit', 'gbt.predict'], times=[fit_time, predict_time], accuracy_type=metric_name, accuracies=[train_metric, test_metric], data=[X_train, X_test], - alg_instance=booster) + alg_instance=booster, alg_params=xgb_params) diff --git a/xgboost_bench/requirements.txt b/xgboost_bench/requirements.txt index 1540ec04f..79bc07cc5 100755 --- a/xgboost_bench/requirements.txt +++ b/xgboost_bench/requirements.txt @@ -2,3 +2,4 @@ scikit-learn pandas xgboost openpyxl +tqdm