Xgb datasets adding #60

RukhovichIV · 2021-03-30T12:36:01Z

This PR needs to be reviewed after #58 is merged. That way here won't be 57 changed files :D

Conflicts: sklearn_bench/dbscan.py sklearn_bench/kmeans.py sklearn_bench/linear.py sklearn_bench/log_reg.py

SmirnovEgorRu · 2021-03-30T14:18:24Z

datasets/loader.py

+    if not os.path.isfile(local_url):
+        logging.info(f'Started loading {dataset_name}')
+        os.system(
+            "kaggle competitions download -c bosch-production-line-performance -f " +


We have python kaggle module (pip install kaggle)

Need to think about error handling if we can't find kaggle API and don't interrupt overall benchmarks execution

I agree about the need in error handling, but I'm not sure if we should put pip install kaggle here. This command will not help any user anyway, because we also need to configure kaggle module before work. And configuration data is not public, is it?

https://github.com/RukhovichIV/scikit-learn_bench/blob/xgb-nvidia-datasets/datasets/load_datasets.py#L62
And we already have such handling btw

SmirnovEgorRu · 2021-03-30T14:23:41Z

datasets/loader.py

+    return True
+
+
+def higgs(dataset_dir: Path) -> bool:


What size?

Usually used:
10M train and 1M test
or
10.5M train and 0.5M test

I guess it's just the whole dataset. Will find out some info about it

SmirnovEgorRu · 2021-03-30T14:24:51Z

datasets/loader.py

@@ -1,4 +1,4 @@
-#===============================================================================
+# ===============================================================================


Nice to split this file at least to:

classification datasets

regression datasets

+, please refer to https://github.com/catboost/benchmarks/blob/master/training_speed/data_loader.py

It's nice to have abalone and letters as well in the benchmarks

Split to 3 files (clf, reg, mult)
Added these two datasets also

SmirnovEgorRu · 2021-03-30T14:27:24Z

datasets/load_datasets.py

-    "covertype": covertype,
-    "codrnanorm": codrnanorm,
+    "skin_segmentation": skin_segmentation,
+    "year": year,


year_prediction_msd

SmirnovEgorRu · 2021-03-30T14:29:03Z

configs/xgb_cpu_config.json

+        "count-dmatrix":"",
+        "algorithm":    "gbt",
+        "tree-method":  "hist",
+        "num-threads":  56


Must be removed

SmirnovEgorRu · 2021-03-30T14:33:08Z

datasets/loader.py

+    return True
+
+
+def fraud(dataset_dir: Path) -> bool:


Nice to follow existing good practice to write some info about the dataset: source and size. Some info is available here - https://github.com/catboost/benchmarks/blob/master/training_speed/data_loader.py

For other data sets - too

Added such info

RukhovichIV · 2021-04-15T08:55:00Z

@PetrovKP, @SmirnovEgorRu, @Alexsandruss
Could you please check this PR?

configs/xgb_gpu_config.json

Alexsandruss · 2021-04-15T13:25:13Z

configs/lgbm_mb_cpu_config.json

-        "data-format": ["pandas"],
-        "data-order": ["F"],
-        "dtype": ["float32"]
+        "lib":          "modelbuilders",


Add note to README that parameters might be set with single value or list of values

done earlier

azure-pipelines.yml

Alexsandruss · 2021-04-15T13:29:46Z

bench.py

@@ -484,7 +486,7 @@ def print_output(library, algorithm, stages, params, functions,
        output = []
        for i in range(len(stages)):
            result = gen_basic_dict(library, algorithm, stages[i], params,
-                                    data[i], alg_instance, alg_params)
+                                    data[i], alg_instance, alg_params if i == 0 else None)


Not clear why only first stage has alg_params

It seems like in all benchmarks all stages in each case has similar parameters. So, since the the parameter list is usually quite long, we can reduce the length of benchmark output by printing this section only once.

@RukhovichIV, Excel report generator filters benchmark cases basing on parameters, output should not be shortened for correct work of generator

rolled back that change, but very upset about it :(

Alexsandruss · 2021-04-15T13:33:32Z

datasets/loader_clf.py

+from sklearn.datasets import fetch_openml, load_svmlight_file
+from sklearn.model_selection import train_test_split


I guess, scikit-learn should be mentioned in readme as requirement for data loading

added such requirement to all readme's

Add to cuML's readme and check that it's listed in all *_bench/requirements.txt

datasets/loader_mul.py

azure-pipelines.yml

PetrovKP · 2021-04-15T23:41:16Z

configs/xgb_cpu_nvda_config.json

@@ -0,0 +1,155 @@
+{


what is nvda?

I think 4 configs for xgboost is a lot. I would create a subfolder in configs: algorithms/xgboost/ or xgboost and move all configs with xgboost

created a subfolder for xgboost configs
changed nvda to nvidia :)

datasets/load_datasets.py

PetrovKP · 2021-04-15T23:53:07Z

datasets/loader_clf.py

+from .loader_utils import retrieve
+
+
+def a_nine_a(dataset_dir: Path) -> bool:


problem with number? where?

It's not a good practice to name variables or functions using numbers. I think this will look better. In fact, dataset names are still unchanged, it's just a function naming

utils.py

…bench into xgb-nvidia-datasets

README.md

configs/README.md

outoftardis · 2021-04-20T13:12:49Z

configs/README.md

+|dataset| List[[Dataset Object](#dataset-object)] | **REQUIRED**  input data specifications. |
+|**specific algorithm parameters**| Union[int, float, str, List[int], List[float], List[str]] | other specific algorithm parameters. The list of supported parameters can be found here |
+
+### **Important:** feel free to move any parameter from **cases** to **common** section since this parameter is common for all cases


this shouldn't be a section title

plus the wording is unclear (who should move parameters from one section to another?)

I've changed it a little bit.
I've also removed The list of supported parameters can be found here - seems like such lists are not implemented yet

configs/README.md

outoftardis · 2021-04-20T13:20:05Z

xgboost_bench/gbt.py

+                    help='Minimum loss reduction required to make'
+                         ' partition on a leaf node')
+parser.add_argument('--n-estimators', type=int, default=100,
+                    help='Number of gradient boosted trees')


Suggested change

help='Number of gradient boosted trees')

help='The number of gradient boosted trees')

Applied in ad176e5

…bench into xgb-nvidia-datasets Conflicts: utils.py

Co-authored-by: Ekaterina Mekhnetsova <mekkatya@gmail.com>

…scikit-learn_bench into xgb-nvidia-datasets Conflicts: configs/README.md

SmirnovEgorRu · 2021-04-23T21:59:48Z

configs/xgboost/xgb_cpu_nvidia_config.json

@@ -0,0 +1,155 @@
+{


You have 2 configs for CPU. Can we merge this into one?
Also, I prefer rename all things with "nvidia" in the name.

PetrovKP

Nice work, thanks!

PetrovKP · 2021-04-24T04:59:31Z

xgboost_bench/gbt.py

@@ -133,6 +132,11 @@ def convert_xgb_predictions(y_pred, objective):
        params.n_classes = y_train[y_train.columns[0]].nunique()
    else:
        params.n_classes = len(np.unique(y_train))
+
+    # BE VERY CAREFUL ON IT!! It should only work for COVTYPE DATASET


Igor Rukhovich and others added 12 commits March 22, 2021 20:22

Applied mypy + flake8 for all files

62f87c3

Sorted imports with ISort

132d73f

Moved env change to runner

4aa4898

fixed all mypy errors and added mypy check to CI

5a8db33

Yet another mypy fixes

5594efd

Small runner refactoring

35b55b8

First attempt of adding nvidia datasets

56de8f7

Merge branch 'master' into mypy-applying

0ee5f05

Conflicts: sklearn_bench/dbscan.py sklearn_bench/kmeans.py sklearn_bench/linear.py sklearn_bench/log_reg.py

removed E265 ignoring for flake8 job

04e7a64

Merge remote-tracking branch 'my/mypy-applying' into xgb-nvidia-datasets

8268747

NVidia benchmarks are working now

b6a7eb0

Added higgs, msrank and airline fetching

7e780bb

RukhovichIV requested review from Alexsandruss, PetrovKP and ShvetsKS as code owners March 30, 2021 12:36

small fixes of env

670c289

RukhovichIV force-pushed the xgb-nvidia-datasets branch from c4871fa to 670c289 Compare March 30, 2021 12:46

SmirnovEgorRu changed the title ~~Xgb nvidia datasets adding~~ Xgb datasets adding Mar 30, 2021

SmirnovEgorRu reviewed Mar 30, 2021

View reviewed changes

Igor Rukhovich added 9 commits April 1, 2021 11:01

Applying comments

dc0e9c9

Merge branch 'mypy-applying' into xgb-nvidia-datasets

f64ae68

Split dataset loading to different files

873754b

Merge remote-tracking branch 'origin/master' into xgb-nvidia-datasets

93ea32d

Why doesnt mypy work?

dcfc5b9

Added abalone + letters, updated all GB configs

340402e

Added links and descriptions for new datasets

6e47423

Merge remote-tracking branch 'origin/master' into xgb-nvidia-datasets

340a628

handling mypy

4be3720

RukhovichIV force-pushed the xgb-nvidia-datasets branch from fd8e84b to 4be3720 Compare April 15, 2021 08:12

Handled skex fake message throwing

8184016

Trying to handle mypy, at. 5

5e76a0b

RukhovichIV requested a review from SmirnovEgorRu April 15, 2021 08:50

Alexsandruss requested changes Apr 15, 2021

View reviewed changes

Alexsandruss mentioned this pull request Apr 15, 2021

Remove omp places for boosting #53

Closed

PetrovKP reviewed Apr 16, 2021

View reviewed changes

PetrovKP added datasets Extension or fix load dataset enhancement New feature or request extend Extend benchmarks labels Apr 16, 2021

Igor Rukhovich added 4 commits April 20, 2021 11:45

Changed configs readme and made small fixes in GB testing configs

13fcd20

Merge branch 'master' of https://github.com/IntelPython/scikit-learn_…

0873f97

…bench into xgb-nvidia-datasets

Applying more comments, updating readme's

877e0fd

Applying comments: renamed configs

8bdc7f2

RukhovichIV requested a review from outoftardis as a code owner April 20, 2021 10:14

outoftardis reviewed Apr 20, 2021

View reviewed changes

Igor Rukhovich and others added 7 commits April 23, 2021 12:17

Changed all datasets to npy, applied Kirill's comments

f9cf09b

Merge branch 'master' of https://github.com/IntelPython/scikit-learn_…

41e003f

…bench into xgb-nvidia-datasets Conflicts: utils.py

Cleanup after someone's commit

523df30

Applying mypy

59303fa

Applied Ekaterina's suggestions

b56e42c

Co-authored-by: Ekaterina Mekhnetsova <mekkatya@gmail.com>

Applied other Ekaterina's comments

ad176e5

Merge branch 'xgb-nvidia-datasets' of https://github.com/RukhovichIV/…

b92a27f

…scikit-learn_bench into xgb-nvidia-datasets Conflicts: configs/README.md

RukhovichIV requested review from Alexsandruss and PetrovKP April 23, 2021 17:33

SmirnovEgorRu reviewed Apr 23, 2021

View reviewed changes

PetrovKP approved these changes Apr 24, 2021

View reviewed changes

Igor Rukhovich added 2 commits April 26, 2021 12:19

Final commits applying

11a8ffc

Alexander's final comments

37d5461

Alexsandruss approved these changes Apr 26, 2021

View reviewed changes

Alexsandruss merged commit d0444a6 into IntelPython:master Apr 26, 2021

		@@ -1,4 +1,4 @@
		#===============================================================================
		# ===============================================================================

		from sklearn.datasets import fetch_openml, load_svmlight_file
		from sklearn.model_selection import train_test_split

		from .loader_utils import retrieve


		def a_nine_a(dataset_dir: Path) -> bool:

	help='Number of gradient boosted trees')
	help='The number of gradient boosted trees')

Xgb datasets adding #60

Xgb datasets adding #60

Uh oh!

Conversation

RukhovichIV commented Mar 30, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RukhovichIV commented Apr 15, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alexsandruss Apr 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Alexsandruss Apr 26, 2021 •

edited

Loading