Skip to content

Xgb datasets adding #60

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Apr 26, 2021
Merged

Conversation

RukhovichIV
Copy link

This PR needs to be reviewed after #58 is merged. That way here won't be 57 changed files :D

@RukhovichIV RukhovichIV force-pushed the xgb-nvidia-datasets branch from c4871fa to 670c289 Compare March 30, 2021 12:46
@SmirnovEgorRu SmirnovEgorRu changed the title Xgb nvidia datasets adding Xgb datasets adding Mar 30, 2021
if not os.path.isfile(local_url):
logging.info(f'Started loading {dataset_name}')
os.system(
"kaggle competitions download -c bosch-production-line-performance -f " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have python kaggle module (pip install kaggle)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to think about error handling if we can't find kaggle API and don't interrupt overall benchmarks execution

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about the need in error handling, but I'm not sure if we should put pip install kaggle here. This command will not help any user anyway, because we also need to configure kaggle module before work. And configuration data is not public, is it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return True


def higgs(dataset_dir: Path) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What size?

Usually used:
10M train and 1M test
or
10.5M train and 0.5M test

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's just the whole dataset. Will find out some info about it

@@ -1,4 +1,4 @@
#===============================================================================
# ===============================================================================
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to split this file at least to:

  • classification datasets
  • regression datasets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+, please refer to https://github.com/catboost/benchmarks/blob/master/training_speed/data_loader.py

It's nice to have abalone and letters as well in the benchmarks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split to 3 files (clf, reg, mult)
Added these two datasets also

"covertype": covertype,
"codrnanorm": codrnanorm,
"skin_segmentation": skin_segmentation,
"year": year,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

year_prediction_msd

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

"count-dmatrix":"",
"algorithm": "gbt",
"tree-method": "hist",
"num-threads": 56
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must be removed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return True


def fraud(dataset_dir: Path) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to follow existing good practice to write some info about the dataset: source and size. Some info is available here - https://github.com/catboost/benchmarks/blob/master/training_speed/data_loader.py

For other data sets - too

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added such info

@RukhovichIV RukhovichIV force-pushed the xgb-nvidia-datasets branch from fd8e84b to 4be3720 Compare April 15, 2021 08:12
@RukhovichIV
Copy link
Author

@PetrovKP, @SmirnovEgorRu, @Alexsandruss
Could you please check this PR?

"data-format": ["pandas"],
"data-order": ["F"],
"dtype": ["float32"]
"lib": "modelbuilders",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add note to README that parameters might be set with single value or list of values

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done earlier

bench.py Outdated
@@ -484,7 +486,7 @@ def print_output(library, algorithm, stages, params, functions,
output = []
for i in range(len(stages)):
result = gen_basic_dict(library, algorithm, stages[i], params,
data[i], alg_instance, alg_params)
data[i], alg_instance, alg_params if i == 0 else None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear why only first stage has alg_params

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like in all benchmarks all stages in each case has similar parameters. So, since the the parameter list is usually quite long, we can reduce the length of benchmark output by printing this section only once.

Copy link
Contributor

@Alexsandruss Alexsandruss Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RukhovichIV, Excel report generator filters benchmark cases basing on parameters, output should not be shortened for correct work of generator

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rolled back that change, but very upset about it :(

Comment on lines 24 to 25
from sklearn.datasets import fetch_openml, load_svmlight_file
from sklearn.model_selection import train_test_split
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, scikit-learn should be mentioned in readme as requirement for data loading

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added such requirement to all readme's

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add to cuML's readme and check that it's listed in all *_bench/requirements.txt

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,155 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is nvda?

I think 4 configs for xgboost is a lot. I would create a subfolder in configs: algorithms/xgboost/ or xgboost and move all configs with xgboost

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created a subfolder for xgboost configs
changed nvda to nvidia :)

from .loader_utils import retrieve


def a_nine_a(dataset_dir: Path) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

problem with number? where?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a good practice to name variables or functions using numbers. I think this will look better. In fact, dataset names are still unchanged, it's just a function naming

@PetrovKP PetrovKP added datasets Extension or fix load dataset enhancement New feature or request extend Extend benchmarks labels Apr 16, 2021
|dataset| List[[Dataset Object](#dataset-object)] | **REQUIRED** input data specifications. |
|**specific algorithm parameters**| Union[int, float, str, List[int], List[float], List[str]] | other specific algorithm parameters. The list of supported parameters can be found here |

### **Important:** feel free to move any parameter from **cases** to **common** section since this parameter is common for all cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be a section title

plus the wording is unclear (who should move parameters from one section to another?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it a little bit.
I've also removed The list of supported parameters can be found here - seems like such lists are not implemented yet

help='Minimum loss reduction required to make'
' partition on a leaf node')
parser.add_argument('--n-estimators', type=int, default=100,
help='Number of gradient boosted trees')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
help='Number of gradient boosted trees')
help='The number of gradient boosted trees')

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in ad176e5

@@ -0,0 +1,155 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have 2 configs for CPU. Can we merge this into one?
Also, I prefer rename all things with "nvidia" in the name.

Copy link
Contributor

@PetrovKP PetrovKP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, thanks!

@@ -133,6 +132,11 @@ def convert_xgb_predictions(y_pred, objective):
params.n_classes = y_train[y_train.columns[0]].nunique()
else:
params.n_classes = len(np.unique(y_train))

# BE VERY CAREFUL ON IT!! It should only work for COVTYPE DATASET
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?)

@Alexsandruss Alexsandruss merged commit d0444a6 into IntelPython:master Apr 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Extension or fix load dataset enhancement New feature or request extend Extend benchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants