-
Notifications
You must be signed in to change notification settings - Fork 73
Xgb datasets adding #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xgb datasets adding #60
Conversation
Conflicts: sklearn_bench/dbscan.py sklearn_bench/kmeans.py sklearn_bench/linear.py sklearn_bench/log_reg.py
c4871fa
to
670c289
Compare
datasets/loader.py
Outdated
if not os.path.isfile(local_url): | ||
logging.info(f'Started loading {dataset_name}') | ||
os.system( | ||
"kaggle competitions download -c bosch-production-line-performance -f " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have python kaggle module (pip install kaggle)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to think about error handling if we can't find kaggle API and don't interrupt overall benchmarks execution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about the need in error handling, but I'm not sure if we should put pip install kaggle
here. This command will not help any user anyway, because we also need to configure kaggle module before work. And configuration data is not public, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/RukhovichIV/scikit-learn_bench/blob/xgb-nvidia-datasets/datasets/load_datasets.py#L62
And we already have such handling btw
datasets/loader.py
Outdated
return True | ||
|
||
|
||
def higgs(dataset_dir: Path) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What size?
Usually used:
10M train and 1M test
or
10.5M train and 0.5M test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's just the whole dataset. Will find out some info about it
datasets/loader.py
Outdated
@@ -1,4 +1,4 @@ | |||
#=============================================================================== | |||
# =============================================================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to split this file at least to:
- classification datasets
- regression datasets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+, please refer to https://github.com/catboost/benchmarks/blob/master/training_speed/data_loader.py
It's nice to have abalone and letters as well in the benchmarks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split to 3 files (clf, reg, mult)
Added these two datasets also
datasets/load_datasets.py
Outdated
"covertype": covertype, | ||
"codrnanorm": codrnanorm, | ||
"skin_segmentation": skin_segmentation, | ||
"year": year, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
year_prediction_msd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
configs/xgb_cpu_config.json
Outdated
"count-dmatrix":"", | ||
"algorithm": "gbt", | ||
"tree-method": "hist", | ||
"num-threads": 56 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
datasets/loader.py
Outdated
return True | ||
|
||
|
||
def fraud(dataset_dir: Path) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to follow existing good practice to write some info about the dataset: source and size. Some info is available here - https://github.com/catboost/benchmarks/blob/master/training_speed/data_loader.py
For other data sets - too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added such info
fd8e84b
to
4be3720
Compare
@PetrovKP, @SmirnovEgorRu, @Alexsandruss |
configs/lgbm_mb_cpu_config.json
Outdated
"data-format": ["pandas"], | ||
"data-order": ["F"], | ||
"dtype": ["float32"] | ||
"lib": "modelbuilders", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add note to README that parameters might be set with single value or list of values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done earlier
bench.py
Outdated
@@ -484,7 +486,7 @@ def print_output(library, algorithm, stages, params, functions, | |||
output = [] | |||
for i in range(len(stages)): | |||
result = gen_basic_dict(library, algorithm, stages[i], params, | |||
data[i], alg_instance, alg_params) | |||
data[i], alg_instance, alg_params if i == 0 else None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear why only first stage has alg_params
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like in all benchmarks all stages in each case has similar parameters. So, since the the parameter list is usually quite long, we can reduce the length of benchmark output by printing this section only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RukhovichIV, Excel report generator filters benchmark cases basing on parameters, output should not be shortened for correct work of generator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rolled back that change, but very upset about it :(
datasets/loader_clf.py
Outdated
from sklearn.datasets import fetch_openml, load_svmlight_file | ||
from sklearn.model_selection import train_test_split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, scikit-learn should be mentioned in readme as requirement for data loading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added such requirement to all readme's
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add to cuML's readme and check that it's listed in all *_bench/requirements.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
configs/xgb_cpu_nvda_config.json
Outdated
@@ -0,0 +1,155 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is nvda?
I think 4 configs for xgboost is a lot. I would create a subfolder in configs: algorithms/xgboost/ or xgboost and move all configs with xgboost
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created a subfolder for xgboost configs
changed nvda
to nvidia
:)
datasets/loader_clf.py
Outdated
from .loader_utils import retrieve | ||
|
||
|
||
def a_nine_a(dataset_dir: Path) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
problem with number? where?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a good practice to name variables or functions using numbers. I think this will look better. In fact, dataset names are still unchanged, it's just a function naming
configs/README.md
Outdated
|dataset| List[[Dataset Object](#dataset-object)] | **REQUIRED** input data specifications. | | ||
|**specific algorithm parameters**| Union[int, float, str, List[int], List[float], List[str]] | other specific algorithm parameters. The list of supported parameters can be found here | | ||
|
||
### **Important:** feel free to move any parameter from **cases** to **common** section since this parameter is common for all cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shouldn't be a section title
plus the wording is unclear (who should move parameters from one section to another?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it a little bit.
I've also removed The list of supported parameters can be found here
- seems like such lists are not implemented yet
xgboost_bench/gbt.py
Outdated
help='Minimum loss reduction required to make' | ||
' partition on a leaf node') | ||
parser.add_argument('--n-estimators', type=int, default=100, | ||
help='Number of gradient boosted trees') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
help='Number of gradient boosted trees') | |
help='The number of gradient boosted trees') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied in ad176e5
…bench into xgb-nvidia-datasets Conflicts: utils.py
Co-authored-by: Ekaterina Mekhnetsova <mekkatya@gmail.com>
…scikit-learn_bench into xgb-nvidia-datasets Conflicts: configs/README.md
@@ -0,0 +1,155 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have 2 configs for CPU. Can we merge this into one?
Also, I prefer rename all things with "nvidia" in the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, thanks!
xgboost_bench/gbt.py
Outdated
@@ -133,6 +132,11 @@ def convert_xgb_predictions(y_pred, objective): | |||
params.n_classes = y_train[y_train.columns[0]].nunique() | |||
else: | |||
params.n_classes = len(np.unique(y_train)) | |||
|
|||
# BE VERY CAREFUL ON IT!! It should only work for COVTYPE DATASET |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?)
This PR needs to be reviewed after #58 is merged. That way here won't be 57 changed files :D