Skip to content

add user guide for ant-xgboost #772

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Sep 5, 2019

Conversation

sperlingxx
Copy link
Contributor

@sperlingxx sperlingxx commented Sep 3, 2019

fix #746

Copy link
Collaborator

@wangkuiyi wangkuiyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. I think we need a clear goal for this document. It contains too much about Ant-XGBoost, which is more appropriate to reside in the Ant-XGBoost repo rather than here the SQLFlow repo.

It is titled "user guide", but doesn't contain the part about building and setting up SQLFlow with Ant-XGBoost codegen. It would be easier to have links to the setup of Jupyter Notebook as well, so users could follow and type the examples.

It seems more comprehensive if we could explain novel concepts, like XGBoost Estimator, before showing how to call them.

Also, please follow the Markdown syntax used with Github: https://guides.github.com/features/mastering-markdown/

@@ -0,0 +1,265 @@
### _user guide:_ Ant-XGBoost on sqlflow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

### => #

@@ -0,0 +1,265 @@
### _user guide:_ Ant-XGBoost on sqlflow

#### Overview
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#### => ##

While the current `auto_train` method is a very simple approach, we are working on better strategies to further scale up hyperparameter tuning in XGBoost training.


### Helpful Backports of XGBoost master
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All above are about Ant-XGBoost and should be part of the documentation of Ant-XGBoost, other than SQLFlow. I would recommend moving the above content to github.com/alipay/ant-xgboost, and put a link in this document pointing to that repo.


<br>

## Quick Start
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This document is supposed to be about the design of antxgboost_codegen.go. But there is no discussion about this code generator?


<br>

## Overall SQL Syntax
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this document is not about the extended syntax by SQLFlow to SQL. Do you want to explain the part of SQLFlow syntax to be utilized by the Ant-XGBoost codegen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangkuiyi Yes, I want to inform users the overall sqlflow syntax related to ant-xgboost. So, we rename this section with Overall SQL Syntax for AntXGBoost.

@sperlingxx
Copy link
Contributor Author

sperlingxx commented Sep 4, 2019

@wangkuiyi Thanks for comments! I have refined this guide, and add a tutorial for AntXGBoost.

@sperlingxx sperlingxx mentioned this pull request Sep 5, 2019
13 tasks
[Ant-XGBoost](https://github.com/alipay/ant-xgboost) is fork of [dmlc/xgboost](https://github.com/dmlc/xgboost), which is maintained by active contributors of dmlc/xgboost in Alipay Inc.

Ant-XGBoost extends `dmlc/xgboost` with the capability of running on Kubernetes and automatic hyper-parameter estimation.
In particular, Ant-XGBoost includes `auto_train` methods for automatic training and introduces an additional parameter `convergence_criteria` for generalized early stopping strategy.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add links to reference auto_train and convergence_criteria, so that users can know the concept clearly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


## Tutorial
We provide an [interactive tutorial](../example/jupyter/tutorial_antxgb.ipynb) via jupyter notebook, which can be run out-of-the-box in [sqlflow playground](https://play.sqlflow.org).
If you want to run it locally, you need to install sqlflow first. You can learn how to install sqlflow at [here](../doc/installation.md).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlflow => SQLFlow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


* xgboost.Regressor

Estimator for regression task, set `train.objective` to `reg:squarederror`(`reg:linear`).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does users need to set objective=train.objective in WITH clause or not? If not, which of would be the value of objective?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

### Columns

#### Feature Columns
For now, two feature column schemas are available.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two feature column schemas
=>
two kinds of feature columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


First one is `dense schema`, which concatenate numeric table columns transparently, such as `COLUMN f1, f2, f3, f4`.

Second one is `sparse key-value schema`, which received string sparse feature formatted like `$k1:$v1,$k2:$v2,...`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does k and v mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

@Yancey0623 Yancey0623 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sperlingxx sperlingxx merged commit 8fe6cde into sql-machine-learning:develop Sep 5, 2019
@sperlingxx sperlingxx deleted the antxgb_doc branch September 5, 2019 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add documentation on XgBoost models
4 participants