Skip to content

Commit 8fe6cde

Browse files
authored
add user guide for ant-xgboost (#772)
1 parent 88f2e2e commit 8fe6cde

File tree

3 files changed

+492
-0
lines changed

3 files changed

+492
-0
lines changed
File renamed without changes.

doc/ant-xgboost_user_guide.md

Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
# _user guide:_ Ant-XGBoost on sqlflow
2+
3+
## Overview
4+
5+
[Ant-XGBoost](https://github.com/alipay/ant-xgboost) is fork of [dmlc/xgboost](https://github.com/dmlc/xgboost), which is maintained by active contributors of dmlc/xgboost in Alipay Inc.
6+
7+
Ant-XGBoost extends `dmlc/xgboost` with the capability of running on Kubernetes and automatic hyper-parameter estimation.
8+
In particular, Ant-XGBoost includes `auto_train` methods for automatic training and introduces an additional parameter `convergence_criteria` for generalized early stopping strategy.
9+
See supplementary section for more details about automatic training and generalized early stopping strategy.
10+
11+
## Tutorial
12+
We provide an [interactive tutorial](../example/jupyter/tutorial_antxgb.ipynb) via jupyter notebook, which can be run out-of-the-box in [sqlflow playground](https://play.sqlflow.org).
13+
If you want to run it locally, you need to install SQLFlow first. You can learn how to install sqlflow at [here](../doc/installation.md).
14+
15+
## Concepts
16+
### Estimators
17+
We provide various XGBoost estimators for better user experience.
18+
All of them are case-insensitive and sharing same prefix `xgboost`. They are listed below.
19+
20+
* xgboost.Estimator
21+
22+
General estimator, with this `train.objective` should be defined explicitly.
23+
24+
* xgboost.Classifier
25+
26+
Estimator for classification task, works with `train.num_class`. Default value is binary classification.
27+
28+
* xgboost.BinaryClassifier
29+
30+
Estimator for binary classification task, the value of `train.objective` is `binary:logistic`.
31+
32+
* xgboost.MultiClassifier
33+
34+
Estimator for multi classification task, the value of `train.objective` is `multi:softprob`. It should work with `train.num_class` > 2.
35+
36+
* xgboost.Regressor
37+
38+
Estimator for regression task, the value of `train.objective` is `reg:squarederror`(`reg:linear`).
39+
40+
### Columns
41+
42+
* Feature Columns
43+
44+
For now, two kinds of feature columns are available.
45+
46+
First one is `dense schema`, which concatenate numeric table columns transparently, such as `COLUMN f1, f2, f3, f4`.
47+
48+
Second one is `sparse key-value schema`, which received LIBSVM style key-value string formatted like `$k1:$v1,$k2:$v2,...`.
49+
This schema is decorated with keyword `SPARSE`, such as `COLUMN SPARSE(col1)`.
50+
51+
* Label Column
52+
53+
Following general sqlflow syntax, label clause of AntXGBoost is formatted in `LABEL $label_col`.
54+
55+
* Group Column
56+
57+
In training mode, group column can be declared in a separate column clause. Group column is identified by keyword `group`, such as `COLUMN ${group_col} FOR group`.
58+
59+
* Weight Column
60+
61+
As group column schema, weight column is identified by keyword `weight`, such as `COLUMN ${weight_col} FOR weight`.
62+
63+
* Result Columns
64+
65+
Schema of straightforward result (class_id for classification task, score for regression task) is following general sqlflow syntax(`PREDICT ${output_table}.${result_column}`).
66+
67+
In addition, we also provide supplementary information of XGBoost prediction. They can be configured with `pred.attributes`.
68+
69+
* append columns
70+
71+
Columns of prediction data table which need to be appended into result table, such as id_col, label_col.
72+
73+
The syntax is `pred.append_columns = [$col1, $col2, ...]`.
74+
75+
* classification probability
76+
77+
Probability of the chosen class, which only work in classification task.
78+
79+
The syntax is `pred.prob_column = ${col}`.
80+
81+
* classification detail
82+
A json string who holds the probability distribution of all classes, formatted like `{$class_id:$class_prob,...}`. Only work in classification task.
83+
84+
The syntax is `pred.detail_column = ${col}`.
85+
86+
* encoding of leaf indices
87+
88+
Predicted leaf index in each tree, they are joined orderly into a string with format `$id_1,$id_2,...`.
89+
90+
The syntax is `pred.encoding_column = ${col}`.
91+
92+
### Attributes
93+
94+
There exists two kinds of attributes, `train.attributes` and `pred.attrbutes`.
95+
`train.attributes`, which starts with prefix `train.`, only work in training mode.
96+
`pred.attributes`, which starts with prefix `pred.`, only work in prediction mode.
97+
98+
All attributes are optional except `train.objective` must be defined when training with `xgboost.Estimator`.
99+
100+
#### Available train.attributes
101+
102+
* [General Params](https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters)
103+
* train.booster
104+
* train.verbosity
105+
106+
* [Tree Booster Params](https://xgboost.readthedocs.io/en/latest/parameter.html#parameters-for-tree-booster)
107+
* train.eta
108+
* train.gamma
109+
* train.max_depth
110+
* train.min_child_weight
111+
* train.max_delta_step
112+
* train.subsample
113+
* train.colsample_bytree
114+
* train.colsample_bylevel
115+
* train.colsample_bynode
116+
* train.lambda
117+
* train.alpha
118+
* train.tree_method
119+
* train.sketch_eps
120+
* train.scale_pos_weight
121+
* train.grow_policy
122+
* train.max_leaves
123+
* train.max_bin
124+
* train.num_parallel_tree
125+
126+
* [Learning Task Params](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters)
127+
* train.objective
128+
* train.eval_metric
129+
* train.seed
130+
* train.num_round
131+
* The number of rounds for boosting
132+
* train.num_class
133+
* The number of label class in classification task
134+
135+
* AutoTrain Params
136+
* train.convergence_criteria
137+
* see supplementary for more details
138+
* train.auto_train
139+
* see supplementary for more details
140+
141+
142+
#### Available pred.attributes
143+
* pred.append_columns
144+
* pred.prob_column
145+
* pred.detail_column
146+
* pred.encoding_column
147+
148+
149+
## Overall SQL Syntax for Ant-XGBoost
150+
### Training Syntax
151+
```sql
152+
// standard select clause
153+
SELECT ... FROM ${TABLE_NAME}
154+
// train clause
155+
TRAIN xgboost.${estimatorType}
156+
WITH
157+
[optional] ${train.attributes}
158+
......
159+
......
160+
COLUMN ${feature_columns}
161+
[optional] COLUMN ${group_column} FOR group
162+
[optional] COLUMN ${weight_column} FOR weight
163+
LABEL ${label_column}
164+
INTO ${model};
165+
```
166+
### Prediction Syntax
167+
```sql
168+
// standard select clause
169+
SELECT ... FROM ${TABLE_NAME}
170+
// pred clause
171+
PREDICT ${output_table}.${result_column}
172+
WITH
173+
[optional] ${pred.attributes}
174+
......
175+
USING ${model};
176+
```
177+
178+
## Supplementary
179+
### Generalized Early Stopping Strategy
180+
`dmlc/xgboost` stops when no significant improvements in the recent n boosting rounds, where n is a configurable parameter.
181+
In Ant-XGBoost, we generalize this strategy and call the new strategy convergence test.
182+
We keep track of the series of metric values and determine whether the series is converged or not.
183+
There are three main parameters to test convergence: `minNumPoints`, `n` and `c`.
184+
Only after the series is longer than (or equal to) `minNumPoints`, it start to be eligible for convergence test.
185+
Once a series is at least `minNumPoints` long, we find the index `idx` for the best metric value so far.
186+
We say a series is converged if `idx + n < size * c`, where `size` is the current number of points in the series.
187+
The intuition is that the best metric value should be peaked (or bottomed) with a wide margin.
188+
189+
With `n` and `c` we can implement complex convergence rules, but there are two common cases.
190+
* `n > 0` and `c = 1.0`
191+
192+
This reduces to the standard early stopping strategy that is employed by dmlc/xgboost.
193+
194+
* `n = 0` and `c in [0, 1]`
195+
196+
For example, `n = 0` and `c = 0.8`. This means there should be at least 20% of points after the best metric value. Smaller value in `c` leads to a more conservative convergence test. This rule tests convergence in an adaptive way; for some problem the metric values are noisy and grow slowly, this rule will have better chance to find the optimal model.
197+
198+
In addition, convergence test understands the optimization direction for all built-in metrics, so there is no need to set `maximize` parameter (defaults to `false`, forgetting to set this parameter often leads to strange behavior when metric value should be maximized).
199+
200+
201+
### AutoTrain
202+
With convergence test, we implement a simple `auto_train` method. There are several components in `auto_train`:
203+
* Automatic parameter validation, setting and rewriting
204+
205+
Setting the right parameters in XGBoost is not easy. For example, when working with `binary:logistic`, one should not set `num_class` to 2 (otherwise XGBoost will fail with an exception).
206+
207+
In Ant-XGBoost, we validate parameters to make sure all parameters are consistent with each other, e.g., `num_class = 3` and `objective = binary:logistic` will raise an exception.
208+
209+
In addition, we try our best to understand the input parameters from the user and automatically set or rewrite some parameters in `auto_train` mode.
210+
For example, when the feature dimension is very high, building a single tree will be very inefficient,
211+
we will automatically set `colsample_bytree` and make sure at most 2000 features are used to build for each tree. Note that automatic parameter rewritting is only turned on in `auto_train` mode, in standard `train`, we only validate parameters and the behavior is fully controled by the user.
212+
213+
* automatic training
214+
215+
With convergence test, number of trees in a boosted ensemble becomes a less important parameter;
216+
one can always set a very large number and rely on convergence test to figure out the right number of trees.
217+
The most important parameters to tune in boosted trees now become learning rate and max depth.
218+
In Ant-XGBoost, we employ grid search with early stopping to efficiently search the best model structure;
219+
unpromising learning rate or depth will be skipped entirely.
220+
221+
While the current `auto_train` method is a very simple approach, we are working on better strategies to further scale up hyperparameter tuning in XGBoost training.

0 commit comments

Comments
 (0)