|
| 1 | +# _user guide:_ Ant-XGBoost on sqlflow |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +[Ant-XGBoost](https://github.com/alipay/ant-xgboost) is fork of [dmlc/xgboost](https://github.com/dmlc/xgboost), which is maintained by active contributors of dmlc/xgboost in Alipay Inc. |
| 6 | + |
| 7 | +Ant-XGBoost extends `dmlc/xgboost` with the capability of running on Kubernetes and automatic hyper-parameter estimation. |
| 8 | +In particular, Ant-XGBoost includes `auto_train` methods for automatic training and introduces an additional parameter `convergence_criteria` for generalized early stopping strategy. |
| 9 | +See supplementary section for more details about automatic training and generalized early stopping strategy. |
| 10 | + |
| 11 | +## Tutorial |
| 12 | +We provide an [interactive tutorial](../example/jupyter/tutorial_antxgb.ipynb) via jupyter notebook, which can be run out-of-the-box in [sqlflow playground](https://play.sqlflow.org). |
| 13 | +If you want to run it locally, you need to install SQLFlow first. You can learn how to install sqlflow at [here](../doc/installation.md). |
| 14 | + |
| 15 | +## Concepts |
| 16 | +### Estimators |
| 17 | +We provide various XGBoost estimators for better user experience. |
| 18 | +All of them are case-insensitive and sharing same prefix `xgboost`. They are listed below. |
| 19 | + |
| 20 | +* xgboost.Estimator |
| 21 | + |
| 22 | + General estimator, with this `train.objective` should be defined explicitly. |
| 23 | + |
| 24 | +* xgboost.Classifier |
| 25 | + |
| 26 | + Estimator for classification task, works with `train.num_class`. Default value is binary classification. |
| 27 | + |
| 28 | +* xgboost.BinaryClassifier |
| 29 | + |
| 30 | + Estimator for binary classification task, the value of `train.objective` is `binary:logistic`. |
| 31 | + |
| 32 | +* xgboost.MultiClassifier |
| 33 | + |
| 34 | + Estimator for multi classification task, the value of `train.objective` is `multi:softprob`. It should work with `train.num_class` > 2. |
| 35 | + |
| 36 | +* xgboost.Regressor |
| 37 | + |
| 38 | + Estimator for regression task, the value of `train.objective` is `reg:squarederror`(`reg:linear`). |
| 39 | + |
| 40 | +### Columns |
| 41 | + |
| 42 | +* Feature Columns |
| 43 | + |
| 44 | + For now, two kinds of feature columns are available. |
| 45 | + |
| 46 | + First one is `dense schema`, which concatenate numeric table columns transparently, such as `COLUMN f1, f2, f3, f4`. |
| 47 | + |
| 48 | + Second one is `sparse key-value schema`, which received LIBSVM style key-value string formatted like `$k1:$v1,$k2:$v2,...`. |
| 49 | + This schema is decorated with keyword `SPARSE`, such as `COLUMN SPARSE(col1)`. |
| 50 | + |
| 51 | +* Label Column |
| 52 | + |
| 53 | + Following general sqlflow syntax, label clause of AntXGBoost is formatted in `LABEL $label_col`. |
| 54 | + |
| 55 | +* Group Column |
| 56 | + |
| 57 | + In training mode, group column can be declared in a separate column clause. Group column is identified by keyword `group`, such as `COLUMN ${group_col} FOR group`. |
| 58 | + |
| 59 | +* Weight Column |
| 60 | + |
| 61 | + As group column schema, weight column is identified by keyword `weight`, such as `COLUMN ${weight_col} FOR weight`. |
| 62 | + |
| 63 | +* Result Columns |
| 64 | + |
| 65 | + Schema of straightforward result (class_id for classification task, score for regression task) is following general sqlflow syntax(`PREDICT ${output_table}.${result_column}`). |
| 66 | + |
| 67 | + In addition, we also provide supplementary information of XGBoost prediction. They can be configured with `pred.attributes`. |
| 68 | + |
| 69 | + * append columns |
| 70 | + |
| 71 | + Columns of prediction data table which need to be appended into result table, such as id_col, label_col. |
| 72 | + |
| 73 | + The syntax is `pred.append_columns = [$col1, $col2, ...]`. |
| 74 | + |
| 75 | + * classification probability |
| 76 | + |
| 77 | + Probability of the chosen class, which only work in classification task. |
| 78 | + |
| 79 | + The syntax is `pred.prob_column = ${col}`. |
| 80 | + |
| 81 | + * classification detail |
| 82 | + A json string who holds the probability distribution of all classes, formatted like `{$class_id:$class_prob,...}`. Only work in classification task. |
| 83 | + |
| 84 | + The syntax is `pred.detail_column = ${col}`. |
| 85 | + |
| 86 | + * encoding of leaf indices |
| 87 | + |
| 88 | + Predicted leaf index in each tree, they are joined orderly into a string with format `$id_1,$id_2,...`. |
| 89 | + |
| 90 | + The syntax is `pred.encoding_column = ${col}`. |
| 91 | + |
| 92 | +### Attributes |
| 93 | + |
| 94 | +There exists two kinds of attributes, `train.attributes` and `pred.attrbutes`. |
| 95 | +`train.attributes`, which starts with prefix `train.`, only work in training mode. |
| 96 | +`pred.attributes`, which starts with prefix `pred.`, only work in prediction mode. |
| 97 | + |
| 98 | +All attributes are optional except `train.objective` must be defined when training with `xgboost.Estimator`. |
| 99 | + |
| 100 | +#### Available train.attributes |
| 101 | + |
| 102 | +* [General Params](https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters) |
| 103 | + * train.booster |
| 104 | + * train.verbosity |
| 105 | + |
| 106 | +* [Tree Booster Params](https://xgboost.readthedocs.io/en/latest/parameter.html#parameters-for-tree-booster) |
| 107 | + * train.eta |
| 108 | + * train.gamma |
| 109 | + * train.max_depth |
| 110 | + * train.min_child_weight |
| 111 | + * train.max_delta_step |
| 112 | + * train.subsample |
| 113 | + * train.colsample_bytree |
| 114 | + * train.colsample_bylevel |
| 115 | + * train.colsample_bynode |
| 116 | + * train.lambda |
| 117 | + * train.alpha |
| 118 | + * train.tree_method |
| 119 | + * train.sketch_eps |
| 120 | + * train.scale_pos_weight |
| 121 | + * train.grow_policy |
| 122 | + * train.max_leaves |
| 123 | + * train.max_bin |
| 124 | + * train.num_parallel_tree |
| 125 | + |
| 126 | +* [Learning Task Params](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters) |
| 127 | + * train.objective |
| 128 | + * train.eval_metric |
| 129 | + * train.seed |
| 130 | + * train.num_round |
| 131 | + * The number of rounds for boosting |
| 132 | + * train.num_class |
| 133 | + * The number of label class in classification task |
| 134 | + |
| 135 | +* AutoTrain Params |
| 136 | + * train.convergence_criteria |
| 137 | + * see supplementary for more details |
| 138 | + * train.auto_train |
| 139 | + * see supplementary for more details |
| 140 | + |
| 141 | + |
| 142 | +#### Available pred.attributes |
| 143 | + * pred.append_columns |
| 144 | + * pred.prob_column |
| 145 | + * pred.detail_column |
| 146 | + * pred.encoding_column |
| 147 | + |
| 148 | + |
| 149 | +## Overall SQL Syntax for Ant-XGBoost |
| 150 | +### Training Syntax |
| 151 | +```sql |
| 152 | +// standard select clause |
| 153 | +SELECT ... FROM ${TABLE_NAME} |
| 154 | +// train clause |
| 155 | +TRAIN xgboost.${estimatorType} |
| 156 | +WITH |
| 157 | + [optional] ${train.attributes} |
| 158 | + ...... |
| 159 | + ...... |
| 160 | +COLUMN ${feature_columns} |
| 161 | +[optional] COLUMN ${group_column} FOR group |
| 162 | +[optional] COLUMN ${weight_column} FOR weight |
| 163 | +LABEL ${label_column} |
| 164 | +INTO ${model}; |
| 165 | +``` |
| 166 | +### Prediction Syntax |
| 167 | +```sql |
| 168 | +// standard select clause |
| 169 | +SELECT ... FROM ${TABLE_NAME} |
| 170 | +// pred clause |
| 171 | +PREDICT ${output_table}.${result_column} |
| 172 | +WITH |
| 173 | + [optional] ${pred.attributes} |
| 174 | + ...... |
| 175 | +USING ${model}; |
| 176 | +``` |
| 177 | + |
| 178 | +## Supplementary |
| 179 | +### Generalized Early Stopping Strategy |
| 180 | +`dmlc/xgboost` stops when no significant improvements in the recent n boosting rounds, where n is a configurable parameter. |
| 181 | +In Ant-XGBoost, we generalize this strategy and call the new strategy convergence test. |
| 182 | +We keep track of the series of metric values and determine whether the series is converged or not. |
| 183 | +There are three main parameters to test convergence: `minNumPoints`, `n` and `c`. |
| 184 | +Only after the series is longer than (or equal to) `minNumPoints`, it start to be eligible for convergence test. |
| 185 | +Once a series is at least `minNumPoints` long, we find the index `idx` for the best metric value so far. |
| 186 | +We say a series is converged if `idx + n < size * c`, where `size` is the current number of points in the series. |
| 187 | +The intuition is that the best metric value should be peaked (or bottomed) with a wide margin. |
| 188 | + |
| 189 | +With `n` and `c` we can implement complex convergence rules, but there are two common cases. |
| 190 | +* `n > 0` and `c = 1.0` |
| 191 | + |
| 192 | + This reduces to the standard early stopping strategy that is employed by dmlc/xgboost. |
| 193 | + |
| 194 | +* `n = 0` and `c in [0, 1]` |
| 195 | + |
| 196 | + For example, `n = 0` and `c = 0.8`. This means there should be at least 20% of points after the best metric value. Smaller value in `c` leads to a more conservative convergence test. This rule tests convergence in an adaptive way; for some problem the metric values are noisy and grow slowly, this rule will have better chance to find the optimal model. |
| 197 | + |
| 198 | +In addition, convergence test understands the optimization direction for all built-in metrics, so there is no need to set `maximize` parameter (defaults to `false`, forgetting to set this parameter often leads to strange behavior when metric value should be maximized). |
| 199 | + |
| 200 | + |
| 201 | +### AutoTrain |
| 202 | +With convergence test, we implement a simple `auto_train` method. There are several components in `auto_train`: |
| 203 | +* Automatic parameter validation, setting and rewriting |
| 204 | + |
| 205 | + Setting the right parameters in XGBoost is not easy. For example, when working with `binary:logistic`, one should not set `num_class` to 2 (otherwise XGBoost will fail with an exception). |
| 206 | + |
| 207 | + In Ant-XGBoost, we validate parameters to make sure all parameters are consistent with each other, e.g., `num_class = 3` and `objective = binary:logistic` will raise an exception. |
| 208 | + |
| 209 | + In addition, we try our best to understand the input parameters from the user and automatically set or rewrite some parameters in `auto_train` mode. |
| 210 | + For example, when the feature dimension is very high, building a single tree will be very inefficient, |
| 211 | + we will automatically set `colsample_bytree` and make sure at most 2000 features are used to build for each tree. Note that automatic parameter rewritting is only turned on in `auto_train` mode, in standard `train`, we only validate parameters and the behavior is fully controled by the user. |
| 212 | + |
| 213 | +* automatic training |
| 214 | + |
| 215 | + With convergence test, number of trees in a boosted ensemble becomes a less important parameter; |
| 216 | + one can always set a very large number and rely on convergence test to figure out the right number of trees. |
| 217 | + The most important parameters to tune in boosted trees now become learning rate and max depth. |
| 218 | + In Ant-XGBoost, we employ grid search with early stopping to efficiently search the best model structure; |
| 219 | + unpromising learning rate or depth will be skipped entirely. |
| 220 | + |
| 221 | + While the current `auto_train` method is a very simple approach, we are working on better strategies to further scale up hyperparameter tuning in XGBoost training. |
0 commit comments