XGBoost tutorial (#820)

Yancey0623 · web-flow · commit 0a09dfd359a2 · 2019-09-16T17:01:39.000+08:00
* xgboost tutorial

* update by comment

* update by comment

* update
diff --git a/example/jupyter/.gitignore b/example/jupyter/.gitignore
@@ -0,0 +1 @@
+.ipynb_checkpoints
diff --git a/example/jupyter/tutorial_xgboost.ipynb b/example/jupyter/tutorial_xgboost.ipynb
@@ -0,0 +1,357 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# XGBoost on SQLFlow Tutorial\n",
+    "\n",
+    "This is a tutorial on train/predict XGBoost model in SQLFLow, you can find more SQLFlow usage from the [User Guide](https://github.com/sql-machine-learning/sqlflow/blob/develop/doc/user_guide.md), in this tutorial you will learn how to:\n",
+    "- Train a XGBoost model to fit the boston housing dataset; and\n",
+    "- Predict the housing price using the trained model;\n",
+    "\n",
+    "\n",
+    "## The Dataset\n",
+    "\n",
+    "This tutorial would use the [Boston Housing](https://www.kaggle.com/c/boston-housing) as the demonstration dataset.\n",
+    "The database contains 506 lines and 14 columns, the meaning of each column is as follows:\n",
+    "\n",
+    "Column | Explain \n",
+    "-- | -- \n",
+    "crim|per capita crime rate by town.\n",
+    "zn|proportion of residential land zoned for lots over 25,000 sq.ft.\n",
+    "indus|proportion of non-retail business acres per town.\n",
+    "chas|Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).\n",
+    "nox|nitrogen oxides concentration (parts per 10 million).\n",
+    "rm|average number of rooms per dwelling.\n",
+    "age|proportion of owner-occupied units built prior to 1940.\n",
+    "dis|weighted mean of distances to five Boston employment centres.\n",
+    "rad|index of accessibility to radial highways.\n",
+    "tax|full-value property-tax rate per \\$10,000.\n",
+    "ptratio|pupil-teacher ratio by town.\n",
+    "black|1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.\n",
+    "lstat|lower status of the population (percent).\n",
+    "medv|median value of owner-occupied homes in $1000s.\n",
+    "\n",
+    "We separated the dataset into train/test dataset, which is used to train/predict our model. SQLFlow would automatically split the training dataset into train/validation dataset while training progress."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "+---------+---------+------+-----+---------+-------+\n",
+       "|  Field  |   Type  | Null | Key | Default | Extra |\n",
+       "+---------+---------+------+-----+---------+-------+\n",
+       "|   crim  |  float  | YES  |     |   None  |       |\n",
+       "|    zn   |  float  | YES  |     |   None  |       |\n",
+       "|  indus  |  float  | YES  |     |   None  |       |\n",
+       "|   chas  | int(11) | YES  |     |   None  |       |\n",
+       "|   nox   |  float  | YES  |     |   None  |       |\n",
+       "|    rm   |  float  | YES  |     |   None  |       |\n",
+       "|   age   |  float  | YES  |     |   None  |       |\n",
+       "|   dis   |  float  | YES  |     |   None  |       |\n",
+       "|   rad   | int(11) | YES  |     |   None  |       |\n",
+       "|   tax   | int(11) | YES  |     |   None  |       |\n",
+       "| ptratio |  float  | YES  |     |   None  |       |\n",
+       "|    b    |  float  | YES  |     |   None  |       |\n",
+       "|  lstat  |  float  | YES  |     |   None  |       |\n",
+       "|   medv  |  float  | YES  |     |   None  |       |\n",
+       "+---------+---------+------+-----+---------+-------+"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "%%sqlflow\n",
+    "describe boston.train;"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "+---------+---------+------+-----+---------+-------+\n",
+       "|  Field  |   Type  | Null | Key | Default | Extra |\n",
+       "+---------+---------+------+-----+---------+-------+\n",
+       "|   crim  |  float  | YES  |     |   None  |       |\n",
+       "|    zn   |  float  | YES  |     |   None  |       |\n",
+       "|  indus  |  float  | YES  |     |   None  |       |\n",
+       "|   chas  | int(11) | YES  |     |   None  |       |\n",
+       "|   nox   |  float  | YES  |     |   None  |       |\n",
+       "|    rm   |  float  | YES  |     |   None  |       |\n",
+       "|   age   |  float  | YES  |     |   None  |       |\n",
+       "|   dis   |  float  | YES  |     |   None  |       |\n",
+       "|   rad   | int(11) | YES  |     |   None  |       |\n",
+       "|   tax   | int(11) | YES  |     |   None  |       |\n",
+       "| ptratio |  float  | YES  |     |   None  |       |\n",
+       "|    b    |  float  | YES  |     |   None  |       |\n",
+       "|  lstat  |  float  | YES  |     |   None  |       |\n",
+       "|   medv  |  float  | YES  |     |   None  |       |\n",
+       "+---------+---------+------+-----+---------+-------+"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "%%sqlflow\n",
+    "describe boston.test;"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Fit Boston Housing Dataset\n",
+    "\n",
+    "First, let's train an XGBoost regression model to fit the boston housing dataset, we prefer to train the model for `30 rounds`,\n",
+    "and using `squarederror` loss function that the SQLFLow extended SQL can be like:\n",
+    "\n",
+    "``` sql\n",
+    "TRAIN xgboost.gbtree\n",
+    "WITH\n",
+    "    train.num_boost_round=30,\n",
+    "    objective=\"reg:squarederror\"\n",
+    "```\n",
+    "\n",
+    "`xgboost.gbtree` is the estimator name, `gbtree` is one of the XGBoost booster, you can find more information from [here](https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters).\n",
+    "\n",
+    "We can specify the training data columns in `COLUMN clause`, and the label by `LABEL` keyword:\n",
+    "\n",
+    "``` sql\n",
+    "COLUMN crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat\n",
+    "LABEL medv\n",
+    "```\n",
+    "\n",
+    "To save the trained model, we can use `INTO clause` to specify a model name:\n",
+    "\n",
+    "``` sql\n",
+    "INTO sqlflow_models.my_xgb_regression_model\n",
+    "```\n",
+    "\n",
+    "Second, let's use a standar SQL to fetch the traning data from table `boston.train`:\n",
+    "\n",
+    "``` sql\n",
+    "SELECT * FROM boston.train\n",
+    "```\n",
+    "\n",
+    "Finally, the following is the SQLFlow Train statment of this regression task, you can run it in the cell:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[03:44:56] 387x13 matrix with 5031 entries loaded from train.txt\n",
+      "\n",
+      "[03:44:56] 109x13 matrix with 1417 entries loaded from test.txt\n",
+      "\n",
+      "[0]\ttrain-rmse:17.0286\tvalidation-rmse:17.8089\n",
+      "\n",
+      "[1]\ttrain-rmse:12.285\tvalidation-rmse:13.2787\n",
+      "\n",
+      "[2]\ttrain-rmse:8.93071\tvalidation-rmse:9.87677\n",
+      "\n",
+      "[3]\ttrain-rmse:6.60757\tvalidation-rmse:7.64013\n",
+      "\n",
+      "[4]\ttrain-rmse:4.96022\tvalidation-rmse:6.0181\n",
+      "\n",
+      "[5]\ttrain-rmse:3.80725\tvalidation-rmse:4.95013\n",
+      "\n",
+      "[6]\ttrain-rmse:2.94382\tvalidation-rmse:4.2357\n",
+      "\n",
+      "[7]\ttrain-rmse:2.36361\tvalidation-rmse:3.74683\n",
+      "\n",
+      "[8]\ttrain-rmse:1.95236\tvalidation-rmse:3.43284\n",
+      "\n",
+      "[9]\ttrain-rmse:1.66604\tvalidation-rmse:3.20455\n",
+      "\n",
+      "[10]\ttrain-rmse:1.4738\tvalidation-rmse:3.08947\n",
+      "\n",
+      "[11]\ttrain-rmse:1.35336\tvalidation-rmse:3.0492\n",
+      "\n",
+      "[12]\ttrain-rmse:1.22835\tvalidation-rmse:2.99508\n",
+      "\n",
+      "[13]\ttrain-rmse:1.15615\tvalidation-rmse:2.98604\n",
+      "\n",
+      "[14]\ttrain-rmse:1.11082\tvalidation-rmse:2.96433\n",
+      "\n",
+      "[15]\ttrain-rmse:1.01666\tvalidation-rmse:2.96584\n",
+      "\n",
+      "[16]\ttrain-rmse:0.953761\tvalidation-rmse:2.94013\n",
+      "\n",
+      "[17]\ttrain-rmse:0.905753\tvalidation-rmse:2.91569\n",
+      "\n",
+      "[18]\ttrain-rmse:0.870137\tvalidation-rmse:2.89735\n",
+      "\n",
+      "[19]\ttrain-rmse:0.800778\tvalidation-rmse:2.87206\n",
+      "\n",
+      "[20]\ttrain-rmse:0.757704\tvalidation-rmse:2.86564\n",
+      "\n",
+      "[21]\ttrain-rmse:0.74058\tvalidation-rmse:2.86587\n",
+      "\n",
+      "[22]\ttrain-rmse:0.66901\tvalidation-rmse:2.86224\n",
+      "\n",
+      "[23]\ttrain-rmse:0.647195\tvalidation-rmse:2.87395\n",
+      "\n",
+      "[24]\ttrain-rmse:0.609025\tvalidation-rmse:2.86069\n",
+      "\n",
+      "[25]\ttrain-rmse:0.562925\tvalidation-rmse:2.87205\n",
+      "\n",
+      "[26]\ttrain-rmse:0.541676\tvalidation-rmse:2.86275\n",
+      "\n",
+      "[27]\ttrain-rmse:0.524815\tvalidation-rmse:2.87106\n",
+      "\n",
+      "[28]\ttrain-rmse:0.483566\tvalidation-rmse:2.86129\n",
+      "\n",
+      "[29]\ttrain-rmse:0.460363\tvalidation-rmse:2.85877\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%sqlflow\n",
+    "SELECT * FROM boston.train\n",
+    "TRAIN xgboost.gbtree\n",
+    "WITH\n",
+    "    objective=\"reg:squarederror\",\n",
+    "    train.num_boost_round = 30\n",
+    "COLUMN crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat\n",
+    "LABEL medv\n",
+    "INTO sqlflow_models.my_xgb_regression_model;"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Predict the housing price\n",
+    "After training the regression model, let's predict the house price using the trained model.\n",
+    "\n",
+    "First, we can specify the trained model by `USING clause`: \n",
+    "\n",
+    "```sql\n",
+    "USING sqlflow_models.my_xgb_regression_model\n",
+    "```\n",
+    "\n",
+    "Than, we can specify the prediction result table by `PREDICT clause`:\n",
+    "\n",
+    "``` sql\n",
+    "PREDICT boston.predict.medv\n",
+    "```\n",
+    "\n",
+    "And using a standar SQL to fetch the prediction data:\n",
+    "\n",
+    "``` sql\n",
+    "SELECT * FROM boston.test\n",
+    "```\n",
+    "\n",
+    "Finally, the following is the SQLFLow Prediction statment:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[03:45:18] 10x13 matrix with 130 entries loaded from predict.txt\n",
+      "\n",
+      "Done predicting. Predict table : boston.predict\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%sqlflow\n",
+    "SELECT * FROM boston.test\n",
+    "PREDICT boston.predict.medv\n",
+    "USING sqlflow_models.my_xgb_regression_model;"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's have a glance at prediction results."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "+---------+-----+-------+------+-------+-------+------+--------+-----+-----+---------+--------+-------+---------+\n",
+       "|   crim  |  zn | indus | chas |  nox  |   rm  | age  |  dis   | rad | tax | ptratio |   b    | lstat |   medv  |\n",
+       "+---------+-----+-------+------+-------+-------+------+--------+-----+-----+---------+--------+-------+---------+\n",
+       "|  0.2896 | 0.0 |  9.69 |  0   | 0.585 |  5.39 | 72.9 | 2.7986 |  6  | 391 |   19.2  | 396.9  | 21.14 | 21.9436 |\n",
+       "| 0.26838 | 0.0 |  9.69 |  0   | 0.585 | 5.794 | 70.6 | 2.8927 |  6  | 391 |   19.2  | 396.9  |  14.1 | 21.9667 |\n",
+       "| 0.23912 | 0.0 |  9.69 |  0   | 0.585 | 6.019 | 65.3 | 2.4091 |  6  | 391 |   19.2  | 396.9  | 12.92 | 22.9708 |\n",
+       "| 0.17783 | 0.0 |  9.69 |  0   | 0.585 | 5.569 | 73.5 | 2.3999 |  6  | 391 |   19.2  | 395.77 |  15.1 | 22.6373 |\n",
+       "| 0.22438 | 0.0 |  9.69 |  0   | 0.585 | 6.027 | 79.7 | 2.4982 |  6  | 391 |   19.2  | 396.9  | 14.33 | 21.9439 |\n",
+       "| 0.06263 | 0.0 | 11.93 |  0   | 0.573 | 6.593 | 69.1 | 2.4786 |  1  | 273 |   21.0  | 391.99 |  9.67 | 24.0095 |\n",
+       "| 0.04527 | 0.0 | 11.93 |  0   | 0.573 |  6.12 | 76.7 | 2.2875 |  1  | 273 |   21.0  | 396.9  |  9.08 |   25.0  |\n",
+       "| 0.06076 | 0.0 | 11.93 |  0   | 0.573 | 6.976 | 91.0 | 2.1675 |  1  | 273 |   21.0  | 396.9  |  5.64 | 31.6326 |\n",
+       "| 0.10959 | 0.0 | 11.93 |  0   | 0.573 | 6.794 | 89.3 | 2.3889 |  1  | 273 |   21.0  | 393.45 |  6.48 | 26.8375 |\n",
+       "| 0.04741 | 0.0 | 11.93 |  0   | 0.573 |  6.03 | 80.8 | 2.505  |  1  | 273 |   21.0  | 396.9  |  7.88 | 22.5877 |\n",
+       "+---------+-----+-------+------+-------+-------+------+--------+-----+-----+---------+--------+-------+---------+"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "%%sqlflow\n",
+    "SELECT * FROM boston.predict;"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}