diff --git a/example/jupyter/.gitignore b/example/jupyter/.gitignore new file mode 100644 index 0000000000..763513e910 --- /dev/null +++ b/example/jupyter/.gitignore @@ -0,0 +1 @@ +.ipynb_checkpoints diff --git a/example/jupyter/tutorial_xgboost.ipynb b/example/jupyter/tutorial_xgboost.ipynb new file mode 100644 index 0000000000..85e08e34fe --- /dev/null +++ b/example/jupyter/tutorial_xgboost.ipynb @@ -0,0 +1,357 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# XGBoost on SQLFlow Tutorial\n", + "\n", + "This is a tutorial on train/predict XGBoost model in SQLFLow, you can find more SQLFlow usage from the [User Guide](https://github.com/sql-machine-learning/sqlflow/blob/develop/doc/user_guide.md), in this tutorial you will learn how to:\n", + "- Train a XGBoost model to fit the boston housing dataset; and\n", + "- Predict the housing price using the trained model;\n", + "\n", + "\n", + "## The Dataset\n", + "\n", + "This tutorial would use the [Boston Housing](https://www.kaggle.com/c/boston-housing) as the demonstration dataset.\n", + "The database contains 506 lines and 14 columns, the meaning of each column is as follows:\n", + "\n", + "Column | Explain \n", + "-- | -- \n", + "crim|per capita crime rate by town.\n", + "zn|proportion of residential land zoned for lots over 25,000 sq.ft.\n", + "indus|proportion of non-retail business acres per town.\n", + "chas|Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).\n", + "nox|nitrogen oxides concentration (parts per 10 million).\n", + "rm|average number of rooms per dwelling.\n", + "age|proportion of owner-occupied units built prior to 1940.\n", + "dis|weighted mean of distances to five Boston employment centres.\n", + "rad|index of accessibility to radial highways.\n", + "tax|full-value property-tax rate per \\$10,000.\n", + "ptratio|pupil-teacher ratio by town.\n", + "black|1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.\n", + "lstat|lower status of the population (percent).\n", + "medv|median value of owner-occupied homes in $1000s.\n", + "\n", + "We separated the dataset into train/test dataset, which is used to train/predict our model. SQLFlow would automatically split the training dataset into train/validation dataset while training progress." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "+---------+---------+------+-----+---------+-------+\n", + "| Field | Type | Null | Key | Default | Extra |\n", + "+---------+---------+------+-----+---------+-------+\n", + "| crim | float | YES | | None | |\n", + "| zn | float | YES | | None | |\n", + "| indus | float | YES | | None | |\n", + "| chas | int(11) | YES | | None | |\n", + "| nox | float | YES | | None | |\n", + "| rm | float | YES | | None | |\n", + "| age | float | YES | | None | |\n", + "| dis | float | YES | | None | |\n", + "| rad | int(11) | YES | | None | |\n", + "| tax | int(11) | YES | | None | |\n", + "| ptratio | float | YES | | None | |\n", + "| b | float | YES | | None | |\n", + "| lstat | float | YES | | None | |\n", + "| medv | float | YES | | None | |\n", + "+---------+---------+------+-----+---------+-------+" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%sqlflow\n", + "describe boston.train;" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "+---------+---------+------+-----+---------+-------+\n", + "| Field | Type | Null | Key | Default | Extra |\n", + "+---------+---------+------+-----+---------+-------+\n", + "| crim | float | YES | | None | |\n", + "| zn | float | YES | | None | |\n", + "| indus | float | YES | | None | |\n", + "| chas | int(11) | YES | | None | |\n", + "| nox | float | YES | | None | |\n", + "| rm | float | YES | | None | |\n", + "| age | float | YES | | None | |\n", + "| dis | float | YES | | None | |\n", + "| rad | int(11) | YES | | None | |\n", + "| tax | int(11) | YES | | None | |\n", + "| ptratio | float | YES | | None | |\n", + "| b | float | YES | | None | |\n", + "| lstat | float | YES | | None | |\n", + "| medv | float | YES | | None | |\n", + "+---------+---------+------+-----+---------+-------+" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%sqlflow\n", + "describe boston.test;" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Fit Boston Housing Dataset\n", + "\n", + "First, let's train an XGBoost regression model to fit the boston housing dataset, we prefer to train the model for `30 rounds`,\n", + "and using `squarederror` loss function that the SQLFLow extended SQL can be like:\n", + "\n", + "``` sql\n", + "TRAIN xgboost.gbtree\n", + "WITH\n", + " train.num_boost_round=30,\n", + " objective=\"reg:squarederror\"\n", + "```\n", + "\n", + "`xgboost.gbtree` is the estimator name, `gbtree` is one of the XGBoost booster, you can find more information from [here](https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters).\n", + "\n", + "We can specify the training data columns in `COLUMN clause`, and the label by `LABEL` keyword:\n", + "\n", + "``` sql\n", + "COLUMN crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat\n", + "LABEL medv\n", + "```\n", + "\n", + "To save the trained model, we can use `INTO clause` to specify a model name:\n", + "\n", + "``` sql\n", + "INTO sqlflow_models.my_xgb_regression_model\n", + "```\n", + "\n", + "Second, let's use a standar SQL to fetch the traning data from table `boston.train`:\n", + "\n", + "``` sql\n", + "SELECT * FROM boston.train\n", + "```\n", + "\n", + "Finally, the following is the SQLFlow Train statment of this regression task, you can run it in the cell:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[03:44:56] 387x13 matrix with 5031 entries loaded from train.txt\n", + "\n", + "[03:44:56] 109x13 matrix with 1417 entries loaded from test.txt\n", + "\n", + "[0]\ttrain-rmse:17.0286\tvalidation-rmse:17.8089\n", + "\n", + "[1]\ttrain-rmse:12.285\tvalidation-rmse:13.2787\n", + "\n", + "[2]\ttrain-rmse:8.93071\tvalidation-rmse:9.87677\n", + "\n", + "[3]\ttrain-rmse:6.60757\tvalidation-rmse:7.64013\n", + "\n", + "[4]\ttrain-rmse:4.96022\tvalidation-rmse:6.0181\n", + "\n", + "[5]\ttrain-rmse:3.80725\tvalidation-rmse:4.95013\n", + "\n", + "[6]\ttrain-rmse:2.94382\tvalidation-rmse:4.2357\n", + "\n", + "[7]\ttrain-rmse:2.36361\tvalidation-rmse:3.74683\n", + "\n", + "[8]\ttrain-rmse:1.95236\tvalidation-rmse:3.43284\n", + "\n", + "[9]\ttrain-rmse:1.66604\tvalidation-rmse:3.20455\n", + "\n", + "[10]\ttrain-rmse:1.4738\tvalidation-rmse:3.08947\n", + "\n", + "[11]\ttrain-rmse:1.35336\tvalidation-rmse:3.0492\n", + "\n", + "[12]\ttrain-rmse:1.22835\tvalidation-rmse:2.99508\n", + "\n", + "[13]\ttrain-rmse:1.15615\tvalidation-rmse:2.98604\n", + "\n", + "[14]\ttrain-rmse:1.11082\tvalidation-rmse:2.96433\n", + "\n", + "[15]\ttrain-rmse:1.01666\tvalidation-rmse:2.96584\n", + "\n", + "[16]\ttrain-rmse:0.953761\tvalidation-rmse:2.94013\n", + "\n", + "[17]\ttrain-rmse:0.905753\tvalidation-rmse:2.91569\n", + "\n", + "[18]\ttrain-rmse:0.870137\tvalidation-rmse:2.89735\n", + "\n", + "[19]\ttrain-rmse:0.800778\tvalidation-rmse:2.87206\n", + "\n", + "[20]\ttrain-rmse:0.757704\tvalidation-rmse:2.86564\n", + "\n", + "[21]\ttrain-rmse:0.74058\tvalidation-rmse:2.86587\n", + "\n", + "[22]\ttrain-rmse:0.66901\tvalidation-rmse:2.86224\n", + "\n", + "[23]\ttrain-rmse:0.647195\tvalidation-rmse:2.87395\n", + "\n", + "[24]\ttrain-rmse:0.609025\tvalidation-rmse:2.86069\n", + "\n", + "[25]\ttrain-rmse:0.562925\tvalidation-rmse:2.87205\n", + "\n", + "[26]\ttrain-rmse:0.541676\tvalidation-rmse:2.86275\n", + "\n", + "[27]\ttrain-rmse:0.524815\tvalidation-rmse:2.87106\n", + "\n", + "[28]\ttrain-rmse:0.483566\tvalidation-rmse:2.86129\n", + "\n", + "[29]\ttrain-rmse:0.460363\tvalidation-rmse:2.85877\n", + "\n" + ] + } + ], + "source": [ + "%%sqlflow\n", + "SELECT * FROM boston.train\n", + "TRAIN xgboost.gbtree\n", + "WITH\n", + " objective=\"reg:squarederror\",\n", + " train.num_boost_round = 30\n", + "COLUMN crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat\n", + "LABEL medv\n", + "INTO sqlflow_models.my_xgb_regression_model;" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Predict the housing price\n", + "After training the regression model, let's predict the house price using the trained model.\n", + "\n", + "First, we can specify the trained model by `USING clause`: \n", + "\n", + "```sql\n", + "USING sqlflow_models.my_xgb_regression_model\n", + "```\n", + "\n", + "Than, we can specify the prediction result table by `PREDICT clause`:\n", + "\n", + "``` sql\n", + "PREDICT boston.predict.medv\n", + "```\n", + "\n", + "And using a standar SQL to fetch the prediction data:\n", + "\n", + "``` sql\n", + "SELECT * FROM boston.test\n", + "```\n", + "\n", + "Finally, the following is the SQLFLow Prediction statment:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[03:45:18] 10x13 matrix with 130 entries loaded from predict.txt\n", + "\n", + "Done predicting. Predict table : boston.predict\n", + "\n" + ] + } + ], + "source": [ + "%%sqlflow\n", + "SELECT * FROM boston.test\n", + "PREDICT boston.predict.medv\n", + "USING sqlflow_models.my_xgb_regression_model;" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's have a glance at prediction results." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "+---------+-----+-------+------+-------+-------+------+--------+-----+-----+---------+--------+-------+---------+\n", + "| crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | b | lstat | medv |\n", + "+---------+-----+-------+------+-------+-------+------+--------+-----+-----+---------+--------+-------+---------+\n", + "| 0.2896 | 0.0 | 9.69 | 0 | 0.585 | 5.39 | 72.9 | 2.7986 | 6 | 391 | 19.2 | 396.9 | 21.14 | 21.9436 |\n", + "| 0.26838 | 0.0 | 9.69 | 0 | 0.585 | 5.794 | 70.6 | 2.8927 | 6 | 391 | 19.2 | 396.9 | 14.1 | 21.9667 |\n", + "| 0.23912 | 0.0 | 9.69 | 0 | 0.585 | 6.019 | 65.3 | 2.4091 | 6 | 391 | 19.2 | 396.9 | 12.92 | 22.9708 |\n", + "| 0.17783 | 0.0 | 9.69 | 0 | 0.585 | 5.569 | 73.5 | 2.3999 | 6 | 391 | 19.2 | 395.77 | 15.1 | 22.6373 |\n", + "| 0.22438 | 0.0 | 9.69 | 0 | 0.585 | 6.027 | 79.7 | 2.4982 | 6 | 391 | 19.2 | 396.9 | 14.33 | 21.9439 |\n", + "| 0.06263 | 0.0 | 11.93 | 0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1 | 273 | 21.0 | 391.99 | 9.67 | 24.0095 |\n", + "| 0.04527 | 0.0 | 11.93 | 0 | 0.573 | 6.12 | 76.7 | 2.2875 | 1 | 273 | 21.0 | 396.9 | 9.08 | 25.0 |\n", + "| 0.06076 | 0.0 | 11.93 | 0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1 | 273 | 21.0 | 396.9 | 5.64 | 31.6326 |\n", + "| 0.10959 | 0.0 | 11.93 | 0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1 | 273 | 21.0 | 393.45 | 6.48 | 26.8375 |\n", + "| 0.04741 | 0.0 | 11.93 | 0 | 0.573 | 6.03 | 80.8 | 2.505 | 1 | 273 | 21.0 | 396.9 | 7.88 | 22.5877 |\n", + "+---------+-----+-------+------+-------+-------+------+--------+-----+-----+---------+--------+-------+---------+" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%sqlflow\n", + "SELECT * FROM boston.predict;" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}