Skip to content

SQL for machine learning

Yi Wang edited this page Oct 19, 2018 · 3 revisions

Fields, Columns, and the Label

Let us start with a very simple case.

Suppose that we want to regress the salary with respect the age and the gender, we'd train a model using the following SQL statement:

SELECT age, gender, salary
FROM   engineer_info, engineer_payment
WHERE  engineer_info.id = engieer_payment.id
TRAIN  DNNRegressor
WITH   hidden_units = [10, 30]
COLUMN clip(age, 18, 65), gender, cross(clip(age, 18, 65), gender)
LABEL  salary
INTO   my_first_model
;

This generates a table my_first_model, which encode

  1. inputs: age, gender
  2. columns: clipped age, gender, and the cross of clipped age and gender
  3. the label: salary

We see that we need both SELECT to specify the fields to retrieve and COLUMN for fields-to-feature mapping.


Given this model, we can infer the salary for any other group of people. For example, the execution of the following statement

SELECT id, age, sex
FROM   another_company_employee_info
INFER  my_first_model
COLUMN age, vocab(sex, ["Female", "Male"])
LABEL  expected_salary
INTO   a_new_table

should generate a new table a_new_table with fields:

  1. id, age, sex from SELECT, and
  2. expected_salary from LABEL

Again, we need COLUMN in addition to SELECT to map different field names, and even field values, to the features acceptable by the model.

Clone this wiki locally