Skip to content

Implement cluster model by subclassing of tensorflow.keras #16

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Sep 17, 2019
Merged

Implement cluster model by subclassing of tensorflow.keras #16

merged 9 commits into from
Sep 17, 2019

Conversation

BlackPoint-CX
Copy link
Contributor

Has mostly implement cluster model by subclassing of tensorflow.keras.
Still work in progress.

TODO:

  1. Implement logic of saving and loading weights.
  2. Implement modification of codegen in sqlflow.

Copy link
Contributor

@tonyyang-svail tonyyang-svail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent PR.

metric['ari'] = np.round(ari(y, y_pred), 5)
return q, metric

def cluster_train_loop(self, x, y, batch_size=256, maxiter=8000, update_interval=150, tol=0.001):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the usage of y? I thought the cluster model is trained on unlabeled data.

Copy link
Contributor Author

@BlackPoint-CX BlackPoint-CX Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y is used for evaluating the performance of cluster model when training with labeled data.
You are right. It is really necessary to remove this or setting it as None by default.

print(dec.display_model_info(verbose=2))

if hasattr(dec, 'cluster_train_loop'):
dec.cluster_train_loop(x=x, y=y,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if it is possible to use the model.fit API to training the model, and put the early stopping strategy as a callback function? So that the code generation template in SQLFlow would be simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am entangled with this.

In classical scenario, when subclassing the keras.Model class, user only need to define the structure of model in __init__ and the forward pass in the call method. All the other operations such as compile and fit are inherited from keras.Model.

However in cluster model, the inherited fit cannot fit with such training process.
Cluster model uses auxiliary target distribution p as latent label for training and it keeps changing during training. The inherited fit always requires static y. Based on these I used train_on_batch instead of fit.

q, metric = self.evaluate(x, y)
p = self.target_distribution(q) 
...
loss = self.train_on_batch(x=x[idx], y=p[idx])

@@ -0,0 +1,520 @@
/Users/didi/Develop/VirtualEnvs/virtualenv_3.6.5/bin/python3.6 /Users/didi/Develop/PycharmProjects/temp_requirements/src/order_by_project/sqlflow/clustering_model/version_03/dec_demo_03.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this log file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, got it.

pretrain_initializer='glorot_uniform',
loss=None):
"""
Implement cluster model mostly based on DEC.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link to DEC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that the cluster model is DEC but they are really close.
I am entangled with whether this link will mislead users.

self.get_layer(name='clustering').set_weights([self.kmeans.cluster_centers_])
self.display_model_info()
index, loss, p = 0, 0., None
y_pred_last = self.kmeans.fit_predict(self.encoded_input)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is redundancy, L175 can return the y_pred_last.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Will remove it.

Yancey0623
Yancey0623 previously approved these changes Sep 11, 2019
Copy link
Collaborator

@Yancey0623 Yancey0623 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the discussion with @wenjing , the output prediction result group by group_id, so maybe we need to implement a function sqlflow_predict() function that outputs the result which group by group_id.

@Yancey0623 Yancey0623 mentioned this pull request Sep 11, 2019


@timelogger
def train_evaluate(datasource='mnist'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove train_evaluate function in the model definition file, and add a unit test in the tests folder.

self.label = [0 for _ in range(50)] + [1 for _ in range(50)]
feature_columns = [tf.feature_column.numeric_column(key) for key in
self.features]
self.model = sqlflow_models.DNNClassifier(feature_columns=feature_columns)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to test sqlflow_models.DeepEmbeddingCluster here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to test sqlflow_models.DeepEmbeddingCluster here.

Have added in new commit. Tks for remind.

Copy link
Collaborator

@Yancey0623 Yancey0623 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI failed logs:

    import sqlflow_models
sqlflow_models/__init__.py:4: in <module>
    from .deep_embedding_cluster import DeepEmbeddingClusterModel
sqlflow_models/deep_embedding_cluster.py:19: in <module>
    from sklearn.cluster import KMeans
E   ModuleNotFoundError: No module named 'sklearn'

You can fix it by adding the dependency in https://github.com/sql-machine-learning/models/blob/develop/setup.py#L24

self.y_pred_last = self.kmeans.fit_predict(self.encoded_input)
print('{} Done init centroids by k-means.'.format(datetime.now()))

def cluster_train_loop(self, x):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make cluster_train_loop to be more general:

def sqlflow_trai_loop(self, dataset, epochs, verbos):
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make cluster_train_loop to be more general:

def sqlflow_trai_loop(self, dataset, epochs, verbos):
    ...

Actually I have thought about this.
epochs is always used to define how many times to train on the whole dataset. However in this case, the auxiliary target distribution p needs to be updated after update_interval iterations (update_interval always smaller than the number of batches in one epoch). Since the pre-train process can fetch parameter pretrain_epochs from construct function, and pre_train should be consider as sub-train-process, I did not add epochs as parameter here.
verbose can be add here for control to display more details. I can add this later.

Copy link
Collaborator

@Yancey0623 Yancey0623 Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I see that epochs is not necessary for this cause, but we need to support the custom train loop function by editing template_tf.go in SQLFlow, and I think epochs is useful for the most of the scenes, so maybe we can to add epochs argument with a default value.

@BlackPoint-CX BlackPoint-CX changed the title WIP : Implement cluster model by subclassing of tensorflow.keras Implement cluster model by subclassing of tensorflow.keras Sep 17, 2019
Copy link
Collaborator

@Yancey0623 Yancey0623 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the excellent PR!

@weiguoz weiguoz merged commit 73adf6a into sql-machine-learning:develop Sep 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants