Skip to content

Commit 7f3d0a6

Browse files
authored
adds option to add external information sources like model callbacks (#22)
* adds option to add external information sources like model callbacks * refactor model callback * adds pytorch adapter * adds pytorch callback * adds huggingface callback * update README
1 parent 22cadbf commit 7f3d0a6

File tree

9 files changed

+547
-45
lines changed

9 files changed

+547
-45
lines changed

README.md

Lines changed: 195 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,14 @@ This is the official Python SDK for [*refinery*](https://github.com/code-kern-ai
1212
- [Fetching lookup lists](#fetching-lookup-lists)
1313
- [Upload files](#upload-files)
1414
- [Adapters](#adapters)
15-
- [HuggingFace](#hugging-face)
16-
- [Sklearn](#sklearn)
17-
- [Rasa](#rasa)
18-
- [What's missing?](#whats-missing)
19-
- [Roadmap](#roadmap)
15+
- [Sklearn](#sklearn-adapter)
16+
- [PyTorch](#pytorch-adapter)
17+
- [HuggingFace](#hugging-face-adapter)
18+
- [Rasa](#rasa-adapter)
19+
- [Callbacks](#callbacks)
20+
- [Sklearn](#sklearn-callback)
21+
- [PyTorch](#pytorch-callback)
22+
- [HuggingFace](#hugging-face-callback)
2023
- [Contributing](#contributing)
2124
- [License](#license)
2225
- [Contact](#contact)
@@ -122,7 +125,35 @@ Alternatively, you can `rsdk push <path-to-your-file>` via CLI, given that you h
122125

123126
### Adapters
124127

125-
#### Hugging Face
128+
#### Sklearn Adapter
129+
You can use *refinery* to directly pull data into a format you can apply for building [sklearn](https://github.com/scikit-learn/scikit-learn) models. This can look as follows:
130+
131+
```python
132+
from refinery.adapter.sklearn import build_classification_dataset
133+
from sklearn.tree import DecisionTreeClassifier
134+
135+
data = build_classification_dataset(client, "headline", "__clickbait", "distilbert-base-uncased")
136+
137+
clf = DecisionTreeClassifier()
138+
clf.fit(data["train"]["inputs"], data["train"]["labels"])
139+
140+
pred_test = clf.predict(data["test"]["inputs"])
141+
accuracy = (pred_test == data["test"]["labels"]).mean()
142+
```
143+
144+
By the way, we can highly recommend to combine this with [Truss](https://github.com/basetenlabs/truss) for easy model serving!
145+
146+
#### PyTorch Adapter
147+
If you want to build a [PyTorch](https://github.com/pytorch/pytorch) network, you can build the `train_loader` and `test_loader` as follows:
148+
149+
```python
150+
from refinery.adapter.torch import build_classification_dataset
151+
train_loader, test_loader, encoder, index = build_classification_dataset(
152+
client, "headline", "__clickbait", "distilbert-base-uncased"
153+
)
154+
```
155+
156+
#### Hugging Face Adapter
126157
Transformers are great, but often times, you want to finetune them for your downstream task. With *refinery*, you can do so easily by letting the SDK build the dataset for you that you can use as a plug-and-play base for your training:
127158

128159
```python
@@ -175,25 +206,7 @@ trainer.train()
175206
trainer.save_model("path/to/model")
176207
```
177208

178-
#### Sklearn
179-
You can use *refinery* to directly pull data into a format you can apply for building [sklearn](https://github.com/scikit-learn/scikit-learn) models. This can look as follows:
180-
181-
```python
182-
from refinery.adapter.sklearn import build_classification_dataset
183-
from sklearn.tree import DecisionTreeClassifier
184-
185-
data = build_classification_dataset(client, "headline", "__clickbait", "distilbert-base-uncased")
186-
187-
clf = DecisionTreeClassifier()
188-
clf.fit(data["train"]["inputs"], data["train"]["labels"])
189-
190-
pred_test = clf.predict(data["test"]["inputs"])
191-
accuracy = (pred_test == data["test"]["labels"]).mean()
192-
```
193-
194-
By the way, we can highly recommend to combine this with [Truss](https://github.com/basetenlabs/truss) for easy model serving!
195-
196-
#### Rasa
209+
#### Rasa Adapter
197210
*refinery* is perfect to be used for building chatbots with [Rasa](https://github.com/RasaHQ/rasa). We've built an adapter with which you can easily create the required Rasa training data directly from *refinery*.
198211

199212
To do so, do the following:
@@ -278,18 +291,167 @@ nlu:
278291

279292
Please make sure to also create the further necessary files (`domain.yml`, `data/stories.yml` and `data/rules.yml`) if you want to train your Rasa chatbot. For further reference, see their [documentation](https://rasa.com/docs/rasa).
280293

281-
#### What's missing?
282-
Let us know what open-source/closed-source NLP framework you are using, for which you'd like to have an adapter implemented in the SDK. To do so, simply create an issue in this repository with the tag "enhancement".
283294

295+
### Callbacks
296+
If you want to feed your production model's predictions back into *refinery*, you can do so with any version greater than [1.2.1](https://github.com/code-kern-ai/refinery/releases/tag/v1.2.1).
284297

285-
## Roadmap
286-
- [ ] Register heuristics via wrappers
287-
- [ ] Up/download zipped projects for versioning via DVC
288-
- [x] Add project upload
289-
- [x] Fetch project statistics
298+
To do so, we have a generalistic interface and framework-specific classes.
290299

300+
#### Sklearn Callback
301+
If you want to train a scikit-learn model an feed its outputs back into the refinery, you can do so easily as follows:
302+
303+
```python
304+
from sklearn.linear_model import LogisticRegression
305+
clf = LogisticRegression() # we use this as an example, but you can use any model implementing predict_proba
306+
307+
from refinery.adapter.sklearn import build_classification_dataset
308+
data = build_classification_dataset(client, "headline", "__clickbait", "distilbert-base-uncased")
309+
clf.fit(data["train"]["inputs"], data["train"]["labels"])
310+
311+
from refinery.callbacks.sklearn import SklearnCallback
312+
callback = SklearnCallback(
313+
client,
314+
clf,
315+
"clickbait",
316+
)
317+
318+
# executing this will call the refinery API with batches of size 32, so your data is pushed to the app
319+
callback.run(data["train"]["inputs"], data["train"]["index"])
320+
callback.run(data["test"]["inputs"], data["test"]["index"])
321+
```
322+
323+
#### PyTorch Callback
324+
For PyTorch, the procedure is really similar. You can do as follows:
325+
326+
```python
327+
from refinery.adapter.torch import build_classification_dataset
328+
train_loader, test_loader, encoder, index = build_classification_dataset(
329+
client, "headline", "__clickbait", "distilbert-base-uncased"
330+
)
331+
332+
# build your custom model and train it here - example:
333+
import torch.nn as nn
334+
import numpy as np
335+
import torch
336+
337+
# number of features (len of X cols)
338+
input_dim = 768
339+
# number of hidden layers
340+
hidden_layers = 20
341+
# number of classes (unique of y)
342+
output_dim = 2
343+
class Network(nn.Module):
344+
def __init__(self):
345+
super(Network, self).__init__()
346+
self.linear1 = nn.Linear(input_dim, output_dim)
347+
348+
def forward(self, x):
349+
x = torch.sigmoid(self.linear1(x))
350+
return x
351+
352+
clf = Network()
353+
criterion = nn.CrossEntropyLoss()
354+
optimizer = torch.optim.SGD(clf.parameters(), lr=0.1)
355+
356+
epochs = 2
357+
for epoch in range(epochs):
358+
running_loss = 0.0
359+
for i, data in enumerate(train_loader, 0):
360+
inputs, labels = data
361+
# set optimizer to zero grad to remove previous epoch gradients
362+
optimizer.zero_grad()
363+
# forward propagation
364+
outputs = clf(inputs)
365+
loss = criterion(outputs, labels)
366+
# backward propagation
367+
loss.backward()
368+
# optimize
369+
optimizer.step()
370+
running_loss += loss.item()
371+
# display statistics
372+
print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.5f}')
373+
running_loss = 0.0
374+
375+
# with this model trained, you can use the callback
376+
from refinery.callbacks.torch import TorchCallback
377+
callback = TorchCallback(
378+
client,
379+
clf,
380+
"clickbait",
381+
encoder
382+
)
383+
384+
# and just execute this
385+
callback.run(train_loader, index["train"])
386+
callback.run(test_loader, index["test"])
387+
```
388+
389+
#### HuggingFace Callback
390+
Collect the dataset and train your custom transformer model as follows:
391+
392+
```python
393+
from refinery.adapter import transformers
394+
dataset, mapping, index = transformers.build_classification_dataset(client, "headline", "__clickbait")
395+
396+
# train a model here, we're simplifying this by just using an existing model w/o retraining
397+
from transformers import pipeline
398+
pipe = pipeline("text-classification", model="distilbert-base-uncased")
399+
400+
# if you're interested to see how a training looks like, look into the above HuggingFace adapter
401+
402+
# you can now apply the callback
403+
from refinery.callbacks.transformers import TransformerCallback
404+
callback = TransformerCallback(
405+
client,
406+
pipe,
407+
"clickbait",
408+
mapping
409+
)
410+
411+
callback.run(dataset["train"]["headline"], index["train"])
412+
callback.run(dataset["test"]["headline"], index["test"])
413+
```
414+
415+
#### Generic Callback
416+
This one is your fallback if you have a very custom solution; other than that, we recommend you look into the framework-specific classes.
417+
418+
```python
419+
from refinery.callbacks.inference import ModelCallback
420+
from refinery.adapter.sklearn import build_classification_dataset
421+
from sklearn.linear_model import LogisticRegression
422+
423+
data = build_classification_dataset(client, "headline", "__clickbait", "distilbert-base-uncased"0)
424+
clf = LogisticRegression()
425+
clf.fit(data["train"]["inputs"], data["train"]["labels"])
426+
427+
# you can build initialization functions that set states of objects you use in the pipeline
428+
def initialize_fn(inputs, labels, **kwargs):
429+
return {"clf": kwargs["clf"]}
430+
431+
# postprocessing shifts the model outputs into a format accepted by our API
432+
def postprocessing_fn(outputs, **kwargs):
433+
named_outputs = []
434+
for prediction in outputs:
435+
pred_index = prediction.argmax()
436+
label = kwargs["clf"].classes_[pred_index]
437+
confidence = prediction[pred_index]
438+
named_outputs.append([label, confidence])
439+
return named_outputs
440+
441+
callback = ModelCallback(
442+
client: Client,
443+
"my-custom-regression",
444+
"clickbait",
445+
inference_fn=clf.predict_proba,
446+
initialize_fn=initialize_fn,
447+
postprocessing_fn=postprocessing_fn
448+
)
449+
450+
# executing this will call the refinery API with batches of size 32
451+
callback.initialize_and_run(data["train"]["inputs"], data["train"]["index"])
452+
callback.run(data["test"]["inputs"], data["test"]["index"])
453+
```
291454

292-
If you want to have something added, feel free to open an [issue](https://github.com/code-kern-ai/refinery-python-sdk/issues).
293455

294456
## Contributing
295457
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.

refinery/__init__.py

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -183,6 +183,37 @@ def get_record_export(
183183
msg.good(f"Downloaded export to {download_to}")
184184
return df
185185

186+
def post_associations(
187+
self,
188+
associations,
189+
indices,
190+
name,
191+
label_task_name,
192+
source_type: Optional[str] = "heuristic",
193+
):
194+
"""Posts associations to the server.
195+
196+
Args:
197+
associations (List[Dict[str, str]]): List of associations to post.
198+
indices (List[str]): List of indices to post to.
199+
name (str): Name of the association set.
200+
label_task_name (str): Name of the label task.
201+
source_type (Optional[str], optional): Source type of the associations. Defaults to "heuristic".
202+
"""
203+
url = settings.get_associations_url(self.project_id)
204+
api_response = api_calls.post_request(
205+
url,
206+
{
207+
"associations": associations,
208+
"indices": indices,
209+
"name": name,
210+
"label_task_name": label_task_name,
211+
"source_type": source_type,
212+
},
213+
self.session_token,
214+
)
215+
return api_response
216+
186217
def post_file_import(
187218
self, path: str, import_file_options: Optional[str] = ""
188219
) -> bool:

refinery/adapter/torch.py

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
import numpy as np
2+
import torch
3+
from torch.utils.data import Dataset, DataLoader
4+
from sklearn import preprocessing
5+
from .sklearn import (
6+
build_classification_dataset as sklearn_build_classification_dataset,
7+
)
8+
from typing import Any, Dict, Optional, Tuple
9+
from refinery import Client
10+
11+
12+
class Data(Dataset):
13+
def __init__(self, X, y, encoder):
14+
# need to convert float64 to float32 else
15+
# will get the following error
16+
# RuntimeError: expected scalar type Double but found Float
17+
self.X = torch.FloatTensor(X)
18+
# need to convert float64 to Long else
19+
# will get the following error
20+
# RuntimeError: expected scalar type Long but found Float
21+
y_encoded = encoder.transform(y.values)
22+
self.y = torch.from_numpy(y_encoded).type(torch.LongTensor)
23+
self.len = self.X.shape[0]
24+
25+
def __getitem__(self, index):
26+
return self.X[index], self.y[index]
27+
28+
def __len__(self):
29+
return self.len
30+
31+
32+
def build_classification_dataset(
33+
client: Client,
34+
sentence_input: str,
35+
classification_label: str,
36+
config_string: Optional[str] = None,
37+
num_train: Optional[int] = None,
38+
batch_size: Optional[int] = 32,
39+
) -> Tuple[DataLoader, DataLoader, preprocessing.LabelEncoder]:
40+
"""
41+
Builds a classification dataset from a refinery client and a config string.
42+
43+
Args:
44+
client (Client): Refinery client
45+
sentence_input (str): Name of the column containing the sentence input.
46+
classification_label (str): Name of the label; if this is a task on the full record, enter the string with as "__<label>". Else, input it as "<attribute>__<label>".
47+
config_string (Optional[str], optional): Config string for the TransformerSentenceEmbedder. Defaults to None; if None is provided, the text will not be embedded.
48+
num_train (Optional[int], optional): Number of training examples to use. Defaults to None; if None is provided, all examples will be used.
49+
50+
Returns:
51+
Tuple[DataLoader, DataLoader, preprocessing.LabelEncoder]: Tuple of train and test dataloaders, and the label encoder.
52+
"""
53+
data = sklearn_build_classification_dataset(
54+
client, sentence_input, classification_label, config_string, num_train
55+
)
56+
57+
le = preprocessing.LabelEncoder()
58+
le.fit(data["train"]["labels"].values)
59+
60+
train_data = Data(data["train"]["inputs"], data["train"]["labels"], le)
61+
test_data = Data(data["test"]["inputs"], data["test"]["labels"], le)
62+
63+
train_loader = DataLoader(dataset=train_data, batch_size=batch_size)
64+
test_loader = DataLoader(dataset=test_data, batch_size=batch_size)
65+
66+
index = {"train": data["train"]["index"], "test": data["test"]["index"]}
67+
68+
return train_loader, test_loader, le, index

0 commit comments

Comments
 (0)