Skip to content

adds option to add external information sources like model callbacks #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 13, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
228 changes: 195 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,14 @@ This is the official Python SDK for [*refinery*](https://github.com/code-kern-ai
- [Fetching lookup lists](#fetching-lookup-lists)
- [Upload files](#upload-files)
- [Adapters](#adapters)
- [HuggingFace](#hugging-face)
- [Sklearn](#sklearn)
- [Rasa](#rasa)
- [What's missing?](#whats-missing)
- [Roadmap](#roadmap)
- [Sklearn](#sklearn-adapter)
- [PyTorch](#pytorch-adapter)
- [HuggingFace](#hugging-face-adapter)
- [Rasa](#rasa-adapter)
- [Callbacks](#callbacks)
- [Sklearn](#sklearn-callback)
- [PyTorch](#pytorch-callback)
- [HuggingFace](#hugging-face-callback)
- [Contributing](#contributing)
- [License](#license)
- [Contact](#contact)
Expand Down Expand Up @@ -122,7 +125,35 @@ Alternatively, you can `rsdk push <path-to-your-file>` via CLI, given that you h

### Adapters

#### Hugging Face
#### Sklearn Adapter
You can use *refinery* to directly pull data into a format you can apply for building [sklearn](https://github.com/scikit-learn/scikit-learn) models. This can look as follows:

```python
from refinery.adapter.sklearn import build_classification_dataset
from sklearn.tree import DecisionTreeClassifier

data = build_classification_dataset(client, "headline", "__clickbait", "distilbert-base-uncased")

clf = DecisionTreeClassifier()
clf.fit(data["train"]["inputs"], data["train"]["labels"])

pred_test = clf.predict(data["test"]["inputs"])
accuracy = (pred_test == data["test"]["labels"]).mean()
```

By the way, we can highly recommend to combine this with [Truss](https://github.com/basetenlabs/truss) for easy model serving!

#### PyTorch Adapter
If you want to build a [PyTorch](https://github.com/pytorch/pytorch) network, you can build the `train_loader` and `test_loader` as follows:

```python
from refinery.adapter.torch import build_classification_dataset
train_loader, test_loader, encoder, index = build_classification_dataset(
client, "headline", "__clickbait", "distilbert-base-uncased"
)
```

#### Hugging Face Adapter
Transformers are great, but often times, you want to finetune them for your downstream task. With *refinery*, you can do so easily by letting the SDK build the dataset for you that you can use as a plug-and-play base for your training:

```python
Expand Down Expand Up @@ -175,25 +206,7 @@ trainer.train()
trainer.save_model("path/to/model")
```

#### Sklearn
You can use *refinery* to directly pull data into a format you can apply for building [sklearn](https://github.com/scikit-learn/scikit-learn) models. This can look as follows:

```python
from refinery.adapter.sklearn import build_classification_dataset
from sklearn.tree import DecisionTreeClassifier

data = build_classification_dataset(client, "headline", "__clickbait", "distilbert-base-uncased")

clf = DecisionTreeClassifier()
clf.fit(data["train"]["inputs"], data["train"]["labels"])

pred_test = clf.predict(data["test"]["inputs"])
accuracy = (pred_test == data["test"]["labels"]).mean()
```

By the way, we can highly recommend to combine this with [Truss](https://github.com/basetenlabs/truss) for easy model serving!

#### Rasa
#### Rasa Adapter
*refinery* is perfect to be used for building chatbots with [Rasa](https://github.com/RasaHQ/rasa). We've built an adapter with which you can easily create the required Rasa training data directly from *refinery*.

To do so, do the following:
Expand Down Expand Up @@ -278,18 +291,167 @@ nlu:

Please make sure to also create the further necessary files (`domain.yml`, `data/stories.yml` and `data/rules.yml`) if you want to train your Rasa chatbot. For further reference, see their [documentation](https://rasa.com/docs/rasa).

#### What's missing?
Let us know what open-source/closed-source NLP framework you are using, for which you'd like to have an adapter implemented in the SDK. To do so, simply create an issue in this repository with the tag "enhancement".

### Callbacks
If you want to feed your production model's predictions back into *refinery*, you can do so with any version greater than [1.2.1](https://github.com/code-kern-ai/refinery/releases/tag/v1.2.1).

## Roadmap
- [ ] Register heuristics via wrappers
- [ ] Up/download zipped projects for versioning via DVC
- [x] Add project upload
- [x] Fetch project statistics
To do so, we have a generalistic interface and framework-specific classes.

#### Sklearn Callback
If you want to train a scikit-learn model an feed its outputs back into the refinery, you can do so easily as follows:

```python
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression() # we use this as an example, but you can use any model implementing predict_proba

from refinery.adapter.sklearn import build_classification_dataset
data = build_classification_dataset(client, "headline", "__clickbait", "distilbert-base-uncased")
clf.fit(data["train"]["inputs"], data["train"]["labels"])

from refinery.callbacks.sklearn import SklearnCallback
callback = SklearnCallback(
client,
clf,
"clickbait",
)

# executing this will call the refinery API with batches of size 32, so your data is pushed to the app
callback.run(data["train"]["inputs"], data["train"]["index"])
callback.run(data["test"]["inputs"], data["test"]["index"])
```

#### PyTorch Callback
For PyTorch, the procedure is really similar. You can do as follows:

```python
from refinery.adapter.torch import build_classification_dataset
train_loader, test_loader, encoder, index = build_classification_dataset(
client, "headline", "__clickbait", "distilbert-base-uncased"
)

# build your custom model and train it here - example:
import torch.nn as nn
import numpy as np
import torch

# number of features (len of X cols)
input_dim = 768
# number of hidden layers
hidden_layers = 20
# number of classes (unique of y)
output_dim = 2
class Network(nn.Module):
def __init__(self):
super(Network, self).__init__()
self.linear1 = nn.Linear(input_dim, output_dim)

def forward(self, x):
x = torch.sigmoid(self.linear1(x))
return x

clf = Network()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(clf.parameters(), lr=0.1)

epochs = 2
for epoch in range(epochs):
running_loss = 0.0
for i, data in enumerate(train_loader, 0):
inputs, labels = data
# set optimizer to zero grad to remove previous epoch gradients
optimizer.zero_grad()
# forward propagation
outputs = clf(inputs)
loss = criterion(outputs, labels)
# backward propagation
loss.backward()
# optimize
optimizer.step()
running_loss += loss.item()
# display statistics
print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.5f}')
running_loss = 0.0

# with this model trained, you can use the callback
from refinery.callbacks.torch import TorchCallback
callback = TorchCallback(
client,
clf,
"clickbait",
encoder
)

# and just execute this
callback.run(train_loader, index["train"])
callback.run(test_loader, index["test"])
```

#### HuggingFace Callback
Collect the dataset and train your custom transformer model as follows:

```python
from refinery.adapter import transformers
dataset, mapping, index = transformers.build_classification_dataset(client, "headline", "__clickbait")

# train a model here, we're simplifying this by just using an existing model w/o retraining
from transformers import pipeline
pipe = pipeline("text-classification", model="distilbert-base-uncased")

# if you're interested to see how a training looks like, look into the above HuggingFace adapter

# you can now apply the callback
from refinery.callbacks.transformers import TransformerCallback
callback = TransformerCallback(
client,
pipe,
"clickbait",
mapping
)

callback.run(dataset["train"]["headline"], index["train"])
callback.run(dataset["test"]["headline"], index["test"])
```

#### Generic Callback
This one is your fallback if you have a very custom solution; other than that, we recommend you look into the framework-specific classes.

```python
from refinery.callbacks.inference import ModelCallback
from refinery.adapter.sklearn import build_classification_dataset
from sklearn.linear_model import LogisticRegression

data = build_classification_dataset(client, "headline", "__clickbait", "distilbert-base-uncased"0)
clf = LogisticRegression()
clf.fit(data["train"]["inputs"], data["train"]["labels"])

# you can build initialization functions that set states of objects you use in the pipeline
def initialize_fn(inputs, labels, **kwargs):
return {"clf": kwargs["clf"]}

# postprocessing shifts the model outputs into a format accepted by our API
def postprocessing_fn(outputs, **kwargs):
named_outputs = []
for prediction in outputs:
pred_index = prediction.argmax()
label = kwargs["clf"].classes_[pred_index]
confidence = prediction[pred_index]
named_outputs.append([label, confidence])
return named_outputs

callback = ModelCallback(
client: Client,
"my-custom-regression",
"clickbait",
inference_fn=clf.predict_proba,
initialize_fn=initialize_fn,
postprocessing_fn=postprocessing_fn
)

# executing this will call the refinery API with batches of size 32
callback.initialize_and_run(data["train"]["inputs"], data["train"]["index"])
callback.run(data["test"]["inputs"], data["test"]["index"])
```

If you want to have something added, feel free to open an [issue](https://github.com/code-kern-ai/refinery-python-sdk/issues).

## Contributing
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.
Expand Down
31 changes: 31 additions & 0 deletions refinery/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,37 @@ def get_record_export(
msg.good(f"Downloaded export to {download_to}")
return df

def post_associations(
self,
associations,
indices,
name,
label_task_name,
source_type: Optional[str] = "heuristic",
):
"""Posts associations to the server.

Args:
associations (List[Dict[str, str]]): List of associations to post.
indices (List[str]): List of indices to post to.
name (str): Name of the association set.
label_task_name (str): Name of the label task.
source_type (Optional[str], optional): Source type of the associations. Defaults to "heuristic".
"""
url = settings.get_associations_url(self.project_id)
api_response = api_calls.post_request(
url,
{
"associations": associations,
"indices": indices,
"name": name,
"label_task_name": label_task_name,
"source_type": source_type,
},
self.session_token,
)
return api_response

def post_file_import(
self, path: str, import_file_options: Optional[str] = ""
) -> bool:
Expand Down
68 changes: 68 additions & 0 deletions refinery/adapter/torch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn import preprocessing
from .sklearn import (
build_classification_dataset as sklearn_build_classification_dataset,
)
from typing import Any, Dict, Optional, Tuple
from refinery import Client


class Data(Dataset):
def __init__(self, X, y, encoder):
# need to convert float64 to float32 else
# will get the following error
# RuntimeError: expected scalar type Double but found Float
self.X = torch.FloatTensor(X)
# need to convert float64 to Long else
# will get the following error
# RuntimeError: expected scalar type Long but found Float
y_encoded = encoder.transform(y.values)
self.y = torch.from_numpy(y_encoded).type(torch.LongTensor)
self.len = self.X.shape[0]

def __getitem__(self, index):
return self.X[index], self.y[index]

def __len__(self):
return self.len


def build_classification_dataset(
client: Client,
sentence_input: str,
classification_label: str,
config_string: Optional[str] = None,
num_train: Optional[int] = None,
batch_size: Optional[int] = 32,
) -> Tuple[DataLoader, DataLoader, preprocessing.LabelEncoder]:
"""
Builds a classification dataset from a refinery client and a config string.

Args:
client (Client): Refinery client
sentence_input (str): Name of the column containing the sentence input.
classification_label (str): Name of the label; if this is a task on the full record, enter the string with as "__<label>". Else, input it as "<attribute>__<label>".
config_string (Optional[str], optional): Config string for the TransformerSentenceEmbedder. Defaults to None; if None is provided, the text will not be embedded.
num_train (Optional[int], optional): Number of training examples to use. Defaults to None; if None is provided, all examples will be used.

Returns:
Tuple[DataLoader, DataLoader, preprocessing.LabelEncoder]: Tuple of train and test dataloaders, and the label encoder.
"""
data = sklearn_build_classification_dataset(
client, sentence_input, classification_label, config_string, num_train
)

le = preprocessing.LabelEncoder()
le.fit(data["train"]["labels"].values)

train_data = Data(data["train"]["inputs"], data["train"]["labels"], le)
test_data = Data(data["test"]["inputs"], data["test"]["labels"], le)

train_loader = DataLoader(dataset=train_data, batch_size=batch_size)
test_loader = DataLoader(dataset=test_data, batch_size=batch_size)

index = {"train": data["train"]["index"], "test": data["test"]["index"]}

return train_loader, test_loader, le, index
Loading