Skip to content

Commit 4ba7adc

Browse files
authored
Adds adapters for Hugging Face and Sklearn (#18)
* adds automated huggingface dataset creation * upgrade version * update README and refactor
1 parent f7d4f26 commit 4ba7adc

File tree

7 files changed

+242
-17
lines changed

7 files changed

+242
-17
lines changed

README.md

Lines changed: 74 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[![refinery repository](https://uploads-ssl.webflow.com/61e47fafb12bd56b40022a49/62cf1c3cb8272b1e9c01127e_refinery%20sdk%20banner.png)](https://github.com/code-kern-ai/refinery)
22
[![Python 3.9](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-390/)
3-
[![pypi 1.0.2](https://img.shields.io/badge/pypi-1.0.2-yellow.svg)](https://pypi.org/project/refinery-python-sdk/1.0.2/)
3+
[![pypi 1.1.0](https://img.shields.io/badge/pypi-1.1.0-yellow.svg)](https://pypi.org/project/refinery-python-sdk/1.1.0/)
44

55
This is the official Python SDK for [*refinery*](https://github.com/code-kern-ai/refinery), the **open-source** data-centric IDE for NLP.
66

@@ -12,6 +12,8 @@ This is the official Python SDK for [*refinery*](https://github.com/code-kern-ai
1212
- [Fetching lookup lists](#fetching-lookup-lists)
1313
- [Upload files](#upload-files)
1414
- [Adapters](#adapters)
15+
- [HuggingFace](#hugging-face)
16+
- [Sklearn](#sklearn)
1517
- [Rasa](#rasa)
1618
- [What's missing?](#whats-missing)
1719
- [Roadmap](#roadmap)
@@ -120,6 +122,77 @@ Alternatively, you can `rsdk push <path-to-your-file>` via CLI, given that you h
120122

121123
### Adapters
122124

125+
#### 🤗 Hugging Face
126+
Transformers are great, but often times, you want to finetune them for your downstream task. With *refinery*, you can do so easily by letting the SDK build the dataset for you that you can use as a plug-and-play base for your training:
127+
128+
```python
129+
from refinery.adapter import transformers
130+
dataset, mapping = transformers.build_dataset(client, "headline", "__clickbait")
131+
```
132+
133+
From here, you can follow the [finetuning example](https://huggingface.co/docs/transformers/training) provided in the official Hugging Face documentation. A next step could look as follows:
134+
135+
```python
136+
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(1000))
137+
small_eval_dataset = dataset["test"].shuffle(seed=42).select(range(1000))
138+
139+
from transformers import (
140+
AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
141+
)
142+
import numpy as np
143+
from datasets import load_metric
144+
145+
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
146+
147+
def tokenize_function(examples):
148+
return tokenizer(examples["headline"], padding="max_length", truncation=True)
149+
150+
tokenized_datasets = dataset.map(tokenize_function, batched=True)
151+
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
152+
training_args = TrainingArguments(output_dir="test_trainer")
153+
metric = load_metric("accuracy")
154+
155+
def compute_metrics(eval_pred):
156+
logits, labels = eval_pred
157+
predictions = np.argmax(logits, axis=-1)
158+
return metric.compute(predictions=predictions, references=labels)
159+
160+
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")
161+
162+
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
163+
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
164+
165+
trainer = Trainer(
166+
model=model,
167+
args=training_args,
168+
train_dataset=small_train_dataset,
169+
eval_dataset=small_eval_dataset,
170+
compute_metrics=compute_metrics,
171+
)
172+
173+
trainer.train()
174+
175+
trainer.save_model("path/to/model")
176+
```
177+
178+
#### Sklearn
179+
You can use *refinery* to directly pull data into a format you can apply for building [sklearn](https://github.com/scikit-learn/scikit-learn) models. This can look as follows:
180+
181+
```python
182+
from refinery.adapter.embedders import build_classification_dataset
183+
from sklearn.tree import DecisionTreeClassifier
184+
185+
data = build_classification_dataset(client, "headline", "__clickbait", "distilbert-base-uncased")
186+
187+
clf = DecisionTreeClassifier()
188+
clf.fit(data["train"]["inputs"], data["train"]["labels"])
189+
190+
pred_test = clf.predict(data["test"]["inputs"])
191+
accuracy = (pred_test == data["test"]["labels"]).mean()
192+
```
193+
194+
By the way, we can highly recommend to combine this with [Truss](https://github.com/basetenlabs/truss) for easy model serving!
195+
123196
#### Rasa
124197
*refinery* is perfect to be used for building chatbots with [Rasa](https://github.com/RasaHQ/rasa). We've built an adapter with which you can easily create the required Rasa training data directly from *refinery*.
125198

refinery/__init__.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,8 @@ def get_record_export(
111111
num_samples: Optional[int] = None,
112112
download_to: Optional[str] = None,
113113
tokenize: Optional[bool] = True,
114+
keep_attributes: Optional[List[str]] = None,
115+
dropna: Optional[bool] = False,
114116
) -> pd.DataFrame:
115117
"""Collects the export data of your project (i.e. the same data if you would export in the web app).
116118
@@ -155,6 +157,12 @@ def get_record_export(
155157
"There are no attributes that can be tokenized in this project."
156158
)
157159

160+
if keep_attributes is not None:
161+
df = df[keep_attributes]
162+
163+
if dropna:
164+
df = df.dropna()
165+
158166
if download_to is not None:
159167
df.to_json(download_to, orient="records")
160168
msg.good(f"Downloaded export to {download_to}")
@@ -263,7 +271,9 @@ def __monitor_task(self, upload_task_id: str) -> None:
263271
if print_success_message:
264272
msg.good("File upload successful.")
265273
else:
266-
msg.fail("Upload failed. Please look into the UI notification center for more details.")
274+
msg.fail(
275+
"Upload failed. Please look into the UI notification center for more details."
276+
)
267277

268278
def __get_task(self, upload_task_id: str) -> Dict[str, Any]:
269279
api_response = api_calls.get_request(

refinery/adapter/sklearn.py

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
from typing import Any, Dict, Optional
2+
from embedders.classification.contextual import TransformerSentenceEmbedder
3+
from refinery import Client
4+
from refinery.adapter.util import split_train_test_on_weak_supervision
5+
6+
7+
def build_classification_dataset(
8+
client: Client,
9+
sentence_input: str,
10+
classification_label: str,
11+
config_string: Optional[str] = None,
12+
) -> Dict[str, Dict[str, Any]]:
13+
"""
14+
Builds a classification dataset from a refinery client and a config string.
15+
16+
Args:
17+
client (Client): Refinery client
18+
sentence_input (str): Name of the column containing the sentence input.
19+
classification_label (str): Name of the label; if this is a task on the full record, enter the string with as "__<label>". Else, input it as "<attribute>__<label>".
20+
config_string (Optional[str], optional): Config string for the TransformerSentenceEmbedder. Defaults to None; if None is provided, the text will not be embedded.
21+
22+
Returns:
23+
Dict[str, Dict[str, Any]]: Containing the train and test datasets, with embedded inputs.
24+
"""
25+
26+
df_test, df_train, _ = split_train_test_on_weak_supervision(
27+
client, sentence_input, classification_label
28+
)
29+
30+
if config_string is not None:
31+
embedder = TransformerSentenceEmbedder(config_string)
32+
inputs_test = embedder.transform(df_test[sentence_input].tolist())
33+
inputs_train = embedder.transform(df_train[sentence_input].tolist())
34+
else:
35+
inputs_test = df_test[sentence_input].tolist()
36+
inputs_train = df_train[sentence_input].tolist()
37+
38+
return {
39+
"train": {
40+
"inputs": inputs_train,
41+
"labels": df_train["label"],
42+
},
43+
"test": {"inputs": inputs_test, "labels": df_test["label"]},
44+
}

refinery/adapter/transformers.py

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
import os
2+
from refinery import Client
3+
from refinery.adapter.util import split_train_test_on_weak_supervision
4+
from datasets import load_dataset
5+
6+
7+
def build_classification_dataset(
8+
client: Client, sentence_input: str, classification_label: str
9+
):
10+
"""Build a classification dataset from a refinery client and a config string useable for HuggingFace finetuning.
11+
12+
Args:
13+
client (Client): Refinery client
14+
sentence_input (str): Name of the column containing the sentence input.
15+
classification_label (str): Name of the label; if this is a task on the full record, enter the string with as "__<label>". Else, input it as "<attribute>__<label>".
16+
17+
Returns:
18+
_type_: HuggingFace dataset
19+
"""
20+
21+
df_train, df_test, label_options = split_train_test_on_weak_supervision(
22+
client, sentence_input, classification_label
23+
)
24+
25+
mapping = {k: v for v, k in enumerate(label_options)}
26+
27+
df_train["label"] = df_train["label"].apply(lambda x: mapping[x])
28+
df_test["label"] = df_test["label"].apply(lambda x: mapping[x])
29+
30+
hash_val = hash(str(client.project_id))
31+
train_file_path = f"{hash_val}_train_file.csv"
32+
test_file_path = f"{hash_val}_test_file.csv"
33+
34+
df_train.to_csv(train_file_path, index=False)
35+
df_test.to_csv(test_file_path, index=False)
36+
37+
dataset = load_dataset(
38+
"csv", data_files={"train": train_file_path, "test": test_file_path}
39+
)
40+
41+
if os.path.exists(train_file_path):
42+
os.remove(train_file_path)
43+
44+
if os.path.exists(test_file_path):
45+
os.remove(test_file_path)
46+
47+
return dataset, mapping

refinery/adapter/util.py

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
from typing import List, Tuple
2+
from refinery import Client
3+
import pandas as pd
4+
5+
6+
def split_train_test_on_weak_supervision(
7+
client: Client, _input: str, _label: str
8+
) -> Tuple[pd.DataFrame, pd.DataFrame, List[str]]:
9+
"""
10+
Puts the data into a train (weakly supervised data) and test set (manually labeled data).
11+
Overlapping data is removed from the train set.
12+
13+
Args:
14+
client (Client): Refinery client
15+
_input (str): Name of the column containing the sentence input.
16+
_label (str): Name of the label; if this is a task on the full record, enter the string with as "__<label>". Else, input it as "<attribute>__<label>".
17+
18+
Returns:
19+
Tuple[pd.DataFrame, pd.DataFrame, List[str]]: Containing the train and test dataframes and the label name options.
20+
"""
21+
22+
label_attribute_train = f"{_label}__WEAK_SUPERVISION"
23+
label_attribute_test = f"{_label}__MANUAL"
24+
25+
df_train = client.get_record_export(
26+
tokenize=False,
27+
keep_attributes=[_input, label_attribute_train],
28+
dropna=True,
29+
).rename(columns={label_attribute_train: "label"})
30+
31+
df_test = client.get_record_export(
32+
tokenize=False,
33+
keep_attributes=[_input, label_attribute_test],
34+
dropna=True,
35+
).rename(columns={label_attribute_test: "label"})
36+
37+
df_train = df_train.drop(df_test.index)
38+
39+
label_options = list(
40+
set(df_test.label.unique().tolist() + df_train.label.unique().tolist())
41+
)
42+
43+
return (
44+
df_train.reset_index(drop=True),
45+
df_test.reset_index(drop=True),
46+
label_options,
47+
)

requirements.txt

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
1-
numpy==1.22.3
2-
pandas==1.4.2
3-
requests==2.27.1
4-
boto3==1.24.26
5-
botocore==1.27.26
6-
spacy==3.3.1
7-
wasabi==0.9.1
1+
numpy
2+
pandas
3+
requests
4+
boto3
5+
botocore
6+
spacy
7+
wasabi
8+
embedders
9+
datasets

setup.py

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010

1111
setup(
1212
name="refinery-python-sdk",
13-
version="1.0.2",
13+
version="1.1.0",
1414
author="jhoetter",
1515
author_email="johannes.hoetter@kern.ai",
1616
description="Official Python SDK for Kern AI refinery.",
@@ -34,13 +34,15 @@
3434
package_dir={"": "."},
3535
packages=find_packages("."),
3636
install_requires=[
37-
"numpy==1.22.3",
38-
"pandas==1.4.2",
39-
"requests==2.27.1",
40-
"boto3==1.24.26",
41-
"botocore==1.27.26",
42-
"spacy==3.3.1",
43-
"wasabi==0.9.1",
37+
"numpy",
38+
"pandas",
39+
"requests",
40+
"boto3",
41+
"botocore",
42+
"spacy",
43+
"wasabi",
44+
"embedders",
45+
"datasets",
4446
],
4547
entry_points={
4648
"console_scripts": [

0 commit comments

Comments
 (0)