Skip to content

Commit 3281990

Browse files
committed
feat(data-manager): Added a data manager class
1 parent 1e23453 commit 3281990

File tree

4 files changed

+583
-0
lines changed

4 files changed

+583
-0
lines changed
Lines changed: 239 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,239 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Data Manager\n",
8+
"When doing active learning we have our Original Data (OD) Labeled Data [LD] and Unlabeled Data [UD]\n",
9+
"where UD and LD are subsets of OD.\n",
10+
"The active learner operates on UD and returns indexes relative to it. We want to store those indices with respect\n",
11+
"to OD, and sometimes see the subset of labels of LD. (The subset of labels of UD is Null)\n",
12+
"\n",
13+
"That's a fancy way of saying there is a lot book keeping to be done and this class solves that by doing it for you\n",
14+
"\n",
15+
"The main idea is that we store a mask (labeeld_mask) of indices that have been labeled and then expose UD , LD\n",
16+
"and the labels by using fancy indexing with that mask. The manager exposes a an add_labels method which lets the\n",
17+
"user add labels indexed with respect to UD and it will adjust the indices so that they match OD.\n"
18+
]
19+
},
20+
{
21+
"cell_type": "markdown",
22+
"metadata": {},
23+
"source": [
24+
"## Preparation\n",
25+
"In this part we prepare the data and learners, all normal stuff you've seen in other examples. \n",
26+
"Some differences is that we're working with text "
27+
]
28+
},
29+
{
30+
"cell_type": "code",
31+
"execution_count": null,
32+
"metadata": {},
33+
"outputs": [],
34+
"source": [
35+
"\"\"\"\n",
36+
"This example shows how to use the new data manager class.\n",
37+
"For clarity, all the setup has been moved into functions and\n",
38+
"the core is in the __main__ section which is commented\n",
39+
"\n",
40+
"Also look at prepare_manager to see how a DataManager is instantiated\n",
41+
"\n",
42+
"\"\"\"\n",
43+
"\n",
44+
"from sklearn.datasets import fetch_20newsgroups\n",
45+
"from sklearn.ensemble import RandomForestClassifier\n",
46+
"from modAL.datamanager import DataManager\n",
47+
"import numpy as np\n",
48+
"import matplotlib as mpl\n",
49+
"import matplotlib.pyplot as plt\n",
50+
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
51+
"from functools import partial\n",
52+
"\n",
53+
"\n",
54+
"from modAL.models import ActiveLearner\n",
55+
"from modAL.batch import uncertainty_batch_sampling\n",
56+
"\n",
57+
"RANDOM_STATE_SEED = 123\n",
58+
"np.random.seed(RANDOM_STATE_SEED)\n",
59+
"BATCH_SIZE = 5\n",
60+
"N_QUERIES = 50\n",
61+
"\n",
62+
"\n",
63+
"\n",
64+
"\n"
65+
]
66+
},
67+
{
68+
"cell_type": "markdown",
69+
"metadata": {},
70+
"source": [
71+
"### Define Utility Functions"
72+
]
73+
},
74+
{
75+
"cell_type": "code",
76+
"execution_count": 4,
77+
"metadata": {},
78+
"outputs": [],
79+
"source": [
80+
"def prepare_data():\n",
81+
" SKIP_SIZE = 50 # Skip to make the example go fast.\n",
82+
" docs, original_labels = fetch_20newsgroups(return_X_y=True)\n",
83+
" docs_train = docs[::SKIP_SIZE]\n",
84+
" original_labels_train = original_labels[::SKIP_SIZE]\n",
85+
" docs_test = docs[1::SKIP_SIZE] # Offset by one means no overlap\n",
86+
" original_labels_test = original_labels[\n",
87+
" 1::SKIP_SIZE\n",
88+
" ] # Offset by one means no overlap\n",
89+
" return docs_train, original_labels_train, docs_test, original_labels_test\n",
90+
"\n",
91+
"\n",
92+
"def prepare_features(docs_train, docs_test):\n",
93+
" vectorizer = TfidfVectorizer(\n",
94+
" stop_words=\"english\", ngram_range=(1, 3), max_df=0.9, max_features=5000\n",
95+
" )\n",
96+
"\n",
97+
" vectors_train = vectorizer.fit_transform(docs_train).toarray()\n",
98+
" vectors_test = vectorizer.transform(docs_test).toarray()\n",
99+
" return vectors_train, vectors_test\n",
100+
"\n",
101+
"\n",
102+
"\n",
103+
"\n",
104+
"def prepare_learner():\n",
105+
"\n",
106+
" estimator = RandomForestClassifier()\n",
107+
" preset_batch = partial(uncertainty_batch_sampling, n_instances=BATCH_SIZE)\n",
108+
" learner = ActiveLearner(estimator=estimator, query_strategy=preset_batch)\n",
109+
" return learner\n",
110+
"\n",
111+
"\n",
112+
"def make_pretty_summary_plot(performance_history):\n",
113+
" with plt.style.context(\"seaborn-white\"):\n",
114+
" fig, ax = plt.subplots(figsize=(8.5, 6), dpi=130)\n",
115+
"\n",
116+
" ax.plot(performance_history)\n",
117+
" ax.scatter(range(len(performance_history)), performance_history, s=13)\n",
118+
"\n",
119+
" ax.xaxis.set_major_locator(\n",
120+
" mpl.ticker.MaxNLocator(nbins=N_QUERIES + 3, integer=True)\n",
121+
" )\n",
122+
" ax.xaxis.grid(True)\n",
123+
"\n",
124+
" ax.yaxis.set_major_locator(mpl.ticker.MaxNLocator(nbins=10))\n",
125+
" ax.yaxis.set_major_formatter(mpl.ticker.PercentFormatter(xmax=1))\n",
126+
" ax.set_ylim(bottom=0, top=1)\n",
127+
" ax.yaxis.grid(True, linestyle=\"--\", alpha=1 / 2)\n",
128+
"\n",
129+
" ax.set_title(\"Incremental classification accuracy\")\n",
130+
" ax.set_xlabel(\"Query iteration\")\n",
131+
" ax.set_ylabel(\"Classification Accuracy\")\n",
132+
"\n",
133+
" plt.show()\n"
134+
]
135+
},
136+
{
137+
"cell_type": "markdown",
138+
"metadata": {},
139+
"source": [
140+
"## Instantiate The Data Manager\n",
141+
"Here we instantiate the manager. We pass it the feature vectors we'll be training on as well as the original documents (so we can easily indiex them) "
142+
]
143+
},
144+
{
145+
"cell_type": "code",
146+
"execution_count": null,
147+
"metadata": {},
148+
"outputs": [],
149+
"source": [
150+
"def prepare_manager(vectors_train, docs_train):\n",
151+
" manager = DataManager(vectors_train, sources=docs_train)\n",
152+
" return manager\n"
153+
]
154+
},
155+
{
156+
"cell_type": "markdown",
157+
"metadata": {},
158+
"source": [
159+
"# Using The Manager"
160+
]
161+
},
162+
{
163+
"cell_type": "code",
164+
"execution_count": null,
165+
"metadata": {},
166+
"outputs": [],
167+
"source": [
168+
"\n",
169+
"docs_train, original_labels_train, docs_test, original_labels_test = prepare_data()\n",
170+
"vectors_train, vectors_test = prepare_features(docs_train, docs_test)\n",
171+
"manager = prepare_manager(vectors_train, docs_train)\n",
172+
"learner = prepare_learner()\n",
173+
"performance_history = []\n",
174+
"# performance_history.append(learner.score(docs_test, original_labels_test))\n",
175+
"\n",
176+
"for i in range(N_QUERIES):\n",
177+
" # Check if there are more examples that are not labeled. If not, break\n",
178+
" if manager.unlabeld.size == 0:\n",
179+
" break\n",
180+
"\n",
181+
" for index in range(1):\n",
182+
" # query the learner as usual, in this case we are using a batch learning strategy\n",
183+
" # so indices_to_label is an array\n",
184+
" indices_to_label, query_instance = learner.query(manager.unlabeld)\n",
185+
" labels = [] # Hold a list of the new labels\n",
186+
" for ix in indices_to_label:\n",
187+
" \"\"\"\n",
188+
" Here is the tricky part that the manager solves. The indicies are indexed with respect to \n",
189+
" unlabeled data but we want to work with them with respect to the original data. \n",
190+
" The manager makes this almost transparent\n",
191+
" \"\"\"\n",
192+
" '''\n",
193+
" Map the index that is with respect to unlabeled data back to an index with respect to the \n",
194+
" whole dataset\n",
195+
" '''\n",
196+
" original_ix = manager.get_original_index_from_unlabeled_index(ix)\n",
197+
" #print(manager.sources[original_ix]) #Show the original data so we can decide what to label\n",
198+
" # Now we can lookup the label in the original set of labels without any bookkeeping\n",
199+
" y = original_labels_train[original_ix]\n",
200+
" # We create a Label instance, a tuple of index and label\n",
201+
" # The index should be with respect to the unlabeled data, the add_labels function will automatically\n",
202+
" # calculate the offsets\n",
203+
" label = (ix, y)\n",
204+
" # append the labels to a list\n",
205+
" labels.append(label)\n",
206+
" # Insert them all at once.\n",
207+
" manager.add_labels(labels)\n",
208+
" # Note that if you need to add labels with indicies that repsect the original dataset you can do\n",
209+
" # manager.add_labels(labels,offset_to_unlabeled=False)\n",
210+
" # Now teach as usual\n",
211+
" learner.teach(manager.labeled, manager.labels)\n",
212+
" performance_history.append(learner.score(vectors_test, original_labels_test))\n",
213+
"# Finnaly make a nice plot\n",
214+
"make_pretty_summary_plot(performance_history)\n"
215+
]
216+
}
217+
],
218+
"metadata": {
219+
"kernelspec": {
220+
"display_name": "Python 3",
221+
"language": "python",
222+
"name": "python3"
223+
},
224+
"language_info": {
225+
"codemirror_mode": {
226+
"name": "ipython",
227+
"version": 3
228+
},
229+
"file_extension": ".py",
230+
"mimetype": "text/x-python",
231+
"name": "python",
232+
"nbconvert_exporter": "python",
233+
"pygments_lexer": "ipython3",
234+
"version": "3.6.9"
235+
}
236+
},
237+
"nbformat": 4,
238+
"nbformat_minor": 2
239+
}
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
"""
2+
This example shows how to use the new data manager class.
3+
For clarity, all the setup has been moved into functions and
4+
the core is in the __main__ section which is commented
5+
6+
Also look at prepare_manager to see how a DataManager is instantiated
7+
8+
"""
9+
10+
from sklearn.datasets import fetch_20newsgroups
11+
from sklearn.ensemble import RandomForestClassifier
12+
from modAL.datamanager import DataManager
13+
import numpy as np
14+
import matplotlib as mpl
15+
import matplotlib.pyplot as plt
16+
from sklearn.feature_extraction.text import TfidfVectorizer
17+
from functools import partial
18+
19+
20+
from modAL.models import ActiveLearner
21+
from modAL.batch import uncertainty_batch_sampling
22+
23+
RANDOM_STATE_SEED = 123
24+
np.random.seed(RANDOM_STATE_SEED)
25+
BATCH_SIZE = 5
26+
N_QUERIES = 50
27+
28+
29+
def prepare_data():
30+
SKIP_SIZE = 50 # Skip to make the example go fast.
31+
docs, original_labels = fetch_20newsgroups(return_X_y=True)
32+
docs_train = docs[::SKIP_SIZE]
33+
original_labels_train = original_labels[::SKIP_SIZE]
34+
docs_test = docs[1::SKIP_SIZE] # Offset by one means no overlap
35+
original_labels_test = original_labels[
36+
1::SKIP_SIZE
37+
] # Offset by one means no overlap
38+
return docs_train, original_labels_train, docs_test, original_labels_test
39+
40+
41+
def prepare_features(docs_train, docs_test):
42+
vectorizer = TfidfVectorizer(
43+
stop_words="english", ngram_range=(1, 3), max_df=0.9, max_features=5000
44+
)
45+
46+
vectors_train = vectorizer.fit_transform(docs_train).toarray()
47+
vectors_test = vectorizer.transform(docs_test).toarray()
48+
return vectors_train, vectors_test
49+
50+
51+
def prepare_manager(vectors_train, docs_train):
52+
manager = DataManager(vectors_train, sources=docs_train)
53+
return manager
54+
55+
56+
def prepare_learner():
57+
58+
estimator = RandomForestClassifier()
59+
preset_batch = partial(uncertainty_batch_sampling, n_instances=BATCH_SIZE)
60+
learner = ActiveLearner(estimator=estimator, query_strategy=preset_batch)
61+
return learner
62+
63+
64+
def make_pretty_summary_plot(performance_history):
65+
with plt.style.context("seaborn-white"):
66+
fig, ax = plt.subplots(figsize=(8.5, 6), dpi=130)
67+
68+
ax.plot(performance_history)
69+
ax.scatter(range(len(performance_history)), performance_history, s=13)
70+
71+
ax.xaxis.set_major_locator(
72+
mpl.ticker.MaxNLocator(nbins=N_QUERIES + 3, integer=True)
73+
)
74+
ax.xaxis.grid(True)
75+
76+
ax.yaxis.set_major_locator(mpl.ticker.MaxNLocator(nbins=10))
77+
ax.yaxis.set_major_formatter(mpl.ticker.PercentFormatter(xmax=1))
78+
ax.set_ylim(bottom=0, top=1)
79+
ax.yaxis.grid(True, linestyle="--", alpha=1 / 2)
80+
81+
ax.set_title("Incremental classification accuracy")
82+
ax.set_xlabel("Query iteration")
83+
ax.set_ylabel("Classification Accuracy")
84+
85+
plt.show()
86+
87+
88+
if __name__ == "__main__":
89+
docs_train, original_labels_train, docs_test, original_labels_test = prepare_data()
90+
vectors_train, vectors_test = prepare_features(docs_train, docs_test)
91+
manager = prepare_manager(vectors_train, docs_train)
92+
learner = prepare_learner()
93+
performance_history = []
94+
# performance_history.append(learner.score(docs_test, original_labels_test))
95+
96+
for i in range(N_QUERIES):
97+
# Check if there are more examples that are not labeled. If not, break
98+
if manager.unlabeld.size == 0:
99+
break
100+
101+
for index in range(1):
102+
# query the learner as usual, in this case we are using a batch learning strategy
103+
# so indices_to_label is an array
104+
indices_to_label, query_instance = learner.query(manager.unlabeld)
105+
labels = [] # Hold a list of the new labels
106+
for ix in indices_to_label:
107+
"""
108+
Here is the tricky part that the manager solves. The indicies are indexed with respect to unlabeled data
109+
but we want to work with them with respect to the original data. The manager makes this almost transparent
110+
"""
111+
# Map the index that is with respect to unlabeled data back to an index with respect to the whole dataset
112+
original_ix = manager.get_original_index_from_unlabeled_index(ix)
113+
# print(manager.sources[original_ix]) #Show the original data so we can decide what to label
114+
# Now we can lookup the label in the original set of labels without any bookkeeping
115+
y = original_labels_train[original_ix]
116+
# We create a Label instance, a tuple of index and label
117+
# The index should be with respect to the unlabeled data, the add_labels function will automatically
118+
# calculate the offsets
119+
label = (ix, y)
120+
# append the labels to a list
121+
labels.append(label)
122+
# Insert them all at once.
123+
manager.add_labels(labels)
124+
# Note that if you need to add labels with indicies that repsect the original dataset you can do
125+
# manager.add_labels(labels,offset_to_unlabeled=False)
126+
# Now teach as usual
127+
learner.teach(manager.labeled, manager.labels)
128+
performance_history.append(learner.score(vectors_test, original_labels_test))
129+
# Finnaly make a nice plot
130+
make_pretty_summary_plot(performance_history)

0 commit comments

Comments
 (0)