machine learning introduction

machine learning introduction

machine learning introduction

The file machine_learning_101.pdf helps peoples with no machine learning background to better understand machine learning basics

What is machine learning

Machine Learning is the science of getting computers to learn from data to make decisions or predictions.
Machine learning is about teaching computers how to learn from data to make decisions or predictions.

True machine learning use algorithms to build a model based on a training set in order to make predictions or decisions without being explicitly programmed to perform the task

Supervised learning

The machine learning algorithm learns on a labeled dataset.
Learning by examples.

labeled dataset examples

The iris dataset and titanic dataset are labeled dataset

The iris dataset contains a set of 150 records under five attributes: petal length, petal width, sepal length, sepal width and species.
The iris dataset consists of measurements of three types of Iris flowers: Iris Setosa, Iris Versicolor, and Iris Virginica.
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor).
Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.
Based on the combination of these four features, we can distinguish the species

The Titanic has 2224 passengers on board, and more than 1500 died.
This dataset provides Passenger’s name, Passenger’s sex, Passenger’s age, Passenger’s class (1st, 2nd, 3rd ), Port of embarkation (Cherbourg, Queenstown, Southampton), .... and indicates if the passengers survived or died

Unsupervised learning

The machine learning uses unlabeled dataset.

k-means clustering and DBSCAN are unsupervised clustering machine learning algorithms.
They group the data that has not been previously labelled, classified or categorized

Clustering

Clustering uses unsupervised learning (dataset without label) Clustering creates regions in space without being given any labels.
Clustering divides the data points into groups, such that data points in the same group are more similar to other data points in the same group and dissimilar to the data points in other groups.
Groups are basically a collection of data points based on their similarity

k-means clustering and DBSCAN are unsupervised clustering machine learning algorithms.
They group the data that has not been previously labelled, classified or categorized.

Classification

Classification categorizes data points into the desired class.
There is a distinct number of classes.
Classes are sometimes called targets, labels or categories.
Takes as input a training set and output a classifier which predict the class for any new data point.

Classification uses supervised learning.
The machine learning algorithm learns on a labeled dataset
We know the labels from the training set

KNN (k-nearest neighbors) and Support vector classifier (SVC) are supervised learning algorithms for classification.

machine learning model

Once a machine learning model is built with a training set, it can be used to process new data points to make predictions or decisions

k-Fold Cross-Validation

CV can be used to test a model.
It helps to estimate the model performance.
It gives an indication of how well the model generalizes to unseen data.
CV uses a single parameter called k.
It works like this:
it splits the dataset into k groups.
For each unique group:

Take the group as a test data set
Take the remaining groups as a training data set
Use the on the training set to build the model, and then use the test set and evaluate

Example:
A dataset 6 datapoints: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
The first step is to pick a value for k in order to determine the number of folds used to split the dataset.
Here, we will use a value of k=3. so we split the dataset into 3 groups. each group will have an equal number of 2 observations.

For example:

Fold1: [0.5, 0.2]
Fold2: [0.1, 0.3]
Fold3: [0.4, 0.6]

Three models are built and evaluated:

Model1: Trained on Fold1 + Fold2, Tested on Fold3
Model2: Trained on Fold2 + Fold3, Tested on Fold1
Model3: Trained on Fold1 + Fold3, Tested on Fold2

Signal vs Noise

The "signal" is the true underlying pattern that you wish to learn from the data. "Noise", on the other hand, refers to the irrelevant information in a dataset.

The algorithm can end up "memorizing the noise" instead of finding the signal.
The model will then make predictions based on that noise.
So it will perform poorly on new/unseen data.

Model fitting

The sample data used to build the model should represents well the data you would expect to find in the actual population.
A model that is well-fitted produces more accurate outcomes.
A well fitted model will perform well on new/unseen data.
A well fitted model will generalize well from the training data to unseen data.

Overfitting

A model that has learned the noise instead of the signal is considered overfitted
This overfit model will then make predictions based on that noise.
It will perform poorly on new/unseen data.
The overfit model doesn’t generalize well from the training data to unseen data.

How to Detect Overfitting

we can’t know how well a model will perform on new data until we actually test it.
To address this, we can split our initial dataset into separate training and test subsets.

The training sets are used to build the models.
The test sets are put aside as "unseen" data to evaluate the models.
This method will help to know of how well the model will perform on new data (i.e to estimate of our model's performance)

k-Fold Cross-Validation and overfitting

CV gives an indication of how well the model generalizes to unseen data.
CV does not prevent overfitting in itself, but it may help in identifying a case of overfitting.
It estimates the model on unseen data, using all the different parts of the training set as validation sets.

How to Prevent Overfitting

Detecting overfitting is useful, but it doesn’t solve the problem.

To prevent overfitting, train your algorithm with more data. It won’t work every time, but training with more data can help algorithms detect the signal and the noise better. Of course, that’s not always the case. If we just add more noisy data, this technique won’t help. That’s why you should always ensure your data is clean and relevant.

To prevent overfitting, improve the data by removing irrelevant features.
Not all features contribute to the prediction. Removing features of low importance can improve accuracy, and reduce overfitting. Training time can also be reduced.
Imagine a dataset with 300 columns and only 250 rows. That is a lot of features for only very few training samples. So, instead of using all features, it’s better to use only the most important ones. This will make the training process faster. It can help to prevent overfitting because the model doesn’t need to use all the features.
So, rank the features and elimate the less importantes ones.

The python library scikit-learn provides a feature selection module which helps identify the most relevant features of a dataset.
Examples:

The class VarianceThreshold removes the features with low variance. It removes the features with a variance lower than a configurable threshold.
The class RFE (Recursive Feature Elimination) recursively removes features. It selects features by recursively considering smaller and smaller sets of features. It first trains the classifier on the initial set of features. it trains a classifier multiple times using smaller and smaller features set. After each training, the importance of the features is calculated and the least important feature is eliminated from current set of features. That procedure is recursively repeated until the desired number of features to select is eventually reached. RFE is able to find out the combination of features that contribute to the prediction. You just need to import RFE from sklearn.feature_selection and indicate the number of features to select and which classifier model to use.

machine learning algorithms

LinearSVC

LinearSVC is a python class from Scikit Learn library

>>> from sklearn.svm import LinearSVC

LinearSVC performs classification.
LinearSVC finds a linear separator. A line separating classes.
There are many linear separators: It will choose the optimal one, i.e the one that maximizes our confidence, i.e the one that maximizes the geometrical margin, i.e the one that maximizes the distance between itself and the closest/nearest data point point

Support vectors are the data points, which are closest to the line

Support vector classifier

Support vector machines (svm) is a set of supervised learning methods in the Scikit Learn library.
Support vector classifier (SVC) is a python class capable of performing classification on a dataset.
The class SVC is in the module svm of the Scikit Learn library

>>> from sklearn.svm import SVC
>>> clf = SVC(kernel='linear')

SVC with parameter kernel='linear' is similar to LinearSVC

SVC with parameter kernel='linear' finds the linear separator that maximizes the distance between itself and the closest/nearest data point

k-nearest neighbors

k-NN classification is used with a supervised learning set.
K is an integer.
To classify a new data point, this algorithm calculates the distance between the new data point and the other data points.
The distance can be Euclidean, Manhattan, ....
Once the algorithm knows the K closest neighbors of the new data point, it takes the most common class of these K closest neighbors, and assign that most common class to the new data point.
So the new data point is assigned to the most common class of its k nearest neighbors.
So the new data point is assigned to the class to which the majority of its K nearest neighbors belong to.

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

It is an unsupervised machine learning algorithm.
It is a density-based clustering algorithm.
It groups datapoints that are in regions with many nearby neighbors.
It groups datapoints in such a way that datapoints in the same cluster are more similar to each other than those in other clusters.
Clusters are dense groups of points.
Clusters are dense regions in the data space, separated by regions of lower density
If a point belongs to a cluster, it should be near to lots of other points in that cluster.
It marks datapoints in lower density regions as outliers.
It works like this: First, we choose two parameters, a number epsilon (distance) and a number minPoints (minimum cluster size).
epsilon is a letter of the Greek alphabet.
We then begin by picking an arbitrary point in our dataset.
If there are at least minPoints datapoints within a distance of epsilon from this datapoint, this is a high density region and a cluster is formed. i.e if there are more than minPoints points within a distance of epsilon from that point (including the original point itself), we consider all of them to be part of a "cluster".
We then expand that cluster by checking all of the new points and seeing if they too have more than minPoints points within a distance of epsilon, growing the cluster recursively if so.
Eventually, we run out of points to add to the cluster.
We then pick a new arbitrary point and repeat the process.
Now, it's entirely possible that a point we pick has fewer than minPoints points in its epsilon range, and is also not a part of any other cluster: in that case, it's considered a "noise point" (outlier) not belonging to any cluster.
epsilon and minPoints remain the same while the algorithm is running.

k-means clustering

k-means clustering splits N data points into K groups (called clusters).
k ≤ n.

A cluster is a group of data points.
Each cluster has a center, called the centroid.
A cluster centroid is the mean of a cluster (average across all the data points in the cluster).
The radius of a cluster is the maximum distance between all the points and the centroid.

Distance between clusters = distance between centroids.
k-means clustering uses a basic iterative process.
k-means clustering splits N data points into K clusters.
Each data point will belong to a cluster.
This is based on the nearest centroid.
The objective is to find the most compact partitioning of the data set into k partitions.
k-means makes compacts clusters.
It minimizes the radius of clusters.
The objective is to minimize the variance within each cluster.
Clusters are well separated from each other.
It maximizes the average inter-cluster distance.

k-means clusters tend to be of the same size. size refers to the area. size doesnt refer to the number od elements. Two clusters of the same area do not have to have the same number of elements (except if your data set has the same density)

The tendency of k-means to produce equal-sized clusters leads to bad results here

machine learning introduction