In Machine Learning(ML), computers apply statistical learning techniques to automatically identify patterns in data. These techniques can be used to make highly accurate predictions.
What is ML all about?
Algorithms that can learn from observational data, and can make predictions on it.
Learning:
- Unsupervised Learning
- Supervised Learning
Unsupervised learning:
The model is not given any answers to learn from, it must make sense of data just given the observations themselves. We are looking for latent variables(some property we didnt know existed originally)
Supervised learning:
The data the algorithm learns from comes with the correct “answers”. We give a training data from which the model can learn. The model created is then used to predict the answer for new, unknown values Example: We can train a model for predicting car prices based on car attributes using historical sales data. The model can then predict the optimal price for new cars that havent been sold before.
Reality
To illustrate we can create the data sets ourselves(random samples), like we created 2 variables with data that depends linearly on each other. But in reality this is not the case. So what do we do?
We collect a set of “answers”(ie correct data), ie lets say we are predicting car prices, so we collect a set of “answers”, ie historical sales data. Now we perform our regression on this data(we call this data as the training data)and try to fit a curve.
Regression can be thought as a supervised machine learning.
Evaluating supervised learning:
If we have a set of training data that includes the values we are trying to predict, we split the data into two data sets: * A training set * A test set
We then train the model using training set. We then measure(using r-squared or some other metric) the models accuracy by asking it to predict values for the test set and comparethat to known true values.
So the beauty of supervised learning is this concept of train/test where we split the training data(“correct” answers) into 2 sets, namely train set and test set. This helps us to prevent overfitting.
Test/Train in practice:
- Need to ensure both sets are large enough to contain representives of all the variances and outliners
- The data sets must be selected randomly(ie partition of the data sets into 2 train and test must be done randomly)
- Train/test is a great way to guard against overfitting
But still can give us overfitting, say if the sample sizes are too small. But ther is a way around that too to protect against overfitting, its called K-fold cross validation.
K-fold Cross Validation(Way to protect against overfitting):
- We split the data into k segments and not just 2 datasets
- Now we take one as the test set and the remaining as training sets
- Now we train the multiple data sets and individually measure thier performance against the test data set
- Now we take the average performance And find the r-squared average score (Train/test can still give us overfitting and K-fold method protects that limitation. Here we have multiple training sets so even if the result of one training set is overfitting then the others will average it out)
Train/Test in python
Regression is a form of supervised machine learning. So lets take a polynomial regression example and use train/test to find out the right degree polynomial for the data set.
We will start by creating some data set that we want to build a model for (in this case a polynomial regression):
import numpy as np
from pylab import *
np.random.seed(2)
pageSpeeds = np.random.normal(3.0, 1.0, 100)
purchaseAmount = np.random.normal(50.0, 30.0, 100) / pageSpeeds
scatter(pageSpeeds, purchaseAmount)
Now we will split the data in two - 80% of it will be used for “training” our model, and the other 20% for testing it. This way we can avoid overfitting.
trainX = pageSpeeds[:80]
testX = pageSpeeds[80:]
trainY = purchaseAmount[:80]
testY = purchaseAmount[80:]
""" We actually have to randomize the selectioon of train set and test set. So we need to shuffle the data. It can do it with random.shuffle method.
But in this case it is not required as the original data is itself randomized and hence slicing will also be random """
Here is our training dataset: scatter(trainX, trainY)
And our test dataset: scatter(testX, testY)
Now we will try to fit an 8th-degree polynomial to this data (which is almost certainly overfitting, given what we know about how it was generated!)
x = np.array(trainX)
y = np.array(trainY)
p4 = np.poly1d(np.polyfit(x, y, 8))
Lets plot our polynomial(best fit curve) against the training data:
import matplotlib.pyplot as plt
xp = np.linspace(0, 7, 100)
axes = plt.axes()
axes.set_xlim([0,7])
axes.set_ylim([0, 200])
plt.scatter(x, y)
plt.plot(xp, p4(xp), c='r')
plt.show()
Lets plot our polynomial(best fit curve) against the training data:
testx = np.array(testX)
testy = np.array(testY)
axes = plt.axes()
axes.set_xlim([0,7])
axes.set_ylim([0, 200])
plt.scatter(testx, testx)
plt.plot(xp, p4(xp), c='r')
plt.show()
Lets now compute the r-squared score to check the quality of our best fit
from sklearn.metrics import r2_score
r2 = r2_score(testy, p4(testx))
# see how we are using the test data set to test the model that we have trained from our train set
print r2
# returns -2.62352440155e+24
The r-squared score on the test data is kind of horrible! This tells us that our model isnt all that great.
Note:
from sklearn.metrics import r2_score
r2 = r2_score(np.array(trainY), p4(np.array(trainX)))
print r2
#returns 0.642706951469
It fits the training set better thats why gives us a r-squared score of 0.64 on the training set. So even though this model fits the training set, it doesnt fit the test set(r-squared score is -2.62). Thats the beauty of train/test model which helps us to figure out the best-fit for predictions.
If we are working with a Pandas DataFrame (using tabular, labeled data,) scikit-learn has built-in train_test_split functions to make random splitting of training data into test set and train set easy to do.
Bayesian methods:
Ex:Spam classifier in mail
Naive bayse:
How it works? Ex: Lets express the probability of an email being spam if it contains the word “free”
p(spam given free) = p(spam)p(free given sample)/p(free)
The numerator is probability of a message being spam and containing the word “free” The denominator is the overall probability of an email containing the word free
What about the others words?
- We construct p(spam given word) for every meaning word we encounter during training
- We then multiply these together when anlayzing a new email to get the probability of it being spam It is called naive bayse because we assume there is no relationship between each words we just assume that each word is independent
Spam classifier using python:
- Use scikit-learn
- The CountVectorizer lets us operate on lots of words at once, and MultinomialNB does all the heavy lifting on Naive Bayse
We will train it on known set of spams and hams(non-spams),so this is supervised learning
Lets load our training data into a pandas DataFrame that we can play with:
import os
import io
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
def readFiles(path):
for root, dirnames, filenames in os.walk(path):
for filename in filenames:
path = os.path.join(root, filename)
inBody = False
lines = []
f = io.open(path, 'r', encoding='latin1')
for line in f:
if inBody:
lines.append(line)
elif line == '\n':
inBody = True
f.close()
message = '\n'.join(lines)
yield path, message
def dataFrameFromDirectory(path, classification):
rows = []
index = []
for filename, message in readFiles(path):
rows.append({'message': message, 'class': classification})
index.append(filename)
return DataFrame(rows, index=index)
data = DataFrame({'message': [], 'class': []})
data = data.append(dataFrameFromDirectory('full_path/emails/spam', 'spam'))
data = data.append(dataFrameFromDirectory('/full_path/emails/ham', 'ham'))
Lets have a look at that DataFrame: data.head()
Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we have got a trained spam filter ready to go!
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)
#data['message'].values gives back an ndarray which has values of the message column
classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)
Now lets try predicting new mails:
examples = ['Earn free house now!!!', "Hi Sailesh, how about a cup of coffee tomorrow?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
""" prediction contains:
array(['spam', 'ham'],
dtype='|S4')
"""
So the first mail in the examples list is a spam and the second mail is a ham.
K-means clustering(Unsupervised learning):
Where we want to group data into clusters may be movie genres or demographics of people . It attempts to split data into k groups that are closest to K centroids Unsupervised learning - uses only positions of each data point Examples: Where do millionaires lives? What genres of movies naturally fall out of data?
Procedure:
- First it chooses K random centroids
- Then it finds the distance between each point and its centroid and assigns cluster for it
- Now in each region it calculates the effective centroid Repeat step 2 Keep doing it until you narrow down or the nth iteration centroid are almost same as the n-1th centroid(until we converge).
Limitations:
-
Choosing K is hard. How can we choose K? One solution is to increase K values until we stop getting large reductions in squared error(distances from each point to thier centroids)
- Choice of initial centroids
- The random choice of initial centroids can yield different results
- Run it a few times to make sure our initail solutions are not wacky
- Labelling the clusters
- K-means does not attempt to assign any meaning to the clusters we find
- Its upto us to try to dig into the data and try to determine that
K-means clustering using python
Lets make some fake data that includes people clustered by income and age, randomly:
from numpy import random, array
#Create fake income/age clusters for N people in k clusters
def createClusteredData(N, k):
random.seed(10)
pointsPerCluster = float(N)/k
X = []
for i in range (k):
incomeCentroid = random.uniform(20000.0, 200000.0)
ageCentroid = random.uniform(20.0, 70.0)
for j in range(int(pointsPerCluster)):
X.append([random.normal(incomeCentroid, 10000.0), random.normal(ageCentroid, 2.0)])
X = array(X)
return X
We will now use k-means to rediscover these clusters in unsupervised learning:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from numpy import random, float
data = createClusteredData(100, 5)
model = KMeans(n_clusters=5)
# Note here we are scaling the data to normalize it! Important for good results.
# ages range from 20 to 70 while incomes range from 2000 to 200000, so scale will normalize it
model = model.fit(scale(data))
# We can look at the clusters each data point was assigned to
print model.labels_
# And we will visualize it:
plt.figure(figsize=(8, 6))
plt.scatter(data[:,0], data[:,1], c=model.labels_.astype(float))
plt.show()
Entropy:
A measure of disorder of a data set, how same or different is the data set Entropy is 0 if all the classes in the data are the same. The entropy is high if they all are different.
Decision tree:
- We can train on a training data to generate a flow chart to make a decision from
- We can actually construct a flow chart to help us decide for a classification with machine learning(called decision tree)
- Good technique when we have a lot of attributes to classify something
- Another form of supervised learning
- Most interesting application of machine learning
- Gives a flow chart of how to make a decision
Give it some sample data and the resulting classification and out comes a tree!
Decision tree example:
Lets build a system to filter out new resumes based on historical data
- So we are going to form a decision tree which takes a resume and finds out if the person is suitable for the job
- We train the decision tree based on the historical data, which is a database of some important atrributes of job candidates and we know which ones were hired and which ones werent
- We then predict whether a new candidate will get hired based on it
How does a decision tree work?
- At each step, find the attribute we can use to partition the data set to minimize the entropy of the data at the next step
- Make the dat more and more uniform as we traverse down the flowchart
Fancy name for the algorithm - ID3. It is a greedy algorithm
Limitation?
Decision tress are very susceptible to overfitting
How to overcomes this?
We overcome this by a method called Random forests
- We select a random samples of data from the training set
- And for each of these a decision tree is formed. Each tree uses a different sets of attributes, making the algorithm much better.
- Now we let them “vote” on the final decision tree. The process of randomly selecting multiple data sets(ie randomly resampling the input data) is called bootstrap aggregating or bagging
Decision trees using python:
First we will load some data set. Note how we use pandas to convert a csv file into a DataFrame:
import numpy as np
import pandas as pd
from sklearn import tree
# we specify the full path of the historical data in the input_file variable
input_file = "full_path/data_set.csv"
# we create a data frame object from the input file
df = pd.read_csv(input_file, header = 0)
#to see the first few rows of the data set call df.head()
df.head()
Firstly we need to convert the values in the data frame to numbers as scikit lib works only with numers and math and not alphabets or words. So, we will map categorical data to numerical values. In the real world, we need to think about how to deal with unexpected or missing data!
By using map(), we know we will get NaN for unexpected values.
features = list(df.columns[:n])
#features is now ['predictor 1','predictor 2','predictor 3' ....,'predictor n']
#we now actually consstruct the decision tree
#lets assign y to be what we are trying to predict,ie target column
y = df["target_column"]
X = df[features] #collection of all of the data of all of the feature columns
clf = tree.DecisionTreeClassifier() #we create the classifier first
clf = clf.fit(X,y) #now we use the classifier to fit the data
Now lets print the decision tree,
from IPython.display import Image
from sklearn.externals.six import StringIO
import pydot
dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data,
feature_names=features)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
Random forests using python:
We will use a random forest of 10 decision trees to predict employment of specific candidate profiles:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X, y)
"""when we are dealing with random forests, we dont have to walk throught the tree by hand,
we use the predict function and pass in the row dataframe"""
#Predict employment of an employed 10-year veteran
print clf.predict([[10, 1, 4, 0, 0, 0]])
#Predict employment of an unemployed 10-year veteran
print clf.predict([[10, 0, 4, 0, 0, 0]])
#we get the result as 0 or 1
Random forests is essentially an ensemble learning technique.
Ensemble learning:
It just means we use multiple models to try and solve the same problem and let them vote on the results Random forests is an example of ensemble learning
Random forests uses bagging(bootstrap aggregating) to implement ensemble learning. Meaning many models are built, by training on randomly-drawn subsets of the data.
Boosting is an alternate technique. Here each subsequent model in the ensemble boosts attributes that address data mis-classified by the previous model
A bucket of models, ie several different models train on the training data and picks the one that works best with the test data
Stacking runs multiple models at one on the data and combine the result together(This is how Netflix prize was won!)
Advanced ensemble learning:
Bayes Optimal Classifier
- Theoritically the best, but almost always impractical
Bayesian Parametric Averaging
- Attempts to make BOC practical, but its still misunderstood, susceptible to overfitting and often outperformed by the simpler bagging approach
Bayesian Model Combination
- Tries to address all those problems
- But in the end its all about the same as using cross-validation to find the best combination of models.
Support Vector Machines(SVM)
Works well for classifying higher dimensional data(lots of features)
K-means clustering worked on 2 dimensional data, ie age and income, while this can work on higher dimensional data. It can be used to cluster/classify data sets with a lot of features.
- Finds higher dimensional support vectors accross which to divide the data(mathematically these support vectors define hyperplanes)
- Underlying, it uses kernel trick to represent the data in higher dimensional spaces to find hyperplanes
SVM is a supervised learning technique. So we train it on a set of training data to predict new data. So its differnt drom K-means clustering(which is an unsupervised learning). SVM need training data, ie set of “correct” classifications to learn from.
Support Vector classification(SVC):
- SVC is nothing but classifying data using SVM
- We can use different kernels with SVC. Some work better than others for a given data set.
SVC using python
Lets create the same fake income / age clustered data that we used for our K-Means clustering example:
import numpy as np
#Create fake income/age clusters for N people in k clusters
def createClusteredData(N, k):
pointsPerCluster = float(N)/k
X = []
y = []
for i in range (k):
incomeCentroid = np.random.uniform(20000.0, 200000.0)
ageCentroid = np.random.uniform(20.0, 70.0)
for j in range(int(pointsPerCluster)):
X.append([np.random.normal(incomeCentroid, 10000.0), np.random.normal(ageCentroid, 2.0)])
y.append(i)
X = np.array(X)
y = np.array(y)
return X, y
Lets plot it:
from pylab import *
(X, y) = createClusteredData(100, 5)
plt.figure(figsize=(8, 6))
plt.scatter(X[:,0], X[:,1], c=y.astype(np.float))
plt.show()
Now we will use linear SVC to partition our graph into clusters:
from sklearn import svm, datasets
C = 1.0
svc = svm.SVC(kernel='linear', C=C).fit(X, y)
# we are using the linear kernel here in this case
By setting up a dense mesh of points in the grid and classifying all of them, we can render the regions of each cluster as distinct colors:
def plotPredictions(clf):
xx, yy = np.meshgrid(np.arange(0, 250000, 10),
np.arange(10, 70, 0.5))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
plt.figure(figsize=(8, 6))
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
plt.scatter(X[:,0], X[:,1], c=y.astype(np.float))
plt.show()
Or just use predict for a given point:
svc.predict([[200000, 40]])
#returns array([4])
# means a person of age 40 with 200000 income falls under category 4
Recommender systems:
Systems which can recommend stuff to people based on what everybody else did
User-based collaborative filtering:
- Build a matrix of things each user bought/viewed/rated
- Compute similarity scores between users
- Find usrers similar to you Recommend stuff they have bought/viewed/rated that you haven’t yet
Problems with User-based CF:
- People are fickle. Their tastes change.
- There are usually many more people that things(By focusing on finding similar people versus similar things, we are making the computational problem harder)
- People do crazy things(shilling attacks).