Types of data:
- Numerical
- Categorical
- Ordinal
Numerical data:
Measurable quantitatively Eg: heights of people, page load times, stock prices, gas in tank, age in years, money spent in store
Types:
- Dicrete data(integer based)
- Continuous data(infinite range of possiblities)
Categorical data:
Data with no inherent mathematical meaning
Eg: Gender, yes/no questions, Race, states of residence, political party, product category
We often assign numbers to categories in order to represent them more compactly but the numbers have no mathematical meaning.
Ordinal data:
Sort of the mix of the two above mentioned ones Eg: Movie ratings on a scale of 1-5(but these values are not just categories, they do have mathematical meaning, ie 5 is better than a 1 meaning 5 for excellent movie and 1 for a worst movie)
Statistics:
Mean:
- AKA Average
- Sum/ No of samples
Median:
- Sort the values and take the one at the midpoint
Note: If the no of elements in the data set is even then avg of the midpoint elements is the median
Median is less susceptible to outliners than the mean
Mean household income in US is $72,000 while median is only $50,000 as the mean is skewed by a few billionaires.
Mode:
- Most common value in the data set(Not relevant to the continous data set)
- Mode is the data which has the highest frequency
Mean, median and mode using python
Mean:
- 27,000 - normal distribution
- 15,000 - standard deviation
- 10,000 - no of data point to generate
import numpy as np
incomes = np.random.normal(27000,15000,10000) #creates data
np.mean(incomes)
Plot the histogram:
import matplotlib.pylot as plt
plt.hist(incomes,50) # 50 is the no of buckets for histogram
plt.show()
Median:
np.median(incomes) #computes the median
Mode:
ages = np.random.randint(18,high=90,size=500)
#generates 50 random integers between 18 and 90
from scipy import stats
stats.mode(ages)
Variance:
What is the spread or shape of the data distribution
Variance(sigma^2) is simply the average of the squared differences from the mean
Standard deviation:
- Square root of variance
- Standard deviation is sigma
- It is a way to identify the outliners
- mean+/-standard deviation gives us the outliners
- Data points that lie more than one standard deviation from the mean can be considered unusual
Nuance:
If we are working with a subset of data(sample) instead of an entire data set(population) then we want to use the sample variance instead of the population variance. In sample variance we divide the squared variances by N-1 and not N
Variance and standard deviation in python:
Standard deviation: incomes.std()
Variance: incomes.var()
Probability density and mass function:
We have already seen normal distribution, that is an example of probability distribution function
Probability density function:
Way to visualize probability of continuous data
With continuous data there are a range of values that can occur and hence the probability of a specific value to occur is infinitesimally small. Hence we need to think about probability of a range of values occuring
Probability mass function:
Way to visualize probability of discrete data
Common data distributions:
Uniform distribution:
There is a flat common probability of a value occuring within a range values.
import numpy as np
import matplotlib.pyplot as plt
values = np.random.uniform(-10.0, 10.0, 100000)
# generates 10000 points of unifrom distribution between -10 and 10
plt.hist(values, 50)
plt.show()
Normal/Guassian distribution:
We have already seen this Normal and guassian both are same
Lets visualize the probability density function:
from scipy.stats import norm
import matplotlib.pyplot as plt
x = np.arange(-3, 3, 0.001)
plt.plot(x, norm.pdf(x))
Exponential probability distribution function(PDF)Power Law:
from scipy.stats import expon
import matplotlib.pyplot as plt
x = np.arange(0,10,0.001)
# generate data with differences of 0.001 from 0 to 10
plt.plot(x,expon.pdf(x))
# plot a graph with x-axis values as x and y axis values as pdf of x
Binomial probability mass function:
from scipy.stats import binom
import matplotlib.pyplot as plt
n, p = 10, 0.5
x = np.arange(0,10,0.001) #generate data with differences of 0.001 from 0 to 10
plt.plot(x,binom.pmf(x,n,p)) #plot a graph with x-axis values as x and y axis values as pdf of x
Poison probability mass function:
Ex: My website gets on average 500 visits per day. Whats the odds of getting 550?
from scipy.stats import poisson
import matplotlib.pyplot as plt
mu = 500
x = np.arange(400,600,0.5)
plt.plot(x,poisson.pmf(x,mu))
Percentile:
- In a data set whats the point at which x% of the values are less than that value
- 50% percentile means median (as half of the data set are making more and half less)
Percentile using python:
np.percentile(dataset,50)
# so this gets the 50th percentile value
#note 50th percentile is same as median
90th percentile is the point at which 90% of the data is less than the given value
Moments:
- Way to measure the shape of the data distribution,ie probability distribution function
- We denote it by u(n)
The first moment is the mean.
The second moment is the variance.
The third moment is called skew. It is basically a measure of how lopsided a distribution is. If their is a longer tail on the left then it is a negative skew and longer tail on the right is a positive skew
The fourth moment is called kurtosis It is a measure of how thick the tail is and how sharp is the peak. Example: higher peaks have higher kurtosis
Moments in python:
First moment: np.mean(dataset)
Second moment: np.var(dataset)
Third moment: python import scipy.stats as sp sp.skew(dataset)
Fourth moment: sp.kurtosis()
Matplotlib Basics:
Draw a line graph:
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
x = np.arange(-3,3,0.001)
plt.plot(x,norm.pdf(x))
plt.show()
Multiple plots on one graph:
All you have to do is call plt.plot() multiples times to draw multiple graphs before using plt.show()
plt.plot(x,norm.pdf(x))
plt.plot(x,norm.pdf(x,1.0,0.5)) #pdf with mean of 1.0 and standard deviation of 0.5
plt.show()
Save it to file:
So instead of using plt.show() use,
plt.savefig('complete_path/myplot.png',format='png')
Adjust the axis:
axes = plt.axes() #get the axes
axes.set_xlim([-5,5]) #sets the x limit,ie range of x from -5 to 5
axes.set_ylim([0,1.0]) #sets the y limit,ie range of y from -5 to 5
#we can also set the tick marks in y and x axis
axes.set_xticks([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5])
axes.set_yticks([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
plt.plot(x, norm.pdf(x))
plt.plot(x, norm.pdf(x, 1.0, 0.5))
plt.show()
Add a grid:
axex.grid()
Change line types and colors:
plt.plot(x,norm.pdf(x),'b-') #b- indicates the style of the line so it means I need a blue solid line(- means solid)
#r: means a red line with little vertical hashes
#Similarly we have r--,r-.
Labelling axes:
plt.xlabel('time')
plt.ylabel('rate')
Adding a legend:
#this is nothing but a key for the graph, ie if you have 2 graphs then you can name the graphs as [graph1name,graph2name]
plt.legend([graph1name,graph2name],loc=4)
#the loc means location of the legend and 4 indicates bottom right
XKCD style(COMIC BOOK LIKE STYLE):
plt.xkcd()
#so now we go into the xkcd mode
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
plt.xticks([])
plt.yticks([])
ax.set_ylim([-30, 10])
data = np.ones(100)
data[70:] -= np.arange(30)
plt.annotate(
'THE DAY I REALIZED\nI COULD COOK BACON\nWHENEVER I WANTED',
xy=(70, 1), arrowprops=dict(arrowstyle='->'), xytext=(15, -10))
plt.plot(data)
plt.xlabel('time')
plt.ylabel('my overall health')
Pie chart:
#this will help us to go out of xkcd mode and enter the default mode
plt.rcdefaults()
values = [12, 55, 4, 32, 14]
colors = ['r', 'g', 'b', 'c', 'm']
explode = [0, 0, 0.2, 0, 0] //explodes the russian segment of the pie by 20%
labels = ['India', 'United States', 'Russia', 'China', 'Europe']
plt.pie(values, colors= colors, labels=labels, explode = explode)
plt.title('Student Locations')
plt.show()
Bar chart:
values = [12, 55, 4, 32, 14]
colors = ['r', 'g', 'b', 'c', 'm'] //c is cyan and m is magenta
plt.bar(range(0,5), values, color= colors) //range if x values is 0 to 5 and y-axis values to plot are given by the values dataset
plt.show()
Scatter plot:
X = randn(500)
Y = randn(500)
plt.scatter(X,Y)
plt.show()
Box and Whisker plot:
The box represents the Inner Quartile Range(IQR) where 50% of the data resides,ie 25% resides on both halves of the box making it 50
We define outliners in box and whisker plt as anything beyond 1.5 x IQR
uniformSkewed = np.random.rand(100) * 100 - 40 // create uniform data
#now we create some outliners both high and low
high_outliers = np.random.rand(10) * 50 + 100
low_outliers = np.random.rand(10) * -50 - 100
#now we add the outliners to the data set,ie we concatenate
data = np.concatenate((uniformSkewed, high_outliers, low_outliers))
#we plot the graph
plt.boxplot(data)
plt.show()
Covariance and correlation:
Say we have 2 attributes and want to know if they are related in some sense or not
Covariance:
Measures how 2 variables are related to each other, ie how 2 variables vary in tandem from their means
Measuring covariance mathematically:
- Think of the data sets for the 2 variables as high dimensional vectors
- Convert these to vectors of variances from the mean
- Take the dot product(cosine of the sngle between them) of the 2 vectors
- Divide by the sample size
Interpreting covariance is hard:
We know a small covariance is close to 0 means not much correlation between the 2 variables. Large covariances means there is correlation, but how large? We cant say how large is good, so we have the concept of correlation Thats where correlation comes in.
###Correlation:
Just divide the covariance by the standard deviations of both variables and that normalizes things + correlation of 0: means no correlation + correlation of 1: means perfect correlation + correlation of -1: means perfect inverse correlation(ie if one value increases the other decreases and vice versa)
Covariance and Correlation in python
Covariance Hard way:
Lets find the covariance ourselves from first principles
import numpy as np
from pylab import *
def de_mean(x):
xmean = mean(x)
return [xi - xmean for xi in x]
def covariance(x, y):
n = len(x)
return dot(de_mean(x), de_mean(y)) / (n-1)
pageSpeeds = np.random.normal(3.0, 1.0, 1000)
purchaseAmount = np.random.normal(50.0, 10.0, 1000)
scatter(pageSpeeds, purchaseAmount)
covariance (pageSpeeds, purchaseAmount)
Correlation Hard Way:
Similarly lets find the correlation from first priniciples.
def correlation(x, y):
stddevx = x.std()
stddevy = y.std()
return covariance(x,y) / stddevx / stddevy #we divide by both the std deviations to normalize
correlation(pageSpeeds, purchaseAmount)
Easy way(using numpy):
np.corrcoef(pageSpeeds, purchaseAmount)
""" It returns a matrix of the correlation coefficients between every
combination of the arrays passed in"""
# similarly we have np.cov func to compute the covariance directly
Note: Correlation doesnt imply causality! If two variables say x and y have perfect correlation(ie 1), then it doesnt mean that x causes y or y causes x
Conditional probability:
- Probability of something happening given something happened
- p(B given A) means probability of B happenning given A has occurred
- p(B given A) = p(A,B)/P(A)
- p(A,B) is probability of both A and B occurring independent of each other
If probability of E given F,ie P(E given F) is darn close to P(E), then it means E and F are not related to each other,ie independent variables
Bayes theorem:
Based on conditional probability: P(a given b) = p(a)*p(b given a)/p(b)
Example: Drug test - If p(b given a) is high it doesnt mean p(a given b) is high