1
PROBABILITY & STATISTICS IN PYTHON
from Learning Python for Data Analysis and Visualization by Jose Portilla https://www.udemy.com/learning-python-for-data-analysis-and-visualization/
Notes by Michael Brothers Companion to the file Python for Data Analysis.
Table of Contents
Discrete Uniform Distributions .............................................................................................................................................. 2 Setting up a Discrete Uniform Distribution using Scipy .................................................................................................... 2
Continuous Uniform Distributions ......................................................................................................................................... 2 Setting up a Continuous Uniform Distribution using Scipy ............................................................................................... 2
Binomial Distribution ............................................................................................................................................................. 3 Setting up a Binomial Distribution using Scipy.................................................................................................................. 3
Poisson Distribution ............................................................................................................................................................... 4 Setting up a Poisson Distribution using Scipy ................................................................................................................... 4
Normal Distribution ............................................................................................................................................................... 5 Setting up a Normal Distribution using Scipy .................................................................................................................... 5
Sampling ................................................................................................................................................................................. 5
T-Distributions ........................................................................................................................................................................ 6
Hypothesis Testing ................................................................................................................................................................. 7
Chi-Square Distribution and the Chi-Square Test ................................................................................................................. 7
Bayes' Theorem ...................................................................................................................................................................... 7
Beta Distribution .................................................................................................................................................................... 8
APPENDIX I: Mathematical Symbols & Conventions: ........................................................................................................... 9
2
Standard Imports: import numpy as np
from numpy.random import randn
import pandas as pd
from scipy import stats
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from __future__ import division if using Python 2 %matplotlib inline
Discrete Uniform Distributions A random variable X has a discrete uniform distribution if each of the n values in its range x1, x2, x3 .... xn has equal probability. The probability mass function f(xi) and variance are:
π(π₯π) = 1
π π2 =
π2 β 1
12
Setting up a Discrete Uniform Distribution using Scipy on the roll of a fair die: from scipy.stats import randint
roll_options = [1,2,3,4,5,6]
low, high = 1, 7 go to 7 since index starts at 0 mean, var = randint.stats(low, high) .stats also returns skew & kurtosis print 'The mean is %2.1f, and the variance is %2.2f' %(mean, var)
The mean is 3.5, and the variance is 2.92 Β΅ = (6+1)/2 =3.5 2 = ((6-1+1)2β1)/12 =2.92 Make a PMF bar plot using matplotlib: plt.bar(roll_options, randint.pmf(roll_options, low, high));
Simple bar plot (blue) with wide, easily read bars. roll_options sets the x-values (1,2,3,4,5,6), height=0.167 For more info: http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.randint.html)
Continuous Uniform Distributions The probability density function is calculated as the area under the curve β in the case of uniformity, under a horizontal
straight line. Calculations for the probability density function f(x) and variance 2 are as follows:
π(π₯) = 1
(π β π) π2 =
(π β π)2
12
Setting up a Continuous Uniform Distribution using Scipy from scipy.stats import uniform
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Let's set an A and B A=19
B=27
# Set x as 1000 linearly spaced points about A and B x = np.linspace(A-5, B+5, 1000) Note: linspace can't be used to plot perfectly vertical lines # Use uniform(loc=start point, scale=endpoint β start point) rv = uniform(loc=A, scale=B-A)
# Plot the PDF of that uniform distribution plt.plot(x, rv.pdf(x));
3
Binomial Distribution A Bernoulli Trial is a random experiment in which there are only two possible outcomes - success and failure. A series of trials n will follow a binary distribution so long as the probability of success p remains constant, and trials are independent of one another. The probability mass function of a binomial distribution is:
π(π = π) = πΆ(π, π)ππ(1 β π)πβπ π€βπππ πΆ(π, π) = π!
π! (π β π)!
Setting up a Binomial Distribution using Scipy For this example, a basketball player takes an average 11 shots per game, with a 72% probability of success. n, p = 11, .72
from scipy.stats import binom
mean, var = binom.stats(n, p) .stats also returns skew & kurtosis print 'Mean = %1.2f StdDev = %1.2f' %(mean, var**0.5)
Mean = 7.92 StdDev = 1.49 in practice it we might use whole numbers Plotting the probability mass function using matplotlib: x = range(n+1)
Y = binom.pmf(x,n,p)
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(x,Y,'o');
So what's the probability of making exactly 8 shots in a game? binom.pmf(8,11,.72) returns 0.261588 (26%)
NOTE: Y is a 12-item numpy array. Y.sum() returns 1.0000000000000007. Just for fun, let's plot two binomial distributions together. A second player averages 15 shots per game with a 48% probability of success n1, p1 = 11, .72
n2, p2 = 15, .48
x1 = range(n1+1)
Y = binom.pmf(x1,n1,p1)
x2 = range(n2+1)
Z = binom.pmf(x2,n2,p2)
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(x1,Y,'o')
plt.plot(x2,Z,'x',color='r');
To see the first player's skew and kurtosis values: mean, var, skew, kurt = binom.stats(n1, p1, moments='mvsk')
print 'skew =',skew,' and kurtosis =',kurt
skew = -0.295468420143 and kurtosis = -0.0945165945166
For more info: 1.) http://en.wikipedia.org/wiki/Binomial_distribution 2.) http://stattrek.com/probability-distributions/binomial.aspx 3.) http://mathworld.wolfram.com/BinomialDistribution.html
4
Poisson Distribution Considers the number of discrete events or occurrences over a specified interval or continuum (e.g. time, distance,etc.). The Poisson Distribution takes just one argument β the mean β and fits values about the mean from zero to infinity.
π(π = π) = πππβπ
π!
where e = Euler's number = 2.718β¦ Setting up a Poisson Distribution using Scipy On average, 10 customers visit a restaurant during lunchtime rush. What are the odds of seeing exactly 7 customers? from scipy.stats import poisson
mu = 10 this is our known mean for the interval (don't use "lambda" as a variable name) x = 7 this is our target for the probability mass function mean, var = poisson.stats(mu) returns the mean (mu) and variance (if we want) odds = poisson.pmf(x,mu) returns the probability mass function print 'There is a %2.2f%% chance that exactly %d customers show up at the lunch
rush' %(100*odds, x)
There is a 9.01% chance that exactly 7 customers show up at the lunch rush
Note: poisson.pmf(0,10) returns 0.0000454 (4 thousandths of a percent chance that no people show up) Plotting the Poisson Distribution k = range(30)
Y = poisson.pmf(k,mu)
import matplotlib.pyplot as plt
%matplotlib inline
plt.bar(k,Y);
Applying a Cumulative Distribution Function So what is the probability that more than 10 customers arrive? We need to sum up the value of every bar past the 10 customers bar. We can do this by using a Cumulative Distribution Function (CDF). This describes the probability that a random variable X with a given probability distribution (such as the Poisson in this current case) will be found to have a value less than or equal to X. If we use the CDF to calculate the probability of 10 or less customers showing up we can take that probability and subtract it from the total probability space, which is 1 (the sum of all probabilities). x = poisson.cdf(10,10)
print 'The probability that more than 10 customers show up is %2.2f%%.' %(1-x)
The probability that more than 10 customers show up is 41.70%.
For more info: 1.)http://en.wikipedia.org/wiki/Poisson_distribution#Definition 2.)http://stattrek.com/probability-distributions/poisson.aspx 3.)http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.poisson.html
5
Normal Distribution The distribution is defined by the probability density function equation:
π(π₯, π, π) =1
πβ2ππ
β12π§2 π€βπππ π§ =
(π β π)
π
The area under the curve = 1. Setting up a Normal Distribution using Scipy from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
mu = 0
std = 1
X = np.arange(-4,4,.01) grab 800 datapoints Y = stats.norm.pdf(X,mu,std)
plt.plot(X,Y);
Alternatively, you can pull a "random, normal" set of datapoints from numpy: import numpy as np
mu, sigma = 0, 0.1
norm_set = np.random.normal(mu,sigma,1000)
And plot them using matplotlib: import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(norm_set, bins=50)
To be clear, Y is a set of y-values that fit a curve, and norm_set is a set of random x-values: Y[300] returned 0.24197 and norm_set[300] returned 0.24952 Y[301] returned 0.24439 and norm_set[301] returned -0.08108 For more info: 1.) http://en.wikipedia.org/wiki/Normal_distribution 2.) http://mathworld.wolfram.com/NormalDistribution.html 3.) http://stattrek.com/probability-distributions/normal.aspx
Sampling
When taking n samples from a population N, the sample mean should equal the population mean (and when taking
repeated samples, the sampling distribution of the sample mean will have a normal distribution about the population
mean). The sample standard deviation is calculated as
ποΏ½Μ οΏ½ =π
βπ
This means that to achieve half the measurement error, the sample size must be quadrupled. For more info: https://en.wikipedia.org/wiki/Sampling_distribution
6
T-Distributions For previous distributions the sample size was assumed large (n>30). The Student's t-distribution allows for use of small samples, but does so by sacrificing certainty with a margin-of-error trade-off. The t-distribution takes into account the sample size using n-1 degrees of freedom, which means there is a different t-distribution for every different sample size. If we see the t-distribution against a normal distribution, you'll notice the tail ends increase as the peak get 'squished' down. It's important to note that as n gets larger, the t-distribution converges into a normal distribution. Setting up a T-Distribution using Scipy import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from scipy import stats
from scipy.stats import t
Create x range x = np.linspace(-5,5,100)
Create the t distribution with scipy rv = t(3) this creates a "frozen" RV object with a shape parameter (df=3) norm1 = stats.norm.pdf(x,0,1)
Plot the PDF versus the x range (I added a normal distribution for comparison) plt.plot(x, norm1, color='r', label='normal dist')
plt.plot(x, rv.pdf(x), label='t distribution')
plt.legend(loc='upper right');
For more info: 1.) http://en.wikipedia.org/wiki/Student%27s_t-distribution 2.) http://mathworld.wolfram.com/Studentst-Distribution.html 3.) http://stattrek.com/probability-distributions/t-distribution.aspx 4.) http://docs.scipy.org/doc/scipy-0.17.0/reference/tutorial/stats.html
5.) http://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.t.html
Trivia: the t-distribution was devised by William Sealy Gosset under the pseudonym Student
7
Hypothesis Testing Refer to the jupyter notebook for exercises on testing a null hypothesis, choosing between t- and z- scores, and avoiding Type I and Type II errors. For more info: 1.) Wolfram 2.) Stat Trek 3.) Wikipedia 4.) Probability Book Sample
Chi-Square Distribution and the Chi-Square Test (also written as 2 test) Refer to the jupyter notebook for an exercise on testing the fairness of a pair of dice after observing 500 rolls. from scipy import stats
chisq, p = stats.chisquare(observed, expected)
print 'The chi-squared test statistic is %.2f' %chisq
print 'The p-value for the test is %.2f' %p
The chi-squared test statistic is 9.89
The p-value for the test is 0.45
With such a high p-value, we have no reason to doubt the fairness of the dice. For more info: 1. Wikipedia 2. Stat trek 3. Khan Academy 4. http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.chisquare.html Bayes' Theorem
Assesses the likelihood of A given B when you know the overall likelihood of A, of B, and the likelihood of B given A.
π(π΄|π΅) = π(π΅|π΄)π(π΄)
π(π΅)
When you don't know the overall likelihood of B, use
π(π΄|π΅) = π(π΅|π΄)π(π΄)
π(π΅|π΄)π(π΄) + π(π΅|πππ‘π΄)π(πππ‘π΄)
Refer to the jupyter notebook for an exercise involving a test for breast cancer. For more info: 1.) Simple Explanation using Legos 2.) Wikipedia 3.) Stat Trek
8
Beta Distribution The beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by Ξ± and Ξ², that appear as exponents of the random variable and control the shape of the distribution.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from scipy import stats
a = 0.5
b = 0.5
x = np.arange(0.01, 1, 0.01)
y = stats.beta.pdf(x, a, b)
plt.plot(x, y)
plt.title('Beta: a=%.1f, b=%.1f' %(a,b))
plt.xlabel('x')
plt.ylabel('Probability density')
For more info: https://en.wikipedia.org/wiki/Beta_distribution
9
APPENDIX I: Mathematical Symbols & Conventions:
Greek alphabet
alpha
beta
gamma
delta
epsilon
zeta
eta
theta
iota
kappa
lambda
mu
nu
ksi
omicron
pi
rho
sigma
tau
upsilon
phi
chi
psi
omega
Number Prefix Symbol
10 1 deka- da
10 2 hecto- h
10 3 kilo- k
10 6 mega- M
10 9 giga- G
10 12 tera- T
10 15 peta- P
10 18 exa- E
10 21 zeta- Z
10 24 yotta- Y
Number Prefix Symbol
10 -1 deci- d
10 -2 centi- c
10 -3 milli- m
10 -6 micro-
10 -9 nano- n
10 -12 pico- p
10 -15 femto- f
10 -18 atto- a
10 -21 zepto- z
10 -24 yocto- y
π(π΄|π΅) Probability of A given B π΄ β π΅ A is an element of B π β πΆ c is a subset of C π΄ β© π΅ the intersection of A and B π΄ βͺ π΅ the union of A and B π΄ β§ π΅ logical A and B π΄ β¨ π΅ logical A or B
Top Related