Python for Data Analytics
-
Upload
saikrishnaiyerj -
Category
Documents
-
view
40 -
download
0
description
Transcript of Python for Data Analytics
-
Python for Data Analytics
Lectures 1 & 2: The Python Language and Environment
Rodrigo [email protected]
Spring 2015
1
-
Introduction
2
-
Instructor
Rodrigo Belo
Researcher at Carnegie Mellon University and at Catlica-Lisbon, Portugal
PhD in Technological Change and Entrepreneurship from CarnegieMellon University
Research Interests: Social Networks and Technology onEducational Settings
Background: Undergraduate degree in Computer Science andEngineering, 5 years as Software Engineer
Email: [email protected]
3
-
Course Description
This course introduces Python as a tool to collect, process and analyzelarge data sets from a variety of sources to create information thatguides businesses decision making
4
-
Course Description
Students will get familiarized with Python as a language and as aplatform to integrate different technologies and techniques for dataanalytics, including:
Collection of online information;
Tools and strategies for data storage; and
Data analysis methods.
5
-
Course Description
Each class will start with the introduction of a concept or tool and end within-class hands-on exercises using example datasets.
Throughout the course students will apply these techniques to do theirHomework and their Term Project.
6
-
Learning Objectives
Upon completion of this course, the student will be able to:
1 Use Python as a general-purpose programming language
2 Collect data available online in an automated fashion
3 Process and store data in the appropriate format for future analysis
4 Apply data analytics tools to extract relevant information
7
-
Source Materials
Textbooks:
1 Main: McKinney (2012), Python for Data Analysis, OReilly
2 Other: Russel (2011), Mining the Social Web, OReilly
Online references:
1 Python 2 Documentation: https://docs.python.org/2/
2 pandas online reference:http://pandas.pydata.org/pandas-docs/stable/
3 ggplot online reference: http://ggplot.yhathq.com
8
-
Grading
Individual Assignments: 40%
Assignments will be done by individual students and posted on Blackboard.Specific assignments will appear approx. 1 week prior to due date.
Term Project: 30%
The term-project will be done in 2 or 3 person teams and will involve theapplication of the methods mentioned in the class.
Students will identify a question they would like to answer using publiclyavailable data, gather the data from an online source, store it and analyzeit using some of the methods shown in class.
Final Exam: 30%
May 6, 6pm
9
-
Late Work
If a work is delivered t seconds late, its score is adjusted by multiplying it by
1 (
t
24 5 60 60)4
0 1 2 3 4 5
N. Days Late
0
20
40
60
80
100
Maximum Grade
10
-
Basic Concepts and Environment
11
-
Why Python?
Python is one of the most popular dynamic languages, along with Ruby,Perl, R, and others
Python has a large and active scientific computing community
Adoption of Python has increased significantly since the 2000s both inthe industry and academic community
Python started as general purpose programming language but datamanipulation libraries make it a first class citizen in data manipulationand analysis
Excellent choice as a single language for building data-centricapplications
12
-
Python as Glue
Python integrates easily with C, C++, and FORTRAN, languages in whichmany routines are implemented
Most programs consist of small portions of code where most of the time isspent, and large portions of glue code that doesnt run often
In many cases the execution time of glue code is irrelevant
Python can be used both as a prototyping language and as aproduction language
13
-
Python Essentials
Some of the essential Python libraries and tools:
NumPy
SciPy
pandas
ggplot
IPython
14
-
Python Essentials: NumPy
NumPy (Numerical Python), is the foundational package for scientificcomputing in Python. It provides, among other things:
A fast and efficient multidimensional array object: ndarray
Functions for performing element-wise computations with arrays ormathematical operations between arrays
Linear algebra operations, Fourier transform, and random numbergeneration
Tools for integrating connecting C, C++, and Fortran code to Python
15
-
Python Essentials: SciPy
SciPy is a collection of packages addressing a number of different standardproblem domains in scientific computing:
scipy.integrate: numerical integration routines and differentialequation solvers
scipy.linalg: linear algebra routines and matrix decompositionsextending beyond those provided in numpy.linalg.
scipy.optimize: function optimizers (minimizers) and root findingalgorithms
scipy.signal: signal processing tools
scipy.sparse: sparse matrices and sparse linear system solvers
scipy.stats: standard continuous and discrete probabilitydistributions (density functions, samplers, continuous distributionfunctions), various statistical tests, and more descriptive statistics
16
-
Python Essentials: pandas
pandas provides data structures and functions designed to make workingwith structured data fast, easy and expressive
DataFrame is the primary object of this library
two dimensional object that resembles a table with rows and columns
meat[ :5]
date beef veal pork lamb_and_mutton broilers other_chicken \0 1944-01-01 751 85 1280 89 NaN NaN1 1944-02-01 713 77 1169 72 NaN NaN2 1944-03-01 741 90 1128 75 NaN NaN3 1944-04-01 650 89 978 66 NaN NaN4 1944-05-01 681 106 1029 78 NaN NaN
turkey0 NaN1 NaN2 NaN3 NaN4 NaN
17
-
Python Essentials: ggplot
ggplot is a graphics library that allows for the creation of graphics veryeasily
from ggplot import *
ggplot (aes(x=date , y=beef ) , data=meat) +\geom_line ( ) +\stat_smooth( colour=blue , span=0.2)
1945 1955 1965 1975 1985 1995 2005
date
0
500
1000
1500
2000
2500
3000
beef
18
-
Python Essentials: IPython
IPython is the component that ties everything together. Aside from thestandard terminal, IPython shell provides:
IPython notebook: HTML notebook for connecting to IPython througha web browser
GUI console with inline plotting, multiline editing and syntaxhighlighting
Infrastructure for interactive parallel and distributed computing
19
-
Installation and Setup
Mac OS X and Linux distributions come with a Python distribution, but notnecessarily with all the required libraries
New users can install Anaconda (http://continuum.io/downloads) orCanopy (https://store.enthought.com/downloads/)
To install IPython (and Python) follow the instructions onhttp://ipython.org/install.html
You will need IPython notebook
20
-
Python 2 and Python 3
The Python community is currently undergoing a transition from thePython 2 series of interpreters to the Python 3 series
Until the appearance of Python 3.0, all Python code was backwardscompatible
The community decided that in order to move the language forward,certain backwards incompatible changes were necessary
21
-
Python 2 and Python 3
Python 3.x is a cleaned up version of Python 2.x
Many inconsistencies were removed in the new version2.x: print "The answer is", 2*23.x: print("The answer is", 2*2)
More details athttp://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/tutorials/key_differences_between_python_2_and_3.ipynb
However, there is still a considerable amount code written in Python 2.x,making it the de facto standard
In this course we will be using Python 2.x
22
-
Integrated Development Environments (IDEs)
There are many editors and IDEs that you can use to edit Python
PyDev (plugin for Eclipse)
Python Tools for Visual Studio
PyCharm
IPython notebook
Emacs
Vim
You can find more IDEs onhttps://wiki.python.org/moin/IntegratedDevelopmentEnvironments
23
-
IPython: An Interactive Computing and
Development Environment
24
-
IPython Basics Prompt
$ ipython --pylabPython 2.7.6 | 64-bit | (default, Jun 4 2014, 16:42:26)Type "copyright", "credits" or "license" for more information.
IPython 2.1.0 -- An enhanced Interactive Python.? -> Introduction and overview of IPythons features.%quickref -> Quick reference.help -> Pythons own help system.object? -> Details about object, use object?? for extra details.Using matplotlib backend: MacOSX
In [1]: 3 + 4Out[1]: 7
In [2]: data = {i : randn() for i in range(8)}
In [3]: dataOut[3]:{0: 0.36680003627745555,1: 0.5231034512314581,2: 0.6300895261779402,3: -0.9115682057027865,4: -1.7244460134107902,5: 0.3829479256814315,6: 0.4718660373870812,7: -0.23438875074129756}
In [4]: data[3]Out[4]: -0.9115682057027865
25
-
IPython Basics Tab Completion
In [7]: dadata datetime datetime_datadate2num datetime64datestr2num datetime_as_string
In [7]: dataOut[7]:{0: 0.0016908926460949773,1: 0.39596065989527957,2: -0.9295711814640477,3: 2.1076302341719058,4: -0.6391315204450737,5: 1.7496783252859787,6: -0.5307855278794061,7: 0.38045583368270064}
26
-
IPython Basics Introspection
Using a question mark (?) before or after a variable will display somegeneral information about the object:In [3]: b?Type: listString form: [1, 2, 3, 45]Length: 4Docstring:list() -> new empty listlist(iterable) -> new list initialized from iterables items
? can also be used before or after a function name
27
-
IPython Basics Introspection
? has a final usage, which is for searching the IPython namespace in amanner similar to the standard UNIX or Windows command line:In [4]: import numpy as np
In [5]: np.*load*?np.loadnp.loadsnp.loadtxtnp.pkgload
28
-
IPython Basics The %run Command
Any file can be run as a Python program inside the environment of yourIPython session using the %run command# ipython_script_test.py
def my_function(x,y,z):return (x + y) / z
aa = 5
%run ipython_script_testprint aaprint my_function(3.0 ,4 ,5)
51.4
29
-
IPython Basics The %paste Command
The %paste command pastes code copied to the clipboard keepingindentation
The following code will not work if simply pasted:x = 5y = 7if (x > 5):
x += 1
y = 8
>>> x = 5y = 7if (x > 5):
x += 1
y = 8>>> ... ... >>> >>>>>> y8>>> %pastex = 5y = 7if (x > 5):
x += 1
y = 8## -- End pasted text -->>> y7>>> 30
-
IPython Basics Interacting with the OS
IPython provides very strong integration with the operating system shell:
Command Descriptionoutput = !cmd args Run cmd and store the stdout in output%alias alias_name cmd Define an alias for a system (shell) command%bookmark Utilize IPythons directory bookmarking system%cd directory Change system working directory to passed directory%pwd Return the current system working directory%dirs Return a list containing the current directory stack%dhist Print the history of visited directories%env Return the system environment variables as a dict
31
-
IPython Basics IPython GUI
Starting an IPython GUI:ipython qtconsole --pylab=inline
32
-
IPython Basics IPython Notebook
Starting the IPython notebook server:ipython notebook --pylab=inline
33
-
Python Language
34
-
Python as a Calculator Basic Math
Python can be used as a basic calculator
Addition and subtraction
print 2 + 4print 8.1 5
63.1
Multiplication
print 5 * 4print 3.1 * 2
206.2
Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
35
-
Python as a Calculator Basic Math
Integer division is not the same as float division
Float division
print 4.0 / 2.0print 1.0/3.1
2.00.322580645161
Integer division
print 4 / 2print 1/3
20
Careful when performing integer division
Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
36
-
Python as a Calculator Basic Math
Exponentiation
print 3.**2print 3**2print 2**0.5
9.091.41421356237
Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
37
-
Advanced Mathematical Operations
Some more advanced mathematical operations require the numpy package
Square Root
import numpy as npprint np. sqrt (2)
1.41421356237
Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
38
-
Exponential and logarithmic functions
Exponential
import numpy as npprint np.exp(1)
2.71828182846
Logarithm
import numpy as npprint np. log(10)print np. log10(10) # base10
2.302585092991.0
Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
39
-
Variable Assignment
The equal sign (=) is used to assign a value to a variable
width = 20height = 5 * 9width * height
900
40
-
Python Language Types
41
-
Boolean
Python has a built-in boolean type:
print width == 20print width == 30
TrueFalse
42
-
Strings
Strings can be enclosed in single quotes or double quotes
Single quotes
Hello World
Hello World
Isn \ t i t nice to have a computer that talks to you?
"Isnt it nice to have a computer that talks to you?"
Double quotes
"Hello World"
Hello World
" Isn t i t nice to have a computer that talks to you?"
"Isnt it nice to have a computer that talks to you?"
43
-
Strings
You can concatenate strings with the + sign:
"Hello " + "World"
HelloWorld
aa = "Hello "bb = "World"aa + bb
HelloWorld
44
-
Strings
Strings are immutable:
aa = "Hello " + "World"print aaaa[5] = R
HelloWorldTraceback (most recent call last):
File "", line 3, in aa[5] = R
TypeError: str object does not support item assignment
45
-
Strings
You can use triple quotes for strings that span multiple lines
print """ \HelloWorld """
Hello-----World
Triple quotes are often used to provide function documentation
46
-
Strings
Strings can be indexed (subscripted), with the first character having index 0
mystring = "Hello World"print mystring[0]print mystring[6:10]
HWorl
There is no separate character type. A character is simply a string of sizeone
47
-
Lists
Lists are a compound data type in Python
can be written as a list of comma-separated values (items) betweensquare brackets
might contain items of different types
squares = [1 , 4, 9, 16, 25]squares
[1, 4, 9, 16, 25]
48
-
Lists
Lists can be indexed like strings
squares = [1 , 4, 9, 16, 25]print squares[1]print squares[3]print squares[3:]
49[9, 16, 25]
Lists are mutable (unlike strings)
let ters = [ a , b , c , d , e , f , g ]print let tersletters [2:5] = [ C , D , E ] # replace some valuesprint let tersletters [2:5] = [ ] # now remove themprint let ters
[a, b, c, d, e, f, g][a, b, C, D, E, f, g][a, b, f, g]
49
-
Lists
Lists can be used as stacks:
stack = [3 , 4, 5]stack .append(6)stack .append(7)stack
[3, 4, 5, 6, 7]
stack .pop( )
7
stack
[3, 4, 5, 6]
50
-
Tuples
A tuple is like a list but without being enclosed in brackets.
Tuples are immutable; you cannot change their values.
a = 3, 4, 5, [7 , 8] , cat print a[0] , a[1]a[1] = dog
3 catTraceback (most recent call last):
File "", line 3, in a[-1] = dog
TypeError: tuple object does not support item assignment
Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
51
-
Sets
A set is an unordered collection with no duplicate elements
basket = [ apple , orange , apple , pear , orange , banana ]f ru i t = set (basket ) # create a set without duplicatesf r u i t
{apple, banana, orange, pear}
52
-
Dictionaries
A dictionary can be though as an unordered set of key : value pairs
phone_list = { jack : 4123324098, j i l l : 4120294139}phone_list
{jack: 4123324098, jill: 4120294139}
phone_list [ rodrigo ] = 4120293473phone_list
{jack: 4123324098, jill: 4120294139, rodrigo: 4120293473}
You can access all the keys and values of a dictionary:
print phone_list . keys ( )print phone_list . values ( )
[rodrigo, jill, jack][4120293473, 4120294139, 4123324098]
53
-
Python Language Control Structures
54
-
Control Flows
if statements
x = 42i f x > 10:
print xelse :
print 10
42
for statements
words = [ cat , window , defenestrate ]for w in words:
print w, len (w)
cat 3window 6defenestrate 12
a = [ Mary , had , a , l i t t l e , lamb ]for i in range( len (a ) ) :
print i , a[ i ]
0 Mary1 had2 a3 little4 lamb
55
-
Python Language Functions
56
-
Defining Functions
You can create functions using the keyword def
def f (x ) :return x**3 np. log (x)
print f (3)print f (5.1)
25.9013877113131.02175946
Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
57
-
Defining Functions
Functions can receive more than one argument
def func (x , y ) :" return product of x and y"return x * y
print func(2 , 3)
6
Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
58
-
Functions - Optional and Keyword Arguments
You can create default values for arguments:
def func (a , n=2):"compute the nth power of a"return a**n
# three different ways to ca l l the functionprint func(2)print func(2 , 3)print func(2 , n=4)
4816
Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
59
-
Functions - Optional and Keyword Arguments
Defining a function with two optional arguments
def func (a=1, n=2):"compute the nth power of a"return a**n
# three different ways to ca l l the functionprint func ( )print func(2 , 4)print func (n=4, a=2)
11616
Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
60
-
Functions - Optional and Keyword Arguments
We can define that a function receives an arbitrary number of argumentswith the *args syntax:
def func (*args ) :sum = 0for arg in args :
sum += argreturn sum
print func(1 , 2, 3, 4)
10
Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
61
-
Functions - Optional and Keyword Arguments
We can define that a function receives an arbitrary number of keywordarguments with the **kwargs syntax:
def func (**kwargs ) :for kw in kwargs :
print {0} = {1} . format (kw, kwargs[kw] )
func ( t1=6, color=blue )
color = bluet1 = 6
62
-
Lambda Functions
You can define "lambda" functions, which are also known as inline oranonymous functions.
The syntax is lambda var:f(var)
print map(lambda x:x**2 , [0 , 1, 2])
[0, 1, 4]
Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
63
-
Nested Functions
You can nest functions inside of functions
def wrapper(x ) :a = 4def func (x , a ) :
return a * x
return func (x , a)
print wrapper(5)
20
Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
64
-
Functional Programming Tools
filter
def f (x ) :return x % 3 == 0 or x % 5 == 0
f i l t e r ( f , range(2 , 25))
[3, 5, 6, 9, 10, 12, 15, 18, 20, 21, 24]
map
def cube(x ) : return x*x*x
map(cube, range(10))
[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]
reduce
def add(x ,y ) : return x+y
reduce(add, range(10))
45
65
-
List Comprehensions
List comprehensions provide a shortcut to create lists from existingstructures:
squares = [x**2 for x in range(10)]
print squares
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
66
-
Python Language Class System
67
-
Class Objects
Class objects support two kinds of operations: attribute references andinstantiation.
class MyClass :"""A simple example class """i = 12345def f ( se l f ) :
return hello world
x = MyClass ( )print x . iprint x . f ( )
12345hello world
68
-
Exercises
69
SETUPTODO Lecture 1 [0/2]IntroductionBasic Concepts and EnvironmentIPython: An Interactive Computing and Development Environment
TODO Lecture 2 [0/1]Python LanguagePython Language TypesPython Language Control StructuresPython Language FunctionsPython Language Class SystemExercises