Python for Data Analytics

69
Python for Data Analytics Lectures 1 & 2: The Python Language and Environment Rodrigo Belo [email protected] Spring 2015 1

description

Python for Data Analytics Lecture 1

Transcript of Python for Data Analytics

  • Python for Data Analytics

    Lectures 1 & 2: The Python Language and Environment

    Rodrigo [email protected]

    Spring 2015

    1

  • Introduction

    2

  • Instructor

    Rodrigo Belo

    Researcher at Carnegie Mellon University and at Catlica-Lisbon, Portugal

    PhD in Technological Change and Entrepreneurship from CarnegieMellon University

    Research Interests: Social Networks and Technology onEducational Settings

    Background: Undergraduate degree in Computer Science andEngineering, 5 years as Software Engineer

    Email: [email protected]

    3

  • Course Description

    This course introduces Python as a tool to collect, process and analyzelarge data sets from a variety of sources to create information thatguides businesses decision making

    4

  • Course Description

    Students will get familiarized with Python as a language and as aplatform to integrate different technologies and techniques for dataanalytics, including:

    Collection of online information;

    Tools and strategies for data storage; and

    Data analysis methods.

    5

  • Course Description

    Each class will start with the introduction of a concept or tool and end within-class hands-on exercises using example datasets.

    Throughout the course students will apply these techniques to do theirHomework and their Term Project.

    6

  • Learning Objectives

    Upon completion of this course, the student will be able to:

    1 Use Python as a general-purpose programming language

    2 Collect data available online in an automated fashion

    3 Process and store data in the appropriate format for future analysis

    4 Apply data analytics tools to extract relevant information

    7

  • Source Materials

    Textbooks:

    1 Main: McKinney (2012), Python for Data Analysis, OReilly

    2 Other: Russel (2011), Mining the Social Web, OReilly

    Online references:

    1 Python 2 Documentation: https://docs.python.org/2/

    2 pandas online reference:http://pandas.pydata.org/pandas-docs/stable/

    3 ggplot online reference: http://ggplot.yhathq.com

    8

  • Grading

    Individual Assignments: 40%

    Assignments will be done by individual students and posted on Blackboard.Specific assignments will appear approx. 1 week prior to due date.

    Term Project: 30%

    The term-project will be done in 2 or 3 person teams and will involve theapplication of the methods mentioned in the class.

    Students will identify a question they would like to answer using publiclyavailable data, gather the data from an online source, store it and analyzeit using some of the methods shown in class.

    Final Exam: 30%

    May 6, 6pm

    9

  • Late Work

    If a work is delivered t seconds late, its score is adjusted by multiplying it by

    1 (

    t

    24 5 60 60)4

    0 1 2 3 4 5

    N. Days Late

    0

    20

    40

    60

    80

    100

    Maximum Grade

    10

  • Basic Concepts and Environment

    11

  • Why Python?

    Python is one of the most popular dynamic languages, along with Ruby,Perl, R, and others

    Python has a large and active scientific computing community

    Adoption of Python has increased significantly since the 2000s both inthe industry and academic community

    Python started as general purpose programming language but datamanipulation libraries make it a first class citizen in data manipulationand analysis

    Excellent choice as a single language for building data-centricapplications

    12

  • Python as Glue

    Python integrates easily with C, C++, and FORTRAN, languages in whichmany routines are implemented

    Most programs consist of small portions of code where most of the time isspent, and large portions of glue code that doesnt run often

    In many cases the execution time of glue code is irrelevant

    Python can be used both as a prototyping language and as aproduction language

    13

  • Python Essentials

    Some of the essential Python libraries and tools:

    NumPy

    SciPy

    pandas

    ggplot

    IPython

    14

  • Python Essentials: NumPy

    NumPy (Numerical Python), is the foundational package for scientificcomputing in Python. It provides, among other things:

    A fast and efficient multidimensional array object: ndarray

    Functions for performing element-wise computations with arrays ormathematical operations between arrays

    Linear algebra operations, Fourier transform, and random numbergeneration

    Tools for integrating connecting C, C++, and Fortran code to Python

    15

  • Python Essentials: SciPy

    SciPy is a collection of packages addressing a number of different standardproblem domains in scientific computing:

    scipy.integrate: numerical integration routines and differentialequation solvers

    scipy.linalg: linear algebra routines and matrix decompositionsextending beyond those provided in numpy.linalg.

    scipy.optimize: function optimizers (minimizers) and root findingalgorithms

    scipy.signal: signal processing tools

    scipy.sparse: sparse matrices and sparse linear system solvers

    scipy.stats: standard continuous and discrete probabilitydistributions (density functions, samplers, continuous distributionfunctions), various statistical tests, and more descriptive statistics

    16

  • Python Essentials: pandas

    pandas provides data structures and functions designed to make workingwith structured data fast, easy and expressive

    DataFrame is the primary object of this library

    two dimensional object that resembles a table with rows and columns

    meat[ :5]

    date beef veal pork lamb_and_mutton broilers other_chicken \0 1944-01-01 751 85 1280 89 NaN NaN1 1944-02-01 713 77 1169 72 NaN NaN2 1944-03-01 741 90 1128 75 NaN NaN3 1944-04-01 650 89 978 66 NaN NaN4 1944-05-01 681 106 1029 78 NaN NaN

    turkey0 NaN1 NaN2 NaN3 NaN4 NaN

    17

  • Python Essentials: ggplot

    ggplot is a graphics library that allows for the creation of graphics veryeasily

    from ggplot import *

    ggplot (aes(x=date , y=beef ) , data=meat) +\geom_line ( ) +\stat_smooth( colour=blue , span=0.2)

    1945 1955 1965 1975 1985 1995 2005

    date

    0

    500

    1000

    1500

    2000

    2500

    3000

    beef

    18

  • Python Essentials: IPython

    IPython is the component that ties everything together. Aside from thestandard terminal, IPython shell provides:

    IPython notebook: HTML notebook for connecting to IPython througha web browser

    GUI console with inline plotting, multiline editing and syntaxhighlighting

    Infrastructure for interactive parallel and distributed computing

    19

  • Installation and Setup

    Mac OS X and Linux distributions come with a Python distribution, but notnecessarily with all the required libraries

    New users can install Anaconda (http://continuum.io/downloads) orCanopy (https://store.enthought.com/downloads/)

    To install IPython (and Python) follow the instructions onhttp://ipython.org/install.html

    You will need IPython notebook

    20

  • Python 2 and Python 3

    The Python community is currently undergoing a transition from thePython 2 series of interpreters to the Python 3 series

    Until the appearance of Python 3.0, all Python code was backwardscompatible

    The community decided that in order to move the language forward,certain backwards incompatible changes were necessary

    21

  • Python 2 and Python 3

    Python 3.x is a cleaned up version of Python 2.x

    Many inconsistencies were removed in the new version2.x: print "The answer is", 2*23.x: print("The answer is", 2*2)

    More details athttp://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/tutorials/key_differences_between_python_2_and_3.ipynb

    However, there is still a considerable amount code written in Python 2.x,making it the de facto standard

    In this course we will be using Python 2.x

    22

  • Integrated Development Environments (IDEs)

    There are many editors and IDEs that you can use to edit Python

    PyDev (plugin for Eclipse)

    Python Tools for Visual Studio

    PyCharm

    IPython notebook

    Emacs

    Vim

    You can find more IDEs onhttps://wiki.python.org/moin/IntegratedDevelopmentEnvironments

    23

  • IPython: An Interactive Computing and

    Development Environment

    24

  • IPython Basics Prompt

    $ ipython --pylabPython 2.7.6 | 64-bit | (default, Jun 4 2014, 16:42:26)Type "copyright", "credits" or "license" for more information.

    IPython 2.1.0 -- An enhanced Interactive Python.? -> Introduction and overview of IPythons features.%quickref -> Quick reference.help -> Pythons own help system.object? -> Details about object, use object?? for extra details.Using matplotlib backend: MacOSX

    In [1]: 3 + 4Out[1]: 7

    In [2]: data = {i : randn() for i in range(8)}

    In [3]: dataOut[3]:{0: 0.36680003627745555,1: 0.5231034512314581,2: 0.6300895261779402,3: -0.9115682057027865,4: -1.7244460134107902,5: 0.3829479256814315,6: 0.4718660373870812,7: -0.23438875074129756}

    In [4]: data[3]Out[4]: -0.9115682057027865

    25

  • IPython Basics Tab Completion

    In [7]: dadata datetime datetime_datadate2num datetime64datestr2num datetime_as_string

    In [7]: dataOut[7]:{0: 0.0016908926460949773,1: 0.39596065989527957,2: -0.9295711814640477,3: 2.1076302341719058,4: -0.6391315204450737,5: 1.7496783252859787,6: -0.5307855278794061,7: 0.38045583368270064}

    26

  • IPython Basics Introspection

    Using a question mark (?) before or after a variable will display somegeneral information about the object:In [3]: b?Type: listString form: [1, 2, 3, 45]Length: 4Docstring:list() -> new empty listlist(iterable) -> new list initialized from iterables items

    ? can also be used before or after a function name

    27

  • IPython Basics Introspection

    ? has a final usage, which is for searching the IPython namespace in amanner similar to the standard UNIX or Windows command line:In [4]: import numpy as np

    In [5]: np.*load*?np.loadnp.loadsnp.loadtxtnp.pkgload

    28

  • IPython Basics The %run Command

    Any file can be run as a Python program inside the environment of yourIPython session using the %run command# ipython_script_test.py

    def my_function(x,y,z):return (x + y) / z

    aa = 5

    %run ipython_script_testprint aaprint my_function(3.0 ,4 ,5)

    51.4

    29

  • IPython Basics The %paste Command

    The %paste command pastes code copied to the clipboard keepingindentation

    The following code will not work if simply pasted:x = 5y = 7if (x > 5):

    x += 1

    y = 8

    >>> x = 5y = 7if (x > 5):

    x += 1

    y = 8>>> ... ... >>> >>>>>> y8>>> %pastex = 5y = 7if (x > 5):

    x += 1

    y = 8## -- End pasted text -->>> y7>>> 30

  • IPython Basics Interacting with the OS

    IPython provides very strong integration with the operating system shell:

    Command Descriptionoutput = !cmd args Run cmd and store the stdout in output%alias alias_name cmd Define an alias for a system (shell) command%bookmark Utilize IPythons directory bookmarking system%cd directory Change system working directory to passed directory%pwd Return the current system working directory%dirs Return a list containing the current directory stack%dhist Print the history of visited directories%env Return the system environment variables as a dict

    31

  • IPython Basics IPython GUI

    Starting an IPython GUI:ipython qtconsole --pylab=inline

    32

  • IPython Basics IPython Notebook

    Starting the IPython notebook server:ipython notebook --pylab=inline

    33

  • Python Language

    34

  • Python as a Calculator Basic Math

    Python can be used as a basic calculator

    Addition and subtraction

    print 2 + 4print 8.1 5

    63.1

    Multiplication

    print 5 * 4print 3.1 * 2

    206.2

    Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html

    35

  • Python as a Calculator Basic Math

    Integer division is not the same as float division

    Float division

    print 4.0 / 2.0print 1.0/3.1

    2.00.322580645161

    Integer division

    print 4 / 2print 1/3

    20

    Careful when performing integer division

    Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html

    36

  • Python as a Calculator Basic Math

    Exponentiation

    print 3.**2print 3**2print 2**0.5

    9.091.41421356237

    Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html

    37

  • Advanced Mathematical Operations

    Some more advanced mathematical operations require the numpy package

    Square Root

    import numpy as npprint np. sqrt (2)

    1.41421356237

    Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html

    38

  • Exponential and logarithmic functions

    Exponential

    import numpy as npprint np.exp(1)

    2.71828182846

    Logarithm

    import numpy as npprint np. log(10)print np. log10(10) # base10

    2.302585092991.0

    Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html

    39

  • Variable Assignment

    The equal sign (=) is used to assign a value to a variable

    width = 20height = 5 * 9width * height

    900

    40

  • Python Language Types

    41

  • Boolean

    Python has a built-in boolean type:

    print width == 20print width == 30

    TrueFalse

    42

  • Strings

    Strings can be enclosed in single quotes or double quotes

    Single quotes

    Hello World

    Hello World

    Isn \ t i t nice to have a computer that talks to you?

    "Isnt it nice to have a computer that talks to you?"

    Double quotes

    "Hello World"

    Hello World

    " Isn t i t nice to have a computer that talks to you?"

    "Isnt it nice to have a computer that talks to you?"

    43

  • Strings

    You can concatenate strings with the + sign:

    "Hello " + "World"

    HelloWorld

    aa = "Hello "bb = "World"aa + bb

    HelloWorld

    44

  • Strings

    Strings are immutable:

    aa = "Hello " + "World"print aaaa[5] = R

    HelloWorldTraceback (most recent call last):

    File "", line 3, in aa[5] = R

    TypeError: str object does not support item assignment

    45

  • Strings

    You can use triple quotes for strings that span multiple lines

    print """ \HelloWorld """

    Hello-----World

    Triple quotes are often used to provide function documentation

    46

  • Strings

    Strings can be indexed (subscripted), with the first character having index 0

    mystring = "Hello World"print mystring[0]print mystring[6:10]

    HWorl

    There is no separate character type. A character is simply a string of sizeone

    47

  • Lists

    Lists are a compound data type in Python

    can be written as a list of comma-separated values (items) betweensquare brackets

    might contain items of different types

    squares = [1 , 4, 9, 16, 25]squares

    [1, 4, 9, 16, 25]

    48

  • Lists

    Lists can be indexed like strings

    squares = [1 , 4, 9, 16, 25]print squares[1]print squares[3]print squares[3:]

    49[9, 16, 25]

    Lists are mutable (unlike strings)

    let ters = [ a , b , c , d , e , f , g ]print let tersletters [2:5] = [ C , D , E ] # replace some valuesprint let tersletters [2:5] = [ ] # now remove themprint let ters

    [a, b, c, d, e, f, g][a, b, C, D, E, f, g][a, b, f, g]

    49

  • Lists

    Lists can be used as stacks:

    stack = [3 , 4, 5]stack .append(6)stack .append(7)stack

    [3, 4, 5, 6, 7]

    stack .pop( )

    7

    stack

    [3, 4, 5, 6]

    50

  • Tuples

    A tuple is like a list but without being enclosed in brackets.

    Tuples are immutable; you cannot change their values.

    a = 3, 4, 5, [7 , 8] , cat print a[0] , a[1]a[1] = dog

    3 catTraceback (most recent call last):

    File "", line 3, in a[-1] = dog

    TypeError: tuple object does not support item assignment

    Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html

    51

  • Sets

    A set is an unordered collection with no duplicate elements

    basket = [ apple , orange , apple , pear , orange , banana ]f ru i t = set (basket ) # create a set without duplicatesf r u i t

    {apple, banana, orange, pear}

    52

  • Dictionaries

    A dictionary can be though as an unordered set of key : value pairs

    phone_list = { jack : 4123324098, j i l l : 4120294139}phone_list

    {jack: 4123324098, jill: 4120294139}

    phone_list [ rodrigo ] = 4120293473phone_list

    {jack: 4123324098, jill: 4120294139, rodrigo: 4120293473}

    You can access all the keys and values of a dictionary:

    print phone_list . keys ( )print phone_list . values ( )

    [rodrigo, jill, jack][4120293473, 4120294139, 4123324098]

    53

  • Python Language Control Structures

    54

  • Control Flows

    if statements

    x = 42i f x > 10:

    print xelse :

    print 10

    42

    for statements

    words = [ cat , window , defenestrate ]for w in words:

    print w, len (w)

    cat 3window 6defenestrate 12

    a = [ Mary , had , a , l i t t l e , lamb ]for i in range( len (a ) ) :

    print i , a[ i ]

    0 Mary1 had2 a3 little4 lamb

    55

  • Python Language Functions

    56

  • Defining Functions

    You can create functions using the keyword def

    def f (x ) :return x**3 np. log (x)

    print f (3)print f (5.1)

    25.9013877113131.02175946

    Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html

    57

  • Defining Functions

    Functions can receive more than one argument

    def func (x , y ) :" return product of x and y"return x * y

    print func(2 , 3)

    6

    Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html

    58

  • Functions - Optional and Keyword Arguments

    You can create default values for arguments:

    def func (a , n=2):"compute the nth power of a"return a**n

    # three different ways to ca l l the functionprint func(2)print func(2 , 3)print func(2 , n=4)

    4816

    Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html

    59

  • Functions - Optional and Keyword Arguments

    Defining a function with two optional arguments

    def func (a=1, n=2):"compute the nth power of a"return a**n

    # three different ways to ca l l the functionprint func ( )print func(2 , 4)print func (n=4, a=2)

    11616

    Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html

    60

  • Functions - Optional and Keyword Arguments

    We can define that a function receives an arbitrary number of argumentswith the *args syntax:

    def func (*args ) :sum = 0for arg in args :

    sum += argreturn sum

    print func(1 , 2, 3, 4)

    10

    Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html

    61

  • Functions - Optional and Keyword Arguments

    We can define that a function receives an arbitrary number of keywordarguments with the **kwargs syntax:

    def func (**kwargs ) :for kw in kwargs :

    print {0} = {1} . format (kw, kwargs[kw] )

    func ( t1=6, color=blue )

    color = bluet1 = 6

    62

  • Lambda Functions

    You can define "lambda" functions, which are also known as inline oranonymous functions.

    The syntax is lambda var:f(var)

    print map(lambda x:x**2 , [0 , 1, 2])

    [0, 1, 4]

    Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html

    63

  • Nested Functions

    You can nest functions inside of functions

    def wrapper(x ) :a = 4def func (x , a ) :

    return a * x

    return func (x , a)

    print wrapper(5)

    20

    Source: John Kitchin http://kitchingroup.cheme.cmu.edu/pycse/pycse.html

    64

  • Functional Programming Tools

    filter

    def f (x ) :return x % 3 == 0 or x % 5 == 0

    f i l t e r ( f , range(2 , 25))

    [3, 5, 6, 9, 10, 12, 15, 18, 20, 21, 24]

    map

    def cube(x ) : return x*x*x

    map(cube, range(10))

    [0, 1, 8, 27, 64, 125, 216, 343, 512, 729]

    reduce

    def add(x ,y ) : return x+y

    reduce(add, range(10))

    45

    65

  • List Comprehensions

    List comprehensions provide a shortcut to create lists from existingstructures:

    squares = [x**2 for x in range(10)]

    print squares

    [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

    66

  • Python Language Class System

    67

  • Class Objects

    Class objects support two kinds of operations: attribute references andinstantiation.

    class MyClass :"""A simple example class """i = 12345def f ( se l f ) :

    return hello world

    x = MyClass ( )print x . iprint x . f ( )

    12345hello world

    68

  • Exercises

    69

    SETUPTODO Lecture 1 [0/2]IntroductionBasic Concepts and EnvironmentIPython: An Interactive Computing and Development Environment

    TODO Lecture 2 [0/1]Python LanguagePython Language TypesPython Language Control StructuresPython Language FunctionsPython Language Class SystemExercises