Data Handling using Pandas -1

67
Data Handling using Pandas -1

Transcript of Data Handling using Pandas -1

Page 1: Data Handling using Pandas -1

Data Handling using Pandas -1

Page 2: Data Handling using Pandas -1

Python Library – Pandas

It is a most famous Python package for data science, which offers

powerful and flexible data structures that make data analysis and

manipulation easy. Pandas makes data importing and data analyzing

much easier. Pandas builds on packages like NumPy and matplotlib to

give us a single & convenient place for data analysis and

visualization work.

Page 3: Data Handling using Pandas -1

Help a lot in keeping track of our data.

With a pandas dataframe, we can have different data types

(float, int, string, datetime, etc) all in one place Pandas has built

in functionality for like easy grouping & easy joins of data, rolling

windows Good IO capabilities; Easily pull data from a MySQL

database directly into a data frame With pandas, you can use

patsy for R-style syntax in doing regressions.

Tools for loading data into in-memory data objects from different

file formats.

Data alignment and integrated handling of missing data.

Reshaping and pivoting of data sets.

Label-based slicing, indexing and subsetting of large data sets.

Basic Features of Pandas

Page 4: Data Handling using Pandas -1

Pandas module doesn't come bundled with Standard Python.

If we install Anaconda Python package Pandas will be

installed by default.Steps for Anaconda installation & Use

1. visit the site https://www.anaconda.com/download/

2. Download appropriate anaconda installer

3. After download install it.

4. During installation check for set path and all user5. After installation start spyder utility of anaconda from start menu

6. Type import pandas as pd in left pane(temp.py)

7. Then run it.

8. If no error is show then it shows pandas is installed.

9. Like default temp.py we can create another .py file from new window

option of file menu for new program.

Pandas – Installation/Environment Setup

Page 5: Data Handling using Pandas -1

Pandas – Installation/Environment Setup

Pandas installation can be done in Standard Python

distribution,using following steps.

1. There must be service pack installed on our computer if we

are using windows.If it is not installed then we will not be

3. Now install latest version(any one above 3.4) of python.

able to install pandas in existing Standard Python(which is

already installed).So install it first(google it).

2. We can check it through properties option of my computer

icon.

Page 6: Data Handling using Pandas -1

Pandas – Installation/Environment Setup

4. Now move to script folder of python distribution in command

prompt (through cmd command of windows).

5. Execute following commands in command prompt serially.

>pip install numpy

>pip install six

>pip install pandas

Wait after each command for installation

Now we will be able to use pandas in standard python

distribution.

6. Type import pandas as pd in python (IDLE) shell.

7. If it executed without error(it means pandas is installed on

your system)

Page 7: Data Handling using Pandas -1

Data Structures in Pandas

Data structure refer to specialized way of storing data so as to

apply a specific type of functionality on them.

Two important data structures of pandas are–Series, DataFrame

1. Series

Series is like a one-dimensional array like structure with

homogeneous data. For example, the following series is a

collection of integers.

Basic feature of series are

❖ Homogeneous data

❖ Size Immutable

❖ Values of Data Mutable

Page 8: Data Handling using Pandas -1

Data Frame: Data Frame is like a two-dimensional array with

heterogeneous data.

Basic feature of DataFrame are

❖ Heterogeneous data

❖ Size Mutable

❖ Data Mutable

SR.

No.

Admn

No

Student Name Class Section Sex Date Of

Birth1 001284 NIDHI MANDAL I A Girl 07/08/2010

2 001285 SOUMYADIP BHATTACHARYA I A Boy 24/02/2011

3 001286 SHREYAANG SHANDILYA I A Boy 29/12/2010

Page 9: Data Handling using Pandas -1

Pandas Series

It is like one-dimensional array capable of holding data of

any type (integer, string, float, python objects, etc.).

Series can be created using constructor.

Syntax :- pandas.Series( data, index, dtype, copy)

Creation of Series is also possible from – ndarray,

dictionary, scalar value.

Series can be created using

1. A Python Sequence

2. Array

3. Dict

4. Scalar value or constant

S in capital letters

Constructor

Page 10: Data Handling using Pandas -1

Create an Empty Series

e.g.

import pandas as pd

s = pd.Series()

print(s)

Output

Series([], dtype: float64)

Importing Pandas library

Reference name

Page 11: Data Handling using Pandas -1

Create Series using Python Sequence

Syntax

seriesobject=Series(pythonsequence)

e.g.

import pandas as pd

s = pd.Series(range(5))

print(s)

Output

range(5) generate a sequence [0, 1, 2, 3, 4]

Default index start from 0

Page 12: Data Handling using Pandas -1

Create Series using Python Sequence

Syntax

seriesobject=Series(pythonsequence)

e.g.

import pandas as pd

s = pd.Series([3.5,5.,6.5,8.])

print(s)

Output

Given list is a sequence with 4 values

Page 13: Data Handling using Pandas -1

Without index

e.g.

import pandas as pd

import numpy as np

data = np.array(['a','b','c','d'])

s = pd.Series(data)

print(s)

Create a Series from ndarray

OutputDefault index start from 0

Page 14: Data Handling using Pandas -1

With index position

e.g.

import pandas as p

import numpy as n

data = np.array(['a','b','c','d'])

s = p.Series(data,index=[100,101,102,103])

print(s)

Output

Defined index start from 100

Create a Series from ndarray

Page 15: Data Handling using Pandas -1

Create a Series from ndarray

Using arange function

e.g.

import pandas as pd

import numpy as np

nda=np.arange(3,13,3.5)

print(nda)

s = pd.Series(nda)

print(s)

Output

ndarray using arange function

Series

Page 16: Data Handling using Pandas -1

Eg.1 (without index)

import pandas as pd

import numpy as np

data = {'a' : 0., 'b' : 1., 'c' : 2.}

s = pd.Series(data)

print(s)

Create a Series from dictionary

OutputDefault index start from ‘a’ as key value ofdictionary

Page 17: Data Handling using Pandas -1

Eg.2 (with index)

import pandas as pd

data = {'a' : 0., 'b' : 1., 'c' : 2.}

s = pd.Series(data,index=['b','c','d','a'])

print(s)

Create a Series from dictionary

Defined index

Output

Page 18: Data Handling using Pandas -1

eg.

Create a Series from Scalar

import pandas as pd

s = pd.Series(5, index=[0, 1, 2, 3])

print(s)

Output

Here 5 is repeated for 4 times as per given index

Page 19: Data Handling using Pandas -1

eg.

Create a Series with NaN Value

import pandas as pd

import numpy as np

s = pd.Series([6.5,np.NaN,2.34])

print(s)

Output

Using numpy NaN we can define NaN value

Page 20: Data Handling using Pandas -1

eg.

Create a Series with index(es) as well as Data

import pandas as pd

arr=[31,28,31,30]

month=[‘Jan’,’Feb’,’Mar’,’Apr’]

s = pd.Series(data=arr,index=month)

print(s)

Output

Using index and data we can create series object

Page 21: Data Handling using Pandas -1

eg.

Create a Series with Data Type, index(es) and Data

import pandas as pd

import numpy as np

arr=[31,28,31,30]

month=[‘Jan’,’Feb’,’Mar’,’Apr’]

s = pd.Series(data=arr, index=month, dtype=np.float64)

print(s)

Output

Using index, data and data type we can create series object

Page 22: Data Handling using Pandas -1

Series Object Attributes

Some common attributes-

<seriesobject>.<AttributeName>

Attribute Description

Series.index Returns index of the series

Series.values Returns ndarray

Series.dtype Returns dtype object of the underlying data

Series.shape Returns tuple of the shape of underlying data

Series.nbytes Return number of bytes of underlying data

Series.ndim Returns the number of dimention

Series.size Returns number of elements

Series.intemsize Returns the size of the dtype

Series.hasnans Returns true if there are any NaN

Series.empty Returns true if series object is empty

Page 23: Data Handling using Pandas -1

Attributes of Series object

eg.

import pandas as pd

s=pd.Series([10,20,30,40,50,60])

print(s)

print(s.index)

print(s.values)

print(s.dtype)

print(s.shape)

print(s.nbytes)

print(s.ndim)

print((s.size)

print(s.itemsize)

print(s.hasnans)

print(s.empty)

Size of whole Series in Bytes

Size of one element in Bytes

No. of elements

No. of rows

Data type

Output

Page 24: Data Handling using Pandas -1

Head function

e.g.

import pandas as pd1

s = pd1.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

print (s.head(3))

Head and Tail Functions of Series Object

Output

Return only first 3 elements of series.

Page 25: Data Handling using Pandas -1

Tail function

e.g.

import pandas as pd1

s = pd1.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

print (s.tail(3))

Return only last 3 elements of series.

Output

Page 26: Data Handling using Pandas -1

Accessing Data from Series with indexing and slicing

e.gimport pandas as pd1

s = pd1.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

print (s[0])# for 0 index position

print (s[:3]) #for first 3 index values

print (s[-3:]) # slicing for last 3 index values

Output

Page 27: Data Handling using Pandas -1

Retrieve Data Using Label as (Index)

e.g.

import pandas as pd1

s = pd1.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

print (s[['c','d']])

Output

Page 28: Data Handling using Pandas -1

Retrieve Data from selection

There are three methods for data selection:

▪ loc gets rows (or columns) with particular labels

from

the index.

▪ iloc gets rows (or columns) at particular

positions in the index (so it only takes integers).

▪ ix usually tries to behave like loc but falls back tobehaving like iloc if a label is not present in the

index. ix is deprecated and the use of loc and iloc isencouraged instead

Page 29: Data Handling using Pandas -1

Retrieve Data from selection

e.g.

import pandas as pd

import numpy as np

s = pd.Series(np.nan, index=[49,48,47,46,45, 1, 2, 3, 4, 5])

print(s.iloc[:3])

print(s.loc[:3])

print(s.ix[:3])

Output

Page 30: Data Handling using Pandas -1

e.g.

import pandas as pd1

s = pd1.Series([1,2,3])

t = pd1.Series([1,2,4])

u=s+t

print (u)

u=s*t

print (u)

Mathematical operations with Series

Output

Page 31: Data Handling using Pandas -1

Difference between NumPy array Series objects

1. In case of ndarray, vector operation is possible only

when ndarray are of similar shape. Whereas in case

of series object, it will be aligned only with matching

index otherwise NaN will be returned.

2. In ndarray, index always starts from 0 and always

numeric. Whereas, in series, index can be of any

type including number and not necessary to start

from 0.

Page 32: Data Handling using Pandas -1

DataFrame Data Structure

A DataFrame is a kind of panda structure which stores data in2D form.It is 2-dimensional labeled array which is an ordered collectionof columns where columns can store different kinds of data.A 2D array is a collection of row and column where each rowand column shows a definite index starts from 0.In the given diagram, there are 5 rows and 5 columns. Rowand column index are from 0 to 4 respectively. Each cell hasthe address like-A[2][1], A[1][4] etc like shown in the diagram.

Page 33: Data Handling using Pandas -1

Characteristics of DataFrame

Characteristics of a DataFrame are as follows-

It has 2 index or 2 axes.

It is somewhat like a spreadsheet where row index is

index and column index is called column name.

Indexes can be prepared by numbers, strings or letters.

called

It is possible to have any kind of data

its values are mutable and can be

changed anytime.

Size of DataFrame is also mutable

i.e. The number of row and column

can be increaded or decreased

anytime.

in columns.

Page 34: Data Handling using Pandas -1

Creation and presentation of DataFrame

DataFrame object can be created by passing a data in 2Dformat.

Syntax:-

<dataFrameObject> = pd.DataFrame(<a 2D Data Structure>,

[columns=<column sequence>], [index=<index sequence>])

You can create a DataFrame by various methods by passingdata values. Like-

• 2D dictionaries

• 2D ndarrays

• Series type object

• Another DataFrame object

D and F in capital letters

Constructor

Page 35: Data Handling using Pandas -1

It is a two-dimensional data structure, just like any

table (with rows & columns).Basic Features of DataFrame

Columns may be of different types

Size can be changed(Mutable)

Labeled axes (rows / columns)

Arithmetic operations on rows and columns

Structure

Rows

It can be created using constructor

pandas.DataFrame( data, index, columns, dtype, copy)C

olu

mn

Page 36: Data Handling using Pandas -1

Create DataFrame

It can be created with followings

❑ Lists

❑ dict

❑ Series

❑ Numpy ndarrays

❑ Another DataFrame

Create an Empty DataFrame

e.g.

import pandas as pd1

df1 = pd1.DataFrame()

print(df1)

Output

Page 37: Data Handling using Pandas -1

Create a DataFrame from Lists

e.g.1

import pandas as pd1

data1 = [1,2,3,4,5]

df1 = pd1.DataFrame(data1)print (df1)

Output

Page 38: Data Handling using Pandas -1

e.g.2

import pandas as pd1

data1 = [['Freya',10],['Mohak',12],['Dwivedi',13]]

df1 =pd1.DataFrame(data1,columns=['Name','Age'])

print (df1)

Output

Page 39: Data Handling using Pandas -1

Write below for numeric value as float

df1 = pd1.DataFrame(data1,columns=['Name','Age'],dtype=float)

e.g.3

import pandas as pd1

data1 = [['Freya',10],['Mohak',12],['Dwivedi',13]]

df1 = pd1.DataFrame(data1,columns=['Name','Age'],dtype=float)

print (df1)

Output

Page 40: Data Handling using Pandas -1

Create a DataFrame from Dict of ndarrays / Lists

e.g.1

import pandas as pd

data = {'Name':['Freya','Mohak'],'Age':[9,10]}

df = pd.DataFrame(data)

print (df)

Output

Page 41: Data Handling using Pandas -1

Write below as 3rd statement in below prog for indexing

df = pd.DataFrame(data,index=['rank1','rank2','rank3','rank4'])

e.g.2

import pandas as pd

data = {'Name':['Freya','Mohak'],'Age':[9,10]}

df = pd.DataFrame(data,index=['rank1','rank2'])

print (df)

Output

Page 42: Data Handling using Pandas -1

Create a DataFrame from List of Dicts

e.g.1

import pandas as pd

data = [{'x': 1, 'y': 2},{'x': 5, 'y': 4, 'z': 5}]

df = pd.DataFrame(data)

print (df)

Output

Page 43: Data Handling using Pandas -1

Create a DataFrame from List of Dictionaries with index

e.g.1

import pandas as pd

data = [{'x': 1, 'y': 2},{'x': 5, 'y': 4, 'z': 5}]

df = df = pd.DataFrame(data, index=['first’, 'second’])

print (df)

Output

Page 44: Data Handling using Pandas -1

Create a DataFrame from Dictionary of Series

e.g.1import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),

'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

print (df)

Output

Column Selection ->

print (df ['one’])

Output

Page 45: Data Handling using Pandas -1

Adding a new column by passing as Series: ->

df['three']=pd1.Series([10,20,30],index=['a','b','c’])

Adding a new column using the existing columns values

df1['four']=df1['one']+df1['three']

Output

Output

Page 46: Data Handling using Pandas -1

Create a DataFrame from .txt file

Having a text file './inputs/dist.txt' as:

Pandas is shipped with built-in reader methods. For example the

pandas.read_table method seems to be a good way to read (also

in chunks) a tabular data file.

import pandas

df=pandas.read_table('./input/dists.txt', delim_whitespace=True,

names=('A', 'B', 'C')) will create a DataFrame objects with column

named A made of data of type int64, B of int64 and C of float64

1 1 12.92

1 2 90.75

1 3 60.90

2 1 71.34

Output

Page 47: Data Handling using Pandas -1

Create a DataFrame from csv(comma separated

value) file / import data from cvs file

e.g.

Suppose filename.csv file contains following data

Date,"price","factor_1","factor_2"

2012-06-11,1600.20,1.255,1.548

2012-06-12,1610.02,1.258,1.554

import pandas as pd

# Read data from file 'filename.csv'

# (in the same directory that your python program is based)

# Control delimiters, rows, column names with read_csv

data = pd.read_csv("filename.csv")

# Preview the first 1 line of the loaded data

data.head(1)

Page 48: Data Handling using Pandas -1

DataFrame Attributes• When we create an object of a DataFrame then all information

related to it like size, datatype etc can be accessed by attributes.

<DataFrame Object>.<attribute

• Some attributes are -

name>

Attribute Description

index It shows index of dataframe.

columns It shows column labels of DataFrame.

axes It return both the axes i.e. index and column.

dtypes It returns data type of data contained by dataframe.

size It returns number of elements in an object.

shape It returns tuple of dimension of dataframe.

values It return numpy form of dataframe.

empty It is an indicator to check whether dataframe is empty or not.

ndim Return an int representing the number of axes / array dimensions.

T It Transpose index and columns.

Page 49: Data Handling using Pandas -1

import pandas as pddf=pd.DataFrame([[10,20,30],[40,50,60]])print(df)print(df.index)print(df.columns)print(df.axes)print(df.dtypes)print(df.size)print(df.shape)print(df.values)print(df.empty)print(df.ndim)print(df.T)

DataFrame Attributes

Page 50: Data Handling using Pandas -1

Column addition

df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

print(df)

c = [7,8,9]

df[‘C'] = c

print(df)

Output

Page 51: Data Handling using Pandas -1

Column Deletion

del df[‘A'] # Deleting the first column using DEL function

df.pop('two') #Deleting another column using POP function

Rename columns

df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

df.rename(columns={"A": "a", "B": "c"}, inplace=True)

print(df)

OutputOutput

Output

Page 52: Data Handling using Pandas -1

Row Selection, Addition, and Deletion#Selection by Label

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),

'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df= pd.DataFrame(d)

print(df)

print (df.loc['b'])

Output

Page 53: Data Handling using Pandas -1

#Selection by integer location

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),

'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

print (df.iloc[2])

Slice Rows : Multiple rows can be selected using ‘ : ’operator.

print (df[2:4])

Output

Output

Page 54: Data Handling using Pandas -1

Addition of Rowsimport pandas as pd1

df1 = pd1.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])

df2 = pd1.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df1 = df1.append(df2)

print (df1)

Deletion of Rows

# Drop rows with label 0

df1 = df1.drop(0)

print(df)

Output

Output

Page 55: Data Handling using Pandas -1

Iterate over rows in a dataframe

e.g.

import pandas as pd1 import

numpy as np1

raw_data1 = {'name': ['freya', 'mohak'], 'age': [10, 1],

'favorite_color': ['pink', 'blue'], 'grade':

[88, 92]}

df1 = pd1.DataFrame(raw_data1, columns = ['name', 'age’,

'favorite_color', 'grade'])

for index, row in df1.iterrows():

print (row["name"], row["age"])

Output

Page 56: Data Handling using Pandas -1

Head & Tailhead() returns the first n rows (observe the index values). The defaultnumber of elements to display is five, but you may pass a customnumber. tail() returns the last n rows.e.g.

import pandas as

pd import numpy

as np

d={'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','

Smith','Jack']), 'Age':pd.Series([25,26,25,23,30,29,23]),

'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

df =

pd.DataFrame(d)

print ("Our data frame

is:")

print (df)

print ("The first two rows of the data

frame is:")

print (df.head(2))

Output

Page 57: Data Handling using Pandas -1

Indexing a DataFrame using .loc[ ] :

This function selects data by the label of the rows and columns.

import pandas as pd

import numpy as np

df = pd.DataFrame(

np.random.randn(8, 4),

index = ['a','b','c','d','e','f','g','h'],

columns = ['A', 'B', 'C', 'D'])

print(“Use of loc”)

print(df.loc[:,'A'])

Output

Page 58: Data Handling using Pandas -1

Accessing a DataFrame with a boolean index :In order to access a dataframe with a boolean index, we have tocreate a dataframe in which index of dataframe contains a booleanvalue that is “True” or “False”.

import pandas as pd

dict = {'name':[“Mohak", “Freya", “Roshni"],

'degree': ["MBA", "BCA", "M.Tech"],

'score':[90, 40, 80]}

df = pd.DataFrame(dict, index = [True, False, True])

print(df.loc[True])

Output

Page 59: Data Handling using Pandas -1

Binary operation over dataframe with series

e.g.

import pandas as pd

x = pd.DataFrame({0: [1,2,3], 1: [4,5,6], 2: [7,8,9] })

y = pd.Series([1, 2, 3])

new_x = x.add(y, axis=0)

print(new_x)

Output

Page 60: Data Handling using Pandas -1

Binary operation over dataframe with dataframe

import pandas as pd

x = pd.DataFrame({0: [1,2,3], 1: [4,5,6], 2: [7,8,9] })

y = pd.DataFrame({0: [1,2,3], 1: [4,5,6], 2: [7,8,9] })

new_x = x.add(y, axis=0)

print(new_x)

Output

Page 61: Data Handling using Pandas -1

Merging/joining dataframe

e.g.

import pandas as pd

left = pd.DataFrame({'id':[1,2], 'Name': ['anil', 'vishal’],

'subject_id':['sub1','sub2']})

right = pd.DataFrame({'id':[1,2], 'Name': ['sumer', 'salil'],

'subject_id':['sub2','sub4']})

print (pd.merge(left,right,on='id'))

Merging/combining dataframe(different styles)

Print(pd.merge(left, right, on='subject_id', how='left’)) #left join

Output

Output

Page 62: Data Handling using Pandas -1

Merging/combining dataframe(different styles)

print(pd.merge(left, right, on='subject_id', how='right')) #right join

print(pd.merge(left, right, how='outer', on='subject_id')) #outer join

print(pd.merge(left, right, on='subject_id', how='inner')) # inner join

Output

Output

Output

Page 63: Data Handling using Pandas -1

Sorting Data in DataFrame

cars = {'Brand': ['Honda Civic','Toyota Corolla','FordFocus','Audi A4'], 'Price': [22000,25000,27000,35000] }df = pd.DataFrame(cars)#Sorting with ascending orderprint(df.sort_values(by=['Brand’]))

#Sorting with descending orderprint(df.sort_values(by=['Brand’], ascending=False))Output

Output

Page 64: Data Handling using Pandas -1

Missing Data and Filling Values

import pandas as pda=[[2,5,6,7,8],[8],[10,4],[5,8,9]]df=pd.DataFrame(a)print(df)#Replacing Missing Values with 0 using fillna() functiondf.fillna(0, inplace=True)print(df)

Output

Page 65: Data Handling using Pandas -1

#Replacing Constant Value Column-wisedf.fillna({0:-8, 1:-10, 2:-5}, inplace=True)print(df)

#Interpolate Valuedf.ffill(inplace=True)print(df)

Output

Output

Page 66: Data Handling using Pandas -1

Concate two DataFrame objects with identical columns

import pandas as pd

df1 = pd.DataFrame([['a', 1], ['b', 2]], columns=['letter', 'number'])

print(df1)

df2 = pd.DataFrame([['c', 3], ['d', 4]], columns=['letter', 'number’])

print(df2)

print(pd.concat([df1, df2]))Output

Page 67: Data Handling using Pandas -1

Export Pandas DataFrame to a CSV File

e.g.

import pandas as pd

cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi

A4'], 'Price': [2200000,2500000,2700000,3500000] }

df = pd.DataFrame(cars, columns= ['Brand', 'Price'])

df.to_csv ('D:\export_dataframe.csv', index = False, header=True)

print (df)