Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE...
Transcript of Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE...
![Page 1: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/1.jpg)
Real World Data AnalysisPANDAS
PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES
DR. SYED IMTIYAZ HASSANASSISTANT PROFESSOR, DEPARTMENT. OF CSE, JAMIA HAMDARD(DEEMED TO BE UNIVERSITY), NEW DELHI, INDIA.https://[email protected]://www.jamiahamdard.edu
![Page 2: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/2.jpg)
INTRODUCTION
For fast, flexible, and expressive data structures.
Designed to make working with “relational” or “labeled” data.
Prepared from:
https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html
2
![Page 3: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/3.jpg)
WELL SUITED FOR
Tabular data with heterogeneously-typed columns.
Ordered and unordered (not necessarily fixed-frequency) time series data.
Arbitrary matrix data (homogeneously typed orheterogeneous) with row and column labels.
Any other form of observational / statistical data sets.
The data actually need not be labeled at all to beplaced into a pandas data structure.
3
![Page 4: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/4.jpg)
DATA STRUCTURES
Series: 1D labeled homogeneously-typed array.
DataFrame: General 2D labeled, size-mutabletabular structure with potentially heterogeneously-typed column.
4
• import numpy as np
• import pandas as pd
• s = pd.Series([1, 3, 5, np.nan, 6, 8])
• s
![Page 5: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/5.jpg)
SERIES
A Series by passing a list of values, letting pandascreate a default integer index.
5
import numpy as npimport pandas as pds = pd.Series([1, 3, 5, np.nan, 6, 8])s
![Page 6: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/6.jpg)
OBJECT CREATION
A DataFrame by passing a NumPy array, with a:
datetime index and
labeled columns.
NumPy arrays have one dtype for the entire array, while pandasDataFrames have one dtype per column.
labeled columns
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))df
datetime index
6
dates = pd.date_range('20130101', periods=6)dates
![Page 7: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/7.jpg)
OBJECT CREATION A DataFrame by passing a dict of objects that can be converted
to series-like.
DataFrame
df2 = pd.DataFrame({'A': 1.,'B': pd.Timestamp('20130102'),'C': pd.Series(1, index=list(range(4)), dtype='float32'),'D': np.array([3] * 4, dtype='int32'),'E': pd.Categorical(["test", "train", "test", "train"]),'F': 'foo'})
df2
7
df2.dtypes
![Page 8: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/8.jpg)
VIEWING DATA df.head()
df.tail(3)
df.index
df.columns
df.describe()
df.T
df.to_numpy()
8
![Page 9: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/9.jpg)
SORTING
By Axis
By Values
By Axis
df.sort_index(axis=1, ascending=False)
datetime index
9
df.sort_values(by='B')
![Page 10: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/10.jpg)
SELECTION
Getting Selection by Label
df.loc
df.at
Selection by Position df.iloc
df.iat
Boolean Indexing
10
![Page 11: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/11.jpg)
GETTING Selecting a single column, which yields a Series, equivalent todf.A
df['A']
Selecting via [], which slices the rows.
11
df[0:3]
df['20130102':'20130104']
df
df.A
![Page 12: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/12.jpg)
SELECTION BY LABEL
Selecting on a multi-axis by label.
12
df1 = pd.DataFrame(np.random.randn(6, 4))
df1.loc[0]
df.loc[dates[0]]
df.loc[:, ['A', 'B']]
df.loc['20130102':'20130104', ['A', 'B']]
df.loc['20130102', ['A', 'B']]
df.loc[dates[0], 'A']
df.at[dates[0], 'A'] #Fast
![Page 13: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/13.jpg)
SELECTION BY POSITION
13
df.iloc[3]
df.iloc[3:5, 0:2]
df.iloc[[1, 2, 4], [0, 2]]
df.iloc[1:3, :]
df.iloc[:, 1:3]
df.iat[1, 1]
df.iloc[1, 1]
![Page 14: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/14.jpg)
BOOLEAN INDEXING
14
df[df.A > 0] df[df > 0]
df2 = df.copy()df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']df2
df2[df2['E'].isin(['two', 'four'])]
![Page 15: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/15.jpg)
SETTING
15
df['F'] = s1s1
df.iat[0, 1] = 0df
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))
df.at[dates[0], 'A'] = 0df
df2 = df.copy()df2[df2 > 0] = -df2df2
![Page 16: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/16.jpg)
MISSING DATA
df1 = df.copy()df1.dropna(how='any')
16
df1.fillna(value=5)
pd.isna(df1)
Drop any rows that have missing data.
Filling missing data.
Get the Boolean mask where values are nan.
![Page 17: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/17.jpg)
OPERATIONS
Stats
Apply
Concat
Join
Append
Grouping
17
![Page 18: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/18.jpg)
STATSAPPLY
df.mean()
18
df.mean(1)
df.apply(np.cumsum)
Same operation on the other axis
Operations in general exclude missing data.
df.apply(lambda x: x.max() - x.min())
Applying functions to the data.
![Page 19: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/19.jpg)
HISTOGRAM
s = pd.Series(np.random.randint(0, 7, size=10))s
19
s.value_counts()
![Page 20: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/20.jpg)
CONCAT
df = pd.DataFrame(np.random.randn(10, 4))
df
20
pieces = [df[:3], df[3:7], df[7:]]pieces
pd.concat(pieces)
![Page 21: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/21.jpg)
JOIN
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})left
21
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
right
pd.merge(left, right, on='key')
![Page 22: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/22.jpg)
JOIN
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})left
22
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})right
pd.merge(left, right, on='key')
![Page 23: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/23.jpg)
APPEND
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])df
23
s = df.iloc[3]s
df.append(s, ignore_index=True)
![Page 24: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/24.jpg)
GROUPING
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],'B': ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'C': np.random.randn(8),'D': np.random.randn(8)})
df
24
df.groupby('A').sum()
df.groupby(['A', 'B']).sum()
![Page 25: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/25.jpg)
PLOTTING
25
![Page 26: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/26.jpg)
PLOTTING
26
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])
df
df.plot()
df.plot()
df = df.cumsum()
![Page 27: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/27.jpg)
DATA FILES
27
Format Type Data Description Reader Writer
text CSV read_csv to_csv
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
binary MS Excel read_excel to_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Parquet Format read_parquet to_parquet
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google Big Query read_gbq to_gbq
df.to_csv('foo.csv')
pd.read_csv('foo.csv')
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
df.to_excel('foo.xlsx', sheet_name='Sheet1')
![Page 28: Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT](https://reader035.fdocuments.us/reader035/viewer/2022062317/5f0360677e708231d408e939/html5/thumbnails/28.jpg)
THANK YOU