Real World Data AnalysisPANDAS
PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES
DR. SYED IMTIYAZ HASSANASSISTANT PROFESSOR, DEPARTMENT. OF CSE, JAMIA HAMDARD(DEEMED TO BE UNIVERSITY), NEW DELHI, INDIA.https://[email protected]://www.jamiahamdard.edu
INTRODUCTION
For fast, flexible, and expressive data structures.
Designed to make working with “relational” or “labeled” data.
Prepared from:
https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html
2
WELL SUITED FOR
Tabular data with heterogeneously-typed columns.
Ordered and unordered (not necessarily fixed-frequency) time series data.
Arbitrary matrix data (homogeneously typed orheterogeneous) with row and column labels.
Any other form of observational / statistical data sets.
The data actually need not be labeled at all to beplaced into a pandas data structure.
3
DATA STRUCTURES
Series: 1D labeled homogeneously-typed array.
DataFrame: General 2D labeled, size-mutabletabular structure with potentially heterogeneously-typed column.
4
• import numpy as np
• import pandas as pd
• s = pd.Series([1, 3, 5, np.nan, 6, 8])
• s
SERIES
A Series by passing a list of values, letting pandascreate a default integer index.
5
import numpy as npimport pandas as pds = pd.Series([1, 3, 5, np.nan, 6, 8])s
OBJECT CREATION
A DataFrame by passing a NumPy array, with a:
datetime index and
labeled columns.
NumPy arrays have one dtype for the entire array, while pandasDataFrames have one dtype per column.
labeled columns
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))df
datetime index
6
dates = pd.date_range('20130101', periods=6)dates
OBJECT CREATION A DataFrame by passing a dict of objects that can be converted
to series-like.
DataFrame
df2 = pd.DataFrame({'A': 1.,'B': pd.Timestamp('20130102'),'C': pd.Series(1, index=list(range(4)), dtype='float32'),'D': np.array([3] * 4, dtype='int32'),'E': pd.Categorical(["test", "train", "test", "train"]),'F': 'foo'})
df2
7
df2.dtypes
VIEWING DATA df.head()
df.tail(3)
df.index
df.columns
df.describe()
df.T
df.to_numpy()
8
SORTING
By Axis
By Values
By Axis
df.sort_index(axis=1, ascending=False)
datetime index
9
df.sort_values(by='B')
SELECTION
Getting Selection by Label
df.loc
df.at
Selection by Position df.iloc
df.iat
Boolean Indexing
10
GETTING Selecting a single column, which yields a Series, equivalent todf.A
df['A']
Selecting via [], which slices the rows.
11
df[0:3]
df['20130102':'20130104']
df
df.A
SELECTION BY LABEL
Selecting on a multi-axis by label.
12
df1 = pd.DataFrame(np.random.randn(6, 4))
df1.loc[0]
df.loc[dates[0]]
df.loc[:, ['A', 'B']]
df.loc['20130102':'20130104', ['A', 'B']]
df.loc['20130102', ['A', 'B']]
df.loc[dates[0], 'A']
df.at[dates[0], 'A'] #Fast
SELECTION BY POSITION
13
df.iloc[3]
df.iloc[3:5, 0:2]
df.iloc[[1, 2, 4], [0, 2]]
df.iloc[1:3, :]
df.iloc[:, 1:3]
df.iat[1, 1]
df.iloc[1, 1]
BOOLEAN INDEXING
14
df[df.A > 0] df[df > 0]
df2 = df.copy()df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']df2
df2[df2['E'].isin(['two', 'four'])]
SETTING
15
df['F'] = s1s1
df.iat[0, 1] = 0df
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))
df.at[dates[0], 'A'] = 0df
df2 = df.copy()df2[df2 > 0] = -df2df2
MISSING DATA
df1 = df.copy()df1.dropna(how='any')
16
df1.fillna(value=5)
pd.isna(df1)
Drop any rows that have missing data.
Filling missing data.
Get the Boolean mask where values are nan.
OPERATIONS
Stats
Apply
Concat
Join
Append
Grouping
17
STATSAPPLY
df.mean()
18
df.mean(1)
df.apply(np.cumsum)
Same operation on the other axis
Operations in general exclude missing data.
df.apply(lambda x: x.max() - x.min())
Applying functions to the data.
HISTOGRAM
s = pd.Series(np.random.randint(0, 7, size=10))s
19
s.value_counts()
CONCAT
df = pd.DataFrame(np.random.randn(10, 4))
df
20
pieces = [df[:3], df[3:7], df[7:]]pieces
pd.concat(pieces)
JOIN
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})left
21
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
right
pd.merge(left, right, on='key')
JOIN
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})left
22
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})right
pd.merge(left, right, on='key')
APPEND
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])df
23
s = df.iloc[3]s
df.append(s, ignore_index=True)
GROUPING
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],'B': ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'C': np.random.randn(8),'D': np.random.randn(8)})
df
24
df.groupby('A').sum()
df.groupby(['A', 'B']).sum()
PLOTTING
25
PLOTTING
26
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])
df
df.plot()
df.plot()
df = df.cumsum()
DATA FILES
27
Format Type Data Description Reader Writer
text CSV read_csv to_csv
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
binary MS Excel read_excel to_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Parquet Format read_parquet to_parquet
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google Big Query read_gbq to_gbq
df.to_csv('foo.csv')
pd.read_csv('foo.csv')
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
df.to_excel('foo.xlsx', sheet_name='Sheet1')
THANK YOU
Top Related