Data Manipulation - Fabrice...

86
Data Manipulation Fabrice Rossi CEREMADE Université Paris Dauphine 2019

Transcript of Data Manipulation - Fabrice...

Page 1: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Data Manipulation

Fabrice Rossi

CEREMADEUniversité Paris Dauphine

2019

Page 2: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Data Manipulation

In this courseI tabular dataI elementary extension to

multiple-table dataI data transformation

I wranglingI filteringI ordering

I data aggregation andsummary

I tidy data and reshaping

In other coursesI database management

systemI data modelsI relational dataI unstructured data

2

Page 3: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Data Model

In this courseI a data set is

I a (finite) set of entities (a.k.a. objects, instances, subjects)I each entity is described by its values with respect to a fix set of

variables (a.k.a. attributes)I in practice a data set is a table with

I a row per entityI a column per variable

ExtensionI multiple-table dataI a data set = several tables

3

Page 4: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

age job marital education default balance housing1 30 unemployed married primary no 1787 no2 33 services married secondary no 4789 yes3 35 management single tertiary no 1350 yes4 30 management married tertiary no 1476 yes5 59 blue-collar married secondary no 0 yes6 35 management single tertiary no 747 no7 36 self-employed married tertiary no 307 yes8 39 technician married secondary no 147 yes9 41 entrepreneur married tertiary no 221 yes

10 43 services married primary no -88 yes11 39 services married secondary no 9374 yes12 43 admin. married secondary no 264 yes13 36 technician married tertiary no 1109 no14 20 student single secondary no 502 no15 31 blue-collar married secondary no 360 yes16 40 management married tertiary no 194 no17 56 technician married secondary no 4073 no18 37 admin. single tertiary no 2317 yes19 25 blue-collar single primary no -221 yes20 31 services married secondary no 132 no

4

Page 5: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Variable types

NumericalI essentially “physical”

measurementsI integer or decimalI easier to handle than the

other types

CategoricalI a.k.a. Nominal (factors and

levels in R)I finite number of values

(called categories ormodalities)

I might be ordered

Dates and timesI very important in numerous

applicationsI notoriously difficult to handleI use specific libraries!

Short textsI a.k.a. stringsI could be handled as

categorical dataI specific processing in some

casesI do not confuse them with full

texts

5

Page 6: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

Bank datasetI sources

I https://archive.ics.uci.edu/ml/datasets/Bank%2BMarketing

I http://hdl.handle.net/1822/14838

I data typesI age: integerI balance: integerI education: categorical semi orderedI most of the others: categorical with some binary

6

Page 7: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Data Management

Data manipulation softwareI typical examples: R with tidyverse or python with pandasI limited automatic support for enforcing complex data models

I declarative support for broad typesI constraints can be checked explicitly

very complex constraints can be enforcederror/bug pronedifficult to read

I documentation is needed

7

Page 8: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Outline

Introduction

Data transformation

Data grouping and summarizing

Tidy data

Multiple data tables

8

Page 9: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Subsetting

Working on a subpopulationI a.k.a by “removing” rowsI several motivations

I speed (for large data sets)I robustness (removing outliers)I modeling

I generally called filteringI declarative approach in R and python

I give me the subset of the data that fulfills some conditionsI supported by comparison and Boolean operators

9

Page 10: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

Bank data setMarried clients with secondary education level in their thirties (agebetween 30 and 39 included)

Python (pandas)bank[ (bank.marital == 'married') &

(bank.education == 'secondary') &(bank.age >= 30) &(bank.age < 40) ]

R (dplyr)

bank %>% filter(marital == "married",education == "secondary",age >= 30,age < 40)

10

Page 11: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

Bank data setMarried clients with secondary education level in their thirties (agebetween 30 and 39 included)

Python (pandas)bank[ (bank.marital == 'married') &

(bank.education == 'secondary') &(bank.age >= 30) &(bank.age < 40) ]

R (dplyr)

bank %>% filter(marital == "married",education == "secondary",age >= 30,age < 40)

10

Page 12: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Computational considerations

Running timeI filtering is a row oriented operationI naive implementation

I browse the data row by rowI keep a row if the conditions are fulfilled

I run time proportional to the number of rows in the dataI can be improved in some cases via indexing

Do not program it yourself!I far less efficientI less readable

11

Page 13: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Computational considerations

Running timeI filtering is a row oriented operationI naive implementation

I browse the data row by rowI keep a row if the conditions are fulfilled

I run time proportional to the number of rows in the dataI can be improved in some cases via indexing

Do not program it yourself!I far less efficientI less readable

11

Page 14: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Dropping Variables

Column oriented subsettingI two main motivations

I to restrict the data set to variable types compatible with sometechnique

I to restrict the data set to meaningful variables for an automatedanalysis (e.g. clustering or predictive modeling)

I simple declarative approach

12

Page 15: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

Bank data setKeep some numerical variables

Python (pandas)bank[['age', 'balance', 'day', 'duration']]

R (dplyr)

bank %>% select(age, balance, day, duration)

13

Page 16: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

Bank data setKeep some numerical variables

Python (pandas)bank[['age', 'balance', 'day', 'duration']]

R (dplyr)

bank %>% select(age, balance, day, duration)

13

Page 17: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Ordering

SortingI standard sorting featureI multiple criteria

Pythonbank.sort_values(by=['age',

'balance'])

Rbank %>% arrange(age, balance)

14

Page 18: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Transformation of Variables

Variable OperationsI modifying a variableI adding new variables from

other sourcesI computing new variables

(based on existing ones)

Data WranglingI low level transformationI recodingI extraction and mergingI etc.

Data ManagementI context variablesI enforcing data model

Preparing AnalysisI e.g. recoding categorical to

numericalI or quantifying numerical

variablesI scaling, normalizationI merging categories

15

Page 19: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Computed variables

PrincipleI row oriented calculationI create a new variable using the existing onesI e.g. duration from starting and ending timesI combines nicely with aggregation/summary functions

SupportI numerous statistical summary functions (column oriented)I column oriented arithmetic (e.g. sum of columns)I column oriented logical operations (e.g. comparison)I function application (e.g. to each row)

16

Page 20: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

Bank data setBinary variable telling whether some client has some characteristicsI more than average mean annual balanceI at least one loan

Pythonbank['moreavg'] = bank['balance'] > bank['balance'].mean()bank['oneormoreloan'] = (bank['loan'] == 'yes') | (bank['housing'] == 'yes')

Rbank %>% mutate(moreavg = balance > mean(balance))bank %>% mutate(oneormoreloan = loan=="yes" | housing=="yes")

17

Page 21: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

One Hot EncodingI typical preparatory transformationI categorical variable turn into a set of binary variablesI e.g. Gender=Male or Female transformed into GenderMale and

GenderFemale (binary)

Pythonbank.join(pd.get_dummies(bank['education']))bank.join(pd.get_dummies(bank['education'])).drop('education',axis=1)

Rbank %>%

bind_cols(as.data.frame(model.matrix(~education-1,data=bank))) %>%select(-education)

18

Page 22: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

age job marital education default balance housing1 30 unemployed married primary no 1787 no2 33 services married secondary no 4789 yes3 35 management single tertiary no 1350 yes4 30 management married tertiary no 1476 yes5 59 blue-collar married secondary no 0 yes6 35 management single tertiary no 747 no7 36 self-employed married tertiary no 307 yes8 39 technician married secondary no 147 yes9 41 entrepreneur married tertiary no 221 yes

10 43 services married primary no -88 yes11 39 services married secondary no 9374 yes12 43 admin. married secondary no 264 yes13 36 technician married tertiary no 1109 no14 20 student single secondary no 502 no15 31 blue-collar married secondary no 360 yes16 40 management married tertiary no 194 no17 56 technician married secondary no 4073 no18 37 admin. single tertiary no 2317 yes19 25 blue-collar single primary no -221 yes20 31 services married secondary no 132 no

19

Page 23: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

age job marital primary secondary tertiary unknown1 30 unemployed married 1 0 0 02 33 services married 0 1 0 03 35 management single 0 0 1 04 30 management married 0 0 1 05 59 blue-collar married 0 1 0 06 35 management single 0 0 1 07 36 self-employed married 0 0 1 08 39 technician married 0 1 0 09 41 entrepreneur married 0 0 1 0

10 43 services married 1 0 0 011 39 services married 0 1 0 012 43 admin. married 0 1 0 013 36 technician married 0 0 1 014 20 student single 0 1 0 015 31 blue-collar married 0 1 0 016 40 management married 0 0 1 017 56 technician married 0 1 0 018 37 admin. single 0 0 1 019 25 blue-collar single 1 0 0 020 31 services married 0 1 0 0

20

Page 24: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Improving Representation

Data ManagementI convert values to proper

types, e.g.I integer only to language

supported integerI data as string to language

supported dateI string to semantic content

I proprer encoding of missingdata

I add diagnostic data

Nominal dataI important particular caseI frequently represented by

stringsI loss of efficiencyI cannot leverage automatic

handling (in R in particular)

21

Page 25: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

Data RepresentationI make sure that nominal variables are recognized as suchI yes/no nominal variables can be encoded as logical variables

Pythonfor myvar in ['job', 'marital', 'education']:

bank[myvar] = bank[myvar].astype('category')for myvar in ['default', 'housing', 'loan']:

bank[myvar] = bank[myvar] == 'yes'

Rbank %>% mutate_at(vars(job,marital,education), ~(as.factor(.)))bank %>% mutate_at(vars(default,housing,loan), ~(. == "yes"))

22

Page 26: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

Unknown valuesI specific “unknown” category in the bank dataset (for several

variables)I not considered “special” by the software while it should be

PythonNumpy provides a special value nan for not availablebank.replace("unknown",np.nan)

RSimilar special value NA

bank %>% mutate_all(~replace(., . == "unknown", NA))

23

Page 27: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Outline

Introduction

Data transformation

Data grouping and summarizing

Tidy data

Multiple data tables

24

Page 28: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Conditional analysis

Finding dependencies and linksOne of the main goal of data analysis, e.g.I predictive models: links between target variables and explanatory

variablesI frequent patterns: variables that are frequently non zero at the

same timeI etc.

Conditional summariesI chose one or more variablesI for each possible combination of the values of the chosen

variablesI find all corresponding objects in the data setI compute a summary of the other variables on this subset

25

Page 29: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Mechanism

X Y4 C2 B2 C3 C4 B1 C4 C4 A3 B3 A1 A1 B1 A3 B2 C

Split

Y XA 4A 3A 1A 1

Y XB 2B 4B 3B 1B 3

Y XC 4C 2C 3C 1C 4C 2

Apply

Y SumA 9

Y SumB 13

Y SumC 16

CombineY SumA 9B 13C 16

26

Page 30: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

Bank datasetI balance conditioned on the response to the marketing campaignI age conditioned on marital status and education level

Pythonbank.groupby('y')['balance'].median()bank.groupby(['marital', 'education'])['age'].mean()

Rbank %>% group_by(y) %>% summarize(median_balance = median(balance))bank %>% group_by(marital, education) %>%

summarize(mean_age = mean(age))

27

Page 31: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

Balance versus marketing

y median_balanceno 419.50yes 710.00

Age versus marital andeducation

marital education mean_agedivorced primary 51.39divorced secondary 43.50divorced tertiary 45.15divorced missing 50.38married primary 47.51married secondary 42.40married tertiary 41.78married missing 48.44single primary 37.01single secondary 33.05single tertiary 34.51single missing 34.65

28

Page 32: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Pivot Table

Conditioning by two variablesI useful special caseI we can leverage standard tabular representation

I compute a standard aggregated table with two conditioning variablesand a single aggregate

I “pivot” the tableI remove one of conditioning variable and the aggregateI creates as many columns as they are values of the removed variableI use the aggregate to populate cells

29

Page 33: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Mechanism

X Y Z2 B U2 C V3 A U2 B U1 C U4 C V3 B U4 C U1 B V3 A U2 A V4 A U3 A V4 B U3 B U

S-A-C

Y Z SumA U 10A V 5B U 14B V 1C U 5C V 6

PivotY/Z U VA 10 5B 14 1C 5 6

30

Page 34: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

Age versus marital and education

marital education mean_agedivorced primary 51.39divorced secondary 43.50divorced tertiary 45.15divorced missing 50.38married primary 47.51married secondary 42.40married tertiary 41.78married missing 48.44single primary 37.01single secondary 33.05single tertiary 34.51single missing 34.65

marital/education primary secondary tertiary missingdivorced 51.39 43.50 45.15 50.38married 47.51 42.40 41.78 48.44single 37.01 33.05 34.51 34.65

31

Page 35: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Howto

PythonSpecific support for Pivot tablesbank.pivot_table('age', index = 'marital', columns = 'education')

RA particular case of table reshaping

bank %>% group_by(marital, education) %>%summarize(mean_age = mean(age)) %>% spread(key = education,value = mean_age)

32

Page 36: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Multidimensional analysis (MDA)

Pivot (hyper)cubeI a pivot table with more than 2 “dimensions” (e.g. a pivot cube)I specific vocabulary:

I a dimension: a variable with a finite set of possible valuesI a measure: a numerical variableI a cell contains aggregate values for objects with given values for the

dimension

a very convoluted way of presenting conditional analysisrich possibilities when the set of values of a “dimension” isstructured (e.g. postcodes)rich support with specific OLAP software (not in this part of thecourse)

33

Page 37: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

Bank data setI Possible dimensions: job, marital, education, housing, loanI Possible measures: age and balanceI A cell (such as unemployed, married, primary education, no

housing and no loan) contains the average age and the medianbalance for the persons with the specified values on thedimensions

Tabular point of view

job marital education housing loan mean_age median_balanceadmin. divorced primary no no 57.00 1.00admin. divorced primary no yes 56.00 0.00admin. divorced primary yes no 57.00 179.00admin. divorced secondary no no 41.80 432.00admin. divorced secondary no yes 45.83 175.00

and 337 additional rows...

34

Page 38: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

Tabular viewy housing loan nno no no 1394no no yes 267no yes no 1958no yes yes 381yes no no 283yes no yes 18yes yes no 195yes yes yes 25

MDA view

y="no"housing/loan no yesno 1394 267yes 1958 381

y="yes"housing/loan no yesno 283 18yes 195 25

arranged as a cube!

35

Page 39: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Outline

Introduction

Data transformation

Data grouping and summarizing

Tidy data

Multiple data tables

36

Page 40: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Tidy Data

Concept introduced by Hadley Wickham the Tidy Data paper

DefinitionA data set made of several data tables is tidy if

1. each variable forms a column of a table2. each observation forms a row of a table3. each type of observational unit forms a table

Observational unitI a particular type of observations in a data setI e.g.

I personsI daily behavior of personsI etc.

37

Page 41: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

Bank data setI mixes person information and marketing campaign information

(and economic variables in an extended version!)I marketing campaign data

I last contact dataI contacts during the campaignI summary of previous campaigns!

I clearly untidyI violates property 3I multiple types of observational unit in a single table

38

Page 42: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

Pivot tablehousing/loan no yesno 1677 285yes 2153 406

I intrinsically untidyI columns are not variables but variable values!

Summary table

loan housing nno no 1677no yes 2153yes no 285yes yes 406

I tidy dataI new observational unit: groups of observations!

39

Page 43: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

Pivot tablehousing/loan no yesno 1677 285yes 2153 406

I intrinsically untidyI columns are not variables but variable values!

Summary table

loan housing nno no 1677no yes 2153yes no 285yes yes 406

I tidy dataI new observational unit: groups of observations!

39

Page 44: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Tidying Data

Splitting or JoiningI observational unit levelI splitting

I separates different observational unit into several tablesI filtering/selecting + linking

I joiningI merges several tables about the same observational unitI specific tools (see the last part of this course)

40

Page 45: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Tidying Data

Gathering or SpreadingI variable/observation levelI specific toolsI gathering

I reduces the number of columnsI merges several columns in a single one that corresponds to a proper

variableI spreading

I increases the number of columnsI splits a column into several ones that correspond to proper variables

41

Page 46: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example (Splitting)

Splitting bank marketing dataSteps:

1. add an identifier to each row (to identify clients)2. select variables associated to each observational unit, keeping the

id2.1 persons2.2 current campaign (i.e. last contact)2.3 previous campaigns

3. clean the tables (remove useless rows, e.g. when no previouscampaign is available)

42

Page 47: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example (Splitting)

PythonI pandas data frames have always row identifiersI splitting and cleaning

bank.persons = bank[['age', 'job', 'marital', 'education', 'default','balance', 'housing', 'loan']]

bank.current = bank[['contact', 'day', 'month', 'duration','campaign']]

bank.previous = bank[bank['pdays'] != -1][['pdays', 'previous','poutcome']]

43

Page 48: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example (Splitting)

R1. identifier

bank.tidy <- bank %>% mutate(key = 1:nrow(bank))

2. selection with cleanupbank.persons <- bank.tidy %>% select(key, age, job, marital,

education, default, balance, housing, loan)bank.current <- bank.tidy %>% select(key, contact, day, month,

duration, campaign)bank.previous <- bank.tidy %>% filter(pdays != -1) %>% select(key,

pdays, previous, poutcome)

44

Page 49: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Spreading data

Replace variable encoded on rows by columnsI operates on two variables in the original table: a key and a valueI each value taken by the key becomes a columnI the value variable is used to fill the column

Original table

X Y Z1 A 21 B 32 A 42 B 5

Spread tableY is the key, Z is the value

X A B1 2 32 4 5

45

Page 50: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example (Spreading)

CalIt2 datasetI flow (in number of persons) in and out a buildingI direction encoded in the flow column

flow date time count7 2005-07-24 00:00:00 09 2005-07-24 00:00:00 07 2005-07-24 00:30:00 19 2005-07-24 00:30:00 07 2005-07-24 01:00:00 09 2005-07-24 01:00:00 07 2005-07-24 01:30:00 09 2005-07-24 01:30:00 07 2005-07-24 02:00:00 09 2005-07-24 02:00:00 0

I untidy: flow is not a variable!I spreading is needed

46

Page 51: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example (Spreading)

CalIt2 datasetI flow (in number of persons) in and out a buildingI tidy version

date time entering leaving2005-07-24 00:00:00 0 02005-07-24 00:30:00 1 02005-07-24 01:00:00 0 02005-07-24 01:30:00 0 02005-07-24 02:00:00 0 02005-07-24 02:30:00 2 02005-07-24 03:00:00 0 02005-07-24 03:30:00 0 02005-07-24 04:00:00 0 02005-07-24 04:30:00 0 0

47

Page 52: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example (Spreading)

PythonI spreading can be done by leveraging the indexing systemI hierarchical indexing in this case

calit.set_index(['date', 'time'], inplace = True)tcalit = calit.pivot(columns='flow')tcalit.columns = tcalit.columns.to_flat_index()tcalit.rename(columns={('count', 7):'entering',

('count', 9): 'leaving'},inplace=True)

tcalit.reset_index(inplace = True)

I this is somewhat convoluted

48

Page 53: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example (Spreading)

PythonI spreading can be also be done with pivot_table

tcalit = calit.pivot_table(values='count',index=['date', 'time'],columns='flow')

tcalit.rename(columns={7: 'entering', 9: 'leaving'},inplace=True)

tcalit.reset_index(inplace=True)

I a bit simpler

RStandard use of the spread function

calit %>% spread(flow, count) %>% rename(entering = `7`, leaving = `9`)

49

Page 54: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Gathering data

GatherI gathering is the reverse of spreadingI it reduces the number of columns by encoding them as a series of

rows and two new columns/variablesI the new key variable encode the gathered columns while the new

value variable contains the original value

Original table

Gather X, Y and ZW X Y Za 1 2 3b 5 6 7

Spread table

W K Va X 1a Y 2a Z 3b X 5b Y 6b Z 7

50

Page 55: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example (Gathering)

Sales transaction data setI weekly product salesI one row per product, one column per week: product view

Product_Code W0 W1 W2 W3 W4 W5P1 11.00 12.00 10.00 8.00 13.00 12.00P2 7.00 6.00 3.00 2.00 7.00 1.00P3 7.00 11.00 8.00 9.00 10.00 8.00P4 12.00 8.00 13.00 5.00 9.00 6.00P5 8.00 5.00 13.00 11.00 6.00 7.00P6 3.00 3.00 2.00 7.00 6.00 3.00P7 4.00 8.00 3.00 7.00 8.00 7.00P8 8.00 6.00 10.00 9.00 6.00 8.00P9 14.00 9.00 10.00 7.00 11.00 15.00P10 22.00 19.00 19.00 29.00 20.00 16.00

with 53 columns and 811 rows

51

Page 56: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example (Gathering)

Gather the weeksI gather all the columns except the product oneI one observation: (product, week)

Product_Code Week QuantityP1 1 11P2 1 7P3 1 7P4 1 12P5 1 8P6 1 3P7 1 4P8 1 8P9 1 14P10 1 22

with 42162 more rows

52

Page 57: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example (Gathering)

PythonUsing the melt functionprodlong = pd.melt(prodperweek, 'Product_Code')prodlong.rename(columns={'variable': 'Week',

'value': 'Quantity'},inplace=True)

prodlong['Week'] = pd.to_numeric(prodlong['Week'].str[1:])+1

RStandard use of the gather function

prodperweek %>% gather(Week, Quantity, -Product_Code) %>%mutate(Week = parse_number(Week) + 1)

53

Page 58: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Tidy Data

Modeling assumptionsI data are tidy only with respect to some modeling assumptionsI what is an observation?

I maybe the most important assumptionI very frequently associated to an independence assumption

Practical aspectsI the data format must be adapted to the toolI data mining and machine learning

I generally limited to a single tableI one might need to merge tables (e.g. bank data set)

I column oriented software (R)I have limited row oriented capabilitiesI might need a specific untidy representation

54

Page 59: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Tidy Data

ExamplesI bank marketing: several object types or only one (person +

marketing action)I flow in/out a building

I the half-hour bidirectional flow is an observationI or a day is an observation

I weekly salesI product point of view (original data set)I product × week point of viewI week point of view

55

Page 60: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Outline

Introduction

Data transformation

Data grouping and summarizing

Tidy data

Multiple data tables

56

Page 61: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Multiple data tables

Tidy dataI one table per observational

unitI complex data

I multiple observational units(e.g. persons andproducts)

I multiple tables!

DifficultiesI the vast majority of data

analysis methods are limitedto single tables

I complex real world data usemultiple table!

I a core data manipulationtask: join multiple tables intodata analysis oriented tables

57

Page 62: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

Loan application data setI https://relational.fit.cvut.cz/dataset/Financial

I 8 tables includingI client tableI account tableI credit card tableI loan tableI etc.

I open ended data set: no specific goal

58

Page 63: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Associations and keys

Relational dataI tables must be related one to anotherI in the relational model a table is a relationI some relations describe entities while others describe links

between entities

KeysI a key is a (set of) variable(s) that uniquely identifies an entityI a primary key does that in the relation/table that describe the entityI a foreign key does that in another relation/table

59

Page 64: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Loan application data set

Client tableclient_id gender birth_date district_id

1 F 1970-12-13 182 M 1945-02-04 13 F 1940-10-09 14 M 1956-12-01 55 F 1960-07-03 5

KeysI primary key client_idI foreign key

district_id

Account tableaccount_id district_id date

1 18 1995-03-242 1 1993-02-263 5 1997-07-074 12 1996-02-215 15 1997-05-30

KeysI primary key account_idI foreign key

district_id

60

Page 65: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Loan application data set

Disposition tabledisp_id client_id account_id type

1 1 1 OWNER2 2 2 OWNER3 3 2 DISPONENT4 4 3 OWNER5 5 3 DISPONENT

KeysI primary key disp_idI foreign keys client_id

and account_id

Link tableI the disposition table/relation is a typical example of a link tableI it associates clients with accountsI the link is also an entity as it has characteristics

61

Page 66: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Loan application data set

District tabledistrict_id Name Region Inhabitants A5 A6 A7 A8

1 Hl.m. Praha Prague 1204953 0 0 0 12 Benesov central Bohemia 88884 80 26 6 23 Beroun central Bohemia 75232 55 26 4 14 Kladno central Bohemia 149893 63 29 6 25 Kolin central Bohemia 95616 65 30 4 16 Kutna Hora central Bohemia 77963 60 23 4 27 Melnik central Bohemia 94725 38 28 1 38 Mlada Boleslav central Bohemia 112065 95 19 7 1

Direct linksI no link table from accounts and clients to the district tableI district_id is used as a foreign key in the account and client

tables

62

Page 67: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Joining tables

Main operationsI we need to build unique tables that gather information from

separate onesI this is done via join operations

I identifying matching entities in two different tablesI generating tables which combine variables from said tables for

matched entities

ExampleI joining client information with account informationI joining client information with district information

63

Page 68: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Basic case

PrincipleI input

I two tables A and BI A contains a variable V which is a foreign keyI B contains the same variable V as its primary key

I resultI a table with all the variables in A and B (no repeat)I such that each entity in A is merged with the entity in B referenced

by the value of V

64

Page 69: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Example

Client with district informationI join the client table with the district tableI district_id: foreign key in the client table, primary key in the

district tableI (part of the) resultclient_id gender birth_date district_id Name Region Inhabitants A5 A6 A7

1 F 1970-12-13 18 Pisek south Bohemia 70699 60 13 22 M 1945-02-04 1 Hl.m. Praha Prague 1204953 0 0 03 F 1940-10-09 1 Hl.m. Praha Prague 1204953 0 0 04 M 1956-12-01 5 Kolin central Bohemia 95616 65 30 45 F 1960-07-03 5 Kolin central Bohemia 95616 65 30 4

Pythonpd.merge(client, district)

Rclient %>% inner_join(district)

Common semantics: natural joinI common columns/variables are considered as a key

65

Page 70: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Join types

Missing keysI missing foreign keysI unreferenced foreign keysI wrong foreign keys

How to build the join table?I intersection approachI missing data approach

Inner joinI most common solutionI keeps only full rowsI discard rows that would have

missing values

Outer joinsI produce tables with missing

dataI full or asymmetric (left or

right) joins

66

Page 71: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Join types

Missing keysI missing foreign keysI unreferenced foreign keysI wrong foreign keys

How to build the join table?I intersection approachI missing data approach

Inner joinI most common solutionI keeps only full rowsI discard rows that would have

missing values

Outer joinsI produce tables with missing

dataI full or asymmetric (left or

right) joins

66

Page 72: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

All cases

Left tablex y

-3 14 25 NA

Missing foreign key

Right tabley z1 a2 b3 c

Unreferenced key

Inner joinx y z

-3 1 a4 2 b

Only full rows

Full outer joinx y z

-3 1 a4 2 b5 NA NA

NA 3 c

All combinations

Left outer joinx y z

-3 1 a4 2 b5 NA NA

All rows from the lefttable

Right outer joinx y z

-3 1 a4 2 b

NA 3 c

All rows from theright table

67

Page 73: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

All cases

Left tablex y

-3 14 25 NA

Missing foreign key

Right tabley z1 a2 b3 c

Unreferenced key

Inner joinx y z

-3 1 a4 2 b

Only full rows

Full outer joinx y z

-3 1 a4 2 b5 NA NA

NA 3 c

All combinations

Left outer joinx y z

-3 1 a4 2 b5 NA NA

All rows from the lefttable

Right outer joinx y z

-3 1 a4 2 b

NA 3 c

All rows from theright table

67

Page 74: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

All cases

Left tablex y

-3 14 25 NA

Missing foreign key

Right tabley z1 a2 b3 c

Unreferenced key

Inner joinx y z

-3 1 a4 2 b

Only full rows

Full outer joinx y z

-3 1 a4 2 b5 NA NA

NA 3 c

All combinations

Left outer joinx y z

-3 1 a4 2 b5 NA NA

All rows from the lefttable

Right outer joinx y z

-3 1 a4 2 b

NA 3 c

All rows from theright table

67

Page 75: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

All cases

Left tablex y

-3 14 25 NA

Missing foreign key

Right tabley z1 a2 b3 c

Unreferenced key

Inner joinx y z

-3 1 a4 2 b

Only full rows

Full outer joinx y z

-3 1 a4 2 b5 NA NA

NA 3 c

All combinations

Left outer joinx y z

-3 1 a4 2 b5 NA NA

All rows from the lefttable

Right outer joinx y z

-3 1 a4 2 b

NA 3 c

All rows from theright table

67

Page 76: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Implementation

PythonI merge functionI pd.merge(left, right):

inner joinI parameters

I how: join type among'left', 'right','outer', 'inner'

I on: column name(s) for thejoin

I many others

RI merge in base RI dplyr:

I several functions in withexplicit names:inner_join,full_join, left_join,right_join

I by parameter: columnname(s) for the join

left %>% full_join(right)

68

Page 77: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Multiple joins

Loan dataI table with clients and

accountsI difficulties

I link tableI duplicate variable name

district_id

SolutionI two joinsI columns renaming

Pythonda = pd.merge(disposition, account)da.rename(columns={

'district_id':'acc_district_id'},

inplace=True)fulldata = pd.merge(da, client)

Rdisposition %>% inner_join(account) %>%

rename(acc_district_id=district_id) %>%

inner_join(client)

69

Page 78: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Multiple joins

Loan dataI table with clients and

accountsI difficulties

I link tableI duplicate variable name

district_id

SolutionI two joinsI columns renaming

Pythonda = pd.merge(disposition, account)da.rename(columns={

'district_id':'acc_district_id'},

inplace=True)fulldata = pd.merge(da, client)

Rdisposition %>% inner_join(account) %>%

rename(acc_district_id=district_id) %>%

inner_join(client)

69

Page 79: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Result

disp_id client_id account_id type1 1 1 OWNER2 2 2 OWNER3 3 2 DISPONENT4 4 3 OWNER5 5 3 DISPONENT6 6 4 OWNER7 7 5 OWNER8 8 6 OWNER9 9 7 OWNER

10 10 8 OWNER11 11 8 DISPONENT12 12 9 OWNER13 13 10 OWNER14 14 11 OWNER15 15 12 OWNER16 16 12 DISPONENT17 17 13 OWNER18 18 13 DISPONENT19 19 14 OWNER20 20 15 OWNER

account_id district_id date1 18 1995-03-242 1 1993-02-263 5 1997-07-074 12 1996-02-215 15 1997-05-306 51 1994-09-277 60 1996-11-248 57 1995-09-219 70 1993-01-27

10 54 1996-08-2811 76 1995-10-1012 21 1997-04-1513 76 1997-08-1714 47 1996-11-2715 70 1993-10-0216 12 1997-09-2317 1 1997-01-0818 43 1993-05-2619 21 1995-04-0720 74 1996-08-24

70

Page 80: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Result

disp_id client_id account_id type acc_district_id date1 1 1 OWNER 18 1995-03-242 2 2 OWNER 1 1993-02-263 3 2 DISPONENT 1 1993-02-264 4 3 OWNER 5 1997-07-075 5 3 DISPONENT 5 1997-07-076 6 4 OWNER 12 1996-02-217 7 5 OWNER 15 1997-05-308 8 6 OWNER 51 1994-09-279 9 7 OWNER 60 1996-11-24

10 10 8 OWNER 57 1995-09-2111 11 8 DISPONENT 57 1995-09-2112 12 9 OWNER 70 1993-01-2713 13 10 OWNER 54 1996-08-2814 14 11 OWNER 76 1995-10-1015 15 12 OWNER 21 1997-04-1516 16 12 DISPONENT 21 1997-04-1517 17 13 OWNER 76 1997-08-1718 18 13 DISPONENT 76 1997-08-1719 19 14 OWNER 47 1996-11-2720 20 15 OWNER 70 1993-10-02

71

Page 81: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Result

client_id type acc_district_id date1 OWNER 18 1995-03-242 OWNER 1 1993-02-263 DISPONENT 1 1993-02-264 OWNER 5 1997-07-075 DISPONENT 5 1997-07-076 OWNER 12 1996-02-217 OWNER 15 1997-05-308 OWNER 51 1994-09-279 OWNER 60 1996-11-24

10 OWNER 57 1995-09-2111 DISPONENT 57 1995-09-2112 OWNER 70 1993-01-2713 OWNER 54 1996-08-2814 OWNER 76 1995-10-1015 OWNER 21 1997-04-1516 DISPONENT 21 1997-04-1517 OWNER 76 1997-08-1718 DISPONENT 76 1997-08-1719 OWNER 47 1996-11-2720 OWNER 70 1993-10-02

client_id gender birth_date district_id1 F 1970-12-13 182 M 1945-02-04 13 F 1940-10-09 14 M 1956-12-01 55 F 1960-07-03 56 M 1919-09-22 127 M 1929-01-25 158 F 1938-02-21 519 M 1935-10-16 60

10 M 1943-05-01 5711 F 1950-08-22 5712 M 1981-02-20 4013 F 1974-05-29 5414 F 1942-06-22 7615 F 1918-08-28 2116 M 1919-02-25 2117 M 1934-10-13 7618 F 1931-04-05 7619 M 1942-12-28 4720 M 1979-01-04 46

72

Page 82: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Result

disp_id client_id account_id type acc_district_id date gender birth_date district_id1 1 1 OWNER 18 1995-03-24 F 1970-12-13 182 2 2 OWNER 1 1993-02-26 M 1945-02-04 13 3 2 DISPONENT 1 1993-02-26 F 1940-10-09 14 4 3 OWNER 5 1997-07-07 M 1956-12-01 55 5 3 DISPONENT 5 1997-07-07 F 1960-07-03 56 6 4 OWNER 12 1996-02-21 M 1919-09-22 127 7 5 OWNER 15 1997-05-30 M 1929-01-25 158 8 6 OWNER 51 1994-09-27 F 1938-02-21 519 9 7 OWNER 60 1996-11-24 M 1935-10-16 60

10 10 8 OWNER 57 1995-09-21 M 1943-05-01 5711 11 8 DISPONENT 57 1995-09-21 F 1950-08-22 5712 12 9 OWNER 70 1993-01-27 M 1981-02-20 4013 13 10 OWNER 54 1996-08-28 F 1974-05-29 5414 14 11 OWNER 76 1995-10-10 F 1942-06-22 7615 15 12 OWNER 21 1997-04-15 F 1918-08-28 2116 16 12 DISPONENT 21 1997-04-15 M 1919-02-25 2117 17 13 OWNER 76 1997-08-17 M 1934-10-13 7618 18 13 DISPONENT 76 1997-08-17 F 1931-04-05 7619 19 14 OWNER 47 1996-11-27 M 1942-12-28 4720 20 15 OWNER 70 1993-10-02 M 1979-01-04 46

73

Page 83: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Advanced join topics

Support for real world issuesI key selectionI variable renamingI filtering join (dplyr R only)I enforcing/checking unicity of keys (python only)I index based join (python only)I etc.

74

Page 84: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Licence

This work is licensed under a Creative CommonsAttribution-ShareAlike 4.0 International License.

http://creativecommons.org/licenses/by-sa/4.0/

75

Page 85: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Version

Last git commit: 2019-12-11By: Fabrice Rossi ([email protected])Git hash: af2cee4da140c15fb47c0ff45a9ff8b1028fcbd0

76

Page 86: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse

Changelog

I November 2019: added multiple table dataI October 2019: initial version

77