Brighttalk high scale low touch and other bedtime stories - final
Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)
-
Upload
enda-ridge -
Category
Data & Analytics
-
view
1.419 -
download
0
Transcript of Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)
Data Science PatternsPREPARING DATA FOR AGILE DATA SCIENCE
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
1
#GuerrillaAnalytics http://guerrilla-analytics.net
2What You Will Learn
Why you must identify and mitigate disruptions in projects What Data Science patterns are and how to use them effectively
How this will help you Data Scientists: you need to ‘think in patterns’ Developers: you will productionise these patterns Managers and Directors: you need this capability in a high
performing team
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
3What I’ve Learned
PhD‘Design of Experime
nts for Tuning
Algorithms’
Boutique Consultanc
y
Forensic Data
Analytics
Senior Manager
Professional
Services
Head of Algorith
ms
Copyright Enda Ridge 2015
No matter the industry, teams were always plagued by the same problem …
Time was wasted preparing data and revisiting data instead of delivering real Data Science value
2004 2008 2010 2012 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
4Teams Need ‘Guerrilla Analytics’
Copyright Enda Ridge 2015
Data• Extraction• Receipt• Loading
Analytics• Transform• Algorithms• Consolidate
Insight• Reporting• Work Products
Disruptions
#GuerrillaAnalytics http://guerrilla-analytics.net
5Solution: Maintain Data Provenance
Data
CodeBusiness Domain
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
6Agile Data Preparation Capability
Agility
3. Recognize & Implement
Patterns
2.Supporting
Tools
1. Simple
Conventions
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
7
DataWHAT IT LOOKS LIKEWHAT IT SHOULD LOOK LIKE
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
8What Raw Data Looks Like
Relational DataCustomer
Address
JSON{ "firstName": "John", "lastName": "Smith", "age": 25, "address":
{ "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021" },
"phoneNumber": [ { "type": "home", "number":
"212 555-1234" }, { "type": "fax", "number": "646
555-4567" } ] }
Copyright Enda Ridge 2015
firstName lastName age addressID
John Smith 25 340
Jane Doe 36 158
addressID
StreetAddress
City State
postCode
340 21 2nd Street
New York
NY 10021
341 Main Street
Boston MA 34041
#GuerrillaAnalytics http://guerrilla-analytics.net
9What Raw Data Looks Like
123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"123.123.123.123 - - [26/Apr/2000:00:23:47 -0400] "GET /asctortf/ HTTP/1.0" 200 8130 "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF" "Mozilla/4.05 (Macintosh; I; PPC)"123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/5star2000.gif HTTP/1.0" 200 4005 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"123.123.123.123 - - [26/Apr/2000:00:23:50 -0400] "GET /pics/5star.gif HTTP/1.0" 200 1031 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"123.123.123.123 - - [26/Apr/2000:00:23:51 -0400] "GET /pics/a2hlogo.jpg HTTP/1.0" 200 4282 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"
Copyright Enda Ridge 2015
Machine Data
#GuerrillaAnalytics http://guerrilla-analytics.net
10Data Scientists Need Data To Look Like ThisArtist Track Wee
kDate Ran
k2 Pac Baby Don’t
Cry1 2000-02-
2687
2 Pac Baby Don’t Cry
2 2000-03-02
82
2 Pac Baby Don’t Cry
3 2000-03-11
72
2 Pac Baby Don’t Cry
4 2000-03-18
77
2 Pac Baby Don’t Cry
5 2000-03-25
87
2 Pac Baby Don’t Cry
6 2000-04-01
94
2 Pac Baby Don’t Cry
7 2000-04-08
99
3 Doors Down
Kryptonite 1 2000-04-08
68
3 Doors Down
Kryptonite 2 2000-04-15
67
3 Doors Down
Kryptonite 3 2000-04-22
66
One row per observation One variable per column
Copyright Enda Ridge 2015
‘Tidy Data’, H. Wickham, Journal of Statistical Software 2014
#GuerrillaAnalytics http://guerrilla-analytics.net
11Data Scientists Need Data To Look Like ThisArtist Track Wee
kDate Ran
k2 Pac Baby Don’t
Cry1 2000-02-
2687
2 Pac Baby Don’t Cry
2 2000-03-02
82
2 Pac Baby Don’t Cry
3 2000-03-11
72
2 Pac Baby Don’t Cry
4 2000-03-18
77
2 Pac Baby Don’t Cry
5 2000-03-25
87
2 Pac Baby Don’t Cry
6 2000-04-01
94
2 Pac Baby Don’t Cry
7 2000-04-08
99
3 Doors Down
Kryptonite 1 2000-04-08
68
3 Doors Down
Kryptonite 2 2000-04-15
67
3 Doors Down
Kryptonite 3 2000-04-22
66
Easier to describe relationships between variables (columns) than between rows
Copyright Enda Ridge 2015
‘Tidy Data’, H. Wickham, Journal of Statistical Software 2014
2014-04-01 is the 6th week
#GuerrillaAnalytics http://guerrilla-analytics.net
12Data Scientists Need Data To Look Like This
Artist Track Week
Date Rank
2 Pac Baby Don’t Cry
1 2000-02-26
87
2 Pac Baby Don’t Cry
2 2000-03-02
82
2 Pac Baby Don’t Cry
3 2000-03-11
72
2 Pac Baby Don’t Cry
4 2000-03-18
77
2 Pac Baby Don’t Cry
5 2000-03-25
87
2 Pac Baby Don’t Cry
6 2000-04-01
94
2 Pac Baby Don’t Cry
7 2000-04-08
99
3 Doors Down
Kryptonite 1 2000-04-08
68
3 Doors Down
Kryptonite 2 2000-04-15
67
3 Doors Down
Kryptonite 3 2000-04-22
66
Easier to describe relationships between variables (columns) than between rows
Easier to do comparisons between groups of observations than between groups of columns
Copyright Enda Ridge 2015
‘Tidy Data’, H. Wickham, Journal of Statistical Software 2014
Min, max, first, Nth, average,
median
#GuerrillaAnalytics http://guerrilla-analytics.net
13What Data Scientists Need Data To Look LikeArtist Track Wee
kDate Ran
k2 Pac Baby Don’t
Cry1 2000-02-
2687
2 Pac Baby Don’t Cry
2 2000-03-02
82
2 Pac Baby Don’t Cry
3 2000-03-11
72
2 Pac Baby Don’t Cry
4 2000-03-18
77
2 Pac Baby Don’t Cry
5 2000-03-25
87
2 Pac Baby Don’t Cry
6 2000-04-01
94
2 Pac Baby Don’t Cry
7 2000-04-08
99
3 Doors Down
Kryptonite 1 2000-04-08
68
3 Doors Down
Kryptonite 2 2000-04-15
67
3 Doors Down
Kryptonite 3 2000-04-22
66
Variables organized by role Experiment design (fixed)
on left Measurements on right De-normalised
inefficiencies are OK!
Copyright Enda Ridge 2015
‘Tidy Data’, H. Wickham, Journal of Statistical Software 2014
#GuerrillaAnalytics http://guerrilla-analytics.net
Patterns
Architecture Software Data Science
?Copyright Enda Ridge 2015
14Patterns are “Recurring solutions to common problems”
#GuerrillaAnalytics http://guerrilla-analytics.net
Patterns: ‘Recurring solutions to common problems’
Joining DataCollectingUnique IDMap renameFuzzy joinStacking
TransformationDuplicatesOutliersSampling
Tidying DataSortFilterDerived variables AggregationsPivot and unpivotRoll and unrollPrevious/Next NSplit-Apply-Combine
Copyright Enda Ridge 2015
15
Pattern MatchingRegular Expressions
#GuerrillaAnalytics http://guerrilla-analytics.net
16
Joining Patterns
Copyright Enda Ridge 2015
CollectingUnique IDMap renameFuzzy joinStacking
#GuerrillaAnalytics http://guerrilla-analytics.net
17Joining Pattern: Collecting Datasets
Copyright Enda Ridge 2015
Pull datasets by name (if you have a convention)
Pull datasets by content Index and search
CapabilitySc
hem
a
2015-10-01.log2015-10-02.log2015-10-03.log2015-10-04.log2015-10-05.log2015-10-06.log2015-10-07.log…
Situation: log files
#GuerrillaAnalytics http://guerrilla-analytics.net
18Joining Pattern: Collecting Datasets
Copyright Enda Ridge 2015
Sampling: test and train Experimenting: factors Exploring: what’s in there?
Benefit
Pull datasets by name (if you have a convention)
Pull datasets by content Index and search
CapabilitySc
hem
a
2015-10-01.log2015-10-02.log2015-10-03.log2015-10-04.log2015-10-05.log2015-10-06.log2015-10-07.log…
Situation: log files
#GuerrillaAnalytics http://guerrilla-analytics.net
19Joining Pattern: Unique IDs
Situation: data refreshes
Day_ID date Amt Act
3477 2014-03-16
150,000 SETTLE
4598 2014-03-17
45,000 AMEND
… … … …
CapabilityNeed to uniquely identify records, even when IDs exist in the data Hash functions turn large
amount of data into ‘unique’ string
MD5(Guerrilla Analytics) 3b04a8085df05752e24c09
5f036c44f3 MD5(guerrilla analytics)
8f1438b18748981180e10b8c1365e4d9
Copyright Enda Ridge 2015
Day_ID date Amt Act
3477 2014-03-16
150,000 SETTLE
4598 2014-03-17
45,001 AMEND
… … … …
WEEK 1
WEEK 3.5
#GuerrillaAnalytics http://guerrilla-analytics.net
20Joining Pattern: Unique IDs
Day_ID
date Amt Act Hash_id
3477 2014-03-16
150,000
SETTLE 244072c9f78f59f7ca0ca93426db98da
4598 2014-03-17
45,000 AMEND 613b4ddfc4db2436e8b8deda26bc3c25
… … … …
Copyright Enda Ridge 2015
Day_ID date Amt Act Hash_id3477 2014-03-16 150,000 SETTLE 244072c9f78f59f7ca0ca93426db
98da4598 2014-03-17 45,001 AMEND 03a0d5e5646bfe60fce679a87ef4
cd34… … … …
WEEK 1
WEEK 3.5
#GuerrillaAnalytics http://guerrilla-analytics.net
21Joining Pattern: Map renameSituation: lots of renamingDay_ID
Cust Amt Act
3477 2014-03-16
150,000
SETTLE
4598 2014-03-17
45,000 AMEND
… … … …
Copyright Enda Ridge 2015
id customer
amount
event
3477 2014-03-16
150,000
SETTLE
4598 2014-03-17
45,000 AMEND
… … … …
#GuerrillaAnalytics http://guerrilla-analytics.net
22Joining Pattern: Map renameSituation: lots of renamingDay_ID
Cust Amt Act
3477 2014-03-16
150,000 SETTLE
4598 2014-03-17
45,000 AMEND
… … … …
PatternDay_ID
Cust Amt Act
3477 2014-03-16
150,000 SETTLE
4598 2014-03-17
45,000 AMEND
… … … …
Copyright Enda Ridge 2015
id customer
amount
event
3477 2014-03-16
150,000 SETTLE
4598 2014-03-17
45,000 AMEND
… … … …
id customer
amount
event
3477 2014-03-16
150,000 SETTLE
4598 2014-03-17
45,000 AMEND
… … … …
dataset
from to
trades Day_ID idtrades Amt amount… … …
#GuerrillaAnalytics http://guerrilla-analytics.net
23
Transformation Patterns
Copyright Enda Ridge 2015
DuplicatesOutliersSampling
#GuerrillaAnalytics http://guerrilla-analytics.net
24Transformation Pattern: Duplicates
Situation: repeated data. Can’t decide what to removeArtist Track W
eek
Date Rank
2 Pac Baby Don’t Cry
1 2000-02-26
87
2 Pac Baby Don’t Cry
2 2000-03-02
82
2 Pac Baby Don’t Cry
2 2000-03-02
82
2 Pac Baby Don’t Cry
3 2000-03-11
72
2 Pac Baby Don’t Cry
4 2000-03-18
77
2 Pac Baby Don’t Cry
5 2000-03-25
87
2 Pac Baby Don’t Cry
6 2000-04-01
94
2 Pac Baby Don’t Cry
7 2000-04-08
99
3 Doors Down
Kryptonite 1 2000-03-02
82
Capability Tag repeating records Hold out and review in
critical applications Tag records that repeat
across arbitrary columns
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
25Transformation Pattern: Duplicates
Artist Track Week
Date Rank Dupe_Full_id
2 Pac Baby Don’t Cry
1 2000-02-26
87 1
2 Pac Baby Don’t Cry
2 2000-03-02
82 2
2 Pac Baby Don’t Cry
2 2000-03-02
82 2
2 Pac Baby Don’t Cry
3 2000-03-11
72 3
2 Pac Baby Don’t Cry
4 2000-03-18
77 4
2 Pac Baby Don’t Cry
5 2000-03-25
87 5
2 Pac Baby Don’t Cry
6 2000-04-01
94 6
2 Pac Baby Don’t Cry
7 2000-04-08
99 7
3 Doors Down
Kryptonite 1 2000-03-02
82 8
Copyright Enda Ridge 2015
Give duplicate groups an ID. Don’t delete!
#GuerrillaAnalytics http://guerrilla-analytics.net
26Transformation Pattern: Duplicates
Artist Track Week
Date Rank Dupe_Full_id
Dupe_rank_date
2 Pac Baby Don’t Cry
1 2000-02-26
87 1 1
2 Pac Baby Don’t Cry
2 2000-03-02
82 2 2
2 Pac Baby Don’t Cry
2 2000-03-02
82 2 2
2 Pac Baby Don’t Cry
3 2000-03-11
72 3 3
2 Pac Baby Don’t Cry
4 2000-03-18
77 4 4
2 Pac Baby Don’t Cry
5 2000-03-25
87 5 5
2 Pac Baby Don’t Cry
6 2000-04-01
94 6 6
2 Pac Baby Don’t Cry
7 2000-04-08
99 7 7
3 Doors Down
Kryptonite 1 2000-03-02
82 8 2
Copyright Enda Ridge 2015
Give multiple duplicate groups their own IDs.
#GuerrillaAnalytics http://guerrilla-analytics.net
27
Pattern Matching
Copyright Enda Ridge 2015
Regular Expressions
#GuerrillaAnalytics http://guerrilla-analytics.net
28Pattern MatchingSituation: getting content from large amounts of text
123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"
Capability: Find and extract arbitrary groups of text
ip datetime verb target return
Etc etc
123.123.123.123
26/Apr/2000:00:23:48 -0400
GET /pics/wpaper.gif HTTP/1.0
200 http://www.jafsoft.com/asctortf/
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
29Pattern Matching: Regular Expressions
123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"
ip datetime verb
target return
Etc etc
123.123.123.123
26/Apr/2000:00:23:48 -0400
GET /pics/wpaper.gif HTTP/1.0
200 http://www.jafsoft.com/asctortf/
Copyright Enda Ridge 2015
From beginning of line, give me: 1 to 3 integers, immediately followed by a dot immediately followed by 1 to 3 integers….etc up until I encounter the first “ - -”
/^(\S+) \S+ \S+ \[([^\]]+)\] "([A-Z]+)[^"]*" \d+ \d+ "[^"]*" "([^"]*)"$/m
Regular Expression:
Situation: getting content from large amounts of text
#GuerrillaAnalytics http://guerrilla-analytics.net
30
Tidying Data Patterns
Copyright Enda Ridge 2015
SortFilterDerived variables AggregationsPivot and unpivotRoll and unrollPrevious/Next NSplit-Apply-Combine
#GuerrillaAnalytics http://guerrilla-analytics.net
31Tidying Data Pattern: Split-apply-combine
Artist Track Week
Date Rank
2 Pac Baby Don’t Cry
1 2000-02-26
87
2 Pac Baby Don’t Cry
2 2000-03-02
82
2 Pac Baby Don’t Cry
3 2000-03-11
72
2 Pac Baby Don’t Cry
4 2000-03-18
77
2 Pac Baby Don’t Cry
5 2000-03-25
87
2 Pac Baby Don’t Cry
6 2000-04-01
94
2 Pac Baby Don’t Cry
7 2000-04-08
99
3 Doors Down
Kryptonite 1 2000-04-08
68
3 Doors Down
Kryptonite 2 2000-04-15
67
3 Doors Down
Kryptonite 3 2000-04-22
66
CapabilityApply arbitrary functions to arbitrary groups
ExampleWhat was each artist’s lowest rank per month (i.e their best track)?
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
32Split-apply-combine: SPLITArtist Track We
ekDate Ran
k2 Pac Baby Don’t
Cry1 2000-02-26 87
2 Pac Baby Don’t Cry
2 2000-03-02 82
2 Pac Baby Don’t Cry
3 2000-03-11 72
2 Pac Baby Don’t Cry
4 2000-03-18 77
2 Pac Baby Don’t Cry
5 2000-03-25 87
2 Pac Baby Don’t Cry
6 2000-04-01 94
2 Pac Baby Don’t Cry
7 2000-04-08 99
3 Doors Down
Kryptonite 1 2000-04-08 68
3 Doors Down
Kryptonite 2 2000-04-15 67
3 Doors Down
Kryptonite 3 2000-04-22 66
Copyright Enda Ridge 2015
Artist Track Week
Date Rank
2 Pac Baby Don’t Cry
1 2000-02-26 87
2 Pac Baby Don’t Cry
2 2000-03-02 82
2 Pac Baby Don’t Cry
3 2000-03-11 72
2 Pac Baby Don’t Cry
4 2000-03-18 77
2 Pac Baby Don’t Cry
5 2000-03-25 87
2 Pac Baby Don’t Cry
6 2000-04-01 94
2 Pac Baby Don’t Cry
7 2000-04-08 99
3 Doors Down
Kryptonite 1 2000-04-08 68
3 Doors Down
Kryptonite 2 2000-04-15 67
3 Doors Down
Kryptonite 3 2000-04-22 66
#GuerrillaAnalytics http://guerrilla-analytics.net
33Split-apply-combine: APPLYArtist Track We
ekDate Ran
k2 Pac Baby Don’t
Cry1 2000-02-26 87
2 Pac Baby Don’t Cry
2 2000-03-02 82
2 Pac Baby Don’t Cry
3 2000-03-11 72
2 Pac Baby Don’t Cry
4 2000-03-18 77
2 Pac Baby Don’t Cry
5 2000-03-25 87
2 Pac Baby Don’t Cry
6 2000-04-01 94
2 Pac Baby Don’t Cry
7 2000-04-08 99
3 Doors Down
Kryptonite 1 2000-04-08 68
3 Doors Down
Kryptonite 2 2000-04-15 67
3 Doors Down
Kryptonite 3 2000-04-22 66
Copyright Enda Ridge 2015
Artist Track Week
Date Rank
2 Pac Baby Don’t Cry
1 2000-02-26 87
2 Pac Baby Don’t Cry
2 2000-03-02 82
2 Pac Baby Don’t Cry
3 2000-03-11 72
2 Pac Baby Don’t Cry
4 2000-03-18 77
2 Pac Baby Don’t Cry
5 2000-03-25 87
2 Pac Baby Don’t Cry
6 2000-04-01 94
2 Pac Baby Don’t Cry
7 2000-04-08 99
3 Doors Down
Kryptonite 1 2000-04-08 68
3 Doors Down
Kryptonite 2 2000-04-15 67
3 Doors Down
Kryptonite 3 2000-04-22 66
#GuerrillaAnalytics http://guerrilla-analytics.net
34Split-apply-combine: COMBINEArtist Track W
eek
Date Rank
2 Pac Baby Don’t Cry
1 2000-02-26
87
2 Pac Baby Don’t Cry
2 2000-03-02
82
2 Pac Baby Don’t Cry
3 2000-03-11
72
2 Pac Baby Don’t Cry
4 2000-03-18
77
2 Pac Baby Don’t Cry
5 2000-03-25
87
2 Pac Baby Don’t Cry
6 2000-04-01
94
2 Pac Baby Don’t Cry
7 2000-04-08
99
3 Doors Down
Kryptonite 1 2000-04-08
68
3 Doors Down
Kryptonite 2 2000-04-15
67
3 Doors Down
Kryptonite 3 2000-04-22
66
Copyright Enda Ridge 2015
Artist Track Week
Date Rank
2 Pac Baby Don’t Cry
1 2000-02-26
87
2 Pac Baby Don’t Cry
3 2000-03-11
72
2 Pac Baby Don’t Cry
6 2000-04-01
94
3 Doors Down
Kryptonite 3 2000-04-22
66
#GuerrillaAnalytics http://guerrilla-analytics.net
35Tidying Data Pattern: Unroll (and roll up)
Situation: data on one line
customer_id
session_id
basket
34567 12 45;67;235;9920fD
1232134 2 1345t;456234t
Capability
Copyright Enda Ridge 2015
Get data into a Tidy format
#GuerrillaAnalytics http://guerrilla-analytics.net
36Tidying Data Pattern: Unroll (and roll up)
Situation: data on one line
customer_id
session_id
basket
34567 12 45;67;235;9920fD
1232134 2 1345t;456234t
Capability
customer_id
session_id
basket basket_item
item_order
34567 12 45;67;235;9920fD
45 1
34567 12 45;67;235;9920fD
67 2
34567 12 45;67;235;9920fD
235 3
34567 12 45;67;235;9920fD
99 4
Copyright Enda Ridge 2015
SELECT customer_id,session_id unnest( string_to_array (basket, ‘;') ) AS basket_item FROM TheTable
#GuerrillaAnalytics http://guerrilla-analytics.net
37Tidying Data Pattern: Nth item
Situation: items have an order
Copyright Enda Ridge 2015
customer_id session_id access_point access
34567 12 45;67;235;99;235;99 45
34567 12 45;67;235;99;235;99 67
34567 12 45;67;235;99;235;99 235
34567 12 45;67;235;99;235;99 99
34567 12 45;67;235;99;235;99 235
34567 12 45;67;235;99;235;99 99
#GuerrillaAnalytics http://guerrilla-analytics.net
38Tidying Data Pattern: Nth item
Situation: items have an order
Copyright Enda Ridge 2015
customer_id session_id access_point access order
34567 12 45;67;235;99;235;99 45 1
34567 12 45;67;235;99;235;99 67 2
34567 12 45;67;235;99;235;99 235 3
34567 12 45;67;235;99;235;99 99 4
34567 12 45;67;235;99;235;99 235 5
34567 12 45;67;235;99;235;99 99 6
SELECT row_number() over (partition by customer_id, session_id)AS orderFROM TheTable
#GuerrillaAnalytics http://guerrilla-analytics.net
39Tidying Data Pattern: Nth item
Situation: items have an order
Copyright Enda Ridge 2015
Can I see when users are flipping between access points?
customer_id session_id access_point access order
34567 12 45;67;235;99;235;99 45 1
34567 12 45;67;235;99;235;99 67 2
34567 12 45;67;235;99;235;99 235 3
34567 12 45;67;235;99;235;99 99 4
34567 12 45;67;235;99;235;99 235 5
34567 12 45;67;235;99;235;99 99 6
Gap=1
#GuerrillaAnalytics http://guerrilla-analytics.net
40
Sort1 Unroll2 Nth item3 Split-Apply-Combine4
Chaining of Patterns
Copyright Enda Ridge 2015
With high pattern maturity, focus is no longer on details of the ‘standard’ pattern.Complex evolving code is easier to maintain
#GuerrillaAnalytics http://guerrilla-analytics.net
41Summing up
Guerrilla Analytics requires agile teams
Data Science Patterns are recurring solutions to data preparation problems
Capability to recognize and implement patterns is key for high performance
Pattern groups: Join Transform Pattern matching Tidying Chaining
Copyright Enda Ridge 2015
Agility
3. Recognize & Implement
Patterns
2. Supporting
Tools
1. Simple
Conventions
#GuerrillaAnalytics http://guerrilla-analytics.net
42Find out more
Copyright Enda Ridge 2015
@Enda_Ridge
http://guerrilla-analytics.net
Available on: