Database Data Mining: Practical R Enterprise and Oracle Advanced
Transcript of Database Data Mining: Practical R Enterprise and Oracle Advanced
Introduction Oracle Enterprise R in Practice Wrap up
Database Data Mining: Practical R Enterpriseand Oracle Advanced Analytics
Husnu [email protected]
Global Maksimum Data & Information Technologies
October 2, 2012
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Content
1 Introduction
2 Oracle Enterprise R in PracticeData VisualizationA Bit of Probability and Information TheoryOptimizationText Analysis & Decision Trees
3 Wrap up
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Who am I ?
X Founder at Global Maksimum Data & InformationTechnologies
X in BI Domain
X Oracle Magazine DBA of the Year in 2009
X
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Global Maksimum Data & Information Technologies
X A bunch of people who know what they are doing mainlyfocused on data and the transformation of data intoinformation.
X Complex Event Processing
X 1.2 Million Event in a second on 2x2 Socket Nehalem Blades
X Data Mining
X Churn Prediction Models for TelcosX Marketing Target Selection Models
X Large Scale Database Management System Projects
X 120+ TB Exadata migration from UNIX systems.X Exadata Master Class all over the EMEA region for Exadata
customers, Oracle partners, and Oracle staff at the region.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Global Maksimum Data & Information Technologies
X A bunch of people who know what they are doing mainlyfocused on data and the transformation of data intoinformation.
X Complex Event Processing
X 1.2 Million Event in a second on 2x2 Socket Nehalem Blades
X Data Mining
X Churn Prediction Models for TelcosX Marketing Target Selection Models
X Large Scale Database Management System Projects
X 120+ TB Exadata migration from UNIX systems.X Exadata Master Class all over the EMEA region for Exadata
customers, Oracle partners, and Oracle staff at the region.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Global Maksimum Data & Information Technologies
X A bunch of people who know what they are doing mainlyfocused on data and the transformation of data intoinformation.
X Complex Event Processing
X 1.2 Million Event in a second on 2x2 Socket Nehalem Blades
X Data Mining
X Churn Prediction Models for TelcosX Marketing Target Selection Models
X Large Scale Database Management System Projects
X 120+ TB Exadata migration from UNIX systems.X Exadata Master Class all over the EMEA region for Exadata
customers, Oracle partners, and Oracle staff at the region.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Global Maksimum Data & Information Technologies
X A bunch of people who know what they are doing mainlyfocused on data and the transformation of data intoinformation.
X Complex Event Processing
X 1.2 Million Event in a second on 2x2 Socket Nehalem Blades
X Data Mining
X Churn Prediction Models for TelcosX Marketing Target Selection Models
X Large Scale Database Management System Projects
X 120+ TB Exadata migration from UNIX systems.X Exadata Master Class all over the EMEA region for Exadata
customers, Oracle partners, and Oracle staff at the region.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Global Maksimum Data & Information Technologies
X A bunch of people who know what they are doing mainlyfocused on data and the transformation of data intoinformation.
X Complex Event Processing
X 1.2 Million Event in a second on 2x2 Socket Nehalem Blades
X Data Mining
X Churn Prediction Models for TelcosX Marketing Target Selection Models
X Large Scale Database Management System Projects
X 120+ TB Exadata migration from UNIX systems.X Exadata Master Class all over the EMEA region for Exadata
customers, Oracle partners, and Oracle staff at the region.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Advanced Analytics
X For the first version of BI we just filter rows, project columns,aggragate them using some functions, and give only whatcustomer asks for.
X After we have focused on machine generated data, or BigData dealing the data as we did before becomes more andmore fruitless.
X That’s mainly because of the fact that there is only tinyamount of information available in this pile of data.
X So it requires better tricks, automation, and post-analysiscapabilities.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Advanced Analytics
X For the first version of BI we just filter rows, project columns,aggragate them using some functions, and give only whatcustomer asks for.
X After we have focused on machine generated data, or BigData dealing the data as we did before becomes more andmore fruitless.
X That’s mainly because of the fact that there is only tinyamount of information available in this pile of data.
X So it requires better tricks, automation, and post-analysiscapabilities.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Advanced Analytics
X For the first version of BI we just filter rows, project columns,aggragate them using some functions, and give only whatcustomer asks for.
X After we have focused on machine generated data, or BigData dealing the data as we did before becomes more andmore fruitless.
X That’s mainly because of the fact that there is only tinyamount of information available in this pile of data.
X So it requires better tricks, automation, and post-analysiscapabilities.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Advanced Analytics
X For the first version of BI we just filter rows, project columns,aggragate them using some functions, and give only whatcustomer asks for.
X After we have focused on machine generated data, or BigData dealing the data as we did before becomes more andmore fruitless.
X That’s mainly because of the fact that there is only tinyamount of information available in this pile of data.
X So it requires better tricks, automation, and post-analysiscapabilities.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
In-database Advanced Analytics
X 80% of data mining activity for enterprise means featureengineering.
X Feature Engineering requires an iterative process of
X Filtering data (WHERE)X Aggragating data (GROUP BY)X Transforming data (CASE, DECODE, COALESCE, etc.)
X It is almost impossible to maintain an integrated miningenvironment (Scripts, files, metafiles,etc.) out of database
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
In-database Advanced Analytics
X 80% of data mining activity for enterprise means featureengineering.
X Feature Engineering requires an iterative process of
X Filtering data (WHERE)X Aggragating data (GROUP BY)X Transforming data (CASE, DECODE, COALESCE, etc.)
X It is almost impossible to maintain an integrated miningenvironment (Scripts, files, metafiles,etc.) out of database
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
In-database Advanced Analytics
X 80% of data mining activity for enterprise means featureengineering.
X Feature Engineering requires an iterative process of
X Filtering data (WHERE)X Aggragating data (GROUP BY)X Transforming data (CASE, DECODE, COALESCE, etc.)
X It is almost impossible to maintain an integrated miningenvironment (Scripts, files, metafiles,etc.) out of database
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
In-database Advanced Analytics
X 80% of data mining activity for enterprise means featureengineering.
X Feature Engineering requires an iterative process of
X Filtering data (WHERE)X Aggragating data (GROUP BY)X Transforming data (CASE, DECODE, COALESCE, etc.)
X It is almost impossible to maintain an integrated miningenvironment (Scripts, files, metafiles,etc.) out of database
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Oracle Advanced Analytics Toolkit
X SQL-2003 & Extensions
X Oracle Data Mining
X Oracle Spatial Extensions
X Flow based mining with SQL Developer
X Oracle Enterprise R
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
X R is a free software environment for statistical computing andgraphics.
X Majority of newbies (young data scientists) recently graduateor to be graduated from top universities use R.
X Batteries are included.
X Runs on all modern platforms.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Oracle R Enterprise
X Data you can process with standard R is limited with theamount of memory available on the server running R.
X In order to bypass this problem people implement their ownsolutions in order to off-store data or utilize data samplingtechniques.
X ORE is an extension to standard R adding Oracle steroids intoit.
X The basic idea is to off-load R commands seemless to OracleDatabase or Oracle Big Data Appliance.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Oracle R Enterprise
X Data you can process with standard R is limited with theamount of memory available on the server running R.
X In order to bypass this problem people implement their ownsolutions in order to off-store data or utilize data samplingtechniques.
X ORE is an extension to standard R adding Oracle steroids intoit.
X The basic idea is to off-load R commands seemless to OracleDatabase or Oracle Big Data Appliance.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Oracle R Enterprise
X Data you can process with standard R is limited with theamount of memory available on the server running R.
X In order to bypass this problem people implement their ownsolutions in order to off-store data or utilize data samplingtechniques.
X ORE is an extension to standard R adding Oracle steroids intoit.
X The basic idea is to off-load R commands seemless to OracleDatabase or Oracle Big Data Appliance.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Oracle R Enterprise
X Data you can process with standard R is limited with theamount of memory available on the server running R.
X In order to bypass this problem people implement their ownsolutions in order to off-store data or utilize data samplingtechniques.
X ORE is an extension to standard R adding Oracle steroids intoit.
X The basic idea is to off-load R commands seemless to OracleDatabase or Oracle Big Data Appliance.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
This session
This session is not a R tutorial session but rather a fly over somepossible solutions to real life scenarios using R.If you need some R tutorial please refer to
X
X Rob Kabacoff. R in Action. Manning, 2010
X Oracle R Enterprise Training 2 - Introduction to R
X R Studio
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Data Visualization
Data Visualization
X Advance data analysis usually starts and ends with datavisualization.
X Before modeling anything data scientists use graphs & chartsto figure out behaviour of data
X After modeling in order to report the results they again refer tocharts.
X R supports tens of different charting & graphing packages.Just to mention two of them
lattice is used to generate conditioned graphs (a.k.a.trellis graphs)
ggplot2 is used to make graph generation moreconsistent in R.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Data Visualization
Histogram
X Do you see any significantpattern in distribution ?
X Do you like the wayhistogram is represented ?
s ou r c e ( ”˜/ r−s n i p p l e t s /oow2012/mydata . r ” ,l o c a l=TRUE)
da t a s e t = genera teCus tomer ( )
h=h i s t ( d a t a s e t $ B i l l p e r P e r i o d , f r e q=TRUE,y l a b=”Number o f Customers ” ,x l a b=” B i l l Amount” ,main=” B i l l Amount D i s t r i b u t i o n ” )
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Data Visualization
Remove the Outliers
Do you see any significantpattern in distribution ?
s ou r c e ( ”˜/ r−s n i p p l e t s /oow2012/mydata . r ” , l o c a l=TRUE)da t a s e t = genera teCus tomer ( )
n o o u t l i e r = f u n c t i o n ( data , column , q=0.99 , i n c=TRUE){q = q u a n t i l e ( data [ , column ] , na . rm=TRUE,
probs = quan t i l e , names=FALSE)
i f ( i n c l u s i v e ){pruned = sub s e t ( data , data [ , column ] <= q)
} e l s e{pruned = sub s e t ( data , data [ , column ] < q )
}
pruned}
pruned = n o o u t l i e r ( da ta s e t , ” B i l l p e r P e r i o d ” , 0 . 99 )
h=h i s t ( pruned $ B i l l p e r P e r i o d , f r e q=TRUE,y l a b=”Number o f Customers ” ,x l a b=” B i l l Amount” ,main=” B i l l Amount D i s t r i b u t i o n ” )
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Data Visualization
Conditional Histograms
s ou r c e ( ”˜/ r−s n i p p l e t s /oow2012/mydata . r ” ,l o c a l=TRUE)
sou r c e ( ”˜/ r−s n i p p l e t s /oow2012/commons . r ” ,l o c a l=TRUE)
da t a s e t = genera teCus tomer ( )
pruned = n o o u t l i e r ( da ta s e t ,” B i l l p e r P e r i o d ” , 0 . 99 )
l i b r a r y ( l a t t i c e )h i s tog ram ( ˜ B i l l p e r P e r i o d | Us ingSe rv i ceX ,
data=pruned )
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Data Visualization
Too Many Columns to Visualize
s ou r c e ( ”˜/ r−s n i p p l e t s /oow2012/mydata . r ” ,l o c a l=TRUE)
sou r c e ( ”˜/ r−s n i p p l e t s /oow2012/commons . r ” ,l o c a l=TRUE)
da t a s e t = genera teCus tomer ( )head ( d a t a s e t )
pruned = n o o u t l i e r ( da ta s e t ,” B i l l p e r P e r i o d ” , 0 . 99 )
l i b r a r y ( l a t t i c e )h i s tog ram ( ˜ B i l l p e r P e r i o d | CarBrand ,
data=pruned )
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
A Bit of Probability and Information Theory
Comparing Histograms
X We need a way tocalculate similaritybetween those histograms.
X A strong tool frominformation theoryKullback—LeiblerDivergence allows us todefine a distance metricbetween two distributions.
equ iw i d th = f u n c t i o n ( data , co l , n=10, s f=1e−6){q l i s t = q u a n t i l e ( data [ , c o l ] , na . rm=TRUE,
probs = seq ( 0 . 1 , 1 . 0 , by=1./n ) ,names=FALSE)
r e s u l t=c ( )f o r ( q u a n t i l e i n q l i s t ){
r e s u l t = c ( r e s u l t ,( nrow ( s ub s e t ( data , data [ , c o l ] <=
qu a n t i l e ) ) /nrow ( data ) ) )}
r e s u l t [ 1 : n]−c (0 , r e s u l t [ 1 : ( n−1) ] ) + rep ( s f , n )}
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
A Bit of Probability and Information Theory
KL Divergence & Symmetry
X DKL(P‖Q) =∑
i P(i)ln P(i)Q(i)
X Notice thatDKL(P‖Q) 6= DKL(Q‖P)
X So we simply take the averageof two to obtain a symmetricmetric.
k l d i s t a n c e = f u n c t i o n ( d i s t 1 , d i s t 2 ){
k l 1 = 0 .0f o r ( i i n 1 : l e n g t h ( d i s t 1 ) ){
k l 1 = k l 1 + d i s t 1 [ i ] ∗ l o g ( d i s t 1 [ i ] / d i s t 2 [ i ] , 2 )}
k l 2 = 0 .0f o r ( i i n 1 : l e n g t h ( d i s t 1 ) ){
k l 2 = k l 2 + d i s t 2 [ i ] ∗ l o g ( d i s t 2 [ i ] / d i s t 1 [ i ] , 2 )}
( k l 1+k l 2 ) /2
}
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
A Bit of Probability and Information Theory
Top 5 Car Brands whose Owners Diverge from Baseline
Brand KL
Lancia 8.969125Lincoln 8.969125Proton 7.572549Daewoo 7.572549Pontiac 6.421267
ddf = NULLb a s e l i n e = equ iw i d th ( pruned , ” B i l l p e r P e r i o d ” )f o r ( brand i n d a t a s e t [ ! d u p l i c a t e d ( d a t a s e t [ , c ( ’ CarBrand ’ ) ] ) , 1 ] ){
b randD i s t = equ iw i d th ( s ub s e t ( pruned ,pruned [ , ’ CarBrand ’ ] == brand ) ,
” B i l l p e r P e r i o d ” )ddf = rb i n d ( ddf ,
data . f rame ( ca rb rand=brand ,k l=k l d i s t a n c e ( b a s e l i n e ,
b r andD i s t ) ) )
}
head ( ddf [ o r d e r ( ddf $ k l , d e c r e a s i n g=TRUE) , ] )
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Optimization
Problem Definition
X We a have a terrain covered by severalstations and each point on the terrainhas one of the following status
GREEN Region is in the LoS ofat least one station.
YELLOW Region is in the LoS ofat least on station butfar away.
RED Region is out of LoS.
X For a fixed number of stations weneed to cover as much as we can.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Optimization
Model Sketch Up1
1 Define a function tocalculate the ratio ofgreen zones on terrain.
2 Give this function to oneof optimization modulesof R (Nelder — MeadTechnique) which canhandle non-smooth targetfunctions.
3 Get the optimal stationdistribution.
t a r g e t f u n c=f u n c t i o n ( o b s e r v e r ){m = mat r i x ( data=obs e r v e r , n c o l =2,byrow=TRUE)
# Compute merged s t a t u s o f a l l o b s e r v e r smergeds ta tu s <− r ep ( ” red ” , l e n g t h ( t e r r $ h e i g h t ) )f o r ( i i n seq ( 1 : dim (m) [ 1 ] ) ){
t e r r $ d i s t 2 o b s e r v e r = d i s t a n c e ( t e r r , c (m[ i , ] , 7 ) )s t a t u s = LoS ( t e r r , c (m[ i , ] , 7 ) , maxDist )me rgeds ta tu s = upda t e s t a t u s ( mergeds tatus , s t a t u s )
}
sum( mergeds ta tu s==” green ” )}
optim <− optim ( ob s e r v e r s , t a r g e t f u n c ,c o n t r o l= l i s t ( f n s c a l e=−1, t r a c e =5,
REPORT=1) )
1Refer to LoS Analysis (Part 4)Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Optimization
1 Station (54%)
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Optimization
3 Stations (83%)
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Optimization
6 Stations (99%)
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Text Analysis & Decision Trees
Problem Definition
X For a given string which is writtenintentionally or erroneously wrong bysubscribers, how can we build a modelwhich can deduce the most probablestring among 3 possibilities (or choseto not making any decision).
Our legitamate strings aremom, dad, and brother . Andwe havebrothe → brotherbro → brotherbrother1 → brotherp → ?1234 → ?mom.i.came.home → mommmomyy → momdad[atwork] → daddod → dad
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Text Analysis & Decision Trees
Problem Definition
X For a given string which is writtenintentionally or erroneously wrong bysubscribers, how can we build a modelwhich can deduce the most probablestring among 3 possibilities (or choseto not making any decision).
Our legitamate strings aremom, dad, and brother . Andwe havebrothe → brotherbro → brotherbrother1 → brotherp → ?1234 → ?mom.i.came.home → mommmomyy → momdad[atwork] → daddod → dad
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Text Analysis & Decision Trees
Model Sketch Up
1 Do some feature engineering
X Length of the stringX Prefix flag (3 attributes for each)X Contains flag (3 attributes for each)X Anything else ?
2 Build a classifier to classify those texts based on thosefeatures.
3 Evaluate your classifier
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Text Analysis & Decision Trees
Model Sketch Up
1 Do some feature engineering
X Length of the string
X Prefix flag (3 attributes for each)X Contains flag (3 attributes for each)X Anything else ?
2 Build a classifier to classify those texts based on thosefeatures.
3 Evaluate your classifier
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Text Analysis & Decision Trees
Model Sketch Up
1 Do some feature engineering
X Length of the stringX Prefix flag (3 attributes for each)
X Contains flag (3 attributes for each)X Anything else ?
2 Build a classifier to classify those texts based on thosefeatures.
3 Evaluate your classifier
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Text Analysis & Decision Trees
Model Sketch Up
1 Do some feature engineering
X Length of the stringX Prefix flag (3 attributes for each)X Contains flag (3 attributes for each)
X Anything else ?
2 Build a classifier to classify those texts based on thosefeatures.
3 Evaluate your classifier
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Text Analysis & Decision Trees
Model Sketch Up
1 Do some feature engineering
X Length of the stringX Prefix flag (3 attributes for each)X Contains flag (3 attributes for each)X Anything else ?
2 Build a classifier to classify those texts based on thosefeatures.
3 Evaluate your classifier
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Text Analysis & Decision Trees
First Model
s ou r c e ( ”˜/ r−s n i p p l e t s /oow2012/mydata . r ” , l o c a l=TRUE)d f = gene ra t eTex t ( )
l i b r a r y ( r p a r t )
# grow t r e ef i t <− r p a r t ( c o r r e c t e d ˜ l e n g t h+p r e f i x B r o t h e r+p r e f i xDad+pref ixMom+i n s t r B r o t h e r+
in s t rDad+instrMom ,method=” c l a s s ” , data=df )
t a b l e ( pred = p r e d i c t ( f i t , df , t ype=” c l a s s ” ) ,t r u e = df $ c o r r e c t e d )
truepred ? brother dad mom? 20 0 0 0brother 0 30 10 0dad 0 0 10 0mom 0 0 0 20
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Text Analysis & Decision Trees
More Feature Engineering using Jaro-Winkler Algorithm
Jaro-Winkler distance is a distance metric between strings whichcan be used as a fuzzy string matching algorithm resilient to typoerrors.
l i b r a r y ( RecordL inkage )
enhanced = data . f rame ( df ,momScore = j a r o w i n k l e r ( ”mom” , d f $ o r g i n a l ) ,dadScore = j a r o w i n k l e r ( ”dad” , d f $ o r g i n a l ) ,b r o t h e r S c o r e = j a r o w i n k l e r ( ” b r o t h e r ” , d f $ o r g i n a l ) )
truepred ? brother dad mom? 20 0 0 0brother 0 30 0 0dad 0 0 20 0mom 0 0 0 20
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Conclusion
X R contains lots of libraries to help you model a physicalphenomenon in anyway you like and visualize it.
X Oracle Enterprise R makes it possible to handle large volumesof data without changing your R environment basics.
X Don’t take ODM and Oracle Enterprise R as alternatives ofeach other but rather complimentary solutions of the sameproblem.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Conclusion
X R contains lots of libraries to help you model a physicalphenomenon in anyway you like and visualize it.
X Oracle Enterprise R makes it possible to handle large volumesof data without changing your R environment basics.
X Don’t take ODM and Oracle Enterprise R as alternatives ofeach other but rather complimentary solutions of the sameproblem.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Conclusion
X R contains lots of libraries to help you model a physicalphenomenon in anyway you like and visualize it.
X Oracle Enterprise R makes it possible to handle large volumesof data without changing your R environment basics.
X Don’t take ODM and Oracle Enterprise R as alternatives ofeach other but rather complimentary solutions of the sameproblem.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Question & Answer
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Stay in Touch
http://husnusensoy.wordpress.com
@husnusensoy
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics