DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2...

96
DATA MINING USING R PROGRAMMING LANGUAGE Mamdouh Alenezi 1

Transcript of DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2...

Page 1: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

DATA MINING USING R PROGRAMMING LANGUAGE

Mamdouh Alenezi

1

Page 2: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

QUESTIONS

Have you heard of R?

Have you ever used R in your work?

Do you know data mining and its algorithms and techniques?

2

Page 3: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

HTTP://WWW.R-PROJECT.ORG/ DOWNLOAD R

3

Page 4: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

OUTLINE

What is R? and Why R?

Data Mining in R

Data Import/Export in R

Data Exploration and Visualization

Classification with R

Clustering with R

Text Mining in R

4

Page 5: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

WHAT IS R? AND WHY R?

5

Page 6: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

WHAT IS R?

R is a free software environment for statistical computing and graphics.

R can be easily extended with around 6,000 packages available on CRAN3.

Many other packages provided on Bioconductor, R-Forge, GitHub, etc.

6

Page 7: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

WHY DO DATA SCIENCE WITH R?

Most widely used Data Mining and Machine Learning Package

Machine Learning

Statistics

Software Engineering and Programming with Data

But not the nicest of languages for a Computer Scientist!

Free (Libre) Open Source Statistical Software

All modern statistical approaches

Many/most machine learning algorithms

Opportunity to readily add new algorithms

7

Page 8: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

HOW POPULAR IS R? DISCUSSION LIST TRAFFIC

Sum of monthly email traffic on each software’s main listserv discussion list.

8

Page 9: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

HOW POPULAR IS R? DISCUSSION TOPICS

9

Page 10: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

WHY R?

R was ranked no. 1 in the KDnuggets 2014 poll on Top Languages for analytics, data mining, data science (actually R has been no. 1 in 2011, 2012 & 2013!).

http://www.kdnuggets.com/polls/2014/languages-analytics-data-mining-data-science.html

10

Page 11: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

DATA MINING IN R

11

Page 12: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

DATA MINING

A data driven analysis to uncover otherwise unknown but useful patterns in large datasets, to discover new knowledge and to develop predictive models, turning data and information into knowledge and (one day perhaps) wisdom, in a timely manner.

12

Page 13: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

DATA MINING

Application of Machine Learning

Statistics

Software Engineering and Programming with Data

Effective Communications and Intuition

…..

To Datasets that vary by: Volume, Velocity, Variety, Value, Veracity

To discover new knowledge

To improve business outcomes

To deliver better tailored services

13

Page 14: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

BASIC TOOLS: DATA MINING ALGORITHMS

Cluster Analysis (kmeans, wskm)

Association Analysis (arules)

Linear Discriminant Analysis (lda)

Logistic Regression (glm)

Decision Trees (rpart, wsrpart)

Random Forests (randomForest, wsrf)

Boosted Stumps (ada)

Neural Networks (nnet)

Support Vector Machines (kernlab)

That’s a lot of tools to learn in R!

Many with different interfaces and options.

14

Page 15: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

PREDICTIVE MODELLING: CLASSIFICATION

Goal of classification is to build models (sentences) in a knowledge representation (language) from examples of past decisions.

The model is to be used on unseen cases to make decisions.

Often referred to as supervised learning.

Common approaches: decision trees; neural networks; logistic regression; support vector machines.

15

Page 16: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

MAJOR CLUSTERING APPROACHES

Partitioning algorithms (kmeans, pam, clara, fanny)

Hierarchical algorithms: (hclust, agnes, diana)

Density-based algorithms

Grid-based algorithms

Model-based algorithms: (mclust for mixture of Gaussians)

16

Page 17: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

LANGUAGE: DECISION TREES

Knowledge representation: A flow-chart-like tree structure

Internal nodes denotes a test on a variable

Branch represents an outcome of the test

Leaf nodes represent class labels or class distribution

17

Page 18: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

DATA SCIENTISTS ARE PROGRAMMERS OF DATA

Data Scientists Desire. . .

Scripting

Transparency

Repeatability

Sharing

18

Page 19: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

SOCIAL NETWORK ANALYSIS WITH R

Packages: igraph, sna

Centrality measures: degree(), betweenness(), closeness(), transitivity()

Clusters: clusters(), no.clusters()

Cliques: cliques(), largest.cliques(), maximal.cliques(), clique.number()

Community detection: fastgreedy.community(), spinglass.community()

19

Page 20: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

R AND BIG DATA

Hadoop Hadoop (or YARN) - a framework that allows for the distributed processing of large data sets across clusters

of computers using simple programming models

R Packages: RHadoop, RHIPE

Spark Spark - a fast and general engine for large-scale data processing, which can be 100 times faster than

Hadoop

SparkR - R frontend for Spark

H2O H2O - an open source in-memory prediction engine for big data science

R Package: h2o

MongoDB MongoDB - an open-source document database

R packages: rmongodb, RMongo

20

Page 21: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

R AND HADOOP

Packages: RHadoop, Rhive

RHadoop is a collection of R packages:

rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster

rhdfs - connect to Hadoop Distributed File System (HDFS)

rhbase - connect to the NoSQL HBase database

. . .

You can play with it on a single PC (in standalone or pseudo-distributed mode), and your code developed on that will be able to work on a cluster of PCs (in full-distributed mode)!

21

Page 22: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

DATA IMPORT/EXPORT IN R

22

Page 23: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

DATA IMPORT AND EXPORT

Read data from and write data to

R native formats (incl. Rdata and RDS)

CSV files

EXCEL files

ODBC databases

SAS databases

23

Page 24: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

SAVE AND LOAD R OBJECTS

save(): save R objects into a .Rdata file

load(): read R objects from a .Rdata file

rm(): remove objects from R

24

Page 25: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

SAVE AND LOAD R OBJECTS - MORE FUNCTIONS

save.image(): save current workspace to a file

It saves everything!

readRDS(): read a single R object from a .rds file

saveRDS(): save a single R object to a file

Advantage of readRDS() and saveRDS(): You can restore the data under a different object name.

Advantage of load() and save(): You can save multiple R objects to one file.

25

Page 26: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

IMPORT FROM AND EXPORT TO .CSV FILES write.csv(): write an R object to a .CSV file

read.csv(): read an R object from a .CSV file

26

Page 27: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

IMPORT FROM AND EXPORT TO EXCEL FILES

Package xlsx: read, write, format Excel 2007 and Excel 97/2000/XP/2003 files

27

Page 28: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

READ FROM DATABASES

Package RODBC: provides connection to ODBC databases.

Function odbcConnect(): sets up a connection to database

sqlQuery(): sends an SQL query to the database

odbcClose() closes the connection.

Functions sqlFetch(), sqlSave() and sqlUpdate(): read, write or update a table in an ODBC database

28

Page 29: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

IMPORT DATA FROM SAS

Package foreign provides function read.ssd() for importing SAS datasets (.sas7bdat files) into R.

29

Page 30: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

DATA EXPLORATION AND VISUALIZATION

30

Page 31: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

DATA EXPLORATION AND VISUALIZATION WITH R

Data Exploration and Visualization

Summary and stats

Various charts like pie charts and histograms

Exploration of multiple variables

Level plot, contour plot and 3D plot

Saving charts into files of various formats

31

Page 32: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

SIZE AND STRUCTURE OF DATA

32

Page 33: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

ATTRIBUTES OF DATA

33

Page 34: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

FIRST ROWS OF DATA

34

Page 35: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

A SINGLE COLUMN

The first 10 values of Sepal.Length

35

Page 36: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

SUMMARY OF DATA Function summary()

numeric variables: minimum, maximum, mean, median, and the first (25%) and third (75%) quartiles

categorical variables (factors): frequency of every level

36

Page 37: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

37

Page 38: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

MEAN, MEDIAN, RANGE AND QUARTILES

Mean, median and range: mean(), median(), range()

Quartiles and percentiles: quantile()

38

Page 39: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

VARIANCE AND HISTOGRAM

39

Page 40: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

DENSITY

40

Page 41: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

PIE CHART Frequency of factors: table()

41

Page 42: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

BAR CHART

42

Page 43: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

CORRELATION Covariance and correlation: cov() and cor()

43

Page 44: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

AGGREATION Stats of Sepal.Length for every Species with aggregate()

44

Page 45: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

BOXPLOT The bar in the middle is median.

The box shows the interquartile range (IQR), i.e., range between the 75% and 25% observation.

45

Page 46: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

SCATTER PLOT

46

Page 47: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

A MATRIX OF SCATTER PLOTS

47

Page 48: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

3D SCATTER PLOT

48

Page 49: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

HEAT MAP

Calculate the similarity between different flowers in the iris data with dist() and then plot it with a heat map

49

Page 50: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

CONTOUR

contour() and filled.contour() in package graphics

contourplot() in package lattice

50

Page 51: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

3D SURFACE

51

Page 52: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

PARALLEL COORDINATES

52

Page 53: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

SAVE CHARTS TO FILES

Save charts to PDF and PS files: pdf() and postscript()

BMP, JPEG, PNG and TIFF files: bmp(), jpeg(), png() and tiff()

Close files (or graphics devices) with graphics.off() or dev.off() after plotting

53

Page 54: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

CLASSIFICATION WITH R

54

Page 55: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

CLASSIFICATION WITH R

Decision trees: rpart, party

Random forest: randomForest, party

SVM: e1071, kernlab

Neural networks: nnet, neuralnet, RSNNS

Performance evaluation: ROCR

55

Page 56: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

THE IRIS DATASET

56

Page 57: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

BUILD A DECISION TREE

57

Page 58: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

58

Page 59: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

PREDICTION

59

Page 60: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

R PACKAGES FOR RANDOM FOREST

Package randomForest

very fast

cannot handle data with missing values

a limit of 32 to the maximum number of levels of each categorical attribute

Package party: cforest()

not limited to the above maximum levels

slow

needs more memory

60

Page 61: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

TRAIN A RANDOM FOREST

61

Page 62: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

62

Page 63: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

ERROR RATE OF RANDOM FOREST

63

Page 64: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

VARIABLE IMPORTANCE

64

Page 65: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

CLUSTERING WITH R

65

Page 66: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

CLUSTERING WITH R

k-means: kmeans(), kmeansruns()

k-medoids: pam(), pamk()

Hierarchical clustering: hclust(), agnes(), diana()

DBSCAN: fpc

BIRCH: birch

66

Page 67: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

K-MEANS CLUSTERING

67

Page 68: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

68

Page 69: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

THE K-MEDOIDS CLUSTERING

Difference from k-means: a cluster is represented with its center in the k-means algorithm, but with the object closest to the center of the cluster in the k-medoids clustering.

more robust than k-means in presence of outliers

PAM (Partitioning Around Medoids) is a classic algorithm for k-medoids clustering.

The CLARA algorithm is an enhanced technique of PAM by drawing multiple samples of data, applying PAM on each sample and then returning the best clustering. It performs better than PAM on larger data.

Functions pam() and clara() in package cluster

Function pamk() in package fpc does not require a user to choose k.

69

Page 70: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

CLUSTERING WITH PAMK()

Two clusters:

“setosa"

a mixture of “versicolor" and “virginica"

70

Page 71: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

71

Page 72: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

HIERARCHICAL CLUSTERING OF THE IRIS DATA

72

Page 73: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

73

Page 74: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

TEXT MINING

74

Page 75: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

TEXT MINING WITH R

Text mining: tm

Topic modelling: topicmodels, lda

Word cloud: wordcloud

Twitter data access: twitteR

75

Page 76: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

TEXT MINING

unstructured text data

text categorization

text clustering

entity extraction

sentiment analysis

document summarization

. . .

76

Page 77: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

TEXT MINING OF TWITTER DATA WITH R

1. extract data from Twitter

2. clean extracted data and build a document-term matrix

3. find frequent words and associations

4. create a word cloud to visualize important words

5. text clustering

6. topic modelling

77

Page 78: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

RETRIEVE TWEETS

Retrieve recent tweets by @RDataMining

78

Page 79: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

TEXT CLEANING

79

Page 80: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

80

Page 81: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

81

Page 82: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

82

Page 83: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

83

Page 84: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

84

Page 85: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

FREQUENT WORDS AND ASSOCIATIONS

85

Page 86: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

86

Page 87: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

87

Page 88: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

88

Page 89: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

89

Page 90: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

WORD CLOUD

90

Page 91: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

CLUSTERING

91

Page 92: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

92

Page 93: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

93

Page 94: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

94

Page 95: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

TOPIC MODELLING

95

Page 96: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect

96