Webinar: The rmongodb R package

28
Webinar: The rmongodb R package Dr. rer. nat. Markus Schmidberger January 30th, 2014 Email: Twitter: @cloudHPC [email protected]

description

A free one-hour webinar providing a general introduction to the "rmongodb" R package (https://github.com/mongosoup/rmongodb) which provides a methodology to connect the MongoDB database (http://www.mongodb.com/) and the R statistical computing environment (http://www.r-project.org).

Transcript of Webinar: The rmongodb R package

Page 1: Webinar: The rmongodb R package

Webinar: The rmongodb R packageDr. rer. nat. Markus Schmidberger

January 30th, 2014

Email:

Twitter: @cloudHPC

[email protected]

Page 2: Webinar: The rmongodb R package

OutlineIntroduction to Big Data, MongoDB, MongoSoup, R

Introduction to R Database packages as rmongodb

rmongodb Live Demo

Summary & Outlook & Questions

Page 3: Webinar: The rmongodb R package

Big DataWikipedia: … a collection of data sets so large and complex that itbecomes difficult to process using on-hand databasemanagement tools or traditional data processing. …

storing

processing

Page 4: Webinar: The rmongodb R package

Storing: NoSQL - MongoDBNoSQL: databases using looser consistency models to storedata

MongoDB most popular NoSQL database system

document oriented

JSON-like documents with dynamic schemas

http://docs.mongodb.org/manual/reference/sql-

comparison/

Page 5: Webinar: The rmongodb R package

MongoDB - some commandsdb.collection.find()

db.collection.find().pretty()

db.collection.find( { _id: 5 } )

db.collection.find( { pop: { $gt: 25 } } )

db.collection.insert( { item: “card”, pop: 15 } )

db.collection.ensureIndex( { orderDate: 1, zipcode: -1 } )

db.collection.update( { _id: 1 }, { $set: { “name”: “Warner” } } )

Page 6: Webinar: The rmongodb R package

MongoSoupGerman MongoDB as a Service

cloudControl Add-On

running on AWS EU-Region or in Munich (Germany)

all features available: shared / dedicated hosting, replicaset, sharding

24/7 support available

Page 7: Webinar: The rmongodb R package

Processing: Analyzing with R and Hadoopbackward-looking analysis is outdated

today: quasi real-time analysis

tomorrow: forward-looking predictive analysis

more complex methods, more data available, moreprocessing time required

efficient processing technology required: R, Hadoop, …

check for my Strata London 2013 Tutorial “Big DataAnalyses with R”

Page 8: Webinar: The rmongodb R package

Introduction to RR is a free software environment for statistical computingand graphics

offers tools to manage and analyze data

standard statistical methods are implemented

compiles and runs under different OS

support via huge community

Page 9: Webinar: The rmongodb R package

One statistical Examplekmeans(dat, 4)

K-means clustering with 4 clusters of sizes

17, 30, 22, 31

Cluster means:

[,1] [,2]

1 0.02846 -0.3379

2 0.76616 1.0020

3 1.37160 0.9707

4 -0.06849 0.1409

Clustering vector:

[1] 4 2 4 4 1 1 4 1 4 4 1 4 4 4 4 4 1 4 4 2

4 4 4 4 4 4 4 1 4 4 1 1 1 1 2

[36] 1 1 4 4 4 1 1 4 4 4 1 1 1 4 4 3 2 3 2 3

Page 10: Webinar: The rmongodb R package

3 2 2 3 2 3 2 2 3 2 2 3 2 2 3

[71] 3 2 2 3 3 2 2 2 2 2 2 2 3 2 2 4 3 2 3 2

2 3 3 3 3 3 3 2 3 2

Within cluster sum of squares by cluster:

[1] 1.836 4.660 1.994 3.047

(between_SS / total_SS = 84.1 %)

Available components:

[1] "cluster" "centers" "totss"

"withinss"

[5] "tot.withinss" "betweenss" "size"

"iter"

[9] "ifault"

Page 11: Webinar: The rmongodb R package

plot(dat, col = cl$cluster, cex=2, pch=16)

points(cl$centers, col = 1:4, pch = 13, cex =

4)

Page 12: Webinar: The rmongodb R package

R and DatabasesSQL provides a standard language to filter, aggregate, group,sort data

SQL in new places: Hive, Impala, …

many R packages to connect to the SQL world

R stores relational data in data.frames (extended lists)

Page 13: Webinar: The rmongodb R package

data(iris)

head(iris[,1:3], n=3)

Sepal.Length Sepal.Width Petal.Length

1 5.1 3.5 1.4

2 4.9 3.0 1.4

3 4.7 3.2 1.3

class(iris)

[1] "data.frame"

Page 14: Webinar: The rmongodb R package

R package: sqldfrunning SQL statements on R data frames

library(sqldf)

sqldf("select

Sepal_Length,Sepal_Width,Petal_Length from

iris limit 2")

Sepal_Length Sepal_Width Petal_Length

1 5.1 3.5 1.4

2 4.9 3.0 1.4

sqldf("select count(*) from iris")

count(*)

1 150

Page 15: Webinar: The rmongodb R package

Other relational R packageRMySQL

RPostgreSQL

ROracle

RJDBC

RODBC

RSQLite (SQLite engine is included)

One big problem:all packages read the full query results in R memory

Page 16: Webinar: The rmongodb R package

R and MongoDBon CRAN there are two packages to connect R withMongoDB

rmongodb supported by MongoDB, Inc.

powerful for big data

RMongo

easy to use

limited functionality

reads full query results in R memory

Page 17: Webinar: The rmongodb R package

R package: RMongolibrary(RMongo)

mongo <- mongoDbConnect("cc_JwQcDLJSYQJb",

"dbs001.mongosoup.de", 27017)

dbAuthenticate(mongo,

username="JwQcDLJSYQJb",

password="RSXPkUkXXXXX")

dbShowCollections(mongo)

[1] "zips" "ccp"

"system.users" "system.indexes"

[5] "test_data"

Page 18: Webinar: The rmongodb R package

dbGetQuery(mongo, "zips","{'state':'AL'}",

skip=0, limit=5)

X_id state loc pop

city

1 35004 AL [ -86.51557 , 33.584132] 6055

ACMAR

2 35005 AL [ -86.959727 , 33.588437] 10616

ADAMSVILLE

3 35006 AL [ -87.167455 , 33.434277] 3205

ADGER

4 35007 AL [ -86.812861 , 33.236868] 14218

KEYSTONE

5 35010 AL [ -85.951086 , 32.941445] 19942

NEW SITE

Page 19: Webinar: The rmongodb R package

dbInsertDocument(mongo, "test_data", '{"foo":

"bar", "size": 5 }')

[1] "ok"

# e.g. no command to remove collections

# e.g. no command to create indices

dbDisconnect(mongo)

Page 20: Webinar: The rmongodb R package

R package: rmongodbdeveloped on top of the MongoDB supported C driver

new maintainer:

new repository:

please provide feedback or contribute via Pull Requests

[email protected]

https://github.com/mongosoup/rmongodb

Page 21: Webinar: The rmongodb R package

library(rmongodb)

mongo <-

mongo.create(host="dbs001.mongosoup.de",

db="cc_JwQcDLJSYQJb",

username="JwQcDLJSYQJb",

password="RSXPkUkXXXXX")

mongo

[1] 0

attr(,"mongo")

<pointer: 0x102e4aac0>

attr(,"class")

[1] "mongo"

attr(,"host")

[1] "dbs001.mongosoup.de"

attr(,"name")

[1] ""

Page 22: Webinar: The rmongodb R package

attr(,"username")

[1] "JwQcDLJSYQJb"

attr(,"password")

[1] "RSXPkUkxRdOX"

attr(,"db")

[1] "cc_JwQcDLJSYQJb"

attr(,"timeout")

[1] 0

Page 23: Webinar: The rmongodb R package

Live DemoLive Demo with RStudio and MongoSoup

Page 24: Webinar: The rmongodb R package

JSON <-> BSON <-> Rnew functionality in development

still problems with sub-documents and JSON arrays

using jsonlite package helps

library(rmongodb)

library(jsonlite)

Page 25: Webinar: The rmongodb R package

bson <-

mongo.bson.from.JSON('{"state":"AL"}')

bson

state : 2 AL

list <- mongo.bson.to.list(bson)

list

$state

[1] "AL"

toJSON(list)

[1] "{ \"state\" : [ \"AL\" ] }"

Page 26: Webinar: The rmongodb R package

SummaryR is a powerful statistical tool to analyse many different kindof data

R can access databases

MongoDB and rmongodb ready for Big Data

some open issues for simple usability

Page 27: Webinar: The rmongodb R package

OutlookFixing JSON to BSON issues

Provide efficient functionality for mongoDB to data.frames

Use new mongodb-c library

a lot of work: re-engineering rmongodb back-end

-> more speed, more functionality

go on developing plyrmongodb package:https://github.com/schmidb/dplyrmongodb

Page 28: Webinar: The rmongodb R package

Questions & Answersthanks a lot for your attention

demo code available as vignette in the rmongodb package ongithub

Email: Twitter: @cloudHPC

[email protected]