Webinar: The rmongodb R package

Post on 26-Jan-2015

111 views 3 download

Tags:

description

A free one-hour webinar providing a general introduction to the "rmongodb" R package (https://github.com/mongosoup/rmongodb) which provides a methodology to connect the MongoDB database (http://www.mongodb.com/) and the R statistical computing environment (http://www.r-project.org).

Transcript of Webinar: The rmongodb R package

Webinar: The rmongodb R packageDr. rer. nat. Markus Schmidberger

January 30th, 2014

Email:

Twitter: @cloudHPC

markus@mongosoup.de

OutlineIntroduction to Big Data, MongoDB, MongoSoup, R

Introduction to R Database packages as rmongodb

rmongodb Live Demo

Summary & Outlook & Questions

Big DataWikipedia: … a collection of data sets so large and complex that itbecomes difficult to process using on-hand databasemanagement tools or traditional data processing. …

storing

processing

Storing: NoSQL - MongoDBNoSQL: databases using looser consistency models to storedata

MongoDB most popular NoSQL database system

document oriented

JSON-like documents with dynamic schemas

http://docs.mongodb.org/manual/reference/sql-

comparison/

MongoDB - some commandsdb.collection.find()

db.collection.find().pretty()

db.collection.find( { _id: 5 } )

db.collection.find( { pop: { $gt: 25 } } )

db.collection.insert( { item: “card”, pop: 15 } )

db.collection.ensureIndex( { orderDate: 1, zipcode: -1 } )

db.collection.update( { _id: 1 }, { $set: { “name”: “Warner” } } )

MongoSoupGerman MongoDB as a Service

cloudControl Add-On

running on AWS EU-Region or in Munich (Germany)

all features available: shared / dedicated hosting, replicaset, sharding

24/7 support available

Processing: Analyzing with R and Hadoopbackward-looking analysis is outdated

today: quasi real-time analysis

tomorrow: forward-looking predictive analysis

more complex methods, more data available, moreprocessing time required

efficient processing technology required: R, Hadoop, …

check for my Strata London 2013 Tutorial “Big DataAnalyses with R”

Introduction to RR is a free software environment for statistical computingand graphics

offers tools to manage and analyze data

standard statistical methods are implemented

compiles and runs under different OS

support via huge community

One statistical Examplekmeans(dat, 4)

K-means clustering with 4 clusters of sizes

17, 30, 22, 31

Cluster means:

[,1] [,2]

1 0.02846 -0.3379

2 0.76616 1.0020

3 1.37160 0.9707

4 -0.06849 0.1409

Clustering vector:

[1] 4 2 4 4 1 1 4 1 4 4 1 4 4 4 4 4 1 4 4 2

4 4 4 4 4 4 4 1 4 4 1 1 1 1 2

[36] 1 1 4 4 4 1 1 4 4 4 1 1 1 4 4 3 2 3 2 3

3 2 2 3 2 3 2 2 3 2 2 3 2 2 3

[71] 3 2 2 3 3 2 2 2 2 2 2 2 3 2 2 4 3 2 3 2

2 3 3 3 3 3 3 2 3 2

Within cluster sum of squares by cluster:

[1] 1.836 4.660 1.994 3.047

(between_SS / total_SS = 84.1 %)

Available components:

[1] "cluster" "centers" "totss"

"withinss"

[5] "tot.withinss" "betweenss" "size"

"iter"

[9] "ifault"

plot(dat, col = cl$cluster, cex=2, pch=16)

points(cl$centers, col = 1:4, pch = 13, cex =

4)

R and DatabasesSQL provides a standard language to filter, aggregate, group,sort data

SQL in new places: Hive, Impala, …

many R packages to connect to the SQL world

R stores relational data in data.frames (extended lists)

data(iris)

head(iris[,1:3], n=3)

Sepal.Length Sepal.Width Petal.Length

1 5.1 3.5 1.4

2 4.9 3.0 1.4

3 4.7 3.2 1.3

class(iris)

[1] "data.frame"

R package: sqldfrunning SQL statements on R data frames

library(sqldf)

sqldf("select

Sepal_Length,Sepal_Width,Petal_Length from

iris limit 2")

Sepal_Length Sepal_Width Petal_Length

1 5.1 3.5 1.4

2 4.9 3.0 1.4

sqldf("select count(*) from iris")

count(*)

1 150

Other relational R packageRMySQL

RPostgreSQL

ROracle

RJDBC

RODBC

RSQLite (SQLite engine is included)

One big problem:all packages read the full query results in R memory

R and MongoDBon CRAN there are two packages to connect R withMongoDB

rmongodb supported by MongoDB, Inc.

powerful for big data

RMongo

easy to use

limited functionality

reads full query results in R memory

R package: RMongolibrary(RMongo)

mongo <- mongoDbConnect("cc_JwQcDLJSYQJb",

"dbs001.mongosoup.de", 27017)

dbAuthenticate(mongo,

username="JwQcDLJSYQJb",

password="RSXPkUkXXXXX")

dbShowCollections(mongo)

[1] "zips" "ccp"

"system.users" "system.indexes"

[5] "test_data"

dbGetQuery(mongo, "zips","{'state':'AL'}",

skip=0, limit=5)

X_id state loc pop

city

1 35004 AL [ -86.51557 , 33.584132] 6055

ACMAR

2 35005 AL [ -86.959727 , 33.588437] 10616

ADAMSVILLE

3 35006 AL [ -87.167455 , 33.434277] 3205

ADGER

4 35007 AL [ -86.812861 , 33.236868] 14218

KEYSTONE

5 35010 AL [ -85.951086 , 32.941445] 19942

NEW SITE

dbInsertDocument(mongo, "test_data", '{"foo":

"bar", "size": 5 }')

[1] "ok"

# e.g. no command to remove collections

# e.g. no command to create indices

dbDisconnect(mongo)

R package: rmongodbdeveloped on top of the MongoDB supported C driver

new maintainer:

new repository:

please provide feedback or contribute via Pull Requests

markus@mongosoup.de

https://github.com/mongosoup/rmongodb

library(rmongodb)

mongo <-

mongo.create(host="dbs001.mongosoup.de",

db="cc_JwQcDLJSYQJb",

username="JwQcDLJSYQJb",

password="RSXPkUkXXXXX")

mongo

[1] 0

attr(,"mongo")

<pointer: 0x102e4aac0>

attr(,"class")

[1] "mongo"

attr(,"host")

[1] "dbs001.mongosoup.de"

attr(,"name")

[1] ""

attr(,"username")

[1] "JwQcDLJSYQJb"

attr(,"password")

[1] "RSXPkUkxRdOX"

attr(,"db")

[1] "cc_JwQcDLJSYQJb"

attr(,"timeout")

[1] 0

Live DemoLive Demo with RStudio and MongoSoup

JSON <-> BSON <-> Rnew functionality in development

still problems with sub-documents and JSON arrays

using jsonlite package helps

library(rmongodb)

library(jsonlite)

bson <-

mongo.bson.from.JSON('{"state":"AL"}')

bson

state : 2 AL

list <- mongo.bson.to.list(bson)

list

$state

[1] "AL"

toJSON(list)

[1] "{ \"state\" : [ \"AL\" ] }"

SummaryR is a powerful statistical tool to analyse many different kindof data

R can access databases

MongoDB and rmongodb ready for Big Data

some open issues for simple usability

OutlookFixing JSON to BSON issues

Provide efficient functionality for mongoDB to data.frames

Use new mongodb-c library

a lot of work: re-engineering rmongodb back-end

-> more speed, more functionality

go on developing plyrmongodb package:https://github.com/schmidb/dplyrmongodb

Questions & Answersthanks a lot for your attention

demo code available as vignette in the rmongodb package ongithub

Email: Twitter: @cloudHPC

markus@mongosoup.de