NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

Post on 08-Aug-2015

38 views 4 download

Transcript of NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

It‘s For Data Transformation

M. Sc. Johannes Schildgen2015-07-08

schildgen@cs.uni-kl.de

… Is Not A Query Language!

on Wide-Column Stores

2

"A DBA walks into a NoSQL bar, but turns and leaves because he couldn't find a table"

3

Column Families

RowId info children

4

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

Peter, 1965,IBM, 70k

Lisa, 1997,BSIT

Column Families

€10 €5

€0€7

5

HBase APIput ‘pers‘, ‘Carl‘, ‘info:born‘, ‘1982‘

put ‘pers‘, ‘Carl‘, ‘info:school‘, ‘BSIT‘

put ‘pers‘, ‘Carl‘, ‘info:school‘, ‘BUIT‘

get ‘pers‘, ‘Carl‘

6

Jaspersoft HBase QL

{ "tableName": "pers", "deserializerClass": "com.jaspersoft…DefaultDeserializer", "filter": { "SingleColumnValueFilter": { "family": „info", "qualifier": „school", "compareOp": "EQUAL", "comparator": { "SubstringComparator": { "substr":

„BSIT" } } } }}

𝛔𝐬𝐜𝐡𝐨𝐨𝐥 ¿ ′𝐁𝐒𝐈𝐓 ′𝐩𝐞𝐫𝐬

http://community.jaspersoft.com/wiki/jaspersoft-hbase-query-language

7

Phoenix

SELECT * FROM pers WHERE school = ‘BSIT‘

𝛔𝐬𝐜𝐡𝐨𝐨𝐥 ¿ ′𝐁𝐒𝐈𝐓 ′𝐩𝐞𝐫𝐬

„Parent of each person?“

https://github.com/forcedotcom/phoenix

8

9

Input Table Output Table

10

Column

RowID Value

Input Cell Output Cell

Column

RowID Value

11

_c

_r _v

_c

_r _v

Input Cell Output Cell

12

_cborn

_rLisa

_v1997

_cborn

_rLisa

_v1997

Input Cell Output Cell

𝐩𝐞𝐫𝐬

OUT._r <- IN._r,OUT.born <- IN.born;

𝝅𝒃𝒐𝒓𝒏

13

_cborn

_rLisa

_v1997

_cborn

_rLisa

_v1997

Input Cell Output Cell

𝐩𝐞𝐫𝐬

OUT._r <- IN._r,OUT.born <- IN.born,OUT.school <- IN.school;

𝝅𝒃𝒐𝒓𝒏 , 𝒔𝒄𝒉𝒐𝒐𝒍

14

_cborn

_rLisa

_v1997

_cborn

_rLisa

_v1997

Input Cell Output Cell

𝐩𝐞𝐫𝐬

OUT._r <- IN._r,OUT.$(IN._c) <- IN._v;

𝛔𝐬𝐜𝐡𝐨𝐨𝐥 ¿ ′𝐁𝐒𝐈𝐓 ′

15

_cborn

_rLisa

_v1997

_cborn

_rLisa

_v1997

Input Cell Output Cell

IN-FILTER: school=‘BSIT‘,OUT._r <- IN._r,OUT.$(IN._c) <- IN._v;

row predicate

𝐩𝐞𝐫𝐬𝛔𝐬𝐜𝐡𝐨𝐨𝐥 ¿ ′𝐁𝐒𝐈𝐓 ′

16

That was:Selection and Projection

17

Now:Grouping

18

_ccmpny

_rPeter

_vIBM

_csalsum

_rIBM

_v645k

Input Cell Output Cell

Salary sum of each company.

OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary):

19

RowId info

Eve

Carl

Julia

Lisa

OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary):

born cmpny salary

1965 IBM 70k

born cmpny job

1966 IBM intern

born cmpny salary

1967 IBM 80k

born school salary

1997 BSIT 1k

salsum

IBM 70k

salsum

IBM 80k

salsum

IBM 150k

20

Advanced Transformations:More Filters

21

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

Peter, 1965,IBM, 70k

Lisa, 1997,BSIT

€10 €5

€0€7

22

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

OUT._r <- IN._r,OUT.$(IN._c) <- IN._v;

23

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

IN-FILTER: COL_COUNT(children)>0OUT._r <- IN._r,OUT.$(IN._c) <- IN._v;

24

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

IN-FILTER: COL_COUNT(children)>0OUT._r <- IN._r,OUT.$(IN.children._c) <- IN._v;

25

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

IN-FILTER: COL_COUNT(children)>0OUT._r <- IN._r,OUT.$(IN.children._c?(@>5)) <- IN._v;

cell predicate

26

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

IN-FILTER: COL_COUNT(children)>0OUT._r <- IN._r,OUT.$(IN.children._c?(!Carl)) <- IN._v;

cell predicate

27

NotaQL Transformation Platform:MapReduce

28

map(rowId, row)

row violates row pred.?

has more columns?

no

cell violates cell pred.?

yes

map IN.{_r,_c,_v}, fetched columns and constants to r,c and v

no

emit((r, c), v)

no

yes

Stop

yes

RowId info

Peter born cmpny salary

1965 IBM 70k

salsum

IBM 70k

((IBM, salsum), 70k)

29

reduce((r,c), {v})

put(r, c, aggregateAll(v))

Stop

((IBM, salsum), {70k, 80k, 10k})

((IBM, salsum), 160k)

30

Advanced Transformations:Graph Algorithms

31

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

Peter, 1965,IBM, 70k

Lisa, 1997,BSIT

€10 €5

€0€7

32

_cLisa

_rPeter

_c€5

_cPeter

_rLisa

_v€5

Input Cell Output Cell

„Parent of each person?“

OUT._r <- IN.children._c, OUT.$(IN._r) <- IN._v;

33

RowId info linksWikipedia

Twitter

Google

crawled pr

17:35 0.333

Twitter Google

- -

Google

-

Wikipedia

-

crawled pr

17:36 0.333

crawled pr

17:36 0.333

OUT._r <- IN.links._c, OUT.incoming.$(IN._r) <- IN._v;

34

RowId info links incomingWikipedia

Twitter

Google

crawled pr

17:35 0.333

Twitter Google

- -

Google

-

Wikipedia

-

crawled pr

17:36 0.333

crawled pr

17:36 0.333

OUT._r <- IN.links._c, OUT.incoming.$(IN._r) <- IN._v;

Twitter Wikipedia

- -

Google

-

Wikipedia

-

Reverting the graph

35

RowId info links incomingWikipedia

Twitter

Google

crawled pr

17:35 0.333

Twitter Google

- -

Google

-

Wikipedia

-

crawled pr

17:36 0.333

crawled pr

17:36 0.333

OUT._r <- IN.links._c, OUT.info.pr <- SUM(IN.pr/COL_COUNT(links));

Twitter Wikipedia

- -

Google

-

Wikipedia

-

36

RowId info links incomingWikipedia

Twitter

Google

crawled pr

17:35 0.333

Twitter Google

- -

Google

-

Wikipedia

-

crawled pr

17:36 0.167

crawled pr

17:36 0.5

OUT._r <- IN.links._c, OUT.info.pr <- SUM(IN.pr/COL_COUNT(links));

Twitter Wikipedia

- -

Google

-

Wikipedia

-

PageRank

37

Advanced Transformations:Text Processing

38

RowId infoWikipedia

Twitter

crawled pr body

17:35 0.333 all information can be found here

OUT._r <- IN._r, OUT.words <- COUNT(IN.body.split(‘ ‘));

crawled pr body

17:36 0.167 click here for more information

39

RowId infoWikipedia

Twitter

crawled pr body words

17:35 0.333 all information can be found here 6

OUT._r <- IN._r, OUT.words <- COUNT(IN.body.split(‘ ‘));

crawled pr body

17:36 0.167 click here for more information 5

Word Count

40

RowId infoWikipedia

Twitter

crawled pr body words

17:35 0.333 all information can be found here 6

OUT._r <- IN.body.split(‘ ‘), OUT.$(IN._r) <- COUNT(*);

crawled pr body

17:36 0.167 click here for more information 5

RowId infohere

Wikipedia Twitter

1 1

Term Index

41

Conclusion

Selection, ProjectionGrouping, AggregationSchema-FlexibleHorizontal AggregationMetadataDataGraph ProcessingText Processing

SQL

42

Thank you!