Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin...
-
Upload
jagger-slaughter -
Category
Documents
-
view
214 -
download
0
Transcript of Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin...
![Page 1: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/1.jpg)
Big Data Management – Challenges and Opportunities –
an Incomplete Survey
Jiaheng LuRenmin University of China
Joint work with Yu Liu
Tutorial on HotDB
![Page 2: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/2.jpg)
Tutorial objectives
• Big data challenges• Big data management new principles• Big data management research
– Indexes– Transaction– Architecture– Application– Benchmark
![Page 3: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/3.jpg)
Big data challenge
• Big data– Science data– Finance data– Streaming data– Internet data
![Page 4: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/4.jpg)
Big data management challenge
The growth in database transactions and volumes has a large impact on response times Source: http://www.codefutures.com/database-sharding/
![Page 5: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/5.jpg)
Many techniques have been evolved ..
• Master/Slave
• Cluster Computing
• Table Partitioning
• Federated Tables
![Page 6: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/6.jpg)
Four new principles in big data management
![Page 7: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/7.jpg)
New principle in big data management ( 1 )
• Partition Everything and key-value storage
• 切分万物以治之
•1st normal form cannot be satisfied
![Page 8: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/8.jpg)
New principle in big data management ( 2 )
• Embrace Inconsistency
• 容不同乃成大同
•ACID properties are not satisfied
![Page 9: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/9.jpg)
New principle in big data management ( 3 )
• Backup everything with three copies
• 狡兔三窟方高枕
• Guarantee 99.999999% safety
![Page 10: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/10.jpg)
New principle in big data management ( 4 )
• Scalable and high performance
•运筹沧海量兼容
![Page 11: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/11.jpg)
Big data management
•切分万物以治之•Partition Everything•容不同乃成大同•Embrace Inconsistency•狡兔三窟方高枕•Backup data with three copies•运筹沧海量兼容•Scalable and high performance
![Page 12: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/12.jpg)
Big Data Management Indexes on Big Data
Transaction on Big Data
Processing Architecture on Big Data
Applications in MapReduce Parallel Processing
Benchmark of Big Data Management System
![Page 13: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/13.jpg)
Related Papers
0
2
4
6
8
10
12
14
2009 2010 2011
SIGMOD
VLDB
ICDE
![Page 14: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/14.jpg)
Related Papers
00.5
11.5
22.5
33.5
44.5
Index on Big Data
Transaction on Big Data
Architecture Applications Benchmark
2009
2010
2011
![Page 15: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/15.jpg)
Big data papers (incomplete data)
Indexes on Big Data ~ 4 papersTransaction on Big Data 4~5 papersProcessing Architecture on Big Data
6~7 papersApplications in MapReduce Parallel
Processing 6~7 papers
Benchmark of Big Data Management System
3~4papers
![Page 16: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/16.jpg)
Big Data Management Indexes on Big Data
Transaction on Big Data
Processing Architecture on Big Data
Applications in MapReduce Parallel Processing
Benchmark of Big Data Management System
![Page 17: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/17.jpg)
Indexes on Big Data
Construct indexes which can be maintained in an incremental way.
Avoid bottleneck in the tree-like structure to provide concurrent reading and writing operations
![Page 18: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/18.jpg)
Distributed B-TreeGoal: perform consistent concurrent updates whileallowing high concurrency(read)
M. K. Aguilera, W. Gloab, et al. A Practical Scalable Distributed B-Tree. VLDB 2008
Indexes on Big Data
![Page 19: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/19.jpg)
Distributed B-Tree
3 techniques: Transaction– optimistic concurrency Control Lazy replication of version numbers
at clients Eager replication of version numbers
at servers
M. K. Aguilera, W. Gloab, et al. A Practical Scalable Distributed B-Tree. VLDB 2008
Indexes on Big Data
![Page 20: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/20.jpg)
Use BATON overlay to support range queris Local B+-tree index & Cloud Global(CG) index Only publish a few local index to global index to get high throughput and
concurrencySai Wu, Dawei Jiang, et al. Efficient B-tree Based Indexing for Cloud Data Processing. VLDB 2010
Indexes on Big Data
![Page 21: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/21.jpg)
BATON overlay
Steps to retrieve data:1. Search in the BATON tree(lookup());2. For all overlapping nodes in global index, find the corresponding
nodes(and local index)3. Search in the local B+-Tree index to retrieve data
Sai Wu, Dawei Jiang, et al. Efficient B-tree Based Indexing for Cloud Data Processing. VLDB 2010
Indexes on Big Data
![Page 22: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/22.jpg)
Big Data Management Indexes on Big Data
Transaction on Big Data
Processing Architecture on Big Data
Applications in MapReduce Parallel Processing
Benchmark of Big Data Management System
![Page 23: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/23.jpg)
The CAP Theorem
Consistency
Partition tolerance
Availability
![Page 24: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/24.jpg)
The CAP Theorem
Once a writer has written, all readers will see that write
Consistency
Partition tolerance
Availability
![Page 25: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/25.jpg)
The CAP Theorem
System is available during software and hardware upgrades and node failures.
Consistency
Partition tolerance
Availability
![Page 26: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/26.jpg)
The CAP Theorem
A system can continue to operate in the presence of a network partitions.
Consistency
Partition tolerance
Availability
![Page 27: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/27.jpg)
The CAP Theorem
Theorem: You can have at most two of these properties for any shared-data system
Consistency
Partition tolerance
Availability
![Page 28: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/28.jpg)
Consistency
• Two kinds of consistency:– strong consistency – ACID(Atomicity Consistency Isolation
Durability)
– weak consistency – BASE(Basically Available Soft-state Eventual consistency )
![Page 29: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/29.jpg)
A tailor
3NFTRANSACTION
LOCK ACID
SAFETY
RDBMS
![Page 30: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/30.jpg)
“Not all data need to be treated at the same level of consistency.”
Goal : minimize overall cost of operations in cloud Consistent Rationing
Define consistency guarantees on the data instead at the transaction level
Switch consistency guarantees at runtime, automatically3 categories
T. Kraska, M. Hentschel, et al. Consistency Rationing in the Cloud: Pay only when it matters. VLDB 2009
Transaction on Big Data
![Page 31: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/31.jpg)
Transaction on Big Data
Category C: Session Consistency (temporal) inconsistency is acceptable read-your-own-writes monotonicity converge & achieve eventual consistency at some interval
Category A: Serializable Consistency violation results in large penalty costs
Category B: trade-off between cost per operation & consistency level Adaptive. Switch between session consistency and serializability at
runtime
T. Kraska, M. Hentschel, et al. Consistency Rationing in the Cloud: Pay only when it matters. VLDB 2009
![Page 32: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/32.jpg)
Category B: trade-off between cost per operation & consistency level General Policy
“higher consistency level need to be provided when conflicts(updates) is high.”
Time Policywhen “deadline” approaches, more commits.
Fixed Threshold Policy (for numeric type)
Dynamic Policy (for numeric type)
Y: sum of update value
T. Kraska, M. Hentschel, et al. Consistency Rationing in the Cloud: Pay only when it matters. VLDB 2009
Transaction on Big Data
![Page 33: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/33.jpg)
• Datalog and coordination complexity: theoretical results from PODS aspects
(PODS keynote 2011 Joseph M. Hellerstein, UC Berkeley)
![Page 34: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/34.jpg)
Datalog• Main expressive advantage: recursive
queries. • More convenient for analysis: papers look
better.• Without recursion but with negation it is
equivalent in power to relational algebra• Has affected real practice: (e.g., recursion
in SQL3, magic sets transformations).
![Page 35: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/35.jpg)
Datalog• Example Datalog program:• parent(bill,mary). parent(mary,john).
• ancestor(X,Y) :- parent(X,Y). ancestor(X,Y) :- parent(X,Z),ancestor(Z,Y).
• ?- ancestor(bill,X)
![Page 36: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/36.jpg)
Joseph’s Conjecture(1)• CONJECTURE 1. Consistency And Logical
Monotonicity (CALM).• A program has an eventually consistent,
coordination-free execution strategy if and only if it is expressible in (monotonic) Datalog.
![Page 37: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/37.jpg)
Joseph’s Conjecture (2)• CONJECTURE 2. Causality Required Only for
Non-monotonicity (CRON). • Program semantics require causal message
ordering if and only if the messages participate in non-monotonic derivations.
![Page 38: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/38.jpg)
Joseph’s Conjecture (3)• CONJECTURE 3. The minimum number of
Dedalus timesteps required to evaluate a program on a given input data set is equivalent to the program’s Coordination Complexity.
![Page 39: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/39.jpg)
Joseph’s Conjecture (4)• CONJECTURE 4. Any Dedalus program P can be
rewritten into an equivalent temporally-minimized program P’ such that each inductive or asynchronous rule of P’ is necessary: converting that rule to a deductive rule would result in a program with no unique minimal model.
![Page 40: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/40.jpg)
Circumstance has presented a rare opportunity—call it an imperative—for the database community to take its place in the sun, and help create a new environment for parallel and distributed computation to flourish.
------Joseph M. Hellerstein (UC Berkeley)
![Page 41: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/41.jpg)
Big Data Management Indexes on Big Data
Transaction on Big Data
Processing Architecture on Big Data
Applications in MapReduce Parallel Processing
Benchmark of Big Data Management System
![Page 42: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/42.jpg)
Processing Architecture on Big Data
Make MapReduce more powerful, especially on complicated analysis
Merge cloud computing systems and PDBMSs
![Page 43: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/43.jpg)
Mapreduce online testing platform
• Cloudcomputing.ruc.edu.cn
• Automatic evaluation of Hadoop Mapreduce codes
• Theoretical questions
![Page 44: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/44.jpg)
开放式 Mapreduce 测试平台cloudcomputing.ruc.edu.cn
![Page 45: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/45.jpg)
“Sort-merge implementation in Hadoop poses fundamental barrier to incremental one-pass analysis”
New Hash-Based Platform
Processing Architecture on Big Data
B. Li, E. Mazur, et al. A Platform for Scalable One-Pass Analytics using MapReduce. SIGMOD 2011
![Page 46: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/46.jpg)
Fast Join Processing in Data WarehousePartitioning Data into Vertical Groups Dynamically
Y. Lin, D. Agrawal, et al. Llama: Leveraging Columnar Storage for Scalable Join Processing in the MapReduce Framework. SIGMOD 2011
Processing Architecture on Big Data
![Page 47: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/47.jpg)
Fast Join Processing in Data WarehousePartitioning Data into Vertical Groups DynamicallyConcurrent Join
More Map-side JoinsBASIC PATTERNS: Star Pattern & Chain Pattern
Processing Architecture on Big Data
![Page 48: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/48.jpg)
Processing Architecture on Big Data
Make MapReduce more powerful, especially on complicated analysis
Merge cloud computing systems and PDBMSs
![Page 49: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/49.jpg)
HadoopDB Combination of Parallel DBMS(performance) and MapReduce(scalability, fault-
tolerance) Communication layer : MapReduce
nodes: single-node DBMS instances SMS Planner: SQL MapReduce Job SQL
Processing Architecture on Big Data
![Page 50: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/50.jpg)
Big Data Management Indexes on Big Data
Transaction on Big Data
Processing Architecture on Big Data
Applications in MapReduce Parallel Processing
Benchmark of Big Data Management System
![Page 51: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/51.jpg)
A. Okcan, M. Riedewald. Processing Theta-Joins using MapReduce. SIGMOD 2011 Discuss some Theta-Joins(Inequality-Joins)algorithms
Applications in MapReduce Parallel Processing
![Page 52: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/52.jpg)
R. Vernica, M. J. Carey, et al. Efficient Set-Similarity Joins Using MapReduce. SIGMOD 2010
Use MapReduce Framework to perform set-similarity join, i.e. given two(or one) files, find all pairs of records (a, b) satisfying a and b are similar(sim(a, b)> t)
Give algorithms coping with large amount of data, as well as experimental evaluation.
Applications in MapReduce Parallel Processing
![Page 53: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/53.jpg)
Big Data Management Indexes on Big Data
Transaction on Big Data
Processing Architecture on Big Data
Applications in MapReduce Parallel Processing
Benchmark of Big Data Management System
![Page 54: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/54.jpg)
Benchmark of Big Data Management System
Comparison of the performance between MapReduce paradigm and parallel DBMSs
PERFORMANCE PDBMSs >> MR systems (except data loading)
ComparisonSchema SupportIndexingProgramming ModelData DistributionExecution StrategyFlexibilityFault Tolerance
A. Pavlo, E. Paulson, et al. A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD 2010
![Page 55: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/55.jpg)
Benchmark of Big Data Management System
Comparison of the performance between MapReduce paradigm and parallel DBMSs
PERFORMANCE PDBMSs >> MR systems (except data loading)
ComparisonSchema SupportIndexingProgramming ModelData DistributionExecution StrategyFlexibilityFault Tolerance
A. Pavlo, E. Paulson, et al. A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD 2010
![Page 56: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/56.jpg)
How architectures affect cloud computing (performance) on database applications?Especially for OLTP?
D. Kossmann, T. Kraska, et al. An Evaluation of Alternative Architectures for Transaction Processing in the Cloud. SIGMOD 2010
Benchmark of Big Data Management System
![Page 57: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/57.jpg)
How architectures affect cloud computing(performance) on database applications?Especially for OLTP?
D. Kossmann, T. Kraska, et al. An Evaluation of Alternative Architectures for Transaction Processing in the Cloud. SIGMOD 2010
Benchmark of Big Data Management System
![Page 58: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/58.jpg)
How architectures affect cloud computing(performance) on database applications?Especially for OLTP?
D. Kossmann, T. Kraska, et al. An Evaluation of Alternative Architectures for Transaction Processing in the Cloud. SIGMOD 2010
Benchmark of Big Data Management System
![Page 59: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/59.jpg)
Conclusion• Big Data Management: HOT DB topic
• Research topics: Indexing, transaction, join, architecture, application,
benchmark
![Page 60: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/60.jpg)
References• Sai Wu, Dawei Jiang, et al. Efficient B-tree Based Indexing for Cloud Data
Processing. VLDB 2010• David Chiu, A. Shetty, et al. Evaluating and Optimizing Indexing Schemes for a
Cloud-based Elastic Key-Value Store. In 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
• J. Wang, S. Wu, et al. Indexing Multi-dimensional Data in a Cloud System. SIGMOD 2010
• D. Kossmann, T. Kraska, et al. An Evaluation of Alternative Architectures for Transaction Processing in the Cloud. SIGMOD 2010
• T. Kraska, M. Hentschel, et al. Consistency Rationing in the Cloud: Pay only when it matters. VLDB 2009
• H. T. Vo, C. Chen, et al. Towards Elastic Transactional Cloud Storage with Range Query Support. VLDB 2010
• H. Kllapi, E. Sitaridi, et al. Schedule Optimization for Data Processing Flows on the Cloud. SIGMOD 2011
• M. K. Aguilera, W. Gloab, et al. A Practical Scalable Distributed B-Tree. VLDB 2008
![Page 61: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/61.jpg)
References• E. Friedman, P. Pawlowski, et al. SQL/MapReduce: A Practical approach to self-
describing, polymorphic, and parallelizable user-defined functions. VLDB 2009• R. Vernica, M. J. Carey, et al. Efficient Set-Similarity Joins Using MapReduce.
SIGMOD 2010• S. Blanas, J. M. Patel, et al. A Comparison of Join Algorithms for Log Processing in
MapReduce. SIGMOD 2010• D. Logothetis, K. Yocum. Ad-Hoc Data Processing in the Cloud. VLDB 2008• B. Panda, J. S. Herbach, et al. PLANET: Massively Parallel Learning of Three
Ensembles with MapReduce. VLDB 2009• A. Okcan, M. Riedewald. Processing Theta-Joins using MapReduce. SIGMOD 2011• K. Morton, M. Balazinska, et al. ParaTimer: A Progress Indicator for MapReduce
DAGs. SIGMOD 2010• Y. Cao, C. Chen, et al. ES2: A Cloud Data Storage System for Supporting Both OLTP
and OLAP. ICDE 2011• K. Morton, A. Friesen, et al. Estimating the Progress of MapReduce Pipelines. ICDE
2010
![Page 62: Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551a7cc0550346e0158b47bf/html5/thumbnails/62.jpg)
References• W. Lang, J.M. Patel. Energy Management for MapReduce Clusters. VLDB 2010• T. Nykiel, M. Potamias, et al. MRShare: Sharing Across Multiple Queries in
MapReduce. VLDB 2010• C. Olston, G. Chiou, et al. Nova: Continuous Pig/Hadoop Workflows. SIGMOD 2011• Y. Lin, D. Agrawal, et al. Llama: Leveraging Columnar Storage for Scalable Join
Processing in the MapReduce Framework. SIGMOD 2011• B. Li, E. Mazur, et al. A Platform for Scalable One-Pass Analytics using MapReduce.
SIGMOD 2011• D. G. Campbell, G. Kakivaya, et al. Extreme Scale with Full SQL Language Support in
Microsoft SQL Azure. SIGMOD 2010• A. Abouzeid, K. B-Pawlikowski, et al. HadoopDB: An Architectural Hybrid of
MapReduce and DBMS Technologies for Analytical Workloads. VLDB 2009• Y. Xu, P. Kostamaa, et al. Integrating Hadoop and Parallel DBMS. SIGMOD 2010• J. A. Q-Ruiz, C. Pinkel, et al. RAFT at Work: Speeding-Up MapReduce Applications
under Task and Node Failures. SIGMOD 2011• A. Pavlo, E. Paulson, et al. A Comparison of Approaches to Large-Scale Data
Analysis. SIGMOD 2010