ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark...
-
Upload
alluxio-inc -
Category
Technology
-
view
288 -
download
1
Transcript of ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark...
![Page 1: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/1.jpg)
ALLUXIO (FORMERLY TACHYON): UNIFY DATA AT MEMORY SPEED
- EFFECTIVE USING SPARK WITH ALLUXIO
Spark Summit at BostonFeb. 2017
Haoyuan Li, Alluxio Inc.
![Page 2: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/2.jpg)
HISTORY
• Started at UC Berkeley AMPLab In Summer 2012 • Originally named as Tachyon • Rebranded to Alluxio in early 2016
• Open Sourced in 2013 • Apache License 2.0 • Latest Stable Release: Alluxio 1.4.0 • Alluxio 1.5.0 Planned For Q2, 2017
2
![Page 3: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/3.jpg)
3
• Fastest growing open-source project in the big data ecosystem
• 400+ contributors from 100+ organizations
• Running in large production clusters
• Community members are welcome!
FASTEST GROWING BIG DATA PROJECTS
Popular Open Source Projects’ Growth
Months
Num
ber o
f Con
trib
utor
s
![Page 4: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/4.jpg)
INDUSTRY ADOPTION
4
![Page 5: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/5.jpg)
…
…
5
BIG DATA ECOSYSTEM YESTERDAY
![Page 6: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/6.jpg)
BIG DATA ECOSYSTEM TODAY
…
…
5
![Page 7: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/7.jpg)
…
…
BIG DATA ECOSYSTEM ISSUES
5
![Page 8: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/8.jpg)
BIG DATA ECOSYSTEM WITH ALLUXIO
…
…
FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System
GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface
5
![Page 9: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/9.jpg)
BIG DATA ECOSYSTEM WITH ALLUXIO
…
…
FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System
Unifying Data at Memory Speed
GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface
5
![Page 10: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/10.jpg)
6
![Page 11: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/11.jpg)
7
![Page 12: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/12.jpg)
WHY ALLUXIO
8
Co-located compute and data with memory-speed access to data
Virtualized across different storage systems under a unified namespace
Scale-out architecture
File system API, software only
![Page 13: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/13.jpg)
9
Unification
New workflows across any data in any storage system
Orders of magnitude improvement in run time
Choice in compute and storage – grow each independently, buy only what is needed
Performance Flexibility
BENEFITS
![Page 14: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/14.jpg)
#1 – ACCELERATING REMOTE STORAGE I/O
10
• Scenario: Compute and Storage Separation • Meet different compute and storage hardware requirements
• Scale compute and storage independently
• Store data in traditional filers/SANs and object stores
• Analyze existing data with Big Data compute frameworks
• Limitation
• Accessing data requires remote I/O
![Page 15: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/15.jpg)
I/O WITHOUT ALLUXIO
Spark
Storage
11
![Page 16: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/16.jpg)
I/O WITHOUT ALLUXIO
Spark
Storage
Low latency, memory throughput
High latency, network throughput
11
![Page 17: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/17.jpg)
I/O WITH ALLUXIO
Spark
Storage
Alluxio
12
![Page 18: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/18.jpg)
I/O WITH ALLUXIO
Spark
Storage
AlluxioKeeping data in Alluxio accelerates data access
12
![Page 19: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/19.jpg)
CASE STUDY: BAIDU
13
The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds.
- Baidu
RESULTS
• Data queries are now 30x faster with Alluxio
• Alluxio cluster runs stably, providing over 50TB of RAM space
• By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds
Baidu’s PMs and analysts run
interactive queries to gain insights
into their products and business
• 200+ nodes deployment
• 2+ petabytes of storage
• Mix of memory + HDD
ALLUXIO
Baidu File System
![Page 20: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/20.jpg)
#2 – SHARING DATA AT MEMORY-SPEED AMONG APPLICATIONS
• Scenario: Data Sharing Architecture
• Pipelines: output of one job is input of the next job
• Applications, jobs, and contexts reading the same data
• Limitation
• Sharing data requires I/O
14
![Page 21: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/21.jpg)
SHARING WITHOUT ALLUXIO
Spark
Storage
MapReduce Spark
15
![Page 22: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/22.jpg)
SHARING WITHOUT ALLUXIO
Spark
Storage
MapReduce Spark
Network I/O
Disk I/O
I/O slows down
sharing
15
![Page 23: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/23.jpg)
SHARING WITH ALLUXIO
Spark
Storage
MapReduce Spark
Alluxio
16
![Page 24: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/24.jpg)
SHARING WITH ALLUXIO
Spark
Storage
MapReduce SparkMemory-speed
sharingAlluxio
16
![Page 25: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/25.jpg)
CASE STUDY: BARCLAYS
Thanks to Alluxio, we now have the raw data immediately available at every iteration and we can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity.
- Barclays
RESULTS
• Barclays workflow iteration time decreased from hours to seconds
• Alluxio enabled workflows that were impossible before
• By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds
Barclays uses query and machine
learning to train models for risk
management
• 6 node deployment
• 1TB of storage
• Memory only
ALLUXIO
Relational Database
17
![Page 26: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/26.jpg)
#3 – UNIFYING DATA ACCESS FROM DIFFERENT STORAGE
• Scenario: Multiple Storage Systems
• Most enterprises have multiple storage systems
• New (better, faster, cheaper) storage systems arise
• Limitation
• Accessing data from different systems requires different APIs
18
![Page 27: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/27.jpg)
ACCESSING DATA THROUGH ALLUXIO
Storage B
Alluxio
Spark MapReduce Spark
19
![Page 28: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/28.jpg)
ACCESSING DATA THROUGH ALLUXIO
Storage B
Alluxio
Spark MapReduce Spark
19
![Page 29: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/29.jpg)
ACCESSING DATA THROUGH ALLUXIO
Storage B
Alluxio
Spark MapReduce Spark
Storage A Storage C
Flexible,
simple
no application changes,
new mount point
19
![Page 30: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/30.jpg)
CASE STUDY: QUNAR
We’ve been running Alluxio in production for over 9 months, Alluxio’s unified namespace enable different applications and frameworks to easily interact with data from different storage systems.
- Qunar
RESULTS
• Data sharing among Spark Streaming, Spark batch and Flink jobs provide efficient data sharing
• Improved the performance of their system with 15x – 300x speedups
• Tiered storage feature manages storage resources including memory and HDD
• 200+ nodes deployment
• 6 billion logs (4.5 TB) daily
• Mix of Memory + HDD
ALLUXIO
Qunar uses real-time machine
learning for their website ads.
20
![Page 31: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/31.jpg)
SUMMARY
21
• Adopted by industry leaders
• Unified, memory-speed data access across compute frameworks and storage systems
• Rapidly growing OS community
![Page 32: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/32.jpg)
JOIN THE COMMUNITY
22
Contribute @ www.alluxio.org/contribute Get started @ goo.gl/55ApFx
![Page 33: ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spark with Alluxio at Spark Summit Boston 2017](https://reader036.fdocuments.us/reader036/viewer/2022062503/58e499851a28aba3458b48ff/html5/thumbnails/33.jpg)
Contact: [email protected]
Twitter: @haoyuan
Websites: www.alluxio.com and www.alluxio.org
Thank you! We are hiring!Demo: Spark + Alluxio + S3Alluxio Unified Namespace
23