LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database...
Transcript of LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database...
![Page 1: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/1.jpg)
LVC20-303 - State of Big Data and Data Science on ARM
- Ganesh RajuTech Lead, Big Data and Data Science,
Linaro
![Page 2: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/2.jpg)
Agenda● Big Data Ecosystem● High Level Goals● Misconceptions on ARM● Approach● Team’s Achievements● General Pain Points ● Current Status in ARM World● Roadmap
![Page 3: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/3.jpg)
Big Data and Data Science EcosystemBig Data in itself is a huge ecosystem. It is just too large, complex and redundant. It has too many standards, too many engines, too many vendors.
Categorizing Big Data Components
● Core Components● Operational Components● Data Ingestion● Streaming● Data Warehousing● NoSQL● File formats● Dashboards● Security/Governance● Data Science Tools / Machine Learning
Components● Notebooks
![Page 4: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/4.jpg)
High Level Goals1. ARM is first class citizen with all Big Data and Data Science Projects
a. Build and Portb. Setup CI on ARM Hardwarec. Automated Testsd. Multi-Arch Docker images
2. Benchmark against X863. Optimize with AArch64 advantages
![Page 5: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/5.jpg)
MisconceptionsARM is raspberry pi ? Projects unfamiliar with ARM platformARM is not production ready. Unavailability of ARM HardwareIt’s JAVA, and it should run anywhere !!! Dependencies not having ARM supportAdditional effort required for testing. Lack of interests to work on ARM
![Page 6: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/6.jpg)
Approach➢ Top to Bottom Approach
○ Operational Component - Apache Ambari○ Ambari Mpack○ Apache Bigtop
➢ Bottom to Top Approach○ Core components - Apache Hadoop, Spark, HBase, Hive○ Other Apache Projects like Apache Arrow, Beam○ Other Projects - NiFi, MiniFi, etc○ Data science Projects - Tensorflow, Anaconda, H2O, etc
![Page 7: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/7.jpg)
Apache Bigtop
Bigtop is a comprehensive project for packaging, testing, configuring, installing many Big Data components.
Originally, release and CI, were only available for x86 and powerpc.
To run on Arm, a lots of hacks and manual tuning to configurations were needed. ● Details: - Linaro Big Data team webpage,
https://collaborate.linaro.org/display/BDTS/Documentation
7
![Page 8: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/8.jpg)
Bigtop - Supports >25 Hadoop Ecosystem Components
![Page 9: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/9.jpg)
Bigtop- Foundation for many commercial Hadoop Distros/services
![Page 10: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/10.jpg)
Bigtop - AchievementsApache Bigtop Contributions (BDDS-11)
● BDDS-8 - Apache Ambari mpack● BDDS-8 - Add ElasticSearch to Apache Bigtop● Number patches to upgrade components● Upstream CI● Integration tests and smoke tests● Linaro leading the effort
● Recognition for contributions○ Jun He is recognized as Chair of Bigtop PMC○ Jun He has been filled in RM role for Bigtop○ Yuqi Gu has been recognized as maintainer for
Bigtop
![Page 11: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/11.jpg)
Apache Bigtop on AArch64 Timeline
2016-04 2017-03 2017-11 2018-03 2018-11 2019-06
Build Setup in Linaro
v1.2.1 released with a lot of
AArch64 patches
v1.3.0Officially ARM is
First Class Citizen Jun He - Release
Manager
Successful build on Ubuntu
AArch64 CI nodes added V1.4 Released
v1.5
![Page 12: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/12.jpg)
Bigtop Smoke Test CI matrix
![Page 13: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/13.jpg)
Bigtop Distro Matrix and Components
Distro ARM x86 PPC
CentOS 7 & 8
Debian 9 & 10
Fedora 31
Ubuntu 16.04 & 18.04
OpenSuse 42.3
Hadoop Spark HBase Hive
Flink ElasticSearch LogStash Kibana
Kafka Solr Ambari Flume
Giraph Gpdb Ignite Alluxio
Livy Mahout Oozie Phoenix
Qfs Sqoop Tez YCSB
Zookeeper Zeppelin Hama Tajo
![Page 14: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/14.jpg)
Apache Bigtop: v1.5Upcoming in few weeks !New component additions
- ElasticSearch v5.6.14, Logstash, Kibana v5.4.1Version bumps:
- Hadoop 2.10.5, Spark 2.4.5, HBase v1.50, Hive v2.3.6, Kafka 2.4.0, Flume 1.9.0, Alluxio 1.8.2, Giraph v1.2.0, Ignite v2.7.6, Livy v0.7.0, Pheonix v4.15.0, Solr v6.6.6, Tez v0.9.2, Zeppelin v0.8.2, Zookeeper v3.4.13
Components Removed:- Apex, Hama, Tajo
New features:- Integration Tests- Smoke Tests- More built-in test coverage
- Hive, Flink, Giraph, Zeppelin, etc- A Lot of improvements and bug fixes!
![Page 15: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/15.jpg)
What is Apache Ambari
➢ Platform Independent➢ Pluggable component➢ Version Management and Upgrade➢ Extensibility➢ Failure Recovery➢ Security
Usage of
Apache Ambari
Provisioning of Big Data clusterMonitoring of Hadoop Cluster
Management of Hadoop Cluster
Security of Hadoop Cluster
![Page 16: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/16.jpg)
Achievements● Build and Port - Majority of them already have ARM bits available
○ Apache Pulsar, Pheonix, NiFi, MiniFi, Airflow, Beam, etc● CI with upstream
○ Bigtop, Hadoop, Spark, HBase, Hive, Flink, etc.● Workload setup and Demo
○ ELK Stack - ElasticSearch, Logstash and Kibana○ H2O and Sparkling water○ Apache Ambari○ Apache Drill
● Benchmarking○ HiBench
● Optimization ○ E.g, Arrow CRC32 and ARM specific optimization
● Helping University of Michigan○ Cluster running Bigtop Petabyte size, twitter data, 20 GB of tweets / day○ Ambari and Bigtop
![Page 17: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/17.jpg)
CI setup with other projectsProject CI link
Apache Bigtop https://ci.bigtop.apache.org/computer/
Apache Hadoop https://builds.apache.org/view/H-L/view/Hadoop/job/Hadoop-qbt-linux-ARM-trunk/
Apache Spark https://amplab.cs.berkeley.edu/jenkins/label/spark-arm/
Apache HBase https://builds.apache.org/job/HBase-Nightly-ARM/
Apache Hive https://builds.apache.org/job/Hive-linux-ARM-trunk
Apache Flink https://status.openlabtesting.org/builds?job_name=flink-build-and-test-arm64-core-and-tests
Apache Kudu https://logs.openlabtesting.org/logs/periodic-kudu-mail/github.com/apache/kudu/master/kudu-build-test-arm64-in-docker/4df6de9/
ElasticSearch Stack https://ci.linaro.org/view/All/job/bigdata-elasticsearch/
Apache Arrow https://travis-ci.org/github/apache/arrow/jobs/728491410
Apache Drill https://ci.linaro.org/view/All/job/ldcg-bigdata-apache-drill/
Apache Impala http://status.openlabtesting.org/job/impala-build-test-arm64
Tensorflow http://status.openlabtesting.org/builds?job_name=tensorflow-arm64-release-build-v2.1.0-py36
PyTorch https://snapshots.linaro.org/hpc/python/pytorch/3/
![Page 18: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/18.jpg)
Pain Points● Dependency issues
○ Native binaries: protobuf, phantomjs, …○ Jars with native binaries embedded: levedb-jni, ignite-shmem, jffi,
snappy-java …○ Version mismatch: slf4j, log4j, log4j2, …
● Cyclic references take a lot of effort to fix● It takes time to convince projects
○ Protobuf and PhantomJS issue○ Bazel issue
![Page 19: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/19.jpg)
Team’s Current Scope (Next 6 months)● Building and porting Big Data and Datascience projects on ARM64.
○ BDDS-7 - Apache Bigtop v1.5 Release○ Start Apache Bigtop v1.6 work
■ Hadoop 3 upgrade■ Ambari mpack as top level component
○ BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra○ BDDS-12 - Kerberos and Security components like Ranger, Knox and Atlas
● Utilize Apache Arrow in Apache Spark● Arrow Memory optimization and fix● BDDS-262 - RocksDB performance issue fix
○ RocksDB v5.17+ has >8% performance regression● BDDS-17 - Apache Airflow Workload end to end Setup and Demo● BDDS-252 - Apache Pulsar Workload end to end Setup and demo
![Page 20: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/20.jpg)
Roadmap● Bigtop
○ Hadoop 3 upgrade○ JDK 11 integration○ Ambari Mpack ○ Kubernetes support○ Add Beam, Arrow, Storm, NiFi, MiniFi, Presto○ Add Data science tools
● Build and Port : ○ Databases: ArangoDB, Hawq, Accumulo,
Geode, Parquet-MR, Thrift, Gobblin, etc● ARM Optimization
○ Benchmarking○ SVE and SIMD optimization
● Datascience ○ MLOps, Spark-ML, FlinkML, Horovod,
Hopsml, BigDL, PyTorch, Scikit-Learn, NumPy, Keras, MxNet
○ Anaconda● HPDA
○ Hadoop and Spark on RDMA. RoCE+Spark○ Hadoop on Ceph
● End to End Use case
![Page 21: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/21.jpg)
● 23% of HPC system usage is currently HPDA○ Machine learning○ Stochastic modeling / Monte Carlo – explore large problem
spaces○ MapReduce/Hadoop, graph analytics, knowledge discovery
HPDA – High Performance Data Analysis
![Page 22: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/22.jpg)
RDMA Big Data Proposal● RDMA could give over 40% performance boost for Big Data● Develop and Test Plugins for i.e., Hadoop, such as mapreduce and HDFS, to accelerate
Hadoop by using RDMA (Remote Direct Memory Access) technology on ARM64 platform
![Page 23: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security](https://reader035.fdocuments.us/reader035/viewer/2022081410/60aa44ab0dacf5009031b3d3/html5/thumbnails/23.jpg)
Thanks
Linaro BDDS team:Ganesh Raju - Tech Lead, Linaro [email protected] Gu - Assignee, ARMJun He - Member Engineer, ARM
Thanks to OpenEuler, Packet and ARM for their contributions