Aditya Birla Group...Aditya Birla Group ETL pipelining and analytics using Spark on AWS EMR ABG...

Aditya Birla GroupETL pipelining and analytics using Spark on AWS EMR

ABG analytics team wanted to migrate their on-premise architecture to AWS cloud. HashedIn, being the advanced partner with AWS, helped them in shifting their processes to AWS.

Executive Summary

Problem Statement

There were basically 2 problems that we helped them resolve:

Business Requirements

Data Warehouse migration from Teradata to AWS - ABG had an on-premise datacenter running on Teradata servers which they wanted to transfer to AWS ecosystem for them to run their analytics at a scale.

Root Cause Analysis Optimization Using Spark on AWS EMR - They want to move their analytics running on R server to spark on AWS EMR.

One of the ABG group companies working in the manufacturing of aluminium coils collects data from sensors on the manufacturing line. This data is huge in terms of size, i.e., about 1000variables per 10 ms(~2.5KB of data). The data is dumped in .HDF5 and .dat format files.

The analytics team at ABG wanted to do an RCA on these files and want an efficient andscalable architecture solution, leveraging distributed computing using spark on AWS EMR.

Also, one of the ABG companies have an on-premise data warehouse. This data warehouse is on ashared infrastructure and very slow, therefore isn’t of much use to the business. ABG would liketo move this database to AWS.

End Objectives

Up to 200GB database from on-premise migrated to AWS, using an ETL pipeline that is scalable and easy to use for future transfers.

Conversion of the given sensor data in HDF5 format to a suitable format and set up a running cluster on EMR to analyze the data and perform the root cause analysis.

Key Requirements

The ETL pipeline needs to connect to Teradata via AWS Direct connect and migrate the required database to AWS RDS instances in the most efficient and scalable manner.

Analysis of the HDF5 data files is currently being done on single machines running R server. This process needs to be parallelized on EMR cluster using PySpark. Rate of increase of data was ~10GB per day

User Interface

Solution Approach

Our Solution Structure

Impact and involvement of stakeholdersThe solutions provided would enable the analytics team at ABG to scale up their processes without worrying about the hardware requirements and also the process will be efficiently distributed on EMR, hence making them derive analytical models more quickly and efficiently.

This would have a direct impact on the overall scalability and productivity of the analytical processes.

For the migration use case, the proposed solution was to use AWS Glue over AWS Direct Connect connection. This would be a complete serverless solution where each ETL process will be independent of the other, hence, avoiding server failures and conflicts if multiple people are transferring data from the same source simultaneously.

For the signal analyses, the proposal was to convert the hdf5 files to apache parquet format since it is more friendly with spark and given the columnar storage capabilities of parquet, it is more efficient to index on AWS S3 and query using Athena.

Solution Dynamics and InteractionsAWS Glue connected through a direct connect in a private VPC to the data center and made it possible to perform any transformations required on the data using Pyspark scripts.

The converted parquet files were indexed on S3 and via Athena, we were able to connect the data to Tableau as well for the analysts to perform the required analytical operations.

Apache Spark (PySpark)

Business Outcomes

HashedIn has helped many promising firms across the globe by building customized solutions to give the users a completely hassle-free experience. Kindly let us know if you

have any specific problem/use case, where we can provide more information or consult you.

https://hashedin.com/contact-us/

Business Outcomes

We were able to successfully run the migrations from the datacenter to AWS RDS instances using AWS Glue. This process was slower than some ETL tools running on single servers like Talend, but it was serverless, thus making it a choice if the requirement was to run multiple jobs by multiple people independently.

We were able to successfully set up an EMR cluster and replicate some of the analytical models written in R using Pyspark and benchmark the whole process. This gave ABG confidence to move their analytical processes to AWS EMR, hence making it scalable to real-time analytics as well in the future.

https://hashedin.com/contact-us/

Aditya Birla Group...Aditya Birla Group ETL pipelining and analytics using Spark on AWS EMR ABG...

Documents

Transcript of Aditya Birla Group...Aditya Birla Group ETL pipelining and analytics using Spark on AWS EMR ABG...