On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

15
On the move with Big Data Hadoop, Pig, Sqoop, SSIS… Stéphane Fréchette Thursday February 13, 2014

description

How is Big Data moved around? How are you planning to move it? This session will focus on familiar and not so similar tools you can use today for moving and integrating Big Data. Also important to outline the technologies and platform (introduction to Big Data, Hadoop, HDInsight and tools). We will compare and outline options, discuss how they can work with your existing Hadoop and Windows Azure environment, and provide some guidance on when and how to use each of these tools.

Transcript of On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

Page 1: On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

On the move with Big DataHadoop, Pig, Sqoop, SSIS…

Stéphane FréchetteThursday February 13, 2014

Page 2: On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

Who am I?

My name is Stéphane Fréchette

SQL Server MVP - I’m a Database & Business Intelligence Professional and Founder | CEO of I have a passion for architecting, designing and building solutions that matter.

Self proclaimed Open Data Hacker/Advocate I founded Gatineau Ouverte a citizen led initiative which aims to promote open access to civic data of the city of Gatineau.

Twitter: @sfrechetteBlog: stephanefrechette.comEmail: [email protected]

Page 3: On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

Session Outline

• What is Big Data?• Apache Hadoop• Hadoop Ecosystem• Windows Azure HDInsight• On the move…• SSIS, Sqoop, Pig

• Demos• Resources

Page 4: On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

4

What is Big Data?

Page 5: On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

Apache Hadoop

• Open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models• Designed to scale up from single servers to thousands of machines, each

offering local computation and storage

Page 6: On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

Hadoop Ecosystem

• Core components; • HDFS (Hadoop Distributed File System) -> Storage• MapReduce -> Processing

Page 7: On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

What is Pig?

• Write complex MapReduce jobs using a simple script language (Pig Latin)

• A platform for analyzing large data sets that consists of high-level language for expressing data analysis programs

• Pig translates and compiles complex MapReduce jobs on the fly

http://pig.apache.org

Page 8: On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

What is Sqoop?

• Command-line interface application to transfer bulk data between Hadoop and relational datastores

http://sqoop.apache.org

Page 9: On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

What is Hive?

• A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis

• Provides an SQL-Like language called HiveQL to query data

• Integration between Hadoop and BI and visualization tools

http://hive.apache.org

Page 10: On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

What is SSIS?

• SQL Server Integration Services is a platform for data integration and workflow applications. A fast and flexible tool used for data extraction, transformation, and loading (ETL). • Contains rich set of built-in tasks and transformations; tools for constructing

packages…• Used to solve complex business problems

Page 11: On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

Windows Azure HDInsight

• HDInsight is a Hadoop-based service from Microsoft that brings a 100 percent Apache Hadoop solution to the cloud• Based on the Hortonworks Data Platform• Scalable, on-demand service

Page 12: On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

Demos(let’s move some data…)

Page 13: On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

Resources

• Apache Projects (list with links) http://bit.ly/MfpLtE• Windows Azure HDInsight http://bit.ly/1dnlAX1• HDInsight Tutorials and Guide http://bit.ly/LWRYol• Hortonworks Sandbox 2.0 http://bit.ly/1gkkCte• Hortonworks Tutorial Gallery http://bit.ly/1nvMAEX• Microsoft JDBC Driver 4.0 for SQL Server http://bit.ly/1kEgJ7O• Microsoft Hive ODBC Driver http://bit.ly/NFkhcH• GitHub: WindowsAzure / azure-content http://bit.ly/1hfthlF• SSIS Custom Task – Disorderly Data (Ken Ross) http://bit.ly/1nvIH2G

• GitHub https://github.com/kzhen/SSISHDFS

Page 14: On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

What Questions Do You Have?

Page 15: On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

Thank YouFor attending this session