On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

Post on 11-May-2015

1.729 views 8 download

Tags:

description

How is Big Data moved around? How are you planning to move it? This session will focus on familiar and not so similar tools you can use today for moving and integrating Big Data. Also important to outline the technologies and platform (introduction to Big Data, Hadoop, HDInsight and tools). We will compare and outline options, discuss how they can work with your existing Hadoop and Windows Azure environment, and provide some guidance on when and how to use each of these tools.

Transcript of On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

On the move with Big DataHadoop, Pig, Sqoop, SSIS…

Stéphane FréchetteThursday February 13, 2014

Who am I?

My name is Stéphane Fréchette

SQL Server MVP - I’m a Database & Business Intelligence Professional and Founder | CEO of I have a passion for architecting, designing and building solutions that matter.

Self proclaimed Open Data Hacker/Advocate I founded Gatineau Ouverte a citizen led initiative which aims to promote open access to civic data of the city of Gatineau.

Twitter: @sfrechetteBlog: stephanefrechette.comEmail: stephanefrechette@ukubu.com

Session Outline

• What is Big Data?• Apache Hadoop• Hadoop Ecosystem• Windows Azure HDInsight• On the move…• SSIS, Sqoop, Pig

• Demos• Resources

4

What is Big Data?

Apache Hadoop

• Open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models• Designed to scale up from single servers to thousands of machines, each

offering local computation and storage

Hadoop Ecosystem

• Core components; • HDFS (Hadoop Distributed File System) -> Storage• MapReduce -> Processing

What is Pig?

• Write complex MapReduce jobs using a simple script language (Pig Latin)

• A platform for analyzing large data sets that consists of high-level language for expressing data analysis programs

• Pig translates and compiles complex MapReduce jobs on the fly

http://pig.apache.org

What is Sqoop?

• Command-line interface application to transfer bulk data between Hadoop and relational datastores

http://sqoop.apache.org

What is Hive?

• A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis

• Provides an SQL-Like language called HiveQL to query data

• Integration between Hadoop and BI and visualization tools

http://hive.apache.org

What is SSIS?

• SQL Server Integration Services is a platform for data integration and workflow applications. A fast and flexible tool used for data extraction, transformation, and loading (ETL). • Contains rich set of built-in tasks and transformations; tools for constructing

packages…• Used to solve complex business problems

Windows Azure HDInsight

• HDInsight is a Hadoop-based service from Microsoft that brings a 100 percent Apache Hadoop solution to the cloud• Based on the Hortonworks Data Platform• Scalable, on-demand service

Demos(let’s move some data…)

Resources

• Apache Projects (list with links) http://bit.ly/MfpLtE• Windows Azure HDInsight http://bit.ly/1dnlAX1• HDInsight Tutorials and Guide http://bit.ly/LWRYol• Hortonworks Sandbox 2.0 http://bit.ly/1gkkCte• Hortonworks Tutorial Gallery http://bit.ly/1nvMAEX• Microsoft JDBC Driver 4.0 for SQL Server http://bit.ly/1kEgJ7O• Microsoft Hive ODBC Driver http://bit.ly/NFkhcH• GitHub: WindowsAzure / azure-content http://bit.ly/1hfthlF• SSIS Custom Task – Disorderly Data (Ken Ross) http://bit.ly/1nvIH2G

• GitHub https://github.com/kzhen/SSISHDFS

What Questions Do You Have?

Thank YouFor attending this session