Bash badawi big data training resources

16
Document by Bash Badawi, December, 30, 2016 Please feel free to share, however, I kindly ask to reference the source. Email me if you need further documentation, questions, suggestions. Twitter: @bashbadawi, LinkedIn Profile, My 4-part Big Data Articles on LinkedIn comparing Vendors, Stacks, etc, and Blog on WordPress. Some of the content is lifted from various sources, yet verifiable Data Scientists. Unfortunately, I do not have the references to include in this document. If you are a content provider I used, please email me to include you in the document. Use the Table of Contents to easily navigate to the desired resources. About Me: A Computer Science/Math Graduate with a Recent Master’s Degree in Business/Software Economics and a veteran of the IT industry of over 20 years.

Transcript of Bash badawi big data training resources

Page 1: Bash badawi big data training resources

Document by Bash Badawi, December, 30, 2016

Please feel free to share, however, I kindly ask to reference the source. Email me if you need further documentation, questions, suggestions. Twitter: @bashbadawi, LinkedIn Profile, My 4-part Big Data Articles on LinkedIn comparing Vendors, Stacks, etc, and Blog on WordPress. Some of the content is lifted from various sources, yet verifiable Data Scientists. Unfortunately, I do not have the references to include in this document. If you are a content provider I used, please email me to include you in the document. Use the Table of Contents to easily navigate to the desired resources.

About Me: A Computer Science/Math Graduate with a Recent Master’s Degree in Business/Software Economics and a veteran of the IT industry of over 20 years.

Page 2: Bash badawi big data training resources

Contents Document by Bash Badawi, December, 30, 2016..................................................................... 1

Please feel free to share, however, I kindly ask to reference the source. Email me if you need further documentation, questions, suggestions. Twitter: @bashbadawi, LinkedIn Profile, My 4-part Big Data Articles on LinkedIn comparing Vendors, Stacks, etc, and Blog on WordPress. Some of the content is lifted from various sources, yet verifiable Data Scientists. Unfortunately, I do not have the references to include in this document. If you are a content provider I used, please email me to include you in the document. Use the Table of Contents to easily navigate to the desired resources. ............................................................................................................................................. 1

About Me: A Computer Science/Math Graduate with a Recent Master’s Degree in Business/Software Economics and a veteran of the IT industry of over 20 years. ...................................................................................................................................................... 1

Hadoop Training Resources ....................................................................................................................... 4

Machine Learning Resources ..................................................................................................................... 5

Big Data Lambda Architecture ................................................................................................................... 6

The 40 data science techniques ................................................................................................................ 7

Data Science - DSC Resources From Analytics Bridge ............................................................................... 8

Additional Reading ...................................................................................................................................... 8

4 Ways to Spot a Fake Data Scientist ........................................................................................................ 9

Unstructured Data Definition ..................................................................................................................... 9

Resources ................................................................................................................................................... 9

You’re Not a Data Scientist ...................................................................................................................... 10

Skills needed to be a Data Scientist ......................................................................................................... 10

Technical Skills: Analytics .......................................................................................................................... 10

Technical Skills: Computer Science........................................................................................................... 10

Non-Technical Skills ................................................................................................................................... 10

My Data Science profile which you might want to use in your resume ................................................. 11

Microsoft Big Data Market Play – HDInsight ........................................................................................... 12

HDInsight on Linux (Preview) .................................................................................................................... 12

HDInsight on Windows .............................................................................................................................. 12

Apache Hadoop.......................................................................................................................................... 12

Apache Hadoop - Learn more about the Apache Hadoop software library, a framework that allows for the distributed processing of large datasets across clusters of computers. ...................... 12

HDFS - Learn more about the architecture and design of the Hadoop Distributed File System, the primary storage system used by Hadoop applications. ................................................................. 12

Page 3: Bash badawi big data training resources

MapReduce Tutorial - Learn more about the programming framework for writing Hadoop applications that rapidly process large amounts of data in parallel on large clusters of compute nodes. ...................................................................................................................................................... 12

SQL Database on Azure ............................................................................................................................. 12

Azure SQL Database - MSDN documentation for SQL Database. ................................................ 12

Management Portal for SQL Database - A lightweight and easy-to-use database management tool for managing SQL Database in the cloud. ..................................................................................... 12

Adventure Works for SQL Database - Download page for a SQL Database sample database. .. 12

Microsoft Business Intelligence (for HDInsight on Windows) ................................................................ 13

Connect Excel to Hadoop with Power Query ....................................................................................... 13

Connect Excel to Hadoop with the Microsoft Hive ODBC Driver........................................................ 13

Microsoft Cloud Platform ...................................................................................................................... 13

Learn about SQL Server Reporting Services ......................................................................................... 13

Try HDInsight solutions for big-data analysis (for HDInsight on Windows) .......................................... 13

Analyze HVAC sensor data ..................................................................................................................... 13

Use Hive with HDInsight to analyze website logs ................................................................................. 13

Analyze sensor data in real-time with Storm and HBase in HDInsight (Hadoop) ............................... 13

HDInsight HBase overview MSDN ........................................................................................................... 14

What is HDInsight HBase in Azure? ...................................................................................................... 14

How is data managed in HDInsight HBase? ......................................................................................... 14

Scenarios: What are the use cases for HBase? .................................................................................... 14

Next steps ............................................................................................................................................... 14

Get started with Apache HBase in HDInsight .......................................................................................... 15

Learn how to create HBase tables and query HBase tables by using Hive in HDInsight. .................. 15

NOTE: HBase (version 0.98.0) is only available for use with HDInsight 3.1 clusters on HDInsight (based on Apache Hadoop and YARN 2.4.0). For version information, see what’s new in the Hadoop cluster versions provided by HDInsight? ................................................................................ 15

Prerequisites ........................................................................................................................................... 15

Provision an HBase cluster ........................................................................................................................ 15

To provision an HBase cluster by using the Azure portal ....................................................................... 15

NOTE: ...................................................................................................................................................... 16

Page 4: Bash badawi big data training resources

Hadoop Training Resources 1. http://www.youtube.com/playlist?list=PLF82F6499E89E1BAE 2. Someone started a website for the Hadoop Ecosystem. http://hadoopecosystem.whatazoo.com/.

http://hadoopecosystem.whatazoo.com/home/training 3. https://www.linkedin.com/redirect?url=http%3A%2F%2Fsatya-

hadoop%2Eblogspot%2Ecom%2F2013%2F03%2Fhadoop-training-institutes-in-india%2Ehtml&urlhash=sJuS&_t=tracking_disc

4. http://university.cloudera.com/training/apache_hive_and_pig/hive_and_pig.html 5. http://www.linalis.com/en/training/planning 6. https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+Demo+VM+for+CDH4 7. http://cloudwick.com/training/ 8. http://www.learningtree.com/courses/1250/introduction-to-big-data/ 9. www.bisptrainings.com 10. http://www.udemy.com 11. (http://catechnologies.in/big-data.html). 12. http://www.mapr.com/academy/ 13. By the way DatumFora also offers live online instructor lead Hadoop Courses. Check it out

athttp://www.datumfora.com/#!online-hadoop-course-oct-26-27/c137j Save 20% when registering with promocode (LNKD20)

14. http://www.datumfora.com/#!2-day-hadoop-class-oct-19-20/cf4u 15. http://www.ambaricloud.com/ 16. http://www.mapr.com/academy/ 17. http://www.datumfora.com/#!upcoming-classes/ct0e 18. http://www.learningtree.com/courses/1250/introduction-to-big-data/ 19. http://cloudwick.com/training/ 20. http://www.linalis.com/en/training/planning

http://university.cloudera.com/training/apache_hive_and_pig/hive_and_pig.html 21. http://www.mapr.com/products/download 22. http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support?action=show&redirec

t=Distribution 23. http://hortonworks.com/blog/install-hadoop-windows-hortonworks-data-platform-2-0/ 24. http://hortonworks.com/hdp/downloads/ 25. (Try tutorial on http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/) and read more

about Spark GA on HDP (http://hortonworks.com/blog/announcing-apache-spark-now-ga-on-hortonworks-data-platform/)

26. http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/

Page 5: Bash badawi big data training resources

Machine Learning Resources

Page 6: Bash badawi big data training resources

Big Data Lambda Architecture Posted on September 5, 2012 by dbtube In order to meet the challenges of Big Data, you must rethink data systems from the ground up. You will discover that some of the most basic ways people manage data in traditional systems like the relational database management system (RDBMS) is too complex for Big Data systems. The simpler, alternative approach is a new paradigm for Big Data. In this article based on chapter 1, author Nathan Marz shows you this approach he has dubbed the “lambda architecture.” This article is based on Big Data, to be published in Fall 2012. This eBook is available through the Manning Early Access Program (MEAP). Download the eBook instantly from manning.com. All print book purchases include free digital formats (PDF, ePub and Kindle). Visit the book’s page for more information based on Big Data. This content is being reproduced here by permission from Manning Publications. Author: Nathan Marz Computing arbitrary functions on an arbitrary dataset in real time is a daunting problem. There is no single tool that provides a complete solution. Instead, you have to use a variety of tools and techniques to build a complete Big Data system. The lambda architecture solves the problem of computing arbitrary functions on arbitrary data in real time by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer.

Page 7: Bash badawi big data training resources

The 40 data science techniques 1. Linear Regression 2. Logistic Regression 3. Jackknife Regression * 4. Density Estimation 5. Confidence Interval 6. Test of Hypotheses 7. Pattern Recognition 8. Clustering - (aka Unsupervised Learning) 9. Supervised Learning 10. Time Series 11. Decision Trees 12. Random Numbers 13. Monte-Carlo Simulation 14. Bayesian Statistics 15. Naive Bayes 16. Principal Component Analysis - (PCA) 17. Ensembles 18. Neural Networks 19. Support Vector Machine - (SVM) 20. Nearest Neighbors - (k-NN) 21. Feature Selection - (aka Variable Reduction) 22. Indexation / Cataloguing * 23. (Geo-) Spatial Modeling 24. Recommendation Engine * 25. Search Engine * 26. Attribution Modeling * 27. Collaborative Filtering * 28. Rule System 29. Linkage Analysis 30. Association Rules 31. Scoring Engine 32. Segmentation 33. Predictive Modeling 34. Graphs 35. Deep Learning 36. Game Theory 37. Imputation 38. Survival Analysis 39. Arbitrage 40. Lift Modeling 41. Yield Optimization 42. Cross-Validation 43. Model Fitting 44. Relevancy Algorithm * 45. Experimental Design

The number of techniques is higher than 40 because we updated the article, and added additional ones.

Page 8: Bash badawi big data training resources

Data Science - DSC Resources From Analytics Bridge Career: Training | Books | Cheat Sheet | Apprenticeship | Certification | Salary Surveys | Jobs

Knowledge: Research | Competitions | Webinars | Our Book | Members Only | Search DSC

Buzz: Business News | Announcements | Events | RSS Feeds

Misc: Top Links | Code Snippets | External Resources | Best Blogs | Subscribe | For Bloggers

Additional Reading What statisticians think about data scientists Data Science Compared to 16 Analytic Disciplines 10 types of data scientists 91 job interview questions for data scientists 50 Questions to Test True Data Science Knowledge 24 Uses of Statistical Modeling 21 data science systems used by Amazon to operate its business Top 20 Big Data Experts to Follow (Includes Scoring Algorithm) 5 Data Science Leaders Share their Predictions for 2016 and Beyond 50 Articles about Hadoop and Related Topics 10 Modern Statistical Concepts Discovered by Data Scientists Top data science keywords on DSC 4 easy steps to becoming a data scientist 22 tips for better data science How to detect spurious correlations, and how to find the real ones 17 short tutorials all data scientists should read (and practice) High versus low-level data science

Reference: @DataScienceCtrl | @AnalyticBridge

Page 9: Bash badawi big data training resources

4 Ways to Spot a Fake Data Scientist I’m here to tell you that from all of my conversations with data scientists and “data scientists” I’ve discovered four telltale signs that a professional is not a true data scientist:

1. Lack of a highly quantitative advanced degree – It’s incredibly rare for someone without an advanced

quantitative degree to have the technical skills necessary to be a data scientist. In our data science salary report we found that 88% of data scientists have at least a Master’s degree, and 46% have a Ph.D. The areas of study may vary, but the vast majority are very rigorous quantitative, technical, or scientific programs, including Math, Statistics, Computer Science, Engineering, Economics, and Operations Research.

2. No concrete examples of experience with unstructured data – Lists of tools such as Hadoop, Python, and AWS need to be accompanied by projects that show those skills being put to good use. If a professional cannot provide clear examples of their experience with unstructured data, or mentions data science projects, but keeps their involvement very vague, then they are probably not a data scientist. If their specific role in or impact on a Big Data project is unclear, that is cause for concern.

3. Purely academic or research background – Now, this is not to say that someone with a stellar academic or research background won’t make a great data scientist, but a key component to being a data scientist in a corporate setting is business acumen. Understanding how findings affect business goals and delivering actionable insights to leaders is critical to a data scientist’s success. Many research academics have exceptional data skills, but without strong business savvy they are not data scientists… yet.

4. List of basic business skills – If I see a list of tools on a “data scientist” resume like Omniture, Google Analytics, SPSS, Excel, or any other Microsoft Office tool, you can be sure that I will take a harder look at whether or not this professional makes the grade. These skills are basic business qualifications that are insufficient for most data science positions, and by themselves are not indicative of a true data scientist.

Unstructured Data Definition Unstructured Data (or unstructured information) refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well.

Resources 1. Advanced Degree – More Data Science programs are popping up to serve the current demand, but there

are also many Mathematics, Statistics, and Computer Science programs.

2. MOOCs –Coursera, Udacity, and code academy are good places to start.

3. Certifications – KDnuggets has compiled an extensive list.

4. Bootcamps – For more information about how this approach compares to degree programs or MOOCs, check out this guest blog from the data scientists at Datascope Analytics.

5. Kaggle – Kaggle hosts data science competitions where you can practice, hone your skills with messy, real

world data, and tackle actual business problems. Employers take Kaggle rankings seriously, as they can be seen as relevant, hands-on project work.

6. LinkedIn Groups – Join relevant groups to interact with other members of the data science community.

7. Data Science Central and KDnuggets – Data Science Central and KDnuggets are good resources for staying

at the forefront of industry trends in data science. 8. The Burtch Works Study: Salaries of Data Scientists – If you’re looking for more information about the salaries

and demographics of current data scientists be sure to download our data scientist salary study.

Page 10: Bash badawi big data training resources

You’re Not a Data Scientist The IT biz has historically rebranded job titles based upon what’s trending — today’s Software Architects were once known as Designers or Systems Engineers. Nothing is trending faster and louder than predictive analytics, machine learning, deep learning and AI. So it’s our turn to rebrand data geeks as data scientists. Now don’t get me wrong — some of these folks are legit Data Scientists but the majority is not. I guess I’m a purist –calling yourself a scientist indicates that you practice science following a scientific method. You create hypotheses, test the hypothesis with experimental results and after proving or disproving the conjecture move on or iterate.

Skills needed to be a Data Scientist Technical Skills: Analytics

1. Education – Data scientists are highly educated – 88% have at least a Master’s degree and 46% have PhDs – and while there are notable exceptions, a very strong educational background is usually required to develop the depth of knowledge necessary to be a data scientist. Their most common fields of study are Mathematics and Statistics (32%), followed by Computer Science (19%) and Engineering (16%).

2. SAS and/or R – In-depth knowledge of at least one of these analytical tools, for data science R is generally preferred.

Technical Skills: Computer Science 3. Python Coding – Python is the most common coding language I typically see required in data science roles, along

with Java, Perl, or C/C++. 4. Hadoop Platform – Although this isn’t always a requirement, it is heavily preferred in many cases. Having

experience with Hive or Pig is also a strong selling point. Familiarity with cloud tools such as Amazon S3 can also be beneficial.

5. SQL Database/Coding – Even though NoSQL and Hadoop have become a large component of data science, it is still expected that a candidate will be able to write and execute complex queries in SQL.

6. Unstructured data – It is critical that a data scientist be able to work with unstructured data, whether it is from social media, video feeds or audio.

Non-Technical Skills 7. Intellectual curiosity – No doubt you’ve seen this phrase everywhere lately, especially as it relates to data

scientists. Frank Lo describes what it means, and talks about other necessary “soft skills” in his guest blog posted a few months ago.

8. Business acumen – To be a data scientist you’ll need a solid understanding of the industry you’re working in, and know what business problems your company is trying to solve. In terms of data science, being able to discern which problems are important to solve for the business is critical, in addition to identifying new ways the business should be leveraging its data.

9. Communication skills – Companies searching for a strong data scientist are looking for someone who can clearly and fluently translate their technical findings to a non-technical team, such as the Marketing or Sales departments. A data scientist must enable the business to make decisions by arming them with quantified insights, in addition to understanding the needs of their non-technical colleagues in order to wrangle the data

appropriately. Check out our recent flash survey for more information on communication skills for quantitative professionals.

Page 11: Bash badawi big data training resources

My Data Science profile which you might want to use in your resume

Page 12: Bash badawi big data training resources

Microsoft Big Data Market Play – HDInsight I highly recommend HDInsight it for the non-Linux Windows developers.

Machine Learning on Azure abstracts away a lot of the Big Data complexity and allows you to jump up to final analysis levels, i.e. 6 -7

steps in Hadoop for 2 steps in HDInsight

HDInsight on Linux (Preview) Get started with HDInsight on Linux - A quick-start tutorial for provisioning HDInsight Hadoop clusters on

Linux and running sample Hive queries.

Provision HDInsight on Linux using custom options - Learn how to provision an HDInsight Hadoop cluster on Linux by using custom options through the Azure Management Portal, Azure cross-platform command line, or Azure

Working with HDInsight on Linux - Get some quick tips on working with Hadoop Linux clusters provisioned on Azure.

Manage HDInsight clusters using Ambari - Learn how to monitor and manage your Linux-based Hadoop on HDInsight cluster by using Ambari Web, or the Ambari REST API.

HDInsight on Windows HDInsight documentation - The documentation page for Azure HDInsight with links to articles, videos, and

more resources.

Learning map for HDInsight - A guided tour of Hadoop documentation for HDInsight.

Get started with Azure HDInsight - A quick-start tutorial for using Hadoop in HDInsight.

Run the HDInsight samples - A tutorial on how to run the samples that ship with HDInsight.

Azure HDInsight SDK - Reference documentation for the HDInsight SDK.

Apache Hadoop Apache Hadoop - Learn more about the Apache Hadoop software library, a framework that allows for the

distributed processing of large datasets across clusters of computers.

HDFS - Learn more about the architecture and design of the Hadoop Distributed File System, the primary storage system used by Hadoop applications.

MapReduce Tutorial - Learn more about the programming framework for writing Hadoop applications that rapidly process large amounts of data in parallel on large clusters of compute nodes.

SQL Database on Azure Azure SQL Database - MSDN documentation for SQL Database.

Management Portal for SQL Database - A lightweight and easy-to-use database management tool for managing SQL Database in the cloud.

Adventure Works for SQL Database - Download page for a SQL Database sample database.

Page 13: Bash badawi big data training resources

Microsoft Business Intelligence (for HDInsight on Windows) Familiar business intelligence (BI) tools - such as Excel, PowerPivot, SQL Server Analysis Services, and SQL Server Reporting Services - retrieve, analyze, and report data integrated with HDInsight by using either the Power Query add-in or the Microsoft Hive ODBC Driver.

These BI tools can help in your big-data analysis: Connect Excel to Hadoop with Power Query

Learn how to connect Excel to the Azure Storage account that stores the data associated with your HDInsight cluster by using Microsoft Power Query for Excel.

Connect Excel to Hadoop with the Microsoft Hive ODBC Driver

Learn how to import data from HDInsight with the Microsoft Hive ODBC Driver.

Microsoft Cloud Platform

Learn about Power BI for Office 365, download the SQL Server trial, and set up SharePoint Server 2013 and SQL Server BI.

Learn more about SQL Server Analysis Services. Learn about SQL Server Reporting Services

Try HDInsight solutions for big-data analysis (for HDInsight on Windows) Analyze data from your organization to gain insights into your business. Here are some examples:

Analyze HVAC sensor data Learn how to analyze sensor data by using Hive with HDInsight (Hadoop), and then visualize the data in Microsoft Excel. In this sample, you'll use Hive to process historical data produced by HVAC systems to see which systems can't reliably maintain a set temperature.

Use Hive with HDInsight to analyze website logs Learn how to use HiveQL in HDInsight to analyze website logs to get insight into the frequency of visits in a day from external websites, and a summary of website errors that the users experience.

Analyze sensor data in real-time with Storm and HBase in HDInsight (Hadoop) Learn how to build a solution that uses a Storm cluster in HDInsight to process sensor data from Azure Event Hubs, and then displays the processed sensor data as near-real-time information on a web-based dashboard. To try Hadoop on HDInsight, see "Get started" articles in the Explore section on the HDInsight documentation page. To try more advanced examples, scroll down to the Analyze section.

Page 14: Bash badawi big data training resources

HDInsight HBase overview MSDN HBase is an Apache, open-source, NoSQL database that is built on Hadoop. HBase provides random access and strong consistency for large amounts of unstructured and semistructured data. It was modeled on Google's BigTable, and it is a column-family-oriented database. Data is stored in the rows of a table, and data within a row is grouped by column family. HBase is a schema-less database in the sense that neither the columns nor the type of data stored in them need to be defined before using them. The open-source code scales linearly to handle petabytes of data on thousands of nodes. It can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop ecosystem. What is HDInsight HBase in Azure? HDInsight HBase is offered as a managed cluster that is integrated into the Azure environment. The clusters are configured to store data directly in Azure Blob storage, which provides low latency and increased elasticity in performance and cost choices. This enables customers to build interactive websites that work with large datasets, to build services that store sensor and telemetry data from millions of end points, and to analyze this data with Hadoop jobs. HBase and Hadoop are good starting points for big data project in Azure; in particular, they can enable real-time applications to work with large datasets. The HDInsight implementation leverages the scale-out architecture of HBase to provide automatic sharding of tables, strong consistency for reads and writes, and automatic failover. Performance is enhanced by in-memory caching for reads and high-throughput streaming for writes. Virtual network provisioning is also available for HDInsight HBase. For details, see Provision HDInsight clusters on Azure Virtual Network. How is data managed in HDInsight HBase? Data can be managed in HBase by using the Create, Get, Put, and Scan commands from the HBase shell. Data is written to the database by using put and read by using get. The scan command is used to obtain data from multiple rows in a table. Data can also be managed using the HBase C# API, which provides a client library on top of the HBase REST API. An HBase database can also be queried by using Hive. For an introduction to these programming models, see Get started using HBase with Hadoop in HDInsight. Co-processors are also available, which allow data processing in the nodes that host the database. Scenarios: What are the use cases for HBase? The canonical use case for which BigTable (and by extension, HBase) was created was web search. Search engines build indexes that map terms to the web pages that contain them. But there are many other use cases that HBase is suitable for—several of which are itemized in this section.

Key-value store HBase can be used as a key-value store, and it is suitable for managing message systems. Facebook uses HBase for their messaging system, and it is ideal for storing and managing Internet communications. WebTable uses HBase to search for and manage tables that are extracted from webpages.

Sensor data HBase is useful for capturing data that is collected incrementally from various sources. This includes social analytics, time series, keeping interactive dashboards up-to-date with trends and counters, and managing audit log systems. Examples include Bloomberg trader terminal and the Open Time Series Database (OpenTSDB), which stores and provides access to metrics collected about the health of server systems.

Real-time query Phoenix is a SQL query engine for Apache HBase. It is accessed as a JDBC driver, and it enables querying and managing HBase tables by using SQL.

HBase as a platform Applications can run on top of HBase by using it as a datastore. Examples include Phoenix, OpenTSDB, Kiji, and Titan. Applications can also integrate with HBase. Examples include Hive, Pig, Solr, Storm, Flume, Impala, Spark, Ganglia, and Drill.

Next steps

Get started using HBase with Hadoop in HDInsight

Provision HDInsight clusters on Azure Virtual Network

Configure HBase replication in HDInsight

Analyze Twitter sentiment with HBase in HDInsight

Use Maven to build Java applications that use HBase with HDInsight (Hadoop)

Page 15: Bash badawi big data training resources

Get started with Apache HBase in HDInsight Learn how to create HBase tables and query HBase tables by using Hive in HDInsight. HBase is a low-latency NoSQL database that allows online transactional processing of big data. HBase is offered as a managed cluster that is integrated into the Azure environment. The clusters are configured to store data directly in Azure Blob storage, which provides low latency and increased elasticity in performance and cost choices. This enables customers to build interactive websites that work with large datasets, to build services that store sensor and telemetry data from millions of end points, and to analyze this data with Hadoop jobs. For more information about HBase and the scenarios it can be used for, see HDInsight HBase overview. NOTE: HBase (version 0.98.0) is only available for use with HDInsight 3.1 clusters on HDInsight (based on Apache Hadoop and YARN 2.4.0). For version information, see what’s new in the Hadoop cluster versions provided by HDInsight? Prerequisites Before you begin this tutorial, you must have the following:

An Azure subscription: For more information about obtaining a subscription, see Purchase Options, Member Offers, or Free Trial.

An Azure storage account: For instructions, see How To Create a Storage Account.

A workstation with Visual Studio 2013 installed: For instructions, see Installing Visual Studio.

Provision an HBase cluster NOTE:

1. The steps in this article create an HDInsight cluster by using basic configuration settings. For information about other cluster configuration settings (such as using Azure virtual network or a metastore for Hive and Oozie), see Provision Hadoop clusters in HDInsight by using custom options.

To provision an HBase cluster by using the Azure portal 1. Sign in to the Azure portal. 2. Click NEW in the lower left, and then click DATA SERVICES > HDINSIGHT > HBASE.

You can also use the CUSTOM CREATE option (The above is the older classic portal, the below is the new portal using the Resource Manager Construct)

1. Enter CLUSTER NAME, CLUSTER SIZE, CLUSTER USER PASSWORD, and STORAGE ACCOUNT.

Page 16: Bash badawi big data training resources

The default HTTP USER NAME is admin. You can customize the name by using the CUSTOM CREATION option.

WARNING: For high availability of HBase services, you must provision a cluster that contains at least three nodes. This ensures that, if one node goes down, the HBase data regions are available on other nodes.

1. Click the checkmark icon in the lower right to create the HBase cluster.

NOTE: After an HBase cluster is deleted, you can create another HBase cluster by using the same default blob. The new cluster will pick up the HBase tables you created in the original cluster.