Microsoft Analytics Platform...
Transcript of Microsoft Analytics Platform...
Microsoft Analytics Platform System
The turnkey modern data warehouse appliance
Stefan Cronjaeger
June 2014
Agenda
• Modern Data Warehouse• Big Data
• Application examples
• Analytic Platform Systems• Architecture
• Hadoop
• Integration of Hadoop and APS• APS with external Hadoop Clusters
• APS with Hadoop in the Cloud
• APS with integrated Hadoop
Big Data: Variety, Velocity, Volume … and Analytics
Sensor and machine log
Business apps
Web
Social Media
What to do with the data
7
ForecastGeo analysis Customer interaction
Churn Customer segmentation
Shopping basket & Recommendation
Keywords & Sentiment
Scoring & Outlier
Examples for sentimental analysis: Not only Marketing
8
Browse blogs, Twitter, News articles, NewsgroupsExtract key words, pairs of key words, sentimentsAnalyze and correlate
Campaign supervision• Political campaigns and
keywords• Marketing campaigns• Trend analysis
Quality assurance• Analyse internal technical
discussion groups• Get early warning of
possible technical issues
Supply chain for fashion• Look in fashion blogs
and discussion groups• Forecast demand of
specific fashion articles
Structured data: Fraud detection in large amounts of financial data – where to look
9
Not all digits are equal!
130 years ago Simon Newcomb detected that more numbers started with the digit 1. Re-discovered by Benford
The idea:
Look into the numbers (e.g., balance sheet), look how the numbers are usually distributed and look for deviations
Application:
Tax fraud in balance sheets. Actually used by auditors
Manipulated numbers in scientific publications
Fraud in elections, election campaign financing, …
An application of Benford’s law
Bernhard Rauch, Max Göttsche, Gernot Brähler & Thomas
Kronfeld (2014) Deficit versus social
statistics: empirical evidence for the effectiveness of
Benford’s law, Applied Economics Letters, 21:3, 147-151
Differences in number statistics for EU reporting of Social Data and Deficit data by country
Agenda
• Modern Data Warehouse• Big Data
• Application examples
• Analytic Platform Systems• Architecture
• Hadoop
• Integration of Hadoop and APS• APS with external Hadoop Clusters
• APS with Hadoop in the Cloud
• APS with integrated Hadoop
PDW Logical Architecture
Database “host” Servers
Control Host Node
Direct Attached Storage Nodes
Control Node (virtualized)
Compute/Storage Nodes (virtualized)
Client Queries
Virtualization spare
� All servers are virtualization hosts� Running Windows Server 2012
� Control and compute nodes are virtual� All run SQL Server 2012
� Control node spreads data and workload across compute nodes
� Data loads are in parallel and take advantage of the power of all nodes
Fast Infiniband interconnection
Scalability: Massively Parallel and Shared nothing
Smallest (0TB) To Largest (5PB)
• Start small with a few Terabyte warehouse
• Add capacity up to 5 Petabytes
0TB 5 PB
AddCapacity
AddCapacity
Just grow by adding scale unitsAn SMP system would have needed to be completely reconfigured
� The Base Unit has approximate useable storage capacity of 75TB, based on 5:1 compression.
� 3 additional Scale Units can fit into 1 rack, for up to 300 TB of useable storage.
� Multiple racks can be configured for more useable storage.
� The 1TB drives can be replaced with 2TB or 3TB drives, for double or triple capacity. However, multiple Scale Units will provide better performance compared to one Base Unit with larger hard drives. For example, 3 Scale Units with 1TB drives will perform much better than 1 Base Unit with 3TB drives.
� Backup Node and Landing Zone (ETL Storage) is not included. The customer can order whatever they want for backup purposes, and install it themselves.
2 InfiniBand FDR 36 Port Switches
2 Ethernet Switches 5120-24 G
Control Node DL360p
Failover Node DL360p
Base Unit for 2 nodes2 ProLiant DL360p Compute NodesStorage Block (D6000), 70 drives
1st Scale Unit for 4 nodes2 ProLiant DL360p Compute NodesStorage Block (D6000), 70 drives
2nd Scale Unit for 6 nodes2 ProLiant DL360p Compute NodesStorage Block (D6000), 70 drives
3rd Scale Unit for 8 nodes2 ProLiant DL360p Compute NodesStorage Block (D6000), 70 drives
For customer use
Software
Windows Server 2012:Control Node, Mgmt. Node and Compute Nodes run in virtualized Environment
System Center 2012:Single user i/f for management of PDW, OS, BI, custom apps and private cloud
SQL Server 2012 insideVisual Studio Data ToolsPowerview directly on PDW
xVelocityIn-memory executionClustered columnstore
Big Data IntegrationPolybase: T-SQL query to HadoopExternal tables on Hadoop
Workload ManagementWorkload classes
Microsoft
Distributed, scalable system on commodity hardware composed of:
• HDFS—distributed file system
• MapReduce—programming model
• Others: HBase, R, Pig, Hive, Flume, Mahout, Avro, Zookeeper
HBase (column DB)
Hive Mahout
Oozie
Sqoop
HBase/Cassandra/Couch/MongoDB
Avro
Zo
okeep
er
Pig FlumeCascadingR
Am
bari
HCatalog
Hadoop = MapReduce + HDFS
What is Hadoop?
Control NodeFailover NodeHadoop Head NodeHadoop redundant Head Node
PDW region
PDW scale unit
Hadoop region
Hadoop region
For customer use
APS: Parallel Data Warehouse and HDInsight region
Configurable: • Minimum 1 PDW region• Additional PDW scale units• Additional HDI scale units
HDI region overview
� In a nutshell, it’s a HDI instance running on an appliance.� HDInsight is Microsoft branded Hortonworks distro.
�An integrated appliance for running PDW region and HDI region
�PDW is offered as a stand-alone workload on the appliance.� HDI is offered only as an add-on to PDW, as a scale unit
� Based on V2 hardware.
�H/A for the Head Node is provided via Windows Failover Clustering (WFC), Data Node H/A is provided via HDFS/MapRedmechanisms
�Security add-ons to address security issues which are not contained in standard Hadoop
� Support for multiple user accounts
Single T-SQL query model for PDW and Hadoop with rich features of T-SQL including joins without ETL
Leverages the power of MPP to enhance query execution performance
Supports Windows Azure HDInsight to enable new hybrid cloud scenarios
Query non-Microsoft Hadoop distributions such as Hortonworks and Cloudera
Query Hadoop data with T-SQL using PolyBaseBringing the worlds or big data and the data warehouse together for users and IT
SQL ServerParallel DataWarehouse
Cloudera
Hortonworks(Windows, Linux)
Windows AzureHDInsight
PolyBase
Microsoft HDInsight
Select… Result set
Big data insights for any user Native Microsoft BI integration to create new insights with familiar tools
No ITintervention required
Analyze PDW and Hadoop data in the same view
Allow any users to create new insights with familiar tools
Leverages high adoptionof Excel, Power View, Power Pivot, and SSAS
Power Users
Data Scientists
Everyone else using Microsoft BI tools
Agenda
• Modern Data Warehouse• Big Data
• Application examples
• Analytic Platform Systems• Architecture
• Hadoop
• Integration of Hadoop and APS• APS with external Hadoop Clusters
• APS with Hadoop in the Cloud
• APS with integrated Hadoop
Listening to SQL customers – ShinSeGaeInvesting into Online Shopping website (‘Korea’s Amazon’)
o SQL Server 2012 PDW & HDP 1.3/HDP 2.0 on Linux
What they want 1. ‘We want perform complex data mining on customer
purchase data – basket analysis’.
2. ‘We want to understand the social media data (reviews/Twitter) – specifically around our products & stores’.
3. ‘We will use Hadoop to keep all of our data ~ envisioned to be around 480 TB. PDW will be the efficient analysis engine for the hot data’.
4. ‘PDW & Polybase are much faster than Hive’.
5. ‘We’re interested in using data mining cloud services in Azure (hybrid scenarios)’
Microsoft NDA - Material
Listening to SQL customers – TeleCom
‘Understanding network quality’
o SQL Server 2012 PDW & Cloudera 4.5 on Linux
What they want 1. ‘We collect millions of network records for quality
assessment and capacity planning – on a daily basis’.
2. ‘Hadoop will be used for storage and ETL of these network record files’.
3. ‘PDW for more operational analysis, ad-hoc analysis, operational reports’.
4. ‘We are using Polybase along with Oozie-based orchestration for a seamless & automated integration’.
Microsoft NDA - Material
Polybase for integrating with various Hadoop distributions
• Support of Hortonwork’s HDP 1.x & 2.x (Windows Server and Linux)
• Support of Cloudera’s CDH 4.x (on Linux)
Push-down computation w/ AU1 release
• Pushing computation where data resides (Hadoop as query execution & processing aid)
• Transparent for users – no need to learn map/reduce
• Seamless query experience through external tables + simplified & parallelized ETL through T-SQL (CTAS for import & CETAS for export)
Integration with 3rd party tool and Microsoft insights/BI layer
• Existing applications simply work
• External tables populated through application layer like regular tables
SQL Server Security Model
• ‘You decide who sees what type of data’
• SQL Server permission model adapted for each Polybase object –external table, data source, and file format
Microsoft APS Polybase
Microsoft APS Polybase
APS control & data nodes
Your Apps
PowerPivot & PowerView
PowerPivot & PowerView
Social Apps
Sensor &
RFID
Mobile Apps
WebApps
Polybase/APSquery engine
External Table
External Data source
External File Format
Microsoft NDA - Material
Solution Architecture – Integration with external Hadoop cluster (1)
Microsoft APS Polybase
Microsoft APS Polybase
APS control & data nodes
Your Apps
PowerPivot & PowerView
PowerPivot & PowerView
Social Apps
Sensor &
RFID
Mobile Apps
WebApps
Polybase/APSquery engine
External Table
External Data source
External File Format SELECT user_name FROM ClickStream cs, PDW_User u WHERE cs.user_IP = u.user_IP and
cs.url=’www.microsoft.com’;
Querying Hadoop data
Creating external table, data source, file format
CREATE EXTERNAL DATA SOURCE HDP2.0 WITH ( TYPE = Hadoop,LOCATION = ‘hdfs://HDP:8020’, JOB_TRACKER_LOCATION=‘HDP:50300’);
CREATE EXTERNAL TABLE Clickstream(url varchar(50),event_date date) WITH ( DATA_SOURCE= HDP2.0, LOCATION =‘/employees/ employee.txt’, FILE_FORMAT = MyRCFile);
CREATE EXTERNAL FILE FORMAT MyRCFile WITH(FORMAT_TYPE = ‘RCFile’,
SERDE_METHOD=‘LazyBinarySerDe’ )
CREATE EXTERNAL TABLE Web_Sales WITH (LOCATION='/TPCDS/web_sales/‘, DATA_SOURCE = HDP2.0 , FILE_FORMAT = MyRCFile) AS SELECT u.* FROM PDW_User
CREATE TABLE PDW_Sales WITH DISTRIBUTION = Hash (id) AS SELECT FROM Web_Sales
Persistently exporting & importing
Microsoft NDA - Material
T-SQL Examples – Integration with external Hadoop cluster (2)
Recommendationengine & personalizedadvertising
HDP 1.3 on Linux (5-10
servers)
Campaign
APS/PDW
EDW
Analytic information(right customer targeting)
Online Shopping Mall
SSG.com(renewal)
EIS
OLAP(Tabular)
DATA Mining
Visualization(Silverlight)
Recent/hot data stored in PDW
PolybaseQueries
raw/cold data
Complex Event Processing (Storm)
Message Queues(KAFKA, Open source)
Tracking Log Servers
Web log data(160GB/daily) –External Polybase tables A, B, C1.
Unstructured/semi-structured text data - External Polybase tables D, E, F
Text (Board/SNS/Internal Text )
Weather..
2.
Company emails –External Polybase tables G, H, I
3.
Mails
BI analyst
Operational Data Store
10 GB Ethernet
Microsoft NDA - Material
Solution Architecture (Details) – ShinSeGae
Cloudera’sCDH 4 on HP (18+ servers)
APS/PDW
EDW
Network quality analysis
Capacity Planning
Visualization(PowerPivot)
Hot operational PDW data
PolybaseQueries
raw/cold data (Petabyte of network
logs)
High-frequencyEvent Processing(Network logs)
Capturing Network logs (>300 GB/per day) – External Polybase tables A, B, C
BI analyst/Planner/Decision-maker
Operational Data Store
Infiniband
OozieWorkflows
Remote procedure calls via stored procedures to trigger Polybase
queries
HCatalog
Usage of Hive’s Metadata stores
Microsoft NDA - Material
Solution Architecture (Details) – Telcom
Listening to SQL customers (5) – Government
‘Bridging the gap between cloud & on-prem’
o Current POC - SQL Server 2012 PDW & HDInsight Azure
What they want 1. ‘HDInsight/Hadoop in the cloud to store and massage our
raw data (XML files) generated by our web-application’.
2. ‘PDW to keep the data on-prem (legal requirement) and to have an efficient query engine for analysis purposes’.
3. ‘Polybase is a great way of accessing our files in the cloud via simple T-SQL’.
4. ‘With this solution, we can allow web users to quickly ask questions while the heavy, more complex business analysis is accomplished by PDW users’.
Microsoft NDA - Material
On
-pre
mis
es
or
“pri
va
te c
lou
d”
Azure StorageAzure Storage
Azure HDInsightAzure HDInsight
Mic
roso
ft A
zure
Microsoft APS Polybase
Microsoft APS Polybase
Your Your Apps
Microsoft or 3rd
party Applications
YourYourApps
Public Internet
Polybase as key integrative feature• Integration with external Hadoop, HDInsight region & Azure Storage
Data aging strategies • Aging of cold data to Azure Storage
• APS & HDInsight region for hot & warm data
Query hot data & cold aged data• APS as modern cloud end-point for Azure
• Seamless querying of hot & cold data through APS
• APS as gateway allowing users to query all on-prem data via PowerBIand
APS control &
data nodes
CREATE EXTERNAL DATA SOURCE WASB WITH ( TYPE = Hadoop,LOCATION = ‘wasbs://[email protected]’);
CREATE EXTERNAL TABLE clickstream_HDInsights (url varchar(50), event_date date) WITH ( DATA_SOURCE= WASB, LOCATION =‘/input/ log1.txt’, FILE_FORMAT = MyDelimitedText);
T-SQL examples
SELECT FROM clickstream_HDInsights, PDW_Table
Azure Express Route
Microsoft NDA - Material
Solution Architecture – Hybrid Scenarios
Microsoft BI stack
IBM Cognos
PolybaseQueries
APS/PDW
EDW
PDW/APS for fast query response & data processing of
hot data
Operational Data Store
Public Internet or Azure Express
Route
Transforming to large text files ~ 10 GBs each
(External WASB Tables)
HDI on Azure
cheap data store –alternative to Hadoop on-
prem solution
Azure Blob Storage
Web Application for Tax Filing (e-
invoice)
Web apps- Generating tons of smaller XML files (~7KB each)
Other Web Feeds
HDI tools for data transformation
Microsoft NDA - Material
Solution Architecture (Details) – Government
Listening to SQL customers (6) – Beverage & Vending Machines
‘What are you drinking? Why is the machine down’?
o POC - SQL Server/APS with PDW & HDI region
What they want 1. ‘We want a complete solution stack – we
do not have Hadoop experts in-house and don’t have the money to get it’.
2. ‘We want to store all raw data coming from vending machines into Hadoop’.
3. ‘360 degree of all our data – structured customer data & unstructured data coming from vending machines’.
4. ‘Predicate maintenance of machines’.
Microsoft NDA - Material
Unified Microsoft APS with
PDW & HDI region
Unified Microsoft APS with
PDW & HDI region
APS control & data nodes
Your Apps
PowerPivot & PowerView
PowerPivot & PowerView
Social Apps
Sensor &
RFID
Mobile Apps
WebApps
Distributed & replicated table
HDI name & data nodes
External Table
Unified appliance • Multi-workload support with PDW and HDInsight region
• HDInsight powered by HDP bits
• No need to deal with multiple support teams (‘better together’)
Seamless & performing query experience through Polybase• External tables can be used for HDI data
• PDW data nodes connected via high-speed network (Infiniband) to Hadoop data nodes
Simplified management & monitoring• One consistent monitoring experience through appliance management
tools
T-SQL examples
CREATE EXTERNAL DATA SOURCE HDI_R WITH ( TYPE = Hadoop, LOCATION = 'hdfs://HTUKIA-C-HHN01:8020‘, JOB_TRACKER_LOCATION ='HTUKIA-C HHN01:50300'
CREATE EXTERNAL TABLE HDI_Region (url varchar(50), event_date date) WITH ( DATA_SOURCE= WASB, LOCATION =‘/input/ log1.txt’, FILE_FORMAT = MyDelimitedText);
SELECT FROM clickstream_HDInsights, PDW_Table
Microsoft NDA - Material
Solution Architecture – Unified APS appliance
PowerQuery/PowerView/PowerMap
Data scientist group 2 - Using Polybase for existing tooling (T-SQL, BI tools), performing processing of complex analytical queries &
consistent management experience
Secu
re G
atew
ay &
AD
Int
egra
tion
Data scientist group 1 - using chaingof Hive queries & PowerQuery via
HiveODBC
Full Rack PDW
Polybase Queries
Infiniband
PDW regionHDI region
1 scale unit HDI region
msn.com – Log files
Analyzing ~3 TB Web Traffic
Microsoft servers –Log files
Hive & PowerQueryvia Hive ODBC
Analytical queries via SSDT
APS with PDW & HDI region
System Center & AdminConsole
Microsoft NDA - Material
Solution Architecture (Details) – Internal Microsoft Data Scientist
Microsoft Digital Crime UnitPart of Microsoft LCA (Legal and Corporate Affairs) mandated to help protect
Internet
DCU’s Challenge:
To effectively combat digital crime requires the
collection of huge amounts of data from multiple
sources.
DCU needs to be able to:
• Process 10s of TBs daily and house PBs of data
historically (accessible as needed)
• House 100s of terabytes from multiple sources that
is easily queryable.
• Use leading edge business intelligence and
visualization tools.
Azure
DCU Big Data Solution
S S RS
PowerView Excel with PowerPivot
Embedded BIPredictive Analytics
Hadoop
30 Node Cluster
On Windows
HP Business
Decision
Appliance
SharePoint,
SSRS, SSAS,
PowerView,
PowerPoint
500 TB
SAN
Storage
SQL Azure
MSFT
SQL
Stream
Insight
SSIS
HP EDW
Appliance
MSFT PDWData Sources
Sinkholes, Passive DNS,
Files, 3rd Party Security
Info……….
DCU Investigators
and AnalystsCorporate
Security
Officers
Microsoft Digital Crime Unit
Extract Load
Transform
Extract Load
Transform
Hadoop SSIS PDW SSAS Microsoft BI
DropDrop
Data Source for BIData Source for BI
Data Source for BIData Source for BI
Source for BISource for BI
Microsoft Digital Crime Unit currently being implemented)
– Part of Microsoft LCA (Legal and Corporate Affairs) mandated to help protect the
Internet
– To effectively combat digital crime requires the collection of huge amounts of data
from multiple sources.
• Process 10s of TBs daily and house PBs of data historically (accessible as needed)
• House 100s of terabytes from multiple sources that is easily queryable.
• Use leading edge business intelligence and visualization tools.
– 30 Node Hadoop on Windows Server
– Control Rack and 10 Node PDW Data Rack
– HP BDA (Business Decision Appliance) upgraded to SQL 2012
– BI Voyage currently implementing PDW and BI portions of the project.
Why 2 Storage Platforms?HADOOP Parallel Data Warehouse
• Storage Capacity in the Petabytes • Storage Capacity in the 100s of
Terabytes
• Simplified Load, just drop unstructured
or semi-structured files• ETL process more complex to
transform data in to reporting
optimized DB structures
• No optimization of queries • Structures can be optimized for
common query patterns.
• Queried by IT professionals • Queried by business analysts
• Complex and slow to query multiple
sources at once
• Optimized for fast query against key
data from multiple sources.
Hadoop is DCU’s Centralized Data
Warehouse. Simple load and high
capacity make it optimal for storing huge
volumes of data.
PDW is DCU’s Data Mart platform. Easily
accessible, intuitive data structures, and
blazing fast for querying data.
APS Differentiators• Part of a product family: From SQL server standalone to Cloud
service offerings
• TCO: Very low, especially when looking on the whole bundle: ETL (SSIS), PDW, Data marts (SQL server) and Analytics (SSAS, SSRS)
• Appliance: Much lower effort for DBAs
• Microsoft product stack integration – SSIS, SSAS, SSRS, PowerPivot, System Center, integration with Cloud services
• Linear Scaling via Shared Nothing
• xVelocity: Column Store and In-Memory execution
• Polybase: Integration with Big Data and Hadoop
• HDInsight integrated: fast Infiniband interconnect, management and security
“Microsoft exhibits one of the best value propositions on the market with a low cost and a highly favorable price/performance ratio”- Gartner, February 2012
• Store data in columnar format for massive compression
• Load data into or out of memory for next-generation performance
• Updateable and clustered for real-time trickle loading
48
Up to 100xfaster queries
Updatable clustered columnstore vs. table with customary indexing
Up to 15xmore compression
Columnstore
Parallel query execution
Query
Results
BI Tools
SSRS / SSAS
SQL Server SMP
Concurrency that fuels rapid adoptionGreat performance with mixed workloads
Analytics Platform SystemETL/ELT with SSIS, DQS, MDS
ERP CRM LOB APPS
ETL/ELT with DWLoader
Hadoop / Big Data
PDW
HDInsight
PolyBase
Ad hoc queries
MEC, a global media agency, uses SQL Server PDW with in-memory technology to cut query time—helping marketers unlock the value of their data.
SQL Server Analytics Platform System gives us massively parallel advantages. Whereas it would take up to four hours to run queries scaling across multiple nodes, now it takes just minutes.
Lower energy costs and usage
Reduce tuning efforts while retaining high performance
Simplify management with built in System Center
Reduce the data center footprint
Value through a single flexible appliance solution Why Analytics Platform System when I have SQL Server?
Accelerate time to value and insights with no forklift required for scaling out
Single appliance solution
PDW
HDInsight
PolyBase
Value through a single flexible appliance solution Why Analytics Platform System when I have SQL Server?
Your choice of hardware
Co-engineered with HP, Dell and Quanta best practices
Leading performance with commodity hardware
Pre-configured, built, tuned software and hardware
Integrated support plan with a single Microsoft contactPDW
HDInsight
PolyBase
CROSSMARK needed faster and more detailed insight into terabytes of information about product supply and demand. They deployed a turnkey business intelligence solution from Microsoft and HP that is based on the Microsoft SQL Server Parallel Data Warehouse.
People can instantly create their own reports with SQL Server Power View and PowerPivot for Excel and … they can build those reports 50 percent to many times faster compared with the previous system.