Human Face of Big Data - Archive...Big Data Analytics portfolio including SAP Real-time Data...

1 © 2013 SAP AG or an SAP affiliate company. All rights reserved.

Welcome!

Big Data - Introduction to SAP Big Data Technologies

Big Data - Streaming Analytics

Big Data - Smarter Data Virtualization

Big Data - Gain New Insight from Hadoop

Big Data - Spatial Data Processing for Richer Insights

Big Data - Text Analytics

SAP Big Data Webinar Series

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 3

Speaker Introduction

Yuvaraj Athur Raghuvir, is a Senior Director in the SAP

HANA Platform Solution Management at SAP. He leads the

Big Data Analytics portfolio including SAP Real-time Data

Platform.

Yuvaraj has over 14 years of experience spanning Business

Applications, Business Analytic Solutions, Architecture and

Engineering.

Presented by: Yuvaraj Athur Raghuvir, SAP HANA Platform July 24 2013

Gain New insight from Hadoop


SHOPPERS GET FASHION ADVICE THAT FITS THEIR STYLE


Big Data Economics

Predictive incl. data mining and machine learning

“Unstructured” Analysis Incl. text, media, spatial etc.

Scalable Storage

Streaming Data incl. Sensors, Social and Mobile

Urgent Need


Open Source Community Gift: Apache Hadoop

Predictive incl. data mining and machine learning

“Unstructured” Analysis Incl. text, media, spatial etc.

Scalable Storage

Streaming Data incl. Sensors, Social and Mobile

Urgent Need

Apache Hadoop

► [Commons]

► [Projects]

► [Distributions]

► [Data Scientists]


Hadoop Possibilities

Source: 1) Gartner Blog http://blogs.gartner.com/merv-adrian/2013/02/23/hadoop-2013-part-three-platforms/ retrieved on July 17 2013 2) Gartner Blog http://blogs.gartner.com/merv-adrian/2013/03/08/hadoop-2013-part-four-players/ retrieved on July 17 2013 3) GitHub query “Hadoop” retrived on July 17 2013

Amazon has reported that it started 2 million Elastic MapReduce (EMR) clusters – in a single year.

Pushing New Boundaries! Growing Popularity

among Vendors ► Storage ► Compute ► Network ► High Performance

Computing Vendors

Dynamic Community with over 2500+ Hadoop related projects in GitHub.


Hadoop Community

Source: GitHub query “Hadoop” retrived on July 17 2013

Complexity!

2500+ Community Projects!!

Source: GigaOm Infographic, Mar 5, retrieved on July 17 2013

Highly Dynamic & Evolving Ecosystem

Projects & Alternatives

Source: Gartner Blog http://blogs.gartner.com/merv-adrian/2013/02/21/hadoop-2013-part-two-projects/ retreived on July 17 2013

Gartner Infographic

GigaOm Infographic


The Big Data Phenomenon Big Data is more than just Hadoop…

Exploding data volumes

Accelerating data velocity

Increasing data variety

Business Trends Technology Trends

Storage / Memory / CPU advances

Data Mining/Predictive analysis

In-memory computing

Hadoop & distributed MPP

Complex event processing

“Enterprise” Big

Data


Big Data Challenges Business and technical needs

Business Needs Quick insights from all business-relevant data Pick the right action among many choices

Action 2

Action 1

Action 3

Quick Insight Right Action All Relevant Big Data

100101 011010 100101

Technical Challenges

Cost of store vs. cost to process data considerations

Pressure to gain insight quickly from data

Diversity of data formats makes it difficult to analyze

Need to have disparate technologies interoperate across enterprise

Determine the right action among many valuable insights across variety, volume, velocity and technology is difficult


How to Capitalize on the Big Data Opportunity and Address Big Data Technical Challenges?

To deploy an integrated data processing framework

To enable real-time, actionable insights in business process

context

To derive new value from INFORMATION

Optimize data management in each phase of the information lifecycle process

Regardless of data source, processing technologies, latency challenges, number of user demands

Marry business process insights from structured data analysis with deep pattern, behavior analysis of unstructured data

Enable decision making based on multi-factor considerations, not just instinct/experience

Focus on deriving new value from data by enabling new business and technology use cases previously not feasible

Augment existing business scenarios with new data insights to enable better decision


Hadoop

Data storage (Hadoop Distributed File system)

Job Management

Computation Engine(s)

BI and analytics software from SAP

In-memory

Disk-based data ware-house (SAP Sybase IQ)

… and/or ...

Analytic engine

Analytic engine

Non-SAP solutions

SAP® Solutions

Data warehouse/database (SAP HANA, SAP Sybase IQ/SAP Sybase

ASE)

SAP Data Services

SAP Business Suite

Other SAP solutions

Hadoop as a flexible data store

Hadoop as a simple database

Hadoop as a processing engine

Hadoop for data analytics Reference Data

Streaming Data

Enterprise Data

Transaction Data

Social Media

Enterprise Scenarios with Hadoop


Hadoop


Job Management



Reference Data

Streaming Data

Enterprise Data

Transaction Data

Social Media

Hadoop as a Simple Database

SAP® Solutions


ASE)

SAP Data Services

SAP Business Suite

Other SAP solutions

Scenarios of Use include: ► Extract-Transform-Load from other systems

to Hadoop. ► SAP Data Services provides ETL

support from Hadoop to SAP HANA. ► Store & Retrieve Structured data based on

projects like Hive. ► Depending on the scenario, scalable

analytical systems like Sybase IQ can also be considered.

► Handle large documents as “Blobs” to do retrievals or analytics later.

► Use as a near-line store for offloading data that is considered “cold” or “frozen”

► Data lifecycle management is typically manual.

Focus: Storage and Retrieval of data from Hadoop typically using interfaces provided by HIVE or direct HDFS access

Hadoop as a simple database


Hadoop


Job Management



Reference Data

Streaming Data

Enterprise Data

Transaction Data

Social Media

Hadoop as a Processing Engine

SAP® Solutions


ASE)

SAP Data Services

SAP Business Suite

Other SAP solutions

Scenarios of Use include: ► Data Enrichment.

► Push down of Text Data Transforms from Data Services is an example

► Data Pattern Analysis. ► This is an emerging space across

new data forms. ► Convergence between procedural

data science and declarative access patterns are evolving.

Focus: Distributed Compute leveraging the Map-Reduce computation framework of Hadoop on large distributed data sets

Hadoop as a processing engine


Hadoop


Job Management



Reference Data

Streaming Data

Enterprise Data

Transaction Data

Social Media

Hadoop for Data Analytics

Scenarios of Use include: ► Cross DM analytics.

► Practical only when performance from Hadoop is acceptable

► Stand-alone analytics. ► Emerging area to use Hadoop as the

direct data store for analytics

Focus: A combination of storage and delegated analytics supporting two approaches: • Two-Phase Analytics: Background

processing engine refine and feeding data • Federated Queries: Client side federation

across data stores

BI and analytics software from SAP

In-memory Disk-based data ware-house (SAP Sybase IQ)

… and/or ...

Analytic engine Analytic

engine

Hadoop for data analytics

© 2013 SAP AG. All rights reserved. 17

SAP HANA SP6 - smart data access capability Data virtualization for on-premise and hybrid cloud environments

Benefits Enables access to remote data access

just like “local” table Provides SAP HANA to SAP HANA queries Smart query processing including query decomposition

with predicate push-down, functional compensation Supports data location agnostic development No special syntax to access heterogeneous data sources Non-disruptive evolution

Heterogeneous data sources SAP HANA to Hadoop (Hive) Teradata SAP Sybase ASE SAP Sybase IQ

Transactions + Analytics

Teradata

Hadoop

SAP HANA

ASE

IQ

SAP HANA

Virtual Tables HANA Tables

New

© 2013 SAP AG. All rights reserved. 18

Example Reference Architecture : Machine-to-Machine Infrastructure Run-Time Architecture

Stream

Batch

Synchronization

Cloud / Backend

M2M Application Server Application and

Analytical Services

M2M Services

Core Services Industry-Specific Services

End User Apps

Mobile App Web App Dashboard

Edge

Data Acquisition & Processing

Device Management

SQLA / UltraLite

Data Persistence

Device

Real-Time Data Platform

ESP Event Processing

SQLA Data Synchronization

HANA

“Hot Data”

ASE / IQ “Warm and Cold Data”

Predictive Models

Data Models

Hadoop

Big Data Sets

Cellular M2M revenue opportunity projected to reach $1.2 Trillion by 2020

Meter Data Management will exceed $420 Million by 2020 with a CAGR of 16.8%[2]

Annual smart meter shipments to surpass 140 million units worldwide by 2016, representing a CAGR of 32.9%.[3]

Picture: Beecham Research 1. GSMA; 2.Pike Research; 3. IDC Energy Insights

Healthcare

Public Sector

Industrial

MRI, PDAs

Implants, Surgical Equipment

Pumps, Monitors

Telemedicine, etc.

Tolls, etc.

Planes, Signage

Vehicles, Lights, Ships Tanks, Fighter Jets

Battlefield Comms

Jeeps, Cars, Ambulances

Breakdown, Lone Worker Homeland Security Environ. Monitoring,

etc.

Pumps, Valves, Vats, Conveyors, Pipelines

Motors, Drives, Converting, Fabrication

Assembly/Packaging, Vessels/Tanks, etc.

Turbines

Windmills

UPS

Batteries

Generators

Meters, Drills

Fuel Cells, etc.

Beyond Business Networks – Internet of Things

DRIVERS DIVERTED BEFORE FATAL ACCIDENTS HAPPEN


Hadoop Commons : Core / Common Components

Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data.

MapReduce: MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query.

Hive: Hive is a Hadoop-based data warehousing-like framework originally developed by Facebook. It allows users to write queries in a SQL-like language caled HiveQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc.

Pig: Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL.)

HBase: HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily.

Source: http://wikibon.org/wiki/v/HBase,_Sqoop,_Flume_and_More:_Apache_Hadoop_Defined retrieved on Jul 17 2013

Flume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop.

Oozie: Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive -- then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed.


Hadoop Commons : Core / Common Components

Ambari: Ambari is a web-based set of tools for deploying, administering and monitoring Apache Hadoop clusters. It's development is being led by engineers from Hortonwroks, which include Ambari in its Hortonworks Data Platform.

Avro: Avro is a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing removed procedure calls.

Mahout: Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model.

Sqoop: Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target.

HCatalog: HCatalog is a centralized metadata management and sharing service for Apache Hadoop. It allows for a unified view of all data in Hadoop clusters and allows diverse tools, including Pig and Hive, to process any data elements without needing to know physically where in the cluster the data is stored.

BigTop: BigTop is an effort to create a more formal process or framework for packaging and interoperability testing of Hadoop's sub-projects and related components with the goal improving the Hadoop platform as a whole.

Source: http://wikibon.org/wiki/v/HBase,_Sqoop,_Flume_and_More:_Apache_Hadoop_Defined retrieved on Jul 17 2013

Human Face of Big Data - Archive...Big Data Analytics portfolio including SAP Real-time Data...

Documents

Transcript of Human Face of Big Data - Archive...Big Data Analytics portfolio including SAP Real-time Data...