A technical Introduction to Big Data Analytics

A Technical Introduction to Big Data Analytics

Pethuru Raj PhD

Infrastructure Architect

IBM Global Cloud Center of Excellence (CoE)

IBM India, Bangalore

E-mail: [email protected]

The Business Intelligence (BI) in the Pre-Big Data Era

The Business Intelligence (BI) in the Post-Big Data Era

The Classification of the IT Trends

• The Technology Space - There is a cornucopia of technologies (Computing, Connectivity, Miniaturization, Middleware, Sensing, Actuation, Perception, Analyses, Knowledge Engineering, etc.)

• The Process Space – With new kinds of services, applications, data, infrastructures, and devices

joining into the mainstream IT, fresh process consolidation, orchestration, governance and management mechanisms are emerging. That is, process excellence is the ultimate aim

• Infrastructure Space – Infrastructure consolidation, convergence, centralization, federation,

automation and sharing methods clearly indicate the infrastructure trends in the computing and communication disciplines. Physical infrastructures turn to be virtual infrastructures. Two major infrastructural types are

• System Infrastructure (Compute, Storage, & Network) • Application Infrastructure – Integration Backbones, Platforms (Design, Development, Deployment,

Delivery, Management, etc.), Messaging Middleware, Databases (SQL and NoSQL), etc.

• Architecture Space – Service oriented architecture (SOA), event-driven architecture (EDA), model-

driven architecture (MDA), resource oriented architecture (ROA) and so on are the leading architectural patterns

• The Device Space is fast evolving (Slim & Sleek, handy & trendy, mobile, wearable, implantable,

portable, etc.). Everyday machines are tied up with one another as well as to the remote Web / Cloud

• Data Space – Data are being produced in an automated and massive manner

The Tectonic Trends Towards the Ensuing Knowledge Era

1. Data is being positioned as the strategic asset for any organization

2. Analytics has been an important ingredient for worldwide business enterprises

to

Strategize and Plan Ahead

Take Informed Decisions

Proceed with Confidence and Clarity (Insights-driven Enterprises)

With the arrival of newer technologies, the capabilities and competencies of

Analytics have been consistently on the climb.

In sync up with big data, platforms and infrastructures, big insights will become the

norm for worldwide organizations

For any Strategic and Sustainable Transformation

Leverage Data Assets Insightfully

Optimize Infrastructure Technologically

Innovate Processes Consistently

Assimilate Architectures Appropriately

Choose Technologies Carefully

Ensure Accessibility, Simplicity & Consumability Cognitively

The Principal Sources for Big Data

8

The Convergence of Technologies lays a profound foundation for Large-scale Data Generation

Social Media

Cloud Computing

Mobile

Internet of Things

The Extreme Connectivity enables Data Generation in Heaps

The Deeper and Broader Integration pours out Big Data

• Device to Device (D2D) Integration

• Device to Enterprise (D2E) Integration - In order to have remote and real-time

monitoring, management, repair, and maintenance, and for enabling decision-

support and expert systems, ground-level heterogeneous devices have to be

synchronized with control-level enterprise packages such as ERP, SCM, CRM,

KM etc.

• Device to Cloud (D2C) Integration - As most of the enterprise systems are

moving to clouds, device to cloud (D2C) connectivity is gaining importance.

• Cloud to Cloud (C2C) Integration – Disparate, distributed and decentralised

clouds are getting connected to provide better prospects

The Interconnectivity of Devices generates Large-scale Fast Data

The Technology Cluster Stack

Sensors, Actuators, Controllers, Tags, Stickers, consumer electronics, appliances, Devices, Machines, Utensils, instruments,

gadgets, smart materials

Service oriented device middleware for message routing, enrichment, adaptation etc.

Applications, Services, Data sources, Packages, Platforms, Middleware, etc.

Clouds (Consolidated, Centralized / Federated, Virtualized,

Automated and Shared Infrastructures)

Physical World

Cyber World

Physical

Devices

Device

Middleware

Virtual

Applications

& Platforms

Virtual

Infrastructur

es

Some Tidbits on the Enormity of Data

The Unequivocal Result : the Data-driven World

Business Transactions, Interactions, Operations, and Analytical data

System Infrastructure Log files

Social & People data

Customer, Product, Sales and other business data

Machine and Sensor Data

Scientific Experimentation & Observation Data (Genetics, Particle

Physics, Climate modeling, Drug Discovery, etc.,)

Why Big Data is Strategically Significant for Businesses?

Big Data brings in

Enhanced Business Value through better performance and productivity

Bigger and Bigger Insights through a host of newer Analytics and Use Cases

Big Data : The Business Value

18

What to Do with Big Data?

Big Data Big Insights

Aggregate all kinds of distributed, different and decentralized data

Analyze the formatted and formalized data

Articulate the extracted actionable intelligence

Act based on the insights delivered and raise the bar for futuristic analytics

(Real-time, predictive, prescriptive and personal analytics)

Accentuate business performance and productivity

Big Data Analytics: Key Drivers and Applications

The Drivers for Big Data Analysis

1. There is an Exponential Growth in Data Generation due to

◦ The continued increase in diverse and distributed data sources

2. The Maturity, Stability and Convergence of Technologies - Data Virtualization, Management,

Storage, Transmission, Analysis and Visualization Techniques, Tips, and Tools

3. The Massive Adoption and Adaption of Cloud Infrastructures (Compute, Storage and Network)

4. The Realization of more comprehensive, accurate, and speedier Knowledge Discovery and

Dissemination Platforms and Processes

5. Enhanced Business Value

6. Newer Types of Analytics

◦ Domain-specific Analytics (Customer Sentiment, Social, Security, Retail, Fraud Detection

Analysis, etc.) and

◦ Generic Analytics(Predictive, Prescriptive, High-Performance, Real-time, Smarter

Analytics, etc.)

The Reference Architectures for Big Data Analytics

The Emerging and Evolving Analytics

The Traditional Business Analytics

The Next-Generation Business Analytics

Social Media and Network Analytics

Machine Data Analytics - Use Cases

Here are a few ROI examples from a 1% improvement in productivity across different industries:

Commercial aviation industry — a 1% improvement in fuel savings would yield a savings of $30

billion over 15 years.

Utilities — In global gas-fired power plant fleet a 1% improvement could yield a $66 billion savings

in fuel consumption.

Global health care industry — A 1% efficiency gain from reduction of process inefficiencies

globally could yield more than $63 billion in health care savings.

Railway Networks — Freight moved across the world rail networks, if improved by 1% could yield

another gain of $27 billion in fuel savings.

Upstream Oil and Gas Exploration – a 1% improvement in capital utilization upstream oil and

gas exploration and development could total $90 billion in avoided or deferred capital expenditures.

The convergence of intelligent devices, intelligent networks and intelligent decisioning (Insight vs. Hindsight

analytics) is definitely paving the foundation for the next growth spurt or productivity gains.

Machine Data Analytics – Use Cases

Machine Data Analytics

Batch Vs Real-time Analytics

How Does Real-Time Analytics Work?

The Real-time Analytics Architecture

In-Memory Data Analytics

In-Memory Computing Reference Architecture

Context-Aware Analytics

Big Data Analytics: The Key Platforms

Big Data Analytics: The Platforms

Analytical, Distributed, Scalable and Parallel Databases

Data warehouses, Data Marts, etc.

In-Memory Systems (SAP HANA, etc.)

In-Database Systems (SAS, etc.)

Distributed File Systems (HDFS)

Hadoop Implementations (Cloudera, Map R, HortonWorks, Apache

Hadoop, DataStax, etc.)

NoSQL & Hybrid Databases

Parallel DBMS

Standard relational tables and SQL

◦ Indexing, compression,caching, I/O sharing

◦ Tables partitioned over nodes

◦ Transparent to the user

Meet performance

◦ Needed highly skilled DBA

Flexible query interfaces

◦ UDFs varies accros implementations

Fault tolerance

◦ Not score so well

Assumption: failures are rare

Assumption: dozens of nodes in clusters

45

MapReduce Programming Model & Hadoop Platforms

MapReduce is a programming model which specifies:

◦ A map function that processes a key/value pair to generate a set of intermediate key/value pairs,

◦ A reduce function that merges all intermediate values associated with the same intermediate key.

Hadoop comprises large-scale, distributed, elastic, and fault-tolerant data processing and storage

modules

◦ Is a MapReduce implementation for processing large data sets over 1000s of nodes.

◦ Maps and Reduces run independently of each other over blocks of data distributed across a

cluster

46

The Hadoop Architecture

How Hadoop Functions?

The Hadoop-based Big Data Business Analytics

Why Hadoop?

Better application development productivity through a more flexible data model;

Greater ability to scale dynamically to support more users and data;

Improved performance to satisfy expectations of users wanting highly responsive applications and to allow more complex processing of data.

Scalability to large data volumes:

◦ Scan 100 TB on 1 node @ 50 MB/sec = 23 days

◦ Scan on 1000-node cluster = 33 minutes

Divide-And-Conquer (i.e., data partitioning)

Cost-efficiency

◦ Commodity nodes (cheap, but unreliable)

◦ Commodity network

◦ Automatic fault-tolerance (fewer administrators)

◦ Easy to use (fewer programmers)

Satisfies fault tolerance

Works on heterogeneous environment

NoSQL Databases

NoSQL encompasses a wide variety of different database technologies and were developed in response

to a rise in the volume of data stored about users, objects and products, the frequency in which this data

is accessed, and performance and processing needs.

Document databases pair each key with a complex data structure known as a document. Documents

can contain many different key-value pairs, or key-array pairs, or even nested documents.

Graph stores are used to store information about networks, such as social connections. Graph stores

include Neo4J and HyperGraphDB.

Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an

attribute name (or "key"), together with its value. Examples of key-value stores are Riak and Voldemort.

Some key-value stores, such as Redis, allow each value to have a type, such as "integer", which adds

functionality.

Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and

store columns of data together, instead of rows.

Cassandra (Facebook) (CQL is the query language)

BigTable (Google)

Dynomo (Amazon)

RIAK (SoftLayer) (Apache Lucene)

MongoDB

CouchDB (UNQL is the query language0

Relational Vs. NoSQL Databases

SQL Databases NoSQL Databases

The relational model takes data and separates it into

many interrelated tables. Tables reference each other

through foreign keys

The relational model minimizes the amount of storage

space required, because each piece of data is only

stored in one place. However, space efficiency comes at

expense of increased complexity when looking up data.

The desired information needs to be collected from

many tables (often hundreds in today’s enterprise

applications) and combined before it can be provided to

the application. When writing data, the write needs to be

coordinated and performed on many tables.

Developers generally use object-oriented programming

languages to build applications. It’s usually most efficient

to work with data that’s in the form of an object with a

complex structure consisting of nested data, lists, arrays,

etc. The relational data model provides a very limited

data structure that doesn’t map well to the object model.

Instead data must be stored and retrieved from tens or

even hundreds of interrelated tables. Object-relational

frameworks provide some relief but the fundamental

impedance mismatch still exists between the way an

application would like to see its data and the way it’s

actually stored in a relational database

NoSQL databases have a very different model. For

example, a document-oriented NoSQL database takes

the data you want to store and aggregates it into

documents using the JSON format. Each JSON document

can be thought of as an object to be used by your

application. A JSON document might, for example, take

all the data stored in a row that spans 20 tables of a

relational database and aggregate it into a single

document/object.

Aggregating this information may lead to duplication of

information, but since storage is no longer cost

prohibitive, the resulting data model flexibility, ease of

efficiently distributing the resulting documents and read

and write performance improvements make it an easy

trade-off for web-based applications.

Document databases, on the other hand, can store an

entire object in a single JSON document and support

complex data structures. This makes it easier to

conceptualize data as well as write, debug, and evolve

applications, often with fewer lines of code

Relational Vs. NoSQL Databases

SQL Databases NoSQL Databases

Relational technology requires strict definition of a

schema prior to storing any data into a database.

Changing the schema once data is inserted is a big deal.

Want to start capturing new information not previously

considered? Want to make rapid changes to application

behavior requiring changes to data formats and content?

With relational technology, changes like these are

extremely disruptive and frequently avoided

RDBMS supports scale-up implying the fundamentally

centralized, shared-everything architecture of relational

database technology

Enhancement Techniques include

1. Sharding

2. Denormalizing,

3. Distributed caching

NoSQL databases especially document databases are

typically schemaless, allowing you to freely add fields to

JSON documents without having to first define changes.

The format of the data being inserted can be changed at

any time, without application disruption. This allows

application developers to move quickly to incorporate

new data into their applications.

NoSQL use a cluster of standard, physical or virtual

servers to store data and support database operations.

Support the following

Auto-sharding

Data Replication

Distributed query support – “Sharding” a relational

database can reduce, or eliminate in certain cases, the

ability to perform complex data queries. NoSQL database

systems retain their full query expressive power even

when distributed across hundreds of servers.

Integrated caching – Transparently cache data in system

memory. This behavior is transparent to the application

developer and the operations team, compared to

relational technology where a caching tier is usually a

separate infrastructure tier that must be developed to,

deployed on separate servers, and explicitly managed by

the ops team.

The Capability Comparison of Different Analytical Platforms

The Big Data Analytics Infrastructures

Big Data Analytics – The Emerging Infrastructures

Analytic, Scalable, Parallel and Distributed Databases & Data Warehouses -

Hardware Appliances (MPP and SMP)

In-Memory Compute Infrastructures (SAP HANA on IBM Power 7)

In-Database Compute Infrastructures (SAS Teradata, etc.)

Expertly Integrated Systems (IBM PureData System for Hadoop, Analytics,

etc.)

Clouds (public, private and hybrid) comprising bare metal servers and

virtual machines (VMs)

In-Memory Data Grid (IMDG)

An IMDG is a distributed non-relational data or object store. It can be distributed to

span more than one server.

Reading from memory is more than 3,300 times faster than reading from disk. A

simple calculation would suggest that if it takes an hour to read a set of information

from disk, it would take just over a second to read it from memory

This approach brings data to the cloud, where the application can interact with it,

and the application is completely shielded from the complexity of having to persist

or replicate data back to the on-premise store.

The use of an IMDG also means that while the data is available on the cloud, it is

only available in memory and is never stored on a disk in the cloud.

IMDGs usually support linear scaling to support high loads, data partitioning,

redundancy, and automatic data recovery in case of failures.

The Big Data Analytics in Clouds

The Types of Big Data Analytics in Cloud

Big Data Analytics in Clouds

Why Big Data Analytics in Clouds?

Agility & Affordability - No capital investment of a large size of Infrastructures. Just Use

and Pay

Hadoop Platforms in Clouds - Deploying and using any Hadoop Platforms (generic or

specific, open or commercial-grade, etc.) are fast

NoSQL Databases in Clouds - NoSQL databases are made available in Clouds

WAN Optimization Technologies - There are WAN optimization products and

platforms for efficiently transmitting data over the Internet infrastructure

Business Applications in Clouds - With enterprise information systems (EISs), high-

performance computing systems, and the establishment of data storage, social, device and

sensor clouds go up in public clouds, big data analytics at remote, Internet-scale clouds

makes sense.

Cloud Integrators, Brokers & Orchestrators – There are products and platforms for

seamless interoperability among different and distributed systems, services and data

Entering into the Hybrid World

1. The Traditional Analytical Systems (Data Warehouse) Vs. The

Big Data Analytical systems (Hadoop)

2. The Traditional Databases (RDBMS) Vs. The NoSQL

Databases

3. The Scalable, Distributed, Parallel RDBMS Vs. The NoSQL

Databases

The Hybrid World

The Data Analytics: the Converged Architecture

Big Data Analytics Solution Architectures for Different Industry

Segments

Big Data Insights for Media Industry – A Solution Architecture

Social Network Analytics – A Solution Architecture

Big Data Analytics: the Summary

Digitalization, service-enablement, extreme connectivity, distribution, commoditization, Consumerization, Industrialization, etc. are the brewing trends towards big data

Data Volume, Variety, Velocity and Variability are on the Rise signalling a heightened Data Value. This development is due to the diversity and multiplicity of data sources.

Data Capturing, transmission, Cleansing, Filtering, Formatting, and Storage Tasks, Tools, and Technologies are maturing fast

Big Data platforms, patterns, practices, products, processes and infrastructures are being developed to streamline big data analytics

The Big Picture

Enterprise Space

Embedded Space

Cloud Space

Integration Bus

A Sample List of Book Chapters

Pethuru Raj PhD [email protected]

www.peterindia.net

http://www.linkedin.com/in/peterindia

https://www.facebook.com/sweetypeter

mailto:[email protected]

http://www.peterindia.net/



https://www.facebook.com/sweetypeter

A technical Introduction to Big Data Analytics

Technology

Transcript of A technical Introduction to Big Data Analytics