Big data and you

24
Big Data and You 2015 May Edition Objectives This document is designed to introduce big Data and Analytics . Instead of being deep dive technical paper or product portfolio details, friendly educational presentation (easily and quickly read) for specialists, architects, PMs and managers (*). One simple goal (but complex and time consuming exercise): is you read this paper, you learn something and then you would like to get more details to become an expert. Yes, You can Big Data Table of Contents 1. Introduction 2. Definition 3. BI principles 4. Chronology 5. Hadoop I 6. Hadoop II 7. Hadoop Ecosystem 8. BI vs Big Data 9. Hadoop patterns 10. Hadoop Market Introduction 2012 was the big data marketing buzz, 2013 was the big data technical enablement, 2014 was the big data projects. Now European customers are massively deploying big data (and still analytics) projects. It is time to become an expert to guide our customers and talk with Big Data ecosystem to fill the Big Data skills gap (*) This paper doesn’t pretend to be exhaustive on the Big Data subject, nor it is intended to recommend precise and specific architecture for architects, recommend performance and technical details for specialists or marketing campaign. It doesn’t assume, or require any (or few) knowledge of Big Data 11. BD&A vendors 12. Competition 13. In Memory 14. Streams 15. BigInsights 16. Architecture 17. Positioning 18. Why Power ? 19. Contacts 20. New ! Author # [email protected]

Transcript of Big data and you

Big Data and You20

15

May

Edit

ion

Objectives

This document is designed to introduce big Data and Analytics . Instead of being deep dive technical paper or product portfolio details, friendly educational presentation (easily and quickly read) for specialists, architects, PMs and managers (*). One simple goal (but complex and time consuming exercise): is you read this paper, you learn something and then you would like to get more details to become an expert. Yes, You can Big Data

Table of Contents

1. Introduction2. Definition3. BI principles4. Chronology5. Hadoop I6. Hadoop II7. Hadoop Ecosystem8. BI vs Big Data9. Hadoop patterns10. Hadoop Market

Introduction

2012 was the big data marketing buzz, 2013 was the big data technical enablement, 2014 was the big data projects. Now European customers are massively deploying big data(and still analytics) projects. It is time to become an expert to guide our customers and talk with Big Data ecosystem to fill the Big Data skills gap

(*) This paper doesn’t pretend to be exhaustive on the Big Data subject, nor it is intended to recommend precise and specific architecture for architects, recommend performance and technical details for specialists or marketing campaign. It doesn’t assume, or require any (or few) knowledge of Big Data

11. BD&A vendors12. Competition13. In Memory14. Streams15. BigInsights16. Architecture17. Positioning18. Why Power ?19. Contacts20. New !

Author # [email protected]

# 1

IBM Montpellier Client Center [email protected]

Introduction Definition

What is Data Analysis ? Why Analysing Data ?

Analysis of data is a process of inspecting, cleaning, transforming, and modelling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making.

Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains, such as :

Business Intelligence/AnalyticsData Mining / predictive ToolsBig Data Data integration/ Data visualisationAnd so on …

IT technologies and computer sciences are evolving. Yesterday, when IBM, Honeywell, Sperry, ICL, Xerox,Digital or Olivetti were the IT leaders, CPU and Memory were the key differentiators.

Today, when IBM, Google,SAP, Oracle are the IT leaders, the ultimate differentiator is being able to make more informed choices with confidence, to anticipate and shape business outcomes.As company and industry leaders, you absolutely need deeper insight from their information, to beat your competitors :

• Which customers are thinking of leaving?• Which transactions are fraudulent?• Detect life-threatening conditions in time to intervene

Let’s make it simpler – An example

Analytics = transforming data into (sexy) information to make (intelligent) decision

Weather Forecast : You should decide which boot you’ll take to go to Paris.

You are not expert at all (temperature, pressure, cyclone = RAW data) but you can decide based on weather map (report/analysis)

!message : Data is the new oil requiring Mining, Refining and Delivering

BI Principles Chronology Hadoop I Hadoop II

Big data and You

# 2

IBM Montpellier Client Center [email protected]

Definition

What is Business Intelligence ?

Business analytics (BA) refers to the skills, technologies, practices for continuous iterative exploration and investigation of past business performance to gain insight and drive business planning.Business analytics focuses on developing new insights and understanding of business performance based on data and statistical methods.In contrast, business intelligence (BI) traditionally focuses on

using a consistent set of metrics to both measure past performance and guide business planning, which is also based on data and statistical methods

Big Data is a broad term for data sets so large or complex that they are difficult (or too expensive) to process using traditional data processing applications. Challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and information privacy.

What is Big Data ?

!message : Big Data creates new opportunities to extend Analytics for higher value

BI Principles Hadoop I Hadoop IIIntroduction Hadoop Ecosystem

Big data and You

4th V: Value5th V: Veracity

For more information/technical details, feel free to contact us

OLTP versus OLAP

# 3

IBM Montpellier Client Center [email protected]

BI reference Architecture

Reporting solutions display data in a either synthesized or detailed view, easy to understand for the end user (data mining: discovering Interesting/useful patterns /relationships in large volumes of data – analyzing the past to predict the future)

Data warehouse central database in which data are stored and can be restructured to answer Business needs.

ETLUnifies data from heterogeneous data sources (extracting the useful data)Consolidates them into a unique destination database (cleansing, modifying the data according to the desired output)

Good to know !

People, very often, associate BI with reporting/data mining tool, because this is the “visible” part of the iceberg. But This is an misnomer, BI refers to the full set of tools, such as Reporting, Data warehouse and ETL. For your information, ~70% of the costs and efforts in BI projects is about the data warehouse, the most important (but hidden) part of the “iceberg”.

Star SchemaOptimized for SQL read requests. Fact table (metrics of the reports) in the middle, surrounded by dimension tables (Y axis) = On Line Analytical Processing (OLAP)

3NF SchemaOptimized for flexibility and storage space savings = On Line Transactional Processing (OLTP)

How does Analytics work ? What does OLAP mean ?

!message : BI/Analytics is the way to transform raw data into decision/information

Definition BI Principles Hadoop IHadoop IIuction Hadoop EcosystemChronology BI vs B

Big data and YouAny Analytics Projects/ questions ? Do not hesitate to contact us

First steps - early1950IBM newspaper : Article " A Business Intelligence System" (Hans Peter Luhn)Birth of the wording “Business intelligence”First tools for automatic methods, providing alert services (for scientists)

1970First MIS solutions – Management Information SystemStatic, non flexibleNo analysis features

1980First EIS software – Executive Information SystemMore sophisticated MIS: simulations, report, forecast,

1990BI concepts, is officially formalized by Howard Dresner, Gartner Group analystBirth of Business Performance Management (BPM / EPM)

2005 – 2010BI market strong consolidation – big major IT acquisitionsOracle acquired Siebel (Report - 6B$), Hyperion (EPM- 4B$), Sunopsis (ETL- 1 B$)SAP acquired Business Objects (Report – 7B$), Sysbase (DW – 6B$), Fuzi (ETL), IBM bought Cognos (Report – 5B$), Netezza (DW – 2B$), Ascential (ETL – 1B$)-Yahoo and Google faced terrible performance issues with DW architecture – Need of rethinking data analysis approach – birth of Hadoop

2012 and + Birth of Big data

# 4

IBM Montpellier Client Center [email protected]

A little bit of history ?

!message : Analytics has evolved from business initiative to business imperative

Definition BI Principles Hadoop I Hadoop IIHadoop EcosystemChronology BI vs BigData Hadoop

Big data and You

IBM Montpellier Client Center [email protected]

Why Hadoop ?

1

2

Performance issue : Consider that over the past decade :- CPU speed performance has increased 8 to 10 times- DRAM speed performance has increased 7 to 9 times- Network speed performance has increased 100 times- Bus speed performance has increased 8 to 10 times- Hard disk drive speed performance has increased ONLY 1.2 times

NoSQL: Not Only SQL

Mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling, finer control over availability and most importantly COST

!message : Hadoop meets the need of new scalable architectures providing a businessEfficiency and flexibility over the existing relational data model

ciples Hadoop I Hadoop II Hadoop EcosystemChronology BI vs BigData Hadoop Pattern Hadoop Market

# 5

Big data and YouWould like to bench/test ? Go to MOP Client Center

IBM Montpellier Client Center [email protected]

How does it work ?

Apache Hadoop is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets (Big Data) on computer clusters built from commodity hardware. The core of Apache Hadoop consists of a storage part (Hadoop Distributed File System (HDFS)) and a processing part (MapReduce). Hadoop splits files into large blocks and distributes the blocks amongst the nodes in the cluster. To process the data, Hadoop Map/Reduce transfers code (specifically Jar files) to nodes that have the required data, which the nodes then process in parallel.

This approach takes advantage of data locality to allow the data to be processed faster and more efficiently via distributed processing than by using a more conventional supercomputer architecture that relies on a parallel file system where computation and data are connected via high-speed networking

Would like to appear like an expert ?

HDFS default replication : 3 x, HDFS default blocks size = 128 MB, HDFS sits on top of a native Linux filesytem (ext4, ext3), Slave nodes : HDFS (= data node), MapReduce (= task tracker) , Master nodes : HDFS (= name node), MR (= job tracker), secondary name node is for High Availability

!message : Volume and Variety challenges have led to the creation of new data processing : Map Reduce and HDFS

Hadoop I Hadoop II Hadoop EcosystemChronology BI vs BigData Hadoop Pattern Hadoop Market BD&A

# 6

Big data and YouWould like briefing ? Go to MOP Client Center

YARN, “the hadoop 2 “ decouples MapReduce's resource management and scheduling capabilities, enabling Hadoop to support more varied processing approaches/applications (interactive SQL, real-time streaming, batch processing) # 7

IBM Montpellier Client Center [email protected]

Flume was created to allow you to flow data from a source into your Hadoop® environment.

ZooKeeper provides a centralized infrastructure and services that enable synchronization across a cluster. ZooKeeper maintains common objects needed in large cluster environments like configuration information, hierarchical naming space …

HBase is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data sets, which are common in many big data use cases

Some folks at Facebook developed Hive™, allowing SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements

Oozie simplifies workflow and coordina¬tion between jobs. It provides users with the ability to define actions and dependencies between actions.

Pig initially developed at Yahoo! allows people to focus more on analyzing large data sets and spend less time having to write mapper and reducer programs.

Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses –into Hadoop

Mahout takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model

Ambari is a web-based set of tools for deploying, administering and monitoring Apache Hadoop clusters

!message : The HDFS file system is not restricted to MapReduce jobs. It can be used for other applications, many of which are under development at Apache

Hadoop II Hadoop Ecosystem BI vs BigData Hadoop Pattern Hadoop Market BD&A Vendors Competition

Big data and You

# 8

IBM Montpellier Client Center [email protected]

Different Approaches

Don’t take us wrong : there is no bad approach or good approach, there is no magical approach. There are different approaches, for different needs and results.

With BI approach, Business Users determine what question to ask (business hypothesis) and IT team structures the data (specific selected data into data warehouse) to answer to the question.

With Big Data approach, IT delivers (all data) a platform to enable creative discovery and Business Users Explores what questions could be asked

Different Architectures

BI architecture: Application server and Database server are separated, Network is still in the middle, Data have to go through the network.

Big Data architecture: Analysis Program runs where are the data : Functions have to go through the network. This is highly scalable and flexible by design

Different Objectives

Hadoop is one of the multiple facets of Big Data. This facet (Hadoop) is designed to run huge (Volume) “read” batch, in extreme costs savings way for unstructured data (Variety)

!message : Do not compare apples and oranges : you should (still) need both

Hadoop Ecosystem BI vs BigData Hadoop Pattern Hadoop Market BD&A Vendors Competition In Memory

Big data and YouFor more information/technical details, feel free to contact us

# 9

IBM Montpellier Client Center [email protected]

Technical Hadoop Patterns

Big Data Exploration

Find, visualize, understand all big data to improve decision making

Enhanced 360o

Viewof the Customer

Extend existing customer views (MDM, CRM, etc) by incorporating additional internal and external information sources

Operations Analysis

Analyze a variety of machinedata for improved business results

Data Warehouse Augmentation

Integrate big data and data warehouse capabilities to increase operational efficiency

Security/Intelligence Extension

Lower risk, detect fraud and monitor cyber security in real-time

Big Data Business Use Cases

Keep in Mind

The term Big Data is a bit of a misnomer. Big data is not only referring to huge volume of data or Hadoop, there are many others patterns using streams or in memory solutions

!message : Big Data Analytics are applied Across all Industries, different use cases

BI vs BigData Hadoop Pattern Hadoop Market BD&A Vendors Competition In Memory Streams BigInsights

Big data and You

# 10

IBM Montpellier Client Center [email protected]

Hadoop has been most rapidly adopted by the government, banking, finance, IT and ITES, and insurance sectors

Geographical analysis of the market seems to suggest that North America is the leading revenue generating market and will continue to remain so till 2020.

Hadoop hardware-based, solution providers have been the highest receivers of venture capital funding. The recent times have witnessed a steep

demand for real-time, operational analytics

!message : In 1990’s new performing hardware was the differentiator for companies to compete. Nowadays big data is the key competitive differentiator

Hadoop Pattern Hadoop Market BD&A Vendors Competition In Memory Streams BigInsights Architecture

Big data and You

Hortonworks study – 2014 wikibon figures - 2013

# 11

IBM Montpellier Client Center

The market for Big Data & Analytics solutions has exploded

The race is hot and complex:

Every vendor is jumping in

Alternatives from everywhere

Startups proliferate

Partnerships

No other vendor has what IBM have

– Software/ Hardware

– Services / Research

– Cloud, Mobile, Social

Yet just having ‘everything’ does not make for a market leader

Based primarily on 2012 Wikibon report/forcast http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2012-2017

!message : The race is hot, Every vendor is jumping in, Alternatives from everywhere, Startups proliferate, how do we differentiate in such a crowded market?

Hadoop Market BD&A Vendors Competition In Memory Streams BigInsights Architecture Positioning

Big data and YouAny competitive big data questions ? feel free to contact us

# 12

IBM Montpellier Client Center [email protected]

4 major distributions of Hadoop have

spawned ecosystems of partners

developing data management and

analytic solutions for Big Data

!message : IBM is a global Big data and Analytics leaders, industry’s most comprehensive and enterprise class solutions, broadest portfolio

BD&A Vendors Competition In Memory Streams BigInsights Architecture Positioning Why Power?

Big data and YouAny competitive big data questions ? feel free to contact us

# 13

IBM Montpellier Client Center [email protected]

In-Memory - good timing for an old idea

Largely driven by the big data phenomenon, In-memory computing is a powerful, transformative IT trend to meet high-performance analytics expectations and data visualization needs. In memory solution should not be confused with conventional DBMS storing data in disk blocks cached in memory.

In-Memory” Database technology has been around for over a decade . Traditionally in-memory technology was used in a limited number of operational applications workloads (FSS trading, Telco Billing, HPC, embedded devices) but in 2011 we saw Inflection Point : Increased focus and ‘push’ by SAP With in-memory database, all information is initially loaded into memory. This eliminates the need for optimized databases, indexes, aggregates and designing of cubes and star schemas. The arrival of column centric databases which stored similar information

together allowed storing data more efficiently with greater compression and faster read access , reducing the amount of memory needed to perform a query and increasing processing speed. That’s why column-based technology is very often associated to in memory technology

Column Based Technology

Volume: users /data increase, RAM needed also increases = hardware costs

Velocity : real time analytics, operational analytics

!message : Big Data analytics can benefit from these very large in memorySystems for velocity (since Memory has become cheaper)

dors Competition In Memory Streams BigInsights Architecture Positioning Why Power? Contacts/info

Big data and YouDo you need Big Data Analytics Briefing ? Come to us in MOP

# 14

IBM Montpellier Client Center [email protected]

Deal with Terabytes of data each second

Work with application, sensor and internet data, video/audio

Deliver insight in microseconds to analytical applications

Support complex scenarios using C++ or Java code

Streams is tailor made for companies who need to process data from non-traditional sources, with huge volumes of data, and need results very, very quickly, integrated with existing analytics investments

Stream computing is a different paradigm – the left shows the traditional way data is accessed using queries to pull the data from a data storage device such as a data warehouse or database – which is still valid for many requirements

The new stream computing paradigm brings data to the query – data is pushed or flows through the analytics. This is required for many new use cases in big data

Here’s a little more on how streams works and what you can do with it.

Each of these square represents an operator. The data passes (input stream) through each operator where some action is being performed on the data (output stream)

You can fuse data form multiple streams, you can modify it, annotate it, perform an analytics operation on it, fuse multiple streams or classify it.

!message : Velocity challenges have led to the creation of new data computing paradigm and solution: streaming to bring microseconds effective real time

In Memory Streams BigInsights Architecture Positioning Why Power? Contacts/info

Big data and YouDo you need Big Data Analytics Briefing ? Come to us in MOP

# 15

IBM Montpellier Client Center [email protected]

Hadoop is an Open Source implementation and although very well maintained, doing the “job” for companies it implies a risk. Like Linux, major IT companies provide Hadoop distributions.IBM took this Hadoop and ruggedized it for enterprises, adding enterprises features such as performance, resilience and IBM experiences, (bigsheets, bigsql,gpfs…) while maintaining the open standards 100%. We call it Biginisghts, running on x86, Power Systems and Mainframe (linux)

2 editions : basic edition (100% open source – free) and Enterprise Edition

BigSheets - a big data visualization capability that enables end users to collect, explore and uncover actionable insights through a commonly understood spreadsheet experience (drag and drop, clicks without any Java or Hadoop skills)

Adaptive Map Reduce –Already proven product from Platform Computing (HPC acquisition) , rewriting Map Reduce paradigm in C++ (No garbage collection, faster memory management), allowing :• Optimized Shuffle, map

sort • Resource management

and scheduling of jobs is separated

• leverage shared memory across JVMs, eliminating data movement

BigSQL – SQL on Hadoop is challenging (wide variety of data, MR is batch oriented), BigSQL provides Native full compliant SQL access to datastored in BigInsights, Real JDBC/ODBC drivers, and optimization based on Massively Parallel processing (MPP) architecture, from DB2 experience

Spectrum Scale – GPFS FPO (file placement optimizer) scalable, high performance, and highly reliable, 20+ years experienced product, has many advantages over HDFS:• POSIX compliant• No single point of

failure• Multi tenant • HA/DR solutions

IBM BigInsights for Apache Hadoop v4 has been just released based on ODP initative

Version 3.0 – Enterprise Edition

!message : IBM Hadoop strategy : better analytics tooling that is easier to use + commitment to Hadoop open source (ODP initiative)

In Memory Streams BigInsights Architecture Positioning Why Power? Contacts/info

Big data and You

# 16

IBM Montpellier Client Center [email protected]

How are leading companies transforming their data and analytics environment to take advantage of Big Data and provide faster, better insights at reduced costs within their existing Enterprise Data Warehouses ?

100010010101010

100010010101010

100010010101010

100001101010101100

100001101010101100

000111000010011

000111000010011

!message : The foundational schematic to bring analytics to all stages in the data lifecycle can be overlaid with specific products that provide the functions

Streams BigInsights Architecture Positioning Why Power? Contacts/info

Big data and YouNeed Customer Enablement ? Education ? Send us an email

# 3

IBM Montpellier Client Center [email protected]

!message :

Systems of Record Structured data from operational systems

Transformational benefit / business outcomes come from integration of new data sources with traditional corporate data to find new insights

Systems of Engagement Data that “connects” companies with their

customers, partners and employees

Systems of Insight Diverse data types that

combine structured and

unstructured data for business insight

Streams BigInsights Architecture Positioning Why Power? Contacts/info

In Memory

Hadoop

EDW

Appliance

# 17

Big data and YouNeed Architecture Workshop ? Sizing ? Send us an email

# 18

IBM Montpellier Client Center [email protected]

Important to keep in mind

Big Data (BigInsights, Cognos, SPSS, …) can run on IBM System z. Customers could take advantages of co-locating business data and OLAP data, managing high speed transactions and complex queries for real time operational analytics on a single integrated platform and take benefits of the performance, resiliency and quality of service of IBM Mainframe for critical businesses., as many banks/insurance customers

!message : The infrastructure is a foundational piece to IBM’s perspective of delivering capabilities and offerings for BD&A

Hadoop is Linux – Linux is Power Hadoop is cheap - Power is cheap

Hadoop ecosystem – PowerLinux market acceptance

Power advantages for Big DataLinux on Power – run the same commands as linux on x86 – versions release as the same date

Linux on Power makes 17,6% of top 500 most linux powerful systems (with 5 in top 10)

POWER8 increases performance, reliability and availability lead over Intel, alternative to intel

OpenPower foundation brings Rapid innovation to Power Platform for open linux

Little Endian support makes porting Linux on x86 applications even easier

Power8 design point is for big data (more threads, more cache , more bandwidth, CAPI …)

Intel design point is for multiple market (smart phone, tablet desktop PC, servers …)

Streams BigInsights Architecture Positioning Why Power? Contacts/info

Big data and YouFeel free to contact MOP PowerLinux center for more details

# 20

IBM Montpellier Client Center [email protected]

IBM BigData RessourcesWw Competency Centers Big Data Analytics Links

Web sitesibm.com/HadoopInformation Management Acceleration ZonePowerLinux Big Data

IBM communitiesIBM Systems Big Data and AnalyticsBDSC practitioner wikiIBM Analytics Global Big Data& Analytics Clients References

IBM Developper Workshttps://www.ibm.com/developerworks/analytics/

Please, Please

Help us in improving this document – if any comments / ideas please feel free to send an email

http://bigdatauniversity.com/http://wikibon.org/wiki/v/Category:Big_Datahttp://en.wikipedia.org/wiki/Apache_Hadoophttp://www.slideshare.net/search/slideshow?searchfrom=header&q=big+data

[INFO] Based on 3 experienced years of big data projects , after many weeks of intensive work for compiling several presentations done to customers or conferences, synthetizing concepts, the objective of this educational paper is to clarify some of the concepts and solutions around Big Data in order to better understand the related challenges and opportunities. But There may be (so many) typing errors, mistakes, misleading words, missing concepts, so Please be kind

Streams Biginsights Architecture Positioning Why Power? Contacts/info

Big data and YouIf we can not help you directly, we’ill point you to the right person

> Strong history of leadership in open source & standards : IBM has always been a believer in standardization of interfaces to components of IT and application infrastructure (SQL, Eclipse, OpenPower …)> Supports our commitment to open source currency in all future releases> Accelerates IBM innovation within Hadoop & surrounding applications> Expecting Hortonworks, Pivotal distribution adoption on PowerLinux

> The current ecosystem is challenged and slowed by fragmented and duplicated efforts. The ODP

Core will take the guesswork out of the process and accelerate many use cases by running on a common platform. Freeing up enterprises and ecosystem vendors to focus on building business driven applications.

# 21

IBM Montpellier Client Center [email protected]

!message : ODP is clearly a major and strategic choice in Open community to accelerate Hadoop adoption and grow BigInsights and PowerLinux ecosystem / ISV

NEW AND/OR HOT !!! OPEN DATA PLATFORM

Big data and You

What is Open Data Platform (ODP) ?

> It is an Open-source, non-profit entity, focused, committed in evolving the current state of the platform, and delivering a Foundation certified, packaged, and tested Reference Distribution

Why Open Data Platform (ODP) ?

Where to position ODP vs Apache ?

> ODP supports the Apache (ASF) mission

> ASF provides a governance model around individual projects without looking at ecosystem> ODP aims to provide a vendor-led consistent packaging model for core Apache components as an ecosystem

Why IBM is involved in ODP ?

# 22

IBM Montpellier Client Center [email protected]

!message : IBM fundamental cloud strategy : Complete cloud offering, mixed between control and simplicity.

Big data and You

NEW AND /OR HOT !!! Big Data/Analytics and Cloud

Customer Data Center (On-Premises)

Cloud Data Center (Off Premises)

SIMPLICITY

CONTROL

PureData for analyticsDB2 BLUInfosphere Biginsights

CloudantDashDBSoftlayer

Cloudant

DashDBDistributed NoSQL “Data Layer”, Powering

Web, mobile, & IoT since 2009

Available as a fully-managed DBaaS, managed by you on-premises or hybrid

Transactional JSON “document” database

Spreads data across data centers & devices

Ideal for apps that require:

> Massive, elastic scalability

> High availability

> Geo-location services

> Full-text search

> Occasionally connected users

Data warehouse and analytics as a service on the cloud

• Next Generation In-Memory• Columnar• SIMD Hardware Acceleration• Actionable Compression• Support for OLAP SQL extensions• Connect common 3rd party BI tools

dashDB keeps data warehouse infrastructure out of your way, allowing you to take benefits of :

# 23

IBM Montpellier Client Center [email protected]

!message : Spark is positioned as a fast and general engine for Big Data. It generalizes the MapReduce model and (could?)is poised to replace MapReduce

Big data and You

NEW AND/OR HOT !!! SPARK

Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well suited to machine learning algorithms.

Spark requires a cluster manager and a distributed storage system. For cluster management, Spark supports standalone (native Spark cluster), Hadoop YARN, or Apache Mesos.For distributed storage, Spark can interface with a wide variety, including Hadoop Distributed File System (HDFS), Cassandra, OpenStack Swift, and Amazon S3. Spark also supports a pseudo-distributed local mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in this scenario, Spark is running on a single machine with one executor per CPU core.

Spark had over 465 contributors in 2014, making it the most active project in the Apache Software Foundation and among Big Data open source projects

# 24

IBM Montpellier Client Center [email protected]

!message : From application point of view, data lake challenge is to be an unique and unified data repositories, queryable like a black box

Big data and You

NEW AND /OR HOT !!! DATA LAKE ARCHITECTURE

IDC in late 2014 stated “By 2017 unified data platform architecture will become the foundation of BDA strategy. The unification will occur across information management, analysis, and search technology.”

A Data reservoir is a data lake that provides data to an organization for a variety of analytics processing including:

• Discovery and exploration of data

• Simple ad hoc analytics

• Complex analysis for business decisions

• Reporting

• Real-time analytics

It is possible to deploy analytics into the data reservoir to generate additional insight from the data loaded into the data reservoir.

A data reservoir manages shared repositories of information for analytical purposes.

Each Data Reservoir Repository is optimized for a particular type of processing.

• Real-time analytics, deep analytics (such as data mining), exploratory analytics, OLAP, reporting, …

Example – Creating a logical warehouse

Information virtualization hides the complexities of where the data is located. Here different repositories are being used to host different workloads, but this complexity is hidden by the information virtualization layer.