Working with BigData. Agenda - Background: Internet Scale Data - Hadoop - BigInsights / BigSheets -...

Working with BigData

Agenda

- Background: Internet Scale Data

- Hadoop- BigInsights / BigSheets- Demos

New Intelligence – Internet Scale Data

Enormous amounts of both structured and unstructured data are being created everyday (~ 15 petabytes, 8x all US Libraries)

– NYSE, 1 TB/day

– FB, 20+ TB/day

– CERN/LHC, 40 TB/day (15 PB/year)

Companies recognizing potential of leveraging the broader web for business intelligence coverage - as well as their internal content

This content is an untapped value for business insights & intelligence

- 100 years baseball stats + 100 years weather data

Separating signal from noise is an imperative for successful value extraction

Need ways to harness “big data”

Network collective intelligence --> web scale data

Extracting Signal from Noise --> gathering, extract & explore

Reduction in latency --> actionable insight = competitive advantage

Emerging Big Data Patterns

Pattern Business Question

Computational Journalism BBC editor analyzing UK Parliament web site, identifying MPs and voting records over 10+ years

Telecom How can I analyze customer Call Records to predict customer churn?

IT Systems Management Identifying intrusion patterns via log files

Evidence Base Medicine Perform predictive cost analysis based on previous patterns of treatmentHow can we reduce processing time from 100 hrs to 1hr?

Retail Business Planner How can I use social media to assess brand and sentiment analysis, identify value customers?

Web Archiving What I have collected?How can I offer research tools that can analyze the content?How can I possibly categorize the content?

Bioinformatics How can I dramatically reduce my computation costs when analyzing genetic content?

New Opportunities

• Answer Formerly Unanswerable Questions

• Formulate New Questions• Make more informed, evidence

based decisions• Democratize your data and

unleash information• Visualize Invisible knowledge

6

The Origins of Hadoop

In 2004 Google publishes seminal whitepapers on a new programming paradigm to handle data at Internet Scale (Google processes upwards of 20 PB per day using Map/Reduce)

http://research.google.com/people/sanjay/index.html

The Apache Foundation launches Hadoop – An Open-Source implementation of Google Map/Reduce and the distributed Google FileSystem

Google and IBM create the “Cloud Computing Academic Initiative” to teach Hadoop and Internet Scale Computing Skills to the next generation of Engineers

7

The Origins of Hadoop Post Web 2.0 Era

– An Explosion of Data on the Internet and in the Enterprise

– 1000 GB = 1 Terabyte : 1000 Terabytes = 1 Petabyte

How do we handle unstructured data ?

How do we process the volume ? A need to process 100 TB datasets

– On 1 Node:

• Scanning @50 MB/s = 23 days (MTBF = 3 years)– On 1000 Node Cluster

• Scanning @50 MB/s = 33 mins (MTBF = 1 day)

Need a framework for distribution (Efficient, Reliable, Easy to use)

a framework for running applications (aka jobs) on large clusters built on commodity hardware capable of processing petabytes of data.

a framework that transparently provides applications both reliability and data motion. It ensures data locality.

it implements a computational paradigm named Map/Reduce, where the application is divided into self contained units of work, each of which may be executed or re-executed on any node in the cluster.

it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.

node failures are automatically handled by the framework.

So , What is Hadoop?

9

Hadoop Ecosystem

ApacheHadoop

ApacheHadoop

IBM: Common Component Community Participation

StructuredStore:Hbase

Emerging: Non-MR Models, e.g. Hama Hybrid, e.g. HadoopDB “Realtime” e.g. Hadoop Online

Tooling:KarmaSphere,Cloudera,Apache

Scripting: Query: Pig, Jaql, Hive W/Flow: Cascading, Oozie

Etc: App Co-Ordination: Zookeeper Data Collection: Chukwa Machine Learning Libs: Mahout

•Apache Hadoop community is vibrant, innovative, far-reaching & evolving

•Hadoop technologies have solved major problems at scale

•Committers: Yahoo, Cloudera, Facebook

Graph© Yahoo!

Introducing The InfoSphere BigInsights Portfolio

• The new offerings are powered by Apache Hadoop, an open source technology designed for analysis of big volumes of data. With the new portfolio, IBM is building on this open source technology with its software expertise to deliver business solutions for the analysis of terabyte and petabyte sized quantities of data.

• The new portfolio consists of specific Big Data analytics solutions that can be used by business professionals and easily be deployed by IT professionals in data center and cloud configuration and includes:

• A package of Apache Hadoop software and services designed to help IT professionals quickly get started with Big Data analytics including design, installation, integration and monitoring of this open source technology. The package helps organizations quickly build and deploy custom analytics and workloads to capture insight from Big Data that can then be integrated into existing database, data warehouse and business intelligence infrastructures.

• A software technology preview called BigSheets designed to help business professionals extract, annotate and visually uncover insights from vast amounts of information quickly and easily through a Web browser. BigSheets includes a plug-in framework extension for analytic

engines and visualization software such as ManyEyes .

11

IBM Distribution of Apache Hadoop

BigInsights Application Server

Applications & Solutions

Enabling InfrastructureBigInsights

Core

Install & Configuration

JAQL

Monitoring

Management console

DB & Warehouse integration

Toro

Gumshoe

Next Generation Credit Risk Analytics

Custom applications

Applications / Solutions / Partners / Community

SPSS

Mining and scoring

Unstructured Analytics (SystemT)

Metatracker

Karmasphere

Understanding Our BigInsights Stack BigSheets (included in BigInsights)

12

Non-Traditional /

Non- Relational

Data Sources

In-MotionAnalytics

Traditional / Relational

Data Sources

Database& Warehouses

At-Rest Data Analytics

Results

OL

TP

/ O

LA

P

RT

AP

Ultra Low Latency Results

Traditional / Relational

Data Sources Hadoop

Data Analytics, Data Operations & Model Building

Results

Had

oo

p

Non-Traditional /

Non- Relational

Data Sources Internet Scale

How Streams, Relational and Hadoop Systems Fit

What is it?

A cloud application used by the domain expert for performing ad-hoc analytics at web-scale on unstructured and structured content

Putting Map/Reduce & Hadoop to work for the line-of-business user

How does it works?

Gather content either statically (e.g. crawl) or dynamically through connectors

Extract local or “document” level information (e.g. congress person’s name), cleanse, normalize

Explore, analyze, annotate, navigate content, filter on existing and new relationships, generate results and visualize

Iterate at any and all steps

Uses a browser-based visual front end – spreadsheet metaphor to create worksheets for exploring/visualizing the big data

BigSheets

Map/Reduce (Hadoop)

Distributed File System

(HDFS)

Pig(Scripting

Map/Reduce)

Nutch(Web Crawler)

Other tools(LanguageWare, ICM,

etc.)

BigSheets Job Server

Import/Export

User Interface & Front End Server

Create MonitorVisualiz

eExtend

BigSheets

JSP ContainerJetty + JDBC

StandaloneBigSheets + HadoopJob Controller

IBM Hadoop Common Component

Apache Projects

IBM AnalyticProducts

IBM Products and Apache Enabling Projects

REST API for customer choice of analytic service/engine

REST API for choice of visualization

Export content as feeds, JSON, CSV, …

BigSheets: Logical View

Overview:

Currently archiving a subset of the UK web domain due to copyright constraints (time consuming, labor intensive categorization and analysis)

2010 Legislation will enable the archive to increase from 5000 sites to 4 million

Problem:

The current model does not scale, 30 people currently needed for the 5000 sites!

We have centuries of Newspaper content (optical character recognition), how do we seamlessly aggregate into a logical single collection?

Business Questions:

What have we collected?

Automatic categorization

Descriptive and predictive analysis (e.g. given 2005 and 2009 elections, 2010 elections will…)

How can we enable the research community to derive deeper insights from the data?

Collaboration on digital research project is fundamentally changing

Data Characteristics

Estimated beginning data size: 128TB raw web content, 384TB working set within Hadoop

environment, total data capacity required well over a Petabyte

Cluster Size: Starting with dozens, expand as necessary

British Library

BigSheets is able to quickly demonstrate business value in harnessing internet scale data using the power of hadoop

In customer trials, the spreadsheet metaphor resulted in no additional training necessary - customer able to drive almost immediately w/o the need for understanding schemas, data/table partitioning, query languages, etc

Best practices emerging. Cornerstone is transforms applied to the data should create new worksheets/datasets.

Companies are envisioning many applications where retaining (and periodically updating) the original data is key

Performing analytics on unstructured/semi-structured data implies iterative interpretations of the data - i.e. a flexible schema set & readjusted at runtime.

Summary and Experiences to Date

Working with BigData. Agenda - Background: Internet Scale Data - Hadoop - BigInsights / BigSheets -...

Documents

Transcript of Working with BigData. Agenda - Background: Internet Scale Data - Hadoop - BigInsights / BigSheets -...