Working with BigData. Agenda - Background: Internet Scale Data - Hadoop - BigInsights / BigSheets -...
-
Upload
kerry-walters -
Category
Documents
-
view
221 -
download
0
Transcript of Working with BigData. Agenda - Background: Internet Scale Data - Hadoop - BigInsights / BigSheets -...
Working with BigData
Agenda
- Background: Internet Scale Data
- Hadoop- BigInsights / BigSheets- Demos
New Intelligence – Internet Scale Data
Enormous amounts of both structured and unstructured data are being created everyday (~ 15 petabytes, 8x all US Libraries)
– NYSE, 1 TB/day
– FB, 20+ TB/day
– CERN/LHC, 40 TB/day (15 PB/year)
Companies recognizing potential of leveraging the broader web for business intelligence coverage - as well as their internal content
This content is an untapped value for business insights & intelligence
- 100 years baseball stats + 100 years weather data
Separating signal from noise is an imperative for successful value extraction
Need ways to harness “big data”
Network collective intelligence --> web scale data
Extracting Signal from Noise --> gathering, extract & explore
Reduction in latency --> actionable insight = competitive advantage
Emerging Big Data Patterns
Pattern Business Question
Computational Journalism BBC editor analyzing UK Parliament web site, identifying MPs and voting records over 10+ years
Telecom How can I analyze customer Call Records to predict customer churn?
IT Systems Management Identifying intrusion patterns via log files
Evidence Base Medicine Perform predictive cost analysis based on previous patterns of treatmentHow can we reduce processing time from 100 hrs to 1hr?
Retail Business Planner How can I use social media to assess brand and sentiment analysis, identify value customers?
Web Archiving What I have collected?How can I offer research tools that can analyze the content?How can I possibly categorize the content?
Bioinformatics How can I dramatically reduce my computation costs when analyzing genetic content?
New Opportunities
• Answer Formerly Unanswerable Questions
• Formulate New Questions• Make more informed, evidence
based decisions• Democratize your data and
unleash information• Visualize Invisible knowledge
6
The Origins of Hadoop
In 2004 Google publishes seminal whitepapers on a new programming paradigm to handle data at Internet Scale (Google processes upwards of 20 PB per day using Map/Reduce)
http://research.google.com/people/sanjay/index.html
The Apache Foundation launches Hadoop – An Open-Source implementation of Google Map/Reduce and the distributed Google FileSystem
Google and IBM create the “Cloud Computing Academic Initiative” to teach Hadoop and Internet Scale Computing Skills to the next generation of Engineers
7
The Origins of Hadoop Post Web 2.0 Era
– An Explosion of Data on the Internet and in the Enterprise
– 1000 GB = 1 Terabyte : 1000 Terabytes = 1 Petabyte
How do we handle unstructured data ?
How do we process the volume ? A need to process 100 TB datasets
– On 1 Node:
• Scanning @50 MB/s = 23 days (MTBF = 3 years)– On 1000 Node Cluster
• Scanning @50 MB/s = 33 mins (MTBF = 1 day)
Need a framework for distribution (Efficient, Reliable, Easy to use)
a framework for running applications (aka jobs) on large clusters built on commodity hardware capable of processing petabytes of data.
a framework that transparently provides applications both reliability and data motion. It ensures data locality.
it implements a computational paradigm named Map/Reduce, where the application is divided into self contained units of work, each of which may be executed or re-executed on any node in the cluster.
it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.
node failures are automatically handled by the framework.
So , What is Hadoop?
9
Hadoop Ecosystem
ApacheHadoop
ApacheHadoop
IBM: Common Component Community Participation
StructuredStore:Hbase
Emerging: Non-MR Models, e.g. Hama Hybrid, e.g. HadoopDB “Realtime” e.g. Hadoop Online
Tooling:KarmaSphere,Cloudera,Apache
Scripting: Query: Pig, Jaql, Hive W/Flow: Cascading, Oozie
Etc: App Co-Ordination: Zookeeper Data Collection: Chukwa Machine Learning Libs: Mahout
•Apache Hadoop community is vibrant, innovative, far-reaching & evolving
•Hadoop technologies have solved major problems at scale
•Committers: Yahoo, Cloudera, Facebook
Graph© Yahoo!
Introducing The InfoSphere BigInsights Portfolio
• The new offerings are powered by Apache Hadoop, an open source technology designed for analysis of big volumes of data. With the new portfolio, IBM is building on this open source technology with its software expertise to deliver business solutions for the analysis of terabyte and petabyte sized quantities of data.
• The new portfolio consists of specific Big Data analytics solutions that can be used by business professionals and easily be deployed by IT professionals in data center and cloud configuration and includes:
• A package of Apache Hadoop software and services designed to help IT professionals quickly get started with Big Data analytics including design, installation, integration and monitoring of this open source technology. The package helps organizations quickly build and deploy custom analytics and workloads to capture insight from Big Data that can then be integrated into existing database, data warehouse and business intelligence infrastructures.
• A software technology preview called BigSheets designed to help business professionals extract, annotate and visually uncover insights from vast amounts of information quickly and easily through a Web browser. BigSheets includes a plug-in framework extension for analytic
engines and visualization software such as ManyEyes .
11
IBM Distribution of Apache Hadoop
BigInsights Application Server
Applications & Solutions
Enabling InfrastructureBigInsights
Core
Install & Configuration
JAQL
Monitoring
Management console
DB & Warehouse integration
Toro
Gumshoe
Next Generation Credit Risk Analytics
Custom applications
Applications / Solutions / Partners / Community
SPSS
Mining and scoring
Unstructured Analytics (SystemT)
Metatracker
Karmasphere
Understanding Our BigInsights Stack BigSheets (included in BigInsights)
12
Non-Traditional /
Non- Relational
Data Sources
In-MotionAnalytics
Traditional / Relational
Data Sources
Database& Warehouses
At-Rest Data Analytics
Results
OL
TP
/ O
LA
P
RT
AP
Ultra Low Latency Results
Traditional / Relational
Data Sources Hadoop
Data Analytics, Data Operations & Model Building
Results
Had
oo
p
Non-Traditional /
Non- Relational
Data Sources Internet Scale
How Streams, Relational and Hadoop Systems Fit
What is it?
A cloud application used by the domain expert for performing ad-hoc analytics at web-scale on unstructured and structured content
Putting Map/Reduce & Hadoop to work for the line-of-business user
How does it works?
Gather content either statically (e.g. crawl) or dynamically through connectors
Extract local or “document” level information (e.g. congress person’s name), cleanse, normalize
Explore, analyze, annotate, navigate content, filter on existing and new relationships, generate results and visualize
Iterate at any and all steps
Uses a browser-based visual front end – spreadsheet metaphor to create worksheets for exploring/visualizing the big data
BigSheets
Map/Reduce (Hadoop)
Distributed File System
(HDFS)
Pig(Scripting
Map/Reduce)
Nutch(Web Crawler)
Other tools(LanguageWare, ICM,
etc.)
BigSheets Job Server
Import/Export
User Interface & Front End Server
Create MonitorVisualiz
eExtend
BigSheets
JSP ContainerJetty + JDBC
StandaloneBigSheets + HadoopJob Controller
IBM Hadoop Common Component
Apache Projects
IBM AnalyticProducts
IBM Products and Apache Enabling Projects
REST API for customer choice of analytic service/engine
REST API for choice of visualization
Export content as feeds, JSON, CSV, …
BigSheets: Logical View
Overview:
Currently archiving a subset of the UK web domain due to copyright constraints (time consuming, labor intensive categorization and analysis)
2010 Legislation will enable the archive to increase from 5000 sites to 4 million
Problem:
The current model does not scale, 30 people currently needed for the 5000 sites!
We have centuries of Newspaper content (optical character recognition), how do we seamlessly aggregate into a logical single collection?
Business Questions:
What have we collected?
Automatic categorization
Descriptive and predictive analysis (e.g. given 2005 and 2009 elections, 2010 elections will…)
How can we enable the research community to derive deeper insights from the data?
Collaboration on digital research project is fundamentally changing
Data Characteristics
Estimated beginning data size: 128TB raw web content, 384TB working set within Hadoop
environment, total data capacity required well over a Petabyte
Cluster Size: Starting with dozens, expand as necessary
British Library
BigSheets is able to quickly demonstrate business value in harnessing internet scale data using the power of hadoop
In customer trials, the spreadsheet metaphor resulted in no additional training necessary - customer able to drive almost immediately w/o the need for understanding schemas, data/table partitioning, query languages, etc
Best practices emerging. Cornerstone is transforms applied to the data should create new worksheets/datasets.
Companies are envisioning many applications where retaining (and periodically updating) the original data is key
Performing analytics on unstructured/semi-structured data implies iterative interpretations of the data - i.e. a flexible schema set & readjusted at runtime.
Summary and Experiences to Date
DEMOS