Meta scale kognitio hadoop webinar
-
Upload
michael-hiskey -
Category
Technology
-
view
512 -
download
0
description
Transcript of Meta scale kognitio hadoop webinar
![Page 1: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/1.jpg)
Webinar: Make Big Data Easy with the Right tools and talent
October 2012
- MetaScale Expertise and Kognitio Analytics Accelerate Hadoop for Organizations Large and Small
![Page 2: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/2.jpg)
Today’s webinar
• 45 minutes with 15 minutes Q&A
• We will email you a link to the slides
• Feel free to use the Q & A feature
![Page 3: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/3.jpg)
Agenda
• Opening introduction• MetaScale Expertise
– Case study – Sears Holdings
• Kognitio Analytics – Hadoop acceleration
explained• Summary• Q&A
Michael HiskeyVP Marketing & Business DevelopmentKognitio
Dr. Phil ShelleyCEO, MetaScaleCTO, Sears Holdings
Presenters
Host
Roger GaskellCTOKognitio
![Page 4: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/4.jpg)
Big Data < > Hadoop
Big Data is high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision-making
Volume (not only) size Velocity (speed of Input / Output) Variety (lots of data sources) Value – not the SIZE of your data,
but what you can DO with it!
![Page 5: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/5.jpg)
OK, so you’ve decided to put data in Hadoop...
Now what?
Dr. Phil ShelleyCEO – MetaScaleCTO Sears Holdings
![Page 6: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/6.jpg)
Where Did We Start at Sears?
![Page 7: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/7.jpg)
Where Did We Start? Issues with meeting production schedules
Multiple copies of data, no single point of truth
ETL complexity, cost of software and cost to manage
Time take to setup ETL data sources for projects
Latency in data, up to weeks in some cases
Enterprise Data Warehouses unable to handle load
Mainframe workload over consuming capacity
IT Budgets not growing – BUT data volumes escalating
![Page 8: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/8.jpg)
Why Hadoop?
TraditionalDatabases & Warehouses
Hadoop
![Page 9: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/9.jpg)
An Ecosystem
![Page 10: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/10.jpg)
Data Sourcing Connecting to Legacy source systems Loaders and tools (speed considerations) Batch or near-real time
Enterprise Data Model Establish a model and enterprise data strategy early
Data Transformations The End of ETL as we know it
Data re-use Drive re-use of data Single point of truth is now a possibility
Data Consumption and user Interaction Consume data in-place wherever possible Move data only if you have to Exporting to legacy systems can be done, but it duplicates data Loaders and tools (speed considerations) How will your users interact with the data
Enterprise Integration
![Page 11: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/11.jpg)
Rethink Everything
The way you capture dataThe way you store data
The structure of your dataThe way you analyze dataThe costs of data storage
The size of your dataWhat you can analyzeThe speed of analysisThe skills of your team
The way user interact with data
![Page 12: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/12.jpg)
The Learning from our Journey
• Big Data tools are here and ready for the Enterprise
• An Enterprise Data Architecture model is essential
• Hadoop can handle Enterprise workload To reduce strain on legacy platforms
To reduce cost
To bring new business opportunities
• Must be part of an overall data strategy
• Not to be underestimated
• The solution must be an Eco-System There has to be a simple way to consume the data
Page 12
![Page 13: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/13.jpg)
Hadoop Strengths & Weaknesses?
• Cost effective platform• Powerful / fast data processing environment• Good at standard reporting• Flexibility: Programmable, Any data type• Huge scalability
• Barriers to entry: lots of engineering and coding• High on-going coding requirements• Difficult to access with standard BI/analytical tools• Ad hoc complex analytics difficult• Too slow for interactive analytics
![Page 14: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/14.jpg)
Reference Architecture
![Page 15: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/15.jpg)
What is an “In-memory” Analytical Platform?
• DBMS where all of the data of interest or specific portions of the data have been permanently pre-loaded into random access memory (RAM)
• Not a large cache– Data is held in structures that take advantage of the properties of
RAM – NOT copies of frequently used disk blocks– The databases query optimiser knows at all times exactly which
data is in memory (and which is not)
![Page 16: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/16.jpg)
In-Memory Analytical Database MangementNot a large cache: • No disk access during query execution
– Temporary tables in RAM– Results sets in RAM
• In-Memory means in high speed RAM – NOT slow flash-based SSDs that mimic
mechanical disks
For more information: • Gartner: “Who's Who in In-Memory DBMSs”
Roxanne Edjlali, Donald Feinberg10 Sept 2012 www.gartner.com/id=2151315
![Page 17: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/17.jpg)
Why In-memory: RAM is Faster Than Disk (Really!)
Actually, this only part of the storyAnalytics completely change the workload characteristics on the databaseworkload
Simple reporting and transactional processing is all about “filtering” the data of interestfiltering
Analytics is all about complex “crunching”of the data once it is filteredcrunching
Crunching needs processing power and consumes CPU cycles
CPU cycles
Storing data on physical disks severely limits therate at which data can be provided to the CPUsstoring
Accessing data directly from RAM allowsmuch more CPU power to be deployedaccess
![Page 18: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/18.jpg)
Analytics is about through Data
• To understand what is happening in the data
“CRUNCHING”
Joins
Sorts
Aggregations
Grouping
AnalyticalFunctionsAnalyticalFunctions
CPU cycle-intensive & CPU-bound
• Analytical platforms are therefore CPU-bound– Assume disk I/O speeds not a bottleneck– In-memory removes the disk I/O bottleneck
More complex analytics
More pronounced this becomes =
![Page 19: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/19.jpg)
For Analytics, the CPU is King
• The key metric of any analytical platform should be GB/CPU– It needs to effectively utilize all available cores– Hyper threads are NOT the equivalent of cores
• Interactive/adhoc analytics: – THINK data to core ratios ≈ 10GB data per CPU core
• Every cycle is precious – CPU cores need to used efficiently– Techniques such as “dynamic machine code generation”
Makes in-memory databases go slowerMakes disk-based databases go faster
Careful – performance impact of compression:
![Page 20: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/20.jpg)
Speed & Scale are the Requirements• Memory & CPU on an individual server = NOWHERE near enough for big data
– Moore’s Law – The power of a processor doubles every two years– Data volumes – Double every year!!
• Every CPU core in• Every server needs to efficiently involved in • Every query
Every
– Data is split across all the CPU cores– All database operations need to be parallelised with no points of
serialisation – This is true MPP
• Combine the RAM of many individual servers• many CPU cores spread across• many CPUs, housed in • many individual computers
Many
• The only way to keep up is to parallelise or scale-out
![Page 21: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/21.jpg)
Hadoop ConnectivityKognitio - External Tables
– Data held on disk in other systems can be seen as non-memory resident tables by Kognitio users.
– Users can select which data they wish to “suck” into memory.• Using GUI or scripts
– Kognitio seamlessly sucks data out of the source system into Kognitio memory.
– All managed via SQL
Kognitio - Hadoop Connectors– Two types
• HDFS Connector• Filter Agent Connector
– Designed for high speed• Multiple parallel load streams• Demonstrable 14TB+/hour load rates
![Page 22: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/22.jpg)
Tight Hadoop integration
HDFS Connector• Connector defines access to hdfs file
system• External table accesses row-based data
in hdfs• Dynamic access or “pin” data into
memory• Complete hdfs file is loaded into memory
Filter Agent Connector• Connector uploads agent to Hadoop
nodes• Query passes selections and relevant
predicates to agent• Data filtering and projection takes
place locally on each Hadoop node• Only data of interest in loaded into
memory via parallel load streams
![Page 23: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/23.jpg)
Not Only SQL
Kognitio V8 External Scripts– Run third party scripts embedded within SQL
• Perl, Python, Java, R, SAS, etc.• One-to-many rows in, zero-to-many rows out, one to one
create interpreter perlinterpcommand '/usr/bin/perl' sends 'csv' receives 'csv' ;
select top 1000 words, count(*)from (external script using environment perlinterp
receives (txt varchar(32000))sends (words varchar(100))script S'endofperl(
while(<>){
chomp();s/[\,\.\!\_\\]//g;foreach $c (split(/ /)){ if($c =~ /^[a-zA-Z]+$/) { print "$c\n”} }
})endofperl'from (select comments from customer_enquiry))dt
group by 1 order by 2 desc;
This reads long comments text from customer enquiry table, in line perl converts long text into output stream of words (one word per row), query selects top 1000 words by frequency using standard SQL aggregation
![Page 24: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/24.jpg)
Hardware Requirements forIn-memory Platforms
• Hadoop = industry standard servers
• Careful to avoid vendor lock-in
• Off the shelf, low cost, servers matchneatly with Hadoop
– Intel or AMD CPU (x86)– No special components
• Ethernet network
• Standard OS
![Page 25: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/25.jpg)
Benefits of an In-memory Analytical Platform
• A seamless in-memory analytical layer on top of your data persistence layer(s):
Analytical queries that used to run in hours and minutes, now run in minutes and seconds (often sub-second)
High query throughput = massively higher concurrency
Flexibility• Enables greater query complexity• Users freely interact with data• Use preferred BI Tools (relational or OLAP)
Reduced complexity• Administration de-skilled• Reduced data duplication
![Page 26: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/26.jpg)
• Big Data tools are here and ready for the Enterprise• An Enterprise Data Architecture model is essential• Hadoop can handle Enterprise workload
To reduce strain on legacy platforms To reduce cost To bring new business opportunities
• Must be part of an overall data strategy• Not to be underestimated• The solution must be an Eco-System
There has to be a simple way to consume the data
Page 26
The Learning from our Journey
![Page 27: Meta scale kognitio hadoop webinar](https://reader034.fdocuments.us/reader034/viewer/2022051609/5479fbbab4af9f2d718b47e5/html5/thumbnails/27.jpg)
www.kognitio.com
kognitio.com/blog
twitter.com/kognitio
linkedin.com/companies/kognitio
facebook.com/kognitio
youtube.com/user/kognitio
Dr. Phil ShelleyCEO – MetaScaleCTO Sears Holdings
Michael HiskeyVice PresidentMarketing & Business [email protected]
Phone: +1 (855) KOGNITIO
Upcoming Web Briefings: kognitio.com/briefings
connect contact