Determine the Right Analytic Database: A Survey of New Data Technologies

Determine the Right Analytic Database: A Survey of New Data Technologies

O’Reilly Strata ConferenceFebruary 1, 2011

Mark R. Madsenhttp://ThirdNature.netTwitter: @markmadsen

Atomic Avenue #1 by Glen Orbik

Presenter

Presentation Notes

Determine the Right Analytic Database : A Survey of Data Technologies and Products There has been an explosion in database technology designed to handle big data and deep analytics from both established vendors and startups. This session will provide a quick tour of the primary technology innovations and systems powering the analytic database landscape—from data warehousing appliances and columnar databases to massively parallel processing and in-memory technology. The goal is to help you understand the strengths and limitations of these alternatives and how they are evolving so you can select technology that is best suited to your organization and needs. Mark Madsen, Third Nature Mark spent the past two decades working on analysis and decision support projects in many industries. He is the founder of Third Nature, a research and consulting firm focused on emerging technology and practices in analytics, BI and information management. Mark is also an award-winning former CTO and consultant who frequently speaks at US and European conferences.

http://thirdnature.net/

Key Questions▪What technologies are available?

▪What are they good for?

▪ How do you decide which to use?

But first: why are analytic databases available now?

Presenter

Presentation Notes

Image: spices.jpg - http://flickr.com/photos/oberazzi/387992959/

Consequences of Commoditization: Data Volume

Time

Data Generated

Chipping

GPS

RFID

Sensors

Spimes

You are here

Presenter

Presentation Notes

What do all these things have in common? What is the result? Data. Lots and lots of data. Everyone’s already heard a lot of this already so there’s not a lot of reason to go into detail. Timing-wise, it’s still early for some things, but they’ve been around for a long while already. I was climbing around in tree canopies and looking at networked micro-sensors for temperature, humidity and light in 2002, and working with GPS phone kits in 2003. A lot has progressed since then. Ignoring the Orwellian undertones of bugged money and dusting peaceful protesters for later identification and “questioning”, think about what this means. It means that you’ll never get lost. We’ll raise a generation of kids who always know where they are. RFID means you can always be found. The government will always know where you are. RFID tags in retail goods means we can track the entire supply chain, know how long everything spent everywhere it was. We can even track the lifecycle of things, from the point of origin to their disposal at a landfill or, in the case of recyclables, where they were returned. This data could be used to prevent injection of fake drugs into the supply chain, since each bottle is trackable from source to destination. It makes inventory and supply chain management easier and more accurate. It allows us to monitor product lifecycles. But that’s just dumb RFIDs. What if you chip every product with a sensor, data store, CPU and wireless I/O? Everything can interact, or report its state. Products can alert you to their expiration. Imagine the joys of smart soup: variable pricing that retailers could put us through as inventories on shelves go down, or our shopping habits are exposed in retail time. “Normally he shops around 6:00 PM but it’s 7:00 tonight, and when he shops late he spends less time in the store and spends more per item than usual. Let’s jack up the price on everything a few cents.” Some web sites do this today. It’s possible it could be more intrusive in the future. Bruce Sterling came up with a name for these information-containing products a few years ago. He calls them “spimes”. From our mundane work perspective, we’re talking about a thousand-fold increase in the volume of data we’ll be collecting. The average data volume today is doubling roughly every 12 to 18 months. Are you prepared to handle even a ten-fold increase in a few years? Will our networks, databases, schema designs and integration processes handle this load? Is the data integration architecture we’re using now the one that will be required? Image: n/a

An Unexpected Consequence of Data Volumes

Sums, counts and sorted results only get you so far.

An Unexpected Consequence of Data Volumes

Our ability to collect data is still outpacing our ability to derive meaning from it.

Don’t worry about it. We’ll just buy more hardware.

CPUs, memory and storage track to very similar curves

RIP Moore’s Law: it nearly ground to a halt for silicon integrated circuits about four years ago.

Technology Has Changed (a lot) But We Haven’t

1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000

1010

10 9

10 8

107

106

105

104

103

102

101

10

10‐1

01‐2

10‐3

10‐4

10‐5

10‐6Cal

cula

tion

s pe

r se

cond

per

$10

00

Mechanical Relay Vacuum tube Transistor Integrated circuit

Data: Ray Kurzweil, 2001

10,000 X improvement

Current DW architecture and methods start here

in the mid-1980s

Moore’s Law via the Lens of the Industry Analyst

Time

CPU Speed

Presenter

Presentation Notes

Let’s look at Moore’s law like we did transportation speed. Chip speed has been doubling roughly every 24 months. The volume of data we collect has been rising about twice as fast. Storage capacity has been on an even faster rise. So let’s forecast like an industry analyst. Line 1 = cpu speed Line 2 = power consumption, which tracks pretty well the CPUs Line 3 = heat generation All of which means that somewhere around 2017 your own nuclear reactor and the CPU reaches the surface temperature of the sun and melts a hole in your desk

Moore’s Law: Power Consumption

Time

Power Use

2017

Presenter

Presentation Notes


Moore’s Law: Heat Generation

Time

CPU Temp

2017

Presenter

Presentation Notes


Conclusion #1: Your own nuclear reactor by 2017

Time

Power Use

2017

Presenter

Presentation Notes


Conclusion #2: You Will Need a New Desk in 2017

Time

Power Use

2017

Presenter

Presentation Notes

Let’s look at Moore’s law like we did transportation speed. Chip speed has been doubling roughly every 24 months. The volume of data we collect has been rising about twice as fast. Storage capacity has been on an even faster rise. So let’s forecast like an industry analyst. Line 1 = cpu speed Line 2 = power consumption, which tracks pretty well the CPUs Line 3 = heat generation All of which means that somewhere around 2017 your own nuclear reactor and the CPU reaches the surface temperature of the sun and melts a hole in your desk Dude, you’re getting a Dell!

Problem: linear extrapolation

“If the automobile had followed the same development as the computer, a Rolls-Royce would today cost $100, get a million miles per gallon, and explode once a year killing everyone inside.”

Robert Cringely

Time

Anything

Reality

Presenter

Presentation Notes

“If the automobile had followed the same development as the computer, a Rolls-Royce would today cost $100, get a million miles per gallon, and explode once a year killing everyone inside.” Robert Cringely Transportation speed followed a geometrically increasing curve. Then it flattened. Everything we track is actually a sigmoid curve. With computers, at some point we hit atomic size constraints and power consumption limits. It’s just that we don’t apply common sense to the extrapolation.

Multicore performance is not a linear extrapolation.

Presenter

Presentation Notes

Image: Ford Nucleon

Technology Maturity (time + engineering effort)

New Technology Evolution Means New Problems

1970 1980 1990 2000 2010 2020

Uniprocessorand custom CPU era

Symmetric multi‐processing era

Massively parallel era

Early engineering phaseExploring, learning, inventing

Investment phaseImproving, perfecting, applying

Core problems solved

1010

10 9

10 8

107

106

105

104

103

102

101

10

10‐1

01‐2

10‐3

10‐4

10‐5

10‐6

Presenter

Presentation Notes

Within the IC market we’ve gone through two changes and are on to a third. The core problem from a hardware perspective was solved a few yeas ago, regarding the CPU. Not regarding the rest of the system, the operating system or software to take advantage of it.

What’s different?

Parallelism

We’re not getting more CPU power, but more CPUs.

There are too many CPUs relative to other resources, creating an imbalance in hardware platforms.

Most software is designed for a single worker, not high degrees of parallelism and won’t scale well.

Core problem: software is not designed for parallel work

Databases must be designed to permit local work with minimal global coordination and data redistribution.

Presenter

Presentation Notes

weaver peru.jpg - http://flickr.com/photos/slack12/442373910/

SOME TECHNOLOGY INNOVATIONS

Storage Improvements

For data workloads, disk throughput still key.

Improvements:▪ Spinning disks at .05/GB▪ Solid state disks remove some latencies, read speed of ~250MB/sec

▪ SSD capacity still rising▪ Card storage (PCI), e.g. FusionIO at 1.5GB/sec

▪ SSD is still costly at $2/GB up to $30/GB

Compression Applied to Stored Data

10x compression means 1 disk I/O can read 10x as much data, stretching your current hardware investment

But it eats CPU andmemory.

YMMV

Scale‐up vs. Scale‐out Parallelism

Uniprocessor environments required chip upgrades.

SMP servers can grow to a point, then it’s a forklift upgrade to a bigger box.

MPP servers grow by adding mode nodes.

Database and Hardware Deployment Models

Three levels of software‐hardware integration:▪ Database appliance (specialized hardware and software)▪ Preconfigured (commodity) hardware with software

▪ Software on generic hardware

Then there are the hardware‐database parallel models:

Shared Everything Shared Disk Shared Nothing

Database DB DatabaseDB

OS OS OS OS OS OS

In‐Memory Processing

1. Maybe not as fast you think. Depends entirely on the database (e.g. VectorWise)

2. So far, applied mainly to shared‐nothing models

3. Very large memories are more applicable to shared‐nothing than shared‐memory systems

Box‐limited Limited by node scalinge.g. 2 TB max e.g. 16 nodes, 512MB per = 8TB

4. Still an expensive way to get performance

Columnar Databases

ID Name Salary

1 Marge Inovera $50,000

2 Anita Bath $120,000

3 Nadia Geddit $36,000

Marge Inovera

Anita Bath

Nadia Geddit

$50,000

$120,000

$36,000

1

2

3

In a row-store model these three rows would be stored in sequential order as shown here, packed into a block.

In a column store model database they would be divided by columns and stored in different blocks.

Not just changing the storage layout. Also involves changes to the execution engine and query optimizer.

Column Stores Rule the TPC‐H Benchmark

Columnar Advantages and Disadvantages

+ Reduced I/O for queries not reading all columns

+ Better compression characteristics, meaning database size < raw data size (unlike row store) and less I/O

+ Ability to operate on compressed data, improving overall system performance

+ Less manual tuning

‐ Slower inserts and updates (causing ELT and trickle‐feed problems*)

‐ Worse for small retrievals and random I/O

‐ Uses more system memory and CPU

Advanced Analytic Methods

Machine learning

Statistics

Numerical methods

Text mining & text

analytics

Rules engines & constraint programming

Information theory & IR

Visualization

Explosion of Analytic Techniques

GIS

Presenter

Presentation Notes

There are many areas of research in the analytics world. These are some of the primary categories we’re interested in. Too much to cover here, and many different types of infrastructure so we’ll look at the more common usage, focusing on algorithms and data mining. For commercial / organization al purposes, we need a better grouping than using the academic branches of research or the seemingly arbitrary organization of techniques. In all cases we want to do one thing: create a model that will fit data that either hasn’t been seen before, or hasn’t been used for the desired purpose, or is a better model than what was used previously. The techniques available in any one category may be used for similar purposes, or similar data. In many cases the techniques are specific to certain kinds of data, or have performance characteristics that make them more suitable for one activity or another. There is no single answer to what should be used in a given situation. “Best” depends on the data, scale, problem, cost, and value of the solution.

Map‐Reduce is a parallel programming framework that allows one to code more easily across a distributed computing environment, not a database.

So how do I query the database?

It’s not a database, it’s a key-value store!

Ok, it’s not a databaseHow do I query it?

You write a distributed mapreducefunction in erlang.

Did you just tell me to go to hell? I believe I

did, Bob.

Presenter

Presentation Notes

Cartoon: fault-tolerance

What’s Different

No database

No schema

No metadata

No query language*

Good for:▪ Processing lots of complex or non‐relational data

▪ Batch processing for very large amounts of data

* Hive, Hbase, Pig, others

Presenter

Presentation Notes

Image: automat purple2.jpg - http://flickr.com/photos/alaina/288199169/

Using MapReduce / Hadoop

31

Hadoop is one implementation of MapReduce. There are different variations with different performance and resource characteristics e.g. Dryad, CGL‐MR, MPI variants

Hadoop is only part of the solution. You need more for enterprise deployment. Cloudera’s distribution for Hadoopshows what a complete environment could look like.

Image: Cloudera

Presenter

Presentation Notes

This is why people pay Cloudera. To do this on your own is very expensive and only makes sense if you absolutely need it and it’s vital to your business, or if you have a specific one-off problem of high enough value.

How Hadoop fits into a traditional BI environment

Databases Documents Flat Files XML Queues ERP Applications

Source Environments

File loads ETL

Data Warehouse

Developers Analysts End Users

Development tools and IDEs

Analysis tools, BI BI, Applications

Data stores that augment or replace relational access and storage models with other methods.

Different storage models:• Key‐value stores• Column families• Object / document stores• Graphs

Different access models:• SQL (rarely)• programming API• get/putReality: mostly suck for BI & analytics

Analytic DB vendors are coming from the other direction:• Aster Data – SQL wrapped around MR• EMC (Greenplum) – MR on top of the database

NoSQL theoretically = “not only sql”, in reality…

33

Some realities to consider

Cheap performance?▪ Do you have 20 blades lying around unused?

▪ How much concurrency?

▪ How much effort to write queries? Debug them?

▪ Performance comparisons: 10x slower on the same hardware?

The key is the workload type and the scale of it.

Do you really need a rack of blades for computing?

Graphics co‐processors have been used for certain problems for years.

Offer single‐system solution to offload very large compute‐intensive problems.

Order of magnitude cost reduction, order of magnitude performance increase with current technology today (for compute‐intensive problems).

We’ve barely started with this.

Other Options for analytic software deployment

The basic models.

1. Separate tools and systems (MapReduce and nosql are a simple variation on this theme)

2. Integrated with a database

3. Embedded in a database

The primary arguments about deployment models center on whether to take data to the code or code to the data.

36

Presenter

Presentation Notes

Wanted to review that to provide the full context, but we’re going to focus primarily on the deployment side of the problem. Hardware deployment: linked to hw, addition of service, challenges of data movement for hosted Image: open_air_market_bologna - http://flickr.com/photos/pattchi/181259150/

Leveraging the Database

Levels of database integration:▪ Native DB connector▪ External integration▪ Internal integration▪ Embedded

+ Less data movement+ Possible dev process support+ Hardware / environment savings

+ Possible “sandboxing” support‐ Limitations on techniques

37

Presenter

Presentation Notes

Level 0: none (export, import), basically the separate servers model. Level 1: native connector that takes advantage of database features, e.g. cube views, SQL extensions, partition awareness, etc. Level 2: goes beyond connector, query gen from tool functions (e.g. turn fn into SQL, like a regression into windowed sql), sending output via updates / inserts to db directly (abstracts problem like an etl tool does so don’t need to worry about the db) Level 3: code executes resident with database, variations of UDFs Level 4: Embedded in the database itself, functionality is native (e.g. not UDF-based, although it’s a subtle distinction and meaningless in some cases)

In‐database Execution

You can do a lot with standards‐compliant SQL

If the database has UDFs, you can code too (but it’s harder)

Parallel support for UDFs varies

Some vendors build functions directly into the database, (usually scalar)

Iterative algorithms (ones that converge on a solution) are problematic, more so in MPP

38

Presenter

Presentation Notes

Done in SQL (you can do a lot) DIY UDF and UDT coding (you can do a lot, but it’s harder), Basic fns supported by DB, usually UDF model (vendor, oem, e.g. sas, most working with fuzzy) Parallel support for UDF varies, scalar fns not hard, table fns or look forward/back harder, incredibly difficult for parallel tricks & stashing of intermediate results, why leave to fuzzy Some vendors don’t allow you to write them, others make it very difficult (e.g. sybase, ibm) Fns built into database directly What works today? E.g. iterative problem solving challenges 3rd party tools built in for AA, e.g. SAS but limited to scoring, Fuzzy, way more

What are factors in the decision?

User concurrency: one job or many Repetition is a key element:▪ Execute once and apply (build a response or mortality model)

▪ Many executions daily (web cross‐sells)

In‐process or Batch?▪ Batch and use results – segment, score

▪ In‐process reacts on demand – detect fraud, recommend

In‐process requires thinking about how it integrates with the calling application. (SQL sometimes not your friend)

39

Presenter

Presentation Notes

Many in-process models still have a batch construction requirement.

MATCHING THE PROBLEMS TO TECHNOLOGIES

The problem of size is three problems of volume.

Number of users!

Computations!

Amount of data!

H

Presenter

Presentation Notes

Little Data, Lots of Data, So What? Hydrogen. There’s lots of it around. Simple atom, lightest one we’ve got. Nothing to see here.

Lots of H

“More” can become a qualitative rather than quantitative difference

Presenter

Presentation Notes

Enough hydrogen and something different happens.

Really lots of H

“Databases are dead!” – famous last words

Presenter

Presentation Notes

Eventually, Enough of Anything is Substantively Different And so it is with data, e.g. Google search, translation The web guys with their really lots of data all followed Google with their prediction that databases are dead. You just can’ scale them Maybe not for Google sized search indexing, but then we’re not search indexing are we? In fact, Google runs Oracle for internal data warehousing on several hundred terabytes of data. Databases are dead, so say the new kids on the block with their cloud storage and their real time XML message feeds. You can’t scale a relational database. But you can. Eventually enough of something becomes something different. Enough fast parallel hardware and proper design mean you can scale a database, whether it’s a petabyte Oracle instance or a cloud-deployed columnar database. http://www.flickr.com/photos/badastronomy/3176565627/

Hardware Architectures and Deployment

Compute and data sizes are the key requirements

45

Data volume<10s GB 100s GB 1s TB 10s TB 100sTB PB

PCShared everythingor shared disk

Shared nothing

MR and related

Com

puta

tions

MF

GF

TFP

F


46


Com

puta

tions

MF

GF

TFP

FToday’s reality, and true for a while in most businesses.

The bulk of the market resides here!

Presenter

Presentation Notes

Your data volume is likely not as big as you think it is. 85% of the BI market < 5 TB, bulk < 1 TB. Yes it will grow, but not as fast as people think. More likely to grow in the processing space than the data, simply because most companies are still well below the 10TB level and even with many additions of data, they aren’t close to it yet. The market reality is that large corporations and internet-based companies are the only ones really pushing the data volume space (in the business world).


47


Com

puta

tions

MF

GF

TFP

FToday’s reality, and true for a while in most businesses.

The bulk of the market resides here!

…but analytics pushes many things into the MPP zone.

Presenter

Presentation Notes

The processing space is the focus. Even with a terabyte or two of data, the performance of traditional databases is rarely up to the task of large-scale data mining workloads like looking at all product affinities in a mid-size retailer on more than a semi-annual basis. These analytic workloads call for more processing power as well as higher I/O throughput, and often create the need for MPP solutions.

The real question: why do you want a new platform?

Trouble doing what you already do today▪ Poor response times

▪ Not meeting availability deadlines

Doing more of what you do today▪ Adding users, mining more data

Doing something new with your data▪ Data mining, recommendations, embedded real‐time process support

What’s desired is possible but limited by the cost of supporting or growing the existing environment.

The World According to Gartner: One Magical Quadrant

SQL Server 2008 R2 (PDW)Official production customers?

EMC / GreenplumSQL limitations

Memory / concurrency issues

IngresOLTP database

IlluminateSQL limitations

Very limited scalability

SunMySQL for a DW, is this a joke?

49

Magic Quadrant for Data Warehouse Database Management Systems

The assumption of the warehouse as a database is gone

50

Traditional tabular or

structured data

Data at rest

Non-traditional data (logs, audio,

documents)

Parallel programming

platforms

Databases Streaming DBs/engines

Message streams

Data in motion

Slide 50Copyright Third Nature, Inc.

Presenter

Presentation Notes

Any architecture now will have multiple repositories for data, multiple technologies to cope with the different needs. The primary technology classes line up like this. For most BI programs, the low hanging fruit has been picked. The BI market is changing and BI programs, skills and architectures need to change with it. That means learning about the storage and processing technologies and architectures, and how they can be put together.

Data Access Differences

Basic data access styles:▪ Standard BI and reporting▪ Dashboards / scorecards▪ Operational BI▪ Ad‐hoc query and analysis▪ Batch analytics▪ Embedded analytics

Data loading styles:▪ Refresh▪ Incremental

▪ Constant

Presenter

Presentation Notes

Throughput and response time requirements aren’t enough. You need to understand the workload so you can choose the right technology. Not all technologies are applicable for the same workload. Characteristics of each: Moderate to large data access but aggregated/summarized, repetitive query patterns, man different queries, periods of high concurrency, often over same data High repetition, highly summarized, small number of different queries, possibly many large queries for comparison (prior year, moving avg, etc.) Less repetition, sequences of activity from large data access to highly selective, exploratory, different data sets Large data access with full detail retrieved, or large data access but processed in db, intermediate result storage in large volume High concurrency, small repetitive queries but not usually over same data, small retrieval, minimal aggregation, fixed query patterns

Evaluating ADB Options

Storage style:▪ Files, tables, columns, cubes, KV

Storage type:▪ Memory, disk, hybrid, compressed

Scaling model:▪ SMP, clustered, MPP, distributed

Deployment model:▪ Appliance, cloud, SaaS, on‐premise

Data access model:▪ SQL, MapReduce, R, languages, etc.

License options:▪ CPU, data size, subscription

Presenter

Presentation Notes

Image: bored_girl.jpg - http://www.flickr.com/photos/alejandrosandoval/280691168/

What’s it going to cost? A small sample at list:Solution Pricing model Price/unit 1 TB solution Remarks

DatAupia Node $ 19,500/2TB $ 19,500 You can’t buy a 1 TB Satori server

Kickfire(out of

business)

Data Volume(raw)

$ 50,000,-/TB $ 50,000 Includes MySQL5.1 Enterprise

Vertica Data Volume(raw)

$ 100,000/TB $ 200,000 Based on 5 nodes, $ 20,000 each

ParAccel Data Volume(raw)

$ 100,000/TB $ 200,000 Based on 5 nodes, $ 20,000 each

EXASOL Data Volume(active)

$ 1,350/GB(€1,000/GB)

$ 350,000* Based on 4 nodes, $ 20,000 each

Teradata Node $ 99,000 / TB $ 99,000** Based on 2550 base configuration

* 1TB raw ± 200 GB active, **realistic configuration likely 2x this price53

Factors and TradeoffsThe core tradeoff is not always money for performance.

What else do you trade?

• Load time

• Trickle feeds• New ETL tools• New BI tools• Operational complexity:• Data integration and management

• Backups• Hardware maintenance

Presenter

Presentation Notes

Image: bored_girl.jpg - http://www.flickr.com/photos/alejandrosandoval/280691168/

The Path to Performance

1. Laborware – tuning

2. Upgrade – try to solve the problem without changing out the database

3. Extend – add an ADB or Hadoop cluster to the environment to offload a specific workload

4. Replace – out with the old, in with the new

Presenter

Presentation Notes

Everyone with database performance problems goes through the same stages: Tune the database and the operating system Throw money at the problem (buy more hardware) Database and query redesign Tune some more You may be ready for a change if you’ve done all of these. path_vecchia.jpg - http://www.flickr.com/photos/funadium/2320388358/

One Word: PoC!

Presenter

Presentation Notes

Image: fast kids truck peru.jpg - http://flickr.com/photos/zerega/1029076197/

The Future

Assuming database market embraces MPP, you have compute power that exceeds what the DB itself needs.

Why not execute the code at the data?

Even without MPP, moving to in‐database analytic processing is a future direction and is workable for a large number of people.

57

Presenter

Presentation Notes

System can be easily overbalanced (e.g. GPU, lots of local mem),

Thank you!

Image Attributions

Thanks to the people who supplied the images used in this presentation:

Atomic Avenue #1 by Glen Orbik http://www.orbikart.com/gallery/displayimage.php?album=4&pos=5spices.jpg ‐ http://flickr.com/photos/oberazzi/387992959/Black hole galaxy ‐ http://www.flickr.com/photos/badastronomy/3176565627/weaver peru.jpg ‐ http://flickr.com/photos/slack12/442373910/rc toy truck.jpg ‐ http://flickr.com/photos/texas_hillsurfer/2683650363/automat purple2.jpg ‐ http://flickr.com/photos/alaina/288199169/open_air_market_bologna ‐ http://flickr.com/photos/pattchi/181259150/bored_girl.jpg ‐ http://www.flickr.com/photos/alejandrosandoval/280691168/path_vecchia.jpg ‐ http://www.flickr.com/photos/funadium/2320388358/fast kids truck peru.jpg ‐ http://flickr.com/photos/zerega/1029076197/

Presenter

Presentation Notes

http://www.orbikart.com/gallery/displayimage.php?album=4&pos=5

http://flickr.com/photos/oberazzi/387992959/

http://www.flickr.com/photos/badastronomy/3176565627/

http://flickr.com/photos/slack12/442373910/

http://flickr.com/photos/texas_hillsurfer/2683650363/

http://flickr.com/photos/alaina/288199169/

http://flickr.com/photos/pattchi/181259150/

http://www.flickr.com/photos/alejandrosandoval/280691168/

http://www.flickr.com/photos/funadium/2320388358/

http://flickr.com/photos/zerega/1029076197/

What’s best for which types of problems?*

Shared nothing will be best for solving large data problems, regardless of workload or concurrency.

Column‐stores will improve query response time problems for most traditional query and aggregation workloads.

Row‐stores will be better for operational BI or embedded BI.

Fast storage always makes things better, but is only cost‐effective for medium scale or smaller data.

Compression will help everyone, but column‐stores more than row stores because of how the engines work.

Map‐Reduce and distributed filesystems offer advantages of a schema‐less storage & analytic layer that can process into relational databases.

SMP and in‐memory will be better for high complexity problems under moderate data scale, shared‐nothing and MR for large data scale.

*The answer is always “it depends”

About the PresenterMark Madsen is president of Third Nature, a technology research and consulting firm focused on business intelligence, analytics and performance management. Mark is an award-winning author, architect and former CTO whose work has been featured in numerous industry publications. During his career Mark received awards from the American Productivity & Quality Center, TDWI, Computerworld and the Smithsonian Institute. He is an international speaker, contributing editor at Intelligent Enterprise, and manages the open source channel at the Business Intelligence Network. For more information or to contact Mark, visit http://ThirdNature.net.

http://thirdnature.net/

About Third Nature

Third Nature is a research and consulting firm focused on new and emerging technology and practices in business intelligence, data integration and information management. If your question is related to BI, open source, web 2.0 or data integration then you‘re at the right place.

Our goal is to help companies take advantage of information-driven management practices and applications. We offer education, consulting and research services to support business and IT organizations as well as technology vendors.

We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in product and technology analysis, so we look at emerging technologies and markets, evaluating the products rather than vendor market positions.

Determine the Right Analytic Database: A Survey of New Data Technologies

Technology

Transcript of Determine the Right Analytic Database: A Survey of New Data Technologies