© Prof. Dr. -Ing. Wolfgang Lehner
|
Modern Analytical Database TechnologyWolfgang Lehner
Aalborg
Oct-28, 2014
| 2
Data, data, everywhere…The situation today
Unstructured, coming fromsources that haven’t beenmined before
Compounded by internet, social media, cloud computing, mobile devices, digital images…
Exponential. Every 2 days we create as much data as from the Dawn of Civilisation to 2003*
Hard to keep up. Communication Operators managing petabyte scale expect 10-100 x times data growth in next 5 years**
| 3
Smart Everything
Smart Everything Smart „things“ Smart places Smart networks Smart services Smart solutions
„Smart-*“ infrastructure
Physical and digital worlds collide!
need to make things Smart…! Requirements for “Smart Everything” Interactive (“tangible”) low latency High volume high throughput
| 4
Big Data Analytics…
… this is soooo 2012!
| 5
…from smart phone to smart lenses
http://ngm.nationalgeographic.com
novel Big Data Analytics apps with ms-response time incorporating local context as well as global state
your personalcoupon arrived!!!
Buy x get y free
| 6
Example: Via Della Conciliazione
Source: http://www.spiegel.de/panorama/bild-889031-473266.html
| 7
Example: Via Della Conciliazione
| 8
..beyond traditional applications
Shopping Application ________________
Product Recommendations
Record transactions,
weblogs
Refine Recommendations
Optimize the application
Mining of user transactions and
recommendation history
User CommentsUser on e-retail site
Inventory User Transactions
Other Data Sources
Identify buying patterns, users likes/dislikes
Weblog data
| 9
Current State and Overall Question
Observation
„Things“ are generating lots of data Big Data AnalyticsFIND THE NEEDLE IN THE HAYSTACK+You don’t know if there is a needle at all+The needle may turn out to be a nail.
Infrastructure
Massive computing power in cloud/cluster environments Huge variety of „mobile/distributed“ devices Significant computing power in “mobile” devices Massive memory capacity “disk is tape” – (NV)RAM is king
Significant communication capabilities
Question
What are (some of) the core challengesand opportunities for database management?
| 10
…but the world is radically changing
Main driver – main memory-centric data management
Algorithms as well as data-structures are optimized according to underlying infrastructure
Data Crunching versus Number Crunching
In the past, number crunching ruled HPC LINPACK benchmark FLOPS (floating point operations per second) TOP500, http://www.top500.org
Data Crunching catches up Kernel of graph algorithms TEPS (traversed edges per second) Graph 500, http://www.graph500.org
data volume
com
ple
xity
Reporting
OLAP
Data Mining(classification,
association rules, ...)
Forecasting
Data Imputation
Simulation
11
> A Look at Hardware Trends
I7-26006 HW cores
Xeon E720 HW Threads
Intel Phi
Increasing Number of CoresIncreasing Main Memory Capacity*
CPU/GPU, hybrids FPGA (Field Programmable Gate Array)
„Parallelism“ is the name of the game!„Main Memory“ is the new disk!(?)
* stable RAM will be an additional game changer
12
> Impact on Database Systems
…a plea for specialized DB systems
Implications for the Elephants They are selling “one size fits all“ Which is 30 year old legacy technology that good at nothing
13
> Impact on Database Systems
How to architecturally define systemssatisfying both requirements?
M. Stonebraker
Extreme data Extreme performance
Dynamo
“Three things areimportant in thedatabase world: performance, performance and performance.“Bruce Lindsey
| 14
Challenges for Database Systems
Extreme Data
Extreme Performance
Extreme Flexibility
Flexibility in Database Systems
+ during deployment time(schema definition)
- during database lifetime(schema evolution)
- during query runtime(scheduling, …)
- resource consumption(elasticity, …)
| 15
Flexibility from 10.000 feet
applications
role-based object models
querying Web-Tables
data comes first, schema comes second
Demand flexibility
Open Dataplatforms
Database System
| 16
Flexibility from 10.000 feet
operating system& hardware MegaCore systems
TeraByte-Capacity FPGAs/FPPAs
Provide flexibility
GPUsNonVolatileMemory
Database System
| 17
In a nutshell
…shift from
disk-centric database architectures
to
main-memory centric architectures
Tran
sien
tM
ain
Mem
ory
Pe
rsis
ten
tSt
ora
ge
Database Log
logbuffer
buffer pool
pri
mar
yd
ata
ind
exst
ruct
s
mat
view
s
dic
tio
nar
ies
… …
runtime data
a) Traditional Architecture
Tran
sie
nt
Mai
n M
em
ory
Pers
iste
nt
Sto
rage
Log
logbuffer
database
pri
mar
yd
ata
ind
exst
ruct
s
mat
view
s
dic
tio
nar
ies
… …
runtime data
b) Main-memory-centric architecture
runtime data
18
>
Part 1: Pros/cons of main memory-centric
Part 2: Pros/cons of column orientation
| 19
Speed in Relation...
| 20
BUT – there is no free lunch
Memory Wall
There is an increasing gap between CPU and memory speeds. CPUs spend much of their time waiting for memory.
DRAM characteristics
Dynamic RAM is comparably slow. Memory needs to be
refreshed periodically (ca. every 64 ms).
(Dis-)charging a capacitor takes time.
21
> Role of Caches
Main-memory access has become a performance bottleneck for many computerapplications
Bandwidth Latency Adress translation (TLB)
Cache memories can reduce the memory latency when the requested data is found in the cache
Some systems also use a 3rd level cache.
Caches resemble the buffer manager but are controlled by hardware
| 22
The Role of Caches
Caches – the sunny side
Memory is physically accessed at cache line granularity, e.g. 64Byte Sequential memory access:
23
> Memory Performance Comparison
Motivation for CPU-cache aware data structures
Is memory the new disk ???• Some characteristics are very similar,
e.g. random vs. sequential• Memory architecture complicates things !
| 24
UMA vs. NUMA
Uniform Memory Access (UMA) Non-Uniform Memory Access (NUMA)
| 25
NUMA Architecture
Different NUMA Systems
Here: AMD
| 26
Low-Level Measurements
AMD 8 Sockets
| 27
NUMA Architecture
Different NUMA Systems
SGI UV 2000
64 Sockets 512 Cores
| 28
Low-Level Measurements
SGI 64 Sockets
| 29
Scalability Evaluation
SGI UV 2000
512 Cores
64 Sockets
8TB RAM
30
>
Part 1: Pros/cons of main memory-centric
Part 2: Pros/cons of column orientation
| 31
From DSM to Column-Stores
1985: DSM (Decomposition Storage Model)
Proposed as an alternative to NSM (Normalized Storage Model) Decomposition storage mode, decomposes relations vertically 2 indexes: clustered on ID, non-clustered on value Speeds up queries projecting few columns Disadvantages: storage overhead for storing tuple IDs, expensive tuple
reconstruction costs
Database System Architecture
| 32
From DSM to Column-Stores
Late 90s – 2000s: Focus on main-memory performance
MonetDB PAX: Partition Attributes Across Retains NSM I/O pattern Optimizes cache-to-RAM communication
2005: the (re)birth of column-stores
New hardware and application realities Faster CPUs, larger memories, disk bandwidth Multi-terabyte Data Warehouses
New approach: combine several techniques Read-optimized, fast multi-column access, disk/CPU efficiency, light-weight compression
Used in read oriented environments - OLAP
Some column store systems
MonetDB, C-Store, Sybase IQ, SAP HANA, Infobright, Exasol, X100/VectorWise, …
| 33
Row-Storage vs Column-Storage
• easy to add/modify a record• Logical entity (row)
corresponds to a single physical memory block
• single log entry for BI as well as AI
• might read unnecessary data• Address via „TID“ – tuple
identifier (segment+type+block+index)
• only need to read in relevant data• Useful for wide tables and slective
reads• tuple writes require multiple accesses
• Split tuples into different columnchunks / add to different datastructures / perform multiple log entries
• Alternative addressing methods• Via RID as well as via positional
addressing
| 34
Dictionary Compression
Basic Idea
Dictionary as indirection step to map application values (integers, floats, strings, …) to internal „tokens“ (valueIDs)
Resulting ValueID string is then potentially further compressed using RLE, etc.
| 35
Hybrid Storage Architecture
Use of compression implies two stores
Write optimized store (WOS) Read optimized (compressed) store (ROS)
Delta store main store
update/insert/delete
REDOlog
savepoint data area
common unified table access methods
Merge/Tuple mover
• Dictionary compressed• Unsorted dictionary• Efficient B-tree structures
• Compression schemesaccording to existingdata distribution
• Sorted dictionary• Optimized for HW-scans
36
>
Trends and Challenges
37
> Trends in Hardware
…different trends hardware pushes software (!) Significant and permanent architectural
rewrites/adoptions necessary
10 years ago: main memory centric = shift in the storage hierarchy + # of cores
Next big wave: storage class memory Directly (byte) adressable Sightly slower than traditional RAM PersistentMassive impact on persistency mechanisms
(logging etc.) and recovery
Coburn, J; Caulfield, A.; Akel, A.: NV-Heaps: making persistent objects fast and safe with next-generation, non-volatile memories
| 38
SCM - Architectural Challenges
CPU
I/O Controller
Memory Controller
SCM
DRAM
SCM
Storage Controller
SCM
DISK
1
2
3
[Source]: Implications of SCM on Software Architecture, C. Mohan, IBM Fellow
39
> NV-RAM-based DB Architecture
No distiction between volatile and non-volatile RAM
System has to ensure physical and logical consistency still some recovery mechanisms required
Recovery and physical design may fall together.
Tran
sien
tM
ain
Mem
ory
Pers
iste
nt
Sto
rage
Database Log
logbuffer
buffer pool
pri
mar
yd
ata
ind
exst
ruct
s
mat
view
s
dic
tio
nar
ies
… …
runtime data
Tran
sien
tM
ain
Mem
ory
No
n-V
ola
tile
Mai
n M
emo
ry
database
dyn
amic
dat
a
ind
exst
ruct
s
mat
view
s
dic
tio
nar
ies
…
run
tim
ed
ata
…
stat
icd
ata
mo
vin
gth
ep
ersi
ste
ncy
bar
a) Traditional Architecture b) SCM-enabled Architecture
40
> The downside of NV-RAM
Experiments
LEFT: scan performance (SIMD) on DRAM and SCM with and without prefetching RIGHT: Skip List read/write performance on DRAM and SCM
| 41
(Another) Hybrid Store?
Recovery performance
Depends on the type of database objects residing in NVRAM Primary data Index data …
Balance between read/write penalty and recovery time
Experiment
Different recovery schemes. TATP scale factor 500, 4 users. The database is crashed at second 15.
42
> Customizable Processor Model
Today’s Database Systems
Fat cores (area & power) Few HW adaptions CMOS scaling
Database Processors
Processors build from scratch Long development cycles High development costs
Our Approach
HW/SW codesign Customizable processor Application-specific ISA extensions Tool flow short HW development cycles
Basic Core:Tensilica LX4
43
> HW/SW-DB-CoDesign
… impact on chip design (Example: set intersection)
0
200
400
600
800
1000
1200
1400
1600
1800
0 10 20 30 40 50 60 70 80 90 100
Thro
ugh
pu
t[M
illio
n e
lem
ents
per
se
con
d]
Selectivity [in %]
DBA_2LSU_EIS w/ partial loading DBA_1LSU_EIS w/ partial loading
DBA_2LSU_EIS w/o partial loading DBA_1LSU_EIS w/o partial loading
DBA_1LSU 108MiniFinal processor
+1 Load-Store unit
Data bus: 32->128 bit
+ Partial loading
+ Extended ISA
44
> Trend: Building Applications
Novel types of applications
„Timeless software“, eternal betas, etc. Agile development, no-downtime,
light-weight release cycle etc.
„Apps“-style applications Small specialized applications,
specific extensions etc.
Challenge
How to „talk“ to a data management system?
SQL is far too limited, need sophisticated support for DSLs, need for comprehensivebusiness object description
Extensible parser and execution framework required Full support for multi-tenancy and lifecycle management
45
> Trend: Towards a Platform
= the traditional DB-System = modern „DB“-Systems(SQL, NewSQL, NoSQL,…)The Data Management-Platform
- Entity/document/graph centric data model- Concurrency models- Storage representations- Support Different DSLs/APIs
- Lifecycle management- Common security- HA requirements- Multi-Tenancy Isolation
46
> Trend: Towards a Platform
Example: SAP HANA Platform
Consume
Process
SAP HANA Platform
ApplicationsAnalytics
Land
scape m
anagem
ent
HANA In Memory
Mo
delin
g & lifecycle m
anagem
ent
Hadoop
Ro
les, secu
rity, govern
ance
, com
plian
ce, au
dits
Replication Framework
Data Services
TransactionalPlanning & Simulation
Graph Analytical
Machine Learning& Predictive
Native HANA Apps & Services
Spatial
ESP IM
Extended Storage (IQ)
Tiered Storage (Hot-warm-cold)
Smart Data Access
Text, Social Media Processing
Exploration, Dashboards, Reports, Charting, Visualization
| 47
Trend: Tight Integration with HCI
| 48
Summary
Data Management – The Big Picture
Gains significant importance and relevance in research as well as in industry “Big Data” is a core driver for this research field
Main Memory Centric Data Management
React on changing environment in hardware and software …complete architectural rewrite WAS necessary! …and a 2nd wave is just in front of us!
Outlook
Hardware is pushing and enabling novel architectural design … complete/partial architectural rewrite IS and WILL BE necessary! Data management research positioned as an umbrella for many disciplines (core
database technology, compiler construction, chip design, communication, efficient algorithm design, …)
Top Related