Post on 07-Jan-2017
gluent.com 1
In-MemoryExecutionforDatabases
TanelPoderalongtimecomputerperformancegeek
gluent.com 2
Intro:Aboutme
• TanelPõder• OracleDatabasePerformancegeek(18+years)• ExadataPerformancegeek• LinuxPerformancegeek• HadoopPerformancegeek
• CEO&co-founder:
ExpertOracleExadatabook
(2nd editionisoutnow!)
Instantpromotion
gluent.com 3
GluentOracle
TeradataNoSQL
BigDataSources
MSSQL
AppX
AppY
AppZ
Gluentasadatavirtualizationlayer
OpenDataFormats!
gluent.com 4
GluentAdvisor
1. Analyzes DBstorageuseandaccesspatternsforsafeoffloading
2. 500+Databasesanalyzed
3. 10+PB analyzed– 81% offloadable
4. 2-24x queryspeedup
10PBInterestedinanalyzingyourdatabase?
http://gluent.com/whitepapers
gluent.com 5
Tapeisdead,diskistape,flashisdisk,RAMlocalityisking
JimGray,2006
http://research.microsoft.com/en-us/um/people/gray/talks/flash_is_good.ppt
gluent.com 6
SeagateCheetah15kRPMdiskspecs
200MB/sec!
gluent.com 7
SpinningdiskIOthroughput
• B-Treeindex-walking disk-basedRDBMS• 15000rpmspinningdisks• ~200random IOPSperdisk• ~8kBreadperrandomIO
• 8kB*200IOPS=1.6MB/sec perdisk
• Fullscanning basedworkloads• Potentiallymuchmoredatatoaccess&filter• Partitionpruning,zonemaps,storageindexeshelptoskipdata1• Scanonlyrequiredcolumns(formatswithlargechunksizes)• SequentialIOrateupto200MB/sec perdisk
http://www.dbms2.com/2013/05/27/data-skipping/
However,indexscanscanreadonlyasubsetofdata
gluent.com 8
ScanningabunchofspinningdiskscankeepyourCPUsreallybusy!
*NoteventalkingaboutflashorRAMhere!
gluent.com 9
AsimplequerybottleneckedbyCPU
9GBscanned,processedin7seconds:
~1300MB/sinPX~80MB/sperslave
gluent.com 10
AcomplexquerybottleneckedbyCPU
ComplexQuery:MuchmoreCPUspenton
aggregations,joins.9GBprocessedin1.5minutes
9GB/90seconds=~100MB/sPX
6MB/sperslave
gluent.com 11
Ifdisksandstoragesubsystemsaregettingsofast,whyallthebuzzaroundin-memorydatabasesystems?
*Can’twejustcachetheolddatabasefilesinRAM?
gluent.com 12
AsimpleDataRetrievaltest!
• Retrieve1% rowsoutofa8GBtable:
SELECTCOUNT(*)
, SUM(order_total)FROM
orders WHERE
warehouse_id BETWEEN 500 AND 510
TheWarehouseIDsrangebetween
1and999
Testdatageneratedby
SwingBench tool
gluent.com 13
DataRetrieval:TestResults• Remember,thisisaverysimplescanning+filteringquery:
TESTNAME PLAN_HASH ELA_MS CPU_MS LIOS BLK_READ------------------------- ---------- -------- -------- --------- ---------test1: index range scan * 16715356 265203 37438 782858 511231test2: full buffered */ C 630573765 132075 48944 1013913 849316test3: full direct path * 630573765 15567 11808 1013873 1013850test4: full smart scan */ 630573765 2102 729 1013873 1013850test5: full inmemory scan 630573765 155 155 14 0test6: full buffer cache 630573765 7850 7831 1014741 0
Test5&Test6runentirelyfrommemory
Source:http://www.slideshare.net/tanelp/oracle-database-inmemory-option-in-action
Butwhy50xdifferenceinCPUusage?
gluent.com 14
Tapeisdead,diskistape,flashisdisk,RAMlocalityisking
JimGray,2006
http://research.microsoft.com/en-us/um/people/gray/talks/flash_is_good.ppt
gluent.com 15
LatencyNumbersEveryProgrammerShouldKnow
Latency Comparison Numbers--------------------------L1 cache reference 0.5 nsBranch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cacheMutex lock/unlock 25 nsMain memory reference 100 ns 20x L2 cache,
200x L1 cacheCompress 1K bytes with Zippy 3,000 ns 3 usSend 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSDRead 1 MB sequentially from memory 250,000 ns 250 usRound trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory,20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
Source:https://gist.github.com/jboner/2841832
gluent.com 16
CPU=fast
CPUL2/L3cacheinbetween
RAM=slow
gluent.com 17
RAMaccessisthebottleneckofmoderncomputers
WaitsforRAMaccessshowupasCPUusageinmonitoringtools
Wanttowaitless?Doitless!
gluent.com 18
CPU&cachefriendlydatastructuresarekey!
Headers,ITLentries
RowDirectory
#0hdr row
#1hdr row
#2hdr row
#3hdr row
#4hdr row
#5hdr row
#6hdr row
#7hdr row
#8hdr row
… row
#1offset#2offset#3offset
#0offset
…
Hdrbyte ColumndataLock
byteCCbyte
Col.len ColumndataCol.
len ColumndataCol.len ColumndataCol.
len
• OLTP:Block->Row->Columnformat• 8kBblocks• Greatforwrites,changes
• Field-lengthencoding• Readingcolumn#100requireswalking
throughallprecedingcolumns
• Columns(withsimilarvalues)notdenselypackedtogether
• NotCPUcachefriendlyforanalytics!
gluent.com 19
Scanningcolumnardatastructures
Scanningacolumninarow-oriented datablock
Scanningacolumninacolumn-oriented compressionunit
col1 col2
col3
col4
col5
col6
col2col2
col3col3
col4col4
col5col5
col5col6
col1 col2
3…
col3 col4col4 col5
col6 col1 col2col3
col3
col4
col4
col5
col5col1 col2
col6col6
col1 col2
3…
col3 col4col4 col5
col6 col1 col2col3
col3
col4
col4
col5
col5col1 col2
col6col6
col1 col2
3…
col3 col4col4 col5
col6 col1 col2col3
col3
col4
col4
col5
col5col1 col2
col6col6 Readfilter
column(s)first.Accessonly
projectedcolumnsifmatchesfound.
Reducedmemorytraffic.More
sequentialRAMaccess,SIMD onadjacentdata.
gluent.com 20
Howtomeasure thisstuff?
gluent.com 21
CPUPerformanceCountersonLinux# perf stat -d -p PID sleep 30
Performance counter stats for process id '34783':
27373.819908 task-clock # 0.912 CPUs utilized86,428,653,040 cycles # 3.157 GHz 32,115,412,877 instructions # 0.37 insns per cycle
# 2.39 stalled cycles per insn7,386,220,210 branches # 269.828 M/sec
22,056,397 branch-misses # 0.30% of all branches 76,697,049,420 stalled-cycles-frontend # 88.74% frontend cycles idle 58,627,393,395 stalled-cycles-backend # 67.83% backend cycles idle
256,440,384 cache-references # 9.368 M/sec 222,036,981 cache-misses # 86.584 % of all cache refs 234,361,189 LLC-loads # 8.562 M/sec 218,570,294 LLC-load-misses # 93.26% of all LL-cache hits 18,493,582 LLC-stores # 0.676 M/sec 3,233,231 LLC-store-misses # 0.118 M/sec
7,324,946,042 L1-dcache-loads # 267.589 M/sec 305,276,341 L1-dcache-load-misses # 4.17% of all L1-dcache hits 36,890,302 L1-dcache-prefetches # 1.348 M/sec
30.000601214 seconds time elapsed
Measurewhat’sgoingoninside a
CPU!
Metricsexplainedinmyblogentry:
http://bit.ly/1PBIlde
gluent.com 22
TestingdataaccesspathdifferencesonOracle12c
SELECT COUNT(cust_valid) FROM customers_nopart c WHERE cust_id > 0
Runthesamequeryonsamedatasetstoredindifferentformats/layouts.
Fulldetails:http://blog.tanelpoder.com/2015/11/30/ram-is-the-new-disk-and-how-to-measure-its-performance-part-3-cpu-instructions-cycles/
Testresultdata:http://bit.ly/1RitNMr
gluent.com 23
CPUinstructionsusedforscanning/counting69Mrows
gluent.com 24
AverageCPUinstructionsperrowprocessed
• Knowingthatthetablehasabout69Mrows,Icancalculatetheaveragenumberofinstructionsissuedperrowprocessed
gluent.com 25
CPUcyclesconsumed(fullscansonly)
gluent.com 26
CPUefficiency(Instructions-per-Cycle)
Yes,modernsuperscalarCPUscanexecutemultiple
instructionspercycle
gluent.com 27
ReducingmemorywriteswithinSQLexecution
• Oldapproach:1. Readcompresseddatachunk2. Decompressdata(writedatatotemporarymemorylocation)3. Filteroutnon-matchingrows4. Returndata
• Newapproach:1. Readandfilter compressedcolumns2. Decompressonlyrequiredcolumnsofmatchingrows3. Returndata
gluent.com 28
Memoryreads&writesduringinternalprocessing
Unit=MB Readonlyrequestedcolumns
Rowscountedfromchunkheaders
Scancompresseddata:fewmemorywrites
gluent.com 29
Past&Future
gluent.com 30
Somecommercialcolumnstorehistory
• Disk-optimizedcolumnstores• Expressway103/SybaseIQ(early‘90s)• MonetDB (early‘90s)• OracleHybridColumnarCompression(disk/OLTPoptimized)• …
• Memory-optimizedcolumnstores• …• SAPHANA(December2010)• IBMDB2withBLUAcceleration(June2013)• OracleDatabase12cwithIn-MemoryOption(July2014)• …
*Notaddressingmemory-optimizedOLTP/row-storeshere
gluent.com 31
Future-proofOpenDataFormats!
• Disk-optimizedcolumnardatastructures• ApacheParquet
• https://parquet.apache.org/
• ApacheORC• https://orc.apache.org/
• Memory/CPU-cacheoptimizeddatastructures• ApacheArrow
• Notonlystorageformat• …alsoacross-system/cross-platformIPCcommunicationframework• https://arrow.apache.org/
gluent.com 32
Future
1. RAMgetscheaper+bigger,notnecessarilyfaster
2. CPUcachesgetlarger
3. RAMblendswithstorageandbecomesnon-volatile
4. IOsubsystems(flash)getevenclosertoCPUs
5. IOlatenciesshrink
6. Thelatencydifferencebetweennon-volatilestorageandvolatileRAMshrinks- newdatabaselayouts!
7. CPUcacheisking– newdatastructuresneeded!
gluent.com 33
References
• Slides&Videoofthispresentation:• http://www.slideshare.net/tanelp• https://vimeo.com/gluent
• Indexrangescansvsfullscans:• http://blog.tanelpoder.com/2014/09/17/about-index-range-scans-
disk-re-reads-and-how-your-new-car-can-go-600-miles-per-hour/
• RAMisthenewdiskseries:• http://blog.tanelpoder.com/2015/08/09/ram-is-the-new-disk-and-
how-to-measure-its-performance-part-1/• https://docs.google.com/spreadsheets/d/1ss0rBG8mePAVYP4hlpvjqA
AlHnZqmuVmSFbHMLDsjaU/
gluent.com 34
Thanks!
http://gluent.com/whitepapers
Wearehiringdevelopers&dataengineers!!!
http://blog.tanelpoder.comtanel@tanelpoder.com
@tanelpoder