Presented by Nanditha Thinderucis.csuohio.edu/~sschung/cis611/Nandithacis611termpaper.pdf ·...

Presented by

Nanditha Thinderu

� Enterprise systems are highly distributed and heterogeneous which makes administration a complex task

� Application Performance Management tools developed to retrieve information about failures rates and resource utilization.

� APM platform for monitoring big data with a tight resource budget and fast response time

� APM is refers to monitoring and managing the enterprise software systems.

� The two approaches are� Black –box approach�API based approach� By capturing every method invocation in an enterprise system, APM tools can generate a vast amount data

� APM data consists of a metric name, a value and a time stamp.

� In storage system, the queries can be two major types

� Single value lookups to retrieve the most current value

� Small scans for retrieving systems health information

Metric NameMetric NameMetric NameMetric Name valuvaluvaluvalueeee

MinMinMinMin MaxMaxMaxMax TimestaTimestaTimestaTimestampmpmpmp

DurationDurationDurationDuration

� Yahoo! Cloud servicing Benchmark is designed for evaluation of key values stores using APM properties.

� We define five workloads (R,W,RSW,RW,RSW) as APM data is append only .

� It comprises a data generator, a workload generator as well as drivers for several key-value stores

� The goal was not only to get a pure performance comparison but also a broad overview of available solutions.

� Data stores used can be classified into categories

� Key-value stores : project Voldemort and Redis� Extensible record stores: HBase and Cassandra

� Scalable relational stores: My SQL Cluster an VoltDB

�We used Hbase v0.90.4 running on top of Hadoop v0.20.205.0.�Hbase uses HDFS it also requires the installation and configuration of Hadoop�Tables in Hbase can be accessed through API

�We used the recent 1.0.0.rc2 version and default Random Partitioner that distributes the data across the nodes randomly

�Implemented Cassandra YCSB client which is required to set just one column family to store all fields, each of them corresponding to a column

�It’s a systematic system and employs consistent hashing for distributing the values across the nodes

•We used 0.90.1 with embedded BerkeleyDB

storage and already

implemented Voldemort

configuration was easy

for most part.

•It is highly scalable

storage system with a simpler design

compared to relational

database

�We used 2.4.2 version as cluster version was in an unstable state and could not run a complete test.

�The default updated Redis YCBS client to use SharedJedisPool

�For data storage, YCSB uses a hash map as well as sorted set.

•We used VoltDB v2.1.3

and the default

configuration

•YCSB client driver for

the VoltDB that

connects to all servers

is implemented

•We used MySQL

v5.5.17 and InnoDB as

the storage engine

•RDBMS YCSB client

which is implemented

and connects to

databases using JDBC

� The workload has the most read intensive with 95% and only

5% writes. We present latencies and throughout using

logarithmic scale

� Redis has highest throughput

� Hbase has highest Read latency

� Cassandra has highest write latency

� In the second experiment, workload RW is

used which has 50%writes

� VoltDB achieves highest throughput for one

node which is slightly lower compare to

workload R

� In write latency Hbase and MySQL have

important differences compared to Workload

� Workload is one that is closest to APM use case

� It has 99% write rate

� The throughput results is similar to workload RW

� For the read latency, the apparent change is the high latency of Hbase

� For write latency, Hbase has increased significantly

� The workload RS has 47% read and scan and 6% write

operations

� The MYSQL has best throughput for a single node

� Cassandra, HBase obtain a linear increase in throughput for

number of nodes

� This workload has 50% reads of which 25% are scans

� The most of results are similar to RS

� In this we used 8 nodes of each system

� The results are calculated for workload R

� We observe varying latencies for different key store

values

� The write latencies have similar development for

Cassandra, Voldemort, Redis

� The most efficient system in storage is Hbase

� REDIS an VoltDB are omitted as do not store

data on disk

� Cassandra stores the data most efficiently

� The disk usage can be reduced by

compression

� Series of tests conducted on cluster D

� The throughput increases for all systems with

higher ratios

� Project Voldemort has best read latency

� HBase has a low write latency but it is best for

workload RW

� Cassandra: Its achieves highest throughput for maximum

number of nodes and its performance is best for high rates.

� Hbase: Hbase throughput is lowest for one node. But

increases linearly with number of nodes. It has low write

latency, however read latency is much higher than other

systems.

� Project Voldemort: At low the read and write latencies are

similar and are stable.

� MYSQL: It achieved high throughput, however latency

decreases with the number of nodes.

� Redis: It has high throughput which exceeds all other

systems for read intensive. But latencies decreases for

both read and write operations

� VoltDB: The performance is high for single instance but

never achieved throughput increase with more than one

� we optimized each system for our workload and tested it with a number of open connections which was 4 times higher than the number of cores in the host CPUs.

� Higher numbers of connections led to congestion and slowed down the systems considerably while lower numbers did not fully utilize the systems.

� This configuration resulted in an average latency of the request processing that was much higher than in previously published performance measurements.

� Since our use case does not have the strict latency requirements that are common in online applications and similar environments, the latencies in most results are still adequate

Presented by Nanditha Thinderucis.csuohio.edu/~sschung/cis611/Nandithacis611termpaper.pdf ·...

Documents

Transcript of Presented by Nanditha Thinderucis.csuohio.edu/~sschung/cis611/Nandithacis611termpaper.pdf ·...

Enterprise Database Systems for Big Data, Big Data ...cis.csuohio.edu/~sschung/cis611/CIS611_Lecture1_IntroBigDataAnalyrics.pdfbased on Service Provided • Infrastructure a a rvice

Evaluation of Relational Operationscis.csuohio.edu/~sschung/cis611/QueryProcessingCost_LectureNotes_CIS... · In a typical major DBMS, statistics are automatically collected. Given

ü Possible Pentaho Architecture - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/EnterpriseDatabaseSystems_Pentaho.pdf · Server Layer – it is the middle layer in Pentaho BI

Chapter 6eecs.csuohio.edu/~sschung/cis611/RelationalAlgebra... · 2011. 12. 14. · outer union .

Inverted Index Constructioncis.csuohio.edu/~sschung/cis611/L03InvertedIndex.pdf · Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When

Twitter Heron: Stream Processing at Scalecis.csuohio.edu/~sschung/cis611/Twitter heron_updated.pdf · Twitter Heron: Stream Processing at Scale. TWITTER IS A REAL TIME. ABSTRACT ...

KAFKA: A DISTRIBUTED MESSAGING SYSTEM FOR LOG …cis.csuohio.edu/~sschung/cis611/KafkaDistributedMessagingSystemforLogProcessing.pdfKafka is a multi-subscriber system. i.e. for the

CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

Business Intelligence: OLAP, Data Warehouse, and Column …eecs.csuohio.edu/~sschung/cis611/Lecture3_OLAP_CombinedK.pdf · Why we still study OLAP/Data Warehouse in Big Data? •

Big Data - Cleveland State Universityeecs.csuohio.edu/~sschung/cis611/BigDataIntro.pdf · 2014-08-26 · • Data storage (15%) – Databases Vs. Filesystems (Google/Hadoop Distributed

Web Database Programming Using PHP - csuohio.educis.csuohio.edu/~sschung/cis611/Elmasri_6e_Ch14.pdf · 2012-11-21 · PHP Variables, Data Types, and Programming Constructs PHP variable

Guided Tutorial for Pentaho Pivot4J - csuohio.educis.csuohio.edu/~sschung/cis611/DW_PentahoPivot4JTutorial.pdf · 29 January 2015 Guided Tutorial for Pentaho Pivot4J Plugin 2 •

CIS611 LectureNotes Datawarehouse UpdatedReduced …eecs.csuohio.edu/~sschung/cis611/CIS611_LectureNotes_Datawareho… · April 5, 2013 Data Mining: Concepts and Techniques 2 Chapter

JerrinJoseph Hadoop pptcis.csuohio.edu/~sschung/cis611/JerrinJoseph_Hadoop_ppt.pdfHDFS: KEY FEATURES Highly fault tolerant. (automatic failure recovery system) High throughput Designed

Authors : George Lee, Jimmy Lin, Chuang Liu, Andrew Lorek ...cis.csuohio.edu/~sschung/cis611/TwitterLoggingInfrastructure.pdf · • The rise of social media and user-generated content,

Query Processing 2: Sorting & Joinscis.csuohio.edu/~sschung/cis611/QueryProcessing-SortingBerkerlyU… · • Queries start out as SQL • Database translates SQL to one or more Relational

CIS611 Lab Assignment 1 Part 4: Solution SS Chungcis.csuohio.edu/~sschung/cis611/CIS611_Lab1_2015...CIS611 Lab Assignment 1 Part 4: Solution SS Chung 4. Update the following new changes

INTEGRATING HADOOP AND PARALLEL DBMScis.csuohio.edu/~Sschung/Cis611/INTEGRATINGHADOOPPARRALLELDBMS.pdf• Parallel DBMS deployed in large data warehouse For Business analysis of few

Presented by Andrew Yu - Cleveland State Universitycis.csuohio.edu/~sschung/cis611/AnalyticsinMotionAndrewYu.pdf · Real-time Analytics – Druid Not Fast Enough! Crappy + Slow .

CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern