Cassandra - PHP

CassandraIntegrating Cassandra into your project

dinsdag 12 november 13

Maurits Lawende

• Work at Dutch Open Projects (DOP) since 2007

• Development and technical design for challenging Drupal sites

• Development of SaaS solutions in PHP & NodeJS

ToDoToDay

• Data versus information

• History and usage of Cassandra

• How to use Cassandra

• Developments

Data versus informationCelko, J. (1999). Data and databases

SQL is designed for informationDBMS knows how to use your data

SQL is designed for flexibilityNot even a single line on scalability

SQLnearly 40 years of experience

SQLNever designed for scalability

Alexa top 10• Google

• Facebook

• YouTube

• Yahoo

• Baidu

• Wikipedia

• QQ.com

• LinkedIn

• Live.com

• Twitter

Alexa top 10• Google (BigTable)

• Facebook (MySQL)

• YouTube (MySQL)

• Yahoo

• Baidu (HyperTable)

• Wikipedia (MySQL)

• QQ.com

• LinkedIn (Voldemort)

• Live.com

• Twitter (MySQL)

Cassandra users• Facebook (+ Redis & HBase & MySQL)

• Twitter (+ MySQL)

• Reddit (+ Postgres)

• Digg (+ Redis)

• Bit.ly (+ MongoDB)

• Netflix

• Digg (+ Redis)

• Netflix

Jeff Hammerbacher

• Digg (+ Redis)

• Netflix

Jeff Hammerbacherleft Facebook in 2008

Back to basicDon’t think SQL

Key/value storeEvolved towards tables

Just data

• No joins

• Limited sorting capabilities

• No aggregation, grouping, subqueries whatsoever

Schemaless

• Fixed <strike>tables</strike> column families, but;

• Dynamic column names

Operations in Cassandra 1.0

• CREATE KEYSPACE name

• USE name

• CREATE COLUMN FAMILY name

• DROP KEYSPACE name

• DROP COLUMN FAMILY name

• SET columnfamily[‘row’][‘column’] = ‘value’;

• GET columnfamily[‘row’]

• LIST columnfamily

• DEL columnfamily[‘row’]

• DEL columnfamily[‘row’][‘column’]

• post[‘uuid’][‘title’] = ‘First post!’;

• user[‘mau’][‘firstname’] = ‘Maurits’;

• user[‘mau’][‘lastname’] = ‘Lawende’;

titleFirst post!

firstnameMaurits

lastnameLawende

sorted by rowkey, columnname (all ascending)

• post[‘uuid’][‘user’] = ‘mau’;

How to get a listof blogs by “mau”?

WHERE user = ‘mau’

Bad Request:No indexed columns present in

by-columns clause withEqual operator

sequal scansare rejected

Bad Request: Order by is currently only supportedon the clustered columns of the PRIMARY KEY

Bad Request: ORDER BY is only supported when the partition key is restricted by an EQ or an IN.

ORDER BY date DESCLIMIT 10

only possible when user anddate is in primary key

Predictable performanceNo performance degradation after data growth

• user[‘mau’][‘post001’] = ‘uuid’;

any order and limit

• post[‘uuid’][‘user’] = ‘uuid’;

no uuid IN (...) or OR’s

• user[‘mau’][‘post001:uuid’] = ‘First post!’;

• user[‘mau’][‘post002:uuid’] = ‘Second post!’;

only one query requiredto get user profile

with latest posts

64 KB 64 KB 2 GB

2 billion cells

Beauty?

• Dirty in the SQL world, but;

• It’s a best practice in Big Data

• Don’t think of it as a relational database

• No strict rules on how to use it, just push it to the limits

Each row is a snapshot of data meant to satisfy a given query, sort

of like a materialized view.

Storage in a cluster

Cluster structures

Master-slave

Master-master

Sharding

HDFS / GlusterFS

HyperTable

Dynamo

No master or single point of failureEvery node is (nearly) identical

Distribution and replication02^127

Distribution and replication

Client can connect to any node

Seed nodes

• Required for bootstrapping nodes

• Define 2 or 3 seed nodes per cluster

Extending the ring

• Assign a token for new node

• Configure seed node host

• Start Cassandra on new node

Extending the ring

• Assign a token for new node

• Configure seed node host

• Start Cassandra on new node

Consistency

Writing data

• Hinted handoff

• Write to commit log

• Write in memory

• Write to disk (together with timestamp)

Write consistency

• Choose from ANY, ONE, TWO, THREE, QUORUM, ALL

• QUORUM = floor((replication factor / 2) + 1)

Read consistency

• Choose from ONE, TWO, THREE, QUORUM, ALL

• Most recent copy is returned

Read repair

• Compares data with 2 other replica’s in the background

• Fixes inconsistent and missing data

• At 10% of all reads

Node repair

• Gradually compares all data in nodes with replica’s

• Required in conjunction with read repair to fix ‘forgotten deletes’

ACID theorem

• Atomic; completed successfully or entirely rolled back

• Consistent; transations never invalidates the database state

• Isolated; transactions are processed sequential

• Durable; completed actions are persistent

CAP theorem

• Consistency

• Availability

• Partition tolerance

Impossible to achieve all three:

Eventual consistencyNot guaranteed to be consistent, but becomes consistent later

Eventual consistency

• Best effort

• Consistency is not always more important than speed and scalability (doesn’t require locking)

• Configurable consistency level, but no transaction support

Surrogate keysSay bye to sequences

not consistent across cluster

counters are for counting

Native support for uuid’sf47ac10b-58cc-4372-a567-0e02b2c3d479

Cassandra 1.2

• Not longer schemaless

• Introduced CQL3

• No wide tables anymore

Collections

• Lists

• Maps

• Sets

• user[‘mau’][‘posts’] = ‘uuid’;

• CREATE TABLE user ( username text PRIMARY KEY, posts list<uuid>);

• UPDATE user SET posts = posts + [‘uuid’]

• UPDATE user SET posts = [‘uuid’] + posts

• CREATE TABLE user ( username text PRIMARY KEY, email set<text>);

• UPDATE user SET emails = emails + {‘mail@example.com’}

• CREATE TABLE user ( username text PRIMARY KEY, attending map<timestamp,text>);

• UPDATE user SET attending[‘2013-11-12’] = ‘PHPMeetup’

• DELETE attending[‘2013-12-05’] FROM user

Limits on collections

• 64K

• Whole collection loaded in memory when reading / writing

• Not an alternative to wide tables!

Limits on collections

• 64K

• Whole collection loaded in memory when reading / writing

• Not an alternative to wide tables!

No size check in CQLSET list = list + [‘...’]

Wide tables in CQL3

• CREATE TABLE tweets ( tweet_id uuid PRIMARY KEY, author varchar, body varchar);

• CREATE TABLE timeline ( user_id varchar, tweet_id uuid, author varchar, body varchar, PRIMARY KEY (user_id, tweet_id))

Wide tables in CQL3

user_idmauuser_idmike

uuid:authoranneuuid:authordavid

uuid:bodyTweet from Anneuuid:bodyTweet from David

Wide tables in CQL3

user_idmauuser_idmike

uuid:authoranneuuid:authordavid

uuid:bodyTweet from Anneuuid:bodyTweet from David

For schemaless lovers:

CREATE TABLE name ( rowkey varchar, columnname varchar, value blob, PRIMARY KEY (rowkey, columnname));

Secondary index

• CREATE INDEX name ON table (column);

• High memory usage when used with high cardinality

Iteration

• SELECT * FROM users

Iteration

• SELECT * FROM users LIMIT 10 OFFSET 100

unpredictable performance

Iteration

• SELECT token(username), username, country, age FROM user

Iteration

• SELECT token(username), username, country, age FROM userWHERE token(username) > 23947239 LIMIT 10

Queries are always controlled by one node

Even if data from 100 nodes is involved

MapReduceOr just ‘MapRed’

MapReduce

• array_map

• array_reduce

• Processes a subset of the data

• array_map(function($v) { return strtoupper($v); }, array('a', 'b'))

reduce()

• Merge results from the mapping function

• array_reduce(array(1, 2, 3), function($a, $b) { return $a + $b; });

MapReduce

map() map() map() map()

map()map()map()map()

MapReduce

result

Wordcount$data = array(‘red green blue’, ‘orange blue’, ‘purple green’);

$data = array_map(function($v) { $words = array(); foreach (explode(' ', $v) as $word) $words[$word] = isset($words[$word]) ? $words[$word] + 1 : 1; return $words;}, $data);$data = array_reduce($data, function($a, $b) { foreach ($a as $word => $count) $b[$word] = isset($b[$word]) ? $b[$word] + $count : $count; return $b;}, array());

array(‘red’ => 1, ‘green’ => 2, ‘blue’ => 2, ‘orange’ => 1, ‘purple’ => 1)

ORDER BY value LIMIT 5$data = array(array(4,5,2), array(62,35,1), array(74,56,2,34));

$data = array_map(function($v) { sort($v); return array_slice($v, 0, 5);}, $data);$data = array_reduce($data, function($a, $b) { $v = array_merge($a, $b); sort($v); return array_slice($v, 0, 5);}, array());

array(1, 2, 2, 4, 5)

Remember

• Getting information is a bumpy road in big data

• Use MapRed to transform data into information

MapReduce

• No native support in Cassandra

• MapReduce possible with Hadoop (requires Java programming)

input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);

words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;filtered_words = FILTER words BY word MATCHES '\\w+';word_groups = GROUP filtered_words BY word;word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;ordered_word_count = ORDER word_count BY count DESC;

STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

SELECT v['ip'], COUNT(1) AS cnt FROM www_access GROUP BY v['ip'] ORDER BY cnt DESC LIMIT 30

Pig and Hive

• Using MapReduce

• No(t very) predictable performance

• Good for analysis

Hack your own

• Not too difficult

• Data can be split into subsets by filtering on tokens

• Application must run on all MapRed nodes

• Probably better performance than Pig / Hive

Interfaces / protocols

• Thrift

• Binary protocol (1.2+)

• Gossip (internode communication)

Thrift

• Something like SOAP in a binary format

• Tool which generates libraries based on definition files

• Supports many languages (incl. PHP, JS, NodeJS, c, java, python, ruby.....)

• Also used by HyperTable, HBase, Accumulo and ElasticSearch

• Sole interface before 1.2

Thrift

• No support for collections

Binary protocol

• Recommended protocol for Cassandra 1.2

• Few client libraries available

• No binary connectors were available for PHPhttps://github.com/mauritsl/php-cassandra

php-cassandrarequire('lib/cassandra/Cassandra.php');use Cassandra\Connection as Cassandra;

$connection = new Cassandra('localhost', 'keyspace');

$rows = $connection->query('SELECT * FROM user');foreach ($rows as $row) { print $row->firstname; print $row->listfield[0];}

$rows->count();$rows->getColumns();

Scaling applications

Rule 1:Don’t ask for NoSQL drivers for a CMS

Cassandra does not fit all(same story for every NoSQL solution)

Every page (or API call) should only require a few (if not one) query

Static versus Dynamic data

• Static: information that doesn’t change very often

• I.e.: translations

• May go in a RDBMS or local storage (files?)

• Dynamic: many changes

• Changes must be visible on all nodes

• Use Cassandra

Local versus Global data

• Logging

• Separate logs per node

• Cache

• Sometimes no need to share cache between nodes

• Statistics

• Can be kept local for a limited time

Local versus Global data

• Sessions

• Dependent on session stickiness

Caching

• Memcache is recommended for local cache

• Cassandra can be used for global cache

• Has a TTL featureINSERT INTO ... (...) VALUES (...) USING TTL 86400

What about files?

• Use Hadoop Distributed File System (HDFS) or GlusterFS

What about files?

• Use Hadoop Distributed File System (HDFS) or GlusterFS

• Or use Cassandra

What about files?

• Split files in chunks to avoid hotspots and save the heap

• Not uncommon to have files in Cassandra

• github.com/Netflix/astyanax

• GB’s are ok, but do not store TB’s

Maximum size of cluster?

• No satisfactory answer

• Probably more dependent on network equipment

• Rack awareness helps here

• Facebook: 150 node cluster, 50TB data (2010)

• Easou: 400 node cluster, 300TB data (300 million images)

Minimum size of a cluster?

• Can run on a single node

• 4GB RAM recommended

• Runs fine on 1GB RAM

Minimum size of a cluster?

• Can run on a single node

• 4GB RAM recommended

• Runs fine on 1GB RAM“hot data” should fit in RAM

Installing Cassandra

• Install JDKOracle Java recommended but OpenJDK works ok

• Add Cassandra repository

• apt-get install cassandra

• Set listen and seed address (IP address of node and seed)

• (Re)start Cassandra

Last words...

Data versus informationData structure is naturally responsive for information

predictable performance

History and usageJeff Hammerbacher

How to use itSchema design, CQL3 and limits

DevelopmentsCQL3 and binary protocol

Thank you!

Questions?

Cassandra - PHP

Technology

Transcript of Cassandra - PHP

Helsinki Cassandra Meetup #2: From Postgres to Cassandra

Apache Cassandra in Bangalore - Cassandra Internals and Performance

EUROPEAN COMMISSION - Cassandra deliverables... · Cassandra Common assessment and analysis of risk in global supply chains Deliverable No. D3.4 – Cassandra Deliverable Title Cassandra

Cassandra Summit 2014: Performance Tuning Cassandra in AWS

Cassandra Day Atlanta 2015: Python & Cassandra

Cassandra Summit 2014: Apache Cassandra on Pivotal CloudFoundry

Cassandra Day Denver 2014: Introduction to Apache Cassandra

Solr & Cassandra: Searching Cassandra with DataStax Enterprise

Chicago Cassandra - Cassandra from Python

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

Cassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy

Cassandra Summit 2014: Apache Cassandra at Telefonica CBS

Cassandra CLuster Management by Japan Cassandra Community

Associate)Professor)Cassandra)L.Atherton) Deakin ...cassandra-atherton.com/wp-content/uploads/2018/01/Cassandra-Athert… · Associate)Professor)Cassandra)L.Atherton) Deakin’University’

Cassandra Training Session 2svn.wso2.org/repos/wso2/people/kasunw/BAM/Cassandra/Cassandra... · Configuring Cassandra Contd ... replicationStrategy, replicationFactor, cfs); cluster.addKeyspace(definition);

Storage on EC2 (& Cassandra), Cassandra Workshop, Berlin Buzzwords

Cassandra Core Concepts - Cassandra Day Toronto

Cassandra Summit 2014: Cassandra at Instagram 2014

LA Cassandra Day 2015 - Testing Cassandra

Apache Cassandra™ Documentationcourses.physics.illinois.edu/cs425/fa2017/cassandra10.pdfApache Cassandra 1.0 Documentation Introduction to Apache Cassandra Apache Cassandra is a