Scaling Web Applications With Cassandra Presentation
-
Upload
jagannath-jaggu -
Category
Documents
-
view
14 -
download
3
description
Transcript of Scaling Web Applications With Cassandra Presentation
-
introduction to cassandraeben hewitt
september 29. 2010web 2.0 exponew york city
-
director, application architecture at a global corp
focus on SOA, SaaS, Events
i wrote this
@ebenhewitt
-
agendacontextfeaturesdata modelapi
-
nosql big datamongodbcouchdbtokyo cabinetredisriakwhat about?Poet, Lotus, Xindicetheyve been around foreverrdbms was once the new kid
-
innovation at scalegoogle bigtable (2006)consistency model: strongdata model: sparse mapclones: hbase, hypertableamazon dynamo (2007)O(1) dhtconsistency model: client tune-ableclones: riak, voldemort
cassandra ~= bigtable + dynamo
-
provenThe Facebook stores 150TB of data on 150 nodes
web 2.0
used at Twitter, Rackspace, Mahalo, Reddit, Cloudkick, Cisco, Digg, SimpleGeo, Ooyala, OpenX, others
-
cap theoremconsistencyall clients have same view of dataavailabilitywriteable in the face of node failurepartition toleranceprocessing can continue in the face of network failure (crashed router, broken network)
-
daniel abadi: pacelc
-
write consistencyread consistency
LevelDescriptionZEROGood luck with thatANY1 replica (hints count)ONE1 replica. read repair in bkgndQUORUM (DCQ for RackAware)(N /2) + 1ALLN = replication factor
LevelDescriptionZEROUmmmANYTry ONE insteadONE1 replicaQUORUM (DCQ for RackAware)Return most recent TS after (N /2) + 1 reportALLN = replication factor
-
agendacontextfeaturesdata modelapi
-
cassandra propertiestuneably consistentvery fast writeshighly availablefault tolerantlinear, elastic scalabilitydecentralized/symmetric~12 client languages Thrift RPC API~automatic provisioning of new nodes0(1) dht big data
-
write op
-
Staged Event-Driven ArchitectureA general-purpose framework for high concurrency & load conditioningDecomposes applications into stages separated by queuesAdopt a structured approach to event-driven concurrency
-
instrumentation
-
data replication
-
partitioner smack-downRandom Preservingsystem will use MD5(key) to distribute data across nodeseven distribution of keys from one CF across ranges/nodes
Order Preservingkey distribution determined by tokenlexicographical orderingrequired for range queries scan over rows like cursor in indexcan specify the token for this node to usescrabble distribution
-
agendacontextfeaturesdata modelapi
-
structure
-
keyspace~= databasetypically one per applicationsome settings are configurable only per keyspace
-
column familygroup records of similar kindnot same kind, because CFs are sparse tablesex:UserAddressTweetPointOfInterestHotelRoom
-
think of cassandra as row-orientedeach row is uniquely identifiable by keyrows group columns and super columns
-
column familyn= 42user=ebenkey123key456user=alisonicon=
nickname=The Situation
-
json-like notationUser {123 : { email: [email protected], icon: },
456 : { email: [email protected], location: The Danger Zone}}
-
0.6 example$cassandra f$bin/cassandra-cli cassandra> connect localhost/9160
cassandra> set Keyspace1.Standard1[eben][age]=29cassandra> set Keyspace1.Standard1[eben][email][email protected]> get Keyspace1.Standard1[eben'][age']=> (column=6e616d65, value=39, timestamp=1282170655390000)
-
a column has 3 partsnamebyte[]determines sort orderused in queriesindexedvaluebyte[]you dont query on column valuestimestamplong (clock)last write wins conflict resolution
-
column comparatorsbyteutf8longtimeuuidlexicaluuid
ex: lat/long
-
super columnsuper columns group columns under a common name
-
PointOfInterestsuper column familyCentral Park10017
Empire State Bldg
Phoenix Zoo85255desc=Fun to walk in.phone=212. 555.11212desc=Great view from 102nd floor!
-
PointOfInterest { key: 85255 { Phoenix Zoo { phone: 480-555-5555, desc: They have animals here. }, Spring Training { phone: 623-333-3333, desc: Fun for baseball fans. }, }, //end phx
key: 10019 { Central Park { desc: Walk around. It's pretty.} , Empire State Building { phone: 212-777-7777, desc: Great view from 102nd floor. } } //end nyc}ssuper columnsuper column familyflexible schemakeycolumn super column family
-
about super column familiessub-column names in a SCF are not indexedtop level columns (SCF Name) are always indexedoften used for denormalizing data from standard CFs
-
agendacontextfeaturesdata modelapi
-
slice predicatedata structure describing columns to returnSliceRangestart column namefinish column name (can be empty to stop on count)reversecount (like LIMIT)
-
read apiget() : Columnget the Col or SC at given ColPath COSC cosc = client.get(key, path, CL);
get_slice() : Listget Cols in one row, specified by SlicePredicate: List results = client.get_slice(key, parent, predicate, CL);
multiget_slice() : Mapget slices for list of keys, based on SlicePredicate Map results = client.multiget_slice(rowKeys, parent, predicate, CL);
get_range_slices() : List returns multiple Cols according to a rangerange is startkey, endkey, starttoken, endtoken: List slices = client.get_range_slices( parent, predicate, keyRange, CL);
-
write apiclient.insert(userKeyBytes, parent, new Column(band".getBytes(UTF8), Funkadelic".getBytes(), clock), CL);
batch_mutatevoidbatch_mutate( map, CL)removevoidremove(byte[], ColumnPathcolumn_path,Clock,CL)
-
batch_mutate//create paramMap mutationMap = new HashMap();
//create Cols for MutsColumn nameCol = new Column("name".getBytes(UTF8),Funkadelic.getBytes("UTF-8"), new Clock(System.nanoTime()););Mutation nameMut = new Mutation();nameMut.column_or_supercolumn = nameCosc; //also phone, etc
Map muts = new HashMap();List cols = new ArrayList();cols.add(nameMut);cols.add(phoneMut);muts.put(CF, cols);//outer map key is a row key; inner map key is the CF namemutationMap.put(rowKey.getBytes(), muts);//send to serverclient.batch_mutate(mutationMap, CL);
-
raw thrift: for masochists only
pycassa (python)fauna (ruby)hector (java)pelops (java)kundera (JPA)hectorSharp (C#)
-
what aboutSELECT WHEREORDER BYJOIN ON GROUP
?
-
rdbms: domain-based model what answers do I have?
cassandra: query-based model what questions do I have?
-
SELECT WHEREcassandra is an index factory
USERKey: UserIDCols: username, email, birth date, city, stateHow to support this query?
SELECT * FROM User WHERE city = Scottsdale
Create a new CF called UserCity:USERCITYKey: cityCols: IDs of the users in that city.Also uses the Valueless Column pattern
-
Use an aggregate key state:city: { user1, user2}
Get rows between AZ: & AZ; for all Arizona users
Get rows between AZ:Scottsdale & AZ:Scottsdale1 for all Scottsdale usersSELECT WHERE pt 2
-
ORDER BYRows are placed according to their Partitioner:
Random: MD5 of keyOrder-Preserving: actual key
are sorted by key, regardless of partitionerColumns are sorted according to CompareWith or CompareSubcolumnsWith
-
is cassandra a good fit?you need really fast writesyou need durabilityyou have lots of data > GBs>= three serversyour app is evolvingstartup mode, fluid data structureloose domain data points of interest
your programmers can dealdocumentationcomplexityconsistency modelchangevisibility toolsyour operations can dealhardware considerationscan move dataJMX monitoring
-
thank you!@ebenhewitt
**************