Cassandra for Python Developers

34
Cassandra for Tyler Hobbs Python Developers

description

A high level introduction to Apache Cassandra followed by an introduction to pycassa, the Python client library for Cassandra.Presented at PyTexas 2011 by Tyler Hobbs.

Transcript of Cassandra for Python Developers

Page 1: Cassandra for Python Developers

Cassandrafor

Tyler Hobbs

Python Developers

Page 2: Cassandra for Python Developers

Open-sourced by Facebook Apache Incubator Top-level Apache project DataStax founded

History

2008

2009

2010

2010

Page 3: Cassandra for Python Developers

Scalable– 2x Nodes == 2x Performance

Reliable (Available)– Replication that works

– Multi-DC support

– No single point of failure

Strengths

Page 4: Cassandra for Python Developers

Fast– 10-30k writes/sec, 1-10k reads/sec

Analytics– Integrated Hadoop support

Strengths

Page 5: Cassandra for Python Developers

No ACID transactions– Don't need these as often as you'd think

Limited support for ad-hoc queries– You'll give these up anyway when sharding an RDBMS

Generally complements another system– Not intended to be one-size-fits-all

Weaknesses

Page 6: Cassandra for Python Developers

Every node plays the same role– No masters, slaves, or special nodes

Clustering

Page 7: Cassandra for Python Developers

Clustering

0

10

20

30

40

50

Page 8: Cassandra for Python Developers

Clustering

0

10

20

30

40

50

Key: “www.google.com”

Page 9: Cassandra for Python Developers

Clustering

0

10

20

30

40

50

Key: “www.google.com”

14

md5(“www.google.com”)

Page 10: Cassandra for Python Developers

Clustering

0

10

20

30

40

50

14

Key: “www.google.com”

md5(“www.google.com”)

Page 11: Cassandra for Python Developers

Clustering

0

10

20

30

40

50

14

Key: “www.google.com”

md5(“www.google.com”)

Page 12: Cassandra for Python Developers

Clustering

0

10

20

30

40

50

14

Key: “www.google.com”

md5(“www.google.com”)

Replication Factor = 3

Page 13: Cassandra for Python Developers

Client can talk to any node

Clustering

Page 14: Cassandra for Python Developers

Keyspace– A collection of Column Families– Controls replication settings

Column Family– Kinda resembles a table

Data Model

Page 15: Cassandra for Python Developers

Static– Object data

Dynamic– Pre-calculated query results

ColumnFamilies

Page 16: Cassandra for Python Developers

Static Column Families

zznate

driftx

thobbs

jbellis

password: *

password: *

password: *

name: Nate

name: Brandon

name: Tyler

password: * name: Jonathan site: riptano.com

Users

Page 17: Cassandra for Python Developers

Dynamic Column Families

zznate

driftx

thobbs

jbellis

driftx: thobbs:

driftx: thobbs:mdennis: zznate

Following

zznate:

pcmanus xedin:

Page 18: Cassandra for Python Developers

Dynamic Column Families Timeline of tweets by a user Timeline of tweets by all of the people a

user is following List of comments sorted by score List of friends grouped by state

Page 19: Cassandra for Python Developers

Python client library for Cassandra

Open Source (MIT License)– www.github.com/pycassa/pycassa

Users– Reddit

– ~10k github downloads of every version

Pycassa

Page 20: Cassandra for Python Developers

easy_install pycassa– or pip

Installing Pycassa

Page 21: Cassandra for Python Developers

pycassa.pool– Connection pooling

pycassa.columnfamily– Primary module for the data API

pycassa.system_manager– Schema management

Basic Layout

Page 22: Cassandra for Python Developers

RPC-based API

Rows are like a sorted list of (name,value)

tuples– Like a dict, but sorted by the names– OrderedDicts are used to preserve sorting

The Data API

Page 23: Cassandra for Python Developers

Inserting Data

>>> from pycassa.pool import ConnectionPool>>> from pycassa.columnfamily import ColumnFamily>>>>>> pool = ConnectionPool(“MyKeyspace”)>>> cf = ColumnFamily(pool, “MyCF”)>>>>>> cf.insert(“key”, {“col_name”: “col_value”})>>> cf.get(“key”){“col_name”: “col_value”}

Page 24: Cassandra for Python Developers

Inserting Data

>>> columns = {“aaa”: 1, “ccc”: 3}>>> cf.insert(“key”, columns)>>> cf.get(“key”){“aaa”: 1, “ccc”: 3}>>>>>> # Updates are the same as inserts>>> cf.insert(“key”, {“aaa”: 42})>>> cf.get(“key”){“aaa”: 42, “ccc”: 3}>>>>>> # We can insert anywhere in the row>>> cf.insert(“key”, {“bbb”: 2, “ddd”: 4})>>> cf.get(“key”){“aaa”: 42, “bbb”: 2, “ccc”: 3, “ddd”: 4}

Page 25: Cassandra for Python Developers

Fetching Data

>>> cf.get(“key”){“aaa”: 42, “bbb”: 2, “ccc”: 3, “ddd”: 4}>>>>>> # Get a set of columns by name>>> cf.get(“key”, columns=[“bbb”, “ddd”]){“bbb”: 2, “ddd”: 4}

Page 26: Cassandra for Python Developers

Fetching Data

>>> # Get a slice of columns>>> cf.get(“key”, column_start=”bbb”,... column_finish=”ccc”){“bbb”: 2, “ccc”: 3}>>>>>> # Slice from “ccc” to the end>>> cf.get(“key”, column_start=”ccc”){“ccc”: 3, “ddd”: 4}>>>>>> # Slice from “bbb” to the beginning>>> cf.get(“key”, column_start=”bbb”,... column_reversed=True){“bbb”: 2, “aaa”: 42}

Page 27: Cassandra for Python Developers

Fetching Data

>>> # Get the first two columns in the row>>> cf.get(“key”, column_count=2){“aaa”: 42, “bbb”: 2}>>>>>> # Get the last two columns in the row>>> cf.get(“key”, column_reversed=True,... column_count=2){“ddd”: 4, “ccc”: 3}

Page 28: Cassandra for Python Developers

Fetching Multiple Rows

>>> columns = {“col”: “val”}>>> cf.batch_insert({“k1”: columns,... “k2”: columns,... “k3”: columns})>>>>>> # Get multiple rows by name>>> cf.multiget([“k1”,“k2”]){“k1”: {”col”: “val”}, “k2”: {“col”: “val”}}

>>> # You can get slices of each row, too>>> cf.multiget([“k1”,“k2”], column_start=”bbb”) …

Page 29: Cassandra for Python Developers

Fetching a Range of Rows

>>> # Get a generator over all of the rows>>> for key, columns in cf.get_range():... print key, columns“k1” {”col”: “val”}“k2” {“col”: “val”}“k3” {“col”: “val”}

>>> # You can get slices of each row>>> cf.get_range(column_start=”bbb”) …

Page 30: Cassandra for Python Developers

Fetching Rows by Secondary Index

>>> from pycassa.index import *>>>>>> # Build up our index clause to match>>> exp = create_index_expression(“name”, “Joe”)>>> clause = create_index_clause([exp])>>> matches = users.get_indexed_slices(clause)>>>>>> # results is a generator over matching rows>>> for key, columns in matches:... print key, columns“13” {”name”: “Joe”, “nick”: “thatguy2”}“257” {“name”: “Joe”, “nick”: “flowers”}“98” {“name”: “Joe”, “nick”: “fr0d0”}

Page 31: Cassandra for Python Developers

Deleting Data

>>> # Delete a whole row>>> cf.remove(“key1”)>>>>>> # Or selectively delete columns>>> cf.remove(“key2”, columns=[“name”, “date”])

Page 32: Cassandra for Python Developers

Connection Management

pycassa.pool.ConnectionPool– Takes a list of servers

• Can be any set of nodes in your cluster

– pool_size, max_retries, timeout– Automatically retries operations against other nodes

• Writes are idempotent!

– Individual node failures are transparent– Thread safe

Page 33: Cassandra for Python Developers

Async Options

eventlet– Just need to monkeypatch socket and threading

Twisted– Use Telephus instead of Pycassa– www.github.com/driftx/telephus– Less friendly, documented, etc

Page 34: Cassandra for Python Developers

Tyler Hobbs@tylhobbs

[email protected]