untitled
Post on 22-Nov-2014
82 Views
Preview:
TRANSCRIPT
@ebenhewitt10. 14. 10
strange loopst louis
adopting apache
• i wrote this
agenda• context• features• data model• api
“If I had asked the people what they wanted, they would have said ‘faster horses’”.
--Henry Ford
so it turns out, there’s a lot of data in the world…
• Google processes 8 EB of data every year– 24 PB every day– 1PB is a quadrillion bytes– 1 EB is a 1024 PB
• eBay– 50TB of new data every day
• World of Warcraft – uses 1.3 PB to store the game
• Chevron– 2TB of data every day
• WalMart’s Customer Database– 2004, .5 petabyte = 500 TB
The movie Avatar required 1PB storage
…or the equivalent of a single MP3
…if that MP3 was 32 years
long
it ain’t getting any smaller• 2006: 166 exabytes• 2010: >1000 exabytes
how do you scale relational databases?
1. tune queries2. indexes3. vertical scaling
– works for a time– eventually need to add boxes
4. shard– create a horizontal partition (how to join now?)– argh
5. denormalize6. now you have new problems
– data replication, consistency– master/slave (SPOF)
7. update configuration management– start doing undesirable things (turn off journaling)– caching
the no sql value proposition:
• sql sux• rdbms sux• throw out
everything you know
• run around like a crazy person
“nosql” “big data”• mongodb• couchdb• tokyo cabinet• redis• riak• what about?– Poet, Lotus, Xindice– they’ve been around forever– rdbms was once the new kid…
what is
distributeddecentralizedfault tolerantelastic durabledatabase
cassandra.apache.org
daughter of Priam & Hecuba
innovation at scalegoogle bigtable (2006)• consistency model:
strong• data model: sparse map• clones: hbase,
hypertable• column family,
sequential writes, bloom filters, linear insert performance
• CP
amazon dynamo (2007)• consistency model:
client tune-able• data model: key-value• O(1) dht• clones: riak, voldemort• symmetric p2p, gossip• AP
proven• SimpleGeo >50 Large EC2 instances
• Digg: 3TB of data
• The Facebook stores 150TB of data on 150 nodes
• US Government has 400 nodes for analytics in intelligence community in partnership with Digital Reasoning
• Used at Twitter, Rackspace, Mahalo, Reddit,
no free lunch• no transactions• no joins• no ad hoc queries
agenda• context• features• data model• api
cassandra properties• tuneably consistent• durable, fault tolerant• very fast writes• highly available• linear, elastic scalability• decentralized/symmetric• ~12 client languages – Thrift RPC API
• ~automatic provisioning of new nodes• 0(1) dht • big data
consistency
•consistency– all clients have same view of data
•availability– writeable in the face of node failure
•partition tolerance– processing can continue in the face of
network failure (crashed router, broken
daniel abadi: pacelc
partition! trade-off A & C
normal condition: tradeoff latency & consistency
write consistencyLevel Description
ZERO Good luck with thatANY 1 replica (hints count)
ONE 1 replica. read repair in bkgnd
QUORUM (N /2) + 1
ALL N = replication factor
Level Description
ZERO Ummm…ANY Try ONE instead
ONE 1 replica
QUORUM Return most recent TS after (N /2) + 1 reportALL N = replication factor
read consistency
durability
fast writes: staged eda• A general-purpose framework for high
concurrency & load conditioning• Decomposes applications into stages
separated by queues• Adopt a structured approach to event-
driven concurrency
highly
agenda• context• features• data model• api
structure
keyspace• settings (eg, partitioner)
column family…• settings (eg, comparator, type [Std])
column…• name• value• timestamp
keyspac
• ~= database• typically one per application• some settings are configurable only
per keyspace– partitioner
• Configured in XML in YAML in API
create a keyspace//Create KeyspaceKsDef k = new KsDef();k.setName(keyspaceName);k.setReplication_factor(1);k.setStrategy_class
("org.apache.cassandra.locator.RackUnawareStrategy");
List<CfDef> cfDefs = new ArrayList<CfDef>();k.setCf_defs(cfDefs);
//Connect to ServerTTransport tr = new TSocket(HOST, PORT);TFramedTransport tf = new TFramedTransport(tr); //new defaultTProtocol proto = new TBinaryProtocol(tf);Cassandra.Client client = new Cassandra.Client(proto);tr.open();
partitioner smack-downRandom• system will use MD5
(key) to distribute data across nodes
• even distribution of keys from one CF across ranges/nodes
Order Preserving• key distribution
determined by token• lexicographical ordering• can specify the token
for this node to use• ‘scrabble’ distribution• required for range
queries – scan over rows like cursor
in index
column family• group records of similar kind• CFs are sparse tables• ex:– Tweet– Address– Customer– PointOfInterest
column family
n=42
user=eben
key123
key456
user=alison icon=
nickname=The
Situation
columns
keys
json-like notationUser { 123 : { user:eben, nickname: The Situation },
456 : { user: alison, icon: ,
: The Danger Zone}}
think of cassandra as
row-oriented• each row is uniquely identifiable by
key• rows group columns and super
a column has 3 parts1. name– byte[]– determines sort order– used in queries– indexed
2. value– byte[]– you don’t query on column values
3. timestamp– long (clock)– last-write-wins conflict resolution
get started$cassandra –f$bin/cassandra-cli cassandra> connect localhost/9160
cassandra> set Keyspace1.Standard1[‘eben’][‘age’]=‘29’
cassandra> set Keyspace1.Standard1[‘eben’][‘email’]=‘e@e.com’
cassandra> get Keyspace1.Standard1[‘eben'][‘age']
=> (column=6e616d65, value=29,
column comparators• byte• utf8• long• timeuuid (version 1)• lexicaluuid (any, usually version 4)• <pluggable>– ex: lat/long
super
super columns group columns under a common name
<<SCF>>PointOfInterest
super column
<<SC>>Central Park1001
7
<<SC>>Empire State Bldg
63112
desc=Fun to walk in.
phone=212.
555.11212
desc=Great view from
102nd floor!
<<SC>>The Loop
phone=314.
555.11212
desc=Home of Strange
Loop!
PointOfInterest { key: 85255 { Phoenix Zoo { phone: 480-555-5555, desc: They have animals
here. }, Spring Training { phone: 623-333-3333, desc: Fun for baseball
fans. }, }, //end phx
key: 10019 { Central Park { desc: Walk around. It's pretty.} , Empire State Building { phone: 212-777-7777, desc: Great view from 102nd floor. } } //end nyc
s
super column
super column family
flexible schema
key
column
super column
about super column families• sub-column names in a SCF are not
indexed– top level columns (SCF Name) are always
indexed• often used for denormalizing data
from standard CFs
rdbms: domain-based model
what answers do I have?big query language
cassandra: query-based model
what questions do I have?
replica/tion• configurable replication factor• replica placement strategy
rack unaware Simple Strategyrack aware Old Network Topology
Strategydata center shard Network Topology
Strategy
agenda• context• features• data model• api
slice predicate• data structure describing columns to
return– SliceRange• start column name (byte[])• finish column name (can be empty to stop on
count)• reverse• count (like LIMIT)
read api• get() : Column– get the Col or SC at given ColPath COSC cosc = client.get(key, path, CL);
• get_slice() : List<ColumnOrSuperColumn>– get Cols in one row, specified by SlicePredicate: List<ColumnOrSuperColumn> results = client.get_slice(key, parent, predicate, CL);
• multiget_slice() : Map<key, List<CoSC>>– get slices for list of keys, based on SlicePredicate
Map<byte[],List<ColumnOrSuperColumn>> results = client.multiget_slice(rowKeys, parent, predicate, CL);
• get_range_slices() : List<KeySlice> – returns multiple Cols according to a range– range is startkey, endkey, starttoken, endtoken: List<KeySlice> slices = client.get_range_slices(
insert
insert(userIDKey, cp, new Column("name".getBytes(UTF8), "George Clinton".getBytes(), clock),
CL);
delete
String columnFamily = "Standard1";byte[] key = "k2".getBytes(); //row key
Clock clock = new Clock(System.currentTimeMillis());
ColumnPath colPath = new ColumnPath();colPath.column_family = columnFamily;colPath.column = "b".getBytes();
client.remove(key, colPath, clock, ConsistencyLevel.ALL);
batch_mutateMap<byte[], Map<String, List<Mutation>>> mutationMap = new HashMap<byte[], Map<String, List<Mutation>>>();
List<Mutation> mutationList = new ArrayList<Mutation>();mutationList.add(mutation);
Map<String, List<Mutation>> m = new HashMap<String, List<Mutation>>();
m.put(columnFamily, mutationList);
//just for this row key, though we could add moremutationMap.put(key, m);client.batch_mutate(mutationMap, ConsistencyLevel.ALL);
raw thrift: for masochists
• pycassa (python)• Telephus (twisted python)• fauna/cassandra gem (ruby)• hector (java)• pelops (java)• kundera (JPA)• hectorSharp (C#)
what about…
SELECT WHEREORDER BY
JOIN ON GROUP?
SELECT WHEREcassandra is an index factory
<<cf>>USERKey: UserIDCols: username, email, birth date, city, state How to support this query?
SELECT * FROM User WHERE city = ‘Scottsdale’
Create a new CF called UserCity: <<cf>>USERCITYKey: city
• Use an aggregate key state:city: { user1, user2}
• Get rows between AZ: & AZ; for all Arizona users
• Get rows between AZ:Scottsdale & AZ:Scottsdale1
for all Scottsdale users
SELECT WHERE pt 2
ORDER BY
Rows
are placed according to their Partitioner:
•Random: MD5 of key•Order-Preserving: actual key
are sorted by key, regardless of partitioner
Columns
are sorted according to CompareWith or CompareSubcolumnsWith
data• skinny rows, wide rows (billions of
columns)• denormalize known queries– secondary index support in 0.7
• client join others• 2 caching layers: row, index
is cassandra a good fit?• sub-millisecond writes• you need durability• you have lots of data > GBs
>= three servers• growing data over time• your app is evolving
– startup mode, fluid data structure
• loose domain data – “points of interest”
• multi data-center
• your programmers can deal– documentation– complexity– consistency model– change– visibility tools
• your operations can deal– hardware considerations– can move data– JMX monitoring
use cases• jboss.org/inifispan – data grid cache
• log data stream• hotelier– points of interest – guests
• geospatial• travel– segment analytics
With Hadoop!• BI w/o ETL• raptr.com – storage & analytics
for gaming stats• imagini– visual quizzes for
publishers– real time for 100s of
millions of users
coming in 0.7• secondary indexes• hadoop improvements• large row support ( > 2GB)• dynamic routing around slow nodes
YOU ALREADY HAVE THE RIGHT
DATABASE TODAYFOR THE APPLICATION YOU
HAVE TODAY
what would you do if scale wasn’t a problem?
@ebenhewittcassandra.apache.org
"An invention has to make sense in the world in which it is finished, not the world in which it is started”.
--Ray Kurzweil
top related