elosztott, skálázódó adatbázis-kezelő rendszer - sztaki · elosztott, skálázódó...
TRANSCRIPT
Elosztott, skálázódóadatbázis-kezelő rendszer
2012. július 13. péntek
Molnár András ([email protected])Garzó András ([email protected])
http://cassandra.apache.org/
Eredet
● “In Greek mythology, ...
Eredet
● “In Greek mythology, Cassandra was the daughter of King Priam and Queen Hecuba of Troy. Cassandra was so beautiful that the god Apollo gave her the ability to see the future. But when she refused his amorous advances, he cursed her such that she would still be able to accurately predict everything that would happen—but no one would believe her. Cassandra foresaw the destruction of her city of Troy, but was powerless to stop it. The Cassandra distributed database is named for her. I speculate that it is also named as kind of a joke on the Oracle at Delphi, another seer for whom a database is named.“
(Cassandra: The Definitive Guide)
CAP
CAP
● RDBMS's
CAP
● Bigtable
● HBase
● RDBMS's
CAP
● Bigtable
● HBase
● Voldemort● Cassandra● RDBMS's
CAP
(Cassandra: The Definitive Guide)
“Tuneable consistency”
set consistency level ...
(Cassandra: The Definitive Guide)
Mit mond magáról?● “Apache Cassandra is an
open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web.”
● “That’s exactly 50 words. Of course, if you were to recite that to your boss in the elevator, you'd probably get a blank look in return.”
(Cassandra: The Definitive Guide)
Mit mond magáról?● “Apache Cassandra is an
open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web.”
● “That’s exactly 50 words. Of course, if you were to recite that to your boss in the elevator, you'd probably get a blank look in return.”
(Cassandra: The Definitive Guide)
Indexed, schema-free, row-oriented store
Mikor érdemes használni? (pl.)
● Cassandra vs RDBMS– nagy adatmennyiség, szerver klaszter
– ...
● Cassandra vs HBase– always writable
– ...
● Cassandra vs Voldemort– dozens or hundreds of columns
– ...
Use case examples● Large deployments
– many nodes
● Lots of writes, statistics & analysis– always writable
● Geographical distribution– data locality
● Evolving applications– no strict schema
(Cassandra: The Definitive Guide)
Fejlesztés állása
● aktuális verzió: 1.1.2, ● released 2012-07-02
● [v1.1.1 released: 2012-06-04]
Kik supportálják?● Third-party solution provides e.g.
Cassandra wiki
Kik használják?● “Twitter is using Cassandra for analytics: for real-time analytics, for
geolocation and places of interest data, and for data mining over the entire user store.
● Mahalo uses it for its primary near-time data store.
● Facebook still uses it for inbox search, though they are using a proprietary fork.
● Digg uses it for its primary near-time data store.
● Rackspace uses it for its cloud service, monitoring, and logging.
● Reddit uses it as a persistent cache.
● Cloudkick uses it for monitoring statistics and analytics.
● Ooyala uses it to store and serve near real-time video analytics data.
● SimpleGeo uses it as the main data store for its real-time location infrastructure.
● Onespot uses it for a subset of its main data store.” (Cassandra: The Definitive Guide)
Kik használják?
● “Cassandra is in use at Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco, OpenX, Digg, CloudKick, Ooyala, and more companies that have large, active data sets.”
● “The largest known Cassandra cluster has over 300 TB of data in over 400 machines.”
● More: http://www.datastax.com/cassandrausers
( cassandra.apache.org )
Data Model
Data Model● Sparse table
(Cassandra: The Definitive Guide)
Data Model● Sparse table
Data Model
● Super column family feature
(“becoming deprecated”)● “Not officially deprecated, but not highly recommended either”
Ed Anuff: Cassandra Indexing Techniques
Column● name
● ● byte[]
● ● Queried against (predicates)
● ● Determines sort order
● value
● ● byte[]
● ● Opaque to Cassandra
● timestamp
● ● long
● ● Conflict resolution (Last Write Wins)(Cassandra: The Definitive Guide)
Column sorting● Column names are stored in sorted
order according to the value of compare_with:– AsciiType, BytesType,
LexicalUUIDType, – IntegerType, LongType, – TimeUUIDType, – UTF8Type– CompositeType ...– Custom ... (Cassandra: The Definitive Guide)
Row storing & sorting
● Column sorting is controllable, but key sorting isn’t; row keys always sort in byte order.
● Rows are stored in an order defined by the partitioner (for example, with RandomPartitioner, they are in random order, etc.).
(Cassandra: The Definitive Guide)
Alternate indexes
● Native secondary indexes● Wide rows as lookup and grouping tables● Custom secondary indexes
Ed Anuff: Cassandra Indexing Techniques
Minta példa
Minta példa
User● ● Stores users● ● Keyed on a unique ID (UUID).● ● Columns for username and password
Username● ● Indexes users● ● Keyed on username● ● One column, the unique UUID for user
Username UUID
UsernameUUID Password
Eric Evans: Hands-on Cassandra
Minta példa
Friends● ● Maps a user to the users (s)he follows● ● Keyed on user ID● ● Columns for each user being followed
Followers● ● Maps a user to those following her/him● ● Keyed on username● ● Columns for each user following
UUID Follows(followees)
Username Followers(followers)
Eric Evans: Hands-on Cassandra
Minta példa
Tweets● ● Stores tweets and maps them to users● ● Keyed on a unique identifier● ● Columns:
– Unique identifier
– User ID
– Body of the tweet
– timestamp
TweetID UUID Body Timestamp
Eric Evans: Hands-on Cassandra
Minta példaTimeline
● ● The materialized view of Tweets for a user.
● ● Keyed on user ID● ● Columns that map timestamps to Tweet ID
Userline● ● The collection of Tweets attributed to a
user● ● Keyed on user ID● ● Columns that map timestamps to Tweet ID
TweetIDUUID(tweetIDsOf)
TweetIDUUID(tweetIDsTo)
Eric Evans: Hands-on Cassandra
2. minta példa
(Cassandra: The Definitive Guide)
Design patterns
● Materialized View,● Valueless Column, ● Aggregate Key,● ... ?
(Cassandra: The Definitive Guide)
Wide Rows
● “If your data model has no rows with over a hundred columns, you’re either doing something wrong or shouldn’t be using Cassandra”
– wide row for grouping
– wide row as a simple index
– composite column names, e.g.
Ed Anuff: Cassandra Indexing Techniques
Indexes = { "User_Keys_By_Last_Name" : { {"adams", 1} : "e5d...", {"anderson", 1} : "e5f...", {"anderson", 2} : "e71...", ...
Model the queries first
● “Start with your queries. – Ask what queries your application will need,
and model the data around that instead of modeling the data first, as you would in the relational world.” (Cassandra: The Definitive Guide)
Működés - írás
(Cassandra: The Definitive Guide)
Működés - olvasás
(Cassandra: The Definitive Guide)
Limitations
● All data for a single row must fit (on disk) on a single machine in the cluster. Because row keys alone are used to determine the nodes responsible for replicating their data, the amount of data associated with a single key has this upper bound.
● A single column value may not be larger than 2GB. (However, large values are read into memory when requested, so in practice "small number of MB" is more appropriate. [might be changed?])
● The maximum number of column per row is 2 billion.
● The key (and column names) must be under 64K bytes.
● no subcolumn indexing [might be changed?](Cassandra wiki)
Hadoop Map/Reduce
● Map/Reduce jobok írhatók Cassandrás adatokra
● Pig és Hive is képes Cassandrás adatokon dolgozni
Telepítés, minta futtatás
● Ld. README.txt
– tar -zxvf apache-cassandra-$VERSION.tar.gz
– ... (log, lib könyvtárak beállítása)
– ... (config fájl szerkesztése v. default-on hagyása)
– szerver: bin/cassandra -f
– CLI kliens: bin/cassandra-cli --host localhost
Telepítés, minta futtatás● Cassandra-cli - adatbázis parancsok pl.● create keyspace Keyspace1;
● use Keyspace1;
● create column family Users with .....
● set Users[jsmith][first] = 'John'; set Users[jsmith][last] = 'Smith';
● set Users[jsmith][age] = long(42);
● get Users[jsmith];– => (column=last, value=Smith, timestamp=1287604215498000)
– => (column=first, value=John, timestamp=1287604214111000)
– => (column=age, value=42, timestamp=1287604216661000)
– Returned 3 results.
● del Users[jsmith][age];
● del Users[jsmith];
~ schema / database
~ table (no explicit schema)
~ insert/update : set columns of row with id jsmith
~ select * from Users where id=jsmith
~ delete/update set null : remove columns of row or full row
Saját tapasztalatok
● Egy gépen, egy node-dal ...
● Több gépen ...
Itt a vége.
Köszönöm a figyelmet!