polyglot persistence in the real world: cassandra + s3 + mapreduce

38
Polyglot Persistence in the Real World Anton Yazovskiy Thumbtack Technology

Upload: thumbtacktech

Post on 29-Jun-2015

1.913 views

Category:

Technology


0 download

DESCRIPTION

This talk focuses on building a system from scratch, showing how to perform analytical queries in near real-time and still get the benefits of high performance database engine of Cassandra. The key subjects of my speech are: ● The splendors and miseries of NoSQL ● Apache Cassandra use-cases ● Difficulties of using MapReduce directly in Cassandra ● Amazon cloud solutions: Elastic MapReduce and S3 ● “real-enough” time analysis In particular the talk dives into ways of handling different kinds of semi-ad-hoc queries when using Cassandra, the pitfalls in designing a schema around a specific analytics use case. Some attention will be paid towards dealing with time series data in particular, which can present a real problem when using Column-Family or Key-Value store databases.

TRANSCRIPT

Page 1: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

Polyglot Persistence in the Real World

Anton Yazovskiy Thumbtack Technology

Page 2: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

� Software Engineer at Thumbtack Technology �  an active user of various NoSQL solutions �  consulting with focus on scalability �  a significant part of my work is advising people on

which solutions to use and why �  big fan of BigData and clouds

Page 3: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

� NoSQL – not a silver bullet � Choices that we make � Cassandra: operational workload � Cassandra: analytical workload � The best of both worlds � Some benchmarks � Conclusions

Page 4: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

•  well known ways to scale •  scale in/out, scale by

function, data denormalization

•  really works •  each has disadvantages •  mostly manual process

(newSQL)

http://qsec.deviantart.com

Page 5: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

�  solve exactly these kind of problem �  rapid application development

�  aggregate �  schema flexibility �  auto-scale-out �  auto-failover

� amount of data able to handle �  shared nothing architecture, no SPOF � performance

Page 6: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

�  splendors and miseries of aggregate � CAP theorem dilemma

Consistency

Partition Tolerance Availability

Page 7: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

Analytical Operational

Consistency Availability

Performance Reliability

Page 8: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

Analytical Operational

Consistency Availability

Performance Reliability

I want it all

Page 9: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

(released by Facebook in 2008)

� elastic scalability & linear performance * � dynamic schema � very high write throughput �  tunable per request consistency �  fault-tolerant design � multiple datacenter and cloud readiness � CaS transaction support *

* http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra

Page 10: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

� Large data set on commodity hardware � Tradeoff between speed and reliability � Heavy-write workload � Time-series data

http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra

Page 11: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

Cassandra

Operational

Reliability Performance

Analytical

Small demo after this slide

Page 12: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

TIMESTAMP   FIELD  1   …  12344567   DATA      

SERVER  1   12326346   DATA      13124124   DATA      13237457   DATA      

SERVER  2   13627236   DATA      

� expensive range queries across cluster � unless shard by timestamp � become a bottleneck for heavy-write workload

select * from table where timestamp > 12344567 and timestamp < 13237457

Page 13: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

�  all columns are sorted by name �  row – aggregate item (never sharded)

Column  Family  

row  key  1  column  1   column  2   column  3   ..   column  N  value  1.1   value  1.2   value  1.3   ..   value  1.N  

row  key  2  column  1   column  2   ...   column  M   value  2.1   value  2.2   …   value  2.M  

Super columns are discouraged and omitted here

get slice

get range

+ combinations of these queries + composite columns

get key

Page 14: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

�  all columns are sorted by name �  row – aggregate item (never sharded)

row  key  1   Emestamp   Emestamp   Emestamp   Emestamp  SERVER  1   row  key  2   Emestamp   Emestamp   Emestamp  

row  key  3   Emestamp  row  key  4   Emestamp   Emestamp   Emestamp   Emestamp  

SERVER  2   row  key  5   Emestamp   Emestamp  

get_slice(“row key 1”, from:“timestamp 1”, null, 11)

get_slice(row_key, from, to, count)

Page 15: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

get_slice(“row key 1”, from:“timestamp 1”, null, 11) get_slice(“row key 1”, from:“timestamp 11”, null, 11) get_slice(“row key 1”, null, to:“timestamp 11”, 11)

Next page

Prev.page

�  all columns are sorted by name �  row – aggregate item (never sharded)

row  key  1   Emestamp   Emestamp   Emestamp   Emestamp  SERVER  1   row  key  2   Emestamp   Emestamp   Emestamp  

row  key  3   Emestamp  row  key  4   Emestamp   Emestamp   Emestamp   Emestamp  

SERVER  2   row  key  5   Emestamp   Emestamp  

get_slice(row_key, from, to, count)

Page 16: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

� Time-range with filter: �  “get all events for User J from N to M” �  “get all success events for User J from N to M” �  “get all events for all user from N to M”

Page 17: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

� Time-range with filter: �  “get all events for User J from N to M” �  “get all success events for User J from N to M” �  “get all events for all user from N to M”

events::success::User_123  Emestamp  1  

value  1  

events::success  Emestamp  1  

value  1  

events::User_123  Emestamp  1  

value  1  

Page 18: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

� Counters: �  “get # of events for User J grouped by hour” �  “get # of events for User J grouped by day”

events::success::User_123  1380400000   1380403600  

14   42  

events::User_123  1380400000   1380403600  

842   1024  

(group by day – same but in different column family for TTL support)

Page 19: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

�  row key should consist of combination of fields with high cardinality of values: �  name, id, etc..

�  boolean values are bad option �  composite columns – good option for it

�  timestamp may help to spread historical data

�  otherwise, scalability will not be linear

Page 20: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

In theory – possible in real-time �  average, 3 dimensional filters, group by, etc..

But: �  hard to tune data model �  lack of aggregation options �  aggregation by historical data

Page 21: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

“I want interactive reports”

Cassandra

“Reports could be a little bit out of date, but I want to control this delay value”

Auto update somehow

Page 22: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

�  Impact on production system or

�  Higher total cost of ownership

�  Difficulties with scalability

�  hard to support with multiple clusters

http://www.datastax.com/docs/0.7/map_reduce/hadoop_mr

Page 23: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Page 24: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

http://aws.amazon.com

Page 25: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

�  Hadoop tech.stack �  Automatic deployment �  Management API �  Temporal cluster �  Amazon S3 as data storage *

* copy from S3 to EMR HDFS and back

Page 26: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

JobFlowInstancesConfig instances = ..

instances.setHadoopVersion(..) instances.setInstanceCount(dataNodeCount + 1)

instances.setMasterInstanceType(..)

instances.setSlaveInstanceType(..)

RunJobFlowRequest req = ..(name, instances) req.addSteps(new StepConfig(name, jar))

AmazonElasticMapReduce emr = ..

emr.runJobFlow(req)

Page 27: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

Execute job on running cluster: StepConfig stepConfig = new StepConfig(name, jar)

AddJobFlowStepsRequest addReq = …

addReq.setJobFlowId(jobFlowId) addReq.setSteps(Arrays.asList(stepConfig))

AmazonElasticMapReduce emr =

emr.addJobFlowSteps(addReq)

Page 28: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

�  cluster lifecycle: Long-Running or Transient �  cold start = ~20 min �  tradeoff: cluster cost VS availability

�  Compressing and Combiner tuning may speed-up jobs very much

�  common problems for all big data processing tools - monitoring, testability and debug (MRUnit, local hadoop, smaller data set)

Page 29: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Page 30: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Page 31: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

try { long txId = cassandra.persist(entity) sql.insert(some) sql.update(someElse) cassandra.commit(txId) sql.commit()

} catch (Exception e) { sql.rollback() cassandra.rollback(txId)

}

Page 32: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

insert into CHANGES (key, commited, data) values ('tx_id-58e0a7d7-eebc', ’false’, ..)

update CHANGES set commited = ’true’

where key = 'tx_id-58e0a7d7-eebc’

delete from CHANGES

where key = 'tx_id-58e0a7d7-eebc’

Page 33: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

non-production setup: •  3 nodes (cassandra) •  m1.medium EC2 instance •  1 data center •  1 app instance

I numbers

Page 34: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

real-time metrics update (sync): �  average latency - 60 msec �  process > 2,000 events per second �  generate > 1000 reports per second real-time metrics update (async): �  process > 15,000 events per second uploading to AWS S3: slow, but multi-threading helps *

it is more then enough, but what if …

Page 35: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

� distributed systems force you to make decisions �  systems like Cassandra trade speed for

Consistency � CAP theorem is oversimplified

�  you have much more options

� polyglot persistence can make this world a better place �  do not try to hammer every nail with the same

hammer

Page 36: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

� Cassandra – great for time series data and heavy-write workload…

�  ... but use cases should be clearly defined

Page 37: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

� Amazon S3 – is great �  simple, slow, but predictable storage

� Amazon EMR �  integration with S3 – great �  very good API, but … �  … isn’t a magic trick and require

knowledge about Hadoop and skills for effective usage

Page 38: Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

/** [email protected] @yazovsky www.linkedin.com/in/yazovsky

*/

/** http://www.thumbtack.net http://thumbtack.net/whitepapers

*/