benchmarking cloud-based tagging services

23
www.ci.anl.gov www.ci.uchicago.edu Benchmarking Cloud-based Tagging Services Tanu Malik, Kyle Chard, Ian Foster Computation Institute Argonne National Laboratory and University of Chicago

Upload: tanu-malik

Post on 10-May-2015

90 views

Category:

Internet


0 download

DESCRIPTION

Workshop on CloudDB, in conjunction with ICDE, 2014

TRANSCRIPT

Page 1: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

Benchmarking Cloud-based Tagging ServicesTanu Malik, Kyle Chard, Ian FosterComputation InstituteArgonne National Laboratory and University of Chicago

Page 2: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

2

Quiz

Blacksmith’s tools X-Y axes as a tool for calculus

Bayes theorem as a tool for probability theory

Page 3: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

3

Tagging: A tool for information management

Page 4: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

4

A Tagging Service

• Suppose we want to build a large-scale tagging service that allows tagging of data files 1. What is the best way to store these large number

of tags and their values in a database?2. Which cloud-based database offering to choose

from?

Page 5: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

5

Storing Tags: Many Possibilities

• resource-tag-values is a 3-triple, like entity-attribute-values

– noSQL stores– triple stores– relational stores

o horizontal schemao triple schemao vertically partitioned schema/decomposed storage

model/columnar schema o attribute schema

Page 6: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

6

Which cloud-based offering to choose from?

• No clear consensus regarding the performance of cloud-based database offerings for sparse datasets

• Motivates a tagging benchmark and an infrastructure that uses this benchmark to compare various cloud-based database offerings• determine which cloud-based platform provides the most

efficient support for dynamic, sparse data

Page 7: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

7

Outline

• Related Work• The Tagging Model• Workload Generation• Framework for Evaluating Tagging Workloads• Experiments

Page 8: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

8

Related Work

• Twitter hash tags Vs tags in Del.ic.ious, Flickr– Messages are tagged resulting in transient, trending social

groups– Resources are tagged with semantically-meaningful keywords – Tags with values

• Cloud benchmarks– The CloudStone benchmark

o Web 2.0 social application in which event resources are tagged with comments and ratings as tags

o Considers database tuning as a complex process– The OLTP-Bench

o Infrastructure for monitoring performance and resource consumption of a cloud-based database using a variety of relational workloads

o It does not support tagging workloads

Page 9: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

9

The Tagging Model

Page 10: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

10

The Tagging Model

• 3 modes to associate tag-value pairs with resources: – blind: users do not know tags assigned to a resource– viewable: can see the tags associated with a resource– suggestive: the service suggests possible tags to users

• Access control over resources, tags, and their values:– assign permissions on resources similar to Unix-style

permissions– policies assigned at individual and group level– users can only tag the resources they created/use of tags

by any user– policies on creation or removal of a resource

Page 11: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

11

Key Obervations

• Some resources are more frequently tagged than others. – empirical studies have shown that a power law holds for this

phenomenon.• Number of distinct tags, owned by a user, are relatively

few. – probability that a Flickr user has more than 750 distinct tags

is roughly 0.1%.• Distinct tag usage increases with the increase in the

number of users and resources in the system– higher correlation coefficient with the number of resources

than users.J. Huang, et. al, “Conversational tagging in twitter,” Hypertext and Hypermedia. ACM, 2010.C. Marlow, et.al , “Ht06, tagging paper, taxonomy, flickr, academic article, to read,” Hypertext and Hypermedia. ACM, 2006.

Page 12: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

12

Workload Generation

• No separate data and query generation phases– database tables and their attributes, i.e., tags, are

decided by the incoming user workload• Session-based, closed loop Markov-chain

process in Python

Page 13: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

13

Workload Generation

• Transition probabilities in Markov Chain– Some resources are more frequently tagged than others.

o when a user queries for a resource to be tagged, the chosen resource is dictated by a power law.

– Number of distinct tags, owned by a user, are relatively few. o the transition probabilities of using an existing tag is higher

than the probability of creating a new tag definition. – Distinct tag usage increases with the increase in the

number of users and resources in the systemo internal transition probabilities in the tagging state is a

function of the previous state

Page 14: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

14

• Search for resources • Find all other tags or popular tag-value pairs on a given resource.• Suggest tags for a selected resource. • Find resources that are “linked” to a set of

resources • Find resources where value like ‘V ’ exists.

– This query fetches all tags with which the user is associated.

Workload Generation

Page 15: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

15

Framework for Evaluating Tagging Datasets

• Tagging modifies the underlying database schema, adding attributes or inserting values– defining new tags increases the sparseness of the

dataset– reusing existing tags or adding new values decreases

the sparseness of the database

Page 16: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

16

Framework for Evaluating Tagging Datasets

• Given a tagging (query and tag) workload, generate a variety of schemas for a given database system.

• Measure sparseness of the generated tagging dataset using a sparseness metric defined as:

Low s => vertically partitioned schemaHigh s => horizontal schema

Page 17: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

17

Experimental Setup

• OLTP Bench: client infrastructure that executes web workloads on relational databases deployed on the cloud– workload generator into the centralized Workload Manager. – generates a work queue, which is consumed by a user-

specified number of threads to control parallelism and concurrency.

– all the threads are currently running on a single machine. • We experiment with three databases:

– MySQL with InnoDB version 5.6, – Postgres version 9.2, and – SQL Server 2010– DB_A, DB_B, and DB_C

Page 18: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

18

EC2 setup

• Server DBMSs were deployed within the same geographical region

• A single instance dedicated to workers and collecting statistics and another instance running the DBMS server

• For each DBMS server, flushed each system’s buffers before each experiment.

• As recommended by OLTP-Bench, to mitigate noise in cloud environments, we ran all of our experiments without restarting our EC2 instances (where possible), and executed the benchmarks multiple times and averaged the results.

Page 19: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

19

Comparing DB SaaS providers

• Given a schema and a tagging workload, which database and its cloud offering offers the best performance (good overall throughput and low latency) and cost trade- off?

Page 20: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

20

s = 0.37

Page 21: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

21

s=0.67

Page 22: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

22

DB_A and different machine sizes

Page 23: Benchmarking Cloud-based Tagging Services

www.ci.anl.govwww.ci.uchicago.edu

23

Conclusion

• Tagging is becoming popular in Web 2.0 applications as a tool for managing large volumes of information

• Being introduced as part of scientific workflows and datasets

• Running a high-performance tagging service can be tricky– Right data model– Right database

• Proposed a benchmark for a tagging service– it is open-source, though not released it– [email protected]

• Results show the benefits of using a benchmark