crowdmining mining association rules from the...

Post on 04-May-2018

219 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Crowd Mining

Tova Milo

Data Everywhere

The amount and diversity of Data being generated and collected is exploding

Web pages, Sensors data, Satellite pictures, DNA sequences, …

2 Crowd Mining

From Data to Knowledge

3 Crowd Mining

Buried in this flood of data are the keys to

- New economic opportunities

- Discoveries in medicine, science and the humanities - Improving productivity & efficiency

However, raw data alone is not sufficient!!!

We can only make sense of our world by

turning this data into knowledge and insight.

The research frontier

4 Crowd Mining

• Knowledge representation.

• knowledge collection, transformation, integration, sharing.

• knowledge discovery.

We will focus today on human knowledge

Think of humanity and its collective mind expanding…

The engagement of crowds of Web users for data procurement

9/1/2013 5

Background - Crowd (Data) sourcing

5 Crowd Mining

General goals of crowdsourcing

• Main goal – “Outsourcing” a task (data collection & processing)

to a crowd of users

• Kinds of tasks – Can be performed by a computer, but inefficiently

– Can be performed by a computer, but inaccurately

– Tasks that can’t be performed by a computer

6 Crowd Mining

So, are we done?

MoDaS (Mob Data Sourcing)

• Crowdsourcing has tons of potential, yet limited triumph…

– Difficulty of managing huge volumes of data and users of questionable quality and reliability.

• Every single initiative had to battle, almost from scratch, the

same non-trivial challenges

• The ad hoc solutions, even when successful, are application specific and rarely sharable

=> Solid scientific foundations for Web-scale data sourcing With focus on knowledge management using the crowd

7 MoDaS

EU-FP7 ERC

MoDaS Mob Data Sourcing

Challenges – (very) brief overview

• What questions to ask? [SIGMOD13, VLDB13]

• How to define & determine correctness of answers? [WWW12, ICDE11]

• Who to ask? how many people? How to best use the resources? [ICDE12, VLDB13, ICDT13, ICDE13]

8 Crowd Mining

Data Mining

Data Cleaning

Probabilistic Data

Optimizations and Incremental Computation

A simple example – crowd data sourcing (Qurk)

• ?

9 Crowd Mining 9

Picture name

Lucy

Don

Ken

… …

The goal: Find the names of all the women in the people table

SELECT name FROM people p WHERE isFemale(p)

isFemale(%name, %photo) Question: “Is %name a female?”,

%photo Answers: “Yes”/ “No”

A simple example – crowd data sourcing

10 Crowd Mining 10 10

Picture name

Lucy

Don

Ken

… …

Crowd Mining: Crowdsourcing in an open world

• Human knowledge forms an open world

• Assume we want to find out what is interesting and important in some domain area

Folk medicine, people’s habits, …

• What questions to ask?

11 Crowd Mining

Back to classic databases...

• Significant data patterns are identified using data mining techniques.

• A useful type of pattern: association rules

– E.g., stomach ache chamomile

• Queries are dynamically constructed in the learning process

• Is it possible to mine the crowd?

12 Crowd Mining

Turning to the crowd

Let us model the history of every user as a personal database

• Every case = a transaction consisting of items

• Not recorded anywhere – a hidden DB

– It is hard for people to recall many details about many transactions!

– But … they can often provide summaries, in the form of personal rules

“To treat a sore throat I often use garlic”

13 Crowd Mining

Treated a sore throat with garlic and oregano leaves…

Treated a sore throat and low fever with garlic and ginger …

Treated a heartburn with water, baking soda and lemon…

Treated nausea with ginger, the patient experienced sleepiness… …

Two types of questions

• Free recollection (mostly simple, prominent patterns)

Open questions

• Concrete questions (may be more complex)

Closed questions

We use the two types interleavingly.

When a patient has both headaches and fever, how often do you use a willow tree bark infusion?

Tell me how you treat a particular illness

“I typically treat nausea with ginger infusion”

14 Crowd Mining

Contributions (at a very high level)

• Formal model for crowd mining; allowed questions and the answers interpretation; personal rules and their overall significance.

• A Framework of the generic components required for mining the crowd

• Significance and error estimations. [and, how will this change if we ask more questions…]

• Crowd-mining algorithms

• [Implementation & benchmark. both synthetic & real data.]

15 Crowd Mining

The model: User support and confidence

• A set of users U

• Each user u U has a (hidden!) transaction database Du

• Each rule X Y is associated with:

16

user support

user confidence

Crowd Mining

Model for closed and open questions

• Closed questions: X ? Y – Answer: (approximate) user support and confidence

• Open questions: ? ? ? – Answer: an arbitrary rule with its user support and confidence

“I typically have a headache once a week. In 90% of the

times, coffee helps.

17 Crowd Mining

Significant rules

• Significant rules: Rules were the mean user support and confidence are above some specified thresholds Θs, Θc.

• Goal: estimating rule significance while asking the smallest possible number of questions to the crowd

18 Crowd Mining

Framework components

• One generic framework for crowd-mining

• One particular choice of implementation of all black boxes

• We do not claim any optimality

• But we validate by experiments

19 Crowd Mining

• Treating the current answers as a random sample of a hidden distribution , we can approximate the distribution of the hidden mean

• μ – the sample average

• Σ – the sample covariance

• K – the number of collected samples

• In a similar manner we estimate the hidden distribution

9/1/2013 20

Estimating the mean distribution

20 Crowd Mining

• Define Mr as the probability mass above both thresholds for rule r

• r is significant if Mr is greater than 0.5

• The error prob. is the remaining mass

• Estimate how error will change if another question is asked

• Choose rule with largest error reduction

9/1/2013 21

Rule Significance and error probability

21 Crowd Mining

Completing the picture (first attempt…)

• Which rules should be considered?

Similarly to classic data mining (e.g. Apriori)

Start with small rules, then expend to rules similar to significant rules

• Should we ask an open or closed question?

Similarly to sequential sampling

Use some fixed ratio of open/closed questions to balance the tradeoff between precision and recall

22 Crowd Mining

Completing the picture (second attempt…)

How to measure the efficiency of Crowd Mining Algorithms ???

• Two distinguished cost factors: – Crowd complexity: # of crowd queries used by the algorithm

– Computational complexity: the complexity of computing the crowd queries and processing the answers

[Crowd comp. lower bound is a trivial computational comp. lower bound]

• There exists a tradeoff between the complexity measures – Naïve questions selection -> more crowd questions

23 Crowd Mining

Semantic knowledge can save work

24 Crowd Mining

Semantic knowledge can save work

Given a taxonomy of is-a relationships among items, e.g. espresso is a coffee

frequent({headache, espresso}) frequent({headache, coffee})

Advantages

• Allows inference on itemset frequencies

• Allows avoiding semantically equivalent itemsets {espresso}, {espresso, coffee}, {espresso, beverage}…

25 Crowd Mining

Complexity boundaries

Given a taxonomy Ψ we measure complexity it terms of

• the input Ψ: – its size, shape, width…

• the output frequent itemsets, represented compactly by the – Maximal Frequent Itemsets (MFI)

– Minimal Infrequent Itemsets (MII)

26 Crowd Mining

Complexity boundaries

• Notations:

– |Ψ| - the taxonomy size

– |I(Ψ)| - the number of itemsets (modulo equivalences)

– |S(Ψ)| - the number of possible solutions

27 Crowd Mining

Now, back to the bigger picture…

28 Crowd Mining

“I’m looking for activities to do in a child-friendly attraction in New York,

and a good restaurant near by”

“You can go bike riding in Central Park and eat at Maoz Vegetarian. Tips: Rent bikes at the boathouse”

“You can go visit the Bronx Zoo and eat at Pine Restaurant. Tips: Order antipasti at Pine. Skip dessert and go for ice cream across the street”

The user’s question in natural language:

Some of the answers:

ID Transaction Details

1 <Football> doAt <Central_Park>

2 <Biking> doAt <Central_Park>. <BaseBall> doAt <Central_Park> <Rent_Bikes> doAt <Boathouse>

3 <Falafel> eatAt <Maoz Veg.>

4 <Antipasti> eatAt <Pine>

5 <Visit> doAt <Bronx_Zoo>. <Antipasti> eatAt <Pine>

Crowd Mining Query language (based on SPARQL and DMQL)

31 Crowd Mining

FIND association rules RELATED TO x, y+, z, u?, p?, v? WHERE {$x instanceOf <Attraction>. $x inside <NYC>. $x hasLabel “Child Friendly”. $y subClassOf* <Activity>. $z instanceOf <Restaurant>. $z nearBy $x. ...} MATCHING ( {} => {([] eatAt $z.)} WITH support THRESHOLD = 0.007 ) AND ( {([] doAt $x)} => {($y doAt $x),($u $p $v)} WITH support THRESHOLD = 0.01 WITH confidence THRESHOLD = 0.2 RETURN MFI )

Determine group size for each variable,

SPARQL-like where clause, $x is a variable

Path of length 0 or more

The left-hand and right-hand parts of the rule are defined as SPARQL patterns

Mining parameters: thresholds, output granularity, etc.

Several rule patterns can be mined, joint by AND/OR

(Closed) Questions to the crowd

32 Crowd Mining

How often you do something in Central Park? { ( [] doAt <Central_Park> )

How often do you eat at Maoz Vegeterian? { ( [] eatAt <Maoz_Vegeterian> ) }

How often do you swim in Central Park? { ( <Swim> doAt <Central_Park> ) }

Supp=0.7

Supp=0.1

Supp=0.3

(Open) Questions to the crowd

33 Crowd Mining

What do you do in Central Park ? { ( $y doAt <Central_Park>) }

What else do you do when biking in Central Park? { (<Biking> doAt <Central_Park>) , ($y doAt <Central_Park>) }

Complete: When I go biking in Central Park… { (<Biking> doAt <Central_Park>), ( $u $p $v ) }

$y = Football, supp = 0.6

$y = Football, supp = 0.6

$u = <Rent_Bikes>, $p = doAt, $v = <BoatHouse>, supp = 0.6

Example of extracted association rules

• < Ball Games, doAt , Central_Park > (Supp:0.4 Conf:1)

< [ ] , eatAt , Maoz_Vegeterian > (Supp:0.2 Conf:0.2)

• < visit , doAt , Bronx_Zoo > (Supp: 0.2 Conf: 1)

< [ ] , eatAt , Pine_Restaurant > (Supp: 0.2 Conf: 0.2)

34 Crowd Mining

Can we trust the crowd ?

35 Crowd Mining 35 35 35

Common solution: ask multiple times

We may get different answers Legitimate diversity

Wrong answers/lies

36

Can we trust the crowd ?

36 Crowd Mining

Things are non trivial …

37 Crowd Mining 37 37 37 37

• Different experts for different areas

• “Difficult” questions vs. “simple” questions

• Data in added and updated all the time

• Optimal use of resources… (both machines and human)

Solutions based on - Statistical mathematical models - Declarative specifications - Provenance

Summary

38 Crowd Mining

The crowd is an incredible resource! Many challenges: • (very) interactive computation • A huge amount of data • Varying quality and trust

“Computers are useless, they can only give you answers” - Pablo Picasso

But, as it seems, they can also ask us questions!

Thanks

39 Crowd Mining

Antoine Amarilli, Yael Amsterdamer, Rubi Boim, Susan Davidson, Ohad Greenshpan, Benoit Gross, Yael Grossman, Ezra Levin, Ilia Lotosh, Slava Novgordov, Neoklis Polyzotis, Sudeepa Roy, Pierre Senellart, Amit Somech, Wang-Chiew Tan…

EU-FP7 ERC

MoDaS Mob Data Sourcing

top related