by : garima indurkhya jay parikh shraddha herlekar vikrant naik

36
ITCS 6010 DATA INTEGRATION Presentation On Social Web By : Garima Indurkhya Jay Parikh Shraddha Herlekar Vikrant Naik

Upload: claud-moore

Post on 29-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Social Web

ITCS 6010DATA INTEGRATION

Presentation On Social Web

By :Garima IndurkhyaJay ParikhShraddha HerlekarVikrant Naik1Paper 1The Structure of Collaborative Tagging SystemsAuthors : Golder, S. and Huberman, B. ,2005.2ContentsWhat is tagging?Tagging & TaxonomyAspects of ClassificationKinds of TagsCase Study : Del.icio.us

3What is Tagging?Marking the content with descriptive termsExamples :Catalog indexing by LibrarianKeywords to describe a blog entry / Photo on web

Collaborative tagging : practice of allowing anyone to freely attach keywords or tags to contentSocial Bookmark Managers:Del.icio.us (http://del.icio.us)Flickr (http://www.flickr.com) CiteULike(http://www.citeulike.org/) Cloudalicious (http://cloudalicio.us/)

4Tagging & TaxonomyTaggingNon-hierarchicalDescribe the information held within them Tag based search returns great variety of things simultaneously For example : the Tags for the article about cats in Africa could be cats, africa, animals, cheetahs etc.

TaxonomyHierarchical For example : the Taxonomy for the article about cats in Africa could be

5Aspects of ClassificationProblems to be considered while classifyingSemanticPolysemySynonymy

CognitiveBasic level variation

Sense making

6Kinds of TagsSeveral kinds of functions performed by tags for bookmarksIdentifying What (or Who) it is About Identifying What it Is Identifying Who Owns It Identifying Qualities or CharacteristicsSelf Reference Task Organizing7Case Study : Del.icio.usDel.icio.usCollaborative tagging system for webSocial bookmark managerStorage of personal bookmarksPublic nature of bookmarks8Case Study : Del.icio.us

9Paper 2On the Selection of Tags for Tag CloudsAuthors : P. Venetis, et. al., WSDM, 2011.10ContentsTag CloudSystem ModelProperties of Tag CloudAlgorithms to generate Tag CloudsUser Models for Tag CloudsExperimental Evaluation of algorithmsEvaluation of User ModelsConclusion

11Tag CloudDefinition A visual representation of social tags, organized into paragraph-style layout, usually in alphabetical order, where the relative size and weight of the font for each tag corresponds to the relative frequency of its use.

CompactThree dimension at a time!alphabetical ordersize indicating importancethe tags themselves12Tag CloudTag cloud for our example cats in africa

13Tag CloudUses of Tag CloudSummarizing web search resultsSummarizing results over biomedical databasesSummarizing results of structured queries

14Tag Cloud Example of tag cloud for summarizing web search results

15System ModelTerminologiesC = set of objects (e.g. web pages / articles)T = set of tagsCq = set of objects for query q|Cq| = number of objects in CqTq = set of tags for query qAq(t) = Association set for V tag t Tq ,there is c Cq S = set of tags in tag cloud Tq|S| = number of tags in tag cloud

Partial (scoring) functions(t,c) : T x C [0,1]Similarity functionSim(. , .) : C x C [0,1]

Aq (t) is subset or equal to set Cq

Partial function explanation : the function establishes a partial ordering between the objects from CqIf scoring for Ci is higher than Cj then we say the Ci is ranked higher than Cj

Similarity function explanation : the function takes 2 objects as input and outputs a number between 0 and 1Higher the number, more similar are the objects.16Properties of Tag CloudExtent of SThe cardinality of S ext(s) = |s|

Coverage of SScored size of objects associated with S

Where |Cq|s,q = sum of scores for every c Cq

Extent of SThe larger ext(s) more the topics covered in S

Coverage of Ssince it is more important to include more high-ranked objects than the low-ranked objects in a tag cloud,we select objects with higher scored size (function defined in terminology) and high coverage factor for our tag cloud.0 < cov(s) < 1i f cov is closer to 1 more objects in tag cloud will be higher in rank.Example : query = history courses in stanfordThere are more Roman history courses than Ethiopian history coursesSo we include tag roman in tag cloud since it covers more objects tha tag Ethiopian,

17Properties of Tag CloudOverlap of SThe extent of redundancy

Cohesiveness of SHow closely related the objects in each association set of S are

Overlap :If multiple tags are associated with same object c we say they overlap 0 < over(s) < 1 if over is closer to 1, tags in tag cloud will be overlapping to the greater extent.Examples : Query : computer science Then choose programming languages and databases Instead of programming languages and compilers in a tag cloud as these 2 tags overlap Cohesiveness It is the average of (intra) similarities between objects of the association sets of the tags in S0 < coh(s) < 1 if coh is closer to 0, more different objects will be shown in tag cloud18Properties of Tag CloudRelevance of SRelevance between tags in S and original query q

Popularity of SA tag is more popular if it is associated with many objects in Cq.

Relevance0 < rel(s) < 1 if rel is closer to 1, tags in tag cloud are more related to queryExample : for query : computer scienceInclude tag relational database in cloud instead of tag theory 19Properties of Tag CloudIndependence of STags are Independent if they refer to dissimilar objects

Balance of SRatio of minimum size of Association set to the maximum size of Association set for a particular tag in a Tag cloud S.

Independence 0 < ind < 1 if ind is closer to 1, tags are more independentWe can say Independence and cohesiveness seem to be complementary to each other.Example : programming and software ARE NOT independent

Balance :0< bal < 1 if bal is closer to 1, tag cloud is more balanced.Query : history courses in stanfordtag clouds: {Greek, Roman} vs. {European, Ethiopian}.

Ethiopian points to very few coursesHence 2nd tag cloud is imbalanced20Algos to generate Tag CloudsSingle vs Multi-objective tag selectionE.g. achieving high popularity, get more coverage, be more cohesive, Incorporating relevance

Input to algorithmsCq, Tq and S Tq

21Algos to generate Tag CloudsPopularity algorithm(POP)

The most common algorithm in social information sharingA tag is more popular if it is associated with many objects in Cq. It allows user to see what other people are mostly interested in sharing.For query q and parameter k, the algo returns top k tags in Tq according to their |Aq(t)|.

22Tf-idf based algorithms(TF,WTF)f (q, t, c) = s(t, c) (tf-idf method)f (q, t, c) = s(t, c).s(q, c) (weighted-idf or WTF method)

Score w for tag t = aggregating the values of a function f(q, t, c)The algorithm returns the k tags with the highest scores.

23Maximum Coverage Algorithm(COV)

24User Models for Tag CloudsBuild an Ideal user satisfaction modelUse this model to compare the tag cloudsBase model: Coverage The probability that an object is of the users interest is r.p, while the probability that an object is of the users interest is p.

For an object the probability that it is of the users interest is and for every object the probability that it is of the users interest is p.25User Models for Tag CloudsIncorporating RelevanceFor an object the probability that it is of the users interest is and for every object the probability that it is of the users interest is p.

Incorporating CohesivenessFor an object the probability that it is of the users interest is and for every object the probability that it is of the users interest is p.

For an object c that is contained by and no other association sets the probability that it is of the users interest is the one that can be seen in and for every object the probability that it is of the users interest is p.26User Models for Tag CloudsIncorporating OverlapFor an object c that is contained by and no other association sets the probability that it is of the users interest is the one that can be seen in

and for every object the probability that it is of the users interest is p.Taking into account Scores

Closing Comment

For an object c that is contained by and no other association sets the probability that it is of the users interest is the one that can be seen in and for every object the probability that it is of the users interest is p.27Experimental Evaluation

Datasets:CourseRankDel.icio.us

CourseRank: all the words in the course title, course description and user comments to be the tags for the objectdel.icio.us.:consists of urls and the tags that the users assigned to them on del.icio.us.Graph: CourseRank is more dense than del.icio.us: the number of tags that have a particular number of related objects (Aq(t)) is higher for CourseRank than del.icio.us28Experimental Evaluation of algorithms: CourseRankMost metrics are not correlatedOnly coverage and popularity correlatedHigh coverage might not be highly relevantAlgorithms impact metrics differently

To understand their impact, we ran each algorithm on each of our queries in Q for both datasets and evaluated all the metrics we have defined.Dataset- CourseRankCoverage:COV highest, POP good, Overlap: all algos yield similar overlap, pop little low, owing to many popular tags like history and art, but mayb no significant impact.29Experimental Evaluation of algorithms : CourseRank

Cohesiveness n Relevance: WTF does well. TF performs average.Balance n Popularity: POP maximizes the popularity metric n good in balance.Experimental Evaluation of algorithms : del.icio.us Similar, but overall range of values for coverage metric is around 0.2-0.8, much lower than for CourseRank dataset

Cohesiveness n Relevance: WTF does well. TF performs average.Balance n Popularity: POP maximizes the popularity metric n good in balance.Impact on failure probabilityAlgorithms impact failure probability differently

As r grows, objects covered by tag cloud more likely of interest.CourseRank: COV best, followed by POP, TF den WTFDelicious: Performance of all closer, failure much larger and varies more.Evaluation of User Models

80% predicted correctly, even when failure probability small100% for 0.15-0.25 difference, so if agreement, we get best tag cloud !x-axis is the difference in failure probability in a pair of tag clouds,y-axis shows the % of queries for which our model predicted best tag cloud out of those pairs that workers came to an agreement33ConclusionMetrics generally not correlated So, different important aspects of tag cloud are covered.

COV best algorithm to find tag cloud followed by POPPOP works well with relevance and cohesiveness!

User model- useful tool to identify tag clouds preferred by users

Extent of SThe larger ext(s)34Future WorkExtend model to capture balance metric

Construct algorithm to minimize failure probability for a dataset and given extent

Take into account items with unassigned and spam tags

Extent of SThe larger ext(s)35Thank you!Extent of SThe larger ext(s)36