Download - Flickr Tag Analysis
![Page 1: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/1.jpg)
Flickr Tag AnalysisFlickr Tag Analysis
Ahmet IscenAhmet Iscen
![Page 2: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/2.jpg)
Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation Conclusions
![Page 3: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/3.jpg)
Social Media Important part of our daily lives today
Twitter 12th largest country in the world
Two new members sign up every second to LinkedIn
![Page 4: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/4.jpg)
What is Flickr? Image and video
hosting Acquired by Yahoo! in
2005 51 million registered
members and 80 million unique visitors as of June 2011
6 million photos Widely used by
researchers
![Page 5: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/5.jpg)
Flickr
![Page 6: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/6.jpg)
Dataset Xirong Li's Flickr-3.5M Dataset 3,500,000 images 570,000 unique tags 270,000 unique user-ids Randomly selected 250,000 images with their
tags
http://staff.science.uva.nl/~xirong/index.php?n=DataSet.Flickr3m
![Page 7: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/7.jpg)
Challenges Tags totally depend on the user Can be extremely noisy Huge range of possible words Examples:
milos tasic milosevrodjendan verjaardagmilos
desember 2005
tmo
![Page 8: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/8.jpg)
Preprocessing Eliminate stopwords (a,for,the etc.) Eliminate extreme words (those that appear
less than 20 photos and more than 80% of the photos.
Porter Stemmer (only for association rule) Convert everything to lowercase Eliminate tags with less than 2 letters and
more than 20 letters Eliminate numerical tags
![Page 9: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/9.jpg)
Association Rules Mining Rapid Miner
[york] --> [new] (confidence: 0.910) Support: 0.04
[geolat, geolon] --> [geotag] (confidence: 0.986) Support: 0.03
[hors, lotharlez] --> [caballo, cheval, hestur] (confidence: 0.846) Support: 0.03
[paard] --> [hors, lotharlenz, zirg] (confidence: 0.802) Support: 0.03
[hors, paard] --> [lotharlenz, zirg] (confidence: 0.802) Support: 0.03
![Page 10: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/10.jpg)
Association Rules Mining Poor results.
Probably due to noise and variance in data.
Takes too much time to process the words and find rules.
Need find alternative methods
![Page 11: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/11.jpg)
Latent Semantic Analysis Same as LSI (LSI used in IR field) SVD on document-term matrix to reduce
dimensionality Words are compared by taking the cosine of
the angle between two vectors by any two rows.
![Page 12: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/12.jpg)
Implementation Gensim – topic modeling toolkit
Python
Tested different corpus and topic sizes
![Page 13: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/13.jpg)
Latent Semantic Analysis 250000 photos, 20 topicstopic #0: 0.997*"wedding" + 0.047*"family" + 0.023*"friends" + 0.022*"party" +
0.019*"reception" + 0.013*"california" + 0.011*"ceremony" + 0.009*"india" + 0.008*"church" + 0.008*"sanfrancisco"
topic #11: 0.491*"newyork" + -0.463*"china" + 0.448*"nyc" + -0.233*"beach" + 0.174*"newyorkcity" + 0.146*"italy" + -0.132*"friends" + -0.123*"flowers" + 0.119*"new" + -0.117*"beijing"
topic #4: 0.586*"paris" + -0.524*"family" + 0.417*"france" + 0.186*"london" + 0.178*"party" + -0.169*"halloween" + 0.156*"europe" + -0.121*"japan" + 0.103*"travel" + 0.063*"birthday"
topic #1: 0.701*"halloween" + 0.588*"party" + 0.169*"friends" + 0.165*"family" + 0.157*"birthday" + 0.126*"japan" + 0.071*"christmas" + 0.059*"london" + 0.058*"travel" + 0.055*"beach"
![Page 14: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/14.jpg)
Latent Semantic Analysis 250000 photos, 50 topicstopic #10: -0.655*"friends" + 0.633*"china" + 0.221*"travel" + 0.166*"beijing" +
0.136*"party" + -0.088*"beach" + 0.075*"vacation" + 0.071*"greatwall" + 0.070*"shanghai" + -0.066*"flowers"
topic #28: -0.580*"india" + -0.323*"trip" + 0.279*"nature" + 0.262*"snow" + -0.258*"dog" + -0.224*"sunset" + 0.200*"winter" .
topic #20: -0.527*"cat" + 0.511*"sunset" + 0.266*"sky" + -0.242*"california" + -0.209*"sanfrancisco" + 0.198*"clouds" + -0.167*"beach" + -0.156*"flower" + -0.149*"cats" + -0.132*"dog"
topic #17: -0.323*"california" + -0.272*"sanfrancisco" + 0.269*"cat" + 0.254*"horse" + 0.211*"pferd" + 0.207*"cheval" + 0.205*"caballo" + 0.205*"paard" + 0.204*"hest" + 0.204*"cavalo"
![Page 15: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/15.jpg)
Latent Semantic Analysis 250000 photos, 100 topicstopic #29: 0.689*"australia" + 0.279*"sydney" + -0.233*"nature" + 0.220*"trip" + -
0.209*"france" + -0.187*"india" + -0.175*"snow" + 0.157*"new" + 0.144*"paris" + -0.134*"winter"
topic #58: 0.401*"geotagged" + 0.385*"geolat" + 0.380*"geolon" + -0.261*"people" + 0.259*"day" + 0.198*"england" + 0.191*"newzealand" + -0.178*"canada" + 0.168*"water" + -0.144*"portrait".
topic #45: 0.406*"fall" + 0.398*"park" + 0.315*"october" + -0.291*"animals" + 0.289*"autumn" + -0.262*"art" + 0.182*"leaves" + -0.175*"zoo" + -0.163*"sky" + 0.132*"garden"
topic #85: -0.673*"hongkong" + 0.221*"florida" + 0.221*"singapore" + 0.209*"winter" + 0.174*"museum" + -0.170*"boston" + -0.165*"scotland" + -0.153*"prague" + 0.153*"cats" + -0.136*"island"
![Page 16: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/16.jpg)
Latent Semantic Analysis Notice the negative weights.
Hard to interpret
Probabilistic methods are not used
![Page 17: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/17.jpg)
Latent Dirichlet Allocation Expectation- Maximization Each document is a mixture of topics Find the posterior for topics in the E-Step
p(topic t | document d) Then update the assignment of the current
word in the M-Step
p(word w | topic t)
![Page 18: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/18.jpg)
Latent Dirichlet Allocation 250000 photos, 20 topicstopic #13: 0.088*party + 0.072*halloween + 0.027*lake + 0.024*boat + 0.022*home +
0.019*park + 0.018*river + 0.016*ice + 0.015*spring + 0.014*birthday
topic #3: 0.046*trip + 0.044*vacation + 0.044*sanfrancisco + 0.040*california + 0.026*road + 0.024*cats + 0.018*school + 0.018*cruise + 0.014*ca + 0.014*old
topic #8: 0.051*paris + 0.042*france + 0.027*july + 0.027*4th + 0.025*music + 0.022*car + 0.021*rock + 0.020*dogs + 0.020*concert + 0.016*geotagged
![Page 19: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/19.jpg)
Latent Dirichlet Allocation 250000 photos, 50 topicstopic #7: 0.111*sunset + 0.108*beach + 0.089*holiday + 0.047*fun + 0.029*smile +
0.028*forest + 0.023*rose + 0.020*wood + 0.019*disneyland + 0.019*costarica
topic #14: 0.141*vacation + 0.046*san + 0.037*francisco + 0.034*sports + 0.020*hockey + 0.020*top + 0.019*cake + 0.014*cafe + 0.013*biking + 0.013*ruins
topic #23: 0.112*trip + 0.070*bridge + 0.057*road + 0.048*blue + 0.048*building + 0.042*film + 0.035*orange + 0.022*university + 0.021*telephone + 0.018*sky
topic #29: 0.124*party + 0.110*friends + 0.085*christmas + 0.045*rock + 0.038*lake + 0.038*ireland + 0.031*castle + 0.026*africa + 0.025*live + 0.025*music
![Page 20: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/20.jpg)
Latent Dirichlet Allocation 250000 photos, 100 topicstopic #10: 0.109*hawaii + 0.093*island + 0.060*la + 0.030*photoshop + 0.027*walk +
0.026*hdr + 0.024*maui + 0.023*us + 0.019*fountain + 0.018*beach
topic #24: 0.172*house + 0.106*architecture + 0.077*festival + 0.068*airplane + 0.038*flying + 0.029*flight + 0.026*air + 0.025*aircraft + 0.021*aviation + 0.020*airshow
topic #34: 0.231*vacation + 0.159*trip + 0.136*lake + 0.095*florida + 0.088*birds + 0.062*san + 0.051*francisco + 0.015*yellowstone + 0.015*kayak + 0.015*maltay
topic #70: 0.114*november + 0.074*thanksgiving + 0.050*soccer + 0.048*polarbear + 0.048*ski + 0.041*basketball + 0.035*safari + 0.034*bear + 0.023*wien + 0.021*flood
![Page 21: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/21.jpg)
Conclusions LSA and LDA are more useful for analyzing
tags than Association Rule Mining
There is no “best” number of topics
Human interpretation still might be required
![Page 22: Flickr Tag Analysis](https://reader035.vdocuments.net/reader035/viewer/2022062517/56813050550346895d95fe31/html5/thumbnails/22.jpg)
Future Works Increase the corpus size to 1000000
documents
Analyze Flickr groups as well