character-based neural embeddings for tweet clustering

18
Character-based Neural Embeddings for Tweet Clustering Svitlana Vakulenko Lyndon Nixon Mihai Lupu Vienna University of Economics and Business TU Wien (Vienna University of Technology) MODUL Technology The 5th International Workshop on Natural Language Processing for Social Media (SocialNLP) In conjunction with EACL 2017 April 3, 2017 in Valencia, Spain Vakulenko et al. (Wirtschaftsuniversit¨ at Wien) SocialNLP@EACL2017 Valencia, Spain 1 / 18

Upload: svitlana-vakulenko

Post on 21-Jan-2018

163 views

Category:

Social Media


0 download

TRANSCRIPT

Page 1: Character-based  Neural Embeddings for Tweet Clustering

Character-based Neural Embeddingsfor Tweet Clustering

Svitlana Vakulenko Lyndon Nixon Mihai Lupu

Vienna University of Economics and BusinessTU Wien (Vienna University of Technology)

MODUL Technology

The 5th International Workshop on Natural Language Processingfor Social Media (SocialNLP)

In conjunction with EACL 2017April 3, 2017 in Valencia, Spain

Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 1 / 18

Page 2: Character-based  Neural Embeddings for Tweet Clustering

#WEDNTREADWORDS

Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 2 / 18

Page 3: Character-based  Neural Embeddings for Tweet Clustering

#CMABRIGDEUINERVTISYEFECT

Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 3 / 18

Page 4: Character-based  Neural Embeddings for Tweet Clustering

Character-based Neural Embeddings

Language modeling [Sutskever et al., 2011] [Kim et al., 2016]

Natural Language Generation [Goyal et al., 2016]

Word spelling correction [Sakaguchi et al., 2017]

Part-of-speech tagging [dos Santos and Zadrozny, 2014]

Information extraction [Qi et al., 2014]

Text classification [Zhang et al., 2015]

Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 4 / 18

Page 5: Character-based  Neural Embeddings for Tweet Clustering

Tweet2Vec: bi-GRU RNN [Dhingra et al., 2016]

1

1Picture Credits: Tobias Fink

Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 5 / 18

Page 6: Character-based  Neural Embeddings for Tweet Clustering

Breaking News Detection from Twitter Stream

EU projects: SocialSensor, REVEAL, Pheme, InVID ...

2

2Picture Credits: AppAdvice, Mind The Gap Public Relations

Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 6 / 18

Page 7: Character-based  Neural Embeddings for Tweet Clustering

SNOW Data Challenge [Papadopoulos et al., 2014]

Dataset: 1M/24h tweets related to major events (Syria, terror, Ukraine,bitcoin) annotated with 59 reference topics, e.g.:

25-02-14 18:00 Nigeria children killed in attack on schoolNigeria,children,killed,attack,school,Boko Haram438372486808629250,438373272439123968,438373225320697856

Winner: aggressive filtering and hierarchical clustering [Ifrim et al., 2014]Precision: 0.56 Recall: 0.36 F-Measure: 0.4

Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 7 / 18

Page 8: Character-based  Neural Embeddings for Tweet Clustering

Results

Interval Tweets Model Dimensions #Clusters Homogeneity Completeness V-Measure

18:00 10,344Tweet2Vec 500 3026 0.9958 0.9453 0.9699TweetTerm 433 66-79 0.9277 1 0.9625

22:00 14,471Tweet2Vec 500 5292 1 0.9601 0.9796TweetTerm 589 93-118 0.9385 0.9969 0.9668

23:15 8,231Tweet2Vec 500 3986 1 0.98 0.9899TweetTerm 565 67-142 0.8062 0.9978 0.8918

01:00 5,123Tweet2Vec 500 2242 1 0.8877 0.9405TweetTerm 721 71-111 0.8104 1 0.8953

01:30 4,589Tweet2Vec 500 2091 1 0.8762 0.934TweetTerm 635 64-78 0.8024 1 0.8903

Table: Results of clustering evaluation on the English-language dataset

Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 8 / 18

Page 9: Character-based  Neural Embeddings for Tweet Clustering

TweetTerm: results sample

obama : michelle and i were saddened to hear of the passing of harold ramis ...touching tribute to ghostbusters star harold ramis from comic artiston the joyful comedy of harold ramismajor tokyo-based bitcoin exchange mt . gox goes dark”bitcoin exchange giant mt . gox goes dark — popular science ”

obesity rate for young children plummets 43 % in a decadethe national obesity rate for young children dropped 43 % over the past decade

diplomatic pressure is unlikely to reverse uganda’s cruel anti-gay lawprovisions of arizona proposed anti-gay laweven mitt romney wants arizona’s governor to veto the state’s anti-gay billicymi : arizona pizzeria response to state anti-gay bill

amazing debate nic ! well done !well done 4 -0well done ! i find running so difficult . feel proud !well done him :-)well done nicola my money is on you you done it well tonight ??

Table: Similarity patterns in tweets discovered by TweetTerm.

Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 9 / 18

Page 10: Character-based  Neural Embeddings for Tweet Clustering

TweetTerm results

Similarity patterns: word-level N-grams

Word-based approach is inflexible : mt.gox vs mtgox

Esp. evident for social media posts with diverse vocabulary

Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 10 / 18

Page 11: Character-based  Neural Embeddings for Tweet Clustering

Tweet2Vec: results sample

video : bitcoin : mtgox exchange goes offline - bitcoin , a virtual currency ...the slow-motion collapse of mt . gox is bitcoin’s first financial crisis ...Disastro bitcoin : mt . gox cessa ogni attivite ... : mt . gox , il pi u grande cambiavalute bitco ...

Correct

california couple finds time capsules worth $10 millioncalifornian couple finds $10 million worth of gold coins in tin can

Correct

ukraine puts off vote on new government despite eu pleas for quick action - washington post ...ukraine truce shattered , death toll hits 67 - kiev (reuters) - ukraine suffered its bloodiest day ...ukraine fighting leaves at least 18 dead as kiev barricades burn - clashes in ukraine ...

Partial

are you going to come on his network and get poor ratings too ?are you sold on the waffle taco ?

Incorrect

the chromecast app flood has started bythe importance of emotion in design by

Incorrect

Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 11 / 18

Page 12: Character-based  Neural Embeddings for Tweet Clustering

Tweet2Vec results

sensitive to the order of symbols

also uncovers syntactic patterns instead of semantics

would benefit from stop-word removal, e.g. an analogue to IDFweighting scheme

black-box document and similarity representation

Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 12 / 18

Page 13: Character-based  Neural Embeddings for Tweet Clustering

Directions for Future Work

1 SyntaxI filter out syntactic patternsI eliminate stop-wordsI e.g. develop an analogue for the IDF weighting scheme for neural

networks, i.e. an aggregation step

2 SemanticsI explore semantic similarity patterns, e.g. paraphrases and synonymsI leverage pre-trained word-embeddings? e.g. Word2Vec, GloveI combine word-based semantics with the character-based similarityI e.g. “construct a representation by concatenating a word and a

character embedding” [Hashimoto et al., 2016]

3 DatasetI extend experiments to a larger multi-lingual dataset of tweets

4 BaselineI compare with word-based neural network model

Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 13 / 18

Page 14: Character-based  Neural Embeddings for Tweet Clustering

Questions!

Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 14 / 18

Page 15: Character-based  Neural Embeddings for Tweet Clustering

Bibliography I

Dhingra, B., Zhou, Z., Fitzpatrick, D., Muehl, M., and Cohen,W. W. (2016).Tweet2vec: Character-based distributed representations for socialmedia.In Proceedings of the 54th Annual Meeting of the Association forComputational Linguistics, ACL 2016, August 7-12, 2016, Berlin,Germany.

dos Santos, C. N. and Zadrozny, B. (2014).Learning character-level representations for part-of-speech tagging.In Proceedings of the 31th International Conference on MachineLearning, ICML 2014, 21-26 June, 2014, Beijing, China, pages1818–1826.

Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 15 / 18

Page 16: Character-based  Neural Embeddings for Tweet Clustering

Bibliography II

Goyal, R., Dymetman, M., and Gaussier, E. (2016).Natural language generation through character-based rnns withfinite-state prior knowledge.In COLING 2016, 26th International Conference on ComputationalLinguistics, Proceedings of the Conference: Technical Papers,December 11-16, 2016, Osaka, Japan, pages 1083–1092.

Hashimoto, K., Xiong, C., Tsuruoka, Y., and Socher, R. (2016).A joint many-task model: Growing a neural network for multipleNLP tasks.CoRR, abs/1611.01587.

Ifrim, G., Shi, B., and Brigadir, I. (2014).Event Detection in Twitter using Aggressive Filtering andHierarchical Tweet Clustering.In Papadopoulos, S., Corney, D., and Aiello, L. M., editors,Proceedings of the SNOW 2014 Data Challenge co-located with23rd International World Wide Web Conference (WWW 2014), April8, 2014, Seoul, Korea, pages 33–40.

Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 16 / 18

Page 17: Character-based  Neural Embeddings for Tweet Clustering

Bibliography III

Kim, Y., Jernite, Y., Sontag, D., and Rush, A. M. (2016).Character-aware neural language models.In Proceedings of the Thirtieth AAAI Conference on ArtificialIntelligence, February 12-17, 2016, Phoenix, Arizona, USA, pages2741–2749.

Papadopoulos, S., Corney, D., and Aiello, L. M. (2014).SNOW 2014 Data Challenge: Assessing the Performance of NewsTopic Detection Methods in Social Media.In Papadopoulos, S., Corney, D., and Aiello, L. M., editors,Proceedings of the SNOW 2014 Data Challenge co-located with23rd International World Wide Web Conference (WWW 2014), April8, 2014, Seoul, Korea, pages 1–8.

Qi, Y., Das, S. G., Collobert, R., and Weston, J. (2014).Deep learning for character-based information extraction.In Advances in Information Retrieval - 36th European Conference onIR Research, ECIR 2014, Amsterdam, The Netherlands, April 13-16,2014. Proceedings, pages 668–674.

Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 17 / 18

Page 18: Character-based  Neural Embeddings for Tweet Clustering

Bibliography IV

Sakaguchi, K., Duh, K., Post, M., and Durme, B. V. (2017).Robsut wrod reocginiton via semi-character recurrent neuralnetwork.In Proceedings of the Thirty-First AAAI Conference on ArtificialIntelligence, February 4-9, 2017, San Francisco, California, USA.,pages 3281–3287.

Sutskever, I., Martens, J., and Hinton, G. E. (2011).Generating text with recurrent neural networks.In Proceedings of the 28th International Conference on MachineLearning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2,2011, pages 1017–1024.

Zhang, X., Zhao, J., and LeCun, Y. (2015).Character-level convolutional networks for text classification.In Advances in Neural Information Processing Systems 28: AnnualConference on Neural Information Processing Systems 2015,December 7-12, 2015, Montreal, Quebec, Canada, pages 649–657.

Vakulenko et al. (Wirtschaftsuniversitat Wien) SocialNLP@EACL2017 Valencia, Spain 18 / 18