twitter dynamic

6
Prediction of Interest for Dynamic Profile of Twitter User Elisafina Siswanto School of Electrical Engineering and Informatics Institut Teknologi Bandung Bandung, Indonesia [email protected] Masayu Leylia Khodra School of Electrical Engineering and Informatics Institut Teknologi Bandung Bandung, Indonesia [email protected] Luh Joni Erawati Dewi Universitas Pendidikan Ganesha [email protected] Abstract—Numerous studies have been conducted to explore the social network of Twitter; some have been conducted to predict the interest or the topic of the user's tweet. In this study, we investigate the best classification model for determining the user’s interest based on the bio and a collection of tweets. We use the supervised learning-based classification with the lexical features. Two approaches were proposed; they are the classification that was made based on the user's tweet using multilabel classification method and the classification that was made based on specific accounts. From the result of experimental result, it could be concluded that the employment of the classification using specific accounts approach led to better accuracy. Keywords – interest, topic, classification, Twitter, lexical, machine training I. INTRODUCTION The rapid growth of social network users, such as Twitter, in the last few years has produced a large number of “the user- generated texts”. In 2014, 650 million of Twitter users have been recorded and the average number of tweets in one day has totaled 58 million [1]. Such a large quantity of information may have been useful in various fields, especially in the product marketing field. However, the profile of every Twitter user is not complete; its only contain name, location, website, and bio. This information is still too narrow to determine the market segmentation of product marketing. The social network of Twitter has made the researcher easy to take the tweet data. This caused the development of researches in the prediction of the profile of the Twitter user by using the user’s tweets. Some have been intended to identify the users’ personalities (introvert, extrovert, etc.) based on the interaction among the users on Twitter [2], determine the user’s category in an event [3]. In addition, some others have been intended to form the demographic information of the user [4], determine the topic of every tweet [5], and so forth. The user profile may be divided into two; they are static profile and dynamic profile [6]. The static profile is the user’s profile, which rarely or never changes such as sex, groups of ages, status of marriage, and so forth. The dynamic profile is the user’s profile, which frequently change over time such as the user’s interest. Numerous studies have been conducted to explore both the English tweet [4,7-9] and the Indonesian one [10-11]. However, the study, which explored the dynamic profile of the Indonesian-speaking user, had never been conducted yet. This research is intended to predict the user’s interest, the topics in which the Twitter user are interested in, based on the tweet content or bio of the user. The user’s interest is under the category of the dynamic profile as it is changeable. This study, which explored the prediction of the user’s interest, was conducted as the result may be used for various needs, especially in the marketing field. If someone’s interest can be identified, products may be marketed to those who are interested in the products; as a result, time and the expenses needed for marketing the products may be reduced. In this study, we assumed that the topic in which the twitter user was interested was the topic that frequently appeared among all the user's tweets. Unlike Michelson and Macskassy, who predicted the topic from an English twitter using knowledge based method taken from Wikipedia [5], we predicted the topic of every tweet using the model established by the supervised-learning method. We particularly explored the Indonesian tweets seeing that Indonesia is the fifth biggest Twitter users in the world. In addition, the researches about tweet's topic in Indonesian language are still rare. So far, two studies have been conducted to predict the static profile of the Twitter user for gender attribute [10] and age and employment attributes [11]; none had been conducted to explore the dynamic profile. The prediction was made using two approaches; they are the approach using the user's tweet by multilabel classification and the approach of classification based on the other Twitter users using specific accounts. Lexical features such as n-gram word and n-gram character, and preprocess adapted for the sentiment analysis of Indonesian tweet [12] were used for the two approaches. We examined various machine learning and features selection methods using WEKA library [13]. In the following sections, related works will be discussed. Section 3 explains how the prediction of the dynamic profile was made. Section 4 explains the data used in this research. Section 5 describes how the preprocess was staged. The 2014 International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA) 978-1-4799-5100-0/14/$31.00 ©2014 IEEE 266

Upload: muhammad-ikhsan

Post on 13-Sep-2015

226 views

Category:

Documents


6 download

DESCRIPTION

Twitter Dynamic

TRANSCRIPT

  • Prediction of Interest for Dynamic Profile of

    Twitter User

    Elisafina Siswanto

    School of Electrical Engineering and Informatics

    Institut Teknologi Bandung Bandung, Indonesia

    [email protected]

    Masayu Leylia Khodra

    School of Electrical Engineering and Informatics

    Institut Teknologi Bandung Bandung, Indonesia

    [email protected]

    Luh Joni Erawati Dewi Universitas Pendidikan Ganesha

    [email protected]

    AbstractNumerous studies have been conducted to explore the social network of Twitter; some have been conducted to predict the interest or the topic of the user's tweet. In this study, we investigate the best classification model for determining the users interest based on the bio and a collection of tweets. We use the supervised learning-based classification with the lexical features. Two approaches were proposed; they are the classification that was made based on the user's tweet using multilabel classification method and the classification that was made based on specific accounts. From the result of experimental result, it could be concluded that the employment of the classification using specific accounts approach led to better accuracy.

    Keywords interest, topic, classification, Twitter, lexical, machine training

    I. INTRODUCTION The rapid growth of social network users, such as Twitter,

    in the last few years has produced a large number of the user-generated texts. In 2014, 650 million of Twitter users have been recorded and the average number of tweets in one day has totaled 58 million [1]. Such a large quantity of information may have been useful in various fields, especially in the product marketing field. However, the profile of every Twitter user is not complete; its only contain name, location, website, and bio. This information is still too narrow to determine the market segmentation of product marketing.

    The social network of Twitter has made the researcher easy to take the tweet data. This caused the development of researches in the prediction of the profile of the Twitter user by using the users tweets. Some have been intended to identify the users personalities (introvert, extrovert, etc.) based on the interaction among the users on Twitter [2], determine the users category in an event [3]. In addition, some others have been intended to form the demographic information of the user [4], determine the topic of every tweet [5], and so forth. The user profile may be divided into two; they are static profile and dynamic profile [6]. The static profile is the users profile, which rarely or never changes such as sex, groups of ages, status of marriage, and so forth. The dynamic profile is the users profile, which frequently change over time such as the users interest. Numerous studies have been conducted to

    explore both the English tweet [4,7-9] and the Indonesian one [10-11]. However, the study, which explored the dynamic profile of the Indonesian-speaking user, had never been conducted yet.

    This research is intended to predict the users interest, the topics in which the Twitter user are interested in, based on the tweet content or bio of the user. The users interest is under the category of the dynamic profile as it is changeable. This study, which explored the prediction of the users interest, was conducted as the result may be used for various needs, especially in the marketing field. If someones interest can be identified, products may be marketed to those who are interested in the products; as a result, time and the expenses needed for marketing the products may be reduced.

    In this study, we assumed that the topic in which the twitter user was interested was the topic that frequently appeared among all the user's tweets. Unlike Michelson and Macskassy, who predicted the topic from an English twitter using knowledge based method taken from Wikipedia [5], we predicted the topic of every tweet using the model established by the supervised-learning method.

    We particularly explored the Indonesian tweets seeing that Indonesia is the fifth biggest Twitter users in the world. In addition, the researches about tweet's topic in Indonesian language are still rare. So far, two studies have been conducted to predict the static profile of the Twitter user for gender attribute [10] and age and employment attributes [11]; none had been conducted to explore the dynamic profile.

    The prediction was made using two approaches; they are the approach using the user's tweet by multilabel classification and the approach of classification based on the other Twitter users using specific accounts. Lexical features such as n-gram word and n-gram character, and preprocess adapted for the sentiment analysis of Indonesian tweet [12] were used for the two approaches. We examined various machine learning and features selection methods using WEKA library [13].

    In the following sections, related works will be discussed. Section 3 explains how the prediction of the dynamic profile was made. Section 4 explains the data used in this research. Section 5 describes how the preprocess was staged. The

    2014 International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA)

    978-1-4799-5100-0/14/$31.00 2014 IEEE 266

  • feature types, feature selection, and machine learning methods are described in section 6. Experiment result can be seen in section 7. The last section contains the conclusions drawn and the suggestions provided to develop this study.

    II. RELATED WORKS If a user spreads tweet with a particular topic, then it can be

    assumed that he/she is interested in such a topic. Numerous studies have been conducted to obtain the topic using the tweet data in English. One of the ways of conducting these studies which intended to obtain the topic of a tweet is by employing a knowledge-based method using Wikipedia [5,14]. Another way which has been employed to obtain the topic of a tweet by using unsupervised learning method, especially Latent Dirichlet Allocation (LDA) [15-17]. Takahashi, in his study, assumed that a topic might emerge not only when a tweet was written, but also through the behavior of other users, for example, reply or retweet [18].

    Only a few studies that have analyzed the Indonesian tweet. So far, the studies which have explored the Indonesian tweet regarding the sentiment analysis [12,19], extracting the information on the traffic jam in Bandung [20], and the information on the online transaction in Indonesia [21], some explained the trend of Indonesian topic [22], predicted the gender of the users using lexical and sociolinguistic features [10], and the rest predicted the ages and job of the users based on the tweet data [11]. Only the [10,11] which predicted the static profile of the Indonesian-speaking users. Moreover, none had predicted the dynamic profile of the Indonesian-speaking users.

    The supervised-learning is the common technique used to analyze the Indonesian tweets [10-12,19-21]. The supervised-learning in the studies using the tweet data may generally be divided into two; one tweet is regarded as a document, or all the tweet users are combined into one great document.

    The basic difference between the supervised-method and unsupervised-method is that in the supervised-learning the dataset used has been labeled depending on the classes, whereas in the unsupervised-method the dataset is not labeled and is grouped by paying attention to the distance of each instance. The unsupervised learning is labeled after groups of categories are formed. The study using the machine learning approach is conducted in order to determine the topic of a Twitter user, and as explained in the previous sections, the unsupervised-learning method is generally used. The result of the unsupervised learning using LDA method only leads to N number of topics, and every topic is made up of keywords. The weakness of such a method is that the labeling process of a new class is done after groups or clusters are made, therefore, it is possible that any group does not represent the topic, which was expected to appear.

    III. RESEARCH METHOD In marketing, every product usually has a topic or a group

    to which it belongs. As an illustration, hand phones are under the category of technology. Therefore, in the present study the dynamic profile is formed using the supervised learning method.

    The first stage of the supervised learning in the present study is the training data got into the stages of preprocess and feature extraction. The features that had already been obtained were processed in the stage of the machine learning; as a result, a model classification was formed. Next, this classification model was used to classify the new data, which had been processed in the stages of preprocess and feature extraction.

    The fields in which the users were interested could be divided into Business and Finance, Sports, Technology, Entertainment (including music, film, and game), Health and Beauty, Travelling, Automotive, Family, Flora and Fauna, Politics and Law, and Unknown. Such divisions were made based on summary result and modification of the news categories from various news sites in Indonesia for example detik.com, kompas.com, Yahoo News, and okezone.com. These results of divisions were also modified in accordance with the categories of the products on the trading site such as Kaskus and Tokobagus. The Unknown class was a tweet group which did not belong to the other 10 classes and did not have any particular topic, for example, selamat pagi semua (Good Morning all).

    The users interest in this present study was predicted using two approaches. The first approach was the approach using the multilabel classification. Such an approach was applied based on the data obtained through questionnaire in which a user might choose more than one interests, meaning that one document containing the user's entire tweet could have more than one label. That required to be dealt using the multilabel classification. The second approach used specific accounts. It was chosen, as each tweet would have a topic of itself; however, the result of the questionnaire only contained the users name and the list of the topics he/she was interested in. The result of questionnaire data was too limited to classify the topic of each tweet; therefore, specific accounts was used when we construct a classification model. Specific account refers to the Twitter account that writes tweets about a particular topic. As an illustration, the account detikSport will contain the tweet related to sports.

    A. Multilabel Classification Approach The first stage in the multilabel classification approach was

    to obtain labeled data through questionnaire containing the username, and one or more areas in which he/she was interested; the unknown class was not used in the multilabel classification. Then bio data of every username and maximum of last 500 tweet data were taken using Twitter API [23]. Every tweet was made to go through various combinations of preprocess and the feature type (the n-gram word and n-gram character).

    In this approach, the dataset was made with three variations; they are the dataset that only contained the bio only; the dataset with a maximum of 500 tweets only for one instance; and the dataset containing the bio and a maximum of 500 tweets for one instance. Afterwards the dataset containing the string data was changed into the vector of words using the StringToWordVector method of WEKA.

    When we searched for the best model, the dataset was divided into the training data and testing data with 75% to 25% comparison. Such dataset was then tested using various

    2014 International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA)

    267

  • methods of machine learning and feature selection in order to determine the best model using WEKA and library Multilabel MEKA [24] with Binary Relevance classification.

    B. Classification using Specific Accounts The second approach that was applied in this research is

    using specific accounts, where a tweet was assumed to have one topic and the result of the prediction of the users interest was obtained from the percentage of the appearance of each topic exceeding a particular limit of the threshold value. The labeled data used were the data of specific accounts, in which one instance of data contained a tweet and its topic.

    When this approach was employed, first of all, N tweets were taken from specific accounts for each topic. Then every tweet was made to go through various combinations of preprocess and feature type (n-gram word and n-gram character). After that, the dataset containing the string data was changed into the vector of words using the StringToWordVector method of WEKA. Next, we divided the dataset into the training data and testing data with 75% to 25% comparison. This dataset was then tested using various methods of machine learning and selection features to determine the best model.

    Next step, we used the best model to determine the minimal threshold value of the percentage of the appearance of a topic. First, maximum of the last 500 tweets were taken from every user data obtained from the questionnaire and the tweet topics were classified using the model which already available before. If the user had 200 tweets, then there would be 200 topics as the result of classification. The next stage was that the percentage of the appearance of the topic produced by every topic was calculated; the Unknown topic was excluded. As an illustration, the percentage of the appearance of the user As topic was as follows: Sports 10%, Technology 20%, Business and finance 5%, Health and beauty 5%, Entertainment 20%, Family 10%, Travel 10%, and Automotive 20%.

    After getting the percentage of occurrences of topics, then threshold value can be determined. Such a value was the minimal value of the percentage of the appearance of a topic to make it included in the users list of interests. Referring to the example above, if the threshold was 15%, then the dynamic profile of User A would contain Technology, Entertainment and Automotive. The threshold value was determined by dividing the data obtained from the questionnaire into 75% training data and 25% testing data. Next, the total percentages of the topics were collected in accordance with the interests of each user for the training data. This collection of the percentages of true value was used to determine the threshold value. In this present study, the threshold value used was the average value, the median value, and the upper quartile value. The testing data were used to test such three values to discover the most accurate threshold value.

    IV. DATA

    We spread questionnaire in which the user was requested to write his/her username and one or more interests in order to create the tweet dataset. A maximum of last 500 tweet data was taken from every user. Apart from that, the tweet data, which were taken, were only Indonesian tweet data. The distribution

    of the data in the interests of 299 users can be seen from Table I. The data of specific accounts used in the classification using specific accounts approach can be seen from Table II.

    V. PREPROCESSING One of the characteristics of the tweet data is that they have

    high noise. It is necessary to preprocess the data to reduce such noise. The stage of preprocess was made from the modification of the preprocess phase in Indonesian sentiment analysis research [12]. The preprocess of tokenization, case folding, clear number, and convert number in the present study was used without any modification. The preprocess of stopword removal was modified in the stopword dictionary used. In this present study, the stopword dictionary contained the Indonesian general stopword, the Twitter specific words, and the short forms of such stopword. The features normalization preprocess, in this study were modified into the normalization of URL features, mention, and Hashtag. RT deletion is not used as a retweet can contain keywords of a topic.

    In this present study, we added four preprocess step. First one is using Dictionary of Slang Words. This dictionary was used to change the non-standard words into the standard ones. As an illustration, the slang word cemunguut whose standard form is semangat (being encouraged). The second preprocessing is deletion of letter duplication. Such a deletion was made in order to delete the letters in a word which were the same as the previous letters. This can be exemplified by the word assssiiiiiikkkk bangeeeeet which was changed into asik banget (being passionate about something).

    TABLE I. DISTRIBUTION OF INTEREST DATA

    Class Total user Business & Finance 86 Sports 154 Technology 148 Entertainment 199 Health and Beauty 88 Travel 116 Automotive 34 Family 13 Flora and Fauna 47 Politics and Law 32

    TABLE II. SPECIFIC ACCOUNTS FOR EACH CLASS

    Class Specific Accounts Business & Finance

    YukBisnisCom; Kelaspengusaha; EntrepreneursID;KemenkeuRI; KontanKEUANGAN

    Sports WartaOlahRaga; Detiksport; media_olahraga Technology KompasTekno; InfoKomputer; InfoTekno; Lip6Tekno Entertainment MusikKL; blitzmegaplex; cinema21; GameStation_ID;

    Gemoper Health and Beauty

    TrikSehat; Sehatcantik ;SehatPlus; info_cantik; TentangFashion

    Travel detikTravel; MyHotelMyResort; TravelersIndo; Detikfood; Resep_Kuliner; WarisanKuliner; ceritaperut

    Automotive Otomotifweekly; Otoneters; kompasOTOMOTIF; detikoto; 99otomotif; TipsOtomotif; otopedia

    Family Mommiesdaily; BicaraAnak Flora and Fauna Dokterhewanku; pertanian_ID Politics and Law Politikjabar; Rumahpemilu; SeputarPemilu; POLITIKindo;

    Klinikhukum; HukumOnline Business & Finance

    PEPATAHKU; kata2bijak; pepatah; MTLovenHoney; TweetRAMALAN

    2014 International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA)

    268

  • TABLE III. PREPROCESSING RESULT

    Preprocessing Before Preprocessing After PreprocessingFeatures Normalization (Base preprocessing)

    Lo Mahasiswa ? Follow @KampusTwit Yuk, Twit Gaya Mahasiswa yang bakal bikin perut lo sakit karena ketawa. #AsliBikinNgakak"

    lo mahasiswa follow yuk twit gaya mahasiswa yang bakal bikin perut lo sakit karena ketawa asli bikin ngakak

    Clear Number and Convert Number

    "@Ucupsetiawan "@JawabJUJUR: Mention teman2 kamu yang ngomong nya 4l4y banget? #JJ | @melani_angelina""

    mention teman kamu yang ngomong nya alay banget jj

    Stopword Removal

    "Pacarmu minta password kamu? Putusin aja. Pacaran kok gak menghargai kehidupan pribadi orang lain. Gitu."

    pacarmu minta password putusin pacaran menghargai pribadi

    Dictionary of Slang Words

    "yuhuui, cemunguuut bro ^^, \"@DenyAxen: alhamdulilah , lumayan design sekitar 35% lagi kelar .. semangat!!! lanjut minggu akakaka...""

    yuhuui cemunguuut saudara alhamdulilah lumayan design sekitar 35 lagi kelar semangat lanjut minggu akakaka

    The Deletion of Duplication

    "@hels_healey supeeer sekali,betul sekali, pinteeerr,siapa dulu dong bapaknya...hahahahahaha"

    super sekali betul sekali pinter siapa dulu dong bapaknya haha

    Vowel Deletion

    "@hels_healey @hikari_lita @nobitakarai sudah hel, borgol juga lepas lagi, kegedean borgolnya juga....hahaha...v"

    sdh hl brgl jg lps lg kgdn brglny jg hhh v

    We also added deletion of Duplication of Syllables preprocessing. This deletion was made when the user wrote hahahahaha or wkwkwkw for the expression of laughing, that every user could write using a different word length. This expression is generally written using two distinct letters (one repeated syllable). For example, hahahaha and hahahahaha would be changed into haha. The last one, we removed all the vowels from the document, leaving short forms of words in the document. We employed this preprocessing because one form of noise on Twitter was there were too many short forms which were made by deleting vowels. Sample of preprocessing result can be seen in Table III.

    VI. FEATURE AND MACHINE LEARNING The features used in the present study are the lexical

    features; the reason is that classification was made based on the users tweet. Apart from that, the characteristic of tweet did not have standard form and had a high level of noise; as a result, it was difficult to apply syntactical, semantic or pragmatic features. In this research, we tested three types of feature. First one is unigram lexical, as was done by [10]. Second, bigram lexical that was used in [4,7-9]. The last type of feature is n-gram characters, as was done by [7-9], with n value which were tested are 3 and 4.

    Then the features already obtained were chosen using a feature selection method. In order to determine the best selection method, we compared three feature selection methods. The first method is minimal frequency with the minimal values of appearance 3, 5, or 7, as was done by [10]. We also employed Information Gain and Chi Squared methods, which were used in [8,9], to discover the best 500 features.

    In this present study, we also compared several machine learning methods. The first is the Naive Bayes method that is widely used in text classification researches [7,9,10]. The second is Support Vector Machine (SVM) method, as was done by [4,7,10]. The last one, we also tried to use Decision Tree method with C4.5 algorithm that has never been used in previous profile classification researches.

    VII. EXPERIMENT In the present study, the experiment was carried out in

    order to determine the best method to predict Twitter users interests. The first experiment is done to determine the best machine learning and feature selection method. After that, testing was conducted to determine the best type of features. The next stage was that the best preprocess was determined using the learning method, the feature selection method, and the types of features that was found from previous experiment. We tested various combinations of preprocessing, which are clean and covert number, deletion of duplicate letters and syllables, stopword removal, convert slang words, and vowel removal.

    In the multilabel classification approach in particular, the data used were compared whether using the bio was enough, whether using the tweet only or the combination of the bio and tweet led to the best performance. The classification of dynamic profile using specific accounts approach in particular, we added experiment to find the best threshold value that produced the highest accuracy. The final stage was that the two approaches were compared in order to determine the best approach.

    A. Multilabel Classification Approach As explained in the previous part, from the data obtained

    through the questionnaire, it was found that 299 users and would be divided into two, 75% training data and 25% testing data. The distribution of the training data and testing data of the multilabel classification approach can be seen in Table IV.

    From Table V, it can be seen that the best machine learning method was SVM with feature selection using minimal frequency 3 for the combined dataset of bio and tweet and the dataset of tweet only, and minimal frequency 7 for the dataset of bio. The most suitable type of features used to predict the dynamic profile was n-gram character for the dataset of bio and the combined dataset of bio and tweet and unigram word in the dataset of tweet. The best preprocess for the three datasets could only increase accuracy slightly, that is, less than 1%.

    TABLE IV. DISTRIBUTION OF TRAINING AND TESTING DATA

    Class Training Data Testing Data Business & Finance 65 21 Sports 118 36 Technology 115 33 Entertainment 151 48 Health and Beauty 69 19 Travel 83 33 Automotive 32 2 Family 10 3 Flora and Fauna 37 10 Politics and Law 22 10

    2014 International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA)

    269

  • TABLE V. TESTING RESULT OF MULTILABEL CLASSIFICATION APPROACH

    Dataset Machine Learning Type of Feature

    Preprocess

    Bio + Tweet

    SVM + min freq 3 (70.89%)

    4char (70.89%)

    Slang_StopWord (71.11%)

    Bio SVM + min freq 7 (70.98%)

    3char (72.88%)

    Number_Duplicate (73.15%)

    Tweet SVM + min freq 3 (70.89%)

    unigram lexical (70.89%)

    Duplicate_StopWord (71.33%)

    From Table V it could be seen that the best model for predicting the dynamic profile was using the dataset containing the bio data only. The reason was that many users wrote keyword to the topic they preferred in the bio part.

    B. Classification using Specific Accounts The classification with specific accounts was tested using

    several N values (the number of tweets for every class), that is, 100, 200, and 500. The testing result showed that the higher the N value the more accuracy was found. Therefore, in the testing undertaken to determine the type of features and the best preprocess, the dataset used contained 5000 tweets per class.

    From the testing result, it was found that the best model was the one using SVM, the feature selection with minimal frequency 3, the feature of word bigram, and the preprocess using the dictionary of slang words leading to the accuracy of model by 79,28%. The test was undertaken using such a model in order to find the best threshold. From the testing result of the threshold in Table VI, it could be seen that the best threshold value used to predict interests using specific accounts was 3.8%.

    C. Comparison of Multilabel Classification Approach and Specific Accounts The comparison of the testing result of the best model of

    the two approaches can be seen in Table VII. If viewed from the prediction result of data test, the prediction of interests using the multilabel method always resulted in the same prediction of interests, that is, Sports", "Entertainment and technology; however, the other class had never been successfully saved as the result of prediction. That resulted from the imbalanced data as seen in Table I. On the other hand, the method of classification with specific accounts might produce varied predictions, in spite of the small threshold value, that is, 3.8%. Therefore, the best method for predicting interests was the classification with specific accounts.

    TABLE VI. TESTING RESULT OF THRESHOLD VALUE

    Threshold Accuracy (%)

    Average = 13.9% 49.78

    Median = 9.4% 59.42

    Upper Quartile = 3.8% 75.22

    TABLE VII. COMPARISON OF BEST MODEL

    Method Machine Learning

    Type of Feature

    Preprocess Threshold

    Multilabel (bio)

    SVM + min freq 7 (70.98%)

    3char (72.88%)

    Number_ Duplicate (73.15%)

    -

    Specific Accounts

    SVM + min freq 3 (72.76%), total tweet per class increased to 5000

    bigram lexical (79.28%)

    Slang (79.28%)

    Threshold = 3.8%, accuracy of questioner data= 75.22%

    VIII. SUMMARY AND FUTURE WORK This present study was conducted to predict the users

    interests based on the user's tweet and bio data. The conclusions which can be withdrawn is from the testing result of the multilabel classification approach and specific accounts, it was found that the best machine learning method was SVM, the best method of feature selection was minimal frequency and it turned out that the preprocess of tweet data were not contributed to a significant accuracy. After we comparing the two approaches, the best approach was using the classification with specific accounts. Although the accuracy of both methods was more quite high, that is, 70%, the imbalance of the questionnaire data caused the classification result have a tendency to the majority class (in the classification of multilabel) or to need very small threshold value (in the classification with specific accounts).

    This present study can be developed to make it better. One of the things that can be done to develop it is by adding process to deal with the imbalanced data in the multilabel classification. In this present study, the lexical features were used; therefore, in the further study the syntactical features or how a user writes the sentence tweet can be developed.

    IX. REFERENCES [1] [(2014, May) Statistic Brain. [Online].

    "http://www.statisticbrain.com/twitter-statistics/"

    [2] Daniele Quercia, M. Kosinski, David Stillwell, and Jon Crowcroft, "Our Twitter Profiles, Our Selves: Predicting Personality with Twitter," in Privacy, security, risk and trust (passat), 2011 ieee third international conference on and 2011 ieee third international conference on social computing (socialcom), Boston, MA, 2011, pp. 180 - 185.

    [3] Munmun De Choudhury, Nikolas Diakopoulos, and Moor Naaman, "Unfolding the Event Landscape on Twitter: Classification and Exploration of User Categories," in Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work, ACM, 2012, pp. 241-244.

    [4] Delip Rao, David Yarowsky, Abhishek Shreevats, and Manaswi Gupta, "Classifying Latent User Attributes in Twitter," in Proceedings of the 2nd international workshop on Search and mining user-generated contents, ACM, 2010, pp. 37-44.

    [5] Matthew Michelson and Sofus A Macskassy, "Discovering Users' Topics of Interest on Twitter: a First Look," in Proceedings of the fourth workshop on Analytics for noisy unstructured text data, ACM, 2010, pp. 73-80.

    [6] Manel Mezghani, Corinne Amel Zayani, Ikram Amous, and Faiez Gargouri, "A user profile modelling using social annotations: a survey.," in Proceedings of the 21st international conference companion on World Wide Web, ACM, 2012, pp. 969-976.

    [7] John D. Burger, John Henderson, George Kim, and Guido Zarrella,

    2014 International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA)

    270

  • "Discriminating Gender on Twitter," in Conference on Empirical Methods in Natural Language Processing, 2011, pp. 1301-1309.

    [8] William Deitrick et al., "Gender Identification on Twitter Using the Modified Balanced Winnow," Communications & Network, vol. 4, no. 3, 2012.

    [9] Zachary Miller, Brian Dickinson, and Wei Hu, "Gender Prediction on Twitter Using Stream Algorithms with N-gram Character Features," International Journal of Intelligence Science 2, vol. 2, p. 143, 2012.

    [10] Yudi Wibisono and Naufal Faruqi, "Penentuan Gender Otomatis Berdasarkan Isi Microblog Memanfaatkan Fitur Sosiolinguistik," Jurnal Cybermatika, vol. 1, no. 1, 2013.

    [11] Elisafina Siswanto and Masayu Leylia Khodra, "Predicting latent attributes of Twitter user by employing lexical features," in Information Technology and Electrical Engineering (ICITEE), 2013 International Conference, IEEE, 2013, pp. 176-180.

    [12] Ismail Sunni and Dwi Hendratmo Widyantoro, "Analisis Sentimen dan Ekstraksi Topik Penentu Sentimen pada Opini Terhadap Tokoh Publik," Jurnal Sarjana Institut Teknologi Bandung Bidang Teknik Elektro dan Informatika, vol. 1, no. 2, pp. 200-206, Juli 2012.

    [13] University of Waikato. (2013, June) Weka 3: Data Mining Software in Java. [Online]. "http://www.cs.waikato.ac.nz/ml/weka/"

    [14] Chunliang Lu, Wai Lam, and Yingxiao Zhang, "Twitter User Modeling and Tweets Recommendation Based on Wikipedia Concept Graph," in Workshops at the Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012.

    [15] Jianshu Weng, Ee-Peng Lim, Jing Jiang, and Qi He, "Finding Topic-Sensitive Influential Twitterers," in Proceedings of the third ACM international conference on Web search and data mining, ACM, 2010, pp. 261-270.

    [16] Wayne Xin Zhao et al., "Comparing Twitter and Traditional Media using Topic Models," in Advances in Information Retrieval.: Springer Berlin Heidelberg, 2011, pp. 338-349.

    [17] Claudia Wagner, Vera Liao, Peter Pirolli, Les Nelson, and Markus Strohmaier, "It's not in Their Tweets: Modeling Topical Expertise of

    Twitter Users," in Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom), IEEE, 2012, pp. 91-100.

    [18] Toshimitsu Takahashi, Ryota Tomioka, and Kenji Yamanishi, "Discovering Emerging Topics in Social Streams via Link Anomaly Detection," in Data Mining (ICDM), 2011 IEEE 11th International Conference, IEEE, 2011, pp. 1230-1235.

    [19] Aqsath Rasyid Naradhipa and Ayu Purwarianti, "Sentiment classification for Indonesian message in social media," in Cloud Computing and Social Networking (ICCCSN) International Conference, 2012, pp. 1-5.

    [20] Muhammad Hasby and Masayu Leylia Khodra, "Optimal Path Finding based on Traffic Information Extraction from Twitter," in ICT for Smart Society (ICISS), 2013 International Conference, IEEE, 2013, pp. 1 - 5.

    [21] Masayu Leylia Khodra and Ayu Purwarianti, "Ekstraksi Informasi Transaksi Online pada Twitter," Jurnal Cybermatika, vol. 1, no. 1, 2013.

    [22] Yosef Ardhito Winatmoko and Masayu Leylia Khodra, "Automatic Summarization of Tweets in Providing Indonesian," in 4th International Conference on Electrical Engineering and Informatics, ICEEI , 2013, pp. 1027 1033.

    [23] (2013, June) Twitter Developers. [Online]. "https://dev.twitter.com/"

    [24] Jesse Read, Reutemann, and Peter. (2014, March) MEKA: A Multi-label Extension to WEKA. [Online]. "http://meka.sourceforge.net/"

    [25] Sri Krisna Endarnoto, Sonny Pradipta, Anto Satriyo Nugroho, and James Purnama, "Traffic Condition Information Extraction & Visualization from Social Media Twitter for Android Mobile Application," in Electrical Engineering and Informatics (ICEEI) International Conference, 2011, pp. 1-4.

    2014 International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA)

    271

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 200 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 400 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /CreateJDFFile false /Description >>> setdistillerparams> setpagedevice