analysis - 北海道大学arakilab.media.eng.hokudai.ac.jp/~araki/2010/2010-b-6.pdftatsuaki...

20
erviceof OnlineOrder:TacklingCyber-BullyingwithMachineLearning Analysis zynski 1. Pawel Tatsuaki Masui 4. Kenji Yoshio arch Felow / High-Tech Research Center <u en University. Japan ' g, hgu .j p hool oflnformation Science and Technology niversity. Japan .ac . j p hoolofEngineering rsity. Japan i:z ai.info.mieu.ac . j p nt ofComputer Science lnstirute ofTechnology. Japan i l . kitami-i t. ac.jp nt ofElectronics and Information Engineering ; ofEnginecring. Uni¥'ersity. Japan hj 'me li , hokka i- s-u.ac.jp rr 0/ (/i e /J/I mi l1 g proble /l/ s /(/(e(l' i l1 Jap{/111m. ¥ ' a l1 dhll/(I'i l/ g peop/e Th e proble /l/ /i ashe l:.'l1 especial( l' 10 !i ced 011I l/ o . f J

Upload: others

Post on 26-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • l on副 司+

    ervice of Online Order: Tackling Cyber-Bullying with Machine Learning ミffect Analysis

    zynski 1. Pawel Dybala ~ . Tatsuaki Matsuba 人 Fumito Masui 4.

    a ~ . Kenji Ara lく i ~ . Yoshio Momouchド

    arch Fellow / High-Tech Research Center

  • unl) applied 10 Inlemel and nelworks. It deals with all sons of undesirable enlll1e:. or e、 enme oflhe rnOSl well known issues ofonline sccurity include hacking. cracking. dala lhefi and onhne esplOnage. Ho\\引侃

    for several years. a problern that has become much more visiblc and therefore innuential and socially harmful. is lhe problem 01" cxploitation of online open comlllllnication means. sllch as BBS forum boards. 01' social nelworks to convcy harmflll and disturbing inforll1alion. In USA. a grcat focus on this isslIC began in 2001 after thc 9.11 terrorist attack. Howevcr. similarca

    havc becn noticed beforc in other countrics on a smallcr scalc. In Japan. on which this rcsearch is locllscd. a great social

    clisturbance was causecl by cases ofscnding alarming messages by criminals 01' hijackers on thc Internetjllst bcfore cOll1milling

    a crimc. One famolls casc ofthis kind happened in May 2000. a ycar bcforc thc ¥}.II in USA. when a frllstratcd yOllng ll1an sent

    a mcssagc on a poplllar Japancsc 8BS fOrlll11 2channcl. informing rcadcrs h� ¥¥'as going to hijack a buヨ. jllst beforc h

    p!・Cl ceede cl ¥¥'ith his plan. Growing number of s irn i lat・ cascs arollnd thc ¥¥-orld opcncd a public debatc on whcther such

    日 USP!C! OUS l1l C出agcs could not bc spottcd carly cnough to prc¥"cnt thc crimcs I"rom It appcninさ [ 1] ancl on thc I"rc鐡orn

    spccch on th� Intcmct in gcnじra l [2]. Somじ of thc famous rcsearch in this maltcr was donc by thc t can可 o l" H :ー i nchun Chcn. \\'席h日叫机山Iれt a rtぽed山tlt川cp刊ヲ1ro吋J C目じ 1川i日l山川!!川n川11川川11川川l日川ng to ,日trl肌川川1れM川a川tI旬y吃仏引Z弘kじ:D泊川a川訂川l日げ川l'け' k Wc凶b> lド:勺叩吋tο川川、刊.1!Tれ伊1川t

    1-¥ nothcr !"è詰ca創rc h o l" t lけhi附?日; kind ¥¥"as pcrlormcd by Gerstcnl"eld [5]. ¥¥"ho focused on cxtrcmist groups.

    11 0日 e\'e r. thcrc ha¥'c bcen Iinle rcscarch pcrformcd on a problem Icss Icthal. althollgh equally serious. nall1cly onlin� s l a ndcr i nε and bullying ofpri¥ atc pcrsons. known gcncrally as ・ cybc r-bull yi ng. . In .Iapan the problcrn has become seriolls enough to b

    noticed by the Ministry of Edllcation. Cllltllrc. Sports. Scicncc and Tcchnology (Iatcr: MEXT) [6]. At prcscnt. 5chool pcrsonncl and Il1cmbcrs of Pa rcnt ・Tcachcr Associatioll (PTA) havc started Onlinc Patrol to spot Wじb sitcs and blog

    containing such inappropriatc contents. I-Iowcvcr. cOllntlcss nllmbcr 01' sllch data makes thc job an llphill task. Morco¥'cr. th Online Patrol is pcrformed manllally and as a ¥'oluntecr work. Thercforc wc stal1cd this rcscarch to hclp thc Onlinc Patrol

    I11cl11bers. The f�al goal ofthis rcsearch is to crcatc a machine Onlinc Patrol cra¥¥'ler that can automatically spot the cyber.

    bullying cases on thc Wcb and report thcm to thc Policc. In this papcr ¥¥'c prescnt sornc ofthe t�rst reslllts o!"this rcsearch. W, f�st fOCllscd on dc¥'cloping a systcmatic appro駘ch to spouing onlinc cybcr-bllllying cntrics autol11駘tically t�ease thc burden

    ofthc Onlinc Patrol volllntcers. In Ollr approach ¥¥.c perronn affcct analysis ofthcsc contents to f�cl lI i 亘 t i n c ti \'c featllres for cybcr-bllllying entrics. Thc rcaturc spcci f�d as thc most charactcristic for cybcr-bullying is llscd in training of a machin Icarning algorithm for spouing thc malicious contcnts.

    fhc papcr outlinc is as follows. In Scction 2 wc de日cribc thc problcl11 ol"cybcr-bllllying in lllorc details. In Scction 3 ¥¥'e prcscnt thc dctailcd clcscription ofatTcct analysis sy日 t C I11 S uscd in this rcscarch. Scction 4 co nta in 討 result s 01' affect a na l ys i 亘 o r cyber

    bullying data and proposc a tcatllrc to apply in tra inin詰 ofmac hinc Icarning algorithm. 111 Scction 5 wc present the prototyp SVM-based method rordctcc ti n包 t hc cybcrゐll ll y ing cntrics and in the Scじ t io n 6 c¥'alllatc it. Finally. in Section 7 \\・e concllld thc papcr ancl prO¥ ide somc hints on f・llrthcr ¥¥"ork in this arca.

    2. What is Cuber・Bullying '!

    Although thc problcm ofscnding hannful mes"ages on the Internct has existed for scveral years. it has been off�cially def�ed

    only reccntly ancl nal11cd as cyber-bullying6. The 'lational Cril11e Prc¥'ention COllnci 1 in USA statc that cybcr-bllllying happens ・\:vhen thc Intcrnct. ccll phoncs or other dc、 iccs are llsed to send or post tcxt 01' il11agcs intcnded to hllrt or

    cl11barrass anothcr person...7. Other def�itions. slIch as the onc by 8ill Bclsey. a teachcr and an anticyber-bllllying activisl.

    say that cybcr-bllllying '�volvcs the llSC of infomlation and cOl11mllnication t cc hno l o包 ies to SllppOrt dcliberatc. rcpeated. and hostilc bchaviour by an individual or grollp. that is intendecl to han羽 othc rs ... [7].

    SO !日n c of tlけhc f�st rob凶t rcscarch on cybcr-b u ll y i時\\a:. donc by H induja and Patc hi札\\・h o per(-onncd nll ll1 C rO凶 sllrve

    abollt thc subject in the USA [8. 9]. Thcy found out that thc harmflll inforl11ation l11ay inclllde thrcats. sexual rernarks. pcjorativc labcls. false statCl11ents aimccl at hllrniliation. ¥Vhen posted on a 8BS forul11 or a 吋oc i a l nerwork. such as Facebook. it l11ay disclosc personal data of the victim. Thc data which contains hlll11i1iating inl"orll1ation about the victil11 defames or ridicllles the victim personally.

    2.1 Cyber-bullying and Online Patrol in Japan

    In Japan. atiera se¥'eral cascs Ofsllicides ofcybcr-bllllying victims who cOllld not bare thc hllmiliation. MEXT has considere

    thc problcll1 seriOlls cnollgh to start a movcll1cnt against the problem. In a l11anual for spouing and handling the cases 0

    cyber-bullying [6]. the Ministry plltS a grcat ill1portance on carly spotting or the sllspiciollS cntries and ll1cssages. and c¥istinguishcs several types 01" cyberbullying noticcd in Japan. Thcse arc:

    136 Journal of Computational Linguistics Research Volume 1 Number 3 September 2010

  • l'

    y u

    ごnH

    n

    q

    led.

    ‘'、r

    -bullying appcaring on BBS forums. blogs and on pr i 、 a te prof�e web-!.ites:

    ) Entries conlaining lib~lous. slanderous or abusi¥e contentl>:

    Di~、c1 0si ng personal dala 01' natural persons ¥¥ ithoul their authorization:

    ) Enlrics and humi liating onlinc aCli¥'ilies perfomlcd in lhe name of another per:.on:

    -bullying appcaring in c1 cピ lro n ic mail:

    (a) E・ma il s directcd to a ccrtain p~rson child. conlaining 1 ibdous.slandcrolls or ab u~i \ e content、

    (b) E-mails in the form of chain h:l1crs conlaining libclous. s land~rou目。rabu当 I\ ' C じonlcnl S :

    国 mails scnd in lhe nLll11e ofanothcr pcrson. conlaining humilialing contcnls:

    arch ¥¥"c focuscd mosllyηn lhc cascs of cvbcr司b lJ ll yi ng lhat appcar on informal ¥¥"cb sitcs of Japancメ~ scco nιl ar下

    lnformal school ¥¥Tb silCS arc ¥¥'cb :;ilCS ¥¥"hcrc school Pllpils galhcr 10 cxchangc informalion abnu l 只chon l subjccts t:. of lcsts. ctc. Ho¥¥"c¥"cr. LlS ¥¥"as noticed by \\ ・atanabc and Sunayama [10]. on such pagcs thcrc h日 \'c bccll a rapid f cntrics containing in s ult川g or slandcring informalion aboul othcr pupils or e¥'en tcachers. Cases likc that makc uncomfortable using thc Web sitcs and callse undcsirablc misunderstanclings

    1. Detection of cyber-bullying act�ity

    2. Saving the URl of the Web site

    3. Printing out the Web site containing

    cyber-bullying activity

    (in case of a cell-phone sites, taking a photo of the screen conta�ing the entry)

    Sending delet�n request

    of the entry to the Web

    site administrator or

    Internet provider

    Informing the police or

    the legal Affalrs Bureau

    Confirming the deletion of the ent吋

    containing cyber-bullying activity

    Figure 1. ROlllC 01・ Onlinc Palrol

    日' it h slIch maliciolls cntrics. a mO¥'emenl ofOnline Patrol (OP) ¥¥'as founded. Participants orthis l11o¥"cmenl arc uSlIally and PTA mernbcrs. Bascd on thc M EXT def�ition of cybcr-bullying. they read through all availablc cnlrics. dccidc

    r an cntry is dangerolls or not and. i f nccessary. send a dclction rcqllc亘t to th~ ¥¥'cb pagc administrator. Finally. they 同port about the c¥'ent to thc police. The rypical Online Patrol routc is prcscntcd on Figurc 1

    ml, lIscd ‘Irc ê\' bcr- har山川11..'111.川d c\' bじ r- sta l k 川口、、 、\'. nl' 1>~ .(lrc c、' 1>じ rbul l \'i n 日.、一 、

    Journal of Computational Linguistics Research Volume 1 Number 3 September 2010 137

  • Unfortunately. at present state ofatfairs. thc school personncl and PTA mcmbers taking pa口 in Online Patrol perform all tasks Illanually as voluntary work. bcginning frolll rcading tl,e countless numbers of entries and deciding about the appropriatenes~ of their contents. through printing out or taking photos of thc pages containing cyberbullying cntries. and f�ally scnding the reports and deletion rcquests to the appropriate organs. Moreover. sincc the number of cntTies rises by the day. surveillance ofthe wholeWeb bccomcs an uphi IItask for the small nUlllber of patrol mcmbers.

    With this research we airnto creatc a Web cra¥¥ ler capable 10 perforlll this difficult task instead ofhumans. or at least to ease the burden ofthe Onlinc Patrol ¥'olullleer

    3. Affect Anslysis of Cyber-Bullying

    As was shown by Chen and colleagues [3. 4]. analysis of a庁ec t illlensity of Dark Web FOI 山11 S often helps specifying the character of the forulll. Alrning 10 f�d any dependencies belwcen expressing elllolions and cyber-bullying activities we perforrned a srudy and analysed affeclive level of Ihe cyb巴r-bu ll y i ng data

    For conlraslive afTecl analysis we oblaincd Online Patrol data containing both cyber・bullying activities and normal entries. The data used in affect analysis contained 1.495 harmful and 1,504 nonharlllful entries. The affecl analysis was performed wilh ML-Ask syslclll for affect analysis of textual input in Japanese and CAO systelll for analysis of emoticons. both developed by Ptaszynski and colleagues [16. 18]

    1. Extract: ・ Int町民tions

    . Exclamations

    . Sentence-final m同印刷

    3.1 M L-Ask -Affect Analvsis Svstem

    if not ernotive

    Figure 2. Flow chart ofML-Ask system

    Affect Analysis results:

    1. Sentenω: Emotivel

    ML-Ask (eMotive eLements / Elllotive Expressions Analysis System) is a syslelll de¥"eloped for analyzing the emoti conlenlS of utlerances. It pro¥"ides not only information on emotions expressed in input. but also linguistic information on ¥vhal senlcnce elemcnts represent which emotive features and what is their grammatical c1assif�cation. The system uses a two-step procedurc : り Ana lyzing the general emotiveness of an utterance by detecting emotive elements. or clllotellles・expressed by Ihe spcaker and classifying the utterance as elllotive or non-elllotive: 2) Recognizing the particular elllotion types by eXlracting cxpressions ofparticular emotions frOIll the utterance. This analysis is based on Ptaszynski ‘ s [21] idea 0 two-part classi f�ation of realizations of elllotions in language into:

    1) Emotive elements or Emotemes. Elelllents com'eyed in an uロerance indicating Ihat the speaker was elllotionally involve in the utterance. but not detailing the specif� elllotions. The sallle elllotive elelllent can express different emotions depending

    on contexl. This group is linguistically realized by subgroups such as inteりections. exclamations. Illimetic expressions. or vulgar language. Exalllples are: sugee (great!). wakllwakll ( hea口 pOllnding). -yagaru (a vulgarization ofa verb);

    138 J~LJ~nal of Computational Linguistics Research Volume 1 Number 3 September 2010

  • s s g e

    E

    E

    E

    j

    h

    The manllally-seleclCU cmotive c1emellt database c, '"'"i 制d rcpres巴nta ti ons 01' 110n-verbal emoti¥ e elemen~.

    and divided in this ¥V

  • 3.1.2 Contextual Valence Shifters in M L-Ask

    One of the problems in the procedure described above was confusing in some cases the valence polarity of emotive

    expressions. The cause ofthis problem ¥vas extracting froll1 the utterance only the emoti¥"e expression keywords without the

    gramll1atical context. One case ofsuch an input is presented below in example (3). In this sentence the emotive expression is

    the verb akirameru (to give up [verb]) but the phrase -cha ikenai (Don't-[particle+¥"erb]) suggest that the speaker is in fact

    negating and forbidding the emotion expressed literally. SlIch phrases are called Contextual Yalence Shifters (CYS).

    contall1

    However. using only the CYS analysis we would be able 10 find out the appropriate valence of emotions conveyed in the utterance. but ¥¥'e would not know the exact emotion type. To specify the ell1otion types after changing polarity with CYS. we

    applied the idea ofthe 2・d i ll1 ens iona l ll1ode l ofaffect [24] which assllmes that all emotions can be described in 2・dimen sions:

    the emotion's valence polarity (positive/negative) and activation (activatedldeactivated). An example ofpositive-activated

    emotion could be 、xcitement" : a positivedeactivated ell1otion is, e.g・ . "relieP' (see Figure 3).

    Emotion types distinguished by Nakamura [22] were mapped on this 1l10del and their affiliation to one ofthe spaces determined.

    The emotion types with ambiguous affiliation were mapped on two possible fields. When a CYS st口氏側re is discovered, MLュAsk changes the valence polarity ofthe detected emotion. The appropriate emotion after valence changing is deterll1ined as

    the one with valence polarity and activation parameters different to the contrastでd emotion (note aπows in Figure 3).

    3.2CAO ・ EmoticonAnalysis System

    Since one ofthe most popular strategies of expressing emotions in online comll1unication is using ell1oticons. we suppo口ed

    ML-Ask with emoticon analysis system CAO. It is a system for estimation of emotions conveyed through emoticons

    reported by Ptaszynski and colleagues [18]. Emoticons are -sets ofsymbols widely used in text-based online communication

    to convey emotions. CAO, or emotiCon Analysis and decOding of affective infomlation. extracts an emoticon form an input (a sentence) and deterηlines specific emotion types expressed by it using a three-step procedure. Firstl y, matching the input with a predetermined raw emoticon database containing over ten thousand emoticons. The ell1oticons. which could not be

    estill1ated with only the database are automatically divided into semantic areas, such as representations of "mouth" or "eyes'¥based on the idea ofkinemes. or minimalll1eaningful body movements, applied from the theory ofkinesics [19 , 20] Thc areas are automatically annotated according to their co-occurrence in database. The annotation is firstly based on eyeュ

    mouth-eye triplet. 1 f no triplet was found , all semantic areas are estill1ated separately. This provides hints about potential grollps of expressed emotions giving the systell1 a coverage of over 3 ll1illion possibilities. CAO is used as a suppo口ing

    procedure in ML-Ask to ill1prove perfomlance of the affect analysis system in utterances. which do not include emotive

    expressions. like in the example (2) below (emote ll1es・underlined : ell10tive expressions -bold type font).

    (1) Kyowa 旦盆旦~ kimochi ii hi 旦盆旦昼亘 ! Today:TOP 凶ι盟盟MX:joy day:SU8 ML:盟国呈 坦ι:Translation: Today is such a nice day!

    (2 L[辺二 sore wa 主唱型 desu ne- よ ・0・

    ML:iva this' :TOP 凶L:辿gQÍ COP 凶L盟二 回ι: 坦L: 'o'Translation: Whoa. that's great! ・0・ MX:jo~

    (3) Akirame ch呈 ikena i ヱQ

    140

    MX:dislike 坦ι也i! oCYS:cha北enaiooj oyo

    Translation: Don't give up! 担L辺 凶L:!

    Journal of Computational Linguistics Research Volume 1 Number 3 September 2010

  • ρiwρ-vca

    叶・A

    vh

    l

    k

    ・ut

    ιH

    1

    t

    3

    、J

    'L

    F目、

    (4) Hitoribocchi

    MX:sadness

    旦型1te .ya

    ML:nante-da MX:dislike

    da

    COP

    -・

    M L・・・

    Translation: Being alone sucks...

    4. Affect Analysis ResuIts

    !nen SVJe.

    n

    μeeh

    b

    児山川町

    η

    以内

    V

    -ei

    t first we calculated the number ofall el11oti¥'e entries among both sets of data. There was 956 emoti¥

    hamlful (63.95%) and 1,029 among 1.504 non-harmful (68.42%) entries. The difference was not high and therでおrで then山羽町rof emotive entries cannot be considered as a highly distinctive feature. However. we madc the f�SI assumpllon. that hannful

    data are less emotively emphasized than non-harl11ful. This is a reasonable assumption. since cyber-bullying is often based

    on irony or sarcaS I11, which is not highly emotive. however deliberately uses some amount of emotivc inforl11ation 10 slander the object of sarcasm. To confinn this thesis we performed other comparisons.

    the

    we

    >ns:.ted

    We calculated emotive values of all emotive entTies. Although the number of emotive entries and el110tive value 司 bOlh relat

    to the idea of emotiveness of a corpus in general, they are not directly related. One can easily imagine the di fference between one co叩us that consists of many slightly el110tive entries (high number of emotive emries, but low average emotive value) and another coゅus with a 5mall number ofhighly emotive entries (Iow nurnber of crnotive entries, but high average emotive alue). The approxirnated emotive value for hannful and non・harmful data was 1.47 and 1.5, respectively. Here also the

    difference is not high, although, since both values are not directly related, this can moderately suppoロ the thesis.

    -1L

    tI

    l

    >> 104 S皇旦型ri..koite shinu 盟血豆, sonna hageshii 盟旦型且 辿♀些旦盟. "Senzuri masutaa" toshite isshou agamete yaru yo.

    nalysis results: Emotemes: n盆盟主 , sugee , naa , -yo. senzuri: Emotive expression: agameru (worship. respect) ・ fondness :

    Translation:>> 1 04 Dying by 'flicking the bean '? Cannot irnagine ho¥¥" one could 'Oick the bean' so fiercely. 1'11 worship you

    forever. as a 'rnaster-bator¥

    一mmMMK町1

    Avd

    沼崎

    ーnen no tsutsuji no onna m豆旦国 busu suki na hito barashimashouka? I-ncn no ano ko desu ヱ立旦豆? k I rno!!atteru 旦且~ yamet

    agete kudasai

    nalysis results: Emotemes: m旦εha , 学金, 虫豆・ -nde , 恒担金♀笠y: Emotive exp問ss ions: suki ( li ke) ・ fondness ;

    Translation: Ya wanna know who likes 2nd-grade ugly azalea girls? lts that 1 st-grader isn 't itワ H e ' s looks disgusted, so leave him mercifully in peace.

    ed

    itsu wa busakute se ga takai dake no onna. 且益些些~se takai dake ya no ni 盟包旦 otoko引lkim旦出盆包旦辿ide陸出Q anna

    nna owatteru

    nalysis results: Emotemes: χ辺紅皇, 旦旦由主, 且盗型.包旦辿i, Q組担: Ernotive express i on s: 引lki (-lover: an arnateur 0ι) ・

    fondness;

    Translation: She's just tall and apa口 of that she's so freakin' ugly, and despite ofthat she's such a cock-Ioving slut. she' finished already.

    Shinde kure主主主, daibu kiraware-mono de yuurneし subete ga itaitashii= e Analysis results: Ernoternes: 主主主 (sy llabl e

    prolongati on). ムニ (elips i s);

    Ernotive expressions: kiraware-mono (disliked by others) ・dislike , itaitashii (pathetic, pit ifu l)egloom , sadness :

    Translation: Please, dieee, you're so farnous for being disliked by everyone, everything in

    ext, we took a closer look on the extracted emotemes. There are four groups of emotemes distinguished in ML-Ask: i) interj ections, ii) exclarnations, iii) vulgarities and iv) mimetic expressions (gitaigo in Japanese) -aπanged in order oftheir ernotional weight. Distribution of the extracted ernotemes within both entry se凶 is represented in Table 2. There were

    relatively more emotemes with high emotive weight in non・hannfu l data and more low 同 eight emotemes in harmful data, which is another confirmation ofthe thesis presented at the beginning of 山is section

    The biggest differ巴nce between the number of exrracted emotemes w出 found for vulgarities. which can be regarded as a

    unique and distinctive feature of cyber・bu llvin f! entries. This is one ofthe reasons we used it as the main feature in machine

    Journal of Computational Linguistics Research Volume 1 Number 3 September 201 0 141

  • learning systern. described latcr in seCllon 5.

    A sirnilar diffcrence. although lower. appcared also tn m1mct比 CλprcS!> lOns , ¥¥ hich could bc included in furthcr study on enlarging Icxicon for the machine leaming !>>:.tem A" for tntel)ectlOns and cxcl日 rnn ti on :" although Ihey appeared in nlarge number in both datasets. rnore of thcm appeared tn non-hannful entries. This coulJ be caused by Ihe fact that thcse two

    enlotel11e types arc uscd 10 exprcss e l110ti吋 att Jtude tn a 'Itr:ught fOf\Jリ rd way. Thιrl'là rc , allhough therc ccrlainly arc cyberュbull yingc出es wherc thc ¥'iCliI115 arc "Iandered !>rraightforwardl~ ・ not emOllon日 1 and cold sarcasrn is also an often phcnomenon

    竹下eo

    moteme non-hannful harm

    4

    2b.t 17.t

    149

    mJmellC expre出lon5 7 23

    ablc 2. Di ~t r ibut ion of thc eXlracted emolernes ¥¥ ithin both e川rv ~c t s

    nOlher compari50n was made with thc closcr sludy of cmolive ulterances ML-Ask is capable of delccling emotive utterances wilh 11 high reliabi lily (kappa = 0.8) , althol.lgh it specdics particular emol ion I ypcs 以 ilh low Recall (although wilh high Precision) [16J. It is dut! 10 Ihe use Ma lexicon (Naka mura' ::, dictionary. cf. [22]) that is �I ordalc. Thercforc therc is a cerlain number ofsample忌 always describcd as cmol1VC bUI 以 11h n�:,pec i fied emolion typcs. Such a phcnomenon is howcver rcasonablc from the linguistic poinl of \ï C\\', 計I1 ce Iherc arc mnn、 senlcn cc" Ihat ar己 emoll 、 c. allhough the emolion Ihcy convey depcncl忌 o n thelr conlcxl.

    We firSI comparecl ho¥¥' many Ihcrc wen.. ,; p,-ピ Il i cd 、 su 円、pcι iti cd CIllOli¥ C utteranc� The resull ¥¥a5 13.18% ¥5 86.82% t灑

    cyberbullying and 11.95% ¥'S 88.0500 for Ihe normal entries. The highcr ralio � spccifietl clllolion Iypcs in cybcr-bullying c1ata might suggじ忌t Ihat people rnorc oflen u、心 lraclllion麝 1 cmoti¥ c e 可 pn.'~s i o n s 1,) slander pcople than Ihey do 10 exp rcs忌 emotion1I5ually.

    ful

    )oy 32 fondncs" -l9

    fondness 21 JOY 12

    relief 18 陀Iief 9

    feal 11 anger 8

    glool1l sadne出 7 gloorn ' sa dn c!>~ 7

    surpnze 6 exc lleπlcn l

    cxcl len可ent 5 fea.r

    anl!er 3 shal1le 3

    shame 。

    Tablc 3. The nurnber of particular emoli ve expressions eXlracled by Mし..Ask frol1l bolh dalascl

    142 Journal of Computational Linguistics Research Volum

  • a

    negative negal"マ¥ eJpositive

    11

    positivc

    emotion (bolh possible) emotion

    emotlOn type non-harmful harmful emotion げpe noωonト-占h恥ha訂l汀汀川rm emotion type non-harmflll hamlflll

    gloom/sadncss 7 7 excllement 5 3 JOY 32 12

    fear 11 3 shame 。 3 fondness 21 49

    anger 3 8 slIrpnzc 6 3、 relief 18 9

    dislike 49 56 SUM 71 70

    ル1 70 74 SUM 11 9 S U M (fondness 50 21

    exclllded) u

    activation of emotions

    deactivated moderatelvactivated activated

    emotion ( deactivated/activated) emotion

    motlon type non-harmflll harmflll emotion type non-harmful hamlful emotion type non-harmflll hannful

    loon>/ sadness 7 7 JOY 32 12 shamc 。 3

    relief 18 9 fondne~s 21 49 cxcltement 5 3

    dislikc 49 56 fear 11 3

    anger 3 8

    S ll巾nze 6 3

    25 16 11 SUM 102 117 11 SUM 25 20

    Table 4. Comparison of tendencies in annotation of emotion types with regard to the two-dimensional affect space

    the other hand, fondness scored lInexpectedly higher in cyberbullying dataset. Detailed analysis revealed that people uldo白en express strong sarcasm with the use ofpositive cxpressions. Some examples ofsuch entries have been represented

    Table 1.

    lIy, we compared tendcncies in the annotated emOlion types with regard to the two-dimensional affect space. The results represented in Table 4. In the valence dimension. negative emotions were annotated most often on harmful data, and ¥tive emotions. on non-harmful data, which is a reasonable and predictable result. The differences wcre not that obvious,

    r, after exc1usion of fo ndness, which, as mentioned above宅 waso仇en used mostly in sarcasm. the differences became r. The smallest ditTerence was observed in emotion types which can be classified as both positive or negative, as their

    nce usually depends on the particular context.

    for the activation dimension, non-harm向 1 data was annotated as more vivid in the grollps of deeply activated and deeply I1vated emotions. On the other hand, harmflll data was annotated more often on cmotions with moderately activated

    Journal of Computational Linguistics Research Volume 1 Number 3 September 2010 143

  • emotions, which provides another proof for the thesis set on the beginning of this section , namely, that slandering is often expressed in non-emotional statements. Moreover, we were able to set the linguistic feature most distinguishable for cyberュbullying activities. namely. vulgarities. This is an important clue in the development ofmachine learning algorithm for cybeト

    bullying detection.

    5. Machine Learning Method For Cyber-Bullying Detection

    In this section we describe a machine learning method developed to handle cyber-bullying activities. The method consists

    ofseveral stages, including creation ofa lexicon ofvulgar, slanderous and abusive words, slanderous inforrnation detection module, ranking ofthe infonnation according to the level oftheir hannfulness, and visualization ofthe harmful infornlation. The system f10w chart is represented on Figure 4. The creation ofthe system was separated into two general phases: training phase and processing (test) phase. 8elow we present the details of each phase

    1. Training phase:

    (a) Craw1ing the schoo1 Web sites.

    (b) Detecting manually cyber-bullying entries.

    (c) Extraction ofvu1gar words and adding them to 1exicon,

    (d) Estimating word simi1arity with Levenshtein distance.

    (e) Part ofspeech ana1ysis,

    (町 Trainingwith SVM.

    2. Processing (test) phase:

    (a) Craw1ing the schoo1 Web sites

    (b) Detecting the cyber-bullying entries with SVM model ,

    (c) Part ofspeech ana1ysis ofthe detected harmfu1 entry.

    (d) Estimating word simi1arity with Levenshtein distance,

    (e) Marking and visualisation ofthe key sentence.

    5.1 Defining a Cyber-bullying Entry

    8asing on the definition of cyber-bullying provided by ¥1EXT [6] as well as the results of affect analysis from section 4, we created our own working definition ofa cyber-bullying entry.With this definition we aimed to embrace the features needed to

    deal with in research such as ours.

    First1y, the data needs to be appropriately divided and weighted. In our definition we divided entries into three types. Norrnal (N) entry is an en町 not containing any harrnful inforrnation and thus not needing any intervention. Doubtful (D) entry

    contains inforrnation that might be hannful , but it needs further ana1ysis to make the decision whether to delete it or not Fina ll y, Harnlful (H) entry is a type of entry towards which there is no doubt that it contains harmful infornlation and the decision to delete it can be made without further analysis. We used the above tripartite annotation to classify different types

    of entry contents.

    The MEXT definition assumes that cyber-bullying happens when a person is personally offended on the Web. This includes

    disclosing the person 's name, personal inforrnation and other areas ofprivacy. Therefore, as the first feature distinguishable for cyber-bullying entry we define NAMES. This includes such inforrnation as:

    • Names and surnames ofpeople (e.g. "Michal Ptaszynski")

    H -When a person's name can be clearly distinguished

    144 Journal of Computational Linguistics Research Volume 1 Number 3 September 2010

  • 1

    -

    ・1vavE

    -, • Initials and nicknames .g. " M.P.", "Mich料 Ptasz村ski\ "Mr. p.")

    H -When a person's identity can be c1early distinguished

    D -When a person 's identity cannot be c1early distinguished

    にdnH

    、LH6

    1

    3

    I

    1

    . :'tJames ofinstitutions and affiHations

    .g. "That JSPS Fellow from Hokkai-Gakuen University"

    H -When a person 's identity can be c1early distinguished

    D -When a person's identity cannot be c1early distinguished

    the second type of a feature distinguishable for cyber-bullying ¥¥

    This includes: RルtATlON.

    • Address, phone numbres, etc. (e.g. "Minami 26. Nishi 11. Chuo・ku , Sapporo, 064司0926 . Japan'\or "-81 ・ 11 ・55 1 ・ 295 1")

    H -When the information refers to a private person

    D -When the information is public or refers to a public emi。

    • Questions about private persons e.g. "Who is that tall foreigner walking around lalely on Hlgh-Tech comdorT)

    H -Always considered as undesirable and harm向上 including sirualions. 111 which lhe obJ

    • Entries revea1ing persona1 information e.g. "[ heard that guy is responsible for the new pr句eC I :')

    H ・ When a person 's identity can be clearly distinguished

    D -When a person 's identity cannot be clearly distinguished

    Ilerature on cyber-bullying indicates vulgarities as the first distinctive feature of cyber・bullying activity [8, 9]. We were also le to prove this statement in the section 4. The biggest difference in extraction of linguistic features from both cyberュ

    llying and nomlal data was noticed for vulgarities. Therefore we selected VULGA則T[ESas the keywords distinguishable

    fur cyber-bullying. Vulgarities are obscene or vulgar words which connote offences against particular persons or society.

    島町mplesof such words in English are shit , 向ck , bitch. Examples in Japanese include words like uzai (freaking annoying). or

    oi (freaking ugly). [n our research we divided the entries containing vulgarities into two types, namely: Ne to • Entries containing any form ofvu1gar or offensive language

    D -Even when the object cannot be identified ,

    luv'

    ES

    ur-ω-m

    [es

    Jle

    • Quarrels between two or more users

    0 ・ Even whet it happens between two anonymous users, a quarrel can lead to revealing personal informalion or other forms of cyber-bullying

    1 entries not containing any ofthe abo¥'e infomlation is classified as normal (N)

    5.1.1 Definition Testing

    可 lest, whether the definition is coherent, we performed an experiment. From the overa1l 2,999 entries ( 1 雫495 harmful and 1,504 rmaり we extracted randomly a sample of 500 entries. Then we asked six human annotators (5 males and I female) to

    tate the sample according to the above definition. Finally, we caIculated Cohen's kappa agreement coefficient for all icipants. The agreement coefficient was 0.67, which is regarded as strong, which confirms lhat the definition is coherent.

    owever, in some cases the pa口icipants did not agree completely, mostly in cases of vulgarity. Some words could be :onsidered as vulgar or not depending on one's subculture and usual vocabulary (e.g・ - 吋gangster" , or 叶pimp" can be ~rceived as normal in a hip-hop subculture). This indicates that the definition should be complemented in the future with

    re strict definitions ofvulgarities.

    Journal of Computational Linguistics Research Volume 1 Number 3 September 2010 145

  • 5.2 Construction of Vulgari~' Lexicon Vulgar words are 0仇en not detected by part of speech (POS) taggers. or ll1arked as "llnknown word". We decided to crcate

    list ofvlllgarities and add it to a basic lexicon of POS taggc仁 This was perforll1ed according to lhe pl occdure described below At first we perforll1ed a stlldy on the vlllgar keywords constilllting Ihe hal111 fll 1 inforll1alion. We oblaincd a sel of infol111al school Web pages. rcad Ihell1 and perfol111ed a ll1anllal calcgorizalion inlo "hannfll]"' and 川on-harll1 fll]"' enlries based on Ihe

    MEXT c1assificalion of cyber-bullying. Froll1lhese Web pages we oblained 1.495 hal111 flll cntrics incllltling 255 llniqlle vlllgal keywords distinguishable for cyber-bllllying aClivity. The eXlracled keywortls werc finally atldcd 10 lhc list of vlllgarities

    Finally. we added grall1ll1atical inforll1ation to the extracted keywords and atldcd thじ rn 10 thc POS lagged� An exall1ple of addition ofgrall1ll1atical infomlation is presented in Table 5.

    kill10i (freaking llgly)

    POS: Adjective:

    Headword: kimoi (hit-rate: 294):

    Reading: kimoi:

    Pronllnciation: kill1oi:

    Conjllgated fonn: uninOecled:

    Table 5. Exall1ple ofa newly regislered word

    5.3 Estimation ofWord Similarity with Levenshtein Distance Users wOllld often change spelling ofwords and write them in a unnorll1alized way. It is a part ofjargonization ofthe language llsed online. Exall1ples of this phenoll1enon in English. would be writing phrases Iikc. "c. U ・・ in Ihe ll1eaning of"See yOll [later]!" (llsually on the end of e-mailsoronlinemessages).or..brah..inthell1eaningof..brorther].friend..9.etc.Soll1e

    examples of such colloqllial transfomlations in the Japanese online language are shown in Table 6.

    original word

    kimoi (freaking llgly. gross)

    lIzai (什eak ing annoying)

    colloquial transformation

    kirnosll 雫 ki sho i , ki sho ,

    uze巴. UZA I. ll却kkoi. ...

    With this variation. words having the same meaning wOllld be classified as separate samples, which wOllld callse hit-rate dispersion. Therefore to llnifシ the sall1e words written with slightly different spelling we ca1culated the similarity of the extracted words. The sill1ilarity was ca1clllated llsing Levenshtein Dislance [11] in a way s i ll1 il a l・ to [12.13]. The Levenshtein Distance between two strings is ca1clllated as the minimllll1 nllmber of operations required to transform one string into another. where the available operations are only deletion. insertion or sllbstitution of a single charactel

    However宅 as Japanese is transcribed llsing three character types. Chinese characters ( ka町 i) encapsulating from one to

    several syllables and lwo additional syllabary (katakana and hiragana). calculating the distance would become imprecise. To

    solve this problem. every word was alltomatically transformed in their alphabelical transcription. An example of distance

    ca1clllation is presented on the back-transformation ofthe word kimoslllo ilS original spelling kimoi in Table 7. The distance between both words is eqllal 2.

    transformed 、\'ord

    kimosu

    • kimoill

    • kimoi

    lln this research wc use standnrd POS tagger for Japancse. McCab [171 4 http:./www.intcrnctslang.com

    performed operation

    substitlltion of ・s ' to 'ホ: distance = 1; deletion offinal 'll': dislance =

    146 Journal of Computational Linguistics Research Volume 1 Number 3 September 2010

  • Processln& 1 民凋odule J- 一一ーf

    Vlsuallz,atlon of har円、ful

    inforn司at:ion

    i….... ....・H・-ぶぷt ・H・-喜二 ・H・-…l Out:put: I 1 #165 r匿名 J 3-7 18:30

    1 000ってやっ vXXX-tte y由 tsu kωo yode

    The 8 U Y, XXX, ya kno\IV, Is such a sh川 ii ぽんときーんもい !

    Nonto kiinrnoi

    i .民:~.~.'?.伊丹~~IS.円ß!)! ). ・・・ ・・ ・・ ・・ ー .. . ... ........... . . ー - -;

    su

    -a

    sw

    5.4 SVM 8ased Classification ofCyber-bullying

    To c1assify the entries into either ham1flll (cyber.bullying) or nonharmful , we used Support Vector Machines. Support Vector lachines (SVMs) are a method ofsupervised machine learning developed by Vapnik [14] and used for c1assification of data.

    They are defined as follows. With a set of training samples, divided into two categories A and B. SVM training algorithm ~enerates a model for prediction of whether test samples belong to either category A or B. In the traditional description of

    M model , samples are represented as points in space (vectors). SVM constrllcts a hyperplane, in a space of a higher imension than the base one, with the largest distance lO the nearest training data points (support vectors). The larger the rgin the lower the generalization e打orofthe c1assifier. Since SVM has been successflllly used for text c1assification [15] we ide to use them in this research as well. In our research the category A contains cyber-bullying cases and the category B

    ntains all other cases. which do not consist of socially ham1ful infom1ation. As the software for building S VM models we

    d SVM light (ver6.02)lo.

    Figure 4. Flow chart ofthe cyber-bullying activity detecting system

    ρしw

    ρ」円

    Hn

    u

    ooee

    5.5 Extraction ofKev Sentences

    ln the process of automation of Online Patrol. apa口 from the c1assification of cyber-bullying entries. there is a need to propriately determine how harmful is a certain entry. A ranking according to the hannfulness of entries is important to

    tect the most dangerous cases. In our approach an entry is considered as the more hannful. the more vlllgar keywords

    pear in the entry.

    The hannfulness of an entry is calculated using T-score. T-score is a measure that answers the question ofhow confident one be that the association measured between two words is an actual collocation and not a matter of chance. The higher

    U汀ence freqllency a word has in a co叩us , the higher is the value ofT-score. A T-score ofa word associating with words

    nd B is ca¥culated according to the equation below:

    147 Number 3 September 2010 Journal of Computational Linguistics Research Volume 1

  • where

    a = [word co ・ occurrence frequency]o

    ([occurence ofwordA] # [occurence ofword 8])

    • [all words in the co叩us]and, . b = [word co ・ occurrence frequen

    T =~ 一一。陀 b

    We calculate the hamlfulness of the whole en町y as a sum of Tscores calculated for all vulgar words. This way the more

    frequently occurring words there 釘芭 in the en汀y. the higher rank the enrry achieves in the ranking of harmfulness.

    6. Evaluation ofThe Method

    To verify the performance ofthe method we evaluated three procedures:

    1. Classification ofCyber-bullying entries with SVM ,

    2. Word similarity calculation with Levenshtein distance,

    3. Extraction ofkey sentences.

    6.1 Evaluation ofSVM Model

    To apply SVM to detect harmful infomlation from unofficial school 88S sites, we needed to prepare the data for training the SVM model. At first we performed moゅhologica l analysis ofthe 88S entries to be used as the training data. For every pa口

    of speech from the analyzed and parsed data we used as features the POS labels with the original strings of characters. As the

    features in the pa口 ofspeech identification we used paロs of speech like nouns (person 's name, or other than name), verbs and adjectives. In the identification ofthe whole entries we used feature sets consisting ofthe features of each pa口 ofspeech

    and the strings of characters containing the whole entry. 8ased on the amount of identified features, SVM model calculates the probability of affiliation of a character string to a ce円ain class.

    As the features for training we used several combinations ofmain and additional features. Main features inc1uded: (1) words with POS , (2) words only, (3) POS only. Additional features inc1uded: (A) Occurrence frequency, calculated as in equation 2, (8) Relative 仕equency, calculated as in equation 3, (C) Inverse document frequency (IDF). calculated as in equation 4, and (0) Term frequency-inverse document frequency (TF-IDF), calculated as in equation 5.

    (A) Occurrence frequency = Frequency of a POS withinone document (2)

    (A)

    Freqllency ofthe POS within all documents (8) Relative 什eqllency= (3)

    number of all entries (C) IDF = log( + 1)

    number of entries containing the POS (4)

    (O) TF ・ IDF = (A) ・ (C) (5)

    As the training data we used a1l2.999 entries gathered during an actual Online Patrol. from which human annotators (Online

    Patrol members) classified 1,495 entries as harmful and 1.504 as nonharmflll. We did not. however, apply the string similarity calculation to the SVM model. to evaluate both techniques separately.

    J http ://s vmligh t.joachims . ol宮

    148 Journal of Computational Linguistics Research Volume 1 Number 3 September 2010

  • 1 ・A

    1 ・B

    1 ・C

    1 ・D

    2・A

    2-8

    2・C

    2・D

    3・A

    3-8

    3・C

    3・D , .L 孟」」J・マfL

    40 50 60 70

    Figure 5. Results ofthe experiments ¥¥;th

    Precision

    Re臼H

    80 90

    fordifferent feat町田

    100 (%)

    The above conditions are applied to test the model trained wi山 S%1 bght. In the evaluation we calculated the system 's result as balanced F-score , using 1 O-fold cross validation for Precision and Recall. Ln 1 O-fold cross validation data is first broken into 10 sets of size nJl O. Then, 9 datasets are used to train on and 1 ぉ a test. This procedure is repeated 10 times and the overall score is the mean accuracy from all 10 tests. The balanced F-score. Precision (P) and Recall (R) are calculated as follows,

    where,

    and

    P.R Fscore = 2 ・一一ー一一

    P+R

    Precision = * Re叫 1 = 子s = cases correctly classified by the system as harmful

    n = all cases classified by the system as harm向l

    c = all harmful cases

    (6)

    The results are represented in Figure 5. The results for group 3 (oniy POS) were the lowest. Groups 1 (words+POS) and 2

    Journal of Computational Linguistics Research Volume 1 Number 3 September 2010 149

  • (words only) were comparable. The highest Recall appe泊 red in the experiments with Relative Frequency. However. the

    Precision in this case was not ideal (c1ose to 65%). Experiments with IDF (C) and TFIDF (0) gave the highest results with Recall c10se to 80%, Precision c10se to 90%. The highest score from all was for the combination words+POS with TF-IDF (Precision = 89%, Recall = 80%, F-score = 84.3%).

    6.2 Evaluation ofSimilarit). Calculation

    When preparing the conditions for evaluatiol1 ofword similarity calculation method with Levenshtein distance. we noticed

    that the larger was the threshold, the higher was the probability of matching a word with completely different meaning. Therefore we perforrned an optimization ofthe similarity calculatiol1 algorithm. In the optimization we applied two heuristic

    rules shown in Table 8.

    、z

    a'n

    O

    F3 c

    陀1

    ・・

    7

    ・532

    1

    0

    nF

    nu

    nunυnunU

    門υ

    nU

    ハυ

    nu

    ,,‘、

    2 3 (市首hOld )

    Figure 6. Precision of similarity calculation before and after applying the heuristic rules

    Rule Examnle

    1. deletion syllable prolongations kimooi ・ kimoi

    2. unification ofword first letter In case ofuzai we will consider only

    the words beginning with u

    Table 8. Two heuristic rules applied in optimization ofsimilarity calculation algorithm

    011 BBS sites, where informallanguage is widely used. such as unofficial schoolWeb si tes, prolonging of syllables is used mostly to express changes in user attitude, or mood, usually highlighting the nunofficial character of the entry. However, uch operatiol1 has no influence on the semantics ofa word. Therefore before the phase ofsimilarity calculation we added a

    rule deleting all syllable prolongations.

    The second rule was unification ofthe first letters of matched words. This was done because in the Japanese language only

    the final letters change during conjugation. Therefore we could assume that it is irrelevant to calculate similarity for words

    differing with first letters.

    The results for Precision ofsimilarity calculation before and after applying the heuristic rules are presented on Figure 6. The

    Precision was greatly improved after applying the rules. With the threshold set on 2, the Precision before applying the rules was 58.9% and was improved to 85.0%.

    150 Journal of Computational Linguistics Research Volume 1 Number 3 September 2010

  • 6.3 Evaluation 01" Key Sentence Extraction

    To solve this problem we considered the same vlllgar words appearing in one ent

    words may appear as collocations. In sllch cases the identical nllmerous collocations are co 一 ← 一一一 一 一 一一一--0the reslllts is represented in Table 10. The reslllts indicate the bias have disappeared. However. the T-score became t∞ small

    ausing many word pairs appear on the same place in the ranking making the ranks diff�Clllt to set. This is caused b~ the fact

    that the words with the same meaning appear on the Web sites in different transcription and therefore there is a large number

    ofword sets with a small OCCll汀ence

    'e solved this problem by calclllating word similarity before calculating T-score. The reslllts are represented in Table 11. Dll

    to word similarity calculation a rise in T-score was observed and the number of diversified word pairs increased , which mad rank setting mllch easier. There was 202 vlllgar word pairs, from which 40 pairs occurred more than twice.

    wω泊A word B co-occurrence T-score

    baka [h] (stupid) baka [h] 861 29.34

    shine [K] (fllCk yOll) shine [K] 552 23.49

    shine [h] shine [h] 58 7.62

    tarashi [k] (pimp) shine [K] 17 4.12

    bllSll [k] (ugly bilCh) shine [K] 16 4.00

    kimosll [h] shine [h] 16 4.00

    shine [h] bllSll [h] 6 2.45

    Table 9.The results ofT-score calclllation for sets ofvulgar words. As mentioned in section 5.3

    words in Japanese can be llsually transcribed in three different systems: hiragana, katakana and kanji. The differences in transcription are represented in the table as markers after the

    words, with [h] for hiragana, [k] katakana and [K] for kanji

    判例司:lA word B co・occurrence T-score

    shine [1く] shine [K] 7 2.65

    shine [h] shine [h] 4 2.00

    pashiri [k] (looser) shine [K] 3 1.73

    debu [k] kiero [K] 3 1.73

    kiero [K] (get lost) kiero [K] 2 1.41

    shine [K] kiero [K] 2 1.41

    llzai [h] kimoi [k] 2 1.41

    Table 10. Change in the results of T-score calclllation for selS of vulgar words, when two or more words were considered as one

    Journal of Computational Linguistics Research Volume 1 Number 3 September 2010 151

  • 6.4 Discussion

    wordA wordB co・occurrence T-score

    hine [K] shine [K] 11 3.32

    kimoi [h] shine [K] 11 3.32

    kimoi [h] busaiku [h] 8 2.83

    uzai [h] kimoi [h] 7 2.65

    panko [k] (slut) panko [k] 6 2.45

    kimoi [h] kimoi [h] 6 2.45

    busaiku [h] busaiku [h] 6 2.45

    Table 11. Change in the results ofT-score ca1culation for sets ofvulgar words, with ca¥culated word similarity

    This time in morphological analysis ofvulgarwords we used only a small set ofmanually discovered words, which we added to the lexicon. It is difficult to include all existing vulgar words and more such vocabulary will appear in the future as well. Therefore as a further work we need to develop a method for automatic ex汀act ion ofvulgar vocabulary from the Intemet.

    The results ofSVM model used to distinguish between harrnful and non-harrnflll infonnation were 89% of Precision and 80%

    of Recall. However, on the unofficial school Web pages used as the data in this research there were numerous entries consisting of only one sentence, or even one word. Therefore the feature set for 汀aining was not sllfficient and the overall reslllt was not ideal ( F・score= 84.3%).

    As for word similarity ca1clllation, many vulgar words are short and a change of even one letter might callse a change of meaning. Therefore, two different words would be matched as similar by Levenshtein distance, when the threshold is too wide. This problem might be solved by automatically setting the threshold according to the word length.

    As for the extraction of key sentences, although we were able to ca1culate a non biased T-score for vulgar expressions and set the ranking, over 80% of刊19ar words appeared only once. This caused over halfofthe cases to be attached with similar ranks. This problem could be solved by increasing the nllmber oftraining data or applying a different method ofrank setting.

    7. Conclusions and Future Work

    In this paper ¥ve presented a research on cyber-bullying, a new social problem that emerged recently, together with the development ofsocial networking poロals , etc. Cyber-bullying consist ofsending messages containing slanderous expressions, harmful for other people. or verbally bullying other people in 什ont ofthe rest of online community. In Japanese society, on which we focused in particular, this problem is pat1icularly vivid on unofficial school Web sites. To handle the problem teachers and PTA members perforrn voluntarily Online Patrol to spot and delete the online entries harrnful for other people.

    Unfortunately, there is already an enormous number of cyber-bullying cases and the number keeps growing, which makes the Online Patrol an uphill task when perforrned manually.

    We started this research to create an artificial Online Patrol agent. Looking for clues to select linguistic features to be used in

    machine leaming algorithm, we perforrned comparative affect analysis ofthe cyber-bullying data and normal entries. As a result, we noticed that, that the harrnful data were less emotively emphasized than non-harmful. The thesis is reasonable, since the harmful entries are written with premeditation and aim not in expressing ones own emotions, but in evoking in other online community members negative emotions against victim of the cyber-bullying. The results of comparing different

    dimensions of emotional emphasis suggested the thesis was true. although the reliability of the proof was not satisfactory

    and further analysis on more robllst data is necessary. Another discovery, although an expected one, was that positive emotions appeared more often in non-harrnful data and negative emotions appeared more often in harrnful data. However, detailed analysis revealed that, especially for fondness, the expressions of positive emotions are often used in strong sarcastic meaning. Therefore there is a need to analyze the data taking into consideration also other dimensions than the

    152 Journal of Computational Linguistics Research Volume 1 Number 3 September 2010

  • alence and the activation of an emotion. As one of such means we plan to apply Ptasz}鳴sk i et al 's [25 , 26] system for ontextual affect analysis verifying whether an emotion expressed in an uner釦ce is appropriate for its context. We assume 山is will help in developing a sufficient model offom1alization of cyber-bullying acti吋l1es . For the first step ofthe research , e were able to select the most distincti¥'e linguistic feature for cyber-bullying. namely, nJJganues ¥¥'e used this feature in

    �e creation ofmachine leaming classifier for cyber-bullying detection

    e created the machine leaming-based system for cybeトbullying detection and evaluated 11. At lexicon ofvulgar words distinctive for cyber-bullying entries. To recognize the vulgar words. but wnnen lD an infonnal or

    jargonized way, we calculated word similarity with Levenshtein distance. As the result, with the threshold equa1 2 山e system was able to correctly detem1ine the similar words with a 85% of Precision. The Support Vector ~I achtne model rrained for iscrimination between harmful and non-harmful data achieved the results of 89% of Precision and 800/0 of Recal1 (with alanced F-score = 84.3%). Finally the rank setting ofthe online entries according to their harmfulness was set 凶mgT-scor、We were able to eliminate undesirable bias in the rank setting, unfortunately, the procedure is not yet ideal

    ince new vulgar words appear freqllently, we need to find a way to alltomatically extract new vulgarities from the Intemet to eep the lexicon lIP to date 目 There is also a need for more experiments with the system inc1l1ding its di fferent \'ari ations 釦improvemems (ex. Levenshtein distance threshold optimization) d

    S

    H

    The problems conceming online security have been escalating ever since the birth ofthe Internet. Some ofthem are widel known to the society, such as spam e-mails or hacking, however more and more such problems, inc1uding cyber-bullying, ppear by the day and social consciollsness about them is not yet sufficient. There have been developed solutions for sorne ofthese problems (e.g. , alltomatic spam e-rnail detection, firewall protection, etcよ while others, like cyber-bllllying, only keep calating. Recognising such problems and developing the rernedy is an lIrgent rnatter and is desirable to come into foclls of rtificiallntelligence.

    )f

    o

    . Acknowledgements

    This research was suppo口edby (JSPS) KAKE:.!HI Grant-in-Aid for JSPS Fellows (Pr句ect number: 22・00358) and a Research Project on De¥'elopment of a Internet Word Book System for Dynamic Following ofInfonnation Transition (Project number: 0500833). The allthors thank Mr. Motoki Matsumura frorn Human Rights Research Institute Against All Forms for Discrimination and Racism-MIE for providing data from unofficial school Web sites.

    d

    Ir

    5・ References

    [1] Robinson , し (2005). Debating the events ofSeptember 11 th: Discursive and interactional dynamics in three online fora. Journal ofComputer-Mediated Commu /1 icatio l1 , 10(4) al1ic1e 4.

    [2] Leets, L. (2001). Responses to Internet hate sites: Is speech too free in cyberspace?, Comm. Law and Policy, vol. 6 (2) p. 7-317.

    le s, lJ1

    n,圃[3] Chen, H., Qin. J. , Reid, E. , Chung.W., Zhou,Y.. Xi , W. .Lai, G, Elhourani, T. , Bonillas, A (2004). Wang, F.ぺand . Sageman, e. .. M (2004). The dark web portal: Collecting and analyzing the presence of domestic and international terrorist groups on the le 圃 web, Jn: Proc. 7th IEEE Int. Conf. Intelligent Transportation Systems, Washington , DC, p. 106・ 11

    [4] Abbasi , A. Chen.H.(2007). Affect Intensity Analysis ofDark Web Forllms, JEEE Intell信仰印刷d Securiか Informarics , p. 282-288.

    ID

    .a [5] Gerstenfeld, G A. ,Grant, D. R. Chiang,C. P (2003). Hate online: A content analysis of extrernist Intemet sites. Anal. ofSoc. Jssues αnd Pub. Policy, 3 (1) 29・44.

    [6] Ministry of Education , Culture, Sports, Science and Technology (2008). (MEXT). ':.!etlo f 0 no リ 1mピ ni kansuru tai-0 manyuarujirei sh-u (gakk-0 , ky-oin muke) ['Bullying on the Net' Manllal for handling and collection of cases (for school eachers)] (in Japanese). MEXT.

    [7] Belsey, B., Cyberbullying: An Emerging Threat for the "Always On・ , Generation. hnp://www.cyberbullying.calpdf/ Cyberbullying Presentati on Description.pdf

    [8] Patchin, J. w., Hinduja, S. (2006). Bullies move beyond the schoolyard: A preliminary look at cyberbullying. Youth iolence and Juvenile Justice, 4 (2) H. 148・ 169 .

    久町叶川町

    d

    崎町内19 he

    Journal of Computational Linguistics Research Volume 1 Number 3 September 2010 153

  • [9] Hinduja,S.. JPatchin. (2009). W. Bullying beyond the schoolyard: Preventing and responding to cyberbullying. Corwin Press.

    [10] Watanabe , H.. Sunayama‘ W (2006).Denshi keijiban ni okeru y-uza no seishitsu no hy-oka [User nature evalution on BBS](in Japanese). IEICE Technical Report , 105(652) . 2006・KBSE, p.25・30(2006).

    [11] Lcvenshtein.V. 1. (1965). Binary Code Capable ofCorrecting Deletions‘ Insert ions and Reversals. Doklady Akademii Nauk SSSR, 163 (4) 845-848.

    [12] Minoru, H. ‘ Hirohide. E. (1997). Nihongo OCR bun ni okeru eり i , katakana no superu ayamari teisei-hou [Spelling Correction Method for English and Katakana in Japanese OCR Text] (in Japanese). Transactions of Infon11ation Processing

    Society of Japan, 38 (7) 1317-1327

    [13] Ryozo , K.. Koudai, H.. Tatsuya.S. (2005). ProductionRule wo mochiita shisutemu hyougen to koshou shindan e no ouyou [Modeling and Fault Diagnosis of Controlled Plant based on Production Rule] (in Japanese). The Robotics and Mechatronics Conference 2005. p. 16

    [14] Vapnik , V. (1998). Statistical Learn ingTheory, Springer.

    [15] Hirotoshi. T.. 刊kafum i. M.. Masahiko. H. (1998). Support Vector Machine ni yoru tekisuto bunrui [Text Categorization

    Using Support Vector Machines] (in Japanese) , IPSJ SIG Notes. 98 (99) , p.173-180.

    [16] Ptaszynski. M叶 Dyba la , P.. Rzepka. R.. Ar桃i 、 K .(2009) . AfTecting Corpora: Experiments with Automatic Affect Annotation System -A Case Study ofthe 2channel Forum -' , 111 : Proc. ofPACLlNG-09. p. 223-228

    [17] Kudo, T.(200 1). MeCab: Yet Another Part-of-Speech and Morphological Analyzer. http://mecab.sourceforge.ne

    [18] Ptaszynski , M., Maciejewski 宅 J. , Dyba la , P., Rzepka, R., Araki , K.(20 1 0) CAO: A Fully Automatic Emoticon Analysis System.ln: Proceedings ofThe Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAト 1 0). p. 1 026・ 1032 . .

    [19] Birdwhistell , R . し (1952) .lntroduction to kinesics: an annotation system for analysis ofbody motion and gesture, Univ of Kentucky Press.

    [20] Birdwhistell , R. L( 1970).Kinesics and Context, University ofPennsylvania Press. Philadelphia.

    [21] Ptaszynski , M.(2006). Boisterous language. Analysis of structures and semiotic functions of emotive expressions in conversation on Japanese Int巴rnet bulletin board forum -2channel-. (in Japanese). M.A. Dissertation. UAM, Poznan (2006)

    [22] Nakamura, A.( 1993). Iくanjo hyogenjiten [Dictionary ofEmo

    154 Journal of Computational Linguistics Research Volume 1 Number 3 September 2010