colios - corpus linguistic open source

22
Alexandru-Lucian Gînscă 1 , Adrian Iftene 1 , Marius Corîci 2 ConsILR Conference, 8-9 December, Bucharest, Romania National Museum of Romanian Literature, (MNLR) 1 “Al. I. Cuza”, University of Ia “Al. I. Cuza”, University of Ia s s i, i, Rom Rom a a nia nia 1 Facult Facult y of Computer Science y of Computer Science 2 Intelligentics, Cluj-Napoca, Intelligentics, Cluj-Napoca, Romania Romania

Upload: marius-corici

Post on 04-Dec-2014

1.416 views

Category:

Education


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: CoLiOS - Corpus Linguistic Open Source

Alexandru-Lucian Gînscă1, Adrian Iftene1, Marius Corîci2

ConsILR Conference, 8-9 December, Bucharest, RomaniaNational Museum of Romanian Literature, (MNLR)

11“Al. I. Cuza”, University of Ia“Al. I. Cuza”, University of Iassi, i, RomRomaaniania11FacultFaculty of Computer Science y of Computer Science

22Intelligentics, Cluj-Napoca, Intelligentics, Cluj-Napoca, RomaniaRomania

Page 2: CoLiOS - Corpus Linguistic Open Source

Motivation Existing Sentiment Corpora Files Sources Annotations Annotation Process Corpus Statistics Evaluation Metrics Proposal Conclusions

ConsILR Conference, 8-9 December, MNLR, Bucharest

Page 3: CoLiOS - Corpus Linguistic Open Source

Sentiment Analysis or Opinion Mining represents for some time a hot topic within Web 2.0 era.

To build robust systems for Sentiment Analysis, there are needed resources for training and evaluating the systems.

The lack of such a Sentiment Corpus for Romanian.

We intend to make it publicly available, free of charge for individual researchers and research centers.

ConsILR Conference, 8-9 December, MNLR, Bucharest 3

Page 4: CoLiOS - Corpus Linguistic Open Source

4ConsILR Conference, 8-9 December, MNLR, Bucharest

Existing Sentiment Corpora: MPQA opinion corpus, Large Movie Review Dataset, SentiWordNet, The JDPA Sentiment Corpus, UMass Amherst Linguistics Sentiment Corpora

Languages: English, German, Italian, Chinese, Japanese

Page 5: CoLiOS - Corpus Linguistic Open Source

5ConsILR Conference, 8-9 December, MNLR, Bucharest

Romanian online publications: Online NewsPapers (MediaFax, Romania Libera, etc) Blogs (Chinezu.eu, Zoso.ro, etc) News Portals (Realitatea.net, StirileProTv.ro, etc)

Category: Telecommunications

Companies: Orange, Vodafone, Cosmote and so on.

Page 6: CoLiOS - Corpus Linguistic Open Source

6ConsILR Conference, 8-9 December, MNLR, Bucharest

<paragraph id=“”></paragraph>

<sentimentGroup value=“” id_group=“”> </sentimentGroup>

-4 <= value <= 4

<entity type=“” sentiment=“” id_entity=“”   id_group=“”></entity> -4 <= value <= 4

Page 7: CoLiOS - Corpus Linguistic Open Source

7ConsILR Conference, 8-9 December, MNLR, Bucharest

Page 8: CoLiOS - Corpus Linguistic Open Source

8ConsILR Conference, 8-9 December, MNLR, Bucharest

Linking sentiment groups to entities

Page 9: CoLiOS - Corpus Linguistic Open Source

We consider the following major categories: City, Organization, Company, Country, Person and additionaly we consider categories like Brand, Product and Publication

For almost all major categories we consider subcategories: ◦ For Cities we consider Romanian, European, American and Other

Cities◦ For Organizations we consider Parties, Faculties, Universities,

Ministries, etc.◦ For People we consider Sportsmen, Politicians, Males, Females,

etc.

9ConsILR Conference, 8-9 December, MNLR, Bucharest

Page 10: CoLiOS - Corpus Linguistic Open Source

11 annotators (1st year master students in computational linguistics at FII, UAIC)

As annotation tool we decided to use Serna (http://www.syntext.com/products/serna/) : open source, flexible, easy to use, intuitive

Method 1: process the chosen files with our tools and automatically add annotations for named entities and for sentiments

Method 2: process only at paragraph level

10ConsILR Conference, 8-9 December, MNLR, Bucharest

Page 11: CoLiOS - Corpus Linguistic Open Source

11

Page 12: CoLiOS - Corpus Linguistic Open Source

12

Page 13: CoLiOS - Corpus Linguistic Open Source

13

Page 14: CoLiOS - Corpus Linguistic Open Source

14

Page 15: CoLiOS - Corpus Linguistic Open Source

11 annotators 1 week span 110 files 1988 paragraphs 2044 sentiment groups 4301 entities 1101 links between entities and sentiment

groups

15ConsILR Conference, 8-9 December, MNLR, Bucharest

Page 16: CoLiOS - Corpus Linguistic Open Source

16ConsILR Conference, 8-9 December, MNLR, Bucharest

Page 17: CoLiOS - Corpus Linguistic Open Source

17ConsILR Conference, 8-9 December, MNLR, Bucharest

Page 18: CoLiOS - Corpus Linguistic Open Source

Sentiment group precision

Precision for named entities and sentiment group links

18ConsILR Conference, 8-9 December, MNLR, Bucharest

Page 19: CoLiOS - Corpus Linguistic Open Source

19ConsILR Conference, 8-9 December, MNLR, Bucharest

Relaxed precision for sentiment group value

CG = the set of correctly identified sentiment groups VF (SSG)= the value of the sentiment group as given by the system VG (SSG)= the value of the sentiment group from the gold file.

Page 20: CoLiOS - Corpus Linguistic Open Source

20ConsILR Conference, 8-9 December, MNLR, Bucharest

Average deviation for sentiment group value

CG = the set of correctly identified sentiment groups VF (SSG)= the value of the sentiment group as given by the system VG (SSG)= the value of the sentiment group from the gold file.

Page 21: CoLiOS - Corpus Linguistic Open Source

The importance of a Corpus for Sentiment Analysis for Romanian.

The annotation format and methodology.

Comparison between our proposal and existing Sentiment Corpora.

21ConsILR Conference, 8-9 December, MNLR, Bucharest

Page 22: CoLiOS - Corpus Linguistic Open Source

22ConsILR Conference, 8-9 December, MNLR, Bucharest