discourse annotation for arabic 2

35
Survey on Discourse Annotation for Arabic A. Algarni, H. Alharbi and N. Almutairy Supervisor: Dr. A. Alsaif April 23, 2013 Kingdom of Saudi Arabia Ministry of Higher Education Imam Mohammed Ibn Saud Islamic University College of computer and Information Sciences CS465 - Natural Language Processing ة ودي سع ل ا ة ي ب ر لع ا كة ل م م ل ا ي للعا ما ي عل ت ل ا ارة" ور ة ي م لا س( لا ود ا سع+ ن- ب مد ح مام م( لا ا عة ام- ج ومات عل م ل ما; ن و- ب س حا ل وم ا عل ة ي كل عال465 ة ي ع ت ب لط ا ات" لع ل ا ة- ج ل معا1

Upload: arabicnlpimamu2013

Post on 11-Jun-2015

112 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Discourse annotation for arabic 2

1

Survey on Discourse Annotation for Arabic

A. Algarni, H. Alharbi and N. AlmutairySupervisor: Dr. A. Alsaif

April 23, 2013

Kingdom of Saudi ArabiaMinistry of Higher Education

Imam Mohammed Ibn Saud Islamic UniversityCollege of computer and Information Sciences

CS465 - Natural Language Processing

السعودية العربية المملكةالعالي التعليم وزارة

اإلسالمية سعود بن محمد اإلمام جامعةالمعلومات ونظم الحاسب علوم كلية

الطبيعية – 465عال اللغات معالجة

Page 2: Discourse annotation for arabic 2

Outline

IntroductionThe Leeds Arabic Discourse TreebankDiscourse Connective RecognitionDiscourse Relation RecognitionSemantic-Based SegmentationDiscourse Segmentation Based on Rhetorical

MethodsA Comprehensive Taxonomy of Arabic Discourse

Coherence Relations

2

Page 3: Discourse annotation for arabic 2

3

Introduction

Linguistic annotation covers any descriptive or analytic notations applied to raw language data.

Annotated Discourse Corpora can be very useful to facilitate theoretical studies along with contributing in the development of NLP applications.

Page 4: Discourse annotation for arabic 2

4

Applications

Information extractionQuestion-answeringSummarizationMachine translation, generation.

Page 5: Discourse annotation for arabic 2

5

Discourse Relations and Discourse Connectives

Discourse Relation is the way that two arguments (text segments) logically connected.

Temporal, Comparison, Causal, Expansion..etcDiscourse Connective (DC) :A lexical marker

used to link two abstract objects in a text. Abstract Object (AO) : Abstract objects in

discourse are things like proposition , events, facts and opinions.

Argument (Arg) : A text expressing an abstract object and linked by a DC.

Page 6: Discourse annotation for arabic 2

6

The Leeds Arabic Discourse Treebank

• First effort towards producing an Arabic Discourse Treebank was introduced in 2011 by A. Alsaif and K. Markert.• Collected a large set of Arabic discourse connectives using text analysis and corpus based techniques.•Final list contains 107 discourse connectives.

Page 7: Discourse annotation for arabic 2

7

Types of Discourse connectives

Page 8: Discourse annotation for arabic 2

8

Types of Relations

Page 9: Discourse annotation for arabic 2

9

Types of Relations Cont..

COMPARISON.Similarity:

Page 10: Discourse annotation for arabic 2

10

Arabic Discourse Annotation Tool (ADA) and Annotation Process

Page 11: Discourse annotation for arabic 2

11

Annotation Methodology

1. Measuring whether annotators agree on the binary decision on whether an item constitutes a discourse connective in context.

2. Measuring whether annotators agree on which discourse relation an identified connective expresses. As annotators can use sets of relations for a connective.

Page 12: Discourse annotation for arabic 2

12

Results

Agreement in task 1 is highly reliable (N=23331) percentage agreement of 0.95,

kappa of 0.88.Agreement in task 2 (relation assignment)

is relatively low (N=5586), percentage agreement of 0.66, kappa 0.57, and alpha of 0.58.

Page 13: Discourse annotation for arabic 2

13

Discourse Connective Recognition

To distinguish between discourse and non-discourse usage of a connective.

Example: once, while.A. Alsaif and K.Markert (2011) introduced

a Connective identifier for Arabic based on syntactic features.

Page 14: Discourse annotation for arabic 2

14

Discourse Connective Recognition by A. Alsaif and K.Markert (2011)

Features:Surface Features (SConn)Lexical features of surrounding words

(Lex)Example

] باإلرهاق ] يصابوا ان ممكن االطفال [ Arg1 ان DCو ]

] بالنعاس] يشعروا . Arg2 ان جيدا يناموا لم اذا

[Children might be tired]Arg1 [and]DC [feel sleepy]Arg2 during school time if they did not sleep well

Page 15: Discourse annotation for arabic 2

15

Features:Part of Speech features (POS)Syntactic category of related phrases

(Syn) (E.g.: وجميلة كبيرة the school is / المدرسةvery large and beautiful)

Al-Masdar feature.

Discourse Connective Recognition by A. Alsaif and K.Markert (2011) Cont…

Page 16: Discourse annotation for arabic 2

16

Results

Discourse Connective Recognition by A. Alsaif and K.Markert (2011) Cont…

Features Acurr KBaseline (not Conn) 68.9 0

M1 Conn only 75.7 0.48

Tokenization by white space + auto taggerM2M3M4

Conn+ SConn+Lex Conn+ SConn+Lex+POS Conn+SConn+Lex+POS+Masdar

85.6 0.6287.6 0.6988.5 0.70

ATB-based featuresM5M6M7

Conn+SConn+Lex Conn+SConn+Lex+Syn/POS Conn+SConn+Lex+Syn/POS+Masdar

86.2 0.6591.2 0.7992.4 0.82

M8M9

Conn+SConn+Syn SConn+Lex+Syn+Masdar

91.2 0.7991.2 0.79

Page 17: Discourse annotation for arabic 2

17

Discourse Relation RecognitionTo identify the type of the relationA. Alsaif and K.Markert (2011) introduced

the first algorithms to automatically identify relations for Arabic

Page 18: Discourse annotation for arabic 2

18

Features:Connective features Words and POS of arguments MasdarTense and Negation Length, Distance and Order Features Argument Parent Production Rules

Discourse Relation Recognition by A. Alsaif and K.Markert (2011)

Page 19: Discourse annotation for arabic 2

19

ResultsAcurr k Features

All connectives (6039)

52.5 0 Baseline (CONJUNCTION)

77.2 0.6078.7 0.6678.3 0.65

Conn only (1) Conn+Conn f+ Arg f (37) Conn+Conn f+ Arg f+ Production rules (1237)

M1M2M3

Excluding wa at BOP (3813)

35 0 Baseline (CONJUNCTION)

74.3 0.6577.0 0.6976.7 0.69

Conn only (1) Conn+Conn f+ Arg f (37) Conn+Conn f+ Arg f+ Production rules (1237)

M1M2M3

Page 20: Discourse annotation for arabic 2

20

ResultsAcurr k Features

All connectives (6039)

62.4 0 Baseline (EXPANSION )

88.7 0.7888.7 0.78

Conn only (1) Conn+Conn f+ Arg f (37)

M1M2

Excluding wa at BOP (3813)

41.8 0 Baseline (EXPANSION)

82.7 0.7483.5 0.75

Conn only (1) Conn+Conn f+ Arg f (37)

M1M2

Page 21: Discourse annotation for arabic 2

21

Semantic-Based Segmentation of Arabic TextsCorpus AnalysisDefinition: Let L be a list of candidate

segments connectors, each element c in L is classified based on its effects on the text segmentation as either active or passive

Examples:.1[] الكلية في جديد قسم إنشاء الجامعة إدارة تعتزم

]هنالك] القسم هذا إنشاء تؤكد التي التقارير بعض.2] و[ الكلية في جديد قسم إنشاء الجامعة إدارة تعتزم

[ هنالك] و القسم هذا إنشاء تؤكد التي التقارير بعض[ لذلك موعدا يحدد لم لكن

Page 22: Discourse annotation for arabic 2

22

Segmentation ProcessIdentifying the connectors that indicate

complete segments. Locating the active connectors.Resolving the case where adjacent active

connectors exist.Setting the segments boundaries. Creating the final list of segments.

Page 23: Discourse annotation for arabic 2

23

Discussionevaluate the segmentation process, they

collected ten essays.Each essay ranges between 500 and 700

words.After implementing the segmentation

process.Gave the output to judges to evaluate

them in terms of two factors: correct hit and incorrect hit.

Page 24: Discourse annotation for arabic 2

24

Discussion Cont..Incorrect hit Correct hit Essay

0 33 1

1 15 2

0 25 3

1 23 4

0 20 5

1 29 6

1 26 7

2 33 8

0 26 9

0 22 10

Page 25: Discourse annotation for arabic 2

25

Arabic Discourse Segmentation Based on Rhetorical Methods

This Method is depends on the meaning of the connector " و" in Arabic language.

There are six types of " و" classified into two classes, "Fasl" and "Wasl " :

"Fasl " : segmenting place."Wasl " : unsegmenting but connecting

the text.

Page 26: Discourse annotation for arabic 2

26

Types of Connector "و" Class Example Type

Fasl العلم التالميذ يعلمون انهم واللهاألساتذة. عظيما عمال ليقدمون

والقسم

Fasl يعانون الذين وحدهم ليسوا سائل ورب�الشبابطبقات: بين من الشباب على ركزتم لماذا يقول

؟ المجتمع

ور�ب

Fasl النفسية المشكالت بعض من المراهقون يعاني.و كثيرة أخرى سلبيات به عامة المجتمع

واالستئناف

Wasl الفصل المدرس .ودخل يبتسم هو والحال

Wasl الحبيبان .وجلس القمر ضوء والمعية

Wasl محمد خالد وسافر والعطف

Page 27: Discourse annotation for arabic 2

27

The Arabic sentence Segmentation System

Page 28: Discourse annotation for arabic 2

28

Feature Extraction

•The following are the features of " والمعية": X3 = noun and X7 = accusative mark.

Page 29: Discourse annotation for arabic 2

29

Experiment and Results

They used 1200 instances for training.They used 293 instances for testing after

testing there are 290 correct and 3 incorrect instances.

The result with:94.68% Recall

96.82% Precision

98.98% Accuracy

Page 30: Discourse annotation for arabic 2

30

A Comprehensive Taxonomy of Arabic Discourse Coherence Relations

Coherence relations are classified into two types: explicit relations and implicit relations.

example Coherence relations

I am very happy because I got excellent marks in exams.

Explicit relations

I am very happy. I got excellent marks in exams.

Implicit relations.

Page 31: Discourse annotation for arabic 2

31

The procedure of creating an Arabic Taxonomy of Coherence Relations

Page 32: Discourse annotation for arabic 2

32

Examples of Implicit Arabic relations

"Impossible condition / المستحيل : " الشرطسم) في الجمل يلج حتى الجنة يدخلون وال

الخياط(

"Cascaded questioning/ " المكرر :االستفهامنحن) أم تزرعونه أأنتم ماتحرثون؟ أفرأيتم

الزارعون؟)

Page 33: Discourse annotation for arabic 2

33

ResultsThey got a set of 47 Arabic coherence

relations.coherence relations. Result

From English coherence relations.

31

additional Arabic explicit coherence relations.

12

Arabic implicit relations. 4

Page 34: Discourse annotation for arabic 2

34

Conclusion

Discourse Annotation is a very fertile field and it has many NLP applications, for Arabic there are some challenges due to the lack of annotated corpora and studies.

Page 35: Discourse annotation for arabic 2

35

Thank You