dependency parsing and empty category detection in hindi...

Dependency Parsing and Empty Category Detection in Hindi Language

Thesis submitted in partial fulfillmentof the requirements for the degree of

Master of Science (by Research)in

Computer Science & Engineering

by

KUKKADAPU PUNEETH200802020

[email protected]

International Institute of Information TechnologyHyderabad - 500 032, INDIA

March 2016

Copyright c© PUNEETH KUKKADAPU, 2016

All Rights Reserved

International Institute of Information TechnologyHyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “Dependency Parsing and Empty Category De-tection in Hindi Language” by PUNEETH KUKKADAPU, has been carried out under my supervisionand is not submitted elsewhere for a degree.

Date Adviser: Prof. Dipti Misra Sharma

To my Family and Friends

Acknowledgments

I would like to sincerely thank for my Prof. Dipti Misra Sharma. This thesis would not have beenpossible without her guidance. I must thank her for support, time and encouragement. Her deep Lin-guistic insights helped me a lot in my thesis.

I should thank Dr. Prashanth Mannem who gave a direction for my thesis work. A very talentedperson and friendly guide who always aims high in his work and also in his students. This thesis wouldnot have been possible without his guidance.

I have also enjoyed working with Prof. Rajeev Sangal for my Dual degree projects. I was thrilledwith the approaches and ideas when it comes to solving tasks in Natural Language Processing. He isnot only a great Professor and advisor but also very empathetic person.

I would also like to thank Dr. Samar Husain and Dr. Sriram, they have taught us NLP course andgave the motivation to pursue it later. I would also like to thank Prof. Radhika Mamidi for her supportand encouragement. I would like to thank the support of my seniors Bharat Ram Ambati, AswarthAbilash Dara, Phani Gadde, Prudhvi Kosaraju and Prashanth Kolachina.

I had a great time working from my friends Arjun Reddy, Deepak Malladi, Jayendra Rakesh, VinayBhargav, Nitesh and Sarvesh. We all started in LTRC in the same semester. It was great pleasureworking with them.

Deepak and Arjun have helped me with my academics and research work through out my time inIIIT-Hyderabad and I am very thankful to them. I would also like to thank my friends: Sri Kalyan,Nagarjuna, Rahul, Deepak, Jayendra Rakesh who had made my life in IIIT-Hyderabad very memorable.

Finally, I thank my Mother, Grand Father and sister for their constant support in my life.

v

Abstract

Parsing Indian languages has always been a challenging task. In recent years there have been vari-ous approaches explored for improving parsing accuracy for Hindi and other Indian languages. In thiswork, we present our experiments to improve dependency parsing accuracy for Hindi language as partof COLING-MTPIL 2012 shared task. We explored three data driven parsers on grounds of large featurepool consisting of morphological, chunk and syntactic features. We tried with different parser config-urations by considering different parsing strategies, classifiers and feature templates. We explored theusage and adoption of the Turbo Parser for parsing Indian languages. In addition to Turbo parser wehave also explored other data-driven parsers: Malt and MST. We have also experimented on getting thebest out of these parsers by using two approaches. We selected the best configuration for each set ofdata and were able to produce the best average accuracy in the shared task. We achieved a best result of96.50% unlabeled attachment score (UAS), 92.90% labeled accuracy (LA), 91.49% labeled attachmentscore (LAS) using voting method on data with gold POS tags. In case of data with automatic POS tags,we achieved a best result of 93.99% (UAS), 90.04% (LA) and 87.84% (LAS).

The second part of this thesis focuses on using statistical dependency parsing technique to detectNULLs or Empty Categories in the sentences. In these experiments we have worked with Hindi de-pendency treebank. There were some rule based approaches tried out before to detect Empty heads forHindi language but statistical learning for automatic prediction was not demonstrated. In this approachwe used a technique of introducing complex labels into the data to predict Empty Categories in sen-tences.We have mapped the problem of Empty category prediction to data-driven parsing and exploredvarious parsers to find the best one for this approach. The motivation comes from using data-drivenparsing to solve other tasks in Natural Language Processing. The system was able to predict Emptycategories with a decent F-score of 76.26. We have discussed about shortcomings and difficulties of thisapproach.

vi

Contents

Chapter Page

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 About Indian Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Free Word Order and Vibhakti . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Non-Projectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.3 Morphologically Rich . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Dependency Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Treebanks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 About Hindi Treebank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.2 SSF Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.3 CoNLL Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Paninian Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.5 About Malt Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.6 About MST Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.7 Constraint Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.7.1 Hard Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.7.2 Soft Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.7.3 Turbo Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.8 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Ensemble Approach for parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1 Experiments on Indian Language Dependency Parsing . . . . . . . . . . . . . . . . . 163.2 Coling MTPIL shared task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3.1 Parser Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.1.1 Malt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.1.2 MST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3.1.3 Turbo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.2 Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3.3 Blending . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

vii

viii CONTENTS

3.3.4 Feature set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3.5 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.6 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Empty category detection and insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.1 what is an Empty Category? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 why is an Empty Category present in a treebank? . . . . . . . . . . . . . . . . . . . . 244.3 Empty Categories in other languages . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3.1 Empty categories in Penn treebank . . . . . . . . . . . . . . . . . . . . . . . . 264.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.5 Empty category annotation and detection in Hindi treebank . . . . . . . . . . . . . . . 27

4.5.1 Multi-Layered Treebank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.5.2 Types of Empty Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.6 Empty category Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.7 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.7.1 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.7.2 Data-driven parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.7.3 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.7.4 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.8 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.8.1 Parser settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.8.2 Features and Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.8.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.8.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Appendix A: Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

List of Figures

Figure Page

2.1 An Example of a Non-Projective sentence. . . . . . . . . . . . . . . . . . . . . . . . . 52.2 An Example of a Projective sentence. . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Dependency structure for the sentence ‘Ram ate an apple’. . . . . . . . . . . . . . . . 72.4 An Example of a Hindi sentence annotated in CoNLL format. . . . . . . . . . . . . . 102.5 Levels in paninian model.[7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Dependency tree structure output of Malt Parser . . . . . . . . . . . . . . . . . . . . . 193.2 Dependency tree structure output of MST Parser . . . . . . . . . . . . . . . . . . . . . 203.3 Dependency tree structure output of Turbo Parser . . . . . . . . . . . . . . . . . . . . 21

4.1 An Example of a Hindi sentence annotated with a NULL category. . . . . . . . . . . . 254.2 An Example of a Hindi sentence with a NULL NP category. . . . . . . . . . . . . . . 294.3 An Example of a Hindi sentence with a NULL VP category with backward gapping. . 304.4 An Example of a Hindi sentence with a NULL VP category with forward gapping. . . 304.5 An Example of a Hindi sentence with a NULL CC category. . . . . . . . . . . . . . . 314.6 Pre Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.7 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

ix

List of Tables

Table Page

2.1 Important karaka relations prescribed in Panini’s framework. . . . . . . . . . . . . . . 11

3.1 UAS, LA, LAS denote the Unlabeled Attachment score, Labeled Accuracy Score andLabeled Attachment Score respectively. Accuracies on data annotated with gold stan-dard POS tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Accuracies on data annotated with automatic POS tags . . . . . . . . . . . . . . . . . 22

4.1 Recall comparison of Rule based and statistical approaches. . . . . . . . . . . . . . . . 35

A.1 Important karaka relations prescribed in Panini’s framework. . . . . . . . . . . . . . . 40

x

Chapter 1

Introduction

1.1 Parsing

Parsing helps in understanding the relations between words in a sentence. It plays an important rolein a lot of applications like machine translation, word-sense disambiguation, search engines, dialoguesystems etc. Parsers are mainly classified into two categories: grammar driven and data driven. Thecombination of these two led to the development of hybrid parsers.

The availability of annotated corpora in recent years has led the data driven parsers to achieve con-siderable success in parsing. The availability of phrase structure treebanks for English [31] has seen thedevelopment of many efficient parsers. But parsing morphologically rich, free word order languages(MoR-FWO) is a whole other task, Most of the Indian languages come under this category. Previouswork on free word order languages [23, 7] suggested that dependency parsing can handle such languagesbetter than constituency parsing. This led to the development of dependency treebanks. Dependencyannotation using paninian framework was also started for Indian Languages [6]. In recent years, thereis a lot of interest in data driven dependency parsing on Indian Languages (IL) due to the availability ofannotated corpora thereby building high accuracy parsers [37, 34, 32, 30].

Parsing MoR-FWO languages is a challenging task [24, 25]. In spite of the availability of treebankswith dependency frameworks for some MoR-FWO languages now, the state-of-the-art parsers for theselanguages perform lower than that of fixed word order language like English [38]. Past experimentsfor MoR-FWO languages using different parsers have shown that there are number of factors whichcontribute to the performance of a parser [38, 21, 35]. We in this thesis focus on improving parsing ac-curacy for MoR-FWO language and using these parsing techniques to identification of Empty categoriesin treebank.

1.2 Contribution

There are two major contributions of this thesis

1

• Building a best parsing system for parsing Hindi as part of COLING-MTPIL Tools contest 2012.

• Using Statistical dependency parser to automatic detection of Empty categories in Hindi language.

The objective of the Coling 2012 MTPIL workshop was to bring together Machine Translation (MT)and Parsing researchers to showcase their work on Indian Languages and exploit the synergies to inter-connnect state-of-the-art Indian language MT and parsing research. As part of this workshop, they hadhost a dependency parsing shared task for Hindi. In the shared task, a part of the Hindi DependencyTreebank (HDT) containing gold standard morphological analyses, part-of-speech tags, chunks and de-pendency relations labeled in the computational paninian framework was released. The evaluation willconsider the standard dependency tree based measures over both gold standard and automatic parts ofspeech. Three different parsers have been explored for this task and also came up with an approach toget the better of them. The best accuracies for both gold standard and automatic generated versions ofdata have also been reported.

As part of second contribution, we used the statistical dependency parsing techniques to address theproblem of Empty category detection and recovery in Hindi language. Since it is not a rule based ap-proach, this can be further used on other Indian languages also. There have not been much research ondetection of empty categories in Hindi language using statistical approaches.

1.3 Outline

Chapter 2: This chapter discusses about Hindi language and why Parsing Hindi language is a diffi-cult task, challenges involved in this task. It also describes about Hindi treebank, dependency formalismused to build this treebank i.e., Paninian framework. This chapter also talks about dependency parsingwith focus on Paninian framework and various parsers which we have explored in our experiments.It describes in detail about Malt Parser, a statistical data driven parser, which will be used in manyexperiments throughout the thesis. We also describe about Turbo parser.

Chapter 3: This chapter introduces parsing for Hindi language and discusses about the related workhappening over the past few years on this task. It discusses about COLING shared task, approachesand parsers which we have explored for our experiments on Indian Language dependency parsing. Italso discusses about grammatical features used for language and settings for each parser which hadperformed better. Finally we conclude with evaluation and error analysis our approach. Publications:Ensembling various dependency parsers: Adopting turbo parser for Indian languages (24th InternationalConference on Computational Linguistics, COLING-MTPIL, 2012).

Chapter 4: In this chapter, Empty category definition and its usage are discussed. It also talks aboutEmpty categories in Hindi treebank and also other languages like English. Earlier work on annotationand detection of these empty categories for Hindi language and also little about this task which hadbeen done in other languages. We also cover on various categories of empty categories in Penn treebank

2

and Hindi treebank. We also discuss about the approach we have taken in order to use the statisticalapproach. We explored Malt parser and Turbo parser for this problem, we discuss about the best settingsfor the parser.

Finally we conclude with our results and error analysis of our approach. Publications: A statis-tical approach to prediction of empty categories in Hindi dependency treebank (Fourth Workshop onStatistical Parsing of Morphologically Rich Languages, 2013).

Chapter 5: In this chapter we conclude the thesis with summary.

3

Chapter 2

Background and Related Work

In this chapter, We start with discussing briefly about Indian Languages and some important featuresof Indian languages which would help us understand the nature of these languages. We then talk aboutTreebanks, Hindi Treebank in detail. We also talk about Paninian framework which is used for parsingIndian languages and then introduce different parsing strategies we exploited in our work. We thenexplain briefly about Malt Parser and Turbo Parser which we used at many places throughout our thesis.

2.1 About Indian Languages

Indian Languages can be broadly classified into two types: Indo-Aryan Languages and DravidianLanguages. Languages such as Hindi, Marathi, Urdu, Bangla etc belong to Indo-Aryan Category. Lan-guages like Telugu, Tamil, Kannada end Malayalam tc come under Dravidian category. All these IndianLanguages show certain characteristics which pose challenges for parsing. In this section we discussabout some of these characteristics.

2.1.1 Free Word Order and Vibhakti

Indian languages have relatively free word order. Free word order indicates that the constituents ofa sentence can occur in any order without affecting the gross meaning of the sentence. The emphasisand other characteristics like naturality of the sentence may be different when compared to its variantscaused by changes in order of words.

The post-position markers and surface case endings of nouns are collectively referred to as vibhakti ofnouns. vibhakti plays a key role in identifying semantic relationships in Indian Languages. They conveysignificant grammatical information like subject, object and instrument etc. Consider the followingEnglish sentence:Example: Ram beats Mohan.For the above sentence, there are multiple possibilities of Hindi sentence with different word order:

1. Words raama mohana ko piiTataa hai

4

Gloss ram mohan beats

2. Words mohana ko raama piiTataa hai

Gloss mohan ram beats

For the above sentences Mohan has the same vibhakti ko. Even if positions of raama and mohanako are interchanged, it doesn’t alter semantic relations of Ram and Mohan with verb. However if thevibhakti ko is interchanged to raama then semantic roles get changed. It changes to:Mohan beats Ram.

1. Words mohana raama ko piiTataa hai

Gloss mohan ram beats

2. Words raama ko mohana piiTataa hai

Gloss ram mohan beats

2.1.2 Non-Projectivity

Non-projectivity occurs when dependents do not either immediately follow or precede their heads ina sentence [44]. As discussed earlier, Hindi is a free word order language and therefore non-projectivesentences in Hindi treebank can occur frequently. Each sentence in Hindi treebank can be expressed asa dependency tree. A Dependency tree can be formally defined as :A dependency tree D = (V, E, �) is a directed graph with V a set of nodes, E a set of edges showing adependency relation on V, and � linear order on V. Every dependency tree satisfies two properties : a)it is acyclic, and b) all nodes have in-degree 1, except root node with in-degree 0 [15].

Condition of Projectivity: A dependency tree D = (V, E, �) is projective if it satisfies the followingcondition: i → j, ∈ (i, j) ⇒ ∈ Subtreei . Otherwise D is non-projective.

The below examples depicts us about the difference between a Projective and Non-Projective sen-tence. Crossing of arcs in the dependency structure is the sign of Non-Projectivity of sentence.

• Example of a Non-Projective Sentence is shown in Figure 2.1:

Figure 2.1 An Example of a Non-Projective sentence.

• Example of a Projective Sentence is shown in Figure 2.2:

5

Figure 2.2 An Example of a Projective sentence.

2.1.3 Morphologically Rich

Morphologically rich languages (MRLs) express multiple levels of information already at the wordlevel. The lexical information for each word form in a MRL may have additional information con-cerning the grammatical function of the word in the sentence, its grammatical relations to other words,pronominal clitics, inflectional affixes and so on. Expressing such functional information morphologi-cally allows for a high degree of word-order variation. Furthermore, lexical items appearing in differentsyntactic contexts may be realized in different forms. This leads to a high level of word form variationand complicates lexical acquisition from small sized corpora [45].Example: In Hindi, the word ladakA (boy) has inflectional forms such as ladaka (masculine-direct-singular), ladake (masculine-oblique-singular), ladake (masculine-direct-plural), ladakoM (masculine-oblique-plural). The lexical item in the above words conveys gender, case and number information.

A parsing model for MRLs requires recognizing the morphological information in each word form.Due to the high level of morphological variation, however, data-driven systems are not guaranteed toobserve all morphological variants of a word form in a given annotated corpus.

2.2 Dependency Parsing

Dependency is the notion that linguistic units, e.g. words, are connected to each other by directedlinks. The verb is considered as the structural center of clause structure. All other lexical items are eitherdirectly or indirectly connected to the verb in terms of the directed links, which are called dependencies.Structure is determined by the relation between lexical items in the sentence.

Figure 2.3 shows the dependency structure of a sentence ‘Ram ate an apple’. In general, dependencyparsing can be broadly divided into grammar-driven and data-driven dependency parsing [39]. Mostof the modern grammar-driven dependency eliminate the parse structures which do not satisfy set ofgrammatical constraints for that language. These grammatical constraints mostly change from onelanguage to another. Data-driven dependency parsers are different from the grammar driven parsers,data-driven parsers use a corpus to build a probabilistic model for disambiguation. Nevertheless manydata-driven parsers also combine dependency formalism with the probabilistic model.

6

Figure 2.3 Dependency structure for the sentence ‘Ram ate an apple’.

2.3 Treebanks

Treebanks are text corpora in which each sentence associates with a parse tree, i.e. annotated witha syntactic structure. These are linguistic resources in which each sentence can be enriched by mark-ing morphological, lexical information and semantic information explicitly for each lexical item in thesentence. Treebanks are useful resources for any kind of corpus-based linguistic research that is relatedto syntax. Treebanks are also used in various natural language processing applications like parsers, ma-chine translation systems and morph analyzers etc. Some notable efforts in development of treebank arethe Penn Tree Bank for English language and the Prague Dependency Bank for Czech language. Tree-banks can be created manually or semi-automatically. In semi-automatic approach, parsers are used toget syntactic structures for sentences in corpus and then linguists check and correct the output syntacticstructures given by parser.

2.3.1 About Hindi Treebank

The Hindi Dependency Treebank is an NSF funded project at Language Technologies ResearchCentre, IIIT-Hyderabad in collaboration with University of Boulder, University of Massachusetts atAmherst, University of Washington and Columbia University New york. It is part of a Multi-Representationaland Multi-Layered Treebank for Hindi.

As discussed in earlier section, the characteristics of Hindi languages such as free word order, richmorphology and non-projectivity pose challenges in parsing Hindi language. To overcome these chal-lenges using data-driven parsing approaches and to evaluate parsers for accuracy, huge corpus is re-quired. Previous research showed that some of these challenges can be better handled by using depen-dency based annotation scheme rather than using phrase based annotation scheme [23, 43, 7]. As aresult of this, dependency annotation for Hindi based on Paninian framework is used for building thetreebank [6]. The dependency relations in the treebank are syntactico-semantic in nature where the main

7

verb is the primary binding constituent of the sentence. The participants in an action are labeled withkaraka relations.

The treebank has dependency relations, verb-arguments and phrase structure (PS) representations.The dependency treebank contains information such as POS, Morph, chunk and dependency label foreach lexical item. The manual annotation of the dependency treebank includes the annotation of part ofspeech (POS) tag, morphological information for each word, division of sentence into chunks, identifi-cation of chunk labels and also marking dependency relations between words. These are explained indetail below

• Part of Speech Information: A POS tag is assigned to each lexical item based on its definitionand context. POS tags in Hindi dependency treebank are annotated for each node in the sentencefollowing the POS and chunk annotation guidelines.

• Morph Information : Information pertaining to the morphological features of the nodes is also en-coded using the Shakti Standard Format (SSF). These morphological features have eight manda-tory feature attributes for each node. These features are classified as root, category, gender, num-ber, person, case, post position (for a noun) or tense aspect modality (for a verb) and suffix.

• Chunk Information: Chunk is a group of words in a sentence which are syntactically correlated.The process of identifying and dividing the sentence into such word groups is called chunking.All the words in a chunk have the same chunk label, The naming convention for chunks is basedon annotation guidelines in [14].

• Dependency Information : After POS, morph and chunk annotation, dependency annotation isdone following the set of dependency guidelines in [14]. This information is encoded at thesyntactico-semantic level following from the Paninian dependency framework.

This Hindi dependency treebank is available in two formats: CoNLL-X format and SSF (ShaktiStandard Format). We have used CoNLL-X format data for all our experiments.Example of a sentence in both SSF format and CoNLL-X format:

2.3.2 SSF Format

The SSF format has four columns. Token id, token/chunk boundaries, POS/Chunk tags and featurestructure appear in the four columns respectively. Detailed description of SSF can be found in [13].Example of a Hindi sentence in SSF format is given below.

1. AKirakAra RB <fs af=’AKirakAra,adv,,,,,,’ name=’AKirakAra’ posn=’10’ chunkId=’RBP’drel=’sent-adv:kara’ chunkType=’head:RBP’>

2. ise PRP <fs af=’yaha,pn,any,sg,3,o,ko,ko’ name=’ise’ posn=’20’ chunkId=’NP’drel=’k2:kara’ chunkType=’head:NP’>

8

3. kAbU NN <fs af=’kAbU,n,m,sg,3,o,0 meM,0’ name=’kAbU’ posn=’30’ chunkId=’NP2’drel=’k7:kara’ vpos=’vib 2’ chunkType=’head:NP2’>

4. meM PSP <fs af=’meM,psp,,,,,,’ name=’meM’ posn=’40’ drel=’lwg psp:kAbU’chunkType=’child:NP2’>

5. kara VM <fs af=’kara,v,m,sg,any,,0 le+yA jA+yA1,0’ name=’kara’ posn=’50’chunkId=’VGF’ chunkType=’head:VGF’ voicetype=’active’ vpos=’tam 2 3’ stype=’declarative’>

6. liyA VAUX <fs af=’le,v,m,sg,any,,yA,yA’ name=’liyA’ posn=’60’ drel=’lwg vaux:kara’chunkType=’child:VGF’>

7. gayA VAUX <fs af=’jA,v,m,sg,any,,yA1,yA1’ name=’gayA’ posn=’70’ drel=’lwg cont:kara’chunkType=’child:VGF’>

8. . SYM <fs af=’.,punc,,,,,,’ name=’.’ posn=’80’ chunkId=’BLK’ drel=’rsym:kara’chunkType=’head:BLK’>

2.3.3 CoNLL Format

CoNLL format [36] is the standard format being used in CoNLL Shared Tasks on dependency pars-ing. This is a ten column format. A short description of these columns is mentioned below:

1. ID : Token counter, starting at 1 for each new sentence.

2. FORM : Word form or punctuation symbol.

3. LEMMA : Lemma or stem (depending on particular data set) of word form, or an underscore ifnot available.

4. CPOSTAG : Coarse-grained part-of-speech tag, where tagset depends on the language.

5. POSTAG : Fine-grained part-of-speech tag, where the tagset depends on the language, or identicalto the coarse-grained part-of-speech tag if not available.

6. FEATS : Unordered set of syntactic and/or morphological features (depending on the particularlanguage), separated by a vertical bar (—), or an underscore if not available.

7. HEAD : Head of the current token, which is either a value of ID or zero (’0’). Note that dependingon the original treebank annotation, there may be multiple tokens with an ID of zero.

8. DEPREL : Dependency relation to the HEAD. The set of dependency relations depends on theparticular language. Note that depending on the original treebank annotation, the dependencyrelation may be meaningful or simply ’ROOT’.

9

9. PHEAD : Projective head of current token, which is either a value of ID or zero (’0’), or an un-derscore if not available. Note that depending on the original treebank annotation, there may bemultiple tokens an with ID of zero. The dependency structure resulting from the PHEAD columnis guaranteed to be projective (but is not available for all languages), whereas the structures re-sulting from the HEAD column will be non-projective for some sentences of some languages (butis always available).

10. PDEPREL : Dependency relation to the PHEAD, or an underscore if not available. The set ofdependency relations depends on the particular language. Note that depending on the originaltreebank annotation, the dependency relation may be meaningful or simply ’ROOT’.

Figure 2.4 An Example of a Hindi sentence annotated in CoNLL format.

2.4 Paninian Framework

Panini developed a suitable computational grammar formalism for free word ordered languages. Asmajority of Indian languages are free word ordered, paninian formalism has been successfully appliedfor Indian languages. Paninian approach handles how to extract karaka relations which are syntactico-semantic for a given sentence. Notion of karaka relations is central for Paninian Grammar. Theserelations define syntactico-semantic relations between verbs and other constituents in a sentence. Paniniapproach was first formulated for Sanskrit about 2000 years ago. Few main Karaka relations listedin Paninian Grammar in A.1. For a complete list of tagset which is used in Hindi treebank refer toappendixA.

Figure 2.5 depicts the four levels of a sentence according to Paninian model.

• The surface level is any given sentence in word form or uttered form.

10

Figure 2.5 Levels in paninian model.[7]

• In the vibhakti level, there are local word groups with case endings, preposition or postpositionmarkers. In this level, the sentence contains Noun groups and verb groups based on vibhakti.Noun groups consists of noun, its vibhakti and possible adjectives. Verb group consists of Mainverb followed by auxiliary verbs. Vibhkati for verb also carry information about Tense, Aspectand Modality of verb. This level has only syntactic information regarding the sentence.

• The Karaka level has both syntactic and semantic information about the given sentence.

• The semantic level corresponds to the actual meaning of the sentence.

At the karaka level, we have karaka relations and verb-verb relations. Karaka relations are syntactico-semantic relations between the verbs and other related elements in a sentence. They capture a certainlevel of semantics which is somewhat similar to thematic relations but different from it [7]. This levelof semantics is important syntactically and is reflected in the surface form of the sentences. Hence anannotation scheme based on Paninian framework was proposed by [6].

karaka label Descriptionk1 karta(doer/agent/subject)k2 karma (object/patient)k3 karana (instrument)k4 sampradaana (recipient)k5 apaadaana (source)k7 (location in time/space)

Table 2.1 Important karaka relations prescribed in Panini’s framework.

In Paninian approach, verb is main entity of the sentence. The sentence has a main verb and all otherlexical entities are treated as modifiers of this main verb. The entities modifying the verb participate inthe action specified by the verb. The participant relations with the verb are called karaka. The notion

11

of karaka will incorporate the local semantics of the verb in a sentence, while also taking cue from thesurface level morpho-syntactic information [46]. The annotation scheme carries out the analysis of eachsentence taking into consideration the verb as the central, binding element of the sentence.

The analysis of sentence generally begins with verbs demands for its arguments. The arguments areidentified, taking verbs meaning into consideration. Their relationship with the verb is established usingkaraka and other relations. The discovery procedure of dependency relations depends on the morpho-syntactic information. The verb generally selects karta or the karma based on its TAM (tense, aspect andmodality) marker. This selection is shown syntactically either via agreement or some case markings.There exists, therefore, a TAM-vibhakti correspondence that can help identify certain relations. Otherrelations can also be identified based on similar surface cues.

Example: raama ne raavana ko tiira se maara.

1. Words raama ne raavana ko tiira se maara

Gloss ram ravan arrow killed

In the above sentence, the main verb is maara (killed). According to Paninian approach, the verbselects karta and karma. The postpositions markers indicate the role of other nouns in the sentence.raama (Ram) is karta of the sentence, which can be identified using vibhakti ne. raavana (ravan) iskarma of the sentence, can be identified using vibhakti ko. tiira (arrow) is karaka of sentence, can beidentified using vibhakti se. Thus vibhkati plays a crucial role in identifying karakas roles in sentence.

2.5 About Malt Parser

Malt Parser implements the transition based approach to dependency parsing [47]. A transition-based algorithm builds a parse tree by a sequence of actions, scoring each action individually. It has twoessential components:

• A transition system which maps sentences to dependency trees

• A classifier which predicts the next transition for every possible system configuration

Malt parser uses four transitions:

• Shift: push the next word in the buffer onto the stack.

• Left-Arc: add an arc from the topmost word on the stack w1 to the second-topmost word w2 andpop w2.

• Right-Arc: add an arc from the second-topmost word on the stack, w2 to the topmost word w1and pop w1

• Reduction: pop the stack.

12

Classifiers can be induced from treebank data using different machine learning methods. Malt’sclassifier uses transitions to predict next action based on the feature vector. Feature vector is a settingwhich can be changed to suit the data which may use different features of the data like POS tags, wordsetc. Malt parser also provides two classifiers namely Liblinear and LibSVM classifier. In addition tothis, Malt also provides an option to use external libsvm packages. The task of the classifier is to predictthe optimal transition based on its existing configuration.

Malt parser provides different transition parsing algorithms. Some of them are essentially developedfor parsing sentences with non-projective nature. Some of the algorithms that Malt parser provides areArc Eager, Arc Standard, Covington, Stack etc.

2.6 About MST Parser

MST parser implements the graph based approach to dependency parsing. For a given input sentence,a graph-based algorithm finds the parse tree with maximum score from all possible outputs. The scoresare calculated for all possible complete trees.Graph-based Models: Define a space of candidate dependency trees of input sentence.

• Learning: Induce a model for scoring a candidate tree.

• Parsing: Find a tree with the highest score given the model.

In the learning phase, it generates a model by learning weights from the given sentence and dependencytree. It generates all the possible trees from the given sentence and then adjusts weights to the edges ofthe dependency tree. It use maximum perceptron learning algorithm for finding weights.

Arc-factored model:X: an input sentence.Y: a candidate dependency tree.xi → xj : a dependency edge from word i to word j.Φ(X): the set of possible dependent trees over X.

Y ∗ = arg maxY ∈Φ(X)

score(Y |X)

Y ∗ = arg maxY ∈Φ(X)

∑(xi→xj)∈Y

score(xi → xj)

score(xi → xj) can be either probability or not.

score(xi → xj) =−→w .

−→f (xi → xj)

13

MST parser uses Non-projective dependency parsing algorithm for parsing. It constructs the dependencystructure by generating a dependency graph of all the possible dependency trees and generates a Max-imum Spanning Tree out of this graph. It also supports projective parsing. It uses Chu-Liu-EdmondsMST algorithm for Non-Projective case and Eisner algorithm for projective parsing.

MSTParser has the strength of exact inference, but as the size of feature set increases it takes moretime in learning and also more time in parsing the given sentence. In other words, it is constrained interms of choice of features. MaltParser is deterministic, but it can accommodate larger set of featureswhich certainly is an advantage. The MaltParser works deterministically. At each step of parsing, itmakes a single decision and chooses one of the four transition actions based on current context, nextinput words, the stack and the existing arcs. One drawback of deterministic parsing is error propagation,since once an incorrect action is made, the output parse will be incorrect regardless of the subsequentactions. As it is deterministic, there is no going back after an incorrect action.

2.7 Constraint Parsing

Constraint parsing is an approach where grammar rules which are written by linguists and somecontext dependent rules are programmed as constraints. A parse structure should satisfy the givenconstraints to be considered as a valid tree. These rules can vary in number depending on the languagein consideration. Constraints can be classified into two types: Hard Constraints and Soft Constraints.Constraint based parsing using integer programming has been explored for Indian languages before[11, 10].

2.7.1 Hard Constraints

Hard constraints or H-constraints, are grammatical features of a language which cannot be avoided.H-constraints comprises of lexical and structural knowledge of the language. The H-constraints areconverted into integer programming problem and solved, The solution is a valid parse or a set of validparses. There can be multiple parses which can satisfy these H-constraints. H-constraints alone maynot be able to restrict multiple valid parse structures of a given sentence and hence more information isrequired to reduce the ambiguities.

2.7.2 Soft Constraints

The soft constraints or S-constraints, are learnt as weights from an annotated treebank. The weightsare used to score the parse structures. The tree with the highest overall score is selected as the bestparse. They reflect various preferences that a language has towards various linguistic phenomena. Theyare used to prioritize the parses and select the best parse.Parse structures satisfying Hard constraints is a necessary condition but may not be sufficient. Softconstraints can be applied on top of Hard constraints to select the best parse structure among all the

14

parse structures which satisfy Hard constraints. Both Hard and Soft constraints together can be thoughtas the grammar of a language.

2.7.3 Turbo Parser

Turbo parser is an example of a constraint based parser which constructs the problem of non-projective dependency parsing as a polynomial-sized integer linear program. It can also encode priorknowledge as hard constraints, it can also learn soft constraints from data. In particular, its model is ableto learn correlations among neighboring arcs, word valency, and tendencies toward projective parses.The model parameters are learned in a max-margin framework by using a linear programming relaxation[32].

Turbo parser will train a second-order non-projective parser with features for arcs, consecutive sib-lings and grandparents. The default training algorithm is cost-augmented MIRA (Margin-infused re-laxed algorithm), but it also provides other options. The decoder uses AD3 algorithm. It will also traina probabilistic model for unlabeled arc-factored pruning, which speeds up parsing by reducing the num-ber of possible arcs. In situations where learning speed is essential than accuracy a simple arc-factoredmodel can be preferred. In this mode it speeds up the process by not applying first order pruner.

Turbo parser provides three settings for training: basic, standard and full.

• Basic: enables arc-factored parts.

• Standard: enables arc-factored parts, consecutive sibling parts and grandparent parts.

• Full: enables arc-factored parts, consecutive sibling parts, grandparent parts, arbitrary siblingparts and enables head bi-gram parts.

2.8 Evaluation Metrics

• Unlabeled Attachment Score (UAS): Proportion of tokens assigned the correct head.

• Labeled Attachment Score (LAS): Proportion of tokens assigned with the correct head and thecorrect dependency type.

• Label Score(LS): Proportion of tokens assigned with the correct dependency type.

15

Chapter 3

Ensemble Approach for parsing

3.1 Experiments on Indian Language Dependency Parsing

Hindi is a morphologically rich and a relatively free word order language (MoR-FWO). Parsingsuch MoR-FWO languages like Czech, Turkish etc is a difficult task because of their non-configurablenature. Previous research showed that the dependency parsing performs better than phrase based parsingfor such languages [23, 7]. Dependency annotation for Hindi is based on Paninian framework and thisannotation scheme is used for building the treebank [6]. In recent years data driven parsing on Hindihas shown good results, the availability of annotated corpora is an important factor for this development[37, 34, 32, 30, 4]. Parsing experiments for Hindi language have also been tried out using approachessuch as rule-based and hybrid of rule-based and data-driven [9].

There are several instances of the task on dependency parsing for Indian languages [24, 25]. Thebest accuracy parser [3] in ICON-2009 NLP Tools Contest and best accuracy parser [28] in ICON-2010Tools Contest used the transition based Malt Parser for their experiments exploring various differentfeatures and also reported the results of MST parser [3]. [27] used the approach of blending withinvarious configurations of the malt parser and [52] used the voting approach of combining multipleparsing systems into a hybrid parser for this task. [2] used the bidirectional parser [42]. Here, the parserdoes a best first search for every sentence and selects the most confident relation at each step withoutfollowing a specific direction (either from left-to-right or right-to-left).

In [28] we explored on the dependency parsing on three (MoR-FWO) languages. The languagesare Hindi, Telugu and Bangla. We have explored Malt parser on grounds of a large feature pool withdifferent parsing strategies. For Hindi, we show that the utilization of the output of shallow parser thatreflects various chunk information as features helps in improving the accuracies. A chunk is a minimalphrase consisting of co-related, inseparable words or entities, such that intra chunk dependencies are notdistorted [12]. From the set of morphological information provided along with the data, vibhakti andtam has helped in improving accuracies [8, 5]. For Hindi, we experimented incorporating local morpho-syntactic information similar to earlier work in [4]. The local morpho-syntactic information reflectsvarious chunk information in the form of chunk type, head-non head information, chunk boundary

16

information and distance to the end of chunk. From these, best chunk features for coarse and fine graineddata sets are derived. We used a forward selection approach in determining the best combination of theavailable pool of features. We have also explored best parser settings and features for Telugu and Banglalanguages, we have reported the best parser settings, features of Malt parser for each language and type.

3.2 Coling MTPIL shared task

As part of our experiments on Indian Language dependency parsing, best chunk features for Hindiand best parser settings with Malt parser and results with the configuration are discussed earlier inchapter 2. For this shared task on Hindi dependency parsing, we explored a new parser: Turbo parser[33]. We have also explored MST parser with previous best configuration for Hindi [3]. we triedthe combination of various parsing systems using two different methods i.e., simple voting [52] andblending method [40].

3.2.1 Data

The training and development datasets were released by the organizers as part of the shared taskin two different settings. One being the manually annotated data with POS tags, Chunks and otherinformation such as gender, number, person etc. whereas the other one contains only automatic POStags without any other information. Training set contains 12,041 sentences (2,68,093 words) and de-velopment dataset contains 1233 sentences (26,416 words). The size of the data set is significantlyhigher when compared to the previous shared tasks on Hindi Dependency Parsing [24, 25]. Some of thesentences have been removed due to the presence of the errors in them.

The test dataset contains 1828 sentences (39,775 words) and for the final system, we combined thetraining and development datasets into one dataset and used this set for the training. We tried differenttypes of features as mentioned above and selected the best feature set by tuning it on the developmentdataset.

3.3 Experiments and Results

The parser settings and feature set related to each parser are explained in the section 3.3.1

3.3.1 Parser Settings

3.3.1.1 Malt

Malt parser provides two learning algorithms LIBSVM and LIBLINEAR. It also gives various op-tions for parsing algorithms and we have experimented on nivre-eager, nivre-standard and stack-proj

17

parsing algorithms. Finally, we chose nivre-standard parsing algorithm and LIBSVM learning algo-rithm options by tuning it on the development data set. We have also combined training and devel-opment datasets and did a five fold cross validation using this combined dataset for selecting the besttemplate, classifier and algorithm for both settings of data.

3.3.1.2 MST

MSTParser [34] is a non-projective dependency parser, which reduces natural language dependencyparsing to finding maximum spanning trees in directed graphs. Dependency structure models are basedon large-margin discriminative training methods. Projective parsing is also supported. The main featuresof this parser are:

• First and second order projective and non projective parsing.

• Perceptron and k-best MIRA training.

The setting that performed best for MST parser is second order non-projective with beam width (k-bestparses) of 5 and default iterations of 10. The tuning of MST parser in second order non- projective ishard since it is computationally intensive.

3.3.1.3 Turbo

Turbo parser provides three settings for training: basic, standard and full. We experimented with allthree models for both data settings and consistently the Turbo parser in full mode performs better thanother models as it trains a second-order non-projective parser.

For the two different data settings: manually annotated data and automatic POS data, we used thesame feature set used in [33]. For the gold-standard POS data setting, we added chunk based featuresas previously mentioned. We also experimented with the clause based features but we did not get muchperformance gain using them. Turbo parser uses the concept of supported and unsupported features tomitigate the effect of having large number of features to some extent.

3.3.2 Voting

In case of voting, once the outputs from all the parsing systems are obtained, each dependency rela-tion that has the maximum number of votes from the various systems are included in the output. In caseof a tie, the dependency relation predicted by the high accuracy parser is picked in the final output.Example: unhoMne kahA ki masale ko Apasa meM sulaJA liyA jAegA.

For the above sentence, the dependency tree structure output of Malt parser is shown in Figure 3.1,output of MST parser is shown in Figure 3.2 and output of Turbo parser is shown in Figure 3.3. Theonly difference in the given three dependency tree parses is, for the word ‘masale’ dependency relation

18

label output of Turbo parser and Malt parser is ‘k2’ where as dependency relation label output givenby MST parser is ‘k7’. In this scenario, voting approach considers ‘k2’ as dependency relation labelfor word ‘masale’. The output is chosen based on voting because two out of the three parsers had theiroutput as ‘k2’. Similarly the same approach will be applicable while choosing the parent for each word.

Figure 3.1 Dependency tree structure output of Malt Parser

3.3.3 Blending

The one drawback that is inherent in the voting method is that the dependency tree resulted for eachsentence may not be fully connected as we are including each dependency relation at a time ratherincluding the whole dependency tree. In order to mitigate this drawback, the concept of blending hasbeen introduced by [40] and the software has been released as MaltBlender[21]. In this approach,a graph is built for the dependency relations obtained from the various outputs and then it selects amaximum spanning tree out of it. The tool also provides various options for selecting the weights fordifferent parsers and these weights are determined by tuning on the development dataset.

In this work, we used Malt, MST and Turbo parsers for the parser combination using the abovementioned two approaches. To the best of our knowledge, the process of adopting Turbo parser forIndian languages has not been explored previously.

3.3.4 Feature set

The number of features a dependency parser uses are typically huge. Selection of features has a lotof impact on both the run-time complexity and also on the performance of a parser. We tried various

19

Figure 3.2 Dependency tree structure output of MST Parser

uni-gram and bi-gram features related to words, lemmas, POS tags, Coarse POS tags, vibhakti (post-positional marker), TAM (tense, aspect and modality) and other available morphological information.In addition to these features, we also used chunk features like chunk head, chunk distance and chunkboundary information. Finally we also experimented with some clause-based features like head/child ofa clause, clausal boundary information.

3.3.5 Evaluation Results

Table 3.1 lists the accuracy with gold POS tags setting of the data using various parsers. It also liststhe accuracy obtained when the parsers are combined using simple voting method and an intelligentblending method. It can be observed from the table that Turbo parser performs well in all the evaluationmetrics in terms of using a single parser whereas voting method performs well overall. The resultssubmitted for this data setting using Turbo parser was ranked second in terms of LAS, LS but first inUAS. If we take the voting results, then it will be ranked first in terms of all the evaluation metrics whencompared to the other systems submitted for this shared task.

Table 3.2 gives the accuracy for the test data with automatic POS tags. The results on this data settingsubmitted using Turbo parser is significantly higher among all the methods we tried and all the othersystems that submitted test results. The voting and blending systems didn’t get the increase in accuracythan Turbo Parser because of the lower accuracies from Malt and MST. It can be inferred from the resultsthat the voting and blending systems benefit if the accuracies produced by the parsers are comparable toeach other. On the other hand if such a difference is very high then it will hurt their performance.

20

Figure 3.3 Dependency tree structure output of Turbo Parser

With Gold Part-Of-Speech TagsMethod UAS LA LAS

Malt 93.32% 90.56% 88.86%MST 94.88% 88.13% 86.45%Turbo 96.37% 92.14% 90.83%Voting 96.50% 92.90% 91.49%

Blending 96.34% 92.83% 91.49%

Table 3.1 UAS, LA, LAS denote the Unlabeled Attachment score, Labeled Accuracy Score and LabeledAttachment Score respectively. Accuracies on data annotated with gold standard POS tags

3.3.6 Error Analysis

The presence of 20.4% of sentences are non-projective (469 arcs) in the test data reflects the complex-ity of developing a high accuracy parser for Hindi. When we observed the performance of the individualparsers on these specific arcs, the accuracy has ranged from 27-28%. One trivial reason involves thecomplexity and other might be due to the less amount of training examples related to non-projectivity.

The gold-standard data contains the chunk information marked for every word i.e., whether it isthe head or child of a particular chunk. The dependency relations within a chunk are called intra-chunk dependency relations whereas across the chunks are called inter-chunk dependency relations.Figure 3.3.6 shows the accuracies of the dependency labels (that are frequent in the data) for inter-chunk dependency relations and these are ones that are hard to predict. It can be observed in part (a)of the Figure 3.3.6 that if the accuracies of the individual parsers are comparable then we are getting agood result in the voting and blending approaches. In part (b), we can see a drop in the accuracies on

21

With Automatic Part-Of-Speech TagsMethod UAS LA LAS

Malt 81.23% 76.76% 73.69%MST 91.02% 84.76% 82.55%Turbo 93.99% 90.04% 87.84%Voting 93.24% 89.01% 86.62%

Blending 93.47% 89.15% 87.05 %

Table 3.2 Accuracies on data annotated with automatic POS tags

the voting and blending approaches since the difference in accuracies between the parsers is high. Theanalysis of dependency labels on intra-chunk dependencies has not been shown since there is not muchdifference in the accuracies between the parsing models.

3.4 Summary

In this chapter we have presented our experiments on various dependency parsers: Malt, MST andTurbo. Our experiments cover two versions of Hindi treebank, gold-standard and automatic standard.Our main contribution is adaption of Turbo parser for Hindi language which yielded good results onautomatic standard version of treebank, as in day-to-day tasks of Machine translation etc., gold versionof data is not possible. We need to have a dependency parser which can provide good accuracy onautomatic POS tagged sentences. As we have discussed earlier, automatic set of Hindi data contains onlyPOS tag information. Our experiments reported Turbo parser has 87.84% LAS on automatic dataset,

22

where as on gold dataset the best score is 91.49%. This shows that 3.65% is achieved with all thefeatures and blending of the parsers mentioned, where as Turbo parser on gold dataset gives an accuracyof 90.83%. This shows that 2.99% gain is achieved through gold POS tags, morph, vibhakti, TAM andchunk information. Also, coming to automatic POS tagged data without any additional informationlike morph, chunk etc, Turbo parser has 87.84% LAS where as the next best MST parser has 82.55%LAS. There is a significant difference of 5.29%. For this dataset Turbo parser has better accuracy thanblending approach as well, blending approach has 87.05% LAS. This can be expected because blendingapproach gives good results if all the participating parsers have comparable accuracies. Malt parserreported 73.69% LAS, there is a big difference of about 14.15% when compared to Turbo parser. Henceblending approach did not yield better results as compared to other dataset where blending approachexceeded all the participating parsers in terms of accuracy.

23

Chapter 4

Empty category detection and insertion

4.1 what is an Empty Category?

Empty category is a nominal element which does not have any phonological content and is there-fore unpronounced [1]. Empty categories are annotated in sentences to ensure a linguistically plausiblestructure. Empty categories include traces such as Wh-traces which indicate movement operations in in-terrogative sentences and dropped pronouns which indicate missing pronouns in places where pronounsare normally expected.

Empty categories play a crucial role in the annotation framework of the Hindi dependency treebank[6, 14]. If dependency structure of a sentence do not form a fully connected tree then Empty category(denoted by NULL in Hindi treebank) is inserted in the sentence. In the Hindi dependency treebank, anEmpty category has at least one child. Traditional parsing algorithms do not insert Empty categories andrequire the Empty categories to be part of the input. These Empty categories are manually annotatedin the treebank. In real time scenarios, like translation between languages, it is not possible to addthe Empty categories into the sentences manually. So we require an approach which can identify thepresence of these Empty categories and insert them into appropriate positions in the sentence.

Figure 4.1 shows an Example of a Hindi sentence annotated with a NULL category. The Englishtranslation for this sentence is, “Its not fixed what his big bank will do”.

4.2 why is an Empty Category present in a treebank?

The use of empty categories to represent the syntactic structure of a sentence is the feature of thegenerative linguistics and they represent an important source of information in treebanks annotated inthis linguistic fashion. Penn treebank [31], Chinese treebank [49] and the Arabic treebank have alsoused Empty categories in their respective annotation schemes to preserve the syntactic structure andinformation provided by these elements.

24

Figure 4.1 An Example of a Hindi sentence annotated with a NULL category.

Empty category is also useful to indicate and mark the location of a dislocated phrase, so that samestructure can be created as when a phrase is not dislocated, and it also allows easy extraction of itspredicate-argument structure.

In Natural Language applications like Machine Translation, Empty category identification may benecessary. Unless these are identified and marked accordingly the sentence in target language cannotbe constructed properly or possible ambiguities in target language may increase. For Example, In lan-guages like Chinese, subject pronouns are routinely dropped. The meaning of sentence may not changemuch even with absence of these pronouns. But when this sentence is translated into another language,these dropped pronouns may have to be explicitly marked and replaced with possible pronouns or nounphrases if the target language does not allow dropped pronouns and to represent syntactic structure.

4.3 Empty Categories in other languages

Many treebanks include empty nodes in parse trees to represent dislocated phrases and droppedelements. For example, relative clause markers in the Penn treebank marked by traces are examplesof dislocated phrases. Dropped pronouns in the Korean treebank [22] and the Chinese treebank [50]are also marked by empty nodes. In languages such as Chinese, Japanese, and Korean, pronouns arefrequently dropped when their presence is easily understandable. These languages are called pro-droplanguages. Dropped pronouns are quite a common phenomenon in these languages. In the Chinesetreebank, they occur once in every four sentences on average. In Korean the treebank, they are evenmore frequent, occurring in almost every sentence on average. Translating these pro-drop languagesinto languages such as English where pronouns are regularly retained could be problematic becauseEnglish pronouns have to be generated from nothing.

25

4.3.1 Empty categories in Penn treebank

Empty categories are annotated in Penn treebank in various types, such as PROs, Wh-movement,Topicalization,Ellipsed Predicates etc. we will be discussing about some of these types with an example for each ofthese categories [19].

• PROs:

The most frequent Empty category element in the English treebank (NP *). Example: Themissing subject of imperative sentences “(NP*) Go Away”.

The second use is to mark passivization. Example: “(NP-1 Dante) was led (NP *-1) byVirgil.”

The third use of PRO is in what linguists call control and raising constructions.

(S (NP-SBJ-3 Everyone)(VP seems(S (NP-SBJ *-3)(VP to(VP dislike(NP Drew Barrymore))))))

An example of (NP *) in a raising construction. Here the (NP *) marks that the proposition whichseems to be the case is “everyone dislikes Drew Barrymore.”

• Wh-movement: Traces of wh-movement ((NP *T*) with antecedents of category WHNP, WHADVP,WHADJP amd WHPP) are used in the closely related instances of questions and relative clausesto indicate in which argument or adjunct position the wh-word should be interpreted.

• Topicalization: This occurs when an element is displaced from its usual position and put at thefront of sentence. A *T* with other sorts of antecedents (e.g. NP, ADVP, VP, etc.) is used toindicate topicalization.

• Ellipsed Predicates: In some sentences a predicate like VP, PP-PRD etc, is missing or it is movedto another location. Such categories are called as Ellipsed Predicates. *?* is used to indicate themissing predicate. For example in the below sentence:

“Acting would help him better than talking (VP *?*),” which is to say “Acting would help himbetter than talking would help him.” In this sentence VP predicate is missing and hence indicatedby *?*.

26

4.4 Related Work

There has been a considerable amount of work on the topic of null element restoration done inlanguages like English etc. There are different approaches researchers have taken about this problem.

Johnson’s approach [26] uses a pattern matching algorithm for recovering empty nodes and marktheir respective dislocated constituents in phrase structure trees. The patterns are minimal connectedtreelets containing an empty node and all other nodes co-indexed with it. In the training phase, thesystem goes through each tree in the corpus and, for every null element, extracts the minimal connectedtree which contains null element and every node co-indexed with it (a pattern). The previous step resultsin a list of various patterns and count of occurrences of each pattern. Next, the system counts how manytimes each pattern matches in the treebank, called the match value. Note that since matching ignoresempty categories in the pattern, a pattern may match places in treebank which are identical to it exceptfor null elements. In the next step the patterns are pruned based on counts and match values. In theapplication phase, To restore empty categories to a tree, the system does a pre-order traversal. At eachnode, it checks which patterns, if any, match and applies the highest ranked one. To apply a pattern,it replaces the matching subtree with the contents of the pattern, renumbering null element indices ifnecessary to prevent accidental collision with co-indexation already in the tree.

Campbell [17] used rule based approach for this task. Campbells system is straightforward. Thesystem walks through a tree in pre-order traversal, and at each node it attempts to apply the rules. Eachof these rules makes a decision based on a logical combination of linguistic predicates; he mentionspassivization, finiteness, headedness, function words, and syntactic function as particularly importantpieces of information. This approach uses sentences represented in form of phrase structure trees.

There are many approaches for the recovery of empty categories in the treebanks. In Penn treebank,both ML based [18, 41] and rule based approaches [17] have been explored. Some approaches such as[51] recovers empty categories as a post-processing step, after parsing the text.

4.5 Empty category annotation and detection in Hindi treebank

[16] have earlier discussed about the use of empty categories in treebank design of Hindi/Urdu tree-bank. They explain and justify the types of ECs used in Hindi/Urdu treebank. They also explain theprocess of creation of treebank in steps. The Hindi treebank is available in three layers.

As per annotation of Hindi treebank, a high-level differentiation can be made between various kindsof empty categories on the basis of whether they signify displacement or not. Empty categories thatindicate displacement are called Traces. Traces are created as a result of movement of constituentsin the sentence. A trace is always co-indexed with another unit in the sentence, it signifies that theco-indexed constituent was in the location of the trace previously and got displaced at a later stage ofsentence derivation.

27

All other empty categories are grouped under the label Silent. These empty categories do not markdisplacement, rather they represent missing syntactic elements. The label silent refers to any lexicalelement i.e., head of a word group or dependent in a word group which is missing. The missing lexicalitem can be implicit in the context or may be refering to any other element in the discourse.Example: Empty subject pronouns. Consider the below sentences:A: Are you hungry?B: Don’t know.In the above example, “A” indicates a question asked to another person, where as “B” is the reply to “A”.In the reply “B” the pronoun “I” is omitted.

4.5.1 Multi-Layered Treebank

The First layer of representation is Dependency structure. In this stage, each sentence is annotatedusing Paninian grammar model. We have already discussed about this annotation scheme in chapter 2.In this stage, empty categories are inserted in Dependency structure if the empty category element hasany dependents.

The second layer of representation is PropBanking. In this stage, each verb will have a corre-sponding frame file. This frame file contains information about the arguments of verb. This layer addsadditional semantic information on top of the Dependency structure layer. Karaka relations are alsoincluded in these frame files. Empty categories that are added in this stage or core arguments of the verbsuch as subject and object. As the frame file provides information about the arguments of verb, addingempty categories for missing core arguments of verbs would be a simpler task. The empty categoriesinserted in this step also are of Silent type.

The third layer of representation is Phrase structure. This layer is obtained by an automatic De-pendency structure to Phrase structure conversion process that takes Dependency structure layer andPropBank layer as input and generates Phrase structure as output [48]. Empty categories in this layercan be of type Trace or Silent.

In our experiments we use Hindi dependency treebank. This treebank is the first layer of repre-sentation out of the three layers described above. In this step, Empty categories that are required forcompleting a tree are inserted. These empty categories are manually annotated. An empty categoryinserted in this step always has at least one child, empty categories in layer of treebank does not occuras leaf nodes. The cases where Empty categories are inserted in this step are:

• Empty Head of NP. This is indicated by label *Head-NP* and is inserted when head of a NP ismissing. As shown in Figure 4.2 Empty category is inserted to complete the tree structure as ithas dependent piilii.

• Empty Head of VP. This is indicated by label *Head-VP* and is inserted when head of a VP ismissing. The dependents of missing verb are attached to this Empty category.

28

• Empty Subject with predicative adjective and ‘ki’ complement clause. This is indicated by label*pro* and an Empty subject is inserted so that head of the ‘ki’ complement clause gets attachedto it.

• Empty Conjunction head. The missing conjunction head is indicated by label *CONJ* and isinserted when conjunction is missing and its conjuncts will be attached to the inserted emptycategory node. As shown in Figure 4.5 conjunction is missing and so empty category is insertedand its conjuncts are attached to this empty node.

4.5.2 Types of Empty Categories

Previous work related to Empty categories detection on Hindi data is done by [20] which is a rulebased approach for detection of Empty categories and also presented detailed analysis of different typesof Empty categories present in the Hindi treebank. They used hand-crafted rules in order to identifyeach type of Empty category. As this is a rule based approach it becomes language specific.

[20] have discussed about different types of Empty categories in Hindi treebank in detailed manner.The main types of Empty categories are:

• Empty Subject where a clause is dependent on missing subject (NP) of the verb, denoted asNULL NP or NULL PRP. 4.2 depicts an example of a sentence with Empty Subject.

raam ne niilii shirt khariid-ii aur mohan ne piilii NULL khariidii.Ram erg blue shirt buy and Mohan erg yellow NULL buy.’Ram bought the blue shirt and Mohan bought the yellow (one).’

Figure 4.2 An Example of a Hindi sentence with a NULL NP category.

• Backward Gapping where the verb (VM) is absent in the clause that occurs before a co-ordinatingconjunct, denoted as NULL VM. 4.3 depicts an example of a sentence with backward gapping.

doosare nambara para misa roosa natasha NULL aur tiisare nambara para misa lebanan sendrarahiim.second position on miss Russia Natasha NULL and third position on miss Lebanan Sandra were.’Miss Russia Natasha stood second and Miss Lebanan Sandra was third.’

29

Figure 4.3 An Example of a Hindi sentence with a NULL VP category with backward gapping.

• Forward Gapping where the verb (VM) is absent in the clause that occurs after a co-ordinatingconjunct, denoted as NULL VM. 4.4 depicts an example of a sentence with forward gapping.

divaalii ke dina jua Kele magara NULL gar me yaa hotala me.Diwali GEN day gamble play but NULL home in or hotel in.’Played gamble on Diwali day but was it at home or hotel.’

Figure 4.4 An Example of a Hindi sentence with a NULL VP category with forward gapping.

• Conjunction Ellipses where the Conjunction (CC) is absent in the sentence, denoted as NULL CC.4.5 depicts an example of a sentence with Conjunction Ellipses, in this scenario conjunction “and”is missing in the sentence.

bacce bare ho-ga-ye-hai NULL kisii ki baat nahiin maante.children big become NULL anyone GEN advice not accept.’The children have grown big (and) do not listen to anyone.’

4.6 Empty category Motivation

The primary reason why linguistic theories postulate Empty categories is that it allows for simplerdescriptions. Often we find that sentences with certain empty elements have essentially the same prop-erties like interpretation, case-marking, agreement as the corresponding sentences where the element is

30

Figure 4.5 An Example of a Hindi sentence with a NULL CC category.

present in the sentence. In such cases assuming that the element, which is absent in the sentence, is infact realized by an empty category allows for simpler analyses.

Another important reason for postulating Empty categories comes from the demands of Naturallanguage processing applications like Information extraction, question-answering, Machine translationand related semantic tasks. The more precise and detailed our predicate argument structures are, themore complete our event descriptions will be, and therefore the more effective our semantic processingtechniques will be. As discussed in earlier section, Empty categories are manually annotated into thedependency treebank. In real word scenarios, we cannot expect empty category node to be explicitlymarked in the given sentence. In such cases, the dependency tree structures for such sentences may notbe same as compared to those in which the corresponding lexical item is present. Hence, dependencytree structure may not be semantically correct or instead the parser output may not give complete treestructures.

The aim is to investigate the problem of automatically detecting the Empty categories in the sen-tences using the statistical dependency parsing technique and to shed some light on the challenges ofthis problem. As the data-driven parsing on Hindi language has achieved good results, we try to use thisapproach to predict Empty categories in the sentence. In this approach the information about Empty cat-egories is encoded into the label set of the structure. In these experiments we have used only Projectivesentences from the treebank. Non-projectivity makes it difficult to identify the exact position of Emptycategories during introduction of these Empty categories in the sentence.

4.7 Approach

The presence of Empty categories in treebanks and their importance is already discussed in earliersection 2. We have also discussed about the work that has done in identification of these empty cate-gories in Hindi treebank and other languages in section 2. So we want to built a system that could usestatistical dependency parsing techniques to address the problem of Empty category detection. Thereare two reasons to choose this approach:

31

• There are rule based approaches which have been tried out earlier on Hindi treebank, but there areno statistical approaches used for Hindi.

• We have achieved significant improvement in statistical dependency parsing on Hindi language,so we have a parser with good accuracy on hand a corpus of decent size.

There are 3 main steps involved in this process.

4.7.1 Pre-Processing

In the first step, we encode information about presence of Empty categories in a sentence into thedependency relation label set of the sentence. If NULLs are present in a sentence, we remove the NULLsfrom the respective sentence in the treebank. In a sentence the dependents or children of a NULLcategory are attached to the parent of the NULL category and their respective labels are combined withdependency label of NULL category which indicates the presence of NULL and also says that suchwords or tokens are children of NULL category. Instead of just combining the labels we also add asense of direction to the complex label which indicates whether the position of NULL is to the rightor left of this token in the sentence and subsequently NULLs are also detached from its parent node.Therefore a complex label in a sentence indicates the presence of a NULL category in the sentence.

Figure 4.6 Pre Processing

Example: Null-label r dep-label is a generic type of a complex label. In this format ’r’ indicatesthat a NULL instance is to the right of this token. Null-label is the dependency relation label joining theNull instance and its parent and dep-label is the dependency relation label joining the current token orword to its parent which is a NULL instance. Figure 4.7 illustrates this step.

32

Figure 4.7 Process

4.7.2 Data-driven parsing

In the second step a Data-driven parser is trained using the training data (with complex dependencyrelation labels) and when this parser model is used on the test data it predicts the complex labels in theoutput. In this approach we have tried out different data-driven parsers such as Malt [37], Turbo [33]and MST [34] for this experiment which were shown earlier to be performing better for Hindi Parsing[29] and found that Malt parser performs better than the rest on this data with complex labels.

4.7.3 Post-processing

In the final step, Post-processing is applied on the output predicted by the parser in the above step. Inthis step presence of NULLs are identified using the complex labels and their position in the sentence isidentified using sense of direction in these labels (i.e., whether NULL instance is to the left ’l’ or right’r’ of this token). During the insertion of NULLs into the sentence projectivity of the sentence mustbe preserved. Keeping this constraint intact and using the direction information from the dependencyrelation labels, NULLs are introduced into the sentence. Figure 4.7 illustrates this step.

The exact position where the NULL token is to be inserted is determined as follows:

• Tokens with complex labels are the children of a NULL token - Children of NULL.

• Parent of current token with a complex label will be the parent of NULL category which is intro-duced - Parent of NULL.

• Using the direction info in label (right or left) the NULL is positioned after span of current tokenis completed and before the span of its parent token begins, this gives the Position of NULL token.

This approach handles the cases where a sentence can have more than one NULL category. In caseswhere there are more than one NULL category and one NULL word is occurring as a child to anotherNULL category are also covered, such cases are seen when the count of words in the sentence is high.

The advantage in using statistical approach rather than a rule based approach to predict NULLs is,it can be easily used to predict NULLs in other MoR-FWO languages. The problem with this approachis, it can’t handle Empty categories occurring as Leaf nodes (or Terminal nodes in the dependency tree)

33

and as Root nodes. As we have mentioned earlier, the dependency annotation scheme of Hindi languagedoes not allow for Empty categories to occur as Leaf nodes (or Terminal nodes). But if these Emptycategories occur as Root nodes in the dependency tree then such cases are not disturbed in our approach.

4.7.4 Rules

we have formulated some additional rules based on the behavior of the system to boost the accuraciesof some categories and also have identified some lexical cues which aid for this task.

• For NULL CC ”,” is a lexical cue. In the case of CC with vibhakti ki or subordinating conjunct.If there are multiple clauses separated with ”,” after the ”ki” clause then insert a NULL CC in thesentence before the beginning of last clause.

• In sentences where there is no VG at all in the sentence, we insert a NULL VG before the End ofSentence.

4.8 Experiments and Results

4.8.1 Parser settings

As mentioned earlier we had used Malt parser for our experiments. We have also explored Turboparser and MST parser with this dataset but Malt performed better than others.

Malt parser provides two learning algorithms LIBSVM and LIBLINEAR. We experimented withboth algorithms and LIBSVM gave better results for our data. It also provides various options forparsing algorithms and we have experimented on nivre-eager, nivre-standard and stack-proj parsingalgorithms. Nivre-eager has shown good results in our experiments.

4.8.2 Features and Template

Feature model is the template, which governs the learning from the given training data. We observedfeature model used by [28] performs best.

In order to get best results in the second step (Data-driven parsing) we have experimented with vari-ous features provided in the data. [28, 25] showed the best features that can be used in FEATS column inCoNLL-X format. These features are vibhakti (post positional marker), TAM (tense, aspect and modal-ity), chunk features like chunk head, chunk distance and chunk boundary information have proved to beeffective in parsing of Hindi language and our results on overall accuracy of data is consistent with theirresults.

34

Type of NULL Category Recall by Rule Based Approach Recall by Statistical ApproachNULL VM 60.2 50NULL CC 46.8 69.45NULL NN 89.8 88.89

Total 69.8 69.23

Table 4.1 Recall comparison of Rule based and statistical approaches.

4.8.3 Data

We have used Hindi dependency treebank which was released as part of COLING-MTPIL sharedtask 2012. It is manually annotated dataset with POS tags, Chunks and other information such asgender, number, person etc. Training set contains 12,041 sentences (2,68,093 words) and developmentdataset contains 1233 sentences (26,416 words).

The test dataset contains 1828 sentences (39,775 words) and for the final system, we combined thetraining and development datasets into one dataset and used this set for the training. We tried differenttypes of features as mentioned above and selected the best feature set by tuning it on the developmentdataset.

4.8.4 Results and Discussion

The Results obtained on the test dataset are shown below and accuracy on each Type of Emptycategory are given in Table 4.1. The table 4.1 also shows a comparison of Recall by a Rule basedapproach by [20] and our statistical approach on different types of empty categories, The Recall reportedby the Rule based approach was on a different Hindi datset which is smaller in size compared to theHindi dataset which we have used in our experiments.

The Results obtained by using this approach on the test set including all the Empty category types isas follows:

Precision = 84.9 Recall = 69.23 F-measure = 76.26In computation of the above results the exact position of NULLs in the sentence are not considered.

These values indicate the efficiency of the system in identifying the presence of the Empty categoriesin the system. However, this approach inserted the NULLs in exact positions with a Precision of morethan 85%, i.e., of all the NULL instances it has inserted correctly, it has inserted 85% of them in exactpositions in the sentences.

The approach was able to insert NULL NP tokens with good accuracy but it had a tough time pre-dicting NULL VM tokens. This was also consistent with [20] conclusions about Empty categories inHindi treebank.

In case of NULL VM categories we have observed some inconsistency in the annotation of thesesentences. In these sentences which have multiple clauses with main verb (VM) token missing, certain

35

sentences are annotated with NULL VM for each clause where main verb (VM) token is missing andcertain sentences are annotated with one NULL VM for all the clauses with main verb (VM) missing.This may be a reason for accuracy drop in predicting NULL VM tokens. The main reason for lowaccuracy as we have observed is that the output predicted by the parser is low for these complex labels.The test data consists of 202 complex labels whereas the parser has been able to predict only 102 ofthem, which is a huge drop in accuracy for complex labels. The overall accuracy of parser on the testdata (only projective sentences) has been high 91.11%(LAS), 95.86%(UAS) and 92.65%(LS). The lowaccuracy of the parser on complex labels may be due to less number of these instances compared to sizeof the corpus. Another reason may be due to the introduction of complex labels the size of label set hasincreased significantly and it may be difficult for the parser to learn the rare labels.

36

Chapter 5

Conclusions

The whole work can be divided into two major sections

• Parsing of Hindi Dependency Treebank

• Empty category detection in Treebank

In the first section, We talk briefly about exploring Malt parser on Indian Languages: Hindi, Teluguand Bangla. This section furtherdiscusses about experiments on Hindi language using different data-driven parsers: Malt, MST and Turbo. We explored these three parsers on a large feature pool andwith various parser settings. It also talks about exploring different approaches using these three parsers:voting and blending to get the best of these parsers. We are able to produce best accuracy on two datasetsof Hindi language that were provided as part of COLING-2012 MTPIL shared task on parsing. Turboparser for Hindi language was not explored previously and this parser gave high accuracy on one of thedatasets provided. We have also talked about drawbacks of voting approach and why blending approachis preferred to voting though voting gives better accuracy than blending approach. We observed that thevoting accuracy improves when the accuracies of the individual parsers are comparable to each other.Our system achieved a best result of 96.50% UAS, 92.90% LA, 91.49% LAS using voting method ondata with gold POS tags. In case of data with automatic POS tags, we achieved a best result of 93.99%UAS, 90.04% LA and 87.84%.

In the second section, we presented a statistical approach to Empty category prediction using Data-driven parsing. This section also talks about Empty categories in general and about Empty categoriesin other treebanks. It discusses in detail about the types of Empty categories in Hindi treebank andtheir importance in dependency parsing. The motivation to go for a statistical approach came from theexperiments which we have performed for Hindi Language dependency parsing, experiments which wediscussed earlier in chapter 3. As now we have a data-driven parser for Hindi which has an accuracyabove 90%, we decided to map the task of Empty category detection in Hindi treebank to task of de-pendency parsing. We presented an approach where presence of an empty category is encorporated intodependency relation label. We have also talked about challenges involved in such a task. We have usedstate-of-the-art parser for Hindi language with an accuracy above 90% and have achieved a decent F-

37

score of 76.26 in predicting Empty categories, We have explored all the parsers which we talked aboutin chapter 3 to identify the suitable parser for this approach. A statistical approach is preferable to a rulebased approach because statistical approaches can be applied to other MoR-FWO languages providedif the data-driven parser for respective language has a decent accuracy. Also, such a statistical approachhas not been tried out for Hindi previously. We can also identify rules based on lexical clues and othermorphlogical and chunk information in treebank to further improve the accuracy of this task.

5.1 Future Work

In approach of Empty category detection, The main reason for low accuracy which we have observedis that the output predicted by the parser is low for these complex labels. We look to improve thisaccuracy and there by also improve the accuracy of detecting Empty categories in sentences. Thisapproach can also be further improved by identifying more lexical cues to insert these empty categoriesin post-processing step, thus by improving precision of the system. We also should try this approachon a big data set with a significant number of instances of NULLs. This approach of Empty categorydetection should be applied for other MoR-FWO languages and compare the performances on differentlanguages. We are also interested in seeing the effect of our system in parsing.

38

Appendix A

Appendix

39

No. Tag Name Description1.1 k1 karta(doer/agent/subject)1.2 pk1 prayojaka karta (Causer)1.3 jk1 prayojya karta (causee)1.4 mk1 madhyastha karta (mediator-causer)1.5 k1s vidheya karta (karta samanadhikarana)2.1 k2 karma (object/patient)2.2 k2p karma (Goal, Destination)2.3 k2g gauna karma (secondary karma)2.4 k2s karma samanadhikarana (object complement)3 k3 karana (instrument)

4.1 k4 sampradaana (recipient)4.2 k4a anubhava karta (Experiencer)4.2 k4a anubhava karta (Experiencer)5.1 k5 apaadaana (source)5.2 k5prk prakruti apadana (source material in verbs denoting change of state)6.1 k7t kaalaadhikarana (location in time)6.2 k7p deshadhikarana (location in space)6.3 k7 vishayaadhikarana (location elsewhere)7 k∗u saadrishya (similarity)

8.1 r6 shashthi (possessive)8.2 r6-k1 karta or karma of a conjunct8.3 r6v (kA relation between a noun and a verb)9 adv kriyaavisheshana (manner adverbs only)

10 sent-adv Sentential Adverbs11 rd prati (direction)12 rh hetu (cause-effect)13 rt taadarthya (purpose)

14.1 ras-k∗ upapada sahakaarakatwa (associative)14.2 ras-neg Negation in Associatives15 rs relation samanadhikaran (noun elaboration)16 rsp relation for duratives17 rad Address words18 nmod relc, jjmod relc, rbmod relc Relative clauses, jo-vo constructions19 nmod Noun modifier (including participles)20 vmod Verb modifier21 jjmod Modifiers of the adjectives22 pof Part of Relation23 ccof Conjunct of Relation24 fragof Fragment of25 enm Enumerator

Table A.1 Important karaka relations prescribed in Panini’s framework.

40

Related Publications

• Kukkadapu, Puneeth, Deepak Kumar Malladi, and Aswarth Dara. “Ensembling various depen-dency parsers: Adopting turbo parser for indian languages.” 24th International Conference onComputational Linguistics. 2012.

• Kukkadapu, Puneeth, and Prashanth Mannem. “A statistical approach to prediction of empty cate-gories in hindi dependency treebank.” Fourth Workshop on Statistical Parsing of MorphologicallyRich Languages. 2013.

• Kosaraju, Prudhvi, Sruthilaya Reddy Kesidi, Vinay Bhargav Reddy Ainavolu, and Puneeth Kukkadapu.“Experiments on Indian language dependency parsing.” Proceedings of the ICON10 NLP ToolsContest: Indian Language Dependency Parsing. 2010.

41

Bibliography

[1] Empty category. In https://en.wikipedia.org/wiki/Empty category.

[2] A. Abhilash and P. Mannem. Bidirectional dependency parser for indian languages. ICON10 NLP TOOLS

CONTEST: INDIAN LANGUAGE DEPENDENCY PARSING, page 9.

[3] B. Ambati, P. Gadde, and K. Jindal. Experiments in indian language dependency parsing. Proceedings of

the ICON09 NLP Tools Contest: Indian Language Dependency Parsing, pages 32–37, 2009.

[4] B. R. Ambati, S. Husain, S. Jain, D. M. Sharma, and R. Sangal. Two methods to incorporate local mor-

phosyntactic features in hindi dependency parsing. In Proceedings of the NAACL HLT 2010 First Workshop

on Statistical Parsing of Morphologically-Rich Languages, pages 22–30. Association for Computational

Linguistics, 2010.

[5] B. R. Ambati, S. Husain, J. Nivre, and R. Sangal. On the role of morphosyntactic features in hindi de-

pendency parsing. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of

Morphologically-Rich Languages, pages 94–102. Association for Computational Linguistics, 2010.

[6] R. Begum, S. Husain, A. Dhwaj, D. M. Sharma, L. Bai, and R. Sangal. Dependency annotation scheme for

indian languages. In Proceedings of IJCNLP, 2008.

[7] A. Bharati, V. Chaitanya, R. Sangal, and K. Ramakrishnamacharyulu. Natural language processing: A

Paninian perspective. Prentice-Hall of India, 1995.

[8] A. Bharati, S. Husain, B. Ambati, S. Jain, D. Sharma, and R. Sangal. Two semantic features make all

the difference in parsing accuracy. Proceedings of the 6th International Conference on Natural Language

Processing, pages 134–138, 2008.

[9] A. Bharati, S. Husain, D. Misra, and R. Sangal. Two stage constraint based hybrid approach to free word

order language dependency parsing. In Proceedings of the 11th International Conference on Parsing Tech-

nologies, pages 77–80. Association for Computational Linguistics, 2009.

[10] A. Bharati, S. Husain, D. M. Sharma, and R. Sangal. A two-stage constraint based dependency parser for

free word order languages. In Proceedings of the COLIPS International Conference on Asian Language

Processing 2008 (IALP), 2008.

[11] A. Bharati and R. Sangal. Parsing free word order languages in the paninian framework. In Proceedings

of the 31st annual meeting on Association for Computational Linguistics, pages 105–111. Association for

Computational Linguistics, 1993.

42

[12] A. Bharati, R. Sangal, D. Misra, and L. Bai. Anncorra: Annotating corpora guidelines for pos and chunk

annotation for indian languages. In Technical Report (TR-LTRC-31), LTRC, IIIT-Hyderabad, 2006.

[13] A. Bharati, R. Sangal, and D. M. Sharma. Ssf: Shakti standard format guide. Language Technologies

Research Centre, International Institute of Information Technology, Hyderabad, India, pages 1–25, 2007.

[14] A. Bharati, D. M. Sharma, S. Husain, L. Bai, R. Begam, and R. Sangal. Anncorra: Treebanks for indian

languages, guidelines for annotating hindi treebank, 2009.

[15] R. A. Bhat and D. M. Sharma. Non-projective structures in indian language treebanks. In Proceedings of

the 11th Workshop on Treebanks and Linguistic Theories (TLT11), pages 25–30, 2012.

[16] A. Bhatia, R. Bhatt, B. Narasimhan, M. Palmer, O. Rambow, D. M. Sharma, M. Tepper, A. Vaidya, and

F. Xia. Empty categories in a hindi treebank. In LREC, 2010.

[17] R. Campbell. Using linguistic principles to recover empty categories. In Proceedings of the 42nd annual

meeting on association for computational linguistics, page 645. Association for Computational Linguistics,

2004.

[18] M. Collins. Three generative, lexicalised models for statistical parsing. In Proceedings of the eighth confer-

ence on European chapter of the Association for Computational Linguistics, pages 16–23. Association for


[19] R. Gabbard. Null Element Restoration. PhD thesis, University of Pennsylvania, 2010.

[20] C. Gsk, S. Husain, and P. Mannem. Empty categories in hindi dependency treebank: Analysis and recovery.

In Proceedings of the 5th Linguistic Annotation Workshop, pages 134–142. Association for Computational

Linguistics, 2011.

[21] J. Hall, J. Nilsson, J. Nivre, G. Eryigit, B. Megyesi, M. Nilsson, and M. Saers. Single malt or blended? a

study in multilingual parser optimization. In Proceedings of the CoNLL Shared Task Session of EMNLP-

CoNLL 2007, pages 933–939. Association for Computational Linguistics, 2007.

[22] N.-R. Han and S. Ryu. Guidelines for penn korean treebank version 2.0. 2005.

[23] R. Hudson. Word grammar. Blackwell Oxford, 1984.

[24] S. Husain. Dependency parsers for indian languages. Proceedings of ICON09 NLP Tools Contest: Indian

Language Dependency Parsing, 2009.

[25] S. Husain, P. Mannem, B. R. Ambati, and P. Gadde. The icon-2010 tools contest on indian language

dependency parsing. Proceedings of ICON-2010 Tools Contest on Indian Language Dependency Parsing,

ICON, 10:1–8, 2010.

[26] M. Johnson. A simple pattern-matching algorithm for recovering empty nodes and their antecedents. In

Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 136–143.

Association for Computational Linguistics, 2002.

[27] S. Kolachina, P. Kolachina, M. Agarwal, and S. Husain. Experiments with maltparser for parsing indian

languages. Proc of ICON-2010 tools contest on Indian language dependency parsing. Kharagpur, India,

2010.

43

[28] P. Kosaraju, S. Kesidi, V. Ainavolu, and P. Kukkadapu. Experiments on indian language dependency parsing.

Proceedings of the ICON10 NLP Tools Contest: Indian Language Dependency Parsing, 2010.

[29] P. Kukkadapu, D. K. Malladi, and A. Dara. Ensembling various dependency parsers: Adopting turbo parser

for indian languages. In 24th International Conference on Computational Linguistics, page 179, 2012.

[30] P. Mannem and A. Dara. Partial parsing from bitext projections. In Proceedings of the 49th Annual Meeting

of the Association for Computational Linguistics: Human Language Technologies, pages 1597–1606, 2011.

[31] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The

penn treebank. Computational linguistics, 19(2):313–330, 1993.

[32] A. Martins, N. Smith, and E. Xing. Concise integer linear programming formulations for dependency

parsing. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th In-

ternational Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pages

342–350, 2009.

[33] A. Martins, N. Smith, E. Xing, P. Aguiar, and M. Figueiredo. Turbo parsers: Dependency parsing by

approximate variational inference. In Proceedings of the 2010 Conference on Empirical Methods in Natural

Language Processing, pages 34–44, 2010.

[34] R. McDonald, F. Pereira, K. Ribarov, and J. Hajic. Non-projective dependency parsing using spanning tree

algorithms. In Proceedings of the conference on Human Language Technology and Empirical Methods in

Natural Language Processing (EMNLP), pages 523–530, 2005.

[35] R. T. McDonald and J. Nivre. Characterizing the errors of data-driven dependency parsing models. In

EMNLP-CoNLL, pages 122–131, 2007.

[36] J. Nilsson, S. Riedel, and D. Yuret. The conll 2007 shared task on dependency parsing. In Proceedings of

the CoNLL shared task session of EMNLP-CoNLL, pages 915–932. sn, 2007.

[37] J. Nivre, J. Hall, and J. Nilsson. Maltparser: A data-driven parser-generator for dependency parsing. In

Proceedings of LREC, volume 6, pages 2216–2219, 2006.

[38] J. Nivre, J. Hall, J. Nilsson, A. Chanev, G. Eryigit, S. Kubler, S. Marinov, and E. Marsi. Maltparser:

A language-independent system for data-driven dependency parsing. Natural Language Engineering,

13(02):95–135, 2007.

[39] S. Oepen and J. Carroll. Parser engineering and performance profiling. Natural Language Engineering,

6(01):81–97, 2000.

[40] K. Sagae and A. Lavie. Parser combination by reparsing. In Proceedings of the Human Language Technol-

ogy Conference of the NAACL, Companion Volume: Short Papers, pages 129–132. Association for Compu-

tational Linguistics, 2006.

[41] W. Seeker, R. Farkas, B. Bohnet, H. Schmid, and J. Kuhn. Data-driven dependency parsing with empty

heads. In Proceedings of COLING 2012: Posters, pages 1081–1090, Mumbai, India, December 2012. The

COLING 2012 Organizing Committee.

44

[42] L. Shen and A. Joshi. Ltag dependency parsing with bidirectional incremental construction. In Proceedings

of the Conference on Empirical Methods in Natural Language Processing, pages 495–504. Association for


[43] S. M. Shieber. Evidence against the context-freeness of natural language. Springer, 1987.

[44] L. Tesniere. Elements de syntaxe structurale. Librairie C. Klincksieck, 1959.

[45] R. Tsarfaty, D. Seddah, S. Kubler, and J. Nivre. Parsing morphologically rich languages: Introduction to

the special issue. Computational Linguistics, 39(1):15–22, 2013.

[46] A. Vaidya, S. Husain, P. Mannem, and D. M. Sharma. A karaka based annotation scheme for english. In

Computational Linguistics and Intelligent Text Processing, pages 41–52. Springer, 2009.

[47] A. Vakil. Transition-based dependency parsing. 2012.

[48] F. Xia, O. Rambow, R. Bhatt, M. Palmer, and D. M. Sharma. Towards a multi-representational treebank. In

The 7th International Workshop on Treebanks and Linguistic Theories. Groningen, Netherlands, 2009.

[49] N. Xue, F. Xia, F.-D. Chiou, and M. Palmer. The penn chinese treebank: Phrase structure annotation of a

large corpus. Natural language engineering, 11(02):207–238, 2005.

[50] N. Xue, F. Xia, S. Huang, and A. Kroch. The bracketing guidelines for the penn chinese treebank (3.0).

2000.

[51] Y. Yang and N. Xue. Chasing the ghost: recovering empty categories in the chinese treebank. In Proceedings

of the 23rd International Conference on Computational Linguistics: Posters, pages 1382–1390. Association

for Computational Linguistics, 2010.

[52] D. Zeman. Maximum spanning malt: Hiring worlds leading dependency parsers to plant indian trees.

ICON09 NLP TOOLS CONTEST: INDIAN LANGUAGE DEPENDENCY PARSING, page 19, 2009.

45

dependency parsing and empty category detection in hindi...

Documents