named entity recognition

60
Named Entity Recognition Sobha Lalitha Devi AU-KBC Research Centre Chennai

Upload: gurit

Post on 17-Jan-2016

193 views

Category:

Documents


4 download

DESCRIPTION

Named Entity Recognition. Sobha Lalitha Devi AU-KBC Research Centre Chennai. Named Entity(NE) Recognition. What is NE and What is not an NE How to identify NE Tagset and Annotation Guidelines Methods Used in developing NER. Why do NER?. Key part of Information Extraction system - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Named Entity Recognition

Named Entity Recognition

Sobha Lalitha DeviAU-KBC Research Centre

Chennai

Page 2: Named Entity Recognition

Named Entity(NE) Recognition

• What is NE and What is not an NE• How to identify NE• Tagset and Annotation Guidelines • Methods Used in developing NER

04/21/23 2IIIT Summer School

Page 3: Named Entity Recognition

Why do NER?

• Key part of Information Extraction system• Robust handling of proper names essential for

many applications such as Summarization, IR, Anaphora,.........

• Pre-processing for different classification levels

• Information filtering • Information linking

04/21/23 3IIIT Summer School

Page 4: Named Entity Recognition

What is NER ?• NER involves identification of proper names in

texts, and classification into a set of predefined categories of interest.

• Three universally accepted categories: • Person, location and organisation

• Other common tasks: recognition of date/time expressions, measures (percent, money, weight etc), email addresses etc.

• Other domain-specific entities: names of Drugs, Genes, medical conditions, names of ships, bibliographic references etc.

04/21/23 4IIIT Summer School

Page 5: Named Entity Recognition

04/21/23 IIIT Summer School 5

NER Definition

• Named entity recognition (NER) (also known as entity identification (EI) and entity extraction) is the task that locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

John sold 5 companies in 2002.

<ENAMEX TYPE="PERSON">John</ENAMEX> sold <NUMEX TYPE="QUANTITY">5</NUMEX> companies in <TIMEX TYPE="DATE">2002</TIMEX>.

Page 6: Named Entity Recognition

What is not NER?• NER is not event recognition.• NER does not create templates, • NER does not perform co-reference or entity linking,

– though these processes are often implemented alongside NER as part of a larger IE system.

• NER is not just matching text strings with pre-defined lists of names.

It recognises entities which are being used as entities in a given context.

• NER is not an easy task!

04/21/23 6IIIT Summer School

Page 7: Named Entity Recognition

Named Entity and Philosophy of Language

• Proper Names are defined by

– Descriptivist's theory of Names• Frege, Russell, Ludwig , Wittgenstein and John Searle

– Causal theory of Reference• Saul Kripke

04/21/23 7IIIT Summer School

Page 8: Named Entity Recognition

Descriptivist's theory of Names

Proper names either are synonymous with descriptions, or have their reference determined by virtue of the name's being associated with a description or cluster of descriptions that an object uniquely satisfies.

Causal theory of ReferenceProper names refer to an object by virtue of a causal connectionwith the object as mediated through communities of speakers. That is , proper names, in contrast to descriptions, are rigid designators.

Rigid designators :A proper name refers to the named object in every possible world in which the object exists.

Descriptions designate : a proper name as different objects in different possible worlds.

04/21/23 8IIIT Summer School

Page 9: Named Entity Recognition

Proper Names and Definite Descriptions

• A meaning of a Sentences involving Proper names could be substituted by a contextually appropriate description for a name.

eg: Otto von Bismarck can be known or described as the first Chancellor of the German Empire

Kripke argues that definite descriptions cannot be rigid designators . Because definite descriptions cannot be same/similar in all possible worlds

More on Kripke’s Proper name in Naming and Necessity 1980

04/21/23 9IIIT Summer School

Page 10: Named Entity Recognition

04/21/23 IIIT Summer School 10

What is Named Entity

• Named Entities are – A Noun Phrase – Rigid Designators : It designates/denotes the same

thing in all possible worlds in which the same thing exists and does not designate anything else in those possible worlds in which that same thing does not exist

Page 11: Named Entity Recognition

04/21/23 IIIT Summer School 11

EXAMPLES for Named Entity and not a Named entity

• Hotel & Taj Hotel

• Flower & Rose Flower

• Beach & Kovalam Beach

• Airport & Indira Gandhi International airport

• The School & Good Shepherd School

• Prime Minister & Mr. Manmohan Singh

Page 12: Named Entity Recognition

04/21/23 IIIT Summer School 12

Some problems in indentifying NE

• Variation of NEs. – Manmohan Singh, Manmohan, Dr. Manmohan

Singh

• Ambiguity of NE types: – 1945 (date vs. time)– Washington (location vs. person)– May (person vs. month)– Tata (person vs. organization)

Page 13: Named Entity Recognition

04/21/23 IIIT Summer School 13

Ambiguity Examples

• Person vs Location– Sir C. P Ramaswamy was the Divan of Travancore

(Per)– Sir C.P Ramaswamy Road is in Chennai (Loc)

• Person vs Organization– Anil Ambani opened Reliance Fresh (Per)– Reliance Fresh is under Anil Amabani Group Ltd

(Org)

Page 14: Named Entity Recognition

More complex problems in NER

Issues of style, structure, domain, genre etc.– Punctuation, spelling, spacing, formatting, ….all have an

impact

Dept. of Computing and Information ScienceManchester Metropolitan UniversityManchesterUnited Kingdom

> Tell me more about Leonardo> Da Vinci

04/21/23 14IIIT Summer School

Page 15: Named Entity Recognition

Problems in NE Task Definition

• Category definitions are intuitively quite clear, but there are many grey areas.

• Many of these grey area are caused by metonymy.

Person vs. ArtefactOrganisation vs. LocationCompany vs. ArtefactLocation vs. Organisation

04/21/23 15IIIT Summer School

Page 16: Named Entity Recognition

04/21/23 IIIT Summer School 16

Tagset for Named Entity

• ACE tagset is Hierarchical– ACE-Automatic Content Extraction

• The tagset – CLIA-is Hierarchical -Similar to ACE– Developed for two domains

• Tourism and Health

Page 17: Named Entity Recognition

04/21/23 IIIT Summer School 17

TAGSET• ENAMEX

– Person• Individual

– Family name– Title

• Group– Organization

• Government• Public/private company• Religious• Non-government

– Political Party– Para military– Charitable– Association

• GPE (Geo-political Social Entity)• Media

– Location• Place

– District– City– State– Nation– Continent

• Address• Water-bodies• Landscapes• Celestial Bodies

– Manmade» Religious Places» Roads/Highways» Museum» Theme parks/Parks/Gardens» Monuments

• Facilities– Hospitals

• Institutes• Library

– Hotel/Restaurants/Lodges– Plant/Factories– Police Station/Fire Services– Public Comfort Stations– Airports– Ports– Bus-Stations

• Locomotives• Artifacts

– Implements– Ammunition– Paintings– Sculptures– Cloths– Gems & Stones

• Entertainment– Dance– Music– Drama/Cinema– Sports– Events/Exhibitions/Conferences

• Cuisine’s• Animals• Plants

Page 18: Named Entity Recognition

04/21/23 IIIT Summer School 18

Tagset Continued

• NUMEX– Distance

• Money– Quantity– Count

• TIMEX– Time– Date– Day– Period

Tagset Counts

First Level Tags -3

Second Level -43

Third Level – 40

Total - 86

Page 19: Named Entity Recognition

04/21/23 IIIT Summer School 19

How to Annotate• 1.ENAMEX

– 1.1 Person• 1.1.1 Individual

• These refer to names of each individual person, also includes names of fictional characters found in stories/novels etc.

Tag Structure: <ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL”> abc </ENAMEX>

Examples:

English:<ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL”>Abdul Kalam</ENAMEX>

Page 20: Named Entity Recognition

04/21/23 IIIT Summer School 20

Annotation continued1.1.1.1 Family name

In general we find that a person name consists of a family name. Whenever an instance of individual name occurs with family name, then that part of the name, which refers to family name, must be tagged specifically with subtag “FAMILYNAME” as shown below.

Tag Structure: <ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL” SUBTYPE_2= “FAMILYNAME”> abc </ENAMEX>

Examples: English:<ENAMEX TYPE=”PERSON” SUBTYPE_1=”INDIVIDUAL”> Lalu

Prasad<ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL” SUBTYPE_2= “FAMILYNAME”>Yadav</ENAMEX></ENAMEX>

Page 21: Named Entity Recognition

NE Types

NE TYPES

ENAMEX

NUMEX

TIMEX

The Named entity hierarchy is divided into three major classes Entity

Name, Time and Numerical expressions.

04/21/23 21IIIT Summer School

Page 22: Named Entity Recognition

Entity Types

04/21/23 22IIIT Summer School

Page 23: Named Entity Recognition

Persons are entities limited to humans. A person may be a single

individual or a group. Individual refer to names of each individual person.

Group refers to set of individual

Location entities are limited to geographical entities such as geographical

areas like names of countries, cities, continents and landmasses, bodies of

water, and geological formations.

Organization entities are limited to corporations, agencies, and other

groups of people defined by an established organizational structure

Entity Name Types

04/21/23 23IIIT Summer School

Page 24: Named Entity Recognition

En: [Sita]PERSON is working at [HCL]ORGANIZATION , which is in [Chennai] LOCATION

Ta: [Seetha] PERSON [chennaiyilrukkira] LOCATION [HCLlil] ORGANIZATION

En: Sita Chennai HCL

velaiseikirAl.

Working Ml: [Seetha] PERSON [chennaiyillula] LOCATION [HCLlil] ORGANIZATION

En: Sita Chennai HCL

jolicheyyunnu.

Working Hi: [Seetha] PERSON [HCL] ORGANIZATION main kaam kar raha hai, jo

En: Sita HCL work is which

[chennai] LOCATION main hain.

Chennai in

Examples for Entity Name Types

04/21/23 24IIIT Summer School

Page 25: Named Entity Recognition

Facilities are limited to buildings and other permanent man-made structures

and real estate improvements like hospitals, airport, colleges, libraries etc.

En: [Appolo Hospital] FACILITY is in Chennai LOCATION

Ta: [Appallo maruthuvamanAi]FACILITY [Chennaiyil]LOCATION

irukkirathu

Ml: [Appolo Asupathri]FACILITY [chennaiyil]LOCATION aaN

Hi: [Appolo aspathaal]FACILITY [chennai]LOCATION mein haim.

Entity Name Types

04/21/23 25IIIT Summer School

Page 26: Named Entity Recognition

A locomotive entity is a physical device primarily designed to move an object from one location to another, by carrying, pulling, or pushing the transported object.

En: [Ananthapuri Express]LOCOMOTIVE departs from [Chennai] LOCATION at

[7.30pm] Time.

Hi: [Ananthapuri express] LOCOMOTIVE [Chennai] LOCATION se [rAth 7.30] TIME ko

ravana hoga

Ml: [Ananthapuri eksprass] LOCOMOTIVE [chennaiyilninn] LOCATION [raathri 7.30

maNikk] TIME puRappetum.

Ta: [Ananthapuri viraivu rayil] LOCOMOTIVE [chennaiyilirunthu] LOCATION [iRavu

7.30 maNikku] TIME puRappatukirathu

Entity Name Types

04/21/23 26IIIT Summer School

Page 27: Named Entity Recognition

Artifact entities are objects or things, produced or shaped by human craft,

such

as tools, weapons/ammunition, art paintings, clothes, ornaments, medicines

En: [Vinayaga Statue] ARTIFACT is looking beautiful

Ta: [Vinayakarin Silai] ARTIFACT pArpatharkku alakAkAkairukkirathu

Ml: [ganapathi vigraham]ARTIFACT baMgiyaayi irikkunnu.

Hi: [Vinayaka moorthi] ARTIFACT achi lagh rahi haim.

Entity Name Types

04/21/23 27IIIT Summer School

Page 28: Named Entity Recognition

Entertainment entities denote activities, which are diverting and hold human

attention or interest, giving pleasure, happiness, amusement especially

performance of some kind such as dance, music, sports, events.

En: [Flower Exhibition] ENTERTAINMENT is held at [Hyderabad]LOCATION

Ta: [Malar kankAtchi] ENTERTAINMENT [hyderabaadil]LOCATION Nadaiperukirathu

Ml: [pushpa pradarshanam] ENTERTAINMENT [hyderabaadil] LOCATION natakkunnu

Hi: [phool pradarshnii] ENTERTAINMENT [hyderabad] LOCATION meN Ayojith kiyaa

jAthA hai

Entity Name Types

04/21/23 28IIIT Summer School

Page 29: Named Entity Recognition

Materials refer to the names of food items, cuisines, chemicals and

cosmetics

En: [Honey]MATERIALS is good for face

Ta: [ThEn]MATERIALS mukaththiRku nallathu

Ml: [Madhu] MATERIALS mukaththinu nallathAN

Hi: [Shahad] MATERIALS chehare ke liye achcha hai.

Entity Name Types

04/21/23 29IIIT Summer School

Page 30: Named Entity Recognition

ORGANISMS: These are the names of different animal species including

birds, reptiles, viruses, bacteria and names of herbs, medicinal plants, shrubs,

trees, fruits, flowers etc.

En: [Peacock] ORGANISM is the national bird of [India] LOCATION

Ta: [Mayil] ORGANISM [InthiyAvin] LOCATION thEciyappaRavai Akum.

Ml: [Mayil] ORGANISM [indyayute] LOCATION raashtrapakshi AN.

Hi: [Mor] ORGANISM [bhaarath] LOCATION kaa raashtrIya pakshi hai.

Entity Name Types

04/21/23 30IIIT Summer School

Page 31: Named Entity Recognition

Disease: Names of disease, symptoms, diagonisis and treatment are comes

under this type.

En: Smoking Causes [Cancer] DISEASE

Ta: PukaippithithalAl [puRRuNoi] DISEASE varukiRathu

Ml : pukavali [aRbhudham] DISEASE uNtAkkunnu

Hi: dhumrapan [kaansar] DISEASE ka kaaraN banaatha hai.

Entity Name Types

04/21/23 31IIIT Summer School

Page 32: Named Entity Recognition

Numerical Expressions

NUMEX

DISTANCE

QUANTITY

COUNT

MONEY

04/21/23 32IIIT Summer School

Page 33: Named Entity Recognition

Distance refers to the distance measures such as kilometers, Centimeters, meters, acres, feet etc.

Example: 10 cm., twenty feet, 15 hectares Money specifies the different currency value such as rupee, euro, Dinar,

dollar etc.

Example: Rs. 1000, 250 Euro, $160 Count denotes the number (or counts) of Items/ articles/things etc.

Example: 5 subjects, 12 students, 20 books Quantity measurements like liters, tons, grams, volts etc. are comes under

this category.

Example: 20 litres, 22 kg, 50g, 100 volts

Numerical Expressions

04/21/23 33IIIT Summer School

Page 34: Named Entity Recognition

Time Expressions

TIMEX

MONTH DATETIME YEAR

PERIODDAY SPECIAL DAY

04/21/23 34IIIT Summer School

Page 35: Named Entity Recognition

Temporal expressions are the entities refers to time, date, year, month and day Time: These refer to expressions of time, includes different forms of expressing time. This also includes Hours, minutes and seconds. Example

5’o clock in the morning 9.30 a.m.

Evening 6.30 p.m. Date: This refers to expressions of Date such as 13/12/2001 etc in different forms. This also includes month, date and year Example

August 15 1947 1956 September 11

Temporal Expressions

04/21/23 35IIIT Summer School

Page 36: Named Entity Recognition

Day: These are expressions, which convey days in a year. Also it can include

days occurring weekly /fortnightly/ monthly /quarterly/ biennial etc.

ExampleSundayTomorrowTodayYesterday

Special Day: refers to special days in a year

ExampleGandhi JayanthiRama Navami

Temporal Expressions

04/21/23 36IIIT Summer School

Page 37: Named Entity Recognition

Period: refers to expressions, which express duration of time or

time periods or time intervals.

Example 17 th century 10 minutes 10 a.m. to 12 p.m. One year

Temporal Expressions

04/21/23 37IIIT Summer School

Page 38: Named Entity Recognition

Methodologies

Methods:

1)Rule Based

2)Machine Learning

Hidden Markov Model (HMM)

Naïve Bayes Classifier

Maximum Entropy Markov Model (MEMM)

Conditional random Fields (CRF)

4) Hybrid Approach

04/21/23 38IIIT Summer School

Page 39: Named Entity Recognition

Following are the major challenges encountering in Indian Languages.AgglutinationAmbiguity

Between Proper and common nounsBetween named entities

Lack of Capitalization

Challenges of NER in Indian Languages

04/21/23 39IIIT Summer School

Page 40: Named Entity Recognition

Agglutination

In Dravidian languages, words consist of a lexical root to which one or more

affixes are attached.

Example in Tamil:

1) Ta: Ramanaiththavira

(otherthan Raman)

2) Ta: Cevvaiyandru

(On Tuesday)

3) Ta: Inthiyavilllula

(In India)

4) Ta: KannanaippaRRikkondu

(hold onto Kannan)

Challenges of NER in Indian Languages

04/21/23 40IIIT Summer School

Page 41: Named Entity Recognition

Example in Malayalam:

1) Ml: hemayiluNtaayirunna

(that which Hema have)

2) Ml: Chennaiyilethunna

(reach in Chennai)

3) Ml: arabikatalinaBimukhamaayi

(towards the arabian sea)

4) Ml: kaaSiyilekkozhukunna

( flowing towards kaaSi)

Challenges of NER in Indian Languages

04/21/23 41IIIT Summer School

Page 42: Named Entity Recognition

Ambiguity Comparatively Indian languages suffer more due to the ambiguity that

exists between common & proper nouns and between named entities itself. In some cases same word can refer to different named entity types. Those instances can recognized by contextual information.

Examples:Hi: Akash - Person name and SkyHi: Sooraj - Person name and SunHi: Chaanth – Moon and SilverHi: Aam – Mango and CommonMl: Roopa – Person name and RupeeMl: Madhu – Person name and HoneyMl: Mala – Person name and Garland

Challenges of NER in Indian Languages

04/21/23 42IIIT Summer School

Page 43: Named Entity Recognition

Ta: Thinkal - Day and Month

Ta: Malar - Person name and Flower

Ta: Chevvai - Day and planet

Ta: Shakthi – Person name and Power

Ta: MAlai – Evening and Garland

Ta & Ml: Velli – Silver, Planet, Day

Challenges of NER in Indian Languages

04/21/23 43IIIT Summer School

Page 44: Named Entity Recognition

Spell Variation: Due to the different writing styles same entity is

represented in various word forms. In Tamil, sanskirit letters

such as “ja”, “sha”, “sri” “Ha” are replaced by “sa”,“ciri”, “ka”

Example:

Roja can be written as Rosa

Srimathi - cirimathi

Raja - rasa

ShajahAn - sajakAn

Challenges of NER in Indian Languages

04/21/23 44IIIT Summer School

Page 45: Named Entity Recognition

Lack of Capitalization In English and some other European languages capitalization is considered

as the important feature to identify proper noun. It plays a major role in NE identification. Unlike English capitalization concept is not found in Indian languages.

Challenges of NER in Indian Languages

04/21/23 45IIIT Summer School

Page 46: Named Entity Recognition

Nested Entities: Refers to the named entities which occurs within another

named entities. Also called as embedded entities.

Ta: [[Mathurai] LOCATION [MeenAtchi Amman]PERSON Koyil]RELPLACE

En: Mathurai Meenatchi Amman Temple

Ml: [[Nittoor] PERSON Srinivasa rao] PERSON

En : Nitoor Srinivasa rao

Hi: [[Rajeev] PERSON MArg] ROAD

En : Rajeev Road

Nested Entities

04/21/23 46IIIT Summer School

Page 47: Named Entity Recognition

04/21/23 IIIT Summer School 47

Approaches in Named Entity Resolution

• Dictionary Look-up

• Rule based ( Using lexical, contextual and morphological information)

• Maximum entropy theory based

• Hidden Markov Model

• Conditional Random Fields

• Hybrid methods (Statistical+ Linguistics)

Page 48: Named Entity Recognition

04/21/23 IIIT Summer School 48

Dictionary (Gazetteers) Look-up Approach

• Uses Dictionaries for identifying NERs ( Gazetteers)

• Gazetteer contains NEs from all domains• Advantage

– Very simple approach – Gives very high precision

Page 49: Named Entity Recognition

04/21/23 IIIT Summer School 49

Disadvantages of Dictionary Approach

• Preparation of exhaustive dictionary is a tedious and expensive process.

• The dictionary should cover the different spellings of the same place.

Page 50: Named Entity Recognition

04/21/23 IIIT Summer School 50

Rule Based Approach• Rule Based System

– Needs more rules to tag all kinds of NE

• Advantages:– Rich and expressive rules– Good results

• Disadvantages:– Requires huge experience and grammatical knowledge– Experts to craft rules are expensive – Highly domain specific ( not portable to a new domain)

Page 51: Named Entity Recognition

General difficulties“ Italy's business world was rocked by the

announcement last Thursday that Mr. Verdi would leave his job as vice-president of Music Masters of Milan, Inc. to become operations director of Arthur Andersen".

• Capitalization useless for first word• S not part of name "Italy"• Date is "last Thursday" not "Thursday"• Milan is location, not organization• Arthur Andersen is organization, not person

04/21/23 51IIIT Summer School

Page 52: Named Entity Recognition

Rules success and failureTitle Capitalized_Word Title Person_Name

Correct: Mr. JonesIncorrect: Mrs. Field's Cookies (corporation)

Month_name number_less_than_32 DateCorrect: February 28 Incorrect: Long March 3 (a Chinese Rocket)

From Date to Date DateCorrect: from August 3 to August 9Incorrect: I moved my trip from April to June (twoseparate dates)

04/21/23 52IIIT Summer School

Page 53: Named Entity Recognition

Statistical based approach

• Need to identify features• Feature selection has to be correct for all

types of NE• Development of Tagged Corpus• The Corpus should contain all types of tags in

appropriate number• Domain based corpus has to be generated.

04/21/23 53IIIT Summer School

Page 54: Named Entity Recognition

Automated approaches

Address drawbacks of hand-coded systemAutomated training• Human-annotated (with desired outputstandards) training data• Annotation requires less effort and expertisethan hand-coding rules• Annotation accuracy• Two annotators for checking, third annotator toresolve disputes

04/21/23 54IIIT Summer School

Page 55: Named Entity Recognition

Literature Survey

1) Named Entity Recognition was one of the tasks defined in Message Understanding Conference(MUC) 6.

2) A survey on Named Entity Recognition was done by David Nadeau (2007).3) Techniques used include:

- rule based technique by Krupka (1998)- using maximum entropy by Borthwick (1998)- using Hidden Markov Model by Bikel (1997)- bootstrapping approach using concept based seeds (Niu et al., 2003)- hybrid approaches such as rule based tagging for certain entities such as date,

time, percentage and maximum entropy based approach for entities like location and organization (Rohini et al.,2000)

4) The Stanford NER software (Finkel et al., 2005), uses linear chain CRFs in their NER engine. Here they identify three classes of NERs viz., Person, Organization and Location.

04/21/23 55IIIT Summer School

Page 56: Named Entity Recognition

Arulmozhi, P. and Sobha, L. (2006). HMM-based Part of Speech Tagger for Relatively Free Word Order Language. Advances in Natural Language Processing, Research in Computing

Science Journal, Mexico Volume18, pp. 37-48. Bikel, D. M. Miller, S. Schwartz, R. Weischedel, R. (1997). Nymble: A high-performance

learning name-finder. In Fifth Conference on Applied Natural Language Processing. pp. 194201.

Borthwick, A. Sterling, J. Agichtein, E. and Grishman, R. (1998). Description of the MENE named Entity System. In Seventh Machine Understanding Conference (MUC-7).

Chen, W. Zhang, Y. and Isahara, H. (2006). Chinese Named Entity Recognition with Conditional Random Fields. In Fifth SIGHAN Workshop on Chinese Language Processing, Sydney. pp.118-121.

Ekbal, A. Bandyopadhyay, S. (2009). A Conditional Random Field Approach for Named Entity Recognition in Bengali and Hindi. Linguistic Issues in Language Technology, 2(1). pp.1-44.

References

04/21/23 56IIIT Summer School

Page 57: Named Entity Recognition

Finkel, J. N. Grenager, T. and Manning, C. (2005). Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005). pp. 363-370.

Finkel, J. Dingare, S. Nguyen, H. Nissim, M. Sinclair, G. and Manning, C. (2004). Exploiting Context for Biomedical Entity Recognition: from Syntax to the Web. In Joint Workshop on Natural Language Processing in Biomedicine and its Applications, (NLPBA), Geneva, Switzerland.

Gali, K. Surana, H. Vaidya, A. Shishtla, P. Sharma, D. M. (2008). Aggregating Machine Learning and Rule Based Heuristics for Named Entity Recognition. In Workshop on NER for South and South East Asian Languages, IJCNLP-08, Hyderabad, India.

Kumar, K. N. Santosh, G. S. K. Varma, V. (2011). A Language-Independent Approach to Identify the Named Entities in under-resourced languages and Clustering Multilingual Documents. In International Conference on Multilingual and Multimodal Information Access Evaluation, University of Amsterdam, Netherlands.

Lafferty, J. McCallum, A. Pereira, F. (2001). Conditional Random Fields for segmenting and labeling sequence data. In ICML-01, pp. 282-289.

Loinaz, I.A. Uriarte, O. A. Ramos, N. E. Castro, M. I. F. D (2006). Lessons from the Development of Named Entity Recognizer for Basque. Natural Language Processing, 36. pp. 25 – 37.

McCallum, A. and Li, W. (2003). Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons. In Seventh Conference on Natural Language Learning (CoNLL).

References

04/21/23 57IIIT Summer School

Page 58: Named Entity Recognition

Nadeau, David and Sekine, S. (2007) A survey of named entity recognition and classification. Linguisticae Investigationes 30(1). pp.3–26.

Niu, C. Li, W. Ding, J. Srihari, R. K. (2003). Bootstrapping for Named Entity Tagging using Concept-based Seeds. In HLT-NAACL’03, Companion Volume, Edmonton, AT. pp.73-75.

Pandian, S. Lakshmana, Geetha, T. V. and Krishna. (2007). Named Entity Recognition in Tamil using Context-cues and the E-M algorithm. In the Proceedings of the 3rd Indian International Conference on Artificial Intelligence, Pune, India. pp. 1951 -1958.

Sasidhar, B., Yohan, P.M., Babu, V.A., Govarhan, A.(2011). A Survey on Named Entity Recognition in Indian Languages with particular reference to Telugu. J. International Journal of Computer Science Issues, Volume. 8, pp. 1694-0814 .

Sobha, L., Vijay Sundar Ram. R. (2006). "Noun Phrase Chunker for Tamil", In Proceedings of Symposium on Modeling and Shallow Parsing of Indian Languages, Indian Institute of Technology, Mumbai, pp 194-198.

Srihari, R.K. Niu, C. Yu, L. (2000). A Hybrid Approach for Named Entity Recognition in Indian Languages. In 6th Applied Natural Language Conference, pp. 247-254

Gupta, S. and Bhattacharyya, P. (2010). Think globally, apply locally: using distributional characteristics for Hindi named entity identification. In 2010 Named Entities Workshop, Association for Computational Linguistics Stroudsburg, PA, USA

Vijayakrishna, R. and Sobha, L. (2008). Domain focused Named Entity for Tamil using Conditional Random Fields. In IJNLP-08 workshop on NER for South and South East Asian Languages, Hyderabad, India. pp. 59-66

References

04/21/23 58IIIT Summer School

Page 59: Named Entity Recognition

Literature Survey

Indian Languages:5) Named Entity recognition for Hindi, Bengali, Oriya, Telugu and Urdu (some of the major Indian languages) were addressed as a shared task in the NERSSEAL workshop of IJCNLP. The tagset used here consisted of 12 tags.

6) Vijayakrishna & Sobha (2008) worked on Domain focused Tamil Named Entity Recognizer for Tourism domain using CRF. It handles nested tagging of named entities with a hierarchical tag set containing 106 tags. They considered root of words, POS, combined word and POS, Dictionary of named entities as features to build the system.

7) Pandian et al (2007) have built a Tamil NER system using contextual cues and E-M algorithm.

8) The NER system (Gali et al., 2008) build for NERSSEAL-2008 shared task which combines the machine learning techniques with language specific heuristics. The system has been tested on five languages such as Telugu, Hindi, Bengali, Urdu and Oriya using CRF followed by post processing which involves some heuristics.

04/21/23 59IIIT Summer School

Page 60: Named Entity Recognition

Thank you

04/21/23 60IIIT Summer School