automatic hebrew vocalization

31
דדדד ד ד ד ד ד ד ד ד דד ד דדAutomatic Hebrew Vocalization By: Eran Tomer Advisor: Prof. Michael Elhadad

Upload: temima

Post on 05-Jan-2016

97 views

Category:

Documents


1 download

DESCRIPTION

Automatic Hebrew Vocalization. By: Eran Tomer Advisor: Prof. Michael Elhadad. Natural Language Processing. The computational linguistics field attempts to model and study languages using computational techniques. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Automatic Hebrew Vocalization

By: Eran TomerAdvisor: Prof. Michael Elhadad

Page 2: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

The computational linguistics field attempts to model and study languages using computational techniques.

The diverse challenges confronted by computational linguistics researchers include: Machine translation Automatic text-summarization Speech-to-text Text-to-speech Etc.

Natural Language Processing

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

2

Page 3: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Accomplishing NLP tasks for Hebrew is made difficult by 2 factors: Lack of large-scale, annotated resources

Supervised learning is generally hard to apply High ambiguity rate

A given Hebrew word may have an astonishing number of different meanings and

pronunciations.e.g. ספר, שלט, שערה, משנה

Hebrew Natural Language Processing

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

3

Page 4: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

The Hebrew TreeBank 5,000 segmented and morphologically tagged sentences

Mila Various corpora, lexicons and some NLP tools

Word Segmentation

Morphological tagging

Related Work

פוח� ת�•Noun•Singular•Masculine

ה�•Determiner

ת א�•preposition

כול א•Verb•Singular•Masculine•2nd person•Imperative

התה שכאן מאתמול [אתמול][מ] [כאן][ש] [תה][ה]

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

4

Page 5: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Development of a Hebrew Text-To-Speech system A vocalized and syllabified word may be used as a normalized-

form for a Hebrew TTS systemGeneration of vocalized text for teaching

Vocalized inflected words are difficult to obtain (do not exist in dictionaries), and are widely used for teaching

Improving automatic translation systems

Motivation

אתה תמונה לתפקיד You picture the job

אני עובד משני עד רביעי I work two to four

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

5

Page 6: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Generation Automatically producing fully vocalized verb inflections with the

corresponding morphological attributes.

Syllable segmentation Automatically segmenting vocalized words into syllables

Unknown verb classification Classifying verbs to their corresponding

patterns Automatically selecting an inflection

schema for an un-known verb

Objectives

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

6

Page 7: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

The Hebrew verb How complex must be the computational model for verb full

morphological and vocalization generation? How much lexical knowledge and exceptions are required to

cover the Hebrew verbs lexicon?

Syllable segmentation How complex is syllable segmentation? What level of knowledge is required for

successful segmentation?

Research Questions

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

7

Page 8: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Vocalization

Syllable segmentation

Generation

Previous Work

Free systems

Snopi Automatic Nikud

Nikuda

Academic systems

Kontorovich (2001)

Gal (2002)

Spiegel & Volk (2003)

Commercial systems

Nakdan Text (Melingo)

Auto Nikud

Nakdanit

Academic systems

Finkel & Stump (2002)

Commercial systems

Kolan (Melingo)

Free systems

Hspall (2002)

Academic systems

Finkel and Stump (2002)

Dannélls and Camilleri (2010)

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

8

Page 9: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Vowels vs. Consonants Consonant letters are either vocalized by Shva ( ), or non-vocalized

at the end of a word. There exist two types of Shva, Na and Nach

A letter that functions as a vowel will be vocalized by the following vowel and semi-vowel signs

Background – Hebrew Vocalization

A �

Kamats � Patah � Hataf

Patah E

� Segol � Tsere � Hataf

Segol

U �

Kubuts ו

Shuruk

O � Holam ו Holam

Male � Kamats

Katan � Hataf

Kamats I

� Hirik

ל Yב ל] ב\ נ]•1st person•Plural

בל ל] ב\ י]• 3rd person•Singular ב\ ל] בY לנ]

NaNach

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

9

Page 10: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Diacritic signs may change pronunciation of letters Dagesh

Dagesh () emphasizes letters, yet in modern Hebrew it affects כ/כ,ב/ב and פ/פ only

Mapik Mapik () denotes a constant (emphasized) Hey at the end of the word

Shin dots Shin dots distinguish the pronunciation of ש as SH (ש) or S ( (ש#

Background – Hebrew Vocalization

Dagesh Kal vocalizes , , , , ת, פ כ ד ג ב At the beginning of a word After a Shva Nach

Dagesh Hazak vocalizes any letter other than , , , ר, ע ח ה א Following certain linguistic phenomena, In some noun/verb patterns

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

10

Page 11: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Syllables Hebrew words are composed of syllables, a syllable is a

phonological entity that is pronounced in one effort

Stress Hebrew words are stressed by two stress schemes

Milel (מלעיל) denotes the syllable prior to the last is stressed Milra (מלרע) denotes the last syllable is stressed

Deficient spelling vs. Plene spelling In many cases there exist more than one valid

ways to spell a given Hebrew word

Background – Hebrew Vocalization

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

11

Page 12: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

The syllables and vowels rule (כלל ההברות והתנועות) Require: A stressed/non-stressed syllable (s)

if s is a non-stressed syllable then if s is an open syllable vocalize s with a long vowel else vocalize s with a short vowel

else In most cases s should be vocalized with a long vowel, yet the number of

exceptions is considerable

Background – Hebrew Vocalization

Sound group

Long vowel Short vowel

A � �

E , י� �

I י�� ��

U � ו �

O ו� �

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

12

Page 13: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Stressed syllable

Examples

Background – Hebrew Vocalization

Syllable segmentation

עכבר עכ-בר בר ר ב_ כ] ע\

נהר נ-הר הר ר ה_ נ_

לילה לי-לה לי ה ל_ י] exceptionל\

דלת ד-לת ד exceptionדלת

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

13

Stressed syllable?

Yes

Usuallylong

No

Open syllable?

Yes

Long

No

Short

Page 14: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Verbs Morphological attributes

Patterns

Background – Hebrew Vocalization

Tense Past Beinoni (Participle) Present Future Imperative

Person First Second Third

Number Singular Plural

Gender Masculine Feminine Both

Verbפועל

Light patternsהבניינים הקלים

Paal Nifal Hifil Hufal

Heavy patternsהבניינים הכבדים

Piel Pual Hitpael

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

14

Page 15: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

The Hebrew paradigms Hebrew verbs are clustered into several paradigms that are

characterized by the manner they inflect verbs Complete paradigms (גזרות השלמים) Crippled paradigms (גזרות נחות) Defective paradigm (גזרות חסרות) Etc.

Inflection tables Paradigms are further partitioned into about 300 specific inflection

tables which describe inflections of specific verb families

Background – Hebrew Vocalization

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

15

Page 16: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Inflection tables - example

Background – Hebrew Vocalization

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

16

Page 17: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Verbs list Over 4k manually gathered verbs Morphology - deficient, past, masculine, singular, 3rd person Shin dots are indicated The corresponding inflection table is indicated for each verb

Morphologically analyzed corpora About 50 million fully morphologically disambiguated words Material from “Haaretz” newspaper, “Tapuz” website, the

“Knesset” discussions and other resources

Datasets

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

17

Page 18: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Method We implemented 264 inflection tables which:

Take: A verb (v) from our verb list dataset A corresponding inflection table

Return: Vocalized inflections of v with appropriate morphological tags

Results A list with over than 240,000 vocalized verbs with appropriate

morphological attributesEvaluation

A sample of over 15,000 inflected verbs were manually validated with 99.4% accuracy

Generation

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

18

Page 19: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

C-20, פצפץ: י gת צ] פ\ צ] gפ,PAST+FIRST+MF+SINGULAR+COMPLETE ת_ צ] פ\ צ] gפ,PAST+SECOND+M+SINGULAR+COMPLETE ת] צ] פ\ צ] gפ,PAST+SECOND+F+SINGULAR+COMPLETE ץ Yפ צ] gפ,PAST+THIRD+M+SINGULAR+COMPLETE ה צ_ פ] צ] gפ,PAST+THIRD+F+SINGULAR+COMPLETE נו צ] פ\ צ] gפ,PAST+FIRST+MF+PLURAL+COMPLETE תם צ] פ\ צ] gפ,PAST+SECOND+M+PLURAL+COMPLETE תן צ] פ\ צ] gפ,PAST+SECOND+F+PLURAL+COMPLETE צו פ] צ] gפ,PAST+THIRD+M+PLURAL+COMPLETE צו פ] צ] gפ,PAST+THIRD+F+PLURAL+COMPLETE …

Generation – results sample

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

19

Page 20: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Method Syllable segmentation requires Shva classification

Shva Na marks syllable start*

Shva Nach denotes syllable end*

Each syllable includes exactly one vowel*

* According to Even-Shoshan dictionary

We implemented two Shva classification schemes Heuristic approach - Rabbi-Eliyahu-Behor Shva classification according to the base tense form

Syllable segmentation

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

20

Page 21: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Heuristic approach By Behor - a Shva is a Shva Na if:

It vocalizes the first letter of the word It follows another Shva and it is not at the word end It follows a long, stressed vowel (stress is needed) It vocalizes a letter with Dagesh Hazak (Dagesh type is needed) It vocalizes the first among two identical letters (many exceptions)

By our (adapted) Heuristic: A Shva is a Shva Na if:

It vocalizes the first letter of the word It follows another Shva and it is not at the word end It follows a long vowel

A Shva is a Shva Nach if: It is followed by another Shva

In any other case, we use Shva Nach as default

Syllable segmentation

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

21

Page 22: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Shva classification according to the base tense form Through our generation mechanism, we can correlate verb

inflections to their corresponding base-tense form A Shva present in the base-tense form is a Shva Nach Otherwise the Shva is a Shva Na

Matching inflection to base-tense forms We use a dynamic programming string matching algorithm Operations costs were customized to be character dependent,

respecting the Hebrew inflectional model

Syllable segmentation

I I C R C C C C C C C C I R

י ק � �ק � ד � �ז ת � �ק - ק � ד � �ז - - י �

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

22

Page 23: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Results Thanks to our generation model, we obtain 240k of highly

accurate vocalized verbs We applied our two approaches to receive two lists of verbs

segmented into syllables: By our heuristic approach (based on Behor’s heuristic) By our customized string matching algorithm

Evaluation A sample of 300 segmented verbs were validated for:

81% word accuracy and 85.92% syllable accuracy by the heuristic

99.33% word accuracy and 99.5% syllable accuracy by the string matching approach

Syllable segmentation

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

23

Page 24: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

-גו-למ-תי-ג-למ-תי-גו-למ-ת-ג-למ-ת-גו-למת-ג-למת

Syllable segmentation – results sample

-גו-למ-תם-ג-למ-תם-גו-למ-תן-ג-למ-תן-גו-למו-ג-למו

-גו-לם-ג-לם-גו-למה-ג-למה-גו-למ-נו-ג-למ-נו

גלם

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

24

Page 25: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Method We implemented a classifier (SVM) which:

Take: A non-vocalized verb (v)

Return: The pattern corresponding to v

The SVM uses: Dataset:

Over 2,700 verbs from our verb list 70% are used for training and 30% for testing

Features: Word length letters positions Guttural letters positions

Evaluation 90.25% of the verbs were classified correctly to

their corresponding Hebrew pattern

Verb classification to patterns

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

25

Page 26: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Method We implemented a classifier (SVM) which:

Take: A non-vocalized verb (v)

Return: The inflection table corresponding to v

The SVM uses: Dataset:

Over 2,700 verbs from our verb list 70% are used for training and 30% for testing

Features: Word length letters positions Guttural letters positions Corpus level features (50M morphologically disambiguated corpus)

Evaluation Without corpus level features - 68.63% accuracy With corpus level features - 70.08% accuracy

Unknown verb classification to inflection tables

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

26

Page 27: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

The Hebrew verb inflectional model Q: How complex must be the computational model for verb full

morphological and vocalization generation? A: By implementing 264 inflection tables we achieve 99.4% accuracy

Q: How much lexical knowledge and exceptions are required to cover the Hebrew verbs lexicon?

A: The 260 implemented inflection tables include many exception tables which describe the inflectional model for only several verbs

Our more general, unknowns classification, model, yields 70% accuracy (selecting 1 inflection table out of the total 264 tables)

For comparison the baseline for the most frequent inflection table yields only 34% accuracy

A rough estimation shows over 93% of the verbs in a large corpora exist in our dataset, moreover most unknown verbs are either miss-spelled or falsely tagged as verbs

Discussion

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

27

Page 28: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Syllable segmentation Q: How complex is syllable segmentation?

In contradiction to traditional grammars, few simple rules do not provide highly accurate segmentation

We achieved 99.3% word accuracy and 99.5% syllable accuracy through Shva classification

Q: What level of knowledge is required for successful syllable segmentation? A: By using the vocalized word only we achieve correct word segmentation with

81% accuracy Using the base tense form as well, improves word accuracy to 99.3%

This improvement suggests: Hebrew phonology uses a constructive process, which

derives inflections from base tense forms Inflections are not generated in a pipeline process, in which

morphology would first generate inflections that are later segmented into phonological units

Discussion

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

28

Page 29: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Generation Implementing rare inflection tables Implementing inflection tables for nouns

Syllable segmentation Searching for optimal Hebrew string matching weights Machine learning of syllable segmentation

Future work

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

29

Page 30: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

Unknown verbs classification Using vocalized corpora to extract corpus level features Performing feature selection Classification of vocalized verbs into inflection tables Classification of inflections into inflection tables Exploring the SVM parameters

Automatic vocalization We hope to obtain a substantial vocalized corpora (the Aviv

encyclopedia), which will enable: Setting a base line for automatic vocalization using

a modern vocalized corpora Improving the baseline through supervised learning

Future work

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

30

Page 31: Automatic Hebrew Vocalization

�ד �יק�ו נ

או�טו�מט�י

The End