rosette for arabic search-based · pdf filerosette for arabic search-based applications ......

2
Rosette for Arabic Search-Based Applications SOLUTIONS Accurate Text Analysis for the Complexity of the Arabic Language The rapid growth of Arabic electronic content has increased the need for Arabic-savvy search. The latest generation of Arabic search techniques draws on advances in natural language processing, taking search beyond simple string comparisons to a more intelligent search that can understand that kitaab (“book”) is similar to kutub (“books”) by analyzing the lemma of each word. THE ROSETTE SOLUTION Rosette is designed to use a variety of different algorithms so that the best approach can be applied for each language’s specific requirements. Depending on the language, a combination of lexical data, heuristic rules, and statistical models are implemented to provide the best accuracy and speed for all applications. ROSETTE COMPONENTS The Rosette linguistics platform delivers a unifed application programming interface (API) enabling access to the various linguistic capabilities described above. Search solutions typically use the following components: Rosette Language Identifier (RLI) identifies text in 55 languages and 39 legacy encodings. Rosette Core Library for Unicode (RCLU) transcodes text between 168 legacy encodings and UTF-8, UTF-16, or UTF-32. Rosette Base Linguistics (RBL) returns the morphological analysis of the text, enabling high-accuracy, full-text search. Rosette Entity Extractor (REX) extracts meaningful entities such as names, places, organizations, and dates from text. Rosette Name Indexer (RNI) matches foreign personal names across writing systems and languages. Rosette Name Translator (RNT) translates foreign personal names into English. Rosette Chat Translator (RCT) translates Arabic chat text (known as Arabizi) into native Arabic script. KEY FEATURES Rosette provides the most advanced capabilities commercially available, whether for searching within a language or across multiple languages. Key features include: Language Identification automatically classifies documents by language and encoding. Arabic Persian Urdu Segmentation and Tokenization determine the boundaries of unique lexical tokens in input data, including locating punctuation and special characters. Part-of-Speech Identification tags each word’s part-of- speech, such as noun, verb, or preposition. Tokenization Part of Speech Tagging NOUN NOUN NOUN NOUN NOUN NOUN VERB PREP NUM ADJ Normalization unifies words with various spellings. Normalization Normalization Normalization Normalization Normalization Root Identification returns root forms of input words. Stemming removes prefixes and suffixes from input words. Lemmatization returns the dictionary base forms for inflected verbs, nouns, pronouns, or adjectives. Stem Root Normalization Stem Lemma Lemma Lemma

Upload: lambao

Post on 18-Feb-2018

236 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Rosette for Arabic Search-Based · PDF fileRosette for Arabic Search-Based Applications ... The Rosette linguistics platform delivers a unifed application ... (RCT) translates Arabic

Rosette for Arabic Search-Based Applications

SOLUTIONS

Accurate Text Analysis for theComplexity of the Arabic LanguageThe rapid growth of Arabic electronic content has increasedthe need for Arabic-savvy search. The latest generationof Arabic search techniques draws on advances in naturallanguage processing, taking search beyond simple stringcomparisons to a more intelligent search that can understandthat kitaab (“book”) is similar to kutub (“books”) by analyzingthe lemma of each word.

THE ROSETTE SOLUTIONRosette is designed to use a variety of different algorithmsso that the best approach can be applied for each language’sspecific requirements. Depending on the language, acombination of lexical data, heuristic rules, and statisticalmodels are implemented to provide the best accuracy andspeed for all applications.

ROSETTE COMPONENTSThe Rosette linguistics platform delivers a unifed applicationprogramming interface (API) enabling access to the variouslinguistic capabilities described above. Search solutionstypically use the following components:

• Rosette Language Identifier (RLI) identifies text in55 languages and 39 legacy encodings.

• Rosette Core Library for Unicode (RCLU) transcodestext between 168 legacy encodings and UTF-8, UTF-16, orUTF-32.

• Rosette Base Linguistics (RBL) returns the morphologicalanalysis of the text, enabling high-accuracy, full-textsearch.

• Rosette Entity Extractor (REX) extracts meaningfulentities such as names, places, organizations, and datesfrom text.

• Rosette Name Indexer (RNI) matches foreign personalnames across writing systems and languages.

• Rosette Name Translator (RNT) translates foreignpersonal names into English.

• Rosette Chat Translator (RCT) translates Arabic chat text(known as Arabizi) into native Arabic script.

KEY FEATURESRosette provides the most advanced capabilitiescommercially available, whether for searching within alanguage or across multiple languages. Key features include:

• Language Identification automatically classifiesdocuments by language and encoding.

ArabicPersianUrdu

• Segmentation and Tokenization determine theboundaries of unique lexical tokens in input data,including locating punctuation and special characters.

• Part-of-Speech Identification tags each word’s part-of-speech, such as noun, verb, or preposition.

TokenizationPart of Speech Tagging

NOUNNOUNNOUNNOUN NOUNNOUN VERBPREPNUM ADJ

• Normalization unifies words with various spellings.NormalizationNormalization

NormalizationNormalizationNormalization

• Root Identification returns root forms of input words.

• Stemming removes prefixes and suffixes from inputwords.

• Lemmatization returns the dictionary base forms forinflected verbs, nouns, pronouns, or adjectives.

StemRootNormalizationStem

LemmaLemmaLemma

Page 2: Rosette for Arabic Search-Based · PDF fileRosette for Arabic Search-Based Applications ... The Rosette linguistics platform delivers a unifed application ... (RCT) translates Arabic

VISIT www.basistech.com WRITE [email protected] CALL 617-386-2090

One Alewife CenterCambridge, MA 02140

171 Second StreetSan Francisco, CA 94105

2553 Dulles View DriveHerndon, VA 20171

9-6 Nibancho, Chiyoda-kuTokyo 102-0084, Japan

© 2013 Basis Technology Corporaon. “Basis Technology”, “Geoscope”, “Odyssey Digital Forensics”, “Rosee”, and “We put the World in the World Wide Web” are registered trademarks ofBasis Technology Corporaon. All other trademarks, service marks, and logos used in this document are the property of their respecve owners. (2013-01-04)

WHY NORMALIZATION?Certain words that appear with different spellings often referto one unique word. These different spellings can be due tovarious orthographic variations caused either by the writer’sstylistic choice or by the writer’s neglect of the complexorthographic rules of Arabic. The normalization processprovided by Rosette unifies orthographic variation including:

• Words with additional vocalization marks such as: ــ vs. ــ

• Words containing certain letters with dots added orremoved such as: ـــــ vs. ـــــ

• Words (including ambiguous cases) containing certainletters with symbols added or removed such as:

ٱـد vs. اـد

داؤود vs. داوود

ـا vs. (ـا or ـا or ـا)

WHY PART-OF-SPEECH TAGGING?Words often carry different part-of-speech (POS) dependingon their positions within a sentence. The word in thefollowing sentence is pronounced differently and has adifferent POS when it is found at the beginning of thesentence rather than at the end:

WHY ROOT IDENTIFICATION?Words are derived by applying morphological patterns to thevery base form, called the root. Words such as دروس (“lessons”), (“school”) share the common baseرand (”student“) دارسnotion of “learning” represented by the root درس (“to learn”).

STEMMING VS. LEMMATIZATIONIf your application attempts to match the word ب (“a book”),then a stemming approach will retrieve not only (“books”)and ن (“two books”) but also (writing), which is aweak match to “book”. However, by using lemmatizationthe retrieved words will be limited to (“books”) and ن(“two books”).

Word Meaning Root Lemma Stem

writing ك ت ب ب

ب book ك ت ب ب ب

books ك ت ب ب

WHY STEMMING?Stemming is quite useful when analyzing named entities,which must not be analyzed the same way as general text.In Arabic, prepositions and conjunctions often attach tonamed entities making exact string matches impossible insearch-based applications. Therefore, stemming is used tostrip away meaningless affixes within named entities.

EXPLORE FURTHERFor more information or to request an evaluation, pleasecall us at 617-386-2090 or 800-697-2062, or write [email protected]. We will be happy to assist you inevaluating the performance of our products on your data.