the digital form of the thesaurus dictionary of the...

49
The Digital Form of the The Digital Form of the Thesaurus Dictionary of the Thesaurus Dictionary of the Romanian Language Romanian Language Dan CRISTEA Dan CRISTEA 1, 2 1, 2 , Marius R , Marius R Ă Ă SCHIP SCHIP 1 1 , Corina FOR , Corina FOR Ă Ă SCU SCU 1, 3 1, 3 , Gabriela , Gabriela HAJA HAJA 4 4 , Cristina FLORESCU , Cristina FLORESCU 4 4 , , Bogdan Bogdan ALDEA ALDEA 1, 4 1, 4 , Elena D , Elena D Ă Ă NIL NIL Ă Ă 4 4 1 1 Faculty of Computer Science, Faculty of Computer Science, Al.I Al.I . . Cuza Cuza University of University of Ia Ia ş ş i i 2 2 Institute for Computer Science, Romanian Academy, Institute for Computer Science, Romanian Academy, Ia Ia ş ş i i 3 3 Institute for Artificial Intelligence, Romanian Academy, Buchar Institute for Artificial Intelligence, Romanian Academy, Buchar est est 4 4 Institute of Romanian Philology Institute of Romanian Philology A. A. Philippide Philippide , , Ia Ia ş ş i branch of Romanian Academy i branch of Romanian Academy

Upload: lamdat

Post on 18-Jun-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

The Digital Form of the The Digital Form of the Thesaurus Dictionary of the Thesaurus Dictionary of the

Romanian LanguageRomanian Language

Dan CRISTEADan CRISTEA1, 21, 2, Marius R, Marius RĂĂSCHIPSCHIP11, Corina FOR, Corina FORĂĂSCUSCU1, 31, 3, Gabriela , Gabriela HAJAHAJA44, Cristina FLORESCU, Cristina FLORESCU44, , BogdanBogdan ALDEAALDEA1, 41, 4, Elena D, Elena DĂĂNILNILĂĂ44

11 Faculty of Computer Science, Faculty of Computer Science, Al.IAl.I. . CuzaCuza University of University of IaIaşşii22 Institute for Computer Science, Romanian Academy, Institute for Computer Science, Romanian Academy, IaIaşşii

33 Institute for Artificial Intelligence, Romanian Academy, BucharInstitute for Artificial Intelligence, Romanian Academy, Bucharestest44 Institute of Romanian Philology Institute of Romanian Philology ““A. A. PhilippidePhilippide””, , IaIaşşi branch of Romanian Academyi branch of Romanian Academy

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

Cultural heritageCultural heritage

Preserving identity in an United EuropePreserving identity in an United Europe……

Machine readable dictionaries Machine readable dictionaries ––resources for humanresources for human--computer computer communicationcommunication

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

Textual resources todayTextual resources today

Voices in favor of electronic form of Voices in favor of electronic form of textbooks: textbooks:

the the Gutenberg Gutenberg project project the Google book searchthe Google book searchor the DARPA project GALEor the DARPA project GALE

CLARIN CLARIN –– the new Panthe new Pan--European initiative European initiative for Common Language Resources and for Common Language Resources and Infrastructure of TechnologyInfrastructure of Technology

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

Computerized Dictionaries in The World

EnglishEnglishOxford Advanced LearnerOxford Advanced Learner’’s Dictionarys DictionaryCollins Collins MerriamMerriam--Webster Webster

French: French: TrTréésorsor de la Langue de la Langue FranFranççaiseaiseinformatisinformatisééItalian: Italian: TesoroTesoro delladella Lingua Lingua ItalianaItaliana delledelle originioriginiSpanish: Spanish: DiccionarioDiccionario de la de la LenguaLengua EspaEspaññolaolaPortuguese: Portuguese: LLíínguangua PortuguesaPortuguesa OnOn--LineLineRomanian: Romanian: Romanian Academy Explanatory Romanian Academy Explanatory DictionaryDictionary –– WebDEXWebDEX, DEX , DEX onon--lineline

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

PresentationPresentation

About the dictionaryAbout the dictionaryBenefits of an electronic form of itBenefits of an electronic form of itAn acquisition methodologyAn acquisition methodologyUpgradingUpgrading

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

The thesaurus Dictionary of The thesaurus Dictionary of Romanian Romanian –– since 1913since 1913

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

The big thesaurus dictionary of The big thesaurus dictionary of the Romanian languagethe Romanian language

Two series: Two series: Dictionary of the Academy (DA)Dictionary of the Academy (DA)

published between 1913 and 1949 published between 1913 and 1949 including the entries including the entries AA--CC, , DD--DeDe, , FF--KK, , LL--lojnilojniţţăă

Dictionary of the Romanian Language Dictionary of the Romanian Language (DLR)(DLR)

planned to be finalized in 2007planned to be finalized in 2007rest of the entriesrest of the entries

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

cca 138128cca 13230 TOTAL

4088409ZXIV2000

2365588V, W, X, Y3XIII2005

2396426V2XIII2002

1747326V1XIII1997

2347468U2XII2002

3856240Ţ1XII1994

4202387T3XI1983

5027376T2XI1982

4528271Ş1XI1978

5725726S5X1994

2757371S4X1992

2692347S3X1990

2212300S2X1987

3540388S1X1986

7255641RIX1975

4680523P5VIII1984

4537393P4VIII1980

2727253P3VIII1977

3783334P2VIII1974

4006357P1VIII1972

3622400O2VII1969

5493584N1VII1971

96531076MVI1965-1968

cca 2600174L – lojniţă2II1937-1949

cca 99066J2II1937

cca 14000936F – I1II1934

cca 130090D – de 3I1949

cca 160001064C2I1940

cca 10000716A – B 1I1913

NO. OF WORD ENTRIESNO. OF PAGESLETTERPARTTOMPUBLICATION YEAR

DA

DLR

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

DTLR: statisticsDTLR: statistics

DADA6 volumes, 6 volumes, 3,000 pages 3,000 pages about 45,000 entries about 45,000 entries

DLRDLR23 volumes 23 volumes 10.000 pages 10.000 pages 15 letters (out of 28) 15 letters (out of 28) more than 93,000 entriesmore than 93,000 entries3.200.000 examples 3.200.000 examples –– about 88% of the textabout 88% of the text

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

DTLR: formatDTLR: formatVINILVINILÍÍNN s s. n. n. M. Material plastic, realizat mai ales sub aterial plastic, realizat mai ales sub

formformă ă de foi, obde foi, obţţinut din policlorurinut din policlorură ă de vinil plastifiatde vinil plastifiatăă. . CfCf. L. ROM. 1959, nr. 5, 6. . L. ROM. 1959, nr. 5, 6. Un produs din ce Un produs din ce îîn ce n ce maimai…… rrăăspspîînditndit şşi care are ca material de bazi care are ca material de bază ă gazele gazele sub diferite forme ssub diferite forme sîînt masele plastice (nt masele plastice (vinilinavinilina, , nylonnylonetc.).etc.). GEOLOGIAGEOLOGIA, 48, cf. , 48, cf. DCDC, , DNDN33,, DREVDREV,,VV. . BREBANBREBAN, , DD.. GG. . MiMi--a dat spre a dat spre tiptipăărirerire…… dou două ă volume masive, volume masive, legate legate îîn vinilin negru. n vinilin negru. ROMÂNIA LIROMÂNIA LITERARTERARĂĂ, 1993, nr. , 1993, nr. 3, 12/1, 3, 12/1, cfcf. DEX. DEX22..

–– ŞŞi: (rar) i: (rar) vinilvinilíínnăă s. f.s. f.–– Din Din vinil.vinil.

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

NotationsNotationsDADA and and DLRDLReDAeDA and and eDLReDLR –– electronic versions of DA electronic versions of DA and DLRand DLRDTLRDTLR –– the union of DA and DLRthe union of DA and DLReDTLReDTLR (Dictionary Thesaurus of the (Dictionary Thesaurus of the Romanian Language in electronic form) Romanian Language in electronic form) –– the the electronic version of these two parts, where electronic version of these two parts, where no content changes are operatedno content changes are operatedeDTLReDTLR++ –– the updated the updated eDTLReDTLR, in which , in which eDAeDA has been upgraded to reflect the current has been upgraded to reflect the current languagelanguageeDTLReDTLR++++

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

DTLR: content

Includes:Includes:popular, casual, literary and artistic speechpopular, casual, literary and artistic speechspecial terminologiesspecial terminologiesregionalisms, archaisms, popular technical regionalisms, archaisms, popular technical termsterms, , argotic terms and personal creationsargotic terms and personal creationswords used in specialized languages and words used in specialized languages and styles (children games, riddles, magic spells), styles (children games, riddles, magic spells), phrases, expressions, proverbs and sayingsphrases, expressions, proverbs and sayingscompounds and derivatives as separate compounds and derivatives as separate entriesentries

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

DTLR: content

Definitions of linguistic facts:Definitions of linguistic facts:genus genus proximusproximus and and differentia differentia specificaspecificasemantic values through synonyms, semantic values through synonyms, hypernymshypernyms and oppositionsand oppositionspattern definitions for words that appear in pattern definitions for words that appear in semantic series, semantic series, abstract nouns, diminutives, abstract nouns, diminutives, augmentatives, names of animals and plantsaugmentatives, names of animals and plantsgrammatical functionsgrammatical functions

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

The benefits of The benefits of eDTLReDTLRAs an As an updated word stock of Romanian updated word stock of Romanian forfor::

computational lexicography computational lexicography computational lexicologycomputational lexicologycomputational computational morphology morphology

semanticssemantics

ontologies in Semantic Webontologies in Semantic Webpublication, distribution and accesspublication, distribution and access

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

eDTLReDTLR for for computational computational lexicographylexicography

rere--edit the whole dictionary (Aedit the whole dictionary (A--Z), including unification, Z), including unification, correction and updatingcorrection and updatingextract examples for missing words in extract examples for missing words in eDAeDAuseful environment dedicated to lexicographersuseful environment dedicated to lexicographerscorrect format inconsistencies between different volumes correct format inconsistencies between different volumes define standard formats for certain types of entries: define standard formats for certain types of entries: geological terms, months and days, names of plants...geological terms, months and days, names of plants...possibility to generate various types of dictionaries: possibility to generate various types of dictionaries: orthographicorthographic, , pronunciationpronunciation, , frequencyfrequency, , etymologicaletymological; ; valencesvalences, , collocationscollocations, , phraseologicalphraseological, , proverbsproverbs, , citationscitations; ; onomasiologicalonomasiological or or semasiologicalsemasiological; ; neologismsneologisms, , loanloan--word/foreignword/foreign--wordword, , jargon/slang jargon/slang etc. etc.

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

eDTLReDTLR for for computational computational lexicologylexicology

alignment of word senses with those in alignment of word senses with those in other dictionariesother dictionariesexample: part of the letter example: part of the letter VV against against RoWNRoWN: :

73 common word entries73 common word entriesDLR includes 2,300 senses and DLR includes 2,300 senses and RoWNRoWN 100 100 synsetssynsets

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

eDTLReDTLR for for computational computational morphologymorphology

the extremely rich collection of Romanian the extremely rich collection of Romanian terms terms the the completion of the flexing completion of the flexing paradigmsparadigmssupplement the corpora used for learning supplement the corpora used for learning a Romanian language model a Romanian language model POSPOS--tagging and lemmatisationtagging and lemmatisation

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

eDTLReDTLR for for computational computational semanticssemantics

the vast collection of examples for word the vast collection of examples for word senses senses used to train a word sense used to train a word sense disambiguation program disambiguation program semantic roles of verbs and nouns derived semantic roles of verbs and nouns derived from verbsfrom verbs,, as in as in FrameNetFrameNetfor Semantic Web: sense definitions for Semantic Web: sense definitions formalised formalised concepts in (lexicalised) concepts in (lexicalised) domain ontologiesdomain ontologies

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

eDTLReDTLR for publication, distribution and access

tthe dictionary can be published cheaper he dictionary can be published cheaper by electronic meansby electronic meanssophisticated indexes can link word sophisticated indexes can link word occurrences, including outside the occurrences, including outside the dictionary itself, in other linguistic thesauri dictionary itself, in other linguistic thesauri or in other languages or in other languages most significant achievementmost significant achievement:: could be could be made available on Internet for the wider made available on Internet for the wider possible audiencepossible audience

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

An acquisition methodology An acquisition methodology for for eDLReDLR

a dictionary double pagea dictionary double page

scanning

digital digital photography photography

formatformat

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

An acquisition methodology An acquisition methodology for for eDLReDLR

still still digital digital photography formatphotography format

preparation for OCR(optical character recognition)

digital photography formatdigital photography format

The two pages are split, deskewed, cleared of black margins and downscaled from

600 dpi to 300 dpi

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

An acquisition methodology An acquisition methodology for for eDLReDLR

HTML

OCR

<p><b>VINILÍN</b> s.n. Material plastic, realizat maiales sub formă de foi, obţinut din policlorură de vinilplastifiată. Cf. L.ROM. 1959, nr. 5, 6. <i>Un produsdin ce în ce mai răspîndit şi care are ca material de baza gazele sub diferite forme sînt masele plastice(vinilina, nylon etc.).</i> GEOLOGIA, 48, cf. DC, DN<sup>3</sup>, DREV,V. BREBAN, D. G. <i>Mi-a dat spre tipărire două volume masive, legate în vinilin negu. </i>ROMÂNIA LITERARĂ, 1993, nr. 3, 12/1, cf. DEX<sup>2</sup>.</p>

<p>– Şi: (rar) <b>vinilína</b> s.f.</p><p>– Din <b>vinil.</b></p>

digital photography formatdigital photography format

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

How to How to minimiseminimise the time of the time of correction?correction?

Using experts?Using experts?1515--35 35 minmin peperr pagpagee 250 pag (250 pag (letter letter ŢŢ): 125 ): 125 hours (2hours (25 5 days) days) a total of 663 daysa total of 663 daysthey are expensivethey are expensiveand they get boredand they get bored……and even they leave errors behindand even they leave errors behind……

Using volunteers?Using volunteers?what?what?and how to impose quality? and how to impose quality? and how to prevent the proliferation of and how to prevent the proliferation of unachieved versions?unachieved versions?or posting freely before establishing IPRor posting freely before establishing IPR

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

Use volunteers plus expertsUse volunteers plus experts

A large collaborative correction phaseA large collaborative correction phasestudents of our Universitystudents of our Universitytwice or three timestwice or three times

The only few errors remained corrected by The only few errors remained corrected by expertsexperts

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

A collaborative businessA collaborative business

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

Prevent unPrevent un--authorisedauthorisedrecuperationrecuperation

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

Prevent unPrevent un--authorisedauthorisedrecuperationrecuperation

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

A correcting interfaceA correcting interface

sava

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

A correcting interfaceA correcting interface

să vă

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

Citations: lack of meaning Citations: lack of meaning source of errorssource of errors

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

Citations: lack of meaning Citations: lack of meaning source of errorssource of errors

BUDAI DELEANU, Ţ.CANTEMIR, I.I.I.MAIOR, IST.MOLNAR, RET.PETROVICI, P.…

recovering by pattern matching in a list

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

Instead of parallel Instead of parallel correctionscorrections……

a voting system

send to the expert…

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

……use cascade correctionsuse cascade corrections

send to the expert…

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

Extract fieldsExtract fields

XXMLML formatformat

pattern matching

HTML HTML format format

<entry id="1510"><title>VINIL<accent>I</accent>N</title><POS>s.n.</POS><sense id="1">

<def>Material plastic, realizat mai ales sub forma de foi, obţinut din policlorură de vinil plastifiată.</def>

<examples><example no="1">Un produs din ce în ce mai răspîndit

şi care are ca material de bază gazele sub diferite forme sînt masele plastice (vinilina, nylon etc.).</exemplu>

<example no="2">Mi-a dat spre tipărire două volume masive, legate în vinilin negru.</exemplu>

</examples></sense>

</entry>

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

Continue the correctionsContinue the corrections

browsing and editing interface

XMXML L format format

corrected Xcorrected XMLML: : eDLReDLR

DLRex: a lexicographer

interface

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

Process DAProcess DA

same processing as for DLR

DA double pagesDA double pages

corrected Xcorrected XMLML: : eDAeDA

eDAeDA ∪∪ eDLReDLR eDTLReDTLR

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

Update the content of Update the content of eDAeDA

update orthography

eDAeDA

updated updated eDAeDAABÁC s. a. “Abaque; boulier-compteur”. – Tablă grafică pentru

calcul. | Spec. Aparat de numeraţiune pentru copii, format dintr’un cadru de lemn cu zece vergele paralele, pe care sânt înşirate câte zece sfere mici mobile, de colori diferite.– N. din fran.

Puşcariu:

DLRex

sânt

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

Update the content of Update the content of eDAeDA

update orthography

eDAeDA

updated updated eDAeDAABÁC s. a. “Abaque; boulier-compteur”. – Tablă grafică pentru

calcul. | Spec. Aparat de numeraţiune pentru copii, format dintr’un cadru de lemn cu zece vergele paralele, pe care sânt înşirate câte zece sfere mici mobile, de colori diferite.– N. din fran.

Puşcariu:

DLRex

sunt

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

Update the content of Update the content of eDAeDA

archaisms, old forms in definitions

eDAeDA

updated updated eDAeDAABÁC s. a. “Abaque; boulier-compteur”. – Tablă grafică pentru

calcul. | Spec. Aparat de numeraţiune pentru copii, format dintr’un cadru de lemn cu zece vergele paralele, pe care suntînşirate câte zece sfere mici mobile, de colori diferite.– N. din fran.

Puşcariu:

DLRex

numeraţiune

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

Update the content of Update the content of eDAeDA

ABÁC s. a. “Abaque; boulier-compteur”. – Tablă grafică pentru calcul. | Spec. Aparat de numeraţiune pentru copii, format dintr’un cadru de lemn cu zece vergele paralele, pe care sânt înşirate câte zece sfere mici mobile, de colori diferite.– N. din fran.

Puşcariu:

DLRex

colori

archaisms, old forms in definitions

eDAeDA

updated updated eDAeDA

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

Update the content of Update the content of eDAeDA

DLRex

definitions

eDAeDA

updated updated eDAeDA

ABÁC s. a. “Abaque; boulier-compteur”. – Tablă grafică pentru calcul. | Spec. Aparat de numeraţiune pentru copii, format dintr’un cadru de lemn cu zece vergele paralele, pe care sânt înşirate câte zece sfere mici mobile, de colori diferite.– N. din fran.

Puşcariu:

abac sn – Instrument pentru calcule aritmetice elementare, format dintr-un cadru de lemn cu vergele pe care se pot deplasa bile.

MDA:

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

Update the content of Update the content of eDAeDA

the collection of examples in

eDLR

eDAeDA

updated updated eDAeDA

new words, new word senses,

examples

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

eDTLReDTLR++

updated updated eDAeDA ∪∪ eDLReDLR eDTLReDTLR++

But… it is already outdated!

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

A factory of wordsA factory of words

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

How to keep the dictionary How to keep the dictionary consistent with the consistent with the

evolution of languageevolution of languageDecisions with respect to Decisions with respect to synchronisingsynchronising the dictionary with the the dictionary with the latest language developmentslatest language developments: :

when to accept a new word in languagewhen to accept a new word in languagewhen to consider a word as obsolete (out of use)when to consider a word as obsolete (out of use)when to accept a new sense of a wordwhen to accept a new sense of a wordwhen to consider a sense of a word as out of usewhen to consider a sense of a word as out of use

Lexicographic activities related to new words/sensesLexicographic activities related to new words/senses::build definitionsbuild definitionsselect examplesselect examplesdetect the first and the last mentiondetect the first and the last mentionestablish etymologiesestablish etymologies......

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

eDTLReDTLR++ ++ the synchronic the synchronic eDTLReDTLR++

Organisation, criteria and resourcesOrganisation, criteria and resources

The ParliamentThe Parliament: a legislative initiative recommends to all : a legislative initiative recommends to all publishing houses to offer their texts for researchpublishing houses to offer their texts for researchThe The eDTLReDTLR++ Committee++ Committee: establish the selection criteria : establish the selection criteria for the authorised (for the authorised (re)sourcesre)sourcesA computer programA computer program: sorts the resources by register: : sorts the resources by register: literary, stylistic, domain, author, publishing date, etc.literary, stylistic, domain, author, publishing date, etc.A computer programA computer program: : continuously selects the resources continuously selects the resources satisfying the selection criteria of the Committee conforming satisfying the selection criteria of the Committee conforming also with the registersalso with the registersThe The eDTLReDTLR++ Committee++ Committee: : establishes criteria for a establishes criteria for a word/sense to be considered as word/sense to be considered as ““newnew””, or , or ““outout--ofof--useuse””

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

eDTLReDTLR++ ++ the synchronic the synchronic eDTLReDTLR++

ProcessingProcessing

ProgramProgram: POS: POS--taggtagg and lemmatise the selected and lemmatise the selected documents/resources, sort the lemmas according to their documents/resources, sort the lemmas according to their frequency of occurrencefrequency of occurrenceProgramProgram: apply the criteria to establish the quality of a : apply the criteria to establish the quality of a word/sense word/sense words/sense as words/sense as ““newnew”” and and ““outout--ofof--useuse””LexicographersLexicographers: validates or rejects: validates or rejectsProgramProgram: creates updated versions of : creates updated versions of eDTLReDTLR++ ++ Lexicographers helped by adequate interfacesLexicographers helped by adequate interfaces: modify : modify the automatically generated dictionary, wherever neededthe automatically generated dictionary, wherever needed

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

ConclusionConclusion

Offer Offer eDTLReDTLR to the public!to the public!

The Digital Form of the Thesaurus Dictionary of the Romanian LanThe Digital Form of the Thesaurus Dictionary of the Romanian Languageguage

AcknowledgementsAcknowledgements

Romanian Academy: Marius Romanian Academy: Marius SalaSala and and ViorelViorel BarbuBarbu““Al.I.CuzaAl.I.Cuza”” University of University of IasiIasi: : DumitruDumitru OpreaOpreaThe CNCSIS grant 1815: The CNCSIS grant 1815: ““The Dictionary of the The Dictionary of the Romanian Language in electronic format. Studies Romanian Language in electronic format. Studies regarding its acquisitionregarding its acquisition”” (2003(2003--2005) 2005) The The INTAS grant 05INTAS grant 05--104104--7633 7633 RolTechRolTech ““Platform Platform For Romanian Language Technology: Resources, For Romanian Language Technology: Resources, Tools And InterfacesTools And Interfaces”” (running)(running)PIM SRL, PIM SRL, IasiIasi