phonologic and syllabic patterns of brazilian portuguese extracted from a g2p decoder-parser

Upload: ijeceditor

Post on 08-Aug-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/22/2019 Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser

    1/12

    International Journal of Advanced Computer Science, Vol. 3, No. 8, Pp. 416-427, Aug., 2013.

    ManuscriptReceived:

    21,Apr., 2013

    Revised:

    26,May, 2013

    Accepted:

    5,Jul., 2013

    Published:15,Jul., 2013

    KeywordsBrazilian

    Portuguese,Structure,

    Patterns,

    Phonology,Syllable.

    Abstract This paper presents the

    distribution of Brazilian Portuguese phonemepatterns, according to an automatic grammar

    rules-based grapheme to phoneme converter.The software Nhenhm was used for datatreatment: written texts were decoded into

    phonologic symbols, forming a corpussubjected to statistical analysis. Resultssupport the high level of predictability of

    Brazilian Portuguese phonemes distribution,consonant-vowel structure being consideredthe canonic syllable pattern, as well as the

    stress pattern distribution 'CV.CV. Theefficiency of a grapheme to phonemeconverter entirely based on rules is also

    proven. These results are displayed anddiscussed, as well as some aspects ofNhenhm converter and parser.

    1. IntroductionThe challenging problem of how alphabetic systems

    represent the phonology of a given language [1] is the issuehere discussed, illustrating it with empirical evidence, basedon statistical analysis of the distribution of BrazilianPortuguese phonemes and syllable structure. In addition,questions dealing with prosody are also addressed, withsome comments about the spelling agreement, signed in2009 by seven countries where Portuguese is the officiallanguage. This agreement, the goal of which is tostandardize the Portuguese spelling, will be probablyeffective in 2016.

    The patterns presented were obtained using the softwareNhenhm [2], from the analysis of an automatic grammar

    rules-based grapheme to phoneme converter of BrazilianPortuguese written texts. The program was also improved tobecome a syllable parser. The presentation is preceded by adescription of the relation between the Portuguese writtensystem and the phonological one and the main problemswith which the programmers had to deal to find optimalsolutions for writing the algorithms. Some of the principlesof the Portuguese spelling system together with some of thetheories that guided the converter construction support the

    This work is supported by CAPES, entity of the Brazilian governmentfor the qualification of human resources.

    Vera Vasilvski, Federal University of Santa Catarina (UFSC)

    ([email protected]); Leonor Scliar-Cabral, UFSC ([email protected]);Mrcio Jos Arajo, Federal Technological University of Parana (UTFPR)

    ([email protected])

    discussion. Nhenhm has supplied all transcriptions used inthis article.

    2. Spoken and Written LanguageScience and also History [1] state that the oral verbal

    language develops spontaneously whenever traces of

    humanization are found, whereas the written language is aninvention, the intensive and systematic learning of which isnecessary in most cases [3]. Linguistic evolution is not justa fact of phonological and phonetic change, however,changes often start as pronunciation modifications [1].Consequently, oppositions fade and disappear, causinghomonyms, which must be avoided, so, new words areintroduced to avoid ambiguity of signs [4]. Languages are inperpetual change, although showing an apparent repose.The distance between the oral and the written systems,being the last one conservative and subject to literarytraditions, becomes increasingly high.

    One or more letters (graphemes) represent thephonemes, in alphabetic systems. Those units belonging tothe second articulation distinguish meaning in writing, butthis representation is not a one-to-one, by virtue of thedistance between the oral and the written systems alreadymentioned. Another divergent principle also occurs: theetymological. Since many spellings are based uponetymological origin [3] writing does not represent the oralsystem faithfully. Both spoken and written language has itsown laws and ways.

    A.Phonetics and PhonologyWhile Phonetics is concerned with describing speech

    sounds (phones) from the point of view of their articulation,perception and physical properties, Phonology studies thephonemes of a language, that is, classes of sounds,abstractly represented in the minds of a linguisticcommunity. In this way, phonemic transcription is broad(general), covering all possible phonetic variations of eachphoneme. The aim of Phonology is deep invariance, whilePhonetics searches surface variations.

    There are many schools of Phonology, the first one wasthe Prague Circle, which introduced the functionalistapproach, meaning, in this case, that only phoneticdifferences which cause differences of meaning arerelevant. Perception of those differences is a psychic

    process and implies disregarding any similar phoneticdifference which does not provoke a different meaning.

    Phonologic and Syllabic Patterns of Brazilian

    Portuguese Extracted from a G2P Decoder-ParserVera Vasilvski, Leonor Scliar-Cabral, & Mrcio Jos Arajo

  • 8/22/2019 Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser

    2/12

    Vasilvski et al.:Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser.

    International Journal Publishers Group (IJPG)

    417

    Phonology makes abstraction of the physical properties ofsounds, which are the field of Phonetics. QuotingGlossematics, Phonetics studies the expression of sounds(substance of sounds in their multiplicity and variation), andPhonology studies the form (relations, classes, abstract

    nature, which takes place in the mind) [4].Since the alphabetic principles are based on the

    phoneme representation, any automatic program mustdepart from the phonological description of the respectivelanguage, which is the case of the Brazilian Portuguesephonological system, here transcribed.

    B.Brazilian Portuguese spelling systemWe will present and discuss here some of the most

    important rules regarding the spelling system.Portuguese is a syllable-stressed language, i.e., the vast

    majority of Portuguese words has stressed syllable, leaving

    aside clictics, which are only a few, but are the mostfrequently used (for instance, articles, the majority ofprepositions, some pronouns and conjunctions andaccusative pronouns). However, the stressed syllable is notgraphically signaled for the most frequent stressed words(the ones which receive stress on the penultimate syllable)since Occams razor principle was adopted, graphicallyregistering only the stress of less frequent stressed words.

    Graphically signaling stress is a powerful hallmark forthe reader, because it guides him/her to match the writtenword with its oral representation in the mental lexicon. Thecriteria for graphically signaling Portuguese words are thefollowing: a) in which syllable stress falls; b) which is thelast vowel, followed or not by s; c) which is the last consonant; d) signaling the difference between diphthongand hiatus. Details and examples will be given bellow.

    The stress diacritics of Portuguese are acute (chapu hat) and circumflex (voc you). A morphosyntacticdiacritic (it does not signal stress) is used for signaling theoverlap of the preposition a with the definite article ordemonstrative pronoun a/as, or with the same vowelbeginning the demonstrative pronoun a/aquela(s),a/aquele(s). For instance, Fui casa da Maria (I wentto Marys home), Vamos quele lugar (Lets go to thatplace).

    In Portuguese, stress may relate to the last, penultimate,antepenultimate or, much more rarely, to the fourth lastsyllable of the phonological word, for example, npcias(wedding) /nu.p.si.aS/ [5]. The phonological word inPortuguese is well defined, and its distinctive mark is stress[5]. Thus, the stress position clearly reveals the distinctivevowel [6].

    The position of stress does not depend on the phonemicstructure of the word. There are no word endings inPortuguese imposing certain stress, but there is atermination which is more frequent, although suchfrequency is indeterminable phonologically [6]. However,

    the Portuguese characteristic stress occurs in thepenultimate syllable, which gives Portuguese a bass rhythm.Nevertheless, Brazilian Portuguese has more words with

    stress on the last syllable than European Portuguese,because it incorporated words from the African andIndigenous languages spoken by those who lived togetherwith the Portuguese colonialists in the past [5]. In spite ofthis, the influence of Indigenous and African languages was

    only lexical, since no phoneme belonging to them wasborrowed by Brazilian Portuguese phonemic system [25].

    Another characteristic that makes the Portuguese systemof signaling the stressed syllable in the written systemeffective comes from the fact that it was guided byphonological intuition. Portuguese words main stress isgraphically registered according to the pattern frequency inthe language. The most frequent word pattern is:'C(C)V.C(C)V(s)#, where the last vowel must be a, e,o. These words do not receive any written signalrepresenting stress, e.g., mesa (table) /me.za/,escreves (you write) /iS.kr.viS/, livro (book)

    /li.vru/. The pattern 'C(C)V(s)# is the second mostfrequent: the last written vowel must be a, e, o. If thelast vowel is [-high, -low], it receives a circumflex, e.g.,av (grandfather) /a.vo/; if the last vowel is [+low], itreceives an acute signal, e.g., sof (sofa) /so.fa/,cafs (coffees) /ka.fS/, vov (grandma) /vo.v/. On the other hand, if the last stressed vowel is ior u for instance, abacaxi (pineapple) and caju(cashew) /a.ba.ka.i/ and /ka.u/ , the word will notreceive any diacritic.

    In Brazil, in most of sociolinguistic varieties, theunstressed final vowels spelled with e and o neutralize

    in favor of /i/ and /u/, respectively, when pronounced. Thisneutralization happens because, if the penultimate orantepenultimate syllable of the word is more stressed, thelast syllable is reduced: gente (people) /.ti/, carro(car) /ka.u/.

    Words ending in descending diphthongs without anydiacritic must be read with stress falling in the last syllable:plebeu (commoner) /ple.bew/, unio (union) /u.ni.w/. If stress falls in the penultimate syllable inwordsending in descending diphthongs, the stressed vowel will bemarked with the diacritic: pnei (pony) /po.ne/.

    In Portuguese, all words stressed in the antepenultimatesyllable, since this pattern is the least frequent, have that

    syllable graphically signaled: nmero (number), clida(warm fem.), znite (zenith) /nu.me.ru/, /ka.li.da/,/ze.ni.ti/.

    One example of morphosyntactic function of a diacriticoccurs with two verbster (to have), vir (to come), andtheir derivatives in the third person plural, present tense,indicative (tm, vm, contm, provm) [3], thusindicating plural, since third person singular is tem,vem, contm, provm). The pronunciation, however,does not change, since singular and plural forms arehomophones: vem, vm /v/, /v/.

    In summary, the Portuguese written system of signaling

    stress is based on the principle of economy (Occams razor),considering that the most frequent pattern 'CV.CV(s) is

  • 8/22/2019 Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser

    3/12

    International Journal of Advanced Computer Science, Vol. 3, No. 8, Pp. 416-427, Aug., 2013.

    International Journal Publishers Group (IJPG)

    418

    the one that does not receive a diacritic. Thus, it facilitatesdecoding, although being more complicated for coding,especially as it is not properly understood by teachers and,therefore, by students. This system has lost some of thequalities based on phonological intuition, due to diachronic

    changes in the oral system and the lack of spelling rulesbased on those changes: the 1991 (2009) agreement madethe situation worse. We will come back to this point.

    C. The Portuguese syllableThe syllable is the superior unit in which phonemes

    (vowels and consonants) combine to allow enunciation [6].Syllable division is deeply studied by Phonology. Itsstructure types characterize languages. The basic phonemicstructure is the syllable, not the phoneme (Jakobson, 1967apud [5]). The syllable in Portuguese can be understood as aset of positions (slope or onset, core or nucleus, and decline

    or coda) to be occupied by specific phonemes. The nucleusof the syllable is the only essential position in Portugueseand should be always occupied by a vowel, which is thepredominant phoneme of the syllable. The slope (onset) isoccupied by one or more consonants and may not be presentin the syllable. Further restrictions are made to what may bein the decline (coda), which accepts only certain consonantsand the semivowels /j/, /w/, in Brazilian Portuguese, butcoda can also be empty.

    A basic syllabic schema is displayed in Fig.1. Althoughit lacks a few optionality marks, and so needs somerefinement, it gives an idea of the Brazilian Portuguesephonological syllable structure.

    In Brazilian Portuguese the so called free or open

    syllables, which are the ones that end with a vowel,predominate. This kind of syllables includes simple

    syllables (V) and open complex (CV). Locked or closedsyllables are those ending with consonants (VC, CV(C)C).They are much less frequent in Brazilian Portuguese, andthere are severe constraints limiting which are the possibleconsonants in this position [5]. The most complex syllables

    in Portuguese are the ones that end with two or threephonemes: CCVVC (claus.tro.fo.bi.a /klawS.tro.fo.bi.a/), CCVCC (trans.mu.ta.o /traNS.mu.ta.sawN/ ~ /trS.mu.ta.sw/), and CVCCC(gangs.te.ris.mo /gaN.gS.te.riS.mu/ ~/g.gS.te.riS.mu/). The CVCCC syllable is onlyorthographic, since it breaks into two phonologicalsyllables, that is, CVC.CVC or CV.CVC. In the last twoexamples, there can be two phonological interpretations: thefirst one considers the existence of nasal consonantal codaand disregards the existence of nasal vowels while thesecond considers the existence of nasal vowels and the

    absence of a nasal consonant phoneme in coda position(what the second position admits is the existence ofphonetic variants, or allophones, conditioned by thesubsequent consonant).

    One of the most important evidences in favor of the lastposition is the fact that a velar nasal consonant is producedwhenever the following onset is a velar consonant, forinstance, in the word canga. First of all, there is no velarnasal phoneme in Portuguese, nor is it possible to commuteany of the so called nasal consonants in the internal codaposition, surrounded by the same context (minimal pairs),producing change of meaning. Nhenhm syllable parsing

    favors the second position.

    The sequence CCCV is not valid for BrazilianPortuguese, although it is valid for European Portuguese[14]. The pronunciation of a foreign word like stress is[is.tr.si], so its written form is estresse, in Brazil.

    In general, the Portuguese syllable delimitation is clear,but there are three cases where it is floating. There are threegroups of vowel contexts in which an unstressed and highvowel may be considered as a semivowel, belonging to adiphthong, or as a vowel, forming a hiatus [6]: a) /i/ or /u/preceded or followed by another unstressed vowel(variedade, saudade, cuidado), b) /i / or /u/ followedby a stressed vowel (piano, viola), and c) /i/ or /u/followed by an unstressed vowel at the word ending(ndia, assduo). Phonetically, one can understand theseas diphthongs or hiatuses in free variation with nodistinctive opposition. Phonologically, however, there is asyllabic not significant variable boundary. In BrazilianPortuguese, they are better understood as hiatus(/va.ri.e.da.di/, /pi..nu/, /vi..la/, /.di.a/, /a.si.du.u/),except in the cases in which the second vowel is i ou u,which are better understood as diphthongs: /saw.da.di/,/ku.da.du /.

    The above explanation is part of the theory that sustainsNhenhm decoder and its parsing procedures.

    V vowel; C consonant; { }braces indicate thatphonemes inside them may be combined with therespective phoneme on the left or on the right;( ) parentheses indicate that phonemes inside themmay occur in that position or not; | |archiphoneme;/ /phoneme.

    Fig. 1 Brazilian Portuguese syllabic-phonologic schema [22].

  • 8/22/2019 Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser

    4/12

    Vasilvski et al.:Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser.

    International Journal Publishers Group (IJPG)

    419

    3. Methodology, discussion andresults

    In this section, we present the automatic decoder

    Nhenhm and the methodology applied to the corpus, due tothe close relation between both. They are followed by theresults and their discussion.

    A. The decoder NhenhmThe word that gives the program its name, nhenhm,

    comes from the Tupi language spoken by several Indiantribes who lived and continue living in Brazil and meansthe endlessly repetition of lip movement producing sounds,as the voice; therefore, an analogue of that word could bebla-bla-bla.

    Nhenhm (/./) is a computational program thatdecodes Brazilians official writing system into

    phonological symbols, performs syllable division, andmarks prosody. This program was used for translating,editing, grouping, and searching the corpus.

    What inspired the software development, in 2008, wasthe high level of transparency of Brazilian Portuguesealphabetic system, although there are some problems,namely the fact that the same grapheme e or orepresents respectively two different vowels, /e/, // and /o/,//. So, the hypothesis of the availability of the high level ofpredictability of that system guided the building of asoftware based on rules, which automatically convertedgraphemes into phonemes.

    Methodologically, the applicative development associates

    Computational Linguistics, Corpus Linguistics, Statistics,Phonology, and Phonetics. Since the program planningcombined proper methodology and linguistic theory, thesoftware could be built in a computer programminglanguage which is not specifically planned for the treatmentof human language. The symbols Nhenhm uses for theconversions are displayed in Tab.1.

    The software reads relatively huge bunches of data, andbestow phonologic reports with statistical reports. Afterexamining a phonological corpus rightly assembled, testsdone by drawing on the applicative reached no less than98% of accuracy: they reproduce the portion of theBrazilian writing system that is predictable by decoding

    rules. In relation to the written system as a hole, thecorrectness is not less than 95%. It is known that, toimplement the rules in certain groups, it is important toidentify the syllabic unit [13], [14]; however, the firstversion of Nhenhm [2] reached at least 95% of accuracywithout recognizing the syllabic unit. Such accuracy wasmeasured by testing several texts with the program. Nowthat Nhenhm parser is ready to approach this issueproperly, a lot of new possibilities for language research areavailable. Besides this performance, the program alsoreaches at least 99% of precision at signaling words stress.These results confirm the hypothesis, and authenticate thehigh level of predictability of Brazilian alphabetic system,

    thanks to its phonological basis. It also corroborates that theBrazilian alphabetic system represents the prosody in alogical, accurate, economic and effective manner.

    TABLE1NHENHM LETTERS,DIGRAPHS AND CORRESPONDING PHONEMES

    Graph Phon Example

    // gua (water) // quela (to which)

    // lmpada (light bulb) // ma(apple) // p(foot) // contm (it contains) /e/ lvedo (barm) // tmpora, nfase (temple, emphasis)e // era (era)e /i/ elefante(elephant) /i/ lvido (livid) // lmpido, ndio (clear, Indian)i /j/ peito (breast)i // muito (much)

    // ad(i)vento (advent) // p(powder)

    // anes (dwarfs) /o/ ps (it putpast) // cmputo, cnscio (calculation, conscious)o // somente (only)o /o/ comente (you comment)o /w/ mo(hand)o /u/ pato(duck)u /w/ pau, taquara (wood, bamboo) /u/ til (useful) // cmplice, anncio (accomplice, ad) /w/ cinqenta (fifty)c /s/ cebola (onion)c /k/ acudir (to help)ch // achar (to find)

    g // gente, agir (people, to act)gu /g/ guerra, guitarra (war, guitar)h hoje, ah(today, oh)j // janela (window)l /w/ anzol(hook)l /l/ lenol (sheet), incluso (inclusion)lh // malha (mesh)lh /l/ filhinho (sonny)m /m/ miar (to meow)n /n/ ano (year)nh // ninho (nest)qu /k/ quente, caqui (hot, khaki)q /k/ aqutico (aquatic)r /r/ cera, prata (wax, silver)r |R| amor(love)r // melro, enredo (blackbird, plot)r // rosto (face)rr // amarrar (to tie)s /s/ sapo (frog)s |S| mosca, lesma (fly, snail)ss /s/ assar (to bake)sc /s/ fascinante (fascinating)s /s/ cresa (it grows up)s /z/ asa (wing)x /kS/ txi (taxi)x |S| expor (to expose)x /z/ exato (exact)

    xc /s/ exceo (exception)z /z/ azedo (acid)z |S| luz(light)

  • 8/22/2019 Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser

    5/12

    International Journal of Advanced Computer Science, Vol. 3, No. 8, Pp. 416-427, Aug., 2013.

    International Journal Publishers Group (IJPG)

    420

    The program does not fulfill some aspects of translatingthe written texts into phonological transcription, but thishappens because there are some exceptions in thePortuguese written system. For instance, in some cases, theletter x values are not all predictable by rules. It can be

    decoded as five different phonemes: //, /s/, /z/, /kS/, |S|.Forexample: graxa, sintaxe, exame, nexo, texto /gra.a/, /s.ta.si/, /e.z.mi/, /n.k.su/, /teS.tu/. Amongthe given five possibilities, two are predictable: the value /z/,when x, after the initial letter e, begins the syllable andthe value // when the letter x ends a syllable, followed byan initial unvoiced consonant. Another predictable value is//, when the letter x begins the word.

    There are also some cases of ambiguity, for instance, theletter s value after b, e.g.: observar (to observe) /o.b.seR.vaR/, obsquio (favor)/o.b.z.ki.u/. So, weconsider that s as representing an archiphoneme:/o.b.Ser.vaR/ and /o.b.S.ki.u/ [16].

    Morphology can also provoke unpredictable situations.For example, the prefix trans-, which means accross,causes a pronunciation ambiguity: transamaznica(trans+amaznica) is correctly decoded /tr.za.ma.zo.ni.ka/,but transiberiana (trans+siberiana) was decoded*/tr.zi.be.ri..na/ instead of /tr.si.be.ri..na/, becausethere is resyllabification. This problem can only be solvedby associating morphological and phonological rules in theprogram. We approached this issue deeply in previousworks [1], [15], [24], and managed to fix it in 2011 [22].

    Furthermore, the vowels [+low] // and // are written eand o, as mentioned, which makes it hard to predict theirvalues, since /o/ and /e/ have the same coding. When they

    are stressed and also signaled graphically, the conversion iscorrect. The reduction of pre-tonic and pos-tonic vowels isalso not properly addressed in the Nhenhm algorithm.Regarding this, it is worth pointing that it is subjective,since, most of the time it depends on the speakers linguistic

    variety. Thus, it is an issue for Phonetics, not for Phonology,when all distinctive traces are preserved, no matter thevariety.

    Moreover, we decided to consider the so called arising orascending diphthong as hiatus [5],[10], therefore, wordsending with it are decoded as receiving stress on theantepenultimate syllable: sseo /.si.u/, histria /iS.t.ri.a/, nusea /naw.zi.a/, cio /.si.u/.

    In 2010, Nhenhm was translated into another computerlanguage, and so we could improve its performance (Fig.2).We incremented the main algorithm and the system becameable to provide the phonological syllabic division. As aconsequence, we obtained the spelling syllabic division,

    with at least 99% accuracy. In this way, it became easy tosignal the stressed syllable, since its 2008 version signaledonly the stressed vowel. We used this renewed algorithm tomake an automatic syllable parser for Brazilian Portuguese[15], and we had to solve the problem of syllabification ofwords that contained hyphen, such as beija-flor(hummingbird), p-de-moleque (a peanut candy),dever-se-ia (verb to have a duty,third person singular,past future indicative, synthetic passive voice, with tmesis),and solved them [16].

    In addition, we built an interface between Nhenhm andthe software Laa-palavras [17], [18], which is used forlinguistic research. Furthermore, we used the Nhenhm

    Fig. 2 Main screen of the program Nhenhm 2012, integration with parsing [22]

  • 8/22/2019 Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser

    6/12

    Vasilvski et al.:Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser.

    International Journal Publishers Group (IJPG)

    421

    prosodic-phonological algorithm for building a program forspeech therapy [19], consulting specific literature [20]. Thisprogram has been presented [19], [21], commented [22],and tested [23]the results were encouraging [23].

    The text is converted while the user types it or pastes it.

    Pasted texts must have simple formatting, that is, no capitalletters. The stressed vowel is signaled by an order from theuser. Fig. 2 shows Nhenhm decoder-parser performance. Inthe field Sada 1, the text entry appears converted intophonological symbols and parsed; the stressed syllable ismarked by the prosody mark before its first symbol. In thefield Sada 2, the text appears orthographically parsed.There is only one mistake in Fig.2 converted text: the wordcorreto (correct), should be decoded as /ko..tu/.

    The Nhenhm user can automatically convert either oneword or a 20 pages text, edit it, save it, research it and printit. As the system conversion is rightly esteemed on at least

    95% of accuracy, it allows the user to edit the unsolved 5%(or less) failure rate text, converting, replacing and insertingsymbols, adjusting to dialects. The program also allowsseveral texts to be recorded in a database for specific use instatistical reports.

    B.Basic functioning of the automatic syllable parserAs shown in the flowchart (Fig.3), the phonological

    syllable parsing of a word depends on the precedingphoneme. Thus, if the current phoneme is a consonant or avowel, and the preceding one is a vowel, semivowel or anarchiphoneme, then the current phoneme occupies a syllableboundary position. Consequently, the syllable divisionmarker a dot is inserted before it. For instance, thegraphic word angstias (anguishes), regarding itsphonological form, is converted as /g'uStiaS/, and thenparsed as /.'guS.ti.aS/, regarding its phonological syllables.

    Fig. 3 Syllable parsing basic computational process

  • 8/22/2019 Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser

    7/12

    International Journal of Advanced Computer Science, Vol. 3, No. 8, Pp. 416-427, Aug., 2013.

    International Journal Publishers Group (IJPG)

    422

    So, this word has four vowels //, /u/, /i/ and /a/, whichmeans that it has four syllables therefore, the programshall insert three syllable markers in it. In this case, the firstsyllable ends before the consonant /g/, since the phonemethat precedes it, that is, //, is a possible syllable boundary.

    Then, once there are more phonemes, the search for thesecond syllable boundary begins. The second syllable ofthat word ends before the consonant /t/, because it ispreceded by the archiphoneme |S|, which is also a possiblesyllable boundary. Following, there is another phoneme, thevowel /i/, and this means that there is a third syllable. Next,there is the phoneme /a/, that is preceded by another vowel,the mentioned /i/, and which is considered a possiblesyllable boundary, as seen. This situation leads the programto mark the third syllable ending before /a/. Since /a/ is avowel, it belongs to another syllable, there is, the fourth one,that ends with |S|. So, for there are no more vowels in theword, it has no more syllables, and the parsing procedure is

    complete. In this process, the program takes into account allthe rules exposed in section 2.C.

    The code snippet showed in Fig. 4 is part of the formbelonging to Nhenhm parser, displayed in Fig. 2. This codeis executed every time the input text is changed, and it isused to return both the phonological syllabic division as thespelling syllabic division. Line 81, which is highlighted,calls the method to parse.

    The parsing method was created by using fourparameters, as shown in the image below (Fig.5).

    About these parameters, it is worth telling that:_inOrto is the original entry typed by the user (conhecerto know);

    _inFono is the phonological transcription for theorthographic input (keseR);_sFono receives the return of the syllabic phonologicalparsing (k.e.seR);_sRev receives the return of the orthographic syllabicdivision (co-nhe-cer).

    The processes that are responsible for the return of thelast two parameters are quite complex, so they have to bethe subject of a future work.

    C.Phonologic-syllabic CorpusIn order to test Nhenhm, and also to investigate

    phonologic and syllabic patterns of Brazilian Portuguese,

    from written texts, we assembled a corpus with six articles,published in 2007 in a journal of Brazilian dentistry. Theyare technical and scientific texts, revised, and updated,which were not produced to be used in linguistics research[26], [27]. The six texts were pre-edited in a text editor,individually, before pasting on Nhenhm. Foreign words,words that contained graphemes that do not belong toPortuguese written system and measurement units wereeliminated, as well as some acronyms. Some of them could

    Fig.4 Code snippet of the parser.

    Fig.5 The four parameters of the parser.

  • 8/22/2019 Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser

    8/12

    Vasilvski et al.:Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser.

    International Journal Publishers Group (IJPG)

    423

    be replaced by their spelling form. The system excludespunctuation, hyphen, quotation marks, and some othersymbols by itself, so, they do not need to be treatedpreviously.

    In order to reduce chances of conversion errors, care

    must be taken to ensure the texts perfect readability byNhenhm. After this preparation, the corpus texts werepasted on the program, converted, printed, checked, edited,rechecked, and saved for research. The exceptions weresearched and edited so as to obtain correct translations. Thenumbers (ages, dates, centuries) were replaced by theirspelling forms.

    D. Statistical Reports: The PatternsThe six texts were loaded for generating statistical

    reports regarding phonemic and syllabic distribution. Thereports display the numbers which will be now exposed, and,as such, are reliable.

    1) The Phonemic Patterns: The corpus, afterconversion, totalized 70,811 phonemes, being distributedinto 33,510 syllabic phonemes (vowels), 3,069 non-syllabicphonemes (semivowels), and 34,232 consonant phonemes.Such numbers represent 47.32%, 4.33%, and 48.34%respectively of the total.

    In regard to the vowel phonemes, their distribution is:Tongue position: 42.95% front, 57.05% back; Tongueheight: 43.75% high, 25.07% mid, 31.18% low; Airstreamway (refers to the route taken by the air flow duringvocalization): 87.50% oral, 12.50% nasal; Lip rounding:29.25% rounded, 70.75% unrounded. It is worth

    remembering that all vowels are voiced.The distribution of consonants is: Manner of articulation:51.77% occlusive, and 48.23% constrictive, distributed asfollows: 61.68% fricative, 29.39% vibrating, 8.93% lateral;Place of articulation: 64.67% front, 16.50% back, 18.83%labial; Airstream way: 90.57% oral, 9.43% nasal (oral andnasal); Phonation: 48.14% unvoiced, 51.86% voiced theconsonantal archiphonemes |S| and |R| are not included inthe numbers concerning phonation, because they are theresult of neutralization of features.

    Also, the statistical report provides phoneme individualdistribution, as Tab. 2 displays for the corpus as a whole. Toconfirm the results, we tested only one of the six texts

    belonging to the corpus (10,904 phonemes), the numbers ofwhich we present in detail (Fig. 6).It can be seen that the main features distribution is very

    similar, as well as the other numbers provided by suchreport, which indicates phonemic patterns. A journalistictext composed by 8,454 phonemes was also prepared andtested individually by Nhenhm, and the results weresimilar, since the differences were around 1% to 1.5%.Hence, the results and also the numbers that show thephonologic patterns of Brazilian Portuguese seem reliable.The individual distribution regarding the text whosepatterns are shown in Fig. 6 was presented before [24], andso they could be consulted for making a comparison with

    the numbers exhibited in Tab. 2.

    TABLE2CORPUS PHONEME INDIVIDUAL DISTRIBUTION

    We tried to find another program or even a study thatapproaches this issue in a similar way, that is, one thatclassifies the segments according to their features andinforms such statistics, using corpus, but we did not find

    any. So, for awhile, we could not make comparisons inorder to confirm the reliability of the numbers we havepresented.

    We will comment some results, but much more can besaid about them. The back or posterior vowels occur around15% plus than the front or minus posterior vowels. Themost frequent posterior ones are /a/ and /u/; among the frontvowels, /i/, which occurs only 1% less than /a/, is the mostfrequent. Thus, the vowel that occurs most in Portuguese is/a/, followed by /i/.

    The semivowel // occurs only in the word muito(many, much) /mu.tu/ and derived forms. This is theonly symbol used by Nhenhm that may not be correctly

    read by all computers. We have been studying a way forchanging it, so as to overcome the little disturbance it maybring to the decoder-parser. Some tests done have shownthat the other symbols are read like Normal Text by anycomputer.

    The // is computed with /i/, since the first occurs whenthere is a sequence of two consonants in a word whichordinarily are not a coda (decline), and belong to differentsyllables. In this case, the epenthetic // occurs while suchsequence is pronounced. So, this inserted phoneme works ascore of a phonological syllable: opo (option), cacto(cactus) /o.p.sw/, /ka.k.tu/.

    Ph Q % Ph Q %

    /a/ 8851 12.50% /n/ 1304 1.84%

    /i/ 7587 10.74% /z/ 1152 1.63%/u/ 4618 6.56% // 1084 1.53%

    /t/ 4464 6.30% /v/ 934 1.32%

    /d/ 4124 5.82% /f/ 887 1.25%

    /e/ 3861 5.45% // 774 1.09%

    /S/ 3538 5.00% // 677 0.96%

    /s/ 3177 4.49% /b/ 568 0.80%

    /r/ 2961 4.18% // 560 0.79%

    /k/ 2754 3.89% // 551 0.78%

    /o/ 2571 3.63% // 516 0.73%

    /p/ 2208 3.12% // 406 0.57%

    // 1966 2.78% /g/ 375 0.53%

    /w/ 1964 2.74% // 212 0.30%

    /m/ 1849 2.61% // 90 0.13%

    /l/ 1419 2.00% // 75 0.09%

    /R/ 1341 1.89% // 55 0.07%

    // 1317 1.86% // 21 0.03%

    Total: 36 70811 100%

  • 8/22/2019 Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser

    9/12

    International Journal of Advanced Computer Science, Vol. 3, No. 8, Pp. 416-427, Aug., 2013.

    International Journal Publishers Group (IJPG)

    424

    Looking at the oral and nasal features, both forconsonants and vowels, we can see that Portuguese ispredominantly oral, since around 89% of it is oral and onlyaround 11% is nasal. See that the most frequent nasalphonemes are the vowel // (13thplace in occurrence) andthe consonant /m/ (15thplace in occurrence). Then, the 12most frequent phonemes of Portuguese are oral.

    In relation to the consonant phonemes, there is a balancebetween the occurrence of constrictive and occlusive,although occlusive tends to occur around 3% more than theconstrictive ones. The two most frequent consonants ofPortuguese are the occlusive /t/ and /d/. Another fact that

    calls attention is that the fricative sound //, the nasalocclusive //, and the posterior lateral sound //are actuallyrare in this language. Nevertheless, most of them appear inall six texts that form the corpus, for they participate insome very common words, like chegar, deixar (toarrive, to leave) /e.gaR/, /de.aR/, conhecer,tamanho (to know, size) /k.e.seR/, /ta.m.u/,trabalho (work) /tra.ba.u/. The only sounds that donot occur are // in one text, and //, in another.

    From the results, we find that Brazilian Portuguesephonemic distribution is uniform, once the amount ofvowels and consonants tend to be around 50% each.

    The semivowels reveal the amount of diphthongs (the

    real ones, that is, falling or decreasing diphthongs), sincethe semivowels only occur in this case. The diphthongs of

    the kind vowel+/w/ are more frequent than the ones of thekind vowel+/j/, since the first kind appears, at the very least,64% more times than the second.

    Furthermore, it is feasible hypothesizing that CV(consonant+vowel) is the most common syllable pattern ofBrazilian Portuguese, what leads us to address the BrazilianPortuguese syllable.

    2) The Syllable Patterns: The phonological-syllabicreport, grouping the six texts of the corpus by syllabicfrequency, reveals 628 syllable typesit means that there isno many more than that in Brazilian Portuguese and that

    the corpus as a whole is formed by 33.960 syllables. Tab. 3shows the 30 most common syllable types.According to Tab. 3, the most frequent syllable of

    Brazilian Portuguese is formed by the vowel /a/, that,besides occupying any position in words, and combiningwith any consonant or groups of consonants, semivowels,also form a syllable by itself. Moreover, it is the femininesingular determiner; a preposition; an accusative pronoun;and a demonstrative. It represents around 5% of thesyllables of the corpus, and Tab. 4 confirms this condition.

    Since 22 of the 30 syllables presented are open complex,that is, a CV syllable, we can conclude that CV is the mostfrequent syllable pattern of Brazilian Portuguese. There are

    two syllables with the nasal sound //, what confirms thissound as the most frequent among the nasal vowels, like thephonemic report indicates.

    Fig. 6 Nhenhm statistical report general distribution [22], [24].

  • 8/22/2019 Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser

    10/12

    Vasilvski et al.:Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser.

    International Journal Publishers Group (IJPG)

    425

    TABLE3THE 30MOST FREQUENT SYLLABLES OF BRAZILIAN PORTUGUESE

    Concerning the least frequent syllables, we found that 97

    types occurs only one time (e.g. /kR/, /pR/, /liR/, /e/,/vre/, /loR/, /ka/, /braS/, /kwaR/); 58 types occur two times(e.g. /uS/, /vw/, /moR/, /p/, /baS/, /row/, /tuR/, /glo/,/baR/, /tr/); and 31 types occur three times (e.g. /neS/,/saR/, /niR/, /vriS/, /bru/, /bR/, /dr/, /deR/, /k/, /ke/). SeeTab.1 for some examples.

    There is no closed syllable among the 10 most frequentones in the six texts of the corpus (Tab.4). No complexslope (consonant clusters and velar+/w/) occurs among the30 most frequent syllables (Tab.3).

    Although the parser was created in 2010, this is the firsttime it is used for research, and that the results are shown.The syllabic report confirms the phonemes distribution

    report and opens a bunch of possibilities for language

    research. However, a deeper analysis of the numberspresented is a subject to further studies.

    TABLE4THE 10MOST FREQUENT SYLLABLES OF EACH TEXT OF THE CORPUS

    Nhenhm decoder let to know Brazilian Portuguesedistribution of phonemic patterns, and the parser allowed

    understanding how such phonemes combine in a linguisticunit larger than the phoneme: the syllable.

    E. The Spelling Agreement of 1991 (2009)Some changes will occur in Brazilian Portuguese spelling,

    due to the spelling agreement already mentioned, accordingto which at least seven of the countries where Portuguese isspoken must use the same spelling, from 2013 on. Althoughmost users, even the Press, have adopted the new rules, theBrazilian government postponed for 2016 the requirementfor using the new orthographic rules.

    The most important change for Brazilian Portugueseorthography is the exclusion of the shudder (trema).

    Consequently, the value of the digraphs qu- and gu-becomes unpredictable. Thus, agentar (to stand) and

    Syll. Q % Syll. Q %

    /a/ 303 4.74% /di/ 167 4.91%

    /di/ 243 3.80% /a/ 165 4.85%

    T /si/ 218 3.41% T /si/ 110 3.23%

    e /u/ 212 3.31% e /ka/ 94 2.76%

    x /ta/ 180 2.81% x /ti/ 84 2.47%

    t /tu/ 173 2.70% t /da/ 79 2.32%

    /du/ 168 2.63% /u/ 77 2.26%

    1 /ti/ 167 2.61% 4 /ri/ 63 1.85%

    /m/ 154 2.41% /i/ 60 1.76%

    /i/ 118 1.84% /ta/ 53 1.56%

    /a/ 233 4.38% /a/ 269 5.65%/di/ 179 3.37% /ti/ 147 3.09%

    T /si/ 168 3.16% T /di/ 145 3.05%

    e /ti/ 156 2.94% e /si/ 136 2.86%

    x /du/ 140 2.63% x /i/ 111 2.33%

    t /ta/ 128 2.41% t /u/ 105 2.21%

    /u/ 127 2.39% /du/ 104 2.19%

    2 /ra/ 104 1.96% 5 /tu/ 97 2.04%

    /k/ 102 1.92% /e/ 96 2.02%

    /i/ 95 1.79% /na/ 94 1.98%

    /a/ 316 4.52% /a/ 424 5.98%

    /di/ 248 3.55% /di/ 304 4.29%T /si/ 225 3.22% T /si/ 260 3.67%

    e /ti/ 180 2.57% e /ti/ 185 2.61%

    x /se/ 164 2.34% x /da/ 150 2.12%

    t /du/ 155 2.22% t /u/ 144 2.03%

    /da/ 153 2.19% /i/ 128 1.81%

    3 /ta/ 148 2.12% 6 /ta/ 114 1.61%

    /u/ 148 2.12% /du/ 111 1.57%

    /ra/ 142 2.03% /tu/ 108 1.52%

    Syllabletypes

    Occurrences %

    /a/ 1710 5.04%

    /di/ 1286 3.79%

    /si/ 1117 3.29%

    /ti/ 919 2.71%

    /u/ 813 2.39%

    /du/ 722 2.13%

    /ta/ 694 2.04%

    /da/ 642 1.89%

    /i/ 631 1.86%

    /tu/ 612 1.80%

    /ra/ 574 1.69%

    /k/ 554 1.63%

    /m/ 537 1.58%/pa/ 490 1.44%

    /ka/ 467 1.38%

    /sw/ 455 1.34%

    /ri/ 432 1.27%

    /na/ 428 1.26%

    /li/ 397 1.17%

    /se/ 386 1.14%

    /e/ 384 1.13%

    /o/ 362 1.07%

    /e/ 346 1.02%

    /d/ 328 0.97%/eS/ 305 0.90%

    /te/ 291 0.86%

    /zi/ 288 0.85%

    /ma/ 286 0.84%

    /ki/ 280 0.82%

    /duS/ 278 0.82%

    ... ... ...

    /eR/ 1 0.003%628 33960 100%

  • 8/22/2019 Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser

    11/12

    International Journal of Advanced Computer Science, Vol. 3, No. 8, Pp. 416-427, Aug., 2013.

    International Journal Publishers Group (IJPG)

    426

    eqino (horse), correctly decoded as /a.gw.taR/ and/e.kwi.nu/, will be spelled aguentar and equino,generating the translations */a.g.taR/ and */e.ki.nu/. Inspite of this, in this case, the syllable division remainscorrect, but not the syllabic structure affected by the new

    rules, of course. In Brazil, shudder use is still very common,even by government institutions. For this reason, Nhenhmwill preserve this resource in its algorithm.

    This means that the alphabetic system loses transparency,that is, it loses one of the rules that make it predictable;therefore, reading (decoding) is impaired. Other changesinterfere less in the automatic translationsee [22] for moredetails, but none of them disturbs the prosody system.

    4. Conclusion and OutlooksThe experience of building, testing and using Nhenhm

    has shown the degree of electronic text linguistic readingand conversion difficulty. The phonemic level is the easiestto systematize, and only a few questions regarding it remainunsolved; the difficulty is greater for the syllable level, butit can be also regarded as superseded; the morphology levelcomes next, yet, a little progress has been accomplishedconcerning this field; and then the syntax, which is moreintricate. The complexity of each level may be attenuated bythe systematization of previous levels, because one takesadvantage of the systematization of the other. So, converterslike Nhenhm are a step for future work on levels thattranscend the phoneme, like we did to the syllable.

    Some decisions taken in the system building are

    objectionable to some and noteworthy to others, as are someof the theories chosen. However, this was not optional. Thechoices came from the need imposed by the programmingand, within that, objectivity and intelligibility of existingtheories, and beliefs and intuition of teachers, students andother language users. The efficiency of Nhenhm confirmsthe usefulness of the theories adopted.

    Now that we have made the automatic syllable parsing,the project goes on. Systematization of the BrazilianPortuguese syllable made possible to start addressingmorphology, as we said. Hence, there is still a lot to bestudied about the syllable, since the reports are available. In

    this sense, one question to be approached deeply relates toprosodya task that has already started [28]. Nevertheless,there are still some adjustments and increments to be madein the parser, in order to optimize the editing by the user.

    Yet, some of the next steps are building a voicesynthesizer from Nhenhm, and improving NhenhmFonoaud, which is the program for speech therapy; thisprogram benefits from automatic syllable division already.The program supports the analysis of processes that occur inthe childs phonological system, through the automaticphonological transcription simultaneously to samples of thechild speech recording. Thus, data relies on a phonemicrepresentation of speech, automatically done by the

    Nhenhm phonological-prosodic algorithm. NhFonoaud isdesigned for dealing with phonological tests, using wordswittingly grouped to analyze specific aspects of speech and

    phenomena involved in its development. Thus, it is anapplication for assisting speech therapy, and so languageacquisition research. Now we are making this program ableto create graphics from the registers, and automaticallyidentify the phonological processes involved in childs

    speech.Back to the algorithm, we are still working on rules for

    reducing that 5% (probably much less now) failure rate atthe conversion. Once it was made clear that the conversiontool successfully exploits the close correspondence betweenorthographic representation and pronunciation in BrazilianPortuguese, at the phoneme and syllable level, it proved tobe useful in a wide range of applications.

    References

    [1] S. Silva Neto. Histria da lngua portuguesa. FifthEdition. Rio de Janeiro: Presena, 1988.[2] V. Vasilvski. Construo de um programacomputacional para suporte pesquisa em fonologia doportugus do Brasil. (2008). PhD Thesis, FederalUniversity of Santa Catarina, Florianpolis, Brazil.

    [3] L. Scliar-Cabral. Princpios do sistema alfabtico doportugus do Brasil. So Paulo: Contexto, 2003a.

    [4] B. Malmberg. A fontica: teoria e aplicaes, (1993).Caderno de Estudos Lingsticos, no.25, pp.7-24.

    [5] J. M. Cmara Jr. Estrutura da lngua portuguesa. 16th.Edition. Petrpolis: Vozes, 1986.

    [6] J. M. Cmara Jr., Joaquim Mattoso. Problemas deLingstica descritiva. 16th. Edition. Petrpolis: Vozes,

    1997.[7] J. M. Cmara Jr.. Para o estudo da fonmicaportuguesa. Second Edition. Padro: Rio de Janeiro,1977.

    [8] M. Said Ali. Gramtica secundria e Gramticahistrica da lngua portuguesa. Third Edition. Braslia:Editora da UnB, 1964.

    [9] E. Bechara. Moderna gramtica portuguesa. 19thEdition. So Paulo: Cia. Editora Nacional, 1973.

    [10]L. Bisol. O ditongo da perspectiva da fonologia atual,(1989).Revista Delta, vol.5. no.2, pp.185-224.

    [11]L. C. Cagliari. Anlise fonolgica: introduo teoria e prtica. Campinas: Mercado das Letras, 2002.

    [12]International Phonetic Alphabet (IPA). 2013.http://www.langsci.ucl.ac.uk/ipa/ipachart.html

    [13]J. J. Almeida, A. Simes. Text to speechA rewritingsystem approach,(2001).Procesamiento del LenguajeNatural, vol. 27, pp. 247-255.

    [14]S. Candeias, F. Perdigo. Conversor de grafemas parafones baseado em regras para portugus, (2008). L.Costa, D. Santos, N. Cardoso (Eds.). Perspectivassobre a Linguateca/Actas do encontro Linguateca: 10anos, n.14, pp.99-104.

    [15]V. Vasilvski. Diviso silbica automtica de textoescrito baseada em princpios fonolgicos, (2010).Anais do III Encontro de Ps-graduao em Letras da

    UFS (ENPOLE), So Cristvo, Sergipe, Brazil[16]V. Vasilvski. O hfen na separao silbicaautomtica, (2011). Revista do Simpsio de Estudos

  • 8/22/2019 Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser

    12/12

    Vasilvski et al.:Phonologic and Syllabic Patterns of Brazilian Portuguese Extracted from a G2P Decoder-Parser.

    International Journal Publishers Group (IJPG)

    427

    Lingsticos e Literrios SELL, vol.1, no.3, pp.657-676.

    [17]V. Vasilvski, M. J. Arajo. Laa-palavras: anelectronic system for the Description of BrazilianPortuguese. Florianpolis: LAPLE-UFSC, 2010-2013.

    http://www./sites.google.com/ site/sisnhenhem/[18]L. Scliar-Cabral, V. Vasilvski. Descrio do

    portugus com auxlio de programa computacional deinterface, (2011). Anais da II Jornada de Descriodo Portugus (JDP), Cuiab, Brazil.

    [19]H. F. Blasi, V. Vasilvski. Programa piloto paratranscrio fontica automtica na clnicafonoaudiolgica, (2011) Documentos para el XVICongresso Internacional de la ALFAL, Universidad deAlcal, Alcal de Henares/Madrid.

    [20]L. Scliar-Cabral. Guia prtico de alfabetizao. SoPaulo: Contexto, 2003b.

    [21]T. M. Garcez, H. F. Blasi, V. Vasilvski. Aplicao doprograma piloto para transcrio fontica automtica naclnica fonoaudiolgica, (1011). Anais do 19.Congresso Brasileiro e 8. Congresso Internacional de

    Fonoaudiologia. So Paulo, Brazil.http://www.sbfa.org.br/portal/ suplementorsbfa

    [22]V. Vasilvski. Descodificacin automtica de la lenguaescrita de Brasil basada en reglas fonolgicas.Saarbrcken: Editorial Acadmica Espaola, 2012.

    [23]V. Vasilvski, M. J. Arajo, Blasi, H. F. A BrazilianPortuguese Phonological-prosodic Algorithm Appliedto Deviant Language Acquisition: A Case Study,(2013). Paper to be presented.

    [24]V. Vasilvski. Phonologic Patterns of BrazilianPortuguese: a grapheme to phoneme converter basedstudy, (2012b). Proceedings of the EACL, Workshopon Computational Models of Language Acquisition andLoss. University of Avignon, France.

    [25]J. M. Cmara Jr. Lnguas europias de ultramar: oportugus do Brasil. (1972). In: C. E. F. Ucha (org.).Dispersos de J. Mattoso Cmara Jr. Rio de Janeiro:Fundao Getlio Vargas, pp.71-93.

    [26]J. Sinclair. Corpus, concordance, collocation. OxfordUniversity Press: Oxford, 1991.

    [27]G. Leech. Corpora and theories of linguisticsperformance, (1992).J. Svartvik (Org.). Directions incorpus linguistics, Berlim: Mouton de Gruyter.

    [28]V. Vasilvski, M. J. Arajo. Um Algoritmo Prosdicopara Portugus do Brasil, (2013). Paper to bepresented.

    Vera Vasilvski Post-PhdStudent on LanguageAcquisition (Phonologyand Morphology) atFederal University of

    Santa Catarina(UFSC/CAPES), Brazil.Professor at StateUniversity of PontaGrossa (UEPG), Paran,Brazil. Research Group

    Emergent Linguistic Productivity (CNPq).

    Leonor Scliar-CabralProf. Emeritus at UFSC,Dr. inLinguistics/Psycholinguistics at University of So

    Paulo/Brazil. HonoraryPresident of theInternational Society ofApplied Psycholinguistics(ISAPL). Responsible forthe Research GroupEmergent Linguistic

    Productivity CNPq/CAPES.

    Mrcio Jos ArajoSystem DeveloperAnalyst at TopdataAutomation Systems,Electrical Engineeringgraduate student atFederal TechnologicalUniversity of Paran,Brazil. Natural LanguageProcessing programsdeveloper. ResearchGroup EmergentLinguistic Productivity.