reseachpaper

10
Ryan Turner [COM-4450.001 | [ORTIZ] Machine Translation FROM HUMAN TRANSLATION TO AUTOMATIC LANGUAGE TRANSLATION

Upload: ryan-turner

Post on 11-Apr-2017

35 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: ReseachPaper

Ryan Turner [COM-4450.001 | [ORTIZ]

Machine Translation FROM HUMAN TRANSLATION TO AUTOMATIC LANGUAGE TRANSLATION

Page 2: ReseachPaper

2

Introduction

Language is a sophisticated usage of symbols and oral speech used by the human species

to form a complex system of communication among one another. This ability is particular to the

human species as no other animal on the planet has the capability to create such a complex

system of communication and understanding. Language has created and defined what we know

as human culture and society. “In annuals of Anthropology, language is considered as a primary

tool for studying the culture of a civilization, what we speak influences what we think, what we

feel and what we believe.” (Ashraf, 25). The importance of language to any given culture is

immense and as there are over 5000 different dialects and languages1 in the world there must be

a system of understanding to not only communicate with different cultures but to also understand

their history. This means that the only way to interact with other cultures is to understand and

translate a vast amount of languages to whatever the primary language of the observer or

communicator understands. As globalization increases it is efficacious that there is a system of

accurate translations between languages to further the relationships of both the political and

business specters. Human translators require a lot of training and knowledge of different cultures

to be able to translate accurately. This is a viable option in many circumstances but with such an

influx of information and documents in so many languages due to the advancement of internet

sharing and communicating, there needs to be a faster and more easily accessible option then just

human translators. Automatic language translation systems or machine translation (MT) is a

leading technology of which computer programs analyze language structures and source texts to

create a translation to a target language with little to no human interaction. Such machine 1 Roughly 6,500 spoken languages but about 2,000 of those languages fewer than 1,000 speakers

Page 3: ReseachPaper

3

translation tools available on the internet are Babel Fish, Google Translate, Babylon and StarDict

which are only capable of giving rough translations without human editing. This is because the

technology has still yet to advance to the complexity of many different languages. Understanding

how language systems work and the related terminology is the first thing to understanding how

MT works. Then the most commonly used MT systems will be described with their advantages

and disadvantages in translation so that there can be a better understanding where the technology

still needs to advance. By highlighting the disadvantages of these MT systems and the

complexity of translating languages in general, possible solutions will by explained for a future

technology that is more accurate, with less human interaction.

Language and Translation Terminology

In order to understand how Machine Translation or Automatic Language Translation

works it must be understood how language is constructed and what the human translation process

consists of. Human language generally consists of two main parts: a lexicon and a form of

grammar or set of rules. A lexicon basically is the knowledge of words and knowing the meaning

of such words. Grammar is a set of rules that allow human language to combine those words

from the knowledge of lexicons into a meaningful or coherent sentences. The translation process

involves understanding or decoding the meaning of the source text, both lexically and

grammatically, and then re-coding this meaning into the target language. This re-coding of the

language must follow the lexical understanding and grammatical rules of that target language.

The complexity in translation lies in the fact that many languages do not have similar grammar

rules or lexicons that follow the same meanings. Grammar rules that must be considered are that

of the types of words (nouns, verbs, adjectives, pronouns, prepositions, etc.), functions of the

Page 4: ReseachPaper

4

words, case markings of the words and finally the gender of the words. In order to understand a

language one must also know how a language is structured and how the rules of grammar are

applied. For one to do this there must be an in-depth knowledge of the culture the language

comes from so that there is an understanding of its semantics2, syntax3, idioms4 and ambiguous

words that only have meaning given context. Many languages have one word with many

meanings but can only be translated given the context of the rest of the sentence. Given the

complexity of translating a language, many human translators are only able to translate few

source languages into even fewer target languages. For this problem MT has been evolving and

advancing so as to mediate the human involvement in translation and make it much easier and

faster for accurate translations of hundreds of languages.

Rule-Based Machine Translation

There are a few main MT systems used to automatically translate language, one of them

is called Rule-Based Machine Translation (RBMT). RBMT is a combination of three different

systems of translation which include transfer-based, interlingual and dictionary based machine

translation. Interlingual and dictionary based machine translation systems are often used the most

in RBTM unless the target language from the source has no interlingual standard. Interlingual

machine translation originally translates the source language into an independent language

separate from any other language. Then from this standard independent language it is transferred

2 The study of linguistic development by classifying and examining changes in meaning and form

3 The study of the patterns of formation of sentences and phrases from words

4an expression whose meaning is not predictable from theusual meanings of its constituent elements, as kick thebucket or hang one's head, or from the generalgrammatical rules of a language, as the table round forthe round table, and that is not a constituent of a largerexpression of like characteristics.

Page 5: ReseachPaper

5

to a translation of the target language. If the source language and the target language do not have

or share the interlingual, independent language then the source language is translated first into an

intermediate understanding of the meaning of the sentence. From here it is transferred to the

target language through dictionary-based translation. The difference between transfer-based and

interlingual translation is that the interlingual system has a standard operational method, which

does prohibit certain language pairing, well transfer-based translation uses just an intermediate

system of the source language. If the pairing of languages have an interlingual translation the

overall RBMT will be more accurate. Once the source language has been translated either

through transfer-based or interlingual then that intermediate or independent language then goes

through the dictionary-based translation to get the final product of the target language.

Dictionary-based translation is as simple as translated the source language word for word to the

target language. If the source language was only dictionary translated without either transfer-

based or interlingual translation then the translation would miss all the grammar, syntactical, and

semantic rules as well as the target language’s idioms and morphemes5 (Wilks, 29). RBMT is

great for consistent and predictable translations well following the grammatical rules of the

language pairings. The disadvantages of RBMT include a lack in fluency and the translations

rarely catch exceptions to rules in any given language (Ashraf, 28).

Statistical-Based Machine Translation

Another method of Machine Translation is that of Statistical-Based Machine Translation

which is involved in analyzing the words and sentences based off of a system of statistics. This is

5Any of the minimal grammatical units of a language, eachconstituting a word or meaningful part of a word, thatcannot be divided into smaller independent grammaticalparts, as the, write, or the -ed of waited.

Page 6: ReseachPaper

6

a methodology that uses statistical data to create a translation from which utilizes a bilingual

corpora (Ashraf, 28). A text corpus is a set of stored and processed texts that are very large and

structured that are used for statistical analysis. These statistical analysis are based on the

occurrences in the language and all the set of rules applied to such source and target language.

The SBMT method can only exist, however, if there is enough data to support of parties of

languages, source and target language. This means that there must be enough data or information

concerning each language stored and analyzed to create a translation. “Building statistical

translation models is a fast process, however, the innovation depends intensely on existing

multilingual corpora. At least 2 million words for a particular space and considerably more for

general dialect are needed” (Ashraf, 28). The problem with SBMT is that there needs to be an

existing set of analyzed linguistic data that is very CPU depended for this system of translation to

be very accurate. The way in which SBMT works is based off of a probability distribution in

which the source languages probability of meaning to target language is high. This means that

through the process of statistical analysis the given a source language is translated by the

probability that it occurs in the target language, if and only if, the data of analysis between the

language pairing is present and the probability high enough. SBMT requires the statistical data

be within the domain of the translator’s inquiry and if the pairing languages do not have enough

data the translation can be quite unpredictable. Another problem with SBMT is that the system

does not actually know the grammatical, syntactical, semantical, etc. rules to but merely relies on

loads of statistical information between a bilingual corpuses. The two great things about SBMT

are that, given enough stored and analyzed information of the pairing language, the translation

from source to target languages are very fluent and good at “catching exceptions to rules”

(Ashraf, 28).

Page 7: ReseachPaper

7

Example-Based Machine Translation

Example-Based Machine Translation (EBMT) is very similar to SBMT in that it

compares the language pairings of the translation. The difference is that through the EBMT

system there is no probability analysis but rather a system that relies on previous translated

sentences between the languages pairings. This means that the bilingual corpus of language

pairings are an analogy and there must be prior translated data or information in this system

before a new translation between source and target language can be created. The EBMT system

is an analogy of previously translated language pairings through phrases rather than complete

sentences (Daybelge, 296). When translating a complete sentence from source to target language

there must first be data of a previous translation of certain phrases within the sentence. Once

these phrases are found in the system they are then put together to form the full translation of

source to target language. The inherent problem with EBMT is that there has to be a data log,

within the server being used, that has the previous translated sentences that are similar to what

the language pairings goal translation will be. Meaning there has to not only be many previous

translations between the purposed language pairings but there also has to be translations between

the two languages that are very similar to the meaning of the source to target translation. For this

reason this system of Machine Translation becomes very limited to very few languages that are

capable of being accurately translated.

Hybrid Machine Translation

All of the machine translation systems mentioned above each have their advantages and

disadvantages which leaves much more to be desired when it comes to the accuracy and

Page 8: ReseachPaper

8

knowledge of Machine Translations. This lack of satisfaction among translation techniques has

led to what is called Hybrid Machine Translation (HMT) (). HMT is the technique of combining

multiple machine translation systems into one engine with the hopes of one system making up

for the disadvantages of another system and vice versa. The most common methods of HMT are

to combine RBMT and SBMT. The HMT called Statistical Rule Generation is one such method

that first generates a statistical analysis of grammatical rules within its database, if it has enough

of the pairing languages information. This statistical method aims to mimic a ruled based method

of translation from an analysis of information and thus forming rules of the grammar, syntax, etc.

for the pairing languages. Unlike SBMT, this method will follow the rules that it has generated

from its analysis even in the case of ambiguity in the language pairing translation which will

often times lack in fluency and expectations to rules within languages. For this reason this hybrid

method is only capable of creating accurate translations if the rules of each language are similar

or share a close etymological background6.

The most accurate hybrid machine translation there is would be the Multi-pass system of

translation. In this method the source language is processed multiple times through both RBMT

and SBMT. This process starts with the RBMT process and is often times called the pre-process

in which the source language is analyzed for all of the rules common to that language. It then

creates an independent language (interlingual) or it creates an intermediate understanding of the

sentence (transfer) if the language pairings do not share an interlingual. This rule-based

translation is then passed through a statistical machine translation system that analysis both the

RBMT pre-processed translation and the information within its database of the two original

language pairings. This eliminates the RBMT system from having to translate from the 6 A chronological account of the birth and development of a particular word or element of a word, often delineating its spread from one language to another and its evolving changes in form and meaning.

Page 9: ReseachPaper

9

interlingual or transfer-based method and allows for two forms of statistical analysis to create a

more accurate translation. This Multi-pass system of machine translation does require both

databases of RBMT and SBMT to work coherently which in turn requires more disk space and

CPU usage. HMT may need more processing power and information to create automatic

language translations but the advantages of combining the two most popular methods of

translations has allowed for the most reliable and accurate translations to date().

Conclusion

The future for machine translation and automatic language translation systems are getting

much more accurate with less human involvement as evolving technology and enhancements in

data collection continue to progress. The complexity of language and the amount of different

languages, dialects and cultures in the world make it very hard for automatic language translation

but because of the importance of communication between cultures, politics, business and

learners, there has become a necessity to create a system that helps people around the world

connect. The ultimate goal is to have absolutely no human involvement in translating one

language from another whether it be from text to text or speech to speech (Grap, 12.6). This

would mean that any text scanned onto a computer or found on the internet could be

automatically translated to any language in the world with the full meaning of the source

language understood in the translated target language. With this type of technology humans

would be able to bridge the gap of communication between cultures which would thus progress

the knowledge and understanding of the human species in general.

Page 10: ReseachPaper

10

Bibliography Ashraf, Neeha, and Manzoor Ahmad. "Machine Translation Techniques and Their Comparative

Study." International Journal of Computer Applications IJCA 125.7 (2015): 25-31. Web.

Daybelge, Turhan, and Ilyas Cicekli. "A Ranking Method for Example Based Machine

Translation Results by Learning from User Feedback."Applied Intelligence Appl Intell 35.2

(2010): 296-321. Web. Grap, Hannah. "Automated language translation ... a solution to public sector communication requirements." Summit Magazine Sept. 2009. General OneFile. Web. 4 May 2016.

Park, Eun-Jin, Oh-Woog Kwon, Kangil Kim, and Young-Kil Kim. "Classification-Based

Approach for Hybridizing Statistical and Rule-Based Machine Translation." ETRI J ETRI

Journal 37.3 (2015): 541-50. Web.

Wilks, Yorick. Machine Translation: Its Scope and Limits. New York: Springer, 2008. Print.