lti brochure

8/9/2019 Lti Brochure

1/32

Language Technologies Institute

Carnegie Mellon

phonological phenomena * multilingual text extraction * topic tracking * machine translation

information retrieval * computational linguistics * machine learning * biosequence modelingc

omputer-assistedlanguagelearning*questionanswering

*informationtheory

*

technologysupportededucatio

n*languagemodeling


2/32

Thus it may be true that the way to translate from Chinese to Arabic, or

from Russian to Portuguese, is not to attempt the direct route... Perhaps the

way is to descend, from each language, down to the common base of human

communication--the real but as yet undiscovered universal language--and

then re-emerge by whatever particular route is convenient.

- Warren Weaver

Technologiesdela

langue

Tecnologasdellenguage

Sprachtechnologie


3/32


Overview 4

Ongoing Research 6

Academic Programs 18

LTI Courses and 20

Admissions

Faculty 22

C o n t e n t s

Edited by C. Adle Weitz

2004Carnegie Mellon University


4/32

Overview


The Language Technologies Institute (LTI) in the School of Computer Science at Carnegie Mellon University conductsresearch and provides graduate education in all aspects of language technologies, including computational linguistics,machine translation, speech recognition and synthesis, statistical language modeling, information retrieval and web searchengines, text mining, information management, digital libraries, intelligent tutoring, and more recently bio-sequence/bio-language, structure and function analysis (genome, proteome). The LTI combines linguistic approaches with machinelearning and corpus-based methods, depending on the scientific questions investigated and project needs.

The LTI was established in 1996, combining the Center for Machine Translation (CMT), which was founded in 1986, andother areas of computational language research. The LTI contains a unique mix of theoretical and systems-building re-searchers specializing in various aspects of computer science, artificial intelligence, computational linguistics and machinelearning, and provides a rich and diverse environment for collaboration among faculty, graduate students, visiting scholars,and research staff. As part of the School of Computer Science at CMU, LTI faculty and students collaborate closely withmembers of the Computer Science Department, the Center for Automated Learning and Discovery, the Robotics Institute,The Institute for Software Research International, and the Human-Computer Interaction Institute. Collaborative researchareas include mobile computing, computational biology, multi-agent systems, cognitive modeling, intelligent tutoring sys-tems, multi-media interfaces, text and data mining, artificial intelligence systems, and machine learning theory and algo-

rithms.

The LTI offers both a Masters and PhD in Language Technologies.The curricula of the graduate programs are based on a set of core coursesthat include linguistic and statistical methods for language analysis,fundamental computer science, and in-depth coverage of focus areasin language technology such as machine translation, information re-trieval, and speech recognition. Students benefit from a modular set oflaboratory courses, in which they learn the basics of natural languagetechnology through intensive hands-on practice. In addition, studentshave the opportunity to expand their education with courses from theother institutes in the School of Computer Science (listed above), in-cluding courses in algorithms, artificial intelligence, computer systems,machine learning, statistics, and computational biology.

In addition to fundamental and theoretical research, the LTI very muchfocuses on large-scale challenges of consequence to industry, govern-ment or society in the large, often spawning start-up companies, orinternational-scale projects. The original LYCOS search engine, forinstance, was created at CMU, as was the original VIVISIMO meta-search engine. The C-STAR international speech-translation consor-tium, and the Universal Library project were also initiated at CMU the goal of the latter being the dissemination of the collected works ofhumankind worldwide with free universal access.

The LTI offers and encourages collaboration with industry, national orinternational, ranging from industrial affiliates and in-residence visit-

ing researchers, to targeted industrial education programs and longer-term joint R&D projects. Such projects have produced practical re-sults including high-accuracy machine translations for Caterpillar Inc,and the Condor search engine used in Korea.

Overview


5/32


Carnegie Mellon University and Pittsburgh

Located in Pittsburgh, Pennsylvania, Carnegie Mellon Univer-sity is well known for its interdisciplinary research centers andinstitutes, such as the Pittsburgh Supercomputing Center, the Soft-ware Engineering Institute, the Data Storage Systems Center, TheInformation Networking Institute, the Engineering Design Re-search Center, the Institute for Complex Engineered Systems,the Robotics Institute, the Human-Computer Interaction Insti-tute, the Language Technologies Institute, The Systems and Se-curity Center and the Entertainment Technology Center. In theseorganizations, researchers and faculty from diverse disciplinescollaborate on problems, benefiting from varied viewpoints andscientific approaches.

Once the greatest steel-producing city in the United States, to-

day Pittsburgh is a medium-size modern city. With growing high-technology industries such as computer science and biotechnol-ogy, Pittsburghs energy reflects the vision of the future morethan the shadow of the past. Many high-technology industriesare locating research labs in Pittsburgh, including Intel, Seagate,Rand, Hyundai, and others. Some might be surprised to learnthat Pittsburgh is economically diverse, visually exciting, archi-tecturally active, ethnically rich and educationally innovative.

Carnegie Mellon is well integrated into its Pittsburgh surround-ings. Ten minutes east of the downtown business district, the103-acre campus is situated in Oakland, the educational andmedical mecca of the city. In addition to Carnegie Mellon, there

are four other institutions of higher education in this section ofthe city, which provide a wide range of educational opportuni-ties. Adjacent to campus is the 500-acre Schenley Park, com-plete with public golf course, tennis courts, outdoor pool, ice-skating rink, and numerous jogging, mountain biking andcross-country skiing trails.


6/32

Research Methodology

OngoingRe

search

Ongoing Research

The modeling of human language lies atthe confluence of linguistics, artificial in-telligence, statistics, machine learning,and cognitive science, and has been thefocus of intense research in the past fewdecades.

Research efforts at the LTI pursue a vari-ety of approaches, from linguistic andknowledge-based methods, to supervisedand unsupervised statistical, corpus-driven machine learning techniques.

In all of these approaches, domainknowledge and theory (linguistic or sta-tistical) inspire the general technical ap-proach to solving a language technologyproblem, and suggest the overall structureof the model. Data is then used to esti-

mate model parameters, and to furtherrefine the model, suggest salient features,optimize parameters, and ultimately, as-sess the quality and viability of the ap-proach. The general research paradigmis much like that used in other areas ofscience and engineering: a model is for-mulated and used to make predictions, andthose predictions are then evaluated inan effort to improve the model in an it-erative process.

A Linguistic Model that is refined withMachine Learning

Another example is the framework devel-oped within the Avenue project for infer-

ring transfer rules for Machine Transla-tion from bilingual data. The learned rulesare mappings between syntactic structuresbetween the two languages, and are usedin a runtime Machine Translation systemto produce translations. Researchers at theLTI have created a theoretical frameworkthat defines the kinds of models, i.e. gram-mar rules, that can be learned from data.A collection of specific instances of suchrules, constituting a transfer grammarfor a particular pair of languages, is theninferred from a balanced word-alignedsmall corpus of human translated sen-

tences. The framework addresses ques-tions about how rules can combine tocover more complex structures, whattypes of linguistic features can be ex-pressed in the rules and passed from onerule to another, and so forth. The result-ing model (i.e set of transfer rules) fora particular language pair can be refinedin an active learning feedback loop whereusers correct the output of the MachineTranslation system, and the system auto-matically uses the corrections augmentedwith active learning methods to refine theunderlying rule set.

Fig 1:Research methology integratingdata, knowledge and model.

Fig 3:Data, knowledge, model and pre-dictions in AVENUEs translation rulelearning system.

A Statistical Model that can be refinedwith expert knowledge

As one example, sequential models suchas hidden Markov models are of great

use in language technologies such asspeech recognition and information ex-traction from text and web documents.Although often effective, hidden Markovmodels make strong independence as-sumptions that can limit their predictiveperformance. Research by LTI facultyhas led to alternative frameworks, suchas Conditional Random Fields, that makefewer independence assumptions andallow more expert knowledge to be in-corporated as features into the model.These models have been applied to in-formation extraction, parsing, and even

image analysis, and have more recentlybeen adopted to biological sequenceanalysis problems such as gene findingand protein super-secondary structureprediction.

Fig 2: Biosequence to 3D structure pre-dictions in Biolinguistics

AminoAcid

Sequence

Protein

StructuralKnowledge

Learned

StructuralMappingModel

StructuralPredictions

AminoAcid

Sequence

Protein

StructuralKnowledge

Learned

StructuralMappingModel

StructuralPredictions

Data Knowledge

Model

Predictions

Data Knowledge

Model

Predictions

BilingualData

Theoretical

Learning

Framework

Learned

Translation

Rules

Machine

TranslationOutput

BilingualData

Theoretical

Learning

Framework

Learned

Translation

Rules

Machine

TranslationOutput


7/32


Translation of human language was one ofthe very first tasks attempted by the devel-opers of first digital computers in the1950s. Over fifty years later, fully auto-matic Machine Translation (MT) remainsone of the most difficult and challengingtopics of research within Artificial Intelli-gence. With the emergence of universalaccess to information enabled by todaysinternet, language has become a criticalbarrier to global information access andcommunication, and the need for MT isgreater than ever before. The LTI origi-nated as the Center for Machine Transla-tion in the mid-1980s, and MT continuesto be a prominent sub-discipline of researchwith the LTI. The LTI is unique in thebreadth of MT problems and approachesthat are being investigated and pursued in

the context of a variety of research projects,and in the number of faculty and research-ers involved in MT research.

Part of the excitement of MT as a researchfield lies in its wide range of challenges,from cutting-edge applications that arecommercially feasible today, through tech-niques that could have practical applica-tion within a few years, to problems thatwill not be fully solved until the advent oftrue Artificial Intelligence. The ultimategoal of this area can be characterized asmachine translation that is: (1) general pur-pose (any topic or domain); (2) high qual-ity (human quality or better); and (3) fullyautomatic. Remarkably, our current MTcapabilities can reasonably satisfy any twoof these three criteria, but we cannot yetmeet all three at once. Our KANT projectproduces fully automatic, high qualitytranslations for information disseminationin well-defined technical domains such aselectric power utility management andheavy equipment technical documentation(as in the Catalyst application for Cater-pillar). Our Example-based MT and Sta-tistical MT systems can produce fully au-

tomatic translation in broad or unlimiteddomains, but have not yet approached orsurpassed human quality levels. In other

Machine Translation

projects such as BABYLON and Diplomat/Tongues, successful multi-lingual commu-nication is achieved by augmenting thislimited-quality MT with human interactionin order to help resolve translation errors.

There is currently active research beingconducted within the LTI on all of the ma-jor approaches to MT. Each of these ap-proaches has some unique strengths butalso inherent weaknesses and limitations.Consequently, different approaches aresuitable for different scenarios. Our mainresearch thrusts are in machine learningapproaches to MT, including corpus-basedapproaches such as Generalized Example-Based MT and Statistical MT systems thathave focused primarily on Chinese-to-En-glish and Arabic-to-English MT. Addition-ally, we conduct ongoing work on Multi-Engine MT (MEMT), combining the re-sults of different MT techniques in orderto exploit each techniques strong points.

Since traditional rule-based approaches toMT require lengthy development cycles,and corpus driven MT requires largeamounts of pre-translated parallel text fortraining, the LTI is investigating alterna-tive MT paradigms for minority languages,such as Quechua and Mapudungn. Thegoal of the AVENUE project is to produceMT systems requiring neither extendedhuman development cycles (too costly) or

huge parallel corpora (not available). We

are investigating methods such as unsu-pervised learning of complex morphol-ogy, and transfer-rule induction from lim-ited numbers of selected word-alignedphrases and sentences, via machine learn-ing methods such as seeded versionspaces. Although we have had initial suc-cess, much of the challenge remains be-fore us.

Another aspect in which MT systems dif-fer is in their input/output modes: text ver-sus speech. The JANUS project, for in-stance, combines speech recognition withlanguage translation in the large. Otherprojects, such as the Speechalator, pro-duced limited-scope, speech-to-speechMT on a hand-held device that also in-cludes fluent speech synthesis. The areaof speech-to-speech MT is still young andgrowing, with technical difficulties, suchas how to translate from a lattice of sen-tence recognition hypotheses, producedby the speech recognizer, rather than sim-ply a single known sentence as input.Human factors issues include clarifica-tion dialogs when the recognition or thetranslation is problematic, and how totrain users implicitly to use the systemfor maximum effectiveness.

Finally, we have some non-traditionalprojects, such as investigating Dolphinlanguage (we kid thee not), and whether

it can be interpreted.

Types of Machine Translation

Interlingua

SyntacticParsing

SemanticAnalysis

SentencePlanning

TextGeneration

Source(Arabic)

Target(English)

Transfer Rules

Direct: SMT, EBMT


8/32

In collaboration with the Human-Com-puter Interaction Institute, the LTI fac-ulty carries out research in intelligent tu-toring, focusing on explorations of therole of language in learning and learningresearch, and specifically how language

technology can be used to support thatendeavor. A major thrust of this researchis to explore issues related to eliciting andresponding to productive student expla-nation behavior. It involves manybroader issues such as influencing studentexpectations, motivation, and learningorientation. This interdisciplinary re-search agenda involves five primary foci:

*Controlled experimentation and analy-sis of student interactions with each otheras well as with human tutors and com-puter tutors in order to explore the stimuli

that encourage productive student behav-ior, appropriate learning orientation, andultimately effective learning

*Analysis of think aloud protocols inlearning scenarios in order to better un-derstand the process of learning

* Basic research in language technologyto support the semi-automatic analysis oflanguage interactions in learning sce-narios (text classification, automatic es-say grading, etc.)

* Basic research in dialogue technologyto enable interaction in natural languagein tutorial environments between humansand computer tutors or to support inter-actions in natural language between hu-man learners in collaborative settings (ro-bust language understanding, dialoguemanagement, etc.)

* Development of easy-to-use tools forbuilding scalable language interactioninterfaces and tutorial environments moregenerally (semantic knowledge sourceauthoring tools, etc.)

Tutorial dialogue is a unique, intenselydynamic form of instruction that allows fora high degree of expressiveness and sensi-tivity both in terms of the tutors adapta-tion of material to the individual needs ofstudents as well as in the opportunity it cre-

ates for students to make their thinkingtransparent to the tutor. State-of-the-arttutorial dialogue systems have focused onleading students through directed lines ofreasoning to support conceptual under-standing, clarifying procedures, or coach-ing the generation of explanations for jus-tifying solutions, problem solving steps,predictions about complex systems, or un-derstanding of computer architecture.Evaluations of state-of-the-art tutorial dia-logue systems provide a powerful proof-of-concept, demonstrating conclusivelythat the language technology exists for sup-

porting productive learning interactions innatural language. Carnegie Mellon re-searchers are at the forefront of this move-ment, both in terms of producing landmarksystems and widely usedresources.

A current thrust of thiswork at the LTI involvespushing this technologyinto new areas, such assupporting designactivities in an exploratorylearning environment. An

important part of this workinvolves the developmentof the DReSDeN tutorialdialogue planner tomanage a range of mixed-initiative tutorial dialoguestrategies includingnegotiation dialogue forencouraging students todevelop the skills to askthemselves importantquestions leading to athoughtful, reflectivedecision making process.Another current thrust ofthis work involves theoptimization of timemanagement in tutorialdialogue interactions.

In order for tutorial dialogue systems tobecome widespread and have a real impacton education, it is imperative that theybecome easier to build. Another active areaof research at the LTI is building tools, suchas Carmel-Tools for authoring domain

specific knowledge sources for robustprocessing of student explanations, tofacilitate this process. Thus, beyondproducing technology to be used to addressour own theoretical questions aboutlearning interactions, we aim to producereusable technology that can facilitate thework of other researchers pursuing theirown related questions. Our ultimate goalis to produce resources that are simpleenough to be used by non-AI researchersand practitioners, such as educationresearchers, domain experts, andinstructors, and thus to put the power of

tutorial dialogue in the hands of those withthe pedagogical expertise to maximize itseffectiveness and meet the real needs ofstudents.

Tutoring Systems

g

g

Ongoing Research


9/32


CALL - Computer Aided Language Learning

The Universal Library

The central goal of the Universal Library isto digitize, index and make universally avail-able all published works of humankind, in-cluding books, periodicals, artwork andmusic. A further goal is to provide value-added information services such as auto-mated summarization, reading assistants, fullcontent search and translation over theinternet in any language. Imagine the situa-tion in which every researcher, every teacher,every citizen and even every schoolchildwould have everything ever written at herfingertips, regardless of what country shelives in, her economic status, the school she

attended, or her native language. The am-plification of human potential would be vast,and would lead to the ultimate democrati-zation of information and knowledge. LikeRome, the Universal Library is not built in aday, but rather it is a project for the ages,requiring constant improvement and enrich-ment.

We estimate that approximately 100 milliondifferent books have been published in thehistory of the world, but only a tiny fractionare available digitally. The remainder mustfirst be scanned a labor-intensive chore.

Toward that end, with the support of theNational Science Foundation, we have or-ganized the Million Book Project, a jointeffort of CMU and the governments of In-dia and China, with other countries such asEgypt and Turkey joining the cause. Hun-dreds of digital scanners have been distrib-

uted to centers in those countries, wherepersonnel furnished by their governmentsscan books for two shifts per day. As offall 2004 about 100,000 volumes havebeen scanned. These are passed throughoptical character recognition software,indexed, and added to the Universal Li-brary. Approximately half of the scannedbooks are in English; the remainder arein a wide range of other languages. Thefirst million books are expected to becomplete by the end of 2006, at whichpoint we will embark on the Ten MillionBook Project.

The Language Technologies Institute andISRI provide indexing software and stor-age infrastructure for the Million BookProject. The CMU Libraries furnishmetadata, archiving and copyright clear-ance support. A number of researchprojects are underway to explore appli-cations and uses for the Universal Library.One of them, the Universal Dictionary, isan effort to build a database of every wordin every language. This will serve as abasic resource for machine translation andmultilingual searching. We are also ex-

ploring new methods of navigating hugetext spaces. As the size of the collectiongrows, the limitations of keyword search,particularly for multilingual queries, be-come severe. What is needed is a lan-guage-independent search method that isable to retrieve based on concepts rather

than specific terms, which suggests amultimodal rather than a pure text-basedinterface.

Copyright is major barrier to free distri-bution of content. The vast majority ofworks ever published are still in copy-right. Of these, more than 90% are outof print, which means that they produceno revenue either for the author or thepublisher. We are endeavoring to encour-age publishers to allow the Universal Li-brary to scan their out of print books andpermit them to be viewed on the Internet

and retrieved through search engines.Publishers who do this often find an in-creased demand for their books. We arealso working with the government of In-dia to develop a new copyright statute thatwould provide funds, analogous to theUK Public Lending Right, to be distrib-uted to copyright owners whose worksare accessed on the Internet, with micro-payments to be provided through publicfunds. Eventually the availability of suchpayments could remove further obstaclesto offering copyrighted material.

The Universal Library enjoys coopera-tive relationships with other institutions,including the Internet Archive, the Digi-tal Library Federation, and the NationalAcademy Press.

Learning a new language is a process of trialand error where the student observes lan-guage, tries to imitate it, and finds out howgood the imitation was. Until recently, thebest ways to learn a new language were ei-ther by going to the country or by having apersonal tutor. First attempts to create lan-

guage learning software lacked feedback andauthenticity. This is where language tech-nologies are starting to provide a potent al-ternative. The use of speech recognition hasenabled students to speak to a system andfind out exactly what phonetic and prosodicerrors were made and how to correct them.Using modeling of native and non-nativespeech and knowledge of the native lan-guage of the student, we have developed

powerful pinpointing techniques to dis-cover where errors lie in elicited speech.The use of natural grammars allow us toanalyze student writing and suggest cor-rections. Natural language grammars alsoallow us to read along with a student andgive help in understanding a passage upon

request. By using information retrieval,we can find appropriate texts for astudents level of reading and lexical andgrammar knowledge and can give otherresearchers the tools to determine howhard a new text can be and still be effec-tive, for example, what percentage of newwords can be in a text which still allowsthe student to generalize the meaning. Forlearning research, members of the LTI

faculty have projectss with experts inIntelligent Tutoring from the HumanComputer Interaction Institute as well aswith psychologists and language learn-ing specialists at Carnegie Mellon and theUniversity of Pittsburgh.


10/32

Computational Biology

OngoingResearch

Comp

utationalBiology

Figure: Crystal structure of pectate lyase

(pdb id 1EE6), a triple beta helix protein

(www.rcsb.org).

Large amounts of genomic and proteinsequence data for homo sapiens andother organisms have become available,together with a growing body of corre-lated protein structure and function datacreating an opportunity for addressingthe sequence mapping and structurefolding problems with increasingly so-phisticated data-driven (statistical andcomputational) methods to discover,characterize and model regularities andoutliers in the biological data. Machinelearning methods, with large amounts ofdata, led to multiple breakthroughs inlanguage technologies such as automaticspeech recognition, document classifi-cation, information extraction, statisti-cal machine translation and other chal-lenging natural language processing

tasks. Our research exploits the analogybetween mapping words to meaning viasyntax, in order to decipher the funda-mental meaningful building blocks ofbiological sequence language, via itsstructure to its underlying function.

The goal is to derive new hypotheses tocorrelate these building blocks with struc-tural, dynamic and functional meaningfor different living organisms in terms offolding, activity, interactions, and path-ways. For example, one of the importantchallenges we try to tackle is the predic-tion of super secondary structures such asthe beta helix (Figure). We work with su-per secondary protein structure experimen-talists so that the hypotheses generatedfrom this computational approach can betested by wet lab experiments.

In another attempt to relate protein primarysequence to its structure and function, wehave engaged in a project to infer evolu-tionary selectional pressure from residueconservation in multiple sequence align-

ment of proteins families. Many measureshave been proposed for quantifying theoverall degree of sequence conservation ina multiple sequence alignment of protein.However, these measures fail to identifywhich particular properties are conserved

at each position in the alignment. We de-rived an algorithm for systematically iden-tifying the conservation of specific physi-cal-chemical properties in individual posi-tions in a multiple sequence alignment. Wehave applied our method to the diverse

GPCR family and demonstrate the compu-tational significance of the properties wehave identified by successfully using themto predict whether specific amino acids willoccur in particular positions in the align-ment. We have also used our method toannotate Rhodopsin, a well characterizedmember of the GPCR family, with aselectional pressure profile, which allowedus to biologically interpret our findings.We further applied our method to a mul-tiple sequence alignment of an HIV-I pro-tein, and are gearing to apply it to a largeset of protein families, including crystallins

and various globins. Looking ahead, weplan to refine our method by incorporatingphylogenetic histories, and separating mu-tation.

The above examples illustrate the powerof combining computational linguistics,statistical machine learning and bio-se-quence/structure/function discovery; weexpect to tackle other interesting problemsin this field, such as protein-protein inter-action predictions and aspects of immunesystem modeling at the molecular level.

Ongoing Research


11/32


RADAR: Reflective Agent with Distributed Adaptive Reasoning

The RADAR project is an example of alarge-scale interdisciplinary researchproject conducted at the LTI. RADAR is afive-year research project that spans manyunits in Carnegie Mellons School of Com-puter Science, including many LTI re-

searchers. The overall goal is to develop asoftware-based cognitive personal assis-tant who could anticipate the needs of hissuperiors. RADAR will help busy manag-ers work more effectively, by copingadaptively with various tasks ranging fromroutine to complex problem solving. Thisnew technology should be equally valuableto managers in industry, academia, and gov-ernment. RADAR will help its human mas-ter in many ways: scheduling meetings, al-locating resources, maintaining a project

web site, producing coherent reports fromdisorganized snippets of information, anddealing with the constant flood of e-mail.Additionally, all of the capabilities of RA-DAR will be tested jointly in a crisis man-agement task, such as re-organizing a con-

ference on very short notice after losing itsplanned venue.

The key scientific challenge is to endowRADAR with enough flexibility, learningcapabilities and general knowledge tohandle all these diverse tasks, includingrequests and situations that were not an-ticipated by RADARs designers. Whenfaced with a surprising new request, RA-DAR might not know how to proceed, butit should do something sensible. Perhapsit can weave together fragments of old

plans to create a new one that fits thecurrent situation. Perhaps it can simplyask for advice. But even turning free-form advice into an executable plan ofaction is a difficult research problem.The technologies underlying RADAR

range from planning and problem solv-ing (e.g. unifying hard and soft time andspace constraints), to natural languageprocessing (e.g. extracting useful infor-mation from email streams), to dynami-cally adaptable user interaction (e.g.when and how to ask for advice or offersuggestions). All of these capabilitiesdepend on machine learning technology:learning from human advice, learning byobservation, learning by active experi-mentation, and transferring the results oflearning across problems and domains.


12/32

Speech ProcessingSpeech is the most natural way for humansto communicate and we find it so easy to

use, that most of us are surprised to learnhow complex the processing of spoken lan-guage actually is. As one of our goals atLTI is to make speech communication withand through computers more useful, wework on improving the fundamental tech-nologies of automatic speech processing,i.e. speech recognition and speech synthe-sis. Also, we develop new technologiesusing those components, such as speech-to-speech translation, spoken dialog sys-tems, audio-based information extractionand retrieval, and computer aided languagelearning.

Speech RecognitionAutomatic Speech Recognition is the pro-cess of decoding a spoken speech signalinto a written form, that is, a sequence ofwords. To do this, the analog speech sig-nal needs to be digitized and then for ef-ficiency reasons-- reduced to its essentialrelevant information, which is mainly doneby a form of frequency decomposition (thepicture below shows a spectrogram repre-sentation of a speech waveform from con-versationally spoken speech. The bright-ness of the colors indicate the energy level

present at a given frequency. The final rep-resentation of speech in the computer is astream of parameter vectors over time.These vectors will be classified intopho-nemes the smallest linguistically distinctsounds of a language. For this purpose pro-totypes of these phonemes (so-calledacoustic models) will be trained before-hand. With the help of thepronunciationdictionary that relates each word to a con-catenation of phonemes, the speech de-coder can find possible word candidatesand in combination with the languagemodel the most likely sequence of words

is chosen to transcribe the spoken speechsignal.Humans are able to understand equally wellthe articulated read speech of a TV newsanchor speaker and our over-excited friendcalling from a loud party. What seemsso easy for us, is very difficult for ma-chines: part of the difficulty lies in the

Speech Processing

p

g

fact that the speech signal can be heavilyaffected by background noises, channel dis-tortions, or cross-talk, but also that spokenspeech varies in speaking style, speed, andcontent. More difficulties arise in speechrecognition because different words mightbe pronounced the same (as in two, to,and too), one word might be pronounceddifferently (such as the in the teen vsthe adult) and also because speech isspoken continuously, so it provides no natu-ral segmentation. For instance, the samephonetic sequence can be segmented intotwo different word sequences: This ma-chine can recognize speech, or This ma-chine can wreck a nice beach. Which se-quence will be picked depends on the ex-

pectation of the listener.In order to learn the knowledge in the com-ponents of the automatic speech recognizer,namely the acoustic models, the pronun-ciation dictionary, and the language model,todays speech recognition algorithms mustuse data from which those models aretrained. Thus, the acoustic model learns themost likely way people pronounce particu-lar phonemes in particular contexts. Thepronunciation dictionary models the mostlikely sequences of phonemes to buildwords, and the language model learns themost likely sequences of words to build

sentences. The language model is statisti-cally trained and scores all of the possiblephrases that could have been spoken. Itdiffers from more traditional parsing tech-niques, although they may overlap, sincespeech is less likely to be in traditional lin-guistic sentences.

Speech SynthesisSpeech synthesis is the process of generat-ing natural sounding and appropriatespeech from text or other computer-read-able formats. The task can be viewed inthree parts:

Text Analysis: General text may containnumbers, abbreviations, and other non-standard words that require proper treat-ment if they are to be pronounced intelli-gibly. In English, the string of digits 1984

has several pronunciations depending onwhether it is a yearnineteen eighty four,a quantity one thousand nine hundred(and) eighty four, or a telephone numberwhich can be pronounced one nine eightfour.

Linguistic analysis: Once given thewords, we still require the pronunciations.This can be done by a pronunciation dic-tionary. However, no matter how large thedictionary is, we will still encounter wordsoutside of its vocabulary due to neologisms,names, etc. Therefore, a letter-to-soundrules system is also required. Prosody, in-cluding tune, duration, and phrasing are thecomponents that make speech interesting.There are many ways to pronounce words;recreating the prosody and style (i.e. po-lite or urgent) makes the speech more un-derstandable and more acceptable to us-ers.

Waveform generation: Currently, the mostcommon form of constructing waveformsfrom phonetic and prosodic descriptions isby concatenating short pieces of pre-re-corded natural speech and modifying theirprosody to match the desired form. Tradi-tional approaches record all phone-phonetransitions in a language (called diphones).Although this technique is robust, more

general unit selection synthesis, in whichthe database contains more varied speechwith multiple examples of phones in vari-ous contexts, and an appropriate selectionalgorithm, seem to offer promise of muchhigher quality speech.

Speech Processing Research ProjectsThroughout the LTIs speech processingresearch there are always two directionsthat influence the work: knowledge drivenand data driven. A substantial amount ofknowledge is required in order to buildsuch systems. Knowledge of acoustic pho-

netics, pronunciation, linguistics, signalprocessing etc is needed to define theframework within which we are working.For example how to find the phoneme setfor languages that have not yet been stud-

Ongoing Research


13/32


ied, or how to find pronun-ciation of words that arenot found in a dictionary,how to include knowledgeof syntax and semantics toaid speech processing.Human language complex-ity and variability is such that no hand-written rules can cover all cases, thus ourknowledge based techniques are alsoclosely coupled with statistically basedmethods. A common theme appearingthroughout all the LTIs work is develop-ing and applying machine learning tech-niques to appropriately defined knowledgedriven frameworks to improve the useful-ness of the work. These interdisciplinary

approaches encourage sharing of tech-niques over different projects: languagemodeling techniques may also be used intext summarization; novel machine learn-ing techniques may be applied to speechproblems. The LTIs speech research al-lows standard components to be used inother larger projects thus making themmore useful, but also offering greater chal-lenges in the application of speech andlanguage that lead to more fundamentalresearch. In the following we list a selec-tion of projects and applications whichare currently under development in LTI:

Speech-to-Speech Translation: TheConsortium for Speech TranslationResearch (C-STAR) is a speech-to-speech translation system developedjointly between CMU and a interna-tional partners from Japan, Korea,Italy, France, and China. Here,speech recognition must deal withmany languages, recognize and trans-late in real-time, and handle many dif-ferent users. Recent work investi-gates deploying such systems on re-source constrained mobile devices,

and improve the robustness and qual-ity of domain dependent translation.In the project STR-Dust (SpeechTranslation for Domain UnlimitedSpontaneous Communication Tasks)we push the limits of todays speechtranslation coverage to rather unlim-ited domains, such as in meetings, lec-tures, and news.

Multilingual Speech Recognition:Since speech recognition is the mostnatural means to allow communicationacross language and culture barriers,speech recognizers in many differentlanguages are an essential prerequisitefor making speech-driven communi-cation applications attractive andavailable to the public. The projectGlobalPhone focuses on the rapid de-

ployment of speech recognizers inmany languages, i.e. by reducing therequired effort in terms of time andcost to build such recognizers, thus en-abling support for languages for whichfew or no resources are available.

Meeting Summarization: A micro-phone records a multi-person meeting.Off-line speech recognition technol-ogy transcribes the meeting, includingthe difficult task of separating thevoices and identifying the speaker.Information retrieval technology isthen used to index the data so we cananswers queries such as Find the partwhere Bob and Jane talked about nextyears budget.

Dialog systems: The CMU DARPACommunicator project allows experi-ments in mixed initiative dialog be-tween humans and machines via thetelephone in the domain of flight, ho-tel and car rental information. Thisrequires accurate, real-time recogni-tion across the reduced bandwidth andpotentially noisy telephone, real-timeaccess to networked information, natu-ral language generation, parsing, syn-thesis, and a dialog manager.

Computer aided language learning:The Fluency project applies speechrecognition techniques to aid non-na-tives in pronunciation.

Language Modeling:Classical languagemodeling techniquescollect statistics on

word tri-grams (threeword sequences), butbecause there aremany words in the lan-

guage and even more tri-grams, col-lecting enough data to find all ex-amples is difficult. Thus, smoothingand back-off techniques are often re-quired. A number of new languagemodeling techniques are being in-vestigated within the LTI, includingclass-based language models thatconsider whole word classes, notjust individual words, for n-grams.

Additionally, new statistical model-ing techniques such as maximumentropy are being developed to givebetter estimates of the probabilitydistribution of word sequences.

Synthesis using Festival, FestVox,and Flite: To ensure that our speechsynthesis work is available to thewidest range of users, we work withthe University of Edinburghs Fes-tival Speech Synthesis System, afree software synthesis toolkit andengine. We have also produced the

FestVox tools for building newvoices in new languages allowingthe construction of both generalvoices, and domain specific voices.Also with the small footprint CMUFlite system, synthesis can be usedon any platform.

Robust speech recognition: currentautomatic speech recognition sys-tems are limited in their ability toadapt to the effects of new speak-ers, difficult acoustical environ-ments, non-native accents, andspontaneous speech production.Researchers at the LTI are carryingout a broad program of research toimprove the robustness of automaticspeech recognition using a varietyof techniques.


14/32

Distributed IR (federated search) in thereal world: Peer to peer networks con-tain searchable digital libraries (leafnodes), and directory services (hub

nodes) that route messages and mergeresults from different sites.

Translingual Information Retrievaluses queries in one language (e.g., En-glish) to find documents in other lan-guages (e.g., German, Chinese and Ara-bic). Traditional machine translationmethods do not work well when queriesare short, out of context, and not sen-

tences. Our research focuses on corpus-based translation of query terms by learn-ing empirical associations among multi-lingual lexicons from translation mates(documents, paragraphs, passages or sen-tences), and by mapping queries anddocuments to conceptual interlingua thatbridge the language barrier.

Information Retrieval and Text Mining

InformationRetrievalandTextMining

On

goingResearch

Web search engines are a well-known type of Information Retrieval (IR) system that uses statistical inference to locate documentsthat satisfy an information need. Greater search engine accuracy is always useful, but finding information is no longer the mostimportant IR problem. Todays tools often deliver much more information than a person can read easily. IR and Text Miningresearch at the LTI involves a wide range of issues related to finding, validating, organizing, summarizing, analyzing, and commu-nicating large amounts of information. Our goal is to enable people to routinely base decisions on far more information than ispractical today.

Machines cannot understand the meaning of a multimedia document in the way that a human can, but many useful tasks can beaccomplished with limited forms of understanding. Statistical corpus analysis, probabilistic inference, and machine learning are thetools of IR and Text Mining research. Research at the LTI is grounded in theory, and tested in large-scale applications. Conse-quently, research projects focus on everything from basic theory to software engineering. Several representative examples aredescribed below.

Research on Advanced IR Architec-tures develops systems that combinestandard search queries with detailed,long-term user and task models andhighly structured documents. Documentstructure may indicate how the documentis organized (e.g., XML), or it may beprovided by language analysis tools (e.g.,named-entities, part-of-speech, syntacticparsing). This research supports LTIprojects on open-corpus language tutor-ing, such as the REAP reading compre-hension project. Much of this researchis distributed via the open-source LemurToolkit .

The Distributed Information Retrievalproject studies environments, such as theWeb, large corporate networks, and peer-to-peer networks, in which thousands ofsearch engines are available. Cooperationcannot be assumed, so robust techniquesare required for automatically character-

izing search engines, selecting amongthem, searching them, and integrating re-sults retrieved from different sources.

The Language Technologies Institute pio-neered research in automated Text Sum-marizationwith the Maximum MarginalRelevance Metric and its application touser-profile-relevant document summari-zation. Research also focuses on summa-rizing dialog clusters of topically relateddocuments and automated generation ofbriefings from corpora.

The REAP project uses word histo-grams to model how children use lan-guage at different ages. These allowtheLemur search engine to select textsthat use vocabulary a particular childis likely to understand.

Ongoing Research


15/32


* Information extraction from free text,involves tasks such as detecting entities(people, places, organizations, etc.), de-tecting roles (e.g. Clinton as senator, orCarter as peace envoy), and detecting re-

lations (who does what to whom). Oneconcrete accomplishment is a softwarepackage called Minor Third for learning-based information extraction (usingHMMs, CRFs and other techniques).Information extraction is used in ques-tion answering, in topic detection andtracking, and in cognitive agents (e.g. inthe RADAR project). Information extrac-tion may also be used to instantiate tables(e.g. what product is sold by whom atwhat price) for classical data mining.

* Topic detection and tracking (TDT)utilizes supervised learning to track top-ics or events defined by examples texts,and unsupervised learning to detect theemergence of new topics or events innews streams. The latter entails detect-ing novelty i.e. discovering in whichways to detect that a textual descriptionof an event indicates a novel one, even ifit may contain points in common withearlier different events of like type. TDTis inherently a dynamic time-series learn-ing challenge, where topics may drift overtime, events initiate and fade from thenews, or morph into other events.

Text mining research at the LTI consistsof four major related categories, listedbelow.

* Text categorization and filtering is aform of supervised machine learning,where texts are assigned categories e.g.emails to folders, web pages to taxonomicclasses, books to catalog codes first bytraining statistical classifiers from ex-ample corpora, and classifying new texts

with the trained classifiers. Filtering is aform of dynamic text categorization,where the categories are defined implic-itly and evolve over time.

* Large public comment databasesare a feature of modern democratic so-cieties. Email and the Web make it easyfor people to express their opinionsabout proposed government policies

and regulations, consumer products,and a wide range of other topics. Popu-lar and controversial topics can quicklycause hundreds of thousands of com-ments. Text mining research at the LTIincludes interactive methods of orga-nizing, summarizing, and exploringlarge databases of unstructured textcomments. Tools of this type can in-crease the responsiveness and transpar-ency of democratic governments, andallow companies to better track cus-tomer opinions about products and ser-vices.

The Lemur Toolkit for Language Model-ing and Information Retrieval is an open-source softwate toolkit developed at theLTI and the University of Massachusetts.The heart of Lemur is a set of indexingmethods that support a wide range of IRcapabilities. Lemur includes multilingualsearch engines that are based on severalprobabilistic and vector-space retrievalmodels. Its powerful query language andsupport for text annotations and documentstructure make it particularly useful forquestion answering, language tutoring andother research at the LTI. It also includesa variety of other IR capabilities, such asfederated search, text summarization anddocument clustering. Lemur is used inuniversities and research laboratories

around the world. For more information seewww.lemurproject.org Part of text mining relates strongly to

information retrieval, using notions suchas inverted indexing and text-to-textsimilarity metrics: cosine similarity inhigh-dimensional vector spaces, andgenerative similarity in statistical lan-guage modeling approaches. The LTIis active in all aspects of text mining,including computational methods thatscale to large real-world challenges, andthat apply to different languages.
http://www.lemurproject.org/http://www.lemurproject.org/


16/32

The LTI has a long history of researchin knowledge-based natural languageprocessing and computational linguis-

tics, dating back to Carbonells workon knowledge-based interlingual ma-chine translation and Tomitas work onefficient natural language parsing tech-niques, when the precursor of the LTIwas Carnegie Mellons Center forMachine Translation. Of particularnote are the KANT and KANTOOsystems developed by Nyberg,Mitamura and Carbonell that broughthigh-accuracy interlingua machinetranslation to large-scale practical usefor translating technical literature atCaterpillar Inc. to several languages.

This line of work is characterized bycareful linguistic analysis, large-scaleknowledge engineering, and solid sys-tem building. More recently, knowl-edge-based systems are combined withmachine learning, such as in the AV-ENUE project where translation trans-fer rules are learned from a minimalnumber of word-aligned translationpairs via new techniques such asseeded-version-space learning. Thecurrent pure knowledge-basedprojects at LTI are:

Knowledge-Based NLP and Question Answering

Know

ledge-BasedNLP

OngoingResearch

at helping users to find answers in on-line documents. More advanced ques-tion-an-answering (QA) systems, such as

LTIs JAVELIN project, use NLP tech-niques (segmentation, stemming, pars-ing, semantic interpretation, unification,etc.) to a) understand the underlyingmeaning of the questions they are posed,and b) find the most likely answers inthe target collection(s). Information gath-ering becomes a collaborative process,where the system and user work togetherin an ongoing dialog to refine the searchfor ever-better answers. In addition toresearch on basic parsing and interpre-tation of unrestricted text, we are alsoactively working on: a) information gath-

ering dialogs; b) sophisticated multi-layer retrieval strategies; c) a range ofapproaches to information extraction(pattern-based, statistical, etc.); and d)synthesis of answers from multiple an-swer candidates. The team is also inves-tigating the use of dynamic planning inQA (e.g., to explore first the strategiesthat are most likely to yield a good an-swer when processing time is limited),and use of reasoning and belief networksto piece together individual pieces of in-formation from several documents.

Knowledge Acquisition from NaturalLanguage Text

The knowledge acquisition bottleneck haslong been decried as one of the limitingfactors for applications of artificial intelli-gence how can we get all of the appro-priate world knowledge into the computerso that it can solve problems of practicalsignificance in a new domain? In our re-search on knowledge acquisition from text,we are working to define a formal map-ping between specific structures in naturallanguage and corresponding meaning rep-resentations in a formal representation (e.g.frame logic). The goal of CMUs contri-bution to the HALO-II project is to reduce

the cost of encoding knowledge for a prob-lem-solving system by making it possibleto acquire knowledge directly from an ex-isting text (e.g., from a textbook). Currentwork focuses on acquiring various typesof knowledge (ontologies, rules, processes,etc.) from college textbooks in domainssuch as Biology, Chemistry, and Physics.

Open-Domain Question Answering forMulti-Lingual Text Collections

As the size of available on-line text collec-tions grows ever larger, simple search en-

gines are becoming less and less effective

New Bill of Rights

Get the rightright information

To the right people

At the right time

On the right medium

In the right language

With the right level of detail

Ongoing Research


17/32


18/32

PhD in Language & Information Technologies

Ph.D. Curriculum

A student working towards a PhD in Language and Infor-mation Technologies must successfully complete at leastsix courses in the LTI and two courses from any depart-ment in the School of Computer Science (For more courseinformation, see LTI Courses).

Of these eight courses, the student must take at least onefrom each of the LTI Focus Areas (Linguistic, ComputerScience, Task Orientation, and Statistical/Learning) andmust take at least two lab courses, which involve hands-on

work in one of four different areas (Speech, Machine Trans-lation, Information Retrieval, and Natural Language Pro-cessing). The lab modules are self-paced, with TeachingAssistant and faculty guidance. Students are encouragedto consider taking additional elective courses beyond theeight required. Students may select additional courses fromthe LTI, from related courses in the Computer Science De-partment, or from other related CMU or University of Pitts-burgh departments. Areas of possible interest includeSpeech, Linguistics, Statistics, and Human-Computer In-

teraction.

Proficiencies

The following skills must be demonstrated in the course ofgraduate study, with flexibility in the form and timing oftheir demonstration:

Writing:Satisfied by producing a peer-reviewed confer-ence paper or a written report that at least two SCS facultycertify as being of conference-paper quality. The topic ofthe paper may be the students research results, a compre-hensive survey of a research area, a linguistic analysis pa-

per, or any other pertinent topic.

Presentation:Satisfied via a public presentation of goodquality, such as an external conference presentation or aninternal seminar presentation reviewed by several faculty

members.

Programming:Satisfied by demonstrating competence incomputer programming of language technology; this is nor-mally satisfied in the course of the students research, butcould also be satisfied via explicit apprenticeship if desired.

Teaching:Satisfied by two successful Teaching Assistant-ships (TA), as determined by the faculty members for whom

the student serves as TA. Typical TA responsibilities in-clude planning a portion of the syllabus, developing exer-cises, and delivering some lectures under faculty supervi-sion. Of the two TA-ships, typically one will be for an un-

dergraduate class and one will be for a graduate class.

The PhD in Language and Information Technologies is a research-oriented degree program consisting of the followingcomponents: successful completion of a set of courses, mastery of certain proficiencies, and a program of research, directed

by a faculty advisor, culminating in a PhD thesis.

Academic Programs

The LTI currently offers two graduate degree programs: a PhD in Language Technologies and a Masters in LanguageTechnologies. We also participate in an undergraduate Linguistics minor (language technologies track). The LTI also offersshorter certificate programs in Language Technologies as well; please contact us for further information on these.

Student Evaluation

Following the long-standing SCS tradition, the LTI doesnot focus only on courses or exams, and does not have afixed timeline for completion of the PhD degree, although

our target is five years. Instead, we carry out an individual-ized student evaluation at the end of each semester basedon research performance, classes and other contributions.For each student, we write a letter indicating to them whetherthey are making satisfactory progress towards complet-ing their degree. Students are in good standing as long as

they are making satisfactory progress.

Financial Support

All PhD and most Masters students accepted into the Lan-guage Technologies Institute are awarded a Research Fel-lowship for the academic year, covering full tuition and a

living allowance, usually renewable for the duration of theprogram, as long as the student maintains good standing.

Students are encouraged to apply for support from outsideCarnegie Mellon (fellowships, foreign government grants,etc.) As an incentive to seek funding from other sources, asupplement is provided to the stipend of any student whoobtains outside support.


19/32


Research and PhD Thesis

It is expected that all PhD students engage in active re-search from their first semester. Moreover, advisor selec-tion should occur within 1-2 months of entering the PhDprogram, with the option to change at a later time. Roughlyhalf of a students time should be allocated to research andhalf to courses until the coursework is completed.

Once the coursework is completed, the student should be-gin to move towards a thesis topic, in consultation with thestudents advisor. Once a suitable topic is defined, the stu-dent prepares and presents a dissertation proposal, basedon their initial work on that topic. The dissertation pro-

posal is normally expected at the end of the third year, and

The Master of Language Technologies (MLT) is a professional degree that is normally completed in two years. Studentschoose an individualized curriculum from a flexible set of courses and self-paced laboratory modules that cover linguistic

and statistical approaches and basic computer science. The curriculum is usually tailored to emphasize a specialty in one ofthree language technology areas: Machine Translation, Information Retrieval, or Speech Technology. Directed research isan integral part of the MLT program; each MLT student carries out research under the guidance of a faculty advisor.

With some modifications and enhancements, the MLT curriculum also forms the course-based component of the PhD Pro-

gram. The more research-oriented MLT students are encouraged to apply for continuing studies in the PhD program, with

most of their MLT courses and hands-on work being credited towards the PhD.

Master of LT Thesis Option

A Masters Thesis Option is available for students whowish to demonstrate independent research ability duringtheir enrollment in the LTI Masters program. Studentswho choose the Masters Thesis Option will be expectedto follow thesis guidelines that are similar in character tothose for the LTI PhD. The Masters thesis requirementsare less rigorous, however, since the Masters disserta-tion is expected to be defined, completed, and publiclydefended in less than one year.

Master of LT Curriculum

The curriculum for the MLT consists of a minimum of120 course units at a senior or graduate level. From these120 units, six courses must be LTI courses and two othercourses must be SCS courses. There are additional con-

straints on course selection, required in order to meet SCS-wide Masters requirements. A concentrated form of thisdegree may be completed in one year without the research

component.

Master of Language Technologies

describes the general area of investigation and the specificproblem(s) to be addressed, a clear argument for the signifi-cance of the problem, relevant past work, expected scientificcontributions of the proposed work, and a projected timelinefor completion. A dissertation committee consisting of theadvisor, at least two other CMU faculty in language technolo-gies, and at least one external member should be formed prior

to the proposal. The dissertation itself, normally completedduring the fifth year, includes a detailed description of all thework done, including its clear evaluation and the final scien-tific contributions. The thesis is then defended in a publicoral presentation. A successful defense results in the award-ing of the PhD degree.


20/32

LTI Courses

Courses and Admissions

11-682 Human Language Technologies (Words forNerds): During the last decade computers have begunto understand human languages. Web search engines, lan-guage analysis programs, machine translation systems,speech recognition, and speech synthesis are used everyday by tens of millions of people in a wide range of situa-tions and applications. This course covers the fundamen-tal statistical and symbolic algorithms that enable com-puters to work with human language, from text process-ing to understanding speech and language.

11-711 Algorithms for NLP:A graduate-level course onthe computational properties of natural languages and thefundamental algorithms for the symbolic processing ofnatural languages.

11-717 LT for Computer-Aided Language Learning:This course studies the design and implementation ofCALL systems that use Language Technologies such asSpeech Synthesis and Recognition, Machine Translation,and Information Retrieval.

11-721 Grammar and Lexicon: A graduate-level courseon linguistic data analysis and theory, focusing on meth-odologies that are suitable for computational implemen-

tations. The course covers major syntactic and morpho-logical phenomena in a variety of languages. The empha-sis is on examining both the diversity of linguistic struc-tures and the constraints on variation across languages.

11-723 Formal Semantics: A graduate-level course onformal linguistic semantics: Given a syntactic analysis ofa natural language utterance, how can one assign the cor-rect meaning representation to it, using a formal logicalsystem?

Sample Course Descriptions

We briefly describe here the main focus of a sample of our courses; we list these in numerical order. To see a complete list

of current courses and course descriptions, please see:http://www.lti.cs.cmu.edu/Courses/

11-731 Machine Translation: A graduate-level course

surveying the history, techniques, and research topics inthe field of Machine Translation.

11-741 Information Retrieval: This course studies thetheory, design, and implementation of text-based informa-tion systems. The IR core components of the course in-clude important retrieval models (Boolean, vector space,probabilistic, inference net, language modeling), cluster-ing algorithms, automatic text categorization, and experi-mental evaluation. A variety of current research topics arealso covered, including cross-lingual retrieval, documentsummarization, machine learning and topic detection andtracking.

11-743 Advanced IR Seminar/Lab: This is a seminarthat focuses on current research in Information Retrieval.The seminar covers recent research on subjects such asretrieval models, text classification, information gathering,fact extraction, information visualization, summarization,text data-mining, information filtering, collaborative fil-tering, question answering systems, and portable informa-tion systems.

11-751 Speech Recognition: This course provides an in-troduction to the theoretical foundations, essential algo-rithms, major approaches, experimental strategies and cur-rent state-of-the-art systems in speech recognition.

11-752 11-752: Phonetics, Prosody, Perception andSynthesis: This course offers insight into how humanperception of speech relates to the physical properties ofthe signals. It covers practical aspects of speech includ-ing projects. The second half of this course concentrateson speech synthesis and building of synthetic voices.

11-761 Language and Statistics: This course covers someof the central themes and techniques that have emerged instatistical methods for language technologies and naturallanguage processing.

11-791/792 Software Engineering for Language Tech-nologies I/II: This two-course sequence combines class-

room material and assignments in the fundamentals of soft-ware engineering (11-791) with a self-paced, faculty-su-pervised directed project (11-792). The two courses coverall elements of project design, implementation, evaluationand documentation.
http://www.lti.cs.cmu.edu/Courses/http://www.lti.cs.cmu.edu/Courses/


21/32


22/32

Faculty

Alan W BlackAssociate Research

Professor

BS Computer Science

Coventry Polytechnic 1984

MS Knowledge Based

Systems

University of Edinburgh

1986

PhD Artificial Intelligence

University of Edinburgh

1993

Faculty

speech synthesis speech to speech translation spoken dialog systems

machine translation cross-language information retrieval topic tracking

Ralf BrownSenior Systems

Scientist

BS Computer Science

Towson University 1986

PhD Computer Science

Carnegie Mellon

University 1993

Jamie CallanAssociate Professor

BA Applications of

Computer Science

Univ. of Connecticut 1984

MS Computer & Information

Science

Univ. of Massachusetts

1987


Univ. of Massachusetts

1993

information retrieval adaptive information filtering text data mining

Alan Black has created practical implementations of computational theories of speech and language.After a wide background in morphology, language modeling in speech recognition, and computationsemantics, he now works in all aspects of speech generation. As an author of the free software FestivalSpeech Synthesis System, he researched text analysis, prosodic modeling, waveform generation, and ar-chitectural issues in synthesis systems. His work targets data-driven computational models that allowsynthesizers to capture speaker style. Specifically, he studies data-driven prosodic models, automaticbuilding of voices in English and other languages. To allow spoken output anywhere, he also deploys thiswork on handheld computers, specifically addressing rapid development of voices in new languages,

modeling of speaker individuality, and evaluation of voice quality.

Professor Black teaching is very practical, thus his courses involve significant exercises that allow stu-dents to gain experience in building synthetic voices, statistically trained models, etc. After some practicalexperience it is easier to understand the underlying theoretical issues and their relative importance.

www.cs.cmu.edu/~awb

Ralf Brown's research interests cover several areas of language technology, such as reference resolution,disambiguation, corpus-based machine translation, cross-language information retrieval, and topic track-ing in news. His recent research has focused on Example-Based Machine Translation and its applications,particularly in the context of multi-engine translation systems, and on topic tracking in news. He alsoworks with machine-learning techniques for extracting patterns from parallel text in order to build trans-lation systems with less training material.

Current and recent projects include RADD (Rapidly-Adaptable Data-Driven Machine Translation), AV-ENUE (machine translation for languages with few resources), Topical Novelty Detection in the TDT(Topic Detection and Tracking) program for detecting new events in the news and tracking their evolu-tion, TONGUES (rapid development of bi-directional speech-to-speech translation systems), andMUCHMORE (cross-language information retrieval in the medical domain).

www.cs.cmu.edu/~ralf

Jamie Callan is interested in a wide range of information retrieval and text mining topics. In recent yearshis research has focused on four problems listed below.

Federated Search (Distributed IR): Provide access to many search engines through a singlesearch interface; includes peer-to-peer search. Research topics include learning what each enginecontains, selecting which to search, searching them, and integrating results from different sources.

Adaptive Document Filtering: Monitor information streams to find documents that satisfy an infor-

mation need. The system should learn a persons information needs, rapidly identify desired docu-ments, and distinguish between novel and redundant information.

Large-Scale Text Analysis: Develop tools for rapidly analyzing large text datasets. For example,when a government agency receives 100,000 comments about a new regulation, it needs to knowwhich groups commented, what topics were discussed, and what supporting evidence was cited.

IR for Language Applications: Search engines are increasingly used in question answering andlanguage tutoring systems. Such applications require rich text annotation (e.g., syntax, named en-tity), complex queries, and retrieval models that combine varied forms of evidence.

His students initially work closely with him to study specific ideas while learning research skills and IR.As students gain expertise, they develop their own interests and have more freedom in exploring them.

www.cs.cmu.edu/~callan
http://www.lti.cs.cmu.edu/~awbhttp://www.lti.cs.cmu.edu/~ralfhttp://www.lti.cs.cmu.edu/~callanhttp://www.lti.cs.cmu.edu/~callanhttp://www.lti.cs.cmu.edu/~ralfhttp://www.lti.cs.cmu.edu/~awb


23/32


24/32

Faculty

Robert

FrederkingSenior Systems Scientist,

Director LTI Graduate

Programs

BS Computer Engineering

Case Western ReserveUniversity 1977

PhD Computer Science, AI

Carnegie Mellon University

1986

artificial intellegence machine learning computational geometry

speech-to-speech MT rapid-development wide-coverage MT question answering

Eugene FinkSystems Scientist

BS Mount Allison University

1991

MS University of Waterloo

1992

PhD Carnegie Mellon

University

1999

multimedia analysis multimedia interfacesInformedia digital video library

Alex HauptmannSenior Systems Scientist

BA Psychology

Johns Hopkins University

1982

MA Psychology

Johns Hopkins University

1982

Diplom Computer Science

Technische Universitt Berlin

1984



1991

Eugene Fink's research interests are in various aspects of artificial intelligence, including machinelearning, planning, problem solving, automated problem reformulation, e-commerce applications,medical applications, and theoretical foundations of artificial intelligence. His interests also includecomputational geometry and algorithm theory.

He is currently working on an intelligent system for automated allocation of offices and related re-sources, in both crisis and routine situations. This work is part of the RADAR project, aimed atcreating a general-purpose assistant for office managers. He is also working on techniques for identi-fication of both known and surprising patterns in large-scale databases, and applying these tech-niques to homeland security. This work is part of the ARGUS project, which is a joint research projectinvolving Carnegie Mellon and Dynamix Technologies.

www.cs.cmu.edu/~eugene

Bob Frederking's primary research area has been machine translation applications that do not currentlypermit the use of purely knowledge-based techniques. This includes rapidly developing Machine Trans-lation (MT) for new languages and translating text and speech that are not limited to a narrow, well-defined domain. Our main technical approach in this area is Multi-Engine MT (MEMT). MEMT appliesseveral different MT techniques to the same text, and then attempts to select the best results from eachtechnique. He developed and implemented the initial chart-based dynamic-programming technique formerging the results from the different engines and our current merging technique, which uses statisticallanguage modeling to select among the different technique outputs. He has also been involved in LTIprojects in Cross-Language Information Retrieval, Question Answering, and Information Extractionfrom email, among other things.

Professor Frederking believes that successful advising and teaching hinge largely on successful commu-nication: presenting advice (or a lecture), understanding what (if anything) the student is having troublewith, and then providing the information or guidance that he or she needs to resolve any difficulties. Asthe Chair of the LTI's graduate programs, he is the default advisor for students who are not project-

supported.

www.cs.cmu.edu/~ref/

Alex Hauptman's research aims to design and build intelligent programs that process data from largevolumes of multimedia data, including text, image, video, and audio and make the data useful for otherapplications, so as to improve speech recognition, image understanding, NLP, machine learning, ques-tion answering and IR. The challenge is to find the right data, to process it into a suitable form fortraining, learning, or re-use, and to build mechanisms that can successfully utilize this data.

This work takes part in the context of the Informedia digital video project, which aims to achieve ma-chine understanding of video and film media, including all aspects of search, retrieval, visualization andsummarization in both current and archival content collections. The base technology developed underInformedia combines speech, image and natural language understanding to automatically transcribe,segment and index linear video for intelligent search and image retrieval.

www.cs.cmu.edu/~alex
http://www.lti.cs.cmu.edu/~refhttp://www.lti.cs.cmu.edu/~eugenehttp://www.lti.cs.cmu.edu/~eugenehttp://www.lti.cs.cmu.edu/~refhttp://www.lti.cs.cmu.edu/~refhttp://www.lti.cs.cmu.edu/~alexhttp://www.lti.cs.cmu.edu/~alexhttp://www.lti.cs.cmu.edu/~refhttp://www.lti.cs.cmu.edu/~eugene


25/32

John LaffertyProfessor (CSD, LTI)

BA

Middlebury College 1982

MS

Princeton University 1984

PhD Mathematics

Princeton University 1986natural language processing machine learning information theory

machine translation spoken language understanding machine learning

Associate Research

Professor

BA Computer Science

Israel Institute of

Technology 1987

MS Computer Science


1993



1996

Alon Lavie

Judith Klein-SeetharamanAssistant Professor, Depart.

of Pharmacology, Univ. of

Pittsburgh School of Med.

Research Scientist, LTI

Diplom in Biology, Univ. of

Cologne, Germany 1995

Diplom in Chemistry, Univ.

of Cologne, Germany 1996

PhD Biological Chemistry,

MIT 2000 Computational biology/bioinformatics biochemistry/biophysics structural biology

How does sequence map to structure and function of proteins in different organisms? Dr. Klein-Seetharaman takes a linguistically inspired view of this question in analogy to How do words map tomeaning in natural languages? using stochastic language modeling technologies. Computational modelsare validated experimentally by interdisciplinary (biochemical and biophysical, in particular NMRspectroscopic) studies of purified proteins and model peptide sequences. The emphasis lies on testingpredicted sequence dependence on structural and dynamic aspects of folding/misfolding and func-tional properties of proteins. Specific proteins that are expressed, purified and studied experimentallyin Dr. Klein-Seetharamans laboratory include the G-protein coupled receptor rhodopsin, the glutamate

receptors and the epidermal growth factor receptor. These systems function in diverse signal transduc-tion pathways, but resemble each other in their mechanism of action. Each receptor undergoes sub-stantial conformational changes during the signaling process and the investigation of the precise mo-lecular details of these changes is instrumental to elucidating the molecular mechanism of signaling bythese molecules.

www.cs.cmu.edu/~judithks

The central focus of John Lafferty's research is machine learning, including algorithms, theory, andstatistical methods for learning from data. The motivating applications for this work most often comesfrom text and natural language processing, information retrieval, and other areas of language tech-nologies. For example, in recent work with his colleagues he has studied approximate inferencealgorithms for a family of mixture models appropriate for document collections, and applied thealgorithms to automatically extract the subtopic structure of scientific articles. Over several yearsProfessor Lafferty has been involved in the development of a language modeling approach to infor-mation retrieval, including a general approach to IR based on decision theory. In other work he isresearching learning algorithms for sequential and graph-structured data, using a framework calledconditional random fields for combining the strengths of graphical models with discriminative classi-fication methods such as support vector machines and logistic regression.

www.cs.cmu.edu/~lafferty

Alon Lavie's main areas of research are Machine Translation (MT) of both text and speech, andSpoken Language Understanding (SLU). His current most active research is on the design and devel-opment of new approaches to Machine Translation, for languages with limited amounts of data re-sources. He has also worked extensively on the design and development of Speech-to-Speech Ma-chine Translation systems and on robust parsing algorithms for analysis of spoken language.

Professor Lavie is co-PI of the AVENUE project (funded by NSF/ITR), where we are developing ageneral framework for building prototype MT systems for languages for which only scarce amountsof data and linguistic resources are available. He also works on parsing algorithms for spoken lan-guage analysis of databases of transcribed spoken language (such as CHILDES). He was co-PI of theNespole! and C-STAR speech translation projects and of the LingWear and Babylon mobile speechtranslation projects, where he directed the design and development of the analysis and translationcomponents.

He is the principal instructor of the graduate-level course on "Algorithms for NLP". He also teachesthe section on "Natural Language Processing" for the "Introduction to Human Language Technolo-gies" course, and supervise the Lab in NLP (11-712) lab course at the LTI.

www.cs.cmu.edu/~alavie
http://www.lti.cs.cmu.edu/~laffertyhttp://www.lti.cs.cmu.edu/~judithkshttp://www.lti.cs.cmu.edu/~judithkshttp://www.lti.cs.cmu.edu/~laffertyhttp://www.lti.cs.cmu.edu/~laffertyhttp://www.lti.cs.cmu.edu/~alaviehttp://www.lti.cs.cmu.edu/~alaviehttp://www.lti.cs.cmu.edu/~laffertyhttp://www.lti.cs.cmu.edu/~judithks


26/32

Faculty

Teruko MitamuraAssociate Research

Professor

LTI Finance DirectorMA Linguistics

University of Pittsburgh1985

PhD Linguistics

University of Pittsburgh

1989knowledge-based MT question answering Japanese NLP and dialog systems

Eric NybergAssociate Professor

BA Computer Science

Boston University 1983

PhD Computational

Linguistics

Carnegie Mellon

University 1992

machine translation integrated information management software engineering

Lori LevinAssociate Research

Professor

BA Linguistics

University of Pennsylvania

1979

PhD Linguistics

MIT 1986

minority languages machine translation interlingua representationslexicons

Lori Levin works on linguistic issues in machine translation of spoken and written language. Hercareer-long research interest is the design of multi-lingual systems that accommodate typologicallydiverse languages. LTI's AVENUE project focuses on translation of languages with scarce data re-sources. Developing a machine translation system typically requires a level of economic and humanresources that may not be available for all languages. Research on MT for minor languages combineslinguistic typology and machine learning to automate the production of machine translation systemsfor new languages.

She is also part of a consortium for designing semantic interlingual representations of text meaning.

We are using multi-parallel corpora (multiple versions of the same text) to center in on what is com-mon among sentences that are supposed to convey the same meaning. In addition to the interlinguadesign, the consortium is producing annotated multi-parallel corpora, tools for annotation, and evalu-ation metrics.

Her other interests include computer-assisted language learning, especially tools to assist secondlanguage readers with comprehension of authentic texts.

www.cs.cmu.edu/~lsl

Professor Mitamuras research focuses on the following projects:

JAVELIN-II (open-domain, multilingual question answering): A system which combines NLP,planning, IR and MT to answer natural language questions and refine the search strategy in consul-tation with the user.

CAMMIA (Conversational Agent for Multilingual Mobile Information Access):A system whichextends VoiceXML with NLP and dialog management to support dynamic multi-task dialogs inJapanese and English.

KANT (Knowledge-based Accurate Natural Language Translation):A project founded in 1991 forthe research and development of large-scale, practical translation systems for technical documenta-tion. KANT uses a controlled vocabulary and grammar for each source language, and explicit, yet

focused semantic models for each technical domain to achieve very high accuracy in translation.

Teruko Mitamura teaches the courses Machine Translation, Grammars and Lexicons, and LT for CALL.

www.cs.cmu.edu/~teruko

Eric Nyberg's research at LTI is currently focused on three main areas:

Open-Domain Question Answering. The JAVELIN project combines natural language dialog, in-formation retrieval, text understanding, fact extraction, and probabilistic reasoning to answer com-plex questions about entities, relationships and events expressed in unstructured text.

Conversational Agents for Mobile Multilingual Information Access. The CAMMIA project is cre-

ating speech dialog systems for robust, multi-task dialogs in mobile environments such as carnavigation systems.

Knowledge-Based Machine Translation. Since the late 1980s he has worked on controlled lan-guage, document checking and machine translation for technical documentation; the current sys-tem, KANTOO, is now in use at Caterpillar, Inc.

Professor Nyberg also teaches a two-course series on software engineering and information technol-ogy, where students learn about software analysis, design, and construction in the context of real-worldteam projects.

www.cs.cmu.edu/~ehn
http://www.lti.cs.cmu.edu/~lslhttp://www.lti.cs.cmu.edu/~lslhttp://www.lti.cs.cmu.edu/~terukohttp://www.lti.cs.cmu.edu/~terukohttp://www.lti.cs.cmu.edu/~ehnhttp://www.lti.cs.cmu.edu/~ehnhttp://www.lti.cs.cmu.edu/~terukohttp://www.lti.cs.cmu.edu/~lsl


27/32

Roni RosenfeldProfessor

BS Mathematics and Physics

Tel-Aviv University 1985

MS Computer ScienceCarnegie Mellon University

1991



1994

statistical language modeling speech recognition/interfaces machine learning

Carolyn PensteinRose

Research ScientistBS Computer Science

University of California at

Irvine, 1992

MS Computational

Linguistics Carnegie Mellon

University 1994

PhD Language and

Information Technologies

Carnegie Mellon

University 1997

robust language understanding technology supported education

spoken language interaction speech recognition interface design

Alex RudnickyPrincipal Systems

Scientist

BS Psychology

McGill University 1975

MS Psychology


1976

PhD Psychology


1980

Carolyn Penstein Rose's primary research objective is to develop and apply language technology(i.e., robust language understanding technology and dialogue management technology) to enableeffective computer based and computer supported instruction. The important role of students makingtheir thinking explicit through verbal explanation is well established. Thus, a major thrust of hecurrent research is to explore issues related to eliciting and responding to student explanation behav-ior. However, many of the underlying issues, such as influencing student expectations, motivationand learning orientation, transcend the specific input modality. She is the PI for two tutorial dialogueprojects, namely CycleTalk for thermodynamics tutoring, and Calculategy for calculus tutoring. Sheis also co-PI for a physics tutoring project headed up by Kurt VanLehn at the University of Pittsburgh

She has served as a co-instructor for Grammar Formalisms and the Master's of HCI Project CourseProfessor Penstein Rose is also the primary instructor of the Conversational Interfaces course, whichis jointly listed in LTI and HCI.

www.cs.cmu.edu/~cprose/

Professor Rosenfelds research spans two key areas:

Computational Molecular Biology and more specifically Computational Biolinguistics.Manyof the problems in this area involve statistical modeling of long sequences of building blocks(nucleotides or amino acids) and their relationship to proteins and their function. This is verysimilar to the problem of modeling natural language: long sequences of letters or words, andtheir relationship to the deep structure and meaning of sentences. He is currently working todetect and characterize specific selectional pressure in proteins.

Speech interaction with PDAs, web portals, and robots is now feasible. But what is the ideastyle for human-machine speech communication? Natural language interfaces are easy for peopleyet they are brittle, difficult to develop, and they strain recognition technology. Furthermore, bytrying to emulate people, they fail to communicate the functional limitations of the machine. Arethere better alternatives? The Speech Graffiti (aka USI) project is designing and evaluating new

speech-based interaction paradigms.

www.cs.cmu.edu/~roni

Alex Rudnicky's research centers on interactive systems that use speech. He is interested in the following

problems:

Speech systems that learn: his research attempts to develop a process that, given an abstractspecification of capabilities, supports the automatic configuration of a speech system for an inter-active task, and then supports incremental learning over the life of the ap