nlp research at internet age an overview of nlp at microsoft research asia ming zhou manager of...
TRANSCRIPT
NLP Research at Internet AgeAn Overview of NLP at Microsoft Research Asia
Ming ZhouManager of Natural Language Group
Microsoft Research Asia
Trends of Internet Services
• Eco system to work with third party’s apps– Apple Apps, Facebook, Twitter, Baidu, Sina, QQ
• Real time content collection and search– Twitter, Facebook, Del.ici.ous, NYT, YouTube
• Mobile search– Contextual intent understanding– Towards decision making and action taking
• Social power– Social tags (like) for general search engines– Search engines in SNS – Social QA
Impact and Challenge to NLP Research
• Impact– Biggest database ever – connects data– Biggest social network – connects people– Harnessing collective intelligence– Contextual information processing: User, user’s social
network, location, time – Real-time information processing: Collection, index,
operation without delay • Challenge
– How to leverage data, people, contextual information to reach real-time information processing?
Problems of Traditional NLP Approaches (NLP 1.0)
• Deep in individual component technologies but reach upper bounds
• Less consider scenarios, user’s need, market need• Serious data sparseness with human annotation • Evaluation bottleneck• Slow deployment • Lack effective framework to involve users’ feedback
4
New Strategy of NLP (NLP2.0)• Data collection from the web• Domain specific and open-IE • Contextual NLP • Maximize on the system level not on the
individual component• Earlier deployment on Internet • Make best use of social factors
5
Our Vision and Task
• Advanced NLP technologies– Word breaker, POS tagging, chunking, syntactic parser, semantic role
labeling, speller, query suggestion, summarization– Chinese, Japanese, English
• Multi-language information access– Statistical machine translation– Multi-language search
• Semantic computing– Sentiment analysis, event extraction, ontology learning– Understanding query intent and document – Contextual NLP
Understand user and document in any language, for any device and any applications
Text analysis
Skeleton parser
Named entity identification
Pos tagging
SLM
Com
ponent techs
Machine Translation
Translation evaluation
Tran. know. acquisition
WEB mining for MT
SMT
Information Extraction
Annotation tool
Machine learning
Term extraction
Information Retrieval
paraphrasing
Vertical search
Cross language IR
NLP enriched Indexing
and search
Query-doc relevance
Text mining
Data NLP (C, J, E) MT (C, J, E)
MRD
Translation
lexicon
Bilingual corpus
Bilingual tagged
corpus
IR and IE (C,J,E)
MRD
Parsing lexicon Tagged corpus
Balanced corpus
Applications
Chinese IME
Query speller
English writing wizard News Search
Twitter SearchPocket translatorJapanese IME
MSRA NLP Research Overview
Meta data extraction
Couplet generation Resume Routing General web search
Chatbot
Comparison Shopping
Research Accomplishment • Awards
– MSRA Best Research Team(2010)– Finalist of WSJ Asian Innovation Awards (2010)– MS ARD Best Project (Engkoo)– MSRA Best Innovation (1998-2008): IME and Chinese couplets
• Academic impact– Best result in NIST 2008 SMT, CWMT 2008 and CWMT 2009– Best result in SIGHAN 2006 bake off on Chinese word segmentation– Best result in cross language information retrieval in TREC-9, NTCIR-III– 40 ACL papers, 9 SIGIR, 17 Coling papers (2000-2010)– PC Chair, area chair of ACL
• Collaboration with universities– HIT Joint lab on NLP, Speech and Search, Tsinghua Joint lab on Media and
Network– 400 interns in 12 years– Summer schools since 2001– PhD supervisors at universities
8
Summer School on Information Extraction (Harbin, June, 2005)
Cheng Niu: Information extraction
Frank Seide: Speech information extraction
and search
Hwee Tou Ng: Advanced topics of information
extraction
Chin-Yew Lin: Information extraction
for automatic summarization
Projects based on NLP 2.0
• Engkoo: Web-based English learning service– Data mining from the web
• Chinese couplets– Include user’s power into system evolvement
• Semantic analysis and search of micro-blogging– Move to SNS, mobile
EngkooParallel data mining from the web
Video: http://video.sina.com.cn/v/b/37417609-1286528122.html
Rapidly Changing Language
• Approximately 1.5 billion people speak English as a primary, secondary or business language
• China: The largest “English speaking” country with 250 million English learners and USD 60 billion annual expenses
• Problem: Live language: new words, new meanings
Key Insight:With billions of translated web pages and sharable repositories of language data growing every day, the Internet holds the sum of human language knowledge
www.engkoo.com
Major Features: Microsoft Products:
Endless Lexicon with Native Definitions
State-of-the-Art Machine Translation(NIST OpenMT Winner)
Real-time Interactive Alignment
Bing
Office
MSN
Human-Like TTS & Phonetic Search
Massive Dictionary Mined from the Web
Fresh and Diverse Examples
Advanced Search with Sentence Analysis
Sentences Classification
Learn Contextual Usage with Word Alignment
Learn Contextual Usage with Word Alignment
Learn Contextual Usage with Word Alignment
Hints of Easy-Confused Words
Knowlege Mining Pipeline
MinedData
Parsed Data Linguistic
Knowledge
Web Mining Indexed
DataLinguistic Parsing
Knowledge Mining
Multi-level
Indexing
Machine Translation ModelParaphrasing Model
tokenizing: he could hardly afford to waste that golden time. 他 无法 浪费 那样 的 好 时光。
skeleton parsing: (Tsub~he~afford) (ModAdv~hardly~afford) (Tobj~waste~afford) (Tobj~time~waste) (AdjAttrib~golden~time)
(Tsub~ 他 ~ 浪费 ) (ModAdv~ 无法 ~ 浪费 )(Tobj~ 浪费 ~ 时光 ) (AdjAttrib~ 好 ~ 时光 )
alignment: he( 他 ) could hardly afford to( 无法 ) waste( 浪费 ) that( 那样的 ) golden( 好 ) time( 时光 )
1. word’s idiomatic usage • Verb~Noun (decline~offer)
• Verb~Adv (greatly~improve)• Adj~Noun (arduous~task)• Adv~Adj (extremely~bad)
2. paraphrasing• turn_on~light, switch_on~light
• laborious~task, hard~task• deeply~moved, deeply~touched
3. collocation translations• 订 ~ 计划 ,make~plan• 订 ~ 旅馆 , book~room
• 订 ~ 杂志 , subscribe to ~magazine
Parallel Sentence: He could hardly afford to waste that golden time.
他无法浪费那样的好时光。
1. single word“he”, “could”, “hardly”, “afford” etc.“ 他” , “ 无法” , ” 浪费“ etc.
2. single word with its POS“he_Pron”, “could_Verb”,“hardly_Adv” etc.
“ 他 _Pron”, “ 无法 _Adv”, ” 浪费 _Verb“ etc.3. collocation
“Tsub~he~afford ”, “Tobj~time~waste” etc.“Tsub~ 他 ~ 浪费” , “ModAdv~ 无法 ~ 浪费” etc.
Chinese Couplets
Include user‘s power into system evolvement
Chinese Couplets (http://duilian.msra.cn)
http://video.sina.com.cn/v/b/10937201-1452530713.html
FS and SS Share the Same Style
风 (wind)---------------- 水 (water)吹 (blow) --------------- 使 (make)荞 (buckwheat) -- ------ 舟 (ship)动 (wave)---------------- 流 (go)桥 (bridge) ------------- 洲 (island)未 (not) ----------------- 不 (not)动 (wave) --------------- 流 (go)
Repetition of pronunciations( 音韵联 )
FS and SS Share the Same Style
有 (have)----------------- 缺 (lack)子 (son) ------------------- 鱼 (fish)有 (have) ------------------ 缺 (lack)女 (daughter)------------- 羊 (mutton)方 (so) --------------------- 敢 (dare)称 (call) -------------------- 叫 (call)好 (good) ------------------- 鲜 (fresh)
Decomposition of characters ( 拆字联 )
鲜鱼 羊
好 女 子
FS and SS Share the Same Style
板桥 (Banqiao)---------------- 东坡 (Dongpo)造 (produce) ------------------- 居 (live)桥 (bridge) --------------------- 坡 (mountain)板 (board)---------------------- 东 (east)
Person name
( 人名联 )
Palindrome( 回文联 )
• Banqiao( 板桥 ) and Dongpo( 东坡 ) are famous litterateurs• Reading from top to down is identical to down to top
天 高sky high
SS Generation Process
山hill
天sky
高high
深deep
任permit
倚depend
虫insect
鸟bird
虎tiger
飞fly
舞dance
鸣tweedle
鸟 飞bird fly
山 高hill high
海 阔 凭 鱼 跃Sea wide allow fish jump
虎 啸tiger roar
山高任鸟飞天高任鸟鸣天高任鸟飞山高靠虎啸山高任虎啸山深任鸟飞天高任花香
……
SMT decoding Reranking
天高任鸟飞山高任鸟飞天高任鸟鸣天高任鸟舞山深任鸟飞山高任花香天高任花香
……
山高任鸟飞天高任鸟鸣天高任鸟飞山深任鸟飞天高任花香天高任鸟舞山高任花香
……
Linguisticfiltering
SS Generation Approach
• A multi-phase SMT approach
– Phase1: a phrase-based log-linear model
– Phase2: some linguistic filters
– Phase3: a Ranking SVM
Phrase-based log-linear model
SS output
Linguistic filters
FS input
N-best candidates
Ranking SVM model
Great Examples
• FS: 月落乌啼霜满天• SS: 风吹雁过雨连宵
• FS: 千江有水千江月• SS: 万里无云万里星• FS: 秦淮河桨声灯影• SS: 松花江水色月光
• FS: 此木为柴山山出 ( 此 + 木 = 柴 ; 山 + 山 = 出 )• SS: 白水作泉日日昌 ( 白 + 水 = 泉 ; 日 + 日 = 昌 )
• Motivation– Training data is not adequate– While user log is big(60k/m), increasing, diverse
• What logs we record– User inputs– User finalized couplets
• Second sentences selected out of the candidates provided by our system• User modified second sentences
User log for Model Enhancement
User’s Log Analysis
Number of input sentences 12,322
Number of unique input sentences 6,698
Users directly select from system output
3,459
User manual modify system output 606
Save as favorite couplets 109
Invalid user input 618
No second sentence generated 2,211
Banner generation 2,687
Select the generated banner as favorite
428
No banner output 265
Data Source Log from
http://couplet.msra.cn
Time period Aug. 31-Oct. 9,
2006
New Framework with Log Data
Training data
Source-Channel model
Second sentence output
Translation model
Log data
Re-ranking
First sentence input
Language model
Mutual information
N-best candidates
Translation model
Language model
Mutual information
User operation
Twitter Search
Move to social internet and mobile
Tweets
Noise Filtering
Raw Data
Semantic Role Labeling
Sentiment Analysis
NE Recognition
Dependency Parsing Co-reference
Text Normalization ClassificationSentence Boundary
Detection
Tweets Cluster
Statistical Relationship
Learning
News & Images Link Extraction
Community Extraction User Influence Measure
Hot tag, topic Extraction Popular Tweet Extraction
Top video, music, artists Extraction
A collection of tweets
Individual tweet
Multi-level Indexing
Sem
antic Search
Conclusion
• Internet trends and impacts to NLP• NLP2.0 strategy• Web data mining: Engkoo• User’s power: Couplets• SNS and mobile: Twitter search