ai based language learning tools
TRANSCRIPT
Oct.28.2017
Ewa Szymanska, PhD
Head of Rakuten Institute of Technology Singapore
2Source: https://unsplash.com/ by Element5 Digital
3
I am watching shows in Chinese to get used to ‘actual’ spoken Mandarin, and not just what I see in my textbooks
“
” VIKI user
4* Images from Rakuten VIKI, Rakuten TV
5
1.8 billion people are learning foreign languages
Source: The Washington Post: https://www.washingtonpost.com/news/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts
Languages with most
native speakersMost commonly studied
foreign languages
6
Online individual language learning market is growing at 12% CAGR
Source: Rosetta Stone Investor Day 2017
7
I. Entertaining Content II. Global Users III. Technology
*Photo by Jakob Owens on Unsplash
8
Interactive
subtitles
Video
dictionary Quizzes1 23
* Images from Rakuten VIKI
9
Interactive subtitles1
Fast adoption
30,000 DAU
– daily active users
High engagement
Korean Learn Mode
users view 10% more
than Viki average
High satisfaction
83 NPS
– net promoter score
*cnet.com @ CBS Interactive Inc. Apr 13, 2017; Keia.org, Korean Economic Institute, Apr 2017; Forbes Oct 24, 2017; The Verge, Sep 28, 2017
Shows availability
“Daughter
Back”
“Return of
Happiness”
“Ice and Fire
of Youth”
“My Love
from the Star”
“Boys Over
Flowers”
“Descendants
of the Sun”
Learn Chinese (Japan) Learn Korean (USA)
* Images from Rakuten VIKI
[ Learn Mode collection on viki.com ]
11
• 60,000+ quizzes taken
• 35,000+ users completed the quiz
• Very positive social media engagement:
2 Drama Vocab Quiz [ languagequiz.viki.com ]
12
3 Video-based Dictionary
Integrate with the classroom curriculum:
13
“ If you talk to a manin a language he understands,that goes to his head.
If you talk to him in his language,
that goes to his heart. ”
- Nelson Mandela
14
Oct 28, 2017
Stanley Kok
Principal Research ScientistRakuten Institute of Technology (Singapore)
you16
你 是 辣妹 , 也是 名门贵 族
你是辣妹,也是名门贵族
你 是 辣妹 , 也是 名门贵族are (a) hot chick and also (of) the gentry
Splitting a sentence into pieces, each preserving
its original semantics
you are (a) hot chick and also tribe
17
努力的人才会成功
努力 的 人 才 会 成功only hardworking people will succeed
努力 的 人才 会 成功hardworking talent will succeed
18
Tokenization
19
Dictionary
Lookup
20
Many open-source tokenizers available
Good, but not perfect
Different mistakes
Why not use more (or all) of them to improve
tokenization?
Strengths of one tokenizer overcomes
shortcomings of another
21
How to quantify “goodness” of tokenization?
Take human learner’s perspective
#Dictionary look-ups needed to understand all tokens
Non-existent tokens assumed to need large #lookups (10)
你 是 辣妹 你 是 辣 妹 你 是辣 妹
hot
chickareyou
younger
sister
spicyareyou younger
sister?you
1 + 1 + 1 = 31 + 1 + 1 + 1 = 4
1 + 10 + 1 = 12
22
Can do better than picking lowest cost
tokenization from tokenizers
Treat common tokens as “anchor points”
Pick best tokens from remaining ones
23
你 是 辣妹 也是 名门贵 族
你 是辣 妹 也是 名门贵族
你 是 辣妹 也是 名门贵族
you are hot chick
and also tribe
youyounger
sisterand also (of) the gentry
(15)
(14)
(5)
24
Dictionaries are important for language learning
Manual approach provides high-quality dictionary,
but not scalable
About 7000 languages in the world
About 49 million bilingual dictionaries
Thus need automatic approach
25
Lots of online dictionaries available
Could we automatically learn new dictionaries
from them?
Focus on Chinese-English (C-E) & Korean-
English (K-E) bilingual dictionaries
26
Lots of dictionaries online
Some are C-E and K-E, but many are not
Many dictionaries are C-X and X-E
Use language X as bridge/pivot
C-X + X-E => C-E, e.g.,
辣妹->fille sexy + fille sexy ->hot chick
=> 辣妹-> hot chick
27
Take 2 hops for now
Chinese-English dictionary has 750K entries
90% correct
Korean-English dictionary has 100K entries
99% correct
28
Learn bilingual dictionary using
Using seed lexicon
Monolingual data (plentiful)
Maps bi-lingual phrases to vector space
dolphin
海豚
东京Tokyo
Sushi
寿司
29
30
31
Artifact of standard machine translation pipeline
Parallel sentences aligned word for word
Compute probability of mapping tokens of a
source language to those of a target language
A correct source token will be more
consistently aligned to its corresponding
target token(s)
Add high-probability mappings to dictionary
32
Chinese English P(C|E) P(E|C) AveProb
辣妹 hot chick 0.8 0.9 0.85
是辣 is curry 0.1 0.1 0.1
33
Chinese-English Dictionary
3 million Chinese tokens (Jan’17)
89% in dictionary
Korean-English Dictionary
4 million Korean tokens (Jan’17)
86% in dictionary
34
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
#KoreanTokens vs. #Defintions
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
#ChineseTokens vs. #Definitions
35
Match parallel sentences to
Phrase table
Dictionary
36
他放弃梦想
He gave up his dreams
Chinese English AveProb
放弃 gave up his 0.74
放弃 quit, 0.83
放弃 abdicate 0.68
Phrase Table
37
他放弃梦想
He gave up his dreams
Chinese English AveProb
放弃 gave up his 0.74
放弃 quit 0.83
放弃 abdicate 0.68
Phrase Table
Best Match
他放弃梦想
He gave up his dreams
best match
38
Chinese English AveProb
放弃 gave up his 0.74
放弃 quit 0.83
放弃 abdicate 0.68
Phrase Table
best match
Chinese English
放弃 abandon
放弃 give up
放弃 abdicate
Dictionary
Drama Vocabulary Quiz
Liling Tan
Rakuten Institute of Technology (Singapore)
28 Oct 2017 @ Rakuten Tech. Conference
40
Overview• Introduction
•Demo
•How did We Create the Quiz?
41
Introduction•Quizzes are fun and could be viral
•But manually creating quizzes is tedious
•We created #DramaVocabQuiz that generates new vocabulary quizzes automatically
42
43
44
45
46
47
48
How do we Generate
Quizzes
Automatically?
49
Korean Drama Word List
• The word 미남 [minam] “handsome guy” can be followed by multiple suffixes at once -이시라구요 [-issilaguyo] to form a single word meaning “someone said that he is handsome”.
• We only extract the root word 미남 [minam], and count it as a unique word type
50
Korean Drama Word List
51
Korean Drama Word List
52
Korean Drama Word List
53
Splitting Word List into
3 Difficulty Levels
↑
54
Generate the Distractors
• Distractor 1: Select the top 5th to 20th closest words (cosine)
• Distractor 2: Use Distractor 1 as negative and question word as positive, select 1st to 20th closest word (cosmul)
References:
• Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In ICLR.
• Omer Levy and Yoav Goldberg. 2014. Linguistic Regularities in Sparse and Explicit Word Representations. In CoNLL.
55
Language Leaners Like Quizzes!!
• 60,000+ quizzes taken
• 35,000+ unique users completed quiz
• 16% of the users repeated quiz
56
Word Frequency is a Good Indicator of Difficulty
10
8
6
4
2
0
Easy Medium Hard
Easy = Frequent words
Medium = Less Frequent
words
Hard = Least Frequent
words
57
Conclusion
Watch Drama,Learn Language
Quiz: https://languagequiz.viki.com
Techblog: https://techblog.rakuten.co.jp/2017/05/26/lang-quiz/
Oct.28.2017
Pang Zineng
Senior Technologist
Rakuten Institute of Technology Singapore
59* Images from Rakuten VIKI
60
clipspages
Web Search In-Video Search
* Images from Rakuten VIKI
61
Web Search In-Video Search
•The meta data of the site
•The meta data of the page
•The word tokens in the page
•The topic of the page
•The originality of the page
•Hyperlinks (page rank)
• The meta data of the video
•The meta data of this clip
(timestamp, length, URI, etc.)
• The caption text of the clip
• The frames & audio signal
•Complexity of the sentence
•Diversity of the clips
site
identifier
page
identifier
content
ranking
search
relevancy
video
identifier
clip
identifier
search
relevancy
content
ranking
* Images from Rakuten VIKI
62
Job:
• Make some data ready for consumption.
Questions:
• How does the data come?
• What needs to be done for it to be ready?
• How will the data be consumed?
database
Pre-
processing
function
Trigger /
monitor
function
Raw Data
Data access
function
FTP API
Data provider
Data consumer
63
Job:
• Let outsider use a function.
Questions:
• How frequently will the function be used?
• What data does the function need?
Application
logic
API
Endpoint
Web Application
API Cache
Request
Queue
Application Cache
Internal/External Data
64
Rakuten TV
video contents
Other
video contents
Rakuten VIKI
video contents
Search
function
3rd Party Platform
Motion Dictionary
* Images from Rakuten VIKI
65
Japanese
Dictionary
Data
dictionary
function
voice
function
3rd party
solution
Korean
Dictionary
Data
Chinese
Dictionary
Data
3rd party
solutionopen source
framework
Interactive Subtitles
(version 2)
Interactive Subtitles
(version 3)
* Images from Rakuten VIKI
tokenization
function
Korean
Tokenization
Data
Chinese
Tokenization
Data
Japanese
Tokenization
Data
open source
frameworkopen source
framework
open source
framework
Korean
Tokenization
Data
Chinese
Tokenization
Data
In-house
solutionIn-house
solution
66
Japanese
Dictionary
Data
dictionary
function
voice
function
3rd party
solution
Korean
Dictionary
Data
Chinese
Dictionary
Data
3rd party
solutionopen source
framework
Interactive Subtitles
(version 2)
Interactive Subtitles
(version 3)
* Images from Rakuten VIKI
tokenization
function
Japanese
Tokenization
Data
open source
framework
Global
Tokenization
Data
In-house
solution
Global
Dictionary
Data
In-house
solution
Korean
Tokenization
Data
Chinese
Tokenization
Data
In-house
solutionIn-house
solution
67
Take
Quiz
function
Vocab Quiz
(version 1)
* Images from Rakuten VIKI
Chinese
Quiz Data
Korean
Quiz Data
68
Chinese
Quiz Data
Take
Quiz
function
voice
functionVocab Quiz
(version 2)
* Images from Rakuten VIKI
Korean
Quiz Data
69
Fast iteration in R&D won’t be possible
if we had many things bundled or coupled.
-- PangVocab Quiz
• https://languagequiz.viki.com/
Learn Mode (PC/Mac only)
• https://www.viki.com/collections/316981l-learn-the-basics-chinese
• https://www.viki.com/collections/316939l-learn-the-basics-korean
Motion Dictionary
• TBD