ai based language learning tools

69
Oct.28.2017 Ewa Szymanska, PhD Head of Rakuten Institute of Technology Singapore

Upload: rakuten-inc

Post on 21-Jan-2018

67 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: AI based language learning tools

Oct.28.2017

Ewa Szymanska, PhD

Head of Rakuten Institute of Technology Singapore

Page 2: AI based language learning tools

2Source: https://unsplash.com/ by Element5 Digital

Page 3: AI based language learning tools

3

I am watching shows in Chinese to get used to ‘actual’ spoken Mandarin, and not just what I see in my textbooks

” VIKI user

Page 4: AI based language learning tools

4* Images from Rakuten VIKI, Rakuten TV

Page 5: AI based language learning tools

5

1.8 billion people are learning foreign languages

Source: The Washington Post: https://www.washingtonpost.com/news/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts

Languages with most

native speakersMost commonly studied

foreign languages

Page 6: AI based language learning tools

6

Online individual language learning market is growing at 12% CAGR

Source: Rosetta Stone Investor Day 2017

Page 7: AI based language learning tools

7

I. Entertaining Content II. Global Users III. Technology

*Photo by Jakob Owens on Unsplash

Page 8: AI based language learning tools

8

Interactive

subtitles

Video

dictionary Quizzes1 23

* Images from Rakuten VIKI

Page 9: AI based language learning tools

9

Interactive subtitles1

Fast adoption

30,000 DAU

– daily active users

High engagement

Korean Learn Mode

users view 10% more

than Viki average

High satisfaction

83 NPS

– net promoter score

*cnet.com @ CBS Interactive Inc. Apr 13, 2017; Keia.org, Korean Economic Institute, Apr 2017; Forbes Oct 24, 2017; The Verge, Sep 28, 2017

Page 10: AI based language learning tools

Shows availability

“Daughter

Back”

“Return of

Happiness”

“Ice and Fire

of Youth”

“My Love

from the Star”

“Boys Over

Flowers”

“Descendants

of the Sun”

Learn Chinese (Japan) Learn Korean (USA)

* Images from Rakuten VIKI

[ Learn Mode collection on viki.com ]

Page 11: AI based language learning tools

11

• 60,000+ quizzes taken

• 35,000+ users completed the quiz

• Very positive social media engagement:

2 Drama Vocab Quiz [ languagequiz.viki.com ]

Page 12: AI based language learning tools

12

3 Video-based Dictionary

Integrate with the classroom curriculum:

Page 13: AI based language learning tools

13

“ If you talk to a manin a language he understands,that goes to his head.

If you talk to him in his language,

that goes to his heart. ”

- Nelson Mandela

Page 14: AI based language learning tools

14

Page 15: AI based language learning tools

Oct 28, 2017

Stanley Kok

Principal Research ScientistRakuten Institute of Technology (Singapore)

Page 16: AI based language learning tools

you16

你 是 辣妹 , 也是 名门贵 族

你是辣妹,也是名门贵族

你 是 辣妹 , 也是 名门贵族are (a) hot chick and also (of) the gentry

Splitting a sentence into pieces, each preserving

its original semantics

you are (a) hot chick and also tribe

Page 17: AI based language learning tools

17

努力的人才会成功

努力 的 人 才 会 成功only hardworking people will succeed

努力 的 人才 会 成功hardworking talent will succeed

Page 18: AI based language learning tools

18

Page 19: AI based language learning tools

Tokenization

19

Dictionary

Lookup

Page 20: AI based language learning tools

20

Many open-source tokenizers available

Good, but not perfect

Different mistakes

Why not use more (or all) of them to improve

tokenization?

Strengths of one tokenizer overcomes

shortcomings of another

Page 21: AI based language learning tools

21

How to quantify “goodness” of tokenization?

Take human learner’s perspective

#Dictionary look-ups needed to understand all tokens

Non-existent tokens assumed to need large #lookups (10)

你 是 辣妹 你 是 辣 妹 你 是辣 妹

hot

chickareyou

younger

sister

spicyareyou younger

sister?you

1 + 1 + 1 = 31 + 1 + 1 + 1 = 4

1 + 10 + 1 = 12

Page 22: AI based language learning tools

22

Can do better than picking lowest cost

tokenization from tokenizers

Treat common tokens as “anchor points”

Pick best tokens from remaining ones

Page 23: AI based language learning tools

23

你 是 辣妹 也是 名门贵 族

你 是辣 妹 也是 名门贵族

你 是 辣妹 也是 名门贵族

you are hot chick

and also tribe

youyounger

sisterand also (of) the gentry

(15)

(14)

(5)

Page 24: AI based language learning tools

24

Dictionaries are important for language learning

Manual approach provides high-quality dictionary,

but not scalable

About 7000 languages in the world

About 49 million bilingual dictionaries

Thus need automatic approach

Page 25: AI based language learning tools

25

Lots of online dictionaries available

Could we automatically learn new dictionaries

from them?

Focus on Chinese-English (C-E) & Korean-

English (K-E) bilingual dictionaries

Page 26: AI based language learning tools

26

Lots of dictionaries online

Some are C-E and K-E, but many are not

Many dictionaries are C-X and X-E

Use language X as bridge/pivot

C-X + X-E => C-E, e.g.,

辣妹->fille sexy + fille sexy ->hot chick

=> 辣妹-> hot chick

Page 27: AI based language learning tools

27

Take 2 hops for now

Chinese-English dictionary has 750K entries

90% correct

Korean-English dictionary has 100K entries

99% correct

Page 28: AI based language learning tools

28

Learn bilingual dictionary using

Using seed lexicon

Monolingual data (plentiful)

Maps bi-lingual phrases to vector space

dolphin

海豚

东京Tokyo

Sushi

寿司

Page 29: AI based language learning tools

29

Page 30: AI based language learning tools

30

Page 31: AI based language learning tools

31

Artifact of standard machine translation pipeline

Parallel sentences aligned word for word

Compute probability of mapping tokens of a

source language to those of a target language

A correct source token will be more

consistently aligned to its corresponding

target token(s)

Add high-probability mappings to dictionary

Page 32: AI based language learning tools

32

Chinese English P(C|E) P(E|C) AveProb

辣妹 hot chick 0.8 0.9 0.85

是辣 is curry 0.1 0.1 0.1

Page 33: AI based language learning tools

33

Chinese-English Dictionary

3 million Chinese tokens (Jan’17)

89% in dictionary

Korean-English Dictionary

4 million Korean tokens (Jan’17)

86% in dictionary

Page 34: AI based language learning tools

34

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

#KoreanTokens vs. #Defintions

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

#ChineseTokens vs. #Definitions

Page 35: AI based language learning tools

35

Match parallel sentences to

Phrase table

Dictionary

Page 36: AI based language learning tools

36

他放弃梦想

He gave up his dreams

Chinese English AveProb

放弃 gave up his 0.74

放弃 quit, 0.83

放弃 abdicate 0.68

Phrase Table

Page 37: AI based language learning tools

37

他放弃梦想

He gave up his dreams

Chinese English AveProb

放弃 gave up his 0.74

放弃 quit 0.83

放弃 abdicate 0.68

Phrase Table

Best Match

Page 38: AI based language learning tools

他放弃梦想

He gave up his dreams

best match

38

Chinese English AveProb

放弃 gave up his 0.74

放弃 quit 0.83

放弃 abdicate 0.68

Phrase Table

best match

Chinese English

放弃 abandon

放弃 give up

放弃 abdicate

Dictionary

Page 39: AI based language learning tools

Drama Vocabulary Quiz

Liling Tan

Rakuten Institute of Technology (Singapore)

28 Oct 2017 @ Rakuten Tech. Conference

Page 40: AI based language learning tools

40

Overview• Introduction

•Demo

•How did We Create the Quiz?

Page 41: AI based language learning tools

41

Introduction•Quizzes are fun and could be viral

•But manually creating quizzes is tedious

•We created #DramaVocabQuiz that generates new vocabulary quizzes automatically

Page 42: AI based language learning tools

42

Page 43: AI based language learning tools

43

Page 44: AI based language learning tools

44

Page 45: AI based language learning tools

45

Page 46: AI based language learning tools

46

Page 47: AI based language learning tools

47

Page 48: AI based language learning tools

48

How do we Generate

Quizzes

Automatically?

Page 49: AI based language learning tools

49

Korean Drama Word List

• The word 미남 [minam] “handsome guy” can be followed by multiple suffixes at once -이시라구요 [-issilaguyo] to form a single word meaning “someone said that he is handsome”.

• We only extract the root word 미남 [minam], and count it as a unique word type

Page 50: AI based language learning tools

50

Korean Drama Word List

Page 51: AI based language learning tools

51

Korean Drama Word List

Page 52: AI based language learning tools

52

Korean Drama Word List

Page 53: AI based language learning tools

53

Splitting Word List into

3 Difficulty Levels

Page 54: AI based language learning tools

54

Generate the Distractors

• Distractor 1: Select the top 5th to 20th closest words (cosine)

• Distractor 2: Use Distractor 1 as negative and question word as positive, select 1st to 20th closest word (cosmul)

References:

• Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In ICLR.

• Omer Levy and Yoav Goldberg. 2014. Linguistic Regularities in Sparse and Explicit Word Representations. In CoNLL.

Page 55: AI based language learning tools

55

Language Leaners Like Quizzes!!

• 60,000+ quizzes taken

• 35,000+ unique users completed quiz

• 16% of the users repeated quiz

Page 56: AI based language learning tools

56

Word Frequency is a Good Indicator of Difficulty

10

8

6

4

2

0

Easy Medium Hard

Easy = Frequent words

Medium = Less Frequent

words

Hard = Least Frequent

words

Page 57: AI based language learning tools

57

Conclusion

Watch Drama,Learn Language

Quiz: https://languagequiz.viki.com

Techblog: https://techblog.rakuten.co.jp/2017/05/26/lang-quiz/

Page 58: AI based language learning tools

Oct.28.2017

Pang Zineng

Senior Technologist

Rakuten Institute of Technology Singapore

Page 59: AI based language learning tools

59* Images from Rakuten VIKI

Page 60: AI based language learning tools

60

clipspages

Web Search In-Video Search

* Images from Rakuten VIKI

Page 61: AI based language learning tools

61

Web Search In-Video Search

•The meta data of the site

•The meta data of the page

•The word tokens in the page

•The topic of the page

•The originality of the page

•Hyperlinks (page rank)

• The meta data of the video

•The meta data of this clip

(timestamp, length, URI, etc.)

• The caption text of the clip

• The frames & audio signal

•Complexity of the sentence

•Diversity of the clips

site

identifier

page

identifier

content

ranking

search

relevancy

video

identifier

clip

identifier

search

relevancy

content

ranking

* Images from Rakuten VIKI

Page 62: AI based language learning tools

62

Job:

• Make some data ready for consumption.

Questions:

• How does the data come?

• What needs to be done for it to be ready?

• How will the data be consumed?

database

Pre-

processing

function

Trigger /

monitor

function

Raw Data

Data access

function

FTP API

Data provider

Data consumer

Page 63: AI based language learning tools

63

Job:

• Let outsider use a function.

Questions:

• How frequently will the function be used?

• What data does the function need?

Application

logic

API

Endpoint

Web Application

API Cache

Request

Queue

Application Cache

Internal/External Data

Page 64: AI based language learning tools

64

Rakuten TV

video contents

Other

video contents

Rakuten VIKI

video contents

Search

function

3rd Party Platform

Motion Dictionary

* Images from Rakuten VIKI

Page 65: AI based language learning tools

65

Japanese

Dictionary

Data

dictionary

function

voice

function

3rd party

solution

Korean

Dictionary

Data

Chinese

Dictionary

Data

3rd party

solutionopen source

framework

Interactive Subtitles

(version 2)

Interactive Subtitles

(version 3)

* Images from Rakuten VIKI

tokenization

function

Korean

Tokenization

Data

Chinese

Tokenization

Data

Japanese

Tokenization

Data

open source

frameworkopen source

framework

open source

framework

Korean

Tokenization

Data

Chinese

Tokenization

Data

In-house

solutionIn-house

solution

Page 66: AI based language learning tools

66

Japanese

Dictionary

Data

dictionary

function

voice

function

3rd party

solution

Korean

Dictionary

Data

Chinese

Dictionary

Data

3rd party

solutionopen source

framework

Interactive Subtitles

(version 2)

Interactive Subtitles

(version 3)

* Images from Rakuten VIKI

tokenization

function

Japanese

Tokenization

Data

open source

framework

Global

Tokenization

Data

In-house

solution

Global

Dictionary

Data

In-house

solution

Korean

Tokenization

Data

Chinese

Tokenization

Data

In-house

solutionIn-house

solution

Page 67: AI based language learning tools

67

Take

Quiz

function

Vocab Quiz

(version 1)

* Images from Rakuten VIKI

Chinese

Quiz Data

Korean

Quiz Data

Page 68: AI based language learning tools

68

Chinese

Quiz Data

Take

Quiz

function

voice

functionVocab Quiz

(version 2)

* Images from Rakuten VIKI

Korean

Quiz Data

Page 69: AI based language learning tools

69

Fast iteration in R&D won’t be possible

if we had many things bundled or coupled.

-- PangVocab Quiz

• https://languagequiz.viki.com/

Learn Mode (PC/Mac only)

• https://www.viki.com/collections/316981l-learn-the-basics-chinese

• https://www.viki.com/collections/316939l-learn-the-basics-korean

Motion Dictionary

• TBD