ai based language learning tools

Post on 21-Jan-2018

69 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Oct.28.2017

Ewa Szymanska, PhD

Head of Rakuten Institute of Technology Singapore

2Source: https://unsplash.com/ by Element5 Digital

3

I am watching shows in Chinese to get used to ‘actual’ spoken Mandarin, and not just what I see in my textbooks

” VIKI user

4* Images from Rakuten VIKI, Rakuten TV

5

1.8 billion people are learning foreign languages

Source: The Washington Post: https://www.washingtonpost.com/news/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts

Languages with most

native speakersMost commonly studied

foreign languages

6

Online individual language learning market is growing at 12% CAGR

Source: Rosetta Stone Investor Day 2017

7

I. Entertaining Content II. Global Users III. Technology

*Photo by Jakob Owens on Unsplash

8

Interactive

subtitles

Video

dictionary Quizzes1 23

* Images from Rakuten VIKI

9

Interactive subtitles1

Fast adoption

30,000 DAU

– daily active users

High engagement

Korean Learn Mode

users view 10% more

than Viki average

High satisfaction

83 NPS

– net promoter score

*cnet.com @ CBS Interactive Inc. Apr 13, 2017; Keia.org, Korean Economic Institute, Apr 2017; Forbes Oct 24, 2017; The Verge, Sep 28, 2017

Shows availability

“Daughter

Back”

“Return of

Happiness”

“Ice and Fire

of Youth”

“My Love

from the Star”

“Boys Over

Flowers”

“Descendants

of the Sun”

Learn Chinese (Japan) Learn Korean (USA)

* Images from Rakuten VIKI

[ Learn Mode collection on viki.com ]

11

• 60,000+ quizzes taken

• 35,000+ users completed the quiz

• Very positive social media engagement:

2 Drama Vocab Quiz [ languagequiz.viki.com ]

12

3 Video-based Dictionary

Integrate with the classroom curriculum:

13

“ If you talk to a manin a language he understands,that goes to his head.

If you talk to him in his language,

that goes to his heart. ”

- Nelson Mandela

14

Oct 28, 2017

Stanley Kok

Principal Research ScientistRakuten Institute of Technology (Singapore)

you16

你 是 辣妹 , 也是 名门贵 族

你是辣妹,也是名门贵族

你 是 辣妹 , 也是 名门贵族are (a) hot chick and also (of) the gentry

Splitting a sentence into pieces, each preserving

its original semantics

you are (a) hot chick and also tribe

17

努力的人才会成功

努力 的 人 才 会 成功only hardworking people will succeed

努力 的 人才 会 成功hardworking talent will succeed

18

Tokenization

19

Dictionary

Lookup

20

Many open-source tokenizers available

Good, but not perfect

Different mistakes

Why not use more (or all) of them to improve

tokenization?

Strengths of one tokenizer overcomes

shortcomings of another

21

How to quantify “goodness” of tokenization?

Take human learner’s perspective

#Dictionary look-ups needed to understand all tokens

Non-existent tokens assumed to need large #lookups (10)

你 是 辣妹 你 是 辣 妹 你 是辣 妹

hot

chickareyou

younger

sister

spicyareyou younger

sister?you

1 + 1 + 1 = 31 + 1 + 1 + 1 = 4

1 + 10 + 1 = 12

22

Can do better than picking lowest cost

tokenization from tokenizers

Treat common tokens as “anchor points”

Pick best tokens from remaining ones

23

你 是 辣妹 也是 名门贵 族

你 是辣 妹 也是 名门贵族

你 是 辣妹 也是 名门贵族

you are hot chick

and also tribe

youyounger

sisterand also (of) the gentry

(15)

(14)

(5)

24

Dictionaries are important for language learning

Manual approach provides high-quality dictionary,

but not scalable

About 7000 languages in the world

About 49 million bilingual dictionaries

Thus need automatic approach

25

Lots of online dictionaries available

Could we automatically learn new dictionaries

from them?

Focus on Chinese-English (C-E) & Korean-

English (K-E) bilingual dictionaries

26

Lots of dictionaries online

Some are C-E and K-E, but many are not

Many dictionaries are C-X and X-E

Use language X as bridge/pivot

C-X + X-E => C-E, e.g.,

辣妹->fille sexy + fille sexy ->hot chick

=> 辣妹-> hot chick

27

Take 2 hops for now

Chinese-English dictionary has 750K entries

90% correct

Korean-English dictionary has 100K entries

99% correct

28

Learn bilingual dictionary using

Using seed lexicon

Monolingual data (plentiful)

Maps bi-lingual phrases to vector space

dolphin

海豚

东京Tokyo

Sushi

寿司

29

30

31

Artifact of standard machine translation pipeline

Parallel sentences aligned word for word

Compute probability of mapping tokens of a

source language to those of a target language

A correct source token will be more

consistently aligned to its corresponding

target token(s)

Add high-probability mappings to dictionary

32

Chinese English P(C|E) P(E|C) AveProb

辣妹 hot chick 0.8 0.9 0.85

是辣 is curry 0.1 0.1 0.1

33

Chinese-English Dictionary

3 million Chinese tokens (Jan’17)

89% in dictionary

Korean-English Dictionary

4 million Korean tokens (Jan’17)

86% in dictionary

34

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

#KoreanTokens vs. #Defintions

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

#ChineseTokens vs. #Definitions

35

Match parallel sentences to

Phrase table

Dictionary

36

他放弃梦想

He gave up his dreams

Chinese English AveProb

放弃 gave up his 0.74

放弃 quit, 0.83

放弃 abdicate 0.68

Phrase Table

37

他放弃梦想

He gave up his dreams

Chinese English AveProb

放弃 gave up his 0.74

放弃 quit 0.83

放弃 abdicate 0.68

Phrase Table

Best Match

他放弃梦想

He gave up his dreams

best match

38

Chinese English AveProb

放弃 gave up his 0.74

放弃 quit 0.83

放弃 abdicate 0.68

Phrase Table

best match

Chinese English

放弃 abandon

放弃 give up

放弃 abdicate

Dictionary

Drama Vocabulary Quiz

Liling Tan

Rakuten Institute of Technology (Singapore)

28 Oct 2017 @ Rakuten Tech. Conference

40

Overview• Introduction

•Demo

•How did We Create the Quiz?

41

Introduction•Quizzes are fun and could be viral

•But manually creating quizzes is tedious

•We created #DramaVocabQuiz that generates new vocabulary quizzes automatically

42

43

44

45

46

47

48

How do we Generate

Quizzes

Automatically?

49

Korean Drama Word List

• The word 미남 [minam] “handsome guy” can be followed by multiple suffixes at once -이시라구요 [-issilaguyo] to form a single word meaning “someone said that he is handsome”.

• We only extract the root word 미남 [minam], and count it as a unique word type

50

Korean Drama Word List

51

Korean Drama Word List

52

Korean Drama Word List

53

Splitting Word List into

3 Difficulty Levels

54

Generate the Distractors

• Distractor 1: Select the top 5th to 20th closest words (cosine)

• Distractor 2: Use Distractor 1 as negative and question word as positive, select 1st to 20th closest word (cosmul)

References:

• Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In ICLR.

• Omer Levy and Yoav Goldberg. 2014. Linguistic Regularities in Sparse and Explicit Word Representations. In CoNLL.

55

Language Leaners Like Quizzes!!

• 60,000+ quizzes taken

• 35,000+ unique users completed quiz

• 16% of the users repeated quiz

56

Word Frequency is a Good Indicator of Difficulty

10

8

6

4

2

0

Easy Medium Hard

Easy = Frequent words

Medium = Less Frequent

words

Hard = Least Frequent

words

57

Conclusion

Watch Drama,Learn Language

Quiz: https://languagequiz.viki.com

Techblog: https://techblog.rakuten.co.jp/2017/05/26/lang-quiz/

Oct.28.2017

Pang Zineng

Senior Technologist

Rakuten Institute of Technology Singapore

59* Images from Rakuten VIKI

60

clipspages

Web Search In-Video Search

* Images from Rakuten VIKI

61

Web Search In-Video Search

•The meta data of the site

•The meta data of the page

•The word tokens in the page

•The topic of the page

•The originality of the page

•Hyperlinks (page rank)

• The meta data of the video

•The meta data of this clip

(timestamp, length, URI, etc.)

• The caption text of the clip

• The frames & audio signal

•Complexity of the sentence

•Diversity of the clips

site

identifier

page

identifier

content

ranking

search

relevancy

video

identifier

clip

identifier

search

relevancy

content

ranking

* Images from Rakuten VIKI

62

Job:

• Make some data ready for consumption.

Questions:

• How does the data come?

• What needs to be done for it to be ready?

• How will the data be consumed?

database

Pre-

processing

function

Trigger /

monitor

function

Raw Data

Data access

function

FTP API

Data provider

Data consumer

63

Job:

• Let outsider use a function.

Questions:

• How frequently will the function be used?

• What data does the function need?

Application

logic

API

Endpoint

Web Application

API Cache

Request

Queue

Application Cache

Internal/External Data

64

Rakuten TV

video contents

Other

video contents

Rakuten VIKI

video contents

Search

function

3rd Party Platform

Motion Dictionary

* Images from Rakuten VIKI

65

Japanese

Dictionary

Data

dictionary

function

voice

function

3rd party

solution

Korean

Dictionary

Data

Chinese

Dictionary

Data

3rd party

solutionopen source

framework

Interactive Subtitles

(version 2)

Interactive Subtitles

(version 3)

* Images from Rakuten VIKI

tokenization

function

Korean

Tokenization

Data

Chinese

Tokenization

Data

Japanese

Tokenization

Data

open source

frameworkopen source

framework

open source

framework

Korean

Tokenization

Data

Chinese

Tokenization

Data

In-house

solutionIn-house

solution

66

Japanese

Dictionary

Data

dictionary

function

voice

function

3rd party

solution

Korean

Dictionary

Data

Chinese

Dictionary

Data

3rd party

solutionopen source

framework

Interactive Subtitles

(version 2)

Interactive Subtitles

(version 3)

* Images from Rakuten VIKI

tokenization

function

Japanese

Tokenization

Data

open source

framework

Global

Tokenization

Data

In-house

solution

Global

Dictionary

Data

In-house

solution

Korean

Tokenization

Data

Chinese

Tokenization

Data

In-house

solutionIn-house

solution

67

Take

Quiz

function

Vocab Quiz

(version 1)

* Images from Rakuten VIKI

Chinese

Quiz Data

Korean

Quiz Data

68

Chinese

Quiz Data

Take

Quiz

function

voice

functionVocab Quiz

(version 2)

* Images from Rakuten VIKI

Korean

Quiz Data

69

Fast iteration in R&D won’t be possible

if we had many things bundled or coupled.

-- PangVocab Quiz

• https://languagequiz.viki.com/

Learn Mode (PC/Mac only)

• https://www.viki.com/collections/316981l-learn-the-basics-chinese

• https://www.viki.com/collections/316939l-learn-the-basics-korean

Motion Dictionary

• TBD

top related