similarity metrics for japanese kanji

39
Similarity Metrics for Japanese Kanji Lars Yencken / 99designs Maths and Science Meetup, 30th Nov 2012

Upload: larsyencken

Post on 14-May-2015

491 views

Category:

Technology


9 download

DESCRIPTION

The Japanese writing system is a big barrier to learners. This talk at the Melbourne Maths and Science Meetup is about kanji similarity metrics, which can help learners by indicating confusable pairs, or by supporting new types of dictionary search.

TRANSCRIPT

Page 1: Similarity metrics for Japanese kanji

Similarity Metrics for Japanese Kanji

Lars Yencken / 99designs

Maths and Science Meetup, 30th Nov 2012

Page 2: Similarity metrics for Japanese kanji

LinguisticsComputerScience

Computational Linguistics

Page 3: Similarity metrics for Japanese kanji

Relative difficultyof languages

Page 4: Similarity metrics for Japanese kanji

DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE

Page 5: Similarity metrics for Japanese kanji

DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE

Closely related to English

█▌575-600 class hours

Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish

Page 6: Similarity metrics for Japanese kanji

DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE

Closely related to English

█▌575-600 class hours

Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish

Significant linguistic and/or cultural differences

███ 1100 class hours

Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu

Page 7: Similarity metrics for Japanese kanji

DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE

Closely related to English

█▌575-600 class hours

Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish

Significant linguistic and/or cultural differences

███ 1100 class hours

Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu

Page 8: Similarity metrics for Japanese kanji

DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE

Closely related to English

█▌575-600 class hours

Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish

Significant linguistic and/or cultural differences

███ 1100 class hours

Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu

Page 9: Similarity metrics for Japanese kanji

DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE

Closely related to English

█▌575-600 class hours

Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish

Significant linguistic and/or cultural differences

███ 1100 class hours

Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu

Page 10: Similarity metrics for Japanese kanji

DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE

Closely related to English

█▌575-600 class hours

Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish

Significant linguistic and/or cultural differences

███ 1100 class hours

Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu

Page 11: Similarity metrics for Japanese kanji

DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE

Closely related to English

█▌575-600 class hours

Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish

Significant linguistic and/or cultural differences

███ 1100 class hours

Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu

Page 12: Similarity metrics for Japanese kanji

DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE

Closely related to English

█▌575-600 class hours

Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish

Significant linguistic and/or cultural differences

███ 1100 class hours

Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu

Page 13: Similarity metrics for Japanese kanji

DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE

Closely related to English

█▌575-600 class hours

Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish

Significant linguistic and/or cultural differences

███ 1100 class hours

Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu

Exceptionally difficult fornative English speakers

██████ 2200 class hours

Arabic, Cantonese, Mandarin, Japanese, Korean

Page 14: Similarity metrics for Japanese kanji

DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE

Closely related to English

█▌575-600 class hours

Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish

Significant linguistic and/or cultural differences

███ 1100 class hours

Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu

Exceptionally difficult fornative English speakers

██████ 2200 class hours

Arabic, Cantonese, Mandarin, Japanese, Korean

Page 15: Similarity metrics for Japanese kanji

Page 16: Similarity metrics for Japanese kanji

持/mo(tsu)/ "to carry"

Page 17: Similarity metrics for Japanese kanji

持 挂拝

Page 18: Similarity metrics for Japanese kanji

distance(持, 挂) = ???

Page 19: Similarity metrics for Japanese kanji

The space of kanji

Page 20: Similarity metrics for Japanese kanji
Page 21: Similarity metrics for Japanese kanji

dog

dough

log

Page 22: Similarity metrics for Japanese kanji

持挂

拝土

Page 23: Similarity metrics for Japanese kanji

Approaches

Page 24: Similarity metrics for Japanese kanji

Compare images

Page 25: Similarity metrics for Japanese kanji

持挂

Page 26: Similarity metrics for Japanese kanji
Page 27: Similarity metrics for Japanese kanji

Compare components

Page 28: Similarity metrics for Japanese kanji

扌, 土, 寸

彳, 土, 寸

Page 29: Similarity metrics for Japanese kanji

Compare strokes

Page 30: Similarity metrics for Japanese kanji

P R O S P E R I T Y

P R O P E R T I E S

Page 31: Similarity metrics for Japanese kanji

P R O S P E R I T Y

P R O P E R T I E S

distance: 6

Page 32: Similarity metrics for Japanese kanji

3, 11a, 2a, 2a

3, 11a, 2a, 2a, 2a

distance: 1

Page 33: Similarity metrics for Japanese kanji

Compare trees

Page 34: Similarity metrics for Japanese kanji

� �

� � �

� �� � � � �

� �

� � �

� �� � � � �

Page 35: Similarity metrics for Japanese kanji

� �

� � �

� �� � � � �

� �

� � �

� �� � � � �

tree edit distance

Page 36: Similarity metrics for Japanese kanji

So what works?

Page 37: Similarity metrics for Japanese kanji
Page 38: Similarity metrics for Japanese kanji
Page 39: Similarity metrics for Japanese kanji

Thanks!