similarity metrics for japanese kanji
DESCRIPTION
The Japanese writing system is a big barrier to learners. This talk at the Melbourne Maths and Science Meetup is about kanji similarity metrics, which can help learners by indicating confusable pairs, or by supporting new types of dictionary search.TRANSCRIPT
Similarity Metrics for Japanese Kanji
Lars Yencken / 99designs
Maths and Science Meetup, 30th Nov 2012
LinguisticsComputerScience
Computational Linguistics
Relative difficultyof languages
DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE
DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE
Closely related to English
█▌575-600 class hours
Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish
DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE
Closely related to English
█▌575-600 class hours
Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish
Significant linguistic and/or cultural differences
███ 1100 class hours
Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu
DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE
Closely related to English
█▌575-600 class hours
Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish
Significant linguistic and/or cultural differences
███ 1100 class hours
Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu
DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE
Closely related to English
█▌575-600 class hours
Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish
Significant linguistic and/or cultural differences
███ 1100 class hours
Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu
DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE
Closely related to English
█▌575-600 class hours
Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish
Significant linguistic and/or cultural differences
███ 1100 class hours
Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu
DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE
Closely related to English
█▌575-600 class hours
Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish
Significant linguistic and/or cultural differences
███ 1100 class hours
Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu
DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE
Closely related to English
█▌575-600 class hours
Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish
Significant linguistic and/or cultural differences
███ 1100 class hours
Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu
DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE
Closely related to English
█▌575-600 class hours
Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish
Significant linguistic and/or cultural differences
███ 1100 class hours
Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu
DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE
Closely related to English
█▌575-600 class hours
Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish
Significant linguistic and/or cultural differences
███ 1100 class hours
Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu
Exceptionally difficult fornative English speakers
██████ 2200 class hours
Arabic, Cantonese, Mandarin, Japanese, Korean
DIFFICULTY OF LEARNING LANGUAGESFOREIGN SERVICE INSTITUTE, US DEPARTMENT OF STATE
Closely related to English
█▌575-600 class hours
Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish
Significant linguistic and/or cultural differences
███ 1100 class hours
Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu
Exceptionally difficult fornative English speakers
██████ 2200 class hours
Arabic, Cantonese, Mandarin, Japanese, Korean
持
持/mo(tsu)/ "to carry"
持 挂拝
distance(持, 挂) = ???
The space of kanji
dog
dough
log
持挂
拝土
Approaches
Compare images
持挂
Compare components
�
�
扌, 土, 寸
彳, 土, 寸
Compare strokes
P R O S P E R I T Y
P R O P E R T I E S
P R O S P E R I T Y
P R O P E R T I E S
distance: 6
�
�
3, 11a, 2a, 2a
3, 11a, 2a, 2a, 2a
distance: 1
Compare trees
�
� �
� � �
� �� � � � �
� �
� � �
� �� � � � �
�
�
� �
� � �
� �� � � � �
� �
� � �
� �� � � � �
�
tree edit distance
So what works?
Thanks!