how unidecoder transliterates utf-8 to ascii
DESCRIPTION
Slides of my talk at Paris.rb on 2014-11-07. How does UTF-8 work? How to leverage it to convert chinese, russian or any non-ASCII character to ASCII? Here is what the Unidecoder gem does.TRANSCRIPT
UnidecoderSimon Courtois - @happynoff
Transliteration
��
Ni Hao
ПРИВЕТPRIVIeT
How does it work?
At the beginning there was ASCII
A 65B 66C 67
a 97b 98c 99
a 97 11 0000164 32 16 8 4 2 1
A 65 10 0000164 32 16 8 4 2 1
b 98 11 0001064 32 16 8 4 2 1
B 66 10 0001064 32 16 8 4 2 1
Then… 8-bit computers!
So every country had its own
encoding(s)!
All was fine until…
TheWorld Wide Web
UTF-8 to the rescue!
Everything on 32 bits?
Bad ideac a f é
Bad idea
f éc a
\0
Bad idea
A better ideaA 65 010 00001
110 XXXXX 10 XXXXXX
1110 XXXX 10 XXXXXX10 XXXXXX
A better idea110 XXXXX 10 XXXXXX
10000110 10 011111
A better idea
10000 1055 П011111
So, how does unidecoder work?
How do we go from П to P ?
Start from a string like “П”
Unpack it“П”.unpack(“U”)
[1055] 00000100 00011111
4 31
4 x04 x04.yml
Ie Io Dj … P
0 1 2 31
How to obtain and 31 ?4
unpacked = 1055 0000010000011111
unpacked >> 80001111100000100
4
How to obtain and ?4 31
31unpacked = 1055
0000010000011111
unpacked & 255000111110000010011111111000000000001111100000100
Brain fried yet?advertising time!
www.tinci.fr
Web Development
Software Development
Consulting & Support
@tincihq
ResourcesCharacters, Symbols and the Unicode Miracle:
bit.ly/why-utf8
Slides: bit.ly/unidecoder
Unidecoder: github.com/norman/unidecoder
Thank you!Simon Courtois - @happynoff