how unidecoder transliterates utf-8 to ascii

34
Unidecoder Simon Courtois - @happynoff

Upload: simon-courtois

Post on 25-Jun-2015

246 views

Category:

Technology


4 download

DESCRIPTION

Slides of my talk at Paris.rb on 2014-11-07. How does UTF-8 work? How to leverage it to convert chinese, russian or any non-ASCII character to ASCII? Here is what the Unidecoder gem does.

TRANSCRIPT

Page 1: How Unidecoder Transliterates UTF-8 to ASCII

UnidecoderSimon Courtois - @happynoff

Page 2: How Unidecoder Transliterates UTF-8 to ASCII

Transliteration

Page 3: How Unidecoder Transliterates UTF-8 to ASCII

��

Ni Hao

Page 4: How Unidecoder Transliterates UTF-8 to ASCII

ПРИВЕТPRIVIeT

Page 5: How Unidecoder Transliterates UTF-8 to ASCII

How does it work?

Page 6: How Unidecoder Transliterates UTF-8 to ASCII

At the beginning there was ASCII

Page 7: How Unidecoder Transliterates UTF-8 to ASCII

A 65B 66C 67

a 97b 98c 99

Page 8: How Unidecoder Transliterates UTF-8 to ASCII

a 97 11 0000164 32 16 8 4 2 1

A 65 10 0000164 32 16 8 4 2 1

Page 9: How Unidecoder Transliterates UTF-8 to ASCII

b 98 11 0001064 32 16 8 4 2 1

B 66 10 0001064 32 16 8 4 2 1

Page 10: How Unidecoder Transliterates UTF-8 to ASCII

Then… 8-bit computers!

Page 11: How Unidecoder Transliterates UTF-8 to ASCII

So every country had its own

encoding(s)!

Page 12: How Unidecoder Transliterates UTF-8 to ASCII

All was fine until…

Page 13: How Unidecoder Transliterates UTF-8 to ASCII

TheWorld Wide Web

Page 14: How Unidecoder Transliterates UTF-8 to ASCII

UTF-8 to the rescue!

Page 15: How Unidecoder Transliterates UTF-8 to ASCII

Everything on 32 bits?

Page 16: How Unidecoder Transliterates UTF-8 to ASCII

Bad ideac a f é

Page 17: How Unidecoder Transliterates UTF-8 to ASCII

Bad idea

f éc a

Page 18: How Unidecoder Transliterates UTF-8 to ASCII

\0

Bad idea

Page 19: How Unidecoder Transliterates UTF-8 to ASCII

A better ideaA 65 010 00001

110 XXXXX 10 XXXXXX

1110 XXXX 10 XXXXXX10 XXXXXX

Page 20: How Unidecoder Transliterates UTF-8 to ASCII

A better idea110 XXXXX 10 XXXXXX

10000110 10 011111

Page 21: How Unidecoder Transliterates UTF-8 to ASCII

A better idea

10000 1055 П011111

Page 22: How Unidecoder Transliterates UTF-8 to ASCII

So, how does unidecoder work?

Page 23: How Unidecoder Transliterates UTF-8 to ASCII

How do we go from П to P ?

Page 24: How Unidecoder Transliterates UTF-8 to ASCII

Start from a string like “П”

Page 25: How Unidecoder Transliterates UTF-8 to ASCII

Unpack it“П”.unpack(“U”)

[1055] 00000100 00011111

4 31

Page 26: How Unidecoder Transliterates UTF-8 to ASCII

4 x04 x04.yml

Ie Io Dj … P

0 1 2 31

Page 27: How Unidecoder Transliterates UTF-8 to ASCII

How to obtain and 31 ?4

Page 28: How Unidecoder Transliterates UTF-8 to ASCII

unpacked = 1055 0000010000011111

unpacked >> 80001111100000100

4

Page 29: How Unidecoder Transliterates UTF-8 to ASCII

How to obtain and ?4 31

Page 30: How Unidecoder Transliterates UTF-8 to ASCII

31unpacked = 1055

0000010000011111

unpacked & 255000111110000010011111111000000000001111100000100

Page 31: How Unidecoder Transliterates UTF-8 to ASCII

Brain fried yet?advertising time!

Page 32: How Unidecoder Transliterates UTF-8 to ASCII

www.tinci.fr

Web Development

Software Development

Consulting & Support

@tincihq

Page 33: How Unidecoder Transliterates UTF-8 to ASCII

ResourcesCharacters, Symbols and the Unicode Miracle:

bit.ly/why-utf8

Slides: bit.ly/unidecoder

Unidecoder: github.com/norman/unidecoder

Page 34: How Unidecoder Transliterates UTF-8 to ASCII

Thank you!Simon Courtois - @happynoff