how unidecoder transliterates utf-8 to ascii

Post on 25-Jun-2015

246 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

Slides of my talk at Paris.rb on 2014-11-07. How does UTF-8 work? How to leverage it to convert chinese, russian or any non-ASCII character to ASCII? Here is what the Unidecoder gem does.

TRANSCRIPT

UnidecoderSimon Courtois - @happynoff

Transliteration

��

Ni Hao

ПРИВЕТPRIVIeT

How does it work?

At the beginning there was ASCII

A 65B 66C 67

a 97b 98c 99

a 97 11 0000164 32 16 8 4 2 1

A 65 10 0000164 32 16 8 4 2 1

b 98 11 0001064 32 16 8 4 2 1

B 66 10 0001064 32 16 8 4 2 1

Then… 8-bit computers!

So every country had its own

encoding(s)!

All was fine until…

TheWorld Wide Web

UTF-8 to the rescue!

Everything on 32 bits?

Bad ideac a f é

Bad idea

f éc a

\0

Bad idea

A better ideaA 65 010 00001

110 XXXXX 10 XXXXXX

1110 XXXX 10 XXXXXX10 XXXXXX

A better idea110 XXXXX 10 XXXXXX

10000110 10 011111

A better idea

10000 1055 П011111

So, how does unidecoder work?

How do we go from П to P ?

Start from a string like “П”

Unpack it“П”.unpack(“U”)

[1055] 00000100 00011111

4 31

4 x04 x04.yml

Ie Io Dj … P

0 1 2 31

How to obtain and 31 ?4

unpacked = 1055 0000010000011111

unpacked >> 80001111100000100

4

How to obtain and ?4 31

31unpacked = 1055

0000010000011111

unpacked & 255000111110000010011111111000000000001111100000100

Brain fried yet?advertising time!

www.tinci.fr

Web Development

Software Development

Consulting & Support

@tincihq

ResourcesCharacters, Symbols and the Unicode Miracle:

bit.ly/why-utf8

Slides: bit.ly/unidecoder

Unidecoder: github.com/norman/unidecoder

Thank you!Simon Courtois - @happynoff

top related