encoding nightmares (and how to avoid them)

35
ENCODING NIGHTMARES And how to avoid them

Upload: kenneth-farrall

Post on 11-Apr-2017

34 views

Category:

Software


2 download

TRANSCRIPT

Page 1: Encoding Nightmares (and how to avoid them)

ENCODINGNIGHTMARES And how to avoid

them

Page 2: Encoding Nightmares (and how to avoid them)

PHILADELPHIA SOFTWARE LOCALIZATION MEETUP

Welcome to our kickoff event! For more information, visit the meetup site at: https://www.meetup.com/Philadelphia-Software-Localization-Meetup/

Page 3: Encoding Nightmares (and how to avoid them)

PLAN OF TALK Encoding Nightmares Character Encoding and the Modern Tower of Babel Rise of Unicode Rules of Thumb to Avoid Nightmares Tricks of the Trade Discussion

Page 4: Encoding Nightmares (and how to avoid them)

TAIWANESE WEBSITE FAIL

Page 5: Encoding Nightmares (and how to avoid them)

DZONGKHA (BHUTANESE) AS WINDOWS-1252

Page 6: Encoding Nightmares (and how to avoid them)
Page 7: Encoding Nightmares (and how to avoid them)

CORRUPTED DOCUMENT, DATA LOSS

Page 8: Encoding Nightmares (and how to avoid them)

ENCODING NIGHTMARES CAN LEAD TO … Confusion Missed deadlines Software Bugs Data corruption Embarrassment

Page 9: Encoding Nightmares (and how to avoid them)

CHARACTER ENCODING AND THE MODERN TOWER OF

BABEL

Page 10: Encoding Nightmares (and how to avoid them)

BINARY LANGUAGE The Bit, Two States (0, 1) Represented by switches “on” (1) or “off” (0) (Yes, No) Grouped Together, Represent More States n bits = 2n States 8 bits = 1 byte = 256 states

Page 11: Encoding Nightmares (and how to avoid them)

BINARY CHARACTER ENCODING ASCII Character Encoding Associate Binary string with English, letters, numbers, etc. How Many Needed? Used 127 distinct binary numbers, each mapped to a member of the ASCII character set Defined in the ASCII “Code Page”

Page 12: Encoding Nightmares (and how to avoid them)

EUROPEAN LANGUAGES NEED MORE SPACE

German, French, other languages needed more than 128 characters Started to use the 8th bit (doubles the possibilities) 256 spaces in these 8 bit character maps

Page 13: Encoding Nightmares (and how to avoid them)

CHINESE, JAPANESE, KOREAN (CJK) NEED EVEN MORE In Chinese, 2,000 distinct characters is often considered a minimum threshold for literacy. 40,000 characters are in common use and tens of thousands more in rare, historical literature. Japanese uses 2,000 characters, mixing their own phonetic scripts comprising the phonetic and ideographic characters borrowed from the Chinese Modern Korean tends toward more phonetic language and relies much less on the broader set of Chinese characters 256 characters in not enough for any of them

Page 14: Encoding Nightmares (and how to avoid them)

DOUBLE BYTE CHARACTER ENCODINGS

Two Bytes, 16 Bits 216 = 65,536 possible characters some bits used as signals, so can’t actually store 65,000 total

https://r12a.github.io/scripts/tutorial/part2 / (Creative Commons license)

Page 15: Encoding Nightmares (and how to avoid them)

NUMBER OF ENCODINGS EXPLODE

ISO 646, ASCII, EBCDIC, CP37, CP930, CP1047, ISO 8859-1, ISO 8859-2, ISO 8859-3, ISO 8859-4, ISO 8859-5, ISO 8859-6, ISO 8859-7, ISO 8859-8, ISO 8859-9, ISO 8859-10 , ISO 8859-11 , ISO 8859-12 , ISO 8859-13 , ISO 8859-14 , ISO 8859-15 ,CP437, CP720, CP737, CP850, CP852, CP855, CP857, CP858, CP860, CP861, CP862, CP863, CP865, CP866, CP869, CP872, Windows-1250, Windows-1251, Windows-1252, Windows-1253 , Windows-1254 , Windows-1255 , Windows-1256, Windows-1257, Windows-1258, Mac OS Roman, Shift JIS, GB 2312, GBK, Taiwan Big5, KOI8-R, KOI8-U, KOI7 ….

Page 16: Encoding Nightmares (and how to avoid them)

1980S: THE COMPUTING TOWER OF BABELSame binary sequence represents entirely different characters Sharing documents across borders becomes very difficult Unintelligible Files (common experience during early days of web) Hard to create a document containing multiple languages. Double-byte encodings increase likelihood of and add complexity to encoding nightmares

Page 17: Encoding Nightmares (and how to avoid them)

THE NIGHTMAREIf you open and save a file with the wrong character encoding, you can change it permanently. Important data may then be irretrievable.

Page 18: Encoding Nightmares (and how to avoid them)

RISE OF UNICODE

Page 19: Encoding Nightmares (and how to avoid them)

WHAT IS UNICODE? Global, unified “solution” to character encoding tower of babel One big encoding table for all world’s characters All linguistic symbols have a unique, defined “code point” Capacity for 1 million characters

Page 20: Encoding Nightmares (and how to avoid them)

UNICODE CONSORTIUM Non-profit corporation with global members from industry, government, academia, and other NGOs Approve new characters for registration as official Unicode Works closely with W3C and ISO

Page 21: Encoding Nightmares (and how to avoid them)

MORE ON UNICODE Abstract characters, not glyphs Broken Into Planes (each with 65,536 characters): Basic Multilingual Plane + 16 other planes

Room for more than 1 million individual characters

NOT a specific binary encoding of that number (UTF-8 differs from UTF-16) lots of room for growth

Page 22: Encoding Nightmares (and how to avoid them)

VERSION 9.0 (JUNE 21 2016) Adds exactly 7,500 characters, for a total of 128,172 characters: Osage, a Native American language Nepal Bhasa, a language of Nepal Fulani and other African languages Tangut, a major historic script of China 72 emoji characters, such as new smilies and people, animals and nature, and food and drink

Page 23: Encoding Nightmares (and how to avoid them)

STILL NOT UBIQUITOUS! Pre-Unicode encodings very much still in use. Legacy operating systems Popular applications MS Office Products And even within Unicode, nightmares still possible (UTF-?)

Page 24: Encoding Nightmares (and how to avoid them)

4 RULES OF THUMB

Page 25: Encoding Nightmares (and how to avoid them)

LIMIT YOUR APPLICATIONS Every app in chain has potential to corrupt.Make sure nobody opens the file “just to take a look.”

Page 26: Encoding Nightmares (and how to avoid them)

USE UTF-8 For websites and mobile apps, almost always the right choice If resource uses different encoding, use ICONV or similar tool to convert

Page 27: Encoding Nightmares (and how to avoid them)

KNOW YOUR METADATA <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

</head>

<head> <meta charset="UTF-8"> </head>

Page 28: Encoding Nightmares (and how to avoid them)

KNOW THE DIFFERENCE BETWEEN CHARACTERS

AND GLYPHS technically, Unicode encodes characters, not glyphs or fonts characters can be thought of as the base shape while glyphs and fonts are particular appearances of those characters, including combination of “root characters which appear as one symbol, like the é this distinction can be important when you are diagnosing a character display problem; but the boundary can be fuzzy . Ä, for example is actually a complete character with unique code point, but is can also be stored as two code points, which combine the base character A with the umlaut in combination you may have correct encoding, but the particular font you are using to display the characters may not have the appropriate glyphs to display the encoded character.

Page 29: Encoding Nightmares (and how to avoid them)

TRICKS OF THE TRADE

Page 30: Encoding Nightmares (and how to avoid them)

CHECK AND CONVERT ENCODING Some text editors and stand alone utilities (like ICONV) guess and convert the encoding Libraries available (Mozilla Universal Charset Detector, International Components for Unicode) Can often guess correctly, but they are imperfect Some tools allow you to check large sets of files in batches

Page 31: Encoding Nightmares (and how to avoid them)

UTF-8 WITH BOM? BOM = Byte Order Mark Essentially a signal to receiver of message that the string is Unicode Can be appended to binary strings by otherwise “neutral” apps like Windows Notepad Can trip up various programming languages and introduce garbage (PHP, for example) Could show up in text editor (if misinterpreted) as series of characters to right Use editor (such as Sublime Text) or encoding converter to convert to straight UTF-8



Page 32: Encoding Nightmares (and how to avoid them)

SPREADSHEET TIP Careful with CSV and Excel Excel often mangles CSV encoding Use Google Docs (or MAC) to save CSV as Excel and then convert back to CSV

Page 33: Encoding Nightmares (and how to avoid them)

TOOLS Will post to our discussion page at the Meetup site. Add your own!

Page 34: Encoding Nightmares (and how to avoid them)

DISCUSSION

Questions? Tips? Horror Stories?

Page 35: Encoding Nightmares (and how to avoid them)

THANK YOU!Merci – Gracias – Danke

Grazie – Obrigado شكرا 谢谢 당신을 감사하십시오 ありがとう

www.mtmlinguasoft.com