encoding nightmares (and how to avoid them)

ENCODINGNIGHTMARES And how to avoid

them

PHILADELPHIA SOFTWARE LOCALIZATION MEETUP

Welcome to our kickoff event! For more information, visit the meetup site at: https://www.meetup.com/Philadelphia-Software-Localization-Meetup/

https://www.meetup.com/Philadelphia-Software-Localization-Meetup/events/237857906/

https://www.meetup.com/Philadelphia-Software-Localization-Meetup/events/237857906/

PLAN OF TALK Encoding Nightmares Character Encoding and the Modern Tower of Babel Rise of Unicode Rules of Thumb to Avoid Nightmares Tricks of the Trade Discussion

TAIWANESE WEBSITE FAIL

DZONGKHA (BHUTANESE) AS WINDOWS-1252

CORRUPTED DOCUMENT, DATA LOSS

ENCODING NIGHTMARES CAN LEAD TO … Confusion Missed deadlines Software Bugs Data corruption Embarrassment

CHARACTER ENCODING AND THE MODERN TOWER OF

BABEL

BINARY LANGUAGE The Bit, Two States (0, 1) Represented by switches “on” (1) or “off” (0) (Yes, No) Grouped Together, Represent More States n bits = 2n States 8 bits = 1 byte = 256 states

BINARY CHARACTER ENCODING ASCII Character Encoding Associate Binary string with English, letters, numbers, etc. How Many Needed? Used 127 distinct binary numbers, each mapped to a member of the ASCII character set Defined in the ASCII “Code Page”

EUROPEAN LANGUAGES NEED MORE SPACE

German, French, other languages needed more than 128 characters Started to use the 8th bit (doubles the possibilities) 256 spaces in these 8 bit character maps

CHINESE, JAPANESE, KOREAN (CJK) NEED EVEN MORE In Chinese, 2,000 distinct characters is often considered a minimum threshold for literacy. 40,000 characters are in common use and tens of thousands more in rare, historical literature. Japanese uses 2,000 characters, mixing their own phonetic scripts comprising the phonetic and ideographic characters borrowed from the Chinese Modern Korean tends toward more phonetic language and relies much less on the broader set of Chinese characters 256 characters in not enough for any of them

DOUBLE BYTE CHARACTER ENCODINGS

Two Bytes, 16 Bits 216 = 65,536 possible characters some bits used as signals, so can’t actually store 65,000 total

https://r12a.github.io/scripts/tutorial/part2 / (Creative Commons license)

https://r12a.github.io/scripts/tutorial/part2%20/

NUMBER OF ENCODINGS EXPLODE

ISO 646, ASCII, EBCDIC, CP37, CP930, CP1047, ISO 8859-1, ISO 8859-2, ISO 8859-3, ISO 8859-4, ISO 8859-5, ISO 8859-6, ISO 8859-7, ISO 8859-8, ISO 8859-9, ISO 8859-10 , ISO 8859-11 , ISO 8859-12 , ISO 8859-13 , ISO 8859-14 , ISO 8859-15 ,CP437, CP720, CP737, CP850, CP852, CP855, CP857, CP858, CP860, CP861, CP862, CP863, CP865, CP866, CP869, CP872, Windows-1250, Windows-1251, Windows-1252, Windows-1253 , Windows-1254 , Windows-1255 , Windows-1256, Windows-1257, Windows-1258, Mac OS Roman, Shift JIS, GB 2312, GBK, Taiwan Big5, KOI8-R, KOI8-U, KOI7 ….

1980S: THE COMPUTING TOWER OF BABELSame binary sequence represents entirely different characters Sharing documents across borders becomes very difficult Unintelligible Files (common experience during early days of web) Hard to create a document containing multiple languages. Double-byte encodings increase likelihood of and add complexity to encoding nightmares

THE NIGHTMAREIf you open and save a file with the wrong character encoding, you can change it permanently. Important data may then be irretrievable.

RISE OF UNICODE

WHAT IS UNICODE? Global, unified “solution” to character encoding tower of babel One big encoding table for all world’s characters All linguistic symbols have a unique, defined “code point” Capacity for 1 million characters

UNICODE CONSORTIUM Non-profit corporation with global members from industry, government, academia, and other NGOs Approve new characters for registration as official Unicode Works closely with W3C and ISO

MORE ON UNICODE Abstract characters, not glyphs Broken Into Planes (each with 65,536 characters): Basic Multilingual Plane + 16 other planes

Room for more than 1 million individual characters

NOT a specific binary encoding of that number (UTF-8 differs from UTF-16) lots of room for growth

VERSION 9.0 (JUNE 21 2016) Adds exactly 7,500 characters, for a total of 128,172 characters: Osage, a Native American language Nepal Bhasa, a language of Nepal Fulani and other African languages Tangut, a major historic script of China 72 emoji characters, such as new smilies and people, animals and nature, and food and drink

STILL NOT UBIQUITOUS! Pre-Unicode encodings very much still in use. Legacy operating systems Popular applications MS Office Products And even within Unicode, nightmares still possible (UTF-?)

4 RULES OF THUMB

LIMIT YOUR APPLICATIONS Every app in chain has potential to corrupt.Make sure nobody opens the file “just to take a look.”

USE UTF-8 For websites and mobile apps, almost always the right choice If resource uses different encoding, use ICONV or similar tool to convert

KNOW YOUR METADATA <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

</head>

<head> <meta charset="UTF-8"> </head>

KNOW THE DIFFERENCE BETWEEN CHARACTERS

AND GLYPHS technically, Unicode encodes characters, not glyphs or fonts characters can be thought of as the base shape while glyphs and fonts are particular appearances of those characters, including combination of “root characters which appear as one symbol, like the é this distinction can be important when you are diagnosing a character display problem; but the boundary can be fuzzy . Ä, for example is actually a complete character with unique code point, but is can also be stored as two code points, which combine the base character A with the umlaut in combination you may have correct encoding, but the particular font you are using to display the characters may not have the appropriate glyphs to display the encoded character.

TRICKS OF THE TRADE

CHECK AND CONVERT ENCODING Some text editors and stand alone utilities (like ICONV) guess and convert the encoding Libraries available (Mozilla Universal Charset Detector, International Components for Unicode) Can often guess correctly, but they are imperfect Some tools allow you to check large sets of files in batches

UTF-8 WITH BOM? BOM = Byte Order Mark Essentially a signal to receiver of message that the string is Unicode Can be appended to binary strings by otherwise “neutral” apps like Windows Notepad Can trip up various programming languages and introduce garbage (PHP, for example) Could show up in text editor (if misinterpreted) as series of characters to right Use editor (such as Sublime Text) or encoding converter to convert to straight UTF-8

ï»¿

SPREADSHEET TIP Careful with CSV and Excel Excel often mangles CSV encoding Use Google Docs (or MAC) to save CSV as Excel and then convert back to CSV

TOOLS Will post to our discussion page at the Meetup site. Add your own!

DISCUSSION

Questions? Tips? Horror Stories?

THANK YOU!Merci – Gracias – Danke

Grazie – Obrigado شكرا 谢谢 당신을 감사하십시오 ありがとう

www.mtmlinguasoft.com