using regular expressions to handle non-ascii text

Using regular expressions to handle non-ASCII text

A motivating example

Program which puts data into database Create a simple mySQL table Write a program which accepts a string from a form and appends it to the database table We will use it on the next few slides

First interaction We use the form to submit the string Fred is here Checking the database shows that the string was correctly stored

Second interaction We use the form to submit the string Fred is here said Tom The program claims it handled the string correctly But, checking the database shows that the slanted apostrophes look funny The problem stems from the way the slanted apostrophes are encoded The confusion is because is not a standard ASCII character It is not the same as the basic apostrophe ' which is a standard ASCII character

Third interaction The problem is even worse if are developing a website to support customers who use languages besides English Suppose we use the form to submit the Chinese string The program claims it handled the string correctly But, checking the database shows something strange The Chinese characters have been converted to HTML entity numbers

Fourth interaction Actually, the treatment of Chinese is not as bad as if we use the program to handle other Latin script languages Suppose we use the form to submit the Polish word znakw The program claims it handled the string correctly But, checking the database shows something strange about the way the letter is handled

An interlude To see the root of the problem, we need to understand how characters are handled We will return to the use of regular expressions in website programming, but first we must look at character encoding

Character encoding

A file containing a Polish word (part 1) Let's use Notepad to create a new file containing the Polish word znakw (which means symbols, signs or characters)

A file containing a Polish word (part 2) Notepad allows us to save the file in different formats which it calls ANSI, Unicode, Unicode big-endian UTF-8

Comparing the formats We can use XVI-32 to examine the different files The ANSI file contains 6 bytes The so-called Unicode file contains 14 bytes although Microsoft call the format used in this file 'Unicode', the proper name for the format is UTF- 16LE, where LE means 'little-endian' The so-called Unicode big-endian file also contains 14 bytes the proper name for the format used in this file is UTF-16BE, where BE means 'little-endian' The UTF-8 file contains 10 bytes ANSI was developed for English script UTF-16LE, UTF-16BE and UTF-8 are implementations of an approach called Unicode, which was developed to support all language scripts Let's examine these four formats

The ANSI format

Viewing the ANSI file in XVI32 The file contains 6 bytes, one for each character The English characters z, n, a, k and w are encoded using ASCII codes - byte values in the range 00 to 7F But the code for is is based on an extension to ASCII called Windows-1252, which uses byte values in the range 80 to FF thus, is represented as F3 Extensions to ASCII which use values 00 to FF for various purposes are often called "code-pages" and Windows-1252 (also known as Microsoft Windows Latin-1) is often called CP-1252. By the way, Windows-1252 is often confused with a similar, but slightly different, character code, ISO 8859-1 (a.k.a. ISO Latin- 1)

Code pages The CP-1252 or Microsoft Windows Latin-1 "code page" is only one of many different ways of using byte values in the range 80 through FF Different code pages support different languages. CP-1251, for example, uses byte values 80 through FF for Cyrillic, the alphabet used in Russian, Bulgarian, Serbian, Macedonian, Kazakh, Kyrgyz, Tajik,... When I lived in Thailand, the computers all used CP-874; this supports the Thai alphabet In CP-874, the byte-value which represents in CP-1251, actually represents the symbol (the Thai numeral for two - it is pronounced 'sawng') Using different code pages was OK when files generated in one culture were never used outside that culture But it's no good when a file generated in a country whose computers use one code page is opened in a country where computers use another code page It is also a problem when one needs to deal with different languages in one document This motivated development of the Unicode

Unicode

Code points Unicode is an abstract code as we shall see later, Unicode can be implemented in various ways In Unicode, each symbol is represented by an abstract code point A code point is usually written in the form U+ followed by a sequence of hex digits, for example U+007A The U+ is actually meant to remind us of the set union symbol, , referring to the fact that Unicode is meant to be a union of character sets Unicode provides enough code points for 1,114,112 symbols However, most of these code points are still unused which is why its promoters are reasonably confident that it will always provide enough code points to support all symbols likely to be developed or, at least, all symbols developed by members of our species!

Planes Unicode is intended to cope with all symbols existing or likely to be developed It provides enough code points for 1,114,112 symbols This huge set of code points is divided into 17 "planes", each of which contains 65,536 (2 16 ) code points Plane 0, the Basic Multilingual Plane (BMP), contains code points for almost (but not quite) all symbols used in current languages Plane 1, the Supplementary Multilingual Plane (SMP), contains historic scripts (hieroglyphs, cuneiform, Minoan Linear B), musical notation, mathematical alpha-numerics, emoticons and game symbols (playing cards, dominoes). Plane 2, the Supplementary Ideographic Plane (SIP), is used for some Chinese, Japanese and Korean symbols that are not in Plane 0 Planes 3-13 are still unused Plane 14, the Supplementary Special-purpose Plane (SSP), contains special- purpose non-graphical characters Planes 15 and 16, the Supplementary Private Use Areas, are available for use by entities outside the Unicode Consortium

Writing code points Code points in the Basic Multilingual Plane (BMP) are written as U+ followed by four hex digits for example, the code point for the letter z is written as U+007A Code points outside the BMP are written using U+ followed by five or six hex digits, as required, for example, the LANGUAGE-TAG character in Plane 14 is written as U+E0001 while one private-use character in Plane 16 is written as U+10FFFD

Blocks Within the Basic Multilingual Plane, code points are grouped in contiguous ranges called blocks Each block has its own unique and descriptive name Example blocks: Basic Latin, Latin-1 Supplement, Greek and Coptic, Cyrillic, Armenian, Hebrew, Arabic, Arabic Supplement, Tibetan, Ogham Blocks contain contiguous code points but may be of different sizes While the Basic Latin block contains 128 code points, the Cyrillic block contains 256 code points but the Armenian block contains only 96 and the Ogham block contains only 32 code points

Where to find details of these blocks Unicode.org maintains a list of all blocks at http://www.unicode.org/charts/ Clicking on a block name gives you a PDF file for the block For example, the next slide shows the PDF file for the Ogham block

Example PDF file for a Unicode block The PDF file for a Unicode block gives the following information for each symbol in the block a picture of the symbol its code point a descriptive name for the symbol

Backward compatibility Unicode was designed to be compatible with ASCII Thus, the Basic Latin block contains all 128 ASCII standard characters Each ASCII code maps directly to a Unicode code point in this block For example, the letter z, whose ASCII code is 7A, has the code point U+007A The letter n, whose ASCII code is 6E, has the code point U+006E And so on

Backward compatibility (contd.) The Latin-1 Supplement block also contains 128 code points Some, but not all, of these code points are similar to the codes in the Windows-1252 (Microsoft Windows Latin-1) code page Those code points in the Latin-1 Supplement block which do map directly to Windows-1252 codes include the code points for Latin letters with accents and other common diacritical marks such as umlauts Thus, the accented letter , which has the Windows-1252 code of F3, has the Unicode code point U+00F3

Implementations of Unicode Unicode is an abstract code Various implementations include UTF-32 UTF-16 UTF-8

UTF-32 UTF-32 is a fixed-length encoding of Unicode Every code point is directly encoded using 32 bits, or four bytes

UTF-16 UTF-16 is a 16-bit encoding of Unicode Unlike UTF-32, it is a variable-length encoding code points are encoded with one or two 16-bit code-units, that is, in UTF-16 a code point is encoded as either two or four bytes

UTF-8 Like UTF-16, UTF-8 is a variable-length encoding It uses different number of bytes for different code points Code points for the most common characters, the English letters, are represented as single bytes Less common characters are represented as two bytes, Rarer characters are represented as three or more bytes This means that UTF is the most space-efficient representation of Unicode We will see it in more detail later

Examining the Notepad formats To put some flesh on this, let's examine the various formats in which Notepad stores the small file we saw earlier

The so-called Unicode format in Notepad As we shall see, this format is actually a form of UTF-16 Its proper name is UTF-16LE

Viewing the "Unicode" file in XVI32 (part 1) The file contains 14 bytes The first two bytes contain a byte order mark (BOM), which will be explained on a later slide Then, each character is encoded as two bytes

Viewing the "Unicode" file in XVI32 (part 2) The byte order mark is stored at the start of a file to tell programs whether the file is written in little-endian or big-endian format The Unicode code point for the byte order mark is U+FEFF Note that the BOM is actually stored is our file as FFFE FFFE is the little-endian version of FEFF, so it tells us that the file is stored in little-endian format So we know that each character in the rest of the file is encoded in little-endian format

Viewing the "Unicode" file in XVI32 (part 3) The first two bytes of the file, the BOM, tell us that file is stored in little-endian format Then, each character is encoded as two bytes, in little-endian format The Unicode code point for z is U+007A But, because the file is in little- endian format, the code point for z is stored in the file as 7A 00 The Unicode code point for n is U+006E but this little-endian file stores it as 6E 00 And so on for the other characters

Viewing the "Unicode" file in XVI32 (part 4) Unicode was designed to be compatible with ASCII, so the Basic Latin block contains all 128 ASCII standard characters, each ASCII code mapping directly to a Unicode code point z, whose ASCII code is 7A, has the code point U+007A and appears as 7A 00 in this little-endian file n, whose ASCII code is 6E, has the code point U+006E and appears as 6E 00 in this little-endian file and so on for a, k and w

Viewing the "Unicode" file in XVI32 (part 5) The Latin-1 Supplement block also contains 128 code points Some, but not all, of these code points are similar to the codes in the Windows-1252 (Microsoft Windows Latin-1) code page The Windows-1252 codes for common Latin letters with accents or other diacritical marks do map directly to code points in the Latin-1 Supplement block Thus, the code point for , which has the Windows-1252 code of F3, has the code point U+00F3 and appears as F3 00 in this little-endian file

Big-endian Unicode

Viewing the Unicode big-endian file in XVI32 The proper name for this format is UTF-16BE The file has 14 bytes The first two bytes contain the byte order mark and, then, each of the six characters is encoded as two bytes The fact that the byte order mark, U+FEFF, is stored as FE FF tells that the file is in big-endian format Thus, the code point for z, U+007A, is stored as 00 7A; the code point for n, U+006E, is stored as 00 6E; and so on

A UTF-8 file represents characters using a space-efficient representation of Unicode code points It uses different number of bytes for different code points Code points for the most common characters, the English letters, are represented as single bytes, Less common characters are represented as two bytes, Rarer characters are represented as three or more bytes

UTF-8 (contd.) Single-byte codes are used for the Unicode points U+0000 through U+007F Thus, the UTF-8 codes for these characters are exactly the same as the corresponding ASCII codes. As we shall see, these single-byte codes can be easily distinguished from the first bytes of multi-byte codes The high-order bit of the single-byte codes is always 0 As we shall see, the high-order bit in the first byte of a multi-byte code is always 1

UTF-8 (contd. ) Each of the first 128 characters in UTF-8 need only one byte This covers all ASCII (English) characters Each of the next 1920 characters need two bytes This covers the remainder of almost all Latin-derived alphabets It also covers the Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Maldivian alphabets It also covers the so-called Combining Diacritical Marks, which can be used to construct new letters as well as providing an alternative way of specifying the standard accented letters that are already covered above Each of the rest of the 65,536 characters in the Basic Multilingual Plane (which contains nearly all characters used in living languages) need three bytes Four bytes are needed for characters in the other Unicode planes, these include less-common characters in the Chinese, Japanese and Korean scripts as well as various historic scripts and mathematical symbols

UTF-8 (contd.) As seen before, the high-order bit of a single-byte code is always 0 A multi-byte code consists of a leading byte and one or more continuation bytes. The leading byte has two or more high-order 1s followed by a 0, while continuation bytes all have '10' in the high-order position. The number of high-order 1s in the leading byte of a multi-byte sequence indicates the number of bytes in the sequence so the length of the sequence can be determined without examining the continuation bytes. The remaining bits of the encoding are used for the bits of the code point being encoded, padded with high-order 0s if necessary. The high-order bits go in the leading byte, lower-order bits in succeeding continuation bytes. The number of bytes in the encoding is the minimum required to hold all the significant bits of the code point. We shall see an example of multi-byte coding in a later slide

Viewing the UTF-8 file in XVI32 (part 1) As we shall see later, the first three bytes in this file, EF BB BF, form a "byte order mark", although this mark is not needed or recommended by the Unicode standard The next four bytes, 7A 6E 61 6B, look like ASCII codes - they contain the single-byte UTF-8 encodings of U+007A, U+006E, U+0061 and U+006B (znak) The next two bytes C3 B3, contain a two-byte encoding of , as explained in the next slide The last byte contains a single- byte encoding of U+0077 (w)

Viewing the UTF-8 file in XVI32 (part 2) The two-byte UTF-8 encoding for , which has the code point U+00F3, is as follows Since there are two bytes in the code, the leading byte is of the form 110x xxxx and the continuation byte has the form 10xx xxxx U+00F3 has the following bits, 0000 0000 1111 0011 The significant bits in this code point are 1111 0011 There is room for 6 bits in the continuation byte, so it can contain the six low-order bits 11 0011, so this byte becomes 1011 0011, which is B3 The two high-order bits, 11, will be placed in the leading byte But there is room for 5 bits in the leading byte so these two bits must be padded with three high- order 0s So the leading byte becomes 110 000 11, that is 1100 0011, which is C3 So the UTF-8 code for is C3 B3

Viewing the UTF-8 file in XVI32 (part 3) We can now see that the first three bytes in this file, EF BB BF, are the UTF-8 encoding of the Unicode byte order mark U+FEFF Since there are three bytes in the code, the leading byte is of the form 1110 xxxx and each of the two continuation bytes has the form 10xx xxxx So the bytes are 1110 xxxx 10xx xxxx 10xx xxxx All bits in the U+FEFF code point are significant 1111 1110 1111 1111 There is room for 6 bits in the last byte, so it can contain the six lowest-order bits 11 1111, so this byte becomes 1011 1111, which is BF The next six bits, 1110 11, will be placed in the middle byte, so this becomes 1011 10 11, which is BB The leading byte gets the highest-order bits, becoming 1110 1111, which is EF So the UTF-code for the byte order mark is EF BB BF

Let's check our understanding by considering some other languages

A web page in Hebrew Consider this page http://www.haaretz.co.il/news/politics/1.2151492 Let's copy the first word in the headline, (It's pronounced 'rosh' and means head, leader, boss, chief)

Let's save this word in a UTF-8 file Start a new document in Notepad Paste the word we have just copied And save the file using the UTF-8 format

Inspecting the UTF-8 Open the file with XVI32 The first three bytes, EF BB BF, are familiar They are the UTF-8 encoding of the Unicode code point for the byte order mark

Inspecting the UTF-8 (contd.) There are six remaining bytes in the file, D7 A8 D7 90 D7 A9 So we suspect there are two bytes per character, but let's check Look at the first byte, D7, in binary format 1101 0111 It must be a leading byte in a multi-byte code, because its first bit is a 1 Indeed, it must be the first byte in two-byte code, because its first bits are 110 So the first character in the file has a two-byte UTF-8 code, D7 A8 Let's compute the Unicode code point and see the character

Inspecting the UTF-8 (contd.) The first character in the file has a two-byte UTF-8 code, D7 A8 In binary, this is 1101 0111 1010 1000 Using the colour code in the figure 1101 0111 1010 1000 Thus, the data bits are 1 0111 10 1000 Organized into nibbles, this is 101 1110 1000 So the code point is 0000 0101 1110 1000 That is, U+05E8 On the next slide, we will check what character this is

Checking the character (part 1) Use the code point, U+05E8, in a HTML entity number, Save the HTML file And view the file in a browser...

Checking the character (part 2) The first two bytes after the byte order mark were D7 A8 They were the UTF-8 encoding of the Unicode code point U+05E8 When used in a HTML entity number, this is rendered as the Hebrew letter (pronounced resh), the first letter in the word (pronounced rosh) Notice that Notepad is clever enough to display this letter on the right, even though it is the first letter in the file This is because Notepad recognizes that these letters are from an alphabet in which words are written right-to-left

Exercise As an exercise, check the last four bytes in the file That is, check that D7 90 is the UTF-8 encoding of the code point for the Hebrew letter (pronounced aleph) and that D7 A9 is the UTF-8 encoding of the code point for the Hebrew letter (pronounced shin)

A word in Arabic Consider the word (pronounced hasan - it means 'good') Lets save it in a UTF-8 file

Inspecting the UTF-8 Open the file with XVI32 Again, the first three bytes, EF BB BF, encode the byte order mark The next byte, D8, is the binary 1101 1000 Since its first bits are 110, it must be the leading byte of a two-byte code So the code is D8 AD In binary, this is 1101 1000 1010 1101 So the data bits are 110 0010 1101 That is 0000 0110 0010 1101 So the code point is U+062D

Checking the character (part 1) Use the code point, U+062D, in a HTML entity number, Save the HTML file And view the file in a browser...

Checking the character (part 2) This does not look right - the letter does not look like the first (right-most) letter in the word , which looks like However, this is simply a result of the fact that Notepad is clever enough to recognize that the Arabic letter, ha, has several forms When ha is written as an isolated letter, it is written When ha is written at the start of a word, it is written The letter also has two other forms, for when it appears in the middle of a word and when it appears at the end of a word.

Inspecting the UTF-8 (contd.) The sixth byte in the file, D8, is the binary 1101 1000 Since its first bits are 110, it must be the leading byte of a two-byte code So the code is D8 B3 In binary, this is 1101 1000 1011 0011 So the data bits are 110 0011 0011 That is 0000 0110 0011 0011 So the code point is U+0633

Checking the character (part 1) Use the code point, U+0633, in a HTML entity number, Save the HTML file And view the file in a browser...

Checking the character (part 2) Again, this may not look perfect, but it is correct the letter does not look exactly like the middle letter in the word , which looks like Again, this is simply a result of the fact that Notepad is clever enough to recognize that the Arabic letter, sin, has several forms When sin is written as an isolated letter, it is written When ha is written in the middle of a word, it is written The letter also has two other forms, for when it appears at the start of a word and when it appears at the end of a word. The next slide show how clever Firefox is at rendering Arabic

Checking a character sequence (part 1) Use the two code points, U+062D and U+0633, in a HTML file And view the file in a browser...

Checking a character sequence (part 2) Notice that Firefox now renders the letter ha using its initial-position form, so it looks like it does in Notepad The letter sin still looks different in Firefox than in Notepad This is because Notepad shows the medial-position form of the letter while Firefox shows the final-position form of the letter Let's see what happens if we encode all three Arabic letters of the word hasan in a HTML file

Inspecting the UTF-8 (contd.) The eighth byte in the file, D9, is the binary 1101 1001 Since its first bits are 110, it must be the leading byte of a two-byte code So the code is D9 86 In binary, this is 1101 1001 1000 0110 So the data bits are 110 0100 0110 That is 0000 0110 0100 0110 So the code point is U+0646

Checking a character sequence (part 1) Use the three code points, U+062D, U+0633 and U+0646, in a HTML file And view the file in a browser...

Checking a character sequence (part 2) Notice that Firefox renders all three letters exactly as they appear in Notepad Firefox uses the initial-position version of ha, the medial-position of sin and the final-position version of nun

Now, let's try Chinese The sentence means I am Irish or, literally, I am Ireland person It is pronounced wo shi ai er lan ren Let's use Notepad to save it in a UTF-8 file

Inspecting the UTF-8 (contd.) Again, the first three bytes encode the byte order mark The fourth byte in the file, E6, is the binary 1110 0110 Since its first bits are 1110, it must be the leading byte of a three-byte code So the code is E6 88 91 In binary, this is 1110 0110 1000 1000 1001 0001 So the data bits are 0110 0010 0001 0001 So the code point is U+6211

Checking the character (part 1) Use the code point, U+6211, in a HTML file And view the file in a browser...

Checking the character (part 2) We can now see that U+6211 is the code point for the Chinese character which means I or me

Exercise As an exercise, check the last fifteen bytes in the file E6 98 AF E7 88 B1 E5 B0 94 E5 85 B0 E4 BA BA There are five further Chinese characters in the file Are all of their code points encoded as three-byte codes? Or do some of them have shorter, and others longer, codes? What is the code point for the character (pronounced ren, it means person)?

The same characters can be specified in different ways

Consider the two HTML pages shown below ABCD1.html looks the same as ABCD2.html But...

The two identical-looking pages have different source code The text, ABCD, is specified directly in the source code for ABCD1.html But The same text, ABCD, is specified using HTML entity numbers in the source code for ABCD2.html

Bear this in mind... When we are looking at the next few web pages, remember that the same characters can be specified in different ways

Encoding characters in HTML form data

A simple form-submission program The program, formInput1.php, which is shown below, displays a form Then...

A simple form-submission program The program, formInput1.php, which is shown below, displays a form Then, when the user submits a string, it...

A simple form-submission program The program, formInput1.php, which is shown below displays a form Then, when the user submits a string, it...... reports the string it received

Another simple form-submission program The program, formInput2.php, which is shown below, also displays a form Then...

Another simple form-submission program The program, formInput2.php, which is shown below, also displays a form Then, when the user submits a string, it...

Another simple form-submission program The program, formInput2.php, which is shown below, also displays a form Then, when the user submits a string, it...... also reports the string it received

Both reports look similar But...

... the source texts for the two report pages are different

What's the cause of the difference? The difference between the source texts for the two reports must lie in some difference between the source codes of the two programs formInput1.php and formInput2.php Lets's compare the two programs

Difference between the two programs The only difference between the two programs is that...... formInput2.php encodes every page it delivers in UTF-8 This means that its form also encodes the user's input in UTF-8 It also means that the source text for the report page is encoded in UTF-8... which means that the character appears correctly

Let's see how formInput1.php handles Chinese input The form cannot encode the user's Chinese string in UTF-8 So it encodes the characters in the string as HTML entity numbers we can see this in the source code for the report page

Now see how formInput2.php handles Chinese input The form is able to encode the user's Chinese string in UTF-8 So the Chinese characters, themselves, appear in the source text for the report page

Checking that the string received really is UTF-8 This program, formInput3.php, creates a new file called userString and writes a copy of the received string into it We can then download this new file and examine its contents with XVI32

Notice that the new file, userString, contains 18 bytes These are exactly the same as the bytes after the byte order mark in chinese.txt, the file that was created by Notepad when we stored the same Chinese string So our program does indeed receive (and store) characters encoded in UTF-8

The moral is... If there is any possibility that users of your web- pages will enter non-English characters in their form submissions, make sure that you encode your web-pages in UTF-8 Indeed, you should always encode your web-pages in UTF-8

Diacritical marks in Unicode

Diacritical marks The writing systems for many languages use symbols that are intended to modify other symbols They are usually called diacritical marks (from a Greek word, , diakritikos, which means distinguishing) Irish uses such marks to distinguish long vowels from short vowels, for example, from a Other European languages use different diacritical marks on vowels, for example, to distinguish , , , and from a and Some European languages use diacritical marks on consonants French distinguishes from c Spanish distinguishes from n Slavic languages distinguish , and from s In fact, the Irish writing system used diacritical marks on consonants until recently, indicating lenited consonants with a diacritical dot, as in , , , , , , , , Many non-European writing systems also use diacritical marks, for various purposes

Interlude Unicode and Insular Script

Examples of old Irish script Old Irish script is still visible on street signs in Cork city The photographs below were taken in November 2013 The first letter in the second word in first photograph shows a consonant with a diacritical mark Both photographs show several special letter forms from the Insular Script which was developed in the 600s and was still in use when I was in school For more info, see http://en.wikipedia.org/wiki/Insular_scripthttp://en.wikipedia.org/wiki/Insular_script These letter forms are all supported by Unicode

Unicode support for Insular Script The special letters form of Insular Script are supported in the Latin Extended-D block Details about the individual letters can be found at codepoints.net For example http://codepoints.net/U+A779 gives details about that is U+A779 LATIN CAPITAL LETTER INSULAR Dhttp://codepoints.net/U+A779 Note that although Unicode supports Insular Script, not all fonts do - one that does in Quivira http://www.quivira-font.com/downloads.php

Using special fonts in HTML The HTML file below specifies the code point for the capital D in Insular Script But Firefox does not display the letter properly This is not a fault in Firefox It simply means that the client machine on which Firefox is running does not have a font which supports this Unicode code point We can overcome this limitation by using a feature of CSS3

Using the @font-face command CSS3 provides the @font-face command for telling a browser to load a special font which may not be available on the client machine Below, the browser is told to load the Quivira font which is available in a file called Quivira.ttf in the same server directory as the HTML file The browser can now render the Insular Capital D correctly

Back to Diacritical marks in Unicode

Diacritical marks in digital typography Many computerized typography systems treat diacritically-marked characters (such as , or ) as completely distinct from base characters (such as a, n or s) However, some systems do not provide separate symbols for diacritically- marked characters Instead, they provide special diacritical symbols (~ ` etc.) which may be combined with base symbols to produce the same visual appearance to the reader Unicode tries to subsume these different types of system Thus, for example, Unicode provides two different ways of encoding It provides U+00E0, the code point for the composite character But it also allows us to achieve the same visual appearance by appending U+0300, the code point for `, to U+0061, the code point for a In Unicode jargon, code points for diacritical marks, like U+0300 the code point for `, are called combining marks As we shall see later, there is a special regular expression notation for dealing with combining marks

An experiment Lets see, close-up, the effect of appending U+0300, the code point for `, to U+0061, the code point for a We will create a UTF-8 encoded binary file using XVI32 and open it with Notepad Since Notepad seems to like an initial byte order mark in UTF-8 format, the first three bytes in our file will be EF BB BF Since U+0061 belongs to the Basic Latin block, this code point is encoded in UTF-8 as one byte, namely 61 So far, then, our file contains four bytes EF BB BF 61 Now we must encode U+0300 in UTF-8 Hex 03 00 is 0000 0011 0000 0000 in binary This has ten significant bits are 11 0000 0000 A two-byte code, 110xxxxx 10xxxxxx, provides space for eleven bits Padding the ten bits with a leading 0, we get 1100 1100 1000 0000, that is CC 80 in hex So our complete file is EF BB BF 61 CC 80 Let's create it with XVI32 and then open it in Notepad

An experiment (contd.) A UTF-8 file containing a byte order mark followed by U+0061 (code point for the letter a) and U+0300 (the code point for the combining mark `), contains six bytes EF BB BF 61 CC 80

An experiment (contd.) A UTF-8 file containing a byte order mark followed by U+0061 (code point for the letter a) and U+0300 (the code point for the combining mark `), contains six bytes EF BB BF 61 CC 80 Creating this with XVI32, we get

An experiment (contd.) A UTF-8 file containing a byte order mark followed by U+0061 (code point for the letter a) and U+0300 (the code point for the combining mark `), contains six bytes EF BB BF 61 CC 80 Creating this with XVI32, we get Saving this in a fill called aGrave.txt, we get

An experiment (contd.) A UTF-8 file containing a byte order mark followed by U+0061 (code point for the letter a) and U+0300 (the code point for the combining mark `), contains six bytes EF BB BF 61 CC 80 Creating this with XVI32, we get Saving this in a fill called aGrave.txt, we get Opening this file in Notepad, we get

List of Unicode combining marks The complete (as of 2013) list of Unicode combining marks is available at http://www.unicode.org/charts/PDF/U0300.pdf We are allowed to combine these in any order we like, although not all rendering software may display them correctly

List of Unicode combining marks The complete (as of 2013) list of Unicode combining marks is available at http://www.unicode.org/charts/PDF/U0300.pdf We are allowed to combine these in any order we like, although not all rendering software may display them correctly Let's try combining a few of them

List of Unicode combining marks The complete (as of 2013) list of Unicode combining marks is available at http://www.unicode.org/charts/PDF/U0300.pdf We are allowed to combine these in any order we like, although not all rendering software may display them correctly Let's try combining a few of them Let's try appending both U+0300 and U+0333 to U+0061

Another experiment Lets see how Notepad handles the result of appending two combining marks, U+0300 and U+0333, to U+0061, the code point for a We will create a UTF-8 encoded binary file using XVI32 and open it with Notepad We already know that the byte sequence for an initial byte order mark followed by U+0061 and U+0300 is EF BB BF 61 CC 80 Now we must encode U+0333 in UTF-8 Hex 03 00 is 0000 0011 0011 0011 in binary This has ten significant bits are 11 0011 0011 A two-byte code, 110xxxxx 10xxxxxx, provides space for eleven bits Padding the ten bits with a leading 0, we get 1100 1100 1011 0011, that is CC B3 in hex So our complete file is EF BB BF 61 CC 80 CC B3

Another experiment Lets see how Notepad handles the result of appending two combining marks, U+0300 and U+0333, to U+0061, the code point for a We will create a UTF-8 encoded binary file using XVI32 and open it with Notepad We already know that the byte sequence for an initial byte order mark followed by U+0061 and U+0333 is EF BB BF 61 CC 80 Now we must encode U+0333 in UTF-8 Hex 03 00 is 0000 0011 0011 0011 in binary This has ten significant bits are 11 0011 0011 A two-byte code, 110xxxxx 10xxxxxx, provides space for eleven bits Padding the ten bits with a leading 0, we get 1100 1100 1011 0011, that is CC B3 in hex So our complete file is EF BB BF 61 CC 80 CC B3 Creating it with XVI32 and opening it in Notepad, we see that Notepad cannot render the result perfectly

Yet another experiment But Firefox can render perfectly the result of appending two combining marks, U+0300 and U+0333, to U+0061, to the code point for a Let's create a UTF-8 file without the (unnecessary byte order mark). Its contents will be 61 CC 80 CC B3 Let's call the file aNovelWithoutBOM.txt and upload it to a server The PHP program below delivers the content of this file to Firefox, which can render it very well

Character Equivalence We have seen that Unicode provides different ways of encoding diacritically-marked characters Such a character may be encoded using different code point sequences a sequence containing a single code point (for a composite character) a sequence of several code points (one for a base character, followed by one or more further code points for combining marks) Some way is needed to determine whether or not two code point sequences represent the same diacritically-marked character That is, some way is needed to determine whether or not two code point sequences are equivalent, so that programs can compare sequences, organize them alphabetically and search for them The Unicode standard defines two kinds of sequence equivalence canonical equivalence a weaker notion called compatibility equivalence The standard also defines normalization algorithms which, for each type of equivalence, produce a unique code point sequence from all equivalent sequences For more details, see http://unicode.org/reports/tr15/

XML and UTF-8

Several versions of the same document Consider this short XML document Franois Hollande Nicolas Paul Stphane Sarkzy de Nagy-Bocsa I won! We will store it in two formats, ANSI and UTF-8, and see how Firefox handles the result Later, we will develop a slightly better version of the document and also store it in both formats

ANSI version of the document We have stored the document in an ANSI-encoded file called memorandumStoredInANSIformat.xml Firefox objects to the character

UTF-8 version of the document We have stored the document in an UTF- 8-encoded file called memorandumStoredInUTF8format.xml Firefox handles the file properly It has no trouble with the , , or characters So, XML files which contain non-ASCII characters should be stored in UTF-8 So make sure you use a text editor which supports UTF-8 encoding

Improved version of the document Consider this version of the XML document Franois Hollande Nicolas Paul Stphane Sarkzy de Nagy-Bocsa I won! It has an encoding attribute We will store it in two formats, ANSI and UTF-8, and see how Firefox handles the result

ANSI version of the document We have stored the document in an ANSI-encoded file called memorandumUTF8storedInANSI.xml There is a discrepancy between the encoding attribute and the actual encoding used in the file Not surprisingly, Firefox objects to the character

UTF-8 version of the document We have stored the document in an UTF- 8-encoded file called memorandumUTF8storedInUTF8.xml Firefox handles the file properly Summary: The actual encoding matters more than the encoding attribute, but it is better to use the attribute

Another example Here is a memo from Xi Jinping to Bo Xilai We have stored the document in an UTF-8-encoded file Firefox handles the file properly With UTF-8, your XML documents can contain text in any script

Yet another example Indeed, with UTF-8, even your tag names can be in any script For example, here is another version of the memo from Xi Jinping to Bo Xilai While the content is in Chinese, the tag names are in Cyrillic You should always use UTF-8

Unicode and operating systems

Modern operating systems support Unicode When an operating system supports Unicode, file names can contain any Unicode code point However, even though an operating system may support Unicode in file names, this does not mean that all applications will display them properly For example, below, Firefox shows that two files in a folder on a Linux server have Chinese characters in their names But these are not rendered correctly if we use the Linux ls command

Unicode implementations in operating systems Different operating systems provide different implementations of Unicode in file names Unix/Linux uses UTF-8 in file names Windows NTFS uses UTF-16

UTF-8 and URLs

Any script can be used in URLs You already know that any ASCII symbol can be encoded in a URL using the % character followed by a string containing the ASCII code of the symbol In fact, this use of ASCII is just a special case Any Unicode symbol can be used url-encoded in URLs This means that a URL can contain any script Below, see the English Wikipedia page for Bo Xilai The URL contains his name in English On the next slide we will see the corresponding Chinese page

Chinese characters in URLs Below, see the Chinese Wikipedia page for Bo Xilai The URL contains his name in Chinese, url-encoded using UTF-8 Firefox displays the Chinese characters in the URL box But, even though the URL contains Chinese characters, Internet Explorer (at least MSIE 8, the version on my desktop) does not show them in the URL box Instead it shows the url-encoded UTF-8, three bytes for each character, as we would expect: https://zh.wikipedia.org/wiki/%E8%96%84%E7%86%99%E6%9D%A5

Handling Unicode in regular expressions

Handling combining marks with regular expressions As we shall see, regular expressions can handle UTF-8 text although treating combining marks can be tricky PHP, for example, supports Unicode when a new modifier, the u modifier is appended to regular expressions

The dot operator and Unicode Regular expression engines treat a single Unicode code-point as a single character Thus the dot meta character will match any single Unicode code-point (except the line- break character)

The dot operator and combining marks Remember that the character can be represented as one code point (U+00E0) or, using a combining mark, as two (U+0061 U+0300) The dot operator will not treat a character followed by a combining mark as one character It will match only the basic character Instead, the meta-sequence \X will match any unicode character, be it a single code point or a sequence

To be continued See http://www.regular-expressions.info/unicode.html

Unicode in XML and other Markup Languages http://www.w3.org/TR/unicode-xml/ http://www.w3.org/TR/unicode-xml/ Regular Expression Matching in XSLT 2 http://www.xml.com/pub/a/2003/06/04/tr.html http://www.xml.com/pub/a/2003/06/04/tr.html Support for XSLT 2.0 Although XSLT 2.0 is not natively supported in Firefox, it is possible via Saxon-B (Java) or, more recently, Saxon-CE (JavaScript) to perform.... Browsers so far don't support XSLT 2.0. There is however an attempt to port Saxon 9 to client-side Javascript so that it can be used within browsers: See http://www.saxonica.com/ce/index.xml for open source saxon-cehttp://www.saxonica.com/ce/index.xml

using regular expressions to handle non-ascii text

Documents