xml and localization
DESCRIPTION
An overview of XML and how it is used in the localization worldTRANSCRIPT
XML and LOCALIZATION
An overview by @Fantpmas from @YamagataEurope
What is XML? And why do you people love acronyms so much?
XML stands for eXtensible Markup Language
You can write your own language/dialect
A language to store data in a human readable format
XML is designed to carry data not display data like HTML XML doesn't do anything on its own, nada, zilch!
A sample XML document (Don't worry it's all plain text)
The root element
3 child elements
An XML element in detail
Start tag End tag
Attribute
Element content
Attribute value
XML elements can be empty
is the same as
Self-closing element
There are rules to follow When all rules are abided by, the XML is well-formed
XML well-formedness rules (not exhaustive) • There must be a root element • Elements must follow naming rules • All elements must be closed • Element names are case sensitive • Elements must be properly nested • Attributes must be quoted • Attributes can only appear once in same start tag • Some characters cannot be used as such • Entities must be declared
There must be a root element
Elements must follow naming rules
Names can only start with • A letter (in any language, including accented letters) • A colon • An underscore
筆者 筆者
Elements must follow naming rules
Names cannot contain • White spaces • Most punctuation characters except colon, underscore,
hyphen, dot, middle dot • Symbol characters
筆 者 筆 者
All elements must be closed
Element names are case sensitive
Elements must be properly nested
Attribute values must be quoted
Single or double quotes
Attention to those darn quotes
If double quotes are used you cannot use double quotes inside the attribute value . The same applies for single quotes.
Attributes must be unique in tags
Some characters cannot be used
• < and & need to escaped into entities: and • Most control characters
(characters to indicate carriage return, tab or backspace)
A word about entities
Entities are used to represent characters or a sequence of characters that needs to be repeated throughout a document Syntax:
Ampersand Semicolon
Predefined XML entities
5 predefined character entities, only 2 are obligatory
< < less than
> > greater than
& & ampersand
' ' apostrophe
" " quotation mark
Entities must be declared
Except for predefined entities all entities must be declared in the Document Type Definition
Entity
DTD Entity declaration
Other constructs
• XML declaration
• Stylesheet declaration
• Document Type declaration
• Comments
• CDATA
Document Type Definition A DTD defines the structure of an XML document
How to declare DTDs
DTDs can be internal
DTD
How to declare DTDs
DTDs can be external
XML Schema
XML Schema (*.xsd) is an XML based alternative to DTD
DTDs in the localization world
Don't be scared, but XML really is everywhere • TMX • TBX • XLIFF • TTX • SRX • QT Linguist TS • DITA • ...
Encoding
All XML parsers must support at least UTF-8 and UTF-16. Default encoding is UTF-8. Always a good idea to specify the encoding
Byte Order Mark
A character to indicate the byte order of an XML document In UTF-8 it's optional and not even recommended In UTF-16 it's used to indicate endianness: little-endian or big-endian If you see these at the start of a file, something's wrong:
Complimentary technologies What? There's more of this geek stuff!?
Extensible Stylesheet Language Transformation (XSLT)
It's XML to transform another XML document!
XSL Transformations
XML
(X)HTML
XML
TXT
How to apply an XSLT
Declare the stylesheet in the XML file itself
Use an application like XMLSpy or xmlstarlet
XSLT localization examples
• Convert a TTX to a two-column HTML or CSV • Convert a TMX to a TBX • Convert a TMX to a TXT (for spell-check in MS Word) • Convert multilingual XML to TMX/TBX • Generate HTML preview for XML in SDL Trados Studio • Prepare XML files for translation
XPath
It's a query language to select nodes from an XML document It's used in XSLT
Will select all elements that have an attribute called
and whose value is
And also in SDL Trados Studio file types
Is XML good for localization? Yes, but not always
XML is great for localization
• Unicode supported by default
• Metadata gives more information about content
• Separates content from formatting (to some extent)
• Human readable
• Easily transformable using XSLT
• Excellent for single-sourcing
But bad XML is bad
• Translatable content in attributes
• No metadata to distinguish between content e.g. mixed languages, translatable vs not translatable
• CDATA is just plain cheating
• Bad implementations of standards (XLIFF)
And also
• Multilingual XML can be challenging (XSLT can help)
東京
• Big files and one-liners can cause processing problems
(pretty-printing can help)
Tools, tools, tools
• Altova XMLSpy: all-round XML editor
• Altova DiffDog: compare XML files
• xmlstarlet: command line XML toolkit
• EditPad Pro for all encoding/BOM matters
"Specification is only theory. In practice, there is only the parser."
@Tnkrd