getting started with icu vladimir weinstein eric mader steven r. loomis

43
Getting Started with ICU Vladimir Weinstein Eric Mader Steven R. Loomis

Upload: ursula-manning

Post on 01-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

  • Getting Started with ICUVladimir WeinsteinEric MaderSteven R. Loomis

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    AgendaGetting & setting up ICU4CUsing conversion engineUsing break iterator engineGetting & setting up ICU4JUsing collation engineUsing message formatsExample analysis

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Getting ICU4Chttp://ibm.com/software/globalization/icuGet the latest releaseGet the binary packageSource download for modifying build optionsCVS for bleeding edge read instructions

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Setting up ICU4CUnpack binariesIf you need to build from sourceWindows: MSVC .Net 2003 Project, CygWin + MSVC 6, just CygWinUnix: runConfigureICUmake installmake check

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Testing ICU4CWindows - run: cintltst, intltest, iotestUnix - make check (again)See it for yourself:

    #include #include "unicode/utypes.h"#include "unicode/ures.h"

    main() { UErrorCode status = U_ZERO_ERROR; UResourceBundle *res = ures_open(NULL, "", &status); if(U_SUCCESS(status)) { printf("everything is OK\n"); } else { printf("error %s opening resource\n", u_errorName(status)); } ures_close(res);}

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Conversion Engine - OpeningICU4C uses open/use/close paradigmOpen a converter:

    UErrorCode status = U_ZERO_ERROR;UConverter *cnv = ucnv_open(encoding, &status);if(U_FAILURE(status)) {/* process the error situation, die gracefully */}Almost all APIs use UErrorCode for statusCheck the error code!

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    What Converters are Availableucnv_countAvailable() get the number of available convertersucnv_getAvailable get the name of a particular converterLot of frameworks allow this examination

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Converting Text Chunk by Chunkchar buffer[DEFAULT_BUFFER_SIZE];char *bufP = buffer;len = ucnv_fromUChars(cnv, bufP, DEFAULT_BUFFER_SIZE, source, sourceLen, &status);if(U_FAILURE(status)) {if(status == U_BUFFER_OVERFLOW_ERROR) {status = U_ZERO_ERROR;bufP = (UChar *)malloc((len + 1) * sizeof(char));len = ucnv_fromUChars(cnv, bufP, DEFAULT_BUFFER_SIZE, source, sourceLen, &status);} else {/* other error, die gracefully */}}/* do interesting stuff with the converted text */

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Converting Text Character by CharacterWorks only from code page to UnicodeUChar32 result;char *source = start;char *sourceLimit = start + len;while(source < sourceLimit) {result = ucnv_getNextUChar(cnv, &source, sourceLimit, &status);if(U_FAILURE(status)) {/* die gracefully */}/* do interesting stuff with the converted text */}

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Converting Text Piece by Piecewhile((!feof(f)) && ((count=fread(inBuf, 1, BUFFER_SIZE , f)) > 0) ) { source = inBuf; sourceLimit = inBuf + count; do { target = uBuf; targetLimit = uBuf + uBufSize; ucnv_toUnicode(conv, &target, targetLimit, &source, sourceLimit, NULL, feof(f)?TRUE:FALSE, /* pass 'flush' when eof */ /* is true (when no more data will come) */ &status); if(status == U_BUFFER_OVERFLOW_ERROR) { // simply ran out of space we'll reset the // target ptr the next time through the loop. status = U_ZERO_ERROR; } else { // Check other errors here and act appropriately } text.append(uBuf, target-uBuf); count += target-uBuf; } while (source < sourceLimit); // while simply out of space }

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Clean up!Whatever is opened, needs to be closedConverters use ucnv_closeSample uses conversion to convert code page data from a file

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Text Boundary AnalysisProcess of locating linguistic boundaries while formatting and processing textMany usesRelatively straightforward for EnglishHard for some other languages:Chinese and JapaneseThaiHindi

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Break Iteration - IntroductionCharacter boundaries: grapheme clustersWord boundaries: word counting, double click selectionLine break boundaries: where to break a lineSentence break boundaries: sentence counting, triple click selectionICU class - BreakIterator

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Break Iteration starting statesPoints to a boundary between two charactersIndex of character following the boundaryUse current() to get the boundaryUse first() to set iterator to start of textUse last() to set iterator to end of text

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Break Iteration - NavigationUse next() to move to next boundaryUse previous() to move to previous boundaryReturns DONE if cant move boundary

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Break Itaration Checking a positionUse isBoundary() to see if position is boundaryUse preceeding() to find boundary at or beforeUse following() to find boundary at or after

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Break Iteration - OpeningUse the factory methods:Locale locale = ; // locale to use for break iteratorsUErrorCode status = U_ZERO_ERROR;

    BreakIterator *characterIterator = BreakIterator::createCharacterInstance(locale, status);

    BreakIterator *wordIterator = BreakIterator::createWordInstance(locale, status);

    BreakIterator *lineIterator = BreakIterator::createLineInstance(locale, status);

    BreakIterator *sentenceIterator = BreakIterator::createSentenceInstance(locale, status);Dont forget to check the status!

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Set the textWe need to tell the iterator what text to use:UnicodeString text;

    readFile(file, text);wordIterator->setText(text);Reuse iterators by calling setText() again.

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Break Iteration - Counting words in a file:int32_t countWords(BreakIterator *wordIterator, UnicodeString &text){ U_ERROR_CODE status = U_ZERO_ERROR; UnicodeString word; UnicodeSet letters(UnicodeString("[:letter:]"), status);

    int32_t wordCount = 0; int32_t start = wordIterator->first();

    for(int32_t end = wordIterator->next(); end != BreakIterator::DONE; start = end, end = wordIterator->next()) { text->extractBetween(start, end, word); if(letters.containsSome(word)) { wordCount += 1; } }

    return wordCount;}

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Break Iteration Breaking linesint32_t previousBreak(BreakIterator *breakIterator, UnicodeString &text, int32_t location){ int32_t len = text.length();

    while(location < len) { UChar c = text[location];

    if(!u_isWhitespace(c) && !u_iscntrl(c)) { break; }

    location += 1; }

    return breakIterator->previous(location + 1);}

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Break Iteration Cleaning upUse delete to delete the iteratorsdelete characterIterator;delete wordIterator;delete lineIterator;delete sentenceIterator;

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Useful LinksHomepage: http://ibm.com/software/globalization/icuAPI documents and User guide: http://ibm.com/software/globalization/icu/documents.jsp

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Getting ICU4JEasiest pick a .jar file off download section on http://ibm.com/software/globalization/icuUse the latest version if possibleFor sources, download the source .jarFor bleeding edge, use the latest CVS see site for instructions

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Setting up ICU4JCheck that you have the appropriate JDK versionTry the test code (ICU4J 3.0 or later):

    import com.ibm.icu.util.ULocale;import com.ibm.icu.util.UResourceBundle;public class TestICU {public static void main(String[] args) { UResourceBundle resourceBundle = UResourceBundle.getBundleInstance(null, ULocale.getDefault()); }}

    Add ICUs jar to classpath on command lineRun the test suite

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Building ICU4JNeed ant in addition to JDKUse ant to buildWe also like Eclipse

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Collation EngineMore on collation tomorrow!Used for comparing stringsInstantiation:ULocale locale = new ULocale("fr");Collator coll = Collator.getInstance(locale);// do useful things with the collatorLives in com.ibm.icu.text.Collator

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    String ComparisonWorks fastYou get the result as soon as it is readyUse when you dont need to compare same strings many timesint compare(String source, String target);

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Sort KeysUsed when multiple comparisons are requiredIndexes in data basesICU4J has two classesCompare only sort keys generated by the same type of a collator

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    CollationKey classJDK compatibleSaves the original stringCompare keys with compareTo methodGet the bytes with toByteArray methodWe used CollationKey as a key for a TreeMap structure

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    RawCollationKey classDoes not store the original stringGet it by using getRawCollationKey methodMutable class, can be reusedSimple and lightweight

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Message Format - IntroductionAssembles a user message from partsSome parts fixed, some supplied at runtimeOrder different for different languages:English: My Aunts pen is on the table.French: The pen of my Aunt is on the table.Pattern string defines how to assemble parts:English: {0}''s {2} is {1}.French: {2} of {0} is {1}.Get pattern string from resource bundle

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Message Format - ExampleString person = ; // e.g. My AuntString place = ; // e.g. on the tableString thing = ; // e.g. pen

    String pattern = resourceBundle.getString(personPlaceThing);MessageFormat msgFmt = new MessageFormat(pattern);Object arguments[] = {person, place, thing);String message = msgFmt.format(arguments);

    System.out.println(message);

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Message Format Different data typesWe can also format other data types, like datesWe do this by adding a format type:String pattern = On {0, date} at {0, time} there was {1}.;MessageFormat fmt = new MessageFormat(pattern);Object args[] = {new Date(System.currentTimeMillis()), // 0 a power failure // 1 };

    System.out.println(fmt.format(args));On Jul 17, 2004 at 2:15:08 PM there was a power failure.This will output:

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Message Format Format stylesAdd a format style:String pattern = On {0, date, full} at {0, time, full} there was {1}.;MessageFormat fmt = new MessageFormat(pattern);Object args[] = {new Date(System.currentTimeMillis()), // 0 a power failure // 1 };

    System.out.println(fmt.format(args));On Saturday, July 17, 2004 at 2:15:08 PM PDT there was a power failure.This will output:

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Message Format Format style details

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Message Format No format typeIf no format type, data formatted like this:

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Message Format Counting filesPattern to display number of files:There are {1, number, integer} files in {0}.String pattern = resourceBundle.getString(fileCount);MessageFormat fmt = new MessageFormat(fileCountPattern);String directoryName = ;Int fileCount = ;Object args[] = {directoryName, new Integer(fileCount)};

    System.out.println(fmt.format(args));There are 1,234 files in myDirectory.Code to use the pattern:This will output messages like:

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Message Format Problems counting filesIf theres only one file, we get:There are 1 files in myDirectory.Could fix by testing for special case of one fileBut, some languages need other special cases:Dual formsDifferent form for no filesEtc.

    27th Internationalization and Unicode Conference

    Getting Started with ICU

    Message Format Choice formatChoice format handles all of thisUse special format element:

    There {1, choice, 0#are no files| 1#is one file| 1