last revised: 8 april 2006 eacc to unicode migration ki tat lam head of library systems the hong...

56
Last revised: 8 April 2006 EACC to Unicode Migration EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library [email protected] OCLC CJK Users Group 2006 Annual Meeting April 8 2006, San Francisco

Upload: noah-mcnulty

Post on 27-Mar-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

Last revised: 8 April 2006

EACC to Unicode MigrationEACC to Unicode Migration

Ki Tat LAMHead of Library Systems

The Hong Kong University of Science and Technology [email protected]

OCLC CJK Users Group 2006 Annual Meeting April 8 2006, San Francisco

Page 2: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 2

ContentsContents

Migrating systems from EACC to Unicode Migrating systems from EACC to Unicode environmentsenvironments Why migrating?Why migrating? What has been done?What has been done? HKIUG Unicode InitiativesHKIUG Unicode Initiatives

IssuesIssues EACC/Unicode mapping tableEACC/Unicode mapping table Round-trip cross-walkRound-trip cross-walk Improving searching with TSVCC LinkingImproving searching with TSVCC Linking Font displayFont display

Page 3: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 3

An Observation …

Page 4: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 4

曆 [Calendar]

歷 [History]

历 Simplified form of 曆 and 歷

曆法历法

[System for determining the beginning, length and divisions of a year]

Page 5: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 5

曆法 was incorrectly displayed as 歷法 .Is it a data entry error? a display problem? or what?

Page 6: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 6

Why Migrating?Why Migrating?

EACC (East Asian Character Code, ANSI EACC (East Asian Character Code, ANSI Z39.64-1989) was introduced into the CJK Z39.64-1989) was introduced into the CJK library community by RLG in the early 1980s library community by RLG in the early 1980s (known as REACC at that time)(known as REACC at that time)

Its was an important milestone – for the first Its was an important milestone – for the first time, we began to have a C-J-K time, we began to have a C-J-K unifiedunified standard standard with a relatively with a relatively largelarge character set (about character set (about 16,000) for use in bibliographic records16,000) for use in bibliographic records

Page 7: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 7

Why Migrating? Why Migrating? [cont.][cont.]

By adopting EACC as an alternate character set By adopting EACC as an alternate character set in MARC 21 (at that time it was called in MARC 21 (at that time it was called USMARC), libraries with East Asian collections USMARC), libraries with East Asian collections were able to share and use CJK cataloging were able to share and use CJK cataloging records via the OCLC and RLIN cataloging records via the OCLC and RLIN cataloging platformsplatforms

However, great effort is required for integrated However, great effort is required for integrated library systems (ILS) to make use of the EACC-library systems (ILS) to make use of the EACC-based CJK data in the recordsbased CJK data in the records

Page 8: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 8

Why Migrating? Why Migrating? [cont.][cont.]

To communicate in EACC is extremely difficult To communicate in EACC is extremely difficult because EACC failed to be supported in the because EACC failed to be supported in the mainstream IT environmentmainstream IT environment Hardly you can find EACC supported by Hardly you can find EACC supported by

operating systems, fonts, input methods, operating systems, fonts, input methods, editors, etc., both in the old days and todayeditors, etc., both in the old days and today

It will also be unlikely to see EACC supported It will also be unlikely to see EACC supported in web browsers in the current Internet erain web browsers in the current Internet era

Why? – EACC’s three-byte coding structure is Why? – EACC’s three-byte coding structure is alienalien to the binary computing world to the binary computing world

Page 9: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 9

Why Migrating? Why Migrating? [cont.][cont.]

Due to its unpopularity, EACC became a Due to its unpopularity, EACC became a frozenfrozen standard and there is no way to fix errors and standard and there is no way to fix errors and add charactersadd characters

If EACC is stored natively in the bibliographic If EACC is stored natively in the bibliographic database, then in order to input and display CJK database, then in order to input and display CJK characters at the application layers (such as characters at the application layers (such as OPAC and record editor), ILS will have to rely on OPAC and record editor), ILS will have to rely on lossylossy mapping tables to map EACC to other mapping tables to map EACC to other character encodings (e.g. BIG5, GB, JIS, KSC character encodings (e.g. BIG5, GB, JIS, KSC and UTF-8)and UTF-8)

Page 10: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 10

Why Migrating? Why Migrating? [cont.][cont.]

Unicode comes to the rescueUnicode comes to the rescue Single standard for written texts of almost all Single standard for written texts of almost all

languages in the worldlanguages in the world Has more than 96,000 characters, most of Has more than 96,000 characters, most of

them are CJKthem are CJK An active standard, with constant updatesAn active standard, with constant updates Widely adopted and supported in the current Widely adopted and supported in the current

IT environment – major operating systems IT environment – major operating systems and web browsers, plus many devices and and web browsers, plus many devices and applications, speak the Unicode languageapplications, speak the Unicode language

Page 11: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 11

Why Migrating? Why Migrating? [cont.][cont.]

With more than 25 years’ influence by EACC, it With more than 25 years’ influence by EACC, it is unlikely that all library systems and data can is unlikely that all library systems and data can be migrated overnight to the Unicode be migrated overnight to the Unicode mainstreammainstream

It is anticipated that there will be a period of It is anticipated that there will be a period of parallel operation, with co-existing EACC and parallel operation, with co-existing EACC and Unicode bibliographic data interchanging among Unicode bibliographic data interchanging among systems, resulting in confusion and data losssystems, resulting in confusion and data loss

Even if systems have migrated to Unicode, there Even if systems have migrated to Unicode, there are still problems that require attentionare still problems that require attention

Page 12: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 12

What has been done?What has been done?

MARC 21 specifications for MARC-8 and MARC 21 specifications for MARC-8 and UCS/Unicode environmentUCS/Unicode environment

LC’s code tables for mapping between MARC-8 LC’s code tables for mapping between MARC-8 and Unicodeand Unicode

OCLC WorldCat migration to Unicode platformOCLC WorldCat migration to Unicode platform OCLC Connexion’s Unicode supportOCLC Connexion’s Unicode support LC’s Voyager upgradeLC’s Voyager upgrade INNOPAC/MillenniumINNOPAC/Millennium HKIUG Unicode InitiativesHKIUG Unicode Initiatives

Page 13: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 13

MARC 21 SpecificationsMARC 21 Specifications

In 2000, the Library of Congress issued:In 2000, the Library of Congress issued:

Specifications to distinguish the encoding of Specifications to distinguish the encoding of MARC 21 records in the original (MARC 21 records in the original (MARC-8MARC-8) ) environment and in the new environment and in the new UCS/UnicodeUCS/Unicode environmentenvironment[http://www.loc.gov/marc/specifications/speccharintro.html][http://www.loc.gov/marc/specifications/speccharintro.html]

MARC-8 means characters are encoded in one MARC-8 means characters are encoded in one 8-bit byte (e.g. ASCII) and three 8-bit bytes (e.g. 8-bit byte (e.g. ASCII) and three 8-bit bytes (e.g. EACC)EACC)

Page 14: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 14

A MARC 21 bibliographic record in ISO2709 format viewed in Notepad, showing CJK characters encoded in EACC in MARC-8 environment

21 62 62 21 39 25 21 30 21

黃 大 一

Page 15: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 15

MARC 21 Specifications MARC 21 Specifications [cont.][cont.]

UCS/Unicode EnvironmentUCS/Unicode Environment[http://www.loc.gov/marc/specifications/speccharucs.html][http://www.loc.gov/marc/specifications/speccharucs.html]

Use Use UTF-8UTF-8 as character encoding as character encoding Leader position 9 contains value “a”Leader position 9 contains value “a” Field 066 (Character Sets Present) is not Field 066 (Character Sets Present) is not

neededneeded The script identification information in subfield The script identification information in subfield

6 (Linkage) can be dropped6 (Linkage) can be dropped Lengths specified by number of 8-bit bytes, Lengths specified by number of 8-bit bytes,

rather than number of characters.rather than number of characters.

Page 16: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 16

MARC 21 Specifications MARC 21 Specifications [cont.][cont.]

Unicode combining rule for diacritics, i.e. Unicode combining rule for diacritics, i.e. combining marks follow rather than precede combining marks follow rather than precede the character they modifythe character they modify

Page 17: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 17

A MARC 21 bibliographic record in ISO2709 format viewed in Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment

Page 18: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 18

MARC 21 Specifications MARC 21 Specifications [cont.][cont.]

LC issued code tables for mapping between LC issued code tables for mapping between MARC-8 and UCS/Unicode:MARC-8 and UCS/Unicode: Not only for EACC, but also for other Latin Not only for EACC, but also for other Latin

and non-Latin scripts such as ANSEL, and non-Latin scripts such as ANSEL, Hebrew, Cyrillic, Arabic and GreekHebrew, Cyrillic, Arabic and Greek

Provide essential information for ILS’s Provide essential information for ILS’s Unicode implementationUnicode implementation

Page 19: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 19

Page 20: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 20

Page 21: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 21

MARC 21 Specifications MARC 21 Specifications [cont.][cont.]

UNICODE-MARC Discussion ListUNICODE-MARC Discussion List[http://listserv.loc.gov/listarch/unicode-marc.html][http://listserv.loc.gov/listarch/unicode-marc.html]

Since July 2005Since July 2005 Active discussion on issues concerning Active discussion on issues concerning

Unicode implementation in MARC 21Unicode implementation in MARC 21 Some of the discussion was summarized as Some of the discussion was summarized as

MARC Proposal 2006-04, "MARC Proposal 2006-04, "Technique for Technique for conversion of Unicode to MARC-8conversion of Unicode to MARC-8,” and was ,” and was approved by MARBI in January 2006, with approved by MARBI in January 2006, with changes.changes.[http://www.loc.gov/marc/marbi/2006/2006-04.html][http://www.loc.gov/marc/marbi/2006/2006-04.html]

Page 22: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 22

OCLC WorldCat and ConnexionOCLC WorldCat and Connexion

WorldCat – migrated to Oracle with Unicode WorldCat – migrated to Oracle with Unicode supportsupport

Released Connexion client softwareReleased Connexion client software Unicode-based, running on WindowsUnicode-based, running on Windows Comprehensive CJK supportComprehensive CJK support Rely on Windows’ IME for input of CJK Rely on Windows’ IME for input of CJK

characterscharacters Export and importExport and import of records in both MARC-8 of records in both MARC-8

and UCS/Unicode environments.and UCS/Unicode environments.

Page 23: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 23

LC’s CatalogLC’s Catalog

Its Voyager system was upgraded recently to Its Voyager system was upgraded recently to provide Unicode supportprovide Unicode support

Capable of displaying and searching CJK data in Capable of displaying and searching CJK data in 880 fields880 fields

Allows export of records in MARC-8 and Allows export of records in MARC-8 and Unicode environmentsUnicode environments

Issued a cataloging policy position paper for the Issued a cataloging policy position paper for the Unicode implementation at LC (March 2006), Unicode implementation at LC (March 2006), with details on current implementation and future with details on current implementation and future opportunitiesopportunities[http://www.loc.gov/catdir/cpso/unicode.pdf][http://www.loc.gov/catdir/cpso/unicode.pdf]

Page 24: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 24

INNOPAC/MillenniumINNOPAC/Millennium

INNOPAC has been supporting EACC, and CJK INNOPAC has been supporting EACC, and CJK in general, since its implementation at HKUST in general, since its implementation at HKUST Library 15 years agoLibrary 15 years ago

Millennium clients run on Windows XP with Millennium clients run on Windows XP with Unicode supportUnicode support

CJK records are stored in EACC internally; but CJK records are stored in EACC internally; but provides option to migrate the storage to provides option to migrate the storage to UnicodeUnicode

HKIUG Unicode Task Force is working with the HKIUG Unicode Task Force is working with the vendor to improve the Unicode storagevendor to improve the Unicode storage

Page 25: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 25

HKIUG Unicode InitiativesHKIUG Unicode Initiatives

HKIUG – Hong Kong Innovative Users GroupHKIUG – Hong Kong Innovative Users Group Founded in 1996Founded in 1996 Members from all 15 INNOPAC libraries in Members from all 15 INNOPAC libraries in

Hong Kong and Macau, including the eight Hong Kong and Macau, including the eight Hong Kong government-funded universitiesHong Kong government-funded universities

HKIUG Unicode Initiatives – since 2003, to work HKIUG Unicode Initiatives – since 2003, to work closely with the ILS vendor (Innovative closely with the ILS vendor (Innovative Interfaces Inc.) to improve INNOPAC / Interfaces Inc.) to improve INNOPAC / Millennium’s CJK supportMillennium’s CJK support

Page 26: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 26

HKIUG Unicode Initiatives HKIUG Unicode Initiatives [cont.][cont.]

Achievements:Achievements: Developed HKIUG Version of the EACC to Developed HKIUG Version of the EACC to

Unicode mapping tableUnicode mapping table Resolved EACC to Unicode multi-mapping Resolved EACC to Unicode multi-mapping

problemproblem Developed TSVCC (Developed TSVCC (TTraditional, raditional, SSimplified, implified,

VVariant ariant CChinese hinese CCharacters) linking tablesharacters) linking tables HKIUG Unicode Task Force - to maintain the HKIUG Unicode Task Force - to maintain the

Unicode and TSVCC tables and to assist the Unicode and TSVCC tables and to assist the vendor on Unicode migration; members from vendor on Unicode migration; members from CUHK, CITYU, HKUST and HKUCUHK, CITYU, HKUST and HKU

Page 27: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 27

Migration IssuesMigration Issues

The need of EACC/Unicode mapping tableThe need of EACC/Unicode mapping table Multi-mapping and round trip failure problemsMulti-mapping and round trip failure problems TSVCC linkingTSVCC linking Font display problemFont display problem

Page 28: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 28

HKIUG EACC/Unicode TableHKIUG EACC/Unicode Table

First released in September 2003; last revised in First released in September 2003; last revised in August 2005August 2005

Contains:Contains: 15672 EACC characters15672 EACC characters 7043 pure CCCII characters7043 pure CCCII characters

Mapping for EACC characters - follows LC as Mapping for EACC characters - follows LC as much as possiblemuch as possible

Contains 7043 “Contains 7043 “Pure CCCIIPure CCCII” that have no EACC ” that have no EACC equivalent - includes them to avoid too many equivalent - includes them to avoid too many missing charactersmissing characters

Page 29: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 29

Page 30: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 30

Page 31: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 31

HKIUG EACC/Unicode Table HKIUG EACC/Unicode Table [cont.][cont.]

Identified:Identified: 160 multi-mapping linked cases, e.g.160 multi-mapping linked cases, e.g.

49 multi-mapping unlinked cases, e.g.49 multi-mapping unlinked cases, e.g.

Causing failure in round-trip crosswalkCausing failure in round-trip crosswalk

Page 32: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 32

LibraryEACC

Round-trip Crosswalk Failure

Step 2:U+7CFB 系

1. Library contributes 历 in EACC {274349}, which is the simplified form of 曆

4. Library receives 历 in EACC {27462A}, which is the simplified form of 歷

2. Connexion finds {274349} in mapping table and stores 历 in Unicode U+5386OCLC

WorldCat

Export from OCLC Import to OCLC

3. Connexion finds {274349} and {27462A} in mapping table and decides to output 历 in EACC {27462A}

Unicode

Page 33: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 33

U+5386

Page 34: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 34

Export

Page 35: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 35

Export output is {27 46 2A} – incorrect!

Page 36: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 36

TSVCC LinkingTSVCC Linking

When searching When searching 历法历法 “ “ Li fa”, you will prefer to Li fa”, you will prefer to retrieve records that have:retrieve records that have: 历法历法 曆法曆法 where where 曆 曆 and and 历 历 have have Traditional – SimplifiedTraditional – Simplified relationshiprelationship

Similarly, when searching Similarly, when searching 屏屏 , you will prefer to , you will prefer to retrieve its retrieve its VariantVariant 屛屛

Requires linking T,S,V forms during Requires linking T,S,V forms during searchingsearching

Page 37: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 37

In LC’s Online Catalog, searching title 曆法 will retrieve 3 hits.

Page 38: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 38

Searching with 历, the simplified form of 曆 , will however retrieve 3 other hits.

Page 39: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 39

慈禧太後 ? Excuse me, are they typos! Shouldn’t it be 慈禧太后 ?

Page 40: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 40

Google is capable linking 餘 and 余

Page 41: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 41

TSVCC Linking TSVCC Linking [cont.][cont.]

HKIUG Unicode Task Force constructed two HKIUG Unicode Task Force constructed two versions of TSVCC Linking tablesversions of TSVCC Linking tables EACC Version [released November 2004]EACC Version [released November 2004] Unicode Version [draft created March 2006]Unicode Version [draft created March 2006]

for ILS’s that store characters in EACC and in for ILS’s that store characters in EACC and in Unicode respectivelyUnicode respectively

Page 42: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 42

TSVCC Linking TSVCC Linking [cont.][cont.]

EACC VersionEACC Version Table M (80 entries)– linking relationship Table M (80 entries)– linking relationship

is not purely from EACC, e.g.is not purely from EACC, e.g.214349 曆 | 274349 历 | 2D4349 暦 | 21462A 歷 | 27462A 历 | 4B462A 歴 | #U+5386 multi-mapped 27462A,274349

Table V (3065 entries) – linking Table V (3065 entries) – linking relationship is purely from EACC, e.g.relationship is purely from EACC, e.g.21306C 仇 | 2D306C 讎 | 33306C 讐 | 4B306C 雠

Page 43: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 43

Page 44: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 44

Page 45: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 45

TSVCC Linking TSVCC Linking [cont.][cont.]

Unicode VersionUnicode Version Still in draft constructionStill in draft construction So far has 3061 entries, e.g.So far has 3061 entries, e.g.

U+5C5B 屛 | U+5C4F 屏 | U+6452 摒 | #EACC link ([27/21]415A) AND Variant form of U+5C4F is U+5C5B

U+965D 陝 | U+965C 陜 | U+9655 陕 | #EACC link ([23/29]4A44) AND Simplified form of U+965D is U+9655 is

Page 46: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 46

Page 47: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 47

Page 48: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 48

TSVCC Linking TSVCC Linking [cont.][cont.]

Plan to include linking of New/Old forms in the Plan to include linking of New/Old forms in the TSVCC Unicode Version, e.g.TSVCC Unicode Version, e.g.

Page 49: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 49

TSVCC Linking TSVCC Linking [cont.][cont.]

Results of implementing TSVCC Linking:Results of implementing TSVCC Linking: Improvement in searching – Improvement in searching – higher recallhigher recall Trade-off – Trade-off – lower precisionlower precision If search results are sorted/displayed in If search results are sorted/displayed in

TSVCC TSVCC normalizednormalized form, misleading and form, misleading and inaccurate display may occur - such as the inaccurate display may occur - such as the OCLC Connexion browse list display problem OCLC Connexion browse list display problem mentioned previouslymentioned previously

Page 50: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 50

Font IssuesFont Issues

Do not believe in Do not believe in What you see is what you haveWhat you see is what you have, , because because What you see varies with fonts What you see varies with fonts !!

For example, the following glyphs have different code For example, the following glyphs have different code points in EACC:points in EACC:

Page 51: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 51

Font IssuesFont Issues

But in Unicode, they are assigned the same code points. But in Unicode, they are assigned the same code points. Depending on the font in use, you will see different Depending on the font in use, you will see different glyphs:glyphs:

Page 52: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 52

ConclusionConclusion

How far are we?How far are we? Both LC and OCLC have done enormous Both LC and OCLC have done enormous

work in enabling and promoting the use of work in enabling and promoting the use of Unicode in MARC recordsUnicode in MARC records

ILS vendors are working very hard to ILS vendors are working very hard to implement and enhance the Unicode supportimplement and enhance the Unicode support

Libraries and CJK experts are providing Libraries and CJK experts are providing advice and suggesting solutionsadvice and suggesting solutions

Page 53: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 53

Conclusion Conclusion [cont.][cont.]

We have reviewed various migration issues:We have reviewed various migration issues: The need for an accurate EACC/Unicode The need for an accurate EACC/Unicode

mapping tablemapping table Extending to non-EACC charactersExtending to non-EACC characters Multi-mappings and round-trip failureMulti-mappings and round-trip failure TSVCC LinkingTSVCC Linking Font display issuesFont display issues

Page 54: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 54

Conclusion Conclusion [cont.][cont.]

The failure of round-trip crosswalk between The failure of round-trip crosswalk between systems will continue to be a problem until systems will continue to be a problem until everyone interchanges MARC records purely in everyone interchanges MARC records purely in Unicode. This will only happen when the majority Unicode. This will only happen when the majority of systems store and use data natively in of systems store and use data natively in Unicode.Unicode.

Unlike EACC, Unicode does not have a build-in Unlike EACC, Unicode does not have a build-in linking relationship. Implementing TSVCC is linking relationship. Implementing TSVCC is essential for improving searching.essential for improving searching.

Page 55: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 55

Additional ReferencesAdditional References

Assessment of Options for Handling Full Unicode Assessment of Options for Handling Full Unicode Character Encodings in MARC 21 -- Part 1: New Scripts Character Encodings in MARC 21 -- Part 1: New Scripts ( January 2004) and Part 2: Issues (June 2005).( January 2004) and Part 2: Issues (June 2005).[http://www.loc.gov/marc/marbi/list-report.html][http://www.loc.gov/marc/marbi/list-report.html]

Joan M. Aliprand. The structure and content of MARC 21 Joan M. Aliprand. The structure and content of MARC 21 records in the Unicode environment. Information records in the Unicode environment. Information technology and libraries, v.24, no.4, December 2005, technology and libraries, v.24, no.4, December 2005, p.170-179.p.170-179.

Wong, Philip and K.T. Lam. HKIUG’s Unicode projects : Wong, Philip and K.T. Lam. HKIUG’s Unicode projects : untangling the chaotic codes. HKIUG Annual Meeting untangling the chaotic codes. HKIUG Annual Meeting 2005. [2005. [http://hdl.handle.net/1783.1/2429]http://hdl.handle.net/1783.1/2429]

Page 56: Last revised: 8 April 2006 EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

EACC to Unicode Migration – K.T. Lam, HKUST Library 56

Thank You!Thank You!