digital data archives in the humanities

27
Pacific and Regional Archive for Digital Sources in Pacific and Regional Archive for Digital Sources in Endangered Cultures Endangered Cultures Digital data Digital data archives in the archives in the humanities humanities Linda Barwick, University of Sydney Linda Barwick, University of Sydney APAN Semantic Web workshop, Bangkok, 27 January APAN Semantic Web workshop, Bangkok, 27 January 2005 2005 es for participating in the semantic es for participating in the semantic the case of PARADISEC the case of PARADISEC

Upload: rock

Post on 13-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Digital data archives in the humanities. Issues for participating in the semantic web the case of PARADISEC. Linda Barwick, University of Sydney APAN Semantic Web workshop, Bangkok, 27 January 2005. Endangered languages. Over 2000 of the world’s 6000 languages in the Asia-Pacific region - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Digital data archives in the humanities

Pacific and Regional Archive for Digital Sources in Endangered Pacific and Regional Archive for Digital Sources in Endangered CulturesCultures

Digital data archives in the Digital data archives in the humanitieshumanities

Linda Barwick, University of SydneyLinda Barwick, University of SydneyAPAN Semantic Web workshop, Bangkok, 27 January 2005APAN Semantic Web workshop, Bangkok, 27 January 2005

Issues for participating in the semantic webIssues for participating in the semantic webthe case of PARADISECthe case of PARADISEC

Page 2: Digital data archives in the humanities

Endangered Endangered languageslanguages•Over 2000 of the world’s 6000 Over 2000 of the world’s 6000

languages in the Asia-Pacific regionlanguages in the Asia-Pacific region

•Number likely to fall to a few Number likely to fall to a few hundred by 2100 (UNESCO)hundred by 2100 (UNESCO)

•Australian researchers active in Australian researchers active in region since 1950s - making unique region since 1950s - making unique recordings of unrepeatable eventsrecordings of unrepeatable events

•Recordings now themselves Recordings now themselves endangered (format obsolescence, endangered (format obsolescence, media deterioration, loss of media deterioration, loss of metadata)metadata)

Page 3: Digital data archives in the humanities

PARADISEC’s PARADISEC’s missionmission•To preserve and make accessible Australian To preserve and make accessible Australian

researchers’ field recordings of endangered researchers’ field recordings of endangered languages and musics from the Asia-Pacific languages and musics from the Asia-Pacific regionregion

•Preservation: to adopt world’s best Preservation: to adopt world’s best practice standards and formats to practice standards and formats to maximise sustainability and future maximise sustainability and future useability of the collectionuseability of the collection

•Access: To take advantage of emerging Access: To take advantage of emerging information and communication information and communication technologies to maximise access to our technologies to maximise access to our collection by both researchers and collection by both researchers and cultural heritage communitiescultural heritage communities

Page 4: Digital data archives in the humanities

PARADISEC PARADISEC structurestructure

CIs: Cliff GoddardCIs: Cliff GoddardHugh de FerrantiHugh de Ferranti

CIs: Steve BirdCIs: Steve BirdNick EvansNick EvansCathy FalkCathy Falk

Janet FletcherJanet FletcherJohn HajekJohn Hajek

CIs: Andrew PawleyCIs: Andrew PawleyJohn BowdenJohn Bowden

Malcolm RossMalcolm RossAlan RumseyAlan Rumsey

Project ManagerProject Manager(Metadata guru)(Metadata guru)Nick ThiebergerNick Thieberger

Audio Archiving UnitAudio Archiving UnitDirector: Linda BarwickDirector: Linda BarwickAudio: Frank DaveyAudio: Frank DaveyProject Liaison: Amanda HarrisProject Liaison: Amanda Harris

Store account - web interfaceStore account - web interfaceStuart HungerfordStuart Hungerford

CIs: William FoleyCIs: William FoleyAllan MarettAllan MarettJane SimpsonJane Simpson

Page 5: Digital data archives in the humanities

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

NetworkingNetworking

•Main campuses (University of Sydney, University of Main campuses (University of Sydney, University of Melbourne, Australian National University) Melbourne, Australian National University) connected by Grangenet (next generation research connected by Grangenet (next generation research network, 10Gbps connections)network, 10Gbps connections)

•Pay subscription, not traffic costsPay subscription, not traffic costs

•Satellite campus UNE connected by AARnet Satellite campus UNE connected by AARnet (Australian research and education network - (Australian research and education network - currently billed traffic cost, 155Mbps connection)currently billed traffic cost, 155Mbps connection)

•Both with connections to APAN community (Asia Both with connections to APAN community (Asia Pacific Advanced Networks) - potential for linking Pacific Advanced Networks) - potential for linking to regional and international R&E networks - to regional and international R&E networks - potential traffic costs an issuepotential traffic costs an issue

Page 6: Digital data archives in the humanities

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

StorageStorage

•Australian Partnership for Advanced Computing National Australian Partnership for Advanced Computing National Facility Mass Data Storage System - Hierarchical Storage Facility Mass Data Storage System - Hierarchical Storage Manager systemManager system

•Funded by consortium of Australian higher education Funded by consortium of Australian higher education bodiesbodies

•Tape robot system - can handle 1.2PBTape robot system - can handle 1.2PB

•PARADISEC will add 2-3TB per year once satellite ingest PARADISEC will add 2-3TB per year once satellite ingest commissionedcommissioned

•Current horizon of facility 2008 - project PARADISEC Current horizon of facility 2008 - project PARADISEC collection up to 9TB by thencollection up to 9TB by then

•Will need to apply to host material/share data from Will need to apply to host material/share data from other collectionsother collections

Page 7: Digital data archives in the humanities

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

SoftwareSoftware•Initial metadata database in Filemaker Pro 6 Initial metadata database in Filemaker Pro 6

with periodic XML dumps for OLAC static with periodic XML dumps for OLAC static harvestingharvesting

•Currently being ported to MySQL/PHP to Currently being ported to MySQL/PHP to allow dynamic harvesting and other allow dynamic harvesting and other functionalityfunctionality

•Python software for managing repository and Python software for managing repository and website (Stuart Hungerford, ANU)website (Stuart Hungerford, ANU)

•Developing Java-based geographic search Developing Java-based geographic search interface (TimeMap)interface (TimeMap)

•All based on Open Source toolsAll based on Open Source tools

Page 8: Digital data archives in the humanities

Audio IngestAudio Ingest

•Initially ingested as raw WAV on Initially ingested as raw WAV on AudioCube 5 Dell AudioCube 5 Dell 670 workstations 670 workstations running Wavelab (2005 will add running Wavelab (2005 will add remote Pyramix workstations)remote Pyramix workstations)

•Masters 24-bit 96khz Broadcast WAV Format Masters 24-bit 96khz Broadcast WAV Format (uncompressed audio with encapsulated (uncompressed audio with encapsulated metadata)metadata)

•Some lower rate (e.g. if digital original Some lower rate (e.g. if digital original 16bit 48khz from DAT)16bit 48khz from DAT)

•WAV > BWF by Quadriga audio archiving softwareWAV > BWF by Quadriga audio archiving software

•derivatives produced by batch processing - CD-derivatives produced by batch processing - CD-audio quality (16-bit, 44.1khz) and mp3 audio quality (16-bit, 44.1khz) and mp3 quality(128bps)quality(128bps)

Page 9: Digital data archives in the humanities

Digital Digital preservationpreservation

•““Azoulay” server partitioned for working files and Azoulay” server partitioned for working files and archive partition for sealed masters - current archive partition for sealed masters - current capacity 750GB (>3TB in 2005)capacity 750GB (>3TB in 2005)

•Sealed masters archived to 100GB data tapes on Sealed masters archived to 100GB data tapes on University of Sydney LTO Mass Data Storage University of Sydney LTO Mass Data Storage System (high-low watermark script) - duplicate System (high-low watermark script) - duplicate data tapes kept at 2 locations on campusdata tapes kept at 2 locations on campus

•Sealed masters mirrored to Australian Partnerhsip Sealed masters mirrored to Australian Partnerhsip for Advanced Computing national Store facility for Advanced Computing national Store facility (Canberra) nightly (Canberra) nightly

•Password-protected online access to Store facilityPassword-protected online access to Store facility

Page 10: Digital data archives in the humanities

Data repository Data repository contentscontents

•Repository totals 21 January 200Repository totals 21 January 20055

•total files: 2714total files: 2714

•total items: 8total items: 84477

•total size: 1.total size: 1.11TBTB

•total hours audio: total hours audio: 668668 hours hours

•file types: .wav, .mp3 (1file types: .wav, .mp3 (1210210); .tif, ); .tif, ((171171), .jpg (46), .pdf (34), .txt ), .jpg (46), .pdf (34), .txt (3), .rtf (8), .xml (32)(3), .rtf (8), .xml (32)

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Page 11: Digital data archives in the humanities

Data rData repository epository ccollectionsollectionsBradley (5hr)Bradley (5hr)

Brotchie (15hr)Brotchie (15hr)Capell (9hr)*Capell (9hr)*Corris (6hr)Corris (6hr)Crowther (2hr)Crowther (2hr)Donohue (3hr)Donohue (3hr)Dutton (266hr)Dutton (266hr)Fedden (7hr)Fedden (7hr)Foley (23hr)Foley (23hr)Gardner (56hr)Gardner (56hr)Kartomi (2hr)*Kartomi (2hr)*Loughnane (9hr)Loughnane (9hr)Lawton (3hr)Lawton (3hr)Laycock (29hr)Laycock (29hr)

McElhanon (41hr) McElhanon (41hr) McIntyre (10hr)McIntyre (10hr)Margetts (17hr)Margetts (17hr)Poignant (2hr)Poignant (2hr)Rumsey (20hr)Rumsey (20hr)San Roque (1hr)San Roque (1hr)Sam (6hr)Sam (6hr)Tepano (19hr)Tepano (19hr)Thieberger (39hr)Thieberger (39hr)Toulmin (35hr)Toulmin (35hr)Voorhoeve (33hr)Voorhoeve (33hr)Wurm (11)*Wurm (11)*Evans (Hons thesis)Evans (Hons thesis)Thieberger (PhD thesis)Thieberger (PhD thesis)

* Ingestion * Ingestion ongoing ongoing January 2005January 2005

PDSC Jan 2005 AB1

AC1

AM2

AM3

AM4

AR1

AR2

BE1

CLV1

DB1

DG3

DL1

KM1

LS1

LSR1

MC1

MC2

MD1

MK2

MT1

NT1

P130_19

RL1

RL2

RP1

SAW2

SF1

TD1

TT1

WF1

Page 12: Digital data archives in the humanities

PAPUA N. GUINEAAbauAmbonese PidginAngoram (Kanduanuin)Angoram (Moim dialect)AomieArapeshArifamaAunaleiAuwimAwomoBaBalawaiaBaraiBarugaBarupu (Warapu)Be'aniviaBiageBiboBinandereBodinumuBoeraBoineBokuBoridiBouxulaBratMomireBuinBurumChimbaChirimaDagaDaravaDawawaDedua

DimaDimadimaDinaDogaDomuDoromuDouraEfogiEfogi DialectsEmoEnivilogoForeFuyugeyGabadiGinumanGwedenaHereiHiae MotuHiri MotuHubeHulaI'aiIkegaIomaIsaka (Krisa)KaipiKairiKambotKangaKaramaKarawari Lg (Ambinwari)KarukaruKâte

KinalakngaKimiKiriwinaKoiariKoitaKoitabuKokilaKokoroKombaKoparKorikiKorikoKosorongKovaiKovioKubuirubuKumanKumukioKuniKunimaipaKwaleLaimodoMada'aMagiMâgobinengMagoreMaisinMaiwaManagalasManamManubaraManumuMapeiMapena

MariMariaMekeoMelpaMianMid-WahgiMigabacMindikMiniafaMogoniMomMorMotuMuhiang ArapeshNabakNagaNamanadzaNaoroNaraNew Ireland PidginNgalaNomuNotuOndoroOne (Onne)OnjabOnoOpaoOrokaivaOrokoloOumaPaiwaPolice MotuPorome

Qld PidginRabukaRaepa TatiSalibaSamoSeneSepik Tok PisinSialumSinaugoroSonaSuauSukuSuraiTaboroTairumaTauadeToboTok PisinTolaiUberiUbirUbir GonjoeVesilogoVioribaiwaWamoraWangunWigaWoseraYele.YewuduYimasYoba

COOKISLANDSRarotonganPukapuka

FRENCHPOLYNESIATahitian

CHILE >>>Rapa Nui

PALAUPalauan

SOLOMONSBabatanaRirioRuvianaVareseLauSanta Cruz

INDONESIAAsmatBratHatamInanwatanManikionMoiNingrumSahuSebyarTinamTodaheTok PisinYahadian

.

PARADISEC Repository Languages PARADISEC Repository Languages November 2004November 2004

INDIARajbangsi

NEW CALEDONIADehu

VANUATUSouth EfateBislamaLelepa

FIJILauan

TONGATongan

Page 13: Digital data archives in the humanities

Sample item interfaceSample item interface

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 14: Digital data archives in the humanities

Sample item interfaceSample item interface

Page 15: Digital data archives in the humanities

Sample catalog Sample catalog metadatametadata

Page 16: Digital data archives in the humanities

Metadata January 2005Metadata January 2005•1800 items (recordings or theses) digitised 1800 items (recordings or theses) digitised

or assessed for digitisation (1629 findable or assessed for digitisation (1629 findable online via metadata repository)online via metadata repository)

•254 languages from 39 countries in 254 languages from 39 countries in Asia-PacificAsia-Pacific

•Cassettes: 1256 hoursCassettes: 1256 hours

•Reel-to-reel tapes: 417,356 metres of Reel-to-reel tapes: 417,356 metres of tapetape

•Video: 356 hoursVideo: 356 hours

Page 17: Digital data archives in the humanities

Open Language Archives Community Open Language Archives Community (OLAC)(OLAC)

http://www.language-archives.orghttp://www.language-archives.org

•Sub-communitySub-community of of Open Archives Open Archives InitiativeInitiative

•Worldwide virtual Worldwide virtual library of language library of language resources resources

•PARADISEC one of PARADISEC one of 29 participating 29 participating archivesarchives

AIMSAIMS

•develop consensus on develop consensus on best current practice for best current practice for digital archiving of digital archiving of language resourceslanguage resources

•develop network of develop network of interoperating interoperating repositories & services repositories & services for housing & accessing for housing & accessing such resources such resources

Page 18: Digital data archives in the humanities

Metadata OLAC harvestMetadata OLAC harvest

Page 19: Digital data archives in the humanities

lacito.vjf.cnrs.fr/archivagelacito.vjf.cnrs.fr/archivage www.uaf.edu/anlc/www.uaf.edu/anlc/

emeld.orgemeld.org

www.ailla.utexas.orgwww.ailla.utexas.org

paradisec.org.auparadisec.org.au

www.arts.auckland.ac.nz/antwww.arts.auckland.ac.nz/antwww.aiatsis.gov.auwww.aiatsis.gov.au

www.hrelp.org/archive/www.hrelp.org/archive/

www.mpi.nl/DOBESwww.mpi.nl/DOBES

DELAMAN connections www.delaman.orgDELAMAN connections www.delaman.org

Page 20: Digital data archives in the humanities
Page 21: Digital data archives in the humanities

General Ontology for Linguistic General Ontology for Linguistic DescriptionDescription

Page 22: Digital data archives in the humanities

Music Description Music Description Ontologies?Ontologies?

•Much more complicated situation because of Much more complicated situation because of commercial music industry interestscommercial music industry interests

•Most ontologies designed for commercial music Most ontologies designed for commercial music (albums, tracks, composers etc ) or Western (albums, tracks, composers etc ) or Western music notation (diatonic scale etc)music notation (diatonic scale etc)

•Most recent ethnomusicological discourse Most recent ethnomusicological discourse concentrates on social context rather than concentrates on social context rather than description or analysis and suspicious of description or analysis and suspicious of universalist approachesuniversalist approaches

•Some current initiatives e.g. EU MusicNetworkSome current initiatives e.g. EU MusicNetwork

Page 23: Digital data archives in the humanities

Issues for semantic Issues for semantic webweb

•Small-scale specialist archive with few staff and Small-scale specialist archive with few staff and precarious funding - not resourced for huge amount precarious funding - not resourced for huge amount of work for RDF markupof work for RDF markup

•Curator-intensive - cannot be readily automatedCurator-intensive - cannot be readily automated

• Need to motivate and involve researchers and Need to motivate and involve researchers and communities in description as well as high-level ICT communities in description as well as high-level ICT advisorsadvisors

•Present highest priority salvage of endangered Present highest priority salvage of endangered mediamedia

•Lack of appropriate ontologies especially for musicLack of appropriate ontologies especially for music

Page 24: Digital data archives in the humanities

But…But…•We have a good foundation - well-structured data We have a good foundation - well-structured data

and metadata (for whole-item level) conforming and metadata (for whole-item level) conforming to international standardsto international standards

•We are in conversation with international We are in conversation with international disciplinary communities through OLAC, EMELD, disciplinary communities through OLAC, EMELD, DELAMANDELAMAN

•Our collection is of high cultural heritage and Our collection is of high cultural heritage and scholarly value, of interest to international scholarly value, of interest to international communitycommunity

•We are motivated to learn more from other large-We are motivated to learn more from other large-scale distributed digital data archivesscale distributed digital data archives

Page 25: Digital data archives in the humanities

PARADISEC gratefully PARADISEC gratefully acknowledges support acknowledges support

from:from:•Partner Universities (Sydney, Melbourne, Partner Universities (Sydney, Melbourne, ANU, UNE)ANU, UNE)

•Australian Research Council LIEF schemeAustralian Research Council LIEF scheme

•Australian Partnership for Sustainable Australian Partnership for Sustainable Repositories (SORRT testbed)Repositories (SORRT testbed)

•Australian Partnership for Advanced Australian Partnership for Advanced ComputingComputing

•GrangenetGrangenet

•ANU Internet FuturesANU Internet Futures

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 26: Digital data archives in the humanities

Contact usContact us• http://www.paradisec.org.auhttp://www.paradisec.org.au

[email protected] [email protected] (Director)(Director)

[email protected]@paradisec.org.au (Project Manager)au (Project Manager)

Page 27: Digital data archives in the humanities

Relevant URLsRelevant URLs

•PARADISEC website PARADISEC website http://paradisec.org.au/http://paradisec.org.au/

•PARADISEC repository login PARADISEC repository login http://store.http://store.apacapac..eduedu.au/cgi-bin/pdsc-v3.0..au/cgi-bin/pdsc-v3.0.cgi/logincgi/login

•PARADISEC streaming trial PARADISEC streaming trial http://paradisec.org.au/streamingtrial.htmlhttp://paradisec.org.au/streamingtrial.html

•Transcript page image trial Transcript page image trial http://www.austehc.unimelb.edu.au/~gavan/lana/http://www.austehc.unimelb.edu.au/~gavan/lana/hdms.hdms.htmhtm

•EMELD General Ontology for Linguistic DescriptionEMELD General Ontology for Linguistic Descriptionhttp://www.emeld.org/tools/ontology.cfmhttp://www.emeld.org/tools/ontology.cfm