twenty years of language resource development and distribution · twenty years of language resource...

45
Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri, Marian Reed, Denise DiPersio, Mark Liberman University of Pennsylvania, Linguistic Data Consortium {ccieri, mreed, dipersio, myl} AT ldc.upenn.edu

Upload: others

Post on 15-Sep-2019

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Twenty Years of Language Resource

Development and Distribution: A Progress Report on LDC Activities

Christopher Cieri, Marian Reed, Denise DiPersio, Mark Liberman

University of Pennsylvania, Linguistic Data Consortium

{ccieri, mreed, dipersio, myl} AT ldc.upenn.edu

Page 2: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

20 Memberships Years, currently running from January-December

Page 3: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

Member

In any single membership year, LDC releases 30-36 corpora

(in early years, range was 14-50 corpora/year)

Page 4: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2 basic membership types: Standard & Subscription

3 member types: Non-Profit, Government, Commercial

Page 5: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

Non-Profit Organization

Standard Membership

Fee: $2400 to help sustain the Consortium

Select any 16 data sets

Research Use

Ongoing Rights

Member

Page 6: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

Member

Non-Profit Organization

Subscription Membership

Fee: $3850 to help sustain the Consortium

All data sets * 2 copies shipped automatically

Research Use

Ongoing Rights

Page 7: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Value

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

Members

Members in 1996, 1997 paid $2000 fee for each year

Received 18 CALLHOME data sets which cost $5,000,000 to create

plus 36 other corpora including:

Switchboard-1 Release 2

The CMU Kids Corpus

1996 Speaker Recognition Benchmark

Boston University Radio Speech Corpus

ROI=1250% just on the CALLHOMEs

Develop Costs per single Corpus min=$42,000, max=$2,000,000

Lowest possible ROI=153%

Page 8: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Alternatives

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

Government Membership same fees as Non-Profit, different agreement

Commercial Membership higher fees, commercial rights

Many corpora can be licensed individually but at greater unit cost.

Can’t even afford $2400?

Wait, how did you get here, then?

LDC is a Consortium, an organization of organizations,

established for their mutual benefit.

LDC sometimes trades data sets for other Language Resources

or services.

Provided it offers the Consortium a positive return on their

investment

Let’s talk!

OK. Never mind.

Page 9: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Origin & Model

Linguistic Data Consortium established 1992 via open, competitive DARPA solicitation, won by U. Penn.

centralize distribution, archiving of language data

manage licenses & distribution practice

Business Model developed by overseers from government, industry and academia

DARPA funding covered operations, corpus creation for 5 years

required to be self-sufficient via annual membership fees, data licenses

new grants fund LR creation, not maintenance; NSF, NIST early supporters

Data Sources donations, funded projects, community initiatives and LDC initiatives

Membership members provide annual support generally fees, sometimes data, services

receive ongoing rights to data published in years when they support LDC

reduced fees on older corpora, extra copies

access to LDC Online

Page 10: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Benefits

Uniform licensing within & across research communities

4 basic user license types, 1000s of instances

~100 provider arrangements

no significant copyright issues in 20 years of operations

several independent issues resolved

Cost Sharing

relieves funding agencies of distribution costs, concerns

provides vast amounts of data to members LDC annual membership benefit ~30 corpora

development cost for 1 corpus ≥ (LDC membership fee * 10 | 100 | 1000)

Stable research infrastructure

LRs permanently accessible, across multiple platform changes

terms of use & distribution methods standardized & simple

members’ access to data is ongoing

any patches available via same methods

tools, specifications, papers distributed without fee

Page 11: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Page 12: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Page 13: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Page 14: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Page 15: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Page 16: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Page 17: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Page 18: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Page 19: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Page 20: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Page 21: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Page 22: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Page 23: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Page 24: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Page 25: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Page 26: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Page 27: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Page 28: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Page 29: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Models

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Provider

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Data Center

Page 30: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

IPR Intermediary

IPR intermediary improves combinatorics

Providers + Members

NOT Providers * Members

But much more

speaks for a group of 3168 organizations

greater experience negotiating IPR than any single member

dedicated staff, trained and experienced

linguists, computer scientists can focus on what they do best

works with researchers who have high value contact

spreads cost of any data acquisition over user base

consistent, attractive terms to providers and users

peace-of-mind for providers

clarity for members

Page 31: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Benchmarking

Since inception in 1992, LDC has distributed

>84000 copies

>1300 titles

>3168 organizations

>70 countries

About half of the titles are e-corpora

developed for technology evaluation programs

released generally after use in the relevant communities

64 titles added to Catalog since last LREC

>4 years of publications “in queue”!!!

8309 academic papers relying on LDC Corpora

search for such papers is ~ 60% complete

Page 32: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Benchmarking

Page 33: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Benchmarking

Page 34: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

LDC Roles

distribution & archiving

language resource production, including quality control

intellectual property rights and license management

human subject protocol management

data collection

annotation and lexicon building

creation of tools, specifications, best practices

knowledge transfer: documentation, metadata, consulting, training

corpus creation research (meta-research) and academic publication

resource coordination in large multisite programs

serving multiple research communities

as funding panelists, workshop participants and oversight committee

members.

Page 35: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

LDC Structure

Mark LibermanDirector

Christopher CieriExecutive Director

Christopher Walker Manager, Software

Development

Mohamed Maamouri, Sr. Research

Administrator

Andrea Mazzucchi, Manager Systems

Stephanie StrasselSr. Assoc Dir Collection,

Annotation

DeniseDiPersio

Assoc Dir, External Relations

Mark MandelResearcher

Natalia Bragilevskaya, Manager LDC/IRCS RBO

Karina CzokaFiscal Coordinator

Ikeila TurnerOffice Manager

Daniel Jacquette Publications Manager

Marian ReedCommunications

Manager

NameWebmaster

Eleftheria Ahtaridis Membership Manager

Andrew McMackinSysAdmin

Wayne HillSysAdmin

Miguel ReynosoHelp Desk Coordinator

Yiwola AwoyaleResearcher

Moussa BambaResearcher

Seth KulickResearcher

Ann BiesRes. Project Manager

Justin MottLead Annotator

Dave GraffInternal Consultant

Jon WrightLead Programmer

Kevin WalkerLead Programmer

Haejoong LeeLead Programmer

John MalamonSr. Programmer/Analyst

Preston CabeSr. Programmer/Analyst

Chris CarusoProgrammer/Analyst

Brendan CallahanProgrammer/Analyst

Will Haun Programmer/Analyst

Robert ParkerSr. Programmer/Analyst

Brian GainorProgrammer/Analyst

Ann SawyerResearch Project Coord

Lauren SummersResearch Project Coord

Xiaoyi MaLead Programmer

Steve Grimes Sr. Programmer/Analyst

Mike CiulProgrammer/Analyst

Programmer/Analyst

Programmer/Analyst

Zhiyi SongSr. Project Manager

Kira GriffitRes. Project Manager

Amanda MorrisRes. Project Manager

Kari van der Clouet Sr. Project Manager

Xuansong LiResearch Administratir

Safa IsmaelResearch Project Coord

Ramez ZakharyCoord/Lead Annotator

Dalal ZakharyCoord/Lead Annotator

Alonso IndacocheaCoord/Lead Annotator

Joe EllisRes. Project Manager

Jennifer GarlandResearch Project Coord

Linda BrandschainRes. Project Manager

Karen JonesResearch Project Coord

Stephanie SessaCoord/Lead Annotator

Neville RyantResearch Programmer

Data Manager

Page 36: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Grants in Data

Page 37: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Grants in Data

Significant concern & policy about an ill-understood phenomenon

a crypto-zoology of HLT researchers

LDC Principle: no one with a bona fide research agenda and a

genuine lack of ability to contribute will go without data

26 free data sets (click What’s Free on home page)

numerous, numerous arrangements to get data to needy researchers

Formalization:

grants in data each semester

requirements: data use statement, letter of support from advisor

Grants

2010: 8 corpora

2011: 24 corpora

2012: 8 corpora

CS, EE, oriental studies, second language acquisition and teaching

$40,000 awarded to date

Page 38: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Programs

Goal NW WB BN/C CTS IV Vid OTHER

CALLHOME STT

CALLFRIEND LR

SWITCHBOARD STT

Mixer SR

LCTL Translingual IR, MT

TDT STT, MT, IR

TIDES STT, MT, IR, IE

EARS STT

GALE STT, MT, IR, IE, SUM

MADCAT HR

MR QA

RATS STT

BOLT MT

BEST SR

ALADDIN Video ED

DOE Reading Enh. Language Learning

DOE Dictionaries Language Learning

LDC Online Access

Net-DC Networking

TalkBank Networking

Bio-IE IE

SCOTUS Access, Diarization

Digging into Data Mining

PNG/BOLD Fieldwork

Page 39: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

LDC Data in NIST Evaluations

96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11

LRE ✓ ✓ ✓ ✓ ✓ ✓

SRE ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

BN Re ✓ ✓ ✓ ✓

CTS Re ✓ ✓ ✓ ✓

SDR ✓ ✓ ✓

TDT ✓ ✓ ✓ ✓ ✓ ✓ ✓

ACE ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

OpenMT ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

DUC ✓ ✓ ✓ ✓ ✓ ✓ ✓

RT ✓ ✓ ✓ ✓ ✓ ✓ ✓

STD ✓

GALE Trans ✓ ✓ ✓ ✓ ✓ ✓

MetricsMaTr ✓ ✓

MADCAT ✓ ✓ ✓ ✓

TAC KBP ✓ ✓ ✓

TRECVid SED ✓ ✓ ✓ ✓

TRECVid MED ✓ ✓

Page 40: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Data Collection

news text, journals, financial documents

web text: newsgroups, blogs, discussion fora

email, chat, SMS, tweets

biomedical text & abstracts

printed, handwritten & hybrid documents

broadcast news & conversation, podcasts

conversational telephone speech

lectures, interviews, meetings, field interviews

read & prompted speech

task oriented speech, role play, speech in noise

web video

animal vocalizations

Page 41: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Adaptation: Annotation

data scouting, selection, triage

audio-audio alignment; bandwidth, signal quality, language, dialect, program, speaker

quick & careful transcription

segmentation & alignment at story, turn, sentence, word level

orthographic & phonetic script normalization

phonetic, dialect, sociolinguistic feature & supralexical

documenting zoning, handwriting transcription, OCR

tokenization and tagging of morphology, part-of-speech, gloss

syntactic, semantic, discourse function, disfluency, sense disambiguation

fine and coarse-grained topic, relevance, novelty, entailment

identification, classification of mentions in text of entities, relations, events, time, location & co-reference

knowledgebase population

single & multi-document summarization of various lengths from titles-200

translation, multiple translation, edit distance, translation post-editing, quality control

alignment of translated text at document, sentence, phrase & word levels

physics of gesture

identification, classification of entities and events in video

Page 42: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Program Services

assess program needs: sponsors, developers, evaluators

develop timelines for LR creation and system evaluation

translate of “wish lists” into feasible action plan

coordinate LR activities across & among programs

maintain data matrices of LR features and availability

maintain optimization, stabilization of data requirements

incorporate technology into data production improving

rapidly catalog, license, replicate, distribute program LRs

broaden program impact through general distribution

protection of restricted data

Page 43: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Cost Models

Development Internal Distribution

External Distribution Maintenance

Consortial

Early CT

DARPA

NSF

User Pays

Sponsor Pays

Page 44: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Conclusion

Data Centers must adapt, maintain their role in LR sharing

Data Centers alone offer

dedicated labor force

specialized equipment

special training

needed to

fulfill their mission of lower barriers to LR access

simplify discovery

guarantee longevity

reduce cost

Page 45: Twenty Years of Language Resource Development and Distribution · Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities Christopher Cieri,

Conclusion

LDC is large & expanding (with increasing circumspection)

Much wheel reinvention in context of new initiatives

Offer services that allow HLT researchers to focus on HLT

Expanding in

volume

diversity

quality

languages

data types

annotations

services