the dgs corpus project - purdue university · the dgs corpus project development of a corpus based...

1
Background Information Duration: 15 years (2009-2023) Responsible organization: Academy of Sciences in Hamburg, Germany Realization: Institute of German Sign Language and Communication of the Deaf (University of Hamburg) Funding: 8.4 millions of the German Academies of Science Programme and additional resources provided by the University of Hamburg 150 person-years of work Team: 9 researchers, 4 Deaf research assistants, 1 technical staff, up to 18 student research assistants Goals: Reference corpus of German Sign Language (DGS) Corpus-based Dictionary of DGS - German Involvement of the Deaf Community 328 informants 22 regional contact persons (find informants, raise public awareness on the project within the language community in their region) Focus group (planned): approximately 20-25 deaf experts (regionally rooted representatives of language community) provide support and validate analyses Feedback: web-based portal (planned) to receive feedback from members of language community e.g. on usage and regional variation of lexical items Deaf team members, researchers and student co-workers Flow of information between project and community through presentations at local Deaf clubs, focus group, website , facebook etc. The Reference Corpus 328 informants: men and women, 4 age groups, 13 regions 350-400 hours of footage 2.25 million tokens (estimated) 500 TB raw data (expected) Metadata on informants’ linguistic and social background and studio session (IMDI standard) Tokenized, lemmatized and annotated Uses: basis for dictionary entries, language documentation, resource for basic linguistic research, resource for Deaf studies (texts on Deaf experiences and lives, Deaf culture), signed texts usable for sign language teaching The Dictionary Corpus-based Descriptive In electronic form 6000 sign entries (planned) Bidirectional: search via sign form or written word Sign entries including information on form, meaning, grammar, variants and usage Examples of use taken from the corpus Cross references to related and similar signs Dictionary grammar We are currently experimenting with search by sample function. To be published in 2023 The iLex Environment Transcription and annotation tool Works with several synchronized video streams allowing the user to switch between different perspectives Integrated lexical database supports token- type-matching Metadata integrated into the database Multi-user approach Analyses via SQL statements Support of lexicographic workflow (work in progress) Support of quality assurance (work in progress) Export functions to ELAN, Quicktime with subtitles, HTML etc. Integrates video processing Studio Setup Mobile studio 7 cameras for 5 recording perspectives A1 and B1: front views on informants: HD cameras & stereoscopic cameras A2 and B2: birds-eye views on informants: HD cameras C: whole scene: HD camera 12 computers Preliminary Basic Vocabulary Basic vocabulary of DGS and German Based on evaluation of published sign collections (not on corpus data) Signs verified by focus group and web- based public feedback To be published in 2013 To be replaced by the general dictionary in 2023 2009 2010 0 2011 2012 2013 20 14 20 15 2016 2017 2018 2019 2020 2021 2022 2023 d data co ollection pro- basic tra anscription de etailed transcription pro- duc- lemma sele ection, analysis & com mpilation of dictionary entries duc- tion feedback & consult tation of focus group tion Publication of the preliminary basic vocabulary Publication of the public corpus Publication of the first corpus based, electronic dictionary DGS–German Poster presented at the Theoretical Issues in Sign Language Research (TISLR) 10 Conference, Sept 30 - Oct 2, 2010 at Purdue University, Indiana, USA. The research leading to these results has received funding from the German Academies of Science Programme. References Hanke, Thomas / Hong, Sung-Eun / König, Susanne / Langer, Gabriele / Nishio, Rie / Rathmann, Christian: Designing Elicitation Stimuli and Tasks for the DGS Corpus Project. Poster presented at the Theoretical Issues in Sign Language Research (TISLR) 10 Conference, Sept 30 - Oct 2, 2010 at Purdue University, Indiana, USA. Hanke, Thomas / König, Lutz / Wagner, Sven / Matthes, Silke: DGS Corpus & Dicta-Sign: The Hamburg Studio Setup. In: Proceedings of the 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies, LREC 2010, 22-23 May 2010, Malta. pp. 106-109. König, Susanne / Konrad, Reiner / Langer, Gabriele / Nishio, Rie (2010): How Much Top-Down And Bottom-Up Do We Need To Build A Lemmatised Corpus? Poster presented at the Theoretical Issues in Sign Language Research (TISLR) 10 Conference, Sept 30 - Oct 2, 2010 at Purdue University, Indiana, USA. Matthes, Silke / Hanke, Thomas / Storz, Jakob / Efthimiou, Eleni / Dimiou, Nassia / Panagiotis, Karioris / Braffort, Annelies / Choisier, Annick / Pelhate, Julia / Safar, Eva (2010): Elicitation Tasks and Materials designed for Dicta-Sign's Multi-lingual Corpus. In: Proceedings of the 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies, LREC 2010, 22-23 May 2010, Malta. pp. 158-163. Nishio, Rie / Hong, Sung-Eun / König, Susanne / Konrad, Reiner / Langer, Gabriele / Hanke, Thomas / Rathmann, Christian (2010): Elicitation methods in the DGS (German Sign Language) Corpus Project. In: Proceedings of the 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies, LREC 2010, 22-23 May 2010, Malta. pp. 178-185. Data Collection 2009-2011 13 regions at 12 locations (mobile studio) 328 informants balanced for sex, age, region Studio session: 2 informants (peer-to-peer situation) Moderated by Deaf contact person Duration: one day (approximately 5.5 hours of filming plus breaks) Elicitation tasks (about 20 different tasks) Various stimuli (e.g. signed texts, movies, pictures, words) Different subjects areas to cover basic vocabulary Various text types: e.g. conversation, discussion, description, re-telling, planning A2 Informant A Informant B Moderator A1 B1 C B2 A2 C B1 A1 B2 A B Sign language acquisition age displayed for age groups of informants (number of informants so far: 92) 0-3 4-6 >6 0 5 10 15 20 25 30 18-30 31-45 46-60 61+ Transcription and Annotation Translation Translation into German, segmentation into utterances Basic Transcription Transcription / annotation carried out by student research assistants Supervised and checked by native signers Tokenization (segmentation into single signs) Lemmatization (token-type matching: identification and tagging of lexical items via glossing), tagging of productive signs and other signs Further specifications: Variant, modified and deviant sign forms Mouthings Detailed Transcription Approximately 50% of the basic transcriptions will be transcribed in more depth as needed for analysis and dictionary production Differentiation of phonological variants, grammatical sign forms (e.g. plural, negation, modifications), use of space Coding of contextual meaning Syntactic categories Sign context Mouth gesture, (lexical) facial expressions Sub-sentence phrase structure The Public Corpus Selected parts of the reference corpus (approx. 50 hrs) will be made publicly accessible (including English translation and basic transcription/annotation) Analysis and Compilation of Dictionary Entries Analysis of spatial and grammatical behaviour of signs, contextual meaning, form variation, usage Abstraction from corpus data and other information (feedback, focus group) to give a general description of lexical signs, their forms, meanings and uses, variation, dialectal information PRGLӾDEOH VLJQ QR UHJLRQDO UHVWULFWLRQ FDQ EH ORFDWHG LQ VLJQLQJ VSDFH ÃD GHӾQLWH HQWLW\´ used for task management influence on Key public corpus preliminary basic vocabulary dictionary including dictionary grammar English translation detailed transcription reference corpus focus group & web-based feedback published collections of signs analysis & compilation into entries sign entries dictionary grammar data collection selection of signs German translation & basic transcription analysis verification additional information on signs & uses 1 2 3 drag & drop list of sign entries (types) sign entry (type) tokens (of a type) transcript 1. segmentation (tokenizing) 2. token-type-matching (lemmatizing) 3. further annotations token tag I N S T I T U T F Ü R D E U T S C H E G E B Ä R D E N S P R A C H E U N D K O M M U N I K A T I O N G E H Ö R L O S E R The DGS Corpus Project Development of a Corpus Based Electronic Dictionary German Sign Language – German Dolly Blanck, Thomas Hanke, Ilona Hofmann, Sung-Eun Hong, Olga Jeziorski, Thimo Kleyboldt, Lutz König, Susanne König, Reiner Konrad, Gabriele Langer, Rie Nishio, Christian Rathmann, Stephanie Vorwerk, Sven Wagner University of Hamburg, Institute of German Sign Language and Communication of the Deaf

Upload: buinhi

Post on 17-Sep-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The DGS Corpus Project - Purdue University · The DGS Corpus Project Development of a Corpus Based Electronic Dictionary German Sign Language – German Dolly Blanck, Thomas Hanke,

Background Information• Duration: 15 years (2009-2023)• Responsible organization: Academy of Sciences in Hamburg,

Germany• Realization: Institute of German Sign Language and

Communication of the Deaf (University of Hamburg) • Funding: € 8.4 millions of the German Academies of Science

Programme and additional resources provided by the University of Hamburg

• 150 person-years of work• Team: 9 researchers, 4 Deaf research assistants, 1 technical

staff, up to 18 student research assistants• Goals:

• Reference corpus of German Sign Language (DGS)• Corpus-based Dictionary of DGS - German

Involvement of the Deaf Community• 328 informants• 22 regional contact persons (find informants, raise public awareness on the

project within the language community in their region)• Focus group (planned): approximately 20-25 deaf experts (regionally rooted

representatives of language community) provide support and validate analyses• Feedback: web-based portal (planned) to receive feedback from members of

language community e.g. on usage and regional variation of lexical items• Deaf team members, researchers and student co-workers• Flow of information between project and community through presentations at local Deaf

clubs, focus group, website, facebook etc.

The Reference Corpus• 328 informants: men and women, 4 age groups, 13 regions• 350-400 hours of footage• 2.25 million tokens (estimated)• 500 TB raw data (expected)• Metadata on informants’ linguistic and social background

and studio session (IMDI standard)• Tokenized, lemmatized and annotated • Uses: basis for dictionary entries, language

documentation, resource for basic linguistic research, resource for Deaf studies (texts on Deaf experiences and lives, Deaf culture), signed texts usable for sign language teaching

The Dictionary• Corpus-based• Descriptive • In electronic form• 6000 sign entries (planned)• Bidirectional: search via sign form or written

word• Sign entries including information on form,

meaning, grammar, variants and usage• Examples of use taken from the corpus• Cross references to related and similar signs• Dictionary grammar• We are currently experimenting with search

by sample function.• To be published in 2023

The iLex Environment• Transcription and annotation tool• Works with several synchronized video

streams allowing the user to switch between different perspectives

• Integrated lexical database supports token-type-matching

• Metadata integrated into the database• Multi-user approach• Analyses via SQL statements• Support of lexicographic workflow (work in progress)• Support of quality assurance (work in progress)• Export functions to ELAN, Quicktime with subtitles,

HTML etc.• Integrates video processing

Studio Setup • Mobile studio• 7 cameras for 5 recording perspectives• A1 and B1: front views on informants:

HD cameras & stereoscopic cameras• A2 and B2: birds-eye views on informants: HD

cameras• C: whole scene: HD camera• 12 computers Preliminary Basic Vocabulary

• Basic vocabulary of DGS and German• Based on evaluation of published sign

collections (not on corpus data)• Signs verified by focus group and web-

based public feedback• To be published in 2013• To be replaced by the general

dictionary in 2023

20092009 20102010 2011 20122012 2013 20142014 20152015 2016 20172017 2018 2019 2020 2021 2022 20232023data collectiondata collectiondata collectiondata collectiondata collectiondata collectiondata collection pro-

duc-tion

basic transcriptionbasic transcriptionbasic transcriptionbasic transcriptionbasic transcriptionbasic transcriptionbasic transcriptionbasic transcriptionbasic transcription detailed transcriptiondetailed transcriptiondetailed transcriptiondetailed transcriptiondetailed transcriptiondetailed transcriptiondetailed transcriptiondetailed transcriptiondetailed transcriptiondetailed transcriptionpro-duc-tion

lemma selection, analysis & compilation of dictionary entrieslemma selection, analysis & compilation of dictionary entrieslemma selection, analysis & compilation of dictionary entrieslemma selection, analysis & compilation of dictionary entrieslemma selection, analysis & compilation of dictionary entrieslemma selection, analysis & compilation of dictionary entrieslemma selection, analysis & compilation of dictionary entrieslemma selection, analysis & compilation of dictionary entrieslemma selection, analysis & compilation of dictionary entrieslemma selection, analysis & compilation of dictionary entrieslemma selection, analysis & compilation of dictionary entrieslemma selection, analysis & compilation of dictionary entries

pro-duc-tionfeedback & consultationfeedback & consultationfeedback & consultationfeedback & consultationfeedback & consultationfeedback & consultationfeedback & consultationfeedback & consultationfeedback & consultationfeedback & consultationof focus groupof focus groupof focus groupof focus groupof focus groupof focus groupof focus group

pro-duc-tion

Publication of the preliminary basic vocabulary Publication of the public corpus

Publication of the first corpus based, electronic dictionary DGS–German

Poster presented at the Theoretical Issues in Sign Language Research (TISLR) 10 Conference, Sept 30 - Oct 2, 2010 at Purdue University, Indiana, USA. The research leading to these results has received funding from the German Academies of Science Programme.

ReferencesHanke, Thomas / Hong, Sung-Eun / König, Susanne / Langer, Gabriele / Nishio, Rie / Rathmann, Christian: Designing Elicitation Stimuli and Tasks for the DGS Corpus Project. Poster presented at the Theoretical Issues in Sign Language Research (TISLR) 10 Conference, Sept 30 - Oct 2, 2010 at Purdue University, Indiana, USA.

Hanke, Thomas / König, Lutz / Wagner, Sven / Matthes, Silke: DGS Corpus & Dicta-Sign: The Hamburg Studio Setup. In: Proceedings of the 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies, LREC 2010, 22-23 May 2010, Malta. pp. 106-109.

König, Susanne / Konrad, Reiner / Langer, Gabriele / Nishio, Rie (2010): How Much Top-Down And Bottom-Up Do We Need To Build A Lemmatised Corpus? Poster presented at the Theoretical Issues in Sign Language Research (TISLR) 10 Conference, Sept 30 - Oct 2, 2010 at Purdue University, Indiana, USA.

Matthes, Silke / Hanke, Thomas / Storz, Jakob / Efthimiou, Eleni / Dimiou, Nassia / Panagiotis, Karioris / Braffort, Annelies / Choisier, Annick / Pelhate, Julia / Safar, Eva (2010): Elicitation Tasks and Materials designed for Dicta-Sign's Multi-lingual Corpus. In: Proceedings of the 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies, LREC 2010, 22-23 May 2010, Malta. pp. 158-163.

Nishio, Rie / Hong, Sung-Eun / König, Susanne / Konrad, Reiner / Langer, Gabriele / Hanke, Thomas / Rathmann, Christian (2010): Elicitation methods in the DGS (German Sign Language) Corpus Project. In: Proceedings of the 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies, LREC 2010, 22-23 May 2010, Malta. pp. 178-185.

Data Collection• 2009-2011• 13 regions at 12 locations (mobile studio)• 328 informants balanced for sex, age, region• Studio session:

• 2 informants (peer-to-peer situation) • Moderated by Deaf contact person• Duration: one day (approximately 5.5 hours of

filming plus breaks)• Elicitation tasks (about 20 different tasks)

• Various stimuli (e.g. signed texts, movies, pictures, words)

• Different subjects areas to cover basic vocabulary• Various text types:

e.g. conversation, discussion, description, re-telling, planning

A2

Informant AInformant B

Moderator

A1 B1

C

B2 A2

C

B1A1

B2

AB

Sign language acquisition agedisplayed for age groups of informants (number of informants so far: 92)

0-3 4-6 >6

0

5

10

15

20

25

30

18-30 31-45 46-60 61+

Transcription and AnnotationTranslation• Translation into German, segmentation into utterancesBasic Transcription• Transcription / annotation carried out by student research assistants• Supervised and checked by native signers• Tokenization (segmentation into single signs)• Lemmatization (token-type matching: identification and tagging of

lexical items via glossing), tagging of productive signs and other signs• Further specifications:

• Variant, modified and deviant sign forms• Mouthings

Detailed Transcription• Approximately 50% of the basic transcriptions will be transcribed in

more depth as needed for analysis and dictionary production• Differentiation of phonological variants, grammatical sign forms

(e.g. plural, negation, modifications), use of space• Coding of contextual meaning• Syntactic categories• Sign context• Mouth gesture, (lexical) facial expressions• Sub-sentence phrase structure

The Public Corpus• Selected parts of the reference corpus (approx. 50 hrs) will

be made publicly accessible (including English translation and basic transcription/annotation)

Analysis and Compilation of Dictionary Entries• Analysis of spatial and

grammatical behaviour of signs, contextual meaning, form variation, usage

• Abstraction from corpus data and other information (feedback, focus group) to give a general description of lexical signs, their forms, meanings and uses, variation, dialectal information

used for task management influence on

Key

publiccorpus

preliminary basic vocabulary

dictionary including dictionary grammar

Englishtranslation

detailed transcription

reference corpusfocus group& web-based

feedback

published collections

of signs

analysis &compilation into entries

signentries

dictionarygrammar

data collection

selection of signs

German translation & basic transcription

analysis

verificationadditional

information on signs & uses

1

2

3

drag

& d

rop

list of sign entries (types)

sign entry (type)

tokens (of a type)

transcript1. segmentation (tokenizing)2. token-type-matching (lemmatizing)3. further annotations

token tag

INSTITUTFÜR

DEUTSCHE GEBÄR

DENS

PRACHE

UNDKOMMUNIKATION GEH

ÖRLOSE

R

The DGS Corpus ProjectDevelopment of a Corpus Based Electronic Dictionary German Sign Language – German

Dolly Blanck, Thomas Hanke, Ilona Hofmann, Sung-Eun Hong, Olga Jeziorski, Thimo Kleyboldt, Lutz König, Susanne König, Reiner Konrad, Gabriele Langer, Rie Nishio, Christian Rathmann, Stephanie Vorwerk, Sven Wagner

University of Hamburg, Institute of German Sign Language and Communication of the Deaf