basic natural language processing for swiss german texts

27
CorpusLab, URPP “Language and Space” Basic natural language processing for Swiss German texts Tanja Samardˇ zi´ c 09/06/2017 Page 1

Upload: others

Post on 31-Oct-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Basic natural languageprocessing for Swiss GermantextsTanja Samardzic

09/06/2017 Page 1

Page 2: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Long-term contributionNoemi AepliFatima StadlerYves ScherrerElvira Glaser

Funding

Hasler Foundation grant No 16038

UZH URPP ’Language and Space’

Agreement with Spitch

Specific tasks

Henning BeywlChristof BlessAlexandra BunzliMatthias FriedliAnne GohringNoemi GrafAnja Hasse

Gordon HeathAgnes KolmerMike LinggPatrick MachlerEva PetersUliana Petrunina

Janine Richner-SteinerHana RuchBeni RuefPhillip StrobelSimone UeberwasserAlexandra Zoller

09/06/2017 Basic natural language processing for Swiss German texts Page 2

Page 3: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Data

Page 4: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Oral history project ArchiMob

09/06/2017 Basic natural language processing for Swiss German texts Page 4

Page 5: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

The ArchiMob corpus sample

09/06/2017 Basic natural language processing for Swiss German texts Page 5

Page 6: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Some numbers

44 documents selected by Janine Richner-Steiner and Matthias Friedli,supervised by Elvira Glaser

Release 1.0 (2016):

– 34 documents, around 500 000 word tokens

– 23/44 documents transcribed in the period 2004–2014

– 11/44 documents transcribed in 2015, in collaboration with Spitch

Next release (2017):

– 43 documents, around 650 000 word tokens

– 6/44 documents transcribed in 2016

– 3/44 in progress

09/06/2017 Basic natural language processing for Swiss German texts Page 6

Page 7: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Format

Page 8: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Current format

09/06/2017 Basic natural language processing for Swiss German texts Page 8

Page 9: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Content

je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN

jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV

09/06/2017 Basic natural language processing for Swiss German texts Page 9

Page 10: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Transcription

Page 11: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Transcription

je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN

jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV

09/06/2017 Basic natural language processing for Swiss German texts Page 11

Page 12: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Manual transcription

1. 16 documents - Nisus Writer– No segmentation (only turns)– No text to speech alignment– Converted into XML, added segmentation and alignment

2. 7 documents - FOLKER (Schmidt, 2012)– Segmented into chunks of 4-10 seconds– XML and alignment output

3. 11 documents - EXMARaLDA (Schmidt, 2012)– same as FOLKER, just more convenient

09/06/2017 Basic natural language processing for Swiss German texts Page 12

Page 13: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Some details

– Based on Dieth guidelines, but gradually simplified

– Utterance as the basic unit

– Turns not explicitly annotated

– Inconsistence in writing (pronouns and clitics)

– Pauses, repetitions

– Incomprehensible speech

09/06/2017 Basic natural language processing for Swiss German texts Page 13

Page 14: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Normalisation

je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN

jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV

09/06/2017 Basic natural language processing for Swiss German texts Page 14

Page 15: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Approach

– Manual normalisation of 6 documents, VARD2 and IGT

– Automatic normalisation– Character-level machine translation (CSMT) with MOSES– Training on the 6 manually normalised documents

09/06/2017 Basic natural language processing for Swiss German texts Page 15

Page 16: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

CSMT

Translation model: p(normalised |transcribed)

i s c h s c h i s t

Language model: p(normalisedi |normalisedi−1)

i s t

09/06/2017 Basic natural language processing for Swiss German texts Page 16

Page 17: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Current state of the art

Yves Scherrer and Nikola Ljubesic (KONVENS 2016)

– Larger translation units (utterances instead of words)

– Language model augmented with German spoken data

– Improved tuning

– Result: 90.46 % accuracy

09/06/2017 Basic natural language processing for Swiss German texts Page 17

Page 18: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Part-of-speech tagging

Page 19: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Part-of-speech

je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN

jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV

09/06/2017 Basic natural language processing for Swiss German texts Page 19

Page 20: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Tagger development

STTS+ tag set

Train Test % Acc. % OOV

TuBa-D/S Normalised 70.31 24.21Starting NOAH Original 60.56 30.72

Removed TuBa-D/S Normalised 70.68 24.21punctuation NOAH Original 73.09 30.72

NOAH +Adapted ArchiMob Original 90.09 –

09/06/2017 Basic natural language processing for Swiss German texts Page 20

Page 21: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Current activities

Tagger adaptation:

– Active learning: gradually add ArchiMob data in the train set

– CRF tagger

09/06/2017 Basic natural language processing for Swiss German texts Page 21

Page 22: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Speech-to-text

Page 23: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Speech-to-text

Acoustic model: p(transcribed |sound)

/.../ /.../ /.../ /.../ /.../ /.../ /.../ das ischsch en ez

Language model: p(transcribedi |transcribedi−1)

das ischsch en ez

09/06/2017 Basic natural language processing for Swiss German texts Page 23

Page 24: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Approach

– Improving Spitch prototype with new language models

– Our own speech-to-text development with Kaldi

– Manual transcription

09/06/2017 Basic natural language processing for Swiss German texts Page 24

Page 25: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Next steps

Page 26: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Next steps

– Continue transcription, PoS tagging, normalisation

– Neural transducers (deep learning) for normalisation

– Subword language models for speech-to-text

– New data

09/06/2017 Basic natural language processing for Swiss German texts Page 26

Page 27: Basic natural language processing for Swiss German texts

CorpusLab, URPP “Language and Space”

Your feedback!