introduction - universitetet i oslo · introduction contrastive analysis without sentence alignment...

19
Introduction Contrastive analysis without sentence alignment Additional layers of annotation will give better predictors Interesting analyses can be done already, even on a simple data set Cross syntactic, morphological and word order parameters to highlight expected differences Does language matter more than text sample?

Upload: others

Post on 02-Mar-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Introduction

• Contrastive analysis without sentence alignment

• Additional layers of annotation will give better predictors

• Interesting analyses can be done already, even on a simpledata set

• Cross syntactic, morphological and word order parameters tohighlight expected differences

• Does language matter more than text sample?

Data set

lang., book copula comp genatr genatrfirst xadvpart relpron particlescu Luke 0.03537 0.01681 0.03585 0.10619 0.03093 0.01380 0.05266cu Mark 0.02929 0.02009 0.02159 0.12500 0.03558 0.01159 0.05557cu Matt 0.03113 0.01783 0.02762 0.10909 0.02674 0.01280 0.07432got Luke 0.02072 0.01673 0.09243 0.03448 0.01514 0.01116 0.01116got Mark 0.01990 0.03216 0.03452 0.14019 0.03560 0.00903 0.02646got Matt 0.01915 0.02885 0.03618 0.11111 0.02814 0.01135 0.03689grc Cor 0.02362 0.01379 0.04842 0.17576 0.00763 0.00983 0.10754. . . . . . . . . . . . . . . . . . . . . . . .

Clause types: COMP and copula

• Occurrence of COMP as a measure of complexity

• Varying use of null copulas

• Auxiliaries or not?

• Prediction: Latin should stick out

Clause types

0.01 0.02 0.03 0.04 0.05 0.06

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

copula

com

p

Luke

Mark

MatthewLuke

Mark

Matthew

1 Corinthians

1 Thessalonians

1 Timothy

2 Corinthians

2 Thessalonians

2 Timothy

Acts

Colossians

Ephesians

Galatians

John

Luke

Mark

Matthew

PhilippiansRevelation

Romans1 Corinthians

1 John

1 Peter

1 Thessalonians

1 Timothy

2 Corinthians

2 John

2 Peter2 Thessalonians

2 Timothy

Acts

Colossians

Ephesians

GalatiansHebrews

James

John

Jude

Luke

Mark

MatthewPhilemon

Philippians

Revelation

Romans

Titus

Genitive attributes

• OCS has very restricted use of genitive attributes

• Genitives may be preposed or postposed, but to varying extentin the different languages

• Prediction: OCS should stick out

Genitive attributes

0.02 0.04 0.06 0.08 0.10

0.0

0.1

0.2

0.3

0.4

genatr

gena

trfir

st

LukeMark

Matthew

Luke

Mark

Matthew

1 Corinthians

1 Thessalonians

1 Timothy

2 Corinthians

2 Thessalonians

2 Timothy

Acts

ColossiansEphesians

Galatians

John

LukeMark Matthew

Philippians

Revelation

Romans

1 Corinthians

1 John

1 Peter

1 Thessalonians

1 Timothy

2 Corinthians

2 John

2 Peter

2 Thessalonians

2 Timothy

Acts Colossians

Ephesians

GalatiansHebrews

JamesJohn

Jude

Luke

MarkMatthew

Philemon

Philippians

Revelation

Romans

Titus

Embedded predication

• Greek has a very large share of XADV participles

• The other languages sometimes replace these by relativeclauses (Latin in particular)

• Prediction: Greek and Latin should be neatly separated

Embedded predication

0.00 0.01 0.02 0.03 0.04

0.01

0.02

0.03

0.04

0.05

0.06

xadvpart

relp

ron

LukeMark

MatthewLuke

MarkMatthew

1 Corinthians

1 Thessalonians

1 Timothy

2 Corinthians

2 Thessalonians

2 Timothy

Acts

Colossians

EphesiansGalatians

John LukeMarkMatthew

Philippians

Revelation

Romans

1 Corinthians

1 John

1 Peter

1 Thessalonians

1 Timothy

2 Corinthians

2 John

2 Peter

2 Thessalonians

2 Timothy

Acts

Colossians

Ephesians

GalatiansHebrews

JamesJohn

JudeLuke

Mark

MatthewPhilemon

Philippians

Revelation

RomansTitus

Correspondence analysis

• Two-way frequency plots allow us to consider factors pairwise

• Statistical modelling allows us to visualise the contributions ofmultiple factors in one plot

• Correspondence analysis: reduce a multi-way comparison to atwo-dimensional similarity space

• Differences between rows and between columns are convertedinto distances (close = similar, distant = different)

• Are there systematic differences in the frequencies of variousgrammatical phenomena as a function of language and book?

Correspondence analysis

-1.0 -0.5 0.0 0.5 1.0 1.5

-0.5

0.0

0.5

1.0

Factor 1 (55.4 %)

Fact

or 2

(18

.7 %

)

LukeMark

Matthew

Luke

MarkMatthew

1 Corinthians

1 Thessalonians

1 Timothy2 Corinthians

2 Thessalonians

2 Timothy

Acts

Colossians

Ephesians

GalatiansJohn

LukeMark

MatthewPhilippians

Revelation

Romans

1 Corinthians

1 John

1 Peter

1 Thessalonians

1 Timothy

2 Corinthians

2 John2 Peter

2 Thessalonians

2 Timothy3 John

Acts

Colossians

Ephesians

Galatians

Hebrews

James

John

Jude

Luke

Mark Matthew

PhilemonPhilippians

Revelation

Romans

Titus

copula

comp

genatr

genatrfirst

xadvpart

relpron

particles

Observations

• Language matters more than text sample

• Some languages are more closely grouped than others: why?

• The case of Gothic Luke — more than one Wulfila?

Doing comparative syntax with morphology

• We can look at how parts of speech are combined in the texts

• We extract sequences of three POS-tags

Sample sentence

et vocavit nomen eius IesumC V N P N

→ C .V .N, V .N.P, N.P.N

Trigrams

• If we do this to the entire corpus, we get more than 200.000trigrams

• There are 1076 unique trigrams

• This makes it possible to compare word order across thelanguages

• Prediction: Word order should be very similar, except for thearticle in Greek

Trigrams

lang., book I+A+N G+I+D I+A+A I+A+C A+D+A I+A+V . . .grc,Matthew 0 0 0 0 0.000134156157768 0 . . .grc,Mark 0 0 0 0 0.000113442994895 0 . . .. . . . . . . . . . . . . . . . . . . . . . . .

• Hopeless dataset: 1076 observations for each of 50 units

• Lots of zeroes and other ’useless’ informations

• But we can reduce the 1076 observations to three axes andstill have control of 57.6% of the data

Three-dimensional data

PC 1 PC 2 PC 3

grc, Matthew -16.284392 11.4117630 3.0072437grc, Mark -15.057998 14.0906852 3.3003379grc, Luke -15.186215 11.1504303 3.3776230grc, Revelation -16.630761 13.4693890 -23.9870676la, Matthew 12.826632 15.2713093 1.8494721la, Mark 8.726130 7.0003690 2.3731114la, Luke 10.654470 10.0395966 4.0825441la, Revelation 13.216987 19.3034364 -29.2234370. . . . . . . . . . . .

Two-dimensional view

−1.0 −0.5 0.0 0.5

−0.

50.

00.

5

Factor 1 (38.9 %)

Fac

tor

2 (

7 %

)

Matthew

Mark

LukeJohn

Acts

Romans1 Corinthians

2 CorinthiansGalatians

Ephesians

Philippians

Colossians1 Thessalonians2 Thessalonians

1 Timothy

2 Timothy

Revelation

Matthew

Mark

LukeJohn

Acts

Romans1 Corinthians2 Corinthians

Galatians

EphesiansPhilippians

Colossians1 Thessalonians2 Thessalonians

1 Timothy2 Timothy

Titus

PhilemonHebrews

James

1 Peter

2 Peter

1 John2 John3 John

Jude

RevelationMatthewMark

LukeMatthew

Mark

Luke

I.A.N

G.I.D

A.D.A

I.A.VC.G.FI.A.P

I.S.N P.M.M

I.S.V A.I.R

I.C.V

M.D.C

C.D.C

A.R.M

I.N.N

I.P.R

I.C.N

G.A.N

G.A.M

G.A.C

I.I.I

R.V.SV.R.SV.S.V

V.S.P

S.M.V

V.S.N

V.S.I

V.R.CI.P.A

G.C.M

I.P.M

N.I.N

I.V.V

G.C.V

I.P.P

G.C.S

D.A.D

M.D.A

G.D.RG.D.S

G.D.P

I.V.D

I.V.A

G.D.G

G.D.A

I.V.R

M.V.M

I.N.R

A.M.P

I.N.C

I.N.M

S.I.S

I.C.R

A.G.G

A.G.CA.D.S

C.G.S

G.N.G

A.D.C

I.I.M

R.V.A

G.P.M

S.N.M

C.P.M

V.P.S

V.V.I

V.C.I

V.M.M

G.S.R

V.M.G

V.C.S

V.N.I

D.D.G

N.C.I

I.D.D

I.D.G

I.D.M

I.D.N

I.D.S

S.N.V

C.V.S

F.V.VR.I.P

C.I.VV.I.P

G.P.C

C.I.R

C.M.SM.S.M

C.M.C

R.A.A

C.M.DG.F.V

G.G.D

N.I.R

I.C.G

I.R.A

C.I.A

I.R.N

V.M.SC.I.NC.I.M

I.M.V

S.V.M

N.S.P

N.S.S

N.R.C

S.V.I

N.S.C

C.A.G

C.A.D

N.R.S

N.R.R

S.V.N

I.M.A

I.M.NR.C.D

R.C.M

C.C.G

G.V.S

N.P.SC.D.I

C.D.R

R.R.S

R.S.N

R.C.A

R.C.C

V.D.S

R.M.D

D.G.C

S.P.A

N.C.S

S.P.S

S.S.AS.C.S

S.M.M

M.G.D

S.M.C

C.S.P

C.S.V

M.G.P

C.S.NS.M.R

V.S.S

D.M.D

M.D.D

M.M.P

M.M.S

C.R.S

R.S.P

S.N.C

P.A.M

S.N.P

S.N.S

V.I.S

C.I.D

P.R.I

P.R.S

P.F.G

D.S.M

D.S.D

D.S.P

F.N.V

P.I.P

D.V.S

I.M.CM.S.V

D.C.I

R.S.M

M.N.S D.M.M

S.R.R

S.R.S

S.S.M

P.V.S

D.C.G

S.S.V

N.G.S

R.N.I

N.F.V

M.R.V

M.R.S

M.R.M

M.S.N

N.N.MN.V.I

D.D.C

D.N.IN.A.F

D.A.A

P.I.M

G.S.V

N.I.M

P.I.N

V.P.F

M.A.M

M.A.A

M.A.C

P.S.NR.A.M

P.S.M

P.C.I

G.S.D

F.A.V

D.G.A

P.C.S

A.M.VV.M.D

S.A.M

V.R.G

S.A.C

D.C.M

S.A.S

S.P.N

A.A.G

A.A.A

A.N.S

S.A.A

A.A.R

D.R.G

D.R.A

G.V.I

D.P.I

D.P.M

S.C.V

D.M.S

S.D.PS.D.G

D.N.F

C.C.S

S.D.M

A.A.DS.P.R

M.V.I

A.A.MA.I.P

P.F.NA.C.I

A.C.C

P.F.A

A.C.S

A.S.P

A.S.V

P.N.S

A.S.C

A.S.A

P.M.DM.S.R

A.N.I

G.I.V

D.G.P

G.S.PG.S.S

D.A.G

D.A.I

G.S.A

D.G.G

M.S.D

G.S.M

G.R.A

M.V.D

R.G.A

F.F.F

D.I.SD.I.V

F.F.P

F.G.V

M.V.S

C.G.M

M.M.C

Without the Greek

−1.5 −1.0 −0.5 0.0 0.5

−2.

0−

1.5

−1.

0−

0.5

0.0

0.5

1.0

Factor 1 (12.9 %)

Fac

tor

2 (

12 %

)

MatthewMarkLuke

John

ActsRomans1 Corinthians

2 Corinthians

Galatians

Ephesians

Philippians

Colossians1 Thessalonians

2 Thessalonians

1 Timothy2 Timothy

Titus

PhilemonHebrewsJames

1 Peter2 Peter

1 John

2 John

3 John

Jude

Revelation

MatthewMark

LukeMatthewMark

Luke

I.A.N

R.G.N

I.C.R

C.D.C

I.C.NG.A.C

G.A.G

I.I.V

G.C.DG.C.GG.C.A

M.C.R

G.C.P

G.C.R

M.M.P

I.V.M

I.V.G

M.D.C

G.D.G

A.M.P

A.M.RI.N.DA.G.G

A.G.C

G.N.GA.D.C

V.N.II.D.A

A.M.N

G.P.C

A.M.A

C.M.A

I.C.G

I.R.V

G.M.M

N.R.CC.A.C

I.M.P

I.M.V

R.C.D

N.M.N

C.C.G

R.R.D

V.G.C

N.M.A

R.M.GR.C.N

N.M.M

P.I.DP.I.A

M.D.D

M.M.V

M.M.D

M.M.M

V.V.M

V.N.M

P.P.IN.I.R

G.P.G

R.A.M

A.V.IP.I.N

P.G.M F.P.R

D.C.G

N.F.DN.G.G N.F.P

A.G.M

M.R.DN.N.F

N.N.M

D.D.C

P.D.I

N.I.D

M.A.NM.A.A

M.A.C

G.R.RP.N.F

D.G.C

M.N.N

M.N.M

P.C.C

A.A.R

N.C.C

M.C.G

G.P.M

F.V.CA.I.D

F.D.C

A.C.I

P.M.G

D.A.A

D.A.M D.G.G

G.G.N

M.V.D

R.G.DM.P.M

M.V.M

Some observations

• Contrary to what we saw in the first part, text sample mattersmore than language when it comes to trigrams

• Word order is slavishly transferred from the Greek to the othertexts and is of little use to PROIEL

• But could be used in authorship studies!

What about you?

−0.8 −0.6 −0.4 −0.2 0.0 0.2

−0.

50.

00.

51.

0

Factor 1 (70.7 %)

Fac

tor

2 (

29.3

%)

NULL

Anastasia

angelika

corevolta

dag dagmar

eirik

Frotscher

hanneme

IngAndKri

irenama

knutolavmari

pål

perkrit

pickost

sean

øyvindshareslash

emptyV

emptyC