Introduction - Universitetet i oslo · Introduction Contrastive analysis without sentence alignment...
Transcript of Introduction - Universitetet i oslo · Introduction Contrastive analysis without sentence alignment...
Introduction
• Contrastive analysis without sentence alignment
• Additional layers of annotation will give better predictors
• Interesting analyses can be done already, even on a simpledata set
• Cross syntactic, morphological and word order parameters tohighlight expected differences
• Does language matter more than text sample?
Data set
lang., book copula comp genatr genatrfirst xadvpart relpron particlescu Luke 0.03537 0.01681 0.03585 0.10619 0.03093 0.01380 0.05266cu Mark 0.02929 0.02009 0.02159 0.12500 0.03558 0.01159 0.05557cu Matt 0.03113 0.01783 0.02762 0.10909 0.02674 0.01280 0.07432got Luke 0.02072 0.01673 0.09243 0.03448 0.01514 0.01116 0.01116got Mark 0.01990 0.03216 0.03452 0.14019 0.03560 0.00903 0.02646got Matt 0.01915 0.02885 0.03618 0.11111 0.02814 0.01135 0.03689grc Cor 0.02362 0.01379 0.04842 0.17576 0.00763 0.00983 0.10754. . . . . . . . . . . . . . . . . . . . . . . .
Clause types: COMP and copula
• Occurrence of COMP as a measure of complexity
• Varying use of null copulas
• Auxiliaries or not?
• Prediction: Latin should stick out
Clause types
0.01 0.02 0.03 0.04 0.05 0.06
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
0
copula
com
p
Luke
Mark
MatthewLuke
Mark
Matthew
1 Corinthians
1 Thessalonians
1 Timothy
2 Corinthians
2 Thessalonians
2 Timothy
Acts
Colossians
Ephesians
Galatians
John
Luke
Mark
Matthew
PhilippiansRevelation
Romans1 Corinthians
1 John
1 Peter
1 Thessalonians
1 Timothy
2 Corinthians
2 John
2 Peter2 Thessalonians
2 Timothy
Acts
Colossians
Ephesians
GalatiansHebrews
James
John
Jude
Luke
Mark
MatthewPhilemon
Philippians
Revelation
Romans
Titus
Genitive attributes
• OCS has very restricted use of genitive attributes
• Genitives may be preposed or postposed, but to varying extentin the different languages
• Prediction: OCS should stick out
Genitive attributes
0.02 0.04 0.06 0.08 0.10
0.0
0.1
0.2
0.3
0.4
genatr
gena
trfir
st
LukeMark
Matthew
Luke
Mark
Matthew
1 Corinthians
1 Thessalonians
1 Timothy
2 Corinthians
2 Thessalonians
2 Timothy
Acts
ColossiansEphesians
Galatians
John
LukeMark Matthew
Philippians
Revelation
Romans
1 Corinthians
1 John
1 Peter
1 Thessalonians
1 Timothy
2 Corinthians
2 John
2 Peter
2 Thessalonians
2 Timothy
Acts Colossians
Ephesians
GalatiansHebrews
JamesJohn
Jude
Luke
MarkMatthew
Philemon
Philippians
Revelation
Romans
Titus
Embedded predication
• Greek has a very large share of XADV participles
• The other languages sometimes replace these by relativeclauses (Latin in particular)
• Prediction: Greek and Latin should be neatly separated
Embedded predication
0.00 0.01 0.02 0.03 0.04
0.01
0.02
0.03
0.04
0.05
0.06
xadvpart
relp
ron
LukeMark
MatthewLuke
MarkMatthew
1 Corinthians
1 Thessalonians
1 Timothy
2 Corinthians
2 Thessalonians
2 Timothy
Acts
Colossians
EphesiansGalatians
John LukeMarkMatthew
Philippians
Revelation
Romans
1 Corinthians
1 John
1 Peter
1 Thessalonians
1 Timothy
2 Corinthians
2 John
2 Peter
2 Thessalonians
2 Timothy
Acts
Colossians
Ephesians
GalatiansHebrews
JamesJohn
JudeLuke
Mark
MatthewPhilemon
Philippians
Revelation
RomansTitus
Correspondence analysis
• Two-way frequency plots allow us to consider factors pairwise
• Statistical modelling allows us to visualise the contributions ofmultiple factors in one plot
• Correspondence analysis: reduce a multi-way comparison to atwo-dimensional similarity space
• Differences between rows and between columns are convertedinto distances (close = similar, distant = different)
• Are there systematic differences in the frequencies of variousgrammatical phenomena as a function of language and book?
Correspondence analysis
-1.0 -0.5 0.0 0.5 1.0 1.5
-0.5
0.0
0.5
1.0
Factor 1 (55.4 %)
Fact
or 2
(18
.7 %
)
LukeMark
Matthew
Luke
MarkMatthew
1 Corinthians
1 Thessalonians
1 Timothy2 Corinthians
2 Thessalonians
2 Timothy
Acts
Colossians
Ephesians
GalatiansJohn
LukeMark
MatthewPhilippians
Revelation
Romans
1 Corinthians
1 John
1 Peter
1 Thessalonians
1 Timothy
2 Corinthians
2 John2 Peter
2 Thessalonians
2 Timothy3 John
Acts
Colossians
Ephesians
Galatians
Hebrews
James
John
Jude
Luke
Mark Matthew
PhilemonPhilippians
Revelation
Romans
Titus
copula
comp
genatr
genatrfirst
xadvpart
relpron
particles
Observations
• Language matters more than text sample
• Some languages are more closely grouped than others: why?
• The case of Gothic Luke — more than one Wulfila?
Doing comparative syntax with morphology
• We can look at how parts of speech are combined in the texts
• We extract sequences of three POS-tags
Sample sentence
et vocavit nomen eius IesumC V N P N
→ C .V .N, V .N.P, N.P.N
Trigrams
• If we do this to the entire corpus, we get more than 200.000trigrams
• There are 1076 unique trigrams
• This makes it possible to compare word order across thelanguages
• Prediction: Word order should be very similar, except for thearticle in Greek
Trigrams
lang., book I+A+N G+I+D I+A+A I+A+C A+D+A I+A+V . . .grc,Matthew 0 0 0 0 0.000134156157768 0 . . .grc,Mark 0 0 0 0 0.000113442994895 0 . . .. . . . . . . . . . . . . . . . . . . . . . . .
• Hopeless dataset: 1076 observations for each of 50 units
• Lots of zeroes and other ’useless’ informations
• But we can reduce the 1076 observations to three axes andstill have control of 57.6% of the data
Three-dimensional data
PC 1 PC 2 PC 3
grc, Matthew -16.284392 11.4117630 3.0072437grc, Mark -15.057998 14.0906852 3.3003379grc, Luke -15.186215 11.1504303 3.3776230grc, Revelation -16.630761 13.4693890 -23.9870676la, Matthew 12.826632 15.2713093 1.8494721la, Mark 8.726130 7.0003690 2.3731114la, Luke 10.654470 10.0395966 4.0825441la, Revelation 13.216987 19.3034364 -29.2234370. . . . . . . . . . . .
Two-dimensional view
−1.0 −0.5 0.0 0.5
−0.
50.
00.
5
Factor 1 (38.9 %)
Fac
tor
2 (
7 %
)
Matthew
Mark
LukeJohn
Acts
Romans1 Corinthians
2 CorinthiansGalatians
Ephesians
Philippians
Colossians1 Thessalonians2 Thessalonians
1 Timothy
2 Timothy
Revelation
Matthew
Mark
LukeJohn
Acts
Romans1 Corinthians2 Corinthians
Galatians
EphesiansPhilippians
Colossians1 Thessalonians2 Thessalonians
1 Timothy2 Timothy
Titus
PhilemonHebrews
James
1 Peter
2 Peter
1 John2 John3 John
Jude
RevelationMatthewMark
LukeMatthew
Mark
Luke
I.A.N
G.I.D
A.D.A
I.A.VC.G.FI.A.P
I.S.N P.M.M
I.S.V A.I.R
I.C.V
M.D.C
C.D.C
A.R.M
I.N.N
I.P.R
I.C.N
G.A.N
G.A.M
G.A.C
I.I.I
R.V.SV.R.SV.S.V
V.S.P
S.M.V
V.S.N
V.S.I
V.R.CI.P.A
G.C.M
I.P.M
N.I.N
I.V.V
G.C.V
I.P.P
G.C.S
D.A.D
M.D.A
G.D.RG.D.S
G.D.P
I.V.D
I.V.A
G.D.G
G.D.A
I.V.R
M.V.M
I.N.R
A.M.P
I.N.C
I.N.M
S.I.S
I.C.R
A.G.G
A.G.CA.D.S
C.G.S
G.N.G
A.D.C
I.I.M
R.V.A
G.P.M
S.N.M
C.P.M
V.P.S
V.V.I
V.C.I
V.M.M
G.S.R
V.M.G
V.C.S
V.N.I
D.D.G
N.C.I
I.D.D
I.D.G
I.D.M
I.D.N
I.D.S
S.N.V
C.V.S
F.V.VR.I.P
C.I.VV.I.P
G.P.C
C.I.R
C.M.SM.S.M
C.M.C
R.A.A
C.M.DG.F.V
G.G.D
N.I.R
I.C.G
I.R.A
C.I.A
I.R.N
V.M.SC.I.NC.I.M
I.M.V
S.V.M
N.S.P
N.S.S
N.R.C
S.V.I
N.S.C
C.A.G
C.A.D
N.R.S
N.R.R
S.V.N
I.M.A
I.M.NR.C.D
R.C.M
C.C.G
G.V.S
N.P.SC.D.I
C.D.R
R.R.S
R.S.N
R.C.A
R.C.C
V.D.S
R.M.D
D.G.C
S.P.A
N.C.S
S.P.S
S.S.AS.C.S
S.M.M
M.G.D
S.M.C
C.S.P
C.S.V
M.G.P
C.S.NS.M.R
V.S.S
D.M.D
M.D.D
M.M.P
M.M.S
C.R.S
R.S.P
S.N.C
P.A.M
S.N.P
S.N.S
V.I.S
C.I.D
P.R.I
P.R.S
P.F.G
D.S.M
D.S.D
D.S.P
F.N.V
P.I.P
D.V.S
I.M.CM.S.V
D.C.I
R.S.M
M.N.S D.M.M
S.R.R
S.R.S
S.S.M
P.V.S
D.C.G
S.S.V
N.G.S
R.N.I
N.F.V
M.R.V
M.R.S
M.R.M
M.S.N
N.N.MN.V.I
D.D.C
D.N.IN.A.F
D.A.A
P.I.M
G.S.V
N.I.M
P.I.N
V.P.F
M.A.M
M.A.A
M.A.C
P.S.NR.A.M
P.S.M
P.C.I
G.S.D
F.A.V
D.G.A
P.C.S
A.M.VV.M.D
S.A.M
V.R.G
S.A.C
D.C.M
S.A.S
S.P.N
A.A.G
A.A.A
A.N.S
S.A.A
A.A.R
D.R.G
D.R.A
G.V.I
D.P.I
D.P.M
S.C.V
D.M.S
S.D.PS.D.G
D.N.F
C.C.S
S.D.M
A.A.DS.P.R
M.V.I
A.A.MA.I.P
P.F.NA.C.I
A.C.C
P.F.A
A.C.S
A.S.P
A.S.V
P.N.S
A.S.C
A.S.A
P.M.DM.S.R
A.N.I
G.I.V
D.G.P
G.S.PG.S.S
D.A.G
D.A.I
G.S.A
D.G.G
M.S.D
G.S.M
G.R.A
M.V.D
R.G.A
F.F.F
D.I.SD.I.V
F.F.P
F.G.V
M.V.S
C.G.M
M.M.C
Without the Greek
−1.5 −1.0 −0.5 0.0 0.5
−2.
0−
1.5
−1.
0−
0.5
0.0
0.5
1.0
Factor 1 (12.9 %)
Fac
tor
2 (
12 %
)
MatthewMarkLuke
John
ActsRomans1 Corinthians
2 Corinthians
Galatians
Ephesians
Philippians
Colossians1 Thessalonians
2 Thessalonians
1 Timothy2 Timothy
Titus
PhilemonHebrewsJames
1 Peter2 Peter
1 John
2 John
3 John
Jude
Revelation
MatthewMark
LukeMatthewMark
Luke
I.A.N
R.G.N
I.C.R
C.D.C
I.C.NG.A.C
G.A.G
I.I.V
G.C.DG.C.GG.C.A
M.C.R
G.C.P
G.C.R
M.M.P
I.V.M
I.V.G
M.D.C
G.D.G
A.M.P
A.M.RI.N.DA.G.G
A.G.C
G.N.GA.D.C
V.N.II.D.A
A.M.N
G.P.C
A.M.A
C.M.A
I.C.G
I.R.V
G.M.M
N.R.CC.A.C
I.M.P
I.M.V
R.C.D
N.M.N
C.C.G
R.R.D
V.G.C
N.M.A
R.M.GR.C.N
N.M.M
P.I.DP.I.A
M.D.D
M.M.V
M.M.D
M.M.M
V.V.M
V.N.M
P.P.IN.I.R
G.P.G
R.A.M
A.V.IP.I.N
P.G.M F.P.R
D.C.G
N.F.DN.G.G N.F.P
A.G.M
M.R.DN.N.F
N.N.M
D.D.C
P.D.I
N.I.D
M.A.NM.A.A
M.A.C
G.R.RP.N.F
D.G.C
M.N.N
M.N.M
P.C.C
A.A.R
N.C.C
M.C.G
G.P.M
F.V.CA.I.D
F.D.C
A.C.I
P.M.G
D.A.A
D.A.M D.G.G
G.G.N
M.V.D
R.G.DM.P.M
M.V.M
Some observations
• Contrary to what we saw in the first part, text sample mattersmore than language when it comes to trigrams
• Word order is slavishly transferred from the Greek to the othertexts and is of little use to PROIEL
• But could be used in authorship studies!