Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected...
Transcript of Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected...
![Page 1: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/1.jpg)
Are you talking Bernoulli to me? Significance testing and burstiness of words in text corpora
Jefrey Lijffijt Doctoral Student and Researcher Department of Information and Computer Science, Aalto University
Helsinki Institute for Information Technology (HIIT) Finnish Centre of Excellence for Algorithmic Data Analysis Research (ALGODAN) Finnish Doctoral Programme in Computational Sciences (FICS)
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
1
![Page 2: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/2.jpg)
My research group • Panagiotis Papapetrou
(post-doc) • Kai Puolamäki
(acting group leader) • Heikki Mannila
(vice president) Department of Information and Computer Science
Aalto University
Outside collaborators • Tanja Säily
(PhD student) • Terttu Nevalainen
(academy professor)
Department of Modern Languages
University of Helsinki
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
2
![Page 3: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/3.jpg)
Background research group
• Main topic: (algorithmic aspects of) data mining
• Pattern mining – binary matrices and sequences
• Significance testing of data mining results – Patterns, classification, clustering etc. – Permutation testing and constrained randomization
• Swap randomization for binary matrices (Gionis et al. 2007) • Swap randomization for real-valued matrices (Ojala et al. 2008,
Ojala 2010)
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
3
![Page 4: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/4.jpg)
Outline
• Background and motivation
• Statistical tests for comparing corpora
• Comparing statistical tests for comparing corpora
• Directions for further research
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
4
![Page 5: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/5.jpg)
Background and motivation
11/11/2011 Jyväskylä
5
Are you talking Bernoulli to me? Jefrey Lijffijt
![Page 6: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/6.jpg)
Initial motivation: repetitions of words
• General concept: priming – People repeat themselves and each other – Evidence for priming of words, phrases, syntax – Whole field of study – Open problems
• Unclear at instance level • Unknown what ‘constructs’ can be primed
• Working concept: persistence = unexpected repetition – Interpretation free
• Main question: what is (un-) expected?
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
6
![Page 7: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/7.jpg)
What is (un-) expected?
• Too general question, lots of factors in text/speech – Genre/register – Speaker/writer
• Age • Gender • Background
– Topic – Audience – …
• Huge natural variation
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
7
![Page 8: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/8.jpg)
What is (un-) expected?
• Simpler questions, addressed today
1. Modeling expected behavior of words using inter-arrival times
2. How to take into account natural variation when comparing corpora
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
8
![Page 9: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/9.jpg)
Definitions
• Corpus is a set of documents
• Document is a sequence of words
• Frequency defined as
• Size defined as
!
S = {D1,…,DN}
!
Di = (Di,1,…,Di,Di )
!
fr(w j ,Di ) = Ij=1Di" wi = Di, j( )
!
fr(w j ,S) = fr(w j ,Di )i=1S"
!
size(S) = Dii=1N"
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
9
![Page 10: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/10.jpg)
Definitions
• Bag-of-words assumption: words are generated from a multinomial distribution with fixed parameters –
€
p1,…, pn , such that pii=1n∑ =1
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
10
![Page 11: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/11.jpg)
Example: text segmentation based on piecewise constant parts
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
11
• Blue line is incremented when we encounter an instance
• Green line is piecewise constant segmentation
• Is the Poisson distribution a good model for this data?
![Page 12: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/12.jpg)
What is (un-) expected?
• Frequency distribution differs greatly per word – Depends on frequency and burstiness/dispersion
0 0.02 0.04 0.06 0.08 0.10
125
250
375
500
Normalized frequency
Num
ber o
f tex
ts
for (n = 879020)
0 0.02 0.04 0.06 0.08 0.1Normalized frequency
I (n = 868907)
1145
Data: British National Corpus, 4049 texts
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
12
![Page 13: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/13.jpg)
Modeling expected behavior using inter-arrival times
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
13
• Count space between consecutive occurrences (of and)
Finnair believes that it will be able to resume its scheduled service to and from New York on Monday, after
two days of cancellations caused by hurricane Irene. All three airports serving New York City have been
closed because of the hurricane and Finnair was forced to cancel flights on Saturday and Sunday. The
airline is not certain when its scheduled service can be resumed, but the assumption is that Monday
afternoon's flight from Helsinki will depart. Some Finnair passengers whose final destination is not New York
have been rerouted and some have delayed travel plans. The company has also offered ticket holders a
refund. YLE
• IATand = {29, 9, 39, 29}
• Hypothesis: this captures the behavior pattern of words
![Page 14: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/14.jpg)
Modeling expected behavior using inter-arrival times
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
14
• Denote the occurrence positions
• jth inter-arrival time
• nth inter-arrival time
•
• IATand = {29, 9, 39, 29}
€
qi1,…,qi
n =14,43,52,91
€
ai, j = qij+1 − qi
j , for j =1,…,n −1
€
ai,n = qi1 + size(S) − qi
n
€
size(S) =106
![Page 15: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/15.jpg)
Modeling expected behavior using inter-arrival times (Altmann et al. 2009)
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
15
• Fit Weibull (stretched exponential) distribution to IAT
– α > 0 is the scale parameter – β > 0 is the shape parameter
• β = 1 exponential distribution
• Predict word class based on β
€
f (x) =βα
xα
⎛ ⎝ ⎜
⎞ ⎠ ⎟ β−1
e−(x /α )β
![Page 16: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/16.jpg)
What is (un-) expected?
• Frequency distribution differs greatly per word – Depends on frequency and burstiness/dispersion
0 0.02 0.04 0.06 0.08 0.10
125
250
375
500
Normalized frequency
Num
ber o
f tex
ts
for (n = 879020)
0 0.02 0.04 0.06 0.08 0.1Normalized frequency
I (n = 868907)
1145
Data: British National Corpus, 4049 texts
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
16
βML = 0.93 βML = 0.57
![Page 17: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/17.jpg)
What is (un-) expected?
• Bag-of-words prediction for I is quite poor
0 0.01 0.02 0.03 >=0.040
250
500
750
1000
1250
Normalized frequency
Num
ber o
f tex
ts
I (n = 868907)
Observed distributionPredicted by bag of words model
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
17
![Page 18: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/18.jpg)
All models are wrong, but some are useful (G.E.P. Box, several articles) • Linguists have done fine with statistical tests based on
bag-of-words assumption – Paper on log-likelihood test for comparing corpora (Dunning
1993) has 1585 citations
• Information retrieval is based on burstiness of semantically rich words – Term frequency-inverse document frequency
€
idf (wi ) = log S{D∈ S :wi ∈D}
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
18
![Page 19: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/19.jpg)
Statistical tests for comparing corpora
11/11/2011 Jyväskylä
19
Are you talking Bernoulli to me? Jefrey Lijffijt
![Page 20: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/20.jpg)
Problem setting
• Given two corpora S and T
• Find all words that are significantly more frequent in S than in T, or vice versa
• Is this difference statistically significant?
Are you talking Bernoulli to me? Jefrey Lijffijt
Word Freq in S Freq in T time 13,072 16,112 Total 7,196,688 8,365,458
11/11/2011 Jyväskylä
20
![Page 21: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/21.jpg)
Motivation
• Find differences between groups – Speaker groups of different ages
• S = 20−30 , T = 40−50 – Genres
• S = newspaper, T = magazines – Author gender
• S = male, T = female – Time periods
• S = 1600−1639, T = 1640−1681
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
21
![Page 22: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/22.jpg)
Definitions revisited
• Corpus is a set of documents
• Document is a sequence of words
• Frequency defined as
• Size defined as
€
S = {D1,…,DN}
€
Di = (Di,1,…,Di,Di )
€
fr(w j ,Di ) = Ij=1Di∑ wi = Di, j( )
€
fr(w j ,S) = fr(w j ,Di )i=1S∑
€
size(Di ) = Di
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
22
€
size(S) = Dii=1N∑
![Page 23: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/23.jpg)
Definitions revisited
• Normalized frequency
and or
• Bag-of-words assumption: words are generated from a multinomial distribution with fixed parameters –
• We are interested in only one parameter
€
p1,…, pn , such that pii=1n∑ =1
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
23
€
φDi (w j ) =fr(w j ,Di )size(Di )
€
φS (w j ) =fr(w j ,S)size(S)
€
φ S (w j ) =φDi (w j )i=1
S∑S
€
pi = φS (wi )
![Page 24: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/24.jpg)
Data (1/2)
• British National Corpus (BNC), XML edition (2007) – General language corpus
• Restrict to fiction prose genre
• Compare male (S) against female authors (T) – 7.2 M versus 8.4 M words – 203 versus 206 texts (stories and book parts)
• Preprocessing: remove punctuation, ignore multi-word tags, include titles etc.
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
24
![Page 25: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/25.jpg)
Data (2/2)
• Corpus of Early English Correspondence (CEEC) – Contains personal letters – 1410 – 1681 – Version with normalized spelling
• Compare periods 1600-1639 and 1640-1681 – 1.2+ M words – 3000+ letters
• Preprocessing: remove punctuation from all words, include titles etc.
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
25
![Page 26: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/26.jpg)
Formal problem setting
• Input: – Two corpora: S and T – A significance threshold: α (0 < α < 1)
• Word q is significant at level α if and only if p ≤ α
• p gives the probability for the normalized frequency of word q being equal in S and T – H0: ϕ(q,S) = ϕ(q,T) – H1: ϕ(q,S) ≠ ϕ(q,T) – Two-tailed tests: test direction separately
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
26
![Page 27: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/27.jpg)
Pearson’s χ2-test • Assume all words are independent
– Bag-of-words model
• Significance test using 2x2 table
•
• With Yates’ correction
•
Word Freq in S Freq in T time 13,072 16,112 Total 7,196,688 8,365,458
Are you talking Bernoulli to me? Jefrey Lijffijt
€
X 2 =(Oi − Ei )
2
Eii=12∑
!
p " 0.00000066
!
X 2 ~ "12
11/11/2011 Jyväskylä
27
€
X 2 =(Oi − Ei − 0.5)
2
Eii=12∑
![Page 28: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/28.jpg)
Log-likelihood ratio test (Dunning 1993) • Assume all words are independent
– Bag-of-words model
• Significance test using 2x2 table
•
•
•
Word Freq in S Freq in T time 13,072 16,112 Total 7,196,688 8,365,458
€
λ =Bin(kS ,nS , pi
S+T ) ⋅ Bin(kT ,nT , piS+T )
Bin(kS ,nS , piS ) ⋅ Bin(kT ,nT , pi
T )
Are you talking Bernoulli to me? Jefrey Lijffijt
€
Bin(k,n, pi ) =nk⎛
⎝ ⎜ ⎞
⎠ ⎟ pi
k 1− pi( )n−k
€
p ≈ 0.00000061
€
−2logλ ~ χ12
11/11/2011 Jyväskylä
28
![Page 29: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/29.jpg)
Welch’s T-test
• Based on frequency distribution of q over documents
• Allows for unequal variances in the two sets
• , where s is the sample variance
• t ~ t-distribution with v degrees of freedom
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
29
€
t =φ S −φ TsS2
S+sT2
T
![Page 30: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/30.jpg)
Wilcoxon rank-sum test (Mann-Whitney U test) • Based on frequency distribution of q over documents
• Order all texts by frequency • R1 and R2 are the smaller and larger rank sums
•
• For small , consult table, for large
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
30
€
U = R1 −N1(N1 +1)
2
€
U ~ Ν(µU ,σU )
€
U
€
µU =N1N2
2, σU =
N1N2(N1 + N2 +1)12
![Page 31: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/31.jpg)
Inter-arrival time test (Lijffijt et al. 2011)
• Produce random corpora: S’1, …, S’R and T’1, …, T’R • Use two-tailed version of empirical p-value* (North et al.
2002)
•
where
•
Are you talking Bernoulli to me? Jefrey Lijffijt
€
p2 =1+ R ⋅ 2 ⋅ min(p1,1− p1)
1+ R
€
p1 =H( fr(q,Si
' ) ≤ fr(q,Ti' ))i=1
R∑R
11/11/2011 Jyväskylä
31
€
H(x ≤ y) =
10.50
if x < yif x = yif x > y
⎧
⎨ ⎪
⎩ ⎪
![Page 32: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/32.jpg)
Inter-arrival time test (Lijffijt et al. 2011)
• Count space between consecutive occurrences IAT distribution
• Resampling of S and T 1. Pick random first index from 2. Sample random inter-arrival time from 3. Repeat 2. until size of S exceeded
•
•
X X X 1 2 3 … size(S)
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
32
€
g(x)
€
f (x)
€
f (x) =βα
xα
⎛ ⎝ ⎜
⎞ ⎠ ⎟ β−1
e−(x /α )β
€
g(x) = C⋅ x⋅ f (x) s.t. C⋅ x⋅ f (x)x∑ =1
€
G(x) =1−Γ 1+
1β, xα
⎛ ⎝ ⎜ ⎞ ⎠ ⎟ β⎛
⎝ ⎜
⎞
⎠ ⎟ ⎟
Γ 1+1β
⎛
⎝ ⎜
⎞
⎠ ⎟
![Page 33: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/33.jpg)
Inter-arrival time test (Lijffijt et al. 2011)
• Count space between consecutive occurrences IAT distribution
• Resampling of S and T 1. Pick random first index from 2. Sample random inter-arrival time from 3. Repeat 2. until size of S exceeded
• Alternatively, is the empirical IAT-distribution – Lijffijt et al. (2011) suggests Weibull is often not a good fit
•
X X X
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
33
€
g(x)
€
f (x)
€
g(x) = C⋅ x⋅ f (x) s.t. C⋅ x⋅ f (x)x∑ =1€
f (x)
1 2 3 … size(S)
![Page 34: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/34.jpg)
Bootstrap test (Lijffijt et al. 2011)
• Resampling based on word frequency distribution – Sample |S| texts (with replacement) from S or T
• Produce random corpora: S’1, …, S’R and T’1, …, T’R • Use two-tailed version of empirical p-value
•
•
Are you talking Bernoulli to me? Jefrey Lijffijt
!
p2 =1+ R " 2 " min(p1,1# p1)
1+ R
11/11/2011 Jyväskylä
34
€
p1 =H(φ
Si' (q) ≤ φTi'
(q))i=1R∑
R
![Page 35: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/35.jpg)
Comparison for time (β = 0.88)
• pχ2 = 0.00000066 • plog-likelihood = 0.00000061 • pt-test = 0.048 • prank-sum = 0.028 • pIAT = 0.00030 • pbootstrap = 0.021
Word Freq in S Freq in T time 13,072 16,112 Total 7,196,688 8,365,458
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
35
![Page 36: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/36.jpg)
Example: frequency thresholds (Lijffijt et al. 2011) • α = 0.01 in a text of 2000 words
• Smaller β gives larger differences
Word Freq in BNC (x106)
Weibull β Binomial Weibull Inter-arrival
Bootstrap
a 2.2 1.01 61 61 72
for 0.9 0.93 29 30 37
I 0.9 0.57 29 48 110
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
36
![Page 37: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/37.jpg)
Experiments (Säily et al., forthcoming)
• Compare four methods using CEEC – χ2-test – Log-likelihood ratio test – Inter-arrival time test – Bootstrap test
• Compute p-values for all words – H0: normalized frequency in 1600-1639 and 1640-1681 are
equal – H1: normalized frequency in 1600-1639 and 1640-1681 are not
equal
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
37
![Page 38: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/38.jpg)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
p value Inter arrival time
pva
lue
Log
likel
ihoo
d
0 0.10
0.1
![Page 39: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/39.jpg)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
p value Bootstrap
pva
lue
Log
likel
ihoo
d
0 0.10
0.1
![Page 40: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/40.jpg)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
p value Chi squared
pva
lue
Log
likel
ihoo
d
0 0.10
0.1
![Page 41: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/41.jpg)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
p value Chi squared
pva
lue
Inte
rar
rival
tim
e
0 0.10
0.1
![Page 42: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/42.jpg)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
p value Chi squared
pva
lue
Boot
stra
p
0 0.10
0.1
![Page 43: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/43.jpg)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
p value Inter arrival time
pva
lue
Boot
stra
p
0 0.10
0.1
![Page 44: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/44.jpg)
Examples of words with ≥ 10-fold difference
Are you talking Bernoulli to me? Jefrey Lijffijt
Word % 1600-1639 % 1640-1681 p LL p Boot him 0.540 0.507 .011 .17 we 0.280 0.305 .011 .18 mr 0.289 0.317 .0040 .10 horse 0.019 0.026 .0091 .091 prince 0.027 0.020 .0065 .076 goods 0.013 0.019 .0091 .099 patent 0.008 0.004 .0051 .12 pound 0.019 0.013 .0090 .11 merchant 0.007 0.004 .010 .11 li 0.007 0.003 .0048 .095
11/11/2011 Jyväskylä
44
![Page 45: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/45.jpg)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
p value Bootstrap
pva
lue
Log
likel
ihoo
d
0 0.10
0.1
10-fold distances the other way around do not occur
![Page 46: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/46.jpg)
Most frequent significant words
Are you talking Bernoulli to me? Jefrey Lijffijt
Word % 1600-1639 % 1640-1681 p LL p Boot my 1.75 1.45 < .0001 .0001 that 1.48 1.72 < .0001 .0001 your 1.38 1.12 < .0001 .0001 it 1.16 1.33 < .0001 .0001 is 0.92 1.04 < .0001 .0001 and 3.35 2.95 < .0001 .0001 with 0.89 0.79 < .0001 .0001 but 0.78 0.95 < .0001 .0001 in 1.50 1.74 < .0001 .0001
• 292 words significant at α = 0.05 in all four methods*
11/11/2011 Jyväskylä
46
![Page 47: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/47.jpg)
Conclusion part 1
• Bag-of-words model poorly represents frequency distributions (especially of bursty words)
• New models: inter-arrival times and bootstrap test – More conservative p-values – Weibull β predicts difference between models
• Not covered so far: - Correction for multiple hypotheses
• How do we really know which test is better?
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
47
![Page 48: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/48.jpg)
Comparing statistical tests for comparing corpora
11/11/2011 Jyväskylä
48
Are you talking Bernoulli to me? Jefrey Lijffijt
![Page 49: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/49.jpg)
Uniformity of p-values
• By definition p-values should be uniform – Pr(p ≤ x) = x
• Test is conservative iff p-value are too high – Pr(p ≤ x) < x
• Test is anti-conservative iff p-values are too low – Pr(p ≤ x) > x
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
49
![Page 50: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/50.jpg)
Testing uniformity of p-values
• Experiment using random splitting of data – BNC fiction-prose subcorpus (2000 word samples from each text)
• Do for each word with frequency ≥ 50 – Assign half of the texts to S and other half to T – Compute p-value for each of the six methods – Repeat 500 times
• Data is generated under the null hypothesis
• Use Kolmogorov-Smirnov test to compute p-value for uniformity of p-values for each word for each test – Total of 3,302⋅6 = 19,812 p-values
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
50
![Page 51: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/51.jpg)
Testing uniformity of p-values
• Use Kolmogorov-Smirnov test to compute p-value for uniformity of p-values for each word for each test – Total of 3,302⋅6 = 19,812 p-values
• Use Bonferroni correction for multiple hypothesis – We really do not want any false-positives
• α = 0.01
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
51
![Page 52: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/52.jpg)
DPnorm: normalized measure of dispersion (Gries 2008, Lijffijt & Gries to appear) • Difference between expected and observed frequencies
• oi = relative number of occurrences in Di
• ei = relative size of Di
•
•
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
52
€
DP =oi − eii=1
N∑2
€
DPnorm =DP
1−mini (ei )
ei 0.4 0.4 0.2
oi 0.0 0.0 1.0
|oi-ei| 0.4 0.4 0.8
€
DP =1.6 /2 = 0.8
€
DPnorm = 0.8 /(1− 0.2) =1
![Page 53: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/53.jpg)
0
0.2
0.4
0.6
0.8
1
DPn
orm
Chi squared (57.6% rejected) Likelihood ratio (65.0% rejected) T test (4.8% rejected)
50 500 5000 50000 5000000
0.2
0.4
0.6
0.8
1
DPn
orm
Word frequency
Wilcoxon rank sum (3.6% rejected)
500 5000 50000Word frequency
Inter arrival (16.3% rejected)
50 500 5000 50000 500000Word frequency
Bootstrapping (3.6% rejected)
Uniformity under random text assignment
![Page 54: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/54.jpg)
0
0.2
0.4
0.6
0.8
1
DPn
orm
Chi squared (45.4% rejected) Likelihood ratio (65.0% rejected) T test (4.7% rejected)
50 500 5000 50000 5000000
0.2
0.4
0.6
0.8
1
DPn
orm
Word frequency
Wilcoxon rank sum (3.4% rejected)
500 5000 50000Word frequency
Inter arrival (16.3% rejected)
50 500 5000 50000 500000Word frequency
Bootstrapping (3.6% rejected)
Anti-conservativeness under random text assignment
![Page 55: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/55.jpg)
0
0.2
0.4
0.6
0.8
1
DP
norm
Chi squared (68.9% rejected) Likelihood ratio (10.4% rejected) T test (10.3% rejected)
50 500 5000 50000 5000000
0.2
0.4
0.6
0.8
1
DP
norm
Word frequency
Wilcoxon rank sum (10.3% rejected)
500 5000 50000Word frequency
Inter arrival (0.0% rejected)
50 500 5000 50000 500000Word frequency
Bootstrapping (0.0% rejected)
Uniformity under random word assignment
![Page 56: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/56.jpg)
0
0.2
0.4
0.6
0.8
1
DP
norm
Chi squared (0.0% rejected) Likelihood ratio (3.9% rejected) T test (3.9% rejected)
50 500 5000 50000 5000000
0.2
0.4
0.6
0.8
1
DP
norm
Word frequency
Wilcoxon rank sum (3.9% rejected)
500 5000 50000Word frequency
Inter arrival (0.0% rejected)
50 500 5000 50000 500000Word frequency
Bootstrapping (0.0% rejected)
Anti-conservativeness under random word assignment
![Page 57: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/57.jpg)
0 200 4000
0.2
0.4
0.6
0.8
1Chi squared (7.9176e 25)
sample index
pva
lue
0 200 4000
0.2
0.4
0.6
0.8
1
trip (41)
Likelihood ratio (3.0402e 07)
sample indexp
valu
e0 200 400
0
0.2
0.4
0.6
0.8
1T test (3.1573e 07)
sample index
pva
lue
0 200 4000
0.2
0.4
0.6
0.8
1Wilcoxon rank sum (3.1573e 07)
sample index
pva
lue
0 200 4000
0.2
0.4
0.6
0.8
1Inter arrival (0.00033271)
sample index
pva
lue
0 200 4000
0.2
0.4
0.6
0.8
1Bootstrapping (0.080167)
sample index
pva
lue
Discrete test problem
![Page 58: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/58.jpg)
0 200 4000
0.2
0.4
0.6
0.8
1Chi squared (0.20308)
sample index
pva
lue
0 200 4000
0.2
0.4
0.6
0.8
1
would (2590)
Likelihood ratio (0.53725)
sample indexp
valu
e0 200 400
0
0.2
0.4
0.6
0.8
1T test (0.53647)
sample index
pva
lue
0 200 4000
0.2
0.4
0.6
0.8
1Wilcoxon rank sum (0.53648)
sample index
pva
lue
0 200 4000
0.2
0.4
0.6
0.8
1Inter arrival (0.386)
sample index
pva
lue
0 200 4000
0.2
0.4
0.6
0.8
1Bootstrapping (0.33949)
sample index
pva
lue
Discrete test problem
![Page 59: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/59.jpg)
Conclusion part 2
• Bag-of-words based test should not be used by linguists
• Bootstrap test and Wilcoxon give best results
• T-test gives similar results – T-test is faster to compute
• Open problems – A fair test for all tests? – Why is resampling S and T required in bootstrap test?
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
59
![Page 60: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/60.jpg)
Further research
11/11/2011 Jyväskylä
60
Are you talking Bernoulli to me? Jefrey Lijffijt
![Page 61: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/61.jpg)
• Segmentation of text – Find change-points in sequence of inter-arrival times
• Clustering of words – We know burstiness is related to parts-of-speech – Can we group words based on inter-arrival times?
• Stochastic model for inter-arrival times – Current best (Weibull) does not fit that well – How to take into account correlations in inter-arrival time test
Further research: Burstiness and inter-arrival times
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
61
![Page 62: Are you talking Bernoulli to me? - UGentjlijffij/presentations/Jyvaskyla.pdf · Modeling expected behavior using inter-arrival times Are you talking Bernoulli to me? Jefrey Lijffijt](https://reader036.fdocuments.us/reader036/viewer/2022080721/5f7a3712fa11425a274840ba/html5/thumbnails/62.jpg)
References • Altmann, E.G., Pierrehumbert, J.B., Motter, A.E. (2009). Beyond word frequency:
Bursts, lulls, and scaling in the temporal distribution of words, PLoS ONE, 4(11):e7678.
• Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics, 19:61-74.
• Lijffijt, J., Papapetrou, P., Puolamäki, K., Mannila, H. (2011). Analyzing Word Frequencies in Large Text Corpora using Inter-arrival Times and Bootstrapping. In ECML PKDD 2011, Part II, pp. 341-357.
• North, B. V., Curtis, D. and Sham, P. C. (2002). A note on the calculation of empirical p-values from Monte Carlo procedures. The American Journal of Human Genetics, 71(2): 439-441.
• Gries, S. Th. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4): 403-437.
• Lijffijt, J. and Gries, S. Th. (to appear). Correction to 'Dispersions and adjusted frequencies in corpora'. International Journal of Corpus Linguistics.
• Slides available soon at http://users.ics.tkk.fi/lijffijt
Are you talking Bernoulli to me? Jefrey Lijffijt
11/11/2011 Jyväskylä
62