On the Waring distribution, the Gl nzel -Schubert model ...tjtkoski/goranwaring.pdf · ˚Abo 22nd...
Transcript of On the Waring distribution, the Gl nzel -Schubert model ...tjtkoski/goranwaring.pdf · ˚Abo 22nd...
On the Waring distribution, the Glanzel -Schubertmodel and their applications - a historiett
Timo KoskiDept. of Math., KTH Royal Institute of
Technology
Abo 22nd August 2013, Seminar in Honor of Goran Hognas
August 20, 2013
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Background
This lecture is based onT. Koski, E. Sandstrom, & U. Sandstrom (2011): EstimatingResearch Productivity from a Zero-Truncated Distribution.Proceedings of the 13th Conference of the International Society forScientometrics and Informetrics, Vols 1 and 2, pp. 747-755.This is a piece of argument about the Swedish government officialreport
Resurser for kvalitet (2007). Slutbetankande avResursutredningen. SOU 2007:81,
where Ulf Sandstrom Indek/KTH and Erik Sandstrom contributedwith the bibliometric model. The report and its recommendationsare also known under the acronym RUT 2.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Background
The system for funding allocation to public research institutionspresented by the Swedish government in October 2008 was basedon RUT 2. Needless to say, this generated both public and privateconflict of opinion, where expressions like ’mathysteri’ werespotted. There will be more about the public conflict at the end ofthe lecture, if time permits.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Mathysteri ?
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
The Waring method
An important part of the bibliometric method invoked in RUT 2 isa statistical estimate of how many active researchers there are inthe Nordic countries, an estimate made with what has come to beknown (at least in Sweden) as the Waring method.This estimate is needed to take into account the fact that thereare different traditions in various disciplines on publishing and inpublishing in ISI-journals1.
1as listed in Thomson Reuters
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Statement by The Government Offices of Sweden
Resursutredningens forslag (= RUT 2, au,s remark) innebar att dedirekta anslagen fordelas utifran akademins egna kriterier for vadsom ar god utbildning och forskning och utifran studenternas egnainformerade val. Det resulterar i att staten varken kan eller borstyra hur resurserna fordelas mellan larosatena. Det blir darforviktigt att denna modell skots och kvalitetssakras av ettakademiskt val kvalificerat mellanliggande organ utanforRegeringskansliet.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
A report from a ’akademiskt val kvalificeratmellanliggande organ”
byJ. Froberg, M. Gunnarsson, A. Jonsson och S. Karlsson;Avdelningen for forskningspolitisk analys, Vetenskapsradet.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
A report from a ’akademiskt val kvalificeratmellanliggande organ”
The critical comments in the report from Avdelningen forforskningspolitisk analys/VR cited are heavy. This talk is arejoinder to some of those. Let us begin from the beginning.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Or how Edward Waring broke out of the academic universe
Edward Waring (1736 -1798) held theLucasian Chair of Mathematics in the University of Cambridge.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
E. Waring, Miscellanea Analytica (1762): Waring,sFormula
Expansion in inverse factorials, due to Edward Waring:
1
x − α=
∞∑
r=0
α(r)
x(r+1), x > α > 0, (1)
where α(r) is the ascending factorial
α(r) = α · (α+ 1) · . . . · (α+ r − 1) =Γ(α+ r)
Γ(α),
where we used the well known recursion formula of the Gammafunction
Γ(z + 1) = zΓ(z). (2)
For derivations of (1), see
N.E. Norlund: Vorlesungen uber Differenzenrechnung. VerlagJulius Springer, Berlin, 1924, p. 261L.M. Milne-Thompson: The Calculus of Finite Differences.MacMillan, London, 1951, p. 291.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
J.O. Irwin (1963): the Waring distribution
Let us rewrite (1) with ρ = x − α
1
ρ=
∞∑
r=0
α(r)
(ρ+ α)(r+1), (3)
i.e.,
1 =
∞∑
r=0
ρ ·α(r)
(ρ+ α)(r+1), (4)
and we have discovered a probability distribution (pr )∞
r=0 on thenon-negative integers
pr = ρ ·α(r)
(ρ+ α)(r+1), r = 0, 1, 2, . . . , (5)
J.O. Irwin: The place of mathematics in medical andbiological sciences. J. R. Statistical Society, A, vol. 126, 1963,pp.1−14.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
The Waring distribution: the recursion formula
The Waring distribution (pr )∞
r=0 is by the recursion formula forΓ(z) in (2)
pr =
ρ · α(0)
(ρ+α)(1)= ρ
Γ(α)Γ(α)
Γ(α+ρ+1)Γ(α+ρ)
= ρα+ρ
r = 0
α+(r−1)α+ρ+r
pr−1, r = 1, 2, . . . .
(6)
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
The Waring distribution: the mean
Irwin found amongst many other things the mean µ of thedistribution as
µ =α
ρ− 1if ρ > 1. (7)
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
War(ρ, α)
We say that X is a random variable that has the Waringdistribution with parameters ρ and α, if
Pr (X = k) = pk k = 0, 1, 2, . . . .
with pks given in (6). We state this as
X ∼ War(ρ, α)
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
A weak form of Power Law
pk = P (X = k) ≈ k−(1+ρ), as k → ∞. (8)
We can call ρ the tail parameter, as it controls the tail of thedistribution. The graphs above depict pk for War(3, 1) (blue) andWar(2, 1) (green) as functions of k .
W-C. Chen: On the weak form of Zipf’s law. Journal ofApplied Probability, 17, 1980, pp. 611−622.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Zipf-Lotka Law
P (X = k) = c · k−2, k = M,M + 1, . . . (9)
where c is the normalization constant. Zipf- Lotka’s Law wasempirically found as a bibliometric distribution on the number ofauthors making k contributions. The basic discovery of Lotka wasthat the publication frequencies are skew distributions.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Yule - Simon Distribution
The state probabilities of birth-and-death processes are a source ofpower law distributions. G.U. Yule, established a model (a purebirth process) to explain the observed size distribution of generawith respect to the number of species. Yule obtained a special caseof the following probability mass function due to H.A. Simon
qk = δB (δ + 1, k) , k = 1, 2, . . . , . (10)
δ > 0, B (δ + 1, k) is the Beta function, i.e., B (x , y) = Γ(x)Γ(y)Γ(x+y) . It
can be easily checked that if X ∼ War(ρ, α), then
Pr (X = k | X > 0) → ρB (ρ+ 1, k)
as α → 0.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Generating the Waring distribution
Hierarchic:
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Generating the Waring distribution
Hierarchic:
Draw p from the Beta (prior) distribution with parameters αand ρ.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Generating the Waring distribution
Hierarchic:
Draw p from the Beta (prior) distribution with parameters αand ρ.Draw a value of X from the geometric distribution withparameter p.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Generating the Waring distribution
Hierarchic:
Draw p from the Beta (prior) distribution with parameters αand ρ.Draw a value of X from the geometric distribution withparameter p.Then X ∼ War(ρ, α).
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Generating the Waring distribution
Hierarchic:
Draw p from the Beta (prior) distribution with parameters αand ρ.Draw a value of X from the geometric distribution withparameter p.Then X ∼ War(ρ, α).Thus the Waring distribution is in Bayesian statistics known asthe Beta-Geometric distribution.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Generating the Waring distribution
Hierarchic:
Draw p from the Beta (prior) distribution with parameters αand ρ.Draw a value of X from the geometric distribution withparameter p.Then X ∼ War(ρ, α).Thus the Waring distribution is in Bayesian statistics known asthe Beta-Geometric distribution.
A. Schubert & W. Glanzel: A Dynamic Look at a Class ofSkew Distributions. Scientometrics, vol. 6, no 3, 1984,pp. 149−167.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
A.G.M. McKendrick on Modeling and Repetitive Events(1926)
In the majority of the processes with which one is concernedin the study of the medical sciences, one has to deal withassemblages of individuals, be they living or be they dead,which become affected according to some characteristic. Theymay meet and exchange ideas, the meeting may result in thetransference of some infectious disease, and so forth.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
A.G.M. McKendrick on Modeling and Repetitive Events(1926)
In the majority of the processes with which one is concernedin the study of the medical sciences, one has to deal withassemblages of individuals, be they living or be they dead,which become affected according to some characteristic. Theymay meet and exchange ideas, the meeting may result in thetransference of some infectious disease, and so forth.
The life of each individual consists of a train of such incidents,one following the other. From another point of view eachmember of the human community consists of an assemblageof cells.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
A.G.M. McKendrick on Modeling and Repetitive Events(1926)
In the majority of the processes with which one is concernedin the study of the medical sciences, one has to deal withassemblages of individuals, be they living or be they dead,which become affected according to some characteristic. Theymay meet and exchange ideas, the meeting may result in thetransference of some infectious disease, and so forth.
The life of each individual consists of a train of such incidents,one following the other. From another point of view eachmember of the human community consists of an assemblageof cells.
A.G.M. McKendrick: Applications of mathematics in medicalproblems. Proceedings of Edingburgh Mathematical Society,44, 1926, 98−130.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
A.G.M. McKendrick on Modeling and Repetitive Events(1926)
These cells react and interact amongst each other, and eachindividual lives a life which may be again considered as asuccession of events, one following the other. If one thinks ofthese individuals, be they human beings or be they cells, asmoving in all sorts of dimensions, reversibly or irreversibly,continuously or discontinuously, by unit stages or per saltum,then the method of their movement becomes a study inkinetics, and can be approached by the methods ordinarilyadopted in the study of such systems.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Glanzel & Schubert (1984): Postulates for repetitiveevents
New elements (with no occurrence) may enter the system at arate proportional to the actual total number of elements inthe system.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Glanzel & Schubert (1984): Postulates for repetitiveevents
New elements (with no occurrence) may enter the system at arate proportional to the actual total number of elements inthe system.
The chance for occurrence of the event grows linearly with thenumber of events already occurred (The (linear) Mattheweffect: For whosoever hath, to him shall be given, and he
shall have more abundance√
Matthew 13:12, King Jamestranslation).
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Glanzel & Schubert (1984): Postulates for repetitiveevents
New elements (with no occurrence) may enter the system at arate proportional to the actual total number of elements inthe system.
The chance for occurrence of the event grows linearly with thenumber of events already occurred (The (linear) Mattheweffect: For whosoever hath, to him shall be given, and he
shall have more abundance√
Matthew 13:12, King Jamestranslation).
Elements have an equal chance to drop out of the systemindependently of the number of prior occurrences of the event.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
An Infinite Array of Cells
xi = the number of elements in cell nr. i , x =∑
∞
i=0 xi
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
The Postulates
x =∞∑
i=0
xi
s = σx , σ > 0, fi = (α+βi)xi , α > 0, β ≥ 0, gi = γxi , γ ≥ 0.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
The Postulates
xi = the content of cell nr. i , x =∑
∞
i=0 xi .NEW ELEMENTS: Rate of the external source is proportional tothe total content:
s = σx , σ > 0. (11)
THE MATTHEW EFFECT: the higher the cell index, more facile isfurther transfer
fi = (α+ βi)xi , α > 0, β ≥ 0 (12)
UNIFORM LEAKAGE: proportional to the cell content
gi = γxi , γ ≥ 0 (13)
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
The New Europe
In the setting of Glanzel and Schubert xi is the frequency ofauthors (e.g., in some field of science in a country) with ipublished papers. The postulates tell that there is a cumulativeadvantage in higher levels of productivity. The parameter σ isproportional to the total number of authors and is the rate ofexternal source emitting new authors. The leakage parameter turnsout not to influence the equilibrium state, but influences the rateof convergence (when present) to the stationary solution.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
The Kinetics
·
x0= s − f0
·
x i= fi−1 − fi − gi = (α+ β(i − 1))xi−1 − (α+ βi + γ)xi
which yields·
x= (σ − γ)x .
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
The Kinetics
·
x0= s − f0·
x i= fi−1 − fi − gi = (α+ β(i − 1))xi−1 − (α+ βi + γ)xi·
x= (σ − γ)x .
Let us set pidef= xi
x. Then
·
pi=d
dt
(xix
)= (α+ β(i − 1))pi−1 − (α+ βi + σ)pi
·
p0= σ − (α+ σ)p0
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
The Stationary Solution
·
pi=d
dt
(xix
)= (α+ β(i − 1))pi−1 − (α+ βi + σ)pi
·
p0= σ − (α+ σ)p0
The stationary solution with·
pi=·
p0= 0 is thus clearly
pi =
{σ
α+σi = 0
α+β·(i−1)α+β·i+σ
pi−1, i = 1, 2, . . . .(14)
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
The Stationary Solution
With β = 0 (no Matthew effect) we have
p0 =σ
α+ σ
pr =α
α+ σpr−1 = . . . =
(α
α+ σ
)r σ
α+ σ,
i.e., a geometric distribution with the parameter σα+σ
. HerePr( publication ) = α
α+σ.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Change of parameters
We re-parametrize
ρ ↔ σ
β, α ↔ α
β.
Then
pr =
{ρ
α+ρr = 0
α+(r−1)α+ρ+r
pr−1, r = 1, 2, . . . .(15)
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
The Waring Distribution
Thus the stationary solution of the cell system in terms of therelative frequencies of the cell contents
pr =
{ρ
α+ρr = 0
α+(r−1)α+ρ+r
pr−1, r = 1, 2, . . . .
is nothing but the Waring distribution.The condition on the tail parameter ρ > 1 for existence of mean isequivalent to σ > β, which means that the rate of infusion of newauthors is higher than the rate of transfer of authors to higherpublication numbers.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
The New Europe
Consider xi , the frequency of authors (e.g., in some field of sciencein a country during a period) with i published papers. Let therelative frequency pi of authors with i papers be computed fromdata without any model of production. Furthermore, one can agreeon the fact that the mean of the publication distribution is a somekind of measure of scientific productivity (in that field andcountry). Then
µ =imax∑
i=1
xi pi
However, any reasonable estimate of productivity must involve thenotion of potential authors, i.e., those doing research but notpublishing during the period under consideration.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Estimation of the frequency of zero
The publication productivity data is by its very definitionzero-truncated, i.e., as there is no information of those that are notpublishing (in a certain period of time). We shall now find a wayto estimate frequency of zero from zero truncated (or, truncated tothe left at one) data using the Waring distribution.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
A.G.M McKendrick: example of estimation of zerofrequency
The problem of estimation of zero frequency (under a Poissonmodel) from zero truncated data was first considered byA.G.M McKendrick, recognized also for the McKendrick-VonFoerster partial differential equation. McKendrick (1926) wasconsidering a case of estimating the number of individuals in anIndian village, who were susceptible to infection but did notdevelop the symptoms. He developed a differential equation andsolved it to get the negative binomial distribution, from which heobtained the Poisson distribution as a limiting case.
A.G.M. McKendrick: Applications of mathematics in medicalproblems. Proceedings of Edingburgh Mathematical Society,44, 1926, 98−130.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
A pioneering example of estimation of zero frequency
McKendrick developed a moment estimator to find the the numberof individuals susceptible to infection but did not develop thesymptoms. His data contained the number of individuals that didnot develop the symptoms, including thus the immune ones.We shall also use a kind of moment estimator of zero frequencyusing a remarkable characterization of the Waring distribution bytruncated means.
X.L. Meng : The EM algorithm and medical studies: Ahistorical link. Statistical Methods in Medical Research, 6,1997, 23.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Left-truncation of War(ρ, α)
The proof (omitted) of the following theorem is an appplication ofthe recursion of the Waring probabilities and the Gamma function.
Theorem
If X ∼ War(ρ, α), then
Pr (X = n + i | X ≥ n) = Pr (Y = i) , i = 0, 1, 2, . . . (16)
where Y ∈ War(ρ, α + n)
By (16) we have Pr (X − n = i | X ≥ n) = Pr (Y = i) and thus
E [X − n | X ≥ n] = E [Y ]
and since Y ∈ War(ρ, α + n), we get by (7)
E [Y ] =α+ n
ρ− 1(17)
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Left-truncation of War(ρ, α)
Hence
E [X − n | X ≥ n] =α+ n
ρ− 1.
But E [X − n | X ≥ n] = E [X | X ≥ n]− n. Thus we have foundthat if X ∼ War(ρ, α), then
E [X | X ≥ n] = µ+ n · µ1, n = 0, 1, . . . (18)
where µ = αρ−1 (as it should) and µ1 =
ρρ−1 . In fact (18) is a
characterization of War(ρ, α).
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
A Characterization of War(ρ, α)
The following theorem is due to Glanzel and Schubert. A simplifiedproof is given by Dimaki and Xekalaki.
Theorem
X ∼ War(ρ, α) if and only if
E [X | X ≥ k] = µ+ k · µ1, k = 0, 1, . . . (19)
where µ(= E [X ] = E [X | X ≥ 0]) is given in (7) and µ1 =ρ
ρ−1 .
W. Glanzel, A. Telcs & A. Schubert: Characterization bytruncated moments and its application to Pearson typesystems. Zeitschrift fur Wahrscheinlichkeitstheorie undverwandte Gebiete, 66, 1984, pp. 173−183
C. Dimaki & E. Xekalaki: Towards a unification of certaincharacterizations by conditional expectations. Annals of theInstitute of Statistical Mathematics,48, 1996, pp. 157 168.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
A First Characterization of War(ρ, α)
The simplified proof of the characterization above applies anothercharacterization.
Theorem
X ∼ War(ρ, α) if and only if
P (X > r) =α+ r
ρP (X = r) , r = 0, 1, . . . (20)
Proof: ⇐: We assume that (20) is true for all r = 0, 1, . . .. Forr = 0 we have that 1− P (X = 0)= P (X > 0) = α
ρP (X = 0),
which is solved by P (X = 0) = ρ/(α+ ρ), and this equals by (15)the probability of zero for X ∼ War(ρ, α).Next
P (X > r + 1) = P (X > r)− P (X = r + 1)
and by (20)
=α+ r
ρP (X = r)− P (X = r + 1)
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
A Characterization of War(ρ, α)
In other words
P (X > r + 1) =α+ r
ρP (X = r)− P (X = r + 1)
But we assume (20) so that
P (X > r + 1) =α+ r + 1
ρP (X = r + 1)
Thus it must hold that
α+ r + 1
ρP (X = r + 1) =
α+ r
ρP (X = r)− P (X = r + 1]
⇔(α+ r + 1)P (X = r + 1) = (α+ r)P (X = r)− ρP (X = r + 1]
⇔P (X = r + 1) =
α+ r
α+ ρ+ r + 1P (X = r)
which is the recursion for the Waring probabilities in (6)Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Waring regression
With regard to the Waring model of scientific productivity we wantto estimate the parameter µ, which is the mean of the Waringdistribution. We use the the left truncated mean from (19) as anaffine function of k
E [X | X ≥ k] = µ+ k · µ1, k = 0, 1, . . .
from the above. Let yk be the left truncated sample mean i.e., anestimate of E [X | X ≥ k], kmax is the maximum value of thepublications in data. Then we write
yk = µ+ k · µ1 + ek , k = 1, . . . , kmax
where ek are random deviations (or residuals) of yk from the ’true’regression line. Then by fitting of straight line by (weighted) leastsquares, we may estimate the intercept µ and the regressioncoefficient µ1.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Waring regression
A final question is to obtain a figure of the uncertainty of theestimate of µ. There seems to be no immediate analytic procedurefor assessment of this uncertainty, as, in particular, we shouldperhaps not assume the homoscedasticity and independence of theresiduals.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
The Waring Method
Consider the left truncated publication means yk , k = 1, . . . , kmax
of some field of research or some university
yk = µ+ k · µ1 + ek , k = 1, . . . , kmax
Find the estimate µ the intercept µ (= the average number ofpublications per person) and then
number of potential authors =number of papers
µ
There is a certain vagueness in the concept of potential authors inthe literature. We incorporate now the frequency of zero.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Productivity . . .
The publication productivity data is by its very definitionzero-truncated, i.e., as there is no information of those that are notpublishing (in a certain period of time). But as is clear from (19),the Waring distribution is not hampered by the truncation. Asobserved above the expression (19) in gives a way of finding bylinear regression against k the estimated intercept µ and this canbe used to estimate the frequency of zero via r = 0 in (15) as
p0 =α+ µ
α(µ+ 1) + µ(21)
if α is the estimate from data.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Data and methods
A weakness of earlier empirical tests of the Waring based estimatesis the lack of precise test data. In order to produce a satisfactoryempirical dataset for testing the accuracy of Waring basedestimates of the zero class frequency, a known publicationfrequency distribution that includes zero values has to be created.Next, the creation of a publication frequency dataset is described,which is based on figures concerning researchers at two Swedishuniversities. The different distributions created below are based ona selection of potential authors, i.e. categories of people that weexpect could publish research papers.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Data and methods
These potential authors will not have published during the selectedtime period (but possibly in another time period) and will thusform the zero class of the publication frequency distribution. Itshould be noted that the frequency distributions will varydepending on which categories of people that are selected. Weinclude professors, researchers and senior lecturers in the potentialauthor definition.Employee data concerning the time period of 2005-2007 wereobtained from two Swedish universities.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Data and methods
A selection of 729 and 949 from the respective universities werehereby obtained. Publication data was downloaded for eachpotential author from the Web of Science and compiled into atable were the number of publications (article, letter and review)associated with each potential author was listed. In addition, thenumber of first author publications and reprint publications byeach potential author was extracted (the table omitted here).
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Data and methods
Furthermore, a random author was selected for each of thedownloaded publications. This was achieved by randomly selectinga single author from the author list of each publication, resulting ina selection of one author per publication. The number of randomlyselected authorships of each potential author was added to thetable of publication frequencies. For each university and authorshiptype (first, reprint, all and random), a publication frequencydistribution, i.e. the number of authors having one publication,two publications and so forth, was compiled (the table omittedhere). The zero-frequencies were removed to form zero-truncatedsamples.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Waring based estimation of the population mean
The publication frequency distributions described above arezero-truncated samples: the zero frequencies are missing. Theobjective of the method presented and tested is to estimate thesezero frequencies.
Extraction of left truncated sample means. The result is a setof data points ranging from one (zero-truncated) to themaximum value of the distribution, with increasing means.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Waring based estimation of the population mean
The publication frequency distributions described above arezero-truncated samples: the zero frequencies are missing. Theobjective of the method presented and tested is to estimate thesezero frequencies.
Extraction of left truncated sample means. The result is a setof data points ranging from one (zero-truncated) to themaximum value of the distribution, with increasing means.Fitting of straight line. The data points are plotted and astraight line is fitted through the points using weighted leastsquare regression. Weights in
are used.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Waring based estimation of the population mean
The publication frequency distributions described above arezero-truncated samples: the zero frequencies are missing. Theobjective of the method presented and tested is to estimate thesezero frequencies.
Extraction of left truncated sample means. The result is a setof data points ranging from one (zero-truncated) to themaximum value of the distribution, with increasing means.Fitting of straight line. The data points are plotted and astraight line is fitted through the points using weighted leastsquare regression. Weights in
A. Telcs, W. Glanzel and A. Schubert: Characterization andstatistical test using truncated expectations for a class of skewdistributions. Mathematical Social Sciences, 10, 1985,169−178.
are used.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Waring based estimation of the population mean
The publication frequency distributions described above arezero-truncated samples: the zero frequencies are missing. Theobjective of the method presented and tested is to estimate thesezero frequencies.
Extraction of left truncated sample means. The result is a setof data points ranging from one (zero-truncated) to themaximum value of the distribution, with increasing means.Fitting of straight line. The data points are plotted and astraight line is fitted through the points using weighted leastsquare regression. Weights in
A. Telcs, W. Glanzel and A. Schubert: Characterization andstatistical test using truncated expectations for a class of skewdistributions. Mathematical Social Sciences, 10, 1985,169−178.
are used.Simplified Waring estimation The estimation presented abovemay be simplified and calculated only based on the samplemean and the share of one-frequencies.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Agreement between expected and estimated productivity
The potential author data set provides a full population ofpublication frequencies, including zero frequencies. This providesfor detailed comparisons of true and estimated population means.However, the data set is rather small and the estimates cantherefore be expected to be unstable.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Waring Regression SLU-first author data
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
This is a fact !
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Avdelningen for forskningspolitisk analys
Let us assign manually and arbitrarily some additional publicationsto the most productive in the data set in the Figure above:
One of the criticisms levelled by Avdelningen for forskningspolitiskanalys against the Waring method ! Another is that the regressionfit is improved, if the extreme values in the right are removed, ashas been done by Telcs et.al., but this removal has no basis in thetheory.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Agreement between expected and estimated productivity
But when we use the estimate of zero frequency:
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Agreement between expected and estimated productivity
The estimated values are for the most part very good. In manycases the estimations are within a 5 % margin from the expectedvalues. Only in a few cases the estimates are considerably far fromthe expected values (>20 %). Compared to the Simplified Waringthe full version of Waring performs slightly better.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Testing the Reliability of the Estimate
The results above indicate that Waring based estimates of the zerofrequency in general produce good results. A concern is, however,that the estimates will be very sensitive to small variations in thesample, i.e. that the reliability of the estimations will be low. Thatis certainly the case in the estimates provided above since they arebased on relatively small samples ( 500-1000 authors).
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Testing the Reliability of the Estimate
In cases where we do not need to know the zero class, largersamples are created. In the following we have created a test of thereliability of the estimations when larger samples are used.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Memories
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Error Analysis
To test the error margin on a larger sample, a second data set hasbeen compiled. From the Web of Science, publications with Nordicaddresses published between 2003 and 2006 were downloaded. Aselection of Nordic authors was obtained by extracting first authorsand reprint authors from the downloaded publications andconnecting these to the designed addresses. Authors withnon-Nordic addresses were removed. The restriction to first andreprint authors was necessary since other authors could not beassociated with specific addresses.The names of the selected set of author fractions were manuallyadjusted to distinguish between homonyms and to harmonizeauthor fractions relating to the same person. The number of firstand reprint authorships of each distinctive author represented inthe data were extracted and compiled into a table.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Error Analysis
Each publication was designated to one of seven fields based onthe classification of ISI subject categories. Following this, eachdistinctive author was designated to the field where the author hadmost publications. In cases where the number of publications wasequal for two fields, one field was randomly selected. For eachfield, a publication frequency distribution, i.e. the number ofauthors having one publication, two publications and so forth, wascompiled. The number of potential authors with zero publicationsin the selected time period is not included.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Bootstrap
A bootstrap technique for this is to resample, say B times, theauthorship data (described above) thus creating a replicateauthorship data. For each replicate data set one calculates theregression line obtaining µ1, . . . , µB , from which one can calculatethe emprical distribution of the estimate of the intercept itsbootstrap mean, bootstrap standard deviation and find fractals tocompute a bootstrap confidence interval for the intercept.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
The Bootstrap Distribution for the Intercept for Physics
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Confidence intervals of estimated productivity
Bootstrapped population mean estimates were computed for eachzero truncated distribution of the field data set and confidenceintervals were calculated using bootstrap.The confidence intervals show that the differences between thedifferent fields are in most cases small. For the social science fields,however, the confidence interval is large, which shows thatestimates of distributions having a large share of zeros make itdifficult (as expected). Still, the results indicate that Waring basedestimations of productivity can be used for field comparisons.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Confidence Intervals for nine fields of science
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
The Question from Avdelningen for forskningspolitiskanalys
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Historiett
Historiette is a diminutive form of the French word histoire, and aliterary term used in French since 1700’s. Historiette is a shortstory, like an anecdote. For readers of Swedish literature the termis almost exclusively identified with a book (a collection of shortstories) by Hjalmar Soderberg titled ”Historietter”, first publishedin 1898.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Finally: Bibliometrics
. . . a statistical approach to master the flood of scientificinformation and to analyse and to understand the characteristics ofbig science by measuring quantitative aspects of communication inscience and by providing the results to scientists and users outsidethe scientific community. Monitoring, description and modelling ofthe production, dissemination and use of knowledge was originallyin the foreground.. . . In the following two decades after 1980 bibliometrics wascharacterised by a shift towards science-policy andresearch-management application (W. Glanzel: The perspectiveshift, 2006)
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
The present historiett: A case of science policy
The system for funding allocation to public research institutionspresented by the Swedish government in October 2008 was basedon RUT 2. This was applied starting 2009. In 2011 anotherreport, an assignment from the Government, was authored by thethen Chancellor of Swedish universities
A. Flodstrom: Prestationsbaserad resurstilldelning foruniversitet och hogskolor. U2011/7356/UH, 252 pages.
who amongst other things recommended abandoning of theWaring method (being controversial).
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
The present historiett: A comment on U2011/7356/UH
SULF har stora principiella invandningar mot att anvandakvalitetsindikatorer som grund for resursfordelning . . . avvisa(r)den nuvarande prestationsbaserade modellen (=RUT 2) liksom deforandringar av denna som foreslas av utredaren (=A. Flodstrom).
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
The present historiett: A comment on U2011/7356/UH(cont’d)
SULF avvisar . . . bestamt ett ensidigt nyttjande av enkla matt paden forutvarande verksamheten som mangden externa medel, antalpublikationer eller citeringar som ett underlag for statsmakternasresursfordelning. Det framsta skalet harfor ar att sadana mattverkar systematiskt konserverande, till forman for forskning langshuvudfaran, det vill saga mer forskning om det vi redan vet.Utnyttjandet av sadana indikatorer vid fordelningen avforskningsresurser innebar en forvaxling av kvantitet och kvalitet.
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Thank You ! (Hippopotamus= ? in Swedish)
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
A.G.M. McKendrick again
. . . individuals, may meet and exchange ideas, the meetingmay result in the transference of . . .The life of each individual consists of a train of such incidents,one following the other . . .
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
A.G.M. McKendrick again
. . . individuals, may meet and exchange ideas, the meetingmay result in the transference of . . .The life of each individual consists of a train of such incidents,one following the other . . .
react and interact amongst each other, and each individuallives a life which may be again considered as a succession ofevents, one following the other. If one thinks of theseindividuals, . . ., as moving in all sorts of dimensions, reversiblyor irreversibly, continuously or discontinuously
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution
Succession of events: Lyckliga Ar som Prof. emeritus !Tack Goran !
Abo 22nd August 2013, Seminar in Honor of Goran Hognas Waring Distribution