Efficiently calculating inbreeding on large pedigrees databases

ARTICLE IN PRESS

Contents lists available at ScienceDirect

Information Systems

Information Systems 34 (2009) 469–492

0306-43

doi:10.1

� Cor

E-m

sfm15@

journal homepage: www.elsevier.com/locate/infosys

Efficiently calculating inbreeding on large pedigrees databases

Brendan Elliott �, En Cheng, Stephen Mayes, Z. Meral Ozsoyoglu

Electrical Engineering and Computer Science Department, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, OH 44106, USA

a r t i c l e i n f o

Article history:

Received 29 January 2009

Accepted 6 February 2009

Recommended by: L. Wongpedigree graph structures. Inbreeding is the quantitative measure of the genetic

relationship between two individuals. The inbreeding coefficient is related to the

Keywords:

Inbreeding coefficients

Pedigree

NodeCodes

Family NodeCodes

79/$ - see front matter & 2009 Elsevier B.V. A

016/j.is.2009.02.002

responding author. Tel.: +1216 368 2802; fax:

ail addresses: [email protected] (B. Elliott), exc92@

case.edu (S. Mayes), [email protected] (Z.M. Ozso

a b s t r a c t

We consider pedigree data structured in the form of a directed acyclic graph, and use an

encoding scheme, called NodeCodes, for expediting the evaluation of queries on

probability that both copies of any given gene are received from the same ancestor. In

this paper we discuss the evaluation of the inbreeding coefficient of a given individual

using NodeCodes and propose a new encoding scheme, Family NodeCodes, which is

further optimized for pedigree graphs. We implemented and tested these approaches on

both synthetic and real pedigree data in terms of performance and scalability.

Experimental results show that the use of NodeCodes provides a good alternative for

queries involving the inbreeding coefficient, with significant improvements over the

traditional iterative evaluation methods (up to 10.1 times faster), and Family NodeCodes

further improves this to 77.1 times faster while using 91% less space than regular

NodeCodes.

& 2009 Elsevier B.V. All rights reserved.

1. Introduction

According to Encyclopedia Britannica, a Pedigree is a‘‘record of ancestry or purity of breed.’’ Pedigrees arehierarchical hereditary structures and are typically repre-sented as directed acyclic graphs. Stud books (listings ofpedigrees for horses, dogs, etc.) and herdbooks (recordsfor cattle, swine, sheep, etc.) are maintained by govern-mental or private record associations or breed organiza-tions in many countries. In human genetics, pedigreediagrams are utilized to trace the inheritance of a specifictrait, abnormality, or disease, calculate disease risk factors,identify individuals at risk, and facilitate genetic counsel-ing. In addition to medical genetics, pedigrees are alsocommonly used in animal breeding (e.g., horse racing andpet breeding), plant studies (self-pollinated plant breed-ing), and genealogical studies. A sample pedigree diagram

ll rights reserved.

+1216 368 6888.

case.edu (E. Cheng),

yoglu).

is shown in Fig. 1, where rectangles represent male, circlesrepresent female.

As data collection and storage technology are becom-ing more readily available at a lower cost, the size andvariety of usable pedigree data has been increasing at ahigh rate. There are already large, heavily used pedigreedata collections such as The Utah Population Database [2]with 1.6 million genealogy records. The types of queriesthat are posed over pedigree data are also becoming moreelaborate than simple ancestor/descendant types ofqueries as the advances in medical genetics require morecomplex methodologies for the analysis of pedigree data[19]. The Jagelman Registries of Cleveland Clinic [3] is animportant example of a pedigree data collection, which isheavily used by medical and genetic researchers for theanalysis of hereditary structure and identifying riskfactors for inherited colon cancer. A list of examplepedigree queries is given in Fig. 2. Thus, there is a needfor scalable data management techniques for storing andquerying pedigree data due to both increasing volume ofavailable pedigree data, and increasing use of pedigreedata analysis in medical genetics for hereditary diseases.

www.sciencedirect.com/science/journal/is

www.elsevier.com/locate/infosys

dx.doi.org/10.1016/j.is.2009.02.002

mailto:[email protected]




ARTICLE IN PRESS

B. Elliott et al. / Information Systems 34 (2009) 469–492470

We have recently introduced a pedigree data modeland a pedigree query language (PQL) having an XPath-likesyntax for querying pedigree data [1]. PQL uses a compactand easy to understand syntax and enables the expressionof complicated queries that traverse the pedigree struc-ture. We use a special encoding scheme called NodeCodesfor expediting query evaluation on the pedigree structure.With PQL, the pedigree query evaluation framework willallow queries on pedigree data in Fig. 2 to be expressedeffectively and provided a way to evaluate these queriesefficiently. One of the important queries on pedigree datais to check if there is inbreeding in the ancestors of anindividual, and compute the inbreeding coefficient ofindividuals. The inbreeding coefficient of an individual isone of the parameters in calculating the cancer risk of anindividual in hereditary cancers [19]. Inbreeding is alsoimportant for wildlife conservationists and purebredlivestock. Inbreeding coefficients are calculated routinelyfor animals included in national genetic evaluations foryield traits [20], and animal selection [20]. Anothergenealogical measurement, the kinship of two individualsis defined [7] as the inbreeding coefficient of the(hypothetical) child of these two individuals. Hence themethods for evaluating the inbreeding coefficient are alsodirectly applicable to determining kinship of individuals.Efficient evaluation of inbreeding is important for largesize pedigrees. It is especially important for real-timeapplications such as genetic counseling. Even if theinbreeding coefficients of individuals are pre-calculatedand stored in the database, queries such as those involvingkinship determination requires an ad-hoc computation ofinbreeding coefficient.

In the preliminary version of this work [31], wepresented algorithms to compute the degree of inbreedingof an individual using NodeCodes, by evaluating Wright’sinbreeding coefficient formula [6] (which is also used

n1 n2

n4 n5n3

n8

n7n6

n10n9 n12n11

n13

Fig. 1. Sample pedigree diagram.

Q1: Find the coefficient of inbreeding for all individuals in a pedigreQ2: Determine the kinship of two individuals. Q3: Find all first-degree (FDR) and second-degree (SDR) relatives oQ4: Find all individuals in a pedigree with two or more ancestors diaancestors diagnosed with cancer on the paternal side. Q5: Find all individuals in a pedigree diagnosed with cancer with at lage 35. Q6: Display all family members at high risk based on Church’s clinic

Fig. 2. Example ped

in PedHunter [7]). In order to test the scalabilityand efficiency of query evaluation using NodeCodes onvarious sizes of pedigree data, we generated and usedsynthetic pedigree data based on an earlier work by VesaOllikainen [5].

In this paper, we propose a method to reduce the spacerequirements of NodeCodes based on representing pedi-grees as graphs of families instead of graphs of individuals.We create a modified encoding scheme, Family NodeCodes,which allows for efficient evaluation of inbreeding queriesat a fraction of the storage costs, and present an algorithmto compute inbreeding using these new codes. We presentexperimental results on inbreeding coefficient calculationfor both synthetic and real pedigree data [4], comparingpedigree query evaluation using NodeCodes, FamilyNodeCodes, and the traditional approach of using iterativequeries. Our results show that inbreeding coefficientevaluation using Family NodeCodes gives a significantimprovement over both the previously used naıveapproach and the previously introduced individual-levelNodeCodes, while greatly reducing the number of codesrequired. Furthermore, the improvements are scalablewith pedigree size and are still observed even in extremecases where every individual remarries.

The main contributions of this paper are as follows:

(i)

e.

f an ingnose

east 2

ally d

igree

An efficient and scalable query evaluation scheme forcalculating the inbreeding coefficient efficiently onpedigree data using NodeCodes.

(ii)
Introduction of Family NodeCodes, which labelfamilies instead of individuals and allow identifica-tion of family features like number of parents andremarriages from the codes alone while also usingmuch less space than traditional NodeCodes.
(iii)
An improved query evaluation scheme for calculatinginbreeding using Family NodeCodes.
(iv)
A methodology for generating synthetic pedigreedata with remarriages.
(v)
Experimental results demonstrating significant per-formance gains for queries with inbreeding coeffi-cient calculation over the traditional approach forpedigree queries through the use of NodeCodes andfurther improvements by using Family NodeCodes.
The rest of the paper is organized as follows. Section 2gives a brief overview of the related work on pedigree datamanagement, and evaluation of pedigree queries, inbreed-ing coefficient calculation, and geneolgical numberingsystems. In Section 3, the graph structure for pedigreegraphs and the utilization of NodeCodes for query

dividual who were diagnosed with cancer before age 50. d ancestors on the maternal side of the family, but no

descendants diagnosed with at least three polyps before

efined scoring system (number of FDR, SDR, age) [19].

queries.

ARTICLE IN PRESS

B. Elliott et al. / Information Systems 34 (2009) 469–492 471

evaluation is presented. Inbreeding coefficient and itsevaluation using NodeCodes are presented in Section 4.Family-level pedigree graph, construction of family Node-Codes and their utilization for evaluating inbreeding queriesare presented in Section 5. We also present an analyticalcomparison of family-level pedigree graphs with pedigreegraphs with individuals, and the NodeCodes based on thesegraphs, in terms of size of the graphs, and space overheadrequired for their utilization in Section 5. In order to test thescalability of our approach as the pedigree data grows, wedesigned and implemented a population simulator togenerate arbitrarily large pedigrees. We implementedand tested evaluating inbreeding queries by NodeCodes,and Family-level NodeCodes using both real and synth-etic pedigree data, both for query performance and forscalability. The synthetic data generation is presented inSection 6, and experimental results together with theexperimental set-up are presented in Section 7. Finally,our concluding remarks are given in Section 8.

2. Previous work

Much of the recent work in pedigree data managementhas focused on pedigree visualization and not on efficientquery evaluation or support for complex queries. Thisincludes PViN [8] and Peditree [9], both of which aredesigned to display large pedigrees from relationaldatabases, with PViN focusing on the use of moderngraphics capabilities for fast drawing, and PediTreeproviding support for performing calculations on thedata, including inbreeding estimation. However, the mostcomplicated query described by both systems is simplyretrieving all ancestors/descendants of an individual byiteratively performing an SQL query for every individualancestor/descendant to check for additional relatives.

Other academic work includes PedHunter [7,25], whichfocuses on the task of verifying pedigrees and to that endprovides functions (each using C language and one ormore SQL queries) for finding relatives and for testing howsets of individuals are related as well as for calculatingsome statistics. Answering a couple of structural queriesusing SQL is described in [10], but it involves explicitlycreating temporary relations. While an early SQL-likelanguage supporting path expressions is given in [7], theevaluation is largely based on following pointers.

There are also a few other commercial and academicsoftware packages for calculating inbreeding coefficients,including FSpeed [26], LaoTzu’s Animal Register [27],CompuPed [28], Oklahoma State University’s Departmentof Animal Science [29], and Cyrillic [12]. One of the mostpopular commercial packages is Cyrillic 2.1, which cancalculate inbreeding and crossovers from phenotype data,but provides no support for structure-based queryingand only supports pedigrees of up to 10,000 individuals.Cyrillic 3 supports larger pedigrees, but simply usesMENDEL [14] for pedigree analysis such as inbreeding,which uses a technique that is known to only work forcalculating inbreeding of small to medium-sized pedi-grees [15]. Another popular commercial pedigree softwareproduct is Progeny [13]. Progeny 6 only supports simpleBoolean queries on data fields and does not seem to

support calculating more complex statistics beyondsimple aggregate operations.

Several genealogical numbering systems have beendeveloped that at first glance appear quite similar toNodeCodes. These include the Henry Numbering System[21] and the d’Aboville System [22], which is based on theHenry Numbering System. The Henry Numbering Systemassigns an individual or progenitor a single number orletter. Then, each of the children of this individual isnumbered by appending the individual’s number with asequential numbering of the children. For example, if thechosen individual is numbered 1, then this individual’sfirst child will be numbered 11, the second child will benumbered 12, and so on. The Henry Numbering System islimited, unless letters are used, to numbering only up tonine children. Therefore, the d’Aboville System wascreated to avoid this issue by placing periods in betweenthe different generation’s numbers in an individual’snumbering. In the previous example, the second childwould have been numbered 1.2 in the d’Aboville system.These systems appear to be very similar to the NodeCodesbut there are several important differences including:(i) These numbering systems were designed to displaynicely formatted ancestor or descendant reports fromcomputer software [23]. Therefore, they choose anindividual from anywhere in the pedigree and only assignone number per descendant in a manner that makes thepedigree presentable to an end user. The NodeCodes aredesigned to assist queries against pedigree data bypotentially assigning more than one code per individualand allow for large improvements over existing iterativemethods of querying. (ii) The NodeCodes are permanentlyassigned to each individual based on their position in thepedigree while the other numbering systems dynamicallyassign numberings based on the report that has beenrequested, and (iii) Unlike these systems, NodeCodes alsohas the ability to express gender as well as a path from aprogenitor. Furthermore, Family NodeCodes can indicatewhen remarriages have occurred and how many knownparents are in a family.

A Pedigree Query Language (PQL) and evaluationof pedigree queries using NodeCodes is presented in[1]. However, this earlier work only described how Node-Codes can be used to evaluate some basic query predicates.The process for updating NodeCodes when changes aremade to a pedigree is discussed in [30]. The preliminaryversion of this work discussed how to calculate theinbreeding coefficient with NodeCodes [31] Utilizing Node-Codes for inbreeding queries, while being very efficient forsmaller pedigrees do not scale well due to large storageoverhead as the size of the pedigree data grows to be verylarge. This paper introduces Family NodeCodes, whichreduce the storage overhead, and improve the scalabilitywhile also significantly improving query performance.

3. Query evaluation

3.1. Pedigree graph structure and NodeCodes

We use a directed acyclic graph to represent thepedigree data, where the nodes represent individuals and

ARTICLE IN PRESS


directed edges represent parent–child relationships.Nodes are also labeled by gender, in addition to node IDsand other optional information. We use a pedigree graphlabeling system, NodeCodes, to expedite queries requiringgraph traversal. NodeCodes is a graph encoding schemeoriginally proposed for encoding single source directedgraphs [16,17], which was later adapted to encodepedigree graphs [1]. Using NodeCodes, each node of agraph is assigned labels which are sequences of integersand delimiters. The integers represent the sibling order,and the delimiters denote the generations as well asindicating the gender of the node. We use ‘‘.’’, ‘‘,’’, and ‘‘;’’to denote female, male and unknown, respectively. Anexample pedigree graph labeled with NodeCodes is shownin Fig. 3. First the progenitors (nodes with in-degree 0) arelabeled (we may consider adding a virtual root r andmaking all progenitors children of r). For each node u inthe graph, the set of NodeCodes of u, denoted NC(u), areassigned using a depth-first-search traversal starting fromthe source node as follows:

�

0.01,01,1

If u is the virtual root node r, then NC(u) contains onlyone element, the empty string.
� Let u be a node with NC(u), and v0, v1,y, vk be u’s
children in sibling order, then for each x in NC(u), acode xi� is added to NC(vi), where 0pipk, and �indicates the gender of the individual represented bynode vi.Note that the virtual root node is not actually used, andthe progenitors of the pedigree have NodeCodes i�where i ¼ 0,1,y, where the order indicates the inser-tion order of progenitors (0 is the first). Fig. 3 shows apedigree graph labeled with NodeCodes.Some important properties of NodeCodes that weutilize in the rest of this paper are:

p0 p1

p7

p2 p3

p4

p5 p6

1,0.

0.0,1,0,

1,1.

1,2.0.0,0.1.1,0,0.1.1,1.0.1.

0.0,0.1,0,0.1,1.0.

,0.0,,0.0,.0.0,

1,2.0;0.0,0.1.0;1,0,0.1.0;1,1.0.1.0;0.0,0.0,0;1,0,0.0,0;1,1.0.0,0;

Node Inbreedingp0 0 p1 0 p2 0 p3 0 p4 0.125 p5 0 p6 0.25 p7 0.203125

Fig. 3. Example pedigree with NodeCodes.

�
A NodeCode can be assigned to one and only one nodein the pedigree graph. � Let Progenitors(v) denote the set of nodes which are
the progenitor ancestors of a node v, and NC(v) denotethe set of NodeCodes of v in a pedigree graph. Then,there is a one-to-one correspondence between theNodeCodes in NC(v) and the set of all paths from u inProgenitors(v) to v. This implies that the number ofNodeCodes of a node v is the same as the number ofpaths from the virtual source node to v, wherejNC(v)jXjProgenitors(v)j. Actually, if there is no in-breeding (crossover of paths) then jNC(v)j ¼jProgenitors(v)j.

3.2. Query evaluation

Using NC(v), the ancestors, descendants, parents andchildren of a given node v can be determined. For a givennode v, finding ancestors of v requires considering all theNodeCodes in NC(v), while any NodeCode from NC(v) issufficient to find the descendants of v [1]. More specifi-cally, let nv ¼ a1�a2�yam�1�am� be a NodeCode of anode v, where 2pm. Then ‘‘a1�a2�yam�1.’’ and‘‘a1�a2�yam�1,’’ are in NC(mother(v)) and NC(father(v))respectively. Similarly, ‘‘a1�a2�yam�1�am�am+1�’’ is inNC(children(v)) for each NodeCode nv of v, where am+1 isan integer, and ‘‘�’’ is ‘‘.’’ for daughters and ‘‘,’’ for sons.Given NC(v), the NodeCodes of all descendants, allancestors, siblings, and other relatives of v can bedetermined in a similar fashion. Given a query thattraverses the pedigree graph structure, ancestors, descen-dants, parents and children of individuals can all beidentified by string manipulations using NodeCodes, andsuch queries can be transformed into SQL queriesaccordingly. More complicated structural relationships,e.g., first degree relatives, or second degree relatives) arealso translated into SQL using these basic steps.

We model pedigrees, individuals and NodeCodes in ourrelational database with the following tables:

Pedigree(PedigreeID, Name)Individual(IndividualID, PedigreeID, MotherID, FatherID,

Gender, Name, DOB, DOD, etc.)NodeCodes(PedigreeID, NodeCode, IndividualID)

A simple example of an SQL query on this schema tofind all descendants of individual k using NodeCodes is:

SELECT nr.IndividualID

FROM NodeCodes nk, NodeCodes nr

WHERE nk.IndividualID ¼ k

AND nk.PedigreeID ¼ nr.PedigreeID

AND nk.NodeCode+‘_%’ LIKE nr.NodeCode

The full details for query evaluation for traditionalpedigree queries involving hereditary relationships, trans-lating such queries into simple SQL queries usingNodeCodes is given in [1]. In the next section, we present

ARTICLE IN PRESS


how queries involving inbreeding can be evaluated onpedigree data using NodeCodes.

4. Inbreeding calculations

Inbreeding for an individual n in a given pedigree graphP is defined as the case where the father and mother of n

share a common ancestor in P. In our representation of thepedigree structure, this can be represented as an undir-ected cycle in P that includes n. According to thisdefinition, whether or not parents have inbreeding does

not directly determine whether their offspring has inbreed-ing. A case can be easily constructed where both parentsof an individual n1 have a high degree of inbreeding, butn1 does not have inbreeding at all (i.e. when none of theirancestors are related). Still, the source of the genes ofindividual n is his parents. Therefore, a person must haveparents sharing at least one common ancestor in order tohave inbreeding. The NodeCodes of n effectively captureall ancestors that pass genes to n. Hence, given theNodeCodes of n and the parents of n, it can be determinedif n has inbreeding in the context of a pedigree tree P. Twonodes have a common ancestor if they have NodeCodessharing at least one prefix. Thus, for an individual to haveinbreeding, both parents of the individual must haveNodeCodes sharing a common prefix. Likewise, if twoindividuals share at least one common NodeCode prefixthen they will also have a nonzero kinship score.

Theorem 1. An individual has inbreeding if and only if themother and father of the individual have NodeCodessharing a common prefix.

In the example in Fig. 3, consider the node p7 which has

parents p5 and p6. Since p5 and p6 have NodeCodes sharing

a common prefix, e.g. ‘‘1,’’, p7 has inbreeding. Similarly,

since p7 has inbreeding, i.e. p7’s parents have common

ancestors, its parents have NodeCodes sharing the same

prefix.

4.1. Inbreeding coefficient

It is useful to be able to quantify the level of inbreedingin the genotype of an individual, because the level ofinbreeding determines the likelihood that the individual ishomozygous in certain characteristics. The inbreedingcoefficient, first defined by Wright in 1922 [6], is an effortto define the level of inbreeding in the genotype of anindividual quantitatively. The inbreeding coefficient of anindividual is a function of the number and location of thecommon ancestors of both parents of this individual in thegiven pedigree. Wright’s coefficient of inbreeding isformulated as follows:

Fx ¼

P 1

2

� �rþsþ1

ð1þ FAÞ

" #if x has inbreeding

0 otherwise

8><>:

9>=>;

In this formula, x is the individual for whom we arecalculating the inbreeding coefficient, and A is thecommon ancestor of the parents of x. Fx is the inbreedingcoefficient for x, and FA is the inbreeding coefficient for A.

r is the number of generations between x and A from thematernal side, and s is from the paternal side. Thesummation is performed over all common ancestors ofthe parents of x, and for all pairs of paths that are notoverlapping from the common ancestor to the parents.

Definition 4.1 ((Overlapping paths):). Let p1 and p2 be apair of paths from a common ancestor n to mother m andfather f of an individual. The pair p1, p2 is non-overlappingif n is the only node in common to p1 and p2. Otherwise,the pair of paths p1, p2 is overlapping (with crossover), andthe nodes in common to paths p1 and p2 other than n arecalled the crossover nodes, which are also commonancestors for m and f.

An example of a pair of overlapping paths is {p1 ¼ p1-

p2-p4-p5; p2 ¼ p1-p3-p4-p6} regarding the par-

ents of p7 where p4 is a crossover node. An example of a

pair of non-overlapping paths is {p1 ¼ p1-p2-p4-p5;

p2 ¼ p1-p6} with respect to the parents of p7 in Fig. 3.

Note that the formula is recursively called for each

inbreeding root, i.e. common ancestor. If an individual

does not have inbreeding (i.e. her parents do not share a

common ancestor) her inbreeding coefficient will be 0.

4.2. Calculating inbreeding coefficient with NodeCodes

If an individual has inbreeding, the inbreeding coeffi-cient is calculated using Wright’s formula, calling itselfrecursively for all inbreeding roots. The general outline forcalculating the inbreeding coefficient of an individual p

using NodeCodes is as follows:

Algorithm Inbreeding Coefficient

Input: NodeCodes NC(p)

Output: Inbreeding coefficient of p.

1.
Find the NodeCodes of mother m and father f of individual p.
2. I
dentify common ancestors of mother and father.
3. F
or each common ancestor c
a.
Find the set of pairs of paths from c to mother and father.
b.
Identify non-overlapping pairs of paths
c.
Find the number of generations between the common ancestor
and the individual for the non-overlapping pairs of paths.

d.
Compute the inbreeding coefficient.
In this algorithm, step 1 is trivial (see Section 3 and[1]). Step 3.c is just finding the path length and step 3.d isapplying the formula. We will explain the other stepsbelow:

4.2.1. Identifying common ancestors

Given an individual p, with mother m and father f, thisstep requires matching NC(m) with NC(f) having thelongest common prefix for matching sets.

Definition 4.2 ((Longest Common Prefix for match):). LetM and F be (sub)sets of the mother and father codes. Thenx is the longest common prefix for matching M and F, ifthere is no x0 where x is a prefix of x0, and x0 is a commonprefix of all mi in M and all fi in F.

We use the notation x ¼ LCP(M, F) to denote that x is the

LCP for matching sets M and F. Alternatively, given x, we

can find the sets of matched mother NodeCodes M, and

father NodeCodes F. We use the notation MPS(x) ¼ (M, F)

ARTICLE IN PRESS


to denote that M and F is the matched pair of sets of

NodeCodes with the longest common prefix x.

Example 4.1. Consider the individual p7 in the pedigree inFig. 3. The mother and father of p7 are p6 and p5,respectively. The sets of mother and father NodeCodesare: NC(p5) ¼ {a, b, c}, and NC(p6) ¼ {d, e, f, g} where a

¼ /0000S, b ¼ /1000S, c ¼ /1100S, d ¼ /0001S,e ¼ /1001S, f ¼ /1101S, g ¼ /12S. (Note that, we ignorethe gender delimiters in the NodeCodes for simplicity ofpresentation in this section.)

w ¼ /000S is a longest common prefix for p5 and p6

where pair of matched sets for w is MPS(/000S) ¼

({a},{d}).

x ¼ /110S is a longest common prefix for p5 and p6

where pair of matched sets for x is MPS(/110S) ¼ ({c},{f}).

y ¼ /11S is not a longest common prefix for p5 and p6

since MPS(/11S) ¼MPS(/110S).

However, z ¼ /1S is a longest common prefix for p5

and p6 since it is the longest shared prefix for the sets

of mother and father codes it matches, i.e. MPS(/1S) ¼

({b, c}, {e, f, g}).

Note that, if x ¼ LCP(M, F) then x is a NodeCode of the

common ancestor for mother m and father f (by Theorem

1), and the sets of paths represented by M and F are the

paths from this common ancestor to m and f, respectively.

We have the following lemma and theorems stating the

properties of longest common prefix for matching mother

and father NodeCodes, which follow from the definitions.

Lemma 1. Let x and y be two longest common prefixes formatching mother m and father f of an individual, whereMPS(x) ¼ (Mx, Fx) and MPS(y) ¼ (My, Fy). Then, x is a prefixof y if and only if Mx+My, and Fx+Fy.

Theorem 2. Let x and y be two longest common prefixesfor matching mother m and father f of an individual, and x

is a prefix of y. Then,

(a)
the nodes corresponding to codes x and y are bothcommon ancestors of m and f, and
(b)
the node for x is an ancestor for the node for y (i.e. y isthe younger ancestor).
Theorem 3. Let p, q be two ancestors of both m and f suchthat p is also an ancestor of q, and let x be a NodeCode forp, where x ¼ LCP(Mx, Fx). Then, there exists a NodeCode y

for q such that

(a)
x is a prefix of y, and (b) y is a longest common prefix for matching sets of
codes for m and f, i.e., y ¼ LCP(My, Fy) for nonemptysets My, and Fy for m and f, respectively.

Due to the limitation of space, please refer to [32] for

the proofs of all the lemmas and theorems.

4.2.2. Identifying pairs of paths from common ancestors

Let n be a common ancestor of m and f of an individual,xi, 1pipk, be the NodeCodes of n such that xi ¼ LCP(Mxi,

Fxi) for some nonempty subsets Mxi, and Fxi of m and f

codes, respectively. Since n is a common ancestor, byTheorem 1, there is at least one such xi, i.e., iX1. Let x beany one of such xi’s. Then, the set of pairs of paths from n

to m and f can be represented as

PPðn; xÞ ¼ fða; bÞjx ¼ LCPðMx; FxÞ and a 2 Mx and b 2 Fxg

Note that each such common prefix xi gives a duplicaterepresentation for the set of pairs of paths from n to m

and f.

Example 4.2. Again for the individual p7 in Fig. 3, andusing the abbreviations for codes in Example 4.1., considerthe following longest common prefixes for parents p5

and p6:

x1 ¼ h000i;MPSðx1Þ ¼ ðfag; fdgÞ

x2 ¼ h100i;MPSðx2Þ ¼ ðfbg; fegÞ

x3 ¼ h110i;MPSðx3Þ ¼ ðfcg; ff gÞ

Note that all 3 of the shared prefixes are in NC(p4)indicating that p4 is a common ancestor for p5 and p6.

The set of paths from p4 to p5 and p6 is

PPðp4; x1Þ ¼ fða;dÞg

where a represents the path /p4, p5S, and d represents

the path /p4, p6S when we replace the NodeCodes with

node identifiers. Similarly, PP(p4, x2) and PP(p4, x3) repre-

sent the same pairs of paths, so any one of the xi’s can be

used to identify the paths from the common ancestor p4.

4.2.3. Identifying overlapping pairs of paths

As discussed above, a pair of paths from a commonancestor n to m and f are overlapping if they have acrossover node o (i.e. o is another common ancestor of m

and f, and o is younger than n), and overlapping pairs needs

to be excluded from inbreeding computations.Let n be a common ancestor of m and f, and x be a

NodeCode for n such that x ¼ LCP(M, F). Let a pair of codes(a, b), where a in M, b in F, represents an overlapping path.Then, there is a crossover node o, and from Theorems 1and 3, there is a shared prefix x0, where x0 ¼ LCP(Mx0, Fx0), o

is the node having x0 as its NodeCode, and x is a prefix of x0.Since (a, b) is overlapping, it is in the set of pairs of paths,

from n to m and f, with crossover node o.

First we compute a representation for those pairs ofpaths from n via crossover node o corresponding to sharedprefix x, denoted by PP(o, x). PP(o, x) is computed by takingthe union of PP(o, xi) for all xi that are in NC(o) where x is aprefix of xi. Then, the set of pairs of overlapping paths fromn to nodes m and f with crossover are

SOPðn; o; xÞ ¼ fða; bÞj9tm; tf ðða; tf Þ 2 in PPðo; xÞ

and ðtm; bÞ 2 in PPðo; xÞÞ

Example 4.3. p1 and p4 are both common ancestors of m

and f. Ignoring gender marks for simplicity.

NCðp4Þ ¼ fh000i; h100i; h110ig; NCðp1Þ ¼ fh1ig.

LCPðfb; cg; fe; f ; ggÞ ¼ h1i ¼ x

Longest common prefixes x0i, where x is a prefix of arex01 ¼ /100S and x02 ¼ /110S, both are members of NC(p4).

ARTICLE IN PRESS


Thus the crossover node is p4:

PPðp4; xÞ ¼ PPðp4; x01Þ [ PPðp4; x

02Þ

¼ PPðp4; h100iÞ [ PPðp4; h110iÞ

¼ fðb; eÞ; ðc; f Þg

Then the set of overlapping pairs of paths from p1 to m

and f with crossover p4 is

SOPðp1;p4; xÞ ¼ fðb; eÞ; ðb; f Þ; ðe; cÞ; ðe; f Þg

The set of pairs of paths matched with x, is {(b, e), (b, f),(b, g), (c, e), (c, f), (c, g)} (that is all pairs (a, b) where a in Mx

and b in Fx where MPS(x) ¼ (Mx, Fx)). Excluding theoverlapping pairs, the remaining pairs are {(b, g), (c, g)},representing pairs of non-overlapping paths (/p1, p2, p4,p5S, /p1, p6S), and (/p1, p3, p4, p5S, /p1, p6S) which willbe used in inbreeding coefficient computation.

The following example illustrates the overall algorithm

for inbreeding coefficient calculation using NodeCodes.

Example 4.4. Let us calculate the inbreeding coefficientfor p7 for pedigree in Fig. 3. From the NodeCodes of p5 andp6, we find two common ancestors: p1 with the longestcommon prefix /1S and p4 with the common prefixes/000S, /100S, and /110S. Examining the commonancestor (inbreeding root) p4 first, we use the prefix/000S and find the matching mother code /0000S, andthe father code /0001S (which have no overlappingpaths), giving us r ¼ 1 and s ¼ 1, which can be pluggedinto the formula along with p4’s inbreeding (see the tablein Fig. 3) to get 0.125�(1+0.125) ¼ 0.140625 contributedto the score from p4. For p1, we have two father codesand three mother codes beginning with /1S, resulting in6 pairs of paths from p1 to parents p5 and p6, seeExample 4.3. After eliminating the overlapping paths viathe crossover p4, as shown in Example 4.3, calculating thescores from the remaining two pairs of paths gives us0.03125 for each one (0.0625 total for p1), and so we findthe coefficient of inbreeding for p7 as 0.203125.

4.2.4. Complexity of algorithm

Let h be the height of the pedigree graph (i.e. the lengthof the longest ancestor, descendant path between any twonodes in the pedigree), k be the number of NodeCodesof the individual whose inbreeding coefficient is beingcalculated, and s is the length of the longest NodeCode ofthe individual. Then, step 1 takes k operations. In step 2,the common ancestors of parents are found by prefix-matching pairs of NodeCodes from the mother and fatherNodeCodes. This can be done by sorting the father andmother NodeCodes and then matching them by longestcommon prefix in a merge operation. This requiresO(klogk) time for sort, and O(s�k) time for merge andprefix match. Identifying overlapping paths in step 3.b,takes O(c�s�k) where c is the number of commonancestors of parents of the individual. Calculating thepath length in step 3.c is simply bounded by the length ofthe paths to the individual from its common ancestors,which is bounded by the length of the longest NodeCodeof the individual, so it is O(s).

Using memoization, repeated computations of theinbreeding coefficient can be avoided, in case the samecommon ancestor is encountered multiple times duringrecursive calls. Thus computing the inbreeding coefficientof each common ancestor once, the total cost forcomputing the inbreeding coefficient of an individual isO(c�s�k+klogk) where c, s, k5n, and n is the total numberof individuals in a given pedigree. In real data, the largestpedigree consists of 118 individuals spanning 8 genera-tions with an average of 1.86 NodeCodes per individual. Insynthetic data, the largest pedigree consists of 195 197individuals spanning 19 generations with an average of26.42 NodeCodes per individual.

In PedHunter [7], the inbreeding coefficient is calcu-lated by kinship and the implementation is using iterativeSQL queries and memorization with a kinship table withthe time cost of O(n2) and the space cost of O(n2) where n

is the total number of individuals in a given pedigree.In summary, our NodeCodes-based method for calcu-

lating inbreeding coefficient is much faster than previousiterative methods.

5. Family-level graph

5.1. Motivation

In previous sections, we used a directed acyclic graphto model the individual-level pedigree graph. As theamount of genealogy information grows quickly, the sizeof the individual-level graph will increase equally fast. Forexample, given a synthetic pedigree of 195197 indivi-duals, the total number of NodeCodes for the individual-level graph is 5157 369 (see Fig. 18). This growth will notscale well as genealogy data grows. In this section, wepropose a family-level pedigree graph, where the nodesrepresent families instead of individuals, to represent thesame data as the individual-level graph using fewer nodesand edges, with the motivation to decrease the size andthe storage overhead of using NodeCodes.

5.2. Family-level graph structure

Family-level pedigree graph is a directed acyclic graphwhere nodes represent families. We define a family as aunique set of two parents, a mother and a father (one orboth may be missing from the data), and a set of children.The directed edges represent relationships betweenfamilies and there is an edge between two nodes if thereis a shared individual between the two families corre-sponding to these nodes. More specifically, for anyindividual, there is an edge from the family node wherethe individual is a child, to each of the family node(s)where that individual is a parent. Given family-levelpedigree graph, the in-degree of each family node, exceptfor families with paternal progenitors, is two; with oneincoming edge on the maternal side and the second oneon the paternal side. Note that if an individual has neithera parent nor a child, the individual is typically not givena family node and is left out of the family-level graph.

ARTICLE IN PRESS


An example (individual) pedigree graph and its corre-sponding family-level pedigree graph are shown in Fig. 4.

Given an individual i, we can obtain his or her mother’sID and father’s ID from the pedigree data in our database.Using both parents’ IDs, we can generate a familyidentifier and determine whether or not it exists in thefamily list. If the family identifier exists, then we need addi to the corresponding family’s children list. Otherwise, wegenerate a family node and set its value based on themother ID, father ID, and i.

There are a couple of design issues to consider whenconstructing family nodes for progenitors. The first dealswith isolated progenitors, i.e. individuals with no parents andno children. We do not create a family node for suchprogenitors (having this individual as a child) since theseindividuals are not connected to others and would thusnever be included in the results of any queries that look forparent, child, ancestor, or descendant relationships (includ-ing inbreeding). Thus, as a family graph optimization, weexclude these individuals from the graph. The second dealswith the remarried progenitors. We create a family node forremarried progenitors having the individual as a child. Suchfamily nodes contain a virtual parent. Using these virtualparent nodes helps the execution of the inbreedingalgorithm, and other queries by providing links betweenall of the remarried progenitors’ families back to one singlenode where that remarried progenitor is a child.

Algorithm Family-level Graph Construction

Input: Pedigree p, Optimization Flag opt, Virtual Parent Flag virtual

Output: A Family-level Graph for p

1.
Find all individuals in pedigree p.
2.
For each individual i in pedigree p,
a.
If both the mother ID and father ID of i are NULL, add i to the
progenitor list.

b.
Otherwise, using the mother ID and father ID of i, generate a
unique family ID.

-
If family ID exists, add i to the child list of the corresponding
family node.

-
Otherwise, create a family node based on family ID, add family
ID to the list of families, and initialize the family’s information

based on individual i.

3.
If opt is FALSE, create family nodes for each isolated progenitor
individual.

4.
If virtual is TRUE, create a virtual family for each remarried
progenitor.

5.3. Scalability of family-level pedigree graphs

Our motivation for family-level pedigree graphs is thatthey can scale better to large pedigrees as compared toindividual pedigree graphs. First we show estimatedreduction in the number of nodes and the number ofedges with family-level graph as compared with indivi-dual-level graph. Experimental results confirming thescalability of family-level graphs on both real pedigreedata and synthetic data are presented in Section 7.

In order to estimate the number of nodes, and thenumber of edges, for an individual-level graph and afamily-level graph, we have the following simplifyingassumptions on the pedigree data:

(1)
number of female and male individuals is the same, (2) each family has two parents,
(3)
each family has K children, (4) an individual can have multiple marriages, and the
average number of marriages for an individual TR.

The set of individuals in a pedigree can be classified intothree groups—progenitors (individuals with no parents),lowest descendants (individuals with no children), andmiddle-level individuals (those with both parent and child).Let N be the total number of individuals, Np be the numberof progenitors, Nd be the number of lowest descendants,and Nm be the number of middle-level individuals. Then,

N ¼ Np þ Nd þ Nm (1)

For the individual pedigree graph, there are N nodes and2�(Nm+Nd) edges. Since we know Nm+Nd ¼ N�Np, if theratio of Np to N is very small, we can say the number ofedges is 2�N. For the family-level pedigree graph, let M

denote the number of families, i.e., the number of nodes.Assuming that (a) each family consists of K+2 membersand (b) every middle-level individual appears in twofamilies (once as a child and once as a parent) (c) everyprogenitor appears in one family as a parent (d) everylowest descendant appears in one family as a child, so itturns out that N ¼ Np+Nd+Nm individuals result inNp+Nd+2�Nm instantiations of family members. Then therelationship between M and N is as follows:

M ¼Np þ Nd þ 2nNm

K þ 2¼

N þ Nm

K þ 2o

2nN

K þ 2(2)

Given M families, we can say that there are 2�M edgesin terms of the family pedigree graph, if we assume theratio of the number of progenitor families to M is verysmall. Based on the above analysis, we can compute theratio of reduction in terms of number of nodes, andnumber of edges of family pedigree graph, compared withindividual pedigree graph. The ratio is expressed asfollows:

#nodesðedgesÞ of family graph

#nodesðedgesÞ of individual graph¼

M

No

2

K þ 2(3)

This confirms our intuition that family-level pedigreegraphs are expected to be smaller in size (in terms ofnumber of edges, and number of nodes) than individual-level pedigree graphs. As a consequence of being smallerin size, family-level pedigree graphs also facilitate sig-nificantly more scalable utilization of NodeCodes. That is,using the family-level pedigree graph, NodeCodes encod-ing paths between family nodes can be utilized todetermine the inbreeding coefficient of individuals aswell as other pedigree-related queries. Utilizing family-level NodeCodes for the same pedigree data requiressignificantly less number of NodeCodes, which alsorequires less space. Estimated improvements on spacerequirements for family-level NodeCodes is presented inSection 5.5, after presenting how NodeCodes are adaptedto family-level pedigree graphs in the next section.

5.4. Family-level NodeCodes

Due to the creation of the new family-level graph, wealso must redefine our NodeCodes to work with it. To

ARTICLE IN PRESS

P0 , –

P1, P2

P3 , P4

P5

P2 , –

P4

P1 , –

P3

F0

F3

F2F1

P0 , –

P1

P3 , P4

P5

P1 , –

P4

P1 , –

P3

F0

F3

F2F1

Fig. 5. Common ancestor is (a) the parents in a family node or (b) a child in a family node.

P1, P0

P3, P4

P8, P7

P11

P6, P7

P10

P4, P5

P8, P9

P2, P3

P7

F0

F3 F4

F1

P0 P1

P2 P3 P4 P5

P6 P7 P8 P9

P10 P11

F2

Fig. 4. (a) An individual pedigree graph GI ¼ (VI, EI) and (b) a family-level pedigree graph GF ¼ (VF, EF).


properly encode the new family-level graph withNodeCodes, we need to consider a couple of issues thatarise with the new method. For some of our queries andalgorithms, we require detailed knowledge about thepaths and the contents of the family nodes. We would liketo be able to obtain this information without extradatabase lookups. Therefore, it will require a few extraencodings to be contained within the NodeCodes so thatwe can retain all of the important information. The basicmethod and structure presented in Section 3 will remainthe same with a few minor additions.

The first change is in the placement of the genderdelimiter codes. In contrast to the original graph, nodes inthe family graph are inherently genderless and instead theedges are assigned a gender based on the parent’s gender.Instead of placing the gender delimiter after the indivi-dual’s sibling number in the family, we now place it before

the family’s number to represent the edge. Note that in

this model, NodeCodes for a family effectively label the

parents in the family as they end with the parents’ genderand the parents’ position within the parent families.

Next, in order to keep the same path information ascould be determined by the previous individual-levelgraph’s NodeCodes, we need to know more about remar-riages. For example, consider the two graphs given in Fig. 5:

Using the NodeCodes on the individual-level graph ofthe family graph in Fig. 5(a), we could easily determinethat there are two non-overlapping paths from P0 to P5’sparents of length 2 (P3 and P4 are cousins). In the samefashion, we could determine with the individual-levelgraph of Fig. 5(b) that there are two non-overlappingpaths from P1 to P5’s parents of length 1 (P3 and P4 arehalf-siblings). However, when the graph is abstracted tothe family-level, using traditional NodeCodes would losethis information since there is no way to indicate thedifference between the two graphs. In case (a), all the

ARTICLE IN PRESS


parent(s) in F0 would be common ancestors (if P0’shusband was known it would also be a commonancestor); in case (b), the specific remarried child P1 inF0 is the single common ancestor. We refer to these casesfor the family graph as [all] parent inbreeding and [single]

child inbreeding, respectively. To ensure that childinbreeding is identified correctly, we continue to use thesibling number of the individual within the family to identifychild families instead of assigning a new numbers to eachchild family. Then, only if a child is involved in more thanone marriage, a remarriage identifier is added to eachNodeCode originating from that child directly after thesibling/progenitor number. This identifier consists of aremarriage marker (‘j’) followed by the marriage number.For individuals who are only married once, adding themarriage number would simply waste space as it wouldnever be used; thus we leave it off in the common case ofa single marriage. Thus, given the two example graphs, theNodeCodes for graph (a) would not change while theNodeCodes for graph (b) would change starting at F1 andF2. For example, in graph (b), if F0’s code is x, then F1

becomes x.0j0 and F2 becomes x.0j1; the common prefixbetween the two is therefore ‘‘x.0j’’ and ending in ‘j’ tellsus that we have a child inbreeding case. With thismodification, our algorithms can recognize this changeand be able to determine the non-overlapping paths basedsolely on the NodeCodes (described in Section 5.6 below).

Another problem exists when attempting to determinethe number of parents within a given family. Consider thefollowing situations:

When there are no incoming edges, only a singleprogenitor NodeCode is assigned to that family node;therefore making each of the three family nodes (a), (b),and (c) of Fig. 6 have the same NodeCodes while eachnode has a different number of parents. Note that case (a)is a virtual node for a remarried progenitor. When there isonly one incoming edge, the particular family node willonly receive NodeCodes from one family; thereforemaking it impossible to tell whether or not one parent(d) or two parents (e) exist. Note that in case (e), P1 is aprogenitor, but will not contribute any NodeCodes, sincethe family is already labeled by the code from P0’s parentfamily. When there are two incoming edges, as in cases (f)and (g), there are two sets of NodeCodes, one of which willbe marked as female by the gender delimiter ‘.’ of the

– , –...

P0*, –...

0 Incoming Edges

P0 , –...

1 Incomi

P0*, P1*...

Fig. 6. Cases for number of incoming ed

mother’s incoming edge and one which will be marked asmale by the gender delimiter ‘,’ of the father’s incomingedge. In cases (f) and (g), the unmodified NodeCodes canbe used to determine the number of parents. However, inorder to allow us to distinguish cases (a)/(b)/(c) and cases(d)/(e), we append a missing parent marker to theNodeCode before the gender delimiter. If it is missing asingle parent (cases (b) and (d)), the ‘$’ marker is used. If itis missing both parents (the virtual parent case (a)), the ‘!’marker is used. In practice, the vast majority of familynodes contain two parents (cases (c), (e), (f), and (g)),since we usually have either data about both parents or nodata about either parent. It is thus much more efficient interms of space to only mark the rare cases when parentsare missing and otherwise imply two parents.

In summary, to solve the above problems, we intro-duced three modifications to family-level NodeCodeencoding versus the previous individual-level encoding.First, we introduced virtual parent nodes for progenitorindividuals who have multiple marriages. Second, weintroduced a remarriage marker that allows the relativefamily position of common ancestors to be identified,which helps to categorize them as either parent inbreed-ing or child inbreeding as well as to distinguish a specificchild who has multiple marriages. Third, we introduce amissing parent marker for 1-parent families and 0-parentfamilies (virtual parent for remarried progenitor), whichhelps to identify common ancestors from parent positionand assists in identifying the non-overlapping pairs ofpaths as well.

After noting these changes, we can present the basicpattern of a family-level NodeCode. Let p be a progenitorfamily number, sn be the nth step in the NodeCode. Thebasic pattern is NC ¼ ps1s2ysk, where there may be 0 ormore steps (zero steps becomes simply p). Let � be theparent edge gender delimiter (‘.’ or ‘,’), i is the individual’ssibling number in the family, r is the remarriage marker(either empty or ‘‘jq’’, where q is the marriage number)and m is the missing parent code (empty, ‘$’ or ‘!’). Eachstep is thus s ¼ �irm.

For a family-level pedigree graph G, we generate theNodeCodes and set NC(u) for each node u in the followingprocedure. Fig. 7 is an example of a family-level encodingof a pedigree. As an optimization, we do not assignNodeCodes to isolated families (with no parent or child

P0 , P1*...

ng Edge ...P0

...P1

2 Incoming Edges

P0 , P1...

...P0 , P1

P0 , P1...

* Progenitor

ges and actual number of parents.

ARTICLE IN PRESS

F0


families), since their pedigree structure is entirely en-coded within their family node.

– , –

P1

0!
Algorithm Family-level Graph Encoding
Input: Family-level Pedigree Graph G

Output: A set NC(u) for each node u in G

1. A
dd a virtual root S.
2. A

– , P1P , P

F1 F2

dd an edge between S and each progenitor family with child

families. Assume there are K progenitor families. For each

progenitor family node p, set NC(p) ¼ i, where i ¼ 0,y, K�1

according to a depth-first traversal sibling order.
0 10!,0|0 3. I
P3P2

F3

0!,0|1$

f there is an outgoing edge e from a family node p to a family node

u, then a family node p must have a child i, who is a parent in node

u. If the individual i is a mother in node u, then the edge e is called

the maternal link. Otherwise, it is called the paternal link. d(e) ¼ {‘‘.’’ ,

‘‘,’’}. Specifically, ‘‘.’’ denotes the maternal link and ‘‘,’’ denotes the

paternal link.

4. I

P2 , P3

P4

f the individual i has remarriages, t(u) ¼ c(i)+‘‘j’’+r(u), where c(i)

denotes the child order of the individual i in family node p, and r(u)

is a remarriage number regarding individual i for node u. Otherwise,

t(u) ¼ c(i).

5. I
f u is 2-parent node, then m(u) ¼ ‘‘‘‘. If u is 1-parent node, then
m(u) ¼ ‘‘$’’. Otherwise, m(u) ¼ ‘‘!’’.
F5F4
0!,0|0.0
6. I
P4 , – P4 , P10!,0|1$,0

f u is neither S nor a progenitor family node, for either of u’s parent

family node p, for any code nc(p) in NC(p), add [nc(p)d(e)t(u)m(u)] to

NC(u).

P5 , P6

P7

P5 P6

F6

0!,0|0.0.0|0$.0

0!,0|1$,0.0|0$.0

0!,0|2,0

0!,0|0.0.0|1,0

0,!0|1$,0.0|1,0

0!,0|2

0!,0|0.0.0|1

0!,0|1$,0.0|1

0!,0|0.0.0|0$

0!,0|1$,0.0|0$

Fig. 7. A pedigree graph with family node encoding.

As a note, we add a final NodeCode terminator symbol‘#’ to the end of each family NodeCode for correctevaluation. However, for clarity of presentation we haveignored it in our discussion and assume it is added to theend of each code before it is stored in the database.

Finally, to update our relational database representa-tion to store family-level pedigree graphs, we now use thefollowing schema:

Pedigree(PedigreeID, Name)Individual(IndividualID, PedigreeID, MotherID, FatherID,

Gender, Name, DOB, DOD, etc.)FamilyNodes(PedigreeID, FamilyID, IndividualMotherID,

IndividualFatherID)FamilyNodeChildren(PedigreeID, FamilyID, ChildIndivi-

dualID)FamilyNodeCodes(PedigreeID, NodeCode, FamilyID)

In comparison with the schema in Section 3.2, theNodeCodes table has been removed and replaced withtables describing the family nodes, the children in thefamily nodes and the family NodeCodes. With the properSQL query modifications, queries such as finding alldescendants, etc. can be adapted to execute on this newschema, although the details of this are outside the scopeof this paper.

5.5. Scalability of family-level NodeCodes

In this section we compare the utilization of family-level pedigree graphs and pedigree graphs with indivi-duals in terms of number of NodeCodes required in bothapproaches. In order to determine the number ofNodeCodes, we first estimate the number of generationsin a pedigree with N individuals. Let G denote the numberof generations, K be the number of children in a family,

and Np, Nd, and Nm be the numbers of progenitors, lowestdescendants, and middle-level individuals, respectively.

5.5.1. Individual-level NodeCodes

With the same assumptions on the pedigree data as inSection 5.3, the number of individuals at each generationcan be estimated as follows:

Gen 0: Np individuals, each individual has 1 NodeCodesGen 1: Np�(K/2) individuals, each individual have 2NodeCodesGen 2: Np�(K/2)2 individuals, each individual have 4NodeCodesyy yy

Gen G: Np�(K/2)G individuals, each individual have 2G

NodeCodes

N ¼ Np þ Np �K

2þ Np �

K

2

� �2

þ � � � þ Np �K

2

� �G

(4)

Based on formula (4), we can derive the number ofgenerations G in terms of N, Np, and K and express G

ARTICLE IN PRESS


as follows:

G ¼ logK=2ðK=2� 1Þ � N þ Np

Np� 1 (5)

Given the number of individuals at every generationand the number of NodeCodes for each generation’sindividual, we can calculate the number of NodeCodesfor N individuals in terms of individual node encoding:

# of NodeCodes ¼ Np þ Np � K þ Np � K2þ � � � þ Np � KG

¼NpðK

ðGþ1Þ� 1Þ

K � 1(6)

5.5.2. Family-level NodesCodes without remarriages

Similar to the individual-level calculation, we cancalculate the number of families for each generation.Now we first focus on a simple situation, where noremarriages take place.

Gen 0: Np individuals, Np/2 families, each family has 1NodeCodesGen 1: Np�(K/2) individuals, (Np/2)�(K/2) families,each family has 2 NodeCodesGen 2: Np�(K/2)2 individuals, (Np/2)�(K/2)2 families,each family have 4 NodeCodesyy yy

Gen (G�1): Np�(K/2)G�1 individuals, (Np/2)�(K/2)G�1

families, each family have 2G�1 NodeCodesGen G: Np�(K/2)G individuals, who are lowest descen-dants.

Based on the number of families at every generation andthe number of NodeCodes for each generation’s family, wecan calculate the number of NodeCodes for N individualsin terms of Family NodeCodes:

# of Family NodeCodes ¼Np

2þ

Np

2� K þ

Np

2� K2

þ � � � þNp

2� KG�1

¼ðNp=2ÞðKG

� 1Þ

K � 1(7)

5.5.3. Family-level NodeCodes with remarriages

Based on the simple marrying once scenario, we canderive the number of families and the number ofNodeCodes for a complicated scenario, where an indivi-dual can have remarriages, and the number of remarriagesfor one individual is at most TR. Now we calculate thenumber of NodeCodes for a worst case.

Gen 0: Np individuals, Np/2�TR families, each familyhas 1 NodeCodesGen 1: Np�K/2 individuals, (Np/2)�(K/2)�TR families,each family has 2 NodeCodesGen 2: Np�(K/2)2 individuals, (Np/2)�(K/2)2�TR fa-milies, each family have 4 NodeCodesyy yy

Gen (G�1): Np�(K/2)G�1 individuals, (Np/2)�(K/

2)G�1�TR families, each family have 2G�1 NodeCodesGen G: Np�(K/2)G individuals, who are lowest descen-dants.

and the number of NodeCodes for each generation’s
Based on the number of families at every generation
family, we can calculate the number of NodeCodes forN individuals in terms of Family NodeCodes:

# of Family NodeCodes ¼Np

2� TR þ

Np

2� TR � K

þNp

2� TR � K2

þ � � � þNp

2� TR � KG�1

¼Np=2ðKG

� 1Þ

K � 1� TR

(8)

According to this analysis of individual-level encodingand family-level encoding with/without remarriages, wecan compute the ratio of reduction in terms of number ofNodeCodes in the family pedigree graph compared withthe NodeCodes in the individual pedigree graph. Withoutremarriages, the ratio is expressed as follows:

#NodeCodes of family graph

#NodeCodes of individual graph¼

1

2 � K(9)

With TR remarriages, the worst case for family graph,the ratio is computed as follows:

#NodeCodes of family graph

#NodeCodes of individual graph¼

TR

2 � K(10)

This confirms that when TRo2�K, or when the averagenumber of marriages is less than twice the averagenumber of children, utilizing family-level NodeCodes forthe same pedigree data requires less number of Node-Codes and thus also less space. From formula (10), we cantell the number of marriages for an individual TR is animportant factor with regards to the reduction ratio of thenumber of NodeCodes. However, it seems quite reasonablein practice for TR to be much smaller than 2�K on typicalpedigrees. To further evaluate the effect that TR exerted onthe space requirement of family graph, we generated asynthetic remarriage data set and present the results inSection 7.4.

5.6. Calculating inbreeding coefficient using family

NodeCodes

In this section, we discuss the changes required toadapt the algorithm for calculating the inbreedingcoefficient (Section 4) utilizing the family-level Node-Codes (Section 5.4). Most of the changes involve deter-mining and using the relative position of an individual’scommon ancestors within their family nodes and identi-fying the number of parents in their families to ensure thecorrect number of paths are found. The proposedmodifications to NodeCodes enable the algorithm toobtain this information from the individual’s NodeCodeswithout additional database queries.

An important property of family nodes is that all

children in the family have the same inbreeding coefficient

value. Since these children have the same parents by thedefinition of family node, they have the exact samecommon ancestors and paths to these common ancestors.Thus, we can introduce the Family Inbreeding Coefficient

algorithm that operates on family nodes instead ofindividual nodes.

ARTICLE IN PRESS


5.6.1. Algorithm overview

When using the Family Inbreeding Coefficient algorithmto find the inbreeding of an individual i, we must first findthe family where i is a child. Let this family be F ¼ FC(i).Running the algorithm on F will return the inbreedingcoefficient shared by all children in the family, including i.Thus, InbreedingCoefficient(i) and FamilyInbeedingCoeffi-cient(FC(i)) are equivalent. Note that for half-siblings, whomay have different inbreeding coefficients, will be indifferent family nodes that happen to have a parent incommon. The high-level outline for calculating theinbreeding coefficient of family F based on family-levelNodeCodes is discussed in this section along with the keysections; the full algorithm pseudo-code is given inAppendix 1.

Algorithm: Family Inbreeding Coefficient

Input: Family NodeCodes NC(F)

Output: Inbreeding coefficient value common to all children in family F

1.
Partition which NodeCodes in NC(F) are from the mother individual m
and father individual f

2.
Identify the common ancestors of m and f, C, by their family nodes FA
their relative family positions

3.
Detect parents in the common ancestor family nodes FA that are not
represented by any NodeCodes

4.
For each common ancestor c in C
a. F
ind the set of pairs of paths from c to m and f
b. I
dentify and eliminate overlapping pairs of paths
c. F
ind the number of generations between c and the parents of F
d. I
f c is a child in FA, then use NC(FA) to recursively calculate the
inbreeding coefficient of c

e. O
therwise, c is a parent in FA and c’s inbreeding is found using the
NodeCodes from FA’s parent family (where c is a child)

f. C
ompute the inbreeding coefficient
Identifying mother & father NodeCodes: The indivi-dual’s NodeCodes can be partitioned into mother andfather codes by the last gender delimiter in the code,which represents the mother link or the father link.

Example 5.1. Using the pedigree in Fig. 7, the mothercodes of family F6 (corresponding to individual P5) are‘‘0!,0j0.0.0j0$.0’’ and ‘‘ 0!,0j1$,0.0j0$.0’’ and the fathercodes (corresponding to individual P6) are ‘‘0!,0j0.0.0j1,0’’,‘‘0,!0j1$,0.0j1,0’’, and ‘‘0!,0j2,0’’.

Finding least common ancestors with family &

position: Similar to Sections 4.2.1. and 4.2.2, we can find

the NodeCodes that represent all the common ancestor

families by finding all of the longest common prefix for

matching sets of the set of mother family codes NC(m) and

set of father family codes NC(f). This can be done by

identifying all prefixes that are shared between at least

one mother and one father code and then eliminating any

prefix x that is shorter but maps to the same set of mother

and father codes as a longer prefix y where x is itself a

prefix of y.

More formally, let SM and SF be (sub)sets of NC(m) and

NC(f) respectively. Let x ¼ LCP(SM, SF) denote that x is the

longest common prefix for matching SM with SF.

If x ends in ‘‘j’’ (the remarriage marker), it is in the form

x ¼ y+c(k)+‘‘j’’, and thus y is a NodeCode of the family

YAFA and c(k) is the child number in family Y of common

ancestor C.

Otherwise, x ¼ y, where y is a NodeCode of family YAFA

and all parents in family Y are common ancestors.

Furthermore, if y ends with ‘‘$’’, then we know Y is a 1-

parent family; if y ends with ‘‘!’’ it is a virtual family node;

otherwise Y is a 2-parent family.

Thus, we know which individual(s) are the common

ancestor(s) in the family and their relative position in the

family node (which uniquely identifies them without

knowing the individual’s real database ID). The rest of the

algorithm therefore uniquely identifies common ancestors

by the pair /family ID, positionS.

Example 5.2. The longest common prefixes for matchingsets of F6 in Fig. 7 are ‘‘0!,0j’’ (which matches all five codesand corresponds to a child inbreeding case in the 1st childof family F0, which is P1) and ‘‘0!,0j0.0.0j’’ and‘‘0!,0j1$,0.0j’’ (which both match two codes each andcorrespond to a child inbreeding in the 1st child of familyF3, which is P4). As before, an ID lookup by NodeCodes isrequired to realize that the family NodeCodes of these twocodes (‘‘0!,0|0.0’’ and ‘‘0!,0j1$,0’’) both refer to the samefamily, F3. This gives us two common ancestors: the 1stchild in F0 and the 1st child in F3. From the remarriagemarkers (or lack thereof), we know that F0 is a virtualfamily node and F3 should have 2 parents.

Detecting parents without NodeCodes: When theinbreeding is found to be in the all parents, we need tocorrectly know how many parents are in the family nodeto ensure all paths are identified. With a missing parentmarker, we can accurately tell if the family node is a 0-, 1-,or 2-parent node. In the process of finding the unique setof common ancestors (the previous step), we actuallyidentified all of the common ancestor family’s NodeCodes.By examining the last gender delimiter of these ancestorcodes, we can identify how many parents are representedin the codes. The difference between the actual number ofparents and the number of parents found in the codes willreveal if we have parents who are progenitors notrepresented by NodeCodes, as in Fig. 6, cases (c) and (e).In both of these cases, the missing parent is a progenitorwhose code was effectively redundant for recordingancestor/descendant information. Since these parentswith missing codes are progenitors (thus with unknownparents), they are considered to have an inbreedingcoefficient of 0.0. If we already know the paths and pathlengths to the other common ancestor parent in the family(which is represented by a code), then we can use thosepath lengths and the inbreeding coefficient of zeroappropriately adjust the final inbreeding score to takeinto account the unrepresented parent.

Identifying overlapping pairs of paths: The funda-mental principle of this step is identical to the methoddescribed in Section 4.2.3 for finding crossover individualsthe individual-level NodeCodes; however, we now identi-fy crossover individuals by both family ID and familyposition. The only subtle change is that if both commonancestors are in the same family, there is only a crossover

ARTICLE IN PRESS


if one of them is a parent in the family and the other one isa child in the family. If the common ancestors are the twoparents or two different children, then there is nocrossover between them.

Example 5.3. In Fig. 7, F3 is a descendant family of F0 andso the overlap is detected in essentially the same way as inExample 4.3.

Finding path lengths: The length of each path from aparent to a common ancestor can be calculated based onthe mother/father code and the NodeCode prefix corre-sponding to the common ancestor. Path length iscalculated as follows:

Algorithm: Path Length

Input: parent NodeCode p, ancestor prefix a

Output: path length between the parent and the common ancestor

1. I
f a ends with ‘‘j’’ (child inbreeding) then
a.
Remove up to and including the last gender delimiter from a
b.
Return
(CountGenderDelimiters(p)�CountGenderDelimiters(a))�1

2. O
therwise (parent inbreeding), return
CountGenderDelimiters(p)�CountGenderDelimiters(a)

Example 5.4. In the Fig. 7 example, find the path lengthbetween the mother P5 and common ancestor P1 (withcode ‘‘0!0j’’) using the path represented by NodeCode‘‘0!,0j0.0.0j0$.0’’. In this case, p ¼ ‘‘0!,0j0.0.0j0$.0’’ anda ¼ ‘‘0!0j’’, so we have a case of child inbreeding andPathLength ¼ (CountGenderDelimiters(‘‘0!,0j0.0.0j0$.0’’)-�CountGenderDelimiters(‘‘0!’’))�1 ¼ (4�0)�1 ¼ 3.

Calculating common ancestor inbreeding: In order tocalculate Wright’s formula [6], we need to recursivelycalculate the inbreeding of common ancestors. On thefamily graph, there are now two cases: the commonancestor is a child in the family (child inbreeding) or aparent in the family (parent inbreeding). The familyinbreeding algorithm calculates the inbreeding coefficientfor the children in the family. For common ancestor c infamily Y, if we have child inbreeding (c is a child in Y), thenwe can simply call the Family Inbreeding Coefficientalgorithm recursively on family Y. However, in the case ofparent inbreeding, we have to use the family where theparent is a child as the input to the algorithm and we so faronly found the family where they are a parent. Thus, wemust recursively call the algorithm using the parentfamily of c (following the father link upwards if c is thefather or the mother link upwards if c is the mother). TheID of this family can be retrieved by taking one ofthe codes of that parent (from the codes of Y that wehave already found) and removing the last step (up to andincluding the last gender code).

Note that there are a couple of optimizations possiblehere that can avoid database lookups. When the recursivecall is made, a naıve implementation of the algorithmwould use the family ID as the algorithm’s parameter andthen first load the NodeCodes of that family from thedatabase. The family ID of the common ancestor would befound and passed as the parameter in the recursive call.However, we have actually already calculated the Node-Codes for that common ancestor while trying to determine

the unique common ancestors and so if the algorithm ischanged to use the set of NodeCodes as a parameter, thenthis extra database lookup can be avoided. We call thisoptimization common ancestor NodeCode calculation from

descendant NodeCodes (or simply the NCCalc optimizationfor short).

The second optimization is to cache the resultsfrom previous inbreeding calculations at the family-level.Later calls to calculate the inbreeding of another indivi-dual in the same pedigree can benefit from previousresults being stored. Furthermore, later calls to calculateother children in the same family can be answered fromthe cache with only the family ID lookup required; in ourreal data set, 70% of 1721 families had more than one childand would thus greatly benefit from this caching. Whencompared to the other alternative methods, only O(jFj)memory (where F is the set of all families) is used forstoring these results, versus O(jNj) for the individual-levelNodeCodes-based method and O(jN2

j) for the iterativemethod.

6. Synthetic data generation

In order to test the scalability of our approach forquerying pedigrees, we implemented a population simu-lator to allow us to generate arbitrarily large pedigrees.Our overall algorithm is based on the algorithm forgenerating populations with overlapping generations inChapter 4 of [5] along with the parameters given inappendix B of [18] to model the relatively isolated FinnishKainuu subpopulation and its growth during the years1500–2000. The parameters include: starting/ending year,initial population size, initial age distribution, marriageprobability, maximum age at pregnancy, expected numberof children by time period, immigration rate, and prob-ability of death by time period and age group. For oursimulation, we added parameters for the maximumnumber of individuals to generate, the average marriagerate to allow remarriages, and a set of probabilities forgenerating individuals with cancer/polyp data. This sec-tion contains an overview of the generation algorithm,which varies slightly from the one in [5].

First, an initial population of progenitors is createdaccording to the initial age distribution. When anindividual is created, their gender is not initially specified,but their year of death and if/when/how many times theywill marry is randomly determined along with whether ornot they will get cancer and when. An annual update cyclecalculates the deaths, marriages, births, and immigrantsonce for each year of the simulation until the ending yearor the maximum number of individuals is reached. Afterindividuals who died during the year are removed fromthe live population, the number of new immigrants iscalculated as a percent of the previous population size andthese immigrants are generated close to their marriageage (for the purpose of generating pedigrees, individualswho never marry are not interesting). Then, all individualspast their marriage age are randomly assigned partnerswith one individual then assigned to be male and theother to be female. After this, any individuals who have

ARTICLE IN PRESS


not reached their target number of marriages will bedivided into male/female pools and randomly assigned toadditional marriages (couples). Since the number ofchildren per individual should vary, but the overallexpected number of children should be average, thenumber of births per year is calculated by summing overthe probabilities that a child will be born this year for allliving females above their marriage age (based onexpected number of children and the number of yearsbetween their marriage and their death). This sum resultsin the number of children expected to be born this yearand these children are randomly distributed amongavailable couples.

For our experiments, we added a post-processing stepthat removes entirely disconnected individuals (i.e. noparents or children) and then randomly assign the gendersof the remaining individuals.

7. Experiments

In this section we show the effectiveness of ourframeworks for inbreeding query evaluation by makingcomparisons among the performance of family-levelNodeCodes, individual-level NodeCodes, and an iterativemethod used in existing systems [7–9]. First, we willevaluate the space savings of Family NodeCodes versusindividual-level NodeCodes. We examine the performanceof inbreeding queries using data from the ClevelandClinic’s Familial Polyposis Registry [4], the very largesynthetic pedigrees, and synthetic pedigrees with varyingmarriage rates. Finally, we show the space savings ofFamily NodeCodes versus individual-level NodeCodes.

7.1. Experimental data

We used two synthetic data sets, one to test scalability

in larger data sets and one to show the effect ofremarriages. For the scalability data set, twelve pedigreeswere generated by running the simulator from the year1500 with an initial population of 30, an immigration rateof 1%, exactly one marriage per individual, and the samerandom seed, while the maximum number of individualsto generate was raised from 100 to 204 800 by doublingthe size each time. Thus, each larger pedigree is a supersetof the small ones. The final generated pedigrees rangedfrom 77 individuals spanning 3 generations with anaverage of 1.24 NodeCodes per individual for the smallestto 195 197 individuals spanning 19 generations with anaverage of 26.42 NodeCodes per individual for the largest.The number and size of the NodeCodes for these pedigreesis shown in Fig. 18. For the remarriage data set, elevenpedigrees where generated with remarriage rates between1.0 and 2.0 increasing by 0.1 each time. All of thesepedigrees where generated with the maximum numberof individuals set at 1600 with the other settings thesame as for the scalability data set; the number andsize of the NodeCodes for these pedigrees is shown inFig. 19.

The Cleveland Clinic’s (CCF) Familial Polyposis Registry[4], which we used as our real data set, is the largest

inherited colorectal cancer registry in the United State andthe second largest in the world. At present, the PolyposisRegistry database contains pedigrees of 750 families and11 350 patient histories recorded in the past 25 years atCCF. It captures complex pedigree and clinical data such asdemographic characteristics, pedigree relations, and thedistribution of polyps, cancer sites, surgical procedures,and medical treatments. We performed our experimentson the available good pedigrees in this dataset, whichconsisted of 654 pedigrees containing 8345 individuals.The largest pedigree consisted of 118 individuals spanning8 generations with an average of 1.86 NodeCodes perindividual.

7.2. Experimental setup

We tested the effectiveness of our method using C#2005 and SQLServer 2005. We implemented the Node-Code labeling algorithm and used strings to storeNodeCodes with the sibling numbers encoded in a base-64 representation. All queries were run on cold cache andthe test machine was a 2.0 GHz Pentium 4 with 1 GB ramrunning Windows XP.

We compared the execution time required to calculatethe coefficient of inbreeding by the iterative method fordescribed in [7] (source code available [25]) and theindividual-level NodeCode algorithm described in Section4 and the family-level algorithm described in Section 5.6.While memoization is used by all three implementationsto avoid calculating inbreeding coefficients for the sameindividual more than once, the standard iterative methodcalculates inbreeding through kinship [7,25] and thusrequires an O(n2) table in memory for n individuals, whileNodeCodes at the individual-level only really benefitsfrom memoization on the actual inbreeding scores (i.e.O(n)) and at the family-level the memory requirement isO(f), where f is the number of family nodes. For largepedigrees, the O(n2) space required for the kinshipmemoization can become prohibitively large, so weexamined both the ad-hoc query case where eachinbreeding query is executed with no memoization datayet available at query start time and the cached/bulkcalculation case where the inbreeding coefficients arecalculated for an entire pedigree and full memoization isused. In the cached/bulk case, individuals who areprocessed later can benefit from previously storedinbreeding/kinship coefficients. Hash tables were used inplace of arrays/matrices for memoization data to reducememory usage (especially for the iterative method).

We analyzed the effects of pedigree size (# indivi-duals), the inbreeding coefficient value, the depth ofindividuals in the pedigree (the longest path between theindividual and a progenitor), and the remarriage rate.

7.3. Experimental results: inbreeding query processing

In the first experiment, 100 random individuals wereselected from each of our 12 synthetic pedigrees (exceptfor the smallest one which only contained 77 individuals).For each individual, the query was run on cold cache

ARTICLE IN PRESS


starting with no initial memoization data to show how thecost of ad-hoc querying increases with pedigree size forthe iterative algorithm, individual-level NodeCodes andFamily NodeCodes. We will refer to the previous methodas iterative and we will abbreviate individual-levelNodeCodes as NC, Family NodeCodes as FamNC, andFamily NodeCodes with ancestor NodeCode calculation(versus loading them from the database; see Section 5.6)as FamNCCalc.

Fig. 8 shows the average time per query for eachpedigree (as the pedigree size doubles). As can be seen,the average time per query grew increasingly larger forthe iterative method compared to either NC or FamNC asthe pedigree size increased, from a comparable amount oftime on the small pedigrees (p3000 individuals) to twiceas much time per query than NC on the largest. AmongNodeCodes-based methods, FamNC performed up to 1.6times better than NC on large pedigrees (X50 000individuals) and 3.3 times faster than iterative. FamNC-Calc provided only a modest improvement over FamNC onthe largest pedigree (3% faster). Since FamNCCalc wastypically only around 2–3% faster for our experiments onlarger pedigrees (a comparatively small difference com-pared to NC and iterative), we have chosen to simply use

00

50100150200250300350400450

Aver

age

Tim

e pe

r Que

ry (m

s)

Individuals in Pedigree

Iterative NC FamNC FamNCCalc

50000 100000 150000 200000

Fig. 8. Effect of pedigree size on average query time in synthetic

pedigrees (100 random individ. each).

0.001

10

100

1000

10000

100000

Num

ber o

f Ind

ivid

uals

Inbreeding Coefficient Value

Largest Synthetic Pedigree (195k individuals)

0.05 0.10 0.15 0.20 0.25

Fig. 9. Inbreeding coeffic

FamNC to represent both methods on most of our latergraphs.

For our next experiment, we generated 1000 randomindividuals from the largest synthetic pedigree andinvestigated the effect of different parameters on querytime, including inbreeding coefficient value and the depthof the individual in the pedigree.

For the inbreeding coefficient value, we expected thatindividuals with larger inbreeding coefficients would havemore NodeCodes, causing NC and FamNC to suffer slightly.In Fig. 10, which shows the average query time for eachdistinct inbreeding value, we see that for most values ofthe inbreeding coefficient, NC and FamNC outperformedthe iterative method, especially FamNC, from a modestimprovement in the lowest values (0–1�10�8) to adramatic improvement for many of the mid-range values(1�10�8 to 1�10�5). As expected, for a few of the highestinbreeding coefficient values, we see slightly worseperformance by NC. However, FamNC still performedbetter than both iterative and NC for even those highestinbreeding coefficient values. The improvement resultsfrom reductions in the number of Family NodeCodes,compared with individual-level NodeCodes. Also note thatthere were few individuals with high inbreeding coeffi-cients (see Fig. 9) and so the results for the lower (morecommon) values are more significant.

The distribution of inbreeding coefficient valuesshowed that in the largest synthetic pedigree, 94.2% ofindividuals had no inbreeding, and in the real data, thesezero inbreeding scores where 98.7% of the individuals. Thehistograms for the frequencies of inbreeding values in thelargest synthetic pedigree and all the real data (aggre-gated) are shown in Fig. 9. Note that the value distribu-tions are similar for the real and synthetic data (thehorizontal axis is a log scale for both).

Now, if we examine the overall effect of each inbreed-ing coefficient value on the total time (as a sum instead ofan average per value), we clearly see in Fig. 11 that the vastmajority of the computing time is spent computinginbreeding scores for individuals with no inbreeding, whichiterative method spends about three times more timeprocessing as NC (2.8 times faster) and FamNC (3.3 timesfaster) need only one or two disk seeks (to get the

0.001

10

100

1000

10000All Real Pedigrees (8k individuals)

Num

ber o

f Ind

ivid

uals

Inbreeding Coefficient Value0.05 0.10 0.15 0.20 0.25 0.30

ient distributions.

ARTICLE IN PRESS

1E-110

500

1000

1500

2000

2500

Aver

age

Tim

e P

er Q

uery

(ms)


Iterative NC FamNC

1E-10 1E-9 1E-8 1E-7 1E-6 1E-5 1E-4 1E-3 0.01

Fig. 10. Effect of inbreeding coefficient on average query time in largest synthetic pedigree (1000 random individ.).

1E-110

500

1000

1500

2000

100000150000200000250000300000

Tota

l Que

ry T

ime

(ms)


Iterative NC FamNC

1E-10 1E-9 1E-8 1E-7 1E-6 1E-5 1E-4 1E-3 0.01

Fig. 11. Effect of inbreeding coefficient on total query time in largest synthetic pedigree (1000 random individ.)


NodeCodes of the individual) to determine if the indivi-dual’s parents have any common ancestors.

Next, we examined the effect of the depth of theindividual in the pedigree (number of steps in the longestNodeCode) on the query time and this time enabled theuse of caching (memoization) to enable subsequentcalculations to take advantage of previous inbreedingresults. Fig. 12 shows how the average time per querygrows as the individual’s depth increases. We can see thatFamNC scales best among the three methods, 8.9 timesfaster than including iterative and 1.3 times faster thanNC. The reason for this is that NC and FamNC both skipintermediate generations and can jump straight to thecommon ancestors and FamNC needs to process less codesthan NC.

When the total time per depth level (Fig. 13) isexamined instead of the expected time per individual, itcan be seen that generations 12–15 (again, this is out of 16generations) of the pedigree are the most expensive forthis synthetic data set, and it is at this point that theiterative method is as much as 3.7 times slower than NCand 7.6 times slower than FamNC (generation 14).

Our next experiment examines the how well themethods scale in the bulk calculation case, where wecalculate the inbreeding coefficient for every individual inthe pedigree and previously calculated values are mem-oized for use by later inbreeding calculations so work isnot repeated. This should cut down on a reasonableamount of redundant work, especially for the iterativemethod. We repeated this experiment for each pedigree in

ARTICLE IN PRESS

0

200

400

600

800

1000

1200

1400

1600

Aver

age

Tim

e P

er Q

uery

(ms)

Depth of Individual in Pedigree

Iterative - Cache NC - Cache FamNC - Cache

2 4 6 8 10 12 14 16

Fig. 12. Effect of depth on average query time in largest synthetic pedigree (1000 random individ.).

00

20000

40000

60000

80000

100000

Tota

l Que

ry T

ime

(ms)

Depth of Individual in Pedigree

Iterative - Cache NC - Cache FamNC - Cache

2 4 6 8 10 12 14 16

Fig. 13. Effect of depth on total query time in largest synthetic pedigree (1000 random individ.).


the scalability data set with the cache cleared before eachrun. According to Fig. 14, while NC is 10.1 times faster thaniterative, it is clearly shown that FamNCCalc scales bestamong three methods, 1.1 times faster than FamNC, 7.7times faster than NC, and 77.1 times faster than iterative.

Next, we compared the results from our bulk experi-ment on synthetic data with the results from all the realpedigrees, which are all relatively small in comparison;the results are shown in Fig. 15. For this data set, we cansee that the iterative method performs best for very smallpedigrees (o35 individuals) and that the NodeCodes-based methods have a small per-pedigree overhead ofroughly 80 ms. As we have shown from the largersynthetic data, once the pedigrees get a bit larger this

small cost is soon overcome by the cost incurred by theshear number of queries required for the iterative method.When we investigated those large pedigrees (115–125individuals), we found that in our data, there is only twopedigrees that have more than 115 individuals and nopedigrees in the 76–114 range. This is why the averagequery time goes up rapidly for in pedigrees size of115–125. Once the pedigrees start to reach this size range,the NodeCodes-based methods begin to do better thaniterative. However, we can also see that the differencebetween FamNC and FamNCCalc makes more of adifference at these smaller scales as FamNCCalc beginsto outperform iterative for pedigrees as small as 46–55individuals (Fig. 16).

ARTICLE IN PRESS

00

500

1000

1500

2000

2500

3000

3500

40002500030000

Tim

e (s

)

# Individuals in Pedigree


28914

50000 100000 150000 200000

Fig. 14. Effect of size on bulk pedigree inbreeding calculation time in synthetic pedigrees.

1-50

200

400

600

800

1000

# Individuals 594 1841 2604 919 1265 436 174 279 233# Pedigrees 332 198 61 25 25 6 3 2 2

Aver

age

Tim

e P

er P

edig

ree

(ms)

Pedigree Size


6-15 16-25 26-35 36-45 46-55 56-65 66-75 115-125

Fig. 15. Effect of pedigree size on average bulk inbreeding calculation time in real pedigrees.


Finally, we also evaluated the effect of multiplemarriages on bulk inbreeding calculation time in oursynthetic remarriage data set, which varies the averagenumber of marriages per individual from 1.0 to 2.0.As can be seen, FamNC is 2.2–2.5 times faster than NCand 3.3–4.2 times faster than iterative for these relativesmall pedigrees (approximately 1600 individuals). Notethat FamNC is relatively consistent, even through thehigher marriage rates will approximately double thenumber of Family NodeCodes (see Section 7.4). This isbecause the NodeCodes-based inbreeding algorithmsare not doing table scans or iteratively walking much of

the pedigree. Instead, they just load the NodeCodes at thebeginning and do ID lookups for each unique commonancestor prefix. Since the number of individuals is fixed andthe depth is similar, then the query time is almost same,even though the average number of marriage rises from 1.0or 2.0.

7.4. Experimental results: scalability of family-level

NodeCodes

In this section we present experimental results com-paring family-level NodeCodes with those based on

ARTICLE IN PRESS

1.00

2000

4000

6000

8000

10000

12000

14000

16000

Tim

e (m

s)

Average Number of Marriages

Iterative NC FamNC

1.2 1.4 1.6 1.8 2.0

Fig. 16. Effect of multiple marriages on bulk inbreeding calculation time in synthetic remarriage pedigree.

Family - No Opt, No Virtual

Family - Opt,No Virtual Family - Opt, Virtual

Nodes Edges NodeCodes Nodes Edges NodeCodes Nodes Edges NodeCodes

0.444 0.148 0.252 0.206 0.141 0.110 0.209 0.155 0.116

Fig. 17. The ratio of family-level graph/individual-level graph on real data.

00

1000000

2000000

3000000

4000000

5000000

6000000

0 MB

20 MB

40 MB

60 MB

80 MB

100 MB

120 MBN

odeC

odes

Spa

ce

# N

odeC

odes

Individuals in Pedigree

# Individual NodeCodes

# Family NodeCodes Individual NodeCode Space Family NodeCode Space

50000 100000 150000 200000

Fig. 18. Effect of pedigree size on NodeCodes in synthetic pedigrees (no remarriages).


individuals in terms of the number and size of theNodeCodes. The real and synthetic pedigrees describedin Section 7.1 are used. Experimental results confirm that

Family NodeCodes scale significantly better than theoriginal NodeCodes described in [1] by requiring signifi-cantly fewer codes and space.

ARTICLE IN PRESS

1.00

1000

2000

3000

4000

5000

6000

0 KB

5 KB

10 KB

15 KB

20 KB

25 KB

30 KB

35 KB

40 KB

45 KB

Nod

eCod

es S

pace

# N

odeC

odes

Average # of Marriages

# Individual NodeCodes # Family NodeCodes

Individual NodeCode Space Family NodeCode Space

1.2 1.4 1.6 1.8 2.0

Fig. 19. Effect of multiple marriages on NodeCodes in synthetic remarriage pedigrees (about 1600 individ. each).


Results for the real pedigrees are shown in Fig. 17, whichpresents the ratios of the number of nodes, the number ofedges, and the number of NodeCodes using the family-levelnode encoding as compared with the original individual-level node encoding. Without the optimizations for dis-connected individuals and families or virtual nodes we seethe family graph as having a 74.8% reduction in the numberof NodeCodes as the individual graph. However, onceoptimization is introduced, it increase up to a 89.0%reduction, which only drops slightly (to 88.4%) when virtualparent families are added to allow for correct evaluation ofinbreeding queries on remarried progenitors.

The results for the scalability data set are presented inFig. 18 and show that the number of NodeCodes requiredis only 3.7–9.0% of the individual graph corresponding toonly 3.1–9.0% of the size. We can thus conclude that thespace requirements of the family-level graph scale muchbetter for very large pedigrees.

Finally, we evaluated the effect of multiple marriageson this improvement rate by using the remarriage dataset. The space results for this data set are show in Fig. 19.As expected, the number and size of family-level Node-Codes increase with the average number of marriages, andthe total length of NodeCodes increases faster than thenumber of NodeCodes since we introduced the remarriagemarkers. However, even in the extreme case when mostindividuals are married twice, there is still a significantimprovement in space with 67.9% reduction in thenumber of codes and 51.3% reduction in the total size.We can thus conclude that the space savings of FamilyNodeCodes are significant, even with in the presence ofremarriages.

8. Conclusions

We have presented a Query Evaluation methodusing NodeCodes for inbreeding coefficient queries and

then further improved upon in by introducing FamilyNodeCodes, which are both faster and use significantlyless space for large pedigrees. We also implementedand tested these query evaluation systems using bothreal data and synthetic data of various sizes andremarriage rates to test scalability. Experimental resultsdemonstrate that NodeCodes, especially Family Node-Codes, provide a good alternative for query evaluationon pedigree data, especially for pedigrees of sizes be-tween 100 and 200 000, and for queries that requiretraversing a large portion of the pedigree like thecalculation of the coefficient of inbreeding. For the sizesof pedigrees we examined, typical improvements werebetween 3.3 and 10.1 times faster for NodeCodes andup to 77.1 times faster for Family NodeCodes, with anincreasing benefit as pedigree size increased. In addition,Family NodeCodes can reduce space requirements by91–94% versus NodeCodes on the individual-based pedi-gree graph. Furthermore, only O(n) memory is requiredby NodeCodes for memoization versus O(n2) requiredfor the existing method; Family NodeCodes improveson this by requiring only O(f) memory, where f is thenumber of families. Our future work includes investigat-ing the use of NodeCodes for other types of complexpedigree queries and developing a full-scale systemfor storing, querying and visualizing pedigree data,which would also allow the visual construction of PQLqueries [1].

Acknowledgements

We would like to thank Elena Manilich,Dr. James Church, and the Cleveland Clinic’s FamilialPolyposis Registry [4] for kindly allowing us to usetheir data for this study and S. Fatih Akgul who madesignificant contributions to preliminary versions of thiswork.

ARTICLE IN PRESS


Appendix 1. Family NodeCodes Inbreeding Algorithm Pseudo-code

1.
Inbreeding Coefficient (Individual)
2.
Family’Load the family where the individual is a child from the DB
3.
if(individual is missing a parent orfamily is missing a parent family)
4.
return0.0
5.
if(inbreeding for Family.ID is cached)
6.
returncached value
7.
NodeCodes’Load family’s NodeCodes from the database
8.
returnFamilyInbreedingCoefficient(Family.ID, NodeCodes)
9.
FamilyInbreedingCoefficient(FamilyID, NodeCodes)
10.
if(inbreeding for FamilyID is cached)
11.
returncached value
12.
if(NodeCodes is empty)
13.
return0.0
14.
. 15. // Find NodeCode prefixes representing common ancestor’s NodeCodes
16.
MotherCodes’NodeCodes where last gender delimiter is ‘.’
17.
FatherCodes’NodeCodes where last gender delimiter is ‘,’
18.
if(MotherCodes is empty orFatherCodes is empty)
19.
return0.0 // No parent family
20.
// Note: LCP’ matches prefixes at the following delimiters: . , |
21.
CommonAncestorNC’Find all LCP’(MotherCodes, FatherCodes)
22.
if(CommonAncestorNC is empty)
23.
return0.0 // No common ancestors
24.
. 25. // Identify unique common ancestors and their relative family position
26.
CommonAncestors’an Empty Set
27.
foreach(AncestorNC in CommonAncestorNC)
28.
AncFamilyID’Lookup FamilyID from AncestorNC in the database
29.
if(AncestorNC ends with ‘|’)
30.
Position’Last child number
31.
elseif(AncestorNC contains no gender delimiter)
32.
Position’‘progenitor parent’
33.
elseif(AncestorNC’s last gender delimiter is ‘.’)
34.
Position’‘mother’
35.
elseif(AncestorNC’s last gender delimiter is ‘,’)
36.
Position’‘father’
37.
endif
38.
Ancestor’newAncestor
39.
if(CommonAncestors does not contain the key oAncFamilyID, Position4)
40.
Ancestor.FamilyID’AncFamilyID
41.
Ancestor.Position’Position
42.
Ancestor.CodelessParent’false
43.
Add Ancestor to CommonAncestors with key oAncFamilyID, Position4 44. else
45.
Get Ancestor from CommonAncestors with key oAncFamilyID, Position4 46. endif
47.
Add AncestorNC to Ancestor.NCs
48.
Add MotherCodes starting with AncestorNC to Ancestor.MotherPaths
49.
Add FatherCodes starting with AncestorNC to Ancestor.FatherPaths
50.
. 51. // Mark all ancestors as having a codeless parent if there is a parent
52.
// in the family not represented by a NodeCode. If this is the case,
53.
// then one of the parents will be a progenitor married to a
54.
// non-progenitor (which has a code from its parents).
55.
foreach(unique AncFamilyID in CommonAncestors.Keys)
56.
Find the number of codeless parents by using the missing parent marker
57.
foreach(Ancestor in CommonAncestors with AncFamilyID)
58.
foreach(Path in Ancestor.MotherPaths’Ancestor.FatherPaths)

ARTICLE IN PRESS


59.
if(last gender delimiter ¼ ‘.’)
60.
Mark mother as found
61.
if(last gender delimiter ¼ ‘,’)
62.
Mark father as found
63.
if(one less parent found than expected)
64.
foreach(Ancestor in CommonAncestors with AncFamilyID)
65.
if(Ancestor.Position is a parent)
66.
Ancestor.CodelessParent’true
67.
. 68. // Determine which ancestors are on which paths
69.
foreach(Ancestor inCommonAncestors)
70.
foreach(AncestorCode inAncestor.NCs)
71.
foreach(PossibleDescendant inCommonAncestors)
72.
if(PossibleDescendant ¼ Ancestor)
73.
continue
74.
foreach(PossibleDescendantCode in PossibleDescendantNC)
75.
if(AncestorCode is not a prefix of PossibleDescendantCode)
76.
continue
77.
if(AncestorFamilyID ¼ PossibleDescendantFamilyID
and (Ancestor.Pos a ‘parent

orPossibleDescendant.Pos a ‘child’))

78.
continue
79.
foreach(Path in Ancestor.MotherPaths[Ancestor.FatherPaths)
80.
if (PossibleDescendantCode is a prefix of Path)
81.
Add PossibleDescendant to Path Crossovers
82.
. 83. // Calculate Wright’s Equation
84.
sum’0.0
85.
foreach(Ancestor in CommonAncestors)
86.
foreach(oMotherPath, FatherPath4inAncestor.MotherPaths�Ancestor.FatherPaths)
87.
if(MotherPath.Crossovers\FatherPath.Crossovers is not empty)
88.
continue// Overlapping path
89.
if (Ancestor.Pos ¼ ‘progenitor parent’) // No inbreeding
90.
inbreeding’0.0
91.
else if(Ancestor.Pos ¼ ‘parent’)
92.
// Shorten path by 1 to get family ancestor is child
93.
NCs’RemoveLastCode(Ancestor.NCs)
94.
FamilyID’Lookup Family ID from AncestorNC in database
95.
Inbreeding’FamilyInbreedingCoefficient(FamilyID, NCs)
96.
else// Ancestor’s position is ‘child’ so it can be used as is
97.
NCs’Ancestor.NCs
98.
FamilyID’Ancestor.FamilyID
99.
Inbreeding’FamilyInbreedingCoefficient(FamilyID, NCs)
100.
endif
101.
sum’sum+((0.5 ^ (MotherPath.Length+FatherPath.Length+1))
%(1.0+Inbreeding))

102.
if(Ancestor.CodelessParent)
103.
sum’sum+((0.5 ^ (MotherPath.Length+FatherPath.Length+1))
%(1.0+0.0))

104.
Cache sum as result for FamilyID
105.
returnsum
References

[1] B. Elliott, S.F. Akgul, Z.M. Ozsoyoglu, E. Manilich. A framework forquerying pedigree data, in: Proceedings of SSDBM 2006, vol. 18,2006, pp. 71–80.

[2] Pedigree and Population Resource: Utah Population Database./http://www.hci.utah.edu/groups/ppr/S.

[3] /http://www.clevelandclinic.org/registries/S.

[4] /http://colorectal.ccf.orgS.

[5] Vesa Ollikainen, Simulation techniques for disease gene localization

in isolated populations, Ph.D. Thesis, University of Helsinki, Finland,

2002.

[6] Sewall Wright, Coefficients of inbreeding and relationship, The

American Naturalist 56 (645) (1922).

[7] R. Agarwala, L.G. Biesecker, K.A. Hopkins, C.A. Francomano, A.A.

Schaffer, Software for constructing and verifying pedigrees within

http://www.hci.utah.edu/groups/ppr/

http://www.clevelandclinic.org/registries/

http://colorectal.ccf.org

ARTICLE IN PRESS


large genealogies and an application to the old order Amish ofLancaster County, Genome Research 98 (8) (1998) 211–221.

[8] E.A. Wernert, J. Lakshmipathy, PViN: a scalable and flexible systemfor visualizing pedigree databases, in: Proceedings of the 2005 ACMSymposium on Applied Computing, vol. 5, 2005, pp. 115–122.

[9] R.V. Berloo, R.C.B. Hutten, Peditree: pedigree database analysis andvisualization for breeding and science, Journal of Heredity 96(2005) 465–468.

[10] A. Yount, The use of relational database commands in retrieval ofpedigree information, Journal of Medical Systems 11 (2–3) (1987).

[12] Cyrillic Software, /http://www.cyrillicsoftware.com/S.[13] Progeny Software, /http://www.progeny2000.com/S.[14] K. Lange, R. Cantor, S. Horvath, M. Perola, C. Sabatti, J. Sinsheimer, E.

Sobel, Mendel version 4.0: a complete package for the exact geneticanalysis of discrete traits in pedigree and population data sets,American Journal of Human Genetics 69 (suppl.) (2001) A1886.

[15] MENDAL Manual, /http://www.genetics.ucla.edu/software/mendelS.[16] T. Bozkaya, N. Balkir, T. Lee, Efficient evaluation of path algebra

expressions, CWRU Technical Report (1997).[17] L. Sheng, Z.M. Ozsoyoglu, G. Ozsoyoglu, A graph query language and

its query processing, in: Proceedings of 1999 ICDE Conference, 1999.[18] H. Toivonen, P. Onkamo, K. Vasko, V. Ollikainen, P. Sevon, H.

Mannila, M. Herr, J. Kere, Data mining applied to linkagedisequilibrium mapping, American Journal of Human Genetics 67(2000) 133–145.

[19] J.M. Church, A scoring system for the strength of a family history ofcolorectal cancer, Diseases of the Colon & Rectum 48 (5) (2005)889–896.

[20] G.R. Wiggans, P.M. Vanraden, J. Zuurbier, Calculations and use ofinbreeding coefficients for genetic evaluation of united states dairycattle, Journal of Dairy Science 78, 1584–1590.

[21] Encyclopedia of Genealogy: Henry Numbers, /http://www.eogen.com/HenryNumbersS.

[22] Encyclopedia of Genealogy: d’Aboville Numbers, /http://www.eogen.com/dAbovilleNumbersS.

[23] R. Pence, Numbering Systems in Genealog, Based on materialspresented to the National Genealogical Society’s Computer InterestGroup, 1995, /http://www.saintclair.org/numbers/S.

[25] PedHunter, /http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/pedhunter.htmlS.

[26] FSpeed, /http://www.tenset.co.uk/fspeed/S.[27] LaoTzu’s Animal Register, /http://animalregister.net/S.[28] CompuPed, /http://www.mbfs.com/compuped/S.[29] Oklahoma State University’s Department of Animal Science’s

Computer Software, /http://www.ansi.okstate.edu/software/S.[30] Brendan Elliott, S. Fatih Akgul, Z. Meral Ozsoyoglu, Efficient evaluation

of queries on pedigree data, CWRU Technical Report, 2006.[31] B. Elliott, S.F. Akgul, S. Mayes, Z.M. Ozsoyoglu, Efficient evaluation of

inbreeding queries on pedigree data, in: Proceedings of SSDBM 2007.[32] The proofs of all the lemmas and theorems, /http://nashua.cwru.

edu/pedigree/docsS.

http://www.cyrillicsoftware.com/

http://www.progeny2000.com/

http://www.genetics.ucla.edu/software/mendel

http://www.eogen.com/HenryNumbers

http://www.eogen.com/HenryNumbers

http://www.eogen.com/dAbovilleNumbers

http://www.eogen.com/dAbovilleNumbers

http://www.saintclair.org/numbers/

http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/pedhunter.html

http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/pedhunter.html

http://www.tenset.co.uk/fspeed/

http://animalregister.net/

http://www.mbfs.com/compuped/

http://www.ansi.okstate.edu/software/

http://nashua.cwru.edu/pedigree/docs

http://nashua.cwru.edu/pedigree/docs

Efficiently calculating inbreeding on large pedigrees databases

Documents

Transcript of Efficiently calculating inbreeding on large pedigrees databases