ll l l 0.4 inarbitrarypopulationstructures 0
Transcript of ll l l 0.4 inarbitrarypopulationstructures 0
●●
●
●●
●
●●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●●
●●
●●● ●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●●
●
●
● ●
● ● ●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●●
●●
●● ●
●●
● ●
●
●
●
●
●
●
●
●
●
●
0.2
0.3
0.4
Inbr
eedi
ng c
oeff.
Relatedness and differentiationin arbitrary population structures
Alejandro Ochoa, John D. Storey Lab, Princeton University7 DrAlexOchoa � viiia.org/research/ R [email protected]
1 / 35
My research areas / Contributions1. Stratified False Discovery Rate (FDR)
I Finding: per-stratum local FDR maximizes power controlling overall FDRI Improved power in protein domain predictionI Identified protein classes with problematic statistics
2. Genetic and other factors controlling the placebo responseI Collaboration with Otsuka PharmaceuticalI Mixed-effects modeling of longitudinal response in drug trials for
Schizophrenia, Bipolar Disorder, and Major Depressive DisorderI Genome-wide association study for individual-specific placebo response
3. Kinship and FST for arbitrary population structuresI Motivation: world-wide human population structureI Generalized definitions and modelsI Novel bias calculations validated by simulationsI Novel non-parametric estimator with greatly improved accuracy
2 / 35
My research areas / Contributions1. Stratified False Discovery Rate (FDR)
I Finding: per-stratum local FDR maximizes power controlling overall FDRI Improved power in protein domain predictionI Identified protein classes with problematic statistics
2. Genetic and other factors controlling the placebo responseI Collaboration with Otsuka PharmaceuticalI Mixed-effects modeling of longitudinal response in drug trials for
Schizophrenia, Bipolar Disorder, and Major Depressive DisorderI Genome-wide association study for individual-specific placebo response
3. Kinship and FST for arbitrary population structuresI Motivation: world-wide human population structureI Generalized definitions and modelsI Novel bias calculations validated by simulationsI Novel non-parametric estimator with greatly improved accuracy
2 / 35
My research areas / Contributions1. Stratified False Discovery Rate (FDR)
I Finding: per-stratum local FDR maximizes power controlling overall FDRI Improved power in protein domain predictionI Identified protein classes with problematic statistics
2. Genetic and other factors controlling the placebo responseI Collaboration with Otsuka PharmaceuticalI Mixed-effects modeling of longitudinal response in drug trials for
Schizophrenia, Bipolar Disorder, and Major Depressive DisorderI Genome-wide association study for individual-specific placebo response
3. Kinship and FST for arbitrary population structuresI Motivation: world-wide human population structureI Generalized definitions and modelsI Novel bias calculations validated by simulationsI Novel non-parametric estimator with greatly improved accuracy
2 / 35
Why study relatedness?
Human geneticsis fascinating!
Pop. structureconfoundsassociationstudies (GWAS)
Heritability ofcomplex traits
Animal and plantbreeding
3 / 35
Why study relatedness?
Human geneticsis fascinating!
Pop. structureconfoundsassociationstudies (GWAS)
Heritability ofcomplex traits
Animal and plantbreeding
3 / 35
Why study relatedness?
Human geneticsis fascinating!
Pop. structureconfoundsassociationstudies (GWAS)
Heritability ofcomplex traits
Animal and plantbreeding
3 / 35
Why study relatedness?
Human geneticsis fascinating!
Pop. structureconfoundsassociationstudies (GWAS)
Heritability ofcomplex traits
Animal and plantbreeding
3 / 35
Single Nucleotide Polymorphism (SNP) data
⇒
Genotype xijCC 0CT 1TT 2
⇒
Individuals
Loci
0 2 2 1 1 0 10 2 1 0 12 ...
X
4 / 35
Single Nucleotide Polymorphism (SNP) data
⇒
Genotype xijCC 0CT 1TT 2
⇒
Individuals
Loci
0 2 2 1 1 0 10 2 1 0 12 ...
X
4 / 35
Single Nucleotide Polymorphism (SNP) data
⇒
Genotype xijCC 0CT 1TT 2
⇒
Individuals
Loci
0 2 2 1 1 0 10 2 1 0 12 ...
X4 / 35
Hardy-Weinberg Equillibrium (HWE): Binomial draws
xij = genotype at locus i for individual j .
pTi = frequency of reference allele at locus i , (ancestral) population T .
Under HWE:
Pr(xij = 2|pTi ) =(pTi)2,
Pr(xij = 1|pTi ) = 2pTi(1− pTi
),
Pr(xij = 0|pTi ) =(1− pTi
)2.
HWE not valid under population structure!
5 / 35
Hardy-Weinberg Equillibrium (HWE): Binomial draws
xij = genotype at locus i for individual j .
pTi = frequency of reference allele at locus i , (ancestral) population T .
Under HWE:
Pr(xij = 2|pTi ) =(pTi)2,
Pr(xij = 1|pTi ) = 2pTi(1− pTi
),
Pr(xij = 0|pTi ) =(1− pTi
)2.
HWE not valid under population structure!
5 / 35
Hardy-Weinberg Equillibrium (HWE): Binomial draws
xij = genotype at locus i for individual j .
pTi = frequency of reference allele at locus i , (ancestral) population T .
Under HWE:
Pr(xij = 2|pTi ) =(pTi)2,
Pr(xij = 1|pTi ) = 2pTi(1− pTi
),
Pr(xij = 0|pTi ) =(1− pTi
)2.
HWE not valid under population structure!
5 / 35
Goal: measure dependence structure of genotype matrix columns
IndividualsLo
ci0 2 2 1 1 0 10 2 1 0 12 ...
X
High-dimensional binomial data
Population structure⇒ dependence between individuals (columns)
Linkage disequilibrium⇒ dependence between loci (rows)
6 / 35
Goal: measure dependence structure of genotype matrix columns
IndividualsLo
ci0 2 2 1 1 0 10 2 1 0 12 ...
X
High-dimensional binomial data
Population structure⇒ dependence between individuals (columns)
Linkage disequilibrium⇒ dependence between loci (rows)
6 / 35
Goal: measure dependence structure of genotype matrix columns
IndividualsLo
ci0 2 2 1 1 0 10 2 1 0 12 ...
X
High-dimensional binomial data
Population structure⇒ dependence between individuals (columns)
Linkage disequilibrium⇒ dependence between loci (rows)
6 / 35
Model parametersIBD(T ): “Identical By Descent” for ancestral population T — shared coin flips
f Tj : Inbreeding coefficientPr. that the two alleles at a random locus of individual j are IBD(T )
Var(xij |T ) = 2pTi(1− pTi
)(1 + f Tj )
ϕTjk : Kinship coefficient
Pr. that two alleles, one at random from each of individuals j and k , at onerandom locus are IBD(T )
Cov(xij , xik |T ) = 4pTi(1− pTi
)ϕTjk
FST: Fixation indexPr. that two random alleles in a subpopulation at a random locus are IBD(T )
7 / 35
Model parametersIBD(T ): “Identical By Descent” for ancestral population T — shared coin flips
f Tj : Inbreeding coefficientPr. that the two alleles at a random locus of individual j are IBD(T )
Var(xij |T ) = 2pTi(1− pTi
)(1 + f Tj )
ϕTjk : Kinship coefficient
Pr. that two alleles, one at random from each of individuals j and k , at onerandom locus are IBD(T )
Cov(xij , xik |T ) = 4pTi(1− pTi
)ϕTjk
FST: Fixation indexPr. that two random alleles in a subpopulation at a random locus are IBD(T )
7 / 35
Model parametersIBD(T ): “Identical By Descent” for ancestral population T — shared coin flips
f Tj : Inbreeding coefficientPr. that the two alleles at a random locus of individual j are IBD(T )
Var(xij |T ) = 2pTi(1− pTi
)(1 + f Tj )
ϕTjk : Kinship coefficient
Pr. that two alleles, one at random from each of individuals j and k , at onerandom locus are IBD(T )
Cov(xij , xik |T ) = 4pTi(1− pTi
)ϕTjk
FST: Fixation indexPr. that two random alleles in a subpopulation at a random locus are IBD(T )
7 / 35
Model parametersIBD(T ): “Identical By Descent” for ancestral population T — shared coin flips
f Tj : Inbreeding coefficientPr. that the two alleles at a random locus of individual j are IBD(T )
Var(xij |T ) = 2pTi(1− pTi
)(1 + f Tj )
ϕTjk : Kinship coefficient
Pr. that two alleles, one at random from each of individuals j and k , at onerandom locus are IBD(T )
Cov(xij , xik |T ) = 4pTi(1− pTi
)ϕTjk
FST: Fixation indexPr. that two random alleles in a subpopulation at a random locus are IBD(T )
7 / 35
Existing approaches
1. FST estimationI For independent subpopulations only!I Weir-Cockerham (WC) estimator (1984) — 15K citations!I “Hudson” pairwise estimator (2013) tweaks WCI BayeScan (2008) — 1.2K citations
2. Kinship estimationI “Standard” kinship estimator (1950s)
I Used by most modern GWAS approaches that control for population structure(PCA, LMM, adj. χ2; top paper 6K citations)
I GCTA heritability estimation (2 papers: 4K citations)I Our novel finding: accuracy requires unstructured population
(a minority of closely-related individuals)
8 / 35
Existing approaches
1. FST estimationI For independent subpopulations only!I Weir-Cockerham (WC) estimator (1984) — 15K citations!I “Hudson” pairwise estimator (2013) tweaks WCI BayeScan (2008) — 1.2K citations
2. Kinship estimationI “Standard” kinship estimator (1950s)
I Used by most modern GWAS approaches that control for population structure(PCA, LMM, adj. χ2; top paper 6K citations)
I GCTA heritability estimation (2 papers: 4K citations)I Our novel finding: accuracy requires unstructured population
(a minority of closely-related individuals)
8 / 35
Dataset: Human Origins
●●
●●●
●●
●
●●●
●●
●
●
●●
●●
●●
●●
●
●●●
●
●
●●
●
●
●●
●
●●
●●●
●● ●
● ●
● ●
●
●●●
●●
●●● ●
●●●
●
●●●
●●
●●
●
●
●●
●
●●
●●
● ●●
●
●●
●
●
●
●●
●●●
●
●●●●●●●●
●
● ●●
●●●
●●●●
●●●
●●●
●
●●●
●
●●●
●
●
●●
●
●
●
●●●
●
●
● ●
●
●●
●●
●
●●
● ●
●
●●
● ●●
●
●●
●●
●●
●
●●●
●
●●
●
●
●
●●●●
● ●● ●
●
●●
●●
●
● ●●
●
●● ●●
●
●●
●
●●●
●●
●
●
●●
●
●
●●
●●
●
●●
●●
●●●● ●
●●●
●●
●
● ●
●●
●●
●●
●
●
●
●
●
Ju_hoan_South
Ju_hoan_North
Taa_West
Taa_East
Taa_North
NaroGui
Hoan
Xuun
GanaTshwa
Khomani
Nama
Haiom
Kgalagadi
Shua
Khwe
Mbuti
Biaka
BantuSA
Tswana
Damara
Himba
Wambo
YorubaEsan
Mende
BantuKenya
Luhya
MandenkaGambian
Luo
Dinka
Kikuyu
Hadza
Sandawe
Masai
Datog
Somali
OromoJew_Ethiopian
Saharawi
MoroccanMozabite
Algerian Tunisian
Libyan Egyptian
Yemeni
Jew_Libyan
Jew_TunisianJew_Moroccan
Jew_Yemenite
Saudi
BedouinBBedouinA
PalestinianJordanian
SyrianLebanese_Muslim
Lebanese_Christian
Jew_Turkish
Assyrian
Druze
Lebanese
Jew_iraqi
Iranian_Bandari
Iranian
Jew_Iranian
Turkish
Jew_Ashkenazi
CypriotMaltese
Canary_Islander
Italian_SouthSicilian
Romanian
FrenchN
SpanishSW Greek
Italian_North
SpanishNE
FrenchS
Sardinian
Basque
Orcadian
English
Icelandic
Norwegian
Irish
Irish_UlsterScottish
Shetlandic
GermanSorb
Polish
Czech
Croatian
Hungarian
AlbanianBulgarian
Mordovian
FinnishRussian
Estonian
Ukrainian
BelarusianLithuanian
Armenian
Jew_GeorgianGeorgian
Abkhasian
Adygei
North_Ossetian
Chechen
Lezgin
KumykBalkar
Nogai
Makrani
BrahuiBalochi
Jew_Cochin
Tajik
Kalash
Pathan
Sindhi
Burusho
Brahmin_TiwariGujarati
Punjabi
Vishwabrahmin
Lodhi
Mala
BengaliKharia
Onge
Chuvash
Turkmen Uzbek
Hazara
Uygur
Tubalar
Mansi
Selkup
Kyrgyz
Altaian
Tuvinian
Nganasan
Dolgan
Yakut
Xibo
Hezhen
OroqenUlchi
Even
Yukagir
ItelmenKoryak
Chukchi
Eskimo
AleutAleut_Tlingit
Kusunda
Burmese
CambodianThai
Kalmyk
Ami
Atayal
Malay
Kinh
Vietnamese
Dai
Lahu
NaxiYi
TujiaHan
MiaoShe
Tu
Mongola
Daur
JapaneseKorean
Chipewyan
CreeAlgonquin
Ojibwa
Pima
Mixe
Mixtec
Zapotec
Mayan
Inga
Kaqchikel
Cabecar
Piapoco
Karitiana
Surui
Bolivian
Aymara
Quechua
Guarani
Chilote
Australian
Nasoi
Papuan
Ata
Baining_Malasait
Baining_Marabu
Bajo
Buka
Dusun
Ilocano
Kankanaey
Kol_New_Britain
Kove
Kuot_Kabil
Kuot_LamalauaLavongaiLebbo Madak
MamusiMamusi_Paleabu
Mangseng
Manus
Melamela
Mengen
Murut Mussau
Nakanai_Bileki
Nakanai_Loso
Nailik
Notsi
Saposa
Sulka
Tagalog
Teop
Tigak
Tolai
Visayan
Subpopulations
SAfricaMAfricaNAfricaMiddleEastEuropeCaucasusSAsiaNAsiaEAsiaAmericasOceania
2,922 indivs. from 244 locs. — 593,124 loci — SNP chipLazaridis et al. (2014), (2016); Skoglund et al. (2016)
9 / 35
Ju_h
oan_
Sou
thJu
_hoa
n_N
orth
Taa_
Wes
tTa
a_E
ast
Taa_
Nor
thN
aro
Gui
Hoa
nX
uun
Gan
aT
shw
aK
hom
ani
Nam
aH
aiom
Kga
laga
diS
hua
Khw
eM
buti
Bia
kaB
antu
SA
Tsw
ana
Dam
ara
Him
baW
ambo
Yoru
baE
san
Men
deB
antu
Ken
yaLu
hya
Man
denk
aG
ambi
an Luo
Din
kaK
ikuy
uH
adza
San
daw
eM
asai
Dat
ogS
omal
iO
rom
oJe
w_E
thio
pian
Sah
araw
iM
oroc
can
Moz
abite
Alg
eria
nTu
nisi
anLi
byan
Egy
ptia
nYe
men
iJe
w_L
ibya
nJe
w_T
unis
ian
Jew
_Mor
occa
nJe
w_Y
emen
iteS
audi
Bed
ouin
BB
edou
inA
Pal
estin
ian
Jord
ania
nS
yria
nLe
bane
se_M
uslim
Leba
nese
_Chr
istia
nJe
w_T
urki
shA
ssyr
ian
Dru
zeLe
bane
seJe
w_i
raqi
Iran
ian_
Ban
dari
Iran
ian
Jew
_Ira
nian
Turk
ish
Jew
_Ash
kena
ziC
yprio
tM
alte
seC
anar
y_Is
land
erIta
lian_
Sou
thS
icili
anR
oman
ian
Fre
nchN
Spa
nish
SW
Gre
ekIta
lian_
Nor
thS
pani
shN
EF
renc
hSS
ardi
nian
Bas
que
Orc
adia
nE
nglis
hIc
elan
dic
Nor
weg
ian
Iris
hIr
ish_
Uls
ter
Sco
ttish
She
tland
icG
erm
anS
orb
Pol
ish
Cze
chC
roat
ian
Hun
garia
nA
lban
ian
Bul
garia
nM
ordo
vian
Fin
nish
Rus
sian
Est
onia
nU
krai
nian
Bel
arus
ian
Lith
uani
anA
rmen
ian
Jew
_Geo
rgia
nG
eorg
ian
Abk
hasi
anA
dyge
iN
orth
_Oss
etia
nC
hech
enLe
zgin
Kum
ykB
alka
rN
ogai
Mak
rani
Bra
hui
Bal
ochi
Jew
_Coc
hin
Tajik
Kal
ash
Pat
han
Sin
dhi
Bur
usho
Bra
hmin
_Tiw
ari
Guj
arat
iP
unja
biV
ishw
abra
hmin
Lodh
iM
ala
Ben
gali
Kha
riaO
nge
Chu
vash
Turk
men
Uzb
ekH
azar
aU
ygur
Tuba
lar
Man
siS
elku
pK
yrgy
zA
ltaia
nTu
vini
anN
gana
san
Dol
gan
Yaku
tX
ibo
Hez
hen
Oro
qen
Ulc
hiE
ven
Yuka
gir
Itelm
enK
orya
kC
hukc
hiE
skim
oA
leut
Ale
ut_T
lingi
tK
usun
daK
alm
ykB
urm
ese
Cam
bodi
anT
hai
TuM
ongo
laD
aur
Kin
hV
ietn
ames
eD
aiLa
huN
axi
Yi
Tujia
Han
Mia
oS
heJa
pane
seK
orea
nA
mi
Ata
yal
Kan
kana
eyM
urut
Dus
unIlo
cano
Taga
log
Vis
ayan
Baj
oM
alay
Lebb
oC
hipe
wya
nC
ree
Alg
onqu
inO
jibw
aP
ima
Mix
eM
ixte
cZ
apot
ecM
ayan
Inga
Kaq
chik
elC
abec
arP
iapo
coK
ariti
ana
Sur
uiB
oliv
ian
Aym
ara
Que
chua
Gua
rani
Chi
lote
Mus
sau
Sap
osa
Buk
aTe
opT
igak
Aus
tral
ian
Kov
eM
elam
ela
Nak
anai
_Bile
kiM
angs
eng
Man
usN
asoi
Nai
likN
otsi
SW
_Bou
gain
ville
Kuo
t_K
abil
Kuo
t_La
mal
aua
Lavo
ngai
Mad
akTo
lai
Men
gen
Mam
usi
Mam
usi_
Pal
eabu
Nak
anai
_Los
oA
taS
ulka
Kol
_New
_Brit
ain
Bai
ning
_Mal
asai
tB
aini
ng_M
arab
uP
apua
n
Ju_hoan_SouthJu_hoan_North
Taa_WestTaa_East
Taa_NorthNaro
GuiHoanXuunGana
TshwaKhomani
NamaHaiom
KgalagadiShuaKhweMbutiBiaka
BantuSATswanaDamara
HimbaWamboYoruba
EsanMende
BantuKenyaLuhya
MandenkaGambian
LuoDinka
KikuyuHadza
SandaweMasaiDatog
SomaliOromo
Jew_EthiopianSaharawi
MoroccanMozabiteAlgerianTunisian
LibyanEgyptian
YemeniJew_Libyan
Jew_TunisianJew_MoroccanJew_Yemenite
SaudiBedouinBBedouinA
PalestinianJordanian
SyrianLebanese_Muslim
Lebanese_ChristianJew_Turkish
AssyrianDruze
LebaneseJew_iraqi
Iranian_BandariIranian
Jew_IranianTurkish
Jew_AshkenaziCypriotMaltese
Canary_IslanderItalian_South
SicilianRomanian
FrenchNSpanishSW
GreekItalian_North
SpanishNEFrenchS
SardinianBasque
OrcadianEnglish
IcelandicNorwegian
IrishIrish_Ulster
ScottishShetlandic
GermanSorb
PolishCzech
CroatianHungarian
AlbanianBulgarian
MordovianFinnish
RussianEstonian
UkrainianBelarusianLithuanianArmenian
Jew_GeorgianGeorgian
AbkhasianAdygei
North_OssetianChechen
LezginKumykBalkarNogai
MakraniBrahui
BalochiJew_Cochin
TajikKalashPathanSindhi
BurushoBrahmin_Tiwari
GujaratiPunjabi
VishwabrahminLodhiMala
BengaliKhariaOnge
ChuvashTurkmen
UzbekHazaraUygur
TubalarMansi
SelkupKyrgyzAltaian
TuvinianNganasan
DolganYakutXibo
HezhenOroqen
UlchiEven
YukagirItelmenKoryak
ChukchiEskimo
AleutAleut_Tlingit
KusundaKalmyk
BurmeseCambodian
ThaiTu
MongolaDaurKinh
VietnameseDai
LahuNaxi
YiTujiaHan
MiaoShe
JapaneseKorean
AmiAtayal
KankanaeyMurut
DusunIlocanoTagalogVisayan
BajoMalayLebbo
ChipewyanCree
AlgonquinOjibwa
PimaMixe
MixtecZapotec
MayanInga
KaqchikelCabecarPiapoco
KaritianaSurui
BolivianAymara
QuechuaGuaraniChilote
MussauSaposa
BukaTeop
TigakAustralian
KoveMelamela
Nakanai_BilekiMangseng
ManusNasoiNailikNotsi
SW_BougainvilleKuot_Kabil
Kuot_LamalauaLavongai
MadakTolai
MengenMamusi
Mamusi_PaleabuNakanai_Loso
AtaSulka
Kol_New_BritainBaining_MalasaitBaining_Marabu
Papuan
SAfrica MAfrica NAfrica MiddleEast Europe Caucasus SAsia NAsia EAsia Americas Oceania
SA
fric
aM
Afr
ica
NA
fric
aM
iddl
eEas
tE
urop
eC
auca
sus
SA
sia
NA
sia
EA
sia
Am
eric
asO
cean
ia
00.
10.
20.
3
Kin
ship
Indi
vidu
als
Our new kinshipestimatesGenotypes from “Human Origins”(Lazaridis et al. 2014, 2016;Skoglund et al. 2016)
Edited from Ephert [CC BY-SA 3.0], viaWikimedia Commons
*Inbreeding coeffs. on diagonal
10 / 35
Ju_h
oan_
Sou
thJu
_hoa
n_N
orth
Taa_
Wes
tTa
a_E
ast
Taa_
Nor
thN
aro
Gui
Hoa
nX
uun
Gan
aT
shw
aK
hom
ani
Nam
aH
aiom
Kga
laga
diS
hua
Khw
eM
buti
Bia
kaB
antu
SA
Tsw
ana
Dam
ara
Him
baW
ambo
Yoru
baE
san
Men
deB
antu
Ken
yaLu
hya
Man
denk
aG
ambi
an Luo
Din
kaK
ikuy
uH
adza
San
daw
eM
asai
Dat
ogS
omal
iO
rom
oJe
w_E
thio
pian
Sah
araw
iM
oroc
can
Moz
abite
Alg
eria
nTu
nisi
anLi
byan
Egy
ptia
nYe
men
iJe
w_L
ibya
nJe
w_T
unis
ian
Jew
_Mor
occa
nJe
w_Y
emen
iteS
audi
Bed
ouin
BB
edou
inA
Pal
estin
ian
Jord
ania
nS
yria
nLe
bane
se_M
uslim
Leba
nese
_Chr
istia
nJe
w_T
urki
shA
ssyr
ian
Dru
zeLe
bane
seJe
w_i
raqi
Iran
ian_
Ban
dari
Iran
ian
Jew
_Ira
nian
Turk
ish
Jew
_Ash
kena
ziC
yprio
tM
alte
seC
anar
y_Is
land
erIta
lian_
Sou
thS
icili
anR
oman
ian
Fre
nchN
Spa
nish
SW
Gre
ekIta
lian_
Nor
thS
pani
shN
EF
renc
hSS
ardi
nian
Bas
que
Orc
adia
nE
nglis
hIc
elan
dic
Nor
weg
ian
Iris
hIr
ish_
Uls
ter
Sco
ttish
She
tland
icG
erm
anS
orb
Pol
ish
Cze
chC
roat
ian
Hun
garia
nA
lban
ian
Bul
garia
nM
ordo
vian
Fin
nish
Rus
sian
Est
onia
nU
krai
nian
Bel
arus
ian
Lith
uani
anA
rmen
ian
Jew
_Geo
rgia
nG
eorg
ian
Abk
hasi
anA
dyge
iN
orth
_Oss
etia
nC
hech
enLe
zgin
Kum
ykB
alka
rN
ogai
Mak
rani
Bra
hui
Bal
ochi
Jew
_Coc
hin
Tajik
Kal
ash
Pat
han
Sin
dhi
Bur
usho
Bra
hmin
_Tiw
ari
Guj
arat
iP
unja
biV
ishw
abra
hmin
Lodh
iM
ala
Ben
gali
Kha
riaO
nge
Chu
vash
Turk
men
Uzb
ekH
azar
aU
ygur
Tuba
lar
Man
siS
elku
pK
yrgy
zA
ltaia
nTu
vini
anN
gana
san
Dol
gan
Yaku
tX
ibo
Hez
hen
Oro
qen
Ulc
hiE
ven
Yuka
gir
Itelm
enK
orya
kC
hukc
hiE
skim
oA
leut
Ale
ut_T
lingi
tK
usun
daK
alm
ykB
urm
ese
Cam
bodi
anT
hai
TuM
ongo
laD
aur
Kin
hV
ietn
ames
eD
aiLa
huN
axi
Yi
Tujia
Han
Mia
oS
heJa
pane
seK
orea
nA
mi
Ata
yal
Kan
kana
eyM
urut
Dus
unIlo
cano
Taga
log
Vis
ayan
Baj
oM
alay
Lebb
oC
hipe
wya
nC
ree
Alg
onqu
inO
jibw
aP
ima
Mix
eM
ixte
cZ
apot
ecM
ayan
Inga
Kaq
chik
elC
abec
arP
iapo
coK
ariti
ana
Sur
uiB
oliv
ian
Aym
ara
Que
chua
Gua
rani
Chi
lote
Mus
sau
Sap
osa
Buk
aTe
opT
igak
Aus
tral
ian
Kov
eM
elam
ela
Nak
anai
_Bile
kiM
angs
eng
Man
usN
asoi
Nai
likN
otsi
SW
_Bou
gain
ville
Kuo
t_K
abil
Kuo
t_La
mal
aua
Lavo
ngai
Mad
akTo
lai
Men
gen
Mam
usi
Mam
usi_
Pal
eabu
Nak
anai
_Los
oA
taS
ulka
Kol
_New
_Brit
ain
Bai
ning
_Mal
asai
tB
aini
ng_M
arab
uP
apua
n
Ju_hoan_SouthJu_hoan_North
Taa_WestTaa_East
Taa_NorthNaro
GuiHoanXuunGana
TshwaKhomani
NamaHaiom
KgalagadiShuaKhweMbutiBiaka
BantuSATswanaDamara
HimbaWamboYoruba
EsanMende
BantuKenyaLuhya
MandenkaGambian
LuoDinka
KikuyuHadza
SandaweMasaiDatog
SomaliOromo
Jew_EthiopianSaharawi
MoroccanMozabiteAlgerianTunisian
LibyanEgyptian
YemeniJew_Libyan
Jew_TunisianJew_MoroccanJew_Yemenite
SaudiBedouinBBedouinA
PalestinianJordanian
SyrianLebanese_Muslim
Lebanese_ChristianJew_Turkish
AssyrianDruze
LebaneseJew_iraqi
Iranian_BandariIranian
Jew_IranianTurkish
Jew_AshkenaziCypriotMaltese
Canary_IslanderItalian_South
SicilianRomanian
FrenchNSpanishSW
GreekItalian_North
SpanishNEFrenchS
SardinianBasque
OrcadianEnglish
IcelandicNorwegian
IrishIrish_Ulster
ScottishShetlandic
GermanSorb
PolishCzech
CroatianHungarian
AlbanianBulgarian
MordovianFinnish
RussianEstonian
UkrainianBelarusianLithuanianArmenian
Jew_GeorgianGeorgian
AbkhasianAdygei
North_OssetianChechen
LezginKumykBalkarNogai
MakraniBrahui
BalochiJew_Cochin
TajikKalashPathanSindhi
BurushoBrahmin_Tiwari
GujaratiPunjabi
VishwabrahminLodhiMala
BengaliKhariaOnge
ChuvashTurkmen
UzbekHazaraUygur
TubalarMansi
SelkupKyrgyzAltaian
TuvinianNganasan
DolganYakutXibo
HezhenOroqen
UlchiEven
YukagirItelmenKoryak
ChukchiEskimo
AleutAleut_Tlingit
KusundaKalmyk
BurmeseCambodian
ThaiTu
MongolaDaurKinh
VietnameseDai
LahuNaxi
YiTujiaHan
MiaoShe
JapaneseKorean
AmiAtayal
KankanaeyMurut
DusunIlocanoTagalogVisayan
BajoMalayLebbo
ChipewyanCree
AlgonquinOjibwa
PimaMixe
MixtecZapotec
MayanInga
KaqchikelCabecarPiapoco
KaritianaSurui
BolivianAymara
QuechuaGuaraniChilote
MussauSaposa
BukaTeop
TigakAustralian
KoveMelamela
Nakanai_BilekiMangseng
ManusNasoiNailikNotsi
SW_BougainvilleKuot_Kabil
Kuot_LamalauaLavongai
MadakTolai
MengenMamusi
Mamusi_PaleabuNakanai_Loso
AtaSulka
Kol_New_BritainBaining_MalasaitBaining_Marabu
Papuan
SAfrica MAfrica NAfrica MiddleEast Europe Caucasus SAsia NAsia EAsia Americas Oceania
SA
fric
aM
Afr
ica
NA
fric
aM
iddl
eEas
tE
urop
eC
auca
sus
SA
sia
NA
sia
EA
sia
Am
eric
asO
cean
ia
−0.0
50
0.05
0.1
0.15
Kin
ship
Indi
vidu
als
Standard kinshipestimatesGenotypes from “Human Origins”(Lazaridis et al. 2014, 2016;Skoglund et al. 2016)
Edited from Ephert [CC BY-SA 3.0], viaWikimedia Commons
11 / 35
Only our new estimator is accurate in simulations
True KinshipA New estimateB Standard est.C
00.
10.
2
Kin
ship
Indi
vidu
als
12 / 35
Population-level inbreeding increases with distance from Africa
●●
●
●●
●
●●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●●
●●
●●● ●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●●
●
●
● ●
● ● ●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●●
●●
●● ●
●●
● ●
●
●
●
●
●
●
●
●
●
●
0.2
0.3
0.4
Inbr
eedi
ng c
oeff.
13 / 35
Differentiation (FST) previously underestimated
0.0 0.1 0.2 0.3 0.4 0.5
020
40
Inbreeding Coefficient
Den
sity
SAfricaMAfricaNAfricaMiddleEastEuropeCaucasusSAsiaNAsiaEAsiaAmericasOceania
Subpopulations
SAfricaMAfricaNAfricaMiddleEastEuropeCaucasusSAsiaNAsiaEAsiaAmericasOceania
FST estimates
BayeScanWCHudsonKNew
14 / 35
Only our new method estimates generalized FST accurately
0.00
0.05
0.10
Ind. Subpops.A
Wei
r−C
ocke
rham
Hud
sonK
Bay
eSca
n
Sta
ndar
dK
insh
ip
New
− − −−
− AdmixtureB
Wei
r−C
ocke
rham
Hud
sonK
Bay
eSca
n
Sta
ndar
dK
insh
ip
New
− −−
−
−
−
True FST
FSTindep limit
Estimates−
FS
T e
stim
ate
FST estimator
15 / 35
Recently-admixed populations
African-Americans
Baharian et al. (2016)
Hispanics
Moreno-Estrada et al. (2013)16 / 35
Admixed siblings from different populations?
Lucy and Maria, UK
Ochoa brothers, MX
High Admixture LD:
Moreno-Estrada et al. (2013)
Solution: treat every individual as its own population!
17 / 35
Admixed siblings from different populations?
Lucy and Maria, UK
Ochoa brothers, MX
High Admixture LD:
Moreno-Estrada et al. (2013)
Solution: treat every individual as its own population!
17 / 35
Admixed siblings from different populations?
Lucy and Maria, UK
Ochoa brothers, MX
High Admixture LD:
Moreno-Estrada et al. (2013)
Solution: treat every individual as its own population!
17 / 35
Admixed siblings from different populations?
Lucy and Maria, UK
Ochoa brothers, MX
High Admixture LD:
Moreno-Estrada et al. (2013)
Solution: treat every individual as its own population!17 / 35
Dataset: 1000 Genomes Project (2013)
●
●
●
●
●
● ●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
ASW
ACB
ESN
GWD
LWK
MSLYRI
GBR
FIN
IBS
TSI
CEU
BEB
GIH
ITU
PJL
STU
CDX
CHB
JPT
KHV
CHS
CLM
MXL
PEL
PUR
Subpopulations
SAfricaEuropeSAsiaEAsiaAmr
2,504 indivs. from 26 locs. — 20,417,698 loci (asc. in YRI) — WGS trios, etc.
18 / 35
00.
10.
20.
30.
4
Kin
ship
Indi
vidu
als
0.0
0.5
1.0
Individuals
Anc
estr
y fr
ac.
x
PU
RC
LMP
EL
MX
L
Pop
ulat
ion
x
AF
RA
MR
EU
R
Anc
estr
y
Kinship driven byadmixture in Hispanics
Our new kinship estimates
Genotypes from the 1000 Genomes Project (2013)
19 / 35
Comparison of population structures in simulation
T
S1
f TS1
S2
f TS2
...SK
f TSK
Indep. Subpops.T
S1 S2 ... SK
A1 A2 ... An
Admixture
00.
10.
2
Kin
ship
Indi
vidu
als
20 / 35
FST in the independent subpopulation model
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
0.4
0.6
Alle
le fr
eque
ncy
Illustration.
FST =Var(pSi∣∣T)
pTi(1− pTi
) .Here FST = proportion of variance explained by pop. structure
21 / 35
FST in the independent subpopulation model
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
0.4
0.6
Alle
le fr
eque
ncy
Illustration.
FST =Var(pSi∣∣T)
pTi(1− pTi
) .Here FST = proportion of variance explained by pop. structure
21 / 35
Wright’s FST
T = Total, S = Subpopulation, I = Individual.
Total inbreeding: FIT =1|S |∑j∈S
f Tj ,
Local inbreeding: FIS =1|S |∑j∈S
f Sj ,
Structural inbreeding: FST =FIT − FIS1− FIS
.
22 / 35
Our generalized FSTNeed new “local” subpopulations Lj (separates total from local inbreeding):(
1− f Tj)
=(1− f
Ljj
)(1− f TLj
).
Generalized FST: applicable to arbitrary population structures, equals previousdefinition for non-overlapping subpopulations:
FST =n∑
j=1
wj fTLj.
Mean heterozygosity in a structured population:
H̄i =1n
n∑j=1
Pr(xij = 1|T ) = 2pTi(1− pTi
)(1− FST) .
23 / 35
Our generalized FSTNeed new “local” subpopulations Lj (separates total from local inbreeding):(
1− f Tj)
=(1− f
Ljj
)(1− f TLj
).
Generalized FST: applicable to arbitrary population structures, equals previousdefinition for non-overlapping subpopulations:
FST =n∑
j=1
wj fTLj.
Mean heterozygosity in a structured population:
H̄i =1n
n∑j=1
Pr(xij = 1|T ) = 2pTi(1− pTi
)(1− FST) .
23 / 35
Our generalized FSTNeed new “local” subpopulations Lj (separates total from local inbreeding):(
1− f Tj)
=(1− f
Ljj
)(1− f TLj
).
Generalized FST: applicable to arbitrary population structures, equals previousdefinition for non-overlapping subpopulations:
FST =n∑
j=1
wj fTLj.
Mean heterozygosity in a structured population:
H̄i =1n
n∑j=1
Pr(xij = 1|T ) = 2pTi(1− pTi
)(1− FST) .
23 / 35
FST measures population structure / differentiation
●●
●
●●
●
●●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●●
●●
●●● ●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●●
●
●
● ●
● ● ●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●●
●●
●● ●
●●
● ●
●
●
●
●
●
●
●
●
●
●
0.2
0.4
Alle
le fr
eque
ncy
Median diff. SNP in Human Origins (rs2650044; given MAF ≥ 10%).
F̂WCST ≈ 0.0961 using Weir-Cockerham estimator and K = 244.
24 / 35
FST measures population structure / differentiation
●●
●
●●
●
●●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●●
●●
●●● ●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●●
●
●
● ●
● ● ●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●●
●●
●● ●
●●
● ●
●
●
●
●
●
●
●
●
●
●
0.2
0.4
Alle
le fr
eque
ncy
Median diff. SNP in Human Origins (rs2650044; given MAF ≥ 10%).
F̂WCST ≈ 0.0961 using Weir-Cockerham estimator and K = 244.
24 / 35
Comparison of population structures in simulation
T
S1
f TS1
S2
f TS2
...SK
f TSK
Indep. Subpops.T
S1 S2 ... SK
A1 A2 ... An
Admixture
00.
10.
2
Kin
ship
Indi
vidu
als
25 / 35
Our admixture simulation (R package ‘bnpsd’ on CRAN)Intermediate subpop. diff.
FS
T
01
A
0.0
0.2
Intermediate subpop. spread
dens
ity
B
Admixture proportions
ance
stry
01
C
Discrete subpop. approx.
ance
stry
01
D
position in 1D geography
26 / 35
Kinship model for genotypes
symbol meaningT ref ancestral populationi locus index
j , k individual indexespTi ref allele frequencyxij genotype (num ref alleles)ϕTjk kinship of j , k
f Tj inbreeding of j
Statistical model:
E[xij |T ] = 2pTi ,
Var(xij |T ) = 2pTi(1− pTi
)(1 + f Tj ),
Cov(xij , xik |T ) = 4pTi(1− pTi
)ϕTjk .
(Wright 1921, 1951; Malécot 1948; Jacquard 1970).
27 / 35
Problem: common estimators not consistent under structure
Estimate of ancestral allele frequency:
p̂Ti =12
n∑j=1
wjxij
Variance asymptotically non-zero under population structure:
Var(p̂Ti∣∣T) = pTi
(1− pTi
)ϕ̄T
(for independent individuals ϕ̄T = 12n(1 + FST). ⇒ nEff ≈ 4 in Human Origins!)
Therefore, naive estimators that use p̂Ti (next) are not consistent!
28 / 35
Problem: common estimators not consistent under structure
Estimate of ancestral allele frequency:
p̂Ti =12
n∑j=1
wjxij
Variance asymptotically non-zero under population structure:
Var(p̂Ti∣∣T) = pTi
(1− pTi
)ϕ̄T
(for independent individuals ϕ̄T = 12n(1 + FST). ⇒ nEff ≈ 4 in Human Origins!)
Therefore, naive estimators that use p̂Ti (next) are not consistent!
28 / 35
Problem: common estimators not consistent under structure
Estimate of ancestral allele frequency:
p̂Ti =12
n∑j=1
wjxij
Variance asymptotically non-zero under population structure:
Var(p̂Ti∣∣T) = pTi
(1− pTi
)ϕ̄T
(for independent individuals ϕ̄T = 12n(1 + FST). ⇒ nEff ≈ 4 in Human Origins!)
Therefore, naive estimators that use p̂Ti (next) are not consistent!
28 / 35
Bias in standard kinship estimator
ϕ̂T ,stdjk =
m∑i=1
(xij − 2p̂Ti
) (xik − 2p̂Ti
)4
m∑i=1
p̂Ti(1− p̂Ti
) , p̂Ti =12
n∑j=1
wjxij .
Bias varies by j , k :
ϕ̂T ,stdjk
a.s.−−−→m→∞
ϕTjk − ϕ̄T
j − ϕ̄Tk + ϕ̄T
1− ϕ̄T.
True KinshipA New estimateB Standard estimateC Limit of standardD
00.
10.
2
Kin
ship
Indi
vidu
als
29 / 35
Bias in standard kinship estimator
ϕ̂T ,stdjk =
m∑i=1
(xij − 2p̂Ti
) (xik − 2p̂Ti
)4
m∑i=1
p̂Ti(1− p̂Ti
) , p̂Ti =12
n∑j=1
wjxij .
Bias varies by j , k :
ϕ̂T ,stdjk
a.s.−−−→m→∞
ϕTjk − ϕ̄T
j − ϕ̄Tk + ϕ̄T
1− ϕ̄T.
True KinshipA New estimateB Standard estimateC Limit of standardD
00.
10.
2
Kin
ship
Indi
vidu
als
29 / 35
Bias in standard kinship estimator
ϕ̂T ,stdjk =
m∑i=1
(xij − 2p̂Ti
) (xik − 2p̂Ti
)4
m∑i=1
p̂Ti(1− p̂Ti
) , p̂Ti =12
n∑j=1
wjxij .
Bias varies by j , k :
ϕ̂T ,stdjk
a.s.−−−→m→∞
ϕTjk − ϕ̄T
j − ϕ̄Tk + ϕ̄T
1− ϕ̄T.
True KinshipA New estimateB Standard estimateC Limit of standardD
00.
10.
2
Kin
ship
Indi
vidu
als
29 / 35
Our new estimator (R package ‘popkin’ on CRAN)
Step 1: “pre-adjusted” kinship estimator with uniform bias.
ϕ̂T ,preadjjk =
m∑i=1
(xij − 1)(xik − 1)− 1
4m∑i=1
p̂Ti(1− p̂Ti
) + 1 a.s.−−−→m→∞
ϕTjk − ϕ̄T
1− ϕ̄T,
Step 2: Estimate minimum kinship, use to unbias “step 1” estimates.
ϕ̂T ,preadjmin
a.s.−−−→m→∞
− ϕ̄T
1− ϕ̄T, ϕ̂T ,new
jk =ϕ̂T ,preadjjk − ϕ̂T ,preadj
min
1− ϕ̂T ,preadjmin
a.s.−−−→m→∞
ϕTjk .
This yields consistent f̂ T ,newj , F̂ new
ST estimators!
30 / 35
Our new estimator (R package ‘popkin’ on CRAN)
Step 1: “pre-adjusted” kinship estimator with uniform bias.
ϕ̂T ,preadjjk =
m∑i=1
(xij − 1)(xik − 1)− 1
4m∑i=1
p̂Ti(1− p̂Ti
) + 1 a.s.−−−→m→∞
ϕTjk − ϕ̄T
1− ϕ̄T,
Step 2: Estimate minimum kinship, use to unbias “step 1” estimates.
ϕ̂T ,preadjmin
a.s.−−−→m→∞
− ϕ̄T
1− ϕ̄T, ϕ̂T ,new
jk =ϕ̂T ,preadjjk − ϕ̂T ,preadj
min
1− ϕ̂T ,preadjmin
a.s.−−−→m→∞
ϕTjk .
This yields consistent f̂ T ,newj , F̂ new
ST estimators!
30 / 35
Performance of new estimator
True KinshipA New estimateB Standard estimateC Limit of standardD
00.
10.
2
Kin
ship
Indi
vidu
als
31 / 35
Bias in FST estimators for independent subpopulationsPrevious estimator for n subpopulations, simplified for known AFs (πij):
F̂ indepST =
m∑i=1
σ̂2i
m∑i=1
p̂Ti(1− p̂Ti
)+ 1
nσ̂2i
,
p̂Ti =1n
n∑j=1
πij , σ̂2i =1
n − 1
n∑j=1
(πij − p̂Ti
)2.
Estimator is biased in dependent subpopulations:
F̂ indepST
a.s.−−−→m→∞
FST − 1n−1
(nθ̄T − FST
)1− 1
n−1
(nθ̄T − FST
) .
32 / 35
Bias in FST estimators for independent subpopulationsPrevious estimator for n subpopulations, simplified for known AFs (πij):
F̂ indepST =
m∑i=1
σ̂2i
m∑i=1
p̂Ti(1− p̂Ti
)+ 1
nσ̂2i
,
p̂Ti =1n
n∑j=1
πij , σ̂2i =1
n − 1
n∑j=1
(πij − p̂Ti
)2.
Estimator is biased in dependent subpopulations:
F̂ indepST
a.s.−−−→m→∞
FST − 1n−1
(nθ̄T − FST
)1− 1
n−1
(nθ̄T − FST
) .
32 / 35
Only our new method estimates generalized FST accurately
0.00
0.05
0.10
Ind. Subpops.A
Wei
r−C
ocke
rham
Hud
sonK
Bay
eSca
n
Sta
ndar
dK
insh
ip
New
− − −−
− AdmixtureB
Wei
r−C
ocke
rham
Hud
sonK
Bay
eSca
n
Sta
ndar
dK
insh
ip
New
− −−
−
−
−
True FST
FSTindep limit
Estimates−
FS
T e
stim
ate
FST estimator
33 / 35
The future: improved kinship has repercussions across genetics!
Accurate andefficientestimation,admixturemodeling
Associationstudies, selectiontests
Bias inheritability ofcomplex traits
Animal and plantbreeding
34 / 35
Acknowledgments
John D. StoreyAndrew BassIrineo CabrerosWei HaoRiley Skeen-Gaar
Neo Christopher ChungUniversity of Warsaw
Funding:National Institutes of HealthOtsuka Pharmaceutical
Lewis-Sigler Institute for Integrative Genomics
35 / 35