Utilization of the Scythe C++ open source library for...
Transcript of Utilization of the Scythe C++ open source library for...
Utilization of the Scythe C++ open source libraryfor statistical geocomputation in livestock landscape genomics
Sylvie Stucki1 Saif Agha2 Menghua Li3 Stéphane Joost1
1Geographic Information Systems Laboratory, EPFL, Lausanne
2Animal production dept, Faculty of Agriculture, Ain Sham University, Cairo, Egypt
3MTT Agrifood Res Finland, Biotechnol & Food Res, Jokioinen, Finland
October 26, 2012!"#$%&$#'()*&+,((((((! ""!"""#""!"#$%#$&
$%&'()*+%,"-".'%/(+%,
0+1234+
-./012.(31(45637.8/9:;<=.8;1:>/.?>@96:?(93(9(A/9:0."
.%3&%'4"
522+,,"61'(,
732)'1"+8+(/,
+6:6:A(93(3B.()89:;(CD3./(;.?(@96:?
!B1:.(:E2F.8?,2.8A.:0G#;8.??.?(6:(<=.8;1:>/.?>@96:?
9":";
<":"=
HI
>":"?>
@A":"@?
@@":"@#
JJJK
@9":"@=
JLJL
JM(>(JI
Outline
Outline
Introduction
Methods and tools
Case study
Ongoing developments
Conclusion
2 of 19
Introduction
Outline
Introduction
Methods and tools
Case study
Ongoing developments
Conclusion
Introduction
Population genetics
Founded by Fischer, Haldane and Wright ∼ 1930Change in allele frequency under evolutionary forcesVariable part of genome ∼ 0.1%
Neutral vs adaptive variation
● Outliers detection○ Neutral and adaptive markers behave differently○ Summary statistics between populations (FST )○ Simulate neutral distributions or coalescent time (Lositan)
● Correlative methods○ Organisms adapt to their habitat○ Link genetic variation with environmental features○ Landscape genetics (MatSAM)
3 of 19
Introduction
Population genetics
Founded by Fischer, Haldane and Wright ∼ 1930Change in allele frequency under evolutionary forcesVariable part of genome ∼ 0.1%Neutral vs adaptive variation
● Outliers detection○ Neutral and adaptive markers behave differently○ Summary statistics between populations (FST )○ Simulate neutral distributions or coalescent time (Lositan)
● Correlative methods○ Organisms adapt to their habitat○ Link genetic variation with environmental features○ Landscape genetics (MatSAM)
3 of 19
Introduction
Landscape genetics
!"#"$%&'( "#)%*+#,"#$(!"+(
4 of 19
Methods and tools
Outline
Introduction
Methods and tools
Case study
Ongoing developments
Conclusion
Methods and tools
Landscape genomics
Search for direct associations between genome and environment
Spatial coincidence
Molecular markers+ Environmental parameters
Association models
Multiple univariate logistic regressions⇒ Keep the most significant ones
30 40 50 60 70 80
0.0
0.2
0.4
0.6
0.8
1.0
Sunshine in September [% of daylength]
GG_OARX_63571789.1
5 of 19
Methods and tools
Landscape genomics
Search for direct associations between genome and environment
Spatial coincidence
Molecular markers+ Environmental parameters
Association models
Multiple univariate logistic regressions⇒ Keep the most significant ones
30 40 50 60 70 80
0.0
0.2
0.4
0.6
0.8
1.0
Sunshine in September [% of daylength]
GG_OARX_63571789.1
5 of 19
Methods and tools
Logistic regressions
For a specific marker, model with q variables, at each location i
ln( πi1 − πi) = β0 + xi1β1 + ⋅ ⋅ ⋅ + xiqβq = x
Ti β
πi probability of presence of the markerxi vector of environmental variablesβ vector of q + 1 coefficients
● Maximum likelihood estimation● Loglikelihood-ratio (G) and Wald tests with Bonferroni correction● Model is significant ⇔ Both tests reject H0
6 of 19
Methods and tools
Logistic regressions
For a specific marker, model with q variables, at each location i
ln( πi1 − πi) = β0 + xi1β1 + ⋅ ⋅ ⋅ + xiqβq = x
Ti β
πi probability of presence of the markerxi vector of environmental variablesβ vector of q + 1 coefficients
● Maximum likelihood estimation● Loglikelihood-ratio (G) and Wald tests with Bonferroni correction● Model is significant ⇔ Both tests reject H0
6 of 19
Methods and tools
Samβada
● Successor of MatSAM working in MATLAB● Open source stand-alone application
easy to install● Written in C++
for Windows, Unix and Mac OS X● Uses Scythe Statistical Library 1
open-source library for matrix calculations● Currently in console mode
graphical interface on the wish listsimple parameter file ,
1scythe.wustl.edu/7 of 19
Methods and tools
Input files
!"#"$%&'( "#)%*+#,"#$(!"+(
HEADERS Y WORDDELIM ";" NUMVARENV 14 NUMMARK 140 NUMINDIV 385 IDINDIV "Ind" COLSUPENV "x" "y" DIMMAX 1 SAVETYPE END ALL
8 of 19
Methods and tools
Current stage
Release is imminent
● Multivariate models● Measure spatial autocorrelation
○ Global and local Moran’s I for both molecular and environmental data○ Export results as shapefile with shapelib2
● Two times faster than MatSAM
Open source
● Promote landscape genomics● Share our tools to speed-up research● Licensing is annoying
2shapelib.maptools.org/9 of 19
Case study
Outline
Introduction
Methods and tools
Case study
Ongoing developments
Conclusion
Case study
Context
International Sheep Genetics Consortium3
"... find genes associated with production, quality and disease traits"
Natural selection in livestock?
Domestication ∼ 10’000 years ago in Middle-EastHerds migrated worldwide with humansTraits of interest : robustness and disease resistance→ local adaptationStill the main breeding strategy in developing countries
3www.sheephapmap.org10 of 19
Case study
Context
International Sheep Genetics Consortium3
"... find genes associated with production, quality and disease traits"
Natural selection in livestock?Domestication ∼ 10’000 years ago in Middle-EastHerds migrated worldwide with humansTraits of interest : robustness and disease resistance→ local adaptationStill the main breeding strategy in developing countries
3www.sheephapmap.org10 of 19
Case study
World is changing fast
Artificial selection ∼ 50 yearsCosmopolitan breeds (Sheep: Merino, Cow: Holstein)● highly productive● demanding (food, medicine)● low genetic diversity, inbreeding● prone to epizootics
Crossbreeding between traditional and cosmopolitan breedsFitness of offspring is questioned
Conservation in livestockPreserve genetic diversity to ensure future farming
11 of 19
Case study
World is changing fast
Artificial selection ∼ 50 yearsCosmopolitan breeds (Sheep: Merino, Cow: Holstein)● highly productive● demanding (food, medicine)● low genetic diversity, inbreeding● prone to epizootics
Crossbreeding between traditional and cosmopolitan breedsFitness of offspring is questioned
Conservation in livestockPreserve genetic diversity to ensure future farming
11 of 19
Case study
Genetic data
Provided by the Sheep HapMap Project (ISGC)1479 animals from 37 local breeds
Genotyping
● 50k SNP BeadChip array● 49’034 SNP● 147’102 binary markers after recoding
12 of 19
Case study
Environmental data
Individual positions not available for each breedBreeds were located at the centroid of their spatial distributions
105 environmental variables
● Climatic Research Unit8 environmental variables (monthly + yearly means)
● SRTM30Altitude
13 of 19
Case study
Results
15’456’105 models
Thresholdp-value = 0.001Bonferroni p = 6.47 ∗ 10−11Score = 42.67Significant models1’425’159 models58’290 markersConservative test
100 200 300 400
050000
150000
250000
Wald score
Counting
Significant modelsSignificant markers
14 of 19
Case study
Env_1 Gscore WaldScore Correlation P (Lositan)
GG_OARX_63571789.1SUN-September 960.48 474.39 - 0.995558SUN-July 967.51 452.89 - 0.995558SUN-August 955.00 449.16 - 0.995558DTR-September 879.24 447.13 - 0.995558RDO-August 841.10 439.61 + 0.995558AA_OARX_63571789.1SUN-August 796.92 458.60 + 0.995558RDO-August 803.47 458.28 - 0.995558AA_s43015.1SUN-October 687.88 401.91 - 0.994685RDO-January 656.77 390.10 + 0.994685AA_OARX_59578440.1RDO-August 601.51 397.51 + 0.999846
15 of 19
Ongoing developments
Outline
Introduction
Methods and tools
Case study
Ongoing developments
Conclusion
Ongoing developments
Geographically Weighted Regression
● Analysis of spatially varying relationships(Fotheringham, Brunsdon and Charlton.
Geographically Weighted Regression. (Wiley, 2002))
● Better understanding of local processes● Possibly sort out false positive detections● Implementation almost finished● Synthetise and interprete local coefficients
○ Already many studies on linear models○ Generalised linear models framework
is a bit different
16 of 19
Ongoing developments
High performance computing
Whole genome sequencing
● Project NextGen● Several millions of SNP (before recoding)● Huge amount of models● Focus on univariate logistic regressions
Grid computing
● CoreSAM● Small version of Samβada, written in C● Up to 7 times faster
17 of 19
Ongoing developments
High performance computing
Whole genome sequencing
● Project NextGen● Several millions of SNP (before recoding)● Huge amount of models● Focus on univariate logistic regressions
Grid computing
● CoreSAM● Small version of Samβada, written in C● Up to 7 times faster
17 of 19
Conclusion
Outline
Introduction
Methods and tools
Case study
Ongoing developments
Conclusion
Conclusion
Conclusion
● Landscape genomics○ Molecular data○ Environmental information and spatial analysis○ Study local adaptation○ Clues about biological functions of markers
● Results are coherent with outliers detection methods○ Direct association studies based in individuals○ Scalable for whole genome sequencing○ Requirement: records of sampling locations
● Samβada: open-source integrated solution○ Help evolutionary biologists○ Advise conservation managers
18 of 19