Utilization of the Scythe C++ open source library for...

30
Utilization of the Scythe C++ open source library for statistical geocomputation in livestock landscape genomics Sylvie Stucki 1 Saif Agha 2 Menghua Li 3 Stéphane Joost 1 1 Geographic Information Systems Laboratory, EPFL, Lausanne 2 Animal production dept, Faculty of Agriculture, Ain Sham University, Cairo, Egypt 3 MTT Agrifood Res Finland, Biotechnol & Food Res, Jokioinen, Finland October 26, 2012

Transcript of Utilization of the Scythe C++ open source library for...

Page 1: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Utilization of the Scythe C++ open source libraryfor statistical geocomputation in livestock landscape genomics

Sylvie Stucki1 Saif Agha2 Menghua Li3 Stéphane Joost1

1Geographic Information Systems Laboratory, EPFL, Lausanne

2Animal production dept, Faculty of Agriculture, Ain Sham University, Cairo, Egypt

3MTT Agrifood Res Finland, Biotechnol & Food Res, Jokioinen, Finland

October 26, 2012!"#$%&$#'()*&+,((((((! ""!"""#""!"#$%#$&

$%&'()*+%,"-".'%/(+%,

0+1234+

-./012.(31(45637.8/9:;<=.8;1:>/.?>@96:?(93(9(A/9:0."

.%3&%'4"

522+,,"61'(,

732)'1"+8+(/,

+6:6:A(93(3B.()89:;(CD3./(;.?(@96:?

!B1:.(:E2F.8?,2.8A.:0G#;8.??.?(6:(<=.8;1:>/.?>@96:?

9":";

<":"=

HI

>":"?>

@A":"@?

@@":"@#

JJJK

@9":"@=

JLJL

JM(>(JI

Page 2: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Outline

Outline

Introduction

Methods and tools

Case study

Ongoing developments

Conclusion

2 of 19

Page 3: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Introduction

Outline

Introduction

Methods and tools

Case study

Ongoing developments

Conclusion

Page 4: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Introduction

Population genetics

Founded by Fischer, Haldane and Wright ∼ 1930Change in allele frequency under evolutionary forcesVariable part of genome ∼ 0.1%

Neutral vs adaptive variation

● Outliers detection○ Neutral and adaptive markers behave differently○ Summary statistics between populations (FST )○ Simulate neutral distributions or coalescent time (Lositan)

● Correlative methods○ Organisms adapt to their habitat○ Link genetic variation with environmental features○ Landscape genetics (MatSAM)

3 of 19

Page 5: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Introduction

Population genetics

Founded by Fischer, Haldane and Wright ∼ 1930Change in allele frequency under evolutionary forcesVariable part of genome ∼ 0.1%Neutral vs adaptive variation

● Outliers detection○ Neutral and adaptive markers behave differently○ Summary statistics between populations (FST )○ Simulate neutral distributions or coalescent time (Lositan)

● Correlative methods○ Organisms adapt to their habitat○ Link genetic variation with environmental features○ Landscape genetics (MatSAM)

3 of 19

Page 6: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Introduction

Landscape genetics

!"#"$%&'( "#)%*+#,"#$(!"+(

4 of 19

Page 7: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Methods and tools

Outline

Introduction

Methods and tools

Case study

Ongoing developments

Conclusion

Page 8: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Methods and tools

Landscape genomics

Search for direct associations between genome and environment

Spatial coincidence

Molecular markers+ Environmental parameters

Association models

Multiple univariate logistic regressions⇒ Keep the most significant ones

30 40 50 60 70 80

0.0

0.2

0.4

0.6

0.8

1.0

Sunshine in September [% of daylength]

GG_OARX_63571789.1

5 of 19

Page 9: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Methods and tools

Landscape genomics

Search for direct associations between genome and environment

Spatial coincidence

Molecular markers+ Environmental parameters

Association models

Multiple univariate logistic regressions⇒ Keep the most significant ones

30 40 50 60 70 80

0.0

0.2

0.4

0.6

0.8

1.0

Sunshine in September [% of daylength]

GG_OARX_63571789.1

5 of 19

Page 10: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Methods and tools

Logistic regressions

For a specific marker, model with q variables, at each location i

ln( πi1 − πi) = β0 + xi1β1 + ⋅ ⋅ ⋅ + xiqβq = x

Ti β

πi probability of presence of the markerxi vector of environmental variablesβ vector of q + 1 coefficients

● Maximum likelihood estimation● Loglikelihood-ratio (G) and Wald tests with Bonferroni correction● Model is significant ⇔ Both tests reject H0

6 of 19

Page 11: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Methods and tools

Logistic regressions

For a specific marker, model with q variables, at each location i

ln( πi1 − πi) = β0 + xi1β1 + ⋅ ⋅ ⋅ + xiqβq = x

Ti β

πi probability of presence of the markerxi vector of environmental variablesβ vector of q + 1 coefficients

● Maximum likelihood estimation● Loglikelihood-ratio (G) and Wald tests with Bonferroni correction● Model is significant ⇔ Both tests reject H0

6 of 19

Page 12: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Methods and tools

Samβada

● Successor of MatSAM working in MATLAB● Open source stand-alone application

easy to install● Written in C++

for Windows, Unix and Mac OS X● Uses Scythe Statistical Library 1

open-source library for matrix calculations● Currently in console mode

graphical interface on the wish listsimple parameter file ,

1scythe.wustl.edu/7 of 19

Page 13: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Methods and tools

Input files

!"#"$%&'( "#)%*+#,"#$(!"+(

HEADERS Y WORDDELIM ";" NUMVARENV 14 NUMMARK 140 NUMINDIV 385 IDINDIV "Ind" COLSUPENV "x" "y" DIMMAX 1 SAVETYPE END ALL

8 of 19

Page 14: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Methods and tools

Current stage

Release is imminent

● Multivariate models● Measure spatial autocorrelation

○ Global and local Moran’s I for both molecular and environmental data○ Export results as shapefile with shapelib2

● Two times faster than MatSAM

Open source

● Promote landscape genomics● Share our tools to speed-up research● Licensing is annoying

2shapelib.maptools.org/9 of 19

Page 15: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Case study

Outline

Introduction

Methods and tools

Case study

Ongoing developments

Conclusion

Page 16: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Case study

Context

International Sheep Genetics Consortium3

"... find genes associated with production, quality and disease traits"

Natural selection in livestock?

Domestication ∼ 10’000 years ago in Middle-EastHerds migrated worldwide with humansTraits of interest : robustness and disease resistance→ local adaptationStill the main breeding strategy in developing countries

3www.sheephapmap.org10 of 19

Page 17: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Case study

Context

International Sheep Genetics Consortium3

"... find genes associated with production, quality and disease traits"

Natural selection in livestock?Domestication ∼ 10’000 years ago in Middle-EastHerds migrated worldwide with humansTraits of interest : robustness and disease resistance→ local adaptationStill the main breeding strategy in developing countries

3www.sheephapmap.org10 of 19

Page 18: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Case study

World is changing fast

Artificial selection ∼ 50 yearsCosmopolitan breeds (Sheep: Merino, Cow: Holstein)● highly productive● demanding (food, medicine)● low genetic diversity, inbreeding● prone to epizootics

Crossbreeding between traditional and cosmopolitan breedsFitness of offspring is questioned

Conservation in livestockPreserve genetic diversity to ensure future farming

11 of 19

Page 19: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Case study

World is changing fast

Artificial selection ∼ 50 yearsCosmopolitan breeds (Sheep: Merino, Cow: Holstein)● highly productive● demanding (food, medicine)● low genetic diversity, inbreeding● prone to epizootics

Crossbreeding between traditional and cosmopolitan breedsFitness of offspring is questioned

Conservation in livestockPreserve genetic diversity to ensure future farming

11 of 19

Page 20: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Case study

Genetic data

Provided by the Sheep HapMap Project (ISGC)1479 animals from 37 local breeds

Genotyping

● 50k SNP BeadChip array● 49’034 SNP● 147’102 binary markers after recoding

12 of 19

Page 21: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Case study

Environmental data

Individual positions not available for each breedBreeds were located at the centroid of their spatial distributions

105 environmental variables

● Climatic Research Unit8 environmental variables (monthly + yearly means)

● SRTM30Altitude

13 of 19

Page 22: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Case study

Results

15’456’105 models

Thresholdp-value = 0.001Bonferroni p = 6.47 ∗ 10−11Score = 42.67Significant models1’425’159 models58’290 markersConservative test

100 200 300 400

050000

150000

250000

Wald score

Counting

Significant modelsSignificant markers

14 of 19

Page 23: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Case study

Env_1 Gscore WaldScore Correlation P (Lositan)

GG_OARX_63571789.1SUN-September 960.48 474.39 - 0.995558SUN-July 967.51 452.89 - 0.995558SUN-August 955.00 449.16 - 0.995558DTR-September 879.24 447.13 - 0.995558RDO-August 841.10 439.61 + 0.995558AA_OARX_63571789.1SUN-August 796.92 458.60 + 0.995558RDO-August 803.47 458.28 - 0.995558AA_s43015.1SUN-October 687.88 401.91 - 0.994685RDO-January 656.77 390.10 + 0.994685AA_OARX_59578440.1RDO-August 601.51 397.51 + 0.999846

15 of 19

Page 24: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Ongoing developments

Outline

Introduction

Methods and tools

Case study

Ongoing developments

Conclusion

Page 25: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Ongoing developments

Geographically Weighted Regression

● Analysis of spatially varying relationships(Fotheringham, Brunsdon and Charlton.

Geographically Weighted Regression. (Wiley, 2002))

● Better understanding of local processes● Possibly sort out false positive detections● Implementation almost finished● Synthetise and interprete local coefficients

○ Already many studies on linear models○ Generalised linear models framework

is a bit different

16 of 19

Page 26: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Ongoing developments

High performance computing

Whole genome sequencing

● Project NextGen● Several millions of SNP (before recoding)● Huge amount of models● Focus on univariate logistic regressions

Grid computing

● CoreSAM● Small version of Samβada, written in C● Up to 7 times faster

17 of 19

Page 27: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Ongoing developments

High performance computing

Whole genome sequencing

● Project NextGen● Several millions of SNP (before recoding)● Huge amount of models● Focus on univariate logistic regressions

Grid computing

● CoreSAM● Small version of Samβada, written in C● Up to 7 times faster

17 of 19

Page 28: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Conclusion

Outline

Introduction

Methods and tools

Case study

Ongoing developments

Conclusion

Page 29: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Conclusion

Conclusion

● Landscape genomics○ Molecular data○ Environmental information and spatial analysis○ Study local adaptation○ Clues about biological functions of markers

● Results are coherent with outliers detection methods○ Direct association studies based in individuals○ Scalable for whole genome sequencing○ Requirement: records of sampling locations

● Samβada: open-source integrated solution○ Help evolutionary biologists○ Advise conservation managers

18 of 19

Page 30: Utilization of the Scythe C++ open source library for ...ogrs2012.heig-vd.ch/public/ogrs2012/slides/Stucki.pdf · Introduction Populationgenetics FoundedbyFischer,HaldaneandWright∼

Conclusion

End

Thank you!

Contact: [email protected]

19 of 19