1 Alexei Fedorov, Ph.D. Associate Professor Head of Bioinformatics Lab Department of Medicine Vice...

35
1 Alexei Fedorov, Ph.D. Associate Professor Head of Bioinformatics Lab Department of Medicine Vice Director Program in Bioinformatics and Genomics/Proteomics Tel: (419)‑383‑5270 Email: [email protected] http://bpg.utoledo.edu/~afedorov/lab/

Transcript of 1 Alexei Fedorov, Ph.D. Associate Professor Head of Bioinformatics Lab Department of Medicine Vice...

1

Alexei Fedorov, Ph.D.

Associate ProfessorHead of Bioinformatics Lab

Department of Medicine

Vice DirectorProgram in Bioinformatics and Genomics/Proteomics

Tel: (419)‑383‑5270Email: [email protected]

http://bpg.utoledo.edu/~afedorov/lab/

May 2011

4

Bioinformatics Lab in 2013-2014

PhD students

Shuhao QiuMasters students

Ahmed Al-Khudair

Current grants

NSF Career Development 2007-2012 “Investigation of intron cellular roles”

5

MAJOR GOAL:

Bioinformatics

Investigation

of the

Human Genome

Education in Bioinformatics(TWO TYPES OF STUDENTS)

• Computer/math background gain experience in Biology (Sam, Andy)

• Biological background gain experience in programming (Dave, Maryam)

• Example of computational projects: Binary-absrtacted Markov models and their

application to sequence classification http://etd.ohiolink.edu/view.cgi?acc_num=mco1271271172

http://bpg.utoledo.edu/~sshepard/defense/ video

Genomic MRIhttp://bpg.utoledo.edu/gmri/

http://www.jove.com/Details.php?ID=2663

Job perspectives (example: Ashwin Prakash)

PhD – November 2011, HSC UT

PhD research fellow -- from January 2011

Johns Hopkins School of Medicine

Declined offers:

• Cold Spring Harbor Laboratory

• Baylor College of Medicine

The PI’s students received the following awards:

• Jason Bechtel, Outstanding MSBS student in 2008 at HSC UT.

• Theodor Rais, Second/Third Poster award by Ohio Bioinformatics Consortium, 2009.

• Samuel Shepard, Outstanding PhD student in 2010 at HSC UT.

• Lorraine Walters, Undergraduate Research Recognition Award, UT May 2012.

• Arnab Saha-Mandal, 1) Outstanding MSBS student in 2013 at HSC UT; and 2) Canadian Institute of Health Research fellowship support ($20,000).

• Jasmine Serpen, 1) Ohio Governor's Thomas Edison Award for Excellence in Biotechnology & Biomedical Technologies-1st place; and 2) OSERA Biomedical Research/Bioengineering Award-1st place (for high school students).

10

Program in Bioinformatics and Genomics/Proteomics (BPG)

• http://hsc.utoledo.edu/depts/bioinfo/

• BPG offers a Certificate in association with the degrees of Doctor of Philosophy (Ph.D.) or Doctor of Medicine (M.D.). BPG also offers a Master of Science in Biomedical Sciences (MSBS).

11

Two courses in Spring semester:

• Application of Bioinformatics, Proteomics, and Genomics (BIPG 640) or “Advanced Bioinformatics” (should be taken after “Fundamental Bioinformatics” of Dr. Trumbly)

• Introduction to Bioinformatic Computation (BIPG 610) The main goal of this course is to provide basic programming skills to biological and medical students who may lack a background in computer sciences. Programming will be specifically taught using important biological examples, focusing in particular on the PERL language.

No programming skills are required!

12

In the “Introduction to Bioinformatic Computation” course, rather than doing “cookbook” lab exercises, students participate in real-world, challenging problems whose resolution advances the field of genome biology. In addition to learning programming and other bioinformatic skills the students of this course acquire knowledge in how to present the final product of bioinformatic research and how to write a scientific paper on the subject.

•In 2005 the class developed a program to identify novel genes for non-coding RNAs in humans and other mammals. This work resulted in publication of an article in Nucleic Acids Research1, coauthored by the group of students who were actively working on this project.

•In 2006 course students created a novel public database (ASMD) and also a novel computational resource “Splicing Potential”. Ten students were co-authors in two manuscripts2,3.

•In 2007 the class participated in the “Genomic MRI” project. Seven of these students are co-authors in BMC Genomics, 20084

•2008 class continued “Genomic MRI” project. They performed whole genome comparisons for human, chimpanzee, and macaque and also analyzed distribution of 4 million SNPs inside and outside MRI regions. The results are in preparation for publication in Genome Research with 6 students among the authors.

Publications with IBC students

54. Prakash A., Shepard S., Mileyeva-Biebesheimer O., He J., Hart B., Chen M., Amarachiniha S., Bechtel J., Fedorov A. “Molecular forces shaping human genomic sequence at mid-range scales”, BMC Genomics 2009, 10:513.

53. Bechtel J.M., Wittenschlaeger T., Dwyer T., Song J., Arunachalam S., Ramakrishnan S.K., Shepard S., Fedorov A. Genomic mid-range inhomogeneity correlates with an abundance of RNA secondary structures. BMC Genomics 2008, 9:284.

52. Bechtel J. M., Rajesh P., Ilikchyan I., Deng Y., Mishra P.K., Wang G., Wu X., Afonin K., Grose W., Wang Y., Khuder S., and Fedorov A. Calculation of Splicing Potential from the Alternative Splicing Mutation Database Research Notes 2008, 1:4.

51. Bechtel J. M., Rajesh P., Ilikchyan I., Deng Y., Mishra P.K., Wang G., Wu X., Afonin K., Grose W., Wang Y., Khuder S., and Fedorov A. The Alternative Splicing Mutation Database: a hub for investigations of alternative splicing using mutational evidence. Research Notes 2008, 1:3.

44. Fedorov A, Stombaugh J., Harr M.W., Yu S., Nasalean L., Shepelev V. Computer identification of snoRNA genes using a Mammalian Orthologous Intron Database. Nucl. Acids Res. 2005. 33, 4578-4583.

http://www.utoledo.edu/centers/brim/index.html

COURSE: BioinformaticsBioinformatics of Biomarkers and Individualize Medicine, Spring 2012

COURSE: BioinformaticsBioinformatics of Biomarkers and Individualize Medicine, Spring 2012

• Course time line: 14 Weeks • No prerequisites, recommended: Introduction

of bioinformatics and molecular biology

• Reserve materials: None

• Unit 1 Biomarker discovery and validation

• Unit 2 Individualized Medicine

16

Investigation of the human genome

BASE COUNT 846302 a 578512 c 575805 g 843114 t 1703 othersORIGIN 1 gaattcaaaa aagaaagaca atgacttgta gctgaagcta tgatcaggaa aagatggggt 61 ggacggcatt tgagaaaatc aggacagtgg tgtacttatc aaataagaag atctgggcag 121 aagattgttg aaaaagcaga cacagcactg agtagcagca tggagcagaa aagcataagg 181 aacaagtagt gcagtgtgcc tgaacatagg atgggaaatt aggaaagata aatggaggct 241 gactgtggga agccttacat tccaggctta gtggaataag taaatattta aatctcatga 301 gttcttttct ctctgctttc tatttttcac gacctgaact cacctcccag tgaggagatg 361 tttccaccta gcactaaaca gtaactagtt cagactatat atttaaaaaa aaaaaaaaaa 421 aaaaaaaaaa gcagaacagc tcagatcatc cagtgaagtg gtgctactat tatactatta 481 acggggagat gaaagccaga taagatggag aagtaggaaa tttacgaaac attttaaaag 541 aaaatttatt tattcatcaa tatttacata aatgtttatt aattctaagt actatagtag 601 gcacccattt attactttca aaaattgaca atatacaagt taataaaatc atattagttt 661 cctcttctaa taaaattatc tcactcaaat tcatataact aaaaatacat ttaataaatt 721 ttatttttaa aatataggcc acttctactc tattcatttt tgcacttaac attctcttgc 781 tttcaaaaat gtatgaaaaa tttcagttta gtccccacca aatctcaatt tagaccccgg 841 ataaagagta aataaattaa agagctgtca gaattaaaac actactacag gtctccttca 901 ctttatggca tagatgaagg caggaaatac tggctgaaaa ttttgtttat gtcaaagatt 961 ttgatgatta ccatcagaga tctgatatct cagggaagaa aagcctttca tataccactt 1021 aaaaaattct gccaggcgcg gtggctcacg cctgtaatcc cagcactttg ggaggctgag 1081 gtgggcagat cacctgaggt cagaagttcg agaccagcct gaccaacatg gagaaaccct 1141 gtctctacta aaaatacaaa atcagccggg cgtggtggcg catgcctgta atcccagcta 1201 cttgggaggc tgaggcagga gaatcacttg aacccaggag gcagaggttg cggtgagccg 1261 agatcacacc attgcactcc agcctgggca acaagggcga aactctgtct caaaaaaaaa 1321 aaaacttctg gggaaatggt ggcctgcctt gtaacatcta tgtgtcttag agggccatgg 1381 tatgacaccc ttgggcagtc atttatagag tccttccctg accagggaat catcctgcca

17

... after the first 50 pages ..

141601 cagcaccaaa tcctctcatt gcctttttaa aaaatgttgt ccaatttaac atcaagacac 141661 tgtccatgca atctgttgaa aaatctggct atttgcaaac aaagaaaaaa tgtatagcct 141721 cccacactat atatcaaaat aaacccaagt gtataaaaga gaaaatttta agtgaaacca 141781 aaacttgaaa atattgagat gaatattagt tagagctttg agtaggaaag gattttttga 141841 acagataacc aacagaggaa gtcagaaaac agtaatcatt tccttaatga aaatacaaaa 141901 cttaagtact tcaaaaaagt cattacaata cttaaaaacc ttacaacaat catgtggaaa 141961 gcatttatta caaataattc agaaaaagga tttatatccc taataactaa agaagtgagg 142021 aagaatgcta agatcacatt ttttaaaaag tagctaaagg ataatataaa tgactaacag 142081 acctgaggaa aaaagctaac ctcacaagta ttcaaccaaa taaaataacc tcgagatacc 142141 acttaaaaac ctatcgaaat aacgaagtgt ttggaaaatg acaagattca aaatctggta 142201 agagcagcat ttttccccat tgtggaggga gtgtgtaaat tggtgtggtc tttctgaaaa 142261 gcaattaggc aatcttgtat caaaaatctt caaagtgttc ttactctttg atgaagaatt 142321 ccacacgtgt gaatcctaaa acaattaaaa gtatgaacat atttttatgc acaaagatgt 142381 ttagccaaaa ggaaaacgac ctaaatgacg aatgatgtgc aactgcatgg ataaattgtt 142441 gtatatcaaa atgatgaaat attttgcagc tttgaaaagg taattttgaa aaaactttaa 142501 agacctcaaa aatgcccaaa atatattaat tgaaaaggat acaaaacttt attatttcac 142561 tacgtaatga aacagaatac agttgatcct tgaacaacgc tggtttgaac tgcactcgtc 142621 cacttacatt cagatttttt tctttttgct tttttttttt gagacgaagt ctcactctgt 142681 cacccaggct ggagggcagt ggcaccattc tggctcacta caacctgcgt ataccaggtt 142741 caagcaattc tcctgcctca gcctcccaag tagctggaat tacaggcgcc tgtcaccacg 142801 tccagctaat ttttgtattt ttagtagaga cggagtttca ccatgttggc caggctggtc 142861 tcgaactcct ggcctcaagt aatccacctg cctcagcctc ccaaagtgct gggattacag 142921 gcatcagccg ggtgcggtgg cttatgcctg caatcccatc ctggctaaca cggtgaaacc 142981 ctgtctctac taaaatacaa aaaattagct gagtgtggtg gcacatgcct atagttccag 143041 ctacttggga ggctgaggga tgagaattgc ttgaacctgg gaggcagagg ttgcagtgag 143101 ccgagatcac accactgtac tccagcctgg gcaacagagc aagactccat ctcaaaaaaa 143161 aaaaaaaaaa aaaaaagaaa aagaaaaaga aaaaggtatg ttatgaatgc agaaagtata 143221 tgttgatgct agtctattgt gtaatttacc accataaaat atacacaggt ctattataga 143281 agttaaaatg tatcaaaatg tatacacaaa cacttagaga tagtacatgg tatcattccc 143341 agttgagaaa aatgtaagca aacatgaaga tgcagtatta aatcataact gtataaaatt

18

... after next 200 pages

683041 ggaggtgggg agcgcctctg cccagccgcc ccatctggga ggtggggagc gcctctgtcc 683101 agccaccaac ccatctggga agtgaggagc gcctctgcct ggccaccccg tctgggaagt 683161 gaggagcacc tctgccgggc tgccccgtct gggaagtgtt cccaacagct ctgaagagac 683221 agcgaccatc gagaatgggc catgatgacg atggtggttt tgtcgaaaag aaaaggggga 683281 aatgtgggga aaagaaagag agatcagatt gttactgtgt ctgtgtagaa agaagtagac 683341 ataggagact ccattttgtt ctgtactaag aaaaattctt ctgccttggg atgctgttaa 683401 tctataacct tacccccaaa cccctgctct ctgaaacatg tgctgtgtca actcagggtt 683461 aaatggatta agggcgatgc aagatgtgct ttgttaaaca gatgcttgaa gacagaaaaa 683521 aaaaaagaaa gagaaaaaaa aaatcattga aggattattt atgccctatg gcatcccttt 683581 ctccaacact tgtcacctaa tgaccaggga tcaataccca caaatacagt aagacctatt 683641 tttaaaggtt ttcagcttaa ctgttttgtc tcttaataaa tttttatata ggaaaaaaaa 683701 aagaatgttg aatattggcc cccactctct tctggcttgt agagtttctg cagagagatc 683761 cactgttagt ctgatggctt ccctttgtgg gtaacccagt ctttctttct gcccttaaca 683821 ttttttcctt catttcaacc atggtgaatc tgacaattat gtgtcttggt gttgctcttc 683881 tcaaggagta tctttgtggt gttctctgta tttcctgaat ttgaatattg gcctgtgtgg 683941 ataggttggg gaagttctcc tggataatat cctgaagagt gttttccaac ttggttccat 684001 tctcccagtc actttcaggt acaccaatca aatgtaggtt tggtcttttc acatagtccc 684061 atatttcttg gaggctttgc tcattccttt tcattctttt ttctctaatc ttgtcttcaa 684121 gctttatttc attaagttag tttatatttg actgtgcttt atacttgaca aagcactttc 684181 acatttcttg tcttttttgg gcctgataat tactctgcaa gttaaaaagg aaaaactcca 684241 agtaccatta cgctccgtga ggacagggac tattttgttc attgttgcaa cctaagcact 684301 taatatgttg cctggtccag agtagatact catatataaa tacttgctga ataaagggat 684361 gaatgggtgg gtggttagat gaatggaatt tgccttaatt ttcaagatgg attcaatttc 684421 caattccact tactggtgag aagccttgtc taagtcttta aaccttactt tcctcatcta 684481 taaaacagtg acaatgatat tgtttctgct accacaatgg aaaaaaggac agaattactt 684541 agtgtcatag tgatcaggaa taaagccagg gcttgaagca tctcctgatt cctagggcat 684601 tgtttgtccc aatgtatatg gcagagggag aaagaaaacc gttgagtctt aatctgtcag 684661 gcactatttt atgaacttta aaatcctcat agcagggcca ggtgcagtgg ctcacacctg 684721 taatcccagc actttgggag gccaaggcag gcagatcact tgaggtcagg accagcctgt 684781 ccaacgtggt gaaaccacat ctctactaaa aatacaaaaa ttagccaggc gtggtggtgc 684841 atgcctataa tcccagctac ttgggaggct gaggcaggag aaatgcttga acctgggagg 684901 cagaggttgt ggtgagctga gattgtgcca ctgtactcca gcctgggcaa cagaacaaga

19

Human chromosome 1

4,814,628 lines =

=100,000 pages

= 100 books(1000 pages each)

Nature 2012, Sept 6th, v.489, p 46

Lab 2013

The 1000 Genome ProjectA guide to your ancestry

The pattern of the human genetic variations believed to be a key to reveal much about the human population history and diversity. The 1000 Genome project has sequences 1092 genome from different populations and by identifying the sequence that correspond to LWK, GBR, JPT and FIN, we are aiming to learn more about the population genetic patterns and to get a picture of the genetic diversity existed within the mentioned populations. The 1000 genome project effort to catalogue the human genetic variation is utilized in this project to calculate and compare these genetic differences between 14 populations. I am presenting the results that our bioinformatics lab’s team obtained so far and working on having it put in a paper. Using Perl programming to compute the differences between each two individual’s genomes from the 1000 Genome project for the 14 populations

•ASW HapMap African ancestry individuals from SW US •CEU CEPH individuals •CHB (CHB) Han Chinese in Beijing •CHS (CHB) Han Chinese South •CLM Colombian in Medellin, Colombia •FIN HapMap Finnish individuals from Finland •GBR British individuals from England and Scotland (GBR) •IBS Iberian populations in Spain •JPT JPT Japanese individuals •LWK (LWK) Luhya individuals •MXL HapMap Mexican individuals from LA California •PUR Puerto Rican in Puerto Rico •TSI Toscan individuals •YRI (YRI) Yoruba individuals

The Graph above illustrates the distribution of the genetic differences among the 14 populations. The X axis shows the range in the number of differences (2.7 million – 5.5 million). The Y axis represents the number of pairs (two individuals

compared by calculating the number of genetic differences between their genomes).

Figure 2: The Graph below showing the 14 populations consisting 4 distinct origins and lets call them 4 ancestries. 1_African , 2_Hybrid , 3_European, 4Asian.

43

2

1

Figure 3:The three populations that have African origin, they total differences distributed close to each other. The LWK population(Luhya individuals ) showd some individual who had almost half (2.7 million – 4.8 million) the number of differences, almost all of these have been declared as siblings and relatives.Some of them are not declared to be relatives by the 100 Genome project so our results suggest that they might be some undeclared relatives in the 100 genome project.

We further examined some populations for any declared relationships between any of these individuals; the relatives showed that they have the minimum difference in their genetic variation. For example, In the LWK population as showing in the table below, the relatives fall at the top of the list when we sorted the total differences from lowest to highest. The green highlighted cells showing that these individuals are related to each other as been declared by the 1000 genome appendix, The ones that are not highlighted we suggest that they are somehow relatives but they haven’t been declared by the 1000 genome project.

 ID1_LWK

ID2_LWK

Total_LWK differences

 

1 NA19374

NA19373

2756691 Siblings

2 NA19352

NA19347

2777456 Siblings

3 NA19470

NA19443

2848500 Aunt/Uncle

4NA193

97NA1939

62871776 Siblings

5NA194

44NA1943

43004459 Siblings

6NA193

34NA1933

13007478 ? 

7NA193

82NA1938

13070661 uncertain parent/child relationship

8NA194

53NA1944

53077137  ?

9NA194

70NA1946

93111728 Niece/Nephew

10

NA19331

NA19313

3119208  ?

11

NA19382

NA19380

3970915 Half Siblings

12

NA19453

NA19444

4106949  ?

13

NA19334

NA19313

4178970 Unknown relation

14

NA19469

NA19443

4236592 Niece/Nephew

Figure 4:CLM, PUR and MXL populations, they show a very wide distribution ranged from 3.1-4.86. what our results indicate that these population have wide range of mixed blood. The PUR population have a second peak showing on the right side (range between 4.74-4.9 million), we expect that these individuals having different blood. More investigation on these people being conducted to know where do they have blood from.

Figure 5:Populations from FIN, GBR, TSI, CEU and IBS. All these population fall under European origin. The IBS population show as a really

low curve because only 13 person have been sequenced from this population.

Figure 6:The population from Asian origin showed how they are close in their blood by having really close shape of distribution that ranged

between 3.4 million- 3.69 million.

We are more investigating the highest differences pairs (the highest differences between pairs of individuals) that we suggest that they possibly have a different origin. We investigated the highest 40 pairs in some population

and we found that some individuals showed high difference with other individual and that were significantly repeated. Example in the figure below

The list below is the CLM individuals that showed the highest genetic differences with each other and when we looked at them individually we noticed that some of them have been repeated significantly more than others as it shows in the right

side list of repeats. We see that HG01551 and HG01342 has been repeated as highest difference for 20 times while others were repeated 2and 3 times. So we decided to investigate the possibility of these individuals having other origin.

•HG01551 4479513 HG01136•HG01365 4480834 HG01342•HG01342 4481529 HG01250•HG01551 4481637 HG01250•HG01551 4483529 HG01375•HG01551 4485279 HG01125•HG01488 4487693 HG01342•HG01366 4488647 HG01342•HG01551 4490996 HG01259•HG01342 4493212 HG01271•HG01342 4493218 HG01277•HG01377 4494064 HG01342•HG01462 4494414 HG01390•HG01551 4496682 HG01365•HG01461 4497146 HG01342•HG01342 4498051 HG01125•HG01551 4499694 HG01148•HG01551 4499713 HG01345•HG01375 4500523 HG01342•HG01551 4501432 HG01134•HG01551 4503181 HG01495•HG01389 4506393 HG01342•HG01342 4508562 HG01148•HG01551 4510222 HG01377•HG01342 4514486 HG01134•HG01551 4519187 HG01389•HG01342 4520380 HG01124•HG01440 4527415 HG01342•HG01342 4533004 HG01275•HG01342 4535490 HG01272•HG01551 4537772 HG01272•HG01551 4541901 HG01488•HG01551 4542804 HG01461•HG01551 4558088 HG01462•HG01551 4561600 HG01275•HG01390 4562418 HG01342•HG01462 4564478 HG01342•HG01551 4577349 HG01440•HG01551 4608288 HG01390•HG01551 4678948 HG01342

The idea was to take those repeated high difference individuals with 10 other controls from the same population that showed average number of genetic difference within the same population , we then randomly took

individuals from other populations and calculated the genetic differences between our 10 control +2 high repeats and the 1 control from the other populations.

The comparison below was between 10 controls from CLM plus the 2 high repeated high genetic difference (HG01551 and HG01342 ) , against one control individual from YRI population(Yoruba individuals ) “African Ancestry “.HG01551 and HG01342 had the lowest difference indicating that these two persons might be from African origin.

We more compared CLM controls with individual from African population(LWK) and another individual from Asian(CHS).

The two control individuals showed lowest genetic difference against LWK control while showed highest difference when against CHS individual . This suggest that our two individuals from CLM population are

originally belong to an African origin.

CLM - LWK

CLM - CHS

Conclusions

• Total variants showed substantial geographic differentiation,• Total number of differences determines diverse populations that

are more geographically and ancestrally remote.• populations are grouped by the predominant component of

ancestry: Europe (CEU, TSI, GBR, FIN and IBS), Africa (YRI, LWK and ASW), East Asia (CHB, JPT and CHS) and the Americas (MXL, CLM and PUR).

• Relatives within the same population have significantly less number of genotype variations “almost half the number” comparing to the non relatives.

• The study of human genetic variation has evolutionary significance. It can help to understand ancient human population migrations as well as how different human groups are biologically related to one another.