An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and...

28
An Introduction to An Introduction to Bioinformatics Bioinformatics (high-school version) (high-school version) Ying Xu Ying Xu Institute of Bioinformatics, and Biochemi Institute of Bioinformatics, and Biochemi stry and Molecular Biology Department stry and Molecular Biology Department University of Georgia University of Georgia [email protected] [email protected]
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    0

Transcript of An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and...

Page 1: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

An Introduction to An Introduction to BioinformaticsBioinformatics(high-school version)(high-school version)

Ying XuYing XuInstitute of Bioinformatics, and Biochemistry and MolInstitute of Bioinformatics, and Biochemistry and Mol

ecular Biology Departmentecular Biology DepartmentUniversity of GeorgiaUniversity of Georgia

[email protected]@bmb.uga.edu

Page 2: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

The BasicsThe Basics

genes

cell chromosome

ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggta…………………………………

genome and sequencing

protein

metabolic pathway/network

Page 3: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

BioinformaticsBioinformatics(or computational biology)(or computational biology)

• This interdisciplinary science … is aboutThis interdisciplinary science … is about prproviding computational support to studies oviding computational support to studies onon linking the behavior of cells, organisms linking the behavior of cells, organisms and populations to and populations to the information encodethe information encoded in the genomesd in the genomes

– Temple Smith

ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggta…………………………………

Page 4: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Information Encoded in Information Encoded in GenomesGenomes

• What information? And how to find and interpret What information? And how to find and interpret it?it?

• Working molecules (proteins, RNAs) in our cellsWorking molecules (proteins, RNAs) in our cells

ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggta…………………………………

bacterial cell

Page 5: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Information Encoded in Information Encoded in GenomesGenomes

• How to find where protein-encoding genes are in a genome?How to find where protein-encoding genes are in a genome?

• A genome is like a book written in “words” consisting of 4 A genome is like a book written in “words” consisting of 4 letters (A, C, G, T), and each protein-encoding gene is like letters (A, C, G, T), and each protein-encoding gene is like an instruction about how the protein is madean instruction about how the protein is made

• People have found that the six-letter words (e.g., AAGTGC) People have found that the six-letter words (e.g., AAGTGC) have different frequencies in genes from non-gene regionshave different frequencies in genes from non-gene regions

ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggta…………………………

Page 6: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Information Encoded in Information Encoded in GenomesGenomes

Frequency in genes (AAA ATT) = 1.4%; Frequency in non-genes (AAA ATT) = 5.2%Frequency in genes (AAA GAC) = 1.9%; Frequency in non-genes (AAA GAC) = 4.8%Frequency in genes (AAA TAG) = 0.0%; Frequency in non-genes (AAA TAG) = 6.3%….

AAAATTAAAATTAAAGACAAAATTAAAGACAAACACAAAATTAAATAGAAATAGAAAATT …..

Is this a gene or non-gene region if you have to make a bet?

Page 7: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Information Encoded in Information Encoded in GenomesGenomes

• Preference modelPreference model:: – for each 6-letter word X (e.g., AAA AAA), calculate its frequencies in gefor each 6-letter word X (e.g., AAA AAA), calculate its frequencies in ge

ne and non-gene regions, ne and non-gene regions, FC(X), FN(X)FC(X), FN(X)– calculate X’s calculate X’s preferencepreference value value P(X) = log (FC(X)/FN(X))P(X) = log (FC(X)/FN(X))

• PropertiesProperties::– P(X) is 0 if X has the same frequencies in gene and non-gene regionsP(X) is 0 if X has the same frequencies in gene and non-gene regions– P(X) has positive score if X has higher frequency in gene than in non- P(X) has positive score if X has higher frequency in gene than in non-

gene region; gene region; the larger the difference, the more positive the score isthe larger the difference, the more positive the score is– P(X) has negative score if X has higher frequency in non-gene than in P(X) has negative score if X has higher frequency in non-gene than in

gene region; gene region; the larger the difference, the more negative the score isthe larger the difference, the more negative the score is

• Gene predictionGene prediction: given a DNA region, calculate the sum of : given a DNA region, calculate the sum of P(X) values for all 6-letter words X in the region; P(X) values for all 6-letter words X in the region; – if the sum is larger than zero, predict “gene”if the sum is larger than zero, predict “gene”– otherwise predict non-geneotherwise predict non-gene

Page 8: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Information Encoded in Information Encoded in GenomesGenomes

• You just learned your first bioinformatics method You just learned your first bioinformatics method for gene prediction – for gene prediction – congratulationscongratulations!!

Page 9: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Information Encoded in Information Encoded in GenomesGenomes

• Ok, we now have learned how to find genes encoded Ok, we now have learned how to find genes encoded in a genomein a genome

• How do we find out what they do (their biological How do we find out what they do (their biological functions, e.g. sensors, transportors, regulators, functions, e.g. sensors, transportors, regulators, enzymes)?enzymes)?

Page 10: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Information Encoded in Information Encoded in GenomesGenomes

• People have observed that similar protein sequences People have observed that similar protein sequences tend to have similar functionstend to have similar functions

• Over the years, many genes have been thoroughly studied in different organisms, e.g., human, mouse, fly, …., rice, …– their biological functions have been identified and documented

• For a new protein, scientists can possibly predict its function by identifying well-studied proteins in other organisms, that have high sequence similarities to it– This works for ~60% of genes in a newly sequenced genome

Page 11: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Information Encoded in Information Encoded in GenomesGenomes

• Scientists have developed computational Scientists have developed computational techniques fortechniques for– identifying regulatory signals that controls gene identifying regulatory signals that controls gene

transcriptiontranscription– predicting protein-protein interactionspredicting protein-protein interactions– elucidating biological networks for a particular functionelucidating biological networks for a particular function– ……... and elucidating many other information... and elucidating many other information

Page 12: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Information Encoded in Information Encoded in GenomesGenomes

E. Coli O157 and O111 are human pathogenic while E. Coli K12 is not;

Can we tell why? Which genes or pathways in E. coli O157 and O111 are responsible for the pathogenicity?

Page 13: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Information Encoded in Information Encoded in GenomesGenomes

E. co

li K-12

E. co

li

O15

7

B.

pseudom

allei

P. furiosus

Random

seq

human chrom

osome #

1

Page 14: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Information Encoded in Information Encoded in GenomesGenomes

Red: prokaryotes

Blue: eukaryotes

Green: plastids

Orange: plasmids

Black: mitochondria

x-axis: average of variations of the K-mer frequencies,

y-axis: average barcode similarity among fragments of a genome

Page 15: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Information Encoded in Information Encoded in GenomesGenomes

• Yes, biologists can derive a lot of information from Yes, biologists can derive a lot of information from genomes nowgenomes now

• … … but we are far from fully understanding any but we are far from fully understanding any genome yet, even for the simplest living organisms, genome yet, even for the simplest living organisms, bacteriabacteria

• We can clearly use new ideas from bright young We can clearly use new ideas from bright young minds – interested in doing bioinformatics?minds – interested in doing bioinformatics?

Page 16: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Linking Genome Information to Linking Genome Information to Biological Systems BehaviorsBiological Systems Behaviors

• To fully understand cellular behaviors, we need toTo fully understand cellular behaviors, we need to– elucidate information encoded in the genome, andelucidate information encoded in the genome, and– understand working molecules, encoded by the genome, understand working molecules, encoded by the genome,

behaves according to the physical laws on earth!behaves according to the physical laws on earth!

ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag…………………………

gene

protein

Page 17: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Key Drivers of Key Drivers of BioinformaticsBioinformatics

• Human genome project has fundamentally Human genome project has fundamentally changed biological sciencechanged biological science

• A key consequence of the genome project is A key consequence of the genome project is scientists learned that they can produce scientists learned that they can produce biological data massivelybiological data massively– genome sequencesgenome sequences– microarray data for gene expression levelsmicroarray data for gene expression levels– yeast two hybrid systems for protein-protein interactionsyeast two hybrid systems for protein-protein interactions– …… …… and other “high-throughput” biological dataand other “high-throughput” biological data

These data reflect the cellular states, molecular structures and functions, in complex ways

Page 18: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Key Drivers of Key Drivers of BioinformaticsBioinformatics

• … … and let bioinformaticians to (help to) decipher and let bioinformaticians to (help to) decipher the meaning of these data, like in genome the meaning of these data, like in genome sequences sequences

• Together, high-throughput probing technologies Together, high-throughput probing technologies and bioinformatics are transforming biological and bioinformatics are transforming biological science into a new science more like physics science into a new science more like physics

Page 19: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Key Drivers of Key Drivers of BioinformaticsBioinformatics

• Like physics, where Like physics, where general rules and lawsgeneral rules and laws are taught are taught at the start, at the start, biology will surely be presented to future biology will surely be presented to future generations of students as a set of basic systems generations of students as a set of basic systems ....... ....... duplicated and adapted to a very wide range of cellul duplicated and adapted to a very wide range of cellular and organismic functions, ar and organismic functions, following basic evolutionfollowing basic evolutionary principles constrained by Earth’s geological histary principles constrained by Earth’s geological history.ory.– Temple SmithTemple Smith, , Current Topics in Computational Molecular BiologyCurrent Topics in Computational Molecular Biology

Page 20: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Biomarker IdentificationBiomarker Identification

• Our goal is to identify markers in blood that can Our goal is to identify markers in blood that can tell if a person has a particular form of cancertell if a person has a particular form of cancer

…… in a similar fashion to doing pregnancy test using a test kit, possibly at home

Page 21: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Biomarker IdentificationBiomarker Identification• Microarray gene expression data allow comparative Microarray gene expression data allow comparative

analyses of gene expression patterns in cancer analyses of gene expression patterns in cancer versusversus normal tissuesnormal tissues

on cancer tissues

on normal tissues

Finding genes showing maximum difference in their expression levels between cancer and normal tissues

Page 22: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Biomarker IdentificationBiomarker Identification

proteins A, …, Z highly expressed in cancer

Page 23: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Biomarker IdentificationBiomarker Identification

• QuestionQuestion: : Can we predict which of these tissue marker Can we predict which of these tissue marker proteins can get secreted into blood circulation so we can proteins can get secreted into blood circulation so we can get markers in blood?get markers in blood?

• Through literature search, we found over proteins being Through literature search, we found over proteins being secreted into blood circulation due to various physiological secreted into blood circulation due to various physiological conditionsconditions

• We then trained a “classifier” to identify “features” that We then trained a “classifier” to identify “features” that distinguish between proteins that can be secreted into blood distinguish between proteins that can be secreted into blood and proteins that cannotand proteins that cannot

Page 24: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Biomarker IdentificationBiomarker Identification

• We have developed a classifier to distinguish blood-We have developed a classifier to distinguish blood-secretory proteins and other proteinssecretory proteins and other proteins

• On a test set with 52 positive data and 3,629 negative data, On a test set with 52 positive data and 3,629 negative data, our classifier achievesour classifier achieves– 89.6% sensitivity, 98.5% specificity and 94% AUC89.6% sensitivity, 98.5% specificity and 94% AUC

Page 25: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Biomarker IdentificationBiomarker Identification

• The predicted marker proteins can be validated The predicted marker proteins can be validated using mass spectrometry experimentusing mass spectrometry experiment

Page 26: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Biomarker IdentificationBiomarker Identification

• If successful, it will be possible to test for cancer If successful, it will be possible to test for cancer using a test-kit like pregnancy test-kitsusing a test-kit like pregnancy test-kits

Page 27: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Take-Home MessageTake-Home Message

• Biological science is under rapid transformation because of Biological science is under rapid transformation because of high-throughput measurement technologies and high-throughput measurement technologies and bioinformatics bioinformatics

• As an emerging field, bioinformatics is about using As an emerging field, bioinformatics is about using computational techniques to solve biological problems, and computational techniques to solve biological problems, and represents the future of biologyrepresents the future of biology

Page 28: An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

THANK YOU!