Introduction to Bioinformatics and Gene Expression...
Transcript of Introduction to Bioinformatics and Gene Expression...
![Page 1: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/1.jpg)
1
Introduction to Bioinformatics and Gene Expression Technologies
Utah State University – Fall 2019Statistical Bioinformatics (Biomedical Big Data)Notes 1
![Page 2: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/2.jpg)
2
Vocabulary
Gene:
Genetics:
Genome:
Genomics:
hereditary DNA sequence at a specific location on chromosome (that “does something”)
study of heredity & variation in organisms
an organism’s total genetic content (full DNA sequence)
study of organisms in terms of their genome
![Page 3: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/3.jpg)
3
Vocabulary Protein:
Proteomics:
Phylogeny:
Phylogenetics:
Phenotype:
sequence of amino acids that “does something”
study of all of the proteins that can come from an organisms’ genome
the evolutionary or historical development of an organism (or its DNA sequence)
the study of an organism’s phylogeny
the physical characteristic of interest in each individual – for example, plant height, disease status, or embryo type
![Page 4: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/4.jpg)
4
Vocabulary Bioinformatics:
Statistical Bioinformatics:
the collection, organization, & analysis of large-scale, complex biological data
the application of statistical approaches to bioinformatics, especially in identifying significant changes (in sequences, expression patterns, etc.) that are biologically relevant (especially in affecting the phenotype, or in response to some treatment)
![Page 5: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/5.jpg)
5
RNA-Seq Example: 8 patients, 56,621 genes 8 heart tissue samples 4 control (no heart disease) 4 cardiomyopathy (heart disease) 2 restrictive (contracts okay, relaxes abnormally) 2 dilated (enlarged left ventricle)
These “Naples” data made public by Institute of Genetics and Biophysics (Naples, Italy)
Ctrl_3 RCM_3 Ctrl_4 DCM_4 Ctrl_5 RCM_5 Ctrl_6 DCM_6ENSG00000000003 308 498 362 554 351 353 220 309ENSG00000000005 3 164 2 43 13 83 22 16ENSG00000000419 1187 1249 1096 1303 970 863 637 684ENSG00000000457 163 239 168 195 153 194 44 117ENSG00000000460 63 108 83 109 87 43 54 51ENSG00000000938 369 328 272 669 1216 193 861 292...
url <- "http://www.stat.usu.edu/jrstevens/bioinf/naples.csv"naples <- read.csv(url, row.names=1)
![Page 6: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/6.jpg)
Common statistical research objectives Test each gene (row) for “differential
expression” between conditions Ctrl vs. non-Ctrl Dilated vs. Restrictive Restrictive vs. Ctrl etc.
Test specific groups of genes (with a known common function) for overall expression differences between conditions Which functions are differentially active between
Ctrl and non-Ctrl, for example?6
![Page 7: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/7.jpg)
HW data: bovine oviduct“Effect of lactation and location relative to the corpus luteum on the transcriptome of the bovine oviduct epithelium”
Biology prerequisite: opposable thumb(s) Why does context matter? Interdisciplinary collaborations require
communication, curiosity, and learning new fields Statistical collaborator vs data analyst
(2019 Vance & Smith ASCCR framework) NCBI GEO database: GSE124110
7url <- "http://www.math.usu.edu/jrstevens/bioinf/GSE124110_readCounts_oviduct.txt"bovine <- read.table(url, header=TRUE)
![Page 8: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/8.jpg)
8
Central Dogma of Molecular Biology
![Page 9: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/9.jpg)
9
A road map to bioinformaticsCentral Dogma Technology
GenomicHypothesis
Type of Study or Analysis
Gene GenomeSequencing
Genotype QTL
mRNAtranscript
TranscriptProfiling Transcriptome
Microarrays or Next-Gen Sequencing(Epigenetics / methylation)
Protein Protein quantification and function
Proteome Protein Microarrays or Proteomics
Phenotype(From introductory lecture by RW Doerge at 2013 Joint Statistical Meetings)
![Page 10: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/10.jpg)
10
“Alphabets”
DNA sequences defined by nucleotides (4)
DNA sequence mRNA sequenceProtein sequence
Protein sequences defined by amino acids (20)
![Page 11: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/11.jpg)
11
General assumption of gene expression technology
Use mRNA transcript abundance level as a measure of the level of “expression” for the corresponding gene in the biological sample
Proportional to degree of gene expression
Side note: a “methylated” gene is silenced (no expression)
![Page 12: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/12.jpg)
12
How to measure mRNA abundance? Several different approaches with similar
themes: Affymetrix GeneChip Nimblegen array Two-color cDNA array More modern: next-generation sequencing (NGS)
Representation of genes on slide Small portion of gene (“oligo”) Larger sequence of gene Blank slate (NGS)
oligonucleotide arrays
![Page 13: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/13.jpg)
13
General DNA sequencing Sanger
1970’s – today most reliable, but expensive
Next-generation [high-throughput] (NGS): Genome Sequencer FLC (GS FLX, by 454
Sequencing) Illumina’s Solexa Genome Analyzer Applied Biosystems SOLiD platform others … Key aspect: sequence (and identify) all sequences
present
![Page 14: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/14.jpg)
14
Common features of NGS technologies (1)
fragment prepared genomic material biological system’s RNA molecules RNA-Seq
DNA or RNA interaction regions ChIP-Seq, HITS-CLIP
others …
sequence these fragments (at least partially) produces HUGE data files (~10 million
fragments sequenced)
![Page 15: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/15.jpg)
15
Common features of NGS technologies (2)
align sequenced fragments with reference sequence usually, a known target genome (gigo…) alignment tools: ELAND, MAQ, SOAP, Bowtie,
others often done with command-line tools still a major computational challenge
count number of fragments mapping to certain regions usually, genes these read counts linearly approximate target
transcript abundance
![Page 16: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/16.jpg)
16
Here, RNA-Seq: recall central dogma:
DNA mRNA protein action quantify [mRNA] transcript abundance
Isolate RNA from cells, fragment at random positions, and copy into cDNA
Attach adapters to ends of cDNA fragments, and bind to flow cell (Illumina has glass slide with 8 such lanes –so can process 8 samples on one slide)
Amplify cDNA fragments in certain size range (e.g., 200-300 bases) – using PCR clusters of same fragment
Sequence – base-by-base for all clusters in parallel https://www.youtube.com/watch?v=-7GK1HXwCtE
![Page 17: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/17.jpg)
17(originally illumina.com download)
![Page 18: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/18.jpg)
18(originally illumina.com download)
![Page 19: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/19.jpg)
19(orginally illumina.com download)
![Page 20: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/20.jpg)
20(orginally illumina.com download)
![Page 21: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/21.jpg)
21
Then align and map … For sequence at each cluster, compare to [align
with] reference genome; file format: millions of clusters per lane approx. 1 GB file size per lane
For regions of interest in reference genome (genes, here), count number of clusters mapping there requires well-studied and well-documented
genome
![Page 22: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/22.jpg)
22
A short word on bioinformatic technologies “Never marry a technology, because it will
always leave you.” – Scott Tingey,
Director of Genetic Discovery at DuPont (shared in RW Doerge 2013 introductory overview lecture at 2013 JSM)
In this class, we will discuss only a couple of technologies, emphasizing their recurringstatistical issues These are perpetual (and compounding)
![Page 23: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/23.jpg)
A Rough Timeline of Technologies (1995+) Microarrays
require probes fixed in advance – only set up to detect those (2005+) Next-Generation Sequencing (NGS)
typically involves amplification of genomic material (PCR)(can bias GC-rich regions; also may lead to misassemblies and gaps [2018])
(2010+) Third-Generation Sequencing “next-next-generation” – Pac Bio, Ion Torrent no amplification needed – can sequence single molecule longer reads possible; still (2013 ; 2016) showing high errors
(2012+) Nanopore-Based Sequencing [very promising] Oxford Nanopore, Genia, others bases identified as whole molecule slips through nanoscale
hole (like threading a needle); coupled with disposable cartridges; still (2013 ; 2016) under development (2018)
(?+) more …23
Differ in how
sequencing done; subsequent post-alignm
ent statistical analysis basically same
![Page 24: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/24.jpg)
24
Affymetrix Technology – GeneChip
Each gene is represented by a unique set of probe pairs (usually 12-20 probe pairs per probe set)
Each spot on array represents a single probe (with millions of copies)
These probes are fixed to the array
(Image courtesy Affymetrix, www.affymetrix.com)
![Page 25: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/25.jpg)
25
Affymetrix Technology – Expression
(Images courtesy Affymetrix, www.affymetrix.com)
A tissue sample is prepared so that its mRNA has fluorescent tags; wait for hybridization; scan to light tag
![Page 26: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/26.jpg)
26
Affymetrix GeneChip
Image courtesy Affymetrix, www.affymetrix.com
![Page 27: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/27.jpg)
27
Cartoon Representations (originally from Affymetrix outreach)
Animation 1: GeneChip structure(1 min.)
Animation 2: Measuring gene expression(2.5 min)
![Page 28: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/28.jpg)
28
Images; Affymetrix data is probe intensity
Images courtesy Affymetrix, www.affymetrix.com
Full Array Image Close-up of Array Image
![Page 29: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/29.jpg)
29
How to analyze data meaningfully? Consider (for any technology):
Data quality Data distribution Data format & organization Appropriateness of measurement methods (& variance) Sources of variability (and their types) Appropriate models to account for sources of variability and
address question of interest Meaning of P-values and appropriate tests of significance Statistical significance vs. biological relevance Appropriate and useful representation of results
Many useful tools available from Bioconductor
![Page 30: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/30.jpg)
30
The Bioconductor Project
“Bioconductor is an open source and open development software project for the analysis and comprehension of genomic data”
Not just for RNA-Seq or microarray data
Like a living family of software packages, changing with needs
Core team mainly at Fred Hutchinson Cancer Research, plus many other U.S. and international institutions
Source: www.bioconductor.org
![Page 31: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/31.jpg)
31
Main Features of the Bioconductor Project
Use of R Documentation and reproducible research Statistical and graphical methods Annotation Short courses Open source Open development
Source: www.bioconductor.org
![Page 32: Introduction to Bioinformatics and Gene Expression ...math.usu.edu/jrstevens/bioinf/1.Technologies.pdf · 1 Introduction to Bioinformatics and Gene Expression Technologies Utah State](https://reader036.fdocuments.us/reader036/viewer/2022062505/5ede04b8ad6a402d666947df/html5/thumbnails/32.jpg)
32
What will we do in this class?
Learn basics of a few major Bioconductor tools
Focus on statistical issues
Discuss recent developments
Learn to discuss all of this in context