Biology and Bioinformatics
-
Upload
jorden-kirkland -
Category
Documents
-
view
33 -
download
8
description
Transcript of Biology and Bioinformatics
Biology and
Bioinformatics
Gabor T. Marth
Department of Biology, Boston [email protected]
BI820 – Seminar in Quantitative and Computational Problems in Genomics
The animal cell
DNA – the carrier of the genetic code
DNA organization – chromosomes
Translation of genetic information
DNA sequencing informatics
DNA sequencing informatics
DNA organization
Genome annotation
De novo gene prediction
Similarity-based gene prediction
Gene localization
Genetic mapping
Gene function
Expression analysis
Protein structure
RNA structure
Protein structure prediction
RNA structure prediction
DNA evolution
Evolution of chromosome organization
Evolution of gene structure
Evolution of DNA sequence
Comparative genomics
Phylogenetics
Mechanisms of molecular evolution
Sequence variations
• Human Genome Project produced a reference genome sequence that is 99.9% common to each human being
• sequence variations make our genetic makeup unique
SNP
• Single-nucleotide polymorphisms (SNPs) are most abundant, but other types of variations exist and are important
Why do we care about variations?
phenotypic differences
demographic history
inherited diseases
How do we find polymorphisms?
• look at multiple sequences from the same genome region
• diverse sequence resources can be used EST
WGS
BAC
• diversion: sequencing informatics
SNP discovery -- Methods
Sequence clustering
Cluster refinement
Multiple alignment
SNP detection
SNP discovery – Computer tools
>CloneXACGTTGCAACGTGTCAATGCTGCA
>CloneYACGTTGCAACGTGTCAATGCTGCA
ACCTAGGAGACTGAACTTACTGACCTAGGAGACCGAACTTACTG
~ 30,000 clones
25,901 clones (7,122 finished, 18,779 draftwith basequality values)
21,020 clone overlaps(124,356 fragment overlaps)
507,152 high-quality candidate SNPs(validation rate 83-96%)
Marth et al., Nature Genetics 2001
SNP discovery – Mining Projects
SNP databases and characteristics
• access to variation data• SNP properties• reliability of information
• characterizing known polymorphic sites in sample collections – genotyping
Where do variations come from?
• sequence variations are the result of mutation events TAAAAAT
TAACAAT
TAAAAAT TAAAAAT TAACAAT TAACAAT TAACAAT
TAAAAAT TAACAAT
TAAAAAT
MRCA• mutations are propagated down through generations
Mutation rate
accgttatgtaga accgctatgtaga
MRCA
actgttatgtaga accgctatataga
MRCA
• higher mutation rate (µ) gives rise to more SNPS
Recombination
accgttatgtaga accgttatgtaga
accgttatgtaga
accgttatgtaga
accgttatgtaga
accgttatgtaga
accgttatgtaga
accgttatgtaga
accgttatgtaga
accgttatgtaga
accgttatgtaga
accgttatgtaga
accgttatgtaga
Demographic history
small (effective) population size N
large (effective)
population size N
• different world populations have varying long-term effective population sizes (e.g. African N is larger than European)
Modeling
past
present
stationary expansioncollapse
MD(simulation)
AFS(direct form)
histo
ry
0
0.05
0.1
1 2 3 4 5 6 7 8 9 10
0
0.05
0.1
1 2 3 4 5 6 7 8 9 100
0.05
0.1
1 2 3 4 5 6 7 8 9 10
0
0.05
0.1
1 2 3 4 5 6 7 8 9 10
bottleneck
0
0.1
0.2
0.3
0 1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0 1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0 1 2 3 4 5 6 7 8 9 10
0
0.1
0.2
0.3
0 1 2 3 4 5 6 7 8 9 10
Ancestral inference
0
0.05
0.1
0.15
1 2 3 4 5 6 7 8 9 10
minor allele count
bottleneckmodest but
uninterrupted expansion
The signatures of selection
• selective mutations influence the genealogy itself; in the case of neutral mutations the processes of mutation and genealogy are decoupled
Association and haplotype structure
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.81E-6
1E-5
1E-4
1E-3
0.01
0.1
1
10
100
1000
Reco
mbin
atio
n F
ract
ion
r2
European Asian
African American
Dista
nce
(kb)
“linkage disequilibrium”
“haplotype blocks”
Computer simulations: the Coalescent
Medical utility?
clinical phenotypemolecular markers
?
functional understanding
Mapping disease-causing loci
genetic linkage
association between allele and phenotype
Forensic applications