The Human Genome
3000000000bases
The raw dataNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagtcacttcctccttcagGAACATTGCAGTGGGCCTAAGTGCCTCCTCTCGGGACTGGTATGGGGACGGTCATGCAATCTGGACAACATTCACCTTTAAAAGTTTATTGATCTTTTGTGACATGCACGTGGGTTCCCAGTAGCAAGAAACTAAAGGGTCGCAGGCCGGTTTCTGCTAATTTCTTTAATTCCAAGACAGTCTCAAATATTTTCTTATTAACTTCCTGGAGGGAGGCTTATCATTCTCTCTTTTGGATGATTCTAAGTACCAGCTAAAATACAGCTATCATTCATTTTCCTTGATTTGGGAGCCTAATTTCTTTAATTTAGTATGCAAGAAAACCAATTTGGAAATATCAACTGTTTTGGAAACCTTAGACCTAGGTCATCCTTAGTAAGATcttcccatttatataaatacttgcaagtagtagtgccataattaccaaacataaagccaactgagatgcccaaagggggccactctccttgcttttcctcctttttagaggatttatttcccatttttcttaaaaaggaagaacaaactgtgccctagggtttactgtgtcagaacagagtgtgccgattgtggtcaggactccatagcatttcaccattgagttatttccgcccccttacgtgtctctcttcagcggtctattatctccaagagggcataaaacactgagtaaacagctcttttatatgtgtttcctggatgagccttcttttaattaattttgttaagggatttcctctagggccactgcacgtcatggggagtcacccccagacactcccaattggccccttgtcacccaggggcacatttcagctAtttgtaaaacctgaaatcactagaaaggaatgtctagtgacttgtgggggccaaggcccttgttatggggatgaaggctcttaggtggtagccctccaagagaatagatggtgAatgtctcttttcagacattaaaggtgtcagactctcagttaatctctcctagatccaggaaaggcctagaaaaggaaggcctgactgcattaatggagattctctccatgtgcaaaatttcctccacaaaagaaatccttgcagggccattttaatgtgttggccctgtgacagccatttcaaaatatgtcaaaaaatatattttggagtaaaatactttcattttccttcagagtctgctgtcgtatgatgccataccagagtcaggttggaaagtaagccacattatacagcgttaacctaaaaaaacaaaaaactgtctaacaagattttatggtttatagagcatgattccccggacacattagatagaaatctgggcaagagaagaaaaaaaggtcagagtttaatcctcaTTCCTAAGTTAtgtaaaccaaaaataaaattctgaagatgtcctgatcatctgaatggacccttcctctggaccagggcattccaaagttaacctgaaaattggtttgggccatgatgggaagggaggtttggatatgcctcattatgccctcttccctttcagaattcaggaaaagccaacc
agcattaacatcaacacagattttcagatcttaggtttctttccgatctattctctctgaaccctgctacctggaggcttcatctgcataataaaactttagtctccacaaccccttatcttaccccagacattcctttctattgataataactctttcaaccaattgccaatcagggtatgtttaaatctacctatgacctggaagcccccactttgcaccctgagatcaaaccagtgcaaatcttatatgtattgatttgtcAATGAAAACAGTCAAAGCCagtcaggcacagtggctcatgcctgtaatcccagcactttgggaggctgaggcgggtagatcacctgaggtcaggagttcgacaccagcctggccaacatggtgaaaccccgtccctactaaaatacaaaaattagcccagcttggtggtgggcacctgtaatcttagctactgcagagactgaggcaggagaatcgcttgaacccaggaggtggaggttgcagtgacctgagattttgccattgcactccagcctgggcaacagagcaagactctatctcaaaaaacaaacaaacaaacaaacaaacaaacaaacTgtcaaaatctgtacagtatgtgaagagatttgttctgaaccaaatatgaatgaccatggtccatgacacagccctcagaagaccctgagaacatgtgcccaaggtggtcacagtgcatcttagttttgtacattttagggagatatgagacttcagtcaaatacatttttaaaaaatacattggttttgtccagaaagccagaaccactcaaagcaggggtttccaggttataagtagatttaaaatttttctgattgacaattggttgaaagagttgtcaatagaaaggaatgtctgcattgtgacaagaggttgtggagaccaagtttctgtcatgcagatgaagccttcaggtagcaggcttccaagataacaggttgtaaatagttcttatcagacttaaGTTCTGTGGAGACGTAAAATGAGGCATATCTGACCTCCACTTccaaaaacatctgagacaggtctcagttaattaagaaagtttgttctgcctagtttaaggacatgcccatgacactgcctcaggaggtcctgacagcatgtgcccaaggtggtcaggatacagcttgcttctatatattttagggagaaaatacatcaGCCtgtaaacaaaaaattaaattctaaggtccctgaaccatctgaatgggctttcttctaggccagggcactctaaaattgaagaacctgaacattcctttctattgataatactttcagccagttgagcccattcagaCCACAGCAAGGTGCCAGGCCAGGCAAGGGCTGACTTGAGATACCTGCCAGATGAGTCACTGGCAAAAGGTGCTGCTCCCTGGTGAGGGAGAAACACCAGGGGCTGGGAGAGGCCCAGAAGGCTCTGAAGGAGTTTTGGTTTGGCTGGCCATGTGTGCAATTAGCGTGATGAGCTCTGACATGGCCTTGCATGGACGGATTGGGCAGG
A’s T’s C’s and G’s and N’s
Composition of the human genome
• Nearly half the genome is repeats
• Only approximately 1.5% is known coding genes
• Unknown functional fraction?!
The repeat content Jumping -genes
1. Transposition-derived repeats
2. Inactive retroposed cellular genes.
3. Simple repeats - microstats
4. Segmental duplications
5. Tandom repeats (telomere, centromere)
Few than expected genes
GeneSweep – Ewan Birney (Welcome Trust Sanger Institute)
The happy winner .
Lee Rowen of the Institute for Systems Biology.25,947 genes.
Genome complexity
Regulators elements
Promoters, enhancers, repressors…
This is where it get complicated.
Alternative splicing
56% for Humans 22% for Worms
Variation among chromosomes
Initial sequencing and analysis of the human genome
International Human Genome Sequencing Consortium Nature 409, 860 - 921 (15 February 2001)
• Overall recombination rate dependent on chromosome length.
• Large variation in the gene density between chromosome.
• Difference in organisation
Variation within chromosomes
Rec
ombi
natio
n
GC
Gen
e de
nsity
The genome is non-random in its organisation
Recombination – High at telomere
GC – Variation at many scales - Isochores
Gene Density – Organisation by function
New observations
• Variation at multiple scales within and between chromosomes
• Only twice as many genes as flies and worms – but more proteins
• Genes have arrived from bacteria and transposable elements
• Transposons inactive and LTR probably also (Alu’s in GC rich regions)
• Most mutations occur in males (higher mutation rate)
• GC poor regions correspond to dark bands.
• Recombination rates are higher at telomeres
• Lots of between individual variation
2001Humans Genome Project starts 1990
Draft Human Genome completed 2001
Fewer gaps 147,821 341
More continuity 81kb 38,500kb
Gene rich regions completed 2003
Each chromosome compiled and annotated. 2006!
Go home?
• Error rate of ~1 in per 100,000 bases
• 2.85 billion bases
• Covers ~99% of the euchromatic genome.
Completing the Human Genome
New builds: Build 36, May 2006
Build 35, May 2004
Build 34, July 2003
Build 33, April 2003
Not quite finished
December 2001 - NCBI 28 July 2003 - NCBI 34
Chromosome 1
Segmental duplications
- allow genes to diversify and acquire novel functions.
• Duplication of a gene from one to many positions on the chromosome.
• A pericentric inversion follows a duplication of two genes
Chromosomes 2 and 4
Gene deserts
Megabase sized genomic segments containing no known coding genes.
(some show conservation)
Role of these regions?
Lowest recombination rates of all the autosomes
Chromosomes 3
Lowest rate of segmental duplication
Large inversion from our ancestor with chimps.
Chromosomes 7
Complex repeat patterns and fragile locations
Williams-Beuren syndrome associated with a large deletion (1.6Mb).
Lots of repetitive and duplicated DNA.
What is the true sequences?
“It is characterized by a distinctive, "elfish" facial appearance, along with a low nasal
bridge; an unusually cheerful demeanor and ease with strangers, coupled with
unpredictably occurring negative outbursts; mental retardation coupled with an unusual facility with language; a love for music; and
cardiovascular problems, such as supravalvular aortic stenosis and transient
hypercalcemia.”
Chromosomes 10
Multi-species alignment – gene involved in cancer
Conservation indicates the location of functional elements.Some are known genes.Others aren’t – higher levels of conservation!
Chromosomes 19
Very high gene density
Increase in all classes of known genes.
26 genes per megabase.
What is special about this chromosome?
Has high recombination rate. And repeat density And GC content.
Chromosomes 12 and 3
Recombination rate variation
Knowing the physical positions of variants allows recombination
rates
Male and female rates differ
Fine scale variation
N.C.B.I. www.ncbi.nlm.nih.gov/genome/guide/human/
Ensembl www.ensembl.org/Homo_sapiens/
UCSC genome.ucsc.edu/cgi-bin/hgGateway
• A joint project between EMBL and the Sanger Institute.
• Primarily funded by the Welcome Trust.
• Mr Ensembl – Ewan Birney
• Based at the University of California Santa Cruz.
• Largely funded by the NHGRI.
• Mr UCSC – David Hassler
• Part of the National Institute of Health.
• Has a number of important associated projects.
• Mr NCBI – David Lipman.
Where is the data available
• Compositional Base compositionInsertion deletionsSegmental duplications RepeatsTransposable elements
• Functional
GenesRegulatory elementsGene expression
• EvolutionarySpecies comparisonVariation dataPopulation genetic analysis
What data available
Affy U133Affy HuEx1.0
GNF Ratio
GNF Atlas 2
Allen Brain
Expression and Regulation
Alt-SplicingGene BoundsUniGene
TIGR Gene Index
H-Inv
Other ESTsOther mRNAs
Human ESTs
Spliced ESTs
Human mRNAs
mRNA and EST Tracks
ExonWalksno/miRNA
EvoFoldYale Pseudo
Superfamily
Retroposed Genes
Augustus Genes
ExoniphyGenscanGenes
GeneidGenes
SGP Genes
N-SCAN
ECgeneGenes
AceViewGenes
EnsemblGenes
Vega Pseudogenes
Vega Genes
MGC Genes
Other RefSeq
RefSeqGenesCCDSKnown
Genes
Genes and Gene Prediction Tracks
Human Mutation
RGD QTL
Phenotype and Disease Associations
RestrEnzymes
Short Match
WSSD Duplication
GC Percent
FosmidEnd Pairs
BAC End PairsCoverageGapAssemblyMap
Contigs
RecombRate
FISH Clones
STS Markers
Chromosome Band
Base Position
Mapping and Sequencing Tracks
Use drop down controls below and press refresh to alter tracks displayed.
Tracks with lots of items will automatically be displayed in more compact modes.
• Human chromosomes are numbered
• Arms are labelled p and q
• Regions labelled ascending from centromere.
• Bases numbered from beginning of small arm to end of long arm.
Orientation
Microsatellites and repeats
Transposable elements
• Important in many common diseases
• Some of the most polymorphic loci
• Make up a large proportion of the genome
Annotation - Repeats
mRNA evidence
Protein evidence
Gene predictionEST evidence
Predicted transcripts- Known Novel
Manually annotated genes
• Different levels of evidence for genes
• Based on homology
• Based on expression
• Based on prediction
Annotation - genes
Expression
Levels & Tissues
RegulatoryElements
• Regulatory elements might be important in complex diseases
• Micro array technology is generating expression data on a large scale
Annotation – Expression and Regulation
Expression varies in space and time
Cross Species
Within Humans
Annotation – Evolutionary
Variation is the most important feature of the genome!?
(issues - alignment)
(issues - ascertainment)
Encylopedia of DNA Elements - Encode
• Variation group – SNPs indels
• Function group – Promoters, transcription and binding
• Chromatin group – Chromatin modification, replication origins
• Multiple sequence alignment – Conservation vs Constraint
Aim: Understand everything possible about these regions.
1% of genome
14 manually chosen regions
(Alpha & beta globin, HOXA, FOXP2 and CFTR)
Plus 26 random regions
Human Variation
SNPs – most common variation in the human genome
10 million common variants.
Synonymous Non-synonymous variation
Information in the density of SNPs.
Information in the frequency of SNPs.
Information in the correlation between SNPs.
HapMap Project
2002 HapMap phase I begins
� Three populations � (YRI) Yoruba in Ibadan, Nigeria 90� (CEU) Utah, USA 90� (CHB) Han Chinese in Beijing 45� (JPT) Japanese in Tokyo 44
� Approximately 1 million SNPs
2005 Phase I complete, phase II begins
� Increase from 1 million to ~ 4.6 million
2006 Phase II complete, “phase III” begins
� Additional 6 populations
� Kenya, African Americans, Mexican Americans, Italy, India
• Linkage Disequilibrium information is an important tool
• Population genetic annotation is often sample specific
The International HapMap Learing from studies
of human variation
•Can learn about how genetic diversity is structured across the globe
•Identify regions which have been under recent positive selection
•Identify recombination hotspots
Hot Topics
• Micro RNA’s
20mers of RNA that form a diversity of roles – e.g. regulating mRNA levels
• Structural variation
The genome of is full of polymorphic insertions and deletions, from 1kb to a Megabase
• Genome-wide association studies
Millions of £s being spend on scanning the genome for loci showing association with disease status.
Chromosomes X and Y
Sex chromosomes
Top Related