Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG...
-
Upload
eileen-rich -
Category
Documents
-
view
215 -
download
1
Transcript of Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG...
Sequence analysis of CpG islands reveals possiblefunctional correlation between genes and its CpG
island sequence
Henry Hyun-il Paik
Bioinformatics, School of Informatics
Indiana University
Outline
• What CpG islands are
• The Known Relations between CpG islands and Genes
• Motivation and Goal
• Data set
• Procedures
• Results
• Discussion
What CpG islands are?
• CpG dinucleotides are rare in mammal DNA
• DNA Methylation only occurs at CpG sites• Methylated cytosines may be converted to thymine by
deamination over evolution– CpG TpG
• CpG islands are short stretches of DNA with higher frequency of the CG sequence
• Usually they are not methylated
What CpG islands are?
• Definition from Gardiner-Garden & Frommer– At least 200 bases long– G+C content: > 50%– observed CpG/expected CpG ratio: >= 0.6
• Definition from Takai & Jones – Longer than 500 bp– G+C content: > 55%– observed CpG/expected CpG ratio: >= 0.65– With this definition, these CpGi’s are more likely to be
associated with the 5’ regions of genes and exclude most Alu’s
• There are about 29,000 such regions in the human genome
What CpG islands are?
CpG islands & Genes
• CpG islands located in the promoter regions of genes can play important roles in gene silencing
• Housekeeping genes– Almost all housekeeping genes are associated with at least one
CpG island– CpG islands are starting 5’ to the transcription start site and
covering one or more exons and introns
• Tissue specific genes– About 40 % tissue specific genes are associated with islands– The position of these islands is not strongly toward the
transcription start site as in the housekeeping genes
CpG islands & Genes
• Not all CpG islands are associated with genes– Ioshikhes & Zhang determined the features to discriminate the
promoter-associated and non-associated CpG islands
• There are methylation-prone and methylation-resistant CpG islands– Feltus et. al. found patterns to discriminate methylation-prone
from methylation-resistant CpG islands
CpG islands & Genes
Gene
5’ end
CpGi
Gene
Promoter CpG islands
Gene CpG islands in body
Gene 3’ end CpG islands
Motivation and Objective
• Our project was inspired by these ideas• Mechanical definition follows the definition as it is
– At least 200 bases long– G+C content: > 50%– observed CpG/expected CpG ratio: >= 0.6
• We tried to find “Semantic meaning” of CpG islands : Co-relation between CpG islands & Gene Functions
• Are there any significant CpGi patterns related to the gene functions?
Motivation and Objective
Gene 1CpGi 1
Gene 2CpGi 2
We assume that gene1 and gene2 have similar function
1) Then gene 1 sequence and gene 2 sequence are probably similar.
2) Our Goal is to find CpGi patterns when genes have similar function
Data Set• Reference:• Larsen F., Gundersen, G., Lopez L., Prydz H.• CpG island as Gene Markers in the Human Genome• Genomics 13:1095-1107 (1992)
• Total number of entries: 1711• Entries with no islands: 1212• Entries with islands: 499• Total number of islands: 928
• The Length of CpG islands– Average size of islands: 465 bp– Shortest detectable island: 200 bp– Largest island: 3340 bp
Expression of gene Number Number associated with islands
Widespread 217 216 (99%)
Limited 719 261 (36%)
a Snap Shot of Data set
Procedures
Fasta all-to-all Comparison
Clustering By BAG
MEME
MAST
BLAST
Clustering
Motif (Pattern) Discovery & Search
for each cluster
Database search with CpG islands patterns
Clustering
• We use a clustering program, BAG by Sun Kim
• We compare each CpG island to all CpG islands using fasta for the input of BAG
• BAG makes clusters based on sequence similarity
Motif Discovery & Search
• MEME discovers patterns for each cluster
• To see the significance of a pattern, MAST searches all CpG islands with the pattern
• We can see how significant the pattern is or how often the pattern occur according to E value
• Profiles are made to represent each cluster
Motif Discovery & Search
BLAST
• The entire GenBank was searched with CpG island profile, not with Gene
• We see how efficiently the profile can find the genes that have similar function
• This verifies the validity of the profile
Results
• There are 26 clusters in which members have similar gene function among total 115 clusters
• These 26 clusters are divided into two categories depending on CpGi location– 18 clusters have CpGi’s in coding region– 8 clusters have CpGi’s in promoter region
Results
• One example from CpGi in body
• Cluster # 18 : Human heat-shock protein HSP70B' gene– Meme– Mast– profile sequence
ATCATCGCCAACGACCAGGGCAACCGCACCACCCCCAGCTACGTGGCCTT
– Blast
Results
• One example from promoter CpGi
• Cluster # 25 : Human gene for creatine kinase B– Meme– Mast– Profile sequence
GAGGAGTCCTACGAAGTGTTCAAGGATCTCTTCGACCCCATCATTGAGGA
– Blast
Gene & CpG islands in promoter region
cluster Description Acc No.
7 Human MAGE-4a antigen (MAGE4a) gene
U10687.1 U10687.3 U10687.4 U10687.2 U10687.5
14 Aldose Reductase gene M59856.1 L14440.1
25 Human creatine kinase M60806.1 X15334.1
72_73 Human metallothionein gene M10942.1(arti) J03910.1 M13003.1 K01383.1
79_80 Human gene for neurofilament subunit X05608.1(arti) X15306.1 Y00067.2
85 Phenylethanolamine N-methyltransferase gene
J03280.1 X52730.1
92 Human U1 small nuclear RNA pseudogene
M14387.1 M28010.1 M28011.1
96 Human trichohyalin (TRHY) gene L09190.1 L09190.3
Gene & CpG islands in CDScluster Description Acc No.
9 alpha 2 adrenergic receptor gene D13538.1 M23533.2 M34041.1 M67439.1(arti)
M83181.1 M28269.1 X13556.1
10 actin gene M19283.2 M20543.2
13 alkaline phosphatase gene J03252.1 J03930.1 M31008.2
18 Human heat shock protein M19645.1 ARTI M59830.1 M11717.1 X51757.1
32 Neurophysin gene X62890.1 M11166.1 M11186.1
41 Human v-erbA related ear-2 gene X12794.1 X12795.1
52 histone H1 (H1F4) gene X57130.1 M60748.1 X57129.1
53 histone H3 gene X57128.1 M60746.1 M26150.1
54 Human histone H4 (H4) gene X60482.1 X60483.1 X60484.1 X00091.1 X00038.1
M16707.1 M60749.1 X60487.1 X67081.1 X60486.1
56 serotonin receptor gene K02405.1 K02773.1 ARTI K01499.1 X02228.1 M77285.1
58 Human histone H2b gene M60751.1 X57985.1 X00088.1
59 Human histone H2a gene M60752.1 X00089.1
64 Human heat shock protein X03901.1 L39370.1
69 proto oncogene (JUN) J04111.1 M29039.1
87 Human beta-tubulin pseudogene X00734.5 J00315.1
90 H.sapiens gene for 28S rRNA V8 region X69341.1 X69358.1 X69357.1 M11167.1
91 Human POU daomain factor (Brn-3a) gene U10063.1 U10061.1
Discussion
• The blast result implies that both CpG islands in promoter region and in CDS are good markers for gene sequences
• Even though there are small numbers of promoter CpG islands, they represented their clusters significantly
• Since many CpG islands tend to cover exons, they can be used to identify transcripts
• Need more data to support this result and to make generic patterns
Acknowledgement
• Dr. Sun Kim
• Dr. Paul Ma
• Arvind
• Bioperl community
Comments & Questions