Insilico Gene Analysis
Transcript of Insilico Gene Analysis
-
8/8/2019 Insilico Gene Analysis
1/34
iabt
In silico Gene Analysis
-
8/8/2019 Insilico Gene Analysis
2/34
iabt
Outline
Introduction
Alignment
ORF searching
3D protein modeling
Case study
-
8/8/2019 Insilico Gene Analysis
3/34
iabt
INTRODUCTIONWhat is gene?
What are the essential components of a gene
Initiation codon
Intron and exons(in eukaryotes)
Stop codon
Regulatory sequences
A length of DNA which codes for a particular protein, or in certain
cases a functional or structural RNA molecule
-
8/8/2019 Insilico Gene Analysis
4/34
iabt
INTRODUCTION .
Essential feature of gene which are considered for in silico gene analysis
All proteins contains 20 amino acids (one letter code)
Stop codons are also fixed TAA, TAG and TGA
Intron boundaries- GU-AC
Codon usage differs from organism to organism
All nucleotide sequences essentially contains A, T,G and C
Initiation codon is fixed - ATG
-
8/8/2019 Insilico Gene Analysis
5/34
iabt
FILE FORMATS
FASTA format>XM_414949 | Gallus gallus |alpha 2 globinMVLSAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYF
GI format
; comment;commentXM_414949
MVLSAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYF1
GDE format
%XM_414949 | Gallus gallus |alpha 2 globinMVLSAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYF
NBRF/PIR format
>P1; XM_414949 | Gallus gallus |alpha 2 globinMVLSAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYF
-
8/8/2019 Insilico Gene Analysis
6/34
iabt
ALIGNMENTS
The result of a comparison of two or more gene or protein sequences in
order to determine their degree of base or amino acid similarity
ALIGNMENT
Pair wise Alignment Multiple Alignment
Local Alignment Global alignment
-
8/8/2019 Insilico Gene Analysis
7/34
iabt
>NG_000007 |chromosome 11| beta hemoglobin|Homo sapiens
>NG_000007 |chromosome 11| beta hemoglobin|Homo sapiens
REFERENCE SEQUENCE
atggtgcatctgactcctgaggagaagtctgccgttactgccctgtggggcaaggtgaacgtggatgaagttggtggtgaggccctgggcaggctgctggtggtctacccttggacccagag
gttctttgagtcctttggggatctgtccactcctgatgctgttatgggcaaccctaaggtgaaggctcatggcaagaaagtgctcggtgcctttagtgatggcctggctcacctggacaacctcaagggcacctttgccacactgagtgagctgcactgtgacaagctgcacgtggatcctgagaacttcaggctcctgggcaacgtgctggtctgtgtgctggcccatcactttggcaaagaattcaccccaccagtgcaggctgcctatcagaaagtggtggctggtgtggctaatgccctggcccacaagtatcactaa
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK ZVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG KEFTPPVQAAYQKVVAGVANALAHKYH
-
8/8/2019 Insilico Gene Analysis
8/34
iabt
Sequences may be nucleotide-nucleotide or amino acid-amino acid
PAIRWISE ALIGNMENT
Two sequences are compared at a time
It may be gaped/ un-gaped alignment
Ex : BLAST and FASTA
Two algorithms Smith- Waterman algorithm (local alignment)
Needleman-Wunsch algorithm (global alignment)
-
8/8/2019 Insilico Gene Analysis
9/34
iabt
BLAST (Basic Local Alignment Search Tool)
Pair wise local alignment
Developed by Stephen Altschul, Warren Gish, David Lipman
BLAST searches for short matches of a fixed length W between the
query and sequences in the database
Stages in search
BLAST performs an ungapped alignment between the query and databasesequence on either sides , if they share a common word.
BLAST performs a gapped alignment between the query sequence and the
database sequence
-
8/8/2019 Insilico Gene Analysis
10/34
iabt
BLAST .
It consider whole database as one
sequence and align the query
sequence
high-scoring segment pairs
-
8/8/2019 Insilico Gene Analysis
11/34
iabt
BLAST ..
Low complexity region
-
8/8/2019 Insilico Gene Analysis
12/34
iabt
FASTA
Pairwise local alignment
Developed by David J. Lipman and William R. Pearson in 1985
It looks for identically matching word length called ktup
It identifies single high scoring region
It matches individual sequence of database with query sequence
-
8/8/2019 Insilico Gene Analysis
13/34
iabt
FASTA .
It aligns the individual database
sequence with Query sequence
E value is different from BLAST
E= Np
-
8/8/2019 Insilico Gene Analysis
14/34
iabt
PROTEIN MATRICES
C 1
S 0 1
T 0 0 1
P 0 0 0 1
A 0 0 0 0 1
G 0 0 0 0 0 1N 0 0 0 0 0 0 1
D 0 0 0 0 0 0 0 1
E 0 0 0 0 0 0 0 0 1
Q 0 0 0 0 0 0 0 0 0 1
H 0 0 0 0 0 0 0 0 0 0 1
R 0 0 0 0 0 0 0 0 0 0 0 1
K 0 0 0 0 0 0 0 0 0 0 0 0 1
M 0 0 0 0 0 0 0 0 0 0 0 0 0 1
I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
L 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
F 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
C S T P A G N D E Q H R K M I L V F Y W
C 12
S 0 2
T -2 1 3
P -1 1 0 6
A -2 1 1 1 2
G -3 1 0 -1 1 5
N -4 1 0 -1 0 0 2
D -5 0 0 -1 0 1 2 4
E -5 0 0 -1 0 0 1 3 4
Q -5 -1 -1 0 0 -1 1 2 2 4
H -3 -1 -1 0 -1 -2 2 1 4 3 6
R -4 0 -1 0 -2 -3 0 -1 -1 1 2 6
K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5
M -5 -1 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6
I -3 0 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5
L -6 -2 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6V -2 0 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4
F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9
Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10
W -8 -5 5 -6 -6 -7 4 7 7 5 3 2 -3 -4 -5 -2 -6 0 0 17
C S T P A G N D E Q H R K M I L V F Y W
Associated substitution matrix PAM250 matrix
S
UB
J
E
C
T
QUERY
-
8/8/2019 Insilico Gene Analysis
15/34
iabt
GAPS AND PENALTIES
Constant penalty : usually it is 1
Proportional penalty : depends on length of the gap
Affine : gap openig penalty + gap extension penalty
S = actual alignment score from matrix gap penalty
-
8/8/2019 Insilico Gene Analysis
16/34
iabt
RESTRICTION SITES
-
8/8/2019 Insilico Gene Analysis
17/34
iabt
MULTIPLE ALIGNMENT
More than two sequences
Gaps are frequent
Always global alignment
-
8/8/2019 Insilico Gene Analysis
18/34
iabt
WHY DO WE NEED MULTIPLE ALIGNMENT ???
Homology searching between the sequence
To characterize the protein families-conserved domains, promoters etc,
Designing special probes, degenerated primers etc,..
Required in Protein modeling
Helps in prediction of secondary and tertiary structure of new sequence
Input for constructing phylogenetic tree
-
8/8/2019 Insilico Gene Analysis
19/34
iabt
MULTIPLE ALIGNMENT ALGORITHMS
Hierarchical method (Clustal W) Divide and conquer method
AB
C
D
E
-
8/8/2019 Insilico Gene Analysis
20/34
iabt
MULTIPLE ALIGNMENT .
Gaps
Conserved
region
-
8/8/2019 Insilico Gene Analysis
21/34
iabt
CONSERVED DOMAIN SEARCH
Conserved domain
Some amount of sequence (20 %) missing in blastat C terminal end
-
8/8/2019 Insilico Gene Analysis
22/34
iabt
SOFTWARE AVAILABLE
Clustal W / X
Bioedit
Q align
CLC free work bench
Gene tool
Vector NTI
NCBI server
EMBL server
-
8/8/2019 Insilico Gene Analysis
23/34
iabt
PHYLOGENETIC ANALYSIS
Sequence should be correct and originated from specified source
Sequences should be homologous
Each position in a alignment should be homologous with every other
in that alignment
No contamination of sequence i.e., nuclear and organelle genomes
-
8/8/2019 Insilico Gene Analysis
24/34
iabt
PHYLOGENETIC ANALYSIS.
Distance method
Tree building methods
Character based method
UPGMA NJ
Maximum parsimony method Maximum likelihood method
-
8/8/2019 Insilico Gene Analysis
25/34
iabt
SP
G
At
LAo
H
L
S
P
G
AoAt
H
NEIGHBOUR JOINING METHOD
-
8/8/2019 Insilico Gene Analysis
26/34
iabt
ORF SEARCHING
Molecular biology background
ORF contains following features
Initiation codon
Stop codon
Intron boundaries
Defined codon usage
-
8/8/2019 Insilico Gene Analysis
27/34
iabt
ORF FINDING ALGORITHMS
Content-based method
Site based method
Comparative method
-
8/8/2019 Insilico Gene Analysis
28/34
iabt
ORF FINDING ALGORITHMS
Text information
Graphical view
-Hemoglobin gene
-
8/8/2019 Insilico Gene Analysis
29/34
iabt
GENSCAN
Gene tool
CLC free work bench
SOFTWARE AVAILABLE
-
8/8/2019 Insilico Gene Analysis
30/34
iabt
PROTEIN THREE DIMENSIONAL MODELING
Comparative modeling
Fold recognition
Ab initio prediction
-
8/8/2019 Insilico Gene Analysis
31/34
iabt
COMPARATIVE PROTEIN MODELING
start
Identify related structure
Select Template
Evaluate the model
Align target sequence withtemplate structure
Build model for target
ModelOK?
YESend
NO
-
8/8/2019 Insilico Gene Analysis
32/34
iabt
Bovine hemoglobin Human hemoglobin Beta chain
COMPARATIVE PROTEIN MODELING
-
8/8/2019 Insilico Gene Analysis
33/34
iabt
SOFTWARE AVAILABLE
Cn3D
Bioediter
Deep view / swiss-pdb viewer
-
8/8/2019 Insilico Gene Analysis
34/34
iabt
OTHER METHODS
Also called as protein threading
It uses the library of models
Based on library information model is constructed
Fold recognition
Ab initio prediction
Uses the the thermodynamics and quantum mechanism