IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis
sequence of file formats in bioinformatics
-
Upload
nadeem-akhter -
Category
Entertainment & Humor
-
view
262 -
download
4
description
Transcript of sequence of file formats in bioinformatics
![Page 1: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/1.jpg)
1
SEQUENCE FILE FORMATS
![Page 2: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/2.jpg)
2
introduction
Data is stored in a biological database in the form of sequences or molecular form
Unique file format Representation of data in biological
database Categories of file formats
Sequence database Molecular database
![Page 3: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/3.jpg)
3
Sequence file formats
Gene bank flat-file Format FASTA Format Multi-FASTA Format GCG Format GCG-MSF Format EMBL Format Clustal Format SWIS PROT format
![Page 4: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/4.jpg)
4
Gene bank flat-file Format
Used by NCBI It is divided into three parts Header just a direct and very precise
or brief introductory part Features
all genes in seq., location of genes in genome, protein product and coding genes etc. Sequence : ORIGIN atcgatcgatgcgctat
//
![Page 5: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/5.jpg)
5
Description of gene bank flat file identifiers
HEADRES Locus Definition Accession Version Dbsource: dates for creation and modifications Keywords Source Organism References Authors Title Journal Medline ID: all published sources Comment FEATURES SEQUENCE
![Page 6: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/6.jpg)
6
Retrieved from ncbi
![Page 7: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/7.jpg)
7
![Page 8: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/8.jpg)
8
![Page 9: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/9.jpg)
9
Fasta format
One line header Stats with > followed by name of gene Sequence of gene or protein
Blank spaces Paragraph marks Numerals
Are all ignored Steric sign * at the end
![Page 10: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/10.jpg)
10
FASTA Format
>p53 ctcgaggggc ctagacattg ccctccagag agagcaccca acaccctcca ggcttgaccg 61 gccagggtgt ccccttccta ccttggagag agcagcccca gggcatcctg cagggggtgc 121 tgggacacca gctggccttc aaggtctctg cctccctcca gccaccccac tacacgctgc 181 tgggatcctg gatctcagct ccctggccga caacactggc aaactcctac tcatccacga 241 aggccctcct gggcatggtg gtccttccca gcctggcagt ctgttcctca cacaccttgt 301 tagtgcccag cccctgaggt tgcagctggg ggtgtctctg aagggctgtg agcccccagg 361 aagccctggg gaagtgcctg ccttgcctcc ccccggccct gccagcgcct ggctctgccc*
![Page 11: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/11.jpg)
11
![Page 12: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/12.jpg)
12
Multi-FASTA Format
Just like an aggregation of FASTA file as listed above
Multiple sequences follow one after the other
Single file Accepted by several databases Clustal W Multalin
![Page 13: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/13.jpg)
13
MULTI FASTA format
> jhumagccagggtgt ccccttccta ccttggagag agcagcccca gggcatcctg cagggggtgc >bhuma
gccagggtgt ccccttccta ccttggagag agcagcccca gggcatcctg cagggggtgc >puma
gccagggtgt ccccttccta ccttggagag agcagcccca gggcatcctg cagggggtgc >zuma
gccagggtgt ccccttccta ccttggagag agcagcccca gggcatcctg cagggggtgc
![Page 14: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/14.jpg)
14
![Page 15: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/15.jpg)
15
GCG Format
GCG: genetics computer group First line says it all …. !!N.A_SEQUENCE 1.0 !!AA_SEQUENCE 1.0 Just a simple format in which we just
get to now the sequence for the genes or proteins
![Page 16: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/16.jpg)
16
GCG format
![Page 17: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/17.jpg)
17
GCG-MSF Format
Multiple sequences Sequence name Sequences Alignment Word pileup indicates that It is a multiple
sequence containing file Mandatory MSF word indicated in the file that
tells that it is an MSF GCG file and is not just GCG Comments terminated with // 2 consecutive blank lines Multiple sequences
![Page 18: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/18.jpg)
18
GCG MSF Format
![Page 19: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/19.jpg)
19
EMBL Format
Sequence format of European molecular biology laboratory database
Starts with ID identification number Ends with // as terminator Different lines with own format Used to record various forms of data i.e DNA, RNA, GENE, PROTEIN etc etc
![Page 20: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/20.jpg)
20
EMBLformat
![Page 21: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/21.jpg)
21
Clustal Format
Most widely used sequence alignment tool
CLUSTAL W CLUSTAL X Aligned protein or gene sequences
![Page 22: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/22.jpg)
22
Clustal x
![Page 23: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/23.jpg)
23
SWIS PROT format
Protein sequence database ID : identification number AC: accession number DE: description GN: gene name OS: organism specie OG: organelle OC: organism classification OX: organism taxonomy cross reference RN: reference number RP: reference position
![Page 24: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/24.jpg)
24
Continued…
RC: reference comment RX: reference cross reference RA: reference author RT: reference title RL: reference location CC: blank DR: database cross reference KW: key word FT: feature table SQ: sequence //
![Page 25: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/25.jpg)
25
![Page 26: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/26.jpg)
26
Sequence conversion tools
Several software's have been designed by … ?
The aim of these software's is to make a detailed conversion of one sequence format into another
Some of the software used widely for sequence inter-conversion are :
ReadSeq GCG SeqVerter Seqret
![Page 27: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/27.jpg)
27
Read Seq
Developed by Dr. D.G Gilbert Automated conversion 18 supported file formats are there
which can be interconverted into one another
![Page 28: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/28.jpg)
28
![Page 29: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/29.jpg)
29
![Page 30: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/30.jpg)
Assignment
FASTA Multi FASTA Flat file GCG format EMBL Clustal SWISS PROT
Make each file by this Friday and send as attachments in an email 30
![Page 31: sequence of file formats in bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062418/55504cc4b4c90580748b5243/html5/thumbnails/31.jpg)
31
Molecular file formats
continued…