Accessing molecular biology information * Genome browsers...

24
Accessing molecular biology information * Genome browsers - UCSC * NCBI * Galaxy

Transcript of Accessing molecular biology information * Genome browsers...

Page 1: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

Accessing molecular biology information

* Genome browsers- UCSC

* NCBI

* Galaxy

Page 2: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

Flow of genetic information and databases

polyA

splicing

transcription

Exon 1

5'

DNA

translation

5' UTR

protein

folding

RNA

maturemRNA

3' UTR

Exon 2 Exon 3 Exon 4

3'

intron intron intron

coding sequence

Genbank/EMBL

GenPept / SwissProt / UniProt

Protein Data Bank (PDB)

Page 3: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

Redundancy at GenBank=> RefSeq

Many sequences are represented more than once in GenBank

2003 RefSeq collection : curated secondary databasenon-redundantselected organisms

•Genome DNA (assemblies)•Transcripts (RNA)•Protein

Databases, cont.

Page 4: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

RefSeq vs GenBank

Akin to review articlesAkin to primary literatureLimited to model organismsNo limit to species included

Records can contradict each other

Single records for each molecule of major organisms

Multiple records from same loci common

NCBI reivses as new data emergeOnly author can reviseNCBI creates from existing dataAuthor submitsCuratedNot curated

RefSeqGenBank

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook

Page 5: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

Archives with nucleotide sequences

• Genbank/EMBL• Sequence read archive (NCBI)• GEO (Gene expression omnibus, NCBI)• 1000 Genomes Project• The Cancer Genome Atlas• The Cancer Genome Project• Species-specific databases like

– FlyBase– WormBase– Saccharomyces Genome Database

Page 6: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

Genome sequencingusing a shotgun approach

Page 7: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

Sequenced eukaryotic genomes

Page 8: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

Sequencing going wild ...

BGI : "capacity to sequence the equivalent of 1,600 complete human genomes each day"

"BGI and BGI Americas aim to build a library of digital life, which includes 1,000 plant and animal reference genomes, 10,000 microorganism genomes". “million genome project”: Sequencing of one million chinese individuals

Page 9: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

2008 2009 2010 2011 2012 2013

0e+0

02e

+14

4e+1

46e

+14

8e+1

41e

+15

NCBI Sequence Read Archive2013: a total of ~ 1015 nt

Page 10: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

http://www.genome.gov/sequencingcosts/

Page 11: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project
Page 12: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

Date Cost per Mb of DNA 

Sequence  Cost per Genome 

september‐2001  $                                5 292.39   $                            95 263 072 mars‐2002 $                                3 898.64  $                            70 175 437 september‐2002 $                                3 413.80  $                            61 448 422 mars‐2003  $                                2 986.20   $                            53 751 684 oktober‐2003  $                                2 230.98   $                            40 157 554 januari‐2004  $                                1 598.91   $                            28 780 376 april‐2004 $                                1 135.70  $                            20 442 576 juli‐2004 $                                1 107.46  $                            19 934 346 oktober‐2004  $                                1 028.85   $                            18 519 312 januari‐2005 $                                    974.16  $                            17 534 970 april‐2005  $                                    897.76   $                            16 159 699 juli‐2005  $                                    898.90   $                            16 180 224 oktober‐2005 $                                    766.73  $                            13 801 124 januari‐2006 $                                    699.20  $                            12 585 659 april‐2006 $                                    651.81  $                            11 732 535 juli‐2006 $                                    636.41  $                            11 455 315 oktober‐2006  $                                    581.92   $                            10 474 556 januari‐2007  $                                    522.71   $                              9 408 739 april‐2007 $                                    502.61  $                              9 047 003 juli‐2007  $                                    495.96   $                              8 927 342 oktober‐2007 $                                    397.09  $                              7 147 571 januari‐2008 $                                    102.13  $                              3 063 820 april‐2008  $                                      15.03   $                              1 352 982 juli‐2008  $                                        8.36   $                                  752 080 oktober‐2008  $                                        3.81   $                                  342 502 januari‐2009 $                                        2.59  $                                  232 735 april‐2009 $                                        1.72  $                                  154 714 juli‐2009  $                                        1.20   $                                  108 065 oktober‐2009 $                                        0.78  $                                    70 333 januari‐2010  $                                        0.52   $                                    46 774 april‐2010  $                                        0.35   $                                    31 512 juli‐2010 $                                        0.35  $                                    31 125 oktober‐2010 $                                        0.32  $                                    29 092 januari‐2011 $                                        0.23  $                                    20 963 april‐2011 $                                        0.19  $                                    16 712 juli‐2011  $                                        0.12   $                                    10 497 oktober‐2011  $                                        0.09   $                                      7 743 

Page 13: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

www.1000genomes.org

The 1000 Genomes Project is the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation.

As with other major human genome reference projects, data from the 1000 Genomes Project will be made available quickly to the worldwide scientific community through freely accessible public databases. The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied.

Many other sites have sequences available for downloading

Page 14: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

http://cancergenome.nih.gov/

The Cancer Genome Atlas (TCGA) is a landmark research program supported by the National Cancer Institute and National Human Genome Research Institute at the National Institutes of Health. TCGA researchers will identify the genomic changes in more than 20 different types of human cancer. By comparing the DNA in samples of normal tissue and cancer tissue taken from the same patient, researchers can identify changes specific to that particular cancer.

TCGA is analyzing hundreds of samples for each type of cancer. By looking at many samples from many different patients, researchers will gain a better understanding of what makes one cancer different from another cancer. This is important because even two patients with the same type of cancer may experience very different outcomes or respond very differently to treatments. By connecting specific genomic changes with specific outcomes, researchers will be able to develop more effective, individualized ways of helping each cancer patient.

Page 15: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

http://www.sanger.ac.uk/genetics/CGP/

The identification of genes that are mutated and hence drive oncogenesis has been a central aim of cancer research since the advent of recombinant DNA technology. The Cancer Genome Project is using the human genome sequence and high throughput mutation detection techniques to identify somatically acquired sequence variants/mutations and hence identify genes critical in the development of human cancers (see here for a description of our strategy). This initiative will ultimately provide the paradigm for the detection of germline mutations in non-neoplastic human genetic diseases through genome-wide mutation detection approaches.

Page 16: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

Archives with nucleotide sequences

• Genbank/EMBL• Sequence read archive (NCBI)• GEO (Gene expression omnibus, NCBI)• 1000 Genomes Project• The Cancer Genome Atlas• The Cancer Genome Project• Species-specific databases like

– FlyBase– WormBase– Saccharomyces Genome Database

Page 17: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

NCBI

* Nucleotide * Protein* Structure* PubMed* OMIM (genetic diseases)* dbSNP* Taxonomy browser

Page 18: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

translation

protein

folding

DNA

RNA

transkription

NCBI databases

protein 3D structurewith specific biological function

"Nucleotide"

"Protein"

"Structure"

Page 19: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

EMBL format

ID LISOD standard; DNA; PRO; 756 BP.XXAC X64011; S78972;XXSV X64011.1XXDT 28-APR-1992 (Rel. 31, Created)DT 30-JUN-1993 (Rel. 36, Last updated, Version 6)XXDE L.ivanovii sod gene for superoxide dismutaseXXKW sod gene; superoxide dismutase.XXOS Listeria ivanoviiOC Bacteria; Firmicutes; Bacillus/Clostridium group;OC Bacillus/Staphylococcus group; Listeria.XXRN [1]RX MEDLINE; 92140371.RA Haas A., Goebel W.;RT "Cloning of a superoxide dismutase gene from Listeria ivanovii byRT functional complementation in Escherichia coli and characterization of theRT gene product.";RL Mol. Gen. Genet. 231:313-322(1992).XXRN [2]RP 1-756RA Kreft J.;RT ;RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases.RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum AmRL Hubland, 8700 Wuerzburg, FRGXXDR SWISS-PROT; P28763; SODM_LISIV.XX

Database formats - EMBL and Genbank

Page 20: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

FH Key Location/QualifiersFHFT source 1..756FT /db_xref="taxon:1638"FT /organism="Listeria ivanovii"FT /strain="ATCC 19119"FT RBS 95..100FT /gene="sod"FT terminator 723..746FT /gene="sod"FT CDS 109..717FT /db_xref="SWISS-PROT:P28763"FT /transl_table=11FT /gene="sod"FT /EC_number="1.15.1.1"FT /product="superoxide dismutase"FT /protein_id="CAA45406.1"FT /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAVSGFT HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLKAAFT IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPVLGLFT DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK"XXSQ Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other;

cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 60gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa 120ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg 180gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca 240ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt 300

Page 21: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

CDS join(1886..1922,2272..2319,3563..3675,4750..4878)

* to represent a coding sequence on the complementary strand of DNA:CDS complement(1159..2577)

Examples of feature table elements

* to represent a coding sequence that is constructedfrom a range of exons:

Page 22: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

EMBL and Genbank formats

EMBL format

ID LISOD standard; DNA; PRO; 756 BP.XXAC X64011; S78972;XXSV X64011.1XXDT 28-APR-1992 (Rel. 31, Created)DT 30-JUN-1993 (Rel. 36, Last updated, Version 6)XXDE L.ivanovii sod gene for superoxide dismutaseXXKW sod gene; superoxide dismutase.XXOS Listeria ivanoviiOC Bacteria; Firmicutes; Bacillus/Clostridium group;OC Bacillus/Staphylococcus group; Listeria.XXRN [1]RX MEDLINE; 92140371.RA Haas A., Goebel W.;RT "Cloning of a superoxide dismutase gene from Listeria ivanovii byRT functional complementation in Escherichia coli and characterization of theRT gene product.";RL Mol. Gen. Genet. 231:313-322(1992).XXRN [2]RP 1-756RA Kreft J.;RT ;RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases.RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum AmRL Hubland, 8700 Wuerzburg, FRGXXDR SWISS-PROT; P28763; SODM_LISIV.XX

Page 23: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

Common sequence formats

1. Genbank 2. EMBL 3. FASTA

>X12345 Y098TR gene CGTATCTTACGAGCTACTACGAGGTCTTATCGGACGAGCGACT...

4. FASTQ

@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGT+!''*((((***+))%%%++)(%%%%).1***-+*

Page 24: Accessing molecular biology information * Genome browsers ...bio.lundberg.gu.se/courses/vt13/bmm2_ncbi_galaxy_2013_print.pdfoktober‐2011 $ 0.09 $ 7 743. The 1000 Genomes Project

Search Field Definition Qualifier

Accession

Contains the unique accession number of the sequence or record, assigned to the nucleotide, protein, structure, genome record, or PopSet by a sequence database builder. The Structure database accession index contains the PDB IDs but not the MMDB IDs.

[ACCN]

All Fields Contains all terms from all searchable database fields in the database. [ALL]

Author Name

Contains all authors from all references in the database records. The format is last name space first initial(s), without punctuation (e.g., marley jf).

[AUTH]

Feature Key

Contains the biological features assigned or annotated to the nucleotide sequences and defined in the DDBJ/EMBL/GenBank Feature Table (http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html). Not available for the Protein or Structure databases.

[FKEY]

Journal Name

Contains the name of the journal in which the data were published. Journal names are indexed in the database in abbreviated form (e.g., J Biol Chem). Journals are also indexed by their by ISSNs. Browse the index if you do not know the ISSN or are not sure how a particular journal name is abbreviated.

[JOUR]

Modification Date

Contains the date that the most recent modification to that record is indexed in Entrez, in the format YYYY/MM/DD (e.g., 1999/08/05). A year alone, (e.g., 1999) will retrieve all records modified for that year; a year and month (e.g., 1999/03) retrieves all records modified for that month that are indexed in Entrez.

[MDAT]

Organism Contains the scientific and common names for the organisms associated with protein and nucleotide sequences. [ORGN]

Properties

Contains properties of the nucleotide or protein sequence. For example, the Nucleotide database's Properties index includes molecule types, publication status, molecule locations, and GenBank divisions. A Properties index is not available in the Structure database.

[PROP]

Publication Date

Contains the date that records are released into Entrez, in the format YYYY/MM/DD (e.g., 1999/08/05). It is the date the [PDAT]

A selection of search fields using NCBI Entrez.