Anotação automática de seqüências biológicas: ontologias e sistemas de pipelines Arthur Gruber...

76
Anotação automática de seqüências biológicas: ontologias e sistemas de pipelines Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP

Transcript of Anotação automática de seqüências biológicas: ontologias e sistemas de pipelines Arthur Gruber...

Anotação automática de seqüências biológicas: ontologias e sistemas de pipelines

Arthur Gruber

Instituto de Ciências Biomédicas Universidade de

São Paulo

AG-ICB-USP

Sequence annotation

• Annotation is the process of adding information to a DNA sequence.

• The information usually has DNA coordinate.

• Features could be repeats, genes, promoters, protein domains……..

• Features can be linked to other databases e.g. Pfam/Pubmed

AG-ICB-USP

Public databases

• GenBank, EMBL and DDBJ.• All databases update each other

automatically

AG-ICB-USP

Feature table

• http://www.ncbi.nlm.nih.gov/projects/collab/FT/

• Format definition

• Covers DDBJ/EMBL/GenBank

• Defines all accepted annotation terms and hierarchy

AG-ICB-USP

Annotation file

Contains:• A header with:

• Information about the sequence• Organism• Authors• References• Comments

• A feature table containing• Sequence features and co-ordinates

AG-ICB-USP

ID PFMAL1P4 standard; DNA; INV; 66441 BP.XXAC AL031747;XXSV AL031747.8XXDT 24-SEP-1998 (Rel. 57, Created)DT 27-APR-2000 (Rel. 63, Last updated, Version 13)XXDE Plasmodium falciparum DNA from MAL1P4XXKW HTG; rifin; telomere; var; var-like hypothetical protein.XXOS Plasmodium falciparum (malaria parasite P. falciparum)OC Eukaryota; Alveolata; Apicomplexa; Haemosporida; Plasmodium.XXRN [1]RA Oliver K., Bowman S., Churcher C., Harris B., Harris D., Lawson D.,RA Quail M., Rajandream M., Barrell B.;RT ;RL Submitted (24-SEP-1998) to the EMBL/GenBank/DDBJ databases.RL P.falciparum Genome Sequencing Consortium, The Sanger Centre, WellcomeRL Trust Genome Campus, Hinxton, Cambridge CB10 1S.

Header (EMBL)

AG-ICB-USP

LOCUS PFMAL1P4 66442 bp DNA linear INV 02-DEC-2004DEFINITION Plasmodium falciparum DNA from MAL1P4, complete sequence.ACCESSION AL031747 AL844501VERSION AL031747.9 GI:23477012KEYWORDS HTG; rifin; telomere; var; var-like hypothetical protein.SOURCE Plasmodium falciparum 3D7 ORGANISM Plasmodium falciparum 3D7 Eukaryota; Alveolata; Apicomplexa; Haemosporida; Plasmodium.REFERENCE 1 AUTHORS Hall,N., Pain,A., Berriman,M., Churcher,C., Harris,B., Harris,D., TITLE Sequence of Plasmodium falciparum chromosomes 1, 3-9 and 13 JOURNAL Nature 419 (6906), 527-531 (2002) PUBMED 12368867REFERENCE 2 AUTHORS Oliver,K., Pain,A., Berriman,M., Bowman,S., Churcher,C., Harris,B., Harris,D., Lawson,D., Quail,M., Rajandream,M., Hall,N. and Barrell,B. TITLE Direct Submission JOURNAL Submitted (24-SEP-1998) P.falciparum Genome Sequencing Consortium, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UKCOMMENT On Oct 2, 2002 this sequence version replaced gi:7670004. For more information about this sequence or the Malaria Project, see http://www.sanger.ac.uk/Projects/P_falciparum.

NCBI Header

AG-ICB-USP

Feature

• Region of DNA that was annotated with a key/qualifier• Keys: CDS, intron, miscellaneous, etc.

• Qualifier: notes or extra-information about a feature

i.e. exon (key) /gene=“adh” (qualifier)

AG-ICB-USP

Feature keys

misc_differencemisc_featuremisc_recombmisc_RNAmisc_signalmisc_structuremodified_basemRNAN_regionold_sequencepolyA_signalpolyA_siteprecursor_RNAprim_transcript

primer_bindpromoterprotein_bindRBSrepeat_regionrepeat_unitrep_originrRNAS_regionsatellitescRNAsig_peptidesnRNAsnoRNAsourcestem_loopSTSTATA_signalterminator

transit_peptidetRNAunsureV_regionV_segmentvariation3'clip3'UTR5'clip5'UTR-10_signal-35_signal

attenuatorC_regionCAAT_signalCDSconflictD-loopD_segmentenhancerexonGC_signalgeneiDNAintronJ_segmentLTRmat_peptidemisc_binding

AG-ICB-USP

Feature qualifier

Additional information about a feature

/allele="text"/citation=[number]/codon=(seq:"text",aa:<amino_acid>)/codon_start=<1/db_xref="<database>:<identifier>"/EC_number="text"/evidence=<evidence_value>/exception="text"/function="text"/gene="text"/label=feature_label/map="text"

/note="text"/number=unquoted/product="text"/protein_id="<identifier>"/pseudo/standard_name="text"/translation="text"/transl_except=(pos:<base_range>,aa:<amino_acid>)/transl_table/usedin=accnum:feature_label

AG-ICB-USP

Features (EMBL)

AG-ICB-USP

Features (NCBI)

AG-ICB-USP

FEATURES Location/Qualifiers source 1..66442 /organism="Plasmodium falciparum 3D7" /mol_type="genomic DNA" /isolate="3D7" /db_xref="taxon:36329" /chromosome="1" repeat_region 1..583 /note="telomeric repeat" repeat_region 584..1641 /note="14bp repeat" gene join(29733..34985,36111..37349) /gene="MAL1P4.01" /note="synonyms: PFA0005w, VAR" CDS join(29733..34985,36111..37349) /gene="MAL1P4.01" /note="Subtelomeric var gene Pfam hit to PF03011 Similar to Plasmodium falciparum VaR, mal1p4.01 vaR SWALL:Q9NFB6 (EMBL:AL031747) (2163 aa) fasta scores: E(): 0, 100% id in 2163 aa" /codon_start=1 /product="erythrocyte membrane protein 1 (PfEMP1)" /protein_id="CAB89209.1" /db_xref="GI:7670005" /db_xref="GOA:Q9NFB6" /db_xref="UniProtKB/TrEMBL:Q9NFB6" /translation="MVTQSSGGGAAGSSGEEDAKHVLDEFGQQVYNEKVEKYANSKIY KEALKGDLSQASILSELAGTYKPCALEYEYYKHTNGGGKGKRYPCTELGEKVEPRFSDTLGGQCTNKKIEGNKYIKGKDVGACAPYRRLHLCSHNLESIQ

CDS features

• CDS stands for coding sequence and is used to denote genes and pseudogenes.

• These features are automatically translated on submission and the protein added to the protein databases.

AG-ICB-USP

/note

• Note field contains all the evidence for a gene call……..plus anything else.• Similarity (fasta or blast)• Domain/motif information (Pfam,

TMHMM, etc.)• Unusual features (repeats, aa richness)

AG-ICB-USP

/product

• The name of the gene product eg. Alcohol dehydrogenase

• Unless there is proof we must qualify...• Putative• Possible

• Always be conservative!… eg. Putative dehydrogenase

dehyrogenase like protein

• Only piece of annotation added to the protein databases.

AG-ICB-USP

Naming protocols• Hypothetical protein unknown function and no homology

 

• Conserved hypothetical protein unknown function WITH homology

 

• Alcohol dehydrogenase like looks a bit like it, but may not be.

• Putative alcohol dehydrogenase probably a alcohol dehydrogenase

• Alcohol dehydrogenase this has previously been characterised and shown to be alcohol dehydrogenase in this

organism.

AG-ICB-USP

/gene

• The gene name• eg ADH1

• Only transfer a gene name if it is meaningful

• Never transfer a gene name like PfB0024.• Is it a gene family? make sure two genes

have the same name.

AG-ICB-USP

Transitive Annotation

• AKA annotation catastrophe• Junk in = Junk out

• Mis-annotations spread through incorrect database submissions.

AG-ICB-USP

How can we standardize the annotation terms?

AG-ICB-USP

Through a dynamic controlled vocabulary

AG-ICB-USP

AG-ICB-USP

So what does that mean?From a practical view, ontology is the representation of something we know about. “Ontologies" consist of a representation of things, that are detectable or directly observable, and the relationships between those

things.

Ontology Structure

cell

membrane chloroplast

mitochondrial chloroplastmembrane membrane

Directed Acyclic Graph (DAG) - multiple

parentage allowed

GO topology

• The ontologies are structured as directed acyclic graphs• Similar to hierarchies but differ in that a more

specialized term (child) can be related to more than one less specialized term (parent).

• For example, hexose biosynthetic process has two parents, hexose metabolic process and monosaccharide biosynthetic process.

AG-ICB-USP

True Path Violations Create Incorrect Definitions

..”the pathway from a child term all the way up to its top-level parent(s) must always be true".

chromosome

Part_of relationship

nucleus

True Path Violations

..”the pathway from a child term all the way up to its top-level parent(s) must always be true".

chromosome

Mitochondrial chromosome

Is_a relationship

True Path Violations

..”the pathway from a child term all the way up to its top-level parent(s) must always be true".

chromosome

Mitochondrial chromosome

Is_a relationship

Part_of relationship

nucleusA mitochondrial chromosome is not part of a nucleus!

True Path Violations

..”the pathway from a child term all the way up to its top-level parent(s) must always be true".

nucleus chromosome

Nuclear chromosome

Mitochondrial chromosome

Is_a relationshipPart_of

relationship

mitochondrion

Part_of relationship

GO Definitions: Each GO term has 2 Definitions

A definition written by a biologist:

necessary & sufficientconditions

written definition(not computable)

Graph structure: necessary conditionsformal

(computable)

Term-term relationship

• is_a• The is_a relationship is a simple class-

subclass relationship, where A is_a B means that A is a subclass of B

• For example, nuclear chromosome is_a chromosome.

AG-ICB-USP

GO:0043232 : intracellular non-membrane-bound organelle GO:0005694 : chromosome GO:0000228 : nuclear chromosome

Term-term relationship

• part_of• C part_of D means that whenever C is present, it is

always a part of D, but C does not always have to be present

• For example, periplasmic flagellum part_of periplasmic space

AG-ICB-USP

GO:0044464 : cell part

GO:0042995 : cell projection

GO:0019861 : flagellum

GO:0009288 : flagellin-based flagellum

GO:0055040 : periplasmic flagellum

GO:0042597 : periplasmic space

GO:0055040 : periplasmic flagellum

Current Ontologies

• Molecular function: tasks performed by gene product

• Biological process: broad biological goals accomplished by ordered assemblies of molecular functions

• Cellular component: subcellular structures, locations and macromolecular complexes

AG-ICB-USP

AG-ICB-USP

Search result for toxin

AG-ICB-USP

Relationships in GO

•“is-a”

•“part of”

AG-ICB-USP

GO paths to terms

AG-ICB-USP

GO definitions

AG-ICB-USP

Pyruvate dehydrogenase

AG-ICB-USP

Why the interest in GO?● Universal ontology● Functional classification scheme with

many different levels in a DAG● Widespread interest from scientific

community● Already mappings to SP keywords and

gene products-annotation on some organisms

AG-ICB-USP

GO Evidence codes

AG-ICB-USPAG-ICB-USP

• Experimental Evidence Codes •EXP: Inferred from Experiment •IDA: Inferred from Direct Assay •IPI: Inferred from Physical Interaction •IMP: Inferred from Mutant Phenotype •IGI: Inferred from Genetic Interaction •IEP: Inferred from Expression Pattern

• Computational Analysis Evidence Codes •ISS: Inferred from Sequence or Structural Similarity •ISO: Inferred from Sequence Orthology •ISA: Inferred from Sequence Alignment •ISM: Inferred from Sequence Model •IGC: Inferred from Genomic Context •RCA: inferred from Reviewed Computational Analysis

• Author Statement Evidence Codes •TAS: Traceable Author Statement •NAS: Non-traceable Author Statement •Curator Statement Evidence Codes •IC: Inferred by Curator

• ND: No biological Data available • Automatically-assigned Evidence Codes

•IEA: Inferred from Electronic Annotation • Obsolete Evidence Codes • NR: Not Recorded

Current Mappings to GO

• Consortium mappings -MGD, SGD, FlyBase

• Swiss-Prot keywords

• EC numbers

• InterPro entries

• Medline ID

• Commercial companies -CompuGen, Proteome

AG-ICB-USP

AG-ICB-USP

AG-ICB-USP

AG-ICB-USP

InterPro-to-GO

EC number-to-GO

AG-ICB-USP

SP keyword-to-GO

AG-ICB-USP

GO doesn’t cover…

• Gene products: e.g. cytochrome c is not in the ontologies, but attributes of cytochrome c, such as oxidoreductase activity, are.

• Processes, functions or components that are unique to mutants or diseases: e.g. oncogenesis is not a valid GO term because causing cancer is not the normal function of any gene.

• Attributes of sequence such as intron/exon parameters: these are not attributes of gene products and will be described in a separate sequence ontology (see Sequence Ontology).

• Protein domains or structural features.

• Protein-protein interactions.

• Environment, evolution and expression.

• Anatomical or histological features above the level of cellular components, including cell types.

AG-ICB-USP

Sequence Ontology

• The four major aspects of the complete Sequence Ontology are:• located sequence features for objects that

can be located on sequence in coordinates,• sequence attributes for describing the

properties of features,• consequences of mutation for the

annotation of the effects of a mutation• chromosome variation to describe large

scale variations

AG-ICB-USP

Sequence Ontology

AG-ICB-USPAG-ICB-USP

• How to edit an ontology file?• OBO-Edit – an ontology editor for biologists

• OBO-Edit compliant format

Generic feature format 3

AG-ICB-USPAG-ICB-USP

• Generic format for sequence annotation interchange

• Tab-delimited text file• Represents features in hierarchical view

• Uses a controlled vocabulary – is compliant to Sequence Ontology

AG-ICB-USPAG-ICB-USP

• The tab-delimited file presents 9 columns:• Column 1: "seqid"• Column 2: "source"• Column 3: "type"• Columns 4 & 5: "start" and "end"• Column 6: "score"• Column 7: "strand"• The strand of the feature. + for positive

strand (relative to the landmark), - for minus strand

• Column 8: "phase"• Column 9: "attributes"

Generic feature format 3

Generic feature format 3

• Column 1: "seqid"• Column 2: "source"• Column 3: "type"• Columns 4 & 5: "start" and "end"• Column 6: "score"• Column 7: "strand"• Column 8: "phase"• Column 9: "attributes"

How to annotate these splicing variants using Sequence Ontology terms and the GFF3?

• The annotated genome region is named “ctg123” • A gene named EDEN extends from coordinates 1 to 9000• The gene encodes three alternatively-spliced variants: EDEN.1, EDEN.2 and EDEN.3• Transcript EDEN.3 presents two alternative translation start points• There is a transcriptional factor binding site (a promoter) located 50 bp upstream of the translational start site of EDEN.1

##gff-version 3##sequence-region ctg123 1 1497228 ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN ctg123 . TF_binding_site 1000 1012 . + . ID=tfbs00001;Parent=gene00001 ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1 ctg123 . mRNA 1050 9000 . + . ID=mRNA00002;Parent=gene00001;Name=EDEN.2 ctg123 . mRNA 1300 9000 . + . ID=mRNA00003;Parent=gene00001;Name=EDEN.3

ctg123 . exon 1300 1500 . + . ID=exon00001;Parent=mRNA00003ctg123 . exon 1050 1500 . + . ID=exon00002;Parent=mRNA00001,mRNA00002ctg123 . exon 3000 3902 . + . ID=exon00003;Parent=mRNA00001,mRNA00003ctg123 . exon 5000 5500 . + . ID=exon00004;Parent=mRNA00001,mRNA00002,mRNA00003ctg123 . exon 7000 9000 . + . ID=exon00005;Parent=mRNA00001,mRNA00002,mRNA00003

ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1

ctg123 . CDS 1201 1500 . + 0 ID=cds00002;Parent=mRNA00002;Name=edenprotein.2ctg123 . CDS 5000 5500 . + 0 ID=cds00002;Parent=mRNA00002;Name=edenprotein.2ctg123 . CDS 7000 7600 . + 0 ID=cds00002;Parent=mRNA00002;Name=edenprotein.2

ctg123 . CDS 3301 3902 . + 0 ID=cds00003;Parent=mRNA00003;Name=edenprotein.3ctg123 . CDS 5000 5500 . + 2 ID=cds00003;Parent=mRNA00003;Name=edenprotein.3ctg123 . CDS 7000 7600 . + 2 ID=cds00003;Parent=mRNA00003;Name=edenprotein.3

ctg123 . CDS 3391 3902 . + 0 ID=cds00004;Parent=mRNA00003;Name=edenprotein.4ctg123 . CDS 5000 5500 . + 2 ID=cds00004;Parent=mRNA00003;Name=edenprotein.4ctg123 . CDS 7000 7600 . + 2 ID=cds00004;Parent=mRNA00003;Name=edenprotein.4

AG-ICB-USPAG-ICB-USP

• If you writes a GFF file, you can test it! There is an online validator:

http://dev.wormbase.org/db/validate_gff3/validate_gff3_online

Generic feature format 3

Testing the GFF3 Validator

AG-ICB-USPAG-ICB-USP

Testing the GFF3 Validator

Let’s change the feature names

Annotation viewing and editingArtemis• Artemis is a free genome viewer and annotation

tool developed by Kim Rutherford (Sanger Institute, UK).

• It allows for visualization of sequence features and results of analyses, in the context of the sequence and its six-frame translation.

AG-ICB-USP

Annotation viewing and editingArtemis• Artemis is written in Java, and is available for

UNIX, GNU/Linux, BSD, Macintosh and MS-Windows systems.

• It can read complete EMBL and GENBANK database entries or sequence in FASTA or raw format. Extra sequence features can be in EMBL, GENBANK or GFF format.

AG-ICB-USP

AG-FMVZ-USPAG-FMVZ-USP

AG-FMVZ-USPAG-FMVZ-USP

AG-FMVZ-USPAG-FMVZ-USP

AG-FMVZ-USPAG-FMVZ-USP

AG-FMVZ-USPAG-FMVZ-USP