Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H....

22
Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven P. Briggs University of California San Diego,, *Institute for Microbiology and Genetics, Gottingen, Germany Proc. Natl. Acad. Sci. USA 105: 21034-21038 (2008) .
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H....

Page 1: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

Discovery and revision of Arabidopsis genesby proteogenomics

Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven P.

BriggsUniversity of California San Diego,,

*Institute for Microbiology and Genetics, Gottingen, Germany

Proc. Natl. Acad. Sci. USA 105: 21034-21038 (2008) .

Page 2: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

Limitations of gene annotation

• Based on evidence of transcripts• Depends on gene finding/ protein prediction

algorithms.• How do we define genes? • Models suffer from errors in reading frame and exon

definition. • Rare transcripts? Noise?• Arabidopsis is the best annotated plant genome and

other plant genomes are annotated relative to Arabidopsis.

Page 3: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

Types of alternative splicing

Page 4: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

What did Castellana et al. do to detect gene model errors?

• Isolated Arabidopsis proteins from different tissues.• Analyzed tryptic peptides by Tandem Mass

Spectrometry.• Determined sequences for 144,079 distinct peptides.• Confirmed gene models for 40% (12,769) of

annotated genes (assuming gene total of 31,922).• 18,024 novel peptides were found, suggesting 13% of

the proteome was missing or incorrect. • They added or corrected 1473 gene/proteins, leaving

1 to 4% unidentified protein coding genes.

Page 5: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

Proteins

• Protein extracts of four Arabidopsis organs: (leaf, root, flower, silique) and cell culture MM2d.

• Phosphoproteins were enriched using TiO2from MM2d• Sodium orthovanadate (Na3VO4)used as a phosphatase

inhibitor.• Cysteines were reduced and alkylated.• Digested with trypsin. • Separated by high resolution 3D-LC: RP1, SCX, RP2,• in 45 runs producing 144,079 tryptic peptides.

Page 6: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

Mass Spectrometry (MS) From Wikipedia. Ionized molecules or molecule fragments are measured by their mass-to-charge ratios

1) the components of the sample are ionized by an electron beam, which results in the formation of charged particles (ions),

2) directing the ions into a electric and/or magnetic fields,

3) computation of the mass-to-charge ratio of the particles based on their motion as they transit through electromagnetic fields

4) 5) detection of the ions, which in step 3) were sorted according to m/z.

Page 7: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

Mass Spectrometers consist of three modules:

1) An ion source, which can convert gas phase sample molecules into ions (or, in the case of electrospray ionization, move ions that exist in solution into the gas phase); 2) a mass analyzer, which sorts the ions by their masses by applying electromagnetic fields; and 3) a detector, which measures the value of an indicator quantity and thus provides data for calculating the abundances of each ion present.

Page 8: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

A quadrupole time-of-flight hybrid tandem mass spectrometer.

Multiple stages of mass analysis separation can be accomplished with MS steps separated in space or time. In tandem mass spectrometry the elements are physically separated. These elements can be sectors, transmission quadrupole, or time-of-flight.

ESI is electrospray ionization

MALDI is matrix-assisted laser desorption/ionization

Page 9: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

Work flow

©2008 by National Academy of Sciences

Castellana N. E. et.al. PNAS (2008)105:21034-21038

Page 10: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

Acquisition of Spectra

• Peptides charged by electrospray ionization.• LTQ linear ion trap tandem mass spectrometery• 21 million spectra were acquired. Data is

archived in Tranche (http://tranche.proteomecommons.org)

• Spectra were searched against three reference databases: TAIR 7, a six frame translation of the genome, and ab initio gene predictions using AUGUSTUS and exon prediction.

Page 11: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

Number of assigned spectra, distinct peptides, and proteins in different samples and organs.

Baerenfaller et al. (2008) Science 320: 938-941.

• Plant tissue Spectra Distinct peptides Proteins Avg. Mol. Mass (kD)• Differentiated organs 465,836 64,219 10,902 54.6 • Roots 71,516 27,546 6,125 55.0 • Roots 10 days 38,476 20,301 5,159 55.7 • Roots 23 days 33,040 16,984 4,466 54.3 • Leaves 80,186 20,417 4,853 57.5 • Cotyledons 39,419 13,628 3,665 58.2 • Juvenile leaves 40,767 14,437 3,892 57.8 • Flowers 147,650 33,192 7,040 57.4 • Flower buds 54,588 19,467 5,104 58.5 • Open flowers 57,861 20,205 5,215 59.0 • Carpels 35,201 13,393 3,946 56.7 • Siliques 79,589 23,054 5,779 54.6 • Seeds 86,895 13,901 3,789 54.7• Cell culture 324,345 49,842 8,698 57.3 • Dark 149,051 34,551 6,547 59.7 • Light 143,583 32,656 6,474 59.8 • Light; small 31,711 15,318 4,472 43.2• Total 790,181 86,456 13,029 54.7• TAIR7 27,029 45.9

65% of all peptides were detected in only one organ. 1.3% were identified an all organs.

Page 12: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

Total peptides 144,079Peptides in TAIR 7 annotation 126,055Peptides not in TAIR 18,024Peptides not in TAIR but uniquely located in the genome

16,348

New intergenic “clusters” 1,765 (genes)Former noncoding pseudogenes 561 genes (31%)Never recognized as genes before due to inadequate support

331 genes (20%)

Uniquely identified by peptides 198 genes

Some Peptide Bookkeeping

Page 13: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

Fig. S1. Discovery Curve, showing the number of distinct peptides matching to TAIR7 recovered as a function of the number of annotated spectra. The discovery curve is separated to show the contribution of each individual dataset.

Page 14: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

Novel gene discovery

Castellana N. E. et.al. PNAS 2008;105:21034-21038

©2008 by National Academy of Sciences

A cluster of 13 uniquely located peptides that do not overlap a current gene model (Chr3). The prediction track shows the single exon gene model produced by AUGUSTUS.

(B) The predicted sequence shows strong homology to a Thylakoid lumen family protein (sp|P82658|TL19_ARATH). It also shows strong similarity to proteins in both grapevine (emb|CAO40861.1 a hypothetical gene) and rice (Os08g0504500 a cDNA derived gene).

Page 15: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

Intergenic Regions

64% of intergenic clusters overlap annotated pseudogenes or transposons.

Annotated pseudogenes may be incorrectly truncated, and have missing exons.

Transposons may contain protein coding genes unrelated to transposon activity. (gene hitch-hiking)

A large number (7,442 ) of small ORFs have been found as transcripts from intragenic regions*. 155 of these have predicted peptides.

*Hanada et al. (2007) Genome Research 17:632-640.

Page 16: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

Peptides overlapping a predicted transposable element gene

Castellana N. E. et.al. PNAS 2008;105:21034-21038

©2008 by National Academy of Sciences

Five peptides overlap an annotated transposable element gene. The

inferred protein is 56% identical to a ubiquitin like protease.

Page 17: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

Gene refinement: new exons, boundary change, exon skipping, modified translation start and stop sites.

A majority are novel exons: 60% are within introns, and 40% are in UTRs.

26 cases may actually be a single exon.

Exon extension and shortening are equally frequent.

AUGUSTUS using the peptide evidence predicts altered transcripts in 695 genes.

In 130 cases, peptide variation indicates new isoforms.

Page 18: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

Refined Gene Model4 novel peptides map in the 5’UTR and the first exon of a protein kinase

Castellana N. E. et.al. PNAS 2008;105:21034-21038

©2008 by National Academy of Sciences

Page 19: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

New gene models from identified peptidesBaerenfaller et al (2008) Science 320: 938-941.

Page 20: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

New gene models from identified peptidesBaerenfaller et al (2008) Science 320: 938-941.

Page 21: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

Take home lessons

MS is a powerful adjunct to genomics and transcriptomics.

More precise definition of coding genes.

Proteomics is becoming more quantitative and less expensive.

MS can provide absolute protein quantitation.

Likely to play an increasing role in “omic” research.

Proteomics people will want more respect.

Page 22: Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven.

References•Katja Baerenfaller, Jonas Grossmann, Monica A. Grobei, Roger Hull, Mattias Hirsch-Hoffman, Shaul Yalovsky, Phillip Zimmermann, Ueli Grossniklaus, Wilhelm Gruissem, Sacha (2008). Genome scale proteomics reveals Arabidopsis thaliana Gene models and proteome dynamics. Science 320: 938-941.

•Stephen Tanner, Zhouxin Shen, Julio Ng, Liliana Florea, Roderic Guiogo, Steven Briggs and Vineet Bafna. (2007). Improving gene annotation using peptide mass spectrometry. Genome Res. 2007. 17: 231-239 2007;17:231-239

•Kousuke Hanada, Xu Zhang, Justin O. Borevitz, Wen-Hsiung Li,•and Shin-Han Shiu1 (2007). A large number of novel coding small open reading frames in the intergenic regions of the Arabidopsis thaliana genome are transcribed and/or under purifying selection. Genome Res. 2007 17: 632-640