Large scale genomes comparisons Practical sessions Fredj Tekaia Institut Pasteur [email protected]...

19
Large scale genomes comparisons Practical sessions Fredj Tekaia Institut Pasteur [email protected] EMBO Bioinformatic and Comparative Genome Analysis Course Stazione Zoologica Anton Dohrn, Naples, Italy May 7 - 19, 2012

Transcript of Large scale genomes comparisons Practical sessions Fredj Tekaia Institut Pasteur [email protected]...

Large scale genomes comparisons

Practical sessions

Fredj TekaiaInstitut Pasteur

[email protected]

EMBO Bioinformatic and Comparative Genome Analysis Course

Stazione Zoologica Anton Dohrn, Naples, Italy

May 7 - 19, 2012

EMBO Bioinformatic and Comparative Genome Analysis Course

Stazione Zoologica Anton Dohrn, Naples, Italy

May 7 - 19, 2012

Plan for the practical sessions• Saccharomyces cerevisiae (SACE: 5863 protein sequences)• Candida glabrata (CAGL: 5202 protein sequences)• Zygosaccharomyces rouxii (ZYRO: 4991 protein sequences)

-data from ftp.ncbi.nlm.nih.gov/genomes/Fungi

For each proteome we will perform the following:

CIRCOSPrepare a table relating the relationships between genomes to be used with circos.

Multiple comparisons:-Extract all pairs of proteins that are Reciprocally Best Hits (Venn Diagram);

Interspecies comparisons:-Perform all pair-wise proteome comparisons;For each pair:-Get for each protein its best significant hit in the other proteome;-Get for each protein all its significant hits in the other proteome;-For each protein calculate the number of its significant matches;

Intra-species comparisons:-Compare the proteome to itself, using blastp (with adequate options);-Get for each protein its best significant match (presented in a table form);-Get for each protein all its significant matches (presented in a table form);-For each protein calculate the number of its significant matches;

Data preparation:-Transform the protein identification so that to get simpler identifiers;-Split the whole protein sequence database into single protein sequences;

Plan for the practical sessions:

-use 3 yeast species (SACE, CAGL, ZYRO).

-data from ftp.ncbi.nlm.nih.gov/genomes/Fungi

• Prepare in adequate fasta format, the protein sequence data

(needs data transformation)

• Compare each proteome to itself (duplication - paralogs)

• Compare each proteome to another proteome (RBH - orthologs)

• prepare a file for the visualization of protein similarities (circos).

Need for writing sh and perl (or xx) scripts

Species Code Chrom osomes Size #genes Saccharomyces cerevisiae SACE A 230218 94 Saccharomyces cerevisiae SACE B 813184 406 Saccharomyces cerevisiae SACE C 316620 161 Saccharomyces cerevisiae SACE D 1531933 754 Saccharomyces cerevisiae SACE E 576874 277 Saccharomyces cerevisiae SACE F 270161 126 Saccharomyces cerevisiae SACE G 1090940 527 Saccharomyces cerevisiae SACE H 562643 281 Saccharomyces cerevisiae SACE I 439888 207 Saccharomyces cerevisiae SACE J 745751 357 Saccharomyces cerevisiae SACE K 666816 312 Saccharomyces cerevisiae SACE L 1078177 508 Saccharomyces cerevisiae SACE M 924431 460 Saccharomyces cerevisiae SACE N 784333 393 Saccharomyces cerevisiae SACE O 1091291 536 Saccharomyces cerevisiae SACE P 948066 464

Saccharomyces cerevisiae (SACE): 16 chromosomes

Species Code Chrom osomes Size #genes Candida glabrata CAGL A 491328 200 Candida glabrata CAGL B 502101 212 Candida glabrata CAGL C 558804 230 Candida glabrata CAGL D 651701 283 Candida glabrata CAGL E 687738 278 Candida glabrata CAGL F 927101 383 Candida glabrata CAGL G 992211 434 Candida glabrata CAGL H 1050361 460 Candida glabrata CAGL I 1100349 462 Candida glabrata CAGL J 1195132 514 Candida glabrata CAGL K 1302831 556 Candida glabrata CAGL L 1455689 575 Candida glabrata CAGL M 1402899 615

Species Code Chrom osomes Size #genes Zygosaccharomyces rouxii ZYRO A 1114666 580 Zygosaccharomyces rouxii ZYRO B 1388208 706 Zygosaccharomyces rouxii ZYRO C 1464093 774 Zygosaccharomyces rouxii ZYRO D 1496342 768 Zygosaccharomyces rouxii ZYRO E 881646 416 Zygosaccharomyces rouxii ZYRO F 1554288 806 Zygosaccharomyces rouxii ZYRO G 1865392 914

Candida glabrata (CAGL): 13 chromosomes

Zygosaccharomyces rouxii (ZYRO) : 7 chromosomes

SACE-rw-r----- 1 tekaia staff 54600 Apr 25 11:54 NC_001133.faa-rw-r----- 1 tekaia staff 5273 Apr 25 11:57 NC_001133.ptt-rw-r----- 1 tekaia staff 233863 Apr 25 11:54 NC_001134.faa-rw-r----- 1 tekaia staff 22362 Apr 25 11:57 NC_001134.ptt-rw-r----- 1 tekaia staff 85632 Apr 25 11:54 NC_001135.faa-rw-r----- 1 tekaia staff 9043 Apr 25 11:57 NC_001135.ptt-rw-r----- 1 tekaia staff 436412 Apr 25 11:54 NC_001136.faa-rw-r----- 1 tekaia staff 41636 Apr 25 11:57 NC_001136.ptt-rw-r----- 1 tekaia staff 152613 Apr 25 11:54 NC_001137.faa-rw-r----- 1 tekaia staff 15210 Apr 25 11:57 NC_001137.ptt-rw-r----- 1 tekaia staff 71415 Apr 25 11:54 NC_001138.faa-rw-r----- 1 tekaia staff 7115 Apr 25 11:57 NC_001138.ptt-rw-r----- 1 tekaia staff 303249 Apr 25 11:54 NC_001139.faa-rw-r----- 1 tekaia staff 28954 Apr 25 11:57 NC_001139.ptt-rw-r----- 1 tekaia staff 156585 Apr 25 11:53 NC_001140.faa-rw-r----- 1 tekaia staff 15544 Apr 25 11:57 NC_001140.ptt-rw-r----- 1 tekaia staff 119694 Apr 25 11:53 NC_001141.faa-rw-r----- 1 tekaia staff 11384 Apr 25 11:57 NC_001141.ptt-rw-r----- 1 tekaia staff 213993 Apr 25 11:53 NC_001142.faa-rw-r----- 1 tekaia staff 19732 Apr 25 11:57 NC_001142.ptt-rw-r----- 1 tekaia staff 184175 Apr 25 11:53 NC_001143.faa-rw-r----- 1 tekaia staff 17048 Apr 25 11:57 NC_001143.ptt-rw-r----- 1 tekaia staff 302218 Apr 25 11:53 NC_001144.faa-rw-r----- 1 tekaia staff 28180 Apr 25 11:57 NC_001144.ptt-rw-r----- 1 tekaia staff 267545 Apr 25 11:53 NC_001145.faa-rw-r----- 1 tekaia staff 25329 Apr 25 11:57 NC_001145.ptt-rw-r----- 1 tekaia staff 223148 Apr 25 11:53 NC_001146.faa-rw-r----- 1 tekaia staff 21558 Apr 25 11:57 NC_001146.ptt-rw-r----- 1 tekaia staff 304338 Apr 25 11:53 NC_001147.faa-rw-r----- 1 tekaia staff 29393 Apr 25 11:57 NC_001147.ptt-rw-r----- 1 tekaia staff 266238 Apr 25 11:53 NC_001148.faa-rw-r----- 1 tekaia staff 25450 Apr 25 11:57 NC_001148.ptt

AB

C

D…..

>gi|6319249|ref|NP_009332.1| Pau8pMVKLTSIAAGVAAIAATASATTTLAQSDERVNLVELGVYVSDIRAHLAQYYMFQAAHPTETYPVEVAEAVFNYGDFTTMLTGIAPDQVTRMITGVPWYSSRLKPAISSALSKDGIYTIAN>gi|33438754|ref|NP_878038.1| hypothetical protein YAL067W-AMPIIGVPRCLIKPFSVPVTFPFSVKKNIRILDLDPRTEAYCLSLNSVCFKRLPRRKYFHLLNSYNIKRVLGVVYC>gi|6319250|ref|NP_009333.1| Seo1pMYSIVKEIIVDPYKRLKWGFIPVKRQVEDLPDDLNSTEIVTISNSIQSHETAENFITTTSEKDQLHFETSSYSEHKDNVNVTRSYEYRDEADRPWWRFFDEQEYRINEKERSHNKWYSWFKQGTSFKEKKLLIKLDVLLAFYSCIAYWVKYLDTVNINNAYVSGMKEDLGFQGNDLVHTQVMYTVGNIIFQLPFLIYLNKLPLNYVLPSLDLCWSLLTVGAAYVNSVPHLKAIRFFIGAFEAPSYLAYQYLFGSFYKHDEMVRRSAFYYLGQYIGILSAGGIQSAVYSSLNGVNGLEGWRWNFIIDAIVSVVVGLIGFYSLPGDPYNCYSIFLTDDEIRLARKRLKENQTGKSDFETKVFDIKLWKTIFSDWKIYILTLWNIFCWNDSNVSSGAYLLWLKSLKRYSIPKLNQLSMITPGLGMVYLMLTGIIADKLHSRWFAIIFTQVFNIIGNSILAAWDVAEGAKWFAFMLQCFGWAMAPVLYSWQNDICRRDAQTRAITLVTMNIMAQSSTAWISVLVWKTEEAPRYLKGFTFTACSAFCLSIWTFVVLYFYKRDERNNAKKNGIVLYNSKHGVEKPTSKDVETLSVSDEK>gi|6319252|ref|NP_009335.1| hypothetical protein YAL065CMNSATSETTTNTGAAETTTSTGAAETKTVVTSSISRFNHAETQTASATDVIGHSSSVVSVSETGNTKSLITSGLSTMSQQPRSTPASSIIGSSTASLEISTYVGIANGLLTNNGISVFISTVLLAIVW……….

NC_001133.faa

SACE S288c chromosome I, complete sequence. - 1..230218 94 proteinsLocation Strand Length PID Gene Synonym Code COG Product

1807..2169 - 120 6319249 PAU8 YAL068C - - Pau8p

2480..2707 + 75 33438754 - YAL067W-A - - hypothetical protein

7235..9016 - 593 6319250 SEO1 YAL067C - - Seo1p

11565..11951 - 128 6319252 - YAL065C - - hypothetical protein

12046..12426 + 126 6319253 - YAL064W-B - - hypothetical protein

13363..13743 - 126 7839146 - YAL064C-A - - hypothetical protein

21566..21850 + 94 330443360 - YAL064W - - hypothetical protein

…..

NC_001133.ptt

>gi|6319249|ref|NP_009332.1| Pau8pMVKLTSIAAGVAAIAATASATTTLAQSDERVNLVELGVYVSDIRAHLAQYYMFQAAHPTETYPVEVAEAVFNYGDFTTMLTGIAPDQVTRMITGVPWYSSRLKPAISSALSKDGIYTIAN>gi|33438754|ref|NP_878038.1| hypothetical protein YAL067W-AMPIIGVPRCLIKPFSVPVTFPFSVKKNIRILDLDPRTEAYCLSLNSVCFKRLPRRKYFHLLNSYNIKRVLGVVYC>gi|6319250|ref|NP_009333.1| Seo1pMYSIVKEIIVDPYKRLKWGFIPVKRQVEDLPDDLNSTEIVTISNSIQSHETAENFITTTSEKDQLHFETSSYSEHKDNVNVTRSYEYRDEADRPWWRFFDEQEYRINEKERSHNKWYSWFKQGTSFKEKKLLIKLDVLLAFYSCIAYWVKYLDTVNINNAYVSGMKEDLGFQGNDLVHTQVMYTVGNIIFQLPFLIYLNKLPLNYVLPSLDLCWSLLTVGAAYVNSVPHLKAIRFFIGAFEAPSYLAYQYLFGSFYKHDEMVRRSAFYYLGQYIGILSAGGIQSAVYSSLNGVNGLEGWRWNFIIDAIVSVVVGLIGFYSLPGDPYNCYSIFLTDDEIRLARKRLKENQTGKSDFETKVFDIKLWKTIFSDWKIYILTLWNIFCWNDSNVSSGAYLLWLKSLKRYSIPKLNQLSMITPGLGMVYLMLTGIIADKLHSRWFAIIFTQVFNIIGNSILAAWDVAEGAKWFAFMLQCFGWAMAPVLYSWQNDICRRDAQTRAITLVTMNIMAQSSTAWISVLVWKTEEAPRYLKGFTFTACSAFCLSIWTFVVLYFYKRDERNNAKKNGIVLYNSKHGVEKPTSKDVETLSVSDEK>gi|6319252|ref|NP_009335.1| hypothetical protein YAL065CMNSATSETTTNTGAAETTTSTGAAETKTVVTSSISRFNHAETQTASATDVIGHSSSVVSVSETGNTKSLITSGLSTMSQQPRSTPASSIIGSSTASLEISTYVGIANGLLTNNGISVFISTVLLAIVW……….

NC_001133.faa

SACE S288c chromosome I, complete sequence. - 1..230218 94 proteinsLocation Strand Length PID Gene Synonym Code COG Product

1807..2169 - 120 6319249 PAU8 YAL068C - - Pau8p

2480..2707 + 75 33438754 - YAL067W-A - - hypothetical protein

7235..9016 - 593 6319250 SEO1 YAL067C - - Seo1p

11565..11951 - 128 6319252 - YAL065C - - hypothetical protein

12046..12426 + 126 6319253 - YAL064W-B - - hypothetical protein

13363..13743 - 126 7839146 - YAL064C-A - - hypothetical protein

21566..21850 + 94 330443360 - YAL064W - - hypothetical protein

…..

NC_001133.ptt

>YAL068C Pau8pMVKLTSIAAGVAAIAATASATTTLAQSDERVNLVELGVYVSDIRAHLAQYYMFQAAHPTETYPVEVAEAVFNYGDFTTMLTGIAPDQVTRMITGVPWYSSRLKPAISSALSKDGIYTIAN>YAL067W-A hypothetical protein YAL067W-AMPIIGVPRCLIKPFSVPVTFPFSVKKNIRILDLDPRTEAYCLSLNSVCFKRLPRRKYFHLLNSYNIKRVLGVVYC>YAL067C Seo1pMYSIVKEIIVDPYKRLKWGFIPVKRQVEDLPDDLNSTEIVTISNSIQSHETAENFITTTSEKDQLHFETSSYSEHKDNVNVTRSYEYRDEADRPWWRFFDEQEYRINEKERSHNKWYSWFKQGTSFKEKKLLIKLDVLLAFYSCIAYWVKYLDTVNINNAYVSGMKEDLGFQGNDLVHTQVMYTVGNIIFQLPFLIYLNKLPLNYVLPSLDLCWSLLTVGAAYVNSVPHLKAIRFFIGAFEAPSYLAYQYLFGSFYKHDEMVRRSAFYYLGQYIGILSAGGIQSAVYSSLNGVNGLEGWRWNFIIDAIVSVVVGLIGFYSLPGDPYNCYSIFLTDDEIRLARKRLKENQTGKSDFETKVFDIKLWKTIFSDWKIYILTLWNIFCWNDSNVSSGAYLLWLKSLKRYSIPKLNQLSMITPGLGMVYLMLTGIIADKLHSRWFAIIFTQVFNIIGNSILAAWDVAEGAKWFAFMLQCFGWAMAPVLYSWQNDICRRDAQTRAITLVTMNIMAQSSTAWISVLVWKTEEAPRYLKGFTFTACSAFCLSIWTFVVLYFYKRDERNNAKKNGIVLYNSKHGVEKPTSKDVETLSVSDEK>YAL065C hypothetical protein YAL065CMNSATSETTTNTGAAETTTSTGAAETKTVVTSSISRFNHAETQTASATDVIGHSSSVVSVSETGNTKSLITSGLSTMSQQPRSTPASSIIGSSTASLEISTYVGIANGLLTNNGISVFISTVLLAIVW……….

NC_001133.faa

SACE S288c chromosome I, complete sequence. - 1..230218 94 proteinsLocation Strand Length PID Gene Synonym Code COG Product

1807..2169 - 120 6319249 PAU8 YAL068C - - Pau8p

2480..2707 + 75 33438754 - YAL067W-A - - hypothetical protein

7235..9016 - 593 6319250 SEO1 YAL067C - - Seo1p

11565..11951 - 128 6319252 - YAL065C - - hypothetical protein

12046..12426 + 126 6319253 - YAL064W-B - - hypothetical protein

13363..13743 - 126 7839146 - YAL064C-A - - hypothetical protein

21566..21850 + 94 330443360 - YAL064W - - hypothetical protein

…..

NC_001133.ptt

Final sequence format

Write a perl/sh script to systematically transform the sequence identifications

Follow the indications on PS document

Notations:

Sequence and genome files:

We consider sequences and databases in “fasta” format.

DB.pep (extension “.pep” for protein databases);Exp.: GSACE.pep, for Saccharomyces cerevisiae protein db.

seq.prt  (extension “.prt” for protein sequences);Exp.: YAL063C.prt

Scripts:script.pl (extension “.pl” for perl scripts);script.scr (extension “.scr” for unix shell scripts);

#!/bin/perl

#Use: replaceid.pl NC_000962.ptt NC_000962.faa#output in NC_xx.pep

$PTT = @ARGV[0]; # ncbi ptt file$FAA = @ARGV[1]; # ncbi faa file

$CHR=substr($PTT, 0 , length($PTT) -4);open(OUT,">$CHR.pep");

open(IN, "$PTT") || die "can't find $PTT";while(<IN>) {@tab=split(/\s+/, $_);$PID{$tab[3]} = "$tab[5]"; } #whileclose(IN);

open (IN2, "$FAA") || die "can't open $FAA";while(<IN2>) {print OUT $_ if ( !m/^>/ );if ( m/^>/ ) {@tab = split( /[\|]/, $_ );print OUT ">$PID{$tab[1]} $tab[4]"; }#if }# whileclose(IN2); close(OUT);

Examples

Associative array

List of values

#!/bin/sh

for file in `ls *.ptt`do

NC=`echo $file | sed -e "s/\..*//g"`replaceid.pl $NC.ptt $NC.faa

done

Comparing one proteome vs itself

YAL005C YLL024C 98.19 607 11 0 … 0.0 1041YAL005C YER103W 83.58 609 97 2 … 0.0 889YAL005C YBL075C 81.94 609 107 2 … 0.0 888YAL005C YJL034W 64.74 604 209 3 … 0.0 702YAL005C YDL229W 64.02 567 198 4 … 2e-176 613YAL005C YNL209W 63.84 567 199 4 … 4e-176 613YAL005C YJR045C 51.06 611 281 9 … 1e-136 481YAL005C YEL030W 49.43 615 285 11 … 8e-130 459YAL005C YLR369W 49.27 548 254 8 … 1e-125 445YAL005C YPL106C 35.85 371 230 4 … 9e-63 236YAL005C YBR169C 35.85 371 230 4 … 1e-57 219YAL005C YHR064C 31.90 373 242 5 … 2e-48 188YAL005C YKL073W 24.55 501 343 10 … 2e-28 122YAL007C YOR016C 75.00 180 42 1 … 8e-61 227YAL012W YGL184C 32.69 413 240 13 … 2e-46 181YAL012W YLR303W 30.37 438 243 12 … 6e-34 139YAL012W YHR112C 29.90 398 236 14 … 2e-29 125YAL012W YFR055W 27.99 293 199 7 … 2e-27 117YAL015C YOL043C 50.88 285 140 0 … 9e-82 298YAL017W YOL045W 62.10 694 224 6 … 0.0 771……….

All hits

Multiple matches if any

Comparing one proteome vs itself

YAL005C YLL024C 98.19 607 11 0 … 0.0 1041YAL005C YER103W 83.58 609 97 2 … 0.0 889YAL005C YBL075C 81.94 609 107 2 … 0.0 888YAL005C YJL034W 64.74 604 209 3 … 0.0 702YAL005C YDL229W 64.02 567 198 4 … 2e-176 613YAL005C YNL209W 63.84 567 199 4 … 4e-176 613YAL005C YJR045C 51.06 611 281 9 … 1e-136 481YAL005C YEL030W 49.43 615 285 11 … 8e-130 459YAL005C YLR369W 49.27 548 254 8 … 1e-125 445YAL005C YPL106C 35.85 371 230 4 … 9e-63 236YAL005C YBR169C 35.85 371 230 4 … 1e-57 219YAL005C YHR064C 31.90 373 242 5 … 2e-48 188YAL005C YKL073W 24.55 501 343 10 … 2e-28 122YAL007C YOR016C 75.00 180 42 1 … 8e-61 227YAL012W YGL184C 32.69 413 240 13 … 2e-46 181YAL012W YLR303W 30.37 438 243 12 … 6e-34 139YAL012W YHR112C 29.90 398 236 14 … 2e-29 125YAL012W YFR055W 27.99 293 199 7 … 2e-27 117YAL015C YOL043C 50.88 285 140 0 … 9e-82 298YAL017W YOL045W 62.10 694 224 6 … 0.0 771……….

Best hits

Comparing one proteome vs a different proteome

YAL001C CAGL0A00803g 42.26 1188 623 20 … 0.0 823YAL002W CAGL0A00781g 31.31 1217 798 20 … 3e-167 584YAL003W CAGL0F08547g 74.52 208 50 2 … 2e-59 223YAL005C CAGL0G03795g 93.41 607 40 0 … 0.0 993YAL005C CAGL0G03289g 85.39 609 86 2 … 0.0 899YAL005C CAGL0D02948g 64.24 604 212 3 … 0.0 684YAL005C CAGL0K04741g 64.90 567 193 4 … 2e-179 624YAL005C CAGL0C05379g 64.90 567 193 4 … 2e-179 624YAL005C CAGL0I03322g 50.90 613 283 9 … 2e-135 477YAL005C CAGL0I01496g 50.08 613 288 9 … 1e-134 475YAL005C CAGL0G04917g 46.07 573 291 8 … 6e-121 429YAL005C CAGL0M06083g 35.31 371 232 4 … 4e-58 220YAL005C CAGL0L10560g 32.26 372 241 4 … 5e-51 197YAL005C CAGL0F06369g 22.37 599 406 16 … 4e-20 94.7YAL007C CAGL0C02761g 70.17 181 51 2 … 3e-58 219YAL009W CAGL0C02717g 70.10 204 61 0 … 1e-81 296YAL010C CAGL0C02695g 47.37 494 225 5 … 1e-111 398YAL011W CAGL0H06391g 38.42 596 318 9 … 1e-74 275YAL012W CAGL0H06369g 85.24 393 55 2 … 0.0 659YAL012W CAGL0L06094g 35.20 392 226 13 … 4e-54 206……….

All hits

Multiple hits

Comparing one proteome vs a different proteome

YAL001C CAGL0A00803g 42.26 1188 623 20 … 0.0 823YAL002W CAGL0A00781g 31.31 1217 798 20 … 3e-167 584YAL003W CAGL0F08547g 74.52 208 50 2 … 2e-59 223YAL005C CAGL0G03795g 93.41 607 40 0 … 0.0 993YAL005C CAGL0G03289g 85.39 609 86 2 … 0.0 899YAL005C CAGL0D02948g 64.24 604 212 3 … 0.0 684YAL005C CAGL0K04741g 64.90 567 193 4 … 2e-179 624YAL005C CAGL0C05379g 64.90 567 193 4 … 2e-179 624YAL005C CAGL0I03322g 50.90 613 283 9 … 2e-135 477YAL005C CAGL0I01496g 50.08 613 288 9 … 1e-134 475YAL005C CAGL0G04917g 46.07 573 291 8 … 6e-121 429YAL005C CAGL0M06083g 35.31 371 232 4 … 4e-58 220YAL005C CAGL0L10560g 32.26 372 241 4 … 5e-51 197YAL005C CAGL0F06369g 22.37 599 406 16 … 4e-20 94.7YAL007C CAGL0C02761g 70.17 181 51 2 … 3e-58 219YAL009W CAGL0C02717g 70.10 204 61 0 … 1e-81 296YAL010C CAGL0C02695g 47.37 494 225 5 … 1e-111 398YAL011W CAGL0H06391g 38.42 596 318 9 … 1e-74 275YAL012W CAGL0H06369g 85.24 393 55 2 … 0.0 659YAL012W CAGL0L06094g 35.20 392 226 13 … 4e-54 206……….

Best hits

Follow the document : Tekaia_EMBO2012_PS.pdf