Download - P4 2017 io

Transcript
Page 1: P4 2017 io
Page 2: P4 2017 io

FBW

07-11-2017

Wim Van Criekinge

Page 4: P4 2017 io

BPC 2017

Page 5: P4 2017 io

Recap

if condition:

statements

[elif condition:

statements] ...

else:

statements

while condition:

statements

for var in sequence:

statements

break

continue

Strings

REGULAR EXPRESSIONS

Page 6: P4 2017 io

Devhints.io

Page 7: P4 2017 io

Towards a protein prosite scanner

N-{P}-[ST]-{P}.[RK](2)-x-[ST].

[ST]-x-[RK].

[ST]-x(2)-[DE].

[RK]-x(2,3)-[DE]-x(2,3)-Y.G-{EDRKHPFYW}-x(2)-[STAGCN]-{P}.x-G-[RK]-[RK].

C-x-[DN]-x(4)-[FY]-x-C-x-C.

E-x(2)-[ERK]-E-x-C-x(6)-[EDR]-x(10,11)-[FYA]-[YW].

[DEQGSTALMKRH]-[LIVMFYSTAC]-[GNQ]-[LIVMFYAG]-[DNEKHS]-S-[LIVMST]-{PCFY}-[STAGCPQLIVMF]-[LIVMATN]-[DENQGTAKRHLM]-[LIVMWSTA]-

[LIVGSTACR]-{LPIY}-{VY}-[LIVMFA].

[KRHQSA]-[DENQ]-E-L>.

R-G-D.

[AG]-x(4)-G-K-[ST].

D-{W}-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}-[LIVMC]-[DENQSTAGC]-x(2)-[DE]-[LIVMFYW].

[EQ]-{LNYH}-x-[ATV]-[FY]-{LDAM}-{T}-W-{PG}-N.[LIVM]-x-[SGNL]-[LIVMN]-[DAGHENRS]-[SAGPNVT]-x-[DNEAG]-[LIVM]-x-[DEAGQ]-x(4)-[LIVM]-x-[LM]-[SAG]-[LIVM]-[LIVMT]-[WS]-x(0,1)-[LIVM](2).

[FY]-C-[RH]-[NS]-x(7,8)-[WY]-C.

C-x-C-x(2)-{V}-x(2)-G-{C}-x-C.

C-x(2)-P-F-x-[FYWIV]-x(7)-C-x(8,10)-W-C-x(4)-[DNSR]-[FYW]-x(3,5)-[FYW]-x-[FYWI]-C.

[LIFAT]-{IL}-x(2)-W-x(2,3)-[PE]-x-{VF}-[LIVMFY]-[DENQS]-[STA]-[AV]-[LIVMFY].

[KRH]-x(2)-C-x-[FYPSTV]-x(3,4)-[ST]-x(3)-C-x(4)-C-C-[FYWH].

C-x(4,5)-C-C-S-x(2)-G-x-C-G-x(3,4)-[FYW]-C.

[LIVMFYG]-[ASLVR]-x(2)-[LIVMSTACN]-x-[LIVM]-{Y}-x(2)-{L}-[LIV]-[RKNQESTAIY]-[LIVFSTNKH]-W-[FYVC]-x-[NDQTAH]-x(5)-[RKNAIMW].

C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H.L-x(6)-L-x(6)-L-x(6)-L.

C-x(2)-C-x(1,2)-[DENAVSPHKQT]-x(5,6)-[HNY]-[FY]-x(4)-C-x(2)-C-x(2)-F(2)-x-R.[LIVMFE]-[FY]-P-W-M-[KRQTA].

L-M-A-[EQ]-G-L-Y-N.

IRED_1R-P-C-x(11)-C-V-S.

[RKQ]-R-[LIM]-x-[LF]-G-[LIVMFY]-x-Q-x-[DNQ]-V-G.

[KR]-x(1,3)-[RKSAQ]-N-{VL}-x-[SAQ](2)-{L}-[RKTAENQ]-x-R-{S}-[RK].

[LIVMF](2)-D-E-A-D-[RKEN]-x-[LIVMFYGSTN].

[KRQ]-[LIVMA]-x(2)-[GSTALIV]-{FYWPGDN}-x(2)-[LIVMSA]-x(4,9)-[LIVMF]-x-{PLH}-[LIVMSTA]-[GSTACIL]-{GPK}-{F}-x-[GANQRF]-[LIVMFY]-x(4,5)-[LFY]-x(3)-

[FYIVA]-{FYWHCM}-{PGVI}-x(2)-[GSADENQKR]-x-[NSTAPKL]-[PARL].

Scan for the following prosite patterns in your 4 sequences

Hint: translate the patters to regexes and then scan

Page 8: P4 2017 io

reuse galacto.py in github

Consensus_pattern="G-R-x-N-[LIV]-I-G-[DE]-H-x-D-Y"

pattern=Consensus_pattern.replace("-","")

pattern=pattern.replace("x","[A-Z]")

#print(pattern)

count=0

for s in sequences:

count=count+1

print ("searching seq",count)

s=s.replace(" ","")

matches = re.finditer(pattern,s)

for match in matches:

print (match.group(0),"from: ",match.start(),"to: ",match.end())

Page 9: P4 2017 io

>SEQ1

MGNLFENCTHRYSFEYIYENCTNTTNQCGLIRNVASSIDVFHWLDVYISTTIFVISGILNFYCLFIALYT YYFLDNETRKHYVFVLSRFLSSILVIISLLVLESTLFSESLSPTFAYYAVAFSIYDFSMDTLFFSYIMIS LITYFGVVHYNFYRRHVSLRSLYIILISMWTFSLAIAIPLGLYEAASNSQGPIKCDLSYCGKVVEWITCS LQGCDSFYNANELLVQSIISSVETLVGSLVFLTDPLINIFFDKNISKMVKLQLTLGKWFIALYRFLFQMT NIFENCSTHYSFEKNLQKCVNASNPCQLLQKMNTAHSLMIWMGFYIPSAMCFLAVLVDTYCLLVTISILK SLKKQSRKQYIFGRANIIGEHNDYVVVRLSAAILIALCIIIIQSTYFIDIPFRDTFAFFAVLFIIYDFSILSLLGSFTGVAM MTYFGVMRPLVYRDKFTLKTIYIIAFAIVLFSVCVAIPFGLFQAADEIDGPIKCDSESCELIVKWLLFCI ACLILMGCTGTLLFVTVSLHWHSYKSKKMGNVSSSAFNHGKSRLTWTTTILVILCCVELIPTGLLAAFGK SESISDDCYDFYNANSLIFPAIVSSLETFLGSITFLLDPIINFSFDKRISKVFSSQVSMFSIFFCGKR

>SEQ2

MLDDRARMEA AKKEKVEQIL AEFQLQEEDL KKVMRRMQKE MDRGLRLETH EEASVKMLPT YVRSTPEGSE VGDFLSLDLG GTNFRVMLVK VGEGEEGQWS VKTKHQMYSI PEDAMTGTAE MLFDYISECI SDFLDKHQMK HKKLPLGFTF SFPVRHEDID KGILLNWTKG FKASGAEGNN VVGLLRDAIK RRGDFEMDVV AMVNDTVATM ISCYYEDHQC EVGMIVGTGC NACYMEEMQN VELVEGDEGR MCVNTEWGAF GDSGELDEFL LEYDRLVDES SANPGQQLYE KLIGGKYMGE LVRLVLLRLV DENLLFHGEA SEQLRTRGAF ETRFVSQVES DTGDRKQIYN ILSTLGLRPS TTDCDIVRRA CESVSTRAAH MCSAGLAGVI NRMRESRSED VMRITVGVDG SVYKLHPSFK ERFHASVRRL TPSCEITFIE SEEGSGRGAA LVSAVACKKA CMLGQ

>SEQ3

MESDSFEDFLKGEDFSNYSYSSDLPPFLLDAAPCEPESLEINKYFVVIIYVLVFLLSLLGNSLVMLVILY SRVGRSGRDNVIGDHVDYVTDVYLLNLALADLLFALTLPIWAASKVTGWIFGTFLCKVVSLLKEVNFYSGILLLACISVDRY LAIVHATRTLTQKRYLVKFICLSIWGLSLLLALPVLIFRKTIYPPYVSPVCYEDMGNNTANWRMLLRILP QSFGFIVPLLIMLFCYGFTLRTLFKAHMGQKHRAMRVIFAVVLIFLLCWLPYNLVLLADTLMRTWVIQET CERRNDIDRALEATEILGILGRVNLIGEHWDYHSCLNPLIYAFIGQKFRHGLLKILAIHGLISKDSLPKDSRPSFVGSSSGH TSTTL

>SEQ4

MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA

sequences

Page 10: P4 2017 io

10

Reading Files

name = open("filename")

– opens the given file for reading, and returns a file object

name.read() - file's entire contents as a string

name.readline() - next line from file as a string

name.readlines() - file's contents as a list of lines

– the lines from a file object can also be read using a for loop

>>> f = open("hours.txt")

>>> f.read()

'123 Susan 12.5 8.1 7.6 3.2\n

456 Brad 4.0 11.6 6.5 2.7 12\n

789 Jenn 8.0 8.0 8.0 8.0 7.5\n'

Page 11: P4 2017 io

11

File Input Template

• A template for reading files in Python:

name = open("filename")

for line in name:

statements

>>> input = open("hours.txt")

>>> for line in input:

... print(line.strip()) # strip() removes \n

123 Susan 12.5 8.1 7.6 3.2

456 Brad 4.0 11.6 6.5 2.7 12

789 Jenn 8.0 8.0 8.0 8.0 7.5

Page 12: P4 2017 io

12

Writing Files

name = open("filename", "w")name = open("filename", "a")

– opens file for write (deletes previous contents), or

– opens file for append (new data goes after previous data)

name.write(str) - writes the given string to the file

name.close() - saves file once writing is done

>>> out = open("output.txt", "w")>>> out.write("Hello, world!\n")>>> out.write("How are you?")>>> out.close()

>>> open("output.txt").read()'Hello, world!\nHow are you?'

Page 13: P4 2017 io

https://prosite.expasy.org

Page 14: P4 2017 io

• Where to put the files ?

Page 15: P4 2017 io

Swiss-Knife.py

• Using a database as input ! Parse

the entire Swiss Prot collection

– How many entries are there ?

– Average Protein Length (in aa and

MW)

– Relative frequency of amino acids

• Compare to the ones used to construct

the PAM scoring matrixes from 1978 –

1991

Page 16: P4 2017 io

Amino acid frequencies

1978 1991

L 0.085 0.091

A 0.087 0.077

G 0.089 0.074

S 0.070 0.069

V 0.065 0.066

E 0.050 0.062

T 0.058 0.059

K 0.081 0.059

I 0.037 0.053

D 0.047 0.052

R 0.041 0.051

P 0.051 0.051

N 0.040 0.043

Q 0.038 0.041

F 0.040 0.040

Y 0.030 0.032

M 0.015 0.024

H 0.034 0.023

C 0.033 0.020

W 0.010 0.014

Second step: Frequencies of Occurence

Page 17: P4 2017 io

Getting the database

FASTA: Uniprot_sprot.fasta – 268Mb

TEXT: Uniprot_sprot.dat – zipped (560

Mb) unzipped (3Gb)

http://www.ebi.ac.uk/uniprot/download-center