Searching (and manipulating) your data

32
Searching (and manipulating) your data

description

Searching (and manipulating) your data. >A06662 Synthetic nucleotide sequence of the human GSH transferase pi gene. : Location:1..1000 UGG GACCAGUCAGCAGAGGCAGCGUGUGUGCGCGUGCGUGUGCGUGUGUGUGCGUGUGUG UGUGUACGCUUGCAUUUGUGUCGGGUGGGUAAGGAGAUAGAGAUGGGCGGGCAGUAGGCC - PowerPoint PPT Presentation

Transcript of Searching (and manipulating) your data

Page 1: Searching (and manipulating) your data

Searching (and manipulating) your data

Page 2: Searching (and manipulating) your data
Page 3: Searching (and manipulating) your data

>A06662 Synthetic nucleotide sequence of the human GSH transferase pi gene. : Location:1..1000UGGGACCAGUCAGCAGAGGCAGCGUGUGUGCGCGUGCGUGUGCGUGUGUGUGCGUGUGUGUGUGUACGCUUGCAUUUGUGUCGGGUGGGUAAGGAGAUAGAGAUGGGCGGGCAGUAGGCCCAGGUCCCGAAGGCCUUGAACCCACUGGUUUGGAGUCUCCUAAGGGCAAUGGGGGCCAUUGAGAAGUCUGAACAGGGCUGUGUCUGAAUGUGAGGUCUAGAAGGAUCCUCCAGAGAAGCCAGCUCUAAAGCUUUUGCAAUCAUCUGGUGAGAGAACCCAGCAAGGAUGGACAGGCAGAAUGGAAUAGAGAUGAGUUGGCAGCUGAAGUGGACAGGAUUUGGUACUAGCCUGGUUGUGGGGAGCAAGCAGAGGAGAAUCUGGGACUCUGGUGGUCUGGCCUGGGGCAGACGGGGGUGUCUCAGGGGCUGGGAGGGAUGAGAGUAGGAUGAUACAUGGUGGUGUCUGGCAGGAGGCGGGCAAGGAUGACUAUGUGAAGGCACUGCCCGGGCAACUGAAGCCUUUUGAGACCCUGCUGUCCCAGAACCAGGGAGGCAAGACCUUCAUUGUGGGAGACCAGGUGAGCAUCUGGCC

UGG : W> A06662_proteinW

Page 4: Searching (and manipulating) your data

>A06662 Synthetic nucleotide sequence of the human GSH transferase pi gene. : Location:1..1000UGGGACCAGUCAGCAGAGGCAGCGUGUGUGCGCGUGCGUGUGCGUGUGUGUGCGUGUGUGUGUGUACGCUUGCAUUUGUGUCGGGUGGGUAAGGAGAUAGAGAUGGGCGGGCAGUAGGCCCAGGUCCCGAAGGCCUUGAACCCACUGGUUUGGAGUCUCCUAAGGGCAAUGGGGGCCAUUGAGAAGUCUGAACAGGGCUGUGUCUGAAUGUGAGGUCUAGAAGGAUCCUCCAGAGAAGCCAGCUCUAAAGCUUUUGCAAUCAUCUGGUGAGAGAACCCAGCAAGGAUGGACAGGCAGAAUGGAAUAGAGAUGAGUUGGCAGCUGAAGUGGACAGGAUUUGGUACUAGCCUGGUUGUGGGGAGCAAGCAGAGGAGAAUCUGGGACUCUGGUGGUCUGGCCUGGGGCAGACGGGGGUGUCUCAGGGGCUGGGAGGGAUGAGAGUAGGAUGAUACAUGGUGGUGUCUGGCAGGAGGCGGGCAAGGAUGACUAUGUGAAGGCACUGCCCGGGCAACUGAAGCCUUUUGAGACCCUGCUGUCCCAGAACCAGGGAGGCAAGACCUUCAUUGUGGGAGACCAGGUGAGCAUCUGGCC

GAC : D> A06662_proteinWD

Page 5: Searching (and manipulating) your data

>A06662 Synthetic nucleotide sequence of the human GSH transferase pi gene. : Location:1..1000UGGGACCAGUCAGCAGAGGCAGCGUGUGUGCGCGUGCGUGUGCGUGUGUGUGCGUGUGUGUGUGUACGCUUGCAUUUGUGUCGGGUGGGUAAGGAGAUAGAGAUGGGCGGGCAGUAGGCCCAGGUCCCGAAGGCCUUGAACCCACUGGUUUGGAGUCUCCUAAGGGCAAUGGGGGCCAUUGAGAAGUCUGAACAGGGCUGUGUCUGAAUGUGAGGUCUAGAAGGAUCCUCCAGAGAAGCCAGCUCUAAAGCUUUUGCAAUCAUCUGGUGAGAGAACCCAGCAAGGAUGGACAGGCAGAAUGGAAUAGAGAUGAGUUGGCAGCUGAAGUGGACAGGAUUUGGUACUAGCCUGGUUGUGGGGAGCAAGCAGAGGAGAAUCUGGGACUCUGGUGGUCUGGCCUGGGGCAGACGGGGGUGUCUCAGGGGCUGGGAGGGAUGAGAGUAGGAUGAUACAUGGUGGUGUCUGGCAGGAGGCGGGCAAGGAUGACUAUGUGAAGGCACUGCCCGGGCAACUGAAGCCUUUUGAGACCCUGCUGUCCCAGAACCAGGGAGGCAAGACCUUCAUUGUGGGAGACCAGGUGAGCAUCUGGCC

CAG : Q> A06662_proteinWDQ

Page 6: Searching (and manipulating) your data

>A06662 Synthetic nucleotide sequence of the human GSH transferase pi gene. : Location:1..1000UGGGACCAGUCAGCAGAGGCAGCGUGUGUGCGCGUGCGUGUGCGUGUGUGUGCGUGUGUGUGUGUACGCUUGCAUUUGUGUCGGGUGGGUAAGGAGAUAGAGAUGGGCGGGCAGUAGGCCCAGGUCCCGAAGGCCUUGAACCCACUGGUUUGGAGUCUCCUAAGGGCAAUGGGGGCCAUUGAGAAGUCUGAACAGGGCUGUGUCUGAAUGUGAGGUCUAGAAGGAUCCUCCAGAGAAGCCAGCUCUAAAGCUUUUGCAAUCAUCUGGUGAGAGAACCCAGCAAGGAUGGACAGGCAGAAUGGAAUAGAGAUGAGUUGGCAGCUGAAGUGGACAGGAUUUGGUACUAGCCUGGUUGUGGGGAGCAAGCAGAGGAGAAUCUGGGACUCUGGUGGUCUGGCCUGGGGCAGACGGGGGUGUCUCAGGGGCUGGGAGGGAUGAGAGUAGGAUGAUACAUGGUGGUGUCUGGCAGGAGGCGGGCAAGGAUGACUAUGUGAAGGCACUGCCCGGGCAACUGAAGCCUUUUGAGACCCUGCUGUCCCAGAACCAGGGAGGCAAGACCUUCAUUGUGGGAGACCAGGUGAGCAUCUGGCC

UCA : S> A06662_proteinWDQS

Page 7: Searching (and manipulating) your data

>A06662 Synthetic nucleotide sequence of the human GSH transferase pi gene. : Location:1..1000UGGGACCAGUCAGCAGAGGCAGCGUGUGUGCGCGUGCGUGUGCGUGUGUGUGCGUGUGUGUGUGUACGCUUGCAUUUGUGUCGGGUGGGUAAGGAGAUAGAGAUGGGCGGGCAGUAGGCCCAGGUCCCGAAGGCCUUGAACCCACUGGUUUGGAGUCUCCUAAGGGCAAUGGGGGCCAUUGAGAAGUCUGAACAGGGCUGUGUCUGAAUGUGAGGUCUAGAAGGAUCCUCCAGAGAAGCCAGCUCUAAAGCUUUUGCAAUCAUCUGGUGAGAGAACCCAGCAAGGAUGGACAGGCAGAAUGGAAUAGAGAUGAGUUGGCAGCUGAAGUGGACAGGAUUUGGUACUAGCCUGGUUGUGGGGAGCAAGCAGAGGAGAAUCUGGGACUCUGGUGGUCUGGCCUGGGGCAGACGGGGGUGUCUCAGGGGCUGGGAGGGAUGAGAGUAGGAUGAUACAUGGUGGUGUCUGGCAGGAGGCGGGCAAGGAUGACUAUGUGAAGGCACUGCCCGGGCAACUGAAGCCUUUUGAGACCCUGCUGUCCCAGAACCAGGGAGGCAAGACCUUCAUUGUGGGAGACCAGGUGAGCAUCUGGCC

GCA : A> A06662_proteinWDQSA

Page 8: Searching (and manipulating) your data

>A06662 Synthetic nucleotide sequence of the human GSH transferase pi gene. : Location:1..1000UGGGACCAGUCAGCAGAGGCAGCGUGUGUGCGCGUGCGUGUGCGUGUGUGUGCGUGUGUGUGUGUACGCUUGCAUUUGUGUCGGGUGGGUAAGGAGAUAGAGAUGGGCGGGCAGUAGGCCCAGGUCCCGAAGGCCUUGAACCCACUGGUUUGGAGUCUCCUAAGGGCAAUGGGGGCCAUUGAGAAGUCUGAACAGGGCUGUGUCUGAAUGUGAGGUCUAGAAGGAUCCUCCAGAGAAGCCAGCUCUAAAGCUUUUGCAAUCAUCUGGUGAGAGAACCCAGCAAGGAUGGACAGGCAGAAUGGAAUAGAGAUGAGUUGGCAGCUGAAGUGGACAGGAUUUGGUACUAGCCUGGUUGUGGGGAGCAAGCAGAGGAGAAUCUGGGACUCUGGUGGUCUGGCCUGGGGCAGACGGGGGUGUCUCAGGGGCUGGGAGGGAUGAGAGUAGGAUGAUACAUGGUGGUGUCUGGCAGGAGGCGGGCAAGGAUGACUAUGUGAAGGCACUGCCCGGGCAACUGAAGCCUUUUGAGACCCUGCUGUCCCAGAACCAGGGAGGCAAGACCUUCAUUGUGGGAGACCAGGUGAGCAUCUGGCC

GAG : E> A06662_proteinWDQSAE

Page 9: Searching (and manipulating) your data

>A06662 Synthetic nucleotide sequence of the human GSH transferase pi gene. : Location:1..1000UGGGACCAGUCAGCAGAGGCAGCGUGUGUGCGCGUGCGUGUGCGUGUGUGUGCGUGUGUGUGUGUACGCUUGCAUUUGUGUCGGGUGGGUAAGGAGAUAGAGAUGGGCGGGCAGUAGGCCCAGGUCCCGAAGGCCUUGAACCCACUGGUUUGGAGUCUCCUAAGGGCAAUGGGGGCCAUUGAGAAGUCUGAACAGGGCUGUGUCUGAAUGUGAGGUCUAGAAGGAUCCUCCAGAGAAGCCAGCUCUAAAGCUUUUGCAAUCAUCUGGUGAGAGAACCCAGCAAGGAUGGACAGGCAGAAUGGAAUAGAGAUGAGUUGGCAGCUGAAGUGGACAGGAUUUGGUACUAGCCUGGUUGUGGGGAGCAAGCAGAGGAGAAUCUGGGACUCUGGUGGUCUGGCCUGGGGCAGACGGGGGUGUCUCAGGGGCUGGGAGGGAUGAGAGUAGGAUGAUACAUGGUGGUGUCUGGCAGGAGGCGGGCAAGGAUGACUAUGUGAAGGCACUGCCCGGGCAACUGAAGCCUUUUGAGACCCUGCUGUCCCAGAACCAGGGAGGCAAGACCUUCAUUGUGGGAGACCAGGUGAGCAUCUGGCC

GCA : A> A06662_proteinWDQSAEA

Page 10: Searching (and manipulating) your data

>A06662 Synthetic nucleotide sequence of the human GSH transferase pi gene. : Location:1..1000UGGGACCAGUCAGCAGAGGCAGCGUGUGUGCGCGUGCGUGUGCGUGUGUGUGCGUGUGUGUGUGUACGCUUGCAUUUGUGUCGGGUGGGUAAGGAGAUAGAGAUGGGCGGGCAGUAGGCCCAGGUCCCGAAGGCCUUGAACCCACUGGUUUGGAGUCUCCUAAGGGCAAUGGGGGCCAUUGAGAAGUCUGAACAGGGCUGUGUCUGAAUGUGAGGUCUAGAAGGAUCCUCCAGAGAAGCCAGCUCUAAAGCUUUUGCAAUCAUCUGGUGAGAGAACCCAGCAAGGAUGGACAGGCAGAAUGGAAUAGAGAUGAGUUGGCAGCUGAAGUGGACAGGAUUUGGUACUAGCCUGGUUGUGGGGAGCAAGCAGAGGAGAAUCUGGGACUCUGGUGGUCUGGCCUGGGGCAGACGGGGGUGUCUCAGGGGCUGGGAGGGAUGAGAGUAGGAUGAUACAUGGUGGUGUCUGGCAGGAGGCGGGCAAGGAUGACUAUGUGAAGGCACUGCCCGGGCAACUGAAGCCUUUUGAGACCCUGCUGUCCCAGAACCAGGGAGGCAAGACCUUCAUUGUGGGAGACCAGGUGAGCAUCUGGCC

GCG : A> A06662_proteinWDQSAEAA

Page 11: Searching (and manipulating) your data

>A06662 Synthetic nucleotide sequence of the human GSH transferase pi gene. : Location:1..1000UGGGACCAGUCAGCAGAGGCAGCGUGUGUGCGCGUGCGUGUGCGUGUGUGUGCGUGUGUGUGUGUACGCUUGCAUUUGUGUCGGGUGGGUAAGGAGAUAGAGAUGGGCGGGCAGUAGGCCCAGGUCCCGAAGGCCUUGAACCCACUGGUUUGGAGUCUCCUAAGGGCAAUGGGGGCCAUUGAGAAGUCUGAACAGGGCUGUGUCUGAAUGUGAGGUCUAGAAGGAUCCUCCAGAGAAGCCAGCUCUAAAGCUUUUGCAAUCAUCUGGUGAGAGAACCCAGCAAGGAUGGACAGGCAGAAUGGAAUAGAGAUGAGUUGGCAGCUGAAGUGGACAGGAUUUGGUACUAGCCUGGUUGUGGGGAGCAAGCAGAGGAGAAUCUGGGACUCUGGUGGUCUGGCCUGGGGCAGACGGGGGUGUCUCAGGGGCUGGGAGGGAUGAGAGUAGGAUGAUACAUGGUGGUGUCUGGCAGGAGGCGGGCAAGGAUGACUAUGUGAAGGCACUGCCCGGGCAACUGAAGCCUUUUGAGACCCUGCUGUCCCAGAACCAGGGAGGCAAGACCUUCAUUGUGGGAGACCAGGUGAGCAUCUGGCC

UGU : C> A06662_proteinWDQSAEAAC

Page 12: Searching (and manipulating) your data

codonAMINO = {'GCU':'A','GCC':'A','GCA':'A', 'GCG':'A', 'CGU':'R','CGC':'R','CGA':'R','CGG':'R','AGA':'R','AGG':'R', 'UCU':'S','UCC':'S','UCA':'S','UCG':'S','AGU':'S','AGC':'S’ 'AUU':'I','AUC':'I','AUA':'I','AUU':'I','AUC':'I','AUA':'I', 'UUA':'L','UUG':'L','CUU':'L','CUC':'L','CUA':'L','CUG':'L', 'GGU':'G','GGC':'G','GGA':'G', 'GGG':'G', 'GUU':'V','GUC':'V','GUA':'V','GUG':'V', 'ACU':'T','ACC':'T','ACA':'T','ACG':'T', 'CCU':'P','CCC':'P','CCA':'P','CCG':'P', 'AAU':'N','AAC':'N', 'GAU':'D','GAC':'D', 'UGU':'C','UGC':'C', 'CAA':'Q','CAG':'Q', 'GAA':'E','GAG':'E', 'CAU':'H','CAC':'H', 'AAA':'K','AAG':'K', 'UUU':'F','UUC':'F', 'UAU':'Y', 'UAC':'Y', 'AUG':'M', 'UGG':'W', 'UAG':'STOP', 'UGA':'STOP', 'UAA':'STOP' }

Page 13: Searching (and manipulating) your data

codonAMINO = {'GCU':'A','GCC':'A','GCA':'A', 'GCG':'A', 'CGU':'R','CGC':'R','CGA':'R','CGG':'R','AGA':'R','AGG':'R', 'UCU':'S','UCC':'S','UCA':'S','UCG':'S','AGU':'S','AGC':'S', 'AUU':'I','AUC':'I','AUA':'I','AUU':'I','AUC':'I','AUA':'I', 'UUA':'L','UUG':'L','CUU':'L','CUC':'L','CUA':'L','CUG':'L', 'GGU':'G','GGC':'G','GGA':'G','GGG':'G','AAU':'N','AAC':'N', 'GUU':'V','GUC':'V','GUA':'V','GUG':'V','GAU':'D','GAC':'D', 'ACU':'T','ACC':'T','ACA':'T','ACG':'T','UGU':'C','UGC':'C', 'CCU':'P','CCC':'P','CCA':'P','CCG':'P','CAA':'Q','CAG':'Q', 'GAA':'E','GAG':'E','CAU':'H','CAC':'H','AAA':'K','AAG':'K', 'UUU':'F','UUC':'F','UAU':'Y','UAC':'Y','AUG':'M','UGG':'W', 'AUG':'START','UAG':'STOP', 'UGA':'STOP', 'UAA':'STOP' }

>>>codonAMINO['GCU']'A'>>>codonAMINO['AUG']’START’>>> for k in codonAMINO.keys():... print k, codonAMINO[k]GUC VAUA IGUA VGUG VACU TAAC Netc.

Page 14: Searching (and manipulating) your data

Dictionaries

Dictionaries are unordered collections of objects

Dictionaries are structures for mapping immutable objects (keys) on arbitrary objects (values)

d = {key1:value1, key2:value2,…,keyN:valueN}

lists and dictionaries cannot be used as dictionary keys!!!!

keys must be unique, i.e. the same key cannot be associated to more than one value

Page 15: Searching (and manipulating) your data
Page 16: Searching (and manipulating) your data

>>> d = {'pep1':'MGSNKSKPKDASQRRRSLEPAENVHGAGG', \ 'pep2':'RSLEPAENVHGAGGGAFPASQTPS'}>>> len(d)2

>>> d[‘pep1’]'MGSNKSKPKDASQRRRSLEPAENVHGAGG’

>>> d['pep3'] = 'ASADGHRGPSAAFAPAAA'>>> d{'pep1' : 'MGSNKSKPKDASQRRRSLEPAENVHGAGG', 'pep2' : 'RSLEPAENVHGAGGGAFPASQTPS', ‘pep3’ : 'ASADGHRGPSAAFAPAAA'}

Page 17: Searching (and manipulating) your data

>>> del d[‘pep2’]>>> d{'pep1' : 'MGSNKSKPKDASQRRRSLEPAENVHGAGG', ‘pep3’ : 'ASADGHRGPSAAFAPAAA'}

>>> d.clear()>>> d{ }

Page 18: Searching (and manipulating) your data

>>> dict = {“a”:1, “b”:2, “c”:3}>>> dict.keys() #list of dictionary keys[ ‘a ’, ‘c ’, ‘b ’]

>>> keys = dict.keys()>>> keys.sort() #sort keys[ ‘a ’, ‘b ’, ‘c ’]

>>> dict.values() #list of dictionary values[1, 3, 2]

>>> dict.items() #tuple of dictionary (key,value) pairs[(‘a ’, 1), (‘c ’, 3), (‘b ’, 2)]

>>> dict.has_key(“a”) #True if dict has key “a”, else FalseTrue

Page 19: Searching (and manipulating) your data

Exercise

Using the codonAMINO dictonary from tgac.py translate the sequence in rna_seq.fasta. Start with a single reading frame.Then try all reading frames.

Page 20: Searching (and manipulating) your data

for line in F: if line[0] == '>': header = line.split() geneID = header[0] Out.write(geneID + '_protein\n') else: seq = seq + line.strip()

prot = ''for i in range(0,len(seq),3): if codonAMINO.has_key(seq[i:i+3]): prot = prot + codonAMINO[seq[i:i+3]] else: prot = prot + '*'

Out.write(prot + '\n')

Page 21: Searching (and manipulating) your data

F = open('rna_seq.fasta')Out = open('protein_seq.fasta','w')

seq = ''for line in F: if line[0] == '>': header = line.split() geneID = header[0] Out.write(geneID + '_protein\n') else: seq = seq + line.strip()

from tgac import codonAMINO

prot = ''for j in range(3): Out.write(str(j) + "-frame\n") for i in range(j,len(seq),3): if codonAMINO.has_key(seq[i:i+3]): prot = prot + codonAMINO[seq[i:i+3]] else: prot = prot + '*' Out.write(prot + '\n') prot = ''

Page 22: Searching (and manipulating) your data

Remove redundancy

Page 23: Searching (and manipulating) your data

How many different objects?

How many unique objects?

Page 24: Searching (and manipulating) your data

Are the two groups identical?

What is the intersection of the two groups?

Page 25: Searching (and manipulating) your data

Q5XXA6Q9Y5P2Q14667O75387Q8WV07Q8CH62Q9GZY1Q9NQQ7Q8VCX2Q7Z769Q8CH62Q14667Q9NQQ7Q14667Q9Y5P2

Q7Z769Q8CH62Q9GZY1Q9NQQ7Q14667Q5XXA6Q9Y5P2Q14667O75387Q9Y5P2Q8WV07Q8VCX2Q8CH62Q14667Q9NQQ7

Page 26: Searching (and manipulating) your data
Page 27: Searching (and manipulating) your data

Sets are unordered collections of unique objects

•Sets do not support indexing and slicing•in and not in operators can be used to test an element for membership in a set. •Sets are useful for removing duplicates•Set operations: intersection, union, difference, symmetrical difference

Sets

they are not sequence-like objects and that they cannot contain identical elements

Page 28: Searching (and manipulating) your data

In order to create a set, the method set(x) must be used, where x is a sequence-like object (string, tuple, list)

Create a new set

add(x)update(x)

Page 29: Searching (and manipulating) your data

S1.union(S2)

The union between 2 sets S1 and S2 creates a new set with the elements from both S1 and S2.

>>> S1 = set(['a','b','c'])>>> S2 = set (['c','d','e'])>>> S1.union(S2)set([‘a’, ‘c’, ‘b’, ‘e’, ‘d’])>>> S1 | S2set([‘a’, ‘c’, ‘b’, ‘e’, ‘d’])

Page 30: Searching (and manipulating) your data

S1.intersection(S2)

The intersection of 2 sets S1 and S2 creates a new set with the elements common to S1 and S2

>>> S1 = set(['a','b','c'])>>> S2 = set (['c','d','e'])>>> S1.intersection(S2)set([‘c’])>>> S1 & S2set([‘c’])

Page 31: Searching (and manipulating) your data

S1.symmetric_difference(S2)or S1 ^ S2

Symmetric difference of two sets S1 and S2 creates a new set with elements in either S1 or S2 but not both

>>> S1 = set(['a','b','c'])>>> S2 = set (['c','d','e'])>>> S1.symmetric_difference(S2)set([‘a’, ‘b’, ‘e’, ‘d’])>>> S1 ^ S2set([‘a’, ‘b’, ‘e’, ‘d’])

Page 32: Searching (and manipulating) your data

S1.difference(S2)or S1 - S2

The difference of two sets S1 and S2 creates a new set with elements in S1 but not in S2

>>> S1 = set(['a','b','c'])>>> S2 = set (['c','d','e'])>>> S1.difference(S2)set([‘a’, ‘b’])>>> S1 - S2set([‘a’, ‘b’])>>> S2 – S1set([‘e’, ‘d’])