Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU...

31
Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine

Transcript of Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU...

Page 1: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Introduction to Pythonfor Biologists

Lecture 2

This Lecture

Stuart BrownAssociate Professor

NYU School of Medicine

Page 2: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Learning Objectives

• Flow control (if/else) and Operators• For loops• Recursion• Reading and Writing files (File I/O)• Create custom functions with def• Dictionaries

Page 3: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Flow control

• Programs need to make decisions, and have controlled looping (repeat operations for a specific number of times).

Decision operators: if, elif, elseLooping operators:

for x in list: while a < 10:

Page 4: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

For Loops• For loops iterate (step) through a list one element at a time. • In Python, loops and decisions are set off by a colon and an

indent. • Python ‘for’ syntax is very simple, but you must use correct

indent of statements in the loop

>>> my_list=['G', 'A', 'hat', 'cat'] >>> concat = "" # this is an empty string >>> for i in my_list:

concat = concat + i >>> print (concat)GAhatcat

Page 5: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Loop through a String

• For loops work on strings as if they were a list of characters.

>>> my_dna ='ATGCGTA'>>> for i in my_DNA:

print (i)ATGC

Page 6: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

if/else example

>>> my_DNA = "ATGCGTA“>>> if my_DNA.find("GC"):

print (“GC is found”)else:

print (“No GC found”)

Page 7: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Operators

• Operators include the basic math functions: +, -, /, *, ** (raise to power)

• Comparisons: >, <, >=, <=, ==• Boolean operators: and, or, not

Page 8: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Example

dna=‘GATCCGGTTACTACGACCTGA’count_G=0count_A=0for base in dna:

if base == 'G':count_G += 1

elif base == 'A'count_A += 1

print ('G= ' + str(count_G) + ' ' + 'A= ' + str(count_A)

Page 9: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Functions• More complex operators are also known as functions• They can deal with file I/O, more complex math, or

other manipulations of data.• Functions use parentheses to act on some data

object, and may take additional parametersprint(x)open('filename', r)read(filehandle)my_list.append(42)write(data, 'filename')len(my_dna)

Page 10: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Range• range(start,stop,[step]) creates a list of integers

– Starts at zero by default– A range does not include the stop number– Step is optional

>>> range(10)[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> range(4, 11, 2) # from 4 to 11 with step of 2 [4, 6, 8, 10]

• range() is often used as part of a for loop to step through a list while keeping track of what number item you are working on:>>> a = ['Mary', 'had', 'a', 'little', 'lamb']>>> for i in range(len(a)): print i, a[i]0 Mary1 had2 a3 little4 lamb

Page 11: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

List Compression• A list compression creates a list using a

function and a for loop. An optional if statement can be included.

squares = [] # create a list of squares < 50for x in range(10):

if (x**2) <50: squares.append(x**2)

print squares [0, 1, 4, 9, 16, 25, 36, 49]

# create a list of squares < 50 with a list compression

squares = [x**2 for x in range(10) if (x**2) < 50] print squares [0, 1, 4, 9, 16, 25, 36, 49]

Page 12: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Custom Functions• In Python, users can create their own

functions, which act like subroutines• or use functions within code written by others

(known as modules)

def g_count(dna): #function takes a string as input

count=0for base in dna:

if base == ‘G’:count += 1

return(count) #function returns an integer

Page 13: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

ATG finder

>>> def find_ATG(dna):if dna.find("ATG"):

return ("ATG is found")else:

return ("No ATG found")>>> my_dna =‘TATGCGTA‘>>> find_ATG(my_dna)ATG is found

Bonus point if you find and fix some of the bugs in this code

Page 14: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Recursion• Now that you can make custom functions …– what would happen if you wrote a function that

called itself?def countdown(n):

if n <= 0:print “Blastoff!”

else:print ncountdown(n-1)

• Of course, you should avoid creating an infinite loop …def plustwo(n):

print nplustwo(n+2) #be careful running this- get ready to kill it

Page 15: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Fibonacci• Computer Scientists use recursion often, it is

less common in Bioinformatics applications.• has several sections that explore

algorithms in computational biology and beyond. – There is a nice (fairly simple) problem about

Fibonacci Numbers: http://rosalind.info/problems/fibo/

– Give it a try (in Python, of course).

Page 16: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

def Fib(x):def Fib(x):

if x =0:return 0

elif x = 1:return 1

elif x > 1:return Fib(x-1) + Fib(x-2)

• Why is this program such a bad idea?• How can you do it better using a simple list to store the Fib series?• This is also a good introduction to computational complexity.

Bioinformatics often deals with large data and complex computations, so the speed of computing for a given task is an important issue.

Page 17: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

File I/O

• Usually your programs will get input data in a text file, and you will want to write output to a file rather than dump it on the screen (“standard output”, “stdout”)

• In Python, a file must be opened before reading or writing. The open file is assigned to a variable called a ‘handle’, then the program will read or write to the handle

• The .read() method captures the whole contents of the file in a single string.

• .close() the file when you are done with it.

file1 = open(‘human_pep.fasta’) Hum_pep = file1.read()gene_count = Hum_pep.count(‘>’)file1.close()

Page 18: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

with open() as f

• A nicer way to open a file is to use the with/as keywords and an indented block. This automatically closes the file when the indented block is completed.

>>> with open(‘human_pep.fasta’) as file1:Hum_pep = file1.read()

Page 19: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Write output to a file

• To create an output file, open a file (give it any name you want) with the ‘w’ option and assign it to a variable name.

• Then use the write() method. write() works just like print(), you can include string methods, concatenation, etc. inside the parentheses.

output=open('humpep_count.txt', 'w')output.write('Gene Count: ' + str(gene_count))output.close()

Page 20: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Read a file line by line with a for loop

• readlines() captures a file as a list of lines (rather than all in one big string), then you can loop over the list of lines.

my_file = open(‘human_dna.fasta’) human_seq = my_file.readlines()for line in human_seq:

print (len(line))

• Or you can iterate over lines in the file directly with a for loop:my_file = open(‘human_dna.fasta’) for line in my_file:

print (len(line))

Page 21: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Dictionaries

• Dictionaries contain key-value pairs. (Called a “hash” in most other programming languages)

my_dict1 = {'ATT' : 'I', 'CTT' : 'L', 'GTT' : 'V',

'TTT' : 'F'}

• Very useful for lookup lists of things like the amino acid codon table or k-mer lists

• Designed to give very fast random access lookup of the key and return the corresponding value

• Keys must be unique strings, values can be anything

Page 22: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Zip makes a dictionary

• Rather than type a dictionary, you can build a dictionary from two lists using zip()

>>> list1 = ('GAT', 'CAT', 'TAT', 'AAT')>>> list2 = (1, 2, 3, 4)>>> zip(list1,list2)[('GAT', 1), ('CAT', 2), ('TAT', 3), ('AAT', 4)]

Page 23: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Check and add to dictionary• Another useful application of a dictionary is to build a non-redundant

list. – For each item, check if it is in the dictionary, if not then add it to the

dictionary. – You can count occurrences at the same time.

Example: count DNA dimers

DNA = 'GATCCGGTTACTACGACCTGAGAT'Dimers = {} #create an empty dictionaryfor x in range(len(DNA)):

di = DNA[x:(x+2)] if di in Dimers:

Dimers[di] += 1 #add one to count for dielse:

Dimers[di] = 1 #add di to Dimers dictprint Dimers

Bonus point if you find and fix the bugs in this code

Page 24: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Challenge Assignment

• Write a function that translates a DNA string into protein.

• In your function, use a dictionary of triplet codons as keys and amino acids as values

• Begin translation at the first ATG codon • Write a program that uses your translate

function to open and translate a file that contains a single DNA sequence as text, write the output as another text file.

Page 25: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Zip a codon table(save yourself some typing)

codons= ['ttt', 'ttc', 'tta', 'ttg', 'tct', 'tcc', 'tca', 'tcg', 'tat', 'tac', 'taa', 'tag', 'tgt', 'tgc', 'tga', 'tgg', 'ctt', 'ctc', 'cta', 'ctg', 'cct', 'ccc', 'cca', 'ccg', 'cat', 'cac', 'caa', 'cag', 'cgt', 'cgc', 'cga', 'cgg', 'att', 'atc', 'ata', 'atg', 'act', 'acc', 'aca', 'acg', 'aat', 'aac', 'aaa', 'aag', 'agt', 'agc', 'aga', 'agg', 'gtt', 'gtc', 'gta', 'gtg', 'gct', 'gcc', 'gca', 'gcg', 'gat', 'gac', 'gaa', 'gag', 'ggt', 'ggc', 'gga', 'ggg']

amino_acids = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG‘

>>> codon_table = dict(zip(codons, amino_acids))

Very nice Python code by Peter Collingridge: http://www.petercollingridge.co.uk/python-bioinformatics-tools/codon-table

Page 26: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Re-use Code vs Write NewA little break for a philosophical debate

• When should you find and re-use code written by others and when should you write your own?

• In Bioinformatics, many of the problems you will encounter with data have been faced by other people. – A great deal of code has been written and shared in public repositories.– Some of this code has been published an cited in the literature– Don’t try to re-write BLAST (unless you really, really have to)

• If you can’t find code to do exactly what you want, should you adapt existing, or write your own?– There are challenges to figuring out someone else’s code– New code that uses (depends) on programs written by others is very fragile– There are challenges to validate your own code when using it to analyze and publish

scientific data– There is value to building your own repository of code elements from scratch that

work and fit together in a way that is intuitive for you

Page 27: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Some Statistics in Python

• NumPy has some basic statistics functions that work on arrays.

>>> squares = [x**2 for x in range(10) if (x**2) < 50]>>> sq=np.array(squares)>>> np.mean(sq)17.5>>> np.median(sq)12.5>>> np.std(sq)16.680827317612277

Page 28: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Other NumPy funcions

NumPy has:• linear algebra• trigonometry• logarithms• polynomials• Fourier Transformations • random sampling• permutations• sorting• and distributions (normal, Poisson, hypergeometrix, logistic,

gamma, negative binomial, etc)

Page 29: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

SciPy

• SciPy is an extension of NumPy that provides a great deal more complex mathematic, statistical, and scientific data analysis functions.

>>> import antigravity

Page 30: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Summary

• Flow control (if/else) and Operators• For loops• Recursion• Reading and Writing files (File I/O)• Create custom functions with def• Dictionaries

Page 31: Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Next Lecture: Biopython