BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145...

45
BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg ([email protected]) 4103 Nat Sci 2 - office hours Tu, Th 3:30-5:00 (or by appointment) phone 824-8573 TA – Bassem Shoucri ([email protected]) 4351 Nat Sci 2, 824-6873, 3116 – office hours M 2-4 lectures will be posted on web pages after lecture http:// blumberg.bio.uci.edu/biod145-w2015 http:// blumberg-serv.bio.uci.edu/biod145-w2015 No office hours on Thursday 2/12

Transcript of BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145...

Page 1: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

BioSci D145 lecture 6 page 1 ©copyright Bruce Blumberg 2011. All rights reserved

BioSci D145 Lecture #6

• Bruce Blumberg ([email protected])– 4103 Nat Sci 2 - office hours Tu, Th 3:30-5:00 (or by appointment)– phone 824-8573

• TA – Bassem Shoucri ([email protected])– 4351 Nat Sci 2, 824-6873, 3116 – office hours M 2-4

• lectures will be posted on web pages after lecture – http://blumberg.bio.uci.edu/biod145-w2015– http://blumberg-serv.bio.uci.edu/biod145-w2015

– No office hours on Thursday 2/12

Page 2: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

Midterm results

Mean: 25.3 Median: 27 Maximum: 33 Minimum: 18 Std. dev.: 4.3

BioSci D145 lecture 6 page 2 ©copyright Bruce Blumberg 2011. All rights reserved

Page 3: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

Functional Genomics - The challenge: Many new genes of unknown function• Where/when are they expressed?

– Known genes (e.g. from genome projects)• Gene chips (Affymetrix)• Microarrays (Oligo, cDNA, protein) (Iyer paper week 4)

– Novel genes• Expression profiling (3 papers this week)

– Genomic tiling microarrays (Kapranov)– SAGE and related approaches (RIKEN)– Massively parallel sequencing (RNAseq) (t’Hoen)

• Which genes regulate what other genes? • Epigenetic modification of gene expression (week 7 papers)• What is the phenotype of loss-of-function? (week 8 papers)

– Genome wide siRNA (Boutros)– Genome wide synthetic lethal screens (Luo)– CRISPR/Cas (Gilbert)

• What do they interact with (week 9 papers)• ‘Omic approaches and humanized mouse models (week 10 papers)

BioSci D145 lecture 6 page 3 ©copyright Bruce Blumberg 2011. All rights reserved

Page 4: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

Routes to gene identification

• Genome sequences are minimally useful without annotation– Annotation = description, biological information

• Functional annotation – information on the function• Structural annotation – identification of genes, sequence

elements– Much annotation is done automatically today

• Via sequence comparisons with various databases– Gene sequences– ESTs

• Algorithms predict promoters, splicing, polyadenylation sites and, most importantly ORFs

• ORFs – open reading frames are putative proteins– Algorithms miss in both directions– Source of much disagreement

• Field of bioinformatics has grown to encompass many types of analysis related to gene function– www.igb.uci.edu

BioSci D145 lecture 6 page 4 ©copyright Bruce Blumberg 2011. All rights reserved

Page 5: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

How are genes identified?

• Random– EST sequencing, select interesting looking gene– Large scale expression analysis

• http://xenopus.nibb.ac.jp/• From protein sequences

– Antibody screening– Reverse translate and oligo screen

• Functional cloning– Finding a gene by using a functional assay

• Positional cloning– Find a gene by where it is located, what it is near

• By similarity to other sequences– Gene family– Cross-species– Computer based equivalents

• Bioinformatic analysis that relates back to functional or positional cloning

BioSci D145 lecture 6 page 5 ©copyright Bruce Blumberg 2011. All rights reserved

Page 6: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

How are genes identified?

• Functional cloning (aka expression cloning) – identifying by a functional assay– What are functional assays?

• Enzyme activity – kinases (add PO4 to proteins)• Ligand binding – peptide hormone (e.g. glucagon) receptors• Transport (ions, sugars, etc) – e.g., intestinal glucose

transporters• Mutant rescue – restore function to a cell or embryo

– Introduce cDNA library pools (~10,000 cDNAs) • via transfection, microinjection, infection• Perform functional assays

– Robust, sensitive, accurate is key

– positive pools are subdivided and retested to obtain pure cDNAs

• cycle is repeated until single clones obtained

– Applications – many enzymestransporters and growth factorreceptors cloned this way

BioSci D145 lecture 6 page 6 ©copyright Bruce Blumberg 2011. All rights reserved

Page 7: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

How are genes identified?

• Positional cloning – identifying by where the gene is located– from genetic linkage map– by walking from nearby sequence tags (ESTs, STS, STC, etc)– Gene trap techniques (week 8)– Interspecific backcrosses (Mus musculus vs M. spretus)– When possible – try to rescue phenotype with candidate region

• Positional cloning – ok you have identified a region where your gene of interest may be located - now what?– How do we figure out what genes are in this region without

knowing function?– General problem for annotation of sequences

• Genome sequencing vis a vis positional cloning

BioSci D145 lecture 6 page 7 ©copyright Bruce Blumberg 2011. All rights reserved

Page 8: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

How are genes identified? - Case studies

• Duchenne muscular dystrophy (DMD) first gene positionally cloned– One group did genomic subtraction cloning

• Strategy enriched for regions lost in DMD patient• Made a library and tested clones by Southern blot to normal

and DMD DNA– 2nd group cloned breakpoints

• Girl with translocation between X and 21• 21 was rich in rRNA genes so made a

radiation hybrid panel from patient • Identified hybrid cell carrying the breakpoint

– made a genomic library from it• Screened library for clones with both rRNA genes and X

chromosome specific sequences• Mapped this genomic DNA to male patients with DMD and

found deletions in many of them– DMD gene is largest known – 2.4 megabases– cDNA cloning followed – protein is dystrophin

BioSci D145 lecture 6 page 8 ©copyright Bruce Blumberg 2011. All rights reserved

Page 9: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

How are genes identified? (contd)

• Duchenne muscular dystrophy (contd)– Today – for a sequenced organism – just go to the database

identify sequences in region of interest and verify by Southern or PCR as above

• Or look in large insert libraries with breakpoints• Or do cDNA subtraction between tissues from normal

individual and DMD individual– Presumes knowledge of source of mutation, i.e., the

defect resides in the affected tissue– Would not detect a defect in inducing factor from other

tissue

BioSci D145 lecture 6 page 9 ©copyright Bruce Blumberg 2011. All rights reserved

Page 10: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

How are genes identified? (contd)

• Ways to identify genes in regions– Cross-species hybridization

• Probe another species with this genomic region• coding sequences are conserved -> should see hybridization

where genes are• What do you think are limitations to this approach?

– Species must be sufficiently different to reduce “noise” from overall sequence conservation

» mouse vs human probably not great» Human vs frog or fish probably good

– Must be sufficiently similar for genes to be conserved» Human vs frog or fish probably good» Humans vs yeast only good for common genes

– Target species region needs to be well characterized

• Computer parallels – compare sequence to be annotated with annotated sequence from a different organism

– e.g., human with Drosophila– Unknown bacterium with E. coli, etc.

BioSci D145 lecture 6 page 10 ©copyright Bruce Blumberg 2011. All rights reserved

Page 11: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

How are genes identified? (contd)

• Ways to identify genes in regions (contd)– Hybridization to known genes or coding materials

• What are some examples?

– Hybridize to mRNA (Northern blots)– Hybridize to cDNA libraries (must be right tissue, cell or

stage)– Hybridization to genomic tiling microarrays– Capture cDNAs or mRNAs from solution• Computer based parallels

BioSci D145 lecture 6 page 11 ©copyright Bruce Blumberg 2011. All rights reserved

– Compare with expressed sequences from other species– Compare with ESTs

Page 12: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

How are genes identified? (contd)

• Ways to identify genes in regions (contd)– Identify features found in typical promoters

• What are promoters?

• CpG islands – regions in eukaryotic genes that are hypomethylated

– Undermethylated– Remember that methylation of DNA typically inhibits gene

expression– Digest with enzymes that have CG in recognition site that

would be inhibited if methylated, e.g., SacII CCGCGG, run gel to check

» If nonmethylated (expressed) enzyme will cut, region will be hypersensitive, get chopped up.

» If methylated (not expressed) enzyme will not cut and region will not get digested

• DNAse I hypersensitive sites– Similar principle – transcriptionally active DNA is “open”– If open, it is more sensitive to DNAse I than non-active

DNA– Test by digestion and gel electrophoresis

Regions 5’ to a gene that are required for expression

BioSci D145 lecture 6 page 12 ©copyright Bruce Blumberg 2011. All rights reserved

Page 13: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

How are genes identified? (contd)

• The problem with all of these methods is that experiments are required– What do we do when sequences are coming in at the rate of tens

of gigabases/month?– Need large-scale, robust, computerized methods to identify genes

and annotate sequences!

• All bioinformatics depends on databases– UCI bioinformatics groups are among the best at designing and

constructing databases• http://www.igb.uci.edu/tools/databases.html

• Three major databases of sequences (automatically duplicated)– GENBANK http://www.ncbi.nlm.nih.gov/Genbank/index.html– DNA Databank of Japan http://www.ddbj.nig.ac.jp/– European Molecular Biology Laboratory (EMBL)

http://www.ebi.ac.uk/embl/index.html

BioSci D145 lecture 6 page 13 ©copyright Bruce Blumberg 2011. All rights reserved

Page 14: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

• http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=retrieve&dopt=default&list_uids=1756

Dystrophin as an example

BioSci D145 lecture 6 page 14 ©copyright Bruce Blumberg 2011. All rights reserved

Page 15: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

Genome annotation – how to identify genes?

• Gene identification/prediction is important but difficult– Large variety of methods and algorithms to predict exons– To identify genes must first identify open reading frames (ORFs)

• When dealing with cDNAs – look for regions that code for proteins

– Do all genes code for proteins?– Correct reading frame for a sequence is assumed to be

largest with no stop codons (TGA, TAA, TAG)• Lots of tricks can be employed

– Codon frequency for an organism» Coding sequences follow codon usage» Noncoding sequences do not, often have lots of stop

codons– Consensus sites

» Kozak translational initiation CCRCCATGG• What is a very important consideration when searching

sequences to predict ORFs?– Sequence must be accurate

» Incorrect base calls are troublesome» But indels (insertions or deletions) are disastrous

BioSci D145 lecture 6 page 15 ©copyright Bruce Blumberg 2011. All rights reserved

Page 16: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

Genome annotation – how to identify genes (contd)?

• Computer based gene prediction methods– Two major methods are in use

• Homology searches– Compare with other known sequences

• Ab initio (from the beginning) prediction– Use algorithms to recognize common features and predict

genes» Promoters» Splice sites» Polyadenylation sites» ORFs

– Generally, microbial genomes are much easier to annotate – WHY?

• Simply identify ORFS > 300 bp (100 amino acids)– Works very well– But can miss small coding sequences– Must run on both strands because there are shadow genes

(overlap on two strands)• Using a variety of programs, can predict genes in bacterial

genomes– Venter Sargasso sea paper

BioSci D145 lecture 6 page 16 ©copyright Bruce Blumberg 2011. All rights reserved

Page 17: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

Genome annotation – how to identify genes (contd)?

• Computer based gene prediction methods (contd)– Huge variety of programs

available• Neural networks –

attempt to model learning process– Build decision trees, use probabilistic reasoning

• Rule-based systems– Rules often not clear – Have trouble with exceptions

• Hidden Markov models– Break sequences down into small units based on

statistical analysis of composition» Hexamers appear to be optimal size to search

– Classify sequences into types or “states”– Identify transitions between states– Very useful for large number of purposes

BioSci D145 lecture 6 page 17 ©copyright Bruce Blumberg 2011. All rights reserved

Page 18: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

Genome annotation – how to identify genes (contd)?

• Computer based gene prediction methods (contd)– Training sets are used to

“teach” programs howto solve problems

• Training set is actual data – genes with known features– Programs use training sets to classify new data

• Neural networks use training data to build decision trees• Rule-based systems use training data to generate rules• HMM build table of probabilities for states and transitions

– Pierre Baldi in IGB is UCI expert in machine learning

• How well do gene predicting programs work?– Extremely well on bacterial genomes– Fairly well on simply eukaryotic genomes– Variable on complex genomes– Rule of thumb – use a group of programs and look for areas of

agreement among them– The best current programs combine ab initio predictions with

similarity data to define a probability model

BioSci D145 lecture 6 page 18 ©copyright Bruce Blumberg 2011. All rights reserved

Page 19: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

Identification of gene function

• You have identified a gene – what is its function?– Always look for similarity to known sequences

• Swiss-prot is fairly well annotated • GENBANK translated database is most complete• BLAST is tool to use• Amino acid searches more sensitive than nucleotide searches

– Because identical amino acid sequences might only be 67% identical at nucleotide level

– What might you find?• Match may predict biochemical and physiological function

– e.g., a known enzyme from another organism• Match may predict biochemical function only

– e.g., a kinase• Match a gene from another organism with no known function

– May match ESTs or ORFs from other organisms• Match a known gene with partly characterized function

– Search leads to clarification of function – NifS in book• Might not match anything at all

– Expect this will happen less and less

BioSci D145 lecture 6 page 19 ©copyright Bruce Blumberg 2011. All rights reserved

Page 20: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

Up-regulated by TTNPB (> 1.5 Fold, p < 0.01, n=334)

Transcriptional15% (53)

Hypothetical26% (90)

Housekeeping15% (52)

Extrcellular matrix3% (10)

Tumor suppressor1% (2)

Unidentified21% (73)

Signal tranduction8% (26)

Miscellaneous1% (4)

Retinoid metabolism1% (3)

Cytoskeleton4% (14)

Neural2% (7)

Energy metabolism3% (9)

Cytoskeleton

Energy metabolism

Extrcellular matrix

Housekeeping

Hypothetical

Neural

Retinoid metabolism

Signal tranduction

Transcriptional

Tumor suppressor

Unidentified

Miscellaneous

BioSci D145 lecture 6 page 20 ©copyright Bruce Blumberg 2011. All rights reserved

Page 21: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

Identification of gene function (contd)

• You have identified a gene – what is its function? (contd)– Does the sequence contain an obvious functional motif?

• Homeobox or other consensus DNA binding domain?• Kinase domain?• Serine protease, etc.

– InterPro database allows one to compare a protein sequence with whole family of structural databases

• http://www.ebi.ac.uk/interpro/• HICAICGDRSSGKHYGVYSCEGCKGFFKRTVRKDLTYTCRDSKDCMI

DKRQRNRCQYCRYQKCLAMGM• http://

www.ebi.ac.uk/interpro/sequencesearch/iprscan5-S20150210-213958-0238-95988468-pg

• Other sorts of similarity searches• Identify protein secondary structure motifs

– Alpha helix, beta sheets, hydrophobicity– Amphipathic helices– Overall polarity of sequences– Not used much

BioSci D145 lecture 6 page 21 ©copyright Bruce Blumberg 2011. All rights reserved

Page 22: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

Identification of gene function (contd)

• You have identified a gene – what is its function? (contd)– Gene ontology – highly

structured vocabulary for gene classification

• Genes are classified using this vocabulary

• Relates protein function with cellular or organismal functions

– Nucleic acid binding– Cell division

BioSci D145 lecture 6 page 22 ©copyright Bruce Blumberg 2011. All rights reserved

Page 23: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

Genome annotation

• Extremely important as number of sequences increases– Goals are to identify

• all of the sequences• all of the features of each sequence• All of the functions of the identified genes

– Sometimes annotation does not agree with known function• Human error• New and updated information not propagated to database• Inaccurate sequencing• Sometimes annotation is correct but protein lacks function

under certain conditions (e.g., need cofactors)– Gold standard for functional analysis is loss-of-function analysis

• Most accurate annotation– Common to have “annotation jamborees” where biologists and

bioinformaticians come together to annotate new sequences• Xenopus tropicalis jamboree was in Spring 2006• But many genes and gene models are still unannotated

BioSci D145 lecture 6 page 23 ©copyright Bruce Blumberg 2011. All rights reserved

Page 24: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

Which genes regulate what other genes?

• The biggest defect of expression microarrays or transcript profiling is that neither can distinguish direct targets from indirect targets– Which genes are a primary response to the treatment vs.– Which ones have one or more intermediates?

• How can we approach and solve this important problem?

• All eukaryotic DNA occurs as chromatin (DNA+histone and other proteins)

– Chromatin conformation influences whether transcription can occur

BioSci D145 lecture 6 page 24 ©copyright Bruce Blumberg 2011. All rights reserved

– Identify the genes to which a transcription factor binds– Identify to which genes RNA polymerase II is recruited

• And from which it is dismissed

Page 25: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

Which genes regulate what other genes?

—Open chromatin is accessible to transcriptional machinery• DNA is unmethylated• Histones are methylated and acetylated (acetylase activates)• Many transcriptional co-activators methylate and acetylate

histones and other chromatin associated proteins− Opens the chromatin

BioSci D145 lecture 6 page 25 ©copyright Bruce Blumberg 2011. All rights reserved

Page 26: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

Which genes regulate what other genes?

—Closed chromatin is inaccessible to transcriptional machinery• DNA is methylated• Some histone tails are methylated, others not• Transcriptional co-repressors recruit histone deacetylases

(HDACs) which lead to chromatin condenstation• Chromatin condensation leads to gene silencing

• Identification of chromatin-localized proteins is diagnostic for direct target genes of transcription factors

− Most common application

• Identification of promoters to which RNA Pol II is recruited upon some treatment is diagnostic for genes directly upregulated by the treatment

− Trickier but very useful

• Identification of promoters from which RNA Pol II is dismissed upon some treatment is diagnostic for genes downregulated by the treatment

BioSci D145 lecture 6 page 26 ©copyright Bruce Blumberg 2011. All rights reserved

Page 27: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

Chromatin immunoprecipitation - ChIP

• ChIP is the only method for large-scale identification of direct transcriptional targets

• General strategy– Crosslink proteins to nearby DNA with formaldehyde

• Works for about 2 angstrom distances• What does this say about the specificity of the interaction?

— Only specific interactions will be reflected in crosslinks

BioSci D145 lecture 6 page 27 ©copyright Bruce Blumberg 2011. All rights reserved

Page 28: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

• ChIP - general strategy (contd)– Break chromatin into small chunks by sonicating– Typically want ~500 bp fragments– Evaluate sonication quality and extent by gel electrophoresis to

ensure that size range is obtained• Needs MUCH optimization

Chromatin immunoprecipitation - ChIP

BioSci D145 lecture 6 page 28 ©copyright Bruce Blumberg 2011. All rights reserved

Page 29: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

• ChIP - general strategy (contd)– Precipitate chromatin with antibody against protein of interest

Chromatin immunoprecipitation - ChIP

• Bind antibody, then capture complex with protein G/protein A beads

• Reverse crosslinks – and remove proteins with proteinase K digestion

• Purify DNA away from proteins• Evaluate enrichment of individual

candidate binding sites by PCR

BioSci D145 lecture 6 page 29 ©copyright Bruce Blumberg 2011. All rights reserved

Page 30: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

• Flavors of ChIP commonly in use– Standard ChIP – one antibody, few targets analyzed

• Most commonly used method– ChIP-chip – chromatin immunoprecipitation on chip

• Recovered fragments are used to probe microarray of genomic DNA

• Allows identification of novel binding sites• Requires good genomic microarrays

Chromatin immunoprecipitation - ChIP

— Whole genome requires MANY chips (at least 7 for human and mouse) = EXPENSIVE

— may not be available for your target organism

— Affymetrix, Agilent, Nimblegen are sources

BioSci D145 lecture 6 page 30 ©copyright Bruce Blumberg 2011. All rights reserved

Page 31: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

• Flavors of ChIP commonly in use– ChIP-sequencing (ChIP-seq) – chromatin immunoprecipitation

sequencing• Massively parallel sequencing of recovered fragments• Unbiased method to identify transcription factor binding sites• Price is fast converging on ChIP-chip• Requires excellent, well-characterized antibody!

Chromatin immunoprecipitation - ChIP

BioSci D145 lecture 6 page 31 ©copyright Bruce Blumberg 2011. All rights reserved

Page 32: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

Computer-based methods that may help to identify binding sites• Phylogenetic footprinting – What is it?

– Powerful method to identify regulatory elements in DNA sequences

– Central assumption is that protein coding sequences evolve much more slowly than DNA sequences (or DNA sequences evolve faster)

• Due to selective pressure on protein function– Sequences conserved in related organisms likely to be functional– Species selection-

• Must be sufficiently diverged that functional domains stand out

• Sufficiently conserved to that they can be identified– A variety of algorithms exist – typical approach is to use multiple

programs and look for what is found in common.

BioSci D145 lecture 6 page 32 ©copyright Bruce Blumberg 2011. All rights reserved

Page 33: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

Comparative genomics (contd)

• Phylogenetic footprinting comparison of zebrafish, mouse and xenopus caudal orthologs (cdx4, cdx4, Xcad3)– A number of putative conserved elements identified including– TTCATTTGAATGCAAATGTA– Absolutely conserved in all 3 promoters– Compare with database

• http://www.ncbi.nlm.nih.gov/blast/

– Also found in human cdx4 – a good validation of the result

• As more genomes are sequenced and compared, phylogenetic footprinting becomes a very powerful filter to identify potentially conserved regulatory sequences– ECR browser offers precomputed comparisons of conserved

elements• http://ecrbrowser.dcode.org/

BioSci D145 lecture 6 page 33 ©copyright Bruce Blumberg 2011. All rights reserved

Page 34: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

Routes to gene identification

• Genome sequences are minimally useful without annotation– Annotation = description, biological information

• Functional annotation – information on the function• Structural annotation – identification of genes, sequence

elements– Much annotation is done automatically today – “gene models”

• Via sequence comparisons with various databases– Gene sequences– ESTs

• Algorithms predict promoters, splicing, polyadenylation sites and, most importantly ORFs

• ORFs – open reading frames are putative proteins– Algorithms miss in both directions– Source of much disagreement

• Field of bioinformatics has grown to encompass many types of analysis related to gene function– www.igb.uci.edu

BioSci D145 lecture 6 page 34 ©copyright Bruce Blumberg 2011. All rights reserved

Page 35: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

BioSci D145 lecture 6 page 35 ©copyright Bruce Blumberg 2011. All rights reserved

1. (8 points) You are a 4th year senior at UCI and had hoped to stay for a 5th year so that you could double major in Molecular Biology and Biochemistry together with Pharmaceutical Sciences. Sadly, the powers that be in the School of Biological Sciences have decreed that no one can stay for more than 4 years irrespective of the reason. In order to graduate on time, you need to take an advanced molecular biology laboratory. The class is a new one at UCI and involves the entire class collaborating on a project to sequence and characterize the genome of an as yet uncharacterized organism. UCI is part of the Genome 10K project that aims to sequence 10,000 vertebrate genomes. An enterprising young assistant professor has secured support from NSF to fund this course. This year’s project is to sequence the genome of Myrmecophaga tridactyla - the giant anteater (why are you surprised?). This is a one quarter long project so there is no time to waste.

a) (2 points) The department chair read Craig Venter's autobiography and suggests that the EST (expressed sequence tag, cDNA) project use a 384-capillary sequencer (donated by Craig Venter), together with cycle sequencing to generate 200,000 ESTs. Is this a good idea? Explain why, or why not.

Not a good idea. Standard sequencing can certainly generate 200K ESTs, but in more like a year using a single machine and Sanger sequencing (200,000/384 = 520 runs). Therefore, it will be too slow and you will not graduate. It will probably cost much more than the existing budget, too.

Page 36: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

BioSci D145 lecture 6 page 36 ©copyright Bruce Blumberg 2011. All rights reserved

Amanda is right – you need a new department chair. The only hope of generating a map in the time that is available will be to use HAPPY mapping. I would map the ESTs being identified in part a) to genomic DNA using this PCR based method. It will be tedious, but doable. HAPPY mapping uses isolated genomic DNA and PCR to determine how close 2 markers are to each other in the genome.

1b) (3 points) Your project is to generate a map of the anteater genome. The department chair (who is full of ideas) suggests that you conduct large-scale BAC end sequencing and order these by sequentially hybridizing individual BACs to the library. Amanda, the course director, thinks this is a bad idea. Who is right? Amanda says that you must select the method to use and generate a good map by the end of the quarter, otherwise you are not going to pass the course or graduate. Explain briefly why you chose the method you did and in succinct terms highlight the main features and strengths of the method

Bill is correct – only nextgen sequencing will give yield enough sequence to assemble the genome in the time available. Sequencing by hybridization is only really applicable for resequencing regions that you already know the sequence of. The essential feature of nextgen sequencing is that, depending on which flavor you choose, it can generate ~108-109 bases per run. Thus, you should be able to sequence the genome to a high degree of completion without many runs.

1. c) (3 points) Your significant other is in the group that will generate the actual genome sequence and asks your advice about how to proceed. Bill, the group leader, proposes to use nextgen sequencing while George, a very vocal member of the group read Jurassic Park and insists that sequencing by hybridization is the best and fastest approach (after all, they cloned dinosaurs!). Who is correct? Why? What key features of the chosen method might enable the sequence to be generated and assembled in 10 weeks?

Page 37: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

BioSci D145 lecture 6 page 37 ©copyright Bruce Blumberg 2011. All rights reserved

2. (12 points) Since you were able to compete your tasks in the molecular biology lab course and graduated on time, you have been recruited to NASA where you will be part of a team that characterizes the biological material returned from a probe that has been secretly sent to collect samples from Europa – a moon of Jupiter that was long suspected to have liquid water under a crust of ice. Indeed, Europa has an ocean under the ice and NASA has quietly returned a sample to Earth for characterization.

2a) (3 points) The first task is to determine whether there are any living organisms in the sample. How might you go about identifying how many different organisms are in the sample with a high probability of not missing any? What is the main assumption you must make in order for this to be successful?This is based on the Venter paper. You will want to extract DNA from the sample and

sequence it directly. The main assumption you have to make is that these organisms will contain DNA that is sufficiently similar to that of terrestrial organisms so that you can sequence it using standard methods and assemble into genomes by computer. If you are lucky it will be and a strategy similar to whole genome shotgun that Venter used will be best. However, it is 2015 so you would want to use one of the nextgen methods.

Page 38: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

BioSci D145 lecture 6 page 38 ©copyright Bruce Blumberg 2011. All rights reserved

2b) (3 points) You are lucky - your sample yields 11,279 sequence assemblies, 8017 of which are complete, or nearly complete. Are these genomes likely to be prokaryotic or eukaryotic? Why? How will you determine whether these sequences are bona fide residents of Europa vs. contaminants introduced on Earth by a sloppy scientist (not you, of course) ?

If you got 8017 complete genomes, then these are probably small, making it likely that they will be prokaryotic. I will compare these sequences with those available in the GENBANK database and determine whether they are identical to any organisms found on Earth. If they are not, then they have likely originated on Europa. If some are identical, or highly similar to terrestrial organisms, then you would suspect contamination.

After sequencing the bacteria, I would develop a RT-PCR assay using Taqman or similar technology to quickly identify these sequences in any sort of sample you receive. Such technology uses custom primers to specifically and quantitatively identify matching sequences in any number of samples. PCR makes it very sensitive. Or, if you had microarrays containing Europan sequences, you could use those.

2c) (3 points) Unfortunately, your colleagues on the project develop a strange illness. Orange fur grows all over their bodies and after two weeks, they develop high fevers, become delirious and demand to be BioSci majors at UCI. Sadly, they die shortly afterward. You are the kind of person who makes lemonade out of lemons, instead of complaining that they are not sweet enough. It is tragic that your colleagues died, but the identification of Europan DNA sequences on their skin suggests that the microorganisms can be cultured. Sure enough, you are able to culture two types of bacteria from the skin of the first deceased patient. Design a diagnostic test to determine whether someone carries these Europan bacteria on their skin, or whether it is present in any sort of random sample to be tested? Note the key features of your assay

Page 39: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

BioSci D145 lecture 6 page 39 ©copyright Bruce Blumberg 2011. All rights reserved

3d) (3 point) Clearly, causing a person to develop orange skin and later die is a serious problem. Describe a relatively thorough approach to identify and track the types of changes over time induced in patients by exposure to the Europan organisms ?

The approach I was looking for was that of the Chen paper. Since you have learned about genomic and transcriptomic analyses, it would be appropriate to perform whole genome sequencing to rule out any mutations induced by the infection and transcriptome analysis to identify changes in gene expression over time. It would be best if you had pre-infection samples to compare with those after the person was presumed to be infected.

Page 40: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

BioSci D145 lecture 6 page 40 ©copyright Bruce Blumberg 2011. All rights reserved

3. (8 points) The plot thickens. Patients, doctors and nurses in the local hospital begin to show signs of the disease – orange fur. Word gets out to the media and suddenly Anderson Cooper is there broadcasting from in front of the hospital, bravely not wearing hazmat gear. He is muy macho but not very bright (and does not look good in a hazmat suit anyway). Intriguingly, some of the patients do not become delirious and die, they simply have orange fur.

3a) (2 points) The first thing you need to figure out is how the contamination spread from the original patient to others. You recall that your sarcastic D145 professor said that doctors are the major source of infections in hospitals, but you aspire to be a doctor and don't believe him. The patients were immediately put into isolation after arriving in the hospital so it is very unlikely that the contamination spread without personal contact. How could you test whether the Europan microorganisms in Dr. Quackenbush (the attending physician in the ER) are the same as in the initial patients from NASA and in the others who subsequently become affected? Dr. Quackenbush appears to be unaffected.Perhaps the best way to do this is to sequence microorganisms collected from the various patients and swabs from Dr. Quackenbush’s body. If the sequences are identical, AND present on Dr. Quackenbush, then one appropriate conclusion would be that he is an intermediate carrier of the infection. You could also use high resolution fingerprinting to compare the strains with each other. Expression microarrays are not sufficient to discriminate closely related species as these are likely to be.

Page 41: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

BioSci D145 lecture 6 page 41 ©copyright Bruce Blumberg 2011. All rights reserved

3b) (3 points) Intriguingly, Dr. Quackenbush appears to have the Europan microorganisms on his hands and on his lab coat and gloves, but remains unaffected. Nevertheless, he is placed into isolation and observed. As time passes and more data are collected, you notice that there are 3 phenotypes among people who show evidence of Europan microorganisms on their skin (detected with the test you developed in 2c). These are a) orange fur, delirious, then dead, b) orange fur but otherwise healthy, and c) unaffected. This sounds to you like a classic case of a single gene that differs when it is homozygous for one allele (sensitive, i.e. dead), or the other allele (resistant), or heterozygous (orange but alive). You perform whole genome association studies and quickly discover that one region on chromosome 20 (about 1 megabase long) is highly associated with resistance to the Europan microorganisms. How might you determine whether this region is transcribed into mRNA and, if so, how large the transcripts are and how much of the transcript is present?Make RNA from skin of affected people and perform Northern analysis using the region from chromosome 20 as a probe to detect whether transcripts are present and what size they are. Add a standard to determine how much is present. Northern is the only way to detect size but you could use QRT-PCR to quantitate with appropriate standards. Many people wanted to use RNA-seq and then determine which sequences mapped to this region. I gave credit for this, but this would be equivalent to killing a fly with a missle.

Page 42: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

BioSci D145 lecture 6 page 42 ©copyright Bruce Blumberg 2011. All rights reserved

3c) (3 points)You have identified several families where one parent is totally resistant to the infection, the other parent gets orange fur and the children are either resistant or get orange fur but do not die. Your analysis in b) above has shown that there are no differences in the coding capacity of this region between resistant, partially resistant or susceptible individuals and that the sequence of the affected region does not change significantly among patients. The data suggest that a variation in copy number of some part of the candidate region might be responsible for resistance. Outline how you would go about identifying copy number variations in this region, determining whether patients have the CNV or not and if CNVs are linked with the disease?This would use an approach similar to that in the Redon paper. 1) Use SNP or chromosome tiling microarrays to identify CNV regions and 2) test this in all sorts of patients. 3) If the CNV is linked with the disease, then the patients should segregate into groups where the number of copies is correlated with the severity of disease.

Page 43: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

BioSci D145 lecture 6 page 43 ©copyright Bruce Blumberg 2011. All rights reserved

4. (7 points) As you might expect, Anderson Cooper has now developed orange fur. CNN is desperate to know whether or not he will survive and the ongoing drama is generating big ratings. They need to know how long the ratings bonanza will continue so that they can project how much they can charge for commercials. Meanwhile, Larry King has come out of retirement to anchor the news story since he is so old that no one is worried whether he will contract the disease (plus it makes him look brave).

4a) (1 point) How could you determine whether Anderson will die, or simply remain orange (other than waiting and observing what happens) ?Assuming you showed that CNVs are associated with resistance in 2c, use the same method to determine which group Anderson falls into.

Page 44: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

BioSci D145 lecture 6 page 44 ©copyright Bruce Blumberg 2011. All rights reserved

Two methods come to mind. The first would be gene expression microarray analysis. You would prepare mRNA from both cell populations, label it and probe appropriate microarrays. You would then compare the genes that are expressed in normal skin stem cell and the infected, hair follicle stem cell-like cells. Genes that are expressed differently will be the ones that you focus your attention on. You could also perform exhaustive nextgen sequencing of cDNA derived from the mRNA to identify which sequences are different between the 2 populations. This method has the benefit that it would detect microorganism encoded transcripts that would not appear in your analysis otherwise. Of course, you need a genome assembly to map the RNA-seq reads against.

4b) (3 points) Continuing to make lemonade out of lemons, you hypothesize that understanding how the Europan microorganisms infect skin cells and lead to the generation of orange fur might lead you to a cure for baldness and give you financial security forever. You found that the microorganisms behave somewhat like Chlamydia and grow as intracellular parasites in skin stem cells. Your analysis suggests that the infected skin stem cells are instead transformed into a cell type like hair follicle stem cells. Assuming that you can isolate both cell types in pure form, outline how would you determine what transcripts are expressed in normal skin stem cells compared with infected stem cells?

Page 45: BioSci D145 lecture 6 page 1 © copyright Bruce Blumberg 2011. All rights reserved BioSci D145 Lecture #6 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci.

BioSci D145 lecture 6 page 45 ©copyright Bruce Blumberg 2011. All rights reserved

First you need to culture skin cells from regions with orange fur (infected) and regions of normal skin (presumably uninfected). The most brute force method would be to conduct genome sequencing from both cells and focus on where LINE elements are located in each. If they are in different places, then you will have proven that they are moving around. Alternatively, you could use FISH to show that the chromosomal patterns of LINE localization are different between the two cell types. Credit was given for a variety of creative answers.

4b) c) (3 points) Another possible hypothesis is that something about the infection of skin cells with the Europan microorganisms has mobilized one or more LINE elements (retrotransposon-like repeated sequences) allowing them to hop around the genome and cause the strange effects observed. How could you determine whether LINE elements in cultured skin cells infected with Europan organisms compared with uninfected cells, move from their original location in the genome to a new location?