Transcript Advances in Cancer Genomics - Science | AAAS › sites › default › files ›...

1

Advances in Cancer Genomics [0:00:00] Sean Sanders: Hello and welcome to today’s Science/AAAS live webinar. My name is

Sean Sanders and I’m the commercial editor at Science Magazine. Slide 1 The topic for today’s webinar is “Advances Cancer Genomics”. Cancer is

often characterized by the deregulation or dysregulation of the normal control pathways for cellular growth and/or apoptosis. Despite many advances, traditional genomic research had sometimes been hampered by technological limitations or excessive cost. With next generation genomic platforms, scientists are now able to cost‐effectively assay individual cancer genomes and characterize them in terms of the global genetic, epigenetic, and transcriptional changes. In‐depth characterization of these events and the relationships between them will lead to better understanding of the mechanisms of tumorigenesis, metastasis, and therapeutic response. In this timely webinar, our panel of distinguished scientists will share their latest advances in cancer genomics and offer their views on the road ahead for this important area of research.

In the studio today, I’m joined by Dr. Sean Grimmond from the Institute

for Molecular Bioscience at The University of Queensland in Australia. Next, we have Dr. John McPherson from the Ontario Institute for Cancer Research in Toronto, Canada. And our third guest today is Dr. David Wheeler from Baylor College of Medicine in Houston, Texas.

A reminder to everyone watching that to see an enlarged version of any

of the slides, you just need to click the enlarge slides button located underneath the slide window of your web console. You can also download a PDF copy of all of the slides by using the download slides button. If you’re joining us live, you can submit a question to the panel at any time by simply typing it into the ask‐a‐question box on the bottom left of your viewing console below the video screen and clicking the submit button. I’ll do my best, as usual, to get to as many of the questions as possible. Please do keep them short and to the point.

Finally, I’d like to thank Applied Biosystems, a Division of Life

Technologies Corporation for their sponsorship of today’s webinar. Slide 2 Now, it is my great pleasure to introduce our first speaker today.

Associate Professor Sean Grimmond completed his Ph.D. at the University of Queensland in Australia before doing his postdoctoral training at the MRC Mammalian Genetics Unit, in Oxford in the United

2

Kingdom. After spending time at the Queensland Institute of Medical Research in Brisbane, Australia, he moved to the University of Queensland in 2001, where he is currently a Principal Research Fellow at The Institute for Molecular Bioscience and leads Australia’s International Cancer Genomics Consortium research program into pancreatic and ovarian cancer. The central focus of Dr. Grimmond’s research is capturing genomic, transcriptomic, and epigenomic data from model systems and pathological states, and defining the underlying molecular events controlling cell differentiation, organogenesis, and cancer. In recent years, Dr. Grimmond’s lab has pioneered approaches for studying the mammalian mRNA and microRNA transcriptomes at single nucleotide resolution through multigigabase scale sequencing.

Welcome, Dr. Grimmond. Dr. Sean Grimmond: Thank you, Sean. Slide 3 Yes. So, what I’m going to present here today is really a review of one

aspect of how next generation sequencing has really revolutionized one aspect of genomics. And I’ll be actually talking about how this is being applied to studying the transcriptome.

And if we think about over the last decade, we’ve really seen

transcriptomics being the premiere tool for making sense of the underlying genetics of biological and pathological states. And it’s been really powerful in giving us biomarkers that we can correlate, especially with the likes of the phenotypes in cancer.

Slide 4 In the case of cancer transcriptomics, the sequence‐based approaches

are really carrying out very similar studies to what we’d performed with arrays. And we can really break them into three areas. The first is the process of studying locus activity or actually measuring just chain activity across the entire genome. The second aspect is that sequence approaches allow us to measure transcript specific expression. And we can now do this in a much more sensitive and a far more specific manner than we can with any array‐based approach. And then, the third important area is sequence content. So, not only are we looking at the activity of these loci, we can now look for pathological events that might be happening in this sequence. So, we’re talking about splicing events. We’re talking about mutations that maybe correlated with a disease.

Slide 5 So, just first to highlight why we’re making the transition from

microarrays to sequenced‐based approaches, I’m showing here on this slide, two scatter plots. The top is showing the comparison of two cell

3

lines using microarray profiling. The bottom panel is showing the same samples being compared by RNA‐seq. And the red and green spots that we can see there correspond to differentially expressed genes in each sample. And one thing we can certainly see is that the genes that are labeled gray in the bottom left‐hand panels of each one of those scatter plots are the genes that are really at our limits of detection.

[0:04:56] Slide 6 Now, if we actually start to compare what happens when we take the

genes that were detected in the microarray experiment, and look at them in the RNA profiling experiment. So, we take the genes that were identified in the top left panel and now look at how they would perform in the opposing panel, we find that the array genes, even the limited detection array, are now in the very sort of mid range to high level detection range of the RNA‐seq. And conversely, genes that were being reliably detected in the RNA‐seq in the bottom left‐hand panel are often in the noise level that we would see in the microarray experiment. So, the bottom line here is we’re detecting up to 40% more genes when we use the RNA‐seq approach, and we’re also doing this in a quantitative fashion.

So, if we use this sequence data, we can decide how sophisticated we

want to be. We can just measure gene activity, but we can also go down and look at the individual transcript level. And I’ll describe to you how that’s being carried out at the moment.

Slide 7 On the slide here, what we can see is a diagram of about four transcripts

coming from a locus. And traditionally if we were to measure this with a microarray, we might have a probe just to the 3’ end that would only measure three or four of these transcripts.

Slide 8 If we were a little more sophisticated, we could start to make probes that

might identify unique exonic sequences. But, indeed, what we do with a sequence‐based approach is we look for diagnostic or unique sequences for every transcript here, whether it be exonic sequences or the junction sequences of these exons.

Slide 9 And if we actually measure the frequency at which those tags map to

these regions, where these two exons would come together, we can measure that transcript specific expression. So, if we think about taking that one step further, we can now look at gene expression as an array, but also in a genomic context.

Slide 10

4

In the bottom panel here, I’m showing you just the red peaks along the graph of the various tops of transcripts that can come from this locus. We can look at sequence tags that are matching those exons. And we can also look at this combination of exon usage to define the actual splice variants that may be there.

Slide 11 Just to give you an example here, just looking on one cell line at a locus

here for VEGF receptor 1, the bottom line at the bottom is the full‐length transcript. There is strong transcriptional evidence both from exon coverage and exon junction usage that the full‐length transcript is expressed.

Slide 12 Slide 13 But if we actually go through and tease out this information, we can also

find strong evidence of variant isoforms being expressed. In this case, a smaller transcript, which has a secreted decoy receptor. And, indeed, a splice variant, which also generates a second decoy receptor. So, this locus is actually making three different proteins when it is active in this context. So, something we wouldn’t capture with arrays normally.

Slide 14 The power of being able to detect this variant expression is really driven

by our ability to identify exon usage. And this panel here is just showing you how we go about doing that these days. In the left‐hand, we can see that if we take all previously defined alternative splicing events from the likes of the ASTs and cDNAse, now our public databases, we can make up a good panel of diagnostic events that correspond to full‐length transcripts.

We’re also able to recently take all the exons from each locus though and

start to build up combinations in silicon. And by this way, we can look for novel events. So, we’re thinking about what might be theoretically possible with the known exons, and let’s just see if we can find evidence of those guys coming together.

The third way is to actually look for completely novel exons. So, we look

for clusters of tags, and then we see how they might connect to other exons from neighboring genes.

And just to highlight the benefits of these approaches, the known

complexity gives us a measurement of known transcripts. The theoretical approaches and the transcriptome discovery approach are actually very useful for looking for aberrant splicing and indeed the possibility of detecting gene fusions. And this is where there has been a chromosomal

5

rearrangement, which leads to the generation of a transcript that we would never see in a normal cell.

Slide 15 I should also say that when we’ve generated this data we also have

sequenced content. So, if we go through and characterize that content, it’s also possible for us to identify mutations. And just to give you an example here of showing how we would go about doing that. When you think about sequencing the transcriptome, we have very deep coverage of many of the genes when they’re expressed at large numbers of copies per cell. And if we generate about a hundred million reads from an experiment, and we look for unique tags that correspond to the exonic sequences, and we find sequence variants, we can stack those up together and make a call as to whether or not there is a variant within that sample.

Once those variants have been identified, we can then go through and

look at how they affect the open reading frame. And then it’s up to the task of trying to determine whether that non‐synonymous change or indeed that insertion or deletion is likely to be pathogenic or creating a true mutation. And a lot of that work is really ongoing now where we have to try to rank these mutations when we can define them now really quite quickly as to which ones maybe expressed variants and which ones maybe disease.

[0:10:15] Just finally, I’d like to say that while the messenger RNA transcriptome

has proved to be far more complicated than one transcript per gene, we’re seeing that the transcriptome is opening up to whole new areas. There are non‐coding RNAs. There is also a very small RNA population that we know could control target genes. And in the case of the microRNAs, we’re actively pursuing this in cancer as well for exactly the same reasons, biomarkers as well as genes that maybe driving biology.

Slide 16 The slide I’ve just presented here is to show you the average sort of

results we would get from a sequencing experiment whereby sequencing the small RNA population, we’ll find about 45% of the tags from the sample will correspond to microRNAs.

If we look in the top right‐hand panel, this is just to highlight the size

distributions we’ll see of microRNAs from these experiments. The majority of tags being between 21 to 22 base pairs in length. And once we’ve identified those tags, we can then go on to use them to identify markers that correspond to a specific tissue, a specific state. And, indeed, we can start to actually model individual microRNAs and try to work out what targets they might be hitting.

6

And the bottom right‐hand panel corresponds to the classical oncomere,

17‐5p, where we’re looking at its targets and whether or not they’re expressed within a particular sample. So, we’re actively pursuing to survey these and make atlases of these in our tumors as well.

Slide 17 Just in closing, I’d say that this whole transcriptome sequencing is proving

to be extremely powerful, both in studying gene activity but also transcriptome discovery.

Slide 18 Slide 19 We can use this to study sequence content so we can rapidly get a handle

on expressed mutations. And really, this sort of transcriptomics can also be applied to the microRNA fraction of cells.

Sean Sanders: Great. Thank you very much, Dr. Grimmond. Slide 21 So, we’re going to move right along to our next speaker and that today is

Dr. John McPherson. Slide 22 Shortly after completing his postdoctoral studies at the University of

California, Irvine, Dr. McPherson attained a faculty position and established, as co‐director, the National Human Genome Research Center Chromosome 5 Genome Center in 1993. Three years later, Dr. McPherson relocated to the Washington University Genome Sequencing Center where as Co‐Director, he and his colleagues played a significant role in the Human Genome Sequencing Project, including pioneering many large‐scale mapping and sequencing technologies. In 2003, Dr. McPherson joined the Human Genome Sequencing Center at the Baylor College of Medicine in Houston, Texas, where he established a high throughput sequencing pipeline, aimed particularly at investigating cancer genomes. Now at the Ontario Institute for Cancer Research where he is director of cancer genomics, Dr. McPherson is leading his team to make OICR one of the top 10 sequencing centers in the world.

Welcome, Dr. McPherson. Dr. John McPherson: Thank you, Sean. Slide 23 On the first slide, I’ve just outlined the programs of the OICR. Just to

make the point that although today, we’re primarily going to be talking about cancer genomics and bioinformatics, that it’s part of a whole package in order to bring fruition to the clinic.

Slide 24

7

Next‐gen platforms have certain advantages not only ‐‐ in the very bottom there, they produce huge amounts of data, just unprecedented amounts of data that you’ll see. But they also have some advantages in the upfront pipeline. One library is all you need to sequence thousands and thousands of transcripts.

In the interest of time, I’m not going to go through all these points ‘cause

we have a very limited window here and we want to get to your questions. But you’ll see as we go along that this is really changing how we’re doing research.

Slide 25 This profiles some of the platforms, the commercial platforms that are

available right now and sort of where they’re at. This is ‐‐ mostly, I’ll be talking about the Applied Biosystems/SOLiD and the Illumina/GAII. The 454 instrument is also listed there with some of their current capabilities.

On average, these are short‐read instruments. Short reads being 50 to

100 base pairs; although, the 454 can do longer than that with fewer read numbers. But there’s just huge amount of reads. Notice that the scales on this are log scales. And this is changing dramatically. By the end of the year, I think, all these numbers are going to double or triple.

Slide 26 The question that I get asked a lot is whether, you know, should you go

for longer reads or more reads. And, I think, that one of the points I want to make today is that it’s really important to think about the experiment that you’re doing. Longer reads cost more and increasing throughput has some cost to it as well but probably less. And, I think, you have to really think of the context of the experiment.

For example, Sean talked about microRNA sequencing. These are short

sequences and obviously then there’s no point in doing long reads. It’s wasted effort. So, which platform and how many reads and what length you want is really a matter of experimental design.

[0:15:00] Slide 27 This shows you what we have at the OICR right now. We have seven of

the GAIIs and we have seven of the SOLiD 3s. That’s 21 flow cells in total. At maximum capacity, this is about 800 billion bases per month. It’s a lot of data. And we have 1.2 petabytes of storage for that and 1600 cores for analyzing that.

Slide 28 The applications are listed here that we’re involved in. Pretty much

everything that was done previously in genomics is now very much ported over to these next‐gen applications. You’re going to hear about

8

whole genome sequencing from David. Structural variation, I’m going to comment on it a bit. You’ve heard about transcriptomes and I’ll also touch on epigenomics.

Slide 29 So, before I talk about structural variants, today in the questions and now

when we talk about mate‐pairs and paired‐ends, there’s a ‐‐ we use ‐‐ a lot of people use them interchangeably. Just to make a distinction here, paired‐ends are where you’re sequencing the opposite ends of a single fragment of DNA. Mate‐pairs are made by taking larger fragments, circularizing these, capturing the ends, and then sequencing those. So, you get a larger reach out.

Slide 30 What you use those for? If you sequence a genome and then we’re using

mate‐pairs, for example, and then map them back to a reference, you’ll get ‐‐ in the upper left‐hand corner, you’ll see that you’ll get a profile of fragment sizes just from building the library. And what you’re interested in are those areas in orange, the outliers. The ones that appear to be too small for your average fragment size or too large. And those should be randomly distributed in the genome.

But if you see a cluster of them, down at the bottom, it’s listed the types

of things you can find. If they’re too close together, it would represent an insertion. If they’re too far apart, a deletion, inversions, and also translocations.

Slide 31 Another area I want to talk about is capturing specific regions of the

genome. I refer to it as direct selection. In the early ‘90s, this was done with biotinylated fragments from BACs and pulling out cDNAs and this is really the same principle. But you hear it referred to as hybrid selection, genome partitioning depending on the company.

There are sort of two flavors and one is capturing on a SOLiD support.

This is using microarrays where the oligos that are attached to the microarray are specific to the region you’re after. You hybridize the DNA and then you elute that off. So, it’s like an array CGH experiment except you elute off and sequence the DNA.

The other is you can do this in‐solution with probes and this is the Agilent

SureSelect approach. And I’ll give some examples of both. Slide 32 This is showing a 600 kb region. So, this is targeting a region on the 8q24.

The blue is the read depth that was obtained across the region. The long red bar is the region that we’re after and the shorter red bar shows where probes are able to be developed. On the bottom there, you’ll see

9

the repeat content. And occasionally, where there’s a repeat, you can’t design a probe.

Slide 33 We just zoomed in on a region of that. You can see that what I’ve plotted

now is just the Tm of the oligos to show where they are. And there’s one right in the middle. It’s a single oligo by itself. And you can actually see there was captured material. So, it’s a very efficient method to get coverage of this specific region. It doesn’t capture the entire region in many genomes, this is an example from a human genome, again due to the repeats.

Slide 34 Using the Agilent SureSelect system, you can do some ‐‐ this is the in‐

solution capture. You can do similar things. This is actually directed at exons in this case. This is just one gene. And the dark ‐‐ the black bars represent the covered gene. You can see that every exon is represented and covered well.

Slide 35 Just to touch on epigonomics. Many people are familiar with ChIP‐chip

and this is using chromatin IP to pull down regions of the genome that are associated with various proteins. In this case, I’m showing ‐‐ this is a histone modification. You can cross‐link the DNA to the histones, isolate the DNA down with the antibodies, and then release the DNA. And typically, that was put on a microarray, but now, you can just sequence directly.

And the advantage here is that it’s the output of these sequencers. You

get such coverage that it’s really just a counting exercise. And then you map them to the genome.

Slide 36 And this is just an example. The states don’t matter here. This is just a cell

line under two different states. On the top is looking at certain histone modification. And you can see between the two states just by showing where the DNA map to, you can see that there’s a difference. And depending on what marker you’re looking here, whether they have a significance of permissive or non‐permissive expression.

Slide 37 I want to mention the International Cancer Genome Consortium. This is a

group of international labs that are getting together to try and reduce some of the redundancy and have more uniform standards for collecting data. The goal overall of the consortium is to look at about 50 different tumor types and about 500 samples per tumor. And to put it in context using the tumor and the normal and something like doing 50,000 human genome projects.

Slide 38

10

This slide represents the groups that are involved currently who have stepped up and joined the consortium. Sean is in the Australian group there and doing pancreas. And you can see that we are also doing pancreas. This clearly isn’t meant to be a land grab where certain groups are doing certain tumors. We’re working very closely with Sean and the idea here is to get as close as we can to the 500 tumors completely sequenced.

[0:19:55] Slide 39 I want to make a comment on data analysis. It’s ‐‐ the amount of data

being produced is enormous not only in the sheer volume of it but just the analysis. And for anyone in bioinformatics out there, we need help. There’s clearly a disconnect now, I think, but we’re just completely overwhelming the bioinformatics capability. It’s getting better, but we definitely can use more tools.

Slide 40 And lastly, just the people to thank. This is my group and the

bioinformatics groups at the OICR. Website there at the bottom if you want more information about the OICR.

And thank you for listening. Slide 41 Sean Sanders: Great. Thank you very much, Dr. McPherson. Slide 42 So, our final speaker today is Dr. David Wheeler. Dr. Wheeler completed

his Bachelor of Science degree at University of Maryland, College Park, going on to a Master of Science in biochemistry and a Ph.D. in molecular genetics, both at George Washington University in Washington, D.C. He carried out his postdoctoral training at Brandeis University in Waltham, Massachusetts before moving to Baylor College of Medicine in Houston, Texas, to pursue his growing interest in the new area of computational biology. Dr. Wheeler was director of the Molecular Biology Computation Resource at Baylor for 10 years and in 2001 joined the Human Genome Sequencing Center there. He is currently director of bioinformatics and cancer genomics in the Human Genome Sequencing Center, where he develops methods for discovery of genome variation in human and animal populations using DNA sequencing technologies with the goal of relating polymorphisms to human disease, especially cancer. Dr. Wheeler is also on the editorial board of the Journal of Genome Research.

Thank you very much. Dr. David Wheeler: Thank you, Sean. Slide 43

11

Good afternoon, everybody. I think at the outset, it’ll be interesting to step back just for a second to see where we’ve come from and where we’re going with all this DNA sequencing.

It was 2003 that the first human genome was completed at a cost of $3B.

That was about $1 per base of the human genome. It took about 10 years to do that work. By 2007, we had sequenced the first individual human genome using a next generation technology at a cost of about $2M and about two months’ worth of work.

So, the rate at which DNA sequence is being generated is rapidly

increasing and the costs are decreasing. We expect that by the time the third generation sequencers are available, sometime around 2010, that we’ll be sequencing human genomes for about $10,000 to $1000. And this will be accomplished in a matter of a week, maybe two weeks per genome.

Slide 44 So, this shows the platforms that are in use in the Human Genome

Sequencing Center. We have all three currently commercially available platforms. And the increase in DNA sequence is evident in this curve. Within about a year, we’ll be producing about over 1 billion bases per month. And this is being driven primarily by improvements in the sequencing platforms in current use rather than the addition of new platforms. So, these curves project the technological development that these companies are putting into this endeavor.

Slide 45 So, why do we sequence the cancer genome and what are looking for?

We’re looking for somatic mutations in the DNA. We’re looking for epigenetic changes and we’re looking for changes in expression. My colleagues have covered the second two of these issues. I’ll spend some time on somatic mutation in DNA today.

One of the most challenging aspects of this is that the scale of the

variation is so broad. We’re looking at scales that go from a single base to the entire genome when we’re looking at the cancer genome. The other issues is the wide variety of next generation instruments that produce DNA sequencing reads of varying length and varying characteristics. And combining all that information together is becoming a challenge.

Slide 46 Because of the large amount of data that’s being produced now, there

has been a plethora of new tools being introduced to analyze the data. And the primary type of analysis that we’re doing today is comparing the reads that are generated back to a reference genome.

12

When we were producing the first human genome, the actual assembly of the genome was the important thing. Now, the comparison is the important thing. And so there are specific tools being developed by software engineers to handle this specific problem.

[0:25:11] Slide 47 This again shows the variation of scale. You’ll notice that the length scale

here is logarithmic, going from a single base up to chromosome length of around 100 million bases. The individual SNPs and mutations are at one end of this where we look for mini or microsatellite variation, insertion, deletion, variation, and translocations, also copy number changes, focal amplification, and chromosome aneuploidy.

On the right‐hand side of this slide are the different DNA sequencing

methodologies that we apply to detect this variation. So, for the largest scales, we can simply look at how the reads from a whole‐genome shotgun experiment cover the reference genome. For translocations, insertions, deletions, inversions, and other structural rearrangements like that, we look at the way that paired‐end reads map to the genome. And we look for anomalies in those paired‐ends as John McPherson discussed.

Then at the smallest level, we look within the reads themselves to see

base variation within the reads. The challenge here is that the reads have error rates in them. And the

error rates are generally much larger than the variation rate, the true variation rate in the genome. So, we need sophisticated tools for distinguishing errors from real variation.

Slide 48 So, this image shows what a copy number alteration experiment would

look like. What happens here is that the number of reads across each chromosome is counted for both the tumor and the matched normal sequence for the same patient.

And where the histogram deflects above the central line, that’s an

indication of amplification. Where the histogram deflects below the central line, that’s an indication of deletion. And you can see that, for example, chromosome 7 is broadly amplified in this particular patient. This is a patient with brain cancer. Chromosome 9 and chromosome 10 show large‐scale deletions and chromosome 19 also shows amplification.

Slide 49 If we zoom in on those chromosomes, at chromosome 7, you can see the

general amplification of that chromosome. And you can also see a specific amplification at 7p11.2, which is where the EGFR gene is located.

13

So, this patient has a very extensive amplification of the EGFR gene, which is a key gene sending mitogenic signals into the cell to accelerate its division rate.

On chromosome 9p21, you can see a homozygous deletion of the

CDKN2A gene, a critical gene for regulating the cell division cycle. Slide 50 So, now, we’ll talk about SNPs and mutations going back to the other end

of the scale. And in the effort to detect mutations, I mentioned that there can be and there is an error rate in the DNA sequence that’s much larger than the variation rate that we find in normal and in diseased tissues.

Slide 51 So, when we discover differences, it’s important to validate those

differences, and we usually do that using a second technology, a second chemistry. So, if we were to generate sequence on the SOLiD sequencer and collect variation, we would then validate it by perhaps a 454 sequencer or a Sanger sequence.

Slide 52 As these technologies are evolving, in the Human Genome Sequencing

Center, we are sequencing to deep coverage between 20x and 30x using the SOLiD platform, and we’re simultaneously sequencing 6x to 10x coverage using the 454 sequencer. And this provides a global validation when we see mutations on both ‐‐ when we see variation on both platforms.

[0:30:12] Slide 53 This shows our overall pipeline for processing this data. Whenever we are

determining a genome sequence, we have to look at both the tumor and the normal. And so, in this particular case, we collected read data from the tumor in the normal at 15x coverage. We mapped the reads back to the genome, back to the reference using Corona Lite, which is a software from AB, and detect all the variants. With these variants, we are able to electronically create a probe list. And do an experiment that’s much like a SNP array except we do it electronically and we call this e‐Genotyping where we can look back in the reads from the 454 platform and find the variation in the reads without having to remap all those reads to the genome. That gives us a list of valid variation in both the tumor and the normal. And then by subtracting normal variation from the tumor variation, we get mutation list.

Slide 54 Results from one of our first experiments with brain cancer, we’ve found

over 2.2 million variants at high stringency and very high accuracy. This led through the e‐Genotype in validation to a collection of almost 6000 somatic mutations of which 5 of those would cause a missense change.

14

Slide 55 And this next slide shows a list of seven of those missense changes that

could potentially be involved in cancer. These are in genes that have been known in other cancers to be involved in the disease.

Slide 56 So, thinking again about the current costs of the human genome, we’re at

a point where for approximately $100,000 today, we can do a complete genome. But that’s still fairly expensive for processing large numbers of samples. So, for this application, we are turning to the capture techniques for targeted sequencing that John McPherson mentioned.

Slide 57 In the Genome Center, we use the NimbleGen platform. And we have

generated whole exon chips for capturing all the exons of the known genes. And specifically, isolating them and sequencing the exons to deep coverage.

Slide 58 This shows the profile of a typical experiment normalizing all of the

targets to a length of 100. And you can see that the coverage is quite uniform over the target area in blue. And that there is some bleed over into what we call the buffer. Because the fragments of DNA that are being captured are quite large and so that results ‐‐ well, they’re on the order of about 3 to 500 bases, and so that results in some extension to the flanking sequences.

Slide 59 So, with this technology, we have been sequencing pancreatic

adenocarcinoma. And again, comparing tumor to normal, we currently have in our experiments about 3000 missense and nonsense mutations. The coverage is still low at this point. But as we increase the coverage, the accuracy of this experiment will improve.

However, currently, we have found three mutations that are also in the

COSMIC database. And so, finding mutations that have been seen before is a strong indication that they are going to be real. And so we see three genes that are known to be involved in cancer that have also been mutated in this patient.

Slide 60 So, in summary, deep sequence coverage by next‐generation methods is

accurately discovering mutations related to cancer. Multi‐platform approach is yielding rapid validation over our results. And the e‐Genotyping method that we have developed rapidly and efficiently assesses raw sequencing data for known SNPs and mutations.

Thank you.

15

Sean Sanders: Great. Thank very much, Dr. Wheeler. Slide 61 So, we’re going to jump right into the Q&A session now. Thank you all for

your presentations. We’ve had a number of questions coming in, but I’m going to start with one that came in through email, and I’ll pass it over to Dr. Grimmond first. The viewer asks how much depth of coverage is necessary or adequate for cancer transcriptome analysis, for example detection of coding SNPs?

[0:35:18] Dr. Sean Grimmond: Okay. So, in the case of transcriptome analysis, the depth really depends

on the question that you want to ask. If you’re just looking at genome or gene activity similar to a microarray experiment, you might only be taking about 10 million reads are required. But if you want to get complete coverage of all the exons that are being expressed, you probably need in the order of 100 to 150 million reads from that sample.

You know, at that depth, the genes that are expressed at 100 copies/cell,

you will have excellent coverage of those. And, indeed, that ones that are highly expressing, a thousand copies per cell, you have enormous coverage. But you’ll start to get down to the required depth that you need on the rarer transcripts.

Sean Sanders: Uh‐hum. Dr. McPherson? Dr. John McPherson: Just to turn it into a DNA genome question as well, so the coverage there

David mentioned about 30x coverage as a target. And that’s pretty typical these days, of about 30x coverage of a genome on average. And then to call a SNP in a region, you need 8 to 10x in a region. You need 8 to 10x coverage just part of a sampling issue to make sure that you’ve seen both alleles.

For application to cancer, you may have to go much deeper and do the

heterogeneity of the tumor population and/or contamination with normal.

Sean Sanders: Okay. Dr. Wheeler? Dr. David Wheeler: Yes. To that, I might add, in the capture technology, we’re actually going

much deeper now. We’re trying to get up to about 150x average coverage. That’s because the minimum coverage that you need to call a mutation, as John said, is about 8 to 10x. And this discovery is driven by the minimum coverage. So, you need a high average coverage to get this

16

minimum coverage across most of the targets. So, it’s somewhat analogous to this situation with the RNA‐seq.

Sean Sanders: Uh‐hum. Okay. So, I had a question come in from one of our viewers about recent

articles in the New England Journal of Medicine. And there’s been a couple of questions about this, about some controversy about whether ‐‐ especially the whole genome sequence, the whole genome association studies are valid to ‐‐ you know, so this viewer asks, do they have limited value. And what is your opinion. So, maybe we’ll start with Dr. McPherson.

Dr. John McPherson: It’s a good question. Certainly, historically, there have been a lot of

association studies that haven’t panned out. And I think it was due to small numbers ‐‐ small sample size. Now, people are banding together, very large populations that they’re studying, finding peaks, they’re getting them independently validated in other populations. And I think that the regions that are being identified are real and going to pan out.

Sean Sanders: Uh‐hum. Dr. John McPherson: The next question is, what’s the next step? How do you get down to what

is the actual causative mutation? You’re looking at some marker that is nearby. But how do we find the actual causative one.

Sean Sanders: Uh‐hum. Okay. Dr. Wheeler, maybe you would like to add something. Dr. David Wheeler: Yes. Just to draw a contrast between the whole genome association and

what we’ve been talking about here. So with the G‐waves, as they’re known, we’re looking for genetic predisposition to a disease. And so, that’s a slightly different question than somatic mutations that we’ve been talking about where we compare the tumor and the normal within a given patient looking for somatic changes.

Sean Sanders: Okay. Excellent. A question about next‐gen sequencing. Where are the bottlenecks right

now? What are the issues? I see Dr. Grimmond smiling. We’ll throw that at you then.

Dr. Sean Grimmond: I might start and I’ll claim the high ground, which is bioinformatics. The

biggest challenge at this moment in time is the bioinformatics required to take such a large amount of data and turn that data into knowledge. So,

17

from the mapping through to the identification of variants to making calls on those variants, and then giving some biological insight to that is an enormous exercise.

And, I think, what we’ve seen here is with the likes of the ICGC, we need

to be doing this on a large scale. So, we have a large number of patients. So, we can start to correlate those events with disease, which means the entire initiative is enormous.

Sean Sanders: Okay. Dr. John McPherson: I think that another bottleneck is actually getting good samples to

analyze. If we want to do large numbers, we need large numbers of good quality samples and with proper informed consent for large‐scale sequencing. And so, a lot of the studies that we’re embarking on now, we’re not using some of the old collections that have been around. We’re starting to collect again. And something like pancreatic cancer that Sean and I are very interested in and David as well, it’s a fairly uncommon cancer. And so getting the collection as to look at 500 is going to be a challenge in itself.

[0:40:10] Sean Sanders: Okay. Great. So, to follow on from that and to something that Dr.

Wheeler was talking about earlier about comparing tumor and normal samples. We have a question asking whether you can address the relative impacts of carefully typing the tumor and carefully selecting the patient’s genetic background and heritage, and looking at the environment as well. So, how do those have an impact on your sample collection and your analysis?

Dr. John McPherson: I guess, I can start on that. Start with environment. Environment is

obviously very important in doing cancer research. When we collect the samples, there’s a lot of extra information and clinical data that also comes along with that. But from the point of view of what we’re looking at, it’s sort of the downstream analysis. And in targeting right now, we’re looking at more of the somatic changes, the changes, mutations that have occurred in the tumor. I don’t know Sean if you want to pick up on that?

Dr. Sean Grimmond: Yeah, I think that the prime focus of some of these cancer genome

initiatives at the moment is to make that atlas of somatic events. I think, it’s very important to recognize that there are going to be environmental components and other components driving these diseases. We need to

18

capture that information. Once the atlas has been made for a large number of samples, then we can start to do those correlations.

Sean Sanders: Uh‐hum. Dr. John McPherson: Yes. I think to that I might add that most cancers have a variety of

subtypes. And one of the things we’re trying to do with these mutational atlases is to help further sub‐classify those tumors. And so as the data‐‐ as more data is being generated and we become aware of more subtypes, I think we will be able to ask more and more specific questions with the sequencing data. So, we’re just in the infancy of this new field right now. And so, the viewer raises a very important question that I think will unfold over the next several years.

Sean Sanders: Okay. Excellent. A more specific question on the normalization of data from

transcriptome sequencing, how do you do this? Dr. Sean Grimmond: So, the transcriptome data can be treated like normal array data

provided there are some normalizations, which you wouldn’t normally see. To go into the details of how the transcriptome is typically sequenced, we take all the RNA and we shatter it into small pieces. And that’s done out of convenience so that we can then sequence it with the next‐generation platforms.

The challenge there is that if a transcript is very, very long, you can get a

large number of tags. And if a transcript is very, very short, you’ll likely get less fragments. So, there is some normalization that you can do to take into account transcript length. But apart from that, once that data is collected, we would then, I guess, equalize the sequence steps. So, we if we have two samples, a normal sample and a tumor sample, we would make sure that when we start to do the comparisons once there’ve been lengths normalized, that we have similar numbers of tags. So, we’d look at the relative abundance per 10 million reads or whatever. And then it can just be treated like a standard sequence data.

Sean Sanders: Great. So, here’s one for you. Do you think next‐gen sequencing will replace

microarray technology? So, Dr. McPherson, you want to start with that? Dr. John McPherson: In many areas, I think it has already.

19

Sean Sanders: Uh‐hum. Dr. John McPherson: But not completely. It’s still ‐‐ for example, copy number analysis, it’s still

very cheap to do a microarray compared to sequencing. And we are doing microarray analysis for copy number. You can get copy number out of the sequence data, but you have to generate a lot of sequence to cover it.

So, I think, many applications have replaced something like ChIP‐seq for

example or ChIP‐chip has been replaced by ChIP‐seq. The advantages clearly are that you don’t have to decide what to put on the microarray. So, it’s true discovery versus an analysis of what you think you already know.

Sean Sanders: Okay. Dr. Wheeler? Dr. David Wheeler: So, the technology of SOLiD phase analysis of expression and array CGH,

those technologies are advancing as well and becoming less and less expensive as well. So, there’s kind of a horserace going on and it’s hard to predict what will happen too far into the future.

We’ve seen sequencing make further and further inroads into what used

to be clearly the domain of these SOLiD phase methodologies. I think, as John said, that those inroads will continue for a while. It’s hard to know whether they’ll ever be completely replacing one for the other.

[0:45:07] Sean Sanders: Okay. Dr.‐‐ Dr. Sean Grimmond: I’d make a quick comment there that I think that arrays are still vastly

more scalable than sequencing. You can analyze the data on your laptop and, at the moment, I need a super computer to look at the sequence data. So, there are some real differences in the ease at which these platforms can be used. I think that from an expression point of view that arrays certainly have some legs yet for some years.

Sean Sanders: Uh‐hum. Dr. Sean Grimmond: Because we can process a large number of samples and then dig deep

with the sequence approach. If you put the two together, you then know what you’re seeing on the arrays, and in your usage tool depending on the task that you’re going to take on.

Sean Sanders: Okay.

20

It seems that 5’ untranslated regions are not sequenced as much as 3’ UTRs. Do you agree with this and do you have any comments on the issue? Dr. Wheeler, you want to start?

Dr. David Wheeler: So, in the whole genome sequencing context, of course, we see

everything. In the whole exon that I was talking about earlier, currently we’re only looking at the ‐‐ or we’re primarily looking at the coding sequence. The coding sequence today is what we understand and can interpret the easiest. So, as time goes on and we understand better and better the signals in both the 3’ and 5’ end, I think, the capture methodologies will start to include those regions.

Sean Sanders: Dr. McPherson? Dr. John McPherson: I think that ‐‐ I’d almost sort of disagree with that statement. I think that ‐

‐ not David’s, but the question and the way it was phrased. I think that the 5’ ends are sampled frequently now. They weren’t in the past because a lot of the techniques were actually collecting 3’ tags for example.

Sean Sanders: Uh‐hum. Dr. John McPherson: But certainly, on the capture arrays and other methods of isolating

specific things, we’re definitely looking at the 5’ ends of genes. And certainly, as Sean said with transcriptome, you look at the entire transcriptomes, you’re getting everything.

Dr. Sean Grimmond: I might make one point there that if you have a look at some of the RNA‐

seq work that is out there at the moment, quite often, you do see an underrepresentation of the 5’ ends. And there’s a number of methodological reasons for that. If you pull the RNA from the 3’ end and there’s some degradation, you’ll see an underrepresentation of the 5’ end. But all the methodologies are pushing to equalize that out these days.

Sean Sanders: Uh‐hum. So, I’m going to give you a rather prosaic question from a high school

student. They wanted to ‐‐ clearly an animal lover. They wanted to know if you can do this on humans, can you do it on animals?

Dr. John McPherson: Do the sequencing? Sean Sanders: The types of experiments you’re doing here.

21

Dr. John McPherson: Certainly. And in fact, the cancer sequencing that we’re doing, we are

actually ‐‐ there are mouse models that could be used. And we are actually doing sequencing of xenografts, which are tumors that are implanted in mice. But its DNA is DNA, you can do this for animals, you can do this for bacterium, you can do this for anything.

Sean Sanders: And the results you’re getting from those studies are useful in human? Dr. John McPherson: We think so. Sean Sanders: Right. [Laughter] Actually, that brings us, I guess, to cost as well and we had a question

that asked about the ‐‐ as the cost of whole genome sequencing continues to fall, will targeted approaches continue to be employed or do you think we’re just going to be sequencing whole genomes? And, I guess, this goes to the data issue as well. Maybe, Dr. Wheeler, you’ll start us off.

Dr. David Wheeler: Yes. I think that for ‐‐ the objective will be to use a given patient’s tumor

DNA to help diagnose and to help outline a course of treatment for the patient. And to do that, we will most likely be going with whole genome sequencing. I can’t say that the targeted approaches wouldn’t have research application or application in some arenas and in some cancers. But as the cost gets low enough to do the whole sequence, I think, that’s the direction we’ll go.

Sean Sanders: Uh‐hum. Dr. John McPherson: And I think that the targeted approaches won’t necessarily go away. The

question asked a lot is as sequencing gets cheaper and cheaper, you know, why bother selecting regions. But I still think that there’ll be many, many instances where if you do a targeted approach, you can pool a hundred samples. And for the price of sequencing one, you can sequence a hundred. And again, it goes back to experimental design. I think you have to decide what is it that you want to look at.

Sean Sanders: Right. Dr. Sean Grimmond: I’d also add that talking about the G well’s [0:49:52] [Phonetic]

experiments, before we can find a region that predisposes some individuals in a population basis, you may have a region of the genome

22

you want to look in a large number of samples. And I think this is an excellent way to go forward with that.

[0:50:04] Sean Sanders: Great. The next question is from, I believe, a colleague of Dr. Wheeler’s at

Baylor. Do you believe that the heterogeneity of cancer mutations is greater than that expected from other genetic diseases, for other genetic diseases?

Dr. David Wheeler: Certainly, the heterogeneity in tumors is fairly extensive. And that’s an

area that we’ll be able to really delve into deeply with the next‐gen and especially as we go to true single molecule sequencing. So, yes, cancer is more heterogeneous. And that’s a whole area that remains to be explored now.

Dr. John McPherson: I think too that some of the complex diseases are very heterogeneous as

well. And that, you know, we’re using GUS studies or what have you, we’re identifying genes that are important in disease. But they account for very small effects in general. We’re not finding ‐‐ except in a few cases, we’re not finding the gene that causes this disease in a complex disease.

I think because of that, there’s a lot of room to go. If you add them all up,

we’re still only understanding, in many diseases only 10% or 15% of the underlying cause of disease. So, there’s a lot of heterogeneity yet to be explored in lots of other disease as well.

Sean Sanders: Okay. Our next question is about sample preparation and we’ve talked

about using multiple different platforms to look at samples. And the viewer asked whether using these multiple platform validation methods identify sample preparation bias, if this is something that you’ve seen. Dr. McPherson?

Dr. John McPherson: There are biases in all of the methods and I think the important thing is to

start to understand them, and understand the biases in your data. And whether they’re complementary is what we’re exploring.

So, we have two different platforms. There’s a certain overhead involved

in maintaining two different platforms and the sample preps are slightly different. They’re very similar in many steps as well. And we’re trying to combine as many steps as we can so we have more of a single pathway that diverges in a couple of points. But, overall, I think you just have to

23

understand the biases and expect there are biases. And what our hope is that they’ll be somewhat complementary.

Sean Sanders: Uh‐hum. Okay. A question about single cell sequencing, is it possible to sequence

single cells? And there have also been some questions about cancer stem cells. So maybe we can address that as well. Dr. Grimmond?

Dr. Sean Grimmond: There’s been some work very recently looking at the sequencing of the

transcriptome from single cells, which is recently published. And that certainly opens up a lot of excitement for us to be able to take individual cancer cells and start to address the overall heterogeneity we see in the transcriptome. I still believe it’s very challenging too. There has been some work done in bacteria, on single cells in genome work, but once we go to the size of mammalian genomes, it’s still very challenging to work at that level.

Dr. John McPherson: Yeah, I agree. I think it’s difficult to get comprehensive analysis on a

single cell at this point in time. And I used to think myself that the Holy Grail was to be able to do all this work on a single cell. And the more I think about it, I think that, you know, I’m very happy with small population of cells that I can analyze. Single cell there are no replicates. You get one data point and you’re going to have to average across multiple cells, I think, anyways.

So, I think, from a research standpoint of understanding how a cell

operates, it’s a very important tool that continues to be developed. From the standpoint of what I’m trying to do, I’ll be happy if we can get down to hundreds of cells rather than a single cell.

Sean Sanders: Uh‐hum. Dr. David Wheeler: Yeah, I think this goes back to the issue of heterogeneity and tumors are

infamous for their broad heterogeneity. And there are specific questions we can ask if we can get at the single cell level. So, I think, there is a strong motivation to get to the single cell level eventually, and perhaps some of the third generation sequencers will be able to give us that technology.

Sean Sanders: Great. So, we haven’t spoken a lot about epigenetics or epigenetic changes so

I’ll throw one question on bisulphite sequencing. The viewer says when

24

we do bisulphite sequencing to monitor DNA methylation what is the percentage of the genome that cannot be assessed because of reduced complexity? So, who’d like to take that one on? [Laughs]

Dr. John McPherson: I’ll jump in, I guess. I don’t know the exact answer to that. Certainly,

when the read lengths were quite short, it was a severe problem in trying to map back to the genome. Read lengths are getting longer. There are tricks in the works that allow you to map things better. So, I’m not sure at this point what percentage is going to be not accessible. I think in the end, it won’t be that much different than trying to map the unconverted reads back.

[0:55:01] Sean Sanders: Uh‐hum. Dr. John McPherson: But right now, I really don’t know what the percentage is. It’s certainly

lower than looking at regular reads, but I still think it’s actually quite high. Sean Sanders: Okay. I have a question about the use of paraffin‐embedded tumor samples.

Are any of you using these types of samples and what kind of success have you achieved?

Dr. David Wheeler: So, we have some very large collections of tissue specimens that are

paraffin embedded and this is a very challenging area right now. There’s very strong motivation to try to be able to do DNA sequencing of this material because the collections are so extensive. It’s difficult. The challenge comes from the degradation of the DNA. But there’s a lot of active research going on to try to minimize the damage to the DNA.

At the current time, recovery yields are pretty good. And so, it’s

becoming practical to take this DNA and use it in the context of exon capture, and researchers are also successfully obtaining RNA. So, I think, the technology will continue to improve in this area and we’ll begin using these more frequently.

Sean Sanders: Okay. Another issue we haven’t really touched on is personalized medicine. And

we had a question about whether advances thus far ‐‐ and I know you’ve said this, this field is still in its infancy. But have you seen any advances that have moved towards the clinic and have changed the way that cancers are treated? So, Dr. Grimmond, would you like to start?

25

Dr. Sean Grimmond: Yeah. I mean, I think, as you said, we really are at the information

gathering phase. But there are some trailblazers there through the moment who are showing the potential of this work. I mean, there have been some fantastic presentations recently at the AGBT meeting in February where Marco Marra was showing that it’s possible to start screening individual cancers and take some of the guesswork out of what therapeutic might work or might not work.

So, if you look at a tumor, you look at the genes that are expressed, you

look at the pathways that are likely to be susceptible to treatment with a drug versus not, you can actually try to come up with a regime to address that. And, indeed, it’s going on now in some of the leading translational hospitals here in the US to target and sequence large numbers of genes, that we have drugs that can target those pathways.

Sean Sanders: Uh‐hum. Dr. Sean Grimmond: And the idea was to take some of that guesswork out. And I think that

will be the first step that we see towards personal medicine. Sean Sanders: Dr. McPherson? Dr. John McPherson: I don’t know if I have much to add to that. That kind of sums it up. I think

that more and more we’re starting to understand that the pathway nature of cancer in that a single mutation in a cancer area, you see a different tumor and it’s got a different mutation, but they’re connected by this pathway connectivity.

And what we want to get to the point is where we can understand that

pathway so that we can target, the drugs that are specific to pathways can be used as a frontline. In many severe cancers, you don’t have a lot of time to make that decision. And right now, the care is sort of a primary treatment and then if that’s not working then you go to the secondary treatment. It’d be nice if you could tailor it a little more and go straight to the one that is more likely to work.

Dr. David Wheeler: So, there are some targeted therapies today based on a few known

genes, EGFR or B2 and other kinases. And through the sequencing that we’ve done in a large‐scale broad sequencing across thousands of genes, we can see the feasibility of this. And so, I would say in the near future, this will become a routine, but we’re not really there yet and so…

26

Sean Sanders: So, we’re almost out of time, but I’m going to ask you a final question based on this. So, we’ve seen where we’d like to get to. What technologies do you think are coming up down the road and how are things going to change to get us there? So, maybe we’ll start with Dr. Wheeler and work our way back.

Dr. David Wheeler: Well, as I’ve mentioned already, we’re seeing the first machines that do

true single molecule sequencing. I think there are advances coming in microfluidics that might make single cell analysis practical in the not too distant future. There are other third generation sequencing technologies that will produce very long reads, that will address some of the mapping issues that John referred to earlier. So, what I think…

[1:00:07] Sean Sanders: Okay. John? Dr. John McPherson: Yeah, I think David covered most of them. But sample preparation, I

think, microfluidics is a good example that if you’d make sample preparation easier, you know, ideally from a little bit of blood and you can get the DNA and sequence the genome. Certainly, advances in the technology for higher throughput and still faster, it takes quite a long time to sequence a genome now. And cost, the cost has to drop dramatically before it’s done routinely.

Sean Sanders: Okay. Dr. Grimmond? Dr. Sean Grimmond: Yeah. I would say the biggest theory is we need advancement in scale.

We still need more sequence per run. I guess that ties to cost as well. The next next‐generation sequencers, single molecule sequencers that potentially are having very long reads so we can rapidly assemble. And, indeed, I guess the bioinformatics that being able to assemble de novo rather than mapping to a reference, being able to do that from the ground up is really what’s needed, I think, to make this a viable thing for personal genomics.

Sean Sanders: And, I guess, carrying around our own petabyte drives as well to carry our

information on. Okay. So, we are out of time. I’d like to thank our wonderful speakers for

being with us today and for the superb discussion they’ve provided: Dr. Sean Grimmond from the Institute for Molecular Bioscience at the University of Queensland, Dr. John McPherson from OICR in Toronto, and Dr. David Wheeler from Baylor College of Medicine in Houston.

27

Thank you also to our online viewers for the great questions. Sorry, we did not have time to get to all of them.

Please go to the URL at the bottom of your slide viewer now if you’d like

to learn more about the products related to today’s discussion. And look out for more webinars from Science in the near future available at www.sciencemag.org/webinar. We encourage you to share your thoughts with us about this webinar by sending an email to the address now up in your slide viewer, [email protected].

Again, thank you to all the participants and to Applied Biosystems for

their kind sponsorship of today’s educational seminar. Goodbye. Thank you. [1:02:07] End of Audio

Transcript Advances in Cancer Genomics - Science | AAAS › sites › default › files ›...

Documents

Transcript of Transcript Advances in Cancer Genomics - Science | AAAS › sites › default › files ›...