Data and Python in Biology at PyData NYC 2015
-
Upload
maria-nattestad -
Category
Data & Analytics
-
view
294 -
download
2
Transcript of Data and Python in Biology at PyData NYC 2015
![Page 1: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/1.jpg)
How big data is transforming biology and how we are using Python to make sense of it all
Maria NattestadComputational biology PhD studentPyData NYC 2015
![Page 2: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/2.jpg)
Overview
Genome sequencing
Using Python to study cancer
Personal genomics
![Page 3: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/3.jpg)
Overview
Genome sequencing
Using Python to study cancer
Personal genomics
![Page 4: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/4.jpg)
Your genome46 strings of A, T, C, and G for a total of about 6 billion characters
male
![Page 5: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/5.jpg)
Mutations in the genome can lead to cancer and other diseases
Over 20,000 genes are scattered all over the genome.
The genome is the instruction manual for creating a living thing.
Some changes in the genome do nothing or encode normal variation like hair color, while others can cause disease.
![Page 6: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/6.jpg)
Illumina = “Next-generation sequencing”
Sanger = The original
Human Genome Project publishes first draft
![Page 7: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/7.jpg)
Big Data
3000 Rice Genomes Project
![Page 8: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/8.jpg)
Sequencing by the numbers
• Human genome is 6 billion letters [ATCG]
• No technology exists that can read an entire chromosome from end to end
• Illumina sequencing produces 100 letters of sequence
• If the genome was random, this would be enough
![Page 9: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/9.jpg)
The genome is not random
ATCGATCAT?ATCGATCATA
repeats
Because of this the human genome STILL has gaps
![Page 10: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/10.jpg)
Repeats make it harder to assemble the genome puzzle
A
B
RCDR
RRCR B R DR
A R
A
BR
C
DIf a repeat is longer than the reads
![Page 11: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/11.jpg)
Long-read DNA Sequencing
Pacific BiosciencesOxford Nanopore MinION
>10X as expensive as next-generation (Illumina) sequencing>100X read length
![Page 12: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/12.jpg)
Resolving repeatswith long-read sequencing
A R D CB R
A
B R
R
C
D
A R DCB R
![Page 13: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/13.jpg)
Overview
Genome sequencing
Using Python to study cancer
Personal genomics
![Page 14: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/14.jpg)
How the human genome changes during cancer
Normal human genome
![Page 15: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/15.jpg)
How the human genome changes during cancer
(Davidson et al, 2000)
80 chromosomes instead of 46
Cancer genome
Cell line from a woman with metastatic breast cancer in 1971, tumor cells have been grown and studied in the lab ever since.
![Page 16: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/16.jpg)
Split-read variant calling
chromosome 1
chromosome 2
![Page 17: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/17.jpg)
A simple gene fusion
Gene1
Gene2
Gene1 Gene2
![Page 18: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/18.jpg)
A complex gene fusion
Gene1
Gene2
Gene1 Gene2
![Page 19: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/19.jpg)
SplitThreader:A new Python graph library for representing rearranged genomes
CHR 1
CHR 2
ATCGCCTA
GTCCATAG
8
10
2
ATCG CCGA
ATAGGTCC
CHR 1
CHR 210
2
8
![Page 20: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/20.jpg)
Class structure of SplitThreader
Node Node
NodeNode Edge
Edge
Edge
Port Port Port Port
Port Port Port Port
Graph
Edge
Edge
Edge
Edge
Once you enter a node, you must exit out the other side like a tunnel.
![Page 21: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/21.jpg)
Biological insights from SplitThreader
Depth first searchor
Breadth first search
Gene fusion finding
History of mutations
![Page 22: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/22.jpg)
Using SplitThreader to find a gene fusion
CYTH1
EIF3H
CYTH1 EIF3HGoal is to find a path like this:
![Page 23: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/23.jpg)
Too many copies of Her2 contributes to making cancer worse
Sequencing
Actual genome
Her2
Too much Her2
Too much signal to divide
Too many cell divisions
Cancer grows
![Page 24: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/24.jpg)
About 40 copies of Her2 gene scattered around the genome
Her2 gene
![Page 25: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/25.jpg)
Her2
Chr 17: 83 Mb
8 Mb
![Page 26: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/26.jpg)
Her2
Her2
![Page 27: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/27.jpg)
Her2
8 Mb
Chromosome 17
![Page 28: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/28.jpg)
Her2
Chr 17Chr 8
1. Healthy chromosome 172. Sequence copied into
chromosome 83. Subsequence copied within
chromosome 84. Complex variant and
inverted duplication within chromosome 8
5. Subsequence copied within chromosome 8
![Page 29: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/29.jpg)
SplitThreader is open source on Github
ATCG CCGA
ATAGGTCC
CHR 1
CHR 210
2
8
https://github.com/marianattestad/splitthreader
Visualization with D3.js is underway!Contributions are very welcome
![Page 30: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/30.jpg)
Overview
Genome sequencing
Using Python to study cancer
Personal genomics
![Page 31: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/31.jpg)
Personal genomics
SNP chip Sequencing• Illumina, SureGenomics• About $1,000• Captures large and small
mutations even if completely novel and unexpected
• 23andMe• About $100• Captures tiny mutations
scientists already know to look for
![Page 32: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/32.jpg)
Personal genomics debates
• Should the government allow these companies to give people their genomic data?– How about interpreting the health risks?
• Is sharing your own genome breaking your family’s privacy?
![Page 33: Data and Python in Biology at PyData NYC 2015](https://reader033.fdocuments.us/reader033/viewer/2022042722/58a7c5651a28ab6b5a8b565d/html5/thumbnails/33.jpg)
THANK YOU