Enabling Genomic BIG DATA with Content Centric Networking J.J. Garcia-Luna-Aceves UC Santa Cruz

Enabling Genomic BIG DATA with Content Centric Networking J.J. Garcia-Luna-Aceves UC Santa Cruz Example Today Cancer Genome Hub (CGhub) CGhubs purpose was to store the genomes sequenced as part of The Cancer Genome Atlas (TCGA) project. At about 300 GB/genome this translated to about 17,000 genomes over the 44 month lifetime of the project. Transmission requirements of archiving effort reached a sustained rate of 17 Gbps by the end of the 44 month project Cancer Genomics Hub (UCSC) is Housed in SDSC CoLo: Large Data Flows to End Users 1G 8G 15G Cumulative TBs of CGH Files Downloaded Data Source: David Haussler, Brad Smith, UCSC 30 PB Example Today CGHub had to use current technology: Data organization and search: XML schema definitions Security: Existing techniques, HTTPS Big data transfer: Modified Bit Torrent (Gene Torrent or GT) HTTPS and Bit Torrent are problematic No caching with HTTPS TCP limitations percolate to multiple connections under BT or GT A potential playground for DDoS? The Future of Genomic BIG DATA Is the Internet ready to support personalized medicine? Is the future of genomic data really different? If not, what technology would be limiting progress? First: Genomic data are really BIG DATA. Personalized medicine will make genomic data volumes explode, and many other applications of genomic data will develop Even if one site or a few mirrors are used for a personal genome, it has to be uploaded. Is Technology Ready in in 5-10 Years? Communication, storage and and computing technologies are not the problem: Production optical 1 Tbpshttp://www.lightreading.com/document.asp?d oc_id=188442&http://www.lightreading.com/document.asp?d oc_id=188442& Individual hosts able to transmit at 100 Gbps I/O throughput can keep up with the network speeds (i.e., disk will be able to handle 100 Gbps = 12.5 GBps). Memory and processing costs will continue to decline. Networking is The BIG PROBLEM for Genomic BIG DATA Speed of light will not increase but number of genomic data repositories or distance between them will Internet protocol stack was not designed for BIG DATA transfer over paths with large bandwidth- delay products: TCP throughput DDoS vulnerabilities (e.g., SYN flooding) Caching vs privacy (e.g., HTTPS) Static directory services (e.g., DNS vs content directories). Sobering Results for Todays Internet TCP and variations (e.g., BT) cannot be the baseline to support big data genomics Storage must be used to reduce bandwidth-delay products Simulation results -4-day simulation -20 locations -40 Gbps links with 5 to 25ms latency - ave. degree of 5 TCP (client/server) Content centric approach Internetworking BeND TCP/IP architecture must change for BIG DATA, but how? Content Centric Network architectures (CCN) such as NDN and CCNx have been proposed The main advantage of CCN solutions is caching ButNDN and CCNx still at early stages of development Big Data Networking is all about bandwidth-delay product, not replacing IP addresses with names

Enabling Genomic BIG DATA with Content Centric Networking J.J. Garcia-Luna-Aceves UC Santa Cruz

Documents

Transcript of Enabling Genomic BIG DATA with Content Centric Networking J.J. Garcia-Luna-Aceves UC Santa Cruz