Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

45
Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago

Transcript of Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Page 1: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Big Data

George K. Thiruvathukal, PhDIEEE Computer Society, Loyola

University Chicago

Page 2: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Evolution of the tat gene from HIV isolates taken from the US 1990-2009

We will come back to this in the case study.

Page 3: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Topics• What is Big Data?• The Sliding Scale of Big Data• Brief Observations about Computing Education and Big Data• Sources of Big Data• Emerging Technologies/Techniques

– NoSQL approaches (MongoDB)– Private Clouds and Open Stack– Post-Java Era and Scala– Using Python as Glue Language– RESTful Thinking

• Case Study: Building a Genomic Data Warehouse to Study HIV (and other virus) Evolution

• Future Directions• Acknowledgments

Page 4: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Big Data “Defined”“Big Data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. This definition is intentionally subjective and incorporates a moving definition of how big a dataset needs to be in order to be to be considered big data—i.e. we don’t define big data in terms of being larger than a certain number of Terabytes (thousands of gigabytes). We assume that over time that the definition of datasets that qualify as big data will also increase. Also note that the definition can vary by sector, depending on what kinds of software tools are commonly available and what sizes of datasets are common in a particular industry. With those caveats, big data in many sectors today will range from a few dozen terabytes to multiple petabytes (thousands of terabytes).”

Source: Big data: The next frontier for innovation, competition, and productivity (McKinsey and Company)http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation

Page 5: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

McKinsey Report on Big Data• $600 to buy hard drive that stores all the world’s music• 5 billion mobile phones in use• 30 billion Facebook posts• 40% growth in global data generated per year

– Corresponding 5% growth in global IT spending• 235 TB data collected by US Library of Congress in April 2011• 15/170 sectors in USA store more data per second than Library of

Congress• $300 billion business to US health care• 250 billion euro European public sector• Continued importance to retail operations (not just Wal-Mart anymore!)• Shortage of analytical talent positions• Shortage of data-savvy managers

Page 6: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

The Audacity of Storage

Source: Seagate Electronics Web Site (seagate.com)

Page 7: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Units of Measurement

• Terabyte – 0.25 of these 4TB drives• Petabyte – 250 • Exabyte – 250,000• Zettabyte – 250,000,000• Yottabyte – 250,000,000,000

• We’re well on our way to exascale and beyond in most domains. Big Data today implies exascale thinking.

Page 8: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

The CS Education Challenge• Big data skills still not primary focus of most universities• “Computational thinking” good but inclusion of “Data Science” even

better– Traditional HPC more computationally driven– Cloud Computing more application/data driven

• Need to teach about “out of core” to excel in the data-driven world• Need to teach about “web/distributed scale” in order to query the data

effectively• Various initiatives beginning to connect the dots

– EduPar to learn parallel/distributed computing principles early– Emergence of “data science” academic programs

• In short, we need to rethink whether CS accurately describes what we are doing!

Page 9: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Sources of Big Data• social media: tweets, likes (+1s), etc.

– http://www.fastcompany.com/3013208/these-amazing-twitter-metadata-visualizations-will-blow-your-mind

• user-generated content like photos and videos• VOIP traffic• customer and B2B transactions• GPS-equipped cell phones• mobile devices• system logs of all kinds• RFID tag sensors embedded in everything from airport runways to casino

chips• e-mail• …mostly unstructured -> documents as opposed to relations (a la RDBMS)

Page 10: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Distributed Design Principles are Vitally Important to Big Data

• NoSQL emerging as the way to operate at web (distributed) scale• Transparency principles (from Coulouris, Dollimore, Kindberg):

– Access transparency: - enables local and remote objects to be accessed using identical operations

– Location transparency: location of resources is hidden– Migration transparency: resources can move without changing names– Replication transparency: users cannot tell how many copies exist– Concurrency transparency: multiple users can share resources automatically– Parallelism transparency: - activities can happen in parallel without user

knowing about it– Failure transparency: concealment of faults

• Modern NoSQL databases employ most of these, especially when combined with cloud computing.

Page 11: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

MongoDB: Secret Sauce of Many Big Data Environments

• Document storage model (JSON/BSON)• Ad hoc/dynamic schemas (by default)

– JSON does not imply total absence of a schema– http://json-schema.org/

• Distributed design principles/sharding– Replication – High Availability– Distributed query processing

• Full indexing support• Atomic updates and transactions (if required)• Map/Reduce from Hadoop (JavaScript)• Other embedding possible with JavaScript (built around Node.js technology)• File Storage via GridFS• …capabilities not limited to MongoDB; see http://couchdb.apache.org/ (same

organization behind Hadoop)

Page 12: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Distributed Architecture of MongoDB

• Mongod – main database process, one per shard– Replica sets to allow for failover execution with a revolving

master model• Mongos – routers to expose the collection of shards as a

single server • Config servers – Servers that contain metadata about

the cluster and what chunks of data reside on each shard

• MongoDB allows for reasonable operation to continue (often in read-write mode) in the presence of single daemon/node failures

Page 13: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

MongoDB Sharding/Replicas

Page 14: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Sharding (Literature)

• BigTable– http://research.google.com/archive/bigtable.html

• Yahoo! Hosted Data Serving Platform– http://www.brianfrankcooper.net/pubs/pnuts.pdf

Page 15: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

NoSQL vs. and/or Hadoop

• Doesn’t need to be either-or; could be and.• Hadoop

– Ideal for periodic jobs that do map-reduce (e.g. ETL or data warehouse from multiple sources)

– Transparent support for clustered execution• NoSQL (e.g. MongoDB, a front-runner in many projects)

– Built-in aggregation, including map-reduce support a la Hadoop– Sharding allows for distributed query processing from any node (and

in any language)• Additional reading:

– http://docs.mongodb.org/ecosystem/use-cases/hadoop/– http://docs.mongodb.org/manual/tutorial/map-reduce-examples/

Page 16: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

OpenStack IaaS (I=infrastructure)

Page 17: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

OpenStack Components

• Object Store: storage and retrieval of files• Image: catalog/repository of virtual disk images• Compute: virtual servers on demand• Dashboard: user interface for accessing all components• Identity: authentication/authorization• Network: network connectivity as a service (on demand

as well)• Block Storage: block storage for guest VMs (similar to

iSCSI)• http://en.wikipedia.org/wiki/OpenStack

Page 18: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Emerging Case Study• Phylogenetic Analysis of HIV-1

– Evolution of HIV happens so rapidly that we need to build an online analytical system for understanding it in space and time.

• Components/Pipeline– Genbank data ETL (Scala + Python); import into MongoDB warehouse– RESTful Querying to slice/dice gene information into FASTA (format used by

alignment tools– Example of JavaScript (server side) map-reduce embedded analytics using MongoDB– Alignment and Visualization using existing (offline) tool.

• http://ecommons.luc.edu/cs_facpubs/68 /• Our approach is best summarized: Use the best tools and languages for the

task you are trying to do (polyglot)– Case study uses 3 programming languages, MongoDB, a web services framework

(Flask), existing bioinformatics tools, and a test VMware cluster for hosting it all before moving to an IaaS solution.

Page 19: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Working with Genbank Data• Not completely unstructured but messy nevertheless.• Often contain errors, owing to the complexity of the domain.• Errors in most cases are innocuous. Can correct them offline and add to

warehouse later (or whenever)• Parsers present unwanted complexity; see API at

http://www.biojava.org/docs/api16/org/biojava/bio/seq/io/SeqIOTools.html

• Our approach: Transform the Genbank Data Directly into MongoDB “documents” (JSON objects) for a posteriori and long-term analysis– …and never think about Genbank format again!

• Focus is on extracting features of interest, although our future efforts will be to transform the entire Genbank corpus into documents for easier parsing/processing (for understanding other viruses, etc.)

Page 20: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

LOCUS HM067748 9680 bp DNA linear VRL 27-JUN-2010DEFINITION HIV-1 isolate nx2 from China, complete genome.ACCESSION HM067748VERSION HM067748.1 GI:298919707KEYWORDS .SOURCE Human immunodeficiency virus 1 (HIV-1) ORGANISM Human immunodeficiency virus 1 Viruses; Retro-transcribing viruses; Retroviridae; Orthoretrovirinae; Lentivirus; Primate lentivirus group.REFERENCE 1 (bases 1 to 9680) AUTHORS Miao,W., Liu,Y., Wang,Z., Zhuang,D., Bao,Z., Li,H., Liu,S., Li,L. and Li,J. TITLE Sequence and characterization of full-length genome of two HIV-1 strains isolated from two infected patients in China JOURNAL Unpublished

FEATURES Location/Qualifiers source 1..9680 /organism="Human immunodeficiency virus 1" /proviral /mol_type="genomic DNA" /isolate="nx2" /host="Homo sapiens" /db_xref="taxon:11676" /country="China" /collection_date="16-Oct-2006" LTR 1..591 gene 747..2226 /gene="gag" /note="gag protein" gene <2030..5030 /gene="pol" /note="pol protein" gene 4975..5556

Most of the data in this flat file is not important to our study, Watson. We need to import the bold fields and

extract the data from the DNA in the next slide. (Both of

these slides are one Genbank data file.)

Crick and Watson, DNA (1953)

Page 21: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

ORIGIN 1 ttgatttgtg ggtctatcac acacaaggct acttccctga ttggcacaac tacacaccgg 61 gaccagggac cagattcccg ctgacttttg ggtggtgctt caagctagta ccagttgacc 121 caagggaagt agaagaggcc agcgaaggag aagacaacag tttgctacac cctgtctgcc 181 agcatggaat ggaggatgaa cacagagaag tgttaaagtg gaagtttgac agccaattag 241 catacagaca ctgggcccgc gagctacatc cggagtttta caagaactgc tgatacagaa 301 gggactttcc gcgggacttt ccaccagggc gttccgggag gtgtggtctg ggcggtactg 361 ggagtggtca accctcagat gctgcatata agcagctgct ttgcgcctgt accgggtctc 421 ttagttagac cagatctgag cctgggagct ctctggctag ctaggaaccc actgcttaag 481 cctcaataaa gcttgccttg agtgctctga gcagtgtgtg cccatctgtt gtgtgactct 541 ggtaactaga gatccctcag acccttgtgg cagtgtggaa aatctctagc agtggcgccc 601 gaacaggggc aagaaaagga aaatgagacc cgaggggatt tcttgacgca ggactcggct 661 tgctgaagtg cactcggcaa gaggcgagag gggcgactgg tgagtacgcc aattttattt 721 gactagcgga ggctagaagg agagagatgg gtgcgagagc gtcaatatta agaggggaaa 781 aattggataa atgggaaaga attaggttaa ggccaggggg aaagaaacac tatctgctaa 841 aacacatagt atgggcaagc agagagctgg aaaaatttgc acttaaccct ggccttttag 901 agacatcaga aggatgtaag caaataataa aacagctaca accagctctt cagacaggaa 961 cagaggaact taaatcatta tacaacacag tagcagttct ctattgtgta catgaaaaaa 1021 tagacatacg agacaccaaa gaagccttag acaagataga agaagaacaa aataaatgtc 1081 agcagaaaac acagcaggca aaaaaggatg atgagaaggt tagtcaaaat tatcctatag 1141 tgcagaatct ccaagggcac atggtacatc agcctctatc acctagaact ttaattgcat 1201 gggtaatagt agtggacaga gaagactcct tagctcagaa gtaatacccc tgttcacagc 1261 ataatcagaa ggagccaccc cacaagatct aaactccatg ttaaatacag tagggcgaca 1321 tcaagcagct atgcaaatgt taaaagatac catcaatgga gaggctgcag aatgagatag 1381 attgcatcca gtgcatgcag ggccagtggc accaggccag atgagagaac caaggggtag 1441 tgacatagca ggaactacta gtactctcca ggagcaaata ggatggatga caaataatcc 1501 acctatccca gtaggagaaa tctataaaag atggataatc gtcggattaa ataaattagt [...] 9541 gcctgggagc tctctggcta gctaggaacc cactgcttaa gcctcaataa agcttggctt 9601 gagtgctctg agcagtgtgt gcccatctgt tgtgtgactc tggtaactag agatccctca 9661 gacccttgtg gcagtgtgga

One of the genes (gag) of interest at positions

747…2226. Need to extract lots of these

for our HIV repository!

Page 22: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Basic Structure• Genbank is a collection of Sequences• Sequence is a collection of Features• Sequences and Features have annotations (e.g. simple key/value pairs,

possibly having ad hoc structure within)• We use Scala (an emerging object-functional language; runs on JVM) to

parse this format.- Task naturally suited to the stream-oriented facilities found in functional

languages- Support for “failure” as a concept allows processing to continue meaningfully.- Leverages existing BioJava library, which is adapted to Scala.

• Scala code writes a delimiter separated file which is postprocessed by Python to create/update MongoDB entries.

Page 23: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Scala Genbank Parsing• The next few slides show

– Scala Genbank file parser (shows how to extract the entire corpus of files as a stream)

– Python postprocessor to transform flattened stream produced by Scala into a MongoDB collection (Python great glue language!)

– Python RESTful API to query the collection for offline analytics (using Clustal and visualization tools)

– Embedded analytics example using native map-reduce with JavaScript in MongoDB

• The full source for what we’re doing is available from our Bitbucket repository at:– https://bitbucket.org/loyolachicagocs_bioi/hiv-biojava-scala

• This is still work in progress but is being used to build our computational biology data warehouse.

Page 24: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Scala Genbank File Parser (Importer)

Page 25: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

object TryBio { case class SourceInformation(country: String, collectionDate: String, note: String) case class SequenceInformation(accession: String, origin: String) case class GeneInformation(gene: String, start: Int, end: Int)

/** * Converts an annotation to a properly typed Scala map. */ implicit def annotationAsScalaMap(annotation: Annotation) = annotation.asMap.asInstanceOf[JMap[String, String]].asScala

def processFile(file: java.io.FileReader) = { val sequences: JIterator[Sequence] = SeqIOTools.readGenbank(new BufferedReader(file)) for { seq <- sequences.asScala seqInfo <- getSequenceInformation(seq) sourceInfo <- getSourceInformation(seq) gene <- getGenes(seq) } { val fields = List(seqInfo.accession, gene.gene, sourceInfo.country, sourceInfo.collectionDate, sourceInfo.note, seqInfo.origin.substring(gene.start, gene.end)) println(fields.mkString("|")) } }

def main(args: Array[String]) { for (arg <- args) { val f = new FileReader(arg) processFile(f) f.close() } }}

Page 26: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

def getSequenceInformation(sequence: Sequence): Option[SequenceInformation] = for { // returns None for sequences without accession so they get skipped in main acc <- sequence.getAnnotation get "ACCESSION" origin = sequence.seqString } yield SequenceInformation(acc, origin)

def getSourceInformation(sequence: Sequence): Option[SourceInformation] = for { // returns None for non-source sequences so they get skipped in main f <- sequence.features.asScala.find { _.getType == "source" } a = f.getAnnotation } yield SourceInformation( a.getOrElse("country", UNKNOWN_COUNTRY), a.getOrElse("collection_date", UNKNOWN_DATE), a.getOrElse("note", UNKNOWN_NOTE))

private val allowedGenes = Set("gag", "pol", "env", "tat", "vif", "rev", "vpr", "vpu", "nef")

def getGenes(sequence: Sequence): Iterator[GeneInformation] = for { f <- sequence.features.asScala // skip features without gene annotation g <- f.getAnnotation get "gene" if f.getType == "CDS" && (allowedGenes contains g) l = f.getLocation } yield GeneInformation(g, l.getMin - 1, l.getMax - 1)

Page 27: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Using Python to Import Genbank Stream into Mongo

Page 28: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

def main():

mongo_db_name = sys.argv[1]

# Assume Mongo is running on localhost at its defaults

client = MongoClient() db = client[mongo_db_name]

folder = sys.argv[1] if db.posts.count() > 0: print("Mongo database %s is not empty. Please create new" % folder) sys.exit(1)

for line in sys.stdin: text = line.strip() (accession, gene, country, date, note, sequence) = data = line.split("|")[:6] document = { 'accession' : clean(accession), 'gene' : clean(gene), 'country' : clean(country), 'date' : clean(date), 'note' : clean(note), 'sequence' : sequence } db.posts.insert(document)

print("Wrote %d documents" % db.posts.count())

Page 29: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

RESTful Services using Python and Flask Micro Web Framework

Page 30: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

RESTful Queries for Querying Genbank HIV Corpus

• RESTful architectural style exposes the collection as a discoverable hierarchy of resources:– <base>/genbank: Returns datasets that we’ve imported (hiv the

only one right now)– <base>/genbank/<collection>: Returns list of discovered genes– <base>/genbank/<collection>/<gene>: Returns a FASTA file for all

files where <gene> was present– <base>/genbank/<collection>/unknown/<thing> produces a report

of what data were not imported (<thing> is a code indicating country, date, or notes that are needed to support the previous three common queries

• Self-hosted Live Sandbox (not Open Stack based yet)– http://tirtha.cs.luc.edu:5000/genbank/.

Page 31: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Queries (Live)• Show datasets

– http://tirtha.cs.luc.edu:5000/genbank/– You’ll see “hiv”, perhaps others (test databases)

• Show genes within “hiv” collection– http://tirtha.cs.luc.edu:5000/genbank/hiv/– You’ll see gene names, e.g. gag, env, …

• Show FASTA file for offline alignment/analytics (eventually we’ll embed it in our web services)– http://tirtha.cs.luc.edu:5000/genbank/hiv/env (gives FASTA for env gene across

ALL datasets)– http://tirtha.cs.luc.edu:5000/genbank/hiv/gag (gives FASTA for gag gene across

ALL datasets)• These queries are all achieved using a RESTful service, written using Python

micro web framework and MongoDB directly. We’ll look at the code.

Page 32: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

@app.route("/genbank")def get_databases(): client = MongoClient() db_names = client.database_names() text = '\n'.join(db_names) print("DB Names",text) resp = Response(text, status=200, mimetype='text/plain') return resp

Page 33: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

@app.route("/genbank/<collection>")def get_collection_gene_names(collection): text = '\n'.join(get_collection_genes(collection)) resp = Response(text, status=200, mimetype='text/plain') return resp

def get_collection_genes(collection): client = MongoClient() db = client[collection] return db.posts.distinct('gene')

Page 34: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

FASTATEMPLATE=""">%(accession)s|%(gene)s|%(country)s|%(date)s|%(note)s%(sequence)s""”

def get_fasta(collection, gene): client = MongoClient() db = client[collection] cursor = db.posts.find({ 'gene' : gene }) fasta = StringIO.StringIO() for item in cursor: fasta.write(FASTATEMPLATE % item) text = fasta.getvalue() fasta.close() return text

@app.route("/genbank/<collection>/<gene>")def get_collection_gene(collection, gene): resp = Response(get_fasta(collection, gene), status=200, mimetype='text/plain') return resp

Page 35: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Doing Hadoop Map-Reduce Style Processing directly (via Mongo Shell)

Page 36: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Map-Reduce Mongo Style• JSON is the native storage format of MongoDB.• JavaScript is the native query language.• Much of what Hadoop does can be done in JavaScript without writing full Java

programs/classes.• Map function: projects a list of key/value pairs from JSON documents (by selecting

attributes of interest)• Reduce function: iterates over the keys and/or values of interest using JavaScript

libraries for aggregate operations (or your own code)• Fully interactive execution makes it easy to test code without launching executable

jobs to do it– Doesn’t replace Hadoop fully though (no job control) but can be addressed with off-the-

shelf job schedulers/load balancers (say, in a cluster).• Following example shows how to create a map-reduce computation to determine

the average length of the nucleotide sequences discovered in our Genbank data set.• This can run in parallel/distributed mode in a sharded configuration.

Page 37: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Map/Reduce Using MongoDBvar computeAvgSequenceLength = function(accession, sequences) { var total = 0; for (var i=0; i < sequences.length; i++) { total += sequences[i].length } return (total + 0.0) / sequences.length;}

var emitByAccessionSequence = function() { emit(this.accession, this.sequence)}

db.results.remove()db.posts.mapReduce( emitByAccessionSequence, computeAvgSequenceLength, { out : "results" } )

var results = db.results.find()

while (results.hasNext()) { var result = results.next(); print("average(",result['_id'], ") = ", result['value'])}

Page 38: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Outputmongo localhost:27017/hiv mapreduce.js

average( JN235962 ) = 1694.5714285714287average( JN235963 ) = 1705.857142857143average( JN235964 ) = 1703.7142857142858average( JN235965 ) = 1563.7142857142858average( JN248316 ) = 1443.375average( JN248317 ) = 1446average( JN248318 ) = 1570average( JN248319 ) = 1425average( JN248320 ) = 1431.857142857143average( JN248321 ) = 1591.4444444444443average( JN248322 ) = 1584.3333333333333average( JN248323 ) = 1441.875average( JN248324 ) = 1430.142857142857average( JN248325 ) = 1579average( JN248326 ) = 1558.8333333333333average( JN248327 ) = 1439.625average( JN248328 ) = 1562.5714285714287average( JN248329 ) = 1558.4444444444443average( JN248330 ) = 1570.5555555555557average( JN248331 ) = 1451.375…

Page 39: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Early Visualizations of HIV Evolution (from our warehouse/web services)

• The pipeline presented thus far allows us to use existing tools/services to do the visualization

• BioEdit workstation tool used to show colorized view/alignment of sequence data obtained by gene.

• Dendroscope library used to show the hierarchical decomposition (the Phylogenetic tree) of how the virus has evolved.

• Details beyond the scope of this talk but we know we can get from here to a real-time, longitudinal view of what HIV is doing.

• Future work will be to embed all of this as RESTful services and use a cluster to render any visualization on demand.

Page 40: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Colorized view of FASTA data (acquired by web service/Mongo collection)

Page 41: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Examples of trees – evolution of the tat gene from HIV isolates taken from the US 1990-2009

Page 42: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Same tree… different view

Page 43: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Same tree… different view

Page 44: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Future Directions

• Deploy to OpenStack based private cloud (in progress)• Import entire Genbank corpus and other genomics data sets• Incorporate alignment as embedded analytics to precompute

visualizations of interest• Integrate visualization into the web services.• Work on new predictive piece to identify emerging

threats/mutations.• Early results suggest MongoDB can do most queries on

important slices of data (virus/gene) in fractions of a second, including map-reduce style.– We’re hoping to import the entire Genbank corpus after getting our

private cloud established.

Page 45: Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.

Acknowledgments

• Debbie Sims and colleagues at IEEE Computer Society (for the opportunity to give this webinar)

• Catherine Putonti, Loyola University Chicago (Biology and Computer Science)

• Steven Reisman (Graduate Student, Computer Science) for work on the longitudinal visualizations

• Joe Kaylor and Konstantin Läufer, Loyola University Chicago for pairing on Scala and RESTful services work

• Manish Parashar, Rutgers University (Computer Science) to discuss our shared view of big data

• Rusty Eckman (Northrop-Grumman) for his helpful input and feedback.