Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.
-
Upload
kimberly-dennis -
Category
Documents
-
view
216 -
download
0
Transcript of Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago.
Big Data
George K. Thiruvathukal, PhDIEEE Computer Society, Loyola
University Chicago
Evolution of the tat gene from HIV isolates taken from the US 1990-2009
We will come back to this in the case study.
Topics• What is Big Data?• The Sliding Scale of Big Data• Brief Observations about Computing Education and Big Data• Sources of Big Data• Emerging Technologies/Techniques
– NoSQL approaches (MongoDB)– Private Clouds and Open Stack– Post-Java Era and Scala– Using Python as Glue Language– RESTful Thinking
• Case Study: Building a Genomic Data Warehouse to Study HIV (and other virus) Evolution
• Future Directions• Acknowledgments
Big Data “Defined”“Big Data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. This definition is intentionally subjective and incorporates a moving definition of how big a dataset needs to be in order to be to be considered big data—i.e. we don’t define big data in terms of being larger than a certain number of Terabytes (thousands of gigabytes). We assume that over time that the definition of datasets that qualify as big data will also increase. Also note that the definition can vary by sector, depending on what kinds of software tools are commonly available and what sizes of datasets are common in a particular industry. With those caveats, big data in many sectors today will range from a few dozen terabytes to multiple petabytes (thousands of terabytes).”
Source: Big data: The next frontier for innovation, competition, and productivity (McKinsey and Company)http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation
McKinsey Report on Big Data• $600 to buy hard drive that stores all the world’s music• 5 billion mobile phones in use• 30 billion Facebook posts• 40% growth in global data generated per year
– Corresponding 5% growth in global IT spending• 235 TB data collected by US Library of Congress in April 2011• 15/170 sectors in USA store more data per second than Library of
Congress• $300 billion business to US health care• 250 billion euro European public sector• Continued importance to retail operations (not just Wal-Mart anymore!)• Shortage of analytical talent positions• Shortage of data-savvy managers
The Audacity of Storage
Source: Seagate Electronics Web Site (seagate.com)
Units of Measurement
• Terabyte – 0.25 of these 4TB drives• Petabyte – 250 • Exabyte – 250,000• Zettabyte – 250,000,000• Yottabyte – 250,000,000,000
• We’re well on our way to exascale and beyond in most domains. Big Data today implies exascale thinking.
The CS Education Challenge• Big data skills still not primary focus of most universities• “Computational thinking” good but inclusion of “Data Science” even
better– Traditional HPC more computationally driven– Cloud Computing more application/data driven
• Need to teach about “out of core” to excel in the data-driven world• Need to teach about “web/distributed scale” in order to query the data
effectively• Various initiatives beginning to connect the dots
– EduPar to learn parallel/distributed computing principles early– Emergence of “data science” academic programs
• In short, we need to rethink whether CS accurately describes what we are doing!
Sources of Big Data• social media: tweets, likes (+1s), etc.
– http://www.fastcompany.com/3013208/these-amazing-twitter-metadata-visualizations-will-blow-your-mind
• user-generated content like photos and videos• VOIP traffic• customer and B2B transactions• GPS-equipped cell phones• mobile devices• system logs of all kinds• RFID tag sensors embedded in everything from airport runways to casino
chips• e-mail• …mostly unstructured -> documents as opposed to relations (a la RDBMS)
Distributed Design Principles are Vitally Important to Big Data
• NoSQL emerging as the way to operate at web (distributed) scale• Transparency principles (from Coulouris, Dollimore, Kindberg):
– Access transparency: - enables local and remote objects to be accessed using identical operations
– Location transparency: location of resources is hidden– Migration transparency: resources can move without changing names– Replication transparency: users cannot tell how many copies exist– Concurrency transparency: multiple users can share resources automatically– Parallelism transparency: - activities can happen in parallel without user
knowing about it– Failure transparency: concealment of faults
• Modern NoSQL databases employ most of these, especially when combined with cloud computing.
MongoDB: Secret Sauce of Many Big Data Environments
• Document storage model (JSON/BSON)• Ad hoc/dynamic schemas (by default)
– JSON does not imply total absence of a schema– http://json-schema.org/
• Distributed design principles/sharding– Replication – High Availability– Distributed query processing
• Full indexing support• Atomic updates and transactions (if required)• Map/Reduce from Hadoop (JavaScript)• Other embedding possible with JavaScript (built around Node.js technology)• File Storage via GridFS• …capabilities not limited to MongoDB; see http://couchdb.apache.org/ (same
organization behind Hadoop)
Distributed Architecture of MongoDB
• Mongod – main database process, one per shard– Replica sets to allow for failover execution with a revolving
master model• Mongos – routers to expose the collection of shards as a
single server • Config servers – Servers that contain metadata about
the cluster and what chunks of data reside on each shard
• MongoDB allows for reasonable operation to continue (often in read-write mode) in the presence of single daemon/node failures
MongoDB Sharding/Replicas
Sharding (Literature)
• BigTable– http://research.google.com/archive/bigtable.html
• Yahoo! Hosted Data Serving Platform– http://www.brianfrankcooper.net/pubs/pnuts.pdf
NoSQL vs. and/or Hadoop
• Doesn’t need to be either-or; could be and.• Hadoop
– Ideal for periodic jobs that do map-reduce (e.g. ETL or data warehouse from multiple sources)
– Transparent support for clustered execution• NoSQL (e.g. MongoDB, a front-runner in many projects)
– Built-in aggregation, including map-reduce support a la Hadoop– Sharding allows for distributed query processing from any node (and
in any language)• Additional reading:
– http://docs.mongodb.org/ecosystem/use-cases/hadoop/– http://docs.mongodb.org/manual/tutorial/map-reduce-examples/
OpenStack IaaS (I=infrastructure)
OpenStack Components
• Object Store: storage and retrieval of files• Image: catalog/repository of virtual disk images• Compute: virtual servers on demand• Dashboard: user interface for accessing all components• Identity: authentication/authorization• Network: network connectivity as a service (on demand
as well)• Block Storage: block storage for guest VMs (similar to
iSCSI)• http://en.wikipedia.org/wiki/OpenStack
Emerging Case Study• Phylogenetic Analysis of HIV-1
– Evolution of HIV happens so rapidly that we need to build an online analytical system for understanding it in space and time.
• Components/Pipeline– Genbank data ETL (Scala + Python); import into MongoDB warehouse– RESTful Querying to slice/dice gene information into FASTA (format used by
alignment tools– Example of JavaScript (server side) map-reduce embedded analytics using MongoDB– Alignment and Visualization using existing (offline) tool.
• http://ecommons.luc.edu/cs_facpubs/68 /• Our approach is best summarized: Use the best tools and languages for the
task you are trying to do (polyglot)– Case study uses 3 programming languages, MongoDB, a web services framework
(Flask), existing bioinformatics tools, and a test VMware cluster for hosting it all before moving to an IaaS solution.
Working with Genbank Data• Not completely unstructured but messy nevertheless.• Often contain errors, owing to the complexity of the domain.• Errors in most cases are innocuous. Can correct them offline and add to
warehouse later (or whenever)• Parsers present unwanted complexity; see API at
http://www.biojava.org/docs/api16/org/biojava/bio/seq/io/SeqIOTools.html
• Our approach: Transform the Genbank Data Directly into MongoDB “documents” (JSON objects) for a posteriori and long-term analysis– …and never think about Genbank format again!
• Focus is on extracting features of interest, although our future efforts will be to transform the entire Genbank corpus into documents for easier parsing/processing (for understanding other viruses, etc.)
LOCUS HM067748 9680 bp DNA linear VRL 27-JUN-2010DEFINITION HIV-1 isolate nx2 from China, complete genome.ACCESSION HM067748VERSION HM067748.1 GI:298919707KEYWORDS .SOURCE Human immunodeficiency virus 1 (HIV-1) ORGANISM Human immunodeficiency virus 1 Viruses; Retro-transcribing viruses; Retroviridae; Orthoretrovirinae; Lentivirus; Primate lentivirus group.REFERENCE 1 (bases 1 to 9680) AUTHORS Miao,W., Liu,Y., Wang,Z., Zhuang,D., Bao,Z., Li,H., Liu,S., Li,L. and Li,J. TITLE Sequence and characterization of full-length genome of two HIV-1 strains isolated from two infected patients in China JOURNAL Unpublished
FEATURES Location/Qualifiers source 1..9680 /organism="Human immunodeficiency virus 1" /proviral /mol_type="genomic DNA" /isolate="nx2" /host="Homo sapiens" /db_xref="taxon:11676" /country="China" /collection_date="16-Oct-2006" LTR 1..591 gene 747..2226 /gene="gag" /note="gag protein" gene <2030..5030 /gene="pol" /note="pol protein" gene 4975..5556
Most of the data in this flat file is not important to our study, Watson. We need to import the bold fields and
extract the data from the DNA in the next slide. (Both of
these slides are one Genbank data file.)
Crick and Watson, DNA (1953)
ORIGIN 1 ttgatttgtg ggtctatcac acacaaggct acttccctga ttggcacaac tacacaccgg 61 gaccagggac cagattcccg ctgacttttg ggtggtgctt caagctagta ccagttgacc 121 caagggaagt agaagaggcc agcgaaggag aagacaacag tttgctacac cctgtctgcc 181 agcatggaat ggaggatgaa cacagagaag tgttaaagtg gaagtttgac agccaattag 241 catacagaca ctgggcccgc gagctacatc cggagtttta caagaactgc tgatacagaa 301 gggactttcc gcgggacttt ccaccagggc gttccgggag gtgtggtctg ggcggtactg 361 ggagtggtca accctcagat gctgcatata agcagctgct ttgcgcctgt accgggtctc 421 ttagttagac cagatctgag cctgggagct ctctggctag ctaggaaccc actgcttaag 481 cctcaataaa gcttgccttg agtgctctga gcagtgtgtg cccatctgtt gtgtgactct 541 ggtaactaga gatccctcag acccttgtgg cagtgtggaa aatctctagc agtggcgccc 601 gaacaggggc aagaaaagga aaatgagacc cgaggggatt tcttgacgca ggactcggct 661 tgctgaagtg cactcggcaa gaggcgagag gggcgactgg tgagtacgcc aattttattt 721 gactagcgga ggctagaagg agagagatgg gtgcgagagc gtcaatatta agaggggaaa 781 aattggataa atgggaaaga attaggttaa ggccaggggg aaagaaacac tatctgctaa 841 aacacatagt atgggcaagc agagagctgg aaaaatttgc acttaaccct ggccttttag 901 agacatcaga aggatgtaag caaataataa aacagctaca accagctctt cagacaggaa 961 cagaggaact taaatcatta tacaacacag tagcagttct ctattgtgta catgaaaaaa 1021 tagacatacg agacaccaaa gaagccttag acaagataga agaagaacaa aataaatgtc 1081 agcagaaaac acagcaggca aaaaaggatg atgagaaggt tagtcaaaat tatcctatag 1141 tgcagaatct ccaagggcac atggtacatc agcctctatc acctagaact ttaattgcat 1201 gggtaatagt agtggacaga gaagactcct tagctcagaa gtaatacccc tgttcacagc 1261 ataatcagaa ggagccaccc cacaagatct aaactccatg ttaaatacag tagggcgaca 1321 tcaagcagct atgcaaatgt taaaagatac catcaatgga gaggctgcag aatgagatag 1381 attgcatcca gtgcatgcag ggccagtggc accaggccag atgagagaac caaggggtag 1441 tgacatagca ggaactacta gtactctcca ggagcaaata ggatggatga caaataatcc 1501 acctatccca gtaggagaaa tctataaaag atggataatc gtcggattaa ataaattagt [...] 9541 gcctgggagc tctctggcta gctaggaacc cactgcttaa gcctcaataa agcttggctt 9601 gagtgctctg agcagtgtgt gcccatctgt tgtgtgactc tggtaactag agatccctca 9661 gacccttgtg gcagtgtgga
One of the genes (gag) of interest at positions
747…2226. Need to extract lots of these
for our HIV repository!
Basic Structure• Genbank is a collection of Sequences• Sequence is a collection of Features• Sequences and Features have annotations (e.g. simple key/value pairs,
possibly having ad hoc structure within)• We use Scala (an emerging object-functional language; runs on JVM) to
parse this format.- Task naturally suited to the stream-oriented facilities found in functional
languages- Support for “failure” as a concept allows processing to continue meaningfully.- Leverages existing BioJava library, which is adapted to Scala.
• Scala code writes a delimiter separated file which is postprocessed by Python to create/update MongoDB entries.
Scala Genbank Parsing• The next few slides show
– Scala Genbank file parser (shows how to extract the entire corpus of files as a stream)
– Python postprocessor to transform flattened stream produced by Scala into a MongoDB collection (Python great glue language!)
– Python RESTful API to query the collection for offline analytics (using Clustal and visualization tools)
– Embedded analytics example using native map-reduce with JavaScript in MongoDB
• The full source for what we’re doing is available from our Bitbucket repository at:– https://bitbucket.org/loyolachicagocs_bioi/hiv-biojava-scala
• This is still work in progress but is being used to build our computational biology data warehouse.
Scala Genbank File Parser (Importer)
object TryBio { case class SourceInformation(country: String, collectionDate: String, note: String) case class SequenceInformation(accession: String, origin: String) case class GeneInformation(gene: String, start: Int, end: Int)
/** * Converts an annotation to a properly typed Scala map. */ implicit def annotationAsScalaMap(annotation: Annotation) = annotation.asMap.asInstanceOf[JMap[String, String]].asScala
def processFile(file: java.io.FileReader) = { val sequences: JIterator[Sequence] = SeqIOTools.readGenbank(new BufferedReader(file)) for { seq <- sequences.asScala seqInfo <- getSequenceInformation(seq) sourceInfo <- getSourceInformation(seq) gene <- getGenes(seq) } { val fields = List(seqInfo.accession, gene.gene, sourceInfo.country, sourceInfo.collectionDate, sourceInfo.note, seqInfo.origin.substring(gene.start, gene.end)) println(fields.mkString("|")) } }
def main(args: Array[String]) { for (arg <- args) { val f = new FileReader(arg) processFile(f) f.close() } }}
def getSequenceInformation(sequence: Sequence): Option[SequenceInformation] = for { // returns None for sequences without accession so they get skipped in main acc <- sequence.getAnnotation get "ACCESSION" origin = sequence.seqString } yield SequenceInformation(acc, origin)
def getSourceInformation(sequence: Sequence): Option[SourceInformation] = for { // returns None for non-source sequences so they get skipped in main f <- sequence.features.asScala.find { _.getType == "source" } a = f.getAnnotation } yield SourceInformation( a.getOrElse("country", UNKNOWN_COUNTRY), a.getOrElse("collection_date", UNKNOWN_DATE), a.getOrElse("note", UNKNOWN_NOTE))
private val allowedGenes = Set("gag", "pol", "env", "tat", "vif", "rev", "vpr", "vpu", "nef")
def getGenes(sequence: Sequence): Iterator[GeneInformation] = for { f <- sequence.features.asScala // skip features without gene annotation g <- f.getAnnotation get "gene" if f.getType == "CDS" && (allowedGenes contains g) l = f.getLocation } yield GeneInformation(g, l.getMin - 1, l.getMax - 1)
Using Python to Import Genbank Stream into Mongo
def main():
mongo_db_name = sys.argv[1]
# Assume Mongo is running on localhost at its defaults
client = MongoClient() db = client[mongo_db_name]
folder = sys.argv[1] if db.posts.count() > 0: print("Mongo database %s is not empty. Please create new" % folder) sys.exit(1)
for line in sys.stdin: text = line.strip() (accession, gene, country, date, note, sequence) = data = line.split("|")[:6] document = { 'accession' : clean(accession), 'gene' : clean(gene), 'country' : clean(country), 'date' : clean(date), 'note' : clean(note), 'sequence' : sequence } db.posts.insert(document)
print("Wrote %d documents" % db.posts.count())
RESTful Services using Python and Flask Micro Web Framework
RESTful Queries for Querying Genbank HIV Corpus
• RESTful architectural style exposes the collection as a discoverable hierarchy of resources:– <base>/genbank: Returns datasets that we’ve imported (hiv the
only one right now)– <base>/genbank/<collection>: Returns list of discovered genes– <base>/genbank/<collection>/<gene>: Returns a FASTA file for all
files where <gene> was present– <base>/genbank/<collection>/unknown/<thing> produces a report
of what data were not imported (<thing> is a code indicating country, date, or notes that are needed to support the previous three common queries
• Self-hosted Live Sandbox (not Open Stack based yet)– http://tirtha.cs.luc.edu:5000/genbank/.
Queries (Live)• Show datasets
– http://tirtha.cs.luc.edu:5000/genbank/– You’ll see “hiv”, perhaps others (test databases)
• Show genes within “hiv” collection– http://tirtha.cs.luc.edu:5000/genbank/hiv/– You’ll see gene names, e.g. gag, env, …
• Show FASTA file for offline alignment/analytics (eventually we’ll embed it in our web services)– http://tirtha.cs.luc.edu:5000/genbank/hiv/env (gives FASTA for env gene across
ALL datasets)– http://tirtha.cs.luc.edu:5000/genbank/hiv/gag (gives FASTA for gag gene across
ALL datasets)• These queries are all achieved using a RESTful service, written using Python
micro web framework and MongoDB directly. We’ll look at the code.
@app.route("/genbank")def get_databases(): client = MongoClient() db_names = client.database_names() text = '\n'.join(db_names) print("DB Names",text) resp = Response(text, status=200, mimetype='text/plain') return resp
@app.route("/genbank/<collection>")def get_collection_gene_names(collection): text = '\n'.join(get_collection_genes(collection)) resp = Response(text, status=200, mimetype='text/plain') return resp
def get_collection_genes(collection): client = MongoClient() db = client[collection] return db.posts.distinct('gene')
FASTATEMPLATE=""">%(accession)s|%(gene)s|%(country)s|%(date)s|%(note)s%(sequence)s""”
def get_fasta(collection, gene): client = MongoClient() db = client[collection] cursor = db.posts.find({ 'gene' : gene }) fasta = StringIO.StringIO() for item in cursor: fasta.write(FASTATEMPLATE % item) text = fasta.getvalue() fasta.close() return text
@app.route("/genbank/<collection>/<gene>")def get_collection_gene(collection, gene): resp = Response(get_fasta(collection, gene), status=200, mimetype='text/plain') return resp
Doing Hadoop Map-Reduce Style Processing directly (via Mongo Shell)
Map-Reduce Mongo Style• JSON is the native storage format of MongoDB.• JavaScript is the native query language.• Much of what Hadoop does can be done in JavaScript without writing full Java
programs/classes.• Map function: projects a list of key/value pairs from JSON documents (by selecting
attributes of interest)• Reduce function: iterates over the keys and/or values of interest using JavaScript
libraries for aggregate operations (or your own code)• Fully interactive execution makes it easy to test code without launching executable
jobs to do it– Doesn’t replace Hadoop fully though (no job control) but can be addressed with off-the-
shelf job schedulers/load balancers (say, in a cluster).• Following example shows how to create a map-reduce computation to determine
the average length of the nucleotide sequences discovered in our Genbank data set.• This can run in parallel/distributed mode in a sharded configuration.
Map/Reduce Using MongoDBvar computeAvgSequenceLength = function(accession, sequences) { var total = 0; for (var i=0; i < sequences.length; i++) { total += sequences[i].length } return (total + 0.0) / sequences.length;}
var emitByAccessionSequence = function() { emit(this.accession, this.sequence)}
db.results.remove()db.posts.mapReduce( emitByAccessionSequence, computeAvgSequenceLength, { out : "results" } )
var results = db.results.find()
while (results.hasNext()) { var result = results.next(); print("average(",result['_id'], ") = ", result['value'])}
Outputmongo localhost:27017/hiv mapreduce.js
average( JN235962 ) = 1694.5714285714287average( JN235963 ) = 1705.857142857143average( JN235964 ) = 1703.7142857142858average( JN235965 ) = 1563.7142857142858average( JN248316 ) = 1443.375average( JN248317 ) = 1446average( JN248318 ) = 1570average( JN248319 ) = 1425average( JN248320 ) = 1431.857142857143average( JN248321 ) = 1591.4444444444443average( JN248322 ) = 1584.3333333333333average( JN248323 ) = 1441.875average( JN248324 ) = 1430.142857142857average( JN248325 ) = 1579average( JN248326 ) = 1558.8333333333333average( JN248327 ) = 1439.625average( JN248328 ) = 1562.5714285714287average( JN248329 ) = 1558.4444444444443average( JN248330 ) = 1570.5555555555557average( JN248331 ) = 1451.375…
Early Visualizations of HIV Evolution (from our warehouse/web services)
• The pipeline presented thus far allows us to use existing tools/services to do the visualization
• BioEdit workstation tool used to show colorized view/alignment of sequence data obtained by gene.
• Dendroscope library used to show the hierarchical decomposition (the Phylogenetic tree) of how the virus has evolved.
• Details beyond the scope of this talk but we know we can get from here to a real-time, longitudinal view of what HIV is doing.
• Future work will be to embed all of this as RESTful services and use a cluster to render any visualization on demand.
Colorized view of FASTA data (acquired by web service/Mongo collection)
Examples of trees – evolution of the tat gene from HIV isolates taken from the US 1990-2009
Same tree… different view
Same tree… different view
Future Directions
• Deploy to OpenStack based private cloud (in progress)• Import entire Genbank corpus and other genomics data sets• Incorporate alignment as embedded analytics to precompute
visualizations of interest• Integrate visualization into the web services.• Work on new predictive piece to identify emerging
threats/mutations.• Early results suggest MongoDB can do most queries on
important slices of data (virus/gene) in fractions of a second, including map-reduce style.– We’re hoping to import the entire Genbank corpus after getting our
private cloud established.
Acknowledgments
• Debbie Sims and colleagues at IEEE Computer Society (for the opportunity to give this webinar)
• Catherine Putonti, Loyola University Chicago (Biology and Computer Science)
• Steven Reisman (Graduate Student, Computer Science) for work on the longitudinal visualizations
• Joe Kaylor and Konstantin Läufer, Loyola University Chicago for pairing on Scala and RESTful services work
• Manish Parashar, Rutgers University (Computer Science) to discuss our shared view of big data
• Rusty Eckman (Northrop-Grumman) for his helpful input and feedback.