Leveraging the MapReduce Application Model to Run Text ...

Leveraging theMapReduce Application Model

to Run Text Analytics in HPC Clusters

Cyril Briquet, McMaster University

Overview

1. Text Analytics with Voyeur Tools

2. HPC burst computing

3. MapReduce

4. Challenges

Text analytics with Voyeur Tools

Voyeur Tools

* interactive web app

* toolbox (word frequencies, collocates, concordances, ...)

* for text corpora (primary scope: Digital Humanities)

http://voyeurtools.org (you can try now)

Text analytics with Voyeur Tools

Overview

3. MapReduce

4. Challenges

HPC burst computing

* next generation of Voyeur Tools:

scale to process large text corpora upon user request,

so-called “burst computing”

* use of HPC resources to process bursts of user requests

so that interactive web app remains responsive

* SHARCNET-sponsored project & Postdoctoral Fellowship

(started: January 2010)

Burst computing opportunities

3 opportunities to use HPC resources in Voyeur Tools:

* initial data importing

* data indexing

* data analysis

=> multithreading “OK”, but limited data parallelism

Overview

3. MapReduce

4. Challenges

MapReduce [Dean et al. 2004]

MapReduce features [Dean et al. 2004]

* automatic parallelization and distribution

* I/O scheduling (+ data-aware scheduling)

* fault-tolerance

* status and monitoring

Using MapReduce in Voyeur Tools

* initial data importing:

import text corpora into MapReduce filesystem

* data indexing: online or (preferably) batch indexing jobs

* data analysis: online analysis jobs

Overview

3. MapReduce

4. Challenges

Challenge: data model

impact of data model

* current Voyeur Tools: file-based

(O.S.-level, “local” filesystem)

* MapReduce: record-based

(application-level, “distributed” filesystem)

Challenge: computation model

impact of computation model

* current Voyeur Tools: multiple nested loops

* MapReduce: 2-phase record-based processing

Challenge: cluster job queue

impact of cluster job queue

* middleware daemons (for data distribution, computation)

should ideally be always-on (to reduce web app latency)

* application malleability (to # available cores):

supplementary challenge...

Challenge: global filesystem

impact of cluster-level global filesystem

* very convenient for many applications...

but we'd rather preposition data to local disks,

in order to maximize parallelism of data access

Challenge: firewalls

impact of firewalls

* can communicate <===> SHARCNET nodes, clusters

* cannot download data from the web

(= limited parallelism of initial data import from web servers)

Challenge: public web front-end

requirement for public web front-end

* scalable servlet container

* DNS entry

Conclusion

* project getting started

* will rely on Open Source software:

. Apache Hadoop

. Apache Tomcat (maybe Eclipse Jetty)

. and of course Voyeur Tools

* importance of flexible data model

Leveraging theMapReduce Application Model

to Run Text Analytics in HPC Clusters

Cyril Briquet, McMaster University

Leveraging the MapReduce Application Model to Run Text ...

Documents

Transcript of Leveraging the MapReduce Application Model to Run Text ...

AdWords Smart-Bidding Leveraging Machine Learning 2017... · Leveraging Machine Learning - AdWords Smart-Bidding Annika Weckner, Display Specialist - Central Europe ... 1+ week Run

Automatically Leveraging MapReduce Frameworks for Data ... · MapReduce is a popular programming paradigm for running large-scale data-intensive computation. Recently, many frameworks

Pipelined-MapReduce an Improved MapReduce

Leveraging Legacy LiabilitiesAIRROC Webinar Series: Introduction to Run-Off ... • Companies with run-off legacy liabilities struggle to find ways to exit this business. • Commutations,

White Paper Leveraging existing structured cabling ... · Leveraging existing structured cabling Infrastructure to deploy security systems ... Data networks run on structured cabling

MapReduce Paradigm

Leveraging the Short-Term Memory of Hardware to …pages.cs.wisc.edu/~shanlu/paper/asplos193-arulraj.pdf · to Diagnose Production-Run Software Failures ... preserving production-run

Chapter 6 - MapReduceHadoop, Chapter 2 – MapReduce MapReduce is a programming model for data processing striking a balance between simplicity and expressional power. Hadoop can run

MapReduce-MPI Library Users Manualmapreduce.sandia.gov/doc/Manual.pdf · MapReduce-MPI WWW Site - MapReduce-MPI Documentation What is a MapReduce? The canonical example of a MapReduce

Hadoop Mapreduce

1. Introduction to MapReduce - UPMlsd.ls.fi.upm.es/.../IntroToMapReduce.pdf · Processing of massive data: MapReduce – 1. Introduction to MapReduce MapReduce has a 'low semantic

Leveraging SUSE Linux to run SAP HANA on the Amazon Web Services Cloud

Run Safer: Leveraging the Enterprise for Response and Recoverymedia.govtech.net/MeganT/Denver_2011/Run_Safer... · SAP Direct Relief International . Run Safer: Leveraging the Enterprise

SystemML: Declarative Machine Learning on MapReduce · a set of MapReduce jobs that can run on a cluster of machines. ... media analytics, web-search, computational advertising and

MapReduce and Hadoop File Systemnsrit.edu.in/admin/img/cms/10096mapreduce.pdf · The Outline Introduction to MapReduce From CS Foundation to MapReduce MapReduce programming model

Leveraging OpenStack to Run Mesos/Marathon at Charter Communications

MapReduce - cse.hcmut.edu.vn

Leveraging smart meter data for electric utilities...1. Leveraging smart meter data[Sample use case for electric utilities] 2. Performance evaluation of MapReduce and Spark 1.6 (using

Google’s MapReduce Programming Model — Revisitedlaemmel/MapReduce/paper.pdf · Google’s MapReduce Programming Model — Revisited Ralf Lammel¨ Data Programmability Team Microsoft

MapReduce Tutorial