Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem...

19
6 Challenges in Developing an Introductory Course in Big Data Programming Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem 04/17/2015

Transcript of Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem...

Page 1: Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem 04/17/2015.

6 Challenges in Developing an Introductory Course in

Big Data Programming

Eastern Connecticut State University

Roland DePratti

Dr. Garrett Dancik

Dr. Sarah Tasneem

04/17/2015

Page 2: Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem 04/17/2015.

Initiated September, 2013 to align Data Management and Bioinformatics topics

Hadoop programming arose as the natural synergy topic◦ It was seen as the natural consolidation of a number of areas in CS◦ A growing discipline with a concrete theoretical and practical foundation◦ Great job opportunities for our students ◦ Could result in valuable assets that could be leveraged across university

departments.

Initial research completed last summer◦ Development of Big Data Team ◦ Completed summary research on the topic◦ Identified Cloudera as our Academic partner◦ Reviewed Cloudera Support materials◦ Identified grants to support work

Project Background

Presentation url: http://www1.easternct.edu/deprattir/ccscne-2015-content/

Page 3: Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem 04/17/2015.

Solve the challenges ! Complete team training Develop course materials Complete test run with 2 independent study

students (Fall, 2015) Kickoff as a CS Topics class – Spring, 2016. Develop future goals and roadmap

2015/ 2016 Tasks

Page 4: Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem 04/17/2015.

We are halfway through this process ◦ A lot still to learn

We want to share the decisions we face around four of six identified challenges

We are looking for input from others (both during conference, as well as later), who are ahead or behind us

And hoping the input and collaboration results in better knowledge delivery to our students

Will document our experiences and results for future presentations

Why are We Here?

Page 5: Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem 04/17/2015.

Selection of course topics (Roland)

Keeping up with the speed of change (Roland)

Ensuring proper prerequisite knowledge (Garrett)

Managing the lab environment (Sarah)

Software platform stability

Developing meaningful lab exercises

6 Big Data Course Design Challenges

Page 6: Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem 04/17/2015.

Selecting Course Topics, While Keeping Up with Change

Yellow = Active ProjectsRed = Non Active ProjectsOrange = Soon to be Sunset

( ) identifies CS Knowledge Areas

Page 7: Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem 04/17/2015.

Teach the concepts, the technology will change

Teach the future, not the past ◦ Spark vs MapReduce

Show how the platform works together◦ Relational -> Sqoop -> HDFS -> MapReduce/Spark

Build on what they already know◦ Relational DBMS, Java, SQL

Use lab exercises the tie in other CS topics◦ Data Mining◦ Bioinformatics

Guiding Principles

Page 8: Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem 04/17/2015.

Our Current Path through the Maze

Yellow = Active ProjectsRed = Non Active ProjectsGreen = Course Topics

() identifies CS Knowledge Areas

Page 9: Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem 04/17/2015.

Pre-requisite knowledge for Big Data programming

TopicSelected required

coverageCurrent coverage

Linux operating system

Directory structure, file management, text editors,

core commandsnone

Java

Basic Java programming, abstract classes and

interfaces, serialization, JUnit testing, Log4J

framework

Object-oriented Java programming course

Eclipse IDEJava programming,

generating JAR files, using Junit, log4j

Object-oriented Java programming course

Page 10: Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem 04/17/2015.

Challenge: Students need additional Java / Eclipse experience and may be "rusty", and do not have Linux experience

Possible solution:◦Offer a 1 credit laboratory course as a co-requisite to Big

Data programming◦Offer a 1 credit "Programming in a Linux environment"

course that would be a pre/co-requisite to Big Data programming and could also be taken by others

Pre-requisite knowledge for Big Data programming

Page 11: Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem 04/17/2015.

In House Cluster

Create clusters of computers on campus

-limited size

Establishment and maintenance cost

-University IT

Infrastructure As A Service- IAAS

(scalable replacement for local IT)

Access infrastructural resources in cloud- terms of virtual machines

No maintenance Students can use same tools

as professionals use

AWS offers virtualized platforms

-pay-as-you-use -careful to not waste computing resources

Page 12: Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem 04/17/2015.

Cloud Computing• modern day useful problem solving tool

• many universities are incorporating cloud computing in the curriculum

• related knowledge and skills are becoming fundamental for computing professionals. • will provide students with hands-on cloud computing experience.

• students will experience cutting-edge tools -- help them grow professionally

Page 13: Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem 04/17/2015.

Selection of course topics

Keeping up with the speed of change

Ensuring proper prerequisite knowledge

Managing the lab environment

Software platform stability

Developing meaningful lab exercises

6 Big Data Course Design Challenges

Page 14: Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem 04/17/2015.

Additional References and Content

Page 15: Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem 04/17/2015.

1. Albrecht, J. 2009, Bringing big systems to small schools: distributed systems for undergraduates, SIGCSE ’09:

Proceedings of the 40th ACM technical symposium on Computer science education

2. Garrity et al, 2011, WebMapReduce: an accessible and adaptable tool for teaching  map-reduce computing,

SIGCSE ’11:Proceedings of the 42nd ACM technical symposium on Computer science education

3. Lin, J. 2008, Exploring large-data issues in the curriculum: a case study with MapReduce, TeachCL ‘08

Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics

4. Makadev, A. & Wurst, K. 2015, Developing Cincentrations in Big Data Analytics and Softweare development at a

Small Liberal Arts University, Journal of Computing Sciences in Colleges , Volume 30 Issue 3.

5. Brandon D., 2015, Teaching Data Analytics Across the Computing Curricula, Journal of Computing Sciences in

Colleges , Volume 30 Issue 5.

6. Wolffe, G., 2009, Teaching Parallel Computing: New Possibilities, Journal of Computing Sciences in Colleges ,

Volume 25 Issue 1.

7. Brown,R. et al, 2010, Strategies for Preparing CS Students for the Multicore World, Proceedings of the 2010

ITiCSE working group reports

8. www.acm.org/education/CS2013-final-report.pdf Accessed 3/16/2015

Additional References

Page 16: Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem 04/17/2015.

Big Data Open Source Projects

Project Description

HDFS Hadoop Distributed File System: A user defined file system that manages larger blocks and provides file management across a distributed system

Avro A remote procedure call and data serialization framework developed within Apache's Hadoop project

LZO Lempel-Ziv-Oberhumer (or LZO) is a lossless algorithm that compresses data to ensure high decompression speed.

MapReduce A programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

Spark An open-source cluster computing framework originally developed in the AMPLab at UC Berkeley using in-memory primitives to speed up performance.

Tez The Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data

Cascading Cascading is a software abstraction layer for Apache Hadoop. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language hiding the underlying complexity of MapReduce jobs.

Scalding Scalding is a Scala library that makes it easy to write MapReduce jobs in Hadoop. It's similar to other MapReduce platforms like Pig and Hive, but offers a higher level of abstraction by leveraging the full power of Scala and the JVM. Scalding is built on top of Cascading,

Programming Models/Frameworks

File Management

All definitions were sourced from Wikipedia or Apache project website

Page 17: Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem 04/17/2015.

Big Data Open Source Projects

Project Description

MongoDB MongoDB (from humongous) is one of many cross-platform document-oriented databases.

Cassandra Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.

Hbase HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java

Redis Redis is a data structure server. It is open-source, networked, in-memory, and stores keys with optional durability.

Data Management

Project Description

Sqoop Sqoop is a command-line interface application for transferring data between relational databases and Hadoop.

Pig Pig is a high-level platform for creating MapReduce programs used with Hadoop.

Flume Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

Data Ingestion

All definitions were sourced from Wikipedia or Apache project website

Page 18: Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem 04/17/2015.

Big Data Open Source Project

Project Description

SparkSQL Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data.

Hive Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.[2] While initially developed by Facebook,

Impala Cloudera Impala is Cloudera's open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop

Drill Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google's Dremel system which is available as an infrastructure service called Google BigQuery.

Query

Project Description

Oozie Oozie is a workflow scheduler system to manage Hadoop jobs.

Workflow Management

All definitions were sourced from Wikipedia or Apache project website

Page 19: Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem 04/17/2015.

Big Data Open Source Projects

Project Description

Spark Streaming Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics.

Storm Apache Storm is a distributed computation framework to allow batch, distributed processing of streaming data.

Kafka Apache Kafka is an open-source message broker project, which aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

Samza Apache Samza is an open-source project developed by the Apache Software Foundation, written in Scala. The project aims to provide a near-realtime, asynchronous computational framework for stream processing

Streaming

Project Description

MLlib A distributed machine learning framework on top of Spark

Machine Learning

All definitions were sourced from Wikipedia or Apache project website