Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem...
-
Upload
noah-nichols -
Category
Documents
-
view
212 -
download
0
Transcript of Eastern Connecticut State University Roland DePratti Dr. Garrett Dancik Dr. Sarah Tasneem...
6 Challenges in Developing an Introductory Course in
Big Data Programming
Eastern Connecticut State University
Roland DePratti
Dr. Garrett Dancik
Dr. Sarah Tasneem
04/17/2015
Initiated September, 2013 to align Data Management and Bioinformatics topics
Hadoop programming arose as the natural synergy topic◦ It was seen as the natural consolidation of a number of areas in CS◦ A growing discipline with a concrete theoretical and practical foundation◦ Great job opportunities for our students ◦ Could result in valuable assets that could be leveraged across university
departments.
Initial research completed last summer◦ Development of Big Data Team ◦ Completed summary research on the topic◦ Identified Cloudera as our Academic partner◦ Reviewed Cloudera Support materials◦ Identified grants to support work
Project Background
Presentation url: http://www1.easternct.edu/deprattir/ccscne-2015-content/
Solve the challenges ! Complete team training Develop course materials Complete test run with 2 independent study
students (Fall, 2015) Kickoff as a CS Topics class – Spring, 2016. Develop future goals and roadmap
2015/ 2016 Tasks
We are halfway through this process ◦ A lot still to learn
We want to share the decisions we face around four of six identified challenges
We are looking for input from others (both during conference, as well as later), who are ahead or behind us
And hoping the input and collaboration results in better knowledge delivery to our students
Will document our experiences and results for future presentations
Why are We Here?
Selection of course topics (Roland)
Keeping up with the speed of change (Roland)
Ensuring proper prerequisite knowledge (Garrett)
Managing the lab environment (Sarah)
Software platform stability
Developing meaningful lab exercises
6 Big Data Course Design Challenges
Selecting Course Topics, While Keeping Up with Change
Yellow = Active ProjectsRed = Non Active ProjectsOrange = Soon to be Sunset
( ) identifies CS Knowledge Areas
Teach the concepts, the technology will change
Teach the future, not the past ◦ Spark vs MapReduce
Show how the platform works together◦ Relational -> Sqoop -> HDFS -> MapReduce/Spark
Build on what they already know◦ Relational DBMS, Java, SQL
Use lab exercises the tie in other CS topics◦ Data Mining◦ Bioinformatics
Guiding Principles
Our Current Path through the Maze
Yellow = Active ProjectsRed = Non Active ProjectsGreen = Course Topics
() identifies CS Knowledge Areas
Pre-requisite knowledge for Big Data programming
TopicSelected required
coverageCurrent coverage
Linux operating system
Directory structure, file management, text editors,
core commandsnone
Java
Basic Java programming, abstract classes and
interfaces, serialization, JUnit testing, Log4J
framework
Object-oriented Java programming course
Eclipse IDEJava programming,
generating JAR files, using Junit, log4j
Object-oriented Java programming course
Challenge: Students need additional Java / Eclipse experience and may be "rusty", and do not have Linux experience
Possible solution:◦Offer a 1 credit laboratory course as a co-requisite to Big
Data programming◦Offer a 1 credit "Programming in a Linux environment"
course that would be a pre/co-requisite to Big Data programming and could also be taken by others
Pre-requisite knowledge for Big Data programming
In House Cluster
Create clusters of computers on campus
-limited size
Establishment and maintenance cost
-University IT
Infrastructure As A Service- IAAS
(scalable replacement for local IT)
Access infrastructural resources in cloud- terms of virtual machines
No maintenance Students can use same tools
as professionals use
AWS offers virtualized platforms
-pay-as-you-use -careful to not waste computing resources
Cloud Computing• modern day useful problem solving tool
• many universities are incorporating cloud computing in the curriculum
• related knowledge and skills are becoming fundamental for computing professionals. • will provide students with hands-on cloud computing experience.
• students will experience cutting-edge tools -- help them grow professionally
Selection of course topics
Keeping up with the speed of change
Ensuring proper prerequisite knowledge
Managing the lab environment
Software platform stability
Developing meaningful lab exercises
6 Big Data Course Design Challenges
Additional References and Content
1. Albrecht, J. 2009, Bringing big systems to small schools: distributed systems for undergraduates, SIGCSE ’09:
Proceedings of the 40th ACM technical symposium on Computer science education
2. Garrity et al, 2011, WebMapReduce: an accessible and adaptable tool for teaching map-reduce computing,
SIGCSE ’11:Proceedings of the 42nd ACM technical symposium on Computer science education
3. Lin, J. 2008, Exploring large-data issues in the curriculum: a case study with MapReduce, TeachCL ‘08
Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics
4. Makadev, A. & Wurst, K. 2015, Developing Cincentrations in Big Data Analytics and Softweare development at a
Small Liberal Arts University, Journal of Computing Sciences in Colleges , Volume 30 Issue 3.
5. Brandon D., 2015, Teaching Data Analytics Across the Computing Curricula, Journal of Computing Sciences in
Colleges , Volume 30 Issue 5.
6. Wolffe, G., 2009, Teaching Parallel Computing: New Possibilities, Journal of Computing Sciences in Colleges ,
Volume 25 Issue 1.
7. Brown,R. et al, 2010, Strategies for Preparing CS Students for the Multicore World, Proceedings of the 2010
ITiCSE working group reports
8. www.acm.org/education/CS2013-final-report.pdf Accessed 3/16/2015
Additional References
Big Data Open Source Projects
Project Description
HDFS Hadoop Distributed File System: A user defined file system that manages larger blocks and provides file management across a distributed system
Avro A remote procedure call and data serialization framework developed within Apache's Hadoop project
LZO Lempel-Ziv-Oberhumer (or LZO) is a lossless algorithm that compresses data to ensure high decompression speed.
MapReduce A programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
Spark An open-source cluster computing framework originally developed in the AMPLab at UC Berkeley using in-memory primitives to speed up performance.
Tez The Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data
Cascading Cascading is a software abstraction layer for Apache Hadoop. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language hiding the underlying complexity of MapReduce jobs.
Scalding Scalding is a Scala library that makes it easy to write MapReduce jobs in Hadoop. It's similar to other MapReduce platforms like Pig and Hive, but offers a higher level of abstraction by leveraging the full power of Scala and the JVM. Scalding is built on top of Cascading,
Programming Models/Frameworks
File Management
All definitions were sourced from Wikipedia or Apache project website
Big Data Open Source Projects
Project Description
MongoDB MongoDB (from humongous) is one of many cross-platform document-oriented databases.
Cassandra Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
Hbase HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java
Redis Redis is a data structure server. It is open-source, networked, in-memory, and stores keys with optional durability.
Data Management
Project Description
Sqoop Sqoop is a command-line interface application for transferring data between relational databases and Hadoop.
Pig Pig is a high-level platform for creating MapReduce programs used with Hadoop.
Flume Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
Data Ingestion
All definitions were sourced from Wikipedia or Apache project website
Big Data Open Source Project
Project Description
SparkSQL Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data.
Hive Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.[2] While initially developed by Facebook,
Impala Cloudera Impala is Cloudera's open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop
Drill Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google's Dremel system which is available as an infrastructure service called Google BigQuery.
Query
Project Description
Oozie Oozie is a workflow scheduler system to manage Hadoop jobs.
Workflow Management
All definitions were sourced from Wikipedia or Apache project website
Big Data Open Source Projects
Project Description
Spark Streaming Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics.
Storm Apache Storm is a distributed computation framework to allow batch, distributed processing of streaming data.
Kafka Apache Kafka is an open-source message broker project, which aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
Samza Apache Samza is an open-source project developed by the Apache Software Foundation, written in Scala. The project aims to provide a near-realtime, asynchronous computational framework for stream processing
Streaming
Project Description
MLlib A distributed machine learning framework on top of Spark
Machine Learning
All definitions were sourced from Wikipedia or Apache project website