CourseSyllabus · Email: [email protected] Communication Conn
CSE427$ – CLOUD$COMPUTING$ …m.neumann/fl2016/cse427/00_CourseOutline.pdf · cse427$ –...
Transcript of CSE427$ – CLOUD$COMPUTING$ …m.neumann/fl2016/cse427/00_CourseOutline.pdf · cse427$ –...
CSE 427 – CLOUD COMPUTING WITH BIG DATA APPLICATIONS
Fall 2016Marion Neumann
COURSE SYLLABUS
ABOUT• Marion Neumann• email: m dot neumann at wustl dot edu• office: Jolley Hall 222• office hours: MON 4-‐5pm
• Course website:http://sites.wustl.edu/neumann/courses/fall-‐2016/cse-‐427s/
• Piazza• use it for any questions and suggestions about the course! Sign up here: piazza.com/wustl/fall2016/cse427s
• no anonymous posts
1/19/16 2
You are a real person!
LECTURES AND LABS• Monday & Wednesday 2:30-‐4pm
• Lectures in Louderman 458• Lab sessions in Eads 016
– ca. 7-‐8 labs replace respective lectures – will be announced in-‐class, on Piazza, and the course webpage
• Lecture participation is beneficial• Black/white board notes• Demos/practical examples• Quizzes
• Lab participation is beneficial• VM debugging with fellow students, TAs, and instructor• data preparation for homeworks• Quizzes
1/19/16 3
HOMEWORK ASSIGNMENTS• Homework assignments (40%)
• weekly!– assigned on MON or as announced in-‐class (after the lecture)– due following MON (2:30pm before the lecture) or as indicated on assignment
• work in groups of 2 (you can use Piazza to find a partner) • use SVN repository for submissionsà find instructions how to use SVN on the course webpage
• Final Project (20%)• implementation component and conceptual component• due 14th of December
• TA office hours• TBA
1/19/16 4
LATE POLICY – NO MAKE UPS
• homework assignments must be turned in on time• it is your responsibility to commit your work to your SVN repo
I am NOT taking late submissions!
• you get an automatic 1 class extension on every homework à use this with caution:
There is no extension to this extension for NO reason (at all).
• no makeup quizzes or assignments for any reason (this includes grade improvements, failed SVN commits, miss-‐interpreted due dates, …)
1/19/16 5
COLLABORATION AND ACADEMIC DISHONESTY
• Collaboration PolicyYou are encouraged to discuss the course material with other students. Discussing the material, and the general form of solutions to the labs is a key part of the class. Since, for many of the assignments, there is no single “right” answer, talking to other students and to the TAs is a good thing. However, everything that you turn in should be your own work, unless we tell you otherwise. If you talk about assignments with another student, then you need to explicitly tell us on the hand-‐in. You are not allowed to copy answers or parts of answers from anyone else, or from material you find on the Internet. This will be considered as willful cheating, and will be dealt with according to the official collaboration policy.
Your solutions will be compared to the solutions of other students and solutions available ONLINE!
• Academic DishonestyUnless explicitly instructed otherwise, everything that you turn in for this course must be your own work. If you willfully misrepresent someone else’s work as your own, you are guilty of cheating. Cheating, in any form, will not be tolerated in this class. There is zero tolerance of Academic Dishonesty. I will be actively searching for academic dishonesty on all homework assignments, quizzes, and exams. If you are guilty of cheating on any assignment or exam, you will receive and F in the course and be referred to the School of Engineering Discipline Committee. In severe cases, this can lead to expulsion from the University, as well as possible deportation for international students. If you copy from anyone in the class both parties will be penalized, regardless of which direction the information flowed.
This is your only warning.
1/19/16 6
IN-‐CLASS EXAMS AND QUIZZES• 2 in-‐class exams
• Count for 20% of total class performance each• Dates:
• Final: 7 Dec 2016• Midterm: TBA
• Quizzes• will be given in lectures and labs• need WIFI enabled device (laptop, tablet, smart phone, …)• completion and results will be recorded (via student ID)• will be used to decide letter grades for boarderlined scores (less
than 1% away from cutoff)• >60% quiz participation is required for “grade bump”• no makeup for quizzes
1/19/16 7
QUIZ
GRADING POLICY
• Grading Summary• 40% homework assignments• 20% final project• 20% midterm• 20% final
• if boarderlined: > 60% completed quizzes allow for better grade
• This is only half a “Systems” class!• exams test your conceptual knowledge • exams count for 50% of the course performance
1/19/16 8
implementation skills
conceptual understanding / critical thinking
CSE 427 𝑺𝟐
WARM-‐UP QUIZ
1/21/16 9
• go to: https://b.socrative.com/login/student/
• room name will be announced in-‐class
• enter your student ID (6-‐digit number) • NOT your name
QUIZ
COURSE OBJECTIVE• Understand conceptually
• what Big Data is • what large-‐scale data management and analysis means
• Understand specifically• how MapReduce implements distributed data analysis • how a Hadoop cluster achieves parallel computing and data storage• the development process to tackle Big Data analysis tasks• which Hadoop Big Data tools are useful for which application
• Hands-‐on practice• using Hadoop• implementing algorithms in MapReduce (Java) and Spark (Python or Scala)
• data analysis with Hadoop tools (Pig, Hive, Impala)
1/19/16 10
TOPICS TO BE COVERED (SYLLABUS) PART I: Map Reduce• Distributed File Systems & MapReduce
• HDFS• Hadoop MapReduce
• Developing Programs in Hadoop MapReduce• MapReducing Algorithms• Introduction to Apache Spark
PART II: Big Data Analysis• Application: Recommendation engines• Data Analysis
• Hadoop Pig, Hive, and Impala• Data Management
• Hadoop tools (Sqoop)
1/19/16 Contents may be subject to changes! 11
TOPICS TO BE COVERED (SYLLABUS) PART III: (More) Big Data Applications• Large-‐scale Machine Learning
• Classification using MapReduce • Clustering in Spark
Optional: Structured and High-‐dimensional Data• Graph Data
• Link Analysis using PageRank• Social network analysis
• Information Retrieval/Finding Similar Items• Big feature spaces• Document retrieval• Locality-‐sensitive hashing
1/19/16 Contents may be subject to changes! 12
BACKGROUND & PREREQUISITES• Programming
• Java (ßmainly)• Python [or Pearl, Scala] (some)• SQL (ß very useful) and relational databases (RDMS)
• Algorithms• sorting• hashing• CSE 247
• Maths• matrices, some linear algebra• probabilities• graphs• machine learning (supervised learning, classification, training/testing)
1/19/16 13
COURSE MATERIALS
• The content of this class is derived largely from the Cloudera Developer Training for Apache Hadoop , Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop, and the Cloudera Developer Training for Apache Spark, which are made available to Washington University through the Cloudera Academic Parntership program.
• Cloudera Course VM à install beforeWED 7th of Sept!
• Further materials are adapted from the “Mining of Massive Data Sets” book (http://www.mmds.org/) and class taught at Stanford by Jure Leskovec
1/19/16 14
READING
• Required books• Hadoop: The Definite Guide (4th edition)
by Tom White
• Mining of Massive Data Setsby Jure Leskovec, AnandRajaraman, Jeff Ullman (available for free online http://mmds.org)
• Optional book• Data Algorithms: Recipes for Scaling Up with
Hadoop and Spark by Mahmoud Parsian
• Reading and further materials will be posted on the course webpage.• All readings are considered course material and are exam relevant!
1/19/16
Use CLDR14 to save 40% on
O’Reilly books & 50% on ebooks!
15
BEFORE next lecture
1/19/16
SUMMARY
• All relevant information can be found on the course webpage:http://sites.wustl.edu/neumann/courses/fall-‐2016/cse-‐427s/
• Ask all questions on Piazza!
Do you have any questions??
16
You are a real person!